- distributions

let's start with a function

the gamma function to be exact.

which looks like this equation.

Γ (n) = (n-1)!

that function along with euler's number, fractions, and some exponents make a distribution with a shape and scale. we call the shape and scale the parameters of the distribution and we call the distribution the gamma distribution.

when the shape=1 and scale=2
we get this graph.

the distribution changes as the parameters change

add one to the shape and scale and we get a very different graph.

the distribution flattens

as the shape and scale increase.

here the shape=3 and scale=2

the gamma distribution is a building block for others - let's take a look at the exponential distribution.

the exponential distribution

brings us back to the first graph.

an exponential distribution is a type of gamma distribution where the shape parameter is set to one.

the exponential distribution moves as the scale changes

however, when we think of the exponential distribution moving we talk about the rate changing.

this is because thinking of rate as the parameter in exponential distributions lets us solve problems.

the rate is equal the inverse of the scale in the gamma distribution.

here is an exponential distribution with rate=1

getting back to gamma

while the exponential distribution lets us solve some problems, we can solve new ones if we add a bunch of exponential distributions together.

this graph is made by adding two exponential distributions together, both with rate=.5

since the shape of an exponential distribution is always one, the shape parameter of the gamma distribution that comes from adding exponentials will always be a positive whole number.

when that happens we call it an erlang distribution, which you can see with a shape=2 and scale=2

one more gamma

we can make one more distribution by picking certain values for our gamma function, the chi-squared distribution .

this one is seen in stats classes when doing the test for indepedence to see if two or more classes are mutually exclusive, it has one parameter, degrees of freedom. here the degrees of freedom=6

to make the chi-squared distribution from the gamma distribution, we set the scale to two and divide the shape by two.

tying it together

the three distributions we made - the exponential, erlang, and chi-squared are all specific parameterizations of the gamma distribution.

exponential predicts the time until something happens, erlang is used to predict how long you'll wait in a queue, and chi-squared looks at indepedence between groups.

next, we'll build distributions with multiple gamma functions.

the second distribution

the beta distribtuion is made of using two independent gamma distributions.

beta(a, b) = gamma(a, θ) / ( gamma(a,θ) + gamma(b, θ) )

where a and b are the two parameters of the beta distribution called alpha and beta. here we set alpha=4, beta=6.

the beta distribution looks larger on the graph, because it has a different support (x values). it is defined from [0,1], while the gamma distribution is defined from [0,∞], so we zoomed in a little.

functionally speaking

the beta distribution is named because of the beta function. the beta function has two parameters and can be written in terms of the gamma function.

beta(x, y) = ( Γ (x) * Γ (y) ) / Γ (x + y)

in this graph we changed the the parameters of the distribution to alpha=3, beta=2.

starting from scratch

if we set the two parameters of the beta distribution to alpha=1, beta=1 we get a horizontal line, which is called the uniform distribution.

we can interpret a uniform distribution as every outcome being equally likely. because of this, we use the uniform parametrization of the beta distribution as a common starting point or conjugate prior in bayesian statistics.

another starting point

the arcsin distribution is a common prior distribution in bayesian statistics as well. it is just another specific parametrization of the beta distribution with alpha=.5, beta=.5

almost beta

the kumaraswamy distribution is very similar to the beta distribution with two shape parameters, but we don't use the beta function to make the probability density function, unlike the beta distribution.

also, another way to think about the parameterization of a uniform distribution is a as kumaraswamy distribution with alpha=1, beta=1 , here we set alpha=5, beta=1

going all over laplace

if we divide two standard uniform distributions both with parameters alpha=0, beta=1 and then take the log, we get the laplace distribution.

laplace(0,1) = log( uniform(0,1) / uniform(0,1) )

the laplace distribution has two parameters, here they are location=10, scale=1. if the location=0 then we would only see the right side of the peak.

testing it out

using two laplace distributions, we can build another distribution by dividing them. This makes a common distribution in statistical testing - the f distribution.

the f distribution is known as a ratio distribution ~ another common way to make it is by taking the ratio of two chi-squared distributions (which are really just gamma distributions). it has two parameters, which are both degrees of freedom, here both are degrees of freedom=2

while the chi-squared distribution is used in the test for indepedence, with the f distribution we can do analysis of variance (called ANOVA for short) and the f-test for finding the best model in a linear regression.

test number 2

taking the log of the f distribution and dividing it by 2 we can make the fisher's z distribution. the fisher's z distribution, also has two parameters like the f distribution, degrees of freedom, known as degree 1 and degree 2.

fisher's z (degree 1, degree 2) = log ( f (degree 1, degree 2) ) / 2

here degree 1=5, degree 2=2. an important note - this distribution is not used in a z-test, for those we use a normal distribution, which is just around the corner.

raylly?

the rayleigh distribution has one parameter, the scale, here the scale=1. when that happens it can be thought of as a case of the chi-squared distribution where the degrees of freedom=2

when looking at wind speed, the magnitude of the speed tends to follow a rayleigh distribution, and generally this distribution is helpful when looking for the magnitude of components.

the laplace, f, fisher's z, and rayleigh distributions can all by made by subtracting, dividing, or taking a square root of a specific parameterization of the gamma distribution.

generally speaking

the laplace distribution we made earlier is a case of the generalied normal distribution, which has three parameters, location, scale, and shape

here the location=2.5, scale=1, and shape=4

we call the laplace distribution we made earlier is a parameterization of the generalied normal distribution when the shape=1, along with laplace the normal distribution is another parameterization of the generalied normal distribution when the shape=2, but before we get there let's talk transformations and sampling.

uniform transformed

if we take a fourier transform of laplace we get a cauchy(0,1), we can also make the cauchy distribution from transforming a uniform distribution.

cauchy(0,1) = tan( π (uniform - .5) )

the parameters of for the cauchy distribution are location and scale, here the location=0 scale=2. the tails of the cauchy distribution are larger than the normal distribution.

taking a sample

setting cauchy(0,1) gets us to another popular distribution in statistical testing, the t-distribution with one parameter degrees of freedom.

the t-distribution comes from sampling a normal distribution, so as the degrees of freedom increase the curve becomes taller, the tails become smaller, and we approach the normal distribution.

finally normal

the t-distribution is sampled from the normal distribution, which is a parameterization of the generalized normal distribution with the shape=2, so we just have two parameters location and scale.

by changing the location parameter (moving the peak of the graph left or right) we can make more parametrizations of the normal distribution with one special one to remember - the standard normal distribution, this is only when location=0 and scale=1

only positive

the chi distribution is the absolute value of the standard normal distribution, its also the square root of the chi-squared distribution. like the chi-squared distribution the chi distribution has one parameter degrees of freedom, here degrees of freedom=2

while the chi-squared distribution is a parametrization of the gamma distribution, it is also the sum of squared normal distributions. the chi distribution is the square root of that sum.

chi-scaled

the nakagami distribution is a scaled version of a specific parameterization of the chi distribution, that has two parameters shape and spread, here shape=1, spread=3

the nakagami distribution is also the square root of the gamma distribution where scale of the gamma is equal to the shape/spread of the nakagami distribution.

particles-of-chi

an application of the chi distribution is the maxwell-boltzmann distribution, which models the distribution of the speed of particles in a gas at a given temperature.

the maxwell-boltzmann distribution is a transformation of a chi distribution with degrees of freedom=3. it has one parameter scale, which can be applied in modeling the temperature of a gas. as the scale increases the gas is hotter and the curve flattens, here the scale=2, so not that hot.

inverted-chi

to get another distribution in the chi-squared family, we are just going to invert it. this will make the inverse-chi-squared distribution, like the other chi-squared it has one parameter, degrees of freedom, right now the degrees of freedom=4.

this distribution is common in bayesian inference, when we don't know the variance of of a normal distribution.

back to gamma

just as the chi-squared distribution is a parameterization of the gamma, the inverse-chi-squared is a parameterization of the inverse-gamma distribution.

here we have an inverse-gamma distribution with shape=2 and the scale=1.

another gamma

the lévy distribution has two parameters location and scale. if the location=0 it is also a parameterization of the inverse-gamma distribution with the shape=.5, this graph is with parameters location=1, scale=1.

the lévy distribution has been shown to closely model the frequency of geomagnetic reversals - when magnetic north and south swap places.

exponentially important

the 80-20 rule comes from the pareto distribution, which is the exponential distribution raise it to the exponential function.

pareto(scale, shape)= scale * e ^{exponential(rate)}

pareto distribution has two parameters - scale and shape. the scale is also known as the minimum x value, here the scale=1, so the x-value on the left side of the graph is 1. the shape is known as the pareto index, here shape=3, which is the starting value on the y-axis.

the pareto distribution is also a continuous version of the zeta distribution used in zipf's law, which is worth a google.

not as important...

the lomax distribution is a subset of the pareto distribution and not as famous. like the pareto distribution it is used in economics as well as modeling internet traffic. it has two parameters - scale=2, and shape=1 here.

the lomax distribution is also equal to the f distribution in a specific case.

part of the family

the burr distribution is another distribution used in economics, and is a case of the lomax distribution with two paramters c and k.

when c=1 the burr distribution is the lomax distribution. here c=1 and c=2, so this graph is very similar to the one we made with above.

logistically

going from the pareto family of distributions whose log is an exponential distribution to the log of the ratio of exponentials creates the logistic distribution.

the logistic distribution has two paramters location and scale, we set location=2 and scale=1 for this graph.

the logstic function that is used to make this distribution shows up everywhere in machine learning. here is what the standard logistic function looks like

logistic function = 1 / ( 1 + e ^-x )

logit again

if we apply the logistic function we made above to the normal distribution we can make the logit-normal distribution, with location=1 and scale=1 parameters.

the logit-normal distribution can be used in place of the dirichlet distribution, because it's a little more simple to make.

prime ratios

we made the logistic distribution by taking the ratio of exponential distributions, if we take the ratio of two gamma distributions with the scale=1 we get the beta prime distribution.

the two shape parameters from the gamma distributions are equal to the two parameters of the beta prime distribution, here they are shape=2 and shape=3.

special primes

a special case of the beta prime distribution is the dagum distribution, which has 3 parameters, two for shape and one for scale.

here shape=1, shape=3, and scale=1 the dagum distribution is used in economics and is also related to the gini index.

logs of logs

by setting the first shape parameter of the dagum distribution to 1, we can make the, log-logistic distribution with two parameters, scale and shape

this the first of three log distributions we will talk about. we can make these by taking a distribution we know and raise it to the exponential function. so to make the log-logistic distribution we raise logistic distribution to the exponential function.

log-logistic(scale, shape)=e^{logistic(location, scale)}

pretty normal

the log-normal distribution has two parameters location and scale, its used everywhere in engineering and economics.

to get back to the normal distribution from the log-normal distribution we take the natural log of the log-normal distribution, basically undoing the logs.

normal(location, scale) = ln( log-normal(location, scale) )

logs of them

taking the exponential function of the cauchy distribution gets us to the last distribution we will talk about with a log in the name, the log-cauchy distribution. like the cauchy distribution it has two parameters - location and scale.

log-cauchy distribution is a heavy tail distribution, mainly cause it is logarithmically decaying.

getting extreme

the next three distributions we are going to make by taking the negative log (or natural log) of a distribution we have already seen. if we take the negative log of the exponential distribution with the rate=1 we get the gumbel distribution.

gumbel(location, 1) = - log( exponential(1) )

the gumbel distribution has two parameters, the location and scale, here the location=1 and scale=2.

from the uniform

the fréchet distribution is built from taking the negative log of the uniform distribution and has three parameters shape, scale and location of minimum.

fréchet(shape, scale, location) = location + scale * (- log ( uniform(0,1) ) ) ^{- (1 / shape)}

the formula looks more complicated than the gumbel, but we normally set the scale=1 and location of minimum=0, so the formula ends up being...

fréchet(shape, scale, location) = (- log ( uniform(0,1) ) ) ^{- (1 / shape)}

or just the negative log of the uniform distribution raised to negative inverse of the shape parameter, here shape=2, scale=1 and location of minimum=0

the last log

the last distribution we will build is with the negative natural log and is called the weibull distribution. it is very similar to the fréchet distribution, but only has two parameters scale and shape.

weibull(scale, shape) = scale * (- ln ( uniform(0,1) ) ) ^{(1 / shape)}

here the scale=1 and shape=1. the gumbel, fréchet, and weibull distributions are all extreme value distributions used in predicting when extreme weather events (like floods) will occur.

hi

this site visually explores the relationship between distributions