Exponential family

Summary

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family",[1] or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

The concept of exponential families is credited to[2] E. J. G. Pitman,[3] G. Darmois,[4] and B. O. Koopman[5] in 1935–1936. Exponential families of distributions provide a general framework for selecting a possible alternative parameterisation of a parametric family of distributions, in terms of natural parameters, and for defining useful sample statistics, called the natural sufficient statistics of the family.

Nomenclature difficulty

edit

The terms "distribution" and "family" are often used loosely: Specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter;[a] however, a parametric family of distributions is often referred to as "a distribution" (like "the normal distribution", meaning "the family of normal distributions"), and the set of all exponential families is sometimes loosely referred to as "the" exponential family.

Definition

edit

Most of the commonly used distributions form an exponential family or subset of an exponential family, listed in the subsection below. The subsections following it are a sequence of increasingly more general mathematical definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of discrete or continuous probability distributions.

Examples of exponential family distributions

edit

Exponential families include many of the most common distributions. Among many others, exponential families includes the following:[6]

A number of common distributions are exponential families, but only when certain parameters are fixed and known. For example:

Note that in each case, the parameters which must be fixed are those that set a limit on the range of values that can possibly be observed.

Examples of common distributions that are not exponential families are Student's t, most mixture distributions, and even the family of uniform distributions when the bounds are not fixed. See the section below on examples for more discussion.

Scalar parameter

edit

The value of   is called the parameter of the family.

A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function, for the case of a discrete distribution) can be expressed in the form

 

where       and   are known functions. The function   must be non-negative.

An alternative, equivalent form often given is

 

or equivalently

 

In terms of log probability,

 

Note that   and  

Support must be independent of θ

edit

Importantly, the support of   (all the possible   values for which   is greater than  ) is required to not depend on  [7] This requirement can be used to exclude a parametric family distribution from being an exponential family.

For example: The Pareto distribution has a pdf which is defined for   (the minimum value,   being the scale parameter) and its support, therefore, has a lower limit of   Since the support of   is dependent on the value of the parameter, the family of Pareto distributions does not form an exponential family of distributions (at least when   is unknown).

Another example: Bernoulli-type distributions – binomial, negative binomial, geometric distribution, and similar – can only be included in the exponential class if the number of Bernoulli trials,   is treated as a fixed constant – excluded from the free parameter(s)   – since the allowed number of trials sets the limits for the number of "successes" or "failures" that can be observed in a set of trials.

Vector valued x and θ

edit

Often   is a vector of measurements, in which case   may be a function from the space of possible values of   to the real numbers.

More generally,   and   can each be vector-valued such that   is real-valued. However, see the discussion below on vector parameters, regarding the curved exponential family.

Canonical formulation

edit

If   then the exponential family is said to be in canonical form. By defining a transformed parameter   it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since   can be multiplied by any nonzero constant, provided that   is multiplied by that constant's reciprocal, or a constant c can be added to   and   multiplied by   to offset it. In the special case that   and   then the family is called a natural exponential family.

Even when   is a scalar, and there is only a single parameter, the functions   and   can still be vectors, as described below.

The function   or equivalently   is automatically determined once the other functions have been chosen, since it must assume a form that causes the distribution to be normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of   even when   is not a one-to-one function, i.e. two or more different values of   map to the same value of   and hence   cannot be inverted. In such a case, all values of   mapping to the same   will also have the same value for   and  

Factorization of the variables involved

edit

What is important to note, and what characterizes all exponential family variants, is that the parameter(s) and the observation variable(s) must factorize (can be separated into products each of which involves only one type of variable), either directly or within either part (the base or exponent) of an exponentiation operation. Generally, this means that all of the factors constituting the density or mass function must be of one of the following forms:

 

where   and   are arbitrary functions of   the observed statistical variable;   and   are arbitrary functions of   the fixed parameters defining the shape of the distribution; and   is any arbitrary constant expression (i.e. a number or an expression that does not change with either   or  ).

There are further restrictions on how many such factors can occur. For example, the two expressions:

 

are the same, i.e. a product of two "allowed" factors. However, when rewritten into the factorized form,

 

it can be seen that it cannot be expressed in the required form. (However, a form of this sort is a member of a curved exponential family, which allows multiple factorized terms in the exponent.[citation needed])

To see why an expression of the form

 

qualifies,

 

and hence factorizes inside of the exponent. Similarly,

 

and again factorizes inside of the exponent.

A factor consisting of a sum where both types of variables are involved (e.g. a factor of the form  ) cannot be factorized in this fashion (except in some cases where occurring directly in an exponent); this is why, for example, the Cauchy distribution and Student's t distribution are not exponential families.

Vector parameter

edit

The definition in terms of one real-number parameter can be extended to one real-vector parameter

 

A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as

 

or in a more compact form,

 

This form writes the sum as a dot product of vector-valued functions   and  .

An alternative, equivalent form often seen is

 

As in the scalar valued case, the exponential family is said to be in canonical form if

 

A vector exponential family is said to be curved if the dimension of

 

is less than the dimension of the vector

 

That is, if the dimension, d, of the parameter vector is less than the number of functions, s, of the parameter vector in the above representation of the probability density function. Most common distributions in the exponential family are not curved, and many algorithms designed to work with any exponential family implicitly or explicitly assume that the distribution is not curved.

Just as in the case of a scalar-valued parameter, the function   or equivalently   is automatically determined by the normalization constraint, once the other functions have been chosen. Even if   is not one-to-one, functions   and   can be defined by requiring that the distribution is normalized for each value of the natural parameter  . This yields the canonical form

 

or equivalently

 

The above forms may sometimes be seen with   in place of  . These are exactly equivalent formulations, merely using different notation for the dot product.

Vector parameter, vector variable

edit

The vector-parameter form over a single scalar-valued random variable can be trivially expanded to cover a joint distribution over a vector of random variables. The resulting distribution is simply the same as the above distribution for a scalar-valued random variable with each occurrence of the scalar x replaced by the vector

 

The dimensions k of the random variable need not match the dimension d of the parameter vector, nor (in the case of a curved exponential function) the dimension s of the natural parameter   and sufficient statistic T(x) .

The distribution in this case is written as

 

Or more compactly as

 

Or alternatively as

 

Measure-theoretic formulation

edit

We use cumulative distribution functions (CDF) in order to encompass both discrete and continuous distributions.

Suppose H is a non-decreasing function of a real variable. Then Lebesgue–Stieltjes integrals with respect to   are integrals with respect to the reference measure of the exponential family generated by H .

Any member of that exponential family has cumulative distribution function

 

H(x) is a Lebesgue–Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is absolutely continuous with a density   with respect to a reference measure   (typically Lebesgue measure), one can write  . In this case, H is also absolutely continuous and can be written   so the formulas reduce to that of the previous paragraphs. If F is discrete, then H is a step function (with steps on the support of F).

Alternatively, we can write the probability measure directly as

 

for some reference measure  .

Interpretation

edit

In the definitions above, the functions T(x), η(θ), and A(η) were arbitrary. However, these functions have important interpretations in the resulting probability distribution.

  • T(x) is a sufficient statistic of the distribution. For exponential families, the sufficient statistic is a function of the data that holds all information the data x provides with regard to the unknown parameter values. This means that, for any data sets   and  , the likelihood ratio is the same, that is   if  T(x) = T(y. This is true even if x and y are not equal to each other. The dimension of T(x) equals the number of parameters of θ and encompasses all of the information regarding the data related to the parameter θ. The sufficient statistic of a set of independent identically distributed data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the posterior distribution of the parameters, given the data (and hence to derive any desired estimate of the parameters). (This important property is discussed further below.)
  • η is called the natural parameter. The set of values of η for which the function   is integrable is called the natural parameter space. It can be shown that the natural parameter space is always convex.
  • A(η) is called the log-partition function[b] because it is the logarithm of a normalization factor, without which   would not be a probability distribution:
 

The function A is important in its own right, because the mean, variance and other moments of the sufficient statistic T(x) can be derived simply by differentiating A(η). For example, because log(x) is one of the components of the sufficient statistic of the gamma distribution,   can be easily determined for this distribution using A(η). Technically, this is true because

 

is the cumulant generating function of the sufficient statistic.

Properties

edit

Exponential families have a large number of properties that make them extremely useful for statistical analysis. In many cases, it can be shown that only exponential families have these properties. Examples:

Given an exponential family defined by  , where   is the parameter space, such that  . Then

  • If   has nonempty interior in  , then given any IID samples  , the statistic   is a complete statistic for  .[9][10]
  •   is a minimal statistic for   iff for all  , and   in the support of  , if  , then   or  .[11]

Examples

edit

It is critical, when considering the examples in this section, to remember the discussion above about what it means to say that a "distribution" is an exponential family, and in particular to keep in mind that the set of parameters that are allowed to vary is critical in determining whether a "distribution" is or is not an exponential family.

The normal, exponential, log-normal, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, geometric, inverse Gaussian, ALAAM, von Mises, and von Mises-Fisher distributions are all exponential families.

Some distributions are exponential families only if some of their parameters are held fixed. The family of Pareto distributions with a fixed minimum bound xm form an exponential family. The families of binomial and multinomial distributions with fixed number of trials n but unknown probability parameter(s) are exponential families. The family of negative binomial distributions with fixed number of failures (a.k.a. stopping-time parameter) r is an exponential family. However, when any of the above-mentioned fixed parameters are allowed to vary, the resulting family is not an exponential family.

As mentioned above, as a general rule, the support of an exponential family must remain the same across all parameter settings in the family. This is why the above cases (e.g. binomial with varying number of trials, Pareto with varying minimum bound) are not exponential families — in all of the cases, the parameter in question affects the support (particularly, changing the minimum or maximum possible value). For similar reasons, neither the discrete uniform distribution nor continuous uniform distribution are exponential families as one or both bounds vary.

The Weibull distribution with fixed shape parameter k is an exponential family. Unlike in the previous examples, the shape parameter does not affect the support; the fact that allowing it to vary makes the Weibull non-exponential is due rather to the particular form of the Weibull's probability density function (k appears in the exponent of an exponent).

In general, distributions that result from a finite or infinite mixture of other distributions, e.g. mixture model densities and compound probability distributions, are not exponential families. Examples are typical Gaussian mixture models as well as many heavy-tailed distributions that result from compounding (i.e. infinitely mixing) a distribution with a prior distribution over one of its parameters, e.g. the Student's t-distribution (compounding a normal distribution over a gamma-distributed precision prior), and the beta-binomial and Dirichlet-multinomial distributions. Other examples of distributions that are not exponential families are the F-distribution, Cauchy distribution, hypergeometric distribution and logistic distribution.

Following are some detailed examples of the representation of some useful distribution as exponential families.

Normal distribution: unknown mean, known variance

edit

As a first example, consider a random variable distributed normally with unknown mean μ and known variance σ2. The probability density function is then

 

This is a single-parameter exponential family, as can be seen by setting

 

If σ = 1 this is in canonical form, as then η(μ) = μ.

Normal distribution: unknown mean and unknown variance

edit

Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then

 

This is an exponential family which can be written in canonical form by defining

 

Binomial distribution

edit

As an example of a discrete exponential family, consider the binomial distribution with known number of trials n. The probability mass function for this distribution is

 

This can equivalently be written as

 

which shows that the binomial distribution is an exponential family, whose natural parameter is

 

This function of p is known as logit.

Table of distributions

edit

The following table shows how to rewrite a number of common distributions as exponential-family distributions with natural parameters. Refer to the flashcards[12] for main exponential families.

For a scalar variable and scalar parameter, the form is as follows:

 

For a scalar variable and vector parameter:

 
 

For a vector variable and vector parameter:

 

The above formulas choose the functional form of the exponential-family with a log-partition function  . The reason for this is so that the moments of the sufficient statistics can be calculated easily, simply by differentiating this function. Alternative forms involve either parameterizing this function in terms of the normal parameter   instead of the natural parameter, and/or using a factor   outside of the exponential. The relation between the latter and the former is:

 
 

To convert between the representations involving the two types of parameter, use the formulas below for writing one type of parameter in terms of the other.

Distribution Parameter(s)   Natural parameter(s)   Inverse parameter mapping Base measure   Sufficient statistic   Log-partition   Log-partition  
Bernoulli distribution              
binomial distribution
with known number of trials  
             
Poisson distribution              
negative binomial distribution
with known number of failures  
             
exponential distribution              
Pareto distribution
with known minimum value  
             
Weibull distribution
with known shape k
             
Laplace distribution
with known mean  
             
chi-squared distribution              
normal distribution
known variance
             
continuous Bernoulli distribution              
normal distribution              
log-normal distribution              
inverse Gaussian distribution              
gamma distribution              
       
inverse gamma distribution              
generalized inverse Gaussian distribution              
scaled inverse chi-squared distribution              
beta distribution

(variant 1)
             
beta distribution

(variant 2)
             
multivariate normal distribution              
categorical distribution

(variant 1)
 

where  
   

where  
       
categorical distribution

(variant 2)
 

where  
   

 

where  

       
categorical distribution

(variant 3)
 

where  
 

 
 

 

       
multinomial distribution

(variant 1)
with known number of trials  
 

where  
   

where  
       
multinomial distribution

(variant 2)
with known number of trials  
 

where  
   

 

where  

       
multinomial distribution

(variant 3)
with known number of trials  
 

where  
 

 
 

 

       
Dirichlet distribution

(variant 1)
             
Dirichlet distribution

(variant 2)
             
Wishart distribution            

       
 
 
       

  • Three variants with different parameterizations are given, to facilitate computing moments of the sufficient statistics.
 
Note: Uses the fact that   i.e. the trace of a matrix product is much like a dot product. The matrix parameters are assumed to be vectorized (laid out in a vector) when inserted into the exponential form. Also,   and   are symmetric, so e.g.  
inverse Wishart distribution            

       
 
 
       

 
normal-gamma distribution            

       

 
* The Iverson bracket is a generalization of the discrete delta-function: If the bracketed expression is true, the bracket has value 1; if the enclosed statement is false, the Iverson bracket is zero. There are many variant notations, e.g. wavey brackets: a=b is equivalent to the [a=b] notation used above.

The three variants of the categorical distribution and multinomial distribution are due to the fact that the parameters   are constrained, such that

 

Thus, there are only   independent parameters.

  • Variant 1 uses   natural parameters with a simple relation between the standard and natural parameters; however, only   of the natural parameters are independent, and the set of   natural parameters is nonidentifiable. The constraint on the usual parameters translates to a similar constraint on the natural parameters.
  • Variant 2 demonstrates the fact that the entire set of natural parameters is nonidentifiable: Adding any constant value to the natural parameters has no effect on the resulting distribution. However, by using the constraint on the natural parameters, the formula for the normal parameters in terms of the natural parameters can be written in a way that is independent on the constant that is added.
  • Variant 3 shows how to make the parameters identifiable in a convenient way by setting   This effectively "pivots" around   and causes the last natural parameter to have the constant value of 0. All the remaining formulas are written in a way that does not access  , so that effectively the model has only   parameters, both of the usual and natural kind.

Variants 1 and 2 are not actually standard exponential families at all. Rather they are curved exponential families, i.e. there are   independent parameters embedded in a  -dimensional parameter space.[13] Many of the standard results for exponential families do not apply to curved exponential families. An example is the log-partition function  , which has the value of 0 in the curved cases. In standard exponential families, the derivatives of this function correspond to the moments (more technically, the cumulants) of the sufficient statistics, e.g. the mean and variance. However, a value of 0 suggests that the mean and variance of all the sufficient statistics are uniformly 0, whereas in fact the mean of the  th sufficient statistic should be  . (This does emerge correctly when using the form of   shown in variant 3.)

Moments and cumulants of the sufficient statistic

edit

Normalization of the distribution

edit

We start with the normalization of the probability distribution. In general, any non-negative function f(x) that serves as the kernel of a probability distribution (the part encoding all dependence on x) can be made into a proper distribution by normalizing: i.e.

 

where

 

The factor Z is sometimes termed the normalizer or partition function, based on an analogy to statistical physics.

In the case of an exponential family where

 

the kernel is

 

and the partition function is

 

Since the distribution must be normalized, we have

 

In other words,

 

or equivalently

 

This justifies calling A the log-normalizer or log-partition function.

Moment-generating function of the sufficient statistic

edit

Now, the moment-generating function of T(x) is

 

proving the earlier statement that

 

is the cumulant generating function for T.

An important subclass of exponential families are the natural exponential families, which have a similar form for the moment-generating function for the distribution of x.

Differential identities for cumulants

edit

In particular, using the properties of the cumulant generating function,

 

and

 

The first two raw moments and all mixed second moments can be recovered from these two identities. Higher-order moments and cumulants are obtained by higher derivatives. This technique is often useful when T is a complicated function of the data, whose moments are difficult to calculate by integration.

Another way to see this that does not rely on the theory of cumulants is to begin from the fact that the distribution of an exponential family must be normalized, and differentiate. We illustrate using the simple case of a one-dimensional parameter, but an analogous derivation holds more generally.

In the one-dimensional case, we have

 

This must be normalized, so

 

Take the derivative of both sides with respect to η:

 

Therefore,

 

Example 1

edit

As an introductory example, consider the gamma distribution, whose distribution is defined by

 

Referring to the above table, we can see that the natural parameter is given by

 
 

the reverse substitutions are

 
 

the sufficient statistics are   and the log-partition function is

 

We can find the mean of the sufficient statistics as follows. First, for η1:

 

Where   is the digamma function (derivative of log gamma), and we used the reverse substitutions in the last step.

Now, for η2:

 

again making the reverse substitution in the last step.

To compute the variance of x, we just differentiate again:

 

All of these calculations can be done using integration, making use of various properties of the gamma function, but this requires significantly more work.

Example 2

edit

As another example consider a real valued random variable X with density

 

indexed by shape parameter   (this is called the skew-logistic distribution). The density can be rewritten as

 

Notice this is an exponential family with natural parameter

 

sufficient statistic

 

and log-partition function

 

So using the first identity,

 

and using the second identity

 

This example illustrates a case where using this method is very simple, but the direct calculation would be nearly impossible.

Example 3

edit

The final example is one where integration would be extremely difficult. This is the case of the Wishart distribution, which is defined over matrices. Even taking derivatives is a bit tricky, as it involves matrix calculus, but the respective identities are listed in that article.

From the above table, we can see that the natural parameter is given by

 
 

the reverse substitutions are

 
 

and the sufficient statistics are  

The log-partition function is written in various forms in the table, to facilitate differentiation and back-substitution. We use the following forms:

 
 
Expectation of X (associated with η1)

To differentiate with respect to η1, we need the following matrix calculus identity:

 

Then: