In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906,^{[1]} building on an earlier proof of the same inequality for doubly-differentiable functions by Otto Hölder in 1889.^{[2]} Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. In its simplest form the inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation; it is a simple corollary that the opposite is true of concave transformations.^{[3]}
Jensen's inequality generalizes the statement that the secant line of a convex function lies above the graph of the function, which is Jensen's inequality for two points: the secant line consists of weighted means of the convex function (for t ∈ [0,1]),
$tf(x_{1})+(1-t)f(x_{2}),$
while the graph of the function is the convex function of the weighted means,
The difference between the two sides of the inequality, $\operatorname {E} \left[\varphi (X)\right]-\varphi \left(\operatorname {E} [X]\right)$, is called the Jensen gap.^{[4]}
Statementsedit
The classical form of Jensen's inequality involves several numbers and weights. The inequality can be stated quite generally using either the language of measure theory or (equivalently) probability. In the probabilistic setting, the inequality can be further generalized to its full strength.
Finite formedit
For a real convex function$\varphi$, numbers $x_{1},x_{2},\ldots ,x_{n}$ in its domain, and positive weights $a_{i}$, Jensen's inequality can be stated as:
A common application has x as a function of another variable (or set of variables) t, that is, $x_{i}=g(t_{i})$. All of this carries directly over to the general continuous case: the weights a_{i} are replaced by a non-negative integrable function f (x), such as a probability distribution, and the summations are replaced by integrals.
Measure-theoretic formedit
Let $(\Omega ,A,\mu )$ be a probability space. Let $f:\Omega \to \mathbb {R}$ be a $\mu$-measurable function and $\varphi :\mathbb {R} \to \mathbb {R}$ be convex. Then:^{[5]}
where $a,b\in \mathbb {R}$, and $f\colon [a,b]\to \mathbb {R}$ is a non-negative Lebesgue-integrable function. In this case, the Lebesgue measure of $[a,b]$ need not be unity. However, by integration by substitution, the interval can be rescaled so that it has measure unity. Then Jensen's inequality can be applied to get^{[6]}
In this probability setting, the measure μ is intended as a probability $\operatorname {P}$, the integral with respect to μ as an expected value$\operatorname {E}$, and the function $f$ as a random variableX.
Note that the equality holds if and only if φ is a linear function on some convex set $A$ such that $\mathrm {P} (X\in A)=1$ (which follows by inspecting the measure-theoretical proof below).
General inequality in a probabilistic settingedit
More generally, let T be a real topological vector space, and X a T-valued integrable random variable. In this general setting, integrable means that there exists an element $\operatorname {E} [X]$ in T, such that for any element z in the dual space of T: $\operatorname {E} |\langle z,X\rangle |<\infty$, and $\langle z,\operatorname {E} [X]\rangle =\operatorname {E} [\langle z,X\rangle ]$. Then, for any measurable convex function φ and any sub-σ-algebra${\mathfrak {G}}$ of ${\mathfrak {F}}$:
Here $\operatorname {E} [\cdot \mid {\mathfrak {G}}]$ stands for the expectation conditioned to the σ-algebra ${\mathfrak {G}}$. This general statement reduces to the previous ones when the topological vector space T is the real axis, and ${\mathfrak {G}}$ is the trivial σ-algebra {∅, Ω} (where ∅ is the empty set, and Ω is the sample space).^{[8]}
A sharpened and generalized formedit
Let X be a one-dimensional random variable with mean $\mu$ and variance $\sigma ^{2}\geq 0$. Let $\varphi (x)$ be a twice differentiable function, and define the function
In particular, when $\varphi (x)$ is convex, then $\varphi ''(x)\geq 0$, and the standard form of Jensen's inequality immediately follows for the case where $\varphi (x)$ is additionally assumed to be twice differentiable.
Proofsedit
Intuitive graphical proofedit
Jensen's inequality can be proved in several ways, and three different proofs corresponding to the different statements above will be offered. Before embarking on these mathematical derivations, however, it is worth analyzing an intuitive graphical argument based on the probabilistic case where X is a real number (see figure). Assuming a hypothetical distribution of X values, one can immediately identify the position of $\operatorname {E} [X]$ and its image $\varphi (\operatorname {E} [X])$ in the graph. Noticing that for convex mappings Y = φ(x) of some x values the corresponding distribution of Y values is increasingly "stretched up" for increasing values of X, it is easy to see that the distribution of Y is broader in the interval corresponding to X > X_{0} and narrower in X < X_{0} for any X_{0}; in particular, this is also true for $X_{0}=\operatorname {E} [X]$. Consequently, in this picture the expectation of Y will always shift upwards with respect to the position of $\varphi (\operatorname {E} [X])$. A similar reasoning holds if the distribution of X covers a decreasing portion of the convex function, or both a decreasing and an increasing portion of it. This "proves" the inequality, i.e.
The finite form of the Jensen's inequality can be proved by induction: by convexity hypotheses, the statement is true for n = 2. Suppose the statement is true for some n, so
Since convex functions are continuous, and since convex combinations of Dirac deltas are weaklydense in the set of probability measures (as could be easily verified), the general statement is obtained simply by a limiting procedure.
Proof 2 (measure-theoretic form)edit
Let $g$ be a real-valued $\mu$-integrable function on a probability space $\Omega$, and let $\varphi$ be a convex function on the real numbers. Since $\varphi$ is convex, at each real number $x$ we have a nonempty set of subderivatives, which may be thought of as lines touching the graph of $\varphi$ at $x$, but which are below the graph of $\varphi$ at all points (support lines of the graph).
Now, if we define
$x_{0}:=\int _{\Omega }g\,d\mu ,$
because of the existence of subderivatives for convex functions, we may choose $a$ and $b$ such that
$ax+b\leq \varphi (x),$
for all real $x$ and
$ax_{0}+b=\varphi (x_{0}).$
But then we have that
$\varphi \circ g(\omega )\geq ag(\omega )+b$
for almost all $\omega \in \Omega$. Since we have a probability measure, the integral is monotone with $\mu (\Omega )=1$ so that
Proof 3 (general inequality in a probabilistic setting)edit
Let X be an integrable random variable that takes values in a real topological vector space T. Since $\varphi :T\to \mathbb {R}$ is convex, for any $x,y\in T$, the quantity
It is easily seen that the subdifferential is linear in y^{[citation needed]} (that is false and the assertion requires Hahn-Banach theorem to be proved) and, since the infimum taken in the right-hand side of the previous formula is smaller than the value of the same term for θ = 1, one gets
In particular, for an arbitrary sub-σ-algebra ${\mathfrak {G}}$ we can evaluate the last inequality when $x=\operatorname {E} [X\mid {\mathfrak {G}}],\,y=X-\operatorname {E} [X\mid {\mathfrak {G}}]$ to obtain
In particular, if some even moment 2n of X is finite, X has a finite mean. An extension of this argument shows X has finite moments of every order $l\in \mathbb {N}$ dividing n.
Alternative finite formedit
Let Ω = {x_{1}, ... x_{n}}, and take μ to be the counting measure on Ω, then the general form reduces to a statement about sums:
Proof: Let $\varphi (x)=e^{x}$ in $\varphi \left(\operatorname {E} [X]\right)\leq \operatorname {E} \left[\varphi (X)\right].$
Information theoryedit
If p(x) is the true probability density for X, and q(x) is another density, then applying Jensen's inequality for the random variable Y(X) = q(X)/p(X) and the convex function φ(y) = −log(y) gives
It shows that the average message length is minimised when codes are assigned on the basis of the true probabilities p rather than any other distribution q. The quantity that is non-negative is called the Kullback–Leibler divergence of q from p, where $D(p(x)\|q(x))=\int p(x)\log \left({\frac {p(x)}{q(x)}}\right)dx$.
Since −log(x) is a strictly convex function for x > 0, it follows that equality holds when p(x) equals q(x) almost everywhere.
Rao–Blackwell theoremedit
If L is a convex function and ${\mathfrak {G}}$ a sub-sigma-algebra, then, from the conditional version of Jensen's inequality, we get
So if δ(X) is some estimator of an unobserved parameter θ given a vector of observables X; and if T(X) is a sufficient statistic for θ; then an improved estimator, in the sense of having a smaller expected loss L, can be obtained by calculating
the expected value of δ with respect to θ, taken over all possible vectors of observations X compatible with the same value of T(X) as that observed. Further, because T is a sufficient statistics, $\delta _{1}(X)$ does not depend on θ, hence, becomes a statistics.
^Jensen, J. L. W. V. (1906). "Sur les fonctions convexes et les inégalités entre les valeurs moyennes". Acta Mathematica. 30 (1): 175–193. doi:10.1007/BF02418571.
^Guessab, A.; Schmeisser, G. (2013). "Necessary and sufficient conditions for the validity of Jensen's inequality". Archiv der Mathematik. 100 (6): 561–570. doi:10.1007/s00013-013-0522-3. MR 3069109. S2CID 56372266.
^Dekking, F.M.; Kraaikamp, C.; Lopuhaa, H.P.; Meester, L.E. (2005). A Modern Introduction to Probability and Statistics: Understanding Why and How. Springer Texts in Statistics. London: Springer. doi:10.1007/1-84628-168-7. ISBN 978-1-85233-896-1.
^Gao, Xiang; Sitharam, Meera; Roitberg, Adrian (2019). "Bounds on the Jensen Gap, and Implications for Mean-Concentrated Distributions" (PDF). The Australian Journal of Mathematical Analysis and Applications. 16 (2). arXiv:1712.05267.
^p. 25 of Rick Durrett (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press. ISBN 978-1108473682.
^Niculescu, Constantin P. "Integral inequalities", P. 12.
^p. 29 of Rick Durrett (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press. ISBN 978-1108473682.
^Attention: In this generality additional assumptions on the convex function and/ or the topological vector space are needed, see Example (1.3) on p. 53 in Perlman, Michael D. (1974). "Jensen's Inequality for a Convex Vector-Valued Function on an Infinite-Dimensional Space". Journal of Multivariate Analysis. 4 (1): 52–65. doi:10.1016/0047-259X(74)90005-0. hdl:11299/199167.
^Liao, J.; Berg, A (2018). "Sharpening Jensen's Inequality". American Statistician. 73 (3): 278–281. arXiv:1707.08644. doi:10.1080/00031305.2017.1419145. S2CID 88515366.
^Bradley, CJ (2006). Introduction to Inequalities. Leeds, United Kingdom: United Kingdom Mathematics Trust. p. 97. ISBN 978-1-906001-11-7.
Referencesedit
David Chandler (1987). Introduction to Modern Statistical Mechanics. Oxford. ISBN 0-19-504277-8.