Next Article in Journal
Signal Processing for the Measurement of the Deuterium/Hydrogen Ratio in the Local Interstellar Medium
Next Article in Special Issue
Fluctuations, Entropic Quantifiers and Classical-Quantum Transition
Previous Article in Journal
Realization of Thermal Inertia in Frequency Domain
Previous Article in Special Issue
Generalized Maximum Entropy Analysis of the Linear Simultaneous Equations Model
Open AccessArticle

A Relationship between the Ordinary Maximum Entropy Method and the Method of Maximum Entropy in the Mean

Centro de Finanzas, IESA, Ave. Iesa, San Bernardino, Caracas 1010, Venezuela
Institute 2, CESA, Calle 35 #6-16, Bogotá, Colombia
Author to whom correspondence should be addressed.
Entropy 2014, 16(2), 1123-1133;
Received: 16 December 2013 / Revised: 11 February 2014 / Accepted: 13 February 2014 / Published: 24 February 2014
(This article belongs to the Special Issue Maximum Entropy and Its Application)


There are two entropy-based methods to deal with linear inverse problems, which we shall call the ordinary method of maximum entropy (OME) and the method of maximum entropy in the mean (MEM). Not only doesMEM use OME as a stepping stone, it also allows for greater generality. First, because it allows to include convex constraints in a natural way, and second, because it allows to incorporate and to estimate (additive) measurement errors from the data. Here we shall see both methods in action in a specific example. We shall solve the discretized version of the problem by two variants of MEM and directly with OME. We shall see that OME is actually a particular instance of MEM, when the reference measure is a Poisson Measure.
Keywords: maximum entropy; maximum entropy in the mean; constrained linear inverse problems maximum entropy; maximum entropy in the mean; constrained linear inverse problems

1. Introduction

During the last quarter of the XIX-th century Boltzmann proposed a way to study convergence to equilibrium in a system of interacting particles through a quantity that was that was a Lyapunov functional for the dynamics of the system, and increased as the system tended to equilibrium. A related idea was used at the beginning of the XX-th century by Gibbs to propose a theory of equilibrium statistical mechanics. The difference between the approaches was in the nature of the microscopic description. In the late 1950s, Jaynes in [1] turned the idea into a variational method to determine a probability distribution given the expected value of a few random variables (observables to use the physical terminology). This procedure is called the method of maximum entropy. This methodology has proven useful in a variety of problems well removed from the standard statistical physics setup. See Kapur (1989) [2] for example, or the Kluwer Academic Press collection of Maximum Entropy and Bayesian Methods or the volume by Jaynes (2003) [3].

As it turns out, similar procedures had come up in the actuarial and statistical literature, see for example the works by Esscher (1932) [4] and by Kullback (1957) [5]. Jaynes’s procedure was further extended in Decarreau et al. (1992) [6], and Dacunha-Castelle and Gamboa (1990) [7]. Such extension has proven a powerful tool to deal with linear inverse problems with convex constraints. See Gzyl and Velásquez (2011) [8] for example. This method uses the standard variational technique as a stepping stone in a peculiar way.

Besides providing a more general type of solutions to the OME problem, we shall verify in two different ways that the standard solution of the OME is actually a particular case of the more general MEM approach to solve linear inverse problems with convex constraints.

The paper is organized as follows. In the remainder of this section we shall state the two problems whose solutions we want to relate. These consists in obtaining a positive, continuous function satisfying some integral constraints. In the next section we shall recall the basics of MEM. In section three we continue with the same theme and examine specific choices of set up to implement the method. Section 4 is devoted to the issue of obtaining the problem by the OME from the solution by the MEM.

In section five we implement both approaches numerically to compare their performance in one simple example. The idea of using two different choices of prior is to emphasize the flexibility of the MEM.

1.1. Statement of the First Problem

Even though the problem considered is not in its most general form, it is enough for our purposes and can be readily extended. We want to find a continuous positive function x(t) : [0, 1] → [0, ∞) such that

0 1 k i ( t ) x ( t ) d t = m ; for i = 1 , , M
Typically {ki(t) : i = 1, ..., M} a collection of measurable functions defined on [0, 1] describing some sort of observations made on a random variable whose density we want to estimate. These could be ordinary moments tni for some collection {n1, ..., nM} of integers. Or they could be fractional powers tai for some collection {a1, ..., aM} of reals. This problem appears when our information consists of the values of a Laplace transform at points {a1, ..., aM} and we map our problem onto [0, 1] by means of the change of variables st = e−s. Or the ki(t) could be trigonometric polynomials. In such case we refer to Equation (1) as a generalized moment problem. When x(t) is required to be a probability density, we shall consider k1(t) ≡ 1 on [0, 1]. It is also apparent that the convex constraint is the positivity constraint on x(t).

1.2. Statement of the Second Problem

Clearly Equation (1) is a particular case of the following more general problem: Let k(s, t) : [a, b] × [0, 1] → ℝ. Let 𝒦𝒞([0, 1]) be a cone contained in the class of continuous functions, and let m(s) : [a, b] → ℝ be some continuous function. We want to find x(t) ∈ 𝒦 satisfying the integral constraints

0 1 k ( s , t ) x ( t ) d t = m ( s ) , s [ a , b ]

We remark that when x(t) is a density and k(s, t) ≡ 1, them m(s) ≡ 1. Clearly the integral constraints could be incorporated into 𝒦, but it is convenient to keep both separated. For what comes below, and to relate to the first problem, we shall restrict ourselves to the convex set of continuous density functions. Such type of problems were considered for example in [9] or in much grater generality in [10] or and more recently in [11] where applications and further references to related work are collected. As mentioned above, the setup can be relaxed considerable at the expense of technicalities. For example, one can consider the kernel k to be defined on the product S1 × S2 of two locally compact, separable metric spaces, and dt could be replaced by some σ–finite measure ν(dt) on (S2, (S2)). But let us keep it as simple as possible.

2. The Maximum Entropy in the Mean Approach

The basic intuition behind the MEM goes as follows. We search for a stochastic process with independent increments {X(t)|t ∈ [0, 1]} defined on some auxiliary probability space (Ω, , Q) such that

d X ( t ) = X ( t + δ ) X ( t ) 𝒦 , for t [ 0 , 1 ] and δ > 0
d E P [ X ( t ) ] d t = x ( t ) ; for some P Q
E P [ 0 1 k ( s , t ) d X ( t ) ] = 0 1 k ( s , t ) x ( t ) d t = m ( t )
Here the measure P on (Ω, ) is yet to be determined. If it exists, notice that x(t) ∈ 𝒦 automatically. The integral with respect to dX(t) is to be understood in the Itô sense.

As we want to implement the scheme numerically, it is more convenient to discretize Equation (2) and then to bring in the MEM. It is at this point where the regularity properties of k(s, t) and x(t) come in to make life easier. Consider a partition of [a, b] into M equal adjacent intervals and a partition of [0, 1] into N adjacent intervals. Let {si|i = 1, ..., M} and respectively {tj|j = 1, ..., N} be the center points of those intervals. Let us set Aij = A(i, j) = k(si, tj). Also, set xjx(j) = x(tj)/N and finally mim(i) = m(si).

Comment We chose xj = x(tj)/N because when x(t) is a density, we want its discretized version to satisfy ∑j xj = 1.

With these changes, the discretized version of the second problem becomes: Given M real numbers mi : i = 1, ..., M, determine positive numbers xj : j = 1, ..., N such that

j = 1 N A i j x j = m i , for i = 1 , , M
To assemble the model, consider Ω = [0, ∞)N with = (Ω) the usual Borel sets. To make things really simple, let qj(j) be N copies of a Measure q() on ([0, ∞), ([0, ∞))) and let
Q ( d ξ ) = j = 1 N q j ( d ξ j )
be the reference measure. Note that with respect to Q the coordinate maps XjX(j) : Ω → [0, ∞) defined by Xj(ξ) = ξj satisfy the positivity constraints and are independent. With this notation, the original discretized problem (6) is transformed
Determine a probality measure P Q such that E P [ AX ] = m
Note that if such a measure P is found, then xj = EP [Xj] satisfies Equation (6). It is to determine P where the OME comes in as a stepping stone.

3. Solution of Equation (7) by MEM

The notation will be as at the end of the previous section. For the purpose of comparison, we shall solve Equation (7) using two different measures. First we shall consider a product of exponential distributions on Ω and then we shall consider a product of Poisson distributions. Let us first develop the generic procedure, and then particularize for each choice of reference measure.

At this point we mention that the only requirement on the reference measure Q ( d ξ ) = Π j = 1 N q j ( d ξ j ) is the following:

Assumption We shall require the closure of the convex hull generated by the support of Q to be exactly Ω.

Let us consider the convex set 𝒫(Q) = {P measure on(Ω, ), P << Q} on which we define the following concave functional

S Q ( P ) = Ω ln ( d P d Q ) d P
whenever ln(dP/dQ) is P–integrable and equal ∞ otherwise. This is the negative of the Kullback-Leibler divergence between P and Q. It is a standard result that SQ(P) is concave in P and we have

Lemma 3.1. Suppose that P, Q and R are probability measures on (Ω, ) such that P << Q, P << R and R << Q, then SR(P) ≤ 0, and

S Q ( P ) = S R ( P ) d P ln ( d R d Q )
Proof. The verification of the first assertion is easy invoking Jensens’ inequality. The second follows readily from the fact that d P d Q = d P d R / d Q d R.

We want to consider the following consequence of this lemma. Let us define R by dRλ(ξ) = ρλ(ξ)dQ(ξ) where

ρ λ ( ξ ) = e < λ , A ξ > Z ( λ )
where Z(λ) is the obvious normalization factor, which is given by
Z ( λ ) = e < λ , A ξ > d Q ξ

The idea behind the maximum entropy method, comes from the realization that for such R, when P satisfies the constraints, substituting Equation (10) in the integral term in Equation (9), we obtain that

Σ ( λ ) ln Z ( λ ) + < λ , m > S Q ( P ) for any λ M
Thus, whenever {λ ∈ ℝM |Z(λ) < ∞} we expect the problem to have a solution, and whenever the class of P′s satisfying the constraints Equation (7) is non-empty, a minimizer of the convex function Σ(λ) is expected to exist. Actually, Csiszar in [12,13] proved the existence of such P′s. Actually a different (a perhaps more physicist oriented) proof was provided more recently in [14]. It is actually a simple exercise to verify the following

Proposition 3.1. Suppose that a measure P satisfying (7) exists and that the minimum of Σ(λ) is reached at λ* in the interior of {λ ∈ ℝM |Z(λ) < ∞}, then the probability P* that maximizes SQ(P) and satisfies Equation (7) is given by

d P * ( ξ ) = e < A λ * , ξ > Z ( λ * ) d Q ( ξ )
We are using A to denote the transpose of A With that result, the next step consists of computing
x * ( j ) = Ω ξ i d P * ( ξ )

Let us now examine two possible choices for Q.

3.1. Exponential Reference Measure

Since we want positive xj, we shall first try all factor q() = μeμξ, which has [0, ∞) as support. A simple computation yields that

Z ( λ ) = j = 1 N μ μ + ( A λ ) j
For the purpose of modeling one has to chose μ large enough such that μ + (Aλ)j is positive. Another simple computation (as in the verification of the preceding proposition), yields
x j * = 1 μ + ( A λ * ) j
Observe that the method has just shifted the mean of each exponential to a new value. We let the reader to write down P* explicitly to verify this assertion.

3.2. Poisson Reference Measure

This time, instead of a product of exponentials, we shall consider a product of Poisson measures, i.e., we take

q ( d ξ ) = e μ k 0 μ k k ! { k } ( d ξ )
Here we use ∊{a}() to denote the unit point mass (Dirac delta) at a. Certainly the convex hull of the non-negative integers is [0, ∞). Notice that now
Z ( λ ) = j 0 N exp ( μ ( 1 e ( A λ ) j ) )
from which we obtain
Σ ( λ ) = μ j = 1 N ( 1 e ( A λ ) j ) + < λ , m >
Notice now that if λ* minimizes that expression, then the estimated solution to Equation (7) is
x j * = e ( A λ * ) j

3.3. The MEM Approach to the Original Problem

Consider Equation (1) again. This time we shall consider a Poisson point process on ([0, 1], ([0, 1]) with intensity dt. By this we mean a base probability space (Ω, , Q) on which a collection of random measures {N(A) : A([0, 1])} (the point process) is given, which has the following properties:


N(A) is a Poisson random variable with intensity (mean)|A|,


Q – almost everywhere AN(A) is an integer valued measure


For any disjoint A1, ..., Ak the N(A1), ..., N(Ak) are independent

From these, it is clear that for any λ ∈ ℝM the random variable 0 1 < λ, k(t) > N(dt) satisfies

E Q [ e 0 1 < λ , k ( t ) > N ( d t ) ] = exp ( 0 1 d t ( exp ( < λ , k ( t ) > ) 1 ) )
where k(t) is the vector of generalized moments appearing in Equation (1). Clearly, here we again denote the previous quantity by ZQ(λ, k), and again Σ(λ) = ln (ZQ(λ)) + <λ, m>. This function is convex on {λ ∈ ℝM |ZQ(λ) < ∞}. When a minimizer λ* exists in the interior of that domain, then P* with density
d P * d Q = e 0 1 < λ * , k ( t ) > N ( d t ) Z Q ( λ * )
is such that
x * ( t ) d E P * [ N [ 0 , t ] ] d t = e < λ * , k ( t ) >
solves Equation (1).

4. OME from MEM

4.1. Discrete Case

We shall now relate the last result to the standard (ordinary) method of maximum entropy. Suppose that the unknown quantities xj in Equation (3) are indeed probabilities, and that m1 = 1 and A1j = 1 for all j = 1, ..., N. It is easy to verify using the first equation of the set Equation (3) that exp ( λ 1 ) = 1 / ζ ( λ r * ) where λr = (λ2, ..., λM) and

ζ ( λ r ) = j = 1 N e i = 2 M λ i A i , j
From this, Equation (15) becomes
x j * = e i = 2 M λ i * A i , j ζ ( λ r * )
which is the solution to Equation (3) by the OME method.

4.2. Continuous Case as Limit of the Discrete Case

This is the second place in which our discretization procedure enters. First rewriteEquation (15) x j * as x*(tj)/N, from which Equation (15) becomes

x * ( t j ) = e i = 2 M λ i * A i , j 1 N ζ ( λ r * )
One may want to argue as follows: Notice as well, that given any t ∈ [0, 1] as N → ∞, there is a sequence tj(N) converging to t. In addition, it is clear that
1 N ζ ( λ r * ) ) 0 1 e i = 2 M λ i k i ( t ) d t
and therefore
x * ( t ) = e i = 2 M λ i * k i ( t ) 0 1 e i = 2 M λ i * k i ( s ) d s
which we would like to identify as the solution (1) provided by the OME method. The problem with the procedure is that the λ* depends on N and changes along the way. Let us indicate a possible way to overcome this issue.

For each N denote by Aj(N) the blocks of the partition of [0, 1], and suppose that the partitions refine each other as N increases (consider dyadic partitions for example). For each N denote the maxentropic solution described in Equation (17) by x N * ( t j ) and define the piecewise constant (continuous) density

x ˜ N ( t ) = j x N * ( t j ) I A j ( N ) ( t )
Clearly, N satisfies Equation (1), but it is not the density that maximizes the entropy. Actually, one can rapidly verify that
S ( x ˜ N ) S ( x ˜ N + 1 ) S ( x ˜ )
We shall relate to the x * displayed in Equation (16) below. The remaining part of the argument is to verify that λNλ (in an obvious notation) as N → ∞. This is simple to say, but hard to prove. A way around the convergence of the λN issue is provided by [12].

4.3. The Full Continuous Case

Here we show how to obtain the OME solution to Equation (1) from the MEM solution Equation (16) without the labor described in the previous section. The argument is similar to the one mentioned above. As k1(t) = 1, we can isolate λ 1 * and rewrite x * ( t ) as

x * ( t ) = e i = 2 M λ i * k i ( t ) 0 1 e i = 2 M λ i * k i ( s ) d s
That this solves Equation (1) is due to the fact that the equations that determine λ* in the full MEM and in the OME cases coincide. This happens because of the special form of ZQ(λ) when the underlying auxiliary process is the Poisson point process

5. Numerical Examples

To compare the output of the three methods, we consider a simple example in which the data consists in a few values of the Laplace transform of the density of a Γ(a, b) density. Observe that if S denotes the original random variable, then T = eS denotes the corresponding random variable with range mapped onto [0, 1]. The values of the Laplace transform of S are the fractional (non-necessarily integer) moments of T. The maxentropic methods yield the density x(t) of T, from which the density of S is to be obtained by the change of variable fS(s) = esx(es).

If we let {α1 = 0, α2, ..., αM} and ki(t) = tαi be M given powers of T, the corresponding moments to be used in Equation (1) are mi = (b/(αi + b))a, with m1 = 1. To be specific, let us consider a = b = 1, and α2 = 1/5, α3 = 1/4, α4 = 1/3, α5 = 1/2, α6 = 5, α7 = 10, α8 = 15, α9 = 20 from which we readily obtain the values of the 9 generalized moments mi. To finish, we take N = 100 partition points of [0, 1].

5.1. Exponential Reference Measure

We shall set μ = 10 as a number high enough so that the positivity conditions mentioned in Section (3.1) holds. The function to be minimized to determine the λs is

Σ ( λ ) = N ln μ j = 1 N ln ( μ + ( A λ ) j ) + < λ , m >

To find the minimizer, we use the Barzilai-Borwein code available for R, see [15]. Once the optimal λ is obtained it is inserted in Equation (14). That is the density of T on [0, 1]. To plot the density on [0, ∞) we perform the change of variables mentioned above and the result is plotted in the Figure 1.

We point out that the L1 norm of the difference between the reconstructed and the original densities is 0.0283 rounding at the fourth decimal place.

5.2. The Poisson Reference Measure

In reference to the setup of Section 3.3 we set μ = 5 this time. The function to be minimized this time is

Σ ( λ ) = μ j = 1 N ( 1 e ( A λ ) j ) + < λ , m >

Once the minimizing λ* has been found, the routine is as above: the density on [0, 1] is mapped onto a density on [0, ∞) by means of a change of variables. The result obtained is displayed on Figure 2.

The L1 norm of the difference between the reconstructed and the original densities is 0.00524.

5.3. The OME Method

In this case, to determine λ* we have to minimize

Σ ( λ ) = 0 1 e < λ , k ( t ) > d t + < λ , m >
which clearly is the same thing as minimizing
Σ ( λ ) = 0 1 ( 1 e < λ , k ( t ) > ) d t + < λ , m >
as mentioned at the end of Section 4.3. The result, after the change of variables is displayed in Figure 3.

The L1 norm of the difference between the reconstructed and the original densities is 0.03479.


We greatly appreciate the comments by the referees. They certainly contributed to improve our presentation.

Conflicts of Interest

The authors declare no conflicts of interest.

Bibliographycal Comment

The literature on the OME method is very large, consider this journal for example. Even though the literature on the MME method is not that large, we cited only a few foundational papers and a very small sample of recent papers that apply the method to interesting problems.


  1. Jaynes, E. Information theory and statistical physics. Phys. Rev 1957, 106, 171–197. [Google Scholar]
  2. Kapur, J. Maximum Entropy Models in Science and Engineering; Wiley Eastern Ltd.: New Delhi, India, 1989. [Google Scholar]
  3. Jaynes, E. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  4. Escher, F. On the probability function in the collective theory of risk. Scan. Aktuarietidskr 1932, 15, 175–195. [Google Scholar]
  5. Kullback, S. Information Theory and Statistics; Dover Pubs: New York, NY, USA, 1968. [Google Scholar]
  6. Decarreau, A.; Hilhorst, D.; Demarechal, C.; Navaza, J. Dual Methods in Entropy Maximization. Application to Some Problems in Crystallography. SIAM J. Optim 1992, 2, 173–197. [Google Scholar]
  7. Dacunha-Castelle, D.; Gamboa, F. Maximum d’entropie et problème des moments. Ann. Inst. Henri Poincaré 1990, 26, 567–596. [Google Scholar]
  8. Gzyl, H.; Velásquez, Y. Linear Inverse Problems: The Maximum entropy Connection; World Scientific Pubs: Singapore, Singapore, 2011. [Google Scholar]
  9. Gzyl, H. Maxentropic reconstruction of Fourier and Laplace transforms under non-linear constraints. Appl. Math. Comput 1995, 25, 117–126. [Google Scholar]
  10. Gamboa, F.; Gzyl, H. Maxentropic solutions of linear Fredholm equations. Math. Comput. Model 1997, 25, 23–32. [Google Scholar]
  11. Gallon, S.; Gamboa, F.; Loubes, M. Functional Calibration estimation by the maximum entropy in the mean principle. 2013. arXiv: 1302.1158[math.ST].. [Google Scholar]
  12. Csiszar, I. I-divergence geometry of probability distributions and minimization problems. Ann. Probab 1975, 3, 148–158. [Google Scholar]
  13. Csiszar, I. Generalized I-projection and a conditional limit theorem. Ann. Probab 1984, 12, 768–793. [Google Scholar]
  14. Cherny, A.; Maslov, V. On minimization and maximization of entropy functionals in various disciplines. Theory Probab. Appl 2003, 17, 447–464. [Google Scholar]
  15. Varadhan, R.; Gilbert, P. An R package for solving large system of nonlinear equations and for optimizing a high dimensional nonlinear objective function. J. Stat. Softw 2009, 32, 1–26. [Google Scholar]
Figure 1. Reconstruction by MEM with exponential reference measure.
Figure 1. Reconstruction by MEM with exponential reference measure.
Entropy 16 01123f1 1024
Figure 2. Reconstruction by MEM with Poisson reference measure.
Figure 2. Reconstruction by MEM with Poisson reference measure.
Entropy 16 01123f2 1024
Figure 3. Reconstruction with OME.
Figure 3. Reconstruction with OME.
Entropy 16 01123f3 1024
Back to TopTop