Beyond Moments: Extending the Maximum Entropy Principle to Feature Distribution Constraints

The maximum entropy principle introduced by Jaynes proposes that a data distribution should maximize the entropy subject to constraints imposed by the available knowledge. Jaynes provided a solution for the case when constraints were imposed on the expected value of a set of scalar functions of the data. These expected values are typically moments of the distribution. This paper describes how the method of maximum entropy PDF projection can be used to generalize the maximum entropy principle to constraints on the joint distribution of this set of functions.


Jaynes' Maximum Entropy Principle
The estimation of probability density functions (PDF) is the cornerstone of classical decision theory as applied to real-world problems. The maximum entropy principle of Jaynes [1] proposes that the PDF should have maximum entropy subject to constraints imposed by the knowledge one has about the density. Let x be a set of N random variables x = [x 1 , x 2 . . . x N ]. The entropy of the distribution p(x) is given by Jaynes worked out the case when the knowledge about p(x) consists of the expected value of a set of K measurements. More precisely, he considered the K scalar functions φ 1 (x), φ 2 (x) . . . φ K (x) and constrained the expected values: If φ k (x) = ∑ N i=1 x k i , then (2) are moment constraints. The distribution maximizing (1) subject to (2) is: where λ 0 is the log of the partition function: The constants λ k are determined by solving where the integral is carried out on the level set or manifold given by The constraint problem can then be re-stated as follows: Problem 1. Given a known distribution p z (z), maximize the entropy of p(x) subject to x∈M(z) p(x) dx = p z (z), ∀z.

Significance
The main significance of maximum entropy PDF projection is the de facto creation of a statistical model through the extraction of features. Once a feature extraction z = Φ(x) has been identified, and it meets some mild requirements given below, a statistical model has been determined. This has a number of advantages, not the least of which is that the "art" of extracting features, i.e., signal processing, is well established, and many good methods exist to extract meaningful information from data. For example, the extraction MFCC features for processing speech signals has been developed to approximate human hearing [6], and, therefore, with maximum entropy PDF projection, should lead to statistical data models which share some qualities with human perception. Before maximum entropy PDF projection, comparing feature extraction methods had to be done based on secondary factors such as classification results. Maximum entropy PDF projection allows a feature extraction method to be evaluated based its corresponding statistical model.
The use of the maximum entropy principle assures the fairest means of comparing two statistical models derived from competing feature extraction methods. In most real-world applications, we cannot know p(x), and must be satisfied with estimating it from some training data. Suppose that we have a set of K training samples x 1 , x 2 , . . . , x K , and have a number of proposed PDFs computed using (6) for various feature transformations z i = Φ i (x). Let these projected PDFs be denoted by p i (x). We would like to determine which projected PDF (i.e., which feature vector) provides a "better" fit to the data. One approach would be to compare the PDFs based on the average log-likelihood L i = 1 K ∑ K n=1 log p i (x n ), choosing the feature transformation that results in the largest value. However, likelihood comparison by itself is misleading, so one must also consider the entropy of the distribution, which is the negative of the theoretical value of L i . Distributions that spread the probability mass over a wider area have higher entropy since the average value of log p(x) is lower. The two concepts of Q and L are compared in Figure 1 in which we show three competing distributions: p 1 (x), p 2 (x), and p 3 (x). The vertical lines represent the location of the K training samples. If L i is the average value of log p i (x) at the training sample locations, then clearly L 1 L 3 L 2 . However, choosing p 2 (x) is very risky because it is over-adapted to the training samples. Clearly, p 2 (x) has lower entropy since most of the probability mass is at places with higher likelihood. Therefore, it has achieved higher L at the cost of lower Q, a suspicious situation. On the other hand, Q 1 = Q 3 , but L 3 > L 1 . Therefore, p 3 (x) has achieved higher L than p 1 (x) without suffering lower Q, so choosing p 3 (x) over p 1 (x) is not risky. If we always choose among models that have maximum possible entropy for the given choice of features, we are likely to obtain better features and better generative models.

MaxEnt PDF Projection
The solution to Problem 1 is based on PDF projection [10]. In PDF projection, one is given a feature distribution p z (z) and constructs a PDF as follows: where p 0 (x) is a reference distribution meeting some mild constraints [10], and p 0,z (z) is the corresponding distribution imposed by p 0 (x) on the measurements z, i.e., p 0,z (z) is Equation (3) applied to p 0 (x). It can be shown that All distributions meeting (5) can be written in the form (6) for some p 0 (x).
The last item in the list indicates that, to solve Problem 1, it is only necessary to select the reference distribution p 0 (x) for maximum entropy (MaxEnt).
To understand the solution to this problem, it is useful to consider the sampling procedure for (6). To sample from distribution (6), one draws a sample z * from PDF p z (z); then, x is drawn from the set M(z * ), defined in (4). Note, however, that to conform to (6), it is necessary to draw sample x from M(z * ) with probability proportional to the value of p 0 (x). The distribution of x on the manifold M(z * ) may be thought of the conditional distribution p(x|z * ), and it is proportional to p 0 (x). It is in fact It can be verified that (7) integrates to 1 on the manifold M(z * ). The entropy of (6) can be decomposed as the entropy of p z (z) plus the expected value of the entropy of the p(x|z) (see Equation (8) in [8]): Maximizing this quantity seems daunting, but there is one condition under which H{p(x|z)} has the maximum entropy for all z, and that is when p(x|z) is the uniform distribution for all z. This, in turn, is achieved when p 0 (x) has a constant value on any manifold M(z). This process of selecting p 0 (x) for maximum entropy is called maximum entropy PDF projection [8,9]. The maximizing reference distribution is written p * 0 = arg max p 0 H{p p (x; p 0 , Φ, p z )}, and the MaxEnt distribution is written which is the unique distribution that solves Problem 1. In order that it is possible to select p 0 (x) for MaxEnt, the feature transformation Φ(x) must be such that the uniform distribution can be defined on M(z) for any z. Thus, M(z) must be bounded and integrable. This condition is easily met if the feature z contains information about the size of x so that when z is fixed to a finite value, the x has a fixed norm. To say this formally, let there exist a function f (z) such that f (Φ(x)) = x for some valid norm x on the range of x.
Once this condition is met, then p * 0 (x) is any distribution that is constant on any level set M(z). This happens if there exists a function c such that Interestingly, any p * 0 (x) meeting these constraints results in the same distribution (6) [8]. This means that, although p * 0 (x) is not unique, p * p (x; Φ, p z ) is unique-it must be unique if it is the maximum entropy PDF. The above conditions can be easily met by inserting an energy statistic into the feature set Φ(x), and defining a reference distribution that depends on x only through this energy statistic. The energy statistic is a scalar statistic from which it is possible to compute a valid norm on the range of x, denoted by X . In summary, the simplest way to solve for the MaxEnt projected PDF given the range of x, denoted by X , involves these three steps:

1.
Identify a norm x valid in X A norm x must meet the properties of scalability ax = |a| x , triangle inequality x + y ≤ x + y , and 0 = 0.

2.
Identify a scalar statistic (energy statistic) t(x) from which it is possible to compute x :

Use a reference hypothesis depending only on t(x).
The above will be demonstrated for three cases of X in Sections 3.1-3.3. The data generation process for MaxEnt PDF projection, corresponding to distribution (8) does not depend on X and is the following:

1.
From the known distribution p z (z), draw a sample denoted by Now identify the set of all samples x mapping to z * , denoted by M(z * ).

3.
Draw a sample x from this set, uniformly, so that no member of M(z * ) is more likely to be chosen than another.
The maximum entropy nature of the solution can be recognized in the uniform sampling on the level set M(z * ). The last item above is called uniform manifold sampling (UMS) [9]. The data generation process for three cases of X are provided in Sections 3.1-3.3.

Examples
The implementation of MaxEnt PDF projection depends strongly on the range of the input data x, denoted by X . In this section, examples are provided for three important cases of X .

Unbounded Data
The 2-norm x 2 is valid in R N and can be computed from the total energy The Gaussian reference hypothesis can be written in terms of t 2 (x): so naturally p 0 (x) will have a constant value on any manifold M(z). Naturally, it is not necessary to include t 2 (x) explicitly in the feature set-it is only necessary that the 2-norm can be computed from z. The distribution p * 0,z (z) can be determined in closed form for some feature transformations [11,12]. For others, the moment generating function can be written in closed form, which allows the saddle point approximation to be used to compute p * 0,z (z) [11]. More on this will be presented in Section 4.1. An important case where a closed-form solution exists is the linear transformation combined with total energy: This case is covered in detail in ([8], Section IV.C, p. 2821), and in ( [9], Section III.B, p. 2459).
The following simple example demonstrates the main points of this case. Assume input data dimension N = 3 and a feature transformation consisting of the sample mean and sample variance: Note that t 2 (x) can be computed from (μ,v), which satisfies the requirement that the 2-norm of x can be computed from z.
Under the assumption that x is distributed according to the standard Normal distribution (9), µ will have mean 0 and variance 1/N, andv will have the chi-square distribution with N − 1 degrees of freedom and scaling 1 N−1 , which is given by where k = N − 1. Furthermore,μ andv are statistically independent. Therefore, p * 0,z (z) = p 0 (µ) · p 0 (v). For the given feature distribution, we assume components of z are independent and Gaussian Figure 2 for slice of x 2 , x 3 at x 1 = 0.0. The density values shown in the figure, summed over all three axes and properly scaled added to a value 0.9999999998, which validates with numerical integration that p * p (x; Φ, p z ) is a density. Notice that the probability is concentrated on a circular region. This can be understood in terms of the sampling procedure given below. To sample from p * p (x; Φ, p z ), we first draw a sample of z from p * 0,z (z), denoted by z * , which provides values for the sample mean value µ * and variance v * . Then, x must be drawn uniformly from the manifold {x :μ = µ * ,v = v * } , which are conditions on the sample mean and variance. This is easily accomplished if we note that the sample mean condition is met for any x of the form where B is the N × (N − 1) ortho-normal matrix spanning the space orthogonal to the vector [1, 1 . . . 1] .
To meet the second (variance) condition, it is necessary that This condition defines a hypersphere in (N − 1) dimensions, which explains the circular region in Figure 2. This hypersphere is sampled uniformly by drawing N − 1 independent Gaussian random variables, denoted by u, then scaling u so that u 2 = (N − 1)v * . Then, x is constructed using (10). Samples drawn in this manner are shown on the right side of Figure 2. To agree with the left side of the figure, only samples with |x 1 | < 0.01 are plotted. Please see the above-cited references for using general linear transformations.

3.2.
Positive Data X = P N Let x have positive-valued elements, so x ranges in the positive quadrant of R N , denoted by P N . This holds whenever spectral or intensity data is processed. The appropriate norm in this space is the 1-norm To satisfy conditions for maximum entropy, it must be possible to compute the statistic t 1 (x) = ∑ N n=1 x n from the features. The exponential reference hypothesis can be written in terms of t 1 (x): so naturally p 0 (x) will have a constant value on any manifold M(z), and is the appropriate reference hypothesis for maximum entropy. The inclusion of t 1 (x) explicitly in the feature set is only one way to insure that M(z) is compact-it is only necessary that the 1-norm can be computed from z. An important feature extraction is the linear transformation Note that is necessary that statistic t 1 (x) can be computed from z, which can be accomplished, for example, to making the first column of A constant. This case is covered in detail in ( [8], Section IV.B, p. 2820), and in ( [9], Section IV, p. 2460). Sampling x is accomplished by drawing a sample z * from p z (z) and then drawing a sample x uniformly from the set {x : A x = z * }.
The following simple example demonstrates the main theoretical concepts. We assume a data dimension of N = 2 so that the distribution can be visualized as an image. The feature transformation is simply the sum of the samples: Under the exponential reference hypothesis, the feature distribution is chi-square with 2N degrees of freedom and scaling 1/2: where k = 2N. For the given feature distribution, we assume Gaussian with a given mean µ z and variance v z . The MaxEnt projected PDF, given by p * Figure 3. The density values shown in the figure, when properly scaled, summed to a value 0.9998, which validates with numerical integration that p * p (x; Φ, p z ) is a density. Note that the distribution is concentrated on the line x 1 + x 2 = µ z = 2, and is flat on this line, as would be expected for maximum entropy. To sample from this distribution, we first draw a sample z * from p z (z) and then draw a sample x on the line given by x 1 + x 2 = z * . This can be done by sampling x 1 uniformly in [0, z * ], then letting x 2 = z * − x 1 . Samples drawn in this way are shown on the right side of Figure 3.
This example generalizes to higher dimension and to arbitrary linear transformations z = A x for full-rank N × M matrix A. In this case, p * 0,z (z) is not chi-square, and in fact is not available in closed-form. However, the moment-generating function is available in closed-form so the saddle point approximation may be used (See Section IV.A, p. 2245 in [11]). Samples of x are drawn by drawing a sample z * from p z (z) and then sampling uniformly in the set {x : A x = z * }. At high dimensions, this requires a form of Gibbs sampling called hit and run (see Section IV, p. 2460 in [9]).

Unit
Hypercube, X = U N Let x have elements limited to 0 ≤ x i ≤ 1. This case is common when working with neural networks. This is called the unit hypercube, denoted by U N . The uniform reference hypothesis produces maximum entropy. No norm-producing energy statistic is needed. Naturally, p 0 (x) will have a constant value on any manifold M(z).
The following simple example demonstrates the main theoretical concepts. We assume a data dimension of N = 2 so that the distribution can be visualized as an image. The feature transformation is simple the sum of the samples: For this case, the uniform distribution brings maximum entropy, p * 0 (x) = 1. Under the reference hypothesis, the feature distribution is Irwin-Hall, given by where sign(0) = 0. For N = 2, this is a triangular distribution For the given feature distribution, we assume Gaussian with a given mean µ z and variance v z . The MaxEnt projected PDF, given by p * Figure 4. The density values shown in the figure, when properly scaled, summed to a value 0.999, which validates with numerical integration that p * p (x; Φ, p z ) is a density. Note that the distribution is concentrated on the line x 1 + x 2 = µ z , and is flat on this line, as would be expected for maximum entropy. To sample from this distribution, we first draw a sample z * from p z (z) and then draw a sample x on the line given by x 1 + x 2 = z * . This can be done by finding where the line that intercepts the axes, and sampling uniformly in the interval between the intercepts. Note that this sampling differs from the previous example as a result of the upper bound at 1.
This example generalizes to higher dimension and to arbitrary linear transformations z = A x for full-rank N × M matrix A. In this case, p * 0,z (z) is no longer Irwin-Hall and in fact is not available in closed-form. However, the moment-generating function is available in closed-form so the saddle point approximation may be used (see Appendix in [13]). Samples of x are drawn by drawing a sample z * from p z (z) and then sampling uniformly in the set {x : A x = z * }. At high dimensions, this requires a form of Gibbs sampling called hit and run (see p. 2465 in [9]).

Implementation Issues
Implementing (8) seems like a daunting numerical task, since p * 0 (x) is some canonical distribution, for which a real data sample x normally lies in the far tails of both p * 0 (x) and p * 0,z (z). However, if the distributions are known exactly, and are represented in the log domain, then the difference typically remains within very reasonable limits. In some cases, terms in log p * 0 (x) and log p * 0,z (z) cancel, leaving (13) only weakly dependent on x (for example, see Section IV.A, p. 2820 in [8]).
Evaluating log p * 0 (x) is mostly trivial since it is normally a canonical distribution, such as Gaussian, exponential, or uniform. Calculating log p * 0,z (z), however, remains the primary challenge in maximum entropy PDF projection. However, when evaluating p * 0,z (z) seems daunting, there are several ways to overcome the problem.

1.
Saddle Point Approximation. If p * 0,z (z) is not available in closed form, the moment-generating function (MGF) might be tractable. This allows the saddle point approximation (SPA) to be used (see Section III in [11]). Note that the term "approximation" is misleading because the SPA approximates the shape of the MGF on a contour, not the absolute value, so the SPA expression for log p * 0,z (z) remains very accurate, in the far tails, even when p * 0,z (z) itself cannot be evaluated in machine precision. Examples of this include general linear transformations of exponential and chi-squared random variables (see Section III.C and Section IV in [11]), general linear transformations of uniform random variables (Appendix in [13]), a set of linear-quadratic forms [14], and order statistics [15].

2.
Floating reference hypothesis. There are conditions under which the MaxEnt reference hypothesis p * 0 (x) is not unique, so it can depend on a parameter θ, so we write p * 0 (x; θ). An example is when the feature z contains the sample mean and sample variance (see example in Section 3.1). In this case, a Gaussian reference hypothesis p * 0 (x; θ) can be modified to have any mean and variance θ = [µ 0 , σ 2 0 ], and can serve as the MaxEnt reference hypothesis with no change at all in the resulting projected PDF. In other words, (13) is independent of θ-this can be verified by cancelling terms. Therefore, there is no reason that θ cannot be made to track the data-that is, let µ 0 =μ(x), σ 2 0 =σ 2 (x). By doing this, p * 0,z (z) will track z, allowing simple approximations based on central limit theorem to be used to approximate p * 0,z (z).

3.
Chain Rule. When p * 0,z (z) cannot be derived for a feature transformation, it may be possible to break the feature transformation into stages, where each stage can be easily analyzed. The next section is devoted to this.

Chain Rule
The primary numerical difficulty in implementing (8) is the calculation of p * 0,z (z). Solutions for many of the most useful feature transformations are available [9,[11][12][13]. However, in many real-world applications, such as neural networks, the feature transformations cannot be easily written in a compact form z = [φ 1 (x), φ 2 (x), . . . φ K (x)]. Instead, they consist of multi-stage transformations, for example, y = T 1 (x), w = T 2 (y), and z = T 3 (w). The individual stages T m (x) could be the layers of a neural network. In this case, it is best to apply (8) recursively to each stage. This means that the distribution of the first stage features p(y) is written using (6) with y taking the role of input data, and so forth. This results in the chain-rule form: where p * 0,x (x), p * 0,y (y), p * 0,w (w) are canonical reference hypotheses used at each stage, for example (9), (11), and (12), depending on the range of x, y, and w, respectively.
To understand the importance of the chain-rule, consider how we would compute (6) without the chain rule. Let T(x) be the combined transformation and let p * 0 (x) be one of the canonical reference distributions. Consider the difficulty in deriving p * 0,z (z). At each stage, the distribution of the output feature becomes more and more intractable, and trying to estimate p * 0,z (z) is futile because generally a canonical reference distribution is completely unrealistic as PDF for real data. Furthermore, p * 0,z (z) is more often than not evaluated in the far tails of the distribution. With the chain-rule, however, we can assume a suitable canonical reference hypothesis at the start of each stage, and only need to derive the feature distribution imposed on the output of that stage.
As long as the reference hypothesis used at each stage meets the stated requirements given in Section 2.1, then the chain as a whole will indeed produce the desired MaxEnt projected PDF, which is the PDF with maximum entropy among all PDFs that generate the desired output feature distribution p(z) through the combined transformation [8]! An example of the application of the chain-rule is the computation of MEL frequency cepstral coefficients (MFCC), commonly used in speech processing. Let us consider a frame of data of length N, denoted by x. The processing is broken into the following stages: 1.
The first step, denoted by y = T 1 (x) is to convert x into N/2 + 1 magnitude-squared discrete Fourier transform (DFT) bins. Under the standard Gaussian assumption (9), the elements of y are independent and have chi-squared statistics (see Section VI.D.1, pp. 47-48 in [12]).

2.
The second step is to sum energy in a set of K MEL-spaced band functions. This results in a set of K band energies. This can be written using the (N/2 − 1) × K matrix A as the linear transformation w = A y. This feature transformation is explained in Section 3.2 above so an exponential reference distribution can be assumed for y. Care must be taken that the K band functions add to a constant-this insures the energy statistic is "contained in the features".

3.
The next step is to compute the log of the K band energies, u = log(w). This is a 1:1 transformation for which PDF projection simplifies to computing the determinant of the transformation's Jacobian matrix (see Section VI.A, p. 46 in [12]).

4.
The last step is the discrete cosine transform (DCT), which can be written as a linear transformation z = C u. If some DCT coefficients are discarded, then the transformation must be analyzed as in Section 3.1 above by including the energy statistic t(u) = u u.
This example illustrates that complex feature transformations can be easily analyzed if broken into simple steps. More on the above example can be found in Sections V and VI in [8].

Large-N Conditional Distributions and Applications.
When the feature value z * is fixed, then sampling x on the manifold M(z * ), called UMS, has some interesting interpretations relative to maximum entropy. Let the conditional distribution be written p(x|z * ). Notice that p(x|z * ) is not a proper distribution since all the probability mass exists on the manifold M(z * ) of zero volume. Writing down p(x|z * ) in closed form or determining its mean is intractable. It is useful, however, to know p(x|z * ) because, for example, the mean of p(x|z * ) is a point estimate of x based on z * , a type of MaxEnt feature inversion. However, depending on the range of x, as exemplified by the three cases in Sections 3.1-3.3, p(x|z * ) can be approximated by a surrogate distribution (See p. 2461 in [9]). The surrogate distribution is a proper distribution that (a) has its probability mass concentrated near M(z * ), (b) has constant value on M(z * ), and (c) has mean value on the manifold, sox ∈ M(z * ). The surrogate distribution therefore meets the same conditions as p(x|z * ) but is a proper distribution. The mean of the surrogate distribution is a very close approximation to the mean of p(x|z * ), which can be called th centroid of M(z * ), but can be computed. In Sections 3.1-3.3, the surrogate distribution is Gaussian, exponential, and truncated exponential, respectively. These are the MaxEnt distributions under applicable constraints. It was shown, for example, when the range of x is the positive quadrant of R, that the centroid corresponds to the classical Maximum Entropy feature inversion approach for a dimension-reducing linear transformation of intensity data, for example to sharpen images blurred by a point-spread function [9]. The method, however, is more general because it can be adapted to different ranges of x [9].
where p(H m ) is the prior class probability, and p(x|H m ) is a PDF estimate for class hypothesis H m . For the classification problem, there are many classifier topologies for using (8) to construct p(x|H m ).

1.
Class-specific features. One can specify a different feature transformation per class, z m = Φ m (x), . This amounts to just comparing the likelihood ratio between class hypothesis H m and the reference distribution, computed using a class-dependent feature [16]. 2.