Beyond Moments: Extending the Maximum Entropy Principle to Feature Distribution Constraints

Baggenstoss, Paul M.

doi:10.3390/e20090650

Open AccessReview

Beyond Moments: Extending the Maximum Entropy Principle to Feature Distribution Constraints

by

Paul M. Baggenstoss

Fraunhofer FKIE, Wachtberg 53343, Germany

Entropy 2018, 20(9), 650; https://doi.org/10.3390/e20090650

Submission received: 27 June 2018 / Revised: 2 August 2018 / Accepted: 20 August 2018 / Published: 30 August 2018

(This article belongs to the Special Issue Entropy: From Physics to Information Sciences and Geometry)

Download

Browse Figures

Versions Notes

Abstract

The maximum entropy principle introduced by Jaynes proposes that a data distribution should maximize the entropy subject to constraints imposed by the available knowledge. Jaynes provided a solution for the case when constraints were imposed on the expected value of a set of scalar functions of the data. These expected values are typically moments of the distribution. This paper describes how the method of maximum entropy PDF projection can be used to generalize the maximum entropy principle to constraints on the joint distribution of this set of functions.

Keywords:

maximum entropy principle; PDF projection; statistical inference

1. Introduction

1.1. Jaynes’ Maximum Entropy Principle

The estimation of probability density functions (PDF) is the cornerstone of classical decision theory as applied to real-world problems. The maximum entropy principle of Jaynes [1] proposes that the PDF should have maximum entropy subject to constraints imposed by the knowledge one has about the density. Let

x

be a set of N random variables

x = [x_{1}, x_{2} \dots x_{N}]

. The entropy of the distribution

p (x)

is given by

H {p (x)} = - \int_{x} p (x) log (p (x)) d x .

(1)

Jaynes worked out the case when the knowledge about

p (x)

consists of the expected value of a set of K measurements. More precisely, he considered the K scalar functions

ϕ_{1} (x), ϕ_{2} (x) \dots ϕ_{K} (x)

and constrained the expected values:

\int_{x} ϕ_{k} (x) p (x) d x = d_{k}, 1 \leq k \leq K .

(2)

If

ϕ_{k} (x) = \sum_{i = 1}^{N} x_{i}^{k},

then (2) are moment constraints.

The distribution maximizing (1) subject to (2) is:

p (x) = e^{- [λ_{0} + λ_{1} ϕ_{1} (x) + λ_{2} ϕ_{2} (x) + \dots λ_{K} ϕ_{K} (x)]},

where

λ_{0}

is the log of the partition function:

Z (λ_{1}, λ_{2} \dots λ_{K}) = \int_{x} e^{- [λ_{1} ϕ_{1} (x) + λ_{2} ϕ_{2} (x) + \dots λ_{K} ϕ_{K} (x)]} .

The constants

λ_{k}

are determined by solving

d_{k} = \frac{\partial}{\partial λ_{k}} log Z, 1 \leq k \leq K .

1.2. Feature Distribution Constraints

Jaynes’ results had initial applications in statistical mechanics and thermodynamics [2], and have found more applications in a wide range of disciplines [2,3,4,5]. However, we would like to extend the results by replacing constraints (2) with constraints that are more meaningful in real-world inference problems. Instead of knowing just the average values of

ϕ_{k} (x)

, suppose we knew the joint distribution of

z = Φ (x) = [ϕ_{1} (x), ϕ_{2} (x), \dots ϕ_{K} (x)]

, denoted by

p_{z} (z)

. This carries more information than the average values of each measurement

ϕ_{k} (x)

. Because the number of parameters K is small compared with the dimension of

x

, it is feasible to estimate

p_{z} (z)

from a set of training samples using kernel-based PDF estimation methods, for example. This constraint is more general and can be adapted to produce something similar to Jaynes’ constraints (2) if the marginal measurement distributions are assumed independent, and Gaussian with mean

d_{k}

. This has immediate applications in a wide range of fields, for example in speech analysis and recognition where

z

could be MEL frequency cepstrum coefficients (MFCC) [6] extracted from the time-series data

x

, or in neural networks, where

z

could be the output of a network.

Note that the distribution

p_{z} (z)

can be obtained from

p (x)

by marginalization:

p_{z} (z) = \int_{x \in M (z)} p (x) d x,

(3)

where the integral is carried out on the level set or manifold given by

M (z) = {x : ϕ_{1} (x) = z_{1}, ϕ_{2} (x) = z_{2}, \dots ϕ_{K} (x) = z_{K}} .

(4)

The constraint problem can then be re-stated as follows:

Problem 1.

Given a known distribution

p_{z} (z)

, maximize the entropy of

p (x)

subject to

\int_{x \in M (z)} p (x) d x = p_{z} (z), \forall z .

(5)

The solution to this problem is called maximum entropy PDF projection [7,8,9].

1.3. Significance

The main significance of maximum entropy PDF projection is the de facto creation of a statistical model through the extraction of features. Once a feature extraction

z = Φ (x)

has been identified, and it meets some mild requirements given below, a statistical model has been determined. This has a number of advantages, not the least of which is that the “art” of extracting features, i.e., signal processing, is well established, and many good methods exist to extract meaningful information from data. For example, the extraction MFCC features for processing speech signals has been developed to approximate human hearing [6], and, therefore, with maximum entropy PDF projection, should lead to statistical data models which share some qualities with human perception. Before maximum entropy PDF projection, comparing feature extraction methods had to be done based on secondary factors such as classification results. Maximum entropy PDF projection allows a feature extraction method to be evaluated based its corresponding statistical model.

The use of the maximum entropy principle assures the fairest means of comparing two statistical models derived from competing feature extraction methods. In most real-world applications, we cannot know

p (x)

, and must be satisfied with estimating it from some training data. Suppose that we have a set of K training samples

x_{1}, x_{2}, \dots, x_{K}

, and have a number of proposed PDFs computed using (6) for various feature transformations

z_{i} = Φ_{i} (x)

. Let these projected PDFs be denoted by

p_{i} (x)

. We would like to determine which projected PDF (i.e., which feature vector) provides a “better” fit to the data. One approach would be to compare the PDFs based on the average log-likelihood

L_{i} = \frac{1}{K} \sum_{n = 1}^{K} log p_{i} (x_{n}),

choosing the feature transformation that results in the largest value. However, likelihood comparison by itself is misleading, so one must also consider the entropy of the distribution,

Q_{i} = - \int_{x} {log p_{i} (x)} p_{i} (x) d x,

which is the negative of the theoretical value of

L_{i}

. Distributions that spread the probability mass over a wider area have higher entropy since the average value of

log p (x)

is lower. The two concepts of Q and L are compared in Figure 1 in which we show three competing distributions:

p_{1} (x)

,

p_{2} (x)

, and

p_{3} (x)

. The vertical lines represent the location of the K training samples. If

L_{i}

is the average value of

log p_{i} (x)

at the training sample locations, then clearly

L_{1} ≪ L_{3} ≪ L_{2} .

However, choosing

p_{2} (x)

is very risky because it is over-adapted to the training samples. Clearly,

p_{2} (x)

has lower entropy since most of the probability mass is at places with higher likelihood. Therefore, it has achieved higher L at the cost of lower Q, a suspicious situation. On the other hand,

Q_{1} = Q_{3}

, but

L_{3} > L_{1}

. Therefore,

p_{3} (x)

has achieved higher L than

p_{1} (x)

without suffering lower Q, so choosing

p_{3} (x)

over

p_{1} (x)

is not risky. If we always choose among models that have maximum possible entropy for the given choice of features, we are likely to obtain better features and better generative models.

2. Main Results

2.1. MaxEnt PDF Projection

The solution to Problem 1 is based on PDF projection [10]. In PDF projection, one is given a feature distribution

p_{z} (z)

and constructs a PDF as follows:

p_{p} (x; p_{0}, Φ, p_{z}) = \frac{p_{0} (x)}{p_{0, z} (z)} p_{z} (Φ (x)),

(6)

where

p_{0} (x)

is a reference distribution meeting some mild constraints [10], and

p_{0, z} (z)

is the corresponding distribution imposed by

p_{0} (x)

on the measurements

z

, i.e.,

p_{0, z} (z)

is Equation (3) applied to

p_{0} (x)

. It can be shown that

Equation (6) is a PDF, so $\int_{x} p_{p} (x; p_{0}, Φ, p_{z}) d x = 1 .$
$p_{p} (x; p_{0}, Φ, p_{z})$ meets (5), so it is consistent with $p_{z} (z)$ .
All distributions meeting (5) can be written in the form (6) for some $p_{0} (x)$ .

The last item in the list indicates that, to solve Problem 1, it is only necessary to select the reference distribution

p_{0} (x)

for maximum entropy (MaxEnt).

To understand the solution to this problem, it is useful to consider the sampling procedure for (6). To sample from distribution (6), one draws a sample

z^{*}

from PDF

p_{z} (z)

; then,

x

is drawn from the set

M (z^{*})

, defined in (4). Note, however, that to conform to (6), it is necessary to draw sample

x

from

M (z^{*})

with probability proportional to the value of

p_{0} (x)

. The distribution of

x

on the manifold

M (z^{*})

may be thought of the conditional distribution

p (x | z^{*})

, and it is proportional to

p_{0} (x)

. It is in fact

p (x | z^{*}) = \frac{p_{0} (x)}{p_{0, z} (z)} .

(7)

It can be verified that (7) integrates to 1 on the manifold

M (z^{*})

. The entropy of (6) can be decomposed as the entropy of

p_{z} (z)

plus the expected value of the entropy of the

p (x | z)

(see Equation (8) in [8]):

H {p_{p} (x; p_{0}, Φ, p_{z})} = H {p_{z} (z)} + \int_{z} H {p (x | z)} p_{z} (z) d z .

Maximizing this quantity seems daunting, but there is one condition under which

H {p (x | z)}

has the maximum entropy for all

z

, and that is when

p (x | z)

is the uniform distribution for all

z

. This, in turn, is achieved when

p_{0} (x)

has a constant value on any manifold

M (z)

.

This process of selecting

p_{0} (x)

for maximum entropy is called maximum entropy PDF projection [8,9]. The maximizing reference distribution is written

p_{0}^{*} = arg {max}_{p_{0}} H {p_{p} (x; p_{0}, Φ, p_{z})},

and the MaxEnt distribution is written

p_{p}^{*} (x; Φ, p_{z}) = \frac{p_{0}^{*} (x)}{p_{0, z}^{*} (z)} p_{z} (Φ (x)),

(8)

which is the unique distribution that solves Problem 1.

In order that it is possible to select

p_{0} (x)

for MaxEnt, the feature transformation

Φ (x)

must be such that the uniform distribution can be defined on

M (z)

for any

z

. Thus,

M (z)

must be bounded and integrable. This condition is easily met if the feature

z

contains information about the size of

x

so that when

z

is fixed to a finite value, the

x

has a fixed norm. To say this formally, let there exist a function

f (z)

such that

f (Φ (x)) = ∥ x ∥

for some valid norm

∥ x ∥

on the range of

x

.

Once this condition is met, then

p_{0}^{*} (x)

is any distribution that is constant on any level set

M (z)

. This happens if there exists a function c such that

p_{0}^{*} (x) = c (Φ (x)) .

Interestingly, any

p_{0}^{*} (x)

meeting these constraints results in the same distribution (6) [8]. This means that, although

p_{0}^{*} (x)

is not unique,

p_{p}^{*} (x; Φ, p_{z})

is unique—it must be unique if it is the maximum entropy PDF.

The above conditions can be easily met by inserting an energy statistic into the feature set

Φ (x)

, and defining a reference distribution that depends on

x

only through this energy statistic. The energy statistic is a scalar statistic from which it is possible to compute a valid norm on the range of

x

, denoted by

X

. In summary, the simplest way to solve for the MaxEnt projected PDF given the range of

x

, denoted by

X

, involves these three steps:

Identify a norm $∥ x ∥$ valid in $X$ A norm $∥ x ∥$ must meet the properties of scalability $∥ a x ∥ = | a | ∥ x ∥$ , triangle inequality $∥ x + y ∥ \leq ∥ x ∥ + ∥ y ∥$ , and $∥ 0 ∥ = 0 .$
Identify a scalar statistic (energy statistic) $t (x)$ from which it is possible to compute $∥ x ∥$ :

$∥ x ∥ = f (t (x)) .$
Use a reference hypothesis depending only on $t (x)$ .

The above will be demonstrated for three cases of

X

in Section 3.1, Section 3.2 and Section 3.3.

The data generation process for MaxEnt PDF projection, corresponding to distribution (8) does not depend on

X

and is the following:

From the known distribution $p_{z} (z)$ , draw a sample denoted by $z^{*} = [z_{1}^{*}, z_{2}^{*} \dots z_{K}^{*}]$ .
Now identify the set of all samples $x$ mapping to $z^{*}$ , denoted by $M (z^{*}) .$
Draw a sample $x$ from this set, uniformly, so that no member of $M (z^{*})$ is more likely to be chosen than another.

The maximum entropy nature of the solution can be recognized in the uniform sampling on the level set

M (z^{*}) .

The last item above is called uniform manifold sampling (UMS) [9]. The data generation process for three cases of

X

are provided in Section 3.1, Section 3.2 and Section 3.3.

3. Examples

The implementation of MaxEnt PDF projection depends strongly on the range of the input data

x

, denoted by

X

. In this section, examples are provided for three important cases of

X

.

3.1. Unbounded Data $X = R^{N}$

Let

x

range everywhere in

R^{N}

. The 2-norm

{∥ x ∥}_{2}

is valid in

R^{N}

and can be computed from the total energy

t_{2} (x) = \sum_{n = 1}^{N} x_{n}^{2} .

The Gaussian reference hypothesis can be written in terms of

t_{2} (x)

:

p_{0} (x) = \prod_{i = 1}^{N} \frac{1}{\sqrt{2 π}} e^{- x_{i}^{2} / 2} = {(2 π)}^{- N / 2} e^{- t_{2} (x) / 2},

(9)

so naturally

p_{0} (x)

will have a constant value on any manifold

M (z)

. Naturally, it is not necessary to include

t_{2} (x)

explicitly in the feature set—it is only necessary that the 2-norm can be computed from

z

.

The distribution

p_{0, z}^{*} (z)

can be determined in closed form for some feature transformations [11,12]. For others, the moment generating function can be written in closed form, which allows the saddle point approximation to be used to compute

p_{0, z}^{*} (z)

[11]. More on this will be presented in Section 4.1.

An important case where a closed-form solution exists is the linear transformation combined with total energy:

z = [A^{'} x, x^{'} x] .

This case is covered in detail in ([8], Section IV.C, p. 2821), and in ([9], Section III.B, p. 2459).

The following simple example demonstrates the main points of this case. Assume input data dimension

N = 3

and a feature transformation consisting of the sample mean and sample variance:

z = [\hat{μ}, \hat{v}],

where

\hat{μ} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}, \hat{v} = \frac{1}{N - 1} \sum_{i = 1}^{N} {(x_{i} - \hat{μ})}^{2} .

Note that

t_{2} (x)

can be computed from

(\hat{μ}, \hat{v})

,

t_{2} (x) = (N - 1) \hat{v} + N {\hat{μ}}^{2},

which satisfies the requirement that the 2-norm of

x

can be computed from

z

.

Under the assumption that

x

is distributed according to the standard Normal distribution (9),

\hat{μ}

will have mean 0 and variance

1 / N

,

p_{0} (μ) = {(\frac{2 π}{N})}^{- 1 / 2} e^{- N μ^{2} / 2},

and

\hat{v}

will have the chi-square distribution with

N - 1

degrees of freedom and scaling

\frac{1}{N - 1}

, which is given by

p_{0} (v) = \frac{k}{2^{k / 2}} Γ^{- 1} (\frac{k}{2}) {[k v]}^{k / 2 - 1} e^{- \frac{v k}{2}},

where

k = N - 1

. Furthermore,

\hat{μ}

and

\hat{v}

are statistically independent. Therefore,

p_{0, z}^{*} (z) = p_{0} (μ) \cdot p_{0} (v) .

For the given feature distribution, we assume components of

z

are independent and Gaussian

p_{z} (z_{i}) = {(2 π v_{i})}^{- 1 / 2} e^{- {(z_{i} - μ_{i})}^{2} / (2 v_{i})}

with given mean

μ_{i}

and variance

v_{i}

, where

z_{0} = \hat{μ},

z_{1} = \hat{v}

. The MaxEnt projected PDF, given by

p_{p}^{*} (x; Φ, p_{z}) = \frac{p_{0}^{*} (x)}{p_{0, z}^{*} (z)} p_{z} (z)

is plotted on the left of Figure 2 for slice of

x_{2}, x_{3}

at

x_{1} = 0.0

. The density values shown in the figure, summed over all three axes and properly scaled added to a value 0.9999999998, which validates with numerical integration that

p_{p}^{*} (x; Φ, p_{z})

is a density. Notice that the probability is concentrated on a circular region. This can be understood in terms of the sampling procedure given below.

To sample from

p_{p}^{*} (x; Φ, p_{z})

, we first draw a sample of

z

from

p_{0, z}^{*} (z)

, denoted by

z^{*}

, which provides values for the sample mean value

μ^{*}

and variance

v^{*}

. Then,

x

must be drawn uniformly from the manifold

\{x : \hat{μ} = μ^{*}, \hat{v} = v^{*}\},

which are conditions on the sample mean and variance. This is easily accomplished if we note that the sample mean condition is met for any

x

of the form

x = {[1, 1 \dots 1]}^{'} μ^{*} + B u,

(10)

where

B

is the

N \times (N - 1)

ortho-normal matrix spanning the space orthogonal to the vector

{[1, 1 \dots 1]}^{'} .

To meet the second (variance) condition, it is necessary that

{∥ u ∥}^{2} = (N - 1) v^{*} .

This condition defines a hypersphere in

(N - 1)

dimensions, which explains the circular region in Figure 2. This hypersphere is sampled uniformly by drawing

N - 1

independent Gaussian random variables, denoted by

u

, then scaling

u

so that

{∥ u ∥}^{2} = (N - 1) v^{*} .

Then,

x

is constructed using (10). Samples drawn in this manner are shown on the right side of Figure 2. To agree with the left side of the figure, only samples with

| x_{1} | < 0.01

are plotted.

Please see the above-cited references for using general linear transformations.

3.2. Positive Data $X = P^{N}$

Let

x

have positive-valued elements, so

x

ranges in the positive quadrant of

R^{N}

, denoted by

P^{N}

. This holds whenever spectral or intensity data is processed. The appropriate norm in this space is the 1-norm

∥ x ∥ = \frac{1}{N} \sum_{n = 1}^{N} x_{n} .

To satisfy conditions for maximum entropy, it must be possible to compute the statistic

t_{1} (x) = \sum_{n = 1}^{N} x_{n}

from the features. The exponential reference hypothesis can be written in terms of

t_{1} (x)

:

p_{0} (x) = \prod_{i = 1}^{N} e^{- x_{i}} = e^{- t_{1} (x)},

(11)

so naturally

p_{0} (x)

will have a constant value on any manifold

M (z)

, and is the appropriate reference hypothesis for maximum entropy. The inclusion of

t_{1} (x)

explicitly in the feature set is only one way to insure that

M (z)

is compact—it is only necessary that the 1-norm can be computed from

z

.

An important feature extraction is the linear transformation

z = A^{'} x .

Note that is necessary that statistic

t_{1} (x)

can be computed from

z

, which can be accomplished, for example, to making the first column of

A

constant. This case is covered in detail in ([8], Section IV.B, p. 2820), and in ([9], Section IV, p. 2460). Sampling

x

is accomplished by drawing a sample

z^{*}

from

p_{z} (z)

and then drawing a sample

x

uniformly from the set

{x : A^{'} x = z^{*}} .

The following simple example demonstrates the main theoretical concepts. We assume a data dimension of

N = 2

so that the distribution can be visualized as an image. The feature transformation is simply the sum of the samples:

z = T (x_{1}, x_{2}) = x_{1} + x_{2} .

Under the exponential reference hypothesis, the feature distribution is chi-square with

2 N

degrees of freedom and scaling

1 / 2

:

p_{0, z}^{*} (z) = \frac{2}{Γ (k / 2)} 2^{- k / 2} {(2 z)}^{(k / 2 - 1)} e^{- z},

where

k = 2 N .

For the given feature distribution, we assume Gaussian

p_{z} (z) = {(2 π v_{z})}^{- 1 / 2} e^{- {(z - μ_{z})}^{2} / (2 v_{z})}

with a given mean

μ_{z}

and variance

v_{z}

. The MaxEnt projected PDF, given by

p_{p}^{*} (x; Φ, p_{z}) = \frac{p_{0}^{*} (x)}{p_{0, z}^{*} (z)} p_{z} (z)

is plotted in Figure 3. The density values shown in the figure, when properly scaled, summed to a value 0.9998, which validates with numerical integration that

p_{p}^{*} (x; Φ, p_{z})

is a density. Note that the distribution is concentrated on the line

x_{1} + x_{2} = μ_{z} = 2

, and is flat on this line, as would be expected for maximum entropy. To sample from this distribution, we first draw a sample

z^{*}

from

p_{z} (z)

and then draw a sample

x

on the line given by

x_{1} + x_{2} = z^{*}

. This can be done by sampling

x_{1}

uniformly in

[0, z^{*}]

, then letting

x_{2} = z^{*} - x_{1} .

Samples drawn in this way are shown on the right side of Figure 3.

This example generalizes to higher dimension and to arbitrary linear transformations

z = A^{'} x

for full-rank

N \times M

matrix

A

. In this case,

p_{0, z}^{*} (z)

is not chi-square, and in fact is not available in closed-form. However, the moment-generating function is available in closed-form so the saddle point approximation may be used (See Section IV.A, p. 2245 in [11]). Samples of

x

are drawn by drawing a sample

z^{*}

from

p_{z} (z)

and then sampling uniformly in the set

{x : A^{'} x = z^{*}}

. At high dimensions, this requires a form of Gibbs sampling called hit and run (see Section IV, p. 2460 in [9]).

3.3. Unit Hypercube, $X = U^{N}$

Let

x

have elements limited to

0 \leq x_{i} \leq 1

. This case is common when working with neural networks. This is called the unit hypercube, denoted by

U^{N}

. The uniform reference hypothesis

p_{0} (x) = 1 .

(12)

produces maximum entropy. No norm-producing energy statistic is needed. Naturally,

p_{0} (x)

will have a constant value on any manifold

M (z)

.

The following simple example demonstrates the main theoretical concepts. We assume a data dimension of

N = 2

so that the distribution can be visualized as an image. The feature transformation is simple the sum of the samples:

z = T (x_{1}, x_{2}) = x_{1} + x_{2} .

For this case, the uniform distribution brings maximum entropy,

p_{0}^{*} (x) = 1 .

Under the reference hypothesis, the feature distribution is Irwin-Hall, given by

p_{0, z}^{*} (z) = \frac{1}{2 (N - 1)!} \sum_{k = 0}^{N} {(- 1)}^{k} (\binom{N}{k}) {(z - k)}^{N - 1} sign (z - k),

where

sign (0) = 0 .

For

N = 2

, this is a triangular distribution

p_{0, z}^{*} (z) = {z, 0 \leq z \leq 1; 2 - z, 1 \leq z \leq 2} .

For the given feature distribution, we assume Gaussian

p_{z} (z) = {(2 π v_{z})}^{- 1 / 2} e^{- {(z - μ_{z})}^{2} / (2 v_{z})}

with a given mean

μ_{z}

and variance

v_{z}

. The MaxEnt projected PDF, given by

p_{p}^{*} (x; Φ, p_{z}) = \frac{p_{0}^{*} (x)}{p_{0, z}^{*} (z)} p_{z} (z)

is plotted in Figure 4. The density values shown in the figure, when properly scaled, summed to a value 0.999, which validates with numerical integration that

p_{p}^{*} (x; Φ, p_{z})

is a density. Note that the distribution is concentrated on the line

x_{1} + x_{2} = μ_{z}

, and is flat on this line, as would be expected for maximum entropy. To sample from this distribution, we first draw a sample

z^{*}

from

p_{z} (z)

and then draw a sample

x

on the line given by

x_{1} + x_{2} = z^{*}

. This can be done by finding where the line that intercepts the axes, and sampling uniformly in the interval between the intercepts. Note that this sampling differs from the previous example as a result of the upper bound at 1.

This example generalizes to higher dimension and to arbitrary linear transformations

z = A^{'} x

for full-rank

N \times M

matrix

A

. In this case,

p_{0, z}^{*} (z)

is no longer Irwin-Hall and in fact is not available in closed-form. However, the moment-generating function is available in closed-form so the saddle point approximation may be used (see Appendix in [13]). Samples of

x

are drawn by drawing a sample

z^{*}

from

p_{z} (z)

and then sampling uniformly in the set

{x : A^{'} x = z^{*}}

. At high dimensions, this requires a form of Gibbs sampling called hit and run (see p. 2465 in [9]).

4. Advanced Concepts

4.1. Implementation Issues

Implementing (8) seems like a daunting numerical task, since

p_{0}^{*} (x)

is some canonical distribution, for which a real data sample

x

normally lies in the far tails of both

p_{0}^{*} (x)

and

p_{0, z}^{*} (z)

. However, if the distributions are known exactly, and are represented in the log domain, then the difference

log p_{0}^{*} (x) - log p_{0, z}^{*} (z)

(13)

typically remains within very reasonable limits. In some cases, terms in

log p_{0}^{*} (x)

and

log p_{0, z}^{*} (z)

cancel, leaving (13) only weakly dependent on

x

(for example, see Section IV.A, p. 2820 in [8]).

Evaluating

log p_{0}^{*} (x)

is mostly trivial since it is normally a canonical distribution, such as Gaussian, exponential, or uniform. Calculating

log p_{0, z}^{*} (z)

, however, remains the primary challenge in maximum entropy PDF projection. However, when evaluating

p_{0, z}^{*} (z)

seems daunting, there are several ways to overcome the problem.

Saddle Point Approximation. If $p_{0, z}^{*} (z)$ is not available in closed form, the moment-generating function (MGF) might be tractable. This allows the saddle point approximation (SPA) to be used (see Section III in [11]). Note that the term “approximation” is misleading because the SPA approximates the shape of the MGF on a contour, not the absolute value, so the SPA expression for $log p_{0, z}^{*} (z)$ remains very accurate, in the far tails, even when $p_{0, z}^{*} (z)$ itself cannot be evaluated in machine precision. Examples of this include general linear transformations of exponential and chi-squared random variables (see Section III.C and Section IV in [11]), general linear transformations of uniform random variables (Appendix in [13]), a set of linear-quadratic forms [14], and order statistics [15].
Floating reference hypothesis. There are conditions under which the MaxEnt reference hypothesis $p_{0}^{*} (x)$ is not unique, so it can depend on a parameter $θ$ , so we write $p_{0}^{*} (x; θ)$ . An example is when the feature $z$ contains the sample mean and sample variance (see example in Section 3.1). In this case, a Gaussian reference hypothesis $p_{0}^{*} (x; θ)$ can be modified to have any mean and variance $θ = [μ_{0}, σ_{0}^{2}]$ , and can serve as the MaxEnt reference hypothesis with no change at all in the resulting projected PDF. In other words, (13) is independent of $θ$ —this can be verified by cancelling terms. Therefore, there is no reason that $θ$ cannot be made to track the data—that is, let $μ_{0} = \hat{μ} (x)$ , $σ_{0}^{2} = {\hat{σ}}^{2} (x)$ . By doing this, $p_{0, z}^{*} (z)$ will track $z$ , allowing simple approximations based on central limit theorem to be used to approximate $p_{0, z}^{*} (z)$ .
Chain Rule. When $p_{0, z}^{*} (z)$ cannot be derived for a feature transformation, it may be possible to break the feature transformation into stages, where each stage can be easily analyzed. The next section is devoted to this.

4.2. Chain Rule

The primary numerical difficulty in implementing (8) is the calculation of

p_{0, z}^{*} (z)

. Solutions for many of the most useful feature transformations are available [9,11,12,13]. However, in many real-world applications, such as neural networks, the feature transformations cannot be easily written in a compact form

z = [ϕ_{1} (x), ϕ_{2} (x), \dots ϕ_{K} (x)]

. Instead, they consist of multi-stage transformations, for example,

y = T_{1} (x)

,

w = T_{2} (y)

, and

z = T_{3} (w) .

The individual stages

T_{m} (x)

could be the layers of a neural network. In this case, it is best to apply (8) recursively to each stage. This means that the distribution of the first stage features

p (y)

is written using (6) with

y

taking the role of input data, and so forth. This results in the chain-rule form:

p (x) = [\frac{p_{0, x}^{*} (x)}{p_{0, x}^{*} (y)}] [\frac{p_{0, y}^{*} (y)}{p_{0, y}^{*} (w)}] [\frac{p_{0, w}^{*} (w)}{p_{0, w}^{*} (z)}] p (z),

(14)

where

p_{0, x}^{*} (x)

,

p_{0, y}^{*} (y)

,

p_{0, w}^{*} (w)

are canonical reference hypotheses used at each stage, for example (9), (11), and (12), depending on the range of

x

,

y

, and

w

, respectively.

To understand the importance of the chain-rule, consider how we would compute (6) without the chain rule. Let

T (x)

be the combined transformation

T (x) = T_{3} (T_{2} (T_{1} (x)))

and let

p_{0}^{*} (x)

be one of the canonical reference distributions. Consider the difficulty in deriving

p_{0, z}^{*} (z)

. At each stage, the distribution of the output feature becomes more and more intractable, and trying to estimate

p_{0, z}^{*} (z)

is futile because generally a canonical reference distribution is completely unrealistic as PDF for real data. Furthermore,

p_{0, z}^{*} (z)

is more often than not evaluated in the far tails of the distribution. With the chain-rule, however, we can assume a suitable canonical reference hypothesis at the start of each stage, and only need to derive the feature distribution imposed on the output of that stage.

As long as the reference hypothesis used at each stage meets the stated requirements given in Section 2.1, then the chain as a whole will indeed produce the desired MaxEnt projected PDF, which is the PDF with maximum entropy among all PDFs that generate the desired output feature distribution

p (z)

through the combined transformation [8]!

An example of the application of the chain-rule is the computation of MEL frequency cepstral coefficients (MFCC), commonly used in speech processing. Let us consider a frame of data of length N, denoted by

x

. The processing is broken into the following stages:

The first step, denoted by $y = T_{1} (x)$ is to convert $x$ into $N / 2 + 1$ magnitude-squared discrete Fourier transform (DFT) bins. Under the standard Gaussian assumption (9), the elements of $y$ are independent and have chi-squared statistics (see Section VI.D.1, pp. 47–48 in [12]).
The second step is to sum energy in a set of K MEL-spaced band functions. This results in a set of K band energies. This can be written using the $(N / 2 - 1) \times K$ matrix $A$ as the linear transformation $w = A^{'} y .$ This feature transformation is explained in Section 3.2 above so an exponential reference distribution can be assumed for $y$ . Care must be taken that the K band functions add to a constant—this insures the energy statistic is “contained in the features”.
The next step is to compute the log of the K band energies, $u = log (w)$ . This is a 1:1 transformation for which PDF projection simplifies to computing the determinant of the transformation’s Jacobian matrix (see Section VI.A, p. 46 in [12]).
The last step is the discrete cosine transform (DCT), which can be written as a linear transformation $z = C^{'} u$ . If some DCT coefficients are discarded, then the transformation must be analyzed as in Section 3.1 above by including the energy statistic $t (u) = u^{'} u .$

This example illustrates that complex feature transformations can be easily analyzed if broken into simple steps. More on the above example can be found in Sections V and VI in [8].

4.3. Large-N Conditional Distributions and Applications.

When the feature value

z^{*}

is fixed, then sampling

x

on the manifold

M (z^{*})

, called UMS, has some interesting interpretations relative to maximum entropy. Let the conditional distribution be written

p (x | z^{*})

. Notice that

p (x | z^{*})

is not a proper distribution since all the probability mass exists on the manifold

M (z^{*})

of zero volume. Writing down

p (x | z^{*})

in closed form or determining its mean is intractable. It is useful, however, to know

p (x | z^{*})

because, for example, the mean of

p (x | z^{*})

is a point estimate of

x

based on

z^{*}

, a type of MaxEnt feature inversion. However, depending on the range of

x

, as exemplified by the three cases in Section 3.1, Section 3.2 and Section 3.3,

p (x | z^{*})

can be approximated by a surrogate distribution (See p. 2461 in [9]). The surrogate distribution is a proper distribution that (a) has its probability mass concentrated near

M (z^{*})

, (b) has constant value on

M (z^{*})

, and (c) has mean value on the manifold, so

\bar{x} \in M (z^{*})

. The surrogate distribution therefore meets the same conditions as

p (x | z^{*})

but is a proper distribution. The mean of the surrogate distribution is a very close approximation to the mean of

p (x | z^{*})

, which can be called th centroid of

M (z^{*})

, but can be computed. In Section 3.1, Section 3.2 and Section 3.3, the surrogate distribution is Gaussian, exponential, and truncated exponential, respectively. These are the MaxEnt distributions under applicable constraints. It was shown, for example, when the range of

x

is the positive quadrant of

R

, that the centroid corresponds to the classical Maximum Entropy feature inversion approach for a dimension-reducing linear transformation of intensity data, for example to sharpen images blurred by a point-spread function [9]. The method, however, is more general because it can be adapted to different ranges of

x

[9].

5. Applications

5.1. Classification

Assume there are M classes and the M class hypotheses are

H_{1}, H_{2} \dots H_{M} .

The general form of the classifier by applying Bayes theorem and (8) is given by

\hat{m} = arg max_{m} p (x | H_{m}) p (H_{m}),

(15)

where

p (H_{m})

is the prior class probability, and

p (x | H_{m})

is a PDF estimate for class hypothesis

H_{m}

. For the classification problem, there are many classifier topologies for using (8) to construct

p (x | H_{m})

.

Class-specific features. One can specify a different feature transformation per class, $z^{m} = Φ^{m} (x)$ ,

$p (x | H_{m}) = \frac{p_{0}^{*} (x)}{p_{0}^{*} (z^{m})} p (z^{m} | H_{m}),$

but the numerator is common, so the classifier rule becomes

$\hat{m} = arg max_{m} \frac{p (z^{m} | H_{m})}{p_{0}^{*} (z^{m})} .$

This amounts to just comparing the likelihood ratio between class hypothesis $H_{m}$ and the reference distribution, computed using a class-dependent feature [16].
It is not necessary to use a common reference hypothesis. A class-dependent reference hypothesis can be selected so that the feature is an approximately sufficient statistic to discriminate the given class from the class-dependent reference hypothesis. Then,

$p (x | H_{m}) = \frac{p_{0, m} (x)}{p_{0, m} (z^{m})} p (z^{m} | H_{m}),$

where $p_{0, m} (x)$ is the class-dependent reference hypothesis. Note that, when using the chain-rule (14), there is not a single reference hypothesis associated with each class, but a series of stage-wise reference hypotheses. Note that here we have relaxed the MaxEnt requirement for the reference hypothesis.
Using a different feature to test each class hypothesis is not always a good idea. Some data can be “contaminated” with noise or interference, so they may not be suitable to test a hypothesis with just one feature. In this case, a class-specific feature mixture (CSFM) [17,18,19] may be appropriate. For the CSFM, we define a set of feature transformations ${Φ^{1} (x), Φ^{2} (x), \dots Φ^{M} (x)}$ . (We assume here that the number of feature transformations equals the number of classes, but this is not necessary.) Then, $p (x | H_{m})$ is constructed as a mixture density using all the features:

$p (x | H_{m}) = \sum_{l = 1}^{M} w_{m, l} \frac{p_{0, l}^{*} (x)}{p_{0, l}^{*} (z^{l})} p (z^{l} | H_{m}),$

where $p_{0, l}^{*} (x)$ is the MaxEnt reference hypothesis corresponding to each feature transformation $Φ^{l} (x)$ .
To solve the classification problem (15), it is necessary to obtain a segment of data $x$ that can be classified into one of M classes. The problem is often not that simple, and the location of the classifiable “event” may be unknown within a longer data recording, or the data recording may contain multiple events from multiple classes. Using MaxEnt PDF projection, it is possible to solve the data segmentation problem simultaneously with the classification problem [20,21].

5.2. Other Applications

MaxEnt PDF projection has applications in the analysis of networks and feature transformations. For example in neural networks, it is possible to view a feed-forward neural network as a generative network, a duality relationship between two opposing types of networks [22]. In addition, the restricted Boltzmann machine (RBM) can be used as a PDF estimator with tractable distribution [13]. In feature inversion, MaxEnt PDF projection can be used to find MaxEnt point-estimates of the input data

x

based on fixed values of the feature [9].

6. Conclusions

In this short paper, the method of maximum entropy PDF projection was presented as a generalization of Jaynes’ maximum entropy principle with moment constraints. The mathematical basis of maximum entropy PDF projection was reviewed and practical considerations and applications were presented.

Funding

This research was funded by Fraunhofer FKIE, Wachtberg, Germany.

Conflicts of Interest

The author declares no conflict of interest.

References

Jaynes, E.T. Information Theory and Statistical Mechanics I. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
Kesavan, H.K.; Kapur, J.N. The Generalized maximum Entropy Principle. IEEE Trans. Syst. Man Cybern. 1989, 19, 1042–1052. [Google Scholar] [CrossRef]
Banavar, J.R.; Maritan, A.; Volkov, I. Applications of the principle of maximum entropy: From physics to ecology. J. Phys. Condens. Matter 2010, 22, 063101. [Google Scholar] [CrossRef] [PubMed]
Holmes, D.E. (Ed.) Entropy, Special Issue on Maximum Entropy and Its Application; MDPI: Basel, Switzerland, 2018. [Google Scholar]
Martino, A.D.; Martino, D.D. An introduction to the maximum entropy approach and its application to inference problems in biology. Heliyon 2018, 4, e00596. [Google Scholar] [CrossRef] [PubMed]
Picone, J.W. Signal Modeling Techniques in Speech Recognition. Proce. IEEE 1993, 81, 1215–1247. [Google Scholar] [CrossRef]
Baggenstoss, P.M. Maximum entropy PDF projection: A review. AIP Conf. Proc. 2017. [Google Scholar] [CrossRef]
Baggenstoss, P.M. Maximum Entropy PDF Design Using Feature Density Constraints: Applications in Signal Processing. IEEE Trans. Signal Process. 2015, 63, 2815–2825. [Google Scholar] [CrossRef]
Baggenstoss, P.M. Uniform Manifold Sampling (UMS): Sampling the Maximum Entropy PDF. IEEE Trans. Signal Process. 2017, 65, 2455–2470. [Google Scholar] [CrossRef]
Baggenstoss, P.M. The PDF Projection Theorem and the Class-Specific method. IEEE Trans. Signal Process. 2003, 51, 672–685. [Google Scholar] [CrossRef]
Kay, S.M.; Nuttall, A.H.; Baggenstoss, P.M. Multidimensional probability density function approximations for detection, classification, and model order selection. IEEE Trans. Signal Process. 2001, 49, 2240–2252. [Google Scholar] [CrossRef]
Baggenstoss, P.M. The Class-Specific Classifier: Avoiding the Curse of Dimensionality (Tutorial). IEEE Aerosp. Electron. Syst. Mag. Spec. Tutor. Add. 2004, 19, 37–52. [Google Scholar] [CrossRef]
Baggenstoss, P.M. Evaluating the RBM Without Integration Using PDF Projection. In Proceedings of the EUSIPCO 2017, Kos, Greece, 28 August–2 Septemper 2017. [Google Scholar]
Nuttall, A.H. Saddlepoint Approximation and First-Order Correction Term to The Joint Probability Density Function of M Quadratic and Linear Forms in K Gaussian Random Variables With Arbitrary Means and Covariances; NUWC Technical Report 11262; US Naval Undersea Warfare Center: Newport, RI, USA, 2000. [Google Scholar]
Nuttall, A.H. Joint Probability Density Function of Selected Order Statistics And the Sum of the Remaining Random Variables; NUWC Technical Report 11345; US Naval Undersea Warfare Center: Newport, RI, USA, 2002. [Google Scholar]
Baggenstoss, P.M. Class-Specific Features in Classification. IEEE Trans. Signal Process. 1999, 47, 3428–3432. [Google Scholar] [CrossRef]
Baggenstoss, P.M. Optimal Detection and Classification of Diverse Short-Duration Signals. In Proceedings of the International Conference on Cloud Engineering, Boston, MA, USA, 11–14 March 2014; pp. 534–539. [Google Scholar]
Baggenstoss, P.M. Class-specific model mixtures for the classification of time-series. In Proceedings of the 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2014. [Google Scholar]
Baggenstoss, P.M. Class-Specific Model Mixtures for the Classification of Acoustic Time-Series. IEEE Trans. AES 2016, 52, 1937–1952. [Google Scholar] [CrossRef]
Baggenstoss, P.M. A multi-resolution hidden Markov model using class-specific features. IEEE Trans. Signal Process. 2010, 58, 5165–5177. [Google Scholar] [CrossRef]
Baggenstoss, P.M. Acoustic Event Classification using Multi-resolution HMM. In Proceedings of the European Signal Processing Conference (EUSIPCO) 2018, Rome, Italy, 3–7 September 2018. [Google Scholar]
Baggenstoss, P.M. On the Duality Between Belief Networks and Feed-Forward Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 1–11. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Comparison of entropy Q and average log-likelihood L for three distributions. The vertical lines are the locations of training samples.

Figure 2. (Left) illustration of projected PDF for

μ_{0} = 0.15

,

v_{0} = 0.3,

μ_{1} = 1.85

,

v_{1} = 0.025

, on a slice of

x_{2}

,

x_{3}

at

x_{1} = 0

; (Right) samples drawn from the sampling procedure (see text).

Figure 2. (Left) illustration of projected PDF for

μ_{0} = 0.15

,

v_{0} = 0.3,

μ_{1} = 1.85

,

v_{1} = 0.025

, on a slice of

x_{2}

,

x_{3}

at

x_{1} = 0

; (Right) samples drawn from the sampling procedure (see text).

Figure 3. (Left) illustration of projected PDF for

μ_{z} = 2.0

,

v_{z} = 0.04

; (Right) samples drawn from the sampling procedure (see text).

Figure 3. (Left) illustration of projected PDF for

μ_{z} = 2.0

,

v_{z} = 0.04

; (Right) samples drawn from the sampling procedure (see text).

Figure 4. Illustration of projected PDF for

μ_{z} = 1.3

,

v_{z} = 0.002

.

Figure 4. Illustration of projected PDF for

μ_{z} = 1.3

,

v_{z} = 0.002

.

© 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Baggenstoss, P.M. Beyond Moments: Extending the Maximum Entropy Principle to Feature Distribution Constraints. Entropy 2018, 20, 650. https://doi.org/10.3390/e20090650

AMA Style

Baggenstoss PM. Beyond Moments: Extending the Maximum Entropy Principle to Feature Distribution Constraints. Entropy. 2018; 20(9):650. https://doi.org/10.3390/e20090650

Chicago/Turabian Style

Baggenstoss, Paul M. 2018. "Beyond Moments: Extending the Maximum Entropy Principle to Feature Distribution Constraints" Entropy 20, no. 9: 650. https://doi.org/10.3390/e20090650

APA Style

Baggenstoss, P. M. (2018). Beyond Moments: Extending the Maximum Entropy Principle to Feature Distribution Constraints. Entropy, 20(9), 650. https://doi.org/10.3390/e20090650

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Beyond Moments: Extending the Maximum Entropy Principle to Feature Distribution Constraints

Abstract

1. Introduction

1.1. Jaynes’ Maximum Entropy Principle

1.2. Feature Distribution Constraints

1.3. Significance

2. Main Results

2.1. MaxEnt PDF Projection

3. Examples

3.1. Unbounded Data $X = R^{N}$

3.2. Positive Data $X = P^{N}$

3.3. Unit Hypercube, $X = U^{N}$

4. Advanced Concepts

4.1. Implementation Issues

4.2. Chain Rule

4.3. Large-N Conditional Distributions and Applications.

5. Applications

5.1. Classification

5.2. Other Applications

6. Conclusions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Beyond Moments: Extending the Maximum Entropy Principle to Feature Distribution Constraints

Abstract

1. Introduction

1.1. Jaynes’ Maximum Entropy Principle

1.2. Feature Distribution Constraints

1.3. Significance

2. Main Results

2.1. MaxEnt PDF Projection

3. Examples

3.1. Unbounded Data X = R N

3.2. Positive Data X = P N

3.3. Unit Hypercube, X = U N

4. Advanced Concepts

4.1. Implementation Issues

4.2. Chain Rule

4.3. Large-N Conditional Distributions and Applications.

5. Applications

5.1. Classification

5.2. Other Applications

6. Conclusions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. Unbounded Data $X = R^{N}$

3.2. Positive Data $X = P^{N}$

3.3. Unit Hypercube, $X = U^{N}$