A Kullback – Leibler View of Maximum Entropy and Maximum Log-Probability Methods

Entropy methods enable a convenient general approach to providing a probability distribution with partial information. The minimum cross-entropy principle selects the distribution that minimizes the Kullback–Leibler divergence subject to the given constraints. This general principle encompasses a wide variety of distributions, and generalizes other methods that have been proposed independently. There remains, however, some confusion about the breadth of entropy methods in the literature. In particular, the asymmetry of the Kullback–Leibler divergence provides two important special cases when the target distribution is uniform: the maximum entropy method and the maximum log-probability method. This paper compares the performance of both methods under a variety of conditions. We also examine a generalized maximum log-probability method as a further demonstration of the generality of the entropy approach.


Introduction
Estimating the underlying probability distribution of the decision alternatives is an essential step for every decision that involves uncertainty [1].For example, when making investments, the distribution over profitability is required, and when designing an engineered solution, the probability of failure for each option is required.
The method used for constructing a joint probability distribution depends on the properties of the problem and the information that is available.When all the conditional probabilities are known, Bayes' expansion formula provides an exact solution.The problem becomes more challenging, however, when incomplete information or computational intractability necessitate the use approximate methods.Maximum likelihood estimation, Bayesian statistics [2], entropy methods [3], and copulas [4] are among the methods for estimating the parameter(s) underlying the distribution or the distribution itself.
Edwin Jaynes [3] proposed the minimum cross-entropy method as a means to determine prior probabilities in decision analysis.Entropy methods rely on the optimization of an objective function where the objective is the Kullback-Leibler divergence.The available information is incorporated in the form of constraints in the optimization problem.Both directions of the cross-entropy method are widely used in decision analysis particularly in aggregating expert opinion [5].
Multiple distributions are enabled by such entropy methods, leading to confusion in some parts of the literature about the applicability and generality of the entropy approach.For example, in some recent literature, [6] criticizes entropy methods and proposes maximizing the sum of log-probabilities (MLP) as a better alternative, without acknowledging that MLP is a special case of the minimum cross-entropy principle.As we shall see, even generalizations of the MLP method are special cases of entropy methods.
Given this observation, this paper seeks to clarify the relationship between the maximum entropy (ME) and the maximum log-probability (MLP) methods.It is well known that ME is a special case of cross entropy in which the target distribution is uniform [3,7].We also highlight that the MLP method is a special case of minimum cross entropy (MCE) with a uniform posterior distribution.Thus, not only are the ME and MLP methods both entropy formulations, they are also both instantiations of minimum cross-entropy when a uniform distribution is involved.This paper first reviews the analytic solutions in both directions that highlight this relationship, providing much needed clarification.
In light of the close relationship between the ME and MLP methods, it is important to understand the properties of the methods to support the appropriate application of each.Thus, the second motivation of this paper is to characterize the consequences of using one method versus the other and the error that may result in each case.A simulation method is developed to quantify this error.This paper then derives insights on the performance of ME and MLP methods based on the numeric results.Finally, the third motivation of this paper is an examination of the geometric properties of the solutions to the ME and MLP methods to further distinguish the two.
The results of this paper are important given the wide applicability of the ME and MLP methods.ME methods are used to approximate in cases of univariate distributions [8], bivariate distributions [9], and in cases with bounds on the distribution [10].The method has also found applications to utility assessments in decision analysis [11].The MLP method, on the other hand, has also received attention in the literature with applications to parameter estimation [12] and optimization [13].
The analysis of this paper is predicated on understanding entropy methods, including the formulations for the ME and MLP methods.Thus, the paper begins with background information on the relevant entropy methods showing that MLP method is a special case of minimum cross entropy in Section 2.Then, we use a numeric example to highlight conditions under which each method outperforms the other in Section 3. We examine generalizations in Sections 4 and 5 and geometric properties of the solutions in Section 6.Finally, Section 7 concludes.

The Minimum Cross Entropy (MCE) Problem
Cross entropy is a measure of the relatedness of two probability distributions, P and Q.It can be leveraged through the principle of minimum cross entropy (MCE) to identify the distribution P that satisfies a set of constraints and is closest to a target distribution Q, where the "closeness" is measured by the Kullback-Leibler divergence [14,15].For a discrete reference distribution Q estimated with discrete distribution P, the Kullback-Leibler divergence is: where p(x i ) and q(x i ) represent the probabilities for outcomes i = 1, . . ., n, of distributions P and Q respectively [14,15].The measure is nonnegative and is equal to zero if and only if the two distributions are identical.Importantly, the Kullback-Leibler divergence is not symmetric.It does not satisfy the triangle inequality, and K(P : Q) and K(Q : P) are not generally equal.Hence, depending on the direction of its objective function, the MCE problem can produce different results [16].This property leads the Kullback-Leibler divergence to also be called the directed divergence.The solution to the MCE problem depends on the direction in which the problem is solved.
We use the notation P 1 CE to indicate the forward direction of the problem, i.e., Direction (1), where the goal is to minimize the divergence of the MCE distribution P = {p(x i ), i = 1, . . ., n} from a known target distribution Q = {q(x i ), i = 1, . . ., n}.In this direction, the problem formulation is: We use the notation P 2 CE to indicate the second direction, i.e., Direction (2), which is the reverse problem.The distribution P = {p(x i ), i = 1, . . ., n} is the target distribution for which the parameters are unknown.This reverse direction is a special case of the maximum likelihood problem and is formulated as: The analytic solution of the MCE problem is known; it is a convex optimization solved using Lagrangian multipliers [16].The solution for the minimum cross-entropy formulation in direction (1), P 1 CE * has an exponential form [16]: where λ 0 and λ j are the Lagrangian multipliers associated with the unity and j-th constraint.Refer to Appendix A for the calculations.Thus, the solution in the reverse direction, P 2 CE * , has an inverse form: where λ 0 and λ j are the Lagrangian multipliers associated with unity and the j-th constraint, respectively.Refer to Appendix A for the calculations.Next, we use these analytic solutions to examine the relationship between MCE and the ME and MLP methods and show how MCE relates the two.

The Maximum Entropy (ME) Method
The ME method is an entropy approach that identifies the distribution with the largest entropy among the set of distributions that satisfy constraints imposed by known information [17,18].The classic ME formulation uses Shannon's entropy as the objective function [18].Then, for a discrete random variable X, the maximum entropy distributionP * ME is the solution to the following optimization problem: In this notation, f j (x i ) are the moment functions, and p(x i ) indicates the probability of the outcome X = x i .The constraints in (6) are imposed by unity and by the known moment which represent the known information.
A well-known result is that the ME method is the special case of MCE in which the target distribution is uniform [3,7].This fact is shown by solving the ME problem and obtaining: where λ 0 and λ j are the Lagrangian multipliers associated with the first two constraints.Notice that replacing q(x i ) from the MCE solution in the forward direction (Equation ( 4)) gives a result that matches Equation (7).These matching solutions show that ME is the special case of MCE with a uniform target distribution Q.The calculations to solve (6) are in Appendix A.

The Maximum Log-Probability (MLP) Method
The MLP method is similarly based on an optimization.In this formulation, however, the objective function is the maximum of a log-probability function.Thus, the MLP distribution is: Then, the solution for the MLP method with mean and unity constraints can be written as: Notice that replacing q(x i ) from the MCE solution in the reverse direction in Equation ( 5) gives a result that matches Equation (9).These matching solutions show that the MLP method is the special case of MCE in which the posterior distribution P is uniform.We also wish to highlight that the analytic center method proposed by Sonnevand [19] has been used in conjunction with MLP [6].
The results in this section illuminate the relationship between the ME and MLP methods; they are both instantiations of MCE and simply represent different directions of the problem.

Simulation to Quantify Error Based on the Underlying Distribution
Given the clarification that shows the similarity between the ME and MLP methods, it is important to understand how the methods are different in order to discern, if possible, the cases in which one method is preferable to the other.Comparing the functional forms of the solutions ( 7) and ( 9) is a starting point for discerning differences.We suspect that the ME method performs better when the underlying probability distribution has an exponential form, whereas the MLP method performs better when the underlying distribution is a rational probability mass function.This section investigates the role of the underlying distribution on method performance.
We design a simulation-based approach to study the performance of the two methods for different probability distribution functions.Generating numerical examples from target distributions facilitates the evaluation of the performance of these two methods in approximating the probability distribution for different distribution functions.Based on the functional forms of their solutions, we consider two distribution families: 1.
Discretized exponential family distribution: Discretized inverse family distribution: Term L is the normalizing factor, X is the vector of random variables, and λ is the vector of parameters.For our study, we generate a test distribution belonging to one of the two mentioned families.Then we solve the ME and MLP problem using the desired information (mean).We consider a simple univariate discrete case.

Simulation Steps
We assume that the underlying random variable X is discrete, with 20 outcomes: X = {1, . . . ,20} and follows either a discretized exponential or a discretized inverse distribution.The Monte Carlo simulation is run 1000 times, with each run containing the following steps: 1.

2.
The coefficients for the desired functional form are randomly generated: The probabilities for each outcome are calculated based on the generated coefficients.4.
The given probabilities are normalized such that they sum to one. 5.
The mean for the sampled data points is calculated.6.
The optimization problems are solved for P * ME and P * MLP .7.
The Kullback-Leibler divergence and the total variation are calculated for each approximation.
In Step 7, the Kullback-Leibler divergence and total variation are calculated in order to serve as performance measures for both methods.The total variation is the sum of absolute differences between the original and estimated distribution for each outcome: The results for the simulation are presented in the following two subsections.Note that in Step 1, functions of different orders may be used, and that in Step 6, the optimization can be solved with different constraints.We first report results when using a first order distribution and a constraint on the mean only, and then we present results with a second order distribution and constraints on both the mean and the second moment.

Results with a Discretized Exponential Distribution
We first examine the simulation results when the underlying distribution is a discretized exponential distribution specified by This function is similar to the exact solution of the ME method.We expect that the ME method performs better with respect to the average divergence measures when using this function.Note that L is the normalizing function, where: The results of the simulation for both the ME and MLP methods are reported in Table 1.As we expected, the ME method performs better in approximating this distribution as shown by the deviation measures that are several orders of magnitude smaller than the deviation measures for the MLP method.The solution of the ME method has exactly the same form as the underlying distribution, making this method more precise in recovering it.The second order exponential function is the exact solution for the ME method with mean and variance constraints: But for consistency of the comparison, we use both the ME and MLP methods with mean and second moment constraints only.Although the solution from the ME method has an exponential form, they are not exactly the same here.However, we expect that the ME method performs better.The results in Table 2 confirm this expectation; the ME method produces significantly smaller divergence measures.

Results with a Discretized Inverse Distribution
Inverse functions have a similar expression to the solution of the MLP method.We explore the possibility that the MLP method performs better with respect to the divergence measures by repeating the simulation when sampling from the following discretized inverse function: In this scenario, as expected, the MLP method outperforms the ME method in regard to the divergence measures.Table 3 summarizes the results for the simulation.We conclude the numerical examples by reporting the simulation results for the second order discretized inverse distribution function: Similar to the discretized exponential example, we use both the mean and the second moment constraints since the order for random variable X has increased.The solution for the MLP method resembles the test distribution function although they are not the same.As expected, the MLP method performs better than the ME method with respect to the performance measures defined.The numerical results reported in Table 4 show this comparison clearly.The results discussed in this section confirm the conjecture that the underlying functional form, whether exponential or inverse, affects the performance of the ME method and the MLP method, and represents an important difference between the methods.Neither method outperforms the other in all cases.The ME method performs better when dealing with an exponential distribution function, whereas the MLP method performs better in the case of an underlying inverse function.

Simulation to Quantify Error Based on the Target Distribution
The results in the previous section suggest that the functional form of the underlying distribution plays an important role in selecting the direction of the MCE problem.In this section, we further differentiate the ME and MLP methods by examining the role of the target distribution.Specifically, we examine (i) whether the functional form of the target distribution affects the precision of the approximations and (ii) under which target functions the ME and MLP solutions get closer together or farther apart.
Assuming the general MCE problem, we consider two possible directions, calling them Direction (1) and Direction (2): Our goal is to investigate the role of the functional form of the target distribution Q = {q(x i )|i = 1, . . ., n}.We accomplish this goal with a simulation that recovers a distribution using different functional forms for the target distributions and that solves the CME problem in both directions, as described in the next section.

Simulation for the Role of the Target Distribution
We use uniform sampling with the simplex method [20] to generate the test distribution P. We reconstruct the distribution P with a different target distribution Q = {q(x i )|i = 1, . . ., n} at each run, using the uniform, inverse, or exponential distribution.The underlying random variable X is assumed to be discrete with outcomes X = {1, . . . ,20}.
We run the Monte Carlo simulation 10,000 times.Each run of the simulation contains the following steps: 1.
The test distribution is generated using uniform sampling on the simplex.This represents a general case for the underlying distribution; 3.
The mean, µ, for the test distribution is calculated as an input for the optimization model; 4.
The µ calculated in Step 3 is used for the target distribution of the inverse and the exponential forms: a = 1/µ;

5.
The second coefficient, the constant term, for the discretized exponential and the inverse function is randomly generated: b ∈ [0, 1]; 6.
The optimization problems are solved for P 1 CE and P 2 CE ; 7.
The Kullback-Leibler divergence and the total deviation are calculated for each approximation.
We also calculate the Euclidean norm between the solutions P 1 CE and P 2 CE for each target distribution: This value indicates the difference between the solutions from each direction when the target function is fixed and enables us to find the distributions for which they are closest/farthest.

Results of Uniform Sampling on the Simplex
Uniform sampling over the simplex generates a test distribution without providing any information about the shape of the distribution function.It seems an appropriate sampling method to compare the solutions of MCE problem in two different directions.Table 5 summarizes the results for Direction (1) of the MCE problem.Each column represents the deviation measures for different target distributions Q, used to reconstruct the test distribution P. Table 5 shows the results for the MCE method in Direction (1), and Table 6 summarizes results of Direction (2).
The results in both Tables 5 and 6 show that there is not much difference in using different target distributions.When the underlying distribution is sampled using uniform sampling on the simplex, the information about the shape of the function is not available.Using the MCE method to recover this general distribution, whether using a uniform, a discretized exponential, or a discretized inverse distribution, does not result in a significant difference.Although the distance when the target distribution is exponential or inverse is larger than when the target distribution is uniform, they are still close to each other.This result reiterates the previous result: the MCE method performs similarly in both directions if there is no information other than the mean.A question that remains to be answered is whether this conjecture will hold if the information from higher moments is added to the MCE optimization.The next section examines this question.

The Generalized Maximum Log-Probability Method
The analytic solutions in Section 2 show that the MLP method is an instantiation of the more general MCE principle and raises the question of whether it is possible to improve the performance of the MLP method by using it in this more general scheme.We investigate this question.Specifically, we are interested in the case when the underlying distribution is a discretized exponential distribution.The numerical example in Section 3 shows that the ME method performs better than the MLP method in this case.
We use the Monte Carlo simulation described in Section 4.1.The underlying distribution is generated using the method described in Section 3 with the following format: The coefficients for the this function are generated at random: a, b ∈ [0, 1].We then use the MCE method in the reverse direction: with the unity constraint and mean constraint.To generalize the MLP method, the target distribution Q is chosen from the exponential family rather than the uniform distribution.Precisely, where a = 1/µ, µ is the mean (available information), and c ∈ [0, 1].The result of the Monte Carlo simulation indicates that the performance of the generalized MLP method is better than the MLP method itself.
The results are shown in Table 8.When comparing the results of Table 8 to those of Table 1, we notice that the ME method still performs better than both the generalized and regular MLP methods.However, the performance of the generalized MLP method improves significantly in comparison to the regular MLP method, both in terms of the Kullback-Leibler divergence and the total deviation.This result suggests that the performance of the MLP method can be improved using the generalized form with a proper target distribution.

Geometric Interpretation
Examining the geometric properties of the solutions to the ME and MLP methods provides further insight to the performance of each.A simple scenario is used for the analysis.The constraint set for both methods creates a bounded polyhedron, a polytope.We consider only constraints on unity and the mean.In the simplest form, if we assume that the random variable X has two outcomes: X = {1, 2}, then the feasible set contains, at most, one point.Figure 1 shows the case where µ = 1.5.The only feasible solution for this constraint set is P(X = 1) = P(X = 2) = 0.5.The dashed line indicates the second constraint, while the solid line refers to the first constraint.Thus, regardless of the objective function, both the ME method and the MLP method will produce the same solution.

Geometric Interpretation
Examining the geometric properties of the solutions to the ME and MLP methods provides further insight to the performance of each.A simple scenario is used for the analysis.The constraint set for both methods creates a bounded polyhedron, a polytope.We consider only constraints on unity and the mean.In the simplest form, if we assume that the random variable has two outcomes: = {1,2}, then the feasible set contains, at most, one point.Figure 1 shows the case where = 1.5.The only feasible solution for this constraint set is ( = 1) = ( = 2) = 0.5.The dashed line indicates the second constraint, while the solid line refers to the first constraint.Thus, regardless of the objective function, both the ME method and the MLP method will produce the same solution.

Geometry with a Three-Outcome Variable
The problem becomes more complicated as the number of outcomes increases.For a random variable with three outcomes, the feasible set lies along the intersection of two planes (constraints).The first constraint, ∑ ( ) = 1, creates a simplex.The second plane, ∑ • ( ) = intersects the simplex, creating a line.In general, if = { , , }, then the line equation for the feasible set can be written as: For the special case of = {1,2,3}, the line equation becomes: For the case where = 2, the line equation becomes = ( , −2 + 1, ), where both methods find the optimal solution at point = , or the uniform distribution.Figure 2 shows the line that is formed as the intersection of these two planes for the case where = {1,2,3} and μ = 2.
It is very important to understand that is the line equation and not all the points on are feasible.Every element of has to be non-negative and smaller than one, satisfying the probability axioms.For example, in the case of = {1,2,3} and = 2, the values for can be only be between 0 and 0.5.This observation poses a limitation for the Monte Carlo simulation we discuss next.

Geometry with a Three-Outcome Variable
The problem becomes more complicated as the number of outcomes increases.For a random variable with three outcomes, the feasible set lies along the intersection of two planes (constraints).The first constraint, ∑ x p(x) = 1, creates a simplex.The second plane, ∑ x x•p(x) = µ intersects the simplex, creating a line.In general, if X = {x 1 , x 2 , x 3 }, then the line equation for the feasible set can be written as: For the special case of X = {1, 2, 3}, the line equation becomes: For the case where µ = 2, the line equation becomes L = (t, −2t + 1, t), where both methods find the optimal solution at point t = 1  3 , or the uniform distribution.Figure 2 shows the line that is formed as the intersection of these two planes for the case where X = {1, 2, 3} and µ = 2.
It is very important to understand that L is the line equation and not all the points on L are feasible.Every element of L has to be non-negative and smaller than one, satisfying the probability axioms.For example, in the case of X = {1, 2, 3} and µ = 2, the values for t can be only be between 0 and 0.5.This observation poses a limitation for the Monte Carlo simulation we discuss next.

Monte Carlo Simulation
We design a simulation to observe the geometric properties of the solutions of the ME and MLP methods and to locate the solutions on the feasible set (line).We assume that the random variable is discrete with three outcomes = {1,11, 21}.Using the line Equation ( 22), we modify the mean, and track the changes in the Kullback-Leibler and the total deviation.The algorithm can be summarized as follows: 1.The value for is determined: = {1, … ,21}.2. Based on the result of Step 1, the feasible range for ( ) = is determined using the line equation of Equation ( 22); 3. The value for is incremented by 0.005 from the minimum to the maximum that was computed in the previous step; 4. Using the line equation, the values for ( ) and ( ) are determined; 5.
= ( ), ( ), ( ) is specified as the desired test distribution; 6.The optimization problems are solved for * and * ; 7. The Euclidean norm of the difference between the solutions of the ME and MLP methods is calculated:

Euclidean Distance of the ME and MLP Solutions
Figure 3 shows the Euclidean distance between the solutions of the ME and the MLP methods for every value of the mean, = {1, … ,21}.The distance between the solutions of both methods is the smallest for the boundary cases: = 1 or = 21.These instances are the cases with only one feasible solution: = (1,0,0) and = (0,0,1).Hence, the solutions for the ME method and the MLP method are similar.The other minimum occurs in the case of = 11.In this case, the number of points in the feasible set is the maximum possible, but both methods provide the uniform solution: * = * = , , .This solution is what one would expect from the ME method as it is the solution with the maximum uncertainty (i.e., maximum entropy).From these results, we see that the distance between the methods vanishes around the uniform distribution, but increases farther away from it.These results underscore the insights derived previously in this paper showing that there are conditions under which both the ME and MLP methods will produce the same results, and there are also conditions under which the solutions will differ.

Monte Carlo Simulation
We design a simulation to observe the geometric properties of the solutions of the ME and MLP methods and to locate the solutions on the feasible set (line).We assume that the random variable X is discrete with three outcomes X = {1, 11, 21}.Using the line Equation ( 22), we modify the mean, µ and track the changes in the Kullback-Leibler and the total deviation.The algorithm can be summarized as follows: 1.

2.
Based on the result of Step 1, the feasible range for p(x 3 ) = t is determined using the line equation L of Equation ( 22); 3.
The value for t is incremented by 0.005 from the minimum to the maximum that was computed in the previous step; 4.
Using the line equation, the values for p(x 1 ) and p(x 2 ) are determined; 5. P = (p(x 1 ), p(x 2 ), p(x 3 )) is specified as the desired test distribution; 6.
The optimization problems are solved for P * ME and P * MLP ; 7.
The Euclidean norm of the difference between the solutions of the ME and MLP methods is calculated: (24)

Euclidean Distance of the ME and MLP Solutions
Figure 3 shows the Euclidean distance between the solutions of the ME and the MLP methods for every value of the mean, µ = {1, . . . ,21}.The distance between the solutions of both methods is the smallest for the boundary cases: µ = 1 or µ = 21.These instances are the cases with only one feasible solution: P = (1, 0, 0) and P = (0, 0, 1).Hence, the solutions for the ME method and the MLP method are similar.The other minimum occurs in the case of µ = 11.In this case, the number of points in the feasible set is the maximum possible, but both methods provide the uniform solution: This solution is what one would expect from the ME method as it is the solution with the maximum uncertainty (i.e., maximum entropy).From these results, we see that the distance between the methods vanishes around the uniform distribution, but increases farther away from it.These results underscore the insights derived previously in this paper showing that there are conditions under which both the ME and MLP methods will produce the same results, and there are also conditions under which the solutions will differ.

Conclusions
In this paper, we first reviewed the notion that both ME and MLP methods are specific instantiations of the minimum cross-entropy principle.Through analytic analysis and numerical examples, we then established that the information about the target distribution can significantly affect the performance of the methods.The ME method performs well with exponential distributions, whereas the MLP method has better performance with inverse distributions.We then used the minimum-cross entropy method to generalize the maximum log-probability approach.
The analysis shows that it is not, in general, possible to determine that one method (direction of the Kullback-Leibler divergence) yields better results than the other.Rather, the performance depends on the problem and the information that is available.This work highlights the need to appropriately match the method used to the information available and opens the door to future research on questions such as the performance of these methods in particular contexts and methods to capture all types of available information.We hope this work helps clarify some of the confusion and criticisms of entropy methods and their special cases in the literature.We also hope to see further applications of entropy methods in a variety of applications.

Figure 1 .
Figure 1.Feasible set for the probability distribution when the variable has two outcomes.

Figure 1 .
Figure 1.Feasible set for the probability distribution when the variable has two outcomes.

Figure 2 .
Figure 2. Feasible set (bold line) when the random variable has three outcomes.

Figure 2 .
Figure 2. Feasible set (bold line) when the random variable has three outcomes.

Figure 3 .
Figure 3. Euclidean norm of the difference between the ME and MLP solutions.

Table 1 .
Univariate first order exponential function with 20 outcomes.

Table 2 .
Univariate second order exponential function with 20 outcomes.

Table 3 .
Univariate first order rational function with 20 outcomes.

Table 4 .
Univariate first-order rational function with 20 outcomes.

Table 7 .
Euclidean distance of solutions of both directions.

Table 8 .
Performance of MLP vs. generalized MLP methods.