On an Objective Basis for the Maximum Entropy Principle

In this letter, we elaborate on some of the issues raised by a recent paper by Neapolitan and Jiang concerning the maximum entropy (ME) principle and alternative principles for estimating probabilities consistent with known, measured constraint information. We argue that the ME solution for the “problematic” example introduced by Neapolitan and Jiang has stronger objective basis, rooted in results from information theory, than their alternative proposed solution. We also raise some technical concerns about the Bayesian analysis in their work, which was used to independently support their alternative to the ME solution. The letter concludes by noting some open problems involving maximum entropy statistical inference.


Introduction
In a recent paper, "A note of caution on maximizing entropy" [1], the authors considered the problem of estimating a probability mass function given supplied constraint information.They identified as "problematic" the maximum entropy solution for the example of a 3-sided die, where the given constraint information is that the mean die value is two.For this example, maximum entropy (ME) solves the following problem: (1) In this case, one can easily show that the maximum entropy solution, consistent with the given constraints, is the uniform distribution p i = 1  3 , i = 1, 2, 3.In their paper, the authors propose an alternative "objectively-based" approach for solving this problem.Specifically, they suppose that the probabilities are random variables, uniformly distributed over their ranges, which are prescribed by the given constraints, i.e., p 1 = p 3 ∈ [0, 0.5] and p 2 ∈ [0, 1].Accordingly, they choose, as their estimated probabilities, the expected values of these (uniformly distributed) random variables: ).The authors argue on intuitive grounds that their solution may be preferable to the ME solution, as they state: "p 2 could be as high as 1, while the other probabilities are bounded above by 0.5....[so] we may be inclined to bet on 2. Once the information gives us reason to prefer one alternative over the others, it is troublesome to claim that the probabilities...are equal."They then also consider a Bayesian learning setting and show, under particular stated assumptions, that Bayesian updating is consistent with their Beyond identifying what the authors call a "problematic" example for the maximum entropy principle, their paper gives historical background on the interpretation of probability, including excerpts of Jaynes' views on maximum entropy and some of the multiple senses in which, based on Jaynes' writings, one can construe that the maximum entropy principle gives "objective" probabilities.
In this letter, we do not attempt to elucidate or specifically articulate Jaynes' understanding of the maximum entropy principle.The purpose of this letter is to elaborate further on the 3-sided die problem from [1] (as well as related problems, where ME is often applied) in order to further understand and explicate several statistically objective bases for preferring one set of probability assignments over another.In so doing, we will argue that there is strong, objective support for the ME solution, as opposed to the alternative solution proposed by Neapolitan and Jiang.We also identify some open problems in maximum entropy statistical inference.

"Most Probable" Interpretation of Maximum Entropy
In [2], Jaynes does provide a principled basis for preferring the maximum entropy solution over alternative probability assignments.Specifically, let N be the number of repeated trials of an experiment with K possible outcomes {ω 1 , ω 2 , . . ., ω K }, and with some constraint information, such as the mean die value measured based on these N repeated trials.Note that the outcomes of the individual trials are not known.Nor is it known the number of occurrences (N k ) of each distinct outcome, ω k .However, suppose that we did know N k , k = 1, . . ., K. For large N , by the weak law of large numbers, e.g., [3], we know that N k N → p k with probability 1, where p k , k = 1, . . ., K are the true probabilities.Thus, if (N 1 , N 2 , . . ., N K ) were known, a good choice for the probability assignments would be the frequency counts . Accordingly, estimating p = (p 1 , p 2 , . . ., p K ) amounts to estimating (N 1 , . . ., N K ).Let (x 1 , x 2 , . . ., x N ), x i ∈ {ω 1 , . . ., ω K }, be a particular N -trial realization sequence (microstate) for the experiment, with associated macrostate (counts) (N 1 , N 2 , . . ., N K ) that agrees with the given constraint information.Suppose all such microstates are a priori equally likely.Then the probability of macrostate (N 1 , N 2 , . . ., N K ) is: where the multinomial is the number of distinct microstates that are consistent with the (constraint-achieving) macrostate.Since, for any given realization sequence, with macrostate (N 1 , . . ., N K ), we would form the probability estimate as p = ( N 1 N , . . ., N K N ), P (N 1 , . . ., N K ) is also the probability that we form the probability assignment ( N 1 N , . . ., N K N ).Thus, if we choose (N 1 , . . ., N K ) to maximize (2), we are determining the probability mass function (p 1 , p 2 , . . ., p K ) = ( N 1 N , . . ., N K N ) that we are most likely to produce as an estimate, given the specified constraint information and the number of die tosses.To maximize (2), one maximizes the multinomial coefficient N N 1 ,...,N K .Based on Stirling's approximation [4], , with H(•) Shannon's entropy function.Accordingly, one can closely approximate maximizing (2) subject to, e.g., a mean value constraint K i=1 iN i = µ by maximizing Shannon's entropy function H( N 1 N , . . ., N K N ) subject to the constraint.Allowing unconstrained probabilities, rather than fractions of N , this amounts to solving: Note, too, that since (2) or its approximate is the probability of forming the estimate (p 1 , . . ., p K ), we can use (4) to evaluate the relative likelihoods of producing different candidate probability assignments, i.e., P (p 1 ,...,p K ) P (p 1 ,...,p K ) = e N (H(p 1 ,...,p K )−H(p 1 ,...,p K )) .Thus, the likelihood of the maximum entropy solution, relative to an alternative distribution, grows exponentially with the entropy difference.
In [1], they acknowledged this statistically-based justification for the maximum entropy distribution, as applied to the 6-sided Brandeis die problem.Moreover, they motivated the 3-sided die problem by stating that "suppose a friend later tossed the die many times...".Thus, their 3-sided die problem genuinely does consider the scenario where the constraint information was accurately measured, based on many repeated die tosses.Accordingly, the above interpretation should be applicable to their 3-sided die problem, just as it is to the Brandeis die problem.For the 3-sided die problem, we have 4 ) = 2 N (log 2 (3)−1.5)∼ 2 0.08N .For example, if N = 1000, the ME distribution is more than 10 24 times more likely to be produced as the estimate than the proposed distribution from [1].The only real assumption in this analysis is that realization sequences (microstates) consistent with a given macrostate are equally likely.to impose, applied to a natural language processing task.Also, [6] used the Kullback distance and the Bayesian Information Criterion to choose relevant constraints and applied this approach to the analysis of genome-wide association study.Nevertheless, it may be fruitful to further investigate alternative approaches for this problem.

Conclusions
In this letter, we elaborated on some of the issues raised by a recent paper [1] concerning the maximum entropy (ME) principle and alternative principles for estimating probabilities consistent with known, measured constraint information.We have argued that the ME solution for the "problematic" example introduced in [1] has stronger objective basis than their alternative proposed solution.We also noted some open problems involving maximum entropy statistical inference.