Entropy, Information, and the Updating of Probabilities

This paper is a review of a particular approach to the method of maximum entropy as a general framework for inference. The discussion emphasizes pragmatic elements in the derivation. An epistemic notion of information is defined in terms of its relation to the Bayesian beliefs of ideally rational agents. The method of updating from a prior to posterior probability distribution is designed through an eliminative induction process. The logarithmic relative entropy is singled out as a unique tool for updating (a) that is of universal applicability, (b) that recognizes the value of prior information, and (c) that recognizes the privileged role played by the notion of independence in science. The resulting framework—the ME method—can handle arbitrary priors and arbitrary constraints. It includes the MaxEnt and Bayes’ rules as special cases and, therefore, unifies entropic and Bayesian methods into a single general inference scheme. The ME method goes beyond the mere selection of a single posterior, and also addresses the question of how much less probable other distributions might be, which provides a direct bridge to the theories of fluctuations and large deviations.


Introduction
Inductive inference is a framework for coping with uncertainty, for reasoning with incomplete information. The framework must include a means to represent a state of partial knowledge-this is handled through the introduction of probabilities-and it must allow us to change from one state of partial knowledge to another when new information becomes available. Indeed, any inductive method that recognizes that a situation of incomplete information is in some way unfortunate-by which we mean that it constitutes a problem in need of a solution-would be severely deficient if it failed to address the question of how to proceed in those fortunate circumstances when new information becomes available. The theory of probability, if it is to be useful at all, demands a method for assigning and updating probabilities.
The challenge is to develop updating methods that are both systematic, objective and practical. When information consists of data and a likelihood function, Bayesian updating is the uniquely natural method of choice. Its foundation lies in recognizing the value of prior information: whatever was learned in the past is valuable and should not be disregarded, which amounts to requiring that beliefs ought to be revised but only to the extent required by the new data. This immediately raises a number of questions: How do we update when the information is not in the form of data? If the information is not data, what else could it possibly be? Indeed, what, after all, is "information"? On a separate line of development, the method of Maximum Gibbs-Shannon Entropy (MaxEnt) allows one to process information in the form of constraints on the allowed probability distributions. This provides a partial answer to one of our questions: in addition to data, information can also take the form of constraints. However, it immediately raises several other questions: What is the interpretation of entropy? Is there unique entropy? Are Bayesian and entropic methods mutually compatible?
Since our topic is the updating of probabilities when confronted with new information, our starting point is to address the question, "what is information?". In Section 2, we develop a concept of information that is both pragmatic and Bayesian. "Information" is defined in terms of its effects on the beliefs of rational agents. The design of entropy as a tool for updating is the topic of Section 3. There, we state the design specifications that define what function entropy is supposed to perform, and we derive its functional form. To streamline the presentation, some of the mathematical derivations are left to the appendices.
To conclude, we present two further developments. In Section 4, we show that Bayes' rule can be derived as a special case of the ME method. An earlier derivation of this important result following a different line of argument was given by Williams [23] before a sufficient understanding of entropy as an updating tool had been achieved. It is not, therefore, surprising that Williams' achievement has not received the widespread appreciation it deserves. Thus, within the ME framework, entropic and Bayesian methods are unified into a single consistent theory of inference. One advantage of this insight is that it allows a number of generalizations of Bayes' rule [2,8]. Another is that it provides an important missing piece for the old puzzles of quantum mechanics concerning the so-called collapse of the wave function and the quantum measurement problem [24,25].
There is yet another function that the ME method must perform in order to fully qualify as a method of inductive inference. Once we have decided that the distribution of maximum entropy is to be preferred over all others, the following question arises immediately: the maximum of the entropy functional is never infinitely sharp, so are we confident that distributions that lie very close to the maximum are completely ruled out? In Section 5, the ME method is deployed to assess quantitatively the extent to which distributions with lower entropy are ruled out. The significance of this result is that it provides a direct link to the theories of fluctuations and large deviations. Concluding remarks are given in Section 6.

What Is Information?
The term "information" is used with a wide variety of different meanings [10,26,27]. There is the Shannon notion of information that is meant to measure an amount of information and is quite divorced from semantics. There is also an algorithmic notion of information that captures a notion of complexity and originates in the work of Solomonov, Kolmogorov, and Chaitin [26]; there is a related notion of entropy as a minimum description length [28]. Furthermore, in the general context of the thermodynamics of computation, it is said that "information is physical" because systems "carry" or "contain" information about their own physical state [29][30][31] (see also [32,33]).
Here, we follow a different path [3,4]. We seek an epistemic notion of information that is closer to the everyday colloquial use of the term-roughly, information is what we request when we ask a question. In a Bayesian framework, this requires an explicit account of the relation between information and the beliefs of ideally rational agents. We emphasize that our concern here is with idealized rational agents. Our subject is not the psychology of actual humans who often change their beliefs by processes that are neither fully rational nor fully conscious. We adopt a Bayesian interpretation of probability as a degree of credibility: the degree to which we ought to believe that a proposition is true if only we were ideally rational. For a discussion of a decision theory that might be relevant to the economics and psychology of partially rational agents see [34][35][36]. An entropic framework for modelling economies that bypasses all issues of bounded rationality is described in [37].
It is implicit in the recognition that most of our beliefs are held on the basis of incomplete information that not all probability assignments are equally good; some beliefs are preferable to others in the very pragmatic sense that they enhance our chances to successfully navigate this world. Thus, a theory of probability demands a theory of updating probabilities in order to improve our beliefs.
We are now ready to address the question: What, after all, is "information"? The answer is pragmatic. Information is what information does. Information is defined by its effects: (a) it restricts our options as to what we are honestly and rationally allowed to believe; and (b) it induces us to update from prior beliefs to posterior beliefs. This, I propose, is a defining characteristic of information: Information is that which induces a change from one state of rational belief to another.
One aspect of this notion is that for a rational agent, the identification of what constitutes information-as opposed to mere noise-already involves a judgment, an evaluation. Another aspect is that the notion that information is directly related to changing our minds does not involve any reference to amounts of information, but it nevertheless allows precise quantitative calculations. Indeed, constraints on the acceptable posterior probabilities are precisely the kind of information that the method of maximum entropy is designed to handle. In short, Information constrains probability distributions. The constraints are the information.
To the extent that the probabilities are Bayesian, this definition captures the Bayesian notion that information is directly related to changing our minds, that it is the driving force behind the process of learning. It also incorporates an important feature of rationality: being rational means accepting that "not everything goes", and that our beliefs must be constrained in very specific ways. However, the indiscriminate acceptance of any arbitrary constraint does not qualify as rational behavior. To be rational, an agent must exercise some judgment before accepting a particular piece of information as a reliable basis for the revision of its beliefs, which raises questions about what judgments might be considered sound. Furthermore, there is no implication that the information must be true; only that we accept it as true. False information is information too, at least to the extent that we are prepared to accept it and allow it to affect our beliefs.
The paramount virtue of the definition above is that it is useful; it allows precise quantitative calculations. The constraints that constitute information can take a wide variety of forms. They can be expressed in terms of expected values, they can specify the functional form of a distribution, or be imposed through various geometrical relations. Examples are given in Section 5 and in [38].
Concerning the act of updating, it may be worthwhile to point out an analogy with dynamics. In Newtonian mechanics, the state of motion of a system is described in terms of momentum, and the change from one state to another is said to be "caused" by an applied force or impulse. Bayesian inference is analogous in that a state of belief is described in terms of probabilities, and the change from one state to another is "caused" by information. Just as a force is that which induces a change from one state of motion to another, so information is that which induces a change from one state of belief to another. Updating is a form of dynamics. In [39], the analogy is taken seriously: the logic is reversed and quantum mechanics is derived as an example of the entropic updating of probabilities.

The Pragmatic Design of Entropic Inference
Once we have decided, as a result of the confrontation of new information with old beliefs, that our beliefs require revision, the problem becomes one of deciding how precisely this ought to be done. First, we identify some general features of the kind of belief revision that one might count as rational. Then, we design a method-a systematic procedure-that implements those features. To the extent that the method performs as desired, we can claim success. The point is not that success derives from our method having achieved some intimate connection to the inner wheels of reality; success simply means that the method seems to be working.
The one obvious requirement is that the updated probabilities ought to agree with the newly acquired information. Unfortunately, this requirement, while necessary, is not sufficiently restrictive: we can update in many ways that preserve both internal consistency and consistency with the new information. Additional criteria are needed. What rules would an ideally rational agent choose?

General Criteria
The rules are motivated by the same pragmatic criteria that motivate the design of probability theory itself [8]-universality, consistency, and practical utility. However, this is admittedly too vague; we must be very specific about the precise way in which the criteria are implemented.

Universality
In principle, different systems and different situations could require different problemspecific induction methods. However, in order to be useful in practice, the method we seek must be of universal applicability. Otherwise, it would fail us when most needed, for we would not know which method to choose when not much is known about the system. To put in different words, what we want to design is a general-purpose method that captures what all the other problem-specific methods might have in common. The idea is that the peculiarities of a particular problem will be captured by the specific constraints that describe the information that is relevant to the problem at hand.
The analogy with mechanics can be found here as well. The possibility of a science of mechanics hinges on identifying a law of motion of universal applicability (e.g., the Schrödinger equation), while the specifics of each system are introduced through initial conditions and the choice of potentials or forces. Here, we shall design an entropy of universal applicability, while the specifics of each problem are introduced through prior probabilities and the choice of constraints.

Parsimony
To specify the updating, we adopt a very conservative criterion that recognizes the value of information: what has been laboriously learned in the past is valuable and should not be disregarded unless rendered obsolete by new information. The only aspects of one's beliefs that should be updated are those for which new evidence has been supplied. Thus, we adopt the following.

Principle of Minimal Updating (PMU):
Beliefs should be updated only to the minimal extent required by the new information.
The special case of updating in the absence of new information deserves a comment. The PMU states that when there is no new information, ideally, rational agents should not change their minds. In fact, it is difficult to imagine any notion of rationality that would allow the possibility of changing one's mind for no apparent reason.
Minimal updating offers yet another pragmatic advantage. As we shall see below, rather than identifying what features of a distribution are singled out for updating and then specifying the detailed nature of the update, we will adopt design criteria that stipulate what is not to be updated. The practical advantage of this approach is that it enhances objectivity-there are many ways to change something but only one way to keep it the same. The analogy with mechanics can be pursued even further: if updating is a form of dynamics, then minimal updating is the analogue of inertia. Rationality and objectivity demand a considerable amount of inertia.

Independence
The next general requirement turns out to be crucially important because without it, the very possibility of scientific theories would be compromised. The point is that every scientific model, whatever the topic, if it is to be useful at all, must assume that all relevant variables have been taken into account and that whatever was left out-the rest of the universe-should not matter. To put it another way, in order to do scientific work, we must be able to understand parts of the universe without having to understand the universe as a whole. Granted, a pragmatic understanding need not be complete and exact; it must be merely adequate for our purposes.
The assumption, then, is that it is possible to focus our attention on a suitably chosen system of interest and neglect the rest of the universe because the system and the rest of the universe are "sufficiently independent". Thus, in any form of science, the notion of statistical independence must play a central and privileged role. This idea-that some things can be neglected and that not everything matters-is implemented by imposing a criterion that tells us how to handle independent systems. The chosen criterion is quite natural: whenever two systems are a priori believed to be independent and we receive information about just one, it should not matter if the other is included in the analysis or not. This is an example of the PMU in action; it amounts to requiring that independence to be preserved unless information about correlations is explicitly introduced.
Again, we emphasize that none of these criteria are imposed by nature. They are desirable for pragmatic reasons; they are imposed by design.

Entropy as a Tool for Updating Probabilities
Consider a set of propositions {x} about which we are uncertain. The proposition x can be discrete or continuous, in one or in several dimensions. It could, for example, represent the microstate of a physical system, a point in phase space, or an appropriate set of quantum numbers. The uncertainty about x is described by a probability distribution q(x). The goal is to update from the prior distribution q(x) to a posterior distribution p(x) when new information-by which we mean a set of constraints-becomes available. The question is, which distribution among all those that satisfy the constraints should we select?
Our goal is to design a method that allows a systematic search for the preferred posterior distribution. The central idea, first proposed by Skilling [16], is disarmingly simple: to select the posterior, first rank all candidate distributions in increasing order of "preference" and then pick the distribution that ranks the highest. Irrespective of what it is that makes one distribution "preferable" over another (we will get to that soon enough), it is clear that any such ranking must be transitive: if distribution p 1 is preferred over distribution p 2 , and p 2 is preferred over p 3 , then p 1 is preferred over p 3 . Transitive rankings are implemented by assigning to each p a real number S[p], which is called the entropy of p in such a way that if p 1 is preferred over p 2 , then S[p 1 ] > S[p 2 ]. The selected distribution (one or possibly many, for there may be several equally preferred distributions) is that which maximizes the entropy functional.
The importance of Skilling's strategy of ranking distributions cannot be overestimated: it answers the questions "why entropy?" and "why a maximum?". The strategy implies that the updating method will take the form of a variational principle-the method of maximum entropy (ME)-involving a certain functional that maps distributions to real numbers. These features are not imposed by nature; they are all imposed by design. They are dictated by the function that the ME method is supposed to perform. (Thus, it makes no sense to seek a generalization in which entropy is a complex number or a vector; such generalized entropies would just not perform the desired function.) Next, we specify the ranking scheme, that is, we choose a specific functional form for the entropy S[p]. Note that the purpose of the method is to update from priors to posteriors so the ranking scheme must depend on the particular prior q and therefore, the entropy S must be a functional of both p and q. The entropy S[p, q] describes a ranking of the distributions p relative to the given prior q. S[p, q] is the entropy of p relative to q, and accordingly, S[p, q] is commonly called relative entropy. This is appropriate and sometimes we will follow this practice. However, since all entropies are relative, even when relative to a uniform distribution, the qualifier "relative" is redundant and can be dropped.
The functional S[p, q] is designed by a process of elimination-this is a process of eliminative induction. First, we state the desired design criteria; this is the crucial step that defines what makes one distribution preferable over another. Candidate functionals that fail to satisfy the criteria are discarded-hence, the qualifier "eliminative". As we shall see, the criteria adopted below are so constraining that there is a single entropy functional S[p, q] that survives the process of elimination.
This approach has a number of virtues. First, to the extent that the design criteria are universally desirable, the single surviving entropy functional will also be of universal applicability. Second, the reason why alternative entropy candidates are eliminated is quite explicit-at least one of the design criteria is violated. Thus, the justification behind the single surviving entropy is not that it leads to demonstrably correct inferences, but rather, that all other candidates demonstrably fail to perform as desired.

Specific Design Criteria
Consider a lattice of propositions generated by a set X of atomic propositions that are mutually exclusive and exhaustive and are labeled by a discrete index i = 1, 2, . . . , n. The extension to infinite sets and to continuous labels turns out to be straightforward. The index i might, for example, label the microstates of a physical system but, since the argument below is supposed to be of general validity, we shall not assume that the labels themselves carry any particular significance. We can always permute labels; this should have no effect on the updating of probabilities.
We adopt design criteria that reflect the structure of the lattice of propositions-the propositions are related to each other by disjunctions (OR) and conjunctions (AND) and the consistency of the web of beliefs is implemented through the sum and product rules of the probability theory. Our criteria refer to the two extreme situations of propositions that are mutually exclusive and of propositions that are mutually independent. At one end, we deal with the probabilities of propositions that are highly correlated (if one proposition is true, the other is false and vice versa); at the other end, we deal with the probabilities of propositions that are totally uncorrelated (the truth or falsity of one proposition has no effect on the truth or falsity of the other). One extreme is described by the simplified sum rule, p(i ∨ j) = p(i) + p(j), and the other extreme by the simplified product rule, p(i ∧ j) = p(i)p(j). (For an alternative approach to the foundations of inference that exploits the various symmetries of the lattice of propositions see [40,41].
The two design criteria and their consequences for the functional form of entropy are given below. Detailed proofs are deferred to the appendices.

DC1:
Probabilities that are conditioned on one subdomain are not affected by information about other non-overlapping subdomains.
Consider a subdomain D ⊂ X composed of atomic propositions i ∈ D and suppose the information to be processed refers to some other subdomain D ⊂ X that does not overlap with D, D ∩ D = ∅. In the absence of any new information about D, the PMU demands, we do not change our minds about probabilities that are conditional on D. Thus, we design the inference method so that q(i|D), the prior probability of i conditioned on i ∈ D, is not updated. Thus, the selected conditional posterior is (1) We adopt the following notation: priors are denoted by q, candidate posteriors by lower case p, and the selected posterior by upper case P. We shall write either p(i) or p i . Furthermore, we adopt the notation, standard in physics where the probabilities of x and θ are written p(x) and p(θ) but there is no implication that p refers to the same mathematical function. We emphasize that the point is not that we make the unwarranted assumption that keeping q(i|D) unchanged is guaranteed to lead to correct inferences. It need not; induction is risky. The point is, rather, that in the absence of any evidence to the contrary, there is no reason to change our minds and the prior information takes priority.
The consequence of DC1 is that non-overlapping domains of i contribute additively to the entropy, where F is some unknown function of two arguments. The proof is given in Appendix A.
Comment 1: It is essential that DC1 refers to conditional probabilities-local information about a domain D can (via normalization) have a non-local effect on the probability of another domain D.

Comment 2:
An important special case is the "update" from a prior q(i) to a posterior P(i) in a situation in which no new information is available. The criterion DC1 applied to a situation where the subdomain D covers the whole space of is, D = X , requires that in the absence of any new information, the prior conditional probabilities are not to be updated:

Comment 3:
The criterion DC1 implies Bayesian conditionalization as a special case. Indeed, if the information is given through the constraint p(D) = 0, whereD is the complement of D, then P(i|D) = q(i|D), which is referred to as Bayesian conditionalization. More explicitly, if θ is the variable to be inferred on the basis of prior information about a likelihood function q(i|θ) and observed data i , then the update from the prior q to the posterior P, is consists of updating q(i) → P(i) = δ ii to agree with the new information and invoking the PMU so that P(θ|i ) = q(θ|i ) remains unchanged. Therefore, which is Bayes' rule. Thus, entropic inference is designed to include Bayesian inference as a special case. Note, however, that imposing DC1 is not identical to imposing Bayesian conditionalization: DC1 is not restricted to information in the form of absolute certainties, such as p(D) = 1.

Comment 4:
If the label i is turned into a continuous variable x, the criterion DC1 requires that information that refers to points infinitely close but just outside the domain D will have no influence on probabilities conditional on D. This may seem surprising, as it may lead to updated probability distributions that are discontinuous, but it is not a problem. In situations where we have explicit reasons to believe that conditions of continuity or differentiability hold, then such conditions should be imposed explicitly. The inference process should not be expected to discover and replicate information with which it was not supplied.

DC2:
When two systems are a priori believed to be independent and the information we receive about one of them makes no reference to the other, then it should not matter whether the latter is included in the analysis of the former or not.
Consider a system of propositions labeled by a composite index, i = (i 1 , i 2 ) ∈ X = X 1 × X 2 . For example, {i 1 } = X 1 and {i 2 } = X 2 might describe the microstates of two separate physical systems. Assume that all prior evidence led us to believe the two subsystems are independent, that is, any two propositions i 1 ∈ X 1 and i 2 ∈ X 2 are believed to be independent. This belief is reflected in the prior distribution: if the individual subsystem priors q 1 (i 1 ) and q 2 (i 2 ), then the prior for the whole system is q 1 (i 1 )q 2 (i 2 ). Next, suppose that new information is acquired such that q 1 (i 1 ) would by itself be updated to P 1 (i 1 ), and that q 2 (i 2 ) would by itself be updated to P 2 (i 2 ). DC2 requires that S[p, q] be such that the joint prior q 1 (i 1 )q 2 (i 2 ) updates to the product P 1 (i 1 )P 2 (i 2 ) so that inferences about one subsystem do not affect inferences about the other.
The consequence of DC2 is to fully determine the unknown function F in (2) so that probability distributions p(i) should be ranked relative to the prior q(i) according to the relative entropy, Comment 1: We emphasize that the point is not that when we have no evidence for correlations, we draw the firm conclusion that the systems must necessarily be independent. Induction involves risk; the systems might, in actual fact, be correlated through some unknown interaction potential. The point is rather that if the joint prior reflected independence and the new evidence is silent on the matter of correlations, then the evidence we actually have-namely, the prior-takes precedence, and there is no reason to change our minds. As before, the PMU requires that a feature of the probability distribution-in this case, independence-will not be updated unless the evidence requires it.

Comment 2:
We also emphasize that DC2 is not a consistency requirement. The argument we deploy is not that both the prior and the new information tell us the systems are independent in which case consistency requires that it should not matter whether the systems are treated jointly or separately. DC2 refers to a situation where the new information does not say whether the systems are independent or not. Rather, the updating is being designedthrough the PMU-so that the independence reflected in the prior is maintained in the posterior by default.

Comment 3:
The generalization to continuous variables x ∈ X is approached as a Riemann limit from the discrete case. A continuous probability density p(x) or q(x) can be approximated by the discrete distributions. Divide the region of interest X into a large number N of small cells. The probabilities of each cell are as follows: where ∆x i is an appropriately small interval. The discrete entropy of p i relative to q i is as follows: and in the limit as N → ∞ and ∆x i → 0 we get the Riemann integral (To simplify the notation, we include multi-dimensional integrals by writing d n x = dx.) It is easy to check that the ranking of distributions induced by S[p, q] is invariant under coordinate transformations. The insight that coordinate invariance could be derived as a consequence of the requirement of subsystem independence first appeared in [5].

The ME Method
We can now summarize the overall conclusion.
The ME method: The goal is to update from a prior distribution q to a posterior distribution when there is new information in the form of constraints C that specify a family {p} of candidate posteriors. The preferred posterior P is that which maximizes the relative entropy, within the family {p} specified by the constraints C.
This extends the method of maximum entropy beyond its original purpose as a rule to assign probabilities from a given underlying measure (MaxEnt) to a method for updating probabilities from any arbitrary prior (ME). Furthermore, the logic behind the updating procedure does not rely on any particular meaning assigned to the entropy whether in terms of information, or heat, or disorder. Entropy is merely a tool for inductive inference.
No interpretation for S[p, q] is given and none is needed.
The derivation above has singled out a unique S[p, q] to be used in inductive inference. Other "entropies" (such as the one-parameter families of entropies proposed in [12][13][14] might turn out to be useful for other purposes-perhaps as measures of some kind of "information", as measures of discrimination or distinguishability among distributions, of ecological diversity, or for some altogether different function-but they are unsatisfactory for the purpose of updating because they fail to perform the functions stipulated by the design criteria DC1 and DC2. They induce correlations that are unwarranted by the information in the priors or the constraints.

Bayes' Rule as a Special Case of ME
Back in Section 3.3.1, we saw that ME is designed to include Bayes' rule as a special case. Here, we wish to verify this explicitly [2]. The goal is to update our beliefs about θ ∈ Θ (θ represents one or many parameters) on the basis of three pieces of information: (1) the prior information codified into a prior distribution q(θ); (2) the new information conveyed by data x ∈ X (obtained in one or many experiments); and (3) the known relation between θ and x given by a model defined by the sampling distribution or likelihood, q(x|θ). The updating will result in replacing the prior probability distribution q(θ) by a posterior distribution P(θ) that applies after the data information has been processed.
The crucial element that will allow the Bayes' rule to be smoothly integrated into the ME scheme is the realization that before the data are collected, not only do we not know θ, but we do not know x either. Thus, the relevant space for inference is not the space Θ but the product space Θ × X , and the relevant joint prior is q(x, θ) = q(θ)q(x|θ). Let us emphasize two points: first, the likelihood function is an integral part of the prior distribution; second, the prior information about how x is related to θ is contained in the functional form of the distribution q(x|θ) and not in the numerical values of the arguments x and θ, which, at this point, are still unknown.
Next, data are collected and the observed values turn out to be x . We must update to a posterior that lies within the family of distributions p(x, θ) that reflect the fact that the previously unknown x is now known to be x , that is, The information in this data constrains but is not sufficient to fully determine the joint distribution, Any choice of p(θ|x ) is, in principle, possible. So far, the formulation of the problem parallels Section 3.3.1 exactly. We are, after all, solving the same problem. The next step is to apply the ME method. According to the ME method, the selected joint posterior P(x, θ) is that which maximizes the entropy, subject to the data constraints. Note that Equation (10) represents an infinite number of constraints on the family p(x, θ): there is one constraint and one Lagrange multiplier λ(x) for each value of x. Maximizing S, (12), subject to (10) and normalization, yields the joint posterior where Z is a normalization constant, and the multiplier λ(x) is determined from (10) as follows: so that the joint posterior is The corresponding marginal posterior probability P(θ) is which is Bayes' rule. Thus, Bayes' rule is derivable from, and therefore consistent with, the ME method.
To summarize, the prior q(x, θ) = q(x)q(θ|x) is updated to the posterior P(x, θ) = P(x)P(θ|x), where P(x) = δ(x − x ) is fixed by the observed data while P(θ|x ) = q(θ|x ) remains unchanged. Note that in accordance with the PMU philosophy that drives the ME method, one only updates those aspects of one's beliefs for which corrective new evidence has been supplied. In [2,8,42], further examples are given that show how ME allows generalizations of Bayes' rule to situations where the data itself are uncertain, there is information about moments of x or moments of θ, or even in situations where the likelihood function is unknown. In conclusion, the ME method of maximum entropy can fully reproduce and then go beyond the results obtained by the standard Bayesian methods.

Deviations from Maximum Entropy
The basic ME problem is to update from a prior q(x) given information specified by certain constraints. The constraints specify a family of candidate distributions as follows: which can be conveniently labeled with a finite number of parameters θ a , a = 1 . . . n. (The generalization to an infinite number of parameters poses technical but not insurmountable difficulties.) Thus, the parameters θ are coordinates on the statistical manifold specified by the constraints. The distributions in this manifold are ranked according to their entropy, and the selected posterior is the distribution p(x|θ 0 ) that maximizes the entropy S(θ). (The notation indicates that S[p θ , q] is a functional of p θ while S(θ) is a function of θ.) The question we now address concerns the extent to which p(x|θ 0 ) should be preferred over other distributions with lower entropy or, to put it differently, to what extent is it rational to believe that the selected value ought to be the entropy maximum θ 0 rather than any other value θ [1]? This is a question about the probability p(θ) of various values of θ. The original problem which led us to design the maximum entropy method was to assign a probability to the quantity x; we now see that the full problem is to assign probabilities to both x and θ. We are concerned not just with p(x), but rather with the joint distributions which we denote as π(x, θ); the universe of discourse has been expanded from X (the space of xs) to the product space X × Θ (Θ is the space of parameters θ).
To determine the joint distribution π(x, θ) ,we make use of essentially the only (universal) method at our disposal-the ME method itself-but this requires that we address the standard two preliminary questions: First, what is the prior distribution? What do we know about x and θ before we receive information about the constraints? Second, what is the new information that constrains the allowed joint distributions π(x, θ)?
This first question is the more subtle one: when we know absolutely nothing about the θs, we know neither their physical meaning nor whether there is any relation to the xs. A joint prior that reflects this lack of correlations is a product, q(x, θ) = q(x)q(θ). We will assume that the prior q(x) is known-it is the same prior we had used when we updated from q(x) to p(x|θ 0 ) using (19).
However, we are not totally ignorant about the θs: we know that they label distributions π(x|θ) on some as yet unspecified statistical manifold Θ. Then there exists a natural measure of distance in the space Θ. It is given by the information metric d 2 = g ab dθ a dθ b [8,43], where and the corresponding volume elements are given by g 1/2 (θ)d n θ, where g(θ) is the determinant of the metric. The uniform prior for θ, which assigns equal probabilities to equal volumes, is proportional to g 1/2 (θ), and therefore we choose q(θ) = g 1/2 (θ). Therefore, the joint prior is q(x, θ) = q(x)g 1/2 (θ). Next, we tackle the second question: what are the constraints on the allowed joint distributions π(x, θ)? Consider the space of all joint distributions. To each choice of the functional form of π(x|θ) (for example, whether we talk about Gaussians, Boltzmann-Gibbs distributions, or something else), there corresponds a different subspace defined by distributions of the form π(x, θ) = π(θ)π(x|θ). The crucial constraint is that which specifies the subspace by imposing that π(x|θ) takes the particular functional form given by the constraint (18), π(x|θ) = p(x|θ). This defines the meaning to the θs and also fixes the prior g 1/2 (θ) on the relevant subspace.
The preferred joint distribution, P(x, θ) = P(θ)p(x|θ), is the distribution, π(x, θ) = π(θ)p(x|θ), that maximizes the joint entropy, S[π, q] = − dx dθ π(θ)p(x|θ) log π(θ)p(x|θ) where S(θ) is given in (19). Varying (21) with respect to π(θ) with dθ π(θ) = 1 and p(x|θ) fixed yields the posterior probability that the value of θ lies within the small volume g 1/2 (θ)d n θ, Equation (22) is the result we seek. It tells us that, as expected, the preferred value of θ is the value θ 0 that maximizes the entropy S(θ), Equation (19), because this maximizes the scalar density exp S(θ). However, it also tells us the degree to which values of θ away from the maximum are ruled out. (Note that the density exp S(θ) is a scalar function and the presence of the Jacobian factor g 1/2 (θ) makes Equation (22) manifestly invariant under changes of the coordinates θ in the space Θ.) This discussion allows us to refine our understanding of the ME method. ME is not an all-or-nothing recommendation to pick the single distribution that maximizes entropy and reject all others. The ME method is more nuanced: in principle, all distributions within the constraint manifold ought to be included in the analysis; they contribute in proportion to the exponential of their entropy and this turns out to be significant in situations where the entropy maximum is not particularly sharp.
Going back to the original problem of updating from the prior q(x), given information that specifies the manifold {p(x|θ)}, the preferred update within the family {p(x|θ)} is p(x|θ 0 ), but to the extent that other values of θ are not totally ruled out, a better update is obtained marginalizing the joint posterior P(x, θ) = P(θ)p(x|θ) over θ, In situations where the entropy maximum at θ 0 is very sharp, we recover the old result, When the entropy maximum is not very sharp a more honest update is Equation (23), which, incidentally, is a form of superstatistics.
One of the limitations of the standard MaxEnt method is that it selects a single "posterior" p(x|θ 0 ) and strictly rules out all other distributions. The result (22) overcomes this limitation and finds many applications. For example, it extends the Einstein theory of thermodynamic fluctuations beyond the regime of small fluctuations; it provides a bridge to the theory of large deviations; and, suitably adapted for Bayesian data analysis, it leads to the notion of entropic priors [44].

Discussion
Consistency with the law of large numbers.
Entropic methods of inference are of general applicability but there exist special situations-for example, those involving large numbers of independent subsystemswhere inferences can be made by purely probabilistic methods without ever invoking the concept of entropy. In such cases, one can check (see, for example, [6,45]) that the two methods of calculation are consistent with each other. It is significant, however, that alternative entropies, such as those proposed in [12][13][14], do not pass this test [46,47], which rules them out as tools for updating. Some probability distributions obtained by maximizing the alternative entropies have, however, turned out to be physically relevant. It is, therefore, noteworthy that those successful distributions can also be derived through a more standard application of MaxEnt or ME, as advocated in this review [8,[48][49][50][51]. In other words, what is being ruled out are not the distributions themselves, but the alternative entropies from which they were inferred.
On priors.
Choosing the prior density q(x) can be tricky. Sometimes, symmetry considerations can be useful but otherwise, there is no fixed set of rules to translate information into a probability distribution except, of course, for Bayes' rule and the ME method themselves.
What if the prior q(x) vanishes for some values of x? S[p, q] can be infinitely negative when q(x) vanishes within some region D. This means that the ME method confers an infinite preference on those distributions p(x) that vanish whenever q(x) does. One must emphasize that this is as it should be. A similar situation also arises in the context of Bayes' theorem, where assigning a vanishing prior represents a tremendously serious commitment because no amount of data to the contrary would allow us to revise it. In both ME and Bayes updating, we should recognize the implications of assigning a vanishing prior. Assigning a very low but non-zero prior represents a safer and possibly less prejudiced representation of one's prior beliefs.
Commuting and non-commuting constraints.
The ME method allows one to process information in the form of constraints. When we are confronted with several constraints, we must be particularly cautious. Should they be processed simultaneously or sequentially? And, if the latter, in what order? The answer depends on the problem at hand [42].
We refer to constraints as commuting when it makes no difference whether they are handled simultaneously or sequentially. The most common example is that of Bayesian updating on the basis of data collected in several independent experiments. In this case, the order in which the observed data x = {x 1 , x 2 , . . .} are processed does not matter for the purpose of inferring θ. In general, however, constraints need not commute and when this is the case, the order in which they are processed is critical.
To decide whether constraints are to be handled sequentially or simultaneously, one must be clear about how the ME method handles constraints. The ME machinery interprets a constraint in a very mechanical way: all distributions satisfying the constraint are, in principle, allowed, while all distributions violating it are ruled out. Therefore, sequential updating is appropriate when old constraints become obsolete and are superseded by new information, while simultaneous updating is appropriate when old constraints remain valid. The two cases refer to different states of information, and therefore, it is to be expected that they will result in different inferences. These comments are meant to underscore the importance of understanding what information is and how it is processed by the ME method; failure to do so will lead to errors that do not reflect a shortcoming of the ME method but rather a misapplication of it.

Pitfalls?
Entropy is a tool for reasoning and-as with all tools for reasoning or otherwise-it can be misused, leading to unsatisfactory results [52]. Should that happen, the inevitable questions are "what went wrong?" and "how do we fix it?" It helps to first ask what components of the analysis can be trusted so that the possible mistakes can be looked for elsewhere. The answers proposed by the ME method are radically conservative: problems always arise through a wrong choices of variables, priors, or constraints. Indeed, one should not blame the entropic method for not having discovered and taken into account relevant information that was not explicitly introduced into the analysis. Indeed, just as one would be very reticent about questioning the basic rules of arithmetic, or the basic rules of calculus, one should not question the basic sum and product rules of the probability calculus and, taking this one step farther, one should not question the applicability of entropy as the updating tool. The adoption of this conservative approach leads us to reject alternative entropies and quantum probabilities. Fortunately, those constructs are not actually needed-as mentioned above, those Tsallis distributions that have turned out be useful can be derived with standard entropic methods [8,[48][49][50][51], and quantum mechanics can be handled within standard probability theory without invoking exotic probabilities [39,53].

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A. DC1-Mutually Exclusive Subdomains
In these appendices, we establish the consequences of the two criteria DC1 and DC2, leading to the final result: Equation (9). The details of the proofs are important not just because they lead to our final conclusions, but also because the translation of the verbal statement of the criteria into precise mathematical form is a crucial part of unambiguously specifying what the criteria actually say.
First, we prove that criterion DC1 leads to the expression Equation (2) for S[p, q]. Consider the case of a discrete variable, p i with i = 1 . . . n, so that S[p, q] = S(p 1 . . . p n , q 1 . . . q n ). Suppose the space of states X is partitioned into two non-overlapping domains D andD with D ∪D = X , and that the information to be processed is in the form of a constraint that refers to the domainD, ∑ j∈D a j p j = A .
DC1 states that the constraint onD does not have an influence on the conditional probabilities p i|D . It may, however, influence the probabilities p i within D through an overall multiplicative factor. To deal with this complication, consider then a special case where the overall probabilities of D andD are also constrained: with P D + PD = 1. Under these special circumstances, constraints onD will not influence p i s within D, and vice versa.
To obtain the posterior, maximize S[p, q] subject to these three constraints, Equations (A1)-(A4) are n + 3 equations; we must solve for the p i s and the three Lagrange multipliers, λ,λ, and µ. Since S = S(p 1 . . . p n , q 1 . . . q n ) its derivative ∂S ∂p i = f i (p 1 . . . p n , q 1 . . . q n ) could, in principle, also depend on all 2n variables. However, this violates the DC1 criterion because any arbitrary change in a j withinD would influence the p i s within D. The only way that probabilities conditioned on D can be shielded from arbitrary changes in the constraints pertaining toD is that for any i ∈ D, the function f i depends only on p j s with j ∈ D. Furthermore, this must hold not just for one particular partition of X into domains D andD, but it must hold for all conceivable partitions, including the partition into atomic propositions. Therefore, f i can depend only on p i , The power of the criterion DC1 is not exhausted yet. The information that affects the posterior can enter not just through constraints, but also through the prior. Suppose that the local information about domainD is altered by changing the prior withinD. Let q j → q j + δq j for j ∈D. Then (A5) becomes which shows that p i with i ∈ D will be influenced by information aboutD unless f i with i ∈ D is independent of all the q j s for j ∈D. Again, this must hold for all possible partitions into D andD, and therefore, The choice of the functions f i (p i , q i ) can be restricted further. If we maximize S[p, q], subject to constraints where λ and µ are Lagrange multipliers. Solving for p i gives a posterior, for some functions g i . As stated in Section 3.3 we do not assume that the labels i themselves carry any particular significance. This means, in particular, that for any proposition labeled i, we want the selected posterior P i to depend only on the numbers q i , λ, µ, and a i . We do not want to have different updating rules for different propositions: two different propositions i and i with the same q i = q i and the same a i = a i should be updated to the same posteriors, P i = P i . In other words, the functions g i and f i must be independent of i. Therefore, Integrating, one obtains S[p, q] = ∑ i F(p i , q i ) + constant .
for some still undetermined function F. The constant has no effect on the entropy maximization and can be dropped. The corresponding expression for a continuous variable x is obtained replacing i by x, and the sum over i by an integral over x leading to Equation (2), S[p, q] = dx F(p(x), q(x)) .

Appendix B. DC2-Independent Subsystems
Here, we show that DC2 leads to Equation (9). Let the microstates of a composite system be labeled by (i 1 , i 2 ) ∈ X = X 1 × X 2 . We shall consider two special cases.

Case (a)
First, we treat the two subsystems separately. Suppose that for subsystem 1, we have the extremely constraining information that updates q 1 (i 1 ) to be P 1 (i 1 ), and for subsystem 2 we have no new information at all. For subsystem 1, we maximize S 1 [p 1 , q 1 ] subject to the constraint p 1 (i 1 ) = P 1 (i 1 ) and the selected posterior is, of course, p 1 (i 1 ) = P 1 (i 1 ). For subsystem 2, we maximize S 2 [p 2 , q 2 ] subject only to normalization and there is no update, P 2 (i 2 ) = q 2 (i 2 ).
When the systems are treated jointly, however, the inference is not nearly as trivial. We want to maximize the entropy of the joint system, S[p, q] = ∑ i 1 ,i 2 F(p(i 1 , i 2 ), q 1 (i 1 )q 2 (i 2 )) , subject to the constraint on subsystem 1, Notice that this is not just one constraint: we have one constraint for each value of i 1 , and each constraint must be supplied with its own Lagrange multiplier, λ 1 (i 1 ). Then, The independent variations δp(i 1 , i 2 ) yield the following: where f is given in (A6),
Next, we impose that the selected posterior is the product P 1 (i 1 )q 2 (i 2 ). The function f must be such that f (P 1 q 2 , q 1 q 2 ) = λ 1 .
Since the RHS is independent of the argument i 2 , the f function must be such that the i 2 -dependence cancels out, and this cancellation must occur for all values of i 2 and all choices of the prior q 2 . Therefore, we impose that for any value of x the function f (p, q) must satisfy f (px, qx) = f (p, q) .
Choosing x = 1/q, we obtain f p q , 1 = f (p, q) or ∂F ∂p = f (p, q) = φ p q . (A7) Thus, the function f (p, q) has been reduced to a function φ(p/q) of a single argument.

Case (b)
Next, we consider a situation in which both subsystems are updated by extremely constraining information: when the subsystems are treated separately, q 1 (i 1 ) is updated to P 1 (i 1 ) and q 2 (i 2 ) is updated to P 2 (i 2 ). When the systems are treated jointly, we require that the joint prior for the combined system q 1 (i 1 )q 2 (i 2 ) be updated to P 1 (i 1 )P 2 (i 2 ).
This shows that for any value of i 1 , the dependences of the LHS on i 2 through P 2 /q 2 and η 2 must cancel each other out. In particular, if for some subset of i 2 s, the subsystem 2 is updated so that P 2 = q 2 , which amounts to no update at all, the i 2 dependence on the left is eliminated but the i 1 dependence remains unaffected, ξ P 1 q 1 e −η 2 = e η 1 (i 1 ) .
Substituting back into (A10), we obtain ξ P 1 P 2 q 1 q 2 = ξ P 1 q 1 ξ P 2 q 2 , where a constant factor e −(η 1 +η 2 ) is absorbed into a new function ξ. The general solution of this functional equation is a power, where a and b are constants. Finally, integrate (A7), The additive constant c may be dropped: it contributes a term that does not depend on the probabilities and has no effect on the ranking scheme. Furthermore, since S[p, q] will be maximized subject to constraints that include normalization, the b term has no effect on the selected distribution and can also be dropped. Finally, the multiplicative constant a has no effect on the overall ranking, except in the trivial sense that inverting the sign of a will transform the maximization problem to a minimization problem or vice versa. We can, therefore, set a = −1 so that maximum S corresponds to maximum preference, which gives us Equation (9) and concludes our derivation.