Remarks on the Maximum Entropy Principle with Application to the Maximum Entropy Theory of Ecology

In the first part of the paper we work out the consequences of the fact that Jaynes’ Maximum Entropy Principle, when translated in mathematical terms, is a constrained extremum problem for an entropy function H(p) expressing the uncertainty associated with the probability distribution p. Consequently, if two observers use different independent variables p or g(p), the associated entropy functions have to be defined accordingly and they are different in the general case. In the second part we apply our findings to an analysis of the foundations of the Maximum Entropy Theory of Ecology (M.E.T.E.) a purely statistical model of an ecological community. Since the theory has received considerable attention by the scientific community, we hope to give a useful contribution to the same community by showing that the procedure of application of MEP, in the light of the theory developed in the first part, suffers from some incongruences. We exhibit an alternative formulation which is free from these limitations and that gives different results.


Introduction
Maximum Entropy Principle MEP (E.T. Jaynes, 1957, see [1,2]) is a powerful inference principle which allows to determine the probability distribution that describes a system on the basis of the information available, usually in the form of averages of observables (random variables) of interest for the system. It is based on: (i) the enumeration of the system states i = 1, . . . , N; (ii) the introduction of one or more functions that translate the information available of the system in the form of constraints on the probability i.e., as f (p) = c where c is a vector of average values; and (iii) a function measuring the uncertainty associated to a probability distribution candidate to describe the system. We consider here only systems with a finite number of states since the extra generality of considering an infinite number of states is not necessary for the aims of this work. The sought distribution is the one for which uncertainty is maximal on the set of distributions for which f (p) = c. This distribution thus represents the least biased estimate on the basis of the given information whether or not its prescriptions are empirically confirmed by experiments. Each the above outlined Steps (i)-(iii) has a profound meaning and impact on the implementation of the MEP procedure. For example, if the states of the system are a priori equally probable, the uncertainty function to use is Shannon entropy H(p), while if they are statistically described by a prior distribution q the uncertainty function is (minus) the relative entropy D(p|q). If the systems states are the result of a coarse-graining procedure from a bigger set of a priori equiprobable states, then a prior distribution q which counts the number of coarse-grained states has to be used for D(p|q) in order to render the MEP procedure invariant with respect to the coarse graining as explained in [3]. See also [4] where a number of different probability distribution used in ecology are derived for a system of individuals arranged in different cells by simply making different assumptions on the individuals (indistinguishable or not), on the constraints and on the coarse graining. The variety of MEP responses simply reflects the facts that different pieces of information are considered; if the previsions fail to be confirmed by the experiment, this is the signal that relevant information for the statistical description on the system under study has been neglected.
A basic tenet of the Maximum Entropy Principle is that two observer which are given the same information expressed by Steps (i) and (ii) above must obtain the same result upon application of the MEP, i.e., the solution to the constrained extremum problem has to be unique. Notice that the MEP procedure does not specify the probability p = (p 1 , . . . , p N ) to use; thus if one observer uses p and another uses p which is related to p by a one-to-one transformation p = g(p ), provided that they set up the constraints in the form f (p) = c and f (g(p )) = c respectively, they are translating in mathematical terms the same information on the system. This represents the same degree of freedom of using different (but related by an invertible transformation) systems of coordinates to describe the position of a point in physics. Therefore, to ensure that the results of the application of the MEP procedure are independent of the choice of the p used the uncertainty function has to be defined accordingly. Proposition 1 in Section 2 below is the main tool to determine the form of the uncertainty function using g-related distributions.
The analysis of this subtile point of the application of MEP is the main aim of this paper. We argue that since there seems not to be a preferred choice for p they have to be considered all equivalent; therefore the related uncertainty functions are also equivalent and can be derived by a single given one. The question is: which is the translation in mathematical terms of the initial information which is entitled to use as uncertainty function the Shannon entropy? We give an answer by resorting to the Boltzmann statistical derivation of the entropy function, which has the same form of Shannon one. This statistical procedure, based on independent tosses of indistinguishable particles in bins, can be applied to different models and allows determining from a sort of "first principle" the form of the entropy function to use. We think that there is room for further investigation of this subtile issue of the application of MEP.
In the second part of the paper we apply the above findings to a specific application of MEP to ecology called Maximum Entropy Theory of Ecology (M.E.T.E.) as exposed in [5]. M.E.T.E. is a purely statistical model of an ecological community. Using the Maximum Entropy Principle the theory aims at inferring the form of some of the most used distributions in ecology from the knowledge of macroscopic information on the system: number of individuals, number of species and total metabolic requirement. MEP has been applied in the field of statistical ecology mainly to derive one of the most important probability distributions, the Species Abundance Distribution (SAD) but not exclusively (see [6] or [7] where it is used for modeling species geographic distributions with presence-only data). Several papers ( [3,8,9]) have revised the application of the MEP inference principle to ecological problems, stressing the importance of the choice of the system' coordinate [10][11][12], of the prior distribution [4,8] and of the entropy function [3,6]. The interest of [5] is that it considers simultaneously information on the distribution of individuals among species (SAD) and on the distribution of metabolic energy rate among individuals. In this richer scenario a number of probability distributions for different observables of interest are derived. The output of the work has grown to become a comprehensive theory of application of MEP to ecological problems, called Maximum Entropy Theory of Ecology (see the book [13,14]) which has been extensively tested with real ecosystem data in [15][16][17]. In Section 4.1 we review the assumptions of the model in [5] and the related MEP solution; in Section 4.2 we propose an alternative application of the MEP procedure starting from the same initial information but giving a different result. In Section 5 we discuss the non equivalence of the two procedures using the Proposition 1 below and motivate why the MEP solution in [5] is flawed by some inconsistences.

MEP Formulation Using Different Variables
The Maximum Entropy Principle is generally expressed as: p is the distribution that maximizes Shannon entropy H(p) over the set of distributions allowed by the constraints. Let p defined by p = g −1 (p) with g invertible be a different choice of the variable to be used to express the probability distribution and the constraints and suppose that for the solution of this constrained extremum problem we use Lagrange multipliers method. Then, the following Proposition holds (see Appendix A for a proof). Proposition 1.p is the constrained extremum of H(p) over the set f (p) = c if and only ifp = g −1 (p) is the constrained extremum of H(g(p )) over the set f (g(p )) = c.
From the above Proposition 1 we learn that if we are given the entropy function and the constraints as (p, f (p), H(p)) and we want to use another variable p = g −1 (p), to find the same MEP solution we have to transform the constraints and the entropy function accordingly using (p , f • g(p ), H • g(p )). In this way, changing variable and finding the maximum are operations whose order can be interchanged. A natural question arises in this setting: which one is the couple (p, f (p)) translating in mathematical terms the initial information on the system that "is justified" in using the entropy function in Shannon form? We will give a partial answer to this question in Section 3 below.
Note that the invariance requirement with respect to the choice of the independent variable is different from the requirement, introduced in the axiomatic derivation of MEP in [18], of invariance in the form of H(p) with respect to a transformation of the type y = Γ(x), where x ∈ D are the (possibly infinite) states of a system and p = p(x). For a system with finite states the transformation Γ (called coordinate transformation in [18]) reduces to a relabeling of the states. Shannon entropy H(p) = − ∑ N i=1 p i ln p i being additive is clearly invariant in form, expressing the fact that different observers may use different labels for the N system states. This relabeling transformation where ϕ k is a permutation of the N state labels is thus a particular case of the more general g considered in the above Proposition 1. It is clearly an invertible transformation in which H(p) and H(g(p )) do coincide, but this is no longer assured for more general transformations g; see Section 3 below.

A Simple, Ecologically Oriented Example
To throw light on the questions considered above we take a very simple system used in a number of physical models that for our purposes can be considered as a minimal example of an ecological community. Consider a set I of N 0 identical individuals (balls) which belong to S species (indistinguishable urns). Suppose that each species contains at least one individual so that the maximal number of individuals allowed to one species is N = N 0 − (S − 1). Suppose that the species labels can be interchanged so that the observable quantity is not the number of individuals n α of a given species α but the number S n of species which have exactly n individuals, for n = 1, . . . , N.
Suppose that the set of species S is equipped with the uniform probability 1/S and consider the random variable ν : S → N , α → ν(α) = n α , N = {1, . . . N} that assigns to a species its abundance. Let φ(n) be the density associated to the random variable ν (in the sequel P denotes probability) where |A| is the number of elements of A. Set for later use S n = ν −1 (n) ⊂ S, therefore |S n | = S n . Let the set of individuals I be equipped with the uniform probability 1/N 0 , let α : I → S, i → α(i) be the assignation of each individual to a species and consider the random variable ν • α : I → N , i → ν(α(i)). Moreover, consider the following subset I n of I obtained by grouping together all the individuals belonging to the species having n individuals (we call this subset the "multispecies" n) I n = {i ∈ I : ν(α(i)) = n} ⊂ I, I n = |I n |.
As a straightforward consequence, we have that I n = nS n . Let φ I (n) be the density associated to the random variable ν • α that is Both φ and φ I are probability distributions on N ; it is easy to see that since I n = nS n , they are related by an invertible transformation φ I = g(φ) defined as In ecology the distribution φ is called Species Abundance Distribution (SAD) and is probably the most important (and investigated) indicator of how an ecological community is structured. While φ(n) gives the probability of extracting a species of abundance n from the set of species, φ I (n) gives the probability that an individual extracted from the set I belongs to a species of abundance n. Real communities are composed of many species with few individuals (rare species) and few ones which contain the majority of the individuals (common species). As a consequence, the two probability distribution are very different. The information on the system resumed by the numbers N 0 , S sets the following equalities If we divide the above two equalities by S or N 0 respectively, we obtain the following equations to be satisfied by φ and φ I respectively Constraints (6) and (7) are g-related, that is φ satisfies Equation (6) if and only if g(φ) satisfies Equation (7). Note that in first case we we are constraining the weighted arithmetic average of 1, . . . , N while in second one we are constraining the weighted harmonic average. Note also that the actual information being used is the ratio N 0 /S or equivalently S/N 0 , so that the MEP description of the system is independent of the size N 0 , S of the system provided that the ratio N 0 /S is kept constant. Suppose that two different observers translate in mathematical terms the information contained in the data N 0 , S using the two above introduced variables φ and φ I and that both invoke the MEP to find the least informative probability distribution allowed by the constraints. The application of MEP with entropy function H(φ) = − ∑ N n=1 φ(n) ln φ(n) and Constraint (6) gives, upon application of the Lagrange multipliers methods, the geometric probability distribution while the MEP procedure with the same entropy function H(φ I ) = − ∑ N n=1 φ I (n) ln φ I (n) and Constraint (7) gives the solutionφ , whose corresponding SAD is It is easy to see thatφ = g −1 (φ I ), therefore with this example we have shown that changing independent variables and using the MEP with the Shannon entropy are non commuting operations in the general case. Had we used Constraint (7) with the transformed entropy we would have obtained the solution g(φ) withφ in Equation (8). Note that the transformed entropy in Equation (11) does not have the form of a relative entropy function therefore it can not be obtained by introducing a suitable prior probability distribution. Note also that the non commutativity considered above is different from the non commutativity of MEP procedure with respect to coarse graining of the system state considered in [3,4].

Which One is the Correct Application of MEP ?
Both φ(n) in Equation (8) andφ(n) in Equation (10) result from the application of the MEP starting from the same initial information N 0 , S on the system and provide non equivalent answers to the question of how the abundance of individuals is distributed among the species. Since the set of distributions allowed by the constraints (feasible set) in the two formulations Equations (6) and (7) are related by an invertible transformation D I = g(D), the non equivalence must reside in the choice of the form of the entropy functions which are not g-related according to Proposition 1. On the basis of the same Proposition, once the form of the entropy function is established in a given formulation of MEP (p, f (p), H(p)), it is also determined for all g-related formulations. It seems thus necessary to invoke a sort of first principle to establish the correct form of the entropy function in the initial MEP formulation.
A possible approach could be to compare the form of the resulting probability distribution against field data. This is not entirely satisfactory since the discrepancy between the MEP solution and the empirical curve could be due to the fact that we have neglected (or we are not aware of) some information on the system which is relevant for the question addressed. In this case the discrepancy between the model and the empirical pattern is a signal that other factors shaping the probability distribution are acting on the system. By the way, for the problem considered above, it is generally acknowledged in the ecological literature (see e.g., [5]) that the geometric SAD in Equation (8) has a poor fit with the empirical SAD, while the one introduced in (10) could provide a better fit since is more flexible (it can be monotonically decreasing or display an interior mode depending on the value of the Lagrange multiplier).
At least for systems with a discrete number of states another criterion can be introduced, which uses the combinatorial argument providing one of the justifications for the form of Shannon entropy historically preceding the axiomatic derivation in [18]. This is the celebrated Boltzmann problem of statistical mechanics [19]. It is useful to briefly review its solution here because it tells us how to determine the correct form of the entropy function from first principles in cases where its determination is not obvious. It deals with the distribution of K indistinguishable particles (individuals) among M equally spaced energy levels = 1, . . . , M. Boltzmann original formulation dealt with arbitrarily spaced energy levels but this extra generality is not needed here. If n is the number of particles having energy and n = (n 1 , . . . , n M ) is the vector of occupation numbers of the energy levels, the logarithm of the probability of n is ln W(n) = ln 1 M K K! n 1 ! . . . n M !
which using Stirling approximation ln n! ≈ n ln n can be rewritten as ln W(n) = K ln K − ∑ n ln n .
The maximization of ln W(n) under the constraints ∑ n = K and ∑ n = E 0 gives the most common, i.e., least informative vector of occupation numbers. The solution of the constrained maximization problem is the celebrated Boltzmann-Gibbs distribution Note that writing n = ψ( )K, ln W in Equation (13) coincides up to a constant with In the same manner, if we have S indistinguishable species to be assigned in the N levels (abundances) of N let s = (S 1 , . . . , S N ) be the occupation vector with constraints ∑ n S n = S and ∑ n nS n = N 0 . As before, setting Sφ(n) = S n the least informative probability distribution is and the associated entropy is Therefore the form of the entropy function for the formulation given by Equation (6) of the MEP problem using the variable φ can be derived by Boltzmann combinatorial argument.
Let us see which is the resulting entropy function if we apply Boltzmann argument to the formulation given by Equation (7) using φ I . Remember that so basically in (A) we are "tossing" S species in the N bins S n while in (B) we are "tossing" N 0 individuals in the N multispecies I n . In the Boltzmann argument the tosses are supposed to be independent and this is true for (A) but not for (B). Indeed a multispecies I n is the grouping of S n species each having n individuals so the number of individuals in I n can be updated only by multiples of n, i.e., adding or subtracting a species of n individuals. Hence the tosses of single individuals does not represent independent tosses. If we want to support the choice of the entropy function by Boltzmann argument we have to consider tosses of groups of n individuals. Let i n be the number of individuals tossed in I n . Therefore the related occupation vector has the form which leads to an entropy (recall that i n = N 0 φ I (n)) of the form and under the Constraints (7) above to the solution We have shown that in the case where the probability distribution being used can be linked at least conceptually to independent repetitions of an experiment, we have a criterion to derive the correct form of the entropy function from a first principle. Of course to build a general theory this simple example deserves further generalization.

Review of M.E.T.E. Assumptions
Let us review the assumptions of M.E.T.E. as exposed in [5]. In [5], various metrics (probability distributions) are derived but here we will content ourselves with the analysis of the derivation of two of them, namely the Species Abundance Distribution φ(n) and the distribution of metabolic energy ψ( ) between the individuals.
Following [5], we consider a community made of N 0 individuals living on an area A 0 , subdivided into S species and having total metabolic energy rate E 0 . Assume that each species has at least one individual and that the individual metabolic rate is a discrete quantity which ranges from 1 to a maximum M in a suitable energy unit. With these assumptions, the abundance of a species α is Here we depart from [5] by assuming for the sake of simplicity that the individual metabolic rate is a discrete quantity while in [5] it is assumed to be a continuous one. This change does not affects our argument since for dealing with the continuous case it is enough to substitute the sums with integrals.
The information on the system is limited to the values N 0 , S, E 0 , and A 0 , but for our aims the knowledge of A 0 is not relevant. Therefore we are dealing with a non spatial model of a community of S non interacting species. On the basis of this sole knowledge we would like to answer to two questions: (Q 1 ) How is the energy distributed among the individuals? (Q 2 ) How are the individuals distributed among the species? Note that, since the species' label can be exchanged, the correct formulation of (Q 2 ) is how many of the species have exactly n individuals for n = 1, . . . , N, which is the information contained in φ(n).
Both questions can be answered in a non deterministic manner; for (Q 2 ), we use the density φ introduced in Equation (1), while, for (Q 1 ), we introduce the map m : I → M, i → m(i) that we will consider as a random variable and its density ψ( ) on M ψ( ) = P(m = ) = P(i ∈ m −1 ( )) = I N 0 , ∀ ∈ M (21) where I = m −1 ( ) ⊂ I and I = |I |. For later use, consider also the joint probability distribution having marginals, respectively, φ I (n) and ψ( ), since the space I is doubly partitioned in multispecies and energy classes.

Our Solution of M.E.T.E. Problem
The following constraints contain all the information (N 0 , S, E 0 ) available on the system and a straightforward application of MEP with the constraints in Equation (23) and the entropy in Equation (15) [with the constraints in Equation (24) and the entropy in Equation (17) respectively] gives the answersψ in Equation (14) [φ in Equation (16)] to the above questions (Q 1 ) and (Q 2 ), i.e., Basically, we are solving twice Boltzmann problem of statistical mechanics. Now, if one is interested in the joint probability Q(n, ) = P(α ∈ S n , i ∈ I ) we proceed as before, introducing the vector c = (C 1,1 , . . . C N,M ) where C n, counts the number of times the object (α, i) is assigned to the bin S n × I . Hence to be maximized under the decoupled constraints (∑ C n, = S n , ∑ n C n, = n ) ∑ n, The related entropy is, setting C n, = Q(n, )N 0 S and the maximum entropy distribution with Constraint (28) can be shown to be We claim that Equation (30) is the correct solution of the MEP problem dealt with in [5,13] in the sense that the choice of the entropy function adopted is supported by the Boltzmann argument. We do not claim that this solution has a better fit with empirically derived patterns with respect to others but only that is derived in a consistent way. In this solution the two random variables ν and m are uncorrelated, therefore the energy and species constraints can be dealt with separately giving Equations (14) and (16) or at the same time. This is a well known property of MEP: if the constraints concern only the marginals of a joint probability, then the MEP solution is the product of two unidimensional distributions.

Analysis of the Application of MEP in M.E.T.E.
The starting point of M.E.T.E. theory is the following probability distribution in Equation (31) on N × M introduced in [5], formula (4c). Quoting [5], page 2702 above formula (1a) : "R(n, ) is the probability that if a species is picked at random form the species list, then it has abundance n, and if an individual is picked at random from that species with abundance n, then its metabolic rate is " (is between [ , + d ] in the original continuum energy formulation, italics is ours) This a delicate point: we think that using "that species" makes the above definition logically inconsistent because if there are more than one species with abundance n it is impossible to know which species is being picked and so the probability of picking an individual with a given metabolic rate is not uniquely defined. One can readily convince oneself of this by considering a toy model of the systems along the lines of example in Box 7.1 in [13], page. 144, but with at least two species having equal abundance and containing individuals with the same metabolic rate. The only way for us to be logically consistent is to substitute "that species" with "a species". This is equivalent to picking an individual of given metabolic rate from the multispecies I n . In this paper we have thus intended the definition of R(n, ) as amended in this way. Note however that in the sequel of [5], below formula (4c), it is (correctly) written "a species". Note also that the same ambiguity is present in the book [13], Chapter 7, pages 142 and 143. In our framework, we thus rephrase the definition as R(n, ) = φ(n)P(i ∈ I |i ∈ I n ) and using the definition of conditional probability P(A, B) = P(A)P(B|A) and (4), (22) also as Therefore, the probability distributions P(n, ) and R(n, ) are related by the invertible transformation g (we use the same symbol for the sake of simplicity) P(n, ) = g n (R(n, )) = nS N 0 R(n, ).
Moreover, the information N 0 , S, E 0 sets the following constraints on R(n, ), see (1a)-(1c) in [5]  hence also R(n, ) is properly normalized. In [5], the MEP procedure is applied as follows: the probability distribution R(n, ) is the one that maximize the Shannon entropy on the set defined by Constraints (34)-(36) above. It is therefore the least informative probability distribution on the basis of the sole knowledge of the macroscopic ratios E 0 /S, N 0 /S. Note that, as observed in [5], if the previous ratios are known, E 0 /N 0 is also known. Standard application of the MEP gives the probability distribution depending on the unknown multipliers λ and β (see [5], formula (6)) R(n, ) = e −λn−β n Z(λ, β) .
The Lagrange multipliers λ and β have to be determined by inserting Equation (38) into the constraints in Equations (34) and (35) but an analytical solution of the resulting equations does not exist. Adopting some approximations in [5] we are lead to the following explicit formulae for the marginal (the SAD distribution) which is the celebrated Fisher log-series ( [20]) and to for the energy marginal. Moreover, the conditional probability is What is striking in the above result, is that it prescribes a non zero correlation (Equation (41)) between the distribution of energy among individuals and the distribution of abundance among species, even if in the initial information nothing seems to hint at a possible correlation. This is for us the signal of a flaw in the above application of MEP.

The Correct Form of Entropy Function of M.E.T.E. Problem
Our aim now is to derive the form of the entropy function for the MEP problem in the R(n, ) variable with constraints on energy and species in Equations (34)-(36) using the Boltzmann argument. Since R(n, ) is not a joint probability distribution, we derive the entropy function for the distribution P(n, ) = P(i ∈ I n , i ∈ I ) in (22) which is related to R(n, ) by the change of variable (33) and use the Proposition 1. The g-related constraints of Equations (34)-(36) for P(n, ) are Therefore, taking into account that we have i n /n independent tosses in the bin I n , we have to use the multiplicity factor W(i) in Equation (19) The maximum of ln W is reached when both terms in the right hand side are maximized, but the first term ln W(i) is independent of the arrangement in the energy levels. Therefore we can maximize the second term in the right hand side of Equation ( is to be maximized under the N + 1 constraints (note that the constraint ∑ n, i n, = N 0 is already enforced since ∑ n i n = N 0 ) ∑ n, i n, = E 0 , and ∑ i n, = i n , n = 1, . . . , N.
gives the solution i n, = i n e −β ∑ n e −β = i n e −β Z(β) .
Since i n, = P(n, )N 0 and i n = φ I (n)N 0 we have and to get the complete solution we have to maximize W(i), which has already been dealt with in the previous section, Equations (19) and (20). Therefore, the MEP procedure for the distribution P(n, ) with the constraints gives the solution Using the change of variables in Equation (33) and Proposition 1, we find the solution of the MEP problem in [5] with respect to the variables R(n, ) R(n, ) = N 0 Sn P(n, ) = e −λn Z(λ) which coincides with Q(n, ) in (30).

Discussion
A non trivial task in the application of the MEP is the translation in mathematical terms (i.e., as constraints on the sought probability distribution) of the information available on the system. In some cases one may hesitate between mathematically equivalent formalizations. What is not so well known is that the use of Shannon entropy H(p) may not be justified in all cases. We have provided a criterion based on Boltzmann method of the most probable occupation vector to derive the correct form of the entropy function in a given formulation. Note that the problem addressed here is different from the search of a suitable prior distribution for the relative entropy D(p|q). In the second part of the paper we have applied this analysis to critically examine the Maximum Entropy Theory of Ecology (M.E.T.E.). Even if the application of MEP contained in M.E.T.E. seems flawed, the resulting SAD proposed by M.E.T.E. is the Fisher log-series, which is widely accepted in the ecological community and considered as giving a best fit for many ecological communities (although the use of log-series SAD has been questioned recently, see [21]). Therefore, the previsions of species abundance based on M.E.T.E. may well be in good agreement with field data. What seems to be hardly giving a sound base in M.E.T.E. is the claim that the information contained in N 0 , S, E 0 produces a coupling between the distribution of abundance between species and the distribution of metabolic energy between individuals.