On Using Relative Information to Estimate Traits in a Darwinian Evolution Population Dynamics

: Since its inception, evolution theory has garnered much attention from the scientific community for a good reason: it theorizes how various living organisms came to be and what changes are to be expected in a certain environment. While many models of evolution have been proposed to track changes in species’ traits, not much has been said about how to calculate or estimate these traits. In this paper, using information theory, we propose an estimation method for trait parameters in a Darwinian evolution model for species with one or multiple traits. We propose estimating parameters by minimizing the relative information in a Darwinian evolution population model using either a classical gradient ascent or a stochastic gradient ascent. The proposed procedure is shown to be possible in a supervised or unsupervised learning environment, similarly to what occurs with Boltzmann machines. Simulations are provided to illustrate the method.


Introduction
The theory of evolution was famously championed by Darwin [1] in a publication where he stated his theory of natural selection.This theory soon became known as Darwinian evolution theory and can be construed as a series of postulates whose aim is to understand dynamical changes in organisms' traits.In other words, understanding the mechanisms of survival or disappearance of a species entails understanding mechanisms by which species traits are passed on overtime to their offspring.From a statistical point of view, a species having multiple offspring is the realization of a random process by which a certain amount of information is passed on to an offspring.How much information or how relevant information (often found in the organism's genome) is passed on may determine the species viability overtime.Since the paper of Vincent et al. [2] on Darwinian dynamics and evolutionary game theory, there have been many studies related to Darwinian dynamics.In Ecology in particular, Ackley et al. [3] proposed a model for competitive evolutionary dynamics, and Cushing [4] established difference equations for population dynamics.A Susceptible-Infected Darwinian model with evolutionary resistance was discussed in Cushing et al. [5].Models for competitive species were proposed by Elaydi et al. [6].However, the literature is very sparse on how to estimate species' traits (information stored or passed on to offspring) from these models using readily available data.
Information theory can help design estimation methods for species traits given data.Before showing how to use information theory for such a purpose, let us mention that there are two approaches to information theory that are related, but meaning and capturing different aspects of it.Firstly, consider a given sample of data from a probability distribution that depends on a parameter.Fisher's information [7] represents the amount of information that the sample contains about the parameter.It can be calculated as the inverse of the sampling error which is the random discrepancy between an estimate and the estimated parameter, and which arises from the sampling process itself (e.g., due to the fact that the population as a whole has not been considered).Fisher's information is important because it is the inverse measure of the precision of a point estimate of a parameter.Indeed, the higher Fisher's information, the more precise is the point estimate.On the other hand, when the information content (or message) of a distribution is under consideration, Shannon's entropy [8] is used.The main difference here is that it is assumed that the content is not a parameter and therefore is not unknown.Most importantly, there is no latency relative to the population involved, unlike a sample, which is only a partial representation of a population.From Shannon's entropy, one can define the relative entropy, which allows us to compare different distributions and thus discriminate between different types of contents of information.Even though they have different interpretations, we note that the two approaches to information are mathematically related.Indeed, Fisher's information is the Hessian matrix with respect to the parameter of the relative entropy or Kullback-Leibler divergence.In discrete or continuous Darwinian dynamics, the main assumption is that of a deterministic relationship between inputs and outputs.As we stated above, the transmission of traits from parents to offspring is, in fact, a random process and should be analyzed as such.This assumes that there is a stochastic relationship between inputs and outputs via a network of connections often referred to as weights.
More precisely, say u is the vector of inputs and v the vector of outputs with network connections matrix W. If input and output samples (u s , v s ), 1 ≤ s ≤ S are both available, the stochastic relationship between inputs and outputs is described as P(v|u, W), which is the probability that input u generates output v when the weight matrix of the network is W. In supervised learning, the goal is to match as closely as possible the input-output distribution P(v|u, W) with the probability distribution P(v|u) related to the samples (u s , v s ), by adjusting the weights W in a feedforward manner.In unsupervised learning, it is assumed that output samples are not given and only input samples (u s ), 1 ≤ s ≤ S are given.However, using an appropriate energy function, one can obtain the joint probability distribution P(u, v, W) and infer from rules of probability that P(u, W) = ∑ v P(u, v, W).The goal in unsupervised learning is therefore to match P(u, W) as closely as possible to the probability distribution P(u) related to samples (u s ), 1 ≤ s ≤ S, by adjusting the weights W in a feedforward manner.Feedforward techniques require the minimization of an appropriate objective function, and since we are dealing with probability distributions, the Kullback-Leibler divergence seems to be the best choice of objective function.
In this paper, our focus is on using relative information or entropy to estimate trait parameters in a Darwinian population dynamics model.The procedure we propose is similar to what occurs in a Boltzmann machine.The remainder of the paper is organized as follows: In Section 2, we make a brief overview of Fisher's information theory, entropy and relative as it pertains to mathematical statistics.In Section 3, we discuss how to computationally estimate trait parameters of discrete Darwinian models in supervised and unsupervised environments by maximizing the relative information or Kullback-Leibler divergence.Finally, in Section 4, we make some concluding remarks.

Review of Information Theory
In this light review of information theory, we mention Fisher's information only for self-containment sake.As for Fisher's information and evolution dynamics, the reader can refer to [9] for a deeper analysis.For ease of analysis and comprehension, the assumptions and notations below are along the lines of that work.

Fisher's Information Theory
Let G(x, Θ) be the probability density of a random variable X, continuous or discrete on an open set X × Ω ⊂ R × R n .Here Θ = (θ 1 , θ 2 , • • • , θ n ) is either a single parameter or a vector of parameters.
Definition 1.Given a random variable X with density function G(x, Θ) satisfying A 1 and A 2 , Fisher's information of X is defined as When Θ is vector of more than one coordinates, Fisher's information is a symmetric positive definite (thus invertible) matrix I(Θ) = (I kl (Θ)) 1≤k,l≤n , where Fisher's information I(Θ) represents the amount of information contained in an estimator of Θ, given data X.

Entropy and Relative Entropy
Let us start by recalling the following definition of Shannon's entropy and that of the Kullback-Leibler divergence or relative entropy.Definition 2. Let µ be a probability distribution defined on a sample space Ω.Then entropy of µ is given as Definition 3. Suppose µ and ν are two probability distributions defined on a sample space Ω.Then, the Kullback-Leibler divergence or relative information D(µ, ν) of µ relative to a fixed ν is given as There is an obvious connection between entropy and relative entropy: We notice that when µ = ν, we have that D KL (µ, ν) = 0.It is known that Fisher's information induces a Riemannian metric ( [10,11]) defined on a statistical manifold, that is, a smooth manifold whose points are probability measures defined on a probability space.It therefore represents the "informational" discrepancy between measurements.As such, it is related to the Kullback-Leibler divergence (KL), used typically to assess the difference between two probability distributions.Indeed, to see this, let θ, θ 0 ∈ Ω.Let us use a second-order Taylor expansion in µ.Then we will obtain From the above observation, we have D(µ, ν) µ=ν = 0. We also have We conclude by noticing that the Hessian matrix and ν = µ(θ 0 ) are infinitesimally close to each other, that is, we have There is an interesting discussion in the finite case in [12].Indeed, suppose that Ω is a finite set with cardinality n.Let Θ = (θ 1 , θ 2 , • • • , θ n ) and let Λ(Ω) be the set of all probability measures on Ω, and S n (Ω) be the simplex In this case, Fisher's information components become where δ kl is the Kronecker symbol.It follows that The interpretation of the latter is Fisher's fundamental theorem (see [7]) and Kimura's maximal principle (see [13]) in terms of Fisher information: natural selection forms a gradient with respect to an information measure, and hence locally has the direction of maximal information increase.The rate of change of the mean fitness of the population is given by the information variance.
Let us highlight some of these facts by calculating the KL divergence D KL (Θ 1 , Θ t ) between distributions ν = P(Θ 1 ) and µ = P(Θ t ) of Θ 1 and Θ t , respectively, and D KL (Θ t , Θ t+1 ) between the probability distributions ν = P(Θ t ) and µ = P(Θ t+1 ) of Θ t and Θ t+1 , respectively, for t = 1, 2, • • • , n. See Figure 1 below.We will use the same parameters as above with the same starting points for θ in the algorithm.
The minimization of D KL (Θ t , Θ t+1 ) and the maximization of D KL (Θ 1 , Θ t ) both occur when Fisher's information is maximal or when the variance of the estimator is minimal.From a dynamical system perspective, this occurs when the fixed point (e −1 , e) has been attained.This means that the KL divergence can be used to determine the critical points of a discrete Darwinian model.Therefore, there is a dichotomy between the problems of minimizing D(Θ t , Θ t+1 ) and maximizing D KL (Θ 1 , Θ t ).This amounts to matching the probability distributions P(Θ t ) and P(Θ t+1 ) as closely as possible.When this happens, we know from above that we will have D KL (Θ t , Θ t+1 ) ≈ 0, which in turn will mean that the fixed points of the discrete Darwinian model have been attained.To that end, we will discuss how to accomplish this with machine learning approaches, namely under supervised and unsupervised learning.
In the (a), convergence to zero means that the distributions of Θ t and Θ t+1 are similar.In (b), convergence to about 8.7 means that the distributions of Θ t are no more different from each other, making their KL divergence from the distribution of Θ 1 constant.In both cases, this is an indication that Fisher's information has been maximized or that the variances of the estimates are the same and very small.

Single Darwinian Population Model with Multiple Traits
Now suppose we are in the presence of one species with density x possessing n traits given by the vector Θ = (θ 1 , θ 2 , • • • , θ n ) and a vector U = (u 1 , u 2 , • • • , u n ).We will consider the following: is the joint distribution of the independent traits θ i , each with mean 0 and variance w 2 i .
(H 3 ) The density of x t is given as Under H 1 − H 3 , we will consider the discrete dynamical system where Remark 1.We note that in the context of Darwinian population dynamics, G(x t , Θ t , U t ) is the population growth rate, b(Θ) represents the birth rate, while c U (Θ) is the intra-specific competition function.Σ represents the covariance matrix of the distribution of traits among phenotypes of the species.
In this section, we propose two approaches to estimate the traits vector Θ in a Darwinian model, using relative information.

Supervised and Unsupervised Learning
Supervised and unsupervised learning are two types of machine learning techniques whose ultimate aim is to learn the best possible connections between a set of inputs and outputs.In supervised learning, the inputs and outputs are presented to a system that learns the best possible connections between inputs and outputs.In an unsupervised learning environment, only inputs are presented and the system learns and describes the best possible outputs that would be related to the inputs.Let us note that the distribution of In the sequel, we propose a machine learning approach for the estimation of trait vector Θ.We show in particular that depending on the amount of information available, machine learning approaches are possible.To that end, learning schemes for supervised and unsupervised learning are designed to estimate both W and K, which are the key parameters in the Darwinian model (7) proposed above.In fact, once the vectors W and K have been estimated, data available on X t and the second equation in (7) are used to evaluate Θ estimated t , the estimated value of Θ.In supervised learning, the values of Θ estimated t should match closely with their sample counterparts Θ samples t .In unsupervised learning, in the absence of sample data Θ samples t , the estimated values Θ estimated t should serve as actual values for Θ t , the vector of traits.

Supervised Learning
Let n, T and M be given positive integers.In supervised learning, we are given a sample of inputs/outputs sample in the form where each Θ (m) t is a 1 × N vector.This means that in this case, we already know the inputs that generate the solution of the system (7).The probability P( Therefore, using known techniques of conditional probabilities, we have where Since we are trying to estimate traits, in supervised learning, the goal will be to determine how well P(Θ t+1 |X t+1 , W, K) matches P(Θ t |X t ).Since these are distributions, we use the Kulback-Leibler (KL) divergence where is the entropy of the distribution of Θ t |X t .At the sample level for given M samples, the average of KL divergence is given as We minimize the quantity on the right-hand side of (12), using a stochastic gradient descent scheme, by updating the weight vectors W and K, respectively.Hence, the following result on supervised learning.
Theorem 1. Suppose that X t , W, and K are as above and that we have input/output sample The minimization process of the KL divergence can be achieved with a classic gradient ascent algorithm for weights W and K with an update scheme given as where α w , α κ > 0 are the learning rates of W and K, respectively, and ∂ ln(P(Θ t+1 , W, K)) ∂x for x = w l , κ l is given in the Appendix A.
Corollary 1.Under the assumptions of Theorem 1 above, the minimization process of the KL divergence can be achieved more efficiently with a stochastic gradient ascent algorithm for weights W and K with an update scheme given as where α w , α κ > 0 are the learning rates of W and K, respectively.
Remark 2. One could use Jeffrey's prior as the probability distribution P(Θ t ) of Θ t if no other information about its distribution is given.In this case, P(Θ t ) , where det(A) is the determinant of the matrix A. We discussed in [9] that Fisher's information matrix for such a system depends on both W and K.This means that a Jeffrey's prior is advisable only for algorithm initialization; otherwise, from Equation (10), the updating scheme would have to be changed according to the dependence of P(Θ t ) on both W and K.In fact, in this case, one would need to consider the respective partial derivatives , for x = w l , κ l , which significantly increases the complexity of the problem.

Unsupervised Learning
Now, we assume that X t , W, and K are as above and only a sample of inputs is not given, we cannot use the learning scheme above.We have to design a method that only depends on X (m) t , W and K.We assume that the joint probability P(X t , Θ t , W, K) of X t and Θ t is where In unsupervised learning, we would like to determine how well P(X t+1 , W, K) matches P(X t ).Since these are distributions, we use the Kulback-Leibler (KL) divergence where is the entropy of the distribution of X t .At the sample level, and for given M samples from the random variable X t , the average of KL divergence is given as Similarly, as above, we minimize the quantity on the right-hand side of (17), using a stochastic gradient descent scheme, by updating the weight vectors W and K, respectively.Hence, the following result is on unsupervised learning.
Theorem 2. Suppose that X t , W, and K are as above and that we have input/output sample X (m) The minimization process of the KL divergence can be achieved with a classic gradient ascent algorithm for weights W and K with an update scheme given as where α w , α κ > 0 are the learning rates of W and K, respectively, and Corollary 2. Under the hypotheses of Theorem 2 above, the minimization process of the KL divergence can be achieved more efficiently with a stochastic gradient ascent algorithm for weights W and K with an update scheme given as where α w , α κ > 0 are the learning rates of W and K, respectively.

2.
Similarly, D KL (P(X t , P(X t+1 , W, K)) is an expectation with respect to the random variable X t , therefore the right-hand-side of Equation (17) is one of its Riemann sum.
Remark 4. We note that there is a subtle difference between Equations ( 14) and (19), but an important one.In Equation (14), all the data are available for training, therefore all quantities can be calculated.In Equation (19), we are not given sample data Θ t , however, we can use the Darwinian difference Equation (7) to evaluate both ∂b(X t , W, K) ∂x and ∂b(X t , W, K) ∂x l .

Simulations
In this simulation, we generate the data from system (7) as follows: we choose T = 200, M = 350, N = 2, c 0 = 0.1, b 0 = 1.5 and the vectors W ∼ 100 × Uni f (0.5, 1), K ∼ Uni f (0.1), U ∼ Uni f (0, 10) and Θ = (1, 2, 3).From this we generate the samples In Figure 2, we start by illustrating of the generated data, where the first column represents the dynamics of X t and the second column represents the dynamics of Θ t = (θ 1t , θ 2t ).To evaluate the accuracy of the method in supervised learning, we evaluate the percentage of correct calculation of vector parameters W and K as Correct Percentage = #{∥Parameter true − Parameter estimated ∥ ≤ η} # of samples , for a threshold parameter η = 0.001, where ∥•∥ represents a norm in R N .In Figure 3 below, we show the percentage of correct calculation for the given data under supervised learning.
Remark 5. Let us observe that the learning techniques proposed above are similar to those of a Boltzmann machine.Indeed, we have and Z is a matrix such that the threshold value is e −E(X,Θ) can be seen as an energy function along the lines of a Boltzmann machine where the constant b 0 can be chosen such that b 0 = ∑ X,Θ exp(−E(X, Θ)), see for instance [14].However, with G(X, Θ) as above, the Darwinian system is strictly not a Boltzmann machine since the diagonal terms of the matrix W are nonzero and the off-diagonal terms are zero.In fact, in a Boltzmann machine, the opposite is true, that is, the diagonal terms are zero and the off-diagonal terms are nonzero.The right column represents the samples θ 1t and θ 2t respectively.From a dynamical system point of view, this is a situation where there is no fixed point (x, Θ) even though there is a fixed point for x t .We observe that despite fluctuating, the percentage of correct learning increases with the number of samples and hovers around 75%.This is quite remarkable given that all the system's parameters are randomly reshuffled at each epoch.(b) Percentage of correct learning for the vector K.The percentage of correct learning for K also hovers around 75% which shows in general that supervised learning can be quite effective at learning the key parameters in a Darwinian evolution dynamics model.

Conclusions
In this paper, we have shown how to estimate trait parameters in a Darwinian evolution dynamics model with one species and multiple traits under supervised and unsupervised learning.The procedure can be implemented using a regular gradient or a stochastic gradient descent.We have shown the similarity between the proposed procedure and Boltzmann machine learning, even though the type of energy function is not the same as that of a Boltzmann machine in the strictest sense of the term.The techniques proposed in this paper could certainly be adaptable to readily available data.This is to say, this is a proof of concept meant to kickstart the conversation on how to bring modern techniques of estimation to important problems of evolution theory with much-needed mathematical rigor.
Funding: This research was funded by The American Mathematical Society and Simmons Foundation, grant number AMS-SIMMONS-PUI-23028GR. Now, we fix Θ t = Θ * and X t = X * .For a given l = 1, 2, • • • , n and x = w l or x = κ l , it follows from (A3) that From Equation (A1), we obtain The latter implies that And also that  (A10) Since from Equation (A9), we have to run through all values of Θ t , this procedure may not be suitable for large networks.With a stochastic gradient procedure, the quantity  = − ∂b(X * , Θ t , W, K) ∂x e −b(X * ,Θ t ,W,K) Z + e −b(X * ,Θ t ,W,K) Z ∑ X t ,Θ t ∂b(X t , Θ t , W, K) ∂x e −b(X t ,Θ t ,W,K) Z = − ∂b(X * , Θ t , W, K) ∂x P(X * , Θ t , W, K) + P(X * , Θ t , W, K) ∑ X t ,Θ t ∂b(X t , Θ t , W, K) ∂x P(X t , Θ t , W, K)

Figure
Figure 1.In (a) D(Θ t , Θ t+1 ); (b) D(Θ 1 , Θ t+1 ) for t = 1, 2, • • • , 15.In the (a), convergence to zero means that the distributions of Θ t and Θ t+1 are similar.In (b), convergence to about 8.7 means that the distributions of Θ t are no more different from each other, making their KL divergence from the distribution of Θ 1 constant.In both cases, this is an indication that Fisher's information has been maximized or that the variances of the estimates are the same and very small.

Figure 2 .
Figure 2. In the left column of this figure, represented is X (m) t for m = 1, 2, 3 and t = 1, 2 • • • , 200.The right column represents the samples θ 1t and θ 2t respectively.From a dynamical system point of view, this is a situation where there is no fixed point (x, Θ) even though there is a fixed point for x t .

Figure 3 .
Figure 3. (a) Percentage of correct learning for the vector W as a function of the number of samples used.The fluctuations are normal and are due to the random selection of samples at each iteration.We observe that despite fluctuating, the percentage of correct learning increases with the number of samples and hovers around 75%.This is quite remarkable given that all the system's parameters are randomly reshuffled at each epoch.(b) Percentage of correct learning for the vector K.The percentage of correct learning for K also hovers around 75% which shows in general that supervised learning can be quite effective at learning the key parameters in a Darwinian evolution dynamics model.
procedure will update the weights W and K by moving oppositely to the gradient of the average sample KL divergence, for selected learning rates α W and α K , as follows: new = w l,old + α W ∂ ln(P(Θ t+1 , W, K)) ∂w l κ l,new = κ l,old + α K ∂ ln(P(Θ t+1 , Θ t , W, K) ∂x P(Θ t |X (m)t+1 , W, K) is replaced with a single realization new = w l,old − α W new = κ l,old − α K carefully chosen constants α w and α K .