Abstract
Since its inception, evolution theory has garnered much attention from the scientific community for a good reason: it theorizes how various living organisms came to be and what changes are to be expected in a certain environment. While many models of evolution have been proposed to track changes in species’ traits, not much has been said about how to calculate or estimate these traits. In this paper, using information theory, we propose an estimation method for trait parameters in a Darwinian evolution model for species with one or multiple traits. We propose estimating parameters by minimizing the relative information in a Darwinian evolution population model using either a classical gradient ascent or a stochastic gradient ascent. The proposed procedure is shown to be possible in a supervised or unsupervised learning environment, similarly to what occurs with Boltzmann machines. Simulations are provided to illustrate the method.
MSC:
37N30; 37N40; 39-08
1. Introduction
The theory of evolution was famously championed by Darwin [1] in a publication where he stated his theory of natural selection. This theory soon became known as Darwinian evolution theory and can be construed as a series of postulates whose aim is to understand dynamical changes in organisms’ traits. In other words, understanding the mechanisms of survival or disappearance of a species entails understanding mechanisms by which species traits are passed on overtime to their offspring. From a statistical point of view, a species having multiple offspring is the realization of a random process by which a certain amount of information is passed on to an offspring. How much information or how relevant information (often found in the organism’s genome) is passed on may determine the species viability overtime. Since the paper of Vincent et al. [2] on Darwinian dynamics and evolutionary game theory, there have been many studies related to Darwinian dynamics. In Ecology in particular, Ackley et al. [3] proposed a model for competitive evolutionary dynamics, and Cushing [4] established difference equations for population dynamics. A Susceptible-Infected Darwinian model with evolutionary resistance was discussed in Cushing et al. [5]. Models for competitive species were proposed by Elaydi et al. [6]. However, the literature is very sparse on how to estimate species’ traits (information stored or passed on to offspring) from these models using readily available data.
Information theory can help design estimation methods for species traits given data. Before showing how to use information theory for such a purpose, let us mention that there are two approaches to information theory that are related, but meaning and capturing different aspects of it. Firstly, consider a given sample of data from a probability distribution that depends on a parameter. Fisher’s information [7] represents the amount of information that the sample contains about the parameter. It can be calculated as the inverse of the sampling error which is the random discrepancy between an estimate and the estimated parameter, and which arises from the sampling process itself (e.g., due to the fact that the population as a whole has not been considered). Fisher’s information is important because it is the inverse measure of the precision of a point estimate of a parameter. Indeed, the higher Fisher’s information, the more precise is the point estimate. On the other hand, when the information content (or message) of a distribution is under consideration, Shannon’s entropy [8] is used. The main difference here is that it is assumed that the content is not a parameter and therefore is not unknown. Most importantly, there is no latency relative to the population involved, unlike a sample, which is only a partial representation of a population. From Shannon’s entropy, one can define the relative entropy, which allows us to compare different distributions and thus discriminate between different types of contents of information. Even though they have different interpretations, we note that the two approaches to information are mathematically related. Indeed, Fisher’s information is the Hessian matrix with respect to the parameter of the relative entropy or Kullback–Leibler divergence. In discrete or continuous Darwinian dynamics, the main assumption is that of a deterministic relationship between inputs and outputs. As we stated above, the transmission of traits from parents to offspring is, in fact, a random process and should be analyzed as such. This assumes that there is a stochastic relationship between inputs and outputs via a network of connections often referred to as weights.
More precisely, say u is the vector of inputs and v the vector of outputs with network connections matrix W. If input and output samples are both available, the stochastic relationship between inputs and outputs is described as , which is the probability that input u generates output v when the weight matrix of the network is W. In supervised learning, the goal is to match as closely as possible the input–output distribution with the probability distribution related to the samples , by adjusting the weights W in a feedforward manner. In unsupervised learning, it is assumed that output samples are not given and only input samples are given. However, using an appropriate energy function, one can obtain the joint probability distribution and infer from rules of probability that . The goal in unsupervised learning is therefore to match as closely as possible to the probability distribution related to samples , by adjusting the weights W in a feedforward manner. Feedforward techniques require the minimization of an appropriate objective function, and since we are dealing with probability distributions, the Kullback–Leibler divergence seems to be the best choice of objective function.
In this paper, our focus is on using relative information or entropy to estimate trait parameters in a Darwinian population dynamics model. The procedure we propose is similar to what occurs in a Boltzmann machine. The remainder of the paper is organized as follows: In Section 2, we make a brief overview of Fisher’s information theory, entropy and relative as it pertains to mathematical statistics. In Section 3, we discuss how to computationally estimate trait parameters of discrete Darwinian models in supervised and unsupervised environments by maximizing the relative information or Kullback–Leibler divergence. Finally, in Section 4, we make some concluding remarks.
2. Review of Information Theory
In this light review of information theory, we mention Fisher’s information only for self-containment sake. As for Fisher’s information and evolution dynamics, the reader can refer to [9] for a deeper analysis. For ease of analysis and comprehension, the assumptions and notations below are along the lines of that work.
2.1. Fisher’s Information Theory
Let be the probability density of a random variable X, continuous or discrete on an open set . Here is either a single parameter or a vector of parameters.
In the sequel, we consider the following assumptions on the function :
- A1 :
- The support of G is independent of .
- A2 :
- is nonnegative and for all .
- A3 :
- , the set of continuously and twice differentiable functions of , for all .
The first assumption is to discard from consideration the uniform distribution whose support is . The second and third assumptions allow the well-definiteness of , its first derivative or score function and its second derivative . In the sequel, the expected value of a random variable X is denoted as .
Definition 1.
Given a random variable X with density function satisfying and , Fisher’s information of X is defined as
When is vector of more than one coordinates, Fisher’s information is a symmetric positive definite (thus invertible) matrix , where
Fisher’s information represents the amount of information contained in an estimator of , given data X.
2.2. Entropy and Relative Entropy
Let us start by recalling the following definition of Shannon’s entropy and that of the Kullback–Leibler divergence or relative entropy.
Definition 2.
Let μ be a probability distribution defined on a sample space Ω. Then entropy of μ is given as
Definition 3.
Suppose μ and ν are two probability distributions defined on a sample space Ω. Then, the Kullback–Leibler divergence or relative information of μ relative to a fixed ν is given as
There is an obvious connection between entropy and relative entropy:
We notice that when , we have that . It is known that Fisher’s information induces a Riemannian metric ([10,11]) defined on a statistical manifold, that is, a smooth manifold whose points are probability measures defined on a probability space. It therefore represents the “informational” discrepancy between measurements. As such, it is related to the Kullback–Leibler divergence (KL), used typically to assess the difference between two probability distributions. Indeed, to see this, let . Let us use a second-order Taylor expansion in . Then we will obtain
From the above observation, we have . We also have
It follows that . We conclude by noticing that the Hessian matrix . Therefore, if and are infinitesimally close to each other, that is,
we have
There is an interesting discussion in the finite case in [12]. Indeed, suppose that is a finite set with cardinality n. Let and let be the set of all probability measures on , and be the simplex . There is a isometry defined as . In this case, Fisher’s information components become
where is the Kronecker symbol. It follows that
The interpretation of the latter is Fisher’s fundamental theorem (see [7]) and Kimura’s maximal principle (see [13]) in terms of Fisher information: natural selection forms a gradient with respect to an information measure, and hence locally has the direction of maximal information increase. The rate of change of the mean fitness of the population is given by the information variance.
Let us highlight some of these facts by calculating the KL divergence between distributions and of and , respectively, and between the probability distributions and of and , respectively, for . See Figure 1 below. We will use the same parameters as above with the same starting points for in the algorithm.
Figure 1.
In (a) ; (b) for . In the (a), convergence to zero means that the distributions of and are similar. In (b), convergence to about 8.7 means that the distributions of are no more different from each other, making their KL divergence from the distribution of constant. In both cases, this is an indication that Fisher’s information has been maximized or that the variances of the estimates are the same and very small.
The minimization of and the maximization of both occur when Fisher’s information is maximal or when the variance of the estimator is minimal. From a dynamical system perspective, this occurs when the fixed point has been attained. This means that the KL divergence can be used to determine the critical points of a discrete Darwinian model. Therefore, there is a dichotomy between the problems of minimizing and maximizing . This amounts to matching the probability distributions and as closely as possible. When this happens, we know from above that we will have , which in turn will mean that the fixed points of the discrete Darwinian model have been attained. To that end, we will discuss how to accomplish this with machine learning approaches, namely under supervised and unsupervised learning.
3. Evolution Population Dynamics and Relative Information
3.1. Single Darwinian Population Model with Multiple Traits
Now suppose we are in the presence of one species with density x possessing n traits given by the vector and a vector . We will consider the following:
- (H1)
- is the joint distribution of the independent traits , each with mean 0 and variance .
- (H2)
- .
- (H3)
- The density of is given as at .
Under , we will consider the discrete dynamical system
where
Remark 1.
We note that in the context of Darwinian population dynamics, is the population growth rate, represents the birth rate, while is the intra-specific competition function. Σ represents the covariance matrix of the distribution of traits among phenotypes of the species.
In this section, we propose two approaches to estimate the traits vector in a Darwinian model, using relative information.
3.2. Supervised and Unsupervised Learning
Supervised and unsupervised learning are two types of machine learning techniques whose ultimate aim is to learn the best possible connections between a set of inputs and outputs. In supervised learning, the inputs and outputs are presented to a system that learns the best possible connections between inputs and outputs. In an unsupervised learning environment, only inputs are presented and the system learns and describes the best possible outputs that would be related to the inputs. Let us note that the distribution of depends on , and .
In the sequel, we propose a machine learning approach for the estimation of trait vector . We show in particular that depending on the amount of information available, machine learning approaches are possible. To that end, learning schemes for supervised and unsupervised learning are designed to estimate both W and K, which are the key parameters in the Darwinian model (7) proposed above. In fact, once the vectors W and K have been estimated, data available on and the second equation in (7) are used to evaluate , the estimated value of . In supervised learning, the values of should match closely with their sample counterparts . In unsupervised learning, in the absence of sample data , the estimated values should serve as actual values for , the vector of traits.
3.2.1. Supervised Learning
Let and M be given positive integers. In supervised learning, we are given a sample of inputs/outputs sample in the form
where each is a vector. This means that in this case, we already know the inputs that generate the solution of the system (7). The probability of given is
Therefore, using known techniques of conditional probabilities, we have
where
Since we are trying to estimate traits, in supervised learning, the goal will be to determine how well matches . Since these are distributions, we use the Kulback–Leibler (KL) divergence
where
is the entropy of the distribution of . At the sample level for given M samples, the average of KL divergence is given as
We minimize the quantity on the right-hand side of (12), using a stochastic gradient descent scheme, by updating the weight vectors W and K, respectively. Hence, the following result on supervised learning.
Theorem 1.
Suppose that , and K are as above and that we have input/output sample . The minimization process of the KL divergence can be achieved with a classic gradient ascent algorithm for weights W and K with an update scheme given as
where are the learning rates of W and K, respectively, and for is given in the Appendix A.
Corollary 1.
Under the assumptions of Theorem 1 above, the minimization process of the KL divergence can be achieved more efficiently with a stochastic gradient ascent algorithm for weights W and K with an update scheme given as
where are the learning rates of W and K, respectively.
Remark 2.
One could use Jeffrey’s prior as the probability distribution of if no other information about its distribution is given. In this case, , where is the determinant of the matrix A. We discussed in [9] that Fisher’s information matrix for such a system depends on both W and K. This means that a Jeffrey’s prior is advisable only for algorithm initialization; otherwise, from Equation (10), the updating scheme would have to be changed according to the dependence of on both W and K. In fact, in this case, one would need to consider the respective partial derivatives , for , which significantly increases the complexity of the problem.
3.2.2. Unsupervised Learning
Now, we assume that , and K are as above and only a sample of inputs is given. Since is not given, we cannot use the learning scheme above. We have to design a method that only depends on and K. We assume that the joint probability of and is
where Z is the constant .
In unsupervised learning, we would like to determine how well matches . Since these are distributions, we use the Kulback–Leibler (KL) divergence
where is the entropy of the distribution of . At the sample level, and for given M samples from the random variable , the average of KL divergence is given as
Similarly, as above, we minimize the quantity on the right-hand side of (17), using a stochastic gradient descent scheme, by updating the weight vectors W and K, respectively. Hence, the following result is on unsupervised learning.
Theorem 2.
Suppose that , and K are as above and that we have input/output sample . The minimization process of the KL divergence can be achieved with a classic gradient ascent algorithm for weights W and K with an update scheme given as
where are the learning rates of W and K, respectively, and , for is given in the Appendix A.
Corollary 2.
Under the hypotheses of Theorem 2 above, the minimization process of the KL divergence can be achieved more efficiently with a stochastic gradient ascent algorithm for weights W and K with an update scheme given as
where are the learning rates of W and K, respectively.
Remark 3.
Remark 4.
We note that there is a subtle difference between Equations (14) and (19), but an important one. In Equation (14), all the data are available for training, therefore all quantities can be calculated. In Equation (19), we are not given sample data , however, we can use the Darwinian difference Equation (7) to evaluate both and .
3.3. Simulations
In this simulation, we generate the data from system (7) as follows: we choose and the vectors , and . From this we generate the samples
In Figure 2, we start by illustrating of the generated data, where the first column represents the dynamics of and the second column represents the dynamics of . To evaluate the accuracy of the method in supervised learning, we evaluate the percentage of correct calculation of vector parameters W and K as
for a threshold parameter , where represents a norm in . In Figure 3 below, we show the percentage of correct calculation for the given data under supervised learning.
Figure 2.
In the left column of this figure, represented is for and . The right column represents the samples and respectively. From a dynamical system point of view, this is a situation where there is no fixed point even though there is a fixed point for .
Figure 3.
(a) Percentage of correct learning for the vector W as a function of the number of samples used. The fluctuations are normal and are due to the random selection of samples at each iteration. We observe that despite fluctuating, the percentage of correct learning increases with the number of samples and hovers around 75%. This is quite remarkable given that all the system’s parameters are randomly reshuffled at each epoch. (b) Percentage of correct learning for the vector K. The percentage of correct learning for K also hovers around which shows in general that supervised learning can be quite effective at learning the key parameters in a Darwinian evolution dynamics model.
Remark 5.
Let us observe that the learning techniques proposed above are similar to those of a Boltzmann machine. Indeed, we have
where , , and W is the diagonal matrix and Z is a matrix such that the threshold value is . Written this way, can be seen as an energy function along the lines of a Boltzmann machine where the constant can be chosen such that , see for instance [14]. However, with as above, the Darwinian system is strictly not a Boltzmann machine since the diagonal terms of the matrix W are nonzero and the off-diagonal terms are zero. In fact, in a Boltzmann machine, the opposite is true, that is, the diagonal terms are zero and the off-diagonal terms are nonzero.
4. Conclusions
In this paper, we have shown how to estimate trait parameters in a Darwinian evolution dynamics model with one species and multiple traits under supervised and unsupervised learning. The procedure can be implemented using a regular gradient or a stochastic gradient descent. We have shown the similarity between the proposed procedure and Boltzmann machine learning, even though the type of energy function is not the same as that of a Boltzmann machine in the strictest sense of the term. The techniques proposed in this paper could certainly be adaptable to readily available data. This is to say, this is a proof of concept meant to kickstart the conversation on how to bring modern techniques of estimation to important problems of evolution theory with much-needed mathematical rigor.
Funding
This research was funded by The American Mathematical Society and Simmons Foundation, grant number AMS-SIMMONS-PUI-23028GR.
Data Availability Statement
The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.
Acknowledgments
I would like to acknowledge Cleves Epoh Nsali Ewonga for invaluable comments that enhanced the quality of this manuscript.
Conflicts of Interest
The author declares no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Appendix A
Appendix A.1. Proof of Theorem 1
Proof.
We recall that
Therefore, using the definition of condition probability, we have from Equation (A1) that
where
For simplicity, we write .
We recall that Z is also the marginal .
Therefore, it follows that
Now, we fix and . For a given and or , it follows from (A3) that
From Equation (A1), we obtain
The latter implies that
And also that
Using the definition of conditional probability, marginal, and Equation (A7), we have
Consequently, Equation (A4) becomes
To summarize, respectively, for and , we have
A classical gradient procedure will update the weights W and K by moving oppositely to the gradient of the average sample KL divergence, for selected learning rates and , as follows:
Since from Equation (A9), we have to run through all values of , this procedure may not be suitable for large networks. With a stochastic gradient procedure, the quantity is replaced with a single realization . Hence, we obtain a stochastic gradient scheme
for carefully chosen constants and . □
Appendix A.2. Proof of Theorem 2
Proof.
The joint probability distribution of and given is
where Z is the constant .
We also have the marginal of given W and K as
and the condition probability
Now, we fix . Let and or . From Equation (A13) and logarithm differentiation, we have that
Using the quotient rule in Equation (A12), we obtain
It follows from the above and Equation (A15) that
To summarize, respectively, for and , we have
A classical gradient procedure will update the weights W and K by moving oppositely to the gradient of the average sample KL divergence, for selected learning rates and , as follows:
The classical approach requires to run through all values of and , again not very suitable for large networks. With a stochastic gradient procedure, the quantity is replaced with a single random realization , whereas the quantity is replace with a single random realization . Hence, we obtain a stochastic gradient scheme
for carefully chosen constants and . □
References
- Darwin, C. On the Origin of Species by Means of Natural Selection; John Murray: London, UK, 1859. [Google Scholar]
- Vincent, T.L.; Vincent, T.L.S.; Cohen, Y. Darwinian dynamics and evolutionary game theory. J. Biol. Dyn. 2011, 5, 215–226. [Google Scholar] [CrossRef]
- Ackleh, A.S.; Cushing, J.M.; Salceneau, P.L. On the dynamics of evolutionary competition models. Nat. Resour. Model. 2015, 28, 380–397. [Google Scholar] [CrossRef]
- Cushing, J.M. Difference equations as models of evolutionary population dynamics. J. Biol. Dyn. 2019, 13, 103–127. [Google Scholar] [CrossRef] [PubMed]
- Cushing, J.M.; Park, J.; Farrell, A.; Chitnis, N. Treatment of outocme in an si model with evolutionary resistance: A darwinian model for the evolutionary resistance. J. Biol. Dyn. 2023, 17, 2255061. [Google Scholar] [CrossRef] [PubMed]
- Elaydi, S.; Kang, Y.; Luis, R. The effects of evolution on the stability of competing species. J. Biol. Dyn. 2022, 16, 816–839. [Google Scholar] [CrossRef] [PubMed]
- Fisher, R.A. On the mathematical foundation of theoretical statistics. Philos. Trans. R. Soc. Lond. Ser. 1922, 222, 594–604. [Google Scholar]
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 623–656. [Google Scholar] [CrossRef]
- Kwessi, E. Information theory in a darwinian evolution population dynamics model. arXiv 2024, arXiv:2403.05044. [Google Scholar]
- Shun-ich, A.; Horishi, N. Methods of Information Geometry; chapter Chentsov Theorem and Some Historical Remarks; Oxford University Press: Oxford, UK, 2000; pp. 37–40. [Google Scholar]
- Dowty, J.G. Chentsov theorem for exponential families. Inf. Geom. 2018, 1, 117–135. [Google Scholar] [CrossRef]
- Harper, M. Information geometry and evolution game theory. arXiv 2009, arXiv:0911.1383. [Google Scholar] [CrossRef]
- Kimura, M. On the change of population fitness by natural selection. Heredity 1958, 12, 145–167. [Google Scholar] [CrossRef]
- Hinton, G.E.; Sejnowski, T.J. Optimal perceptual inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 19–23 June 1983; pp. 448–453. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).