On Using Relative Information to Estimate Traits in a Darwinian Evolution Population Dynamics

Kwessi, Eddy

doi:10.3390/axioms13060406

Open AccessArticle

On Using Relative Information to Estimate Traits in a Darwinian Evolution Population Dynamics

by

Eddy Kwessi

Department of Mathematics, Trinity University, 1 Trinity Place, San Antonio, TX 78212, USA

Axioms 2024, 13(6), 406; https://doi.org/10.3390/axioms13060406

Submission received: 30 April 2024 / Revised: 11 June 2024 / Accepted: 12 June 2024 / Published: 16 June 2024

(This article belongs to the Special Issue Infinite Dynamical System and Differential Equations)

Download

Browse Figures

Versions Notes

Abstract

Since its inception, evolution theory has garnered much attention from the scientific community for a good reason: it theorizes how various living organisms came to be and what changes are to be expected in a certain environment. While many models of evolution have been proposed to track changes in species’ traits, not much has been said about how to calculate or estimate these traits. In this paper, using information theory, we propose an estimation method for trait parameters in a Darwinian evolution model for species with one or multiple traits. We propose estimating parameters by minimizing the relative information in a Darwinian evolution population model using either a classical gradient ascent or a stochastic gradient ascent. The proposed procedure is shown to be possible in a supervised or unsupervised learning environment, similarly to what occurs with Boltzmann machines. Simulations are provided to illustrate the method.

Keywords:

Darwinian evolution; traits; population dynamics; information theory

MSC:

37N30; 37N40; 39-08

1. Introduction

The theory of evolution was famously championed by Darwin [1] in a publication where he stated his theory of natural selection. This theory soon became known as Darwinian evolution theory and can be construed as a series of postulates whose aim is to understand dynamical changes in organisms’ traits. In other words, understanding the mechanisms of survival or disappearance of a species entails understanding mechanisms by which species traits are passed on overtime to their offspring. From a statistical point of view, a species having multiple offspring is the realization of a random process by which a certain amount of information is passed on to an offspring. How much information or how relevant information (often found in the organism’s genome) is passed on may determine the species viability overtime. Since the paper of Vincent et al. [2] on Darwinian dynamics and evolutionary game theory, there have been many studies related to Darwinian dynamics. In Ecology in particular, Ackley et al. [3] proposed a model for competitive evolutionary dynamics, and Cushing [4] established difference equations for population dynamics. A Susceptible-Infected Darwinian model with evolutionary resistance was discussed in Cushing et al. [5]. Models for competitive species were proposed by Elaydi et al. [6]. However, the literature is very sparse on how to estimate species’ traits (information stored or passed on to offspring) from these models using readily available data.

Information theory can help design estimation methods for species traits given data. Before showing how to use information theory for such a purpose, let us mention that there are two approaches to information theory that are related, but meaning and capturing different aspects of it. Firstly, consider a given sample of data from a probability distribution that depends on a parameter. Fisher’s information [7] represents the amount of information that the sample contains about the parameter. It can be calculated as the inverse of the sampling error which is the random discrepancy between an estimate and the estimated parameter, and which arises from the sampling process itself (e.g., due to the fact that the population as a whole has not been considered). Fisher’s information is important because it is the inverse measure of the precision of a point estimate of a parameter. Indeed, the higher Fisher’s information, the more precise is the point estimate. On the other hand, when the information content (or message) of a distribution is under consideration, Shannon’s entropy [8] is used. The main difference here is that it is assumed that the content is not a parameter and therefore is not unknown. Most importantly, there is no latency relative to the population involved, unlike a sample, which is only a partial representation of a population. From Shannon’s entropy, one can define the relative entropy, which allows us to compare different distributions and thus discriminate between different types of contents of information. Even though they have different interpretations, we note that the two approaches to information are mathematically related. Indeed, Fisher’s information is the Hessian matrix with respect to the parameter of the relative entropy or Kullback–Leibler divergence. In discrete or continuous Darwinian dynamics, the main assumption is that of a deterministic relationship between inputs and outputs. As we stated above, the transmission of traits from parents to offspring is, in fact, a random process and should be analyzed as such. This assumes that there is a stochastic relationship between inputs and outputs via a network of connections often referred to as weights.

More precisely, say u is the vector of inputs and v the vector of outputs with network connections matrix W. If input and output samples

(u_{s}, v_{s}), 1 \leq s \leq S

are both available, the stochastic relationship between inputs and outputs is described as

P (v | u, W)

, which is the probability that input u generates output v when the weight matrix of the network is W. In supervised learning, the goal is to match as closely as possible the input–output distribution

P (v | u, W)

with the probability distribution

P (v | u)

related to the samples

(u_{s}, v_{s})

, by adjusting the weights W in a feedforward manner. In unsupervised learning, it is assumed that output samples are not given and only input samples

(u_{s}), 1 \leq s \leq S

are given. However, using an appropriate energy function, one can obtain the joint probability distribution

P (u, v, W)

and infer from rules of probability that

P (u, W) = \sum_{v} P (u, v, W)

. The goal in unsupervised learning is therefore to match

P (u, W)

as closely as possible to the probability distribution

P (u)

related to samples

(u_{s}), 1 \leq s \leq S

, by adjusting the weights W in a feedforward manner. Feedforward techniques require the minimization of an appropriate objective function, and since we are dealing with probability distributions, the Kullback–Leibler divergence seems to be the best choice of objective function.

In this paper, our focus is on using relative information or entropy to estimate trait parameters in a Darwinian population dynamics model. The procedure we propose is similar to what occurs in a Boltzmann machine. The remainder of the paper is organized as follows: In Section 2, we make a brief overview of Fisher’s information theory, entropy and relative as it pertains to mathematical statistics. In Section 3, we discuss how to computationally estimate trait parameters of discrete Darwinian models in supervised and unsupervised environments by maximizing the relative information or Kullback–Leibler divergence. Finally, in Section 4, we make some concluding remarks.

2. Review of Information Theory

In this light review of information theory, we mention Fisher’s information only for self-containment sake. As for Fisher’s information and evolution dynamics, the reader can refer to [9] for a deeper analysis. For ease of analysis and comprehension, the assumptions and notations below are along the lines of that work.

2.1. Fisher’s Information Theory

Let

G (x, Θ)

be the probability density of a random variable X, continuous or discrete on an open set

X \times Ω \subset R \times R^{n}

. Here

Θ = (θ_{1}, θ_{2}, \dots, θ_{n})

is either a single parameter or a vector of parameters.

In the sequel, we consider the following assumptions on the function

G (x, Θ)

:

A₁ :: The support $\{x \in X : G (x, Θ) \neq 0\}$ of G is independent of $Θ$ .
A₂ :: $G (\cdot, Θ)$ is nonnegative and $G \in L^{1} (X)$ for all $Θ \in Ω$ .
A₃ :: $G (x, \cdot) \in C^{2} (Ω)$ , the set of continuously and twice differentiable functions of $Θ$ , for all $x \in X$ .

The first assumption is to discard from consideration the uniform distribution

G (x, θ) = \frac{1}{θ}

whose support is

(0, θ)

. The second and third assumptions allow the well-definiteness of

λ (x, θ) = \ln (G (x, θ))

, its first derivative or score function

g (x, Θ) = \nabla_{θ} λ (x, Θ)

and its second derivative

h (x, Θ) = \nabla_{Θ} g (x, Θ) = \nabla_{Θ}^{2} λ (x, Θ)

. In the sequel, the expected value of a random variable X is denoted as

E [X]

.

Definition 1.

Given a random variable X with density function

G (x, Θ)

satisfying

A_{1}

and

A_{2}

, Fisher’s information of X is defined as

I (Θ) = E_{X} [{(g (X, θ))}^{2}] = - E_{X} [h (X, Θ)] .

(1)

When

Θ

is vector of more than one coordinates, Fisher’s information is a symmetric positive definite (thus invertible) matrix

I (Θ) = {(I_{k l} (Θ))}_{1 \leq k, l \leq n}

, where

I_{k l} = E_{X} [\frac{\partial^{2} λ (X, Θ)}{\partial θ_{k} \partial θ_{l}}], for 1 \leq k, l \leq n .

(2)

Fisher’s information

I (Θ)

represents the amount of information contained in an estimator of

Θ

, given data X.

2.2. Entropy and Relative Entropy

Let us start by recalling the following definition of Shannon’s entropy and that of the Kullback–Leibler divergence or relative entropy.

Definition 2.

Let μ be a probability distribution defined on a sample space Ω. Then entropy of μ is given as

H (μ) = - E_{μ} [\ln [μ]] = \{\begin{matrix} - \sum_{θ \in Ω} μ (θ) \ln (μ (θ)), & if μ is discrete \\ - \int_{θ \in Ω} μ (θ) \ln (μ (θ)) d θ, & if μ is continuous \end{matrix} .

(3)

Definition 3.

Suppose μ and ν are two probability distributions defined on a sample space Ω. Then, the Kullback–Leibler divergence or relative information

D (μ, ν)

of μ relative to a fixed ν is given as

D_{K L} (μ, ν) = E_{μ} [\frac{\ln (μ)}{\ln (ν)}] = \{\begin{matrix} \sum_{θ \in Ω} μ (θ) \ln (\frac{μ (θ)}{ν (θ)}), & if μ and ν are discrete \\ \int_{θ \in Ω} μ (θ) \ln (\frac{μ (θ)}{ν (θ)}) d θ, & if μ and ν are continuous \end{matrix} .

(4)

There is an obvious connection between entropy and relative entropy:

D_{K L} (μ, ν) = - E_{μ} [\ln (ν)] - H (μ) .

We notice that when

μ = ν

, we have that

D_{K L} (μ, ν) = 0

. It is known that Fisher’s information induces a Riemannian metric ([10,11]) defined on a statistical manifold, that is, a smooth manifold whose points are probability measures defined on a probability space. It therefore represents the “informational” discrepancy between measurements. As such, it is related to the Kullback–Leibler divergence (KL), used typically to assess the difference between two probability distributions. Indeed, to see this, let

θ, θ_{0} \in Ω

. Let us use a second-order Taylor expansion in

μ

. Then we will obtain

\begin{matrix} D_{K L} (μ, ν) & = & D_{K L} (μ, ν) |_{μ (θ) = ν (θ)} \\ + & \nabla_{θ} (D_{K L} (μ, ν)) |_{μ (θ) = ν (θ)} + \frac{1}{2} {(θ - θ_{0})}^{T} H (θ) (θ - θ_{0}) + O ({(θ - θ_{0})}^{3}) . \end{matrix}

From the above observation, we have

D (μ, ν) |_{μ = ν} = 0

. We also have

\begin{matrix} \nabla_{θ} (D_{K L} (μ, ν)) & = & \frac{\partial}{\partial θ} E_{μ} [\frac{\ln (μ)}{\ln (ν)}] \\ = & E_{μ} [\frac{\partial}{\partial θ} (\frac{\ln (μ)}{\ln (ν)})] \\ = & E_{\partial μ} [\frac{\ln (μ)}{\ln (ν)}] + E_{ν} [\frac{\partial}{\partial θ} (\frac{μ}{ν})] . \end{matrix}

It follows that

\nabla_{θ} (D_{K L} (μ, ν)) |_{μ (θ) = ν (θ)} = 0

. We conclude by noticing that the Hessian matrix

H (θ) = \nabla_{θ}^{2} D_{K L} (μ, ν) |_{μ (θ) = ν (θ)} = E_{μ} [\frac{\partial^{2}}{\partial θ_{k} \partial θ_{l}} \ln (μ (θ))] = I (θ)

. Therefore, if

μ = μ (θ)

and

ν = μ (θ_{0})

are infinitesimally close to each other, that is,

μ (θ) = μ (θ_{0}) + \sum_{k} Δ θ_{k} \frac{\partial}{\partial θ_{k}} μ (θ) |_{θ = θ_{0}},

(5)

we have

D_{K L} (μ, ν) = \frac{1}{2} \sum_{k l} Δ θ_{k} Δ θ_{l} I_{k l} (θ) + O (Δ θ^{3}) .

(6)

There is an interesting discussion in the finite case in [12]. Indeed, suppose that

Ω

is a finite set with cardinality n. Let

Θ = (θ_{1}, θ_{2}, \dots, θ_{n})

and let

Λ (Ω)

be the set of all probability measures on

Ω

, and

S_{n} (Ω)

be the simplex

S_{n} (Ω) = \{Θ \in Ω : \sum_{i = 1}^{n} θ_{i} = 1\}

. There is a isometry

φ : Λ (Ω) \to S_{n} (Ω)

defined as

φ (μ) = (μ (1), μ (2), \dots, μ (n)) = (θ_{1}, θ_{2}, \dots, θ_{n})

. In this case, Fisher’s information components become

\begin{matrix} I_{k l} (θ) & = & E_{μ} [\frac{\partial^{2}}{\partial θ_{k} \partial θ_{l}} \ln (μ (θ))] = \sum_{j = 1}^{n} θ_{j} \frac{1}{θ_{k}} δ_{k j} \frac{1}{θ_{l}} δ_{l j} = \frac{1}{θ_{k}} δ_{k l}, \end{matrix}

where

δ_{k l}

is the Kronecker symbol. It follows that

\frac{\partial^{2} D (μ, μ_{0})}{\partial θ_{k} \partial θ_{l}} = \frac{1}{θ_{k}} δ_{k l} .

The interpretation of the latter is Fisher’s fundamental theorem (see [7]) and Kimura’s maximal principle (see [13]) in terms of Fisher information: natural selection forms a gradient with respect to an information measure, and hence locally has the direction of maximal information increase. The rate of change of the mean fitness of the population is given by the information variance.

Let us highlight some of these facts by calculating the KL divergence

D_{K L} (Θ_{1}, Θ_{t})

between distributions

ν = P (Θ_{1})

and

μ = P (Θ_{t})

of

Θ_{1}

and

Θ_{t}

, respectively, and

D_{K L} (Θ_{t}, Θ_{t + 1})

between the probability distributions

ν = P (Θ_{t})

and

μ = P (Θ_{t + 1})

of

Θ_{t}

and

Θ_{t + 1}

, respectively, for

t = 1, 2, \dots, n

. See Figure 1 below. We will use the same parameters as above with the same starting points for

θ

in the algorithm.

The minimization of

D_{K L} (Θ_{t}, Θ_{t + 1})

and the maximization of

D_{K L} (Θ_{1}, Θ_{t})

both occur when Fisher’s information is maximal or when the variance of the estimator is minimal. From a dynamical system perspective, this occurs when the fixed point

(e^{- 1}, e)

has been attained. This means that the KL divergence can be used to determine the critical points of a discrete Darwinian model. Therefore, there is a dichotomy between the problems of minimizing

D (Θ_{t}, Θ_{t + 1})

and maximizing

D_{K L} (Θ_{1}, Θ_{t})

. This amounts to matching the probability distributions

P (Θ_{t})

and

P (Θ_{t + 1})

as closely as possible. When this happens, we know from above that we will have

D_{K L} (Θ_{t}, Θ_{t + 1}) \approx 0

, which in turn will mean that the fixed points of the discrete Darwinian model have been attained. To that end, we will discuss how to accomplish this with machine learning approaches, namely under supervised and unsupervised learning.

3. Evolution Population Dynamics and Relative Information

3.1. Single Darwinian Population Model with Multiple Traits

Now suppose we are in the presence of one species with density x possessing n traits given by the vector

Θ = (θ_{1}, θ_{2}, \dots, θ_{n})

and a vector

U = (u_{1}, u_{2}, \dots, u_{n})

. We will consider the following:

(H₁): $b (Θ) = b_{0} \exp (- \sum_{i = 1}^{n} \frac{θ_{i}^{2}}{2 w_{i}^{2}})$ is the joint distribution of the independent traits $θ_{i}$ , each with mean 0 and variance $w_{i}^{2}$ .
(H₂): $c_{U} (Θ) = c_{0} \exp (- \sum_{i = 1}^{n} κ_{i} (θ_{i} - u_{i}))$ .
(H₃): The density of $x_{t}$ is given as $G (x, Θ, U) = b (Θ) \exp (- c_{U} (Θ) x)$ at $t = 1$ .

Under

H_{1} - H_{3}

, we will consider the discrete dynamical system

\{\begin{matrix} x_{t + 1} & = x_{t} G (x_{t}, Θ_{t}, U_{t}) \\ Θ_{t + 1} & = Θ_{t} + Σ g (x_{t}, Θ_{t}, U_{t}) \end{matrix},

(7)

where

Σ = (\begin{matrix} σ_{11} & σ_{12} & \dots & σ_{1 n} \\ σ_{21} & σ_{22} & \dots & σ_{2 n} \\ ⋮ & ⋮ & \dots & ⋮ \\ σ_{n 1} & σ_{n 2} & \dots & σ_{n n} \end{matrix}) .

(8)

Remark 1.

We note that in the context of Darwinian population dynamics,

G (x_{t}, Θ_{t}, U_{t})

is the population growth rate,

b (Θ)

represents the birth rate, while

c_{U} (Θ)

is the intra-specific competition function. Σ represents the covariance matrix of the distribution of traits among phenotypes of the species.

In this section, we propose two approaches to estimate the traits vector

Θ

in a Darwinian model, using relative information.

3.2. Supervised and Unsupervised Learning

Supervised and unsupervised learning are two types of machine learning techniques whose ultimate aim is to learn the best possible connections between a set of inputs and outputs. In supervised learning, the inputs and outputs are presented to a system that learns the best possible connections between inputs and outputs. In an unsupervised learning environment, only inputs are presented and the system learns and describes the best possible outputs that would be related to the inputs. Let us note that the distribution of

Θ_{t}

depends on

X_{t}, K = (κ_{1}, κ_{2}, \dots, κ_{n}), W = (w_{1}, w_{2}, \dots, w_{n})

, and

U = (u_{1}, u_{2}, \dots, u_{n})

.

In the sequel, we propose a machine learning approach for the estimation of trait vector

Θ

. We show in particular that depending on the amount of information available, machine learning approaches are possible. To that end, learning schemes for supervised and unsupervised learning are designed to estimate both W and K, which are the key parameters in the Darwinian model (7) proposed above. In fact, once the vectors W and K have been estimated, data available on

X_{t}

and the second equation in (7) are used to evaluate

Θ_{t}^{estimated}

, the estimated value of

Θ

. In supervised learning, the values of

Θ_{t}^{estimated}

should match closely with their sample counterparts

Θ_{t}^{samples}

. In unsupervised learning, in the absence of sample data

Θ_{t}^{samples}

, the estimated values

Θ_{t}^{estimated}

should serve as actual values for

Θ_{t}

, the vector of traits.

3.2.1. Supervised Learning

Let

n, T

and M be given positive integers. In supervised learning, we are given a sample of inputs/outputs sample in the form

\{(X_{t}^{(m)}, Θ_{t}^{(m)}), t = 1, 2 \dots, T; m = 1, 2 \dots, M\},

where each

Θ_{t}^{(m)}

is a

1 \times N

vector. This means that in this case, we already know the inputs that generate the solution of the system (7). The probability

P (X_{t} | Θ_{t}, W, K)

of

X_{t}

given

Θ_{t}, W, K

is

P (X_{t} | Θ_{t}, W, K) = b_{0} e^{- b (X_{t}, Θ_{t}, W, K)} .

(9)

Therefore, using known techniques of conditional probabilities, we have

P (Θ_{t} | X_{t}, W, K) = \frac{P (X_{t} | Θ_{t}, W, K) P (Θ_{t})}{Z},

(10)

where

Z = \sum_{Θ_{t}} P (X_{t} | Θ_{t}, W, K) P (Θ_{t}) .

Since we are trying to estimate traits, in supervised learning, the goal will be to determine how well

P (Θ_{t + 1} | X_{t + 1}, W, K)

matches

P (Θ_{t} | X_{t})

. Since these are distributions, we use the Kulback–Leibler (KL) divergence

\begin{matrix} D_{K L} (P (Θ_{t} | X_{t}), P (Θ_{t + 1} | X_{t + 1}, W, K)) = & - \sum_{θ_{t}} P (Θ_{t} | X_{t}) \ln (P (Θ_{t + 1} | X_{t + 1}, W, K)) - H (Θ_{t} | X_{t}), \end{matrix}

(11)

where

H (Θ_{t} | X_{t}) = - \sum_{θ_{t}} P (Θ_{t} | X_{t}) \ln (P (Θ_{t} | X_{t}))

is the entropy of the distribution of

Θ_{t} | X_{t}

. At the sample level for given M samples, the average of KL divergence is given as

〈 D_{K L} (P (Θ_{t} | X_{t}), P (Θ_{t + 1} | X_{t + 1}, W, K)) 〉 = - \frac{1}{M} [\sum_{m = 1}^{M} \ln (P (Θ_{t + 1}^{(m)} | X_{t + 1}^{(m)}, W, K)) + H [Θ_{t}^{(m)} | X_{t}^{(m)}]] .

(12)

We minimize the quantity on the right-hand side of (12), using a stochastic gradient descent scheme, by updating the weight vectors W and K, respectively. Hence, the following result on supervised learning.

Theorem 1.

Suppose that

X_{t}, W

, and K are as above and that we have input/output sample

\{(X_{t}^{(m)}, Θ_{t}^{(m)}), t = 1, 2, \dots, T, m = 1, 2, \dots, M\}

. The minimization process of the KL divergence can be achieved with a classic gradient ascent algorithm for weights W and K with an update scheme given as

\{\begin{matrix} w_{l, n e w} & = w_{l, o l d} + α_{W} \frac{\partial \ln (P (Θ_{t + 1}^{(m)} | X_{t + 1}^{(m)}, W, K))}{\partial w_{l}} \\ κ_{l, n e w} & = κ_{l, o l d} + α_{K} \frac{\partial \ln (P (Θ_{t + 1}^{(m)} | X_{t + 1}^{(m)}, W, K))}{\partial κ_{l}} \end{matrix},

(13)

where

α_{w}, α_{κ} > 0

are the learning rates of W and K, respectively, and

\frac{\partial \ln (P (Θ_{t + 1}^{(m)} | X_{t + 1}^{(m)}, W, K))}{\partial x}

for

x = w_{l}, κ_{l}

is given in the Appendix A.

Corollary 1.

Under the assumptions of Theorem 1 above, the minimization process of the KL divergence can be achieved more efficiently with a stochastic gradient ascent algorithm for weights W and K with an update scheme given as

\{\begin{matrix} w_{l, n e w} & = w_{l, o l d} - α_{W} [\frac{\partial b (X_{t + 1}^{(m)}, Θ_{t + 1}^{(m)}, W, K)}{\partial w_{l}} - \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial w_{l}}] \\ κ_{l, n e w} & = κ_{l, o l d} - α_{K} [\frac{\partial b (X_{t + 1}^{(m)}, Θ_{t + 1}^{(m)}, W, K)}{\partial κ_{l}} - \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial κ_{l}}] \end{matrix} .

(14)

where

α_{w}, α_{κ} > 0

are the learning rates of W and K, respectively.

Remark 2.

One could use Jeffrey’s prior as the probability distribution

P (Θ_{t})

of

Θ_{t}

if no other information about its distribution is given. In this case,

P (Θ_{t}) \propto \frac{1}{\sqrt{d e t (I (Θ_{t}))}}

, where

d e t (A)

is the determinant of the matrix A. We discussed in [9] that Fisher’s information matrix for such a system depends on both W and K. This means that a Jeffrey’s prior is advisable only for algorithm initialization; otherwise, from Equation (10), the updating scheme would have to be changed according to the dependence of

P (Θ_{t})

on both W and K. In fact, in this case, one would need to consider the respective partial derivatives

\frac{\partial \ln (P (θ_{t}))}{\partial x} = - \frac{1}{2} \frac{\frac{\partial}{\partial x} d e t (I (θ_{t}))}{d e t (I (θ_{t}))}

, for

x = w_{l}, κ_{l}

, which significantly increases the complexity of the problem.

3.2.2. Unsupervised Learning

Now, we assume that

X_{t}, W

, and K are as above and only a sample of inputs

\{X_{t}^{(m)}, m \in \{1, 2, \dots, M\}\}

is given. Since

Θ_{t}^{(m)}

is not given, we cannot use the learning scheme above. We have to design a method that only depends on

X_{t}^{(m)}, W

and K. We assume that the joint probability

P (X_{t}, Θ_{t}, W, K)

of

X_{t}

and

Θ_{t}

is

P (X_{t}, Θ_{t}, W, K) = \frac{1}{Z} e^{- b (X_{t}, Θ_{t}, W, K)},

(15)

where Z is the constant

Z = \sum_{X_{t}, Θ_{t}} e^{- b (X_{t}, Θ_{t}, W, K)}

.

In unsupervised learning, we would like to determine how well

P (X_{t + 1}, W, K)

matches

P (X_{t})

. Since these are distributions, we use the Kulback–Leibler (KL) divergence

\begin{matrix} D_{K L} (P (X_{t}), P (X_{t + 1}, W, K)) = & - \sum_{X_{t}} P ((X_{t})) \ln (P (X_{t + 1}, W, K)) - H (X_{t}), \end{matrix}

(16)

where

H (X_{t}) = - \sum_{X_{t}} P (X_{t}) \ln (P (X_{t}))

is the entropy of the distribution of

X_{t}

. At the sample level, and for given M samples from the random variable

X_{t}

, the average of KL divergence is given as

〈 D_{K L} (P (X_{t}), P (X_{t + 1}, W, K)) 〉 = - \frac{1}{M} [\sum_{m = 1}^{M} \ln (P (X_{t + 1}^{(m)}, W, K)) + H (X_{t}^{(m)})] .

(17)

Similarly, as above, we minimize the quantity on the right-hand side of (17), using a stochastic gradient descent scheme, by updating the weight vectors W and K, respectively. Hence, the following result is on unsupervised learning.

Theorem 2.

Suppose that

X_{t}, W

, and K are as above and that we have input/output sample

\{X_{t}^{(m)}, t = 1, 2, \dots, T, m = 1, 2, \dots, M\}

. The minimization process of the KL divergence can be achieved with a classic gradient ascent algorithm for weights W and K with an update scheme given as

\{\begin{matrix} w_{l, n e w} & = w_{l, o l d} + α_{W} \frac{\partial \ln (P (X_{t + 1}^{(m)}, W, K))}{\partial w_{l}} \\ κ_{l, n e w} & = κ_{l, o l d} + α_{K} \frac{\partial \ln (P (X_{t + 1}^{(m)}, W, K))}{\partial κ_{l}} \end{matrix} .

(18)

where

α_{w}, α_{κ} > 0

are the learning rates of W and K, respectively, and

\frac{\partial \ln (P (X_{t + 1}^{(m)}, W, K))}{\partial x}

, for

x = w_{l}, κ_{l}

is given in the Appendix A.

Corollary 2.

Under the hypotheses of Theorem 2 above, the minimization process of the KL divergence can be achieved more efficiently with a stochastic gradient ascent algorithm for weights W and K with an update scheme given as

\{\begin{matrix} w_{l, n e w} & = w_{l, o l d} - α_{W} [\frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial w_{l}} - \frac{\partial b (X_{t}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial w_{l}}] \\ κ_{l, n e w} & = κ_{l, o l d} - α_{K} [\frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial κ_{l}} - \frac{\partial b (X_{t}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial κ_{l}}] \end{matrix},

(19)

where

α_{w}, α_{κ} > 0

are the learning rates of W and K, respectively.

Remark 3.

1.: We observe that since $D_{K L} (P (Θ_{t} | X_{t}), P (Θ_{t + 1} | X_{t + 1}, W, K))$ is an expectation with respect to $Θ_{t} | X_{t}$ (see Definition 3 above), then the right-hand-side of Equation (12) is a Riemann sum for $D_{K L} (P (Θ_{t} | X_{t}), P (Θ_{t + 1} | X_{t + 1}, W, K))$ .
2.: Similarly, $D_{K L} (P (X_{t}, P (X_{t + 1}, W, K))$ is an expectation with respect to the random variable $X_{t}$ , therefore the right-hand-side of Equation (17) is one of its Riemann sum.

Remark 4.

We note that there is a subtle difference between Equations (14) and (19), but an important one. In Equation (14), all the data are available for training, therefore all quantities can be calculated. In Equation (19), we are not given sample data

Θ_{t}^{(m)}

, however, we can use the Darwinian difference Equation (7) to evaluate both

\frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial x}

and

\frac{\partial b (X_{t}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial x_{l}}

.

3.3. Simulations

In this simulation, we generate the data from system (7) as follows: we choose

T = 200, M = 350, N = 2, c_{0} = 0.1, b_{0} = 1.5

and the vectors

W \sim 100 \times U n i f (0.5, 1)

,

K \sim U n i f (0.1), U \sim U n i f (0, 10)

and

Θ = (1, 2, 3)

. From this we generate the samples

\{(X_{t}^{(m)}, Θ_{t}^{(m)}), t = 1, \dots, T, m = 1, \dots, M\} .

In Figure 2, we start by illustrating of the generated data, where the first column represents the dynamics of

X_{t}

and the second column represents the dynamics of

Θ_{t} = (θ_{1 t}, θ_{2 t})

. To evaluate the accuracy of the method in supervised learning, we evaluate the percentage of correct calculation of vector parameters W and K as

Correct Percentage = \frac{# \{∥{Parameter}^{true} - {Parameter}^{estimated}∥ \leq η\}}{# of samples},

for a threshold parameter

η = 0.001

, where

∥\cdot∥

represents a norm in

R^{N}

. In Figure 3 below, we show the percentage of correct calculation for the given data under supervised learning.

Remark 5.

Let us observe that the learning techniques proposed above are similar to those of a Boltzmann machine. Indeed, we have

\begin{matrix} G (X, Θ) & = & \frac{1}{b_{0}} e^{- b (X, Θ, W, K)} \\ = & \frac{1}{b_{0}} e x p (- \sum_{i = 1}^{n} \frac{θ_{i}^{2}}{2 w_{i}^{2}} - c_{0} X e x p (- \sum_{i = 1}^{n} κ_{i} (θ_{i} - u_{i}))) \\ = & \frac{1}{b_{0}} e x p (- \frac{1}{2} Θ^{T} W Θ - Θ^{T} Z X) \\ = & \frac{1}{b_{0}} e x p (- E (X, Θ)), \end{matrix}

where

E (X, Θ) = \frac{1}{2} Θ^{T} W Θ + Θ^{T} Z

,

Θ = (θ_{1}, θ_{2}, \dots, θ_{n}), i = 1, 2, \dots, n

, and W is the diagonal matrix

D i a g (\frac{1}{w_{1}^{2}}, \frac{1}{w_{2}^{2}}, \dots, \frac{1}{w_{n}^{2}})

and Z is a matrix such that the threshold value is

Θ^{T} Z X = c_{0} X e x p (- \sum_{i = 1}^{n} κ_{i} (θ_{i} - u_{i}))

. Written this way,

G (X, Θ) = \frac{1}{b_{0}} e^{- E (X, Θ)}

can be seen as an energy function along the lines of a Boltzmann machine where the constant

b_{0}

can be chosen such that

b_{0} = \sum_{X, Θ} e x p (- E (X, Θ))

, see for instance [14]. However, with

G (X, Θ)

as above, the Darwinian system is strictly not a Boltzmann machine since the diagonal terms of the matrix W are nonzero and the off-diagonal terms are zero. In fact, in a Boltzmann machine, the opposite is true, that is, the diagonal terms are zero and the off-diagonal terms are nonzero.

4. Conclusions

In this paper, we have shown how to estimate trait parameters in a Darwinian evolution dynamics model with one species and multiple traits under supervised and unsupervised learning. The procedure can be implemented using a regular gradient or a stochastic gradient descent. We have shown the similarity between the proposed procedure and Boltzmann machine learning, even though the type of energy function is not the same as that of a Boltzmann machine in the strictest sense of the term. The techniques proposed in this paper could certainly be adaptable to readily available data. This is to say, this is a proof of concept meant to kickstart the conversation on how to bring modern techniques of estimation to important problems of evolution theory with much-needed mathematical rigor.

Funding

This research was funded by The American Mathematical Society and Simmons Foundation, grant number AMS-SIMMONS-PUI-23028GR.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

I would like to acknowledge Cleves Epoh Nsali Ewonga for invaluable comments that enhanced the quality of this manuscript.

Conflicts of Interest

The author declares no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Appendix A.1. Proof of Theorem 1

Proof.

We recall that

P (X_{t} | Θ_{t}, W, K) = \frac{1}{b_{0}} e^{- b (X_{t}, Θ_{t}, W, K)} .

(A1)

Therefore, using the definition of condition probability, we have from Equation (A1) that

P (Θ_{t} | X_{t}, W, K) = \frac{P (X_{t} | Θ_{t}, W, K) P (Θ_{t})}{Z},

(A2)

where

Z = \sum_{Θ_{t} = θ_{t}} P (X_{t} | Θ_{t} = θ_{t}, W, K) P (Θ_{t} = θ_{t}) .

For simplicity, we write

Z = \sum_{Θ_{t}} P (X_{t} | Θ_{t}, W, K) P (Θ_{t})

.

We recall that Z is also the marginal

P (X_{t}, W, K) = \sum_{Θ_{t}} P (X_{t} | Θ_{t}, W, K) P (Θ_{t})

.

Therefore, it follows that

\ln (P (Θ_{t} | X_{t}, W, K)) = \ln (P (X_{t} | Θ_{t}, W, K)) + \ln (P (Θ_{t})) - \ln (\sum_{Θ_{t}} P (X_{t} | Θ_{t}, W, K) P (Θ_{t})) .

(A3)

Now, we fix

Θ_{t} = Θ_{*}

and

X_{t} = X_{*}

. For a given

l = 1, 2, \dots, n

and

x = w_{l}

or

x = κ_{l}

, it follows from (A3) that

\frac{\partial \ln (P (Θ_{*} | X_{*}, W, K))}{\partial x} = \frac{\frac{\partial P (X_{*} | Θ_{*}, W, K)}{\partial x}}{P (X_{*} | Θ_{*} |, W, K)} - \frac{\sum_{Θ_{t}} \frac{\partial}{\partial x} P (X_{*} | Θ_{t}, W, K) P (Θ_{t})}{\sum_{Θ_{t}} P (X_{*} | Θ_{t}, W, K) P (Θ_{t})} .

(A4)

From Equation (A1), we obtain

\frac{\partial P (X_{*} | Θ_{*}, W, K)}{\partial x} = - \frac{\partial b (X_{*}, Θ_{*}, W, K)}{\partial x} P (X_{*} | Θ_{*}, W, K) .

(A5)

The latter implies that

\frac{\frac{\partial P (X_{*} | Θ_{*}, W, K)}{\partial x}}{P (X_{*} | Θ_{*} |, W, K)} = - \frac{\partial b (X_{*}, Θ_{*}, W, K)}{\partial x} .

(A6)

And also that

\sum_{Θ_{t}} \frac{\partial}{\partial x} P (X_{*} | Θ_{t}, W, K) P (Θ_{t}) = - \sum_{Θ_{t}} \frac{\partial b (X_{*}, Θ_{t}, W, K)}{\partial x} P (X_{*} | Θ_{t}, W, K) P (Θ_{t}) .

(A7)

Using the definition of conditional probability, marginal, and Equation (A7), we have

\begin{matrix} \frac{\sum_{Θ_{t}} \frac{\partial}{\partial x} P (X_{*} | Θ_{t}, W, K) P (Θ_{t})}{\sum_{Θ_{t}} P (X_{*} | Θ_{t}, W, K) P (Θ_{t})} & = & \sum_{Θ_{t}} \frac{\sum_{Θ_{t}} \frac{\partial}{\partial x} P (X_{*} | Θ_{t}, W, K) P (Θ_{t})}{\sum_{Θ_{t}} P (X_{*} | Θ_{t}, W, K) P (Θ_{t})} \\ = & - \sum_{Θ_{t}} \frac{\partial b (X_{*}, Θ_{t}, W, K)}{\partial x} \frac{P (X_{*} | Θ_{t}, W, K) P (Θ_{t})}{P (X_{*}, W, K)} \\ = & - \sum_{Θ_{t}} \frac{\partial b (X_{*}, Θ_{t}, W, K)}{\partial x} \frac{P (X_{*}, Θ_{t}, W, K)}{P (X_{*}, W, K)} \\ = & - \sum_{Θ_{t}} \frac{\partial b (X_{*}, Θ_{t}, W, K)}{\partial x} P (Θ_{t} | X_{*}, W, K) . \end{matrix}

Consequently, Equation (A4) becomes

\frac{\partial \ln (P (Θ_{*} | X_{*}, W, K))}{\partial x} = - \frac{\partial b (X_{*}, Θ_{*}, W, K)}{\partial x} + \sum_{Θ_{t}} \frac{\partial b (X_{*}, Θ_{t}, W, K)}{\partial x} P (Θ_{t} | X_{*}, W, K) .

(A8)

To summarize, respectively, for

x = w_{l}

and

w = κ_{l}

, we have

\begin{matrix} \frac{\partial \ln (P (Θ_{t + 1}^{(m)} | X_{t + 1}^{(m)}, W, K))}{\partial w_{l}} & = - \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t + 1}^{(m)}, W, K)}{\partial w_{l}} + \sum_{Θ_{t}} \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{\partial w_{l}} P (Θ_{t} | X_{t + 1}^{(m)}, W, K) \\ \frac{\partial \ln (P (Θ_{t + 1}^{(m)} | X_{t + 1}^{(m)}, W, K))}{\partial κ_{l}} & = - \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t + 1}^{(m)}, W, K)}{\partial κ_{l}} + \sum_{Θ_{t}} \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{\partial κ_{l}} P (Θ_{t} | X_{t + 1}^{(m)}, W, K) . \end{matrix}

(A9)

A classical gradient procedure will update the weights W and K by moving oppositely to the gradient of the average sample KL divergence, for selected learning rates

α_{W}

and

α_{K}

, as follows:

\{\begin{matrix} w_{l, n e w} & = w_{l, o l d} + α_{W} \frac{\partial \ln (P (Θ_{t + 1}^{(m)} | X_{t + 1}^{(m)}, W, K))}{\partial w_{l}} \\ κ_{l, n e w} & = κ_{l, o l d} + α_{K} \frac{\partial \ln (P (Θ_{t + 1}^{(m)} | X_{t + 1}^{(m)}, W, K))}{\partial κ_{l}} \end{matrix} .

(A10)

Since from Equation (A9), we have to run through all values of

Θ_{t}

, this procedure may not be suitable for large networks. With a stochastic gradient procedure, the quantity

\sum_{Θ_{t}} \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{\partial x} P (Θ_{t} | X_{t + 1}^{(m)}, W, K)

is replaced with a single realization

\frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial x}

. Hence, we obtain a stochastic gradient scheme

\{\begin{matrix} w_{l, n e w} & = w_{l, o l d} - α_{W} [\frac{\partial b (X_{t + 1}^{(m)}, Θ_{t + 1}^{(m)}, W, K)}{\partial w_{l}} - \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial w_{l}}] \\ κ_{l, n e w} & = κ_{l, o l d} - α_{K} [\frac{\partial b (X_{t + 1}^{(m)}, Θ_{t + 1}^{(m)}, W, K)}{\partial κ_{l}} - \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial κ_{l}}] \end{matrix},

(A11)

for carefully chosen constants

α_{w}

and

α_{K}

. □

Appendix A.2. Proof of Theorem 2

Proof.

The joint probability distribution

P (X_{t}, Θ_{t}, W, K)

of

X_{t}

and

Θ_{t}

given

W, K

is

P (X_{t}, Θ_{t}, W, K) = \frac{1}{Z} e^{- b (X_{t}, Θ_{t}, W, K)},

(A12)

where Z is the constant

Z = \sum_{X_{t}, Θ_{t}} e^{- b (X_{t}, Θ_{t}, W, K)}

.

We also have the marginal of

X_{t}

given W and K as

P (X_{t}, W, K) = \sum_{Θ_{t}} P (X_{t}, Θ_{t}, W, K),

(A13)

and the condition probability

P (Θ_{t} | X_{t}, W, K) = \frac{P (X_{t}, Θ_{t}, W, K)}{P (X_{t}, W, K)},

(A14)

Now, we fix

X_{t} = X_{*}

. Let

l = 1, 2, \dots, n

and

x = w_{l}

or

x = κ_{l}

. From Equation (A13) and logarithm differentiation, we have that

\frac{\partial}{\partial x} \ln (P (X_{*}, W, K)) = \frac{\sum_{Θ_{t}} \frac{\partial}{\partial x} P (X_{*}, Θ_{t}, W, K)}{P (X_{*}, W, K)}

(A15)

Using the quotient rule in Equation (A12), we obtain

\begin{matrix} \frac{\partial}{\partial x} P (X_{*}, Θ_{t}, W, K) & = & \frac{\partial}{\partial x} (\frac{e^{- b (X_{*}, Θ_{t}, W, K)}}{Z}) \\ = & \frac{\frac{\partial}{\partial x} e^{- b (X_{*}, Θ_{t}, W, K)} Z}{Z^{2}} \\ - & \frac{e^{- b (X_{*}, Θ_{t}, W, K)} (\sum_{X_{t}, Θ_{t}} \frac{\partial}{\partial x} e^{- b (X_{t}, Θ_{t}, W, K)})}{Z^{2}} \\ = & - \frac{\partial b (X_{*}, Θ_{t}, W, K)}{\partial x} \frac{e^{- b (X_{*}, Θ_{t}, W, K)}}{Z} \\ + & \frac{e^{- b (X_{*}, Θ_{t}, W, K)}}{Z} \frac{(\sum_{X_{t}, Θ_{t}} \frac{\partial b (X_{t}, Θ_{t}, W, K)}{\partial x} e^{- b (X_{t}, Θ_{t}, W, K)})}{Z} \\ = & - \frac{\partial b (X_{*}, Θ_{t}, W, K)}{\partial x} P (X_{*}, Θ_{t}, W, K) \\ + & P (X_{*}, Θ_{t}, W, K) (\sum_{X_{t}, Θ_{t}} \frac{\partial b (X_{t}, Θ_{t}, W, K)}{\partial x} P (X_{t}, Θ_{t}, W, K)) \end{matrix}

It follows from the above and Equation (A15) that

\begin{matrix} \frac{\partial}{\partial x} \ln (P (X_{t + 1}^{(m)}, W, K)) & = & \frac{\sum_{Θ_{t}} \frac{\partial}{\partial x} P (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{P (X_{t + 1}^{(m)}, W, K)} \\ = & \sum_{Θ_{t}} \frac{\frac{\partial}{\partial x} P (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{P (X_{t + 1}^{(m)}, W, K)} \\ = & - \sum_{Θ_{t}} \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{\partial x} \frac{P (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{P (X_{t + 1}^{(m)}, W, K)} \\ + & \sum_{Θ_{t}} \frac{P (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{P (X_{t + 1}^{(m)}, W, K)} (\sum_{X_{t}, Θ_{t}} \frac{\partial b (X_{t}, Θ_{t}, W, K)}{\partial x} P (X_{t}, Θ_{t}, W, K)) \\ = & - \sum_{Θ_{t}} \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{\partial x} P (Θ_{t} | X_{t + 1}^{(m)}, Θ_{t}, W, K) \\ + & \frac{\sum_{Θ_{t}} P (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{P (X_{t + 1}^{(m)}, W, K)} (\sum_{X_{t}, Θ_{t}} \frac{\partial b (X_{t}, Θ_{t}, W, K)}{\partial x} P (X_{t}, Θ_{t}, W, K)) \\ = & - \sum_{Θ_{t}} \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{\partial x} P (Θ_{t} | X_{t + 1}^{(m)}, Θ_{t}, W, K) \\ + & \frac{P (X_{t + 1}^{(m)}, W, K)}{P (X_{t + 1}^{(m)}, W, K)} (\sum_{X_{t}, Θ_{t}} \frac{\partial b (X_{t}, Θ_{t}, W, K)}{\partial x} P (X_{t}, Θ_{t}, W, K)) \\ = & - \sum_{Θ_{t}} \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{\partial x} P (Θ_{t} | X_{t + 1}^{(m)}, Θ_{t}, W, K) \\ + & \sum_{X_{t}, Θ_{t}} \frac{\partial b (X_{t}, Θ_{t}, W, K)}{\partial x} P (X_{t}, Θ_{t}, W, K) . \end{matrix}

To summarize, respectively, for

x = w_{l}

and

w = κ_{l}

, we have

\{\begin{matrix} \frac{\partial}{\partial w_{l}} \ln (P (X_{t + 1}^{(m)}, W, K)) & = - \sum_{Θ_{t}} \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{\partial w_{l}} P (Θ_{t} | X_{t + 1}^{(m)}, Θ_{t}, W, K) \\ + \sum_{X_{t}, Θ_{t}} \frac{\partial b (X_{t}, Θ_{t}, W, K)}{\partial w_{l}} P (X_{t}, Θ_{t}, W, K) \\ \frac{\partial}{\partial κ_{l}} \ln (P (X_{t + 1}^{(m)}, W, K)) & = - \sum_{Θ_{t}} \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{\partial κ_{l}} P (Θ_{t} | X_{t + 1}^{(m)}, Θ_{t}, W, K) \\ + \sum_{X_{t}, Θ_{t}} \frac{\partial b (X_{t}, Θ_{t}, W, K)}{\partial κ_{l}} P (X_{t}, Θ_{t}, W, K) \end{matrix}

(A16)

A classical gradient procedure will update the weights W and K by moving oppositely to the gradient of the average sample KL divergence, for selected learning rates

α_{W}

and

α_{K}

, as follows:

\{\begin{matrix} w_{l, n e w} & = w_{l, o l d} + α_{W} \frac{\partial \ln (P (X_{t + 1}^{(m)}, W, K))}{\partial w_{l}} \\ κ_{l, n e w} & = κ_{l, o l d} + α_{K} \frac{\partial \ln (P (X_{t + 1}^{(m)}, W, K))}{\partial κ_{l}} \end{matrix} .

(A17)

The classical approach requires to run through all values of

X_{t}

and

Θ_{t}

, again not very suitable for large networks. With a stochastic gradient procedure, the quantity

\sum_{Θ_{t}} \frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}, W, K)}{\partial x} P (Θ_{t} | X_{t + 1}^{(m)}, W, K)

is replaced with a single random realization

\frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial x}

, whereas the quantity

\sum_{X_{t}, Θ_{t}} \frac{\partial b (X_{t}, Θ_{t}, W, K)}{\partial κ_{l}} P (X_{t}, Θ_{t}, W, K)

is replace with a single random realization

\frac{\partial b (X_{t}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial x}

. Hence, we obtain a stochastic gradient scheme

\{\begin{matrix} w_{l, n e w} & = w_{l, o l d} - α_{W} [\frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial w_{l}} - \frac{\partial b (X_{t}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial w_{l}}] \\ κ_{l, n e w} & = κ_{l, o l d} - α_{K} [\frac{\partial b (X_{t + 1}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial κ_{l}} - \frac{\partial b (X_{t}^{(m)}, Θ_{t}^{(m)}, W, K)}{\partial κ_{l}}] \end{matrix},

(A18)

for carefully chosen constants

α_{w}

and

α_{K}

. □

References

Darwin, C. On the Origin of Species by Means of Natural Selection; John Murray: London, UK, 1859. [Google Scholar]
Vincent, T.L.; Vincent, T.L.S.; Cohen, Y. Darwinian dynamics and evolutionary game theory. J. Biol. Dyn. 2011, 5, 215–226. [Google Scholar] [CrossRef]
Ackleh, A.S.; Cushing, J.M.; Salceneau, P.L. On the dynamics of evolutionary competition models. Nat. Resour. Model. 2015, 28, 380–397. [Google Scholar] [CrossRef]
Cushing, J.M. Difference equations as models of evolutionary population dynamics. J. Biol. Dyn. 2019, 13, 103–127. [Google Scholar] [CrossRef] [PubMed]
Cushing, J.M.; Park, J.; Farrell, A.; Chitnis, N. Treatment of outocme in an si model with evolutionary resistance: A darwinian model for the evolutionary resistance. J. Biol. Dyn. 2023, 17, 2255061. [Google Scholar] [CrossRef] [PubMed]
Elaydi, S.; Kang, Y.; Luis, R. The effects of evolution on the stability of competing species. J. Biol. Dyn. 2022, 16, 816–839. [Google Scholar] [CrossRef] [PubMed]
Fisher, R.A. On the mathematical foundation of theoretical statistics. Philos. Trans. R. Soc. Lond. Ser. 1922, 222, 594–604. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 623–656. [Google Scholar] [CrossRef]
Kwessi, E. Information theory in a darwinian evolution population dynamics model. arXiv 2024, arXiv:2403.05044. [Google Scholar]
Shun-ich, A.; Horishi, N. Methods of Information Geometry; chapter Chentsov Theorem and Some Historical Remarks; Oxford University Press: Oxford, UK, 2000; pp. 37–40. [Google Scholar]
Dowty, J.G. Chentsov theorem for exponential families. Inf. Geom. 2018, 1, 117–135. [Google Scholar] [CrossRef]
Harper, M. Information geometry and evolution game theory. arXiv 2009, arXiv:0911.1383. [Google Scholar] [CrossRef]
Kimura, M. On the change of population fitness by natural selection. Heredity 1958, 12, 145–167. [Google Scholar] [CrossRef]
Hinton, G.E.; Sejnowski, T.J. Optimal perceptual inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 19–23 June 1983; pp. 448–453. [Google Scholar]

Figure 1. In (a)

D (Θ_{t}, Θ_{t + 1})

; (b)

D (Θ_{1}, Θ_{t + 1})

for

t = 1, 2, \dots, 15

. In the (a), convergence to zero means that the distributions of

Θ_{t}

and

Θ_{t + 1}

are similar. In (b), convergence to about 8.7 means that the distributions of

Θ_{t}

are no more different from each other, making their KL divergence from the distribution of

Θ_{1}

constant. In both cases, this is an indication that Fisher’s information has been maximized or that the variances of the estimates are the same and very small.

Figure 1. In (a)

D (Θ_{t}, Θ_{t + 1})

; (b)

D (Θ_{1}, Θ_{t + 1})

for

t = 1, 2, \dots, 15

. In the (a), convergence to zero means that the distributions of

Θ_{t}

and

Θ_{t + 1}

are similar. In (b), convergence to about 8.7 means that the distributions of

Θ_{t}

are no more different from each other, making their KL divergence from the distribution of

Θ_{1}

constant. In both cases, this is an indication that Fisher’s information has been maximized or that the variances of the estimates are the same and very small.

Figure 2. In the left column of this figure, represented is

X_{t}^{(m)}

for

m = 1, 2, 3

and

t = 1, 2 \dots, 200

. The right column represents the samples

θ_{1 t}

and

θ_{2 t}

respectively. From a dynamical system point of view, this is a situation where there is no fixed point

(x, Θ)

even though there is a fixed point for

x_{t}

.

Figure 2. In the left column of this figure, represented is

X_{t}^{(m)}

for

m = 1, 2, 3

and

t = 1, 2 \dots, 200

. The right column represents the samples

θ_{1 t}

and

θ_{2 t}

respectively. From a dynamical system point of view, this is a situation where there is no fixed point

(x, Θ)

even though there is a fixed point for

x_{t}

.

Figure 3. (a) Percentage of correct learning for the vector W as a function of the number of samples used. The fluctuations are normal and are due to the random selection of samples at each iteration. We observe that despite fluctuating, the percentage of correct learning increases with the number of samples and hovers around 75%. This is quite remarkable given that all the system’s parameters are randomly reshuffled at each epoch. (b) Percentage of correct learning for the vector K. The percentage of correct learning for K also hovers around

75 %

which shows in general that supervised learning can be quite effective at learning the key parameters in a Darwinian evolution dynamics model.

Figure 3. (a) Percentage of correct learning for the vector W as a function of the number of samples used. The fluctuations are normal and are due to the random selection of samples at each iteration. We observe that despite fluctuating, the percentage of correct learning increases with the number of samples and hovers around 75%. This is quite remarkable given that all the system’s parameters are randomly reshuffled at each epoch. (b) Percentage of correct learning for the vector K. The percentage of correct learning for K also hovers around

75 %

which shows in general that supervised learning can be quite effective at learning the key parameters in a Darwinian evolution dynamics model.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kwessi, E. On Using Relative Information to Estimate Traits in a Darwinian Evolution Population Dynamics. Axioms 2024, 13, 406. https://doi.org/10.3390/axioms13060406

AMA Style

Kwessi E. On Using Relative Information to Estimate Traits in a Darwinian Evolution Population Dynamics. Axioms. 2024; 13(6):406. https://doi.org/10.3390/axioms13060406

Chicago/Turabian Style

Kwessi, Eddy. 2024. "On Using Relative Information to Estimate Traits in a Darwinian Evolution Population Dynamics" Axioms 13, no. 6: 406. https://doi.org/10.3390/axioms13060406

APA Style

Kwessi, E. (2024). On Using Relative Information to Estimate Traits in a Darwinian Evolution Population Dynamics. Axioms, 13(6), 406. https://doi.org/10.3390/axioms13060406

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On Using Relative Information to Estimate Traits in a Darwinian Evolution Population Dynamics

Abstract

1. Introduction

2. Review of Information Theory

2.1. Fisher’s Information Theory

2.2. Entropy and Relative Entropy

3. Evolution Population Dynamics and Relative Information

3.1. Single Darwinian Population Model with Multiple Traits

3.2. Supervised and Unsupervised Learning

3.2.1. Supervised Learning

3.2.2. Unsupervised Learning

3.3. Simulations

4. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Proof of Theorem 1

Appendix A.2. Proof of Theorem 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI