Entropic Dynamics of Mutations in SARS-CoV-2 Genomic Sequences

In this paper, we investigate a certain class of mutations in genomic sequences by studying the evolution of the entropy and relative entropy associated with the base frequencies of a given genomic sequence. Even if the method is, in principle, applicable to every sequence which varies randomly, the case of SARS-CoV-2 RNA genome is particularly interesting to analyze, due to the richness of the available sequence database containing more than a million sequences. Our model is able to track known features of the mutation dynamics like the Cytosine–Thymine bias, but also to reveal new features of the virus mutation dynamics. We show that these new findings can be studied using an approach that combines the mean field approximation of a Markov dynamics within a stochastic thermodynamics framework.


Introduction
The sudden outburst in 2019 of the COVID-19 pandemic has generated a prompt and powerful reaction in the scientific and political community to fight against the worldwide menace represented by the virus ( [1]).One of the first actions undertaken was the deployment of a large genome sequencing effort, which has generated a very large database (about 10 6 sequence as of September 2023) of SARS-CoV-2 sequences in a short timespan.This unprecedented data richness, along with the certain identification of the ancestral virus sequence, has allowed scientists to undertake a detailed scrutiny of the virus evolution in the human host population.The main effort in the genetic research has been axed on functional domain analysis to identify regions in the sequence which are related to protein formation and thus responsible of key virus characteristics such as spreading speed or sensitivity to vaccine or drug treatments.
In this paper, we take a different approach, which can offer a complementary viewpoint on the dynamic of the virus mutation mechanism.In this study, we use the National Center for Biotechnology Information (NCBI, www.ncbi.nlm.nih.gov)database.We downloaded all of the complete RNA sequences with no unknown characters and with the same length (29,903 characters) of the Wuhan reference sequence classified as NC045512.2 in NCBI database.There were about 5600 sequences meeting the above criteria at the retrieval date of February 2023.These constitute the dataset for this study.Moreover, we reduce the high complexity of the nearly 30,000-base-long genomic viral sequences to the study of the four-dimensional probability vector p = (p A , p C , p G , p T ) of the A, C, G, T base frequencies.
In the following, we speak of entropy of a sequence, intending the Shannon entropy of the associated base frequency vector (p or q) for the sake of simplicity.We can then compute the entropy of the most ancient known virus sequence (the one found in Wuhan, China), which we denote with h(q); the entropy h(p) of (any of) the mutated sequences; and the relative entropy D(p|q) between the reference sequence and a mutated one.See, e.g., [2] for a gentle introduction to these notions of Information Theory.
The choice of the Shannon entropy as a statistical indicator of a sequence implies that two sequences that differ in a simple permutation in the bases are indistinguishable; moreover, their relative entropy is zero.Therefore, with the approach chosen in this study, only an accumulation of mutations that changes the base frequency is appreciable.We are aware that this is a drastic simplification of the actual mutation mechanism; nevertheless, we find that such a simple model of mutations is capable to reveal new and unexpected features of the mutation dynamics.Here is the plan of the paper.
In Section 2, we show that the accumulation of mutations in the sequence decreases the entropy of the sequence with respect to the ancestral one.This means that the mutations necessarily further increase the original unbalance of the proportions between the bases of the reference sequence (q C < q G < q A < q T ), enhancing the unbalance p T > p C and p A > p G , a phenomenon already reported in the literature (see, e.g., [3][4][5]), the so-called C → T bias.We find that the decrease in entropy has a analytically computable lower bound, called the minimal entropy curve, which is tight for many of the sequences in the dataset.
In Section 3, we investigate the dynamic of mutations introducing a simple Markovian model, which is used in population dynamic studies and which is akin to the classical Eherenfest urn model in statistical thermodynamics.We compute the mean field approximation of the Markovian dynamics, which gives a master equation type equation.We then compare the evolution of the entropy along the mean field solution with the minimal entropy curve.Note that, unlike the theoretical minimum entropy curve, which is based only on the knowledge of q, the Markovian model of mutation dynamic requires the knowledge of the Markov matrix P of transitions and trasversions, which is computed from the dataset in Section 5.This additional piece of information allows us to track the entropy evolution more closely.
In Section 4, we look at our Markovian dynamic model using the stochastic thermodynamics framework.This allows to describe the mutation bias-which acts as a drift term in the evolution of the base frequency-as the effect of the interaction of a small thermodynamic system with a thermal bath.We can, thus, compute the entropy flow and entropy production term related to the stochastic evolution (see [6][7][8]).Even if we can think of our set of sequences as a thermodynamic system only by analogy, this identification is useful to quantitatively describe the system's entropy evolution.

Computation of Minimal Entropy Curve
We consider the base frequency p of a sequence as a random variable, because the initial RNA sequence is subject to the error-prone copying mechanism.We can then ask ourselves if h(p) increases or decreases with time, or if it fluctuates around its initial value h(q).Since mutations accumulate with time, it is natural to use the relative entropy D(p|q) as a "time variable" and investigate how the entropy h(p) changes with D(p|q).In Figure 1a, we plotted the entropy h(q) of the reference sequence q (red dot) and the entropy h(p) of the mutated sequences in the dataset as a function of their relative entropy "distance" D(p|q).A clear pattern emerges: the entropy is decreasing with the relative entropy.This is a clear indication that the mutations are non-random, otherwise the mutations would more likely affect the most abundant base and the resulting base frequency vector would be more "uniform", hence with higher entropy.If the entropy is to decrease, this means that the effect of mutations has to further unbalance the initial base frequency vector q.In the sequel, we will address quantitatively this aspect of the mutation dynamic mechanism.
To start with, we want to determine if the decrease in entropy has a computable lower bound.This amounts to determine the probability p, which has minimal entropy h(p) over the set of probability distributions that satisfy the constraint D(p|q) = d and the normalization constraint.To this, we use the Lagrange multipliers method [9] for the Lagrange function (here, the index i The necessary first order condition for the extremality ∂G/∂p i = 0 for all i gives p i = C(µ)q λ λ+1 i .By setting β = λ/(λ + 1) and imposing the normalization constraint, we find the solution Note that for β = 1 we have p i (1) = q i .The value of the multiplier β = β(d) is determined by the constraint D(p(β)|q) = d, which translates into the following equation: The function f has a minimum in β = 1 with f (1) = 0; so, for d > 0, the equation To ascertain if they provide a local constrained minimum or maximum for h, we invoke the second order sufficient conditions (see again [9]) on the hessian matrix of G: p is a local minimum (resp.maximum) if and only if H p G( p) is positive (resp.negative) definite.In our case (here, δ ij is the Kronecker symbol), In Figure 2, we computed the evolution of the base frequencies (p A , p C , p G , p T ) with D(p, q).We see that there is a strong mutation bias favoring the substitution of C → T bases and perhaps a weak mutation bias G → A. A detailed study of the molecular nature of the bias is beyond the scope of this study; however, we notice that the above Formula (2) allows us to compute the evolution of the p i /p j ratio for sequences that are close to the minimum entropy curve.Assuming that their base frequency vector is well described by (2), one has since β(d) > 1, the initial unbalance q i /q j increases with d.While it is understandable that the C-T mutation bias lowers the entropy of the mutate sequence, Figure 1b shows that the mutation dynamics drives the decrease of the entropy to the minimum possible value.To our knowledge, this is a new result.
From Figure 1b one sees that the minimum entropy curve represent a lower bound for the sequence' entropy, which is saturated in the first part of the curve and which is loosened in the second part, giving evidence that there is some additional underlying mechanism in the mutation dynamics.In the following Section 3, we present a simple stochastic model to study this feature of the mutation dynamics.

A Stochastic Model of Mutation Dynamics
This kind of model is used in the Ehrenfest model of equilibrium thermodynamics (see, e.g., [10]) and in population dynamics [11] (see also [12] for the use of Markov models in mutation dynamics).We consider four urns (named A, B, C, D), each containing n i identical point particles with i ∈ E and ∑ i n i = N.At each time step ∆t, only one particle is randomly chosen from one urn and placed in one of the four urns.So, the change in the number of particles in urn i at time t is with ∆n i (t) ∈ {−1, 0, 1}.Let p i = n i /N be the probability that the chosen particle belongs to urn i, and let be the conditional probability that the particle in urn i at time t is moved to urn j at time t + ∆t.Then, the average value of ∆n i (t) is where P T is the transpose of P and I is the identity matrix.If the matrix P is independent of time, this model is a (time-homogeneous) discrete time Markov chain that can be used to describe the random variations in the base frequencies of the sequences.The following heuristic argument can be made rigorous (see, e.g., [13] and Appendix A).If the number of particles of the sequence N is sufficiently large, we can assume that the variance of the random variable n i is vanishing with N, so that Hence, n i ≈ ⟨n i ⟩ for large N. So, by multiplying ( 6) by 1/N and taking the average, we obtain If we take ∆t = 1/N as a time step (this means that a time T∼1 is the time required to move all the particle of the system on average), then we can write (11) as In the limit N → ∞, (12) becomes an equality and using (8), we obtain the following ODE for the probability p: Note that a probability distribution p = (p A , . . ., p T ) is stationary if P T p = p.In the following, we set W = P T − I, ∑ i W ij = 0, and we consider the Cauchy problem The above equation is called mean field approximation of the Markov chain [13].In statistical thermodynamics, it is known as a master equation-type dynamic [14].Equilibria W p = 0 of the master equation coincides with above-introduced stationary distributions P T p = p.One can easily show that the above Equation ( 14) can be rewritten using the matrix W as ṗi = ∑ j W ij p j − W ji p i = ∑ i J ij (15) where the quantity J ij = W ij p j − W ji p i is called probability current or thermodynamic flux term.
A simple check on (15) shows that if the matrix W is symmetric (W ij = W ji ), then the uniform distribution is an equilibrium distribution, and if the matrix W is non degenerate, then it is the only equilibrium; therefore, the entropy tends to its absolute maximum value when the system approach the equilibrium.Therefore, if the system entropy is to decrease as in our case, the matrix W (hence P) has to be non-symmetric.
If N is large, we can assume that the mean field dynamics is a good approximation of the mutation dynamic mechanism.For our sequences, N∼3 × 10 4 , which gives a pretty good approximation.We can, therefore, compute h(p(t)) along a solution p(t) of the Cauchy problem (14), and compare its evolution with the plot of Figure 1b; see Figure 3 below.Note that, unlike the theoretical minimum entropy curve, which requires only the knowledge of q, the mean field model of mutation dynamic requires the knowledge of the Markov matrix P. In Section 5, we show how to compute P from the sequences of the dataset.Prior to this, in Section 4, we investigate this Markovian mutation dynamic model using a stochastic thermodynamics framework.

Stochastic Thermodynamic Interpretation of Entropy Decrease
Stochastic thermodynamics is a recent research field at the intersection of classical statistical thermodynamics with information geometry (see, e.g., [6,7]).Some new and old thermodynamic inequalities have been introduced and interpreted in terms of information geometry [15], and then applied to the description of "small" thermodynamic systems in the non stationary regime, like molecular motors.Stochastic thermodynamics seems thus to be a promising tool to study the RNA chain of nucleic acid mutation mechanism.We consider a probabilistic system, which has four states (A, C, G, T), and we suppose that the sequence base frequency p = (p A , p C , p G , p T ) evolves randomly due to its internal dynamics and due to the interaction with an environment, which is responsible of the bias or drift.We want to compute the time evolution of the entropy h(p(t)) along a solution of the Cauchy problem (14).We thus have Now, write ln where X ij = ln p j W ij − ln p i W ji is the thermodynamic force.We can thus rewrite The non-negative quantity is interpreted as the system' entropy production therm.Note that J ij (p) = 0 for all i, j if p is the stationary distribution; therefore, the entropy production term vanishes when the system approaches the equilibrium distribution.The term with no definite sign is the entropy exchange (entropy flow) with the environment (heath bath).From Figure 1 we see that for our system we have Ṡ < 0 which from (18) necessarily implies that Ṡe < 0. Within this stochastic thermodynamic interpretation, the mutation bias act like a cold environment which lowers the entropy of the system.However, the necessarily non negative contribution Ṡi could induce a decrease of the entropy, which is slower that the one prescribed by the absolute minimum entropy curve of Figure 1b.In fact, the difference between the two curves in Figure 3 below is due to the positive contribution of the system entropy production term Ṡi > 0.

Log-Sum Inequality and the Entropy Production Term
To make this work self-contained, we briefly recall here a derivation contained in [7], which may be relevant for the interpretation of the entropy production term Ṡi .Suppose that the master equation matrix Ŵ = ∑ m k=1 W(k) is the sum of m various contribution terms W(k), which describe the interaction of the system with different environments.Then, one can repeat verbatim the derivation in ( 16) for Ŵ and, subsequently, substitute definition ( 17) with the following one: It is straightforward to rewrite the entropy production term Ṡi as Now, apply the log-sum inequality ( which is valid for non-negative numbers a 1 , . . .a m , and b 1 , . . ., b m .Then, (21) satisfies the inequality Therefore, failing to recognize that the master equation matrix W is the sum of different contributions describing the interaction of the thermodynamic system with various environments, one might underestimate the value of the system entropy production term Ṡi .In Section 5, we show how to compute the different matrices W(k) from our dataset.

The Case of SARS-CoV-2 Sequence Dataset
In this section, we apply the theory developed before to the case of the SARS-CoV-2 RNA virus, using the sequences dataset downloaded from the National Center for Biotechnology Information (NCBI) public repository.We retrieved the SARS-CoV-2 reference sequence classified as NC045512.2(the one collected in Wuhan, China, in December 2019), and all the sequences matching the following criteria: same length (29903 base pairs), com-plete, with no unknown characters, and from a human host.There are about 5600 sequences which constitute the dataset under study in this work.

Computation of Markov Matrix P from Data
Let x = (x 1 , . . ., x N ), N = 29903, be the reference sequence and let y = (y 1 , . . ., y N ) be a mutated sequence.We define for i, j ∈ E the (empirical) frequency vector associated with x and y q i = n i (x) N , p i = n i (y) N Therefore, the empirical matrix of conditional probabilities can be defined as where n ij (x, y) is the number of times the base x α = i is mutated in the base y α = j for α = 1, . . ., N. The quantity d H = 0, 1, 2, . . .
is the number of errors in the copying of the x sequence into y, and is called the Hamming distance [16] between the two sequences.Note that the Hamming distance is nonzero for two sequences, which differ by a simple base order exchange, whereas the relative entropy distance is zero in this case, since the base frequencies are unchanged.The Hamming distance thus gives a finer measure of discrepancy between two sequences.We have partitioned our dataset of about 5600 sequences in disjoined classes D H (0), D H (1), . . . of sequences, having the same Hamming distance k = 0, 1, . .., from the reference sequence x.
We obtained 48 classes, and we define the averaged matrix over class k as where |D H (k)| denotes the cardinality of D H (k). Correspondingly, we define W(k) = P(k) − I.In Figure 4, we have plotted the value of the entries of matrix P(k) as a function of the Hamming distance classes k/N.We see that the major contributions to P come from the conditional probabilities C → T (i.e., P CT ), G → T and G → A, giving another confirmation of the above mentioned C-T bias.

Mean Field Dynamics and Entropy Rate
In this section, we study the entropy evolution by comparing the minimal entropy curve with the mean field dynamics (14) for W = W(k) (see Figure 3a).We see that in the upper part of the curve, the two curves are close to each other, due to the fact that the high sequence length N ≈ 30,000 guarantees that the mean field dynamics is a good approximation of the Markovian dynamics.In the lower part of the curve, we see that the mean field solution (blue curve) prescribes a system entropy, which is higher than the minimal entropy curve.The difference is due to the non-negative entropy production term Ṡi > 0, while the fact that the entropy is globally decreasing is due to the mutation bias, which can be described as an interaction with a cold environment causing a negative entropy flux Ṡe < 0. The better fit of the mean field dynamics is consistent with the fact that the theoretical minimum entropy curve reflects only the knowledge of q, while the mean field model of mutation dynamic requires the knowledge of q and the Markov matrix P of transitions.
From Figure 3b, we also see that the entropy production term is higher in the case where the dynamics in ( 14) are described by W(40) with respect to W(1) (compare also Figure 5b,d).This is probably due to the fact that W(40) is averaged over a class of sequences D H (40), which contains more kind of mutations than D H (1) (see Section 5.1 above for the definition of D H ); therefore, it is likely that W(40) "contains" the interaction with multiple environments, a situation that can be described from a theoretical point of view along the lines of Section 4.1.To conclude, in Figure 5, we show the time-evolution of mean field (master equation) dynamics and the time-evolution of the various entropy rate terms Ṡ, Ṡi and Ṡe .

Conclusions
In this paper, we have presented an analysis of the mutations in the SARS-CoV-2 RNA sequences.Unlike the majority of genetic studies, which focus on the detailed functional analysis of very specific regions of the sequences, we have considered only the sequence base frequency as relevant information.Using a literary analogy, we discarded the poetry in the book, and we concentrated only on the differences due to typographic errors between the millions of printed copies.We can thus understand some features of the printing machine, and discover that is biased towards some kind of errors.The functioning of the printer can be described quantitatively by a probabilistic model, which is amenable to a stochastic thermodynamic interpretation.We modeled the accumulation of mutations in the RNA sequence as the slow drift of the probability p = p(t) describing a four-state thermodynamical system in contact with a thermal bath from the initial q = p(0).The evolution of the probability can be described as the mean field evolution of a Markov chain, whose matrix P is derived from data, and it describes the existence of a mutation bias since the entropy is decreasing.It is remarkable that, for SARS-CoV-2, the entropy decrease closely follows a theoretically computable lower bound.As far as we know, this is result is new.We think that this simple model can complement classical approaches to the problem of describing genetic variability.Indeed, our approach is not confined to the study of genetic sequences, and it is virtually applicable to any dynamical system described by a vector field ẋ = X(x, t) over a manifold M and a finite partition E of M (coarse graining).The coarse-grained system evolution is described by a sequence x = (x 1 , . . ., x N ), x i ∈ E, and the probability vector q is the so-called occupation measure of x.If we add a noise or drift term to the deterministic evolution X, then we have a set of perturbed trajectories y = (y 1 , . . .y N ) fluctuating around x.One could retrieve some aspects of the evolution of the perturbed system from a record of collected trajectories along the lines described in this work.
The vector field F is called the mean field dynamics (or the fluid limit approximation) associated with the discrete Markov chain.Let p(t, q) be the solution, and denote with p c (t) the stochastic process with continuous time interpolating the Markov chain.Moreover, let us consider the random variable D T (q) = max{|p c (t) − p(t, q)|, t ∈ 0, T} The following large deviation type estimate is given in [13].
Proposition A1.There exists a C > 0 such that for all ε > 0, T > 0 and sufficiently large N, one has Prob D T (q) ≥ ε p(0) = q ≤ 2ke −Cε 2 N For the Markov model presented in Section 3 (see ( 13)), we have F i (p) = ∑ j p j P ji − p i P ij = ((P T − I)p) i .

) hence the β 2 Figure 1 .
(d) > 1 solution of the equation f (β) = d yields a minimum, while the other β 1 (d) < 1 a maximum.If we plot the value of the entropy h(p(β(d)) along the two solutions β 2 (d) and β 1 (d) giving, respectively, the minimum and maximum possible value of the entropy h(p) for a given value of D(p|q) = d, we obtain the two branches of the violet curve in Figure 1b (upper branch has been cropped in the figure).We see that the lower bound is tight in the first part of the descent, and that there are mutated sequences that have minimal entropy.(a) Plot of the entropy vs. relative entropy of the sequences in the dataset (blue dots); red dots represent the entropy of the reference (Wuhan) sequence.(b) same as in (a); purple curve represents the minimum entropy curve.

Figure 2 .
Figure 2. Left to right and top to bottom: plot of p A , p C , p G and p T base frequencies as a function of relative entropy distance from q.

Figure 4 .
(a) Plot of the value of some of the entries of matrix P(k) as a function of the Hamming distance classes k/N; (b) the same as in (a), showing the entries that gives the major contributions (ten times higher than in (a)).

Figure 5 .
(a) Plot of the solution p(t) of the mean field dynamics for W(1); (b) plot of entropy rate Ṡ (black curve), internal entropy rate Ṡi (orange curve) and entropy flow rate Ṡe (blue curve) along the solution of the mean field dynamics for W(1); (c) the same as in (a) for W(40); (d) the same as in (b) for W(40).