# Rooting and Dating Large SARS-CoV-2 Trees by Modeling Evolutionary Rate as a Function of Time

^{1}

^{2}

## Abstract

**:**

## 1. Introduction

_{A}) of representative viral strains [1,2,3,4,5]. Modern sequencing technology can quickly generate many viral genomic sequences from which a viral tree can be constructed. If the viral tree can be properly rooted, then the parameters µ and T

_{A}can be estimated given a tree of viral strains with sample collection times used for calibrating a molecular clock [6,7,8,9,10,11]. The most recent common ancestor of SARS-CoV-2 was dated in two recent studies to October–November 2019 [12] and mid-August [5], respectively. Using viral genomes from China, Pekar et al. [13] inferred the first cryptic infection of SARS-CoV-2 to span the interval between mid-October and mid-November, 2019.

_{A}was dated highly concordantly to 2019-06-12 and 2019-07-07 with the two trees, respectively. Treating the evolutionary rate as a constant would generate highly divergent and unreasonable estimates from these two trees.

## 2. An Improved TRAD Method

#### 2.1. The Rooting Step

_{i}is the root-to-tip distance when the root is placed at internal node i in Figure 1A. The first two columns do not change, but the last four columns differ with where the root is placed. Different computer programs use different date as day 1. The R package uses 1 January 1970 (1970-01-01) as day 1 but EXCEL uses 1900-01-01 as day 1. I will take the EXCEL convention of using 1900-01-01 as day 1, so 2019-12-10 is 43,809, 2020-01-05 is 43,835, and so on. Using different dates as day 1 does not affect rooting and dating.

_{1}(Table 1) has the strongest relationship to T relative to D

_{2}, D

_{3}, and D

_{4}(Figure 2). In previous implementations of the rooting method [5,8,22], the evolutionary rate µ is assumed to be constant so the relationship between D and T is linear: $D=\mu \left(T-{T}_{A}\right)=-\mu {T}_{A}+\mu T$. Thus, the rooting point that yields the strongest linear relationship (e.g., the highest Pearson correlation) between D and T is taken as the estimated root. This method has been implemented in TempEst [8], DAMBE [23] and TRAD [5,22]. For example, the Pearson correlation between T and the four candidate rooting positions (Table 1) shows that internal node 1 (Figure 1A) is a better rooting position than internal nodes 2, 3 and 4. However, the internal node 1 is not the best root with r as the criterion. Shifting the internal node 1 towards internal node 2 by 0.08709 will achieve the highest r of 0.9962 between T and the root-to-tip distance designated D

_{max.r}(Table 1). Moving the root to any other point along the branches will reduce r. This approach with a constant µ and consequently a linear relationship between D and T will be referred to as Model 1.

^{2}) based on Equation (1) is more appropriate than r as a criterion to choose the best rooting position. With a dependent variable y and two x variables x

_{1}and x

_{2}, R

^{2}is

^{2}values have also been shown in the last row in Table 1 for the four candidate roots as well as for the root with the maximum r. With R

^{2}as the criterion, the optimal root is arrived at by shifting Node 1 towards Node 4 by a distance of 0.006638 (Figure 1). This would yield the maximum R

^{2}= 0.99854.

^{2}values to find the optimal rooting position along the tree with the maximum R

^{2}. TRAD [5,22] is currently the only software package that can root a tree with so many leaves. It took less than a day to root the Apr3_21 tree on a regular desktop computer with an 11th Gen Intel(R) Core(TM) i7-11700 and 64 GB of memory, but a week to root the May7_22 tree.

#### 2.2. The Dating Step

_{A}, one assuming µ as a constant and the other modelling μ as a linear function of time. We have two variables, D and T illustrated in the previous section, to estimate T

_{A}(the time of origin of the common ancestor of the sampled SARS-CoV-2 genomes). In the dating step, D is the root-to-tip distance from the best root. When µ is constant, D and T are linearly related as follows:

_{max.r}in Table 1 on T yields $\mu =1.08986\times {10}^{-2}$ (changes/day/genome) and an intercept of −477.16399. Setting $-\mu {T}_{A}=477.16399$, we get T

_{A}= 43,782.26 which is equivalent to 13 November 2019. From Equation (3), it is also clear that ${T}_{A}=T$ when D = 0. This approach has been applied to the estimation of ${T}_{A}$ of sampled SARS-CoV-2 genomes in a previous study by assuming a constant µ [5]. However, this approach would be problematic when µ is not constant as shown in Figure 2. Not only will it bias the estimate of T

_{A}towards a more recent date, but also generate inconsistent estimates of ${T}_{A}$ depending on which period the viral genomes were sampled. For example, if we separate the data into two groups, with group 1 including the data collected before 2020-03-10 and group 2 including the data after 2020-03-10, then group 2 will date T

_{A}to a more recent time than group 1 because µ increases over time in Figure 2.

_{A}can be estimated as before by setting D = 0 in Equation (1), and solving the resulting quadratic equation. The two roots of the function are

^{2}gives us ${B}_{0}=\mathrm{39,848.83964},$ ${B}_{1}=-1.82494,{B}_{2}=2.08940\times {10}^{-5}$. Therefore, $\mu \left(T\right)=-1.82494+4.17879\times {10}^{-5}T$. Our previous treatment of µ as a constant yields $\mu =1.08986\times {10}^{-2}$, which would be the evolutionary rate on 11 April 2020 given $\mu \left(T\right)$. The discriminant turned out to be −0.0000055554 and may be taken as 0, so ${T}_{A}\approx -\frac{{B}_{1}}{2{B}_{2}}=\mathrm{43,671.4}$ which is 2019-07-25. This is 110 days earlier than the estimate of ${T}_{A}$ when µ is taken as a constant. The variance of ${T}_{A}$ can be estimated by bootstrapping.

## 3. Results

_{A}given the Apr3_21 tree, but differ a lot in the estimated T

_{A}for the May7_22 tree.

_{A}estimated from Model 1 and Model 2 are similar with the Apr3_21 tree (T

_{A}is 2019-08-16 from Model 1 and 2019-06-12 for Model 2, Table 2). However, for the May7_22 tree (Figure 3B) where μ is apparently not constant, Model 1 generated an absurd estimate of T

_{A}of 4 March 2020 (Table 2). In contrast, both the Apr3_21 tree and the May7_22 tree under Model 2 generated consistent estimates of T

_{A}(2019-06-12 and 2019-07-07, respectively, Table 2).

^{−61}, which led to a strong rejection of Model 1 in favor of Model 2. For the May7_22 tree, the p value is even smaller, as one would have expected by contrasting Figure 3A and Figure 3B where the relationship between D and T visually curves more in Figure 3B than in Figure 3A. The results of the likelihood ratio tests are consistent with AIC and BIC as model-selection criteria. Both AIC and BIC strongly favor Model 2 against Model 1 (Table 2).

## 4. Discussion

#### 4.1. The Pros and Cons of Using Large Trees

_{A}. If a new strain originated at the end of March 2020 from one of those early lineages, displaced the original strains, and evolved at a rate different from the original, then including the genomes of these new strains would likely introduce a bias in the estimation of T

_{A}. In this scenario, it is better to have a small tree of the 500 genomes than a large tree including new strains that may have a very different rate of evolution.

_{A}less reliable.

_{A}.

#### 4.2. Strict and Uncorrelated Relaxed Clock

## 5. Conclusions

_{A}with two different viral phylogenies. Third, when µ increases with time, the estimated T

_{A}may be biologically absurd. In contrast, modelling µ as a linear function of time instead of a constant eliminates all these problems. I applied this approach to analyzing two large trees released by NCBI on 3 April 2021, and 7 May 2022, including 83,688 and 970,777 high-quality and full-length SARS-CoV-2 genomes, respectively, with complete sample collection dates for the included viral genomes. The most recent common ancestor of the sampled SARS-CoV-2 genomes was dated to 12 June 2019 with the Apr3_21 tree, and 7 July 2019 with the May7_22 tree with 970,777 leaves. The results also highlight the importance of having very large trees because of substantial rate heterogeneity among different SARS-CoV-2 lineages.

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- MacLean, O.A.; Lytras, S.; Weaver, S.; Singer, J.B.; Boni, M.F.; Lemey, P.; Kosakovsky Pond, S.L.; Robertson, D.L. Natural selection in the evolution of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen. PLoS Biol.
**2021**, 19, e3001115. [Google Scholar] [CrossRef] [PubMed] - Wang, H.; Pipes, L.; Nielsen, R. Synonymous mutations and the molecular evolution of SARS-CoV-2 origins. Virus Evol.
**2021**, 7, veaa098. [Google Scholar] [CrossRef] [PubMed] - Boni, M.F.; Lemey, P.; Jiang, X.; Lam, T.T.-Y.; Perry, B.; Castoe, T.; Rambaut, A.; Robertson, D.L. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat. Microbiol.
**2020**, 5, 1408–1417. [Google Scholar] [CrossRef] [PubMed] - Lytras, S.; Xia, W.; Hughes, J.; Jiang, X.; Robertson, D.L. The animal origin of SARS-CoV-2. Science
**2021**, 373, 968–970. [Google Scholar] [CrossRef] - Xia, X. Dating the Common Ancestor from an NCBI Tree of 83688 High-Quality and Full-Length SARS-CoV-2 Genomes. Viruses
**2021**, 13, 1790. [Google Scholar] [CrossRef] [PubMed] - Xia, X. Distance-Based Phylogenetic Methods. In Bioinformatics and the Cell: Modern Computational Approaches in Genomics, Proteomics and Transcriptomics; Springer: Cham, Switzerland, 2018; pp. 343–379. [Google Scholar]
- Xia, X. DAMBE5: A comprehensive software package for data analysis in molecular biology and evolution. Mol. Biol. Evol.
**2013**, 30, 1720–1728. [Google Scholar] [CrossRef][Green Version] - Rambaut, A.; Lam, T.T.; Max Carvalho, L.; Pybus, O.G. Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). Virus Evol.
**2016**, 2, vew007. [Google Scholar] [CrossRef][Green Version] - Himmelmann, L.; Metzler, D. TreeTime: An extensible C++ software package for Bayesian phylogeny reconstruction with time-calibration. Bioinformatics
**2009**, 25, 2440–2441. [Google Scholar] [CrossRef][Green Version] - To, T.-H.; Jung, M.; Lycett, S.; Gascuel, O. Fast Dating Using Least-Squares Criteria and Algorithms. Syst. Biol.
**2016**, 65, 82–97. [Google Scholar] [CrossRef] - Volz, E.M.; Frost, S.D.W. Scalable relaxed clock phylogenetic dating. Virus Evol.
**2017**, 3, vex025. [Google Scholar] [CrossRef][Green Version] - Kumar, S.; Tao, Q.; Weaver, S.; Sanderford, M.; Caraballo-Ortiz, M.A.; Sharma, S.; Pond, S.L.K.; Miura, S. An Evolutionary Portrait of the Progenitor SARS-CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic. Mol. Biol. Evol.
**2021**, 38, 3046–3059. [Google Scholar] [CrossRef] [PubMed] - Pekar, J.; Worobey, M.; Moshiri, N.; Scheffler, K.; Wertheim, J.O. Timing the SARS-CoV-2 index case in Hubei province. Science
**2021**, 372, 412–417. [Google Scholar] [CrossRef] - van Dorp, L.; Acman, M.; Richard, D.; Shaw, L.P.; Ford, C.E.; Ormond, L.; Owen, C.J.; Pang, J.; Tan, C.C.S.; Boshier, F.A.T.; et al. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infect. Genet. Evol.
**2020**, 83, 104351. [Google Scholar] [CrossRef] - Gómez-Carballa, A.; Bello, X.; Pardo-Seco, J.; Martinón-Torres, F.; Salas, A. Mapping genome variation of SARS-CoV-2 worldwide highlights the impact of COVID-19 super-spreaders. Genome Res.
**2020**, 30, 1434–1448. [Google Scholar] [CrossRef] - Rambaut, A.; Holmes, E.C.; O’Toole, Á.; Hill, V.; McCrone, J.T.; Ruis, C.; du Plessis, L.; Pybus, O.G. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat. Microbiol.
**2020**, 5, 1403–1407. [Google Scholar] [CrossRef] [PubMed] - Chaw, S.-M.; Tai, J.-H.; Chen, S.-L.; Hsieh, C.-H.; Chang, S.-Y.; Yeh, S.-H.; Yang, W.-S.; Chen, P.-J.; Wang, H.-Y. The origin and underlying driving forces of the SARS-CoV-2 outbreak. J. Biomed. Sci.
**2020**, 27, 73. [Google Scholar] [CrossRef] - Liu, Q.; Zhao, S.; Shi, C.-M.; Song, S.; Zhu, S.; Su, Y.; Zhao, W.; Li, M.; Bao, Y.; Xue, Y.; et al. Population Genetics of SARS-CoV-2: Disentangling Effects of Sampling Bias and Infection Clusters. Genom. Proteom. Bioinform.
**2020**, 18, 640–647. [Google Scholar] [CrossRef] - Duchene, S.; Featherstone, L.; Haritopoulou-Sinanidou, M.; Rambaut, A.; Lemey, P.; Baele, G. Temporal signal and the phylodynamic threshold of SARS-CoV-2. Virus Evol.
**2020**, 6, veaa061. [Google Scholar] [CrossRef] [PubMed] - Tay, J.H.; Porter, A.F.; Wirth, W.; Duchene, S. The Emergence of SARS-CoV-2 Variants of Concern Is Driven by Acceleration of the Substitution Rate. Mol. Biol. Evol.
**2022**, 39, msac013. [Google Scholar] [CrossRef] - Pekar, J.E.; Magee, A.; Parker, E.; Moshiri, N.; Izhikevich, K.; Havens, J.L.; Gangavarapu, K.; Malpica Serrano, L.M.; Crits-Christoph, A.; Matteson, N.L.; et al. The molecular epidemiology of multiple zoonotic origins of SARS-CoV-2. Science
**2022**, 377, 960–966. [Google Scholar] [CrossRef] - Xia, X. TRAD: Tip-Rooting and Ancestor-Dating; University of Ottawa: Ottawa, ON, Canada, 2021. [Google Scholar]
- Xia, X. DAMBE7: New and improved tools for data analysis in molecular biology and evolution. Mol. Biol. Evol.
**2018**, 35, 1550–1552. [Google Scholar] [CrossRef] [PubMed][Green Version] - Thorne, J.L.; Kishino, H.; Painter, I.S. Estimating the rate of evolution of the rate of molecular evolution. Mol. Biol. Evol.
**1998**, 15, 1647–1657. [Google Scholar] [CrossRef] [PubMed] - Aris-Brosou, S.; Yang, Z. Effects of models of rate evolution on estimation of divergence dates with special reference to the metazoan 18S ribosomal RNA phylogeny. Syst. Biol.
**2002**, 51, 703–714. [Google Scholar] [CrossRef] [PubMed][Green Version] - Hatcher, E.L.; Zhdanov, S.A.; Bao, Y.; Blinkova, O.; Nawrocki, E.P.; Ostapchuck, Y.; Schäffer, A.A.; Brister, J.R. Virus Variation Resource—Improved response to emergent viral outbreaks. Nucleic Acids Res.
**2017**, 45, D482–D490. [Google Scholar] [CrossRef] - Xia, X. Improved method for rooting and tip-dating a viral phylogeny. In Handbook of Computational Statistics, II; Lu, H.H.-S., Scholkopf, B., Wells, M.T., Zhao, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
- Worobey, M.; Levy, J.I.; Malpica Serrano, L.; Crits-Christoph, A.; Pekar, J.E.; Goldstein, S.A.; Rasmussen, A.L.; Kraemer, M.U.G.; Newman, C.; Koopmans, M.P.G.; et al. The Huanan Seafood Wholesale Market in Wuhan was the early epicenter of the COVID-19 pandemic. Science
**2022**, 377, 951–959. [Google Scholar] [CrossRef]

**Figure 1.**Conceptual illustration of the statistical framework for rooting the tree and dating the most recent common ancestor of the sampled genomes. (

**A**) An unrooted viral tree with viral names in the format of Name|T where T is the collection date (mm-dd-year). Branch lengths are shown next to individual branches. Five internal nodes are numbered 1, 2, 3, 4, and 5, respectively. The branch length (0.67) between nodes 2 and 5 is bisected into the two green numbers by internal node 1, and into two red numbers by internal node 4. The root is unknown and could be anywhere along the branches. (

**B**) The root-to-tip distance (D) when the root is placed at internal node 1.

**Figure 2.**The root-to-tip distance (D) increases with viral sample collection time (T). Dot color codes are blue for D

_{1}, orange for D

_{2}, black for D

_{3}and red for D

_{4}. The order-2 polynomial regression line was fitted for D

_{1}and T only.

**Figure 3.**Changes in evolutionary rate over time visualized by plotting root-to-tip distance (D) over viral sample collection time (horizontal axis). D is from the optimal root estimated as described. (

**A**) Relationship between D and T from the tree released by NCBI on 3 April 2021, and (

**B**) on 7 May 2022.

**Figure 4.**Sequence alignment summarizing all genomic variation among seven SARS-CoV-2 genomes. The genome sequence name is in the form of ACCN|Country_code|Collection_date. EE: Estonia; CH: Switzerland; IT: Italy; CN: China. The four nucleotide sites characterizing the D614G lineage are in bold.

**Figure 5.**A fictitious viral tree with 14 viral genomes (S1 to S14) taken at different times shown as part of OTU names. Branch lengths are indicated next to the branch. An unrooted tree including S1 to S12 but excluding S13 and S14 would have only one branch connecting Node 2 and Node 3, with the branch length indicated by the pink 0.43. Similarly, an unrooted tree of all 14 samples has only one branch connecting Node 1 and Node 4, with the branch length indicated by the purple 0.40. Numbered circles are nodes mentioned in the text.

**Table 1.**Different root-to-tip distances (D

_{1}, D

_{2}, D

_{3}, and D

_{4}) when the candidate root is placed at internal nodes 1, 2, 3 and 4 in Figure 1A, respectively. T is virus collection time in the format of yyyy-mm-dd. The last two rows show (1) the Pearson correlation (r) between T and D and (2) coefficient of determination (R

^{2}) based on $D={b}_{0}+{b}_{1}T+{b}_{2}{T}^{2}$.

Virus | T | D_{1} | D_{2} | D_{3} | D_{4} | D_{max.r} | ${\mathit{D}}_{\mathit{m}\mathit{a}\mathit{x}.{\mathit{R}}^{2}}$ |
---|---|---|---|---|---|---|---|

S4 | 2020-01-05 | 0.60 | 0.40 | 0.20 | 0.97 | 0.51291 | 0.60664 |

S5 | 2020-01-05 | 0.61 | 0.41 | 0.21 | 0.98 | 0.52291 | 0.61664 |

S6 | 2019-12-10 | 0.48 | 0.28 | 0.48 | 0.85 | 0.39291 | 0.48664 |

S1 | 2020-02-29 | 1.07 | 1.27 | 1.47 | 0.70 | 1.15709 | 1.06336 |

S2 | 2020-02-29 | 1.12 | 1.32 | 1.52 | 0.75 | 1.20709 | 1.11336 |

S3 | 2020-02-01 | 0.80 | 1.00 | 1.20 | 0.43 | 0.88709 | 0.79336 |

S7 | 2020-04-15 | 1.50 | 1.70 | 1.90 | 1.13 | 1.58709 | 1.49336 |

S8 | 2020-06-09 | 2.20 | 2.40 | 2.60 | 1.83 | 2.28709 | 2.19336 |

S9 | 2020-06-09 | 2.20 | 2.40 | 2.60 | 1.83 | 2.28709 | 2.19336 |

S10 | 2020-06-09 | 2.20 | 2.40 | 2.60 | 1.83 | 2.28709 | 2.19336 |

S11 | 2020-06-09 | 2.21 | 2.41 | 2.61 | 1.84 | 2.29709 | 2.20336 |

S12 | 2020-05-10 | 1.87 | 2.07 | 2.27 | 1.50 | 1.95709 | 1.86336 |

r | 0.9953 | 0.9949 | 0.9749 | 0.8580 | 0.99786 | 0.99486 | |

R^{2} | 0.9985 | 0.9909 | 0.9538 | 0.9362 | 0.99662 | 0.99854 |

**Table 2.**Summary of the dating results for the Apr3_21 tree (with 83,688 leaves) and the May7_22 tree (with 970,777 leaves). Note the absurd estimation of T

_{A}under Model 1 with the May7_22 tree.

The Apr3_21 Tree | The May7_22 Tree | |||
---|---|---|---|---|

Model 1 ^{(1)} | Model 2 ^{(2)} | Model 1 ^{(1)} | Model2 ^{(2)} | |

B_{0} ^{(3)} | −2415.05517 | 38,679.17639 | −3247.86320 | 122,456.46937 |

B_{1} ^{(3)} | 0.05527 | −1.80887 | 0.073993 | −5.61023 |

B_{2} ^{(3)} | N/A | 2.114018 × 10^{−5} | N/A | 6.425812 × 10^{−5} |

µ ^{(4)} | 0.05527 | ${B}_{1}+2{B}_{2}T$ | 0.073993 | ${B}_{1}+2{B}_{2}T$ |

T_{A} ^{(5)} | 2019-08-16 | 2019-06-12 | 2020-03-04 | 2019-07-07 |

R^{2 (6)} | 0.74468 | 0.74550 | 0.82732 | 0.84360 |

lnL ^{(7)} | −235,496.173 | −235,359.894 | −2,677,701.465 | −2,631,291.332 |

AIC ^{(7)} | 470,996.346 | 470,725.788 | 5,355,408.930 | 5,262,588.663 |

BIC ^{(7)} | 471,015.015 | 470,753.793 | 5,355,444.288 | 5,262,624.021 |

^{(1)}Model 1 treats evolutionary rate µ as constant.

^{(2)}Model 2 treats evolutionary rate µ as a linear function of time, as in Equation (1).

^{(3)}B

_{0}, B

_{1}, B

_{2}: regression coefficients as in Equation (1).

^{(4)}µ: evolutionary rate in number of mutations per genome per day.

^{(5)}T

_{A}: date of the most recent common ancestor (MRCA) of the sampled genomes.

^{(6)}R

^{2}: coefficient of determination from the order-2 polynomial.

^{(7)}Log-likelihood (lnL) of the models and the associated AIC and BIC for model selection.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Xia, X. Rooting and Dating Large SARS-CoV-2 Trees by Modeling Evolutionary Rate as a Function of Time. *Viruses* **2023**, *15*, 684.
https://doi.org/10.3390/v15030684

**AMA Style**

Xia X. Rooting and Dating Large SARS-CoV-2 Trees by Modeling Evolutionary Rate as a Function of Time. *Viruses*. 2023; 15(3):684.
https://doi.org/10.3390/v15030684

**Chicago/Turabian Style**

Xia, Xuhua. 2023. "Rooting and Dating Large SARS-CoV-2 Trees by Modeling Evolutionary Rate as a Function of Time" *Viruses* 15, no. 3: 684.
https://doi.org/10.3390/v15030684