# Dating the Common Ancestor from an NCBI Tree of 83688 High-Quality and Full-Length SARS-CoV-2 Genomes

^{1}

^{2}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. TRAD: Tip Rooting and Ancestor Dating

_{1j}, where j = 1, 2,..., 12 correspond to the 12 sequences, labelled S1 to S12 in Figure 1A), according to the molecular clock hypothesis, is expected to be proportional to tip sampling time (T

_{j}). That is, a longer D

_{1j}corresponds to a proportionally later T

_{j}. Figure 1B shows the 12 T

_{j}values and the corresponding D

_{1j}, D

_{2j,}and D

_{3j}values for internal Nodes 1, 2, and 3, respectively. If Node 1 is the true root, then the D

_{1j}column should be more highly correlated with the T

_{j}column (Figure 1B) than other D

_{ij}columns. Given that ${r}_{P}$ between D

_{1j}and T

_{j}is the largest among all internal nodes, Node 1 is deemed the closest to the true root than other internal nodes. The method also traverses the tree along the branches to find the rooting point with the highest ${r}_{P}$.

_{A}be the time to MRCA, and μ be the evolutionary rate, the molecular clock gives us the following linear relationship to estimate T

_{A}and μ:

_{1j}column on T

_{j}column (Figure 1B) yields μ = 0.045 and T

_{A}= 12/9/2019 (9 December 2019). However, the best rooting position (with the highest r) lies along the branch between Node 1 and Node 5, at a distance of 0.087117 away from Node 1 (Figure 1). This root gives μ = 0.046 (95% CI: 0.039 to 0.054) and T

_{A}= 12/10/2019. I will discuss later alternative approaches for estimating variation in T

_{A}.

#### 2.2. The Phylogenetic Tree of 86582 SARS-CoV-2 Genomes

## 3. Results and Discussion

#### 3.1. Ancestor of Sampled SARS-CoV-2 Is Dated 16 August 2019

_{A}= 16 August 2019 (Figure 2). The resulting rooted tree and the estimated root-to-tip distance for each genome are included as Tree_GoodDates_Rooted.dnd and Tree_GoodDates_RootToTipD.txt, respectively, within the Supplemental File DatingData.zip. I should emphasise that T

_{A}is not for the common ancestor of all SARS-CoV-2 lineages but only for the common ancestor of the sampled SARS-CoV-2 genomes. Only when we have included the most ancient lineages of SARS-CoV-2 would T

_{A}approximate the time of origin of SARS-CoV-2.

_{A}= 13 December 2019. The genomic data by 21 March 2020 would have T

_{A}= 4 December 2019, The genomic data by 8 May 2020 would have T

_{A}= 20 October 2019. The large tree from NCBI released on 3 April 2021 pushes T

_{A}back to 16 August 2019. As illustrated in Figure 1C, only when our samples include the earliest lineages of SARS-CoV-2 can T

_{A}approximate the time of origin of all SARS-CoV-2 viruses. If our sample includes only those red lineages in Figure 1C, then it is impossible for us to date the common ancestor to Node 4.

#### 3.2. Assessing the Variation of the Estimated Origin Date of the MRCA

_{i}− D

_{i}/μ (where T

_{i}and D

_{i}are the collection time and root-to-tip distance for genome i, respectively) would all be equal to 16 August 2019. However, SARS-CoV-2 genomes do not evolve exactly at the same rate, so T

_{i}− D

_{i}/μ will not be all equal to 16 August 2019. If we designate T

_{Ai}= T

_{i}− D

_{i}/μ, then variation in T

_{Ai}could serve as an estimate of variation in the estimated ancestral time for the sampled SARS-CoV-2 genomes (Figure 3), which includes mean, standard deviation, and 95% confidence limits). This visual display is perhaps more appropriate than using the standard error of regression coefficients to attach confidence intervals to the origin of the MRCA. As the very large sample size, the confidence interval for the date of the MRCA origin would be within two days which could be highly misleading because a large number of genomes are not independent samples from the same distribution. Figure 3 also highlights the point that evolutionary rate, although often estimated as a parameter, it is associated with much sampling variation.

_{A}being 16 August 2019 ± 20 days. Dating results from this large tree provide concrete and convincing support to the hypothesis that SARS-CoV-2 might have been transmitted cryptically among human populations months before the viral outbreak [23,24].

_{SARS-CoV-2,RaTG13}/2)/μ. D

_{SARS-CoV-2,RaTG13}, the average distance between SARS-CoV-2 and RaTG13, is ~0.04/site which is translated to ~1200/genome. Therefore,

#### 3.3. Outliers in Figure 2

#### 3.4. Dating the Origin of MRCA Based on SARS-CoV-2 from Cats

_{A}= 30 July 2019 (Figure 4). This exemplifies the point that a small number of data points covering a long time is more valuable than a large number of data points from a single time point. One cannot use the TRAD approach to properly date the MRCA when there are only one or two collection time points, even if each time point is represented by many genomes, as is the case for farmed minks in US and Europe.

#### 3.5. The Identification of Root Is Harder Than Dating the Common Ancestor

_{head}) is 0.5 because this P gives us the highest likelihood. However, we cannot say that the hypothesis of P

_{head}= 0.5 is significantly better than the alternative hypothesis of P

_{head}= 0.4999 or P

_{head}= 0.5001. In the same way, we identify the root by the Pearson correlation (r) between root-to-tip distance (D) and collection time (T). The candidate rooting location that gives us the highest r is chosen as the root. The root of the tree (Figure 5), indicated by a red dot, has the highest r = 0.86295. However, its neighboring nodes have r almost just as large. Some of them have r values differing from the maximum r only after the fourth digits after the decimal (Figure 5B). We, therefore, have the same problem as discriminating between one hypothesis with P

_{head}= 0.5 and alternative hypotheses with P

_{head}= 0.4999 and P

_{head}= 0.5001. Rather than identifying a specific rooting point, one can only identify a general “rooting region” on the tree (somewhat equivalent to the confidence interval of a point estimate), even with the very large tree of 83688 genomes.

_{A}in this study. Future sampling of SARS-CoV-2 should aim to sample beyond those in Figure 5. Given that the two closest relatives of SARS-CoV-2 were isolated from bats in Yunnan, China, with RaTG13 [6] from Mojiang Hani County and RmYN02 [39] from Mengla County, one should expect bats in these regions most likely to harbor viruses representing early lineages of SARS-CoV-2. In particular, the Tongguan mineshaft in Mojiang, Yunnan, where miners suffered from pneumonia-like diseases similar to COVID-19, should be a focal point of investigation [40].

## 4. Conclusions

## Supplementary Materials

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Gilbert, M.T.; Rambaut, A.; Wlasiuk, G.; Spira, T.J.; Pitchenik, A.E.; Worobey, M. The emergence of HIV/AIDS in the Americas and beyond. Proc. Natl. Acad. Sci. USA
**2007**, 104, 18566–18570. [Google Scholar] [CrossRef] [PubMed] [Green Version] - MacLean, O.A.; Lytras, S.; Weaver, S.; Singer, J.B.; Boni, M.F.; Lemey, P.; Kosakovsky Pond, S.L.; Robertson, D.L. Natural selection in the evolution of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen. PLoS Biol.
**2021**, 19, e3001115. [Google Scholar] [CrossRef] [PubMed] - Wang, H.; Pipes, L.; Nielsen, R. Synonymous mutations and the molecular evolution of SARS-CoV-2 origins. Virus Evol.
**2021**, 7, veaa098. [Google Scholar] [CrossRef] [PubMed] - Boni, M.F.; Lemey, P.; Jiang, X.; Lam, T.T.-Y.; Perry, B.; Castoe, T.; Rambaut, A.; Robertson, D.L. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat. Microbiol.
**2020**, 5, 1408–1417. [Google Scholar] [CrossRef] - Lytras, S.; Hughes, J.; Martin, D.; Arné de, K.; Rentia, L.; Pond, S.K.; Xia, W.; Xiaowei, J.; Robertson, D. Exploring the Natural Origins of SARS-CoV-2 in the Light of Recombination. bioRxiv 2021. Available online: https://www.biorxiv.org/content/10.1101/2021.01.22.427830v3.abstract (accessed on 1 September 2021).
- Zhou, P.; Yang, X.-L.; Wang, X.-G.; Hu, B.; Zhang, L.; Zhang, W.; Si, H.-R.; Zhu, Y.; Li, B.; Huang, C.-L.; et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature
**2020**, 579, 270–273. [Google Scholar] [CrossRef] [Green Version] - Rambaut, A.; Lam, T.T.; Max Carvalho, L.; Pybus, O.G. Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). Virus Evol.
**2016**, 2, vew007. [Google Scholar] [CrossRef] [Green Version] - Xia, X. DAMBE7: New and improved tools for data analysis in molecular biology and evolution. Mol. Biol. Evol.
**2018**, 35, 1550–1552. [Google Scholar] [CrossRef] [Green Version] - Himmelmann, L.; Metzler, D. TreeTime: An extensible C++ software package for Bayesian phylogeny reconstruction with time-calibration. Bioinformatics
**2009**, 25, 2440–2441. [Google Scholar] [CrossRef] [Green Version] - To, T.-H.; Jung, M.; Lycett, S.; Gascuel, O. Fast Dating Using Least-Squares Criteria and Algorithms. Syst. Biol.
**2016**, 65, 82–97. [Google Scholar] [CrossRef] - Volz, E.M.; Frost, S.D.W. Scalable relaxed clock phylogenetic dating. Virus Evol.
**2017**, 3, vex025. [Google Scholar] [CrossRef] [Green Version] - Buonagurio, D.A.; Nakada, S.; Parvin, J.D.; Krystal, M.; Palese, P.; Fitch, W.M. Evolution of human influenza A viruses over 50 years: Rapid, uniform rate of change in NS gene. Science
**1986**, 232, 980–982. [Google Scholar] [CrossRef] - Gojobori, T.; Moriyama, E.N.; Kimura, M. Molecular clock of viral evolution, and the neutral theory. Proc. Natl. Acad. Sci. USA
**1990**, 87, 10015–10018. [Google Scholar] [CrossRef] [Green Version] - Drummond, A.; Pybus, O.G.; Rambaut, A. Inference of viral evolutionary rates from molecular sequences. Adv. Parasitol.
**2003**, 54, 331–358. [Google Scholar] - Xia, X. DAMBE5: A comprehensive software package for data analysis in molecular biology and evolution. Mol. Biol. Evol.
**2013**, 30, 1720–1728. [Google Scholar] [CrossRef] [Green Version] - Xia, X.; Yang, Q. A distance-based least-square method for dating speciation events. Mol. Phylogenet. Evol.
**2011**, 59, 342–353. [Google Scholar] [CrossRef] - Xia, X. TRAD: Tip-Rooting and Ancestor-Dating; University of Ottawa: Ottawa, ON, Canada, 2021. [Google Scholar]
- Korber, B.; Fischer, W.M.; Gnanakaran, S.; Yoon, H.; Theiler, J.; Abfalterer, W.; Hengartner, N.; Giorgi, E.E.; Bhattacharya, T.; Foley, B.; et al. Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus. Cell
**2020**, 182, 812–827.e819. [Google Scholar] [CrossRef] - Hatcher, E.L.; Zhdanov, S.A.; Bao, Y.; Blinkova, O.; Nawrocki, E.P.; Ostapchuck, Y.; Schäffer, A.A.; Brister, J.R. Virus Variation Resource—Improved response to emergent viral outbreaks. Nucleic Acids Res.
**2017**, 45, D482–D490. [Google Scholar] [CrossRef] [PubMed] - Vakatov, D. The NCBI C++ Toolkit Book; National Center for Biotechnology Information (US): Bethesda, MD, USA, 2009; Available online: https://ncbi.github.io/cxx-toolkit/ (accessed on 1 September 2021).
- Apolone, G.; Montomoli, E.; Manenti, A.; Boeri, M.; Sabia, F.; Hyseni, I.; Mazzini, L.; Martinuzzi, D.; Cantone, L.; Milanese, G.; et al. Unexpected detection of SARS-CoV-2 antibodies in the prepandemic period in Italy. Tumori J.
**2020**. [Google Scholar] [CrossRef] - Amendola, A.; Canuti, M.; Bianchi, S.; Kumar, S.; Fappani, C.; Gori, M.; Colzani, D.; Pond, S.L.; Miura, S.; Baggieri, M.; et al. Molecular Evidence for SARS-CoV-2 in Samples Collected From Patients with Morbilliform Eruptions Since Late Summer 2019 in Lombardy, Northern Italy. 2021. Available online: https://ssrn.com/abstract=3883274 (accessed on 1 September 2021).
- Zhang, Y.Z.; Holmes, E.C. A Genomic Perspective on the Origin and Emergence of SARS-CoV-2. Cell
**2020**, 181, 223–227. [Google Scholar] [CrossRef] [PubMed] - Andersen, K.G.; Rambaut, A.; Lipkin, W.I.; Holmes, E.C.; Garry, R.F. The proximal origin of SARS-CoV-2. Nat. Med.
**2020**, 26, 450–452. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Qeadan, F.; Honda, T.; Gren, L.H.; Dailey-Provost, J.; Benson, L.S.; VanDerslice, J.A.; Porucznik, C.A.; Waters, A.B.; Lacey, S.; Shoaf, K. Naive Forecast for COVID-19 in Utah Based on the South Korea and Italy Models-the Fluctuation between Two Extremes. Int. J. Environ. Res. Public Health
**2020**, 17, 2750. [Google Scholar] [CrossRef] [PubMed] - Associated Press. St. George Man with New Virus Moves to Utah Hospital. Published on 29 February 2020. 2020. Available online: https://www.usnews.com/news/best-states/california/articles/2020-02-29/st-george-man-with-new-virus-moves-to-utah-hospital (accessed on 1 September 2021).
- Oude Munnink, B.B.; Sikkema, R.S.; Nieuwenhuijse, D.F.; Molenaar, R.J.; Munger, E.; Molenkamp, R.; van der Spek, A.; Tolsma, P.; Rietveld, A.; Brouwer, M.; et al. Transmission of SARS-CoV-2 on mink farms between humans and mink and back to humans. Science
**2021**, 371, 172. [Google Scholar] [CrossRef] [PubMed] - Banerjee, A.; Doxey, A.C.; Mossman, K.; Irving, A.T. Unraveling the Zoonotic Origin and Transmission of SARS-CoV-2. Trends Ecol. Evol.
**2021**, 36, 180–184. [Google Scholar] [CrossRef] - Xia, X. Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense. Mol. Biol. Evol.
**2020**, 37, 2699–2705. [Google Scholar] [CrossRef] - Cui, J.; Li, F.; Shi, Z.-L. Origin and evolution of pathogenic coronaviruses. Nat. Rev. Microbiol.
**2019**, 17, 181–192. [Google Scholar] [CrossRef] [Green Version] - Niu, S.; Wang, J.; Bai, B.; Wu, L.; Zheng, A.; Chen, Q.; Du, P.; Han, P.; Zhang, Y.; Jia, Y.; et al. Molecular basis of cross-species ACE2 interactions with SARS-CoV-2-like viruses of pangolin origin. EMBO J.
**2021**, e107786. [Google Scholar] - Rito, T.; Richards, M.B.; Pala, M.; Correia-Neves, M.; Soares, P.A. Phylogeography of 27,000 SARS-CoV-2 Genomes: Europe as the Major Source of the COVID-19 Pandemic. Microorganisms
**2020**, 8, 1678. [Google Scholar] [CrossRef] - Forster, P.; Forster, L.; Renfrew, C.; Forster, M. Phylogenetic network analysis of SARS-CoV-2 genomes. Proc. Natl. Acad. Sci. USA
**2020**, 117, 9241–9243. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Gómez-Carballa, A.; Bello, X.; Pardo-Seco, J.; Martinón-Torres, F.; Salas, A. Mapping genome variation of SARS-CoV-2 worldwide highlights the impact of COVID-19 super-spreaders. Genome Res.
**2020**, 30, 1434–1448. [Google Scholar] [CrossRef] - Mallapaty, S. After the WHO report: What’s next in the search for COVID’s origins. Nature
**2021**, 592, 337–338. [Google Scholar] [CrossRef] - Kumar, M.; Ching, L.; Astern, J.; Lim, E.; Stokes, A.J.; Melish, M.; Nerurkar, V.R. Prevalence of Antibodies to Zika Virus in Mothers from Hawaii Who Delivered Babies with and without Microcephaly between 2009–2012. PLoS Negl. Trop. Dis.
**2016**, 10, e0005262. [Google Scholar] [CrossRef] [Green Version] - Xia, X. Domains and Functions of Spike Protein in SARS-COV-2 in the Context of Vaccine Design. Viruses
**2021**, 13, 109. [Google Scholar] [CrossRef] [PubMed] - Chinazzi, M.; Davis, J.T.; Ajelli, M.; Gioannini, C.; Litvinova, M.; Merler, S.; Pastore y Piontti, A.; Mu, K.; Rossi, L.; Sun, K.; et al. The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak. Science
**2020**, 368, 395. [Google Scholar] [CrossRef] [Green Version] - Zhou, H.; Chen, X.; Tao, H.; Juan, L.; Hao, S.; Yanran, L.; Peihan, W.; Liu, D.; Yang, J.; Holmes, E.; et al. A Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein. Curr. Biol.
**2020**, 30, 2196–2203. [Google Scholar] [CrossRef] - Rahalkar, M.C.; Bahulikar, R.A. Lethal Pneumonia Cases in Mojiang Miners (2012) and the Mineshaft Could Provide Important Clues to the Origin of SARS-CoV-2. Front. Public Health
**2020**, 8, 581569. [Google Scholar] [CrossRef] - Kimura, M. The Neutral Theory of Molecular Evolution; Cambridge University Press: Cambridge, UK, 1983. [Google Scholar]
- Lanfear, R.; Kokko, H.; Eyre-Walker, A. Population size and the rate of evolution. Trends Ecol. Evol.
**2014**, 29, 33–41. [Google Scholar] [CrossRef] [PubMed] - Xia, X. Detailed Dissection and Critical Evaluation of the Pfizer/BioNTech and Moderna mRNA Vaccines. Vaccines
**2021**, 9, 734. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Rationale for identifying which internal node is closest to the true root: (

**A**) a tree with 12 tips (S1 to S12) whose sampling dates are appended to sequence names. Branch lengths are shown next to branches, and five internal nodes numbered; (

**B**) node-to-tip distances and sampling date between internal Nodes 1, 2, and 3 to the 12 tips. The best internal node is Node 1 with the highest r, but the best rooting position is along the branch between Nodes 1 and 5, at a distance of 0.087117 from Node 1; (

**C**) samples including descendants of Node1 can only date their MRCA at Node 1. Only when the most ancient lineage is sampled can we date the MRCA at node 4.

**Figure 2.**Regression of root-to-tip distance (D) on sampling date (T) of 83688 SARS-CoV-2 genomes, for dating the origin time of the most recent common ancestor (T

_{A}) of the sampled SARS-CoV-2 genomes and estimating the rate (μ) of sequence evolution from the regression equation (D = a + bT). μ = b = 0.055273 mutations/genome/day, and T

_{A}= −a/b = 43,693.3 = 15 August 2019 (where time is counted from 1 January 1900 as day 1, 2 January 1900 as day 2, etc.

**Figure 3.**Frequency distribution of T

_{Ai}(=T

_{i}− D

_{i}/μ) for characterizing variation in T

_{A}(the time for the origin of the common ancestor of the sampled SARS-CoV-2 genomes). The mean, standard deviation (SD), and 95% confidence interval are shown. The red arrow indicates the possible existence of another distribution, i.e., the distribution might be a mixture of two distributions.

**Figure 4.**The relationship between root-to-tip distance (D) and SARS-CoV-2 collection time (T) for non-human samples. A regression line is fitted to 10 samples from 10 infected cats (red dots). The blue arrowhead points to two overlapping red dots. The ancestor of cat SARS-CoV-2 was dated 30 July 2019.

**Figure 5.**Root identification is hard: (

**A**) the tree of 83688 SARS-CoV-2 genomes, with the root indicated by a red dot and many clades collapsed to ease visualisation of viral genomes close to the root; (

**B**) the same tree with different candidate rooting positions with Pearson correlation coefficient (r) between the root-to-tip distance and collection time. The red dot indicates the rooting point with the highest r, but a number of alternative root positions have nearly identical r.

**Figure 6.**Illustration of bias in dating the most recent common ancestor (MRCA) of SARS-CoV-2 lineages based on Root-to-tip distance (D) and collection time (T), when the evolutionary rate has increased recently.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Xia, X.
Dating the Common Ancestor from an NCBI Tree of 83688 High-Quality and Full-Length SARS-CoV-2 Genomes. *Viruses* **2021**, *13*, 1790.
https://doi.org/10.3390/v13091790

**AMA Style**

Xia X.
Dating the Common Ancestor from an NCBI Tree of 83688 High-Quality and Full-Length SARS-CoV-2 Genomes. *Viruses*. 2021; 13(9):1790.
https://doi.org/10.3390/v13091790

**Chicago/Turabian Style**

Xia, Xuhua.
2021. "Dating the Common Ancestor from an NCBI Tree of 83688 High-Quality and Full-Length SARS-CoV-2 Genomes" *Viruses* 13, no. 9: 1790.
https://doi.org/10.3390/v13091790