Using an Unsupervised Clustering Model to Detect the Early Spread of SARS-CoV-2 Worldwide
Abstract
:1. Introduction
2. Materials and Methods
2.1. SARS-CoV-2 Sequencing Collection
2.2. Mutation Calling and Phylogeny Reconstruction
2.3. Feature Extraction and Data Clustering
2.4. Pairwise Genetic Distance
2.5. Simpson’s Diversity Index
2.6. Inferring Positive/Purifying Selection of Individual Sites
2.7. Pairwise Mutation Dependency Score
2.8. Statistical Analysis and Data Visualization
3. Results
3.1. Genetic Analysis Revealed High Genetic Diversity and Rapid Proliferation of SARS-CoV-2
3.2. Clustering of SARS-CoV-2 Displayed Varied Proportions of the Clusters in Different Continents
3.3. The Genetic Variance Analyses Indicated High Diversity between Clusters
3.4. Exploring the Mutations That Shaped the Geographical Distribution of Population Structure
3.5. The Global Spread of SARS-CoV-2
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. The species Severe acute respiratory syndrome-related coronavirus: Classifying 2019-nCoV and naming it SARS-CoV-2. Nat. Microbiol. 2020, 5, 536–544. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhu, N.; Zhang, D.; Wang, W.; Li, X.; Yang, B.; Song, J.; Zhao, X.; Huang, B.; Shi, W.; Lu, R.; et al. A Novel Coronavirus from Patients with Pneumonia in China, 2019. N. Engl. J. Med. 2020, 382, 727–733. [Google Scholar] [CrossRef] [PubMed]
- Rehman, S.U.; Shafique, L.; Ihsan, A.; Liu, Q. Evolutionary Trajectory for the Emergence of Novel Coronavirus SARS-CoV-2. Pathogens 2020, 9, 240. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Forster, P.; Forster, L.; Renfrew, C.; Forster, M. Phylogenetic network analysis of SARS-CoV-2 genomes. Proc. Natl. Acad. Sci. USA 2020, 117, 9241–9243. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Koyama, T.; Platt, D.; Parida, L. Variant analysis of SARS-CoV-2 genomes. Bull. World Health Organ. 2020, 98, 495–504. [Google Scholar] [CrossRef] [PubMed]
- Mahapatro, G.; Mishra, D.; Shaw, K.; Mishra, S.; Jena, T. Phylogenetic Tree Construction for DNA Sequences using Clustering Methods. Procedia Eng. 2012, 38, 1362–1366. [Google Scholar] [CrossRef] [Green Version]
- Sharma, A.; Jaloree, S.; Thakur, R. Review of Clustering Methods: Toward Phylogenetic Tree Constructions. In Proceedings of International Conference on Recent Advancement on Computer and Communication; Springer: Singapore, 2018. [Google Scholar] [CrossRef]
- Azouri, D.; Abadi, S.; Mansour, Y.; Mayrose, I.; Pupko, T. Harnessing machine learning to guide phylogenetic-tree search algorithms. Nat. Commun. 2021, 12, 1983. [Google Scholar] [CrossRef] [PubMed]
- Bhattacharjee, A.; Bayzid, M.S. Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices. BMC Genom. 2020, 21, 497. [Google Scholar] [CrossRef] [PubMed]
- Ning, J.; Beiko, R.G. Phylogenetic approaches to microbial community classification. Microbiome 2015, 3, 47. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wang, J.; Soininen, J.; He, J.; Shen, J. Phylogenetic clustering increases with elevation for microbes. Environ. Microbiol. Rep. 2012, 4, 217–226. [Google Scholar] [CrossRef]
- Fioravanti, D.; Giarratano, Y.; Maggio, V.; Agostinelli, C.; Chierici, M.; Jurman, G.; Furlanello, C. Phylogenetic convolutional neural networks in metagenomics. BMC Bioinform. 2018, 19, 49. [Google Scholar] [CrossRef] [PubMed]
- Qin, L.; Chen, Y.X.; Pan, Y.; Chen, L. A novel approach to phylogenetic tree construction using stochastic optimization and clustering. BMC Bioinform. 2006, 7, S24. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Felsenstein, J.; Churchill, G.A. A hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 1996, 13, 93–104. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Siepel, A.; Bejerano, G.; Pedersen, J.S.; Hinrichs, A.S.; Hou, M.; Rosenbloom, K.; Clawson, H.; Spieth, J.; Hillier, L.W.; Richards, S.; et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005, 15, 1034–1050. [Google Scholar] [CrossRef] [Green Version]
- Medema, M.H.; Cimermancic, P.; Sali, A.; Takano, E.; Fischbach, M.A. A systematic computational analysis of biosynthetic gene cluster evolution: Lessons for engineering biosynthesis. PLoS Comput. Biol. 2014, 10, e1004016. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Han, E.; Carbonetto, P.; Curtis, R.E.; Wang, Y.; Granka, J.M.; Byrnes, J.; Noto, K.; Kermany, A.R.; Myres, N.M.; Barber, M.J.; et al. Clustering of 770,000 genomes reveals post-colonial population structure of North America. Nat. Commun. 2017, 8, 14238. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gonzalez, T.F. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 1985, 38, 293–306. [Google Scholar] [CrossRef] [Green Version]
- Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning—Volume 48, New York, NY, USA, 20–22 June 2016; pp. 478–487. [Google Scholar]
- Hadfield, J.; Megill, C.; Bell, S.M.; Huddleston, J.; Potter, B.; Callender, C.; Sagulenko, P.; Bedford, T.; Neher, R.A. Nextstrain: Real-time tracking of pathogen evolution. Bioinformatics 2018, 34, 4121–4123. [Google Scholar] [CrossRef] [PubMed]
- Larkin, M.A.; Blackshields, G.; Brown, N.P.; Chenna, R.; McGettigan, P.A.; McWilliam, H.; Valentin, F.; Wallace, I.M.; Wilm, A.; Lopez, R.; et al. Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23, 2947–2948. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Price, M.N.; Dehal, P.S.; Arkin, A.P. FastTree 2—Approximately maximum-likelihood trees for large alignments. PLoS ONE 2010, 5, e9490. [Google Scholar] [CrossRef] [PubMed]
- Letunic, I.; Bork, P. Interactive Tree Of Life (iTOL): An online tool for phylogenetic tree display and annotation. Bioinformatics 2007, 23, 127–128. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
- Thorndike, R.L. Who Belongs in the Family? Psychometrika 1953, 18, 267–276. [Google Scholar] [CrossRef]
- Zeng, Z.X.; Vo, A.; Mao, C.S.; Clare, S.E.; Khan, S.A.; Luo, Y. Cancer classification and pathway discovery using non-negative matrix factorization. J. Biomed. Inf. 2019, 96, 103247. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Luo, Y.; Mao, C.; Yang, Y.; Wang, F.; Ahmad, F.S.; Arnett, D.; Irvin, M.R.; Shah, S.J. Integrating hypertension phenotype and genotype with hybrid non-negative matrix factorization. Bioinformatics 2019, 35, 2885. [Google Scholar] [CrossRef] [PubMed]
- Chao, G.Q.; Luo, Y.; Ding, W.P. Recent Advances in Supervised Dimension Reduction: A Survey. Mach. Learn. Knowl. Extr. 2019, 1, 341–358. [Google Scholar] [CrossRef] [Green Version]
- Yu, W.B.; Tang, G.D.; Zhang, L.; Corlett, R.T. Decoding the evolution and transmissions of the novel pneumonia coronavirus (SARS-CoV-2/HCoV-19) using whole genomic data. Zool. Res. 2020, 41, 247–257. [Google Scholar] [CrossRef]
- Li, Y.; Liu, Q.; Zeng, Z.; Luo, Y. Unsupervised clustering analysis of SARS-CoV-2 population structure reveals six major subtypes at early stage across the world. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 9–12 December 2021; pp. 58–63. [Google Scholar]
- Hartl, D.L.; Clark, A.G. Principles of Population Genetics, 4th ed.; Sinauer Associates: Sunderland, MA, USA, 2007; p. xv. 652p. [Google Scholar]
- van Dorp, L.; Acman, M.; Richard, D.; Shaw, L.P.; Ford, C.E.; Ormond, L.; Owen, C.J.; Pang, J.; Tan, C.C.S.; Boshier, F.A.T.; et al. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infect. Genet. Evol. 2020, 83, 104351. [Google Scholar] [CrossRef] [PubMed]
- Yin, C. Genotyping coronavirus SARS-CoV-2: Methods and implications. Genomics 2020, 112, 3588–3596. [Google Scholar] [CrossRef]
- Barrett, J.C.; Fry, B.; Maller, J.; Daly, M.J. Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics 2005, 21, 263–265. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Nowak, M.A.; Michor, F.; Iwasa, Y. The linear process of somatic evolution. Proc. Natl. Acad. Sci. USA 2003, 100, 14966–14969. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, C.I.; Wang, H.Y.; Ling, S.; Lu, X. The Ecology and Evolution of Cancer: The Ultra-Microevolutionary Process. Annu. Rev. Genet. 2016, 50, 347–369. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chen, Y.; Tong, D.; Wu, C.I. A New Formulation of Random Genetic Drift and Its Application to the Evolution of Cell Populations. Mol. Biol. Evol. 2017, 34, 2057–2064. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Pond, S.L.K.; Poon, A.F.Y.; Velazquez, R.; Weaver, S.; Hepler, N.L.; Murrell, B.; Shank, S.D.; Magalis, B.R.; Bouvier, D.; Nekrutenko, A.; et al. HyPhy 2.5—A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies. Mol. Biol. Evol. 2020, 37, 295–299. [Google Scholar] [CrossRef] [PubMed]
- Pachetti, M.; Marini, B.; Benedetti, F.; Giudici, F.; Mauro, E.; Storici, P.; Masciovecchio, C.; Angeletti, S.; Ciccozzi, M.; Gallo, R.C.; et al. Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant. J. Transl. Med. 2020, 18, 179. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- El-Shennawy, L.; Hoffmann, A.D.; Dashzeveg, N.K.; McAndrews, K.M.; Mehl, P.J.; Cornish, D.; Yu, Z.; Tokars, V.L.; Nicolaescu, V.; Tomatsidou, A.; et al. Circulating ACE2-expressing extracellular vesicles block broad strains of SARS-CoV-2. Nat. Commun. 2022, 13, 405. [Google Scholar] [CrossRef] [PubMed]
- Jukes, T.H.; Cantor, C.R. CHAPTER 24—Evolution of Protein Molecules. In Mammalian Protein Metabolism; Munro, H.N., Ed.; Academic Press: New York, NY, USA, 1969; pp. 21–132. [Google Scholar] [CrossRef]
- Tamura, K.; Nei, M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 1993, 10, 512–526. [Google Scholar] [CrossRef] [PubMed]
- Tavare, S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 1986, 17, 56–86. [Google Scholar]
- Li, X.; Wang, W.; Zhao, X.; Zai, J.; Zhao, Q.; Li, Y.; Chaillon, A. Transmission dynamics and evolutionary history of 2019-nCoV. J. Med. Virol. 2020, 92, 501–511. [Google Scholar] [CrossRef]
- Chan, J.F.W.; Yuan, S.F.; Kok, K.H.; To, K.K.W.; Chu, H.; Yang, J.; Xing, F.F.; Liu, J.L.; Yip, C.C.Y.; Poon, R.W.S.; et al. A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: A study of a family cluster. Lancet 2020, 395, 514–523. [Google Scholar] [CrossRef] [Green Version]
- Sun, J.; He, W.T.; Wang, L.; Lai, A.; Ji, X.; Zhai, X.; Li, G.; Suchard, M.A.; Tian, J.; Zhou, J.; et al. COVID-19: Epidemiology, Evolution, and Cross-Disciplinary Perspectives. Trends Mol. Med. 2020, 26, 483–495. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhou, P.; Yang, X.L.; Wang, X.G.; Hu, B.; Zhang, L.; Zhang, W.; Si, H.R.; Zhu, Y.; Li, B.; Huang, C.L.; et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 2020, 579, 270–273. [Google Scholar] [CrossRef] [Green Version]
- Yao, H.; Lu, X.; Chen, Q.; Xu, K.; Chen, Y.; Cheng, L.; Liu, F.; Wu, Z.; Wu, H.; Jin, C.; et al. Patient-derived mutations impact pathogenicity of SARS-CoV-2. medRxiv 2020. [Google Scholar] [CrossRef]
- Korber, B.; Fischer, W.M.; Gnanakaran, S.; Yoon, H.; Theiler, J.; Abfalterer, W.; Hengartner, N.; Giorgi, E.E.; Bhattacharya, T.; Foley, B.; et al. Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus. Cell 2020, 182, 812–827.e19. [Google Scholar] [CrossRef] [PubMed]
- Tang, X.; Wu, C.; Li, X.; Song, Y.; Yao, X.; Wu, X.; Duan, Y.; Zhang, H.; Wang, Y.; Qian, Z.; et al. On the origin and continuing evolution of SARS-CoV-2. Natl. Sci. Rev. 2020, 7, 1012–1023. [Google Scholar] [CrossRef] [Green Version]
- Mishra, A.; Pandey, A.K.; Gupta, P.; Pradhan, P.; Dhamija, S.; Gomes, J.; Kundu, B.; Vivekanandan, P.; Menon, M.B. Mutation landscape of SARS-CoV-2 reveals three mutually exclusive clusters of leading and trailing single nucleotide substitutions. bioRxiv 2020. [Google Scholar] [CrossRef]
- Seemann, T.; Lane, C.R.; Sherry, N.L.; Duchene, S.; Gonçalves da Silva, A.; Caly, L.; Sait, M.; Ballard, S.A.; Horan, K.; Schultz, M.B.; et al. Tracking the COVID-19 pandemic in Australia using genomics. Nat. Commun. 2020, 11, 4376. [Google Scholar] [CrossRef] [PubMed]
Cluster | Cluster A | Cluster B | Cluster C | Cluster D | Cluster E | Cluster F | Total |
---|---|---|---|---|---|---|---|
Africa | 3 | 4 | 65 | 7 | 10 | 9 | 98 |
Asia | 38 | 648 | 248 | 217 | 57 | 116 | 1324 |
Europe | 1137 | 990 | 3119 | 212 | 1108 | 2961 | 9527 |
North America | 94 | 334 | 625 | 1268 | 2274 | 170 | 4765 |
Oceania | 110 | 161 | 233 | 196 | 191 | 149 | 1040 |
South America | 6 | 5 | 44 | 10 | 5 | 49 | 119 |
Total | 1388 | 2142 | 4334 | 1910 | 3645 | 3454 | 16,873 |
Mutation | Substitution | Amino Acid Substitution | Type | Gene | Frequency | Cluster | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | ||||||
C241T | C>T | Intron | Intron | Intron | 66.37% | 10 | 10 | 4238 | 2 | 3548 | 3391 |
T490A | T>A | D>E | N | ORF1ab | 1.04% | 0 | 0 | 1 | 174 | 0 | 0 |
T514C | T>C | H>H | S | ORF1ab | 0.97% | 0 | 162 | 1 | 0 | 0 | 0 |
C1059T * | C>T | T>I | N | ORF1ab | 21.69% | 1 | 8 | 2 | 0 | 3645 | 3 |
G1397A | G>A | V>I | N | ORF1ab | 1.12% | 0 | 186 | 0 | 0 | 1 | 2 |
G1440A | G>A | G>D | N | ORF1ab | 1.92% | 0 | 324 | 0 | 0 | 0 | 0 |
A2480G | A>G | I>V | N | ORF1ab | 3.60% | 608 | 0 | 0 | 0 | 0 | 0 |
C2558T | C>T | P>S | N | ORF1ab | 3.83% | 646 | 1 | 0 | 0 | 0 | 0 |
G2891A * | G>A | A>T | N | ORF1ab | 1.77% | 0 | 298 | 0 | 0 | 0 | 0 |
C3037T | C>T | F>F | S | ORF1ab | 67.26% | 2 | 7 | 4277 | 3 | 3611 | 3448 |
C3177T | C>T | P>L | N | ORF1ab | 1.05% | 0 | 0 | 1 | 171 | 6 | 0 |
C6312A | C>A | T>K | N | ORF1ab | 1.14% | 0 | 189 | 1 | 0 | 0 | 3 |
C8782T | C>T | S>S | S | ORF1ab | 11.42% | 1 | 21 | 5 | 1898 | 1 | 1 |
T9477A | T>A | F>Y | N | ORF1ab | 1.17% | 0 | 3 | 0 | 195 | 0 | 0 |
G11083T * | G>T | L>F | N | ORF1ab | 11.81% | 1342 | 485 | 52 | 21 | 54 | 39 |
C14408T * | C>T | P>L | N | ORF1ab | 67.47% | 1 | 8 | 4301 | 2 | 3636 | 3436 |
C14805T | C>T | Y>Y | S | ORF1ab | 9.39% | 1352 | 8 | 1 | 195 | 0 | 28 |
T17247C | T>C | R>R | S | ORF1ab | 3.00% | 500 | 5 | 1 | 0 | 0 | 0 |
C17747T * | C>T | P>L | N | ORF1ab | 6.92% | 1 | 0 | 0 | 1165 | 1 | 0 |
A17858G | A>G | Y>C | N | ORF1ab | 7.05% | 1 | 1 | 0 | 1187 | 0 | 0 |
C18060T | C>T | L>L | S | ORF1ab | 7.16% | 0 | 3 | 2 | 1202 | 1 | 0 |
T18736C | T>C | F>L | N | ORF1ab | 1.01% | 0 | 0 | 1 | 169 | 0 | 0 |
C18877T | C>T | L>L | S | ORF1ab | 2.67% | 2 | 2 | 440 | 4 | 0 | 2 |
A20268G | A>G | L>L | S | ORF1ab | 4.61% | 0 | 1 | 773 | 3 | 0 | 1 |
A23403G * | A>G | D>G | N | S | 67.65% | 4 | 4 | 4316 | 6 | 3634 | 3451 |
C23731T | C>T | T>T | S | S | 1.68% | 0 | 0 | 0 | 0 | 1 | 282 |
C23929T | C>T | Y>Y | S | S | 1.13% | 0 | 186 | 1 | 0 | 1 | 2 |
C24034T | C>T | N>N | S | S | 1.16% | 0 | 2 | 1 | 187 | 4 | 1 |
G25563T * | G>T | Q>H | N | ORF3a | 26.44% | 1 | 3 | 829 | 2 | 3625 | 2 |
G25979T | G>T | G>V | N | ORF3a | 1.16% | 0 | 2 | 1 | 193 | 0 | 0 |
G26144T * | G>T | G>V | N | ORF3a | 8.61% | 1387 | 62 | 0 | 1 | 1 | 1 |
T26729C | T>C | A>A | S | M | 1.07% | 0 | 1 | 1 | 179 | 0 | 0 |
C27046T | C>T | T>M | N | M | 2.13% | 0 | 1 | 5 | 0 | 0 | 353 |
G28077C | G>C | V>L | N | ORF8 | 1.13% | 0 | 1 | 1 | 188 | 0 | 0 |
T28144C * | T>C | L>S | N | ORF8 | 11.36% | 0 | 10 | 1 | 1903 | 2 | 0 |
C28657T | C>T | D>D | S | N | 1.21% | 0 | 3 | 3 | 196 | 1 | 2 |
T28688C | T>C | L>L | S | N | 1.07% | 0 | 178 | 1 | 0 | 1 | 0 |
C28863T | C>T | S>L | N | N | 1.19% | 1 | 2 | 2 | 193 | 2 | 0 |
G28881A | G>A | R>K | N | N | 20.54% | 4 | 3 | 3 | 1 | 1 | 3453 |
G28882A | G>A | R>K 1 | N | N | 20.49% | 1 | 2 | 0 | 0 | 0 | 3454 |
G28883C | G>C | G>R | N | N | 20.49% | 1 | 2 | 1 | 0 | 0 | 3453 |
A29700G | A>G | Intron | Intron | Intron | 1.04% | 0 | 0 | 4 | 167 | 4 | 1 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Y.; Liu, Q.; Zeng, Z.; Luo, Y. Using an Unsupervised Clustering Model to Detect the Early Spread of SARS-CoV-2 Worldwide. Genes 2022, 13, 648. https://doi.org/10.3390/genes13040648
Li Y, Liu Q, Zeng Z, Luo Y. Using an Unsupervised Clustering Model to Detect the Early Spread of SARS-CoV-2 Worldwide. Genes. 2022; 13(4):648. https://doi.org/10.3390/genes13040648
Chicago/Turabian StyleLi, Yawei, Qingyun Liu, Zexian Zeng, and Yuan Luo. 2022. "Using an Unsupervised Clustering Model to Detect the Early Spread of SARS-CoV-2 Worldwide" Genes 13, no. 4: 648. https://doi.org/10.3390/genes13040648
APA StyleLi, Y., Liu, Q., Zeng, Z., & Luo, Y. (2022). Using an Unsupervised Clustering Model to Detect the Early Spread of SARS-CoV-2 Worldwide. Genes, 13(4), 648. https://doi.org/10.3390/genes13040648