The Spike Protein of SARS-coV2 19B (S) Clade Mirrors Critical Features of Viral Adaptation and Coevolution

Pathogens including viruses evolve in tandem with diversity in their animal and human hosts. For SARS-coV2, the focus is generally for understanding such coevolution on the virus spike protein, since it demonstrates high mutation rates compared to other genome regions, particularly in the receptor-binding domain (RBD). Viral sequences of the SARS-coV2 19B (S) clade and variants of concern from different continents were investigated, with a focus on the A.29 lineage, which presented with different mutational patterns within the 19B (S) lineages in order to learn more about how SARS-coV2 may have evolved and adapted to widely diverse populations globally. Results indicated that SARS-coV2 went through evolutionary constrains and intense selective pressure, particularly in Africa. This was manifested in a departure from neutrality with excess nonsynonymous mutations and a negative Tajima D consistent with rapid expansion and directional selection as well as deletion and deletion–frameshifts in the N-terminal domain (NTD region) of the spike protein. In conclusion, we hypothesize that viral transmission during epidemics through populations of diverse genomic structures and marked complexity may be a significant factor for the virus to acquire distinct patterns of mutations within these populations in order to ensure its survival and fitness, explaining the emergence of novel variants and strains.


Introduction
SARS-coV2 is a member of the Coronaviridae family, with a wide range of viruses that affect humans, animals, or both [1]. Pathogens are being shown to coevolve in tandem with diversity in animals and humans [2][3][4][5][6]. The exact pathway and timeline of the virus's emergence and appearance of human cases is still unknown. Many theories exist to track the possible evolution of SARS-coV2 from animals, including recombination events that began with bat corona viruses (RmYN02, RpYN06, and PrC31); it was found to be the closest ancestor for the virus in the whole genome apart from the spike protein in which RaTG13 bat-derived virus is the closest. No intermediate host has been determined so far [2]. In the past two years, the virus evolved to give 12 variants, five of them are dominating and each with specific unique set of mutations. , which was reported in early November 2021. These variants have been divided into three categories either being variants of concern, interest or high consequences, the last which does not include any variant to date [7]. Although they circulate the globe, some dominate specific countries. SARS-coV2 mutations are mainly concentrated on the spike protein and open reading frame 1 (ORF1), but as time passed, mutations expanded to include other open reading frames (ORFs) and structural proteins including the membrane, envelope and nucleocapsid proteins.
From the host side, a Spike protein's main function is coupling angiotensin-converting enzyme 2 receptor (ACE2), recognizing and fusing to facilitate viral entry to the host cell [8]. With the emergence of more transmissible and mutable variants, understanding the evolutionary characteristic of SARS-coV2 s spike genomic region is critical for predicting the path that reinfection, vaccination, and therapeutics will take [9]. The spike protein contains several conserved areas, but the region of RBD in the S1 subunit is the highest mutable region of the spike. S1 is where the initiation of the attachment of the virus to the ACE2 starts [10,11]. The SARS-coV2 N-terminal domain (NTD region) of the spike protein also brought attention because its evolution is related to alteration of the viral antigenicity and promoting immune escaping. The more mutations and/or deletions in this region, the faster SARS-coV2 will adapt, evolve and evade the immune system [12]. The mutational robustness found in this virus demonstrates its strength to tolerate host range expansion and adaptation, phenotypic plasticity or environmental stressors such as temperature, virulence or attenuation, antigenicity and immune escape [13]. Furthermore, the uncontrolled community transmission of SARS-coV2 increases the possibility of the emergence of more transmissible variants, which is determined by host diversity in specific countries [14,15].
Here, we investigate SARS-coV2 sequences: specifically, the 19B (S) clade, the first dominating variant after the ancestral virus of Wuhan [16], obtained from various countries and continents in order to understand the sequel of evolutionary processes and viral adaptation in disparate environments, particularly the putative effect of the population variation on its distribution and mutational variations.

Study Design
This is a retrospective cross-sectional study to address the global variation in the SARS-coV2 evolution and how it may have adapted to the host selective pressure. This analysis is for the 19B (S) clade, the first dominating variant after the ancestral virus of Wuhan, covering the period from the start of the pandemic until February 2022.

Viral Genome Sequences
A total of 15,537 viral sequences of the 19B (S) clade were procured from the GISAID EpiCoV database [17] for the different continents at the period from the start of the pandemic until February 2022. Sequences of the lineages from A to A30, Bat, Pangolin as well as the variants of concern including Alpha, Beta, Gamma, Delta, Lambda and Omicron were also downloaded from the same database. They were chosen according to their first appearance in the collection dates. More focus was on the 19B (S) lineage A.29 that has been downloaded also.
We filtered sequences for the study based on the length of the sequence not to be less than 29,000 nucleotides, percentage of gaps, N stretches (unidentified nucleotides) of less than 5%, lack of clusters of mutations, and overlapping of reading frames. It was evaluated using the GISAID EpiCoV database and NEXTSTRAIN web tool [16,17].

Sequences Analysis
Sequences were analyzed using the Next-Generation Sequencing (NGS) analysis packages targeting areas of variations in all sequences included in the analysis, the COVID-19 genome annotator online tool for annotation and the NEXTSTRAIN online tool for defining clades for each sequence [16,18]. Sequences were aligned to SARS-coV2 isolate Wuhan-Hu-1 with the accession number NC 045512.2S, and mutations were examined. The variations of nonsynonymous and synonymous mutations were estimated for the whole genome and the spike protein for each sample.

Phylogenetic Analysis
Phylogenetic analysis was carried out for those variants using the MEGAX software [19]. Maximum likelihood analysis was performed with bootstrapping (1000 iterations).

Evolutionary Distance and Neutrality Testing
The evolutionary distance and neutrality were tested by the Tajima neutrality test using MEGAX software. p values were estimated for the proportion of synonymous to nonsynonymous as an indicator of neutrality according to Kimura [20].

19B (S) A.29 Lineage Analysis
The lineage of 19B (S) clade A.29 was taken as an example showing different patterns of mutations. Using a Python script, the frequency of mutations was calculated for each country involving the whole genome. Countries that reported A.29 were from Africa (Gambia and Sudan), Asia (India and Jordan), Europe (United Kingdom, Germany, and Belgium) as well as Oceania (Australia) and North America (the United States of America and Canada).

Data Visualization
The frequency of mutations for the lineage A.29 was displayed on bar charts and tables. It was displayed also in a timeline chart. Secondary RNA structures for the spike protein were then constructed by first extracting the Spike protein region using the seqkit command [21] and then uploading the sequences to the RNAFOLD online tool [22]. For the secondary structures created, the ensemble diversity and Minimal Free Energy (MFE) were estimated. To show the variation in the structure of the spike protein and the NTD region, a 3D model was created using the EXPASY translation [23] and the SWISSMODEL online tools [24].

The Variation of the 19 (S) Clade across Continents
The distribution of the 19B (S) clade samples varied, expectedly, between and within countries due to the wide differences in the sequencing efforts. Findings revealed the existence of different patterns of distribution of the 19B (S) clade lineages across continents, in which for example, the A.23, A and A.27 were dominating Africa with approximate percentages of 39.8%, 16.8% and 14.9%, respectively, while the A lineage dominated Asia with 61.1%. The A.2 lineage was found to be common in Europe, North America, South America and Oceania with approximate percentages of 37.2%, 40.6%, 76.2% and 56.9%, respectively. In Africa, the lineages A18 to A30 were found to be more common than in other continents (Supplementary Table S1).

Nonsynonymous Mutations Found to Be Exceeding in Africa
In Africa, the total number of the 19B(S) clade samples was 1147, the total number of nonsynonymous to synonymous mutations was 1468\1013, and there were 28 deletions and 18 deletion-frameshifts across the whole genome. Whereas in North America, where n = 8313, there were 3822 nonsynonymous mutations, 2225 synonymous, and 55 deletions and deletions-frameshifts (each) ( Table 1).
The same pattern of variation applies to other lineages and variants of concern. Based on the dates of collection reported in the databases combined with phylogenetic analysis, 11 of the lineages were found to have originated most likely in Africa including Senegal  The same pattern of variation applies to other lineages and variants of concern. Based on the dates of collection reported in the databases combined with phylogenetic analysis, 11 of the lineages were found to have originated most likely in Africa including Senegal

Tajima's Neutrality Test Returned a Negative D Value Hence Directional Selection
Signals of selection were manifested in the excess of nonsynonymous mutations in the global sample x235.7 (p = 0.0001), particularly in the African continent in comparison to other continents (Z = 3.91 p = 0.0001) despite the small sample size. Tajima's neutrality test returned a negative D value of -2.646764, which is consistent with expansion and directional selection. lineages sequences of the 19B (S) reported in the databases based on their dates of collection. They were 33 sequences in number. Note: Red (Africa), Yellow (Asia), Blue (Europe), Purple (North America) and Green (South America).

Tajima's Neutrality Test Returned a Negative D Value Hence Directional Selection
Signals of selection were manifested in the excess of nonsynonymous mutations in the global sample x235.7 (p = 0.0001), particularly in the African continent in comparison to other continents (Z = 3.91 p = 0.0001) despite the small sample size. Tajima's neutrality test returned a negative D value of -2.646764, which is consistent with expansion and directional selection.

Mutational Variation in the A.29 Lineage of the 19B (S) Clade and the Spike Protein Structural Variation
The A.29 lineage of the 19B (S) clade presented with a specific mutational pattern and hence was selected to address questions of adaptation and coevolution.   Deletions and deletion-frameshifts were concentrated in the NTD region of the S1 subunit of the spike protein, whereas nonsynonymous mutations were scattered in different ORFs of the viral genome including the spike protein itself. In the 3 UTR of the genome, there were extra-genic mutations at coronavirus 3 stem-loop II-like motif (s2m; located in the region from 29,728 to 29,768) at position 29,742 and 29,739 (Supplementary Figure S3).
Based on the CoVariants website [25], which gives an overview of SARS-coV2 variants and their mutations, which is supported by data from the GISAID platform, mutations from the above-mentioned samples were found mainly in those variants of concern as nonsynonymous mutations including: S:N501Y dominating 20I (Alpha, V1), 20H (Beta, V2) and 20J (Gamma, V3), S:H655Y dominating 20J (Gamma, V3), 21K (Omicron) and 21L (Omicron), M:I82S dominating 21B (Kappa), S:V143 and S:N211 dominating 21K (Omicron), and finally the extragenic mutation in the 3 UTR:29,742 in 21A, I and J (Delta) and 21B (Kappa) as the G nucleotide substituted to T, which is a synonymous mutation, but in those samples, it was substituted with an A nucleotide. None of the known variants showed substitution at 3 UTR: 29,739, which changed from C to T.
Visualization of structural variations in the spike protein were presented in secondary and 3D structures, all in comparison to the reference genome of SARS-coV2, Bat-derived virus of Yunnan, and Pangolin. There were small variations from the reference genome in the minimal free energy and ensemble diversity in secondary structures, and multiple areas of deletions were demonstrated in the 3D structures ( Supplementary Figures S4-S10).
Those samples were scattered across the pandemic years. The majority appeared in 2021 with a peak in March, with the exception of samples from Kombo city in Gambia and Kassala city in Sudan, whose samples were reported in August and December 2020, respectively (Figure 3).
Deletions and deletion-frameshifts were concentrated in the NTD region of the S1 subunit of the spike protein, whereas nonsynonymous mutations were scattered in different ORFs of the viral genome including the spike protein itself. In the 3′ UTR of the genome, there were extra-genic mutations at coronavirus 3′ stem-loop II-like motif (s2m; located in the region from 29,728 to 29,768) at position 29,742 and 29,739 (Supplementary Figure S3).
Based on the CoVariants website [25], which gives an overview of SARS-coV2 variants and their mutations, which is supported by data from the GISAID platform, mutations from the above-mentioned samples were found mainly in those variants of concern as nonsynonymous mutations including: S:N501Y dominating 20I (Alpha, V1), 20H (Beta, V2) and 20J (Gamma,V3), S:H655Y dominating 20J (Gamma,V3), 21K (Omicron) and 21L (Omicron), M:I82S dominating 21B (Kappa), S:V143 and S:N211 dominating 21K (Omicron), and finally the extragenic mutation in the 3′ UTR:29,742 in 21A, I and J (Delta) and 21B (Kappa) as the G nucleotide substituted to T, which is a synonymous mutation, but in those samples, it was substituted with an A nucleotide. None of the known variants showed substitution at 3′ UTR: 29,739, which changed from C to T.
Visualization of structural variations in the spike protein were presented in secondary and 3D structures, all in comparison to the reference genome of SARS-coV2, Bat-derived virus of Yunnan, and Pangolin. There were small variations from the reference genome in the minimal free energy and ensemble diversity in secondary structures, and multiple areas of deletions were demonstrated in the 3D structures ( Supplementary Figures S4-S10).
Those samples were scattered across the pandemic years. The majority appeared in 2021 with a peak in March, with the exception of samples from Kombo city in Gambia and Kassala city in Sudan, whose samples were reported in August and December 2020, respectively ( Figure 3).

Discussion
For a virus such as SARS-coV2, to jump the species barriers, it needs to pass through stages of natural selection and incur more genomic changes to increase its affinity to the human genomic counterparts including the ACE2. This was applied to the structure of the spike protein of the Bat-derived virus of Yunnan and Pangolin samples, which showed multiple deletions and deletion-frameshifts not represented in the reference genome of SARS-coV2 [26]. Viruses are able to evolve within the host itself, resulting in variants and strains that could infect and transmit better [27]. Once this happens in a certain population, a variant will form clusters and allow for more transmission and escaping of the immune surveillances of the host. Still, in continents such as in Africa, where highly diverse populations exist [28][29][30], the dynamics and timeline for the virus evolution and adaptation is not well understood due to the lack of adequate sampling. Taking Sudan as an example in which the 19B (S) clade was dominating and represented with different patterns of mutations in high frequency (49%) could be a clue for the virus track to adapt with population variability in order to increase its survival [26,31]. An example of some mutations found is the H655Y substitution, which enhances spike cleavage and viral growth [32]. N501Y and Y449H substitutions are believed to increase the virus transmissibility and fitness, although it has been shown that the Y449H substitution alone leads to a decrease in the affinity of binding to the ACE2 receptor, but when it came along with the N501Y, it was speculated that they will enhance binding affinity and escaping, based on epistatic shifting and modulation [33]. In addition, there is the deletion and deletion-frameshift in the NTD region, which plays an important role in conformation of the spike protein structure, binding to ACE2 and immune escaping, although it shows a higher rate of conservation [34]. These mutational patterns are scattered in different variants and strains of concern and interest, rather than being grouped in one variant including the 19B (S) strain. Patterns of similar mutations were shown in a few other countries including Gambia, India, Jordan, the United Kingdom, Germany, Belgium, the United States of America, Canada, and Australia with low frequencies, possibly due to migration, since the first samples were detected in Gambia in August followed by Sudan in December in the year 2020 based on collection dates reported in the GISAID platform.
Deletions could be either deleterious, in which the virus starts reverting to its ancestral state, or compensatory with changes that could mask the effect of the previous mutations. Such a mechanism will guarantee recovery of fitness, all depending on the effective population size of the organism. The mechanism is common in the RNA viruses [35]. For the virus to have a deletion-frameshift followed by deletion means that the virus is going through an intense selective pressure and through a survival strategy to overcome variation, which has been shown in the Omicron and Delta variants that manifest such a breadth of deletions in the S1 subunit [36]. By comparing SARS-coV2 to the SARS-coV2-related coronavirus (SC2r-CoV) lineages from bats and pangolins, researchers found out that the NTD area's gained indels during viral transmission across animals are the same as those reported during human transmissions [8]. According to one theory, increasing global human population immunity to natural SARS-coV2 infection is driving such a large selection pressure on the virus genome, which may select for convergent deletions at NTD to avoid being neutralized by neutralizing antibodies against NTD [37]. Another study demonstrated that rather than working in an antibody avoidance mechanism, NTD deletion in certain locations boosts viral infectivity by boosting the incorporation of cleaved S into virion [14].
As an RNA virus, SARS-coV2 could affect the host as a quasispecies population; hence, for indels mutation, it can be maintained, which will not be the case if the infected host came as one clonal population with the same genome sequences [38][39][40]. In addition, these quasispecies are maintained after selection and bottleneck events [41]. Having an adaptive landscape will allow overcoming the high functional constrains of the virus for more transmissibility and infectivity [35]. However, this landscape should not be from the pathogen side alone. A hypothesis called the Red Queen mentioned that the host and pathogen reciprocally struggle to maintain constant levels of fitness [42], and the virus copy number and mutational rate are highly determined by the variability within the host [36,43,44]. Many studies investigated the coadaptation of different ancient viruses; taking Adenovirus lineages as an example, it appeared to have speciated and coevolved with different vertebrate families, and host shifts to new taxa were accompanied by changes in genome content and those non-coevolved showed more severe pathology [43]. Others studied how this relationship can affect different part of the human genome: for example, the hepatitis C virus is able to affect reversibly the diversity of mitochondrial DNA, in which it can be used as a biomarker to investigate the stage of infection [44]. Other studies mentioned that having some genetic markers represented in certain population has a large effect on the disease progression: for example, the representation of the Y chromosome ancestry marker R1b1b2 in certain continents including in Africa with low frequencies is associated with a lower mortality rate from COVID-19 infection [45]. Interestingly, African viral sequences appeared to be ancestral to most of the lineages of the 19B (S) clade and for two variants of concern including Beta and Omicron, making the continent together with Asia the site for the emergence of these variants and in tally with the hypothesis that places of high effective population size are probably the main sources of novel variants. Tajima's Neutrality test D value is consistent with rapid expansion and directional selection, which are major features of the pandemic. In numerous studies, it was discovered that Africans have higher nucleotide and haplotype diversity than non-Africans [5,[28][29][30]. In another study, ACE2 receptors are presented with some different nonsynonymous mutations in African populations which were identified as signatures of selection affecting variation at regulatory regions related to ACE2 expression [46] along with multiple risk factors such as HIV coinfection, tuberculosis, anemia and population age [47][48][49].
The dearth of viral sequences from the African continent in this manuscript might be related to the paucity of sampling due to poorly resourced and ailing health systems or perhaps to the relatively low disease burden. Circumventing such hurdles is essential for a lucid understanding of the epidemiology of the pandemic and in verifying concepts such as the above [50,51].

Conclusions
Finally, for the virus to evolve within complex populations characterized by a higher effective population size, it may be pivotal for the virus to acquire distinct patterns of mutations that ensure its survival and fitness, thus becoming a source of novel variants of concern, as was observed early on in the 19 (S) clade of SARS-coV2, specifically in the A.29 lineage, which was first dominating the African continent and revealed distinct mutational variations that were scattered in other variants of concern. These variations raise many questions related to the function of the mutations and structural motif within the host that may lead to better viral fitness or transmissibility.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/microorganisms10102017/s1, Table S1: The distribution in the 19B (S) lineages in the different contents, in which the A.23 and A.27 were dominating Africa with approximate percentages of 39.67% and 14.996% respectively, while the A Lineage dominated Asia with 61.1%. The A.2 found to be common in Europe, North America, South America and Oceania with approximate percentages of 37.23%, 40.623%, 76.23% and 56.93% respectively Figure S1: The frequency of the shared synonymous mutations in samples of 19B (S) clade A.29 lineage with different pattern among countries over the whole genome include NSP16:A199A, NSP16:K137K, NSP9:N95N, NSP4:S76S, NSP4 - Figure S2: The frequency of the shared Deletion and Deletion-Frameshifts in samples of 19B (S) clade A.29 lineage with different pattern among countries over the whole genome include S: F140, S: V143, S: N211. Figure S3: The frequency of shared Extragenic mutations in the 3 UTR region in samples of 19B (S) clade A.29 lineage with different pattern among countries over the whole genome. Figure S4: A. Secondary structure, for the sample hCoV-19_Wuhan_HB-WH5-234_2020_EPI_ISL_455393_2020-03-18 (Reference: accession number NC 045512.2S), Minimum free energy of −1072.20 kcal/mol, the ensemble diversity is 898.62. B. 3D structure of the spike protein using the template 7cn8.1.A, seq identity: 91.68% with the description of Glycoprotein Cryo-EM structure of PCoV_GX spike