Molecular Epidemiology of SARS-CoV-2 during Five COVID-19 Waves and the Significance of Low-Frequency Lineages

SARS-CoV-2 lineages and variants of concern (VOC) have gained more efficient transmission and immune evasion properties with time. We describe the circulation of VOCs in South Africa and the potential role of low-frequency lineages on the emergence of future lineages. Whole genome sequencing was performed on SARS-CoV-2 samples from South Africa. Sequences were analysed with Nextstrain pangolin tools and Stanford University Coronavirus Antiviral & Resistance Database. In 2020, 24 lineages were detected, with B.1 (3%; 8/278), B.1.1 (16%; 45/278), B.1.1.348 (3%; 8/278), B.1.1.52 (5%; 13/278), C.1 (13%; 37/278) and C.2 (2%; 6/278) circulating during the first wave. Beta emerged late in 2020, dominating the second wave of infection. B.1 and B.1.1 continued to circulate at low frequencies in 2021 and B.1.1 re-emerged in 2022. Beta was outcompeted by Delta in 2021, which was thereafter outcompeted by Omicron sub-lineages during the 4th and 5th waves in 2022. Several significant mutations identified in VOCs were also detected in low-frequency lineages, including S68F (E protein); I82T (M protein); P13L, R203K and G204R/K (N protein); R126S (ORF3a); P323L (RdRp); and N501Y, E484K, D614G, H655Y and N679K (S protein). Low-frequency variants, together with VOCs circulating, may lead to convergence and the emergence of future lineages that may increase transmissibility, infectivity and escape vaccine-induced or natural host immunity.

In 2022, Omicron BA.2 outcompeted BA.1 from week 3, continuing through the fourth wave ( Figure 1C Figure 1C). In addition, B.1.1 was observed again at week 5 in 2022 in a single case.

Phylogenetic Description of VOCs Overtime
Compared to global data, our sequences clustered with several VOCs, of which Beta, Delta and Omicron separated into the dominant waves of infection in South Africa as expected. All VOCs evolved from the clades 20C, 20A and 20B, respectively, after the first wave of infections ( Figure 2A). Among our study samples, Alpha and Kappa were present, with 20B and 20A being the closest ancestors, respectively. An increase in the number of mutations gained across the SARS-CoV-2 whole genome was observed as new variants and their sub-lineages emerged from 2020 to 2022 (Figure 2A,B). From April to October 2020, 20A, 20B and 20C SARS-CoV-2 clades were dominant and characterised by~8-30 mutations across their genomes. The number of mutations present in Beta increased from~20 to 37 from November 2020 to June 2021, respectively. Several Delta sub-lineages were observed from June to November 2021, with Delta 21A having the lowest number of mutations (21)(22)(23)(24)(25)(26)(27)(28)(29) Table S5). In January and February 2021, B.1 only had 9-15 mutations and evolved further from June (21 mutations) to August (23 mutations). From 2020 to 2021, during the first three waves of infection, B.1 retained five mutations, including P323L and G671I in the RdRp protein; and C136F, delY144, and D614G in the S protein (Table S5). The B.1 lineage was predominantly observed among adults 45-60 years of age diagnosed through community screening and test services (3/7; 57.1%) and out-patients (2/7; 28.6%) (Table S1). B.1.1 was observed consistently from May to September 2020, with mutations fluctuating daily, with at least 7 mutations present and a maximum of 20 mutations ( Figure 4B, Table S6). Six mutations carried over from 2020 to 2021 (during the 1st, 3rd and 4th waves), including R203K and G204R in the N protein; H208Y in the nsp2 protein; Y138H in the nsp8 protein; R126S in the ORF3a protein; P323L in the RdRp protein; and D614G in the S protein (Table S6). This lineage was predominant among younger adults aged 25-44 years from community screening (14/28; 50.05%) and out-patients in casualty (8/28; 28.6%) (Table S1). B.1 was only observed again in July 2020, with 20 mutations and decreased to 16 mutations by September 2020 ( Figure 4B, Table S5). In January and February 2021, B.1 only had 9-15 mutations and evolved further from June (21 mutations) to August (23 mutations). From 2020 to 2021, during the first three waves of infection, B.1 retained five mutations, including P323L and G671I in the RdRp protein; and C136F, delY144, and D614G in the S protein (Table S5). The B.1 lineage was predominantly observed among adults 45-60 years of age diagnosed through community screening and test services (3/7; 57.1%) and outpatients (2/7; 28.6%) (Table S1). B.1.1 was observed consistently from May to September 2020, with mutations fluctuating daily, with at least 7 mutations present and a maximum of 20 mutations ( Figure 4B, Table S6). Six mutations carried over from 2020 to 2021 (during the 1st, 3rd and 4th waves), including R203K and G204R in the N protein; H208Y in the nsp2 protein; Y138H in the nsp8 protein; R126S in the ORF3a protein; P323L in the RdRp protein; and D614G in the S protein (Table S6). This lineage was predominant among younger adults aged 25-44 years from community screening (14/28; 50.05%) and out-patients in casualty (8/28; 28.6%) (Table S1).   (Table S2). B.1.1.52 was present from April to June 2020, with a maximum of 19 mutations present in April 2020, which was stable until May 2020 (16-18 mutations) and declined by June 2020 (10-15 mutations) ( Figure 4B, Table S8). Nine mutations were identified in B.1.1.52 during the first wave. Mutations included H210R and S212R in the M protein, R203K and G204R in the N protein, Y138H in the nsp8 protein, T1639A in the PLpro protein, P323L in the RdRp protein, and D614G in the S protein. More than 50% of individuals were infected with B.1.1.52 (7/13; 53.8%) among community screening and out-patients from ARV clinics and casualty (Table S2). D614G was the only S protein mutation that was present with more than a single occurrence among the B.1.1.348 (87.5%; 7/8) and B.1.1.52 (92.9%; 13/14) lineages.

Discussion
The SARS-CoV-2 global pandemic is fueled by several variants of concern or interest that displayed more efficient transmission or immune evasion properties. As VOCs evolved from the Wuhan-Hu1, SARS-CoV-2 has undergone convergence resulting in novel lineages that could have evolved from previous lineages or VOCs [1]. Our study describes the diversity of SARS-CoV-2 lineages and VOCs from the first to fifth COVID-19 waves in SA and the significance of lineages observed at low frequencies.
SARS-CoV-2 infected individuals were predominantly observed among younger adults aged 25-60 years and females, with no noticeable difference in the prevalence from 2020, 2021 or 2022. Globally, similar observations were made, with higher rates of infection observed among younger adults >25 years of age, with the majority of resultant deaths observed among older adults >60 years [18][19][20][21][22]. The detection rate of SARS-CoV-2 almost doubled for females (60.1%) when compared to males (37.5%). Of note was that the majority of cases were from the City of Johannesburg Metro district in Gauteng, which is expected since it is the hub of major public hospitals in Johannesburg, clinics and a variety of SARS-CoV-2 community screening sites. Approximately 72% of samples sequenced belonged to individuals that sought community screen and test services, followed by outpatients, and therefore history and symptom profiles of these individuals were not recorded as was for in-patients. Of note was that community screening was 30.7% and 20.6% greater in 2022 compared to 2020 and 2021, respectively.
We identified 24 lineages, apart from the dominating VOCs in SA, from 2020 to 2022, which descended from the Wuhan-Hu1 strain, including B.  B1.1, B.1.1.52 and C.1 were detected at frequencies of 11%, 14% and 23%, respectively, before the emergence of Beta in 2020 [23,24], which is higher than observed for Gauteng in this study except for B.1.1. These three lineages (B.1, B.1.1 and C.1) were the only lowfrequency lineages that were not completely displaced and circulated during more than one wave. The B.1 lineage was predominant across Africa at the beginning of the pandemic [3]. B.1 gained up to 21 mutations across the genome over time from 2020 to 2021. B.1 was detected at a high prevalence of 46% in the United States of America (USA) and 11% in Turkey, whereas for Africa, it was 26.7% compared to 2% overall for SA and 3% in this study [25,26]. This lineage was responsible for the 2020 SARS-CoV-2 outbreak in Northern Italy [27]. Similar to our study, B.1 was detected in Spain through waves 2 to 4 [28]. B.1 was detected in less than 1% of SARS-CoV-2 genomes from Peru in 2021, whereas it was detected at a high frequency of 46.3% in Colombia [29,30].
The number of mutations in B.1.1 decreased from 20 to 8 by 2022. This lineage dominated in the UK at 25% and at 17% in the USA [25,26]. B.1.1 was also reported as one of the dominant (32.9%) SARS-CoV-2 lineages in Spain during the first wave in 2020, whereas B.1 was present at 2.7% during the same time [28]. In Peru, B.1.1 was detected at a frequency of 1%, while in Chile, it was reported as one of five lineages that dominated early in the pandemic [30,31]. B.1.1.348 was detected in Argentina during the first wave of SARS-CoV-2 infections at a low frequency of 3.7% and was present in Peru at 3% and at a higher frequency of 9.7% in Colombia [30,32,33]. Chile also reported B.1.1.348 as one of five dominant lineages early in the pandemic [31]. B.1.1.348 was also detected in Spain in a few cases [28].
Globally C.1, C.1.2 and C.2 had a cumulative prevalence of <0.5%, with SA accounting for 94.6% (437/462), 89.5% (307/343), and 45.8% (33/72) of the data, respectively [25,26]. C.1 had~16 mutations in 2020 which remained somewhat consistent with up to 19 mutations by 2021, which was similar to findings in previous studies from SA with the majority of sequences geographically representative of KwaZulu-Natal and the Free State Provinces [10,34]. However, the latter study detected B.1.1.54 (28.8%; 320/1 111) and B.1.1.56 (9.4%; 104/1 111) lineages (~14 mutations) from the parent lineage B.1.1 at high prevalence during the first wave across five SA provinces, but the majority of sequences were from KwaZulu-Natal [10]. Whereas the B.  1 and B.1.1, were the most prevalent in the Free State but were not detected in our study [34]. This may suggest that B.1.1.348, B.1.1.52 and C.2 were more specific to Gauteng, especially within the City of Johannesburg Metro. C.1.2 (20D clade) evolved from the C.1 lineage [11] and circulated at low frequencies in 2021 during the Delta wave. C.1.2 was a variant of interest for SA during 2021. It was only observed from June to December 2021 at a frequency of <10% during each month but never outcompeted other VOCs. C.1.2 was observed across all nine provinces of SA with <4% monthly prevalence from November 2021. While our study showed a slightly higher prevalence of 6.1% (53/870) and represented~32% of all the South African C.1.2 strains described [11], C.1.2 had 23 highly prevalent mutations that occurred concurrently (co-evolved), with the exception of L585F (17%) and D936H (5%) observed in a previous study [11]. Significant mutations in the S protein were detected in lower frequency lineages and also characteristic in VOCs that were dominant in SA. Mutations including D614G, D215G, N440K, S477N, T478K, E484K and H655Y are responsible for enhancing ACE2 binding affinity, transmissibility or immune escape [15,35]. The low-frequency lineages were primarily observed among young adults aged 25-44 years of age from community screening and out-patients.
Since a very limited number of in-patients were identified with the B and C lineages, this suggests that they were not associated with severe illness.
There were several important mutations worth noting that were detected during more than one SARS-CoV-2 wave among the low-frequency lineages in this study. The S68F substitution in the E protein can also assist with stabilising the protein structure [36,37]. A significant M protein mutation was the I82T, which is involved in increasing structural stability and was most prevalent in the B.1.575 lineages in the USA in 2020 [38,39]. The N protein mutations, including P13L, R203K and G204R/K, may assist with increasing the transmission of the SARS-CoV-2 but also be associated with reduced severity of disease and, therefore, lower mortality rates when compared to individuals with the wild-type N protein [40][41][42]. In the nsp1 protein, the E102K mutation encourages attachment to viral RNA, increasing replication and thereby increasing infectivity [43]. While del141-143 in nsp1, also reported in Omicron BA.4 lineages during the fifth wave in SA and described in Delta AY.63 in Norway, extends the duration of infection [5,44]. R126S in ORF3a compromises the viral structure by reducing its stability [45]. P323L was observed across all low-frequency lineages and was previously described as responsible for maintaining the RdRp protein structure and may enhance mutation rates [46,47]. Among the S protein mutation in the low-frequency lineages, the most significant were the following: N501Y which encourages affinity to human ACE-2 [48]; del69-70 and delY144 may influence the antigenic properties protein (NTD); furthermore, del69-70 escapes neutralising antibodies, increases viral replication and transmissibility [48][49][50][51]; E484K assists the virus by escaping host immunity and increased binding affinity to the hosts ACE-2 receptor [51]; D614G was identified in all VOCs and found to improve ACE-2 receptor binding thereby increasing viral transmissibility [52][53][54]; H655Y was also detected in Gamma and Omicron, and increases virulence by encouraging furin cleavage, evading immunity and increasing transmission; N679K increases glycosylation at the S1/S2 furin cleavage site which could prevent syncytia formation [40,55].
The circulation of SARS-CoV-2 VOC were similar to that reported previously in SA during the first five COVID-19 waves [56,57]. Wuhan-Hu1 ancestral strains driving the first wave (2020: epiweeks 14-42), Beta (2020: epiweeks 43 to 2021: epiweek 25), Delta (2021: epiweeks 20 to 45), Omicron BA.1/BA.2 (2021: epiweek 46 to 2022: epiweek 12) and Omicron BA.4/BA.5 (2022: epiweeks 13 to 42) driving the 2nd, 3rd, 4th and 5th waves, respectively. Several studies have previously reported on the circulation of VOCs in SA, which, except for Alpha, drove the major waves of infection from 2020 to 2022 [2,3,15,58,59]. We identified Alpha in 2021 between the Beta and Delta waves; however, it was never a dominant lineage and only circulated scantly across 9 weeks. Beta evolved from the B.1.1 lineage (20C clade), with up to 37 mutations gained across the genome by the end of the second wave. After its initial detection in SA, Beta spread across 20 countries in Africa [3]. This VOC had eight mutations in the S protein, with the most significant being K417N, E484K and N501Y within the RBD [2,25,60]. Beta was not completely displaced after the second wave since we detected it at low frequencies later in 2021 towards the tail-end of the Delta wave, as well as in several other provinces in SA and <0.1% globally [56,61].
We  [56]. Phylogenetic analysis showed that Delta descended from the B.1 lineage (20A clade) and gained up to 51 mutations across the genome by the end of the third wave, with 9-10 mutations in the S protein [60]. L452R and P681R were the common mutations identified in Delta sub-lineages, while T478K was the primary mutation in the S protein among the Delta lineages, which is responsible for interacting with the human ACE-2 receptor, increasing its infectivity [25,60,62]. From our study, Beta and Delta circulated for up to 30 consecutive weeks; however, the detection rate of Delta was much greater than Beta, suggesting Delta has higher rates of viral replication and transmission [62].
The main limitation identified in this study was the inconsistency of sequence data in 2020. The number of genomes sequenced in 2020 was limited since routine surveillance commenced late in 2020, and retrospective sequencing had to be performed. However, our dataset proves to be a good representation due to random selection, as the major VOC events observed in SA are reported accurately when compared to the national surveillance data. In addition, the emergence of novel lineages and VOC primer mismatches resulted in challenges in obtaining high-quality sequence data with adequate coverage across the genome. To overcome this, primer optimisation was required to prevent incorrect lineage/VOC assignment. Therefore, in this study, only genome sequences with 65-99% coverage across the genome and with >80% coverage for the S proteins were included.

Conclusions
Our study showed that with the emergence of novel lineages and VOCs, the number of mutations increased simultaneously, reducing protection against vaccine-induced or natural immune response and other environmental selective pressures. In addition, we identified that B.1, B.1.1 and C.1 lineages were able to retain mutations over time and still maintained their standing during the five COVID-19 waves in SA, although at much lower frequencies. These lineages presented with mutations that were carried over with the VOCs. It is possible that low-frequency lineages, together with VOCs circulating, can lead to possible convergence and recombination, which may result in the next novel lineage or variant that may further increase transmissibility, infectivity and escape vaccine-induced or natural host immunity. Therefore, it is important that low-frequency variants be studied in conjunction with VOCs across the globe to determine their impact on different populations.

Informed Consent Statement:
This study contributed to the national surveillance for SARS-CoV-2 in South Africa, for which formal patient consent was not required.