Genomic Diversity of SARS-CoV-2 in Algeria and North African Countries: What We Know So Far and What We Expect?

Here, we report a first comprehensive genomic analysis of SARS-CoV-2 variants circulating in North African countries, including Algeria, Egypt, Libya, Morocco, Sudan and Tunisia, with respect to genomic clades and mutational patterns. As of December 2021, a total of 1669 high-coverage whole-genome sequences submitted to EpiCoV GISAID database were analyzed to infer clades and mutation annotation compared with the wild-type variant Wuhan-Hu-1. Phylogenetic analysis of SARS-CoV-2 genomes revealed the existence of eleven GISAID clades with GR (variant of the spike protein S-D614G and nucleocapsid protein N-G204R), GH (variant of the ORF3a coding protein ORF3a-Q57H) and GK (variant S-T478K) being the most common with 25.9%, 19.9%, and 19.6%, respectively, followed by their parent clade G (variant S-D614G) (10.3%). Lower prevalence was noted for GRY (variant S-N501Y) (5.1%), S (variant ORF8-L84S) (3.1%) and GV (variant of the ORF3a coding protein NS3-G251V) (2.0%). Interestingly, 1.5% of total genomes were assigned as GRA (Omicron), the newly emerged clade. Across the North African countries, 108 SARS-CoV-2 lineages using the Pangolin assignment were identified, whereby most genomes fell within six major lineages and variants of concern (VOC) including B.1, the Delta variants (AY.X, B.1.617.2), C.36, B.1.1.7 and B.1.1. The effect of mutations in SAR-CoV-2 genomes highlighted similar profiles with D614G spike (S) and ORF1b-P314L variants as the most changes found in 95.3% and 87.9% of total sequences, respectively. In addition, mutations affecting other viral proteins appeared frequently including; N:RG203KR, N:G212V, NSP3:T428I, ORF3a:Q57H, S:N501Y, M:I82T and E:V5F. These findings highlight the importance of genomic surveillance for understanding the SARS-CoV-2 genetic diversity and its spread patterns, leading to a better guiding of public health intervention measures. The know-how analysis of the present work could be implemented worldwide in order to overcome this health crisis through harmonized approaches.


Introduction
Officially, in late December 2019, the World Health Organization (WHO) was notified by the Chinese Health Authorities of pneumonia cases of unknown etiology detected in Wuhan City, Hubei Province [1], which could mark the emergence of a novel and serious threat to public health. On 7 January 2020, researchers from the Shanghai Public Health Clinical left and School of Public Health reported the isolation of a new type of coronavirus (novel coronavirus, nCoV) [2] and a preliminary analysis of the Wuhan virus sequence (WH-Human_1.fasta.gz), [Genbank/NCBI release (MN908947.1)] suggesting a possible zoonotic origin [3].
Between 10 and 15 January 2020, findings of unexplained pneumonia in a Shenzhen family cluster confirmed the presence of the novel coronavirus, and suggested possible sustained human-to-human transmission [4], despite the fact that the extent of this mode of transmission is unclear. Since the first report, other territories, areas and countries the current study aims to provide information, for the first time, about the geographic distribution of SARS-CoV-2 genomic lineages and potential diversification pathways of the virus in Algeria and North African countries. The circulation of these variants has broad epidemiological implications for public health, including ongoing vaccination efforts.

Epidemiological Dynamics and Genomic Data Processing
The complete genome sequences of SARS-CoV-2 isolates from Algeria and North African countries, including Egypt, Libya, Morocco and Tunisia, were retrieved from the EpiCoV database of the GISAID initiative [16]. As of 20 December 2021, 2599 genomes were downloaded and only viruses affecting human hosts were selected, removing lowquality sequences (>5% NNNs) and using only full-length sequences (>29,000 nt). In total, 1669 complete, high coverage genome sequences from the dataset were selected to investigate the genetic characterization (Table 1). Table 1. COVID-19 cases distribution and total analyzed SARS-CoV-2 genomes from North African countries 1 . Daily updates on the number of confirmed new cases of COVID-19 in Algeria were analyzed up of February 2020 (for 20 months) from publicly released data provided by the Algerian Ministry of Health, Population and Hospital Reform (https://www.sante.gov.dz/) (accessed on 20 December 2021).

Sequence Alignment and Phylogenetic Analysis of Algerian Genomes
For the local Algerian virus comparison, thirty-six sequences were first aligned using a multiple sequence alignment algorithm (MAFFT v7. 471) [17]. The maximum likelihood tree was reconstructed with the IQ-TREE server using the general time-reversible (GTR) model with rate heterogeneity (GTR + G) and 1000 ultrafast bootstrap repetitions (http://www.iqtree.org accessed on 20 December 2021) [18]. The SARS-CoV-2 genomes were classified into lineages using the PANGOLIN web application (Phylogenetic Assignment of Named Global Outbreak LINeages) (https://pangolin.cog-uk.io accessed on 20 December 2021) [19]. The viral clades were assigned by the Nextclade tool (https://clades.nextstrain.org/ accessed on 20 December 2021) [20] and through the UShER web interface from the University of California, Santa Cruz (https://genome.ucsc.edu/ cgi-bin/hgPhyloPlace accessed on 20 December 2021). Viral clades were defined on the basis of available genomes sharing the same pattern of mutations. The Algerian population was comparatively evaluated against the Wuhan reference genome (NC_045512.2-Wuhan-Hu-1) obtained from NCBI GenBank. Quality checks of the sequences and evaluation of genetic distance were performed in MEGA software version 6 [21] and the final dataset was displayed using Interactive Tree of Life (iTOL) v.4 (https://itol.embl.de/ accessed on 20 December 2021) [22].

Mutation Signature and Clade Assignment Analysis
The Nextclade web tool (https://clades.nextstrain.org accessed on 20 December 2021) and the online COVID-19 genome annotator 'coronapp' [23] were used to perform mutation signature calling and SNP profile defining of total genome sequences from North African countries by checking amino acid substitutions, deletion or insertion mutations on specific regions, including; spike surface glycoprotein (S), polyprotein 1ab (nsp1-nsp16), structural proteins (S, E, M, and N) and other accessory proteins. In addition, genomic lineages and clades were inferred by GISAID and PANGOLIN databases according to the nomenclature system at the time of data collection.

Epidemiology of SARS-CoV-2 in Algeria and North Africa
By 20 December 2021, over 2,500,000 SARS-CoV-2 cumulative cases had been reported in North African countries ( Figure 1A) of which 214,592 cases were confirmed in Algeria with more than 6190 deaths attributed to the virus and a case fatality ratio (CFR) of 2.88%. In addition, Algeria's western neighbor, Morocco, registered the highest rate of positivity among the North African countries with 34.3% of total cases (952,814) followed by Tunisia, Libya, Egypt, and Sudan with 721,031; 381,749; 375,330 and 45,112 of positive cases, respectively (Table 1). The first confirmed positive case of SARS-CoV-2 infection in Algeria was reported on 25 February 2020, initially among international travelers until flights were stopped in March 2020. Immediately after the first case, the country experienced several waves of the pandemic. The second wave of viral introductions occurred between October and December 2020 and included migrants returning from Europe, followed by a third wave of rapid growth in Mid July and August 2021 in terms of the daily incidence of positivity and deaths ( Figure 1B).

Phylogenetic Analysis of SARS-CoV-2 Genomes in Algeria
Of the 85 available sequences, 36 genomes that met the quality criteria for analysis (>90% coverage) were used to construct a maximum-likelihood phylogenetic tree. As presented in Figure 2, the phylogenetic analysis is in support of the PANGOLIN lineages assignment of which the analyzed SARS-CoV-2 genomes belonged to six different B lineages  Algeria has little diversity in variant mapping, which is not surprising given limitations to whole genome sequencing. The Nextclade analysis revealed that seventeen of the 36 SARS-CoV-2 genomes belonged to the Delta clade (21J), with the rest being part of clades 20A, 20B and 20C. More recently, two genomes were submitted in December 2021 to the EpiCoV database and were grouped with 21K. Furthermore, GISAID analysis showed that the selected sequences belong to four high-level phylogenetic groups including G, GH, GR and GK with 16 genomes (44.4%) as part of the GK (Delta) clade and nine others (25.8%) of the GH (Beta) clade. The time course of the phylogenetic analysis and clade distribution showed that clades G (Variant S-D614G), GH (Variant ORF3a-Q57H) and GR (Variant N-G204R) were the most prevalent in the first and second waves of viral introductions. However, this was no longer the case in early May and mid-July 2021, in which clade GK took over. Expanded phylogenetic analysis was conducted to examine the genetic divergence of Algerian samples against global representative SARS-CoV-2 genomes present in the Nextstrain database. The mutation-resolved ML phylogenetic tree confirmed the PANGO and Nextclade lineages assigned, since 17 genomes grouped with the 21J representatives, six (16.6%) with the 20A clade, four (11.1%) belong to the 20B clade, eight (22.2%) with the 20C sequences and two (5.4%) with the 21K (Omicron) clade ( Figure 2).

Distribution of SARS-CoV-2 Lineages and Clades in North Africa
The variants from 1669 retrieved genomes were clustered in 108 SARS-CoV-2 lineages using the Pangolin web services tool, whereby most samples fell within six dominant lineages (Figures 3 and 4) 19.1%, 17.8%, 6.9%, and 5.9%, respectively. The Nextstrain clade assignment revealed that the analyzed genomes formed fifteen distinct clades with 20A, 20D and 21J (Delta) being the most common with 26.4%, 24.1% and 18.7% respectively, followed by 20I (Alpha) and 20B with 8.1%-6.7% each. Analysis of the distribution of SARS-CoV-2 clades in North African countries showed that clade GR was the most frequently identified with 25.9% among the total genomes, followed by GH and GK (19.9%-19.6%) and their parent clade G (10.3%). Other less common clades including S and L were also identified in 3.1% and 1.7% of the analyzed sequences, respectively. Furthermore, about 5% of the genomes were not clustered into any of the major clades. Interestingly, 1.5% of total genomes were assigned as GRA (Omicron), the newly emerged clade.

Egypt
A total of fifty-two lineages have been identified by the Pangolin phylogenetic classification, of which five were most prevalent including C.36 (30.

Sudan
Nine Pango lineages were identified from Sudan, of which the A.9 and B.1.351 present the most prevalent lineages with 25.8% each followed by B.1 with 12.9% of total analyzed genomes. In addition, four different clades were reported with GH (51.6%) and S (35.5%) as dominant compared to GR, the clade with 9.7%.

Phylogenetic Analysis of Omicron Genome Sequences from North Africa
A total of 25 genome sequences were obtained from GISAID, collected between October and December 2021. Spatiotemporal phylogenetic analyses were conducted using the complete genomes available at the time, with two genomes from Algeria, one genome from Egypt and 22 genomes from Morocco. As is shown Figure 5A, the global phylogeny of SARS-CoV-2 sequences (Delta and Omicron) from North Africa (as of 20 December 2021) showed that Omicron sequences (21K) could have been a progeny of the nextclade 20B. The global subtrees ( Figure 5B) showed evidence of different geographic origins of Omicron lineage. The majority of analyzed sequences were closely related with BA.1 sequences from England and the remained genomes were related to sequences from Scotland and United States suggesting multiple introductions of SARS-CoV-2 variants into North African countries. In addition, the phylogenetic analysis showed that the BA.1 genomes formed monophyletic clusters indicating local transmission of Omicron lineage in Morocco compared to the two Algerian sequences.

Genomic Variation and Mutation Signature
The retrieved SARS-CoV-2 genomes from North African countries were compared with the reference NC_045512.2-Wuhan-Hu-1 and, as expected, significant numbers of nonsynonymous and synonymous mutations were detected. The annotated mutations, event by event, are summarized in Figure 6. The analysis of 1669 SARS-CoV-2 genomes have highlighted a total of 42,685 mutation events compared to the reference (Supplementary data). A high prevalence of single-nucleotide polymorphisms (SNPs) was noted with 26,532 (62.17%) events over indels (insertion or deletions) with 2.31% and 0.014%, respectively. Furthermore, 11,222 events of silent SNPs falling in coding regions were detected, representing 26.30% of the total events. Overall, the C>T transition presents the most common events accounting for 42.63% with 18,198

Variant Analysis of Omicron SARS-CoV-2 Genomes
The analysis of the genetic polymorphism of Omicron SARS-CoV-2 genomes compared to the Wuhan-Hu-1 reference genome revealed variable mutations between viruses. A total of 1455 mutation events were noted with a high prevalence of single-nucleotide polymorphisms (SNPs) (914, 62.8%) and events followed by silent SNPs (235, 16.15%) over deletion with 5.85% of total events (Supplementary data). The frequent mutation events observed for Omicron genomes are summarized in Table 2.

Discussion
Despite substantial advances, the implementation of genomic surveillance remains a challenge for most African countries where access to whole genome sequencing is limited. Since the first description of the SARS-CoV-2 sequence in late 2019 [3], an exponentially increasing number of virus genomes have been reported across the globe with over ten million complete genomes deposited in the GISAID (https://www.gisaid.org accessed on 20 December 2021) and Genbank (https://www.ncbi.nlm.nih.gov/sars-cov-2/ accessed on 20 December 2021) databases (accessed on 20 December 2021). Nonetheless, SARS-CoV-2 genome sequence data from Africa constitute less than 0.7% with 70,421 sequences in genome repositories.
Due to the naturally expanding genetic diversity of SARS-CoV-2 viruses, extensive molecular surveillance and efforts to understand the patterns of the global spread of the pandemic have been introduced including the three main nomenclatures, PANGO lineages (PANGO, Phylogenetic Assignment of Named Outbreak LINeages) [19], Nextstrain clades [20] and GISAID classification. While PANGO lineages provide more detailed outbreak cluster information, the other two nomenclatures offer broad geographical and temporal clade trends.
This paper presents the first insight into a comprehensive analysis of genome sequences of SARS-CoV-2 circulating in Algeria and North African countries. The data of 1669 SARS-CoV-2 genomes submitted to the EpiCoV GISAID database as of 20 December 2021 were analyzed with respect to genomic clades and their geographic distribution. The results revealed the presence of different clades and variants as defined by GISAID, PINGOLIN and Nextclade tools that could be involved in the varied exacerbation of symptoms and disease severity in local residents.
As of 25 February 2020, a case of COVID-19 was reported in the first Member State of the AFRO Region leading the Algerian and neighboring health authorities to set up a response plan with rapid implementation to prevent and control SARS-CoV-2 spreading [24,25]. Despite the restrictions and the lockdown measures applied in most North African countries, the virus continued to spread from one region to another [26], and evolved with numerous genetic variants being associated with higher infectivity [27]. So far, the retrieved SARS-CoV-2 genomes were clustered into twelve major clades, as defined by the GISAID database, and at least 108 pingolineages, with six dominant variants including B.1, the Delta variants (AY.X, B.1.617.2), C.36, B.1.1.7 and B.1.1. Clades GR, GH and GK were the most frequently identified among the analyzed genomes, followed by G, GRY, GV and O clades, with lower prevalence confirming the heterogeneity of circulating strains. Interestingly, 1.5% of total genomes were assigned as GRA (Omicron), the newly emerged clade.
Early on in the first outbreak, the SARS-CoV-2 genomes were classified in two major lineages, named the European superclade A (also referred to L) and the East Asian superclade B (referred to S) [28,29], and later several sublineages in the GISAID nomenclatures have been introduced including V, G, GH, GR, GV and GRY clades based on marker mutation and phylogenetic analysis [30] (https://www.gisaid.org/ accessed on 20 December 2021).
Globally, the G clade and its derivatives GH, GR, and GV are the most common clades amongst the sequenced SARS-CoV-2 genomes [31]. Mercatelli and Giorgi [23] reported that GISAID clades G, GR and GV are prevalently present in Europe with relatively higher COVID-19 cases, deaths and CFRs, while the clades GH and GR have been mostly observed in the Americas, the top ranked continents with respect to CFR and local disease epidemiology parameters.
The dynamics of SARS-CoV-2 spreading in North Africa was not so different from that which was observed worldwide, with first and second waves dominated by viruses belonging to clades 20A and 20B, followed by a third wave linked to the circulation of variants characterized by an increase in the number of severe forms of COVID-19, leading to more deaths. Similarly, a study that investigated SARS-CoV-2 sequences collected in the Eastern Mediterranean Region found that more than 65.8% of the viruses belong to clades 20A, 20B, and 20C (GISAID clades GR, GH, G and GV) [32]. Similarly, genome sequencing of SAR-CoV2 isolated from Egyptian patients showed that most of the sequences can be assigned clades G/GR/GH/O (as per GISAID system) [33]. In addition, genomic surveillance applied to SARS-CoV-2 transmission in Morocco [34] between March and May 2020, revealed different aspects of the epidemic with the introduction of SARS-CoV-2 strains from different European countries where most genomes fell within Clades 20A, 20B with different mutation patterns giving rise to the diversity of SARS-CoV-2 lineages reported in this study.
New changes and variants of SARS-CoV-2 constantly emerge as long as ongoing transmission persists causing major epidemics in the United Kingdom (UK) [19], Brazil [35,36], and South Africa [37]. The Delta variant (PANGO lineage: B.1.617.2), first detected in India, has spread quickly across the world, and is designated a variant of concern by the World Health Organization likely due to higher transmissibility prior to wild type infection, estimated to be about 60% more transmissible than the Alpha variant [38].
Recently, the Omicron variant (B.1.1.529) has been primarily of concern after the Delta variant due to its large number of mutations (26 to 32) in the genome compared with other variants, especially in the spike protein, many of which are located within the receptor binding domain (RBD) [39], known or predicted to contribute to escape from neutralizing antibodies and existing countermeasures. Recently, Omicron (B.1.1.529) was predicted to be associated with a rapid increase of COVID-19 cases (https://www.who.int/news/ item/28-11-2021-update-on-omicron) (accessed on 30 December 2021). In a short period, the circulation of Omicron has been found in at least 65 countries and territories with thousands of confirmed cases (https://www.gisaid.org/hcov19-variants/ accessed on 20 December 2021).
The D614G spike mutation characterizes the G clade and its derivate has spread exponentially across the world and become rapidly the most prevalent lineage worldwide [40], occurring in over 92% of total analyzed genomes in this study. However, the A23403G mutation leading to the D614G spike (S) variant was found to be located in a heavily glycosylated residue in the viral spike, was implicated in increased infectiveness and allows fast spreading of the virus during the COVID-19 pandemic compared to the wild type variant Wuhan-Hu-1 [41].
It is worth mentioning that the Spike D614G mutation accompanies other frequent mutation sites in the ORF1ab (NSP3:C3037T, NSP3:T428I and NSP12:C14408T) region, the mutation at position 241 (C241T) targeting the 5 UTR, as well as the mutations at positions 203 and 212 in the Nucleocapsid protein (N:RG203KR, N:G212V), in the receptor binding domain (RBD) of Spike (S:N501Y), and in the ORF3a protein (ORF3a:Q57H). Generally, Spike D614G and ORF1b-P314L variants are consistently related and co-occur in all geographic locations with increasing frequency [42]. The spike glycoprotein region mediates the infection of target cells through binding to its cognate receptor angiotensin converting enzyme 2 (ACE2) and initiating viral-host fusion and replication [43]. This region is reported to be the most essential for viral attachment and entry into the host cells [44,45]. Therefore, ACE2 expression in different tissues and interactions with SARS-CoV-2 are critical for the infection's progression to severe coronavirus disease 2019 (COVID-19) [46]. The P314L mutation in NSP12 (RNA-dependent polymerase) may play a causal role in viral replication, therefore enhancing its transmission ability and infectivity [44]. Moreover, extragenic SNPs in 5 UTR:C241T may also affect the folding of the ssRNA and influence the replication rates of SARS-CoV-2 as it is found to occur most prominently [47]. Comparative genomic analysis of SARS-CoV-2 genomes revealed multiple crucial mutations to the Spike gene including K417N, K417T, E484K, N501Y, A570D, D614G, P681H, T716I, S982A and D1118H, which may aggravate the severity of SARS-CoV-2 more than the wild type variant, and potentially raise the concern of vaccine efficacy against novel strains [41,48].
The broad SARS-CoV-2 lineage diversity circulating in North African countries could intensify the impact of the pandemic in the region, affecting the efficacy of vaccines and displaying reduced antibodies neutralization, even reducing the reliability of diagnosis schemes including the current primary method of detecting SARS-CoV-2 (Reverse transcription-quantitative polymerase chain reaction (RT-qPCR)) [49,50]. However, and within a very short period of time, research applied to COVID-19 diagnosis has advanced with ever-increasing knowledge and inventions, in adapting available virus detection technologies and exploiting the power of interdisciplinary research to design novel diagnostic tools to improve detection efficiency [51,52]. Given the epidemiological behaviors, current evidence supports that VOCs, including Delta and the newly emerged Omicron variant, have rapidly escalated, becoming predominant in the globe and replacing previously circulating variants (https://nextstrain.org/ncov/gisaid/global) (accessed on 12 February 2022), adding up to a complex epidemiological scenario.
Compared with other variants and the early identified SARS-CoV-2 strains, the high frequency of mutations in the spike sequence of the Omicron variant raises concern about potential immune escape and its impact remains to be determined [53]. However, a complete experimental evaluation of Omicron might take weeks or even months. Largescale case-control studies are essential for investigating clinical severity and the current situation must lead national governments to place a higher priority on timely collection and analysis. In fact, COVID-19 severity varies enormously depending on the country, the prevalence of vaccination, the population's characteristics and medical management guidelines [54].

Conclusions
Despite the presence of some limitations in the study, such as the absence of clinical data on patients, as well as unbalanced sample sizes among the analyzed countries, the data provide valuable information about the SARS-CoV-2 clades circulating in North African countries and may help inform the dynamics of the disease for better control measures and appropriate public health action as the pandemic spreads in Africa. Analysis of SARS-CoV-2 sequences highlighted, for the first time, the changing pattern of circulating SARS-CoV-2 lineages in Algeria and North Africa between February 2020 and December 2021. Distinct lineages of SARS-CoV-2 contributing to three separate waves of infections reflective of the epidemiological pattern were identified, leading to the detection of previously major circulating variants of concern (VOC) in addition to the newly emerged Omicron variant.
As is known, the African region is characterized by the largest infectious disease burden and the weakest public health infrastructures, which can be explained by the fact that a large population is vulnerable due to conflict, poor socio-economic status, food insecurity and limited access to better health services. Furthermore, the prolonged humanitarian crises facilitate the spread of the actual disease within and between countries as well as causing extensive deterioration of health. Given the current epidemic and limited understanding of the epidemiology of this disease, the coronavirus poses a serious challenge for the continent and the emergence of a serious health threat highlights the need to support African countries with 'Weaker Health Systems'.