First Phylogenetic Analysis of Malian SARS-CoV-2 Sequences Provides Molecular Insights into the Genomic Diversity of the Sahel Region

We are currently facing a pandemic of COVID-19, caused by a spillover from an animal-originating coronavirus to humans occurring in the Wuhan region of China in December 2019. From China, the virus has spread to 188 countries and regions worldwide, reaching the Sahel region on 2 March 2020. Since whole genome sequencing (WGS) data is very crucial to understand the spreading dynamics of the ongoing pandemic, but only limited sequencing data is available from the Sahel region to date, we have focused our efforts on generating the first Malian sequencing data available. Screening 217 Malian patient samples for the presence of SARS-CoV-2 resulted in 38 positive isolates, from which 21 whole genome sequences were generated. Our analysis shows that both the early A (19B) and the later observed B (20A/C) clade are present in Mali, indicating multiple and independent introductions of SARS-CoV-2 to the Sahel region.

Introduction SARS-CoV-2 (previously called 2019-nCoV) is a novel member in the genus of Betacoronaviridae and is responsible for the current, rapidly escalating COVID-19 pandemic.
To this day, this virus, which has a lower pathogenicity than SARS-CoV but a higher human to human transmissibility [1], caused more than 16 million infections and 648,000 associated deaths worldwide ( [2] 27 July 2020) The outbreak of the virus began in China, in the city of Wuhan, in December 2019, followed by the shift of the epicenter to Europe in mid-March 2020 [3]. From Europe and Asia, the wave of infection moved to North and South America and as well to Africa. The first COVID-19 case in the Sahel region was reported from Senegal on the 2nd of March 2020 [4]. Within 3 weeks, the virus did spread to all Sahel countries, including Burkina Faso, Mauritania, Tchad, Niger, Sudan and Eritrea, reaching Mali on the 25th of March 2020 [5]; [6]. Since the Sahel region is currently facing several challenges, including a severe food and security crisis, it is difficult to judge the effects, the COVID-19 pandemic will have on the already overwhelmed healthcare systems [7]. Especially in conflict-affected areas where the population has only limited access to clean drinking water and handwashing facilities, hundreds of health centres are closed or are not operating due to the poor security situation. Furthermore, social distancing measures are difficult to implement in the dense African urban settings and the national governments struggle to divide the already insufficient budget between health, food and security emergencies [8]. Therefore it is not surprising that the medical and technical equipment as well as the laboratory infrastructure in those countries is sometimes outdated and not on a comparable level to most European countries or the USA. Hence, only limited genomic data of SARS-CoV-2 exists from the Sahel region, with no whole genome sequences being available from Mali to this date.
On 25.03.2020, the first two COVID-19 cases (a 49-year-old woman living in Bamako and a 62-year-old patient from Kayes, both returning from France) were confirmed in Mali [6]. To support the Malian public health system, SARS-CoV-2 diagnostic capabilities were established at the Centre d'Infectiologie Charles Mérieux du Mali (CICM-Mali) by training of scientific personnel and by providing testing reagents supported by the German Enhancement Initiative against Biological Threats in the G5 Sahel region. Based on the given training, the CICM-Mali was prepared as the second and central diagnostic center to support the diagnosis of SARS-CoV-2 for Bamako and its surrounding regions and started its activities on April 3rd.
Since the confirmation of the first two cases in March 2020, a steady increase in the number of total cases and a fast spreading of the SARS-CoV-2 virus within Mali was observed. As of . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.23.20165639 doi: medRxiv preprint 26 July 2020, 2510 cumulative COVID-19 cases and 123 (Case fatality ratio: 4,93%) related deaths have been reported in Mali. [11] Based on the current data available, 13 out of 100 000 Malians get infected resulting in a moderate risk for the Malian population to acquire COVID-19, especially considering the fact that mitigation strategies, like social distancing, cannot be applied to the same extent as in Europe [9]. In order to analyse origins of existing infections, distribution patterns and hence limit the spread of SARS-CoV-2, sequencing data of COVID-19 positive cases are needed.
In this study, we analyzed the first Malian genome sequences of SARS CoV-2 originating from patients from Bamako and its surrounding villages. Using comparative genomics we set the results in the context of available African genome sequences thereby providing data of underrepresented regions, like the Sahel, to the scientific community.

Patient sampling
Nasopharyngeal or oropharyngeal samples were collected from 217 suspected patients using

RNA extraction and SARS-CoV-2 detection
Viral RNA was extracted using the QIAamp Viral RNA Mini Kit (Qiagen, Hilden, Germany) and analyzed for the presence of SARS-CoV-2 at the Centre d'Infectiologie Charles Mérieux du Mali by reverse transcription quantitative PCR (RT-qPCR) according to the published protocol of [10] targeting the RdRp and E gene region. SuperScript III Platinum Polymerase qRT-PCR kit (ThermoFisher Scientific, Germering, Germany) was used for amplification. MS2 phages were added as internal control. In order to allow simultaneous detection of the E and MS2 gene, the E singleplex assay of Corman et al. was converted into a multiplex assay by addition of MS2 specific primers and probes (Cy5-labeled) [12]. Positive results obtained in the E gene assay were confirmed using the discriminatory RT-qPCR RdRp gene assay. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . 38 samples which had been tested positive for both, the E and the RdRp gene, were sent to the Bundeswehr Institute of Microbiology (IMB) for whole genome sequencing.

Library preparation and sequencing
All samples were processed according to the published nCoV-2019 ARTIC sequencing protocol [13,14]. To this end, cDNA Synthesis of extracted RNA was performed according to the manufacturer's instructions using the SuperScript IV First-Strand Synthesis System Finally, nanopore sequencing was performed using SQK-LSK109 chemistry on a 9.4.1 SpotON Flow Cell on the GridION system (Oxford Nanopore Technologies, Oxford, UK).
After passing sequence quality control, demultiplexing requiring barcodes at both ends of the reads was achieved using Porechop [15] and adapter trimming was achieved by the ARTIC pipeline [13] setting Wuhan-Hu-1 as reference strain for read mapping.

Phylogenetic Analysis
For the enrichment pipeline, Illumina reads were mapped to the reference strain Wuhan/Hu-1/2019 (GenBank accession MN908947) using bwa v0.7.17 [16]. The obtained bam files were . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.23.20165639 doi: medRxiv preprint added as a separate read-group to the bam files generated by the ARTIC pipeline using SAMtools [17]. Follow-up steps of the ARTIC pipeline were performed manually using the combined bam files as input. Additionally, variants were called for the Illumina mapping using BCFtools v1.9-168 of the SAMtools package [17] and merged with the variant VCF files from the ARTIC pipeline, if necessary. All variations were screened against problematic_sites_sarsCov2.vcf for hompolasic sites or sequencing issues which have the potential to adversely affect phylogenetic and evolutionary inference [18]. Final genomes were submitted to GISAID to make them publicly available.
SNP and phylogenetic analyses were performed using a local installation of the nextstrain.org pipeline [5]. For this, strains listed in GISAID and belonging to the African subset (N=1203 as of June 22nd, 2020) were included in the initial analysis and further filtered to minimize redundancy and selected by relevance to possible travel and/or trade routes (N=73).  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Testing of Malian patient samples
The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.23.20165639 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.23.20165639 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020.  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. SNPs are non-synonymous and are hence listed in Table 1  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09. 23.20165639 doi: medRxiv preprint Outbreak LINeages) algorithm [19], the GISAID clades using the actual letters of the marker mutations [20] and the nextstrain clades which are defined by year and a letter code [21]. As the Pangolin Lineage and the Nextstrain Clade resulted in a fairly similar classification for the Malian SARS-CoV-2 genomes (see figure 3), and it is up to now unsure which nomenclature will endure, we will discuss our results for both systems. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.23.20165639 doi: medRxiv preprint COVID-19 already in Mali prior his departure to Tunisia. Furthermore, nearest neighbour sequences to Malian genomes of this lineage are also locally widespread and originate from all over the world

Special features of Malian genomes
Besides the division of the Malian genomes into the A (19B) and B (20A/C) clusters, two additional observations are worth mentioning.
The first one concerns sample M002672, belonging to the A (19B) lineage. Here, a 9 bp deletion at the nucleotide position 685, resulting in a 3 amino-acid deletion of the protein ORF1a, was detected. This microdeletion was already described in SARS-CoV-2 genomes derived from countries of the northern hemisphere like Iceland, Sweden, England, Wales, Canada and USA [22], but has, to our knowledge, not been detected in African samples so far. Unfortunately, a phenotypic function of this microdeletion remains to this day elusive.
The second observation was made in samples M002673 and M002707 at position 14408 and 18973, respectively. In both samples quasispecies, indicated by the simultaneous occurrence of at least two isolates, were detected at the described position. Table 1 lists only the dominant/more frequently occuring mutation. As the quasispecies were observed independent of the sequencing technology (Nanopore and Illumina), artificially introduced mutations via e.g. PCR can be excluded. The C14408T mutation, which leads to the P314L conversion in ORF1b is a known mutation and next to D614D another hallmark for the B (20*) cluster. This mutation seems to provoke a positive effect for the virus and can hence be regarded as directed selection. In contrast, the G18973T mutation, leading to the amino acid exchange V1836F, was only detected once before in a German cluster (example strain: Germany/NRW-MPP-24/2020), but did obviously not manifest. Nevertheless, we had the rare opportunity to witness the evolution of the virus within a human sample.
Regarding the variant screening we found only one variant in two strains, M002593 and M002659, which may confound evolutionary interpretations: G11083T. Position 11083 is a major homoplastic site in Orf1ab [23] which is geographically ubiquitous, and appears in strain M002659 both in the Illumina and MinION reads. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.23.20165639 doi: medRxiv preprint

Discussion
Up to now, only limited sequencing data of Northern Africa and the Sahel region is available compared to other regions of the world like America, Asia or Europe. Viral genomes spreading on the African continent are mainly accessible from states like South Africa [24], the Democratic Republic of Congo (DRC) and Kenya [25]. For the Sahel region, genomic information of SARS-CoV-2 is only provided by Nigeria [26] and Senegal [5,27], while data from countries like Mauritania, Niger, Burkina Faso, Chad, Eritrea and Sudan is still missing.
With our full-length SARS-CoV-2 genomes originating from Mali, we want to contribute to the cohort of sequencing data from the Sahel region thereby enhancing the knowledge of the is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . Since the A lineage contains the initial emerging pandemic strain, as well as Asian genomes [19], we suggest a very early, but probably unrecognized onset of infections in Mali. Reasons that most likely encouraged the non-detection of the virus are on the one hand the fragile and overstretched health care system and on the other hand the transmission through patients with only mild symptoms being misdiagnosed or even not at all recognized. These listed reasons might also explain the relatively high mortality rate (5%) and the high estimated number of unreported cases in Mali.
In contrast, the B.1 lineage comprises sequences associated with the large Italian outbreak in February 2020 and represents to date (date 21.07.2020), together with its sublineage B.1.1, the worldwide prevalent virus clade [19]. The African continent, especially South Africa is dominated by lineage B (its origin still lying in Asia) and B.1 [24]. In Mali, only ⅓ of the patient samples were assigned to lineage B.1, thus giving rise to speculation that this viral lineage is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 25, 2020. . https://doi.org/10.1101/2020.09.23.20165639 doi: medRxiv preprint but also to reasonable national measures to restrict or terminate the SARS-CoV-2 spread within the country and hence worldwide.