High Rate of Mutational Events in SARS-CoV-2 Genomes across Brazilian Geographical Regions, February 2020 to June 2021

Brazil was considered one of the emerging epicenters of the coronavirus pandemic in 2021, experiencing over 3000 daily deaths caused by the virus at the peak of the second wave. In total, the country had more than 20.8 million confirmed cases of COVID-19, including over 582,764 fatalities. A set of emerging variants arose in the country, some of them posing new challenges for COVID-19 control. The goal of this study was to describe mutational events across samples from Brazilian SARS-CoV-2 sequences publicly obtainable on Global Initiative on Sharing Avian Influenza Data-EpiCoV (GISAID-EpiCoV) platform and to generate indexes of new mutations by each genome. A total of 16,953 SARS-CoV-2 genomes were obtained, which were not proportionally representative of the five Brazilian geographical regions. A comparative sequence analysis was conducted to identify common mutations located at 42 positions of the genome (38 were in coding regions, whereas two were in 5′ and two in 3′ UTR). Moreover, 11 were synonymous variants, 27 were missense variants, and more than 44.4% were located in the spike gene. Across the total of single nucleotide variations (SNVs) identified, 32 were found in genomes obtained from all five Brazilian regions. While a high genomic diversity has been reported in Europe given the large number of sequenced genomes, Africa has demonstrated high potential for new variants. In South America, Brazil, and Chile, rates have been similar to those found in South Africa and India, providing enough “space” for new mutations to arise. Genomic surveillance is the central key to identifying the emerging variants of SARS-CoV-2 in Brazil and has shown that the country is one of the “hotspots” in the generation of new variants.


Introduction
In December of 2019, at Wuhan, China, a novel betacoronavirus was first detected. Coronavirus Disease (COVID-19) [1] has developed into a global pandemic, causing waves of epidemics, infecting over 219 million people and 4.55 million deaths globally by 30 August 2021 [2]. The local profile outbreaks were shaped by measures of restrictions, including lockdown, commerce limitations, and travel control. The viral spreading has led scientists to investigate genomic epidemiology, which plays a central role in characterizing and understanding the emergence of viruses [3][4][5]. The SARS-CoV-2 single-stranded RNA is 29.9 kb in size and has positive coding orientation, encoding four major structural proteins on its 3 end: spike (S), envelope (E), membrane (M), and nucleocapsid (N). These proteins are essential to virus entry into cells and virus particle formation [6,7].
NGS-based SARS-CoV-2 genome characterizations have revolutionized the scale and depth of variant analysis worldwide. Never before has a viral genome been sequenced so globally in such a short time. Even so, numerous reports revealed potential adaptations of the nucleotide (nt), amino acid (aa), and structural heterogeneity of viral proteins, particularly in the S protein [8]. At the moment, the world's concern is focused on about four functionally well-defined variants-B.1.1.7 (Alpha), B.1.351 (Beta), P.1 (Gamma), and B.1.617.2 (Delta)-which are associated with viral fitness changes [9][10][11][12].
The mutation rate in SARS-CoV-2 is about 10 4 replacements of base pairs per year, and possible variations may appear in each replication cycle. In the context of investigating evolutionary events, it is possible to compare single-nucleotide polymorphisms (SNPs) in RNA sequences because mutations in coronaviruses occur from RdRp mistakes during viral genome replication [6,13,14].
The spike protein of SARS-CoV-2 contains an N-terminal S1 subunit and a C-terminal membrane proximal to the S2 subunit. The N-terminal domain (NTD), located in the portion S1A, recognizes carbohydrates, such as sialic acid, and it is responsible for the attachment of the virus to the host cell surface. The receptor-binding domain (RBD) in the S1B portion interacts with the human ACE-2 receptor. Between S1 and S2 there is a PRRA sequence motif that functions as a furin cleavage site. The transmembrane domain has a second cleavage site, which also participates in the viral entry into host cells [15]. The mutation D614G in S, for instance, is a frequently identified mutation that has been associated with increased virus transmissibility and infectivity, and it is possibly one of the origins of the widely prevalent B1.1 branch in many countries. Nonetheless, the mutation does not seem to alter the antigenicity of the S protein [16] and was not associated with any changes in disease severity [13].
In this study, we evaluated mutational events across samples from publicly available SARS-CoV-2 sequences available in GISAID since the beginning of the epidemic (from February 2020 to June 2021) in Brazil. Moreover, we propose a mathematical relation between new mutants versus sequenced genomes. This analysis is fundamental to understanding the changes in the viral genome leading to alterations in viral fitness and transmissibility across the population.

Data Retrieval
Whole genome sequences of SARS-CoV-2 obtained from COVID-19 cases in Brazil were downloaded from the Global Initiative on Sharing Avian Influenza Data-EpiCoV (GISAID-EpiCoV) platform (https://www.gisaid.org/, accessed on 30 June 2021) [17]. Only sequences submitted up to 30 June 2021 and complete genomes (above 29,000 bp) were included. The high coverage filter was also applied to ensure acceptable quality. According to GISAID, high coverage means that only entries with less than 1% of undefined bases (NNNs) and no insertions and deletions unless verified by the submitter are tolerated. Sequences with an unidentified division were also excluded from the final dataset. The sequences were downloaded in FASTA format. The annotated reference genome sequence of the SARS-CoV-2 isolate Wuhan-Hu-1 was retrieved from the NCBI database (Accession Number: NC_045512.2).

Data Processing
A total of 16,953 SARS-CoV-2 complete genome sequences obtained in Brazil were included in the study. The data were grouped according to Brazilian regions (Central-West = 939; Northeast = 1847; North = 921; South = 1913; Southeast = 11,333). Genomes were classified by region and were aligned against the SARS-CoV-2 reference genome using Minimap2 [18] aligner. The SAM files from the alignments were sorted, converted to BAM, and indexed using Samtools V1.9, (The Sanger Institute, Hinxton, UK) [19]. The BAM file was subjected to bcftools mpileup and bcftools call (part of the samtools framework) to call variants and generate genomic VCF files. The bcftools filter was then used to filter called variations and to derive the final VCF file. The Variant Effect Predictor (VEP) was used to assess the functional effects of detected variants on SARS-CoV-2 transcripts [20].

Dynamics of SARS-CoV Clades
Genomic surveillance of SARS-CoV-2 across Brazilian regions was performed using the Nextstrain platform (https://nextstrain.org/ncov, accessed on 30 June 2021), an open-source program that generates updated phylogeny with interactive visualization of publicly available SARS-CoV-2 genomes. The pipeline includes subsampling, alignment, maximum-likelihood phylodynamic analysis, temporal dating of ancestral nodes, discrete trait geographic reconstruction, and results visualization in Auspice [21]. Because of the large number of sequenced genomes, we conducted a subsampling of Brazilian genomes from GISAID by using Nextstrain's bioinformatics toolkit, which includes python3 scripts for preparing GISAID data for processing by Augur [22]. This was conducted by using the subsample-max-sequence option to randomly sample 300 strains from states that had a large number of sequenced genomes (through the use of the "query" option). The Brazilian virus lineages were identified using Pangolin v3.1.5 as implemented on 25 June 2021 [23]. Additionally, metadata of all SARS-CoV-2 genomes submitted to the GISAID database were accessed on 30 June 2021 by using the complete genomes and high-coverage filter, and the genomic clades were inferred according to its nomenclature system at the time of data collection.

Mathematical Model to Estimate the Rate of Genome Mutants and Global Diversity Rate
A total of 1,070,424 SARS-CoV-2 complete and high-coverage genome sequences were obtained and divided by continent: 11,574 genomes from South America; 14,986 from Oceania; 678,977 from Europe; 60,777 from Asia; 296,009 from Europe; and 8101 from Africa. The data were processed following the conditions below and shown in Supplementary Materials (Table S1 and Table S2): (I) Calculation of an estimate of how many genomes are hypothetically necessary to obtain a new mutant; (II) Comparison of continents by creating an estimate index for the growing rate of variation; and (III) Comparison of countries from the same regions to identify hotspots of the new variants.
For (I), to calculate an estimate of mutants per sequencing, we put at the center of the test the major sequencing region (Europe) comprising 1,000,285 genomes, identifying a maximum of 956 lineages in 49 countries. To obtain how many genomes are necessary to obtain a new mutant, we used a factor correction to equalize the values of lineages, presuming that all the regions are hypothetically sequenced at the same rate.
The factor correction is: ((ΣGE/ΣG) * (ΣLE/ΣL)), considering: Sum of total genomes used after filtering = ΣG Sum of total genomes from Europe = ΣGE Sum of total lineages = ΣL Sum of total lineages from Europe = ΣLE Additionally, we applied the same logic for lineage correction and estimated the number of genomes necessary using a correction rate to identify a new variant using as a determinant variable the maximum of lineages, genomes and country.
For (II) and (III), estimating an index growing variants (IGV) and identifying hotspots: Lineages per country = l Sum of total lineages from each region: L The index was estimated using the Shannon index variations ln(l/L) and sum products from both matrices (comparing each country, estimating the variation inside the region).
It is important to point out that different nomenclatures for SARS-CoV-2 have been proposed, including by Nextstrain, cov-lineages.org [23], and GISAID [17]. Because of that, we also looked at the assigned clades in the GISAID database. According to data from the GISAID database, within a year of emergence, SARS-CoV-2 had evolved into nine clades, including L (to which virus reference strains belong), S, V, G, GH, GR, GV, GRY, and O [24]. The subsampled distribution of GISAID clades across Brazilian regions is shown in Figure 6. Overall, the clade GR (n = 16,142; 95.2%) was the most prevalent among the SARS-CoV-2 genomes submitted from Brazilian regions, followed by GRY (n = 375; 2.2%) and G (n = 245; 1.4%). Less common clades including V, GV, S, and L were identified in 0.04%, 0.02%, 0.02%, and 0.02% of the submitted genomes, respectively.
In addition, analysis based on the chronological distribution of SARS-CoV-2 clades in Brazil showed that clade G was predominant at the beginning of the pandemic. However, this could partially be an effect of the small number of sequenced genomes by then. Since this initial stage, the clade GR increased rapidly and stabilized as 74.5% of all sequences in March 2020, 89.2% in April 2020, and 86.9% in May 2020 and increased further to become the most frequent clade, with more than 90% in June 2020 (Figure 7).

Numbers of Genomes to Identify a New Variant
Regarding the published data obtained from GISAID, one question remains unclear. Is it possible to estimate how many genomes are necessary to identify a new mutant? The amount of sequences processed by countries is proportional to investment and laboratory resources. On the other hand, it is possible to mathematically equalize the obtained numbers of data and create a numeric value to correct it. We downloaded 1,629,158 sequences and grouped them as follows: South America, North America, Europe, Africa, Asia, and Oceania (Table S1). Europe in total sequenced about 1,000,285 genomes of SARS-CoV-2, identifying 956 Pango lineages in 49 countries. The differences regarding genomic surveillance and viral spread are clearly inside Europe, where the United Kingdom sequenced 41.6% of all available genomes. Understanding the genome variability and evolution of genomes is fundamental to mounting an effective response to contain the pandemic; however, it requires governmental effort and scientific support. We already know that these difficulties are common in other countries, leading to a gap in being able to determine the hotspot locations by number of sequences. In Table 1 below, we calculated how many genomes are necessary to identify a new variant, considering the quantity of available genomes. The data (Table S1 and Table S2 show the estimated index and applied formulas) were equalized with a correction value estimated as 0.06485746617. The column G/L (genomes/lineages ratio) shows the data previously equalized. This analysis focused on the relevance of regions as hotspots for new variants, in cases of the number, G/L is smaller than the others, indicating a greater probability of finding a new mutant for each reported number. On the other side, the last column (Table 1) shows a diversity analysis between countries of each region, considering the maximum number of lineages of each country between geographical regions. This index, represented by alpha diversity (Table 1), shows the high diversity of lineages across Europe, numerically represented as 8.4996. In the range of each country, Europe varied from 0.0072 to the highest value 0.3676, represented by France. Second in diversity was Africa, reporting 6.4225, in which South Africa presented the highest rate (0.3381), followed by Asia at 5.9559. As for the Asiatic region, the model placed India (0.3677) as a hotspot for the generation of new variants, followed by Japan (0.3542). Across South America, the index was 3.4064, probably explained by the low availability of data and low proportion of cases in some countries. Analyzing each country in this region, we observed that according to genomes available Chile and Brazil represented the highest levels of variability (0.3679 and 0.3650, respectively). North America (1.6557) presented lower levels of diversity, and Canada was, surprisingly, a significant hotspot (0.3574). Oceania, as expected, represented a minimal level of estimated variation, showing an index of 0.9893.

Discussion
As SARS-CoV-2 continues to circulate in the human population after more than one year of pandemic, it is natural to observe genetic differences between SARS-CoV-2 strains sampled in various locations. This is the largest study focused on genome-wide mutational spectra covering nucleotides, amino acids, and deletion mutations in 16,953 complete and high coverage SARS-CoV-2 genomes from the five Brazilian geographical regions. Furthermore, this preliminary and crucial analysis of the Brazilian SARS-CoV-2 genomes shows an increase in the number of mutations.
Viral mutations are probabilistic events because of a viruses' random transmission between infected people. Viral load is variable and depends on such factors as the course of infection and host immunity. Some individuals are "super spreaders", which means that behavioral and environmental variables are relevant to infectivity, increasing successful transmission [25,26]. To date, we have around 20.8 million COVID-19 cases. Hence, here we are analyzing only~0.05% of reported cases, comprising a snapshot of SARS-CoV-2 mutational status in Brazil.
A number of previous studies have examined variants within SARS-CoV-2 isolates; for example, [27,28] reported clustered groups of sequences showing geographical similarities, suggesting clusters of similar transmission in both time and viral strains. Resende et al. [29] showed that B.1.1.28 (E484K) is present in several states from the South, Northeast, and North Brazilian regions and dates its origin to 27 August 2020 (14 July-18 September). These findings documented a classical SARS-CoV-2 reinfection case with the emerging Brazilian lineage B.1.1.28 (E484K). Additionally, the authors provide evidence of this emerging Brazilian clade's geographic dissemination outside the Rio de Janeiro state. Naveca et al. [30] reported a preliminary genomic analysis of SARS-CoV-2 B.1.1.28 lineage circulating in the Brazilian Amazon region and their evolutionary relationship with emerging and potential SARS-CoV-2 Brazilian variants harboring mutations in the RBD of spike protein.
Phylogenetic analysis of 69 B.1.1.28 sequences isolated in the Amazonas state revealed the existence of two major clades that have evolved locally from April to November 2020 without unusual mutations in the spike protein. In Africa, Motayo et al. [31] showed the high prevalence of the D614 spike mutation (82%) between sequences analyzed.
More than identifying the mutations, these analyses allow continuous research focused on mapping how amino-acids changes affect antibody binding. In a recent study, Nonaka et al. (2021) [32] report the first case of reinfection from genetically distinct SARS-CoV-2 lineage presenting the E484K spike mutation in Brazil, a variant associated with escape from neutralizing antibodies. The mutations on the RBD domain enhanced ACE2 binding, promoting viral infectivity and maybe disrupting neutralizing antibodies (NAb) binding to evade the host immune response. Antibodies targeting RBD have been used and developed as therapeutics and are known as the major contributors to NAb responses. To control viral infection, a robust humoral immune response is essential in populations. Scanning mutations is important to map the RBD changes, and it is used to predict escape mutations in antibody epitopes [33,34].
Nevertheless, SARS-CoV-2 genomes sequenced in Brazil until now were clustered in at least nine major clades, as defined by the GISAID database. Of them, Clade GR was the most frequently identified in Brazilian genomes, followed by GRY and G. Since Clade G (Lineage B.1), defined by the spike protein's D614G mutation, was identified, it rapidly predominated in many locales where it was found. Theoretical evidence suggests that mutations in the viral spike may be linked to altered potential for host cell membrane fusion, which should result in increased person-to-person transmission and pathogenicity [13,30,31]. A dub-cluster of clade G then started to split into GR, GH, and GV and more recently into GRY. Similar to what was observed recently, [24], within the analysis of the distribution of SARS-CoV-2 genomes across continents, showed that there was much expansion in the number of sequence genomes that were clustered into the GR and GRY clade compared with clade G, suggesting higher fitness for transmission by the newer clades compared with their ancestral one.
Here we evaluated the distribution of SARS-CoV-2 mutations across the five Brazilian geographical regions, showing different allelic frequencies with similar general distribution of all variants across different regions. We also showed the presence of 27 missense variants in the entire genome, the majority (44.4%) being present in the spike gene. Based on Pangolin software, we showed the presence of 61 SARS-CoV-2 lineages across Brazilian regions, with a high predominance of the Gamma variant. Based on Nextstrain clades, Brazilian genomes were classified into nine clades, with the majority belonging to clade 20B (n = 2724; 50.91%) and 20J (Gamma, V3) (2516; 47.21%). In GISAID clades, there are also nine clades, with a predominance of the GR clade (95.2%).
Finally, we estimated the number of genomes necessary to report a new variant and what is the ratio of this index by continent. The aim of diversity analysis was to show the hotspot regions in a printed scenario of pandemics. While we have a high genomic diversity in Europe given the large number of sequenced genomes, Africa is emerging as a hotspot for new variants. Asia is the third continent in terms of diversity. In South America, Brazil and Chile present mutation rates that are similar to South Africa and India. These numbers indicate that such regions are indeed hotspots for the emergence of new variants, especially when social restrictions are not applied strictly, leading to increased viral circulation. The genomic surveillance showed a potential tool for monitoring the circulation of SARS-CoV-2 and understanding the biological characteristics of the viral genome.