Genomic Epidemiology of SARS-CoV-2 in Tocantins State and the Diffusion of P.1.7 and AY.99.2 Lineages in Brazil

Tocantins is a state in the cross-section between the Central-West, North and Northeast regions of Brazilian territory; it is a gathering point for travelers and transportation from the whole country. In this study, 9493 genome sequences, including 241 local SARS-CoV-2 samples (collected from 21 December 2020, to 16 December 2021, and sequenced in the MinION platform) were analyzed with the following aims: (i) identify the relative prevalence of SARS-CoV-2 lineages in the state of Tocantins; (ii) analyze them phylogenetically against global SARS-CoV-2 sequences; and (iii) hypothesize the viral dispersal routes of the two most abundant lineages found in our study using phylogenetic and phylogeographic approaches. The performed analysis demonstrated that the majority of the strains sequenced during the period belong to the Gamma P.1.7 (32.4%) lineage, followed by Delta AY.99.2 (27.8%), with the first detection of VOC Omicron. As expected, there was mainly a dispersion of P.1.7 from the state of São Paulo to Tocantins, with evidence of secondary spreads from Tocantins to Goiás, Mato Grosso, Amapá, and Pará. Rio de Janeiro was found to be the source of AY.99.2 and from then, multiple cluster transmission was observed across Brazilian states, especially São Paulo, Paraiba, Federal District, and Tocantins. These data show the importance of trade routes as pathways for the transportation of the virus from Southeast to Northern Brazil.


Introduction
The ongoing COVID-19 pandemic is one of the greatest global threats in modern history. Since 2019, with the host spillover of the Severe Acute Respiratory Syndrome Coronavirus 2 Genomic surveillance provides important clues as to virus-host dynamics [12,13]. Long-read sequencing was first used in arboviral genomic studies [14,15] and then quickly adapted for SARS-CoV-2 given the public health emergency. Thus, next-generation sequencing (NGS) technologies represent a powerful tool for tracing the origin, spread, and transmission chains of outbreaks, as well as monitoring the evolution of etiological agents. The COVID-19 pandemic has triggered efforts for real-time surveillance strategies based on viral genome sequencing. In Brazil, the Corona-ômica-BR project (a network made up of 11 research institutions and dozens of researchers) is harnessing efforts and human resources to track the viral spread and support public health authorities. In this study, we report the complete genome sequences and phylogenetic analysis of 241 SARS-CoV-2 genomes detected in the Tocantins state, located in the North region of Brazil, from December 2020 to December 2021. Furthermore, the first description of SARS-CoV-2 molecular epidemiology in this Brazilian state is presented.

Sample Collection
Samples from patients with SARS-CoV-2-positive nasopharyngeal RT-qPCR were collected between 21 December 2020, and 16 December 2021, at the Central Public Health Laboratory of the State of Tocantins (Figure 1). Reverse-transcription-quantitative realtime polymerase chain reaction (RT-qPCR) for SARS-CoV-2 is the gold-standard method for laboratory diagnosis of COVID-19. The kit used for all samples was the Allplex 2019-nCoV assay, Seegene Inc, Korea (provided by the Brazilian Ministry of Health and used as a routine procedure at all public health laboratories in Brazil). The RT-qPCR reactions were conducted using 6.5 μL of RNA input, instead of the recommended 8 μL. Briefly, the Allplex kit employs target genes and labeled probes E (FAM), N (Quasar 670), RdRP (Cal Red 610), and an internal control (IC; HEX) in one multiplex RT-qPCR. All RT-qPCR reactions were performed using the QuantStudio 5 Real-Time PCR System (Applied Biosystems, Waltham, CA, USA). In total, 241 positive samples with quantification cycle (Cq)

Sample Collection
Samples from patients with SARS-CoV-2-positive nasopharyngeal RT-qPCR were collected between 21 December 2020, and 16 December 2021, at the Central Public Health Laboratory of the State of Tocantins ( Figure 1). Reverse-transcription-quantitative real-time polymerase chain reaction (RT-qPCR) for SARS-CoV-2 is the gold-standard method for laboratory diagnosis of COVID-19. The kit used for all samples was the Allplex 2019-nCoV assay, Seegene Inc., Seoul, Korea (provided by the Brazilian Ministry of Health and used as a routine procedure at all public health laboratories in Brazil). The RT-qPCR reactions were conducted using 6.5 µL of RNA input, instead of the recommended 8 µL. Briefly, the Allplex kit employs target genes and labeled probes E (FAM), N (Quasar 670), RdRP (Cal Red 610), and an internal control (IC; HEX) in one multiplex RT-qPCR. All RT-qPCR reactions were performed using the QuantStudio 5 Real-Time PCR System (Applied Biosystems, Waltham, CA, USA). In total, 241 positive samples with quantification cycle (Cq) below 25 for at least one primer were selected and submitted to the SARS-CoV-2 genome sequencing. The study was approved by the Ethics Committee (33202820.7.1001.5348).

RNA Extraction and Sequencing
The samples were processed at the Central Public Health Laboratory of the State of Tocantins, with an Extracta kit Viral RNA MVXA-P096 FAST (Loccus, Brazil) using an automated extractor (Extracta 96, Loccus, Brazil) following the manufacturer's guidelines. The cDNA synthesis was performed using Luna Script RT SuperMix (5×) (New England Biolabs, Ipswich, MA, USA). The synthesized cDNAs were used as templates for a multiplexed PCR reaction using the two non-overlapping primer pools provided by the ARTIC Network to generate~400 bp amplicons tiled across the genome (V3 nCov-2019 primers) (ARTIC primer set). Amplicons from both primer pools were combined and purified with a 1× volume of Ampure XP beads (Beckman Coulter, Brea, CA, USA). The MinION library preparation was performed using the Ligation Sequencing kit SQK-LSK-109 and Natives Barcoding kits EXP-NBD104 and EXP-NBD114 (Oxford Nanopore, Oxford, UK). The result- ing library was loaded on R9.4 Oxford MinION flowcells (FLO-MIN106) and sequenced using the MinION Mk1B device. ONT MinKNOW software was used to collect raw data. High-accuracy base-calling and quality control analyses were performed using Guppy (v6.0.1) and NanoPlot (v.1.33.0), respectively. Assembly of the high accuracy base called fastq files was performed using the nCoV-2019 novel coronavirus bioinformatics protocol (https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html, accessed on 12 January 2022) with Minimap2 [16] and Medaka (https://github.com/nanoporetech/medaka, accessed on 12 January 2022) for consensus sequence generation.

Phylogenetic Analysis
The 241 Tocantins sequences were combined with 9289 SARS-CoV-2 genome sequences to evaluate the phylogenetic relationships of the viruses circulating in a global context. Accordingly, all genome sequences and associated metadata deposited until 31 December 2021, were downloaded from GISAID (A total of 6,665,096). The first round of subsampling was performed by using Nextstrain's bioinformatics toolkit [23]. We selected 10,000 sequences collected from Brazil between 28 February 2020 to 31 December 2021, excluding data from Tocantins. Next, we selected 5000 sequences from South America collected between 28 February 2020 to 31 December 2021, excluding data we had already selected from Brazil. Finally, we selected 2000 sequences from Africa, Asia, China, Europe, North America, and Oceania, with the maximum date limited to 31 December 2021. For all the sequence selections, we limited the minimum-length parameter to 29,000 base pairs. These steps generated a random subsampling of 25,000 SARS-CoV-2 genomes. To reduce the dataset of genomes to allow feasible phylogenetic analysis, we then used the genome-sampler [24], which selects temporal, geographical, and diversity context samples from an available dataset containing focal sequences. Tocantins' sequences were defined as the focal sequence collection. These steps resulted in a dataset with 9075 sequences. We also downloaded from the GISAID database 211 sequences from Tocantins state and included three sequences from December 2019 collected in Wuhan, China (EPI_ISL_402125, EPI_ISL_406798, and EPI_ISL_434534). The resulting dataset has a total of 9530 sequences.
Multiple sequence alignment was performed using MAFFT with default settings and manually inspected using AliView v1.27 [25]. The maximum likelihood (ML) phylogenetic tree was built using IQ-TREE v2.1.2 [26]. The IQ-TREE2 analysis was performed under the generalized time-reversible (GTR) model of nucleotide substitution with empirical base frequencies (+F) plus FreeRate model (+R3), as selected by the ModelFinder software [27] and 1000 replicates of ultrafast bootstrapping (−B 1000) and SH-aLRT branch test (−alrt 1000). The ML tree was inspected in TempEst v1.5.3 [28] to investigate the temporal signal through a regression analysis of root-to-tip genetic distance against sampling dates. Thirty-seven sequences were strong outliers and were removed from the analysis, leaving 9493 sequences in the final dataset. By repeating the above steps, a strong temporal signal in the final dataset was confirmed (R2 of 0.83 and correlation coefficient of 0.91; Figure S1) and a dated phylogenetic tree was inferred using TreeTime [29]. Visualizations of the time-resolved phylogenetic tree were produced in R v4.1.2 using the ggtree package [30].
Additionally, we extracted sequences of two clades that grouped a large number of Tocantins sequences from the phylogenetic tree using the caper R package [31]. Multiple sequence alignment was performed using MAFFT with default settings and manually inspected using AliView. Spatial and temporal patterns of diffusion were estimated using a Bayesian Markov Chain Monte Carlo (MCMC) approach implemented in BEAST v1.10.4 [32] with BEAGLE v3.1.2 [33] to improve computational time. The HKY + G model [34] was used to model nucleotide evolution under a strict clock model with a Continuous Time Markov Chain (CTMC) [35] and a non-parametric Bayesian Skygrid coalescent model [36]. The best-fit nucleotide substitution models for the two data sets were identified according to the Bayesian information criterion (BIC) method in jMod-elTest v2.1.10 [37]. Location diffusion rates were estimated using the Bayesian stochastic search variable selection (BSSVS) model [38] with a discretization scheme defined as Brazilian states. Following the Nextstrain workflow [23], the initial evolutionary clock rate was set to 8 × 10 −4 substitutions per site per year. Two independent MCMC chains were performed with a length of 100 million generations sampling every 10,000 generations. The two independent runs were merged with Log Combiner v1.10.4 [32] and the convergence of the MCMC chain was assessed using Tracer v1.7.2 [39]. A maximum clade credibility tree (MCC) with annotated branches was then generated in TreeAnnotator v1.10.4, with 20% removed as burn-in [32]. The MCC phylogenetic trees were visualized using ggtree R package and the SpreaD3 v0.9.7.1 [40] was used to visualize the phylogeographic patterns embedded in MCC trees.

Phylogenetic Analysis
The time-resolved phylogenetic tree confirmed the PANGO lineages assigned, and the sequences from the Tocantins were allocated into branches close to Brazilian and South American sequences ( Figure 5). We also point out that during the initial period, the predominant lineages circulating in Brazil were B.1.1.28 and B.1.1.33, followed by the rise of P.2, a descendant of the B.1.1.28 lineage, which formed a clear, evolving differentiation. Furthermore, the first sequences obtained from the Tocantins were grouped in these clades as well, as expected. Subsequently, the majority of the analyzed samples were classified as belonging to the Brazilian P.1 lineage (Gamma), which has dominated the epidemiological scenario in Brazil since its emergence (November-December 2020). Additionally, P.1.7, which originated from the P.1 lineage (spike substitution P681H), was widely spread in Tocantins from March 2021 [8]. The first confirmed case of Delta lineage in Brazil occurred on 26 April 2021. In Tocantins state, the Delta VOC was first detected in samples collected on 29 June 2021, and became dominant in September 2021. We highlight here the high spread and circulation of the Delta variant, and, in particular, of the lineage AY.99.2 in Tocantins, as well as in all Brazilian regions ( Figure 5; see also Figure S3). The only BA.1 sequence identified in this study branched in a clade represented by 59 sequences, mostly represented by sequences from Australia (37.3% or 22 genomes), followed by sequences

Phylogenetic Analysis
The time-resolved phylogenetic tree confirmed the PANGO lineages assigned, and the sequences from the Tocantins were allocated into branches close to Brazilian and South American sequences ( Figure 5). We also point out that during the initial period, the predominant lineages circulating in Brazil were B.1.1.28 and B.1.1.33, followed by the rise of P.2, a descendant of the B.1.1.28 lineage, which formed a clear, evolving differentiation. Furthermore, the first sequences obtained from the Tocantins were grouped in these clades as well, as expected. Subsequently, the majority of the analyzed samples were classified as belonging to the Brazilian P.1 lineage (Gamma), which has dominated the epidemiological scenario in Brazil since its emergence (November-December 2020). Additionally, P.1.7, which originated from the P.1 lineage (spike substitution P681H), was widely spread in Tocantins from March 2021 [8]. The first confirmed case of Delta lineage in Brazil occurred on 26 April 2021. In Tocantins state, the Delta VOC was first detected in samples collected on 29 June 2021, and became dominant in September 2021. We highlight here the high spread and circulation of the Delta variant, and, in particular, of the lineage AY.99.2 in Tocantins, as well as in all Brazilian regions ( Figure 5; see also Figure S3). The only BA.1 sequence identified in this study branched in a clade represented by 59 sequences, mostly represented by sequences from Australia (37.3% or 22 genomes), followed by sequences from Brazil (13.6%, or eight genomes; six from São Paulo, one from Ceará, and one from the Tocantins) ( Figure 5). from Brazil (13.6%, or eight genomes; six from São Paulo, one from Ceará, and one from the Tocantins) ( Figure 5). We identified two subclades with high branch support, one for the P.1.7 lineage (SH-aLRT = 75.8) and one for the AY.99.2 lineage (SH-aLRT = 84.7), both grouping a large number of sequences from Tocantins, especially from our study. The P.1.7 clade was composed of 243 sequences, the majority of which were from Tocantins (49.4% or 120 genomes), followed by São Paulo (24.3% or 59 genomes) and Goiás (12.3% or 30 genomes). Furthermore, this clade was represented by 15 Brazilian states and one genome from Chile (EPI_ISL_2663617). The AY.99.2 clade was composed of 214 sequences. The majority of these were also from Tocantins (34.1% or 73 genomes). Moreover, this clade contained only Brazilian sequences, represented by 19 states and the Federal District. In order to determine the time of the most recent common ancestor (TMRCA) and the diffusion of these two subclades identified in our maximum-likelihood phylogenetic tree, we used coalescent and phylogeographic methods.  We identified two subclades with high branch support, one for the P.1.7 lineage (SH-aLRT = 75.8) and one for the AY.99.2 lineage (SH-aLRT = 84.7), both grouping a large number of sequences from Tocantins, especially from our study. The P.1.7 clade was composed of 243 sequences, the majority of which were from Tocantins (49.4% or 120 genomes), followed by São Paulo (24.3% or 59 genomes) and Goiás (12.3% or 30 genomes). Furthermore, this clade was represented by 15 Brazilian states and one genome from Chile (EPI_ISL_2663617). The AY.99.2 clade was composed of 214 sequences. The majority of these were also from Tocantins (34.1% or 73 genomes). Moreover, this clade contained only Brazilian sequences, represented by 19 states and the Federal District. In order to determine the time of the most recent common ancestor (TMRCA) and the diffusion of these two subclades identified in our maximum-likelihood phylogenetic tree, we used coalescent and phylogeographic methods.

Discussion
In this study, we report the first description of SARS-CoV-2 molecular epidemiology in Tocantins State, Brazil, from September 2020 to December 2021. We demonstrate that three main lineages dominated the SARS-CoV-2 scenario in Tocantins during this period of time: P.1 (33.4%), P.1.7 (34.9%), and AY.99.2 (17.9%). Therefore, our analysis showed that in September 2020, SARS-CoV-2 strains were mainly classified as B.1.1.28 and B.1.1.33, which is similar to observations in other Brazilian regions [2,41,42]. Nevertheless, these lineages were rapidly replaced by the variant of interest (VOI) Zeta (P.2) and, especially, by the Gamma VOC (P.1 and P.1.7), which rapidly spread and dominated the epidemiological scenarios across different regions of Brazil [8]. The P.1 lineage was first detected on 6 January 2021, by Japan's National Institute of Infectious Disease, in four travelers who arrived in Tokyo from Amazonas, Brazil, on 2 January 2021, at airport control [3,43]. The rapid increase in the number of hospitalizations caused by this SARS-CoV-2 variant was a significant source of economic and social disruption in Brazil and led to a significant amount of pressure on the health system. AY.99.2 was a prominent ramification of VOC Delta in all of South America before the emergence of VOC Omicron [44].
The The differences in viral kinetics found between the pre-Gamma, Gamma, and Delta variants in Tocantins suggest some incremental, but potentially adaptive, changes in viral dynamics. These changes are associated with the evolution of SARS-CoV-2 towards a more rapid viral spread. The emerging Brazilian Gamma variant, derived from the B.1.1.28 lineage, began spreading rapidly later in 2020 [45]. The Gamma variant spread from the north of Brazil, passing through Tocantins, towards the rest of the country. Moreover, the spread of Gamma was influenced by the demand for care for Amazon patients in other states that led to the transfer of patients to different hospitals throughout the country.   [46] reported that P.1+ lineages with convergent mutations evolved independently in other Brazilian states, and that the lineage designated as P.1.7 (P.1 + P681H) was the major lineage circulating outside Amazonas.
Our study indicates that the lineage AY.99.2 probably emerged in Rio de Janeiro state, and that multiple cluster transmission was observed across Brazilian states. We also point out a strong migration from Tocantins to Pará. All these observed migrations were probably influenced by the Belém-Brasília highway. In addition, the Tocantins sequences formed a separate clade, reinforcing the idea that local transmission occurred in the state. This may have been mainly influenced by the relaxation of non-pharmacological measures, such as social isolation and quarantines [46][47][48].
We recognize some limitations in the present study. First, the number of analyzed genomes corresponds to a small percentage of the number of cases. Second, the dispersal routes were influenced by multiple factors, including the geographic parameters and traffic metrics of people along the Belém-Brasília highway [49]. However, this may have been a determining factor in local transmission. Third, the dissemination of VOCs among Brazilian states is not a linear description [50,51]. The introduction of the Delta variant occurred in the first semester of 2021, during a period of restriction relaxation and incomplete vaccine schedules. Despite this, and despite the differences between states with large population concentrations (São Paulo, Rio de Janeiro), others, such as Tocantins, are influenced by the transport of agribusiness products [52,53]. In the case of the latter, the state of Tocantins stands out, as it is considered the last Brazilian agricultural frontier beyond the Amazon [11].
Our findings help to explain how the AY.99.2 lineage is transmitted so efficiently in Brazilian populations. Specifically, our results suggest that São Paulo, Federal District, Paraíba, and Tocantins states played an important role in this transmission in the Midwest, North, and Northeast regions of the country. This highlights that genomic surveillance, in addition to being useful in determining which viral variants are circulating in a given location, can also help elucidate viral trajectories. Our findings were also important to support actions to reinforce vaccination, epidemiological surveillance, and the planning of health units (intensive care units and inpatient beds), in addition to the structuring of the genomic surveillance department at the Central Public Health Laboratory of the State of Tocantins (LACEN-TO).
Despite the limitations of our study, we showed genomic surveillance to be a potential tool for monitoring the circulation of SARS-CoV-2 in Tocantins state. Although we used a relatively small number of SARS-CoV-2 sequences from Tocantins, we suggest that our analysis reflects the major variants circulating, as shown in other studies on other Brazilian states. However, levels of SARS-CoV-2 sequencing were significantly different between Brazil and other countries and across Brazilian regions and states [8]. For example, states from northern regions, such as Acre, Amapá, Roraima, and Tocantins, submitted 231, 517, 256, 452, and 890 sequences in GISAID until 31 December 2021, respectively. In comparison, states from regions in the southeast, such as São Paulo and Rio de Janeiro, submitted 45,434 and 11,081 sequences, respectively. This highlights the urgent need for implementing efficient strategies for genomic surveillance in Brazil across different regions, such as the action of the Corona-Ômica Network.Br-MCTI, allowing local public health agencies to know more about the pandemic situation in their states.
Finally, SARS-CoV-2 lineages presenting different kinetics are still emerging in the COVID-19 pandemic. We show the importance of the continuous monitoring of the epidemiology of COVID-19, as well as the specific studies of dynamics that are useful to defining public health measures. All viral surveillance information is deeply helpful in order to follow the evolution of the virus and its routes of dispersion.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/v14040659/s1, Figure S1: Root-to-tip regression of genetic distances and sampling dates for 9493 sequences in the final dataset. Correlation coefficient (r) and r squared are depicted above the graph. Table S1: Epidemiological characteristics of the 241 sequenced samples from Tocantins state. Figure S2: Spatiotemporal distribution of the 241 sequenced genomes from Tocantins state. The figure was generated using R v4.1.2 with the following packages: ggplot2, geojsonio, sf, broom and dplyr, Figure S3