Molecular Tracing of SARS-CoV-2 in Italy in the First Three Months of the Epidemic

The aim of this study is the characterization and genomic tracing by phylogenetic analyses of 59 new SARS-CoV-2 Italian isolates obtained from patients attending clinical centres in North and Central Italy until the end of April 2020. All but one of the newly-characterized genomes belonged to the lineage B.1, the most frequently identified in European countries, including Italy. Only a single sequence was found to belong to lineage B. A mean of 6 nucleotide substitutions per viral genome was observed, without significant differences between synonymous and non-synonymous mutations, indicating genetic drift as a major source for virus evolution. tMRCA estimation confirmed the probable origin of the epidemic between the end of January and the beginning of February with a rapid increase in the number of infections between the end of February and mid-March. Since early February, an effective reproduction number (Re) greater than 1 was estimated, which then increased reaching the peak of 2.3 in early March, confirming the circulation of the virus before the first COVID-19 cases were documented. Continuous use of state-of-the-art methods for molecular surveillance is warranted to trace virus circulation and evolution and inform effective prevention and containment of future SARS-CoV-2 outbreaks.


Introduction
Italy is one of the countries most-and earliest-affected in Europe by the COVID-19 pandemic (https: //gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6). The first autochthonous cases of Coronavirus 2019 Disease (COVID- 19) were observed starting from 21 February 2020 in Codogno (Lodi province), determining on 22 February 2020 the establishment of a "red zone" to contain the epidemic, encompassing 11 municipalities. Thereafter, in a short time, it became evident that the epidemic had already involved a large part of Lombardy region and then spread to neighboring regions and, substantially less, to the rest of the country. On 9 March lockdown was declared for the entire country. The rapidly increasing number of patients who required hospitalization in the intensive care unit suggested that the virus may have circulated for a long period and caused thousands of contagions before the epidemic became manifest [1].
SARS-CoV-2 was first detected in Italy in a couple of Chinese tourists coming from Wuhan on 31 January [2]. Subsequent evaluations have not shown a relationship between the sequence of these strains and those implicated in the epidemic in Lombardy [3].
On the contrary, the Codogno strains resulted strictly related with a strain of SARS-CoV-2 coming from Shanghai which caused a small outbreak in Munich around 20 January [1] and was probably spread later to other European countries and beyond the Atlantic [4]. These sequences are part of a clade initially defined as a European clade, the old Nexstrain A2a subclade, which is currently the most widespread outside China and probably responsible for most of the world pandemic [5].
In the face of more than 240,000 notified cases in Italy, the entire genomes available in public databases are still scarce (77 at the time of this study). The availability of large numbers of sequences collected over time is necessary for molecular surveillance of the epidemic and for evaluation and planning of effective control strategies. To perform this study, a network of Italian Clinical centres and Laboratories across Italy generated additional 59 full-length SARS-CoV-2 sequences from COVID-19 patients ranging from the end of February to the end of April. This contribution helps to trace the temporal origin, the rate of viral evolution and the population dynamics of SARS-CoV-2 in Italy by phylogeny.

Patients and Methods
A total of 59 SARS-CoV-2 whole genomes were newly-characterized from an equal number of patients affected by COVID-19, attending different clinical centres in Northern and Central Italy, from the beginning of the epidemic (22 February 2020) until 27 April 2020 (Table S1).
All of the data used in this study were previously anonymized as required by the Italian Data Protection Code (Legislative Decree 196/2003) and the general authorizations issued by the Data Protection Authority. Ethics Committee approval was deemed unnecessary because, under Italian law, all sensitive data were deleted and we collected only age, gender and sampling date (Art. 6 and Art. 9 of Legislative Decree 211/2003).
Eighteen sequences were obtained after isolating the virus in Vero E6 cells while the remaining 41 were obtained directly from biological samples such as nasopharyngeal swabs or broncho-alveolar lavages (39 and 2, respectively).
Full genome sequences were obtained with different protocols by amplifying 26 fragments as previously described (n = 42) [1] or using random hexamer primers (n = 8) or Ion AmpliSeq SARS-CoV-2 Research Panel (Thermo Fisher Scientific, Waltham, Massachusetts, USA) (n = 9). The PCR products were used to prepare a library for Illumina deep sequencing using a Nextera XT DNA Sample Preparation and Index kit (Illumina, San Diego, California, USA) in accordance with the manufacturer's manual, and sequencing was carried out on a Illumina MiSeq platform for 50 samples, while the remaining nine were sequenced on Ion GeneStudio™ S5 System instrument following the Ion AmpliSeq™ RNA libraries protocol (Thermo Fisher Scientific, Waltham, Massachusetts, USA). The results were mapped and aligned to the reference genome obtained from GISAID (https: //www.gisaid.org/, accession ID: EPI_ISL_412973) using Geneious software, v. 9.1.5 (Biomatters, Auckland, New Zealand) (http://www.geneious.com) [6] or Torrent Suite v. 5.10.1 (Euformatics Oy, Espoo, Finland) or BWA-mem and rescued using Samtools alignment/Map (Hinxton, UK) (v 1.9).

Sequence Data Sets
The newly-characterized 59 genomes plus three previously characterized isolates by us (EPI_ISL_417445-417447) [1] were aligned with a total of 77 Italian sequences available in public databases (GISAID, https://www.gisaid.org/) on 13 May 2020 and 452 genomes sampled in different European and Asian countries (513 and 16, respectively) representing all the different viral clades described in the Nextstrain platform (https://nextstrain.org/). The final data set thus included 588 sequences. Due to the large amount of available sequences, we focused the analysis on European strains by randomly selecting sequences from each country and by excluding identical strains or strains with more than 5% of gaps. We sampled the data in order to have no temporal gaps, by grouping the sequences by country/week/clade and randomly selecting the sequences in each group. We choose 15 sequences for clade A2 and 5 sequences for other clades (therefore including A and B Pangolin lineages and different sublineages) for each European country. For countries with less than the required sequence number we kept all the sequences. The sampling dates of the entire dataset ranged from 30 December 2019 to 27 April 2020. Table S2 shows the accession IDs, sampling dates and locations of the sequences included in the dataset.
A subset of sequences assigned to the old Nextstrain A2 clade (classified as B lineage for Pangolin), was generated for dating the epidemic, including all the Italian sequences, one German (EPI_ISL_406862) and three Chinese isolates from Shanghai, ancestral to the A2 clade (EPI_ISL_416327, EPI_ISL_416334 and EPI_ISL_416386). Coalescent and birth-death phylodynamic analyses were performed on the 136 Italian A2 sequences only.

Genetic Distance, Recombination, and Selection Pressure Analyses
The MEGA X program was used to evaluate the genetic distance between and within Italian sequences on the full length genome, with variance estimation performed using 1000 bootstrap replicates [8].
The RDP5 software was used to investigate the presence of potential recombination [9].
All of the genes were tested for selection pressure using Datamonkey (https://www.datamonkey. org/). Amino acid changes were evaluated using EPI_ISL_402123 as reference strain.

Phylogenetic and Phylodynamic Analyses
The simplest evolutionary model best fitting the sequence data was selected using the JmodelTest v.2.1.7 software [10], and proved to be the Hasegawa-Kishino-Yano model with a proportion of invariant sites (HKY+I).
The phylogenetic analysis for clade assignment was performed by RaxML [11] on the entire dataset of 588 genomes. During the period in which we were carrying out the study, the SARS-CoV-2 clade nomenclature system changed. In particular, Rambaut et al. proposed a dynamic nomenclature based on phylogenetic lineages, called Pangolin (Phylogenetic Assignment of Named Global Outbreak LINeages) [12]. For this reason, we used the old Nextstrain and the new Pangolin (freely available at https: //pangolin.cog-uk.io/) systems for strain classification. The new Nextstrain classification was performed by using the available script (https://github.com/nextstrain/ncov/blob/master/docs/running.md).
The virus' phylogeny, evolutionary rates, times of the most recent common ancestor (tMRCA) and demographic growth were co-estimated in a Bayesian framework using a Markov Chain Monte Carlo (MCMC) method implemented in v.1.10.4 and v.2.62 of the BEAST package [13,14].
A root-to-tip regression analysis was made using TempEst in order to investigate the temporal signal of the dataset [15].
The MCMC analysis was run until convergence with sampling every 10,000 generations. Convergence was assessed by estimating the effective sampling size (ESS) after 10% burn-in using Tracer v.1.7 software (http://tree.bio.ed.ac.uk/software/tracer/), and accepting ESS values of 200 or more. The uncertainty of the estimates was indicated by 95% highest marginal likelihoods estimated [17] by path sampling/stepping stone methods [16].
The final trees were summarized by selecting the tree with the maximum product of posterior probabilities (pp) (maximum clade credibility or MCC) after a 10% burn-in using Tree Annotator v.1.10.4 (included in the BEAST package), and were visualized using FigTree v.1.4.2 (http://tree.bio.ed. ac.uk/software/figtree/). Posterior probabilities >0.7 were considered significant.

Birth-Death Skyline Estimates of the Effective Reproductive Number (R e )
The birth-death skyline model implemented in Beast 2.62 was used to infer changes in the effective reproductive number (R e ), and other epidemiological parameters such as the death/recovery rate (δ), the transmission rate (λ), the origin of the epidemic, and the sampling proportion (ρ) [18]. Given that the samples were collected during a short period of time, a "birth-death contemporary" model was used.
The analyses were based on the previously selected HKY substitution model and the evolutionary rate was set to the value of 0.8 × 10 −3 subs/site/year, which corresponds to the mean substitution rate estimated using a relaxed clock under the exponential coalescent model as transformed into units per year.
For the birth-death skyline analysis, from one to two R e intervals and a log-normal prior with a mean (M) of 0.0 and a variance (S) of 1.0 were chosen, which allows the R e values to change between <1 (0.193) to >5. A normal prior with M = 48.7 and S = 15 (corresponding to a 95% interval from 24.0 to 73.4) was used for the rate of becoming uninfectious. These values are expressed as units per year and reflect the inverse of the time of infectiousness (5.3-19 days, mean 7.5) according to the serial interval estimated by Li et al. [19]. Sampling probability (ρ) was estimated assuming a prior Beta (alpha = 1.0 and beta = 999), corresponding to a minority of the sampled cases (between 10 −5 to 10 −3 ). The origin of the epidemic was estimated using a normal prior with M = 0.1 and S = 0.05 in units per year.
The MCMC analyses were run for 100 million generations and sampled every 10,000 steps. Convergence was assessed on the basis of ESS values (ESS > 200). Uncertainty in the estimates was indicated by 95% highest posterior density (95%HPD) intervals.
The mean growth rate was calculated on the basis of the birth and recovery rates (r = λ − δ), and the doubling time was estimated by the equation: doubling time = ln(2)/r [20].

Phylogenetic Analysis of the Whole Dataset
No recombination events were observed in the entire dataset according to analyses with RDP5 software.
Phylogenetic analysis by maximum likelihood showed that the Italian sequences were included in a single SARS-CoV-2 clade (the old Nextstrain A2 clade, corresponding to new Nextstain clades 20A and 20B) with the exception of three sequences: Two from Chinese patients visiting Italy at the end of January 2020 after being infected in Wuhan and one characterized by us from an Italian subject, living in Padua, sampled in March 2020, not reporting any recent trip outside Italy or contacts with subjects affected by COVID-19 (pp = 0.99) (Figure 1, clade 19A).
Viruses 2020, 12, x FOR PEER REVIEW No recombination events were observed in the entire dataset according to analyses with RDP5 software.
Phylogenetic analysis by maximum likelihood showed that the Italian sequences were included in a single SARS-CoV-2 clade (the old Nextstrain A2 clade, corresponding to new Nextstain clades 20A and 20B) with the exception of three sequences: Two from Chinese patients visiting Italy at the end of January 2020 after being infected in Wuhan and one characterized by us from an Italian subject, living in Padua, sampled in March 2020, not reporting any recent trip outside Italy or contacts with subjects affected by COVID-19 (pp = 0.99) (Figure 1, clade 19A). Recently, new nomenclature systems have been proposed for the SARS-CoV-2 clades. The new lineage assignment of 62 Italian isolates is reported on Table 1 with the correspondence to other naming systems (old and new Nextstrain). All of our isolates belonged to the lineage B.1, only one isolate was classified as lineage B. Recently, new nomenclature systems have been proposed for the SARS-CoV-2 clades. The new lineage assignment of 62 Italian isolates is reported on Table 1 with the correspondence to other naming systems (old and new Nextstrain). All of our isolates belonged to the lineage B.1, only one isolate was classified as lineage B.

Genetic Distances Analysis
The overall mean p-distance between all the Italian isolates was 2.3 (SE:0.3) s/10,000 nts, corresponding to a mean of 6.4 (SE: 0.8) substitutions per genome. The non-synonymous distance (dN) was 2.0 (SE: 0.4) non-syn s/10,000 non-syn nts while the overall synonymous mean distance (dS) was equal to 2.4 (SE: 0.05) syn s/10000 syn nts (dN/dS = 0.83). A higher heterogeneity was observed through months as, stratifying the genetic distances on the basis of the sampling time, we observed a higher heterogeneity among the strains isolated in February (n = 19) compared to those collected in March (n = 96) or April (n = 21) ( Table 2).

Differences in Amino Acids
Considering only the non-synonymous mutations and comparing the Italian genomes with the common ancestor (China, EPI_ISL_402123), there were 159 amino acid substitutions affecting different viral genes, (112 in ORF 1a/1b, 19 in S, 12 in ORF 3a, 4 in M, 3 in ORF7a, 6 in N, and one each in Orf7b, 8 and 10) of which only 15 (9.4%) were observed in 2 or more isolates, as summarized in Table 3. No amino acid changes were observed in the E gene. The previously described substitution D614G in the Spike protein was present in all the isolates belonging to the lineage B.1 and in the strain from Padua belonging to lineage B. Considering the Italian isolates, only 1 site resulted under significant selecting pressure by three different methods (MEME, FEL, FUBAR): Site 1,046 in the Spike protein that was present in three isolates from Padua. This G1046V mutation is located in the S2 subunit, between heptad repeat 1 and 2 domains. Mutations R203K-G204R in N gene were always simultaneously detected. It appears that these mutations discontinue a serine-arginine (S-R) dipeptide by introducing a lysine in-between them, having impacts on structure and function in the mutated N protein.
Fifty-two sequences in our dataset carried these mutations, particularly 11 of the 59 whole genome newly-characterized; six of these were from Tuscany, four from Milan and one from Padua.

Time Reconstruction of the SARS-CoV-2 Italian Lineage B.1 Phylogeny
Root-to-tip regression analysis of the temporal signal from the Italian B.1 subset revealed a weak association between genetic distances and sampling days (a correlation coefficient of 0.31 and a coefficient of determination (R 2 ) of 9.9 × 10 −2 ).
Comparison by BF test of the marginal likelihoods obtained by path sampling (PS) and stepping stone sampling (SS) of the strict vs. relaxed molecular clock (uncorrelated log-normal) showed that the second performed better than the former (strict vs. relaxed molecular clock BF(PS) = −71.9 and BF(SS) = −71.4 for relaxed clock). Comparison of the different demographic models showed that the BSP and the exponential growth models best fitted the data (BSP vs. constant population size BF(PS) = 27.9 and BF(SS) = 30.2 for BSP; constant population size vs. exponential growth BF(PS) = 7.3 and BF(SS) = 8.6) ( Table S3).
The mean tMRCA of the tree root ( Figure 2) was estimated at 107 days before present (BP) (95%HPD: 91.2-113.1), corresponding to 11 January 2020 (from 5 January to 27 January). The tMRCA of the subclade including all the Italian sequences was estimated to be 92.4 (95%HPD: 76.6-95) days BP, corresponding to 25 January (between 23 January and 10 February). Calendar dates of the tree root and the Italian clade were showed in red. The light blue box highlighted the three Padua isolates carrying G1046V mutation in S protein.
The Bayesian tree of the Italian sequences showed 15 small significant subclades including two to ten isolates (Figure 2).

Phylodynamic Analysis of the Italian Dataset
The Bayesian skyline plot of the Italian isolates showed an increase in the number of infections in the period between late February and mid-March 2020, with a rapid exponential growth between 4 and 16 March when it reached a plateau continuing until the last sampling time (Figure 3).  The Bayesian birth-death skyline plot of the Re estimates with 95%HPD with a single R group (corresponding to R0) estimated a mean value of 2.25 (1.5-3.1). Figure 4 (panels a and b) shows the changes of Re since the origin of the epidemic and suggests that Re was higher than 1 since the early days (mean initial Re = 1.4, 95%HPD: 0.08-2.9). The curve started to grow in early February and peaked to a mean value of 2.3 (95%HPD: 1.5-3.5) in the first half of March, and has since remained at this value. The curve obtained with three Re groups showed a slight decrease at mid-March ( Figure  4, panel b).
The origin of the epidemic was estimated at a mean 80.3 days BP (credibility interval: 60-109), corresponding to 7 February (between 9 January and 27 February). The recovery rate was estimated about 7.26 days (CI 4.7-16.0 days), and the transmission rate (λ) increased from 71.7 to 115.96 in units per year (corresponding to a growth rate of 0.06 and 0.18 year −1 ). On the basis of these data, the doubling time decreased from 5.1 days to 3.1 days in the period between early February and mid-March. The Bayesian birth-death skyline plot of the R e estimates with 95%HPD with a single R group (corresponding to R 0 ) estimated a mean value of 2.25 (1.5-3.1). Figure 4 (panels a and b) shows the changes of R e since the origin of the epidemic and suggests that R e was higher than 1 since the early days (mean initial R e = 1.4, 95%HPD: 0.08-2.9). The curve started to grow in early February and peaked to a mean value of 2.3 (95%HPD: 1.5-3.5) in the first half of March, and has since remained at this value. The curve obtained with three R e groups showed a slight decrease at mid-March (Figure 4, panel b). Viruses 2020, 12, x FOR PEER REVIEW

Discussion
Molecular tracing of SARS-CoV-2 coupled with advanced Bayesian and Maximum likelihood phylogenetic analysis provide detailed information about the epidemiology and evolution of emerging infections and helps to improve our understanding on the mechanisms of spreading of the The origin of the epidemic was estimated at a mean 80.3 days BP (credibility interval: 60-109), corresponding to 7 February (between 9 January and 27 February). The recovery rate was estimated about 7.26 days (CI 4.7-16.0 days), and the transmission rate (λ) increased from 71.7 to 115.96 in units per year (corresponding to a growth rate of 0.06 and 0.18 year −1 ). On the basis of these data, the doubling time decreased from 5.1 days to 3.1 days in the period between early February and mid-March.

Discussion
Molecular tracing of SARS-CoV-2 coupled with advanced Bayesian and Maximum likelihood phylogenetic analysis provide detailed information about the epidemiology and evolution of emerging infections and helps to improve our understanding on the mechanisms of spreading of the epidemic.
In a previous study [1], we characterized the viral sequences obtained from the first three patients coming from the Codogno area who were hospitalized at the very beginning of the epidemic in Italy. The Codogno strains correlated with an isolate from an outbreak occurred in Bavaria around 20 January [4]. The present analysis shows that all but one of 62 SARS-CoV-2 sequences obtained from This observation was also confirmed by other Italian studies [1,3]. The same clade is now the most widespread in the world and includes most of the published genomes [5]. The genetic distances among the Italian strains were relatively short, corresponding to an average of about 6.4 mutations per viral genome, even if single isolates may have a higher number of changes. After grouping the sequences according with the sampling months, while the within group mean genetic distances were higher in February compared to subsequent months, the genetic distance between different months increased with time. This observation confirms a continuous evolution of the viral genome (with the emergence of new divergent variants) mainly driven by genetic drift. No significant difference was observed between the non-synonymous and the synonymous substitutions (dn/ds = 0.8), suggesting the absence of relevant selective forces driving the evolution of the viral genome. This observation is further confirmed by the analysis of site-specific selective pressure in the Italian strains, which only showed a single site under significant positive selection in the S protein (position 1046) observed in three strains from Padua. Including in the phylogenetic tree 3 isolates from Shanghai and one from the first patient of the Bavarian cluster, being at the root of the B.1 lineage, the dated tree obtained suggests that SARS-CoV-2 entered Italy between late January and early February 2020. This timing matches with the first autochthonous European cluster of SARS-CoV-2 transmission in Bavaria (Germany), originated on 20 January [1, 4,21] by the introduction of a strain carried by the index patient coming from Shanghai, where the virus had been circulating since January. The skyline plot analysis of the Italian clade shows an exponential increase of the effective number of infections from late February to mid-March, in excellent agreement with the known epidemiological data (https://www.epicentro.iss.it/coronavirus/sars-cov-2-dashboard). In particular, a very rapid growth of the epidemic was detected between the beginning of March and the middle of the same month, when the curve reaches a plateau up to the end of sampling (27 April). The mean value of R 0 was estimated as 2.25 (1.5 to 3.1) in the entire period. A similar result was obtained by Stadler et al. on a smaller sample of 11 sequences mainly from patients with known travel history to Italy (https://virological.org/t/phylodynamic-analyses-based-on-11-genomes-fromthe-italian-outbreak/426). The estimated basic reproduction number (R 0 ) for SARS-CoV-2 has ranged mainly from 2 to 4, according to the different methods employed for the evaluation [22]. In Italy, values between 2.4 and 3.6 have been estimated in the early phase of COVID-19 epidemic before the control measures were taken [23][24][25]. Predictive mathematical models are fundamental to understand the dynamics of the epidemic, plan effective control strategies and verify the efficacy of those applied.
Using a birth-death skyline, we analyzed the changes of R e during the epidemic in Italy over the entire period. We observed that the R e was >1 since the first decade of February, suggesting that the infection was circulating within the population before the first notified (hospitalized) COVID-19 cases. The R e skyline plot reached a value of 2.3 in the first days of March, together with the rapid increase observed in the number of infections by BSP, and slightly decreased thereafter, in agreement with the official data on the course of the epidemic. Between February and March the estimated doubling time of the epidemic decreased from 5.1 to 3.1 days. This value was smaller than that obtained by us for the epidemic in China [26] and might be interpreted as a consequence of a delayed application of more stringent containment measures in Italy. In fact, a slight decrease of the R e value was observed only after mid-March, when a more rigorous social distancing was enforced across the entire country. The persistence of a R e value higher than one until April, in partial contrast with the epidemiological data (https://covstat.it/), could be due to the fact that our estimate was influenced by the circulation of the virus in the community, which is larger than the number of the officially registered clinical cases. It is well known that only a small minority of SARS-CoV-2 infections require hospitalization and that in Italy the number of cases of infection has widely exceeded the number of official reports. In a recent study, the prevalence of anti-SARS-CoV-2 antibodies in asymptomatic blood donors living in Milan was shown to increase from February to April, when the prevalence reached its maximum (about 7%) (https://www.medrxiv.org/content/10.1101/2020.05.11.20098442v2). However, in Italy the numbers of active cases began to decrease only in the second half of April, when the present study had already been stopped. Further studies on extended data collection will be required to estimate the effects of the containment measures.
The only one genome characterized in our study not belonging to lineage B.1 was isolated in a 76-year-old man living in the province of Padua (Veneto), who survived to serious COVID-19 manifestations despite old age and the presence of several comorbidities. He denied any contact with infected subjects and did not travel abroad. This virus belongs to the same lineage (B) of the first 2 cases imported into Italy from the Hubei region, China, at the end of January 2020, before Italy suspended flights from China. The couple landed at the Milan airport and travelled to other locations in Northern and Central Italy before the onset of symptoms requiring hospitalization in Rome, but they had not travelled to Padua. Thus, the origin of such a strain remains unexplained and further investigations are underway to evaluate whether this strain may have played a role in causing an epidemic, at least locally. It would also be interesting to investigate whether the currently predominant strain was for some reasons more epidemic than the initial strain, or if the spread of the latter was limited by random factors.
In conclusion, our data show the importance of molecular and phylogenetic evolutionary reconstruction in the surveillance of emerging infections. Of note, it appears that the outbreak in Italy, which involved hundreds of thousands of people, is mainly attributable to a single introduction of the virus and its uncontrolled circulation for a period of about four weeks. These results reaffirm the strategic importance of continuous surveillance and timely tracing to define and rapidly implement effective containment measures for a possible second wave of the pandemic.
Supplementary Materials: The following are available online at http://www.mdpi.com/1999-4915/12/8/798/s1, Table S1: Data of Italian Patients characterized in the present study, Table S2: Accession IDs, sampling dates and location of sequences included in the dataset, Table S3: Comparison among different demographic models based on Path Sampling (PS) and Stepping Stone (SS) sampling.