Genome Sequencing of SARS-CoV-2 Allows Monitoring of Variants of Concern through Wastewater

: Monitoring SARS-CoV-2 in wastewater has shown to be an effective tool for epidemiological surveillance. More specifically, RNA levels determined with RT-qPCR have been shown to track with the infection dynamics within the population. However, the surveillance of individual lineages circulating in the population based on genomic sequencing of wastewater samples is challenging, as the genetic material constitutes a mixture of different viral haplotypes. Here, we identify specific signature mutations from individual SARS-CoV-2 lineages in wastewater samples to estimate lineages circulating in Luxembourg. We compare circulating lineages and mutations to those detected in clinical samples amongst infected individuals. We show that especially for dominant lineages, the allele frequencies of signature mutations correspond to the occurrence of particular lineages in the population. In addition, we provide evidence that regional clusters can also be discerned. We focused on the time period between November 2020 and March 2021 in which several variants of concern emerged and specifically traced the lineage B.1.1.7, which became dominant in Luxembourg during that time. During the subsequent time points, we were able to reconstruct short haplotypes, highlighting the co-occurrence of several signature mutations. Our results highlight the potential of genomic surveillance in wastewater samples based on amplicon short-read data. By extension, our work provides the basis for the early detection of novel SARS-CoV-2 variants.

efforts also include a mass screening program to regularly perform RT-qPCR tests on residents and crossborder workers [1].
Additionally, biweekly monitoring of Luxembourg wastewater samples provides an overview of up to 73% of the population (https://www.list.lu/en/covid-19/coronastep/; accessed on 9 June 2021). While SARS-CoV-2 is primarily transmitted through aerosols [2], the virus can also persistently be detected in the urine and feces of infected people [3][4][5]. The presence of this virus in the excreta of infected patients led researchers to question its transmission through the water cycle and its persistence in the environment [6][7][8]. Based on current knowledge, the ability of the SARS-CoV-2 to be present and transported in water systems exists, but transmission via water currently seems unlikely [9]. From an epidemiological point of view, the concentration of SARS-CoV-2 in wastewater has been shown to be proportional to the number of COVID-19 cases in the catchment area, which enables wastewater-based surveillance [10][11][12][13][14]. Even though, further evaluation is required in terms of transmissibility [15] or stability [16] of SARS-CoV-2 in effluents, wastewater-based surveillance emerged as a valuable monitoring tool [17,18] with the potential to mitigate delays of individual testing [11].
In addition to its role as an early warning system, wastewater-based surveillance also offers the possibility of identifying the dominant variant of SARS-CoV-2 in order to reconstruct viral phylogenies [19]. Since both symptomatic and asymptomatic individuals contribute to sewage inputs, the resulting pooled sample of excreta from the entire population may provide a more complete picture of the genomic diversity of SARS-CoV-2 circulating in a community than clinical testing and sequencing alone. The regular appearance of novel variants of concern (VOCs), such as B.1.1.7 [20], P.1 [21], B.1.351 [22] and more recently B.1.617.2 [23], further illustrates the need for a genotypic surveillance. These VOCs are generally associated with increased infectivity [22,24], and they quickly started to emerge as dominant lineages in several countries by the end of 2020. The SARS-COV-2 B.1.617.2 variant is currently spreading quickly across Europe and has now become the dominant strain across much of the region. It continues to spread, displacing the circulation of other variants.
To identify viral genotypes from wastewater treatment plant (WWTP) samples, commonly targeted sequencing approaches are applied in which overlapping amplicons covering the SARS-CoV-2 genome are sequenced [25]. However, also untargeted metatranscriptomic sequencing followed by viral enrichment can provide sufficient information to distinguish circulating genotypes [26]. The identified strains or mutations frequently correspond to those detected in a clinical setting [27,28], but genomic sequencing of wastewater samples also provides the possibility to detect novel mutations [29]. However, genotyping SARS-CoV-2 in wastewater remains challenging as the samples represent a mixture of viral genomes and, particularly with short-read data, provide insufficient phasing information to reconstruct viral haplotypes [27]. Tracing individual signature mutations in wastewater can provide valuable information on the spread of VOCs [17], and the allele frequencies of individual mutations could be linked to observations from clinical sequencing early on during the pandemic [30].
Here, we present an approach to assess SARS-CoV-2 lineages circulating in the Luxembourg population from genomic sequencing of WWTP samples. We compare characteristic mutations of lineages present in patient-derived consensus sequences to corresponding mutations detected in wastewater over time and by location. Additionally, we reconstruct haplotypes of specific genomic regions from short-read data based on mutation co-occurrence. Our results emphasize the role of wastewater-based genomic epidemiology particularly regarding tracing the spread of novel VOCs.
Wastewater samples were collected over a 24 h period using automatic samplers (Teledyne ISCO, Lincoln, NE USA). Each composite sample was then transported to the laboratory at 4 °C, and viral RNA was isolated on the day of sampling. Larger particles (debris, bacteria) were removed from the samples by centrifugation at 2,400 g for 20 min at 4 °C. A volume of 120 mL of supernatant was filtered through Amicon ® Plus-15 centrifugal ultrafilter with a cutoff of 10 kDa (Millipore, Burlington, MA, USA) by centrifugation at 3,220 g for 25 min at 4 °C. The resulting concentrate was collected, and 140 µL was then processed to manually extract viral RNA using the QIAamp Viral RNA mini kit (Qiagen, Hilden, Germany), according to the manufacturer's protocol. Elution of viral RNA was performed in 60 µL of elution buffer and stored at −80 °C until RT-qPCR analysis.

SARS-CoV-2 RNA Detection in Wastewater Samples
After RNA extraction, detection of SARS-CoV-2 RNA was performed by RT-qPCR. The Allplex™ SARS-CoV-2 Assay kit (Seegene, Düsseldorf, Germany) was used for the detection of 4-target genes (E gene, RdRP gene, N gene and S gene) for multiplex RT-qPCR, according to the manufacturers' protocol. Fluorophores used for E gene of Sarbecovirus, RdRP/S gene, and N gene of SARS-CoV-2 were FAM, Cal Red 610 and Quasar 670, respectively. The PCR plate with 94 wells of samples' extracted RNA and 2 wells for positive and negative controls were prepared on the Seegene STARlet ® . RT-qPCR was carried out on the CFX96™ Real-Time PCR Detection System ® (Bio-Rad, Hercules, CA, USA) with the following cycling conditions: initial phase 50 °C for 20 min, second phase: 95 °C for 15 min, third phase: 45 cycles of 10 s at 95 °C, followed by 60 °C for 15 s, and final fourth phase 10 s at 72 °C. The samples were analyzed with the Seegene ® Viewer Software v3. 19.001 (Seegene, Düsseldorf, Germany).

Amplicon Sequencing of Wastewater Samples
Wastewater samples presenting one of the three SARS-CoV-2-specific CT values resulting from the Allplex™ SARS-CoV-2 Assay below 35 were subjected to an amplicon sequencing method initially implemented for the clinical samples. First, viral RNA reverse transcription and amplification was performed using a primer scheme based on the ARTIC version 1 protocol (https://www.protocols.io/view/ncov-2019-sequencingprotocol-bbmuik6w; accessed on 15 July 2021), generating 100 overlapping fragments of approximately 900 bp according to an adaptation of the initial primer scheme. Then, library preparation was performed with the DNA Prep, (M) Tagmentation kit (Illumina, San Diego, CA, USA) using 1-2 indexes to sequence 96-192 samples per flow cell, following the manufacturers' instructions adapted for a paired-end 150 bp strategy. Samples were processed for sequencing on a MiniSeq ® or MiSeq ® instrument (Illumina, San Diego, CA, USA).
A total of 157 SARS-CoV-2 positive wastewater samples were selected for sequencing according to the above criterion. Among them, sequencing data were obtained for 98 samples, mappings for 92, of which 79 samples had at least 70% reference coverage. Samples utilized for the analysis of mutations (79) were collected between 30 March 2020 and 4 March 2021 with most samples (58) collected between week 42 of 2020 and week 1 of 2021. Of the 13 sampling locations considered here, most samples used for the mutation analysis were obtained from the following 5 sites: BEG (18), PET (13), SCH (11), BET (7) and HESP (7).

SARS-CoV-2 Genome Sequences From Luxembourg Patients
Consensus SARS-CoV-2 genome sequences for Luxembourg patients (min. reference genome coverage 90%) were downloaded from GISAID [31] on the 4 June 2021, which included 9,133 sequences in total and 5,149 sequences dating from 29 February 2020 to 8 March 2021. The genome sequences were analyzed with a pipeline based on Nextstrain, Washington, US [32], including alignments to the reference sequence NC_045512.2, calling of single nucleotide polymorphisms (SNPs), and assignment of PANGO lineages with Pangolin v.3.1.7 (https://github.com/cov-lineages/pangolin; accessed on 16 July 2021). Virus variant call format (VCF) files were annotated with the SARS-CoV-2 28 April 2020 version of Annovar [33]. Location information, i.e., the affiliated WWTP to a patient's postal address, could be matched to 6,641 sequences in total and 3,762 sequences dating from 29 February 2020 to 8 March 2021.

Lineage Specific Signature Mutations
Clinical sequences could be assigned to 123 distinct PANGO lineages. Characteristic amino acid mutations for each of the lineages appearing in clinical samples were downloaded from outbreak.info (https://outbreak.info/ accessed on 15 July 2021). Characteristic mutations were considered signature mutations if they appeared uniquely in only one specific PANGO lineage within the set of Luxembourg patient consensus sequences.
Additionally, sublineages were grouped to reduce complexity. Sequences not assigned to a "B" lineage were grouped to the first level ("A", "C", "P"), and all "B" sublineages except "B.1.1.X" were grouped (e.g., B.1.177.77 was grouped with B.1.177 and B.1.160.28 was grouped with B.1.160). In addition, for grouped lineages, unique signature mutations were determined and utilized as markers to compare allele frequencies. We focused on the most common lineages with sufficient representation in signature mutations (

Haplotype Reconstruction
For each sample, regions that could potentially be linked by SNPs co-occurring on the same read were determined from the called variants. Specifically, ranges consisted of SNPs within 130 base-pairs given the read length of 150 bp. For ranges spanning a signature mutation, haplotypes were reconstructed with Gretel v0.094 [42] for each of the sample-specific interval(s) using the alignment files and further refined retaining only the haplotypes whose variants were present in the sample-specific VCF files.
For determining correlations between allele frequencies (WWTP samples) and occurrence of mutation in patient-derived samples, samples were grouped by calendar week, considering only weeks with at least 1 WWTP sample with the mutation and at least 10 clinical samples. Additionally, samples were also grouped by geographic location (WWTP of origin). The R function cor.test was applied.

Overview of Available Clinical Sequences
Genome sequencing of patient-derived samples from Luxembourg was carried out from the very beginning of the pandemic. Sequencing efforts increased with a spike in SARS-CoV-2 positive cases in the population in the third week of October 2020 (Figure S1  We detected 4,432 distinct nucleotide mutations (Supplementary Data 3), of which 1,120 (25%) could also be observed in consensus sequences from clinical samples. Most of 231 recurring mutations (defined here as present in more than five samples, Figure 1) can also be observed in the clinical consensus sequences (164 of 231 mutations, 71%), with a mean allele frequency (AF) of 0.28 (mean minimum AF: 0.07, mean maximum AF: 0.67). Most mutations could be classified as missense variants ( Table 1). The largest discrepancy for recurring mutations detected in wastewater and clinical samples can be observed for the S gene (Table 1, Figure 1).

Overview of Sequencing of Wastewater Samples
Particularly, a segment of the S gene proved to be enriched for mutations consistently detected over time, however at relatively low allele frequencies (Figure 1). These could not be detected in several wastewater samples with low average sequencing depth. Overall, the observed pattern of mutations varied over time with some consistency. Samples from different locations did not show distinctive patterns (Figure 1), except for specific regional clusters based on individual signature mutations.  0/2 0/0 0 (0%) 0 (0%) 1 (50%) 0 (0%) Figure 1. Nucleotide mutations detected in more than five wastewater samples with respective allele frequencies. Columns are ordered according to sampling date, and rows are ordered according to genomic position. Column annotations provide sample specific information such as the date of sampling, average sequencing depth per sample, and sampling location. Row annotations highlight the respective genes and whether a mutation has been detected in any clinical samples.

Comparison of Allele Frequencies and Relative Occurrence
To trace lineages in mixed wastewater samples, we downloaded characteristic amino acid mutations for each of the lineages assigned to clinical samples with Pangolin, i.e., a total of 123 distinct lineages, from outbreak.info [29] (Supplementary Data 4). Characteristic mutations were screened for uniqueness in the set, respectively, for the associated lineage or grouped lineage, yielding several signature mutations for most prevalent lineages ( Figure S3 in Supplementary Material).
In order to predict circulating SARS-CoV-2 lineages from the wastewater samples, we compared allele frequencies of signature mutations to the frequencies of lineages in clinical consensus sequences. Grouping samples by calendar week and lineage, we observed a high resemblance of the median allele frequency of signature mutations to the relative frequency of the respective lineages assigned to clinical samples ( Figure 2). This was the case for more abundant (B.1.160) and less abundant lineages such as B.1.474. During the period of higher sample availability (October to December 2020), a greater overlap was observed, but the sequencing depth also affects whether median allele frequencies of signature mutations can be interpreted as predictive for lineage abundances ( Figure S5

Characteristic Mutations of VOCs
A primary interest in the monitoring of wastewater samples is screening for the novel variants of concern. Particularly, lineage B.1.1.7 was of primary interest during the sampled interval. For several samples, we observed that an increasing number of signature mutations of B.1.1.7, at low frequencies, could be detected already in November

Regionally Specific Mutations
To assess the regional patterns of mutations, those detected in WWTP and clinical samples were screened according to their geographical location. Comparing individual mutations and their frequencies grouped by week and location resulted in a slight improvement in terms of correlation of WWTP allele frequency and occurrence in clinical samples (Pearson correlation coefficient: 0.76, p-value < 2.2 x 10 −16 , comparing 1,023 datapoints for 216 unique AA mutations, 14 calendar weeks and 8 locations). In total, 26 amino acid mutations were detected uniquely in the same location within WWTP and clinical samples, including only one characteristic mutation for a lineage. However, several signature mutations for B.1.1.420 were found at high frequencies in February and March 2021 samples from PET, corresponding to an increasing prevalence of those mutations within clinical samples (Figure 4).

Reconstruction of Short Lineage-Specific Haplotypes
In order to identify whether signature mutations co-occur on the same reads, we reconstructed haplotypes for sample-specific genomic ranges linked by SNPs at maximum 130 bps apart from each other. We identified 179 ranges including at least two signature mutations of the same lineage with mean length of 228 bps and maximum length of 1,390 bps. Most ranges included signature mutations for B.1.1.7 (60), followed by B.1.221 (55), and B.1.160 (28).
Focusing on B.1.1.7 specific haplotypes, we determined ranges in 19 samples between positions 27,737 and 28,484 including up to three signature mutations ( Figure 5): C27972T (ORF8:Q27*), G28048T (ORF8:R52I) and A28111G (ORF8:Y73C). In samples of earlier time points (November and December 2020), these signature mutations can sometimes already be detected, but they do rarely occur in the same predicted haplotypes. Additionally, the haplotypes containing early B.1.1.7 signatures are predicted at lower scores reflecting a lack of evidence in counts of SNP co-occurrence, which can also be caused by the low abundance of these haplotypes.
Later time points coinciding with higher spread of B.1.1.7 in the population tend to show more reliably predicted haplotypes include shared signature alleles for B.1.1.7 at higher prevalence ( Figure 5). This indicates that for this region, signature mutations tend to co-occur on the same reads; i.e., they can be linked by SNP co-occurrence on the same reads, derived from the same viral genomes.

Discussion
Genomic sequencing of SARS-CoV-2 from wastewater has the potential to be a valuable tool in population-wide monitoring of circulating lineages. Here, we show that detecting low and major frequency mutations from wastewater samples is possible by sequencing overlapping amplicons. Similar recent results support the same approach for such a purpose [43]. Even though, only roughly one third of all mutations can be matched to those called from patient-derived sequences, most of the recurrent mutations and lineage specific mutations can be found in both datasets.
Tracing signature mutations, or quasi-signature mutations [44], does not allow in general for the reconstructing of full-length haplotypes of SARS-CoV-2 and thus assessing the combinations of characteristic mutations. We show that for genomic regions enriched in mutations, haplotype reconstruction based on SNP co-occurrence could be feasible, but we were mainly able to predict short haplotypes for abundant lineages. The emergent genotypes are difficult to distinguish from spuriously predicted haplotypes. Coupled approaches with Illumina short-read and Oxford Nanopore-based long-read sequencing data have greater potential to allow for the reconstruction of full-length haplotypes from mixed samples [28,45].
In our comparison of clinical and WWTP samples during the period from October 2020 to March 2021, the timeframe in which B.1.1.7 emerged as dominant lineage in Luxembourg, we see that particularly for lineages with multiple unique signature mutations, the allele frequencies of mutations detected in wastewater corresponds to the occurrence of these lineages in the population. With a limited number of present signature mutations, this relationship becomes more variable, and especially for lowly abundant or emerging lineages, high-sequencing depth is required for capturing relevant mutations. While the detection of signature mutations for B.1.1.7 predates the detection in a clinical isolate as has been observed before [27], tracing emerging VOCs in wastewater data may remain challenging. Today, the SARS-CoV-2 variant B.1.1.7. has been de-escalated and no longer belongs to VOCs, mainly because it no longer circulates in the population. The B.1.617.2 variant is the dominant strain worldwide, including Luxembourg. It will continue to spread, displacing the circulation of the other variants, unless a new, more competitive virus emerges. Our findings and methodology presented here strongly position high-throughput sequencing of wastewater samples as a credible tool for analyzing the spread, dynamics and evolution of SARS-CoV-2 variants.
While the systematic sampling of wastewater allowed temporal and spatial monitoring of the epidemic, several samples did not contain a sufficient viral load to be sequenced. However, we were able to identify a local cluster of mutations related to B.1.1.420 in February and March 2021 for the location of PET. This cluster corresponds to a local outbreak with 92 SARS-CoV-2 positive cases in a retirement home in the area of the PET WWTP. This highlights the potential of wastewater monitoring to detect regionally specific clusters, as has been observed previously [27,28], even though the regional differences within Luxembourg are low, given the small size of the country. The detection of regional clusters also depends on factors such as mobility within the respective areas, which could be considered in a refined analysis.
Overall, genomic sequencing of SARS-CoV-2 represents a valuable addition to the sequencing of clinical isolates particularly for tracing VOCs in a wide proportion of the population. Even though, short-read sequencing data only allowed for the detection of individual signature mutations, and the cross-section of allele frequencies within several samples allowed for comprehensive tracing of circulating lineages, regionally specific clusters, and emergent VOCs.

Supplementary Materials:
The following are available online at www.mdpi.com/2073-4441/13/21/3018/s1, Figure S1: Weekly sequencing coverage, Figure S2: Lineages in clinical consensus sequences, Figure S3: Number of signature mutations for grouped lineages, Figure S4: Mapping statistics for wastewater samples, Figure S5: Differences relative occurrence and allele frequencies with sequencing depth and number of signature mutations, Figure S6: Comparison relative occurrence and allele frequencies grouped by week, Figure  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Patient-derived sequences were downloaded from GISAID (4 June 2021). Weekly numbers of infected people were obtained from ECDC and downloaded (https://www.ecdc.europa.eu/en/publications-data/data-national-14-day-notification-rate-covid-19; accessed on 22 June 2021). Raw reads for sequencing data derived from WWTP samples have been uploaded to NCBI and are accessible under BioProject PRJNA765031. Preprocessing: IMP3 available from https://git-r3lab.uni.lu/IMP/imp3 includes FastQC available from https://github.com/s-andrews/FastQC. Picard is available from https://github.com/broadinstitute/picard; accessed on 20 June 2021. The phylodynamics PDP pipeline-based Nextstrain and the variant-calling pipeline using LoFreq are available on https://git-r3lab.uni.lu/covid19-genomics. Code for statistical analyses and generating figures is available in the following repository: https://git.list.lu/malte.herold/coronastep-variant-analysis-scripts/ Acknowledgments: The authors would like to thank all the wastewater syndicates (SIACH, SIVEC, STEP, SIDERO, SIDEN and SIDEST), the "Ville de Luxembourg", the "Gemeng Hesper" as well as the "Administration de la Gestion de l'Eau" (AGE) for their kind and valuable assistance in the wastewater sample collection, the acquisition of wastewater parameters and the collection of demographic data. Some experiments presented in this work were carried out using the high performance computing facilities of the University of Luxembourg (https://hpc.uni.lu; accessed on 1 July 2021).