Longitudinal Sequencing and Variant Detection of SARS-CoV-2 across Southern California Wastewater

: Wastewater-based epidemiology (WBE) is useful for detecting pathogen prevalence and may serve to effectively monitor diseases across broad scales. WBE has been used throughout the COVID-19 pandemic to track disease burden through quantifying SARS-CoV-2 RNA present in wastewater. Aside from case load estimation, WBE is being used to assay viral genomic diversity and emerging potential SARS-CoV-2 variants. Here, we present a study in which we sequenced RNA extracted from sewage influent obtained from eight wastewater treatment plants representing 16 million people in Southern California from April 2020 to August 2021. We sequenced SARS-CoV-2 with two methods: Illumina Respiratory Virus-Enriched metatranscriptomic sequencing (N = 269), and QIAseq SARS-CoV-2-tiled amplicon sequencing (N = 95). We classified SARS-CoV-2 reads into lineages and sublineages that approximated named variants and identified single nucleotide variants (SNVs), of which many are putatively novel SNVs and SNVs of unknown potential function and prevalence. Through our retrospective study, we also show that several SARS-CoV-2 sublineages were detected in wastewater before clinical detection, which may assist in the prediction of future variants of concern. Lastly, we show that sublineage diversity was similar across Southern California and that diversity changed over time, indicating that WBE is effective across megaregions. As the COVID-19 pandemic moves to new phases, and SARS-CoV-2 variants emerge, monitoring wastewater is important to understand local-and population-level dynamics of the virus. These results will aid in our ability to monitor the evolutionary potential of SARS-CoV-2 and help understand circulating SNVs to further combat COVID-19.


Introduction
The COVID-19 pandemic has had a profound impact on the human population, causing over 700 million cases of disease and more than 7 million human deaths worldwide [1].Caused by the emergence of the +ssRNA "severe acute respiratory syndrome coronavirus 2" [2], COVID-19 has caused public health to react and respond in novel ways to track the spread of the disease [3][4][5].One of these unexpected responses has been through the use of wastewater-based epidemiology (WBE) to monitor SARS-CoV-2 viral loads in wastewater throughout the COVID-19 pandemic [4,6,7].Having been used for approximately 80 years (although only formally adopted by the WHO in 2003) to monitor polio, WBE is known to be an effective surveillance method for a variety of diseases [8].As part of a worldwide effort to combat COVID-19, a massive assemblage of epidemiologists has developed methods and analyses for the examination of SARS-CoV-2 viral material in wastewater to track the spread and approximate cases of COVID-19 [4,[9][10][11].While direct sampling from patients is the most definitive method of COVID-19 diagnosis [12,13], it has been shown that clinical testing has probably undercounted the true number of cases [3,[14][15][16].This inaccuracy in case counts is likely due to a combination of supply issues, inability or reluctance to be tested for COVID-19, asymptomatic disease, and unreported at-home testing [3,14,15].As a partner to traditional public health responses, WBE has shown to be a valuable tool in predicting and assaying case counts across populations both small and large [4,7,17,18].Though WBE has shown to be a vital component of the world's fight against COVID-19, its major method of RT-qPCR on extracted RNA from wastewater samples is only able to quantify viral loads and cannot monitor the evolution of SARS-CoV-2 and the resulting viral variants.
There are many challenges to sequencing SARS-CoV-2 from wastewater samples [33][34][35].As a matrix of industrial, agricultural, and human-borne wastes, wastewater often contains a variety of detergents and other compounds that serve as PCR inhibitors and likely degrade viral particles [30,33,36,37].Similarly, SARS-CoV-2 is often at a low viral load, and the virus detected in wastewater is almost certainly fragmented, making sequencing difficult due to an inability to cover an entire genome in one assay [18,[38][39][40].Sequencing methods have been developed to address these challenges, such as viral enrichment, targeted amplification of viral regions, and various RNA extraction protocols but many of these methods are designed for clinical samples, so wastewater analyses remain technically difficult to accurately conduct [17,18,41,42].In order to increase our confidence in SARS-CoV-2 variant analyses, we used two sequencing library preparation methods on wastewater samples: The Illumina Respiratory Virus Oligonucleotide Panel, which enriches for respiratory virus nucleic acids before sequencing, and the QIAseq SARS-CoV-2 Primer Panel, which uses 200 PCR primer sets to amplify the entire SARS-CoV-2 genome.
Sequencing SARS-CoV-2 obtained from wastewater is a critical component of monitoring the ongoing COVID-19 pandemic [4].Here, we present a study in which we used metatranscriptomic sequencing and two methods of library preparation (Illumina Respiratory Virus Oligonucleotide Panel or QIAseq SARS-CoV-2 Primer Panel) to identify SNVs, clades, and sublineages of SARS-CoV-2 on 317 influent wastewater samples.These samples were obtained from eight WTPs across Southern California from April 2020 to August 2021 and represent the collective wastewater of approximately 16 million residents.We investigated several lines of inquiry through our study: First, what RNA viruses are represented in our samples?Second, what clades and sublineages of SARS-CoV-2 were present in Southern California wastewater, and can we detect variants of concern in wastewater?Third, what SNVs were present in Southern California wastewater, and can we detect these variants with both library preparation methods?Lastly, does wastewater sequencing allow for the early detection of variants before clinical sequencing?

Sample Collection and Handling
We previously reported the sample collection and handling procedure in Rothman et al., 2021 [18], and Rothman et al., 2022 [43].Briefly, we collected 317 1-L 24-h composite influent wastewater samples by autosampler at eight WTPs across Southern California between April 2020 and August 2021 (Table 1).We aliquoted and stored 50 mL of sample at 4 • C until processing.

Wastewater Sample RNA Extraction
We used two separate RNA extraction and library preparation protocols for the samples in this study.For one set (N = 269), we used a protocol based on Crits-Christoph, 2021 [26], and Rothman, 2021 [18], and we refer to those papers for detailed RNA extraction methods.Briefly, 50 mL of influent wastewater was pasteurized at 65 • C for 90 min in a water bath, filtered through 0.22-µM sterile filters (VWR, Radnor, PA, USA), then centrifugated at 3000× g with 10 kDa filters (MilliporeSigma, Burlington, MA, USA) and RNA was extracted with an Invitrogen PureLink RNA Mini Kit plus DNase (Invitrogen, Waltham, MA, USA).We will refer to these samples as "Illumina Respiratory Virus-enriched" (IRV) henceforth.
A second set of RNA extractions was carried out (N = 95) with a different extraction and library preparation protocol based on Steele, 2021 [35].Briefly, we added 25 mM MgCl 2 to 20 mL of wastewater, then acidified the samples to pH < 3.5 with HCl.We then transferred the mixture to a cellulose ester membrane (type HA; Millipore, Bedford, MA, USA) then bead bashed the filters in preloaded 2 mL ZR BashingBead lysis tubes (Zymo, Irvine, CA, USA) for 1 min.Lastly, we extracted total nucleic acid with a NucliSENS extraction kit with magnetic bead capture following the supplied protocol (bioMérieux, Durham, NC, USA).Libraries for these samples were then prepared as follows and will be referred to as "tiled amplicons".Note that some samples were only sequenced with the IRV method, while others were only sequenced with the tiled amplicon method.A total of N = 47 samples (N = 18 Hyperion and N = 29 Point Loma) were sequenced with both methodologies.

Sequencing Library Preparation
All sample library preparation and sequencing steps were carried out by the University of California Irvine Genomics High-Throughput Facility (GHTF).The GHTF prepared IRV-enriched libraries with the Illumina Respiratory Virus Oligonucleotide panel paired with an Illumina RNA prep with enrichment kit (Illumina, San Diego, CA, USA), following the manufacturer's protocol.The tiled amplicon libraries were prepared by using the QIAseq SARS-CoV-2 Primer Panel paired with a QIAseq FX DNA Library UDI kit (Qiagen, Germantown, MD, USA) and using the manufacturer's protocol.The GHTF sequenced the resulting paired-end libraries as either 2 × 100 bp or 2 × 150 bp (Supplemental File SF1) on an Illumina NovaSeq 6000 with an S4 300 cycling kit and sent the data as demultiplexed FASTQ files.

Bioinformatics and Sequence Data Processing
All data processing was conducted on the UCI High-Performance Community Computing Cluster (HPC3).We removed sequencing adapter sequences and low-quality bases with the BBTools software v.38.87 "bbduk" [44].We subsequently marked sequencing duplicates with Picard toolkit "MarkDuplicates" [45], removed reads mapping to the HG38 human genome with Bowtie2 [46], then used Kraken2 [47] and Bracken [48] to taxonomically classify our reads for reporting purposes and plotted those relative abundances as a stacked bar plot with the R package "ggplot2" [49,50].
Once we had removed the human reads, we aligned the reads to the SARS-CoV-2 Hu-1 reference strain [51] with Bowtie2, then sorted and indexed the resulting bam files with "samtools" [52].We used iVar [53] with the default settings to trim off QIAseq primer sequences and call single nucleotide variants (SNVs) with a Fisher's exact test of p < 0.05, as compared to the reference strain (Supplemental File SF1).Subsequently, we used Freyja [30] to assign SARS-CoV-2 lineage and sublineage identities to the alignments using the UShER phylogeny [54] and then de-mix the clades and sublineages within each sample to calculate approximate relative abundances.As we wanted to compare the results from IRV-enriched libraries to tiled amplicon libraries, we also used the iVar/Freyja pipeline to call SNVs and assign lineages/sublineages to these libraries, even though there were no true primers to remove.
To compare our wastewater sequence data to clinical sequencing, we obtained the date of sublineage detection and reported genomes from PANGO, GISAID, and the California Health and Human Services Agency [21,22,55,56] and associated these dates with our wastewater sampling dates.Due to the longitudinal nature of our data, we compared variant lineage abundances (as counts per million) over time with MaAsLin2 [57] where we had yearlong data, using WTP and sequencing batch as random effects.Likewise, we reported SNVs from 68 samples with detectible SARS-CoV-2 in Rothman 2021 [18] but here we reanalyzed the data and present the new results for consistency with our new methods.
We investigated SARS-CoV-2 sublineage alpha diversity through Kruskal-Wallis testing and beta diversity through Adonis PERMANOVA testing with the R package "vegan" [58] and measured the change in diversity with linear mixed-effects regression models (LMERs) with the R package "lmerTest" [59] using both WTP and sequencing batch as random effects.Lastly, we plotted all of the data with the R packages "ggplot2", "ggre-

SARS-CoV-2 Read Classification
Because SARS-CoV-2 was well-represented in our samples, we could classify reads mapping to SARS-CoV-2 through UShER SARS-CoV-2 barcoding and de-mixing to approximate relative abundances with Freyja.We were able to classify many of the mapped reads to specific named variants of interest (VOIs) and variants of concern (VOCs) (both currently circulating and historically significant) along with other sublineages of SARS- We obtained broad SARS-CoV-2 genome coverage with both sequencing approaches.When considering all samples together, IRV-prepared libraries covered 99.92% of the SARS-CoV-2 genome at a mean sequencing depth of only 2× at each base position.Tiled amplicon library preparation had both wider coverage and higher sequencing depth, with these libraries covering 99.95% of the genome at a mean depth of 76 reads per base (Figure S1).
In addition to large, overarching SARS-CoV-2 clades (i.e., VOI/VOCs), we often classified reads to a named PANGO sublineage, and we detected 1221 unique sublineages at greater than 0.1% proportional abundance, with substantial detection overlap between tiled-amplicon and IRV sequencing approaches (1215 and 1221 named sublineages, respectively, Supplemental File SF1).We often detected the presence of SARS-CoV-2 sublineages in wastewater before clinical sequencing reported detection: tiled-amplicon sequencing detected 515 (42.7%) in samples before clinical sequencing, in some cases by as much as several months, and IRV sequencing detected 364 (30%) before clinical reports, again often with substantial lead-time as above (Figure 3).

Discussion
Our study represents a large-scale effort to employ wastewater-based epidemiology (WBE) across a catchment area of 16 million people and supports the monitoring of SARS-CoV-2 evolution throughout the ongoing COVID-19 pandemic.Through respiratory vi-

Discussion
Our study represents a large-scale effort to employ wastewater-based epidemiology (WBE) across a catchment area of 16 million people and supports the monitoring of SARS-CoV-2 evolution throughout the ongoing COVID-19 pandemic.Through respiratory virus-enriched and tiled-amplicon RNA sequencing approaches, we classified SARS-CoV-2 lineages and single nucleotide variants (SNVs) and could approximate VOCs/VOIs across a yearlong study of Southern California wastewater.Like other studies, we captured SARS-CoV-2 mutations across the genome and show the potential to detect sublineages and SNVs months before clinical analyses of patient samples [18,26,27,30,39].While WBE is a powerful tool-and is not subject to many of clinical sequencing's drawbacks-we cannot use these methods to determine the exact source of SARS-CoV-2 variants [18,26,30,38] and instead propose the use of WBE to monitor populations instead of individuals.Our results suggest that multi-scale sampling of individual patients, local wastewater catchments (i.e., university campuses), and WTPs can give public health agencies vital information to identify novel SARS-CoV-2 variants and predict disease spread to further combat COVID-19 [4,11,30].

Classifying SARS-CoV-2 Reads and Comparing Wastewater to Clinical Sequencing
In most samples, we could classify SARS-CoV-2 RNA fragments at multiple levels of resolution-both at the named variants (i.e., Alpha, Beta, etc.) and sublineage levels (i.e., B.1.429,B.1.617.2,P.1, etc.)-and calculate semi-quantitative relative abundances.When considering the full year of data, sublineage diversity of SARS-CoV-2 was not different between WTPs, rather it changed monthly probably due to the similarity of proportional disease burden and proximity of the San Diego and Los Angeles counties [1].As expected, our sublineage quantification was not exactly concordant with clinical sequencing data, probably due to the aggregate nature of wastewater and our composite sampling, along with the lack of clinical specimens early in the pandemic [4,21,22,30,38].We do note, however, that ours and clinical data agree well during the emergence of the Delta variant, suggesting that wastewater can detect the potential evolutionary replacement of lineages accurately as has been recently shown with the domination of the Omicron variant [30,63].Additionally, when comparing our sequencing data specifically to California clinical data, we observed a higher apparent concordance between named SARS-CoV-2 clades, indicating that WBE is powerful on smaller scales, which may assist public health agencies with more targeted responses to disease.Similarly, we detected many SARS-CoV-2 sublineages earlier in wastewater than clinical sequencing-in some cases by several months-further supporting work indicating that WBE is useful for predicting disease load and the spread of novel variants [30,63].Naturally, we recognize this is a retrospective study, and rely on clinical sequencing to name and prioritize the variants we sequenced in wastewater, so we suggest that public health and wastewater sequencing be used in tandem to carefully monitor the evolutionary potential of SARS-CoV-2 [4].

Identifying SARS-CoV-2 Single Nucleotide Variants
Similar to previous work, we detected thousands of single nucleotide variants (SNVs) across samples and sequenced putatively novel or rare SNVs that have an unknown function or species host [18,26,27,31,63,64].For example, in many samples (from within or between WTPs), we detected SNVs at positions 9864, 22796, and 28971, which are exceedingly rare in public sequencing data, along with SNVs at 241, 14408, and 23403, which were common in 2020 [65,66].Our ability to detect both low-prevalence and near-ubiquitous SNVs indicates that WBE is broadly useful for accurate SNV detection and may provide a reasonable estimate of what SNVs are circulating across populations [26,30,31,63,64].
Likewise, when comparing our results to other wastewater studies, we detected variants or sublineages also reported in Nice, New York, Montana, Arizona, Northern California, Berlin, and across Austria, often at similar sampling dates, which shows that sequencing wastewater is reproducible and accurate at very large scales [26,31,39,67,68].Being that sequencing wastewater is technically challenging, we qualitatively compared two major methods of SARS-CoV-2 analysis and note that targeted amplification provided better sequencing depth and resolution, indicating its utility when presented with degraded lowtiter RNA and the detergent/PCR inhibitor content of wastewater along with our harsh extraction methods [4,18,41].Nevertheless, there is the possibility that targeted amplification may miss novel variants due to mutational differences in SARS-CoV-2 genomes [69,70], something suggested by our IRV method detecting Omicron where tiled amplicon did not.We also recognize that extracted RNA with two different methods, which likely confounds our results but we still suggest the use of targeted amplification approaches for wastewater samples, which supports previous work and method development [4,18,26,30,41,63,71,72].

Conclusions
Wastewater-based epidemiology has exploded into a worldwide endeavor and is a critical part of humanity's response to the COVID-19 pandemic [4,6,9,11].Our study demonstrates WBE's effectiveness in monitoring SARS-CoV-2 mutations across megaregions [18,26,30,31,39], which continues to be important as novel VOCs emerge and the popularity of at-home testing reduces public health's ability to accurately quantify COVID-19 cases [3,14,15].As WBE technology matures, future directions for method development should address the need to simultaneously detect and sequence variants from other circulating diseases-such as influenza, Respiratory Syncytial Virus, and Norovirus-alongside SARS-CoV-2.COVID-19 has demonstrated the need for scientists, wastewater agencies, and public health to work together to track the evolution and spread of SARS-CoV-2, especially in underserved areas, low population coverage, and places where the medical field is overburdened [4,16].WBE has the potential to discover emergent diseases and should be implemented across population centers as a sentinel for the next pandemic.

Figure 1 .
Figure 1.Stacked bar plots showing the relative abundances of RNA reads mapping to (A) the top 10 most proportionally abundant viruses plus all others in respiratory virus-enriched libraries and (B) SARS-CoV-2 plus other viruses in tiled-amplicon libraries.Plots are faceted by WTP and labeled with sampling date.

Figure 1 .
Figure 1.Stacked bar plots showing the relative abundances of RNA reads mapping to (A) the top 10 most proportionally abundant viruses plus all others in respiratory virus-enriched libraries and (B) SARS-CoV-2 plus other viruses in tiled-amplicon libraries.Plots are faceted by WTP and labeled with sampling date.

Figure 2 .
Figure 2. The relative proportional abundance of the ten most-abundant SARS-CoV-2 lineages plus others in (A) respiratory virus-enriched libraries, and (B) tiled-amplicon libraries faceted by WTP and labeled with sampling date.Note that one sample date from the North City Water Reclamation Plant is not shown.

Figure 3 .
Figure 3. SARS-CoV-2 sublineages at greater than 0.2% relative abundance (for plot visibility) first detected in wastewater samples in (A) respiratory virus-enriched libraries (IRV) and (B) tiled-amplicon libraries by date; (C) denotes the total number of SARS-CoV-2 sublineages first detected by our wastewater sequencing or clinical samples by IRV or tiled-amplicon libraries, respectively, without the relative abundance cutoff.

Figure 3 .
Figure 3. SARS-CoV-2 sublineages at greater than 0.2% relative abundance (for plot visibility) first detected in wastewater samples in (A) respiratory virus-enriched libraries (IRV) and (B) tiledamplicon libraries by date; (C) denotes the total number of SARS-CoV-2 sublineages first detected by our wastewater sequencing or clinical samples by IRV or tiled-amplicon libraries, respectively, without the relative abundance cutoff.

Figure 5 .
Figure 5. (A) Number of single nucleotide variants (SNVs) detected at each sample date and (B) nucleotide position across the SARS-CoV-2 genome for all samples colored by library preparation method (IRV signifies Illumina Respiratory Virus enrichment panel); (C,D) indicate the frequency of SNVs detected at each position of the SARS-CoV-2 genome across all respiratory virus-enriched and tiled-amplicon libraries, respectively.

Figure 5 .
Figure 5. (A) Number of single nucleotide variants (SNVs) detected at each sample date and (B) nucleotide position across the SARS-CoV-2 genome for all samples colored by library preparation

Table 1 .
Sample quantities, date spans of collection, and approximate influent flow and served population.WTP names include the abbreviations used throughout the study, and "IRV" denotes Illumina Respiratory Virus Enrichment library preparation.