Targeted Virome Sequencing Enhances Unbiased Detection and Genome Assembly of Known and Emerging Viruses—The Example of SARS-CoV-2

Targeted virome enrichment and sequencing (VirCapSeq-VERT) utilizes a pool of oligos (baits) to enrich all known—up to 2015—vertebrate-infecting viruses, increasing their detection sensitivity. The hybridisation of the baits to the target sequences can be partial, thus enabling the detection and genomic reconstruction of novel pathogens with <40% genetic diversity compared to the strains used for the baits’ design. In this study, we deploy this method in multiplexed mixes of viral extracts, and we assess its performance in the unbiased detection of DNA and RNA viruses after cDNA synthesis. We further assess its efficiency in depleting various background genomic material. Finally, as a proof-of-concept, we explore the potential usage of the method for the characterization of unknown, emerging human viruses, such as SARS-CoV-2, which may not be included in the baits’ panel. We mixed positive samples of equimolar DNA/RNA viral extracts from SARS-CoV-2, coronavirus OC43, cytomegalovirus, influenza A virus H3N2, parvovirus B19, respiratory syncytial virus, adenovirus C and coxsackievirus A16. Targeted virome enrichment was performed on a dsDNA mix, followed by sequencing on the NextSeq500 (Illumina) and the portable MinION sequencer, to evaluate its usability as a point-of-care (PoC) application. Genome mapping assembly was performed using viral reference sequences. The untargeted libraries contained less than 1% of total reads mapped on most viral genomes, while RNA viruses remained undetected. In the targeted libraries, the percentage of viral-mapped reads were substantially increased, allowing full genome assembly in most cases. Targeted virome sequencing can enrich a broad range of viruses, potentially enabling the discovery of emerging viruses.


Introduction
Emerging and re-emerging viruses appear in a population for the first time or might have existed previously. They are characterized by rapid increments in incidence in geographical regions [1]. Emerging viruses adopt different strategies in order to evade or escape host immune defences [2]. They also exhibit high flexibility in adapting to their current and new hosts, frequently switching between animals and humans [3][4][5][6]. Virtually, all emerging diseases originate from animal populations, while during their adaptation to a new host, they might colonize without causing disease while being able to re-infect another host. During this incubation period, and soon after the emergence of an unknown disease, the accurate and prompt diagnosis of the infectious agent and the development of containment measures, although challenging, is of paramount importance since any delay may result in its full emergence. The new coronavirus SARS-CoV-2, has spread around extremely long reads, the mining of long and partially unknown genomic fragments from a complex nucleic acid mixture, is very important for the characterization of emerging pathogens with an unknown genomic sequence, for which standard PCR amplification assays cannot be designed. The presence of background genomic material can affect the sensitivity of metagenomic diagnostics in portable devices, such as MinION, due to their limited throughput. This is crucial, especially in the case of viruses, where the viral genome usually corresponds to less than 1% of the total genomic material present in the sample. we have shown that this strategy can effectively enrich specific targets and their flanking genomic sequences [18,19]. Additionally, given the very special characteristic of MinION to produce extremely long reads, the mining of long and partially unknown genomic fragments from a complex nucleic acid mixture, is very important for the characterization of emerging pathogens with an unknown genomic sequence, for which standard PCR amplification assays cannot be designed. The presence of background genomic material can affect the sensitivity of metagenomic diagnostics in portable devices, such as MinION, due to their limited throughput. This is crucial, especially in the case of viruses, where the viral genome usually corresponds to less than 1% of the total genomic material present in the sample. Figure 1. Schematic representation of the study design. Targeted virome library preparation enriches both DNA and RNA viruses as well as unknown viruses (in pink) due to the partial hybridization of the molecular baits (green-orange) to the viral genomes. Targeted and untargeted (control) libraries are sequenced in parallel.
In this study, we examine the potential impact of targeted sequencing by hybridization on the unbiased detection and full genome reconstruction of both DNA and RNA viruses, in a single library preparation. We quantitatively evaluate the efficiency of the method in depleting various background genomic materials. Most importantly, we In this study, we examine the potential impact of targeted sequencing by hybridization on the unbiased detection and full genome reconstruction of both DNA and RNA viruses, in a single library preparation. We quantitatively evaluate the efficiency of the method in depleting various background genomic materials. Most importantly, we further explore the usage of the method in the characterisation of the, until recently, unknown virus SARS-CoV-2, which, intentionally, was not included in the baits' design. This proof-of-concept Viruses 2022, 14, 1272 4 of 12 approach highlights the potential usage of the method in the de novo identification of emerging human viruses in the future, using updated baits' panels.

Sample Preparation
Mixed viral NGS libraries were derived from DNA or RNA isolated from eight clinical samples (nasopharyngeal swabs, urine, tissue, bronchoalveolar lavage and blood) already diagnosed positive for the presence of five RNA viruses, namely severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), coronavirus OC43 (HCoV-OC43), influenza A virus H3N2 (infA(H3N2)), respiratory syncytial virus (RSV) and human coxsackievirus A16 (Cox A16) and three DNA viruses, namely, cytomegalovirus (CMV), parvovirus B19 (B19V) and human adenovirus C (HAdV-C). Care was taken so that the RNA and DNA samples included in this study were anonymized prior to analysis in line with the EU General Data Protection Regulation (GDPR) mandates. Extraction of the viral genomic DNA and RNA was performed with the automated platform NucliSens-EasyMag (BioMérieux, Marcy l'Etoile, France) according to the manufacturer's instructions. Quantification of the total RNA and DNA was performed on the Qubit fluorometer (ThermoFisher Scientific, Waltham, MA, USA), initially calibrated with the Qubit RNA HS Assay (declared assay range between 25-500 ng/mL; sample starting concentration between 250 pg/µL-100 ng/µL) and the Qubit dsDNA HS Assay (declared assay range between 1-500 ng/mL; sample starting concentration between 10 pg/µL-100 ng/µL). Equimolar concentrations of DNA/RNA viral extracts were mixed to be used as a template for the generation of targeted and untargeted NGS libraries.
The RNA pool was first subjected to reverse transcription and second strand synthesis using the Maxima H Minus Double-Stranded cDNA Synthesis Kit (ThermoFisher Scientific, Waltham, MA, USA). The DNA/RNA viral pool was then divided into four aliquots that were processed in parallel; two aliquots-one for the targeted and one for the untargeted library-were processed with the KAPA HyperPlus Library Preparation Kit (Roche Diagnostics, Basel, Switzerland), while the two others-one for the targeted and one for the untargeted library-were processed with the ONT SQK-LSK109 "Genomic DNA by Ligation" protocol (Oxford Nanopore Technologies, Oxford, UK) for sequencing on the MinION. Following the HyperCap Target Enrichment Kit (Roche Diagnostics, Basel, Switzerland) instructions, the VirCapSeq-VERT hybridization probes (kindly provided by Roche Diagnostics (Hellas)) were used for targeted virome enrichment, with~two million probes covering the genomes of members of all viral taxa known (up to 2015) to infect vertebrates, including humans [15]. With reference to the MinION targeted library, the library preparation followed the target enrichment procedure due to the susceptibility of the protein-containing MinION adapters to temperature [18].
Performance of the virome target enrichment was verified by specific real-time PCRs for 3 viruses (Cox A16, HAdV-C, RSV), before and after the hybridization process. Similarly, depletion of the host genome was verified via real-time PCR against the ribonuclease P (RNase P) gene (data not shown). The normalised read depth was also estimated for each reference genome (Table 1). Both targeted and untargeted libraries were sequenced using the NextSeq500 platform (Illumina, San Diego, CA, USA) and the MinION sequencer (Oxford Nanopore Technologies, Oxford, UK).

Efficiency of Total Virome Target Enrichment
The objective of this study was to evaluate targeted total virome sequencing in both large-scale and portable platforms and to assess the efficacy of the method in detecting and characterizing unknown, emerging viruses such as SARS-CoV-2.
The untargeted Illumina sequencing run resulted in 32,235,720 single end reads, the majority of which (93.45%) mapped to the human genome, while only 0.94% in total were mapped to the viral genomes. The virome-targeted Illumina run resulted in 63,664,520 reads, with more than 83.03% on-target reads mapped to all viruses and onlỹ 10% off-target reads, reflecting a substantial enrichment compared to the host background (Human gDNA, Human mtDNA, Human rRNA/DNA) and the bacterial DNA (16S gene). The efficiency of the enrichment was lower in the case of CoxsA16, as only 18 reads mapped to the reference sequence after the enrichment. It should be noted though, that Cox A16, along with infA(H3N2), RSV and SARS-CoV-2 remained undetected in the untargeted library. Most importantly, 797,191 reads were mapped on the SARS-CoV-2 reference genome in the targeted library, although no specific baits for this genome were included in the panel. The targeted library was enriched in both DNA and RNA viral genomes in contrast to the untargeted dataset, which contained reads only for the three DNA viruses, with the minor exception of nine reads mapping to the HCoV-OC43 genome. The genome coverage was substantially increased to more than 90% in all enriched viruses, with the exception of RSV (63.62%) and the unsuccessfully enriched Cox A16. (Table 1, Figure 2). The enrichment resulted in a more successful depletion of the human gDNA and the mtDNA (10-fold decrease), rather than human rRNA/DNA and the bacterial genomic material (Supplementary Figure S1).  For the MinION untargeted sequencing, the dataset contained 1,866,680 reads, while the targeted sequencing resulted in 1,129,865 reads. Only 0.6% of the untargeted reads mapped to the viral genomes, while the majority of them (70.86%) mapped to the human genome. On the contrary, the targeted library resulted in 35.41% reads mapping to the viral genomes, out of which 32.48% were on CMV and only 14.07% on the human genome. A substantial fold increase in the reads (%) mapping to SARS-CoV-2 (~4×), CMV (~60×), HAdV-C (~100×) and B19V (~300×) was observed. Similarly, genome coverage was found to be increased to more than 90% in the case of B19V (from 63.44% to 98.86%) and in HAdV-C (from 69.58% to 97.48%). SARS-CoV-2 genome coverage was also increased (from 4.18% to 74.46%), but near full genome reconstruction was not possible from this mapping alignment. The genome coverage was lower wherever the enrichment was moderate such as in the case of infA(H3N2) (from 0.00 to 25.72%), HCoV-OC43 (from 0.14% to 20.09%) and RSV (from 0.00 to 9.23%), allowing the detection of the viruses but not the complete reconstruction of their genomes. Cox A16 remained undetected in both the untargeted and targeted libraries. The reads (%) mapping to the human genome and the mtDNA were decreased five-fold. Only a moderate two-fold decrease in the human rRNA/DNA reads was reported, while the enrichment did not successfully deplete the bacterial DNA (Table 1, Supplementary Figure S1).

De Novo Assembly-Reconstruction of SARS-CoV-2
The efficient enrichment of SARS-CoV-2 sequences in both the Illumina-and the MinION-targeted libraries enabled the de novo reconstruction of contigs covering 89.99% and 57.18% of the viral genome, respectively ( Figure 3). It should be mentioned though that the assessment of the assemblies revealed better metrics for MinION with regard to the sizes of the resulting contigs and the NG50 (1161 for Illumina vs. 1892 for MinION). Conversely, although the total length of the assembly was 10-fold higher for MinION, the substantially higher number of mismatches per 100 kbp (6478.48 vs. 256.39) and indels per 100 kbp (7409.76 vs. 7.43) reflect a profile with longer, yet erroneous contigs, which do not align with the reference sequence, resulting in lower overall coverage. The erroneous profile of the MinION de novo assembly is also reflected in the unaligned fraction, which exceeds 78.5% (368,756 bp out of 468,694 bp), in contrast to only 8% of unaligned assembly in the case of the Illumina dataset. A substantial fold increase in the reads (%) mapping to SARS-CoV-2 (~4×), CMV (~60×), HAdV-C (~100×) and B19V (~300×) was observed. Similarly, genome coverage was found to be increased to more than 90% in the case of B19V (from 63.44% to 98.86%) and in HAdV-C (from 69.58% to 97.48%). SARS-CoV-2 genome coverage was also increased (from 4.18% to 74.46%), but near full genome reconstruction was not possible from this mapping alignment. The genome coverage was lower wherever the enrichment was moderate such as in the case of infA(H3N2) (from 0.00 to 25.72%), HCoV-OC43 (from 0.14% to 20.09%) and RSV (from 0.00 to 9.23%), allowing the detection of the viruses but not the complete reconstruction of their genomes. Cox A16 remained undetected in both the untargeted and targeted libraries. The reads (%) mapping to the human genome and the mtDNA were decreased five-fold. Only a moderate two-fold decrease in the human rRNA/DNA reads was reported, while the enrichment did not successfully deplete the bacterial DNA (Table 1, Supplementary Figure S1).

De Novo Assembly-Reconstruction of SARS-CoV-2
The efficient enrichment of SARS-CoV-2 sequences in both the Illumina-and the Min-ION-targeted libraries enabled the de novo reconstruction of contigs covering 89.99% and 57.18% of the viral genome, respectively ( Figure 3). It should be mentioned though that the assessment of the assemblies revealed better metrics for MinION with regard to the sizes of the resulting contigs and the NG50 (1161 for Illumina vs. 1892 for MinION). Conversely, although the total length of the assembly was 10-fold higher for MinION, the substantially higher number of mismatches per 100 kbp (6478.48 vs. 256.39) and indels per 100 kbp (7409.76 vs. 7.43) reflect a profile with longer, yet erroneous contigs, which do not align with the reference sequence, resulting in lower overall coverage. The erroneous profile of the MinION de novo assembly is also reflected in the unaligned fraction, which exceeds 78.5% (368,756 bp out of 468,694 bp), in contrast to only 8% of unaligned assembly in the case of the Illumina dataset.

Discussion
The detection and characterization of viruses in clinical samples can be challenging, mainly because of the low viral titer and the complex combination of a viral and a robust host genomic background. Even though immense advances in NGS technologies have led to the development of various experimental protocols, the time and cost required to sequence large, complex libraries in high read depth has not yet been overcome [26,27]. Hence, a committed effort has been devoted in order to develop target-enrichment methods to capture selected genomic regions of interest.
Efficacy of the sequence capture is expressed by the extent to which the target regions are enriched, calculated as the proportion of total reads derived from the region of interest divided by the reads derived from the untargeted regions [28]. Applications of targeted enhancement can lead to a dramatic increase in sensitivity for virus detection, exceeding methods such as amplicon sequencing [29]. The approach of target enrichment offers a solution more feasible than typical PCR and its variations, as it ensures that non-specific amplification due to interaction between multiple primer pairs will not take place, while it overcomes issues, such as the upper limit to the amplicon's length generated and the normalization of the products' concentration [30]. Targeted enrichment has been shown to increase the sensitivity of detection and to enhance the full genome sequencing of the dengue 1-4, chikungunya and Zika viruses in clinical samples with low viral loads [14] For example, in SARS-CoV-2 positive samples with low viral copy number, the positivity of the infection is confirmed by the amplification of orf1ab and 11 genes, which are related to virulence and the analysis of the sequencing output regarding the read number, the depth and sequencing coverage and the identity [31].
Target-enrichment methods are under continuous development, in order to achieve higher sensitivity with reference to the detection of mutations and the characterization of structural variability, in terms of insertions and deletions [26]. The elimination of genomicbackground-DNA, allows for a higher-depth NGS analysis of the remaining targeted sequences, thus enabling the detection of single nucleotide polymorphisms (SNPs) and copy number variants (CNVs) in clinical samples, as well as the retrieval of pathogen nucleic acids and the reconstruction of known and unknown genomes derived from infectious agents found in low abundance within the host [26,32]. This was confirmed in our study, as the sequencing output of the untargeted libraries revealed the presence of less than 1% of total viral reads in the datasets, while, specifically, respiratory RNA viruses such as infA(H3N2), RSV and Cox A16 remained undetected. Conversely, when targeted virome enrichment, based on the "VirCapSeq-VERT" strategy for large-scale and portable platforms, was applied prior to the sequencing analysis, the percentage of the total reads reported as viral in targeted libraries was 0.004% and 0.0028% for infA(H3N2) in the Illumina and the MinION runs, respectively. The percentage of total reads mapped on the CMV genome was 0.94% and 0.5% in the Illumina and the MinION sequencing runs of the untargeted libraries, while it increased to 76.48% and 32.48%, respectively, after the target enrichment with VirCapSeq-VERT, the platform which is known to result in a 100to 1000-fold increase in viral genomes from blood [15]. Importantly, both the Illuminaand MinION-enriched libraries presented a higher capacity in mining viral sequences originating from RNA viruses. According to our results, following an enrichment-or amplification-based protocol prior to high throughput sequencing, would enhance the detection of RNA viruses since they remained, in most cases, undetected in the untargeted libraries [33,34].
In our study, we utilized the VirCapSeq-VERT strategy for the detection of the novel virus SARS-CoV-2, although the corresponding reference genome was not included in the initial design of the baits' panel. The baits hybridize to short conserved sequences, presenting genomic variation up to 40% [15]. The successful detection and reconstruction of SARS-CoV-2 reported in this study (96.12% coverage of the reference genome for the Illumina mapping alignment), is a proof-of-concept, indicating that partial alignment of the molecular baits to the viral genomes can enable the detection and typing of unknown, emerging human viruses. As highlighted by the recent pandemic, false negative diagnostic results (e.g., during PCR screening) may have grave consequences and it becomes more important in cases where the sensitivity of the diagnostic assays can be compromised by mutations accumulated in regions that correspond to PCR assay targets (primers and probes annealing regions) [35]. Thus, technical advancements improving detection sensitivity via unbiased methods based on NGS are urgently needed. Especially in the context of emerging viruses, where the reference genome may not be available, enrichment allowing the de novo reconstruction of the genome is very useful. In this study, we deployed a simulating analytical pipeline to de novo reconstruct the SARS-CoV-2 genome from both Illumina and MinION total virome-enriched datasets. Our results also highlight the importance of the NGS platform accuracy with regard to the completeness of the assembly. Although the MinION-generated reads successfully covered the majority of the genome in the mapping alignment (using the reference as a guide), the de novo assembled contigs, based on the same reads, could only cover~58% of the genome. These contigs, however, were substantially larger compared to those generated from the Illumina reads. Taken together, these results support the conclusion that MinION datasets can be very useful in the reconstruction of unknown or highly repetitive viral genomes since they largely contribute in their phasing, especially in combination with other NGS platforms generating shorter, yet more accurate reads [36,37].
Emerging and re-emerging zoonotic infectious diseases pose a great threat to public health, presenting a great increase in incidence, varying by geographic region and population [38] with the ease of their transmission being facilitated by intercontinental travel [39], while their epidemiology is markedly affected by the complex host-pathogen interaction [38]. In a globalised world, trade across international borders facilitates the spread of pathogens, and therefore specific counter measures are needed to rapidly control these spreads [11]. Consequently, there is great importance in the investigation of the precision, robustness and applicability of diagnostic platforms such as the portable sequencer MinION for PoC applications, allowing the effective and prompt containment of the spread, thereby limiting a subsequent global outbreak. However, many of the MinION library preparation steps, and especially the hybridization procedures deployed in our study, still need standard laboratory equipment, which restricts their applicability in the field. Further development of dedicated microfluidic devices that would integrate parts of the workflow would largely simplify these complex procedures.
Our study highlights the importance and the usability of total virome target enrichment, using both large-scale and portable NGS platforms. In the context of a prospective study, extensive testing using undiagnosed clinical specimens and comparison with real-time PCR diagnostics will facilitate the standardization of this method towards the development of a diagnostic platform for the detection and identification of emerging viruses with increased accuracy.

Conclusions
Targeted virome sequencing enhances the simultaneous detection of multiple DNA and RNA viruses extracted directly from clinical samples. The depletion of the background genomic material varies depending on its source. The method can be potentially used for the identification of unknown, emerging human viruses, which may differ from the viral strains used for the design of the baits' panel.