Molecular Characterization of Potato Virus Y (PVY) Using High-Throughput Sequencing: Constraints on Full Genome Reconstructions Imposed by Mixed Infection Involving Recombinant PVY Strains

In recent years, high throughput sequencing (HTS) has brought new possibilities to the study of the diversity and complexity of plant viromes. Mixed infection of a single plant with several viruses is frequently observed in such studies. We analyzed the virome of 10 tomato and sweet pepper samples from Slovakia, all showing the presence of potato virus Y (PVY) infection. Most datasets allow the determination of the nearly complete sequence of a single-variant PVY genome, belonging to one of the PVY recombinant strains (N-Wi, NTNa, or NTNb). However, in three to-mato samples (T1, T40, and T62) the presence of N-type and O-type sequences spanning the same genome region was documented, indicative of mixed infections involving different PVY strains variants, hampering the automated assembly of PVY genomes present in the sample. The N- and O-type in silico data were further confirmed by specific RT-PCR assays targeting UTR-P1 and NIa genomic parts. Although full genomes could not be de novo assembled directly in this situation, their deep coverage by relatively long paired reads allowed their manual re-assembly using very stringent mapping parameters. These results highlight the complexity of PVY infection of some host plants and the challenges that can be met when trying to precisely identify the PVY isolates involved in mixed infection.


Introduction
Plant viruses can be a source of economic loss by causing diseases in economically important plant species. Potyviruses are well known harmful pathogens [1,2], and the genus type species, potato virus Y (PVY) is considered to be one of the most important viruses affecting potato crops [3] and other economically important Solanaceous species (e.g., pepper, tomato, tobacco), as well as weeds and non-solanaceous hosts [1,4]. Furthermore, potyviruses are known to be commonly found in mixed infections, from which synergistic interactions with heterologous viruses are well documented, for example, by inducing more severe symptoms [5].
As the diseases caused by PVY are incurable under field conditions, prophylactic measures are focused on preventing or slowing virus spread in plant populations through the use of resistant varieties [6,7] of healthy controlled propagation material [8] or through the eradication of infected plants. All these phytosanitary actions require access to early, sensitive, and specific detection [9]. Although ELISA and RT-PCR-based techniques provide robust tools for the diagnosis of plant viruses, the large genetic diversity found in many viral species can impose severe constraints on the design and performance of some detection systems [10,11].
To identify a specific virus, molecular detection methods with sufficient sensitivity are usually implemented [12]. Modern diagnostics often require data not only at the level of virus species but they may also require differentiation between viral strains or individual isolates, as these can have different epidemiological properties relevant for the deployment of control strategies [13][14][15]. For example, in the case of PVY, the potato tuber necrotic ringspot disease (PTNRD), which can considerably decrease their market price, or even make the tubers unsellable, is associated with specific recombinant isolates [16,17].
The recombinant strains differ in their genome composition. Although often having a complex mosaic structure, some more or less conservative recombination junction sites have been identified [17,21]. Several RT-PCR-based approaches have been developed to discriminate PVY strains [11,22]. However, to unambiguously assign an isolate to a strain or to identify a divergent variant, a full-length genome characterization is required. The use of high-throughput sequencing (HTS) technologies has brought forth the possibility of unbiased virus detection and identification allowing the gain of a potentially complete view of a plant virome including access to full-length genome data [23,24].
In this work, viromes of 10 tomato and sweet pepper plants were determined from HTS data, enabling us to obtain (nearly) complete genome sequences for several PVY isolates. In some cases, the automated recovery of such full-length genomic sequences was, however, hampered by the occurrence of mixed infections involving different PVY variants.

PVY Isolates Identified in Non-Potato Hosts Belong to Recombinant Strains
The HTS analysis of nine tomato samples and one sweet pepper sample revealed the presence of PVY and, except for one tomato sample (T31), where only PVY infection was identified, the presence of additional virus(es) ( Table 1).
A single large contig covering the full-length (or nearly full-length) PVY genome was recovered from the HTS dataset for 7 of the 10 samples (T101, T20, T24, T31, PAP, SL50V, PAR-P2) ( Table 1). However, due to the presence of other virus(es), the role of PVY in the etiology of the range of symptoms observed (Table 1) remains undetermined. The multiple alignments and phylogenetic analysis of the obtained sequences, together with representatives of the different PVY strains, assigned the Slovak isolates to three molecular groups. The T101 tomato isolate shows a recombination pattern and strong phylogenetic affinities to the PVY-N-Wi strain. Five isolates from tomato (T20, T24, T31, SL50V, PAR-P2) belong to the PVY-NTNa recombinant strain, while the pepper PAP isolate shares the characteristics of members of the PVY-NTNb strain ( Figure 1C, Table 2). All the deduced PVY polyprotein sequences were of the same length (3061 aa) and showed all the expected conserved amino acid motifs characteristic of PVY without any obvious peculiarities or originalities.

Mixed Infection Involving Different PVY Isolates
De novo assembly of the T1, T40, and T62 HTS datasets, together with a visual inspection of mapped reads indicated the presence of genetically different PVY variants in the plant sample. To investigate this point, four informative genomic regions (nt 37-493, 704-2407, 3001-4948, and 5880-8806) from reference O-type (accession number U09509) and N-type (AJ585197) isolates, selected based on genome differences (see Karasev and Gray [17]), were used for mapping the reads of the T1, T40, and T62 datasets using stringent parameters (minimum overlap 25, minimum overlap identity 95% in Geneious mapper).
While reads corresponding to only one of the PVY strains were identified in all three samples in the 704-2407 and 3001-4948 genomic regions, both O-and N-type reads were obtained when analyzing the 37-493 and 5880-8806 genomic parts. These two regions show, respectively, 31.5-32.3% and 14.4-15% nucleotide divergence between the O and N strains ( Table 3). Due to the recombination pattern and to the recombination junction sites sharing similarities among the various PVY strains ( Figure 2B,C, Table 2), the exact identification of the PVY variants involved in the mixed infections present in the T1, T40, and T62 plants could not be established.
To confirm the presence of O-and N-type sequences covering the genome portions for which reads of both strains were identified (either UTR-P1 or NIa), RT-PCR assays were carried out using specific primers (Table 4). For all three plants, both expected PCR products (N and O-specific) were obtained, confirming the in silico data.

Mixed Infection Involving Different PVY Isolates
De novo assembly of the T1, T40, and T62 HTS datasets, together with a visual inspection of mapped reads indicated the presence of genetically different PVY variants in the plant sample. To investigate this point, four informative genomic regions (nt 37-493, 704-2407, 3001-4948, and 5880-8806) from reference O-type (accession number U09509) and N-type (AJ585197) isolates, selected based on genome differences (see Karasev and Gray [17]), were used for mapping the reads of the T1, T40, and T62 datasets using stringent parameters (minimum overlap 25, minimum overlap identity 95% in Geneious mapper).
While reads corresponding to only one of the PVY strains were identified in all three samples in the 704-2407 and 3001-4948 genomic regions, both O-and N-type reads were obtained when analyzing the 37-493 and 5880-8806 genomic parts. These two regions show, respectively, 31.5-32.3% and 14.4-15% nucleotide divergence between the O and N strains ( Table 3). Due to the recombination pattern and to the recombination junction sites sharing similarities among the various PVY strains ( Figure 2B, C, Table 2), the exact identification of the PVY variants involved in the mixed infections present in the T1, T40, and T62 plants could not be established.
To confirm the presence of O-and N-type sequences covering the genome portions for which reads of both strains were identified (either UTR-P1 or NIa), RT-PCR assays were carried out using specific primers (Table 4). For all three plants, both expected PCR products (N and O-specific) were obtained, confirming the in silico data.

Attempts to Reconstruct Complete PVY Genomic Sequences from Mixed Infection
Automated assembly of the T1, T40, and T62 HTS dataset did not result in complete genomes (the largest contig obtained represented <75% of a full genome and, with this exception, all other contigs represented less than about half of the full genome size) and did not allow the proper identification of the PVY variants involved. Therefore, a manual reconstruction approach was evaluated, using extremely stringent conditions that could potentially allow the separation of closely related variants. Since it was supposed that long reads would be critical for the reconstruction of closely related haplotypes, a specific trimming step in which only high-quality paired reads (>140 bp) were selected was performed. Tentative scaffolds were manually reconstructed from contig sequences and were then validated and extended by successive rounds of mapping at very high stringency (>95% reads length showing >98% nt identity) using only the long high quality reads selected as described above. This approach allowed for the gradual separation of haplotypes, even in genome regions in which the isolates in mixed infection showed in the order of a few percent of nucleotide divergence. It should be stressed that these reconstructions did not involve any reference sequence information and that the genomes were thus reconstructed without any a priori or fixed expectations on the final outcome. The obtained nearly complete genomes (Genbank Accession numbers MW685827-MW685832) were finally validated by mapping of the selected long high quality paired reads using a full identity criterion (100% identity over 100% of reads length), giving strong confidence in the obtained sequences thanks to their deep coverage (in all cases in excess of 250x average coverage). A phylogenetic analysis of the reconstructed genomes together with reference isolates finally allowed us to unambiguously determine that the T1 was simultaneously infected with NTNb and N:O isolates while the T40 and T62 ones were co-infected by NTNa and N-Wi isolates (Figure 3).

Discussion
Mixed infection of plants with several viruses belonging to different taxa is now frequently detected, especially thanks to the application of massive parallel sequencing [24][25][26][27][28]. In this work, the analysis of leaf samples from tomato and pepper indeed revealed mixed infections of PVY (a potyvirus) and other viruses (from the Cucumovirus, Carlavirus, and/or Ophiovirus genera, Table 1) further confirming the complexity of the virome that can be present in single plants [29,30].
De novo assembly of HTS reads or their mapping to reference viral genomes is standardly used in order to access the infecting viruses' genome data [9,31,32]. Because of frequent recombination events in PVY evolutionary history, resulting in a genome with mosaic structure involving both PVY-N and PVY-0 parents in the case of most PVY recombinant strains [17,33], such approaches can be challenging.
Assembly of the HTS reads generated from ribosomal-depleted total RNA (1.5-3.9 M of 93-185 bp length) allowed us to obtain full-length or nearly full-length PVY genomes (i.e., missing only a few nucleotides at the genome extremities) with a high coverage from 7 of the 10 tested samples (Table 1). In all these cases, a single PVY variant was unambiguously identified, belonging to the PVY-N-Wi, PVY-NTNa, or PVY-NTNb strains. These results complement a previous more limited work [34], showing the presence of recombinant PVY strains in Slovakia based on RT-PCR targeting various recombination junction (RJ) sites.
Recombination clearly played a significant role in the evolutionary history of PVY [10,17,35]. Recombination, together with mutation, increases the genetic variability of the given taxa, which can result in higher viral fitness or survival of the virus population in a previously nonviable environment, sometimes leading to the emergence of new resistance-breaking strains [36,37]. Indeed, a change in the PVY strain prevalence from a non-recombinant to recombinant one during the last few decades was generally observed [6,38,39] suggesting a better adaptation of current recombinant strains over non-recombinant ones. Similarly, the latest reports of PVY strain characterization in infected plants point to a growing incidence of recombinant PVY strains [40]. PVY recombinant strains were previously characterized to be quite conserved regarding the positions of recombination junctions within their genomes [21]. Four RJs seem to be deeply conserved and have very similar or even identical positions between several strains (PVY-N:O, PVY-N-Wi, PVY-NTNa, PVY-NTNb). From these RJs, the first and the last RJ show more variety

Discussion
Mixed infection of plants with several viruses belonging to different taxa is now frequently detected, especially thanks to the application of massive parallel sequencing [24][25][26][27][28]. In this work, the analysis of leaf samples from tomato and pepper indeed revealed mixed infections of PVY (a potyvirus) and other viruses (from the Cucumovirus, Carlavirus, and/or Ophiovirus genera, Table 1) further confirming the complexity of the virome that can be present in single plants [29,30].
De novo assembly of HTS reads or their mapping to reference viral genomes is standardly used in order to access the infecting viruses' genome data [9,31,32]. Because of frequent recombination events in PVY evolutionary history, resulting in a genome with mosaic structure involving both PVY-N and PVY-0 parents in the case of most PVY recombinant strains [17,33], such approaches can be challenging.
Assembly of the HTS reads generated from ribosomal-depleted total RNA (1.5-3.9 M of 93-185 bp length) allowed us to obtain full-length or nearly full-length PVY genomes (i.e., missing only a few nucleotides at the genome extremities) with a high coverage from 7 of the 10 tested samples (Table 1). In all these cases, a single PVY variant was unambiguously identified, belonging to the PVY-N-Wi, PVY-NTNa, or PVY-NTNb strains. These results complement a previous more limited work [34], showing the presence of recombinant PVY strains in Slovakia based on RT-PCR targeting various recombination junction (RJ) sites.
Recombination clearly played a significant role in the evolutionary history of PVY [10,17,35]. Recombination, together with mutation, increases the genetic variability of the given taxa, which can result in higher viral fitness or survival of the virus population in a previously nonviable environment, sometimes leading to the emergence of new resistance-breaking strains [36,37]. Indeed, a change in the PVY strain prevalence from a non-recombinant to recombinant one during the last few decades was generally observed [6,38,39] suggesting a better adaptation of current recombinant strains over non-recombinant ones. Similarly, the latest reports of PVY strain characterization in infected plants point to a growing incidence of recombinant PVY strains [40]. PVY recombinant strains were previously characterized to be quite conserved regarding the positions of recombination junctions within their genomes [21]. Four RJs seem to be deeply conserved and have very similar or even identical positions between several strains (PVY-N:O, PVY-N-Wi, PVY-NTNa, PVY-NTNb). From these RJs, the first and the last RJ show more variety between rare PVY recombinants, while the second and third RJs are more stably shared [17,19]. Because of this peculiar pattern of variation within the PVY species, it was previously recommended to use sequencing or multiple pairs of specific primers targeting RJs within the PVY genome in order to assign an isolate to a particular strain [11,41]. However, the growing evidence of "divergent" PVY isolates escaping accurate identification using such tests, which were developed with a limited pool of isolates [42], pinpoints the usefulness of complete genome analyses to characterize PVY strain variability [16]. Indeed, the challenges of viral strain identification due to extensive recombination were recently addressed for the sugarcane mosaic virus (SCMV), another member of genus potyvirus, where it was advised to use whole genome phylogenetic analysis for viral strain identification [43].
Mixed infections by different PVY isolates are reported to be quite common [44,45]. In the work reported here, 3 of the 10 tested samples (T1, T40, and T62) were identified as bearing a mix of different PVY variants. As a consequence of these mixed infections, it was not possible to de novo assemble full-length PVY genomes for these three samples. The mixed infection status was evident both from the assembled partial contigs and from the mapping of HTS reads against selected portions of PVY-O and PVY-N genomes, which showed the simultaneous presence of N-type and O-type sequences spanning the same genomic regions ( Table 2). These in silico results were further confirmed by N-and Otype specific RT-PCRs. Unfortunately, as a consequence of the conservation of the RJ region in the various PVY recombinant strains and of the sharing of genomic regions between the recombinants involved, it was not possible to automatically reconstruct full-length PVY genomes or haplotypes, even when trying a range of assembly parameters. While the data confirmed the presence of at least two PVY strains in the three samples, it was not possible to precisely determine the identity of the individual strains present in each sample by a standard bioinformatics approach. Indeed, these could potentially be identified as either NTNa and NWi or N:O and NTNb, or different combinations of these strains (Figure 2).
Only manual reassembly making use of long paired reads mapped at very high stringency against manually assembled scaffolds ultimately allowed the reconstruction of near-complete genomic sequences and the identification of the PVY strains involved in these mixed infections. In practice, as soon as two isolates in coinfection share a region of perfect identity that is longer than the reads available, it will not be possible to reconstruct the genomic haplotypes over this region. However, in practice, the various PVY recombinant strains are not identical in their shared genomic regions. The basis for this small divergence is that (i) the various recombinants independently emerged from different parental isolates and (ii) the recombinants have independently evolved and further diverged since their emergence. As a consequence, PVY recombinant strains are not identical in their shared genomic regions but, on the contrary, diverge by a limited percentage, as can clearly be seen in phylogenetic reconstructions focusing on such regions. This limited divergence translates into an average of 1-2(3) mutations/100 nt over the shared regions. Given the average 250 nt covered by the long paired reads used here, this small level of differentiation is sufficient to reconstruct or validate haplotypes over long genomic regions even if shared by co-infecting recombinants. Although this cannot be unambiguously established, we suggest that several factors were critical in the ability to perform this reconstruction including (i) mixed infections that involved only two isolates, (ii) very deep coverage (>250× in all cases), and (iii) availability of long (>140 nt) paired reads (and up to~600 nt of unread sequence between reads of a pair). The last two points were in particular critical for the ability to separate haplotypes over shared regions and unambiguously reconstruct haplotypes over recombination junctions.
It should be noted that even though we detected at least two PVY strains in mixed infection from a single leaf sample, one can wonder whether such a state is stable or transient because of possible virus-virus interactions, such as cross protection [46].
Several reports of PVY interactions in mixed infections have been previously published, showing an antagonistic nature, depending on the viral strains [13,47], and even partial cross-protection [48]. It was shown that PVY-NTN superinfection decreased the titers of PVY-N:O and PVY-N-Wi, while existing PVY-NTN infection prevented infection by secondary virus in some cases, while its titer was unaffected by PVY-N:O, PVY-N-Wi, or PVY-O [48]. Such information could be useful to discriminate between potential strain mixtures as in the case reported here, although it would be impossible to determine the time of infection by individual strains.
In this work, mixed infections involving PVY recombinant isolates proved recalcitrant to automated genome reconstruction and strain identification following Illumina HTS. It was however possible to manually perform these tasks. Such a problem with strain identification could be potentially avoided by using single molecule long sequencing reads such as those generated by MinION (Oxford Nanopore) as proposed by Pooggin [49] (2018). Indeed, Della Bartola et al. [45] (2020) showed that it is possible to assemble and differentiate individual PVY strains in mixed infections using MinION, while it was not possible to individuate strains using specific primer pairs which otherwise worked well for most samples, supporting the former proposal.

Plant Samples
Nine tomato (Solanum lycopersicum L.) plants and one sweet pepper (Capsicum annum L.) plant were sampled during their vegetation period (July-September 2017-2019) and analyzed for the presence of viruses using HTS (Table 1). All the samples originated from plants grown in home gardens in Slovakia.

HTS Analysis
Total RNAs were extracted from a single leaf from the upper part of plants (ca 0.2 g) using the Spectrum Plant Total RNA Kit (Sigma Aldrich, St. Louis, MO, USA). Subsequently, ribosomal RNAs were removed using the Ribo-Zero rRNA Removal Kit (Illumina, San Diego, CA, USA). Samples of ribosomal-depleted total RNAs were used for doublestranded cDNA synthesis using the SuperScript II kit (Thermo Fisher Scientific, Waltham, MA, USA). The cDNA was then purified with 2.2x AMPure XP beads and quantified with the Qubit 2.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). The samples were then processed with the transposon-based chemistry library preparation kit (Nextera XT, Illumina, San Diego, CA, USA). Low-cycle PCR and mutual indexing of the fragments were carried out. Fragments were purified with 1.8x AMPure XP beads (BeckmanCoulter, Brea, CA, USA) without size selection. The fragment size structure of the DNA libraries was assessed using the Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). The equimolar pool of 4 nM DNA libraries was denatured, diluted to 10 pM, and sequenced (300-bp paired-end sequencing) on the Illumina MiSeq platform (Illumina, San Diego, CA, USA).
High-quality trimmed reads were used for de novo assembled using a CLC Genomics Workbench 7.5 (https://www.qiagenbioinformatics.com/, accessed on 5 October 2020) with automatic graph parameters set and with reads mapped back to contigs with the following parameters (Mismatch cost 2, Insertion cost 3, Deletion cost 3, Length fraction 0.7 and Similarity fraction 0.9) and minimum contig length 1000 bp. All contigs were subsequently aligned to the viral genomes database (ftp://ftp.ncbi.nih.gov/genomes/ Viruses/all.fna.tar.gz, accessed on 5 October 2020).
In the case of three samples for which mixed infections involving recombinant isolates did not allow the automated reconstruction of full-length genomes, the contigs obtained were manually assembled into scaffolds which were further extended and validated by successive rounds of reads mapping at extremely high stringency (95% of reads length, 98% identity) using CLC Genomics workbench v21.0.3. Only long (>140 nt), high-quality paired reads selected by a specific trimming step were used in this reconstruction that allowed us to separate coinfecting recombinants even in their shared genomic regions and to reconstruct the complete genomic haplotypes. Final validation of the obtained near-complete genomic sequences was performed by a final mapping of the selected highquality read pairs at an extreme stringency level (100% identity over 100% of reads length) to ensure the accuracy of the assembled genomes.

Strain Specific RT-PCR Assays
O-type and N-type specific RT-PCR targeting portions of the 5 UTR-P1 and NIa gene ( Table 4) were carried out on a template consisting of random hexamer-derived cDNA employing the same RNA isolation as used for the preparation of HTS libraries. For all primer combinations, the following cycling conditions were used: denaturation at 98 • C for 1 min, 35 cycles of amplification (98 • C for 30 s, 53 • C for 30 s, and 72 • C for 30 s), and a final extension at 72 • C for 5 min.