Unlocking the Hidden Genetic Diversity of Varicosaviruses, the Neglected Plant Rhabdoviruses

The genus Varicosavirus is one of six genera of plant-infecting rhabdoviruses. Varicosaviruses have non-enveloped, flexuous, rod-shaped virions and a negative-sense, single-stranded RNA genome. A distinguishing feature of varicosaviruses, which is shared with dichorhaviruses, is a bi-segmented genome. Before 2017, a sole varicosavirus was known and characterized, and then two more varicosaviruses were identified through high-throughput sequencing in 2017 and 2018. More recently, the number of known varicosaviruses has substantially increased in concert with the extensive use of high-throughput sequencing platforms and data mining approaches. The novel varicosaviruses have revealed not only sequence diversity, but also plasticity in terms of genome architecture, including a virus with a tentatively unsegmented genome. Here, we report the discovery of 45 novel varicosavirus genomes which were identified in publicly available metatranscriptomic data. The identification, assembly, and curation of the raw Sequence Read Archive reads has resulted in 39 viral genome sequences with full-length coding regions and 6 with nearly complete coding regions. The highlights of the obtained sequences include eight varicosaviruses with unsegmented genomes, which are linked to a phylogenetic clade associated with gymnosperms. These findings have resulted in the most complete phylogeny of varicosaviruses to date and shed new light on the phylogenetic relationships and evolutionary landscape of this group of plant rhabdoviruses. Thus, the extensive use of sequence data mining for virus discovery has allowed us to unlock of the hidden genetic diversity of varicosaviruses, the largely neglected plant rhabdoviruses.


Introduction
A recently discovered huge number of diverse viruses has revealed the complexities of the evolutionary landscape of replicating entities and the challenges associated with their classification [1], leading to the first comprehensive proposal of the virus world megataxonomy [2]. Nevertheless, a minuscule portion, likely a small fraction of one percent, of the virosphere has been characterized so far [3]. Therefore, we have a limited knowledge of the vast world virome, with its remarkable diversity, that includes every potential host organism assessed so far [4][5][6]. Data mining of publicly available transcriptome datasets has become an efficient and inexpensive strategy to unlock the diversity of the plant virosphere [5]. Data-driven virus discovery relies on the vast number of available datasets on the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI). This resource, which is growing at an exceptional rate and includes data of a large and diverse number of organisms, represents a substantial fraction of species that populate our planet, which makes the SRA database an invaluable source to identify novel viruses [7].
Varicosavirus is one of the six genera that are comprised of plant rhabdoviruses (family Rhabdoviridae, subfamily Betarhabdovirinae), and its members are thought to have a negative-sense, single-stranded, bi-segmented RNA genome [8]. Nevertheless, recently, we described the first apparently unsegmented varicosavirus [9]. In those varicosaviruses with segmented genomes, RNA 1 consists of one to two genes, with one of those encoding the RNA-dependent RNA polymerase L, while RNA 2 consists of three to five genes, with the first open reading frame (ORF) encoding a nucleocapsid protein (N) [8,10]. On the other hand, the only unsegmented varicosavirus described so far has five ORFs, in the order: 3 -N-Protein 2-Protein 3-Protein 4-L-5 [9]. Varicosaviruses appear to have a diverse host range that includes dicots, monocots, gymnosperms, ferns, and liverworts [6,9]. The vector of a sole member, lettuce big vein-associated virus (LBVaV), has been characterized, which is the chytrid fungus Olpidium spp. [11].
Until 2017, LBVaV was the only identified and extensively characterized varicosavirus [12][13][14], and then, in 2017 and 2018, two novel varicosaviruses were identified through high-throughput sequencing (HTS) [15,16]. However, in 2021 and 2022, there was a five-fold increase in the number of reported varicosaviruses, with 12 out 15 discovered through data mining of publicly available transcriptome datasets [6,9,17,18], while the other three were identified using HTS [19][20][21] (Supplementary Figure S1). Nevertheless, only some minor biological aspects, such as mechanical transmissibility, of some of these members were further characterized [15,20]. Therefore, varicosaviruses are, by far, the least-studied plant rhabdoviruses, and many aspects of their epidemiology remain elusive. In terms of genetic diversity, before this study, while greatly expanded by recent works, the Varicosavirus genus includes only three accepted species and 15 tentative members.
In this study, we identified 45 novel varicosaviruses by analyzing publicly available metatranscriptomic data. Thus, the extensive use of data mining for virus discovery has allowed us to unlock some of the hidden diversity of varicosaviruses, the much-neglected plant rhabdoviruses.

Identification of Plant Rhabdovirus Sequences from Public Plant RNA-seq Datasets
Three strategies were used to detect varicosavirus sequences: (1) Amino acid sequences corresponding to the nucleocapsid and polymerase proteins of known varicosaviruses were used as queries in tBlastn searches with the parameters word size = 6, expected threshold = 10, and scoring matrix = BLOSUM62, against the Viridiplantae (taxid: 33090) Transcriptome Shotgun Assembly (TSA) sequence databases. The obtained hits were manually explored and based on percentage identity, query coverage, and E-value (>1 × 10 −5 ) and shortlisted as likely corresponding to novel virus transcripts, which were then further analyzed. (2) Raw sequence data corresponding to the SRA database associated with the 1K study [22] were explored for varicosa-like virus sequences. (3) The Serratus database was explored, employing the serratus explorer tool [5], and using as queries the sequences of LBVaV, red clover varicosavirus, and black grass varicosavirus. Those SRA libraries that matched the query sequences (alignment identity > 45%; score > 10) were further explored in detail.

Sequence Assembly and Identification
The nucleotide (nt) raw sequence reads from each SRA experiment, which are associated with different NCBI bioprojects (Table 1), were downloaded and pre-processed by trimming and filtering with the Trimmomatic tool as implemented in http://www.usadellab. org/cms/?page=trimmomatic (accessed on 19 August 2022). The resulting reads were assembled de novo with rnaSPAdes using standard parameters on the Galaxy.org server. The transcripts obtained from the de novo transcriptome assembly were subjected to bulk local BLASTX searches (E-value < 1 × 10 −5 ) against a collection of varicosavirus protein se-quences available at https://www.ncbi.nlm.nih.gov/protein?term=txid140295 [Organism] (accessed on 19 August 2022). The resulting viral sequence hits of each bioproject were visually explored. Tentative virus-like contigs were curated (extended or confirmed) by iteratively mapping each SRA library's filtered reads. This strategy used BLAST/nhmmer to extract a subset of reads related to the query contig and used the retrieved reads to extend the contig and then repeated the process iteratively using the extended sequence as query. The extended and polished transcripts were reassembled using the Geneious v8.1.9 (Biomatters Ltd., San Diego, CA, USA) alignment tool with high sensitivity parameters. Bowtie2, available at http://bowtie-bio.sourceforge.net/bowtie2/index.shtml (accessed on 26 September 2022), was used with standard parameters for filtered read mapping to calculate the mean coverage of each assembled virus sequence.

Pairwise Sequence Identity
Percentage amino acid (aa) sequence identities of the L protein of those varicosaviruses identified in this study, as well as those available in the NCBI database, were calculated using SDTv1.2 [59]. Virus names, abbreviations, and NCBI accession numbers of the varicosaviruses already reported are shown in Supplementary Table S1.

Phylogenetic Analysis
Phylogenetic analysis based on the predicted polymerase protein of all available varicosaviruses was completed using MAFFT 7.505 (https://mafft.cbrc.jp/alignment/software) (accessed on 25 August 2022) with multiple aa sequence alignments and using FFT-NS-i as the best-fit model. The aligned aa sequences were used as inputs to generate phylogenetic trees using the maximum-likelihood method (best-fit model = E-INS-i) with the FastTree 2.1.11 tool (available at http://www.microbesonline.org/fasttree/) (accessed on 25 August 2022). Local support values were calculated with the Shimodaira-Hasegawa test (SH) and 1000 trees were resampled. The L proteins of four selected cytorhabdoviruses were used as outgroups. To explore the potential phylogenetic co-divergence of varicosaviruses with their associated host plants, plant host cladograms were generated in phyloT v.2 (https://phylot.biobyte.de/, accessed on 26 August 2022) based on NCBI Taxonomy. Connections were manually inferred between the viral and plant phylograms and cladograms and visually inspected.

Results and Discussion
Most varicosaviruses likely do not induce easily discernable disease symptoms. Since their presence is not expected in the sequencing libraries of apparently "healthy" vegetables, they are ideal candidates to be identified through mining publicly available metatranscriptomic data. Accordingly, very recently, 12 novel proposed varicosaviruses were discovered when publicly available transcriptome datasets were mined [6,9,17,18]. Therefore, to unlock the hidden diversity of varicosaviruses, we extensively searched for these viruses in already available plant transcriptome data. This bioinformatics research resulted in the identification of 45 novel varicosaviruses, including the corrected full-length coding genome segments of the previously reported Arceuthobium sichuanense-associated virus 2 (ASaV2) [18], which had apparently been reconstructed from the genome segments of two different varicosaviruses. We also identified three novel variants of three recently discovered varicosaviruses, confirming and strengthening the results previously reported by Bejerman et al. [9]. This significant number of newly discovered varicosaviruses represents a 3.5-fold increase in the known varicosaviruses (Supplementary Figure S1), which clearly highlights the importance of data-driven virus discovery to illuminate the landscape of largely overlooked taxonomic groups, such as varicosaviruses.
More details, identification, assembly, and curation of raw SRA reads in this study resulted in 39 viral genome sequences with full-length coding regions and six with nearly complete coding regions. These viruses were associated with 45 plant host species (Table 1). Most of the tentative plant hosts of the novel varicosaviruses are herbaceous dicots (24/45), nine are herbaceous monocots, eight are gymnosperms, and four are liverworts and ferns ( Table 1).
The genomes of 37 viruses identified in this study were bisegmented, where the RNA 1 of 36 of them encodes only the L protein, while the RNA 1 of Chamaemelum virus 1 (ChaV1) has an additional ORF 5' to the L gene, supported by the identification of the conserved intergenic sequence (see below), encoding a 171 aa putative protein (Table 1, Figure 1), which appears to be the first varicosavirus reported with an ORF in this position. The RNA 2 segments of these 37 viruses have three to five genes in the order 3 -N-PX-5 . Twelve of them have three genes, while 17 have four genes and eight contained five genes (Table 1, Figure 1). Of the previously reported varicosaviruses, six have three genes, four have four genes, and four have five genes; therefore, RNA 2 has a flexible genomic architecture and is apparently the most frequent genomic organization in the RNA 2 of varicosaviruses that includes four genes (21 members) or three genes (18 members).  Supplementary Table S1 and Table 1. The consensus gene junction sequences of the bisegmented varicosaviruses were determined to be 3 AU(N) 5 UUUUUGCUCU 5 (Table 2), while the gene junction sequences of all but one of the unsegmented varicosaviruses differed slightly in the 3 end, being GU(N) 5 instead of AU(N) 5 (Table 2). Strikingly, the consensus gene junction of the unsegmented Torreya virus 1 (TorV1) was similar to that of the bisegmented varicosaviruses. The potential implication of this difference in the gene junctions needs to be explored since it could be linked to the basal evolutionary grouping of TorV1 (see below).    Table 1; while the names and abbreviations of known viruses are listed in Supplementary Table S1. There is a great dearth of data on the potential functions of putative proteins, other than N and L, encoded by varicosaviruses, and, intriguingly, there were no conserved domains identified in these proteins. We grasped some shared identities, primarily for the cognate P3 (but also for several P2 proteins) (Table 1), though for most of the encoded proteins, the BlastP results were orphans, with no known signals or domains present and no clues towards their putative (or conserved) function. Thus, further studies should be focused on the functional characterization of these proteins to gain essential knowledge regarding the elusive proteome of varicosaviruses beyond the N and L proteins.
The pairwise aa sequence identities between the L proteins of all the reported varicosaviruses, including those identified in this study, showed great diversity and an overall low identity between the different varicosaviruses ( Figure 2, Supplementary Table S2). Relatively low sequence identity is a common feature among rhabdovirus taxa, characterized by a high level of diversity in both the genome sequence and organization [10]. In addition, the overall low sequence identity among the novel viruses detected here and with the previously described varicosaviruses suggests that despite the many viruses identified in this study, there likely remains a significant amount of virus "dark matter" for yet-to-be-discovered varicosaviruses.  Supplementary Table S1 and Table 1. When we analyzed the diversity between the variants of viruses which are likely members of the same species, we found that proteins encoded by the Brassica virus 2, Spinach virus 1, and Sciadopitys virus 1 variants were very similar. On the other hand, proteins encoded by the Brassica virus 1, Lolium virus 1, and Melilotus virus 1 variants were quite diverse, but, nevertheless, they showed aa identities for the N and L proteins exceeding 80%. Thus, we tentatively propose an aa sequence identity of 80% across the L gene as the threshold for species demarcation in the Varicosavirus genus, a taxonomic criterion which had previously not been fully defined [10]. This threshold is strongly supported by the comparison of the L protein aa sequence of 60 viruses (Figure 2, Supplementary Table S2). Based on this criterion, all 39 novel viruses with their complete coding region assembled in this study should be considered as belonging to novel Varicosavirus species, which would increase the number of members of the genus by more than an order of magnitude.
Bejerman et al. [9] tentatively reported the first unsegmented varicosavirus, Pinus flexilis virus 1 (PiFleV1), which was associated with the gymnosperm Pinus flexilis. In this study, we complemented that result by the discovery of eight additional unsegmented varicosaviruses which were exclusively associated with gymnosperms ( Table 1), some of which are linked to the same genus Pinus and present a significant co-evolution of viruses and hosts. These results robustly support a clade of gymnosperm-associated varicosaviruses with a distinct genome architecture, requiring the rewriting of a previously proposed key feature and fundamental marker of varicosaviruses: their genomic bisegmented nature. It is tempting to speculate that the unsegmented genomic architecture may be linked to the adaptation to gymnosperm hosts and a shared ancient evolutionary history of these viruses and hosts.
Interestingly, in the BlastP analyses of N, P2, and P3 of the gymnosperm-associated viruses, most of them had, as a best hit to the cognate proteins encoded by the putative bisegmented ASaV2 (Table 1), a virus apparently hosted by a parasitic plant of spruce (Picea, Pinacea). Furthermore, unexpectedly, the best hit of the putative P5 protein encoded on ASaV2 RNA2 was a fragment of the PiFleV1 L protein, while the deduced L protein on ASaV2 RN1 was not a best hit with PiFleV1, but instead, with the non-gymnosperm-linked MelRoV1 hosted by the Orobanchaceae parasitic plant Melampyrum roseum. Thus, we suspected that ASaV2 was potentially misassembled from fragments belonging to two different viruses. Consequently, we re-analyzed the original SRA data used by Sidhartan et al. [18] and were able to assemble two distinct varicosavirus genomes: one bisegmented genome presumably linked to the parasitic plant and one unsegmented genome most likely linked to spruce, which would support our hypothesis. We believe that there are several reasons that led to the original ASaV2 description: (i) the atypical and unexpected existence at the time of an unsegmented varicosavirus; (ii) the presence of two varicosaviruses in the very same sequencing library, which may be the first tentative evidence in the literature of co-infection of two varicosaviruses; and (iii) the fact that the sequence reads corresponding to the L gene region of the unsegmented varicosavirus were low, which may have affected the assembling pipelines used by the authors. All in all, independently verifying unexpected re-analysed SRA data may lead to a clearer understanding of the genomic structure of the mined RNA virus genomes. Nevertheless, the inability to return to the original biological material to replicate, confirm, and validate the assembled viral genome sequences is a significant limitation of the data mining approach for virus discovery. Thus, researchers must be cautious when analysing SRA public data for virus discovery and understand the preliminary nature of its results.
The phylogenetic analysis based on the deduced L protein aa sequences placed all unsegmented varicosaviruses, except TorV1, into a distinct clade. Interestingly, TorV1 was placed in a clade that was basal to all varicosaviruses ( Figure 1). This distinct phylogenetic branching and clustering of the unsegmented viruses suggests that they share a unique evolutionary history among varicosaviruses. Moreover, this may suggest that bisegmented varicosaviruses are evolutionarily younger than unsegmented ones. It may also mean that a genome split in varicosa-like viruses occurred after the radiation of gymnosperms and angiosperms. Bisegmented varicosaviruses did not cluster according to their genomic organization, nor did they cluster with the plant species associated with each virus (Figure 1). For example, brassica virus 1 and brassica virus 2 were placed in distinct clades, while two viruses associated with orchids (Ophius virus 1 and Caladenia virus 1) were placed in different clusters, and monocot-associated viruses were not all grouped together. On the other hand, all varicosaviruses associated with ferns and liverworts belonged to the same cluster, which was also shared with previously reported varicosaviruses from these plant types, while most of the grass-associated varicosaviruses were also clustered together ( Figure 1).
We generated a tanglegram to compare the virus phylogram and plant host cladogram to further explore virus-host relationships (Figure 3). This analysis showed that the viruses of some clades clearly co-diverged with their hosts, including the gymnosperm-associated virus clade, the SpV1 and Silene virus 1 clade, the grass-associated virus clade, and the clade of fern and liverworts viruses, suggesting a shared host-virus evolution in those clades ( Figure 3). However, the tanglegram topology also indicated that for most of the vari-cosaviruses, there was no apparent concordant evolutionary history with their plant hosts, similar to what was previously reported for invertebrate and vertebrate rhabdoviruses [60]. Several lines of evidence suggest that varicosaviruses may be vertically transmitted: (i) a close host-virus co-evolution in some clades may reflect species isolation and a lack of horizontal transmission, (ii) some viruses detected in this study were identified from seed transcriptomics databases, and (iii) an emerging characteristic of persistent, chronic infections of several plant viruses which are likely vertically transmitted are latent/asymptomatic infections, a characteristic which appears to be shared with varicosaviruses. Thus, further studies should be carried out to elucidate the transmission mode of varicosaviruses beyond the fungal-transmitted LBVaV [11]. It is worth mentioning that even with the availability of thousands of RNAseq libraries of fungi and arthropods, we failed to detect any evidence of varicosaviruses in those organisms, which could suggest that vectors of varicosaviruses are rare or non-existent.
Before the era of data-driven virus discovery, few viruses had been identified in gymnosperms [61][62][63][64]. However, when data mining was applied to publicly available transcriptomes, many novel viruses were identified in this large group of higher plants, highlighting the rich and diverse gymnosperm virosphere, which still is largely unexplored. A distinct clade of gymnosperm-associated viruses was recently identified within amalgaviruses [65], while we recently described two distinct caulimovirids and geminivirids linked to the gnetophyte Welwitschia mirabilis [66]. Eight unsegmented varicosaviruses associated with gymnosperms were identified in this study, and another was discovered by Bejerman et al. [9]. Taken together, all of these recently discovered viruses in gymnosperms strongly suggest that they may have evolutionary trajectories that are distinct from those infecting angiosperms. Thus, it is likely that further exploration of additional gymnosperm datasets or new transcriptome studies of other gymnosperms will yield plenty of novel viruses with unique features, highlighting their close evolution with their hosts. The clear association between gymnosperm-associated viruses and their hosts likely indicates a close coevolution, which suggest an early adaptation of this group of viruses to infect gymnosperms. This hypothesis is also supported by the distinct genomic architecture and divergent evolutionary history among varicosaviruses, as shown in the phylogenetic tree, which are characterized by long branches and distinctive clustering. Taken together, the gymnosperm-associated varicosaviruses could be taxonomically classified in a novel genus within the family Rhabdoviridae, subfamily Betarhabdovirinae, for which we suggest the name "Gymnorhavirus".
In summary, this study highlights the importance of the analysis of SRA public data as a valuable tool not only to accelerate the discovery of novel viruses, but also to gain insight into their evolution and to refine virus taxonomy. Using this approach, we looked for hidden varicosalike virus sequences to unlock the veiled diversity of a largely neglected plant rhabdovirus genus, the varicosaviruses. Our findings, including an approximately 3.5-fold expansion of the current genomic diversity within the genus, resulted in the most complete phylogeny of varicosaviruses to date, and they shed new light on the genomic architecture, phylogenetic relationships, and evolutionary landscape of this unique group of plant rhabdoviruses. Future studies should assess many intriguing aspects of the biology and ecology of these viruses such as potential symptoms, vertical transmission, and putative vectors.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/pathogens11101127/s1, Figure S1: Stacked bar chart showing the number of previously reported varicosaviruses and those in this study; Table S1: Virus names, abbreviations, and NCBI accession numbers of the varicosavirus sequences used in this study; Table  S2: Amino acid sequence identity of the complete L gene ORF.

Acknowledgments:
The authors would like to express their sincere gratitude to the generators of the underlying data used for this work, which are cited in Table 1. By following open-access practices and supporting accessible raw sequence data in public repositories available to the research community, they have promoted the generation of new knowledge and ideas.

Conflicts of Interest:
The authors declare no conflict of interest.