Illuminating the Plant Rhabdovirus Landscape through Metatranscriptomics Data

Rhabdoviruses infect a large number of plant species and cause significant crop diseases. They have a negative-sense, single-stranded unsegmented or bisegmented RNA genome. The number of plant-associated rhabdovirid sequences has grown in the last few years in concert with the extensive use of high-throughput sequencing platforms. Here, we report the discovery of 27 novel rhabdovirus genomes associated with 25 different host plant species and one insect, which were hidden in public databases. These viral sequences were identified through homology searches in more than 3000 plant and insect transcriptomes from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) using known plant rhabdovirus sequences as the query. The identification, assembly and curation of raw SRA reads resulted in sixteen viral genome sequences with full-length coding regions and ten partial genomes. Highlights of the obtained sequences include viruses with unique and novel genome organizations among known plant rhabdoviruses. Phylogenetic analysis showed that thirteen of the novel viruses were related to cytorhabdoviruses, one to alphanucleorhabdoviruses, five to betanucleorhabdoviruses, one to dichorhaviruses and seven to varicosaviruses. These findings resulted in the most complete phylogeny of plant rhabdoviruses to date and shed new light on the phylogenetic relationships and evolutionary landscape of this group of plant viruses. Furthermore, this study provided additional evidence for the complexity and diversity of plant rhabdovirus genomes and demonstrated that analyzing SRA public data provides an invaluable tool to accelerate virus discovery, gain evolutionary insights and refine virus taxonomy.


Introduction
The costs for high-throughput sequencing (HTS) have been significantly reduced each year due to advances in sequencing technologies; therefore, the number of genome and transcriptome sequencing projects has been steadily increasing, resulting in a massive number of nucleotides deposited in the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI). Over 16,000 petabases (10 15 bases) have been deposited in the SRA, with over 6000 petabases available as open-access data [1]. Thus, this large amount of data has provided significant challenges for data storage, bioinformatic analysis and management. This impressive and potentially useful amount of data concomitantly raised two issues: (I) high logistical costs of data management and (II) large amounts of neglected and unused data awaiting secondary analysis and repurposing. In the specific case of large plant sequencing project datasets, virome studies are scarce.
Abundant novel viruses, many of them not known to induce any apparent symptoms in their host or without a known host, have been identified from diverse environments using metagenomic approaches. This has highlighted our limited knowledge about the richness of a continuously expanding plant virosphere, which appears highly diverse in every potential host assessed so far [2][3][4][5]. Furthermore, the great number of viruses recently discovered by HTS, a miniscule portion of the virosphere, allowed a first glimpse of the path to a comprehensive megataxonomy of the virus world [6].
The scientific interest of the submitters of transcriptome datasets is often limited to a narrow objective within their specific field of study, which leaves a large amount of potentially valuable data not analyzed [7]. In such transcriptome datasets, viral sequences may be hidden in plain sight; thus, their analysis has become a valuable tool for the discovery of novel viral sequences [8][9][10][11][12][13][14][15][16]. In a recent consensus statement report, Simmonds and colleagues [17] contended that viruses that are known only from metagenomic data can, should and have been incorporated into the official classification scheme overseen by the International Committee on Taxonomy of Viruses (ICTV). Consequently, the analysis of public sequence databases constitutes a valuable resource for the discovery of novel plant viruses, which allows the reliable identification and characterization of new viruses in hosts with no previous record of virus infections [8]. This approach to virus discovery is inexpensive, as it does not require the acquisition of samples and subsequent sequencing but on secondary analyses of publicly available data to address novel research questions and objectives. At the same time, it is more wide-ranging and comprehensive than any other current approach due to the millions of datasets from a large variety of potential host species available from the NCBI-SRA [12].
Plant rhabdoviruses have negative-sense, single-stranded RNA genomes and are taxonomically classified in six genera: Cytorhabdovirus, Alphanucleorhabdovirus, Betanucleorhabdovirus and Gammanucleorhabdovirus for viruses that have an unsegmented genome and Dichorhavirus and Varicosavirus for viruses that have a bisegmented genome and infect both monocot and dicot plants [18]. These six genera were recently assigned to the subfamily Betarhabdovirinae within the family Rhabdoviridae [19]. Viruses classified in five of these genera are transmitted persistently by arthropods in which they also replicate [18,20], whereas varicosaviruses are transmitted by soil-borne chytrid fungi [18]. Cytoand nucleorhabdovirus genomes have six conserved canonical genes encoding in the order 3 -nucleocapsid protein (N)-phosphoprotein (P)-putative movement protein (P3)-matrix protein (M)-glycoprotein (G)-large polymerase (L)-5 ; the L gene of dichorhaviruses is located on RNA2 [21]. Up to three accessory genes with unknown functions have been identified among cyto-and nucleorhabdovirus genomes, leading to diverse genome organizations [21,22]. Conserved gene junction sequences separate each gene, and the overall coding region is flanked by 3 leader and 5 trailer sequences that feature partially complementary ends that may form a panhandle structure during replication [20]. Varicosavirus RNA 1 has one to two genes, with one of those encoding the RNA-dependent RNA polymerase L, while RNA 2 has three to five genes, with the first open reading frame (ORF) encoding a coat protein [20,21]. The 3 -and 5 -terminal sequences of the two varicosavirus genome segments are similar but do not exhibit inverse complementarities [21].
In this study, we queried the publicly available plant transcriptome datasets in the transcriptome shotgun assembly (TSA) database hosted at NCBI and identified 27 novel plant rhabdoviruses from 25 plant and one insect species, showing structural, functional and evolutionary cues to be classified in the family Rhabdoviridae; subfamily Betarhabdovirinae and genera Cytorhabdovirus, Alphanucleorhabdovirus, Betanucleorhabdovirus, Dichorhavirus and Varicosavirus.

Identification of Plant Rhabdovirus Sequences from Public Plant Transcriptome Datasets
The detection of plant rhabdovirus sequences was done as described by Longdon and colleagues [13]. Briefly, the amino acid sequences corresponding to the nucleocapsid and polymerase proteins of several known cyto-and nucleorhabdoviruses were used as query in tBlastn searches with the parameters word size = 6, expected threshold = 10 and scoring matrix = BLOSUM62 against the Viridiplantae (taxid:33090) and Hemiptera (taxid:7524) TSA databases. The obtained hits were explored by eye and based on the percentage identity, query coverage and E-value (>1 × 10 5 ), shortlisted as likely corresponding to novel virus transcripts, which were then further analyzed. Given the redundant nature of many retrieved hits, a step of contig clustering was implemented using the CD-hit suite with the standard parameters available at http://weizhongli-lab.org/cdhit_suite/cgi-bin /index.cgi?cmd=cd-hit (accessed on 10 March 2021). In addition, the raw sequence data corresponding to several SRA experiments associated with different NCBI Bioprojects (Table 1) were retrieved for further analyse

Sequence Assembly and Identification
The nucleotide (nt) raw sequence reads from each analyzed SRA experiment linked to the TSA projects returning rhabdovirus-like hits were downloaded and preprocessed by trimming and filtering with the Trimmomatic tool, as implemented in http://www.usad ellab.org/cms/?page=trimmomatic (accessed on 15 March 2021), and the resulting reads were assembled de novo with Trinity v2.6.6 using the standard parameters. The transcripts obtained from de novo transcriptome assembly were subjected to bulk local BLASTX searches (E-value < 1 × 10 5 ) against a Refseq virus protein database available at ftp://ftp. ncbi.nlm.nih.gov/refseq/release/viral/viral.1.protein.faa.gz (accessed on 3 March 2021). The resulting viral sequence hits of each bioproject were visually explored. Tentative virus contigs were extended by iterative mapping of each SRA library's raw reads. This strategy employed BLAST/nhmmer to extract a subset of reads related to the query contig, and these retrieved reads were used to extend the contig, and then, the process was repeated iteratively using as the query the extended sequence. The extended and polished transcripts, now presenting overlapping regions, were reassembled using the Geneious v8.1.9 (Biomatters Ltd., Auckland, New Zealand) alignment tool.

Pairwise Sequence Identity
Percentage amino acid (aa) sequence identities of the predicted ORFs of each plantassociated rhabdovirid identified in this study based on the available plant-associated rhabdovirus genome sequences were calculated using https://www.ebi.ac.uk/Tools/psa/ emboss_needle/ (accessed on 23 March 2021).

Phylogenetic Analysis
Phylogenetic analysis based on the predicted nucleocapsid proteins of all plant rhabdovirids, listed in Table S1, was done using MAFFT 7 https://mafft.cbrc.jp/alignme nt/software (accessed on 29 March 2021), with multiple aa sequence alignments using FFT-NS-i as the best-fit model. The aligned aa sequences were used as the input in MegaX software [23] to generate phylogenetic trees by the maximum-likelihood method (best-fit model = WAG + G + F). Local support values were computed using bootstraps with 1000 replicates.

Summary of Discovered Rhabdovirid Sequences
The complete coding regions of 17 novel rhabdoviruses were identified; in addition, partial genomic sequences for 10 novel viruses were assembled. These viruses were associated with 25 plant host species and one insect species (Table 1). The bioinformatic and source data of each of the 27 viral sequences, as well as the GenBank accession number and proposed classification, are listed in Table 1; the summary of the assembly statistics of each virus of the plant rhabdovirus sequences identified from the transcriptome data available in the NCBI database are presented in Table S2. Based on phylogenetic relatedness, genome organization and sequence identity, the novel viruses were tentative assigned to the established plant rhabdovirus genera Alphanucleorhabdovirus, Betanucleorhabdovirus, Cytorhabdovirus, Dichoravirus and Varicosavirus. Most of the tentative plant hosts of the novel viruses are herbaceous dicots (16/25), seven are herbaceous monocots, one a woody dicot and one a gymnosperm ( Table 1). The characteristics of deduced proteins encoded by each rhabdovirid sequence were determined by predictive algorithms and are shown in Table S3. The genomic architecture and evolutionary placement of the 27 discovered viruses are described below, grouped by affinity to members of the diverse genera within the Betarhabdovirinae.

Alphanucleorhabdovirus
The complete coding region of a novel putative alphanucleorhabdovirus, tentatively named Agave tequilana virus 1 (ATV1), with the genome organization 3 -N-P-P3-M-G-L-5 ( Figure 1A) was assembled from blue agave transcriptome data (Table 1). A nuclear localization signal (NLS) was predicted in every ATV1-encoded protein (Table S3). According to the NLS scores, the N protein is predicted to be located exclusively in the nucleus, whereas the P and L proteins have a partial nuclear localization, while the P3, M and G proteins are localized to both the nucleus and the cytoplasm. Leucine-rich nuclear export signals (NES) were predicted in the N, P and L proteins (Table S3). A transmembrane domain motif was detected in the C-terminus of the G protein, and a signal peptide was predicted in its Nterminus (Table S3). The consensus gene junction sequence 3 -AUUCUUUUUGGGUUG-5 of the ATV1 genome is similar to that of the alphanucleorhabdoviruses maize mosaic virus (MMV), maize Iranian mosaic virus (MIMV), Morogoro maize-associated virus (MMaV) and taro vein chlorosis virus (TaVCV) ( Table 2).
The consensus gene junction sequences of the viruses identified in this study are highlighted in light grey. * Names and abbreviations of newly identified viruses are listed in Table 1, while the names and abbreviations of known viruses are listed in Table S1. Pairwise aa sequence identities between the ATV1-encoded proteins and those from other alphanucleorhabdoviruses showed low sequence identities of 10.9-35.0% (Table 3). The nucleotide sequence identity for the complete genome sequence of ATV1 and other alphanucleorhabdoviruses ranged from 47-49.3 % (Table 3). Table 3. Pairwise identity percentages between ATV1 and the selected alphanucleorhabdoviruses.  Table S1 and Table 1.
The phylogenetic analysis based on the N protein aa sequences showed that ATV1 clustered with the monocot-infecting alphanucleorhabdoviruses MMV, MIMV, MMaV and TaVCV (Figure 2), which were also the most similar in pairwise sequence identity values, shared equivalent genome organizations and had similar conserved gene junction sequences ( Table 2).

Betanucleorhabdoviruses
The complete coding regions of three putative betanucleorhabdoviruses, tentatively named Asclepias syriaca virus 2 (AscSyV2), Plectranthus aromaticus virus 1 (PleArV1) and Rhododendron delavayi virus 1 (RhoDeV1), as well as the partial genomes of Cuscuta reflexa virus 1 (CusReV1) and Persicaria minor virus 1 (PerMiV1), were assembled in this study ( Table 1). The genome organization of these five viruses is 3 -N-P-P3-M-G-L-5 ( Figure 1A). A NLS was predicted in every protein encoded by AscSyV2, CusReV1, PleArV1 and RhoDeV1 (Table S3). According to the NLS scores, the CusReV1 and PleArV1 N proteins, RhoDeV1 P protein, CusReV1 M protein and PleArV1 and RhoDeV1 L proteins are expected to be exclusively nuclear, whereas the RhoDeV1 M protein is predicted to have a partial nuclear localization. The AscSyV2 and RhoDeV1 N proteins; AscSyV2, CusReV1 and PleArV1 P proteins; AscSyV2, CusReV1, PleArV1 and RhoDeV1 P3 proteins; AscSyV2 and PleArV1 M proteins; AscSyV2, CusReV1, PleArV1 and RhoDeV1 G proteins and AscSyV2 and CusReV1 L proteins are predicted to localize to both the nucleus and the cytoplasm. Leucine-rich NES were predicted in the AscSyV2, CusReV1 and PleArV1 N proteins; the CusReV1 and PleArV1 P proteins; the CusReV1 and RhoDeV1 P3 proteins; the PleArV1 M protein and the PleArV1 and RhoDeV1 L proteins (Table S3). A transmembrane domain was detected in the C-terminal sequence of the AscSyV2, CusReV1, PleArV1 and RhoDeV1 G proteins. A signal peptide was detected in the G protein N-terminus of AscSyV2, CusReV1 and RhoDeV1 (Table S3).
The deduced AscSyV2 M protein, PleArV1 G protein and RhoDeV1 L protein are the smallest such proteins reported so far among betanucleorhabdoviruses (Table S4).
Interestingly, the N proteins of AscSyV2, PleArV1 and RhoDeV1 are basic, whereas the CusReV1 N protein is acidic. Moreover, the CusReV1 and PleArV1 M proteins are basic, while the AscSyV2 and RhoDeV1 M proteins are acidic (Table S3).
Pairwise aa sequence identity values between each encoded protein of the five novel viruses and those from other betanucleorhabdoviruses vary significantly ( Table 4). The highest nucleotide sequence identity for the genome sequence between AscSyV2, PleArV1 and RhoDeV1 and other betanucleorhabdoviruses ranged from 51.4 to 62.1% (Table 4). Table 4. Pairwise identity percentages between AscSyV2, CusReV1, PleArV1 and RhoDeV1 and the selected betanucleorhabdoviruses.
PleArV1 vs. The phylogenetic analysis based on the deduced N protein aa sequences showed that CusReV1, RhoDeV1, PerMiV1 and PleArV1 clustered with the betanucleorhabdoviruses BFTaV1, BCaRV, BmV2, CdVCV1, DYVV, GSPNuV, SYNV and SYVV (Figure 2), which are also the most similar in pairwise sequence identity values of their cognate proteins; furthermore, all these viruses have a similar genome organization. Interestingly, AscSyV2 clustered closely with the betanucleorhabdoviruses AaNV and ApRVA and had similar pairwise sequence identity values for their cognate proteins. However, the genome organization of AaNV and ApRVA was different in that it included an accessory gene between the M and G genes.
Interestingly, the predicted N protein of almost every cytorhabdovirus identified in our study is acidic, but the PelRaV1 N protein is neutral, and the BeTaV1 N protein is basic. The P protein of almost every cytorhabdovirus identified in our study is acidic, but the TrAV1 P protein is basic. Moreover, the M protein of almost every cytorhabdovirus identified in our study is basic, except GymDenV1 and TrAV1 M proteins, which are acidic. Furthermore, the predicted G proteins of AscSyV1, DiCoV1, GlLV1, NymAV1 and PelRaV1 are acidic, except the LotCorV1 G protein is neutral, and the AntAmV1 and BeTaV1 G proteins are basic (Table S3).
The consensus gene junction sequences of the novel cytorhabdoviruses identified in our study are diverse ( Table 2) but similar to those previously reported for phylogenetically related cytorhabdoviruses. Interestingly, the intergenic sequence of the GymDenV1 and TrAV1 gene junctions starts with an adenine instead of a guanine residue like in every other plant rhabdovirus (Table 2).
Pairwise aa sequence identity values between each encoded protein of the thirteen novel viruses and those from known cytorhabdoviruses vary significantly ( Table 5). The highest nucleotide sequence identity for the genome sequence between AntAmV1, Asc-SyV1, BeTaV1, GlLV1, GymDenV1, NymAV1, TaEV1 and TrAV1 and the other cytorhabdoviruses ranged between 50.3 and 65.6% (Table 5). Table 5.
Pairwise identity percentages between the cytorhabdoviruses assembled in this study and the selected cytorhabdovirus.
The phylogenetic analysis based on the N protein aa sequence showed that the thirteen novel viruses grouped with known cytorhabdoviruses GlLV1 and NymAV1 clustered with Kenyan potato cytorhabdovirus (KePCyV), Trifolium pratense virus A (TpVA), strawberry virus 1 (StrV1) and tomato yellow mottle-associated virus (TYMaV), with whom they share a similar genomic organization. AChV1 and SuSV1 formed a clade with alfalfa dwarf virus (ADV), chrysanthemum yellow dwarf-associated virus (ChYDaV), raspberry vein chlorosis virus (RVCV) and strawberry crinkle virus (SCV). AscSyV1, LotCorV1 and PelRaV1 clustered with Actinidia cytorhabdovirus (AcCV), Bacopa monnieri virus 1 (BmV1) and Wuhan insect virus 4 (WhIV4), and all these viruses have a similar genomic organization. BeTaV1 clustered with cucurbit cytorhabdovirus 1 (CuCV1), and these viruses share a similar genomic organization. AntAmV1 clustered with paper mulberry mosaic-associated virus (PMuMaV), with which it shares a similar genomic organization. TaEV1 clustered with the cluster formed by AntAmV1 and PMuMaV, but these viruses have a dissimilar genomic organization. DiCoV1 clustered with Colocasia bobone diseaseassociated virus (CBDaV), but these viruses have a dissimilar genomic organization, while GymDenV1 and TrAV1 formed a monophyletic sister clade with the aphid-transmitted cytorhabdoviruses (Figure 2).

Dichorha-Like Viruses
The near-complete RNA 1 and RNA 2 sequences of a dichorha-like virus tentatively named Viola verecunda virus 1 (VVeV1) were identified in this study. We assembled three nonoverlapping fragments corresponding to RNA 1 containing the complete N, P3 and G genes and the near-complete RNA 2, which encodes the L protein (Table 1). Thus, the likely genome organization of VVeV1 is 3 -N-P-P3-M-G-5 for RNA 1 and 3 -L-5 for RNA 2 ( Figure 1D). A NLS was predicted in every deduced VVe1 protein (Table S3). According to the NLS scores, the N and L proteins should be exclusively localized to the nucleus, whereas the P3 and G proteins are predicted to localize to both the nucleus and the cytoplasm. Leucine-rich NESs were predicted in the N and L protein sequences (Table S3). A transmembrane domain was identified in the C-terminus of the G protein (Table S3). We were not able to identify a consensus gene junction sequence in the genome fragments of VVeV1 (Table 2). Pairwise aa sequence identities between the VVeV1-encoded proteins and those from known dichorhaviruses showed low sequence identities of 20% or less in the P3 and G proteins and 20.5-23.1% in the N protein ( Table 6). The phylogenetic analysis based on the N protein aa sequences showed that VVeV1 clusters with the dichorhaviruses but is placed by itself in a sister clade (Figure 2). Table 6. Pairwise identity percentages between VVeV1 and the dichorhaviruses.  Table S1 and Table 1.
The predicted LoPV1 N protein is the longest varicosavirus N protein described so far, whereas the MelRoV1 L protein is the smallest L protein described so far among the varicosaviruses, while the PiFleV1 L protein is the largest varicosavirus L protein described so far (Table S4). Interestingly, the PiFLeV1 N protein is basic, whereas the N proteins of AAnV1, BrRV1, LoPV1, MelRoV1, PhPiV1 and SpV1 are acidic. Moreover, AAnV1 and BrR1 P2s are basic, while P2s of LoPV1, MelRoV1, PiFleV1 and SpV1 are acidic. Furthermore, P3s of AAnV1, MelRoV1, PhPiV1 and PiFleV1 are basic, whereas the BrRV1 and LoPV1 P3s are acidic (Table S3).
The pairwise aa sequence identities between the cognate-encoded proteins of AAnV1, BrRV1, LoPV1, MelRoV1, PhPiV1, PiFleV1 and SpV1 and those from other varicosaviruses vary significantly ( Table 7). The phylogenetic analysis based on the deduced N protein aa sequences placed all seven novel varicosa-like viruses into a clade with the known varicosaviruses. BrRV1 and SpV1 clustered with RCaVV, and these viruses have a similar genomic organization with the three ORFs on RNA 2. AAnVi clustered with the clade formed by LoPV1 and AMVV1, and these viruses infect monocots. PhPV1 clustered with LBVaV, while the unique PiFleV1 was placed in a sister clade to the clade composed by LBVaV and PhPV1, and MelRoV1 clustered with vitis varicosavirus (VVV) (Figure 2). Table 7. Pairwise identity percentages between the varicosaviruses assembled in this study and the reported varicosaviruses.   Table S1 and Table 1.

Discussion
In the last few years, several novel plant rhabdoviruses, which may be asymptomatic, have been reported in HTS studies [47][48][49][50][51][52][53][54][55]. Thus, it is tempting to speculate that some plant rhabdovirus-like sequences may be hidden in published plant transcriptome databases, generated with diverse objectives beyond virus research, where viral RNA was inadvertently copurified with endogenous plant RNA and sequenced. To prove this point, we and others have previously identified the sequences of a small number of plant rhabdoviruses contained in public transcriptome databases [8,9,16]; however, an extensive search has not been conducted to date. Therefore, we queried the publicly available plant transcriptome datasets in the transcriptome shotgun assembly (TSA) database hosted at NCBI, which resulted in the identification of 27 novel plant rhabdoviruses.

Discovery of Novel Plant Rhabdoviruses Expands the Diversity and Evolutionary History of Rhabdovirids
The coding-complete or complete genomic sequences of 66 plant rhabdoviruses were reported by early 2021. Thirty-three of them are cytorhabdoviruses, thirteen alphanucleorhabdoviruses, ten betanucleorhabdoviruses, five dichorhaviruses, four varicosaviruses and one gammanucleorhabdovirus. Similarly, half of the novel sequences reported in this study are those of putative cytorhabdoviruses, indicating the extensive diversification of this group of viruses. The novel 27 viruses discovered in this study all appear to be members of new species and account for nearly half of the plant rhabdovirus species reported so far. In addition, five new putative betanucleorhabdoviruses shed more light on the diversity and evolution of this recently created genus. One newly identified putative alphanucleorhabdovirus and one novel putative dichorhavirus expand the diversity of their respective genera. Finally, we detected seven novel varicosa-like viruses, which doubles the number of known varicosaviruses, providing new information about the genomic diversity and evolution of these little-studied fungi-transmitted viruses. Thus, our study provides the most complete insight to date about the genomic diversity and evolution of plant rhabdoviruses, complementing the status quo of these viruses with additional data on genomic organization and highlighting their apparent genetic flexibility. Overall, the sequence identity between the newly discovered viruses and those plant rhabdoviruses already described was low, a common feature among these taxa characterized by a high level of diversity in both genome sequence and organization [20]. The low sequence identity of the novel viruses with the closest previously described virus may indicate that there is still a significant amount of virus "dark matter" within the plant rhabdovirus space worth exploring that potentially contains a significant number of yet to be discovered plant rhabdoviruses. Future works should focus on the analysis of additional RNA datasets of diverse potential hosts of the partial virus genomes identified in this study.
Viruses assigned to different species within the genera Alphanucleorhabdovirus, Betanucleorhabdovirus and Cytorhabdovirus have a nucleotide sequence identity lower than 75% in the complete genome sequence and occupy different ecological niches as evidenced by differences in the host species and/or arthropod vectors (https://talk.ictvonline.org/ic tv-reports/ictv_online_report/negative-sense-rna-viruses/w/rhabdoviridae) (accessed 11 April 2021). Cytorhabdoviruses may also have an aa sequence identity of less than 80% in all cognate ORFs. Viruses assigned to different species in the genus Dichorhavirus have less than 80% nucleotide sequence identity in the L gene. Based on these species demarcation criteria, ATV1 should be classified in a new Alphanucleorhabdovirus species and AscSyV2, PleArV1 and RhoDeV1 in a new species in the genus Betanucleorhabdovirus; due to incomplete sequences, PerMiV1 and CusReV1 cannot be classified based on these criteria. AntAmnV1, AscSyV1, BeTaV1, GlLV1, GymDenV1, LotCorV1, NymAV1, TaEV and TrAV1 should be classified in a new species in the genus Cytorhabdovirus, whereas AChV1, DiCoV1, PelRaV1 and SuSV1, due to incomplete genome sequence data, cannot be classified. The dichorha-like VVeV1 also awaits at least coding-complete sequence data before it can be formally classified in this genus. The species demarcation criteria recently proposed for the genus Varicosavirus by the ICTV Rhabdoviridae Study Group included a minimum nucleotide sequence divergence of 50% in the L protein. Based on this genetic distance, the varicosa-like viruses AAnV1, BrRV1, LoPV1, MelRoV1, PiFleV1 and SpV1 would not be considered different species, and PhPiV1 cannot be classified due to incomplete sequencing data. Considering the diverse host plant species these viruses were identified in and their phylogenetic relationships, these viruses should likely be classified in different species. Thus, we propose a nucleotide sequence identity of 75% across the genome and in the N gene as thresholds for species demarcation in the Varicosavirus genus, in line with the species demarcation criteria for the other genera of plant rhabdoviruses.

Host Range of the Novel Plant Rhabdoviruses
Most of the plant hosts in which the novel viruses of this study were identified are dicots; nevertheless, AChV1, which was detected in Chinese onions, is likely the first monocot-infecting cytorhabdovirus that belongs to a clade of aphid-transmitted viruses. Furthermore, most of the source hosts in which the novel viruses were identified are herbaceous plants, which, overall, are the most common hosts of plant rhabdoviruses [18]. However, one novel putative varicosavirus, Pinus flexilis virus 1, was associated with a gymnosperm host. Few viruses were identified previously from gymnosperms [12,[56][57][58][59]. Interestingly, a putative RdRp sequence was identified in the gymnosperm Sciadopitys verticillata and proposed to belong to a varicosavirus [14]. Therefore, to our knowledge, this study is the first to report the complete coding sequence of a plant rhabdovirus associated with a gymnosperm host and presenting a unique unsegmented genome, redefining the genome architecture of varicosaviruses. Future virus discovery studies should focus on gymnosperm viromes, which have been understudied and likely hide a rich and diverse number of novel viruses.

Diverse Genome Organizations of Plant Rhabdoviruses
Interestingly, the genome sequence of four novel viruses discovered in this study, GymDenV1 (only four predicted genes), TaEV and TrAV1 (both with only five predicted genes) and PiFleV1 (an unsegmented varicosa-like virus), have a unique genome architecture among plant rhabdoviruses that differs from the consensus genome organization reported for previously known cytorhabdoviruses and varicosaviruses [21]. The genomes of the cytorhabdoviruses GymDenV1, TaEV1 and TrAV1 lack a glycoprotein ORF, which, for rhabdoviruses, is not essential for replication and systemic movement, as demonstrated using an infectious clone of Sonchus yellow net virus [60]. Furthermore, isolates of citrusassociated rhabdovirus (CiaRV) were recently shown to have an impaired G ORF [61].
Five canonical structural protein genes 3 -N-P-M-G-L-5 are thought to be conserved among all rhabdoviruses [20], while plant rhabdovirus genomes contain at least six ORFs [21]. However, only four genes were predicted in the genome of the varicosavirus RCaVV [51]. Four genes were also identified in some of the varicosa-like virus genomes assembled in this study, suggesting that a minimal set of four genes may be sufficient for varicosaviruses to replicate in a plant host. This hypothetical minimal requirement appears to also apply to cytorhabdoviruses. For example, GymDenV1 encodes only the nucleocapsid core (NC) proteins N, P and L that are essential for virus replication and transcription and the M protein that is required for condensation of the core during virion assembly [20]. Thus, how GymDenV1 moves from cell to cell remains to be unraveled. Interestingly, no cell-to-cell movement protein has been identified so far in the varicosaviruses.
The discovery of diverse novel rhabdo-like viruses in metagenomics studies of arthropods [62] supports the assumption that arthropods have been fundamental to rhabdovirus evolution [63], and the G protein was found to be essential for virus attachment to predict cellular receptors in the midgut of arthropod vectors that facilitate virus uptake [20]. This agrees with the hypothesis that cytorhabdoviruses, nucleorhabdoviruses and dichorhaviruses evolved from viruses of plant-feeding arthropods that acquired movement proteins and assorted RNAi suppressors through recombination with preexisting plant viruses [3]. The viruses in these three genera appear to be the least plant-specialized among the negative-sense RNA viruses. However, the evolution of the nonenveloped negativesense RNA plant viruses, such as the fungal-transmitted varicosaviruses, which likely do not encode a G protein, clearly reflects the adaptation to a plant-specific lifestyle, raising the possibility that their origin is via a trans-kingdom horizontal transfer between fungi and plants [3]. A G protein-defective genome was recently identified in citrus isolates of CiaRV, and the authors speculated that the eventual "simplification" of viral genomes to adapt to plants without requiring an arthropod vector could provide an evolutionary advantage, especially in fruit trees that are propagated artificially by asexual modes, such as cutting and grafting [61]. Nevertheless, the tentative hosts of GymDenV1, TaEV1 and TrAV1 are herbaceous plants; thus, an evolutionary advantage linked to the lack of a G protein is not obvious for these viruses. Strikingly, TaEV1 is phylogenetically related to arthropod-transmitted cytorhabdoviruses. Thus, further studies should focus on the potential vector and the mode of transmission of GymDenV1, TaEV1 and TrAV1 to complement their peculiar minimalist genome organization and evolutionary links with biological data.

Phylogenetic Relationships among Plant Rhabdoviruses as Predictors for Vector Types
Among all plant rhabdoviruses studied so far, there is a strong correlation between phylogenetic relationships and vector types [18,20]. We therefore predict that the novel betanucleorhabdoviruses are likely aphid-transmitted, while the putative dichorhavirus VVeV1 is likely transmitted by Brevipalpus mites, and the varicosa-like viruses may be transmitted by chytrid fungi. Based on its phylogenetic clustering, the alphanucleorhabdovirus ATV1 is likely transmitted by a planthopper. Among the novel cytorhabdoviruses identified in this study, AChV1, AscSyV1, GlLV1, LotCorV1, NymAV1, PelRaV1 and SuSV1 are likely aphid-transmitted, while the vectors for AntAmV1, DiCoV1, GymDenV1, TaEV1 and TrAV1 cannot be predicted. The BeTaV1 genome was assembled from whitefly transcriptome data and, therefore, is likely a whitefly-transmitted cytorhabdovirus. The L protein-like TSA sequences included in the assembly of the BeTaV1 genome were previously reported to be 95-98% identical to soybean blotchy mosaic virus partial L gene sequences [18]; thus, it is tempting to speculate that the plant host of BeTaV1 may be soybeans. Recently, it was reported that the bean-associated cytorhabdovirus is efficiently transmitted by whiteflies, which was the first report of a whitefly-transmitted rhabdovirus [64], thus supporting our hypothesis that soybean blotchy mosaic virus may represent the second whitefly-transmitted rhabdovirus.
At least one transmembrane domain was identified in each P protein predicted in the genomes of a group of cytorhabdoviruses discovered in this study, consistent with a previous analysis of other cytorhabdoviruses, which also identified at least one transmembrane domain in every P protein [65]. Interestingly, this overprinted accessory protein is encoded by every cytorhabdovirus that appears to be aphid-transmitted. On the other hand, the P protein is not encoded in the genomes of aphid-transmitted betanucleorhabdoviruses. Therefore, the potential function of the P protein is unlikely to be directly associated with the vector specificity.

Cytorhabdoviruses
Two of the cytorhabdoviruses, AChV1 and SuSV1, whose genomes were only partially assembled, clustered phylogenetically in a monophyletic clade with ADV, ChYDaV, RVCV and SCV, all the viruses that contain a P6 accessory ORF between the G and L genes [66][67][68][69]. Given this clustering, it is likely that AChV1 and SuSV1 will have a similar genomic organization. At least one transmembrane domain has been predicted in every cytorhabdovirus with an accessory P6 ORF. Transmembrane domains were also predicted in the accessory ORF between the G and L genes of other cytorhabdoviruses [49,65,70]. Thus, the protein encoded by this small ORF may have membrane-associated functions similar to viroporins in vertebrate rhabdoviruses [71].
The phylogenetic relationships of the now expanded number of known cytorhabdoviruses provide some support for splitting the genus Cytorhabdovirus to establish three genera that represent different evolutionary lineages: (I) Alphacytohabdovirus would include species for all aphid-transmitted cytorhabdoviruses; (II) Betacytorhabdovirus would include species for all those cytorhabdoviruses likely transmitted by leafhoppers, planthoppers and whiteflies, as well as AntAmnV1, DiCoV1 and TaEV1, and (III) Gammacytorhabdovirus would include species for GymDenV1 and TrAV1 (Figure 2).

Nucleorhabdoviruses
The ATV1 genome that was assembled from the transcriptome data of the monocot agave does not encode any accessory ORFs, like most of the alphanucleorhabdoviruses, except the group of PYDV-like viruses that encode an X ORF of unknown function [18]. ATV1, while branching into a sister clade, appears to have a close evolutionary relationship with the cluster of planthopper-transmitted monocot-infecting alphanucleorhabdoviruses MIMV, MMV, MMaV and TaVCV. It is tempting to speculate that this clade evolved from a common ancestor that adapted to infect monocots and to be transmitted by planthoppers. Furthermore, the isoelectric point (IEP) of ATV1-predicted proteins is similar to that of MIMV, MMV, MMaV and TaVCV-predicted proteins, thus supporting a link between these viruses and ATV1.
All the betanucleorhabdoviruses identified in this study, AscSyV2, CusReV1, Per-MiV1, PleArV1 and RhoDeV1, appear to be associated with dicot hosts, which is consistent with every betanucleorhabdovirus reported so far [18]. The dicot host range of the betanucleorhabdoviruses is likely linked to their insect vector, given that every betanucleorhabdovirus where a vector has been experimentally determined is transmitted by aphids [18]. Although aphids can colonize both dicots and monocots [72], it has been suggested that these insects more successfully feed on dicots [73]. Regarding their genome organization, the five novel betanucleorhabdoviruses identified in this study encode six plant rhabdovirus ORFs in the conserved order 3 -N-P-P3-M-G-L-5 without any accessory ORFs, like the majority of previously known betanucleorhabdoviruses, such as the well-studied SYNV. CusReV1, PerMiV1, PleArV1 and RhoDeV1 likely belong to the same evolutionary lineage as BFTaV1, BCaRV, CdVCV1, DYVV, GSPNuV, SYNV and SYVV, which may represent the ancestral clade within the betanucleorhabdoviruses. However, AscSyV2 s closest evolutionary relationship is with the betanucleorhabdoviruses ApRVA and AaNV, which differ in their genome organization by having an accessory ORF located between the M and G genes [46,74]. Interestingly, the IEP of the AscSyV2, CusReV1, Per-MiV1, PleArV1 and RhoDeV1 N proteins is different to the IEP of the cognate proteins encoded by the phylogenetically related betanucleorhabdoviruses. It is unknown if this difference in the IEP may have a biological effect in terms of RNA or protein binding. Moreover, the RhoDeV1 M protein is acidic, while the BFTaV1, BCaRV, BmV2, CdVCV1, CusReV1, DYVV, GSPNuV, PleArV1, SYNV and SYVV N proteins are basic. It has been suggested that charge differences in the M proteins may be associated with differences in their abilities to interact with the negatively charged lipids of the membrane [75]. NLS and NES were predicted for most of the AscSyV2, ATV1, CusReV1, PerMiV1, PleArV1 and RhoDeV1-encoded proteins, which can be expected, since nucleorhabdoviruses replicate in the nuclei of infected cells [20].
The phylogenetic relationships of nucleorhabdoviruses, including the novel alphaand betanucleorhabdoviruses discovered in our analysis, clearly support the recent split of the previous genus Nucleorhabdovirus into the three genera Alphanucleorhabdovirus, Betanucleorhabdovirus and Gammanucleorhabdovirus [18].

Varicosaviruses
Currently, there are only three varicosaviruses recognized by the ICTV, LBVaV, AMVV1 and RCaVV, and a novel varicosavirus, VVV, was recently identified [52]. Nevertheless, the available information regarding varicosavirus gene functions is generally scarce, and the functional roles of the varicosavirus P2, P3, P4, P5 and P6 proteins remain to be elucidated. No transmembrane domain was predicted in any of the proteins encoded by the varicosavirus genomes identified in this study, as well as in those already described; thus, it appears that no varicosavirus-encoded protein has a membrane-associated function. The AAnV1, L and P2 proteins; LoPV1 N and P2 proteins and AMVV1 and MelRoV1 L proteins are predicted to be exclusively located in the nucleus, whereas NES are predicted in most N, P2 and P3 proteins and some P4 proteins (Table S3). This suggests a potential role of these proteins in the cell nucleus.
The N protein encoded by the unsegmented PiFLeV1 is basic; on the other hand, the N proteins encoded by the bisegmented varicosaviruses AAnV1, AMVV1, BrRV1, LBVaV, LoPV1, MelRoV1, PhPiV1, RCaVV, SpV1 and VVV are acidic. This difference in the IEP may be associated with a different replication mechanism of the unsegmented PiFLeV1 compared to the segmented varicosaviruses. Moreover, AAnV1 and BrRV1 P2s are basic, while the P2s of AMVV1, LBVaV, LoPV1, MelRoV1, PiFleV1, RCaVV, SpV1 and VVV are neutral or acidic. The biological significance of this difference is unknown, because the functional role of P2 still needs to be unraveled.
LoPV1 and AMVV1 are phylogenetically closely related; their N and P3 aa sequences are >50% identical, and both are associated with grasses in the family Poaceae. However, LoPV1 RNA1 has one ORF, while AMVV1 RNA1 has two [76]. These monocot-infecting varicosaviruses are phylogenetically related to AAnV1, which also is associated with a monocot host and shares a similar genomic organization with LoPV1. PhPiV1 clustered with LBVaV; thus, it is likely that its genome, which was partially assembled, has a similar genomic organization to LBVaV, with two ORFs encoded in RNA1 and five in RNA2 [77]. Interestingly, the RNA2 of four of the novel varicosa-like viruses identified in this study have three ORFs, similar to RCaVV and AMVV1 RNA2 [51,76]. On the other hand, the RNA2 of one novel varicosa-like virus identified in our study has four ORFs. Thus, the number of ORFs identified in the RNA2 of every novel varicosa-like virus reported in this study, except for PhPiV1, is different from the first identified varicosavirus, LBVaV, and the recently described varicosavirus, VVV, which have five ORFs [52,77]. The RNA1 segments of AAnV1, BrRV1, LoPV1, SpV1 and MelRoV1 are similar to the RNA1 of RCaVV and VVV in that they only encode the L protein, whereas a small ORF before the L gene was identified in PhPiV1 and previously in AMVV1 and LBVaV [51,52].
Among the varicosa-like viruses identified in this study, PiFleV1 is unique in terms of genome organization, since its genome is unsegmented, a characteristic that differs from the currently known bisegmented genome architecture of varicosaviruses [21]. This may be an adaptation to its gymnosperm host. Unfortunately, we were not able to assemble the complete genome of the varicosavirus associated with the gymnosperm Sciadopitys verticillata previously reported by Mushegian and colleagues [14] to support this hypothesis. Based on genome organization and phylogenetic placement, PiFleV1 appears to be the first example where rhabdoviruses with segmented and unsegmented genomes are closely related and may be classified in the same genus.

Conclusions
In summary, this study illustrates the complexity and diversity of plant rhabdoviruses genomes and demonstrates that analyzing SRA public data is a valuable tool not only to accelerate the discovery of novel viruses but, also, to gain insight into their evolution and to refine virus taxonomy. However, the inability to go back to the biological material to confirm viral genome sequences and to link the presence of the viruses to a specific host is the main drawback of the data mining approach for virus discovery. This limitation could lead to the potential misidentification of host species linked to viruses. Thus, researchers need to be cautious when analyzing SRA public data for virus discoveries.
Supplementary Materials: The following are available online at https://www.mdpi.com/artic le/10.3390/v13071304/s1. Table S1: Virus names, abbreviations and NCBI accession numbers of the plant rhabdovirus sequences used in this study. Table S2: Summary of the assembly statistics of the plant rhabdovirus sequences identified from the transcriptome data available in the NCBI database. Table S3: Characteristics of the deduced proteins encoded by each assembled rhabdovirid sequence determined by predictive algorithms. Table S4: Size of each encoded protein of every publicly available alpha-, beta and gamma nucleorhabdovirus, cytorhabdovirus, varicosavirus and dichorhavirus.