Conserved Motifs and Domains in Members of Pospiviroidae

In 1985, Keese and Symons proposed a hypothesis on the sequence and secondary structure of viroids from the family Pospiviroidae: their secondary structure can be subdivided into five structural and functional domains and “viroids have evolved by rearrangement of domains between different viroids infecting the same cell and subsequent mutations within each domain”; this article is one of the most cited in the field of viroids. Employing the pairwise alignment method used by Keese and Symons and in addition to more recent methods, we tried to reproduce the original results and extent them to further members of Pospiviroidae which were unknown in 1985. Indeed, individual members of Pospiviroidae consist of a patchwork of sequence fragments from the family but the lengths of fragments do not point to consistent points of rearrangement, which is in conflict with the original hypothesis of fixed domain borders.


Introduction
Viroids are the smallest known plant pathogens with a genome length ranging from 246 to 401 bases in size depending on the viroid species (Table S1) and variant. They solely consist of a circular, single-stranded RNA [1], which does not code for any protein [2]. Thus, for all of their biological functions-such as replication, processing, and transport [3]-they have to present sequence or structural features to exploit host proteins. For example, viroids from the family Pospiviroidae are transcribed by DNA-dependent RNA polymerase II (pol II) using the viroid RNA as a template in an asymmetric rolling circle mechanism in the nuclei of infected cells [4][5][6][7]. The thermodynamically optimal, stable structure of most members of the family Pospiviroidae consists of an unbranched series of short helices and small loops ( Figure 1); i.e., their sequence is an imperfect inverted repeat [8][9][10].
In 1985, Keese and Symons [11] proposed a model based on the secondary structure of viroids from the family Pospiviroidae: accordingly, their secondary structure can be subdivided into five structural and functional domains and "viroids have evolved by rearrangement of domains between different viroids infecting the same cell and subsequent mutations within each domain" [12]. These five domains, with precise borders [11][12][13], are as follows (Figure 1): • The terminal left (TL) domain plays an important role in the replication of Pospiviroidae as it contains the starting point for pol II [14][15][16][17]. Furthermore, two subgroups of Pospiviroidae differ by presence and absence, respectively, of two sequence motifs in the TL domain [18][19][20]: members of genus Pospiviroid and some members of Coleviroid contain a terminal conserved region (TCR; see Supplemental Figure S20), while members of genera Cocad-and Hostuviroid contain a terminal conserved hairpin (TCH; Figure S21). • The pathogenicity (P) domain is associated with symptom severity [21][22][23], but clearly not the only viroid part responsible for virulence (examples can be found in [22,[24][25][26][27]).
The P domain includes an oligopurine stretch in the upper part and a partly complementary oligopyrimidine stretch in the lower part of the secondary structure. Thus, the region shows alternative structures of low thermodynamic stability, giving rise to the name "premelting region" [28]. • The central (C) domain is the most conserved part between different viroids. It contains a loop E motif ( Figure S22b), which shows similarities to loop E of eukaryotic 5S rRNA, sarcin/ricin loop in 28S rRNA, and loop B of hairpin ribozymes [29,30]: it consists of five non-Watson-Crick basepairs and a bulged nucleotide involved in a triple pair; this unusual conformation allows for the formation of a UV-induced crosslink between two nucleotides of loop E [31][32][33]. The upper part of the C domain can be rearranged into a hairpin (HPI; Figure S22a), which is only present in thermodynamic metastable structures during replication or at biologically non-relevant temperatures [34]. Loop E and HPI are both involved in the processing of linear replication intermediates to mature circles [35][36][37][38][39]. • The variable (V) domain has the lowest sequence similarity even between closely related species [11] but contains one part of hairpin II (HPII; Figure S25), which is a metastable structural element critical for the transcription of (−)-stranded replication intermediates in pospiviroids [40,41]. The 3' part of HPII is located in the lower part of the TL domain. • The terminal right (TR) domain of genus Pospiviroid has been proposed to be involved in transport. For example, the RY motifs in the TR domain are the binding sites for viroid RNA-binding protein 1 (VirP1) [42][43][44], which is indispensable for replication and cell-to-cell transport [45,46].
The domain model was established with seven sequences from different viroid species of family Pospiviroidae [11]. The borders between these domains "were defined by sharp changes in sequence homology" [12] between the aligned subgroups of the seven viroids. However, in the original publication, a description of the used procedure was not given [11]; a further publication [12] states the use of the alignment program NUCALN [47]. Here, we describe an attempt to redo the determination of the domain boundaries for the original seven sequences as well as extend it for more recently added Pospiviroidae species-today counting 30-by the use of three different programs:

1.
NUCALN is no longer available, but its algorithm is still part of CLUSTALΩ [48] for the fast production of pairwise alignments. Thus, we modified CLUSTALΩ to output not only scores but also aligned positions from the fast alignment step (for details see Supplement and Figure S1). In the following, this modified CLUSTALΩ is called NUCALN.

2.
JALI, short for "Jumping Alignments" [49,50], determines the positions of recombination in a candidate sequence when compared to a seed alignment of related sequences. 3.
VMATCH [51] determines maximal substring matches between two sequences. NUCALN as well as JALI were only able to reproduce a few of the domain borders for certain pairwise combinations of viroid sequences; however, in general, exact positions were not deducible neither for the original viroid sequences nor for more recent ones. We conclude that Keese and Symons used some undisclosed method to determine the domain borders. Such a process-most probably guided by human intuition-is difficult to implement in a computer program. Regardless, results from JALI and VMATCH hint to past recombination events between members of Pospiviroidae. In these events, sequence fragments were exchanged between viroids, supporting the observations of [11] about the possible existence of a sequence toolbox, which allows viroids to easily evolve by recombination between different viroids infecting the same plant.          [11,52] to determine the domain boundaries. Sequence alignments were produced by MAFFT X-INS-I [53]. Consensus structures were generated by CONSTRUCT [54]. Consensus sequences and drawings were produced by R2R [55]. The color code used by R2R is explained in box (e); a co-varying position has Watson-Crick pairs that differ at both nucleotide positions among sequences; if only one position differs, the occurrence is classified as a compatible mutation [56]. Note that no basepair annotated as "covarying" or "compatible" is statistically significant according to the analysis by R-SCAPE [57]. (a) Domain names are given at top: TR, terminal right; P, pathogenicity; C, central; V, variable; TL, terminal left. The P and V domains given by [52] are marked by black lines and gray background; the "consistent" domains are marked by green lines (see Section 3.2). For viroid names, see Table S1. E, loop E ( Figure S22b

Materials and Methods
Software tools used to produce the results are listed in Table 1. GLE was used to visualize the results of JALI and VMATCH.

Consensus Sequence Compilation
Viroid reference sequences from Pospiviroidae were downloaded from the Viral Genomes database [68,69]; further members of each viroid were found via BLASTN in NCBI's "standard databases". A preliminary alignment of sequences from a single viroid species was produced via MAFFT L-INS-I. Based on this alignment, incomplete sequences were removed; if necessary, sequences were reverse complemented to (+)-strands, the start and end of sequences were adjusted to standard positions, and redundant sequences were removed. Then, a final alignment was produced via MAFFT X-INS-I with options --maxiterate 1000 and --retree 100. MAFFT X-INS-I takes into account structural information for the alignment; the iterative refinement and the guide tree prediction are repeated until no more improvement in the final score is made or the number of cycles reaches 1000 and 100, respectively. For each of the final alignments, a consensus sequence and structure were calculated via CONSTRUCT using RNAFOLD with options -p -c (partition folding, circular RNA). Drawings of the consensus structures were performed via R2R.

NUCALN
The algorithm, implemented in NUCALN [47], is still part of CLUSTALΩ for the fast production of pairwise alignments. Thus, we modified CLUSTALΩ to output not only the scores but also the k-tuple matches from the fast alignment step. For details, see Supplementary Materials. NUCALN searches for sequence stretches of at least k nucleotides in length that are identical between sequences x and y; these stretches are called k-tuple matches. In a dotplot of sequences x and y, the k-tuple matches are diagonals of at least k dots ( Figure S1). That is, each k-tuple match lies on a diagonal d m containing all nucleotide pairs x i and y j with m = i − j. A diagonal is significant if its number of k-tuple matches is above the mean number of k-tuple matches for all diagonals with matches. The k-tuple matches that occur in a window of length w around a significant diagonal lie in window space. Finally, from the k-tuple matches within window space, an alignment is produced via standard dynamic programming with scores +1 for each matching nucleotide pair of the k-tuple matches and −g for gaps in between the matches independent of the gap length. We slightly varied the parameters 2 ≤ k ≤ 6 and 1 ≤ g ≤ 4 around those used in [12] (k = 4, g = 4); these variations did not alter the results. The window length w = 100 used by [12] is clearly too large, because with members of Pospiviroidae, most k-tuple matches are quite close to the main diagonal; thus, we chose w = 25 to reduce the runtime of NUCALN and reduce the number of uninformative matches.

JALI
The parameters of JALI were optimized to closely reproduce the domain boundaries as given by Keese and Symons [52]. First, a sequence file with all seven sequences (PSTVd, GENBANK AC NC_002030 or PTVA; MPVd, TPM; CEV-A, CEV; TASVd, TASCG; CSVd, V01107; CCCVd, CCC; HSVd, X00009) given in [11,52] was assembled. A PERL script then selected one of the sequences in a loop as the candidate and then repeatedly called JALI to align the candidate sequence to the seed alignment using a range of scoring parameters (gap open −70 ≤ o ≤ −40, gap extension −6 ≤ e ≤ −2, jump −125 ≤ j ≤ −45); for match scoring, we used the RIBOSUM weights w [70] multiplied by 10, which gives a range w(A|C) = 6 to w(A|A) = 47. The positions of jumps were recorded and compared to the domain borders as given in [52] to find optimal parameter combinations. We finally used o = −60, e = −2, and j = −95. A toy example with JALI is shown in Figure S2; examples of JALI's output with different parameters are shown in Figure S3.

VMATCH
VMATCH is a tool that finds, for example, long matches between two sequences or repeats in a single sequence. We used the following options for a comparison of the two sequences: -d: report (only) direct matches; -l 50: report matches of length ≥ 50; -e 20: edit distance-only report matches that contain at most 20 mismatches, insertions or deletions; -leastscore 50: only report matches if their score is ≥ 50; -seedlength 4: length of exact seeds.

Pairwise Sequence Identity
We calculated pairwise sequence identity (PSI; in percent) for two aligned sequences by PSI = number of nucleotide matches min(length of non-aligned sequences) · 100 ; that is, the number of nucleotides was normalized to the length of the shorter of both sequences without gaps. There are other methods to calculate PSI, that all have some disadvantages. For example, normalization to the average length of both sequences, as used in [52], assigns artifactually low PSI values to an alignment of a short and a long sequence. This is of relevance because members of Pospiviroidae have sequence lengths in a wide range from 250 to 380 nt.
To determine the PSI of a domain X with sequence ranges i . . . j and k . . . j for viroid v, we extracted from a pairwise alignment of viroids v and w the sequence blocks i . . . j and k . . . j, calculated PSI for the individual blocks, and then averaged both values. If the domain ranges of both viroids differ from each other in the alignment, PSI X,v,w will differ from PSI X,w,v .
Average pairwise sequence identity (APSI; in percent) is calculated for N aligned sequences by

Can We Retrace the Results of Keese and Symons?
Keese and Symons [11,52] based their domain hypothesis on pairwise alignments of single sequences from seven viroid species, namely PSTVd, MPVd, CEVd, TASVd, CSVd (all belonging to the genus Pospiviroid), CCCVd (genus Cocadviroid), and HSVd (genus Hostuviroid; for full viroid names and genus affiliation see Table S1). Hence, we started with these sequences and aligned them using NUCALN with parameters as mentioned in [12].
In principle, one should compare all 21 pairwise alignments of the seven viroid sequences. As examples, pairwise alignments of PSTVd to the other six viroids are shown in Supplemental Figure S4. Each alignment can also be visualized as a dotplot; an example is shown in Figure 2a, and the further dotplots of PSTVd with the other six viroid sequences are shown in Figure S5. To ease this comparison, we show in Figure 2b an overlay of all optimal kmers from the 21 pairwise alignments produced with NUCALN on the basis of a MAFFT X-INS-I alignment. Figure 2d differs from Figure 2b by the use of consensus sequences; the latter should avoid any misjudgment due to a pathological choice of sequences. The dotplots of Figure 2b,d, however, are very similar to each other and their small differences are not of importance. The following results and discussions are based on pairwise sequence identities of these viroids; for an overview, we already refer to Table 2. Most of the conclusions presented in [11,12,52] are comprehensible from these dotplot overlays (Figure 2b,d). These conclusions are given verbatim below; changes due to more recent nomenclature are given in square brackets: • The V domain shows the "greatest variability between closely related viroids" [12] and "is the most variable region .  [11]. Indeed, only a few diagonals show up in the dotplots (Figure 2) in the V domain; minor similarities in the lower part of the V domain are due to the 3' part of HPII. The average pairwise sequence identity (APSI) in a MAFFT X-INS-I alignment of consensus sequences of species used in [11] is 64%, while the APSI of the V domain in the same alignment is only around 49% (Table S4b). • "In pairwise sequence comparisons of viroids containing highly homologous C domains . . . there is significantly less sequence homology in the P and V regions, occurring about 5-9 residues 5' and about 7-15 residues 3' of the inverted repeat" [11]. The number of kmer matches in the C domain regions is higher than in P and V regions; compare the dot colors of these domains in the overlay dotplots (Figure 2b,d).
In summary, the similarities and relationships between the members of Pospiviroidae are comprehensible. However, we were not able to exactly reproduce the PSI and APSI values of Keese and Symons. This is not simply due to different equations (see Section 2.5) used here and in [12]: we tried different approaches to calculate PSI, as well as different NUCALN parameters to create the basic alignments, all without reaching coincident results. Further possibilities, which we did not explore, include the alignments of domains instead of full-length sequences or the manual optimization of non-aligned regions in between the aligned kmers of NUCALN. Anyway, our PSI and APSI values are also in support of the above conclusions. More critical than the exact reproduction of PSI and APSI values, however, is the missing consistency of domain borders in our alignments of different pospiviroids and a missing idea of how to define these borders.

Consistent Domain Borders of Pospiviroid Members
Without having a proper definition for a domain border, we can still improve the position of borders given in [11,12] to be consistent between species of Pospiviroids. That is, the domain borders should align between species in an alignment of all Pospiviroid species ( Figure S7). As a starting point, we used the original domain borders of PSTVd ( Figure S7, top bars for PSTVd and background shading of all sequences). In the aligned sequences of most species, gaps are located close to the original borders. We positioned the new borders ( Figure 1, green lines; Figure S7, top) into these gap positions, which was equivalent to a shift of original borders by less than six nucleotides. Table S5 summarizes the positions of the now consistent domain borders and of the sequence stretches around the borders in consensus sequences and in more than 90% of individual sequences, which should help identify domain borders in new sequences.
We selected the new domain borders only for Pospiviroid members. The inclusion of sequences from the other genera of Pospiviroidae led to alignments with a high number of gaps because Hostu-and Cocadviroid sequences are a lot shorter than Pospiviroid sequences and Apscaviroid sequences have a low similarity to Pospiviroid sequences. This is also obvious from dotplots; f. e., see the dotplots of PSTVd vs. CCCVd ( Figure S5e) and PSTVd vs. HSVd ( Figure S5f). The seven Coleviroid members consist of combinations of six different structural elements with a common CCR ( Figure S15). The low number of sequences for the structural elements (two to three) does not allow us to conclude on domains.

Can We Extend the Domain Hypothesis to the Other Genera of Pospiviroidae?
With descriptions of further viroid species, beyond those used to establish the domain model [11], the model was extended to these new sequences. The additional species were CTiVd [71] from genus Cocadviroid and ASSVd [13], GYSVd-1 [13], and GYSVd-2 [72] from genus Apscaviroid (Table S1). Today, ten species belong to the genus Apscaviroid and a further ten have been preliminary assigned to Apscaviroid. An overlay of all optimal kmers from the 20 × 19/2 = 190 pairwise alignments produced with NUCALN on the basis of a MAFFT X-INS-I alignment is shown in Figure 3; alignment, consensus sequences, and structures of the ten officially accepted apscaviroids are shown in Figures S10 and S11, respectively.
In the dotplot (Figure 3), only two highly conserved motifs are clearly detectable. These are the TCR ( Figure S10b) and the CCR including HPI (Figures S23 and S24). An oligopurine and the partially complementary oligopyrimidine stretch in the P domain [13,18] are obvious in an alignment ( Figure S10) but only marginally in the dotplot. The absence of further motifs, such as HPII and RY in Pospiviroid members, makes it difficult to deduce any domain borders.  (Table S1). Domain borders of ASSVd [13] are marked by black lines. Blue bars and labels mark TCR ( Figure S10b), HPI ( Figure S23), and LCCR ( Figure S24).
Anyway, in most if not all publications with sequence descriptions of apscaviroids, it was mentioned that Apscaviroid sequences are composed of sequence stretches with high similarity to other members of Pospiviroidae. Given the correctness of the domain hypothesis, the borders of such a sequence stretches should coincide with the domain borders. To confirm this, we used the programs JALI and VMATCH. For example, in the publication of the first sequence of CDVd (formerly called citrus viroid IIIA, CVdIIIA) [73], the authors state: "the extended upper CCR [of CDVd] is most related to ASSVd . . . In contrast, the sequence of the extended lower CCR is most related to PBCVd . . . Thus, in the case of CVdIIIA, it appears that the CCR sequences were exchanged as two individual RNA strands from two different viroids." Indeed, JALI aligns with the TL, upper P, and upper C domain of CDVd to ASSVd and the TR and lower C domain to PBCVd (see black line representing aligned segments of CDVd in Figure 4a). In addition, VMATCH finds the longest and most significant matches in the TL and upper C domain between ASSVd and CDVd and in the lower C domain between PBCVd and CDVd (see red and blue lines in Figure 4b,c). Note, however, that none of the recombination points predicted by JALI exactly coincide with the published domain borders of ASSVd [13] and none of the sequence matches found by VMATCH coincide with domains of ASSVd .  TL  P  C  V  TR  V  C  P  TL   LCCR  TCR  UCCR  TCR   HPI  HPI   Viroids   50  100  150  200  250  300   The lack of match between recombination points and domain borders generally holds true. Further examples by JALI are shown in Figures S13, S14 and S16 for all members of genera Pospiviroid, Apscaviroid, and Coleviroid, respectively. Another example is CBCVd, a member of genus Cocadviroid infecting Citrus as well as hop; already Puchta et al. [74] described that CBCVd shares sequence stretches with members of genera Pospiviroid and Hostuviroid. Indeed, the sequence of CBCVd aligns best to upper TL, P, and V domains of HSVd, to C and lower V domain of HLVd, to TR domain of CEVd, and to lower TL domain of CCCVd ( Figure 5). Note that CBCVd and CEVd have only one RY motif, while most other members of Pospiviroid have two RY motifs (see alignment in Figure S17). This similarity between CBCVd and CEVd is in support of recombination between an CBCVd ancestor and an CEVd-like sequence .  TL  P  C  V  TR  V  C  P  TL   LCCR  UCCR  HPI  HPI   TL  P  C  V  TR  V  C  P  TL   UCCR  TCR  HPII  HPII  HPI HPI  RY2  RY2  RY1RY1   Viroids   50 100 DLVd, a member of genus Hostuviroid, might be a recombinant between different viroid genera [75], in this respect comparable to CBCVd (see Figures S18 and S19). DLVd's TL domain is similar to that of PCFVd from genus Pospiviroid, includes a TCR like other members of Pospiviroid but not a TCH as HSVd, the type strain of Hostuviroid.
Similarly to the missing match of recombination points and domain borders, points of duplication in natural variants of CCCVd [61,62] and CEVd [58][59][60] do not generally coincide with domain borders. For a few examples, see flags in the respective structures of Figure 1; the flags enclose sequence stretches that occur duplicated in longer-than-unit length sequence variants. The exception, however, is the sequence duplication of a CCCVd variant (LOCUS CCC1SLOW), published in 1982 [61], that starts and ends exactly at the V/TR border.

Conclusions
Our analysis with the software tools NUCALN, MAFFT, JALI, and VMATCH of evolutionary relations between viroids of family Pospiviroidae support the original finding by Keese and Symons in 1985 that these viroids have evolved by the rearrangement of sequence stretches between different viroids infecting the same cell and subsequent mutations. Even recombination between viroids from the different genera of Pospiviroidae are likely. We were, however, not able to reproduce the exact borders of these recombined stretches and thus have doubts regarding the validity of the strict domain model as defined by Keese and Symons. For all ten species of Pospiviroid, we were able to improve the original domain borders to be consistent between species with respect to a MAFFT alignment. Nevertheless, the conserved sequence and structure motifs-especially those of Pospiviroid members-are sufficient to subdivide their rod-shaped structure into biologically functional sections.  Figure S1. Illustration of parameters in the Wilbur-Lipman algorithm. Figure S2. Examples for alignment between a candidate sequence and a seed alignment using JALI. Figure S3. JALI output with CSVd as the candidate sequence and all other viroids presented in Keese and Symons (1985) as the seed alignment. Figure S4. Pairwise alignments of PSTVd to the other sequences used by Keese and Symons (1985) with NUCALN. Figure S5. Dotplots between the sequences used by Keese and Symons (1985). Figure S6. Dotplots between consensus sequences of the species used by Keese and Symons (1985). Figure S7. MAFFT alignment between consensus sequences of Pospiviroid members. Figure S8. MAFFT alignment between sequences used in Keese and Symons (1987). Figure S9. Overlay of dotplots from NUCALN alignments for all Pospiviroid members. Figure S10. MAFFT alignment between consensus sequences of Apscaviroid members. Figure S11. Consensus sequences and secondary structures of Apscaviroid members. Figure S12. Alignment of CDVd, ASSVd, and PBCVd. Figure S13.
MAFFT alignment between the consensus sequences of Coleviroid members. Figure S16. Similarity among Coleviroid members. Figure S17. MAFFT alignment between members of the genera Cocad-, Pospi-, and Hostuviroid. Figure S18. MAFFT alignment of DLVd, HSVd, and selected members of Pospiviroid. Figure S19. Similarity among DLVd, HSVd, and selected members of Pospiviroid. Figure S20. Sequence logo of "Terminal Conserved Region" (TCR). Figure S21. Sequence logo of "Terminal Conserved Hairpin" (TCH). Figure S22. Part of the central domain including hairpin I (HPI) based on an alignment of 899 Pospiviroid sequences. Figure S23. Part of the central domain including hairpin I (HPI) based on an alignment of 612 Apscaviroid sequences. Figure S24. LCCR of Apscaviroid members. Figure S25. Hairpin II (HPII) of Pospiviroid sequences. Figure S26. RY motif of Pospiviroid sequences. Supplementary tables: Table S1. Classification of viroids, viroid names, and abbreviations. Table S2. Pairwise sequence identity of the sequences used by Keese and Symons (1985) after alignment with NUCALN (k = 4 and g = 4). Table S3. Pairwise sequence identity of the sequences used by Keese and Symons (1985) after alignment with NUCALN with parameters k = 3 and g = 3. Table S4. Average pairwise sequence identity of consensus sequences of Pospiviroid species after alignment with MAFFT X-INS-I. Table S5. Consistent domain borders of Pospiviroid members. Table S6. Nomenclature for incompletely specified bases.
Author Contributions: G.S. conceived the study and wrote the manuscript; K.-P.W. and G.S. analyzed the data. All authors have read and agreed to the published version of the manuscript.