A Mathematical Analysis of HDV Genotypes: From Molecules to Cells

Hepatitis D virus ( HDV) is classified according to eight genotypes. The various genotypes are included in the HDVdb database, where each HDV sequence is specified by its genotype. In this contribution, a mathematical analysis is performed on RNA sequences in HDVdb. The RNA folding predicted structures of the Genbank HDV genome sequences in HDVdb are classified according to their coarse-grain tree-graph representation. The analysis allows discarding in a simple and efficient way the vast majority of the sequences that exhibit a rod-like structure, which is important for the virus replication, to attempt to discover other biological functions by structure consideration. After the filtering, there remain only a small number of sequences that can be checked for their additional stem-loops besides the main one that is known to be responsible for virus replication. It is found that a few sequences contain an additional stem-loop that is responsible for RNA editing or other possible functions. These few sequences are grouped into two main classes, one that is well-known experimentally belonging to genotype 3 for replication reminiscent of the editing mechanism in HDV genotype 3 exists in HDV genotype 7 has not been explored before and is predicted by eigenvalue analysis. Finally, when comparing native and shuffled sequences, it is shown that HDV sequences belonging to all genotypes are accentuated in their mutational robustness and thermodynamic stability as compared to other viruses that were subjected to such an analysis.


Introduction
RNA secondary structures assume functional roles during the virus life cycle and are therefore a topic of considerable interest [1,2]. Over the years, specific functions of RNA structure motifs have been examined in many viruses (for example, in the hepatitis C virus (HCV) [3][4][5]) by combining computational and experimental approaches. In recent years, the advent of high-throughput structure-probing methods has opened up exciting opportunities in elucidating the viral RNA structure repertoire at a large scale [6]. A large percentage of viral RNA motifs tend to possess linear secondary structures similar to the ones depicted in [7], which are often stem-loop structures designated by SL for their identification. Stem-loop structures can be found in a multitude of viruses, e.g., [7], and in general, they are important from the evolutionary perspective [8]. In viruses, stem-loop structural motifs have been investigated in our previous article [9]. This contribution is a continuation of [9] that specifically considers hepatitis delta virus (HDV), which has the smallest human viral genome and uses the envelope of hepatitis B virus (HBV) to generate infectious particles. Two main forms of HDV infection have been described: (1) co-infection together with HBV, with a high rate of viral clearance in adults clinically similar to HBV mono-infection [10], or (2) super-infection in the presence of a pre-existing HBV infection. The latter results in a persistent chronic HDV infection in 70-90% of the cases and early on was shown to be is associated with a high risk to develop cirrhosis and primary liver cancer (HCC) [11]. Since the discovery of this virus, eight genotypes with distinct geographical and ethnic regions and at least nine subtypes have been defined [12]. The lowest inter-genotype divergence of ≥10% was reported between HDV-5 and HDV-2, whereas the highest inter genotype difference was estimated for HDV-3 [12], which is predominant in South America. For certain genotypes, differences in clinical outcomes have been observed [13]. Genotype 3 is associated with fulminant hepatitis epidemics with high lethality rates [14].
An important issue in modeling and analyzing RNAs is the representation of their secondary structure, desirably in a simplified and yet useful manner. Several approaches have been devised, among which some major pioneering ones are the full graph representation in which each nucleotide is a node [15], a coarse-grain tree-graph representation in which each motif is a node [16], and a full tree forming a homeomorphically irreducible tree [17]. All of the aforementioned representations have been implemented in the Vienna RNA package [18]. The full graph representation is equivalent to the dot-bracket representation in the Vienna RNA package [18][19][20] and the ct file in mfold/UNAFold [21][22][23].
In the context of RNA secondary structure analysis, coarse-grain tree-graphs have found a variety of uses [16,[24][25][26][27][28]. It is also possible to generalize the coarse-grain representations to abstract shapes [29]. In [16,24], a coarse-grain representation of the RNA secondary structure was proposed, which was later named Shapiro's representation in the Vienna RNA package. In [25], topological indices were first suggested to be used for analyzing coarse-grain tree-graphs. In [26,27], it was found that the second smallest eigenvalue of the Laplacian matrix is able to provide a similarity measure for differentiating between various RNA tree-graph topologies. The smallest eigenvalue of the Laplacian matrix is identically zero and then the second smallest eigenvalue, which is called algebraic connectivity, provides a measure of how much the tree-graph is linear (a path) or compact (a star) as illustrated in [27]. This concept can be applied when filtering candidates in the process of RNA deleterious mutation prediction, which was used in the relevant prediction software RNAmute and its extension [28,[30][31]. Aside from design applications that have to do with conformational switching and multistable RNAs [32], it will also be used here to detect large conformational switches that were reported in [33] by applying a similar idea. In passing, it is worthwhile noting that conformational switching is present in both HCV [34] and HIV [35], as examples, in structural elements of different length scales and with different mechanisms than the relevant one in HDV. Reverting to [26,27], seminal theorems by Fiedler and Merris [36,37] were shown to be applicable for the examination of how the coarse-grain tree-graph representing the RNA secondary structure is shaped. Following Fiedler's work on the algebraic connectivity in general graphs, Merris has shown how tree graphs can be ordered by their algebraic connectivity [38], alongside a wider perspective of the Laplacian spectrum of a graph [39,40].
Following [9], in which the dataset was taken from [41], herein, the initial plan was to apply a similar methodology to perform structural classification in HDVdb [42]. Unlike in [9], the sequences in HDVdb are no longer small RNAs, and there are only 512 sequences. Nevertheless, when RNA folding prediction methods are applied on these sequences, stem-loop unbranched rod structures emerge that are mostly completely linear and can be represented by a tree-graph that is a path on n vertices. This prompted us to specifically examine the well-known conformation switch in HDV that was mentioned above [33] and check whether our methodology that was outlined in [9] can detect more HDV genotypes other than the peculiar genotype 3 that is associated with such a conformational switch. Because all genotypes are represented in HDVdb [42], the topic of genotypes was highlighted in our analysis. Our goal was to filter out most sequences in HDVdb that only exhibit an unbranched rod structure in their folding prediction by energy minimization and only collect the sequences that exhibit a double-hairpin branched structure, classifying them according to their genotypes. We performed the filtering using an approach that will be described in detail, also addressing the advantages of our approach, in the Results and Discussion sections. In genotype 3, the unbranched rod structure displayed in Figure  1 (containing the essentials of Figure 2A of [43]) is responsible for virus replication, and the double-hairpin branched structure displayed in Figure 1 (containing the essentials of Figure 2B of [43]), is responsible for RNA editing. With the issue of genotypes as concentration, aside from examining RNA structures at the level of molecules, we also address a yet unexplored topic of how genotypes may affect HDV viral kinetics at the level of cells and provide a future perspective. Finally, as in [9], we also examine the mutational robustness and thermodynamic stability of HDVdb sequences/structures. We demonstrate for completeness that these sequences are mutationally robust and thermodynamically stable, as expected, in comparison to their corresponding shuffled sequences and their predicted structures.

Materials and Methods
The mathematical analysis consists of two main components that have been put forth in [9]. The first component relies on filtering rod-like structures that are dominant in HDV because they correspond to virus replication. These rod-like structures can be distinguished by their unique coarse-grain tree-graph representation, which is a path on n vertices. Section 2.1 describes the filtering method that utilizes a formula for the second-lowest eigenvalue of the Laplacian matrix corresponding to a coarse-grain tree-graph representation that is a path on n vertices. Section 2.2 describes the second component that involves an analysis of mutational robustness and thermodynamic stability.

Defining the Laplacian Matrix of a Tree-Graph and Calculating Its Second Lowest Eigenvalue for a Path
The coarse-grain tree-graph representation of an RNA secondary structure, also known as the Shapiro representation [16], enables an initial analysis of the collection of RNA structures based on their constituent motifs and their compactness. The tree-graph can then be represented by its corresponding Laplacian matrix.
Let T = (V, E) be a tree with vertex set V = {v 1 , v 2 , …, v n } and edge set E. Let us denote by d(v) the degree of v where v ∈ V is a vertex of T. The Laplacian matrix of T is L(T) = (l ij ), where: The eigenvalues of the Laplacian matrix are independent of the choice of labeling for the nodes in the tree-graph T, which only amounts to interchanges of the rows and columns. For example, with the orderly labeling of the linear tree-graph containing four nodes as in Figure  2, the Laplacian matrix L(T) becomes: Zakh et al. Page 4 The smallest eigenvalue of the Laplacian matrix is identically zero. The second smallest eigenvalue, which is called the algebraic connectivity [36], provides a measure of how much the tree-graph is linear (a path) or compact (a star). In the following, we will derive a formula for the second smallest eigenvalue of a path in an alternative way to the one used in [36], directly utilizing the tridiagonal structure of the Laplacian matrix.
The Laplacian matrix is a slight deviation from a tridiagonal Toeplitz matrix, which stems from the fact that tree-graph extreme vertices are bounded to a single neighbor, rather than two. For example, the tridiagonal Toeplitz matrix corresponding to L 4 above is: As noted in [9], calculating the eigenvalues of the Laplacian matrix could be approached in principle using a methodology similar to that of calculating the eigenvalues of a tridiagonal Toeplitz matrix. Calculating the eigenvalues of such a matrix (such as L' 4 ) can be simplified by using a theorem whereby if L' = h(M) and the eigenvalues of M are calculated and denoted as λ 1 , λ 2 , …, λ n , then the eigenvalues of L' are, respectively, h(λ 1 ), h(λ 2 ), …, h(λ n ). In a generalized form for the matrix L' n , we obtain: where M n is an nxn tridiagonal matrix of the following form: Because the eigenvalues of I n are trivial, finding the eigenvalues of a tridiagonal Toeplitz matrix amounts to finding those of M n . This can be completed in at least two ways. The first, which is longer, is to use elementary properties of determinants to directly derive the characteristic polynomial and then find its roots. As worked out in [44], denoting the characteristic polynomial of the matrix M n by φ n (x) and using the transformation x = 2•cosΘ, the characteristic polynomial becomes, after a considerable derivation: ϕ n (2 • cosΘ) = sin(n + 1)Θ sinΘ (6) Zakh et al. Page 5 which is solved at Θ = kπ n + 1 (k = 1, …, n), yielding the eigenvalues: λ k (M n ) = 2 • cos kπ n + 1 (k = 1, …, n) (7) The second way, which is shorter, is to note that the expansion of the characteristic polynomial satisfies the three-point recurrence relationship that matches the recursive formula for the Chebyshev polynomials of the first kind; see also [45]: where T n (x) is the Chebyshev polynomial of the first kind of order n, e.g., T 1 (x) = x and T 2 (x)=x 2 −1. For x=cosΘ, the following relationship holds [46]: Thus, we obtain that the zeros of T n (x) are the roots given by Equation (7). Therefore, using Equation (4), the eigenvalues of L′ n are given by: Looking back at the Laplacian matrix for a path on n vertices in our application: which can also be written as: where P n is similar to M n of Equation (5) P n is a slight variation on M n , and using a similar procedure, it is found (e.g., see [44]) that its eigenvalues are given as the roots of: (14) which is solved at Θ = kπ n (k = 0, 1, …, n − 1), hence: From Equations (12) and (15), it follows that the eigenvalues of L n are: which leads to the trivial eigenvalue of zero for k = 0, and the smallest second eigenvalue for k=1 that is given by: The second smallest eigenvalue of the Laplacian matrix is called the algebraic connectivity [36] of T and labeled as a(T). Some of the properties of a(T) that concern the application presented here are mentioned in the Appendix of [9], in addition to the illustrative calculation of a(T) for the RNA secondary structure example shown in Figure 2.
Note that by convention of the choice of tree-graph representation, those loops with single isolated nucleotides are not accounted for as nodes, but the 5′-3′ ends are counted as a node. In the case of a star of four vertices, for example, a(T) = 1.0, which is the upper bound for the algebraic connectivity. A star applies for a tree-graph possessing three vertices or more (n ≥ 3), and the algebraic connectivity of a star is always unity [40]. The algebraic connectivity a(T) is characterized by some special properties described in the Appendix of [9] that are advantageous for the RNA secondary structure application that is presented here.

Mutational Robustness and Thermodynamic Stability
For quantitatively measuring mutational robustness, the neutrality η is calculated. Given an RNA sequence of length N, the neutrality is calculated by: 18) where d is the base-pair distance between secondary structure of the original sequence and secondary structure of the mutant, averaged over all 3N one-mutant neighbors. The base-pair distance available in the Vienna RNA package is used to calculate the distance between two RNA secondary structures. The RNA secondary structures in this study were predicted by energy minimization approach [18,21] using RNAfold available in the Vienna RNA package [18][19][20], noting that similar predictions can be completed with mfold/UNAFold [21][22][23].

Eigenvalue Analysis
As was conceptualized in [9], we start with an eigenvalue analysis of the HDV genome sequences that are available in [42]. Despite their length of several hundred nucleotides, the HDV genome sequences are known to be well predicted by energy minimization methods, tending to have linear rod-like structures. It is worthwhile noting that the genomes are circular, and because of the antigenome concept, a reverse complement should be taken before inserting the sequences as input to energy minimization software that relies on energy parameters [48] such as mfold (as used in the past in [43]) or RNAfold, choosing the circular folding option. MPGAfold [33] can also be used, but it is more involved to apply. Having experimented with some sequences that are relevant to the HDV editing that takes place in genotype 3 from patients in South America, the branched double-hairpin structures responsible for editing are found in mfold's suboptimal solutions as depicted in Figure 1 (containing the essentials of Figure 2B of [43]). There is a slight difference between the mfold and RNAfold solutions that worked to our advantage. While the branched double-hairpin structures are well captured in the manner that suboptimal solutions were devised in mfold [49], contrary to the detection of conformational switches by mutations application in [50], it would be more difficult to view them in the way that suboptimal solutions were devised in RNAsubopt [51]. Nevertheless, the optimal solutions of RNAfold with the structure ensemble are slightly more sensitive to deviations from the unbranched structure. Thus, while it is possible to extend our simulations to consider the first few suboptimal solutions of mfold, we found that it is enough, for a convincing sample of sequences that we experimented with, to only consider the optimal solution of RNAfold as a way to detect branched double-hairpin structures successfully and then double-check the relevant sequences by examining their suboptimal solutions with mfold (in some cases, the branched double-hairpin structure will already appear in the optimal solution of mfold).
The following table reports an eigenvalue analysis of all HDV genome sequences in [42] by calculating their second-smallest eigenvalue of the Laplacian matrix corresponding to the tree-graph representation of their folding prediction as illustrated in Figure 2. As explained above, for the folding prediction, we used RNAfold with only the optimal solution considered. The code for calculating the eigenvalues was written in Java and is available at https://github.com/ChurkinAlex/RNAStructureEigenvalueCalculator (accessed on July 3, 2021).
This will further be analyzed in Section 4 with respect to the formula a(T) = 2(1 − cos(π/n)) in Equation (17) for a linear tree-graph representation of the RNA secondary structure to show the tendency of HDV viruses to possess unbranched (rod-like) stem-loop structures represented by a path on n vertices. From the biological standpoint, it shows that the dominant structure is the unbranched structure responsible for virus replication. It should be noted that the number of branched structured sequences could have been calculated by alternative approaches such as going through each one of the vertices in the tree-graph and checking whether there is an external loop besides the ones on the two opposite ends. This is more cumbersome as compared to the mathematical approach as described above with using Equation (17), simply plugging the number of vertices n in the equation and checking whether the resultant algebraic connectivity is equal to the second smallest eigenvalue of the Laplacian matrices corresponding to the tree-graph representation obtained by our Java code. The running time to calculate the eigenvalues of the Laplacian matrices is notedly very small because the tree-graph vertices represent RNA secondary structure motifs (e.g., bulges and loops), and therefore the size of all Laplacian matrices is smaller than 90 × 90 according to the eigenvalue interval at the top row of Table 1. Thus, the mathematical approach, in this case, is both more elegant and efficient as compared to an alternative computational approach that is based on base-pairing or loop-type identification.
In examining the second column of Table 1 that lists the eigenvalue interval values themselves, one cannot assert that one particular HDV genotype is more remote or peculiar than the other HDV genotypes. However, the second eigenvalue of the Laplacian matrix is just a single similarity measure, and tree edit distances are more informative for this purpose. Thus, we computed pairwise tree edit distances using RNAdistance in the Vienna RNA package to perform an analysis considering a few representatives of each HDV genotype in order to assess the similarity between HDV genotypes. The results are displayed in Table 2. While the results indicate that there is no particular genotype more distant from other genotypes, which is in line with the eigenvalue analysis, the tree edit distance is a more comprehensive similarity measure for such purposes as compared to the second eigenvalue of the Laplacian matrix, which is not a distance.
Although the second eigenvalue of the Laplacian matrix is not a distance, it is still a useful similarity measure. We observe that for the HDV viral genomes circular folding predictions that are represented by the Shapiro coarse-grain tree-graph representation, the values in Table 1 are relatively very small compared to the values calculated for other RNA secondary structures found in nature (e.g., [26,27]), and the properties of the algebraic connectivity listed in the Appendix of [9] can be used in a beneficial manner.

A Directed Structure-Based Search
Having observed in Table 1 that most of the HDV sequences possess unbranched rod-like structures (64% in total) that can be filtered out, we remain with a smaller pool of branched structured sequences. Within them, there is even a smaller pool of branched multi-hairpin structured sequences and a much smaller pool of branched multi-hairpin structure sequences in which the SL1 upper-left hairpin of Figure 1B (depicting the essentials drawn in Figure  2B of [43]) contains the sub-sequence "GAAC" in the external loop of the SL1 hairpin.
We chose to concentrate on this sub-sequence because it was shown to be important for RNA editing (e.g., [43]), and it is known that the GAAC tetraloop is a frequent sequence in the GANC family that has function capabilities. We find that only a few structures contain this hairpin composition. The structures of genotype 3 from the Amazon rainforest in South America (Peru, Bolivia, Brazil, Ecuador, Venezuela) contain it, validating our method, as well as two outliers that can be easily discarded as such. However, we noticed another group of structures that contain it belonging to genotype 7 from Cameroon, which is further discussed in Section 4. Figure 3 illustrates how the conformational switching from an unbranched structure to a branched structure in genotype 7 displayed in Figure 3A resembles and differs from the known conformational switching that takes place in genotype 3 displayed in Figure 3B. The SL1-like hairpin in Figure 3 contains the sub-sequence "GAAC" in its external loop, and it is also noted that it appears in the optimal solution of mfold, while the SL1 hairpin in Figure 3 that is exactly composed of the sub-sequence "GAAC" in its external loop appears in the second suboptimal solution of mfold, raising the possibility that genotype 7 is more susceptible to conformational switching than genotype 3.
It should also be noted that concerning the sub-sequence required for RNA editing, we have tried searching for the five nucleotides "AUAGU" (representing the AMBER/W site for genotype 3 in South America) that are mentioned in the caption of Figure 1 and are colored in the figure itself over the entire Genbank HDV genome sequences file of HDVdb. We found that they are most prevalent in genotype 7 in Cameroon with 26 out of 69 hits (38%), with genotype 3 in South America coming second with 10 out of 69 hits (14%), and with the rest of the genotypes in other geographical locations possessing much lower hits. However, because of the antigenome concept, finding the reverse complement of these five nucleotides near SL1 would have shown a precise similarity to the editing mechanism in genotype 3. In our case, alternative scenarios or mechanisms that are non-explored yet and utilize conformational switching are also possible.

Mutational Robustness and Thermodynamic Stability
Finally, to conclude the analysis of HDV sequences, it is worthwhile verifying that the HDV native sequences are more mutationally robust and more thermodynamically stable than their corresponding shuffled HDV sequences as expected. Equation (18) is used to calculate the neutrality, and in all of the calculations, the RNAfold of the Vienna RNA Package 2.0 [20] is used for RNA structure prediction by energy minimization. Figure 4 depicts the neutrality distribution of native and shuffled HDV sequences for all structures in the dataset, having used Python's multiprocessing module ("process" class) to speed up the calculation. Shuffling is performed by using shuffleseq from EMBOSS with dinucleotide shuffling. Figure

Discussion and Conclusions
The mathematical methods that are used herein to analyze HDV genotypes are conceptually within a similar framework as the ones we have used in [9], taking advantage of the properties of the Laplacian matrix and its second smallest eigenvalue to indicate how the coarse-grain representation of the RNA secondary structure is assumed in numerous scenarios but modified to our HDV application that aims to detect conformational switching. This also applies to the computational methods of using RNA folding prediction by energy minimization, by which we also calculate neutrality and MFE distributions. The results obtained in the previous section can now be analyzed, and as a consequence, a hypothesis can be formulated that suggests that the specific mechanism by which RNA editing occurs in HDV genotype 3 (as illustrated in detail, for example, in [52]) may not be limited to genotype 3 from the Amazon rainforest that originated this research [53]. By our analysis with HDVdb [42], we may have possibly found traces for it as well in genotype 7 from Cameroon.
From Table 1, one can notice that most of the HDV genome sequences in [42] (64%) are predicted to fold to an unbranched (rod-like) stem-loop structure represented by a path on n vertices because this stem-loop structure is responsible for the virus replication. If we are seeking to detect the branched double-hairpin structure that is responsible for the virus RNA editing, we can preliminary filter out a large number of sequences for which their second smallest eigenvalue of the Laplacian matrix corresponds to Equation (17) for the algebraic connectivity of a path on n vertices. In general, from the perspective of the ordering of trees by their algebraic connectivity [38], all of the predicted folded structures tend to have a very low algebraic connectivity number close to zero because of the tendency to have a linear stem-loop structure (for HDV genome sequences, even longer in linearity than the examples in the literature that were cited in [9]). By examining the different genotypes, one can observe in Table 1 that genotype 7 is contrasting the behavior of the total number of sequences in the first row of Table 1, and about 73% of the genotype 7 sequences are predicted to fold to a branched structure, which may signal that this genotype may exhibit an interesting story in its ability to perform a conformational switch that relates to function. As a possibility, after further filtering according to the directed structure-based search as described in sub-Section 3.2, it is found that alongside the known sequences of genotype 3 from South America, a hairpin that appears similar to SL1 as depicted in Figure 3B can also be detected in genotype 7 from Cameroon, as depicted in Figure 3A. Three of these sequences from genotype 7, besides MG711735, are MG711804, MG711754 and LT604971 (if these sequences are inserted into mfold or RNAfold, the reverse complement should be taken beforehand, and circular folding should be selected).
For completion of our analysis method that was outlined in [9], by the comparison between native and shuffled sequences in [35], it is found in Figure 4 that native sequences are more mutationally robust than shuffled sequences. Additionally, Figure 5 shows that native sequences are more thermodynamically stable than shuffled sequences. This can serve as a verification that, indeed, the RNA sequences in [42] significantly exhibit characteristics of natural RNAs. Furthermore, it can be viewed from Figures 4 and 5 that HDV sequences are accentuated in their mutational robustness and thermodynamic stability as compared to HCV [54] and HIV [55] and, arguably, to other viruses, because of their very high amount of base pairing in their unbranched structure. It is also interesting to note in passing that from the evolutionary perspective, the viral RNA structures explored in this study and previously in [9] have maintained a high degree of conservation, whereas the high mutation rate of RNA viruses and their substantial capacity for adaptation could have made the sequences unidentifiable and likened them to "evolutionary losers" in the archaeology of coding RNA [56]. A mathematical analysis could also benefit the concepts that have been described in [56] and relate to ancient-like RNA elements having a specific structure in genomic viral RNAs.
Future work relating to mathematical modeling of HDV genotypes is to address HDV viral kinetics across the different genotypes by a simple differential equation model [57,58], which can be solved by standard mathematical software such as Matlab without the need for a sophisticated numerical solution as in, for example, [59,60], where the model is nonlinear. Some parameter values can be fixed based on [61], while others are estimated for each patient according to each participanťs viral kinetics. As in [62], for HCV genotypes 1 and 2, we hypothesize that HDV genotypes [63,64] can also potentially affect HDV viral kinetics.
The association between HDV genotypes and RNA folding has not yet been elucidated. Further research in this field may decipher the mechanisms responsible for some of the clinical phenomena observed in patients infected with HDV. These include the wide variability between patients in HDV replication rate, HDV vs. HBV dominance, the severity of clinical presentation following acute HDV infection, and the rate of disease progression to cirrhosis and hepatocellular carcinoma. Some of these variations are linked with specific HDV genotypes, while others occur in patients infected with the same genotype. It is possible that sequence variations within (sub-genotypic) or across genotypes may yield different RNA folding patterns, which in turn could modify certain viral functional capacities such as editing and packaging efficiency as well as RNA-protein interactions. These conformational-functional alterations may bear an impact on viral infectivity, HBV HDV interactions, and immune system activation, eventually translating into distinct clinical patterns of disease presentation and progression. The outcome of RNA-protein interaction is likely to provide some key insights into these open questions. More generally, the link between the cellular level that is addressed by viral kinetics modeling and the molecular level of examining primary and secondary structures of the RNA molecule might be more thoroughly understood by the prediction of RNA-protein interactions, a computational sub field that has been developed in recent years and is actively being pursued these days, e.g., [65][66][67].
Future studies exploring structural-functional associations across HDV genotypes should center on the following perspective points: (1) the higher viral packaging efficiencies of genotype 1 vs. genotype 2 isolates; (2) the higher editing efficiencies of genotype 1 compared to those of genotype 2; (3) genotype 2 HDV infection is less frequently associated with fulminant hepatitis at the acute stage and follows a more benign long-term course (lower rate of progression to cirrhosis or hepatocellular carcinoma (HCC)) at the chronic stage as compared to genotype 1; (4) HDV genotype-dependent interactions with HBV (e.g., [68]).
To summarize, the biological and pathogenetic significance of HDV genotypes has not yet been elucidated in detail and is worthwhile exploring. Mathematical and computational methods can assist in these explorations and discover new findings. In this contribution, it is exemplified how a mathematical analysis of HDVdb [42] in which the aim was to explore how RNA tree-graphs are ordered by their algebraic connectivity assisted by the unique properties of the Laplacian matrices of tree-graphs and their second-lowest eigenvalue in Merris [38,69] can possibly yield new biological findings of clinical relevance. As per our analysis, we raise the possibility that HDV sequences of genotype 7 from Cameroon perform conformational switching that may affect their function, akin to a similar mechanism of RNA editing initially described in genotype 3 in South America. It is also possible that our analysis hints at another interesting biological function by way of conformational switching, other than RNA editing, that may take place in genotype 7. An analysis with a similar methodology can also be applied to other viruses and other biological agents of interest. From the pathogenetic point of view, a different clinical outcome of HDV infection has been observed for genotype 3 in comparison to the other genotypes. Due to the small numbers of genotype 7 cases with clinical information, such studies on disease outcome would be of interest.
Since the discovery of HDV, it was thought that this type of virus is only present in humans in an invariable association with HBV. However, recently, HDV-like sequences have been identified by metagenomic analyses in snakes (Boa constrictor), ducks (Anas species), rodents, fish, amphibians, and invertebrates (termites) without evidence of any HBV-like agent supporting infection. Most of these viruses have similar genomic features, including size and circular and unbranched rod-like structures. The snake-derived HDV protein bears 50% with L-HDAg. The duck-associated HDV protein shares only a homology of 32%. Detailed analysis of sequences and secondary structures as shown in this study for HDV genotypes will provide more insight on the relation of these agents to "human" HDV with respect to biology and possible pathogenesis.
Another open question is the interaction of certain HDV genotypes to corresponding genotypes of HBV. From the different genotypes of HBV, genotype F is the most prevalent in South America and is frequently associated with HDV super-infection. However, the more recent finding of the co-infection of HDV-3 with different genotypes of HBV suggests that the association between HDV-3 and HBV-F is not necessarily causally related to a more severe clinical course of infection [68]. This indicates that the properties of the genome, including secondary structures, correlate with clinical outcomes. This may also be true for HDV-7 by the results of our mathematical analysis. (A) To the left, the unbranched rod structure of HDV, isolate Peru-1 (L22063), obtained by mfold prediction with the optimal solution taken, containing the most essential information of Figure 2A of [43]. (B) To the right, the double-hairpin branched structure of HDV, isolate Peru-1 (L22063), obtained by mfold prediction with the second suboptimal solution taken, containing the most essential information of Figure 2B of [43]. The five nucleotides "AUAGU" comprising the editing site are colored.  (A) Secondary structure prediction by mfold/UNAFold of MG711735 from genotype 7 (upper part) with an SL1-like hairpin that contains the sub-sequence "GAAC" as in SL1. Conformational switching from an unbranched structure to a branched structure appears in the optimal solution. (B) Secondary structure prediction by mfold/UNAFold of L22063 from genotype 3 (upper part) with the SL1 hairpin that is known to be responsible for RNA editing. Conformational switching from an unbranched structure to a branched structure appears in the second suboptimal solution. The neutrality distribution of all structures from our dataset comparing native and shuffled sequences. Shuffled sequences with dinucleotide shuffling were obtained using shuffleseq from EMBOSS. Reported p-value (see text) is less than 0.0001. The MFE distribution of all structures in our dataset comparing native and shuffled sequences. Shuffled sequences with dinucleotide shuffling were obtained using shuffleseq from EMBOSS. Reported p-value (see text) is less than 0.0001.