Markov Chain-Based Sampling for Exploring RNA Secondary Structure under the Nearest Neighbor Thermodynamic Model and Extended Applications

Ribonucleic acid (RNA) secondary structures and branching properties are important for determining functional ramifications in biology. While energy minimization of the Nearest Neighbor Thermodynamic Model (NNTM) is commonly used to identify such properties (number of hairpins, maximum ladder distance, etc.), it is difficult to know whether the resultant values fall within expected dispersion thresholds for a given energy function. The goal of this study was to construct a Markov chain capable of examining the dispersion of RNA secondary structures and branching properties obtained from NNTM energy function minimization independent of a specific nucleotide sequence. Plane trees are studied as a model for RNA secondary structure, with energy assigned to each tree based on the NNTM, and a corresponding Gibbs distribution is defined on the trees. Through a bijection between plane trees and 2-Motzkin paths, a Markov chain converging to the Gibbs distribution is constructed, and fast mixing time is established by estimating the spectral gap of the chain. The spectral gap estimate is obtained through a series of decompositions of the chain and also by building on known mixing time results for other chains on Dyck paths. The resulting algorithm can be used as a tool for exploring the branching structure of RNA, especially for long sequences, and to examine branching structure dependence on energy model parameters. Full exposition is provided for the mathematical techniques used with the expectation that these techniques will prove useful in bioinformatics, computational biology, and additional extended applications.


Introduction
Computational and mathematical applications play a critical role in the analysis of the structure and function of biological molecules, including ribonucleic acid (RNA).RNA is an essential biological polymer with many roles including information transfer, regulation of gene expression, and catalysis of chemical reactions.The primary structure of an RNA molecule may be understood as a sequence of amino acids: arginine, urasil, guanine, and cytosine.As is standard, we frequently abbreviate these as A, U, G, and C, respectively.RNA molecules are single-stranded and may therefore interact with themselves, forming A-U, G-U, and G-C bonds.The secondary structure of an RNA molecule is a set of such bonds.
The determination of secondary structure is an important step to understanding an RNA molecule's full shape and therefore its function [1,2].Accordingly, secondary structure information is commonly used in tertiary structure prediction algorithms, see, e.g., [3][4][5][6].Identifying the secondary structure of RNA is crucial to understanding its function and mechanism in a cell [7].Thus, the structure of RNA is critical to the development of biological and pharmaceutical therapeutics.Biologists use inexpensive and expedient means to sequence RNA, but the experimental determination of structure is more difficult and time-consuming.Therefore, computational methods are the primary means to determine possible RNA secondary structures.
For decades, one of the main computational approaches for examining RNA structure and branching properties has been thermodynamic free energy minimization using Nearest Neighbor Thermodynamics Modeling (NNTM) [8][9][10].This free energy is in turn used in algorithms to predict secondary structure given an RNA sequence, see, e.g., [11][12][13].Under the NNTM, the free energy of a structure is computed as the sum of the free energy of its various substructures.Many common programs (e.g., mFold, RNAFold, RNA Structure, sFold, Vienna RNA, etc.) intake a single sequence to produce secondary structures based on NNTM energy minimizations performed via dynamic programming.Nearest neighbor parameter sets include both a set of rules, referred to as equations or features, and a set of parameter values used by the equations.Separate rules exist for predicting stabilities of helices, hairpin loops, small internal loops, large internal loops, bulge loops, multi-branch loops, and exterior loops.Other branching properties of interest include, but are not limited to, average ladder distance, maximum ladder distance, maximum branching degree, average contact distance, average branching degree, degree of branching at the exterior loop, number of multi-loops with n braches, etc.The online nearest neighbor database (NNDB) archives and stores complete nearest neighbor sets, including rules and corresponding parameter values [14].
A common challenge is inferring whether the predicted results of NNTM for a set of RNA structural features or branching properties are within expected dispersion thresholds for a given energy model.For example, is the number of hairpins more than 2-3 standard deviations greater than the expected mean for a given energy model?This challenge is particularly vexing if the sequence is relatively long (greater than 1000 nucleotides).If structural features or branching properties are determined to exceed expected energy model dispersion thresholds, it relays potential scientific and/or mechanistic insight.Continuing with our hairpin example, what if an NNTM model produces a result where the number of hairpins seems rather large for the given sequence length?If the number of hairpins exceeds the expected dispersion of the NNTM model, it might be inferred that the greater number of hairpins is evidence of natural selection.
The primary objective of the present study is to enable mathematical determination of the dispersion of RNA secondary structural features for a given sequence length.We present a Markov-based algorithm to provide samples of the branching structure under the NNTM and Gibbs distribution, but without reference to a particular sequence of nucleotides.The algorithm enables the determination of where the predicted feature or branching property for an actual sequence falls within this distribution, which in turn enables the determination of whether the predicted NNTM feature or branching property is within expected dispersion limits.
In particular, this work investigates RNA substructures called multi-loops, the places where three or more helices join.Though multi-loops are crucial to the overall shape of a secondary structure, the models used to predict them algorithmically do not produce accurate results [15].This investigation builds on an existing model of RNA branching [16] and provides a theoretical grounding for a Markov chain which may be used to algorithmically investigate branching properties of secondary structure models.The investigational foundation is a model for RNA secondary structure developed by Hower and Heitsch [16], in which secondary structures are in bijection with plane trees and the minimal energy structures of the model have been previously characterized.The present study characterizes the full Gibbs distribution of possible structures.Notably, Bakhtin and Heitsch [17] analyzed a very similar model and determined degree sequence properties of the distribution of plane trees asymptotically.However, the present study utilizes a Markov chain-based sampling algorithm to investigate the Gibbs distribution in the finite case.A full explanation of the plane tree model as well as the derivation of the energy functions is provided in Section 2.1.

Methods
The methods are divided into an overview of the RNA secondary structure NNTM plane tree model and energy functions (Section 2.1) and an all-encompassing explanation of the mathematical preliminaries that lay the foundation for the derived results and corresponding algorithms (Section 2.2).

Derivation of Energy Functions
The energy function studied here is derived from the Nearest Neighbor Thermodynamic Model (NNTM).The numerical parameters from the NNTM can be found in the NNDB [14].In calculating energy functions for the sequences, we consider thermodynamic parameter values published by Turner in 1989 [8], 1999 [9], and 2004 [10].
The plane trees that we study in this paper come from two combinatorial RNA sequences, both of the form A 4 (Y 5 ZA 4 YZ 5 A 4 ) n .The sequences of interest have (Y, Z) = (C, G) or (Y, Z) = (G, C).For both of these sequences, the set of maximally-paired secondary structures is in bijection with the set of plane trees of size n [18].Figure 1 shows one example of a secondary structure and corresponding plane tree.
These specific combinatorial sequences are chosen because they allow for the study of the relationship between NNTM multiloop parameters and the branching behavior of secondary structures without interference from the energy contributions have specific base pairing combinations.In particular, the only places where the free energy differs between different secondary structures (for the same sequence) is in the type and number of multi-loops, the branching at the exterior loop, the number of hairpins, and the number of internal nodes.All of these energies directly relate to branching, not to specific base pairs.This simplification achieved by focusing only on multi-loops and branching both creates a model that is more amenable to theoretical analysis and speed computation.
Note that these secondary structures should not be considered representative of naturally occurring secondary structures.Instead, the only properties of interest in these structures are branching-related properties.
Three constants determine the free energy contribution of multiloops under NNTM, a, b, and c.The value of a encodes the energy penalty per multiloop.The constant b specifies the energy penalty per single-stranded nucleotide in a multiloop.The value of c gives the energy penalty for each helix branching from a multiloop.
In addition to the multiloop parameters a, b, c discussed above, we must account for the energy contributions of stacking base pairs, hairpins, interior loops, and dangling energy contributions.The energy of one helix is given by h.The energy associated with a hairpin is f, and the energy contribution of an interior loop is i.Finally, the parameter g encodes the dangling energy contributions.All of these values can be computed directly from the parameters found in the NNTM.
We wish to compute the energy of the structure corresponding to plane tree t having (down) degree sequence d 0 , d 1 , …, d n−1 and root degree r.Note that the down degree of a node x is equal to the number of children of x, and, in the down degree sequence, d i is the number of non-root nodes with exactly i children.The energy contribution of all hairpin loops will be d 0 f, and similarly, the total energy of all interior loops will be d 1 i.For a multi-loop having down degree j, the energy contribution will be a + 4b(j + 1) + c(j + 1) + (j + 1)g, and so the contribution of all multi-loops is given by ∑ j = 2 n d j (a + 4b(j + 1) + c(j + 1) + g(j + 1)).Te root vertex of the tree corresponds to the exterior loop and has energy contribution gr.Finally, our structure has n helices, each with energy h.Summing all of these components gives the total energy.
where we have used the facts , and δ = a + 8b + 2c + h + 2g.Then, the energy function is αd 0 + βd 1 +γr +δn.Since n will be fixed, we disregard the term δn, giving E(t) = αd 0 + βd 1 + γr . ( Though we study these energy functions for arbitrary values of (α, β, γ), numerical values for both the input energy parameters from NNTM and the resulting energy function coefficients are given in Table 1.

Mathematical Preliminaries
Section 2.2 of this manuscript provides the necessary mathematical background, including a formal introduction of combinatorial objects and a review of the relevant Markov chain mixing results used to construct our resultant sampling Markov chain and corresponding mixing time proof in Section 3.

Combinatorial
Objects-A plane tree is a rooted, ordered tree.We will use T n to denote the set of plane trees with n edges.It is known that T n is given by the nth Catalan . In a plane tree, a leaf is a node with down degree 0, and an internal node is a non-root node with down degree 1.For a given plane tree t, we will use d 0 (t) to denote the number of leaves and d 1 (t) to denote the number of internal nodes.
For a plane tree t, the energy of the tree is given by where α and β are real parameters of the energy function.Note that this function is a simplification of the model due to Hower and Heitsch [16] discussed in Section 2.1.Making this simplification effectively disregards the energy contribution of the exterior loop, which is small in comparison to the total energy of a structure, especially for the longer sequences that are of interest to us.Other authors have made similar simplifications, e.g., [17].
For our purposes, we consider α and β to be arbitrary but fixed.We will consider a Gibbs distribution g on the set T n , where the weight of each tree t is given by where Z = ∑ y ∈ T n e −E(y) is a normalizing constant.
A Motzkin path of length n is a lattice path from (0, 0) to (n, 0), which consists of steps along the vectors U = (1, 1), H = (1, 0), and D = (1, −1) and never crosses below the x-axis.We can also represent Motzkin paths as strings from the alphabet {U, H, D} where, in any prefix, the number of Us is greater than or equal to the number of Ds.The number of Motzkin paths of length n is given by the Motzkin numbers M n where Motzkin numbers and Motzkin paths have been well-studied in the combinatorics literature, see, e.g., [20][21][22][23][24].
A Dyck path is a Motzkin path with no H steps.It is easy to see that a Dyck path must have even length, so we will use D n to denote the set of Dyck paths on length 2n.It is well known that D n = C n (see, e.g., [25]).
A 2-Motzkin path is a Motzkin path in which (1, 0) steps are given one of two distinguishable colors.Let M m 2 be the set of all 2-Motzkin paths of length m.We can also represent 2-Motzkin paths as strings from the alphabet {U, H, I, D}, where, as before, the number of Ds never exceeds the number of Us in any prefix.In a such a string x, we denote by |x| a the number of times the symbol a appears in x, where a ∈ {U, H, I, D}.Notice that we always have |x| U = |x| D .For any x ∈ M n 2 and k ∈ {1, • • •, n}, let x(k) denote the symbol at index k in the string representation of x.Additionally, the skeleton of a 2-Motzkin path x is the Dyck path of Us and Ds which results from removing all Hs and Is from x.We will denote the skeleton of x by σ(x).

A Bijection Between
T n and M n − 1 2 -We will use the particular bijection between plane trees and 2-Motzkin paths from Deutsch [26], which neatly encodes information about d 0 and d 1 .For clarity, we will overview the bijection here.
For a given plane tree t with n edges, assign a label from the set {U, H, I, D} to each edge e according to the following rules: • If e is the leftmost edge off a non-root node of down degree at least 2, assign the label U.
• If e is the rightmost edge off a non-root node of down degree at least 2, assign the label D.
• If e is the only edge off a non-root node of degree 1, assign the label I.
• If e is an edge off the root node, or if e is neither the leftmost nor the rightmost edge off its parent node, assign the label H. Now, if we traverse t in a preorder reading off these labels, we get a 2-Motzkin path of length n.However, this path will always begin with H, so we define Φ(t) to be the 2-Motzkin path of length n − 1 after this initial H is removed.Figure 2 gives an example of this labeling process.From Deutsch, we know not only that Φ is a bijection, but also that if x = Φ(t) then |x| I = d 1 (t) and |x| U + |x| H + 1 = d 0 (t).
Using this bijection, it is natural to extend our energy function to 2-Motzkin paths.We define the energy of a 2-Motzkin path x to be and we extend our definition of the distribution g to M n 2 accordingly.We note that, while this energy function does not capture all possible weightings on 2-Motzkin paths, it does capture all weightings possible under our simplification of the model due to Hower and Heitsch [16] after applying the bijection due to Deutsch [26].

Markov Chains-A Markov chain
ℳ is a sequence of random variables X 0 , X 1 , X 2 , • • • taking values in a state space Ω subject to the condition that All Markov chains that we consider will be implicitly time-homogeneous (meaning Pr(X t+1 = y | X t = x) does not depend on t) and finite (meaning |Ω| < ∞).The transition matrix of a time-homogeneous Markov chain is the matrix P: Ω × Ω → [0, 1] given by It is easy to see that if X 0 has distribution vector x, then X t has distribution vector P t x.
A finite Markov chain with transition matrix P is said to be ergodic if it has the following two properties.
It is well-known that if ℳ is ergodic, then there exists a unique distribution vector π, the stationary distribution, such that Pπ = π, and lim t→∞ P t (x, y) = π(y) for any states x, y ∈ Ω.
For ϵ > 0, the mixing time τ(ϵ) of ℳ is given by Intuitively, the mixing time gives a measure of the number of steps required for ℳ to get sufficiently close to its stationary distribution from any starting state.
Let ℳ be a finite ergodic Markov chain over a state space Ω with transition matrix P. Let the eigenvalues of P be λ 0 , As is standard, it will be convenient to denote the inverse of the spectral gap by relaxation time τ rel (ℳ): = 1/Gap(ℳ).
Additionally, the spectral gap is given by the following functional definition [27].
where the infimum is taken over all non-constant functions f : Ω ℝ.A direct consequence of this definition of the spectral gap is the following lemma.
Lemma 1.: Let ℳ 1 and ℳ 2 be ergodic Markov chains over Ω with the same stationary distribution.Let P 1 and P 2 be the transition matrices of ℳ 1 and ℳ 2 respectively.If for all x, y ∈ Ω and for some constant c > 0 we have P 1 (x, y) ≤ cP 2 (x, y), then Additionally, spectral gap is related to the mixing time by the following lemma [28].
Lemma 2.: Let ℳ be an ergodic Markov chain with state space Ω, and let λ 0 , λ 1 , …, λ |Ω|−1 be the eigenvalues of the transition matrix P as defined above.Then, for all ϵ > 0 and x ∈ Ω, we have We say that a Markov chain ℳ, whose state space depends on a variable n ∈ ℕ, is rapidly mixing if τ(ϵ) is bounded above by some polynomial in n and log(ϵ −1 ).For the specific chains studied in this manuscript, we will show that τ(ϵ)(ℳ) is bounded by a polynomial in n and log(ϵ −1 ) if and only if τ rel (ℳ) is bounded by a polynomial in n and log(ϵ −1 ).Our next lemma presents sufficient conditions.Lemma 3.: Let ℳ be an ergodic Markov chain with state space Ω and let λ 0 , λ 1 , …, λ |Ω|−1 be the eigenvalues of its transition matrix.Let ϵ > 0. If τ(ϵ) is bounded by a polynomial in n and log(ϵ −1 ), then τ rel is also bounded by a polynomial in n and log(ϵ −1 ).Further, suppose we have log(1/π(x)) bounded by some polynomial q(n) for all x ∈ Ω.Then, τ rel (ℳ)  being bounded by a polynomial in n and log(ϵ −1 ) implies that τ(ϵ) is also bounded by some polynomial in n and log(ϵ −1 ).
Proof.: Suppose that τ(ϵ) ≤ p(n, log(ϵ −1 )), where p is a polynomial.Beginning with the left hand side of Lemma 2, note that Then, applying Lemma 2 and the bound on τ(ϵ), where p′ is again a polynomial in n and log(ϵ −1 ).
Applying Lemma 2, where p′ is some polynomial.□

Coupling-A coupling of a Markov chain
for which the following properties hold.

1.
Each chain X t t = 0 ∞ and Y t t = 0 ∞ , when viewed in isolation, is a copy of ℳ, given initial states X 0 = x and Y 0 = y.
Formally, the first item above requires that the joint distribution of (X t , Y t ) (given (X t−1 , Y t−1 )) should satisfy the property that the marginal of X t (and also Y t ) is consistent with the probability transitions of ℳ.We define the coupling time T to be The following lemma [29] is useful in bounding the coupling time T.
Coupling time and mixing time are then related by the following theorem [28].
Theorem 1.: A Markov chain M with coupling time T has mixing time τ(ϵ) bounded by 2.2.5.Decomposition-We use two disjoint decomposition methods for bounding the spectral gap, one developed by Martin and Randall [30], and a very recent one given by Hermon and Salez [31], building on the work by Jerrum, Son, Tetali and Vigoda [32].We use both theorems because, while the latter gives better bounds, the former has more relaxed conditions, which is necessary in one of our applications.The setup for both methods is the same.
Let ℳ be an ergodic, reversible Markov chain over a state space Ω with transition matrix P and stationary distribution π.Suppose Ω can be partitioned into disjoint subsets Ω 1 , …, Ω m .
For each i ∈ [m], let ℳ i be the restriction of ℳ to Ω i , which is obtained by rejecting any transition that would leave Ω i .Let P i be the transition matrix of ℳ i Additionally, we define ℳ to be the projection chain of ℳ over the state space [m] as follows.Let the transition matrix P of ℳ be given by One can check that ℳ is reversible and has stationary distribution while each ℳ i has stationary distribution With this notation, we have the following theorem by Martin and Randall [30].
Theorem 2.: Defining ℳ i and ℳ as above, we have The theorem due to Hermon and Salez obtains better bounds if, for each pair (i, j) ∈ [m] × [m] with P (i, j) > 0, we can find an effective joint distribution (often referred to as a "coupling") κ ij : Ω i × Ω j → [0, 1] of the distributions π i and π j .In other words, we must have ∀y ∈ Ω j , ∑ x ∈ Ω i κ ij (x, y) = π j (y) . ( The quality of the joint distribution κ is defined as where the minimum is taken over all (x, y, i, j) with x ∈ Ω i , y ∈ Ω j for which P (i, j) > 0 and κ ij (x, y) > 0. Hermon and Salez [31] prove the following.
Theorem 3.: With P, P , P i , and χ defined as above, The utility of these decomposition theorems is that they allow us to break down a more complicated Markov chain into pieces that are easier to analyze.If we can show that the pieces rapidly mix, and the projection chain rapidly mixes, then we may conclude that the original chain rapidly mixes as well.
Additionally, to aid with the analysis of some projection chains, we will need another lemma from [30].
Let ℳ M be the Markov chain on [m] with Metropolis transitions P M (i, j) = 1 2Δ min{1, π Ω j π Ω i } whenever P (i, j) > 0, where Δ is the maximum degree of vertices in the transition graph of M. Let ∂ i (Ω j ) = {y ∈ Ω j : ∃x ∈ Ω i with P(x, y) > 0}.Then we have the following Lemma 5.: With ℳ M as defined above, suppose there exist constants a > 0 and b > 0 with 1. P(x, y) ≥ a for all x, y such that P(x, y) > 0.
In order to help analyze the mixing time of ℳ M , we will also require the following two lemmas.Note that Lemma 6 is used only in the proof of Lemma 7.
Lemma 6.: Let a i i = 1 m be a log concave sequence, with a i > 0 for all 1 ≤ i ≤ m.Then, for all 1 ≤ i ≤ j ≤ m.
Proof.: In order to use induction, we will slightly reframe the statement.We will prove We now proceed by induction on k.The base case, k = 0, is trivial.Now fix l > 0 and suppose that the induction hypothesis is true for k = l − 1, that is, By log concavity a i + l 2 ≥ a i + l − 1 a i + l + 1 , or, equivalently, Therefore, where the first inequality follows from the induction hypothesis, and the second inequality follows from log concavity.□ Lemma 7.: Let π be a probability distribution on [m].Let ℳ be a Markov chain on [m] with the transition probabilities and the appropriate self-loop probabilities P(i, i).If π(i) is log-concave in i, then ℳ has mixing time (and hence also relaxation time) O(m 2 ).
Proof.: We define a coupling (X t , Y t ) on ℳ as follows.If X t ≠ Y t , then at time step t + 1, flip a fair coin.

•
If tails, set X t+1 = X t , and update Y t+1 the same way as we did for X t+1 in the previous case.
Therefore, π(i) is not unimodal in i and is therefore also not log-concave in i, contradicting our hypothesis.Therefore we have E (Δφ) 2 | X t , Y t ≥ 1 4 , as desired.□

Results
Here we present the constructed Markov chain and corresponding algorithms devised for the sampling task and the proof of an upper bound on the relaxation time-that the chain mixes rapidly.Collectively, the results illustrate an analytical approach to calculate the dispersion of the secondary structure and corresponding branching properties of RNA based on the NNTM energy function minimization and without reference to a specific nucleotide sequence.

Our Markov Chain on M m 2
We define a Markov chain ℳ = X 0 , X 1 , X 2 , ⋯ on M m 2 to sample 2-Motzkin paths as a representation of plane trees.Here, we use m = n − 1 to denote the length of the 2-Motzkin paths corresponding to plane trees with n edges.
We define each step of ℳ as follows.First, pick a random element l uniformly from {1, 2, 3, 4}.Now choose y as follows.
• If l = 1, pick a random pair of consecutive symbols in X t , and call this pair s.If s is UD or HH, let s′ be either UD or HH with probabilities 1 1 + e −α and e −α 1 + e −α respectively.Let y be the string X t with s replaced by s′.Otherwise, let y = X t .Let y be the 2-Motzkin path given by changing the symbol in X t (j) to c. Otherwise, we let y = X t .

•
If l = 3, pick i and j each uniformly from {1, • • •, m}.If each of X t (i) and X t (j) are either U or D, let y be the string X t with the symbols at indices i and j swapped.Otherwise, let y = X t .
• If l = 4, pick a random pair of consecutive symbols in X t , and call this pair s.If s is of the form ab or ba for some a ∈ {U, D} and b ∈ {H, I}, let s′ be the reverse of s, and let y be the string X t with s replaced by s′.Otherwise, let y = X t .
One can see that ℳ is irreducible by noting that every path can be transformed to the path consisting of all H's.To make this transformation, first use the l = 4 rule to move all H's and I's to the end of the path.If there are any U's in the path, we must now have at least one consecutive pair UD.Use the l = 1 rule to convert the UD to a HH.From here we can repeat, again moving all H's to the end and replacing UD with HH, until only H's and I's remain.Finally, we can use the l = 2 rule to convert all I's to H's.Since all of these steps can also be taken in reverse, this gives a procedure to move between two arbitrary paths, demonstrating irreducibility.We can also conclude that ℳ is aperiodic, due to the existence of self-loops.
Combined with irreducibility, this establishes that ℳ is ergodic.
We claim that ℳ is reversible with respect to the stationary distribution π(x) = e −E(x) Z , where y) .This can be easily verified by considering the four move types listed above.For example, for the first move type given above (transforming UD to HH and vice versa), let x and y be the states of interest.Suppose that y has the consecutive symbols HH where x contains UD.Then, One can verify that similar computations hold for the remaining three types of moves.
Therefore, we conclude that the chain ℳ has stationary distribution π(x) = e −E(x) Z .
The Markov chain ℳ can be implemented in pseudocode as in Algorithm 1.Here, the Ber(p) function returns true with probability p, and false otherwise.We also use the addition of strings to denote concatenation.
Additionally, in order to convert the 2-Motzkin path X t into a plane tree, we use the algorithm in Algorithm 2, which assumes the existence of a Node object with children and parent attributes.

Mixing Time Results
Our main result is to prove the rapid mixing of the Markov chain defined in Section 3.1.An upper bound on the relaxation time is achieved by bounding the spectral gap from below.A spectral gap bound for the complex chain at hand is obtained through the use of multiple decomposition theorems, which give bounds on the spectral gap of the complex chain in terms of the spectral gaps of multiple simpler chains.The disjoint decomposition theorem due to Martin and Randall [30] provides a flexible approach to the decomposition of Markov chains.Very recent work by Hermon and Salez [31], building on the work of Jerrum, Son, Tetali, and Vigoda [32], proves a decomposition theorem with tighter bounds but stronger hypotheses.
Since this proof involves multiple decomposition steps, we provide an overview here.The primary tools used in this proof are the two decomposition theorems presented in Section 2.2.5.We first partition the state space of all 2-Motzkin paths by the number of Us in the path.The projection chain from this first decomposition is linear and is proved to be rapidly mixing using a result of Martin and Randall [30] (Lemma 8).Each of the restriction chains are decomposed again, this time by the pattern of H and I symbols.The projection chains for this second decomposition are shown to be rapidly mixing by coupling (Lemma 9).The restriction chains are decomposed a third time, this time according to the skeleton of U and D steps.The projection chains for this third decomposition are shown to be rapidly mixing by comparison to the classic mountain valley moves chain on Dyck paths (Lemma 10).This last set of restriction chains are found to be rapidly mixing by isomorphism to the chain consisting of adjacent transpositions on binary strings (Lemma 11).Finally, starting from the most restricted chains, we use the decomposition theorems to obtain a bound on the spectral gap of the original chain (Theorem 4).

Algorithm 1:
The main Markov chain algorithm.This pseudocode calculates X t given X 0 .
Require: X 0 is a valid 2-Motzkin path of length m.
x ← X 0 We now proceed with a formal presentation.We will use a series of decompositions of ℳ.
We will first decompose our state space M m 2 into S 0 , ⋯, S m/2 , where Let ℳ k denote the Markov chain ℳ restricted to the set S k , and let ℳ be the projection chain over this decomposition as outlined for Theorem 2.
Additionally, we will decompose each S k into the sets {T k,q : q ∈ (H + I) m−2k }, where (H + I) m−2k denotes the set of strings with length m − 2k from the alphabet {H, I}.We define T k,q to be the set of 2-Motzkin paths x ∈ S k such that the substring of H and I symbols in x is q.Let ℳ k, q denote the chain ℳ k restricted to T k,q , and let ℳ k be the projection chain of ℳ k over this decomposition.
Finally, we decompose each T k,q into the partition U k, q, s : s ∈ D k based on the skeletons of the 2-Motzkin paths.For each s ∈ D k , we define As before, we let ℳ k, q, s be the Markov chain ℳ k, q restricted to U k,q,s , and let ℳ k, q be the appropriate projection chain.For clarity, this four-level decomposition is summarized in Figure 3.

Algorithm 2:
Algorithm to convert a sampled 2-Motzkin path to a plan tree.The pseudocode calculates φ −1 (x).Proof.-Thechain ℳ is a linear chain with states k in {0, …, ⌊m/2⌋}, and with stationary distribution where π is defined as in Section 2.2.5.Notice that transitions in ℳ which move between the S k sets are those which change a HH substring into a UD or DU substring, or vise versa.Thus, the transitions in ℳ only increase or decrease k by at most 1.We seek to apply Lemma 5. To choose a, notice that for x ∈ S k and y ∈ S k±1 with P(x, y) > 0, we have 1 + e α or P (x, y) = 1 4(m − 1) 1 1 + e −α .
Note that the factor 1/4 comes from the choice l = 4, and the factor 1/(m − 1) comes from the fact that there are m − 1 adjacent pairs to pick from.Then, Thus, we pick a = Proof.-Noticethat ℳ k appears as a chain with states q in the set Q = (H + I) m−2k .
Additionally, transitions in ℳ k only occur between strings in Q that differ at only one index.The stationary distribution of ℳ k is given by π k (q) ∝ e (β − α) | q| H , where we have intentionally used the constant of proportionality to remove all dependence on k, which we consider, in this context, to be fixed.
Additionally, for q 1 , q 2 ∈ Q which differ at exactly one index, we have the transition probability We may show that ℳ k rapidly mixes by a simple coupling argument.Let X t , Y t t = 0 ∞ be our coupled Markov chain on Q × Q.We define one step in this coupled chain as follows.

1.
With probability One can check that each of (X t ) t and (Y t ) t are indeed copies of ℳ k .Additionally, notice that we will have X t = Y t after all m − 2k possible indices j have been updated.

By the Coupon Collector Theorem, we have the coupling time of this chain to be
⋅ O((m − 2k)log(m − 2k)) = O(mlogm).Thus, using Theorem 1, we have the mixing time (and the relaxation time) also O(m log m).□ Lemma 10.-ℳ k, q has relaxation time τ rel ℳ k, q = O m 2 , for all pairs (k, q).
Proof.-Notice that all x ∈ T k,q have equal energy, and that U k, q, s = m 2k for all s.Thus, ℳ k, q has a uniform stationary distribution.If we represent each set U k,q,s by the Dyck path s, we can think of ℳ k, q as a chain over D k .Since all the transitions in ℳ k, q that move between the U k,q,s sets are moves that exchange the positions of a U and a D, the transitions in ℳ k, q are simply the moves on elements of D k which exchange a U with a D. We call these moves on the elements of D k , transposition moves.
For each s 1 , s 2 ∈ D k that differ by a transposition move, the transition probabilities in our projection chain are given by The last equality above relies on counting the number of terms in the sum.Notice that for each x ∈ U k, q, s 1 , there is a unique y ∈ U k, q, s 2 for which P(x, y) > 0. Therefore, the number of terms is simply U k, q, s 1 = m 2k . Compare this chain to the traditional mountain valley Markov chain on D k , which we will denote by ℳ′.The transition probabilities of ℳ′ are given by P ′ s 1 , s 2 = 1 k 2 for each pair (s 1 , s 2 ) which differ by a mountain-valley move.It is known from Cohen [33] that Gap ℳ′ = 1 O k 2 .Thus, applying Lemma 1 to ℳ k, q and ℳ′, we -ℳ k, q, s has relaxation time τ rel ℳ k, q, s = O m 3 , for all valid triples (k, q, s).
Proof.-Notice that transitions in ℳ k, q, s consist only of moves which involve swapping an H or an I with an adjacent U or D. Additionally, all 2-Motzkin paths in U k,q,s have equal energy, so for all x, y ∈ U k,q,s such that P(x, y) > 0, we have P (x, y) = 1 8(m − 1) .
To determine the mixing time of ℳ k, q, s , consider an isomorphic chain.Let U′ be the set of all binary strings of length m with 2k zeros and m − 2k ones.Let ℳ′ be the Markov chain on U′ where each step does nothing with probability 7/8 and swaps a random pair of adjacent (potentially identical) digits with probability 1/8.From Wilson [34], we know that the spectral gap of ℳ′ is Finally, we can combine our bounds on the spectral gaps of all of these chains to prove our main result.Proof.-Weuse Lemmas 11 and 10 with Theorem 3 to obtain a bound on Gap ℳ k, q .
We define a coupling κ s 1 , s 2 for each pair s 1 , s 2 ∈ D k × D k with P k, q s 1 , s 2 > 0. For each such pair, notice that the set of pairs (x, y) ∈ U k, q, s 1 × U k, q, s 2 with P(x, y) > 0 is a perfect matching.Thus, we may set To compute χ, we begin by observing π(x) = π(y) for all x, y ∈ ℳ k, q .Note also for all skeletons s of length 2k.Before computing χ, we start by finding .
We now proceed with the calculation of χ.Recall that the minimum is taken over all tuples x, y, s 1 , s 2 where P s 1 , s 2 > 0 and κ 0 1 , s Theorem 3 then gives Similarly, we define a coupling κ q 1 , q 2 for each pair (q 1 , q 2 ) ∈ (H + I) m−2k × (H + I) m−2k with P k q 1 , q 2 > 0 to apply Theorem 3 to M k .Notice that once again, the set of pairs (x, y) ∈ T k, q 1 × T k, q 2 for which P(x, y) > 0 forms a perfect matching.Thus, we take To compute χ for this coupling, we again begin with a few preliminary computations.In all of the following, let x ∈ T k, q 1 , y ∈ T k, q 2 with P , q 2 > 0. Note that q 1 and q 2 have the same length and differ at only one index.We will show the computations for the case where q 1 has a I where q 2 has a H.The computations for the other case are nearly identical.

Now we can compute
χ = min π(x)P (x, y) π q 1 P q 1 , q 2 κ q 1 , q Applying Theorem 3 then gives Unfortunately, we have not been able to find a useful coupling for ℳ, so for the last step of our decomposition, we apply Theorem 2. Since Gap(ℳ) = O This gives us the required polynomial bound, and therefore Lemma 3 implies that ℳ is rapidly mixing.□

Discussion and Conclusions
The goal of this work was to identify a Markov chain and construct a corresponding algorithm by which to examine the non-uniform distribution and dispersion properties of NNTM RNA secondary structures and branching properties independent of a specific nucleotide sequence.This study successfully identifies the existence of a Markov chain, with a provably polynomial mixing time, which generates a Gibbs distribution on plane trees.This stationary probability distribution models branching characteristics of RNA secondary structure under the NNTM.While the exploration of sampled structures obtained from this algorithm are beyond the scope of the presented results, pseudocode (see Section 3.1) is provided to facilitate future work in this area.Below we discuss the direct applications and implications of this work to RNA modeling, the possibility of implementing a dynamic programming approach, the possibility of an approach using stochastic context-free grammars, other biological applications of this work, contributions of this work towards independent mathematical research interests, and limitations and future directions of the present work.

Applications to RNA Modeling
The most straightforward application of this work is in understanding the background distribution of the branching behavior for secondary structures predicted under the NNTM.While the NNTM is widely used to predict secondary structures from sequence data, little is known about the general branching characteristics of the predicted structures, independent of a specific input sequence.Quantities such as the number of hairpins, the maximum branching in a multiloop, the average branching in a multiloop, and the maximum ladder distance of the structure [7,35] help to characterize the branching behavior and could be computed from samples obtained from this algorithm.These quantities also have been studied in native structures and/or could be easily obtained from databases such as RNA STRAND [36].The parameter values of α, β, and γ corresponding to various revisions of the NNTM are given in Table 1 in Section 2.1.The Markov chain and corresponding algorithms presented will enable biologists to calculate the dispersion of key branching properties for a specific energy function.As described with the detailed hairpin dispersion example in the Introduction (Section 1), knowing whether branching properties fall within acceptable dispersion limits is crucial for deducing potential functional insight or hypothesizing other scientific ramifications.
Another key application to RNA modeling of the presented algorithms is the ability to explore the parameter space of possible values for α and β.While the various revisions of the NNTM correspond to specific values for these parameters, in principle any real-valued parameters could be used.Finding values for these parameters that approximate reality remains an open question.Yet, determination of how differences in parameter values change the distribution of NNTM branching properties, such as maximum ladder distance, is crucial.Moreover, parameter space exploration is necessary to identify and further explore the phase transitions that exist.The presented Markov chain and corresponding algorithms expedite such future computational experimentation.Therefore, collectively, the presented algorithm enables exploration that will greatly improve understanding of NNTM-based RNA secondary structures and branching properties, as well as identify potential limitations or specific branching structures where the NNTM models do not sufficiently emulate reality.For example, NNTM-based free energy minimization algorithms achieved an accuracy of at least 60% in only 9% of 16S secondary structures analyzed by Doshi et.al. [15].
The algorithm presented here can only sample under an energy function of the form αd 0 + βd 1 , and this does not capture the entirety of the model presented in [16], which considers energy functions of the form αd 0 + βd 1 + γr.However, the missing term, γr, represents the energy contribution of the exterior loop, and the exterior loop contributes less of the total free energy as sequence length increases.Therefore, when interested in sequences of at least moderate length, this algorithm may be able to provide insight, as long as information about the exterior loop is not the specific object of study.Note that other authors have made similar simplifications with respect to the exterior loop, e.g., [17].

Possibility of a Dynamic Programming Approach
This sampling problem to calculate the dispersion of NNTM RNA secondary structure and properties utilized Markov chain techniques.However, is it possible to utilize a dynamic programming algorithm?It is straightforward to sample Dyck paths under a uniform probability distribution using dynamic programming techniques.However, it is not clear whether a similar technique could be used for the Gibbs distribution we define here, due to the complexity of the energy function.In particular, large numeric computations may be required to handle the variable k, the number of U steps in a path.While Alonso presents a way to sample from the unweighted distribution Pr(k = l) ∝ m 2l C l time without large computations [37], it is unclear if a similar method may be used for the present application.

Possibility of an SCFG Approach
Stochastic context-free grammars (SCFGs) have been widely used in the field of RNA secondary structure prediction, e.g., [38][39][40][41].Most commonly, the probabilities for production rules in an SCFG are determined by training on a set of known secondary structures, often including covariance information from homologous structures.These approaches are not immediately applicable to the problem we study here, as they do not give any insight into the NNTM multiloop energy parameters.
However, some authors have constructed SCFGs based on the NNTM.In particular, Nebel and Scheid [38] construct an SCFG with 29 distinct production rules to mirror the NNTM features.They also present a sampling algorithm allowing for sampling structures of a fixed size using the grammar.However, they do not actually compute probabilities for the production rules that would allow one to sample from a Gibbs distribution (with NNTM energy) and instead rely on training on a set of known structures.Indeed, it is not clear from the paper whether such a set of probabilities must exist.
Even in the case of the simplified model we present in this manuscript, it is not clear how to assign probabilities to production rules in an SCFG so that the probability of obtaining a given structure matches the Gibbs probability under the NNTM.See Supplement 1: SCFG for more details.
Even if a suitable SCFG could be formulated, the SCFG approach is not necessarily superior.The sampling algorithm presented by Nebel and Scheid has time complexity O(n 3 ) and space complexity O(n 2 ).While the algorithm we present does have large time complexity, it only requires linear space, which may be an advantage for some applications.
Even though we cannot easily formulate a SCFG, it is reasonable to consider whether a context-free grammar (such as that presented in Supplement 1) could nonetheless be used as the basis for a dynamic programming algorithm.In fact, this is possible.The key idea is to create a table for each non-terminal symbol X and then populate entry k of the table with ∑e E(t) , where the sum is taken over all trees t ∈ T k which can be derived from symbol X.
Once the tables been populated with these (non-normalized) probabilities, a stochastic backtracking procedure can be used to obtain samples.
However, as in Section 4.2, an assumption that each arithmetic operation can be performed in unit time is not appropriate here.Because the elements of our dynamic programming tables are in fact parts of the partition function, we can conclude that the numbers involved could have up to O(n) digits.Each arithmetic operation, therefore, becomes much more expensive.While a polynomial-time dynamic programming algorithm based on a contextfree grammar is possible, an efficient dynamic programming algorithm would require substantially more work.

Extended Applications
The Markov chain mixing analysis techniques explored in this manuscript have the potential for useful application in a variety of fields.Markov chain Monte Carlo algorithms are widely used in several fields including, machine learning [42], econometrics [43], and Bayesian Statistics [44].In virtually all applications, an understanding of mixing time increases confidence in the results.In some situations, an understanding of mixing time may also allow for more efficient algorithm selection and implementation.
While many Markov chains with nonuniform stationary distributions have been used for biological applications (e.g., [45][46][47][48]), theoretical guarantees on the mixing time are generally not known.Instead, researchers must rely on convergence heuristics, and in fact, many introductions to Markov chain Monte Carlo written for biologists explain such heuristic techniques [49][50][51][52].Of course, heuristics can be misleading, and rigorous mixing time guarantees would be significantly preferable.The same techniques used in this work might be used to generate algorithms with rigorous mixing time bounds for other biological problems concerning a nonuniform distribution.
The mathematical techniques used in this manuscript have been widely used in mathematics, physics, and computer science, demonstrating their broader applicability.For numerous examples, we direct the reader to the books of Levin, Peres, and Wilmer [53]; Montenegro and Tetali [54]; and Jerrum [55].
As an example where similar techniques have found utility in biological applications, it is interesting to briefly consider the study of cladograms, which arise from phylogenetic trees.Mathematically, a cladogram is a binary tree with n labeled leaves and n − 2 unlabeled internal nodes.While an explicit formula is known for the exact number of cladograms of a given size, mixing time under certain dynamics has also been studied.For example, Aldous [56] studied a Markov chain where a leaf is removed at random and then attached to a random edge in the tree, obtaining a proof that the mixing time is bounded below by O(n 2 ) and bounded above by O(n 3 ).Further work by Schweinsberg [57] later proved an upper bound of O(n 2 ), closing the gap between the upper and lower bounds.

Independent Mathematical Research Interests
The plane trees examined as a model for RNA secondary structure are of independent mathematical interest.As Catalan objects, they have been studied combinatorially (see, for example, [25,58]), and Markov chains on Catalan objects have received significant attention over the years [33,34,[59][60][61], but with very few results providing tight estimates on the corresponding mixing times; most commonly these are discussed in the language of Dyck paths.Cohen's thesis [33] gives an overview of the known mixing time results for chains on Catalan objects.All of the chains surveyed there have a uniform distribution over the Catalan-sized state space as their stationary distribution.Among these, essentially the only known chain with tight bounds (upper and lower bounds differing by a small multiplicative constant) is due to Wilson [34] and gives the relaxation time of O(n 3 ) for the walk consisting of adjacent transpositions on Dyck paths.In comparison, in [59] the chain using all (allowed) transpositions has been shown to have relaxation time of O(n 2 ), and further conjectured to have O(n) as the relaxation time, in analogy with the random transposition shuffle of n cards.
Judging from the lack of progress on several of these chains, it is evident that determining mixing or relaxation time for these chains is typically a challenging problem, even in the case where the stationary distribution is uniform.
In the current work, the RNA secondary structure modeling naturally leads to a state-space on Catalan objects with a nonuniform distribution, making the corresponding mixing time analysis even more challenging.Another example where mixing times are estimated for Markov chains on Catalan objects with nonuniform stationary distribution is the work of Martin and Randall [30], which examines a Gibbs distribution on Dyck paths weighted by the number of returns to the x-axis.

Limitations and Future Directions
While the mixing time proved here is polynomial, it is almost certainly too large to allow for any practical computational sampling experiments.However, we conjecture the actual mixing time to be much smaller, and future work may provide a better bound.Even without additional theoretical results, interesting work is possible using the algorithm we present and heuristic methods for evaluating Markov chain mixing.See ( [62], Ch. 8) for a discussion of heuristic methods for monitoring Markov chain convergence.The corresponding plane tree has 4 edges and encodes the branching pattern seen in the secondary structure.A plane tree with edges labeled according to the bijection Φ, along with its corresponding 2-Motzkin path.
is H or I, choose a symbol c to be either H or I with probabilities e −α e −α + e −β and e −β e −α + e −β respectively.

Require: x is a valid 2 -
Motzkin path of length m. root ← new Node() // u will be where a new node will be added for an H or D symbol u ← root // v will be always the last node added v ← new Node() // the stack will keep track of previous values of u stack = new Stack() root.children.append(v)for i = 1 → m do node ← new Node() if x[i] = U then v.children.append(node)stack.push(u)u ← v else if x[i] = I then v.children.append(node)else if x[i] = H then u.children.append(node)else if x[i] = D then u.children.append(node)u ← stack.pop()v ← node return root Lemma 8.-ℳ has relaxation time τ rel (ℳ) = O m 4 .

1 4 ( 2 .
m − 1) 1 + e − | α| .To pick b, we let ∂ − S k = y ∈ S k : ∃x ∈ S k − 1 , P (x, y) > 0 for k ∈ {1, • • •, ⌊m/2⌋}, and we let ∂ + S k = y ∈ S k : ∃x ∈ S k + 1 , P (x, y) > 0 for k ∈ {0, • • •, ⌊m/2⌋ − 1}.Additionally, let A k for k ∈ {1, • • •, ⌊m/2⌋} be the subset of S k consisting of the 2-Motzkin paths in which the first D symbol appears immediately after a U. Let B k for k ∈ {0, • • •, ⌊m/2⌋ − 1} be the subset of S k consisting of the 2-Motzkin paths in which a pair of adjacent H symbols occur before all other H or I symbols.It is easy to see that A k ⊂ ∂ − (S k ) and B k ⊂ ∂ + (S k ).We have + e −β m − 2k , as there are C k ways to arrange the U and D symbols and m − 1 2k − 1 ways to insert m − 2k H or I symbols (treating H and I as being identical for now) without placing anything between the first D and the U immediately before it.The energy contribution of the U and D symbols is given by e −α(k+1) , and the energy contribution of the H and I symbols is (e −α + e −β ) m−2k .The required normalizing constant is Z n .Similarly, we also get π B k = C k e −α(k + 3) e −2β Z m m − 1 2k e −α + e −β m − 2k − 2 because there are C k ways to arrange the U and D symbols and m − 1 2k ways insert m − 2k − 1 H or I symbols (treating the initial pair of H's as a single symbol gives us only m − 2k − 1 symbols to insert).The energy contribution of the U's, D's, and the initial two H's isgiven by e −α(k + 3) e −2β , and the energy contribution of the remaining H's and I's is (e −α + e −β ) m−2k−2 .Finally, Z m is again a normalizing constant.Hence combining these two results, we have Additionally, one can check that π(i) is log concave in i.Hence, using Lemma 7, we get τ rel ℳ M = O m 2 , and in turn τ rel (ℳ) = O m 4 , as claimed.□ Lemma 9.-ℳ k has mixing time τ ℳ k = O(m log m), for all k.

2 .
Otherwise, pick a random index j ∈ [m − 2k].Let a ∈ {H, I} be a random symbol such that Pr(a = H) = e −α e −α + e −β and Pr(a = I) = e −β e −α + e −β .Now let X t+1 and Y t+1 be X t and Y t respectively, each with the jth symbol changed to a.
application of Lemma 3 allows us to conclude that the mixing time is also polynomially-bounded.Corollary 1.-ℳ is rapidly mixing.Proof.-Inorder to apply Lemma 3, we need to obtain a polynomial bound on log(1/π(x)) for all x ∈ Ω.Let t ∈ Ω have maximum energy among all elements of Ω.For any x ∈ Ω,

Figure 1 .
Figure 1.A ribonucleic acid (RNA) secondary structure for one of the combinatorial RNA sequences used in this work and its corresponding plane tree.The ordering of the edges in the plane tree is derived from the 3' to 5' ordering of the RNA sequence.Note that the exterior loop corresponds to the root of the plane tree.The diagram in (a) was generated by ViennaRNA[19].(a) A maximally-paired secondary structure for A 4 (C 5 GA 4 CG 5 A 4 ) 4 has 4 helices;(b)

Figure 3 .
Figure 3.The four level decomposition of M m 2 (left), and the projection chains corresponding to each decomposition (right).