1. Introduction
The hidden Markov model (HMM) is a probabilistic graphical model well-suited for stochastic processes and has achieved broad success in applications involving inference over time-series and sequential data [
1,
2,
3]. While efficient inference via the linear-time Viterbi’s algorithm enables tractable decoding of latent state sequences from observed data [
4], the practical effectiveness of the HMM depends critically on the availability of reliable methods for parameter estimation. In particular, well-established learning techniques, in particular, maximum likelihood estimation for fully observed training data and the expectation–maximization-based Baum–Welch algorithm for data with latent or missing structures, make it possible for HMM parameters to be robustly learned from data [
5,
6]. It is this combination, and especially the maturity of parameter estimation methodologies, that has translated the HMM’s theoretical formulation into a widely adopted modeling framework.
While the standard HMM is effective at modeling local dependencies, applications of sequential data may require inference of richer structural information that goes beyond simple stochastic processes. Such structures may arise in diverse settings, such as syntactic and semantic dependencies in linguistic sequences [
7,
8], bio-residue interactions that determine biomolecular structure and function [
9,
10], and latent patterns in time-series data such as weather forecasting and market prediction [
11,
12,
13,
14]. While such higher-order dependences between random events have been modeled with extensions to the standard HMM [
15,
16], they rarely achieve the same level of practical effectiveness as the standard HMM. Due to the increased model complexity, parameter estimation becomes substantially harder than in the standard HMM; the richer structural dependencies dramatically increase the number of parameters and complicate latent inference. In particular, learning in
-order HMMs suffers from exponential parameter growth and data sparsity [
17,
18], while stochastic context-free grammar (SCFG) [
19] training is computationally expensive and statistically ineffective [
20,
21].
More recently, the arbitrary-order hidden Markov model (
-HMM) [
22] has been introduced as a generalization of the standard HMM with the expressiveness to overcome the shortcomings of previous extensions to the HMM, e.g., the
-order HMM and SCFG. The
-HMM is equipped with a mechanism to express dependences between events in arbitrary distances with a well-defined joint probability distribution over an observable sequence of data and hidden events that generate the data. Such dependences may be nested, parallel, and crossing patterns, some of which are beyond what SCFG is capable of characterizing. One significant utility of the
-HMM, as with the standard HMM, is its ability to uncover higher-order dependencies among latent events through decoding. Though the task of decoding with the full strength of the
-HMM can be computationally intractable, it has been shown that prediction of an optimal set of dependences induced by exclusive pairs of hidden events can be accomplished much more efficiently, in
time complexity, on observed sequences of
n data items [
22].
To translate the theoretical potential of the
-HMM into practical utility, its decoding algorithm must be paired with an effective method for probabilistic parameter estimation. However, because the
-HMM is strictly more expressive than
-order HMMs and stochastic context-free grammars, parameter estimation methods developed for those models are no longer adequate. Moreover, with large-scale datasets, higher-order dependency structures can vary substantially across instances, making a single global parameter set insufficient. To address these challenges, we propose an amortized learning method [
23,
24] that enables scalable, instance-specific parameter inference and fully exploits the modeling capacity of the
-HMM. Rather than estimating a single set of global parameters shared by all sequences, we learn an input-conditioned parameter estimator from data which generates instance-specific
-HMM parameters for each input sequence of data whose dependence structures are to be predicted.
Specifically, the amortized parameter estimator is trained to infer parameters by optimizing a composite objective that integrates global and local structural constraints on the latent-event sequence. One term encourages the reference hidden-event path to receive a high score under the inferred parameters. A second term applies supervision to the reference hidden-state sequence, helping stabilize locally consistent state-transition behavior. A third term introduces a structured margin that separates the reference path from competing paths produced by Viterbi-style decoding under the same parameters. A fourth, task-specific term further guides the model toward the desired structured output. Together, these terms favor parameter settings that support correct global decoding, locally consistent latent states, and task-relevant structural predictions. Evaluated on RNA secondary structure prediction, the proposed framework achieves strong empirical performance competitive with state-of-the-art methods.
The rest of the paper is organized as follows.
Section 2 reviews the hidden Markov model and introduces some fundamentals of machine learning.
Section 3 gives a thorough description of the
-HMM ’s optimal decoding algorithm, and its expressiveness.
Section 4 presents the amortized parameter inference framework in detail.
Section 5 demonstrates the performance of the proposed work through comparisons with the-state-of-art methods on RNA secondary structure prediction. We conclude in
Section 6 with some discussions.
3. Arbitrary-Order Hidden Markov Model
3.1. -HMM
The HMM is extended to encompass a framework for modeling higher-order dependences that go beyond linear relationships between consecutive events in stochastic processes. Technically, the dependency between two states can be specified via a distribution of the joint probability of data emissions by the two states [
22].
Definition 2. An influence is an ordered pair of states , with probability distribution such that for every fixed , Let be a stochastic process where and , , are two variables. Then is a parent of if there is an influence such that , , and i is the largest integer.
In addition, we denote with the set of parents of . Also let represent the set of decedents of .
Definition 3. An arbitrary-order hidden Markov model (α-HMM) is a bivariate stochastic process , where draws values from a finite set S of states and draws values from a finite set Σ of symbols, such that, for finite , the joint probability is defined as We point out that Equation (
3) does not specify how the last conditional probability term
is computed. This resembles the Bayesian network setting, where different computation methods may be adopted. Since variables
may be interdependent, one viable method is the Noisy-OR [
29]. While it is a heuristic method, normalization with the partition function can be applied to ensure a well-defined probability distribution, as used in clique factorization in Markov random fields [
30,
31], assuming that influences on
from its parent variables
are independent. In the present work, however, we focus on the
-HMM case. Accordingly,
which is the assumption we adopt for this work.
By Definition 3, the conditional probability in (
4) is defined as,
Definition 4. For fixed integer , an α-HMM is called an -HMM if for every state variable in the stochastic process, both and .
In an -HMM, no state variable can influence or be influenced by more than h variables. Thus, the -HMM is exactly the standard HMM, where symbol is completely determined by the its emission state . The parameter h defines a family of -HMM models with different expressive levels. In this paper, we focus on the -HMM case, which is the specific model used in our decoding algorithm and experiments.
3.2. Expressiveness of the -HMM
The extension to the HMM with the new notion of influence has endowed modeling capability to the
-HMM. First, due to Definition
2 (for influence from
to
), where
i,
, needs to be the largest, it becomes possible to model, with the
-HMM , the influence of nested as well as parallel patterns that underline the essentials of SCFG. In addition, the
-HMM can also model influences of crossing patterns, exceeding SCFG in expressiveness [
19].
Figure 1 (1) shows an example of the
-HMM’s ability to model nested, parallel, and crossing patterns of influences.
Moreover, the
-HMM has the modeling power of the
-order HMMs, for any fixed
.
Figure 1 (2) shows an example of an
-HMM that is a canonical simulation of second-order HMMs. It is not difficult to see that similar examples of the
-HMM can be constructed to simulate the work of
-order HMMs, for any fixed
.
3.3. Optimal Decoding with -HMM
One of the central tasks of modeling with the -HMM is to decode the hidden states that have generated an observed sequence of symbols over the alphabet , such that the decoded hidden variables achieve the maximum likelihood .
Definition 5. For an α-HMM of state set S and alphabet set Σ, the decoding problem is, given observation , to find a sequence of states , such that We call
the predicted optimal path for observation
. To solve Equation (
5), we resort to Formulas (
3) and (
4), which make it possible to develop a dynamic programming algorithm for the optimal decoding.
Definition 6. Let , , and . Define as the maximum probability of a state-sequence in which emits symbol , , , is an influence, and i is the largest.
By (
3) the joint probability can be factorized; we derive a recurrence for
as, in an observed sequence of symbols
,
The optimally decoded hidden state-sequence from the observed sequence
is
It is clear that a table of dimensions can be used to store values , which can be computed by a dynamic programming algorithm with the worst case time complexity of , where is the number of states in the -HMM and is a constant.
4. Amortized Parameter Inference for -HMM
Higher-order dependency structures modeled by the -HMM can vary substantially across instances, especially in the present large dataset, making a single global parameter set insufficient. To address these challenges, we propose an amortized learning approach that enables scalable, instance-specific parameter inference and fully exploits the modeling capacity of the -HMM. Rather than estimating a single set of global parameters shared by all sequences, we learn an input-conditioned neural parameter estimator from data which generates instance-specific -HMM parameters for each input sequence of data whose dependence structures are to be predicted.
The input to an -HMM is a variable-length sequence whose symbols are drawn from a fixed vocabulary (alphabet) . The vocabulary size and the number of states determine the dimensionality of key model components: emission parameters are defined over (distributions of size per emitting state), and both pairwise influence and transition parameters form square tables in space . Specifically, every influence is defined over an ordered pair of states , encoding a left-to-right influence that accounts for dependence between their symbol emissions.
For amortized learning, we first map each input sequence into a vectorized pre-encoder representation before feeding it into the estimator . This vectorization can be implemented either by one-hot encoding, , or by a learned embedding table that maps each symbol to a d-dimensional vector , , where d is a user-specified hyperparameter. To preserve order information, we optionally add a positional embedding so that the same symbol occurring at different positions can receive different representations. Positional embeddings can be implemented either as fixed sinusoidal features or as learned d-dimensional vectors. This yields a pre-encoder representation sequence , where . Overall, the above steps define a deterministic encoding from the input sequence to its pre-encoder representation.
However, both one-hot vectors and token embeddings are context-independent: the representation of a symbol depends only on its identity (and position, if positional encodings are used), but not on the surrounding symbols in the sequence. As a result, the same symbol occurring at the same position receives the same representation across different sequences, regardless of context. To incorporate contextual information, we apply a lightweight Transformer encoder. Its self-attention mechanism allows each position to aggregate information from the entire sequence, yielding context-dependent representations that are subsequently used by the parameter estimator. We denote the resulting contextual representation sequence by , where is the contextual embedding at position k.
As discussed in
Section 3, any specific
-HMM is parameterized by the parameters
for emission, transition, and influence parameters, respectively. We perform amortized inference with an input-conditioned estimator
that maps an input sequence
to instance-specific parameters:
We point out that designing transitions and influences is particularly important because they govern the complexity of the latent states and the overall modeling capacity. An overly simple model structure may lack the specificity in expressiveness required by the task, while an overly complex one can make inference and training unstable or data-inefficient. In the simplest setting, the total number of unconstrained outputs can be viewed as the sum of three parts: for emissions (of the involved state) per position, and for every pairwise influence, and an additional number of parameters for transitions (depending on the chosen transition structure and , the number of states).
To map from the encoder space
, we first apply an attention-based
pooling layer to obtain a fixed-length summary vector
, which highlights informative positions in variable-length sequences. A
multi-layer perceptron (MLP) then maps
to three groups of unconstrained logits,
where
is the output of MLP.
Since and represent (conditional) probability distributions, we apply softmax along the appropriate dimensions to project unconstrained logits into the corresponding probability space. For , whose transition structure is user-specified and may vary across settings, we recommend a sigmoid-based constrained construction so that the resulting transition probabilities respect the prescribed structure and each row sums to one.
We train the amortized estimator
end-to-end (including the encoder) on a dataset
where
is the observed sequence,
is the task annotation, and
is a reference hidden-state sequence consistent with
under the task constraints.
We optimize with a composite objective consisting of four terms: (i) a path-level loss that encourages the reference hidden-event path to receive a high score under the inferred parameters; (ii) a state-sequence loss that applies supervision to the reference hidden-state sequence through its local state transitions, thereby helping stabilize the learning of locally consistent state-transition behavior; (iii) a structured adversarial loss that enforces separation between the reference path and a competing decoded path under the current predicted parameters; and (iv) a task-specific structured loss that further guides the model toward the desired structured output.
We combine these losses with learnable uncertainty-style weights:
where
are scalar parameters learned jointly with
. Concrete task-specific instantiations of these four terms are given in the following section.
5. Performance Evaluation
We evaluate the proposed amortized parameter inference learning method for the -HMM on RNA secondary structure prediction, a well-studied problem in computational biology with a rich history of modeling approaches. Existing approaches to RNA secondary structure prediction often involve a trade-off between interpretability and predictive accuracy. In this section, we show that RNA secondary structure prediction can be formulated in a simple, accurate, and interpretable manner with the -HMM, thereby demonstrating the effectiveness of the proposed parameter inference method.
5.1. -HMM for RNA Secondary Structure
A
ribonucleic acid (RNA) consists of
nucleotides adenine (A), guanine (G), cytosine (C), and uracil (U) that are chained together by covalent bonds to form the
primary sequence. An RNA sequence can fold back itself via
canonical Watson–Crick
base pairs A-U and C-G, and the wobble base pair G-U via hydrogen bonds between the nucleotides. Consecutive, stacked base pairs, called
stems, connected by unpaired single strands, called
loops, constitute the
secondary structure acting as a scaffold for its three-dimensional structure (
Figure 2). Therefore, determination or prediction of RNA secondary structure is important to understanding the three-dimensional structures and thus functions of RNAs.
The problem of RNA secondary structure prediction is computationally discovering the most plausible secondary structure given only a primary sequence. In essence, this requires determining which (remote) pairs of nucleotides form exclusively canonical base pairs. Such a phenomenon can be well-captured by the expressive power of the -HMM. Specifically, with the -HMM, the nucleotides of an RNA molecule are modeled by emissions of states, the primary sequence of nucleotides connected by covalent bonds is modeled by the sequence of state transitions, and most importantly, the base pairs are modeled by influences between states and their emissions of nucleotides.
Figure 3B shows a general
-HMM graph for modeling RNA secondary structures. The model can capture all patterns of higher-level stem relationships in the RNA secondary structure, including nested, parallel, as in tRNA (
Figure 2), and crossing (pseudoknot) as in
Figure 3A. In addition, every stem, a stack of base pairs, is modeled with three influences
from state
L to states
P,
Q, and
R, respectively, where the inner most base pair is influenced by
, the second inner most by
, and the third and beyond by
.
It is clear to see that hidden states in the -HMM emitting an observable nucleotide sequence contain all the needed information for the secondary structure of the sequence. Thus, the structure prediction task is translated to a problem of decoding with the -HMM.
5.2. RNA-Specific Parameterization of the -HMM Estimator
To decode hidden states with the
-HMM on an input RNA sequence
of length
n, the transition, emission, and influence parameters
are predicted by the learned parameter estimator
, such that
From
Figure 3, we formulate the following matrices:
Here,
is the transition probability matrix over states
. In particular,
and
, while the
R-row of
is parameterized by
with
. In the RNA implementation used in this work, we further introduce two leave probabilities,
and
, to control transitions from states
P and
Q, respectively.
Estimating by for the -HMM on each input RNA sequence requires five scalars for , four logits for , and logits for , for a total of sequence-specific parameters. Other quantities used during decoding, such as the start distribution and several state-dependent costs, are global trainable parameters shared across sequences.
To infer these 25 parameters, we first pass the RNA sequence through a neural encoder, which maps each nucleotide (A, C, G, U) to a learnable 640-dimensional embedding vector. The resulting embedding sequence is then processed by a three-layer Transformer encoder with rotary positional encoding, a hidden dimension of 640, 10 attention heads, a feed-forward dimension of with GELU activations, and a dropout rate of . The contextualized representation at each sequence position is then passed through a token-level trunk comprising a linear projection from 640 to 320 dimensions, a GELU activation, a layer-normalization layer, and one residual feed-forward block with hidden dimensions of 320 and a dropout rate of . These token features are then aggregated by an attention-pooling layer to obtain a sequence-level representation, which is mapped to 25 sequence-specific parameters by a layer-normalization layer followed by a linear projection. For the tRNA experiments, we use a simplified instantiation of the model in which only , , and are predicted as sequence-specific transition parameters, while and are not inferred and are instead fixed to zero. Consequently, the tRNA variant predicts sequence-specific parameters rather than 25. The tRNA encoder is also smaller, using a two-layer Transformer with hidden dimensions of 128, feed-forward dimensions of 192, 4 attention heads, learned absolute positional embeddings, and a dropout rate of . In both model variants, the three influence roles are parameter-tied through a single shared nucleotide-pair matrix .
5.3. Training Objective for RNA Secondary Structure Prediction
We train the model using AdamW [
32] with a learning rate of
and a weight decay of
. Following the general loss formulation introduced in
Section 4, we instantiate the RNA secondary structure objective as
where
,
,
, and
are learnable scalar weights.
Let
denote an input RNA sequence,
the reference hidden-state sequence, and
the reference dot-bracket secondary structure. Let
denote the score of the reference state path under the sequence-specific parameters predicted by
. Then the path-level loss is defined as
which encourages the reference path to receive a sufficiently high score.
The structured adversarial loss is defined as
where
are decoded hard negative candidates and
are margin constants.
The state-sequence loss is defined as the negative log-likelihood of the reference hidden-state chain under the predicted start and transition distributions:
In the RNA secondary structure setting, the task-specific term
is instantiated as a pairwise base-pair supervision loss:
where
and
denote positive and sampled negative base-pair candidates, respectively.
Decoding is discrete and non-differentiable in our implementation. In particular, the Viterbi-style dynamic program is used only to generate hard negative candidates for , and gradients are not propagated through the discrete backtracking step. Thus, training does not rely on REINFORCE-style gradient estimation; instead, the amortized estimator is optimized through the differentiable surrogate losses above.
5.4. Datasets and Evaluation Settings
We evaluate the -HMM on RNA secondary structure prediction under four complementary settings designed to assess performance within the tRNA family, on temporally held-out tRNA sequences, on broader RNA families within the training distribution, and on out-of-domain RNA families.
First, we evaluate the model on the curated T-psi-C dataset [
33], which contains annotated tRNA sequences. On this dataset, we conduct two evaluations: (i) an
in-domain evaluation using
10-fold cross-validation on tRNA sequences released before 2024; and (ii) a
temporally held-out evaluation in which the model is trained on all pre-2024 sequences and evaluated on 14 tRNA sequences released in 2024.
Second, to assess generalization beyond the relative tRNA family, we train the broader-family model on RNAStralign and evaluate it on both an in-domain test split and the out-of-domain ArchiveII benchmark. RNAStralign is a widely used RNA secondary structure dataset containing 37,149 sequences from eight RNA families: 16S rRNA, tRNA, 5S rRNA, group I intron (grp1), SRP, tmRNA, RNase P, and telomerase RNA. ArchiveII contains 3975 sequences from ten RNA families, including two additional families, 23S rRNAs and group II introns (grp2), beyond those represented in RNAStralign.
In the broader-family experiments, we follow the general data-processing protocol adopted for learning-based RNA secondary structure prediction benchmarks and exclude pseudoknotted structures and sequences longer than 600 nucleotides, since these are beyond the scope of the present work. After preprocessing, the resulting dataset sizes are 13,950 sequences for training, 2073 for validation, 1439 for the in-domain RNAStralign test split, and 1119 for the out-of-domain ArchiveII test set.
5.5. Baselines and Evaluation Protocol
We compare against
CONTRAfold [
34], ViennaRNA (
RNAfold MFE and centroid) [
35],
LinearFold [
36], SPOT-RNA [
37], RNA-FM [
38], E2Efold [
39], MXfold2 [
40], and KnotFold [
41].
For the pre-2024 T-psi-C benchmark, all methods are evaluated under the same 10-fold cross-validation protocol, and we report mean ± std across folds (
Table 1). For broader-family evaluation, we additionally report results on
ArchiveII (
Table 2).
5.6. Results on tRNA Benchmarks
As summarized in
Table 1, our neural
-HMM achieves strong pair-level accuracy on tRNA, with particularly strong precision and F1 compared with the baselines considered here. These results indicate that the proposed amortized parameter inference framework is highly effective in the homogeneous tRNA setting.
Negative controls. To discourage memorization of common tRNA secondary-structure patterns, we construct two negative sets: (i) random negatives, where for each tRNA length L we generate a sequence by sampling each nucleotide uniformly from ; and (ii) shuffled negatives, where each tRNA sequence is randomly permuted, preserving the mononucleotide composition but disrupting the biological structure. These negatives are used only for evaluation. Our method yields paired-nucleotide ratios (PNRs) of on shuffled negatives and on random negatives, indicating that the model does not simply predict high pairing rates on arbitrary sequences.
Temporal evaluation. To assess temporal generalization, we evaluate the model on 14 tRNA sequences released in 2024, all of which are excluded from training and validation.
-HMM achieves precision = 1.000, recall = 1.000, and F1 = 1.000 on this held-out set. Although this result is encouraging, it should be interpreted cautiously because the temporal test set is small and drawn from a structurally homogeneous RNA family. We visualize one such prediction in
Figure 4, showing high-quality structure recovery on entry
tdbR00000239 compared with several baseline methods.
5.7. Results on Broader RNA Families
We further evaluate the proposed method beyond the relatively homogeneous tRNA family using the broader-family
RNAStralign and
ArchiveII benchmarks. On the in-domain
RNAStralign test split, our method achieves precision
, recall
and F1
. On the out-of-domain
ArchiveII benchmark,
Table 2 shows that our method achieves the strongest F1 among the compared methods while also maintaining very high precision. Together, these results suggest that the proposed amortized
-HMM framework remains competitive not only within the broader training distribution but also on structurally diverse RNA families outside it.
Figure 5 shows a long 16S rRNA example (484 nt) with high base-pair accuracy (F1
), indicating that our model captures long-range base-pair interactions.