Amortized Parameter Inference for the Arbitrary-Order Hidden Markov Model

Zhang, Sixiang; Cai, Liming

doi:10.3390/axioms15040289

Open AccessArticle

Amortized Parameter Inference for the Arbitrary-Order Hidden Markov Model

by

Sixiang Zhang

^* and

Liming Cai

School of Computing, University of Georgia, Athens, GA 30602, USA

^*

Author to whom correspondence should be addressed.

Axioms 2026, 15(4), 289; https://doi.org/10.3390/axioms15040289

Submission received: 10 February 2026 / Revised: 24 March 2026 / Accepted: 10 April 2026 / Published: 14 April 2026

(This article belongs to the Special Issue Stochastic Modeling and Optimization Techniques)

Download

Browse Figures

Versions Notes

Abstract

The arbitrary-order hidden Markov model (

α

-HMM) is a nontrivial generalization of the standard HMM, designed to model stochastic processes with higher-order dependences among arbitrarily distant random events. The

α

-HMM admits an efficient Viterbi-style optimal decoding algorithm, making it feasible to discover higher-order dependences among data objects in observed sequential data. Because the

α

-HMM exceeds the expressive power of standard HMMs, fixed

k th

-order HMMs, and stochastic context-free grammars, effective probabilistic parameter estimation approaches are required to translate this theoretical expressiveness of the

α

-HMM into practical utility. This paper introduces a principled methodology for effective estimation of probabilistic parameters of the

α

-HMM from observed data. In large-scale sequential datasets, higher-order dependencies can vary widely across instances, so a single global parameter set may be inadequate. Instead, an amortized parameter inference approach is proposed for the

α

-HMM, in which an input-conditioned parameter estimator is learned from data and used to infer instance-specific parameters for each input instance to the decoding algorithm. Specifically, the neural parameter estimator is trained using a composite learning objective that is partially enabled by the optimal decoding algorithm. The effectiveness of the proposed parameter estimation method is demonstrated through empirical results of the application of the

α

-HMM in biomolecular structure modeling and prediction.

Keywords:

arbitrary-order hidden Markov model; probabilistic parameter estimation; amortized parameter learning; optimal decoding algorithm; biomolecule structure prediction

MSC:

62M05

1. Introduction

The hidden Markov model (HMM) is a probabilistic graphical model well-suited for stochastic processes and has achieved broad success in applications involving inference over time-series and sequential data [1,2,3]. While efficient inference via the linear-time Viterbi’s algorithm enables tractable decoding of latent state sequences from observed data [4], the practical effectiveness of the HMM depends critically on the availability of reliable methods for parameter estimation. In particular, well-established learning techniques, in particular, maximum likelihood estimation for fully observed training data and the expectation–maximization-based Baum–Welch algorithm for data with latent or missing structures, make it possible for HMM parameters to be robustly learned from data [5,6]. It is this combination, and especially the maturity of parameter estimation methodologies, that has translated the HMM’s theoretical formulation into a widely adopted modeling framework.

While the standard HMM is effective at modeling local dependencies, applications of sequential data may require inference of richer structural information that goes beyond simple stochastic processes. Such structures may arise in diverse settings, such as syntactic and semantic dependencies in linguistic sequences [7,8], bio-residue interactions that determine biomolecular structure and function [9,10], and latent patterns in time-series data such as weather forecasting and market prediction [11,12,13,14]. While such higher-order dependences between random events have been modeled with extensions to the standard HMM [15,16], they rarely achieve the same level of practical effectiveness as the standard HMM. Due to the increased model complexity, parameter estimation becomes substantially harder than in the standard HMM; the richer structural dependencies dramatically increase the number of parameters and complicate latent inference. In particular, learning in

k th

-order HMMs suffers from exponential parameter growth and data sparsity [17,18], while stochastic context-free grammar (SCFG) [19] training is computationally expensive and statistically ineffective [20,21].

More recently, the arbitrary-order hidden Markov model (

α

-HMM) [22] has been introduced as a generalization of the standard HMM with the expressiveness to overcome the shortcomings of previous extensions to the HMM, e.g., the

k th

-order HMM and SCFG. The

α

-HMM is equipped with a mechanism to express dependences between events in arbitrary distances with a well-defined joint probability distribution over an observable sequence of data and hidden events that generate the data. Such dependences may be nested, parallel, and crossing patterns, some of which are beyond what SCFG is capable of characterizing. One significant utility of the

α

-HMM, as with the standard HMM, is its ability to uncover higher-order dependencies among latent events through decoding. Though the task of decoding with the full strength of the

α

-HMM can be computationally intractable, it has been shown that prediction of an optimal set of dependences induced by exclusive pairs of hidden events can be accomplished much more efficiently, in

O (n^{3})

time complexity, on observed sequences of n data items [22].

To translate the theoretical potential of the

α

-HMM into practical utility, its decoding algorithm must be paired with an effective method for probabilistic parameter estimation. However, because the

α

-HMM is strictly more expressive than

k th

-order HMMs and stochastic context-free grammars, parameter estimation methods developed for those models are no longer adequate. Moreover, with large-scale datasets, higher-order dependency structures can vary substantially across instances, making a single global parameter set insufficient. To address these challenges, we propose an amortized learning method [23,24] that enables scalable, instance-specific parameter inference and fully exploits the modeling capacity of the

α

-HMM. Rather than estimating a single set of global parameters shared by all sequences, we learn an input-conditioned parameter estimator from data which generates instance-specific

α

-HMM parameters for each input sequence of data whose dependence structures are to be predicted.

Specifically, the amortized parameter estimator is trained to infer parameters by optimizing a composite objective that integrates global and local structural constraints on the latent-event sequence. One term encourages the reference hidden-event path to receive a high score under the inferred parameters. A second term applies supervision to the reference hidden-state sequence, helping stabilize locally consistent state-transition behavior. A third term introduces a structured margin that separates the reference path from competing paths produced by Viterbi-style decoding under the same parameters. A fourth, task-specific term further guides the model toward the desired structured output. Together, these terms favor parameter settings that support correct global decoding, locally consistent latent states, and task-relevant structural predictions. Evaluated on RNA secondary structure prediction, the proposed framework achieves strong empirical performance competitive with state-of-the-art methods.

The rest of the paper is organized as follows. Section 2 reviews the hidden Markov model and introduces some fundamentals of machine learning. Section 3 gives a thorough description of the

α

-HMM ’s optimal decoding algorithm, and its expressiveness. Section 4 presents the amortized parameter inference framework in detail. Section 5 demonstrates the performance of the proposed work through comparisons with the-state-of-art methods on RNA secondary structure prediction. We conclude in Section 6 with some discussions.

2. Fundamentals

In this section, we review some fundamentals of the hidden Markov model (HMM) and in machine learning that are necessary for further reading of the

α

-HMM ’s decoding algorithm, and the parameter estimation method to be introduced in later sections. We refer the reader to [2,25] for more detailed information.

2.1. Hidden Markov Model

A stochastic process is a family of random variables

{X_{k} : k \in N}

, where

N

is the natural number set and random variables

X_{k}

are defined over the same probability space.

Definition 1.

A discrete hidden Markov model (HMM) [5] is a bivariate stochastic process

{(X_{k}, Y_{k}) : k \in N}

, where

X_{k}

draws values from a finite set S of states and

Y_{k}

draws values from a finite set Σ of symbols, such that, for every finite

n \geq 1

, the joint probability

P ({(X_{k}, Y_{k}) : k \leq n})

is defined as

P ({(X_{k}, Y_{k}) : k \leq n}) = P (X_{1}) \prod_{k = 1}^{n} P (Y_{k} | X_{k}) \prod_{k = 1}^{n - 1} P (X_{k + 1} | X_{k})

(1)

Note that Equation (1) is derived based on the local Markov property as well as the chain rule of probability theory. Graphically, an HMM is represented as a probabilistic finite state machine

(S, Σ, ϵ, τ)

in which each state

r \in S

can emit a symbol

a \in Σ

with probability

ϵ_{r} (a)

of distribution

\sum_{a \in Σ} ϵ_{r} (a) = 1

. Every state r can transit to another state s with probability

τ_{r} (s)

of distribution

\sum_{s \in S} τ_{r} (s) = 1

. Specifically, for any symbol

a \in Σ

and any state

r \in S

, the conditional probability

P (Y_{k} = a | X_{k} = r)

in (1) is defined as

ϵ_{r} (a)

, and conditional probability

P (X_{k + 1} = s | X_{k} = r)

is defined as

τ_{r} (s)

for any state

s \in S

. The distributions

ϵ

and

τ

are the parameters that completely determine the HMM model.

A typical task associated with the HMM is the decoding of hidden states

{X_{k}^{*} : k \leq n}

from an observed finite sequences of symbols

{Y_{k} : k \leq n}

with the following optimization

{X_{k}^{*} : k \leq n} = arg max_{{X_{k} : k \leq n}} P ({Y_{k} : k \leq n} | {X_{k} : k \leq n})

which can be computed with the

O (n)

-time dynamic programming Viterbi algorithm [4] by taking the advantage probability factorization in Equation (1).

To deploy the decoding algorithm for practical applications, the parameter

θ = (ϵ, τ)

of an HMM model needs to be known. While the accurate values of the the parameters may not always be known, estimation of parameter values from training data can be achieved. Depending on availability of data, there are broadly two different approaches for parameter estimation associated with the HMM. When training data contain both observed sequences of symbols and the hidden states that have generated them, distributions

ϵ

and

τ

can be computed with the maximum likelihood method through counting of all the emissions and transitions labeled in the training data and assigning the frequencies to the probabilistic parameters.

A difficult situation arises when the states emitting observable symbols are not known or not included in the training data. To deal with this scenario of missing data, typically, the expectation maximization (EM) method is used to compute expected frequencies of event occurrences. Technically, EM is an iterative process consisting of alternate steps to compute expectation and to maximize it. Given the current parameter

θ^{(t)}

, the expected log-likelihood

P (X, Y | θ)

of

θ

over all possible values of hidden variables in

X

(drawn from the training data) is computed and maximized for new parameters as

\begin{matrix} θ^{(t + 1)} & = arg max_{θ} E_{X \sim P (\cdot | Y, θ^{(t)})} [log P (X, Y | θ)] \\ = arg max_{θ} \sum_{X} log P (X, Y | θ) P (X | Y, θ^{(t)}) \end{matrix}

For the HMM, the iterative process is efficiently realized by Baum–Welch algorithm [6] that coordinates two processes forward and backward, computing the total probability

\sum_{X} P (X, Y)

. The algorithm guarantees converging to a local optima.

2.2. Deep-Learning Preliminaries

Deep learning is a state-of-the-art technique to approximate functions of complex relationships between inputs and outputs via neural networks [26]. A neural network consists of stacked layers of computational units, each computing the weighted sum of its inputs followed by a nonlinear transformation. In practice, earlier layers often capture simple or local patterns of abstract, while deeper layers combine them to encode more global and structured information.

A neural network is learned through training on available data by settling those parameters associated with the computation units in the network. They are determined by minimizing an empirical objective function that quantifies the discrepancy between the network’s predicted outputs and observed data. Arising from probabilistic modeling, the most common choice of objective, among others, is the negative log-likelihood of the data under a parameterized conditional distribution. Due to the typical non-convexity and high-dimensionality of the optimization problem, the network parameters are computed with iterative gradient-based methods via backpropagation that is reasonably efficient to achieve gradients [25].

The following are some basic functions which will also be used in our proposed

α

-HMM neural parameter estimator to be introduced in Section 4.

Unit-level functions
sigmoid: The sigmoid function transforms a real number (e.g., output of a neural gate) to a probability in interval $(0, 1)$ for every $t \in R$ ,

$σ (t) = \frac{1}{1 + exp (- t)} \in (0, 1)$

softplus: The softplus function maps a real number to a nonnegative value for every $t \in R$ ,

$softplus (t) = log (1 + exp (t)) \in R_{\geq 0}$
Layer-level functions
MLP [27]: An MLP is multilayer perceptron that defines mapping: $R^{d} \to R^{m}$ from an input layer to the output layer with one or more hidden layers in between, where neurons in two consecutive layers are fully connected.
softmax: The softmax function maps a vector $u \in R^{K}$ to a probability vector $p \in {[0, 1]}^{K}$ , such that, for $k = 1, \dots, K$ ,

$p_{k} = softmax (u_{k}) = \frac{exp (u_{k})}{\sum_{j = 1}^{K} exp (u_{j})} .$

satisfying $\sum_{k = 1}^{K} p_{k} = 1$ . It is typically used as a normalization operator and applied at the output layer only.
Functions on sequences with embeddings
Transformer encoder [28]: A Transformer encoder is a mapping function: $T_{θ} : R^{L \times d} \to R^{L \times d}$ from an input length-L sequence of vectors to a sequence of contextualized representations of the same length. Each input vector (of d entries) is an embedding (semantics) of the corresponding input position, while each output vector is a weighted combination of all input positions (computed using self-attention).
Attention pooling: An attention pooling function, $R^{L \times d} \to R^{d}$ , computes a weighted summary of a sequence of vectors, where the weights are learned to emphasize the most relevant elements. It assigns higher weight to positions that are more informative for the task, represented with aggregated values that focus on what matters most.

3. Arbitrary-Order Hidden Markov Model

3.1. $α$ -HMM

The HMM is extended to encompass a framework for modeling higher-order dependences that go beyond linear relationships between consecutive events in stochastic processes. Technically, the dependency between two states can be specified via a distribution of the joint probability of data emissions by the two states [22].

Definition 2.

An influence is an ordered pair of states

〈 r, s 〉

,

r, s \in S

with probability distribution

η_{r, s}^{a}

such that for every fixed

a \in Σ

,

\sum_{b \in Σ} η_{r, s}^{a} (b) = 1

(2)

Let

{(X_{k}, Y_{k} | k \in N)}

be a stochastic process where

X_{i}

and

X_{j}

,

i < j

, are two variables. Then

X_{i}

is a parent of

X_{j}

if there is an influence

〈 r, s 〉

such that

X_{j} = s

,

X_{i} = r

, and i is the largest integer.

In addition, we denote with

π (X_{j}) = {X_{i} : i < j, X_{i} is a parent of X_{j}}

the set of parents of

X_{j}

. Also let

δ (X_{i}) = {X_{k} : X_{i} \in π (X_{k})}

represent the set of decedents of

X_{i}

.

Definition 3.

An arbitrary-order hidden Markov model (α-HMM) is a bivariate stochastic process

{(X_{k}, Y_{k}) : k \in N}

, where

X_{k}

draws values from a finite set S of states and

Y_{k}

draws values from a finite set Σ of symbols, such that, for finite

n \geq 1

, the joint probability

P ({(X_{k}, Y_{k}) : k \leq n})

is defined as

P ({(X_{k}, Y_{k}) : k \leq n}) = P (X_{1}) \prod_{k = 1}^{n - 1} P (X_{k + 1} | X_{k}) \prod_{k = 1}^{n} P (Y_{k} | X_{k}, Y_{i}, X_{i} \in π (X_{k}))

(3)

We point out that Equation (3) does not specify how the last conditional probability term

P (Y_{k} | X_{k}, Y_{i}, X_{i} \in π (X_{k}))

is computed. This resembles the Bayesian network setting, where different computation methods may be adopted. Since variables

X i \in π (X_{k})

may be interdependent, one viable method is the Noisy-OR [29]. While it is a heuristic method, normalization with the partition function can be applied to ensure a well-defined probability distribution, as used in clique factorization in Markov random fields [30,31], assuming that influences on

X_{k}

from its parent variables

X_{i}

are independent. In the present work, however, we focus on the

α_{1}

-HMM case. Accordingly,

P (Y_{k} | X_{k}, Y_{i}, X_{i} \in π (X_{k})) = \prod_{X_{i} \in π (X_{k})} P (Y_{k} | X_{k}, X_{i}, Y_{i})

(4)

which is the assumption we adopt for this work.

By Definition 3, the conditional probability in (4) is defined as,

P (Y_{k} = b | X_{k} = s, X_{i} = r, Y_{i} = a) = η_{r, s}^{a} (b)

Definition 4.

For fixed integer

h \geq 0

, an α-HMM is called an

α_{h}

-HMM if for every state variable

X_{k}

in the stochastic process, both

| π (X_{k}) | \leq h

and

| δ (X_{k}) | \leq h

.

In an

α_{h}

-HMM, no state variable can influence or be influenced by more than h variables. Thus, the

α_{0}

-HMM is exactly the standard HMM, where symbol

Y_{k}

is completely determined by the its emission state

X_{k}

. The parameter h defines a family of

α_{h}

-HMM models with different expressive levels. In this paper, we focus on the

α_{1}

-HMM case, which is the specific model used in our decoding algorithm and experiments.

3.2. Expressiveness of the $α$ -HMM

The extension to the HMM with the new notion of influence has endowed modeling capability to the

α

-HMM. First, due to Definition 2 (for influence from

X_{i}

to

X_{j}

), where i,

i < j

, needs to be the largest, it becomes possible to model, with the

α_{1}

-HMM , the influence of nested as well as parallel patterns that underline the essentials of SCFG. In addition, the

α_{1}

-HMM can also model influences of crossing patterns, exceeding SCFG in expressiveness [19]. Figure 1 (1) shows an example of the

α_{1}

-HMM’s ability to model nested, parallel, and crossing patterns of influences.

Moreover, the

α_{1}

-HMM has the modeling power of the

k th

-order HMMs, for any fixed

k \geq 1

. Figure 1 (2) shows an example of an

α_{1}

-HMM that is a canonical simulation of second-order HMMs. It is not difficult to see that similar examples of the

α_{1}

-HMM can be constructed to simulate the work of

k th

-order HMMs, for any fixed

k \geq 3

.

3.3. Optimal Decoding with $α_{1}$ -HMM

One of the central tasks of modeling with the

α

-HMM is to decode the hidden states

X

that have generated an observed sequence

Y

of symbols over the alphabet

Σ

, such that the decoded hidden variables achieve the maximum likelihood

X^{*} = arg {max}_{X} P (Y | X)

.

Definition 5.

For an α-HMM of state set S and alphabet set Σ, the decoding problem is, given observation

a = a_{1} \dots a_{n} \in Σ^{n}

, to find a sequence of states

ρ^{*} \in S^{n}

, such that

ρ^{*} = arg max_{ρ \in S^{n}} P (X = ρ, Y = a)

(5)

We call

ρ^{*}

the predicted optimal path for observation

a

. To solve Equation (5), we resort to Formulas (3) and (4), which make it possible to develop a dynamic programming algorithm for the optimal decoding.

Definition 6.

Let

1 \leq k \leq n

,

r \in S

, and

0 \leq i < k

. Define

p_{k, r, i}

as the maximum probability of a state-sequence

ρ = s_{1} \dots s_{k}

in which

s_{j}

emits symbol

a_{j}

,

j = 1, \dots, k

,

s_{k} = r

,

〈 s_{i}, s_{k} 〉

is an influence, and i is the largest.

By (3) the joint probability can be factorized; we derive a recurrence for

p_{k, r, i}

as, in an observed sequence of symbols

a = a_{1} \dots a_{n}

,

p_{k, r, i} = η_{s_{i}, r}^{a_{i}} (a_{k}) \cdot max_{t \in S, 1 \leq h < k - 1} {τ_{t} (r) \cdot p_{k - 1, t, h}}

(6)

The optimally decoded hidden state-sequence from the observed sequence

a

is

ρ^{*} = arg max_{r, i} {p_{n, r, i}}

It is clear that a table of dimensions

n \times n \times | S |

can be used to store values

p_{k, r, i}

, which can be computed by a dynamic programming algorithm with the worst case time complexity of

O (n^{3} | S |^{2})

, where

| S |

is the number of states in the

α

-HMM and is a constant.

4. Amortized Parameter Inference for $α_{1}$ -HMM

Higher-order dependency structures modeled by the

α

-HMM can vary substantially across instances, especially in the present large dataset, making a single global parameter set insufficient. To address these challenges, we propose an amortized learning approach that enables scalable, instance-specific parameter inference and fully exploits the modeling capacity of the

α

-HMM. Rather than estimating a single set of global parameters shared by all sequences, we learn an input-conditioned neural parameter estimator

f_{θ}

from data which generates instance-specific

α

-HMM parameters for each input sequence of data whose dependence structures are to be predicted.

The input to an

α

-HMM is a variable-length sequence

a = a_{1} \dots a_{n}

whose symbols are drawn from a fixed vocabulary (alphabet)

Σ

. The vocabulary size

| Σ |

and the number of states

| S |

determine the dimensionality of key model components: emission parameters are defined over

Σ

(distributions of size

| Σ |

per emitting state), and both pairwise influence and transition parameters form square tables in space

R^{| S | \times | S |}

. Specifically, every influence is defined over an ordered pair of states

〈 r, s 〉

, encoding a left-to-right influence that accounts for dependence between their symbol emissions.

For amortized learning, we first map each input sequence into a vectorized pre-encoder representation before feeding it into the estimator

f_{θ}

. This vectorization can be implemented either by one-hot encoding,

x_{k} = e (a_{k}) \in {0, 1}^{| Σ |}

, or by a learned embedding table that maps each symbol to a d-dimensional vector

x_{k} \in R^{d}

,

1 \leq k \leq n

, where d is a user-specified hyperparameter. To preserve order information, we optionally add a positional embedding

p_{k}

so that the same symbol occurring at different positions can receive different representations. Positional embeddings can be implemented either as fixed sinusoidal features or as learned d-dimensional vectors. This yields a pre-encoder representation sequence

({\tilde{x}}_{1}, \dots, {\tilde{x}}_{n})

, where

{\tilde{x}}_{k} = x_{k} + p_{k}

. Overall, the above steps define a deterministic encoding from the input sequence to its pre-encoder representation.

However, both one-hot vectors and token embeddings are context-independent: the representation of a symbol depends only on its identity (and position, if positional encodings are used), but not on the surrounding symbols in the sequence. As a result, the same symbol occurring at the same position receives the same representation across different sequences, regardless of context. To incorporate contextual information, we apply a lightweight Transformer encoder. Its self-attention mechanism allows each position to aggregate information from the entire sequence, yielding context-dependent representations that are subsequently used by the parameter estimator. We denote the resulting contextual representation sequence by

H = {[h_{1}, \dots, h_{n}]}^{⊤} \in R^{n \times d}

, where

h_{k} \in R^{d}

is the contextual embedding at position k.

As discussed in Section 3, any specific

α

-HMM is parameterized by the parameters

θ = {ϵ, τ, η}

for emission, transition, and influence parameters, respectively. We perform amortized inference with an input-conditioned estimator

f_{θ}

that maps an input sequence

a

to instance-specific parameters:

f_{θ} (a) = (ϵ (a), τ (a), η (a)) .

We point out that designing transitions

τ

and influences

η

is particularly important because they govern the complexity of the latent states and the overall modeling capacity. An overly simple model structure may lack the specificity in expressiveness required by the task, while an overly complex one can make inference and training unstable or data-inefficient. In the simplest setting, the total number of unconstrained outputs can be viewed as the sum of three parts:

| Σ |

for emissions (of the involved state) per position, and

{| Σ |}^{2}

for every pairwise influence, and an additional number of parameters for transitions (depending on the chosen transition structure and

| S |

, the number of states).

To map from the encoder space

H

, we first apply an attention-based pooling layer to obtain a fixed-length summary vector

\bar{h}

, which highlights informative positions in variable-length sequences. A multi-layer perceptron (MLP) then maps

\bar{h}

to three groups of unconstrained logits,

{MLP}_{θ} (\bar{h}) = (o^{(ϵ)}, o^{(τ)}, o^{(η)}) .

where

o

is the output of MLP.

Since

ϵ

and

η

represent (conditional) probability distributions, we apply softmax along the appropriate dimensions to project unconstrained logits into the corresponding probability space. For

τ

, whose transition structure is user-specified and may vary across settings, we recommend a sigmoid-based constrained construction so that the resulting transition probabilities respect the prescribed structure and each row sums to one.

We train the amortized estimator

f_{θ}

end-to-end (including the encoder) on a dataset

D = {(a^{(i)}, y^{(i)}, ρ^{(i)})}_{i = 1}^{N},

where

a^{(i)} = a_{1}^{(i)} \dots a_{n_{i}}^{(i)}

is the observed sequence,

y^{(i)}

is the task annotation, and

ρ^{(i)} = ρ_{1}^{(i)} \dots ρ_{n_{i}}^{(i)}

is a reference hidden-state sequence consistent with

y^{(i)}

under the task constraints.

We optimize

f_{θ}

with a composite objective consisting of four terms: (i) a path-level loss

L_{path}

that encourages the reference hidden-event path

ρ

to receive a high score under the inferred parameters; (ii) a state-sequence loss

L_{state}

that applies supervision to the reference hidden-state sequence through its local state transitions, thereby helping stabilize the learning of locally consistent state-transition behavior; (iii) a structured adversarial loss

L_{adv}

that enforces separation between the reference path

ρ

and a competing decoded path

ρ^{'}

under the current predicted parameters; and (iv) a task-specific structured loss

L_{task}

that further guides the model toward the desired structured output.

We combine these losses with learnable uncertainty-style weights:

L = \sum_{c \in {path, state, adv, task}} exp (- s_{c}) L_{c} + s_{c},

where

{s_{c}}

are scalar parameters learned jointly with

θ

. Concrete task-specific instantiations of these four terms are given in the following section.

5. Performance Evaluation

We evaluate the proposed amortized parameter inference learning method for the

α

-HMM on RNA secondary structure prediction, a well-studied problem in computational biology with a rich history of modeling approaches. Existing approaches to RNA secondary structure prediction often involve a trade-off between interpretability and predictive accuracy. In this section, we show that RNA secondary structure prediction can be formulated in a simple, accurate, and interpretable manner with the

α_{1}

-HMM, thereby demonstrating the effectiveness of the proposed parameter inference method.

5.1. $α_{1}$ -HMM for RNA Secondary Structure

A ribonucleic acid (RNA) consists of nucleotides adenine (A), guanine (G), cytosine (C), and uracil (U) that are chained together by covalent bonds to form the primary sequence. An RNA sequence can fold back itself via canonical Watson–Crick base pairs A-U and C-G, and the wobble base pair G-U via hydrogen bonds between the nucleotides. Consecutive, stacked base pairs, called stems, connected by unpaired single strands, called loops, constitute the secondary structure acting as a scaffold for its three-dimensional structure (Figure 2). Therefore, determination or prediction of RNA secondary structure is important to understanding the three-dimensional structures and thus functions of RNAs.

The problem of RNA secondary structure prediction is computationally discovering the most plausible secondary structure given only a primary sequence. In essence, this requires determining which (remote) pairs of nucleotides form exclusively canonical base pairs. Such a phenomenon can be well-captured by the expressive power of the

α_{1}

-HMM. Specifically, with the

α_{1}

-HMM, the nucleotides of an RNA molecule are modeled by emissions of states, the primary sequence of nucleotides connected by covalent bonds is modeled by the sequence of state transitions, and most importantly, the base pairs are modeled by influences between states and their emissions of nucleotides.

Figure 3B shows a general

α_{1}

-HMM graph for modeling RNA secondary structures. The model can capture all patterns of higher-level stem relationships in the RNA secondary structure, including nested, parallel, as in tRNA (Figure 2), and crossing (pseudoknot) as in Figure 3A. In addition, every stem, a stack of base pairs, is modeled with three influences

η_{1}, η_{2}, and η_{3}

from state L to states P, Q, and R, respectively, where the inner most base pair is influenced by

η_{1}

, the second inner most by

η_{2}

, and the third and beyond by

η_{3}

.

It is clear to see that hidden states in the

α_{1}

-HMM emitting an observable nucleotide sequence contain all the needed information for the secondary structure of the sequence. Thus, the structure prediction task is translated to a problem of decoding with the

α_{1}

-HMM.

5.2. RNA-Specific Parameterization of the $α_{1}$ -HMM Estimator

To decode hidden states with the

α_{1}

-HMM on an input RNA sequence

a = a_{1} \dots a_{n}

of length n, the transition, emission, and influence parameters

(τ, ϵ, η)

are predicted by the learned parameter estimator

f_{θ}

, such that

f_{θ} (a) = (τ, ϵ, η) .

From Figure 3, we formulate the following matrices:

\begin{matrix} τ & = (\begin{matrix} 1 - Δ & Δ & 0 & 0 \\ p_{leave} & 0 & 1 - p_{leave} & 0 \\ q_{leave} & 0 & 0 & 1 - q_{leave} \\ α & β & 0 & 1 - α - β \end{matrix}), ϵ = (\begin{matrix} ϵ_{A} & ϵ_{C} & ϵ_{G} & ϵ_{U} \end{matrix}), η = (\begin{matrix} η_{A A} & η_{A C} & η_{A G} & η_{A U} \\ η_{C A} & η_{C C} & η_{C G} & η_{C U} \\ η_{G A} & η_{G C} & η_{G G} & η_{G U} \\ η_{U A} & η_{U C} & η_{U G} & η_{U U} \end{matrix}) . \end{matrix}

(7)

Here,

τ

is the transition probability matrix over states

L, P, Q, R

. In particular,

τ_{L} (P) = Δ

and

τ_{L} (L) = 1 - Δ

, while the R-row of

τ

is parameterized by

α, β \geq 0

with

α + β \leq 1

. In the RNA implementation used in this work, we further introduce two leave probabilities,

p_{leave}

and

q_{leave}

, to control transitions from states P and Q, respectively.

Estimating

(τ, ϵ, η)

by

f_{θ}

for the

α_{1}

-HMM on each input RNA sequence

a

requires five scalars for

(Δ, α, β, p_{leave}, q_{leave})

, four logits for

ϵ

, and

4 \times 4 = 16

logits for

η

, for a total of

5 + 4 + 16 = 25

sequence-specific parameters. Other quantities used during decoding, such as the start distribution and several state-dependent costs, are global trainable parameters shared across sequences.

To infer these 25 parameters, we first pass the RNA sequence through a neural encoder, which maps each nucleotide (A, C, G, U) to a learnable 640-dimensional embedding vector. The resulting embedding sequence is then processed by a three-layer Transformer encoder with rotary positional encoding, a hidden dimension of 640, 10 attention heads, a feed-forward dimension of

2 \times 640 = 1280

with GELU activations, and a dropout rate of

0.05

. The contextualized representation at each sequence position is then passed through a token-level trunk comprising a linear projection from 640 to 320 dimensions, a GELU activation, a layer-normalization layer, and one residual feed-forward block with hidden dimensions of 320 and a dropout rate of

0.05

. These token features are then aggregated by an attention-pooling layer to obtain a sequence-level representation, which is mapped to 25 sequence-specific parameters by a layer-normalization layer followed by a linear projection. For the tRNA experiments, we use a simplified instantiation of the model in which only

Δ

,

α

, and

β

are predicted as sequence-specific transition parameters, while

p_{leave}

and

q_{leave}

are not inferred and are instead fixed to zero. Consequently, the tRNA variant predicts

3 + 4 + 16 = 23

sequence-specific parameters rather than 25. The tRNA encoder is also smaller, using a two-layer Transformer with hidden dimensions of 128, feed-forward dimensions of 192, 4 attention heads, learned absolute positional embeddings, and a dropout rate of

0.1

. In both model variants, the three influence roles are parameter-tied through a single shared

4 \times 4

nucleotide-pair matrix

η

.

5.3. Training Objective for RNA Secondary Structure Prediction

We train the model using AdamW [32] with a learning rate of

10^{- 4}

and a weight decay of

10^{- 2}

. Following the general loss formulation introduced in Section 4, we instantiate the RNA secondary structure objective as

L = exp (- s_{path}) L_{path} + s_{path} + exp (- s_{state}) L_{state} + s_{state}

+ exp (- s_{adv}) L_{adv} + s_{adv} + exp (- s_{task}) L_{task} + s_{task},

where

s_{path}

,

s_{state}

,

s_{adv}

, and

s_{task}

are learnable scalar weights.

Let

a

denote an input RNA sequence,

ρ

the reference hidden-state sequence, and

y

the reference dot-bracket secondary structure. Let

S_{θ} (a, ρ, y)

denote the score of the reference state path under the sequence-specific parameters predicted by

f_{θ} (a)

. Then the path-level loss is defined as

L_{path} = softplus (m_{path} - S_{θ} (a, ρ, y)),

which encourages the reference path to receive a sufficiently high score.

The structured adversarial loss is defined as

L_{adv} = \frac{1}{K} \sum_{k = 1}^{K} softplus (S_{θ} (a, ρ_{k}^{'}, y_{k}^{'}) - S_{θ} (a, ρ, y) + m_{adv}),

where

{(ρ_{k}^{'}, y_{k}^{'})}_{k = 1}^{K}

are decoded hard negative candidates and

m_{path}, m_{adv} \geq 0

are margin constants.

The state-sequence loss is defined as the negative log-likelihood of the reference hidden-state chain under the predicted start and transition distributions:

L_{state} = - log P_{θ} (ρ_{1}) - \sum_{i = 1}^{n - 1} log P_{θ} (ρ_{i + 1} ∣ ρ_{i}) .

In the RNA secondary structure setting, the task-specific term

L_{task}

is instantiated as a pairwise base-pair supervision loss:

L_{task} = \frac{1}{| P^{+} |} \sum_{(i, j) \in P^{+}} softplus (- s_{θ} (i, j)) + \frac{1}{| P^{-} |} \sum_{(i, j) \in P^{-}} softplus (s_{θ} (i, j)),

where

P^{+}

and

P^{-}

denote positive and sampled negative base-pair candidates, respectively.

Decoding is discrete and non-differentiable in our implementation. In particular, the Viterbi-style dynamic program is used only to generate hard negative candidates for

L_{adv}

, and gradients are not propagated through the discrete backtracking step. Thus, training does not rely on REINFORCE-style gradient estimation; instead, the amortized estimator is optimized through the differentiable surrogate losses above.

5.4. Datasets and Evaluation Settings

We evaluate the

α_{1}

-HMM on RNA secondary structure prediction under four complementary settings designed to assess performance within the tRNA family, on temporally held-out tRNA sequences, on broader RNA families within the training distribution, and on out-of-domain RNA families.

First, we evaluate the model on the curated T-psi-C dataset [33], which contains annotated tRNA sequences. On this dataset, we conduct two evaluations: (i) an in-domain evaluation using 10-fold cross-validation on tRNA sequences released before 2024; and (ii) a temporally held-out evaluation in which the model is trained on all pre-2024 sequences and evaluated on 14 tRNA sequences released in 2024.

Second, to assess generalization beyond the relative tRNA family, we train the broader-family model on RNAStralign and evaluate it on both an in-domain test split and the out-of-domain ArchiveII benchmark. RNAStralign is a widely used RNA secondary structure dataset containing 37,149 sequences from eight RNA families: 16S rRNA, tRNA, 5S rRNA, group I intron (grp1), SRP, tmRNA, RNase P, and telomerase RNA. ArchiveII contains 3975 sequences from ten RNA families, including two additional families, 23S rRNAs and group II introns (grp2), beyond those represented in RNAStralign.

In the broader-family experiments, we follow the general data-processing protocol adopted for learning-based RNA secondary structure prediction benchmarks and exclude pseudoknotted structures and sequences longer than 600 nucleotides, since these are beyond the scope of the present work. After preprocessing, the resulting dataset sizes are 13,950 sequences for training, 2073 for validation, 1439 for the in-domain RNAStralign test split, and 1119 for the out-of-domain ArchiveII test set.

5.5. Baselines and Evaluation Protocol

We compare against CONTRAfold [34], ViennaRNA (RNAfold MFE and centroid) [35], LinearFold [36], SPOT-RNA [37], RNA-FM [38], E2Efold [39], MXfold2 [40], and KnotFold [41].

For the pre-2024 T-psi-C benchmark, all methods are evaluated under the same 10-fold cross-validation protocol, and we report mean ± std across folds (Table 1). For broader-family evaluation, we additionally report results on ArchiveII (Table 2).

5.6. Results on tRNA Benchmarks

As summarized in Table 1, our neural

α_{1}

-HMM achieves strong pair-level accuracy on tRNA, with particularly strong precision and F1 compared with the baselines considered here. These results indicate that the proposed amortized parameter inference framework is highly effective in the homogeneous tRNA setting.

Negative controls. To discourage memorization of common tRNA secondary-structure patterns, we construct two negative sets: (i) random negatives, where for each tRNA length L we generate a sequence by sampling each nucleotide uniformly from ${A, C, G, U}$ ; and (ii) shuffled negatives, where each tRNA sequence is randomly permuted, preserving the mononucleotide composition but disrupting the biological structure. These negatives are used only for evaluation. Our method yields paired-nucleotide ratios (PNRs) of $0.223 \pm 0.005$ on shuffled negatives and $0.219 \pm 0.006$ on random negatives, indicating that the model does not simply predict high pairing rates on arbitrary sequences.
Temporal evaluation. To assess temporal generalization, we evaluate the model on 14 tRNA sequences released in 2024, all of which are excluded from training and validation. $α_{1}$ -HMM achieves precision = 1.000, recall = 1.000, and F1 = 1.000 on this held-out set. Although this result is encouraging, it should be interpreted cautiously because the temporal test set is small and drawn from a structurally homogeneous RNA family. We visualize one such prediction in Figure 4, showing high-quality structure recovery on entry tdbR00000239 compared with several baseline methods.

5.7. Results on Broader RNA Families

We further evaluate the proposed method beyond the relatively homogeneous tRNA family using the broader-family RNAStralign and ArchiveII benchmarks. On the in-domain RNAStralign test split, our method achieves precision

= 0.9820

, recall

= 0.9759

and F1

= 0.9789

. On the out-of-domain ArchiveII benchmark, Table 2 shows that our method achieves the strongest F1 among the compared methods while also maintaining very high precision. Together, these results suggest that the proposed amortized

α_{1}

-HMM framework remains competitive not only within the broader training distribution but also on structurally diverse RNA families outside it. Figure 5 shows a long 16S rRNA example (484 nt) with high base-pair accuracy (F1

= 0.9682

), indicating that our model captures long-range base-pair interactions.

6. Conclusions

We proposed a novel amortized inference pipeline for sequence modeling, centered on the degree-1 arbitrary-order hidden Markov model (

α_{1}

-HMM). By learning input-conditioned parameters via a neural estimator, our method enables instance-specific probabilistic modeling in a single forward pass, avoiding costly per-sequence iterative parameter estimation while preserving the interpretability of structured latent-variable models.

In the context of RNA secondary structure prediction, our framework demonstrates high accuracy and robustness, even under challenging negative controls, while offering explainable predictions. Although the current model only captures first-degree influence (

α_{1}

), future work could explore higher-order dependencies (

α_{k}

for

k > 1

) to model richer long-range interactions. Importantly, amortized inference makes such extensions practical by providing efficient instance-conditioned parameterization; complementary directions include constrained or approximate decoding strategies to further improve scalability beyond the

O (N^{3})

worst-case complexity of exact exclusive-pair decoding.

Author Contributions

Conceptualization, S.Z. and L.C.; methodology, S.Z.; software, S.Z.; writing—review and editing, S.Z. and L.C.; supervision, L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are publicly available. The T-Psi-C Database is available at T-Psi-C Database. The RNAStralign and ArchiveII datasets used in this work are available via the MultiMolecule Hugging Face dataset releases at RNAStralign and ArchiveII, respectively.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rabiner, L.; Juang, B. An Introduction to Hidden Markov Models. IEEE ASSP Mag. 1986, 3, 4–16. [Google Scholar] [CrossRef]
Rabiner, L.R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
Ephraim, Y.; Merhav, N. Hidden Markov processes. IEEE Trans. Inf. Theory 2002, 48, 1518–1569. [Google Scholar] [CrossRef]
Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 1967, 13, 260–269. [Google Scholar] [CrossRef]
Baum, L.E.; Petrie, T. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. Ann. Math. Stat. 1966, 37, 1554–1563. [Google Scholar] [CrossRef]
Baum, L.E.; Petrie, T.; Soules, G.; Weiss, N. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 1970, 41, 164–171. [Google Scholar] [CrossRef]
Sainburg, T.; Mai, A.; Gentner, T.Q. Long-range sequential dependencies precede complex syntactic production in language acquisition. Proc. R. Soc. B Biol. Sci. 2022, 289, 20212657. [Google Scholar] [CrossRef]
Collins, M. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, 7–12 July 1997; pp. 16–23. [Google Scholar] [CrossRef]
Tinoco, I.J.; Bustamante, C. How RNA folds. J. Mol. Biol 1999, 293, 271–281. [Google Scholar] [PubMed]
Akutsu, T.; Miyano, S. On the approximation of protein threading. Theor. Comput. Sci. 1999, 210, 261–275. [Google Scholar] [CrossRef]
Sezer, O.B.; Gudelek, M.U.; Ozbayoglu, A.M. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019. Appl. Soft Comput. 2020, 90, 106181. [Google Scholar] [CrossRef]
Duarte, F.S.; Rios, R.A.; Hruschka, E.R.; de Mello, R.F. Decomposing time series into deterministic and stochastic influences: A survey. Digit. Signal Process. 2019, 95, 102582. [Google Scholar] [CrossRef]
Allen, J.F.; Ferguson, G. Actions and events in interval temporal logic. J. Log. Comput. 1994, 4, 531–579. [Google Scholar] [CrossRef]
Mitchell, T.R.; Thompson, L.; Peterson, E.; Cronk, R. Temporal adjustments in the evaluation of events: The “rosy view”. J. Exp. Soc. Psychol. 1997, 33, 421–448. [Google Scholar] [CrossRef]
Koski, T. Hidden Markov Models for Bioinformatics Volume 2; Springer Science and Business Media: New York, NY, USA, 2001. [Google Scholar]
Zaki, M.J.; Carothers, C.D.; Szymanski, B.K. VOGUE: A variable order hidden markov model with duration based on frequent sequence mining. ACM Trans. Knowl. Discov. Data (TKDD) 2010, 4, 1–31. [Google Scholar] [CrossRef]
Xiao, J.H.; Liu, B.Q.; Wang, X.L. Principles of non-stationary hidden markov model and its applications to sequence labeling task. In Proceedings of the Second International Joint Conference on Natural Language Processing, Jeju Island, Republic of Korea, 11–13 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 827–837. [Google Scholar] [CrossRef]
Begleiter, R.; El-Yaniv, R.; Yona, G. On prediction using variable order Markov models. J. Artif. Int. Res. 2004, 22, 385–421. [Google Scholar] [CrossRef]
Lari, K.; Young, S. The estimation of stochastic context-free grammars using the Inside-Outside algorithm. Comput. Speech Lang. 1990, 4, 35–56. [Google Scholar] [CrossRef]
Lari, K.; Young, S. Applications of stochastic context-free grammars using the Inside-Outside algorithm. Comput. Speech Lang. 1991, 5, 237–257. [Google Scholar] [CrossRef]
Johnson, M. PCFG Models of Linguistic Tree Representations. Comput. Linguist. 1998, 24, 613–632. [Google Scholar]
Dastjerdi, F.R.; Robinson, D.A.; Cai, L. α-HMM and optimal decoding higher-order structures on sequential data. J. Comput. Math. Data Sci. 2022, 5, 100065. [Google Scholar] [CrossRef]
Margossian, C.C.; Blei, D.M. Amortized Variational Inference: When and Why? In Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, 15–19 July 2024; Volume 244, pp. 2434–2449. [Google Scholar]
Zammit-Mangion, A.; Sainsbury-Dale, M.; Huser, R. Neural Methods for Amortized Inference. Annu. Rev. Stat. Its Appl. 2025, 12, 311–335. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 24 March 2026).
Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Srinivas, S. A generalization of the Noisy-Or model. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, Washington, DC, USA, 9–11 July 1993. [Google Scholar]
Grimmett, G.R. A theorem about random fields. Bull. Lond. Math. Soc. 1973, 5, 81–84. [Google Scholar] [CrossRef]
Clifford, R. Markov random fields in statistics. In Disorder in Physical Systems: A Volume in Honour of John M. Hammersley; Grimmett, G.R., Welsh, D.J.A., Eds.; Oxford University Press: Oxford, UK, 1990. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Sajek, M.P.; Woźniak, T.; Sprinzl, M.; Jaruzelska, J.; Barciszewski, J. T-psi-C: User friendly database of tRNA sequences and structures. Nucleic Acids Res. 2020, 48, D256–D260. [Google Scholar] [CrossRef]
Do, C.B.; Woods, D.A.; Batzoglou, S. CONTRAfold. Bioinformatics 2006, 22, e90–e98. [Google Scholar] [CrossRef]
Lorenz, R.; Bernhart, S.H.; Höner zu Siederdissen, C.; Tafer, H.; Flamm, C.; Stadler, P.F.; Hofacker, I.L. ViennaRNA Package 2.0. Algorithms Mol. Biol. 2011, 6, 26. [Google Scholar] [CrossRef]
Huang, L.; Zhang, H.; Deng, D.; Zhao, K.; Liu, K.; Hendrix, D.A.; Mathews, D.H. LinearFold: Linear-time approximate RNA folding by 5’-to-3’ dynamic programming and beam search. Bioinformatics 2019, 35, i295–i304. [Google Scholar] [CrossRef]
Singh, J.; Hanson, J.; Paliwal, K.; Zhou, Y. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat. Commun. 2019, 10, 5407. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Hu, Z.; Sun, S.; Tan, Q.; Wang, Y.; Yu, Q.; Zong, L.; Hong, L.; Xiao, J.; Shen, T.; et al. Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions. arXiv 2022, arXiv:2204.00300. [Google Scholar] [CrossRef]
Chen, X.; Li, Y.; Umarov, R.; Gao, X.; Song, L. RNA Secondary Structure Prediction By Learning Unrolled Algorithms. arXiv 2020, arXiv:2002.05810. [Google Scholar] [CrossRef]
Sato, K.; Akiyama, M.; Sakakibara, Y. RNA secondary structure prediction using deep learning with thermodynamic integration. Nat. Commun. 2021, 12, 941. [Google Scholar] [CrossRef] [PubMed]
Gong, T.; Ju, F.; Bu, D. Accurate prediction of RNA secondary structure including pseudoknots through solving minimum-cost flow with learned potentials. Commun. Biol. 2024, 7, 297. [Google Scholar] [CrossRef] [PubMed]
McCann, H.; Meade, C.; Williams, L.; Petrov, A.; Johnson, P.; Simon, A.; Hoksza, D.; Nawrocki, E.; Chan, P.; Lowe, T.; et al. R2DT: A comprehensive platform for visualizing RNA secondary structure. Nucleic Acids Res. 2025, 53, gkaf032. [Google Scholar] [CrossRef] [PubMed]

Figure 1.

α_{1}

-HMM examples (1) and (2), where directed solid (black) arcs are transitions and directed dotted (colored) arcs are (directional) influences; emissions not shown. (3) A stochastic process showing nested, parallel, and crossing patterns of influences modeled by the

α_{1}

-HMM in (1). (4) A stochastic process showing the effect of the

α_{1}

-HMM in (2) simulating a second-order HMM.

Figure 1.

α_{1}

-HMM examples (1) and (2), where directed solid (black) arcs are transitions and directed dotted (colored) arcs are (directional) influences; emissions not shown. (3) A stochastic process showing nested, parallel, and crossing patterns of influences modeled by the

α_{1}

-HMM in (1). (4) A stochastic process showing the effect of the

α_{1}

-HMM in (2) simulating a second-order HMM.

Figure 2. tRNA structure. (A) The typical secondary structure of tRNA, where stems, stacked canonical base pairs, are connected by non-pairing single-strand loops. (B) Details of additional tertiary interactions, twisting the molecule, scaffolded by the secondary structure, into a 3D shape rendered in (C). (D) An arc diagram to show the tRNA secondary structure, where nested and parallel patterns of relationships are induced by stacked canonical base pairs.

Figure 3. Illustration of the

α

-HMM modeling of RNA secondary structure. (A) The secondary structure of viral RNA (PDB id: 1L2X) with stacked base pairs in crossing patterns, a pseudoknot. (B) A general 4-state

α_{1}

-HMM with influences to capture base pairs of nested, parallel, and crossing patterns in RNA secondary structures. (C) A potential sequence of states decoded with the

α_{1}

-HMM given only the primary sequence, implicating the expected secondary structure.

Figure 3. Illustration of the

α

-HMM modeling of RNA secondary structure. (A) The secondary structure of viral RNA (PDB id: 1L2X) with stacked base pairs in crossing patterns, a pseudoknot. (B) A general 4-state

α_{1}

-HMM with influences to capture base pairs of nested, parallel, and crossing patterns in RNA secondary structures. (C) A potential sequence of states decoded with the

α_{1}

-HMM given only the primary sequence, implicating the expected secondary structure.

Figure 4. Qualitative comparison of RNA secondary structure predictions on the same sequence. (A) Ground truth (GT). (B) Our prediction, which is the only one identical to the GT. (C) CONTRAfold. (D) RNAfold (ViennaRNA). (E) ViennaRNA (centroid). (F) E2Efold. (G) KnotFold. (H) RNA-FM. (I) SPOT-RNA. (J) MXfold2. All secondary-structure visualizations were generated using R2DT [42]. The sequence is from the T-Psi-C Database 2.0 (entry tdbR00000239), which is part of the temporal test set.

Figure 5. Qualitative visualization on a 16S rRNA example. (A) Ground truth. (B) Structure predicted by our method. Despite the sequence length (484 nt), our method achieved high base-pair accuracy (F1

= 0.9682

).

Figure 5. Qualitative visualization on a 16S rRNA example. (A) Ground truth. (B) Structure predicted by our method. Despite the sequence length (484 nt), our method achieved high base-pair accuracy (F1

= 0.9682

).

Table 1. Performance comparison across methods on the tRNA benchmark. Pair-level precision (P), recall (R), and F1 are reported as mean ± std across 10 folds on the pre-2024 T-psi-C dataset.

Method	P	R	F1
ViennaRNA-MFE (2011) [35]	$0.668 \pm 0.044$	$0.720 \pm 0.046$	$0.692 \pm 0.045$
ViennaRNA-Centroid (2011) [35]	$0.711 \pm 0.049$	$0.672 \pm 0.041$	$0.682 \pm 0.043$
LinearFold (2019) [36]	$0.668 \pm 0.044$	$0.719 \pm 0.047$	$0.691 \pm 0.045$
CONTRAfold (2006) [34]	$0.695 \pm 0.015$	$0.765 \pm 0.016$	$0.724 \pm 0.015$
SPOT-RNA (2019) [37]	$0.830 \pm 0.009$	$0.988 \pm 0.008$	$0.901 \pm 0.009$
RNA-FM (2022) [38]	$0.739 \pm 0.007$	$0.962 \pm 0.009$	$0.835 \pm 0.007$
E2Efold (2020) [39]	$0.950 \pm 0.022$	$0.940 \pm 0.023$	$0.944 \pm 0.022$
MXfold2 (2021) [40]	$0.951 \pm 0.011$	$0.953 \pm 0.012$	$0.951 \pm 0.011$
KnotFold (2024) [41]	$0.909 \pm 0.007$	$0.993 \pm 0.006$	$0.948 \pm 0.006$
Our method	$0.986 \pm 0.008$	$0.989 \pm 0.008$	$0.987 \pm 0.008$

Table 2. Baseline performance on ArchiveII. We report pair-level precision (P), recall (R), and F1.

Method	P	R	F1
ViennaRNA-MFE	0.5880	0.6690	0.6260
ViennaRNA-Centroid	0.6430	0.6440	0.6430
LinearFold	0.5860	0.6660	0.6230
CONTRAfold	0.6158	0.6895	0.6506
MXfold2	0.7630	0.8200	0.7900
SPOT-RNA	0.7340	0.8260	0.7770
KnotFold	0.7830	0.8720	0.8250
RNA-FM	0.6710	0.8120	0.7350
E2Efold	0.7380	0.5810	0.6500
Our method	0.9013	0.8084	0.8523

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, S.; Cai, L. Amortized Parameter Inference for the Arbitrary-Order Hidden Markov Model. Axioms 2026, 15, 289. https://doi.org/10.3390/axioms15040289

AMA Style

Zhang S, Cai L. Amortized Parameter Inference for the Arbitrary-Order Hidden Markov Model. Axioms. 2026; 15(4):289. https://doi.org/10.3390/axioms15040289

Chicago/Turabian Style

Zhang, Sixiang, and Liming Cai. 2026. "Amortized Parameter Inference for the Arbitrary-Order Hidden Markov Model" Axioms 15, no. 4: 289. https://doi.org/10.3390/axioms15040289

APA Style

Zhang, S., & Cai, L. (2026). Amortized Parameter Inference for the Arbitrary-Order Hidden Markov Model. Axioms, 15(4), 289. https://doi.org/10.3390/axioms15040289

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Amortized Parameter Inference for the Arbitrary-Order Hidden Markov Model

Abstract

1. Introduction

2. Fundamentals

2.1. Hidden Markov Model

2.2. Deep-Learning Preliminaries

3. Arbitrary-Order Hidden Markov Model

3.1. $α$ -HMM

3.2. Expressiveness of the $α$ -HMM

3.3. Optimal Decoding with $α_{1}$ -HMM

4. Amortized Parameter Inference for $α_{1}$ -HMM

5. Performance Evaluation

5.1. $α_{1}$ -HMM for RNA Secondary Structure

5.2. RNA-Specific Parameterization of the $α_{1}$ -HMM Estimator

5.3. Training Objective for RNA Secondary Structure Prediction

5.4. Datasets and Evaluation Settings

5.5. Baselines and Evaluation Protocol

5.6. Results on tRNA Benchmarks

5.7. Results on Broader RNA Families

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Amortized Parameter Inference for the Arbitrary-Order Hidden Markov Model

Abstract

1. Introduction

2. Fundamentals

2.1. Hidden Markov Model

2.2. Deep-Learning Preliminaries

3. Arbitrary-Order Hidden Markov Model

3.1. α -HMM

3.2. Expressiveness of the α -HMM

3.3. Optimal Decoding with α 1 -HMM

4. Amortized Parameter Inference for α 1 -HMM

5. Performance Evaluation

5.1. α 1 -HMM for RNA Secondary Structure

5.2. RNA-Specific Parameterization of the α 1 -HMM Estimator

5.3. Training Objective for RNA Secondary Structure Prediction

5.4. Datasets and Evaluation Settings

5.5. Baselines and Evaluation Protocol

5.6. Results on tRNA Benchmarks

5.7. Results on Broader RNA Families

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. $α$ -HMM

3.2. Expressiveness of the $α$ -HMM

3.3. Optimal Decoding with $α_{1}$ -HMM

4. Amortized Parameter Inference for $α_{1}$ -HMM

5.1. $α_{1}$ -HMM for RNA Secondary Structure

5.2. RNA-Specific Parameterization of the $α_{1}$ -HMM Estimator