KOSLM: A Kalman-Optimal Hybrid State-Space Memory Network for Long-Term Time Series Forecasting

Tan, Xin; Wang, Lei; Wang, Mingwei; Zhang, Ying

doi:10.3390/app152312684

Open AccessArticle

KOSLM: A Kalman-Optimal Hybrid State-Space Memory Network for Long-Term Time Series Forecasting

School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12684; https://doi.org/10.3390/app152312684

Submission received: 23 October 2025 / Revised: 24 November 2025 / Accepted: 24 November 2025 / Published: 29 November 2025

(This article belongs to the Special Issue Advanced Methods for Time Series Forecasting)

Download

Browse Figures

Versions Notes

Abstract

Long-term time series forecasting (LTSF) remains challenging, as models must capture long-range dependencies and remain robust to noise accumulation. Traditional recurrent models, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM), often suffer from instability and information degradation over extended horizons. The state-of-the-art method xLSTMTime improves memory retention through exponential gating and enhanced memory-transition rules, but it still lacks principled guidance. To address these issues, we propose the Kalman-Optimal Selective Long-Term Memory (KOSLM) model, which embeds a Kalman-optimal selective mechanism driven by the innovation signal within a structured state-space reformulation of LSTM. KOSLM dynamically regulates information propagation and forgetting to minimize state estimation uncertainty, providing both theoretical interpretability and practical efficiency. Extensive experiments across energy, finance, traffic, healthcare, and meteorology datasets show that KOSLM reduces mean squared error (MSE) by 14.3–38.9% compared with state-of-the-art methods, with larger gains at longer horizons. The model is lightweight, scalable, and achieves up to 2.5× speedup over Mamba-2. Beyond benchmarks, KOSLM is further validated on real-world Secondary Surveillance Radar (SSR) tracking under noisy and irregular sampling, demonstrating robust and generalizable long-term forecasting performance.

Keywords:

long-term time series forecasting; LSTM; state-space model; kalman optimality; selective memory; robust prediction; SSR tracking

1. Introduction

Long-term time series forecasting (LTSF) aims to predict future values over extended horizons based on historical observations. Accurate long-range forecasts are critical for applications such as energy scheduling, climate modeling, financial planning, traffic management, and healthcare resource allocation. LTSF presents unique challenges: capturing long-range dependencies, mitigating error accumulation, and adapting to non-stationary temporal dynamics [1].

Traditional recurrent architectures, including Recurrent Neural Networks (RNNs) [2] and Long Short-Term Memory (LSTM) networks [3,4], often struggle with long sequences due to vanishing or exploding gradients [5]. While LSTM gates alleviate some of these issues, their heuristic design lacks explicit structural constraints derived from a principled optimality criterion, potentially leading to suboptimal memory retention over extended horizons. The recently proposed xLSTMTime further enhances long-range modeling by introducing the sLSTM and mLSTM components. The sLSTM employs scalar memory with exponential gating to manage long-term dependencies, while the mLSTM component uses matrix memory and a covariance update rule to enhance storage capacity and relevant information retrieval capabilities. Despite these improvements, xLSTMTime remains constrained by heuristic recurrent design.

The Kalman filter (KF) [6] provides optimal state estimation under Gaussian noise [7]. Recent studies have explored integrating KF with LSTM to improve the accuracy of time-series forecasting. Representative approaches include the following: (i) Deep Kalman Filters [8], which parameterize the state-transition and observation functions of KF using LSTM; (ii) KalmanNet [9], which employs an LSTM to learn residual corrections to the Kalman gain under a known KF model; and (iii) uncertainty-aware LSTM–KF hybrids [10], which estimate the covariance or uncertainty structures of KF through recurrent dynamics. However, these approaches generally maintain a loose coupling between LSTM and KF—they do not embed Kalman-inspired feedback directly into the internal gating dynamics of LSTM.

Recently, selective state space models (SSMs) [11,12] have demonstrated efficient sequence modeling with linear-time complexity. By dynamically modulating SSM’s parameters based on the input, they can filter task-irrelevant patterns while retaining critical long-term information. This selective modulation property motivates revisiting LSTM gates from a state-space perspective, into which Kalman-inspired feedback can be injected, thereby endowing the gating mechanism with Kalman-optimal structural constraints.

In this work, we propose the Kalman-Optimal Selective Long-Term Memory (KOSLM) model, which establishes a context-aware feedback pathway that optimally balances memory retention and information updating, providing both theoretical interpretability and practical efficiency.

Our main contributions are as follows:

State-space reformulation of LSTM: We formalize LSTM networks as input- and state-dependent SSMs, where each gate dynamically parameterizes the state-transition and input matrices. This framework provides a principled explanation of LSTM’s long-term memory behavior.
Kalman-optimal selective gating: Inspired by the Kalman filter and selective SSMs, we introduce a Kalman-optimal selective mechanism in which the state-transition and input matrices are linearly modulated by a Kalman gain learned from the innovation term, establishing a feedback pathway that minimizes state estimation uncertainty.
Applications to real-world forecasting: KOSLM consistently outperforms state-of-the-art baselines across LTSF benchmarks in energy, finance, traffic, healthcare, and meteorology, achieving 14.3–38.9% lower mean squared error (MSE) and a maximum reduction in mean absolute error (MAE) of 25.2%; it also delivers up to 2.5× faster inference than Mamba-2. In real-world Secondary Surveillance Radar (SSR) tracking under noisy and irregular sampling, KOSLM demonstrates strong robustness and generalization.

By bridging heuristic LSTM gating with principled Kalman-optimal estimation, KOSLM provides a robust, interpretable, and scalable framework for long-term sequence modeling, offering both methodological novelty and practical forecasting utility.

2. Background and Theory

2.1. LSTM Networks

RNNs model sequential data through recursive temporal computation. However, traditional RNNs often suffer from vanishing and exploding gradients when capturing long-term dependencies. The LSTM network [3] addresses this issue by introducing gating mechanisms that regulate the flow of information over time. The network structure of the LSTM neuron is shown in Figure 1.

Each LSTM unit maintains a cell state

C_{t}

that carries long-term information and a hidden state

H_{t}

that provides short-term representations. The cell state is updated according to

C_{t} = F_{t} ⊙ C_{t - 1} + I_{t} ⊙ {\tilde{C}}_{t},

(1)

where

F_{t}

,

I_{t}

, and

O_{t}

denote the forget, input, and output gates, respectively. These gates are nonlinear functions of the current input

x_{t}

and the previous hidden state

H_{t - 1}

. The final output is obtained as follows:

H_{t} = O_{t} ⊙ tanh (C_{t}) .

(2)

This gating mechanism allows the model to selectively retain or discard information, mitigating gradient degradation. However, the gates are learned heuristically through data-driven optimization rather than derived from an explicit structural optimality constraint, making the LSTM sensitive to noise and unstable for long-term dependencies.

2.2. State Space Models

2.2.1. Selective State Space Models

Selective State Space Models (S6) are a recent class of sequence models for deep learning that are broadly related to RNNs and classical SSMs. They are inspired by a particular system (Equation (3)) that maps a one-dimensional function or sequence

x_{t} \in R \to y_{t} \in R

through an implicit latent state

h_{t} \in R^{N}

:

\begin{matrix} h_{t} & = A h_{t - 1} + B x_{t}, \end{matrix}

(3a)

\begin{matrix} y_{t} & = M h_{t} . \end{matrix}

(3b)

where A is the state transition matrix, defining how the past latent state

h_{t - 1}

influences the current state; B is the input matrix, determining how

x_{t}

modulates

h_{t}

; and M maps the implicit latent state

h_{t}

to the output.

These models integrate the SSM described above into deep learning frameworks and introduce input-dependent selective mechanisms (see Appendix A for a detailed discussion), achieving Transformer-level modeling capability with linear computational complexity.

The theoretical connections among LSTM, KF, and SSM provide the foundation for constructing a unified Kalman-optimal selective memory framework.

2.2.2. Kalman Filter

The KF [6] is a classical instance of the state-space model, providing the minimum mean-square-error (MMSE) estimate of hidden system states under noisy observations. The system dynamics are expressed as

\begin{matrix} h_{t} & = A h_{t - 1} + w_{t}, w_{t} \sim N (0, Q_{t}), \end{matrix}

(4a)

\begin{matrix} z_{t} & = M h_{t} + v_{t}, v_{t} \sim N (0, R_{t}), \end{matrix}

(4b)

where M is the same as the M in the SSM introduced above, but in the context of the KF, it is commonly referred to as the observation matrix;

w_{t}

and

v_{t}

denote process and observation noise with covariances

Q_{t}

and

R_{t}

, respectively. Note that the observation $z_{t}$ can be regarded as the input $x_{t}$ in the SSMs.

At each time step, the KF performs two operations: prediction and update. The prediction step estimates the prior state

{\hat{h}}_{t}^{-}

and prior error covariance

{\hat{P}}_{t}^{-}

based on the posterior state

{\hat{h}}_{t - 1}

and posterior error covariance

{\hat{P}}_{t - 1}

from the previous time step:

{\hat{h}}_{t}^{-} = A {\hat{h}}_{t - 1},

(5)

{\hat{P}}_{t}^{-} = A {\hat{P}}_{t - 1} A^{T} + Q_{t}

(6)

The update step then refines this prediction using the observation

z_{t}

:

K_{t} = {\hat{P}}_{t}^{-} M^{T} {(M {\hat{P}}_{t}^{-} M^{T} + R_{t})}^{- 1}

(7)

{\hat{h}}_{t} = {\hat{h}}_{t}^{-} + K_{t} (z_{t} - M {\hat{h}}_{t}^{-})

(8)

{\hat{P}}_{t} = (I - K_{t} M) {\hat{P}}_{t}^{-}

(9)

Here,

K_{t}

is the Kalman gain, which minimizes the posterior error covariance

{\hat{P}}_{t}

. It determines how much new information (the innovation term

z_{t} - M {\hat{h}}_{t}^{-}

) should be incorporated into the updated state (

{\hat{h}}_{t}

is the posterior state). This principle of optimal selective information integration forms the theoretical foundation for the innovation-driven gating design proposed later. A detailed justification for interpreting the Kalman gain as a prototype of dynamic selectivity is provided in Appendix B.1.

3. Proposed Method

3.1. Reformulating LSTM as a State-Space Model

Following the LSTM formulation in Equation (1), the LSTM cell can be equivalently reformulated as a time-varying SSM, where the cell state

C_{t}

evolves under nonlinear, input- and state-dependent dynamics:

C_{t} = A_{t} C_{t - 1} + B_{t} z_{t},

(10)

Here,

z_{t}

serves as the observation in the KF (corresponding to the input

x_{t}

in the SSM and LSTM), and the matrices

A_{t}

and

B_{t}

are determined by the forget gate

F_{t}

and input gate

I_{t}

, respectively. A detailed derivation of this LSTM-to-SSM reconstruction, including the mapping of gating mechanisms to state-space parameters, is provided in Appendix B.2.

3.2. Kalman-Optimal Selectivity via Innovation-Driven Gain

We introduce the innovation term from the KF:

{Innov}_{t} = z_{t} - M_{t} A_{t - 1} C_{t - 1},

(11)

which measures the discrepancy between the observation input

z_{t}

and the predicted state based on the previous cell state

C_{t - 1}

, serving as a real-time correction signal between model prediction and actual measurement.

In classical KF, this innovation drives the computation of the Kalman gain

K_{t}

(Equation (7)), which regulates the incorporation of new information and the retention of prior state during the update step. In KOSLM, rather than explicitly solving the Riccati recursion [13], we learn a functional mapping:

K_{t} = ϕ ({Innov}_{t}; θ_{ϕ}),

(12)

where

ϕ (\cdot)

is a lightweight multi-layer perceptron (MLP) with sigmoid activation, with its width and depth controlled by two hyperparameters,

MLP_d

and

MLP_layers

(e.g.,

MLP_d = 64

,

MLP_layers = 2

), and parameterized by

θ_{ϕ}

. The sigmoid activation ensures that the estimated gain

K_{t} \in (0, 1)

, maintaining physical interpretability as an adaptive weighting coefficient and preventing numerical instability.

The learned gain dynamically regulates the trade-off between prior memory and new information, yielding a learnable yet principled mechanism for Kalman-optimal selectivity. The bounded output range of

K_{t}

further stabilizes the state-space update and constrains divergence during long-horizon inference. Appendix B.3 provides the theoretical derivation demonstrating that the innovation term serves as a sufficient statistic for learning the Kalman gain. Appendix B.4 further presents controlled experiments confirming that the learned gain accurately approximates the oracle Kalman gain across various

(Q, R)

regimes (Table A1). Controlled ablation studies (see Section 4.2.2 and Table 1) show that increasing the hidden layer size or depth yields only marginal improvements in predictive performance, indicating that KOSLM’s performance is largely insensitive to the specific configuration of

ϕ

. This confirms that the performance gains primarily stem from the Kalman-optimal structural design rather than the network capacity. These results jointly establish the theoretical and empirical foundation for embedding Kalman-optimal selectivity into deep learning models.

We then define the state-space evolution as follows:

A_{t} = (I - K_{t} M_{t}) A, B_{t} = K_{t},

(13)

where A is the base state transition matrix, serving as a learnable parameter of the model;

A_{t}

and

B_{t}

are dynamically modulated by the Kalman gain

K_{t}

.

3.3. Structural Overview of KOSLM

The KOSLM cell preserves the computational efficiency of a standard LSTM while embedding a Kalman-inspired feedback loop. The KOSLM network architecture is illustrated in Figure 2; at each timestep, the operations are as follows:

Initialization: The learnable matrix A is initialized following S4D-Lin and for the real case is S4D-Real [14], which is based on the HIPPO theory [15]. These define the n-th element of A as $- \frac{1}{2} + n i$ and $- (n + 1)$ respectively;
Compute $z_{t}$ and $M_{t}$ via the candidate gate ${\tilde{C}}_{t}$ and output gate $O_{t}$ ;
Compute the innovation: ${Innov}_{t} = z_{t} - M_{t} A C_{t - 1}$ ;
Estimate the Kalman gain: $K_{t} = ϕ ({Innov}_{t})$ ;
Update state transition matrices: $A_{t} = (I - K_{t} M_{t}) A$ , $B_{t} = K_{t}$ ;
Propagate the hidden state: $C_{t} = A_{t} C_{t - 1} + B_{t} z_{t}$ ;
Compute the output hidden representation: $H_{t} = M_{t} C_{t}$ .

To further clarify the conceptual and structural differences between KOSLM and existing Kalman-based neural architectures, Table 2 summarizes a direct comparison with the KalmanNet [9].

3.4. Theoretical Interpretation

Under linear–Gaussian assumptions and with a sufficiently expressive mapping

ϕ

, the learned gain

K_{t}

converges to the oracle Kalman solution. Consequently, KOSLM inherits the nonlinear expressive power of LSTM while achieving the minimum-variance estimation property of the Kalman filter in its linear regime. This leads to improved stability and robustness, particularly in long-horizon or noisy sequence modeling.

3.5. Practical Advantages

KOSLM offers several practical benefits:

Robustness: The feedback structure mitigates error accumulation and improves performance under noise or distributional shifts.
Efficiency: With only 0.24 M parameters, KOSLM achieves up to 2.5× faster inference than Mamba-2, while maintaining competitive accuracy.
Versatility: The model generalizes across diverse domains, from energy demand forecasting to radar-based trajectory tracking.

4. Results

This section provides a comprehensive evaluation of the proposed KOSLM model through large-scale long-term forecasting benchmarks (Section 4.1), component ablation studies (Section 4.2), efficiency assessments (Section 4.3), and a real-world radar trajectory tracking case study (Section 4.4). The experimental analyses collectively aim to validate both the predictive accuracy and robustness of KOSLM across diverse and noisy temporal conditions.

4.1. Main Experiments on Benchmark Datasets

To examine the capability of KOSLM in modeling long-range dependencies and maintaining stability over extended forecasting horizons, we conduct systematic experiments on nine widely used real-world datasets covering domains such as traffic flow, electrical load, exchange rate, meteorology, and epidemiology. The datasets vary in frequency, dimensionality, and temporal regularity, providing a comprehensive benchmark for assessing model generalization.

4.1.1. Dataset Details

We summarize the datasets used in this study as follows. Weather [16] contains 21 meteorological variables (e.g., temperature and humidity) recorded every 10 min throughout 2020. ETT (Electricity Transformer Temperature) [17] includes four subsets: two hourly-level datasets (ETTh1, ETTh2) and two 15-minute-level datasets (ETTm1, ETTm2). Electricity [18], derived from the UCI Machine Learning Repository, records hourly power consumption (kWh) of 321 clients from 2012 to 2014. Exchange [19] comprises daily exchange rates among eight countries. Traffic [20] consists of hourly road occupancy rates measured by 862 sensors on San Francisco Bay Area freeways from January 2015 to December 2016. Illness (ILI) dataset [21] tracks the weekly number of influenza-like illness patients in the United States. Table 3 summarizes the statistical properties of all nine benchmark datasets. All datasets are divided into training, validation, and test subsets with a ratio of 7:1:2.

4.1.2. Implementation Details

All models are trained using the Adam optimizer without weight decay. The learning rate is selected from

[1 \times 10^{- 3}, 1 \times 10^{- 2}]

via grid search. Batch size is set to 32 by default, adjustable up to 256 depending on GPU memory. Training is performed for 15 epochs, and the checkpoint with the lowest validation loss is used for testing. Experiments are repeated five times, and average results are reported (RMS values are provided in the Supplementary Materials). All models adopt a 2-layer architecture with hidden dimension 64. We use PyTorch’s default weight initialization, and no additional regularization (dropout or gradient clipping) is applied. All experiments are implemented in PyTorch 2.1.0 with Python 3.11 on NVIDIA RTX 4090 GPUs. Random seeds for Python, NumPy, and PyTorch are fixed to ensure reproducibility.

The gain function

ϕ

is implemented as a one-layer MLP with hidden dimension 64 and sigmoid activation, followed by a linear projection to the gain matrix

K_{t}

. The same

ϕ

network is shared across all timesteps to ensure parameter consistency.

4.1.3. Experimental Setup of Main Experiments

All experiments follow the evaluation protocol established in xLSTMTime [22], adopting prediction horizons of

T \in {96, 192, 336, 720}

for standard datasets and

T \in {24, 36, 48, 60}

for the weekly-sampled ILI dataset. We compare the proposed KOSLM with nine recent state-of-the-art baselines that represent diverse architectural paradigms, spanning state-space, recurrent, attention-based, and linear modeling frameworks:

SSM-based: FiLM [23], S-Mamba [24];
LSTM-based: xLSTMTime [22];
Transformer-based: FEDformer [25], iTransformer [26], Crossformer [27];
MLP/TCN-based: DLinear [28], PatchTST [29], TimeMixer [30].

This comprehensive selection enables a fair and systematic comparison across diverse sequence modeling paradigms.

4.1.4. Overall Performance

Table 4 and Table 5 report the long-term multivariate forecasting results on nine real-world datasets, evaluated by MSE and MAE. Across nearly all datasets and prediction horizons, the proposed KOSLM achieves the best or near-best performance, highlighting its superior generalization and robustness under diverse temporal dynamics.

Consistent superiority across domains: KOSLM outperforms all competing baselines, particularly under complex and noisy datasets. In terms of average MSE and MAE reductions, KOSLM achieves relative improvements of 31.96% and 18.77% on Traffic, 13.37% and 25.20% on Electricity, 5.47% and 20.27% on Exchange, 11.26% and 18.43% on Weather, and 22.04% and 23.09% on ILI. Moreover, on the four ETT benchmarks, KOSLM achieves MSE improvements of up to 27.46% and MAE improvements of up to 25.2%, demonstrating strong adaptability to varying periodic and nonstationary patterns. These consistent gains verify that the Kalman-inspired selective updating mechanism effectively filters noise and dynamically adjusts to regime shifts, ensuring stable forecasting accuracy over long horizons.
Stable error distribution and reduced variance: The MSE–MAE gap of KOSLM remains narrower than that of other baselines, implying reduced large deviations and more concentrated prediction errors. This indicates more stable error behavior, which is crucial for long-horizon forecasting where cumulative drift often occurs. The innovation-driven Kalman gain estimation provides adaptive correction at each timestep, ensuring smooth and consistent prediction trajectories under uncertain dynamics.
Strong scalability and generalization: KOSLM achieves leading performance not only on large-scale datasets (Traffic, Electricity) but also on small, noisy datasets (ILI), confirming robust generalization across different temporal resolutions and noise levels. Its consistent advantage over Transformer-based (e.g., iTransformer, FEDFormer, PatchTST), recurrent (e.g., xLSTMTime), and state-space models (e.g., S-Mamba, FiLM) demonstrates that the proposed Kalman-optimal selective mechanism provides an effective inductive bias for modeling long-term dependencies.
Advantage over LSTM-based architectures: Compared with advanced LSTM-based models such as xLSTMTime, KOSLM achieves consistently better results across nearly all datasets and horizons. This verifies that replacing heuristic gates with Kalman-optimal selective gating enhances memory retention and update stability. While xLSTMTime alleviates gradient decay via hierarchical memory, KOSLM further refines state updates through innovation-driven gain estimation, thereby achieving a more principled and stable information flow.

To further demonstrate the long-horizon stability of the proposed model, we compare KOSLM with the recent S-Mamba [24], a state-of-the-art state space model that represents the latest advancement in efficient sequence modeling. Figure 3 presents the forecasting trajectories at the long horizon (

T = 720

) across five representative datasets. KOSLM maintains accurate trend alignment and amplitude consistency with the ground truth, showing particularly superior convergence behavior on smoother and more stationary datasets such as Weather and Exchange, where the Kalman gain adaptation stabilizes long-term predictions. In contrast, S-Mamba exhibits slight temporal lag and amplitude attenuation under extended forecasting conditions. These results visually confirm the advantage of the proposed Kalman-based feedback selectivity in preserving long-term temporal fidelity.

4.2. Ablation Study

To further analyze the effectiveness of each proposed component, we conduct ablation experiments on four widely used long-term time-series forecasting datasets (ETTm1, ETTh1, Traffic, and Exchange) with a prediction length of

L = 720

, following the data preprocessing described in Section 4.1.1. All model variants are trained and tested following the default implementation settings described in Section 4.1.2, and the results are averaged over five runs to ensure statistical reliability.

4.2.1. Structural Ablation

We first evaluate the contribution of the Kalman-based structure by comparing three variants: (i) Full (KOSLM): the complete model where

K_{t} = ϕ (Innov)

and

A_{t} = (I - K_{t} M_{t}) A, B_{t} = K_{t}

; (ii) No-Gain:

ϕ

receives the innovation but directly outputs

(A_{t}, B_{t})

, removing the explicit computation of the Kalman gain

K_{t}

, i.e., the pathway

Innov \to K_{t} \to (A_{t}, B_{t})

is replaced by

Innov \to (A_{t}, B_{t})

; (iii) No-Innov: the innovation statistic is removed.

ϕ

receives the standard network input (e.g.,

[x_{t}; H_{t - 1}]

), while

K_{t}

is still computed and mapped to

(A_{t}, B_{t})

via the Kalman form, i.e.,

[x_{t}; H_{t - 1}] \to K_{t} \to (A_{t}, B_{t})

.

As shown in Table 6, removing either the gain computation or the innovation input consistently leads to higher errors across all datasets. For instance, on ETTm1, the MSE increases by 32.9% when the gain path is removed, confirming that both the innovation statistic and Kalman gain are essential for stable long-horizon prediction. This finding aligns with the Kalman filtering principle that the innovation serves as a sufficient statistic for state correction.

4.2.2. Capacity Ablation

To verify that the performance gains of KOSLM mainly stem from the Kalman-optimal structural design rather than the network capacity, we conduct a capacity ablation study. We evaluate the

ϕ

network under four progressively larger configurations: (i) Linear: A single linear layer with input dimension equal to the number of input features and output dimension 64, without any activation; (ii) SmallMLP: A two-layer MLP with 128 hidden units and sigmoid activation; (iii) MediumMLP: A three-layer MLP with 256 hidden units per layer and sigmoid activation; (iv) HighCap: A four-layer MLP with 512 hidden units per layer and sigmoid activation.

All models are trained under identical settings with five independent random seeds, and the results are reported as mean ± standard deviation. This allows us to evaluate both performance trends and statistical reliability. From Table 1, it is evident that increasing the

ϕ

network capacity beyond SmallMLP does not consistently improve MSE or MAE; in some cases, the error even increases slightly. The small MediumMLP improvements (ETTm1) or slight deterioration (ETTh1, Traffic, Exchange) indicate that performance gains are dominated by the Kalman-optimal structural design rather than network depth or over-parameterization. This confirms the effectiveness of the innovation-driven gain estimation in KOSLM for stable long-horizon prediction.

4.3. Efficiency Benchmark

To evaluate the computational efficiency of KOSLM, we benchmark it against representative sequence modeling baselines, including Transformer, Mamba-2, and LSTM, across four key metrics: runtime scalability, end-to-end throughput, memory footprint, and parameter count. All experiments are conducted using PyTorch 2.1.0 with torch.compile optimization on a single NVIDIA RTX 4090 GPU.

4.3.1. Runtime Scalability

We assess runtime performance on both a controlled synthetic setup and the ETTm1 dataset. As illustrated in Figure 4, KOSLM exhibits near-linear growth in inference time with increasing sequence length L. For sequences longer than

L = 2 k

, KOSLM surpasses optimized Transformer implementations (FlashAttention [31]) and achieves up to

2.5 \times

faster execution than the fused-kernel version of Mamba-2 [24] at

L > 4 k

. Compared with a standard PyTorch LSTM, KOSLM achieves approximately

1.3

–

1.9 \times

speedup across the tested sequence lengths, demonstrating robust scalability for long sequences.

4.3.2. Throughput Analysis

Figure 5 reports the end-to-end inference throughput measured in million tokens/s (M tokens/s). KOSLM maintains consistently high throughput across all tested sequence lengths, achieving up to

1.9 \times

higher throughput than LSTM and

2.7 \times

higher than Mamba-2 in the 4K–8K regime. The performance remains stable for even longer sequences, highlighting the model’s suitability for extended temporal dependencies.

4.3.3. Memory Footprint

Table 7 reports GPU memory usage during training for varying batch sizes with input length 2048 and models containing approximately 0.24 M parameters. KOSLM demonstrates favorable memory efficiency relative to similarly sized Mamba-2 implementations and remains comparable to a standard LSTM with roughly ten times more parameters. This efficiency results from KOSLM’s compact Kalman-optimal selective mechanism, which dynamically modulates state transitions with minimal parameter overhead.

4.3.4. Model Size

Table 8 summarizes the parameter counts. KOSLM contains only 0.24 M parameters, roughly 4.5% of the Transformer and 11% of the LSTM baselines, while achieving comparable or superior forecasting performance. The compact design emphasizes KOSLM’s suitability for deployment in resource-constrained environments.

This efficiency stems directly from the Kalman-optimal selective mechanism, which adaptively regulates the state update through innovation-driven gain modulation. Such a formulation not only reduces redundant computation but also provides a principled pathway for scaling to long sequences.

4.4. Real-World Application: Secondary Surveillance Radar (SSR) Target Trajectory Tracking

To further evaluate the robustness and practical generalizability of KOSLM under real-world noisy conditions, we conducted a real-world experiment on SSR target trajectory tracking. SSR is a ground-based air traffic surveillance system where aircraft respond to radar interrogations with encoded transponder signals, generating sparse range–azimuth sequences known as raw SSR plots. These observation sequences present three major challenges for sequential modeling:

High stochastic noise: Measurement noise leads to random fluctuations in the estimated positions;
Irregular sampling: Aircraft maneuvers and radar scan intervals result in uneven temporal spacing;
Correlated anomalies: Spurious echoes or missing detections introduce discontinuities in the trajectories.

These characteristics make SSR a natural yet challenging testbed for assessing model stability, noise resilience, and temporal consistency. KOSLM addresses these challenges by integrating context-aware selective state updates with learnable dynamics under the Kalman optimality principle, enabling robust filtering, smoothing, and extrapolation in partially observed environments.

4.4.1. Experimental Setup

To achieve a balance between realism and experimental controllability, we adopted a semi-physical simulation approach. Training data were derived from Automatic Dependent Surveillance–Broadcast (ADS-B) flight tracks obtained from the OpenSky Network [32]. Controlled Gaussian noise with an SNR of 33 dB was added to emulate SSR observation uncertainty. For evaluation, we deployed KOSLM on real SSR radar data collected from an operational ground-based system (peak transmit power 6 kW), capturing eight live air-traffic targets under operational conditions. Detailed data acquisition and preprocessing procedures are provided in Appendix C.

Unlike the main benchmarking experiments on standardized long-term time series forecasting (LTSF) datasets, this case study aims to demonstrate practical applicability and operational robustness. Due to irregular sampling, sparse observations, and high stochastic noise, conventional quantitative metrics such as MSE or MAE are not meaningful. Therefore, we focus on qualitative trajectory visualizations to illustrate model performance in realistic conditions.

4.4.2. Results and Analysis

Figure 6 presents representative trajectories of eight air-traffic targets, comparing classical KF algorithm, Transformer, Mamba, and KOSLM. The classical KF algorithm exhibits locally inaccurate and jagged trajectories due to its fixed linear dynamics, which cannot adapt to irregular sampling or abrupt maneuvers. The Transformer captures general trends but produces fragmented and temporally inconsistent tracks under sparse and noisy observations. Mamba improves noise robustness but still demonstrates local instability during complex maneuvers.

In contrast, KOSLM generates smooth, coherent, and temporally consistent trajectories that closely follow the true flight paths, highlighting its ability to handle high stochastic noise, irregular sampling, and correlated anomalies. This robustness stems from the innovation-driven Kalman gain and context-aware selective state updates, which allow KOSLM to adapt dynamically to non-stationary motion patterns.

5. Conclusions

This study addressed a fundamental limitation of recurrent architectures such as LSTM, whose heuristically designed gates lack structural constraints for optimality, leading to instability and information decay over long sequences. To overcome this issue, we proposed the Kalman-Optimal Selective Long-Term Memory Network (KOSLM), which reconstrues LSTM as a nonlinear, input- and state-dependent state-space model, and integrates an innovation-driven Kalman-optimal gain path for principled information selection. This formulation unifies LSTM gating, selective state-space modeling, and Kalman filtering into a single theoretically grounded recurrent framework. We note that the proposed KOSLM is lightweight, containing only 0.24 M parameters—significantly smaller than typical LSTM or Transformer baselines—while still achieving competitive or superior performance. This compact design ensures efficient implementation on standard GPU hardware without imposing constraints on memory or computation.

Extensive experiments demonstrate that KOSLM achieves state-of-the-art performance on long-term forecasting benchmarks, while ablation studies confirm the essential role of the innovation statistic and Kalman-form gain in stabilizing long-horizon modeling. Moreover, validation on real-world SSR trajectory tracking highlights its robustness under noisy and non-stationary conditions.

In summary, embedding Kalman-optimal principles into deep recurrent networks provides both theoretical insights and practical benefits for robust long-term sequence modeling. Future work will focus on extending KOSLM beyond Gaussian assumptions and applying it to multimodal and cross-domain time-series scenarios.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app152312684/s1, Table S1: KOSLM run-to-run variability (RMS) over 5 runs.

Author Contributions

All authors contributed to the conception and design of the study. Material preparation, code implementation, data collection, and analysis were performed by X.T. and L.W. The first draft of the manuscript was written by L.W. and all authors commented on previous versions of the manuscript. X.T. and M.W. supervised the research and provided critical revisions and theoretical guidance. Y.Z. assisted in data verification and manuscript editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets and source code used in this study are publicly available at https://github.com/wl822513/KOSLM (accessed on 23 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Selective State Space Models

This section focuses on S6, which provide a unified framework connecting the gating with time-varying state-space representations. Their associated selection mechanisms have become a central design principle in modern SSMs, such as Mamba [11]. These mechanisms improve modeling efficiency by dynamically identifying and retaining task-relevant information over long sequences, effectively compressing the latent state without compromising representational capacity.

Relation to Existing Concepts.

Selection mechanisms are conceptually related to several prior ideas, including gating, hypernetworks, and data-dependent parameterization:

Gating was originally introduced as a signal-control mechanism in recurrent neural networks (RNNs) such as LSTM [3] and GRU [33], where multiplicative gates regulate information updates across time steps. The concept has been generalized to other architectures, including gated convolutions and transformers [34,35], though often without an explicit interpretation in terms of temporal signal control.
Hypernetworks [36] generate network parameters dynamically through auxiliary networks, enabling the model’s internal dynamics to adapt based on input signals.
Data-dependent parameterization [37] represents a broader paradigm in which model parameters are directly conditioned on input data. Both gating and hypernetworks can be viewed as specific instances within this larger class.

While selection mechanisms share certain similarities with these paradigms, they form a distinct modeling category. Unlike classical gating or hypernetwork mechanisms that often operate locally or step-wise, selection mechanisms are explicitly designed to route, filter, or suppress sequence-level information in an input- or state-dependent manner, enabling stable long-horizon modeling. In selective SSMs, this is typically realized by parameterizing system matrices (e.g.,

Δ

, A, B, C) as functions of the input at each time step.

From Implicit to Explicit Selection.

Early structured SSMs (S4) [38] encoded fixed inductive biases through learned structured dynamics, providing an implicit, input-independent form of selection via controlled signal propagation. Later models, including Mamba [11], introduced explicit selection, where the parameters of the state-space system (e.g.,

Δ

, A, B, C) are conditioned directly on the current input, allowing the model to dynamically emphasize or suppress features at each step.

Semantic Clarification.

As discussed in S6 [11], while selection mechanisms can be loosely related to gating, hypernetworks, or general data dependence, these broader descriptions do not capture the defining characteristic of selection. The term selection is reserved for mechanisms that explicitly operate along the sequence dimension, enabling adaptive memory compression and long-range information control. From a dynamical systems perspective, such mechanisms can be interpreted through state-space discretization and signal propagation theory [39,40]. Importantly, classical RNN gating can be viewed as a local, step-wise precursor to these more general selection mechanisms.

Scope and Relevance.

The principle of selection underlies recent progress in linear-time sequence modeling. Building upon this paradigm, our proposed KOSLM extends selective SSMs by introducing an innovation-driven Kalman-optimal feedback pathway, transforming heuristic selection into a principled, uncertainty-aware mechanism for long-term sequence modeling.

Appendix B. Detailed Derivation

Appendix B.1. Kalman Gain as a Prototype for Dynamic Selectivity

In classical filtering theory, the Kalman gain

K (t)

acts as a dynamic weighting factor that determines how prior state estimates and new observations are combined to produce a new state [6]. It is formally derived as the solution to the Riccati differential equation, and its value depends on the evolving uncertainty in the internal system state (captured by the prior covariance

P (t)

) and the noise characteristics of the observation process (captured by the measurement covariance R) [41]. This time-varying gain governs the extent to which incoming measurements correct the state estimate, ensuring minimum mean squared error under Gaussian noise assumptions.

This correction process is essentially a trade-off between the historical state information of the system and the incoming input information. In the context of deep learning, we interpret the observation as the current input sequence, and the prior state estimate as the latent representation of the system’s history. From this perspective, the Kalman gain plays a role analogous to a dynamic selection factor that balances the contribution of contextual knowledge (from the model state) and content information (from the input) in generating the updated representation. Therefore, the Kalman gain can be viewed as a principled prototype for content-aware and context-sensitive information selection. This perspective justifies the broader use of dynamic selection strategies in sequence models, especially when aiming to balance stability and adaptability in long-range modeling.

Appendix B.2. LSTM-to-SSM Reconstruction

To endow LSTM networks with structured modeling semantics, in this section we reconstruct them as a form of nonlinear time-varying state-space model ((see Equation (3a)). Specifically, we identify that the interactions among the forget gate, input gate, output gate, input signal, and cell state in LSTMs essentially constitute an input-dependent and state-aware state transition–observation process.

As illustrated in Figure 1, the gating mechanisms and the cell state

C_{t}

in LSTM establish a pathway for memory propagation across time. The cell state

C_{t}

is the core component that preserves long-term memory across the temporal dimension and models long-range dependencies. In terms of representational role, it is both semantically and structurally equivalent to the latent state

h_{t}

in SSMs (see Equation (3a)). Therefore, we take

C_{t}

as the central axis and analyze its information propagation process.

The forget gate

F_{t}

regulates the degree to which short-term memory from cell state

C_{t - 1}

is discarded. It is determined jointly by the current input

x_{t}

and the previous hidden state

H_{t - 1}

, thereby embodying an input- and state-dependent transition process:

A_{t} ≜ F_{t} = σ (W_{f} X_{t} + b_{f}),

(A1)

where

X_{t} = [x_{t}; H_{t - 1}]

denotes the concatenated input. The corresponding retained term is

A_{t} C_{t - 1}

.

The pathway for new information input in LSTM consists of two steps:

The candidate cell state ${\tilde{C}}_{t}$ can be interpreted as a differentiable nonlinear mapping from the joint input $X_{t}$ :

$z_{t} ≜ {\tilde{C}}_{t} = tanh (W_{c} X_{t} + b_{c}),$

(A2)

where the linear transformation $W_{c} X_{t} + b_{c}$ fuses the external input $x_{t}$ and the hidden state $H_{t - 1}$ into the observation space, with $tanh (\cdot)$ introducing nonlinearity to produce an intermediate representation $z_{t}$ . We treat this representation as the observation input at time t under the SSM (Here, the observation input denotes the externally observable signal that directly drives the state update process. It is written as $z_{t}$ in KF and as $x_{t}$ in SSMs, which are semantically equivalent).
The interaction between the input gate $I_{t}$ and the candidate state ${\tilde{C}}_{t}$ establishes a structured pathway for injecting new information into the state dynamics, analogous to the excitation of state evolution by external inputs in SSMs.

$B_{t} z_{t} ≜ I_{t} ⊙ {\tilde{C}}_{t} = σ (W_{i} X_{t} + b_{i}) ⊙ tanh (W_{c} X_{t} + b_{c}),$

(A3)

where $B_{t}$ serves as the input matrix, structurally equivalent to the input gate $I_{t}$ .

Consequently, the update process of the LSTM cell state

C_{t}

at time t (Equation (1)) can be rewritten in the form of a state-space transition:

C_{t} = A_{t} C_{t - 1} + B_{t} z_{t},

(A4)

where

A_{t}

and

B_{t}

are realized through the forget and input gates driven by the joint input

X_{t}

, thereby endowing the model with the ability to selectively forget or memorize long-term information in an input-dependent and context-aware manner.

Finally, the output pathway of LSTM is given by

H_{t} = O_{t} ⊙ tanh (C_{t}),

(A5)

which is formally equivalent to the observation equation of a classical SSM:

y_{t} = M_{t} C_{t},

(A6)

where

M_{t}

is formed by the output gate

O_{t}

together with the nonlinear transformation

tanh (\cdot)

.

The above reconstruction shows that the gating mechanism in LSTM can be interpreted as a class of nonlinear, input- and state-dependent SSMs. This perspective establishes a valid unification of LSTM gating mechanisms and state-space models, and reveals that LSTM’s long-term memory capability originates from the structured realization of both state-space modeling and efficient information selection. Building on this understanding, we further introduce the Kalman gain into this state-space structure, imposing a structural constraint that minimizes the uncertainty of state estimation errors.

Appendix B.3. Kalman-Optimal Selective Mechanism

Recent studies on time series modeling have proposed a class of selective mechanisms based on SSMs, among which the most representative work is S6 [11]. Its core idea points out the following:

“One method of incorporating a selection mechanism into models is by letting their parameters that affect interactions along the sequence (e.g., the recurrent dynamics of an RNN or the convolution kernel of a CNN) be input-dependent.”

Inspired by this, we design a selective mechanism based on Kalman-optimal state estimation within the SSM (Equation (3)). Unlike the input-dependent selection mechanism represented by S6, we make the key parameters ( $A_{t}$ , $B_{t}$ ), which control how the model selectively propagates or forgets information along the sequence dimension, depend on the innovation term: the deviation between the observation input and the prior state prediction in the observation space, denoted as Innov in this paper. This method constructs a learnable selection path that integrates observational inputs and latent state feedback, with the optimization objective of minimizing the uncertainty in state estimation errors.

Learnable Gain from Innovation.

In the KF algorithm, the closed-form solution of the Kalman gain

K_{t}

(Equation (7)) is obtained by minimizing the covariance of the state estimation error [6]. Here, the observation matrix

M_{t}

and the observation noise covariance

R_{t}

are considered known system priors, usually derived from physical modeling or domain knowledge. Therefore, the dynamics of

K_{t}

entirely stem from the statistical properties of the prior estimation error

e_{t}^{-} = h_{t} - {\hat{h}}_{t}^{-}

. Under the linear observation model (4b),

e_{t}^{-}

is mapped into the observation space, yielding the following representation of the innovation term:

Innov = z_{t} - M_{t} {\hat{h}}_{t}^{-} = M_{t} (h_{t} - {\hat{h}}_{t}^{-}) + v_{t},

(A7)

This equation shows that Innov is a linear mapping of

e_{t}^{-}

, which, although perturbed by additive observation noise

v_{t}

, still retains the full uncertainty information of

e_{t}^{-}

. Thus, Innov can serve as a sufficient statistic for estimating

K_{t}

, providing a theoretical basis for directly estimating

K_{t}

from the innovation term. The estimation process is defined as

K_{t} = ϕ ({Innov}_{t}; θ_{ϕ}),

(A8)

where

ϕ (\cdot)

is a differentiable nonlinear function, parameterized by a neural network and trained via gradient descent. Theoretical soundness of this formulation is confirmed by a controlled experiment (Appendix B.4), which demonstrates that

ϕ (\cdot)

can reliably recover the oracle Kalman gain from the innovation statistics with negligible estimation error.

Optimal Selectivity via Gain.

By substituting the KF prediction process (Equation (5)) into the state update equation (Equation (8)), the two processes can be expressed as a nonlinear time-varying SSM form (Equation (3a)):

\begin{matrix} {\hat{h}}_{t} & = A_{t} {\hat{h}}_{t - 1} + B_{t} z_{t}, A_{t} = (I - K_{t} M_{t}) A, B_{t} = K_{t}, \end{matrix}

(A9)

Within this framework, parameters

A_{t}

and

B_{t}

are linearly modulated by

K_{t}

driven by the innovation term, yielding a context-aware Kalman-optimal selective path that minimizes state estimation uncertainty and imposes structural optimality constraints on LSTM.

Appendix B.4. Empirical Validation of Innovation-Driven Kalman Gain Learning

To empirically validate the theoretical claim that the Kalman gain can be dynamically inferred from the innovation term, we design a controlled linear–Gaussian synthetic experiment where the optimal gain has a closed-form solution. This allows us to directly compare the learned gain against the ground-truth Kalman gain and quantify their alignment over time.

Experimental Design.

We consider a one-dimensional linear dynamical system as in Equation (4). The analytical Kalman gain

K_{t}

is computed from the standard Riccati recursion and serves as the oracle reference. We design three model variants for comparison:

(i): Oracle-KF: Standard Kalman filter using the known $(A, Q, R, M)$ parameters to compute $K_{t}$ .
(ii): Supervised- $ϕ$ : A small MLP $ϕ (\cdot)$ takes the innovation $i n n o v_{t} = z_{t} - M {\hat{h}}_{t}^{-}$ as input and predicts ${\hat{K}}_{t} = ϕ (i n n o v_{t})$ , trained by minimizing $L = | {\hat{K}}_{t} - K_{t} |^{2}$ over all timesteps. During training, the predicted prior state ${\hat{h}}_{t}^{-}$ is provided by the Oracle-KF to isolate the learning of the innovation-to-gain mapping. This tests whether the innovation term contains sufficient information to recover the oracle gain.
(iii): End-to-End: The same MLP is trained to directly map from $(z_{t}, {\hat{h}}_{t}^{-})$ to $K_{t}$ without explicitly constructing the innovation term, serving as an ablation to assess the importance of innovation-driven modeling.

Setup and Metrics.

We simulate

10, 000

trajectories of length

T = 100

with fixed system parameters

(A = 0.9, M = 1)

. To test robustness under different noise regimes, we consider four parameter settings by varying process noise Q and observation noise R: (i)

Q = 5, R = 0.1

; (ii)

Q = 5, R = 0.5

; (iii)

Q = 10, R = 0.1

; (iv)

Q = 100, R = 0.5

. The supervised models are trained for 200 epochs using Adam optimizer with a learning rate

10^{- 3}

. Performance is evaluated by

MSE of Gain: $MSE = \frac{1}{T} \sum_{t} {| K_{t} - {\hat{K}}_{t} |}^{2}$ , measuring how well the learned gain matches the oracle trajectory.
State Estimation RMSE: Root-mean-square error between estimated and true state trajectories, where the state estimates are produced by running a Kalman update step with the learned ${\hat{K}}_{t}$ , verifying that accurate gain learning improves filtering quality.

Results and Analysis.

Figure A1 presents the learned Kalman gain trajectories across all considered noise regimes. Across every

(Q, R)

configuration, the Supervised-

ϕ

model closely matches the oracle Kalman gain

K_{t}

, producing stable and accurate trajectories throughout both transient and steady phases. In contrast, the End-to-End model produces a noisier and slightly biased gain curve, indicating that omitting the explicit innovation term fails to accurately capture the gain dynamics. This qualitative observation is quantitatively confirmed in Table A1: Supervised-

ϕ

achieves up to four orders-of-magnitude lower gain MSE and reduces state-estimation RMSE by 4.5–32.8% under all

(Q, R)

noise settings. These results demonstrate that the innovation term is indeed sufficient for inferring the optimal Kalman gain and that explicitly modeling innovation-driven selectivity yields more stable and accurate state estimation. Together, the figure and table provide strong empirical support for our theoretical claim that innovation-guided gain learning constitutes a principled and robust mechanism for selective state updates.

Table A1. Quantitative Results of Kalman Gain Learning. Supervised-

ϕ

achieves orders-of-magnitude lower gain MSE and consistently better state estimation RMSE across all noise regimes compared to end-to-end learning, highlighting the critical role of explicitly modeling the innovation term in recovering accurate Kalman gain dynamics.

Table A1. Quantitative Results of Kalman Gain Learning. Supervised-

ϕ

achieves orders-of-magnitude lower gain MSE and consistently better state estimation RMSE across all noise regimes compared to end-to-end learning, highlighting the critical role of explicitly modeling the innovation term in recovering accurate Kalman gain dynamics.

(Q, R)	Supervised- $ϕ$		End-to-End
(Q, R)	MSE	RMSE	MSE	RMSE
(5, 0.1)	$1.3 \times 10^{- 6}$	0.317	$3.5 \times 10^{- 3}$	0.368
(5, 0.5)	$2.9 \times 10^{- 5}$	0.684	$1.8 \times 10^{- 3}$	0.694
(10, 0.1)	$1.2 \times 10^{- 6}$	0.216	$1.3 \times 10^{- 3}$	0.341
(100, 0.5)	$2.2 \times 10^{- 7}$	0.310	$6.8 \times 10^{- 4}$	0.747

Figure A1. Kalman gain learning across different noise regimes. Top-left:

(Q = 100, R = 0.5)

; top-right:

(Q = 5, R = 0.1)

; bottom-left:

(Q = 5, R = 0.5)

; bottom-right:

(Q = 10, R = 0.1)

. Supervised-

ϕ

closely matches the oracle

K_{t}

in all cases, while end-to-end trajectories remain noisy and biased. These results confirm that innovation-driven modeling provides stable and accurate gain learning across both low- and high-noise conditions.

Figure A1. Kalman gain learning across different noise regimes. Top-left:

(Q = 100, R = 0.5)

; top-right:

(Q = 5, R = 0.1)

; bottom-left:

(Q = 5, R = 0.5)

; bottom-right:

(Q = 10, R = 0.1)

. Supervised-

ϕ

closely matches the oracle

K_{t}

in all cases, while end-to-end trajectories remain noisy and biased. These results confirm that innovation-driven modeling provides stable and accurate gain learning across both low- and high-noise conditions.

Appendix C. SSR Case Study: Data and Preprocessing Details

Appendix C.1. Data Acquisition

The semi-physical training dataset is constructed using ADS-B logs from the OpenSky Network, recorded on 15 June 2025, between 10:00 and 10:15 AM. The data cover a 300 km region surrounding Tokyo Haneda, Narita, and Incheon Airports. Each log provides updates every 5 s for 503 aircraft, including longitude, latitude, altitude, and timestamp information. A detailed statistical summary of the dataset is provided in Table A2.

To emulate realistic Secondary Surveillance Radar (SSR) observation noise, zero-mean Gaussian perturbations are applied to each measurement with a signal-to-noise ratio (SNR) of 33 dB, formulated as

X_{i} = X_{i} + N (0, \frac{σ_{i}^{2}}{10^{\frac{33}{10}}}),

where

σ_{i}^{2}

represents the empirical variance of the i-th feature. This noise configuration reflects typical radar tracking uncertainty under moderate atmospheric interference.

Table A2. Summary of the Semi-physical ADS-B Dataset Used for Training. The dataset consists of 5 s updates for 503 aircraft collected within a 15 min window around major East Asian airports. Each sample includes four state variables: longitude, latitude, altitude, and timestamp.

Targets	Update Frequency	# Features	Total Samples	Time Span
503	5 s	4	166,110	15 June 2025, 10:00–10:15

Appendix C.2. Normalization and Reverse Transformation

All input data are standardized to zero mean and unit variance:

y = \frac{x - \bar{x}}{σ}, \hat{x} = σ \hat{y} + \bar{x} .

Appendix C.3. Field Data Collection

For real-world evaluation, we use raw SSR plots collected by a field-deployed radar with a peak transmit power of 6 kW. These sequences contain irregular sampling, strong noise, and missing returns, providing a rigorous test of model robustness.

Appendix C.4. Processing Workflow

The overall pipeline—from semi-physical data generation to field testing—is summarized in Figure A2, where the upper branch denotes the training stage on noisy ADS-B data, and the lower branch represents inference on real SSR plots. This workflow ensures consistency between semi-physical training data and real-world SSR testing scenarios.

Figure A2. SSR trajectory modeling pipeline. The upper branch shows training using ADS-B data with noise simulation and normalization; the lower branch shows testing on real raw SSR plots using trained models, followed by trajectory prediction and comparison.

References

Cheng, M.; Liu, Z.; Tao, X.; Liu, Q.; Zhang, J.; Pan, T.; Zhang, S.; He, P.; Zhang, X.; Wang, D.; et al. A comprehensive survey of time series forecasting: Concepts, challenges, and future directions. TechRxiv 2025. [Google Scholar] [CrossRef] [PubMed]
Salazar, C.; Banerjee, A.G. A distance correlation-based approach to characterize the effectiveness of recurrent neural networks for time series forecasting. Neurocomputing 2025, 629, 129641. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Benjamin, J.; Mathew, J. Enhancing continuous integration predictions: A hybrid LSTM-GRU deep learning framework with evolved DBSO algorithm. Computing 2025, 107, 9. [Google Scholar] [CrossRef]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the Difficulty of Training Recurrent Neural Networks. In Proceedings of the 30th International Conference on Machine Learning (ICML’13), Atlanta, GA, USA, 16–21 June 2013; pp. 1310–1318. [Google Scholar]
Kalman, R.E. A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Yuan, X.; Li, J.; Kuruoglu, E.E. Robustness enhancement in neural networks with alpha-stable training noise. Digit. Signal Process. 2025, 156, 104778. [Google Scholar] [CrossRef]
Krishnan, R.G.; Shalit, U.; Sontag, D. Deep kalman filters. arXiv 2015, arXiv:1511.05121. [Google Scholar] [CrossRef]
Revach, G.; Shlezinger, N.; Ni, X.; Escoriza, A.L.; Van Sloun, R.J.; Eldar, Y.C. KalmanNet: Neural network aided Kalman filtering for partially known dynamics. IEEE Trans. Signal Process. 2022, 70, 1532–1547. [Google Scholar] [CrossRef]
Dahan, Y.; Revach, G.; Dunik, J.; Shlezinger, N. Uncertainty quantification in deep learning based Kalman filters. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 13121–13125. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv 2024, arXiv:2405.21060. [Google Scholar] [CrossRef]
Aliyu, B.K.; Osheku, C.A.; Adetoro, L.M.; Funmilayo, A.A. Optimal Solution to Matrix Riccati Equation—For Kalman Filter Implementation. In MATLAB; Katsikis, V.N., Ed.; IntechOpen: Rijeka, Croatia, 2012; Chapter 4. [Google Scholar] [CrossRef][Green Version]
Gu, A.; Goel, K.; Gupta, A.; Ré, C. On the parameterization and initialization of diagonal state space models. Adv. Neural Inf. Process. Syst. 2022, 35, 35971–35983. [Google Scholar]
Gu, A.; Gulcehre, C.; Paine, T.; Hoffman, M.; Pascanu, R. Improving the gating mechanism of recurrent neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 3800–3809. [Google Scholar]
Weather Dataset. 2024. Weather Data from the BGC Jena Climate Data Archive. Available online: https://www.hypermeteo.com/en/?gad_source=1&gad_campaignid=14712996685&gclid=EAIaIQobChMIvJ_HqOORkQMVSpCDBx3MFTosEAAYASAAEgLx7fD_BwE (accessed on 12 May 2024).
ETT Datasets. 2024. Energy Transformer Temperature Dataset (ETT) for Long-Term Forecasting. Available online: https://ieee-dataport.org/documents/merged-ett-dataset-time-series-forecasting (accessed on 12 May 2024).
Electricity Dataset. 2024. Electricity Load Diagrams Dataset from UCI Machine Learning Repository. Available online: https://ieee-dataport.org/documents/uci-dataset (accessed on 12 May 2024).
Exchange Rate Dataset. 2024. Daily Exchange Rate Dataset for Long-Term Time Series Forecasting, Provided with the ETT Benchmark Collection. Available online: https://github.com/juyongjiang/TimeSeriesDatasets (accessed on 12 May 2024).
Traffic Dataset. 2024. Traffic Flow Data from California Performance Measurement System (PeMS). Available online: https://www.smartmicro.com/traffic-sensor?utm_medium=advert&utm_source=google&utm_campaign=search-traffic&utm_content=ad&utm_term=traffic_sensor&gad_source=1&gad_campaignid=22435495111&gclid=EAIaIQobChMIlvLc8OORkQMVQJKDBx3FOSSQEAAYASAAEgITNPD_BwE (accessed on 12 May 2024).
Illness (ILI) Dataset. 2024. Weekly Influenza-Like Illness (ILI) Dataset for Forecasting Tasks, Available Within the ETT Benchmark Suite. Available online: https://www.researchgate.net/figure/Influenza-like-illness-ILI-data-from-the-FluID-database-Weekly-scaled-rates-of-ILI_fig1_337718309 (accessed on 12 May 2024).
Alharthi, M.; Mahmood, A. xlstmtime: Long-term time series forecasting with xlstm. AI 2024, 5, 1482–1495. [Google Scholar] [CrossRef]
Zhou, T.; Ma, Z.; Wen, Q.; Sun, L.; Yao, T.; Yin, W.; Jin, R. Film: Frequency improved legendre memory model for long-term time series forecasting. Adv. Neural Inf. Process. Syst. 2022, 35, 12677–12690. [Google Scholar]
Wang, Z.; Kong, F.; Feng, S.; Wang, M.; Yang, X.; Zhao, H.; Wang, D.; Zhang, Y. Is mamba effective for time series forecasting? Neurocomputing 2025, 619, 129178. [Google Scholar] [CrossRef]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. itransformer: Inverted transformers are effective for time series forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar]
Zhang, Y.; Yan, J. Crossformer: Transformer Utilizing Cross-Dimension Dependency for Multivariate Time Series Forecasting. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Su, J.; Shen, Y.; Wu, Z. DLinear: A Linear Complexity Approach for Long-Term Time Series Forecasting. arXiv 2022, arXiv:2205.13504. [Google Scholar]
Huang, X.; Tang, J.; Shen, Y. Long time series of ocean wave prediction based on PatchTST model. Ocean. Eng. 2024, 301, 117572. [Google Scholar] [CrossRef]
Wang, S.; Wu, H.; Shi, X.; Hu, T.; Luo, H.; Ma, L.; Zhang, J.Y.; Zhou, J. Timemixer: Decomposable multiscale mixing for time series forecasting. arXiv 2024, arXiv:2405.14616. [Google Scholar] [CrossRef]
Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv 2024, arXiv:2307.08691. [Google Scholar]
Zhang, J.; Wei, L.; Yanbo, Z. Study of ADS-B data evaluation. Chin. J. Aeronaut. 2011, 24, 461–466. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Mehta, H.; Gupta, A.; Cutkosky, A.; Neyshabur, B. Long range language modeling via gated state spaces. arXiv 2022, arXiv:2206.13947. [Google Scholar] [CrossRef]
Hua, W.; Dai, Z.; Liu, H.; Le, Q. Transformer quality in linear time. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2022; pp. 9099–9117. [Google Scholar]
Ha, D.; Dai, A.; Le, Q.V. Hypernetworks. arXiv 2016, arXiv:1609.09106. [Google Scholar] [PubMed]
Poli, M. Data-dependent neural networks. arXiv 2023, arXiv:2305.13272. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Combining recurrent, convolutional, and continuous-time models with structured state space. Adv. Neural Inf. Process. Syst. 2021, 34, 21984–21998. [Google Scholar]
Funahashi, K.I.; Nakamura, Y. Approximation of dynamical systems by continuous time recurrent neural networks. Neural Netw. 1993, 6, 801–806. [Google Scholar] [CrossRef]
Tallec, C.; Ollivier, Y. Can recurrent neural networks warp time? arXiv 2018, arXiv:1804.11188. [Google Scholar] [CrossRef]
Tahirovic, A.; Redzovic, A. Optimal Estimation for Continuous-Time Nonlinear Systems Using State-Dependent Riccati Equation (SDRE). arXiv 2025, arXiv:2503.10442. [Google Scholar]

Figure 1. Structure of the LSTM network.

Figure 2. Unfolded KOSLM network layer. The D-dimensional input

x_{t}

is dynamically coupled with the N-dimensional hidden state

H_{t - 1}

and mapped to the output

H_{t}

through the higher-dimensional cell state

C_{t} \in R^{T}

. Compared to the heuristic forget and output gates of the classical LSTM, our mechanism introduces Kalman-optimal constraints by parameterizing SSM parameters (

A_{t}

,

B_{t}

) based on the innovation term.

Figure 2. Unfolded KOSLM network layer. The D-dimensional input

x_{t}

is dynamically coupled with the N-dimensional hidden state

H_{t - 1}

and mapped to the output

H_{t}

through the higher-dimensional cell state

C_{t} \in R^{T}

. Compared to the heuristic forget and output gates of the classical LSTM, our mechanism introduces Kalman-optimal constraints by parameterizing SSM parameters (

A_{t}

,

B_{t}

) based on the innovation term.

Figure 3. KOSLM vs. S-Mamba: Forecasting comparison on five representative datasets with both input and prediction horizons set to 720. The blue line denotes the ground truth, and the red line indicates model predictions. KOSLM demonstrates superior long-horizon stability and trend consistency compared to S-Mamba.

Figure 4. Runtime benchmarks. Inference time as a function of sequence length L. KOSLM demonstrates near-linear scaling and maintains faster inference than Transformer and Mamba-2 for long sequences.

Figure 5. Throughput benchmarks. End-to-end inference throughput on RTX 4090. KOSLM maintains stable throughput as sequence length increases, outperforming LSTM and Mamba-2 for long sequences.

Figure 6. SSR target tracking under real-world operational conditions. Trajectories of eight air-traffic targets tracked from SSR measurements. Red curves denote raw SSR observations, blue curves denote tracking by the conventional KF algorithm (on-site SSR interrogator output), and teal curves indicate deep model predictions (Transformer, Mamba, or KOSLM), followed by KF smoothing. KOSLM achieves smoother, more accurate, and stable tracking under strong noise and irregular sampling conditions.

Table 1. Capacity ablation (corrected). Performance of KOSLM with varying

ϕ

network capacities on four long-term forecasting datasets (

L = 720

). Averaged over five runs with standard deviations. Lower values (indicated by the down arrow “↓”) correspond to better performance.

Table 1. Capacity ablation (corrected). Performance of KOSLM with varying

ϕ

network capacities on four long-term forecasting datasets (

L = 720

). Averaged over five runs with standard deviations. Lower values (indicated by the down arrow “↓”) correspond to better performance.

Dataset	Variant	#Params	MSE ↓	MAE ↓	Δ MSE vs. SmallMLP
ETTm1	Linear	0.183M	0.480 ± 0.012	0.355 ± 0.008	+46.3%
	SmallMLP	0.255M	0.328 ± 0.010	0.282 ± 0.007	—
	MediumMLP	0.389M	0.335 ± 0.011	0.284 ± 0.001	+2.13%
	HighCap	0.898M	0.342 ± 0.015	0.288 ± 0.010	+4.27%
ETTh1	Linear	0.183M	0.440 ± 0.010	0.330 ± 0.004	+30.95%
	SmallMLP	0.255M	0.336 ± 0.008	0.291 ± 0.006	—
	MediumMLP	0.389M	0.341 ± 0.010	0.294 ± 0.002	+1.49%
	HighCap	0.898M	0.355 ± 0.012	0.302 ± 0.008	+5.65%
Traffic	Linear	0.183M	0.290 ± 0.011	0.232 ± 0.006	+9.02%
	SmallMLP	0.255M	0.266 ± 0.010	0.212 ± 0.007	—
	MediumMLP	0.389M	0.270 ± 0.012	0.216 ± 0.002	+1.50%
	HighCap	0.898M	0.279 ± 0.014	0.222 ± 0.005	+4.89%
Exchange	Linear	0.183M	0.272 ± 0.010	0.305 ± 0.005	+4.62%
	SmallMLP	0.255M	0.260 ± 0.009	0.291 ± 0.009	—
	MediumMLP	0.389M	0.264 ± 0.010	0.294 ± 0.007	+1.54%
	HighCap	0.898M	0.285 ± 0.012	0.310 ± 0.009	+9.62%

Table 2. Core structural comparison between KalmanNet and KOSLM.

Aspect	KalmanNet	KOSLM (Ours)
Core idea	Strictly follows the classical KF architecture (Section 2.2.2); the neural network learns to correct the Kalman gain $K_{t}^{KF}$ under partially known linear dynamics.	Reinterprets LSTM gating as a Kalman-optimal state estimation problem; state estimation does not strictly adhere to the KF equations, but directly learns $K_{t}$ from the innovation term.
System dynamics	Assumes fixed parameters $(A, M, Q, R)$ ; suitable for systems with known or partially known dynamics.	Learns $(A_{t}, B_{t}, M_{t})$ from data; A is a learnable parameter matrix; fully adaptive to nonlinear and nonstationary environments.
Form of gain	$K_{t} = K_{t}^{KF} + f (s_{t}; θ_{f})$ —learns a residual correction to the classical Kalman gain.	$K_{t} = ϕ ({Innov}_{t}; θ_{ϕ})$ —directly learns the gain function from the innovation.
Gain network input	$s_{t} = [{\hat{h}}_{t \| t - 1}, z_{t} - M {\hat{h}}_{t \| t - 1}]$ —uses both predicted state and innovation as inputs.	${Innov}_{t} = z_{t} - M_{t} A_{t - 1} C_{t - 1}$ —relies solely on the innovation signal for gain computation (see Appendices Appendix B.3 and Appendix B.4, which prove its sufficiency).
Output role	Outputs a residual correction $Δ K_{t}$ added to $K_{t}^{KF}$ .	Outputs $K_{t}$ and integrates gain estimation into the state transition: $A_{t} = (I - K_{t} M_{t}) A$ , $B_{t} = K_{t}$ , forming a unified recurrent–estimation pathway.
System dependency	Requires partial knowledge of $(A, H, Q, R)$ for baseline Kalman computation.	Fully data-driven; no analytical Kalman gain or explicit system parameters required.
Theoretical interpretation	Neural residual learning to compensate model mismatch.	Innovation-driven dynamic selectivity that enforces Kalman-optimal information update behavior.

Table 3. Details of benchmark datasets used in our experiments.

Dataset	Frequency	# Features	Time Steps	Time Span
ETTh1	1 h	7	17,420	2016–2017
ETTh2	1 h	7	17,420	2017–2018
ETTm1	15 min	7	69,680	2016–2017
ETTm2	15 min	7	69,680	2017–2018
Exchange	1 day	8	7588	1990–2010
Weather	10 min	21	52,696	2020
Electricity	1 h	321	26,304	2012–2014
ILI	7 days	7	966	2002–2020
Traffic	1 h	862	17,544	2015–2016

Table 4. Long-term forecasting: Multivariate long-term forecasting results on the Traffic, Electricity, Exchange, Weather, and ILI datasets. The prediction length is set to

T \in {96, 192, 336, 720}

for all datasets except ILI, which uses

O \in {24, 36, 48, 60}

due to its weekly resolution. The best results are shown in bold red, and the second-best are underlined purple. All results are averaged over 5 runs. Lower is better.

Table 4. Long-term forecasting: Multivariate long-term forecasting results on the Traffic, Electricity, Exchange, Weather, and ILI datasets. The prediction length is set to

T \in {96, 192, 336, 720}

for all datasets except ILI, which uses

O \in {24, 36, 48, 60}

due to its weekly resolution. The best results are shown in bold red, and the second-best are underlined purple. All results are averaged over 5 runs. Lower is better.

Models		KOSLM		xLSTMTime		FiLM		iTransformer		FEDFormer		S-Mamba		Crossformer		DLinear		PatchTST		TimeMixer
Metric		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
Traffic	96	0.290	0.230	0.358	0.242	0.416	0.294	0.395	0.268	0.587	0.366	0.382	0.261	0.522	0.290	0.650	0.396	0.462	0.295	0.462	0.285
	192	0.260	0.210	0.378	0.253	0.408	0.288	0.417	0.276	0.604	0.373	0.396	0.267	0.530	0.293	0.598	0.370	0.466	0.296	0.473	0.296
	336	0.227	0.179	0.392	0.261	0.425	0.298	0.433	0.283	0.621	0.383	0.417	0.276	0.558	0.305	0.605	0.373	0.482	0.304	0.498	0.297
	720	0.290	0.232	0.434	0.287	0.520	0.353	0.467	0.302	0.626	0.382	0.460	0.300	0.589	0.328	0.645	0.394	0.514	0.322	0.506	0.313
+31.96%	Avg	0.266	0.212	0.391	0.261	0.442	0.308	0.428	0.282	0.610	0.376	0.434	0.287	0.550	0.304	0.625	0.383	0.481	0.304	0.485	0.297
Electricity	96	0.136	0.184	0.128	0.221	0.154	0.267	0.148	0.240	0.193	0.308	0.139	0.235	0.219	0.314	0.197	0.282	0.181	0.270	0.153	0.247
	192	0.131	0.184	0.150	0.243	0.164	0.258	0.162	0.253	0.201	0.315	0.159	0.255	0.231	0.322	0.196	0.285	0.188	0.274	0.166	0.256
	336	0.120	0.177	0.166	0.259	0.188	0.283	0.178	0.269	0.214	0.329	0.176	0.272	0.246	0.337	0.209	0.301	0.204	0.293	0.185	0.277
	720	0.158	0.205	0.185	0.276	0.236	0.332	0.225	0.317	0.246	0.355	0.204	0.298	0.280	0.363	0.245	0.333	0.246	0.324	0.225	0.310
+13.37%	Avg	0.136	0.187	0.157	0.250	0.186	0.285	0.178	0.270	0.214	0.327	0.170	0.265	0.244	0.334	0.212	0.300	0.205	0.290	0.182	0.272
Exchange	96	0.135	0.212	–	–	0.086	0.204	0.086	0.206	0.148	0.278	0.086	0.207	0.256	0.367	0.088	0.218	0.088	0.205	0.095	0.207
	192	0.279	0.292	–	–	0.188	0.292	0.177	0.299	0.271	0.315	0.182	0.304	0.470	0.509	0.176	0.315	0.176	0.299	0.151	0.293
	336	0.341	0.339	–	–	0.356	0.433	0.331	0.417	0.460	0.427	0.332	0.418	1.268	0.883	0.313	0.427	0.301	0.397	0.264	0.361
	720	0.282	0.322	–	–	0.727	0.669	0.847	0.691	1.195	0.695	0.867	0.703	1.767	1.068	0.839	0.695	0.901	0.714	0.586	0.602
+5.47%	Avg	0.259	0.291	–	–	0.339	0.400	0.360	0.403	0.519	0.429	0.367	0.408	0.940	0.707	0.354	0.414	0.367	0.404	0.274	0.365
Weather	96	0.144	0.171	0.144	0.187	0.199	0.262	0.174	0.214	0.217	0.296	0.165	0.210	0.158	0.230	0.196	0.255	0.177	0.218	0.163	0.209
	192	0.224	0.236	0.192	0.236	0.228	0.288	0.221	0.254	0.276	0.336	0.214	0.252	0.206	0.277	0.237	0.296	0.225	0.259	0.208	0.250
	336	0.249	0.248	0.237	0.272	0.267	0.323	0.278	0.296	0.339	0.380	0.274	0.297	0.272	0.335	0.283	0.335	0.278	0.297	0.251	0.287
	720	0.169	0.175	0.313	0.326	0.319	0.361	0.358	0.347	0.403	0.428	0.350	0.345	0.398	0.418	0.345	0.381	0.354	0.348	0.339	0.341
+11.26%	Avg	0.197	0.208	0.222	0.255	0.253	0.309	0.258	0.278	0.309	0.360	0.251	0.276	0.259	0.315	0.265	0.317	0.259	0.281	0.240	0.271
ILI	24	1.160	0.507	1.514	0.694	1.970	0.875	3.154	1.235	3.228	1.260	–	–	3.041	1.186	2.215	1.081	1.319	0.754	1.453	0.827
	36	1.262	0.561	1.519	0.722	1.982	0.859	2.544	1.083	2.679	1.150	–	–	3.406	1.232	1.963	0.963	1.579	0.870	1.627	0.903
	48	1.098	0.545	1.500	0.725	1.868	0.896	2.489	1.112	2.622	1.080	–	–	3.459	1.221	2.130	1.024	1.553	0.815	1.644	0.914
	60	1.118	0.583	1.418	0.715	2.057	0.929	2.675	1.034	2.857	1.078	–	–	3.640	1.305	2.368	1.096	1.470	0.788	1.633	0.908
+22.04%	Avg	1.160	0.549	1.488	0.714	1.969	0.890	2.715	1.116	2.847	1.170	–	–	3.387	1.236	2.169	1.041	1.480	0.807	1.589	0.888

Table 5. Long-term forecasting: Forecasting results on the ETT datasets with prediction lengths

T \in {96, 192, 336, 720}

. The best results are shown in bold red, and the second-best are underlined purple. All results are averaged over 5 runs. Lower is better.

Table 5. Long-term forecasting: Forecasting results on the ETT datasets with prediction lengths

T \in {96, 192, 336, 720}

. The best results are shown in bold red, and the second-best are underlined purple. All results are averaged over 5 runs. Lower is better.

Models		KOSLM		xLSTMTime		FiLM		iTransformer		FEDFormer		S-Mamba		Crossformer		DLinear		PatchTST		TimeMixer
Metric		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
ETTm1	96	0.204	0.221	0.286	0.335	–	–	0.334	0.368	0.379	0.419	0.333	0.368	0.404	0.426	0.345	0.372	0.329	0.367	0.320	0.357
	192	0.233	0.237	0.329	0.361	–	–	0.377	0.391	0.426	0.441	0.376	0.390	0.450	0.451	0.380	0.389	0.367	0.385	0.361	0.381
	336	0.414	0.339	0.358	0.379	–	–	0.426	0.420	0.445	0.459	0.408	0.413	0.532	0.515	0.413	0.413	0.399	0.410	0.390	0.404
	720	0.481	0.368	0.416	0.411	–	–	0.491	0.459	0.543	0.490	0.475	0.448	0.666	0.589	0.474	0.453	0.454	0.439	0.458	0.441
+4.03%	Avg	0.333	0.291	0.347	0.372	–	–	0.407	0.410	0.448	0.452	0.398	0.405	0.513	0.496	0.403	0.407	0.387	0.400	0.382	0.395
ETTm2	96	0.100	0.232	0.164	0.250	0.165	0.256	0.180	0.264	0.203	0.287	0.179	0.263	0.287	0.366	0.193	0.292	0.175	0.259	0.175	0.258
	192	0.132	0.274	0.218	0.288	0.222	0.296	0.250	0.309	0.269	0.328	0.250	0.309	0.414	0.492	0.284	0.362	0.241	0.302	0.237	0.299
	336	0.244	0.382	0.271	0.322	0.277	0.333	0.311	0.348	0.325	0.366	0.312	0.349	0.597	0.542	0.369	0.427	0.305	0.343	0.298	0.340
	720	0.268	0.408	0.361	0.380	0.371	0.389	0.412	0.407	0.421	0.415	0.411	0.406	1.730	1.042	0.554	0.522	0.402	0.400	0.275	0.323
+24.39%	Avg	0.186	0.324	0.254	0.310	0.259	0.319	0.288	0.332	0.305	0.349	0.288	0.332	0.757	0.610	0.350	0.401	0.281	0.326	0.246	0.306
ETTh1	96	0.298	0.277	0.368	0.395	–	–	0.386	0.405	0.376	0.419	0.386	0.405	0.423	0.448	0.386	0.400	0.414	0.419	0.375	0.400
	192	0.337	0.306	0.401	0.416	–	–	0.441	0.436	0.420	0.448	0.443	0.437	0.471	0.474	0.437	0.432	0.460	0.445	0.479	0.421
	336	0.395	0.334	0.422	0.437	–	–	0.487	0.458	0.459	0.465	0.489	0.468	0.570	0.546	0.481	0.459	0.501	0.466	0.484	0.458
	720	0.471	0.364	0.441	0.465	–	–	0.503	0.491	0.506	0.507	0.502	0.489	0.653	0.621	0.519	0.516	0.500	0.488	0.498	0.482
+8.09%	Avg	0.375	0.320	0.408	0.428	–	–	0.454	0.447	0.440	0.460	0.455	0.450	0.529	0.522	0.456	0.452	0.469	0.454	0.459	0.440
ETTh2	96	0.194	0.338	0.273	0.333	–	–	0.297	0.349	0.358	0.397	0.296	0.348	0.745	0.584	0.333	0.387	0.302	0.348	0.289	0.341
	192	0.238	0.384	0.340	0.378	–	–	0.380	0.400	0.429	0.439	0.376	0.396	0.877	0.656	0.477	0.476	0.388	0.400	0.372	0.392
	336	0.258	0.394	0.373	0.403	–	–	0.428	0.432	0.496	0.487	0.424	0.431	1.043	0.731	0.594	0.541	0.426	0.433	0.386	0.414
	720	0.314	0.432	0.398	0.430	–	–	0.427	0.445	0.463	0.474	0.426	0.444	1.104	0.763	0.831	0.657	0.431	0.446	0.412	0.434
+27.46%	Avg	0.251	0.387	0.346	0.386	–	–	0.383	0.407	0.437	0.449	0.381	0.405	0.942	0.684	0.559	0.515	0.387	0.407	0.364	0.395

Table 6. Structural ablation. Comparison of KOSLM and its structural variants on four long-term forecasting datasets (

L = 720

). Removing either the Kalman gain path or the innovation input leads to consistent performance degradation. All results are averaged over five runs with mean ± standard deviation. Lower values (indicated by the down arrow “↓”) correspond to better performance.

Table 6. Structural ablation. Comparison of KOSLM and its structural variants on four long-term forecasting datasets (

L = 720

). Removing either the Kalman gain path or the innovation input leads to consistent performance degradation. All results are averaged over five runs with mean ± standard deviation. Lower values (indicated by the down arrow “↓”) correspond to better performance.

Dataset	Model	MSE ↓	MAE ↓	Δ MSE vs. Full
ETTm1	Full	0.498 ± 0.011	0.416 ± 0.007	—
	No-Gain	0.662 ± 0.017	0.553 ± 0.014	+32.9%
	No-Innov	0.583 ± 0.012	0.487 ± 0.008	+17.1%
ETTh1	Full	0.510 ± 0.012	0.378 ± 0.010	—
	No-Gain	0.605 ± 0.022	0.448 ± 0.015	+18.6%
	No-Innov	0.554 ± 0.016	0.411 ± 0.010	+8.6%
Traffic	Full	0.321 ± 0.008	0.195 ± 0.016	—
	No-Gain	0.428 ± 0.011	0.260 ± 0.008	+33.4%
	No-Innov	0.375 ± 0.009	0.228 ± 0.007	+16.9%
Exchange	Full	0.266 ± 0.014	0.341 ± 0.023	—
	No-Gain	0.299 ± 0.018	0.383 ± 0.023	+12.3%
	No-Innov	0.294 ± 0.015	0.377 ± 0.019	+10.6%

Table 7. Memory footprint under different batch sizes, input length = 2048. KOSLM exhibits lower memory usage than Transformer and Mamba-2, while remaining comparable to LSTM despite its smaller size.

Batch Size	Transformer (GB)	Mamba-2 (GB)	LSTM (GB)	KOSLM (GB)
1	0.223	0.085	0.081	0.061
2	0.363	0.158	0.109	0.104
4	0.640	0.290	0.166	0.188
8	1.180	0.561	0.283	0.344
16	2.256	1.103	0.518	0.668
32	4.408	2.188	0.987	1.317
64	8.712	4.357	1.926	2.615
128	17.000	8.696	3.803	5.211

Table 8. Model parameter sizes. KOSLM remains lightweight while providing competitive long-horizon forecasting performance.

Model	Parameters (M)
LSTM	2.17
Transformer	5.33
Mamba-2	0.21
FiLM	1.50
KOSLM	0.24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tan, X.; Wang, L.; Wang, M.; Zhang, Y. KOSLM: A Kalman-Optimal Hybrid State-Space Memory Network for Long-Term Time Series Forecasting. Appl. Sci. 2025, 15, 12684. https://doi.org/10.3390/app152312684

AMA Style

Tan X, Wang L, Wang M, Zhang Y. KOSLM: A Kalman-Optimal Hybrid State-Space Memory Network for Long-Term Time Series Forecasting. Applied Sciences. 2025; 15(23):12684. https://doi.org/10.3390/app152312684

Chicago/Turabian Style

Tan, Xin, Lei Wang, Mingwei Wang, and Ying Zhang. 2025. "KOSLM: A Kalman-Optimal Hybrid State-Space Memory Network for Long-Term Time Series Forecasting" Applied Sciences 15, no. 23: 12684. https://doi.org/10.3390/app152312684

APA Style

Tan, X., Wang, L., Wang, M., & Zhang, Y. (2025). KOSLM: A Kalman-Optimal Hybrid State-Space Memory Network for Long-Term Time Series Forecasting. Applied Sciences, 15(23), 12684. https://doi.org/10.3390/app152312684

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

KOSLM: A Kalman-Optimal Hybrid State-Space Memory Network for Long-Term Time Series Forecasting

Abstract

1. Introduction

2. Background and Theory

2.1. LSTM Networks

2.2. State Space Models

2.2.1. Selective State Space Models

2.2.2. Kalman Filter

3. Proposed Method

3.1. Reformulating LSTM as a State-Space Model

3.2. Kalman-Optimal Selectivity via Innovation-Driven Gain

3.3. Structural Overview of KOSLM

3.4. Theoretical Interpretation

3.5. Practical Advantages

4. Results

4.1. Main Experiments on Benchmark Datasets

4.1.1. Dataset Details

4.1.2. Implementation Details

4.1.3. Experimental Setup of Main Experiments

4.1.4. Overall Performance

4.2. Ablation Study

4.2.1. Structural Ablation

4.2.2. Capacity Ablation

4.3. Efficiency Benchmark

4.3.1. Runtime Scalability

4.3.2. Throughput Analysis

4.3.3. Memory Footprint

4.3.4. Model Size

4.4. Real-World Application: Secondary Surveillance Radar (SSR) Target Trajectory Tracking

4.4.1. Experimental Setup

4.4.2. Results and Analysis

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Selective State Space Models

Appendix B. Detailed Derivation

Appendix B.1. Kalman Gain as a Prototype for Dynamic Selectivity

Appendix B.2. LSTM-to-SSM Reconstruction

Appendix B.3. Kalman-Optimal Selective Mechanism

Appendix B.4. Empirical Validation of Innovation-Driven Kalman Gain Learning

Appendix C. SSR Case Study: Data and Preprocessing Details

Appendix C.1. Data Acquisition

Appendix C.2. Normalization and Reverse Transformation

Appendix C.3. Field Data Collection

Appendix C.4. Processing Workflow

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI