A Neural Network-Enhanced Kalman Filter for Time Series Anomaly Detection in Cyber-Physical Systems

Ma, Zhongnan; Xu, Wentao; Zhou, Hao; Yu, Ke; Wu, Xiaofei

doi:10.3390/s26082332

Open AccessArticle

A Neural Network-Enhanced Kalman Filter for Time Series Anomaly Detection in Cyber-Physical Systems

by

Zhongnan Ma

,

Wentao Xu

,

Hao Zhou

,

Ke Yu

^*

and

Xiaofei Wu

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(8), 2332; https://doi.org/10.3390/s26082332

Submission received: 10 March 2026 / Revised: 6 April 2026 / Accepted: 7 April 2026 / Published: 9 April 2026

(This article belongs to the Section Industrial Sensors)

Download

Browse Figures

Versions Notes

Abstract

Cyber-physical systems (CPSs) represent sophisticated intelligent architectures that tightly couple computational elements, communication networks, and physical processes. Their deployments now span virtually every industrial and civilian domain—from power grids and manufacturing plants to autonomous transportation networks. Ensuring the secure operation of CPSs relies fundamentally on effective time series anomaly detection, which remains a challenging task due to the complex, often unknown system dynamics and non-negligible sensor noise present in real-world environments. To address these challenges, we introduce a Neural Network-Enhanced Kalman Filter (NNEKF), a novel anomaly detection framework that combines model-based filtering with data-driven learning. The NNEKF employs a two-stage trained neural network with a specialized architecture: the first stage learns the underlying dynamics of the CPS, while the second stage optimizes the computation of the Kalman gain during the update step. At inference time, the enhanced Kalman filter recursively estimates the likelihood of observed sensor measurements to identify anomalies, supported by a batched parallel inference scheme that delivers substantial speedups. Extensive experiments on benchmark datasets demonstrate that the NNEKF attains an average F1-score of 0.935, coupled with rapid inference and minimal model footprint—surpassing all competitive baselines and facilitating dependable real-time anomaly detection for CPS environments.

Keywords:

anomaly detection; Kalman Filter; neural network; time series; Cyber-Physical Systems

1. Introduction

Cyber-physical systems (CPSs) integrate sensing and control functionalities with physical processes across diverse industrial domains, including smart grids, smart factories, and intelligent transportation systems [1,2]. These architectures depend on continuous, high-fidelity measurements to coordinate interconnected devices and process large-scale data streams. Yet their operational integrity is jeopardized by cyber-attacks, sensor faults, and human errors—risks that can corrupt measurement validity and inflict substantial economic or environmental damage. Safeguarding CPS security and data fidelity is therefore paramount. Effective sensor monitoring is essential, motivating the widespread adoption of time series anomaly detection to flag deviations in measurement streams. Whereas traditional statistical or rule-based methods falter amid the complexity of modern CPSs, deep learning-based techniques have gained traction as a compelling alternative.

Detecting anomalies in measurement signals remains an active research area. In particular, multivariate time series anomaly detection represents a core challenge for CPSs, with numerous methods proposed in the recent literature [3,4,5].

Conventional approaches encompass statistical methods, classical machine learning algorithms, and, increasingly, deep learning techniques. These data-driven methods have achieved remarkable progress, yet they often demand substantial training data and incur high time complexity.

In contrast, the Kalman filter (KF) offers inherent robustness against noise with low computational complexity, making it particularly suitable for resource-constrained CPS applications [6,7]. The KF operates through a recursive two-step process: in the prediction step, the prior state estimate is projected forward using a system dynamic model; in the update step, this prediction is refined by fusing the available observation with an optimally computed Kalman gain. This theoretically grounded framework has consistently demonstrated exceptional capability in estimating the states of dynamic systems.

Nevertheless, the KF’s efficacy hinges critically on accurate prior knowledge of system dynamics, typically encoded through a fully characterized state-space model. When the underlying physics are partially unknown or the system exhibits complex, nonlinear behaviors, the KF’s assumptions break down, leading to suboptimal estimation and degraded anomaly detection performance.

This fundamental limitation motivates our proposal of a hybrid approach that synergistically combines the theoretical soundness of the KF with the model-agnostic representation capabilities of deep neural networks (DNNs). The integration of the KF and DNN methodologies for time series anomaly detection presents several key challenges and research questions:

RQ1: How can the state-space model be effectively learned from data?
RQ2: How can we improve the calculation of Kalman gain?
RQ3: What constitutes an appropriate evaluation metric for accurately distinguishing between normal and anomalous data points?

To address these questions, this paper proposes a novel time series anomaly detection framework for CPSs, termed the Neural Network-Enhanced Kalman Filter (NNEKF). Specifically, we employ a two-stage trained neural network with a specialized architecture. In the first stage, the neural network learns the state transition function of the CPS, thereby capturing its underlying dynamics. In the second stage, the network optimizes the calculation of the Kalman gain in the update step. During the detection phase, the refined KF is naturally applied to track the uncertainty of the hidden states of the system and estimate the likelihood of the observed sensor measurements over time. The NNEKF framework integrates the strengths of both the KF and neural networks for time series anomaly detection. As a result, the NNEKF benefits from the ability to capture highly nonlinear system dynamics while maintaining robustness against process and sensor noises inherent in CPSs. We summarize the main contributions of our paper as follows:

We propose a novel anomaly detection architecture that integrates Kalman filtering with deep learning. A two-stage trained neural network first encodes system dynamics through the state transition function, then adaptively refines Kalman gain computation. The enhanced filter recursively evaluates measurement likelihoods for efficient anomaly identification.
We introduce a batched parallel inference mechanism that delivers orders-of-magnitude inference acceleration, achieving fast response time and low memory footprint—key requirements for real-time, resource-constrained CPS deployments.
Comprehensive experiments demonstrate that our method achieves an average F1-score of 0.935, comparable to state-of-the-art approaches in CPS anomaly detection, while maintaining low inference latency.
Ablation studies validate the efficacy of core architectural components, including the attention mechanism, K-network, and two-stage training. We further analyze the model’s sensitivity to key hyperparameters, noise, and missing data.

2. Background

2.1. Cyber-Physical Systems

In CPSs, the physical domain encompasses real-world processes requiring supervision, while the cyber domain comprises communication, computation, and control functionalities. These layers interface through sensors and actuators: sensors acquire the system state

x_{t}

and transduce it into observations

y_{t}

, while the cyber layer and actuators generate control inputs

u_{t}

. For anomaly detection,

y_{t}

is analyzed to estimate its probability distribution. An alert is triggered when the residual error surpasses a predefined threshold or the measurement likelihood drops below a specified bound [8], as illustrated in Figure 1.

The CPS can be typically characterized by a discrete-time nonlinear state-space representation, comprising two fundamental components: a state transition model and a measurement model.

State Transition Model: Consider a hidden state vector

x_{t} \in R^{n}

at time step t. The state evolution is governed by the following:

x_{t} = f (x_{t - 1}, u_{t}) + w_{t}, w_{t} \sim N (0, Q),

(1)

where the state transition incorporates both deterministic dynamics and stochastic perturbations. The deterministic component is described by a nonlinear mapping

f : R^{n} \times R^{m} \to R^{n}

, with

x_{t} \in R^{n}

representing the system state and

u_{t} \in R^{m}

denoting the control input. The stochastic component

w_{t}

represents zero-mean Gaussian process noise with covariance matrix

Q

.

Measurement Model: The observation process is modeled as follows:

y_{t} = h (x_{t}) + v_{t}, v_{t} \sim N (0, R),

(2)

where

y_{t} \in R^{p}

constitutes the observation vector,

h : R^{n} \to R^{p}

is the measurement mapping, and

v_{t}

represents zero-mean Gaussian measurement noise with covariance matrix

R

. At each time t, the observed measurement

y_{t}

is a noise-corrupted version of the true system state.

2.2. Kalman Filter

Our proposed model builds upon the classical KF framework, necessitating a brief overview. The KF is an optimal linear recursive estimator that minimizes the mean squared error under Gaussian noise assumptions. For linear systems, there exist matrices

F

(state transition),

B

(control input), and

H

(observation) such that

f (x_{t - 1}, u_{t}) = {Fx}_{t - 1} + {Bu}_{t}, h (x_{t}) = {Hx}_{t}

(3)

The KF operates in two phases:

predict:

{\hat{x}}_{t | t - 1} = {Fx}_{t - 1} + {Bu}_{t},

(4a)

{\hat{Σ}}_{t | t - 1} = F Σ_{t - 1} F^{T} + Q .

(4b)

{\hat{y}}_{t | t - 1} = H {\hat{x}}_{t | t - 1},

(5a)

S_{t} = H {\hat{Σ}}_{t | t - 1} H^{T} + R .

(5b)

update:

{\hat{x}}_{t} = {\hat{x}}_{t | t - 1} + K_{t} ▵ y_{t},

(6a)

{\hat{Σ}}_{t} = (I - K_{t} H_{t}) {\hat{Σ}}_{t | t - 1} .

(6b)

The Kalman gain matrix

K_{t}

is computed as follows:

K_{t} = {\hat{Σ}}_{t | t - 1} H^{T} S_{t}^{- 1},

(7)

with innovation

▵ y_{t} = y_{t} - {\hat{y}}_{t | t - 1}

.

The Extended Kalman Filter (EKF) [9] generalizes the KF to nonlinear systems via first-order Taylor linearization:

{\hat{x}}_{t | t - 1} = f (x_{t - 1}, u_{t - 1}), F_{t} = J_{f} (x_{t - 1})

(8)

{\hat{y}}_{t | t - 1} = h ({\hat{x}}_{t | t - 1}), H_{t} = J_{h} ({\hat{x}}_{t | t - 1})

(9)

where

F_{t}

and

H_{t}

are the Jacobians used in covariance propagation and gain computation.

2.3. Related Work

Time series anomaly detection has witnessed substantial progress through deep learning methodologies.

Reconstruction-based architectures have demonstrated particular effectiveness for this task. The Deep Autoencoding Gaussian Mixture Model (DAGMM) [10] integrates deep autoencoders with Gaussian Mixture Models (GMMs) for latent space probabilistic modeling. OmniAnomaly [11] employs Gated Recurrent Units (GRUs) and Variational Autoencoders (VAEs) to identify anomalies based on reconstruction probabilities. USAD [12] introduces an unsupervised adversarial framework for multivariate time series, while GDN [13] leverages graph neural networks to capture sensor dependencies for anomaly detection. DTGMM [14] combines deep autoencoders, Transformer, and GMMs to enhance condition monitoring and early warning capabilities for critical equipment such as boiler superheaters and turbine bearings.

Attention mechanisms have emerged as a critical component for enhancing temporal modeling in anomaly detection. Anomaly Transformer [15] utilizes self-attention with a minimax training strategy to amplify discrepancies between normal and abnormal patterns. Beyond time series anomaly detection, attention–GRU architectures have demonstrated superior performance over traditional methods in industrial predictive maintenance scenarios [16]. These developments corroborate our motivation for employing attention mechanisms to refine state transition learning within a Kalman filtering framework.

Concurrently, model-based techniques—particularly Kalman filter variants—have gained renewed attention for anomaly detection in dynamical systems. The standard KF assumes linear dynamics and Gaussian noise, which often prove inadequate in practice. To address nonlinearities, the Extended Kalman Filter (EKF) [9] employs first-order Taylor expansion (Equations (8) and (9)), while the unscented Kalman filter (UKF) [17] utilizes deterministic sampling for improved state estimation. The Particle Filter (PF) [18] accommodates non-Gaussian distributions through sequential Monte Carlo methods. However, these extensions fundamentally rely on explicit system models, limiting applicability when dynamics are unknown or evolve over time.

Two emerging research directions have sought to integrate deep learning with Kalman filtering, directly motivating our work on state-space model learning (RQ1) and Kalman gain learning (RQ2).

Learnable Kalman Gain (RQ2). The conventional Kalman gain is theoretically optimal only for linear Gaussian systems; under nonlinear or non-Gaussian conditions, it becomes suboptimal and degrades tracking performance over time. To address this, KalmanNet [19] and Split-KalmanNet [20] employ GRUs to learn the gain directly from observation-state residuals, bypassing explicit covariance calculations. While these approaches demonstrate that neural augmentation in the update step improves filtering performance, they do not address the simultaneous learning of system dynamics.

Neural State-Space Models (RQ1). For systems with complex or partially known dynamics, an alternative strategy encodes observations into a latent space governed by a simplified (typically linear Gaussian) state-space model [21,22,23]. Complementing this, refs. [24,25,26] directly estimate state-space parameters via neural networks to mitigate model mismatch—a critical limitation of conventional KF requiring precise model specifications.

Our work advances neural-augmented Kalman filtering for anomaly detection. While NSIBF [27] learns state transition and measurement functions, it fails to capture long-range dependencies and incurs high inference costs. We address these limitations through: (1) an attention mechanism for refined state transition learning (RQ1), inspired by attention–GRU architectures [16], and (2) a lightweight neural network for dynamic Kalman gain computation (RQ2) with parallel inference for significant speedup. This integration achieves superior accuracy with lower computational overhead than prior approaches.

3. Methodology

This section presents our proposed neural network architecture for state-space model learning, Kalman gain learning, and the integrated NNEKF framework for time series anomaly detection.

3.1. State-Space Model Learning

Let

x_{t}

,

y_{t}

, and

u_{t}

denote the hidden state, sensor observation, and actuator state, respectively, at discrete time t. To address RQ1, we propose a neural network architecture that employs self-attention mechanisms to learn the underlying state-space dynamics. As illustrated in Figure 2, our framework comprises three subnets (f, g, and h):

The network f learns the state transition function. It operates on two inputs: the current hidden state $x_{t - 1}$ and a historical sequence ${(y, u)}^{t - l : t - 1}$ of sensor observations and actuator states from a sliding window of length l.
The historical sequence is first encoded by LSTM layers [28] into hidden representations ${h_{t - l}, \dots, h_{t - 1}}$ . Unlike prior work such as NSIBF [27] that relies solely on LSTM for temporal modeling, we employ a self-attention mechanism where $x_{t - 1}$ serves as the query and each LSTM output $h_{i}$ as key–value pairs. Attention weights $α_{i} = softmax (a_{i})$ are computed via learned compatibility functions, allowing for dynamic focus on relevant historical moments. The context vector $c_{t - 1} = \sum_{i = t - l}^{t - 1} α_{i} h_{i}$ is concatenated with $x_{t - 1}$ and mapped to $x_{t}$ through an MLP.
The network g encodes the sensor observation $y_{t - 1}$ into the corresponding hidden state $x_{t - 1}$ .
The network h reconstructs the sensor observation ${\tilde{y}}_{t - 1}$ from hidden state $x_{t - 1}$ .

The network accepts

y_{t - 1}

and historical sequence

{(y, u)}^{t - l : t - 1}

as inputs, producing reconstructed observation

{\tilde{y}}_{t - 1}

and predicted observation

{\tilde{y}}_{t}

as outputs.

Assume the training dataset comprises T time points. The loss function is as follows:

L = \sum_{t = l}^{T} (w_{1} ∥ y^{t - 1} - {\tilde{y}}^{t - 1} ∥_{2}^{2} + w_{2} ∥ y^{t} - {\tilde{y}}^{t} ∥_{2}^{2} + w_{3} ∥ x^{t} - x^{t - 1} ∥_{2}^{2})

(10)

where the first two terms capture reconstruction and prediction errors, the third term enforces temporal smoothness, and

w_{1}

,

w_{2}

, and

w_{3}

denote weighting hyperparameters. We set

w_{1} = 0.45

,

w_{2} = 0.45

, and

w_{3} = 0.1

, yielding a combined weight of

0.9

for the reconstruction and prediction terms. This configuration ensures the accurate modeling of system observations, while the relatively small weight on smoothness prevents over-regularization and preserves responsiveness to rapid state changes. The detailed hyperparameter configuration is provided in Appendix A.

After training, we obtain the learned state transition and measurement functions:

x_{t} = f (x_{t - 1}, {y_{t - l : t - 1}, u_{t - l : t - 1}}), y_{t} = h (x_{t})

(11)

These correspond to the functions

f (\cdot)

and

h (\cdot)

defined in Equations (1) and (2), respectively.

3.2. Kalman Gain Learning

3.2.1. Overall Architecture

To address RQ2, we present an enhanced KF algorithm in this section. Specifically, we introduce a K-network to learn the Kalman gain (KG) from data, which is then integrated with the previously learned system dynamics

f (\cdot)

and measurement functions

h (\cdot)

from Section 3.1 within the comprehensive KF framework. The overall architecture of our proposed approach is summarized in Figure 3. In each time instance t, similarly to the KF, the forward propagation of the NNEKF is divided into two steps: predict and update. The difference is that the NNEKF only keeps track of the mean estimation and does not track covariance estimation.

predict: The prior state estimate

{\hat{x}}_{t | t - 1}

is computed using the posterior estimate from the previous time step

{\hat{x}}_{t - 1}

along with sensor measurements and actuator states from a sliding window of the past l time steps

W_{t - 1} = {y_{t - l : t - 1}, u_{t - l : t - 1}}

, as defined in (12a). Note that the initial posterior estimate

{\hat{x}}_{1}

is derived from the initial observation

y_{1}

via Equation (12c).

Subsequently, the prior observation estimate

{\hat{y}}_{t | t - 1}

is derived from

{\hat{x}}_{t | t - 1}

through (12b). Both

f (\cdot)

and

h (\cdot)

are learned as described in Section 3.1.

{\hat{x}}_{t | t - 1} = f ({\hat{x}}_{t - 1}, W_{t - 1}),

(12a)

{\hat{y}}_{t | t - 1} = h ({\hat{x}}_{t | t - 1}) .

(12b)

{\hat{x}}_{1} = h (y_{1})

(12c)

update: The NNEKF updates the posterior state estimate

{\hat{x}}_{t}

using the new observation

y_{t}

and prior estimate

{\hat{x}}_{t | t - 1}

through a process similar to the classical KF (Equations (13a) and (13b)). However, unlike the conventional KF, the KG is not computed analytically but learned directly from data using an RNN (13c). The recurrent architecture’s inherent memory enables implicit KG computation without explicit covariance statistics.

▵ y_{t} = y_{t} - {\hat{y}}_{t | t - 1},

(13a)

{\hat{x}}_{t} = {\hat{x}}_{t | t - 1} + K_{t} ▵ y_{t},

(13b)

K_{t} = K (▵ y_{t}, ▵ {\hat{y}}_{t}, ▵ {\hat{x}}_{t}, ▵ x_{t}) .

(13c)

3.2.2. K-Network Architecture

In the following, we detail the design of the K-network for learning the Kalman gain (KG). The computation of KG requires inputs that capture the statistical relationships between observations and state estimates. At each time step t, the K-network receives inputs that encode the statistical properties of both the current observation

y_{t}

and the previous state estimate

{\hat{x}}_{t - 1}

.

We define the following input features to characterize the unknown statistical relationships within the CPS model:

F1: Observation difference $▵ y_{t} = y_{t} - y_{t - 1}$ .
F2: Innovation difference $▵ {\hat{y}}_{t} = y_{t} - {\hat{y}}_{t | t - 1}$ .
F3: State posterior difference $▵ {\hat{x}}_{t} = {\hat{x}}_{t} - {\hat{x}}_{t - 1}$ .
F4: State update difference $▵ x_{t} = {\hat{x}}_{t} - {\hat{x}}_{t | t - 1}$ .

These four features are designed to enable the K-network to learn the underlying statistics of states and observations effectively, thereby supporting accurate KG computation.

We design the K-network using attention mechanisms and Gated Recurrent Units (GRUs). As shown in Figure 4, it integrates multi-head attention with GRU-based memory to model statistical dependencies for Kalman gain estimation. Input features are first embedded into a latent space, then processed by multi-head attention with residual connections and layer normalization. The refined features pass through a GRU to capture temporal dynamics in hidden state

h_{t}

, which is decoded into Kalman gain

K_{t} \in R^{m \times n}

via a linear output layer.

We define the MSE loss between predicted and actual observations:

L = ∥ {\hat{y}}_{t} - y_{t} ∥^{2} .

(14)

{\hat{y}}_{t} = h ({\hat{x}}_{t})

(15)

We partition the training dataset into N distinct sequences. Specifically, denoting the length of the i-th training sequence as

T_{i}

, the dataset can be formally represented as

D = {Y_{i}}_{i = 1}^{N}

, where

Y_{i} = [y_{1}^{i}, y_{2}^{i}, \dots, y_{T_{i}}^{i}]

represents the i-th sequence.

Letting

θ

denote the trainable parameters of the K-network, we formulate the mean squared error (MSE) loss function as follows:

L (θ) = \frac{1}{N} \sum_{i = 1}^{N} \sum_{t = 1}^{T_{i}} {∥ {\hat{y}}_{t}^{i} - y_{t}^{i} ∥}^{2},

(16)

where

{\hat{y}}_{t}^{i}

denotes the predicted observation at time step t for the i-th sequence.

While forward propagation requires the complete NNEKF architecture (including f-net and h-net), we focus exclusively on KG learning during backpropagation. Consequently, we freeze the pre-trained parameters of f-net and h-net, updating only the K-network’s parameters.

3.3. Analysis of Learning Algorithm

This subsection analyzes the gradient propagation mechanism to elucidate why end-to-end training suffers from severe gradient interference and how the proposed two-stage training achieves gradient decoupling to ensure stable convergence.

Let

θ_{d y n} = {θ_{f}, θ_{g}, θ_{h}}

denote the parameters of the state-space model learning module (comprising networks f, g, and h), and let

θ_{k g}

denote the parameters of the Kalman gain network

K_{θ_{k g}}

. The dynamic learning loss

L_{d y n} (θ_{d y n})

and the Kalman gain learning loss

L_{k g} (θ_{d y n}, θ_{k g})

are defined in Equations (10) and (14), respectively.

3.3.1. Gradient Interference in End-to-End Training

Under end-to-end joint training, the total loss is

L_{d y n} (θ_{d y n})

, which is defined in

L_{t o t a l} (θ_{d y n}, θ_{k g}) = L_{d y n} (θ_{d y n}) + λ L_{k g} (θ_{d y n}, θ_{k g}), λ > 0 .

(17)

The gradient with respect to

θ_{d y n}

contains a destructive cross-term:

\nabla_{θ_{d y n}} L_{t o t a l} = \nabla_{θ_{d y n}} L_{d y n} + λ \nabla_{θ_{d y n}} L_{k g},

(18)

where the cross-term expands to the following:

\nabla_{θ_{d y n}} L_{k g} = \sum_{t = 1}^{T} 2 {(y_{t} - {\hat{y}}_{t | t})}^{⊤} \frac{\partial h ({\hat{x}}_{t | t})}{\partial {\hat{x}}_{t | t}} (\frac{\partial {\hat{x}}_{t | t - 1}}{\partial θ_{d y n}} + K_{t} \frac{\partial (y_{t} - h ({\hat{x}}_{t | t - 1}))}{\partial θ_{d y n}}) .

(19)

Because the Kalman filter is recursive, gradient conflicts accumulate over time steps, progressively amplifying modeling errors in the dynamic module. This mirrors the gradient pathology in Physics-Informed Neural Networks (PINNs), where the simultaneous optimization of competing objectives leads to destructive interference [29]. Specifically,

$\nabla_{θ_{d y n}} L_{d y n}$ pushes $θ_{d y n}$ toward stable, physically consistent state transitions;
$\nabla_{θ_{d y n}} L_{k g}$ adapts $θ_{d y n}$ to compensate for instantaneous Kalman gain errors, encouraging overfitting to closed-loop residuals.

Simultaneously optimizing these incompatible objectives causes training instability, slow convergence, and poor generalization—a failure mode analogous to PINN training collapse in convection-dominated PDEs [29].

3.3.2. Gradient Decoupling via Two-Stage Training

The proposed two-stage training eliminates interference by sequential optimization with parameter freezing.

Stage 1: State-Space Model Learning. Only the dynamic loss

L_{d y n}

is optimized:

θ_{d y n}^{★} = \arg \min_{θ_{d y n}} L_{d y n} (θ_{d y n}) .

The Kalman gain network is inactive, so the gradient updates for

θ_{d y n}

are guided solely by reconstruction and prediction errors. This ensures that the dynamic module converges to a physically consistent representation of the CPS dynamics, unaffected by the filtering objective.

Stage 2: Kalman Gain Learning. After Stage 1,

θ_{d y n}^{★}

is frozen. As a result, the cross-term

\nabla_{θ_{d y n}} L_{k g}

vanishes, and the gradient of

L_{k g}

no longer propagates into the dynamic module. The Kalman gain network is optimized independently within a fixed, stable feature space, completely decoupled from the dynamic model. Both sub-tasks therefore converge along their own optimal directions, free from conflicting gradient signals.

3.4. Anomaly Detection

In this section, we address anomaly detection in multivariate time series observations

y_{t}

. Let

y_{t} \in R^{m}

denote the m-dimensional observation vector at time t, where each component

y_{t, i}

represents the i-th variable. An anomaly score

S_{t}

is computed at each time step, and an observation is flagged as anomalous if

S_{t}

exceeds a predefined threshold.

The anomaly score

S_{t}

for observation

y_{t}

requires an estimate

{\hat{y}}_{t}

of the expected observation, obtained through the following recursive procedure:

1.: Initialization: ${\hat{x}}_{1} = h (y_{1})$ .
2.: Prediction: Compute the prior state estimate ${\hat{x}}_{t | t - 1}$ via (12a) and the prior observation estimate ${\hat{y}}_{t | t - 1}$ via (12b).
3.: Update: Obtain the posterior state estimate ${\hat{x}}_{t}$ via (13a)–(13c), and compute the posterior observation estimate ${\hat{y}}_{t}$ via (15).

The posterior state estimate

{\hat{x}}_{t}

is then used to predict the next prior state

{\hat{x}}_{t + 1 | t}

and observation

{\hat{y}}_{t + 1 | t}

. The estimated observation

{\hat{y}}_{t}

serves to compute the anomaly score

S_{t}

for the current observation

y_{t}

.

To address RQ3, an appropriate anomaly metric is crucial for effective detection. A straightforward approach computes the Euclidean distance between

y_{t}

and

{\hat{y}}_{t}

, equivalent to the loss function in (14):

S_{t} = {∥ {\hat{y}}_{t} - y_{t} ∥}^{2} .

(20)

Alternatively, following NSIBF [27], we employ the Mahalanobis distance (MD) [30]:

S_{t} = \sqrt{{(y_{t} - {\hat{y}}_{t})}^{⊤} R^{- 1} (y_{t} - {\hat{y}}_{t})} .

(21)

We introduce the covariance matrix

R

to capture observation uncertainty, estimated as follows:

{\tilde{y}}_{t} = h (x_{t}), Δ {\tilde{y}}_{t} = {\tilde{y}}_{t} - y_{t},

(22a)

Δ \tilde{Y} = {[Δ {\tilde{y}}_{1}, \dots, Δ {\tilde{y}}_{T}]}^{⊤} \in R^{T \times m},

(22b)

R = cov (Δ \tilde{Y}) \in R^{m \times m},

(22c)

where

Δ {\tilde{y}}_{t}

represents the reconstruction error at time t, and

R

provides a global estimate of observation uncertainty across the training dataset of size T.

As detailed in Algorithm 1, the per-step time complexity of the NNEKF is

O (l n^{2} + n m + m^{2})

, with l denoting the window length, n the state dimension, and m the observation dimension. By comparison, the standard Kalman filter requires

O (n^{3})

operations stemming from explicit covariance propagation in (7) and the update step. These complexities align when n and m are comparable in scale.

Algorithm 1 NNEKF Inference (Anomaly Detection) at Time Step t
Require: Posterior state estimate ${\hat{x}}_{t - 1}$ , historical window $W_{t - 1} = {y_{t - l : t - 1}, u_{t - l : t - 1}}$ , current observation $y_{t}$ , pre-trained networks f, h, K, covariance matrix $R$ (estimated from training data).
Ensure: Posterior state estimate ${\hat{x}}_{t}$ , anomaly score $S_{t}$ .
1: Predict step
2: ${\hat{x}}_{t \| t - 1} \leftarrow f ({\hat{x}}_{t - 1}, W_{t - 1})$	▹ $O (l \cdot n^{2})$
3: ${\hat{y}}_{t \| t - 1} \leftarrow h ({\hat{x}}_{t \| t - 1})$	▹ $O (n \cdot m)$
4: Compute innovation and features
5: $Δ y_{t} \leftarrow y_{t} - {\hat{y}}_{t \| t - 1}$	▹ $O (m)$
6: $Δ {\hat{y}}_{t} \leftarrow y_{t} - {\hat{y}}_{t \| t - 1}$	▹ $O (m)$
7: $Δ {\hat{x}}_{t} \leftarrow {\hat{x}}_{t} - {\hat{x}}_{t - 1}$	▹ $O (n)$
8: $Δ x_{t} \leftarrow {\hat{x}}_{t} - {\hat{x}}_{t \| t - 1}$	▹ $O (n)$
9: Kalman gain computation
10: $K_{t} \leftarrow K (Δ y_{t}, Δ {\hat{y}}_{t}, Δ {\hat{x}}_{t}, Δ x_{t})$	▹ $O (n \cdot m)$
11: Update step
12: ${\hat{x}}_{t} \leftarrow {\hat{x}}_{t \| t - 1} + K_{t} Δ y_{t}$	▹ $O (n \cdot m)$
13: Compute anomaly score
14: if using MSE then
15: $S_{t} \leftarrow {∥ y_{t} - {\hat{y}}_{t \| t - 1} ∥}^{2}$	▹ $O (m)$
16: else
17: $S_{t} \leftarrow \sqrt{{(y_{t} - {\hat{y}}_{t \| t - 1})}^{⊤} R^{- 1} (y_{t} - {\hat{y}}_{t \| t - 1})}$	▹ $O (m^{2})$
18: end if
19: return ${\hat{x}}_{t}$ , $S_{t}$

The baseline NSIBF [27] employs an unscented Kalman filter (UKF) built upon deterministic sigma points. In practice, its runtime substantially surpasses

O (n^{3})

owing to the overhead of propagating

2 n + 1

sigma points through nonlinear transformations. Furthermore, traditional Kalman-type estimators—including NSIBF—adhere to a rigid sequential structure: the estimate

{\hat{x}}_{t}

is contingent upon

{\hat{x}}_{t - 1}

, precluding straightforward parallelization across time steps.

The NNEKF circumvents this constraint by exploiting the observation mapping

h (\cdot)

to initialize states at arbitrary time steps, thereby severing long-range temporal dependencies. This facilitates the batched parallel inference framework depicted in Figure 5: the time series is segmented into B autonomous batches, each initialized by applying

h (\cdot)

to its leading observation. All batches undergo simultaneous GPU processing following the identical recursive update rules as sequential execution, with anomaly scores subsequently concatenated. This parallel paradigm curtails wall-clock time to roughly

O ((T / B) \cdot (l n^{2} + n m + m^{2}))

, as corroborated empirically in Section 4.4, where the NNEKF delivers orders-of-magnitude acceleration relative to NSIBF across all four datasets.

4. Experiments and Results

In this section, we evaluate our proposed method for anomaly detection on four real-world CPS datasets and compare the performance with several competitive anomaly detection methods.

4.1. Datasets and Baselines

We consider the following four real-world CPS datasets:

ASD [31]: Twelve server entities with 19 metrics (CPU, memory, network, VM, etc.) at 5 min intervals, with expert-labeled anomalies.
SMD [11]: Twelve machine entities with 38 metrics at 1 min intervals.
PUMP [27]: Water pump system data at 1 min granularity over five months.
SMAP [32]: Soil and telemetry data from NASA’s Mars rover.

More specifications of our datasets are given in Table 1.

We compare the NNEKF against several established anomaly detection methods:

iForest [33]: A tree-based method detecting anomalies via recursive data partitioning.
DAGMM [10]: Combines deep autoencoders with Gaussian Mixture Models for latent space modeling.
OmniAnomaly [11]: A deep generative model using GRU-VAE with normalizing flows, employing reconstruction probabilities as anomaly scores.
AnomalyTrans [15]: Uses an anomaly–attention mechanism to compute association discrepancy with a minimax strategy.
DCdetector [34]: Employs dual-attention asymmetric design with contrastive learning for permutation-invariant representations.
DADA [35]: Uses adaptive bottlenecks for dynamic temporal compression and dual adversarial decoders to amplify deviations.
KAN-AD [36]: Replaces MLPs with Kolmogorov–Arnold Networks to capture nonlinear dependencies via decomposed univariate functions.
NSIBF [27]: Learns CPS dynamics via neural networks followed by Kalman filter state tracking.

DAGMM and OmniAnomaly represent early deep learning approaches, while AnomalyTrans, DCdetector, and DADA represent recent advances. KAN-AD achieves state-of-the-art performance by leveraging Kolmogorov–Arnold Networks for nonlinear temporal modeling. NSIBF serves as a crucial baseline combining neural networks with Kalman filtering, enabling direct comparison with our Neural Network-Enhanced Kalman Filter.

To evaluate the effects of two distinct anomaly scores—minimum squared error (MSE) and Mahalanobis distance (MD)—we introduce two corresponding models: NNEKF-MSE and NNEKF-MD. The former computes the MSE between the estimated observation and the ground truth, while the latter computes the Mahalanobis distance using a covariance matrix

R

.

4.2. Performance and Analysis

We adopt precision, recall, and F1-score as evaluation metrics, with particular emphasis on F1 due to its balanced trade-off. Table 2 and Table 3 summarize the performance (mean ± half-width of 95% confidence interval) for all methods and datasets.

KAN-AD achieves the best overall performance with the highest average F1-score (0.938). Our proposed NNEKF-MD and NNEKF-MSE rank second (0.935) and third (0.933), respectively. NSIBF exhibits consistent performance with the fourth best mean F1-score (0.918), which both NNEKF variants outperform.

Among baselines, KAN-AD and AnomalyTrans perform strongly on SMD, PUMP, and SMAP but are degraded markedly on ASD—a dataset with limited training data (approximately 8000 samples per subset). This suggests that these methods may struggle in data-scarce scenarios.

To validate robustness and efficiency, Table 3 reports the mean ± half-width of the 95% confidence interval for average F1-scores and inference times over five independent runs. Although KAN-AD achieves the highest average F1-score, our NNEKF model ranks second while delivering the fastest inference time, enabled by our batched parallel inference strategy. Notably, this parallel implementation reduces inference time by over two orders of magnitude compared to NSIBF without sacrificing accuracy. This speedup stems from the batched parallel inference strategy introduced in Section 3.4, which eliminates the sequential bottleneck inherent in traditional Kalman filters.

A detailed cost analysis is provided in Table 4 and Table 5. NSIBF requires 680–7300 s for combined training and inference across datasets, with 0.43–1.83 MB storage.

The NNEKF employs two-stage training: Stage 1 learns state-space dynamics (f-net and h-net); Stage 2 refines the Kalman gain network (K-net) while freezing the dynamic modules. Although this extends training marginally, the NNEKF achieves substantially faster inference by replacing NSIBF’s sigma point sampling with parallelized batch processing. With 32 batches applied uniformly across all datasets, our method achieves

3.1 \times

to

22.3 \times

speedups over NSIBF and requires only 0.80–3.22 MB of storage, making it well-suited for resource-constrained CPS environments.

To better understand the NNEKF’s behavior, we examine its anomaly scores on ASD. Figure 6 presents the scores generated by NNEKF-MD, NNEKF-MSE, and NSIBF for segments from the ASD dataset. Red-highlighted regions denote true anomalies; purple-highlighted regions indicate predicted anomalies. While all three models produce elevated scores within anomaly regions, NNEKF-MSE exhibits irregular fluctuations due to noise sensitivity, and NSIBF suffers from false-positive fluctuations that degrade precision. In contrast, NNEKF-MD demonstrates the most reliable behavior: stable scores in normal regions with elevation confined to anomalous regions. This explains NNEKF-MD’s superior F1-score on ASD.

The superior performance of NNEKF-MD stems from its use of Mahalanobis distance, which incorporates the covariance matrix

R

to decorrelate features. However, this reliance reveals a key limitation:

R

is precomputed from training residuals (Equation (22)) and thus depends heavily on the accuracy of the learned state-space model. When the model accurately captures system dynamics,

R

effectively enhances detection; otherwise, estimation errors propagate into the anomaly score.

To simulate this scenario, we deliberately compromise state-space model learning by selecting improper loss weights

w_{1} = 0.3

;

w_{2} = 0.1

;

w_{3} = 0.6

(Equation (10)), compared to the proper configuration (

w_{1} = 0.45

;

w_{2} = 0.45

;

w_{3} = 0.1

). This impairs f-net training, causing model mismatch. Figure 6 shows anomaly scores under this mismatch: both the MSE and MD produce compromised predictions, with MD exhibiting worse degradation—false negatives in anomalous regions and false-positive fluctuations in normal regions. This confirms that NNEKF-MD performance critically depends on state-space model accuracy.

A more principled alternative would be to estimate

R_{t}

online within the K-network, but this would substantially increase complexity and inference latency. Given our goal of balancing accuracy, speed, and model footprint for real-time CPS deployments, we adopted the lightweight precomputation strategy.

4.3. Ablation Study

To evaluate the core contributions of the NNEKF, we conduct ablation studies on (1) the attention mechanism in state-space modeling, (2) the learnable K-network for Kalman gain, and (3) end-to-end versus two-stage training. We compare the full model against three ablated variants (Figure 7), each employing the MSE-based anomaly score:

Full Model: The complete NNEKF framework employing attention-based state transition learning, the dedicated K-network, and two-stage training.
w/o Attention: Replaces the attention mechanism with a standard LSTM in the state-space module to isolate its contribution to temporal dynamic modeling.
w/o K-Network (EKF): Substitutes the learned K-network with the analytical Extended Kalman Filter gain to validate the necessity of data-driven gain learning.
End-to-End: Jointly optimizes the state-space model and K-network via Equation (17) ( $λ = 1.0$ ), eliminating the two-stage training procedure.

As shown in Figure 7, the full model consistently outperforms all variants. The w/o Attention variant exhibits degraded F1-scores, confirming that attention is crucial for capturing complex temporal dependencies. The w/o K-Network variant shows the most significant performance drop, empirically validating that data-driven Kalman gain is essential for optimal state estimation in nonlinear CPS environments. Notably, the End-to-End variant performs the worst across all datasets, corroborating our analysis in Section 3.3 regarding gradient interference and justifying the two-stage training strategy.

4.4. Parameter Sensitivity

A key feature of the NNEKF is batched parallel inference, which accelerates detection by segmenting the input time series into independent batches initialized via the observation mapping

h (\cdot)

. We investigate the impact of segment count on inference time and F1-score.

As shown in Figure 8, inference time decreases exponentially with the number of segments, empirically confirming the

O ((T / B) \cdot ({l n}^{2} + n m + m^{2}))

complexity derived in Section 3.4. This near-linear speedup is crucial for real-time CPS deployments.

Figure 9 demonstrates the robustness of our parallelization strategy: F1-scores remain stable across segment counts, with only minimal degradation even at 128 segments. This negligible decline is far outweighed by the dramatic inference time reduction—over two orders of magnitude compared to the sequential NSIBF baseline. We therefore select 32 segments as the default configuration to balance speed and accuracy.

We further study the impact of the hidden state dimension

x_{t}

on model performance. As shown in Figure 10, Figure 11, Figure 12 and Figure 13, a consistent trend emerges across all datasets: low-dimensional states lack sufficient capacity to encode observational information, whereas excessively high dimensions introduce redundancy and overfitting. Specifically, for ASD (observation dimension 19), the optimal state dimension is 12; for SMD (observation dimension 38), the optimal dimension is 16; for smap (observation dimension 25), the optimal dimension is 15; and for PUMP (observation dimension 44), the optimal dimension lies between 21 and 22. This pattern suggests that a moderate latent dimensionality strikes the optimal balance between representational capacity and generalization, avoiding both underfitting and the curse of dimensionality.

4.5. Robustness Evaluation

We evaluate the robustness of our proposed model against two prevalent types of data corruption: additive Gaussian noise and random missing values.

We inject zero-mean Gaussian noise into the input features. The noise standard deviation is scaled relative to the data range as

σ = α \cdot (\max (x) - \min (x))

, where

α

denotes the relative noise intensity. We vary

α

across

{0.01, 0.05, 0.1, 0.2, 0.5}

. Figure 14 presents the F1-scores under varying noise levels. As expected, the performance degrades gradually with increasing

α

; however, our model maintains a competitive F1-score above 0.7 even under severe noise conditions (

α = 0.5

).

We further evaluate the model’s tolerance to incomplete data by randomly setting a fraction of input features to zero. The missing rate r varies from

0.1

to

0.7

with a step size of

0.1

. Figure 15 illustrates the results. Notably, although performance declines as r increases, the model retains reasonable performance even when

70 %

of features are missing (

r = 0.7

), demonstrating strong robustness against data incompleteness.

5. Conclusions

This paper proposes a Neural Network-Enhanced Kalman Filter (NNEKF), a novel anomaly detection approach for cyber-physical systems. The method integrates a neural network with a Kalman filter to learn system dynamics and optimize state estimation, achieving robust and efficient detection. Extensive experiments on four real-world CPS benchmark datasets demonstrate that the NNEKF achieves an average F1-score of 0.935, which is comparable to state-of-the-art methods, while offering low inference latency and minimal memory footprint, rendering it ideally suited for resource-constrained CPS environments.

Beyond detection accuracy, the NNEKF is designed to facilitate practical deployment in real-world applications. Latency: Under batched parallel inference, the per-step computational complexity is reduced to

O ((T / B) \cdot (l n^{2} + n m + m^{2}))

, achieving inference times that are orders of magnitude faster than those of baseline methods. Scalability: The model consistently delivers strong performance across datasets of varying sizes and dimensions (ASD, SMD, PUMP, SMAP), demonstrating its adaptability to different scales of cyber-physical systems. Online deployment: With a minimal model footprint and low inference latency, the NNEKF can be effectively deployed for online sensor monitoring and real-time alerting.

Going forward, a promising direction to address the covariance limitation is to estimate

R_{t}

online through low-rank approximation techniques. Recent advances such as RRKF [37] and LoKO [38] demonstrate that covariance matrices can be maintained with quadratic or linear complexity. Integrating these methods into our K-network would enable lightweight online adaptation for real-time CPS deployment, enhancing robustness while preserving minimal memory footprint.

Author Contributions

Conceptualization, Z.M.; methodology, Z.M. and W.X.; software, Z.M.; validation, Z.M., W.X. and H.Z.; formal analysis, K.Y.; investigation, X.W.; writing—original draft preparation, Z.M.; writing—review and editing, Z.M. and H.Z.; visualization, W.X.; supervision, K.Y. and X.W.; project administration, Z.M. and K.Y.; funding acquisition, K.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant No. 62371057 and the 111 Project of China under Grant No. B08004.

Institutional Review Board Statement

Not applicable. The study used only publicly available datasets and involved no direct human or animal subjects.

Informed Consent Statement

Not applicable. The study did not involve human subjects.

Data Availability Statement

The ASD and SMD datasets used in this study are publicly available and can be accessed at https://github.com/zhhlee/InterFusion/tree/main/data, accessed on 12 February 2026. The PUMP dataset used in this study is publicly available and can be accessed at https://github.com/cfeng783/NSIBF/tree/main/datasets, accessed on 12 February 2026. The SMAP dataset used in this study is publicly available and can be accessed at https://www.kaggle.com/datasets/patrickfleith/nasa-anomaly-detection-dataset-smap-msl/data, accessed on 12 February 2026. No new data were created.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Hyperparameter Configuration

This appendix provides detailed hyperparameter settings for the proposed NNEKF framework, including the neural network architectures, training configurations, and inference parameters. All experiments were conducted on a virtual GPU (vGPU). The underlying physical hardware is an NVIDIA GeForce RTX 4090 with 48 GB memory.

Table A1. Hyperparameter settings for the NNEKF framework.

Hyperparameter	Dataset	Value
Hidden state dimension $x_{t}$	ASD	12
	SMD	16
	PUMP	21/22
	SMAP	15
Input window length l	ASD	10
	SMD	10
	PUMP	15
	SMAP	10
Weighting hyperparameter $w_{1}$	ALL	0.45
Weighting hyperparameter $w_{2}$	ALL	0.45
Weighting hyperparameter $w_{3}$	ALL	0.1
Hidden dims for g net	ASD	64
	SMD	64
	PUMP	128
	SMAP	64
Hidden dims for h net	ASD	64
	SMD	64
	PUMP	128
	SMAP	64
Hidden dims for f net	ASD	128
	SMD	128
	PUMP	227
	SMAP	150
K-network attention heads	ASD	4
	SMD	4
	PUMP	4
	SMAP	4
K-network embedding dimension	ASD	95
	SMD	190
	PUMP	220
	SMAP	150
K-network GRU layers	ALL	3
Learning rate (Stage 1)	ALL	$5 \times 10^{- 4}$
Learning rate (Stage 2)	ALL	$1 \times 10^{- 3}$
Epochs (Stage 1)	ALL	100
Epochs (Stage 2)	ALL	100
Parallel batches B	ALL	32

References

El-Shafeiy, E.; Alsabaan, M.; Ibrahem, M.I.; Elwahsh, H. Real-time anomaly detection for water quality sensor monitoring based on multivariate deep learning technique. Sensors 2023, 23, 8613. [Google Scholar] [CrossRef] [PubMed]
Xing, W.; Shen, J. Security control of cyber–physical systems under cyber attacks: A survey. Sensors 2024, 24, 3815. [Google Scholar] [CrossRef]
Blázquez-García, A.; Conde, A.; Mori, U.; Lozano, J.A. A Review on Outlier/Anomaly Detection in Time Series Data. ACM Comput. Surv. 2021, 54, 1–33. [Google Scholar] [CrossRef]
Zamanzadeh Darban, Z.; Webb, G.I.; Pan, S.; Aggarwal, C.; Salehi, M. Deep Learning for Time Series Anomaly Detection: A Survey. ACM Comput. Surv. 2024, 57, 1–42. [Google Scholar] [CrossRef]
DeMedeiros, K.; Hendawi, A.; Alvarez, M. A survey of AI-based anomaly detection in IoT and sensor networks. Sensors 2023, 23, 1352. [Google Scholar] [CrossRef] [PubMed]
Khodarahmi, M.; Maihami, V. A review on Kalman filter models. Arch. Comput. Methods Eng. 2023, 30, 727–747. [Google Scholar] [CrossRef]
Li, Q.; Li, R.; Ji, K.; Dai, W. Kalman Filter and Its Application. In Proceedings of the 2015 8th International Conference on Intelligent Networks and Intelligent Systems (ICINIS); IEEE: Piscataway, NJ, USA, 2015; pp. 74–77. [Google Scholar] [CrossRef]
Giraldo, J.; Urbina, D.; Cardenas, A.; Valente, J.; Faisal, M.; Ruths, J.; Tippenhauer, N.O.; Sandberg, H.; Candell, R. A Survey of Physics-Based Attack Detection in Cyber-Physical Systems. ACM Comput. Surv. 2019, 51, 1–36. [Google Scholar] [CrossRef]
Kalman, R.E.; Bucy, R.S. New Results in Linear Filtering and Prediction Theory. J. Basic Eng. 1961, 83, 95–108. [Google Scholar] [CrossRef]
Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In International Conference on Learning Representations (ICLR 2018); OpenReview.net: Vancouver, BC, Canada, 2018. [Google Scholar]
Su, Y.; Zhao, Y.; Niu, C.; Liu, R.; Sun, W.; Pei, D. Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2828–2837. [Google Scholar] [CrossRef]
Audibert, J.; Michiardi, P.; Guyard, F.; Marti, S.; Zuluaga, M.A. Usad: Unsupervised anomaly detection on multivariate time series. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 23–27 August 2020; pp. 3395–3404. [Google Scholar]
Deng, A.; Hooi, B. Graph neural network-based anomaly detection in multivariate time series. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Palo Alto, CA, USA, 2021; Volume 35, pp. 4027–4035. [Google Scholar]
Wang, S.; Zhao, C.; Liu, X.; Ni, X.; Chen, X.; Gao, X.; Sun, L. Hybrid Deep Learning Framework for Anomaly Detection in Power Plant Systems. Algorithms 2025, 18, 704. [Google Scholar] [CrossRef]
Xu, J.; Wu, H.; Wang, J.; Long, M. Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. arXiv 2022, arXiv:2110.02642. [Google Scholar] [CrossRef]
Kumar, D.; Addula, S.R.; Lind, M.; Brown, S.; Odion, S. AI-Driven Hybrid Deep Learning and Swarm Intelligence for Predictive Maintenance of Smart Manufacturing Robots in Industry 4.0. Electronics 2026, 15, 715. [Google Scholar] [CrossRef]
Julier, S.J.; Uhlmann, J.K. New extension of the Kalman filter to nonlinear systems. In Proceedings of the Signal Processing, Sensor Fusion, and Target Recognition VI; SPIE: Bellingham, WA, USA, 1997; Volume 3068, pp. 182–193. [Google Scholar]
Djuric, P.; Kotecha, J.; Zhang, J.; Huang, Y.; Ghirmai, T.; Bugallo, M.; Miguez, J. Particle Filtering. IEEE Signal Process. Mag. 2003, 20, 19–38. [Google Scholar] [CrossRef]
Revach, G.; Shlezinger, N.; Ni, X.; Escoriza, A.L.; van Sloun, R.J.G.; Eldar, Y.C. KalmanNet: Neural Network Aided Kalman Filtering for Partially Known Dynamics. IEEE Trans. Signal Process. 2022, 70, 1532–1547. [Google Scholar] [CrossRef]
Choi, G.; Park, J.; Shlezinger, N.; Eldar, Y.C.; Lee, N. Split-KalmanNet: A Robust Model-Based Deep Learning Approach for State Estimation. IEEE Trans. Veh. Technol. 2023, 72, 12326–12331. [Google Scholar] [CrossRef]
Laufer-Goldshtein, B.; Talmon, R.; Gannot, S. A Hybrid Approach for Speaker Tracking Based on TDOA and Data-Driven Models. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 725–735. [Google Scholar] [CrossRef]
Zhou, L.; Luo, Z.; Shen, T.; Zhang, J.; Zhen, M.; Yao, Y.; Fang, T.; Quan, L. KFNet: Learning Temporal Camera Relocalization Using Kalman Filtering. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 4918–4927. [Google Scholar] [CrossRef]
Yoshida, W.; Hirose, K. Fast same-step forecast in SUTSE model and its theoretical properties. Comput. Stat. Data Anal. 2024, 190, 107861. [Google Scholar] [CrossRef]
Rangapuram, S.S.; Seeger, M.W.; Gasthaus, J.; Stella, L.; Wang, Y.; Januschowski, T. Deep State Space Models for Time Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Nice, France, 2018; Volume 31. [Google Scholar]
Tian, Y.; Lai, R.; Li, X.; Xiang, L.; Tian, J. A Combined Method for State-of-Charge Estimation for Lithium-Ion Batteries Using a Long Short-Term Memory Network and an Adaptive Cubature Kalman Filter. Appl. Energy 2020, 265, 114789. [Google Scholar] [CrossRef]
Ma, X.; Zhang, S.; Tang, T.; Yu, D.; Wang, X.; Zhang, H.; Ding, L.; Dai, K. A Lightweight High-Impact Acceleration State Reconstruction Method for Multibody Dynamic Systems by an Extended Kalman Filter- Aided Time Neural Network. IEEE Sens. J. 2024, 24, 31524–31537. [Google Scholar] [CrossRef]
Feng, C.; Tian, P. Time Series Anomaly Detection for Cyber-Physical Systems via Neural System Identification and Bayesian Filtering. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, 14–18 August 2021; pp. 2858–2867. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Krishnapriyan, A.; Gholami, A.; Zhe, S.; Kirby, R.; Mahoney, M.W. Characterizing possible failure modes in physics-informed neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Nice, France, 2021; Volume 34, pp. 26548–26560. [Google Scholar]
De Maesschalck, R.; Jouan-Rimbaud, D.; Massart, D. The Mahalanobis Distance. Chemom. Intell. Lab. Syst. 2000, 50, 1–18. [Google Scholar] [CrossRef]
Li, Z.; Zhao, Y.; Han, J.; Su, Y.; Jiao, R.; Wen, X.; Pei, D. Multivariate Time Series Anomaly Detection and Interpretation Using Hierarchical Inter-Metric and Temporal Embedding. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, 14–18 August 2021; pp. 3220–3230. [Google Scholar] [CrossRef]
Hundman, K.; Constantinou, V.; Laporte, C.; Colwell, I.; Soderstrom, T. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 387–395. [Google Scholar]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining; IEEE: Piscataway, NJ, USA, 2008; pp. 413–422. [Google Scholar] [CrossRef]
Yang, Y.; Zhang, C.; Zhou, T.; Wen, Q.; Sun, L. DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 3033–3045. [Google Scholar] [CrossRef]
Shentu, Q.; Li, B.; Zhao, K.; Shu, Y.; Rao, Z.; Pan, L.; Yang, B.; Guo, C. Towards a General Time Series Anomaly Detector with Adaptive Bottlenecks and Dual Adversarial Decoders. In Proceedings of the 13th International Conference on Learning Representations, ICLR 2025, Singapore, 24–28 April 2025; pp. 18810–18833. [Google Scholar]
Zhou, Q.; Pei, C.; Sun, F.; Jing, H.; Gao, Z.; Zhang, H.; Xie, G.; Pei, D.; Li, J. KAN-AD: Time Series Anomaly Detection with Kolmogorov–Arnold Networks. In Proceedings of the International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025; PMLR; pp. 79136–79149. [Google Scholar]
Schmidt, J.; Hennig, P.; Nick, J.; Tronarp, F. The rank-reduced Kalman filter: Approximate dynamical-low-rank filtering in high dimensions. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Nice, France, 2023; Volume 36, pp. 61364–61376. [Google Scholar]
Abdi, H.; Sun, M.; Zhang, A.; Kaski, S.; Pan, W. LoKO: Low-Rank Kalman Optimizer for Online Fine-Tuning of Large Models. arXiv 2024, arXiv:2410.11551. [Google Scholar]

Figure 1. Anomaly detection architecture in CPSs.

Figure 2. Neural network architecture for state-space model learning. Blue, yellow, and green boxes represent networks f, g, and h, respectively.

Figure 3. The overall architecture of the NNEKF. Parameters of networks f and g are frozen, while parameters of network K are active (trainable).

Figure 4. The architecture of the K-network with an attention mechanism.

Figure 5. Serial inference versus batched parallel inference.

Figure 6. Visualization of anomaly scores for ASD data segments, with anomaly scores from NNEKF-MSE, NNEKF-MD, and NSIBF.

Figure 7. An F1-score comparison of the full NNEKF model against its ablated variants across three datasets.

Figure 8. Inference time as a function of the number of parallel batches across four datasets.

Figure 9. F1-score as a function of the number of parallel batches across four datasets.

Figure 10. F1-scores for different state dimensions on ASD dataset.

Figure 11. F1-scores for different state dimensions on SMD dataset.

Figure 12. F1-scores for different state dimensions on pump dataset.

Figure 13. F1-scores for different state dimensions on SMAP dataset.

Figure 14. F1-scores for different noise levels across four datasets.

Figure 15. F1-scores for different missing rates across four datasets.

Table 1. Dataset statistical information.

	Train	Test	Subsets	Dims	Anomalies (%)
ASD	102,331	51,840	12	19	4.61
SMD	304,168	304,174	12	38	5.84
PUMP	76,901	143,401	1	44	10.05
SMAP	135,183	427,617	1	25	13.13

Table 2. Experimental results on four datasets (ASD, SMD, PUMP, SMAP). Results are reported as mean ± half-width of 95% confidence interval over 5 independent runs. Best result is highlighted in bold.

Method	ASD			SMD
Method	P	R	F1	P	R	F1
iForest	$0.263 \pm 0.010$	$0.517 \pm 0.012$	$0.349 \pm 0.009$	$0.338 \pm 0.011$	$0.576 \pm 0.014$	$0.426 \pm 0.009$
DAGMM	$0.621 \pm 0.012$	$0.908 \pm 0.010$	$0.738 \pm 0.009$	$0.592 \pm 0.019$	$0.959 \pm 0.009$	$0.732 \pm 0.014$
OmniAnomaly	$0.743 \pm 0.015$	$0.918 \pm 0.012$	$0.821 \pm 0.010$	$0.872 \pm 0.016$	$0.969 \pm 0.011$	$0.918 \pm 0.010$
AnomalyTrans	$0.810 \pm 0.016$	$0.625 \pm 0.019$	$0.706 \pm 0.014$	$0.893 \pm 0.012$	$0.966 \pm 0.019$	$0.928 \pm 0.011$
DCdetector	$0.745 \pm 0.027$	$0.627 \pm 0.017$	$0.681 \pm 0.016$	$0.950 \pm 0.022$	$0.883 \pm 0.016$	$0.915 \pm 0.012$
DADA	$0.733 \pm 0.016$	$0.950 \pm 0.020$	$0.827 \pm 0.012$	$0.936 \pm 0.016$	$0.941 \pm 0.017$	$0.938 \pm 0.011$
KAN-AD	$0.839 \pm 0.019$	$0.926 \pm 0.012$	$0.880 \pm 0.011$	$0.956 \pm 0.014$	$0.980 \pm 0.012$	$0.968 \pm 0.010$
NSIBF	$0.847 \pm 0.011$	$0.951 \pm 0.014$	$0.896 \pm 0.009$	$0.970 \pm 0.007$	$0.976 \pm 0.011$	$0.973 \pm 0.007$
NNEKF-MSE	$0.881 \pm 0.019$	$0.929 \pm 0.014$	$0.904 \pm 0.012$	$0.988 \pm 0.009$	$0.967 \pm 0.017$	$0.977 \pm 0.010$
NNEKF-MD	$0.902 \pm 0.009$	$0.928 \pm 0.017$	$0.915 \pm 0.010$	$0.976 \pm 0.014$	$0.950 \pm 0.011$	$0.963 \pm 0.009$
Method	PUMP			SMAP
Method	P	R	F1	P	R	F1
iForest	$0.922 \pm 0.011$	$0.475 \pm 0.007$	$0.627 \pm 0.006$	$0.431 \pm 0.014$	$0.520 \pm 0.019$	$0.471 \pm 0.011$
DAGMM	$0.928 \pm 0.009$	$0.767 \pm 0.014$	$0.840 \pm 0.009$	$0.806 \pm 0.019$	$0.871 \pm 0.012$	$0.837 \pm 0.011$
OmniAnomaly	$0.939 \pm 0.007$	$0.815 \pm 0.015$	$0.873 \pm 0.009$	$0.818 \pm 0.016$	$0.894 \pm 0.019$	$0.854 \pm 0.012$
AnomalyTrans	$0.944 \pm 0.016$	$0.987 \pm 0.012$	$0.965 \pm 0.010$	$0.822 \pm 0.011$	$0.979 \pm 0.019$	$0.894 \pm 0.010$
DCdetector	$0.943 \pm 0.014$	$0.982 \pm 0.019$	$0.962 \pm 0.012$	$0.733 \pm 0.021$	$0.901 \pm 0.019$	$0.808 \pm 0.015$
DADA	$0.919 \pm 0.012$	$0.962 \pm 0.014$	$0.940 \pm 0.009$	$0.688 \pm 0.016$	$0.902 \pm 0.020$	$0.781 \pm 0.012$
KAN-AD	$0.964 \pm 0.012$	$0.971 \pm 0.011$	$0.967 \pm 0.009$	$0.979 \pm 0.011$	$0.901 \pm 0.016$	$0.938 \pm 0.010$
NSIBF	$0.960 \pm 0.010$	$0.941 \pm 0.009$	$0.950 \pm 0.006$	$0.927 \pm 0.011$	$0.880 \pm 0.012$	$0.852 \pm 0.009$
NNEKF-MSE	$0.957 \pm 0.009$	$0.969 \pm 0.012$	$0.963 \pm 0.007$	$0.879 \pm 0.015$	$0.895 \pm 0.012$	$0.887 \pm 0.010$
NNEKF-MD	$0.909 \pm 0.010$	$0.985 \pm 0.006$	$0.959 \pm 0.005$	$0.913 \pm 0.010$	$0.890 \pm 0.016$	$0.901 \pm 0.010$

Table 3. F1-score and inference time comparison on four datasets (ASD, SMD, PUMP, SMAP). Results are reported as mean ± half-width of 95% confidence interval over 5 independent runs. Best result is highlighted in bold, and second best is underlined.

Methods	F1-Score	Inference Time (s)
Methods	Avg F1	ASD	SMD	PUMP	SMAP
Baseline
AnomalyTrans	$0.873$	$31.81 \pm 1.43$	$83.90 \pm 2.56$	$69.92 \pm 1.96$	$108.42 \pm 3.07$
DCdetector	$0.842$	$8642.30 \pm 51.74$	16,732.76 ± 79.58	12,718.96 ± 65.01	21,575.11 ± 110.97
DADA	$0.871$	$15.80 \pm 0.70$	$76.19 \pm 2.28$	$45.61 \pm 1.37$	$102.58 \pm 2.89$
KAN-AD	$0.938$	$6.36 \pm 0.38$	$30.16 \pm 1.73$	$17.61 \pm 1.12$	$37.16 \pm 1.96$
NSIBF	$0.918$	$636.36 \pm 5.72$	$3908.17 \pm 31.24$	$1861.13 \pm 16.17$	$7239.29 \pm 63.59$
Serial Inference
NNEKF-MSE	$0.933$	$61.47 \pm 4.51$	$362.92 \pm 27.23$	$178.18 \pm 10.42$	$694.94 \pm 40.61$
NNEKF-MD	$\underset{̲}{0.935}$	$64.80 \pm 4.59$	$375.14 \pm 29.32$	$188.05 \pm 10.58$	$735.02 \pm 41.61$
Parallel Inference
NNEKF-MSE	$0.933$	$2.17 \pm 0.19$	$10.86 \pm 0.47$	$4.85 \pm 0.32$	$18.92 \pm 1.28$
NNEKF-MD	$\underset{̲}{0.935}$	$\underset{̲}{2.23 \pm 0.22}$	$\underset{̲}{11.36 \pm 0.48}$	$\underset{̲}{4.97 \pm 0.30}$	$\underset{̲}{19.37 \pm 1.13}$

Table 4. Computational cost and model storage of NSIBF across four datasets.

Dataset	Training Time (s)	Inference Time (s)	Total Time (s)	Model Storage (MB)
ASD	48.20	636.36	684.56	0.43
SMD	144.75	3908.17	4052.92	0.45
PUMP	35.79	1861.13	1896.92	1.83
SMAP	61.83	7239.29	7301.12	0.49

Table 5. Computational cost and model storage of NNEKF across four datasets.

Dataset	Stage 1 (s)	Stage 2 (s)	Inference Time (s)	Total Time (s)	Model Storage (MB)
ASD	51.16	166.83	2.17	217.99	0.80
SMD	155.61	506.57	10.86	662.18	0.87
PUMP	40.89	132.17	4.85	177.86	3.22
SMAP	72.65	235.51	18.92	327.08	1.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, Z.; Xu, W.; Zhou, H.; Yu, K.; Wu, X. A Neural Network-Enhanced Kalman Filter for Time Series Anomaly Detection in Cyber-Physical Systems. Sensors 2026, 26, 2332. https://doi.org/10.3390/s26082332

AMA Style

Ma Z, Xu W, Zhou H, Yu K, Wu X. A Neural Network-Enhanced Kalman Filter for Time Series Anomaly Detection in Cyber-Physical Systems. Sensors. 2026; 26(8):2332. https://doi.org/10.3390/s26082332

Chicago/Turabian Style

Ma, Zhongnan, Wentao Xu, Hao Zhou, Ke Yu, and Xiaofei Wu. 2026. "A Neural Network-Enhanced Kalman Filter for Time Series Anomaly Detection in Cyber-Physical Systems" Sensors 26, no. 8: 2332. https://doi.org/10.3390/s26082332

APA Style

Ma, Z., Xu, W., Zhou, H., Yu, K., & Wu, X. (2026). A Neural Network-Enhanced Kalman Filter for Time Series Anomaly Detection in Cyber-Physical Systems. Sensors, 26(8), 2332. https://doi.org/10.3390/s26082332

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Neural Network-Enhanced Kalman Filter for Time Series Anomaly Detection in Cyber-Physical Systems

Abstract

1. Introduction

2. Background

2.1. Cyber-Physical Systems

2.2. Kalman Filter

2.3. Related Work

3. Methodology

3.1. State-Space Model Learning

3.2. Kalman Gain Learning

3.2.1. Overall Architecture

3.2.2. K-Network Architecture

3.3. Analysis of Learning Algorithm

3.3.1. Gradient Interference in End-to-End Training

3.3.2. Gradient Decoupling via Two-Stage Training

3.4. Anomaly Detection

4. Experiments and Results

4.1. Datasets and Baselines

4.2. Performance and Analysis

4.3. Ablation Study

4.4. Parameter Sensitivity

4.5. Robustness Evaluation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Hyperparameter Configuration

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI