In this section, we evaluate our proposed method for anomaly detection on four real-world CPS datasets and compare the performance with several competitive anomaly detection methods.
4.1. Datasets and Baselines
We consider the following four real-world CPS datasets:
ASD [
31]: Twelve server entities with 19 metrics (CPU, memory, network, VM, etc.) at 5 min intervals, with expert-labeled anomalies.
SMD [
11]: Twelve machine entities with 38 metrics at 1 min intervals.
PUMP [
27]: Water pump system data at 1 min granularity over five months.
SMAP [
32]: Soil and telemetry data from NASA’s Mars rover.
More specifications of our datasets are given in
Table 1.
We compare the NNEKF against several established anomaly detection methods:
iForest [
33]: A tree-based method detecting anomalies via recursive data partitioning.
DAGMM [
10]: Combines deep autoencoders with Gaussian Mixture Models for latent space modeling.
OmniAnomaly [
11]: A deep generative model using GRU-VAE with normalizing flows, employing reconstruction probabilities as anomaly scores.
AnomalyTrans [
15]: Uses an anomaly–attention mechanism to compute association discrepancy with a minimax strategy.
DCdetector [
34]: Employs dual-attention asymmetric design with contrastive learning for permutation-invariant representations.
DADA [
35]: Uses adaptive bottlenecks for dynamic temporal compression and dual adversarial decoders to amplify deviations.
KAN-AD [
36]: Replaces MLPs with Kolmogorov–Arnold Networks to capture nonlinear dependencies via decomposed univariate functions.
NSIBF [
27]: Learns CPS dynamics via neural networks followed by Kalman filter state tracking.
DAGMM and OmniAnomaly represent early deep learning approaches, while AnomalyTrans, DCdetector, and DADA represent recent advances. KAN-AD achieves state-of-the-art performance by leveraging Kolmogorov–Arnold Networks for nonlinear temporal modeling. NSIBF serves as a crucial baseline combining neural networks with Kalman filtering, enabling direct comparison with our Neural Network-Enhanced Kalman Filter.
To evaluate the effects of two distinct anomaly scores—minimum squared error (MSE) and Mahalanobis distance (MD)—we introduce two corresponding models: NNEKF-MSE and NNEKF-MD. The former computes the MSE between the estimated observation and the ground truth, while the latter computes the Mahalanobis distance using a covariance matrix .
4.2. Performance and Analysis
We adopt precision, recall, and F1-score as evaluation metrics, with particular emphasis on F1 due to its balanced trade-off.
Table 2 and
Table 3 summarize the performance (mean ± half-width of 95% confidence interval) for all methods and datasets.
KAN-AD achieves the best overall performance with the highest average F1-score (0.938). Our proposed NNEKF-MD and NNEKF-MSE rank second (0.935) and third (0.933), respectively. NSIBF exhibits consistent performance with the fourth best mean F1-score (0.918), which both NNEKF variants outperform.
Among baselines, KAN-AD and AnomalyTrans perform strongly on SMD, PUMP, and SMAP but are degraded markedly on ASD—a dataset with limited training data (approximately 8000 samples per subset). This suggests that these methods may struggle in data-scarce scenarios.
To validate robustness and efficiency,
Table 3 reports the mean ± half-width of the 95% confidence interval for average F1-scores and inference times over five independent runs. Although KAN-AD achieves the highest average F1-score, our NNEKF model ranks second while delivering the fastest inference time, enabled by our batched parallel inference strategy. Notably, this parallel implementation reduces inference time by over two orders of magnitude compared to NSIBF without sacrificing accuracy. This speedup stems from the batched parallel inference strategy introduced in
Section 3.4, which eliminates the sequential bottleneck inherent in traditional Kalman filters.
A detailed cost analysis is provided in
Table 4 and
Table 5. NSIBF requires 680–7300 s for combined training and inference across datasets, with 0.43–1.83 MB storage.
The NNEKF employs two-stage training: Stage 1 learns state-space dynamics (f-net and h-net); Stage 2 refines the Kalman gain network (K-net) while freezing the dynamic modules. Although this extends training marginally, the NNEKF achieves substantially faster inference by replacing NSIBF’s sigma point sampling with parallelized batch processing. With 32 batches applied uniformly across all datasets, our method achieves to speedups over NSIBF and requires only 0.80–3.22 MB of storage, making it well-suited for resource-constrained CPS environments.
To better understand the NNEKF’s behavior, we examine its anomaly scores on ASD.
Figure 6 presents the scores generated by NNEKF-MD, NNEKF-MSE, and NSIBF for segments from the ASD dataset. Red-highlighted regions denote true anomalies; purple-highlighted regions indicate predicted anomalies. While all three models produce elevated scores within anomaly regions, NNEKF-MSE exhibits irregular fluctuations due to noise sensitivity, and NSIBF suffers from false-positive fluctuations that degrade precision. In contrast, NNEKF-MD demonstrates the most reliable behavior: stable scores in normal regions with elevation confined to anomalous regions. This explains NNEKF-MD’s superior F1-score on ASD.
The superior performance of NNEKF-MD stems from its use of Mahalanobis distance, which incorporates the covariance matrix to decorrelate features. However, this reliance reveals a key limitation: is precomputed from training residuals (Equation (22)) and thus depends heavily on the accuracy of the learned state-space model. When the model accurately captures system dynamics, effectively enhances detection; otherwise, estimation errors propagate into the anomaly score.
To simulate this scenario, we deliberately compromise state-space model learning by selecting improper loss weights
;
;
(Equation (
10)), compared to the proper configuration (
;
;
). This impairs
f-net training, causing model mismatch.
Figure 6 shows anomaly scores under this mismatch: both the MSE and MD produce compromised predictions, with MD exhibiting worse degradation—false negatives in anomalous regions and false-positive fluctuations in normal regions. This confirms that NNEKF-MD performance critically depends on state-space model accuracy.
A more principled alternative would be to estimate online within the K-network, but this would substantially increase complexity and inference latency. Given our goal of balancing accuracy, speed, and model footprint for real-time CPS deployments, we adopted the lightweight precomputation strategy.
4.4. Parameter Sensitivity
A key feature of the NNEKF is batched parallel inference, which accelerates detection by segmenting the input time series into independent batches initialized via the observation mapping . We investigate the impact of segment count on inference time and F1-score.
As shown in
Figure 8, inference time decreases exponentially with the number of segments, empirically confirming the
complexity derived in
Section 3.4. This near-linear speedup is crucial for real-time CPS deployments.
Figure 9 demonstrates the robustness of our parallelization strategy: F1-scores remain stable across segment counts, with only minimal degradation even at 128 segments. This negligible decline is far outweighed by the dramatic inference time reduction—over two orders of magnitude compared to the sequential NSIBF baseline. We therefore select 32 segments as the default configuration to balance speed and accuracy.
We further study the impact of the hidden state dimension
on model performance. As shown in
Figure 10,
Figure 11,
Figure 12 and
Figure 13, a consistent trend emerges across all datasets: low-dimensional states lack sufficient capacity to encode observational information, whereas excessively high dimensions introduce redundancy and overfitting. Specifically, for ASD (observation dimension 19), the optimal state dimension is 12; for SMD (observation dimension 38), the optimal dimension is 16; for smap (observation dimension 25), the optimal dimension is 15; and for PUMP (observation dimension 44), the optimal dimension lies between 21 and 22. This pattern suggests that a moderate latent dimensionality strikes the optimal balance between representational capacity and generalization, avoiding both underfitting and the curse of dimensionality.
4.5. Robustness Evaluation
We evaluate the robustness of our proposed model against two prevalent types of data corruption: additive Gaussian noise and random missing values.
We inject zero-mean Gaussian noise into the input features. The noise standard deviation is scaled relative to the data range as
, where
denotes the relative noise intensity. We vary
across
.
Figure 14 presents the F1-scores under varying noise levels. As expected, the performance degrades gradually with increasing
; however, our model maintains a competitive F1-score above 0.7 even under severe noise conditions (
).
We further evaluate the model’s tolerance to incomplete data by randomly setting a fraction of input features to zero. The missing rate
r varies from
to
with a step size of
.
Figure 15 illustrates the results. Notably, although performance declines as
r increases, the model retains reasonable performance even when
of features are missing (
), demonstrating strong robustness against data incompleteness.