3.2. Anomaly Detection Performance
In this section, we perform anomaly detection for various failure types and evaluate the results using the area under the receiver operating characteristic curve (AUC). The average AUC value of the results from 20 datasets is displayed for each failure type. The evaluation results produced by the reactive method for each action state are shown in
Figure 6. The AUC performance for failure types
and
is shown in
Figure 6. Since failures occurred in all action states except for action state 1 (the idle state), we evaluated the anomaly detection performance using the AUC for action states 2 to 27.
Figure 7 depicts the AUC results for the history-based method for various parameter combinations. We performed a grid search over
and
and evaluated each combination using the mean AUC across all fault types on a held-out validation set. In
Figure 7, as
M and
increase,
Figure 7a shows overall upward sloping and
Figure 7b shows overall downward sloping. There is a trade-off. A larger
increases sensitivity to persistent faults but also raises the false alarm rate on normal data, while a larger
M incorporates more temporal context but may introduce stale information from distant action states. The grid search results in
Figure 7 show that performance improves up to
and begins to degrade for
due to increased false positives. As a result, the history-based method is sensitive to parameters and, thus, after testing several parameters for the system, an acceptable parameter can be set. In this research, the parameter settings are set to
= 0.1 and
M = 15, which achieve the best trade-off between detection sensitivity and false alarm rate in the grid search.
We compare our method with representative one-class and unsupervised baselines, including OCSVM, Isolation Forest (IForest), DIF, ABOD, VAE, DeepSVDD, COPOD, ROD, ECOD, and the recent time-series anomaly detection method SensitiveHUE [
25]. The performance results are presented in
Table 5. For IForest, we used 100 trees with a
subsampling size [
34]. OCSVM uses MATLAB’s
fitcsvm function with an RBF kernel (
KernelFunction=’rbf’) and standardization. COPOD, ROD, and ECOD are parameter-free algorithms. For ABOD, we used
neighbors. For VAE, the encoder/decoder sizes were (16, 8, 4) and (4, 8, 16), with latent dimension 4 and 50 epochs. For DeepSVDD, we used hidden neurons (32, 16) and 50 epochs. For DIF, we used
and
max_samples=’auto’. For SensitiveHUE, we used sequence length 10 and 50 training epochs.
The cumulative probability method in
Table 5 resulted from applying the reactive method without separating the action state. The cumulative probability method is nearly identical to the reactive method except that the reactive method learns and evaluates each action state. Still, the cumulative probability method learns and evaluates all of the data without splitting the action state. The number of clusters in the cumulative probability method is 50. When there are more than 50 GMM clusters, there is little difference in performance. The reactive method and history-based method used three clusters for each action state, and there was no significant difference in performance even if two or more clusters were used. Both methods yield higher AUC values than the cumulative probability method. Like the action state-based algorithm proposed in this paper, other algorithms were applied after separating according to the action state. In most algorithms, the performance improved after separating the action state. The outcomes of applying the action state to other algorithms are reported in
Table 5, as the reactive method applies the action state to the cumulative probability method. In
Table 5, ‘Action state’ represents that the algorithm was applied to each action state separately. For the action-state variants, AUC is computed per action state and averaged across states (excluding the idle state, which has no samples). Better outcomes were also attained by these algorithms due to using separate action states. Thus, the action state concept can be applied to various algorithms. In this experimental setting, the proposed history-based action-state method achieves the highest mean AUC among the compared approaches.
The improvement from action state decomposition varies significantly across algorithms. VAE and DeepSVDD show substantial improvement (from 0.54 to 0.75 and from 0.49 to 0.67, respectively), because these methods are sensitive to the complexity of the data distribution, and when trained on mixed multi-modal data, they struggle to learn a meaningful normal boundary. SensitiveHUE also shows a large improvement (from 0.63 to 0.84, +33%). In contrast, methods like ABOD that rely on local geometric properties show smaller improvement, as they are inherently more robust to multi-modal distributions. OCSVM and IForest show moderate improvement (3–5%), indicating that action state decomposition provides benefits across different algorithm families. Moreover, action-state models substantially reduce inference cost. The per-sample inference time is 0.75 μs for the reactive method and 0.87 μs for the history-based method, compared to 8.68 μs for the cumulative probability method (averaged over 10 runs, measured in MATLAB R2021a on an AMD Ryzen 9 5900X CPU). This approximately 10× speedup is achieved because each action-state model processes a smaller subset of data with fewer GMM clusters (3 per state vs. 50 for the cumulative method).
The proposed GMM-based methods achieve the highest AUC on the excavator data (
Table 5) because the excavator simulation model produces nearly linear relationships between action parameters and sensor outputs. Within each action state, the normal data forms compact, approximately Gaussian clusters in the 10-dimensional feature space, which the GMM captures precisely. The Mahalanobis distance combined with the chi-squared CDF (Equation (
14)) provides a principled probabilistic anomaly score between 0 and 1, where the score directly represents the cumulative probability that the observed deviation from the learned normal cluster is attributable to chance. This scoring mechanism offers interpretable diagnostics. When an anomaly is detected, the operator can identify which GMM cluster was most violated, quantify the deviation magnitude in standard deviations, and trace the anomaly back to specific sensor dimensions through the covariance structure. Such interpretability is critical for industrial deployment, where maintenance decisions require understanding why an alarm was triggered, not merely that it was triggered. In contrast, black-box methods such as VAE (reconstruction error) and DeepSVDD (hypersphere distance) lack this direct probabilistic interpretation. Although ABOD achieves competitive AUC without action state decomposition (0.86), its angle-based scores do not provide the same level of physical interpretability as the GMM’s probabilistic framework.
To assess the sensitivity of deep learning methods to hyperparameters and architecture choices, we conducted extended experiments on the excavator dataset with VAE and DeepSVDD across multiple encoder architectures and training epochs.
Table 6 summarizes the results. For VAE, three encoder/decoder sizes were tested: small (16, 8, 4), medium (64, 32, 16), and large (128, 64, 32), with 10 and 50 epochs. For DeepSVDD, hidden layer configurations of (32, 16), (64, 32), and (128, 64, 32) were tested similarly.
Several findings emerge from
Table 6. First, the hyperparameters used in
Table 5 (VAE: encoder (16, 8, 4), 50 epochs, and DeepSVDD: hidden (32, 16), 50 epochs) achieve near-optimal performance, as increasing epochs from 10 to 50 yields negligible improvement for VAE (0.746→0.747 with AS) while providing modest gains for DeepSVDD (0.656→0.676 with AS). We adopted 50 epochs for all tables to ensure consistent convergence across datasets. Second, larger architectures provide modest gains with action state decomposition (VAE: 0.747→0.766, DeepSVDD: 0.676→0.694) but can degrade without it (DeepSVDD: 0.545→0.515), suggesting that larger networks overfit when trained on mixed multimodal data. For SensitiveHUE, longer sequence lengths improve performance both without AS (0.575→0.627→0.710) and with AS (0.817→0.840→0.862), but action state decomposition provides a much larger improvement than increasing seq_len. Even the shortest configuration with AS (seq_len = 5, 0.817) surpasses the best configuration without AS (seq_len = 20, 0.710). Increasing epochs from 10 to 50 yields only marginal gains at both seq_len = 10 (0.627→0.632 without AS, 0.840→0.843 with AS) and seq_len = 20 (0.710→0.716 without AS, 0.862→0.865 with AS), confirming rapid convergence. We adopted 50 epochs for SensitiveHUE in
Table 5 and the subsequent benchmark experiments to match the epoch count used for VAE and DeepSVDD. The improvement from AS is largest for shorter sequences (+0.242 at seq_len = 5 vs. +0.152 at seq_len = 20, both at 10 epochs), suggesting that action state decomposition compensates for limited temporal context. For DIF, increasing
from 10 to 50 improves both conditions, but further increasing to 100 shows no additional gain. These results suggest that the action state decomposition is generally more impactful than hyperparameter tuning for improving anomaly detection across the tested methods.
To assess statistical robustness, all baseline algorithms were evaluated across 20 independent test sequences, each containing different anomaly onset timings. The standard deviation of AUC across sequences was 0.02–0.07 for all algorithms, indicating stable performance. Algorithms with higher mean AUC (GMM, ABOD) tend to exhibit larger absolute standard deviations (±0.04–0.07) due to sensitivity to anomaly timing, while near-random algorithms (COPOD, ECOD) show smaller variability (±0.02–0.04).
We additionally evaluated all algorithms on type A (action parameter) faults to assess generalizability. Type A faults are inherently harder to detect because each fault affects only one of six action parameters, creating subtle shifts in a subset of the 10-dimensional feature space. Without action state decomposition, the best-performing algorithms for type A faults are IForest (mean AUC = 0.58) and COPOD (0.57), while GMM and DeepSVDD drop to near-random (0.48 and 0.49, respectively). Notably, action state decomposition provides inconsistent benefits for type A faults—some algorithms improve (e.g., DIF: 0.47→0.57) while others degrade (e.g., IForest: 0.58→0.50, ABOD: 0.53→0.45). This is because type A faults only affect action states that involve the faulty action parameter. In unaffected states, the data appears normal, and per-state models assign low anomaly scores. The proposed Bayesian fault diagnosis framework (
Section 3.3) addresses this limitation by exploiting the pattern of anomaly detection across action states rather than aggregate anomaly scores.
3.3. Fault Diagnosis
It is difficult to collect sufficient data for each cause of failure in the actual industry. We propose a fault diagnosis method that uses modeling to create and evaluate a large amount of data for each cause of failure. By using the action state, it is possible to not only detect anomalies but also to diagnose failures. A type A failure can be diagnosed based on the conditional probabilistic model of an action state.
We analyze which action parameter caused each failure.
Figure 8a shows the anomaly detection probabilities for each action state when the fault was caused by the boom-up pilot (type
). The action states with a related fault action parameter have a relatively high anomaly detection probability. If the anomaly detection probability is known for each fault type
as shown in
Figure 8a,
can be calculated using (
22)–(
24). The
values are shown in
Figure 8b.
Figure 8b shows the average value of the results for 40 test datasets for which
is the cause of the failure. If the amount of anomaly detection data is sufficient, the cause of the failure is accurately predicted.
To comprehensively evaluate the fault diagnosis capability, we extended the Bayesian diagnosis to all six action fault types (–). For each true fault type, 40 test datasets (20 random-sequence and 20 scenario-sequence) were generated, and the posterior probability was computed by sequentially accumulating anomaly detections across action states. The diagnosis was made by selecting the fault type with the highest posterior after a given number of detected anomalies.
Table 7 shows the resulting
confusion matrix after 20 detected anomalies. The overall diagnosis accuracy is 86.7%. Five out of six fault types achieve accuracy above 92%, with
(boom up),
(arm in), and
(bucket out) reaching 97.5%. The diagnosis accuracy improves with the number of detected anomalies, from 27.5% after 3 detections, to 43.8% after 5, 77.5% after 10, 82.9% after 15, and 86.7% after 20. This convergence demonstrates that the Bayesian framework effectively accumulates evidence from multiple action states to identify the root cause.
The fault type
(boom down) shows the lowest diagnosis accuracy (40.0%). This is because the boom-down pilot pressure contributes relatively uniformly to sensor responses across action states, making it difficult to distinguish it from other fault types based on action-state-conditional detection patterns. In contrast, fault types that activate specific subsets of action states (e.g.,
boom up,
bucket in) produce distinctive detection patterns that enable high diagnosis accuracy. More specifically, the Bayesian fault diagnosis relies on the action-state-conditional detection probability
(Equation (
22)), where each fault type
ideally produces a unique pattern of detection probabilities across the 27 action states. For most fault types, the perturbed pilot pressure affects only action states where the corresponding action is active, yielding a sparse and distinctive pattern. However, the boom-down parameter (BD) appears in Equations (
5)–(
8) with relatively smaller coefficients in several of these equations compared to other actions, and its influence propagates to the pump responses in a manner similar to several other fault types. As a result, the detection probability column for
exhibits high similarity with those of other fault types (particularly
,
, and
), reducing the discriminative power of the Bayesian update. The confusion matrix confirms this: the 24 misdiagnosed
trials are spread across all five other fault types rather than concentrated on a single type, indicating a diffuse rather than systematic misclassification. Potential improvements could include incorporating the magnitude of anomaly scores (rather than binary detection) into the Bayesian framework, or augmenting the diagnosis with temporal patterns of consecutive anomaly detections across action states.
It should be noted that the current fault diagnosis framework assumes single-fault scenarios and requires the true fault type to be among the six predefined action faults. In practice, the simultaneous occurrence of multiple faults would produce a superposition of detection patterns, potentially degrading diagnosis accuracy. For unknown fault types, the maximum posterior probability can serve as a confidence measure. A low maximum posterior (e.g., below 0.5 after 20 detections) could trigger a “reject” option, indicating that the observed fault pattern does not match any predefined type.
3.4. Generalizability to Other Domains
To validate whether the feature-based decomposition concept—the core idea behind action state—generalizes beyond excavator systems, we conducted experiments on two public benchmark datasets, UCI Hydraulic Systems [
24] and Skoltech Anomaly Benchmark (SKAB) [
35]. In the excavator case, action states are derived from pilot pressure signals that directly control motor actions. However, the decomposition concept need not rely on action-specific features. Any process variable that partitions the data into distinct operating regimes can serve as a decomposition basis. To test this broader hypothesis, we selected datasets where a single process variable—unrelated to any action or control input—naturally partitions operating conditions. If performance improvement is observed—even for a subset of algorithms—it demonstrates that the feature decomposition concept is applicable beyond action-specific features, and the pattern of which algorithms benefit reveals practical conditions for effective decomposition. Unlike turbofan engines [
36] and chemical processes [
37], where per-mode modeling has been previously explored [
38], feature-based mode decomposition has not been systematically evaluated on hydraulic system or industrial process monitoring datasets.
The UCI Hydraulic Systems dataset monitors a hydraulic test rig with 17 sensors sampling at 1–100 Hz across 60-s duty cycles. We selected 10 low-frequency sensors (4 temperature, 2 flow, 1 vibration, and 3 efficiency/cooling sensors) sampled at 1–10 Hz, excluding the 6 pressure sensors and 1 motor power sensor that operate at 100 Hz, because their rapid intra-cycle dynamics are poorly summarized by cycle-level statistics. We computed 2 statistical features per sensor (mean and standard deviation), yielding 20-dimensional feature vectors from 2205 duty cycles. After filtering for stable operating conditions (1449 cycles), the accumulator pressure condition serves as the anomaly target. Cycles with optimal pressure (130 bar) are labeled normal, while reduced/degraded/failure conditions (115/100/90 bar) are labeled anomalous. Training uses 70% of normal-only samples, with the remaining 30% normal plus all anomaly samples forming the test set. For mode definition, we discretized the TS1 temperature sensor into 3 levels (low/medium/high) based on tercile boundaries.
The SKAB dataset contains multivariate time-series data from a water circulation testbed with eight features (two vibration, two temperature, two electrical, one pressure, and one flow measurement). The dataset provides pre-defined train/test splits with anomaly-free training samples and test samples from 34 anomaly scenarios. For mode definition, we discretized the Volume Flow RateRMS feature (coolant flow rate) into three levels based on tercile boundaries of the training data, partitioning the data into low-, medium-, and high-flow operating regimes.
Table 8 compares anomaly detection performance (AUC) with and without feature-based mode decomposition. All algorithms use the same hyperparameters as in
Table 5. Without mode decomposition, a single model is trained on all data. With mode decomposition, separate models are trained per mode.
The results demonstrate that feature-based mode decomposition can improve anomaly detection on both datasets, though the benefit varies substantially with dataset characteristics. On UCI Hydraulic, 7 out of 10 comparable algorithms improve with mode decomposition. ROD shows the most dramatic gain (0.573 to 0.891, +55.5%), followed by VAE (0.599 to 0.928, +54.9%), SensitiveHUE (0.603 to 0.860, +42.6%), OCSVM (0.717 to 0.976, +36.1%), and DeepSVDD (0.610 to 0.719, +17.9%). IForest also shows substantial improvement (0.836 to 0.915, +9.4%). COPOD (0.468 to 0.257, −45.1%), ECOD (0.496 to 0.231, −53.4%), and DIF (0.650 to 0.356, −45.2%) degrade, likely because these distribution-based methods are sensitive to the reduced sample size per mode. The history-based method yields the same AUC as the reactive method (both 0.765), because each duty cycle is an independent observation without temporal ordering between cycles. Consequently, there is no sequential context for the history method to accumulate.
On SKAB, 5 out of 10 comparable algorithms improve with mode decomposition: SensitiveHUE shows the largest gain (0.600 to 0.673, +12.2%), followed by DIF (0.528 to 0.585, +10.8%), COPOD (0.585 to 0.623, +6.5%), IForest (0.513 to 0.546, +6.4%), and ROD (0.622 to 0.656, +5.5%). In contrast, VAE (0.676 to 0.543, −19.7%) and DeepSVDD (0.657 to 0.512, −22.1%) show substantial degradation. The more modest improvement on SKAB compared to UCI Hydraulic may be partly attributed to a structural difference. Fault-induced feature shifts can cause test-mode imbalance. In UCI Hydraulic, the TS1 temperature remains stable whether the system is normal or faulty, so training-data-based tercile boundaries produce balanced test modes (33%/34%/33%). In SKAB, however, many anomaly scenarios (e.g., water leakage, pump failures) directly alter the coolant flow rate, causing the test distribution to deviate substantially from the training distribution (the test tercile of 74%/6%/20% vs. the training tercile of 33%/33%/33%). The dominant test mode (74%) contains a mixture of normal and diverse anomalous samples that a single per-mode model struggles to separate, limiting the decomposition benefit. This observation suggests a practical consideration. Decomposition may be more effective when the selected feature’s distribution remains relatively stable between normal and anomalous conditions.
An important observation across the three datasets is that the best-performing algorithm differs depending on the dataset characteristics, and the effectiveness of mode decomposition appears to depend, among other factors, on how well the decomposition feature separates operating conditions. On the excavator data (
Table 5), the proposed GMM-based history method achieves the highest AUC (0.890). This is because the excavator simulation model produces nearly Gaussian per-state distributions due to the linearity of the hydraulic system under no-load conditions, which aligns well with GMM’s parametric assumptions. Moreover, the GMM-based scoring provides interpretable anomaly explanations through the Mahalanobis distance and chi-squared CDF framework. On UCI Hydraulic, OCSVM achieves the highest AUC (0.976) with mode decomposition, outperforming the proposed GMM methods (reactive: 0.765). The UCI dataset has relatively few training samples per mode, a regime where the RBF kernel of OCSVM effectively captures the decision boundary without requiring large samples, while the GMM’s Mahalanobis-based scoring is less suited to the non-Gaussian feature distributions derived from statistical summaries (mean/std) of raw sensor signals. On SKAB, SensitiveHUE achieves the best performance with mode decomposition (0.673), followed by ROD (0.656).
These results confirm that the feature-based decomposition concept generalizes beyond the excavator’s action-state framework. The decomposition features used in UCI Hydraulic (temperature) and SKAB (flow rate) are process variables unrelated to any control action or input command, yet mode decomposition improves performance for the majority of algorithms on UCI Hydraulic and for half of the algorithms on SKAB. The contrasting results between the two datasets suggest that one potential factor influencing decomposition effectiveness is the stability of the decomposition feature’s distribution across normal and anomalous conditions. In UCI Hydraulic, where the temperature feature is largely unaffected by hydraulic faults, the decomposition produces balanced modes and substantial improvement. In SKAB, where faults alter the flow rate used for decomposition, the resulting mode imbalance may contribute to the reduced benefit. However, other dataset-specific factors—such as data dimensionality, anomaly diversity, and per-mode sample size—may also play a role. This analysis suggests that feature-based mode decomposition is broadly applicable, and may be advantageous to select decomposition features that characterize operating conditions rather than fault symptoms.