Figure 7 illustrates a specific segment of the LIT-301 sensor data from the SWaT dataset. The top row represents normal data, while the bottom row represents abnormal data containing anomalies. From left to right, each column displays the raw time series data, the LGAF transformed image, and the resulting anomaly map. For the normal data, the anomaly map shows an anomaly score of 0 for all pixels, indicating no detected anomalies at any time point. In contrast, the anomaly map for the abnormal data contains multiple pixels along the main diagonal that exceed the threshold, and the FAS value also surpasses the criteria, successfully detecting anomalies. These detected points correspond to time segments in the raw data where abnormal patterns deviate from the normal patterns. This demonstrates the successful anomaly detection process. The detailed results of anomaly detection are presented below.
5.2.1. Module-Wise Analysis
This section evaluates the performance of our proposed system on the dataset, compares the performance of different modules, and provides an interpretation of the results. The experimental results include comparisons of segmentation methods, encoding techniques, and utilized models. For all configurations other than the comparison targets, the settings achieving the highest performance were selected for the experiments.
Table 2 presents the experimental results comparing time series data segmentation based on cyclic intervals with segmentation using regular intervals. Due to the nature of the proposed method, it is hard to consider the interrelations between different time series segments. However, segmenting data by cyclic intervals enables each subsequence to represent a typical normal pattern, which significantly contributes to improving overall anomaly detection performance even without considering such interrelations. Moreover, this approach preserves temporal relationships within a cycle, such as time dependencies within the data, further enhancing detection accuracy. That said, this method is difficult to apply to non-cyclic data and can degrade performance if an incorrect cycle is selected. Therefore, to effectively apply cyclic segmentation, preliminary processes such as data analysis or decomposition are essential.
Figure 8 shows examples of dividing the image by cycle and by regular interval.
Table 3 presents the experimental results comparing different image encoding methods. GGAF has an advantage in maintaining consistency within time series data, reducing the impact of noise, and preserving overall patterns. However, in the SWaT dataset used in this study, the sensor values exhibit oscillations with small variations and are not significantly affected by extreme values. Due to these data characteristics, the experimental results for LGAF and GGAF showed no substantial differences. This is because the stable variability in the data did not provide factors that could clearly distinguish the two methods. Similarly, no significant differences were observed in the UCR dataset.
In contrast, for the NAB dataset, the LGAF method failed to detect one of the three anomalous points, whereas the GGAF method successfully detected all three. This can be attributed to the oscillatory nature of normal data within a specific range, where GGAF’s ability to apply consistent criteria and reduce the influence of minor variations made it more effective at identifying anomaly caused by extreme values.
Figure 9 illustrates the results of LGAF and GGAF on the NAB dataset. The top row presents the results using LGAF, and the bottom row shows the results using GGAF. Images generated with LGAF present minimal differences between anomalous and normal data, resulting in low anomaly scores on the corresponding anomaly maps. In contrast, images generated with GGAF present more pronounced differences, and the anomaly maps show higher anomaly scores, successfully identifying the anomalies.
As such, GGAF may be more effective for detecting anomalies associated with extreme values or in scenarios where the data exhibits high variability and noise.
Table 4 presents the experimental results comparing different image anomaly detection models. The results focus on univariate data for each model, showing similar performance across the models with no significant differences for the dataset used in this study. As shown in
Table 5, RD, despite its simple structure, calculates feature differences across multiple layers. On the other hand, EAD uses the efficient PDN network, resulting in lower memory usage compared to RD.
In terms of training time, there were notable differences between the two models. RD employs a standard Dataloader, leading to relatively shorter training times. In contrast, EAD utilizes an InfiniteDataloader, which allows for infinite repetition of the dataset during training, resulting in longer training times. However, since EAD is not influenced by the number of training samples, it could potentially achieve shorter training times when handling large datasets.
Given that the dataset used in this study primarily consists of simple patterns, the performance gap between EAD and RD was not significant. In conclusion, EAD is more suitable for efficient anomaly detection tasks with low memory requirements, while RD is better suited for detecting anomalies in complex datasets where higher precision is required.
5.2.2. Comparison with Baselines
This section compares the performance of the proposed system on the dataset with various baselines and provides an interpretation of the results. Similarly to the previous section, experiments were conducted using configurations that achieved the highest performance. However, a new criterion for the anomaly detection threshold was additionally introduced. This was based on the observation that evaluation results are highly influenced by the chosen threshold values. In this study, results were analyzed using (1) a general threshold and (2) an optimal threshold that yielded the best performance for each sensor.
The general threshold was derived by averaging the anomaly scores at actual anomaly points from a subdataset of the MVTecAD image dataset [
38]. In contrast, the optimal threshold was determined by adjusting various threshold values and selecting the one that produced the best performance.
Table 6 presents the results of uni-variate time series anomaly detection experiments conducted on the SWaT dataset using individual sensors. Since the ground truth used in the experiments includes attack information for all sensors, it is possible that an attack is marked in the ground truth, even if no attack occurred on the specific sensor being evaluated.
When applying the general threshold, the results showed superior performance compared to the baseline. Additionally, when optimal thresholds were set for each sensor, improvements were observed in the Recall and F1 score for most sensors, indicating enhanced detection performance for abnormal data. This suggests that the general threshold was stricter than the optimal threshold, as it typically had a higher value, leading to more stringent anomaly detection criteria.
For sensors such as DPIT-301 and AIT-504, the results were identical or very similar. This was because the values of the optimal threshold and the general threshold for these sensors were closely aligned. Overall, detection performance was particularly strong when the data exhibited general patterns and when the train and test datasets had similar characteristics. This is because the process of segmenting normal patterns and generating images facilitates the learning of normal data, thereby enabling more effective detection of anomalies that deviate from these patterns.
However, differences in the data patterns of individual sensors led to varying levels of normal pattern formation, resulting in some variation in anomaly detection performance. Nevertheless, the overall performance was consistently strong when optimal thresholds were applied to each sensor.
Table 7 presents the results of uni-variate time series anomaly detection experiments conducted on the UCR and NAB datasets. The experimental results for the UCR dataset demonstrated the best performance compared to existing models, which can be attributed to the characteristics of the dataset. The time series data in the UCR dataset consist of similar normal patterns, and the length of each pattern is neither too short nor too long. This allowed the data to be fully utilized without requiring additional preprocessing steps such as down-sampling. Furthermore, the differences in normal patterns between the training and testing data were minimal, making it relatively easier to detect anomalies when they appeared in the test set. These factors contributed to the superior performance observed on the UCR dataset compared to other datasets.
On the other hand, the results for the NAB dataset showed a Recall of 1.0, successfully detecting all anomalies, but the Precision and F1 score were relatively low. According to the dataset statistics reported for the baseline model TranAD [
14], the anomaly ratio was 0.92%. However, in the actual data used for our experiments, the anomaly ratio was only 0.07%, with just 3 out of 4033 sequences labeled as anomalies. While the data surrounding the labeled anomalies exhibited significant deviations from the normal range, the Ground Truth designated only three specific points as anomalies.
Our system detected not only the labeled anomalies, but also the surrounding points affected by those anomalies, resulting in a relatively high number of False Positives and, consequently, lower Precision. However, since the system did not misclassify normal points outside the vicinity of anomalies as anomalies, the overall anomaly detection performance can still be regarded as robust.
Additionally, the experimental result of TSI-GAN [
5], which employs a mechanism similar to ours by converting time series data into images and applying image anomaly detection techniques, achieved an F1 score of 0.846 on the InternalBleeding dataset. While there is a difference in the amount of data used, our model achieved a higher average F1 score of 0.939 based on experiments conducted on additional subdatasets.
Table 8 presents the results of multivariate time series anomaly detection experiments conducted on the SWaT dataset. The results of T2IAE are based on using the same GAF method as our system. In this study, anomaly detection was performed on a selected set of representative sensors from the SWaT dataset, and the results were aggregated to derive the final performance, which was then compared with baseline models. The experiments analyzed the results using a general threshold applied uniformly across all sensors and optimal thresholds tailored for each sensor.
The comparison of the two approaches showed that applying optimized thresholds for individual sensors reduced False Positives, and enabled the detection of more anomalies. The final performance of the SWaT dataset was evaluated by integrating the anomaly points detected across the selected sensors. The anomaly detection results can be intuitively observed in
Figure 10. Our system demonstrated relatively low precision, but consistently high recall. This indicates that the model identified a larger number of points as anomalies, effectively maintaining a low False Positive Rate (FPR). The ability to identify anomalies without missing them, even at the cost of misclassifying some normal data as abnormal (False Positives), is often considered more valuable in certain domains. Therefore, the results of this study can be considered significant, especially in fields where rapid and reliable anomaly detection is crucial.
For this experiment, the analysis was conducted using representative sensors from similar sensor groups where attacks occurred, rather than averaging performance across all sensors in the SWaT dataset. Future research could consider the correlations between sensors and expand anomaly detection to a larger set of sensors. By aggregating results across more sensors, the system’s anomaly detection performance is expected to be further enhanced.