You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

17 October 2023

TCF-Trans: Temporal Context Fusion Transformer for Anomaly Detection in Time Series

,
,
,
,
and
1
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore
2
School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
3
Chongqing Yuxin Road & Bridge Development Co., Ltd., Chongqing 400060, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Signal Processing and Machine Learning for Sensor Systems

Abstract

Anomaly detection tasks involving time-series signal processing have been important research topics for decades. In many real-world anomaly detection applications, no specific distributions fit the data, and the characteristics of anomalies are different. Under these circumstances, the detection algorithm requires excellent learning ability of the data features. Transformers, which apply the self-attention mechanism, have shown outstanding performances in modelling long-range dependencies. Although Transformer based models have good prediction performance, they may be influenced by noise and ignore some unusual details, which are significant for anomaly detection. In this paper, a novel temporal context fusion framework: Temporal Context Fusion Transformer (TCF-Trans), is proposed for anomaly detection tasks with applications to time series. The original feature transmitting structure in the decoder of Informer is replaced with the proposed feature fusion decoder to fully utilise the features extracted from shallow and deep decoder layers. This strategy prevents the decoder from missing unusual anomaly details while maintaining robustness from noises inside the data. Besides, we propose the temporal context fusion module to adaptively fuse the generated auxiliary predictions. Extensive experiments on public and collected transportation datasets validate that the proposed framework is effective for anomaly detection in time series. Additionally, the ablation study and a series of parameter sensitivity experiments show that the proposed method maintains high performance under various experimental settings.

1. Introduction

Anomaly detection aims to find patterns that do not comply with expected behaviour [1]. In many real-world applications, anomaly detection tasks are important research topics [2,3]. Typically, anomalies are categorised as point, contextual, and collective. Since it is common to find that no specific distributions fit the data, and the characteristics of anomalies are different, using traditional anomaly detection methods based on distance estimation or statistical theory may be challenging. Moreover, complex and changing intrinsic data characteristics, low recall rate, and high dimensional data [4] further impede the learning performance of traditional machine learning methods. Under these circumstances, the detection algorithm requires excellent learning ability of the data features. Deep learning methods commonly learn the complex dynamics in the data without relying upon underlying patterns within the data. This advantage makes them popular in dealing with anomaly detection tasks. Transformers, which apply the self-attention mechanism, have shown outstanding performances in modelling long-range dependencies among different deep learning methods.
Common methods for dealing with anomaly detection tasks [5,6,7] can be generally classified as traditional machine learning methods and deep learning methods.
SVM [8], One-class SVM (OC-SVM) [9], Isolation forest [10] and Local Outlier Factor (LOF) [11] are typical examples of machine learning anomaly detection algorithms. However, if raw samples are complex or dense, the detection performance of these methods will be limited. LODA [12] is a lightweight anomaly detector that ensembles different detectors and is suitable for data streams. LSCP [13] is another ensemble framework compatible with different types of base detectors and further determines the most competent base detector in the local region upon similarity measuring.
Although machine learning methods are suitable for some anomaly detection tasks, deep learning methods more effectively learn expressive representations of complex data in some real-world applications [4,14]. For example, deep support vector data description (DeepSVDD) [15] is applied to complex data for better feature selection. Recurrent neural networks (RNN) that capture time dependence are commonly used to recognise or predict sequences. RNN has been exploited with gating mechanisms to become common methods such as LSTM and Gated recurrent units (GRU). For example, one LSTM-based method is adopted for detecting urban anomalies [16]. DeepAnT [17] is a novel anomaly detection method in time series, which does not require a huge dataset. It primarily applies a CNN-based network to take a window range of time series and try to predict its value for the next stamp. Then, the predicted value is sent to an anomaly detector module to determine its abnormality. DAGMM [18] applies a compression network and an estimation network to achieve unsupervised anomaly detection. The compression network implements a deep autoencoder to generate a low-dimensional representation for each input. Then the estimation network, based on the Gaussian Mixture Model, takes the representation and predicts the corresponding likelihood. Parameters of both the two sub-networks are jointly optimised simultaneously. SO-GAAL [19] applies the generative adversarial learning framework, which consists of a generator and a discriminator used to detect anomalies. Alternatively, GDN [20] combines graph structure learning and attention weights to achieve good anomaly detection results in some fields. LUNAR [21] is another graph neural network-based anomaly detection method. It extracts information from the nearest neighbours of each node and further detects anomalies, and it can learn and adapt to different sets of data. Transformer [22]-based algorithms are also widely applied in anomaly detection tasks. For example, UTRAD [23] obtains stable training and accurate anomaly detection/localisation results based on a transformer-based autoencoder. Additionally, MT-RVAE [24] utilises the variational Transformer model with improved positional encoding and feature extraction to achieve satisfying anomaly detection performances.
In this paper, one Transformer-based network, Informer [25], is chosen as the baseline for dealing with anomaly detection tasks with data collected from real-world applications. The original Informer is an efficient model of Transformer that adopts ProbSparse self-attention mechanism to significantly reduce the time complexity and memory usage while outperforming existing methods, mainly in time-series forecasting tasks. Additionally, it can handle tasks in an unsupervised way to avoid the cumbersome labelling cost. However, directly applying the original Informer to time-series anomaly detection tasks may not be appropriate. Since the origin Informer is used in time-series forecasting tasks, it aims to find the overall trend of the target sequence and can ignore some unusual details. Moreover, Transformers may focus on dominant relationships among sequences while paying less attention to intrinsic details when dealing with short-term data. As a result, as shown in Figure 1, it has a straight-throughout feature-transmitting structure for layers in the decoder, and the output of the informer decoder is merely based on the last layer, which contains the least noise but may miss some details. However, anomalies may be rare in anomaly detection tasks, and some minor details in data may reflect the anomaly and cannot be ignored. Different features should be utilised to improve the overall detection performances [26].
Figure 1. Demonstration example of the 3-layer decoder block diagram for the original Informer network.
To overcome the above-mentioned limitations, we propose to better utilise features from shallow and deep decoder layers with a new multi-layer feature fusion decoder. The original feature-transmitting structure in the decoder of Informer is replaced with the proposed feature fusion decoder to fully utilise the features extracted from shallow and deep decoder layers. This strategy prevents the decoder from missing unusual anomaly details while maintaining robustness from noises inside the data. Next, the auxiliary predictions generated by the decoder will be further adaptively fused based on similarities/distances and sequence information in the temporal context fusion module. This strategy exploits temporal context information of the data by a learnable weight to make the output more robust. We evaluate the proposed method using both the public and our collected transportation datasets for anomaly detection tasks and compare the results with recently proposed machine learning and deep learning methods.
The main contributions of our work can be summarised as follows:
  • We introduce a novel framework: Temporal Context Fusion Transformer (TCF-Trans) for unsupervised anomaly detection in time series based on temporal context fusion.
  • We replace the straight throughout feature-transmitting structure in the decoder layers of Informer with the proposed feature fusion decoder, which fully utilises the features extracted from shallow and deep decoder layers. This strategy prevents the decoder from missing unusual anomaly details while maintaining robustness from noises inside the data.
  • We propose the temporal context fusion module to fuse the auxiliary predictions generated by the decoder adaptively. This strategy alleviates noises or distortions caused by the single auxiliary prediction and fully uses temporal context information of the data.
  • Extensive experiments on the public and collected transportation datasets validate that the proposed framework is effective for anomaly detection tasks, such as transportation tasks in time series. In addition, a series of sensitivity experiments and the ablation study show that the proposed method maintains high performance under various experimental settings.
The remaining parts of this paper are organised as follows. Section 2 reviews related works and the background of the Transformer. Section 3 describes the details of the proposed method. Section 4 describes the experiments for validating the proposed method. The conclusion of this paper is presented in Section 5.

3. TCF-Trans: Temporal Context Fusion Transformer

3.1. Overall Structure

As shown in Figure 3, TCF-Trans consists of three main modules: an auxiliary prediction generator, a temporal context fusion module and an anomaly detection module. The auxiliary prediction generator performs feature learning, fusion and refinement of the processed input data based on an Encoder–decoder architecture. Next, the generated auxiliary predictions are further processed by the temporal context fusion module based on the similarity/distance and sequence information to generate output predictions adaptively. Finally, the output predictions are compared with the target data under anomaly detection criteria to produce anomaly scores. The final detection result will be determined based on the threshold. These modules will be presented in detail in the following sections.
Figure 3. Block diagram of overall structure of TCF-Trans.

3.2. Auxiliary Prediction Generator

The auxiliary prediction generator implements an Encoder–decoder architecture similar to the Informer [25]. The input  X t  is encoded into the hidden state representation  H t  as the encoder output. Next, the decoder produces output auxiliary predictions  P ^ t  based on the encoder output  H t  and the decoder input  X d i n t .
The process in the encoder is based on the Informer baseline. However, as we mentioned earlier, the process of the Informer baseline in the decoder is ineffective in anomaly detection tasks. To overcome such limitations, we propose a multi-layer feature fusion decoder to better use different features extracted in shallow and deep decoder layers and further refine them to generate auxiliary predictions. To smoothly demonstrate the idea of feature fusion in the decoder, we take a three-layer decoder as an example. As shown in Figure 4, the decoder consists of three layers and inputs are based on encoder output  H t  and the decoder input  X d i n t . The outputs for three layers are denoted as  D = { d 1 , d 2 , d 3 } , where  d i  is the output for the ith layer. Our feature fusion aims to generate the merge and refined  d i R L  (denoted as  d i * ) for the ith layer based on  D  and use it to produce auxiliary predictions to assist detection.
Figure 4. Demonstration example of the three-layer feature fusion decoder of TCF-Trans. ⊕ donates feature fusion operations.
In the multi-layer decoder, shallow layers contain more information about details, while deep layers contain more depth representations of the data [41,42]. Therefore, we can fuse features from deep layers with those from shallow layers to obtain an optimal representation of the data, as follows:
d i * = [ d i , , d 1 ] R L × d f e a t u r e * ,
where  [ · , , · ]  donates the concatenation, i is the order of the layer and  d f e a t u r e *  is the dimension of the fused feature.
During the fusion process, the deeper the layer, the more features it fuses. Under this circumstance, directly applying the FC layer to produce outputs makes it likely to lose information. Therefore, feature-refining tools such as multilayer perceptron (MLP) layers can be implemented to refine the fused features for deeper layers. Note that MLP layers can be represented as hidden layers in the MLP network and layer normalisation can be added to achieve stable transmission.
Lastly, auxiliary predictions produced by the FC layer using refined features can be defined as follows:
P ^ t = { P ^ 1 t , , P ^ N t } ,
where N is the number of layers in the decoder and  P ^ 1 t  is the ith auxiliary prediction.
During the training of this module, the mean squared error (MSE) loss function can be chosen to compute loss among auxiliary predictions and target sequences. Moreover, weights can be assigned for each prediction to emphasise the importance of predictions produced by different layers. The loss function of this module can be expressed as follows:
L o s s 1 = i = 1 N w i M S E ( P ^ i t , Y t ) ,
where  w i  is the weight of the ith decoder layer.

3.3. Temporal Context Fusion Module and Anomaly Detection

After obtaining several auxiliary predictions produced by the former module, one direct way to combine them is to empirically assign weights to each prediction. However, such a method takes a long time to reach a satisfying solution due to the large number of trials required. Also, applying a scalar weight limited the utilisation of the temporal context of these predictions. Because different dimensions or points in the sequence may include different intrinsic temporal contexts or the importance of one prediction, applying a scalar weight to a prediction means only different importance exists on different predictions. However, all points in the sequence and dimensions may not share the same importance. In that case, the output prediction based on a fixed scalar weight may only partially take advantage of our feature fusion decoder. Therefore, we aim to produce the final prediction adaptively based on a weight learned from these auxiliary predictions’ similarity/differences and fused temporal context.
As shown in Figure 5, auxiliary predictions are sent in the similarity/distance measurement block to determine their similarities/differences. We can calculate the similarities/differences among them as follows:
d i , j = k ( P ^ i t , P ^ j t ) i j ,
where  k ( · , · )  can be a user-defined distance or similarity measurement such as Euclidean distance, etc.
Figure 5. Demonstration example of temporal context fusion module of TCF-Trans. ⊕ donates feature fusion operations.
In the meantime, a slice or all of each auxiliary prediction  P i t R L s × d s  can be chosen as the sequence information, where  L s  and  d s  are its length and dimension, respectively. Target sequences could also be added to the sequence information, and we take the auxiliary prediction as an example for convenience of understanding. Then, the temporal context fusion is processed as follows:
D ^ t = [ d 1 , 2 , , d N , N 1 , P 1 t , , P N t ] ,
where  [ · , , · , · , , · ]  donates the concatenation and N is the number of layers in the decoder.
Next, the fused temporal context is processed by MLP layers to further feature extraction and refinement. Since we aim to make full use of auxiliary predictions’ similarity/differences and temporal information, the output of the MLP will be sent to the FC layer to adaptively generate the weight  W t *  based on different weight resolution expectations on various data inputs. (learned weight with a higher dimension presents higher resolution). Specifically, the learned weight  W t *  can be chosen from  W t = { W t R N , W t R d y × N , W t R L y × d y × N } .
Last, the output prediction can be computed as follows:
Y ^ t = i = 1 N W i t * · P ^ i t .
During the training of this module, we can also implement the MSE loss function to compute loss output prediction and target sequences.
After we have obtained the output predictions, the anomaly detection module compares them with the target sequence under specific criteria. Here, we can choose MSE loss to generate the anomaly score as follows:
A n o m a l y S c o r e = M S E ( Y , Y ^ ) .
Once we have the anomaly score, a threshold can be set to determine anomalies. The threshold can be determined empirically or via grid search for better precision, but such methods require extensive trials and are time-consuming. Alternatively, methods based on Streaming Peaks-Over-Threshold (SPOT) [43] can be applied to determine the threshold  t h .
Therefore, values exceeding  t h  in the  A n o m a l y S c o r e  are considered potential anomalies. The main procedures for detecting anomalies via TCF-Trans are summarised in Algorithm 1.
Algorithm 1: Anomaly detection via TCF-Trans
Sensors 23 08508 i001

4. Experiments

In this section, we validate the effectiveness of the proposed anomaly detection framework through several experiments. First, we describe the experimental setup. Then, we evaluate the proposed framework on three public datasets in Section 4.2. Next, we compare the proposed method with several state-of-the-art methods in the real-world transportation traffic dataset in Section 4.3. Moreover, we conduct an ablation study and parameter sensitivity experiments in Section 4.4 and Section 4.5, respectively.

4.1. Setup

We follow standard evaluation metrics in anomaly detection tasks, including  F 1  score, Precision, and Recall, to evaluate performances as follows
P r e c i s i o n = T P F P + T P , R e c a l l = T P F N + T P , F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l ,
where  T P  denotes the number of correct anomalous detections,  F P  denotes the number of incorrect anomalous detections, and  F N  denotes the number of incorrect normal detection. A higher  F 1  score, precision and recall demonstrate better performances.
The proposed method is implemented with the Pytorch [44] framework and runs on the NVIDIA RTX 3080 GPU. Some comparison methods are based on publicly available codes provided by PyOD [45]. We implement three layers ( l = 3 ) for the feature fusion decoder and the dimension of the model  d m o d e l = 512 .

4.2. Evaluation on Public Datasets

We evaluate the proposed method by applying it to three real-world public anomaly detection datasets. The first public dataset is a gesture dataset [46] collected in a real-world scenario. It records X and Y moving coordinates of one actor’s right hand into a time-series sequence. During the actor’s actions with the right hand, anomalous actions within a specific time period are recorded. In total, the number of data points in the gesture dataset is around 11,000, with a dimension of two. Around 70% of the samples are used to train the model, while the rest of the data are used for testing. The second and the third public datasets are provided in NAB (The Numenta Anomaly Benchmark) [47] with known anomaly causes collected in the real-world scenario. The second one contains temperature sensor data of an internal component of a large industrial machine (i.e., machine temperature dataset). The third one contains ambient temperature data in an office setting (i.e., ambient temperature dataset). Each dataset has more than 7000 univariate data samples collected in time series. Part of the dataset is used as the training set, while the rest is set for testing.
In the gesture dataset, we implement three state-of-the-art comparison methods, including LUNAR [21], DeepAnt [17], and DeepSVVD [15]. Comparison anomaly detection results on this dataset are shown in Table 1, where bold faced number in each column of the table indicates the best result among all the methods in comparison, which is applied to all other tables in this paper. The results show that the proposed TCF-Trans obtains the best performance in terms of  F 1  score, which indicates this method has a good balance in terms of its overall performance without biased detection. Although the recall of the proposed method is not the highest, the methods with higher recall values can suffer from low precision. This phenomenon means they are highly likely to generate false alarms. Therefore, the proposed method obtains competitive performances among the compared state-of-the-art anomaly detection methods. The visualisation examples of detection result slices achieved via the proposed method and LUNAR on the gesture dataset are presented in Figure 6. The X-axis represents recording time. Figure 6a represents the raw data from the test set. Figure 6b and Figure 6c are the corresponding anomaly scores of the proposed method and LUNAR, respectively. Potential anomalies continuously happen after around the 1600th recording time. Compared with LUNAR, the proposed method has more stable anomaly scores among the anomalous regions. On the contrary, anomaly scores obtained from LUNAR suffer from many missing alarms.
Table 1. Comparison anomaly detection results of the proposed method with other methods on the real-world gesture dataset.
Figure 6. Visualisation examples on the gesture dataset. (a) Raw data slice from the test set (b) Visualisation example of a detection result slice by the proposed method (c) Visualisation example of a detection result slice by LUNAR.
In the machine temperature dataset and the ambient temperature dataset, we implement four state-of-the-art comparison methods, including LSCP [13], LUNAR [21], SO-GAAL [19], and DeepSVVD [15]. Comparison anomaly detection results on the machine temperature dataset are shown in Table 2. Results show that the proposed method outperforms other methods in the  F 1  score, with close precision and recall values, indicating that the proposed method can avoid biased detection results. Although some methods have higher recall values than the proposed method, the gap between their recall and precision is noticeable. This unbalanced performance prevents such methods from making accurate predictions among normal samples.
Table 2. Comparison anomaly detection results of the proposed method with other methods on the machine temperature dataset.
Table 3 summarises the comparison anomaly detection results on the ambient temperature dataset. Results show that TCF-Trans obtains satisfying results among the four methods compared here. Meanwhile, we notice that performance vibrations of some methods on these three datasets are apparent, while the proposed method is more stable for different datasets. This advantage indicates that the proposed method is less sensitive to the change in the intrinsic dataset and may be exploited for many detection tasks.
Table 3. Comparison anomaly detection results of the proposed method with other methods on the ambient temperature dataset.

4.3. Evaluation of the Real-World Transportation Dataset

The proposed method is applied to one real-world collected transportation dataset for anomaly detection tasks to show its potential in other real-life applications. The dataset is collected in Chongqing, China, by days (i.e., real-world transportation dataset (days)), and it contains vehicle traffic data from several roads. This dataset with a day collection rate can reflect long-term traffic anomalies, which can be valuable for local enforcement, helping to assess overall traffic planning and management. Moreover, since the relatively large collection rate requires a longer accumulation of data to enlarge its size, the collection difficulty is greater, and the relatively small size of this dataset already contains information for several months. Under such circumstances, the proposed method’s learning ability with a small number of samples can also be evaluated. The statistical summary of this dataset is shown in Table 4. In total, the real-world transportation dataset (days) contains 270-day vehicle traffic data from different roads. Part of the dataset, which does not contain an anomaly, is used as the training set, while the rest is chosen as the testing set. We implement five state-of-the-art anomaly detection methods for comparison, including LODA [12], LSCP [13], LUNAR [21], SO-GAAL [19], and DeepSVVD [15]. Table 5 presents the comparison anomaly detection results on this dataset. Results show that TCF-Trans effectively detects anomalies in the real-world transportation dataset. It can be observed that the proposed method balances precision and recall. We owe this good ability to the proposed feature fusion strategy because we fuse different features with different characteristics. This strategy can alleviate the negative drawbacks of being vulnerable to noises or missing anomaly details. As a result, the proposed method can obtain good overall detection performance. The visualisation example of a detection result slice is shown in Figure 7. In this figure, the X-axis represents collected dates. Figure 7a contains the raw data slice from the test set, and Figure 7b shows the corresponding anomaly score among the test set. SPOT, which is mentioned in Section 3, can be used to determine the threshold without requiring time-consuming grid search trails. It can be observed that there are two types of anomalies in the raw data. For example, a continuous long-term vehicle traffic drop happens around 100th date, which may reflect a continuous anomalous vehicle traffic drop due to large-scale traffic control by local enforcement. The corresponding anomaly score achieved using the proposed method remains continuously high without a clear drop during this region, which reflects that it can effectively deal with this anomalous pattern. Some short-term sharp changes, such as points before the 140th date, may reflect temporary constructions and belong to another type of anomaly. The anomaly scores in this region are also sufficiently stable, showing that the proposed method can effectively detect short-term anomalies.
Table 4. Statistical summary of the real-world transportation dataset (days).
Table 5. Comparison anomaly detection results of the proposed method with other methods on the real-world transportation dataset (days).
Figure 7. Visualisation examples on the real-world transportation dataset (days). (a) Raw data slice from the test set, where each graph line represents traffic from different raw data sources (b) visualisation example of a detection result slice achieved using the proposed method.

4.4. Ablation Study

In this section, we conduct the ablation study on the real-world transportation dataset (days) to analyse the effectiveness of each component of the proposed method. We implement four variants of the proposed method, including (i) the Informer baseline, (ii) the TCF-Trans w/o temporal context fusion, in which we replace the temporal context fusion module with one FC layer to generate the output directly, and (iii) the TCF-Trans w/o feature fusion, in which we do not fuse features form different layers. To minimise the impact of different threshold methods, we present results via the grid search based on  F 1  score to check performances in theory (with notations †) in this ablation study.
Based on results shown in Table 6, we make the following observations: (1) The proposed TCF-Trans utilises the advantages of each sub-module to achieve the optimal performance among these variants. (2) Since features from different layers have different characteristics, fusing features from different layers helps to improve detection performances. Fusing them can make the proposed method robust to noise and prevent the decoder from missing potential details related to anomalies. (3) Adaptively fusing the auxiliary predictions based on temporal context helps to satisfactorily generate results. On the contrary, directly transforming auxiliary predictions to the final output may be inappropriate.
Table 6. Ablation study results on the real-world transportation dataset (days).
Next, we implement another optimiser SGD on the proposed method to evaluate its impacts. Table 7 shows the results of using different optimisers. The results obtained using SGD are worse than the Adam. The low performance can be led by SGD trapping in the local optimal.
Table 7. Comparison of the anomaly detection performance of the proposed method with different optimisers in relation to the real-world transportation dataset (days).
Moreover, we gradually decrease the number of data used for training from 100% to 80% to show the proposed method’s potential to achieve effective detection performances without requiring the accumulation of a large amount of data. The results using different ratios of data used are listed in Table 8. It can be found that  F 1  scores do not vary much with ratios from 100% to 80%, which indicates that the proposed method has the potential to obtain satisfying detection performances. Moreover, the results further validate the proposed method’s learning ability with few samples.
Table 8. Comparison of the anomaly detection performance of the proposed method with different ratios of training data on the real-world transportation dataset (days).

4.5. Parameter Sensitivity Experiments

In this section, the proposed method is implemented under different parameter settings to evaluate its sensitivity to parameter changes on the real-world transportation dataset (days). We also report results achieved via a grid search, in theory (with notations †), to reduce the impact of different threshold methods.
Since the decoder input  X d i n t  combines the earlier piece sequence before the output and a placeholder, we aim to evaluate the effect of reference sequence  X r e f t  with different lengths  L r e f , as well as corresponding input sequence lengths. Therefore, we choose five reference and input length combinations to evaluate their effect. The results of the proposed method using different combinations of lengths are summarised in Table 9. Our  F 1  scores with different lengths are close, which indicates that the proposed method is good at dealing with different input and reference sequences.
Table 9. Comparison of the anomaly detection performance of the proposed method with different lengths of input and reference sequence on real-world transportation dataset (days).
Meanwhile, we aim to determine the impact on a larger number of decoder layers. We increase the number of decoder layers to four (i.e., four-layer TCF-Trans) to validate its performance. We also compare the performances of the Informer baseline with four decoder layers (i.e., four-layer Informer baseline). As shown in Table 10, when comparing performances with the four-layer Informer baseline, the impacts of increasing decoder layers for the proposed method are not serious. This can be explained by our fusion strategy, which fully utilises features from shallow and deep layers to make the method more robust to the noise from a single layer while retaining important details for anomaly detection.
Table 10. Comparison of the anomaly detection performance of the proposed method with a four-layer decoder on the real-world transportation dataset (days).
Other loss functions besides the MSE loss may also be chosen as the training loss during the training. We also evaluate the proposed method using two different training loss functions: MAE loss and SmoothL1 loss. The results of using different types of training loss are summarised in Table 11. These results show that  F 1  scores with different types of training loss do not vary much, which indicates that loss functions can be adopted in our method.
Table 11. Comparison of the anomaly detection performance of the proposed method with different types of training loss on the real-world transportation dataset (days).
Based on experiments conducted on the proposed method, the proposed method can handle anomaly detection tasks with different settings, and it shows robustness to these changes. Therefore, the proposed method: TCF-Trans, is an effective solution for anomaly detection in time series applications.

5. Conclusions

This paper has addressed anomaly detection tasks in time series based on a novel framework named temporal context fusion transformer (TCF-Trans). This model utilises the Transform’s excellent long-range dependencies modelling capacities and fuses features extracted from shallow and deep decoder layers to prevent the decoder from missing unusual anomaly details while maintaining robustness from noises inside the data, and it improves detection performances. Additionally, the proposed temporal context fusion module fuses the auxiliary predictions generated by the decoder adaptively. It makes the output more robust to noises or distortions caused by the single auxiliary prediction and fully uses temporal context information of the data with a learnable weight. We have performed extensive experiments using the proposed framework on the public and collected transportation datasets. Results have shown that the proposed framework is applicable for some real-world anomaly detection tasks such as transportation in time series. Additionally, the ablation study and several parameter sensitivity experiments have shown that the proposed method can maintain a high performance under various experimental settings.

Author Contributions

X.P. co-designed and implemented the method and prepared the manuscript draft, co-conducted the experiments, and analyzed the experiment results. H.L. helped in designing the algorithm and conducting the experiments. Y.L. helped in processing raw data and conducting the experiments. P.F. helped in collecting and processing raw data. Y.C. and Z.L. provided valuable advice to this work and edited the manuscript drafts. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to express their thanks to Chongqing Yuxin Road & Bridge Development Co., for providing the real-world data.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. 2009, 41, 1–58. [Google Scholar] [CrossRef]
  2. Cherdo, Y.; Miramond, B.; Pegatoquet, A.; Vallauri, A. Unsupervised Anomaly Detection for Cars CAN Sensors Time Series Using Small Recurrent and Convolutional Neural Networks. Sensors 2023, 23, 5013. [Google Scholar] [CrossRef] [PubMed]
  3. Xu, Z.; Yang, Y.; Gao, X.; Hu, M. DCFF-MTAD: A Multivariate Time-Series Anomaly Detection Model Based on Dual-Channel Feature Fusion. Sensors 2023, 23, 3910. [Google Scholar] [CrossRef] [PubMed]
  4. Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep learning for anomaly detection: A review. ACM Comput. Surv. 2021, 54, 1–38. [Google Scholar] [CrossRef]
  5. El Sayed, A.; Ruiz, M.; Harb, H.; Velasco, L. Deep Learning-Based Adaptive Compression and Anomaly Detection for Smart B5G Use Cases Operation. Sensors 2023, 23, 1043. [Google Scholar] [CrossRef] [PubMed]
  6. Kim, B.; Alawami, M.A.; Kim, E.; Oh, S.; Park, J.; Kim, H. A comparative study of time series anomaly detection models for industrial control systems. Sensors 2023, 23, 1310. [Google Scholar] [CrossRef]
  7. Lan, D.T.; Yoon, S. Trajectory Clustering-Based Anomaly Detection in Indoor Human Movement. Sensors 2023, 23, 3318. [Google Scholar] [CrossRef]
  8. Fisher, W.D.; Camp, T.K.; Krzhizhanovskaya, V.V. Anomaly detection in earth dam and levee passive seismic data using support vector machines and automatic feature selection. J. Comput. Sci. 2017, 20, 143–153. [Google Scholar] [CrossRef]
  9. Tian, Y.; Mirzabagheri, M.; Bamakan, S.M.H.; Wang, H.; Qu, Q. Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems. Neurocomputing 2018, 310, 223–235. [Google Scholar] [CrossRef]
  10. Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data TKDD 2012, 6, 1–39. [Google Scholar] [CrossRef]
  11. Mishra, S.; Chawla, M. A comparative study of local outlier factor algorithms for outliers detection in data streams. In Emerging Technologies in Data Mining and Information Security; Springer: Singapore, 2019; pp. 347–356. [Google Scholar]
  12. Pevnỳ, T. Loda: Lightweight on-line detector of anomalies. Mach. Learn. 2016, 102, 275–304. [Google Scholar] [CrossRef]
  13. Zhao, Y.; Nasrullah, Z.; Hryniewicki, M.K.; Li, Z. LSCP: Locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining, SIAM, Santa Barbara, CA, USA, 2–4 May 2019; pp. 585–593. [Google Scholar]
  14. Choi, K.; Yi, J.; Park, C.; Yoon, S. Deep learning for anomaly detection in time-series data: Review, analysis, and guidelines. IEEE Access 2021, 9, 120043–120065. [Google Scholar] [CrossRef]
  15. Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep one-class classification. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4393–4402. [Google Scholar]
  16. Trinh, H.D.; Giupponi, L.; Dini, P. Urban anomaly detection by processing mobile traffic traces with LSTM neural networks. In Proceedings of the 2019 16th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Boston, MA, USA, 10–13 June 2019; pp. 1–8. [Google Scholar]
  17. Munir, M.; Siddiqui, S.A.; Dengel, A.; Ahmed, S. DeepAnT: A deep learning approach for unsupervised anomaly detection in time series. IEEE Access 2018, 7, 1991–2005. [Google Scholar] [CrossRef]
  18. Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  19. Liu, Y.; Li, Z.; Zhou, C.; Jiang, Y.; Sun, J.; Wang, M.; He, X. Generative adversarial active learning for unsupervised outlier detection. IEEE Trans. Knowl. Data Eng. 2019, 32, 1517–1528. [Google Scholar] [CrossRef]
  20. Deng, A.; Hooi, B. Graph Neural Network-Based Anomaly Detection in Multivariate Time Series. Proc. AAAI Conf. Artif. Intell. 2021, 35, 4027–4035. [Google Scholar] [CrossRef]
  21. Goodge, A.; Hooi, B.; Ng, S.K.; Ng, W.S. LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks. Proc. AAAI Conf. Artif. Intell. 2022, 36, 6737–6745. [Google Scholar] [CrossRef]
  22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  23. Chen, L.; You, Z.; Zhang, N.; Xi, J.; Le, X. UTRAD: Anomaly detection and localization with U-Transformer. Neural Netw. 2022, 147, 53–62. [Google Scholar] [CrossRef]
  24. Wang, X.; Pi, D.; Zhang, X.; Liu, H.; Guo, C. Variational transformer-based anomaly detection approach for multivariate time series. Measurement 2022, 191, 110791. [Google Scholar] [CrossRef]
  25. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
  26. Li, H.; Peng, X.; Zhuang, H.; Lin, Z. Multiple Temporal Context Embedding Networks for Unsupervised time Series Anomaly Detection. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 3438–3442. [Google Scholar]
  27. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  28. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
  29. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  30. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
  31. Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image transformer. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm Sweden, 10–15 July 2018; pp. 4055–4064. [Google Scholar]
  32. Chen, H.; Wang, Z.; Tian, H.; Yuan, L.; Wang, X.; Leng, P. A Robust Visual Tracking Method Based on Reconstruction Patch Transformer Tracking. Sensors 2022, 22, 6558. [Google Scholar] [CrossRef] [PubMed]
  33. Xian, T.; Li, Z.; Zhang, C.; Ma, H. Dual Global Enhanced Transformer for image captioning. Neural Netw. 2022, 148, 129–141. [Google Scholar] [CrossRef] [PubMed]
  34. Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, USA, 10–12 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
  35. Liu, M.; Ren, S.; Ma, S.; Jiao, J.; Chen, Y.; Wang, Z.; Song, W. Gated Transformer Networks for Multivariate Time Series Classification. arXiv 2021, arXiv:2103.14438. [Google Scholar]
  36. Wang, C.; Xing, S.; Gao, R.; Yan, L.; Xiong, N.; Wang, R. Disentangled Dynamic Deviation Transformer Networks for Multivariate Time Series Anomaly Detection. Sensors 2023, 23, 1104. [Google Scholar] [CrossRef]
  37. Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in Time Series: A Survey. arXiv 2023, arXiv:2202.07125. [Google Scholar]
  38. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  39. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
  40. Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
  41. Wang, P.; Zheng, W.; Chen, T.; Wang, Z. Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice. arXiv 2022, arXiv:2203.05962. [Google Scholar]
  42. Xue, F.; Chen, J.; Sun, A.; Ren, X.; Zheng, Z.; He, X.; Chen, Y.; Jiang, X.; You, Y. A Study on Transformer Configuration and Training Objective. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
  43. Siffer, A.; Fouque, P.A.; Termier, A.; Largouet, C. Anomaly detection in streams with extreme value theory. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1067–1075. [Google Scholar]
  44. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
  45. Zhao, Y.; Nasrullah, Z.; Li, Z. PyOD: A Python Toolbox for Scalable Outlier Detection. J. Mach. Learn. Res. 2019, 20, 1–7. [Google Scholar]
  46. Keogh, E.; Lin, J.; Fu, A. HOT SAX: Efficiently finding the most unusual time series subsequence. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX, USA, 27–30 November 2005; p. 8. [Google Scholar] [CrossRef]
  47. Ahmad, S.; Lavin, A.; Purdy, S.; Agha, Z. Unsupervised real-time anomaly detection for streaming data. Neurocomputing 2017, 262, 134–147. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.