Next Article in Journal
Quantitative Retrieval of Soil Salinity in Arid Regions: A Radar Feature Space Approach with Fully Polarimetric SAR Data
Previous Article in Journal
Efficient Internet of Things Communication System Based on Near-Field Communication and Long Range Radio
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reevaluating the Potential of a Vanilla Transformer Encoder for Unsupervised Time Series Anomaly Detection in Sensor Applications

1
Department of Computer Science, Chungbuk National University, Cheongju 28644, Republic of Korea
2
Department of Electronics Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(8), 2510; https://doi.org/10.3390/s25082510
Submission received: 14 January 2025 / Revised: 26 February 2025 / Accepted: 14 April 2025 / Published: 16 April 2025
(This article belongs to the Section Electronic Sensors)

Abstract

:
Sensors generate extensive time series data across various domains, and effective methods for detecting anomalies in such data are still in high demand. Unsupervised time series anomaly detection provides practical approaches to addressing the challenges of collecting anomalous data. For effective anomaly detection, a range of deep-learning-based models have been explored to handle temporal patterns inherent in time series data. In particular, Transformer encoders have gained significant attention due to their ability to efficiently capture temporal dependencies. Various studies have attempted the architectural improvements of Transformer encoders to address the inherent complexity of time series data analysis. Unlike the previous studies, this work demonstrates that a vanilla Transformer encoder-based framework remains yet a competitive model for time series anomaly detection. Instead of architectural modification of the Transformer encoder, we identify key design choices and propose an asymmetric autoencoder-based framework incorporating those design choices with a vanilla Transformer encoder and a linear layer decoder. The proposed framework has been evaluated on a range of unsupervised time series anomaly detection benchmarks, and the experimental results show that it achieves performance that is either superior or competitive compared to state-of-the-art models.

1. Introduction

Time series anomaly detection is used to identify unusual patterns or events in time series data [1]. This is an important problem across many domains like finance, manufacturing, and healthcare [2]. Due to its importance, this problem has been a focus of research for several decades. Initially, statistical methods [3,4,5,6] and machine-learning methods [7,8,9,10] were studied. Recently, with advancements in deep learning, there has been increasing interest in applying deep-learning models to time series anomaly detection [11]. In particular, unsupervised time series anomaly detection has attracted significant attention due to the challenges of collecting abnormal data in real-world scenarios, such as the rarity of anomalies and the diversity of anomaly patterns [12,13,14,15].
In unsupervised time series anomaly detection, numerous models have been proposed. Some studies have focused on capturing temporal dependencies using Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) [16,17,18], while others have used Convolution Neural Networks (CNNs) and Temporal Convolution Networks (TCNs) to extract temporal features locally and hierarchically [19,20,21]. Additionally, Graph Neural Networks have been employed to model complex relationships in multivariate time series data [22,23,24]. Generative models, such as Variational Autoencoders, Generative Adversarial Networks, Normalizing Flows, and Diffusion models, have also been explored. These models detect anomalies by analyzing reconstruction errors or deviations from learned data distributions [25,26]. Autoencoders, in particular, are widely adopted due to their simplicity and effectiveness in capturing temporal features [27].
Transformers outperformed previous models in natural language processing [28]. Since then, they have been successfully applied across diverse domains, proving their effectiveness. The success of Transformers has driven extensive research into their application to unsupervised time series anomaly detection [29,30,31,32]. In particular, Transformer encoders have gained attention for their ability to efficiently capture temporal dependencies. However, due to the inherent complexity of time series data, various studies have attempted to improve Transformer encoders by architectural modifications such as a prior-association branch [33] and a gated memory module [34]. These architectural improvements have led to notable performance gains on various benchmarks [29,33,34,35].
This study demonstrates that effective anomaly detection could be achieved without such structural modifications solely through careful design choices within a vanilla Transformer encoder-based asymmetric autoencoder framework. To substantiate this, we identify key design choices, which include dataset-level and sub-dataset-level normalizations (for preprocessing raw time series data), time series segment generation, segment-level normalization and denormalization, time series segment embedding, and a simple decoder for asymmetric autoencoder realization. Additionally, we propose an autoencoder-based anomaly detection framework that consists of a vanilla Transformer encoder and a linear layer decoder, incorporating key design choices. The proposed framework has been assessed on several widely used benchmarks for unsupervised time series anomaly detection. The experimental results show that the proposed framework achieves a performance superior to or comparable to recent Transformer encoder-based models with architectural modifications. These findings reveal the often-overlooked potential of vanilla Transformer encoders for unsupervised time series anomaly detection. The code for the proposed framework, along with its pretrained weights for the benchmark datasets, is publicly available at https://github.com/chatterboy/revisitVanillaTransEncUnsupTSAD (accessed on 13 April 2025).

2. Related Work

In unsupervised time series anomaly detection, deep-learning-based anomaly detection models have been actively studied [36]. In particular, Transformer encoder-based anomaly detection models have gained significant attention due to their ability to efficiently capture the temporal patterns inherent in time series data [37]. While Transformer encoder-based models have proven effective, they still encounter challenges stemming from the complex nature of time series data. These challenges include inter-variable dependencies, distribution shifts due to non-stationarity, sparsely occurring anomalies, and diverse anomaly patterns [38].
To address these challenges, previous studies have mainly attempted to modify the architectural features of transformer encoders. GTA [24] incorporated a graph neural network to effectively capture inter-variable dependencies in multivariate time series data. Anomaly Transformer [33] introduced an additional branch to capture associations in normal data while considering the inductive bias of anomalous data. DCdetector [39] utilized a two-tower architecture that simultaneously learns patch-wise and in-patch association representations to enhance the representation capability of normal data. MEMTO [34] leveraged a gated memory module to distinguish normal data and abnormal data based on the prototypical features of normal data. AnomalyLLM [35] employed a teacher-student framework in which the student network is trained to replicate the features of a teacher network, adapted from a pretrained large language model.
While previous studies have mainly concentrated on enhancing Transformer encoder architectures for effective anomaly detection, this study takes a different approach. Instead of modifying the architecture, we explore how effective unsupervised time series anomaly detection can be achieved purely through strategic design choices applied to a vanilla Transformer encoder-based asymmetric autoencoder framework.

3. Materials and Methods

This work aims to revisit the potential of a vanilla Transformer encoder in unsupervised time series anomaly detection. To evaluate the capabilities of a vanilla Transformer encoder, we propose an autoencoder-based framework equipped with a vanilla Transformer encoder and a linear layer decoder (see Figure 1). This framework makes use of a simple architecture while enabling a vanilla Transformer encoder to effectively capture temporal dependencies. This framework takes time series segments generated from (preprocessed) time series data as input and reconstructs the taken time series segments. The proposed framework consists of the following modules: preprocessing time series, generating time series segments, segment-level normalization-denormalization, time series segment embedding, and an asymmetric encoder-decoder.

3.1. Preprocessing Time Series Data

Preprocessing plays a vital role in the development of a deep-learning framework. We are concerned with normalization techniques for preprocessing in the context of unsupervised learning-based time series anomaly detection. They are used to convert the scale of time series data into a specified interval, which usually significantly impacts the performance of the anomaly detection process.
Representative normalization techniques include min-max normalization and z-score normalization. Let X R T × V represent time series data, where x i j denotes the value at the i-th time step of the j-th variable. Its normalized time series data X ˜ R T × V can be expressed in X ˜ = ϕ ( X ) , where ϕ ( · ) is a normalization method which is defined as follows:
ϕ min - max ( x i , j ) = x i , j min k { x i , k } max k { x i , k } min k { x i , k } min - max normalization
ϕ z - score ( x i , j ) = x i , j μ · , j σ · , j z - score normalization
where μ · , j and σ · , j represent the mean and standard deviation of the j-th variable, respectively. As the decision choices for the normalization strategy, the framework is designed to examine the impact of three normalization strategies: no normalization, min-max normalization, and z-score normalization. The best choice is determined based on performance evaluations for each dataset.
Some datasets are composed of multiple sub-datasets (see Table 1 and Table 2). For instance, the multivariate dataset MSL consists of 27 sub-datasets, and the univariate dataset ABP consists of 42 sub-datasets. For datasets consisting of multiple sub-datasets, we apply two levels of normalization sequentially: first at the sub-dataset level, followed by the dataset level, with three normalization options available at each level. Sub-dataset-level normalization is performed on each sub-dataset individually, while dataset-level normalization is applied across all sub-datasets as a whole. The total nine combinations of normalization methods for the sub-dataset level and the dataset level are examined to select the best one. At inference time, parameters such as min, max, mean, and standard deviation computed during training are utilized.

3.2. Generating Time Series Segments

Time series datasets often consist of long sequences dynamically collected in real-world environments. These characteristics can make it difficult to directly feed raw time series data into an outlier detection model. To address this, techniques for generating fixed-length segments from the original data are employed, facilitating easier analysis and processing by the model.
The windowing technique is a commonly used method for generating time series segments in unsupervised anomaly detection for time series data. It involves copying fixed-length segments from an original time series data at regular intervals using a sliding window. Varying the window length and stride enables the generation of different sets of time series segments from the data. The process of generating a full set of time series segments from preprocessed data using the windowing technique can be outlined as follows:
x = W ( X ˜ )
where W ( · ) is a window function, W : R T × V R N × L × V , x and X ˜ denote the time series segment and its preprocessed dataset, respectively.
The optimal window length and stride size for capturing temporal dependencies effectively can vary across datasets. To address this, we adopt window lengths and stride sizes tailored to each dataset. Furthermore, stride size can be applied differently during training and evaluation. During evaluation, the stride size matches the window length, following the non-overlapping approach used in prior studies such as [17,33,34,39]. In contrast, during training, the stride size is selected based on whether an overlapping or non-overlapping approach is more effective for capturing temporal dependencies. This approach ensures efficient temporal pattern extraction during training.

3.3. Segment-Level Normalization and Denormalization

Real-world time series data are often non-stationary, with statistical properties that change over time. This non-stationarity poses significant challenges in extracting meaningful temporal features. Since time series segments are derived from non-stationary data, they inherently retain this non-stationarity. To address these challenges, we use a segment-level normalization and denormalization technique designed to effectively capture temporal dependencies in non-stationary time series segments. As shown in Figure 1, the normalization is applied to time series segments prior to their embedding. The embedded time series segments pass through the encoder and the corresponding time series segments are reconstructed by the decoder. The reconstructed one is denormalized into the scale of original time series segments.
We employ a reversible instance normalization method [40], originally proposed for time series forecasting to address distribution shifts between training and testing datasets, for normalizing and denormalizing time series segments. Given batched time series segments x R B × L × V with batch size B, the sequence length L and the number of variables V, the normalization method is applied to multivariate time series segments as follows:
x ˜ i , : , j = α j x i , : , j μ i , j σ i , j + β j
where x ˜ R B × L × V represents the normalized time series segments, μ i , j and σ i , j denote the mean and standard deviation of the data points of the j-th variable within a time series segment i, and α j and β j are learnable parameters used for scaling and shifting of the j-th variable. The normalization has the effect of alleviating distributional shifts and scale variations caused by non-stationarity [40]. It is used with the expectation of making the time series stationary. The normalization method is applied in a segment-wise manner to the time series, where the mean and variance are computed and used for each segment.
The encoder and decoder are trained to reconstruct the normalized time series segments at the output of the decoder. For normalized time series segments reconstructed by the decoder, denormalization is carried out as follows:
x ^ i , : , j = σ i , j x ¯ i , : , j β j α j + μ i , j
where x ^ R B × L × V represents the final denormalized reconstructed time series segments produced by the model, while x ¯ represents the normalized reconstructed time series segments generated by the decoder.

3.4. Time Series Segment Embedding

Time series segment embedding is needed to transform time series segments into a format suitable for the proposed autoencoder-based architecture, which consists of the encoder and decoder shown in Figure 1. The embedding module converts the data points of dimension V within time series segments into real-valued vectors of dimension F. To capture both individual data-point information and short-term temporal dependencies within time series segments, we use a 1D convolutional layer for embedding, i.e., the embedding method generates an embedding vector e by considering individual data points along with their neighboring data points for a time series segment x ¯ .
e = Conv 1 d ( x ˜ )
where e R B × L × F is a batch of B sequences, where each sequence consists of L embedding vectors with a feature dimension of F. The 1D convolutional layer consists of F kernels, each with a length of 3, a stride of 1, and no bias term, to generate an embedding vector of dimension F. Zero padding is applied to ensure that the sequence length of the embedded time series segment matches that of the original time series segment.
Additionally, we choose not to use positional encoding (PE), which is commonly employed in Transformers to incorporate positional information for extracting position-based features. This decision is based on two key considerations. First, positional encoding works by injecting additional positional information into the data. However, this process may lead the model to interpret the sequential flow of the data based on the positional encoding rather than its original order. As a result, the model may fail to properly learn the natural sequential relationships in the data. In reconstruction-based methods, preserving the original sequence is crucial, and such distortions can negatively impact reconstruction performance. Second, a reconstruction model, such as an autoencoder, aims to reproduce the input data as accurately as possible. Adding positional encoding introduces new patterns to the data, which might cause the model to learn these patterns inappropriately or to overlook essential structural information from the original data by overemphasizing positional differences. This may impair the model’s reconstruction performance, resulting in degrading its anomaly detection capabilities. To assess the impact of positional encoding, we have conducted an ablation study as part of our analysis (see Section 3.4).

3.5. Asymmetric Autoencoder

Autoencoders are widely used in unsupervised time series anomaly detection because of their simplicity and capability to reconstruct input data. An autoencoder consists of two components: an encoder and a decoder. The encoder extracts features from the input data while the decoder reconstructs the input data from these extracted features. According to Wang et al. [41], an excessively powerful decoder can impede the encoder’s ability to effectively capture meaningful features. This happens because the powerful decoder can be trained to nearly perfectly reconstruct the input data without strongly relying on useful latent representations produced by the encoder. Rather than encouraging the encoder to learn meaningful representations, the decoder may instead memorize trivial patterns, resulting in poor generalization.
To mitigate this issue, we propose an asymmetric autoencoder design where the encoder is more complex and parameter-rich while the decoder is kept simpler with fewer parameters. This asymmetry ensures that the decoder focuses solely on reconstruction, allowing the encoder to effectively capture rich and informative temporal features from the time series segments. Specifically, we use a vanilla Transformer encoder for the encoder and a linear layer for the decoder.
Given a batch of embedding vector sequences e R B × L × F , a vanilla Transformer encoder produces a batch of hidden vector sequences h R B × L × F as follows:
h = VanillaTransformerEncoder ( e )
where h is the output of the last encoder block in a vanilla Transformer encoder. Then, a decoder reconstructs normalized time series segments for given hidden vector sequences.
The decoder, implemented as a linear layer, transforms each hidden vector h R F from the encoder into a data-point x ¯ R V within a normalized time series segment, as follows:
x ¯ = W h
where W R V × F is a weight matrix for the decoder. The decoder processes each hidden vector generated by the encoder to reconstruct the normalized time series segment corresponding to the input time series segment provided to the encoder.

3.6. Training

The proposed framework is an autoencoder-based model designed to reconstruct given time series segments. To achieve this, we employ a reconstruction-based training approach that minimizes the discrepancy between the input and the reconstructed output. Specifically, the model is trained to reduce the difference between the input time series segments and the corresponding reconstructed segments generated by the framework. Let x R B × L × V represent the input time series segments, and let x ^ R B × L × V denote the reconstructed segments. The model is optimized using the following loss function:
( x , x ^ ) = 1 B b = 1 B x b x ^ b 2 2
where B is batch size, and x b and x ^ b denote the b-th segments of x and x ^ , respectively.

3.7. Evaluation

We use an evaluation protocol commonly used in reconstruction-based methods. The protocol consists of three steps: first, computing anomaly scores; second, predicting whether each timestamp is normal or abnormal; and third, evaluating the predictions using standard evaluation metrics.
Various methods exist for computing anomaly scores in reconstruction-based approaches. We utilize the anomaly scoring method proposed in the literature [33,34,35,39]. For a given data-point x t R V from the batched time series segments x and its reconstructed counterpart x ^ t R V produced by the trained model, the anomaly score s t at timestamp t is computed as follows:
s t = x t x ^ t 2
To predict the anomaly status y ^ t at each timestamp t within the time series data X ˜ , we compute it as follows:
y ^ t = 1 , s t θ 0 , otherwise
Here, y ^ t represents the anomaly status prediction, where y ^ t 0 , 1 , and θ is the threshold used to determine whether a data-point is anomalous or normal. The threshold is a critical parameter for accurate predictions. Among the various methods available for threshold selection, we adopt the approach used in the previous comparative studies [33,34,39].
To assess model performance, we employ several metrics. Dor experiments on multivariate benchmarks, we use Precision, Recall, and F1 scores. Given the ground truth anomaly labels y { 0 , 1 } T and the predicted anomaly labels y ^ { 0 , 1 } T , these metrics are defined as follows:
Precision = TP TP + FP
Recall = TP TP + FN
F 1 score = 2 × Preicison × Recall Precision + Recall
Here, TP (True Positive) denotes data points correctly identified as anomalies, while TN (True Negative) refers to data points accurately classified as normal. FP (False Positive) represents instances mistakenly detected as anomalies despite being normal, whereas FN (False Negative) refers to data points incorrectly classified as normal when they are actually anomalies. Additionally, we apply Point Adjustment (PA) [42], a post-processing technique that considers an entire anomalous segment correctly identified if at least one data-point within the segment is detected as anomalous. The PA-based metrics are employed to ensure a fair comparison between the proposed model and recent models on multivariate benchmarks [33,34,35,39].
While the PA-based metrics capture real-world anomaly detection scenarios [42], they have recognized limitations in evaluating model performance [43,44]. To overcome these shortcomings, we also incorporate affiliation metrics [45]. Unlike Precision, Recall, and the F1 score, which assess performance at the data-point level, affiliation metrics evaluate performance at the segment (event) level. Given the ground truth anomaly segments e = { e 1 , e 2 , . . . , e n } and the predicted anomaly segments e ^ = { e ^ 1 , e ^ 2 . . . , e ^ m } , affiliation Precision, affiliation Recall, and affiliation F1 score are defined as follows:
affiliation Precision = 1 | S | i S P precision i ( e i )
affiliation Recall = 1 n i = 1 n P recall i ( { e ^ I i } )
affiliation F 1 score = 2 × affiliation Precision × affiliation Recall affiliation Precision + affiliation Recall
Here, S represents ground truth anomaly segments that have at least one corresponding predicted anomaly segment. P precision i indicates whether the predicted anomaly segments assigned to the i-th ground truth anomaly segment correctly predict it, while { e ^ I i } denotes the predicted anomaly segments associated with the i-th ground truth anomaly segment. Additionally, P recall i represents whether the i-th ground truth anomaly segment correctly matches its corresponding predicted anomaly segments. The affiliation metrics are employed for comparison on univariate benchmarks.

4. Comparative Experiments and Results

4.1. Benchmark Datasets

The benchmark datasets utilized in this study comprise real-world time series data collected from various domains, including space environments, industrial systems, and healthcare. These datasets inherently contain anomalies, which are identified by domain experts rather than through predefined synthetic distortions. To better characterize the nature of anomalies within these datasets, we classify them into three types [36]:
  • Point anomaly: Individual data points that significantly deviate from expected values.
  • Contextual anomaly: Data points that appear normal in isolation but are deemed anomalous when viewed within a specific context.
  • Collective anomaly: A group of data points that, when considered together, exhibit abnormal behavior, even if individual points appear normal.
Each benchmark comprises a combination of these anomaly types, depending on its inherent characteristics. These datasets are employed to assess the proposed model, demonstrating its effectiveness in detecting anomalies across diverse real-world scenarios.

4.1.1. Multivariate Datasets

We use six multivariate time series anomaly detection benchmarks, as described in Table 1. The MSL [16] contains telemetry data with labeled anomalies from the Mars Science Laboratory rover. The SMAP [16] includes telemetry and anomaly labels from the Soil Moisture Active Passive satellite. The SMD [25] is a labeled multivariate time series dataset containing metrics like CPU and memory usage from 28 server machines over 5 weeks. The PSM [46] features anonymized server metrics from eBay spanning 13 weeks of training and 8 weeks of testing. The GECCO [38] focuses on drinking water quality monitoring, providing multivariate time series data with physical and chemical properties. The SWAN-SF [38] includes multivariate time series data on extreme space weather conditions for space weather analysis.
Table 1. Description of multivariate time series anomaly detection benchmark datasets. The symbol “#” denotes “number of”.
Table 1. Description of multivariate time series anomaly detection benchmark datasets. The symbol “#” denotes “number of”.
# Sub-Datasets# Dimensions# Training Data Points# Test Data PointsAnomaly Ratio (%)
MSL275558,31773,72910.53
SMAP5425138,004435,82612.84
SMD2838708,405708,4204.16
PSM125132,48187,84127.76
GECCO1969,26069,2611.05
SWAN-SF13860,00060,00032.60
The anomaly ratio in the test set.

4.1.2. Univariate Datasets

We employ nine univariate time series anomaly detection benchmark datasets, described in Table 2. These datasets are derived from the UCR Anomaly Archive [47], following the categorization by Goswami et al. [48]. The UCR Anomaly Archive consists of 250 sub-datasets, each originating from a specific domain. Goswami et al. [48] grouped these sub-datasets into nine distinct domains: ABP, Acceleration, Air Temperature, ECG, EPG, Gait, NASA, Power Demand, and RESP. To construct the datasets, we concatenated all sub-datasets within each domain.
Table 2. Description of univariate time series anomaly detection benchmark datasets.
Table 2. Description of univariate time series anomaly detection benchmark datasets.
# Sub-Datasets# Training Data Points# Test Data PointsAnomaly Ratio (%)
ABP421,036,7461,841,4610.37
Acceleration738,40062,3371.71
Air Temperature1352,00054,3920.82
ECG911,795,0836,047,3140.38
EPG25119,000410,4150.45
Gait331,157,5712,784,5200.38
NASA1138,50086,2960.86
Power Demand11197,149311,6290.61
RESP17868,0002,452,9530.12
The anomaly ratio in the test set.

4.2. Baselines

To assess the effectiveness of the proposed framework, we compare it with recent advanced deep-learning methods for unsupervised time series anomaly detection. For multivariate datasets, the baseline methods include Anomaly Transformer [33], DCdetector [39], MEMTO [34], and AnomalyLLM [35]. For univariate datasets, the comparisons feature TS-TCC [49], THOC [17], NCAD [50], and AnomalyLLM [35].

4.3. Implementation Details

The proposed framework was implemented using Python 3.8.19 and PyTorch 2.1.0. Training and evaluation were conducted on a single NVIDIA A100 GPU, utilizing the Adam optimizer [51]. The hyperparameter configurations for the multivariate and univariate datasets are detailed in Table A1 and Table A2, respectively. Key hyperparameters were determined through a grid search, while other parameters were set to commonly used default values.

4.4. Main Results

Table 3 presents the F1 scores comparing the proposed vanilla Transformer encoder-based model with state-of-the-art time series anomaly detection methods, including Anomaly Transformer [33], DCdetector [39], MEMTO [34], and AnomalyLLM [35], across the six multivariate time series datasets.
The results for the proposed model were obtained following the aforementioned experimental procedures, while the results for the other models were taken from their respective publications. The findings reveal that the proposed model achieved competitive performance across most datasets, with superior results on some. Specifically, it recorded the highest F1 scores of 0.975 and 0.920 on SMAP and GECCO, respectively. For SMD, PSM, and SWAN-SF, it achieved the second-highest performance, following AnomalyLLM [35]. In particular, on GECCO, the proposed model outperformed both DCdetector [39] and AnomalyLLM [35] by a substantial margin.
These results highlight that, despite its relatively simple architecture, the proposed model delivers performance comparable to or better than more advanced models. Through thoughtful design choices, the vanilla Transformer encoder-based model effectively captures the temporal dependencies inherent in time series data. Further details on the experimental results are provided in Table A3.
Table 4 presents the Affiliation F1 scores comparing the proposed vanilla Transformer encoder-based model with state-of-the-art time series anomaly detection methods, including TS-TCC [49], THOC [17], NCAD [50], and AnomalyLLM [35], across the nine univariate time series datasets. The results of the proposed model were obtained using the outlined experimental procedures, while those for the other models were sourced from the experiments in AnomalyLLM [35].
The findings show that, with few exceptions, the proposed model demonstrated competitive or superior performance in most datasets. Notably, it achieved the highest scores of 0.966, 0.802, and 0.763 on Acceleration, ECG, and RESP, respectively. For datasets such as ABP, Air Temperature, Gait, NASA, Power Demand, and Average, the model delivered the second-best performance. Although the proposed model showed a noticeable performance gap compared to AnomalyLLM [35] on ABP and EPG, it performed competitively or even better on the remaining datasets.
As with the multivariate dataset experiments, these results confirm that the proposed model, despite its relatively simple architecture, achieves performance comparable to or exceeding that of advanced methods. This shows that with thoughtful design choices, a vanilla Transformer encoder-based model can effectively capture temporal features. Additional details on the experimental results are provided in Table A4.
Additionally, we conducted further comparative experiments on the proposed model using various evaluation metrics on multivariate benchmarks. We utilized accuracy and PA-based F1-score, along with Affiliation Precision, Affiliation Recall, Range-AUC-ROC [52], Range-AUC-PR [52], VUS-ROC [52], and VUS-PR [52]. Unlike accuracy and F1 score, which measure performance at the data-point level, the other metrics (Affiliation Precision, Affiliation Recall, Range-AUC-ROC, Range-AUC-PR, VUS-ROC, and VUS-PR) assess performance at the event (anomalous segment) level. These event-level evaluation metrics are intended to offer a more comprehensive and robust assessment of model performance in time series anomaly detection.
Table 5 presents the evaluation results of the proposed model alongside compared models across the benchmarks. The results for the proposed model are obtained from our experiments, while those for the compared models are based on the reported performance of DCdetector [39]. Our findings indicate that the relative superiority of each model depends on the specific evaluation metric used.
First, we observed that a model that performs well on data-point-level metrics does not necessarily achieve the best results on event-level metrics. For example, in the MSL dataset, DCdetector [39] demonstrated the highest performance in data-point-level metrics but did not achieve the top scores in certain event-level metrics. Similarly, in the SMAP dataset, while the proposed model performed best overall in data-point-level metrics, it achieved the highest performance in only two out of six event-level metrics.
Second, in event-level evaluation, model rankings fluctuate considerably depending on the selected metric. For instance, in the MSL dataset, the leading model alternates between the proposed model and DCdetector [39], depending on the evaluation metric. Likewise, in the SMAP dataset, the ranking varies among the proposed model, DCdetector [39], and Anomaly Transformer [33]. These findings highlight the importance of using diverse evaluation metrics for a more comprehensive assessment of model performance.
Finally, in the PSM dataset, the proposed model consistently outperformed the compared models across both data-point-level and event-level metrics.
In conclusion, our experiments show that the proposed model delivers competitive performance compared to the latest models, Anomaly Transformer [33] and DCdetector [39]. This underscores that effective unsupervised anomaly detection can be accomplished solely through thoughtful design choices, without requiring architectural modifications to the Transformer encoder-based autoencoder model.

4.5. Ablation Study

We conducted an ablation study to evaluate the impact of the key design principles underlying the proposed framework. These principles include: (1) segment-level normalization and denormalization, (2) the use of positional encoding, and (3) preprocessing of time series data.

4.5.1. Segment-Level Normalization and Denormalization

To evaluate the effectiveness of segment-level normalization and denormalization, we assessed the model’s performance with and without RevIN. For this analysis, three multivariate datasets (MSL, SMAP, and SMD) and three univariate datasets (ECG, Gait, and RESP) were selected. The F1 score was used as the evaluation metric for the multivariate datasets, while the affiliation F1 score was applied for the univariate datasets.
Table 6 shows the F1 scores for the multivariate datasets, and Table 7 presents the affiliation F1 scores for the univariate datasets. In the experiments with the multivariate datasets, incorporating RevIN consistently led to improved performance across all datasets. Similarly, in the univariate dataset experiments, RevIN generally achieved superior performance. These findings demonstrate that employing RevIN can enhance the performance of both multivariate and univariate anomaly detection tasks.
RevIN has been developed to address distribution shift issues that occur between training and testing datasets in time series forecasting. In this study, we used RevIN not only to mitigate distribution shifts between training and testing datasets but also to alleviate distribution shifts between time series segments in window-based anomaly detection. Such segment-level distribution shifts can prevent a model from effectively learning general temporal patterns during training. By addressing this issue, RevIN facilitates the efficient capture of general temporal patterns.

4.5.2. Positional Encoding

The analysis revealed that incorporating positional encoding (PE) does not consistently improve the model’s performance in time series anomaly detection. Experiments on multivariate datasets, such as MSL, SMAP, and SMD, produced mixed results when absolute positional encoding (APE) or learnable PE was applied. Notably, the vanilla Transformer encoder achieved the highest F1 scores without any form of PE, as shown in Table 6. This indicates that preserving the temporal context in its original form may be more effective for reconstruction-based anomaly detection tasks.
Reconstruction tasks often rely on maintaining the integrity of the temporal context. Adding positional information can cause the model to differentiate features unnecessarily, potentially leading to overfitting on anomalous data. For univariate datasets, such as ECG, Gait, and RESP (see Table 7), the exclusion of PE also resulted in higher affiliation F1 scores. This aligns with the hypothesis that PE may introduce additional complexity, which is counterproductive for models focused on reconstructing temporal patterns.
Overall, these findings underscore that positional encoding is not essential for the vanilla Transformer encoder in this task. Instead, direct modeling of temporal dependencies can yield better results.

4.5.3. Preprocessing Time Series Data

Preprocessing strategies played a crucial role in determining the model’s performance. The experiments examined various preprocessing combinations of normalization at both the dataset and sub-dataset levels, including min-max normalization and z-score normalization. The results in Table 8 and Table 9 provide the following key insights: For all the multivariate datasets, the highest F1 scores are achieved without using any normalization methods at the sub-dataset-level and dataset-level normalizations. For the univariate datasets, different normalization methods at sub-dataset-level and dataset-level normalizations produced the best affiliation F1 scores.
Choosing the preprocessing techniques like normalization that align with the specific characteristics of a dataset is crucial for preserving the integrity of its temporal structure. Proper preprocessing strategies for multivariate and univariate datasets enhance the model’s ability to distinguish normal patterns from anomalies, thereby improving reconstruction accuracy. These findings advocate for a tailored approach to preprocessing in time series anomaly detection, emphasizing its significant impact on model performance.

4.6. Hyperparameter Sensitivity

We performed a hyperparameter sensitivity analysis for the key hyperparameters of our proposed framework. Figure 2 presents the results of this analysis.
For the training window step size, the results indicate that the framework demonstrates robust performance within a specific range. However, performance declines significantly as the step size decreases, especially in the MSL and SMD datasets. This drop may result from variations in the time series segments generated according to the training window step size.
Regarding the window length, the framework exhibits high sensitivity to this hyperparameter. For the MSL and SMAP datasets, performance tends to decrease as the window length increases. Conversely, in the SMD dataset, shorter window lengths lead to poorer performance. Additionally, in the SMAP dataset, performance declines when the window length deviates from 50, either shorter or longer. Since window length directly affects the generation of training and testing time series segments, it is a critical hyperparameter. Given that the optimal window length varies across datasets, selecting an appropriate value is crucial for achieving the best results.
For the number of heads in the Transformer-based encoder, this hyperparameter appears to have minimal impact on model performance. While employing multiple attention heads allows the model to capture diverse dependencies, the results suggest that increasing the number of heads does not enhance performance. This indicates that effective anomaly detection can be achieved with a smaller number of dependencies.
The model dimension size shows varying effects depending on the dataset. For MSL and SMAP, performance decreases as the model dimension size increases. In contrast, for SMD, performance deteriorates when the model dimension size is either smaller or larger than 128. In PSM, performance remains unchanged across all model dimension sizes. These results highlight the need to carefully tune the model dimension size for optimal performance.
The impact of the number of encoder blocks varies depending on the dataset. For SMAP and PSM, performance remains almost unchanged regardless of the number of encoder blocks. In contrast, SMD and PSM exhibit greater performance fluctuations, indicating that this parameter can affect model performance in specific datasets.
The learning rate results show trends similar to the number of encoder blocks. In SMAP and PSM, performance is largely unaffected by changes in the learning rate. However, in SMD and PSM, significant variations in performance are observed, indicating that the learning rate is also an important hyperparameter.
Lastly, the dropout ratio results suggest that the framework achieves robust performance across different dropout ratios. Even in MSL, where the largest performance variation is noted, the changes are relatively minor.

5. Discussion

This work highlights the potential of a vanilla Transformer encoder with carefully selected design choices for unsupervised time series anomaly detection. Despite its simplicity, the proposed framework delivers performance that is competitive with, and in some cases exceeds, state-of-the-art models across various benchmark datasets. Notably, our framework achieved overall performance comparable to the SOTA models, AnomalyLLM [35], which relies on abnormal data injection [50,53,54]—a technique that synthesizes abnormal data from normal data during training. In contrast, our framework uses only normal data for training. These findings underscore the viability of a vanilla Transformer encoder-based model as an effective model for unsupervised time series anomaly detection.
In Transformer-based unsupervised time series anomaly detection, existing research has focused on enhancing architectural aspects of vanilla Transformer encoder while simultaneously incorporating various design choices. Anomaly Transformer [33] introduced a prior-association branch based on an adjacent-concentration inductive bias to enhance the distinction between normal and anomalous data. This additional branch enables the Transformer encoder to learn temporal dependencies that are easier to capture in normal data but more challenging for anomalous data, thereby improving overall model performance. Beyond architectural improvements, Anomaly Transformer [33] also explored design choices, such as time series segmentation with adjusted window length and segment embedding method along with positional embedding method.
DCdetector [39] employed a two-tower architecture that leverages multi-head attention in Transformer encoders to extract both patch-wise and in-patch association representations. This architecture differentiates normal and anomalous data by ensuring consistency between association representations derived from two distinct views of normal data. In addition to architectural enhancements, DCdetector [39] adopted several design choices, including time series segmentation, segment-level normalization to address non-stationarity, and segment embedding with positional encoding.
MEMTO [34] incorporated a gated memory module to store and leverage prototypical normal features – representative representations extracted from normal data – to facilitate the reconstruction of normal data while making the reconstruction of anomalous data more challenging. This module was integrated into a Transformer encoder as an architectural improvement. Furthermore, MEMTO [34] explored several design choices, such as time series segment generation with adjusted window length, time series segment embedding along with positional encoding, and the use of an asymmetric autoencoder to enhance feature learning.
AnomalyLLM [35] employed a teacher-student framework that leverages a pretrained large language model (LLM) with strong representational capabilities to perform representation learning, effectively distinguishing normal and abnormal data representations. It applies synthetic distortions to normal data to generate pseudo-anomalous data while utilizing a prototypical Transformer encoder to remove irrelevant information and extract informative features. Also, AnomlayLLM [35] incorporated several design choices, such as segment-level normalization to mitigate the non-stationarity, time series segment embedding along with position encoding, and an asymmetric encoder-decoder structure.
While prior studies have explored architectural enhancements of Transformer encoders alongside various design choices, their primary contributions emphasize architectural modifications rather than the design choices themselves. In contrast, our study demonstrates that effective anomaly detection can be achieved solely through carefully structured design choices, without modifying the Transformer encoder’s architecture when integrated into an asymmetric encoder-decoder framework utilizing a linear decoder.
Additionally, the prior studies have not explicitly stated certain design choices, such as data preprocessing techniques that significantly impact model performance. In contrast, our study clearly specifies all considered design choices and has conducted an ablation study to evaluate their individual impact on anomaly detection performance. We believe this work offers valuable insights for the development of time series anomaly detection models.
While our proposed framework demonstrated promising results, some limitations remain. First, unsupervised time series anomaly detection is inherently challenging due to the absence of abnormal data in a training dataset [17,33,34,35,39,49,50]. With only normal data available for training, it becomes difficult to optimize hyperparameters, like the threshold θ in Equation (11), that are sensitive to abnormal data.
Second, our evaluation of multivariate datasets in the experiments was conducted using the point adjustment (PA) method [42], which is known to potentially overestimate model performance [43,44]. To ensure fair comparisons with other models, we applied the same PA method as used in prior studies. However, for univariate datasets, we employed affiliation metrics [45], which are independent of the PA method, to provide a more reliable assessment of our framework’s performance in comparison to other models.
Third, we evaluated our proposed framework using a non-overlapping window evaluation protocol, where the window step size matches the window length. All the compared methods followed the same protocol for performance evaluation. In addition to the non-overlapping window protocol [17,33,34,35,39], some studies use overlapping window evaluation protocols. For example, a step size of 1 is employed in certain methods [23,24,30,55,56,57]. These differences in window step sizes during evaluation make direct comparisons across studies more challenging.
Lastly, we found that adjusting the window step size during training proved effective. However, keeping the step size fixed during the generation of time series segments restricts the model’s ability to capture the diverse temporal features present in time series data. This limitation may have adversely contributed to the significant performance gap observed compared to AnomalyLLM [35] on certain univariate datasets.
Future research should address the limitations identified in this study. First, the reliance on training datasets containing only normal data restricts the model’s ability to generalize to diverse anomaly patterns. Incorporating semi-supervised or self-supervised learning techniques could improve the model’s capacity to handle unseen anomalies. Additionally, the evaluation methodology, which relies on the point adjustment method and non-overlapping window protocols, could be enhanced by exploring alternative metrics and approaches to ensure more comprehensive and equitable model comparisons. Investigating dynamic or adaptive windowing strategies may further enhance the detection of temporal patterns. Lastly, extending the framework to accommodate more diverse datasets and application scenarios, including resource-constrained environments, could improve its practicality and scalability.

6. Conclusions

This study revisited the potential of a vanilla Transformer encoder for unsupervised time series anomaly detection. We demonstrated that when paired with a linear layer decoder and thoughtfully selected design choices, a vanilla Transformer encoder can achieve performance comparable to or even surpass state-of-the-art models across various univariate and multivariate benchmark datasets.
Our findings underscore the pivotal role of key design choices, such as segment-level normalization and denormalization, the omission of positional encoding, and precise hyperparameter tuning, in the model’s success. Despite its simplicity, the proposed asymmetric autoencoder framework, built on a vanilla Transformer encoder, effectively captures temporal dependencies for time series anomaly detection, emphasizing the often-underestimated potential of vanilla Transformer encoders.

Author Contributions

Conceptualization, C.S.H. and K.M.L.; methodology, C.S.H. and K.M.L.; software, C.S.H.; validation, C.S.H., H.K. and K.M.L.; formal analysis, C.S.H. and K.M.L.; investigation, K.M.L.; resources, H.K. and K.M.L.; data curation, C.S.H.; writing—original draft preparation, C.S.H.; writing—review and editing, K.M.L.; visualization, C.S.H.; supervision, C.S.H.; project administration, K.M.L.; funding acquisition, H.K. and K.M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1A5A8026986).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We have used the benchmarks which are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
1Done-dimensional
ABPArterial blood pressure benchmark
APEAbsolute Positional Encoding
CNNConvolutional Neural Network
ECGElectrocardiogram benchmark
EPGElectrical Penetration Graph benchmark
GECCOGenetic and Evolutionary Computation Conference benchmark
GRUGated Recurrent Unit
LSTMLong Short-Term Memory
MSLMars Science Laboratory benchmark
PAPoint Adjustment
PEPositional Encoding
PSMPooled Server Metrics benchmark
RESPRespiration benchmark
RevINReversible Instance Normalization
RNNRecurrent Neural Network
SegNDSegment-level Normalization-Denormalization
SMAPSoil Moisture Active Passive satellite benchmark
SMDServer Machine Dataset benchmark
SWAN-SFSpace Weather Analytics for Solar Flares benchmark
TCNTemporal Convolutional Network

Appendix A

Table A1. Hyperparameter configurations on multivariate datasets.
Table A1. Hyperparameter configurations on multivariate datasets.
MSLSMAPSMDPSMGECCOSWAN-SF
# epochs116445
Batch size323232323232
# layers135222
# heads118888
Model dimension size9616128256128512
Feed-forward dimension size96322565122561024
Window length25503003010400
Training window stride1001003003010200
Test window stride25503003010400
Dropout ratio0.10.10.10.10.10.0
Learning rate3 × 10 4 3 × 10 4 3 × 10 4 3 × 10 4 3 × 10 4 3 × 10 4
Sub-dataset-level norm------
Dataset-level norm----z-score-
Table A2. Hyperparameter configurations of univariate datasets.
Table A2. Hyperparameter configurations of univariate datasets.
ABPAccelerationAir TemperatureECGEPGGaitNASAPower DemandRESP
# epochs3510132521
Batch size32323232323283232
# layers221222231
# heads888888188
Model dimension size51225625625625625632128256
Feed-forward dimension size102451251251251251232256512
Window length1002005010030350100100100
Training window stride1004005010030350100100100
Test window stride1002005010030350100100100
Dropout ratio0.00.10.10.10.10.10.10.00.1
Learning rate5 × 10 5 3 × 10 5 3 × 10 3 3 × 10 4 5 × 10 5 5 × 10 5 5 × 10 5 3 × 10 3 3 × 10 3
Sub-dataset-level normminmaxminmax-minmaxminmaxminmax-minmaxminmax
Dataset-level normz-scorez-score-z-score---z-scorez-score
Table A3. Detailed performance evaluation results on six multivariate time series datasets. Bold values indicate the best performance, and underlined values denote the second-best.
Table A3. Detailed performance evaluation results on six multivariate time series datasets. Bold values indicate the best performance, and underlined values denote the second-best.
MSLSMAPSMDPSMGECCOSWAN-SF
PRF1PRF1PRF1PRF1PRF1PRF1
Anomaly Transformer0.9210.9520.9360.9420.9940.9670.8940.9550.9230.9690.9890.979------
DCdetector0.9370.9970.9660.9560.9890.9700.8360.9110.8720.9710.9870.9790.3830.5970.4660.9550.5960.734
MEMTO0.9210.9680.9440.9380.9960.9660.8910.9840.9350.9750.9920.983------
AnomalyLLM0.9370.9790.9580.9440.9690.9560.9340.9980.9650.9960.9980.9970.5110.7930.6200.8730.7450.804
Ours0.9170.9710.9430.9640.9860.9750.8940.9890.9390.9920.9940.9930.9190.9210.9200.8610.7490.801
Table A4. Affiliation precision, affiliation recall, and affiliation F1 score results for our model and other comparisons on nine univariate time series anomaly detection datasets. Bold values represent a top result and underline values denote a second result on a metric.
Table A4. Affiliation precision, affiliation recall, and affiliation F1 score results for our model and other comparisons on nine univariate time series anomaly detection datasets. Bold values represent a top result and underline values denote a second result on a metric.
ABPAccelerationAir TemperatureECGEPG
APARAF1APARAF1APARAF1APARAF1APARAF1
TS-TCC0.7630.7450.7540.5550.5430.5490.9800.9570.9690.7580.7820.7840.9280.9350.931
THOC0.8220.8080.8150.7820.7700.7760.9840.9580.9710.7620.7580.7600.9110.9050.908
NCAD0.8020.7860.7940.8550.8420.8490.7620.7470.7580.7370.7320.7350.7950.7830.789
AnomalyLLM0.9310.9100.9200.9650.9480.9560.9890.9590.9740.7680.8080.7870.9350.9320.933
Ours0.7250.9910.8380.9341.0000.9660.9510.9930.9720.6980.9420.8020.7430.9660.840
GaitNASAPower DemandRESPAverage
APARAF1APARAF1APARAF1APARAF1APARAF1
TS-TCC0.7980.7900.7940.5120.5080.5110.7670.7590.7630.5610.5600.5600.7360.7310.735
THOC0.7880.7800.7840.9020.8910.8960.7770.7720.7750.3820.3950.3890.7900.7820.786
NCAD0.8640.8520.8580.8690.8530.8610.7240.7230.7230.6130.6120.6130.7800.7700.776
AnomalyLLM0.8910.8520.8710.9690.9530.9610.8880.8840.8860.7360.7360.7360.8970.8870.892
Ours0.7670.9980.8670.8920.9620.9260.7660.9940.8650.6290.9690.7630.7890.9790.871

References

  1. Cook, A.A.; Mısırlı, G.; Fan, Z. Anomaly detection for IoT time-series data: A survey. IEEE Internet Things J. 2019, 7, 6481–6494. [Google Scholar] [CrossRef]
  2. Zamanzadeh Darban, Z.; Webb, G.I.; Pan, S.; Aggarwal, C.; Salehi, M. Deep learning for time series anomaly detection: A survey. ACM Comput. Surv. 2024, 57, 1–42. [Google Scholar] [CrossRef]
  3. Basu, S.; Meckesheimer, M. Automatic outlier detection for time series: An application to sensor data. Knowl. Inf. Syst. 2007, 11, 137–154. [Google Scholar] [CrossRef]
  4. Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice, 2nd ed.; OTexts: Melbourne, Australia, 2018; Available online: https://otexts.com/fpp2/ (accessed on 8 May 2018).
  5. Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000; pp. 93–104. [Google Scholar]
  6. Hahsler, M.; Bolaños, M. Clustering data streams based on shared density between micro-clusters. IEEE Trans. Knowl. Data Eng. 2016, 28, 1449–1461. [Google Scholar] [CrossRef]
  7. He, J.; Cheng, Z.; Guo, B. Anomaly detection in satellite telemetry data using a sparse feature-based method. Sensors 2022, 22, 6358. [Google Scholar] [CrossRef]
  8. Paffenroth, R.; Kay, K.; Servi, L. Robust PCA for anomaly detection in cyber networks. arXiv 2018, arXiv:1801.01571. [Google Scholar]
  9. Ramaswamy, S.; Rastogi, R.; Shim, K. Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000; pp. 427–438. [Google Scholar]
  10. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  11. Schmidl, S.; Wenig, P.; Papenbrock, T. Anomaly detection in time series: A comprehensive evaluation. Proc. VLDB Endow. 2022, 15, 1779–1797. [Google Scholar] [CrossRef]
  12. Schmidt, M.; Simic, M. Normalizing flows for novelty detection in industrial time series data. arXiv 2019, arXiv:1906.06904. [Google Scholar]
  13. Malhotra, P.; Ramakrishnan, A.; Anand, G.; Vig, L.; Agarwal, P.; Shroff, G. LSTM-based encoder-decoder for multi-sensor anomaly detection. arXiv 2016, arXiv:1607.00148. [Google Scholar]
  14. Munir, M.; Siddiqui, S.A.; Dengel, A.; Ahmed, S. DeepAnT: A deep learning approach for unsupervised anomaly detection in time series. IEEE Access 2018, 7, 1991–2005. [Google Scholar] [CrossRef]
  15. Bashar, M.A.; Nayak, R. TAnoGAN: Time series anomaly detection with generative adversarial networks. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, Australia, 1–4 December 2020; pp. 1778–1785. [Google Scholar]
  16. Hundman, K.; Constantinou, V.; Laporte, C.; Colwell, I.; Soderstrom, T. Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 387–395. [Google Scholar]
  17. Shen, L.; Li, Z.; Kwok, J. TimeSeries anomaly detection using temporal hierarchical one-class network. Adv. Neural Inf. Process. Syst. 2020, 33, 13016–13026. [Google Scholar]
  18. Shen, L.; Yu, Z.; Ma, Q.; Kwok, J.T. Time series anomaly detection with multiresolution ensemble decoding. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 9567–9575. [Google Scholar]
  19. Ngu, H.C.V.; Lee, K.M. CL-TAD: A contrastive-learning-based method for time series anomaly detection. Appl. Sci. 2023, 13, 11938. [Google Scholar] [CrossRef]
  20. Cherdo, Y.; Miramond, B.; Pegatoquet, A.; Vallauri, A. Unsupervised anomaly detection for cars CAN sensors time series using small recurrent and convolutional neural networks. Sensors 2023, 23, 5013. [Google Scholar] [CrossRef]
  21. Yue, Z.; Wang, Y.; Duan, J.; Yang, T.; Huang, C.; Tong, Y.; Xu, B. TS2Vec: Towards universal representation of time series. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 22 February–1 March 2022; Volume 36, pp. 8980–8987. [Google Scholar]
  22. Zhao, M.; Peng, H.; Li, L.; Ren, Y. Graph attention network and informer for multivariate time series anomaly detection. Sensors 2024, 24, 1522. [Google Scholar] [CrossRef]
  23. Deng, A.; Hooi, B. Graph neural network-based anomaly detection in multivariate time series. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 4027–4035. [Google Scholar]
  24. Chen, Z.; Chen, D.; Zhang, X.; Yuan, Z.; Cheng, X. Learning graph structures with transformer for multivariate time-series anomaly detection in IoT. IEEE Internet Things J. 2021, 9, 9179–9189. [Google Scholar] [CrossRef]
  25. Su, Y.; Zhao, Y.; Niu, C.; Liu, R.; Sun, W.; Pei, D. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2828–2837. [Google Scholar]
  26. Geiger, A.; Liu, D.; Alnegheimish, S.; Cuesta-Infante, A.; Veeramachaneni, K. TadGAN: Time series anomaly detection using generative adversarial networks. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 33–43. [Google Scholar]
  27. Talukder, S.; Yue, Y.; Gkioxari, G. TOTEM: TOkenized time series embeddings for general time series analysis. arXiv 2024, arXiv:2402.16412. [Google Scholar]
  28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  29. Wang, C.; Xing, S.; Gao, R.; Yan, L.; Xiong, N.; Wang, R. Disentangled dynamic deviation transformer networks for multivariate time series anomaly detection. Sensors 2023, 23, 1104. [Google Scholar] [CrossRef]
  30. Lai, C.Y.A.; Sun, F.K.; Gao, Z.; Lang, J.H.; Boning, D. Nominality score conditioned time series anomaly detection by point/sequential reconstruction. Adv. Neural Inf. Process. Syst. 2024, 36, 76637–76655. [Google Scholar]
  31. Feng, Y.; Zhang, W.; Fu, Y.; Jiang, W.; Zhu, J.; Ren, W. SensitiveHue: Multivariate time series anomaly detection by enhancing the sensitivity to normal patterns. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 782–793. [Google Scholar]
  32. Yue, W.; Ying, X.; Guo, R.; Chen, D.; Shi, J.; Xing, B.; Chen, T. Sub-adjacent transformer: Improving time series anomaly detection with reconstruction error from sub-adjacent neighborhoods. arXiv 2024, arXiv:2404.18948. [Google Scholar]
  33. Xu, J. Anomaly transformer: Time series anomaly detection with association discrepancy. arXiv 2021, arXiv:2110.02642. [Google Scholar]
  34. Song, J.; Kim, K.; Oh, J.; Cho, S. Memto: Memory-guided transformer for multivariate time series anomaly detection. Adv. Neural Inf. Process. Syst. 2023, 36, 57947–57963. [Google Scholar]
  35. Liu, C.; He, S.; Zhou, Q.; Li, S.; Meng, W. Large language model guided knowledge distillation for time series anomaly detection. arXiv 2024, arXiv:2401.15123. [Google Scholar]
  36. Choi, K.; Yi, J.; Park, C.; Yoon, S. Deep learning for anomaly detection in time-series data: Review, analysis, and guidelines. IEEE Access 2021, 9, 120043–120065. [Google Scholar] [CrossRef]
  37. Ma, M.; Han, L.; Zhou, C. Research and application of transformer based anomaly detection model: A literature review. arXiv 2024, arXiv:2402.08975. [Google Scholar]
  38. Lai, K.H.; Zha, D.; Xu, J.; Zhao, Y.; Wang, G.; Hu, X. Revisiting time series outlier detection: Definitions and benchmarks. In Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Virtual, 6–14 December 2021. [Google Scholar]
  39. Yang, Y.; Zhang, C.; Zhou, T.; Wen, Q.; Sun, L. DCDetector: Dual Attention contrastive representation learning for time series anomaly detection. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 3033–3045. [Google Scholar]
  40. Kim, T.; Kim, J.; Tae, Y.; Park, C.; Choi, J.H.; Choo, J. Reversible instance normalization for accurate time-series forecasting against distribution shift. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
  41. Wang, X.; Pi, D.; Zhang, X.; Liu, H.; Guo, C. Variational transformer-based anomaly detection approach for multivariate time series. Measurement 2022, 191, 110791. [Google Scholar] [CrossRef]
  42. Ren, H.; Xu, B.; Wang, Y.; Yi, C.; Huang, C.; Kou, X.; Zhang, Q. Time-series anomaly detection service at microsoft. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 3009–3017. [Google Scholar]
  43. Garg, A.; Zhang, W.; Samaran, J.; Savitha, R.; Foo, C.S. An evaluation of anomaly detection and diagnosis in multivariate time series. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 2508–2517. [Google Scholar] [CrossRef]
  44. Kim, S.; Choi, K.; Choi, H.S.; Lee, B.; Yoon, S. Towards a rigorous evaluation of time-series anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 22 February–1 March 2022; Volume 36, pp. 7194–7201. [Google Scholar]
  45. Huet, A.; Navarro, J.M.; Rossi, D. Local evaluation of time series anomaly detection algorithms. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 635–645. [Google Scholar]
  46. Abdulaal, A.; Liu, Z.; Lancewicki, T. Practical approach to asynchronous multivariate time series anomaly detection and localization. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event, Singapore, 14–18 August 2021; pp. 2485–2494. [Google Scholar]
  47. Wu, R.; Keogh, E.J. Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress. IEEE Trans. Knowl. Data Eng. 2021, 35, 2421–2429. [Google Scholar] [CrossRef]
  48. Goswami, M.; Challu, C.; Callot, L.; Minorics, L.; Kan, A. Unsupervised model selection for time-series anomaly detection. arXiv 2022, arXiv:2210.01078. [Google Scholar]
  49. Eldele, E.; Ragab, M.; Chen, Z.; Wu, M.; Kwoh, C.K.; Li, X.; Guan, C. Time-series representation learning via temporal and contextual contrasting. arXiv 2021, arXiv:2106.14112. [Google Scholar]
  50. Carmona, C.U.; Aubet, F.X.; Flunkert, V.; Gasthaus, J. Neural contextual anomaly detection for time series. arXiv 2021, arXiv:2107.07702. [Google Scholar]
  51. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  52. Paparrizos, J.; Boniol, P.; Palpanas, T.; Tsay, R.S.; Elmore, A.; Franklin, M.J. Volume under the surface: A new accuracy evaluation measure for time-series anomaly detection. Proc. VLDB Endow. 2022, 15, 2774–2787. [Google Scholar] [CrossRef]
  53. Hendrycks, D.; Mazeika, M.; Dietterich, T. Deep anomaly detection with outlier exposure. arXiv 2018, arXiv:1812.04606. [Google Scholar]
  54. Zhang, C.; Zhou, T.; Wen, Q.; Sun, L. TFAD: A decomposition time series anomaly detection architecture with time-frequency analysis. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 2497–2507. [Google Scholar]
  55. Audibert, J.; Michiardi, P.; Guyard, F.; Marti, S.; Zuluaga, M.A. Usad: Unsupervised anomaly detection on multivariate time series. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 23–27 August 2020; pp. 3395–3404. [Google Scholar]
  56. Tuli, S.; Casale, G.; Jennings, N.R. Tranad: Deep transformer networks for anomaly detection in multivariate time series data. arXiv 2022, arXiv:2201.07284. [Google Scholar] [CrossRef]
  57. Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis. arXiv 2022, arXiv:2210.02186. [Google Scholar]
Figure 1. Overview of the proposed framework for unsupervised time series anomaly detection. This framework is designed to reconstruct input time series segments and consists of several key modules: preprocessing time series data, generating time series segments, segment-level normalization and denormalization, time series segment embedding, and an asymmetric encoder-decoder.
Figure 1. Overview of the proposed framework for unsupervised time series anomaly detection. This framework is designed to reconstruct input time series segments and consists of several key modules: preprocessing time series data, generating time series segments, segment-level normalization and denormalization, time series segment embedding, and an asymmetric encoder-decoder.
Sensors 25 02510 g001
Figure 2. Hyperparameter sensitivity analysis results of seven key hyperparameters for our proposed framework on four multivariate datasets. The hyperparameters are training window step size, window length, the number of attention heads, model dimension size, the number of encoding blocks, learning rate, and dropout.
Figure 2. Hyperparameter sensitivity analysis results of seven key hyperparameters for our proposed framework on four multivariate datasets. The hyperparameters are training window step size, window length, the number of attention heads, model dimension size, the number of encoding blocks, learning rate, and dropout.
Sensors 25 02510 g002
Table 3. F1 score results on the six multivariate datasets. Bold values indicate the best performance, and underlined values denote the second-best.
Table 3. F1 score results on the six multivariate datasets. Bold values indicate the best performance, and underlined values denote the second-best.
MSLSMAPSMDPSMGECCOSWAN-SF
Anomaly Transformer [33]0.9360.9670.9230.979--
DCdetector [39]0.9660.9700.8720.9790.4660.734
MEMTO [34]0.9440.9660.9350.983--
AnomalyLLM [35]0.9560.9650.9580.9970.6200.804
Ours0.9430.9750.9390.9930.9200.801
Table 4. Affiliation F1 score results on nine univariate datasets. Bold values indicate the best performance, and underlined values denote the second-best.
Table 4. Affiliation F1 score results on nine univariate datasets. Bold values indicate the best performance, and underlined values denote the second-best.
ABPAccelerationAir TemperatureECGEPG
TS-TCC [49]0.7540.5490.9690.7840.931
THOC [17]0.8150.7760.9710.7600.908
NCAD [50]0.7940.8490.7580.7350.789
AnomalyLLM [35]0.9200.9560.9740.7870.933
Ours0.8380.9660.9720.8020.840
GaitNASAPower DemandRESPAverage
TS-TCC [49]0.7940.5110.7630.5600.735
THOC [17]0.7840.8960.7750.3890.786
NCAD [50]0.8580.8610.7230.6130.776
AnomalyLLM [35]0.8710.9610.8860.7360.892
Ours0.8670.9260.8650.7630.871
Table 5. Performance results of anomaly detection models across eight evaluation metrics. Aff. Precision and Aff. Recall refer to affiliation Precision and affiliation Recall, respectively, while R-AUC-ROC and R-AUC-PR denote Range-AUC-ROC and Range-AUC-PR. Bold values indicate the best performance for each evaluation metric within each benchmark.
Table 5. Performance results of anomaly detection models across eight evaluation metrics. Aff. Precision and Aff. Recall refer to affiliation Precision and affiliation Recall, respectively, while R-AUC-ROC and R-AUC-PR denote Range-AUC-ROC and Range-AUC-PR. Bold values indicate the best performance for each evaluation metric within each benchmark.
AccuracyF1 ScoreAff. PrecisionAff. RecallR-AUC-ROCR-AUC-PRVUS-ROCVUS-PR
MSLAnomaly Transformer [33]0.9870.9390.5180.9600.9000.8790.8820.863
DCdetector [39]0.9910.9660.5180.9740.9320.9160.9320.917
Ours0.9880.9430.6170.9550.9570.8800.9570.880
SMAPAnomaly Transformer [33]0.9910.9640.5140.9870.9630.9410.9550.934
DCdetector [39]0.9920.9700.5150.9860.9600.9420.9520.935
Ours0.9930.9450.5910.8850.9560.8970.9560.895
PSMAnomaly Transformer [33]0.9870.9740.5540.8030.9180.9300.8870.907
DCdetector [39]0.9900.9790.5470.8290.9160.9290.8840.906
Ours0.9960.9930.7740.9270.9720.9580.9590.950
Table 6. Ablation analysis results of segment-level normalization-denormalization and positional encoding on three multivariate datasets. SegND represents segment-level normalization-denormalization, APE represents absolute positional encoding, and numbers in the cells represent F1 scores. signifies that the corresponding one is not used, while indicates that the corresponding one is used. Bold values indicate the best performance.
Table 6. Ablation analysis results of segment-level normalization-denormalization and positional encoding on three multivariate datasets. SegND represents segment-level normalization-denormalization, APE represents absolute positional encoding, and numbers in the cells represent F1 scores. signifies that the corresponding one is not used, while indicates that the corresponding one is used. Bold values indicate the best performance.
SegNDPositional EncodingMSLSMAPSMD
0.9090.7400.830
APE (sinusoid)0.8460.7340.794
 APE (learnable)0.9020.7100.845
0.9430.9750.939
APE (sinusoid)0.9140.9660.904
 APE (learnable)0.9240.9570.923
Table 7. Ablation analysis results of segment-level normalization-denormalization and positional encoding on three univariate datasets. SegND, APE, and are described in Table 6. The numbers in the cells represent affiliation F1 scores. Bold values indicate the best performance.
Table 7. Ablation analysis results of segment-level normalization-denormalization and positional encoding on three univariate datasets. SegND, APE, and are described in Table 6. The numbers in the cells represent affiliation F1 scores. Bold values indicate the best performance.
SegNDPositional EncodingECGGaitRESP
0.7390.7830.693
APE (sinusoid)0.7720.7680.702
 APE (learnable)0.7460.8190.703
0.8020.8670.763
APE (sinusoid)0.7520.8200.729
 APE (learnable)0.7590.8500.721
Table 8. Ablation analysis results of preprocessing time series data on three multivariate datasets, each comprising multiple sub-datasets (see Table 1). Sub-Dataset-Level refers to a case where a normalization method is applied to each sub-dataset within a dataset, while Dataset-Level refers to a case where a normalization method is applied to a single dataset that concatenates all its sub-datasets. The numbers in the cells represent F1 score results. signifies that the corresponding one is not used, while indicates that the corresponding one is used. Bold values indicate the best performance.
Table 8. Ablation analysis results of preprocessing time series data on three multivariate datasets, each comprising multiple sub-datasets (see Table 1). Sub-Dataset-Level refers to a case where a normalization method is applied to each sub-dataset within a dataset, while Dataset-Level refers to a case where a normalization method is applied to a single dataset that concatenates all its sub-datasets. The numbers in the cells represent F1 score results. signifies that the corresponding one is not used, while indicates that the corresponding one is used. Bold values indicate the best performance.
Sub-Dataset-LevelDataset-LevelMSLSMAPSMD
0.9430.9750.939
min-max0.9190.9250.931
z-score0.8870.7070.877
min-max0.6110.8420.812
min-maxmin-max0.6050.8300.807
min-maxz-score0.7500.6400.740
z-score0.9230.6920.811
z-scoremin-max0.9220.8230.808
z-scorez-score0.8870.6550.799
Table 9. Ablation analysis results of preprocessing time series data on four univariate datasets, each comprising multiple sub-datasets (see Table 2). Sub-Dataset-Level and Dataset-Level are described in Table 8. The numbers in the cells represent affiliation F1 scores. signifies that the corresponding one is not used, while indicates that the corresponding one is used. Bold values indicate the best performance.
Table 9. Ablation analysis results of preprocessing time series data on four univariate datasets, each comprising multiple sub-datasets (see Table 2). Sub-Dataset-Level and Dataset-Level are described in Table 8. The numbers in the cells represent affiliation F1 scores. signifies that the corresponding one is not used, while indicates that the corresponding one is used. Bold values indicate the best performance.
Sub-Dataset-LevelDataset-LevelECGEPGGaitNASA
0.6720.8400.7250.926
min-max0.6570.8290.7090.872
z-score0.6700.8120.7120.892
min-max0.7920.7230.8670.846
min-maxmin-max0.7790.7260.8500.862
min-maxz-score0.8020.7070.8570.850
z-score0.7350.6670.8250.914
z-scoremin-max0.7560.6770.8150.883
z-scorez-score0.7640.7110.8310.870
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, C.S.; Kim, H.; Lee, K.M. Reevaluating the Potential of a Vanilla Transformer Encoder for Unsupervised Time Series Anomaly Detection in Sensor Applications. Sensors 2025, 25, 2510. https://doi.org/10.3390/s25082510

AMA Style

Han CS, Kim H, Lee KM. Reevaluating the Potential of a Vanilla Transformer Encoder for Unsupervised Time Series Anomaly Detection in Sensor Applications. Sensors. 2025; 25(8):2510. https://doi.org/10.3390/s25082510

Chicago/Turabian Style

Han, Chan Sik, HyungWon Kim, and Keon Myung Lee. 2025. "Reevaluating the Potential of a Vanilla Transformer Encoder for Unsupervised Time Series Anomaly Detection in Sensor Applications" Sensors 25, no. 8: 2510. https://doi.org/10.3390/s25082510

APA Style

Han, C. S., Kim, H., & Lee, K. M. (2025). Reevaluating the Potential of a Vanilla Transformer Encoder for Unsupervised Time Series Anomaly Detection in Sensor Applications. Sensors, 25(8), 2510. https://doi.org/10.3390/s25082510

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop