Transformer-Autoencoder-Based Unsupervised Temporal Anomaly Detection for Network Traffic with Dual Prediction and Reconstruction

Lu, Jieke; Yang, Xinyi; Liu, Yang; Zuo, Haoran; Zhou, Feng; Yu, Tong; Liu, Dengmu; Deng, Tianping; Luo, Lijun

doi:10.3390/app16042143

Open AccessArticle

Transformer-Autoencoder-Based Unsupervised Temporal Anomaly Detection for Network Traffic with Dual Prediction and Reconstruction

by

Jieke Lu

¹,

Xinyi Yang

²

,

Yang Liu

^2,3,

Haoran Zuo

²,

Feng Zhou

²,

Tong Yu

⁴,

Dengmu Liu

²,

Tianping Deng

²

and

Lijun Luo

^2,*

¹

Electric Power Research Institute of Guangxi Power Grid Co., Ltd., Nanning 530000, China

²

Hubei Key Laboratory of Internet of Intelligence, School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China

³

People’s Bank of China Qinghai Branch, Xining 810001, China

⁴

School of Computer Science, Northeast Electric Power University, Jilin City 132000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 2143; https://doi.org/10.3390/app16042143

Submission received: 15 January 2026 / Revised: 18 February 2026 / Accepted: 19 February 2026 / Published: 23 February 2026

(This article belongs to the Special Issue Deep Learning and Its Applications in Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

With the rapid growth of large-scale networks, traditional rule-based and supervised anomaly detection methods struggle with heavy reliance on labeled data, slow response to rapidly changing patterns, and difficulty in capturing complex temporal anomalies. At the same time, real-world traffic exhibits strong class imbalance, where normal samples overwhelmingly dominate, causing many existing models to miss subtle but critical abnormal behaviors. To address these challenges, this paper proposes an unsupervised temporal anomaly detection framework for network traffic based on a Transformer-autoencoder bidirectional prediction and reconstruction model. The framework combines the advantages of autoencoders and regression models, using multi-head self-attention and positional encoding to capture long-range temporal dependencies in traffic sequences. A masked decoding mechanism is further employed to prevent information leakage from future time steps. The model jointly generates forward and backward predictions as well as reconstructed sequences, and designs multiple anomaly scoring strategies that integrate prediction and reconstruction errors to enhance the sensitivity to point, contextual, and collective anomalies under highly imbalanced data. Experiments on three public benchmark datasets demonstrate that the proposed method significantly improves detection performance, achieving up to an F1 score of 0.960 and a precision of 0.949, with recall approaching 1.0, while reducing false alarms, thereby showing strong applicability to practical network security scenarios.

Keywords:

network traffic; anomaly detection; temporal sequence modeling; transformer autoencoder; time series analysis

1. Introduction

With the rapid proliferation of large-scale network systems [1], cloud services, and IoT devices, together with the continuous evolution of attack patterns, network security has become increasingly complex and diversified [2,3,4], as shown in Figure 1. This phenomenon makes network traffic anomaly detection an important research focus in information security [5]. Since traffic generated by intrusion behaviors differs essentially from that produced by legitimate user activities, such differences can be exploited to identify abnormal flows that threaten normal network operations, enabling potential security risk detection, fault diagnosis, and configuration optimization through real-time traffic monitoring and analysis [6,7,8]. Time series analysis can be leveraged to capture recurrent patterns and unexpected deviations [9]. Researchers further integrate time series analysis with deep learning and machine learning, enabling it to achieve more powerful nonlinear modeling and representation learning capabilities [10,11]. This combination can identify anomalous behaviors from high-dimensional, non-stationary traffic time series [12].

However, existing time series forecasting and anomaly detection methods still have limitations when they are applied to complex real-world network security scenarios [13,14]. First, many studies are still built on supervised or semi-supervised frameworks and rely heavily on manually labeled anomalous samples [13,15]. In real network environments, anomalous samples are extremely scarce and highly imbalanced, which introduces bias into the model in the training stage and makes it difficult to generalize to novel attacks and unknown anomalies [13,14]. Second, traditional methods based on statistical models or on neural networks (RNN), long short-term memory (LSTM) networks and CNN architectures usually model temporal dependencies or local patterns from only a single perspective [16,17]. They find it difficult to capture multivariate dependencies and complex patterns where periodic and bursty behaviors coexist [18], and they have limited capability to jointly detect point anomalies, contextual anomalies, and collective anomalies. In addition, recent unsupervised or weakly supervised methods based on autoencoders, generative adversarial networks (GAN), and dynamic thresholds have achieved good performance on some public datasets, but they are generally sensitive to the stability of the data distribution and to the noise level [15,19,20]. When the anomaly rate is extremely low, or the service workload changes rapidly, these methods tend to suffer from non-adaptive thresholds and overfitting to normal patterns [14,19]. Therefore, there is an urgent need for an unsupervised temporal anomaly detection framework that combines the global modeling capability of Transformers and integrates both prediction and reconstruction information [21]. Such a framework should automatically learn multi-scale temporal patterns and high-dimensional dependency structures in network traffic without requiring a large number of labels [14,21].

To achieve high-precision and robust network traffic anomaly detection under conditions of extreme class imbalance and dynamically evolving environments [22], this paper introduces an unsupervised anomaly detection approach. However, applying unsupervised learning to real-world network traffic still presents several challenges [23]. First, network traffic exhibits pronounced peaks, troughs, and periodic fluctuations, leading to a non-stationary distribution. If the model merely memorizes historical normal patterns, changes in the operating environment can easily cause frequent false positives or false negatives. Second, truly malicious traffic typically accounts for only a tiny fraction of the overall volume, while attack behaviors are diverse and rapidly evolving, making the model more likely to overfit normal patterns and leaving it insufficiently expressive for rare, morphologically diverse anomalies [24]. Finally, traffic time series contain both short-term bursts and long-range periodic structures, requiring the detection model to simultaneously capture fine-grained local variations and long-term temporal dependencies.

This study is devoted to accurately detecting anomalies in high-dimensional, non-stationary, and severely imbalanced network traffic [25]. We design an anomaly detection framework tailored to network time-series data. The proposed method enhances sensitivity to various types of anomalies while improving robustness to noise, resulting in more accurate and stable detection performance in practical network monitoring scenarios.

The main contributions of this work are threefold:

(1): We formulate unsupervised network traffic anomaly detection as a multivariate time-series modeling problem and develop a practical detection framework that supports point, contextual, and collective anomalies.
(2): We propose a Transformer-autoencoder-based architecture that integrates bidirectional prediction and sequence reconstruction within a unified encoder–decoder design, enabling the model to capture both short-term bursts and long-range temporal patterns while alleviating boundary effects commonly observed in one-directional predictors.
(3): We design an anomaly scoring mechanism with multiple fusion strategies that jointly exploits prediction and reconstruction deviations to improve robustness under noisy and dynamically changing traffic, together with static and optimization-driven dynamic thresholding to convert continuous scores into actionable alarms.

The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents the proposed anomaly detection model. Section 4 describes the associated algorithms in detail. Section 5 outlines the experimental setup and reports the evaluation results, and Section 6 concludes the paper.

2. Related Work

This section reviews related work on network traffic anomaly detection, with emphasis on temporal modeling and intrusion detection. As networks scale and encrypted traffic becomes pervasive, accurately detecting anomalies in high-dimensional time series has drawn growing interest from both academia and industry. However, existing approaches still struggle with non-stationary traffic, severe class imbalance, and diverse anomaly types in real-world settings [26,27]. Therefore, it is necessary to systematically organize and analyze prior work, laying a solid foundation for designing more effective unsupervised temporal anomaly detection models.

2.1. Time Series Modeling

Research on time-series forecasting and anomaly detection has a long history in statistics, signal processing, and machine learning [28,29]. Early work mainly relied on classical statistical models such as autoregressive, moving average, and autoregressive integrated moving average, which model temporal dependencies through linear recursion and have shown good performance on stationary series with clear trend and periodic patterns [28,29,30]. However, these models struggle to capture nonlinear dynamics and complex, weakly periodic behaviors that are common in modern network systems, limiting their applicability in highly dynamic environments such as large-scale data centers and industrial networks [31,32,33].

With the rapid development of deep learning, methods based on neural networks have become the mainstream approach for time series modeling [34]. Recurrent RNNs and their variants, including LSTM networks and gated recurrent unit (GRU) networks, have been widely used to learn nonlinear temporal dependencies [34,35,36,37]. Koutnik et al. [16] improved the traditional RNN architecture by introducing a clock module to reduce computation time. Abbasimehr et al. [17] combined LSTM with a multi-head attention mechanism to develop a forecasting model that can both discover long-term patterns and capture key features. Shaik et al. [38] integrated graph neural networks with reinforcement learning, and used Bayesian optimization for intelligent hyperparameter tuning, enabling accurate prediction in dynamic environments. These improved schemes further enhance the ability to capture long-term patterns, multi-scale periodicity, and complex feature interactions in multivariate sequences [16,17,38]. However, the sequential nature of RNN-type models makes training and inference difficult to parallelize, leading to higher computational costs on long sequences, and they are vulnerable to gradient vanishing or exploding. As a result, their performance will limit robustness and generalization when applied to large-scale, noisy network traffic in real-world anomaly detection scenarios.

2.2. Intrusion Detection

Intrusion detection and network traffic anomaly detection constitute another important line of research. Since the early concept of intrusion detection as analyzing system and network data to identify malicious behavior [39], traditional methods have primarily relied on manually crafted rules and statistical thresholds over flow-level features. Although rule-based systems can effectively detect known attack signatures, they are labor-intensive to maintain, sensitive to evolving attack patterns, and often unable to cope with the explosive growth in network traffic and the increasing prevalence of encryption. To address these limitations, recent work has shifted toward machine learning and deep learning approaches, which automatically learn discriminative features from raw or lightly processed network data. Representative examples include hybrid optimization-enhanced deep residual networks, such as GSOOA-1DDRSN [40], for feature selection and classification in intrusion detection, which improve both detection accuracy and runtime efficiency over traditional rule-based and shallow-learning methods.

Deep learning techniques have increasingly attracted the attention of researchers in the intrusion detection field [41]. Among supervised approaches, Alashjaee [42] proposed an Attention-CNN-LSTM hybrid intrusion detection model, which uses CNN to extract spatial features from network traffic, and an attention mechanism to assign higher weights to key features. Hegde et al. [43] kept the multilayer perceptron (MLP) architecture unchanged but improved detection performance by optimizing data quality, addressing the poor real-time performance of traditional model-driven methods in practical deployments. Semi-supervised methods such as RLAD [44] combine deep reinforcement learning with active learning to adapt to continuously evolving anomaly patterns in real-world multivariate sequences. In weakly supervised frameworks, Elaziz et al. [45] integrated deep reinforcement learning with a variational autoencoder (VAE), enabling business-process anomaly detection with only a limited number of labeled anomalous samples. For unsupervised models, Zhang et al. [46] used a graph convolutional network (GCN) together with a graph attention mechanism (GAT) to capture the relational structure in building energy-consumption data. More recently, Baldoni et al. [47] fitted principal component analysis (PCA) using normal traffic and used the reconstruction error as an anomaly score, supporting rapid response and offering potential for detecting zero-day attacks.

Several deep time-series anomaly detection models are also closely related to our scenario. Geiger et al. [48] proposed TadGAN, which leverages GAN to generate realistic data and employs LSTM-based encoder/decoder generators along with dual critics to reconstruct sequences, thereby capturing complex temporal correlations in time-series data. Hundman et al. [49] addressed spacecraft telemetry anomaly detection characterized by high dimensionality, non-stationarity, and scarce labels. They used a two-layer LSTM network for one-step prediction while incorporating anomalous sequence pruning and historical feedback to reduce false alarms. Hsieh et al. [50] developed a framework based on LSTM Autoencoder, learning temporal patterns solely from normal data. It triggers alerts through majority voting over multiple window-level predictions for the same timestamp. Xu et al. [21] proposed the Anomaly Transformer, which adopts a dual-branch anomaly attention mechanism and measures their discrepancy using a symmetric KL divergence.

Despite their effectiveness, these methods still have several limitations. Most supervised and semi/weakly supervised approaches rely on continuous label acquisition and expert feedback, which is costly under large-scale, rapidly evolving network traffic. Hybrid prediction-reconstruction detectors are widely adopted, but they face two practical issues. First, one-directional predictors often suffer from boundary artifacts due to missing context near window edges. Second, network traffic anomalies may exhibit phase shifts or morphology distortions. Graph-based and PCA-based models often assume relatively stable relational structures or linear subspaces, making it difficult to capture highly nonlinear behaviors and concept drift in real-world network environments. GAN- and LSTM-based architectures may suffer from training instability, difficulty in modeling very long-range dependencies, and sensitivity to threshold selection, which can lead to delayed or noisy alarms in highly non-stationary traffic. In addition, the Anomaly Transformer requires maintaining dual association branches and solving a minimax optimization problem, which increases memory and computational cost, complicates hyperparameter tuning, and may limit its scalability in large-scale network monitoring scenarios.

This paper focuses on unsupervised time-series anomaly detection in network traffic under highly imbalanced and dynamically evolving conditions. Building on the strengths and limitations of existing studies, we aim to improve the detection of point anomalies, contextual anomalies, and collective anomalies while maintaining robustness to noise and distribution shift. Unlike approaches that rely solely on forecasting, reconstruction, or a single anomaly score, we propose a Transformer-autoencoder-based bidirectional framework combining forecasting with reconstruction. The framework jointly models forward and backward temporal dependencies and reconstructs normal patterns, then fuses multiple error signals via adaptive thresholding. Specifically, we introduce a boundary-aware bidirectional scoring rule to improve edge sensitivity. Apart from that, we compute reconstruction discrepancy via local DTW alignment, improving robustness to minor temporal misalignment. This design enhances the sensitivity to subtle anomalous behaviors in long sequences and mitigates the impact of class imbalance. Experiments on benchmark datasets such as SMD, PSM, and ASD demonstrate that, compared with baseline methods, our approach achieves significant gains in F1 score, precision, and recall, highlighting its effectiveness and practicality in real-world network security scenarios.

3. System Model

This section presents the system modeling of the proposed unsupervised network-traffic anomaly detection framework. We first describe how raw multivariate traffic is aggregated, cleaned, normalized, and transformed into fixed-length windows suitable for temporal modeling. We then introduce the Transformer-autoencoder architecture used to learn normal temporal patterns, including the encoder–decoder structure and the dual prediction–reconstruction design. Finally, we formulate the training objective, which jointly optimizes forward and backward prediction together with sequence reconstruction to capture both short-term fluctuations and long-range dependencies in network traffic.

Our model adopts a modular design tailored for large-scale network traffic monitoring. Figure 2 illustrates the overall architecture of the proposed bidirectional forecasting-reconstruction model for unsupervised network-traffic anomaly detection. The model follows an encoder–decoder paradigm and is equipped with dual regression heads and a reconstruction head, specifically designed to capture long-range temporal dependencies and subtle structural changes in traffic sequences. The architecture consists of an input preprocessing module, a temporal embedding layer, a lightweight Transformer encoder, a masked decoder, and three output branches that collaboratively perform forward forecasting, backward forecasting, and sequence reconstruction.

3.1. Input Representation and Preprocessing

Let the preprocessed multivariate network-traffic time series be denoted as

{\{x_{t}\}}_{t = 1}^{T}, x_{t} \in R^{F}

, where T is the total number of time steps, and F is the feature dimension. After segmenting the sequence using a sliding window of length L, we obtain a batch input tensor

x \in R^{B \times T \times F}

, where B denotes the batch size.

Each input window is then fed into an MLP to project the original feature space into a unified model dimension. This projection maps heterogeneous traffic features into a shared latent space suitable for attention operations, improving representational capacity while serving as a learnable feature-fusion layer. On top of the projected features, temporal positional encodings are added to inject sequential information that pure attention mechanisms lack, enabling the model to distinguish early and late events within a window and to capture periodic or trending behaviors. The resulting temporal embedding sequence is used as the base representation and fed into the encoder for subsequent global dependency modeling.

3.1.1. Traffic Aggregation and Feature Extraction

Network traffic is collected as multivariate time series comprising flow-level and protocol-level metrics. To reduce noise and accommodate multi-scale temporal patterns, the raw data are first aggregated over fixed time intervals such as minute-level windows, using robust statistics such as the median, which mitigates the influence of outliers and improves the robustness of downstream modeling.

From the aggregated traffic, three groups of features are extracted:

(1): Statistical features: Mean, median, standard deviation, variance, minimum, and maximum, which characterize the overall distribution and short-term variability of traffic within each interval.
(2): Temporal features: Descriptors of temporal dynamics, such as frequency-domain components via Fourier transforms, sliding-window traffic rates, and differences between adjacent time steps, which help distinguish normal periodic patterns from abrupt changes.
(3): Traffic-type features: Protocol and flow descriptors, including protocol type, packet size distribution, source/destination ports, and session duration. They can capture application-level behavior and support the identification of protocol-level attacks.

After feature extraction, PCA is applied to reduce dimensionality while preserving at least 95% of the variance. This step alleviates redundancy, lowers computational cost, and provides a compact yet informative representation of each time step.

3.1.2. Normalization and Missing-Value Handling

To harmonize heterogeneous feature scales, min-max normalization is employed to linearly map each feature to the range [0, 1]. This prevents features with large magnitudes from dominating the learning process and accelerates model convergence. The normalization formula is as follows:

x^{'} = \frac{x - m i n (x)}{m a x (x) - m i n (x)}

(1)

where x is raw data,

x^{'}

is normalized data, and

m i n (x)

and

m a x (x)

represent the minimum and maximum values of the data, respectively.

Missing values, which may arise due to temporary measurement failures, network interruptions, or collection delays, are handled using forward filling. Each missing entry is replaced with the most recent valid observation in the same feature dimension. This simple yet effective strategy preserves continuity in the time series and avoids discarding partially corrupted samples.

3.1.3. Windowing and Embedding

The normalized multivariate series is segmented into overlapping windows of fixed length L with a stride of 1 time step. Each window thus contains 30 consecutive time points and serves as the basic input unit for the model, balancing temporal resolution and computational efficiency.

For a single sample, the k-th window can be formalized as:

x^{(k)} = [x_{k}, x_{k + 1}, \dots, x_{k + L - 1}] \in R^{L \times F}, k = 1, \dots, T - L + 1

(2)

The linear embedding layer maps the PCA-reduced feature vectors into a latent space of dimension

d_{model}

, producing an embedded sequence with a unified model dimension

z^{(k)} \in R^{L \times d_{model}}

. To encode temporal order, sinusoidal positional encoding is added to the embeddings, enabling the model to distinguish different positions in the sequence and to capture periodic patterns in network traffic. Specifically, we have:

h_{t}^{(k)} = σ (W_{mlp} x_{t}^{(k)} + b_{mlp}), z_{t}^{(k)} = h_{t}^{(k)} + p_{t}, t = 1, \dots, L

(3)

where

W_{mlp} \in R^{F \times d_{model}}

and

b_{mlp}

are learnable parameters,

p_{t}

denotes the positional encoding at time step t, and

σ (\cdot)

is the activation function.

3.2. Transformer-Autoencoder Architecture

3.2.1. Encoder Design

The overall architecture of the encoder and decoder is shown in Figure 3. Input Processing embeds the observed input window for the encoder using a learnable embedding layer and positional encoding. Each time window capturing local temporal context is then passed to the Transformer-based encoder, which learns a compact representation of normal temporal dynamics. The encoder is built by stacking Transformer blocks composed of multi-head self-attention and enhanced feed-forward sublayers. Each encoder layer uses four parallel attention heads to analyze correlations between different time steps across multiple representation subspaces, allowing the model to learn both short-term bursts and long-term seasonal patterns in network traffic. The feed-forward network refines the attention outputs via a three-stage dense architecture, extracting richer nonlinear features while progressively reducing dimensionality. Dropout is applied after both multi-head attention and feed-forward sublayers to mitigate overfitting and stabilize training. Each block is wrapped with residual connections and layer normalization to improve gradient flow and accelerate convergence. Overall, this lightweight encoder design balances modeling capacity and computational cost, making the framework suitable for large-scale traffic monitoring.

The encoder input for a specific window is denoted as

Z^{(k)} = [z_{1}^{(k)}, \dots, z_{L}^{(k)}] \in R^{L \times d_{model}}

. When computing the self-attention mechanism, the input sequence is first linearly transformed into the Query (Q), Key (K), and Value (V) matrices, which represent the query features of the current element, the matching keys, and the corresponding content, respectively:

Q = Z^{(k)} W^{Q}, K = Z^{(k)} W^{K}, V = Z^{(k)} W^{V}

(4)

where Z is the input sequence, and

W^{Q}, W^{K}, W^{V} \in R^{d_{model} \times d_{k}}

are learnable weight matrices.

The similarity between each query vector and all key vectors is computed via dot products to obtain the attention score matrix, which characterizes the correlations among elements. To prevent excessively large dot-product values that may cause gradient instability, a scaling factor

\frac{1}{\sqrt{d_{k}}}

is introduced, where

d_{k}

denotes the dimensionality of the key vectors. The softmax function is applied to normalize the attention scores into a probability distribution, ensuring that the weights sum to 1. Finally, the normalized attention weights are multiplied with the corresponding value vectors and summed to form a weighted aggregation, producing a context-aware representation at each position and thus yielding a new output sequence. The calculation of single-head self-attention is

A t t e n t i o n (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(5)

Multi-head attention is written as:

MHA (Z^{(k)}) = Concat ({Attn}_{1}, \dots, {Attn}_{H}) W^{O}

(6)

where H denotes the number of attention heads and

W^{O}

represents the output transformation matrix.

Compared with the standard Transformer encoder, the feed-forward network is extended with an additional Dense layer (ReLU → GELU → Linear) and 256 hidden units to progressively smooth and refine the representation, thereby enhancing the model’s capacity to capture nonlinear relationships in the time series. Dropout with rate 0.1 is applied after both the attention and feed-forward sublayers to reduce overfitting and accelerate convergence. The output of each encoder layer satisfies

{\tilde{Z}}^{(k)} = LayerNorm (Z^{(k)} + Dropout (MHA (Z^{(k)})))

(7)

H^{(k)} = LayerNorm ({\tilde{Z}}^{(k)} + Dropout (FFN ({\tilde{Z}}^{(k)})))

(8)

where FFN denotes a feedforward network. The output of the final encoder block provides a contextualized representation for each time step in the input window, capturing both short-term fluctuations and long-range dependencies in the traffic.

3.2.2. Decoder Design and Multi-Branch Heads

During decoding, the model adopts an autoregressive scheme to integrate the encoder’s historical contextual information for both prediction and reconstruction. Target Processing forms the decoder input sequence and applies the same embedding and positional encoding before masked decoding. The decoder is initialized by using the encoder output at the last time step as a seed, which is further expanded by injecting two-step Gaussian noise perturbations. This design introduces diversity, mitigates overfitting, and supports robust multi-step forecasting. The decoder is structurally similar to the encoder, but it incorporates an autoregressive generation mechanism and an encoder–decoder cross-attention mechanism. Specifically, by employing a masked self-attention matrix M, we ensure that the representation at the i-th time step focuses solely on the current and historical positions. This prevents information leakage from future time steps and enforces temporal causality. The elements of M are

M_{i j} = \{\begin{matrix} 0, & j \leq i, \\ - \infty, & j > i . \end{matrix}

(9)

The masked attention is written as:

{Attn}_{mask} (Q, K, V) = softmax (\frac{Q K^{⊤} + M}{\sqrt{d_{k}}}) V

(10)

The final hidden state of the encoder is used to initialize the decoder input and is temporally expanded. Small Gaussian perturbations are added at the beginning to improve robustness and reduce overfitting. In the second attention sub-layer, the encoder–decoder attention mechanism uses the encoder outputs as keys and values, enabling the decoder to exploit global context when generating each time step. The output head consists of a stack of dense layers followed by a nonlinear activation function (sigmoid), which maps decoder states back to the original feature space and constrains the output range to better fit the reconstruction objective.

Built upon a shared decoder backbone, the model implements three functional branches:

(1): Forward prediction branch: The first segment performs forward prediction along the original time order to forecast future one-step value or multi-step traffic conditions, focusing on capturing trend variations and bursty traffic changes.
(2): Backward prediction branch: The middle segment reconstructs the input window by operating on a time-reversed version of the input sequence. After decoding, the predicted results are reversed back to the original timeline, providing complementary contextual information for the beginning part of the window.
(3): Reconstruction branch: The final segment reconstructs the central region of the window, aiming to faithfully reproduce normal patterns and thereby highlight structural deviations.

For a given window

x^{(k)}

, the decoder output can be written as:

({\tilde{y}}_{k - 1}, {\tilde{y}}_{k}, {\tilde{y}}_{k + 1}, \dots, {\tilde{y}}_{k + L - 1}, {\tilde{y}}_{k + L}^{fwd}) = g_{θ} (x^{(k)})

(11)

where

{\tilde{y}}_{k - 1}

denotes the one-step backward prediction on the left side of the window,

{\tilde{y}}_{k + L}^{fwd}

denotes the one-step forward prediction on the right side of the window, and

{\tilde{y}}_{t}

represents the reconstructed value for the internal time steps

t \in [k, k + L - 1]

. The function

g_{θ} (\cdot)

denotes the overall mapping parameterized by

θ

, which comprises the Transformer encoder–decoder and the three-branch output heads.

When a probabilistic output formulation is used, the Softmax in the output layer normalizes decoder outputs to probabilities. In this way, the model performs regression at both ends and reconstruction in the middle, forming a unified architecture that couples forward prediction, backward prediction, and sequence reconstruction. A weighted joint loss across the three branches is used to train the entire network end-to-end, which encourages the encoder–decoder to learn normal behaviors while enhancing sensitivity to anomalies in temporal dynamics and feature structures. The three branches are ultimately fused into a unified anomaly score, which is then processed by an adaptive thresholding module to produce the final anomaly labels.

3.2.3. Training Objective and Optimization

The network is trained in an unsupervised manner by minimizing a weighted combination of three mean-squared-error (MSE) terms: forward prediction loss, backward prediction loss, and reconstruction loss. Let

L o s s_{f w d}

,

L o s s_{r e v}

, and

L o s s_{r e c}

denote these losses, respectively. Let the number of training windows be

N_{win}

. For the k-th window, denote its true values as

(y_{k - 1}, y_{k}, \dots, y_{k + L - 1}, y_{k + L})

.

The formulas for calculating forward prediction loss, backward prediction loss and reconstruction loss are as follows:

\{\begin{matrix} L o s s_{fwd} & = \frac{1}{N_{win}} \sum_{k = 1}^{N_{win}} {∥y_{k + L} - {\hat{y}}_{k + L}^{fwd}∥}_{2}^{2}, \\ L o s s_{rev} & = \frac{1}{N_{win}} \sum_{k = 1}^{N_{win}} {∥y_{k - 1} - {\hat{y}}_{k - 1}^{rev}∥}_{2}^{2}, \\ L o s s_{rec} & = \frac{1}{N_{win} L} \sum_{k = 1}^{N_{win}} \sum_{i = 0}^{L - 1} {∥y_{k + i} - {\tilde{y}}_{k + i}∥}_{2}^{2} . \end{matrix}

(12)

The total loss is defined as:

L o s s = α L o s s_{f w d} + β L o s s_{r e c} + γ L o s s_{r e v}

(13)

where

α

,

β

and

γ

are weighting parameters reflecting the central role of reconstruction while retaining significant contributions from both prediction directions.

Overall, the training objective of the proposed framework is to learn model parameters that minimize the total loss defined in Formula (14):

θ^{*} = \arg \min_{θ} L o s s (θ),

(14)

where

θ

denotes all trainable parameters of the Transformer-autoencoder model.

4. Algorithm Design

Building on the above system modeling, this section details the algorithmic design of the proposed framework. We first formalize how the model outputs are converted into anomaly scores, including bidirectional prediction errors, reconstruction-based scores, and their fusion into a unified indicator. We then describe the thresholding and post-processing procedures that transform continuous scores into interval-level alarms. Finally, we present the complete training and detection workflows, integrating data preprocessing, window construction, model optimization, score computation, and alarm generation into a coherent algorithm.

4.1. Anomaly Score

The outputs of the prediction and reconstruction branches are translated into pointwise anomaly scores. The design aims to combine sensitivity to abrupt local changes with robustness to benign fluctuations.

4.1.1. Bidirectional Prediction Error

Employing bidirectional prediction, the model generates forecasts using both forward and backward Transformer decoders to assess its ability to predict future traffic. Forward prediction involves the decoder directly processing the original time series, starting from the beginning and sequentially predicting the next traffic value. Reverse prediction reverses the input sequence, with the decoder generating predictions starting from the end point and then reversing them back to the original order to align with the time steps. To mitigate initialization artifacts, the earliest part of time steps in each window is excluded from scoring by setting their corresponding errors to zero. When calculating anomaly scores, for each time step t, the MSE between the forward prediction value

{\hat{y}}_{t}^{fwd}

and backward prediction value

{\hat{y}}_{t}^{rev}

and the true value

y_{t}

is first computed to evaluate the model’s predictive accuracy. The one-way prediction error is

e_{t}^{fwd} = {∥y_{t} - {\hat{y}}_{t}^{fwd}∥}_{2}^{2}

(15)

e_{t}^{rev} = {∥y_{t} - {\hat{y}}_{t}^{rev}∥}_{2}^{2}

(16)

In the central overlap region where both forward and backward predictions are available, the prediction error is taken as their average, which reduces directional noise. At the non-overlapping boundaries, which are the first and last N/2 time steps, the maximum of the two losses is used to emphasize boundary anomalies. Let

Ω_{mid}

denote the overlapping central interval where both forward and backward predictions exist, and

Ω_{bd}

denote the boundary interval where only unidirectional predictions exist. Then the bidirectional prediction anomaly score is defined as:

s_{t}^{pred} = \{\begin{matrix} \frac{1}{2} (e_{t}^{fwd} + e_{t}^{rev}), & t \in Ω_{mid}, \\ \max (e_{t}^{fwd}, e_{t}^{rev}), & t \in Ω_{bd}, \\ 0, & else \end{matrix}

(17)

To reduce noise and simulate exponentially weighted moving average (EWMA) smoothing and initial segment masking, process

s_{t}^{pred}

through exponentially weighted moving average smoothing and start segment mask.

{\tilde{s}}_{t}^{pred} = λ s_{t}^{pred} + (1 - λ) s_{t - 1}^{pred}, t \geq 2

(18)

where

λ \in (0, 1)

is the smoothing coefficient, and

{\tilde{s}}_{1}^{pred} = s_{1}^{pred}

.

s_{t}^{mask} = \{\begin{matrix} 0, & t \leq m, \\ {\tilde{s}}_{t}^{pred}, & t > m, \end{matrix} m = ⌈ 0.01 T ⌉

(19)

The top 1% of time step errors are set to zero, yielding the final

s_{t}^{mask}

, the bidirectional prediction-based anomaly score. This approach reduces false positives caused by initial segment artifacts while mitigating the issue of unidirectional prediction overlooking anomalies near sequence boundaries. This bidirectional design alleviates the well-known issue that single-direction predictors often underperform near sequence boundaries due to missing context, thereby enhancing detection robustness across the entire window.

4.1.2. Reconstruction Error

The reconstruction component extracts the reconstructed sequence at intermediate time steps and compares it with the original sequence. By capturing structural anomalies in the feature space, it measures the model’s ability to reconstruct the original data. The reconstruction error type employs dynamic time warping (DTW) to capture morphological variations. This method handles differing lengths and rhythmic changes in time series data, tolerates minor time step offsets, and minimizes alignment errors. It exhibits heightened sensitivity to abnormal deviations in nonlinear traffic patterns, such as sudden attack spikes. Let the reconstruction sequence generated by the model during the inference phase be

{{\tilde{y}}_{t}}_{t = 1}^{T}

. Centered at time step t, extract a local segment of length

2 l + 1

:

y_{t} = [y_{t - l}, \dots, y_{t + l}]

(20)

{\tilde{y}}_{t} = [{\tilde{y}}_{t - l}, \dots, {\tilde{y}}_{t + l}]

(21)

Using DTW to compute the optimal alignment path

P_{t} = {(i, j)}

between two segments, its cost is

d_{t}^{DTW} = \frac{1}{| P_{t} |} \sum_{(i, j) \in P_{t}} {∥ y_{i} - {\tilde{y}}_{j} ∥}_{2}

(22)

Use

d_{t}^{DTW}

as the anomaly score for the reconstruction branch:

s_{t}^{rec} = d_{t}^{DTW}

(23)

This method is better at capturing differences in form and phase, making it more sensitive to anomalies such as sudden attacks that distort patterns or cause temporal misalignment.

4.1.3. Score Fusion Strategies

From a statistical perspective, combining complementary evidence sources improves the stability of anomaly assessment. In particular, bidirectional prediction provides two temporally consistent views that can be averaged in the overlap region to reduce directional noise, while reconstruction error captures structural deviations. Their fusion yields a more reliable separation between normal fluctuations and truly abnormal segments, which is especially important when estimating the duration of collective anomalies under noise and distribution shifts.

We formulate four fusion schemes to cover different noise regimes. PRED/REC rely on a single source and may be sensitive to branch-specific noise. SUM behaves like an “OR” rule but may increase false alarms when either branch is noisy. MULT acts as an “AND-like” gating that amplifies timesteps where both prediction and reconstruction evidence is strong while suppressing single-source fluctuations, which is particularly beneficial for noisy network-traffic time series.

(1): PRED: It uses only prediction error, emphasizing trend breaks and sudden spikes, suitable for detecting abrupt attacks such as DDoS.

$S_{t}^{PRED} = s_{t}^{mask}$

(24)
(2): REC: It uses only reconstruction error, focusing on structural inconsistencies, suitable for detecting local protocol manipulations and context anomalies.

$S_{t}^{REC} = s_{t}^{rec}$

(25)
(3): SUM: It computes a weighted sum of the two errors, typically with equal weights. This scheme balances sensitivity to both trend and structural anomalies and can be tuned to specific scenarios.

$S_{t}^{SUM} = w_{pred} s_{t}^{mask} + w_{rec} s_{t}^{rec}$

(26)

where $w_{pred} + w_{rec} = 1$ , and the importance of the prediction and reconstruction branches can be adjusted via the validation set.
(4): MULT: It computes the element-wise product of prediction and reconstruction errors, nonlinearly amplifying time steps where both errors are large while suppressing noise in single indicators. This scheme is particularly effective in high-noise environments and for detecting subtle, multi-faceted anomalies such as stealthy infiltration.

$S_{t}^{MULT} = (s_{t}^{mask} + ε) (s_{t}^{rec} + ε)$

(27)

where $ε > 0$ denotes the minimal smoothness constant, employed to prevent numerical underflow. This strategy significantly amplifies anomaly signals when both prediction and reconstruction errors increase simultaneously, thereby enhancing the separability of high-confidence anomalies.

The final point-level composite anomaly score is denoted as

S_{t}

, and any one of the four fusion methods described above may be selected across different experiments. These fusion schemes allow the model to flexibly adapt to different deployment environments and anomaly types without modifying the underlying network architecture.

4.2. Thresholding and Post-Processing

4.2.1. Static k-Sigma Thresholding

The simplest thresholding strategy assumes that anomaly scores for normal traffic approximately follow a unimodal distribution. In this case, a static threshold is defined as:

T h r e s h o l d_{s t a t i c} = μ + k \cdot σ

(28)

where

μ

and

σ

are the mean and standard deviation of the anomaly scores on a validation subset, and k is a constant corresponding to a “

3 δ / 4 δ

” rule. Time steps with scores exceeding

T h r e s h o l d_{s t a t i c}

are labeled as anomalous.

4.2.2. Dynamic Cost-Based Thresholding

A fixed k-sigma threshold can be brittle under distribution shift or varying noise levels across network scenarios. We therefore adopt a cost-based dynamic threshold by balancing score separation between normal/anomalous regions and alarm volume to avoid excessive false positives.

The dynamic thresholding method searches for the optimal value of z by minimizing a cost function, where z indicates how many standard deviations the threshold deviates from the mean. Specifically, we consider several factors: the difference between the overall mean and the mean of points below the threshold, denoted as

δ_{mean}

; the difference between the overall standard deviation and the standard deviation of points below the threshold, denoted as

δ_{std}

; the number of points exceeding the threshold; and the number of consecutive above-threshold segments. Based on these factors, we define a cost function

z_{cost}

to balance the statistical discrepancy between anomalous and normal points and the amount of points labeled as anomalies. We then search within a given range for the z that minimizes this cost and use it to determine the threshold. The dynamic threshold is defined as:

T h r e s h o l d_{d y n a m i c} = μ + z \cdot σ

(29)

where

μ

is the mean of the error,

σ

is the standard deviation of the error, and z is automatically searched by the optimizer scipy.fmin.

The cost function is defined as:

z_{cost} (z) = - (\frac{Δ μ}{μ} + \frac{Δ σ}{σ}) / (num_above + num_{sequences}^{2})

(30)

where the numerator

(\frac{Δ μ}{μ} + \frac{Δ σ}{σ})

measures how well the candidate threshold separates the error distribution into a normal part (below threshold) and a tail (above threshold). A larger

Δ μ

and

Δ σ

indicate that the below-threshold errors become more normal-like. Minimizing

z_{cost}

pushes the optimizer toward thresholds that increase this separation, which reduces the risk of setting the threshold too high. Meanwhile, the denominator

(num_above + num_{sequences}^{2})

penalizes alarm volume from two complementary perspectives. num_above is the number of error points above the current threshold, discouraging labeling an excessive number of points as anomalous. In addition, num_sequences is the number of consecutive anomalous segments, penalizing the fragmentation of alarms. Using

num_{sequences}^{2}

makes the penalty super-linear, strongly discouraging thresholds that create many disjoint anomalous segments even when num_above is moderate. Together, these terms act against overly sensitive thresholds. Therefore, the components jointly control false alarms without sacrificing event coverage.

4.2.3. Anomalous Interval Extraction

After thresholding, all time steps with scores exceeding Threshold are initially marked as anomalous. Point-wise anomaly labels often lead to fragmented and noisy alarms. Therefore, we apply interval-level post-processing to merge contiguous detections and prune spurious short segments, producing event-like anomaly intervals that are more actionable and easier to interpret. This interval extraction step converts point-wise scores into event-like anomaly intervals, thereby providing a direct characterization of anomaly duration that is more actionable for network operations.

First, we merge nearby anomalous points into contiguous intervals using a padding parameter, combining consecutive anomalous points (

{\hat{y}}_{t} = 1

) into preliminary intervals

I = {[a_{m}, b_{m}]}_{m = 1}^{M}

. To avoid fragmented alarms, each interval is expanded by a padding parameter p:

[a_{m}, b_{m}] \leftarrow [\max (1, a_{m} - p), \min (T, b_{m} + p)]

(31)

this means that intervals whose gap is smaller than a merge tolerance g are further merged.

Second, we prune intervals whose anomalous proportion falls below a minimum percentage. To mitigate false positives, we prune intervals that are unlikely to represent real events. Specifically, for an interval

[a, b]

, let its length be

l = b - a + 1

and its anomaly density be

ρ ([a, b]) = \frac{1}{l} \sum_{t = a}^{b} {\hat{y}}_{t}

(32)

we discard intervals with

l < l_{\min}

or

ρ ([a, b]) < ρ_{\min}

, where

l_{\min}

and

ρ_{\min}

are user-defined hyperparameters.

Finally, we assign each interval an aggregate severity score computed from its constituent point scores. For each remaining interval

[a, b]

, we compute an interval-level severity score to support ranking and triage:

score ([a, b]) = Agg ({{\tilde{S}}_{t}}_{t = a}^{b})

(33)

where

Agg (\cdot)

can be the maximum, mean, or a robust statistic such as the average of the top-q percentile values. The final output of the system thus consists of a set of anomalous intervals

[s t a r t, e n d, s c o r e]

, where score denotes the interval-level anomaly score. These intervals can be directly consumed by network administrators and downstream security systems for further investigation and mitigation.

4.3. Algorithm Implementation

Algorithm 1 illustrates the overall implementation procedure of the proposed Transformer-autoencoder-based unsupervised anomaly detection method. The algorithm takes as input the multivariate training series

D_{train}

and test series

D_{test}

, together with the window length L, batch size B, number of training epochs E, loss weights

α, β, γ

, and the chosen anomaly-score fusion scheme

F (\cdot)

(PRED/REC/SUM/MULT). The outputs are the optimized model parameters

θ^{⋆}

, the anomaly score sequence, and the corresponding binary labels on

D_{test}

. At the beginning of the algorithm, the raw traffic is cleaned, normalized, and imputed, then statistical, temporal, and traffic-type features are extracted and compressed by PCA. The processed time series are segmented into overlapping windows of length L, and the Transformer-autoencoder parameters are initialized.

Algorithm 1 Transformer-Autoencoder-based Unsupervised Anomaly Detection

Require: Multivariate training series

D_{train}

; test series

D_{test}

; window length L; batch size B; training epochs E; loss weights

α, β, γ

; chosen fusion scheme

F (\cdot)

(PRED/REC/SUM/MULT)
Ensure: Trained model parameters

θ^{⋆}

; anomaly scores

{s c o r e}

and binary labels

{{\hat{r}}_{t}}

on

D_{test}

1:: Preprocessing and window construction
2:: Clean, normalize, and fill missing values in $D_{train}$
3:: Extract statistical, temporal, and traffic-type features; apply PCA
4:: Segment $D_{train}$ into overlapping windows ${W_{i}}_{i = 1}^{N_{train}}$ of length L
5:: Initialize Transformer-autoencoder parameters $θ$
6:: Training
7:: for $epoch = 1$ E do
8:: Shuffle ${W_{i}}_{i = 1}^{N_{train}}$ and split into mini-batches of size B
9:: for each mini-batch $B = {W_{i}}$ do
10:: Embed each time step and add positional encodings
11:: Encode windows via Transformer encoder to obtain latent states
12:: Decode via three branches to obtain forward predictions ${\hat{y}}_{t}^{fwd}$ , backward predictions ${\hat{y}}_{t}^{rev}$ , and reconstructions ${\tilde{y}}_{t}$
13:: Compute MSE losses: $L o s s_{f w d}$ , $L o s s_{r e v}$ , $L o s s_{r e c}$
14:: Compute total loss $L o s s = α L o s s_{f w d} + β L o s s_{r e c} + γ L o s s_{r e v}$
15:: Update $θ$ by backpropagation using Adam optimizer
16:: end for
17:: Apply learning-rate scheduling and early stopping on validation loss
18:: end for
19:: Set $θ^{⋆} \leftarrow θ$
20:: Detection and threshold selection
21:: Apply the same preprocessing and feature extraction to $D_{test}$
22:: Segment $D_{test}$ into windows ${W_{j}^{test}}_{j = 1}^{N_{test}}$
23:: for each test window $W_{j}^{test}$ do
24:: Run encoder–decoder with parameters $θ^{⋆}$
25:: Compute bidirectional prediction errors $e_{t}^{fwd}, e_{t}^{rev}$ and fused prediction error $s_{t}^{mask}$ (average in overlap region, maximum at boundaries)
26:: Compute reconstruction error $s_{t}^{rec}$
27:: Fuse errors into a unified anomaly score $s c o r e = F (s_{t}^{mask}, s_{t}^{rec})$
28:: end for
29:: Estimate score mean $μ$ and standard deviation $σ$ on validation scores
30:: Search for threshold parameter z via cost-based optimization and set $T h r e s h o l d = μ + z σ$
31:: for each time step t in $D_{test}$ do
32:: if $s c o r e \geq T h r e s h o l d$ then
33:: ${\hat{r}}_{t} \leftarrow 1$ // anomalous
34:: else
35:: ${\hat{r}}_{t} \leftarrow 0$ // normal
36:: end if
37:: end for
38:: Post-process ${{\hat{r}}_{t}}$ to merge nearby anomalous points into intervals and compute interval-level scores
39:: return $θ^{⋆}$ , anomaly scores ${s c o r e}$ , and labels ${{\hat{r}}_{t}}$

During the training stage, all training windows are first partitioned into mini-batches of size B. For each mini-batch, every time step is projected into the model space and augmented with positional encodings, then passed through the Transformer encoder to obtain latent representations. The decoder, equipped with forward, backward, and reconstruction branches, generates the one-step-ahead prediction, the one-step-back prediction, and the reconstructed subsequence. The corresponding mean-squared errors are computed for the three branches and combined into a weighted total loss

α {Loss}_{fwd} + β {Loss}_{rec} + γ {Loss}_{rev}

. Model parameters are updated by backpropagation using the Adam optimizer, while learning-rate scheduling and early stopping on the validation loss are employed to prevent overfitting and stabilize convergence. After all epochs are completed, the final parameters are stored as

θ^{⋆}

.

In the detection stage, the same preprocessing, feature extraction, and windowing pipeline is applied to

D_{test}

. Each test window is fed into the trained encoder–decoder with parameters

θ^{⋆}

to obtain forward and backward predictions as well as reconstructions. These are converted into bidirectional prediction errors and reconstruction errors, which are then aggregated into a unified point-wise anomaly score

S_{t} = F (s_{t}^{mask}, s_{t}^{rec})

according to the selected fusion strategy. A validation subset or recent sliding horizon is used to estimate the score mean and standard deviation and to search the threshold parameter z via a cost-based optimization, yielding a data-driven decision threshold,

Threshold = μ + z σ

. Time steps whose scores exceed this threshold are marked as anomalous. Finally, adjacent anomalous points are merged into contiguous intervals, short or low-density segments are pruned, and an interval-level severity score is assigned to each remaining segment. The algorithm thus produces a compact set of abnormal intervals with scores that can be directly consumed by network operators and downstream security systems.

5. Performance Evaluation

This section evaluates the effectiveness and robustness of the proposed unsupervised network traffic anomaly detection model. We first describe the experimental setup, including datasets, baselines, evaluation metrics, and implementation details. We then present quantitative results on three public multivariate time-series datasets, followed by a comparison between detected anomaly intervals and the ground truth. Finally, we have an ablation study on anomaly score fusion strategies.

5.1. Experimental Setup

5.1.1. Hardware and Software Environment

All experiments are conducted on a workstation equipped with an Intel(R) Core(TM) Ultra 7 258V CPU at 2.20 GHz (Intel, Santa Clara, CA, USA) and a 64-bit Windows operating system. The implementation is based on Python 3.11 and the TensorFlow 2.14.1 deep learning framework, using PyCharm Professional 2024.2 as the integrated development environment. Data preprocessing and analysis rely on NumPy 1.26.4, Pandas 2.1.4, and SciPy 1.10.0, while Scikit-learn 1.1.3, TensorFlow 2.14.1, and Keras 2.14.0 are used for machine learning and deep learning utilities. For time-series analysis we employ statsmodels 0.14.4 and Pyts 0.12.0, and Matplotlib 3.9.2 and Seaborn 0.13.2 are used for visualization.

5.1.2. Datasets

To evaluate the generalization capability of the proposed model, we adopt three widely used multivariate time-series benchmark datasets.

SMD (Server Machine Dataset): This dataset contains five weeks of monitoring data collected from 28 different machines, each with 38 sensor measurements sampled every minute. The first five days contain only normal data, while intermittent anomalies are injected into the last five days. The overall anomaly ratio is approximately 5.84%. For each entity, the time series is split into two equal parts: the first half is used for training and the second half for testing, yielding 708,405 training samples, 141,681 validation samples taken from the last 20% of the training data, and 708,420 test samples.

PSM (Pooling Server Metrics): This dataset is derived from internal metrics of multiple application servers at eBay. It provides 13 weeks of training data and 8 weeks of test data, resulting in 132,481 training samples, 26,398 validation samples, and 87,841 test samples.

ASD (Application Server Dataset): ASD is a multivariate time-series dataset collected from a large Internet company for anomaly detection and explanation. It includes 12 server entities, each characterized by 19 metrics related to CPU, memory, network, and virtual machines. The sampling interval is 5 min and the overall anomaly ratio is about 4.61%. The first 30 days are used for training, with the last 30% of training data reserved for validation, and the last 15 days form the test set.

5.1.3. Baseline Methods

We compare the proposed model with five representative unsupervised or semi-supervised time-series anomaly detection methods.

(1): TadGAN: A GAN-based method that uses LSTM networks as the generator and two critic networks to capture complex temporal dependencies. It reconstructs normal patterns via adversarial training and computes anomaly scores by combining reconstruction errors with critic outputs.
(2): LSTM Dynamic Threshold: A prediction-based LSTM model with a dynamic thresholding mechanism. A two-layer LSTM predicts future values, and the prediction errors are smoothed by EWMA. A data-driven dynamic threshold is then learned to distinguish anomalous from normal points.
(3): LSTM Autoencoder: An LSTM autoencoder that learns a compact representation of normal sequences and reconstructs them. Anomalies are detected when the reconstruction error exceeds a preset threshold, based on the assumption that normal data can be accurately reconstructed whereas abnormal data cannot.
(4): Anomaly Transformer: A Transformer-based unsupervised anomaly detector with a dual-branch anomaly attention mechanism. It models prior and series-wise associations and generates anomaly scores from a weighted combination of reconstruction error and association discrepancy optimized via a minimax strategy.
(5): Variational Autoencoder (VAE): A probabilistic generative model that learns a latent-variable distribution of normal sequences by optimizing the evidence lower bound. It reconstructs inputs while regularizing the latent space via a KL divergence term, yielding a smooth and structured representation of normal patterns. During inference, anomalies are identified by large reconstruction errors or low data likelihood under the learned latent distribution.

We select baselines that represent complementary methodological families and also help reveal the limitations of single-source detection. Specifically, LSTM Dynamic Threshold is a prediction-only detector, while LSTM Autoencoder and VAE are reconstruction-oriented baselines. Anomaly Transformer is chosen as a strong Transformer-based attention baseline. Finally, we include TadGAN as a representative GAN-based method, which combines reconstruction error with an additional information source. These baselines cover generative, prediction-based, reconstruction-based, probabilistic latent-variable, and attention-based paradigms, providing a comprehensive comparison against the proposed Transformer-autoencoder hybrid framework.

5.1.4. Evaluation Metrics

Following common practice in anomaly detection, we report accuracy, precision, recall, F1 score, true positive rate (TPR), false positive rate (FPR), average precision (AP), and time-to-detect (TTD). Among them, precision and recall characterize the trade-off between false alarms and missed detections, and the F1 score is the harmonic mean of precision and recall, providing a balanced metric. In addition, AP averages precision across recall levels, which is particularly informative under class imbalance. We also report TTD to evaluate detection timeliness for anomalies, defined as the start time of the detected abnormal interval minus the start time of the true abnormal interval.

5.1.5. Parameter Settings

In our experiment, when calculating the total loss, we set

α

= 0.25,

β

= 0.5, and

γ

= 0.25 for

L o s s_{fwd}

,

L o s s_{rev}

and

L o s s_{rec}

respectively. When selecting anomaly thresholds, we employ the dynamic thresholding approach.

The model is trained for 20 epochs with a batch size of 32, using an Adam-based optimizer with an initial learning rate of 0.001. Layer normalization is applied after residual connections, L2 weight decay of

1 \times 10^{- 5}

is imposed, and two callbacks are employed: (i) ReduceLROnPlateau halves the learning rate when the validation loss plateaus for 5 epochs, and (ii) EarlyStopping terminates training when no improvement is observed on the validation loss for 15 consecutive epochs.

5.2. Overall Detection Performance

After training the proposed model and all baselines on the three datasets, we compute anomaly scores and apply the same thresholding and post-processing strategy for a fair comparison. Table 1 summarizes the quantitative results in terms of accuracy, F1 score, recall, precision, TPR, FPR, AP and TTD.

Across all datasets, the proposed model shows a consistently stronger overall trade-off between missed detections and false alarms than the baselines: it maintains very high accuracy above 0.98 and high recall above 0.97 while keeping FPR extremely low, and it also achieves the best AP to 0.93, indicating robust precision-recall performance under severe class imbalance. Beyond correctness, the model is also more reliable for anomaly detection: its TTD is generally closer to zero, showing that the detected abnormal interval start is better aligned with the true onset. However, many baselines exhibit large-magnitude offsets, either substantial positive delays or highly unstable negative offsets with large magnitudes, indicating poor temporal localization of detected abnormal intervals. In contrast, baseline methods often achieve one metric only by sacrificing others. For example, Anomaly Transformer can obtain high recall but may suffer from very low precision and large FPR, resulting in much lower AP and delayed detection. TadGAN and VAE may reach moderate accuracy yet still yield substantially weaker AP and large positive TTD, reflecting slower onset detection. Overall, these results demonstrate that our framework provides a more favorable and practically useful combination of high sensitivity, low false alarms, strong ranking quality, and timely interval onset detection.

Our method integrates bidirectional prediction errors with reconstruction errors, making the anomaly score sensitive to both abrupt deviations and subtle temporal drifts. This design helps reduce missed detections that are common in baselines with weaker temporal modeling or less robust scoring. For example, our recall is consistently high, while TadGAN can miss a large portion of anomalies and VAE may severely under-detect on certain datasets, and LSTM Dynamic Threshold may fail to capture evolving multi-step abnormal episodes. The Transformer’s self-attention further improves recall by capturing long-range dependencies that are difficult for these baselines to model reliably.

Precision benefits from two aspects. On the one hand, the decoder’s masked self-attention prevents future information leakage, avoiding overly optimistic predictions that distort the error distribution. On the other hand, the fusion of prediction and reconstruction views suppresses spurious spikes caused by noise, which reduces false positives. This is reflected not only in our strong precision and low FPR, but also in consistently high AP across datasets (0.870/0.929/0.914), meaning that precision remains strong across a wide range of recall levels. In comparison, reconstruction-only baselines such as LSTM Autoencoder and VAE tend to either over-trigger or poorly separate normal fluctuations from anomalies, leading to much lower AP. Similarly, Anomaly Transformer shows unstable precision under challenging settings, which produces many false alarms and degrades AP, even when recall is high.

Figure 4 compares the F1 scores of all methods on the SMD, PSM, and ASD datasets. The proposed model consistently achieves the highest F1 score on all three benchmarks, reaching 0.931, 0.960, and 0.955 respectively, while the best competing baseline remains below 0.48 on SMD and PSM and below 0.30 on ASD. This large margin indicates that the proposed framework is better at capturing discriminative patterns in multivariate network traffic, and can maintain stable detection performance across different domains and anomaly types. The advantage mainly comes from the multi-head self-attention mechanism and positional encoding, which allow the model to capture long-range temporal dependencies and cross-feature interactions in parallel, thereby modeling both bursty collaboration among multiple IP addresses and gradual distribution shifts over long sequences.

In contrast, the baseline models exhibit several structural limitations that explain their inferior F1 scores. TadGAN shows a delayed response to abrupt traffic changes. Under strongly non-stationary traffic, the generator and discriminator are prone to training imbalance and local optima, resulting in unstable gradients and a poor characterization of normal traffic. LSTM Dynamic Threshold suffers from vanishing or exploding gradients when dealing with long sequences, making it difficult to preserve distant temporal dependencies and leading to both false positives and false negatives. LSTM Autoencoder is not sufficiently sensitive to trend changes. When the traffic pattern is noisy and multi-periodic, the hidden states of LSTM cannot effectively encode complex superposed patterns, so the reconstruction error becomes weak and the model lacks robustness. Although Anomaly Transformer leverages relational priors, it must maintain both prior-association and series-association branches simultaneously, which incurs high memory cost, and its minimax-style objective requires careful tuning of multiple loss weights. In practice, this makes optimization difficult and prevents the model from achieving its theoretical best performance. As a probabilistic reconstruction baseline, VAE regularizes the latent space with a prior, which can overly smooth rare but informative deviations. Meanwhile, its likelihood-based training objective may encourage reconstructing averaged patterns under complex temporal mixtures, reducing the sharpness of anomaly-related reconstruction signals and ultimately limiting F1 performance in challenging traffic regimes.

Overall, the proposed model consistently yields higher accuracy, recall, precision, and F1 scores than all baselines on the three datasets, while maintaining a very low FPR. In addition, it achieves higher AP, indicating more reliable ranking quality under severe class imbalance, and exhibits more favorable TTD, reflecting better temporal alignment between detected abnormal intervals and the true anomaly onset. This demonstrates that the model can effectively detect diverse anomaly patterns in network traffic, and that it achieves a favorable trade-off between sensitivity, robustness, and detection timeliness.

5.3. Effect of Anomaly Duration

After reporting point-wise metrics, our pipeline further performs interval-level post-processing to convert point detections into event-like anomaly intervals. This reduces fragmented alarms and improves interpretability in practice. To further assess the practical usefulness of the proposed method, we conduct a case study that compares the detected anomalous intervals with ground-truth annotations across the three datasets. Figure 5 visualizes a representative SMD time series, where the upper panel highlights the true abnormal regions and the lower panel overlays the filtered detection results. Table 2 lists several typical interval pairs for SMD, PSM, and ASD, showing the correspondence between labeled and detected segments.

On the SMD dataset, the detected abnormal intervals exhibit substantial overlap with the true abnormal regions and often extend slightly beyond the annotated boundaries, indicating that our model can capture not only the core anomalous episodes but also their onset and decay phases around the labeled windows. Similar behavior is observed on the PSM and ASD datasets. The predicted intervals cover most of the annotated abnormal windows and introduce only a few short additional segments, which can be interpreted as transition periods before and after the main anomalous events. These observations suggest that the proposed framework not only achieves strong point-wise detection capability but also produces temporally coherent abnormal regions that align well with expert annotations. Operationally, sustained anomalies typically correspond to prolonged abnormal network behavior, which can accumulate damage more severely than isolated spikes. Our window-based modeling combined with interval-level post-processing is therefore well suited for capturing such collective anomalies, rather than producing fragmented point-wise alarms. The dual-view scoring further improves the stability of segment detection by reducing noise-induced fluctuations, which is critical for estimating anomaly duration in practice.

Combined with the low false positive rate and high recall reported earlier, this interval-level consistency supports the applicability of the method to real-time network monitoring scenarios, where missing critical anomalies is far more costly than raising a small number of manageable false alarms. As a result, we distinguish two levels in our pipeline: (i) point-wise anomaly scoring, where the model outputs an anomaly score for each time step based on prediction/reconstruction errors, and (ii) event-level anomaly characterization, where anomaly duration is obtained by grouping consecutive abnormal time steps into contiguous intervals.

Apart from that, the internal model response is reflected indirectly through persistent score elevation. It is important to note that anomaly duration and the model’s internal response are related but not strictly one-to-one. Duration is determined by the temporal continuity of the anomaly score, whereas internal responses are manifested through changes in learned representations and attention patterns that influence the magnitude and persistence of prediction/reconstruction errors. In general, longer anomalies tend to yield sustained error elevation and a more stable separation between normal and abnormal score distributions. However, short but severe anomalies may trigger sharp local responses, while long but mild drifts may cause smaller yet persistent deviations. Our fusion of bidirectional prediction and reconstruction errors provides complementary evidence that stabilizes segment detection.

5.4. Effect of Anomaly Score Fusion Strategies

Our model performs unsupervised time-series anomaly detection based on Transformer-autoencoder, in which the anomaly-score fusion strategy is one of the key factors affecting detection performance. To investigate the effectiveness of score fusion, we compare four strategies: element-wise multiplication (MULT), prediction-only (PRED), a weighted sum of prediction and reconstruction errors (SUM), and reconstruction-only (REC). For each strategy, we conduct experiments on three datasets and report the corresponding anomaly detection results to analyze their impacts on model performance.

Since the final anomaly score depends on how prediction and reconstruction evidence is combined, we explicitly evaluate four fusion strategies under the same training and thresholding protocol. Figure 6 compares the four anomaly-score fusion strategies, MULT, PRED, SUM, and REC, across the SMD, PSM, and ASD datasets. On SMD, REC achieves the highest F1 score, with MULT very close behind, while PRED and SUM are noticeably lower. On PSM, MULT clearly dominates, followed by PRED, whereas SUM and REC lag by a considerable margin. On ASD, the gap between strategies becomes even more pronounced: MULT attains the best F1 score by a large margin, and all three alternatives (PRED, SUM, REC) perform substantially worse and relatively close to one another. Overall, the bar chart indicates that MULT is the consistently strongest strategy across datasets, presenting that the overall gains are not tied to a single ad hoc choice but are supported by a systematic comparison.

Since anomalies are highly imbalanced, Figure 7 further details the behavior of each fusion strategy in terms of ROC and PR curves, which provides a more informative view in rare-event settings. For MULT in Figure 7a, the ROC curves on all three datasets rise steeply toward the top-left corner with high AUC values, and the corresponding PR curves remain close to the upper boundary over a wide recall range, especially on PSM and ASD. This indicates that a large proportion of anomalies can be detected at very low false-positive rates, supporting the effectiveness of MULT under imbalance. Under PRED in Figure 7b, the ROC curves for SMD and PSM remain strong, but the ASD ROC curve flattens considerably, and the PR curves drop more quickly with increasing recall, particularly on SMD and ASD, indicating a sharper precision-recall trade-off and a higher proportion of false alarms when the decision threshold is relaxed. For SUM in Figure 7c, the ROC curves still show acceptable AUCs, yet the PR curves on SMD and PSM are clearly lower and more sensitive to the operating point, while ASD maintains a relatively high PR curve. REC in Figure 7d exhibits an excellent ROC curve on SMD and reasonable performance on PSM, but both ROC and PR curves on ASD are noticeably degraded, reflecting weaker separability between normal and anomalous points.

These differences can be explained by how each fusion strategy exploits prediction and reconstruction errors. In realistic network scenarios, many APT-like attacks simultaneously exhibit temporal deviations like slow drifts or bursty spikes, and local structural anomalies such as protocol-field changes or packet-size distribution shifts. In such cases, both prediction error and reconstruction error tend to increase on truly abnormal segments. MULT leverages this by nonlinearly amplifying points where both errors are large while suppressing points where only one error is elevated, thereby enhancing the contrast between genuine anomalies and single-source noise. When combined with the bidirectional prediction design, MULT also improves sensitivity to anomalies at the beginning and end of sequences, where information is often incomplete for one direction of prediction. This explains why its ROC and PR curves are both strong and stable across datasets, and why it is particularly effective in noisy, dynamic environments where fixed linear weights are hard to tune. PRED, which relies solely on temporal prediction error, is simple and responsive to sudden trend changes, but it cannot distinguish natural volatility from malicious deviations in feature space, leading to higher false positives. REC is well suited to capturing contextual and collective anomalies with clear structural patterns and shows good robustness to sensor noise, but it is less sensitive to isolated point anomalies and long-term drifts. SUM provides a flexible linear combination of the two errors, yet its effectiveness depends heavily on carefully chosen weights. When the error distributions differ significantly, suboptimal weighting can either drown out important anomaly signals or overemphasize benign fluctuations.

Taken together, these results suggest that MULT is a strong default choice for deployment in heterogeneous, attack-rich environments. MULT produces a high anomaly score only when both prediction and reconstruction errors are simultaneously large. This property suppresses false alarms caused by branch-specific fluctuations, while preserving sensitivity to true anomalies that consistently perturb both signals. It achieves adaptive fusion directly through multiplication, reducing reliance on specific patterns in training data and enhancing the model’s generalization capabilities, making it more suitable for dynamic environments. Meanwhile, PRED or REC may be preferable when prior knowledge indicates that anomalies are dominated by pure temporal or pure structural deviations, respectively.

5.5. Ablation Studies

To understand the contribution of each component in our framework, we conduct an ablation study by varying key modules in the preprocessing, scoring, and thresholding stages. As reported in Table 3, we evaluate each variant using accuracy, precision, recall, F1 score, TPR/FPR, AP and TTD.

We first compare the default PCA setting (retaining 95% of variance) against compression thresholds (0.90 and 0.80), as well as a configuration where the PCA module is completely removed. When the retention ratio is reduced to 0.90, performance degrades (F1 drops to 0.919 and AP to 0.851) and FPR increases to 0.018. Further reducing it to 0.80 leads to a more pronounced decline, indicating that overly aggressive compression discards informative variations relevant to rare anomalies. Removing PCA results in a sharp performance collapse (F1 = 0.605, AP = 0.433) and a large-magnitude onset misalignment (TTD = −345), indicating substantially worse temporal localization of detected abnormal intervals. Overall, these results show that retaining 95% variance provides the best comprehensive balance.

We then study the smoothing window used to stabilize the anomaly score sequence. Without masking (smoothing_window = 0), the model preserves recall (0.996) but suffers from increased FPR (0.034) and reduced AP (0.831), revealing a substantial rise in spurious detections caused by window-boundary effects. With the default 1% mask (full model), FPR is reduced to 0.015 while recall stays at 0.996, demonstrating that the procedure effectively mitigates boundary-induced false positives without noticeably increasing missed detections. Notably, only when the masking level is increased to 2–3% do the metrics become visibly “over-optimistic” (FPR decreases to 0.007 and F1 rises to 0.965). We therefore clarify that our top-1% error masking is not a hyper-parameter trick for making numbers look better, nor is it intended to hide failure cases or erase true anomalies. Rather, it is a targeted remedy for boundary effects.

To justify the use of DTW for reconstruction error evaluation, we additionally compare the DTW-based formulation (full model) with point-wise and area-based error computations. In comparison, point-wise error slightly reduces recall (0.979), leading to lower F1 (0.929), suggesting that strict point-to-point alignment can miss anomalies under temporal misalignment. The area-based variant substantially lowers precision (0.764) and increases FPR (0.032), resulting in a much lower AP (0.761). This suggests that area-style aggregation tends to over-accumulate benign deviations, inflating anomaly scores in noisy regions and increasing false positives. These results support that DTW more effectively handles time shifts and phase offsets that commonly arise in network monitoring streams, yielding more reliable detection and ranking quality.

Finally, we compare the dynamic thresholding strategy used in the full model with static k-sigma thresholds (

k = 1

and

k = 2

). A static threshold with

k = 1

yields almost perfect recall (0.999) but causes precision to collapse to 0.298 and FPR to surge to 0.244. Increasing to

k = 2

still underperforms the dynamic-threshold in both F1 (0.885) and AP (0.793). This demonstrates that static thresholds are highly sensitive to score distribution shifts and noise, whereas dynamic thresholding maintains a consistently better balance between sensitivity and false-alarm control, which is especially important under severe imbalance and non-stationary behaviors.

In summary, the ablation results in Table 3 validate the rationale behind our module choices. Collectively, these findings confirm that the final configuration is driven by interpretable and reproducible component contributions.

6. Conclusions

In this paper, we have investigated unsupervised temporal anomaly detection for network traffic under practical constraints such as scarce labels and severe class imbalance, and proposed a Transformer-autoencoder-based anomaly detection framework that integrates the reconstruction capability of autoencoders with the forecasting capability of regression models to perform bidirectional prediction and reconstruction in a unified encoder–decoder architecture. The model effectively captures long-range temporal dependencies and local structural patterns. To improve reliability in dynamic and noisy environments, we have further introduced multi-source anomaly scoring with fusion strategies (especially the MULT scheme) and adopted both statistical and optimization-driven dynamic thresholding to balance false alarms and missed detections. Extensive experiments on three public benchmarks demonstrate the effectiveness of the proposed method, achieving F1 scores of 0.931/0.960/0.955 on SMD/PSM/ASD, respectively, and a low false positive rate of 0.015 on SMD. It substantially outperforms several representative baselines including TadGAN, LSTM-based predictor/autoencoder, Anomaly Transformer and VAE. Overall, the results indicate that the proposed framework can robustly detect diverse anomaly types in complex traffic patterns, demonstrating strong robustness under highly imbalanced and dynamically varying traffic conditions. Nevertheless, we acknowledge that a more comprehensive operational assessment would benefit from additional event-based metrics. In particular, future work will incorporate false alarms per hour to better quantify alarm reliability in real deployments. We will also conduct a more systematic threshold-sensitivity study to assess robustness under varying noise levels and distribution shifts. These extensions will further strengthen the practical interpretability of the proposed framework.

Author Contributions

Conceptualization, J.L. and L.L.; methodology, X.Y. and Y.L.; software, H.Z. and F.Z.; validation, D.L., T.D. and T.Y.; formal analysis, X.Y. and Y.L.; resources, J.L. and T.Y.; data curation, H.Z. and F.Z.; writing—original draft preparation, L.L., J.L. and X.Y.; writing—review and editing, L.L., D.L. and T.D.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the Key Research and Development Program of Hubei Province, China under Grant 2024BAB016 and in part by the Key Research and Development Program of Hubei Province, China under Grant 2025BAB051.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available on request from the authors.

Conflicts of Interest

Author Jieke Lu was employed by the Electric Power Research Institute of Guangxi Power Grid Co., Ltd. Author Tong Yu was employed by State Grid Hubei Information and Communication Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Peng, K.; Liu, X.; Han, D.; Hu, Y.; Hu, M.; Cai, C.; Xiong, Z. Hybrid Orchestration of AI Services and Microservices in Cloud-Edge Collaboration. IEEE Trans. Mob. Comput. 2026, 1–17. [Google Scholar] [CrossRef]
Carrera, F.; Dentamaro, V.; Galantucci, S.; Iannacone, A.; Impedovo, D.; Pirlo, G. Combining unsupervised approaches for near real-time network traffic anomaly detection. Appl. Sci. 2022, 12, 1759. [Google Scholar] [CrossRef]
Rahman, M.M.; Al Shakil, S.; Mustakim, M.R. A survey on intrusion detection system in IoT networks. Cyber Secur. Appl. 2025, 3, 100082. [Google Scholar] [CrossRef]
Berman, D.S.; Buczak, A.L.; Chavis, J.S.; Corbett, C.L. A survey of deep learning methods for cyber security. Information 2019, 10, 122. [Google Scholar] [CrossRef]
Liao, T.; Wei, L.; Hu, X.; Hu, J.; Hu, M.; Peng, K.; Cai, C.; Xiong, Z. Blockchain-Enhanced UAV Networks: Optimizing Data Storage for Real-Time Efficiency. IEEE Internet Things J. 2025, 12, 39910–39924. [Google Scholar] [CrossRef]
Gopali, S.; Siami Namin, A. Deep learning-based time-series analysis for detecting anomalies in internet of things. Electronics 2022, 11, 3205. [Google Scholar] [CrossRef]
Garcia-Teodoro, P.; Diaz-Verdejo, J.; Maciá-Fernández, G.; Vázquez, E. Anomaly-based network intrusion detection: Techniques, systems and challenges. Comput. Secur. 2009, 28, 18–28. [Google Scholar] [CrossRef]
Estevez-Tapiador, J.M.; Garcia-Teodoro, P.; Diaz-Verdejo, J.E. Anomaly detection methods in wired networks: A survey and taxonomy. Comput. Commun. 2004, 27, 1569–1584. [Google Scholar] [CrossRef]
Munir, M.; Siddiqui, S.A.; Chattha, M.A.; Dengel, A.; Ahmed, S. FuseAD: Unsupervised anomaly detection in streaming sensors data by fusing statistical and deep learning models. Sensors 2019, 19, 2451. [Google Scholar] [CrossRef] [PubMed]
Tran, D.H.; Nguyen, V.L.; Nguyen, H.; Jang, Y.M. Self-supervised learning for time-series anomaly detection in Industrial Internet of Things. Electronics 2022, 11, 2146. [Google Scholar] [CrossRef]
Blázquez-García, A.; Conde, A.; Mori, U.; Lozano, J.A. A review on outlier/anomaly detection in time series data. ACM Comput. Surv. (CSUR) 2021, 54, 1–33. [Google Scholar] [CrossRef]
Zamanzadeh Darban, Z.; Webb, G.I.; Pan, S.; Aggarwal, C.; Salehi, M. Deep learning for time series anomaly detection: A survey. ACM Comput. Surv. 2024, 57, 1–42. [Google Scholar] [CrossRef]
Sommer, R.; Paxson, V. Outside the closed world: On using machine learning for network intrusion detection. In Proceedings of the 2010 IEEE Symposium on Security and Privacy, Oakland, CA, USA, 16–19 May 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 305–316. [Google Scholar]
Shyaa, M.A.; Ibrahim, N.F.; Zainol, Z.; Abdullah, R.; Anbar, M.; Alzubaidi, L. Evolving cybersecurity frontiers: A comprehensive survey on concept drift and feature dynamics aware machine and deep learning in intrusion detection systems. Eng. Appl. Artif. Intell. 2024, 137, 109143. [Google Scholar] [CrossRef]
Mirsky, Y.; Doitshman, T.; Elovici, Y.; Shabtai, A. Kitsune: An ensemble of autoencoders for online network intrusion detection. arXiv 2018, arXiv:1802.09089. [Google Scholar] [CrossRef]
Koutnik, J.; Greff, K.; Gomez, F.; Schmidhuber, J. A clockwork rnn. In Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, China, 21–26 June 2014; ACM: New York, NY, USA, 2014; pp. 1863–1871. [Google Scholar]
Abbasimehr, H.; Paki, R. Improving time series forecasting using LSTM and attention models. J. Ambient. Intell. Humaniz. Comput. 2022, 13, 673–691. [Google Scholar] [CrossRef]
Wang, L.; Wen, Z.; Ge, H.; Hu, M.; Xu, J.; Peng, K.; Cai, C.; Xiong, Z. Energy-Latency-Aware Microservice Orchestration in Edge Computing via Node Ranking Matrix and Proportional Routing. IEEE Internet Things J. 2025, 12, 51597–51610. [Google Scholar] [CrossRef]
Audibert, J.; Michiardi, P.; Guyard, F.; Marti, S.; Zuluaga, M.A. Usad: Unsupervised anomaly detection on multivariate time series. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; ACM: New York, NY, USA, 2020; pp. 3395–3404. [Google Scholar]
Bashar, M.A.; Nayak, R. TAnoGAN: Time series anomaly detection with generative adversarial networks. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, ACT, Australia, 1–4 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1778–1785. [Google Scholar]
Xu, J.; Wu, H.; Wang, J.; Long, M. Anomaly transformer: Time series anomaly detection with association discrepancy. arXiv 2021, arXiv:2110.02642. [Google Scholar]
Peng, K.; Wang, L.; He, J.; Cai, C.; Hu, M. Joint optimization of service deployment and request routing for microservices in mobile edge computing. IEEE Trans. Serv. Comput. 2024, 17, 1016–1028. [Google Scholar] [CrossRef]
Hu, M.; Guo, Z.; Wen, H.; Wang, Z.; Xu, B.; Xu, J.; Peng, K. Collaborative deployment and routing of industrial microservices in smart factories. IEEE Trans. Ind. Inform. 2024, 20, 12758–12770. [Google Scholar] [CrossRef]
Peng, K.; He, J.; Guo, J.; Liu, Y.; He, J.; Liu, W.; Hu, M. Delay-aware optimization of fine-grained microservice deployment and routing in edge via reinforcement learning. IEEE Trans. Netw. Sci. Eng. 2024, 11, 6024–6037. [Google Scholar] [CrossRef]
Xu, B.; Guo, J.; Ma, F.; Hu, M.; Liu, W.; Peng, K. On the joint design of microservice deployment and routing in cloud data centers. J. Grid Comput. 2024, 22, 42. [Google Scholar] [CrossRef]
Peng, K.; Xie, J.; Wei, L.; Hu, J.; Hu, X.; Deng, T.; Hu, M. Clustering-based collaborative storage for blockchain in IoT systems. IEEE Internet Things J. 2024, 11, 33847–33860. [Google Scholar] [CrossRef]
Hu, Y.; Hou, L.; Hu, J.; Ren, M.; Hu, M.; Cai, C.; Peng, K. Time-varying microservice orchestration with routing for dynamic call graphs via multi-scale deep reinforcement learning. IEEE Trans. Serv. Comput. 2025, 18, 3276–3291. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Brockwell, P.J.; Davis, R.A. Introduction to Time Series and Forecasting; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice; OTexts: Melbourne, Australia, 2018. [Google Scholar]
Pemberton, J. Non-Linear and Non-Stationary Time Series Analysis; JSTOR: Ann Arbor, MI, USA, 1990. [Google Scholar]
Tong, H. Non-Linear Time Series: A Dynamical System Approach; Oxford University Press: Oxford, UK, 1990. [Google Scholar]
Leland, W.E.; Taqqu, M.S.; Willinger, W.; Wilson, D.V. On the self-similar nature of Ethernet traffic (extended version). IEEE/ACM Trans. Netw. 2002, 2, 1–15. [Google Scholar] [CrossRef]
Hewamalage, H.; Bergmeir, C.; Bandara, K. Recurrent neural networks for time series forecasting: Current status and future directions. Int. J. Forecast. 2021, 37, 388–427. [Google Scholar] [CrossRef]
Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Shaik, T.; Tao, X.; Xie, H.; Li, L.; Yong, J.; Li, Y. Graph-enabled reinforcement learning for time series forecasting with adaptive intelligence. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2908–2918. [Google Scholar] [CrossRef]
Anderson, J.P. Computer Security Threat Monitoring and Surveillance; Technical Report; James P. Anderson Company: Washington, DC, USA, 1980. [Google Scholar]
Zuo, F.; Zhang, D.; Li, L.; He, Q.; Deng, J. GSOOA-1DDRSN: Network traffic anomaly detection based on deep residual shrinkage networks. Heliyon 2024, 10, e32087. [Google Scholar] [CrossRef]
Hu, M.; Wang, H.; Xu, X.; He, J.; Hu, Y.; Deng, T.; Peng, K. Joint optimization of microservice deployment and routing in edge via multi-objective deep reinforcement learning. IEEE Trans. Netw. Serv. Manag. 2024, 21, 6364–6381. [Google Scholar] [CrossRef]
Alashjaee, A.M. Deep learning for network security: An Attention-CNN-LSTM model for accurate intrusion detection. Sci. Rep. 2025, 15, 21856. [Google Scholar] [CrossRef] [PubMed]
Hegde, C. Anomaly detection in time series data using data-centric AI. In Proceedings of the 2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 8–10 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Wu, T.; Ortiz, J. Rlad: Time series anomaly detection through reinforcement learning and active learning. arXiv 2021, arXiv:2104.00543. [Google Scholar] [CrossRef]
Elaziz, E.A.; Fathalla, R.; Shaheen, M. Deep reinforcement learning for data-efficient weakly supervised business process anomaly detection. J. Big Data 2023, 10, 33. [Google Scholar] [CrossRef]
Zhang, Z.; Chen, Y.; Wang, H.; Wang, Y.; Fu, Q.; Lu, Y. A multivariate time series anomaly detection model based on graph attention mechanism in energy consumption of intelligent buildings. In Proceedings of the 2022 Tenth International Conference on Advanced Cloud and Big Data (CBD), Guilin, China, 4–5 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 122–127. [Google Scholar]
Baldoni, S.; Battisti, F. Histogram-based network traffic representation for anomaly detection through PCA. Comput. Netw. 2025, 265, 111276. [Google Scholar] [CrossRef]
Geiger, A.; Liu, D.; Alnegheimish, S.; Cuesta-Infante, A.; Veeramachaneni, K. Tadgan: Time series anomaly detection using generative adversarial networks. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 33–43. [Google Scholar]
Hundman, K.; Constantinou, V.; Laporte, C.; Colwell, I.; Soderstrom, T. Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; ACM: New York, NY, USA, 2018; pp. 387–395. [Google Scholar]
Hsieh, R.-J.; Chou, J.; Ho, C.-H. Unsupervised online anomaly detection on multivariate sensing time series data for smart manufacturing. In Proceedings of the 2019 IEEE 12th Conference on Service-Oriented Computing and Applications (SOCA), Kaohsiung, Taiwan, 18–21 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 90–97. [Google Scholar]

Figure 1. Attack scenario diagram.

Figure 2. Model structure.

Figure 3. Encoder–decoder architecture.

Figure 4. Comparison of F1 scores across three datasets for different models.

Figure 5. Comparison of true anomalies and detected anomalies.

Figure 6. Comparison of F1 scores for different anomaly scoring fusion strategies.

Figure 7. ROC and PR curves under four fusion strategies on three datasets.

Table 1. Performance metrics comparison across three datasets for different models.

Dataset	Model	Accuracy	F1 Score	Recall	Precision	TPR	FPR	AP	TTD
SMD	Ours	0.986	0.931	0.996	0.873	0.996	0.015	0.870	−1.8
	TadGAN	0.906	0.329	0.244	0.504	0.244	0.025	0.215	355.7
	LSTM Dynamic Threshold	0.675	0.327	0.843	0.203	0.843	0.343	0.186	−8847.2
	LSTM Autoencoder	0.677	0.329	0.843	0.204	0.843	0.341	0.187	−8202.5
	Anomaly Transformer	0.511	0.277	0.996	0.161	0.996	0.539	0.145	427.8
	VAE	0.896	0.197	0.135	0.360	0.138	0.025	0.175	441.5
PSM	Ours	0.977	0.960	0.970	0.949	0.970	0.020	0.929	−194.29
	TadGAN	0.449	0.435	0.762	0.305	0.762	0.672	0.355	−6313.2
	LSTM Dynamic Threshold	0.438	0.475	0.914	0.321	0.914	0.746	0.317	−34,958.6
	LSTM Autoencoder	0.492	0.459	0.774	0.326	0.774	0.616	0.350	−8049.4
	Anomaly Transformer	0.485	0.269	0.341	0.223	0.341	0.459	0.201	−5015.8
	VAE	0.410	0.420	0.769	0.289	0.748	0.697	0.333	−11,563
ASD	Ours	0.990	0.955	1.000	0.941	1.000	0.011	0.914	−5.14
	TadGAN	0.533	0.298	0.953	0.176	0.953	0.515	0.173	−391.6
	LSTM Dynamic Threshold	0.810	0.180	0.201	0.162	0.201	0.120	0.124	−75.7
	LSTM Autoencoder	0.280	0.218	0.969	0.123	0.969	0.800	0.122	−2271.3
	Anomaly Transformer	0.513	0.188	0.542	0.113	0.542	0.490	0.107	−213.4
	VAE	0.268	0.215	0.969	0.121	0.969	0.813	0.121	−2271.3

Table 2. Comparison table of partial true anomalies and detected anomaly intervals.

Dataset	True Abnormal Intervals	Detected Abnormal Intervals
SMD	[15,849, 16,368], [16,963, 17,517], [18,071, 18,528], [19,367, 20,088], [20,786, 21,195], [24,679, 24,682], [26,114, 26,116], [27,554, 27,556]	[15,847, 17,779], [17,844, 17,924], [18,010, 18,657], [18,719, 18,719], [19,259, 20,156], [24,273, 25,349], [26,075, 26,431], [27,457, 28,058]
PSM	[139,542, 139,873], [139,877, 139,879], [140,279, 140,629], [142,920, 142,925], [142,932, 142,944], [143,003, 143,004], [145,048, 145,049], [147,693, 147,694]	[139,542, 139,879], [140,279, 140,629], [141,608, 141,608], [142,920, 143,004], [144,063, 144,081], [145,048, 145,049], [145,399, 145,418], [147,685, 147,732]
ASD	[760, 766], [1064, 1299], [2758, 2773], [2874, 2886], [3012, 3026], [3160, 3306], [3626, 3639]	[760, 766], [1064, 1299], [2758, 2773], [2838, 2886], [3012, 3026], [3160, 3306], [3626, 3639]

Table 3. Ablation study results.

Module Setting	Accuracy	F1 Score	Recall	Precision	TPR	FPR	AP	TTD
Full Model	0.986	0.931	0.996	0.873	0.996	0.015	0.870	−1.8
PCA = 0.9	0.984	0.919	0.996	0.854	0.996	0.018	0.851	−1.8
PCA = 0.8	0.974	0.878	0.996	0.785	0.996	0.028	0.783	0
PCA removed	0.878	0.605	0.996	0.435	0.996	0.134	0.433	−345
smoothing_window = 0	0.986	0.931	0.996	0.855	0.996	0.034	0.831	−10.7
smoothing_window = 0.02	0.989	0.943	0.996	0.895	0.996	0.012	0.892	−1.8
smoothing_window = 0.03	0.993	0.965	0.996	0.935	0.996	0.007	0.932	−1.8
rec_error_type = point	0.986	0.929	0.979	0.883	0.979	0.013	0.867	−1.8
rec_error_type = area	0.971	0.865	0.996	0.764	0.996	0.032	0.761	−1.8
static threshold k = 1	0.779	0.459	0.999	0.298	0.999	0.244	0.377	−32.4
static threshold k = 2	0.976	0.885	0.996	0.796	0.996	0.027	0.793	−1.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, J.; Yang, X.; Liu, Y.; Zuo, H.; Zhou, F.; Yu, T.; Liu, D.; Deng, T.; Luo, L. Transformer-Autoencoder-Based Unsupervised Temporal Anomaly Detection for Network Traffic with Dual Prediction and Reconstruction. Appl. Sci. 2026, 16, 2143. https://doi.org/10.3390/app16042143

AMA Style

Lu J, Yang X, Liu Y, Zuo H, Zhou F, Yu T, Liu D, Deng T, Luo L. Transformer-Autoencoder-Based Unsupervised Temporal Anomaly Detection for Network Traffic with Dual Prediction and Reconstruction. Applied Sciences. 2026; 16(4):2143. https://doi.org/10.3390/app16042143

Chicago/Turabian Style

Lu, Jieke, Xinyi Yang, Yang Liu, Haoran Zuo, Feng Zhou, Tong Yu, Dengmu Liu, Tianping Deng, and Lijun Luo. 2026. "Transformer-Autoencoder-Based Unsupervised Temporal Anomaly Detection for Network Traffic with Dual Prediction and Reconstruction" Applied Sciences 16, no. 4: 2143. https://doi.org/10.3390/app16042143

APA Style

Lu, J., Yang, X., Liu, Y., Zuo, H., Zhou, F., Yu, T., Liu, D., Deng, T., & Luo, L. (2026). Transformer-Autoencoder-Based Unsupervised Temporal Anomaly Detection for Network Traffic with Dual Prediction and Reconstruction. Applied Sciences, 16(4), 2143. https://doi.org/10.3390/app16042143

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Autoencoder-Based Unsupervised Temporal Anomaly Detection for Network Traffic with Dual Prediction and Reconstruction

Abstract

1. Introduction

2. Related Work

2.1. Time Series Modeling

2.2. Intrusion Detection

3. System Model

3.1. Input Representation and Preprocessing

3.1.1. Traffic Aggregation and Feature Extraction

3.1.2. Normalization and Missing-Value Handling

3.1.3. Windowing and Embedding

3.2. Transformer-Autoencoder Architecture

3.2.1. Encoder Design

3.2.2. Decoder Design and Multi-Branch Heads

3.2.3. Training Objective and Optimization

4. Algorithm Design

4.1. Anomaly Score

4.1.1. Bidirectional Prediction Error

4.1.2. Reconstruction Error

4.1.3. Score Fusion Strategies

4.2. Thresholding and Post-Processing

4.2.1. Static k-Sigma Thresholding

4.2.2. Dynamic Cost-Based Thresholding

4.2.3. Anomalous Interval Extraction

4.3. Algorithm Implementation

5. Performance Evaluation

5.1. Experimental Setup

5.1.1. Hardware and Software Environment

5.1.2. Datasets

5.1.3. Baseline Methods

5.1.4. Evaluation Metrics

5.1.5. Parameter Settings

5.2. Overall Detection Performance

5.3. Effect of Anomaly Duration

5.4. Effect of Anomaly Score Fusion Strategies

5.5. Ablation Studies

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI