Bidirectional Temporal Attention Convolutional Networks for High-Performance Network Traffic Anomaly Detection

Wang, Feng; Huang, Yufeng; Shi, Yifei

doi:10.3390/info17010061

Open AccessArticle

Bidirectional Temporal Attention Convolutional Networks for High-Performance Network Traffic Anomaly Detection

by

Feng Wang

^1,2,*

,

Yufeng Huang

¹

and

Yifei Shi

¹

State Key Laboratory of Rail Transit Vehicle System, Southwest Jiaotong University, Chengdu 610031, China

²

Informatization and Network Management Office, Southwest Jiaotong University, Chengdu 610031, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(1), 61; https://doi.org/10.3390/info17010061

Submission received: 29 November 2025 / Revised: 1 January 2026 / Accepted: 5 January 2026 / Published: 9 January 2026

Download

Browse Figures

Versions Notes

Abstract

Deep learning-based network traffic anomaly detection, particularly using Recurrent Neural Networks (RNNs), often struggles with high computational overhead and difficulties in capturing long-range temporal dependencies. To address these limitations, this paper proposes a Bidirectional Temporal Attention Convolutional Network (Bi-TACN) for robust and efficient network traffic anomaly detection. Specifically, dilated causal convolutions with expanding receptive fields and residual modules are employed to capture multi-scale temporal patterns while effectively mitigating the vanishing gradient. Furthermore, a bidirectional structure integrated with Efficient Channel Attention (ECA) is designed to adaptively weight contextual features, preventing sparse attack indicators from being overwhelmed by dominant normal traffic. A Softmax-based classifier then leverages these refined representations to execute high-performance anomaly detection. Extensive experiments on the NSL-KDD and UNSW-NB15 datasets demonstrate that Bi-TACN achieves average accuracies of 88.51% and 82.5%, respectively, significantly outperforming baseline models such as Bi-TCN and Bi-GRU in terms of both precision and convergence speed.

Keywords:

network traffic; anomaly detection; recurrent neural networks; temporal convolutional networks; attention mechanism

1. Introduction

Modern network infrastructures face an evolving threat landscape characterized by sophisticated attack vectors that include zero-day exploits, polymorphic malware, and advanced persistent threats (APTs). Network traffic anomaly detection, which is the task of identifying malicious patterns in communication flows, has emerged as a critical defense mechanism [1,2]. Although anomaly detection methods based on a recurrent architecture have demonstrated utility in practical applications, they often fail to exploit network traffic features when facing diverse and evolving attack vectors, thereby compromising both detection efficiency and accuracy [3,4]. Consequently, developing a high-performance anomaly detection algorithm capable of fully extracting traffic features is crucial for enhancing network security defenses.

As network traffic is fundamentally time-series data containing evolving information, modeling its temporal dynamics is critical for detecting anomalies [5]. In recent years, machine learning has emerged as the dominant technique in the field of time-series anomaly detection. Based on the availability of labels in the training data, machine learning methods are primarily categorized into two classes: unsupervised learning and supervised learning [6]. Typical unsupervised methods, including Isolation Forest [7], K-Nearest Neighbors (KNN) [8], and Principal Component Analysis (PCA) [9], eliminate the dependency on annotated data. However, when confronting complex network attack scenarios, these methods often exhibit compromised detection accuracy and are prone to high false alarm rates [10]. In contrast, supervised learning methods leverage known attack patterns to achieve superior accuracy, establishing themselves as the preferred solution for intrusion detection. Specifically, Recurrent Neural Networks (RNNs) have gained prominence due to their inherent ability to capture temporal dependencies, making them ideal for analyzing network traffic anomalies [11]. As a typical representative, Recurrent Neural Networks (RNNs) capture temporal associations in data through recurrent processing layer-by-layer [12,13]. However, they are susceptible to the vanishing gradient problem, making it difficult to capture long-term temporal features [14]. To overcome this challenge, Long Short-Term Memory (LSTM) [15] has been widely adopted in time series forecasting and anomaly detection. Sahoo et al. [16] demonstrated that the gating mechanisms of LSTM effectively mitigate the limitations of standard RNNs, leading to superior prediction accuracy. Nevertheless, the large number of parameters in LSTM results in excessively long training times [13]. To improve efficiency, the Gated Recurrent Unit (GRU) [17] was introduced to address this issue. The GRU features a simplified architecture with update and reset gates that selectively regulate the flow of information [18]. This mechanism significantly reduces the parameter number, effectively enhancing detection efficiency. The anomaly detection methods mentioned above all adopt a unidirectional feature propagation network structure, which can only extract temporal feature information from the past rather than obtain that from the future, leading to insufficient information utilization. To overcome the limitation that unidirectional structures cannot acquire future features, bidirectional network structures emerged, aiming to enhance detection performance by simultaneously learning past and future time series information. Jiang et al. [19] proposed a network model based on bidirectional neural networks. Specifically, a Bidirectional GRU (Bi-GRU) was constructed to separately learn temporal sequence information from both past and future directions, effectively improving the performance of network traffic anomaly detection. However, this not only increased the complexity of the chain structure and training time but also remained insufficiently sensitive to long sequences. Subsequently, Chen et al. [20] proposed an anomaly detection model based on Bidirectional Temporal Convolutional Networks (Bi-TCN). The integration of dilated causal convolutions and residual connections enables the Bi-TCN to effectively capture the feature of long-term historical data, thereby accelerating the training process and significantly enhancing its generalizability. Although Bi-TCN improved generalization ability, this model often neglects the varying importance of different traffic features for anomaly detection, thereby limiting the final efficiency and accuracy of the model. Consequently, sparse attack indicators are prone to being overwhelmed by dominant normal traffic features, which compromises both detection efficiency and accuracy.

In recent years, deep learning models based on Transformers [21] and Graph Neural Networks (GNNs) [22] have demonstrated superior performance in domains such as fault diagnosis and anomaly detection. Nevertheless, they still encounter significant challenges in the practical application of network traffic anomaly detection. Transformer architectures typically rely on global self-attention mechanisms [23], where the computational complexity grows quadratically with the sequence length, resulting in substantial computational overhead and memory consumption. Similarly, graph-based methods require the construction of complex topological structures, which introduces significant pre-processing latency when handling high-speed streaming data. Given the stringent requirements for real-time response and deployment in resource-constrained environments like edge devices in network security defense, exploring more efficient and lightweight architectures remains a pivotal research direction. Therefore, this paper focuses on exploring the potential of temporal convolutional architectures, aiming to overcome the feature extraction bottlenecks of RNNs without sacrificing detection speed.

This paper proposes a high-performance network traffic anomaly detection algorithm based on the Bidirectional Temporal Attention Convolutional Network (Bi-TACN). First, targeting the problems of vanishing gradient and the difficulty of extracting long-sequence features, this method extracts time series information based on dilated causal convolutions and increases the range of information extraction through expanded convolution kernels and residual modules. Second, to solve the problem of undefined feature contribution, based on adopting a bidirectional structure to fully extract contextual features, this paper introduces the Efficient Channel Attention (ECA) module [24] to adaptively weight contextual features, thereby accurately extracting the key features of anomaly states. Finally, a detection model based on Bi-TACN is constructed, and its advantages in feature extraction capability and anomaly detection accuracy are verified through comparative experiments with existing typical methods.

The main contributions of this paper are summarized as follows:

(1): The integration of dilated causal convolution and a residual module is capable of effectively capturing long-range temporal sequence information, thereby mitigating the vanishing gradient issues inherent in traditional RNNs.
(2): The Bidirectional structure fully extracts contextual feature information of network traffic, and the ECA module is introduced to assign weights to contextual features.
(3): A network traffic anomaly detection model based on Bi-TACN is constructed, and comparative experiments are conducted with existing typical methods on public datasets to verify its feature extraction capability and anomaly detection accuracy.

The rest of the paper is structured as follows: Section 2 describes the related work of the Bi-TACN; Section 3 demonstrates the proposed formulation of this paper; Section 4 presents the complete experimental setup, including the experiment description, the evaluation indicators, the ablation experience and the comparison performance analysis; Finally, Section 5 concludes our works and highlights our main contributions.

2. Related Work

The Bi-TACN extracts time-series information of network traffic by extending dilated causal convolution and residual module, and introduces a bidirectional structure and ECA module to fully utilize and accurately extract network traffic features, thereby achieving network traffic anomaly detection. The crucial related components of the Bi-TACN are detailed below.

2.1. Dilated Causal Convolution

The architecture of conventional convolution presents a challenge when processing time series data. Specifically, by learning local data features at a certain time point, the output at the current time step inadvertently incorporates features from the subsequent time step. This violates the temporal order of the features within the time series sequence [25]. To address this issue, causal convolution is introduced for time series data processing, which constrains the convolution operation along the temporal dimension [26]. Figure 1 illustrates the dilated causal convolution module, which comprises an input layer, two hidden layers, and an output layer.

Given a network traffic time series

X = {x_{1}, x_{2}, \dots, x_{T}}

, the temporal features at time t can only be extracted from

{x_{1}, x_{2}, \dots, x_{t}}

and must not rely on the future time series

{x_{t + 1}, x_{t + 2}, \dots, x_{T}}

. Causal convolution utilizes one-sided padding or unilateral padding to ensure that the input sequence length is identical to the output sequence length, thereby preventing the use of localized future data for prediction. Its formulation is given by

(F * X) (x_{t}) = \sum_{k = 1}^{K} f_{k} x_{t - K + k}

(1)

where

F = {f_{1}, f_{2}, \dots, f_{k}}

represents the causal convolution filter, k is the filter size, and

x_{t - K + k}

is the sampled element.

However, causal convolution can only expand its receptive field by increasing the size of the convolution kernel. Since a larger kernel size typically introduces more parameters, the dilated convolution kernel is employed within the causal convolution framework [27]. The dilated convolution kernel samples feature information with a specific interval (or spacing), which increases the range of information extracted from the kernel. Furthermore, the dilated kernel does not change the size of the causal convolution kernel. Its formulation is given by

(F_{d} * X) (x_{t}) = \sum_{k = 1}^{K} f_{k} x_{t - (K - k) d}

(2)

where

F_{d} = {f_{1}, f_{2}, \dots, f_{k}}

is the dilated convolution filter, d denotes the dilation rate, and

x_{t - (K - k) d}

represents the element after interval sampling.

2.2. Residual Structure

As the network depth increases, training becomes increasingly challenging. This difficulty is primarily attributed to the vanishing gradient phenomenon, which tends to occur during the Stochastic Gradient Descent (SGD) process due to the multi-layer backpropagation of error signals. However, the residual module effectively addresses the training difficulties associated with deep networks [28], as illustrated in Figure 2. The residual mapping is constructed by combining identity connections with dilated causal convolution layers and activation layers. These modules are stacked within the model and can be formulated as follows:

x_{l + 1} = x_{l} + H (x_{l})

(3)

where

x_{l}

denotes output features of the l-th layer, and

H (x_{l})

denotes the residual mapping, which is composed of dilated causal convolutions and activation functions.

According to Equation (3), the input can bypass the operations of the feature extraction layer and is directly added to the output feature of the feature extraction layer. Furthermore, the residual mapping accelerates the model’s convergence speed and enhances stability, thereby facilitating the optimization process.

2.3. Bidirectional Structure

The aforementioned model only utilizes the unidirectional features of network traffic, resulting in incomplete feature extraction. To address this limitation, a bidirectional structure [19] is introduced. This structure derives each value in the network traffic time series from the reverse direction using another fixed, arbitrary function, thereby capturing the global features of the network traffic time series.

As illustrated in Figure 3, to capture the complete temporal context, we construct a bidirectional architecture consisting of two independent dilated causal convolution branches: a forward branch c and a reverse branch h. Crucially, these two branches do not share parameters, allowing them to learn distinct feature patterns adapted to their respective temporal directions. The learning process is jointly optimized, meaning that both branches are updated simultaneously by backpropagation. The specific learning and updating principles are as follows:

For the processed network traffic time series

X = {x_{0}, x_{1}, \dots, x_{N}}

, the original sequence is first reversed to obtain the inverse matrix:

X^{'} = \{x_{N}, \dots, x_{1}, x_{0}\}

(4)

Subsequently, the reversed sequence is fed into an inverse TCN (Temporal Convolutional Network) for training to learn the network traffic features:

c^{'} = TCN (x^{'})

(5)

where c represents reverse network traffic characteristics.

Finally, an activation function is applied to the reversed features to perform non-linear transformation, yielding the activated network traffic features:

h^{'} = σ (W x^{'} + b)

(6)

where

h^{'}

is the activated network traffic feature, W is the weight matrix, b is the bias, and

σ

is the activation function.

2.4. Efficient Channel Attention (ECA)

Attention mechanisms have shown immense potential in improving model performance [29]. However, most existing methods focus on designing increasingly complex attention modules to achieve better performance, which inevitably increases model complexity and computational overhead. To reduce computational load while simultaneously preventing model overfitting, a lightweight and low-complexity module called ECA is introduced. ECA can not only generate weights for each channel but also learn the correlations among different channels. For time series data, key features are assigned larger weights while irrelevant features are assigned smaller weights. Through the backpropagation process, the model learns to automatically assign larger weights to key temporal features while suppressing irrelevant noise. Therefore, ECA focuses on useful information and enhances the network’s sensitivity to critical features. Figure 4 illustrates the structure of the ECA module.

The ECA module first performs Global Average Pooling (GAP) to capture the global contextual information of each channel. The GAP process can be expressed as

GAP (X) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{c i j}

(7)

where C, H, and W represent the channels, height, and width of the input feature map X, respectively.

X_{c i j}

represents the

(i, j)

-th element of channel c in the input feature map X. The process outputs a C-dimensional vector that reflects the average response of each channel.

Subsequently, the ECA module captures the local cross-channel dependencies using a 1D Convolution Layer. The ECA attention mechanism dynamically determines the kernel size k of the 1D convolution layer. This adaptability ensures that ECA can accommodate different numbers of channels and capture local dependencies between channels more effectively. The kernel size k can be calculated by

k = |\frac{{log}_{2} C + b}{γ}|

(8)

where b is the bias, and

γ

is the hyperparameter controlling the range of local dependencies. A smaller

γ

value yields a larger kernel size, encompassing a broader range of channel dependencies. This paper adopt the default hyperparameter settings from the original paper [24], specifically setting

γ = 2

and

b = 1

. The output of the 1D convolution layer is then represented as

Conv 1 D (GAP (X)) = F_{Conv 1 D} (GAP (X))

(9)

where

F_{C o n v 1 D}

denotes the 1D convolution operation with a kernel size of k.

To facilitate the adaptive learning of cross-channel correlations within the model, the output of the 1D convolution layer undergoes a non-linear transformation using the Sigmoid activation function, which yields the attention weight vector A:

A = σ (Conv 1 D (GAP (X)))

(10)

Finally, the attention weight vector A is multiplied element-wise by the original input feature map X to achieve channel attention rescaling. The recalibrated feature map output is calculated as follows:

Y_{c i j} = A_{c} \cdot X_{c i j}

(11)

where

Y_{c i j}

is the recalibrated feature map output, as

c \in {1, \dots, C}

,

i \in {1, \dots, H}

, and

j \in {1, \dots, W}

.

Not all features are equally important in complex network traffic, since capturing a longer history often introduces more noise from irrelevant features. The ECA module addresses this limitation by assigning adaptive weights to different feature channels. Therefore, while Bi-TCN ensures the model has sufficient historical information, ECA enables the model to focus on the most critical features within that information. This combination improves the representation ability of the model without significantly increasing the complexity.

3. Proposed Formulation

The operational workflow of the Bi-TACN algorithm is illustrated in Figure 5 and consists of the following steps:

First, data preprocessing is performed on the raw dataset to generate the required input data for subsequent model utilization.

Subsequently, the Bi-TACN model is designed to fully integrate the advantages of dilated causal convolution, residual connections, the bidirectional structure, and the ECA attention mechanism, thereby extracting comprehensive network traffic feature information.

Next, the extracted feature information is fused and input into the Softmax layer to complete the training of the Bi-TACN model.

Finally, based on the aforementioned data preprocessing and Bi-TACN model training, anomaly traffic detection is performed.

3.1. Data Preprocessing

To verify the practical effect of the proposed algorithm in network traffic anomaly detection, the standard NSL-KDD benchmark dataset [30] is selected to evaluate the performance of the proposed algorithm. Before feeding the data into the model, data preprocessing is performed, which primarily consists of two stages: feature encoding and normalization. First, data type conversion is performed on non-numerical feature data. In the NSL-KDD dataset, each traffic sample contains 38 numerical features and 3 symbolic features. Since non-numerical features cannot be directly used as input for the anomaly detection model, One-Hot Encoding is used to convert the non-numerical features in the dataset into numerical types [31]. Following this operation, the sample dimension is expanded to 122 dimensions. Subsequently, missing value handling and normalization operations are performed on the numerical feature data. It is first determined whether the 122-dimensional feature data after numerical processing has missing values. If there are missing values, they are removed. If there are no missing values, normalization [32] is performed, which means using a unified scale to eliminate differences between data and mapping all numerical values in the dataset to the range of

[0, 1]

, as shown in Equation (12):

X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}

(12)

where

X_{norm}

represents normalized value of data,

X_{min}

and

X_{max}

represent the minimum and maximum values of the input feature, respectively.

3.2. Bi-TACN Pretrain

First, The preprocessed network traffic sequence is reversed, the formulas for the forward sequence and the reverse sequence are, respectively,

\{\begin{matrix} X = \{x_{0}, x_{1}, \dots, x_{N}\} \\ X^{'} = \{x_{N}, \dots, x_{1}, x_{0}\} \end{matrix}

(13)

Subsequently, the forward and reverse network traffic sequences are separately input into the Bi-TACN model for training. After being processed by dilated causal convolution, these sequences are connected to residual connections, which preserves important intrinsic features while avoiding the problems of vanishing gradient or overfitting. Following this, key features are extracted in the ECA attention module and fused to integrate the forward network traffic features and the reverse network traffic features. This fusion process is formulated as

H = Fusion (h, h^{'})

(14)

where H is the fused generated feature, and Fusion is the fusion function, which performs the concatenation of the forward network traffic features and the reverse network traffic features.

Next, the fused feature is input into the Softmax classifier [33], which can be expressed as

y = Softmax (W_{H} H + b_{H})

(15)

where W represents the classifier’s weight, and b represents the classifier’s bias.

In the forward propagation training, the Cross-Entropy Loss Function [34] is used for parameter updating. Assuming the true label y is a C-dimensional vector, the formula for the Cross-Entropy Loss Function is

L (x, y) = - \sum_{i = 1}^{c} x_{i} log y_{i}

(16)

where

x_{i}

represents the i-th element of the true label, and

y_{i}

represents the probability that the model predicts x belongs to the i-th class.

4. Experimental Validation

4.1. Experiment Description

The NSL-KDD dataset is an updated version of the KDD Cup 1999 dataset, which effectively avoids the problem of redundant records found in the KDD Cup 1999 dataset. In the NSL-KDD dataset, as shown in Table 1, in addition to “Normal” network traffic, there are four types of anomaly network traffic, which are DoS, Probe, R2L, and U2R. This paper uses 80% of the KDDTrain+ dataset for training and 20% for validation, and randomly samples each category from the KDDTest+ dataset for testing.

It should be noted that the training, validation, and testing of the proposed algorithm are conducted using PyCharm Professional v2024.1 on a Windows 11 operating system, with the deep learning environment built via Anaconda utilizing the Pytorch 2.3.0 framework. The underlying hardware platform featured an Intel Core i5-12600KF CPU, an NVIDIA GeForce RTX 4060 GPU, and 32 GB of RAM.

4.2. Evaluation Indicators

To evaluate the performance of the proposed Bi-TACN algorithm in network traffic anomaly detection, this paper uses the following four metrics [35]: Accuracy, Precision, Recall, F1-score, and AUC-ROC (Area Under the Receiver Operating Characteristic Curve) [36]. The specific formulas are as follows:

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(17)

P r e c i s i o n = \frac{T P}{T P + F P}

(18)

R e c a l l = \frac{T P}{T P + F N}

(19)

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall}

(20)

where TP represents the number of anomaly network traffic instances correctly identified by the algorithm, FP represents the number of normal network traffic instances incorrectly identified as anomaly by the algorithm, FN represents the number of anomaly network traffic instances not correctly identified by the algorithm, and TN represents the number of normal network traffic instances correctly identified by the algorithm.

4.3. Ablation Experiment

To verify the influence of the bidirectional structure and the ECA module on improving the model’s detection accuracy, an ablation study was specially carried out, comparing the Bi-TACN algorithm against the Bi-TACN algorithm without the bidirectional structure and the Bi-TACN algorithm without the ECA module, respectively. Each experiment was repeated five times with different random seeds to ensure the reliability and minimize stochastic effects of results. The specific parameter settings for the Bi-TACN algorithm are shown in Table 2.

The results of the ablation experiment are shown in Figure 6. It can be seen that the Bi-TACN algorithm utilizing the bidirectional structure improved average detection accuracy by 2.1%, 1.3%, and 1.9% on the training set, validation set, and test set, respectively. Since the Bi-TACN algorithm without the bidirectional structure still contains the ECA module, the improvement in detection accuracy can be attributed to the contribution of the bidirectional structure to the Bi-TACN algorithm. The bidirectional structure does not solely consider subsequent network traffic, but also comprehensively considers the correlation information with preceding network traffic. This indicates that the bidirectional structure provides the Bi-TACN algorithm with multi-angle contextual analysis capability, fully capturing network traffic features and enhancing the algorithm’s detection ability.

At the same time, compared with the Bi-TACN algorithm without the ECA module, the Bi-TACN algorithm containing the ECA module improved its average detection accuracy by 1.7%, 1.7%, and 2.3% on the training set, validation set, and test set, respectively. The two algorithm architectures are completely identical; the only difference lies in the inclusion of the ECA module. Analyzing the operational process of the algorithm reveals that the ECA module enhances the feature representation ability of the Bi-TACN algorithm in network traffic anomaly detection by assigning higher attention weights to the anomaly state data in the network traffic, while simultaneously ensuring that it is not interfered with by redundant information. This verifies that the ECA module can help the Bi-TACN algorithm effectively extract key features of the anomaly state from network traffic information, thereby correctly classifying the anomaly types of network traffic.

4.4. Performance Analysis

To further verify the network traffic anomaly detection performance of the proposed algorithm, comparative experiments were specially conducted against typical algorithms, including classic algorithms such as LSTM [16], GRU [37], Bi-GRU [19], and Bi-TCN [20]. A comprehensive comparison of the traffic anomaly detection performance of different typical algorithms was made, covering both convergence speed and accuracy evaluation metrics. It is worth noting that to ensure the fairness and consistency of the evaluation and comparison, all baseline methods compared in this study were re-implemented and evaluated under the exact same experimental conditions.

The training set accuracy curves for different typical algorithms are shown in Figure 7. It can be seen that, as the number of iterations increases, the training set accuracy of all algorithms shows an overall upward trend. Specifically, the curve for the Bi-TACN algorithm’s training set data has converged and reached its best accuracy after 6 iterations. Compared with other classic algorithms, its advantage of faster convergence is relatively obvious. Further analysis reveals that the Bi-TACN algorithm possesses stronger network traffic feature extraction capabilities, which accelerates the convergence speed and improves algorithm performance.

The comprehensive detection results of different typical algorithms are shown in Table 3. Comparison reveals that the proposed Bi-TACN algorithm achieves significant improvement across all evaluation metrics.

In terms of average accuracy, Bi-TACN reached 88.5%, outperforming LSTM, GRU, Bi-GRU, and Bi-TCN by 7.2%, 5.1%, 2.9%, and 2.1%, respectively. This significant improvement indicated that the proposed algorithm has a superior ability to make correct overall judgments. Similarly, Bi-TACN demonstrated substantial gains in precision, achieving 89.4%. Compared to the baseline models, this represents an improvement ranging from 5.0% to 9.0%, which effectively reduces the false alarm rate in anomaly detection. For recall, the Bi-TACN algorithm reached 89.6%, showing performance enhancements of 17.5%, 13.1%, 8.2%, and 7.3% compared to LSTM, GRU, Bi-GRU, and Bi-TCN. These results highlight the model’s high sensitivity in identifying the majority of positive instances, thereby minimizing missed detections. Notably, the F1-score stood at 89.5%, which is 6.2% to 17.4% higher than other typical algorithms, indicating its excellent capability in handling both false alarm and missed detection scenarios simultaneously. Furthermore, the AUC-ROC reached 0.87, further confirming the superior discriminative capability of the Bi-TACN architecture. This metric underscores the model’s effectiveness in distinguishing between normal and anomalous traffic patterns.

To evaluate the classification capabilities of the model across different traffic types, we further analyze the confusion matrix in Figure 8, and present the detailed accuracy, precision, recall, and F1-score in Table 4.

The experimental results indicate that the model maintained an accuracy of over 90% across all classes, For the ‘Normal’ and ‘DoS’ categories, which have the largest sample sizes, the model demonstrated robust performance. Specifically, the ‘Normal’ category achieved a Recall of 99.3% with an F1-score of 90.6%, while the ‘DoS’ class reached an F1-score of 96.7% and a Precision of 98.0%. In detecting ‘R2L’ class, the model achieved a Precision of 94.8% and a Recall of 82.2%, resulting in an F1-score of 88.1%. However, despite the high overall accuracy, the Recall rates for ‘Probe’ and ‘U2R’ lower at 44.94% and 58.50%, respectively. This indicates that the model is less sensitive in capturing minority attack patterns.

In summary, the Bi-TACN algorithm outperforms other typical algorithms across all evaluation metrics, indicating that the Bi-TACN algorithm possesses superior network traffic detection performance.

4.5. Generalization Verification

Although the NSL-KDD dataset is the most widely used for network intrusion detection, it contains outdated attack types and traffic patterns that no longer align with contemporary network environments. In contrast, the UNSW-NB15 [38] dataset provides a hybrid of real modern normal activities and synthetic contemporary attacks, which better reflects real-world network behavior. Therefore, this study further adopts this dataset to validate the generalization capability of the proposed model.

Experimental results demonstrate that the proposed algorithm achieves an overall accuracy of 82.5% on the UNSW-NB15 dataset, confirming its effectiveness in handling complex and class-imbalanced modern network traffic. As shown in the confusion matrix in Figure 9, the model exhibits exceptional recognition capabilities for majority classes with large sample sizes. Specifically, the “Generic” class achieves the highest accuracy of 98.10%, with the vast majority of samples correctly classified. Furthermore, “Normal” traffic reaches a classification accuracy of 94.62%; such high-precision filtering of normal traffic is crucial for reducing the false positive rate (FPR) of intrusion detection systems.

Despite the strong overall performance, the confusion matrix reveals significant misclassifications in certain categories. The detection accuracy for “DoS” remains relatively low, with a majority of samples misidentified as “Exploits,” suggesting that these two categories share highly similar characteristics and are difficult to distinguish within the current feature space. Similarly, a substantial number of “Fuzzers” samples are misclassified as either “Exploits” or “DoS.” Regarding minority classes, while “Worms” are detected effectively, this comes at the cost of a high false alarm rate. For “Analysis” and “Backdoors,” the model struggles to extract discriminative features due to data scarcity, resulting in limited identification capabilities.

5. Conclusions

This paper proposes a high-performance network traffic anomaly detection algorithm based on Bi-TACN. The Bi-TACN leverages a bidirectional temporal convolutional structure to comprehensively extract contextual feature information, and incorporates the ECA module to adaptively weigh these features. This design ensures the precise extraction of key features reflecting anomaly states, thereby achieving superior anomaly detection performance. Through ablation experiments and comparative experiments conducted on the NSL-KDD dataset, and generalization verification based on the UNSW-NB15, the principal findings are summarized as follows:

(1): The Bi-TACN integrates dilated causal convolutions to expand the range of information extraction, which facilitates the capture of multi-scale temporal patterns. The associated residual modules effectively mitigate the vanishing gradient problem and accelerate model convergence.
(2): The bidirectional structure integrated with the ECA module helps the proposed Bi-TACN effectively extract key features of anomaly states from network traffic information, simultaneously ensuring that it is not interfered with by redundant information, thereby correctly classifying the anomaly types of network traffic.
(3): Comparison with classical baseline algorithms verifies that the proposed Bi-TACN algorithm has a faster convergence speed and superior network traffic anomaly detection performance, enabling high-performance network traffic anomaly detection.

Finally, we recognize the challenges posed by the severe class imbalance in intrusion detection datasets. In this work, we prioritized evaluating the model’s intrinsic architecture and did not apply explicit data balancing techniques such as oversampling or cost-sensitive learning. Future work will focus on integrating data augmentation methods to further enhance the detection rates of rare attack types.

Author Contributions

Methodology, F.W.; Software, Y.H.; Validation, Y.H.; Writing—original draft, Y.H.; Writing—review & editing, F.W. and Y.S.; Supervision, F.W.; Funding acquisition, F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China, China (grant number 51875481), and the National Natural Science Foundation of Sichuan, China (grant number 23NSFSC0370).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study, the NSL-KDD benchmark dataset, are publicly available and open-access.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fotiadou, K.; Velivassaki, T.-H.; Voulkidis, A.; Skias, D.; Tsekeridou, S.; Zahariadis, T. Network Traffic Anomaly Detection via Deep Learning. Information 2021, 12, 215. [Google Scholar] [CrossRef]
Javaheri, D.; Gorgin, S.; Lee, J.-A.; Masdari, M. Fuzzy Logic-Based DDoS Attacks and Network Traffic Anomaly Detection Methods: Classification, Overview, and Future Perspectives. Inf. Sci. 2023, 626, 315–338. [Google Scholar] [CrossRef]
Narmadha, S.; Balaji, N.V. Improved Network Anomaly Detection System Using Optimized Autoencoder- LSTM. Expert Syst. Appl. 2025, 273, 126854. [Google Scholar] [CrossRef]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
Zavrak, S.; Iskefiyeli, M. Flow-Based Intrusion Detection on Software-Defined Networks: A Multivariate Time Series Anomaly Detection Approach. Neural Comput. Appl. 2023, 35, 12175–12193. [Google Scholar] [CrossRef]
Chowdhury, R.; Sen, S.; Goswami, A.; Purkait, S.; Saha, B. An Implementation of Bi-Phase Network Intrusion Detection System by Using Real-Time Traffic Analysis. Expert Syst. Appl. 2023, 224, 119831. [Google Scholar] [CrossRef]
Gao, J.; Ozbay, K.; Hu, Y. Real-Time Anomaly Detection of Short-Term Traffic Disruptions in Urban Areas through Adaptive Isolation Forest. J. Intell. Transp. Syst. 2025, 29, 269–286. [Google Scholar] [CrossRef]
Chen, X.; Yuan, Z.; Feng, S. Anomaly Detection Based on Improved k-Nearest Neighbor Rough Sets. Int. J. Approx. Reason. 2025, 176, 109323. [Google Scholar] [CrossRef]
Baldoni, S.; Battisti, F. Histogram-Based Network Traffic Representation for Anomaly Detection through PCA. Comput. Netw. 2025, 265, 111276. [Google Scholar] [CrossRef]
Ness, S.; Eswarakrishnan, V.; Sridharan, H.; Shinde, V.; Janapareddy, N.V.P.; Dhanawat, V. Anomaly Detection in Network Traffic Using Advanced Machine Learning Techniques. IEEE Access 2025, 13, 16133–16149. [Google Scholar] [CrossRef]
Yin, C.; Zhang, S.; Wang, J.; Xiong, N.N. Anomaly Detection Based on Convolutional Recurrent Autoencoder for IoT Time Series. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 112–122. [Google Scholar] [CrossRef]
Rezk, N.M.; Nordström, T.; Ul-Abdin, Z. Shrink and Eliminate: A Study of Post-Training Quantization and Repeated Operations Elimination in RNN Models. Information 2022, 13, 176. [Google Scholar] [CrossRef]
Ren, X.; Gu, H.; Wei, W. Tree-RNN: Tree Structural Recurrent Neural Network for Network Traffic Classification. Expert Syst. Appl. 2021, 167, 114363. [Google Scholar] [CrossRef]
Noh, S.-H. Analysis of Gradient Vanishing of RNNs and Performance Comparison. Information 2021, 12, 442. [Google Scholar] [CrossRef]
Wan, X.; Liu, H.; Xu, H.; Zhang, X. Network Traffic Prediction Based on LSTM and Transfer Learning. IEEE Access 2022, 10, 86181–86190. [Google Scholar] [CrossRef]
Sahoo, B.B.; Jha, R.; Singh, A.; Kumar, D. Long Short-Term Memory (LSTM) Recurrent Neural Network for Low-Flow Hydrological Time Series Forecasting. Acta Geophys. 2019, 67, 1471–1481. [Google Scholar] [CrossRef]
Nasreen Fathima, A.H.; Ibrahim, S.P.S.; Khraisat, A. Enhancing Network Traffic Anomaly Detection: Leveraging Temporal Correlation Index in a Hybrid Framework. IEEE Access 2024, 12, 136805–136824. [Google Scholar] [CrossRef]
Ullah, S.; Chen, X.; Han, H.; Wu, J.; Dong, J.; Liu, R.; Ding, W.; Liu, M.; Li, Q.; Qi, H.; et al. A Novel Hybrid Ensemble Approach for Wind Speed Forecasting with Dual-Stage Decomposition Strategy Using Optimized GRU and Transformer Models. Energy 2025, 329, 136739. [Google Scholar] [CrossRef]
Jiang, L.; Zhang, D.; Zhu, Y.; Zhang, X. Abnormal Traffic Detection Based on a Fusion BiGRU Neural Network. In Advances in Swarm Intelligence; Springer: Cham, Switzerland, 2023; Volume 13969, pp. 232–245. [Google Scholar]
Chen, J.; Lv, T.; Cai, S.; Song, L.; Yin, S. A Novel Detection Model for Abnormal Network Traffic Based on Bidirectional Temporal Convolutional Network. Inform. Softw. Tech. 2023, 157, 107166. [Google Scholar] [CrossRef]
Tran, T.M.; Bui, D.C.; Nguyen, T.V.; Nguyen, K. Transformer-Based Spatio-Temporal Unsupervised Traffic Anomaly Detection in Aerial Videos. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 8292–8309. [Google Scholar] [CrossRef]
Ahmad Awan, K.; Ud Din, I.; Almogren, A.; Han, Z.; Guizani, M. TrustAware-GNN: Graph-Neural-Network-Based Trust Management for IoT Anomaly Detection. IEEE Internet Things J. 2025, 12, 37670–37681. [Google Scholar] [CrossRef]
Xiang, L.; Bing, H.; Li, X.; Hu, A. A Frequency Channel-Attention Based Vision Transformer Method for Bearing Fault Identification across Different Working Conditions. Expert Syst. Appl. 2025, 262, 125686. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Song, Y.; Luktarhan, N.; Shi, Z.; Wu, H. TGA: A Novel Network Intrusion Detection Method Based on TCN, Bi-GRU and Attention Mechanism. Electronics 2023, 12, 2849. [Google Scholar] [CrossRef]
Zhang, R.; Sun, F.; Song, Z.; Wang, X.; Du, Y.; Dong, S. Short-Term Traffic Flow Forecasting Model Based on GA-TCN. J. Adv. Transport. 2021, 2021, 1338607. [Google Scholar] [CrossRef]
Liang, H.; Cao, J.; Zhao, X. Multi-Sensor Data Fusion and Bidirectional-Temporal Attention Convolutional Network for Remaining Useful Life Prediction of Rolling Bearing. Meas. Sci. Technol. 2023, 34, 105126. [Google Scholar] [CrossRef]
Park, K.; Soh, J.W.; Cho, N.I. A Dynamic Residual Self-Attention Network for Lightweight Single Image Super-Resolution. IEEE Trans. Multimed. 2023, 25, 907–918. [Google Scholar] [CrossRef]
Zhi, J.; Song, T.; Yu, K.; Yuan, F.; Wang, H.; Hu, G.; Yang, H. Multi-Attention Module for Dynamic Facial Emotion Recognition. Information 2022, 13, 207. [Google Scholar] [CrossRef]
Protić, D.D. Review of KDD Cup ‘99, NSL-KDD and Kyoto 2006+ Datasets. Vojnoteh. Glas. 2018, 66, 580–596. [Google Scholar] [CrossRef]
Li, Y.; Xu, Y.; Cao, Y.; Hou, J.; Wang, C.; Guo, W.; Li, X.; Xin, Y.; Liu, Z.; Cui, L. One-Class LSTM Network for Anomalous Network Traffic Detection. Appl. Sci. 2022, 12, 5051. [Google Scholar] [CrossRef]
Duan, X.; Fu, Y.; Wang, K. Network Traffic Anomaly Detection Method Based on Multi-Scale Residual Classifier. Comput. Commun. 2023, 198, 206–216. [Google Scholar] [CrossRef]
Wei, G.; Wang, Z. Adoption and Realization of Deep Learning in Network Traffic Anomaly Detection Device Design. Soft Comput. 2021, 25, 1147–1158. [Google Scholar] [CrossRef]
Kim, T.-Y.; Cho, S.-B. Web Traffic Anomaly Detection Using C-LSTM Neural Networks. Expert Syst. Appl. 2018, 106, 66–76. [Google Scholar] [CrossRef]
Hooshmand, M.K.; Hosahalli, D. Network Anomaly Detection Using Deep Learning Techniques. CAAI Trans. Intell. Technol. 2022, 7, 228–243. [Google Scholar] [CrossRef]
Nawaz, A.; Abu Ali, N.; Ahmad, A. Fall Detection System Using Tabular GAN for Data Augmentation with Integration of Isolation Forest Model. Appl. Soft Comput. 2025, 185, 113931. [Google Scholar] [CrossRef]
He, Y.; Chen, R.; Li, X.; Hao, C.; Liu, S.; Zhang, G.; Jiang, B. Online At-Risk Student Identification Using RNN-GRU Joint Neural Networks. Information 2020, 11, 474. [Google Scholar] [CrossRef]
Fathima, A.; Khan, A.; Uddin, M.F.; Waris, M.M.; Ahmad, S.; Sanin, C.; Szczerbicki, E. Performance Evaluation and Comparative Analysis of Machine Learning Models on the UNSW-NB15 Dataset: A Contemporary Approach to Cyber Threat Detection. Cybern. Syst. 2025, 56, 1160–1176. [Google Scholar] [CrossRef]

Figure 1. Dilated causal convolutions modules.

Figure 2. Residual module.

Figure 3. Bidirectional structure.

Figure 4. Efficient channel attention module.

Figure 5. Bi-TACN algorithm flow chart.

Figure 6. Accuracy comparison for Bi-TACN algorithm ablation test.

Figure 7. Training set accuracy curves of different typical algorithms.

Figure 8. Confusion matrix for NSL-KDD dataset classification.

Figure 9. Confusion matrix for UNSW-NB15 dataset classification.

Table 1. Distribution of Samples in KDDTrain+ and KDDTest+ Datasets.

Category	KDDTrain+	KDDTest+	Total
Normal	67,343	9711	77,054
DoS	45,927	7458	53,385
Probe	11,656	2421	14,077
R2L	995	2754	3749
U2R	52	200	252
Total	125,973	22,544	148,517

Table 2. Parameters of the Proposed Algorithm.

Parameter	Value
Optimizer	Adam
Batch size	64
Training epochs	50
Learning rate	0.001
Number of layers	3
Dilation factors	1, 2, 4
Kernel size	$1 \times 3$
Number of channels	[32, 64, 128]
Dropout rate	0.5

Table 3. Detection results of different typical algorithms on NSL-KDD dataset.

Method	Accuracy (%)	Variance	Precision (%)	Recall (%)	F1-Score (%)	AUC-ROC
LSTM	$81.30 \pm 1.21$	1.46	$80.39 \pm 1.21$	$72.06 \pm 1.28$	$72.05 \pm 1.21$	0.81
GRU	$83.37 \pm 1.32$	1.74	$81.49 \pm 1.18$	$76.47 \pm 1.35$	$78.90 \pm 1.29$	0.82
BiGRU	$85.65 \pm 1.25$	1.56	$84.36 \pm 1.32$	$81.42 \pm 1.35$	$82.68 \pm 1.14$	0.84
BiTCN	$86.38 \pm 1.24$	1.54	$84.26 \pm 1.24$	$82.33 \pm 0.91$	$83.28 \pm 1.01$	0.86
Bi-TACN	$88.51 \pm 1.27$	1.61	$89.36 \pm 1.19$	$89.59 \pm 1.29$	$89.47 \pm 1.22$	0.87

Table 4. Classification performance of different traffic types on the NSL-KDD dataset.

Class	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)
Normal	91.1	83.3	99.3	90.6
Dos	97.8	98.0	95.5	96.7
Probe	93.6	90.9	44.9	60.1
U2R	99.6	93.6	58.5	72.0
R2L	97.3	94.8	82.2	88.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, F.; Huang, Y.; Shi, Y. Bidirectional Temporal Attention Convolutional Networks for High-Performance Network Traffic Anomaly Detection. Information 2026, 17, 61. https://doi.org/10.3390/info17010061

AMA Style

Wang F, Huang Y, Shi Y. Bidirectional Temporal Attention Convolutional Networks for High-Performance Network Traffic Anomaly Detection. Information. 2026; 17(1):61. https://doi.org/10.3390/info17010061

Chicago/Turabian Style

Wang, Feng, Yufeng Huang, and Yifei Shi. 2026. "Bidirectional Temporal Attention Convolutional Networks for High-Performance Network Traffic Anomaly Detection" Information 17, no. 1: 61. https://doi.org/10.3390/info17010061

APA Style

Wang, F., Huang, Y., & Shi, Y. (2026). Bidirectional Temporal Attention Convolutional Networks for High-Performance Network Traffic Anomaly Detection. Information, 17(1), 61. https://doi.org/10.3390/info17010061

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bidirectional Temporal Attention Convolutional Networks for High-Performance Network Traffic Anomaly Detection

Abstract

1. Introduction

2. Related Work

2.1. Dilated Causal Convolution

2.2. Residual Structure

2.3. Bidirectional Structure

2.4. Efficient Channel Attention (ECA)

3. Proposed Formulation

3.1. Data Preprocessing

3.2. Bi-TACN Pretrain

4. Experimental Validation

4.1. Experiment Description

4.2. Evaluation Indicators

4.3. Ablation Experiment

4.4. Performance Analysis

4.5. Generalization Verification

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI