Anomaly Detection in Imbalanced Network Traffic Using a ResCAE-BiGRU Framework

Nong, Xiaofeng; Qin, Kuangyu; Xie, Xingliu

doi:10.3390/sym17122087

Open AccessArticle

Anomaly Detection in Imbalanced Network Traffic Using a ResCAE-BiGRU Framework

by

Xiaofeng Nong

¹,

Kuangyu Qin

^2,3 and

Xingliu Xie

^2,4,*

¹

Network and Information Center, Guilin Tourism University, Guilin 541006, China

²

Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China

³

Network and Information Technology Center, Guilin University of Electronic Technology, Guilin 541004, China

⁴

School of Information Engineering, Guilin University, Guilin 541006, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(12), 2087; https://doi.org/10.3390/sym17122087

Submission received: 7 October 2025 / Revised: 9 November 2025 / Accepted: 2 December 2025 / Published: 5 December 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

To address the critical challenge of low detection rates for rare anomaly classes in network traffic, a problem exacerbated by severe data imbalance, this paper proposes a deep learning framework for anomaly detection in imbalanced network traffic. Initially, the framework employs the Isolation Forest (iForest) and SMOTE-Tomek techniques for outlier removal and data balancing, respectively, to enhance data quality. The model first undergoes unsupervised pre-training using a symmetrically designed Residual Convolutional Autoencoder (ResCAE) to learn robust feature representations. Subsequently, the pre-trained encoder is integrated with a Bidirectional Gated Recurrent Unit (BiGRU) to capture temporal dependencies within the traffic features. During the fine-tuning phase, a Sharpness-Aware Minimization (SAM) optimizer is employed to enhance the model’s generalization capability. The experimental results on the public CICIDS2017 and UNSW-NB15 datasets reveal the model’s outstanding performance, achieving an accuracy, precision, recall, and F1-score of 99.33%, 99.53%, 99.33%, and 99.41%, respectively. Comparative analysis against baseline models confirms that the proposed method not only surpasses traditional machine learning algorithms but also holds a significant advantage over contemporary deep learning models. The results validate that this framework effectively resolves the issue of low detection rates for rare anomaly classes caused by data imbalance, offering a powerful and robust solution for building high-performance anomaly detection frameworks.

Keywords:

traffic anomaly detection; autoencoder; deep learning; Bidirectional Gated Recurrent Unit (BiGRU); convolutional neural networks (CNN)

1. Introduction

With the continuous expansion of network infrastructures, the security challenges confronting computer networks have become increasingly severe. Network traffic anomaly detection serves as a critical mechanism for ensuring network stability and performance. By monitoring and analyzing network traffic data in real time, potential network disruptions and anomalous behaviors can be promptly identified, thereby enhancing the overall reliability of computer networks. In this context, anomaly detection serves as the core technical methodology, identifying statistical deviations from a learned baseline of normal behavior. These anomalies, or statistical deviations, can represent significant network events, such as flash crowds, equipment failures, or novel traffic patterns that differ from the norm. Effectively identifying them is critical for network monitoring and management. Emerging network phenomena, such as high-volume traffic events (e.g., flash crowds or DDoS events), can render critical infrastructure services unavailable by exhausting system resources, potentially resulting in significant economic losses. Although traditional security mechanisms are widely deployed, the evolution of attack vectors and the expansion of the attack surface have given rise to cyber threats that are increasingly large-scale, persistent, and destructive. Traditional machine learning methods, such as Decision Trees (DTs) [1], Naive Bayes (NB) [2], and Support Vector Machines (SVMs) [3], typically rely on the manual extraction and analysis of traffic features based on expert knowledge to build models. However, these methods possess a limited capacity to capture intricate data patterns and consequently struggle to address the increasingly complex traffic modeling challenges [4].

Deep learning methods offer novel approaches to network traffic anomaly detection by leveraging their powerful capabilities in automatic feature extraction and high-dimensional data modeling. Through multi-layered neural network architectures, deep learning models automatically extract data features via nonlinear transformations, thereby addressing the shortcomings inherent in traditional machine learning methods [5,6,7]. For instance, Althubiti et al. [8] employed a Long Short-Term Memory (LSTM) recurrent neural network classifier for traffic classification tasks. Their experimental results on the KDDCup99 dataset demonstrated that the LSTM classifier outperformed several high-performing traditional classifiers. Similarly, Li et al. [9] proposed an accurate anomaly detection method based on pseudo-anomaly injection, featuring an efficient feature extraction framework and a novel Denoising Autoencoder-Generative Adversarial Network (DAE-GAN) model. Their framework utilized an innovative packet-windowing technique to extract both spatial and temporal features from network traffic. More recently, complex architectures have demonstrated significant potential. Graph Neural Network (GNN)-based methods, for instance, can effectively capture anomalous traffic interactions within the network topology [10]. Concurrently, Transformer-based models leverage their self-attention mechanism to learn long-term dependencies in traffic sequences, also showing great promise for anomaly detection tasks [11]. In another study, Wang et al. [12] introduced an anomalous flow detection system based on a hybrid deep learning model capable of rapidly locating the source of the anomalous flow. Compared to SDN-based anomaly detection methods, the proposed method significantly enhances fine-grained detection by utilizing multidimensional features.

Despite these advances, deep learning methods still face the challenge of data imbalance in traffic anomaly detection, resulting in low detection rates for minority-class attacks. When normal traffic samples far exceed anomalous traffic in volume, the model tends to classify most inputs as the majority (normal) class. This leads to poor detection performance for critical minority attack classes—an outcome that is unacceptable in real-world network monitoring scenarios [6]. Zhou et al. [13] proposed a traffic anomaly detection method that combines an Autoencoder (AE) and a residual neural network. In their approach, the autoencoder first reconstructs the input data for feature extraction, and these features are subsequently used to train the residual network. While this method improved model performance, it failed to account for the imbalanced nature of the data, consequently causing the model to overfit the majority class while underfitting minority classes.

Traditional rule-based monitoring methods are often ineffective against novel, unforeseen anomaly types due to their reliance on pre-defined signatures. In contrast, anomaly-based detection, which compares network activity against established normal behavior patterns, offers a more flexible approach, though it can suffer from high false positive rates if the “normal” baseline is not modeled accurately.

Among the most prominent datasets in this field is the KDDCup99 dataset, which has served as a benchmark in numerous studies [14,15]. It was followed by an enhanced version, NSL-KDD, which has also been extensively studied in the literature [16,17,18,19,20]. The NSL-KDD dataset improved upon its predecessor by removing redundant records to create a more balanced sample distribution, establishing it as a more suitable evaluation benchmark. As a result, it has been widely used to validate the performance of various algorithms [21]. However, it is crucial to note that the NSL-KDD dataset is derived from network traffic captured over two decades ago. Therefore, it fails to represent modern cyberattack techniques, such as Advanced Persistent Threats (APTs), attacks over encrypted traffic, or emerging threats within the Internet of Things (IoT) ecosystem [22]. This limitation means that high performance achieved by models on the NSL-KDD dataset does not necessarily translate into effective defense capabilities in contemporary network environments [23]. This disparity underscores the critical need for evaluation using more recent and diverse datasets, such as CICIDS2017. While the CICIDS2017 and UNSW-NB15 datasets were originally curated for intrusion detection research, in this study, they serve as standard, publicly available benchmarks for network traffic anomaly detection. This is due to their realistic, imbalanced distribution and, most importantly, their clearly labeled classes of non-normal (anomalous) traffic, which allows for a robust evaluation of the proposed methodology in classifying rare, deviant behavior.

Cui et al. [24] proposed a novel multi-module integrated intrusion detection system that utilizes stacked autoencoders for feature extraction. Their method addresses data imbalance by combining a Gaussian Mixture Model (GMM) with a Wasserstein Generative Adversarial Network (WGAN) and employs a CNN and Long Short-Term Memory (LSTM) network for classification. While this approach effectively reduced the model’s false alarm rate, WGANs are susceptible to mode collapse and training instability during sample generation. To mitigate the instability challenges associated with GAN training, researchers have explored alternative data augmentation strategies. For instance, Variational Autoencoder (VAE)-based approaches generate high-quality minority samples by learning the latent distribution of data, offering a more stable training process [25]. Other studies have adopted self-supervised learning paradigms, such as contrastive learning, to enhance a model’s ability to distinguish between normal and anomalous traffic. This is achieved by learning effective representations from unlabeled data, thereby mitigating the impact of data imbalance without direct sample generation [26].

Beyond data-level solutions, improvements at the optimization algorithm level are also critical for enhancing model performance. When processing complex and high-dimensional network traffic data, a model must not only fit the training data but also generalize well to unseen data to effectively counter continuously evolving attack methods.

Foret et al. [27] proposed the Sharpness-Aware Minimization (SAM) algorithm, which enhances model generalization. Instead of seeking parameters that simply minimize the training loss (a sharp minimizer), SAM identifies parameters within a neighborhood characterized by uniformly low loss values (a flat minimizer). In summary, although the aforementioned methods have achieved commendable detection results, the majority still struggle with the challenge of data imbalance, resulting in poor detection rates for minority classes. To address these persistent challenges, this paper proposes a deep learning framework for anomaly detection in imbalanced network traffic.

The main contributions of this study are as follows:

We propose a novel and effective flow detection model, ResCAE-BiGRU. This model extracts multi-scale spatial features using a split residual structure and captures bidirectional temporal dependencies via a BiGRU layer. By integrating the advantages of ResNet and Autoencoder architectures and employing the Sharpness-Aware Minimization (SAM) optimizer, the model significantly enhances generalization and improves the detection rate of minority anomaly classes.
We introduce a data preprocessing pipeline that first utilizes the Isolation Forest algorithm to remove outliers from the majority (normal) class, thereby sharpening the decision boundary. Subsequently, the SMOTE-Tomek technique is applied to synthesize high-quality minority class samples, addressing the data imbalance problem while enhancing sample diversity and bolstering the model’s detection capabilities for rare anomalies.
We conduct a rigorous evaluation of the proposed model on the CICIDS2017 and UNSW-NB15 datasets, which contain diverse and realistic modern cyber threats. The proposed model’s superiority is demonstrated through extensive comparative experiments with existing state-of-the-art methods. Performance is rigorously assessed using standard metrics, including accuracy, precision, recall, and F1-score, confirming the effectiveness and robustness of our ResCAE-BiGRU method.

2. Background

2.1. Residual Network

The Residual Network (ResNet) is a landmark convolutional neural network (CNN) architecture proposed by He et al. [28]. ResNet was specifically designed to address the “network degradation” problem commonly encountered during the training of extremely deep neural networks. Traditionally, it was assumed that deeper networks possess stronger learning and expressive capabilities. However, empirical evidence has shown that when the network depth exceeds a certain threshold, simply stacking additional layers leads to an increase in both training and testing errors.

To overcome this problem, ResNet introduces the concept of a “residual block”. The core idea of the residual block is the addition of a “shortcut connection”, also known as a “skip connection”. This design allows the optimizer to easily learn an identity mapping if a given layer is redundant; this is achieved by driving the weights of the layer’s transformation, F(x), toward zero. It is critical to note that these F(x) layers are always fully trained via backpropagation; the “bypass” refers to the identity path for the signal (+x), not a skip of the training process. This allows the block to learn an identity mapping if the layers are redundant while still enabling stable training of very deep networks. This mechanism significantly simplifies the training of deep networks, allowing gradients to propagate backward more smoothly and effectively alleviating the vanishing gradient problem. The computation within a standard residual block can be expressed by the following formula:

y = F (x, {W_{i}}) + x

(1)

where:

x represents the input vector to the residual block.
y represents the output vector of the block.
The function $F (x, {W_{i}})$ is the residual mapping to be learned, which typically comprises a stack of two or three convolutional layers, each followed by batch normalization and a ReLU activation function.
$W_{i}$ represents the set of weight parameters associated with the layers in the residual mapping.

2.2. Autoencoder

Autoencoder (AE) [29] is a classic unsupervised learning neural network whose primary objective is to learn efficient, compressed feature representations of input data. By setting the input’s reconstruction as the learning objective, the network is trained to approximate an identity function. This is achieved by first mapping the input to a compressed, low-dimensional representation and then reconstructing the original data from that representation. Unlike linear dimensionality reduction techniques such as Principal Component Analysis (PCA) [30], an autoencoder can perform nonlinear transformations, enabling it to capture more complex and deeply embedded patterns within the data. A standard autoencoder has two main parts:

Encoder: This part of the network takes the high-dimensional input data and compresses it into a lower-dimensional representation, often called the latent space or bottleneck.
Decoder: This part takes the compressed representation from the encoder and attempts to reconstruct the original high-dimensional input data from it.

The basic mathematical representation of an autoencoder is as follows:

z = φ (x)

(2)

x^{'} = ψ (z) = ψ (φ (x))

(3)

The training process of the autoencoder drives the learning of network parameters by minimizing the reconstruction error between the input x and the reconstruction output x′. This ensures that the resulting latent space representation captures the essential information required to effectively reconstruct the original data. This process compels the encoder to learn the most salient features of the data. Consequently, the autoencoder serves as a powerful tool for applications such as dimensionality reduction, data denoising, and as a pre-training component for supervised learning tasks.

2.3. GRU and BiGRU Models

The Gated Recurrent Unit (GRU) is a prominent variant of the Recurrent Neural Network (RNN) that offers a simplified alternative to the Long Short-Term Memory (LSTM) architecture. The GRU streamlines the LSTM structure by merging the input and forget gates into a single update gate and unifying the cell state and hidden state, as depicted in Figure 1. Through its use of gating mechanisms, the GRU effectively mitigates the vanishing and exploding gradient problems prevalent in traditional RNNs. This efficiency and structural simplicity make the GRU particularly adept at capturing long-term dependencies in sequential data.

A GRU is composed of two primary gating mechanisms: the update gate and the reset gate. The update gate decides what proportion of information to retain from the previous hidden state versus what to incorporate from a new candidate hidden state. Its output ranges from 0 to 1; a value approaching 1 signifies greater retention of past information, while a value approaching 0 indicates that more of the new information is utilized. The corresponding calculation formula is as follows:

The update gate, denoted as

z_{t}

in Equation (4), determines the proportion of information from the previous hidden state,

h_{t - 1}

, that is carried forward to the current time step. This gate employs a sigmoid activation function, which constrains the output to a range between 0 and 1, acting as a gating mechanism. A value approaching 1 signifies that most of the previous state is retained, whereas a value near 0 indicates that it is largely discarded.

z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z})

(4)

where:

$z_{t}$ is the output vector of the update gate at time step t.
$W_{z}$ and $b_{z}$ are the weight matrix and bias vector for the update gate, respectively.

The reset gate, denoted as

r_{t}

, governs the degree to which information from the previous hidden state (

h_{t - 1}

) is disregarded. When the gate’s output value approaches 0, the model effectively “forgets” past information and focuses on the current input to form its candidate hidden state. Conversely, a value close to 1 allows more of the previous state’s information to be retained. Its computation is defined by the following formula:

r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r})

(5)

The candidate hidden state, denoted as

{\tilde{h}}_{t}

in Equation (6), serves as a proposed update for the hidden state at the current time step. It is generated by combining the current input (

x_{t}

) with the information from the previous hidden state (

h_{t - 1}

) that was filtered by the reset gate. This combined representation is typically passed through a hyperbolic tangent (tanh) activation function to capture new information.

{\tilde{h}}_{t} = t a n h (W_{h} x_{t} + U_{h} (r_{t} \times h_{t - 1}) + b_{h})

(6)

h_{t} = (1 - z_{t}) h_{t - 1} + z_{t} \cdot {\tilde{h}}_{t}

(7)

where:

$r_{t}$ is the output vector of the reset gate.
$W_{h}$ and $b_{h}$ represent the weight matrix and bias vector for the reset gate, respectively.
$h_{t - 1}$ is the hidden state from the previous time step, t − 1.
$x_{t}$ is the input at the current time step, t.
$σ$ denotes the sigmoid activation function.

The Bidirectional Gated Recurrent Unit (Bi-GRU) is an extension of the standard GRU architecture designed to capture contextual information more comprehensively. It consists of two parallel GRU networks: a forward GRU that processes the input sequence chronologically, and a backward GRU that processes it in reverse. At any given time step t, the final output is generated by combining (typically by concatenation) the hidden states from both the forward and backward passes. This structure allows the hidden state at each step to encapsulate information from both past and future contexts, significantly enhancing the model’s ability to understand long-range dependencies and often leading to improved performance.

3. Materials and Methods

3.1. Model Framework

The primary objective of anomaly traffic detection methods is to achieve superior performance in identifying malicious traffic, particularly in scenarios with a limited number of attack samples. As illustrated in Figure 2, the proposed framework comprises two main components: a data processing module and a traffic detection module. The Input block in the diagram represents the raw network traffic traces (e.g., from the CICIDS2017 and UNSW-NB15 datasets), which are detailed in Section 3.5.1. The data processing module initially conducts numerical conversion, outlier removal, and data normalization. Initially, normal class samples are processed using the isolated forest method. Subsequently, to address class imbalance, additional minority attack class samples are synthesized using the SMOTE-Tomek technique, which preserves the consistency of the original data distribution. The traffic detection module features a traffic anomaly detection model based on a Residual Convolutional Autoencoder and Bidirectional Gated Recurrent Unit (ResCAE-BiGRU) architecture. This model first employs the split residual structure of the ResCAE to extract multi-scale spatial features. Subsequently, the BiGRU captures bidirectional temporal dependencies from the data. To enhance the model’s generalization capability, the Sharpness-Aware Minimization (SAM) method is integrated into the training stage alongside the Stochastic Gradient Descent (SGD) optimizer, thereby improving the detection rate of minority attack samples. Finally, the model’s performance was evaluated using standard metrics, including accuracy, precision, recall, and F1-score. The proposed method was tested on the CICIDS2017 dataset to verify its efficacy.

3.2. Data Processing

3.2.1. Numerical Conversion

Different encoding strategies are utilized for features and labels. First, non-numerical categorical features in the dataset are processed using the One-Hot Encoding technique. This method converts each categorical feature containing N unique categories into N new binary features. This transformation effectively maps the categorical text labels into a high-dimensional vector space, enabling the model to process these non-numeric features while avoiding the introduction of artificial ordinal relationships. Secondly, Label Encoding is employed to transform the model’s prediction target, the network traffic attack labels. This technique maps each unique string label to a unique integer. This transformation is necessary for the calculation of the loss function during model training.

3.2.2. Outlier Processing

In real-world network environments, data may contain anomalous values or outliers resulting from factors such as traffic bursts or data collection errors. To address this, the Isolation Forest (iForest) [31] algorithm is employed for outlier detection and removal. To prevent the risk of removing rare but valid malicious samples, this outlier removal process is applied exclusively to the BENIGN (normal) class samples within the training set. All minority attack class samples are explicitly excluded from this step. The specific process is as follows:

A random subsample of size ψ is drawn from the training dataset to form the root node of an isolation tree (iTree).
At the current node, a feature is randomly selected. A split value, p, is then randomly chosen between the minimum and maximum values of that feature for the data points within the node. The node’s data is then partitioned into two child nodes based on this split.
The partitioning process from Step 2 is repeated recursively for each child node. This continues until a node contains only a single data point, or a predefined maximum tree depth is reached.
Steps 1 through 3 are repeated to construct a specified number, t, of iTrees, thus forming an ensemble known as an “isolation forest”.
To normalize path lengths, an average path length for a dataset of size n, denoted as c(n), is used as a normalization factor. The formula for c(n) is:

c (n) = \{\begin{cases} 2 [H (n - 1) - (n - 1) / n], n > 2 \\ 1, n = 2 \\ 0, o t h e r \end{cases}

(8)

where H(i) is the harmonic number, which can be approximated by

\ln (i) + δ

,

δ

≈ 0.577 is the Euler-Mascheroni constant.

6.: To obtain an anomaly score for a data point x, it is passed through the trained forest. The path length, h(x), is measured for each iTree, and the average path length across all trees, E[h(x)], is computed. The final anomaly score, s(x,n), is then calculated as follows:

s (x, n) = 2^{- \frac{E (h (x))}{c (n)}}

(9)

where E[h(x)] is the average path length (expected value) of sample x across all iTrees. An anomaly score close to 1 indicates a high likelihood of being an outlier, while a score much smaller than 0.5 suggests the point is likely normal.

3.2.3. Normalization

To address issues arising from disparate feature scales, we employ Min-Max Normalization. This technique rescales the range of each feature to a specific interval, typically [0, 1], thereby ensuring that all features contribute more equally to the model’s training process and mitigating potential biases. The formula for Min-Max normalization is as follows:

x_{n o r m} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

(10)

where x is the original feature value, and

x_{m i n}

and

x_{m a x}

are the minimum and maximum values of that feature in the dataset, respectively. The resulting value,

x_{n o r m}

, represents the scaled feature.

To prevent data leakage, the normalization process is applied after the dataset has been partitioned into training and validation sets. The MinMaxScaler is first fit only on the training data, and the parameters learned from the training data (i.e., its min and max values) are then applied to transform the validation set.

3.2.4. SMOTE-Tomek Process

SMOTE (Synthetic Minority Oversampling Technique) is a widely used technique for addressing class imbalance in datasets [32]. The fundamental principle of SMOTE is to generate synthetic instances of the minority class to create a more balanced dataset. The algorithm operates by first randomly selecting a minority class instance, x, from a few categories set C. It then identifies the k nearest neighbors of x within the same class, typically using the Euclidean distance. Next, a synthetic sample,

x_{n e w}

, is generated by interpolating between the selected instance x and one of its randomly chosen nearest neighbors, as described in Equation (11). In this equation, α is a random value drawn from a uniform distribution between 0 and 1, which ensures the synthetic sample lies along the line segment connecting the original instance and its selected neighbor.

x_{n e w} = x + α \cdot (\hat{x} - x)

(11)

where:

$x_{n e w}$ is the newly generated synthetic sample.
x is the original sample selected from the minority class.
$\hat{x}$ is the randomly chosen k-nearest neighbor of x.
α is a random number in the interval [0, 1].

The SMOTE algorithm effectively mitigates the overfitting problem associated with random oversampling methods. While SMOTE offers a novel approach to addressing data imbalance, it possesses inherent limitations. Specifically, the synthesis of new samples is determined entirely by the selection of a root sample and one of its minority class neighbors. A key drawback is that this synthesis process operates without consideration for the distribution of the majority class. If both the root sample and its selected neighbor are situated within a dense region of the minority class, the resulting synthetic sample is likely to be appropriately positioned. Conversely, if either the root sample or its neighbor is an outlier, the resulting synthetic instance may be generated in a region dominated by the majority class, causing the new minority sample to encroach upon the majority class space.

To address the issues of class overlap and blurred decision boundaries caused by the standard SMOTE algorithm, this study utilizes the hybrid SMOTE-Tomek method to mitigate the data imbalance problem. The methodology first employs the SMOTE algorithm to generate synthetic minority class samples. Subsequently, the Tomek Links [33] algorithm is applied to cleanse the augmented dataset by removing existing Tomek Link pairs. This two-step process effectively eliminates noise and overlapping instances near the class boundary. In our data processing workflow, the initial dataset is partitioned into training and validation sets at a 7:3 ratio. The SMOTE-Tomek procedure is then applied exclusively to the minority class within the training set. This yields a balanced and refined training dataset, which is subsequently shuffled prior to being used for model training.

3.3. The ResCAE-BiGRU

The proposed ResCAE-BiGRU model for abnormal flow detection comprises three primary components, as illustrated in Figure 3. The detailed layer-wise configuration of the ResCAE-BiGRU architecture is provided in Table 1. The architectural flow, as depicted in Figure 3, proceeds as follows:

ResCAE Feature Extraction: The preprocessed traffic data is fed into the ResCAE module, which utilizes two residual blocks to perform convolution operations. This process extracts multi-scale spatial features and progressively reduces dimensionality, yielding the encoded hidden layer feature vector, denoted as h.
Reshaping: The hidden layer feature vector h is reshaped. This conversion adapts the 4-dimensional format (N, C, H, W), which is required by the Conv2d layers used in our ResCAE, into the 3-dimensional sequential format (N, L, H_in) required by the BiGRU module by employing squeeze and permute operations.
BiGRU Temporal Processing: The reshaped sequential features are then input into the BiGRU module, which processes the sequence bidirectionally to capture both forward and backward temporal dependencies. This results in the final output feature representation, h′.
Classification: This feature representation h′ is passed through a Dropout layer and a fully connected (Linear) layer employing a Softmax activation function to perform traffic classification, thereby achieving anomaly detection.

Figure 3. The proposed ResCAE-BiGRU models architecture.

Table 1. ResCAE-BiGRU network architecture details.

Layer Name	Layer Type	Hyperparameters
RB1_path1_conv1	Conv2d Layer	Filters: 16, Kernel size: 3, Stride: 2, Padding: 1, Activation: ReLU
RB1_path1_bn1	Batch Normalization	Applied to the output of conv1
RB1_path1_conv2	Conv2d Layer	Filters: 16, Kernel size: 3, Stride: 1, Padding: 1, Activation: ReLU
RB1_path1_bn2	Batch Normalization	Applied to the output of conv2
RB1_path2_conv1	Conv2d Layer	Filters: 16, Kernel size: 1, Stride: 2, Padding: 1, Activation: ReLU
RB1_path2_bn1	Batch Normalization	Applied to the output of conv1
RB2	ResidualBlock	The structure of RB2 is the same as RB1, but with 32 filters.
ConvTranspose2d	ConvTranspose2d Layer	Filters: 16, Kernel size: 3, Stride: 2, Padding: 1, Activation: ReLU
ConvTranspose2d	ConvTranspose2d Layer	Filters: 1, Kernel size: 3, Stride: 2, Padding: 1, Activation: Sigmoid
BiGRU	BiGRU Layer	Input size: 32, hidden size: 64, layers: 2, bidirectional: True
Dropout	Dropout Layer	Dropout Rate: 50%

A critical aspect of our model is the transformation of the 1D input feature vector into a structure suitable for 2D convolution and sequential modeling. After data preprocessing, each network sample is a 1D vector with

N_{f e a t u r e s}

features. We explicitly reshape this vector into a 4D tensor by adding two dimensions, resulting in an input shape of [Batch_Size, 1, 1,

N_{f e a t u r e s}

].

This transformation is the basis for our hybrid approach:

Spatial Feature Extraction (ResCAE): The ResCAE module, which utilizes Conv2d layers, treats the [N, C = 1, H = 1, W =

N_{f e a t u r e s}

] input as a 1 ×

N_{f e a t u r e s}

“image”. The 2D convolutional kernels (e.g., kernel_size = 3) slide along the W dimension, capturing local patterns and correlations among adjacent features in the vector. We define this as “spatial” feature extraction, as it learns the relationships within local groups of features (e.g., statistical features related to packet length).

Temporal Dependency Modeling (BiGRU): The ResCAE encoder processes this input, outputting a compressed 4D feature map (e.g., [N, 32, 1,

W_{o u t}

]). As defined in the ResCAE_BiGRU forward pass, this map is reshaped into a 3D tensor of shape [N,

W_{o u t}

, 32] using squeeze and permute operations. This tensor is then fed to the BiGRU, which interprets it as a sequence of length

W_{o u t}

where each step has 32 features. The “temporal” modeling thus refers to capturing the sequential, contextual dependencies (both forward and backward) along this sequence of CNN-extracted feature patches.

3.4. Model Training Strategy

To enhance the generalization ability of the model, a “Pre-tuning and Fine-tuning” training strategy is employed. During the unsupervised pre-training stage, the complete ResCAE autoencoder architecture is trained to reconstruct its input data. Crucially, this stage utilizes the entire training dataset, which includes both normal and various attack traffic samples, albeit without their corresponding labels. By learning to reconstruct this diverse set of traffic, the encoder is compelled to learn a robust and generalized feature representation that captures the underlying patterns common across all data types. The primary objective of this stage is to learn an effective data representation rather than to perform classification, and it utilizes the Mean Squared Error Loss (MSELoss) function to quantify reconstruction error. The formula for MSELoss is:

L_{M S E} = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - {\hat{x}}_{i})}^{2}

(12)

where:

n represents the total number of elements in the input vector.
$x_{i}$ is the value of the i-th element in the original input vector.
${\hat{x}}_{i}$ is the corresponding value in the output vector reconstructed by the decoder.

Minimizing this loss function compels the encoder to learn a compressed feature representation that retains the core information of the original data, thereby providing high-quality initial weights for the subsequent fine-tuning phase.

The fine-tuning phase commences by loading the pretrained encoder weights into the final ResCAE-BiGRU model. For training the classifier, the Stochastic Gradient Descent (SGD) optimizer is augmented with the Sharpness-Aware Minimization (SAM) algorithm [27]. SAM enhances the model’s generalization ability by seeking parameter neighborhoods with uniformly low loss values, thereby improving classification accuracy. This optimization process involves a two-step update strategy.

The first step is to identify points within the parameter neighborhood that yield the steepest increase in loss. Given the current model parameters w, SAM calculates a perturbation vector,

\hat{ε} (w)

, that points toward the direction of steepest loss ascent within a defined neighborhood of radius ρ. The objective is to proactively identify the “worst-case” perturbation in the parameter neighborhood. This perturbation vector is calculated as follows:

\hat{ε} (w) = ρ \frac{\nabla_{w} L_{s} (w)}{{‖\nabla_{w} L_{s} (w)‖}_{2}}

(13)

where:

S represents the training data set.
w denotes the model parameters.
$\nabla_{w} L_{s} (w)$ is the gradient of the loss function with respect to the current parameters w.
ρ is a hyperparameter that defines the size of the neighborhood.

Once the perturbation vector

\hat{ε} (w)

is calculated, the model temporarily updates the parameters to this worst-case position, defined as

w + \hat{ε} (w)

.

The second step involves performing the main parameter update based on the gradient at this worst-case point. Specifically, the loss function is evaluated at the perturbed position

w + \hat{ε} (w)

. The resulting gradient, termed the “adversarial gradient”, is then used to update the original model parameters w. The final update rule is expressed as follows:

\nabla_{w} L_{S}^{a d v} (w) \approx \nabla_{w} L_{S} (w) |_{w + \hat{ε} (w)}

(14)

Finally, the SGD optimizer uses this adversarial gradient to perform the gradient descent update on the original model parameters w. Through this two-step “ascent-descent” strategy, SAM guides the SGD optimizer to find not just a minimal loss value, but a broad, flat minimum. Convergence to such a region implies that the model exhibits less sensitivity to minor perturbations in its parameters, consequently demonstrating stronger robustness and improved generalization performance.

The loss function employed in this study is Focal Loss, a modification of the standard cross-entropy loss. Originally proposed by Lin et al. [34] in their work on RetinaNet, it was designed to solve the problem of extreme class imbalance encountered by deep learning models during training. The core principle of Focal Loss is to modulate the standard cross-entropy loss using a dynamically scaled factor. This factor reduces the contribution of well-classified examples to the total loss, thereby forcing the model to focus on learning from a smaller set of difficult-to-classify samples during training. Specifically, the scaling factor dynamically adjusts the contribution of each sample to the loss based on the model’s predicted confidence. For easily segmented samples with high prediction confidence (i.e., a predicted probability approaching 1), the corresponding loss contribution is significantly reduced. Conversely, for difficult-to-segment samples with low confidence (i.e., a predicted probability approaching 0), the loss contribution remains relatively high.

Focal Loss modifies the cross-entropy loss by introducing a modulating factor,

{(1 - p_{t})}^{γ}

, which is formulated as follows:

L_{F L} (p_{t}) = - {(1 - p_{t})}^{γ} \log (p_{t})

(15)

where:

The factor ${(1 - p_{t})}^{γ}$ is the core modulating factor.
$p_{t}$ represents the model’s predicted probability for the ground-truth class. For an easily classified sample where $p_{t} \to 1$ , this factor approaches zero, thereby reducing the sample’s contribution to the total loss. Conversely, for a misclassified sample where $p_{t} \to 0$ , the factor approaches one, leaving the loss value largely unaffected.
$γ \geq 0$ is a tunable focusing parameter that adjusts the rate at which easily classified samples are down-weighted. A higher value of $γ$ intensifies this down-weighting effect. When $γ = 0$ , the Focal Loss function becomes equivalent to the standard cross-entropy loss.

To further address the imbalance between positive and negative samples, a weighting factor,

α_{t}

, is introduced, which is formulated as follows:

α_{t} = \{\begin{cases} α if y = 1 \\ 1 - α if y = 0 \end{cases}

(16)

This factor is controlled by the hyperparameter

α \in [0, 1]

, which balances the importance assigned to positive and negative classes.

L_{F L} (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} \log (p_{t})

(17)

Consequently, the complete Focal Loss formulation addresses class imbalance statically via the

α_{t}

term while dynamically managing the imbalance between easy and difficult samples via the

{(1 - p_{t})}^{γ}

term.

3.5. Experiments

3.5.1. Dataset

The CICIDS2017 dataset employed in this study is a network traffic dataset developed through a collaboration between the Communications Security Establishment (CSE) and the Canadian Institute for Cybersecurity (CIC). While the CICIDS2017 dataset was originally curated for intrusion detection research, in this study, it serves as a standard, publicly available benchmark for network traffic anomaly detection. This is due to its realistic, imbalanced distribution and, most importantly, its clearly labeled classes of non-normal (anomalous) traffic, which allows for a robust evaluation of the proposed methodology in classifying rare, deviant behavior. The dataset comprises network traffic data collected over a one-week period in a real-world environment. This collection includes only normal traffic on Monday, followed by a mixture of normal and various attack traffic types from Tuesday to Friday. In total, the dataset is distributed across eight files, containing 2,813,797 samples, each described by 79 features. Each sample is labeled as either normal or as one of 14 attack types, such as FTP Brute-force, SSH Brute-force, Web Attack, DoS, and Botnets. The CICIDS2017 dataset provides an ideal benchmark for the development and evaluation of anomaly detection models. Each record contains detailed information, including IP addresses, port numbers, and protocol types, and is clearly labeled to distinguish between normal and various attack traffic flows. These properties, including its realistic data and distinct labels, make the CICIDS2017 dataset highly suitable for machine learning and data science projects, particularly in the cybersecurity domain. To mitigate the inherent class imbalance, a subset consisting of 1/4 of the normal data and all attack data was selected for the experiment. The resulting data distribution is detailed in Table 2.

The UNSW-NB15 dataset was generated using the IXIA PerfectStorm tool at the University of New South Wales (UNSW) Canberra to create a hybrid of realistic normal network activities and synthetic contemporary attack behaviors. From this, a total of 100 GB of raw traffic was captured in PCAP format using the tcpdump tool. The dataset encompasses nine distinct attack categories: Fuzzers, Analysis, Backdoors, Denial of Service (DoS), Exploits, Generic, Reconnaissance, Shellcode, and Worms. Subsequently, a total of 49 features and their corresponding class labels were extracted using the Argus and Bro-IDS tools in conjunction with twelve feature-extraction algorithms. Compared with the legacy KDD99 dataset, UNSW-NB15 more accurately reflects the traffic characteristics of modern networks and presents a more realistic and balanced data distribution. The class distribution within the UNSW-NB15 dataset is illustrated in Figure 4.

3.5.2. Evaluation Metrics

To evaluate the performance of the proposed model, several performance metrics are considered, each representing a specific aspect of classification effectiveness. The selected evaluation indicators include Accuracy, Precision, Recall, and F1-Score.

Accuracy is defined as the proportion of correctly classified samples among all samples and is commonly used to assess the overall performance of the model. In this context, True Positive (TP) refers to the number of positive samples correctly identified by the model, False Negative (FN) denotes the number of positive samples incorrectly classified as negative, False Positive (FP) indicates the number of negative samples incorrectly classified as positive, and True Negative (TN) represents the number of negative samples correctly identified by the model.

Precision is used to represent the proportion of correctly predicted positive samples among all samples predicted as positive. Recall, on the other hand, indicates the proportion of actual positive samples that are correctly identified by the model. These two metrics are essential for evaluating the model’s classification performance with respect to a specific class.

There is a high likelihood that precision and recall may yield inconsistent results. Therefore, the F1-score is introduced to comprehensively evaluate the model’s performance. The F1-score is the harmonic mean of precision and recall, providing a more balanced and intuitive measure of the model’s classification capability across different categories. This metric is particularly useful when there is an uneven class distribution. The formulas for these evaluation metrics are as follows:

\{\begin{cases} A c c u r a c y = \frac{T P + T N}{T P + F N + F P + T N} \\ P r e c i s i o n = \frac{T P}{T P + F P} \\ R e c a l l = \frac{T P}{T P + F N} \\ F1-score = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} \end{cases}

(18)

However, in a multiclass imbalanced setting, a single aggregate F1-score (such as Micro F1 or Macro F1) can obscure poor performance on critical minority classes. To address this, our evaluation strategy is two-fold:

Per-Class F1-Score: Our primary evaluation, presented in Section 4, reports the F1-score for each class individually. We believe this granular, per-class analysis is the most transparent method, as it explicitly reveals the model’s effectiveness (or failure) on specific minority attack classes, which is the core challenge of this research.

For the summary comparison tables, we report the Weighted F1-score. This metric calculates the F1-score for each class and computes a weighted average based on the support for each class.

We chose this Weighted F1-score for aggregation because it reflects the model’s overall performance relative to the dataset’s realistic, imbalanced distribution. We contend that this combination—a transparent per-class breakdown to rigorously assess minority classes, supplemented by a weighted-average score to reflect overall performance on the imbalanced data—provides a more complete and honest evaluation than a single Macro F1-score alone.

3.5.3. Experimental Environment Configuration

The software and hardware configurations of the experimental platform used in this study are shown in Table 3.

3.5.4. Training Parameter Settings

For optimal model training and performance, the hyperparameters were configured as follows. The model was trained for 50 epochs with a batch size of 256. The optimizer selected was Sharpness-Aware Minimization (SAM) using a Stochastic Gradient Descent (SGD) base and a learning rate of 0.01. For the two training stages, the pre-training loss function was Mean Squared Error (MSE), and the fine-tuning loss function was Focal Loss. These hyperparameter settings are summarized in Table 4.

A brief explanation of the above parameters is as follows:

Model training employed a two-stage strategy consisting of pre-training and fine-tuning. In the first stage, the ResCAE module underwent unsupervised pre-training, enabling its encoder to learn a robust and generalized feature representation of the input data. The second stage involved supervised fine-tuning, for which the hyperparameters were configured as follows.

A batch size of 256 was selected to balance gradient stability with available GPU memory resources, with the value being a power of two to optimize hardware utilization.

The Sharpness-Aware Minimization (SAM) optimizer was utilized with a Stochastic Gradient Descent (SGD) base. This choice was made to guide the model towards flatter loss minima, which is correlated with enhanced generalization capabilities. The base SGD optimizer was configured with an initial learning rate of 0.01 and a momentum of 0.9, while the neighborhood radius for SAM (ρ) was set to 0.05.

To dynamically adjust the learning rate during training, a CosineAnnealingLR scheduler was employed. This scheduler modulates the learning rate over the training epochs according to a cosine function curve. The learning rate begins at a pre-defined maximum and gradually decays to a minimum value over the course of training. This strategy enables the use of a larger learning rate in the initial training phases for rapid exploration of the solution space, and a smaller learning rate in the later stages to facilitate precise convergence near an optimal solution. Training was conducted for a total of 50 epochs, over which the scheduler completed one full annealing cycle to ensure sufficient model convergence.

4. Results and Analysis

4.1. Performance Evaluation CICIDS2017 and UNSW-NB15

In real-world online environments, anomalous behaviors often exhibit a degree of stochasticity. Furthermore, such anomalous traffic typically occurs far less frequently than normal traffic patterns, creating an inherent class imbalance. This phenomenon was confirmed in our preceding analysis of the datasets. Therefore, a critical step in evaluating an anomaly detection model is to assess its performance in detecting each distinct type of anomalous behavior. To this end, a per-class performance evaluation is required for each anomaly label.

This research first used the CICIDS2017 dataset to verify the proposed ResCAE-BiGRU model. Figure 5 shows the relationship between the accuracy of this model on the training set and the verification set of the dataset with the number of training rounds. Figure 6 shows the confusion matrix of the model’s performance on the CICIDS2017 dataset. It should be noted that for the purpose of clear visualization, several attack sub-categories listed in Table 2 have been aggregated into their parent categories. For example, attacks such as “FTP-Patator” and “SSH-Patator” are represented under the single “BruteForce” label. Furthermore, attack classes with extremely few samples, like “Heartbleed” and “Web Attack-SQL Injection”, have been omitted from the matrix to maintain readability. It can be seen that even if you only train for 50 rounds, the performance on the verification set is still considerable.

Figure 7 illustrates the training and validation performance of the ResCAE-BiGRU model on the UNSW-NB15 dataset over 50 epochs. The figure contains two plots: one depicting the model’s accuracy and the other showing its loss. In both plots, the x-axis represents the training epoch, and the y-axis represents the metric value (accuracy or loss).

The model exhibited strong convergence and generalization capabilities during the training process. As shown in the loss curves, the training and validation losses both decreased rapidly in tandem from the initial epoch, eventually stabilizing at a low value of approximately 0.44 after about 30 epochs. The two loss curves remain in close proximity throughout training, and the small gap between them indicates that the model effectively avoided overfitting. A similar trend is observed in the accuracy curves, where the training and validation accuracies rose concurrently, increasing from an initial 70% to a stable peak of approximately 91%. Despite slight fluctuations in the validation accuracy, its overall trend remained highly consistent with the training accuracy, suggesting a robust training process.

Collectively, the concurrent decrease in loss and increase in accuracy across both the training and validation sets demonstrate that the model was trained effectively, reached a stable state of convergence, and possesses strong generalization performance. The model’s classification performance on the dataset is further detailed by the confusion matrix in Figure 8.

Table 5 presents the per-class accuracy, precision, recall, and F1-scores on the CICIDS2017 and UNSW-NB15 datasets, demonstrating the proposed model’s high efficacy and generalization capabilities.

While the proposed model demonstrates exceptional overall performance across most categories, it is important to analyze its limitations. As shown in Table 5, the model’s performance on the “Infiltration” attack class is notably poor, with an F1-score of only 0.073. This can be directly attributed to the extreme rarity of this attack in the dataset, which contains only 34 samples for this category. Even with the application of the SMOTE-Tomek technique, which is designed to address class imbalance, such a small number of initial samples makes it difficult to generate a sufficiently diverse and representative set of synthetic data. Consequently, the model struggles to learn the distinguishing features of this attack class, highlighting a challenge for oversampling-based methods when faced with extremely scarce data. Acknowledging this limitation provides a direction for future work, which could explore few-shot learning techniques for such rare attack types.

4.2. Ablation Experiment

To evaluate the contribution of iForest [31] and the Sharpness-Aware Minimization (SAM) optimizer [27], an ablation study was conducted on the CICIDS2017 dataset. The results are presented in Table 6.

The baseline ResCAE + BiGRU model achieved an accuracy of 97.42% and an F1-score of 97.29%. This performance is attributed to ResCAE’s ability to extract spatial features from traffic data while mitigating gradient explosion, and BiGRU’s capacity to capture long-range dependencies and prevent information loss during learning. Integrating SAM (ResCAE + BiGRU + SAM) improved the accuracy and F1-score to 98.39% and 98.35%, respectively. This improvement is because SAM enhances model generalization by guiding convergence towards flatter loss minima rather than sharp ones.

Incorporating iForest for outlier removal (iForest + ResCAE + BiGRU) yielded an accuracy of 98.55% and an F1-score of 98.30%. This is because pre-processing with iForest clarifies the decision boundary between normal and attack classes, which mitigates class overlap and prevents the subsequent oversampling step from generating noisy, borderline synthetic samples. When applying SMOTE (SMOTE + ResCAE + BiGRU), the accuracy and F1-score rose to 99.03% and 98.94%, respectively, by addressing the inherent class imbalance and allowing the model to learn more effectively from minority attack classes.

Finally, the full proposed model, which combines all components, achieved the highest performance, with an accuracy of 99.33% and an F1-score of 99.41%. These results confirm that each component contributes positively to the model’s overall efficacy in detecting attack classes.

4.3. Comparison Experiment

To evaluate the detection effectiveness of the proposed model, its performance was benchmarked on the CICIDS2017 dataset. For this experiment, a subset of the data was created by combining the normal traffic (from Monday) with all available attack traffic. This dataset was then partitioned into a 70% training set and a 30% testing set. The proposed model’s performance was then compared against several baseline models: DT [1], CNN-LSTM [35], AE-ResNet [13], and LCVAE-CBiLSTM [5]. The experimental results are summarized in Table 7 and Figure 9. As shown, the proposed model achieved an accuracy of 99.33%, a precision of 99.53%, a recall of 99.33%, and an F1-score of 99.41%.

When compared to the traditional machine learning model (DT), the proposed model demonstrates superior performance across all four metrics, including an accuracy improvement of over one percentage point. This suggests that for complex network traffic data, deep learning architectures can learn more effective feature representations than shallow models. Traditional methods like Decision Trees often rely on manual feature engineering and have a limited capacity to capture the intricate patterns inherent in such data.

The performance gap is particularly evident when compared to the standard CNN-LSTM model, which our model outperforms by a significant margin: the proposed model achieves an F1-score of 99.41%, whereas the CNN-LSTM scores only 81.36%. CNN-LSTM’s recall is particularly low at 76.83%, indicating a failure to detect a substantial number of attacks. This highlights that the proposed model’s architecture, incorporating a residual autoencoder (ResCAE) and advanced optimization (SAM), provides substantially improved feature extraction and generalization capabilities compared to a standard CNN-LSTM.

In comparison to AE-ResNet, our model shows a notable advantage in both accuracy and recall, suggesting a more comprehensive identification of attack samples. Although AE-ResNet achieves slightly higher precision and F1-scores, our model excels in recall while maintaining competitively high precision, indicating a better balance in detecting true positives. The proposed model also surpasses LCVAE-CBiLSTM in overall accuracy, precision, and F1-score. Although LCVAE-CBiLSTM achieves the highest recall, this comes at the cost of lower precision, resulting in a less balanced overall performance, as reflected by its lower F1-score compared to our model.

Overall, the superior performance of the proposed framework is not due to a single component, but rather the synergistic combination of its architecture and training strategy, which directly address the core challenges of imbalanced traffic detection. First, the architectural design provides superior feature extraction. Unlike DT, which uses traditional machine learning, our ResCAE-BiGRU learns deeper, non-linear features. Compared to AE-ResNet, while it uses a ResNet, its architecture lacks a crucial temporal analysis component. Our framework explicitly includes a BiGRU layer to capture the critical contextual and sequential dependencies inherent in network attacks, which AE-ResNet overlooks. Second, our optimization strategy is far more focused on imbalance and generalization. The CNN-LSTM model low F1-score (81.36%) is likely due to its use of standard binary cross-entropy, which is heavily biased towards the majority class on imbalanced data. Our model, in contrast, employs Focal Loss, forcing it to prioritize hard-to-classify minority attacks. Furthermore, our model utilizes the SAM optimizer, which significantly boosts generalization by finding flat loss minima, an advanced technique not adopted in any of the compared papers. This comprehensive approach, combining iForest cleaning, ResCAE deep features, BiGRU temporal analysis, Focal Loss for imbalance, and SAM for generalization, explains why our model achieves the best overall performance in building a robust and balanced classifier.

5. Conclusions

To address the challenge of low anomaly detection rates caused by class imbalance in network traffic, particularly for minority anomaly classes, this paper proposes a deep learning framework for anomaly detection in imbalanced network traffic, ResCAE-BiGRU.

The proposed methodology begins with a two-stage data optimization process. First, outliers are removed using the Isolation Forest (iForest) algorithm to enhance data quality. Subsequently, the SMOTE-Tomek technique is applied to oversample minority anomaly classes, thereby mitigating the class imbalance problem. The model training phase employs a two-stage strategy. Initially, the ResCAE module is pre-trained in an unsupervised manner, enabling its encoder to learn robust, generalized feature representations from the traffic data. In the supervised fine-tuning stage, the pre-trained ResCAE encoder is integrated with a Bidirectional Gated Recurrent Unit (BiGRU) network to capture bidirectional temporal dependencies within the data features. Furthermore, the Sharpness-Aware Minimization (SAM) optimizer, utilizing an SGD base, is employed to guide the model towards flatter loss minima, which enhances its generalization capabilities.

Comprehensive experimental results on the CICIDS2017 and UNSW-NB15 datasets demonstrate the excellent performance of the proposed model. A comparative analysis confirms that the ResCAE-BiGRU model not only surpasses traditional machine learning algorithms but is also highly competitive with, and in some aspects superior to, other contemporary deep learning models. The results demonstrate that this method effectively addresses the challenge of class imbalance, significantly improving the overall performance of anomaly detection in network traffic. This work provides a robust foundation for developing more effective network traffic anomaly detection frameworks. Future work will focus on validating the model’s effectiveness and generalizability across a broader range of datasets.

Author Contributions

Conceptualization, X.N. and X.X.; methodology, X.X.; software, K.Q.; validation, X.N., X.X. and K.Q.; formal analysis, K.Q.; investigation, X.X.; resources, K.Q.; data curation, X.X.; writing—original draft preparation, X.N.; writing—review and editing, X.X.; visualization, X.N.; supervision, X.X.; project administration, X.X.; funding acquisition, X.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangxi Education Science Planning Project for the 14th Five-Year Plan, grant number 2023B324; the Key Project of Guangxi Social Science Community Think Tank, grant number Zkybkt202497; and the University-level Project of Guilin Tourism University, grant number 2023Z14.

Data Availability Statement

The CICIDS2017 dataset is open to the public and can be obtained from the Canadian Institute for Cybersecurity website: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 2 March 2025). The UNSW-NB15 dataset is open to the public and can be obtained from the UNSW Canberra Cyber Centre’s website: https://research.unsw.edu.au/projects/unsw-nb15-dataset (accessed on 2 March 2025).

Acknowledgments

The authors would like to thank all the members involved in this project for their help in developing this article and all the anonymous reviewers for their criticisms and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AE	Autoencoder
APTs	Advanced Persistent Threats
BiGRU	Bidirectional Gated Recurrent Unit
CIC	Canadian Institute for Cybersecurity
CNNs	Convolutional Neural Networks
CSE	Communications Security Establishment
DAE	Denoising Autoencoder
DDoS	Distributed Denial-of-Service
DTs	Decision Trees
FN	False Negative
FP	False Positive
GAN	Generative Adversarial Network
GMM	Gaussian Mixture Model
GNN	Graph Neural Network
GRU	Gated Recurrent Unit
iForest	Isolation Forest
IoT	Internet of Things
LSTM	Long Short-Term Memory
MSELoss	Mean Squared Error Loss
NB	Naive Bayes
PCA	Principal Component Analysis
ReLU	Rectified Linear Unit
ResCAE	Residual Convolutional Autoencoder
ResNet	Residual Network
RNN	Recurrent Neural Network
SAM	Sharpness-Aware Minimization
SDN	Software-Defined Networking
SGD	Stochastic Gradient Descent
SMOTE	Synthetic Minority Oversampling Technique
SVM	Support Vector Machines
TN	True Negative
TP	True Positive
VAE	Variational Autoencoder
WGAN	Wasserstein Generative Adversarial Network

References

Ahmim, A.; Maglaras, L.; Ferrag, M.A.; Derdour, M.; Janicke, H. A Novel Hierarchical Intrusion Detection System Based on Decision Tree and Rules-Based Models. In Proceedings of the 2019 15th International Conference on Distributed Computing in Sensor Systems (DCOSS), Santorini, Greece, 29–31 May 2019; pp. 228–233. [Google Scholar]
Chen, S.; Webb, G.I.; Liu, L.; Ma, X. A Novel Selective Naïve Bayes Algorithm. Knowl. Based Syst. 2020, 192, 105361. [Google Scholar] [CrossRef]
Tao, P.; Sun, Z.; Sun, Z. An improved intrusion detection algorithm based on GA and SVM. IEEE Access 2018, 6, 13624–13631. [Google Scholar] [CrossRef]
D’hooge, L.; Wauters, T.; Volckaert, B.; De Turck, F. Inter-Dataset Generalization Strength of Supervised Machine Learning Methods for Intrusion Detection. J. Inf. Secur. Appl. 2020, 54, 102564. [Google Scholar] [CrossRef]
Hou, T.; Xing, H.; Liang, X.; Su, X.; Wang, Z. A Marine Hydrographic Station Networks Intrusion Detection Method Based on LCVAE and CNN-BiLSTM. J. Mar. Sci. Eng. 2023, 11, 221. [Google Scholar] [CrossRef]
Fu, Y.; Du, Y.; Cao, Z.; Li, Q.; Xiang, W. A Deep Learning Model for Network Intrusion Detection with Imbalanced Data. Electronics 2022, 11, 898. [Google Scholar] [CrossRef]
Shams, E.A.; Rizaner, A.; Ulusoy, A.H. A Novel Context-Aware Feature Extraction Method for Convolutional Neural Network-Based Intrusion Detection Systems. Neural Comput. Appl. 2021, 33, 13647–13665. [Google Scholar] [CrossRef]
Althubiti, S.; Nick, W.; Mason, J.; Yuan, X.; Esterline, A. Applying Long Short-Term Memory Recurrent Neural Network for Intrusion Detection. In Proceedings of the SoutheastCon, St. Petersburg, FL, USA, 19–22 April 2018; pp. 1–5. [Google Scholar]
Li, Z.; Chen, S.; Dai, H.; Xu, D.; Chu, C.-K.; Xiao, B. Abnormal Traffic Detection: Traffic Feature Extraction and DAE-GAN with Efficient Data Augmentation. IEEE Trans. Reliab. 2023, 72, 498–510. [Google Scholar] [CrossRef]
Lo, W.W.; Layeghy, S.; Sarhan, M.; Gallagher, M.; Portmann, M. E-GraphSAGE: A Graph Neural Network Based Intrusion Detection System for IoT. In Proceedings of the NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium, Budapest, Hungary, 25–29 April 2022; pp. 1–9. [Google Scholar]
Manocchio, L.D.; Layeghy, S.; Lo, W.W.; Kulatilleke, G.K.; Sarhan, M.; Portmann, M. FlowTransformer: A Transformer Framework for Flow-Based Network Intrusion Detection Systems. Expert Syst. Appl. 2024, 241, 122564. [Google Scholar] [CrossRef]
Wang, K.; Fu, Y.; Duan, X.; Liu, T.; Xu, J. Abnormal Traffic Detection System in SDN Based on Deep Learning Hybrid Models. Comput. Commun. 2024, 216, 183–194. [Google Scholar] [CrossRef]
Zhou, P.; Zhou, Z.; Wang, L.; Zhao, W. Network intrusion detection method based on autoencoder and RESNET. Appl. Res. Comput. 2020, 37, 224–226. [Google Scholar]
Khan, M.A.; Ghazal, T.M.; Lee, S.W.; Rehman, A. Data Fusion-Based Machine Learning Architecture for Intrusion Detection. Comput. Mater. Contin. 2022, 70, 3399–3413. [Google Scholar] [CrossRef]
Peng, W.; Kong, X.; Peng, G.; Li, X.; Wang, Z. Network Intrusion Detection Based on Deep Learning. In Proceedings of the 2019 International Conference on Communications, Information System and Computer Engineering (CISCE), Haikou, China, 5–7 July 2019; pp. 431–435. [Google Scholar]
Tang, T.A.; Mhamdi, L.; McLernon, D.; Zaidi, S.A.R.; Ghogho, M. Deep Recurrent Neural Network for Intrusion Detection in SDN-Based Networks. In Proceedings of the 2018 4th IEEE Conference on Network Softwarization and Workshops (NetSoft), Montreal, QC, Canada, 25–29 June 2018; pp. 202–206. [Google Scholar]
Riyaz, B.; Ganapathy, S. A Deep Learning Approach for Effective Intrusion Detection in Wireless Networks Using CNN. Soft Comput. 2020, 24, 17265–17278. [Google Scholar] [CrossRef]
Almiani, M.; AbuGhazleh, A.; Al-Rahayfeh, A.; Atiewi, S.; Razaque, A. Deep Recurrent Neural Network for IoT Intrusion Detection System. Simul. Model. Pract. Theory 2020, 101, 102031. [Google Scholar] [CrossRef]
Le, T.-T.-H.; Kim, Y.; Kim, H. Network Intrusion Detection Based on Novel Feature Selection Model and Various Recurrent Neural Networks. Appl. Sci. 2019, 9, 1392. [Google Scholar] [CrossRef]
Li, Y.; Xu, Y.; Liu, Z.; Hou, H.; Zheng, Y.; Xin, Y.; Zhao, Y.; Cui, L. Robust Detection for Network Intrusion of Industrial IoT Based on Multi-CNN Fusion. Measurement 2020, 154, 107450. [Google Scholar] [CrossRef]
Alrayes, F.S.; Zakariah, M.; Amin, S.U.; Iqbal Khan, Z.; Helal, M. Intrusion Detection in IoT Systems Using Denoising Autoencoder. IEEE Access 2024, 12, 122401–122425. [Google Scholar] [CrossRef]
Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J. Survey of Intrusion Detection Systems: Techniques, Datasets and Challenges. Cybersecurity 2019, 2, 20. [Google Scholar] [CrossRef]
Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep Learning Approach for Intelligent Intrusion Detection System. IEEE Access 2019, 7, 41525–41550. [Google Scholar] [CrossRef]
Cui, J.; Zong, L.; Xie, J.; Tang, M. A Novel Multi-Module Integrated Intrusion Detection System for High-Dimensional Imbalanced Data. Appl. Intell. 2023, 53, 272–288. [Google Scholar] [CrossRef]
Li, Z.; Huang, C.; Qiu, W. An Intrusion Detection Method Combining Variational Auto-Encoder and Generative Adversarial Networks. Comput. Netw. 2024, 253, 110724. [Google Scholar] [CrossRef]
Lopez-Martin, M.; Sanchez-Esguevillas, A.; Arribas, J.I.; Carro, B. Contrastive Learning over Random Fourier Features for IoT Network Intrusion Detection. IEEE Internet Things J. 2023, 10, 8505–8513. [Google Scholar] [CrossRef]
Foret, P.; Kleiner, A.; Mobahi, H.; Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. arXiv 2020, arXiv:2010.01412. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Roweis, S. EM algorithms for PCA and SPCA. Adv. Neural Inf. Process. Syst. 1997, 10, 626–632. [Google Scholar]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Kim, A.; Park, M.; Lee, D.H. AI-IDS: Application of Deep Learning to Real-Time Web Intrusion Detection. IEEE Access 2020, 8, 70245–70261. [Google Scholar] [CrossRef]

Figure 1. The Bi-GRU and GRU architecture.

Figure 2. Schematic overview of the proposed framework.

Figure 4. UNSW-NB15 Dataset Distribution.

Figure 5. ResCAE-BiGRU training and validation performance on CICIDS2017 dataset.

Figure 6. ResCAE-BiGRU confusion matrix for CICIDS2017 dataset.

Figure 7. ResCAE-BiGRU training and validation performance on UNSW-NB15 dataset.

Figure 8. ResCAE-BiGRU confusion matrix for UNSW-NB15 dataset.

Figure 9. Comparison results with existing model performance.

Table 2. CICIDS2017 Dataset Distribution.

Category	Description	Samples
Normal data	BENIGN	568,274
Attack data	DoS Hulk	229,198
	PortScan	157,703
	DDoS	127,082
	DoS GoldenEye	10,289
	FTP-Patator	7894
	SSH-Patator	5861
	DoS Slowloris	5771
	DoS Slowhttptest	5485
	Bot	1943
	Web Attack-Brute Force	1497
	Web Attack-XSS	648
	Infiltration	34
	Web Attack-SQL Injection	21
	Heartbleed	11
Total		1,121,711

Table 3. Experimental platform configuration.

Name	Configuration
Operating System	Windows 11
GPU	NVIDIA GeForce RTX 4060
CPU	Intel Core i7-13700H
RAM	32G
VRAM	8G
Python Version	3.8.8
Pytorch Framework	2.4.1 + cu118
CUDA Version	12.3

Table 4. Experimental hyperparameter settings.

Parameter	Value
batch_size	256
Epoch	50
Optimizer	SAM (with SGD base)
lr	0.01
Loss Function (Pre-training)	MSELoss
Loss Function (Fine-tuning)	Focal Loss

Table 5. Performance on CICIDS2017 and UNSW-NB15.

Dataset	Label	Precision	Recall	F1-Score	Accuracy
CICIDS2017	BENIGN	1.000	0.998	0.994	0.993
	DoS	0.993	0.998	0.996
	DDoS	0.998	0.997	0.998
	PortScan	0.994	1.000	0.997
	BruteForce	0.921	1.000	0.959
	WebAttack	1.000	0.800	0.889
	Botnet	0.454	0.997	0.624
	Infiltration	0.038	0.846	0.073
UNSW-NB15	Normal	0.976	0.959	0.967	0.911
	Generic	0.982	0.899	0.938
	Exploits	0.907	0.948	0.927
	Fuzzers	0.833	0.837	0.836
	DoS	0.767	0.871	0.816
	Reconnaissance	0.859	0.863	0.861
	Analysis	0.510	0.768	0.613
	Backdoors	0.520	0.670	0.586
	Shellcode	0.996	0.587	0.739
	Worms	1.000	0.538	0.700

Table 6. Ablation experiment results.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
ResCAE + BiGRU	97.42	97.30	97.42	97.29
ResCAE + BiGRU + SAM	98.39	98.35	98.39	98.35
iForest + ResCAE + BiGRU	98.55	98.07	98.55	98.30
SMOTE + ResCAE + BiGRU	99.03	98.93	99.03	98.94
Proposed	99.33	99.53	99.33	99.41

Table 7. Model comparison results on CICIDS2017 dataset.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
DT [1]	98.33	97.22	97.38	98.37
CNN-LSTM [35]	93	86.47	76.83	81.36
AE-ResNet [13]	99.23	99.63	98.55	99.78
LCVAE-CBiLSTM [5]	98.69	98.84	99.83	98.49
Proposed	99.33	99.53	99.33	99.41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nong, X.; Qin, K.; Xie, X. Anomaly Detection in Imbalanced Network Traffic Using a ResCAE-BiGRU Framework. Symmetry 2025, 17, 2087. https://doi.org/10.3390/sym17122087

AMA Style

Nong X, Qin K, Xie X. Anomaly Detection in Imbalanced Network Traffic Using a ResCAE-BiGRU Framework. Symmetry. 2025; 17(12):2087. https://doi.org/10.3390/sym17122087

Chicago/Turabian Style

Nong, Xiaofeng, Kuangyu Qin, and Xingliu Xie. 2025. "Anomaly Detection in Imbalanced Network Traffic Using a ResCAE-BiGRU Framework" Symmetry 17, no. 12: 2087. https://doi.org/10.3390/sym17122087

APA Style

Nong, X., Qin, K., & Xie, X. (2025). Anomaly Detection in Imbalanced Network Traffic Using a ResCAE-BiGRU Framework. Symmetry, 17(12), 2087. https://doi.org/10.3390/sym17122087

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Anomaly Detection in Imbalanced Network Traffic Using a ResCAE-BiGRU Framework

Abstract

1. Introduction

2. Background

2.1. Residual Network

2.2. Autoencoder

2.3. GRU and BiGRU Models

3. Materials and Methods

3.1. Model Framework

3.2. Data Processing

3.2.1. Numerical Conversion

3.2.2. Outlier Processing

3.2.3. Normalization

3.2.4. SMOTE-Tomek Process

3.3. The ResCAE-BiGRU

3.4. Model Training Strategy

3.5. Experiments

3.5.1. Dataset

3.5.2. Evaluation Metrics

3.5.3. Experimental Environment Configuration

3.5.4. Training Parameter Settings

4. Results and Analysis

4.1. Performance Evaluation CICIDS2017 and UNSW-NB15

4.2. Ablation Experiment

4.3. Comparison Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI