3.3.1. Convolutional Neural Network
CNN is a widely applied deep learning module. Its core principle involves extracting features from input data through convolutional operations and abstracting information via stacked neural network layers to achieve local feature perception and capture spatial correlations [
35]. In current mainstream datasets for intrusion detection tasks, the raw network traffic data to be analyzed is typically packet-level pcap data. This often requires extracting flow-level CSV files from pcap-formatted packets. Given this characteristic of network traffic datasets, this paper employs a 1D-CNN for feature extraction rather than a 2D-CNN to avoid introducing redundant information and increasing computational complexity.
In intrusion detection, specific combinations of certain features typically indicate particular types of attacks. In the dataset used in this paper, the data is essentially structured tables. The convolutional kernel of the 1D-CNN slides along the feature dimension, capturing local correlations between adjacent features. The output tensor has dimensions (1, 80). In the CICIDS2017 and UNSW-NB15 datasets, the feature generation tools CICFlowMeter (
https://www.unb.ca/cic/research/applications.html#CICFlowMeter, accessed 12 January 2026.) and Argus were used respectively. CICFlowMeter sequentially arranges 80 features based on forward flow, flags, etc. [
27]; Argus arranges features sequentially based on basic, content, and temporal characteristics [
29]. This arrangement enables the 1D-CNN to more accurately capture local spatial patterns in network traffic during convolution operations, forming the foundation for the 1D-CNN’s extraction of local spatial patterns.
The 1D-CNN network architecture employed in this paper is illustrated in
Figure 1: The model sequentially incorporates Conv1D, ReLU activation function, and max pooling layers. Conv1D employs a one-dimensional convolution kernel to extract local spatial features from the input time-series data using a sliding window approach. Following the Conv1D operation, a Reshape operation is applied. Since the original UNSW-NB15 features lack sequence depth after vectorization, we reshape the 196-dimensional feature vectors into a 14 × 14 temporal feature matrix. This provides the contextual foundation for subsequent BiLSTM layers. By computing the dot product between the input data and the convolution kernel weights, it ultimately generates a one-dimensional feature map. The operation of Conv1D is illustrated in
Figure 2. Following convolution, the ReLU activation function is applied, defined as follows:
Here,
, where
denotes the linear output value of the i-th output channel, W represents the weights of the one-dimensional convolution kernel, x is the input data, and b is the bias term. The ReLU function determines neuron activation based on a threshold [
32], masking neurons with negative outputs to promote network sparsity. Following activation, feature maps undergo max pooling to retain salient features while reducing spatial dimensions, thereby preventing overfitting and enhancing learning efficiency. Max pooling reduces sample size without altering weights [
36]. During early training phases, parameter updates across layers cause continuous shifts in input distributions, potentially slowing or destabilizing training. To prevent covariance drift between layers—which can hinder convergence or trigger vanishing/exploding gradients—a Batch Normalization (BN) layer is introduced after max pooling. The BN computation steps are as follows:
Let a batch of inputs be denoted as , where m is the batch size. The mean of this batch is , and the variance is .
To avoid a zero denominator, a small constant ϵ is introduced here. Finally,
undergoes linear processing through two variables:
γ and β are learnable parameters during training, enabling the network to represent any distribution.
3.3.2. Bidirectional Long-Short-Term Memory
Long Short-Term Memory (LSTM) is a specialized type of RNN that utilizes gating mechanisms to learn long-term dependencies in data. Its architecture incorporates cell states and gating mechanisms, addressing the limitations of traditional RNNs in modeling long-term dependencies and mitigating vanishing or exploding gradients during training.
The LSTM network architecture, as shown in
Figure 3, primarily consists of three gates: the forget gate, the input gate, and the output gate. The forget gate determines whether to retain or discard the memory from the previous time step. Let
denote the input at time step t and
denote the hidden state at the previous time step. The output of the forget gate is
, where
and
represent the weights,
is the forget gate bias, and the sigmoid function serves as the activation function. The mathematical expression for the Forgetting Gate is as follows:
When the forget gate outputs = 1, all current information is retained; when = 0, all information is discarded.
The input gate updates memory by determining which information from the current input is written into the cell state. Here,
and
represent weights, while
denotes the input gate bias. The mathematical expression for the input gate is as follows:
The input gate also incorporates a candidate memory unit, generating candidate memories using the tanh activation function.
and
represent the weights, while
denotes the candidate memory unit bias. The expression for the candidate memory is:
After completing the selection of past memories and current information respectively, proceed with updating the memory unit state;
The output gate determines which components of
are used as the current state output to the next cell.
is first scaled by the tanh function. Let
and
denote the weights, and
denote the output gate bias. The mathematical expression for the output gate is as follows:
As previously mentioned, LSTMs demonstrate superior performance in processing time-series data. In certain network traffic scenarios, the characteristics of cyberattacks depend not only on historical traffic but also on subsequent traffic. Imrana et al. [
37] proposed a bidirectional long short-term memory (LSTM) network. This model consists of forward and backward LSTM networks, in which the time series data are replicated and fed into both LSTMs. By collecting feature information from the reverse sequence for model training, this bidirectional dependency addresses the detection limitations inherent in traditional LSTMs. BiLSTM is a variant of LSTM characterized by learning data transmitted both forward from the model’s start to its end and backward from the end to the start [
37]. Structurally, it incorporates two LSTMs, as shown in
Figure 4: one processes the time series forward, the other backward, with the final output being concatenated. Thus, BiLSTM can process all available input data from both past and future within a specific time window. It retains the gating mechanism of traditional LSTMs, addressing potential gradient vanishing and gradient explosion issues during training.
As described in
Section 3.3.1, during the LSTM phase, the feature vectors extracted by the 1D-CNN are segmented into equal-length sub-segments, each containing the same number of features. A single sub-segment serves as the input value for one time step of the LSTM, which learns the logical relationships between all features across steps. The BiLSTM employed in this paper is more suitable for later-stage features such as data backhaul, aiding in further confirmation of potential attack types.
Based on the above analysis, BiLSTM is highly suitable for network intrusion detection tasks. This study employs a dual-layer BiLSTM to capture the temporal features of network traffic. Combined with the 1D-CNN’s ability to capture local spatial features, the hybrid CNN-BiLSTM architecture integrates the strengths of both, effectively balancing temporal and spatial data characteristics to ensure maximum feature extraction.