A Deep Learning Model for Network Intrusion Detection with Imbalanced Data

: With an increase in the number and types of network attacks, traditional ﬁrewalls and data encryption methods can no longer meet the needs of current network security. As a result, intrusion detection systems have been proposed to deal with network threats. The current mainstream intrusion detection algorithms are aided with machine learning but have problems of low detection rates and the need for extensive feature engineering. To address the issue of low detection accuracy, this paper proposes a model for trafﬁc anomaly detection named a deep learning model for network intrusion detection (DLNID), which combines an attention mechanism and the bidirectional long short-term memory (Bi-LSTM) network, ﬁrst extracting sequence features of data trafﬁc through a convolutional neural network (CNN) network, then reassigning the weights of each channel through the attention mechanism, and ﬁnally using Bi-LSTM to learn the network of sequence features. In intrusion detection public data sets, there are serious imbalance data generally. To address data imbalance issues, this paper employs the method of adaptive synthetic sampling (ADASYN) for sample expansion of minority class samples, to eventually form a relatively symmetric dataset, and uses a modiﬁed stacked autoencoder for data dimensionality reduction with the objective of enhancing information fusion. DLNID is an end-to-end model, so it does not need to undergo the process of manual feature extraction. After being tested on the public benchmark dataset on network intrusion detection NSL-KDD, experimental results show that the accuracy and F1 score of this model are better than those of other comparison methods, reaching 90.73% and 89.65%, respectively.


Introduction
With the rapid development of computer and communications networks, Internet technology has provided more convenient services to people around the world than ever before.However, the number and types of cyberattacks (such as network viruses, malicious eavesdropping, malicious attacks, etc.), which increase year by year, are creating serious threats to people's information security and property safety.Therefore, information security and communications security has become crucial to both individuals and society as a whole [1,2].Firewalls are widely deployed and used as basic means of security.However, due to the difficulty of human configuration and the lag for new types of attacks, it is no longer sufficient for units that need high security (e.g., government units, military bases, etc.) [3].Therefore, network security researchers have proposed a new means to quickly identify and deal with anomalous networks intrusion detection systems (IDSs).
IDS is proven to be one of the efficient and promising approaches.It detects known threats and malicious activities by monitoring traffic data in computer systems, and alerts are issued when these threats are detected [4].There are two types of monitoring for malicious activities.One is signature-based detection, similar to antivirus software that requires comparison with previously collected attack features, while the other is anomaly-based detection, which requires comparison with normal traffic to make a judgment.In the KDD99 dataset, Stolfo et al. classified network attacks into four categories-namely, the denial-of-service attack (DoS), user-to-root attack (U2R), remote-to-local attack (R2L), and probe attack [5].
Nowadays, there are many researchers who advocate the combination of intrusion detection and machine learning (ML) technologies for the detection of network attacks by creating effective models.The authors in [6] propose the use of naive Bayes for the identification of anomalous networks and compare it with decision trees (another classical machine learning algorithm).The authors in [7] combine support vector machine (SVM) and the genetic algorithm to optimize the selection, parameters, and weights of SVM features, thus improving the accuracy of network attack identification.The authors in [8] improve the detection by constructing a multi-level random forest model to detect network anomalous behavior.The authors in [9] improve the existing k-nearest neighbor (KNN) classifier by combining K-MEANS clustering and KNN classifier with each other to improve the accuracy of detection.The authors in [10] propose a novel intrusion detection method that first decomposes the network data into smaller subsets by a C4.5 decision tree algorithm and then creates multiple SVM models for the subsets, which reduces the time complexity and improves the detection rate of unknown attacks.However, traditional machine learning methods usually emphasize feature engineering, which consumes considerable computational resources and usually only learns shallow features, leading to less satisfactory detection results.Many scholars have turned their attention to the current trend of deep learning, hoping to import network traffic data directly into the model to skip the feature selection step.In one study [11], the authors propose a model structure based on deep belief networks (DBNs) and probabilistic neural networks (PNNs) to reduce the dimensionality of the data using deep belief networks and then classify the data using a probabilistic neural network, which is superior to the traditional PNNs.The authors in [12] propose a convolutional neural network-based detection method by processing traffic data into image form, saving the process of designing features manually.In another study [13], the authors use RNN networks for Botnet anomaly detection, and the effectiveness of RNN networks on timing features is utilized to further improve the accuracy of classification.Table 1 gives a summary and summary of the relevant research.CNN Proposed to use CNN to detect the network traffic data.Processed into the form of pictures.

Deep Learning
Zhao et al. [11] DBN and PNN Proposed a model structure based on DBN and PNN to reduce the dimensionality of the data using DBN and then classify the data using PNN.

Deep Learning
Su et al. [14] CNN and LSTM Proposed a model based on CNN and LSTM to detect each attack type.

Deep Learning
Electronics 2022, 11, 898 3 of 13 However, there is a problem of uneven distribution in network traffic data, and none of the above networks exploits the correlation between traffic features.In this paper, a DLNID model is proposed to solve the above remaining problems, using adaptive synthetic sampling (ADASYN) for data augmentation of unbalanced samples and a modified stacked autoencoder for data dimensionality reduction.To train and test the performance of the DLNID model, we take the NSL-KDD dataset for simulation testing.The following contributions are presented in this paper: (1) A DLNID model combining attention mechanism and Bi-LSTM is proposed.This DLNID model can classify network traffic data accurately; (2) To address the issue of imbalanced network data, ADASYN is used for data augmentation of the minority class of samples eventually making the distribution of the number of each sample type relatively symmetrical, allowing the model to learn adequately; (3) An improved stacked autoencoder is proposed and used for data dimensionality reduction with the objective of enhancing information fusion.
The rest of this paper follows: Section 2 details the techniques and innovations used in this paper and presents a diagram of the model architecture of the DLNID model.Section 3 presents information about the NSL-KDD dataset used in this paper.Section 4 provides experimental results and analysis.In Section 5, we summarize our study and propose future research.

ADASYN
Adaptive synthetic sampling (ADASYN) [15] is an adaptive oversampling algorithm based on the minority class samples.Compared with other data expansion algorithms, it is characterized by the fact that it generates more instances in a special space with lower density and fewer instances in feature space with higher density.This feature has the advantage of adaptively shifting decision boundaries to difficult-to-learn samples, so ADASYN is more suitable than other data augmentation algorithms to handle network traffic with severe data imbalance.The algorithm is executed in the following steps: Step 1: Calculate the number of samples to be synthesized as G, which can be expressed as where n b represents the majority sample, n s represents the minority samples, and β ∈ (0, 1).
Step 2: For each minority sample, calculate K neighbors by the Euclidean distance and denote by r i the proportion of majority class samples contained in the neighbors, which can be expressed as where K represents the current number of neighbors, and k represents the majority class sample in the current neighbor.
Step 3: Calculate the number of samples that need to be synthesized for each minority sample according to G and synthesize the samples according to Equation (4), which can be expressed as g = G × r i (3) where g represents the quantity to be synthesized, Z i represents the synthesized new sample, X i represents the current minority sample, and X Zi represents a random minority sample among the k neighbors of X i ,λ ∈ (0, 1).

Autoencoder
An autoencoder [16] is an unsupervised learning network architecture, in which the input and output dimensions are the same, and the number of nodes in the middle layer is generally less than the number of nodes on the left and right sides. Figure 1 illustrates a typical autoencoder consisting of two main components, i.e., the encoder and decoder.It works by using deep learning techniques to find an efficient representation of the input data without losing information.In short, it compresses the original data by using the encoder to obtain a lower-dimensional representation, which is then reconstructed into the original data by the decoder.According to this working principle, we can use the trained encoder as a tool for data dimensionality reduction.Compared with the traditional principal component analysis (PCA) [17] data dimension reduction method, the autoencoder can achieve nonlinear changes, which facilitates the learning of more deep projection data information.
sample, i X represents the current minority sample, and Zi X represents a random minor- ity sample among the k neighbors of i X , λ (0,1) ∈ .

Autoencoder
An autoencoder [16] is an unsupervised learning network architecture, in which the input and output dimensions are the same, and the number of nodes in the middle layer is generally less than the number of nodes on the left and right sides. Figure 1 illustrates a typical autoencoder consisting of two main components, i.e., the encoder and decoder.It works by using deep learning techniques to find an efficient representation of the input data without losing information.In short, it compresses the original data by using the encoder to obtain a lower-dimensional representation, which is then reconstructed into the original data by the decoder.According to this working principle, we can use the trained encoder as a tool for data dimensionality reduction.Compared with the traditional principal component analysis (PCA) [17] data dimension reduction method, the autoencoder can achieve nonlinear changes, which facilitates the learning of more deep projection data information.Although the autoencoder can achieve better data dimensionality reduction, compared with other dimensionality reduction methods, we aimed to propose an autoencoder that is able to perform dimensionality reduction and enhance data robustness to adapt to complex network scenarios.Dropout [18] enables each neuron to have the probability p to be discarded during network training iterations, and due to this mechanism, each neuron is not overly dependent on other neurons, thus reducing the phenomenon of overfitting and improving the generalization ability of the model to some extent.By combining the two ideas, a low-latitude representation is obtained by using dropout and stacked autoencoder after dimensionality reduction.Since each dimension has the probability of being Although the autoencoder can achieve better data dimensionality reduction, compared with other dimensionality reduction methods, we aimed to propose an autoencoder that is able to perform dimensionality reduction and enhance data robustness to adapt to complex network scenarios.Dropout [18] enables each neuron to have the probability p to be discarded during network training iterations, and due to this mechanism, each neuron is not overly dependent on other neurons, thus reducing the phenomenon of overfitting and improving the generalization ability of the model to some extent.By combining the two ideas, a low-latitude representation is obtained by using dropout and stacked autoencoder after dimensionality reduction.Since each dimension has the probability of being discarded, the information set of each dimension is more comprehensive than that obtained by traditional autoencoder after dimensionality reduction, thus facilitating model learning.Based on the above ideas, we proposed a stacked encoder structure with increased dropout, as shown in Figure 2.
Electronics 2022, 11, x FOR PEER REVIEW 5 of 14 discarded, the information set of each dimension is more comprehensive than that obtained by traditional autoencoder after dimensionality reduction, thus facilitating model learning.Based on the above ideas, we proposed a stacked encoder structure with increased dropout, as shown in Figure 2.

Channel Attention
An attention mechanism was proposed based on the idea that people usually tend to focus more on some local regions of the image rather than the image as a whole when observing an image.At ImageNet 2017, the WMW team proposed a squeeze-and-excita-

Channel Attention
An attention mechanism was proposed based on the idea that people usually tend to focus more on some local regions of the image rather than the image as a whole when observing an image.At ImageNet 2017, the WMW team proposed a squeeze-and-excitation (SE) network based on the channel attention mechanism [19] and won the Image Classification challenge with a great advantage.
The convolutional block attention module (CBAM) [20] is improved on the basis of SE by adding a channel of Maxpool, and through a large number of experiments, the author of [20] proved that adding it can effectively improve the performance of the model classification.Based on these ideas, in this paper, the CBAM used in 3D image processing was applied to the intrusion detection model for 2D data, with modifications.As shown in Figure 3, the flow of the CBAM for 2D data processing is composed of two important phases, i.e., squeeze and excitation.In the squeeze phase, the traffic data are AvgPooling or Maxpooling, from a (c, w)-dimensional form to a (c, 1)-dimensional form, to obtain the global information of each channel.In the excitation phase, the compressed data are adaptively recalibrated by a multilayer perceptron (MLP) to return a weight matrix for each channel.

Channel Attention
An attention mechanism was proposed based on the idea that people usually tend to focus more on some local regions of the image rather than the image as a whole when observing an image.At ImageNet 2017, the WMW team proposed a squeeze-and-excitation (SE) network based on the channel attention mechanism [19] and won the Image Classification challenge with a great advantage.
The convolutional block attention module (CBAM) [20] is improved on the basis of SE by adding a channel of Maxpool, and through a large number of experiments, the author of [20] proved that adding it can effectively improve the performance of the model classification.Based on these ideas, in this paper, the CBAM used in 3D image processing was applied to the intrusion detection model for 2D data, with modifications.As shown in Figure 3, the flow of the CBAM for 2D data processing is composed of two important phases, i.e., squeeze and excitation.In the squeeze phase, the traffic data are AvgPooling or Maxpooling, from a (c, w)-dimensional form to a (c, 1)-dimensional form, to obtain the global information of each channel.In the excitation phase, the compressed data are adaptively recalibrated by a multilayer perceptron (MLP) to return a weight matrix for each channel.

Bidirectional LSTM
Long short-term memory (LSTM) [21,22] introduces storage cells and cell states to overcome the long-term dependency problem that exists in recurrent neural networks (RNNs) [23].The long-term dependency problem is a gradient explosion or gradient dispersion problem caused by multiple multiplications of matrices when RNNs compute the relationship of distant nodes.The following shows how the LSTM network is updated in one time step: where i t , f t , and o t represent the input gate, the forget gate, and the output gate, respectively.σ (sigmoid) and tanh represent two distinct activation functions, respectively.c t represents the current cell state, c t−1 represents the previous cell state, and c t represents the candidate memory cell.h t represents the hidden state of the current cell, and h t−1 represents the hidden state of the previous cell.
The bidirectional LSTM (Bi-LSTM) network [24] improves its LSTM predecessor by adding backward hidden states ← h t to the existing forward hidden states → h t , allowing it to obtain a forward-looking capability similar to that of the hidden Markov model (HMM).
The following shows how the Bi-LSTM network updates itself in one time step: where h t represents the hidden state of the current cell, h t−1 represents the hidden state of the previous cell, → h t represents the forward hidden state of the current cell, and ← h t represents the reverse hidden state of the current cell.
For network traffic, Bi-LSTM can effectively utilize the temporal features present in the contextual information to improve the model training, and its structure is shown in Figure 4.

Network Architecture
As shown in Figure 5, the overall architecture of the DLNID model consists of seven parts, which are the input layer, encoder layer, multiple convolutional layer, attention layer, Bi-LSTM layer, fully connected layer, and the output layer.In the first layer, the model accepts the network traffic data from the dataset.In the encoder layer, the model uses the encoder part of the improved stacked autoencoder that has been trained to perform dimensionality reduction on the data.In the multiple convolutional layer, the model uses multiple convolutional operations to extract features from the downscaled data.In the attention layer, the model uses the CBAM to redistribute the weights of each channel and assign more important channels with higher weights.In the Bi-LSTM layer, the model extracts the feature information of each dimension and learns the relationship between the dimensions.In the fully connected layer and the output layer, the model passes the learned features onto the classifier and outputs the classification results.Algorithm 1 presents the training process of the DLNID model.

Network Architecture
As shown in Figure 5, the overall architecture of the DLNID model consists of seven parts, which are the input layer, encoder layer, multiple convolutional layer, attention layer, Bi-LSTM layer, fully connected layer, and the output layer.In the first layer, the model accepts the network traffic data from the dataset.In the encoder layer, the model uses the encoder part of the improved stacked autoencoder that has been trained to perform dimensionality reduction on the data.In the multiple convolutional layer, the model uses multiple convolutional operations to extract features from the downscaled data.In the attention layer, the model uses the CBAM to redistribute the weights of each channel and assign more important channels with higher weights.In the Bi-LSTM layer, the model extracts the feature information of each dimension and learns the relationship between the dimensions.In the fully connected layer and the output layer, the model passes the learned features onto the classifier and outputs the classification results.Algorithm 1 presents the training process of the DLNID model.

Network Architecture
As shown in Figure 5, the overall architecture of the DLNID model consists of seve parts, which are the input layer, encoder layer, multiple convolutional layer, attentio layer, Bi-LSTM layer, fully connected layer, and the output layer.In the first layer, th model accepts the network traffic data from the dataset.In the encoder layer, the mode uses the encoder part of the improved stacked autoencoder that has been trained to per form dimensionality reduction on the data.In the multiple convolutional layer, the mode uses multiple convolutional operations to extract features from the downscaled data.I the attention layer, the model uses the CBAM to redistribute the weights of each channe and assign more important channels with higher weights.In the Bi-LSTM layer, the mode extracts the feature information of each dimension and learns the relationship betwee the dimensions.In the fully connected layer and the output layer, the model passes th learned features onto the classifier and outputs the classification results.Algorithm 1 pre sents the training process of the DLNID model.

Data Analysis
The experimental data in this paper adopt the NSL-KDD dataset [5], which is an improved version of the KDD99 dataset [25] that addresses the data redundancy problem present in the KDD99 dataset and is one of the benchmark datasets used to evaluate the performance of IDS.It consists of a training set (KDDTrain+), containing 125,973 traffic samples, and a test set (KDDTest+), containing 22,544 traffic samples.In order to restore the complex network situation in reality to a greater extent, there are only 19 attack types in the training set, and the other 17 attack types only exist in the testing set.
The NSL-KDD dataset has a total of 42 dimensional features, one of which is a classification label, and the rest are feature labels.For binary classification, the classification labels are divided into two categories, i.e., normal and anomaly.For multiclassification, the classification labels are divided into five categories, i.e., normal, Dos, R2L, U2R, and probe.

One-Hot Encoding
Since there are three non-numerical types of feature values, and the model can only accept numerical types, one-hot encoding was adopted to convert the three non-numerical features into numerical features.For example, the values of protocol_type are TCP, UDP and ICMP, and after encoding, the values become [1, 0, 0], [0, 1, 0], and [0, 0, 1], respectively.Finally, the dataset contains 122 dimensional data after encoding.

Data Augmentation
The number of U2R and R2L samples in the NSL-KDD test set is much higher than that in the training set, and only a small percentage of these samples are in the training set; therefore, the trained model has difficulty distinguishing these samples, so we used the aforementioned ADASYN algorithm to expand the data and expand the samples (such as U2R and R2L) that account for a smaller percentage of the original training set, balancing the percentage of the majority and minority samples.This can solve the imbalance problem in the network data to a certain extent and further boost the generalization ability of the model.

Normalization
A large gap between different dimensional feature data within the dataset can bring about problems such as slow model training and insignificant accuracy improvement; therefore, in order to tackle this issue, the MinMaxScaler [26] was adopted to map the data into the range of (0,1) as follows: x = x − x min x max − x min (14) where x max is the maximum value, and x min is the minimum value.

Results
In the following section, we detail the experimental settings and appraise the performance metrics of the model.In addition, we present two sets of ablation experiments to verify the reliability of the data augmentation and improve dimensionality reduction approaches proposed in Section 2. Finally, we compare the model with other papers.

Experimental Settings
In this study, all experiments were conducted in the hardware environment of Intel(R) Core(TM) i5-1035G1 CPU @ 1.00 GHz 1.19 GHz, with the operating system of windows 10, using Python 3.7, PyTorch 1.10, and the sklearn library for writing and simulating the model.

Performance Metrics
The confusion matrix was selected as the classification metric of the model predicted data.Additionally, accuracy (Acc), precision (Pre), recall (Rec), and F1 score (F1) were selected as the performance indicators for binary classification, while recall and false positive rate (FPR) were used as the performance indicators for multiclassification.The computation of each performance indicator is detailed as follows: Pre = TP TP + FP ( 16)

Result Analysis
The experiment studied the performance of the proposed network on normal, Dos, R2L, U2R, and probe for binary and multiclassification experiments, respectively.When the network parameters were chosen as shown in Table 2, the high accuracy and F1 score could be achieved on the KDDTest+ test set.Figures 6 and 7 show the experimental results using the confusion matrix.The experimental results show that most samples were classified correctly, which appear on the diagonal, indicating a better classification performance.However, the comparison between the two figures shows that the performance of the proposed model was somewhat degraded on the multiclassification experiments, compared with the binary classification experiments.Table 3 provides the false-positive and recall rates corresponding to different attacks under the multiclassification task; the aim was to achieve a lower false-positive rate and a higher recall rate in intrusion detection.From the analysis, it can be concluded that despite the data augmentation process, the U2R category was more likely to be misclassified because the U2R category in the test set was much larger than the others in the training set.using the confusion matrix.The experimental results show that most samples were classified correctly, which appear on the diagonal, indicating a better classification performance.However, the comparison between the two figures shows that the performance of the proposed model was somewhat degraded on the multiclassification experiments, compared with the binary classification experiments.Table 3 provides the false-positive and recall rates corresponding to different attacks under the multiclassification task; the aim was to achieve a lower false-positive rate and a higher recall rate in intrusion detection.From the analysis, it can be concluded that despite the data augmentation process, the U2R category was more likely to be misclassified because the U2R category in the test set was much larger than the others in the training set.

Comparison of Data Enhancement Methods
Table 4 shows the experimental results of the transverse comparison test by selecting different data augmentation algorithms under the condition in which the network model was the same, and the dimensionality reduction method remained unchanged.Compared with the unprocessed data, each performance indicator of the proposed model underwent a large improvement.Compared with the SMOTE data augmentation method, the data augmentation method used in this paper also revealed a certain improvement in accuracy and F1 score, with an increase of 2.09% and 2.61%, respectively.Table 5 shows the experimental results of selecting different dimensionality reduction methods for horizontal comparison under the condition in which the model was the same, and the data augmentation method remained unchanged.Compared with the PCA, the performance of each model in this paper greatly improved.Compared with the autoencoder, the improved stacked autoencoder used in this paper also had some improvement in accuracy and F1 score, with an increase of 4.64% and 3.28%, respectively.8 compares the proposed DLNID model with other reference models in terms of accuracy, and it can be seen that the accuracy of DLNID is higher than other models.Table 6 compares the proposed model and other network models in terms of various performance metrics, from which it can be seen that the proposed DLNID model outperforms its comparison peers in terms of Accuracy and F1 score, reaching 90.73% and 89.65% on the KDDTest+ dataset, respectively.Compared with the traditional machine learning   4 shows the experimental results of the transverse comparison test by selecting different data augmentation algorithms under the condition in which the network model was the same, and the dimensionality reduction method remained unchanged.Compared with the unprocessed data, each performance indicator of the proposed model underwent a large improvement.Compared with the SMOTE data augmentation method, the data augmentation method used in this paper also revealed a certain improvement in accuracy and F1 score, with an increase of 2.09% and 2.61%, respectively.

Dimensionality Reduction Comparison
Table 5 shows the experimental results of selecting different dimensionality reduction methods for horizontal comparison under the condition in which the model was the same, and the data augmentation method remained unchanged.Compared with the PCA, the performance of each model in this paper greatly improved.Compared with the autoencoder, the improved stacked autoencoder used in this paper also had some improvement in accuracy and F1 score, with an increase of 4.64% and 3.28%, respectively.8 compares the proposed DLNID model with other reference models in terms of accuracy, and it can be seen that the accuracy of DLNID is higher than other models.Table 6 compares the proposed model and other network models in terms of various performance metrics, from which it can be seen that the proposed DLNID model outperforms its comparison peers in terms of Accuracy and F1 score, reaching 90.73% and 89.65% on the KDDTest+ dataset, respectively.Compared with the traditional machine learning methods such as GAR-Forest or NB Tree, the proposed method required no manual feature extraction and improves the accuracy rate.Compared with the autoencoder, the improved stacked autoencoder proposed in this paper enhanced the information set after dimensionality reduction and achieved better classification results.Compared with the CNN, the proposed model used CNN to first extract feature information and then reassign the weights of channels by using the attention mechanism, and finally, learn the relationship between features in the network traffic by Bi-LSTM, thereby achieving an improved classification performance.

Conclusions
To address the issue of data imbalance in network data and low detection accuracy, we proposed an ADASYN oversampling algorithm as the data augmentation method to tackle the network intrusion data imbalance problem, the stacked autoencoder with increased dropout structure as the data downscaling method, to improve the generalization ability of the model, and the network structure by combining the channel attention mechanism with the bidirectional LSTM network.The accuracy and F1 score of the proposed network model reached 90.73% and 89.65% on the KDDTest+ test set, respectively.Compared with other reference network models, the proposed DLNID model offered a better classification performance.The proposed network model is considered useful for the current development of network intrusion detection.In the future, we plan to apply the DLNID model to an actual, combined network capture module to implement an online intrusion detection model.

Figure 5 .
Figure 5. Overall structure of the model.

Figure 5 .
Figure 5. Overall structure of the model.

Table 1 .
Summary of relevant research.

Table 4 .
Comparison of data enhancement methods.

Table 4 .
Comparison of data enhancement methods.