An Efﬁcient Intrusion Detection Method Based on LightGBM and Autoencoder

: Due to the insidious characteristics of network intrusion behaviors, developing an efﬁcient intrusion detection system is still a big challenge, especially in the era of big data where the number of trafﬁc and the dimension of each trafﬁc feature are high. Because of the shortcomings of traditional common machine learning algorithms in network intrusion detection, such as insufﬁcient accuracy, a network intrusion detection system based on LightGBM and autoencoder (AE) is proposed. The LightGBM-AE model proposed in this paper includes three steps: data preprocessing, feature selection, and classiﬁcation. The LightGBM-AE model adopts the LightGBM algorithm for feature selection, and then uses an autoencoder for training and detection. When a set of data containing network intrusion behaviors are inputted into an autoencoder, there is a large reconstruction error between the original input data and the reconstructed data obtained by the autoencoder, which provides a basis for intrusion detection. According to the reconstruction error, an appropriate threshold is set to distinguish symmetrically between normal behavior and attack behavior. The experiment is carried out on the NSL-KDD dataset and implemented using Pytorch. In addition to autoencoder, variational autoencoder (VAE) and denoising autoencoder (DAE) are also used for intrusion detection and are compared with existing machine learning algorithms such as Decision Tree, Random Forest, KNN, GBDT, and XGBoost. The evaluation is carried out through classiﬁcation evaluation indexes such as accuracy, precision, recall, F1-score. The experimental results show that the method can efﬁciently separate the attack behavior from normal behavior according to the reconstruction error. Compared with other methods, the effectiveness and superiority of this method are veriﬁed.


Introduction
In recent years, computer networks have developed rapidly, gradually playing the role of central information systems in modern life. The increase in the size, application, and infrastructure of computer networks has exposed them to various serious threats such as malicious activities, network intruders, and network criminals. Dealing with these harmful network activities is one of the priorities and important research fields in the world today. Network intrusion detection is an important data analysis task that helps identify network intrusions and protect network security [1]. The detection methods of intrusion detection systems are divided into two categories depending on the modeling methods used [2]: one is based on misuse detection methods, and the other is based on abnormal detection methods. The misuse-based detection method uses signatures that compare known attacks to detect. This method is effective for known attacks, but it is not effective in detecting unknown attacks. In contrast, anomaly-based intrusion detection methods can identify unknown or zero-day attacks.
detection. The NSL-KDD dataset can guarantee 82% accuracy. The second method uses a deep learning algorithm, combined with a gated periodic unit long-short-term memory (GRU-LSTM) network intrusion detection system, the accuracy rate is close to 88%. More and more deep learning methods are applied to intrusion detection and show excellent performance. Kaichen Yang et al. [16] studied how adversarial examples affect the performance of deep neural networks (DNN) that are trained to detect network intrusion in black-box models. It proves that even when the internal information of the target model is separated from the adversary, the adversary can produce an effective adversarial example against the trained DNN classifier. They trained a DNN model for a network intrusion detection system using the NSL-KDD dataset, and achieved 89% accuracy. In addition to DNN, other neural networks can also effectively perform intrusion detection, such as RNN. Chuanlong Yin et al. [17] use deep learning methods-recurrent neural networks (RNN) for intrusion detection.
The performance of the model in binary classification was studied, and experiments were performed on the NSL-KDD dataset. The accuracy of RNN is 83.28%, which is higher than that of machine learning methods such as J48, artificial neural networks, random forests, and support vector machines. Experimental results prove that the RNN algorithm is very suitable for high-precision modeling of classification models. Autoencoder is an unsupervised deep learning framework designed to reconstruct the input in the output while minimizing the reconstruction error [18]. S. Zavrak et al. [19] adopted the method of an autoencoder and variational autoencoder, and compared it with the OCSVM algorithm. This involved an experiment on the CICDS2017 dataset, extraction of the stream-based features from it, a calculation of the ROC curve and AUC value, and an analysis of the performance of the method under different thresholds. Experimental results show that the AUC value obtained by the variational autoencoder is 0.7596, which is better than that of the autoencoder and single-class support vector machine, but it is not easy to determine an appropriate threshold that provides high detection accuracy or low false alarm rate. Cosimo Ieracitano et al. [20] have developed an intelligent intrusion detection system based on statistical analysis and autoencoder. Combine data analysis and statistical techniques for feature extraction and then an autoencoder is used to reduce the 102-dimensional feature vector z to a 50-dimensional latent feature vector e, and then reconstruct the original input with 50 compressed features. The feature vector e is used as the input of the final softmax layer for binary classification. The effectiveness of the proposed IDS was tested using the benchmark NSL-KDD dataset, and an accuracy of 84.21% was achieved, which is superior to algorithms such as LSTM and MLP.

Dataset and Methodology
In this section, we first introduce the NSL-KDD dataset used in this article. On this basis, the method is introduced, including data preprocessing, feature selection and classification.

NSL-KDD Dataset
The NSL-KDD dataset is an improved version of the original KDD99 dataset and is widely used as a benchmark in many intrusion detection systems. The NSL-KDD dataset solves some of the inherent problems of the previous KDD99, such as the existence of a large number of redundant and repeated records in the training and test set, these records bias the classifier to more frequent samples. It has training and test dataset, denoted here as KDDTrain+ and KDDTest+, containing 125,973 and 22,544 instances, respectively. In addition, the NSL-KDD dataset contains four different attack classes: Probe, DoS, R2L, and U2R. The distribution information of KDDTrain+ and KDDTest+ in normal and four attack types are shown in Table 1. Also, the attack types of the NSL-KDD dataset are grouped into four different attack categories: 1. Probe: Probe includes attacks that collect information about the network to effectively avoid the security control systems. 2. DoS: DoS includes attacks that cause the machine to slow down or shut down by sending traffic information that exceeds the system's processing capacity to the server. Legitimate network traffic or access to services is affected by DoS attacks. 3. R2L: R2L includes attacks that illegally access computers by sending remote spoofing packets to the system. 4. U2R: U2R include attacks that provide root access. In this case, the hacker finds out the system vulnerability and starts using the system as a normal user.
As shown in Table 2, the NSL-KDD dataset contains a total of 39 attacks, each of which is divided into one of the following four categories (Probe, DoS, R2L, and U2R). Furthermore, only a new set of attacks are introduced in the test set, and these new attacks are shown in bold.  Figure 1 shows a flow chart of the proposed method. Firstly, the NSL-KDD dataset is preprocessed, and the min-max normalization technique is applied to scale the data to the interval [0,1]. Then, the symbolic features are converted into numerical values by using one-hot-encoding technology. Afterward, the LightGBM algorithm is used for feature selection, and the optimal features are selected from the 41 features to form the optimal feature subset. Finally, we developed AE technology to evaluate the detection performance of IDS in binary classification scenarios.

Data Preprocessing
Data preprocessing is a necessary step before training the model. It includes two parts: data normalization and one-hot-encoding.

Data Normalization
The min-max normalization method is adopted to scale the value xi,j into the numeric range [0,1], according to: Among them, max(x f ) and min(x f ) represent the maximum and minimum value of the f th (numerical) feature x f ; whereas x f j is the normalized feature value ranged between [0,1].

One-Hot-Encoding
One-hot-encoding technology is used to convert three categorical features protocol_type, service, and flag (x 2 , x 3 , x 4 , respectively) into numeric values. In particular, each categorical attribute is represented by binary values. For instance, the x 2 feature (protocol_type) has three attributes: tcp, udp, and icmp. One-hot-encoding technology converts them into binary vectors: [1,0,0], [0,1,0], [0,0,1], respectively. In the same way, x 3 and x 4 features (service and flag) are also converted into one-hot-encoding vectors. In general, 41-dimensional features are mapped to 122-dimensional features (38 continuous, and 84 binary values related to features x 2 , x 3 , x 4 ).

Feature Selection
Next, feature selection is applied, which is essential in the classification task. Feature selection has the advantages of reducing computational complexity, improving the performance of learning algorithms, eliminating redundant information in the dataset, and improving the generalization of data [21]. LightGBM is a new boosting framework proposed by Microsoft in 2017, which is more powerful, faster, and greatly improved in performance than Xgboost, as described in [22]. The performance of the LightGBM model has been widely recognized in several data mining and machine learning challenges. Therefore, we use LightGBM technology and feature importance scores are applied for feature selection.
The LightGBM model is a collection of decision trees. Different from other GBDT models, LightGBM's method of calculating the gain of variation occurs under weak and strong learners (small and big gradients, g i ). The training instances are arranged in descending order according to the absolute value of their gradients, and the first a% of the instances with larger gradients are retained to form the instance subset A. For the residual set A c formed by the (1 − a)% of instances with smaller gradients, a subset B of size b*|A c | is randomly formed. Finally, the instance is split according to the estimated variance gain V j * (d) on the subset A ∪ B. where x ij > d}, d is the point in the data where the split is calculated to find the best gain invariance, and the is used to normalize the gradient sum over B back to the size of A c .
The trees in the LightGBM model are constructed based on the above steps. Let the feature set, x i be x 1 , x 2 ,. . . , x m where i = 1, 2, . . . , m. Then, according to the number of times each feature is used to split the training data across all trees, the feature importance score FIS i is calculated. Therefore, the feature importance score set is represented as: where w i represents the weight of each feature, and x i represents the feature set. Figure 2 shows the best feature importance score of the NSL-KDD feature used by the LightGBM algorithm.  The accuracy of the LightGBM algorithm has been verified in a large number of experiments, and multiple thresholds are set based on feature importance scores to select features. Using all 41 original features as the beginning of the experiment, and finally ended with a subset containing the selected features. It can be seen from Table 3 that the accuracy of the model changes when different numbers of features are selected, and the highest accuracy is 99.20% when 21 features are selected. Then, the three categorical features (protocol_type, service and flag) are converted into 84 dimensions by one-hot-encoding technology, plus the other 18 features as the optimal feature subset, so the dimension of input data is 102 dimensions. The optimal feature subset is taken as the input of intrusion detection, and the autoencoder is used for network intrusion detection. The selected 21 features are shown in Table 4.

Autoencoders
To detect normal and attack behaviors (Probe, DoS, R2L, U2R) of the NSL-KDD dataset, deep learning models based on AE, VAE, and DAE are developed respectively. AE, VAE, and DAE are all deep learning models containing symmetry. See the following subsection for details.

Autoencoder
The autoencoder consists of encoder and decoder operations: First, the encoder converts the input data vector to a typical lower-dimensional representation; then, the decoder attempts to reconstruct the original input from the compressed vector. AE is trained in an unsupervised manner and can learn salient features from unlabeled data [23]. Figure 3 shows the AE structure used in this article. The input data vector x is encoded as a lower dimension representation e: In the formula, W denotes the weight matrix, b represents the bias vector, and ς is the activation function of the encoder. Then, the decoding operation reconstructs the input vector x from the encoded representation e into: where ξ represents the activation function of the decoder and x is a vector reconstructed from e. The encoding of x to e is trained first, and then the decoding of e to x is trained to minimize the error between the reconstructed x and the input x.

Input Error
Target：min error Figure 3. The autoencoder structure adopted in this paper. The autoencoder (AE) consists of two parts: encoding and decoding. The encoding operation converts the input vector x into a compressed representation e; while the decoding operation attempts to reconstruct the input variable from e, so that x ≈ x.

Variational Autoencoder
Like the standard autoencoder, the variational autoencoder (VAE) is a deep generator model with latent variables and an architecture composed of encoder and decoder. The purpose of training is to minimize the reconstruction error between the encoded data and the input data. Using Bayesian inference and probabilistic graphical model methods, the input data is encoded into a low-dimensional latent coding space and then decoded back.
The posterior probability function q φ (z|x) is used as a probability encoder to approximate the intractable posterior p θ (z|x).
We assume that the prior distribution p θ (z) is a multivariate Gaussian distribution with a diagonal covariance matrix. We randomly sample the point z from a prior distribution p θ (z). To make it trainable, the reparameterization technique was introduced [24]. The decoder p θ (x|z) converts this point in the latent space into the original input samples. The loss function of VAE is defined as: where D KL is the Kullback-Leibler divergence, which intuitively measures the degree of similarity between the prior distribution p θ (z) and the posterior distribution q φ (z|x).

Denoising Autoencoder
Denoising autoencoder (DAE) is a variant of the autoencoder [25]. DAE adds a noise process P(x|x) in the process of reconstructing input x to x. Then, uses x to construct the encoder as f θ (x) = s(ωx + b), and reconstructs the decoder as g θ (y) = s(ω y + b ). To calculate the reconstruction error, DAE uses the same method as the autoencoder, except that x is reconstructed by P(x|x), as shown below: arg min θ,θ 1 n n ∑ x=1 L(x (2) , g (2) θ ( f (2) θ (x (2) ))).

Classification
The AE model can extract the characteristics of the input data by adjusting the model parameters, and completely retain the key information of the input data to maintain the optimal reconstruction error. When building an autoencoder model for intrusion detection, the key parameters that need to be set include the number of network layers, the number of neurons in the hidden layer, epochs, the learning rate, and the batch size. However, there is currently no good way to find their optimal parameters.
The number of network layers is related to the dimension of the input data. When the input data dimension is large, it generally sets a larger number of network layers. At the same time, the three-layer model can often achieve better detection results for most data [26]. By reducing the number of neurons layer by layer, you can compress the data layer by layer and obtain important information. At the same time, the compression of the number of neurons in each layer cannot be too large, and excessive compression will result in the loss of important information.
At this time, a set of test data is input into the trained intrusion detection model, and a large reconstruction error will occur between the input data and the reconstructed data obtained by the autoencoder. In this paper, the mean square error is used to estimate the error, and the reconstruction error is defined as: where x is the input data, x is the corresponding reconstruction vector, and x and x have the same dimension. The specific detection process is shown in Algorithm 1.

Algorithm 1 The Detection Algorithm With Trained AE
Input: the test dataset X = {x}; Output: accuracy, precision, recall, and F1-score; 1: Step1: Encoder processes 2: 3: for i = 2 to L do 4: end for 6: EndStep 7: Step2: Decoder process 8: x L = f (W L e L + b L ) 9: for i = L − 1 to 1 L do 10: x i = f W i e i+1 + b i 11: end for 12 Calculate TPR,FPR = roc − curve (x − label, scores) 16: Set Threshold=max { (TPR-FPR)} 17: if score > Threshold then 18: the data point is normal 19: else 20: the data point is the attack 21: end if 22: EndStep The size of the reconstruction error is the basic criterion for evaluating data points to judge normality or attack. To perform network intrusion detection more accurately, the reconstruction error will be further analyzed here. First, determine the threshold T of the reconstruction error. If the reconstruction error is greater than the threshold T, the data point is classified as an attack behavior. Data points in which the reconstruction error is less than or equal to the threshold are classified as normal behavior. Using the label of the test and error set as input, calculate TPR and FPR. For intrusion detection, the higher the TPR and the lower the FPR, the better the detection result. Therefore, find max(TPR-FPR), and use this as the threshold. A set of test data is input into the trained model, and the detection process is carried out according to Algorithm 1.

Experimental Conditions
This experiment is based on Python version 3.6 and PyTorch version 1.3, the experimental environment uses the Ubuntu18.04 64-bit operating system, the GPU is RTX-2080Ti, and there is 64 GB of memory.

Performance Evaluation
To measure the effectiveness of the AE algorithm in intrusion detection, the detection results are divided into four types: true positive (TP), false positive (FP), true negative (TN) and false negative (FN), which can be expressed in the form of the confusion matrix [27], as shown in Table 5. The performance of the proposed IDS is measured using traditional metrics: accuracy, precision, recall, and F1-score (or F-measure):

Parameter Settings and Training Details
The performance of AE, VAE, and DAE models in intrusion detection is experimentally studied. Figure 3 shows the structure ([102:48:32:16:32:48:102]) adopted by our proposed autoencoder, and VAE and DAE also adopt the same structure. The activation function of each layer is relu, and the optimizer is Adam. Also, two important parameters need to be set: learning rate and epoch.
The learning rate is an important parameter in the AE model, with values ranging from 0 to 1. Excessive learning rate will lead to loss explosion, while too small a learning rate will lead to slow convergence or over-fitting of the model. For the setting of the learning rate, we try 0.1, 0.01, 0.001, 0.0001, 0.00001, and so on as test values in turn, and then find the optimal value in a certain cell.
As can be seen from Figure 4, the learning rate gradually decreases at a rate of 10 times from 0.1, and the corresponding detection accuracy changes accordingly. When the learning rate is 0.1, the accuracy rate is 76.06%. When the learning rate gradually decreases, the accuracy rate increases. When the learning rate is 0.001, the accuracy rate reaches the highest at 86.84%. As the learning rate continues to decrease, the accuracy rate decreases. We can observe that in the interval [0.01, 0.001], the accuracy rate is the highest, so we continue to look for the optimal learning rate in this interval. As can be seen from Figure 5, when the learning rate is set within [0.001, 0.01], the overall accuracy rate is at a higher level. When the learning rate is 0.004, the highest accuracy rate is 89.82%, so we set the learning rate to 0.004. OHDUQLQJUDWH $&&

Results and Analysis
This paper proposes an intrusion detection system based on feature selection and autoencoder and compares the performance of the three autoencoders, AE, VAE, and DAE. All experiments and analyses were performed on the benchmark NSL-KDD dataset. Standard evaluation indicators were used to evaluate the effectiveness of the proposed intrusion detection system, including accuracy, precision, recall, and F1-score. According to the feature importance score calculated by the LightGBM algorithm, the most important features were extracted and used as input for deep learning methods (AE, VAE, DAE) and ML methods, including DT, RF, KNN, GBDT, and XGBoost. Table 6 gives the experimental results of AE, VAE, and DAE. The accuracy of AE is 89.82%, the precision is 91.81%, the recall is 90.16%, and the F1-score is 90.98%, and the accuracy of VAE is 84.28%, the precision is 84.92%, the recall is 88.01%, and the F1-score is 86.43%. Last, the accuracy of DAE is 84.30%, the precision is 85.22%, the recall is 87.63%, and the F1-score is 86.41%. The four evaluation indexes of AE are better than VAE and DAE, and the best test results are obtained. Compare our three autoencoder models and measure their performance with feature selection or non-selection. The results are shown in Table 7. When all the features are used, the accuracy of the AE model is 87.48%. After feature selection, the accuracy rate is as high as 89.82%, an increase of 2.34%. Regardless of whether feature selection is performed, the accuracy of the AE model is higher than that of VAE and DAE. Some common machine learning algorithms have long been used in network intrusion detection, such as DT, RF, KNN, GBDT, XGBoost. The results of these five machine learning algorithms are also shown in Table 7, from which it can be seen that DT achieved the highest accuracy. When DT uses all features for network intrusion detection, the accuracy rate is 78.72%, which is 8.76% lower than AE. After feature selection, the accuracy of DT is 80.09%, which is 9.73% lower than AE. Therefore, as can be observed, whether it is compared with the two deep learning models of VAE and DAE, or compared with the machine learning algorithm, the performance of AE is optimal. Setting different numbers of hidden layers (HL) and neurons, and evaluating their performance, we can find the optimal AE structure. Specifically, Table 8 reports the accuracy of different AE models with or without feature selection. As can be seen from the results in Table 8  As shown in Table 9, the AE model proposed in this paper is compared with the latest methods proposed in the literature. They are also trained and tested on the NSL-KDD dataset.

Conclusions
In this paper, we discussed the shortcomings of existing intrusion detection systems and evaluated the performance of LightGBM-AE-based intrusion detection classification models. The LightGBM model is mainly used in classification and regression tasks. In our proposed model, we select features based on the feature importance score generated by the LightBGM model. This feature selection method is the first in intrusion detection and intrusion classification. After the input data is mapped by the encoder and the decoder, a reconstruction data is generated. According to the size of the reconstruction error, a reasonable threshold can be set to distinguish normal data from attack data. The model proposed in this paper is compared with two autoencoder models-VAE and DAE, and machine learning algorithms such as decision tree. From the comparison results, the classification accuracy of the model proposed in this paper is higher, reaching 89.82%, which is more advantageous than other models. The experimental results show that the method has a good detection effect on network intrusion detection.
In the future, we plan to develop a more accurate deep learning model to effectively carry out network intrusion detection. The proposed work is used for binary classification and can only distinguish normal from attack. Soon, this work can be extended to multiple categories and specific attack types can be identified. In order to perform intrusion detection faster, we will explore the application of the distributed learning method proposed by Langer, M. [28] to our research.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: