Improving the Classification Effectiveness of Intrusion Detection by Using Improved Conditional Variational AutoEncoder and Deep Neural Network

Intrusion detection systems play an important role in preventing security threats and protecting networks from attacks. However, with the emergence of unknown attacks and imbalanced samples, traditional machine learning methods suffer from lower detection rates and higher false positive rates. We propose a novel intrusion detection model that combines an improved conditional variational AutoEncoder (ICVAE) with a deep neural network (DNN), namely ICVAE-DNN. ICVAE is used to learn and explore potential sparse representations between network data features and classes. The trained ICVAE decoder generates new attack samples according to the specified intrusion categories to balance the training data and increase the diversity of training samples, thereby improving the detection rate of the imbalanced attacks. The trained ICVAE encoder is not only used to automatically reduce data dimension, but also to initialize the weight of DNN hidden layers, so that DNN can easily achieve global optimization through back propagation and fine tuning. The NSL-KDD and UNSW-NB15 datasets are used to evaluate the performance of the ICVAE-DNN. The ICVAE-DNN is superior to the three well-known oversampling methods in data augmentation. Moreover, the ICVAE-DNN outperforms six well-known models in detection performance, and is more effective in detecting minority attacks and unknown attacks. In addition, the ICVAE-DNN also shows better overall accuracy, detection rate and false positive rate than the nine state-of-the-art intrusion detection methods.


Introduction
In recent years, with the rapid development of cloud computing, LoRa, NB-IoT, 5G communication and artificial intelligence technologies, the internet of things (IoT) technology has also ushered in a boom-like development, and hundreds of millions of devices are connected to the Internet of Things. However, because many IoT nodes collect and store large amounts of user privacy data, IoT systems have become an ideal target for cyber attackers, and attacks on the Internet of Things are increasing [1,2]. Gemalto's IoT security report shows that more than half of companies still can't find out whether they have suffered IoT vulnerability attacks. In addition, the report surveyed 950 IT and business decision makers and found that only 59% of companies encrypted all IoT-related data [3]. The popularity of IoT technology and the intelligence of devices have brought great convenience to people, but the use of new technologies and intelligent devices has also brought new security and privacy risks. For example, of VAE [33]. It embeds a one-hot encoded label vector in the encoder and decoder, and converts unsupervised training mode into supervised training mode. CVAE not only automatically extracts high-level features and reduces the dimensions of network features, but also generate new attack samples of the specified categories. In order to initialize the weight of the DNN hidden layers using the CVAE encoder, we have improved CVAE by embedding intrusion tags only in the decoder, but not in the encoder, named ICVAE. This paper has the following main contributions. First, we use ICVAE to learn the distribution of complex traffic and classes through supervised learning. The network parameters of ICVAE encoder are used to initialize the weight of DNN hidden layers. Second, latent variables with Gaussian noise and specified labels are fed into the trained ICVAE decoder (generating network) to generate specific new attack records, so as to balance the training data and increase the diversity of training samples, thus improving the detection rate of minority attacks and unknown attacks. Third, DNN is used to automatically extract high-level features, and adjust network weights by back propagation and fine-tuning to better address the classification problem of complex, large-scale and non-linear network traffic. Finally, the proposed model is evaluated on the NSL-KDD [34,35] and UNSW-NB15 [36][37][38] datasets. Compared with the well-known classification methods, the proposed model not only reaches better overall accuracy, recall, and false positive rate, but also achieves higher detection rate in minority attacks and unknown attacks.
The remainder of this paper is organized as follows. The related works are introduced in Section 2. Section 3 describes the ICVAE and DNN algorithms. Section 4 proposes a novel intrusion detection model and shows in detail how the model works. Section 5 demonstrates the experimental details and results. Finally, Section 6 provides some conclusions and further work.

Related Works
Although there are CVAE-related work in other fields, there is no report on the combination of ICVAE and DNN for intrusion detection. Kawachi et al. [39] employed a VAE for supervised anomaly detection. Sun et al. [40] used a VAE to learn sparse representations for anomaly detection. Chandy et al. [41] used VAE as a deep generation model to simulate network attack detection problems. Osada et al. [42] employed VAE as a semi-supervised learning for intrusion detection. They use VAE to detect intrusions, not CVAE. Lopez-Martin et al. [16] used conditional VAE (CVAE) to build an ID-CVAE classifier to perform classification and feature recovery. The ID-CVAE uses the reconstructed test data and the nearest neighbor method based on the Euclidean distance to classify the test samples. However, our proposed model not only generates data according to categories, but also uses DNN classifier to perform classification.
The deep learning method integrates high-level feature extraction and classification tasks, overcomes some limitations of shallow learning, and further promotes the progress of intrusion detection systems. Recently, deep learning models have been widely used in the field of intrusion detection. Stacked AutoEncoders are used to detect attacks in IEEE 802.11 networks with an overall accuracy of 98.60% [43]. Ma et al. [44] presented a hybrid method combining spectral clustering and deep neural networks to detect attacks with an overall accuracy of 72.64% on the NSL-KDD dataset. The gated recurrent unit recurrent neural network (GRU-RNN) was used to build an intrusion detection system in an software defined network (SDN) with an accuracy of 89% [45]. Shone et al. [15] employed a stacked non-symmetric AutoEncoder and random forest (RF) to detect attacks. Muna et al. [46] proposed an anomaly detection technique for internet industrial control systems (IICSs) based on the deep learning model, which used deep auto-encoder for feature extraction and deep feedforward neural network for classification. Tamer et al. [20] employed the restricted Boltzmann machine (RBM) to classify normal and abnormal network traffic. Imamverdiyev [18] used the multilayer deep Gaussian-Bernoulli RBM method to detect DoS attacks with an accuracy of 73.23% on the NSL-KDD dataset.
The above intrusion detection evaluation results are very encouraging, but these classification techniques still have detection defects, low detection rate for unknown attacks and high false positive rate for minority attacks. In order to overcome these classification problems, this paper uses ICVAE decoder to generate new attack samples according to the specified intrusion categories, thereby improving the detection rate of unknown attacks and minority attacks. ICVAE encoder automatically learns the potential representation of input data and reduces the dimensions of features. Furthermore, the ICVAE encoder is used to initialize the weight of DNN hidden layers. Finally, it is easier for DNN to achieve global optimization by back propagation and fine tuning network parameters.

Variational AutoEncoder (VAE)
Variational AutoEncoder (VAE) is an important generation model consisting of an encoder network Q φ (Z|X) and a decoder network P θ (X|Z), as shown in Figure 1. VAE can learn approximate inference and can be trained using gradient descent method. The encoder network with parameters φ learns an efficient compression of the data into this lower-dimensional space, which maps data X into a continuous latent variable Z. The decoder network with parameters θ uses the latent variable to generate data, which maps Z to a reconstructed dataX. Here we use deep neural networks to construct the encoder and decoder with parameters θ and φ, respectively.

Probability distribution
Decoder network The core idea of VAE is to use the probability distribution P(X) to sample data points that match this distribution, where X represents a random variable of the data. The goal of VAE is to reconstruct the input data as much as possible, that is, to maximize the log likelihood probability of P(X) [31,47], as follows: Here the variational lower bound objective [31,47] is defined as follows: L is defined as the variation lower bound, which is called the VAE objective function. The first term in Equation (2) represents the reconstruction loss. It encourages the decoder to learn to reconstruct the input data. The second item in Equation (2) uses KL (Kullback-Leibler) divergence to minimize the difference between the encoder's distribution Q(Z|X) and the prior distribution P(Z), that is to say, the learned distribution Q(Z|X) is similar to the prior distribution P(Z). Therefore, the goal of training VAE is to maximize the data generation probability logP(X|Z) and minimize the difference between the learned distribution Q(Z|X) and the true prior distribution P(Z). In other words, the goal of training VAE is to maximize the variational lower bound L.

Improved Conditional Variational AutoEncoder (ICAVE)
Conditional Variational AutoEncoder (CVAE) is an extension of VAE [33], modeled by conditioning the encoder and decoder to class Y, as shown in Figure 2. The encoder Q(Z|X, Y) is now conditional on two variables X and Y, and the decoder P(X|Z, Y) is now conditioned on two variables Z and Y.

Probability distribution
Decoder network Hence, the variational lower bound objective of CVAE [32,33] is defined as follows: The conditional probability distributions of CVAE encoder and decoder are related to class label Y. In order to use the encoder network of CVAE to initialize the network parameters of DNN, we improve the CVAE structure to embed class label Y only in the decoder network. The architecture of ICVAE is shown in Figure 3. The decoder is now conditioned to two variables Z and Y whereas the encoder is now conditioned to one variable X.

Probability distribution
Decoder network ICVAE is composed of encoder network Q φ (Z|X) and decoder network P θ (X|Z, Y). In the decoder, class labels are used as an extra input, so that the decoder probability distribution is conditional on the latent variable Z and class label Y, while the encoder does not contain the class label Y. When decoding, the latent variable Z and the label Y are connected and fed to the decoder, thus new attack samples of the specified class are generated. Hence, the variational lower bound of ICVAE is defined as follows: Here the variational lower bound objective of ICVAE is rewritten as: L (θ, φ; X, Y) in Equation (5) consists of two parts: a log reconstruction likelihood E[log P(X|Z, Y)] and a KL divergence D KL [Q(Z|X) P(Z|Y)]. The first term is to reconstruct X by using the conditional probability distribution P(X|Z, Y) and the second term uses the KL divergence metric to characterize the encoder distribution Q(Z|X) approximating the prior distribution P(Z|Y). In ICVAE, we try to maximize the the variational lower bound objective L (θ, φ; X, Y). In this model, We use the class label as our conditional variable Y. Obviously we could sample Z from a multivariate standard normal distribution N(0, I). By changing the value of Y, such as the attack class in the NSL-KDD dataset, ICVAE's decoder P(X|Z, Y) can generate new attack samples of the specified category.

The Proposed Intrusion Detection Framework
The framework of the proposed ICVAE-DNN is shown in Figure 4. ICVAE-DNN consists of three main phases: (1) training ICVAE, where the training samples are used to train the ICVAE, and the reconstruction loss for each training data sample is stored according to the attack class; (2) generating new attacks, where the ICVAE decoder generates new attack samples based on specified classes, and each newly generated attack sample is merged into the original training data set under the condition that the class reconstruction loss is satisfied; (3) detecting attacks, where the ICVAE decoder is used to initialize the weight of the DNN hidden layers, the merged training data set is used to train the DNN classifier, and the trained DNN classifier is used to detect attacks on the testing data set.

Training ICVAE
The input value of ICVAE must be a real vector, so each symbol feature in the intrusion detection dataset is first converted to a numerical feature. For example, the NSL-KDD [34,35] dataset contains 3 symbol features and 38 numerical features, and the UNSW-NB15 [36][37][38] dataset contains 3 symbol features and 39 numerical features. All symbol features are transformed to a binary one-hot encoding. The NSL-KDD and UNSW-NB15 datasets are converted into 122-dimensional and 196-dimensional features, respectively. The structure of ICVAE is composed of an encoder and a decoder, as shown in Figure 4. For the encoder Q(Z|X), we use a multivariate Gaussian distribution as the Q(Z|X) distribution. For the decoder P(X|Z, Y), we use a multivariate Bernoulli distribution to fit P(X|Z, Y). The output of the decoder network is reconstructed data, which is the predicted probability.
We use the min-max normalization method to scale all data X to [0,1]. After preprocessing all the data in the intrusion detection dataset, we train the ICVAE to optimize the loss of the encoder θ and the decoder φ by using the balanced sampling via label shuffling [48] and Adam [49] optimization algorithm. The ICVAE loss is composed of a reconstruction loss and a KL loss. The KL loss uses the variational inference method to approximate the distribution P(Z|Y) with the deep neural network Q(Z|X), so the ICVAE may have a KL-vanishing problem. ICVAE directly compares the difference between the reconstructed attack and the original attack through the encoding and decoding steps. However, the new attack samples generated by ICVAE decoder may differ greatly, and the newly generated samples may deviate from the original attack space distribution. In order to better select the newly generated attack samples, we calculate the reconstruction loss of each training sample based on the class and then use the maximum reconstruction loss for each class as the screening criteria.  We assume that the decoder P(x i |z, y), (where i = 1, · · · , n) obeys the Bernoulli distribution, i.e., For an observation, the likelihood is: The decoder output is a parameter of Bernoulli distribution, that is, α z,y = Decoder(z, y) =x. Then the negative log likelihood is: It is obvious that the negative log likelihood in Equation (9) is the cross entropy. We use this cross entropy as the reconstruction loss of the decoder. After each training sample (x i , y i ) is fed into the trained ICVAE, the reconstruction loss l i (x i , y i ) can be calculated as follows: The maximum reconstruction loss maxL j of the j-th class is written as follows: where k represents the maximum reconstruction loss scaling factor, typically k is 1.0.

Generating New Attacks
For the encoder Q(Z|X), we use a multivariate Gaussian distribution as the Q(Z|X) distribution. For the decoder P(X|Z, Y), we define the multivariate standard normal distribution N(0, I) as the prior distribution P(Z|Y), that is, Z ∼ N(0, I). We can sample a latent variable z from N(0, I) under a specified labelŷ and feed it into the trained decoder to generate a new attack sample (x,ŷ). Assuming that the new attack sample belongs to class j, i.e.,ŷ ∈ class j, the generated sample (x,ŷ) is fed into the trained ICVAE and the reconstruction loss l(x,ŷ) is calculated according to Equation (10). Then, we compare the reconstruction loss l(x,ŷ) with the maximum loss maxL j of the corresponding class j. If l(x,ŷ) < maxL j , the newly generated sample is merged into the original training set S, otherwise the sample is discarded. The newly generated attack samples are merged into the original training set according to the following criteria:

Detecting Attacks
We employ DNN to detect attacks. DNN is a six-layer feedforward deep neural network. The activation function of all hidden layers in DNN is ReLU6 [50], and the activation function of the output layer in DNN is softmax. The network structure of DNN hidden layers is exactly the same as that of ICVAE encoder. ICVAE encoder can automatically extract high-level features, so the weight of the trained ICVAE encoder is used to initialize the weight of DNN hidden layers, then the merged training data set is used to fine tune DNN classifier, and the DNN classifier is optimized by Adam [49] algorithm. Finally, test samples are fed into the trained DNN classifier to detect attacks.
The proposed intrusion detection model is detailed in Algorithm 1. randomly initialized with scaling variance and biases are initialized to 0. 3: Train the ICVAE using the training data set and calculate the maximum reconstruction loss maxL for each category in the training data set according to Equation (11). 4: Sample z from the multivariate standard Normal N(0, I), specify the attack classŷ, and feed them into the trained ICVAE decoder to generate a new attack samplex. According to Equation (12), the newly generated sample (x,ŷ) is merged into the training data set S. 5: The weights of the trained ICVAE encoder are used to initialize the weight of the DNN hidden layers. First, all hidden layers are frozen, the parameters of output layer are adjusted by back propagation, then all hidden layers are unfrozen, and the merged training data set is used to fine tune DNN classifier. 6: Test samples are fed into the trained DNN classifier to detect attacks. 7: return the classification result.

Performance Evaluation
We use six commonly metrics to evaluate intrusion detection performance, including accuracy, detection rate (DR), precision, recall, false positive rate (FPR), and F1-score. Table 1 shows the confusion matrix consisting of true positive (TP), true negative (TN), false positive (FP), and false negative (FN). TP and TN indicate that the attack and normal records are correctly classified, respectively; FP represents a normal record that is incorrectly predicted as an attack; FN represents an attack record that is incorrectly classified as a normal record. The accuracy, DR, precision, recall and FPR are defined as follows: The F1-score is a measure of recall and precision using harmonic mean. Compared with the accuracy, F1-score is more suitable for evaluating the detection performance of imbalanced samples. It can be defined:

Datasets
Currently, the most common data sets used to evaluate the performance of network intrusion detection systems in the literature are the NSL-KDD [34,35] and UNSW-NB15 [36][37][38] data sets. Therefore, we selected the NSL-KDD and UNSW-NB15 data sets to validate the proposed model.

NSL-KDD Dataset
The NSL-KDD is derived from the raw KDD Cup 99 [51,52] dataset presented by Tavallaee et al. [52]. The NSD-KDD dataset removes duplicate and redundant records in the KDD Cup 99 dataset and is more suitable for evaluating the performance of intrusion detection systems. There are five classes in the NSL-KDD data set, one normal and four attacks, namely, Probe, denial of service (DoS), user to root (U2R), and remote to local (R2L).
The NSL-KDD dataset is imbalanced, with fewer U2R and R2L records. We used two data sets in the NSL-KDD dataset to evaluate intrusion detection performance: KDDTrain+_20Percent.txt (A 20% subset of the full training set), KDDTest+.txt, and KDDTest-21.txt (A subset of the full test set, excluding records of difficulty level 21). In our experiments, KDDTrain+_20Percent is used as a training set, and KDDTest+ and KDDTest-21 are used as test sets. Table 2 shows the number of records for each category on the NSL-KDD dataset. As can be seen from Table 2, approximately 50% of the unknown attacks in the testing dataset did not appear in the training dataset.

UNSW-NB15 Dataset
The UNSW-NB15 [36][37][38] is a new data set that reflects real modern normal activities and contains synthetic contemporary attacks. This data set is completely different from NSL-KDD, which reflects a more modern and complex threat environment. The raw network packet of the UNSW-NB15 data set was created by the Tcpdump tool, then 49 features with the class label are generated by Argus, Bro-IDS tool and 12 algorithms [38]. The full dataset contains a total of 25,400,443 records. The partition of the full dataset are divided into a training set and a test set according to the hierarchical sampling method, namely, UNSW_NB15_training-set.csv and UNSW_NB15_testing-set.csv. The training dataset consists of 175,341 records whereas the testing dataset contains 82,332 records. The number of features in the partitioned dataset [37] is different from the number of features in the full dataset [36]. The partitioned data set has only 43 features with the class label, removing 6 features (i.e., dstip, srcip, sport, dsport, Ltime and Stime) from the full dataset. The partitioned dataset contains ten categories, one normal and nine attacks, namely, generic, exploits, fuzzers, DoS, reconnaissance, analysis, backdoor, shellcode and worms. Table 3 shows in detail the class distribution of the UNSW-NB15 dataset.

Experimental Setup
Our experiments were carried out to evaluate the performance of the proposed model. We used three different datasets from NSL-KDD, and UNSW-NB15 datasets. We compared the results of the proposed model with other well-known detection methods. The proposed system was implemented in the TensorFlow environment on the ThinkStation with 64 GB RAM, Intel E5-2620 CPU and 64-bit Windows 10 operating system. An appropriate number of hidden layers can improve the generalization performance of the DNN classifier. Since the number of input units is less than 200, according to empirical experience, the candidate number of hidden layers is {3, 4}. DNN has the function of automatically extracting features, so the number of hidden units is set in a decreasing manner with the value of {two times the number of categories, four times the number of categories, eight times the number of categories, more than 8 times the number of categories}. When the learning rate is too large, the network will oscillate during training, resulting in no convergence. In TensorFlow, the default learning rate of the Adam optimizer is 0.001, which I reduced by 50%, so my candidate learning rate is {1 × 10 −3 , 5 × 10 −4 , 1 × 10 −4 }. L2 is used to avoid over-fitting issue. Here, if L2 is zero then we get back the original model. However, if L2 is very large, it will add too much weight and will lead to under-fitting, so my candidate L2 is {1 × 10 −4 , 1 × 10 −5 }. In order to overcome the vanishing gradient problem caused by Sigmoid or the explosion gradient problem caused by ReLU, we consider ReLU6 [50] as the activation function of hidden Layers. The parameters of the ICVAE-DNN network configuration are searched according to the following principles, as follows: Grid search and three-fold cross-validation experiments are performed to find the optimal hyperparameters of a model which results in the most accurate predictions. The grid search traverses each group of hyperparameters in the search hyperparameter space. For each group of hyperparameters, three-fold cross-validation is used to evaluate. Three-fold cross-validation divides the original training dataset into three subsets, each of which shares the same proportion of each class of data. In each run of the model, two subsets are used to train the model and the remaining subset is used for test the model. By running the model three times, each subset of data has an equal chance to be used in testing part, and then the score of accuracy is computed by taking the average of the accuracy of the model on the testing subsets. Finally, the parameters that get the best cross-validation score are taken as the optimal parameter. The optimal network structures of the proposed model on the NSL-KDD and UNSW-NB15 data sets are 122-80-40-20-10-5 and 196-140-80-40-20-10, respectively. In the ICVAE encoder, the activation function of all hidden layers is ReLU6 [50], and the activation function of the output layer is linear. In the ICVAE decoder, the activation function of all hidden layers is ReLU6 [50], and the activation function of the output layer is Sigmoid. In the DNN, the activation function of all hidden layers is ReLU6 [50], and the activation function of the output layer is Softmax. The learning rate of ICVAE is 5 × 10 −4 , the learning rate of DNN is 1 × 10 −4 , the value of L2 regularization is 1 × 10 −4 , and the optimization algorithm is Adam [49]. Based on these optimal parameters, the training charts of ICVAE-DNN are shown in Figures 5 and 6. As can be seen from Figures 5b and 6b, the initial loss of DNN was relatively low, which implies that after initializing the weight of DNN hidden layers with ICVAE encoder, DNN was close to global optimum. From the average loss of ICVAE and DNN and the accuracy of the training data, it can be seen that the network is basically convergent.
We performed performance comparisons from two aspects: oversampling method and classification method. Tables 4-8 show the comparison results for different oversampling methods.  Tables 9-11 show the performance comparison between the ICVAE-DNN and six well-known models. In addition, the detection performance of the ICVAE-DNN is further compared with other state-of-the-art models. Table 12

The Detection Performance
As is evident from Tables 2 and 3, the training samples are imbalanced on the NSL-KDD and UNSW-NB15 datasets. The U2R and R2L have minority records on the NSL-KDD dataset, and the worms and shellcode have minority records on the UNSW-NB15 dataset. We use ICVAE decoder to generate several records of the specified category to balance the training data, and the results are shown in Tables 4 and 5. The proposed ICVAE-DNN used the ICVAE decoder to synthesize minority attack samples. The most popular oversampling methods used to synthesize minority attack samples are random over sampler (ROS) [53], SMOTE [54], and ADASYN [55]. In order to demonstrate the superiority of ICVAE-DNN in oversampling technology, three classification models are constructed based on three oversampling methods, namely ROS-DNN, SMOTE-DNN and ADASYN-DNN. Tables 6-8 show the comparison results.  Table 8. Comparison of detection performance for different oversampling methods on the UNSW-NB15 dataset (%). As can be seen from Tables 6 and 7, the proposed ICVAE-DNN achieved the best detection performance on the NSL-KDD (KDDTest+) and NSL-KDD (KDDTest-21) data sets. Table 8 shows that the proposed ICVAE-DNN has a higher detection rate compared to all over-sampling methods except for ROS-DNN (only in backdoor attack), SMOTE-DNN (in DoS and reconnaissance attacks) and ADASYN-DNN (in fuzzers and analysis attacks) on the UNSW-NB15 dataset. ROS-DNN has a better detection rate in backdoor class compared to ICVAE-DNN (28.82% more), SMOTE-DNN shows a comparable detection rate to ICVAE-DNN in classes DoS and reconnaissance (14.73% and 3.52% more, respectively), and the detection rate of ADASYN-DNN in fuzzers and analysis attacks is 23.72% and 71.35% higher than that of our model, respectively. However, compared to three well-known oversampling methods, ICVAE-DNN has higher overall accuracy, precision, F1-score and FPR. These reasons may be due to defects in the three oversampling techniques of ROS, SMOTE and ADASYN. ROS-DNN is a simple copy of the training sample, which easily leads to model overfitting problems and reduces the generalization performance of the classifier. SMOTE-DNN uses the nearest neighbor method to generate new samples for each minority sample, which is prone to over-generalization. ADASYN-DNN uses the Γ distribution to automatically determine the number of samples that need to be synthesized for each minority sample, which are susceptible to outliers and cause changes in the spatial distribution of the original samples. ICVAE uses the spatial distribution of latent variables to generate samples under the guidance of class labels, and uses the reconstruction error to filter the generated samples to ensure that the generated samples are more consistent with the spatial distribution of the original samples. In addition, the trained ICVAE encoder was used to initialize the weight of the hidden layers of the DNN classifier, which made it easier for the DNN classifier to achieve global optimization, thereby improving classification performance. It was also demonstrated from Tables 6-8 that the proposed ICVAE is more suitable for solving the classification problem of imbalanced samples.

ROS-DNN SMOTE-DNN ADASYN-DNN ICVAE-DNN
We compared the results of ICVAE-DNN with some well-known classification methods such as KNN (K-Nearest Neighbor), MultinomialNB (multinomial naive Bayes), SVM, RF, DNN, and DBN. We perform performance evaluation based on the five metrics introduced in Section 5.1. The results compared with some well-known classifiers are depicted in Tables 9-11. As can be seen from Table 9, ICVAE-DNN had the highest overall accuracy, recall, precision and F1-score on NSL-KDD (KDDTest+) data set than all well-known classifiers, except RF (with 0.11% higher in FPR). Moreover, compared with other classifiers, ICVAE-DNN has a higher detection rate in DOS, U2R and R2L classes. RF has a slightly higher detection rate in the normal class compared to ICVAE-DNN (0.11% more), and MultinomialNB has a 7.64% higher detection rate in Probe class than ICVAE-DNN. However, ICVAE-DNN has the highest overall detection rate. In addition, ICVAE-DNN achieves the highest detection rates in two important minority U2R and R2L attacks, indicating that ICVAE-DNN is more effective in detecting minority attacks and unknown attacks.  Table 10 shows that ICVAE-DNN achieves the best overall performance, except for RF (only with 1.34% difference in the overall FPR). The detection rate of RF in the Normal class is 1.34% higher than that of ICVAE-DNN. In the probe attack, MultinomialNB achieved a 2.92% higher detection rate than ICVAE-DNN. However, RF and MultinomialNB were poor in other performances. ICVAE-DNN had the highest detection rates in minority classes U2R and R2L. Tables 9 and 10 show that all classifiers have very low detection rates in the U2R and R2L attacks. The main reason was that there were too few U2R and R2L attacks in the training data set (with 11 and 209 samples, respectively). As can be seen from Table 2, almost half of the U2R and R2L attacks in the testing data set never appeared in the training data set, such as httptunnel, snmpgetattack, snmpguess, etc. As a result, all classifiers are not fully trained. Moreover, some attacks in the R2L class, such as sendmail and snmpguess attacks, exhibit features that were highly similar to normal records, which can cause the classifier to misclassify them as normal records. Table 11 shows that ICVAE-DNN achieves the best overall performance compared to the other six well-known models, except for DBN in the overall recall (slightly higher by 3.22%). ICVAE-DNN reaches the highest detection rate in classes normal, reconnaissance, analysis, backdoor, shellcode and worms. MultinomialNB achieves the highest detection rate of 70.11% in the DoS class, which implies that the DoS attack features conform to the polynomial distribution. KNN achieved the highest detection rate of 96.63% in the generic class, but the overall DR of KNN was 1.67% lower than that of ICVAE-DNN. In fuzzers attacks, SVM achieved a 39.66% higher detection rate than ICVAE-DNN. The detection rate of DBN in exploits attack was 16.4% higher than that of our model, and the overall detection rate was also 3.22% higher, but its overall F1-score was 2.16% lower. However, ICVAE-DNN achieved the highest detection rates in minority and important attacks: analysis, backdoor, shellcode and worms. For example, the detection rate of ICVAE-DNN in the worms class was 68.44% and 79.55% higher than that of KNN and SVM, respectively. In addition, we can see that all classifiers have low detection rate in classes analysis and backdoor, mainly because the analysis and backdoor attack features are highly similar to exploits attack features. As a result, classifiers misclassify most of the analysis and backdoor attacks as exploits attacks.

Additional Comparison
To better demonstrate the performance of ICVAE-DNN, we compare its performance with nine state-of-the-art intrusion detection methods, namely, SCDNN (spectral clustering and deep neural network) [44], STL (self-taught learning) [56], DNN [57], Gaussian-Bernoulli RBM [18], RNN-IDS [58], CASCADE-ANN (a multiclass cascade of artificial neural network) [59], ID-CVAE (intrusion detection CVAE) [16], EM Clustering and DT [38]. Table 12 demonstrates the comparison results of ICVAE-DNN proposed on three datasets with other models in terms of accuracy, DR and FPR. Compared with other methods on the NSL-KDD (KDDTest+) and NSL-KDD (KDDTest-21) datasets, the proposed method achieved the best performance in terms of accuracy, DR and FPR. As shown in Table 12, ICVAE-DNN achieved the highest accuracy of 89.08% and DR of 95.68% on the UNSW-NB15 data set, but its FPR is slightly worse. The CASCADE-ANN proposed by Baig et al. [59] achieved a lower FPR (with 5.91% less) than ICVAE-DNN, but its accuracy and DR were worse. The experimental results show that ICVAE-DNN had higher accuracy, DR and FPR than other state-of-the-art intrusion detection methods except CASCADE-ANN (FPR on UNSW-NB15 dataset). Based on the experimental results, we concluded that ICVAE-DNN has better detection performance for network intrusion detection.

Conclusions
In this paper, we propose a novel intrusion detection approach called ICVAE-DNN that combines the ICVAE with DNN. For large data sets, ICVAE can learn and explore the potential sparse representations between network data features and categories. The trained ICVAE encoder is used to initialize the weight of DNN hidden layers. DNN can learn faster and easier than traditional multi-layer perceptron networks, thus avoiding stopping in the local minima. The ICVAE decoder is able to generate various unknown attack samples according to the specified intrusion categories, which not only balances the training data set, but also increases the diversity of training samples, so ICVAE-DNN can improve the detection rate of minority attacks and unknown attacks. DNN can automatically extract high-level abstract features from the training data, thus it can reduce data dimension to avoid dimension curse. DNN integrates feature extraction and classification methods into a system that automatically extracts features and performs classification without a lot of heuristic rules and manual experience. The classification performance of ICVAE-DNN is evaluated on the NSL-KDD (KDDTest+), NSL-KDD (KDDTest-21), and UNSW-NB15 datasets and compared with six well-known classifiers. Moreover, the experimental results show that the proposed ICVAE-DNN provides higher detection rates in minority attacks (i.e., U2R, R2L, shellcode and worms) than the six well-known classification algorithms: KNN, MultinomialNB, RF, SVM, DNN and DBN. In addition, compared with the state-of-the-art classifiers (such as SCDNN, STL, DNN, Gaussian-Bernoulli RBM, RNN-IDS, ID-CVAE, CASCADE-ANN, EM Clustering and DT), the proposed ICVAE-DNN achieves higher accuracy, detection rate and false positive rate. These experiments prove that ICVAE-DNN is more suitable for detecting network intrusion, especially for minority attacks and unknown attacks.
Considering future work, we plan to study an effective way to improve the detection performance of minority attacks and unknown attacks. We plan to use the adversarial learning method to explore the spatial distribution of ICVAE latent variables to better reconstruct input samples. Through the adversarial learning method, similar minority attacks can be synthesized, and the diversity of training samples can be increased. As a result, the detection performance of the ICVAE-DNN can be further improved.