Comparison of Machine Learning and Deep Learning Models for Network Intrusion Detection Systems

: The development of robust anomaly-based network detection systems, which are preferred over static signal-based network intrusion, is vital for cybersecurity. The development of a ﬂexible and dynamic security system is required to tackle the new attacks. Current intrusion detection systems (IDSs) su ﬀ er to attain both the high detection rate and low false alarm rate. To address this issue, in this paper


Introduction
Cybersecurity plays a crucial role in nearly all aspects of our lives, including social, economic, and political systems. Security researchers have developed various network intrusion detection methods and tools to detect and prevent network intrusions from evolving cyber-attacks [1]. Intrusion detection is a way of monitoring a network for malicious activity or policy violations. It differs from traditional firewall in terms of its dynamic nature where traditional firewall has a static set of rules. With the advent of new technologies and their ever-growing reliance on cyberspace and cloud computing, it is more critical than ever to protect this pseudo aspect of our lives. From our daily routine to minute

Related Work
As network intrusions evolve, it has become imperative to develop dynamically generated flow-based datasets instead of static datasets with both positive and negative examples. The intrusion detection system (IDS) is deployed at a strategic point within the network. It records inbound and outbound traffic from all the devices in the network, creating a flow-based dataset.
KDD cup 99 datasets [7] have been used widely for the past decade for the evaluation of network IDS [8]. The KDD99 dataset was created from DARPA network dataset files [9]. Obeidat et al. [10] applied different ML algorithms on the KDD dataset, such as J48 Decision Tree, Random Forest, Multilayer Perception, Naïve Bayes, and Bayes Network classifiers. In this research work, Random Forest achieved the highest accuracy rate for detecting and classifying all KDD dataset attacks. DL algorithms have also been applied in this dataset. A deep neural network has been implemented and attained high accuracy [11]. Chowdhury et al. [12] implemented few-shot intrusion detection using support vector machines (SVMs) and 1-nearest neighbor on features extracted from the trained deep convolutional neural network (CNN) for intrusion detection. However, the KDD99 dataset suffers from network biasness. In [5], the analysis of the KDD99 dataset showed that about 78% and 75% of the records are duplicated in the train and test set, respectively. This high percentage of redundancy will cause the model to be biased and not reliable.
Tavallaee et al. [8] developed the NSL-KDD dataset from the KDD99 dataset by primarily removing redundancy. Various ML and DL algorithms have been applied to this dataset. Revathi et al. [13] and Ever et al. [14] implemented various ML techniques with high testing accuracy on the NSL-KDD dataset for intrusion detection. Su et al. developed the BAT model [15] that combines Bidirectional Long Short-Term Memory (BLSTM) and attention mechanism. Attention mechanism is used to generate key features of traffic data, deduced from packet vectors generated by the BLSTM model. Furthermore, Future Internet 2020, 12, 167 3 of 16 Zhang et al. [16] implemented a DL model utilizing Autoencoder on the NSL-KDD dataset. However, these datasets almost date back to 20 years. Besides their own shortcomings, they might not contain attack scenarios based on evolving threats.
Sperotto et al. [17] published IP flow-based intrusion detection dataset. It provides a classification of attacks and defense techniques by detection of scans, worms, Botnets and Denial-of-Service (DoS) attacks. However, it does not include normal user behavior. Shiravi et al. [18] included network traffic for normal user behavior as well. However, it lacked communication with the external server. This is vital in terms of attacks mostly done through external servers in real-world scenarios. Thus, an IDS may fail to replicate the attack scenario only through an internal server.
The flow-based intrusion detection dataset should be able to replicate real-world attack scenarios by including an external server as well. Furthermore, it is imperative for the dataset to have both positive and negative examples, anomalous or intrusion behavior, and normal user behavior. Furthermore, data generation should be evolving with time, adding new threats. Data generation should be as close to real-world problems as possible.
Ring et al. [4,5] developed the Coburg intrusion detection dataset (CIDDS), which is a labeled flow-based dataset. It has been implemented in OpenStack by creating a virtual office environment imitating daily network usage. The network traffic is recorded in a unidirectional NetFlow format. NetFlow is a flexible and extensible method to record network performance data, which has become a prevalent method for monitoring traffic in high-speed networks [19], developed by Cisco Systems. Attacks such as denial of service, brute force, and port scans are implemented and labeled for supervised learning. Verma et al. [20] implemented K-nearest neighbor classification and K-means clustering algorithms on the first version of this dataset, CIDDS-001. However, it includes some biased features such as host IP, identified as the attacker. Furthermore, cross-validation has not been implemented, which determines the robustness of the model, and it lacks comparison with deep learning models.
In this study, we will use both the CIDDS-001 and CIDDS-002 datasets independently as well as in combined form. We will implement different ML models and DL models for comparison. We propose and develop an ensemble model that combines the best ML and DL models to improve the overall performance. General performance metrics and time costs will be calculated to determine model effectiveness and efficiency, respectively. Model benchmarking will also be performed on the CIC-IDS2017 dataset, which includes a comparison with state-of-the-art models.

Dataset Used
CIDDS-001 and CIDDS-002 (Coburg intrusion detection dataset) [4,5] are used for the evaluation of anomaly-based network intrusion detection in this study. They are labeled as a flow-based dataset and contain unidirectional NetFlow data. The dataset is generated by emulating a small business environment using OpenStack. The dataset contains ten attributes from NetFlow, and manually labeled classes and attack types. Our focus will be on these attributes and classes for both internal and external servers.
Attributes such as source IP address, destination IP address, and date were not used with the assumption that those features may create biases and not be useful to detect the attacks. Table 1 shows an overview of attributes and classes for both CIDDS-001 and CIDDS-002, that were considered for this study. CIDDS-001 dataset includes 92 attacks, 70 within the OpenStack environment, and 22 with external servers. CIDDS-002 dataset includes an additional 43 attacks to the CIDDS-001 dataset within the OpenStack environment. Furthermore, we balanced the dataset by under-sampling, such that all class labels have the same number. It helps to increase fairness, representation, and reduce biases. We combined CIDDS-001 and CIDDS-002 internal server data as well for analysis. After balancing the dataset, it is divided into 80% train and 20% test dataset. Table 2 shows the number of train and test data.

ML Models Used in This Study
In this study, we applied different ML and DL models on CIDDS datasets and compared their performances. A spot check is performed on the CIDDS-001 external dataset using different ML models to identify the best performing models. ML models such as K-nearest neighbor (KNN) [21], XGBoost [22], and classification and regression trees (CARTs) [23] performed best for this dataset. In the following subsections, we briefly describe the applied models.

K-Nearest Neighbor (KNN)
KNN [20] is one of the essential supervised learning algorithms in ML. It is non-parametric and does not assume any underlying assumptions about the distribution of data. Initially, K is assigned to choose the number of neighbors, which is usually classified in the classification problem. It predicts the class based on the nearest distance between the predefined number of training samples and the new point. Euclidean, Manhattan, and Minkowski are some predominant distance measure functions. However, for the categorical variables, the Hamming distance is used.

XGBoost
XGBoost [22] is an optimized distributed gradient boosting algorithm that is highly efficient. XGBoost improves upon gradient boosting machines through system optimization and improvement in algorithms. Regular gradient boosting, stochastic gradient boosting, and regularized gradient are three main forms of gradient boosting. For efficiency, the system features include parallelization, distributed computing, out-of-core computing, and cache optimization. These features enable multi-machine training to the optimization of data structures to achieve the best global minimum and run time.

CART
Classification and regression trees (CARTs) [22] can be used for both classification and regression problems. A CART output is a decision tree that is generally visualized as a binary tree for each attribute. It is not significantly impacted by outliers in the input variables and can incorporate the same variables more than once in different parts of the trees. Decision trees are composed of decision nodes, branches, and leaf nodes. The tree grows from the root node and splits at each node, while the leaf node is the output variable. The decision tree stopping criteria, as pointed by Zhang [24] is that all leaf nodes represent a single class by attaining homogeneity within prespecified training counts. Figure 1 shows the overall decision tree for the CIDDS-001 external dataset with a detailed figure included as separate Supplementary Material where each tree description can be seen.

CART
Classification and regression trees (CARTs) [22] can be used for both classification and regression problems. A CART output is a decision tree that is generally visualized as a binary tree for each attribute. It is not significantly impacted by outliers in the input variables and can incorporate the same variables more than once in different parts of the trees. Decision trees are composed of decision nodes, branches, and leaf nodes. The tree grows from the root node and splits at each node, while the leaf node is the output variable. The decision tree stopping criteria, as pointed by Zhang [24] is that all leaf nodes represent a single class by attaining homogeneity within prespecified training counts. Figure 1 shows the overall decision tree for the CIDDS-001 external dataset with a detailed figure included as separate Supplementary Material where each tree description can be seen.

DL Models Used in This Study
In this paper, we apply convolutional neural networks (CNNs) [25] and long short-term memory (LSTM) [26] as base DL models, optimized with encoding. The embedding layer has been used for the encoding of different features. In the following subsections, we briefly describe the architecture of DL methods and encoding applied in this study.

Embedding Layer
The embedding layer is used for encoding in this study. It has been primarily used in natural language processing with great success [27]. The embedding layer precedes DL models [28]. It is initialized with random weights. With every epoch, the layer also optimizes the encoding. Due to its dynamic nature, embedding performs better than static encodings like one hot encoding [28]. The embedding output dimension is set to 7 with respect to the number of variables in this study.

Convolutional Neural Network (CNN)
Firstly, for CNN, encoded or non-encoded input is fed into a convolutional layer with 64 filters by optimizing dimension using the lambda layer. Filter size is 2 × 2, and the activation function used is a rectification linear unit (ReLU). The output of dimension 7 × 7 × 64 is fed into another 2D convolutional layer with 128 filters with size 2 × 2 after the dropout layer. Dropout layer is used to

DL Models Used in This Study
In this paper, we apply convolutional neural networks (CNNs) [25] and long short-term memory (LSTM) [26] as base DL models, optimized with encoding. The embedding layer has been used for the encoding of different features. In the following subsections, we briefly describe the architecture of DL methods and encoding applied in this study.

Embedding Layer
The embedding layer is used for encoding in this study. It has been primarily used in natural language processing with great success [27]. The embedding layer precedes DL models [28]. It is initialized with random weights. With every epoch, the layer also optimizes the encoding. Due to its dynamic nature, embedding performs better than static encodings like one hot encoding [28]. The embedding output dimension is set to 7 with respect to the number of variables in this study. Firstly, for CNN, encoded or non-encoded input is fed into a convolutional layer with 64 filters by optimizing dimension using the lambda layer. Filter size is 2 × 2, and the activation function used is a rectification linear unit (ReLU). The output of dimension 7 × 7 × 64 is fed into another 2D convolutional layer with 128 filters with size 2 × 2 after the dropout layer. Dropout layer is used to minimize overfitting. Then the output of dimension 7 × 7 × 128 is fed into another 2D convolutional layer with 256 filters with size 2 × 2. Thereafter, a 2D max-pooling layer is used with size 2 × 2 with output size 3 × 3 × 256, and the last convolutional layer with 128 filters is used. After one more 2D max-pooling layer and flattening, the output is fed into the dense layer with four hidden layers and the final output layer. Adam [29] was used as an optimizer for CNN architecture. It uses adaptive learning rates methodology to calculate individual learning rates for each parameter. Softmax is used as an activation function. It assigns probabilities to each class that sums up to one. The detailed parameter description for the CNN model is shown in Table 3. For LSTM, encoded data are fed into two consecutive LSTM layers with dropouts enabled. Then the output is fed into the dense layers with the final output at the end. ReLU is used as an activation function for LSTM layers and SoftMax for the final output layer. Model Checkpoint function is used for both models to extract the best model out of all the epochs based on the validation dataset. The detailed parameter description for the CNN model is shown in Table 4.

Ensemble Model
Ensemble learning is the combination of different models to optimize the performance either by different combining algorithms or using different training datasets. The weightage algorithm and stacking algorithm are some of the standard combining algorithms.
In the proposed ensemble model, it merges ML and DL models based on weightage or stacking [30], as shown in Figure 2. The model averaging ensemble combines prediction from different models and generally improves the performance. A grid search weighted average ensemble is used for this purpose. The model weights are values assigned to different models based on their individual performance. Weight values normalized values between 0 and 1, such that the weights across all ensemble members sum to one. Some models can have high sensitivity, whereas some might have high specificity. Combining them with respect to their performance without completely discarding any models will improve the performance.
The next method used for the combination was stacking. Unlike a grid search weighted average ensemble, a stacking ensemble can use the set of predictions as a base to conditionally decide to weight the input predictions differently. Detailed information on the stacking ensemble model is shown in Supplementary Material Figure S1. high specificity. Combining them with respect to their performance without completely discarding any models will improve the performance.
The next method used for the combination was stacking. Unlike a grid search weighted average ensemble, a stacking ensemble can use the set of predictions as a base to conditionally decide to weight the input predictions differently. Detailed information on the stacking ensemble model is shown in Supplementary Material Figure S1.

Performance Metrics
In this study, 10-fold cross-validation was used to evaluate the performance of the model and to determine its robustness. In 10-fold cross-validation, the data are partitioned into ten equal parts. Then, one part is left out for validation, and training is performed on the remaining nine parts. This process is repeated until all parts are used for validation.

Performance Metrics
In this study, 10-fold cross-validation was used to evaluate the performance of the model and to determine its robustness. In 10-fold cross-validation, the data are partitioned into ten equal parts. Then, one part is left out for validation, and training is performed on the remaining nine parts. This process is repeated until all parts are used for validation.
Confusion Matrix, precision, recall, and F1-score were used as performance metrics. For three classes, the dimension of the confusion matrix is 3 × 3, and for five classes, the dimension is 5 × 5, respectively. Its diagonal matrix is true predicted values. It consists of true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN).
Furthermore, the time cost has also been evaluated for different models to determine their efficiency.

CIDDS-001 External Dataset
Firstly, different ML models and DL models were applied for the analysis in the CIDDS-001 external dataset with five classes. To determine the robustness of these models, 10-fold cross-validation was applied in the training dataset. The results are shown in Table 5. CART performed better than other ML models with an average mean accuracy of 99.31% with a standard deviation (SD) of 0.15.
The SDs for other models were low as well, and this shows the general robustness and generalizability of applied models. Furthermore, CNN, without encoded input, only achieved 68.10% mean accuracy in 10-fold cross-validation. LSTM, without encoded input, achieved 98.97% mean accuracy with an SD of 0.08. With the embedding layer added to the model for dynamic encoding, both CNN and LSTM achieved improved performance. CNN with embedding got a mean accuracy of 99.15% with an SD of 0.08. LSTM with the embedding layer achieved a mean accuracy of 99.18% with an SD of 0.10, performing minutely better than the CNN model. Total execution time was higher for deep learning models in comparison to ML models. CART has the best execution time of 0.65 s in 10-fold cross-validation. Furthermore, an independent test was performed in 20% of the remaining dataset that was not used for the training. The results are shown in Table 5. Weighted ensemble models and a stacked ensemble model were included for this test. The weighted ensemble model included CART, CNN (with embedding), and LSTM (with embedding), whereas the stacked ensemble model included CNN (with embedding) and LSTM (with embedding). CART achieved an accuracy of 99.38%, with the lowest execution time of 0.067. Detailed performance metrics for CART is given in Table 6. Both deep learning models achieved around 99% accuracy. However, the stacked ensemble that combined CNN and LSTM did not improve the result achieving 98% accuracy. The weighted ensemble gave weightage of 0.105, 0.421, and 0.474 for LSTM (with embedding), CNN (with embedding), and CART, respectively. It did improve the result very slightly with an increase of 0.02% but suffered much in training time. Hence, for other dataset, ensemble models were not applied.

CIDDS-001 Internal and CIDDS-002 Dataset
With the pre-analysis of different models from the CIDDS-001 external dataset, ensemble models were dropped for this part of the analysis. This study was carried out for the CIDDS-001 internal and CIDDS-002 datasets, both with three classes. An independent test was carried out for 20% of the remaining dataset that was not used for the training, with different ML and DL models for both datasets, respectively. The results are shown in Table 7. Overall, CART performed better than others both in terms of accuracy and training time. It achieved 99.88% accuracy with 15.5 s of training time for the CIDDS-001 internal dataset and 99.64% accuracy with 0.77 s of training time for the CIDDS-002 dataset. KNN and XGBoost accuracy also increased with comparison to the CIDDS 1 external dataset but still lower than CART in overall performance. Both DL models with embedding encoding performed well with 99.85% and 99.51% accuracy for CIDDS-001 internal and CIDDS-002 datasets, respectively. Training time for CNN (with embedding) was slightly lower than that of LSTM (with embedding) but significantly higher than CART.
Overall, there was not a significant difference between ML models and DL models except for the training time. Since CART is performing better than other ML models consistently both in terms of accuracy and training time, KNN and XGBoost were dropped for the next analysis. Detailed performance metrics for CART is given in Tables 8 and 9 for both the CIDD-001 and CIDDS-002 dataset, respectively.

CIDDS-001 Internal and CIDDS-002 Combined Dataset
For the last part of the analysis, an independent test was carried out for the combined dataset of CIDDS-001 internal and CIDDS-002. CART, CNN (with embedding), and LSTM (with embedding) were performing better in the previous analysis, so they were chosen for this final analysis. The results are shown in Table 10. Following the same pattern from previous analysis, CART performed better than other models both in terms of accuracy and training time. It achieved 99.85% accuracy, slightly better than that of CNN's and LSTM's 99.77% and 99.79% accuracies, respectively. However, CART's training time of 23.91 s was significantly better than that of CNN's and LSTM's 841.46 and 1172.90 s, respectively. Furthermore, the performance metrics of CART are given in Table 11.

Model Benchmarking
Verma et al. [20] implemented KNN classification on the first version of this dataset, CIDDS-001. We implemented KNN on a balanced CIDDS-001 external dataset, and the results are shown in Table 5. CART and DL models with embedding outperform KNN, both in the 10-fold cross-validation and independent test. Furthermore, we implemented our models on the intrusion detection evaluation dataset (CIC-IDS2017) [6] for benchmarking. For this purpose, we used CART and CNN as ML and DL models, respectively. We chose these models based on performance metrics and training time costs. CNN+embedding was preferred over LSTM+embedding, due to negligible loss on overall accuracy with better training time. We compared our results with state-of-the-art models implemented by Pelletier et al. [31]. The results are shown in Table 12. Pelletier et al. [31] implemented a Neural Network and Random Forest, which achieved consistent high accuracy over 99% in all of the attacks. We used attacks with a high number of instances and balanced them by under-sampling. Our DL and ML models achieved comparable and marginally better accuracies than previous models.

Feature Importance
One of the essential benefits of using ML models is feature importance. These models help us determine the variable effects different attributes can have on classification. Feature importance using different ML algorithms assigns a score to different attributes based on how well they are at predicting class. Feature importance gives a general idea about the impact of different features on classification and can be very useful in feature engineering. In our study, we used CART to calculate feature importance among seven features described in Table 1.
Feature importance for different sets of data is shown in Figure 3. For the CIDDS-001 external server, duration has the most impact on the classification. However, for the CIDDS-001 internal server flags scored significantly higher than other attributes. In CIDDS-002, source port and bytes scored relatively higher than other attributes. Finally, in the combined dataset, flags scored higher as well.
The reason flags scored higher in combined dataset similar to that of CIDDS-001 is because CIDDS-001 is nearly eight times larger than CIDDS-002. So, the feature importance score is biased towards the CIDDS-001 dataset. Due to the difference in the number of classes in the internal and external datasets, we can observe the difference in feature importance for different attributes for respective datasets.

Feature Importance
One of the essential benefits of using ML models is feature importance. These models help us determine the variable effects different attributes can have on classification. Feature importance using different ML algorithms assigns a score to different attributes based on how well they are at predicting class. Feature importance gives a general idea about the impact of different features on classification and can be very useful in feature engineering. In our study, we used CART to calculate feature importance among seven features described in Table 1.
Feature importance for different sets of data is shown in Figure 3. For the CIDDS-001 external server, duration has the most impact on the classification. However, for the CIDDS-001 internal server flags scored significantly higher than other attributes. In CIDDS-002, source port and bytes scored relatively higher than other attributes. Finally, in the combined dataset, flags scored higher as well.
The reason flags scored higher in combined dataset similar to that of CIDDS-001 is because CIDDS-001 is nearly eight times larger than CIDDS-002. So, the feature importance score is biased towards the CIDDS-001 dataset. Due to the difference in the number of classes in the internal and external datasets, we can observe the difference in feature importance for different attributes for respective datasets.

Conclusions
We have implemented different ML and DL algorithms as an intrusion detection system in this study. As a benchmark dataset, a recently developed CIDDS dataset has been used. The dataset is balanced by under-sampling such that all classes have the same number of data.
Overall, CART performed relatively well in all sets of datasets in terms of performance metrics, including the accuracy of 99.8% and training time. There is no significant difference between the performance of ML models with regard to internal vs. external network criteria. Non-encoded DL models did not perform well. However, after the implementation of embedding, both CNN and LSTM performed on par with CART with an accuracy of 99% on all the datasets considered. Due to the dynamic nature of embedding encoding, which improvises with training, better performance was achieved. The weighted ensemble that combined CART, CNN and LSTM achieved good accuracies but at high time cost because of the overhead of DL training times. Stacked ensemble (CNN+LSTM) did not improve base CNN and LSTM model performance but added more training cost due to its sequential architecture. It is observed from the experiments and literature review that the ensemble models are not profitable for minor ambiguity problems (where K-accuracies are >95%) but could prove useful in complex scenarios where (K accuracies are less than 90%).
For a comparative analysis, a genetic neural network [32] was also implemented but did not show any significant improvement in the accuracies in the CIDDS dataset. The performance results are given in the supplementary document in Table S1 and Figure S2.
In all sets of datasets, the accuracy of 99% and F1 score of 1 was achieved. In terms of training costs, CART has the least training time. DL models have higher training time, with ensemble model and genetic neural network having the highest training time. Even so, training time was relatively low for such large datasets for both ML and base DL models, thus increasing the viability of our models. Due to higher training costs and negligible performance improvement, the ensemble model was not used for internal server datasets and combined datasets, which are larger in size than the external server dataset.
Further benchmarking was done by applying our ML and DL models on the CIC-IDS2017 dataset and compared with state-of-the-art models. The performance metrics for these models were at 99%, so there was not much room for significant improvement. However, our model's performance was on par or even slightly better than previous models. This helped us to generalize our model as a robust option as an intrusion detection system, at least with a labeled flow-based dataset.
The training and testing of these models were carried out in the system with an Intel i7-9750 processor, 64GB RAM, and Nvidia 2080 graphic card. The training time was reduced by utilizing CUDA cores in the graphic card.
In this study, a high-performance metric with a relatively low training time was achieved for both ML and DL models in network intrusion detection, with CART performing the best. ML models were able to perform well due to less ambiguity of the feature vectors in the dataset. With a more complex dataset, DL models may outperform ML models. However, ML models give the advantage of control over feature engineering, which is lacking in DL methods. Feature importance helps to select the best relevant features of the data with mutual information criterion and reduce the feature set for better accuracy and faster computations in performance.
For future studies, an improved dataset with more types of attacks can be developed in an OpenStack environment for the benchmark dataset. Further studies should include remedies for sophisticated adversarial attacks using a generative adversarial network (GAN) [33]. Both ML and DL models should be implemented for interpretability and accuracy.
Author Contributions: N.T.: methodology, software, resources, formal analysis, writing-original draft. Z.L.: resources, investigation, writing-review and editing. D.B.K.: conceptualization, supervision, writing-review and editing. B.G.: conceptualization, supervision, validation, writing-review and editing. K.R.: project administration, funding acquisition, conceptualization, supervision, validation, writing-review and editing. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported partially by CISCO Inc., a research grant. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of CISCO Inc.

Conflicts of Interest:
The authors declare no conflict of interest.