CNN-Based Network Intrusion Detection against Denial-of-Service Attacks

: As cyberattacks become more intelligent, it is challenging to detect advanced attacks in a variety of ﬁelds including industry, national defense, and healthcare. Traditional intrusion detection systems are no longer enough to detect these advanced attacks with unexpected patterns. Attackers bypass known signatures and pretend to be normal users. Deep learning is an alternative to solving these issues. Deep Learning (DL)-based intrusion detection does not require a lot of attack signatures or the list of normal behaviors to generate detection rules. DL deﬁnes intrusion features by itself through training empirical data. We develop a DL-based intrusion model especially focusing on denial of service (DoS) attacks. For the intrusion dataset, we use KDD CUP 1999 dataset (KDD), the most widely used dataset for the evaluation of intrusion detection systems (IDS). KDD consists of four types of attack categories, such as DoS, user to root (U2R), remote to local (R2L), and probing. Numerous KDD studies have been employing machine learning and classifying the dataset into the four categories or into two categories such as attack and benign. Rather than focusing on the broad categories, we focus on various attacks belonging to same category. Unlike other categories of KDD, the DoS category has enough samples for training each attack. In addition to KDD, we use CSE-CIC-IDS2018 which is the most up-to-date IDS dataset. CSE-CIC-IDS2018 consists of more advanced DoS attacks than that of KDD. In this work, we focus on the DoS category of both datasets and develop a DL model for DoS detection. We develop our model based on a Convolutional Neural Network (CNN) and evaluate its performance through comparison with an Recurrent Neural Network (RNN). Furthermore, we suggest the optimal CNN design for the better performance through numerous experiments.


Introduction
As cyberattacks evolve, attackers are exploiting unknown vulnerabilities and bypassing known signatures. One of the most representative network solutions is an intrusion detection system (IDS). There are two types of IDSs. One is misuse detection that detects attacks based on known signatures, and the other is anomaly detection which detects abnormal attacks based on normal use patterns. While misuse detection is difficult to detect unknown attacks, anomaly detection has the advantage of being able to detect unknown attacks. However, the anomaly detection has high false alarms because it is challenging to define a variety of normal use patterns. Deep Learning (DL) is a technique that compensates for these weaknesses by learning its own features through a deep neural network. Employing DL into IDS can compensate for the drawbacks of IDS. In other words, Machine Learning In this paper, we use 'corrected KDD' which is the most widely and commonly used for IDS studies. Furthermore, we use CSE-CIC-IDS 2018 which is the most up-to-date IDS dataset and consists of advanced DoS attacks such as Slowloris and Slowhttptest.

Trends of IDS Studies
There are several works that studied about IDS [5]. Jing-Xin et al. [6] use Artificial Neural Network (NN) to Network IDS (NIDS) and they propose the NIDS prototype. Manso et al. [7] propose IDS based on Software Defined Network (SDN). The proposed IDS detects DDoS attacks and informs to SDN controller. Karim et al. [8] study about experimental performance of Snort-based IDS (S-IDS) in network. Xu et al. [9] suggest Distributed Denial-of-Service (DRDoS) detection and defense model based on Deep Forest model (DDDF). In particular, they focus on attacks in Internet of Things (IoT) devices and big data environment. Also, there are several studies about anomaly detection schemes for Industrial Wireless Sensor Networks (IWSNs) based on machine learning [10]. Zhang et al. [11] suggest Hierarchical Intrusion Detection System (HIDS) based on statistical preprocessing and NN classification. Koc et al. [12] show that Hidden Naive Bayes (HNB), which is one of the data mining models, can be used in IDS. Hodo et al. [13] suggest analysis about threat of IoT based on Artificial Neural Network (ANN) to detect DoS/DDoS attacks. They especially focused on classification about normal and threat patterns. Chung et al. [14] show hybrid IDS using intelligent dynamic swarm-based rough set (IDS-RS) or feature selection and simplified swarm optimization or intrusion data classification. They use KDD as dataset and find the proposed model can increase the performance. Aydin et al. [15] suggest hybrid IDS by putting two IDS systems together which is misuse detection and anomaly detection. Al-Jarrah et al. [16] use Time Delayed Neural Network (TDNN) structure to maximize the recognition rate of network attacks. In addition, Karthick et al. [17] propose IDS based on problematic classifier and Hidden Markov Model (HMM). Wahab et al. [18] point out the problem of maximizing the detection of Virtual Machine-based DDoS attacks in a cloud system and propose its trust model. They also propose defense and detection mechanisms especially for the cloud-based systems in the further study [19]. Chen et al. [20] propose a Low-rate Denial-of-Service (LDoS) attack detection model using Hilbert-Huang and trust evaluation.

Trends of IDS Studies Based on Machine Learning and Deep Learning
Numerous IDS studies that employ machine learning have used KDD dataset [21,22]. Sabhnani et al. [23] evaluate the performance of a comprehensive set of pattern recognition and machine learning algorithms in four attack categories of KDD. They employ MLP (Multilayer Perceptron), K-means clustering, and Gaussian classifier and suggest the optimal algorithm showing the high detection accuracy by 4 types of attacks. The experimental result shows that DoS and U2R are the most accurate when applying K-means clustering. R2L and probing have the highest accuracy when using Gaussian classifier and MLP, respectively. Mulay et al. [24] suggest a model combining Support Vector Machine (SVM) and a decision tree. They evaluate the proposed model using KDD and then show that the combined model has a higher accuracy and reduces training and testing time than that of a model with an SVM or decision tree. Further works on KDD improve the performance of intrusion detection by the kernel type of SVM [25,26]. Hasan et al. [25] analyze the type of kernel with a best performance for an SVM-based intrusion detection. They generate new datasets called KDD99Train+ and KDD99Test+ by preprocessing duplicated data belonging to both training and testing datasets. Using the newly generated datasets, they found out that the ability of the SVM classification depends on the type of kernel and hyperparameter setting. Yao et al. [26] propose an enhanced SVM model of weighted kernel functions based on the characteristics of training dataset. According to the experimental evaluation, they find out that the performance of the proposed model is better than the existing SVM model. Kim et al. [27] compare the detection accuracy using DL as well as machine learning, such as SVM, decision tree, NN, and CNN models. They carry out binary classification that classifies KDD into benign and attack. They also perform multiclass classification that classifies the dataset into the 4 categories. While all the four models have high accuracy in the binary classification, the performance of intrusion detection in the multiclass classification varies depending on the type of models. According to the experimental results, the performances of decision tree and NN are lower than that of SVM and CNN. Yin et al. [28] perform the binary classification and multiclass classification based on RNN which is one of DL models. The experimental results show that the accuracy of binary classification is higher than that of multiclass classification. In addition, they find out that hyperparameters such as hidden nodes and learning rate affect the detection accuracy.
Further studies [29,30] that improve the performance of KDD classification with the proposed model have been addressed. Sheikhan et al. [29] propose a three-layers RNN architecture, which classifies features as input and attack types, as a misused-based IDS. They compare the proposed model with other machine learning methods in terms of Detection Rate (DR), False Alarm Rate (FAR) and Cost Per Example (CPE). Their experimental results show that the proposed model improves the classification rate, especially in R2L attacks. They also present better DR and CPE when compared to Multilayer Perceptron and Elman-based intrusion detectors. Bontemps et al. [30] propose a new collective anomaly detection model based on Long-short Term Memory RNN (LSTM-RNN). They show that various output reactions depend on the number of inputs of LSTM-RNN and the proposed model is effective in detecting group anomalies.
Numerous studies [31][32][33] that detect attacks in binary and multiple categories based on CNN have also been addressed. Khan et al. [31] point out the disadvantages of using machine learning algorithms to obtain intrusion detection models. They also propose ways to combine CNN-based network intrusion detection models with soft max algorithms. They evaluate the proposed model using KDD and the experimental results show that the model is more efficient in detecting intrusions compared to the SVM and Deep Belief Network (DBN) algorithms. Tavallaee et al. [34] select some records from KDD and propose new dataset called 'NSL-KDD'. Several pieces of research use this NSL-KDD. Li et al. [32] propose image conversion methods using NSL-KDD data and analyze how CNN models automatically learn the transformed intrusion data. They find out that the CNN model is sensitive to the transformation of data images and can be used for intrusion detection techniques. Gao et al. [35] propose IDS-combined incremental Extreme Learning Machine (I-ELM) with an Adaptive Principle Component (A-PCA) using NSL-KDD and UNSW-NB15 dataset. Chu et al. [36] also use NSL-KDD to detect the attack. Upadhyay et al. [33] use KDD with randomly selected 36 features from 41 number of KDD features. They transform the dataset into/1 × 6 size of images and then store the remaining features in different variables to train the CNN model. The experimental results show that the proposed model results in less than 2% errors in detecting intrusions.
Fares et al. [37] study to achieve higher detection rates and lower false alarm rates using Niyaz et al. [38] employ Self-Taught Learning (STL) algorithm to develop an IDS. Tang et al. [39] suggest IDS in SDN based on Deep Neural Network (DNN) by using NSL-KDD. Ingre et al. [40] employ ANN for intrusion detection using NSL-KDD. The proposed model consists of tansig transfer function, Levenberg-Marquardt (LM), and BFGsquasi-Newton Backpropagation (BGFS) algorithm. The experimental results show that the performance of binary classification with the LM and BFGS Electronics 2020, 9, 916 5 of 21 algorithms is higher than that of multiclass classification with five categories. Also, Erol et al. [41] propose IDS based on ANN by using KDD. In addition, Ibrahim et al. [42] use Distributed Time-Delay ANN to model the network IDS. Tan et al. [43] suggest IDS by using Synthetic Minority Oversampling Technique (SMOTE) to balance the dataset and the random forest algorithm to train the classifier for intrusion detection. Farnaaz et al. [44] also use forest algorithm to build their IDS. Ye [45] uses Principal Component Neural Network (PCNN) and Multiclass SVM (MSSVM) algorithm to detect key features in network intrusion signals with KDD. In addition, Ali et al. [46] develop Fast Learning Network (FLN) based on Particle Swarm Optimization (PSO) to solve the problem of IDS in different approach. They name the proposed model as PSO-FLN and show that PSO-FLN has higher testing accuracy compared to meta-heuristic algorithm for training ELM and FLN classifier. Yang et al. [47] propose the LM-Back Propagation (BP) NN model to increase performance of IDS in IoT. Compared to PSO-BP model and BP NN model, the proposed model shows higher DR and less false alarms. Seo [48] proposes a data preprocessing technique to control the ratio of learning data in sparse classes to increase the performance of the model. He also evaluates his suggestion based on k-nearest Neighbor, SVM, and decision tree, and the experimental results show that the performance with preprocessed data has a higher accuracy than that of with the original data. Amma et al. [49] propose a DoS detection model based on Deep Radial Intelligence (DeeRaI) with Cumulative Incarnation (CUI). They use NSL-KDD and UNSW NB15 as datasets.
There are numerous IDS studies using the ISCX dataset. Koay et al. [50] propose a novel multi-classifier system based on novel entropy and machine learning classifier using ISCXIDS 2012. Idhammad et al. [51] propose semi-supervised DDoS detection based on entropy estimation, co-clustering, information gain ratio, and extra-trees ensemble classifier. Yassin et al. [52] suggest K-means clustering and Naive Bayes combined KMC + NBC-based IDS. Soheily-Khah et al. [53] propose K-means, random forest combined kM-RF-based hybrid intrusion detection. In addition, Faker et al. [54] propose IDS based on DNN, Random Forest, and Gradient Boosting Tree classification using CICIDS 2017. Zhang et al. [55] propose an IDS called DCF-IDS by combining DL network and gcForest (deep random forest). Zhou et al. [56] analyzed CSE-CIC-IDS 2018 using 6 types of ML algorithms such as Random Forest, Naive Bayes, Decision Tree, Neural Network (MLP), Quadratic Discriminant, and K-Neighbors. Kim et al. [57] propose a CNN-based IDS using CSE-CIC-IDS 2018. Chadza et al. [58] carry out predicting intrusions using HMM. They use CSE-CIC-IDS 2018 and evaluate three initialization techniques such as uniform, random, and count-based.
We focus on the DoS category in not only KDD which is the most widely used IDS dataset but also CSE-CIC-IDS 2018, the most up-to-date IDS dataset.

Designing IDS Model Based on CNN
In this Section, we explain the training and testing datasets we use, and design our IDS model based on CNN.

DoS Datasets
The samples of DoS attacks in KDD and CSE-CIC-IDS 2018 are as shown in Table 2. The smurf attack which has the largest samples in KDD is an attack that exhausts network resources by transmitting massive Internet Control Message Protocol (ICMP) packets to a victim system. The attack takes place by broadcasting with a forged IP address to the victim system. Neptune attack, which has the second largest samples, is a SYN flooding attack which induces imperfect TCP session so that it exhausts resources of the victim server. Except Smurf and Neptune attacks, several attacks also belong to the DoS category of KDD. However, the samples are not enough to make reliable training models. In CSE-CIC-IDS 2018, there are several DoS attacks as shown in Table 2. These DoS attacks are more advanced DoS attacks than that of KDD. We perform the binary and multiclass classifications on DoS attacks belonging to CSE-CIC-IDS 2018 as well as KDD. We observe whether our DL model classifies the minute features of DoS attacks which are in the same category as well as features of attack and benign classes.

Creating the Attack Images
KDD consists of 41 traffic features and 1 feature which determines where each data belongs to. In the 41 number of traffic features, 38 of them are represented in numerical features and 3 of them are represented in symbolic features. We transform the symbolic data to numerical data to unify all data formats. The 3 features of the symbolic type are the protocol type of a TCP/IP layer, the service type of a target system and flag type which shows the connection state of the session. There are three types of the protocol type such as ICMP, TCP, and UDP. These protocols are transformed to 3-dimensional vector (1,0,0), (0,1,0) and (0,0,1) through one-hot encoding. Likewise, 67 types of the service including HTTP and FTP are transformed to 67-dimensional vector and the 9 features of the flag type are transformed to 9-dimensional vector. We finally generate the 79-dimensional vector through these transformations. When this 79-dimensional vector is combined with 38 features that have the original numerical features, the 117-dimensional vector is finally generated. In addition, we rescale all the numerical features to be between 0 to 255 to convert the 117-dimensional vector into images with 13 × 9 pixels. Each color channel of the image should be represented with the value between 0 to 255. We then feed these images to our CNN model. The reason we convert the numerical samples into images is that CNN is a DL model for image training.
In this paper, we generate two types of image datasets. One is an RGB set which has 3 color channels (Red, Green, and Blue) and the other one is a grayscale set that has a single channel. An RGB image is an overlaid structure of the three types of color images and is converted into an array of M × N × 3 pixels finally. M and N are the number of columns and rows, respectively [59]. We observe how accuracy varies depending on the type of image. When the 13 × 9 pixel of image is transformed to grayscale and RGB, 13 × 9 × 1 and 13 × 9 × 3 images will be generated, respectively. CSE-CIC-IDS 2018 consist of 78 numerical features including destination port, type of protocol, and flow duration. We rescale these features and transform into grayscale and RGB images with 13 × 6. Figure 1 shows the steps of creating the attack images explained above.
CSE-CIC-IDS 2018 consist of 78 numerical features including destination port, type of protocol, and flow duration. We rescale these features and transform into grayscale and RGB images with 13 × 6. Figure 1 shows the steps of creating the attack images explained above.

Designing CNN Model
CNN is the most widely used DL model for image recognition, consisting of a convolution layer that extracts the features of the image and a fully connected layer that determines which class the input image belongs to. The convolution layer extracts the unique features of the image while keeping I/O and spatial information of the image and reduces the size of the feature data by adding a pooling layer to the convolution layer. An image is processed based on the following equation: 2 1 (1) L refers to the length of input image. K and P refer to the kernel size and zero which is filled by the level of dimension of both ends. Finally, S refers to a stride of the kernel on a convolution layer.
While multiple convolution layers may more effectively learn images with complex features, the number and performance of the convolutional layers are not always proportional. Because a correlation between the number of convolutional layers and its performance depends on the characteristics of the input images, we need to find out the optimal design through various designs and learning. We design our models considering hyperparameters such as the type of images (grayscale or RGB), the number of convolutional layers and the size of kernel, the number of weights used to design a hidden layer in the convolution layer. Figure 2 shows the structure of our CNN model. In addition, we develop our model using Python programming language with Tensorflow [60].

Designing CNN Model
CNN is the most widely used DL model for image recognition, consisting of a convolution layer that extracts the features of the image and a fully connected layer that determines which class the input image belongs to. The convolution layer extracts the unique features of the image while keeping I/O and spatial information of the image and reduces the size of the feature data by adding a pooling layer to the convolution layer. An image is processed based on the following equation: L refers to the length of input image. K and P refer to the kernel size and zero which is filled by the level of dimension of both ends. Finally, S refers to a stride of the kernel on a convolution layer.
While multiple convolution layers may more effectively learn images with complex features, the number and performance of the convolutional layers are not always proportional. Because a correlation between the number of convolutional layers and its performance depends on the characteristics of the input images, we need to find out the optimal design through various designs and learning. We design our models considering hyperparameters such as the type of images (grayscale or RGB), the number of convolutional layers and the size of kernel, the number of weights used to design a hidden layer in the convolution layer. Figure 2 shows the structure of our CNN model. In addition, we develop our model using Python programming language with Tensorflow [60].

Experimental Evaluation
In this Section, the DoS dataset described in Section 3 is trained based on our CNN model and the performances of binary and multiclass classification are evaluated.

Scenario
The proposed CNN model receives grayscale or RGB images as its input. The CNN model is also possible to change two more parameters such as the number of convolutional layer and size of kernel as described in Section 3. We call these parameters as hyperparameters and create 18 kinds of scenarios considering the hyperparameters as shown in Table 3.

Experimental Evaluation
In this Section, the DoS dataset described in Section 3 is trained based on our CNN model and the performances of binary and multiclass classification are evaluated.

Scenario
The proposed CNN model receives grayscale or RGB images as its input. The CNN model is also possible to change two more parameters such as the number of convolutional layer and size of kernel as described in Section 3. We call these parameters as hyperparameters and create 18 kinds of scenarios considering the hyperparameters as shown in Table 3.
The CNN model consists of 1, 2, or 3 convolutional layers, and the number of kernels corresponding to the number of neurons per layer increases by a multiple of 2. In addition, the kernel size is usually set to 3 × 3. However, we set 3 × 3 as a median value and do experiment on sizes of 2 × 2 and 4 × 4 to find out the optimal size. The kernel generates a feature map by moving over the image as much as stride which is designated value. We set the stride to 1 [61] to extract the feature densely. Figure 3 shows examples of our CNN design. We chose 6 scenarios (RGB-1, GS-1, RGB-5, GS-5, RGB-9 and GS-9) that can show various CNN designs with a different number of layers, kernels, and color channels.
Experiments are performed with binary and multiclass classifications. Table 4 shows the detection classes for each classification.

Evaluation of Binary Classification
The experimental results of binary classification for 18 experimental scenarios show that most of scenarios have more than 99% of accuracy. To evaluate the performance of the proposed model, we calculate F1-score. F-score is an index that implies both precision and recall. F1-score is the value that is given a weighted beta value of 1 for precision when calculating the F-score. F1-score is defined as following Equation (2). In deep learning-based learning, the results may vary slightly from experiment to experiment. However, the accuracy of this paper shows the average value of the results tested five times for each scenario, so that even a slight difference can reveal differences in characteristics according to the hyperparameters. In case of KDD, especially in binary classification, RGB scenarios show the highest performance in order of RGB-3, RGB-8, and RGB-6. For the grayscale scenarios, the performance is high in order of GS-8, GS-6, and GS-3. In the case of CSE-CIC-IDS 2018, RGB scenarios show the highest performance in order of RGB-8, RGB-9, and RGB-6. The detection performance with the grayscale images, the performance is high in order of GS-8, GS-3, GS-6, and GS-9, i.e., regardless of the RGB and grayscale images in binary classification, when the kernel size is 2 × 2 or 3 × 3, the performance of the scenario of three convolutional layers is the best. When the kernel size is 4 × 4, the scenario of two convolutional layers has the best performance. A more detailed analysis of binary classification based on hyperparameters is as follows. Table 5 shows a comparison of the number of correct one, the number of wrong one and accuracy of the 9 RGB scenarios (RGB 1~9) and the grayscale scenarios (GS 1~9), respectively in case of KDD. Given that the RGB scenario is more accurate than the grayscale scenario on all graphs, we can see that generating an RGB image of DoS is a way to improve the detection performance. In the experimental result of CSE-CIC-IDS 2018, all the scenarios with RGB images are more accurate than greyscale scenarios as shown in Figure 4.   Table 6 shows the experimental results in binary classification of KDD. When the kernel is 2 × 2 and 3 × 3, we can see the more convolutional layers (1 L, 2 L, 3 L) the more accuracy is increased. Thus, the more layers, the better performance by extracting features more accurately. The 4 × 4 kernel shows that both the RGB and grayscale show the highest performance when there are two convolutional layers. Because the number and performance of the convolutional layers are not always proportional, we can determine that when the kernel is 4 × 4, it must be composed of two convolutional layers to achieve better performance. Indeed, the RGB-8 has the second highest accuracy among the 9 RGB scenarios, and the GS-8 shows the highest accuracy among the grayscale scenarios. Thus, the kernel size 4 × 4 and the two convolutional layers are the combination of hyperparameters with the best performance. In Figure 5 Table 7 shows a comparison of the accuracy on the kernel sizes of 2 × 2, 3 × 3 and 4 × 4 for KDD. For RGB scenarios, there is no pattern of constant shape (e.g., positive or negative) for accuracy according to the kernel size when there are two or three convolutional layers. Similarly, the same pattern is not visible in grayscale scenarios, indicating that kernel size is not a parameter that affects  Table 7 shows a comparison of the accuracy on the kernel sizes of 2 × 2, 3 × 3 and 4 × 4 for KDD. For RGB scenarios, there is no pattern of constant shape (e.g., positive or negative) for accuracy according to the kernel size when there are two or three convolutional layers. Similarly, the same pattern is not visible in grayscale scenarios, indicating that kernel size is not a parameter that affects accuracy alone compared to the type of image or the number of convolutional layers. Similar to KDD, there is no particular pattern with CSE-CIC-IDS 2018 in the scenarios of 4 × 4 kernel size as shown in Figure 6.

Analysis of the Accuracy in Multiclass Classification
Like binary classifications, we compare the accuracy of scenarios according to parameter settings in multiclass classifications of KDD and CSE-CIC-IDS 2018. The experimental results of KDD show high accuracy in order of RGB-3, RGB-5, and RGB-6 for RGB, and GS-8, GS-5, and GS-6 for grayscale. In case of CSE-CIC-IDS 2018, the result show high accuracy in order of RGB-8, RGB-9 and RGB-6 for RGB, and GS-9, GS-3, and GS-6 for grayscale. In other words, both RGB and grayscale images show better performance when there are two and three convolutional layers than when there is one. A closer look at the multiple classification results based on hyperparameters is as follows. Table 8 shows how the accuracy varies depend on the RGB and grayscale images for KDD. It means other hyperparameter values are all same except the number of color channels (RGB or grayscale). Same as the results of the binary classification for KDD, RGB images are more accurate than grayscale images, but GS-8 has higher accuracy than RGB images (RGB-8). Thus, this scenario is most likely to be affected by the combination of the number of convolutional layers and the size of kernel. However, since all other scenarios are more accurate with RGB images, we can determine that RGB images show higher performance in the multiclass classification than grayscale images.

Analysis of the Accuracy in Multiclass Classification
Like binary classifications, we compare the accuracy of scenarios according to parameter settings in multiclass classifications of KDD and CSE-CIC-IDS 2018. The experimental results of KDD show high accuracy in order of RGB-3, RGB-5, and RGB-6 for RGB, and GS-8, GS-5, and GS-6 for grayscale. In case of CSE-CIC-IDS 2018, the result show high accuracy in order of RGB-8, RGB-9 and RGB-6 for RGB, and GS-9, GS-3, and GS-6 for grayscale. In other words, both RGB and grayscale images show better performance when there are two and three convolutional layers than when there is one. A closer look at the multiple classification results based on hyperparameters is as follows. Table 8 shows how the accuracy varies depend on the RGB and grayscale images for KDD. It means other hyperparameter values are all same except the number of color channels (RGB or grayscale). Same as the results of the binary classification for KDD, RGB images are more accurate than grayscale images, but GS-8 has higher accuracy than RGB images (RGB-8). Thus, this scenario is most likely to be affected by the combination of the number of convolutional layers and the size of kernel. However, since all other scenarios are more accurate with RGB images, we can determine that RGB images show higher performance in the multiclass classification than grayscale images. In the experimental results of CSE-CIC-IDS 2018, all the scenarios with RGB images are more accurate than that of grayscale scenarios as shown in Figure 7.  Table 9 shows experimental results of multiclass classification for KDD. This table compares the accuracy of the number of convolution layers.   Table 9 shows experimental results of multiclass classification for KDD. This table compares the accuracy of the number of convolution layers.

Number of Convolutional Layers
In Table 9, the graphs of RGB scenarios and grayscale scenarios show that the higher the number of layers (1 L, 2 L, 3 L), the higher the accuracy, when the kernel size is 2 × 2. However, when the kernel sizes are 3 × 3 and 4 × 4, we can see that the accuracy is not proportional to the number of convolutional layers. In the multiclass classification, the accuracy is proportional to performance only when the kernel size is 2 × 2, while the number of convolutional layers is proportional to performance when the kernel sizes are 2 × 2 and 3 × 3 in the binary classification. Thus, we can say that the number of convolutional layers has a lower impact on performance in the multiclass classification. When the kernels are 2 × 2 and 3 × 3, the accuracies of both RGB and grayscale are much higher with scenarios for three convolutional layers than one or two convolutional layers. When the kernel size is 4 × 4, the accuracies of two and three convolutional layers are high except one convolutional layer. We can expect that the bigger kernel size, the better performance. Figure 8 shows a graph comparing the accuracy of the same scenario with all environments except for the number of convolutional layers in case of CSE-CIC-IDS 2018.
Electronics 2020, 9, x FOR PEER REVIEW 16 of 22 When the kernels are 2 × 2 and 3 × 3, the accuracies of both RGB and grayscale are much higher with scenarios for three convolutional layers than one or two convolutional layers. When the kernel size is 4 × 4, the accuracies of two and three convolutional layers are high except one convolutional layer. We can expect that the bigger kernel size, the better performance. Figure 8 shows a graph comparing the accuracy of the same scenario with all environments except for the number of convolutional layers in case of CSE-CIC-IDS 2018.  Table 10 shows the results of accuracy according to the kernel size in case of KDD. Like there is no specific pattern in the binary classification, no pattern is found to correlate the kernel size and performance in multiclass classifications. In the experimental results of CSE-CIC-IDS 2018, when the kernel size is 2 × 2 and 3 × 3 there is no particular pattern of accuracy similar to KDD, as shown in Figure 9. However, when the kernel size is 4 × 4, the accuracy is much higher compare to the others (2 × 2 and 3 × 3). In particular, as we mentioned in Section 4.3.2, the accuracy is higher regardless of the number of the convolutional layers when kernel size is big.  Table 10 shows the results of accuracy according to the kernel size in case of KDD. Like there is no specific pattern in the binary classification, no pattern is found to correlate the kernel size and performance in multiclass classifications. In the experimental results of CSE-CIC-IDS 2018, when the kernel size is 2 × 2 and 3 × 3 there is no particular pattern of accuracy similar to KDD, as shown in Figure 9. However, when the kernel size is 4 × 4, the accuracy is much higher compare to the others (2 × 2 and 3 × 3). In particular, as we mentioned in Section 4.3.2, the accuracy is higher regardless of the number of the convolutional layers when kernel size is big. Electronics 2020, 9, x FOR PEER REVIEW 17 of 22

Discussion
Experiments with the binary and multiclass classification for the proposed CNN model show that both have achieved higher accuracy than 99% in Section 4. Here we detect the DoS attacks using an RNN model to compare its performance with the proposed model. RNN is developed as a way to extend the NN into sequential data and the hidden nodes form a circular structure. In this experiment, we design a simple RNN model using Keras with five embedding vectors and a sigmoid activation function as hyperparameters. Table 11 shows the precision, recall and F1-score for each class of binary and multiclass classifications using RNN in case of KDD. Precision is the ratio of testing samples that are ground truth among the samples that the model classifies as true. Recall is the ratio of testing samples that are ground truth to predict as true. F1-score is a value represented in one number considering both precision and recall. The RNN model has 99% accuracy in binary classification, almost the same as that of our CNN model. In multiclass classification, however, the RNN model has 100% accuracy in the smurf detection while the accuracies of the neptune and benign are 80% and 85%, respectively.
The interesting thing is that classifying Smurf and Neptune attacks does not cause much misdetections, while there are many misdetections in distinguishing benign from neptune attacks. That is why the accuracy of the RNN model is lower than that of CNN in the multiclass classification.
In the experimental results of CSE-CIC-IDS 2018, the accuracies of both RNN-based binary and multiclass classifications are significantly lower than that of CNN-based detection accuracy as shown in Table 12. In binary classification, about 2% of detection accuracy is lower in benign detection than in attack detection. Even in the multiclass classification, the accuracy of benign detection is only 73.5%. However, in the detection of attacks, DoS-GoldenEye, DDoS-LOIC-HTTP, and DoS-Slowloris have high accuracy of 97%, 95%, and 89%, respectively. However, in other attacks, their accuracies are less than 50%. DoS-Hulk and DDoS-HOIC result in much false positives and DoS-SlowHTTPTest is often incorrectly detected as DoS-Hulk.

Discussion
Experiments with the binary and multiclass classification for the proposed CNN model show that both have achieved higher accuracy than 99% in Section 4. Here we detect the DoS attacks using an RNN model to compare its performance with the proposed model. RNN is developed as a way to extend the NN into sequential data and the hidden nodes form a circular structure. In this experiment, we design a simple RNN model using Keras with five embedding vectors and a sigmoid activation function as hyperparameters. Table 11 shows the precision, recall and F1-score for each class of binary and multiclass classifications using RNN in case of KDD. Precision is the ratio of testing samples that are ground truth among the samples that the model classifies as true. Recall is the ratio of testing samples that are ground truth to predict as true. F1-score is a value represented in one number considering both precision and recall. The RNN model has 99% accuracy in binary classification, almost the same as that of our CNN model. In multiclass classification, however, the RNN model has 100% accuracy in the smurf detection while the accuracies of the neptune and benign are 80% and 85%, respectively.
The interesting thing is that classifying Smurf and Neptune attacks does not cause much misdetections, while there are many misdetections in distinguishing benign from neptune attacks. That is why the accuracy of the RNN model is lower than that of CNN in the multiclass classification.
In the experimental results of CSE-CIC-IDS 2018, the accuracies of both RNN-based binary and multiclass classifications are significantly lower than that of CNN-based detection accuracy as shown in Table 12. In binary classification, about 2% of detection accuracy is lower in benign detection than in attack detection. Even in the multiclass classification, the accuracy of benign detection is only 73.5%. However, in the detection of attacks, DoS-GoldenEye, DDoS-LOIC-HTTP, and DoS-Slowloris have high accuracy of 97%, 95%, and 89%, respectively. However, in other attacks, their accuracies are less than 50%. DoS-Hulk and DDoS-HOIC result in much false positives and DoS-SlowHTTPTest is often incorrectly detected as DoS-Hulk. Experimental results show that the accuracy of CSE-CIC-IDS 2018 is generally lower than that of KDD. This is because our KDD model divides samples into 3 categories which are benign, Smurf and Neptune while the CSE-CIC-IDS 2018 model divides samples into 7 categories such as benign and 6 advanced DoS attacks. From the experimental results with the RNN model, furthermore, we can find out that advanced DoS attacks not only do not have novel characteristics compared to traditional DoS attacks, but also that the characteristics do not appear to be time-series features.

Conclusions
We develop a CNN-based model for the detection of DoS attacks using KDD and CSE-CIC-IDS 2018. There are 4 types of attack categories in KDD, such as DoS, U2R, R2L, and Probing. Most of deep learning-based KDD studies have carried out binary classifications that distinguish benign and attack across the entire category. These studies have also performed multiclass classification that distinguish the 4 categories in KDD.
We focus on one category of DoS and perform detection for different attacks in the same category. We also used the most up-to-date IDS dataset which contains advanced DoS attacks such as DoS-Hulk, DoS-SlowHTTPTest, DoS-GoldenEye, DoS-Slowloris, DDoS-LOIC-HTTP, and DDoS-HOIC. We have generated two types of intrusion image, RGB and grayscale. We have designed our CNN model considering the number of convolutional layers and the size of kernel. To evaluate our model, we created 18 scenarios considering hyperparameters, such as the type of image, the number of convolutional layers, and the kernel size mentioned above. We performed the binary and multiclass classifications for each scenario, and then suggested the optimal scenarios that have higher performance. Our experimental results have shown that RGB images in both binary and multiclass classifications have higher accuracy than that of grayscale images. In addition, we found out that both RGB and grayscale images performed best with three convolutional layers when the kernel sizes are 2 × 2 and 3 × 3. When the kernel size is 4 × 4, deploying two convolutional layers has the highest accuracy. In multiclass classification, there was generally high performance when there was more than one convolutional layer. However, the best model should be found through various hyperparameter setting, because the number and performance of convolutional layers are not proportional. The kernel size has not been found to have a significant impact on both binary and multiclass classifications. We performed a comparison with the RNN model to verify the performance of the proposed model. For KDD, while the CNN model showed 99% or more results in binary and multiclass classifications, the RNN showed 99% accuracy in binary classification and 93% in multiclass classifications. For CSE-CIC-IDS 2018, the CNN model showed 91.5% of accuracy on average while the RNN model showed 65% of accuracy on average. In other words, the CNN model proposed in this paper was able to identify specific DoS attacks with similar characteristics compared to the RNN model. As a future work, multiclass classifications will also be carried out for attacks belonging to other categories in KDD and CSE-CIC-IDS 2018. Furthermore, our model will be used for other intrusion datasets to improve the performance.