An Analysis of the KDD99 and UNSW-NB15 Datasets for the Intrusion Detection System

: The signiﬁcant increase in technology development over the internet makes network security a crucial issue. An intrusion detection system (IDS) shall be introduced to protect the networks from various attacks. Even with the increased amount of works in the IDS research, there is a lack of studies that analyze the available IDS datasets. Therefore, this study presents a comprehensive analysis of the relevance of the features in the KDD99 and UNSW-NB15 datasets. Three methods were employed: a rough-set theory (RST), a back-propagation neural network (BPNN), and a discrete variant of the cuttleﬁsh algorithm (D-CFA). First, the dependency ratio between the features and the classes was calculated, using the RST. Second, each feature in the datasets became an input for the BPNN, to measure their ability for a classiﬁcation task concerning each class. Third, a feature-selection process was carried out over multiple runs, to indicate the frequency of the selection of each feature. From the result, it indicated that some features in the KDD99 dataset could be used to achieve a classiﬁcation accuracy above 84%. Moreover, a few features in both datasets were found to give a high contribution to increasing the classiﬁcation’s performance. These features were present in a combination of features that resulted in high accuracy; the features were also frequently selected during the feature selection process. The ﬁndings of this study are anticipated to help the cybersecurity academics in creating a lightweight and accurate IDS model with a smaller number of features for the developing technologies.


Introduction
Due to the increasing demand for computer networks and network technologies, the attack incidents are growing day by day, making the intrusion detection system (IDS) an essential tool to use for keeping the networks secure. It has been proven to be effective against many different attacks, such as the denial of service (DoS), structured query language (SQL) injection, and brute-force [1][2][3]. Two approaches are to be considered when developing an IDS [4]: misuse-based and anomaly-based. In the misuse-based approach, the IDS attempts to match the patterns of already known network attacks. Its database gets updated continuously by storing the patterns of known network attacks. The anomaly-based IDS, on the other hand, attempts to detect unknown network attacks by comparing them to the regular connection patterns. The anomaly-based IDSs are considered to be adaptive, and they are susceptible to generate a high number of false positives [4,5]. a hierarchical IDS that uses machine-learning and knowledge-based approaches was introduced and tested, using the KDD99 dataset. Reference [39] proposed an ensemble model based on the J48, RF, and Reptree and evaluated it by using the KDD99 and NSL-KDD datasets. A correlation-based approach was implemented, to reduce the features from the datasets. Reference [40] examined the reliability of a few machine learning models, such as the RF and gradient-boosting machines in real-world IoT settings. In order to do the examination, data-poisoning attacks were simulated by using a stochastic function to modify the training data of the datasets. The UNSW-NB15 and ToN_IoT datasets were employed for the experiments of the study.
It is essential to address that the KDD99 and UNSW-NB15 datasets do not contain attacks related to the cloud computing, such as the SQL injection. Reference [41] proposed a countermeasure to detect these attacks, specifically in the cloud environment. The method in Reference [41] can be applied to the cloud environment, without the need for an application's source code.
In this study, the features in the KDD99 and UNSW-NB15 datasets were analyzed by using a rough-set theory (RST), a back-propagation neural network (BPNN), and a discrete variant of the cuttlefish algorithm (D-CFA). The analysis provides an in-depth examination of the relevance of each feature to the malicious-attack classes. It also studies the symmetry of the records distribution among the classes. The results of the analysis suggest a few features and combinations that can be used for creating an accurate IDS model. This study also describes and gives the properties of the datasets mentioned above. Despite the availability of other works that have tried to analyze the two datasets, it is important to study the most common datasets in this domain continuously, not only to confirm their relevance but also to expand the findings on these datasets. However, the main contributions of this paper can be listed as follows: • Give a detailed description of the KDD99 and UNSW-NB15 datasets. • Point out the similarities between the two datasets. This paper includes five sections. The description and properties of the KDD99 and UNSW-NB15 datasets are provided in Section 2. Section 3 explains the methodology and experimental setup. The results and discussions are given in Section 4. Conclusion and future work are provided in Section 5.

Datasets' Description and Properties
The KDD99 is very common between researchers in the IDS research. A survey by Reference [42] found that 142 studies have used the KDD99 dataset form year 2010 until 2015. The dataset is available with 41 features (excluding the labels) and five classes, namely Normal, DoS), Probe, remote-to-local (R2L), and user-to-root (U2R). The KDD99 (ten percent variant) contains 494,021 and 311,029 records in the training and testing sets. The classes in the training and testing sets of the KDD99 are imbalanced, as shown in Figure 1. The DoS class has the highest number of records, while the Normal class comes in second. Moreover, the testing set contains a higher amount of records that are classified as R2L. This distribution of records was found to contain a large amount of duplicated records. The number of records of each class with their amount of duplications is provided in Table 1.  A graphical representation of the amount of records duplications for each class is given in Figure  2. The highest amount of duplications in the training set belongs to DoS and Probe classes, whereas the highest amount of duplications in the testing set belongs to DoS and R2L. The Probe class in the testing set also contains a fair amount of duplications. It is essential to address that the U2R class contains no duplications in the training set. However, the full training and testing sets of the KDD99 dataset contain duplicated records of 348,437 (70.53%) and 233,813 (75.17%), respectively. Five percent more duplications were present in the testing set. The available UNSW-NB15 dataset contains 42 features (excluding the labels) and ten classes, namely Normal, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. Its training set includes 175,341 records, while the testing set has 82,332 records. The  A graphical representation of the amount of records duplications for each class is given in Figure 2. The highest amount of duplications in the training set belongs to DoS and Probe classes, whereas the highest amount of duplications in the testing set belongs to DoS and R2L. The Probe class in the testing set also contains a fair amount of duplications. It is essential to address that the U2R class contains no duplications in the training set. However, the full training and testing sets of the KDD99 dataset contain duplicated records of 348,437 (70.53%) and 233,813 (75.17%), respectively. Five percent more duplications were present in the testing set.  A graphical representation of the amount of records duplications for each class is given in Figure  2. The highest amount of duplications in the training set belongs to DoS and Probe classes, whereas the highest amount of duplications in the testing set belongs to DoS and R2L. The Probe class in the testing set also contains a fair amount of duplications. It is essential to address that the U2R class contains no duplications in the training set. However, the full training and testing sets of the KDD99 dataset contain duplicated records of 348,437 (70.53%) and 233,813 (75.17%), respectively. Five percent more duplications were present in the testing set. The available UNSW-NB15 dataset contains 42 features (excluding the labels) and ten classes, namely Normal, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. Its training set includes 175,341 records, while the testing set has 82,332 records. The The available UNSW-NB15 dataset contains 42 features (excluding the labels) and ten classes, namely Normal, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. Its training set includes 175,341 records, while the testing set has 82,332 records. The classes in the training and testing sets of the UNSW-NB15 are also imbalanced, as illustrated in Figure 3. Normal class in both sets contains the highest amount of records. In contrast, Generic and Exploits come in second. Fuzzers class includes a fair amount of records, as well, but the rest of the classes show a low amount of records compared to the mentioned classes. However, it was found that the training set of the UNSW-NB15 contains a high number of duplicated records, whereas the testing set does not contain any. Based on the details given in Table 2, the full training set shows that it contains 42.24% duplicated records. Figure 4 illustrates the duplications for each class in the training set. These duplications are found mainly in the Generic, DoS, and Exploits classes. Reconnaissance class also contains a fair amount of duplications.    The class distribution difference between the two datasets is shown in Figure 5. The KDD99 has a higher amount of records that represent a malicious attack class. Both training and testing sets of the KDD99 have an almost identical percentage of attack and normal records. As for the UNSW-NB15, the records distributions between the attack and normal classes are more balanced than those in the KDD99. Moreover, the percentage of the attacks and normal classes across both sets are slightly different. The names of the features in each dataset are given in Table 3. The features in the KDD99 dataset are categorized into four groups. They are given in Table 4. The first group (basic) contains nine features that include necessary information, such as the protocol, service, and duration. The second group (content) represents thirteen features, containing information about the content, such as the login activities. The third group (time) provides nine time-based features, such as the number of connections that are related to the same host within two seconds period. The fourth (host) contains  The class distribution difference between the two datasets is shown in Figure 5. The KDD99 has a higher amount of records that represent a malicious attack class. Both training and testing sets of the KDD99 have an almost identical percentage of attack and normal records. As for the UNSW-NB15, the records distributions between the attack and normal classes are more balanced than those in the KDD99. Moreover, the percentage of the attacks and normal classes across both sets are slightly different.  The class distribution difference between the two datasets is shown in Figure 5. The KDD99 has a higher amount of records that represent a malicious attack class. Both training and testing sets of the KDD99 have an almost identical percentage of attack and normal records. As for the UNSW-NB15, the records distributions between the attack and normal classes are more balanced than those in the KDD99. Moreover, the percentage of the attacks and normal classes across both sets are slightly different. The names of the features in each dataset are given in Table 3. The features in the KDD99 dataset are categorized into four groups. They are given in Table 4. The first group (basic) contains nine features that include necessary information, such as the protocol, service, and duration. The second group (content) represents thirteen features, containing information about the content, such as the login activities. The third group (time) provides nine time-based features, such as the number of connections that are related to the same host within two seconds period. The fourth (host) contains  The names of the features in each dataset are given in Table 3. The features in the KDD99 dataset are categorized into four groups. They are given in Table 4. The first group (basic) contains nine features that include necessary information, such as the protocol, service, and duration. The second group (content) represents thirteen features, containing information about the content, such as the login activities. The third group (time) provides nine time-based features, such as the number of connections that are related to the same host within two seconds period. The fourth (host) contains ten host-based features, which provide information about the connection to the host, such as the rate of connections that have the same destination port number trying to be accessed by different hosts.

KDD99 UNSW-NB15
Feature Name Feature Name

Group Features Count
Basic As for the features in the UNSW-NB15 dataset, they are categorized into five groups and provided in Table 5. The first group (flow) includes the protocol feature, which identifies the protocols between the hosts, such as a TCP or UDP. The second group (basic) represents the necessary connection information, such as the duration and number of packets between the hosts. Fourteen features are categorized in this group. The third group (content) provides content information from the TCP, such as the window advertisement values and base sequence numbers. It also provides some information about the HTTP connections, such as the data size transferred using the HTTP service. Eight features are present in this group. The fourth group (time) includes eight features that use time, such as the jitter and arrival time of the packets. The fifth group (additional) includes eleven additional features, such as if a login was successfully made. Moreover, the fifth group includes a few features that calculate the number of rows that use a specific service from a flow of 100 records based on a sequential order. It is important to address that a few described features in Reference [9], namely srcip, sport, dstip, dsport, stime, and ltime, were not present in the actual dataset; therefore, they were not included in this study. Moreover, f 2-9 was present in the dataset but was not described or categorized in Reference [9]; therefore, it was categorized in the basic group.

Group
Features Count Flow A few features were found to be in common between the two datasets. KDD99 s features f 1-1 , f 1-2 , f 1-3 , f 1-5 , and f 1-6 are in common with UNSW-NB15 s features f 2-1 , f 2-2 , f 2-3 , f 2-7 , and f 2-8 . f 1-1 and f 2-1 describe the connection duration; f 1-2 and f 2-2 give the protocol type, such as transmission control protocol (TCP) or user datagram protocol (UDP); f 1-3 and f 2-3 state the service used at the destination, such as file transfer protocol (FTP) or domain name system (DNS); and f 1-5 , f 2-7 , f 1-6 , and f 2-8 give the number of transmitted data bytes between the source and destination. There are some other features in between the two datasets that share similar characteristics. As described in Table 6, both datasets contain features that use connection flags. Connection flags provide additional information, such as synchronization (SYN) and acknowledgment (ACK). There were ten features in the KDD99 that use flags, whereas, in the UNSW-NB15, there were only four features. In Table 6, it can also be seen that the number of features that involve connection count is higher in the KDD99 than UNSW-NB15. Further, the UNSW-NB15 was found to contain more features that are time-based and size-based. Table 6. Similarities of the features in KDD99 and UNSW-NB15.

Category KDD99 UNSW-NB15
Common features Features that use connection flags Features that count connections

Methodology
The dataset analysis was done by using three methods, namely RST, back-propagation neural network (BPNN), and D-CFA. First, using the RST, the dependency between the features and each of the attack classes was calculated. Second, using the BPNN, the classification accuracy (ACC) for each feature to detect a malicious attack class was computed. Lastly, the D-CFA was used for feature selection to select the most relevant features over multiple iterations and runs to indicate the most frequently selected features. The BPNN was recruited to evaluate those selected features as a wrapper feature selection approach. However, to calculate the ACC from the BPNN, the records in the datasets were first transformed and normalized. Figure 6 illustrates the main steps that were taken to analyze the KDD99 and UNSW-NB15 datasets.
Size-based features (transmitted bits, bytes, or packets)

Methodology
The dataset analysis was done by using three methods, namely RST, back-propagation neural network (BPNN), and D-CFA. First, using the RST, the dependency between the features and each of the attack classes was calculated. Second, using the BPNN, the classification accuracy ( ) for each feature to detect a malicious attack class was computed. Lastly, the D-CFA was used for feature selection to select the most relevant features over multiple iterations and runs to indicate the most frequently selected features. The BPNN was recruited to evaluate those selected features as a wrapper feature selection approach. However, to calculate the from the BPNN, the records in the datasets were first transformed and normalized. Figure 6 illustrates the main steps that were taken to analyze the KDD99 and UNSW-NB15 datasets. The three used methods for the analysis and their evaluation measurements are explained in detail in the following subsections.

Rough-Set Theory (RST)
The RST was used to find the dependency between the features and the classes. For this analysis, each feature was used to calculate its dependency on each of the malicious-attack classes. Based on References [29,43], the dependency ratio (called ) was calculated, using Equation (1). The three used methods for the analysis and their evaluation measurements are explained in detail in the following subsections.

Rough-Set Theory (RST)
The RST was used to find the dependency between the features and the classes. For this analysis, each feature was used to calculate its dependency on each of the malicious-attack classes. Based on References [29,43], the dependency ratio (called depRatio) was calculated, using Equation (1).
where U denotes all the records, and X signifies the cardinality of records that are used to classify two classes-normal and attack. The depRatio is a value between 0 and 1. If the depRatio = 1, then X is a crisp set and can classify the two classes correctly, and if the depRatio < 1, then X is a rough set with a depRatio value less than 1. It is defined based on the lower approximation (called lower), which is calculated by using Equation (2).
where lower(X) is the set of records that only belong to the target decision (X), which can be used to classify the decision without any uncertainty. It is the union of all the records for both classes in [r] f , which are entirely contained by the selected feature ( f ). However, once the depRatio(X) of the features is calculated, then, an average of that depRatio (called ADR) can be computed, using Equation (3).
where n is the number of features in the dataset. The ADR can be used to indicate the dependency of all the features to a specific attack class where a higher ADR value designates a higher dependency.

Back-Propagation Neural Network (BPNN)
The used BPNN was based on its implementation provided in Reference [44]. The back-propagation is the training algorithm for adjusting the weights and biases of the neural network [44].
Formally, as illustrated in Figure 7, every input, inp i , with weight, w i , corresponds to the power of the connection. The sum of the weights and the bias, b, donate to the activation function, σ, to generate the output, y [45]. This process can be demonstrated by using Equation (4).  The final weights and biases are obtained by reducing the output of the error loss function. The training procedure ends after the maximum number of epochs is reached. However, the same parameters as in Reference [44] were used to train the BPNN. They are given in Table 7.  In order to keep the structure of the neural network simple, only one layer was set at the hidden layer. As for nodes in the hidden layer, a different number of nodes were set and evaluated with a maximum number equal to n. The logistic sigmoid function was used as an activation function at the hidden and output layers, using Equation (5).
where e is exponential, and v i represents the input value of the function. Mean square error (MSE) was used to calculate the error loss during the training of the neural network, using Equation (6).
where R n is the records number from the training set, out is the output of the function, and desired is the expected output value. The final weights and biases are obtained by reducing the output of the error loss function. The training procedure ends after the maximum number of epochs is reached. However, the same parameters as in Reference [44] were used to train the BPNN. They are given in Table 7. Furthermore, for data preprocessing (transformation and normalization), the non-numeric values were transformed and then normalized, using a min-max function, using Equation (7).
where normalized denotes the normalized value, and minimum and maximum refer to the smallest and highest values of that input. In order to report the ACC, every single feature in the datasets was used as an input to train a BPNN model. The ACC and average ACC (AACC) that are resulted from the training can be calculated by using Equations (8) and (9), respectively.
where TP and TN signify the classification was correct; FP and FN indicate the output was classified incorrectly; and TP, TN, FP, and FN are calculated based on all the outputs y of each BPNN model.

Discrete Cuttlefish Algorithm (D-CFA)
The standard cuttlefish algorithm (CFA) [46] and its discrete variant (D-CFA) [21] have four search strategies that include two exploration (global) strategies and two exploitation (local) strategies, which are based on the skin-color changing of the cuttlefish. In the D-CFA, a new solution, Sol new , is generated based on Re f lection and Visuality, using Equation (10).
where ∪ is the union of the produced discrete data (features). Algorithm 1 gives the pseudo-code of the D-CFA for solving the feature selection issues. The BPNN was used to evaluate the picked features, and the classification accuracy was used as a fitness function during the search process. The flowchart of the process of the D-CFA is given in Figure 8.
flowchart of the process of the D-CFA is given in Figure 8. Each solution, , in the population (called ) includes two subsets: _ ; and _ . The final selected features are assigned to _ , whereas the final unselected features are assigned to _ . No repetition of features is in between the subsets, where _ ∩ _ = none. To illustrate, consider there was a total of 20 features in the dataset and _ is 5; if so, then _ will be equal to 15 features. The improvement phase of the algorithm uses four search strategies, which are explained in the following subsections: The improvement phase of the algorithm uses four search strategies, which are explained in the following subsections: • Global search 1 (lines 8-12 of Algorithm 1) The first global search of the algorithm finds a new solution, Sol new , using Equation (10), where the required values of the Re f lection and Visuality are calculated by using Equations (11) and (12), respectively. where Re f lection and Visuality are the subsets of features with a size equal to the values of R• and V•, to specify the number of the features to be picked from Sol i 's picked_ f tr and unpicked_ f tr. Equations (13) and (14) are used to compute the values of R• and V• , respectively.
where random(zero, picked_ f tr.size) is a number that is randomly generated between zero and number of picked features in the picked_ f tr subset. However, the union of the subsets that are generated from the Re f lection and Visuality is used to create a new subset for the new solution, Sol new . All unpicked features are placed in the unpicked_ f tr subset of the Sol new .
• Local search 1 (lines 13-17 of Algorithm 1) The first local search in the algorithm finds a new solution, Sol new , using Equation (10), based on Sol best . The picked_ f tr and unpicked_ f tr subsets of the are computed by using Equations (15) and (16), respectively. Re where R• is computed by using Equation (17), which is then used to specify the feature index for replacement from picked_ f tr. V is computed by using Equation (18), to specify the feature replacement from unpicked_ f tr subset of Sol best .
where Sol best .unpicked_ f tr.size is equal to the number of features in the unpicked_ f tr subset of Sol best .
• Local search 2 (lines 18-22 of Algorithm 1) The second local search calculates an average of based on Sol best , to generate an average solution (called Sol Avg ), similar to two subsets (picked_ f tr and unpicked_ f tr). Then a new solution, Sol new , is computed based on the subsets of Sol Avg , using Equations (13)- (15). Sol Avg always contains one feature less than those in the picked_ f tr of Sol best . For each generation, one feature from the picked_ f tr subset is removed and moved to the unpicked_ f tr subset, to create the Sol new and update the Sol Avg .
• Global search 2 (lines 23-27 of Algorithm 1) In the second global search, a new solution, Sol new , is generated with random subsets of features, a similar process as in the population initialization. For each Sol i in the Dpop 8: Global search 1 9: Update picked_ f tr and unpicked_ f tr subsets for Sol new using Equations (10)-(14) 10: Evaluate the Sol new using BPNN 11: If f (Sol new ) > f (Sol best ) then Sol best = Dx new 12: If f (Sol new ) > f (Sol i ) then Sol i = Sol new 13: Local search 1 14: Update picked_ f tr and unpicked_ f tr subsets for Sol new using Equations (10), (15)

Results and Discussions
In this section, three experiments were carried out, to analyze the training sets of the KDD99 and UNSW-NB15 datasets. First, the lower and depRatio between each feature and attack class were calculated. Second, the ACC of the features for detecting malicious attack classes in the datasets were computed, using the BPNN. Lastly, the D-CFA was used for feature selection, to find the most frequently selected features. This section also discusses and compares all the obtained results from the experiments.
C# (C-Sharp) programming language was used for the experiments, and it was executed on a desktop computer with a specification of 2.8GHZ CPU (i5-8400) and 8GB RAM.

Calculating the Lower Approximations and Dependencies of the Features
The ADR of the features in the KDD99 and UNSW-NB15, respectively, can be seen in Figures 9  and 10. Figure 9 shows that the features in the KDD99 had their highest ADR values for the U2R and R2L attacks, and their lowest values were for the DoS and all attacks combined. Specifically, feature f 1-5 showed the highest ADR across all attacks, and f 1-6 was found to be the second. Moreover, in the results for the UNSW-NB15, shown in Figure 10, the highest ADR values were for the Shellcode and Worms attacks, and their lowest was for Generic and all attacks combined. In specific, the highest ADR across all attacks was achieved by using f 2-1 , and f 2-13 achieved the second highest. It is crucial to address that  The and of the features for each attack in the KDD99 and UNSW-NB15 are given in Appendix Tables A1 and A2, respectively. As shown in Appendix Table A1, f1-5 had the highest and values for the Probe and R2L. As for the DoS and all the attacks combined, f1-24 had the highest values. The U2R showed the highest values when using f1-33. It is essential to address that f1-12, f1-20, and f1-21 resulted in and values of zero. Moreover, f1-12 is a binary value that is used to indicate if a login was made; f1-21 is related to the user's logins, which is used to indicate if it was associated with a "hot" list, as referred in Reference [8]; and f1-20 is used for indicating the commands of the outbound FTP connections. However, it was found that f1-20 and f1-21 have zero values in all records, and removing them is suggested for any classification task.
Based on the results given in Appendix Table A2  The and of the features for each attack in the KDD99 and UNSW-NB15 are given in Appendix Tables A1 and A2, respectively. As shown in Appendix Table A1, f1-5 had the highest and values for the Probe and R2L. As for the DoS and all the attacks combined, f1-24 had the highest values. The U2R showed the highest values when using f1-33. It is essential to address that f1-12, f1-20, and f1-21 resulted in and values of zero. Moreover, f1-12 is a binary value that is used to indicate if a login was made; f1-21 is related to the user's logins, which is used to indicate if it was associated with a "hot" list, as referred in Reference [8]; and f1-20 is used for indicating the commands of the outbound FTP connections. However, it was found that f1-20 and f1-21 have zero values in all records, and removing them is suggested for any classification task.
Based on the results given in Appendix Table A2, f1-1 showed the highest and values for Fuzzers, DoS, Exploits, Reconnaissance, Shellcode, and all attacks combined. As for the Backdoors and Worms attacks, f2-7 showed its highest values. Unlike f1-20 and f1-21 in the KDD99, none of the features in the UNSW-NB15 resulted in a and values of zero. The lowest was achieved by f2-23, and f2-23 is used to give the value of the TCP window advertisement from the destination connection. Most of the values of f2-23 in the dataset were found to be equal to 255 or zero. The lower and depRatio of the features for each attack in the KDD99 and UNSW-NB15 are given in Appendix A Tables A1 and A2, respectively. As shown in Appendix A Table A1, f 1-5 had the highest lower and depRatio values for the Probe and R2L. As for the DoS and all the attacks combined, f 1-24 had the highest values. The U2R showed the highest values when using f 1-33 . It is essential to address that f 1-12 , f 1-20 , and f 1-21 resulted in lower and depRatio values of zero. Moreover, f 1-12 is a binary value that is used to indicate if a login was made; f 1-21 is related to the user's logins, which is used to indicate if it was associated with a "hot" list, as referred in Reference [8]; and f 1-20 is used for indicating the commands of the outbound FTP connections. However, it was found that f 1-20 and f 1-21 have zero values in all records, and removing them is suggested for any classification task.
Based on the results given in Appendix A Table A2, f 1-1 showed the highest lower and depRatio values for Fuzzers, DoS, Exploits, Reconnaissance, Shellcode, and all attacks combined. As for the Backdoors and Worms attacks, f 2-7 showed its highest values. Unlike f 1-20 and f 1-21 in the KDD99, none of the features in the UNSW-NB15 resulted in a lower and depRatio values of zero. The lowest depRatio was achieved by f 2-23 , and f 2-23 is used to give the value of the TCP window advertisement from the destination connection. Most of the values of f 2-23 in the dataset were found to be equal to 255 or zero.

Classification Accuracy Analysis: Examining the Features for the Detection of Each Attack
The neural networks behave differently based on the number of inputs and hidden nodes in the structure. As described in Reference [47], the number of hidden layer nodes can be set to a value that ranges between the number of inputs and outputs. Therefore, 41 and 42 simulations to train the BPNN were carried out for the KDD99 and UNSW-NB15 datasets. For example, to report the results of this experiment for analyzing the features' ability to classify the attack class in the KDD99, the total number of simulations is equal to (number of features * two) = (41 * 2) = 82. Figures 11 and 12 illustrate the AACC of all the features in the KDD99 and UNSW-NB15, respectively.

Classification Accuracy Analysis: Examining the Features for the Detection of Each Attack
The neural networks behave differently based on the number of inputs and hidden nodes in the structure. As described in Reference [47], the number of hidden layer nodes can be set to a value that ranges between the number of inputs and outputs. Therefore, 41 and 42 simulations to train the BPNN were carried out for the KDD99 and UNSW-NB15 datasets. For example, to report the results of this experiment for analyzing the features' ability to classify the attack class in the KDD99, the total number of simulations is equal to (number of features * two) = (41 * 2) = 82. Figures 11 and 12 illustrate the of all the features in the KDD99 and UNSW-NB15, respectively.  It can be seen in Figure 11 that the for the KDD99 features was higher with several hidden nodes that range between 4 and 25, and beyond that range, it almost plateaued. Whereas the for the UNSW-NB15′s features, as shown in Figure 12, has illustrated an improvement with a number of nodes that exceeds 7. However, the best for each feature in the KDD99 and UNSW-NB15 are shown in Figures 13 and 14. Figure 13 shows that f1-23 had the highest accuracy, while f1-2, f1-4, f1-24, f1-25, f1-26, f1-29, f1-38, and f1-39 had a noticeable difference when compared to other features. f1-23 in isolate resulted in a best of 98.32%, then, f1-2 and f1-24 come in second with a best of 84.92% and 85.33%, respectively. As for the features in UNSW-NB15, as shown in Figure 14, the best was reported using f2-16 and f1-42 with an of 52.32% and 52.47%, respectively. The of the features in UNSW-NB15 was reported at 50.11%. However, these results indicate that the BPNN was able to train with a higher accuracy using the features in KDD99 than those in the UNSW-NB15.

Classification Accuracy Analysis: Examining the Features for the Detection of Each Attack
The neural networks behave differently based on the number of inputs and hidden nodes in the structure. As described in Reference [47], the number of hidden layer nodes can be set to a value that ranges between the number of inputs and outputs. Therefore, 41 and 42 simulations to train the BPNN were carried out for the KDD99 and UNSW-NB15 datasets. For example, to report the results of this experiment for analyzing the features' ability to classify the attack class in the KDD99, the total number of simulations is equal to (number of features * two) = (41 * 2) = 82. Figures 11 and 12 illustrate the of all the features in the KDD99 and UNSW-NB15, respectively.  It can be seen in Figure 11 that the for the KDD99 features was higher with several hidden nodes that range between 4 and 25, and beyond that range, it almost plateaued. Whereas the for the UNSW-NB15′s features, as shown in Figure 12, has illustrated an improvement with a number of nodes that exceeds 7. However, the best for each feature in the KDD99 and UNSW-NB15 are shown in Figures 13 and 14. Figure 13 shows that f1-23 had the highest accuracy, while f1-2, f1-4, f1-24, f1-25, f1-26, f1-29, f1-38, and f1-39 had a noticeable difference when compared to other features. f1-23 in isolate resulted in a best of 98.32%, then, f1-2 and f1-24 come in second with a best of 84.92% and 85.33%, respectively. As for the features in UNSW-NB15, as shown in Figure 14, the best was reported using f2-16 and f1-42 with an of 52.32% and 52.47%, respectively. The of the features in UNSW-NB15 was reported at 50.11%. However, these results indicate that the BPNN was able to train with a higher accuracy using the features in KDD99 than those in the UNSW-NB15.  It can be seen in Figure 11 that the AACC for the KDD99 features was higher with several hidden nodes that range between 4 and 25, and beyond that range, it almost plateaued. Whereas the AACC for the UNSW-NB15 s features, as shown in Figure 12, has illustrated an improvement with a number of nodes that exceeds 7. However, the best ACC for each feature in the KDD99 and UNSW-NB15 are shown in Figures 13 and 14. Figure 13 shows that f 1-23 had the highest accuracy, while f 1-2 , f 1-4 , f 1-24 , f 1-25 , f 1-26 , f 1-29 , f 1-38 , and f 1-39 had a noticeable difference when compared to other features. f 1-23 in isolate resulted in a best ACC of 98.32%, then, f 1-2 and f 1-24 come in second with a best ACC of 84.92% and 85.33%, respectively. As for the features in UNSW-NB15, as shown in Figure 14, the best ACC was reported using f 2-16 and f 1-42 with an ACC of 52.32% and 52.47%, respectively. The AACC of the features in UNSW-NB15 was reported at 50.11%. However, these results indicate that the BPNN was able to train with a higher accuracy using the features in KDD99 than those in the UNSW-NB15. Symmetry 2020, 12, x FOR PEER REVIEW 18 of 34

The Most Frequently Selected Features Using the D-CFA
In this work, the D-CFA was used for feature selection over multiple runs, to pick different subsets of features. Those features are picked based on the highest achieved classification accuracy from a BPNN training. The parameters that were involved in the training of the BPNN are provided in Table 7. However, to find the most relevant features in the KDD99 and UNSW-NB15 datasets, two measurement approaches were considered. First, the D-CFA was applied to find the most relevant features for each attack in both datasets. Second, the D-CFA was simulated twenty times for each dataset, to find the most frequently picked features over those runs. Since f1-20 and f1-21 contain a value of zero in all the records, they were not used for both measurement approaches. As for the D-CFA's parameters, and were set to a value of 10. Since the classes are not balanced in both datasets (see Figures 1 and 3) and the first measurement approach examines the relevancy of features to each attack, the records have been modified. The modification of the number of records was done manually, where the datasets were split into multiple subsets. Each of these subsets includes one attack and an equal number of records from the normal class. It was done due to the lack of records in training set for specific classes, such as the R2L and U2R in KDD99. For example, the subset that was used to select features for the Probe attack in the KDD99 contains 8214 records, of which 4107 records belong to the attack class, and the rest are for the normal class. After simulating the experiment for the first measurement approach, results were concluded and are given in Table 8. The number of nodes in the hidden layer was considered, and  4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

The Most Frequently Selected Features Using the D-CFA
In this work, the D-CFA was used for feature selection over multiple runs, to pick different subsets of features. Those features are picked based on the highest achieved classification accuracy from a BPNN training. The parameters that were involved in the training of the BPNN are provided in Table 7. However, to find the most relevant features in the KDD99 and UNSW-NB15 datasets, two measurement approaches were considered. First, the D-CFA was applied to find the most relevant features for each attack in both datasets. Second, the D-CFA was simulated twenty times for each dataset, to find the most frequently picked features over those runs. Since f1-20 and f1-21 contain a value of zero in all the records, they were not used for both measurement approaches. As for the D-CFA's parameters, and were set to a value of 10. Since the classes are not balanced in both datasets (see Figures 1 and 3) and the first measurement approach examines the relevancy of features to each attack, the records have been modified. The modification of the number of records was done manually, where the datasets were split into multiple subsets. Each of these subsets includes one attack and an equal number of records from the normal class. It was done due to the lack of records in training set for specific classes, such as the R2L and U2R in KDD99. For example, the subset that was used to select features for the Probe attack in the KDD99 contains 8214 records, of which 4107 records belong to the attack class, and the rest are for the normal class. After simulating the experiment for the first measurement approach, results were concluded and are given in Table 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 ACC Features (f) Figure 14. ACC of each feature in the UNSW-NB15, using the best value from all the hidden layer nodes simulations.

The Most Frequently Selected Features Using the D-CFA
In this work, the D-CFA was used for feature selection over multiple runs, to pick different subsets of features. Those features are picked based on the highest achieved classification accuracy from a BPNN training. The parameters that were involved in the training of the BPNN are provided in Table 7. However, to find the most relevant features in the KDD99 and UNSW-NB15 datasets, two measurement approaches were considered. First, the D-CFA was applied to find the most relevant features for each attack in both datasets. Second, the D-CFA was simulated twenty times for each dataset, to find the most frequently picked features over those runs. Since f 1-20 and f 1-21 contain a value of zero in all the records, they were not used for both measurement approaches. As for the D-CFA's parameters, MaxIter and Dpop were set to a value of 10.
Since the classes are not balanced in both datasets (see Figures 1 and 3) and the first measurement approach examines the relevancy of features to each attack, the records have been modified. The modification of the number of records was done manually, where the datasets were split into multiple subsets. Each of these subsets includes one attack and an equal number of records from the normal class. It was done due to the lack of records in training set for specific classes, such as the R2L and U2R in KDD99. For example, the subset that was used to select features for the Probe attack in the KDD99 contains 8214 records, of which 4107 records belong to the attack class, and the rest are for the normal class. After simulating the experiment for the first measurement approach, results were concluded and are given in Table 8. The number of nodes in the hidden layer was considered, and multiple runs were carried out, to find a proper number of nodes to achieve the highest ACC possible.   Table 8 reports the selected features, the number of hidden layer nodes (labeled no. of nodes), and ACC for each attack in both datasets. Even though the number of features is less in the KDD99, the first attack class in the KDD99 (DoS) had the highest number of features. The ACC of detecting that attack was also the highest (99.40%). There were only twelve features for the DoS attack class in the UNSW-NB15, whereas the ACC was reported at 86.57%. It can be observed from Table 8 that the number of selected features for the KDD99 attack classes is less than that in the UNSW-NB15. An average of 25.2 features were selected for the attacks in the KDD99, whereas there was an average of 19.1 selected features for the attacks in the UNSW-NB15. It is essential to address that the lowest number of selected features was for the Generic attack in the UNSW-NB15, which was ten features. The second-lowest number of selected features were for the Fuzzers, Reconnaissance, and Shellcode, which had 18 features to obtain an ACC of 90.40%, 89.85%, and 90.75%, respectively. Furthermore, the results in Table 8 also have indicated that the KDD99 offers 2.97% higher AACC than the UNSW-NB15.
The experiment for the second measurement approach was conducted by using the full training sets of the KDD99 and UNSW-NB15. The selected features from this experiment were evaluated by using the BPNN with the parameters given in Table 7. As for the structure of the neural network, only one hidden node was used to keep its implementation simple. The fitness of the updated solutions from the D-CFA is based on the ACC after each evaluation. The D-CFA aims to increase the ACC regardless of the number of selected features. However, after twenty simulations for each dataset, results were concluded and given in Tables 9 and 10. Based on the output of these runs, the frequency of a feature being selected was measured. Table 9 gives the selection frequency of each feature in the KDD99, as well as its ranking when compared to the others. The ranks were calculated based on the number of times a feature is selected. Furthermore, the resulted ACC from training the BPNN was also provided in Table 9. It can be observed that f 1-23 had the best rank, which was selected nineteen times; f 1-29 was selected sixteen times and had the second rank. These two features belong to the time group (see Table 4). As for the third rank, f 1-1 had fourteen selections during the twenty runs; f 1-1 belongs to the basic group (see Table 4). In terms of ACC, run numbers nine, nineteen, and twenty resulted in the highest ACC.
The frequency of a feature being selected in the UNSW-NB15 dataset is given in Table 10. It can be observed from Table 10 that f 2-10 had the best rank, which was selected at every run. In the second rank, f 2-29 was selected thirteen times out of all runs. As for the third rank, f 2-11 had a frequency of selection of twelve times; f 2-20 was not selected for any of the runs, which represents the window advertisement value for the TCP connection of the source. Besides, the base sequence number for the TCP connection of the source (f 2-21 ) was selected only three times. These two features had the lowest rank when compared to the other features. It is important to stress that the highest ACC was obtained by run numbers nine, twelve, and seventeen. The commonly selected features between these three runs are f 2-10 , f 2-11 , and f 2-29 , which are the top three ranked features from all the runs. These features belong to the basic and content groups (see Table 5).
The following can be concluded from the analysis done in this study: • The KDD99 dataset contains more duplicated records than the UNSW-NB15 dataset.

•
The UNSW-NB15 s testing set does not contain any duplication, whereas the training set does.

•
Both datasets have imbalance classes, and their normal-to-attack class ratio is not balanced.

•
In terms of the normal-to-attack class ratio, the UNSW-NB15 dataset is slightly more balanced.

•
There are five standard features between the datasets (see Table 6).

•
There is a feature in the UNSW-NB15 dataset (f 2-9 ) that is not described by the original creators of the UNSW-NB15 [9,27].

•
The KDD99 dataset has 22 features that share similar characteristics to those in the UNSW-NB15 dataset. The features in the KDD99 dataset are able to train a classifier with a higher ACC than those in the UNSW-NB15 dataset.

•
The features in the UNSW-NB15 dataset show a higher depRatio and ADR than the KDD99 dataset.

•
For training a neural network when using any of the analyzed datasets, it is suggested to use a minimum of three nodes in the structure of the hidden layer, to increase the performance of the training.

•
On average, more features were selected from the KDD99 than the UNSW-NB15 during a feature selection process for the classification task.

•
It is always suggested to employ f 1-1 , f 1-23 , and f  in the KDD99 and f 2-10 , f 2-11 , and f 2-29 in the UNSW-NB15 for any classification task, as they show their involvement to achieve high ACC. • f 2-20 in the UNSW-NB15 was not selected during the feature selection process, indicating the irrelevance of the feature for the classification task.

•
The most selected features from the KDD99 belong to the basic and time groups (see Table 4), whereas the most selected features from the UNSW-NB15 belong to the basic and content groups (see Table 5). The basic group in both datasets contains four common features out of nine in the KDD99 and fourteen in the UNSW-NB15.

•
There are many similarities between the features in the KDD99 and UNSW-NB15. The similarities indicate that the KDD99 is still relevant for the IDS domain, even though it is over a twenty-year-old dataset.

•
Many of the features in both datasets are extracted from the header of the packets. This extraction can be a simple task, given the available tools, such as the TShark [48]. TShark can be used to select specific fields from the header of the packets. Then, the required features can be extracted. This process can be utilized in the development of emerging technologies, such as the IoT and real-time systems.