IntruDTree: A Machine Learning-Based Cyber Security Intrusion Detection Model

: Cyber security has recently received enormous attention in today’s security concerns, due to the popularity of the Internet-of-Things (IoT), the tremendous growth of computer networks, and the huge number of relevant applications. Thus, detecting various cyber-attacks or anomalies in a network and building an effective intrusion detection system that performs an essential role in today’s security is becoming more important. Artiﬁcial intelligence, particularly machine learning techniques, can be used for building such a data-driven intelligent intrusion detection system. In order to achieve this goal, in this paper, we present an Intrusion Detection Tree (“IntruDTree”) machine-learning-based security model that ﬁrst takes into account the ranking of security features according to their importance and then build a tree-based generalized intrusion detection model based on the selected important features. This model is not only effective in terms of prediction accuracy for unseen test cases but also minimizes the computational complexity of the model by reducing the feature dimensions. Finally, the effectiveness of our IntruDTree model was examined by conducting experiments on cybersecurity datasets and computing the precision, recall, fscore, accuracy, and ROC values to evaluate. We also compare the outcome results of IntruDTree model with several traditional popular machine learning methods such as the naive Bayes classiﬁer, logistic regression, support vector machines, and k-nearest neighbor, to analyze the effectiveness of the resulting security model.


Introduction
In recent days, the demand for cyber security and protection against various types of cyber-attacks has been ever increasing. The main reason is the popularity of Internet-of-Things (IoT), the tremendous growth of computer networks, and the huge number of relevant applications that are used by individuals or groups for the purpose of either personal or commercial use. Cyber attacks such as the denial-of-service (DoS) attack [1], computer malware [2], or unauthorized access [1] led to irreparable damage and financial losses in large-scale networks. For instance, according to [3], in May 2017, one ransomware virus brought huge losses to many organizations and industries, including finance, medical care, energy, and universities as well, causing a loss of 8 billion dollars. Other statistics have estimated that on average, a data breach costs an affected organization 3.9 million USD, and 8.19 In our approach, we first took into account the ranking of security features according to their importance in modeling. We then build a tree-based generalized intrusion detection model based on the selected important features. Once the complete tree has been built by utilizing the training security data, the test data is used to validate the model. This approach is not only effective in terms of prediction accuracy for unseen test cases by minimizing the over-fitting in modeling but also minimizes the computational complexity of the model by reducing the feature dimensions while building the resultant model.
The contributions of this work can be summarized as follows.
• We first highlight the importance of security features for high dimensions in a machine learning-based intrusion detection model.

•
We then present an intrusion detection tree "IntruDTree" machine-learning-based security model that first takes into account the ranking of security features according to their importance and then build a tree-based generalized model based on the selected important features. • Finally, we conduct experiments to evaluate the effectiveness of our intrusion detection model IntruDTree.
The experimental results show that our IntruDTree model significantly outperforms previous ones for detecting cyber intrusions in various unseen test cases.
The rest of the paper is organized as follows. Section 2 provides the background and related work of intrusion detection techniques. In Section 3, we present our intrusion detection tree IntruDTree machine-learning-based security model by taking into account feature importance. We evaluate the resultant security model and report the experimental results by conducting experiments on cybersecurity dataset in Section 4. Finally, Section 5 concludes this paper and highlights the future work.

Background and Related Work
An intrusion detection system is typically used to identify malicious cyber-attack behavior on a network while monitoring and evaluating the daily activities in a network or computer system to detect security risks or threats [9,11]. A number of research has been conducted in the area of cybersecurity with the capability of detecting and preventing cyber attacks or intrusions. Signature-based network intrusion detection is one of the well known systems used in the cyber industry [14]. This system takes into account a known signature and has seen widespread adoption including commercial success in recent time. On the other hand, the anomaly-based approach has advantage over the signature-based approach for detecting unseen attacks, including the ability to identify unknown or zero-day attacks [10,18]. This approach monitors network traffic and finds attack behavioral patterns by analyzing the relevant security data. Various data mining and machine learning techniques are used to analyze such security incident patterns for making useful decisions [5,16,19]. The main drawback of the anomaly-based approach is that it may produce high false alarm rates as it may categorize the previously unseen system behaviors as anomalies [10]. Thus, limiting the false positive rates of an intrusion detection system must be a top priority [13]. Therefore, a machine learning-based effective detection approach is needed to minimize these issues.
The area of machine learning is typically considered as a branch of artificial intelligence, which is closely related to computational statistics, data mining and data science, and mainly focuses on making computers to learn from data [19,20]. It has strong ties to mathematical theories, methods, statistical analysis, optimization, and various real-world application domains in the field. Thus, in the area of cybersecurity, machine learning is a type of data-driven method in which understanding the raw security data is the first step to build an intelligent security model for making predictions about future incidents. Although, the association analysis is popular in machine learning techniques to build rule-based intelligent systems [21][22][23], in this work, we primarily focus on classification learning techniques [16,24] that are typically used to build a predictive model by utilizing a given training dataset. For instance, to build a data-driven predictive model, several popular techniques such as the probability-based naive Bayes classifier, hyperplane-based support vector machines, instance-learning-based k-nearest neighbors, the sigmoid function based logistic regression technique, as well as rule-based classification like decision trees have been used [16,19].
In the domain of cybersecurity, particularly for detecting intrusions or cyber-attacks, a number of researchers used the machine-learning classification techniques mentioned above. For instance, Li et al. [25] presented an approach to classify the predefined attack categories such as DoS, Probe or Scan, U2R, R2L, as well as normal traffic utilizing the most popular KDD'99 cup dataset by using the hyperplane-based support vector machine classifier with a RBF kernel. In [26], Amiri et al. used a least-squared support vector machine classifier to train the model utilizing large data sets to create a faster system. In [27], Hu et al. used a variation of the support vector machine classifier in their study in order to classify the anomalies. Wagner et al. [28] employed a one-class support vector machine classifier in their study for the purpose of anomaly detection and different types of attacks such as NetBIOS scans, DoS attacks, POP spams, and Secure Shell (SSH) scans. The authors in [29] used this support vector machine classifier for the purpose of detecting unknown computer worms' activity. Similarly, Kotpalliwar et al. [30], Saxena et al. [31], Pervez et al. [32], Li et al. [25], Shon et al. [33], and Kokila et al. [34] used a support vector machine classifier in their studies for the purpose of building intrusion detection systems.
In addition to the support vector machine classifier discussed above, several other classifiers are used to detect intrusions. For instance, Kruegel et al. [35] used a probability-based Bayesian network to classify the events that process TCP/IP packets. A denial-of-service (DoS) intrusion detector using the same Bayesian network has been described by Benferhat et al. in their study [36]. Panda et al. [37] have employed the probability-based naïve Bayes classifier for the purpose of analyzing the KDD'99 cup data set, in which the data consist of four categories of attacks-Probe or Scan, DoS, U2R, and R2L. Koc et al. [38] used the same naive Bayes classifier in order to build a multi-class intrusion detection system. An instance-based learning algorithm, the KNN, is another popular machine learning method in which the classification of a point is determined by the k-nearest neighbors of that data point. Shapoorifard et al. [39], Vishwakarma et al. [40], and Sharifi et al. [41] used KNN classification technique in their studies for the purpose of intrusion detection systems. Several research studies [42,43] have been done using the logistic regression model as well for identifying malicious traffic and intrusions. In addition to these approaches, authors in [44] considered neural classifier, and those in [45] considered wavelet transform for anomaly detection, particularly DoS attacks.
Among the machine learning techniques, a tree-based method known as decision tree is considered as one of the most popular machine learning classification methods for the purpose of building predictive models. The best known methods for automatically building decision trees are the ID3 [46] and C4.5 [47] algorithms. Recently, a behavioral decision tree algorithm called BehavDT for analyzing behavioral patterns was proposed by Sarker et al. [48]. A significant amount of research in the domain of cybersecurity, such as Ingre et al. [49], Malik et al. [50], Relan et al. [51], Rai et al. [52], Puthran et al. [53], Moon et al. [54], Balogun et al. [55], and Sangkatsanee et al. [56] used the decision tree classification approach in their studies for the purpose of building intrusion detection systems. However, with the high dimensions of security features, a decision tree model may cause several issues such as high variance with over-fitting, high computational cost and time, and low prediction accuracy.
Unlike the above approaches, in this work, we present a machine learning-based security model "IntruDTree" that first takes into account the ranking of security features according to their importance and then build a generalized tree for detecting intrusions based on the selected important features in order to resolve the issues highlighted above.

Materials and Methods
In this section, we present our machine learning-based cyber security intrusion detection model IntruDTree. This comprised of several processing steps: exploring the security dataset, preparing raw data, determining feature importance and ranking, and building the resultant tree-like model. In the following section, we briefly discuss these steps one by one in order to achieve our goal.

Exploring Security Dataset
Security datasets typically represent a collection of information records consisting of several security features and related facts that can be used for the purpose of building a data-driven cybersecurity intrusion detection model [15]. Thus, it is important to understand the nature of raw cybersecurity data and the patterns of security incidents in order to detect malicious behavior or anomalies. In this work, we used an intrusion dataset consisting of two categories of class variables-normal and anomaly-publicly available in Kaggle [57], the world's largest machine learning and data science community. The dataset consists of a total of 41 features, among them, 3 features are qualitative, the protocol_type, f lag, and the other 38 features are quantitative, including duration, logged_in, srv_serror_rate, dst_host_count, dst_bytes. Table 1 shows all the security features, including their value type. The dataset contains more than twenty five thousand instances that were collected from a wide variety of intrusions simulated in a military network environment. To collect these raw data, including TCP/IP dump data for a network, an environment was created by simulating a typical US Air Force local area network. The network was created like a real cyber environment and blasted with multiple cyber-attacks that are known as anomalies [57]. All the security features mentioned in Table 1 are not identical in terms of data distribution, and vary from feature to feature. For instance, Figure 1 and Figure 2 show the data distribution for two different features, duration and dst_bytes, respectively. In order to build our machine learning-based intrusion detection model, we first prepared the raw dataset with the feature values mentioned above. Effectively processing and ranking these security features according to the requirements, building a target machine-learning based detection model, and eventually, data-driven pattern-based decision making, could play a role in providing intelligent cybersecurity services, particularly detecting anomalies or intrusions.

Preparing Raw Security Data
Data preparation comprises both feature encoding and scaling according to the characteristics of the given intrusion dataset.

•
Feature encoding: As mentioned earlier, the dataset contains both the numeric and nominal values of the given security features. Although most of the features are numerically valued, several are nominally valued, such as protocol_type, service, f lag, shown in Table 1, and the class value [anomaly, normal] as well. Thus, it is needed to convert all the nominal valued features into vectors in order to fit these data to the target machine learning-based intrusion detection model. Although, "One Hot Encoding" is a popular approach, we used "Label Encoding" in this work. The reason is that, in one hot encoding technique, a significant number of feature dimensions increase. On the other hand, the label-encoding approach directly converts the feature values into particular numeric values. Let us consider an example in terms of the feature protocol_type. Label encoding can turn the values [tcp, udp, icmp, udp, icmp] into vectors [0, 1, 2, 1, 2]. • Feature scaling: In data pre-processing, feature scaling is also known as data normalization. The values of the security features are in different ranges that vary from feature to feature. For instance, Figure 1 and Figure 2 show the data distributions of two different features, duration and dst_bytes, respectively. For some data points, the value is very low while for some data points, it is much higher, as shown in Figures 1 and 2. Thus, data scaling method is used to normalize the range of the feature values, known as the independent variables as well. In order to do this, we used a Standard Scaler that normalizes the security features with the mean value = 0 and standard deviation = 1. The normalized values are then ready for further analysis in order to build the security model.

Determining Feature Importance and Ranking
In our approach, once the data exploration and preparation have been done, we calculate the importance score of each security feature discussed above and rank them according to their importance score, in order to select a set of significant features for further processing. Generally, feature importance provides a score that indicates how useful or valuable each feature in a cybersecurity dataset, to build a tree-based intrusion detection model. Feature importance is estimated as the decrease in node impurity Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 27 April 2020 doi:10.20944/preprints202004.0481.v1 weighted by the probability of reaching that node. The node probability is simply calculated by the number of instances that reach a particular node divided by the total number of instances. The higher the value, the more important the feature. The value varies from 0 to 1. A value of 0 means that the output of the model does not depend on the feature at all and 1 means that the output of the model is directly associated with the feature. Thus, the importance of each feature is derived from how "pure" it is. In statistics and data mining, "Gini Index" is a well known measure of the impurity of a node that typically measures how often a random element would be wrongly detected [19]. Thus, it is the probability of wrongly classifying a randomly chosen element in a security dataset according to the class distribution in the dataset. For a binary split, the Gini Index [19] of a node n can be expressed as - where p i represents the probability of an element being classified to a particular security class and p l and p r are the proportions of examples in node n p that are assigned to child nodes n l and n r , respectively. Therefore, how much each feature decreases the impurity is then computed using the Gini impurity formulas defined above. The more a feature decreases the impurity, the more important the feature is. We then rank the security features according to their calculated importance score. Thus, after ranking the features, our approach is capable of selecting the top n security features according to their importance score values for the purpose of building an effective tree-based security model by taking into account the selected n features. This thus enables us to identify correlations and patterns in a security data set to transform them into significantly lower dimension datasets without loss of any significant information for modeling.

Designing Intrusion Detection Tree
Once the security features are ready to process, we design a tree-like model in order to make decisions in a data-driven intelligent intrusion detection system. In our model, we take into account the selected security features that are determined according to their importance score and ranking, rather than using all the features available in the given dataset. To design a tree-like model, we start from a root node. It breaks down a given training dataset into smaller subsets and an associated branch of the tree is incrementally created. To identify the attribute for the root node in each level, the "Gini Index" [19] defined in the earlier section is used. An attribute with a lower Gini index is selected to incrementally develop the tree. Thus, we expand the tree by creating the required number of branches consisting of internal and leaf nodes and their connecting edges or arcs. Internal nodes are labelled with the security features selected earlier and each leaf node of the tree is labelled with a target class, either anomaly or normal in this work. The arcs of the tree coming from a node are labeled with each of the possible values of the relevant feature. The final result is a multi-level tree with a number of terminal or leaf nodes that includes anomaly or normal, as shown in Figure 3. Overall, our intrusion detection model IntruDTree takes into account mainly two properties, (i) reducing the feature dimensions by determining the feature importance and ranking, and (ii) to construct a multi-level tree by taking into account the selected important features. Figure 3 shows an example of an IntruDtree considering several features like f lag, service, duration, logged_in and their sample values in a given intrusion dataset.
The overall process for building an IntruDTree is set out in Algorithm 1. Given a training intrusion dataset, DS = {X 1 , X 2 , ..., X m }, where m represents the data size. Each instance is represented by n-dimensional features. The training data also belong to diverse cyber-attack classes CA = {normal, anomaly}. The outcome is an IntruDTree, which is a rule-based classification tree associated with DS. For instance, according to value is RSTR, then the outcome is anomaly". Similarly, another rule with multiple features could be "if flag value is SF, service is f tb, and duration <= 4, then the outcome is anomaly". Thus, a number of security rules can be extracted by traversing the generated IntruDTree, that can be used to detect whether the test case is normal or an anomaly. return N as a leaf node labeled with the class CA. 10 end 11 if imp_feature_list is empty then 12 return N as a leaf node labeled with the majority class in DS; // majority voting 13 end 14 identify the highest precedence feature F split for splitting and assign F split to the node N. 15 foreach feature value val ∈ F split do 16 create subset DS sub of DS containing val.

Experimental Results and Discussion
In this section, we briefly analyze and report the experimental results of this work. For this, we first setup our experiments to evaluate our IntruDTree cyber security model, and then discuss the experimental results in various dimensions related to our analysis. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 27 April 2020 doi:10.20944/preprints202004.0481.v1

Experimental Setup
In order to evaluate our IntruDTree cyber security model, we aim to answer the following three questions: • Question 1: Does the feature importance score and corresponding ranking strategy in IntruDTree model simplify the security dataset by reducing the negligible features, and help to build a generalized data-driven security model? • Question 2: Is the IntruDTree machine-learning-based security model able to effectively detect cyber intrusions and to provide significant outcome results for unseen test cases? • Question 3: How effective is our IntruDTree model compared to traditional machine learning classification-based methods?
In answering these questions, we have conducted experiments on a security dataset consisting of the anomalies and normal classes discussed in an earlier section. We implemented all the methods in Python programming language, in which we used Scikit-learn, the most popular machine learning library for predictive data analysis, and executed them on a Windows PC. In the following subsections, we first define the evaluation metrics that are taken into account to evaluate our security model and then discuss the experimental results that answer the above questions identified for this experimental analysis.

Evaluation Metric
In order to measure the effectiveness of our IntruDTree cyber security model, we compute the outcome results in terms of precision, recall, fscore, ROC value, and overall accuracy, defined as below [19] - where, TP denotes true positives, FP denotes false positives, TN denotes true negative, and FN denotes false negatives in these above formal definitions of precision, recall, fscore, and overall accuracy. In addition to these evaluation metrics, we also take into account the ROC (Receiver Operating Characteristic) curve that is created by plotting the true positive rate (TPR) against the false positive rate (FPR) of the resultant security model.

Effect of Feature Importance Score and Ranking
In order to answer the first question highlighted above, we show the importance score of each feature in the dataset and the significance of this score in our IntruDTree model in this experiment. For this, Figure 4 shows the calculated importance score for various features utilizing the given security dataset. If we observe this figure, we see that the calculated score of all features is not identical in a given dataset and may vary from feature-to-feature according to their effect on the target class. According to Figure 4, the feature src_bytes has the highest score of 0.258 (around 25%), whereas another feature is_host_login has a score closer to the value 0 for this dataset. In order to show the comparative significance of each feature, Figure 4 shows their importance score values in a descending order, where the values are arranged from the largest to the smallest number. The higher the value, the more important the feature in our IntruDTree model. Thus, based on the score values, we can conclude that all the features in a given security dataset might not be similar significant to build a Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 27 April 2020 doi:10.20944/preprints202004.0481.v1 data-driven security model and a number of features comparatively with very low score values, less than a threshold t, can be reduced to simplify the dataset. In this experiment, we have selected all the features to build the model that satisfy the threshold value t = 0.02. As a result, we finally selected the top 14 features according to their importance score values shown in Table 2. Therefore, according to the results shown in Figure 4 and Table 2, we can conclude that the feature importance score and corresponding ranking strategies in our IntruDTree model is able to simplify the security dataset by reducing the negligible features. Consequently, it can help to build a generalized data-driven security model with a smaller number of security features.

Outcome Results of IntruDTree Model
In order to answer the second question highlighted above, in this experiment, we show the outcome results of our machine learning-based cyber security intrusion detection model IntruDTree.
To calculate the outcome results for unseen test cases, we first built the model using a subset of 80% data Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 27 April 2020 doi:10.20944/preprints202004.0481.v1 of the given dataset and used the remaing 20% data for testing the model. The results are calculated by generating a confusion matrix that reports the number of false positives, false negatives, true positives, and true negatives. Based on these values, Table 3 reports the prediction results in terms of precision, recall, fscore, and accuracy for each individual class, either normal or anomaly, utilizing the given dataset, in order to show the outcome results of our IntruDTree model. If we observe Table 3, we see that for each class, this model gives significant results of the precision, recall, fscore, and accuracy, that ensures the model effectiveness while predicting anomalies or intrusions. In addition, Figure 5 shows the outcome results as an ROC curve that represents the TPR (true positive rate) against the FPR (false positive rate) of our IntruDTree model. From Figure 5, we can see that the true positive rate, i.e., correctly classified, is high close to the maximum value 1, while the false positive rate, i.e., falsely classified, is low. Thus, from the overall experimental results shown in Table 3 and Figure 5, we can conclude that our machine learning-based IntruDTree cyber security intrusion detection model is able to effectively detect either the anomalies or normal class according to their occurring patterns in the security dataset and consequently gives a significant outcome for unseen test cases.

Effectiveness Comparison
In order to answer the third question, in this experiment, we compute and compare the effectiveness of our cyber security intrusion detection model IntruDTree, with the traditional popular machine learning-based methods. To show the effectiveness of various machine learning-based security models, we first select several popular baseline methods, such as NaiveBayes (NB), Logistic Regression (LR), K-Nearest Neighbor (KNN), and Support Vector Machines (SVM) for the purpose of comparison. For each model, we compute the outcome results utilizing the same security datasets, in order to compare the models fairly.
To compare the effectiveness mentioned above, Figure 6 shows the relative comparison of precision, recall, fscore and accuracy, utilizing the security datasets discussed in earlier section. For each machine learning-based model, we use the same training and testing set of data, where 80% data are used to train the security model and the rest 20% data are used for testing the model. If we observe Figure 6, we see that our IntruDTree model has a better outcome than the traditional models. The reason is that in our IntruDTree model, we first determined the important features and then built the intrusion detection tree by taking into account the selected features. As a result, it minimizes the model variance and over-fitting issues, and generalizes the security model as well by considering the important features. Thus, the model is able not only to increase the prediction results for unseen test cases but also to minimize the computational complexity of the resultant data-driven security model. Therefore, according to the results shown in Figure 6 and the above experimental analysis, we can conclude that our IntruDTree model is more effective than the traditional machine learning classification based methods while analyzing the cybersecurity data.

Conclusions
In this paper, we have presented an intrusion detection tree IntruDTree machine-learning-based security model. In our approach, we have first taken into account the ranking of security features according to their importance and then built a tree-based generalized intrusion detection model based on the selected important features. We have done this to make the security model effective in terms of prediction accuracy for unseen test cases, and efficient by reducing the computational cost with the processing of less number of features while generating the resultant tree-like model. Finally, the effectiveness of our IntruDTree model was examined by conducting a range of experiments on cybersecurity datasets. We have also compared the outcome results of IntruDTree model with several traditional popular machine learning methods to analyze the effectiveness of the resulting security model. Future work could assess the effectiveness of the IntruDTree model by collecting large datasets with more dimensions of security features in IoT security services, and measuring its effectiveness at the application level in the domain of cybersecurity.