XGB-RF: A Hybrid Machine Learning Approach for IoT Intrusion Detection

: In the past few years, Internet of Things (IoT) devices have evolved faster and the use of these devices is exceedingly increasing to make our daily activities easier than ever. However, numerous security ﬂaws persist on IoT devices due to the fact that the majority of them lack the memory and computing resources necessary for adequate security operations. As a result, IoT devices are affected by a variety of attacks. A single attack on network systems or devices can lead to signiﬁcant damages in data security and privacy. However, machine-learning techniques can be applied to detect IoT attacks. In this paper, a hybrid machine learning scheme called XGB-RF is proposed for detecting intrusion attacks. The proposed hybrid method was applied to the N-BaIoT dataset containing hazardous botnet attacks. Random forest (RF) was used for the feature selection and eXtreme Gradient Boosting (XGB) classiﬁer was used to detect different types of attacks on IoT environments. The performance of the proposed XGB-RF scheme is evaluated based on several evaluation metrics and demonstrates that the model successfully detects 99.94% of the attacks. After comparing it with state-of-the-art algorithms, our proposed model has achieved better performance for every metric. As the proposed scheme is capable of detecting botnet attacks effectively, it can signiﬁcantly contribute to reducing the security concerns associated with IoT systems


Introduction
The Fourth Industrial Revolution will be fueled by cutting-edge and innovative technologies where the Internet of Things (IoT) devices will play integral roles [1].Countries prepared to accept these changes will undoubtedly have a better chance of success in the near future.IoT is one of the fastest expanding fields in the history of technology, with around 50 billion devices in use by the end of 2020 [2].As a result, the devices' security threats are increasing on a large scale.Because IoT devices lack fundamental security protocols, they have become tempting targets for attackers.According to recent estimates, IoT devices are subjected to an average of 5200 attacks per month [3].Only in the first half of 2019, attacks against IoT devices tripled compared with the previous year [4].According to Checkpoint's study, 71% of security experts have observed an increase in security risks in IoT networks after the prevalence of COVID-19 [5].Recently, the Internet of Medical Things (IoMT) [6,7], the Long Range (LoRa) IoT networks [8], Blockchain-based supply chain systems [9,10] and Smart industries [11] have also been targeted by a huge number of attackers.Many of these attacks are extremely hazardous for the devices.Mirai and BASHLITE, found in 2016 and 2015, respectively, are two common IoT botnets that infected numerous IoT devices over the years [12].Among them, the Mirai Distributed Denial-of-Service (DDoS) attack was the third most frequent IoT threat in 2018.Again, a total of 15.4 million DDoS attacks are expected worldwide by 2023 [13].So, with the rise in cybercrime, detecting these attacks has become a crucial field of research [14].There should be an effective framework assuring the security of IoT devices as these issues are highly alarming.Nonetheless, the use of regular security techniques for detecting botnet attacks would not be sufficient to ensure the safety of the systems.
Many researchers have been working on developing intrusion detection systems in recent years, as the attacks increasingly target IoT devices.They mainly focus on two types of botnet attacks in the IoT environment, namely host-based [15], and network-based [16].However, some of the common machine learning approaches for intrusion detection include support vector machine (SVM), K-nearest neighbor (KNN), neural networks, and Naïve Bayes [17][18][19].Shafiq et al. [20] presented a methodology for selecting the efficient machine learning algorithm in IoT devices to detect malicious attacks.They claimed Naïve Bayes as the efficient algorithm according to their experimental analysis.Soe et al. [21] performed a feature selection approach on the N-BaIoT dataset to detect IoT attacks.They obtained nearly 99% accuracy using three distinct machine learning algorithms, including Naïve Bayes, ANN, and J48 decision tree.Diro et al. [22] introduced an intrusion detection method for IoT devices that uses deep learning as a primary tool for detection and achieved significant improvement.Ahmad et al. [23] evaluated the NSL-KDD dataset for intrusion detection using random forest (RF), SVM, and extreme learning machines (ELM) algorithms.Their findings indicated that ELM outperformed SVM when dealing with large data.In another study, a network-based intrusion detection proposed by Deng et al. [24] achieved a detection rate of 96.8% when utilizing K-means clustering after manually selecting 8 to 16 features.Mirsky et al. [25] proposed an unsupervised model which is autonomous.This model is able to observe security issues in network environments where an ensemble autoencoder algorithm is used.This classification model, which operates on low computational resources, is primarily composed of two steps: offline training and online testing on IoT devices.In addition, Radford et al. [26] proposed an unsupervised learning where Long-Short Term Memory (LSTM) cell Recurrent Neural Networks (RNN) is used for network traffic anomaly detection by utilizing a public dataset (https://www.unb.ca/cic/datasets/ids-2017.html (accessed 30 September 2021)).However, no research has provided an in-depth analysis of machine learnings' applicability in the field of host-based intrusion detection for IoT environments [27].
From the literature, it is clear that machine learning-based techniques, being used widely in the IoT security domain, are proven to be effective for developing workable models for detecting IoT threats.As most of the attacks happen in real-time in the IoT environment, therefore a fast attack detection algorithm is required while maintaining higher accuracy.This could be possible by using fewer features that reduce the system complexity and consequently, it should execute faster.This could be very helpful in realtime attack detection.To solve this issue, we propose a machine learning-based approach for intrusion detection in IoT systems where the random forest (RF) algorithm is used to select inevitable features from the N-BaIoT dataset to boost detection accuracy.Then, the eXtreme Gradient Boosting algorithm (XGB) is utilized to characterize and detect malicious attacks.This novelty is very essential for IoT-based intrusion detection as here we have to deal with a big dataset where both time and accuracy are very crucial.However, when RFE and RFECV are used, attributes are added recursively in a one-by-one fashion and check whether that attribute improves the performance or not.If it doesn't improve the classification performance, then that attribute is discarded.In addition, this process continues until all the attributes are checked.On the other hand, RF ranks the features according to information gain rather than checking the feature importance of each feature one-by-one basis.After that, the proposed system is compared with some other state-ofthe-art algorithms applied on the same N-BaIoT dataset, where our system achieves much better performance in every case.The key contributions of this paper are as follows:

•
A hybrid machine learning algorithm has been proposed, for the first time, named XGB-RF where the prominent attributes are selected using RF algorithm and then classified using XGB algorithm.

•
The proposed algorithm has also been compared with other state-of-the-art machine learning algorithms.

Materials and Methods
The hybrid method proposed for detecting in N-BaIoT multi-class attack is called XGB-RF classification.Firstly, the RF feature selection algorithm is used for selecting feasible features.Then XGB classification algorithm is to identify each type of attack that occurs in the IoT network.For this, the proposed method is named XGB-RF.The multiplicity of the N-BaIoT dataset is reduced from 115 to 40 features that relate to the same window using our proposed strategy.The following diagram (Figure 1) represents our proposed system architecture.

Min-Max Scal ing
Feature Selection

Data Acquisition
The research has considered the N-BaIoT dataset [28] to detect IoT botnet attacks, one of the most recent datasets published in 2020 and publicly available (https://www.kaggle.com/mkashifn/nbaiot-dataset (accessed 5 June 2021)).The dataset is comprised of nine IoT sensor traffic captured in a local network using Wireshark in the central switch.It has 115 statistically designed attributes derived from the pcap files.Over five separate time windows, seven statistical measures (mean, variance, count, magnitude, radius, covariance, and correlation coefficient) were computed.Since it includes a time window, this dataset is suited for a stateful Intrusion Detection System (IDS).It is a predetermined timeframe during which data are extracted from traffic within the same period.However, three or more statistical measures were computed for each of these four features, yielding a total of twenty-three features.
A total of 229,829 samples were used; among them, benign and malicious attack samples are 13,113 and 216,716, respectively.The malicious attack consists of 27,188,9502,23,361,15,148,26,210,21,205,24,250,21,995,23,755,24,102 samples of Mirai.ack,Mirai.scan,Mirai.syn,Mirai.udp,Mirai.udpplain,Gafgyt.combo,Gafgyt.junk,Gafgyt.scan,Gafgyt.tcp,Gafgyt.udp,respectively.In this study, a multi-class (i.e., 11 classes) dataset was used.A description of these classes is provided as below: While still a UDP Flood, the default behavior of Mirai is to randomize the source port and the destination ports.When combined with multiple source IPs (coming from multiple bots), the result is a flood of UDP traffic that can be difficult to fingerprint on an upstream router or firewall because there is no common source IP, source port, or destination port.

•
Mirai UDP plain (Class_6): In contrast to UDP flooding, a UDP plain attack is far more "surgical" and effective.Because of how the attacking bot "picks" ports, its effectiveness can be explained.The attacking bot will target the one that is most frequently used rather than flooding all of them.Rather than going all-in, focus the attack on a single target.This will boost the chances of success.

Data Pre-Processing
The traffic in the N-BaIoT dataset is unbalanced as the number of general records becomes very less with respect to attack records.Moreover, it displays the specific counts of attack for each subclass.In the N-BaIoT dataset, there are a hundred and fifteen statistically designed features retrieved from the pcap files.All these features are self-explanatory; a feature can be generalized as Header_ time-windows_ statistical variables.These features depict the following information: -Header description -Over five separate time-windows (100 ms; 500 ms; 1.5 s; 10 s; and 1 min) -Seven statistical variables (mean; variance; count; magnitude; radius; covariance; and correlation coefficient) were calculated.
The decay factor value is utilized throughout the dataset and this research refers to the appropriate time window as L5, L3, L1, and so on.Using a time window, which is a preset period where information is retrieved from the traffic, makes this dataset appropriate for stateful IDS.Four information were collected from the pcap: packet count, jitter, the size of outbound packets alone, and the combined size of outbound and inbound packets.Three or more statistical measures were computed for each of these four features, resulting in a total of 23 features.In order to obtain the one hundred and fifteen characteristics in this dataset, these 23 distinct features were computed over five different time-window.The following 115 features presented in Table 1 have been used in this study.Note that some abbreviated terms are used in Table 1 which have been described in Table 2.

Feature Selection
Three variants of feature selection models are used for selecting features due to the popularity of IoT-based Intrusion Detection Systems (IDS).Those are Recursive Feature Elimination (RFE), Recursive Feature Elimination and Cross-Validation (RFECV), and Select-K-Best.In addition, RF-based feature selection has also been employed.A brief description of the algorithms is given below.

RF-Based Feature Selection
The relative significance of each independent variable in the RF model can also be used for feature selection.A detailed description has been discussed in Section 2.4.1.

Recursive Feature Elimination and Cross-Validation (RFECV)
If there are too many features in the raw data, applying recursive feature elimination with cross-validation (RFECV) to remove some of the unnecessary features is also very effective [29].The best subset of features gets selected for the supplied estimator by omitting features using recursive feature elimination.The major difference between RFECV and RFE is that in RFECV, the estimator is tested in terms of generating predictions on hold-out fold data in each feature subset.As a result, the best feature subset can be identified by ranking CV scores.

Select-K-Best
The Select-K-Best method prioritizes top-scoring k features.The feature score is based on univariate statistical analysis, which is a one-by-one assessment of the factors.It can be applied to both classification and regression data.The score is generated for each feature.Then according to the score, the first K best scoring features with the highest scores are chosen.

Classifiers
The following section discusses the algorithms used for proposed algorithms and the efficacy of understanding the inner workings of these well-established machine learning models.

Random Forest (RF)
As the name suggests, the random forest (RF) algorithm is a collection of classification trees where each tree casts one vote for the most frequent class being assigned to the input data [30].Afterward, a class with majority votes gets selected as the outcome.L. Breiman in 2001, introduced the idea on top of the Bagging (Bootstrap-Aggregating) model, which is currently known as random forest.Not only for classification and regression, but RF is also being employed for feature selection (FS) purposes.Genuer et al., in their work, contributed with the methodological model to rank variable importance through depth analysis on variable importance index [31].Sensitivity of sample size, variable quantity, responsiveness towards different method parameters, and sensitivity towards presence of correlated variables are observed to identify RF variable importance score in their work.Being a supervised learning algorithm, random forest forms the forest, which is a part of a decision tree.It outperforms many machine learning algorithms in terms of accuracy, particularly for big datasets with numerous features.It is usually trained in the bagging method in which a randomized selection (without replacement) is performed from the training samples to construct each tree.It implies that the new trees are not reliant on older ones.Depending on the predictions of decision trees, the model aggregates its output.In general, few measurements are applied to calculate feature importance such as Gini index, mean decrease accuracy, permutation importance method (overcomes the imperfection of mean decrease impurity), etc. Due to the use of the bagging model, RF has gained significance in reducing data over-fitting phenomenon and variance among decision trees.Furthermore, handling missing data are a common treatment of RF architecture.
The RF classifier is integrated with the number of classification trees.Based on the feature importance criteria, 40 notable features are selected.The classification result is calculated using Equation (1).
where T is the training set from the original dataset S and K indicates the subsets from the T dataset.For each subset, the algorithm automatically generates K decision trees with the help of a random vector.C(t) represents the classification result where c i (T) denotes the classification result of i th decision tree.Here, P is the target category.However, several random forest hyper-parameters are being used to either enhance the model's prediction capability or speed up the algorithm.RF gains higher-level performance in the case of high-dimensional data by completing an implicit FS process.Gini importance can be used within RF as measurement criteria for finding feature importance.These relevance scores help identify the decision trees significant to the classifier and can be considered as an outgrowth.
Equation ( 2) is used to measure the Gini impurity.Here, each node is represented as t, which can be any node of RF decision trees.Gini impurity is used to find the optimum split, which is an estimation of measuring entropy.Furthermore, f j in Equation (3) is the fraction of n j samples out of the total number of samples n and j = 0,1 represents the class.
We can achieve decreasing δi by splitting and conveying products to two separate sub-notes (t p & t q ) by a threshold on variable Θ. Equation (4) reflects the procedure.
Next, an exhaustive search is performed with all-inclusive values of Θ, which is obtainable in the node overall thresholds.Afterward, considering all nodes t, Gini impurity values' reductions are saved for all variables separately with Equation ( 5).I G reflects on the number of times feature Θ gets selected during a split and significant within the classifier during a particular problem at hand.
2.4.2.eXtreme Gradient Boosting (XGBoost) After selecting features using RF, an eXtreme Gradient Boosting classifier is utilized in the proposed approach to detect botnet attacks based on a few selected preconditions.XGBoost, also known as XGB, is a promising tree-based ensemble learning classifier, which is treated as the most effective implementation of gradient-boosted decision trees.Gradient boosted decision trees utilize a series of decision trees, each of which learns from the preceding tree and affects the subsequent tree; therefore, they improve the model and develop a powerful learner [32].XGBoost combines weak classifiers to create a strong one [33].As opposed to RF, where independent trees are generated and work individually, XGBoost incorporates feedback from previously accepted decision trees.Each iteration of gradient boosting optimizes the given loss function.The goal is to minimize the residual from the previous step where residual can be interpreted as the dissimilarity between predicted estimation and true estimation.Upon reaching a threshold point for the residual value, the final model is declared for further use.Nonetheless, if a number of decision trees fall under a threshold value before the residual can drop under the threshold, training is stopped and the final model is selected.The use of parallel execution, faster execution than gradient boosting, acceptance of regularization, etc., are a few salient features of the XGB model.
XGBoost's objective function [34] for evaluating the model's performance can be represented as Equation (6).
where θ represents the parameters, r denotes the term of regularization, and t is the loss of training.Algorithm 2 depicts the XGBoost model architecture.

Model Performance Analysis
The purpose of this paper is to construct classification models using a training dataset and to assess their performance throughout the dataset.To evaluate the performance of our proposed model, several performance evaluation metrics are used.We calculate accuracy (ACC), F1 score, Kappa index, Matthew's correlation coefficient (MCC), sensitivity (SE), specificity (SP), threat score, and balanced accuracy score.Furthermore, six confusion matrices for different classifiers are also represented in Section 3.2.L(Y, Ω) [33], where L(Y, F(s)) represents differentiable loss function and T is the total number of sampling.
Fit a base tree, h m deploying training set (S i , pr im ) where i = 1, 2, ..., N

5
Computing the multiplier

Experimental Results
The experimental results are based on the original test data.Hold out approach is used for dividing the dataset.Here, 75% of data are used for training, and the remaining 25% are used for testing purposes.The performance comparison table among classification algorithms and the confusion matrices are shown in Sections 3.1 and 3.2.In addition, evaluation on different Train-Test schemes, Receiver Operating Characteristic (ROC) Curve and comparison with other studies are demonstrated in Sections 3.3-3.5,respectively.

Performance Measures
After selecting features with RF and classifying these using XGB, some statistical measures are performed to assess the performance of the proposed method.Five other machine learning (ML) schemes have also been used: (i) RF-RF (RF classification using RFbased feature selection), (ii) RF-RFE (RF classification using RFE-based feature selection), (iii) RF-RFECV (RF classification using RFECV-based feature selection), (iv) RF-SelectK (RF classification using Select-K-best feature selection) and (v) RF-WFS (RF classification without feature selection).The classification performance is shown in Table 3. From Table 3, it is clear that the best accuracy (99.9426%) has been achieved from the XGB-RF model while RF-RF, RF-RFE, RF-RFECV, RF-SelectKBest, and RF-WFS model obtained nearly 90% accuracy for each.XGB-RF has also obtained the highest results for other performance metrics, including sensitivity (99.9426%), specificity (99.9942%),F1 score (99.9426%), balanced accuracy (99.9683%), etc.The proposed model has the lowest error score (0.06%).So, it can be said that the proposed model has surpassed other remaining models to a large extent in terms of performance.However, the RF-WFS approach performed worst among all of these models.In addition, we have now calculated the execution time of our proposed approach as well.For the whole test set (57,458 instances), it required 57.822 s in a machine having a Core i9 Processor and 64 GB of RAM.This implies that the time required for each attack detection is only 0.0010063 s using XGB-RF.

Confusion Matrix
A multi-class confusion matrix is used to understand the comparativeness of different classifiers.As we have multiple classes in our N-BaIoT dataset, the multi-class confusion matrix lets us visualize the confusion faced during predicting attacks.The confusion matrix for different classifiers is shown in Figure 3.
Here, the same test set is used in all cases in Figure 3 for a fair justification.From the confusion matrix of (a) RF-RF, it is noticeable that the classifier rarely classifies Gafgyt.tcpattacks where it only classifies 7 Gafgyt.tcpattacks out of 5939.Again, in the case of Gafgyt.udpattacks, RF-RF classifies almost all the attacks.However, it misclassifies 5930 as Gafgyt.tcpwhich affects the accuracy score.The confusion matrix of (b) XGB-RF, which is our proposed method, overcomes the unexpected issues created by the previously discussed classifier for detecting Gafgyt.tcp and Gafgyt.udpattacks, respectively.The misclassification rate is extremely low for this XGB-RF model.The model successfully recognizes 99.9426% of attacks.
Again, from the (c) RF-RFECV confusion matrix, we see that it rarely classifies Gafgyt.tcpattacks.Here it classifies only 7 Gafgyt.tcpattacks out of 5939 while it misclassifies 5930 attacks as Gafgyt.upd.The accuracy of RF-RFECV is 89.6585% which is much lower than our proposed approach.Similarly, for (d) RF-RFE, (e) RF-SelectKBest, and (f) RF-WFS, a large amount of misclassification occurs for Gafgyt.tcpattacks as well.

Evaluation on Different Train-Test Schemes
As mentioned earlier, in the proposed approach, 75% of data are used for training, and the remaining 25% are used for testing purposes.Data were primarily divided into 75-25% according to our previous experience.We also evaluate the performance using other train-test splitting schemes like 70-30% and 67-33%.These schemes are also used by other N-BaIoT studies [35,36].The result presented in the Table 4 reveals a very minimal effect on data splitting.

ROC Curve
It can be seen from Figure 3, that our proposed hybrid XGB-RF accurately classified all the classes, including Gafgyt.udpclass.On the other hand, other methods such as RF-RF, RF-RFECV, RF-RFE, RF-SelectKBest and RF-WFS can not classify Gafgyt.udpclass well.Therefore, we have generated a receiver operating characteristic curve (ROC) for Gafgyt.udp vs. all other classes in Figure 4.It can be seen that the proposed method can recognise Gafgyt.udpthan other approaches.Note that ROCs for RF-RF, RF-RFECV, RF-RFE, RF-SelectKBest and RF-WFS are overlaped together as they all exhibit the same performance.Additionally, Mason et al. [37] showed that area under ROC (AUROC) is equivalent to statistical Mann-Whitney U-statistic testing and relevant to statistical p values as we have seen in our paper that ROC of our proposed method is about 1 which is a perfect classifier and equivalently we can say p < 0.001.Therefore, we have not performed a statistical test or ANOVA test for statistical significance.LGBA-NN) approaches on the N-BaIot dataset that resulted in an accuracy of 99.10%, 99%, 89.75%, 99.84%, and 90%, respectively.Most of the above-mentioned approaches used state-of-the-art classifiers but still, their performance is less than our proposed XGB-RF approach.XGB-RF is a hybrid machine learning model that selects the prominent feature subset and improves classification accuracy.This is due to the fact that XGB-RF removes the redundant and irrelevant features.Therefore, it provides a better decision boundary that improves the classification performance and reduces the runtime.Here, from Table 5, it is clear that our proposed model outperforms all the previous approaches.

Conclusions
The future IoT will have a profound effect on our economic and social life.Therefore, it is urgent to keep them secure.Intrusion detection systems are highly effective in identifying possible security risks and breaches.The proposed XGB-RF based hybrid machine learning scheme successfully detects different types of intrusions.Five different schemes were investigated in this proposed work.Among these, it has been shown that the accuracy, sensitivity, Kappa index of XGB-RF is almost 10% higher than other schemes.The effectiveness was tested to N-BaIoT dataset with more than 99% accuracy, which is also higher than different state-of-the-art machine learning schemes.Since the security and confidentiality of IoT devices are crucial for their success, this proposed approach can contribute a lot to enhancing the security aspects of IoT systems.However, due to the continual growth of new types of attacks, 383,379 identifying unknown attacks has proven to be a difficult task.Currently, our proposed approach takes 0.0010063 s for detecting a single attack.The near-future work could be reducing the detection time while maintaining higher accuracy so that it can be implemented in a busy IoT system.In addition, we will analyze the performance of machine learning classifiers to detect unknown attacks in IoT environments.

Figure 1 .
Figure 1.Workflow diagram of the proposed system.

Algorithm 2 :
Followed steps for XGBoost classifier Input: S ∈ R n×d where data are of d dimension and n represents number of samples.Target: F ∈ R n×1 Output: ∑ C i=1 P i = 1,∀i ∈ C = 2 where C = the number of classes and P ∈ [0, 1] for unrevealed test set s. P represents posterior probability 1 Initialization: F o (s) = argmin Ω T ∑ i=1 i. s c a n M ir a i. s y n M ir a i. u d p M ir a i. u d p p la in G a fg y t. c o m b o G a

A
performance comparison is performed among our proposed XGB-RF model and other contemporary studies concerning some performance indices where five of the other methods used the same N-BaIoT dataset.Therefore, we have used those studies in the comparison.Adeel et al.In addition, Serpil et al. used RF and Deep Multilayer Perceptron (DMLP) on the same CICIDS2017 dataset and obtained an accuracy of 99.67% and 91%, respectively.Again, Kathleen et al. employed Support Vector Machine-Decision Tree-Naïve Bayes (SVM-DT-NB) classifier on the KDDCup99 dataset, which produced an accuracy of 99.62%.However, Yan et al., Chaw et al., Abdulkareem et al., Tran et al.In addition, Abdullah et al. utilized the Naïve Bayes-J48 Decision Tree-Artificial Neural Network (NB-J48-ANN)), Classification and Regression Trees (CART), Recurrent Neural Network (RNN), Collective Deep Learning and Local-Global Best Bat Algorithm for Neural Networks (

Table 1 .
Name of the features used in this study.

Table 3 .
Performance of Classification Algorithms.

Table 4 .
Performance Evaluation on Different Train-Test Schemes.

Table 5 .
Performance Comparison with Other Studies.