BoostedEnML: Efficient Technique for Detecting Cyberattacks in IoT Systems Using Boosted Ensemble Machine Learning

Following the recent advances in wireless communication leading to increased Internet of Things (IoT) systems, many security threats are currently ravaging IoT systems, causing harm to information. Considering the vast application areas of IoT systems, ensuring that cyberattacks are holistically detected to avoid harm is paramount. Machine learning (ML) algorithms have demonstrated high capacity in helping to mitigate attacks on IoT devices and other edge systems with reasonable accuracy. However, the dynamics of operation of intruders in IoT networks require more improved IDS models capable of detecting multiple attacks with a higher detection rate and lower computational resource requirement, which is one of the challenges of IoT systems. Many ensemble methods have been used with different ML classifiers, including decision trees and random forests, to propose IDS models for IoT environments. The boosting method is one of the approaches used to design an ensemble classifier. This paper proposes an efficient method for detecting cyberattacks and network intrusions based on boosted ML classifiers. Our proposed model is named BoostedEnML. First, we train six different ML classifiers (DT, RF, ET, LGBM, AD, and XGB) and obtain an ensemble using the stacking method and another with a majority voting approach. Two different datasets containing high-profile attacks, including distributed denial of service (DDoS), denial of service (DoS), botnets, infiltration, web attacks, heartbleed, portscan, and botnets, were used to train, evaluate, and test the IDS model. To ensure that we obtained a holistic and efficient model, we performed data balancing with synthetic minority oversampling technique (SMOTE) and adaptive synthetic (ADASYN) techniques; after that, we used stratified K-fold to split the data into training, validation, and testing sets. Based on the best two models, we construct our proposed BoostedEnsML model using LightGBM and XGBoost, as the combination of the two classifiers gives a lightweight yet efficient model, which is part of the target of this research. Experimental results show that BoostedEnsML outperformed existing ensemble models in terms of accuracy, precision, recall, F-score, and area under the curve (AUC), reaching 100% in each case on the selected datasets for multiclass classification.


Introduction
Monitoring computer networks in recent times has become more convenient and efficient through the use of intrusion detection systems (NIDS) that detect all abnormal entry point in the network are transmitted to the network monitoring expert (NME), which can either be a human or intelligent system, for analysis and consequently take possible actions. One of the current challenges with this scenario is the rate of false alarms generated by the IDS that may result in alert fatigue and failure in the system. In a case where there is a prevalence of alert fatigue, the network experts may spend unnecessary time investigating many false alarms and less time responding to realistic attacks. Hence, the need to reduce false-alarm rates has been studied in the literature [8,9]. In the case of a botnet attack that floods the entire IoT network with streams of bots causing resource depletion and network service interception, artificial intelligence (AI) devices are necessary to detect such floods.
The traditional method in monitoring network flows is the use of human experts who can easily become overwhelmed with false-alarm fatigue. Intelligent machine experts can overcome this problem. ML approaches to monitor both misuse and anomaly-based network traffic have been investigated with different performances in terms of accuracy, precision, recall, and F1-score. In [10], an expanded survey on the various implementation of ML in NIDS relating to IoT environment was presented.
According to [11], authors proposed an IDS based on ensemble ML. The system achieved an accuracy of 99.3% during testing. Other authors also achieved high degrees of accuracy in their proposals [12][13][14]. One major problem in the domain of IDS models using ML has been the rate of false alarms which continues to reduce the practicality rate of implementation of IDS. When systems are designed for the purpose of recommendation activities or for filtering emails into spam or not, the impact of false negatives (FNs) and false positives (FPs) may be neglected [15]. However, when it concerns intrusion whose effect can be more disastrous, reducing the FPs and FNs to their most feasible minimum is extremely important.
In this paper, we propose an IDS that uses boosted ensemble ML classifiers (Boost-edEnML) aimed at enhancing the performance of IDS models in attack detection and classification with reduced false-alarm rates. Network packets are processed using ML algorithms to detect, analyze, and classify the traffic into their respective categories so that triggered alerts can be more reliable, reducing the computational overhead cost of managing false signals in the system. We implement our proposed model based on boosting algorithms as they showed better performance over other algorithms tested in this paper in model complexity, accuracy, and time function. Furthermore, the use of BoostedEnML in this work demonstrates that boosting classifiers such as LGBM and XGB can be combined to significantly improve the detection rate of ML IDS models in classifying attacks in an IoT environment as opposed to existing ML IDS models, which did not implement the combination of these two algorithms.
In the proposed IDS, we train, validate, and test different models based on random forest (RF) [16], AdaBoost [17], XGBoost [18], LightGBM [19], extra tree (ET), and decision tree (DT) [20] classifiers. Except for the DT, other algorithms already exist as ensemble classifiers based on the aggregation of various DT algorithms. A combination of these using a new method usually results in improved performance, in our case, for research. We develop the proposed model on the CSE-CIC-IDS2018 and CIC-IDS2017 datasets, which are the most comprehensive datasets for IDS development [21] currently available. IDS models and, generally, ML algorithms generalize better on balanced data by learning the same features from each class in the dataset. The two datasets used for this work contain imbalance; therefore, we handle the imbalance in our dataset using two main oversampling techniques, which are synthetic minority oversampling technique (SMOTE) [22] and adaptive synthetic sampling (ADASYN) [23,24]. In the NIDS domain, several ensemble ML approaches have been discussed [25][26][27][28][29][30][31], but none have used these classifier combinations to the best of our knowledge. Our proposed approach detects intrusion more accurately and precisely compared to existing systems [25][26][27]29].
The key contribution of this research are outlined as follows:

1.
A search algorithm based on GridSearchCV was implemented to select the most fundamental parameters necessary to obtain a high-performing IDS model. This ensures that the model learns holistically on the dataset.

2.
We performed feature selection to obtain the most predominant features of the datasets and used an ensemble technique to combine the features to obtain a comprehensive array of best performing features. 3.
We implemented oversampling techniques, such as SMOTE and ADASYN, to handle data imbalance in our two datasets, thereby obtaining a highly accurate classification model. These datasets are widely used in similar and recent research. 4.
We implemented several ensemble models and selected the best models depending on time-cost function and overall accuracy. Models based on boosting algorithm showed better performance; hence, they were used to develop the BoostedEnML as proposed.
In each step, the resulting model was validated for a multiclass classification task.

5.
We evaluated the model performance on two robust datasets having various intrusion attempts and used the AUC to validate the performance accuracy.
On evaluation, experimental results show that the proposed BoostedEnML IDS model accurately classified the network traffic flows in the used datasets with reduced FN, FP, and FAR, and maintained a high detection rate for packets of data on the IoT network. Our IDS model for IoT systems showed improved performance over existing models discussed in the literature. In addition, the proposed approach helps to reduce the model complexity by using lightweight algorithms to develop the ensemble model. With the grid search cross-validation applied, we ensured that the proposed model learns from the most relevant network traffic features and uses the algorithm's best parameters to save training time.
The rest of the paper is organized in the following pattern. Section 2 presents the background of ML in IDs, selected algorithms, and related propose works. Our approach to achieving the proposed model is presented in Section 3. In Section 4, we present, analyze, and interpret our research findings, and then we conclude our paper in Section 5.

Background and Related Work
Currently, many research breakthroughs exist in the IDS for network security applied to IoT systems. Notwithstanding, there still exist significant challenges, some of which include a lack of a consistent understanding of normality introduced by network unpredictability, heterogeneous nature of network traffic, unavailability of appropriate public IDS datasets, and vulnerable environments and loopholes that grant access to attackers who actively search for and exploit security flaws. Some security researchers have opined that these challenges are uniquely inherent in IDS in networks and may not be observed in other domains [32]. IoT system security challenges are evolving with the expansion of the application domain of the technology. The IoT layers comprising the perception, the network, and application layers continuously face different threats. The application layer sitting at the topmost part of the network transmits information between the network and other services and tends to face most of the threats due to the connection interface established between other devices [5]. In [33], authors proposed an ensemble IDS model for the IoT environment using gradient boosting algorithm for a binary class classification task. The proposed model reached an accuracy of 98.27% and a precision of 96.40% using XGBoost for feature selection.
Data generation in IoT systems has witnessed a great expansion in the last decades, and transmitting such a volume of data over a regular network has been challenged with high computational resource requirements, low bandwidth, and advanced network attacks. One approach to overcoming the resource constraint and increased cyberattacks is using a cloud computing environment with massive storage capacity, high computational power, and configurable resources integrated with virtualization capabilities for data storage [34]. Flooding the IoT network at all layers with DDoS attacks such as UDP flood, ICMP/Ping flood, SYN flood, ping of death, and zero-day DDoS attacks have resulted in high data loss. Nie et al. in [35] proposed a novel intrusion detection system in the IoT domain to deal with such intrusive attacks as distributed denial of service (DDoS), packet-sniffing, and man-in-the-middle attacks. The authors used the GAN method to train an IDS model using the CSE-CICIDS2018 and CICDDoS2019 datasets, the most recent and complete datasets for training and testing IDSs. The research showed that the models achieved about 97% accuracy in both datasets in the training and evaluation phases. Mitigating DDoS, DoS, botnet, and infiltration attacks on the IoT networks has recently been a challenging task [36].
In [37], authors proposed many IDS models based on machine learning to mitigate attacks on IoT devices in the smart city setting. Different ML algorithms and ensemble methods, such as the stacking, bagging, and boosting methods, were used to develop the ensemble model. On evaluation, the proposed ensemble models reached an accuracy and recall of 0.999. Several Ml algorithms were used by [38] to propose the IDS model for IoT networks. In the work, the authors used K-nearest neighbor (KNN), support vector machine (SVM), artificial neural network (ANN), and other ML algorithms in their work. The models were trained using the train-test split method at an 80:20 ratio; the resulting models were evaluated on the BoT-IoT dataset and achieved an accuracy of 99% with the KNN. Furthermore, Ref. [39] proposed an IDS model for cyberattack monitoring based on the bagging ensemble method with an accuracy of 99.67% on the NSL-KDD dataset.
Currently, several open-source network monitoring solutions are leveraged to provide network security by capturing the TCP/IP packets in the networks. Suricata [40] and Snort [41] are the most commonly used open-source traffic monitoring software. Both have shown some limitations in recording attacks during operation. Suricata and Snort work based on predefined rules to detect malicious attacks [42,43]. One of the major drawbacks of these systems is that any deviation from the predetermined rules would result in a false alarm. Again, it requires that a security expert study both existing attacks and novel network deviations under defined conditions that define the database's signatures. Attackers exploit the vulnerabilities regularly discovered in IoT networks and use the same to tamper with the events protocol. Since this process is dynamic, using a manual approach to define attack features can be ineffective and burdensome to handle.
In addition, considering the extensive data generated by the IoT systems, manually searching for attacks in the dataset can be a hassle. An attempt to proffer a solution is the application of machine learning, which today has gained exceptional popularity in industries and the scientific community in IoT cybersecurity [44][45][46][47][48]. The machine learning technique primarily used in IDS systems is supervised learning, where the database is provided with features and labels to classify the network traffic. Ensemble learning defines an approach where several base learners, referred to as weak learners, are aggregated based on specific rules to form a stronger classifier algorithm [49]. With ensemble methods, models achieve better performance in predicting the nature of the traffic flow, as overfitting and class imbalance are handled with a better approach [50]. In a nutshell, many of the existing ensemble models implement DT architecture in a bagged or boosted manner, leading to improved results.
The bagging method uses different samples of the train data on the algorithms at different times and rates, resulting in different submodels whose average is the desired output of the training. The voting ensemble uses majority voting (soft or hard) for classification tasks, as used in [51] with an accuracy of 99%, and averaging for regression tasks to combine the outputs of the base learners. Bagged DT and RF models are the most widely used bagging ensemble models [50]. On the contrary, the boosting algorithm forces each weak classifier to concentrate on a specific component of the data in the training distribution, thereby transforming groups of weak classifiers into strong ones with improved accuracy. Through this approach, later learners are pressed to concentrate on the mistakes made by earlier learners. Hence, the later classifiers are trained to overcome the mistakes of the earlier classifiers. As a result, each baseline learner in the boosting ensemble can concentrate more on the data points that the other learners misunderstood. When the data are pooled, boosting produces a more precise prediction [17].

Machine Learning Models
In this section, an overview of the selected ML algorithms used in this work is presented. For simplicity, we discuss the decision tree, AdaBoost, extra tree, random forest, LightGBM and XGBoost.

1.
Decision Tree (DT): Decision trees (DTs) are data structures composed of elements called nodes. Following a hierarchical model, the tree has a root node, where the tree begins; sequentially, the tree is composed of child nodes, where each node can have other children or subtrees. A leaf or terminal node is a node that has no offspring. The initial data enters the tree's root and passes through the decision nodes until reaching the leaf node, which presents the result of the processing. Usually, three main variations of DT are prominent in use for IDS designs: ID3 [52], C4.5 [53], and CART [54].

2.
Adaptive Gradient Boosting (AdaBoost): Freund et al. [17] proposed the AdaBoost as a boosting learner that creates a chain of classifiers in succession on the same dataset in such a manner that subsequent classification improves on the errors of the earlier classification. The algorithm achieves this by assigning higher weights to the incorrectly classified classes and lower weights to the correctly classified classes, thereby ensuring that the incorrectly classified instances gain priority during the next phase. The exact process repeats until the best possible result is achieved and the algorithm has used all the instances in the data. As implemented in [55], authors proposed an IDS based on AdaBoost using the CIC-IDS2017 dataset as a training dataset. Applying SMOTE, an accuracy of 81.31% and an F-score of 81.31% were achieved during testing. Although achieving good accuracy, this resulted in a lot of false predictions that need to be improved.

3.
Extra Tree (ET) Classifier: This algorithm improves the performance of DT and RF by incorporating a more significant number of trees into its network. As a result, compared with other ML algorithms, it has the highest number of trees and computational resource requirements. This algorithm works on the principle of meta-estimator and applies an averaging rule to increase predicted accuracy and reduce overfitting. First, the meta-estimator fits several randomized decision trees on different subsamples of the same dataset. Then, it aggregates the results of multiple decorrelated decision trees collected in a forest to output a classification result. The package is available in the sklearn.ensemble.ExtraTreesClassifier library for use in any ML tasks [56].

4.
Random Forest: This algorithm, proposed by Breiman [16], has shown great results in both classification and regression problems, making it the most used ensemble algorithm. By constructing component trees, the algorithm reduces the connection of different decision trees. It extends the attributes of bagged decision trees by inculcating randomized attributes. More importantly, the performance gains observed in RF are achieved through the randomness in the attribute selection process, not from the splits in the decision trees which are created based on a subset of the data attributes [15]. As a popular ensemble algorithm, several authors have used it in IDS [57][58][59]. In [58], authors proposed an IDS model which used principal component analysis (PCA) for dimensionality reduction and random forest classifier for classification. The result was compared with support vector machines (SVM), naive Bayes, and classical decision trees. On testing, authors claimed that the model achieved an accuracy of 96.78%, making it preferable over the others, which achieved less accuracy.

5.
Extreme Gradient Boosting (XGBoost): Extreme gradient boosting (XGBoost) [60] is an extension of the implementation of gradient boosting tree proposed by Friedman et al. [61]. Because it offers parallel computation, cache awareness, a built-in regularization strategy to avoid overfitting, and tree optimization by a split-finding algorithm, XGBoost generally outperforms gradient boosting in terms of performance as it has a quick training and inference time. In [62], an efficient IDS model based on XGBoost was proposed for computer networks. The model was trained and evaluated on the network socket layer-knowledge discovery in databases (NSL-KDD) dataset with an accuracy of 98.70%. 6.
Light Gradient Boosting Machine (LightGBM): Observing the high training time requirement for gradient boosting decision trees (GBDT), Ke et al. [19] proposed two novel techniques to overcome the challenge based on Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). This new implementation was named LightGBM, and it improved training and inference time of GBDT by 20%. Since its development, it has shown highly impressive results even in IDS systems, as shown in [63,64].

SMOTE and ADASYN for Imbalanced Dataset
One of the many challenges affecting the efficiency of ML models is the inadequacy of data points in the dataset used to train the models. Hence, the model cannot learn comprehensively from the available data, creating room for incomplete knowledge in some instances. In the case of our dataset, there are over 13 million benign traffic in the CSE-CIC-IDS2018, with some attacks such as SQL injection having only 87 data instances. In addition, in the CIC-IDS2017, the heartbleed attack has only 11 instances compared with the benign instances with 2 million data points. Some techniques have been proposed to solve this problem, usually based on either oversampling or undersampling methods. In undersampling, the majority class is reduced to be suitable to the minority classes, which leads to the loss of vital information, while in oversampling, the minority classes are increased to be equal or approximate to the majority classes. SMOTE [22] and ADASYN [23] are two of the many oversampling techniques used in handling data imbalance. SMOTE first selects a minority class instance r randomly and finds its K-nearest minority class neighbors. The synthetic instance is then created by choosing one of the K-nearest neighbors p at random and connecting r and p to form a line segment in the feature space. Finally, the synthetic instances are generated as a convex combination of the two chosen instances, r and p. ADASYN is based on the idea of adaptively generating minority data samples according to their distributions: more synthetic data are generated for minority class samples that are harder to learn compared to those minority samples that are easier to learn. Other derivatives of the SMOTE method include borderline-SMOTE [65], borderline-SMOTE SVM, SMOTEN, SMOTENC, and KmeansSMOTE, which are all available in the Imblearn-learn library [66].

Ensemble Machine Learning
Ensemble learning in ML aggregates the results of different ML classifications aimed at achieving better performances in accuracy and attack classification detection rate. In ensemble learning (EL), homogeneous and sometimes heterogeneous algorithmic classifiers can be combined to build an improved predictive model with better inference time [67]. The applicability of ML techniques differs between use cases and the characteristics of the dataset on which it is built. This implies that the technique used for one project dataset might not be applicable to another of the same or similar domain [68]. Hence, EL tries to achieve a model that can be used in the application domain with better results. Different EL models perform differently from each other in the IDS domain based on the dataset used to develop the model. Usually, three main classes/methods of EL exist, including bagging, stacking, and boosting.
Bagging entails averaging the predictions from many decision trees that have been fitted to various samples of the same dataset. It usually incorporates three main approaches, including bootstrapping samples of the train dataset, fitting unpruned DTs on each sample, and use of simple voting or averaging of predictions to obtain the final results. Some known examples of this include bagged decision tree (BDT), random forest (RF), and extra tree (ET) [69]. Given a training set T = t1, . . . , tn with responses L = l1, . . . , ln, the bagging algorithm repeatedly (P times) selects a random sample accompanied by replacement of the training set, then fits trees of different sizes to these samples. This can be achieved using the procedure shown in Algorithm 1. Train a classification tree, fb on Tb, Lb.

4:
After training, predictions for unseen samples x 5: obtain the final predictions from all the individual fb on x by taking the average of all predictions for regression or taking the majority vote for a classification problem using Equation (1). 6

: end for
This approach leads to a better model with reduced variance of the IDS model without increasing the bias. This shows that in a case where the predictions of a single tree are extremely noise-sensitive on the training set, as long as the trees are not correlated, the average of the trees is insensitive to noise. Hence, bagging yields reliable IDS models for IoT environment. When we train many trees on a single dataset (training data), the trees would produce strongly correlated trees (even with the same tree many times not considering whether the training algorithm is deterministic or nondeterministic), which tends to cause overfitting and bias; bagging or bootstrapping the samples in the datasets is a measure to ensuring decorrelation in the trees by showing them different samples of data during the training process in the training sample [70]. More specifically, we calculate an estimate of the uncertainty of the prediction as the standard deviation, σ of the predictions from all the individual regression or classification trees on x according to Equation (2): Stacking, also known as stacked generalization, is an ensemble modeling technique that includes using data from many models' predictions as features to construct a new model and make predictions. In other words, during stacking, we fit different models on the same train data, obtain the results of the predictions, and use another algorithm to combine the predictions for improved results. This approach ensures that the learned features from the first model are maintained by the second model, thereby showing improved results compared to the single model. By using heterogeneous weak models trained on the same data sample, more robust IDS models are obtained [71,72]. Popular EL algorithms based on stacking are blending and super ensemble.
When boosting is implemented, there is sequential addition of the members of the ensemble algorithms which corrects the predictions of the previous classifier and generates a weighted average of the predictions as the output. This feature of boosting algorithm accounts for their better performances over stacked and bagged ensemble classifier. Common examples include AdaBoost, XGB, LGBM, and GBDT [50,73]. Assuming that the boosting ensemble is defined in terms of weighted sum of L weak learners, we obtain the function shown in Equation (3) where c l are coefficients and w l are weak learners.
One drawback of this approach is the difficulty to achieve faster optimization convergence. To arrest this challenge, instead of solving for the coefficients and the weak learners in one try, we implement an iterative optimization approach that is more cost-efficient and tractable. In this scenario, each weak learner is added one by one, checking the iteration for the best possible pair that it gives (coefficient and weak learner) to update the current ensemble model. Hence, we define recurrently the value of s l in a way such that s l (.) = s l−1 + c l w l (.) (4) In which case the values of c l and w l are selected such that s l is the model which has the best fit on the train data, therefore it presents the best possible improvement over s (l−1) according to Equation (4). If we define E(.) as fitting error of the given model and e(.,.) to be the loss/error function, we denote the following: As a result, rather than optimizing "globally" over all of the L models in the total, we approach the optimum by optimizing "locally" creating and gradually adding the learning algorithm to the strong model. Hence, Equation (5) presents a comprehensive approach to the design of highly optimized ensemble classifier based on booting technique. A typical algorithmic representation of the procedure for implementing the boosting algorithm is shown in Algorithm 2 with primary focus on the AdaBoost classifier upon which other boosting classifiers are built. A summary of related literature reviewed in this section is presented in Table 1. Normalize the weights of the samples 5: For available features from the set, train a classifier using a single feature and evaluate the training error 6: Choose the classifier with the lowest error 7: Update the weights of the training samples: increase if classified wrongly by this classifier, decrease if correctly 8: end for 9: Form the final strong classifier as the linear combination of the T classifiers.

Materials and Methods
The materials used for this research and the method are discussed in detail in this section. The well-elaborated architecture describing the process flow is given in Figure 1. The methodology is specifically divided into five different phases, namely, (a) data collection, (b) data preprocessing, (c) ensemble feature selection, (d) model classification, and (e) anomaly detection (classification). We begin the proposed IDS model design by checking the database for important datasets that best meet the specific objectives of this paper. Data in raw format are composed of irregularities and misinformation that must be preprocessed. We perform feature engineering to remove redundant features, then develop the IDS model, as shown in Figure 1. In the end, the final model proposed in this work is implemented using the pseudocode presented in Algorithm 3.

Data Collection
The performance of the ML model is as important as the data used in the training process. For this reason, in our work, we searched through the available datasets to select the most wide and comprehensive datasets upon which we could build our IDS model. Two recent datasets were selected which are publicly available for research purposes: CICIDS2017 (http://205.174.165.80/CICDataset/CIC-IDS-2017/Dataset/, accessed on 7 February 2022), consisting of over 2 million instances among which 83% are benign and 17% are attack classes, and CSE-CIC-IDS2018 (https://registry.opendata.aws/cse-cic-ids2018/, accessed on 7 February 2022), comprising 83% benign and 17% attack, were used in this paper. The datasets collected from these sources are contained in different folders in CSV format. To obtain a robust dataset, we first aggregated all the different CSV files into a single file for each of the selected datasets. The data contain relevant information of the problem domain and needed to be cleaned for further analysis. The selected datasets are maintained by the Canadian Institute of Cybersecurity and the University of New South Wales [76]. Other commonly used datasets include KDD Cup'99, NSL-KDD, UNSW-NB15, Bot-IoT, CICDDoS2019. Usually, the dataset is divided into train and test portions. The collected dataset contains information about network flows recorded in forward and backward order. Some of the features of the datasets include source IP, destination IP, timestamp, flow duration, flow bytes, etc.

Algorithm 3
The algorithm for the BoostedEnsML 1: Define the number of folds in the split, s 2: Initialize s = 0 3: while s 10 do: 4: Train the selected classifier (LGBM or XGB) using 9 parts of the 10 folds and perform prediction on the other part 5: XGB and LGBM are used for predictions on the train set and test data. 6: s += 1 7: end while 8: Using Stacking Classifier to combine the predictions from the two base models. 9: BoostedEnsML is applied to the test data to make final predictions.

Data Preprocessing
The raw datasets shown in Tables 2 and 3 consist of 15 different classes each and one benign class. There are different web, DDoS, and DoS attacks in the datasets. We merged these related attacks into their respective classes. For instance, the DoS GoldenEye, DoS slowloris, and DoS Slowhttptest in the CSE-CIC-IDS2018 dataset were merged into the DoS attack, while in the CICIDS2017, the DoS Hulk, DoS SlowHTTPTest, DoS GoldenEye, and DoS Slowloris were also merged. Similarly, the same approach was used to merge the DDoS flows in both datasets. Usually, data come in raw form and cannot be implemented in that form in ML algorithms. It is important that the crude datasets are cleaned, sanitized, transformed, and features reduced to ensure that attack features used in the ML classifier are the best features. In cleaning and sanitizing the datasets, we removed duplicate rows and columns; rows containing special characters (@,#,%) were checked, and such special characters were deleted. We noticed that some instances in the dataset were 'inf' and NULL values, so the null value rows and columns were deleted. This cleaning was performed on both datasets. ML classifiers can correctly handle all numeric inputs; we converted all non-numeric data into numeric using the LabelEncoder. LabelEncoder is used in ML to encode the y-label into numeric values in the range of 0 to n_classes-1.
For better understanding of the correlation between the traffic features in the dataset, we performed statistical analysis including univariate, bivariate, and multivariate analysis using data visualization tools such as Matplotlib, Seaborn, and Plotly. We observed in the dataset during exploratory data analysis (EDA) that most of the numeric data are of higher values than others. For this reason, we used MaxAbsScaler (Maximum Absolute Scaler) to transform the data into the range of zero and one (0 and 1). There exist the standard scaler, min max scaler, and robust scaler. One advantage of the Max Absolute Scaler over other feature transformation techniques is that it estimates, scales, and transforms each feature one by one in such a way that the maximum value of each feature in the train dataset will be 1.0; hence, the center of the data is maintained and sparsity is not destroyed.
As shown in Tables 2 and 3, our datasets contain severe data imbalances, having benign features in millions and attacks in thousands and hundreds. To handle the imbalance, we used two different sampling techniques to reduce the dataset size without affecting the model's performance. We first oversampled the minority classes using SMOTE and ADASYN. These data sampling techniques generate synthetic instances for the minority class using its features. They ensure that the original information contained in the dataset is maintained. Next, we reduced the benign class using a random undersampling technique, which randomly removes some samples of the benign class without affecting its contribution to the model performance. Similar attack types were merged to obtain seven labels for CSE-CICIDS2018 (consisting of six attacks and one benign feature) and nine for CIC-IDS2017 (consisting of eight attacks and one benign feature). Next, we needed the training, validation, and test datasets. The Sklearn library provides the train-test split function for splitting the dataset, while the Keras module provides the train-test-validation split option. In this research, we used the StratifiedKFold cross-validation split function to achieve better performance. StratifiedKFold was used to split the data into ten different subsets or folds, and in each training iteration, nine different subsets were used for the training and validation, while one was used for testing the performance of the model. The process was repeated ten times until all the samples in the folds were used, thus ensuring that each data point participated in the model training. By using this method, data leaking, which occurs when some test data are visible during training and causes the model to be biased toward the test data, may be avoided.

Hyperparameter Optimization and Ensemble Feature Selection
Hyperparameter optimization (HPO) is an automated method for picking classifier parameters to train the model. While model parameters (MPs) are learned and updated by the model during training, ML programmers define the hyperparameters for the classifier. This paper employed two search strategies to find the optimal hyperparameters for improving model performance. The RomandomizedSearchCV and GridSearchCV were used. While the RandomSearch algorithm randomly selects parameters based on the search space provided to each holding other parameters constant, the GridSearch CV exhaustively searches the grid of parameters and reports the best candidate parameters. Depending on the number of iterations (n_iter) defined, the RandomSearch can be faster than the Grid-Search. Usually, the parameters to be tuned are defined based on the ML algorithm being implemented. For instance, in RF, the parameters tuned are max_features, n_estimators and oob_score, whereas the max_depth, n_estimators and learning_rate are tuned for the XGBoost classifier. When we compared the tuning results, we discovered that the parameters produced with GridSearchCV improved the model's performance more than those obtained with RandomSearchCV; thus, all future training of all models was based on the GridSearchCV-tuned parameters. As previously indicated, some variables in the dataset are unimportant because they have little influence on the traffic flow characteristics. Hence, we used feature selection to determine which features contribute the most to determining network flow characteristics. Therefore, the random forest algorithm for feature importance was implemented, and the first 64 most important features, evaluated by the RF feature importance method, were chosen. Thus, our models were developed based on the chosen 64 features. Reducing these features helps to lower the model complexity and improve training cost while achieving the same performance output. The selected feature map is shown in Table 4.

Model Selection and Training
In IDS implementation, detecting various forms of network intrusion requires IDS to be capable of functioning in multiclass mode. Hence, our task is a multiclass task. ML algorithms are widely used in this domain [77,78]. In this paper, decision tree (DT), extra tree (ET), random forest (RF), AdaBoost (AD), XGBoost (XGB), and LightGBM (LGBM) are the selected algorithms. In research on boosted algorithms, DT, ET, and RF were used as a baseline to evaluate the computational complexity of the boosted algorithms and the resulting BoostedEnML. First, six different models were developed individually for each of the algorithms; and their performances were evaluated. In the ensemble model design, each of the classifiers, DT, RF, XGB, LGBM, were aggregated. Then, to obtain an ensemble IDS model based on DT, other models were used as estimators while DT was used as the meta-learner. This approach was repeated for all the classifiers to obtain the desired results: ensemble decision tree (Ens_DT), ensemble random forest (Ens_RF), ensemble AdaBoost (Ens_AD), ensemble XGBoost (Ens_XGB), ensemble LightGBM (Ens_LGBM). We compared the different ensemble methods by implementing an ensemble using voting and stacking classifiers. These two are called the ensemble hard majority voting (Ens_HMV) and ensemble stacking model (EnSM). Finally, we used the two boosting classifiers (XGB and LGBM) to develop the BoostedEnML which is proposed in this paper. The algorithm presented in Algorithm 3 helps in implemented the BoostedEnML IDS model. We began by defining and initializing the number of splits we wanted each of the folds to have. We used 10 K-folds in each split during which training was performed on 9 folds; the remaining 1 fold was used to validate the model performance. This process was repeated until the 10 folds were completed, thereby using all the data in the train set. Although other approaches show similar or related performances, we demonstrated that BoostedEnML can be used to achieve network traffic classification with high accuracy and reduced computational cost.

Evaluation Metrics
The metrics accuracy, precision, recall, F-score, area under the curve (AUC), confusion matrix, and receiver operating curve (ROC) were used to check how the model performed on the test data. In the field of ML, these metrics are highly used in evaluating the performances of trained models. While the accuracy is a very good evaluation metric for ML tasks, it is not highly recommended for multiclass classification tasks involving imbalanced datasets. This is because high accuracy on imbalanced data may not have resulted from a generalized learning attribute of the model. Hence, other metrics were combined in this work. Given that TP_os, TN_eg, FP_os, and FN_eg are the definition for true positive, true negative, false positive, and false negative outcomes of the models, respectively, the evaluation metrics can be defined by Equations (6)-(10) for the weighted macro performance of the model in terms of the accuracy, precision, recall, and F-score. TP_os represents the samples in our dataset that were correctly classified as positive, TN_eg are samples that were correctly identified as negative, FP_os represents the instances that were negative but were mistakenly identified as positive by the model, and FN_eg represents the positive instances that were classified as negative by the model. The confusion matrix shows the model's performance in classifying each sample correctly or wrongly on a graph. The AUC-ROC curve was originally designed for binary problems; however, it can be adapted for multiclass problems using the OneVersesRest (OVR) or OneVerseOne (OVO) and 'multiclass' arguments. The one-vs.-one algorithm is used to calculate the average of the ROC-AUC scores in pairs, and the one-vs.-rest algorithm calculates the average scores of the ROC-AUC for each network flow label against all other class labels, as shown in Equation (11). We can set the multiclass keyword argument in the function to 'ovo or OVR' while the average is set to 'macro'. This way, we can use the AUC-ROC curve function for multiclass problems.

Accuracy macro =
TP_os + TN_eg TP_os + TN_eg + FP_os + FN_eg (6) Precision macro = TP_os TP_os + FP_eg (7) Recall macro = TP_os TP_os + FN_eg (8) where c is the total number of classes and AUC(j | k) is the AUC with class j as the positive class and class k as the negative class. In general, AUC(j | k) = AUC(k | j) in the multiclass case [79]. Equation (12) extends Equation (11) for weighted ROC-AUC curves. The modification is to change the value for the average to 'weighted' and other arguments are retained. The 'weighted' [80] returns the prevalent weighted average for each of the class in the dataset.
In the experimental setup for this task, we used Python Numpy, Pandas, Matplotlib, and the machine learning library Scikit-learn for the software. The code was executed on a computer running on Intel(R) Core(TM) i7-7700 CPU @ 3.60 GHz, 3600 Mhz, 4 Core(s), 16 GB (15.9 GB usable), Windows 10 Home Single Language 64-bit and NVIDIA GeForce GTX 1050 Ti GPU.

Results and Discussion
In this section, we present and discuss the results obtained from the experiment. As earlier stated, we performed the experiment using two well-known datasets: CIC-IDS2017 and CSE-CICIDS2018, which are publicly available for research purposes [76]. First, we oversampled the data points such that there were almost the same values for each of the samples. For instance, bot, which has 286,191 samples against the benign traffic, with 12,484,708 instances, in Table 2 needed to be increased, otherwise the model would only learn the features of the benign traffic since it would see more of the packets injected as benign. The datasets obtained for the training, validation, and testing after handling imbalances with SMOTE and ADASYN, and splitting using StratifiedKfold crossvalidation, are presented in Table 5. For each of the nine class labels in CICIDS2017, there are 606,812 instances for training and 67,242 instances for testing. The same applied to the CSE-CIC-IDS2018 dataset.  After oversampling the datasets, the resulting data points were very high for the ML task; so we performed undersampling and selected a total of 5,189,072 (6%) data instances of the CSE-CICIDS2018, and 30% of the CIC-IDS2017 dataset with a total of 6,671,664 samples. The two datasets both had a total of 80 features each after preprocessing. A total of 64 features were selected, as shown in Table 4 Figure 2 shows the first 10 important features in the CICIDS2017 dataset as generated with the random forest feature importance. First, we show the results obtained after training the models on the CSE-CICIDS2018 dataset. The performance of each model in terms of the accuracy, precision, recall, F-score, model size, and test time are presented in Table 6. The results show that the task classifies the labels into their respective seven classes as contained in the dataset; identifying, at each time, one of the categories of the network traffic. During the test, the accuracy for each of the ML algorithms, DT, RF, ET, AD, LGBM, and XGB, are 98.7%, 98.4%, 98.3%, 97.8%, 98.8%, and 98.9%, respectively. It can also be observed that XGB has the highest accuracy, precision, recall, F-score, and AUC, compared with other ML algorithms. Hence, it achieves the best performance in correctly identifying each network traffic according to its category. This is expected as it has shown very high accuracy in previous works, outperforming some deep learning models in some datasets [81]. Furthermore, the LGBM model follows the XGB having obtained accuracy, precision, recall, F-score and AUC of 98.8%, 98.83%, 98.83%, 98.83%, and 99.96%, respectively.
LGBM is a lightweight version of the XGB algorithm specifically designed for timing optimization with high accuracy, as seen in this current task. In general, a close look at the evaluation metrics shows close, and almost the same, values obtained for each of the models for each metric used. For instance, DT achieved almost 99% for all the metrics, and RF achieved approximately 98% for all the metrics as well as ET classifier. This is achieved as a result of the balanced dataset and cross-validation approach used. In all cases, each algorithm generalizes very well on the traffic, and thereby gains knowledge to identify to which class the packet belongs. Since all the models trained on the algorithms have almost similar performances, we measured the train and test time for each model to enable us to select the most suitable model for further tasks of ensemble design.
As shown in Table 6, ET required the highest amount of time to predict the different attack classes, using about 15.1 s. This is attributed to the large number of trees in its architecture, so ET was excluded from being used as a base learner in ensemble models.
DT, RF, LGBM, and XGB had total test times of 0.25 s, 9.98 s, 3.4 s, and 4.25 s, respectively. Therefore, we chose them as base learners for ensemble models. The results obtained for the CIC-IDS2017 dataset using the various metrics are shown in Table 7. On this dataset, the DT, RF, ET, AD, LGBM, and XGB classifiers detected each class with an accuracy of 99.59%, 99.45%, 99.68%, 69.67%, 99.16%, and 99.51%, respectively. In terms of the AUC score for each of the classifiers, the DT, RF, ET, AD, LGBM, and XGB reached 99.76%, 99.98%, 99.97%, 67.9%, 96.81%, and 99.97%, respectively, with ET and XGB having the same AUC score of 99.97%. Considering the precision and recall performances of the six models, we observe that each model has high values, which demonstrates the capacity of each of them to give reliable predictions while detecting network traffic. In precision, the DT, RF, AD, LGBM, and LGBM classifiers reached precisely 99.59%, 99.48%, 99.68%, 66.76%, 96.96%, and 99.52%. These performances show that ET and XGB can classify the flow packets with higher precision. In general, ET achieved the best performance in all metrics, although it had the highest detection or prediction time and memory requirement. Due to the large memory capacity and training and testing time requirement for the ET classifier, we selected XGB and LGBM which had similar performance ratings. On the other hand, DT had a prediction time of 0.18 s. LGBM, being a lightweight model, had the lowest memory requirement of about 3.1 MB with an accuracy of 99.16%. Therefore, the models on both datasets detected and classified each traffic with high performances in comparison with other existing methods [4,12]. We used a stacking method (StackingClassifier) to combine all the algorithms to develop ensembles for each classifier. Hence, we obtained Ens_DT (with DT as metalearner), Ens_RF (with RF as meta-learner), Ens_LGBM (with LGBM as meta-learner), and Ens_XGB (XGB as meta-learner). To obtain the classifier based on majority vote, (EnsHMV), we used the four base classifiers as estimator and hard voting as the argument for the voting function. BoostedEnML was then developed using LGBM and XGB only.
Since our task is based on multiclass classification, we show the confusion matrix for Ens_RF and Ens_LGBM classifiers in Figure 5. Almost all the various network traffic types were correctly classified. From the confusion matrix, we can see that during the test for brute force, DDoS, and DOS with the Ens_RF model, all the 67,424 data points in the dataset were correctly identified as either brute force, DDoS, or DoS with 100% accuracy. On the other hand, 64,071 instances were identified as benign, 1 instance was misclassified as DDoS, 3344 were misclassified as infiltration attacks, and 7 were misclassified as web attacks while detecting benign traffic on the CSE-CICIDS2018 dataset. Similarly, on the CICIDS2017 dataset, the Ens_LGBM had only 1, 1, 3342, and 8 misclassifications of bot, DDoS, infiltration, and web attacks, respectively, while detecting benign flows, showing an FNR of 0.05%. However, our proposed BoostedEnsML model outperformed all other ensemble models achieving 100% accuracy, precision, recall, F-score, and AUC for all the different attacks in both datasets, as can be seen in the confusion matrix in Figures 6 and 7. Although other IDS models for IoT scenarios have achieved almost the same accuracy [75], our work demonstrates that using only algorithms based on boosting techniques with balanced datasets can present an improvement on existing works. The model based on the HMV technique, called Ens_HMV, also outperformed other models, reaching high accuracy in both models. Notably, Ens_HMV on the two datasets achieved almost the same performance with the BoostedEnsML model but with regards to memory capacity, the BoostedEnsML (200 MB) is preferred as it has lower computational power than the Ens_HMV (500 MB).  The ROC curve shows the relationship between the true positive rate (TPR) and false positive rate (FPR) for the model performance in detection and classification of each attack. The ROC curve obtained on the CSE-CICIDS2018 dataset for LGBM, RF, DT, and ET is presented in Figure 8. In each case, the AUC score is nearly 1.0, which indicates that the model has high accuracy in correctly classifying the various attacks and benign labels. In addition, the FPR is nearly zero for each of the models, showing a high rate of reduction in false alarms which have been a serious issue in ML used for IDS. Hence, our model outperforms most of the state-of-the-art models [25][26][27] through the methodology adopted for the research. With high detection rate, the proposed model correctly classifies the various network traffic passing through the IoT environment, thereby helping to reduce exposure to cyberattacks.
We applied the ensemble model developed using voting technique (EnsHMV) that is based on bagging classifier for a classification task and the IDS model based on stacked boosting algorithms (BoostedEnML) on each of the datasets to identify how each of them performs in detecting and classifying the network packets into their respective classes. We considered each class as a separate entity to evaluate the classifier's ability to differentiate it from the normal traffic (benign). The result for this experiment is shown in Table 8. The results illustrate that for the various attack in the two datasets, both IDS models showed high precision, recall, and F-score, reaching 100% in correctly classifying the classes. More specifically, while EnsHMV and BoostedEnML performed similarly on the 2018 dataset, BoostedEnML outperformed EnsHMV on both datasets. In detecting infiltration attacks on the CSE-CICIDS2018 dataset, the two models report that the attack is infiltration with 100% recall, while on CICIDS2017, the EnsHMV detected an infiltration attack with a recall of 99.67% against BoostedEnML that reached 100%. The results generally show a low possibility of false alarms in both scenarios.
In terms of the F-score, which is the weighted mean of the recall and precision of the model behavior, Table 8 demonstrates that the EnsHMV reached 0.9636%, 0.9984%, 0.9999%, 0.9989%, 0.9890%, 0.9969%, 1.00%, 0.995%, 0.9992%, and 0.9988% in classifying the benign, botnet, brute force, DDoS, DoS, heartbleed, infiltration, portscan, and web attack traffics in the CICIDS2017 dataset, respectively; while on the CSE-CICIDS2018 dataset, the EnsHMV attained an F-score performance of 0.9978%, 1.000%, 1.000%, 1.000%, 0.9999%, 0.9999%, and 1.000% in classifying the benign, botnet, brute force, DDoS, DoS, infiltration, and web attack flows, respectively. Similarly, the BoostedEnML showed higher performance than the EnsHMV in relation to the F-score measure on both datasets. Specifically, on the CICIDS2017, the BoostedEnML showed an F-score of 1.000%, 0.9999%, 0.9999%, 0.9980%, 1.000%, 1.000%, 1.000%, 0.9999%, and 1.000%, respectively, in the classification of the benign, botnet, brute Force, DDoS, Dos, heartbleed, infiltration, portscan, and web attack flows. It also achieved 0.9998%, 1.000%, 0.9999%, 1.000%, 0.9999%, 1.000%, and 1.0000% in detecting the benign, botnet, brute force, DDoS, DoS, infiltration, and web attack packets in the CSE-CICIDS2018 dataset.   We compared the performance of our models with those of existing models in the literature, as presented in Table 9. In the work of Das et al. [11], the proposed model achieved an accuracy of 92% for the ensemble decision tree, and our En_DT achieved 97.8%, which is about a 5.8% improvement. In addition, while the ensemble based on the neural network (NN), a deep learning model, achieved 99.5%, our BoostedEnsML achieved 100% in all evaluation metrics, showing that the proposed approach is better. On the same dataset as used in our work, the ensemble model based on stacking RF and KNN with DT used as meta-learner in Kim et al. [75] detected the attacks and benign traffic with the accuracy of 99.9%, while our work detected each traffic with 100%, showing 0.1% improvement after handling data imbalance which was not stated in the work of Kim et al. [75]. This indicates that with a balanced dataset integrated with feature selection, the performance of IDS models can be enhanced. There is also a need to evaluate the effect of different feature selection techniques and data imbalance methods on the general behavior of IDS models in detecting and classifying network flows in IoT systems. Our work will try to investigate this idea in future studies.

Conclusions
IoT devices are being used in different facets of human endeavors today, leading to the creation of extensive networks and, consequently, a tremendous amount of network data transmission. In addition to this, cyberattacks are witnessed in IoT systems exponentially, prompting the urgency to develop approaches capable of mitigating these attacks. In this paper, we proposed an ensemble model based on boosting algorithms such as XGB and LGBM. First, we solved the data imbalance problem by using two methods of oversampling technique (ADASYN and SMOTE) and compared the results obtained. A balanced dataset obtained with SMOTE showed better performance than that of ADASYN. This case, notwithstanding, can be relative. We performed several experiments on different ML algorithms, including DT, RF, ET, AD, XGB, and LGBM, and developed an ensemble classifier for each model. In the end, the proposed BoostedEnsML model was developed using the best-performing boosting classifiers (XGB and LGBM), achieving the best performance of 100% in the classification of the various attacks in IoT networks, including DDoS, DoS, web attacks, infiltration, portscan, heartbleed, and botnets. BoostedEnsML outperformed all other ensemble models discussed in the literature. Hence, we obtained a highly efficient, reliable, and accurate IDS model for detecting cyberattacks in IoT networks. In this current work, two ensemble models based on boosting techniques (XGB and LGBM) were used to propose an ensemble model using the stacking technique. Future work will explore more ensemble model approaches and deep learning algorithms to further improve IoT intrusion detection. In addition, we hope to integrate more feature selection techniques to evaluate the effects of different network features on the performance of an ensemble IDS model in preventing IoT-based network intrusion, as well as evaluate more boosting algorithms such as Catboost and GBDT, to develop an ensemble of four boosting classifiers.