XGBoost for Imbalanced Multiclass Classification-Based Industrial Internet of Things Intrusion Detection Systems
Abstract
:1. Introduction
- We built an XGBoost model to improve the detection of imbalanced distribution attack types using two benchmark IIoT IDS datasets, the X-IIoTID and TON_IoT. The raw datasets were preprocessed in several steps, including normalizing features, encoding labels, and splitting the training and testing data.
- We evaluated the proposed method with imbalanced multiclass classification by measuring two performance metrics, which were the confusion matrix and the learning curve.
- We compared the proposed method’s results with other related methods. The experimental results showed that our model enhanced the performance of attack detection for imbalanced multiclass classification in IIoT-based IDS datasets and outperformed other previous models on the same datasets.
2. Materials and Methods
2.1. ML-Based IoT/IIoT IDS Context
2.2. IIoT Datasets and the Imbalanced Multiclass Problem
2.3. The Proposed Method
- The data processing. We performed feature normalization, label encoding, and splitting of the training and testing datasets from the original TON_IoT and X-IIoTID. In the feature normalization, we used a min–max scale for input features following the formula:In the label encoding, we processed the multiclass output values in order to convert non-numerical type data to numerical data for the ML model to learn. We encoded target labels (Y) with a value between 0 and (), where is the number of different values in Y. We used the LabelEconder function of the Sklearn library in Python language to process this task. The ratio between the training data and testing data determined the number of sampling data separately used for the training and testing processes of the proposed method. In splitting the training and testing data, we divided the data with a ratio of 70:30, respectively.
- XGBoost-based IIoT IDS. XGBoost is one representative of the sequence model XGBoost. We chose the XGBoost model because of its advantages, including learning from its mistakes, fine-tuning extensive hyperparameters, scaling imbalanced data, and processing null values. The sequential ensemble method is known as boosting, which attempts to correct the mistakes of the previous models in their sequences. XGBoost is a kind of boosting algorithm that has been proven to boost weak learners in both classification and regression problems. The trees in XGBoost can create a new tree by considering the previous prediction value for the given input data of the tree and then maximize the gain in prediction. We present the main concept of XGBoost-based IIoT IDS datasets in Algorithm 1.In Algorithm 1, the training process is iterative to add a new tree, which can fix prior tree mistakes and residuals. After that, this process combines the previous trees to generate the final prediction. The prediction value is shown in Equation (2).Based on the most significant gain loss, the model selects a leaf node; meanwhile, the model continuously measures the node loss during the training process. The model adds a tree each time by learning a new function to fit the residual of the last prediction. After training, the T tree is obtained, which contains the corresponding leaf node with the corresponding score. Finally, by adding the related scores of each tree, the predicted value is calculated. In order to avoid over-fitting issues, XGBoost needs to find the optimal solution to balance the decline of complexity and the object function. Hence, XGBoost takes the Taylor expansion of the loss function up to the second order and adds a regularization term. Therefore, the XGBoost model prediction is shown in Equation (3).The training objective function of XGBoost includes two parts, which are the training error and regularization as shown in Equation (4).Taylor expansion on training object: in the above step t-th object function, the previous prediction function y head can be considered as a variable t-th weak learner, and is the delta change; so, XGBoost uses a second order Taylor expression to approximate the step t-th object function:In the equation above, at the current t-th step, the step prediction y head and all before t regularization are known values; so, they are constant values in the t-th step object function. We remove the constant terms (because they do not impact the object function optimization), and we obtain:The tree mapping function definition is as follows:Next, we rewrite the object function with regularization:The optimized object function is as follows: Now, the t-th step object function is a function of , , and , whose values are known, because they related to the loss function and step prediction values. So, we can use the following equation to obtain the best to minimize the object function:The optimal w is:Furthermore, the corresponding minimal object value is:The slitting criteria for the weak learner is as follows: Firstly, we obtain the t-th step object function. Next, we build the t-th tree. This tree should be constructed to reduce the object function value as much as possible. To build this tree, we only allow a node split and search for the best split, which causes the greatest reduction. Hence, in each split, we measure the objective function value reduced by the tree object function value (After Node Split)-(Before Node Split).Gain (G) is how many object’s function values are reduced in the split. is a left splitting child leaf; is a right splitting leaf; and I is the parent leaf.For simplicity, each leaf can calculate its Similarity Score ():The splitting gain can be expressed as:Based on the chosen loss function, and are 1-order and 2-order derivatives to calculate tree node similarity and tree leaf output . For classification, the loss function is a custom log loss function (as shown in Equation (19).
- Classification evaluation metrics. Cortés-Leal et al. [42] used a performance metric to mitigate IIoT attacks through measuring the impact of the energy consumption during transmission in IoT and WSN environments. However, we considered other performance metrics, since they suited the ML approach in this work. In particular, we used a confusion matrix (CM) and learning curves to evaluate the performance of our proposed model. A CM contains true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Based on these values of CM, we calculated other performance evaluation metrics, including precision (P), recall (R), and the F1 score.The precision (P) is used to measure the accuracy of the model for classifying a sample as positive.The recall (R) is used to measure the ability of the model to detect positive samples.The learning curve performance represents the efficiency of the model during training time with instances. The cross-validation score will represent the evaluation performance of the learning curve.
Algorithm 1: XGBoost-based IDS model. |
3. Results and Discussion
3.1. Imbalanced Multiclass Classification
3.2. Performance Learning Curve
3.3. Performance Comparison
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
Abbreviations
IoT | Internet of Things |
IIoT | Industrial Internet of Things |
IDS | Intrusion Detection System |
ML | Machine Learning |
NIDS | Network Intrusion Detection System |
XGBoost | eXtremely Gradient Boosting |
M2M | Machine-to-Machine |
M2P | Machine-to-People |
H2M | Human-to-Machine |
M2H | Machine-to-Human |
MitM | Man-in-the-Middle |
DL | Deep Learning |
RNN | Recurrent Neural Network |
LSTM | Long Short-Term Memory |
GRU | Gated Recurrent Unit |
U2R | User to Root |
LR | Logistics Regression |
LDA | Linear Discriminant Analysis |
SVM | Support Vector Machine |
NB | Naïve Bayes |
kNN | k-Nearest Neighbors |
GBM | Gradient Boosting Machines |
RF | Random Forest |
NN | Neural Network |
AP2PFL-MLP | Asynchronous Peer-to-Peer Federated Learning-Multilayer Perceptron |
TP2SF | Trustworthy Privacy-Preserving Secured Framework |
WSN | Wireless Sensor Network |
DR | Detection Rate |
FAR | False Alarm Rate |
DDoS | Distributed Denial of Service |
CM | Confusion Matrix |
TP | true positive |
TN | true negative |
FP | false positive |
FN | false negative |
References
- Latif, S.; Idrees, Z.; Zou, Z.; Ahmad, J. DRaNN: A deep random neural network model for intrusion detection in industrial IoT. In Proceedings of the 2020 International Conference on UK-China Emerging Technologies (UCET), Glasgow, UK, 20–21 August 2020; pp. 1–4. [Google Scholar]
- CNwakanma, I.; Nwadiugwu, W.; Lee, J.M.; Kim, D.S. Real-Time validation scheme using blockchain technology for Industrial IoT. In Proceedings of the 2019 Korean Institute of Communications and Information Sciences Summer Conference, Jeju, Korea, 19–21 June 2019; pp. 379–382. [Google Scholar]
- Boyes, H.; Hallaq, B.; Cunningham, J.; Watson, T. The Industrial Internet of Things (IIoT): An analysis framework. Comput. Ind. 2018, 101, 1–12. [Google Scholar] [CrossRef]
- Hafeez, I.; Antikainen, M.; Ding, A.Y.; Tarkoma, S. IoT-KEEPER: Detecting malicious IoT network activity using online traffic analysis at the edge. IEEE Trans. Netw. Serv. Manag. 2020, 17, 45–59. [Google Scholar] [CrossRef] [Green Version]
- Muna, A.H.; Moustafa, N.; Sitnikova, E. Identification of malicious activities in Industrial Internet of Things based on deep learning models. J. Inf. Secur. Appl. 2018, 41, 1–11. [Google Scholar]
- Defense Use Case. Analysis of the Cyber Attack on the Ukrainian Power Grid. Electricity Information Sharing and Analysis Center (E-ISAC) 388, 2015. Available online: https://africautc.org/wp-content/uploads/2018/05/E-ISAC_SANS_Ukraine_DUC_5.pdf (accessed on 7 May 2022).
- Alladi, T.; Chamola, V.; Zeadally, S. Industrial control systems: Cyberattack trends and countermeasures. Comput. Commun. 2020, 155, 1–8. [Google Scholar] [CrossRef]
- Sitnikova, E.; Foo, E.; Vaughn, R.B. The power of handson exercises in SCADA cybersecurity education. In Information Assurance and Security Education and Training; Springer: Berlin/Heidelberg, Germany, 2013; pp. 83–94. [Google Scholar]
- Dash, S.; Chakraborty, C.; Giri, S.K.; Pani, S.K.; Frnda, J. BIFM: Big-data driven intelligent forecasting model for COVID-19. IEEE Access 2021, 9, 97505–97517. [Google Scholar] [CrossRef]
- Koroniotis, N.; Moustafa, N.; Sitnikova, E. A new network forensic framework based on deep learning for Internet of Things networks: A particle deep framework. Future Gener. Comput. Syst. 2020, 110, 91–106. [Google Scholar] [CrossRef]
- Vaiyapuri, T.; Binbusayyis, A. Application of deep autoencoder as an one-class classifier for unsupervised network intrusion detection: A comparative evaluation. PeerJ Comput. Sci. 2020, 6, e327. [Google Scholar] [CrossRef]
- Le, T.T.H.; Kim, J.; Kim, H. Analyzing effective of activation functions on recurrent neural networks for intrusion detection. J. Multimed. Inf. Syst. 2016, 3, 91–96. [Google Scholar] [CrossRef]
- Le, T.T.H.; Kim, J.; Kim, J.; Thu, H.L.T.; Kim, H. Long Short Term Memory Recurrent Neural Network Classifier for Intrusion Detection. In Proceedings of the 2016 International Conference on Platform Technology and Service (PlatCon), Jeju, Korea, 15–17 February 2016; pp. 1–5. [Google Scholar]
- Le, T.T.H.; Kim, J.; Kim, H. An Effective Intrusion Detection Classifier Using Long Short-Term Memory with Gradient Descent Optimization. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Korea, 13–15 February 2017; pp. 1–6. [Google Scholar]
- Le, T.T.H.; Kang, H.; Kim, H. The Impact of PCA-Scale Improving GRU Performance for Intrusion Detection. In Proceedings of the 2019 International Conference on Platform Technology and Service (PlatCon), Jeju, Korea, 28–30 January 2019; pp. 1–6. [Google Scholar]
- Le, T.-T.-H.; Kim, Y.; Kim, H. Network Intrusion Detection Based on Novel Feature Selection Model and Various Recurrent Neural Networks. Appl. Sci. 2019, 9, 1392. [Google Scholar] [CrossRef] [Green Version]
- Garcia-Teodoro, P.; Diaz-Verdejo, J.; Maciá-Fernández, G.; Vázquez, E. Anomaly-based network intrusion detection: Techniques, systems and challenges. Comput. Secur. 2009, 28, 18–28. [Google Scholar] [CrossRef]
- Al-Khateeb, H.; Epiphaniou, G.; Reviczky, A.; Karadimas, P.; Heidari, H. Proactive Threat Detection for Connected Cars Using Recursive Bayesian Estimation. IEEE Sens. J. 2018, 18, 4822–4831. [Google Scholar] [CrossRef] [Green Version]
- Le, T.-T.-H.; Park, T.; Cho, D.; Kim, H. An Effective Classification for DoS Attacks in Wireless Sensor Networks. In Proceedings of the 2018 Tenth International Conference on Ubiquitous and Future Networks (ICUFN), Prague, Czech Republic, 3–6 July 2018; pp. 689–692. [Google Scholar]
- Sornsuwit, P.; Jaiyen, S. A New Hybrid Machine Learning for Cybersecurity Threat Detection Based on Adaptive Boosting. Appl. Artif. Intell. 2019, 33, 462–482. [Google Scholar] [CrossRef]
- Wu, H.; Han, H.; Wang, X.; Sun, S. Research on Artificial Intelligence Enhancing Internet of Things Security: A Survey. IEEE Access 2020, 8, 153826–153848. [Google Scholar] [CrossRef]
- Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J.; Alazab, A. A Novel Ensemble of Hybrid Intrusion Detection System for Detecting Internet of Things Attacks. Electronics 2019, 8, 1210. [Google Scholar] [CrossRef] [Green Version]
- Ullah, I.; Mahmoud, Q.H. A Two-Level Flow-Based Anomalous Activity Detection System for IoT Networks. Electronics 2020, 9, 530. [Google Scholar] [CrossRef] [Green Version]
- Le, T.-T.-H.; Kim, H.; Kang, H.; Kim, H. Classification and Explanation for Intrusion Detection System Based on Ensemble Trees and SHAP Method. Sensors 2022, 22, 1154. [Google Scholar] [CrossRef] [PubMed]
- Alsamiri, J.; Alsubhi, K. Internet of Things Cyber Attacks Detection using Machine Learning. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 627–634. [Google Scholar] [CrossRef] [Green Version]
- Pacheco, J.; Benitez, V.H.; Felix-Herran, L.C.; Satam, P. Artificial Neural Networks-Based Intrusion Detection System for Internet of Things Fog Nodes. IEEE Access 2020, 8, 73907–73918. [Google Scholar] [CrossRef]
- Al-Hawawreh, M.; Sitnikova, E.; Aboutorab, N. Asynchronous Peer-to-Peer Federated Capability-Based Targeted Ransomware Detection Model for Industrial IoT. IEEE Access 2021, 9, 148738–148755. [Google Scholar] [CrossRef]
- Al-Hawawreh, M.; Sitnikova, E.; Aboutorab, N. X-IIoTID: A Connectivity-Agnostic and Device-Agnostic Intrusion Data Set for Industrial Internet of Things. IEEE Internet Things J. 2022, 9, 3962–3977. [Google Scholar] [CrossRef]
- Alsaedi, A.; Moustafa, N.; Tari, Z.; Mahmood, A.; Anwar, A. TON_IoT Telemetry Dataset: A New Generation Dataset of IoT and IIoT for Data-Driven Intrusion Detection Systems. IEEE Access 2020, 8, 165130–165150. [Google Scholar] [CrossRef]
- Kumar, P.; Gupta, G.P.; Tripathi, R. TP2SF: A Trustworthy Privacy-Preserving Secured Framework for sustainable smart cities by leveraging blockchain and machine learning. J. Syst. Archit. 2021, 115, 101954. [Google Scholar] [CrossRef]
- Booij, T.M.; Chiscop, I.; Meeuwissen, E.; Moustafa, N.; Hartog, F.T.H.d. ToN_IoT: The Role of Heterogeneity and the Need for Standardization of Features and Attack Types in IoT Network Intrusion Data Sets. IEEE Internet Things J. 2022, 9, 485–496. [Google Scholar] [CrossRef]
- Meidan, Y.; Bohadana, M.; Mathov, Y.; Mirsky, Y.; Shabtai, A.; Breitenbacher, D.; Elovici, Y. N-baiot-network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive Comput. 2018, 17, 12–22. [Google Scholar] [CrossRef] [Green Version]
- Moustafa, N.; Slay, J. Unsw-nb15: A comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In Proceedings of the 2015 military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar]
- Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [Google Scholar] [CrossRef] [Green Version]
- Zhou, J.; Cao, Z.; Dong, X.; Vasilakos, A.V. Security and privacy for cloud-based IoT: Challenges. IEEE Commun. Mag. 2017, 55, 26–33. [Google Scholar] [CrossRef]
- Wan, J.; Tang, S.; Shu, Z.; Li, D.; Wang, S.; Imran, M.; Vasilakos, A.V. Software-defined Industrial Internet of Things in the context of industry 4.0. IEEE Sens. J. 2016, 16, 7373–7380. [Google Scholar] [CrossRef]
- Sengupta, J.; Ruj, S.; Bit, S.D. A comprehensive survey on attacks, security issues and blockchain solutions for IoT and IIoT. J. Netw. Comput. Appl. 2020, 149, 102481. [Google Scholar] [CrossRef]
- Choi, C.; Choi, J. Ontology-based security context reasoning for power IoT-cloud security service. IEEE Access 2019, 7, 110510–110517. [Google Scholar] [CrossRef]
- Unwala, I.; Taqvi, Z.; Lu, J. IoT security: ZWave and thread. In Proceedings of the 2018 IEEE Green Technologies Conference (GreenTech), Austin, TX, USA, 4–6 April 2018; pp. 176–182. [Google Scholar]
- Siboni, S.; Sachidananda, V.; Meidan, Y.; Bohadana, M.; Mathov, Y.; Bhairav, S.; Elovici, Y. Security testbed for Internet-of-Things devices. IEEE Trans. Reliab. 2019, 68, 23–44. [Google Scholar] [CrossRef]
- Nguyen, T.G.; Phan, T.V.; Nguyen, B.T.; So-In, C.; Baig, Z.A.; Sanguanpong, S. Search: A collaborative and intelligent nids architecture for sdn-based cloud iot networks. IEEE Access 2019, 7, 107678–107694. [Google Scholar] [CrossRef]
- Cortés-Leal, A.; Del-Valle-Soto, C.; Cardenas, C.; Valdivia, L.J.; Del Puerto-Flores, J.A. Performance Metric Analysis for a Jamming Detection Mechanism under Collaborative and Cooperative Schemes in Industrial Wireless Sensor Networks. Sensors 2022, 22, 178. [Google Scholar] [CrossRef] [PubMed]
Dataset | Features | Limitation |
---|---|---|
N-BaIoT [32] | An IoT environment simulation was set up to collect normal status and botnet attacks. This simulation included some IoT devices such as access points, wifi, wired connection, and a router. A small-scale network-based Wireshark for network traffic collecting aimed to reduce many packets in the high-bandwidth network. | There were no telemetry data from IoT sensors and data traces of operating systems. |
UNSW-NB15 [33] | The dataset was created in a network traffic simulation using an IXIA traffic generator and was saved in four CSV files. This dataset contained a normal vector and nine attack vectors. Each vector had 47 features and the class target feature. | The dataset did not contain security events against operating systems and IoT networks. |
Bot-IoT [34] | Large-scale raw packets from different virtual machines were collected. This dataset contained malware events and various botnet attacks with various data features. | The dataset did not contain hacking vectors against traces of operating systems and IoT systems. |
Device | Target Value | Number of Samples |
---|---|---|
IoT_Fridge | normal | 35,000 |
ddos | 5000 | |
injection | 5000 | |
backdoor | 5000 | |
password | 5000 | |
ransomware | 2902 | |
xss | 2042 | |
IoT_Garage_Door | normal | 35,000 |
ddos | 5000 | |
password | 5000 | |
backdoor | 5000 | |
injection | 5000 | |
ransomware | 2902 | |
xss | 1156 | |
scanning | 529 | |
IoT_GPS_Tracker | normal | 35,000 |
password | 5000 | |
backdoor | 5000 | |
injection | 5000 | |
ddos | 5000 | |
ransomware | 2833 | |
xss | 577 | |
scanning | 550 | |
IoT_Modbus | normal | 35,000 |
injection | 5000 | |
backdoor | 5000 | |
password | 5000 | |
xss | 577 | |
scanning | 529 | |
IoT_Motion_Light | normal | 35,000 |
ddos | 5000 | |
password | 5000 | |
injection | 5000 | |
backdoor | 5000 | |
ransomware | 2264 | |
scanning | 1775 | |
xss | 449 | |
IoT_Thermostat | normal | 35,000 |
password | 5000 | |
injection | 5000 | |
backdoor | 5000 | |
ransomware | 2264 | |
xss | 449 | |
scanning | 61 | |
IoT_Weather | normal | 35,000 |
password | 5000 | |
backdoor | 5000 | |
ddos | 5000 | |
injection | 5000 | |
ransomware | 2865 | |
xss | 866 | |
scanning | 529 |
Output | Target Value | Number of Samples |
---|---|---|
Class 1 | Normal | 421,417 |
RDOS | 141,261 | |
Scanning_vulnerability | 52,852 | |
Generic_scanning | 50,277 | |
BruteForce | 47,241 | |
MQTT_cloud_broker_subscription | 23,524 | |
Discovering_resources | 23,148 | |
Exfiltration | 22,134 | |
insider_malcious | 17,447 | |
Modbus_register_reading | 5953 | |
False_data_injection | 5094 | |
C&C | 2863 | |
Dictionary | 2572 | |
TCP Relay | 2119 | |
fuzzing | 1313 | |
Reverse_shell | 1016 | |
crypto-ransomware | 154 | |
MitM | 117 | |
Class 2 | Normal | 421,417 |
RDOS | 141,261 | |
Reconnaissance | 127,590 | |
Weaponization | 67,260 | |
Lateral_movement | 31,596 | |
Exfiltration | 22,134 | |
Tampering | 5094 | |
C&C | 2863 | |
Exploitation | 1133 | |
crypto-ransomware | 154 |
Dataset | Output class/Device | Precision | Recall | F1 |
---|---|---|---|---|
X-IIoTID | Class 1 | 1.0 | 0.9975 | 0.999 |
Class 2 | 1.0 | 0.998 | 0.999 | |
TON_IoT | IoT_Fridge | 0.9995 | 0.999 | 0.9993 |
IoT_Garage_Door | 0.9995 | 1.0 | 0.9995 | |
IoT_GPS_Tracker | 1.0 | 1.0 | 1.0 | |
IoT_Modbus | 1.0 | 1.0 | 1.0 | |
IoT_Motion_Light | 0.999 | 0.9953 | 0.9984 | |
IoT_Thermostat | 0.995 | 0.9957 | 0.996 | |
IoT_Weather | 0.995 | 1.0 | 0.9975 |
Attack Type | DPM–DDM [27] | DT [28] | Proposed Method (XGBoost) |
---|---|---|---|
crypto-ransomware | 100 | 99.86 | 100 |
Exploitation | 64.66 | 98.52 | 100 |
C&C | 99.27 | 89.66 | 100 |
Tampering | 99.72 | 99.47 | 100 |
Exfiltration | 99.98 | 89.76 | 100 |
Lateral_movement | 91.48 | 98.52 | 100 |
Weaponization | 99.76 | 99.97 | 100 |
Reconnaissance | 93 | 99.22 | 99.8 |
RDOS | 99.94 | 99.99 | 100 |
Normal | - | - | 100 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Le, T.-T.-H.; Oktian, Y.E.; Kim, H. XGBoost for Imbalanced Multiclass Classification-Based Industrial Internet of Things Intrusion Detection Systems. Sustainability 2022, 14, 8707. https://doi.org/10.3390/su14148707
Le T-T-H, Oktian YE, Kim H. XGBoost for Imbalanced Multiclass Classification-Based Industrial Internet of Things Intrusion Detection Systems. Sustainability. 2022; 14(14):8707. https://doi.org/10.3390/su14148707
Chicago/Turabian StyleLe, Thi-Thu-Huong, Yustus Eko Oktian, and Howon Kim. 2022. "XGBoost for Imbalanced Multiclass Classification-Based Industrial Internet of Things Intrusion Detection Systems" Sustainability 14, no. 14: 8707. https://doi.org/10.3390/su14148707