A Novel Study: GAN-Based Minority Class Balancing and Machine-Learning-Based Network Intruder Detection Using Chi-Square Feature Selection
Abstract
:1. Introduction
- The class imbalance problem is solved using the generative adversarial network (GAN) model hyperparameters optimization for tabular or numeric data generation.
- The appropriately generated new data using the UNSW-NB15 dataset reduced the class imbalance problem against different categories of network attacks.
- The generated dataset-based results are compared with the original UNSW-NB15 dataset that proved the validity of the dataset and enhanced its precision rates of classification.
- The chi-square method is used for features selection, where classical ML methods are used for the binary and multi-class classification of network attacks on original and newly generated datasets.
- The results and comparative analysis showed the outperformance of the proposed framework as compared to previous studies such that the proposed framework is a more reliable and valid method of network intrusion detection.
2. Related Work
3. Proposed Methodology
3.1. GAN-Based Minority Class Data Generation
3.2. Dataset Preprocessing
3.3. Features Selection Using Chi-Square
3.4. ML Classification
Algorithm 1 Proposed-framework-based algorithm from features input to classification |
Input: UNSW-NB15 dataset features (FVi) |
Output: Classification of three datasets D1, D2, D3 |
Step 1: Take all features (FVi). |
Step 2: Separate out minority class instances (MVi). |
Step 3: Generate new minority class instances (NMVi) using hyperparameter-optimized GAN model used in proposed study. |
Step 4: Separate out three datasets: Original UNSW-NB15 Dataset (D1), combined data of newly generated minority class in Step 3 and original dataset based normal class instances (D2) and combined data of original UNSW-NB15 instances + GAN based newly generated minority class instances (D3) |
Step 5: Features normalization using Equations 1, 2, 3 and 4 on D1, D2, D3. |
Step 6: Apply chi-square feature selection on D1, D2, D3. |
Step 7: Obtained three feature sets (GVi) based upon chi-square method using D1, D2, D3 |
Step 8: Conducted Experiments 1, 2 and 3 using D1, D2, D3 by feeding them to ML classifiers |
4. Results and Discussion
4.1. Datasets Description
4.2. GAN Hyperparameters Optimization and Learning Environment
4.3. Experiment 1: Original UNSW-NB15 Dataset
4.4. Experiment 2: GAN-Based Dataset
4.5. Experiment 3: Original UNSW-NB15 + GAN Dataset
5. Comparison
6. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Folino, G.; Sabatino, P. Ensemble based collaborative and distributed intrusion detection systems: A survey. J. Netw. Comput. Appl. 2016, 66, 1–16. [Google Scholar] [CrossRef]
- Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J. Survey of intrusion detection systems: Techniques, datasets and challenges. Cybersecurity 2019, 2, 1–22. [Google Scholar] [CrossRef] [Green Version]
- Bayerl, P.S.; Karlović, R.; Akhgar, B.; Markarian, G. Community Policing—A European Perspective; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
- Li, J.; Qu, Y.; Chao, F.; Shum, H.P.; Ho, E.S.; Yang, L. Machine learning algorithms for network intrusion detection. AI Cybersecur. 2019, 151–179. [Google Scholar]
- Anderson, J.P. Computer Security Threat Monitoring and Surveillance; Technical Report; James P. Anderson Company: Fort Washington, PA, USA, 1980. [Google Scholar]
- Hoque, M.S.; Mukit, M.; Bikas, M.; Naser, A. An implementation of intrusion detection system using genetic algorithm. arXiv 2012, arXiv:1204.1336. [Google Scholar]
- Jianhong, H. Network intrusion detection algorithm based on improved support vector machine. In Proceedings of the 2015 International Conference on Intelligent Transportation, Big Data and Smart City, Halong Bay, Vietnam, 19–20 December 2015; pp. 523–526. [Google Scholar]
- Zaman, M.; Lung, C.H. Evaluation of machine learning techniques for network intrusion detection. In Proceedings of the NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symposium, Taipei, Taiwan, 23–27 April 2018; pp. 1–5. [Google Scholar]
- Vinayakumar, R.; Soman, K.; Poornachandran, P. Applying convolutional neural network for network intrusion detection. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Manipal, Karnataka, India, 13–16 September 2017; pp. 1222–1228. [Google Scholar]
- Kwon, D.; Natarajan, K.; Suh, S.C.; Kim, H.; Kim, J. An Empirical Study on Network Anomaly Detection Using Convolutional Neural Networks. In Proceedings of the ICDCS, Vienna, Austria, 2–6 July 2018; pp. 1595–1598. [Google Scholar]
- He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
- Sun, Y.; Wong, A.K.; Kamel, M.S. Classification of imbalanced data: A review. Int. J. Pattern Recognit. Artif. Intell. 2009, 23, 687–719. [Google Scholar] [CrossRef]
- Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar]
- Hodo, E.; Bellekens, X.; Hamilton, A.; Tachtatzis, C.; Atkinson, R. Shallow and deep networks intrusion detection system: A taxonomy and survey. arXiv 2017, arXiv:1701.02145. [Google Scholar]
- Amin, A.; Anwar, S.; Adnan, A.; Nawaz, M.; Howard, N.; Qadir, J.; Hawalah, A.; Hussain, A. Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study. IEEE Access 2016, 4, 7940–7957. [Google Scholar] [CrossRef]
- Aditsania, A.; Saonard, A.L. Handling imbalanced data in churn prediction using ADASYN and backpropagation algorithm. In Proceedings of the 2017 3rd International Conference on Science in Information Technology (ICSITech), Bandung, Indonesia, 25–26 October 2017; pp. 533–536. [Google Scholar]
- Khan, S.H.; Hayat, M.; Bennamoun, M.; Sohel, F.A.; Togneri, R. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 3573–3587. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Ng, W.W.; Hu, J.; Yeung, D.S.; Yin, S.; Roli, F. Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans. Cybern. 2014, 45, 2402–2412. [Google Scholar] [CrossRef] [PubMed]
- Almomani, O. A feature selection model for network intrusion detection system based on PSO, GWO, FFA and GA algorithms. Symmetry 2020, 12, 1046. [Google Scholar] [CrossRef]
- Tan, X.; Su, S.; Huang, Z.; Guo, X.; Zuo, Z.; Sun, X.; Li, L. Wireless sensor networks intrusion detection based on SMOTE and the random forest algorithm. Sensors 2019, 19, 203. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhang, H.; Huang, L.; Wu, C.Q.; Li, Z. An effective convolutional neural network based on SMOTE and Gaussian mixture model for intrusion detection in imbalanced dataset. Comput. Netw. 2020, 177, 107315. [Google Scholar] [CrossRef]
- Fu, Y.; Du, Y.; Cao, Z.; Li, Q.; Xiang, W. A Deep Learning Model for Network Intrusion Detection with Imbalanced Data. Electronics 2022, 11, 898. [Google Scholar] [CrossRef]
- Wu, T.; Fan, H.; Zhu, H.; You, C.; Zhou, H.; Huang, X. Intrusion detection system combined enhanced random forest with SMOTE algorithm. Eurasip J. Adv. Signal Process. 2022, 2022, 1–20. [Google Scholar] [CrossRef]
- Mulyanto, M.; Faisal, M.; Prakosa, S.W.; Leu, J.S. Effectiveness of focal loss for minority classification in network intrusion detection systems. Symmetry 2020, 13, 4. [Google Scholar] [CrossRef]
- Rani, M. Effective network intrusion detection by addressing class imbalance with deep neural networks multimedia tools and applications. Multimed. Tools Appl. 2022, 81, 8499–8518. [Google Scholar] [CrossRef]
- Ashrapov, I. Tabular GANs for uneven distribution. arXiv 2020, arXiv:cs.LG/2010.00638. [Google Scholar]
- Ashrapov, I. GANs for Tabular Data. 2020. Available online: https://github.com/Diyago/GAN-for-tabular-data (accessed on 11 October 2022).
- Zong, W.; Chow, Y.W.; Susilo, W. A two-stage classifier approach for network intrusion detection. In Proceedings of the International Conference on Information Security Practice and Experience, Tokyo, Japan, 25–27 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 329–340. [Google Scholar]
- Toldinas, J.; Venčkauskas, A.; Damaševičius, R.; Grigaliūnas, Š.; Morkevičius, N.; Baranauskas, E. A novel approach for network intrusion detection using multistage deep learning image recognition. Electronics 2021, 10, 1854. [Google Scholar] [CrossRef]
References | Year | Methods | Dataset | Results |
---|---|---|---|---|
[21] | 2019 | SMOTE to solve class imbalance and classical classification method | KDD 99 Cup | Highest Accuracy = 92.57% |
[22] | 2020 | SMOTE, GMM, and CNN | UNSW-NB15 | Binary-Accuracy = 98.8%, F1 = 95.53% |
[22] | 2020 | SMOTE, GMM, and CNN | UNSW-NB15 | Multi-Accuracy = 96.54%,
F1 = 97.26% |
[22] | 2020 | SMOTE, GMM, and CNN | CICIDS2017 | Multi-Accuracy = 99.85F1 = 99.86% |
[25] | 2020 | Focal-Loss based DNN and CNN | NSL-KDD | DNN
Bi.-F1 = 83.92%, M.-F1 = 47.33% CNN Bi.-F1 = 84.87%, M.-F1 = 51.96% |
[25] | 2020 | Focal-Loss based DNN and CNN | UNSW-NB15 | DNN
Bi.-F1 = 79.24%,
M.-F1 = 98.90% CNN Bi.-F1 = 95.57%, M.-F1 = 95.51% |
[23] | 2022 | Bi-LSTM and ADASYN method of assigning attention weights for class imbalance | NSL-KDD | Accuracy = 90.73%, F1-Score = 89.65% |
[24] | 2022 | SMOTE-KNN for class imbalance and random forest for classification | NSL-KDD | Testing-Accuracy = 78.47% |
[26] | 2022 | Modified cross entropy applied for class imbalance and Neural Network applied for Classification | NSL-KDD | Accuracy = 85.56% |
[26] | 2022 | Modified cross entropy applied for class imbalance and Neural Network applied for Classification | UNSW-NB15 | Accuracy = 90.76% |
Classes | Original Instances (Train + Test) | GAN-Based Instances (Train + Test) | Original + GAN-Based Instances |
---|---|---|---|
Analysis | 2000 + 677 = 2677 | 1477 + 453 = 1930 | 4607 |
Backdoor | 1746 + 583 = 2329 | 1264 + 366 = 1630 | 3959 |
DoS | 12,264 + 4089 = 16,353 | 9588 + 2746 = 12,334 | 28,687 |
Exploits | 33,393 + 11,132 = 44,525 | 26,217 + 7588 = 33,805 | 78,330 |
Fuzzers | 18,184 + 6062 = 24,246 | 14,089 + 4150 = 18,239 | 42,485 |
Generic | 40,000 + 18,871 = 58,871 | 31,609 + 12,984 = 44,593 | 103,464 |
Reconnaissance | 10,491 + 3496 = 13,987 | 8132 + 2348 = 10,480 | 24,467 |
Shellcode | 1133 + 378 = 1511 | 827 + 207 = 1034 | 2545 |
Worms | 130 + 45 = 175 | 54 + 15 = 69 | 244 |
Normal | 56,000 + 37,000 = 93,000 | 56,000 + 37,000 = 93,000 | 93,000 |
Total (Attacked + Normal) | 164,674 + 93,000 = 257,674 | 124,060 + 93,000 = 217,060 | 288,788 + 93,000 = 381,788 |
Serial Number | Parameter | Value |
---|---|---|
1 | Generator time | 1.1 |
2 | Bot filter quantile | 0.0001 |
3 | Top filter quantile | 0.99 |
4 | Loss | RMSE |
5 | Maximum depth | 2 |
6 | Maximum bin | 100 |
7 | Learning rate | 0.001 |
8 | Random state | Yes |
9 | Estimators | 100 |
10 | Batch size | 500 |
11 | Patience | 25 |
Classification | Method | Accuracy | Recall | Precision | F1-Score | G-Mean | AUC | Time(sec) Train + Test = Total |
---|---|---|---|---|---|---|---|---|
Binary | MLP | 92.61% | 92.61% | 92.74% | 92.64% | 0.928 | 0.928 | 126.6 + 0.1 = 126.7 |
KNN | 93.36% | 93.36% | 93.36% | 93.36% | 0.928 | 0.928 | 0.1 + 13.4 = 13.5 | |
Logistic Regression | 93.36% | 93.36% | 93.36% | 93.36% | 0.84 | 0.85 | 0.1 + 13.4 = 13.5 | |
Decision Tree | 93.97% | 93.97% | 93.97% | 93.97% | 0.935 | 0.935 | 3.2 + 0.0 = 3.2 | |
Random Forest | 95.59% | 95.59% | 95.60% | 95.60% | 0.953 | 0.953 | 8.8 + 0.2 = 9.0 | |
Extra Trees | 95.35% | 95.35% | 95.36% | 95.35% | 0.950 | 0.950 | 7.0 + 0.2 = 7.2 | |
Multi-class | MLP | 78.63% | 78.63% | 75.73% | 75.36% | 0.86 | 0.91 | 149.2 + 0.1 = 149.3 |
KNN | 78.41% | 78.41% | 79.40% | 78.82% | 0.87 | 0.78 | 0.1 + 14.7 = 14.8 | |
Logistic Regression | 65.47% | 65.47% | 61.53% | 62.59% | 0.77 | 0.78 | 15.2 + 0.0 = 15.2 | |
Decision Tree | 81.38% | 81.38% | 81.18% | 80.90% | 0.88 | 0.80 | 5.3 + 0.0 = 5.4 | |
Random Forest | 83.36% | 83.36% | 83.31% | 82.35% | 0.90 | 0.89 | 10.9 + 0.6 = 11.5 | |
Extra Trees | 83.16% | 83.16% | 82.91% | 82.35% | 0.89 | 0.88 | 7.7 + 0.6 = 8.3 |
Classification | Method | Accuracy | Recall | Precision | F1-Score | G-Mean | AUC | Time(sec) Train + Test = Total |
---|---|---|---|---|---|---|---|---|
Binary | MLP | 91.61% | 91.61% | 91.78% | 91.55% | 0.923 | 0.92 | 70.0 + 0.0 = 70.0 |
KNN | 93.03% | 93.03% | 93.05% | 93.04% | 0.930 | 0.929 | 0.1 + 11.2 = 11.3 | |
Logistic Regression | 86.56% | 86.56% | 87.12% | 86.34% | 0.847 | 0.852 | 2.9 + 0.0 = 3.0 | |
Decision Tree | 93.80% | 93.80% | 93.80% | 93.80% | 0.937 | 0.937 | 2.5 + 0.0 = 2.5 | |
Random Forest | 95.41% | 95.41% | 95.44% | 95.42% | 0.954 | 0.954 | 7.9 + 0.2 = 8.0 | |
Extra Trees | 95.25% | 95.25% | 95.27% | 95.25% | 0.953 | 0.952 | 5.7 + 0.2 = 5.9 | |
Multi-class | MLP | 81.02% | 81.02% | 79.29% | 77.87% | 0.87 | 0.91 | 123.2 + 0.1 = 123.3 |
KNN | 79.95% | 79.95% | 80.62% | 80.21% | 0.87 | 0.766 | 0.1 + 15.1 = 15.3 | |
Logistic Regression | 68.62% | 68.62% | 62.22% | 64.27% | 0.77 | 0.78 | 12.1 + 0.0 = 12.2 | |
Decision Tree | 82.39% | 82.39% | 82.14% | 82.04% | 0.888 | 0.807 | 9.6 + 0.0 = 9.6 | |
Random Forest | 84.53% | 84.53% | 83.84% | 83.58% | 0.90 | 0.88 | 11.8 + 0.5 = 12.3 | |
Extra Trees | 84.36% | 84.36% | 83.70% | 83.59% | 0.90 | 0.88 | 9.0 + 0.5 = 9.6 |
Classification | Method | Accuracy | Recall | Precision | F1-score | G-mean | AUC | Time(sec) Train + Test = Total |
---|---|---|---|---|---|---|---|---|
Binary | MLP | 95.00% | 95.00% | 95.00% | 94.95% | 0.920 | 0.922 | 216.2 + 0.1 = 216.3 |
KNN | 95.84% | 95.84% | 95.83% | 95.84% | 0.950 | 0.949 | 0.4 + 60.0 = 60.4 | |
Logistic Regression | 89.39% | 89.39% | 89.35% | 89.14% | 0.849 | 0.855 | 8.2+ 0.0 = 8.2 | |
Decision Tree | 97.68% | 97.68% | 97.68% | 97.68% | 0.971 | 0.972 | 9.6+ 0.1 = 9.7 | |
Random Forest | 98.05% | 98.05% | 98.05% | 98.04% | 0.974 | 0.973 | 30.0 + 0.5 = 30.5 | |
Extra Trees | 98.14% | 98.14% | 98.14% | 98.14% | 0.976 | 0.976 | 23.0 + 1.2 = 24.2 | |
Multi-class | MLP | 80.25% | 80.25% | 78.86% | 77.50% | 0.88 | 0.92 | 335.5 + 0.1 = 335.6 |
KNN | 80.69% | 80.69% | 81.81% | 81.20% | 0.88 | 0.83 | 0.2 + 35.3 = 35.5 | |
Logistic Regression | 67.38% | 67.38% | 64.50% | 64.64% | 0.78 | 0.74 | 22.9 + 0.0 = 22.9 | |
Decision Tree | 86.80% | 86.80% | 87.04% | 86.24% | 0.919 | 0.90 | 6.4 + 0.1 = 6.5 | |
Random Forest | 87.38% | 87.38% | 87.84% | 86.67% | 0.923 | 0.939 | 20.2 + 1.0 = 21.2 | |
Extra Trees | 87.44% | 87.44% | 87.81% | 86.79% | 0.923 | 0.94 | 15.7 + 1.0 = 16.7 |
Study | Year | Methods | Dataset | Results |
---|---|---|---|---|
[29] | 2018 | Two-stage approach for network intruder detection using SMOTE | UNSW-NB15 | Multi-class classification Accuracy = 85.78% |
[25] | 2020 | Focal-loss-based DNN and CNN | UNSW-NB15 | DNN Binary-F1 = 90.41%, Multi-F1 = 39.78%, CNN Binary-F1 = 86.03%, Multi-F1 = 39.52% |
[30] | 2021 | Image-based network intrusion detection and DL-based classification | UNSW-NB15 | ML-Net Binary Micro-Accuracy = 92.87% Multi-class Micro-Accuracy = 72.31% |
[26] | 2022 | Modified cross entropy applied for class imbalance and neural network applied for classification | UNSW-NB15 | Accuracy = 90.76% |
Proposed study | 2022 | GAN-based class balancing and chi-square feature selection-based ML classification | UNSW-NB15 | Binary Accuracy = 98.14% Precision = 98.14% F1-score = 98.14% G-Mean = 0.976, AUC = 0.976 |
Proposed study | 2022 | GAN based class balancing and chi-square feature selection-based ML classification | UNSW-NB15 | Multi-class Accuracy = 87.44% Precision = 87.81% F1-score = 86.79% G-Mean = 0.923, AUC = 0.94 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alabrah, A. A Novel Study: GAN-Based Minority Class Balancing and Machine-Learning-Based Network Intruder Detection Using Chi-Square Feature Selection. Appl. Sci. 2022, 12, 11662. https://doi.org/10.3390/app122211662
Alabrah A. A Novel Study: GAN-Based Minority Class Balancing and Machine-Learning-Based Network Intruder Detection Using Chi-Square Feature Selection. Applied Sciences. 2022; 12(22):11662. https://doi.org/10.3390/app122211662
Chicago/Turabian StyleAlabrah, Amerah. 2022. "A Novel Study: GAN-Based Minority Class Balancing and Machine-Learning-Based Network Intruder Detection Using Chi-Square Feature Selection" Applied Sciences 12, no. 22: 11662. https://doi.org/10.3390/app122211662
APA StyleAlabrah, A. (2022). A Novel Study: GAN-Based Minority Class Balancing and Machine-Learning-Based Network Intruder Detection Using Chi-Square Feature Selection. Applied Sciences, 12(22), 11662. https://doi.org/10.3390/app122211662