Enhancing Machine Learning Model Prediction with Feature Selection for Botnet Intrusion Detection

Baich, Marwa; Sael, Nawal

doi:10.3390/engproc2025112055

Open AccessProceeding Paper

Enhancing Machine Learning Model Prediction with Feature Selection for Botnet Intrusion Detection^†

by

Marwa Baich

^*

and

Nawal Sael

Laboratory of Information Technology and Modeling, Faculty of Sciences, Casablanca 7955, Morocco

^*

Author to whom correspondence should be addressed.

^†

Presented at the 7th edition of the International Conference on Advanced Technologies for Humanity (ICATH 2025), Kenitra, Morocco, 9–11 July 2025.

Eng. Proc. 2025, 112(1), 55; https://doi.org/10.3390/engproc2025112055

Published: 29 October 2025

Download

Browse Figures

Versions Notes

Abstract

Increased vulnerabilities brought about by the explosive growth of the Internet of Things (IoT) call for improved security measures to protect systems from attacks. Intrusion Detection Systems (IDS) that use machine learning (ML) are essential for identifying vulnerabilities. Among various threats, botnets are particularly challenging due to their persistence and complexity. This study explores the application of ML techniques (RF, NB, DT, KNN, LR, and XGBoost) for intrusion detection in IoT networks, with a focus on handling imbalanced data and applying feature selection methods. On the Bot-IoT dataset, the study used Lasso feature selection and the SMOTE data balancing technique to obtain a high accuracy of 99.99% with low execution times using the XGBoost model.

Keywords:

intrusion detection system; feature selection; machine learning; Internet of Things

1. Introduction

The explosive growth of the IoT has transformed how we interact with the environment by enabling seamless connectivity and large-scale data collection. New security risks have been brought about by this greater connectivity, though, as there is a chance that other kinds of cyberattacks will target IoT networks and devices. One key technique utilized to combat these threats is the IDS; its role is vital in observing and safeguarding misuse-prone IoT frameworks from criminal acts [1]. Intrusion Detection Systems are crafted to detect and warn about abnormal movements that could signify an attempt at compromising security on either a system or network [1]. Botnets are a significant challenge for IDS, as they consist of compromised devices remotely controlled by attackers to execute coordinated attacks like DDoS. These attacks flood the targeted system with traffic, effectively blocking legitimate users from accessing its services. The decentralized and diverse nature of IoT devices involved in botnets makes detection difficult, as they can operate silently using minimal resources. This underscores the need for advanced IDS capable of identifying complex attack patterns and distinguishing between normal and malicious activities to effectively counter botnet threats in IoT networks [2].

IoT networks present significant problems for intrusion detection systems because they may have constrained energy resources, restricted computing power, and specialized protocols. Researchers have looked into several methods for creating efficient Intrusion Detection Systems for IoT to address these issues. Using ML methods to identify intricate patterns and anomalies in IoT network data is one promising strategy [3]. Another crucial aspect of IDS for the IoT is its ability to handle the enormous volume of data generated by IoT devices. This necessitates the creation of effective methods for data processing and analysis, as well as the flexibility to adjust to the ever-changing IoT network environment. Unfortunately, a lot of redundant or unnecessary characteristics are often present in these datasets, and this may have an adverse effect on ML models’ performance [4]. These features can lead to several issues, such as longer Training Time, processing and analyzing more features requires more computational resources and time.

This article explores how feature selection (FS) techniques influence the efficiency of machine learning (ML) algorithms. Three approaches—Filter, Wrapper, and Embedded Methods—were used with six techniques to classify intrusions using the BoT-IoT dataset. The dataset focuses on botnet attacks and very suitable for modeling real-world, high-risk intrusion detection.

The remainder of this study is structured as follows: The strategies for ML-based IDS that have been developed by researchers are described in Section 2. We give a detailed explanation of our study approach, including information regarding the dataset, preprocessing, feature selection, data balancing techniques, and ML algorithms employed in Section 3. The outcomes of FS and ML models on the BoT-IoT dataset are shown in Section 4. Section 5 presents the conclusion and outlines directions for future work.

2. State of the Arts Analysis

Many ML techniques have been explored to efficiently identify anomalies and possible intrusions. For example, in [5], the authors presented a detection system using an Artificial Neural Network (ANN) and the Bot-IoT dataset to identify DDoS attacks. They addressed a significant issue, namely the data imbalance problem, by applying the Synthetic Minority Over-sampling Technique (SMOTE). The authors in [6], used the dataset UNSW-NB15 to propose ML techniques, including Logistic Regression (LR), Decision trees (DT), and XGBoost models for categorizing binary classes. Using SMOTE-OverSampling, this dataset addressed the issue of class imbalance. All the investigations point to the decision tree performing better, scoring 94% accuracy. The NSL-KDD dataset is utilized in [7] as the input dataset for classification tasks. In this study, ML techniques including LR, Support Vector Machine (SVM), and Random Forest (RF) are employed to classify preprocessed datasets. These algorithms’ performance is assessed using criteria such as precision, recall, and accuracy. The findings indicate that RF and LR achieve accuracies below 78%, while SVM surpasses 98% accuracy. In [8], a ML-based IDS is proposed, using a custom variation in the Random Forest (RF) model. This system’s performance is assessed using the TON-IoT and UNSW-NB15 datasets and contrasted with nine well-known machine learning algorithms. The system’s performance is evaluated based on accuracy and sensitivity. The findings suggest that the proposed IDS improves IoT security, particularly for resource-constrained devices, and is effective at defending against a variety of network attacks.

Feature selection aims to enhance model efficiency and accuracy by selecting a subset of relevant variables from the full dataset, while also reducing the risk of overfitting. Several studies have employed feature selection and data balancing techniques to improve classification performance in IDS. Many works rely on filter-based FS methods to select discriminative features for attack detection. In [9], a hybrid FS approach combining Information Gain and Gain Ratio was used to select 15, 11, and 11 features, respectively, followed by Random Forest classification, achieving high detection accuracy for DoS, scanning, and MITM attacks, with rates of 99.95%, 99.96%, and 99.97%, respectively, using the IoTID20 dataset. In [10], an IDS based on LSTM and KNN employed Gain Ratio for FS, yielding 97.28% and 92.29% accuracy on the BoT-IoT dataset. Study [11] applied SMOTE to address class imbalance and used correlation-based FS; Random Forest achieved 99.88% accuracy on CICIDS2017. An innovative IDS for Industry 4.0 was proposed in [12], using PCA and Random Forest, achieving 98.9% accuracy and 97.8% detection rate on BoT-IoT. In [13], six ML algorithms (RF, GB, DT, ANN, NB, LR) were evaluated on BoT-IoT, with RF (99.99%) and ANN (99.91%) showing the best performance. The study in [14] targeted injection attacks in IoT environments using FS and classifiers like SVM, RF, and DT, with DT achieving 99% accuracy on the AWID dataset. Reference [15] proposed a PSO-based FS approach using XGB, reaching 83% accuracy in multiclass using IoTID20 and 98% accuracy in binary classification. In [16], FS methods such as ANOVA, F-Test, and RFE were used with SVM, RF, and DT on NSL-KDD, where RF obtained the highest accuracy across attack types: DoS (0.87), Probe (0.86), R2L (0.76), and U2R (0.98). In [17], FS and feature extraction (FE) methods were compared on TON-IoT, with FE yielding better accuracy and lower sensitivity to feature changes; MLP performed best with FE, while DT performed best with FS. Study [18] compared Genetic Algorithm and Correlation-based FS on IoTID20, showing that DT and RF achieved 100% on all metrics when using GA-selected features. Baich et al. [19] evaluated ML models (RF, NB, SVM, DT) with FS (Pearson correlation, Fisher score) and FE (PCA) on NSL-KDD, reporting that DT with Fisher score achieved 99.26% accuracy and only 0.04 s prediction time.

Even though many studies have explored feature selection (FS) and data balancing techniques to improve IDS performance, these methods are often treated independently and not fully investigated in combination. A thorough integration of FS and balancing techniques could significantly enhance IDS efficiency, especially in the context of IoT. Performance evaluation typically relies on standard metrics namely precision, recall, accuracy, and F1-score. However, additional factors like execution time and false alarm rate are equally critical, as they directly impact the system’s responsiveness and reliability in real-world deployment. A recurring trend in IoT security research is the use of simulated datasets like NSL-KDD, which, despite enabling controlled experimentation, often fail to reflect the complexity and unpredictability of real IoT environments [20]. While FS techniques are commonly used to identify relevant features and reduce dimensionality, data balancing—essential for managing class imbalance—is less frequently addressed, and few studies explore their combined application. This gap highlights the need for a more comprehensive approach. In our work, we propose an IoT-focused IDS framework that addresses these limitations by incorporating real-world datasets to better capture network variability. The SMOTE technique will be used to mitigate class imbalance, ensuring fair representation of minority classes. Simultaneously, FS strategies will be applied to improve model efficiency and reduce computational cost. We will evaluate six promising FS methods, analyzing their impact on model accuracy and processing time, which are essential for real-time detection in IoT systems. Our model evaluation will rely on a diverse set of metrics: Accuracy, Precision, Recall, F1-score, and Matthews Correlation Coefficient (MCC), along with execution time and false alarm rate. This comprehensive assessment will provide insight into both the effectiveness and efficiency of IDS solutions for practical deployment in IoT networks.

3. Research Methodology

Following dataset collection, the procedure starts with data preprocessing and balancing, where class imbalance is addressed using SMOTE. Next, six FS techniques are evaluated: two filtering-based methods, two embedding methods, and two wrapper methods. After selecting features, a variety of ML techniques are applied, such as XGBoost, Random Forest (RF), k-Nearest Neighbors (k-NN), Naïve Bayes (NB), Decision Trees (DT), and Logistic Regression (LR), it is worth noting that these models are the most commonly used in state-of-the-art applications. The performance of these models is assessed using metrics including execution time, false alarm rate, accuracy (AC), precision (PR), recall (RC), F1-Score (F1), Matthew’s correlation coefficient (MCC). This comprehensive investigation aims to identify the intrusion detection methods that perform best in practical situations, as illustrated in Figure 1.

3.1. Datasets Used

Bot-IoT dataset focuses on botnet attacks. The dataset offers researchers a useful tool for creating and evaluating intrusion detection strategies suited to the unique difficulties of IoT security against botnet vulnerabilities. In the BoT-IoT dataset, there are more than 72,000,000 records spread across 74 files, with each row containing 46 features. The four files that make up the extracted 5% have a combined size of 1.07 GB and contain approximately 3 million records [21]. Dataset source: BoT-IoT Dataset; UNSW Canberra, Canberra, ACT, Australia.

3.2. Data Preprocessing

In the preprocessing phase, we first address missing values, duplicates, and outliers through data cleaning procedures. After that, categorical variables are transformed into numerical representations via encoding, which makes it easier to incorporate them into the models. In order to prevent any one characteristic from predominating during model training, normalization techniques are also utilized to scale numerical features to a similar range. Data preprocessing steps were essential to ensure the data’s compatibility with machine learning algorithms and improve model performance.

3.3. Feature Selection (FS)

Selecting features is an important step in ML and data analysis, aimed at picking out a smaller set of crucial features from a larger set that are vital for predicting the target variable or reaching the desired outcome [17].

Filter methods in feature selection involve selecting features based on their intrinsic properties, such as statistical characteristics or correlation with the target variable, without directly incorporating a specific machine learning model. Two common techniques used in filter methods are ANOVA (Analysis of Variance) and Fisher’s Score.

Wrapper methods: These techniques assess performance by evaluating various feature combinations using the real learning model. They use performance indicators relevant to each model to choose features. This study uses two techniques from the Wrapper methods: Forward Feature Selection (FFS) and Recursive Feature Elimination (RFE).

Embedded Methods: These methods select features during the model training process. Integration techniques include learning algorithms that automatically select the most important features during training. Techniques employed in this paper within embedded methods include Lasso and Ridge.

3.4. Data Balancing

BoT-IoT dataset present a significant class imbalance (0.1% Normal and 99% attacks). This unequal distribution of classes in the dataset may result in skewed model outcomes, especially in ML models. In order to address this challenge, we adopted the Synthetic Minority Over-sampling Technique (SMOTE). Instead of simply duplicating samples, this method interpolates between existing instances to create synthetic samples of the minority class. By doing so, it enhances classifier performance on imbalanced datasets and helps reduce classification errors. This effectiveness has made SMOTE one of the most widely adopted and reliable methods for data balancing.

4. Results and Discussion

4.1. Machine Learning (ML) Results

In this experiment, six ML algorithms (DT, LR, RF, NB, KNN, and XGBoost) were evaluated using the BoT-IoT dataset, which includes benign and malicious traffic. To evaluate model performance in the highly imbalanced dataset, several metrics were used: Matthews Correlation Coefficient (MCC), precision, recall, F1-score, accuracy, execution time, and false alarm rate (FAR). MCC provided a balanced measure of prediction quality, while precision and recall assessed the model’s ability to avoid false alarms and detect real attacks, respectively. The F1-score balanced precision and recall, and accuracy, although commonly used, could be misleading in imbalanced datasets. Execution time was crucial for assessing model responsiveness in IoT environments, and FAR measured how often normal traffic was misclassified as attacks. The study also explored how various feature selection techniques (Fisher Score, ANOVA, Ridge, Lasso, RFE, and FFS) impacted these metrics. Results showed that some feature selection methods reduced execution time without sacrificing predictive performance, helping to identify the best algorithm-feature selection combinations for efficient IoT intrusion detection systems.

Table 1 shows how different ML algorithms behave in the absence of feature selection methods. The results demonstrate that in this case, all algorithms attain very high accuracy of nearly 100%. RF, DT, KNN, and XGBoost all achieve 99.99% accuracy, while NB and LR achieve 99.97% and 99.51%, respectively. This suggests that because the dataset is suitable for these methods, good performance can be achieved even in the absence of dimensionality reduction, but When feature selection is not used, some algorithms have a large execution time, such as KNN (6888 s), which can be a limitation in real-world applications that require fast processing.

In the following, we explore how various feature selection techniques affect the performance of ML models in intrusion detection. By evaluating multiple algorithms under different FS methods, we aim to demonstrate the impact of FS on both predictive accuracy and computational efficiency.

Using filter techniques like Fisher Score and ANOVA, the findings in Table 2 demonstrate a substantial range in accuracy and computation time between several machine learning algorithms, both in the presence and absence of feature selection. All things considered, algorithms such as RF, DT, and XGBoost continue to exhibit remarkable accuracy (99.99%) in the majority of setups, all while requiring reasonable calculation times—especially once feature selection techniques like ANOVA are applied. For example, with ANOVA, RF drops from 552 s without filtering to just 287 s while keeping accuracy. On the other hand, ANOVA significantly reduces the computing time of KNN (from 6888 s to 191 s) while maintaining a similar accuracy, indicating that feature selection has a beneficial effect on resource-intensive algorithms. Logistic Regression exhibits lower computing time, which might be helpful for applications where speed is crucial, despite a noticeable drop in accuracy with the Fisher Score (98.12%).

Table 2 reveals also that embedding techniques such as Lasso and Ridge manage to maintain high accuracies while often adjusting the calculation time. For instance, RF with Ridge maintains good performance while achieving 99.96% accuracy with a computation time of 199 s. This is a significant increase in Table 1.

The wrapper techniques, forward feature selection and recursive feature elimination. RFE achieves great accuracy for all evaluated algorithms, up to 99.99%. The execution times of this approach vary, nevertheless, from 2 s for NB to 4761 s for KNN. On the other hand, Forward Feature Selection often results in shorter execution times and improves overall accuracy, reaching 99.99% for several methods. Particularly notable for its quickness is XGB, which executes in just 13 s while maintaining ideal accuracy.

Figure 2 and Figure 3 illustrate the accuracy and execution time of various ML algorithms across different feature selection techniques. While all algorithms achieved over 99% accuracy, execution times varied. KNN had the longest execution time, with RF also requiring significant time. LR and NB showed shorter execution times but lower Matthews’ correlation coefficient values, indicating performance challenges. DT and XGBoost consistently showed strong performance across all metrics, including accuracy, precision, F1, recall, and MCC, while maintaining low execution times, making them ideal for efficient and reliable classification tasks in practical machine learning applications.

4.2. Features Importance Analysis

Bot-IoT dataset contained 42 features, we applied the FS techniques to extract the most relevant features. This focused our analysis on the most significant aspects of the data, thereby improving the effectiveness of our intrusion detection models.

Within IoT environments, examining characteristics provides valuable insights in intrusion detection. Characteristics such as Stime, mean, sum, srate, saddr, and bytes are chosen by different feature selection methods, highlighting their important role in demonstrating significant network behaviors for identifying harmful or questionable actions. Their repeated choice strengthens their significance in this crucial job. The selection methods found that attributes such as dport, seq, rate, and sport, are not considered significant. This observation implies that their contribution to intrusion detection is not significant in this context. By concentrating on the key characteristics shown in Figure 4, these methods have the potential to create better and more accurate models for detecting malicious activities in IoT settings, improving resource allocation, and reducing data overload caused by less useful features.

4.3. Evaluation of Our Results Against to State-of-the-Art Results

To compare different approaches to IDS in the state of the art, it is crucial to use the same dataset to ensure fair evaluations and to better understand the strengths and weaknesses of each method under uniform conditions. Table 3 shows the highest level of accuracy obtained in each study, demonstrating the best practices in terms of accuracy. Our methods attain exceptional accuracies of 99.99% with Lasso. Processing time, however, is an important consideration. While the performance of other techniques, including Random Forest and decision trees, is often excellent, details regarding their execution times are either lacking or unspecified. In conclusion, even though our methods provide remarkable accuracy, they emphasize the importance of finding a balance between accuracy and processing speed to maximize performance based on the unique requirements of each application. The BoT-IoT dataset has limitations for real-world cybersecurity applications as it primarily focuses on IoT infrastructure in a controlled testbed using virtual machines, which may not reflect the complexities of actual IoT networks. Its large size (43 features and 73 million occurrences) also poses significant computational challenges, potentially limiting access for researchers with limited resources.

5. Conclusions

This research emphasizes the significance of careful selection of features in both enhancing model performance and achieving computational efficiency, ultimately improving the practical application of machine learning technique in real-world scenarios. We used the BoT-IoT dataset to compare the classification performance of several ML techniques (RF, NB, DT, KNN, LR, and XGB) using different feature selection strategies. Our results provide important details on accuracy, and other very important metrics, such as false alarm rate and execution time. XGBoost consistently showed higher accuracy and faster execution throughout the examination of various algorithms, especially when applying various feature selection techniques. High accuracy was demonstrated by Random Forest and Decision Tree in a variety of feature selection methods, demonstrating their reliability in a range of data scenarios. In addition, our study emphasizes the crucial importance of choosing features to improve model accuracy and make the most of computational resources. Algorithms such as NB, LR, and KNN showed different levels of enhancement in accuracy when using various feature selection techniques, indicating the subtle influence of feature selection on algorithm effectiveness. Future research avenues will explore the potential of deep learning techniques to further enhance classification performance and handle the complexities of real-world IoT data more effectively. These advancements could provide more robust solutions for cybersecurity and other practical applications.

Author Contributions

Conceptualization, N.S. and M.B.; methodology, N.S. and M.B.; software, M.B.; validation, N.S.; formal analysis, N.S. and M.B.; investigation, M.B.; resources, M.B.; data curation, M.B.; writing—original draft preparation, M.B.; writing—review and editing, N.S.; visualization, N.S. and M.B.; supervision, N.S.; project administration, N.S. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zaman, S.; Tauqeer, H.; Ahmad, W.; Shah, S.M.A.; Ilyas, M. Implementation of Intrusion Detection System in the Internet of Things: A Survey. In Proceedings of the 2020 IEEE 23rd International Multitopic Conference (INMIC), Bahawalpur, Pakistan, 5–7 November 2020; pp. 1–6. [Google Scholar] [CrossRef]
Sharma, P. Critical Review of Various Intrusion Detection Techniques for Internet of Things. In Proceedings of the 2nd International Conference on Data, Engineering and Applications (IDEA), Bhopal, India, 28–29 February 2020; pp. 1–6. [Google Scholar] [CrossRef]
Alaiz-Moreton, H.; Aveleira-Mata, J.; Ondicol-Garcia, J.; Muñoz-Castañeda, A.L.; García, I.; Benavides, C. Multiclass Classification Procedure for Detecting Attacks on MQTT-IoT Protocol. Complexity 2019, 2019, 6516253. [Google Scholar] [CrossRef]
Rahim, R.; Ahanger, A.S.; Khan, S.M.; Ma, F. Analysis of IDS using Feature Selection Approach on NSL-KDD Dataset. In Proceedings of the SCRS Conference on Intelligent Systems, Bangalore, India, 5–6 September 2022; Volume 26. [Google Scholar]
Soe, Y.N.; Santosa, P.I.; Hartanto, R. DDoS Attack Detection Based on Simple ANN with SMOTE for IoT Environment. In Proceedings of the 2019 Fourth International Conference on Informatics and Computing (ICIC), Semarang, Indonesia, 16–17 October 2019; pp. 1–5. [Google Scholar] [CrossRef]
Alissa, K.; Alyas, T.; Zafar, K.; Abbas, Q.; Tabassum, N.; Sakib, S. Botnet Attack Detection in IoT Using Machine Learning. Comput. Intell. Neurosci. 2022, 2022, 4515642. [Google Scholar] [CrossRef] [PubMed]
Raghuvanshi, A.; Singh, U.K.; Sajja, G.S.; Pallathadka, H.; Asenso, E.; Kamal, M.; Singh, A.; Phasinam, K. Intrusion Detection Using Machine Learning for Risk Mitigation in IoT-Enabled Smart Irrigation in Smart Farming. J. Food Qual. 2022, 2022, 3955514. [Google Scholar] [CrossRef]
Al-Ambusaidi, M.; Yinjun, Z.; Muhammad, Y.; Yahya, A. ML-IDS: An Efficient ML-Enabled Intrusion Detection System for Securing IoT Networks and Applications. Soft Comput. 2024, 28, 1765–1784. [Google Scholar] [CrossRef]
Maniriho, P.; Niyigaba, E.; Bizimana, Z.; Twiringiyimana, V.; Mahoro, L.J.; Ahmad, T. Anomaly-Based Intrusion Detection Approach for IoT Networks Using Machine Learning. In Proceedings of the 2020 International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM), Surabaya, Indonesia, 17–18 November 2020. [Google Scholar]
Swarna Sugi, S.S.; Ratna, S.R. Investigation of Machine Learning Techniques in Intrusion Detection System for IoT Network. In Proceedings of the 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS), Thoothukudi, India, 3–5 December 2020; pp. 1164–1167. [Google Scholar] [CrossRef]
Ali, M.L.; Thakur, K.; Schmeelk, S.; Debello, J.; Dragos, D. Deep Learning vs. Machine Learning for Intrusion Detection in Computer Networks: A Comparative Study. Appl. Sci. 2025, 15, 1903. [Google Scholar] [CrossRef]
Idouglid, L.; Tkatek, S.; Elfayq, K.; Guezzaz, A. A Novel Anomaly Detection Model for the Industrial Internet of Things Using Machine Learning Techniques. Radioelectron. Comput. Syst. 2024, 2024, 143–151. [Google Scholar] [CrossRef]
Kerrakchou, I.; El Hassan, A.A.; Chadli, S.; Emharraf, M.; Saber, M. Selection of Efficient Machine Learning Algorithm on Bot-IoT Dataset for Intrusion Detection in Internet of Things Networks. Indones. J. Electr. Eng. Comput. Sci. 2023, 31, 1784–1793. [Google Scholar] [CrossRef]
Gaber, T.; El-Ghamry, A.; Hassanien, A.E. Injection Attack Detection Using Machine Learning for Smart IoT Applications. Phys. Commun. 2022, 52, 101685. [Google Scholar] [CrossRef]
Sarwar, A.; Hasan, S.; Khan, W.U.; Ahmed, S.; Marwat, S.N.K. Design of an Advance Intrusion Detection System for IoT Networks. In Proceedings of the 2022 2nd International Conference on Artificial Intelligence (ICAI), Islamabad, Pakistan, 30–31 March 2022; pp. 46–51. [Google Scholar]
Venkatesan, S. Design an Intrusion Detection System Based on Feature Selection Using ML Algorithms. Math. Stat. Eng. Appl. 2023, 72, 702–710. [Google Scholar]
Li, J.; Othman, M.S.; Chen, H.; Yusuf, L.M. Optimizing IoT Intrusion Detection System: Feature Selection versus Feature Extraction in Machine Learning. J. Big Data 2024, 11, 36. [Google Scholar] [CrossRef]
Altulaihan, E.; Almaiah, M.A.; Aljughaiman, A. Anomaly Detection IDS for Detecting DoS Attacks in IoT Networks Based on Machine Learning Algorithms. Sensors 2024, 24, 713. [Google Scholar] [CrossRef] [PubMed]
Baich, M.; Hamim, T.; Sael, N.; Chemlal, Y. Machine Learning for IoT Based Networks Intrusion Detection: A Comparative Study. Procedia Comput. Sci. 2022, 215, 742–751. [Google Scholar] [CrossRef]
Yang, Z.; Liu, X.; Li, T.; Wu, D.; Wang, J.; Zhao, Y.; Han, H. A Systematic Literature Review of Methods and Datasets for Anomaly-Based Network Intrusion Detection. Comput. Secur. 2022, 116, 102675. [Google Scholar] [CrossRef]
BoT-IoT Dataset. Available online: https://ieee-dataport.org/documents/bot-iot-dataset (accessed on 11 April 2025).

Figure 1. Attack and anomaly detection process.

Figure 2. Accuracy comparison of different ML algorithms.

Figure 3. Execution time comparison of ML algorithms.

Figure 4. Number of records of each feature.

Table 1. ML algorithms result without feature selection.

Algorithms	Accuracy	Precision	Recall	F1	MCC	FAR	Time (s)
RF	99.99%	99.99%	99.99%	99.99%	98.01%	0.019	552
NB	99.97%	99.99%	99.97%	99.98%	47.34%	0.2475	7.5
DT	99.99%	99.99%	99.99%	99.99%	97.53%	0.019	145
KNN	99.99%	99.99%	99.99%	99.99%	93.00%	0.0099	6888
LR	99.51%	99.99%	99.51%	99.75%	15.60%	0.059	210
XGB	99.99%	100%	99.99%	99.99%	99.50%	0.019	43.68

Table 2. ML algorithms result with Feature Selection.

Techniques	Algorithms	Accuracy	Precision	Recall	F1	MCC	FAR	Time (s)
Fisher score	RF	99.99%	100%	99.99%	99.99%	98.54%	0.25	787
	NB	99.96%	99.99%	99.97%	99.98%	45.67%	0.24	5.76
	DT	99.99%	99.99%	99.99%	99.99%	96.13%	0.019	88
	KNN	99.99%	99.99%	99.99%	99.99%	85.57%	0.059	1881
	LR	98.12%	99.99%	98.12%	99.05%	76%	0.099	155
	XGBoost	99.99%	99.99%	100%	99.99%	97.49%	0.049	23
ANOVA	RF	99.99%	100%	99.99%	99.99%	97.15%	0.029	287
	NB	98.38%	100%	98.38%	99.18%	9.11%	0.2475	1.5
	DT	99.99%	99.99%	99.99%	99.99%	96.13%	0.019	42
	KNN	99.99%	99.99%	99.99%	99.99%	82.91%	0.0099	191
	LR	99.38%	100%	99.38%	99.05%	14.47%	0.099	26
	XGBoost	99.99%	100%	99.99%	99.99%	97.15%	0.049	14
Ridge	RF	99.96%	99.99%	99.96%	99.98%	50.40%	0.049	199
	NB	98.03%	100%	98.03%	99.00%	8.25%	0.0001	1.8
	DT	99.96%	99.99%	99.96%	99.98%	49.46%	0.059	17
	KNN	99.96%	99.99%	99.96%	99.98%	49.68%	0.029	710
	LR	99.98%	99.98%	100%	99. 99%	15.85%	0.99	5.20
	XGBoost	99.95%	99.99%	99.95%	99.97%	46.31%	0.039	35
Lasso	RF	99.99%	100%	99.99%	99.99%	98.54%	0.277	300
	NB	99.95%	99.99%	99.95%	99.97%	38.61%	0.25	1.5
	DT	99.99%	100%	99.99%	99.99%	97.15%	0.001	43
	KNN	99.99%	99.99%	99.99%	99.99%	78.61%	0.0198	268
	LR	99.13%	99.99%	99.13%	99.56%	11.97%	0.039	651
	XGB	99.99%	100%	99.99%	99.99%	98.07%	0.001	13
Recursive Feature Elimination	RF	99.98%	99.99%	99.99%	99.99%	96.13%	0.019	297
	NB	99.98%	99.98%	99.99%	99.99%	15.35%	0.910	2
	DT	99.99%	99.99%	99.99%	99.99%	97.00%	0.039	76
	KNN	99.99%	99.99%	99.99%	99.99%	90.00%	0.039	4761
	LR	99.40%	99.99%	99.40%	99.70%	14.82%	0.009	77
	XGB	99.99%	99.99%	99.99%	99.99%	97.06%	0.019	15
Forward feature selection	RF	99.99%	100%	99.99%	99.99%	99.02%	0.01	322
	NB	99.95%	99.99%	99.95%	99.97%	38.16%	0.257	2
	DT	99.99%	99.99%	99.99%	99.99%	98.01%	0.019	44
	KNN	99.99%	99.99%	99.99%	99.99%	89.92%	0.019	3406
	LR	98.45%	99.99%	98.45%	99.22%	8.68%	0.069	83
	XGB	99.99%	99.99%	99.99%	99.99%	97.53%	0.019	13

Table 3. Comparison table of the IDS with Bot-IoT dataset.

Ref	FS	ML	Evaluation Metrics	Accuray	Time
[10]	Gain Ratio	KNN	Geometric mean, Kappa statistics, detection time, accuracy.	92.29%	-
[12]	PCA	RF	Accuracy, Detection Rate, False alarm Rate.	98.9%	-
[13]	Not mentioned	DT	Accuracy, precision, recall, F-measure.	99.99%	Not mentioned
Our approach	Lasso	XGBoost	Accuracy, precision, recall, F-score, MCC, false alarm rate, time (s)	99.99%	13 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Baich, M.; Sael, N. Enhancing Machine Learning Model Prediction with Feature Selection for Botnet Intrusion Detection. Eng. Proc. 2025, 112, 55. https://doi.org/10.3390/engproc2025112055

AMA Style

Baich M, Sael N. Enhancing Machine Learning Model Prediction with Feature Selection for Botnet Intrusion Detection. Engineering Proceedings. 2025; 112(1):55. https://doi.org/10.3390/engproc2025112055

Chicago/Turabian Style

Baich, Marwa, and Nawal Sael. 2025. "Enhancing Machine Learning Model Prediction with Feature Selection for Botnet Intrusion Detection" Engineering Proceedings 112, no. 1: 55. https://doi.org/10.3390/engproc2025112055

APA Style

Baich, M., & Sael, N. (2025). Enhancing Machine Learning Model Prediction with Feature Selection for Botnet Intrusion Detection. Engineering Proceedings, 112(1), 55. https://doi.org/10.3390/engproc2025112055

Article Menu

Enhancing Machine Learning Model Prediction with Feature Selection for Botnet Intrusion Detection^†

Abstract

1. Introduction

2. State of the Arts Analysis