Next Article in Journal
The Effect of Optimised Combined Turning and Diamond Burnishing Processes on the Roughness Parameters of CuZn39Pb3 Alloys
Previous Article in Journal
Prediction of Soft Soil Settlement Based on Ensemble Smoother with Multiple Data Assimilation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AI-Powered Cybersecurity Models for Training and Testing IoT Devices

Center for Cybersecurity and Forensics Education (C2SAFE), Illinois Institute of Technology, 3300 S Federal St., Chicago, IL 60616, USA
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(24), 13073; https://doi.org/10.3390/app152413073
Submission received: 16 October 2025 / Revised: 2 December 2025 / Accepted: 5 December 2025 / Published: 11 December 2025

Abstract

The rapid expansion of the Internet of Things (IoT) has introduced a significant attack surface, making robust security solutions essential. Traditional signature-based methods are often inadequate against modern, versatile threats. Consequently, research has shifted towards AI and machine learning models for intrusion detection. However, much of the existing research suffers from ‘siloed evaluation,’ where models are trained and tested on single, often outdated datasets, leading to poor generalization in real-world, diverse environments. This paper addresses this critical gap by presenting a comprehensive benchmark of leading AI-powered models across a suite of contemporary IoT cybersecurity datasets from the Canadian Institute of Cybersecurity (CIC). We evaluate a range of machine learning and deep learning algorithms, focusing on their detection performance, cross-dataset generalization, and robustness to provide a realistic assessment of their capabilities for securing modern IoT ecosystems.

1. Introduction

With advances in technology to a vital infrastructure linking billions of devices globally, the Internet of Things (IoT) has impacted every aspect of contemporary life [1]. According to projections, the number of IoT-connected devices will keep increasing at an exponential rate, enabling enhanced healthcare systems, smart homes, and vital industrial operations [2]. However, this quick growth has created an unparalleled attack surface, turning commonplace items into possible avenues of entry for bad actors. High-profile attacks, such as the Mirai botnet, have demonstrated the catastrophic potential of insecure IoT devices, capable of launching massive Distributed Denial-of-Service (DDoS) attacks that can disrupt essential internet services [1,3]. Consequently, the development of robust, scalable, and intelligent security solutions is no longer an option but a critical necessity.
Conventional security measures, including Intrusion Detection Systems (IDS) that rely on signatures, have not been adequate to handle the particular difficulties presented by the Internet of Things [4]. These techniques are frequently unsuccessful against new or zero-day assaults and find it difficult to keep up with the versatile nature of modern malware. In order to develop more dynamic and adaptable defensive systems, the research community has resorted more and more to artificial intelligence (AI) and machine learning (ML) [5,6]. AI-powered models present a possible substitute for static, rule-based security since they can identify tiny irregularities suggestive of a cyberattack and understand the intricate patterns of typical network behavior [7].
Even if there is a lot of research on AI’s application to IoT security, there is still a big gap in the literature: there is not a consistent, thorough assessment across a variety of modern datasets. Numerous studies claim to have excellent detection accuracy, however they only evaluate one dataset, which is frequently out of date [8]. In addition to producing models that are unable to generalize to various network settings, protocols, or attack scenarios, a crucial shortcoming for real-world deployment of this method results in excessively optimistic performance claims [9]. A model trained exclusively on smart home traffic, for example, may be entirely ineffective in an Industrial IoT (IIoT) or Internet of Medical Things (IoMT) setting. This “siloed evaluation” approach hinders the ability of the research community to meaningfully compare different models and understand their true capabilities and limitations.
To overcome this critical gap, this paper presents a comprehensive and standardized benchmark of leading AI-powered models across a diverse suite of contemporary IoT cybersecurity datasets from CIC. By leveraging a collection of datasets spanning from 2019 to 2024 which includes general IoT, IoMT, MQTT, Edge-IIoT, and ToN-IoT scenarios [10,11,12], we aim to provide a realistic and comprehensive assessment of model performance, focusing on both detection efficacy and cross-domain generalization. Specifically, our research is guided by the following questions:
  • RQ1: How can AI-powered models improve the detection of cyberattacks in diverse IoT environments compared to traditional methods?
  • RQ2: What challenges, such as concept drift and feature space heterogeneity, arise when training and testing AI models across heterogeneous IoT datasets?
To overcome these limitations and answer these research questions, this paper makes the following key contributions:
  • Unified Benchmarking Framework: We establish a standardized preprocessing and evaluation pipeline applied consistently across five distinct, modern IoT datasets (2019–2024), mitigating the bias found in single-dataset studies.
  • Cross-Domain Performance Analysis: We provide a granular analysis of AI model performance across varying IoT contexts including Industrial IoT (IIoT), Internet of Medical Things (IoMT), and general IoT highlighting specific protocol-based vulnerabilities (e.g., MQTT).
  • Evaluation of Generalization Capabilities: We empirically evaluate the challenges of feature heterogeneity and concept drift, offering insights into the realistic limitations of deploying models trained on general traffic in specialized environments.
The remainder of this paper is organized as follows. Section 2 reviews existing literature on AI-based IoT security and identifies current limitations. Section 3 provides a detailed overview of the datasets utilized, including their specific attack distributions and the preprocessing techniques applied. Section 4 details the proposed benchmarking methodology, including dataset descriptions, preprocessing techniques, and the specific machine learning architectures employed. Section 5 presents comprehensive experimental results and performance metrics. Section 6 discusses the implications of these findings with a focus on the challenges of cross-domain deployment. Finally, Section 7 concludes the paper and outlines directions for future work.

2. Related Work

This section reviews the state-of-the-art in IoT security and intrusion detection. We first examine the inherent vulnerabilities and the evolving threat landscape of IoT ecosystems. Subsequently, we analyze the deployment of AI and ML techniques utilizing CIC datasets for intrusion detection. Finally, we critically evaluate the limitations of existing methodologies specifically focusing on generalization and adversarial robustness to delineate the research gaps addressed by this study.

2.1. The IoT Threat Landscape

The exponential growth of the Internet of Things has introduced severe security challenges, primarily due to the inherent resource constraints of edge devices. Unlike traditional computing systems, IoT devices often lack the processing power and memory required to support robust cryptographic protocols or on-device security agents such as firewalls [13]. This structural vulnerability has facilitated the emergence of several high-impact attack vectors:
Botnets represent perhaps the most pervasive threat to the IoT ecosystem, leveraging malware to compromise vast fleets of insecure devices. For instance, the Mirai botnet exploited default credentials in IP cameras and home routers to orchestrate unprecedented attacks [3], transforming these devices into zombie nodes capable of executing remote commands. These compromised networks are frequently weaponized to launch Distributed Denial-of-Service (DDoS) attacks, where attackers coordinate simultaneous traffic floods to exhaust the bandwidth or resources of a target server, rendering critical services inaccessible [14]. Furthermore, the threat landscape has evolved beyond simple service disruption to include data exfiltration and extortion. Ransomware variants targeting the IoT (RoT) have emerged as a significant risk, capable of encrypting essential device functionalities such as smart locks or industrial sensors and demanding payment for their restoration [15].

2.2. AI-Driven Intrusion Detection Systems

Given the volume and velocity of IoT traffic, manual traffic analysis is infeasible. Consequently, the research community has increasingly adopted AI to automate threat detection. A critical enabler of this research has been the availability of high-quality, labeled datasets from CIC, which provide realistic benchmarks for training algorithms [10,16].
Classical Machine Learning Approaches: Early implementations largely utilized datasets such as CIC-IDS-2017 and CSE-CIC-IDS-2018. Researchers achieved significant success applying classical algorithms like Random Forest [17] and Support Vector Machines (SVM) [18]. These models proved effective at establishing baseline normal behavior and flagging statistical deviations indicative of attacks.
Deep Learning Integration: More recent studies have integrated Deep Neural Networks (DNNs) to capture non-linear relationships in high-dimensional network data [19,20,21]. While these studies frequently report detection accuracies exceeding 99%, these metrics are often derived under controlled conditions within a specific dataset, raising questions regarding their practical viability in dynamic environments.

2.3. Limitations and Research Gaps

Despite the reported high accuracy of AI-based IDS, the literature exhibits critical methodological shortcomings that hinder real-world deployment:
Poor Cross-Domain Generalization: A predominant limitation is the lack of model transferability. An IDS trained exclusively on smart home traffic (General IoT) often fails to generalize to the distinct protocols and behaviors found in the Internet of Medical Things (IoMT) or Industrial IoT (IIoT) [9]. Current literature rarely evaluates models across heterogeneous environments, leading to “siloed” solutions that are overfitted to specific network topologies [22].
Lack of Adversarial Resilience: Furthermore, standard evaluation metrics often neglect the susceptibility of AI models to adversarial machine learning. Sophisticated attackers can introduce subtle perturbations to network traffic imperceptible to humans that cause the classifier to misclassify malicious traffic as benign [23,24]. The majority of existing IoT IDS studies fail to benchmark against such evasion techniques, suggesting that reported robustness may be overstated [25,26].
While prior work confirms the potential of AI for IoT security, there is a notable absence of comprehensive benchmarking that addresses both cross-domain generalization and resilience. This study aims to bridge these gaps by systematically evaluating model performance across diverse, modern CIC datasets.

3. Datasets and Preprocessing

To build and test our AI models, we need good data. This section describes the datasets we chose for our experiments. It also explains the steps we took to clean and prepare the data before feeding it to the models. These steps are very important because the quality of the data directly affects how well the AI models can perform.

3.1. Description of the Datasets

  • We utilized a collection of modern IoT datasets from CIC to evaluate the robustness and cross-domain generalization of our models. We chose these specific datasets because they represent a wide spectrum of IoT environments ranging from smart homes and industrial control systems to medical devices and include recent, complex attack vectors. Using multiple heterogeneous datasets is key to our goal of verifying how well models generalize beyond their training domain. Table 1 provides a comprehensive summary of these datasets, detailing the specific domain focus, total sample count, feature dimensions, and the primary attack categories included in each dataset. ToN-IoT Dataset (2021): This dataset comes from a large and realistic network that includes both regular IoT devices and Industrial IoT (IIoT) devices. Its main strength is its diversity and complexity, making it a good test for how models handle a mixed environment. It includes attacks like DDoS, ransomware, and man-in-the-middle [10].
  • Edge-IIoTset (2022): This dataset focuses on Industrial IoT. The data was collected from a network with edge devices, common in smart factories. Its strength is its focus on IIoT-specific attacks and its realistic feature set, which includes 14 types of attacks like MQTT brute force and injection attacks [11].
  • IoT-2023 Dataset: To establish a performance baseline on more generalized threats, we incorporate this dataset, which represents a modern, broad-scope IoT network.
  • MQTT-IDS Dataset: For protocol-specific analysis, we utilize this dataset, which is narrowly focused on the lightweight MQTT messaging protocol common in many sensor devices.
  • IoMT-2024 Dataset: A critical component of our study, this recent collection is centered on the Internet of Medical Things, allowing us to test model performance in the high-stakes healthcare domain [12].
  • ACI-IoT Dataset (2023): Finally, to evaluate model robustness, we leverage this specialized dataset, which contains not only standard attacks but also intentionally crafted adversarial samples designed to evade AI-based detection.

3.2. Data Preprocessing Pipeline

AI models cannot be trained directly on raw network data since it is messy. In order to clean and format the data, we developed a preprocessing pipeline with multiple steps:
  • Data Cleaning: The first step is to handle errors in the data. This includes finding any missing values (NaN) or infinite values. We removed rows with missing values to ensure our data was complete. We also converted all feature labels from text to numbers so the models could understand them (e.g., ‘Benign’ becomes 0, ‘DDoS’ becomes 1, etc.).
  • Feature Selection: Given the heterogeneity of the datasets, where feature counts varied significantly (often exceeding 80 attributes), we first identified the intersection of common network features across all datasets to ensure consistency. To further optimize the input space, we applied Principal Component Analysis (PCA). This technique projected the high-dimensional data into a lower-dimensional subspace, resulting in a final set of 30 principal components. These 30 components retained the most critical variance of the original data and served as the standardized input vectors for training all machine learning models.
  • Handling Class Imbalance: In any real network, most of the traffic is normal (benign), and only a small part is an attack, which creates an “imbalanced” dataset. To fix this, we used a technique called SMOTE (Synthetic Minority Over-sampling Technique). SMOTE creates new, synthetic examples of the attack classes so that the model has more attack data to learn from.
  • Data Normalization: The values of different features can be on very different scales. This difference can confuse some AI models. To solve this, we used Min-Max scaling to change all feature values to be in the same range, from 0 to 1. This helps the models train faster and more effectively.
By using this careful preprocessing pipeline, we make sure that the data we feed into our AI models is clean, balanced, and in the right format. This allows for a fair comparison between the models and helps us understand their true performance.

4. Methodology

This section outlines the systematic framework employed to evaluate the performance, generalization, and robustness of various AI-powered models for IoT intrusion detection. The technique is intended to offer a thorough benchmark, explicitly addressing our main research questions.

4.1. Selected Intrusion Detection Models

To fully understand which AI methods are most effective for real world IoT security, we implemented the comprehensive experimental framework illustrated in Figure 1. Within this workflow, we selected seven different models for our benchmark study to represent a wide range of analytical capabilities: classical machine learning models, deep learning architectures that can find complex patterns, and a hybrid approach to improve stability. By testing this variety, we can compare the models fairly and determine their strengths and weaknesses across different IoT environments. Table 2 summarizes our selection, detailing the specific role and justification for including each algorithm in our methodology.
The experiments using these models were organized into three distinct phases: a basic benchmark to compare overall model performance, cross-dataset testing to measure how well the models generalize to new, unseen network environments, and an adversarial evaluation to check model reliability under sophisticated attack. The specific setup for these phases is detailed in the following subsections.

4.2. Experimental Design

To ensure the reproducibility and statistical validity of our results, all experiments were conducted on a dedicated workstation equipped with an AMD Ryzen 9 processor, 32 GB of RAM, and an NVIDIA GeForce RTX 4070 GPU (8 GB VRAM). The software environment utilized Python (v3.13.9), leveraging scikit-learn (v1.7.2) for CPU-based classical models and PyTorch (v2.9.1) for GPU-accelerated deep learning architectures. A fixed random seed of 42 was applied to all data splitting and model initialization procedures. For full reproducibility, the specific hyperparameter configurations and architectural details for all models are listed in Table 3.
The experimental protocol is structured into two distinct phases:
Dataset Partitioning and Specifications: To ensure transparency regarding dataset distribution, the harmonized Global Dataset was strictly partitioned. We allocated 80% of the data for Training. The remaining 20% was set aside as a hold-out set, which was further subdivided: 90% for Testing and 10% for Validation. To explicitly address the class imbalance inherent in IoT telemetry, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to the Training set, effectively eliminating imbalance by generating synthetic samples for minority classes. The Testing and Validation sets remained unmodified to reflect real-world distributions. Furthermore, while the datasets primarily feature MQTT, the flow-based features utilized (e.g., packet inter-arrival times) are protocol-agnostic, supporting the models’ generalizability to emerging IoT protocols such as CoAP or NB-IoT.
  • Phase 1: Performance Benchmarking (Internal Validation): To establish a reliable baseline, the four source datasets were harmonized mapping diverse attack labels to a standardized set of classes and merged to create a unified Global Training Dataset. We employed a 10-fold stratified cross-validation scheme. In each iteration, the dataset was partitioned into ten subsets; nine for training and one for validation.
Data Leakage Prevention: Crucially, to prevent variance contamination, all data-dependent preprocessing steps were isolated within the cross-validation loop. The Principal Component Analysis (PCA) projection was fitted exclusively on the training folds, and the learned components were subsequently used to transform the validation folds. Similarly, SMOTE was applied only to the training data. The validation folds remained composed of pristine, original samples (transformed only by the training-derived PCA) to prevent data leakage and ensure realistic performance estimates.
  • Phase 2: Cross-Dataset Generalization (External Testing): To evaluate the models’ resilience to domain shift and their applicability to real-world scenarios, we constructed a strictly held-out “Unseen Test Set.” This dataset comprises the ACI-IoT 2023 dataset in its entirety, augmented with hold-out samples of two specific attack classes that were present in the Global Dataset but absent in ACI-IoT. Models trained on the full Global Dataset were evaluated on this unseen set without any further retraining or fine-tuning. This step rigorously assesses the transferability of the learned features to new network contexts and heterogeneous feature spaces.

5. Results and Evaluation

In this section, we present the empirical results of our experiments and provide a detailed analysis of each model’s performance. Initially, we evaluated every model using 10-fold cross-validation on our training data to establish a performance baseline. Subsequently, we subjected the models to a more rigorous evaluation by testing them on the completely unseen ACI-IoT dataset to determine their ability to generalize to novel attack scenarios.
To guarantee a comprehensive assessment, we did not rely solely on a single score. Instead, we employed a suite of standard evaluation metrics to capture the full performance spectrum, which ranges from the identification of actual threats to the frequency of false positives. These metrics are defined as:
  • Accuracy: The overall percentage of correct predictions out of all predictions made.
A c c u r a c y = T P + T N T P + T N + F P + F N
  • Precision: This metric indicates the reliability of the model when it flags an event as an attack. High precision implies a low rate of false alarms.
P r e c i s i o n ( P ) = T P T P + F P
  • Recall: This measures the proportion of actual attacks that the model successfully identified. High recall indicates that the model rarely misses a threat.
R e c a l l ( R ) = T P T P + F N
  • F1-Score: Given that accuracy can be misleading on imbalanced datasets, the F1-Score provides a balanced view by combining Precision and Recall into a single harmonic mean.
F 1 S c o r e = 2 × P × R P + R
  • AUC: This measures the model’s overall ability to distinguish between classes across all thresholds. It is calculated as the integral of the ROC curve:
A U C = 0   1 T P R ( f   p r ) d ( f   p r )
  • Training Time: This metric measures the computational efficiency of the learning phase. It is defined as the total wall-clock time required for the model to complete its training process on the dataset.
  • Inference Time: This measures the operational latency of the model. It is defined as the total time taken by the model to process and classify all samples in the testing set (referred to as Support in our classification reports). This metric is crucial for assessing suitability for real-time deployment.
The following subsections detail the specific performance of each model.

5.1. Support Vector Machine (SVM)

The Support Vector Machine is a powerful classification algorithm known for its effectiveness in high-dimensional spaces [18]. However, its performance is highly dependent on the choice of kernel and regularization parameters.

5.1.1. Training Performance Analysis

In terms of computational efficiency, the SVM exhibited a significant computational cost during the learning phase, requiring a total training time of 2017.63 s. However, this heavy upfront cost yielded a highly efficient model for deployment, achieving a rapid inference time of just 0.8430 s across the entire test set.
Despite this operational speed, the model achieved a modest overall accuracy of 62.30% during the 10-fold cross-validation phase as seen in Table 4. This aggregate score masks a highly varied performance across different attack categories, as revealed in the classification report in Table 4 and the confusion matrix in Figure 2.
The model demonstrates exceptional proficiency in identifying MITM (42,014 correctly classified) and Scanning_and_Recon attacks (42,197 correctly classified), achieving F1-Scores of 0.93 for both (Figure 2). Conversely, it fails catastrophically in detecting DoS_Botnet attacks. As seen in Figure 2, only 246 out of approximately 45,000 DoS samples were correctly identified, with the vast majority misclassified as Malware Exploit or Benign traffic. This suggests the SVM’s linear boundaries struggled to capture the distributed nature of the botnet traffic.

5.1.2. Cross-Domain Generalization on ACI-IoT Dataset

To assess real-world applicability, the SVM was tested against the ACI-IoT dataset. The resulting ROC curve is shown in Figure 3.
The analysis reveals a stark contrast between training performance and generalization. While Scanning_and_Recon maintained robust detection capabilities (AUC = 0.896), the MITM class which performed exceptionally well during training, dropped to a critical AUC of 0.214, indicating significant overfitting to the training domain’s specific artifacts.
Notably, Brute Force (AUC = 0.794) and DDoS (AUC = 0.787) showed moderate generalization potential as seen in Figure 3.

5.2. Random Forest (RF)

Random Forest is an ensemble method that constructs a multitude of decision trees, making it robust against overfitting [17].

5.2.1. Training Performance Analysis

In terms of computational efficiency, the Random Forest model required a training duration of 2386.86 s. However, it proved highly efficient during the operational phase, recording a total inference time of just 8.20 s for the entire test set.
The model demonstrated exceptional classification performance, achieving a near-perfect accuracy of 99.77%. As shown in Table 5, the model maintained perfect precision (1.00) across all classes and near-perfect recall, with only a negligible drop to 0.99 for the Benign class.
The confusion matrix in Figure 4 further confirms the model’s high fidelity. Unlike the SVM, the Random Forest successfully distinguished between complex attack vectors with minimal misclassifications.

5.2.2. Cross-Domain Generalization on ACI-IoT Dataset

When tested on the unseen dataset, the RF model showed exceptional generalization, significantly outperforming the SVM.
As illustrated in Figure 5, the model achieved perfect or near-perfect separation for the vast majority of classes, including Malware Exploit (AUC = 1.000), Web Attack (AUC = 1.000), and Benign traffic (AUC = 0.997). This indicates that the tree-based ensemble successfully learned robust, non-linear features that remained valid across different network environments, confirming its suitability as a primary detection engine.

5.3. XGBoost (Extreme Gradient Boosting)

XGBoost is an advanced gradient boosting method renowned for its performance and speed [27].

5.3.1. Training Performance Analysis

XGBoost demonstrated exceptional performance on the training data, achieving an overall accuracy of 99.74%. In terms of computational efficiency, the model required a training time of 1331.97 s and recorded a swift inference time of 7.30 s. As detailed in Table 6, the model attained perfect or near-perfect precision, recall, and F1-scores (rounding to 1.00) across all categories. The confusion matrix in Figure 6 corroborates this structural integrity, while there are minor misclassifications specifically regarding ‘Benign’ traffic occasionally being mislabeled as ‘DoS’ or ‘Brute Force’ the diagonal density remains remarkably high, indicating a low error rate during the training phase.

5.3.2. Cross-Domain Generalization on ACI-IoT Dataset

The robust generalization capability of XGBoost is evident in the ACI-IoT test results, shown in Figure 7. Unlike models that see a sharp performance drop on unseen data, XGBoost maintained high stability. It achieved perfect AUC scores of 1.000 for ‘Malware Exploit’ and ‘Web Attack’ classes. ‘Benign’, ‘Brute Force’, and ‘Scanning’ classes also showed superior separation with AUCs exceeding 0.99. While ‘DDoS’ and ‘MITM’ presented slightly more challenge, the model still retained strong predictive power with AUCs of 0.974 and 0.975, respectively.

5.4. Convolutional Neural Network (CNN)

Transitioning to deep learning, the CNN was adapted to find localized patterns among the features of a network flow [28,29].

5.4.1. Training Performance Analysis

The CNN achieved a remarkable accuracy of 98.99%, with a total training time of 4550.33 s and an inference time of 29.37 s. As shown in Table 7, the model exhibits a highly balanced performance with precision, recall, and F1-scores consistently reaching 0.99 or higher across almost all classes. The confusion matrix in Figure 8 visually confirms this structural integrity, while there are minor misclassifications such as slight overlaps between ‘MITM’ and ‘Malware’ the diagonal dominance indicates that the CNN successfully extracted distinct feature maps for each attack signature during training.

5.4.2. Cross-Domain Generalization on ACI-IoT Dataset

In the generalization test (Figure 9), the CNN demonstrated exceptional robustness, rivaling the tree-based models. It achieved perfect AUC scores of 1.000 for ‘Malware Exploit’ and ‘Web Attack’, and near-perfect scores for ‘Benign’ and ‘Brute Force’ (AUC > 0.99). The ‘Scanning and Recon’ class also saw a significant improvement over earlier iterations with an AUC of 0.984. While the ‘MITM’ class proved the most challenging with an AUC of 0.910, the overall results suggest that the CNN’s feature extraction capabilities are highly resilient to domain shifts.

5.5. Autoencoder

The Autoencoder represents a shift to unsupervised learning, trained only on benign traffic to detect anomalies via high reconstruction error [30,31].

5.5.1. Training Performance and Anomaly Detection

A detection threshold was set at the 95th percentile of benign reconstruction errors (a value of 0.000331). The model demonstrated significant computational efficiency, recording a training time of 438.95 s and a testing time of 3.5659 s. On a benign-only test set (Table 8), the model correctly identified 95% of the traffic, yielding a 5% false positive rate. The confusion matrix in Figure 10 visually represents this result.

5.5.2. Generalization and Anomaly Detection Capability

To evaluate its ability to detect threats, the autoencoder was tested on a mixed dataset. The resulting ROC curve (Figure 11) shows an outstanding AUC of 0.975, indicating that reconstruction error is an extremely effective metric for separating normal from anomalous traffic and confirming its power as a robust anomaly detector.

5.6. Gated Recurrent Unit (GRU)

The Gated Recurrent Unit model represents a paradigm shift from static, flow-by-flow analysis to dynamic, sequential analysis. As the first model in our evaluation capable of recognizing temporal patterns across multiple network flows, it is uniquely positioned to detect sophisticated attacks that unfold over time [32].

5.6.1. Training Performance Analysis

The GRU model achieved a robust overall accuracy of 98.62%, validating its efficacy in handling sequential network data. In terms of computational overhead, the model recorded a training time of 2534.67 s and an inference time of 93.63 s. The classification report in Table 9 demonstrates exceptional precision and recall across the majority of attack vectors. Notably, the model achieved perfect or near-perfect F1-scores (0.99–1.00) for high-impact classes such as ‘DoS_Botnet‘, ‘MITM‘, and ‘Malware_Exploit‘. While the performance remains high overall, slight variations were observed in ‘Scanning_and_Recon‘ (Precision 0.95) and ‘DDos‘ (Recall 0.95), suggesting that these specific attack signatures present a slightly higher degree of complexity for the temporal model.
This high fidelity is further confirmed by the confusion matrix in Figure 12. The matrix displays a dominant diagonal, confirming that the GRU successfully learned the temporal distinctions between classes. The few misclassifications align with the findings in the classification report, primarily occurring within the ‘Scanning‘ and ‘DDos‘ categories, while distinct attacks like ‘Web_Attack‘ are identified with high precision.

5.6.2. Cross-Domain Generalization on ACI-IoT Dataset

The GRU’s generalization capabilities are illustrated by the ROC curves in Figure 13. The model yields perfect Area Under the Curve (AUC) scores for Malware_Exploit (AUC = 1.000) and Web_Attack (AUC = 1.000), and near-perfect scores for Brute_Force (AUC = 0.998) and Benign (AUC = 0.996) traffic. A notable distinction is observed in the ‘DDos‘ class, which presents a distinct curve with an AUC of 0.876, reflecting the challenges identified in the recall metrics. Despite this, the strong performance across diverse categories like MITM (AUC = 0.966) and Scanning_and_Recon (AUC = 0.983) indicates that the GRU’s temporal features are robust and transferable to unseen network environments.

5.7. Stacking Ensemble (SE)

To leverage the diverse predictive strengths of our individual models, we constructed a stacking ensemble. This model combines the predictions of the top-performing base learners (SVM, RF, XGBoost, CNN, Autoencoder, and GRU), using a sophisticated XGBoost model as a meta-learner. The meta-learner’s task is to intelligently weigh the inputs from the base models to produce a final, more accurate, and robust classification.

5.7.1. Training Performance Analysis

The Stacking Ensemble model achieved a near-perfect overall accuracy of 99.98%, demonstrating the immense power of synergistic model combination. Remarkably, this performance was achieved with significant computational efficiency, recording a training time of 303.65 s and an inference time of just 3.32 s. As shown in the classification report in Table 10, the precision, recall, and F1-scores for every attack category round to 1.00. This indicates that the meta-learner effectively synthesized the strengths of the base models to correct individual errors, resulting in a model that is statistically perfect across the vast majority of the 3.6 million test samples.
The confusion matrix in Figure 14 provides a granular view of this performance. While the diagonal is overwhelmingly dominant, proving the model’s robustness, the matrix reveals very specific, minor confusion patterns that the aggregate metrics obscure. For instance, there is a slight overlap between Web Attack and MITM, as well as minor misclassifications of Benign traffic as Brute Force. However, compared to the individual base learners, these errors are negligible, confirming that the ensemble successfully creates exceptional decision boundaries.

5.7.2. Cross-Domain Generalization on ACI-IoT Dataset

The Stacking Ensemble’s capabilities are further validated by its cross-domain generalization performance, illustrated in Figure 15. The model achieves perfect AUC scores of 1.00 for five out of seven categories: Benign, Brute_Force, Malware_Exploit, Scan-ning_and_Recon, and Web_Attack.
The model shows slightly distinct behaviors for DDoS (AUC = 0.98) and MITM (AUC = 0.97), where the curves indicate a marginal trade-off between sensitivity and specificity compared to the other classes. Despite this, the consistently high AUC values across such diverse attack types confirm that the Stacking Ensemble has learned highly generalized feature representations, making it the most robust candidate for real-world deployment in unpredictable network environments.

6. Discussion

The empirical evaluation presented in Section 5 provides a multidimensional view of AI-powered intrusion detection, revealing critical trade-offs between detection fidelity, computational efficiency, and cross-domain generalization. While previous works often focus solely on detection accuracy, our results highlight that the “optimal” model is highly context-dependent, determined by the specific constraints of the deployment environment. This section synthesizes the performance metrics Accuracy, F1-Score, AUC, and Operational Latency to define the optimal operational role for each architecture. A strategic comparison of these findings is presented in Table 11.

6.1. Operational Efficiency vs. Detection Fidelity

For IoT ecosystems, the time required to detect a threat (Inference Time) is often as critical as the accuracy of the detection. The results reveal three distinct performance zones:
  • High-Latency Zone (Deep Learning): The GRU incurred the highest computational costs, requiring 93.63 s for inference over 10× slower than the tree-based models. This establishes a clear boundary, while the GRU is essential for temporal analysis, its latency prohibits its use as a primary, inline packet filter on low-power edge devices.
  • Balanced Zone (Tree-Based Ensembles): The tree-based models represent the optimal balance for general application. XGBoost outperformed RF in both training speed (1331 s vs. 2386 s) and inference speed (7.30 s vs. 8.20 s), while maintaining statistically identical accuracy (≈99.7%).
  • High-Efficiency Zone (SVM and Stacking): The SVM offered the lowest inference latency (0.84 s) but at the cost of unacceptable detection accuracy (62.30%). Conversely, the Stacking Ensemble, despite its architectural complexity, achieved an inference time of 3.32 s remarkably faster than the individual tree-based models while delivering the highest accuracy (99.98%). This suggests that the ensemble’s meta-learner efficiently shortcuts the decision process, making it a viable candidate for near-real-time systems.

6.2. Performance Analysis by Architecture

6.2.1. The Limitations of Linear Boundaries (SVM)

The SVM results serve as a cautionary baseline. Despite its superior inference speed, the model’s inability to handle non-linear feature interactions led to catastrophic failures in detecting complex, distributed attacks. Specifically, it achieved an F1-Score of only 0.32 for DoS Botnet attacks. Furthermore, its generalization collapsed when exposed to the unseen ACI-IoT dataset, with the MITM AUC dropping to 0.214. Consequently, SVMs should be deprecated as standalone IDS engines in modern IoT environments, suitable only for ultra-low-power microcontrollers performing simple port-scan detection.

6.2.2. The Dominance of Tree-Based Ensembles (XGBoost)

XGBoost emerged as the superior standalone engine, demonstrating a “triple-threat” capability. First, it excelled in training efficiency, completing the phase 44% faster than Random Forest. Second, it demonstrated robust generalization, maintaining an AUC > 0.97 across almost all classes in the unseen dataset, proving it learns robust, transferable attack signatures rather than dataset-specific noise. Finally, it achieved perfect precision (1.00) across all test classes. These characteristics make XGBoost the recommended standard for commercial IoT gateways and Edge-AI processors where a balance of high throughput and security is required.

6.2.3. Specialized Deep Learning Capabilities

The deep learning models excelled in specific niches that tabular classifiers cannot address:
  • Autoencoder (Zero-Day Detection): With an AUC of 0.975 on the mixed dataset, the Autoencoder proved that reconstruction error is a reliable proxy for malicious intent. Its primary utility is not classification, but filtration identifying novel “unknown unknowns” that supervised models might misclassify as benign.
  • GRU (Temporal Forensics): The GRU’s strength lies in its temporal context. While it suffered from lower recall on DDoS (0.95) compared to XGBoost, it achieved excellent generalization on complex, multi-stage attacks like Reconnaissance (AUC 0.991). However, due to its high inference latency, it is best suited for a retrospective role, analyzing session logs asynchronously rather than filtering live traffic.

6.3. The Stacking Ensemble: Synergy in Defense

The Stacking Ensemble validated the hypothesis that heterogeneous architectures can mitigate individual weaknesses. By combining the statistical rigor of XGBoost, the pattern recognition of CNNs, and the temporal awareness of GRUs, the ensemble achieved effective perfection with 99.98% accuracy. Notably, it achieved perfect cross-domain generalization scores (AUC = 1.00) for five out of seven attack categories on the unseen ACI-IoT dataset. This makes the Stacking Ensemble the definitive choice for Cloud-Based SIEMs (Security Information and Event Management), where computational resources are abundant and the cost of a False Negative is far higher than the cost of computation.

6.4. Deployment Recommendations

Based on this analysis, we propose a tiered deployment architecture for secure IoT ecosystems:
  • Tier 1 (Device Level): Deploy Autoencoders on edge devices. Their fast inference (3.56 s) and ability to flag anomalies without labeled data make them ideal for signaling when a device is behaving erratically.
  • Tier 2 (Gateway Level): Deploy XGBoost. Its high throughput and generalization capabilities allow it to filter 99.7% of known threats at the network edge, preventing upstream congestion.
  • Tier 3 (Cloud Level): Deploy the Stacking Ensemble and GRU. Traffic flagged as suspicious by Tier 1 or 2 is routed here for deep inspection. The GRU analyzes the temporal sequence, while the Ensemble provides the final, high-confidence classification.

7. Conclusions and Future Work

This study presented a comprehensive, cross-domain benchmark of leading AI-powered models for Intrusion Detection Systems (IDS) across a diverse suite of contemporary CIC IoT security datasets. Our methodology directly addressed the critical issue of ‘siloed evaluation’ by systematically testing for cross-dataset generalization, which is essential for real-world IoT deployment. The extensive evaluation allowed us to definitively rank models not just by their laboratory accuracy but by their real-world applicability and resilience to domain shift.

7.1. Answering the Research Questions

The empirical results provide clear and definitive answers to our core research questions, establishing a new baseline for future IoT IDS development:
  • RQ1: How can AI-powered models improve the detection of cyberattacks in diverse IoT environments compared to traditional methods?
AI-powered models offer a profound improvement, achieving near-perfect accuracy and providing superior generalization capabilities. Our benchmark identifies XGBoost and the Gated Recurrent Unit (GRU) as the most robust individual classifiers. The GRU, in particular, demonstrated unparalleled generalization (e.g., AUC of 0.996 for benign traffic on the unseen ACI-IoT dataset) by successfully learning critical temporal patterns that persist across different network environments. Furthermore, the Autoencoder (AUC of 0.975) confirms that unsupervised learning is a highly effective, complementary tool for general anomaly and zero-day attack detection, which is a major capability gap for traditional signature-based systems. The successful models show that AI can move beyond simple pattern matching to understand the context and sequence of network behavior.
  • RQ2: What challenges, such as concept drift and feature space heterogeneity, arise when training and testing AI models across heterogeneous IoT datasets?
Concept drift and feature heterogeneity are significant challenges, as evidenced by the catastrophic failure of the classical SVM (F1-Score of 0.01 for DoS Botnet) and the notable performance drop experienced by high-accuracy models like Random Forest during cross-domain testing. This challenge underscores that many models learn features specific to the source network’s traffic (noise), rather than features truly indicative of malice (signal). Our best solution, the Stacking Ensemble, directly over-came these issues by synergistically integrating multiple specialized models including the Autoencoder for anomaly scoring and the GRU for temporal context to achieve near-perfect cross-domain performance (e.g., AUC of 1.00 for benign and 0.99 for reconnaissance traffic). This architecture proves that robustness against heterogeneity requires a multi-faceted approach, combining feature-based, anomaly-based, and sequential-based detection methods.

7.2. Final Model Recommendations and Future Work

The Stacking Ensemble, leveraging the combined intelligence of its base models, emerged as the state-of-the-art solution for high-stakes, off-device forensic analysis or central IDS in the cloud, offering unparalleled accuracy and generalization. For immediate deployment on resource-constrained edge devices, XGBoost remains the most efficient high-performance model due to its speed, low memory footprint, and strong generalized accuracy. The GRU, while more computationally intensive, is indispensable for systems needing to detect advanced, multi-stage, low-and-slow attacks that rely on temporal stealth. Moving forward, future research must translate these high-performance models into deployable, practical systems:
  • Lightweight Edge Deployment: Developing efficient, optimized versions of high-performing deep learning models (like GRU) and ensemble methods is critical. This involves exploring model quantization, knowledge distillation, and pruning techniques to enable effective detection directly on resource-constrained IoT and edge computing devices.
  • Federated and Continual Learning: Implementing federated learning to train a global model collaboratively across diverse, distributed IoT networks could effectively address concept drift and feature heterogeneity in a privacy-preserving manner, allowing models to continually adapt to new environments without centralized data aggregation.
  • Adversarial Resilience and XAI: Future dataset creation and modeling should explic-itly prioritize adversarial machine learning to stress-test model robustness against sophisticated evasion techniques. Furthermore, integrating XAI is vital to make the complex decisions of ensemble and deep learning models transparent and trustworthy for security analysts.
By focusing on these practical and advanced research areas, the community can transition from achieving high accuracy in the laboratory to deploying truly robust, generalizable, and intelligent security solutions capable of securing the rapidly expanding and evolving IoT ecosystem.

Author Contributions

Conceptualization, S.Q. and K.T.; methodology, S.Q.; software, K.T.; validation, S.Q., K.T., A.H.K. and M.D.; formal analysis, S.Q., M.D. and K.T.; investigation, S.Q., K.T. and A.H.K.; resources, M.D.; data curation, A.H.K.; writing—original draft preparation, S.Q. and K.T.; writing—review and editing, S.Q., M.D., K.T. and A.H.K.; visualization, S.Q. and K.T.; supervision, M.D.; project administration, M.D.; funding acquisition, M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Dr Maurice Dawson.

Data Availability Statement

Publicly available datasets were analyzed in this study. The data can be found at the Canadian Institute for Cybersecurity (CIC) website (https://www.unb.ca/cic/datasets/ (accessed on 4 December 2025)) and the UNSW ToN-IoT repository (https://research.unsw.edu.au/projects/toniot-datasets (accessed on 4 December 2025)).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Al-Garadi, M.A.; Mohamed, A.; Al-Ali, A.K.; Du, X.; Ali, I.; Guizani, M.A. A survey of IoT: Applications, security, and private-preserving schemes. J. Netw. Comput. Appl. 2020, 22, 1646–1685. [Google Scholar]
  2. Statista Research Department. Number of Internet of Things (IoT) Connected Devices World-Wide from 2019 to 2030; Statista: Hamburg, Germany, 2024. [Google Scholar]
  3. Antonakakis, M.; April, T.; Bailey, M.; Bernhard, M.; Bursztein, E.; Cochran, J.; Durumeric, Z.; Halderman, J.A.; Invernizzi, L.; Kallitsis, M.; et al. Understanding the Mirai Botnet. In Proceedings of the 26th USENIX Security Symposium, Vancouver, BC, Canada, 16–18 August 2017; pp. 1093–1110. [Google Scholar]
  4. Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J. Survey of intrusion detection systems: Techniques, datasets and challenges. Cybersecurity 2019, 2, 20. [Google Scholar] [CrossRef]
  5. Buczak, A.L.; Guven, E. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 2015, 18, 1153–1176. [Google Scholar] [CrossRef]
  6. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
  7. Chaabouni, N.; Mosbah, M.; Zemmari, A.; Sauvignac, C.; Faruki, P. Network intrusion detection for IoT security based on learning tech-niques. IEEE Commun. Surv. Tutor. 2019, 21, 2671–2701. [Google Scholar] [CrossRef]
  8. Hindy, H.; Brosset, D.; Bayne, E.; Seeam, A.K.; Tachtatzis, C.; Atkinson, R.; Bellekens, X. A comprehensive review of the CIC-IDS2017 dataset and a preliminary study on its usage for intrusion detection systems. In Proceedings of the IEEE International Conference on Big Data, Online, 10–13 December 2020; pp. 1–10. [Google Scholar]
  9. Arp, D.; Quiring, E.; Pendlebury, F.; Warnecke, A.; Pierazzi, F.; Wressnegger, C.; Cavallaro, L.; Rieck, K. DOS and DON’TS of Machine Learning in Computer Security. In Proceedings of the 29th USENIX Security Symposium, Virtual, 12–14 August 2020; pp. 3971–3988. [Google Scholar]
  10. Moustafa, N. A new distributed architecture for evaluating AI-based security systems at the edge: Network TON_IoT datasets. Sustain. Cities Soc. 2021, 72, 102994. [Google Scholar] [CrossRef]
  11. Ferrag, M.A.; Friha, O.; Hamouda, D.; Maglaras, L.; Janicke, H. Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset for the Edge of Industrial Internet of Things. IEEE Access 2022, 10, 40281–40306. [Google Scholar] [CrossRef]
  12. Dadkhah, S.; Pinto Neto, E.C.; Ferreira, R.; Molokwu, R.C.; Sadeghi, S.; Ghorbani, A.A. CICIoMT2024: Attack vectors in healthcare devices—A multi-protocol dataset for assessing IoMT device security. Preprints 2024, 28, 101351. [Google Scholar] [CrossRef]
  13. Neshenko, N.; Bou-Harb, E.; Crichigno, J.; Kaddoum, G.; Ghani, N. Demystifying IoT security: An exhaustive survey of security vulnerabilities and defense mechanisms. IEEE Commun. Surv. Tutor. 2019, 21, 2702–2733. [Google Scholar] [CrossRef]
  14. Kolias, C.; Kambourakis, G.; Stavrou, A.; Voas, J. DDoS in the IoT: Mirai and other botnets. Computer 2017, 50, 80–84. [Google Scholar] [CrossRef]
  15. Yaqoob, I.; Ahmed, E.; Rehman, M.H.U.; Ahmed, A.I.A.; Al-Garadi, M.A.; Imran, M.; Guizani, M. The rise of ransomware and emerging security challenges in the Internet of Things. Comput. Netw. 2017, 129, 444–458. [Google Scholar] [CrossRef]
  16. Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. A detailed look at the CICIDS2017 dataset. In Proceedings of the International Conference on Information Systems Security and Privacy, Funchal, Portugal, 22–24 January 2018; pp. 172–188. [Google Scholar]
  17. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  18. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  19. Haider, W.; Hu, J.; Slay, J.; Turnbull, B.P.; Xie, Y. A deep CNN-based approach for intrusion detection in the IoT. In Proceedings of the 21st Saudi Computer Society National Computer Conference (NCC), Riyadh, Saudi Arabia, 25–26 April 2018; pp. 1–6. [Google Scholar]
  20. Panwar, H.; Gupta, P.; Monea, M.K.; Kumar, N. A deep learning based approach for intrusion detection in IoT network. In Proceedings of the 10th International Conference on Security of Information and Networks, Jaipur, India, 13–15 October 2017; pp. 33–39. [Google Scholar]
  21. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  22. Muna, A.H.; Moustafa, N.; Sitnikova, E. A comprehensive review of intrusion detection systems for IoT: A deep learning approach. IEEE Internet Things J. 2022, 1–11. [Google Scholar]
  23. Papernot, N.; McDaniel, P.; Jha, S.; Fredrikson, M.; Celik, Z.B.; Swami, A. The limitations of deep learning in adversarial settings. In Proceedings of the IEEE European Symposium on Security and Privacy, Saarbrücken, Germany, 21–24 March 2016; pp. 372–387. [Google Scholar]
  24. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
  25. Carlini, N.; Wagner, D. Towards evaluating the robustness of neural networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–24 May 2017; pp. 39–57. [Google Scholar]
  26. Apruzzese, G.; Anderson, H.S.; Drichel, A.; Glukhov, V.; Hemberg, E.; Hoeschele, M.; Lashkari, A.H.; Lew, P.; Lupu, E.C.; Mittal, S. The Role of Machine Learning in Cybersecurity. In Proceedings of the ACM Computing Surveys, New York, NY, USA, 15 January 2023; pp. 1–38. [Google Scholar]
  27. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  28. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  29. Wang, W.; Zhu, M.; Zeng, X.; Ye, X.; Sheng, Y. Malware traffic classification using convolutional neural networks for representation learning. In Proceedings of the 2017 International Conference on Information Networking (ICOIN), Da Nang, Vietnam, 11–13 January 2017; pp. 712–717. [Google Scholar]
  30. Sakurada, M.; Yairi, T. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, Gold Coast, Australia, 2 December 2014; pp. 4–11. [Google Scholar]
  31. Mirsky, Y.; Doitshman, T.; Elovici, Y.; Shabtai, A. Kitsune: An ensemble of autoencoders for online network intrusion detection. In Proceedings of the Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 18–21 February 2018. [Google Scholar]
  32. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
Figure 1. The comprehensive experimental framework. The workflow consists of four main stages: (1) Data Preparation and Unification, (2) Data Preprocessing Pipeline, (3) Model Training and Benchmarking using stratified cross-validation, and (4) Cross-Domain Generalization Testing on unseen datasets.
Figure 1. The comprehensive experimental framework. The workflow consists of four main stages: (1) Data Preparation and Unification, (2) Data Preprocessing Pipeline, (3) Model Training and Benchmarking using stratified cross-validation, and (4) Cross-Domain Generalization Testing on unseen datasets.
Applsci 15 13073 g001
Figure 2. Confusion Matrix for SVM Model.
Figure 2. Confusion Matrix for SVM Model.
Applsci 15 13073 g002
Figure 3. Multi-Class ROC curve for the SVM model (SGDClassifier) on the ACI-IoT test dataset.
Figure 3. Multi-Class ROC curve for the SVM model (SGDClassifier) on the ACI-IoT test dataset.
Applsci 15 13073 g003
Figure 4. Confusion Matrix for the Random Forest Model.
Figure 4. Confusion Matrix for the Random Forest Model.
Applsci 15 13073 g004
Figure 5. Multi-Class ROC curve for the Random Forest model on the ACI-IoT test dataset.
Figure 5. Multi-Class ROC curve for the Random Forest model on the ACI-IoT test dataset.
Applsci 15 13073 g005
Figure 6. Confusion Matrix for the XGBoost Classifier.
Figure 6. Confusion Matrix for the XGBoost Classifier.
Applsci 15 13073 g006
Figure 7. Multi-Class ROC curve for the XGBoost model on the ACI-IoT test dataset.
Figure 7. Multi-Class ROC curve for the XGBoost model on the ACI-IoT test dataset.
Applsci 15 13073 g007
Figure 8. Confusion Matrix for the CNN Model.
Figure 8. Confusion Matrix for the CNN Model.
Applsci 15 13073 g008
Figure 9. Multi-Class ROC curve for the CNN model on the ACI-IoT test dataset.
Figure 9. Multi-Class ROC curve for the CNN model on the ACI-IoT test dataset.
Applsci 15 13073 g009
Figure 10. Confusion Matrix for the Autoencoder IDS on a Benign-Only Test Set.
Figure 10. Confusion Matrix for the Autoencoder IDS on a Benign-Only Test Set.
Applsci 15 13073 g010
Figure 11. Binary ROC curve for the Autoencoder model on a mixed test dataset.
Figure 11. Binary ROC curve for the Autoencoder model on a mixed test dataset.
Applsci 15 13073 g011
Figure 12. Confusion Matrix for the GRU Model.
Figure 12. Confusion Matrix for the GRU Model.
Applsci 15 13073 g012
Figure 13. Multi-Class ROC curve for the GRU model on the ACI-IoT test dataset.
Figure 13. Multi-Class ROC curve for the GRU model on the ACI-IoT test dataset.
Applsci 15 13073 g013
Figure 14. Confusion Matrix for the Stacking Ensemble Model.
Figure 14. Confusion Matrix for the Stacking Ensemble Model.
Applsci 15 13073 g014
Figure 15. Multi-Class ROC curve for the Stacking Ensemble model on the ACI-IoT test dataset.
Figure 15. Multi-Class ROC curve for the Stacking Ensemble model on the ACI-IoT test dataset.
Applsci 15 13073 g015
Table 1. Summary of Selected IoT Datasets, Domains, and Attack Vectors.
Table 1. Summary of Selected IoT Datasets, Domains, and Attack Vectors.
DatasetYearIoT DomainKey Attack TypesSamples (Approx.)
ToN-IoT [10]2021IIoT and GeneralDDoS, Ransomware, XSS, Backdoor22,339,000
Edge-IIoTset [11]2022Industrial EdgeMQTT-BF, Injection, MITM, Scanning2,200,000
IoT-20232023General IoTMirai, Spoofing, Reconnaissance46,700,000
ACI-IoT2023AdversarialPerturbed Traffic (Adversarial Examples)1,200,000
IoMT-2024 [12]2024Medical (IoMT)MIM, Data Exfiltration, Spoofing8,700,000
Table 2. Selection Rationale for AI-Powered Intrusion Detection Models.
Table 2. Selection Rationale for AI-Powered Intrusion Detection Models.
AlgorithmRationale for Selection (Justification)Role in Benchmark
Classical Machine Learning (ML)
Random Forest (RF)Chosen as a robust and efficient model. RF performs well on large datasets with many features and naturally resists fitting too closely to the training data (overfitting).Assesses the performance ceiling of efficient tree-based methods.
XGBoostSelected as a leading gradient boosting technique. It is recognized for its computational speed and ability to achieve top-tier performance by aggressively correcting mistakes from previous iterations.Represents state-of-the-art ensemble tree methods.
SVMIncluded as a method that finds the clearest separation boundary between attack and normal data. It provides a strong, non-probabilistic approach for complex classification tasks.Acts as a strong, non-probabilistic classification benchmark.
Deep Learning (DL) Architectures
CNNWe adapted the CNN to process network flows as a stream of features. It is effective at finding short, local patterns and hidden, hierarchical relationships within the feature set.Captures local and hierarchical features within the flow data.
GRUChosen for its efficiency in handling sequential data. The GRU models how network traffic changes over time, which is key to finding multi-stage attacks, using fewer resources than older recurrent networks.Captures temporal patterns in network sessions with high efficiency.
Autoencoder (AE)Used for unsupervised anomaly detection. It is trained only on normal traffic; any high error in recreating a new flow signals an anomaly or potential unknown (zero-day) attack.Assesses capacity for unsupervised anomaly and zero-day attack detection.
Hybrid Approach
Stacking EnsembleCreated to combine the strengths of multiple models. The ensemble integrates predictions from the individual top-performing models using a meta-learner to make a final, more stable, and accurate decision.Aims for superior, generalizable performance across heterogeneous datasets.
Table 3. Hyperparameter Configuration and Model Architectures.
Table 3. Hyperparameter Configuration and Model Architectures.
ModelKey HyperparametersSelected Configuration
SVM (SGD)Loss Function
Alpha (Regularization)
Hinge (Linear SVM)
3.18
Random Forestn_estimators (Trees)
Max Depth
100
None (Full Expansion)
XGBoostLearning Rate
Max Depth
0.1
10
n_estimators1000
CNNArchitecture Kernel Size2 Conv1D Layers (32, 128 filters)
3
OptimizerAdam (lr = 0.001)
GRUHidden Units
Sequence Length
128 (2 Layers)
4 flows
Dropout Rate0.2
AutoencoderLatent Dimension Anomaly Threshold16
0.000331 (95th Percentile)
Stacking EnsembleMeta-LearnerXGBoost Classifier
Table 4. Classification Report for the SVM Model.
Table 4. Classification Report for the SVM Model.
ClassPrecisionRecallF1-ScoreSupport
Benign0.250.300.27451,922
Brute Force0.780.740.76451,922
DDoS0.910.750.82451,922
DoS Botnet0.460.240.32451,922
Malware Exploit0.330.490.39451,923
MITM0.910.900.90451,923
Scanning and Recon0.840.930.88451,922
Web Attack0.700.640.67451,923
Weighted Avg0.650.620.633,615,379
Table 5. Classification Report for the Random Forest Model.
Table 5. Classification Report for the Random Forest Model.
ClassPrecisionRecallF1-ScoreSupport
Benign1.000.991.00451,922
Brute Force1.001.001.00451,922
DDoS1.001.001.00451,922
DoS Botnet1.001.001.00451,922
Malware Exploit1.001.001.00451,923
MITM1.001.001.00451,923
Scanning and Recon1.001.001.00451,922
Web Attack1.001.001.00451,923
Weighted Avg1.001.001.003,615,379
Table 6. Classification Report for the XGBoost Model.
Table 6. Classification Report for the XGBoost Model.
ClassPrecisionRecallF1-ScoreSupport
Benign1.000.991.00451,922
Brute Force1.001.001.00451,922
DDoS1.001.001.00451,922
DoS Botnet1.001.001.00451,922
Malware Exploit1.001.001.00451,923
MITM1.001.001.00451,923
Scanning and Recon1.001.001.00451,922
Web Attack1.001.001.00451,923
Weighted Avg1.001.001.003,615,379
Table 7. Classification Report for the CNN Model.
Table 7. Classification Report for the CNN Model.
ClassPrecisionRecallF1-ScoreSupport
Benign0.990.980.99451,922
Brute_Force0.990.990.99451,922
DDoS1.001.001.00451,922
DoS_Botnet0.981.000.99451,922
Malware_Exploit0.980.990.99451,923
MITM0.990.990.99451,923
Scanning_and_Recon1.000.990.99451,922
Web_Attack0.990.990.99451,923
Weighted Avg0.990.990.993,615,379
Table 8. Classification Report for the Autoencoder (Threshold: 0.000331 at 95th percentile).
Table 8. Classification Report for the Autoencoder (Threshold: 0.000331 at 95th percentile).
ClassPrecisionRecallF1-ScoreSupport
Benign1.000.950.97454,133
Malicious0.000.000.000
Accuracy 0.95454,133
Macro Avg0.500.470.49454,133
Weighted Avg1.000.950.97454,133
Table 9. Classification Report for the GRU Model (Overall Test Accuracy: 0.9862).
Table 9. Classification Report for the GRU Model (Overall Test Accuracy: 0.9862).
ClassPrecisionRecallF1-ScoreSupport
Benign0.980.990.98753,203
Brute_Force0.980.980.98753,203
DDos0.990.950.97753,203
DoS_Botnet1.000.991.00753,204
MITM0.991.001.00753,204
Malware_Exploit1.001.001.00753,204
Scanning_and_Recon0.950.990.97753,204
Web_Attack0.990.990.99753,204
Accuracy 0.996,025,629
Macro Avg0.990.990.996,025,629
Weighted Avg0.990.990.996,025,629
Table 10. Classification Report for the Stacking Ensemble Model (Overall Test Accuracy: 0.9998).
Table 10. Classification Report for the Stacking Ensemble Model (Overall Test Accuracy: 0.9998).
ClassPrecisionRecallF1-ScoreSupport
Benign1.000.991.00451,922
Brute_Force1.001.001.00451,922
DDos1.001.001.00451,922
DoS_Botnet1.001.001.00451,923
MITM1.001.001.00451,922
Malware_Exploit1.001.001.00451,922
Scanning_and_Recon1.001.001.00451,923
Web_Attack1.001.001.00451,922
Accuracy 1.003,615,378
Macro Avg1.001.001.003,615,378
Weighted Avg1.001.001.003,615,378
Table 11. Strategic Comparison of Model Suitability and Performance Metrics.
Table 11. Strategic Comparison of Model Suitability and Performance Metrics.
ModelAcc. (%)Gen. (AUC)Train Time (s)Infer. Time (s)Optimal Deployment Scenario
SVM62.300.21 (Poor)20170.84Legacy/Microcontroller: Simple pre-filtering on constrained hardware for detecting scanning activity only.
Random Forest99.770.99 (Exc.)23868.20General Purpose: Robust baseline for standard IoT gateways; high resistance to overfitting.
XGBoost99.740.99 (Exc.)13317.30Edge Gateway Standard: Best all-rounder. Recommended for primary deployment on IoT Hubs due to optimal speedaccuracy balance.
CNN98.930.98 (Good)455029.37Packet Inspection: Best suited for inputs where feature engineering is difficult (e.g., raw payload bytes).
AutoencoderN/A0.9754383.56Zero-Day Filter: Anomaly detection layer. Should sit in front of a classifier to flag unknown threats.
GRU99.300.99 (Exc.)253493.63Forensic Analysis: Too slow for real-time edge blocking. Best used for offline analysis of session logs (low-and-slow attacks).
Stacking Ens.99.981.00 (Perf.)303 *3.32Cloud Core/SIEM: The ultimate defense engine. Deployed in the cloud to aggregate alerts with near-zero false positives.
* Training time refers to the meta-learner training phase on pre-generated predictions.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Quaye, S.; Khan, A.H.; Tapre, K.; Dawson, M. AI-Powered Cybersecurity Models for Training and Testing IoT Devices. Appl. Sci. 2025, 15, 13073. https://doi.org/10.3390/app152413073

AMA Style

Quaye S, Khan AH, Tapre K, Dawson M. AI-Powered Cybersecurity Models for Training and Testing IoT Devices. Applied Sciences. 2025; 15(24):13073. https://doi.org/10.3390/app152413073

Chicago/Turabian Style

Quaye, Samson, Abdul Hadi Khan, Kshitij Tapre, and Maurice Dawson. 2025. "AI-Powered Cybersecurity Models for Training and Testing IoT Devices" Applied Sciences 15, no. 24: 13073. https://doi.org/10.3390/app152413073

APA Style

Quaye, S., Khan, A. H., Tapre, K., & Dawson, M. (2025). AI-Powered Cybersecurity Models for Training and Testing IoT Devices. Applied Sciences, 15(24), 13073. https://doi.org/10.3390/app152413073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop