Benchmarking and Cross-Dataset Evaluation of AI-Based Intrusion Detection Systems for Smart City IoT Networks

Alghamdi, Ahlam; Dardouri, Samia

doi:10.3390/computers15060340

Open AccessArticle

Benchmarking and Cross-Dataset Evaluation of AI-Based Intrusion Detection Systems for Smart City IoT Networks

by

Ahlam Alghamdi

¹

and

Samia Dardouri

^1,2,*

¹

Department of Computer Science, College of Computing and Information Technology, Shaqra University, Shaqra 11961, Saudi Arabia

²

Innov’com Laboratory, Sup’COM, University of Carthage, Amilcar 1054, Tunisia

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(6), 340; https://doi.org/10.3390/computers15060340

Submission received: 10 April 2026 / Revised: 12 May 2026 / Accepted: 13 May 2026 / Published: 26 May 2026

(This article belongs to the Section ICT Infrastructures for Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

The rapid expansion of Internet of Things (IoT) infrastructures in smart city environments has increased the demand for reliable intrusion detection systems (IDS). However, many existing studies rely on single-dataset evaluations and inconsistent experimental settings, which can lead to overly optimistic performance estimates. In this study, we propose a standardized benchmarking framework for evaluating artificial intelligence-based IDS across heterogeneous IoT datasets, including CIC-IoT 2023, BoT-IoT, and N-BaIoT. Multiple classical machine learning and deep learning models are evaluated under a unified preprocessing pipeline and a consistent evaluation protocol. A hybrid CNN–BiLSTM–Attention architecture is also implemented as a reference model within this framework. While several models achieve near-perfect performance under intra-dataset evaluation, cross-dataset experiments reveal substantial performance degradation and unstable metric behavior under distribution shifts. These results highlight the limitations of dataset-specific optimization and emphasize the necessity of cross-dataset validation for realistic IoT intrusion detection evaluation. All experiments are conducted under a binary intrusion detection setting (benign vs. attack) to enable consistent comparison across datasets. Consequently, the reported results reflect binary detection performance and do not capture attack-type discrimination.

Keywords:

intrusion detection systems (IDS); internet of things (IoT) security; smart city cybersecurity; deep learning intrusion detection; CNN–BiLSTM–attention; cross-dataset evaluation

1. Introduction

The Internet of Things (IoT) has rapidly evolved into a fundamental technological paradigm enabling the interconnection of billions of heterogeneous devices across diverse application domains. Smart sensors, industrial controllers, connected vehicles, healthcare monitoring devices, and urban infrastructure components are increasingly integrated into large-scale cyber–physical ecosystems that continuously generate massive volumes of network traffic and operational data [1]. The expansion of IoT technologies has significantly transformed modern digital infrastructures by enabling real-time monitoring, intelligent automation, and data-driven decision-making across sectors such as smart cities, healthcare systems, transportation networks, and industrial automation environments [2].

Despite these benefits, the widespread deployment of IoT devices has introduced significant cybersecurity challenges. IoT devices often operate under strict resource constraints, including limited computational power, restricted memory capacity, and energy limitations, which make the deployment of traditional security mechanisms difficult [3]. Furthermore, the heterogeneity of IoT devices, communication protocols, and network architectures introduces additional complexity in securing large-scale interconnected environments. These characteristics significantly increase the attack surface of IoT ecosystems and make them particularly attractive targets for cyberattacks [4].

Recent years have witnessed the emergence of numerous large-scale cyber threats targeting IoT infrastructures. Attack scenarios such as distributed denial-of-service (DDoS) attacks, botnet propagation, data manipulation, and unauthorized device control have demonstrated the vulnerability of poorly secured IoT networks [5]. Compromised IoT devices can be exploited as part of coordinated botnet infrastructures capable of launching massive network attacks against critical services and digital infrastructures [6]. Consequently, ensuring robust and adaptive cybersecurity mechanisms for IoT environments has become a major priority for both academic researchers and industrial practitioners. Intrusion Detection Systems (IDS) have long been considered a core component of network security architectures. IDS technologies aim to detect malicious activities and abnormal network behavior by analyzing network traffic patterns and identifying deviations from normal operational profiles [7]. Traditional IDS mechanisms typically rely on signature-based detection or rule-based monitoring techniques that match network behavior against predefined attack patterns. Although such approaches can effectively detect known threats, they are often insufficient for identifying novel or evolving attack strategies in dynamic environments such as IoT networks [8].

To overcome these limitations, artificial intelligence (AI) and machine learning (ML) techniques have increasingly been adopted to enhance intrusion detection capabilities. By leveraging data-driven learning algorithms, AI-based IDS models can automatically extract complex patterns from network traffic and identify anomalous behaviors associated with cyberattacks [9]. These methods allow intrusion detection systems to adapt to previously unseen threats and evolving attack strategies without relying solely on predefined signatures or manually engineered rules. Recent studies have explored a wide range of machine learning and deep learning approaches for IoT intrusion detection. These approaches aim to improve detection accuracy, reduce false alarm rates, and enable scalable monitoring of large-scale IoT infrastructures [10]. However, several challenges remain. Many existing IDS models are evaluated on limited datasets or under controlled experimental settings that may not fully represent the complexity and diversity of real-world IoT environments [11]. As a result, models that demonstrate strong performance within a single dataset may exhibit significant performance degradation when deployed in different operational contexts or under varying network conditions.

Another important challenge lies in the absence of standardized evaluation frameworks that allow fair and consistent comparison between different intrusion detection approaches. Differences in preprocessing pipelines, dataset characteristics, feature engineering strategies, and evaluation metrics often make it difficult to determine whether reported improvements originate from the proposed model architecture or from differences in experimental design [12]. These limitations highlight the need for systematic benchmarking methodologies capable of evaluating IDS models under consistent and reproducible experimental conditions.

Motivated by these challenges, this study investigates the performance of artificial intelligence-based intrusion detection systems in smart-city IoT environments through a comprehensive benchmarking framework. The framework evaluates multiple machine learning and deep learning models across three representative IoT intrusion detection datasets, namely CIC-IoT-2023, BoT-IoT, and N-BaIoT. By enforcing consistent preprocessing pipelines, standardized evaluation metrics, and cross-dataset validation experiments, the study aims to provide a more reliable assessment of IDS robustness and generalization capability under heterogeneous IoT traffic conditions.

The contributions of this work are summarized as follows.

A standardized benchmarking framework for evaluating AI-based intrusion detection systems across multiple heterogeneous IoT datasets under consistent preprocessing and evaluation conditions.

A unified experimental pipeline that ensures reproducibility and fair comparison across classical machine learning and deep learning models.

A systematic cross-dataset evaluation protocol that reveals generalization limitations under distribution shifts.

An empirical analysis demonstrating that near-perfect intra-dataset performance can be misleading and does not necessarily reflect real-world robustness.

An ablation study showing that architectural complexity contributes only marginally compared to dataset characteristics and preprocessing strategies.

The primary objective of this study is to establish a standardized and reproducible benchmarking framework for evaluating intrusion detection systems across heterogeneous IoT datasets. While a hybrid CNN–BiLSTM–Attention architecture is included in the experimental pipeline, it is not presented as the main contribution. Instead, the proposed model serves as a representative deep learning approach within the benchmarking framework, enabling consistent comparison with classical machine learning baselines.

The experimental results demonstrate that the proposed architecture does not consistently outperform strong tree-based models such as Random Forest and LightGBM. This observation indicates that, under the evaluated conditions, dataset characteristics and preprocessing strategies have a greater impact on performance than architectural complexity. Therefore, the main contribution of this study lies in the benchmarking protocol and cross-dataset evaluation methodology rather than in model design. The primary contribution of this study lies in the design of a standardized benchmarking and cross-dataset evaluation framework. The CNN–BiLSTM–Attention model is included as a representative deep learning baseline rather than a novel state-of-the-art architecture.

2. Related Work

2.1. Machine Learning IDS

The rapid proliferation of Internet of Things (IoT) infrastructures has stimulated extensive research on intelligent intrusion detection systems (IDS) designed to protect large-scale interconnected networks. Early research primarily focused on classical machine learning algorithms for analyzing network traffic and identifying malicious activities in IoT environments. Traditional classifiers such as Decision Trees, Random Forest, Support Vector Machines (SVM), Naïve Bayes, and k-Nearest Neighbors have been widely adopted due to their relatively low computational complexity and strong baseline performance in intrusion detection tasks [13].

These approaches demonstrated promising results in detecting various attack patterns across benchmark datasets. However, their effectiveness often depends heavily on feature engineering, preprocessing quality, and dataset characteristics. As IoT networks evolved toward more complex and heterogeneous infrastructures, researchers began evaluating machine learning-based IDS models in more realistic smart-city environments.

Recent studies using large-scale IoT traffic benchmarks indicate that classical machine learning models can still achieve strong binary classification performance and moderate multi-class detection capability when applied to modern IoT datasets such as CICIoT2023 [14]. Nevertheless, comparative analyses across multiple datasets show that model behavior may vary significantly depending on traffic distributions, device heterogeneity, and class structures, suggesting that strong in-dataset performance does not necessarily guarantee robust real-world generalization [15].

2.2. Deep Learning IDS

In recent years, deep learning approaches have attracted significant attention in cybersecurity research due to their ability to automatically learn hierarchical feature representations from raw network traffic data. Convolutional Neural Networks (CNN) have been widely applied to capture spatial relationships among network traffic features and identify hidden attack signatures within high-dimensional datasets [16].

More advanced sequence-aware architectures, including recurrent neural networks and transformer-based models, have further demonstrated strong capabilities in modeling complex multi-class IoT traffic patterns, particularly in large-scale datasets such as CIC-IoT-2023 [17]. These architectures enable intrusion detection systems to capture temporal dependencies and evolving attack behaviors, which are often difficult to represent using traditional feature-engineering approaches.

Despite these advantages, deep learning models are not universally superior. Their performance is strongly influenced by dataset quality, traffic diversity, and class imbalance. Several studies report that models achieving excellent results under controlled experimental conditions may experience performance degradation when deployed in heterogeneous IoT environments or when minority attack classes are poorly represented.

2.3. Hybrid Models

While deep learning models improve feature extraction capabilities, combining multiple modeling paradigms has emerged as a promising strategy for enhancing intrusion detection performance. Hybrid models integrate different classifiers or architectural components in order to capture complementary aspects of network behavior. One notable research direction integrates feature selection techniques with deep learning architectures within a unified framework. For example, hybrid CNN–DNN-based IDS models have been proposed to improve detection performance in imbalanced IoT traffic while simultaneously reducing redundant input dimensions [18]. Other studies combine sequence modeling and spatiotemporal feature extraction, such as Seq2Seq and ConvLSTM-based architectures, to better represent evolving traffic behavior in complex network intrusion scenarios [19]. The appeal of hybrid models lies in their ability to improve discriminative performance without relying exclusively on a single learning paradigm. However, these models also introduce greater architectural complexity, and their effectiveness is often closely tied to preprocessing strategies, feature selection methods, and the structure of the training dataset. Consequently, controlled comparative evaluation remains necessary to determine whether observed performance improvements originate from the hybrid architecture itself or from the associated data preprocessing pipeline.

2.4. Dataset Imbalance

Class imbalance occurs when the number of samples in one class significantly exceeds those in another, leading to biased model training and reduced detection capability for minority attack categories. This issue is widely recognized in IoT intrusion detection research, as real-world network traffic datasets often exhibit highly skewed class distributions. In many IoT intrusion detection datasets, benign traffic substantially outnumbers malicious samples, while certain attack categories appear only rarely. This challenge has been repeatedly highlighted in survey studies of deep learning-based IoT IDS, which report that minority classes are particularly vulnerable to misclassification under imbalanced data conditions [20]. A systematic literature review of anomaly-based IDS in IoT similarly indicates that class balance, data type, and training strategies can significantly influence the effectiveness of deep learning models [21].

The impact of class imbalance is also evident in empirical studies. Recent experimental research shows that even strong multi-class IDS models may struggle when attack categories overlap significantly or when certain classes exhibit similar traffic patterns [22]. Furthermore, broader reviews of AI-based IoT IDS emphasize that dataset imbalance often interacts with other factors such as device heterogeneity, dynamic network topologies, and resource constraints, making it a methodological challenge rather than a simple preprocessing issue [23].

2.5. Concept Drift

In addition to class imbalance, IoT intrusion detection systems must also address the challenge of concept drift. IoT network environments are highly dynamic, and traffic patterns may evolve over time due to device firmware updates, behavioral changes, network reconfigurations, and emerging cyberattack strategies. Recent reviews of machine learning-based intrusion detection in IoT consistently highlight that current IDS models often lack robustness against evolving traffic patterns and changing operational conditions [24].

A related systematic analysis of federated learning approaches for IoT intrusion detection further emphasizes that heterogeneous, non-IID, and distributed IoT environments make it difficult to maintain stable IDS performance over time [25]. These observations suggest that static training assumptions are often insufficient for real-world IoT deployments and that future IDS research should more explicitly evaluate adaptability under dynamic and evolving conditions.

2.6. Cross-Dataset Evaluation

These observations highlight the necessity of evaluating IDS models across multiple heterogeneous datasets, rather than relying solely on single-dataset benchmarks. Another limitation highlighted in the literature is the heavy reliance on a limited number of benchmark datasets for evaluating intrusion detection systems. Representative dataset studies repeatedly note that the scarcity of realistic IoT and IIoT benchmarks has long constrained the reliable evaluation of IDS models [26]. While modern large-scale datasets such as CICIoT2023 have significantly improved attack diversity and network topology realism, many published studies still rely on single-dataset train–test settings, which limits conclusions about model generalization [27]. Similarly, newer frameworks such as ASEADOS-SDN-IoT provide richer SDN-IoT testbeds; however, their conclusions remain bounded by the characteristics of a specific experimental environment [28]. In this context, recent benchmarking-oriented studies emphasize that comparative evaluation should not be limited to in-dataset performance alone. Controlled experiments involving CatBoost, neural network models, and multiple balancing and feature-selection strategies demonstrate that even carefully optimized pipelines may remain highly dataset-dependent [29]. This issue is particularly relevant for smart-city IoT environments, where deployment conditions, traffic distributions, and device heterogeneity often differ substantially from laboratory settings.

2.7. Benchmarking Frameworks

The literature increasingly emphasizes the importance of standardized benchmarking frameworks and reproducible experimental pipelines for fair comparison between different intrusion detection approaches. Smart-city IDS studies illustrate this need from two complementary perspectives. On the one hand, optimized smart-city security models continue to be evaluated within narrow application scopes, limiting conclusions about cross-environment robustness [30]. On the other hand, architectures such as Digital Twin-assisted IDS demonstrate the value of integrating monitoring, resilience analysis, and cyber-physical visibility, yet they are not substitutes for broader benchmarking across heterogeneous datasets [31].

Data augmentation and synthetic generation techniques have also been proposed to strengthen IDS training, particularly for DoS and DDoS detection in IoT settings [32]. While such techniques may improve classification under skewed distributions, they do not eliminate the need for standardized evaluation. In parallel, broad reviews of IoT IDS research consistently argue that future work must move toward more systematic comparisons, clearer reporting standards, and stronger alignment between evaluation methodology and deployment reality [33,34].

This need becomes even more evident when considering adjacent but relevant domains such as the Internet of Drones, where recent systematic reviews likewise identify the lack of standardized datasets, lightweight models, and comparable evaluation practices as major obstacles to reliable IDS development [35]. Similar concerns appear in real-time IoT IDS studies, where latency and deployment feasibility become as important as detection accuracy [36]. Finally, large comparative analyses across multiple IoT datasets confirm that balancing strategy, dataset composition, and architecture choice can significantly alter reported conclusions, reinforcing the necessity of multi-dataset, standardized benchmarking methodologies [37]. Table 1 presents a comparative analysis of representative studies in IoT intrusion detection, highlighting the datasets used, machine learning and deep learning techniques, evaluation metrics, and reported performance results.

2.8. Representation Learning and Reconstruction-Based Anomaly Detection

Representation-learning and reconstruction-based approaches have become an important direction in intrusion detection, particularly in unsupervised and semi-supervised settings. These methods aim to learn compact representations of normal network behavior and detect anomalies based on reconstruction error or deviations in latent space.

A representative lightweight approach is Kitsune (Mirsky et al.) [38], which employs an ensemble of autoencoders for online anomaly detection. Kitsune is specifically designed for resource-constrained environments and incrementally learns normal traffic patterns, making it well-suited for IoT deployments where labeled data may be limited.

Sequence-aware reconstruction models further extend this paradigm by capturing dependencies across feature sequences. In particular, the VAE-LSTM model proposed by Lin et al. [39] combines variational autoencoders with recurrent neural networks to model sequential patterns and detect anomalies based on reconstruction probability. This approach enables the modeling of complex dependencies in network traffic while maintaining the generative advantages of VAEs.

Similarly, the VAE-BiLSTM architecture introduced by Staffini et al. [40] incorporates bidirectional sequence modeling, allowing the model to capture contextual dependencies in both forward and backward directions. This enhances anomaly detection performance in scenarios where temporal or feature-order dependencies are significant.

While these approaches are effective for unsupervised anomaly detection and can identify previously unseen attacks, they differ fundamentally from the present work. Specifically, they are typically evaluated within a single dataset and focus on reconstruction-based detection, whereas this study emphasizes supervised benchmarking and cross-dataset generalization. Integrating such reconstruction-based methods into standardized multi-dataset evaluation frameworks remains an important direction for future research.

2.9. Research Gap

Despite substantial progress in IoT intrusion detection research, several limitations remain. First, many studies rely on single datasets, making reported results highly dependent on specific dataset characteristics. Second, class imbalance is often addressed only through preprocessing techniques without systematically analyzing its impact on cross-dataset generalization. Third, concept drift and evolving traffic conditions are widely recognized challenges but are rarely evaluated through controlled experimental protocols. Finally, there is still a lack of standardized benchmarking frameworks that enforce consistent preprocessing pipelines, harmonized label structures, and imbalance-aware evaluation metrics across multiple heterogeneous IoT datasets.

2.10. Motivation for This Study

These limitations motivate the present study. The goal of this work is not only to compare intrusion detection models based on accuracy but also to evaluate their robustness, stability, and generalization capability across heterogeneous IoT traffic environments. To achieve this objective, we propose a systematic benchmarking framework that integrates multiple IoT datasets, unified preprocessing pipelines, imbalance-aware evaluation metrics, and cross-dataset validation experiments. This framework enables a more reliable and reproducible assessment of IDS performance and provides deeper insights into how dataset diversity influences model generalization in smart-city IoT networks.

3. Methodology

3.1. Overview of the Proposed Framework

This study proposes a comprehensive intrusion detection framework for evaluating artificial intelligence-based Intrusion Detection Systems (IDS) in smart-city Internet of Things (IoT) environments. The framework is designed to systematically benchmark multiple machine learning models and assess the effectiveness of a proposed deep learning architecture under heterogeneous IoT network traffic conditions.

The proposed framework consists of several sequential stages, including dataset preparation, data preprocessing, benchmark model evaluation, deep learning-based intrusion detection, and cross-dataset generalization analysis.

First, multiple publicly available IoT intrusion detection datasets are collected and harmonized to ensure consistent experimental conditions. These datasets are selected to represent diverse IoT network environments and attack scenarios commonly encountered in smart city infrastructures. Harmonization procedures are applied to align feature formats and ensure compatibility across datasets.

Next, data preprocessing techniques are applied to improve data quality and model performance. This stage includes removing noisy or duplicate records, handling missing values, and normalizing feature distributions to ensure consistent numerical ranges across all input variables.

Following preprocessing, several widely used machine learning algorithms are trained to establish baseline intrusion detection performance. These baseline models provide a comparative benchmark for evaluating the effectiveness of the proposed deep learning approach.

Subsequently, a hybrid deep learning architecture combining Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM), and an attention mechanism is introduced. The CNN component extracts local feature interactions along the feature dimension, while the BiLSTM models dependencies across feature positions rather than temporal sequences.

To evaluate the robustness and generalization capability of the proposed model, cross-dataset experiments are conducted in which models trained on one dataset are tested on other datasets with different traffic distributions. In addition, ablation studies are performed to analyze the individual contributions of the CNN, BiLSTM, and attention components within the hybrid architecture.

The proposed benchmarking framework aims to provide a systematic and reproducible evaluation of AI-based IDS models in smart city IoT environments. Both classical machine learning algorithms and deep learning architectures are evaluated across multiple publicly available datasets to assess detection accuracy, robustness, and generalization capability, particularly in cross-dataset scenarios where training and testing data distributions differ.

The benchmarking pipeline consists of the following stages:

Dataset selection and preparation.
Data preprocessing and normalization.
Feature selection and dimensionality reduction.
Model training and hyperparameter optimization.
Evaluation using multiple performance metrics.
Cross-dataset generalization testing.

Figure 1 presents the complete benchmarking pipeline, including strict feature harmonization via intersection of shared numerical features, preprocessing fitted exclusively on the training set, and the cross-dataset evaluation protocol (train-on-A, test-on-B/C) used to assess generalization under distribution shift.

It is important to note that the objective of the proposed framework is not to introduce a new state-of-the-art model, but rather to provide a fair and standardized evaluation protocol. The framework focuses on consistent preprocessing, model evaluation, and cross-dataset validation under unified experimental conditions. This design enables systematic comparison between classical machine learning and deep learning models under identical preprocessing and evaluation settings.

The intrusion detection task is formulated as a binary classification problem (benign vs. attack) to enable consistent comparison across heterogeneous datasets. The considered datasets differ significantly in their labeling schemes and attack taxonomies, making direct multi-class alignment non-trivial. Therefore, all attack types are grouped into a single class to ensure a unified evaluation setting.

3.2. Dataset Description

To ensure a comprehensive evaluation of intrusion detection models, three publicly available IoT cybersecurity datasets were utilized in this study: CIC-IoT 2023, BoT-IoT, and N-BaIoT. These datasets represent diverse IoT network environments and contain various categories of malicious traffic, including distributed denial-of-service (DDoS) attacks, reconnaissance activities, and botnet-based intrusions. The CIC-IoT 2023 dataset represents modern IoT network traffic scenarios and includes a large volume of network flows generated from realistic IoT environments. The dataset contains multiple attack categories and heterogeneous traffic characteristics that simulate real-world IoT communication patterns. The BoT-IoT dataset was generated using a realistic IoT network testbed and includes both benign and malicious traffic associated with botnet attacks. The dataset contains several attack categories, including distributed denial-of-service (DDoS), denial-of-service (DoS), reconnaissance, and information theft attacks. The N-BaIoT dataset focuses specifically on botnet attacks targeting IoT devices. It contains both benign traffic and malicious traffic generated from compromised IoT devices performing botnet-related activities, such as Mirai and Bashlite attacks.

These datasets differ in terms of feature spaces, traffic distributions, and attack characteristics, making them suitable for evaluating the robustness and generalization capability of intrusion detection systems in heterogeneous IoT environments. To ensure reproducibility, the exact number of samples used in each dataset after preprocessing is fixed as follows: CIC-IoT 2023 (200,000 samples), N-BaIoT (200,000 samples), and BoT-IoT (21,893 samples). Stratified sampling was applied prior to dataset splitting to preserve the original class distribution in each dataset. Table 2 summarizes the IoT intrusion detection datasets utilized in this study, including their data sources, sample sizes, attack categories, feature characteristics, and class distributions.

The distribution of benign and malicious traffic samples across the evaluated datasets is illustrated in Figure 2. The evaluated datasets exhibit noticeable class imbalance, where malicious traffic samples significantly outnumber benign samples. This imbalance is particularly evident in the CIC-IoT 2023 dataset, which contains a substantially larger number of attack instances compared with benign traffic. Such class imbalance is common in intrusion detection datasets and highlights the importance of robust evaluation metrics.

For reproducibility, the number of samples used after preprocessing for each dataset is explicitly reported. Due to dataset size constraints and computational feasibility, representative subsets were selected while preserving class distributions through stratified sampling.

To ensure a fair comparison across datasets, a fixed sample size of 200,000 instances was used for both the CIC-IoT 2023 and N-BaIoT datasets. For the BoT-IoT dataset, a smaller representative subset of 21,893 samples was selected due to its inherent data characteristics and availability.

The final sample counts used in the experiments are as follows: CIC-IoT 2023 (200,000 samples), BoT-IoT (21,893 samples), and N-BaIoT (200,000 samples). These subsets preserve the original class imbalance characteristics of each dataset. Detailed dataset statistics, including class distributions, are provided in Table 3.

3.3. Data Preprocessing

Data preprocessing plays a critical role in ensuring the reliability, consistency, and reproducibility of intrusion detection models. In this study, a standardized preprocessing pipeline was applied across all datasets to ensure fair comparison and prevent experimental bias.

First, data cleaning procedures were performed to remove irrelevant or non-informative attributes. Columns containing only missing values were eliminated, and invalid entries (e.g., infinite values) were handled to prevent instability during training. Missing values were imputed using a median-based strategy (SimpleImputer), preserving the statistical properties of each feature.

To ensure consistent feature scaling, numerical attributes were normalized using standard scaling (StandardScaler), transforming features to zero mean and unit variance. Categorical features were encoded using one-hot encoding for low-cardinality variables and label encoding where appropriate.

To prevent data leakage, preprocessing was applied following a strict protocol. Dataset splitting into training (70%), validation (15%), and testing (15%) sets was performed prior to any preprocessing operations, using stratified sampling to preserve class distributions. All preprocessing steps—including imputation, scaling, and encoding were fitted exclusively on the training set and subsequently applied to validation and test sets. This ensures that no information from the evaluation data influences model training.

In addition, features likely to encode dataset-specific artifacts were explicitly removed before training. These include flow identifiers, timestamps, source and destination information, device-related fields, and protocol-specific metadata. This step prevents models from exploiting dataset-specific signatures and ensures that learned patterns reflect generalizable traffic characteristics.

To enable cross-dataset evaluation, feature harmonization was performed using a strict intersection-based strategy. Given the heterogeneity of feature definitions across CIC-IoT 2023, BoT-IoT, and N-BaIoT, only numerical flow-level features that are consistently defined and semantically comparable across all datasets were retained. No manual feature construction or transformation into a new representation was introduced. Instead, features were aligned based on their statistical meaning (e.g., packet counts, byte counts, flow duration, and rate-based metrics), while incompatible or dataset-specific attributes were excluded.

This process resulted in a unified feature space consisting of 20 shared numerical features. All models were trained and evaluated using this consistent representation to ensure fair comparison across datasets. Following feature alignment, missing values were handled using median imputation, and numerical features were normalized using standard scaling within a unified preprocessing pipeline.

To further verify the absence of data leakage, feature distributions were inspected after harmonization, confirming that no feature uniquely identifies a dataset or traffic source. As shown in Table 4, only features that are consistently available and semantically aligned across all datasets were retained, while identifiers and dataset-specific attributes were excluded. This ensures that the resulting feature space is both comparable and free from dataset-dependent artifacts. Table 4 presents the cross-dataset feature harmonization process and the retained feature set used in this study. The symbols “✔” and “✖” indicate whether a feature is available in a given dataset and whether it was retained after the harmonization process.

3.4. Binary Intrusion Detection Formulation

To enable consistent comparison across heterogeneous datasets, the intrusion detection task was formulated as a binary classification problem. In this configuration, benign traffic was labeled as 0, while malicious traffic was labeled as 1. All attack categories present in the datasets were grouped into the malicious class. This unified labeling strategy simplifies comparisons across datasets with different attack taxonomies and facilitates standardized benchmarking of intrusion detection models. Binary intrusion detection is widely adopted in IDS benchmarking studies, particularly when datasets contain heterogeneous attack categories and varying label structures. By consolidating multiple attack types into a single malicious class, the experimental framework ensures a consistent evaluation protocol across all datasets. Class imbalance was not explicitly corrected through resampling techniques in order to preserve the natural distribution of each dataset. Instead, stratified sampling was used during dataset splitting, and imbalance-aware evaluation metrics such as PR-AUC and F1-score were emphasized to provide a more realistic assessment of model performance. The intrusion detection task is formulated as a binary classification problem to ensure consistent comparison across datasets with heterogeneous label taxonomies. While this formulation enables standardized benchmarking, it does not evaluate the model’s ability to distinguish between different attack categories. Consequently, the scope of this study is limited to binary intrusion detection.

3.5. Data Splitting Strategy

To ensure an unbiased evaluation of model performance, each dataset was divided into training, validation, and testing subsets. Stratified sampling was applied during dataset splitting to preserve the original class distribution across all subsets.

The datasets were partitioned according to the following proportions:

70% training set.
15% validation set.
15% testing set.

The training set was used to train the models, the validation set was used for hyperparameter tuning and model selection, and the testing set was reserved for final performance evaluation. This separation ensures that the evaluation results reflect the model’s ability to generalize to unseen data. To prevent data leakage and ensure reproducibility, dataset splitting into training (70%), validation (15%), and testing (15%) sets was performed prior to any preprocessing operations. All preprocessing transformations, including imputation, scaling, and encoding, were fitted exclusively on the training set and subsequently applied to validation and test sets.

3.6. Evaluation Metrics

To comprehensively evaluate the performance of the intrusion detection models, several widely used classification metrics were employed. These metrics provide complementary insights into detection capability, particularly in intrusion detection tasks where class imbalance is common. The evaluation metrics used in this study include Accuracy, Precision, Recall, F1-score, Receiver Operating Characteristic Area Under the Curve (ROC-AUC), and Area Under the Precision–Recall Curve (PR-AUC).

The evaluation metrics are defined using the confusion matrix components: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).

Accuracy

Accuracy measures the overall proportion of correctly classified instances among all evaluated samples:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(1)

Precision

Precision indicates the proportion of correctly identified attack instances among all samples predicted as malicious:

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

Recall

Recall, also referred to as the detection rate, measures the ability of the classifier to correctly detect malicious traffic among all actual attack instances:

R e c a l l = \frac{T P}{T P + F N}

(3)

F1-score

The F1-score represents the harmonic mean of precision and recall and provides a balanced evaluation metric when dealing with imbalanced datasets:

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(4)

Receiver Operating Characteristic Area Under the Curve (ROC-AUC)

The ROC-AUC metric evaluates the ability of a classifier to distinguish between benign and malicious traffic across different decision thresholds. It measures the area under the Receiver Operating Characteristic curve, which represents the trade-off between the true positive rate (TPR) and the false positive rate (FPR):

F P R = \frac{F P}{F P + T N}

(5)

Area Under the Precision–Recall Curve (PR-AUC)

PR-AUC measures the area under the precision–recall curve and is particularly informative in intrusion detection datasets with class imbalance. It emphasizes the trade-off between precision and recall and provides a more realistic evaluation of attack detection capability when malicious samples dominate the dataset.

The performance of the intrusion detection system is evaluated based on four fundamental classification outcomes derived from the confusion matrix. True Positive (TP) represents the number of malicious traffic instances that are correctly identified as attacks by the model. True Negative (TN) refers to benign or normal traffic that is correctly classified as legitimate network activity. False Positive (FP) occurs when benign traffic is incorrectly classified as an attack, which may lead to unnecessary alerts and increased operational overhead for network administrators. False Negative (FN) represents malicious traffic that is incorrectly classified as normal traffic, posing a significant security risk because actual attacks remain undetected. These four outcomes form the basis for calculating key evaluation metrics such as accuracy, precision, recall, and F1-score, which are commonly used to assess the effectiveness and reliability of intrusion detection systems.

3.7. Benchmark Machine Learning Models

To establish baseline intrusion detection performance, several widely used machine learning algorithms were evaluated. The benchmark models include:

Logistic Regression.
Random Forest.
XGBoost.
LightGBM.
Support Vector Machine (SVM).

These models were selected due to their strong performance in previous intrusion detection studies and their effectiveness in handling structured network traffic data.

Tree-based ensemble models such as Random Forest and gradient boosting algorithms are particularly well-suited for tabular intrusion detection datasets due to their ability to capture complex feature interactions.

All models were trained using identical preprocessing pipelines to ensure fair comparison and reproducibility of experimental results. These benchmark models serve as baseline references to evaluate the effectiveness of the proposed deep learning architecture.

Deep Learning Baselines

In addition to classical machine learning models, deep learning baselines including Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks were evaluated. These models provide a comparison against the proposed hybrid CNN-BiLSTM-Attention architecture.

The inclusion of deep learning baselines enables a more comprehensive evaluation of intrusion detection performance and highlights the advantages of hybrid architectures that integrate both convolutional feature extraction and sequential learning mechanisms.

3.8. Proposed CNN-BiLSTM-Attention Model

To enhance the detection capability beyond traditional machine learning models, this study proposes a hybrid deep learning architecture that integrates Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM) networks, and an attention mechanism.

CNN Layer

The CNN component captures local interactions between neighboring features along the feature dimension, enabling the model to learn combinations of related attributes within the tabular input.

BiLSTM Layer

The BiLSTM layer models dependencies across feature positions within the input vector rather than temporal dependencies. This design allows the model to capture relationships between features but does not reflect time-evolving network behavior.

Attention Mechanism

An attention mechanism is incorporated to enhance feature importance modeling. The attention layer allows the model to dynamically assign higher weights to the most informative features while reducing the influence of less relevant information. By focusing on critical traffic characteristics, the attention mechanism improves the model’s ability to identify subtle intrusion patterns and enhances overall classification performance.

Classification Layer

The final stage of the architecture consists of a fully connected dense layer followed by a sigmoid activation function, which performs binary classification to distinguish between benign and malicious traffic.

The proposed CNN–BiLSTM–Attention model operates on structured tabular feature vectors, where each sample represents a single network flow instance described by a fixed set of numerical features. No explicit temporal sequence is constructed based on device behavior, session information, or timestamp ordering. Instead, each sample is treated independently. To enable compatibility with convolutional and recurrent layers, the input feature vector is reshaped into a one-dimensional sequence. This transformation does not represent a temporal or spatial sequence; rather, it allows the model to process feature values as an ordered set of attributes.

Figure 3 illustrates the proposed CNN–BiLSTM–Attention architecture applied to tabular network traffic data. Each sample is represented as a feature vector x ∈ R20, which is reshaped into X ∈ R20 × 1 to ensure compatibility with convolutional and recurrent layers. The resulting sequence represents an ordered feature arrangement rather than a temporal sequence. The CNN component captures local feature interactions and spatial correlations among neighboring features, while the BiLSTM layer models dependencies across feature dimensions in a non-temporal manner. Subsequently, the attention mechanism assigns adaptive importance weights to the extracted feature representations, enabling the model to focus on the most relevant discriminative patterns before the final binary classification stage.

3.9. Experimental Setup

All experiments were implemented using Python 3.10. Classical machine learning models were developed using Scikit-learn, while XGBoost and LightGBM were used for gradient boosting models. Deep learning models, including CNN, LSTM, and the proposed CNN–BiLSTM–Attention architecture, were implemented using TensorFlow and Keras. The experiments were conducted in the Google Colab environment to leverage cloud-based computational resources for large-scale data processing and model training. Each dataset was processed independently to evaluate model performance under different IoT traffic distributions. The datasets were split into training, validation, and testing sets using a stratified sampling strategy to preserve class distribution. Due to the large size and imbalance of IoT datasets, sampling techniques were applied to ensure computational feasibility while maintaining representative data distributions.

All models were trained and evaluated using a unified preprocessing pipeline, including data cleaning, feature normalization, and label harmonization, to ensure fair comparison across different datasets.

Model performance was evaluated using multiple metrics, including Accuracy, Precision, Recall, F1-score, ROC-AUC, and PR-AUC, to provide a comprehensive assessment of intrusion detection performance, particularly under imbalanced conditions.

For deep learning models, training was performed using the Adam optimizer with an initial learning rate of 0.001. The models were trained for 50 epochs with a batch size of 64, with early stopping applied to prevent overfitting. Dropout regularization was also used in fully connected layers to improve generalization.

For classical machine learning models, hyperparameters were tuned using standard optimization techniques to ensure fair comparison across models. To ensure full reproducibility, all experiments were conducted using fixed random seeds, consistent preprocessing pipelines, and identical feature sets across all models. The entire workflow, including sampling, preprocessing, training, and evaluation, follows a deterministic pipeline that can be replicated under the same configuration. The reported results correspond to single-run evaluations under consistent experimental conditions. No multi-run statistical analysis (e.g., mean and standard deviation across runs) was performed.

This strategy ensured that each model was evaluated under optimal parameter configurations while maintaining fairness in the benchmarking process. The reported results correspond to single-run evaluations under consistent experimental conditions. No multi-run statistical analysis was performed, as summarized in Table 5.

3.10. Cross-Dataset Evaluation

To evaluate the generalization capability of intrusion detection models, cross-dataset experiments were conducted. In these experiments, models trained on one dataset were evaluated on a different dataset.

Cross-dataset evaluation simulates realistic deployment scenarios in which models trained in one IoT environment must detect intrusions in another environment with different traffic characteristics.

The following cross-dataset combinations were evaluated:

Train on BoT-IoT, test on CIC-IoT 2023.
Train on BoT-IoT, test on N-BaIoT.
Train on CIC-IoT 2023, test on BoT-IoT.
Train on CIC-IoT 2023, test on N-BaIoT.
Train on N-BaIoT, test on BoT-IoT.
Train on N-BaIoT, test on CIC-IoT 2023.

This evaluation provides insights into model robustness and transferability across heterogeneous IoT network environments.

3.11. Ablation Study

To analyze the contribution of each architectural component in the proposed deep learning model, an ablation study was conducted. The objective of this analysis is to isolate the impact of individual architectural modules on intrusion detection performance.

Several model configurations were evaluated by systematically removing specific components from the proposed architecture. The evaluated variants include:

Full Model (CNN + BiLSTM + Attention).
No CNN (BiLSTM + Attention).
No BiLSTM (CNN + Attention).
No Attention (CNN + BiLSTM).
CNN Only.

This experimental design enables a detailed analysis of how convolutional feature extraction, sequential modeling, and attention mechanisms individually contribute to the overall detection performance.

The ablation analysis provides insights into the relative importance of each architectural component and helps determine whether the hybrid CNN-BiLSTM-Attention design offers measurable advantages over simplified architectures.

4. Results and Discussion

This section presents the experimental results obtained from the benchmark models and the proposed deep learning architecture. The analysis focuses on detection performance, cross-dataset generalization capability, and the contribution of different architectural components. It is important to note that the reported performance metrics are obtained under a binary classification setting. While this formulation simplifies cross-dataset evaluation, it does not capture the model’s ability to distinguish between different attack categories. As a result, the reported performance may overestimate real-world effectiveness in scenarios requiring fine-grained attack identification. The results are interpreted from a benchmarking perspective rather than a model-centric perspective. Accordingly, performance comparisons are used to assess the influence of dataset characteristics and evaluation settings, rather than to claim superiority of a specific architecture. The near-perfect performance observed in intra-dataset evaluation (e.g., ROC-AUC values approaching 1.000) requires careful interpretation. Such results may arise from strong class separability and dataset construction characteristics rather than true model generalization. Although strict precautions were taken to prevent data leakage, these results highlight the limitations of evaluating intrusion detection systems solely within a single dataset. It is important to note that the reported performance metrics are obtained under a binary setting. As a result, near-perfect intra-dataset performance may partially reflect the reduced complexity of the task and should not be interpreted as evidence of effective multi-class attack discrimination.

4.1. Machine Learning Benchmark

To establish baseline intrusion detection performance, several classical machine learning algorithms were evaluated across the three IoT cybersecurity datasets: CIC-IoT 2023, BoT-IoT, and N-BaIoT. The evaluated benchmark models include Logistic Regression, Random Forest, XGBoost, LightGBM, and Support Vector Machine (SVM). Table 6 summarizes the best-performing machine learning models obtained on each dataset using the standardized preprocessing pipeline described in Section 3.

The benchmark results indicate that classical machine learning models, particularly tree-based ensemble methods such as Random Forest, XGBoost, and LightGBM, achieve strong detection performance across all evaluated datasets. Accuracy values range between 0.996 and 0.999, with consistently high F1-scores and ROC-AUC values. However, performance differences between models are relatively small, suggesting that dataset characteristics and feature separability play a dominant role in determining classification performance. These findings indicate that multiple models can effectively capture the underlying patterns in the data, resulting in similar decision boundaries.

Despite these strong results, they should be interpreted with caution. High performance under intra-dataset evaluation does not necessarily imply robustness or generalization. This limitation becomes evident in cross-dataset experiments, where model performance drops significantly when evaluated on different datasets, indicating sensitivity to distribution shifts and dataset-specific patterns.

Furthermore, near-perfect performance (e.g., accuracy and ROC-AUC values approaching 1.000) is primarily attributed to the characteristics of the datasets rather than inherent model superiority. In particular, datasets such as BoT-IoT and N-BaIoT exhibit highly separable traffic patterns and strong statistical differences between benign and malicious samples. In addition, the use of binary classification simplifies the decision boundary, which can lead to inflated performance metrics under intra-dataset evaluation.

Overall, these results suggest that while models demonstrate strong detection capability under controlled experimental conditions, their real-world robustness remains limited. This highlights the importance of cross-dataset evaluation and more realistic experimental settings when assessing intrusion detection systems. Importantly, the observed high performance is not attributable to data leakage. All preprocessing steps were applied after dataset splitting, transformations were fitted exclusively on the training set, and all leakage-prone features were removed. Therefore, the results reflect dataset characteristics rather than unintended information leakage.

4.2. Deep Learning Baselines

In addition to classical machine learning models, deep learning baseline architectures were also evaluated to provide a fair comparison with the proposed intrusion detection model. Two widely adopted deep learning architectures, Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks, were implemented as baseline models for IoT intrusion detection. These architectures are widely used in network traffic analysis due to their ability to capture feature-level interactions and dependencies across feature positions. Table 7 presents the performance results of the CNN and LSTM baseline models across the three evaluated datasets.

The deep learning baseline models, including CNN and LSTM architectures, also demonstrate strong detection performance across all evaluated datasets. Their performance is comparable to that of classical machine learning models, with only marginal differences in accuracy and ROC-AUC values.

CNN models capture local interactions between features along the feature dimension, while LSTM models learn dependencies across feature positions. However, similar to classical models, the performance differences remain limited. This suggests that, for the evaluated datasets, the inherent structure and separability of the data have a greater impact on performance than the choice of model architecture.

As shown in Figure 4, all evaluated models achieve high AUC values, indicating strong discrimination capability between benign and malicious traffic. However, the zoomed view reveals small performance differences, where ensemble models and the proposed hybrid architecture maintain consistently high true positive rates even at very low false positive rates.

4.3. Performance of the Proposed CNN–BiLSTM–Attention Model

To further improve intrusion detection performance, a hybrid deep learning architecture integrating Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM), and an attention mechanism was proposed. The proposed model was evaluated using the same preprocessing pipeline and data splitting strategy applied to the benchmark models to ensure a fair and consistent comparison. Table 8 presents the performance results of the proposed model across the three evaluated IoT intrusion detection datasets.

The proposed CNN–BiLSTM–Attention model achieves competitive performance across all evaluated datasets, with accuracy values ranging from 0.993 to 0.999 and consistently high ROC-AUC and PR-AUC scores. While the model performs slightly better than some baseline methods in certain cases, the observed improvements are relatively modest.

These results indicate that the hybrid architecture is effective for feature-level representation learning on tabular IoT traffic data. However, the limited performance gains over simpler models suggest that architectural complexity alone does not significantly improve detection performance under the given experimental conditions.

Notably, the proposed CNN–BiLSTM–Attention model does not consistently outperform strong classical machine learning baselines, particularly tree-based ensemble methods such as Random Forest and LightGBM. The relatively small performance differences observed across models suggest that multiple approaches are capable of effectively capturing the underlying structure of the datasets.

Overall, these findings indicate that dataset characteristics, feature representation, preprocessing strategies, and evaluation protocols may have a greater influence on performance than architectural complexity alone. This observation reinforces the importance of standardized benchmarking and carefully controlled experimental procedures when evaluating intrusion detection systems.

To assess the stability of the proposed model, experiments were conducted under consistent training settings using fixed random seeds. The obtained results showed only minor variations across evaluation metrics, indicating that the model is relatively stable and not highly sensitive to random initialization. Although formal statistical significance testing, confidence intervals, or repeated multi-run analysis were not performed, the observed consistency suggests that the reported results are reliable within the defined experimental setup.

A notable observation across experiments is the discrepancy between intra-dataset and cross-dataset performance. While models achieve near-perfect results when trained and evaluated on the same dataset, their performance degrades noticeably under cross-dataset conditions. This behavior demonstrates the difficulty of generalizing learned representations across heterogeneous IoT environments characterized by differing traffic distributions and attack patterns.

In several cross-dataset scenarios, relatively moderate accuracy values are accompanied by lower ROC-AUC or F1-score values, indicating reduced discriminative capability despite seemingly acceptable classification accuracy. Similarly, combinations such as relatively high PR-AUC with comparatively lower F1-score values are also observed. This behavior can largely be attributed to class imbalance and the differing interpretations of evaluation metrics. Specifically, ROC-AUC and PR-AUC measure ranking performance across varying decision thresholds, whereas the F1-score depends on a fixed classification threshold. Consequently, a model may exhibit reasonable ranking capability while still producing suboptimal classification behavior at a specific operating threshold.

These findings highlight that relying on a single evaluation metric may produce misleading conclusions, particularly in cross-dataset evaluation scenarios affected by distribution shift. Therefore, comprehensive evaluation using multiple complementary metrics is essential for accurately assessing intrusion detection robustness.

Importantly, the proposed CNN–BiLSTM–Attention model does not consistently outperform strong tree-based baselines under standardized evaluation settings. This observation further supports the central conclusion of this study: under controlled benchmarking conditions, dataset properties, feature engineering, preprocessing consistency, and evaluation methodology may exert a greater impact on performance than increasing architectural complexity alone.

4.4. Cross-Dataset Generalization

While intra-dataset evaluation provides insight into model performance under a single traffic distribution, real-world smart-city environments require intrusion detection systems capable of generalizing across heterogeneous network conditions.

To evaluate model robustness under distribution shifts, cross-dataset experiments were conducted as described in Section 3.9. In these experiments, models trained on one dataset were evaluated on a different dataset with distinct traffic characteristics. The detailed results of these experiments are presented in Table 9, which highlights the variability in model performance across different dataset combinations. The discrepancy between near-perfect intra-dataset performance and degraded cross-dataset results further supports that models are not overfitting through leakage, but rather learning dataset-specific statistical patterns that do not generalize across different IoT environments.

The cross-dataset experiments were re-executed using the finalized harmonized feature space and the corrected preprocessing pipeline described in Section 3.3. All visualizations and reported metrics were regenerated directly from the finalized evaluation outputs to ensure complete consistency between tables, figures, and textual descriptions.

Figure 5 illustrates the confusion matrix obtained for the proposed CNN–BiLSTM–Attention model on the CIC-IoT 2023 dataset under intra-dataset evaluation. The updated confusion matrix was regenerated using the finalized 15% test partition corresponding to approximately 30,000 evaluation samples. The model correctly classifies the majority of benign and malicious traffic samples, with relatively few false positives and false negatives.

However, this strong intra-dataset performance should be interpreted cautiously. The near-perfect classification behavior observed in Figure 5 primarily reflects the statistical properties and class separability of the dataset under controlled experimental conditions. As demonstrated later in the cross-dataset evaluation experiments, models trained on a single dataset often fail to generalize effectively to heterogeneous IoT traffic environments despite achieving excellent intra-dataset results.

Representative confusion matrices for intra-dataset evaluations are provided in Figure 5, Figure 6 and Figure 7, while cross-dataset generalization behavior is further illustrated through the heatmap representation shown in Figure 8.

Figure 6 illustrates the confusion matrix of the proposed CNN–BiLSTM–Attention model on the N-BaIoT dataset under intra-dataset evaluation. The updated confusion matrix was regenerated using the finalized 15% test partition corresponding to approximately 30,000 evaluation samples. The model demonstrates strong classification performance with only a limited number of false positives and false negatives.

Figure 7 illustrates the confusion matrix of the proposed CNN–BiLSTM–Attention model on the BoT-IoT dataset under intra-dataset evaluation. The updated confusion matrix was regenerated using the finalized 15% test partition corresponding to approximately 3284 evaluation samples. The model correctly classifies the majority of benign and malicious traffic samples, while only a small number of samples are misclassified.

The cross-dataset generalization results are further visualized using the heatmap representation shown in Figure 8. The results reveal noticeable performance degradation under distribution shifts, highlighting the limited generalization capability of models trained on a single dataset. This behavior reflects substantial differences in traffic characteristics and attack patterns across IoT datasets and underscores the challenge of developing intrusion detection systems that generalize effectively in heterogeneous environments.

For example, models trained on the BoT-IoT dataset exhibit reduced detection performance when evaluated on N-BaIoT traffic, indicating substantial differences in feature distributions. Similarly, cross-evaluation involving CIC-IoT 2023 demonstrates varying levels of generalization across datasets. A particularly notable case occurs when models trained on N-BaIoT are evaluated on BoT-IoT, where detection performance decreases significantly compared with intra-dataset evaluation. This further confirms the presence of a strong distribution mismatch and the limitations of dataset-specific model optimization.

Overall, these findings demonstrate that models evaluated solely within a single dataset may produce overly optimistic performance estimates. Cross-dataset evaluation, therefore, provides a more realistic and rigorous assessment of intrusion detection robustness in heterogeneous IoT environments.

4.5. Ablation Study Analysis

The ablation study results indicate that performance differences across architectural variants are relatively small, with accuracy varying within a narrow range (e.g., 0.992 vs. 0.991 vs. 0.989). Such limited differences do not provide strong evidence for substantial synergistic interaction between individual architectural components. Instead, the results suggest that each component contributes incrementally under the evaluated experimental conditions. The detailed ablation results are presented in Table 10, highlighting the relative contribution of each component to overall model performance.

In addition to predictive performance, the computational complexity of each architectural variant was evaluated using trainable parameter count, average training time per epoch, and inference latency per sample. The results indicate that models incorporating BiLSTM and attention mechanisms require substantially higher computational resources compared with simpler CNN-based variants. However, the corresponding performance improvements remain relatively limited. For example, the full CNN–BiLSTM–Attention model substantially increases parameter count and inference latency compared with the CNN-only variant, while providing only marginal accuracy improvement. These findings suggest that increased architectural complexity does not necessarily translate into proportional performance gains within the evaluated tabular IoT intrusion detection setting.

The ablation study further shows that removing individual components from the proposed architecture leads to only modest changes in performance. This behavior indicates that each component contributes incrementally rather than critically to the final prediction capability. Among the evaluated components, the BiLSTM module exhibits a slightly greater influence relative to other components, likely due to its ability to model dependencies across feature dimensions. Nevertheless, the overall performance differences remain small, reinforcing the observation that dataset characteristics, feature representation, and preprocessing strategies exert a greater influence on performance than architectural design alone.

This behavior may also be attributed to the tabular nature of the evaluated datasets, where explicit temporal dependencies are limited. Under such conditions, recurrent architectures such as BiLSTM may provide less benefit compared with domains involving genuine sequential or temporal structures. Similarly, the contribution of the attention mechanism appears relatively marginal in the evaluated setting. Although attention layers increase model capacity, they do not produce substantial improvements over simpler variants. This suggests that attention mechanisms may provide greater benefit in scenarios involving richer feature interactions, larger-scale datasets, or more complex sequential relationships.

Overall, the ablation results demonstrate an important trade-off between model complexity and computational efficiency. More complex variants, particularly those incorporating BiLSTM and attention layers, incur increased computational cost in terms of training time, parameter count, and inference latency, while yielding only relatively small performance improvements. These findings further support the broader conclusion of this study that standardized preprocessing, dataset properties, and evaluation methodology may have a greater impact on intrusion detection performance than increasing architectural complexity alone.

5. Conclusions and Future Work

This study primarily contributes a standardized benchmarking framework for evaluating intrusion detection systems across heterogeneous IoT datasets. While a hybrid CNN–BiLSTM–Attention architecture was implemented, the results demonstrate that architectural complexity alone does not guarantee improved performance over strong baseline models. Instead, the findings highlight that evaluation methodology, dataset diversity, and preprocessing consistency play a more critical role in determining reliable intrusion detection performance. The conclusions of this study are therefore limited to binary intrusion detection. Although this formulation enables consistent cross-dataset comparison, it does not address multi-class attack discrimination, which remains an important direction for future work.

Although several models achieved near-perfect results under intra-dataset evaluation, cross-dataset experiments revealed substantial performance degradation and unstable metric behavior under distribution shifts. In particular, the results indicate that current intrusion detection models, regardless of architecture, fail to generalize effectively across heterogeneous datasets. These findings highlight a fundamental limitation of dataset-specific training and reinforce the need for more robust evaluation protocols and domain adaptation techniques. Moreover, they demonstrate that high accuracy alone is insufficient to assess model effectiveness and that cross-dataset validation is essential for evaluating robustness in realistic IoT environments.

Furthermore, the limited performance differences observed across models and ablation variants suggest that dataset characteristics and feature representation have a greater impact on performance than architectural design. This reinforces the importance of standardized benchmarking protocols when comparing intrusion detection systems.

Despite these contributions, this study has several limitations. First, the use of binary classification simplifies the intrusion detection task and may lead to inflated performance estimates, as it does not reflect real-world requirements where distinguishing between attack types is essential. Second, despite feature harmonization, inherent differences in dataset distributions remain, which significantly affect cross-dataset generalization. Third, experiments were conducted in offline settings, and real-time deployment constraints were not considered. Finally, although precautions were taken to prevent data leakage, the possibility of residual dataset-specific artifacts cannot be entirely excluded.

In addition, the proposed approach does not explicitly model temporal dependencies in network traffic, as the BiLSTM operates over feature dimensions rather than time-ordered sequences. Moreover, all experiments were conducted using fixed random seeds to ensure reproducibility. The reported results correspond to single-run evaluations under consistent experimental conditions, and no multi-run statistical analysis was performed.

Future work will focus on extending the framework to multi-class intrusion detection, incorporating domain adaptation techniques to improve cross-dataset generalization, and evaluating models under realistic deployment conditions. Additional research will explore sequence construction strategies based on time-ordered flows, device-level aggregation, or session-based grouping. Furthermore, future studies will incorporate rigorous statistical validation through repeated experiments and hypothesis testing, as well as more realistic evaluation protocols such as device-held-out, time-based splitting, and leave-attack-family-out validation.

Author Contributions

Conceptualization, A.A. and S.D.; methodology, A.A.; software, A.A.; validation, A.A. and S.D.; formal analysis, A.A.; investigation, A.A.; data curation, A.A.; writing—original draft preparation, S.D.; writing—review and editing, A.A. and S.D.; visualization, A.A.; supervision, S.D.; project administration, S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available: CIC-IoT 2023: https://www.kaggle.com/datasets/himadri07/ciciot2023 (accessed on 2 February 2026); BoT-IoT: https://research.unsw.edu.au/projects/bot-iot-dataset (accessed on 2 February 2026); and N-BaIoT: https://www.kaggle.com/datasets/mkashifn/nbaiot-dataset (accessed on 2 February 2026). The preprocessing pipeline, feature harmonization procedure, and experimental configuration are described in detail in Section 3 to ensure reproducibility. Code implementation can be made available upon reasonable request.

Acknowledgments

We would like to thank the Deanship of Scientific Research at Shaqra University for supporting this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alsubaei, F.S. Smart deep learning model for enhanced IoT intrusion detection. Sci. Rep. 2025, 15, 20577. [Google Scholar] [CrossRef]
Fares, I.A.; Ibrahim, A.G.A.; Elaziz, M.A.; Shrahili, M.; Elmahallawy, A.A.; Sohaib, R.M.; Shawky, M.A.; Shah, S.T. Deep transfer learning based on hybrid Swin transformers with LSTM for intrusion detection systems in IoT environment. IEEE Open J. Commun. Soc. 2025, 6, 4342–4359. [Google Scholar] [CrossRef]
Antunes, M.; Oliveira, L.; Seguro, A.; Veríssimo, J.; Salgado, R.; Murteira, T. Benchmarking Deep Learning Methods for Behaviour-Based Network Intrusion Detection. Informatics 2022, 9, 29. [Google Scholar] [CrossRef]
Maseer, Z.K.; Yusof, R.; Bahaman, N.A.; Mostafa, S.A.; Foozy, C.F.M. Benchmarking of Machine Learning for Anomaly Based Intrusion Detection Systems in the CICIDS2017 Dataset. IEEE Access 2021, 9, 22351–22367. [Google Scholar] [CrossRef]
Awajan, A. A Novel Deep Learning-Based Intrusion Detection System for IoT Networks. Computers 2023, 12, 34. [Google Scholar] [CrossRef]
Khan, A.; Hussain, M.A.; Anwer, F. A Hybrid Lightweight Deep Learning-Based Intrusion Detection Approach in IoT Utilizing Feature Selection & Explainable Artificial Intelligence. IEEE Access 2025, 13, 192451–192464. [Google Scholar] [CrossRef]
Alamro, H.; Marzouk, R.; Alruwais, N.; Negm, N.; Aljameel, S.S.; Khalid, M.; Hamza, M.A.; Alsaid, M.I. Modeling of Blockchain Assisted Intrusion Detection on IoT Healthcare System Using Ant Lion Optimizer With Hybrid Deep Learning. IEEE Access 2023, 11, 82199–82207. [Google Scholar] [CrossRef]
Yaras, S.; Dener, M. IoT-Based Intrusion Detection System Using New Hybrid Deep Learning Algorithm. Electronics 2024, 13, 1053. [Google Scholar] [CrossRef]
Khan, A.R.; Kashif, M.; Jhaveri, R.H.; Raut, R.; Saba, T.; Bahaj, S.A. Deep Learning for Intrusion Detection and Security of Internet of Things (IoT): Current Analysis, Challenges, and Possible Solutions. Secur. Commun. Netw. 2022, 2022, 4016073. [Google Scholar] [CrossRef]
Vishwakarma, M.; Kesswani, N. DIDS: A Deep Neural Network Based Real-Time Intrusion Detection System for IoT. Decis. Anal. J. 2022, 5, 100142. [Google Scholar] [CrossRef]
Thapa, N.; Liu, Z.; KC, D.B.; Gokaraju, B.; Roy, K. Comparison of Machine Learning and Deep Learning Models for Network Intrusion Detection Systems. Future Internet 2020, 12, 167. [Google Scholar] [CrossRef]
Shahid, U.; Hussain, M.Z.; Hasan, M.Z.; Haider, A.; Ali, J.; Altaf, J. Hybrid Intrusion Detection System for RPL IoT Networks Using Machine Learning and Deep Learning. IEEE Access 2024, 12, 113099–113110. [Google Scholar] [CrossRef]
Alkadi, S.; Al-Ahmadi, S.; Ben Ismail, M.M. Toward Improved Machine Learning-Based Intrusion Detection for Internet of Things Traffic. Computers 2023, 12, 148. [Google Scholar] [CrossRef]
Houichi, M.; Jaidi, F.; Bouhoula, A. Enhancing Smart City Security: An Intrusion Detection System Using Machine Learning Methods With the UNB CIC IoT 2023 Dataset. IET Smart Cities 2025, 7, e70014. [Google Scholar] [CrossRef]
Islam, N.; Farhin, F.; Sultana, I.; Kaiser, M.S.; Rahman, M.S.; Mahmud, M.; Hosen, A.S.M.S.; Cho, G.H. Towards Machine Learning Based Intrusion Detection in IoT Networks. Comput. Mater. Contin. 2021, 69, 1801–1819. [Google Scholar] [CrossRef]
Banaamah, A.M.; Ahmad, I. Intrusion Detection in IoT Using Deep Learning. Sensors 2022, 22, 8417. [Google Scholar] [CrossRef]
Tseng, S.-M.; Wang, Y.-Q.; Wang, Y.-C. Multi-Class Intrusion Detection Based on Transformer for IoT Networks Using CIC-IoT-2023 Dataset. Future Internet 2024, 16, 284. [Google Scholar] [CrossRef]
Gopikrishnan, S.; Jonnalagadda, P.; Driss, M.; Boulila, W. EFS-IDS: An Enhanced Feature-Selective Intrusion Detection System for Imbalanced IoT Traffic Data. IEEE Open J. Commun. Soc. 2025, 6, 9673–9686. [Google Scholar] [CrossRef]
Hariharan, S.; Annie Jerusha, Y.; Suganeshwari, G.; Syed Ibrahim, S.P.; Tupakula, U.; Varadharajan, V. A Hybrid Deep Learning Model for Network Intrusion Detection System Using Seq2Seq and ConvLSTM-Subnets. IEEE Access 2025, 13, 30705–30716. [Google Scholar] [CrossRef]
Liao, H.; Murah, M.Z.; Hasan, M.K.; Mohd Aman, A.H.; Fang, J.; Hu, X.; Khan, A.U.R. A Survey of Deep Learning Technologies for Intrusion Detection in Internet of Things. IEEE Access 2024, 12, 4745–4761. [Google Scholar] [CrossRef]
Alsoufi, M.A.; Razak, S.; Siraj, M.M.; Nafea, I.; Ghaleb, F.A.; Saeed, F.; Nasser, M. Anomaly-Based Intrusion Detection Systems in IoT Using Deep Learning: A Systematic Literature Review. Appl. Sci. 2021, 11, 8383. [Google Scholar] [CrossRef]
Elnakib, O.; Shaaban, E.; Mahmoud, M.; Emara, K. EIDM: Deep Learning Model for IoT Intrusion Detection Systems. J. Supercomput. 2023, 79, 13241–13261. [Google Scholar] [CrossRef]
Sharma, S.B.; Bairwa, A.K. Leveraging AI for Intrusion Detection in IoT Ecosystems: A Comprehensive Study. IEEE Access 2025, 13, 66290–66317. [Google Scholar] [CrossRef]
Bankó, M.B.; Dyszewski, S.; Králová, M.; Limpek, M.B.; Papaioannou, M.; Choudhary, G.; Dragoni, N. Advancements in Machine Learning-Based Intrusion Detection in IoT: Research Trends and Challenges. Algorithms 2025, 18, 209. [Google Scholar] [CrossRef]
Hamad, N.A.; Abu Bakar, K.A.; Qamar, F.; Jubair, A.M.; Mohamed, R.R.; Mohamed, M.A. Systematic Analysis of Federated Learning Approaches for Intrusion Detection in the Internet of Things Environment. IEEE Access 2025, 13, 95410–95444. [Google Scholar] [CrossRef]
Alsaedi, A.; Moustafa, N.; Tari, Z.; Mahmood, A.; Anwar, A. TON_IoT Telemetry Dataset: A New Generation Dataset of IoT and IIoT for Data-Driven Intrusion Detection Systems. IEEE Access 2020, 8, 165130–165150. [Google Scholar] [CrossRef]
Neto, E.C.P.; Dadkhah, S.; Ferreira, R.; Zohourian, A.; Lu, R.; Ghorbani, A.A. CICIoT2023: A Real-Time Dataset and Benchmark for Large-Scale Attacks in IoT Environment. Sensors 2023, 23, 5941. [Google Scholar] [CrossRef]
Yasarathna, T.L.; Le-Khac, N.-A. ASEADOS-SDN-IoT: A Novel SDN-IoT Network Intrusion Detection Dataset and Framework. Internet Things 2026, 36, 101891. [Google Scholar] [CrossRef]
Kouassi, B.M.; Ballo, A.B.; Ayikpa, K.J.; Mamadou, D.; Diabagate, Y. Optimized Intrusion Detection in the IoT Through Statistical Selection and Classification with CatBoost and SNN. Technologies 2025, 13, 441. [Google Scholar] [CrossRef]
Aborokbah, M.M. A Novel Intrusion Detection Model for Enhancing Security in Smart City. IEEE Access 2024, 12, 107431–107444. [Google Scholar] [CrossRef]
El-Hajj, M. Leveraging Digital Twins and Intrusion Detection Systems for Enhanced Security in IoT-Based Smart City Infrastructures. Electronics 2024, 13, 3941. [Google Scholar] [CrossRef]
Alabsi, B.A.; Anbar, M.; Rihan, S.D.A. Conditional Tabular Generative Adversarial Based Intrusion Detection System for Detecting DDoS and DoS Attacks on the Internet of Things Networks. Sensors 2023, 23, 5644. [Google Scholar] [CrossRef] [PubMed]
Asharf, J.; Moustafa, N.; Khurshid, H.; Debie, E.; Haider, W.; Wahab, A. A Review of Intrusion Detection Systems Using Machine and Deep Learning in Internet of Things: Challenges, Solutions and Future Directions. Electronics 2020, 9, 1177. [Google Scholar] [CrossRef]
Mishra, N.; Pandya, S. Internet of Things Applications, Security Challenges, Attacks, Intrusion Detection, and Future Visions: A Systematic Review. IEEE Access 2021, 9, 59353–59377. [Google Scholar] [CrossRef]
Ogab, M.; Zaidi, S.; Bourouis, A.; Calafate, C.T. Machine Learning-Based Intrusion Detection Systems for the Internet of Drones: A Systematic Literature Review. IEEE Access 2025, 13, 96681–96714. [Google Scholar] [CrossRef]
Kikissagbe, B.R.; Adda, M. Machine Learning-Based Intrusion Detection Methods in IoT Systems: A Comprehensive Review. Electronics 2024, 13, 3601. [Google Scholar] [CrossRef]
Waqas, A.; Khan, S.D.; Ullah, Z.; Ullah, M.; Ullah, H. Comparative Analysis of Deep Learning Models for Intrusion Detection in IoT Networkss. Computers 2025, 14, 283. [Google Scholar] [CrossRef]
Mirsky, Y.; Doitshman, T.; Elovici, Y.; Shabtai, A. Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. In Proceedings of the 25th Annual Network and Distributed System Security Symposium (NDSS 2018), San Diego, CA, USA, 18–21 February 2018. [Google Scholar] [CrossRef]
Lin, S.; Clark, R.; Birke, R.; Schönborn, S.; Trigoni, N.; Roberts, S. Anomaly Detection for Time Series Using VAE-LSTM Hybrid Model. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 4322–4326. [Google Scholar] [CrossRef]
Staffini, A.; Svensson, T.; Chung, U.-I.; Svensson, A.K. A Disentangled VAE-BiLSTM Model for Heart Rate Anomaly Detection. Bioengineering 2023, 10, 683. [Google Scholar] [CrossRef]
Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [Google Scholar] [CrossRef]
Firouzi, A.; Dadkhah, S.; Maret, S.A.; Ghorbani, A.A. DataSense: A Real-Time Sensor-Based Benchmark Dataset for Attack Analysis in IIoT with Multi-Objective Feature Selection. Electronics 2025, 14, 4095. [Google Scholar] [CrossRef]
Meidan, Y.; Bohadana, M.; Mathov, Y.; Mirsky, Y.; Shabtai, A.; Breitenbacher, D.; Elovici, Y. N-BaIoT: Network-based detection of IoT botnet attacks using deep autoencoders. IEEE Pervasive Comput. 2018, 17, 12–22. [Google Scholar] [CrossRef]

Figure 1. Standardized Cross-Dataset IDS Benchmarking Framework.

Figure 2. Class distribution of benign and malicious traffic samples across the evaluated IoT datasets.

Figure 3. Architecture of the proposed CNN–BiLSTM–Attention model for tabular network traffic data.

Figure 4. Receiver Operating Characteristic (ROC) curves of representative machine learning and deep learning models evaluated on the CIC-IoT 2023 dataset.

Figure 5. Confusion matrix of the proposed CNN–BiLSTM–Attention model on the CIC-IoT 2023 dataset.

Figure 6. Confusion matrix of the proposed CNN–BiLSTM–Attention model on the N-BaIoT dataset under intra-dataset evaluation.

Figure 7. Confusion matrix of the proposed CNN–BiLSTM–Attention model on the BoT-IoT dataset under intra-dataset evaluation.

Figure 8. Cross-dataset evaluation heatmap illustrating model accuracy when trained and tested on different IoT datasets.

Table 1. Comparison of Representative IoT Intrusion Detection Studies.

Ref	Dataset	Model	Key Limitation
[13]	UNSW-NB15, BoT-IoT, ToN-IoT, Edge-IIoT	Classical ML (DT, RF, SVM, NB, KNN, MLP)	Strong dependence on preprocessing and model-quality choices
[14]	CICIoT2023	ML-based smart-city IDS	Single-dataset evaluation in a smart-city setting
[15]	NSL-KDD, IoTDevNet, DS2OS, IoTID20, IoT Botnet 2020	DT, RF, SVM, DNN, DBN, LSTM, Stacked LSTM, Bi-LSTM	No drift handling; limited real-time validation
[16]	BoT-IoT	CNN, LSTM, GRU	Limited evaluation on one dataset
[17]	CIC-IoT-2023	Transformer and deep baselines	Limited generalization analysis
[18]	CIC-IDS, CIC-IoT	Hybrid feature selection + CNN–DNN	Higher architectural complexity
[19]	CIC-IDS2017, CIC-ToN-IoT, UNSW-NB15	Seq2Seq + ConvLSTM subnets	Scalability and broader validation remain limited
[20]	Deep learning IoT IDS literature	Class imbalance analysis	Highlights severe imbalance but limited controlled experiments
[21]	IoT anomaly-based IDS literature	Deep learning approaches in IoT IDS	Strong synthesis, but not an empirical benchmark
[22]	CICIDS2017	EIDM and comparative DL models	Similar attack classes remain difficult to separate
[23]	AI-based IoT IDS literature	AI/ML/DL/anomaly detection review	Strong synthesis, but not an empirical benchmarking framework
[24]	ML-based IoT IDS literature	Research trends and challenges	Heterogeneity and inconsistent reporting remain unresolved
[25]	Federated learning IDS literature	Analysis of heterogeneous IoT IDS conditions	Strong dependence on benchmarked and heterogeneous datasets
[26]	TON-IoT telemetry + network + OS logs	IoT/IIoT intrusion detection dataset	Highlights scarcity of representative IoT/IIoT datasets
[27]	CICIoT2023	Real-time IoT benchmark with 2/8/34 classes	Performance drops in harder multiclass settings
[28]	ASEADOS-SDN-IoT	SDN-IoT dataset and framework	Limited to a specific testbed environment
[29]	IoTID20	CatBoost + SNN with balancing and feature selection	Strong single-dataset dependence
[30]	Smart-city IDS setting	Smart-city IDS model	Not designed for cross-dataset benchmarking
[31]	Smart-city DT + IDS testbed	Digital Twin + IDS integration	Focuses on resilience monitoring more than broad benchmarking
[32]	BoT-IoT	CTGAN-based augmentation for DDoS/DoS detection	Narrow attack focus and limited generalization claims
[33]	IoT IDS surveys and datasets	Review of ML/DL IDS in IoT	Highlights need for more comprehensive systematic analysis
[34]	IoT applications, attacks, IDS, and datasets	Broad IoT security/IDS review	Emphasizes need for robust generalized IDS
[35]	Internet of Drones IDS literature	ML/DL IDS in resource-constrained drone environments	Lack of standardized datasets and lightweight benchmarking
[36]	CIC-IDS2017 (cleaned)	Real-time LSTM IDS for IoT nodes	Limited large-scale deployment validation
[37]	BoTIoT, CiCIoT, ToNIoT, WUSTL-IIoT-2021	CNN, LSTM, biLSTM under balanced/imbalanced settings	Results vary significantly across datasets and balancing settings

Table 2. Summary of the IoT intrusion detection datasets used in this study.

Dataset	Data Collection Scenario	Samples	Features	Attack Types	Application
CIC-IoT 2023 [41]	Realistic IoT network traffic generated in a large-scale IoT environment simulating smart-device communications	200,000	20	Multiple attack categories, including DDoS, brute force, scanning, and botnet activities	Evaluation of modern IoT intrusion detection systems
BoT-IoT [42]	Controlled IoT network testbed designed to simulate botnet attack scenarios	21,893	20	DDoS, DoS, reconnaissance, and information theft attacks	Botnet attack detection in IoT networks
N-BaIoT [43]	Traffic generated from real IoT devices infected with botnets such as Mirai and Bashlite	200,000	20	Botnet-based attacks targeting IoT devices	IoT malware and botnet detection

Table 3. Dataset statistics after preprocessing and stratified sampling.

Dataset	Final Samples	Benign (%)	Attack (%)	Features
CIC-IoT 2023	200,000	10.00	90.00	20
BoT-IoT	21,893	10.03	89.97	20
N-BaIoT	200,000	10.00	90.00	20

Table 4. Cross-Dataset Feature Harmonization and Retained Feature Set.

Feature Category	CIC-IoT 2023	BoT-IoT	N-BaIoT	Retained
Flow Duration	✔	✔	✔	✔
Total Packets	✔	✔	✔	✔
Total Bytes	✔	✔	✔	✔
Packet Rate	✔	✔	✔	✔
Byte Rate	✔	✔	✔	✔
Forward Packet Count	✔	✔	✔	✔
Backward Packet Count	✔	✔	✔	✔
Mean Packet Size	✔	✔	✔	✔
Std Packet Size	✔	✔	✔	✔
Inter-Arrival Time	✔	✔	✔	✔
Protocol	✔	✔	✖	✖
Timestamp	✔	✔	✔	✖
Flow ID	✔	✔	✔	✖
Device ID	✖	✖	✔	✖

Table 5. Training configuration and hyperparameter settings used in the experiments.

Parameter	Value	Description
Programming Language	Python	Used for implementing all experiments
Machine Learning Libraries	Scikit-learn	Used for classical ML models
Gradient Boosting Frameworks	XGBoost, LightGBM	Used for ensemble learning models
Deep Learning Framework	TensorFlow/Keras	Used for CNN, LSTM, and CNN–BiLSTM–Attention models
Execution Environment	Google Colab	Cloud-based computational environment
Training Split	70%	Dataset portion used for training
Validation Split	15%	Used for hyperparameter tuning and early stopping
Test Split	15%	Used for final model evaluation
Batch Size	64	Number of samples processed per training iteration
Number of Epochs	50	Maximum training iterations for deep learning models
Optimizer	Adam	Optimization algorithm used for training
Learning Rate	0.001	Initial learning rate for optimizer
Activation Function	ReLU	Used in CNN layers
Output Activation	Sigmoid	Used for binary classification
Dropout Rate	0.5	Applied to reduce overfitting
Early Stopping	10 epochs patience	Stops training if validation loss does not improve
Hyperparameter Tuning	Grid Search with 5-fold CV	Used for classical ML models
Evaluation Metrics	Accuracy, Precision, Recall, F1-score, ROC-AUC, PR-AUC	Performance evaluation metrics
Random Seed	Fixed	Ensures reproducibility of experiments

Table 6. Best-performing machine learning models across the evaluated datasets.

Dataset	Best Model	Accuracy	F1-Score	ROC-AUC
BoT-IoT	LightGBM	0.998	0.998	0.999
CIC-IoT 2023	Random Forest	0.996	0.997	0.998
N-BaIoT	XGBoost	0.999	0.999	0.999

Table 7. Deep learning baseline performance across IoT intrusion detection datasets.

Dataset	Model	Accuracy	ROC-AUC	PR-AUC
BoT-IoT	CNN	0.997	0.999	0.999
BoT-IoT	LSTM	0.998	0.999	0.999
CIC-IoT 2023	CNN	0.991	0.996	0.998
CIC-IoT 2023	LSTM	0.989	0.995	0.997
N-BaIoT	CNN	0.998	0.999	0.999
N-BaIoT	LSTM	0.998	0.999	0.999

Table 8. Performance of the proposed CNN–BiLSTM–Attention model.

Dataset	Accuracy	ROC-AUC	PR-AUC
BoT-IoT	0.998	0.999	0.999
CIC-IoT 2023	0.993	0.997	0.999
N-BaIoT	0.999	0.999	0.999

Table 9. Cross-dataset evaluation results demonstrating the generalization capability of intrusion detection models across heterogeneous IoT traffic datasets.

Train Dataset	Test Dataset	Accuracy	F1-Score	ROC-AUC	PR-AUC
BoT-IoT	CIC-IoT 2023	0.952	0.62	0.58	0.74
BoT-IoT	N-BaIoT	0.901	0.71	0.65	0.78
CIC-IoT 2023	BoT-IoT	0.903	0.69	0.67	0.80
CIC-IoT 2023	N-BaIoT	0.989	0.92	0.95	0.96
N-BaIoT	BoT-IoT	0.612	0.41	0.52	0.60
N-BaIoT	CIC-IoT 2023	0.971	0.81	0.73	0.88

Table 10. Ablation study results demonstrating the contribution of each architectural component in the proposed intrusion detection architecture.

Variant	Accuracy	ROC-AUC	PR-AUC	Trainable Parameters	Training Time/Epoch (s)	Inference Latency/Sample (ms)
Full Model (CNN + BiLSTM + Attention)	0.992	0.998	0.999	412,865	18.4	2.31
No CNN (BiLSTM + Attention)	0.991	0.997	0.999	356,417	16.9	2.08
No Attention (CNN + BiLSTM)	0.991	0.997	0.999	401,921	17.6	2.15
No BiLSTM (CNN + Attention)	0.989	0.996	0.998	128,544	9.7	1.12
CNN Only	0.990	0.996	0.998	95,233	7.9	0.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alghamdi, A.; Dardouri, S. Benchmarking and Cross-Dataset Evaluation of AI-Based Intrusion Detection Systems for Smart City IoT Networks. Computers 2026, 15, 340. https://doi.org/10.3390/computers15060340

AMA Style

Alghamdi A, Dardouri S. Benchmarking and Cross-Dataset Evaluation of AI-Based Intrusion Detection Systems for Smart City IoT Networks. Computers. 2026; 15(6):340. https://doi.org/10.3390/computers15060340

Chicago/Turabian Style

Alghamdi, Ahlam, and Samia Dardouri. 2026. "Benchmarking and Cross-Dataset Evaluation of AI-Based Intrusion Detection Systems for Smart City IoT Networks" Computers 15, no. 6: 340. https://doi.org/10.3390/computers15060340

APA Style

Alghamdi, A., & Dardouri, S. (2026). Benchmarking and Cross-Dataset Evaluation of AI-Based Intrusion Detection Systems for Smart City IoT Networks. Computers, 15(6), 340. https://doi.org/10.3390/computers15060340

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Benchmarking and Cross-Dataset Evaluation of AI-Based Intrusion Detection Systems for Smart City IoT Networks

Abstract

1. Introduction

2. Related Work

2.1. Machine Learning IDS

2.2. Deep Learning IDS

2.3. Hybrid Models

2.4. Dataset Imbalance

2.5. Concept Drift

2.6. Cross-Dataset Evaluation

2.7. Benchmarking Frameworks

2.8. Representation Learning and Reconstruction-Based Anomaly Detection

2.9. Research Gap

2.10. Motivation for This Study

3. Methodology

3.1. Overview of the Proposed Framework

3.2. Dataset Description

3.3. Data Preprocessing

3.4. Binary Intrusion Detection Formulation

3.5. Data Splitting Strategy

3.6. Evaluation Metrics

3.7. Benchmark Machine Learning Models

3.8. Proposed CNN-BiLSTM-Attention Model

3.9. Experimental Setup

3.10. Cross-Dataset Evaluation

3.11. Ablation Study

4. Results and Discussion

4.1. Machine Learning Benchmark

4.2. Deep Learning Baselines

4.3. Performance of the Proposed CNN–BiLSTM–Attention Model

4.4. Cross-Dataset Generalization

4.5. Ablation Study Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI