1. Introduction
The Internet of Things (IoT) has rapidly evolved into a fundamental technological paradigm enabling the interconnection of billions of heterogeneous devices across diverse application domains. Smart sensors, industrial controllers, connected vehicles, healthcare monitoring devices, and urban infrastructure components are increasingly integrated into large-scale cyber–physical ecosystems that continuously generate massive volumes of network traffic and operational data [
1]. The expansion of IoT technologies has significantly transformed modern digital infrastructures by enabling real-time monitoring, intelligent automation, and data-driven decision-making across sectors such as smart cities, healthcare systems, transportation networks, and industrial automation environments [
2].
Despite these benefits, the widespread deployment of IoT devices has introduced significant cybersecurity challenges. IoT devices often operate under strict resource constraints, including limited computational power, restricted memory capacity, and energy limitations, which make the deployment of traditional security mechanisms difficult [
3]. Furthermore, the heterogeneity of IoT devices, communication protocols, and network architectures introduces additional complexity in securing large-scale interconnected environments. These characteristics significantly increase the attack surface of IoT ecosystems and make them particularly attractive targets for cyberattacks [
4].
Recent years have witnessed the emergence of numerous large-scale cyber threats targeting IoT infrastructures. Attack scenarios such as distributed denial-of-service (DDoS) attacks, botnet propagation, data manipulation, and unauthorized device control have demonstrated the vulnerability of poorly secured IoT networks [
5]. Compromised IoT devices can be exploited as part of coordinated botnet infrastructures capable of launching massive network attacks against critical services and digital infrastructures [
6]. Consequently, ensuring robust and adaptive cybersecurity mechanisms for IoT environments has become a major priority for both academic researchers and industrial practitioners. Intrusion Detection Systems (IDS) have long been considered a core component of network security architectures. IDS technologies aim to detect malicious activities and abnormal network behavior by analyzing network traffic patterns and identifying deviations from normal operational profiles [
7]. Traditional IDS mechanisms typically rely on signature-based detection or rule-based monitoring techniques that match network behavior against predefined attack patterns. Although such approaches can effectively detect known threats, they are often insufficient for identifying novel or evolving attack strategies in dynamic environments such as IoT networks [
8].
To overcome these limitations, artificial intelligence (AI) and machine learning (ML) techniques have increasingly been adopted to enhance intrusion detection capabilities. By leveraging data-driven learning algorithms, AI-based IDS models can automatically extract complex patterns from network traffic and identify anomalous behaviors associated with cyberattacks [
9]. These methods allow intrusion detection systems to adapt to previously unseen threats and evolving attack strategies without relying solely on predefined signatures or manually engineered rules. Recent studies have explored a wide range of machine learning and deep learning approaches for IoT intrusion detection. These approaches aim to improve detection accuracy, reduce false alarm rates, and enable scalable monitoring of large-scale IoT infrastructures [
10]. However, several challenges remain. Many existing IDS models are evaluated on limited datasets or under controlled experimental settings that may not fully represent the complexity and diversity of real-world IoT environments [
11]. As a result, models that demonstrate strong performance within a single dataset may exhibit significant performance degradation when deployed in different operational contexts or under varying network conditions.
Another important challenge lies in the absence of standardized evaluation frameworks that allow fair and consistent comparison between different intrusion detection approaches. Differences in preprocessing pipelines, dataset characteristics, feature engineering strategies, and evaluation metrics often make it difficult to determine whether reported improvements originate from the proposed model architecture or from differences in experimental design [
12]. These limitations highlight the need for systematic benchmarking methodologies capable of evaluating IDS models under consistent and reproducible experimental conditions.
Motivated by these challenges, this study investigates the performance of artificial intelligence-based intrusion detection systems in smart-city IoT environments through a comprehensive benchmarking framework. The framework evaluates multiple machine learning and deep learning models across three representative IoT intrusion detection datasets, namely CIC-IoT-2023, BoT-IoT, and N-BaIoT. By enforcing consistent preprocessing pipelines, standardized evaluation metrics, and cross-dataset validation experiments, the study aims to provide a more reliable assessment of IDS robustness and generalization capability under heterogeneous IoT traffic conditions.
The contributions of this work are summarized as follows.
A standardized benchmarking framework for evaluating AI-based intrusion detection systems across multiple heterogeneous IoT datasets under consistent preprocessing and evaluation conditions.
A unified experimental pipeline that ensures reproducibility and fair comparison across classical machine learning and deep learning models.
A systematic cross-dataset evaluation protocol that reveals generalization limitations under distribution shifts.
An empirical analysis demonstrating that near-perfect intra-dataset performance can be misleading and does not necessarily reflect real-world robustness.
An ablation study showing that architectural complexity contributes only marginally compared to dataset characteristics and preprocessing strategies.
The primary objective of this study is to establish a standardized and reproducible benchmarking framework for evaluating intrusion detection systems across heterogeneous IoT datasets. While a hybrid CNN–BiLSTM–Attention architecture is included in the experimental pipeline, it is not presented as the main contribution. Instead, the proposed model serves as a representative deep learning approach within the benchmarking framework, enabling consistent comparison with classical machine learning baselines.
The experimental results demonstrate that the proposed architecture does not consistently outperform strong tree-based models such as Random Forest and LightGBM. This observation indicates that, under the evaluated conditions, dataset characteristics and preprocessing strategies have a greater impact on performance than architectural complexity. Therefore, the main contribution of this study lies in the benchmarking protocol and cross-dataset evaluation methodology rather than in model design. The primary contribution of this study lies in the design of a standardized benchmarking and cross-dataset evaluation framework. The CNN–BiLSTM–Attention model is included as a representative deep learning baseline rather than a novel state-of-the-art architecture.
3. Methodology
3.1. Overview of the Proposed Framework
This study proposes a comprehensive intrusion detection framework for evaluating artificial intelligence-based Intrusion Detection Systems (IDS) in smart-city Internet of Things (IoT) environments. The framework is designed to systematically benchmark multiple machine learning models and assess the effectiveness of a proposed deep learning architecture under heterogeneous IoT network traffic conditions.
The proposed framework consists of several sequential stages, including dataset preparation, data preprocessing, benchmark model evaluation, deep learning-based intrusion detection, and cross-dataset generalization analysis.
First, multiple publicly available IoT intrusion detection datasets are collected and harmonized to ensure consistent experimental conditions. These datasets are selected to represent diverse IoT network environments and attack scenarios commonly encountered in smart city infrastructures. Harmonization procedures are applied to align feature formats and ensure compatibility across datasets.
Next, data preprocessing techniques are applied to improve data quality and model performance. This stage includes removing noisy or duplicate records, handling missing values, and normalizing feature distributions to ensure consistent numerical ranges across all input variables.
Following preprocessing, several widely used machine learning algorithms are trained to establish baseline intrusion detection performance. These baseline models provide a comparative benchmark for evaluating the effectiveness of the proposed deep learning approach.
Subsequently, a hybrid deep learning architecture combining Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM), and an attention mechanism is introduced. The CNN component extracts local feature interactions along the feature dimension, while the BiLSTM models dependencies across feature positions rather than temporal sequences.
To evaluate the robustness and generalization capability of the proposed model, cross-dataset experiments are conducted in which models trained on one dataset are tested on other datasets with different traffic distributions. In addition, ablation studies are performed to analyze the individual contributions of the CNN, BiLSTM, and attention components within the hybrid architecture.
The proposed benchmarking framework aims to provide a systematic and reproducible evaluation of AI-based IDS models in smart city IoT environments. Both classical machine learning algorithms and deep learning architectures are evaluated across multiple publicly available datasets to assess detection accuracy, robustness, and generalization capability, particularly in cross-dataset scenarios where training and testing data distributions differ.
The benchmarking pipeline consists of the following stages:
Dataset selection and preparation.
Data preprocessing and normalization.
Feature selection and dimensionality reduction.
Model training and hyperparameter optimization.
Evaluation using multiple performance metrics.
Cross-dataset generalization testing.
Figure 1 presents the complete benchmarking pipeline, including strict feature harmonization via intersection of shared numerical features, preprocessing fitted exclusively on the training set, and the cross-dataset evaluation protocol (train-on-A, test-on-B/C) used to assess generalization under distribution shift.
It is important to note that the objective of the proposed framework is not to introduce a new state-of-the-art model, but rather to provide a fair and standardized evaluation protocol. The framework focuses on consistent preprocessing, model evaluation, and cross-dataset validation under unified experimental conditions. This design enables systematic comparison between classical machine learning and deep learning models under identical preprocessing and evaluation settings.
The intrusion detection task is formulated as a binary classification problem (benign vs. attack) to enable consistent comparison across heterogeneous datasets. The considered datasets differ significantly in their labeling schemes and attack taxonomies, making direct multi-class alignment non-trivial. Therefore, all attack types are grouped into a single class to ensure a unified evaluation setting.
3.2. Dataset Description
To ensure a comprehensive evaluation of intrusion detection models, three publicly available IoT cybersecurity datasets were utilized in this study: CIC-IoT 2023, BoT-IoT, and N-BaIoT. These datasets represent diverse IoT network environments and contain various categories of malicious traffic, including distributed denial-of-service (DDoS) attacks, reconnaissance activities, and botnet-based intrusions. The CIC-IoT 2023 dataset represents modern IoT network traffic scenarios and includes a large volume of network flows generated from realistic IoT environments. The dataset contains multiple attack categories and heterogeneous traffic characteristics that simulate real-world IoT communication patterns. The BoT-IoT dataset was generated using a realistic IoT network testbed and includes both benign and malicious traffic associated with botnet attacks. The dataset contains several attack categories, including distributed denial-of-service (DDoS), denial-of-service (DoS), reconnaissance, and information theft attacks. The N-BaIoT dataset focuses specifically on botnet attacks targeting IoT devices. It contains both benign traffic and malicious traffic generated from compromised IoT devices performing botnet-related activities, such as Mirai and Bashlite attacks.
These datasets differ in terms of feature spaces, traffic distributions, and attack characteristics, making them suitable for evaluating the robustness and generalization capability of intrusion detection systems in heterogeneous IoT environments. To ensure reproducibility, the exact number of samples used in each dataset after preprocessing is fixed as follows: CIC-IoT 2023 (200,000 samples), N-BaIoT (200,000 samples), and BoT-IoT (21,893 samples). Stratified sampling was applied prior to dataset splitting to preserve the original class distribution in each dataset.
Table 2 summarizes the IoT intrusion detection datasets utilized in this study, including their data sources, sample sizes, attack categories, feature characteristics, and class distributions.
The distribution of benign and malicious traffic samples across the evaluated datasets is illustrated in
Figure 2. The evaluated datasets exhibit noticeable class imbalance, where malicious traffic samples significantly outnumber benign samples. This imbalance is particularly evident in the CIC-IoT 2023 dataset, which contains a substantially larger number of attack instances compared with benign traffic. Such class imbalance is common in intrusion detection datasets and highlights the importance of robust evaluation metrics.
For reproducibility, the number of samples used after preprocessing for each dataset is explicitly reported. Due to dataset size constraints and computational feasibility, representative subsets were selected while preserving class distributions through stratified sampling.
To ensure a fair comparison across datasets, a fixed sample size of 200,000 instances was used for both the CIC-IoT 2023 and N-BaIoT datasets. For the BoT-IoT dataset, a smaller representative subset of 21,893 samples was selected due to its inherent data characteristics and availability.
The final sample counts used in the experiments are as follows: CIC-IoT 2023 (200,000 samples), BoT-IoT (21,893 samples), and N-BaIoT (200,000 samples). These subsets preserve the original class imbalance characteristics of each dataset. Detailed dataset statistics, including class distributions, are provided in
Table 3.
3.3. Data Preprocessing
Data preprocessing plays a critical role in ensuring the reliability, consistency, and reproducibility of intrusion detection models. In this study, a standardized preprocessing pipeline was applied across all datasets to ensure fair comparison and prevent experimental bias.
First, data cleaning procedures were performed to remove irrelevant or non-informative attributes. Columns containing only missing values were eliminated, and invalid entries (e.g., infinite values) were handled to prevent instability during training. Missing values were imputed using a median-based strategy (SimpleImputer), preserving the statistical properties of each feature.
To ensure consistent feature scaling, numerical attributes were normalized using standard scaling (StandardScaler), transforming features to zero mean and unit variance. Categorical features were encoded using one-hot encoding for low-cardinality variables and label encoding where appropriate.
To prevent data leakage, preprocessing was applied following a strict protocol. Dataset splitting into training (70%), validation (15%), and testing (15%) sets was performed prior to any preprocessing operations, using stratified sampling to preserve class distributions. All preprocessing steps—including imputation, scaling, and encoding were fitted exclusively on the training set and subsequently applied to validation and test sets. This ensures that no information from the evaluation data influences model training.
In addition, features likely to encode dataset-specific artifacts were explicitly removed before training. These include flow identifiers, timestamps, source and destination information, device-related fields, and protocol-specific metadata. This step prevents models from exploiting dataset-specific signatures and ensures that learned patterns reflect generalizable traffic characteristics.
To enable cross-dataset evaluation, feature harmonization was performed using a strict intersection-based strategy. Given the heterogeneity of feature definitions across CIC-IoT 2023, BoT-IoT, and N-BaIoT, only numerical flow-level features that are consistently defined and semantically comparable across all datasets were retained. No manual feature construction or transformation into a new representation was introduced. Instead, features were aligned based on their statistical meaning (e.g., packet counts, byte counts, flow duration, and rate-based metrics), while incompatible or dataset-specific attributes were excluded.
This process resulted in a unified feature space consisting of 20 shared numerical features. All models were trained and evaluated using this consistent representation to ensure fair comparison across datasets. Following feature alignment, missing values were handled using median imputation, and numerical features were normalized using standard scaling within a unified preprocessing pipeline.
To further verify the absence of data leakage, feature distributions were inspected after harmonization, confirming that no feature uniquely identifies a dataset or traffic source. As shown in
Table 4, only features that are consistently available and semantically aligned across all datasets were retained, while identifiers and dataset-specific attributes were excluded. This ensures that the resulting feature space is both comparable and free from dataset-dependent artifacts.
Table 4 presents the cross-dataset feature harmonization process and the retained feature set used in this study. The symbols “✔” and “✖” indicate whether a feature is available in a given dataset and whether it was retained after the harmonization process.
3.4. Binary Intrusion Detection Formulation
To enable consistent comparison across heterogeneous datasets, the intrusion detection task was formulated as a binary classification problem. In this configuration, benign traffic was labeled as 0, while malicious traffic was labeled as 1. All attack categories present in the datasets were grouped into the malicious class. This unified labeling strategy simplifies comparisons across datasets with different attack taxonomies and facilitates standardized benchmarking of intrusion detection models. Binary intrusion detection is widely adopted in IDS benchmarking studies, particularly when datasets contain heterogeneous attack categories and varying label structures. By consolidating multiple attack types into a single malicious class, the experimental framework ensures a consistent evaluation protocol across all datasets. Class imbalance was not explicitly corrected through resampling techniques in order to preserve the natural distribution of each dataset. Instead, stratified sampling was used during dataset splitting, and imbalance-aware evaluation metrics such as PR-AUC and F1-score were emphasized to provide a more realistic assessment of model performance. The intrusion detection task is formulated as a binary classification problem to ensure consistent comparison across datasets with heterogeneous label taxonomies. While this formulation enables standardized benchmarking, it does not evaluate the model’s ability to distinguish between different attack categories. Consequently, the scope of this study is limited to binary intrusion detection.
3.5. Data Splitting Strategy
To ensure an unbiased evaluation of model performance, each dataset was divided into training, validation, and testing subsets. Stratified sampling was applied during dataset splitting to preserve the original class distribution across all subsets.
The datasets were partitioned according to the following proportions:
70% training set.
15% validation set.
15% testing set.
The training set was used to train the models, the validation set was used for hyperparameter tuning and model selection, and the testing set was reserved for final performance evaluation. This separation ensures that the evaluation results reflect the model’s ability to generalize to unseen data. To prevent data leakage and ensure reproducibility, dataset splitting into training (70%), validation (15%), and testing (15%) sets was performed prior to any preprocessing operations. All preprocessing transformations, including imputation, scaling, and encoding, were fitted exclusively on the training set and subsequently applied to validation and test sets.
3.6. Evaluation Metrics
To comprehensively evaluate the performance of the intrusion detection models, several widely used classification metrics were employed. These metrics provide complementary insights into detection capability, particularly in intrusion detection tasks where class imbalance is common. The evaluation metrics used in this study include Accuracy, Precision, Recall, F1-score, Receiver Operating Characteristic Area Under the Curve (ROC-AUC), and Area Under the Precision–Recall Curve (PR-AUC).
The evaluation metrics are defined using the confusion matrix components: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
Accuracy measures the overall proportion of correctly classified instances among all evaluated samples:
Precision indicates the proportion of correctly identified attack instances among all samples predicted as malicious:
Recall, also referred to as the detection rate, measures the ability of the classifier to correctly detect malicious traffic among all actual attack instances:
The F1-score represents the harmonic mean of precision and recall and provides a balanced evaluation metric when dealing with imbalanced datasets:
The ROC-AUC metric evaluates the ability of a classifier to distinguish between benign and malicious traffic across different decision thresholds. It measures the area under the Receiver Operating Characteristic curve, which represents the trade-off between the true positive rate (TPR) and the false positive rate (FPR):
PR-AUC measures the area under the precision–recall curve and is particularly informative in intrusion detection datasets with class imbalance. It emphasizes the trade-off between precision and recall and provides a more realistic evaluation of attack detection capability when malicious samples dominate the dataset.
The performance of the intrusion detection system is evaluated based on four fundamental classification outcomes derived from the confusion matrix. True Positive (TP) represents the number of malicious traffic instances that are correctly identified as attacks by the model. True Negative (TN) refers to benign or normal traffic that is correctly classified as legitimate network activity. False Positive (FP) occurs when benign traffic is incorrectly classified as an attack, which may lead to unnecessary alerts and increased operational overhead for network administrators. False Negative (FN) represents malicious traffic that is incorrectly classified as normal traffic, posing a significant security risk because actual attacks remain undetected. These four outcomes form the basis for calculating key evaluation metrics such as accuracy, precision, recall, and F1-score, which are commonly used to assess the effectiveness and reliability of intrusion detection systems.
3.7. Benchmark Machine Learning Models
To establish baseline intrusion detection performance, several widely used machine learning algorithms were evaluated. The benchmark models include:
These models were selected due to their strong performance in previous intrusion detection studies and their effectiveness in handling structured network traffic data.
Tree-based ensemble models such as Random Forest and gradient boosting algorithms are particularly well-suited for tabular intrusion detection datasets due to their ability to capture complex feature interactions.
All models were trained using identical preprocessing pipelines to ensure fair comparison and reproducibility of experimental results. These benchmark models serve as baseline references to evaluate the effectiveness of the proposed deep learning architecture.
In addition to classical machine learning models, deep learning baselines including Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks were evaluated. These models provide a comparison against the proposed hybrid CNN-BiLSTM-Attention architecture.
The inclusion of deep learning baselines enables a more comprehensive evaluation of intrusion detection performance and highlights the advantages of hybrid architectures that integrate both convolutional feature extraction and sequential learning mechanisms.
3.8. Proposed CNN-BiLSTM-Attention Model
To enhance the detection capability beyond traditional machine learning models, this study proposes a hybrid deep learning architecture that integrates Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM) networks, and an attention mechanism.
The CNN component captures local interactions between neighboring features along the feature dimension, enabling the model to learn combinations of related attributes within the tabular input.
The BiLSTM layer models dependencies across feature positions within the input vector rather than temporal dependencies. This design allows the model to capture relationships between features but does not reflect time-evolving network behavior.
An attention mechanism is incorporated to enhance feature importance modeling. The attention layer allows the model to dynamically assign higher weights to the most informative features while reducing the influence of less relevant information. By focusing on critical traffic characteristics, the attention mechanism improves the model’s ability to identify subtle intrusion patterns and enhances overall classification performance.
The final stage of the architecture consists of a fully connected dense layer followed by a sigmoid activation function, which performs binary classification to distinguish between benign and malicious traffic.
The proposed CNN–BiLSTM–Attention model operates on structured tabular feature vectors, where each sample represents a single network flow instance described by a fixed set of numerical features. No explicit temporal sequence is constructed based on device behavior, session information, or timestamp ordering. Instead, each sample is treated independently. To enable compatibility with convolutional and recurrent layers, the input feature vector is reshaped into a one-dimensional sequence. This transformation does not represent a temporal or spatial sequence; rather, it allows the model to process feature values as an ordered set of attributes.
Figure 3 illustrates the proposed CNN–BiLSTM–Attention architecture applied to tabular network traffic data. Each sample is represented as a feature vector x ∈ R20, which is reshaped into X ∈ R20 × 1 to ensure compatibility with convolutional and recurrent layers. The resulting sequence represents an ordered feature arrangement rather than a temporal sequence. The CNN component captures local feature interactions and spatial correlations among neighboring features, while the BiLSTM layer models dependencies across feature dimensions in a non-temporal manner. Subsequently, the attention mechanism assigns adaptive importance weights to the extracted feature representations, enabling the model to focus on the most relevant discriminative patterns before the final binary classification stage.
3.9. Experimental Setup
All experiments were implemented using Python 3.10. Classical machine learning models were developed using Scikit-learn, while XGBoost and LightGBM were used for gradient boosting models. Deep learning models, including CNN, LSTM, and the proposed CNN–BiLSTM–Attention architecture, were implemented using TensorFlow and Keras. The experiments were conducted in the Google Colab environment to leverage cloud-based computational resources for large-scale data processing and model training. Each dataset was processed independently to evaluate model performance under different IoT traffic distributions. The datasets were split into training, validation, and testing sets using a stratified sampling strategy to preserve class distribution. Due to the large size and imbalance of IoT datasets, sampling techniques were applied to ensure computational feasibility while maintaining representative data distributions.
All models were trained and evaluated using a unified preprocessing pipeline, including data cleaning, feature normalization, and label harmonization, to ensure fair comparison across different datasets.
Model performance was evaluated using multiple metrics, including Accuracy, Precision, Recall, F1-score, ROC-AUC, and PR-AUC, to provide a comprehensive assessment of intrusion detection performance, particularly under imbalanced conditions.
For deep learning models, training was performed using the Adam optimizer with an initial learning rate of 0.001. The models were trained for 50 epochs with a batch size of 64, with early stopping applied to prevent overfitting. Dropout regularization was also used in fully connected layers to improve generalization.
For classical machine learning models, hyperparameters were tuned using standard optimization techniques to ensure fair comparison across models. To ensure full reproducibility, all experiments were conducted using fixed random seeds, consistent preprocessing pipelines, and identical feature sets across all models. The entire workflow, including sampling, preprocessing, training, and evaluation, follows a deterministic pipeline that can be replicated under the same configuration. The reported results correspond to single-run evaluations under consistent experimental conditions. No multi-run statistical analysis (e.g., mean and standard deviation across runs) was performed.
This strategy ensured that each model was evaluated under optimal parameter configurations while maintaining fairness in the benchmarking process. The reported results correspond to single-run evaluations under consistent experimental conditions. No multi-run statistical analysis was performed, as summarized in
Table 5.
3.10. Cross-Dataset Evaluation
To evaluate the generalization capability of intrusion detection models, cross-dataset experiments were conducted. In these experiments, models trained on one dataset were evaluated on a different dataset.
Cross-dataset evaluation simulates realistic deployment scenarios in which models trained in one IoT environment must detect intrusions in another environment with different traffic characteristics.
The following cross-dataset combinations were evaluated:
Train on BoT-IoT, test on CIC-IoT 2023.
Train on BoT-IoT, test on N-BaIoT.
Train on CIC-IoT 2023, test on BoT-IoT.
Train on CIC-IoT 2023, test on N-BaIoT.
Train on N-BaIoT, test on BoT-IoT.
Train on N-BaIoT, test on CIC-IoT 2023.
This evaluation provides insights into model robustness and transferability across heterogeneous IoT network environments.
3.11. Ablation Study
To analyze the contribution of each architectural component in the proposed deep learning model, an ablation study was conducted. The objective of this analysis is to isolate the impact of individual architectural modules on intrusion detection performance.
Several model configurations were evaluated by systematically removing specific components from the proposed architecture. The evaluated variants include:
Full Model (CNN + BiLSTM + Attention).
No CNN (BiLSTM + Attention).
No BiLSTM (CNN + Attention).
No Attention (CNN + BiLSTM).
CNN Only.
This experimental design enables a detailed analysis of how convolutional feature extraction, sequential modeling, and attention mechanisms individually contribute to the overall detection performance.
The ablation analysis provides insights into the relative importance of each architectural component and helps determine whether the hybrid CNN-BiLSTM-Attention design offers measurable advantages over simplified architectures.
4. Results and Discussion
This section presents the experimental results obtained from the benchmark models and the proposed deep learning architecture. The analysis focuses on detection performance, cross-dataset generalization capability, and the contribution of different architectural components. It is important to note that the reported performance metrics are obtained under a binary classification setting. While this formulation simplifies cross-dataset evaluation, it does not capture the model’s ability to distinguish between different attack categories. As a result, the reported performance may overestimate real-world effectiveness in scenarios requiring fine-grained attack identification. The results are interpreted from a benchmarking perspective rather than a model-centric perspective. Accordingly, performance comparisons are used to assess the influence of dataset characteristics and evaluation settings, rather than to claim superiority of a specific architecture. The near-perfect performance observed in intra-dataset evaluation (e.g., ROC-AUC values approaching 1.000) requires careful interpretation. Such results may arise from strong class separability and dataset construction characteristics rather than true model generalization. Although strict precautions were taken to prevent data leakage, these results highlight the limitations of evaluating intrusion detection systems solely within a single dataset. It is important to note that the reported performance metrics are obtained under a binary setting. As a result, near-perfect intra-dataset performance may partially reflect the reduced complexity of the task and should not be interpreted as evidence of effective multi-class attack discrimination.
4.1. Machine Learning Benchmark
To establish baseline intrusion detection performance, several classical machine learning algorithms were evaluated across the three IoT cybersecurity datasets: CIC-IoT 2023, BoT-IoT, and N-BaIoT. The evaluated benchmark models include Logistic Regression, Random Forest, XGBoost, LightGBM, and Support Vector Machine (SVM).
Table 6 summarizes the best-performing machine learning models obtained on each dataset using the standardized preprocessing pipeline described in
Section 3.
The benchmark results indicate that classical machine learning models, particularly tree-based ensemble methods such as Random Forest, XGBoost, and LightGBM, achieve strong detection performance across all evaluated datasets. Accuracy values range between 0.996 and 0.999, with consistently high F1-scores and ROC-AUC values. However, performance differences between models are relatively small, suggesting that dataset characteristics and feature separability play a dominant role in determining classification performance. These findings indicate that multiple models can effectively capture the underlying patterns in the data, resulting in similar decision boundaries.
Despite these strong results, they should be interpreted with caution. High performance under intra-dataset evaluation does not necessarily imply robustness or generalization. This limitation becomes evident in cross-dataset experiments, where model performance drops significantly when evaluated on different datasets, indicating sensitivity to distribution shifts and dataset-specific patterns.
Furthermore, near-perfect performance (e.g., accuracy and ROC-AUC values approaching 1.000) is primarily attributed to the characteristics of the datasets rather than inherent model superiority. In particular, datasets such as BoT-IoT and N-BaIoT exhibit highly separable traffic patterns and strong statistical differences between benign and malicious samples. In addition, the use of binary classification simplifies the decision boundary, which can lead to inflated performance metrics under intra-dataset evaluation.
Overall, these results suggest that while models demonstrate strong detection capability under controlled experimental conditions, their real-world robustness remains limited. This highlights the importance of cross-dataset evaluation and more realistic experimental settings when assessing intrusion detection systems. Importantly, the observed high performance is not attributable to data leakage. All preprocessing steps were applied after dataset splitting, transformations were fitted exclusively on the training set, and all leakage-prone features were removed. Therefore, the results reflect dataset characteristics rather than unintended information leakage.
4.2. Deep Learning Baselines
In addition to classical machine learning models, deep learning baseline architectures were also evaluated to provide a fair comparison with the proposed intrusion detection model. Two widely adopted deep learning architectures, Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks, were implemented as baseline models for IoT intrusion detection. These architectures are widely used in network traffic analysis due to their ability to capture feature-level interactions and dependencies across feature positions.
Table 7 presents the performance results of the CNN and LSTM baseline models across the three evaluated datasets.
The deep learning baseline models, including CNN and LSTM architectures, also demonstrate strong detection performance across all evaluated datasets. Their performance is comparable to that of classical machine learning models, with only marginal differences in accuracy and ROC-AUC values.
CNN models capture local interactions between features along the feature dimension, while LSTM models learn dependencies across feature positions. However, similar to classical models, the performance differences remain limited. This suggests that, for the evaluated datasets, the inherent structure and separability of the data have a greater impact on performance than the choice of model architecture.
As shown in
Figure 4, all evaluated models achieve high AUC values, indicating strong discrimination capability between benign and malicious traffic. However, the zoomed view reveals small performance differences, where ensemble models and the proposed hybrid architecture maintain consistently high true positive rates even at very low false positive rates.
4.3. Performance of the Proposed CNN–BiLSTM–Attention Model
To further improve intrusion detection performance, a hybrid deep learning architecture integrating Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM), and an attention mechanism was proposed. The proposed model was evaluated using the same preprocessing pipeline and data splitting strategy applied to the benchmark models to ensure a fair and consistent comparison.
Table 8 presents the performance results of the proposed model across the three evaluated IoT intrusion detection datasets.
The proposed CNN–BiLSTM–Attention model achieves competitive performance across all evaluated datasets, with accuracy values ranging from 0.993 to 0.999 and consistently high ROC-AUC and PR-AUC scores. While the model performs slightly better than some baseline methods in certain cases, the observed improvements are relatively modest.
These results indicate that the hybrid architecture is effective for feature-level representation learning on tabular IoT traffic data. However, the limited performance gains over simpler models suggest that architectural complexity alone does not significantly improve detection performance under the given experimental conditions.
Notably, the proposed CNN–BiLSTM–Attention model does not consistently outperform strong classical machine learning baselines, particularly tree-based ensemble methods such as Random Forest and LightGBM. The relatively small performance differences observed across models suggest that multiple approaches are capable of effectively capturing the underlying structure of the datasets.
Overall, these findings indicate that dataset characteristics, feature representation, preprocessing strategies, and evaluation protocols may have a greater influence on performance than architectural complexity alone. This observation reinforces the importance of standardized benchmarking and carefully controlled experimental procedures when evaluating intrusion detection systems.
To assess the stability of the proposed model, experiments were conducted under consistent training settings using fixed random seeds. The obtained results showed only minor variations across evaluation metrics, indicating that the model is relatively stable and not highly sensitive to random initialization. Although formal statistical significance testing, confidence intervals, or repeated multi-run analysis were not performed, the observed consistency suggests that the reported results are reliable within the defined experimental setup.
A notable observation across experiments is the discrepancy between intra-dataset and cross-dataset performance. While models achieve near-perfect results when trained and evaluated on the same dataset, their performance degrades noticeably under cross-dataset conditions. This behavior demonstrates the difficulty of generalizing learned representations across heterogeneous IoT environments characterized by differing traffic distributions and attack patterns.
In several cross-dataset scenarios, relatively moderate accuracy values are accompanied by lower ROC-AUC or F1-score values, indicating reduced discriminative capability despite seemingly acceptable classification accuracy. Similarly, combinations such as relatively high PR-AUC with comparatively lower F1-score values are also observed. This behavior can largely be attributed to class imbalance and the differing interpretations of evaluation metrics. Specifically, ROC-AUC and PR-AUC measure ranking performance across varying decision thresholds, whereas the F1-score depends on a fixed classification threshold. Consequently, a model may exhibit reasonable ranking capability while still producing suboptimal classification behavior at a specific operating threshold.
These findings highlight that relying on a single evaluation metric may produce misleading conclusions, particularly in cross-dataset evaluation scenarios affected by distribution shift. Therefore, comprehensive evaluation using multiple complementary metrics is essential for accurately assessing intrusion detection robustness.
Importantly, the proposed CNN–BiLSTM–Attention model does not consistently outperform strong tree-based baselines under standardized evaluation settings. This observation further supports the central conclusion of this study: under controlled benchmarking conditions, dataset properties, feature engineering, preprocessing consistency, and evaluation methodology may exert a greater impact on performance than increasing architectural complexity alone.
4.4. Cross-Dataset Generalization
While intra-dataset evaluation provides insight into model performance under a single traffic distribution, real-world smart-city environments require intrusion detection systems capable of generalizing across heterogeneous network conditions.
To evaluate model robustness under distribution shifts, cross-dataset experiments were conducted as described in
Section 3.9. In these experiments, models trained on one dataset were evaluated on a different dataset with distinct traffic characteristics. The detailed results of these experiments are presented in
Table 9, which highlights the variability in model performance across different dataset combinations. The discrepancy between near-perfect intra-dataset performance and degraded cross-dataset results further supports that models are not overfitting through leakage, but rather learning dataset-specific statistical patterns that do not generalize across different IoT environments.
The cross-dataset experiments were re-executed using the finalized harmonized feature space and the corrected preprocessing pipeline described in
Section 3.3. All visualizations and reported metrics were regenerated directly from the finalized evaluation outputs to ensure complete consistency between tables, figures, and textual descriptions.
Figure 5 illustrates the confusion matrix obtained for the proposed CNN–BiLSTM–Attention model on the CIC-IoT 2023 dataset under intra-dataset evaluation. The updated confusion matrix was regenerated using the finalized 15% test partition corresponding to approximately 30,000 evaluation samples. The model correctly classifies the majority of benign and malicious traffic samples, with relatively few false positives and false negatives.
However, this strong intra-dataset performance should be interpreted cautiously. The near-perfect classification behavior observed in
Figure 5 primarily reflects the statistical properties and class separability of the dataset under controlled experimental conditions. As demonstrated later in the cross-dataset evaluation experiments, models trained on a single dataset often fail to generalize effectively to heterogeneous IoT traffic environments despite achieving excellent intra-dataset results.
Representative confusion matrices for intra-dataset evaluations are provided in
Figure 5,
Figure 6 and
Figure 7, while cross-dataset generalization behavior is further illustrated through the heatmap representation shown in
Figure 8.
Figure 6 illustrates the confusion matrix of the proposed CNN–BiLSTM–Attention model on the N-BaIoT dataset under intra-dataset evaluation. The updated confusion matrix was regenerated using the finalized 15% test partition corresponding to approximately 30,000 evaluation samples. The model demonstrates strong classification performance with only a limited number of false positives and false negatives.
Figure 7 illustrates the confusion matrix of the proposed CNN–BiLSTM–Attention model on the BoT-IoT dataset under intra-dataset evaluation. The updated confusion matrix was regenerated using the finalized 15% test partition corresponding to approximately 3284 evaluation samples. The model correctly classifies the majority of benign and malicious traffic samples, while only a small number of samples are misclassified.
The cross-dataset generalization results are further visualized using the heatmap representation shown in
Figure 8. The results reveal noticeable performance degradation under distribution shifts, highlighting the limited generalization capability of models trained on a single dataset. This behavior reflects substantial differences in traffic characteristics and attack patterns across IoT datasets and underscores the challenge of developing intrusion detection systems that generalize effectively in heterogeneous environments.
For example, models trained on the BoT-IoT dataset exhibit reduced detection performance when evaluated on N-BaIoT traffic, indicating substantial differences in feature distributions. Similarly, cross-evaluation involving CIC-IoT 2023 demonstrates varying levels of generalization across datasets. A particularly notable case occurs when models trained on N-BaIoT are evaluated on BoT-IoT, where detection performance decreases significantly compared with intra-dataset evaluation. This further confirms the presence of a strong distribution mismatch and the limitations of dataset-specific model optimization.
Overall, these findings demonstrate that models evaluated solely within a single dataset may produce overly optimistic performance estimates. Cross-dataset evaluation, therefore, provides a more realistic and rigorous assessment of intrusion detection robustness in heterogeneous IoT environments.
4.5. Ablation Study Analysis
The ablation study results indicate that performance differences across architectural variants are relatively small, with accuracy varying within a narrow range (e.g., 0.992 vs. 0.991 vs. 0.989). Such limited differences do not provide strong evidence for substantial synergistic interaction between individual architectural components. Instead, the results suggest that each component contributes incrementally under the evaluated experimental conditions. The detailed ablation results are presented in
Table 10, highlighting the relative contribution of each component to overall model performance.
In addition to predictive performance, the computational complexity of each architectural variant was evaluated using trainable parameter count, average training time per epoch, and inference latency per sample. The results indicate that models incorporating BiLSTM and attention mechanisms require substantially higher computational resources compared with simpler CNN-based variants. However, the corresponding performance improvements remain relatively limited. For example, the full CNN–BiLSTM–Attention model substantially increases parameter count and inference latency compared with the CNN-only variant, while providing only marginal accuracy improvement. These findings suggest that increased architectural complexity does not necessarily translate into proportional performance gains within the evaluated tabular IoT intrusion detection setting.
The ablation study further shows that removing individual components from the proposed architecture leads to only modest changes in performance. This behavior indicates that each component contributes incrementally rather than critically to the final prediction capability. Among the evaluated components, the BiLSTM module exhibits a slightly greater influence relative to other components, likely due to its ability to model dependencies across feature dimensions. Nevertheless, the overall performance differences remain small, reinforcing the observation that dataset characteristics, feature representation, and preprocessing strategies exert a greater influence on performance than architectural design alone.
This behavior may also be attributed to the tabular nature of the evaluated datasets, where explicit temporal dependencies are limited. Under such conditions, recurrent architectures such as BiLSTM may provide less benefit compared with domains involving genuine sequential or temporal structures. Similarly, the contribution of the attention mechanism appears relatively marginal in the evaluated setting. Although attention layers increase model capacity, they do not produce substantial improvements over simpler variants. This suggests that attention mechanisms may provide greater benefit in scenarios involving richer feature interactions, larger-scale datasets, or more complex sequential relationships.
Overall, the ablation results demonstrate an important trade-off between model complexity and computational efficiency. More complex variants, particularly those incorporating BiLSTM and attention layers, incur increased computational cost in terms of training time, parameter count, and inference latency, while yielding only relatively small performance improvements. These findings further support the broader conclusion of this study that standardized preprocessing, dataset properties, and evaluation methodology may have a greater impact on intrusion detection performance than increasing architectural complexity alone.
5. Conclusions and Future Work
This study primarily contributes a standardized benchmarking framework for evaluating intrusion detection systems across heterogeneous IoT datasets. While a hybrid CNN–BiLSTM–Attention architecture was implemented, the results demonstrate that architectural complexity alone does not guarantee improved performance over strong baseline models. Instead, the findings highlight that evaluation methodology, dataset diversity, and preprocessing consistency play a more critical role in determining reliable intrusion detection performance. The conclusions of this study are therefore limited to binary intrusion detection. Although this formulation enables consistent cross-dataset comparison, it does not address multi-class attack discrimination, which remains an important direction for future work.
Although several models achieved near-perfect results under intra-dataset evaluation, cross-dataset experiments revealed substantial performance degradation and unstable metric behavior under distribution shifts. In particular, the results indicate that current intrusion detection models, regardless of architecture, fail to generalize effectively across heterogeneous datasets. These findings highlight a fundamental limitation of dataset-specific training and reinforce the need for more robust evaluation protocols and domain adaptation techniques. Moreover, they demonstrate that high accuracy alone is insufficient to assess model effectiveness and that cross-dataset validation is essential for evaluating robustness in realistic IoT environments.
Furthermore, the limited performance differences observed across models and ablation variants suggest that dataset characteristics and feature representation have a greater impact on performance than architectural design. This reinforces the importance of standardized benchmarking protocols when comparing intrusion detection systems.
Despite these contributions, this study has several limitations. First, the use of binary classification simplifies the intrusion detection task and may lead to inflated performance estimates, as it does not reflect real-world requirements where distinguishing between attack types is essential. Second, despite feature harmonization, inherent differences in dataset distributions remain, which significantly affect cross-dataset generalization. Third, experiments were conducted in offline settings, and real-time deployment constraints were not considered. Finally, although precautions were taken to prevent data leakage, the possibility of residual dataset-specific artifacts cannot be entirely excluded.
In addition, the proposed approach does not explicitly model temporal dependencies in network traffic, as the BiLSTM operates over feature dimensions rather than time-ordered sequences. Moreover, all experiments were conducted using fixed random seeds to ensure reproducibility. The reported results correspond to single-run evaluations under consistent experimental conditions, and no multi-run statistical analysis was performed.
Future work will focus on extending the framework to multi-class intrusion detection, incorporating domain adaptation techniques to improve cross-dataset generalization, and evaluating models under realistic deployment conditions. Additional research will explore sequence construction strategies based on time-ordered flows, device-level aggregation, or session-based grouping. Furthermore, future studies will incorporate rigorous statistical validation through repeated experiments and hypothesis testing, as well as more realistic evaluation protocols such as device-held-out, time-based splitting, and leave-attack-family-out validation.