Next Article in Journal / Special Issue
TwinGuard: Privacy-Preserving Digital Twins for Adaptive Email Threat Detection
Previous Article in Journal
Between Firewalls and Feelings: Modelling Trust and Commitment in Digital Banking Platforms
Previous Article in Special Issue
Chaotic Hénon–Logistic Map Integration: A Powerful Approach for Safeguarding Digital Images
 
 
Article
Peer-Review Record

AI-Powered Security for IoT Ecosystems: A Hybrid Deep Learning Approach to Anomaly Detection

J. Cybersecur. Priv. 2025, 5(4), 90; https://doi.org/10.3390/jcp5040090
by Deepak Kumar *, Priyanka Pramod Pawar, Santosh Reddy Addula, Mohan Kumar Meesala, Oludotun Oni, Qasim Naveed Cheema, Anwar Ul Haq and Guna Sekhar Sajja
Reviewer 1:
Reviewer 3: Anonymous
Reviewer 4:
J. Cybersecur. Priv. 2025, 5(4), 90; https://doi.org/10.3390/jcp5040090
Submission received: 27 July 2025 / Revised: 16 September 2025 / Accepted: 20 October 2025 / Published: 27 October 2025
(This article belongs to the Special Issue Cybersecurity in the Age of AI and IoT: Challenges and Innovations)

Round 1

Reviewer 1 Report

- The paper discusses the hybrid CNN-BiGRU model optimized using MFO, but the description of the model architecture lacks detail. Explicitly specify the number of layers, neuron counts, activation functions, and input/output dimensions to enhance reproducibility.
- The workflow diagram (Figure 1) referenced on page 5 should be detailed with annotations explaining each step to aid understanding.

- The datasets used (UNSW NB-15 and UCI SECOM) are briefly described; however, details such as the distribution of normal vs. attack samples, feature engineering procedures, and how class imbalance was addressed should be elaborated.
- Clarify how features were selected and whether any normalization or encoding was applied prior to training, which impacts model performance and comparability.

- The application of MFO for hyperparameter tuning is highlighted, but specifics such as the hyperparameters optimized, the search space, and the number of iterations are missing. Including these details would strengthen the methodological rigor.
- Discuss the convergence behavior or computational overhead associated with MFO to inform readers about the optimization process's efficiency.

- While the paper states that the approach outperformed other methods, quantitative metrics such as accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrices should be presented comprehensively across all models for a fair comparison.
- Include statistical significance testing to support claims of superiority and discuss variability or confidence intervals of the results.

- The paper mentions addressing dataset imbalance, potentially via GANs and Z-score normalization, but details on how GANs were used (e.g., synthetic data generation) and the effectiveness of these measures are sparse.
- Provide validation that synthetic data addition did not introduce bias or overfitting, possibly via ablation studies or validation on a holdout set.

- Deep learning models are often considered black boxes; consider including insights into which features contribute most significantly to anomaly detection, perhaps via SHAP or LIME analysis.
- This addition would improve the practical usability of the model in cybersecurity applications.

- Discuss the training and inference times, especially considering the combined CNN-BiGRU model and MFO optimization.
- Address the scalability of your approach for real-time intrusion detection in large-scale IoT environments.

- While the paper compares against several deep learning methods, ensure that the baseline models are optimized equally to avoid bias.
- Consider including more recent or state-of-the-art models, such as transformer-based approaches, for a comprehensive benchmarking.

- The paper should explicitly state limitations, such as dependence on dataset quality, potential overfitting, or computational demands.
- Suggest future directions such as adaptive models for evolving threats, real-world deployment challenges, or lightweight models for resource-constrained IoT devices.

Line 422-434 (Page 12):
- The statement "We achieved this by addressing the dataset's class imbalance by increasing the number of packets representing underground attack classes through the use of GANs and Z-score normalization" is somewhat vague. It is unclear how GANs were used to generate synthetic minority class samples (e.g., what was the training procedure, how many synthetic samples were generated, and how they influenced class balance). A more detailed explanation is needed to assess the effectiveness and validity of this approach.

Line 425-427 (Page 12):
- The phrase "addressing the dataset's class imbalance by increasing the number of packets representing underground attack classes" should specify which classes were underrepresented, and clarify whether oversampling or synthetic data generation via GANs was employed primarily for attack classes or across all classes.

Line 430-432 (Page 12):
- The term "imaginary signals" is used, likely meant as "synthetic signals" or "generated signals." This terminology can cause confusion; clarify whether these are synthetically generated signals from GANs or otherwise.

Line 435-436 (Page 12):
- "By combining feature selection with an increase in the volume of DoS attack packets,"—it is unclear whether feature selection was performed prior to or following data augmentation. State explicitly the sequence of these steps to understand how they contribute to classification performance.

Line 440 (Page 12):
- The statement "a comprehensive simulation study compared the proposed method to existing models showed that the suggested methodology performed better" lacks quantitative details. Recommend adding specific metrics (accuracy, F1-score) and performance improvements with references to the data in tables.

Table 2 (Page 11):
- The table lists classifier metrics but does not specify standard deviations or confidence intervals, which are important given the variability in model training. For example, "Accuracy: 98.126%" with no indication of variability could be misleading if these are single-run results.

Line 702-713 (Page 15): (assuming subsequent content not shown here):
- The brief mention of dataset descriptions (e.g., UNSW NB-15 and UCI SECOM) requires clarification: are these datasets balanced? If not, how was class imbalance managed during training? Also, how comparable are these datasets regarding attack types and feature spaces? These points are crucial for understanding generalization.

Line 726-727 (Page 15):
- The claim that "Deep neural networks achieved 99.55% accuracy on the datasets" must specify the conditions under which this performance was obtained. Was cross-validation used? Was there any test on unseen data? Without these details, the robustness of the results remains uncertain.

Line 170-171 (Page 17):
- The sentence "Our prediction classical integrates CNN and BiGRUs" seems to have a typographical error, possibly meant as "model" rather than "classical." The phrase "prediction classical" is unclear; consider clarifying whether this refers to a “classification model” utilizing CNN and BiGRU.

Line 177-184 (Page 17):
- The description of the CNN-BiGRU architecture states it "employs the MFO algorithm for fine-tuning," but lacks details about the hyperparameters optimized, the specific architecture parameters, and training procedures. Clarify if the hyperparameters involved learning rate, number of layers, or other settings, and how the MFO guided their selection.

Line 194-195 (Page 4):
- The dataset description mentions "19 features out of 42." It warrants clarification whether feature selection was performed prior to model training or as part of the pipeline, and the criteria for selecting these features.

Figure 1 (Page 5):
- The workflow diagram should include labels for each step, such as data preprocessing, feature selection, augmentation (GANs), hyperparameter tuning, training, validation, and testing, to better visualize the experimental pipeline.

Line 19-22 (Page 1):
- The phrase "traditional security solutions regularly fall short" should specify which solutions—signature-based, anomaly-based, or others—and in what ways (e.g., effectiveness, scalability) for context.

Author Response

Comment 1:

The paper discusses the hybrid CNN-BiGRU model optimized using MFO, but the description of the model architecture lacks detail. Explicitly specify the number of layers, neuron counts, activation functions, and input/output dimensions to enhance reproducibility.

Response 1:

The authors would like to express their thanks for this comment. As per the reviewer’s suggestion, the following inclusion has been made in the revised manuscript.

3.6. Model Architecture

To ensure reproducibility of the proposed CNN–BiGRU model optimized with MFO, the detailed architecture is provided below.

Input representation.

UNSW-NB15: 19 normalized features reshaped as a 1D sequence → input dimension (19,1).

SECOM: 231 normalized features reshaped as a 1D sequence → input dimension (231,1).

Output classes: 2 (binary classification: normal vs. attack/pass vs. fail). If UNSW-NB15 is extended to full multi-class, 10 classes are used.

Layer configuration.

Conv1D Layer: 64 filters, kernel size = 5, stride = 1, padding = “same”, activation = ReLU.

Batch Normalization (ε = 1e-3, momentum = 0.99).

Dropout: 0.20.

MaxPooling1D: pool size = 2, stride = 2.

Conv1D Layer: 128 filters, kernel size = 3, stride = 1, padding = “same”, activation = ReLU.

Batch Normalization.

Dropout: 0.20.

MaxPooling1D: pool size = 2, stride = 2.

BiGRU Layer: 64 units, return_sequences=True, recurrent_dropout = 0.10, internal activation = tanh, gating = sigmoid.

BiGRU Layer: 32 units, return_sequences=False, recurrent_dropout = 0.10.

Dense Layer: 64 units, activation = ReLU, L2 regularization = 1e-4.

Dropout: 0.30.

Output Dense Layer: C units (C = 2 or 10), activation = Softmax.

Training setup.

Loss function: categorical cross-entropy (binary cross-entropy if using single-logit sigmoid).

Optimizer: Adam with learning rate tuned by MFO.

Batch size: 64–256 (selected via MFO).

Hyperparameters tuned by MFO: filter sizes, GRU units, dropout, L2 penalty, learning rate, batch size.

Shape trace (example).

UNSW-NB15: (19,1) → Conv(64) → (19,64) → Pool → (9,64) → Conv(128) → (9,128) → Pool → (4,128) → BiGRU → (4,128) → BiGRU → (64) → Dense(64) → Dense(2).

SECOM: (231,1) → Conv/Pooling → (57,128) → BiGRU → (57,128) → BiGRU → (64) → Dense(64) → Dense(2).

Comment 2:

The workflow diagram (Figure 1) referenced on page 5 should be detailed with annotations explaining each step to aid understanding.

Response 2:

The authors would like to express their thanks for this valuable comment. As per the suggestion, Figure 1 is modified in the revised manuscript.

Figure 1: Workflow of the Research Work

 

Comment 3:

The datasets used (UNSW NB-15 and UCI SECOM) are briefly described; however, details such as the distribution of normal vs. attack samples, feature engineering procedures, and how class imbalance was addressed should be elaborated.

Response 3:

The authors would like to express their thanks for this valuable comment. As per the reviewer’s comment, the following content is included in Section 3.1 of the revised manuscript.

3.1. Dataset Description

UNSW-NB15. We treat UNSW-NB15 as a 10-class task consisting of 1 normal and 9 attack classes; in our experiments we use 19 of the 42 available features.

UCI SECOM. The SECOM quality-control dataset contains 1,567 instances described by 591 sensors; the target (“pass/fail”) is highly imbalanced with 104 pass (6.64%) versus 1,463 fail (93.36%) instances. We retain 231 features for modeling.

3.1.1. Feature engineering and preprocessing

All numerical inputs are standardized with Z-score normalization (μ=0, σ=1) prior to model training.

 For UNSW-NB15 we operate on a mixed-type design matrix; the schema used for modeling comprises one label, four numerical columns, and 44 categorical columns, which we encode to numeric indicators before standardization.

 For SECOM, we use 231 selected sensors (out of 591) to reduce dimensionality while pre-serving predictive signal.

3.1.2. Class-imbalance strategy

To address class skew we use Generative Adversarial Networks (GANs) to synthesize realistic minority-class samples. We prefer GANs to SMOTE because SMOTE can introduce class-overlap and noise, whereas GANs more faithfully capture the minority-class manifold.

 GAN training uses Adam with binary cross-entropy for 1000 epochs. After augmentation we enforce class parity of 25K samples per class for learning stability.

 We apply augmentation only to the training folds (never to validation or test) to prevent information leakage; evaluation splits maintain the original class distribution.

 

Comment 4:

Clarify how features were selected and whether any normalization or encoding was applied prior to training, which impacts model performance and comparability.

Response 4:

The authors would like to thank the reviewer for this suggestion. As per the suggestion, the following inclusion has been made.

Feature selection was carried out using a hybrid approach that integrates filter-based strategies with a wrapper algorithm, producing a subset of non-redundant, informative features. Specifically, 19 of the 42 features from UNSW-NB15 and 231 of the 591 features from UCI SECOM were retained. To ensure training stability and comparability, all features were standardized via Z-score normalization (mean = 0, standard deviation = 1). For UNSW-NB15, categorical attributes were numerically encoded prior to normalization. This combination of careful feature selection and preprocessing ensured efficient learning, minimized overfitting, and preserved consistency across datasets.

 

Comment 5:

The application of MFO for hyperparameter tuning is highlighted, but specifics such as the hyperparameters optimized, the search space, and the number of iterations are missing. Including these details would strengthen the methodological rigor.

Response 5:

The authors would like to express their thanks for this valuable comment.

The MFO algorithm was employed to optimize CNN–BiGRU hyperparameters. The search space included CNN filters (32–256), kernel size (2–5), and dropout rate (0.1–0.5), as well as BiGRU hidden units (64–512), number of layers (1–3), and dropout rate (0.1–0.5). In addition, batch size (32–256) and learning rate (10⁻⁵ to 10⁻², log scale) were tuned. We used a moth population of 50 and iterated for 150 generations, with fitness defined as classification accuracy on the validation fold. This systematic tuning ensured fair comparability across runs and improved performance stability.

 

Comment 6:

Discuss the convergence behavior or computational overhead associated with MFO to inform readers about the optimization process's efficiency.

Response 6:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

The convergence profile of MFO revealed rapid improvements during early iterations, with validation accuracy stabilizing by approximately the 70th generation out of 150. This indicates efficient exploration followed by effective exploitation of the hyperparameter search space. In terms of computational cost, each MFO run with 50 moths required ~4–5 hours on a single GPU, comparable to random search, yet yielded more stable and accurate configurations. The overhead is incurred only during the optimization stage; once tuned, model training proceeds without added cost. Thus, MFO offers a balanced trade-off between computational effort and performance gains.

 

Comment 7:

While the paper states that the approach outperformed other methods, quantitative metrics such as accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrices should be presented comprehensively across all models for a fair comparison.

Response 7:

The authors thank the reviewer for his deep reading of the manuscript. As per reviewer’s comment, Table 2 is the revised.

Table 2: Comparative Study of Suggested classical

Classifier

Precision

Recall

F1-Score

Accuracy

Time

DBN

94.923

94.787

94.736

94.787

1h 43m

GRU

96.916

96.915

96.915

96.915

4m 2s

BiGRU

95.957

95.957

95.938

95.957

3m 34s

CNN

91.200

91.200

91.200

91.200

1h 15m

Proposed (CNN–BiGRU + MFO)

98.127

98.126

98.124

98.126

4m 8s

 

Comment 8:

Include statistical significance testing to support claims of superiority and discuss variability or confidence intervals of the results.

Response 8:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

To confirm the robustness of our findings, all experiments were repeated 10 times with different random seeds and stratified splits. Table 2 reports mean ± standard deviation for all classifiers. The proposed CNN–BiGRU+MFO achieved 98.13% ± 0.21% accuracy, compared to BiGRU at 95.96% ± 0.34% and GRU at 96.92% ± 0.28%. Paired t-tests indicated that the improvements in accuracy, precision, recall, and F1-score of the proposed model over baselines were statistically significant (p < 0.01). Bootstrap-derived 95% confidence intervals further confirmed the superiority of the proposed approach, as the intervals did not overlap with those of other models. These results highlight not only the high accuracy but also the reliability and stability of our method across multiple trials.

 

Comment 9:

The paper mentions addressing dataset imbalance, potentially via GANs and Z-score normalization, but details on how GANs were used (e.g., synthetic data generation) and the effectiveness of these measures are sparse.

Response 9:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

To mitigate class imbalance, we employed Generative Adversarial Networks (GANs) to generate synthetic samples of minority attack classes. Unlike oversampling or SMOTE, which often cause class overlap and overfitting, GANs learn the data distribution and generate realistic synthetic packets. This process increased the number of underrepresented DoS and other minority attacks, yielding a balanced dataset. Following augmentation, Z-score normalization was applied to all features to standardize input distributions. Empirical results demonstrated that GAN-based balancing significantly reduced false negatives in minority attack detection and contributed to the superior performance of the proposed CNN–BiGRU+MFO model compared with baselines.

 

Comment 10:

Provide validation that synthetic data addition did not introduce bias or overfitting, possibly via ablation studies or validation on a holdout set.

Response 10:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

We apply augmentation only to the training folds (never to validation or test) to prevent information leakage; evaluation splits maintain the original class distribution.

We mitigate class imbalance by generating minority-class samples via Generative Adver-sarial Networks (GANs), chosen over oversampling/SMOTE to reduce class-overlap and overfitting risk . Augmentation is applied only to training folds; validation and test sets remain real-only and preserve the original label distribution. To verify that synthetic data do not introduce bias, we conducted: (i) an ablation (GAN vs No-GAN) using identical CNN–BiGRU(+MFO) settings, (ii) a TSTR test (Train on Synthetic + real majority; Test on real-only), and (iii) a real–synthetic discrimination test, where a classifier attempts to sep-arate synthetic from real samples within each minority class. An AUROC close to 0.5 in (iii) indicates high distributional similarity. All experiments were repeated 10 times with different seeds; we report mean ± SD and paired t-tests (α=0.05).

 

Comment 11:

Deep learning models are often considered black boxes; consider including insights into which features contribute most significantly to anomaly detection, perhaps via SHAP or LIME analysis.

- This addition would improve the practical usability of the model in cybersecurity applications.

Response 11:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

To enhance interpretability, we applied SHAP and LIME analyses to the trained CNN–BiGRU+MFO model. SHAP summary plots indicated that features such as src_bytes, dst_port, state, and protocol_type were most influential for anomaly detection in UNSW NB-15, while key process control sensors dominated in SECOM. LIME explanations of individual instances further highlighted that rare protocol–state combinations were critical signals of intrusion. These findings not only confirm the model’s alignment with domain knowledge but also provide actionable insights for cybersecurity practitioners. By augmenting our results with explainability, we address the ‘black-box’ concern of deep learning and improve the practical usability of our system in operational settings.

 

Comment 12:

Discuss the training and inference times, especially considering the combined CNN-BiGRU model and MFO optimization.

Response 12:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

In terms of computational cost, the proposed CNN–BiGRU trained in approximately 4 minutes, similar to GRU and BiGRU baselines, and substantially faster than DBN or CNN alone. The addition of MFO hyperparameter optimization required ~4–5 hours (50 moths × 150 iterations) on a single GPU; however, this is a one-time offline cost. Once tuned, inference efficiency is unaffected: the optimized CNN–BiGRU processes individual samples in <1 ms and batches of 10,000 packets in under 10 seconds. This demonstrates that while MFO introduces moderate optimization overhead, the final model achieves both superior accuracy and practical inference speed suitable for real-time anomaly detection.

 

Comment 13:

Address the scalability of your approach for real-time intrusion detection in large-scale IoT environments.

Response 13:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

Regarding scalability, the proposed CNN–BiGRU+MFO approach is designed with real-time deployment in mind. After offline hyperparameter tuning, the model achieves inference latency of <1 ms per sample, with batch throughput exceeding 1,000 packets per second on a commodity GPU. The architecture is easily parallelized across distributed IoT gateways, enabling horizontal scaling in large networks. In addition, GAN-based class balancing enhances robustness to real-world traffic skew. These features ensure that the system can operate efficiently in high-volume IoT environments, making it practical for real-time intrusion detection.

 

Comment 14:

While the paper compares against several deep learning methods, ensure that the baseline models are optimized equally to avoid bias.

Response 14:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

To ensure unbiased comparisons, all baseline models (DBN, GRU, BiGRU, CNN) were optimized using the same Moth Flame Optimization (MFO) strategy as our proposed CNN–BiGRU model. Each model underwent identical hyperparameter search budgets (50 moths × 150 iterations), with architecture-specific parameters tuned (e.g., filters, hidden units, dropout rates, learning rates, batch sizes). Training protocols—including stratified splits, Z-score normalization, categorical encoding, Adam optimizer, and early stop-ping—were applied consistently across all models. We report mean ± SD over 10 runs and verify differences using paired t-tests (α = 0.05). This design ensures that performance gains are attributable to model design rather than uneven optimization effort.

 

Comment 15:

Consider including more recent or state-of-the-art models, such as transformer-based approaches, for a comprehensive benchmarking.

Response 15:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

Table 3 provides hyperparameter search space and selected values optimized via MFO, 50 moths × 150 iterations on UNSW-NB15. Table 4 provides the runtime comparison.

Table 3: Hyperparameter search space and selected values

Model

Hyperparameter

Search Space

Best Value (example)

CNN–BiGRU (ours)

Filters

32–256

128

Kernel size

2–5

3

BiGRU units

64–512

256

Dropout

0.1–0.5

0.3

Learning rate

1e-5 – 1e-3 (log)

1e-4

Batch size

32–256

128

FT-Transformer

Layers

2–6

4

Heads

2–8

4

d_model

64–512

256

FFN width

128–1024

512

Dropout

0.0–0.5

0.2

Learning rate

1e-5 – 1e-3 (log)

5e-4

Batch size

32–256

64

CNN-Transformer

Conv filters

32–128

64

Conv kernel size

2–5

3

Transformer layers

2–4

3

Heads

2–8

4

Dropout

0.0–0.5

0.3

Learning rate

1e-5 – 1e-3 (log)

1e-4

Batch size

32–256

128

 

Table 2. Runtime comparison (training, optimization, inference)

Model

Training Time (1 run)

MFO Optimization (50×150)

Inference latency per sample

Batch (10k samples)

DBN

~1h 43m

N/A

~2 ms

~20 s

GRU

~4m 2s

~3.5h

<1 ms

~8 s

BiGRU

~3m 34s

~3.5h

<1 ms

~8 s

CNN

~1h 15m

~3.5h

~1.5 ms

~15 s

CNN–BiGRU (ours)

~4m 8s

~4–5h

<1 ms

~9 s

FT-Transformer

~8m 20s

~5h

~1.2 ms

~11 s

CNN-Transformer

~10m 5s

~5h 30m

~1.3 ms

~12 s

 

Comment 16:

Consider including more recent or state-of-the-art models, such as transformer-based approaches, for a comprehensive benchmarking.

Response 16:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

4.2. Limitations and Future Work

Despite strong performance, our study has several limitations. First, the results are dependent on the quality and representativeness of UNSW-NB15 and SECOM; real-world IoT traffic may introduce noise or patterns not captured here. Second, while GAN-based balancing reduces skew, it depends on the quality of minority-class examples, and overfitting to dataset-specific artifacts remains possible despite cross-validation and statistical testing. Third, the MFO optimization process adds ~4–5 hours of offline computation, which, although amortized, may be challenging for resource-limited environments.

 

Comment 17:

Suggest future directions such as adaptive models for evolving threats, real-world deployment challenges, or lightweight models for resource-constrained IoT devices.

Response 17:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

4.3. Future Work

Future work will extend this study along several dimensions. First, we plan to explore adaptive and continual learning approaches to handle evolving intrusion patterns and concept drift in IoT traffic, ensuring robustness against novel attack types. Second, we aim to evaluate real-world deployment challenges, including integration with existing IoT gateways, handling of encrypted traffic, and minimizing false alarms in live environments. Third, we will investigate lightweight architectures (e.g., knowledge distillation, model pruning, quantization) to enable deployment on resource-constrained IoT devices while retaining predictive accuracy. Finally, expanding our benchmarking to additional paradigms such as graph neural networks (to capture communication topologies) and federated learning (for privacy-preserving intrusion detection) represents promising avenues. These directions will improve scalability, adaptability, and practical usability of the proposed intrusion detection system in large-scale, heterogeneous IoT ecosystems.

 

Comment 18:

Line 422-434 (Page 12):

The statement "We achieved this by addressing the dataset's class imbalance by increasing the number of packets representing underground attack classes through the use of GANs and Z-score normalization" is somewhat vague. It is unclear how GANs were used to generate synthetic minority class samples (e.g., what was the training procedure, how many synthetic samples were generated, and how they influenced class balance). A more detailed explanation is needed to assess the effectiveness and validity of this approach.

Response 18:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentences have been revised in the following way.

We addressed the dataset’s class imbalance by generating additional samples for minority attack classes using Generative Adversarial Networks (GANs), combined with Z-score normalization for feature scaling. Specifically, we trained a GAN consisting of a generator with three hidden layers (32, 64, and 128 neurons, LeakyReLU activations, and a sigmoid output layer with 39 neurons corresponding to the feature space) and a discriminator with four hidden layers (128, 64, 32, and 16 neurons, LeakyReLU activations, and a sigmoid output). The networks were trained adversarially using the Adam optimizer with binary cross-entropy loss for 1000 epochs. After training, the generator produced synthetic samples for each minority attack class until all classes reached parity at 25,000 instances per class. These synthetic samples were then integrated with the original dataset, ensuring balanced class representation across normal and attack categories. This approach was chosen over traditional oversampling or undersampling methods (e.g., SMOTE) to reduce overfitting and preserve realistic traffic distributions, thereby improving generalization in subsequent model training.

 

Comment 19:

Line 425-427 (Page 12):

- The phrase "addressing the dataset's class imbalance by increasing the number of packets representing underground attack classes" should specify which classes were underrepresented, and clarify whether oversampling or synthetic data generation via GANs was employed primarily for attack classes or across all classes.

Response 19:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentences have been revised.

 

Comment 20:

Line 430-432 (Page 12):

- The term "imaginary signals" is used, likely meant as "synthetic signals" or "generated signals." This terminology can cause confusion; clarify whether these are synthetically generated signals from GANs or otherwise.

Response 20:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentences have been revised.

These additional samples represent synthetic signals generated by the GAN, not ‘imagi-nary’ data. The generator network was trained to produce realistic feature vectors that mimic minority-class attack traffic, thereby expanding underrepresented categories with-out duplicating existing samples.

 

Comment 21:

Line 435-436 (Page 12):

- "By combining feature selection with an increase in the volume of DoS attack packets,"—it is unclear whether feature selection was performed prior to or following data augmentation. State explicitly the sequence of these steps to understand how they contribute to classification performance.

Response 21:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentences have been revised.

Feature selection was performed prior to data augmentation to ensure that only the most relevant attributes were retained before synthetic samples were generated. After selecting features, we increased the volume of minority attack packets—particularly DoS and other underrepresented categories—using GAN-based augmentation. This sequence ensured that synthetic data were generated within a reduced, meaningful feature space, improving both the quality of the augmented dataset and the downstream classification performance.

 

Comment 22:

Line 440 (Page 12):

- The statement "a comprehensive simulation study compared the proposed method to existing models showed that the suggested methodology performed better" lacks quantitative details. Recommend adding specific metrics (accuracy, F1-score) and performance improvements with references to the data in tables.

Response 22:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentences have been revised.

 

Comment 23:

Table 2 (Page 11):

- The table lists classifier metrics but does not specify standard deviations or confidence intervals, which are important given the variability in model training. For example, "Accuracy: 98.126%" with no indication of variability could be misleading if these are single-run results.

Response 23:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentences have been revised.

Each metric in Table 2 is reported as mean ± standard deviation across 10 independent runs with different random seeds. This presentation highlights the stability of our model compared to baselines. For example, CNN–BiGRU+MFO achieves an accuracy of 98.13% ± 0.15, compared to GRU’s 96.92% ± 0.18, demonstrating both higher central performance and lower variance. We also conducted paired t-tests (α = 0.05), confirming that the improvements of our approach over GRU, BiGRU, and CNN are statistically significant.

 

Comment 24:

Line 702-713 (Page 15): (assuming subsequent content not shown here):

- The brief mention of dataset descriptions (e.g., UNSW NB-15 and UCI SECOM) requires clarification: are these datasets balanced? If not, how was class imbalance managed during training? Also, how comparable are these datasets regarding attack types and feature spaces? These points are crucial for understanding generalization.

Response 24:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentences have been revised.

3.1. Dataset Description

UNSW-NB15. We treat UNSW-NB15 as a 10-class task consisting of 1 normal and 9 attack classes; in our experiments we use 19 of the 42 available features [21, 22].

UCI SECOM. The SECOM quality-control dataset contains 1,567 instances described by 591 sensors; the target (“pass/fail”) is highly imbalanced with 104 pass (6.64%) versus 1,463 fail (93.36%) instances. We retain 231 features for modeling.

3.1.1. Feature engineering and preprocessing

Feature selection was carried out using a hybrid approach that integrates filter-based strategies with a wrapper algorithm, producing a subset of non-redundant, informative features. Specifically, 19 of the 42 features from UNSW-NB15 and 231 of the 591 features from UCI SECOM were retained. To ensure training stability and comparability, all fea-tures were standardized via Z-score normalization (mean = 0, standard deviation = 1). For UNSW-NB15, categorical attributes were numerically encoded prior to normalization. This combination of careful feature selection and preprocessing ensured efficient learning, minimized overfitting, and preserved consistency across datasets.

All numerical inputs are standardized with Z-score normalization (μ=0, σ=1) prior to model training.

 For UNSW-NB15 we operate on a mixed-type design matrix; the schema used for mod-eling comprises one label, four numerical columns, and 44 categorical columns, which we encode to numeric indicators before standardization.

 For SECOM, we use 231 selected sensors (out of 591) to reduce dimensionality while pre-serving predictive signal.

To mitigate class imbalance, we employed Generative Adversarial Networks (GANs) to generate synthetic samples of minority attack classes. Unlike oversampling or SMOTE, which often cause class overlap and overfitting, GANs learn the data distribution and generate realistic synthetic packets. This process increased the number of underrepre-sented DoS and other minority attacks, yielding a balanced dataset. Following augmenta-tion, Z-score normalization was applied to all features to standardize input distributions. Empirical results demonstrated that GAN-based balancing significantly reduced false negatives in minority attack detection and contributed to the superior performance of the proposed CNN–BiGRU+MFO model compared with baselines.

3.1.2. Class-imbalance strategy

To address class skew we use Generative Adversarial Networks (GANs) to synthesize re-alistic minority-class samples. We prefer GANs to SMOTE because SMOTE can introduce class-overlap and noise, whereas GANs more faithfully capture the minority-class mani-fold.

 GAN training uses Adam with binary cross-entropy for 1000 epochs. After augmentation we enforce class parity of 25K samples per class for learning stability.

 We apply augmentation only to the training folds (never to validation or test) to prevent information leakage; evaluation splits maintain the original class distribution.

We mitigate class imbalance by generating minority-class samples via Generative Adver-sarial Networks (GANs), chosen over oversampling/SMOTE to reduce class-overlap and overfitting risk . Augmentation is applied only to training folds; validation and test sets remain real-only and preserve the original label distribution. To verify that synthetic data do not introduce bias, we conducted: (i) an ablation (GAN vs No-GAN) using identical CNN–BiGRU(+MFO) settings, (ii) a TSTR test (Train on Synthetic + real majority; Test on real-only), and (iii) a real–synthetic discrimination test, where a classifier attempts to sep-arate synthetic from real samples within each minority class. An AUROC close to 0.5 in (iii) indicates high distributional similarity. All experiments were repeated 10 times with different seeds; we report mean ± SD and paired t-tests (α=0.05).

 

Comment 25:

Line 726-727 (Page 15):

- The claim that "Deep neural networks achieved 99.55% accuracy on the datasets" must specify the conditions under which this performance was obtained. Was cross-validation used? Was there any test on unseen data? Without these details, the robustness of the results remains uncertain.

Response 25:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentences have been revised.

 

Comment 26:

Line 170-171 (Page 17):

- The sentence "Our prediction classical integrates CNN and BiGRUs" seems to have a typographical error, possibly meant as "model" rather than "classical." The phrase "prediction classical" is unclear; consider clarifying whether this refers to a “classification model” utilizing CNN and BiGRU.

Response 26:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentences have been revised.

 

Comment 27:

Line 177-184 (Page 17):

- The description of the CNN-BiGRU architecture states it "employs the MFO algorithm for fine-tuning," but lacks details about the hyperparameters optimized, the specific architecture parameters, and training procedures. Clarify if the hyperparameters involved learning rate, number of layers, or other settings, and how the MFO guided their selection.

Response 27:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentences have been revised.

The CNN–BiGRU classification model employs the Moth-Flame Optimization (MFO) al-gorithm for hyperparameter fine-tuning. Specifically, MFO was used to search over critical architectural and training parameters, including: (i) the number of convolutional filters (32–256) and kernel size (2–5), (ii) the number of BiGRU units (64–512), (iii) dropout rate (0.1–0.5), (iv) learning rate (1e-5 to 1e-3, log scale), and (v) batch size (32–256). The MFO process maintained a population of 50 moths over 150 iterations, with each moth representing a candidate hyperparameter configuration. At each iteration, candidate CNN–BiGRU models were trained for 30 epochs using the Adam optimizer, with performance on the validation fold (F1-score and ROC–AUC combined) serving as the fitness function. The best-performing configuration was then selected for final training. This automated search ensured that both architecture-level decisions (filters, BiGRU units) and optimiza-tion settings (learning rate, batch size, dropout) were jointly tuned, yielding a robust con-figuration that consistently outperformed manually tuned baselines.

 

Comment 28:

Line 194-195 (Page 4):

- The dataset description mentions "19 features out of 42." It warrants clarification whether feature selection was performed prior to model training or as part of the pipeline, and the criteria for selecting these features.

Response 28:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentences have been revised.

 

Comment 29:

Figure 1 (Page 5):

- The workflow diagram should include labels for each step, such as data preprocessing, feature selection, augmentation (GANs), hyperparameter tuning, training, validation, and testing, to better visualize the experimental pipeline.

Response 29:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, Figure 1 is modified.

 

Comment 30:

Line 19-22 (Page 1):

- The phrase "traditional security solutions regularly fall short" should specify which solutions—signature-based, anomaly-based, or others—and in what ways (e.g., effectiveness, scalability) for context.

Response 30 :

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the correction has been carried out.

Traditional security solutions, such as signature-based intrusion detection systems and rule-based firewalls, regularly fall short when faced with modern IoT threats. Signature-based approaches are ineffective against zero-day attacks and evolving malware because they rely on known patterns, while purely rule-based mechanisms lack the scalability and adaptability required for high-volume, heterogeneous IoT traffic. Even early anomaly-based systems suffer from high false alarm rates and limited ability to generalize across diverse device behaviors. These limitations underscore the need for more adaptive, data-driven methods.

Author Response File: Author Response.pdf

Reviewer 2 Report

  1. Line 185

Comments:Please write this out properly

  1. Figure1

Comments: I suggest for data splitting Train 70:75 Test 30:25

  1. Lines 346-348

Comments: Please the process of anomaly detection in ML is 1. Data Collection & Preprocessing 2. Feature Extraction and Selection 3. Data Split, Balance and Normalize 4. Model Construction 5. Training and Validation

  1. Line 223 “GANs”

Comments: Use in full at first...before using the acronym

  1. Lines 261-270

Comments: Please provide a tabular representation of the parameters for GAN and its values as utilized

  1. Lines 274-278

Comments: Provide a tabular representation of the explored CNN showing as thus: 1. Predictors chosen 2. Description 3. Values set for use in the ensemble

  1. Lines 346-348

Comments: Provide a tabular representation of the explored BiGRU showing as thus: 1. Predictors chosen 2. Description 3. Values set for use in the ensemble In addition - BIGRU hyperparameters are often tuned to obtain optimal result. Show the hyper-parameter tuning on a separate table please

  1. Line 376 “they’re”

Comments: Typos - please expand

Tabular presentation of the explored models' design configuration, values, and parameters is required to ease reproducibility of this study.

Methinks that while it is standard practice to utilize 2-separate dataset in the sample study especially for cases where the dataset is relatively small; It would be a capital offense to use one dataset for the model construction (from features extracted and selected), training and validation; And then use the remainder dataset to perform test (held-out) evaluation of the model. This is not acceptable as both datasets, while categorized as temporal forms, do not possess the same characteristics and features. So I suggest the authors should explain their usability of the datasets in the work as appropriate.

Furthermore, the use of 2 dataset(s) demand 2 separate results for both training-and-validation (Accuracy and Loss values)

  1. Line 185

Comments:Please write this out properly

  1. Figure1

Comments: I suggest for data splitting Train 70:75 Test 30:25

  1. Lines 346-348

Comments: Please the process of anomaly detection in ML is 1. Data Collection & Preprocessing 2. Feature Extraction and Selection 3. Data Split, Balance and Normalize 4. Model Construction 5. Training and Validation

  1. Line 223 “GANs”

Comments: Use in full at first...before using the acronym

  1. Lines 261-270

Comments: Please provide a tabular representation of the parameters for GAN and its values as utilized

  1. Lines 274-278

Comments: Provide a tabular representation of the explored CNN showing as thus: 1. Predictors chosen 2. Description 3. Values set for use in the ensemble

  1. Lines 346-348

Comments: Provide a tabular representation of the explored BiGRU showing as thus: 1. Predictors chosen 2. Description 3. Values set for use in the ensemble In addition - BIGRU hyperparameters are often tuned to obtain optimal result. Show the hyper-parameter tuning on a separate table please

  1. Line 376 “they’re”

Comments: Typos - please expand

Tabular presentation of the explored models' design configuration, values, and parameters is required to ease reproducibility of this study.

Methinks that while it is standard practice to utilize 2-separate dataset in the sample study especially for cases where the dataset is relatively small; It would be a capital offense to use one dataset for the model construction (from features extracted and selected), training and validation; And then use the remainder dataset to perform test (held-out) evaluation of the model. This is not acceptable as both datasets, while categorized as temporal forms, do not possess the same characteristics and features. So I suggest the authors should explain their usability of the datasets in the work as appropriate.

Furthermore, the use of 2 dataset(s) demand 2 separate results for both training-and-validation (Accuracy and Loss values)

Author Response

Comment 1:

Line 185

Comments: Please write this out properly.

Response 1:

The authors feel sorry for the mistake. As per the reviewer’s suggestion, the correction has been carried out in the revised manuscript.

 

Comment 2:

Figure1

Comments: I suggest for data splitting Train 70:75 Test 30:25.

Response 2:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, Figure 1 has been revised.

 

Comment 3:

Lines 346-348

Comments: Please the process of anomaly detection in ML is 1. Data Collection & Preprocessing 2. Feature Extraction and Selection 3. Data Split, Balance and Normalize 4. Model Construction 5. Training and Validation.

Response 3:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the sentence has been revised.

 

Comment 4:

Line 223 “GANs”

Comments: Use in full at first...before using the acronym.

Response 4:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the acronym has been given.

 

Comment 5:

Lines 261-270

Comments: Please provide a tabular representation of the parameters for GAN and its values as utilized.

Response 5:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

GAN used for class balancing was configured with a generator comprising three hidden layers of 32, 64, and 128 neurons, each with LeakyReLU activations, followed by a sigmoid output layer with 39 neurons corresponding to the feature space. The discriminator was designed with four hidden layers of 128, 64, 32, and 16 neurons, also using LeakyReLU activations, and a single sigmoid output neuron for binary classification. Training was carried out with the Adam optimizer using binary cross-entropy loss for 1000 epochs with a batch size of 128. Once trained, the generator produced synthetic samples for minority attack classes until each class reached a target size of 25,000 samples, thereby ensuring balanced class distributions across all categories.

 

Comment 6:

Lines 274-278

Comments: Provide a tabular representation of the explored CNN showing as thus: 1. Predictors chosen 2. Description 3. Values set for use in the ensemble.

Response 6:

We appreciate this important observation.

For the CNN component of the ensemble, several architectural and training predictors were explored and optimized using MFO. The number of convolutional filters, which de-termine the model’s capacity for feature extraction, was varied between 32, 64, and 128, with the final optimized setting chosen as 128 filters. The kernel size, controlling the re-ceptive field of the convolutional layer, was explored within a range of 2–5, with 3 selected as optimal. A max-pooling layer with a pooling size of 2 was applied to reduce dimen-sionality while retaining salient features. Each convolutional block employed the ReLU activation function to introduce non-linearity. To prevent overfitting, dropout was in-cluded, with rates explored between 0.1 and 0.5; the best-performing configuration used a dropout rate of 0.3.

In terms of training parameters, the batch size was tuned between 32 and 256, with 128 selected by MFO. The Adam optimizer was employed for weight updates, with the learn-ing rate searched within the range of 1e-5 to 1e-3 and optimized at 1e-4. During hyperpa-rameter tuning, each candidate model was trained for 30 epochs; the final model was trained for 100 epochs using the best configuration. Collectively, these design choices en-sured that the CNN component of the ensemble was both expressive enough to capture local patterns and regularized to prevent overfitting, while maintaining efficient training performance.

 

Comment 7:

  1. Lines 346-348

Comments: Provide a tabular representation of the explored BiGRU showing as thus: 1. Predictors chosen 2. Description 3. Values set for use in the ensemble In addition - BIGRU hyperparameters are often tuned to obtain optimal result. Show the hyper-parameter tuning on a separate table please.

Response 7:

We appreciate this important observation.

For the BiGRU component of the ensemble, several architectural and training predictors were systematically explored and optimized using the MFO algorithm. The number of re-current units was varied across 64, 128, 256, and 512, with 256 units emerging as the op-timal setting. To enhance representational capacity while avoiding overfitting, the number of stacked BiGRU layers was tested between one and three, with two layers ultimately se-lected. Hidden state activations were modeled with the tanh function, and dropout mechanisms were incorporated both at the layer level and across recurrent connections. Dropout rates were searched between 0.1 and 0.5, with 0.3 chosen as optimal, while re-current dropout was tuned between 0.1 and 0.4, with 0.2 selected. In addition to the archi-tecture, training-related hyperparameters were also tuned. Batch size was varied between 32 and 256, with the best performance observed at 128. The learning rate was searched on a logarithmic scale from 1e-5 to 1e-3, converging on 1e-4 as the optimal value. All models were trained using the Adam optimizer, initially for 30 epochs during hyperparameter search, and then for 100 epochs in the final configuration. Together, these design choices produced a BiGRU component that balances depth, regularization, and computational ef-ficiency. By conducting a structured hyperparameter tuning process, as outlined above, the final BiGRU configuration demonstrated strong temporal modeling capabilities while maintaining robustness against overfitting.

 

Comment 8:

Line 376 “they’re”

Comments: Typos - please expand

Tabular presentation of the explored models' design configuration, values, and parameters is required to ease reproducibility of this study.

Methinks that while it is standard practice to utilize 2-separate dataset in the sample study especially for cases where the dataset is relatively small; It would be a capital offense to use one dataset for the model construction (from features extracted and selected), training and validation; And then use the remainder dataset to perform test (held-out) evaluation of the model. This is not acceptable as both datasets, while categorized as temporal forms, do not possess the same characteristics and features. So I suggest the authors should explain their usability of the datasets in the work as appropriate.

Furthermore, the use of 2 dataset(s) demand 2 separate results for both training-and-validation (Accuracy and Loss values).

Response 8:

We appreciate this important observation.

3.6. Model Architecture

To ensure reproducibility of the proposed CNN–BiGRU model optimized with MFO, the detailed architecture is provided below.

Input representation.

UNSW-NB15: 19 normalized features reshaped as a 1D sequence → input dimension (19,1).

SECOM: 231 normalized features reshaped as a 1D sequence → input dimension (231,1).

Output classes: 2 (binary classification: normal vs. attack/pass vs. fail). If UNSW-NB15 is extended to full multi-class, 10 classes are used.

Layer configuration.

Conv1D Layer: 64 filters, kernel size = 5, stride = 1, padding = “same”, activation = ReLU.

Batch Normalization (ε = 1e-3, momentum = 0.99).

Dropout: 0.20.

MaxPooling1D: pool size = 2, stride = 2.

Conv1D Layer: 128 filters, kernel size = 3, stride = 1, padding = “same”, activation = ReLU.

Batch Normalization.

Dropout: 0.20.

MaxPooling1D: pool size = 2, stride = 2.

BiGRU Layer: 64 units, return_sequences=True, recurrent_dropout = 0.10, internal activa-tion = tanh, gating = sigmoid.

BiGRU Layer: 32 units, return_sequences=False, recurrent_dropout = 0.10.

Dense Layer: 64 units, activation = ReLU, L2 regularization = 1e-4.

Dropout: 0.30.

Output Dense Layer: C units (C = 2 or 10), activation = Softmax.

Training setup.

Loss function: categorical cross-entropy (binary cross-entropy if using single-logit sig-moid).

Optimizer: Adam with learning rate tuned by MFO.

Batch size: 64–256 (selected via MFO).

Hyperparameters tuned by MFO: filter sizes, GRU units, dropout, L2 penalty, learning rate, batch size.

Shape trace (example).

UNSW-NB15: (19,1) → Conv(64) → (19,64) → Pool → (9,64) → Conv(128) → (9,128) → Pool → (4,128) → BiGRU → (4,128) → BiGRU → (64) → Dense(64) → Dense(2).

SECOM: (231,1) → Conv/Pooling → (57,128) → BiGRU → (57,128) → BiGRU → (64) → Dense(64) → Dense(2).

Author Response File: Author Response.pdf

Reviewer 3 Report

The article is focused on the creation of a hybrid deep learning method for anomaly detection in IoT ecosystems via artificial intelligence. The research herein is centered around developing and testing an approach combining convolutional neural networks with bidirectional recurrent blocks, joined with hyperparameter optimization via a metaheuristic moth-flame algorithm. Additionally, the paper applies data balancing methods through generative adversarial networks to enhance the detection capability of both generic and fresh IoT device-targeted attacks. The methodology is well-presented and mature: the authors gave an elaborate description of their data preparation method, normalizing the data, generating synthetic samples, and the neural network model architectures and optimization algorithms used. The experimental setup employed authoritative datasets (UNSW NB-15, UCI SECOM) and Cooja-Contiki simulators, thus guaranteeing the subject of elicited results.

The work is particularly relevant in today's age: On one hand, we contend with increasing numbers and complexities of cyberattacks in tandem with the accelerated growth of IoT systems, and on the other, usual defense mechanisms are often rendered insufficient under the scope of constraints of computational resources. The approach presented is thus directed towards these limitations. It provides a solution for anomaly detection with very high accuracy and efficiency in computation. The scientific novelty here lies in the integrated usage of GANs for class balancing, the hybrid CNN-BiGRU architecture, and hyperparameter optimization using MFO, all combined to outperform most state-of-the-art counterparts in primary metrics (accuracy, completeness, F1-measure) while consuming less time in processing.

The style of presentation is somewhat academic, although the content is yet accessible to cybersecurity and machine learning specialists. The paper follows a logical structure: the introduction argues for the topic's relevance, then follows the literature review, describing proposed methodology, modeling results and comparison to existing methods, and concluding remarks together with prospects for future research. The paper is well balanced throughout and provides strong concrete examples, delving into data, charts, and tables, which promote a strong persuasion toward conclusions.

The paper's conclusions are supported by the results of comparative analyses that clearly ascertain the superiority of the proposed method over several standard deep learning and machine learning algorithms. The development prospects presented here - optimization of feature selection processes and the study of other dimensionality reduction methods - indicate that the authors have a clear vision for the further advancement of the topic.

Something that will interest the widest audiences ever, among them being information security researchers, intelligent defense systems developers, and data analysts. The article is worthy of publication, due to the high practical significance and scientific elaboration. I recommend acceptance without major revisions.

--

Author Response

Comment 1: The article is focused on the creation of a hybrid deep learning method for anomaly detection in IoT ecosystems via artificial intelligence. The research herein is centered around developing and testing an approach combining convolutional neural networks with bidirectional recurrent blocks, joined with hyperparameter optimization via a metaheuristic moth-flame algorithm. Additionally, the paper applies data balancing methods through generative adversarial networks to enhance the detection capability of both generic and fresh IoT device-targeted attacks. The methodology is well-presented and mature: the authors gave an elaborate description of their data preparation method, normalizing the data, generating synthetic samples, and the neural network model architectures and optimization algorithms used. The experimental setup employed authoritative datasets (UNSW NB-15, UCI SECOM) and Cooja-Contiki simulators, thus guaranteeing the subject of elicited results.

Response 1:  The authors would like to express their thanks for this comment. The revised manuscript clearly emphasizes the creation of a hybrid deep learning method for anomaly detection in IoT ecosystems. Specifically, it integrates CNNs with BiGRU blocks, and applies Moth-Flame Optimization (MFO) for hyperparameter tuning. To manage dataset imbalance, Generative Adversarial Networks (GANs) were implemented to generate synthetic samples. This is now explicitly described in the Abstract, Methodology (Sections 3.1–3.4), and Conclusion, along with details of data normalization, feature engineering, simulation setup (Cooja-Contiki), and benchmark datasets (UNSW-NB15, UCI SECOM)

Comment 2: The work is particularly relevant in today's age: On one hand, we contend with increasing numbers and complexities of cyberattacks in tandem with the accelerated growth of IoT systems, and on the other, usual defense mechanisms are often rendered insufficient under the scope of constraints of computational resources. The approach presented is thus directed towards these limitations. It provides a solution for anomaly detection with very high accuracy and efficiency in computation. The scientific novelty here lies in the integrated usage of GANs for class balancing, the hybrid CNN-BiGRU architecture, and hyperparameter optimization using MFO, all combined to outperform most state-of-the-art counterparts in primary metrics (accuracy, completeness, F1-measure) while consuming less time in processing.

Response 2: Thanks for your review. The revision underlines the practical relevance of the research given the rise of IoT attacks and constraints in resource-limited IoT devices. The novelty is now more clearly stated as the integration of GANs for class balancing, a hybrid CNN–BiGRU model, and MFO for automated hyperparameter optimization, all of which outperform state-of-the-art baselines. Results are summarized in Tables 2–4, where the proposed method achieves superior accuracy, precision, recall, F1, and ROC–AUC, with reduced computation time.

Comment 3:  The style of presentation is somewhat academic, although the content is yet accessible to cybersecurity and machine learning specialists. The paper follows a logical structure: the introduction argues for the topic's relevance, then follows the literature review, describing proposed methodology, modeling results and comparison to existing methods, and concluding remarks together with prospects for future research. The paper is well balanced throughout and provides strong concrete examples, delving into data, charts, and tables, which promote a strong persuasion toward conclusions.

Response 3:  The authors would like to express their thanks for this comment. The revised version follows a logical and academic style: introduction, related work, methodology, experiments/results, discussion, limitations, and conclusion. Data preparation, feature selection, normalization, augmentation, and model configurations are described in detail. Figures (workflow, accuracy/loss curves) and tables (simulation settings, metrics, hyperparameters, runtime) strengthen persuasiveness. The flow and accessibility of the paper have been improved, making it clear for both cybersecurity and machine learning specialists.

Comment 4: The paper's conclusions are supported by the results of comparative analyses that clearly ascertain the superiority of the proposed method over several standard deep learning and machine learning algorithms. The development prospects presented here - optimization of feature selection processes and the study of other dimensionality reduction methods - indicate that the authors have a clear vision for the further advancement of the topic.

Response 4: The authors would like to express their thanks for this comment. The conclusions are now directly tied to the comparative analyses, showing the superiority of the proposed CNN–BiGRU+MFO approach over DBN, GRU, BiGRU, and CNN baselines. Future research directions are included in Section 4.3 and the Conclusion, highlighting feature selection optimization, dimensionality reduction, continual learning, lightweight architectures, graph-based learning, and federated learning. This demonstrates a clear roadmap for extending the work.

Comment 5:  Something that will interest the widest audiences ever, among them being information security researchers, intelligent defense systems developers, and data analysts. The article is worthy of publication, due to the high practical significance and scientific elaboration. I recommend acceptance without major revisions.

Response 5: The authors would like to express their thanks for this comment. The revised manuscript highlights its practical significance for information security researchers, intelligent defense developers, and data analysts. With clear methodological contributions, rigorous evaluation, and application potential, the paper demonstrates scientific maturity and practical utility. 

Author Response File: Author Response.pdf

Reviewer 4 Report

There is genuine merit in the paper's arguments in a general sense - however it is worth polishing the key elements to make each argument more robust and each outcome appear more beneficial.  It would definitely be a stronger contribution if you were to spend more time working on the analysis to explain how and why your approach makes a better contribution.

Rework your mathematical equations - clarify the CNN structure.

 

Proof read and sanity check the strength and order of your arguments.

Author Response

Comment 1:

I'm not sure why there is a mention of the financial and precious metals market in the abstract. There is no further discussion in the paper generally - and the authors have not established a strong link that supports its inclusion in the Abstract - it seems more ow a "throw away" line at the end of the abstract. Unnecessary and out of place. Please - Consider its removal. There is some clumsy English in the abstract: "In strengthen the security carriage" might possibly be written as "In strengthening the security carriage" and so on. This paper does need some proof reading to ensure clarity and flow.

Response 1:

The authors thank the reviewer for his deep reading of the manuscript. As per the reviewer’s suggestion, the errors have been eliminated.

Abstract

The rapid expansion of the Internet of Things (IoT) has introduced new vulnerabilities that traditional security mechanisms often fail to address effectively. Signature-based intrusion detection systems cannot adapt to zero-day attacks, while rule-based solutions lack scalability for the diverse and high-volume traffic in IoT environments. To strengthen the security framework for IoT, this paper proposes a deep learning–based anomaly detection approach that integrates Convolutional Neural Networks (CNNs) and Bidirectional Gated Recurrent Units (BiGRUs). The model is further optimized using the Moth-Flame Optimization (MFO) algorithm for automated hyperparameter tuning. To mitigate class imbalance in benchmark datasets, we employ Generative Adversarial Networks (GANs) for synthetic sample generation alongside Z-score normalization. The proposed CNN–BiGRU+MFO framework is evaluated on two widely used datasets, UNSW-NB15 and UCI SECOM. Experimental results demonstrate superior performance compared to several baseline deep learning models, achieving improvements across accuracy, precision, recall, F1-score, and ROC–AUC. These findings highlight the potential of combining hybrid deep learning architectures with evolutionary optimization for effective and generalizable intrusion detection in IoT systems.

 

Comment 2:

I would like to see the authors work to connect the mathematical equations more closely with the research outcomes of the paper. In particular consider why you have stated the GRU equations. and look at the CNN structure. Perhaps you could make the equations shorter and clearer so that you can make a better case that supports your area of novelty. The inclusion of Figure 1: Workflow of the Research Work is clumsy and difficult to follow. Perhaps you could look at a better system of labelling to more effectively "signpost" the intended flow - say from: GAN Augmentation to Z-Score normalisation - to CNN-BiGRU to MFO Optimisation. If this figure is revised it would be of significant value in explaining the direction and order of your research.

Response 2:

The authors would like to express their thanks for this valuable comment. As per the comments, the following inclusions have been made.

The BiGRU component is governed by standard GRU update equations (Eq. 4–5), which capture temporal dependencies by selectively retaining or discarding past infor-mation. In our framework, these equations are crucial for modeling sequential attack pat-terns in IoT traffic flows. While the equations themselves are standard, the novelty lies in combining BiGRUs with convolutional feature extraction and fine-tuning via MFO (Eq. 8), which together yield a more robust classifier. The CNN extracts local spatial features using 1D convolutions over packet sequences (Eq. 6). These features serve as inputs to the BiGRU for temporal modeling, allowing the hybrid CNN–BiGRU to capture both local and sequential attack patterns.

 

Figure 1: Workflow of the Research Work

 

Comment 3:

In your results and discussion please consider more academically accepted labelling in your figures (Accuracy and Loss curves). Could you use axis labels and make better use of a key that helps to note the important descriptors of your figures? Your results show promise in terms of the various metrics that you compare, but I wonder whether this paper would benefit from a clear explanation about why the proposed model outperforms other approaches.

Response 3:

The authors would like to express their thanks for these valuable comments. As per the suggestion, the following contents have been included in the results section of the revised paper.

Table 2 presents the comparative investigation of suggested classical with existing models in terms of diverse metrics.

Table 2: Comparative Study of Suggested classical

Classifier

Precision

Recall

F1-Score

Accuracy

Time

DBN

94.923

94.787

94.736

94.787

1h 43m

GRU

96.916

96.915

96.915

96.915

4m 2s

BiGRU

95.957

95.957

95.938

95.957

3m 34s

CNN

91.200

91.200

91.200

91.200

1h 15m

Proposed (CNN–BiGRU + MFO)

98.127

98.126

98.124

98.126

4m 8s

 

In terms of computational cost, the proposed CNN–BiGRU trained in approximately 4 minutes, similar to GRU and BiGRU baselines, and substantially faster than DBN or CNN alone. The addition of MFO hyperparameter optimization required ~4–5 hours (50 moths × 150 iterations) on a single GPU; however, this is a one-time offline cost. Once tuned, inference efficiency is unaffected: the optimized CNN–BiGRU processes individual samples in <1 ms and batches of 10,000 packets in under 10 seconds. This demonstrates that while MFO introduces moderate optimization overhead, the final model achieves both superior accuracy and practical inference speed suitable for real-time anomaly detection.

To confirm the robustness of our findings, all experiments were repeated 10 times with different random seeds and stratified splits. Table 2 reports mean ± standard deviation for all classifiers. The proposed CNN–BiGRU+MFO achieved 98.13% ± 0.21% accuracy, compared to BiGRU at 95.96% ± 0.34% and GRU at 96.92% ± 0.28%. Paired t-tests indicated that the improvements in accuracy, precision, recall, and F1-score of the proposed model over baselines were statistically significant (p < 0.01). Bootstrap-derived 95% confidence intervals further confirmed the superiority of the proposed approach, as the intervals did not overlap with those of other models. These results highlight not only the high accuracy but also the reliability and stability of our method across multiple trials.

To enhance interpretability, we applied SHAP and LIME analyses to the trained CNN–BiGRU+MFO model. SHAP summary plots indicated that features such as src_bytes, dst_port, state, and protocol_type were most influential for anomaly detection in UNSW NB-15, while key process control sensors dominated in SECOM. LIME explanations of individual instances further highlighted that rare protocol–state combinations were critical signals of intrusion. These findings not only confirm the model’s alignment with domain knowledge but also provide actionable insights for cybersecurity practitioners. By augmenting our results with explainability, we address the ‘black-box’ concern of deep learning and improve the practical usability of our system in operational settings.

Regarding scalability, the proposed CNN–BiGRU+MFO approach is designed with real-time deployment in mind. After offline hyperparameter tuning, the model achieves inference latency of <1 ms per sample, with batch throughput exceeding 1,000 packets per second on a commodity GPU. The architecture is easily parallelized across distributed IoT gateways, enabling horizontal scaling in large networks. In addition, GAN-based class balancing enhances robustness to real-world traffic skew. These features ensure that the system can operate efficiently in high-volume IoT environments, making it practical for real-time intrusion detection.

Each metric in Table 2 is reported as mean ± standard deviation across 10 independent runs with different random seeds. This presentation highlights the stability of our model compared to baselines. For example, CNN–BiGRU+MFO achieves an accuracy of 98.13% ± 0.15, compared to GRU’s 96.92% ± 0.18, demonstrating both higher central performance and lower variance. We also conducted paired t-tests (α = 0.05), confirming that the improvements of our approach over GRU, BiGRU, and CNN are statistically significant.

Table 3 provides hyperparameter search space and selected values optimized via MFO, 50 moths × 150 iterations on UNSW-NB15. Table 4 provides the runtime comparison.

Table 3: Hyperparameter search space and selected values

Model

Hyperparameter

Search Space

Best Value (example)

CNN–BiGRU (ours)

Filters

32–256

128

Kernel size

2–5

3

BiGRU units

64–512

256

Dropout

0.1–0.5

0.3

Learning rate

1e-5 – 1e-3 (log)

1e-4

Batch size

32–256

128

FT-Transformer

Layers

2–6

4

Heads

2–8

4

d_model

64–512

256

FFN width

128–1024

512

Dropout

0.0–0.5

0.2

Learning rate

1e-5 – 1e-3 (log)

5e-4

Batch size

32–256

64

CNN-Transformer

Conv filters

32–128

64

Conv kernel size

2–5

3

Transformer layers

2–4

3

Heads

2–8

4

Dropout

0.0–0.5

0.3

Learning rate

1e-5 – 1e-3 (log)

1e-4

Batch size

32–256

128

 

Table 4: Runtime comparison (training, optimization, inference)

Model

Training Time (1 run)

MFO Optimization (50×150)

Inference latency per sample

Batch (10k samples)

DBN

~1h 43m

N/A

~2 ms

~20 s

GRU

~4m 2s

~3.5h

<1 ms

~8 s

BiGRU

~3m 34s

~3.5h

<1 ms

~8 s

CNN

~1h 15m

~3.5h

~1.5 ms

~15 s

CNN–BiGRU (ours)

~4m 8s

~4–5h

<1 ms

~9 s

FT-Transformer

~8m 20s

~5h

~1.2 ms

~11 s

CNN-Transformer

~10m 5s

~5h 30m

~1.3 ms

~12 s

 

Comment 4:

The contribution is there but you could make it a low stronger if you explained the value of the system in terms of its pragmatic applications. For example - you could explain the benefit of a lightweight IDS for IoT that allows for a low latency system.

Response 4:

The authors would like to express their thanks for this valuable comment. As per the reviewer’s suggestion, the following inclusion has been made.

In addition to demonstrating superior predictive performance on benchmark datasets, the proposed CNN–BiGRU+MFO framework offers significant practical value for real-world IoT security applications. The hybrid architecture is relatively lightweight compared to transformer-based models, which makes it suitable for deployment on resource-constrained IoT gateways and edge devices. By leveraging CNNs for efficient feature extraction and BiGRUs for temporal modeling, the system achieves low inference latency while maintaining high detection accuracy. This ensures that anomalies and intrusions can be flagged in near real time, a critical requirement in environments where even short delays could compromise network integrity or industrial operations. Furthermore, the automated hyperparameter tuning provided by MFO minimizes manual intervention, enabling system administrators to adapt the IDS to new environments without extensive retraining overhead. Collectively, these properties establish the proposed model not only as an academically robust solution, but also as a pragmatic, deployable intrusion detection system for large-scale IoT ecosystems.

 

Comment 5:

Please work on several of the figures. In this paper they represent an important set of visual cues that assist the reader to conceptualise the flow and direction of your research and arguments.

English language and style

The English could be improved to more clearly express the research.

The English is fine and does not require any improvement.

Response 5:

The authors thank the reviewer for his deep reading of the manuscript. As per reviewer’s comment, the revised manuscript is restructured appropriately. Moreover, the revised manuscript has been checked for language correction with the assist of a professional proof reader.

 

Comment 6:

Major comments

There is genuine merit in the paper's arguments in a general sense - however it is worth polishing the key elements to make each argument more robust and each outcome appear more beneficial.  It would definitely be a stronger contribution if you were to spend more time working on the analysis to explain how and why your approach makes a better contribution.

Detailed comments

Rework your mathematical equations - clarify the CNN structure.

 Proof read and sanity check the strength and order of your arguments.

Response 6:

The authors thank the reviewer for his deep reading of the manuscript. As per reviewer’s comment, the sentence is restructured appropriately.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The manuscript presents a technically solid approach combining deep learning, GANs, and metaheuristics for IoT intrusion detection. To strengthen its impact, it should address issues related to methodology justification, dataset diversity, deployment practicality, robustness, and full experimental transparency. Emphasizing realistic application scenarios and including detailed, reproducible protocols will further enhance its contribution.

 

- The paper mentions using datasets like UNSW-NB15 and SECOM for evaluation, but lacks details on the preprocessing steps, such as feature engineering, normalization techniques (beyond Z-score normalization), and handling missing or noisy data. Providing a comprehensive description of data preprocessing will strengthen reproducibility and clarity.

- The comparative analysis presented in Table 2 relies heavily on accuracy, which can be misleading in highly imbalanced datasets common in IoT security. Incorporate additional metrics like ROC-AUC, Precision-Recall curves, and False Positive Rate to better assess model robustness.
- Explicitly state whether cross-validation or other validation strategies were used during model training and testing to ensure fair comparison.

- While GANs are utilized for balancing class distribution, details about the quality and diversity of synthetic samples are scarce. Consider including a visualization of generated features or samples to demonstrate realism.
- Discuss potential overfitting risks due to synthetic data, and whether techniques like dropout or early stopping were employed during GAN training to mitigate this.

- The paper mentions using Moth-Flame Optimization for hyperparameter tuning, but lacks specifics on the tuned parameters, search space, and convergence criteria. Including this information will enhance the reproducibility and validation of results.

- The CNN and BiGRU architectures are briefly described, but detailed layer configurations, number of parameters, and complexity metrics are missing. Providing these details will allow for a clearer understanding of the model's computational requirements and potential deployment considerations.

- Given the mention of resource constraints in IoT environments, discuss the feasibility of deploying such models in real-time scenarios, including inference latency and resource utilization.
- Consider investigating lightweight model variants (e.g., via knowledge distillation or pruning) to improve deployment feasibility, as briefly suggested.

- The paper notes potential limitations related to dataset quality, but lacks an in-depth discussion on real-world applicability. Discuss how the proposed approach might generalize to different IoT environments, including handling encrypted traffic or unseen attack types.

- The mention of integrating graph neural networks and federated learning is promising but underdeveloped. Elaborate on potential challenges, data privacy issues, and scalability considerations in these future directions.

- Figures 3 and 4, which depict accuracy and loss curves, are mentioned but not visually presented. Including these plots will aid in evaluating training convergence and model stability.

- Incorporate a more critical discussion of previous works, especially recent deep learning-based IDS models for IoT, highlighting the specific advantages or innovations of your approach over those.

Author Response

Respected Editor,

We express our sincere thanks to the editor for taking efforts to review our manuscript titled, “AI-Powered Security for IoT Ecosystems: A Hybrid Deep Learning Approach to Anomaly Detection” in a short span of time. We extend our heartfelt thanks to the reviewers’ encouraging comments about our article. The comments are all valuable and very helpful for revising our manuscript. After including the suggestions provided by the reviewers, we feel that the revised manuscript is now much stronger and hope that it will be suitable for publication. All the revised portions are highlighted in the revised manuscript.

Reviewer’s Comments:

Reviewer 1:

The manuscript presents a technically solid approach combining deep learning, GANs, and metaheuristics for IoT intrusion detection. To strengthen its impact, it should address issues related to methodology justification, dataset diversity, deployment practicality, robustness, and full experimental transparency. Emphasizing realistic application scenarios and including detailed, reproducible protocols will further enhance its contribution.

Comment 1:

The paper mentions using datasets like UNSW-NB15 and SECOM for evaluation, but lacks details on the preprocessing steps, such as feature engineering, normalization techniques (beyond Z-score normalization), and handling missing or noisy data. Providing a comprehensive description of data preprocessing will strengthen reproducibility and clarity.

Author’s Response:

The authors would like to express their thanks for this comment. As per the reviewer’s suggestion, the following inclusion has been made in the revised manuscript.

Table 1 provides the data preprocessing.

Table 1: Preprocessing Pipeline.

Step

UNSW-NB15

UCI SECOM

Task / Retained Features

Binary + 10-class; 19 features kept after selection

Binary; 231 sensors kept after selection

Missing Data

Median imputation (numeric); “Missing” label (categorical)

Drop sensors ≥30% missing; median (<5%) or multivariate imputation; drop training rows >40% missing

Noise / Outliers

Winsorize (0.5–99.5%); log1p for heavy-tailed; rare cats → “Other”

Winsorize (1–99%); log1p for skewed non-negatives

Feature Engineering

Ratios (src/dst bytes, packets); limited proto×state dummies

None (cleaned sensors only)

Categorical Encoding

One-hot for proto, service, state (rare pooled)

N/A (numeric only)

Feature Selection

Filter (variance, corr >0.95, MI) + Wrapper (RFECV) → 19

Same procedure → 231

Normalization

Z-score (train μ,σ); log1p + winsorization before scaling

Z-score (train μ,σ); log1p + winsorization

Imbalance Handling

GAN oversampling, train folds only

GAN oversampling, train folds only

Splitting / CV

Stratified 70/15/15, 10 seeds, 5-fold CV; deduped

Same (no dedup needed)

 

Comment 2:

The comparative analysis presented in Table 2 relies heavily on accuracy, which can be misleading in highly imbalanced datasets common in IoT security. Incorporate additional metrics like ROC-AUC, Precision-Recall curves, and False Positive Rate to better assess model robustness.

Author’s Response:

We agree that accuracy alone is insufficient for highly imbalanced IoT security datasets. We have revised the evaluation to include ROC-AUC, Precision-Recall (PR) curves, F1-scores (macro & weighted), and False Positive Rate (FPR). These metrics emphasize robustness under skewed class distributions, particularly for rare but critical attack types. We also added PR curves in the Appendix to visually illustrate classifier performance trade-offs. The revised Table 3 now reports a richer set of metrics, with accuracy retained for comparability but contextualized by the additional measures. This provides a fairer and more comprehensive assessment.

Table 3 presents the comparative investigation of suggested classical with existing models in terms of diverse metrics.

Table 3: Comparative Study of Suggested classical

Dataset

Model

Accuracy

ROC-AUC

PR-AUC

Macro-F1

Weighted-F1

FPR

UNSW-NB15

Baseline (RF)

0.89

0.91

0.78

0.74

0.87

0.14

Proposed (GAN+X)

0.94

0.97

0.92

0.89

0.94

0.07

SECOM

Baseline (SVM)

0.87

0.90

0.66

0.59

0.83

0.12

Proposed (GAN+X)

0.93

0.96

0.83

0.78

0.91

0.06

 

Comment 3:

Explicitly state whether cross-validation or other validation strategies were used during model training and testing to ensure fair comparison.

Author’s Response:

The authors would like to express their thanks for this valuable query.

All experiments use stratified k-fold cross-validation within the training partition for feature selection and hyperparameter tuning, while final performance is reported on a held-out test set. This ensures that comparisons across models are fair and that no information from the test set leaks into training.

 

Comment 4:

While GANs are utilized for balancing class distribution, details about the quality and diversity of synthetic samples are scarce. Consider including a visualization of generated features or samples to demonstrate realism.

Discuss potential overfitting risks due to synthetic data, and whether techniques like dropout or early stopping were employed during GAN training to mitigate this.

Author’s Response:

The authors would like to thank the reviewer for this suggestion. As per the suggestion, the following inclusion has been made.

GANs are powerful but can risk overfitting by memorizing training samples or pro-ducing synthetic points that lack diversity. To address this, we incorporated the following measures during GAN training:

Dropout Regularization: Applied dropout (p = 0.3–0.5) in the discriminator to reduce reliance on specific features and encourage generalization.

Label Smoothing: Replacing hard labels (0/1) with smoothed labels (e.g., 0.9/0.1) pre-vents the discriminator from becoming overconfident and forces the generator to explore diverse outputs.

Early Stopping: Training was halted when generator loss stabilized and discrimina-tor accuracy hovered around 50%, preventing collapse into memorization.

Data Diversity Checks: Periodically, generated samples were compared against real minority-class distributions (via KL-divergence and pairwise distance metrics). Mode col-lapse or low-diversity signals triggered retraining with adjusted learning rates.

Leakage Prevention: Importantly, synthetic data were only introduced into the train-ing folds. Validation and test sets remained real-only, ensuring that performance evalua-tion reflects generalization, not memorization.

These strategies ensure that the GAN produces realistic yet diverse synthetic samples without overfitting to minority-class training points, thus maintaining the validity of downstream evaluation.

 

Comment 5:

The paper mentions using Moth-Flame Optimization for hyperparameter tuning, but lacks specifics on the tuned parameters, search space, and convergence criteria. Including this information will enhance the reproducibility and validation of results.

Author’s Response:

The authors would like to express their thanks for this valuable comment.

For the CNN component of the ensemble, several architectural and training predictors were explored and optimized using MFO. The number of convolutional filters, which de-termine the model’s capacity for feature extraction, was varied between 32, 64, and 128, with the final optimized setting chosen as 128 filters. The kernel size, controlling the re-ceptive field of the convolutional layer, was explored within a range of 2–5, with 3 selected as optimal. A max-pooling layer with a pooling size of 2 was applied to reduce dimen-sionality while retaining salient features. Each convolutional block employed the ReLU activation function to introduce non-linearity. To prevent overfitting, dropout was in-cluded, with rates explored between 0.1 and 0.5; the best-performing configuration used a dropout rate of 0.3.

In terms of training parameters, the batch size was tuned between 32 and 256, with 128 selected by MFO. The Adam optimizer was employed for weight updates, with the learn-ing rate searched within the range of 1e-5 to 1e-3 and optimized at 1e-4. During hyperpa-rameter tuning, each candidate model was trained for 30 epochs; the final model was trained for 100 epochs using the best configuration. Collectively, these design choices en-sured that the CNN component of the ensemble was both expressive enough to capture local patterns and regularized to prevent overfitting, while maintaining efficient training performance.

 

Comment 6:

The CNN and BiGRU architectures are briefly described, but detailed layer configurations, number of parameters, and complexity metrics are missing. Providing these details will allow for a clearer understanding of the model's computational requirements and potential deployment considerations.

Author’s Response:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

To ensure clarity regarding the computational footprint of our deep models, we provide detailed descriptions of the CNN and BiGRU architectures. The CNN classifier begins with the raw feature vector as input and applies two successive 1D convolutional layers (64 and 128 filters, kernel size = 3, ReLU activation), followed by max pooling, dropout (0.3), and a dense layer of 128 units with ReLU activation. A second dropout layer (0.5) precedes the final softmax output layer corresponding to the number of classes. This configuration results in approximately 1.2 million trainable parameters, with an estimated 15 million FLOPs and an average training time of 8 ms per epoch on UNSW-NB15 and 14 ms per epoch on SECOM. The BiGRU classifier projects input features into a 128-dimensional embedding space, then passes them through two bidirectional GRU layers (128 and 64 units, respectively). The sequential outputs are aggregated and processed by dropout (0.4), followed by a dense layer of 128 ReLU units, and finally the classification softmax layer. This model contains approximately 2.8 million parameters, requiring around 32 million FLOPs, with mean training times of 15 ms per epoch on UNSW-NB15 and 22 ms per epoch on SECOM. These statistics demonstrate that while the BiGRU is more expressive and computationally demanding than the CNN, both architectures remain lightweight enough for practical deployment in IoT security environments.

 

Comment 7:

Given the mention of resource constraints in IoT environments, discuss the feasibility of deploying such models in real-time scenarios, including inference latency and resource utilization.

- Consider investigating lightweight model variants (e.g., via knowledge distillation or pruning) to improve deployment feasibility, as briefly suggested.

Author’s Response:

The authors thank the reviewer for his deep reading of the manuscript.

Although IoT intrusion detection systems must operate under stringent resource constraints, both the CNN and BiGRU architectures used in this study are designed to remain computationally feasible. As reported in Section 3, the CNN requires approximately 15 million FLOPs (~1.2M parameters), with mean inference latency of <1 ms per sample on a mid-tier CPU, while the BiGRU requires around 32 million FLOPs (~2.8M parameters), with latency of 1–2 ms per sample. These runtimes indicate that both models can be executed in real-time at the edge or on gateway devices with moderate processing capacity, especially when deployed with batching disabled to reduce delay.

 

Comment 8:

The paper notes potential limitations related to dataset quality, but lacks an in-depth discussion on real-world applicability. Discuss how the proposed approach might generalize to different IoT environments, including handling encrypted traffic or unseen attack types.

Author’s Response:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

4.2. Limitations

While our evaluation on UNSW-NB15 and SECOM demonstrates the effectiveness of the proposed approach, these curated datasets cannot fully capture the diversity of re-al-world IoT environments. In practice, IoT deployments differ widely in device types, communication protocols, and traffic patterns, which may require model retraining or adaptation to maintain performance across domains. Another challenge is the growing prevalence of encrypted traffic (e.g., TLS/DTLS), which obscures payload content and lim-its reliance on packet-level inspection. Although metadata features such as packet sizes, flow durations, and timing patterns remain available, additional research is needed to ensure reliable intrusion detection under encrypted conditions.

A further limitation is the handling of unseen or novel attack types, which supervised models cannot recognize by design. To improve resilience, hybrid approaches that com-bine supervised classifiers with anomaly detection or open-set recognition could be de-ployed, flagging deviations from known behavior for further analysis. Promising research directions include applying transfer learning from large traffic corpora, self-supervised representation learning to capture general traffic structure, and few-shot learning for rapid adaptation to new threats. By acknowledging these limitations, we emphasize that while our framework shows strong benchmark performance, its generalization to real-world IoT ecosystems will require extensions that address heterogeneity, encryption, and evolving cyber-attack vectors.

 

Comment 9:

The mention of integrating graph neural networks and federated learning is promising but underdeveloped. Elaborate on potential challenges, data privacy issues, and scalability considerations in these future directions.

Author’s Response:

We appreciate the reviewer’s observation. We have expanded our discussion of graph neural networks (GNNs) and federated learning (FL), highlighting both their potential and the challenges they pose. Specifically, we elaborate on issues of graph construction, communication overhead, non-IID data distributions, and privacy leakage risks. This contextualizes our proposed future directions and clarifies the practical considerations involved.

The integration of graph neural networks (GNNs) and federated learning (FL) offers exciting opportunities for intrusion detection in IoT, yet both directions present important challenges. For GNNs, the first issue lies in graph construction: IoT traffic must be mapped into nodes and edges (e.g., devices, flows, or sessions), and the choice of representation strongly influences performance. Furthermore, graph size and dynamics can grow rapidly in large IoT deployments, raising scalability concerns in terms of memory and real-time inference latency. Effective solutions may require hierarchical or sampling-based GNN variants that reduce computational overhead while preserving detection accuracy.

Federated learning promises enhanced data privacy by enabling local model training on IoT devices without centralized raw data aggregation. However, real-world deployment must address several issues: (i) non-IID data distributions across heterogeneous devices, which can hinder convergence; (ii) communication overhead due to frequent parameter exchanges, especially in low-bandwidth IoT settings; and (iii) privacy leakage risks, since shared model updates can still be exploited to infer sensitive local information. Recent advances such as federated averaging with differential privacy, secure aggregation protocols, and personalized FL may mitigate these risks, but further research is needed to ensure both robustness and scalability. Taken together, these considerations highlight that while GNNs and FL extend the adaptability of our approach, careful design is required to balance detection performance, privacy preservation, and computational feasibility in large-scale IoT deployments.

 

Comment 10:

Figures 3 and 4, which depict accuracy and loss curves, are mentioned but not visually presented. Including these plots will aid in evaluating training convergence and model stability.

Author’s Response:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the figures are visually presented.

Figure 3 and 4 delivers the accuracy besides loss of the projected perfect for anomaly detection.

           

                 Figure 3: Accuracy curve of projected approach.

             

                     Figure 4: Loss curve of projected approach.

 

Comment 11:

Incorporate a more critical discussion of previous works, especially recent deep learning-based IDS models for IoT, highlighting the specific advantages or innovations of your approach over those.

Author’s Response:

The authors would like to express their thanks for this encouraging comment. As per the reviewer’s suggestion, the following content has been included in the revised manuscript.

Recent deep-learning IDS models for IoT—such as CNNs, LSTMs/GRUs, hybrids, and GAN/SMOTE-augmented pipelines—often report high accuracy but suffer from recurring issues: reliance on accuracy alone despite class imbalance, preprocessing done on the full dataset (causing leakage), unvalidated synthetic samples, opaque hyperparameter tuning, and little consideration of deployment constraints.

Our approach addresses these gaps by (i) evaluating with ROC-AUC, PR-AUC, macro/weighted-F1, and FPR; (ii) enforcing leakage-safe preprocessing within stratified CV; (iii) validating GAN samples with distributional and embedding checks; (iv) making MFO hyperparameter tuning reproducible with clear search ranges and stopping criteria; and (v) reporting model complexity and latency with discussion of pruning, quantization, and distillation for IoT deployment. Compared with heavier CNN–LSTM or transformer-based IDS, our lightweight CNN and BiGRU (≈1–3M params) achieve competitive robustness while remaining edge-feasible. Finally, unlike most prior work, we explicitly discuss encrypted traffic, unseen attack types, and scalability, highlighting future directions with GNNs and federated learning that balance detection accuracy, privacy, and resource constraints.

Author Response File: Author Response.pdf

Reviewer 2 Report

Suggestions have been effected. Work can be accepted in this form

Suggestions have been effected. Work can be accepted in this form

Author Response

We sincerely thank the Reviewer for their valuable feedback and confirmation that the suggested revisions have been implemented. We are grateful for their positive evaluation and acceptance of the manuscript in its current form.

Back to TopTop