Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Research on Intrusion Detection Method Based on Transformer and CNN-BiLSTM in Internet of Things

Sensors 2025, 25(9), 2725; https://doi.org/10.3390/s25092725

by Chunhui Zhang, Jian Li^*, Naile Wang and Dejun Zhang

Reviewer 1: Anonymous

Reviewer 2:

Youngchul Bae

Reviewer 3:

Alejandra Guadalupe Silva Trujillo

Reviewer 4:

Łukasz Więcław

Sensors 2025, 25(9), 2725; https://doi.org/10.3390/s25092725

Submission received: 24 March 2025 / Revised: 23 April 2025 / Accepted: 23 April 2025 / Published: 25 April 2025

(This article belongs to the Section Internet of Things)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a hybrid model based on CNN-BiLSTM-Transformer, which better handles complex features and long-sequence dependencies in intrusion de-tection. For the issue of data class imbalance, the Borderline-SMOTE method is introduced to enhance the model’s ability to recognize minority-class attack samples. To tackle the problem of redundant features in the original dataset, a comprehensive feature selection strategy combining XGBoost, Chi-square (Chi2), and Mutual Information is adopted to ensure the model focuses on the most discriminative features.

1. The manuscript's motivations should be further highlighted in the manuscript, e.g., what problems did the previous works exist?
2. In the introduction, the authors should clearly indicate the contributions and innovations of this paper. It is necessary to add a research background introduction and a detailed explanation of the research motivation, so as to attract more potential audiences.
3. Literature survey is insufficient. You must review all significant similar works that have been done. Also, review some of the good recent works that have been done in this area and are more similar to your paper. For each work first, explain the problem that has been addressed in that work. Then explain the aided to deal with that problem. After that, compare that work with your work and conclude the difference and the benefit of your work with that. The article can be further enhanced by connecting and comparing the undergoing work with some existing literatures. You must add and review all significant similar works that have been done. For example, https://doi.org/10.1109/JIOT.2025.3545741; https://doi.org/10.1109/TAFFC.2025.3547753 and so on.
4. More statistical methods are recommended to analyze the experimental results,such as precision, recall, and F1-score to provide a comprehensive performance evaluation of the proposed method.
5. In Table 2. Configuration of experimental environment and model parameters, how to determine the values of parameters in the used methods.
6. The conclusion and motivation of the work should be added in a clearer way. Could you tell me the limitations of the proposed method? How will you solve them? Please add this part to the manuscript

Comments on the Quality of English Language

There are some grammatical errors seen in the paper. Check carefully for a few clerical errors and formatting issues.

Author Response

Comments 1:

The manuscript's motivations should be further highlighted in the manuscript, e.g., what problems did the previous works exist?

Response 1:

Thank you very much for your insightful comment. We fully agree with your suggestion that the motivations of the manuscript should be more clearly emphasized. In the revised version, we have enhanced the introduction by explicitly summarizing the main challenges faced by existing intrusion detection methods and pointing out the specific problems in previous works. Specifically, we analyzed and summarized the key limitations of existing works, including: (1) the difficulty for single-model architectures to capture diverse features of IoT network traffic. (2) inadequate handling of data imbalance issues, which causes models to bias toward majority classes, and (3) the lack of effective feature selection strategies, resulting in redundancy in input features. These points are now clearly presented to strengthen the motivation of our research. These revisions can be found on page 3, lines 112 to 118.

Comments 2:

In the introduction, the authors should clearly indicate the contributions and innovations of this paper. It is necessary to add a research background introduction and a detailed explanation of the research motivation, so as to attract more potential audiences.

Response 2:

Thank you for your valuable comment. We fully agree with your suggestion. Therefore, we have revised the Introduction section to provide a clearer explanation of the research motivation and explicitly state the contributions and innovations of our work. Specifically, we added the sentence “Consequently, the development of an intrusion detection technology that can maintain a high detection rate while significantly reducing the false alarm rate has emerged as the core challenge in current research on ensuring the security of the Internet of Things.” at the end of Page 2, Paragraph 1, to highlight the core research challenge addressed by our study.

In addition, we added a new paragraph beginning with “To deal with these challenges, the main contributions and innovations of this study are as follows:” on Page 3, Paragraph 3, which systematically outlines the major contributions of our proposed approach. These include the design of a hybrid model combining CNN, BiLSTM, and Transformer modules, the use of Borderline-SMOTE to address data imbalance, and an effective feature selection strategy that improves detection performance. These revisions aim to emphasize the significance, novelty, and practical value of our research.

Comments 3:

Literature survey is insufficient. You must review all significant similar works that have been done. Also, review some of the good recent works that have been done in this area and are more similar to your paper. For each work first, explain the problem that has been addressed in that work. Then explain the aided to deal with that problem. After that, compare that work with your work and conclude the difference and the benefit of your work with that. The article can be further enhanced by connecting and comparing the undergoing work with some existing literatures. You must add and review all significant similar works that have been done. For example, https://doi.org/10.1109/JIOT.2025.3545741; https://doi.org/10.1109/TAFFC.2025.3547753 and so on.

Response 3:

Thank you for your detailed and constructive feedback. We fully agree with your comment regarding the insufficiency of the original literature review. Therefore, we have significantly expanded the Related Work section in the revised manuscript. Specifically, we have added a comprehensive review of more than ten recent and relevant studies.

For each cited work, we have followed your recommended structure: (1) identified the core problem addressed in the study, (2) described the proposed solution, (3) outlined its limitations, and (4) provided a clear comparison with our proposed approach.

These revisions help to better contextualize our research, highlight its novelty, and demonstrate the advantages of our proposed model in areas such as feature extraction, handling of data imbalance, and global dependency modeling. The detailed additions and comparisons can be found on Pages 2–3, lines 53 to 111 of the revised manuscript.

Comments 4:

More statistical methods are recommended to analyze the experimental results, such as precision, recall, and F1-score to provide a comprehensive performance evaluation of the proposed method.

Response 4:

Thank you very much for your valuable comment. We fully agree with this comment. Therefore, to provide a more comprehensive and rigorous evaluation of the proposed method, we have added multiple statistical metrics—accuracy, precision, recall, and F1-score—to all relevant parts of the experimental analysis.

Specifically, these metrics have been included in the performance evaluation of both the CIC-IDS2017 and BoT-IoT datasets. In addition, we have consistently incorporated these metrics into the comparative experiments with baseline models as well as the ablation studies with partial variants of our proposed model.

Comments 5:

In Table 2. Configuration of experimental environment and model parameters, how to determine the values of parameters in the used methods.

Response 5:

Thank you for your valuable comment. We fully agree with this point. Therefore, we have revised Table 2 by adding a new column titled “Selection Justification”, which provides clear explanations for how each parameter was determined.

Specifically, we elaborated on the selection process of each setting, such as: The learning rate was chosen through grid search over a range of values (0.1, 0.01, 0.001, 0.0001); The batch size was determined by testing multiple values (32, 64, 128) to balance memory efficiency and training stability; The number of epochs was controlled using early stopping with a patience value of 5; The optimizer and loss function follow common practice in binary classification using deep learning.

This revised version of Table 2 can be found on Page 10, in the section titled “Table 2. Configuration of experimental environment and model parameters”.

Comments 6:

The conclusion and motivation of the work should be added in a clearer way. Could you tell me the limitations of the proposed method? How will you solve them? Please add this part to the manuscript.

Response 6:

Thank you for your valuable comment. We agree that a clear articulation of the motivation, limitations, and future directions significantly enhances the completeness of the manuscript.

We have, accordingly, revised the Conclusion section by adding a new paragraph that explicitly outlines the limitations of the proposed method and suggests directions for future research. Specifically, we addressed:

(1) The computational complexity of the deep neural network architecture, which may hinder deployment on resource-constrained IoT edge devices;

(2) The reliance on offline datasets, which limits evaluation of the model’s real-time detection performance.

To address these issues, we proposed the following future work:

(1) Exploring model compression, parameter sharing, and knowledge distillation to reduce computational overhead;

(2) Developing an end-to-end real-time intrusion detection system for evaluation in practical IoT environments;

(3) Extending the framework to domain-specific applications such as Industrial IoT (IIoT) and Vehicular Ad-hoc Networks (V2X).

These revisions can be found in Section 4 (Conclusion), Page 18-19, lines 556 to 583.

Comments on the Quality of English Language：

There are some grammatical errors seen in the paper. Check carefully for a few clerical errors and formatting issues.

Response：

Thank you for your valuable comment. We agree with this suggestion. Therefore, we have thoroughly revised the manuscript to correct grammatical errors, clerical mistakes, and formatting issues to improve the overall quality of English throughout the paper.

Reviewer 2 Report

Comments and Suggestions for Authors

Lack of Logical Explanation for Performance Gap
The proposed CNN-BiLSTM-Transformer model significantly outperforms all baseline models, achieving 99.66% accuracy, whereas Decision Tree, Random Forest, CNN, and LSTM models all show results below 95%. However, the manuscript does not provide a logical explanation for this substantial ~5% performance gap. It is important to explain why the proposed model is able to achieve such high performance, especially in contrast to well-established models like Random Forest and CNN, which typically perform well in classification tasks. A detailed analysis highlighting the strengths of the proposed model and its effectiveness in addressing the challenges (e.g., temporal dependency, feature extraction) is required.
Early High Accuracy Suggests Easy-to-Learn Data
According to Figure 5, the model reaches approximately 96.5% accuracy at the very beginning of training. This strongly suggests that the dataset is relatively easy for the model to learn, raising concerns about overfitting or data leakage. If only the proposed model shows this behavior, the same training curve should be provided for the baseline models to objectively demonstrate the superiority of the proposed approach. Additionally, the initial values of the trainable parameters in the deep learning models should be clearly stated, as they can significantly influence training behavior.
Insufficient Analysis of Classification Errors
While a confusion matrix is provided to show classification results, the paper does not offer any analysis of the errors. It is critical to understand why these errors occurred—for example, whether certain classes are systematically misclassified due to class imbalance or feature similarity. A logical interpretation of the confusion matrix would strengthen the model evaluation and provide useful insights for future improvements.
Overly General Future Work
The "Future Work" section in the conclusion is too vague and lacks specificity. Phrases like "optimize the architecture" or "explore other cybersecurity domains" are overly broad. It would be more valuable to outline concrete directions, such as deploying the model on real-time IoT systems, exploring lightweight architectures for edge computing, or applying the approach to specific threat categories like DDoS or malware detection.

Comments on the Quality of English Language

While the technical content of the manuscript is meaningful, the overall quality of English writing needs improvement for clarity and readability.

Author Response

Comments 1:

Lack of Logical Explanation for Performance Gap

The proposed CNN-BiLSTM-Transformer model significantly outperforms all baseline models, achieving 99.66% accuracy, whereas Decision Tree, Random Forest, CNN, and LSTM models all show results below 95%. However, the manuscript does not provide a logical explanation for this substantial ~5% performance gap. It is important to explain why the proposed model is able to achieve such high performance, especially in contrast to well-established models like Random Forest and CNN, which typically perform well in classification tasks. A detailed analysis highlighting the strengths of the proposed model and its effectiveness in addressing the challenges (e.g., temporal dependency, feature extraction) is required.

Response 1:

Thank you very much for your insightful comment. We fully agree with your suggestion. Accordingly, we have added a detailed analysis to explain the performance gap between our proposed CNN-BiLSTM-Transformer model and the baseline models.

Specifically, we clarified how the integration of spatial feature extraction (via CNN), temporal dependency modeling (via BiLSTM), and global context capturing (via Transformer) significantly enhances the model’s capability to detect complex and diverse attack patterns.

In contrast, traditional machine learning models such as Decision Tree and Random Forest lack the ability to learn temporal and hierarchical relationships from network traffic data. CNN-only models focus on local spatial features but neglect sequential information, while LSTM-only models are limited in capturing long-range dependencies and global interactions.

Furthermore, we emphasized the contributions of our data preprocessing techniques, including feature selection, oversampling, and noise reduction, which collectively improve data quality and model robustness. This explanation has been added to the revised manuscript in Page 14, lines 438 to 453.

Comments 2:

Early High Accuracy Suggests Easy-to-Learn Data

According to Figure 5, the model reaches approximately 96.5% accuracy at the very beginning of training. This strongly suggests that the dataset is relatively easy for the model to learn, raising concerns about overfitting or data leakage. If only the proposed model shows this behavior, the same training curve should be provided for the baseline models to objectively demonstrate the superiority of the proposed approach. Additionally, the initial values of the trainable parameters in the deep learning models should be clearly stated, as they can significantly influence training behavior.

Response 2:

Thank you very much for your valuable comment. We fully agree with your concern and have made several revisions to address this issue.

To investigate the rapid performance improvement of the proposed model and eliminate concerns about potential overfitting or data leakage, we have added a new figure (Figure 6) showing the training and testing accuracy curves of two baseline models (CNN and LSTM). The results demonstrate that these baseline models experience a noticeably slower increase in accuracy during the initial training phase and ultimately converge to lower accuracies (92.5% for CNN and 91.9% for LSTM) compared to our proposed model. This confirms that the early high accuracy observed is not due to data issues but rather the model's superior feature extraction and learning capabilities.

Furthermore, we have clarified the initialization strategies used for the trainable parameters in all deep learning models. Specifically, we applied the Kaiming Normal initialization for the convolutional and fully connected layers, and the default Xavier initialization provided by PyTorch for the LSTM and Transformer modules. These initializations contributed to faster convergence and greater training stability across all models.These revisions have been incorporated into the revised manuscript on Page 12, lines 379 to 402.

Comments 3:

Insufficient Analysis of Classification Errors

While a confusion matrix is provided to show classification results, the paper does not offer any analysis of the errors. It is critical to understand why these errors occurred—for example, whether certain classes are systematically misclassified due to class imbalance or feature similarity. A logical interpretation of the confusion matrix would strengthen the model evaluation and provide useful insights for future improvements.

Response 3:

Thank you for your constructive feedback. We fully agree with your comment. Therefore, we have added a detailed analysis of the confusion matrix to identify potential causes of false positives and false negatives.

Specifically, we explain that the relatively higher number of false positives may result from certain normal traffic patterns exhibiting strong feature similarity with attack traffic, particularly near class boundaries. In contrast, false negatives may stem from the diversity and stealthy nature of some attack behaviors, which can cause them to resemble benign traffic. Although the Borderline-SMOTE technique was employed to mitigate class imbalance, real-world traffic variability still poses challenges in clearly distinguishing some samples.

This additional analysis strengthens the evaluation of our model and provides practical insight for future improvements. The revisions have been incorporated into the revised manuscript on Page 10-11, lines 347 to 365.

Comments 4:

Overly General Future Work

The "Future Work" section in the conclusion is too vague and lacks specificity. Phrases like "optimize the architecture" or "explore other cybersecurity domains" are overly broad. It would be more valuable to outline concrete directions, such as deploying the model on real-time IoT systems, exploring lightweight architectures for edge computing, or applying the approach to specific threat categories like DDoS or malware detection.

Response 4:

Thank you for your helpful suggestion. We agree that the “Future Work” section should be more specific and actionable. Accordingly, we have revised this section to outline three concrete directions for future research:

(1) Investigating lightweight strategies such as model compression, parameter sharing, and knowledge distillation to maintain high detection accuracy while enhancing adaptability and resource efficiency for deployment on edge devices.

(2) Developing an end-to-end real-time intrusion detection system prototype to evaluate the model's performance in actual IoT environments and achieve sub-millisecond response times for attack detection.

(3) Extending this methodology to specialized domains like the Industrial Internet of Things (IIoT) and Vehicular Ad-hoc Networks (V2X), tailoring it to meet the unique security requirements of these fields. These revisions can be found in Section 4 (Conclusion), Page 18-19, lines 556 to 583.

Comments on the Quality of English Language：

While the technical content of the manuscript is meaningful, the overall quality of English writing needs improvement for clarity and readability.

Response：

Reviewer 3 Report

Comments and Suggestions for Authors

This study introduces a novel hybrid architecture leveraging the strengths of Convolutional Neural Networks (CNNs), Bidirectional Long Short-Term Memory networks (BiLSTMs), and Transformer

The proposal presents an interesting approach; however, there are several inconsistencies that should be addressed to enhance the clarity and quality of the manuscript.

In the Introduction, the repeated use of the term 'Reference' when citing previous works becomes overly redundant. The authors are encouraged to rephrase these instances to improve the narrative flow.

Additionally, the main concept underpinning the study—BiLSTM—is not introduced until page 7. It would be advisable to define and contextualize this concept earlier in the manuscript.

In lines 364 and 365, the use of the term 'Literature' is ambiguous. The authors should verify whether these are actual references, and revise the wording accordingly to avoid confusion.

Finally, the Conclusions section could be strengthened by including more specific insights into potential directions for future work, particularly those that highlight the contributions of the current study.

Comments on the Quality of English Language

Several issues affect its clarity and overall presentation. Overall, the manuscript would benefit from a thorough language review to improve clarity and fluency

Author Response

Comments 1:

Response 1:

Thank you very much for your valuable suggestion. We fully agree with this comment. To improve the narrative fluency and avoid the repetitive use of the term “Reference,” we have revised the relevant citations in the Introduction. Specifically, instead of starting sentences with “Reference [X],” we now incorporate the citation numbers directly into the narrative. For example, we use expressions such as “Reference [6] tackled…”, “[7] developed…”, “[8] proposed…” and “[9] introduced…” to present the cited work in a more natural and readable manner. These changes enhance the coherence and flow of the related work discussion. The revised text can be found on Pages 2–3, lines 53 to 111.

Comments 2:

Additionally, the main concept underpinning the study—BiLSTM—is not introduced until page 7. It would be advisable to define and contextualize this concept earlier in the manuscript.

Response 2:

Thank you for pointing this out. We fully agree with this comment. Since BiLSTM is a core component of our proposed model, we have revised the manuscript to introduce and explain this concept earlier to enhance the reader’s understanding of the model’s methodological foundation. Specifically, we have added a detailed explanation of the BiLSTM structure and its advantages at the beginning of Section 2.1 (Overall Framework of the Intrusion Detection Model), on page 3-4, lines 137 to 162. In the revision, we describe how BiLSTM captures bidirectional temporal dependencies, why it is suitable for time-sequential data such as network traffic, and how it improves the detection of complex attacks compared to traditional unidirectional LSTM models. We also explain its role within our hybrid CNN-BiLSTM-Transformer architecture.

Comments 3:

In lines 364 and 365, the use of the term 'Literature' is ambiguous. The authors should verify whether these are actual references, and revise the wording accordingly to avoid confusion.

Response 3:

Thank you for your valuable comment. We agree with your observation that the use of the term "Literature" in this context may lead to ambiguity regarding whether the mentioned models are actual referenced studies. To address this, we have revised the wording to directly cite the specific references (e.g., [6], [7], [9], etc.) and describe the corresponding models more clearly. This change avoids potential confusion and clarifies that these are indeed previously published works used for performance comparison. The revised content can be found on page 15-16, lines 462–484 of the updated manuscript.

Comments 4:

Response 4:

Thank you very much for your thoughtful suggestion. We agree with your comment that the Conclusions section should provide clearer and more specific insights into potential directions for future work, while also highlighting the key contributions of the current study. Accordingly, we have revised the Conclusions section to summarize our contributions more clearly and propose detailed future research directions based on the current limitations and practical deployment considerations. These directions include the exploration of lightweight model optimization techniques (e.g., model compression and knowledge distillation), real-time IDS system implementation and testing in real-world IoT scenarios, and the adaptation of our proposed method to domain-specific applications such as Industrial IoT (IIoT) and Vehicle-to-Everything (V2X) environments. The updated content can be found in the Conclusions section (Page 18-19, lines 556 to 583). We hope that these enhancements better communicate the practical value of our work and the future potential of our approach.

Comments on the Quality of English Language：

Several issues affect its clarity and overall presentation. Overall, the manuscript would benefit from a thorough language review to improve clarity and fluency.

Response：

Reviewer 4 Report

Comments and Suggestions for Authors

Let me start with the most critical aspect of the work, namely its title.

While the paper claims to present an IoT-specific intrusion detection solution, a closer examination reveals that this assertion may be overstated. Although widely used, the CIC-IDS2017 dataset was not originally tailored exclusively for IoT networks. Relying on this dataset alone does not definitively demonstrate IoT specificity, as the attack patterns and data characteristics could equally apply to traditional network settings.

To firmly establish IoT specificity, the model should ideally be tested on a dataset that is explicitly collected from IoT environments like CICIoT2023 or BoT-IoT.

The work itself is of a very good level, but in my opinion it should not include the phrase “in Internet of Things” in the title.

The study’s primary objective is to develop a hybrid deep learning model (combining CNN, BiLSTM, and Transformer) augmented with data balancing and feature selection techniques to accurately detect abnormal IoT traffic. This research question is clearly motivated by the increasing severity of IoT cyberattacks and the limitations of existing methods in capturing the diverse characteristics of IoT traffic. However, the research data used does not clearly indicate that the problem has been properly addressed. This would require the use of other, additional data.

However, it should be noted that this study adds a comprehensive, state-of-the-art solution to the existing set of IDS studies.

Also, the authors highlight known issues – heterogenous IoT traffic, evolving attack vectors, and data imbalance – which make conventional IDS techniques less effective. Addressing these issues is timely and important, as evidenced by recent research efforts in IoT intrusion detection systems (IDS) using deep learning and advanced sampling techniques. The manuscript clearly situates itself in this context, citing recent IoT-focused IDS studies (2021–2024).

The methodology presented is generally solid and well thought-out. The authors follow a clear pipeline, but there are some areas for potential improvement or further clarification:

The paper should clarify how the dataset was divided into training and test sets to ensure a fair evaluation. Given that CIC-IDS2017 contains temporal traffic data, an ideal split would prevent any data leakage.
In this work IDS is treated as a binary classifier. Given the dataset contains numerous attack categories authors could evaluate how well the model can distinguish different attacks.
The paper does not discuss inference speed or model size. It would improve the work to address model efficiency: e.g., how fast can it process traffic and could it operate in real-time?

The conclusions drawn in the paper are well-supported by the experimental evidence. The proposed model achieves exceptionally high detection performance. The authors note that a recent hybrid CNN–BiLSTM with SMOTE approach (Wang et al., 2024) reached about 97.7% accuracy on the same dataset. In contrast, the new CNN–BiLSTM–Transformer model attains 99.80% accuracy. Thus, it pushes the performance boundary in the IDS subject area. The set of performance metrics used is comprehensive and appropriate for evaluating an intrusion detection system.

The article uses 45 references, which is a robust number for a journal paper, and overall they appear to be well-chosen and up-to-date. However, there are a couple of small concerns:

Ref [31] and Ref [41] are labeled as “RETRACTED ARTICLE”. The authors do mark them as such, which is transparent, but one wonders if those citations are necessary.
It looks like the reference list has a duplicate entry for the Ref [39] and Ref [43].

There are occasional minor grammatical quirks. For example the phrase “the proposed model reaches 99.69%, slightly lower than BaysCNN+PCA (97.69%)” is clearly a mistake in wording – 99.69 is higher, not lower, than 97.69.

In conclusion, this manuscript presents a significant and well-executed study. However, without the addition of an additional dataset, more specific in terms of IoT, a title change would have to be considered.

Author Response

Comments 1:

Response 1:

Thank you very much for your insightful and valuable feedback. We completely agree with your comment. In order to address this concern and more convincingly demonstrate the IoT-specific effectiveness of our proposed method, we have conducted additional experiments using the BoT-IoT dataset, which is widely recognized in the field of IoT security research.Accordingly, we have added a new subsection entitled “3.5 Analysis of Experimental Results on the BoT-IoT Dataset”, which presents a comprehensive performance evaluation of our CNN-BiLSTM-Transformer model on the BoT-IoT dataset. The new content includes comparative experiments with mainstream baseline models (e.g., Decision Tree, Random Forest, CNN, LSTM, BiLSTM, CNN-LSTM, CNN-BiLSTM) under the same preprocessing pipeline. The results clearly show that our model consistently outperforms all others in accuracy, precision, recall, and F1-score—even under the more challenging conditions of the BoT-IoT dataset. This new subsection can be found on page 17-18, lines 519-555 of the revised manuscript.

Comments 2:

The paper should clarify how the dataset was divided into training and test sets to ensure a fair evaluation. Given that CIC-IDS2017 contains temporal traffic data, an ideal split would prevent any data leakage.

Response 2:

Thank you very much for your valuable comment. We fully agree with your concern regarding the importance of avoiding data leakage, especially when using temporally ordered traffic datasets like CIC-IDS2017. To address this issue and ensure the fairness of our experimental evaluation, we have clarified the dataset splitting strategy in the revised manuscript. Specifically, we have added a detailed explanation in Section 2.1, stating that we adopted a chronological data split strategy, in which traffic from specific dates was entirely assigned to the training set, and traffic from other, non-overlapping dates was used as the test set. This ensures that there is no temporal overlap between the training and test data, thereby preventing any form of data leakage and improving the credibility of the experimental results. This revision can be found on page 3-4, Section 2.1, lines 137–162.

Comments 3:

In this work IDS is treated as a binary classifier. Given the dataset contains numerous attack categories authors could evaluate how well the model can distinguish different attacks.

Response 3:

Thank you very much for your insightful suggestion. We fully agree that evaluating the model’s ability to distinguish between different types of attacks in a multi-class classification setting is an important research direction. In the current work, our primary objective is to maximize the overall detection accuracy by focusing on the binary classification task, distinguishing between normal traffic and malicious traffic, which is a critical first step for real-time and resource-constrained IoT environments. Nonetheless, we appreciate the reviewers' perspective. We plan to explore multi-class classification schemes in subsequent research to enhance the granularity and practical deployment of the model. We thank the reviewers for their valuable perspectives on this matter.

Comments 4:

The paper does not discuss inference speed or model size. It would improve the work to address model efficiency: e.g., how fast can it process traffic and could it operate in real-time?

Response 4:

Thank you very much for your valuable suggestion. We agree that model efficiency, including inference speed and deploy ability on resource-constrained devices, is a critical consideration for practical intrusion detection systems. To address this point, we have emphasized the importance of model efficiency in the Conclusions section (Page 18-19, lines 570-583). Specifically, we have discussed the potential computational challenges of deploying deep neural network-based models in real-time IoT environments, and outlined future research directions such as exploring lightweight techniques (e.g., model compression, parameter sharing, and knowledge distillation) to enhance inference speed and model adaptability. We have also proposed developing a real-time IDS prototype system for further evaluation in real-world scenarios. We hope this addition reflects our awareness of the importance of model efficiency and our intention to pursue it in future work. Thank you again for your insightful comment.

Comments 5:

Ref [31] and Ref [41] are labeled as “RETRACTED ARTICLE”. The authors do mark them as such, which is transparent, but one wonders if those citations are necessary. It looks like the reference list has a duplicate entry for the Ref [39] and Ref [43].

Response 5:

Thank you for your careful review and constructive suggestion. We fully agree with your observation regarding the retracted articles and duplicate references. Regarding the retracted articles: We have removed Ref [41] entirely as it was not essential to our argument. We have replaced Ref [31] (retracted article) with Ref [35] (a more recent and reliable source) to support the same point. Regarding the duplicate entries: We have deleted Ref [43] (duplicate of Ref [39]) and renumbered subsequent references accordingly. The revised reference list appears on [page 19-21].

Comments 6:

Response 6:

We sincerely appreciate your careful reading and valuable feedback. We have corrected the numerical inconsistency by revising the sentence to: "the proposed model reaches 99.69%, significantly higher than BaysCNN+PCA (97.69%)". Additionally, we have thoroughly proofread the entire manuscript to improve grammatical accuracy and clarity. Thank you for helping us enhance the quality of our paper.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Author Response

Thank you very much for your valuable comments and kind confirmation. Your constructive feedback greatly helped us improve the clarity and rigor of our work. We truly appreciate your time and effort throughout the review process.

Reviewer 2 Report

Comments and Suggestions for Authors

Coment 2

The authors state that Kaiming Normal initialization was applied to the convolutional and fully connected layers, while the default Xavier initialization provided by PyTorch was used for the LSTM and Transformer modules. They further claim that these initialization strategies contributed to faster convergence and greater training stability.

While these initialization methods are theoretically known to improve convergence and stability, the fact that the proposed model achieved a training accuracy of 96.5% at the very beginning of training appears rather unusual and difficult to attribute solely to the initialization strategy. Initializations are generally designed to facilitate smooth gradient flow and stable training, but they do not inherently guarantee such a high accuracy at the onset of training. Therefore, the current explanation does not fully address the reviewer’s concern.

In addition, the paper reports a training accuracy of 96.5% and a test accuracy exceeding 99%, which may indeed reflect a structurally strong model, but could also be the result of a fortuitous initialization. To support the generalizability of the model and the claimed robustness of the initialization strategy, I recommend training the model under the same conditions for at least five independent runs, and reporting the average accuracy and standard deviation. This would provide a more reliable and reproducible performance evaluation, beyond a single experimental outcome.

Author Response

Comments 1:
While these initialization methods are theoretically known to improve convergence and stability, the fact that the proposed model achieved a training accuracy of 96.5% at the very beginning of training appears rather unusual and difficult to attribute solely to the initialization strategy. Initializations are generally designed to facilitate smooth gradient flow and stable training, but they do not inherently guarantee such a high accuracy at the onset of training. Therefore, the current explanation does not fully address the reviewer’s concern.

Response 1:
Thank you very much for your valuable comment. We fully understand your concern and agree that the observed high training accuracy at the beginning of training cannot be solely attributed to initialization strategies. In the revised manuscript, we have clarified that this early performance boost is primarily the result of the coordinated design of the CNN-BiLSTM-Transformer architecture and the data preprocessing pipeline.

Specifically, the hybrid model architecture allows for the simultaneous extraction of spatial, temporal, and contextual features, enabling more effective learning even in the initial training stages. Meanwhile, feature selection, noise reduction through Isolation Forest and LOF, and class balancing using Borderline-SMOTE help enhance the data quality and representation.

This explanation has been revised on Page 12, Lines 379–407.

Comments 2:
In addition, the paper reports a training accuracy of 96.5% and a test accuracy exceeding 99%, which may indeed reflect a structurally strong model, but could also be the result of a fortuitous initialization. To support the generalizability of the model and the claimed robustness of the initialization strategy, I recommend training the model under the same conditions for at least five independent runs, and reporting the average accuracy and standard deviation. This would provide a more reliable and reproducible performance evaluation, beyond a single experimental outcome.

Response 2:
We sincerely appreciate this important suggestion regarding reproducibility and generalizability. In response, we conducted five independent training runs under identical data splits, preprocessing, and hyperparameter settings. The results demonstrated consistent performance across all runs, with an average training accuracy of 99.42% (±0.46%) and average testing accuracy of 99.34% (±0.37%). These findings indicate that the model's performance is stable and not the result of chance or a particular initialization instance. The corresponding content has been added on Page 12, Lines 408–423.

Reviewer 3 Report

Comments and Suggestions for Authors

The revised version of the manuscript shows clear improvement, and the authors have addressed most of the issues raised in the previous review, particularly regarding the structure, clarity of key concepts, and the development of the conclusions.

However, there are still important issues that need to be corrected regarding the references. In multiple instances, which weakens the narrative and makes it difficult to assess the relevance of the cited work. More critically, there are several occurrences of the error message 'Error. Reference source not found,' which suggests problems with the reference management system.

These issues must be addressed to ensure the manuscript meets the standards of academic quality and readability expected for publication.

Author Response

Comments :
However, there are still important issues that need to be corrected regarding the references. In multiple instances, which weakens the narrative and makes it difficult to assess the relevance of the cited work. More critically, there are several occurrences of the error message 'Error. Reference source not found,' which suggests problems with the reference management system.

Response :
Thank you for pointing this out. We fully agree with your comment regarding the reference issues. To address this, we have carefully reviewed and revised all in-text citations and the reference list to ensure accuracy and consistency.

In particular, the occurrences of the error message "Error. Reference source not found" were due to reference management issues, which have now been fully resolved. Each of these previously problematic citations has been correctly linked to its corresponding reference entry, ensuring that all cited works can now be properly identified and assessed.

We have also double-checked the relevance and clarity of all references to strengthen the manuscript’s narrative and academic rigor. Thank you again for your careful review and helpful suggestions.

Article Menu

Research on Intrusion Detection Method Based on Transformer and CNN-BiLSTM in Internet of Things

Further Information

Guidelines

MDPI Initiatives

Follow MDPI