TSE-APT: An APT Attack-Detection Method Based on Time-Series and Ensemble-Learning Models
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper proposed an APT detection method based on time-series and ensemble learning models with the attention mechanism. The authors fused dynamic time-series features with static features and proposed an ensemble learning model with a dynamic weight allocation mechanism. The experiments are adequate. This study provided a novel analytical method for APT attack detection.
Suggestions to improve the manuscript:
1. For the abstract, I would recommend that the authors refine the language, reduce redundancy, and clearly articulate the novelty and advantages of the proposed method.
2. In the section “Introduction”, the author should tell the difference between this study and other similar studies before the last paragraph. This will further justify the gap this study intends to fill.
3. For related works, temporal features and ensemble learning were introduced. The candidate models that were selected and integrated are not described. Please add the related recent works.
4. The model structure in Section 3.3.1 lacks formal mathematical expressions, and key parameter settings in the experimental section are not sufficiently detailed. The authors need to provide clearer mathematical formulations and a comprehensive description of important parameters
5. The performance presentation in Section 4.1.4 is insufficiently clear. A single ROC curve alone cannot comprehensively demonstrate the overall model effectiveness. It is suggested that the authors supplement additional experimental images.
Author Response
Comments 1: For the abstract, I would recommend that the authors refine the language, reduce redundancy, and clearly articulate the novelty and advantages of the proposed method.
Response 1: Thank you for pointing this out. We agree with this comment. Therefore, We make changes to the abstract, Therefore, we have revised the abstract to remove redundant language and further elaborate on the advantages of the proposed approach, and we have further described the functions of the three basic models and the problems addressed. This change can be found – page 1, abstract and line 3 and 15.
“ Advanced Persistent Threat (APT) attacks pose a serious challenge to traditional detection methods. These methods often suffer from high false-alarm rates and limited accuracy due to the multi-stage and covert nature of APT attacks. In this paper, we propose TSE-APT, a time series ensemble model that solves these two limitations. It combines multiple machine learning models, such as Random Forest (RF), Multi Layer Perceptron (MLP), and Bidirectional Long Short Term Memory Network (BiLSTM), to dynamically capture correlations between multiple stages of the attack process based on time series features. It discovers hidden features through the integration of multiple machine learning models to improve the accuracy and robustness of APT detection significantly. First, we extract a collection of dynamic time-series features such as traffic mean, flow duration, and flag frequency. We fuse them with static contextual features, including the port service matrix and protocol type distribution to capture the multi-stage behaviors of APT attacks effectively. Then, we propose an ensemble learning model with a dynamic weight allocation mechanism using a self-attention network to adjust the sub-model contribution adaptively. Experiments show that time-series feature fusion significantly enhances detection performance. The RF, MLP, and BiLSTM models achieve 96.7% accuracy, considerably enhancing recall and false positive rate. The adaptive mechanism optimizes the model performance and reduces false alarm rates. This study provides an analytical method for APT attack detection, considering both temporal dynamics and context static characteristics, and provides new ideas for security protection in complex networks.”
Comments 2: In the section “Introduction”, the author should tell the difference between this study and other similar studies before the last paragraph. This will further justify the gap this study intends to fill.
Response 2: Thank you for pointing this out. We agree with this comment. We have added this paragraph to summarize the three detection methods mentioned in the above introduction and compare them with our methods to justify our methods. This change can be found – page 2, paragraph 4.
“ APT attacks are complex, multi-stage, and dynamic nature. As discussed above, existing approaches including machine learning, provenance graph-based and transformer-based both have certain advantages. From our point of view, identify key features accurately, especially those reflecting dynamic behavior, remains essential for detecting multi-stage APT attacks. The combination of static and dynamic features can offer a more complete view of attack processes. Although the traditional approaches with provenance graph and Transformer can automatically capture the complex patterns and relationships in the data, they often suffer from high false alarm rates and lack interpretability. Therefore, we argue that combines both static and dynamic features and ensemble multiple models can better capture the contextual patterns of APT attacks while maintaining efficiency and interpretability.”
Comments 3: For related works, temporal features and ensemble learning were introduced. The candidate models that were selected and integrated are not described. Please add the related recent works.
Response 3: Thank you for pointing this out. We agree with this comment. We collected and learned from the recent work related to the ensemble learning model, and re-completed the related work part. This change can be found – page 4, paragraph 2 .
“ Ensemble learning is popular for achieving better results by integrating multiple models[27]. Saini et al. [28] used deep learning and machine learning models to classify APT attacks. It improves generalization ability by analyzing network traffic, host logs, etc., and integrating multiple base models. This approach enhances detection accuracy and reduces false positives and omissions. Arefin et al. [29] aggregated predictions from various models and selected the one with the highest probability, proving the algorithm's effectiveness. Our method can improve accuracy more effectively by combining dynamic weight adjustment and a self-attention mechanism, enabling the model to focus more effectively on important features and reduce computational load.”
Comments 4: The model structure in Section 3.3.1 lacks formal mathematical expressions, and key parameter settings in the experimental section are not sufficiently detailed. The authors need to provide clearer mathematical formulations and a comprehensive description of important parameters
Response 4: Thank you for pointing this out. We agree with this comment. We describe the three baseline models in this paper more clearly, and the parameters that were not mentioned in the previous descriptions are also given in 4.1.3, and we spend more time on the improvement of the model. This change can be found – page 10, paragraph 5-10, page 13 section 4.1.3 .
“ 1)RF [36], MLP, and BiLSTM [37] are base models. Each model is trained independently about its characteristics and advantages. Classification detection tasks are performed for datasets, and the preliminary test results are output.
2)RF is a set of decision trees that is good at processing high-dimensional structured data and capturing more important features. Therefore, data sets with an unbalanced distribution or a large number of features can be effectively processed. In intrusion detection, RF can quickly identify each characteristic attribute for selection and detection.
3)To make the RF model faster and more effective, we first use mutual information to remove irrelevant features before training. This helps reduce computation and makes the model more stable. After feature selection, we apply a grid search strategy to optimize the parameters of the RF model, including the number of trees and maximum tree depth. This ensures that the model achieves the optimal balance between complexity and generalization performance. The overall configuration of RF parameters are shown in 4.1.3.
4)We adopt a residual MLP architecture, where each hidden layer block includes a skip connection that allows the input to bypass the transformation and be directly added to the output. This design alleviates vanishing gradient issues and enables deeper network training. Furthermore, each block use Dropout to mitigate overfitting and Layer Normalization to stabilize the learning process. In the end, we replace the conventional ReLU activation with the GELU function, which provides smoother gradient flow and has been shown to outperform ReLU in modern deep learning models. These enhancements collectively improve the model’s ability to learn complex patterns in network traffic and contribute to more robust APT attack detection. The overall configuration of residual MLP parameters are shown in 4.1.3.
5)We adopt a hybrid BiLSTM-CNN architecture to jointly model global temporal dependencies and local traffic patterns. The BiLSTM captures long-range contextual information from the input sequence, while the subsequent CNN layer focuses on extracting localized, bursty behaviors commonly found in multi-stage APT attacks. An adaptive max pooling layer is used to summarize the sequence into a compact representation for final classification. This combined design improves robustness to noise and enhances detection of subtle, stage-based attack traces. The overall configuration of BiLSTM-CNN parameters are shown in 4.1.3.
6) To ensure the reproducibility of our experiments and fair comparison across models, summarizes the key hyperparameters and their corresponding search spaces for each model used in this study, including RF, MLP, BiLSTM, XGBoost, and the Self-Attention mechanism. As shown in Table 3. The selected values are based on either prior literature or via grid search optimization, and were used consistently throughout all experiments beyond.
Model |
HyperParameters |
Search Space |
Selected |
RF |
Number of Trees |
[10, 50, 100] |
50 |
RF |
Random State |
[0, 42, 100] |
42 |
MLP |
Hidden Layers |
[2, 3, 4, 5] |
4 |
MLP |
Activation Function |
[ReLU, GELU, Tanh] |
GELU |
MLP |
Dropout Rate |
[0.1, 0.2, 0.3, 0.5] |
0.3 |
BiLSTM |
Hidden Size |
[64, 128, 256] |
128 |
BiLSTM |
Layers |
[1, 2] |
1 |
BiLSTM |
CNN Kernel Size |
[3, 5, 7] |
3 |
BiLSTM |
CNN Channels |
[64, 128, 256] |
128 |
BiLSTM |
Dropout |
[0.1, 0.3, 0.5] |
0.3 |
BiLSTM |
Sequence Length |
[4, 6, 8] |
6 |
XGBoost |
Booster |
[gbtree, gblinear] |
gbtree |
XGBoost |
Max Depth |
[3, 6, 10] |
6 |
XGBoost |
Subsample |
[0.6, 0.8, 1.0] |
0.8 |
XGBoost |
Colsample_bytree |
[0.6, 0.8, 1.0] |
0.8 |
XGBoost |
Number of Trees |
[50, 100, 150] |
100 |
Self-Attention |
Attention Heads |
[2, 4, 8] |
4 |
Self-Attention |
Q/K/V Dimensions |
[32, 64, 128] |
64 |
Self-Attention |
Feedforward Dimension |
[128, 256, 512] |
256 |
Comments 5: The performance presentation in Section 4.1.4 is insufficiently clear. A single ROC curve alone cannot comprehensively demonstrate the overall model effectiveness. It is suggested that the authors supplement additional experimental images.
Response 5: Thank you for pointing this out. We agree with this comment. In view of the opinion that one ROC diagram cannot explain the training process, we add the ROC diagram and Precision-Recall diagram of the three models respectively, which can more clearly see the effect of the three baseline models in training and evaluation, which is conducive to proving the effectiveness of the selection of the three models. This change can be found – page 13, Section 4.1.4 and page 14.
“ This section discusses the training process and evaluation results of the RF, MLP, and BiLSTM models. All models are trained using static and time series features. To ensure a fair comparison, we used the optimal hyperparameter configuration detailed in the appendix.
To fully evaluate model performance, we reported ROC curves, Precision-Recall curves, and Confusion Matrices for each model. These evaluation metrics allow us to visualize classification capabilities from multiple perspectives.
Figure 6. shows the evaluation results for the three models. All models achieve high AUC values and exhibit strong classification performance, successfully distinguishing between benign and malicious traffic. In particular, the confusion matrix shows a clear diagonal advantage, indicating that the prediction accuracy is high
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper presents a new APT detection method (TSE-APT) via time-series features and ensemble learning, which achieves promising results. The integration of dynamic weight allocation with self-attention is particularly innovative.
There are some suggestion to revise the paper:
1. Clarify the problem statement in the abstract, explicitly declare the limitations of existing methods to the specific innovations of TSE-APT.
2. Removed "Stated in Section 3.3 and evaluated in Section 4.2.",“Stated in Section 3.2 and evaluated in Section 4.2.”in contribution statement of introduction section. 3. For time-series features (IAT, FRC, FV, F2), discuss real-world examples of how they differ between normal and APT traffic.
4. Describe the training sequence of sub-models (RF, MLP, BiLSTM). For example, are they trained independently first, then integrated?
5. In conclusion section, discuss how TSE-APT could be deployed in real scenarios.
Author Response
Comments 1: Clarify the problem statement in the abstract, explicitly declare the limitations of existing methods to the specific innovations of TSE-APT.
Response 1: Thank you for the above suggestion. Those comments are valuable and very helpful. we have read through comments carefully and have made corrections. We have modify abstract to summarize the three detection methods and we further described the functions of the three basic models and the problems addressed. This change can be found – page 1 abstract.
“ Advanced Persistent Threat (APT) attacks pose a serious challenge to traditional detection methods. These methods often suffer from high false-alarm rates and limited accuracy due to the multi-stage and covert nature of APT attacks. In this paper, we propose TSE-APT, a time series ensemble model that solves these two limitations. It combines multiple machine learning models, such as Random Forest (RF), Multi Layer Perceptron (MLP), and Bidirectional Long Short Term Memory Network (BiLSTM), to dynamically capture correlations between multiple stages of the attack process based on time series features. It discovers hidden features through the integration of multiple machine learning models to improve the accuracy and robustness of APT detection significantly. First, we extract a collection of dynamic time-series features such as traffic mean, flow duration, and flag frequency. We fuse them with static contextual features, including the port service matrix and protocol type distribution to capture the multi-stage behaviors of APT attacks effectively. Then, we propose an ensemble learning model with a dynamic weight allocation mechanism using a self-attention network to adjust the sub-model contribution adaptively. Experiments show that time-series feature fusion significantly enhances detection performance. The RF, MLP, and BiLSTM models achieve 96.7% accuracy, considerably enhancing recall and false positive rate. The adaptive mechanism optimizes the model performance and reduces false alarm rates. This study provides an analytical method for APT attack detection, considering both temporal dynamics and context static characteristics, and provides new ideas for security protection in complex networks.”
Comments 2: Removed "Stated in Section 3.3 and evaluated in Section 4.2.",“Stated in Section 3.2 and evaluated in Section 4.2.”in contribution statement of introduction section.
Response 2: Thank you for the above suggestion. We agree with the comment and rewrote the sentence in the revised manuscript as the following in page 3 paragraph 2-5
“ The main contributions of this work are as follows:
The accuracy of the proposed TSE-APT method in the CIC-IDS2018 dataset was 97.32%, and the false positive rate was 0.69, which was better than the baseline model.
Demonstrated that combining dynamic time-series features with static flow data improves detection accuracy by over 2% and recall by approximately 3% compared to using static features alone.
Propose a dynamic weight strategy for the allocation of sub-models. It effectively improves the F1 score by approximately 2% and recall by over 3% identify new attacks more accurately."
Comments 3: For time-series features (IAT, FRC, FV, F2), discuss real-world examples of how they differ between normal and APT traffic.
Response 3: Thank you for the above suggestion. We are grateful for the suggestion. As suggested by the reviewer, we have added more details of descriptions and examples when we have added more detailed to the introduction of each feature, and we have also done SHAP interpretability experiments to show the importance and performance of several features in the overall detection. These changes can be found – 1) page 7 paragraph 5 2) page 8 paragraph 1 3) page 8 paragraph 4 4) page 9 paragraph 3 5) page 16 Section 4.2.4 .
“ 1) Figure 2. illustrates the distribution of IAT for benign and XSS traffic samples. As shown in the figure, the benign traffic exhibits significantly larger and more stable IAT values, suggesting regular and continuous communication behavior. In contrast, the IAT values for XSS traffic are much lower and more variable, indicating frequent and bursty packet transmissions. This observation is not limited to XSS attacks, similar IAT behavior has been observed in other APT scenarios, such as data exfiltration or stealthy remote access sessions. Therefore, IAT serves as a strong and generalizable indicator of abnormal timing patterns across different attack types.
2) Figure 3. illustrates the distribution of FRC values for benign and DoS traffic samples. As shown in the figure, benign traffic exhibits consistently low and stable FRC values, indicating smooth and regular network usage with minimal fluctuations in traffic rate over time. In contrast, DoS traffic displays significantly higher and more dynamic FRC values, indicating sudden and intense flow rate changes characteristic of burst attacks. This pattern aligns with the nature of DoS attacks that typically generate large volumes of traffic in a short period. The distinct separation between the two trends demonstrates that FRC effectively captures traffic anomalies and distinguishes malicious behavior from normal activity.
3) Figure 4. illustrates the distribution of FV values for benign and DoS traffic samples, benign traffic maintains very low FV values, mostly below 10, indicating steady and predictable communication with minimal variation in traffic volume. In contrast, DoS traffic exhibits significantly higher FV values ranging from over 1,000 to nearly 10,000 highlighting drastic and unstable changes in flow behavior. These higher variance values reflect the bursty nature of such attacks, where traffic surges occur in short bursts, causing dramatic changes in the overall traffic pattern. The clear separation between the curves confirms that FV is an effective feature for identifying abnormal network behavior, particularly in detecting DoS attacks characterized by erratic and high-volume traffic patterns.
4) Figure 5. illustrates the distribution of F2 values for benign and Bot traffic samples. As the figure shows, benign traffic maintains consistently low F2 values—mostly between 0.01 and 1—indicating normal, balanced usage of TCP flags such as SYN, ACK, and FIN during regular communication. In contrast, Bot traffic exhibits sharply elevated and highly oscillating F2 values, ranging from over 100 to nearly 1,500. This pattern suggests excessive or repetitive use of specific flag bits, which often reflects automated, scripted communication typical of botnet behavior. This abnormal frequency likely reflects attempts by the attacker to manipulate TCP flag behavior to avoid detection or to control botnet communication. The sharp contrast between the two curves confirms that F2 is an effective feature for capturing low-level protocol misuse and distinguishing botnet-based malicious activity from legitimate traffic.
5) To better understand how the model makes predictions, we applied SHAP to analyze the contribution of each feature.
As shown in Figure 7, FRC exhibit relatively high average absolute SHAP values, and FV is also in the top 20 SHAP value features, indicating their effectiveness in detecting APT attacks.
Comments 4: Describe the training sequence of sub-models (RF, MLP, BiLSTM). For example, are they trained independently first, then integrated?
Response 4: we are grateful for the suggestion. According with your advice, we have added a more detailed interpretation regarding the training sequence of sub-models, We train three models are they trained independently first, then integrated. More detailed analysis was added on page 14 paragraph 1.
“ Table 4. shows the experimental results of the new temporal features in detecting network traffic-based APT attacks with three models are trained simultaneously. We can find from Table 4. that the introduction of temporal features significantly improves the overall metrics of the three algorithms. Compared with static features, the accuracy, precision, recall, F1 score, and false positive rate of the three models are significantly improved, especially in the accuracy and recall. This suggests that temporal features make APT attack traffic characteristics more apparent, helping discover the low-frequency behaviors in the attacks and reducing false positives and omissions.”
Comments 5:In conclusion section, discuss how TSE-APT could be deployed in real scenarios.
Response 5: We have re-written this part according to the Reviewer’s suggestion, we carefully look ahead to the future of this experiment and make assumptions about what it would look like when applied to real scenarios. This change can be found—page 18 paragraph 3
“ In the future, we envision the TES-APT framework as a modular detection engine integrated into intrusion detection systems or monitoring systems. By analyzing incoming traffic over a period of time, APT attack protection can be achieved in different environments. In addition, in order to improve the effectiveness of APT attack detection in this scheme, we will continue to improve the experiments. Firstly, we need to use more APT datasets for detection to validate the effectiveness of this method in more realistic situations. Secondly, further optimize the performance and efficiency of the module, striving to achieve real-time traffic analysis and processing, and apply it to practical environments as soon as possible. Finally, with the rise of Transformer models and multimodal learning. We can combine these advanced technologies with ensemble learning and self-attention mechanisms to improve the performance of time series prediction."
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsComments and Suggestions for Authors
This paper presents TSE-APT, an interesting and relevant method for detecting Advanced Persistent Threat (APT) attacks by fusing time-series features with an adaptive ensemble learning model. The approach is well-motivated, and the empirical evaluation on a large-scale dataset is a significant strength. The core ideas of combining static and dynamic features and using a self-attention mechanism for dynamic model weighting are promising. However, the manuscript requires major revisions to meet the standards for publication. The following suggestions are intended to help you strengthen the paper.
- Novelty and Positioning
- Clarify Contribution: The introduction does not sufficiently differentiate TSE-APT from more recent and sophisticated deep learning models for APT detection, such as transformer or graph-based approaches. Please add a clear discussion that contrasts your work with these state-of-the-art methods, emphasizing the specific advantages of your lightweight, feature-driven approach (e.g., lower computational cost, better interpretability).
- Strengthen Contribution List: The list of contributions should be revised to state measurable achievements rather than the actions you performed. For example, instead of "Propose the TSE-APT method...", state what the method achieves, such as "Achieves a false positive rate of 0.69% and an accuracy of 97.32% on the CIC-IDS 2018 dataset, outperforming...".
- Structure and Formalism
- Dedicated Related Work Section: The literature review currently mixed into the Introduction and other sections should be consolidated into a dedicated "Related Work" section. This will improve the paper's flow and allow for a more structured overview of prior research in (a) time-series features for traffic analysis, (b) ensemble learning for security, and (c) attention mechanisms.
- Formal Problem Statement: Please add a "Problem Formulation" section before the methodology. This section should formally define the inputs (e.g., feature vectors from time-windowed flows), the output (binary APT/non-APT label), the threat model assumptions, and the mathematical formulas for your primary evaluation metrics.
- Methodological Transparency and Reproducibility
- Dataset and Preprocessing: The description of the dataset needs more detail. Please justify the choice of the sliding window parameters. Crucially, explain how the class imbalance between malicious and benign flows was handled during training and validation to avoid model bias. A small flowchart of the entire preprocessing pipeline would be very helpful.
- Hyperparameters: The paper lacks sufficient detail for reproducibility. Please provide the hyperparameter grids or the final selected values for all models (RF, MLP, BiLSTM, XGBoost). For the self-attention algorithm, you must specify the dimensions of Q, K, and V, and the number of attention heads used in your experiments.
- Release Artifacts: To maximize the impact and verifiability of your work, we strongly encourage you to release the source code, model weights, and scripts for data processing under an open-source license.
- Evaluation and Analysis
- Statistical Significance: The reported accuracy gains are often small. Without statistical testing, it is unclear if these improvements are significant or due to random data splits. Please perform 5-fold cross-validation or use statistical tests (e.g., McNemar's test, paired t-tests) to validate your claims of superiority and report confidence intervals or p-values.
- Computational Performance: For the method to be considered practical, its performance overhead must be known. Please add a section reporting the training time, inference latency (per flow or window), and peak GPU/CPU memory usage for TSE-APT and the baseline models on your stated hardware.
- Stronger Baselines: The current baselines are the components of your own model. To make a more compelling case, please compare TSE-APT against at least one recent transformer-based and one graph neural network (GNN) baseline on the same data split, or provide a strong justification for their exclusion.
- Expanded Ablation Study: The ablation study is a good start, but it could be more comprehensive. The corresponding table could be expanded to show the performance drop when removing individual components, such as a "TSE-APT (no time-series)" or "TSE-APT (no self-attention)" variant.
- Explainability: To increase trust and practical value, please include explainability analysis. Visualizations from SHAP or plots of the self-attention weights could illustrate which features are most important for detecting a specific APT attack, which would be a valuable contribution.
- Figure Readability: The fonts in the figures are too small to be legible, especially in print. Please increase all font sizes, use thicker lines, and select a color-blind-safe palette.
Comments for author File: Comments.pdf
Comments on the Quality of English Language
The manuscript would benefit significantly from a thorough review by a native English speaker or a professional editing service. The current language quality obscures some of the key ideas. Specific issues include:
- Sentence Structure: There are numerous run-on sentences. For example, the sentence in the Abstract that begins "It combines multiple machine learning models..." is long and convoluted. These should be broken into shorter, clearer sentences.
- Acronyms: Acronyms such as RF, MLP, BiLSTM, and APT are used without being defined at their first appearance. Please define each acronym the first time it is used (e.g., "Random Forest (RF)") and then use the acronym consistently.
- Inconsistent Terminology: The paper uses different phrases to refer to the same concept, which can be confusing. For instance, "dynamic integration design," "ensemble learning module," and the meta-model with XGBoost/self-attention seem to refer to the same component. Please choose a single, consistent term for each concept and use it throughout the manuscript.
- Tense and Grammar: There are inconsistencies in verb tense and some grammatical errors that should be corrected during proofreading.
Author Response
Comments 1: The introduction should be revised to differentiate TSE-APT from recent, more advanced models. A direct comparison with transformer-based (e.g., LogShield 2023, TBDetector 2024) and graph-based APT detectors is necessary. The key innovation—a lightweight, computationally efficient model that combines well-understood statistical features with a dynamic ensemble—needs to be a central argument. The authors should argue why this simpler approach is advantageous, perhaps in terms of resource consumption or interpretability, over more complex deep learning models.
Response 1: Thank you for pointing this out. We are grateful for the suggestion. According to your advice, we have added a more detailed interpretation regarding transformer-based and graph-based in the introduction. We further highlight key innovations and demonstrate that our approach is more effective compared to other similar approaches. This change can be found – page 2, paragraphs 2-4.
“ Based on provenance graph-based (e.g., Graph Neural Network[10], LSTM[11], etc.). This method builds causal graphs from system logs to capture the relationships between processes, files, and network activities. Provenance graphs preserve the time and context of a system’s behavior, making them effective for identifying the multi-stage and stealthy nature of APT attacks. For example, Hossain et al. [12] used the causality tracking and provenance graph to construct a model. They created a dependency graph from audit logs, which helps achieve faster and real-time monitoring. Ren et al. [13] constructed a knowledge graph of APT attacks by combining deep learning with specialized knowledge. This work can dynamically adjust the defense strategies. Compared with other methods, the advantage of provenance graph-based lies in their ability to capture the context and causal relationships of system activities, which is especially useful for modeling multi-stage APT attacks. However, these methods face the strong dependence on complete and high-quality audit logs. As a result, provenance-based models may struggle with inaccurate results when the log data is noisy or incomplete. In recent years, with the rise of transformers, APT detection with powerful sequence construction capabilities and temporal analysis
Based on transformer-based. It inherits the time series modeling capability of Transformer [14], relies on pre-training, and few-shot learning. It can comprehensively model and analyze attacks. For example, LogShield [15] uses a Transformer model to analyze event sequences from system logs, capturing context and temporal information to detect multi-stage and low-frequency APT attacks. In addition, at the end of 2024, TBDetector [16] first turns system logs into causal graphs and feature sequences, then uses Transformer to learn context features, and finally detects APT attacks by calculating anomaly scores. These models demonstrate strong performance on benchmark datasets and are effective at modeling complex multi-stage attack chains. However, due to large parameter sizes and the lack of interpretability of black-box decision logic, these methods are often difficult to deploy in detection scenarios, especially where computational resources are limited or where interpretability is essential.
APT attacks are complex, multi-stage, and dynamic nature. As discussed above, existing approaches, including machine learning, provenance graph-based, and transformer-based both have certain advantages. From our point of view, identifying key features accurately, especially those reflecting dynamic behavior, remains essential for detecting multi-stage APT attacks. The combination of static and dynamic features can offer a more complete view of attack processes. Although the traditional approaches with provenance graph and Transformer can automatically capture the complex patterns and relationships in the data, they often suffer from high false alarm rates and lack interpretability. Therefore, we argue that combining both static and dynamic features and ensembling multiple models can better capture the contextual patterns of APT attacks while maintaining efficiency and interpretability.”
Comments 2: Create a Dedicated "Related Work" Section: The literature review currently interspersed in the introduction should be moved to a separate "Related Work" section. This section should systematically review:
Time-series feature extraction for anomaly and APT detection: Discuss the evolution of feature engineering in this domain and what has been effective.
Ensemble learning techniques in cybersecurity: Review how different ensemble strategies (e.g., bagging, boosting, stacking) have been applied and their limitations.
The use of attention mechanisms in traffic analysis: This is a key component of your method, but is not explicitly reviewed as a standalone topic in the related work. A subsection here would bridge that gap.
Response 2: We gratefully appreciate your valuable suggestion, and have rewritten this part according to the Reviewer’s suggestion. We retain the first two parts of the related work on time series feature extraction and ensemble learning for APT attacks, and we mainly increase the use of the attention mechanism in traffic analysis. We have mainly added to page 4 paragraphs 4-5
“ Attention mechanisms have been widely adopted in traffic-based threat detection. Unlike traditional models that treat all input features equally, attention-based models dynamically assign different weights to different parts of the input, helping to highlight key indicators of APT attacks. For example, Choi, S. [31] proposed a multi-attention mechanism network that combines temporal and feature attention to improve detection accuracy and interpretability. Similarly, Su, Tongtong, et al.[32] combined attention with convolutional layers to extract both local and global features, resulting in improved detection accuracy across datasets.
In Jiao et al.[33] use attention mechanisms to analyze the current system load, which helps reduce false positives by focusing on more important data. This shows that attention can find and highlight key information. In our method, we add self-attention to the ensemble model so it can further enhance feature discrimination, enabling the model to allocate attention to key parts and suppress noise.”
Comments 3: Formalize the Problem Statement: A dedicated "Problem Formulation" or "Problem Statement" section should be added. This section would formally define:
Input: The structure of the network flow data and the sliding window approach (e.g., a sequence of feature vectors).
Output: The binary classification goal (APT or benign).
Threat Model: The specific characteristics of the APT attacks being targeted.
Evaluation Metrics: The mathematical definitions of metrics like Accuracy, Precision, Recall, F1-score, and FPR.
Response 3: Thank you for pointing out this problem in the manuscript. We gratefully appreciate your valuable suggestion. To modify this comment, we have added Section 5 Discussion, in which 5.2.1 provides a detailed analysis of the input methods, and secondly, we have added the definitions and formulas of the 5 evaluation criteria in the experimental environment and preparation phase 4.1.1 to help researchers understand the performance of the method more easily. These changes are shown in 1) page 17, paragraphs 4-5 and 2) page 12 paragraph 6.
“ 1) Since the CIC-IDS2018 dataset contains about 2.83 million attack traffic and about 2.17 million normal traffic, it may cause an imbalance in the model prediction.
First, we use Stratified Split to ensure that the ratio of samples in the training, test, and validation sets is the same. In addition, we also monitor F1 scores and FPR in real time to avoid sperm tricks. In addition, for RF, we also set the balance in advance to balance the dataset. It can be concluded that the probability of prediction imbalance is very small after the dataset is balanced.
2) To give the evaluation results, five representative metrics were used, namely "Accuracy", "Precision", "Recall", "F1 Score", and "False Positive Rate" (also known as false positive rate). The definitions of these five performance metrics are shown in Table 1. Specifically, a true positive (i.e., "TP") indicates that a malicious attack was successfully detected; A true negative (i.e., "TN") indicates that the method under study successfully detected benign activity. On the other hand, if a malicious attack is incorrectly detected as benign process activity (i.e., false negatives, or "FN"), or if benign process activity is classified as malicious (i.e., false positives, or "FP"), the classification result is considered false. In addition to this, the F1 score is a metric that is widely used to measure the accuracy of a test; It is the harmonic average of precision and recall, with an optimal value of 1.
Metrics |
Formula |
Accuracy |
(TP+TN)/(TP+FP+FN+TN) |
Precision |
TP/(TP+FP) |
Recall |
FN/(TP+FN) |
F1 Score |
2×((precision×recall)/(precision + recall)) |
False Positive Rate |
FP/(FP+TN) |
Comments 4: Revise the Contributions List: The bulleted list of contributions should be
Rephrased to highlight measurable achievements rather than the authors' actions.
For example:
Instead of: "Propose the TSE-APT method..."
Use: "Achieved a 97.32% accuracy and a 0.69% false-positive rate on the CIC-IDS2018 dataset, outperforming baseline models."
Instead of: "Construct a multi-dimensional feature space..."
Use: "Demonstrated that combining dynamic time-series features with static flow data improves detection accuracy by over 2% and recall by approximately 3% compared to using static features alone."
Response 4: Thank you for pointing out this problem in the manuscript. We have made corrections according to the Reviewer’s comments. We have made the following changes to the contribution of this article: firstly, to further reflect the effect of the method, secondly, to remove the redundant sections, and then to add more comparison conditions to show the model performance. These changes are shown in page 3, paragraph 1.
“ The main contributions of this work are as follows:
The accuracy of the proposed TSE-APT method in the CIC-IDS2018 dataset was 97.32%, and the false positive rate was 0.69, which was better than the baseline model.
Demonstrated that combining dynamic time-series features with static flow data improves detection accuracy by over 2% and recall by approximately 3% compared to using static features alone.
Propose a dynamic weight strategy for the allocation of sub-models. It effectively improves the F1 score by approximately 2% and recall by over 3% identifying new attacks more accurately. ”
Comments 5: Provide Detailed Hyperparameter Settings: For the sake of reproducibility, the paper must include the specific hyperparameters used for all models (RF, MLP, BiLSTM, and XGBoost). This could be in the form of a table in the appendix. The authors should also specify the number of heads used in the self-attention mechanism and the dimensions of the Query, Key, and Value vectors.
Reproducibility of Experiments: For the sake of reproducibility, providing more comprehensive details on the hyperparameters would be helpful. For instance, for the MLP model, details such as the number of neurons in each of the four hidden layers, the learning rate, and the optimizer used would be beneficial. Similarly, for Random Forest, specifying parameters like the maximum tree depth for the 50 trees used would be valuable.
Response 5: Thank you for pointing this out. We agree with this comment. We describe the three baseline models in this paper more clearly, and the parameters that were not mentioned in the previous descriptions are also given in 4.1.3. We spend more time on the improvement of the model. This change can be found –page 13, section 4.1.3 .
“ To ensure the reproducibility of our experiments and fair comparison across models, summarizes the key hyperparameters and their corresponding search spaces for each model used in this study, including RF, MLP, BiLSTM, XGBoost, and the Self-Attention mechanism. As shown in Table 3. The selected values are based on either prior literature or via grid search optimization, and were used consistently throughout all experiments beyond.
Model |
HyperParameters |
Search Space |
Selected |
RF |
Number of Trees |
[10, 50, 100] |
50 |
RF |
Random State |
[0, 42, 100] |
42 |
MLP |
Hidden Layers |
[2, 3, 4, 5] |
4 |
MLP |
Activation Function |
[ReLU, GELU, Tanh] |
GELU |
MLP |
Dropout Rate |
[0.1, 0.2, 0.3, 0.5] |
0.3 |
BiLSTM |
Hidden Size |
[64, 128, 256] |
128 |
BiLSTM |
Layers |
[1, 2] |
1 |
BiLSTM |
CNN Kernel Size |
[3, 5, 7] |
3 |
BiLSTM |
CNN Channels |
[64, 128, 256] |
128 |
BiLSTM |
Dropout |
[0.1, 0.3, 0.5] |
0.3 |
BiLSTM |
Sequence Length |
[4, 6, 8] |
6 |
XGBoost |
Booster |
[gbtree, gblinear] |
gbtree |
XGBoost |
Max Depth |
[3, 6, 10] |
6 |
XGBoost |
Subsample |
[0.6, 0.8, 1.0] |
0.8 |
XGBoost |
Colsample_bytree |
[0.6, 0.8, 1.0] |
0.8 |
XGBoost |
Number of Trees |
[50, 100, 150] |
100 |
Self-Attention |
Attention Heads |
[2, 4, 8] |
4 |
Self-Attention |
Q/K/V Dimensions |
[32, 64, 128] |
64 |
Self-Attention |
Feedforward Dimension |
[128, 256, 512] |
256 |
Comments 6: Enhance the Dataset Description: The manuscript should elaborate on the
experimental data setup:
o Justify the choice of the 15-minute sliding window and 1-minute step.
o Explain how the class imbalance between malicious (2.83 million) and benign (2.17 million) flows was addressed during training to prevent model bias.
o Include a flowchart illustrating the data preprocessing pipeline from raw .pcap files to the final feature set.
Response 6: We gratefully thank for the precious time they reviewer spent making constructive remarks. In order to modify this note, we have added Section 5 "Discussion", where 5.1 mainly discusses why the CIC-IDS2018 dataset is used and how future research should be extended to more APT datasets, and 5.2 mainly discusses how to solve the data imbalance problem and makes a flowchart for data preprocessing. This change can be found –page 16, section 5.
“ 5.2. Discussion on data processing:
5.2.1. Design of time windows and step sizes
The reason for choosing the 15-minute sliding window is that APT attacks are more covert and long-lasting, and different attack phases will not appear at the same time in a short period of time. Therefore, a longer window, like 15 minutes, helps capture the context and keep the time-based relationship between different attack actions.
The 1-minute step size is used to improve the temporal resolution of the detection, ensuring that the model can catch earlier anomalous patterns during the sliding process and not miss the beginning of the attack.
5.2.2. The dataset seems to be unbalanced
Since the CIC-IDS2018 dataset contains about 2.83 million attack traffic and about 2.17 million normal traffic, it may cause an imbalance in the model prediction.
First, we use Stratified Split to ensure that the ratio of samples in the training, test, and validation sets is the same. In addition, we also monitor F1 scores and FPR in real time to avoid sperm tricks. In addition, for RF, we also set the balance in advance to balance the dataset. It can be concluded that the probability of prediction imbalance is very small after the dataset is balanced.
5.2.3. Flowchart description
Figure 8. shows the steps we performed on the data characteristics. First, start from the .pcap file containing network traffic data, use CICFlowMeter to process traffic statistics, and export them in CSV format for subsequent training and evaluation. The generated CSV file undergoes data preprocessing, including missing value imputation and feature normalization, to ensure consistency and stability of downstream learning tasks. Next, convert the traffic label to binary values: 0 for benign and 1 for attack to facilitate supervised learning. Finally, hierarchical sampling is used to divide the dataset into a training set and a test set to maintain the original class distribution. The processed data is then fed into machine learning models RF, BiLSTM, and MLP for training and evaluation.
Comments 7: Improve Algorithm and Figure Clarity:
o The font size in all figures (especially Figures 2-6) must be increased for readability. Lines should be thicker, and a color-blind-safe palette should be used.
o The self-attention mechanism (Algorithm 1) needs to be more explicit about the dimensions of the inputs and the number of attention heads.
Response 7: We totally understand the reviewer's concern. We have increased the font size of Figure 2-6, made the main lines bolder, and set the colors more obviously, and the main parameters of the attention mechanism have been given in the table in 4.1.3.
Self-Attention |
Attention Heads |
[2, 4, 8] |
4 |
Self-Attention |
Q/K/V Dimensions |
[32, 64, 128] |
64 |
Self-Attention |
Feedforward Dimension |
[128, 256, 512] |
256 |
Comments 8: Expand Evaluation Metrics: The evaluation should be more comprehensive. In addition to the existing metrics, the authors should report:
o Confusion Matrices: For each model, to show the breakdown of true/false positives and negatives.
o ROC-AUC and PR-AUC: The Area Under the Receiver Operating Characteristic Curve (ROC-AUC) and the Area Under the Precision-Recall Curve (PR-AUC) should be reported in tables for all models to allow for a more robust comparison, especially given the class imbalance.
Response 8: Thank you for pointing this out. We agree with this comment. Given the opinion that one ROC diagram cannot explain the training process, we add the ROC diagram and Precision-Recall diagram of the three models, respectively, which can more clearly show the effect of the three baseline models in training and evaluation, which is conducive to proving the effectiveness of the selection of the three models. This change can be found – page 13, Section 4.1.4, and page 14.
“ This section discusses the training process and evaluation results of the RF, MLP, and BiLSTM models. All models are trained using static and time series features. To ensure a fair comparison, we used the optimal hyperparameter configuration detailed in the appendix.
To fully evaluate model performance, we reported ROC curves, Precision-Recall curves, and Confusion Matrices for each model. These evaluation metrics allow us to visualize classification capabilities from multiple perspectives.
Figure 6. shows the evaluation results for the three models. All models achieve high AUC values and exhibit strong classification performance, successfully distinguishing between benign and malicious traffic. In particular, the confusion matrix shows a clear diagonal advantage, indicating that the prediction accuracy is high.
Comments 9: Incorporate Statistical Significance Testing: To validate that the observed performance gains are not due to chance, the authors should perform statistical tests (e.g., McNemar's test or a paired t-test) on the model results, reporting p values or confidence intervals. This is crucial for claims of superiority, even with small margins.
Response 9: We gratefully appreciate your valuable suggestion. We followed the reviewer's suggestion to add a McNemar test for the three basic models, and the results of this experiment can be used to verify the superiority of the model. This change can be found – page 14, paragraph 3.
“ To further ensure that the observed performance improvements are not due to random variation, we performed McNemar tests on three models, which are shown in Table 5.. The results of all pairwise comparisons (RF vs. MLP, RF vs. BiLSTM, MLP vs. BiLSTM) yielded statistically significant p-values (p<0.001), suggesting that the observed performance gains were robust and not accidental. It is concluded that the combination of temporal features can not only improve the model performance, but also continuously and significantly improve the model performance in different model architectures.
Preparation |
McNemar's Statistic |
P-value |
RF vs MLP |
29610.04 |
< 0.001 |
RF vs BiLSTM |
29152.00 |
< 0.001 |
MLP vs BiLSTM |
658.44 |
< 0.001 |
“
Comments 10: Report Computational Overhead: A crucial aspect for practical deployment is the model's efficiency. A new subsection in the "Experimental Analysis" should be dedicated to reporting:
o Training Time: The time taken to train each model.
o Inference Latency: The time required to classify a single flow or a window of flows.
o Resource Consumption: Peak CPU/GPU memory usage during training and inference on the specified hardware (RTX 4060).
Response 10: We appreciate the reviewer’s suggestion to report computational overhead metrics such as training time, inference latency, and resource consumption. These are indeed critical aspects for real-world deployment. We have added section 4.2.6 to the experimental section, which is considered to demonstrate the time and efficiency of our method in practical application and to provide an outlook for future work. The change is shown on page 17, Section 4.2.6.
“ 4.2.6. Overhead Analysis
Ensemble learning models have been effectively deployed in numerous applications such as machine translation and object recognition tasks. However, their computational and storage overhead limit their deployment on high-end platforms. Toward this, we assess the overhead of TES-APT concerning run-time memory usage, and inference latency.
All experiments were performed on a workstation with an Intel i7 processor, 32GB RAM, and an NVIDIA RTX 4060 GPU. The complete training process, including all three base models and the ensemble module, took approximately 840 seconds on the CIC-IDS2018 dataset. Each base model was trained independently in about 411 seconds on average. While the total training time is not minimal, it is acceptable considering the dataset size and the model's high accuracy and F1-score. We will continue to improve in the future to achieve reliable real-time monitoring of system traffic.
At inference time, classifying a single flow or flow sequence requires only 1–2 milliseconds, with GPU memory usage consistently under 2.2 GB, ensuring the system is lightweight enough for practical deployment. Compared to lighter models, our method offers a better balance between detection effectiveness and resource efficiency, making it suitable for real-world APT detection tasks where accuracy is paramount.”
Comments 11: Broaden the Ablation Study: The ablation study is a strong feature, but could be more exhaustive. A table showing the performance drop when each of the following components is removed would be highly informative:
o The model without time-series features.
o The model without static features.
o The model without the self-attention mechanism.
Response 11: Thank you for your valuable suggestion regarding the ablation study. We agree that evaluating the performance impact when removing individual components would further strengthen the analysis. First, we analyzed the results of the ablation experiment in detail, and secondly, we presented the results in a subtractive format. These changes can be found – page 15, paragraphs 4-6.
“ Excluding the self-attention mechanism leads to a further performance decline. The F1-score drops from 97.51% to 96.36%, and recall decreases to 94.56%. The self-attention mechanism allows the model to dynamically focus on the most relevant features; removing it may impair its ability to detect complex attack patterns.
In addition, removing the ensemble learning structure results in a slight drop in performance, with accuracy reduced to 96.78% and recall decreasing to 94.06%. It is demonstrated that the use of a combination of multiple models enables the system to benefit from different types of feature learning, thereby improving its robustness.
Finally, when the time-series features are removed, the model’s accuracy drops to 94.50%, and recall declines to 91.61%, indicating the critical role of temporal dynamics in identifying APT behaviors. Without these sequential features, the model struggles to capture temporal dependencies across flows, leading to reduced detection performance.
Modification |
Accuracy (%) |
Precision (%) |
Recall (%) |
F1 (%) |
FPR (%) |
Full model(TES-APT) - Self-attention |
97.32 96.92 |
99.26 99.34 |
96.23 94.56 |
97.51 96.36 |
0.69 0.67 |
- Ensemble learning |
96.78 |
98.93 |
94.06 |
96.25 |
0.77 |
- Time-series features |
94.50 |
96.29 |
91.61 |
95.89 |
1.61 |
Comments 12: Add Explainability and Interpretability: To build trust and facilitate operational adoption, the authors should include visualizations that explain the model's decisions. Techniques like SHAP (SHapley Additive exPlanations) or visualizations of the attention weights could show which features (e.g., IAT, FRC) are most influential in flagging a potential APT attack.
Response 12: Thank you for your valuable suggestion. We are grateful for the suggestion. As suggested by the reviewer, we have added more details about SHAP interpretability experiments to show the importance and performance of several features in the overall detection. This change can be found on page 16, Section 4.2.4
“ To better understand how the model makes predictions, we applied SHAP to analyze the contribution of each feature.
As shown in Figure 7, features such as FRC and FV exhibit relatively high average SHAP values, indicating their significant role in distinguishing APT traffic. These features reflect flow-level behavior patterns, capturing anomalies in transmission consistency and variability.
In addition, Fwd Seg Size Min、Fwd Pkt Len Max, and Pkt Size Avg also appear among the top contributors. They describe the size and spread of the packet, it’s important in network traffic interaction.
Finally, this SHAP-based analysis confirms the conclusion of our method and provides valid evidence for the combination of static and temporal features that we propose.
Comments 13: Include Stronger Baselines: For a more convincing comparison, the authors should benchmark TSE-APT against at least one transformer-based and one graph-neural network-based APT detection model on the same dataset split. If this is not feasible, a detailed justification for their exclusion is needed.
Response 13: We sincerely appreciate the reviewer’s suggestion to include comparisons with transformer-based and graph neural network (GNN)-based APT detection methods. We agree that such comparisons would further strengthen the empirical evaluation of our proposed model. However, Transformers and GNN models often require significant architectural modifications to accommodate tabular stream features and sequential data, including feature embedding, positional encoding, and graph construction strategies. These changes involve hyperparameter tuning and structural alignment, which require a lot of work In this study, we focused on classical, computationally accessible baselines (RF, MLP, BiLSTM) and their integration through ensemble learning. We plan to include transformer- and GNN-based models in future work under a unified evaluation framework. We discuss in the page 18 paragraph 2.
”5.3. Discussion with the rest of the models:
In this study, we selected representative classical models RF, MLP, and BiLSTM as baseline methods and integrated them through a self-attention-based ensemble strategy. These models offer stable performance and computational efficiency on network attack data. In future work, we plan to incorporate and compare with advanced architectures such as Transformer-based and GNN-based models, which have shown promise in modeling temporal and structural dependencies in traffic data.”
Comments 14: Ambiguity in the Ensemble Mechanism: There is a significant point of confusion in the description of the ensemble model in Section 3.3. The paper first details a self-attention mechanism for dynamically weighting the base models. Immediately after, it introduces XGBoost as a meta-model for making a final prediction. It is unclear how these two components relate. Are they alternative approaches, or does the output of the self-attention layer feed into XGBoost? Clarifying the precise architecture of the final ensemble is crucial for reproducibility and understanding the method's core innovation.
Response 14: We appreciate the reviewer’s insightful comment. To address this ambiguity, we have revised Section 3.3.3 to explain the relationship between the self-attention mechanism and the XGBoost classifier. Specifically, in our ensemble architecture, the self-attention mechanism adaptively assigns weights to the predictions of base models based on input features and confidence scores. These weighted predictions are then fed into an XGBoost classifier, which serves as a meta-learner to make the final prediction. Thus the attention mechanism enhances the reliability of the base model outputs, while XGBoost captures nonlinear interactions among them to improve classification performance. We believe this clarification improves both the transparency and reproducibility of our method. The change can be found on page 12, paragraph 4
“ In our ensemble architecture, the self-attention mechanism adaptively assigns weights to the predictions of base models based on input features and confidence scores. These weighted predictions are then fed into an XGBoost classifier, which acts as a meta-learner to make the final decision. Thus, attention and XGBoost work in a cascaded manner attention enhances the reliability of base model outputs, and XGBoost captures nonlinear interactions to improve final classification accuracy.”
Comments 15: Clarity of Algorithm 1: The pseudo-code for the self-attention mechanism could be made clearer. Specifically:
The operation `γ ← FC(P)` (line 9) needs more detail. How is the probability vector `P` from the base models processed by a fully-connected layer?
The `⊕` operator in `C ← C ⊕ γ` (line 10) is undefined. Please specify if this is concatenation, element-wise addition, or another operation.
Response 15: Thank you for the above suggestions. It is really true, as Reviewer suggested, that we do not explain the implementation process of the extraction vector of the fully connected layer and the definition of the operators in it. First, we input the probability vectors generated by the base model into a fully connected layer to capture the relative confidence and consistency between the predictions of the base model. Secondly, we explain the operator problem in the algorithm, which plays a role: concatenation. The change can be found in page 11, paragraph 4, and algorithm 1.
“ The Self-attention mechanism algorithm is delineated in Algorithm 1. Given the dimension X, the prediction P [m] of each model, and the number of attention heads H, the output of algorithm 1 will be the probability that malicious traffic is predicted. Firstly, converting a single traffic feature to Query, Key, and Value. Using multi-head self-attention to capture the correlation between different semantic spaces, obtaining a global context representation C. Subsequently, extract a confidence vector γ from the process where FC layer receives the probability vector P from sub-models. Then it applies a linear transformation to project P into a higher-dimensional embedding. This transformation enables the model to learn a latent representation of the relative confidence or consistency among base model predictions. After combining the two, they are sent to the weight generator, which outputs the normalized weight ω. This weight is used to measure the relative reliability of the three base models on the current sample. The final prediction Y is obtained by weighted summation P [m] and ω.”
Comments 16: Justification of the Dataset for APT Detection: The manuscript convincingly argues that APTs are distinct due to their long-term, multi-stage nature. However, the evaluation is performed on the CIC-IDS2018 dataset, which
comprises individual, well-known attack types like DDoS and XSS. The paper would be more persuasive if it included a clearer discussion on why detecting these specific traffic types serves as a valid proxy for detecting a full-fledged APT campaign.
Response 16: We gratefully thank for the precious time they spent making constructive remarks. To modify this note, we have added Section 5 "Discussion", where 5.1 mainly discusses why the CIC-IDS2018 dataset is used and how future research should be extended to more APT datasets. The change can be found in page 16, Section 5
“ 5.1. Justification of the Dataset for APT Detection:
In this paper, the CIC-IDS2018 dataset is used instead of the APT attack dataset as follows:
In contrast to an APT attack, which is a long-term, multi-stage attack designed to evade detection for a long time, the CIC-IDS2018 dataset contains all attacks that are short-lived and easier to detect. First of all, we want to target the wide range of network behaviors in the CIC-IDS2018 dataset, such as lateral movement and port scanning, which are also key components in APT attacks and are often used. Secondly, the CIC-IDS2018 dataset is a benign and malicious traffic generated by practical applications, which is rich in features and is suitable for feature extraction and training learning of complex attack behavior patterns. Eventually, we also recognize that the detection of APT requires the evaluation of datasets that simulate long-term stealth behavior across stages. As part of our ongoing and future work, we plan to use APT-focused datasets such as DARPA TC or DARPA datasets to more thoroughly validate the model's ability to deal with multi-stage persistent threats.“
Point 1: Thorough English Editing: The manuscript requires a comprehensive edit for grammar, tense consistency, and sentence structure. Run-on sentences should be broken up.
Response 1: We sincerely thank the reviewers for their valuable suggestions. In response, we carefully revised the entire manuscript to improve grammar, maintain tense consistency, and enhance sentence structure. We also make corrections for misspelled words and wrong tenses. Serial sentences have been identified and appropriately divided or reorganized to ensure clarity and readability. We also enlisted the help of native English speakers to ensure that the quality of the language met publication standards. All changes have been incorporated into the revision for your review.
Point 2: Consistent Acronym Usage: Acronyms like RF, MLP, BiLSTM, and APT must be defined at their first use and used consistently throughout the paper.
Response 2: Thank you for pointing this out. We have carefully reviewed the manuscript to ensure that all acronyms, including RF (Random Forest), MLP (Multilayer Perceptron), BiLSTM (Bidirectional Long Short-Term Memory), are clearly defined at their first mention. After their initial definitions, we have consistently used the corresponding acronyms throughout the manuscript to maintain clarity and readability.
Point 3: Standard Section Ordering: The paper should be restructured to follow a conventional scientific format: Introduction, Related Work, Problem Formulation, Methodology, Experimental Setup, Results and Discussion, and Conclusion.
Response 3: Thank you for your helpful suggestion. In the revised manuscript, we have reorganized the sections to better align with the standard scientific structure. The current structure now follows the conventional order: Introduction, Related Work, Materials and Methods, Discussion, and Conclusion. We believe this new structure improves the logical flow and readability of the paper.
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsI now believe the manuscript has the quality necessary for publication in our prestigious journal.