Next Article in Journal
Mapping User Perceptions of AI in Accounting and Auditing with the TAM—A Network Analysis Approach
Previous Article in Journal
Advancing Mental Health Detection Through Transfer Learning and Feature Fusion: Mitigating Data Imbalance in Large Transformer Models
Previous Article in Special Issue
Robust Hashing for Improved CNN Performance in Image-Based Malware Detection
 
 
Article
Peer-Review Record

APT Attribution Using Heterogeneous Graph Neural Networks with Contextual Threat Intelligence

Electronics 2025, 14(23), 4597; https://doi.org/10.3390/electronics14234597
by Abdirahman Jibril Mead and Abdullahi Arabo *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Electronics 2025, 14(23), 4597; https://doi.org/10.3390/electronics14234597
Submission received: 14 October 2025 / Revised: 12 November 2025 / Accepted: 17 November 2025 / Published: 24 November 2025
(This article belongs to the Special Issue AI in Cybersecurity, 2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper constructs a tripartite heterogeneous graph linking APTs, TTPs, and CKC stages, embedding TTPs with SBERT and applying GraphSAGE for attribution. The structure is coherent and the experiments are informative, but the evaluation design and data transparency remain insufficient, weakening the methodological rigor and credibility for real-world application.

1. The prediction setup risks label leakage since APT nodes are classified while APT–TTP edges already encode ground-truth associations; report or time-based isolation is required for valid evaluation.
2. The CTI-HAL dataset lacks documentation of sources, volume, and sampling criteria, preventing reproducibility and fair comparison.
3. The CKC one-hot representation and temporal edge construction are ill-defined. Ablation experiments are necessary to verify real contribution.
4. Comparative claims are inconsistent and the reported DeepOP F1 exceeds this work’s yet is claimed inferior; either align tasks and datasets or qualify the comparison appropriately.
5. Interpretability analysis is superficial, relying on visualizations without case-level reasoning paths or subgraph evidence supporting explainable attribution.
6. System evaluation omits runtime complexity, scalability, and stability across time, leaving operational feasibility for SOC deployment unsubstantiated.

 

Author Response

Reviewer 1 – Comments and Responses

 

Comment 1 – Label Leakage and Dataset Split

The prediction setup risks label leakage since APT nodes are classified while APT–TTP edges already encode ground-truth associations; report or time-based isolation is required for valid evaluation.

Response:
We acknowledge this valuable observation. While our graph construction follows a campaign-level partition (80/20 split), we recognise that shared TTP nodes could potentially introduce partial information leakage. This has now been explicitly discussed in Section 8 (Future Work), where we recommend and plan a time-aware retraining strategy ensuring campaigns are isolated chronologically to eliminate cross-edge dependencies. This transparent acknowledgement addresses the concern while maintaining consistency with the original results.

 

Comment 2 – Dataset Documentation (CTI-HAL vs APTNotes)

The CTI-HAL dataset lacks documentation of sources, volume, and sampling criteria, preventing reproducibility and fair comparison.

Response:
We thank the reviewer for identifying this. The dataset description has been fully revised to clarify that the final corpus used is APTNotes, not CTI-HAL, which was only used in preliminary trials. Section 4.1 (Feature Extraction) and Table 2 now document dataset provenance, collection period, number of APT groups, TTP nodes, and CKC stages. Licensing information and temporal coverage (2018–2024) have also been included to ensure reproducibility and transparency.

 

 

 

Comment 3 – CKC Representation and Temporal Edges

The CKC one-hot representation and temporal edge construction are ill-defined. Ablation experiments are necessary to verify real contribution.

Response:
We appreciate the reviewer’s feedback on clarifying the contribution of CKC semantics and temporal relations. The paper now explicitly states that each TTP vector is enriched with both its semantic embedding and one-hot CKC stage encoding, ensuring lifecycle-aware differentiation of techniques. This enhancement enables the model to reason over procedural stages as well as behavioural similarity.

The effectiveness of this lifecycle-aware design is reflected in Figure 6, where the GraphSAGE variant, incorporating CKC-integrated TTP embeddings, achieves the highest Macro-F₁ (0.842) and Accuracy (0.850). These results empirically confirm that embedding procedural context improves attribution performance compared with baselines lacking CKC semantics.

 

Comment 4 – Comparative Analysis and F1 Discrepancy

Comparative claims are inconsistent, and the reported DeepOP F1 exceeds this work’s yet is claimed inferior; either align tasks and datasets or qualify the comparison appropriately.

Response:
We appreciate this clarification. DeepOP’ s higher F1 (0.894) corresponds to a different task, next-step technique prediction, on a structured and closed-world dataset, whereas our study addresses open-world actor-level attribution on unstructured CTI. This has now been explicitly noted in Section 7 and Table 5 to ensure fair and transparent comparison.

 

Comment 5 – Interpretability and Explainability

Interpretability analysis is superficial, relying on visualisations without case-level reasoning paths or subgraph evidence supporting explainable attribution.

Response:
We agree that deeper interpretability will strengthen the framework. Section 8 (Future Work) now recommends integrating GNNExplainer and attention-based visualisations to trace top-k TTP and CKC paths influencing APT attribution decisions. We also propose an optional abstain mechanism for uncertain classifications in SOC workflows.

Comment 6 – System Evaluation and Scalability

System evaluation omits runtime complexity, scalability, and stability across time, leaving operational feasibility for SOC deployment unsubstantiated.

Response:
We appreciate this observation. While the current work focuses on attribution accuracy and interpretability, future iterations will incorporate runtime profiling, scalability metrics, and integration into SOC pipelines to evaluate real-world performance.

 

Reviewer 2 – Comments and Responses

 

Comment 1 – Dataset Provenance and Train/Test Split

The paper references CTI-HAL and sources like ThreatFox/MISP/OTX but does not specify exact sources, time windows, licensing, or ATT&CK/CKC mapping procedures. The 80/20 train/test split may risk leakage if reports from the same campaign share TTP–TTP edges or nodes.

Response:
We appreciate this important clarification. The final dataset used is APTNotes (2018–2024), not CTI-HAL, as detailed in Section 4.1 and Table 2. ThreatFox and similar feeds were used only for enrichment validation. We now explicitly discuss possible edge-level leakage and outline a future time-based isolation protocol to fully eliminate shared campaign dependencies (Section 8).

 

Comment 2 – SBERT Variant and Reproducibility

Specify the SBERT variant, tokenizer version, graph-construction script, and random seeds; publish code or an artifact (even anonymized) with config files and a script that rebuilds the tripartite graph from raw CTI.

Response:
We have clarified these implementation details in Section 4.1 and Section 4.3. The SBERT model used is all-MiniLM-L6-v2 (384-dimensional) with HuggingFace 4.41, and random seeds were fixed at 42 for reproducibility. The graph-construction notebook and anonymised dataset are available upon request for verification purposes.

 

 

 

Comment 3 – Explainability and CKC Context

The paper claims explainability via CKC context; consider adding attribution explanations per prediction (e.g. top-k TTPs and CKC paths that influenced the APT label).

Response:
We have expanded the discussion in Section 8 to recommend attribution methods using GNNExplainer or attention-weight visualisations to identify the most influential TTPs and CKC paths contributing to each classification. This directly addresses explainability requirements for SOC adoption.

 

Comment 4 – Figures and Captions

Several figures are referenced (1–10); ensure they are present with legible axes, units, and captions.

Response:
All figures have been verified for presence, clarity, and consistent captioning. Figure labels and references were synchronised, and redundant placeholders removed.

 

Comment 5 – Editorial and Formatting Issues

A few hyphenation breaks and placeholder front-matter tags (e.g. Received/Revised/DOI) were observed.

Response:
All typographical inconsistencies, hyphenation breaks, and placeholder tags have been corrected for final production submission.

 

 

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript proposes a heterogeneous GNN for APT attribution built on a tripartite graph. TTP nodes use SBERT (384-d) embeddings concatenated with a one-hot CKC stage vector; message passing uses GraphSAGE. On a dataset referred to as CTI-HAL, the best model reports Accuracy ≈ 85% and Macro-F1 ≈ 0.84, outperforming LSTM, GAT and a GAT+SAGE hybrid; the paper also contrasts against external baselines (APT-MMF, DeepOP, CSKG4APT, GRAIN). The authors position the work as explainable and SOC-ready.

There are some things that should be adressed:

  1. The paper references CTI-HAL and sources like ThreatFox/MISP/OTX but does not fully specify: exact sources, time windows, licensing, the ATT&CK/CKC mapping procedure, deduplication, or how many reports per APT. Train/val/test is 80/20 on APT nodes but it’s unclear whether reports from the same campaign can leak information across splits via shared TTP–TTP temporal edges or TTP nodes (classic graph leakage). Please: provide a temporal split (train on older, test on newer), campaign-level separation, and release a data card with per-class counts, time ranges, and preprocessing steps
  2. Specify the SBERT variant, tokenizer version, graph construction script, and random seeds; publish code or an artifact (even anonymized) with config files and a script that rebuilds the tripartite graph from raw CTI. Include a model card describing failure modes (e.g., false flags)
  3. The paper claims explainability via CKC context; consider adding attribution explanations per prediction (e.g., top-k TTPs and CKC paths that most influenced the APT label, via GNNExplainer/attention weights or gradient-based saliency). Include calibrated confidences and an abstain option for SOC workflows.
  4. Several figures are referenced (1–10); ensure they are present with legible axes, units, and captions (some appear as placeholders in the submitted PDF)
  5. A few hyphenation breaks (“procedu- ral”), missing spaces and placeholder front-matter (Received/Revised/DOI). Clean before production

Author Response

Reviewer 1 – Comments and Responses

 

Comment 1 – Label Leakage and Dataset Split

The prediction setup risks label leakage since APT nodes are classified while APT–TTP edges already encode ground-truth associations; report or time-based isolation is required for valid evaluation.

Response:
We acknowledge this valuable observation. While our graph construction follows a campaign-level partition (80/20 split), we recognise that shared TTP nodes could potentially introduce partial information leakage. This has now been explicitly discussed in Section 8 (Future Work), where we recommend and plan a time-aware retraining strategy ensuring campaigns are isolated chronologically to eliminate cross-edge dependencies. This transparent acknowledgement addresses the concern while maintaining consistency with the original results.

 

Comment 2 – Dataset Documentation (CTI-HAL vs APTNotes)

The CTI-HAL dataset lacks documentation of sources, volume, and sampling criteria, preventing reproducibility and fair comparison.

Response:
We thank the reviewer for identifying this. The dataset description has been fully revised to clarify that the final corpus used is APTNotes, not CTI-HAL, which was only used in preliminary trials. Section 4.1 (Feature Extraction) and Table 2 now document dataset provenance, collection period, number of APT groups, TTP nodes, and CKC stages. Licensing information and temporal coverage (2018–2024) have also been included to ensure reproducibility and transparency.

 

 

 

Comment 3 – CKC Representation and Temporal Edges

The CKC one-hot representation and temporal edge construction are ill-defined. Ablation experiments are necessary to verify real contribution.

Response:
We appreciate the reviewer’s feedback on clarifying the contribution of CKC semantics and temporal relations. The paper now explicitly states that each TTP vector is enriched with both its semantic embedding and one-hot CKC stage encoding, ensuring lifecycle-aware differentiation of techniques. This enhancement enables the model to reason over procedural stages as well as behavioural similarity.

The effectiveness of this lifecycle-aware design is reflected in Figure 6, where the GraphSAGE variant, incorporating CKC-integrated TTP embeddings, achieves the highest Macro-F₁ (0.842) and Accuracy (0.850). These results empirically confirm that embedding procedural context improves attribution performance compared with baselines lacking CKC semantics.

 

Comment 4 – Comparative Analysis and F1 Discrepancy

Comparative claims are inconsistent, and the reported DeepOP F1 exceeds this work’s yet is claimed inferior; either align tasks and datasets or qualify the comparison appropriately.

Response:
We appreciate this clarification. DeepOP’ s higher F1 (0.894) corresponds to a different task, next-step technique prediction, on a structured and closed-world dataset, whereas our study addresses open-world actor-level attribution on unstructured CTI. This has now been explicitly noted in Section 7 and Table 5 to ensure fair and transparent comparison.

 

Comment 5 – Interpretability and Explainability

Interpretability analysis is superficial, relying on visualisations without case-level reasoning paths or subgraph evidence supporting explainable attribution.

Response:
We agree that deeper interpretability will strengthen the framework. Section 8 (Future Work) now recommends integrating GNNExplainer and attention-based visualisations to trace top-k TTP and CKC paths influencing APT attribution decisions. We also propose an optional abstain mechanism for uncertain classifications in SOC workflows.

Comment 6 – System Evaluation and Scalability

System evaluation omits runtime complexity, scalability, and stability across time, leaving operational feasibility for SOC deployment unsubstantiated.

Response:
We appreciate this observation. While the current work focuses on attribution accuracy and interpretability, future iterations will incorporate runtime profiling, scalability metrics, and integration into SOC pipelines to evaluate real-world performance.

 

Reviewer 2 – Comments and Responses

 

Comment 1 – Dataset Provenance and Train/Test Split

The paper references CTI-HAL and sources like ThreatFox/MISP/OTX but does not specify exact sources, time windows, licensing, or ATT&CK/CKC mapping procedures. The 80/20 train/test split may risk leakage if reports from the same campaign share TTP–TTP edges or nodes.

Response:
We appreciate this important clarification. The final dataset used is APTNotes (2018–2024), not CTI-HAL, as detailed in Section 4.1 and Table 2. ThreatFox and similar feeds were used only for enrichment validation. We now explicitly discuss possible edge-level leakage and outline a future time-based isolation protocol to fully eliminate shared campaign dependencies (Section 8).

 

Comment 2 – SBERT Variant and Reproducibility

Specify the SBERT variant, tokenizer version, graph-construction script, and random seeds; publish code or an artifact (even anonymized) with config files and a script that rebuilds the tripartite graph from raw CTI.

Response:
We have clarified these implementation details in Section 4.1 and Section 4.3. The SBERT model used is all-MiniLM-L6-v2 (384-dimensional) with HuggingFace 4.41, and random seeds were fixed at 42 for reproducibility. The graph-construction notebook and anonymised dataset are available upon request for verification purposes.

 

 

 

Comment 3 – Explainability and CKC Context

The paper claims explainability via CKC context; consider adding attribution explanations per prediction (e.g. top-k TTPs and CKC paths that influenced the APT label).

Response:
We have expanded the discussion in Section 8 to recommend attribution methods using GNNExplainer or attention-weight visualisations to identify the most influential TTPs and CKC paths contributing to each classification. This directly addresses explainability requirements for SOC adoption.

 

Comment 4 – Figures and Captions

Several figures are referenced (1–10); ensure they are present with legible axes, units, and captions.

Response:
All figures have been verified for presence, clarity, and consistent captioning. Figure labels and references were synchronised, and redundant placeholders removed.

 

Comment 5 – Editorial and Formatting Issues

A few hyphenation breaks and placeholder front-matter tags (e.g. Received/Revised/DOI) were observed.

Response:
All typographical inconsistencies, hyphenation breaks, and placeholder tags have been corrected for final production submission.

 

 

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for the careful and thoughtful revision. The revised manuscript shows clear improvement. The following aspects could still be improved.
1. Section 4 remains somewhat lengthy. Consider simplifying the descriptions to enhance readability.
2. The presentation of Figure 9 is inappropriate. Consider converting it into a table to more clearly display.

 

Author Response

We appreciate the reviewer’s feedback on the length of Section 4 (Methodology) and the presentation of Figure 9.
The section has been condensed for clarity while retaining all essential figures, equations, and technical details.
Redundant implementation text was merged, and transitions were simplified for a smoother flow from dataset preparation to model training.
Figure 9 (classification report) has been converted into Table 4, presenting per-class Precision, Recall, and F1-scores in a clear numerical format consistent with MDPI standards.
These revisions improve readability, coherence, and visual consistency without removing any methodological information.

 

Back to TopTop