Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Study on Improving the Automatic Classification Performance of Cybersecurity MITRE ATT&CK Tactics Using NLP-Based ModernBERT and BERTopic Models

Electronics 2025, 14(22), 4434; https://doi.org/10.3390/electronics14224434

by Jaehwan Baek¹

, Jeonghoon O², Seungwoo Jeong² and Wooju Kim^3,*

Reviewer 1:

István Üveges

Reviewer 2: Anonymous

Electronics 2025, 14(22), 4434; https://doi.org/10.3390/electronics14224434

Submission received: 10 October 2025 / Revised: 31 October 2025 / Accepted: 4 November 2025 / Published: 13 November 2025

(This article belongs to the Special Issue Natural Language Processing and Data Science Methods to Mitigate Digital Threats)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents an innovative and well-structured study that explores the optimization of transformer-based models for multi-label text classification within the domain of electronic product reviews. The integration of BERTopic-based features is particularly noteworthy, as it provides a novel approach to enriching document representations with topic-level semantics. Overall, the paper contributes valuable methodological insights to the field of applied NLP. Below are my suggestions for the authors to further improve their paper.

- The phrase “Preprocessing AND tokenization” is somewhat awkwardly phrased. In Algorithm 1, step (1) lists “preprocessing (tokenization, stopword removal)”, suggesting that tokenization is already included as part of preprocessing. It would be clearer to avoid redundancy in this phrasing. Furthermore, it is not specified whether standard normalization procedures such as lowercasing were applied during TF-IDF preprocessing. Lowercasing is a common practice to ensure consistent term representation. Similarly, stemming or lemmatization could have been considered to reduce morphological variants of words to their base form. Including this information would improve the transparency and reproducibility of the preprocessing pipeline.

- The expression “2048-dimensional nodes” is understandable but slightly unconventional. The more standard terminology in neural network descriptions would be “2048 units” or “2048 neurons”.

- In the description of Algorithm 2, the authors state that each document is divided into 256-token segments with 50 % overlap to minimize context disruption. This fixed-length sliding-window approach indeed mitigates information loss and allows for smoother contextual transitions across segments. However, since segmentation is not aligned with sentence boundaries, some linguistic or discourse-level coherence may still be lost when sentences are split across windows. Employing sentence-boundary-aware or adaptive windowing strategies could potentially improve semantic consistency and interpretability.

- The authors note that 77.38 % of the documents were processed within a single window, indicating that in most cases the model’s predictions were based on full-text inputs rather than aggregated across multiple segments. This implies that the mean-pooling mechanism affected only a minority of instances. It would be useful if the authors could comment on whether the remaining multi-window cases exhibited greater local variation in class evidence, as such variation might affect the robustness and stability of document-level predictions.

- In Table 1, the “Total” row contains Korean characters (“총합 1490개”), which appear to be remnants from the software environment used to generate the table. For clarity and consistency in an English-language manuscript, it is recommended to replace these with an English equivalent such as “Total: 1490 documents (100 %)”.

- The inline LaTeX expression (lines 348–353) for the fused input vector does not render correctly, with part of the equation appearing in red. This seems to be a formatting or compilation issue rather than intentional emphasis and should be corrected for proper typesetting.

- The integration of BERTopic-derived features into the TF-IDF + MLP pipeline is innovative but somewhat unconventional. While this hybridization likely enriches the representation space with thematic information, representing topics as one-hot IDs instead of continuous topic-distribution vectors may reduce semantic expressiveness. It would be valuable for the authors to clarify how this integration influenced generalization performance and whether continuous topic-probability vectors were considered as an alternative representation.

- The description of the label-thresholding procedure is somewhat ambiguous. Earlier sections mention a default threshold of τ = 0.5, but it is unclear whether this same value was consistently applied across all models or if thresholds were tuned based on validation data. Providing these details would clarify how the authors managed the trade-off between recall and precision in the multi-label classification setting.

- The description in Section 3.4 could be more explicit regarding the label-assignment process. Although the dataset is multi-label, the phrasing “that label was finally assigned to the document” could be misinterpreted as implying a single-label output. It would help readers if the authors explicitly confirmed that thresholding is applied independently to each label, thereby allowing multiple tactic labels to be assigned to a single document.

- The authors explicitly state that recall is particularly important in their application context, as their goal is to minimize missed relevant labels—that is, to reduce false negatives in cyber threat detection. However, they use the F0.5 metric for evaluation, which gives greater weight to precision rather than recall. This introduces a conceptual inconsistency: if recall is the primary concern, a recall-oriented metric such as F2 would be more appropriate.

Throughout the paper, the authors emphasize that Model 4 is optimized for precision, whereas Model 5 seeks a balance between precision and recall. In this context, the use of F0.5 may be relevant for evaluating precision-focused models. Nevertheless, if the overarching objective is to maximize recall, the choice of evaluation metric should either be adjusted accordingly (e.g., using F2) or explicitly justified to reflect a precision-oriented evaluation goal.

Author Response

We would like to express our sincere gratitude to the reviewer for the valuable and constructive comments.

The attached document contains the authors’ detailed responses to each comment, including explanations, the corresponding revisions made in the manuscript, and the specific page numbers where each change has been implemented.

(Please note that all page numbers and revision details mentioned in this response refer to the Revised_Manuscript version.)

Once again, we sincerely thank the reviewer for the thoughtful and insightful feedback, which greatly helped us improve the quality and clarity of our work.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

1. The manuscript claims performance improvements using hybrid models, yet it fails to provide sufficient methodological clarity. Key implementation details—such as feature dimensionality, fusion strategy (e.g., early vs. late concatenation), and classifier architecture—are either omitted or superficially described.

2. The rationale behind selecting TF-IDF, ModernBERT, and BERTopic as the feature sources is weak:
   a. Why was ModernBERT preferred over other transformer variants?
   b. What empirical or theoretical basis supports the combination of these specific representations?

3. The use of micro-level precision and F₀.₅ scores offer valuable insights, they obscure performance differences across tactical categories. Given the inherent imbalance within the ATT&CK framework, macro-level insights and category-specific analysis are crucial. It is recommended incorporating confusion matrices or precision metrics to enhance reliability.

4. In practice, CTI reports often contain ambiguous or overlapping descriptions of adversary behavior. The study does not address how semantic drift or tactic overlap was handled during annotation or classification. This omission raises concerns about label validity and model robustness.

5. The assertion that the proposed models “reduce analyst workload” in SOC environments is unsubstantiated. No user studies, deployment trials, or feedback from practitioners are presented. The industrial relevance remains hypothetical and should be framed accordingly.

6. Phrases like “significantly enhancing detection performance” and “demonstrated practical applicability” read more like marketing than academic argumentation. The tone should be revised to reflect scientific rigor and cautious interpretation of results.

7. Some related works are recommended for citation:
   a. https://doi.org/10.1109/ACCESS.2021.3107579
   b. https://doi.org/10.1109/CCWC60891.2024.10427746

Author Response

We would like to express our sincere gratitude to the reviewer for the valuable and constructive comments.

(Please note that all page numbers and revision details mentioned in this response refer to the Revised_Manuscript version.)

Once again, we sincerely thank the reviewer for the thoughtful and insightful feedback, which greatly helped us improve the quality and clarity of our work.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you very much for the Authors their thorough and well argued replies. After reading your response, I think you have successfully addressed all the issues I raised, including the need for clearer methodological explanations, terminological precision and justification of your evaluation choices (for example regarding preprocessing, the use of overlapping windows, and the choice of the F₀․₅ metric).

I also appreciate that you did not only revise the text but made the reasoning behind your decisions transparent. The additions have made the manuscript clearer and more reproducible, and the revised tables and figures also improve the overall presentation.

Reviewer 2 Report

Comments and Suggestions for Authors

All problems have been addressed.

Article Menu

A Study on Improving the Automatic Classification Performance of Cybersecurity MITRE ATT&CK Tactics Using NLP-Based ModernBERT and BERTopic Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI