Review Reports - Automated Mapping of Common Vulnerabilities and Exposures to MITRE ATT&CK Tactics

Round 1

Reviewer 1 Report (Previous Reviewer 2)

Comments and Suggestions for Authors

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Author Response

The latest version of the paper has made some improvements based on previous feedback. However, there are a few additional comments that should be addressed to continue enhancing the quality of the paper. Please review the following comments and ensure they are properly addressed in the paper for further improvement.
Response:
We thank the reviewer for the suggestions and thorough assessment. We deeply appreciate your time and effort. We have tried to address all comments as best as possible.

Comments:
1. The paper shows redundancy with word usage and their abbreviations. For example, 'Common Vulnerabilities and Exposures (CVE)' appears in both the abstract and introduction sections (Page 1). I recommend to introduce the full term first, then use its abbreviation thereafter for clarity and consistency.
Response:
We appreciate the constructive suggestion. We introduced the full term first and then used its abbreviation for all technical terms. The suggestions improved the readability and clarity of the paper. Examples of modified terms: CVE, DoS, MLM.

2. Please revise the organization paragraph of the paper. The first section is already included in this introduction section. Replace 'First' with 'Second' in this paragraph (Page 2).
Response:
We have replaced first with second in the Introduction section.

3. Figure 4 should be converted into a table for improved clarity.
Response:
We have converted Figure 4 into a table (see Table 3) in section 3.2.3.

4. Although the paper focuses on precision, recall, and F1 score only. Therefore, it is recommended to include additional performance metrics such as accuracy and ROC to thoroughly validate the models in order to prove deep performance analysis as a contribution mentioned in the paper (Page 2).

Response:
We thank the reviewer for the suggestion, and we appreciate the feedback. We have now added the weighted accuracy score from matrices for the best model at line #398. Note that we do not have all accuracy scores from all experiments to report them in the main results table (Table 5) because we have not saved them all, and it does not make sense to rerun all experiments for this, as it would unnecessarily clutter the results without a significant added value. We have also added a new column in Table 6 in which we present the accuracy per tactic.
Regarding AUC-ROC, we hope you will not mind, but it would have been highly impractical to rerun all configurations multiple times while reporting ROC for multi-class multi-label classification would have generated a lot of sub-figures. As such, we opted to report accuracy besides F1-scores. Thank you kindly for your understanding

5. All models, including CyBERT, SecBERT, 352 SecRoBERTa, TARS, and T5, were trained and tested on a single dataset. Why was a standard dataset not considered for benchmarking purposes?
Response:
We thank the reviewer for the question and suggestion. The idea is that there is no standard dataset available for this particular task, unlike other domains with well-known datasets. Building a comprehensive dataset is crucial for this task and our dataset is one of the main contributions of this paper. To create this version, we expanded our previous dataset, added data from ENISA, and then manually labeled more data to ensure a minimum of 170 instances per tactic.

6. What measures were implemented to verify that the reported performance is not due to under fitting to the dataset utilized?
Response:
We thank the reviewer for the question and suggestion. Taking into account the dataset's characteristics, several measures were implemented to ensure that the reported performance was not merely a result of underfitting the dataset. First, we employed cross-validation methods (for both training and validation) to assess the model's generalization performance across different subsets of the dataset.
We conducted sensitivity analyses and robustness tests (i.e., experiments varying the number of epochs and parameters) to evaluate the model's performance under various conditions and to ensure that it could generalize well to unseen data beyond the training set.
Moreover, we carefully curated evaluation metrics that were sensitive to the nuances of the task and robust to the challenges posed by imbalanced and short-text data. For instance, we may have utilized metrics such as the F1 score, which is well-suited for evaluating classification performance in scenarios with class imbalance and varying sample sizes.

A suggestion to include the following publications in your improved paper.
M. Ishaque, M. G. M. Johar, A. Khatibi, and M. Yamin, "Hybrid deep learning based intrusion detection system using Modified Chicken Swarm Optimization algorithm," in ARPN Journal of Engineering and Applied Sciences, vol. 18, no. 14, pp. 1707-1718, Sep. 30, 2023. [Online]. Available: https://doi.org/10.59018/0723212
Response:
We appreciate the comprehensive review and this constructive suggestion, and we thank the reviewer for it. The suggested paper has ideas and findings that we share. We have added a paragraph (starting at line #25) and cited the paper (line #30) in the Introduction, which helped us build a clearer overall picture:
“One approach to defending against such attacks involves determining if a given action is legitimate or malicious, also known as intrusion detection, which often incorporates Machine Learning (ML) techniques. Various methodologies exist for this purpose: one focuses on real-time defense by analyzing data such as networking traffic or system calls from the time the attack is attempted or even ongoing; alternatively, another approach involves analyzing different sources of data left after the attack (e.g., syslog records) [2]. In our present investigation, our emphasis lies in building a tool for security auditing purposes, which anticipates the potential impacts that a vulnerability or series of vulnerabilities may have upon a system, even in the absence of an actual attack. A strong understanding of vulnerabilities is crucial in creating an efficient defense against such threats.”

Reviewer 2 Report (Previous Reviewer 3)

Comments and Suggestions for Authors

This is a good research with comprehensive explanations and detail discussion. However, there are some flaws to be solved:

1) In line #426, "it confuses these tactics with other tactics", how to solve those problems?

2) In line #459, "manually label more data", this is not a sensible solution. Thus, how to solve the mentioned problem sensibly?

Author Response

This is a good research with comprehensive explanations and detail discussion. However, there are some flaws to be solved:
Response:
We appreciate the comprehensive review and this constructive suggestion, and we thank the reviewer for them. We also appreciate the time and effort the reviewer dedicated to reviewing our work. We have addressed all comments and answered them in detail.

1) In line #426, "it confuses these tactics with other tactics", how to solve those problems?
Response:
We thank the reviewer for the question and suggestion. As we mentioned in the paper, some tactics depend on methodology and can be conceptually related. We have added explanations and potential solutions for the most commonly confused ones: Defense Evasion and Impact.
For the confusion with Defense Evasion, we have added the following fragment at line #422: “For the confusion with Defense Evasion, a more balanced dataset would be necessary because Defense Evasion tends to dominate and appears even when it's not applicable.”
For the confusion with Defense Evasion, we have added the following fragment at line #442: “For the confusion with the Impact tactic (as stated by MITRE, Impact consists of techniques that adversaries use to disrupt availability or compromise integrity by manipulating business and operational processes), we infer that reaching Impact can occur following the malicious acquisition of credentials, suggesting a causal relationship between the two tactics. Providing more examples of Impact, especially ones not related to credentials since it's one of the rarer classes, would be beneficial in avoiding confusion.”

2) In line #459, "manually label more data", this is not a sensible solution. Thus, how to solve the mentioned problem sensibly?
Response:
We appreciate the question, and we thank the reviewer. Indeed, it is challenging to significantly expand the dataset with only a team of three individuals. We have added the following paragraph with explanations at line #462:
“We have extended the initial dataset and integrated data from ENISA (details in section \ref{corpus}). We have observed that over time, other organizations (e.g., ENISA) have publicly released precisely the data we needed. Hence, it's possible that this could occur again in the future, allowing us to incorporate new data from such sources. Another approach is to explore the semi-supervised approach proposed in future work, which leverages the abundance of unlabeled CVEs.”

3) In lines #480 to #484, "want to explore whether taking into account any other type of data or meta-data...." and "take advantage of the significant number of unlabeled CVEs...", the authors have honestly pointed out those important points which are of crucial importance to the success of this research topic. Should the authors include these to increase the significant contribution of this project?
Response:
We appreciate the question and thank the reviewer. Indeed, we have mentioned exploring alternative data types or metadata, as well as leveraging unlabeled CVEs. Incorporating these aspects could significantly augment the project's contribution; however, we plan to address these opportunities in future work. As outlined in the future work segment, several promising directions warrant consideration. First, expanding the dataset manually offers a direct pathway. Second, investigating the inclusion of various types of data or metadata, such as severity scores, during training holds promise for enhancing model performance.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper addresses the task of understanding and categorizing vulnerabilities in the
dynamic cybersecurity landscape, emphasizing the necessity for structured and uniform
cybersecurity knowledge systems. The MITRE Corporation has established two influential
sources: the Common Vulnerabilities and Exposures (CVE) list, dedicated to identifying and
addressing software vulnerabilities, and the MITRE ATT&CK Enterprise Matrix, which serves
as a framework for categorizing adversary actions and defense strategies. Notably, these
two sources are presently not directly linked.
The paper introduces an automated approach to map CVEs to the corresponding MITRE
ATT&CK tactics, employing state-of-the-art Large Language Models to tackle the multi-label
classification problem. Additionally, the study conducts a comprehensive error analysis to
improve understanding of the models' performance and limitations.
Strength
● The code is publicly available on GitHub
● Performance is evaluated through weighted F1 Score due to dataset imbalance
● The resulting model has a small size and it is easy to deploy
Weakness
● All the architectures under consideration exhibit poor performance, which
undermines the potential adoption of this approach in real-world scenarios
● This work shares only minor differences from the authors' prior research. Specifically, the previous study concentrated on classifying techniques (31 in this dataset, excluding less populated ones), while the current work is centered on classifying tactics (14 in this dataset). As far as I understand, each tactic encompasses various corresponding techniques. Therefore, the classification you aim to achieve in this study can be directly deduced from the prior research.
● The dataset employed exhibits notably small dimensions, with nearly half of the classes containing fewer than 100 samples. Consequently, these classes exhibit subpar performance, remaining unclassified.
● Additionally, there is ambiguity regarding the repetition of experiments to assess the statistical significance of the results. If such repetitions have occurred, it is advisable to include the standard deviation of performance in the presentation of results.

Suggestions
● Given the significant performance challenges posed by the six sparsely populated classes (number of samples less than 132), I suggest exploring data augmentation techniques to mitigate this issue and enhance the training phase by increasing the number of available samples.
● The considered architectures employ distinctly varied parameters, such as learning rate and the number of epochs. To promote a fair comparison among the different proposals, it would be advisable to standardize these parameters across all instances.
● Conduct multiple runs of experiments, employing various train-test splits, to evaluate the variability of the results.

Comments on the Quality of English Language

Average

Author Response

The paper addresses the task of understanding and categorizing vulnerabilities in the

dynamic cybersecurity landscape, emphasizing the necessity for structured and uniform

cybersecurity knowledge systems. The MITRE Corporation has established two influential

sources: the Common Vulnerabilities and Exposures (CVE) list, dedicated to identifying and

addressing software vulnerabilities, and the MITRE ATT&CK Enterprise Matrix, which serves

as a framework for categorizing adversary actions and defense strategies. Notably, these

two sources are presently not directly linked.

The paper introduces an automated approach to map CVEs to the corresponding MITRE

ATT&CK tactics, employing state-of-the-art Large Language Models to tackle the multi-label

classification problem. Additionally, the study conducts a comprehensive error analysis to

improve understanding of the models' performance and limitations.

Strength

The code is publicly available on GitHub
Performance is evaluated through weighted F1 Score due to dataset imbalance
The resulting model has a small size and it is easy to deploy

Response: We appreciate the feedback. Thank you kindly!

Weakness

All the architectures under consideration exhibit poor performance, which

undermines the potential adoption of this approach in real-world scenarios

Response: We thank the reviewer for the feedback. We succeeded in extending the initial dataset of almost 2000 entries to 10,000 entries, which also brought a significant boost regarding the performance of the proposed models. This makes our approach now more suitable for usage in real-life scenarios. Extending the dataset manually even more is quite difficult due to CVEs' text description quality and the lack of details for certain CVEs.

This work shares only minor differences from the authors' prior research. Specifically, the previous study concentrated on classifying techniques (31 in this dataset, excluding less populated ones), while the current work is centered on classifying tactics (14 in this dataset). As far as I understand, each tactic encompasses various corresponding techniques. Therefore, the classification you aim to achieve in this study can be directly deduced from the prior research.

Response: We thank the reviewer for the feedback. The dataset was considerably increased, while the classification achieved in the current study cannot be directly deduced from the prior research. At the moment, there are over 200 MITRE techniques but only 14 tactics. As you already mentioned, the previous study focused only on 31, so for some new vulnerabilities (from over 170 techniques), the previous models would not work. However, the current models trained on all 14 tactics should be able to classify new CVEs, making the current approach easier to use in real-world scenarios. This is even more relevant now that we have improved our dataset and results considerably. Moreover, this study uses different, more advanced methods than the previous study, focusing on larger language models. We agree that the current study builds on the previous one, so we added more detailed information about the differences between them in our current study.

The dataset employed exhibits notably small dimensions, with nearly half of the classes containing fewer than 100 samples. Consequently, these classes exhibit subpar performance, remaining unclassified.

Response: We thank the reviewer for the feedback. We succeeded in extending the dataset to 10,000, having at least 170 entries for each of the 14 Mitre Matrix tactics. This also solved the issue of those classes that showed subpar performance and remained unclassified.

Additionally, there is ambiguity regarding the repetition of experiments to assess the statistical significance of the results. If such repetitions have occurred, it is advisable to include the standard deviation of performance in the presentation of results.

Response: We thank the reviewer for the feedback. We have assessed the statistical significance of the results by running multiple rounds of experiments in a cross-validation setup. We also included the description of our experimental setup and the standard deviation for all our results, both for validation and testing.

Suggestions

Given the significant performance challenges posed by the six sparsely populated classes (number of samples less than 132), I suggest exploring data augmentation techniques to mitigate this issue and enhance the training phase by increasing the number of available samples.

Response: We succeeded in extending the dataset to approximately 10,000 entries, having at least 170 samples (over 132) for each tactic; as such, the need to augment the dataset grew smaller. Our new dataset is still unbalanced; however, text augmentation techniques did not help in our case. We tried using TextAttack and GPT-4 for augmenting, but given the specialized cybersecurity domain and the relatively fixed format and vocabulary of a CVE, the augmentations we obtained were not necessarily relevant or useful. Nevertheless, we added a note about this potential technique in our study.

The considered architectures employ distinctly varied parameters, such as learning rate and the number of epochs. To promote a fair comparison among the different proposals, it would be advisable to standardize these parameters across all instances.

Response: The architectures vary greatly, and we tweaked the parameters according to the initial training of each model. More details were added in the paper.

Conduct multiple runs of experiments, employing various train-test splits, to evaluate the variability of the results.

Response: We have assessed the statistical significance and variability of the results by running multiple rounds of experiments in a cross-validation setup. We also included the description of our experimental setup (train, validation, and test sets) and the standard deviation for all our results, both for validation and testing.

Reviewer 2 Report

Comments and Suggestions for Authors

To elevate the quality of the paper, consider the following comments

• The citations are all over the place in this paper. They need some serious fixing.

• For under-represented classes, consider enhancing them by sourcing additional Common Vulnerabilities and Exposures (CVE) descriptions. Additionally, manually synthesize more data points that maintain context and relevance. This approach is likely to yield reasonable results.

• Furthermore, the evaluation metrics should not be limited to the performance of F1-score. A holistic and diverse set of metrics is necessary for a comprehensive understanding of model performance. Consider using metrics such as ROC-AUC, precision curve, and confusion matrix.

• You may also discuss potential limitations of the proposed approach, such as the feasibility of manually synthesizing large amounts of data or the risk of introducing bias during the synthesis process.

Comments on the Quality of English Language

The Quality of English Language should be revised carefully.

Author Response

To elevate the quality of the paper, consider the following comments

The citations are all over the place in this paper. They need some serious fixing.

Response: Thank you for the observation. We have fixed the formatting issue regarding the citations.

For under-represented classes, consider enhancing them by sourcing additional Common Vulnerabilities and Exposures (CVE) descriptions. Additionally, manually synthesize more data points that maintain context and relevance. This approach is likely to yield reasonable results.

Response: Thank you for your feedback! We succeeded in extending the dataset to approximately 10.000, having at least 170 entries for each of the 14 Mitre Matrix tactics. The dataset was merged with an existing dataset created by ENISA, and then new CVEs were manually labeled to get a higher number for sparsely populated classes, which also led to a significant performance improvement.

Furthermore, the evaluation metrics should not be limited to the performance of F1-score. A holistic and diverse set of metrics is necessary for a comprehensive understanding of model performance. Consider using metrics such as ROC-AUC, precision curve, and confusion matrix.

Response: Indeed, a confusion matrix was useful for an in-depth understanding of the model’s performance. Given our multi-label task, where each CVE might have one or multiple associated tactics, we computed 14 confusion matrices, each showing the number of TN, TP, FN, and FP predictions for each tactic, and added them in Appendix A of our paper.

You may also discuss potential limitations of the proposed approach, such as the feasibility of manually synthesizing large amounts of data or the risk of introducing bias during the synthesis process.

Response: Thank you for the observation! The labeling process was performed by multiple experts in the cyber security area following a set of common general guidelines, starting from the standardized approach proposed by the Mapping MITRE ATT&CK to CVEs for Impact methodology. We already highlighted the fact that the labels are interpretable to some extent, and the process of manually creating a much larger dataset is a difficult one.

Reviewer 3 Report

Comments and Suggestions for Authors

There are three major problems in this research:

1) There is no reference list provided and thus all the cited references show garbage inside the document. No way for me to track and trace the sources used.

2) The title is quite awkward at first glance. The findings show the problems of existing technology: It is a kind of overturning approach leading to the need of extending the research. The title should reflect the contributions.

3) The presentation is a very big problem. It shows a lot of experimental information, but it fails to present them clearly. It is easily get lost in this article.

Comments on the Quality of English Language

English is mostly alright but just few typos in the sentence structure.

Author Response

There are three major problems in this research:

1) There is no reference list provided and thus all the cited references show garbage inside the document. No way for me to track and trace the sources used.

Response: Thank you for the observation. We think that there was an issue right before we exported the document to upload it. We have fixed the formatting issue regarding the citations.

Response: Thank you for your feedback! We wanted to be consistent with the previously published study. Following the dataset extension and experiments’ updates, we believe the updated title better encapsulates our work now.

3) The presentation is a very big problem. It shows a lot of experimental information, but it fails to present them clearly. It is easily get lost in this article

Response: Thank you for your feedback! We added a paragraph detailing the structure in the Introduction section for clarification. Our structure follows APA and journal guidelines and we removed H3 headings. We hope everything is clearer now.