Automated Mapping of Common Vulnerabilities and Exposures to MITRE ATT&CK Tactics

: Effectively understanding and categorizing vulnerabilities is vital in the ever-evolving cybersecurity landscape, since only one exposure can have a devastating effect on the entire system. Given the increasingly massive number of threats and the size of modern infrastructures, the need for structured, uniform cybersecurity knowledge systems arose. To tackle this challenge, the MITRE Corporation set up two powerful sources of cyber threat and vulnerability information, namely the Common Vulnerabilities and Exposures (CVEs) list focused on identifying and fixing software vulnerabilities, and the MITRE ATT&CK Enterprise Matrix, which is a framework for defining and categorizing adversary actions and ways to defend against them. At the moment, the two are not directly linked, even if such a link would have a significant positive impact on the cybersecurity community. This study aims to automatically map CVEs to the corresponding 14 MITRE ATT&CK tactics using state-of-the-art transformer-based models. Various architectures, from encoders to generative large-scale models, are employed to tackle this multilabel classification problem. Our results are promising, with a SecRoBERTa model performing best with an F1 score of 77.81%, which is closely followed by SecBERT (78.77%), CyBERT (78.54%), and TARS (78.01%), while GPT-4 showed a weak performance in zero-shot settings (22.04%). In addition, we perform an in-depth error analysis to better understand the models’ performance and limitations. We release the code used for all experiments as open source.


Introduction
Whether we are talking about the everyday lives of individuals or the daily activities of any organization, digital technology has become a core element of modern society.As the scale of software systems has dramatically grown, the attack surface for cybercriminals has also significantly extended, thus making cybersecurity a top priority.
One approach to defending against such attacks involves determining if a given action is legitimate or malicious, also known as intrusion detection, which often incorporates Machine Learning (ML) techniques.Various methodologies exist for this purpose: one focuses on real-time defense by analyzing data such as networking traffic or system calls from the time the attack is attempted or even ongoing; alternatively, another approach involves analyzing different sources of data left after the attack (e.g., syslog records) [1].In our current investigation, we focus on building a tool for security auditing purposes, which anticipates the potential impacts that a vulnerability or series of vulnerabilities may have upon a system, even in the absence of an actual attack.A strong understanding of vulnerabilities is crucial in creating an efficient defense against such threats.
Many large-scale security assessments involve a direct or indirect mapping of vulnerabilities to how they can be exploited.This is needed to understand how to defend against attacks or how exploiting a series of vulnerabilities on different nodes of a wider infrastructure can form a kill chain and what the effects of such an event are.Of course, a team of cybersecurity specialists can perform all these activities manually.However, time is essential in this field, and a delayed response can sometimes be as bad as a total lack of action.This aspect, coupled with the time delay between the moment when a vulnerability is discovered and when the corresponding information is published, results in a critical requirement for a fast and automated manner to better understand vulnerabilities.
To address the challenges of handling the increasingly large number of security flaws, the MITRE Corporation set up two powerful sources of cyber threat and vulnerability information, namely the CVEs list and the MITRE ATT&CK Enterprise Matrix.The CVEs list publicly disclosed computer security flaws.When someone refers to a CVE, they mean a security flaw assigned a CVE ID number [2].Each CVE record includes an ID number, a brief description of the security vulnerability, and other references, like vulnerability reports and advisories.ATT&CK stands for MITRE Adversarial Tactics, Techniques, and Common Knowledge.The MITRE ATT&CK framework is a knowledge base and model for cyber adversary behavior.It reflects the various phases of an adversary's attack life cycle and the platforms they are known to target [3].It is a free tool widely adopted by private and public sector organizations of all sizes and industries.Users include security defenders, penetration testers, red teams, cyber threat intelligence teams, and any internal teams interested in building secure systems, applications, and services [4].The ATT&CK Enterprise Matrix includes 14 tactics (i.e., Reconnaissance, Resource Development, Initial Access, Execution, Persistence, Privilege Escalation, Defense Evasion, Credential Access, Discovery, Lateral Movement, Collection, Command and Control, Exfiltration, and Impact) listed in a logical sequence, thereby indicating the possible phases of an attack [3].
Currently, CVEs and the MITRE ATT&CK Matrix are separate entities, even if they are conceptually strongly connected.Linking them leads to a better understanding of the potential risks associated with given vulnerabilities and to the further adoption of better responses and defensive strategies.This mapping facilitates a more systematic and standardized approach to vulnerability management and helps organizations prioritize their mitigation efforts based on known adversary techniques and tactics.One of the main reasons such a link does not exist for now is that building it manually would require a lot of effort and specialized human resources.However, due to the recent progress in ML, mapping the two entities using artificial intelligence became promising.
This study builds on the work of Grigorescu et al. [5] and aims to automatically map CVEs to the corresponding list of 14 MITRE ATT&CK tactics using state-of-the-art transformer-based models.Instead of addressing the entire collection of 201 possible techniques, we focus on the higher-level task, which is more suitable for practice and provides an overarching view of potential exposure.The initial dataset had several shortcomings regarding a high imbalance and the lack of examples for certain techniques.As such, we now focus on the 14 tactics that have better coverage and a more balanced distribution instead of focusing only on the 31 considered techniques after applying a minimum threshold [5].Additionally, Grigorescu et al. [5] used classical ML models (naïve Bayes classifiers and/or support vector machines) and convolutional neural networks, coupled with older techniques for word embeddings like word2vec, while the current study brings improvements by focusing on newer models and architectures, such as T5 or GPT-4.
The main contributions of this paper are as follows: • We introduce and release a new, extended dataset of 9985 entries (starting from the previous work of Grigorescu et al. [5] with 1718 entries), with each tactic now having at least 170 samples.

•
We establish a strong baseline involving various transformer-based architectures ranging from encoder models (e.g., CyBERT and SecBERT) and encoder-decoder models (e.g., TARS and T5) to zero-shot learning, with GPT-4 having been used for solving the multilabel tactic classification task; • We perform an in-depth results analysis to better understand the models' performance and limitations.

•
We release as open source the dataset and the code used for our experiments, which have been published on GitHub (https://github.com/readerbench/CVE2ATT-CKtactics,accessed on 18 February 2024).
The paper is structured as follows: Section 2 presents the related work regarding vulnerability classification based on the MITRE matrix.Further on, we present our method, including our dataset and the employed models.We used three types of transformer architectures, namely encoder (represented by SecBERT, CyBERT, and SecRoBERTa), encoderdecoder (represented by TARS and T5), and decoder only (GPT-4).We then present and discuss our results and, in the end, show our conclusion and our directions for future work.

Related Work
Mapping ATT&CK to CVE for Impact [6] is an initiative started by the Center for Threat-Informed Defense in October 2021 and sponsored mainly by AttackIQ and JPMorgan Chase [7].This research project aimed to establish a critical connection between vulnerability management, threat modeling, and compensating controls by mapping a CVE to MITRE tactics and techniques.The project did not have an automated component and relied on a rigorous, public labeling methodology.For now, the project publicly released a set of 840 labeled vulnerabilities, which are available on GitHub [8].Both the dataset and the methodology were considered during the labeling process of the dataset introduced by Grigorescu et al. [5], which is described later on in detail and used in the current study.
Kuppa et al. [9] presented a multilabel text classification approach for mapping CVEs to ATT&CK techniques and proposed a multihead joint embedding neural network architecture.They addressed the issue of insufficient labeled data using an unsupervised labeling method that assigned labels without supervision to 17 techniques.This method relied on extracting phrases from threat reports and ATT&CK technique descriptions and then mapping a CVE to a technique based on the cosine distance between the CVE's description and the technique's vector.The lack of supervision for this approach is a benefit; however, its capability to classify only 17 techniques leaves room for major improvement.
Ampel et al. [10] introduced another approach for mapping CVE's to ATT&CK tactics.Their study compared different techniques and their performances: classical ML models (i.e., random forests, SVM, naïve Bayes, and logistic regression), deep learning models (i.e., RNN, LSTM, and BiLSTM), pretrained language models (i.e., GPT-2, BERT, and RoBERTA), and self-distillation.Their distillation approach combines relational knowledge from a large, pretrained model (the teacher, in this case, is a pretrained RoBERTa model) and a prior untrained model (the student).As a result, the trained student model generalizes better to unseen data than a model without knowledge distillation.This approach obtained encouraging results, with an F1-score of 76.18%.However, the classification was only multiclass, not multilabel, as a CVE was only mapped to one corresponding tactic.This rarely happens in reality, where a vulnerability is often exploited using multiple tactics.
Grigorescu et al. [5] also presented a multilabel text classification approach on the same dataset considered in this work.The difference is that [5] mapped techniques instead of tactics.Ideally, the mapping of techniques is the final goal.Still, only a few techniques (31 out of 192) exceeded a minimum occurrence threshold, and several limitations were observed, especially due to the high imbalance.As such, this study focuses on tactics as an overarching goal.[5] classified text descriptions from CVEs using SciBERT and adversarial attacks for the 31 techniques with an F1 score of 47.84%.Our approach has some similarities, but the current work explores a wider, more powerful set of models.
Haddad et al. [11] focused on mapping CVEs to MITRE CWE Top 25 Weaknesses by approaching the problem as a ranking problem and aims to release (still not currently available) a dataset of 4012 manually annotated samples.The main difference between our tasks and the tasks proposed by Haddad et al. [11] is that the CWE weaknesses focus on identifying and categorizing specific vulnerabilities or weaknesses, while the MITRE tactics focus on categorizing and describing adversary behaviors and objectives during cyber attacks.Fine-tuned deep learning models like SentenceBERT and rankT5 featured good semantic understanding for this ranking task.Specific metrics for ranking tasks, like the mean reciprocal rank (MRR), mean average precision (MAP), and normalized discounted cumulative gain (NDCG), were used for evaluating the models.MAP@k/NDCG@k compares the top-k prediction to the ground truth (true positives), so MAP@1 and NDCG@1 are, in fact, representing a classification measure, which should be more relevant for comparison with our task.The best model was a fine-tuned sentence BERT (SBERT), with a MAP@1 score of 85% and an NDCG@1 score of 84.46%.

Method
Our method significantly enlarged the previous corpus by Grigorescu et al. [5], as language models require as much data as possible to learn efficiently.Second, even though the new dataset also exhibits class imbalance, each class has at least 170 samples, which is a noteworthy improvement from the initial dataset where some classes had a very small number of samples (e.g., Command and Control had only 7 samples).Furthermore, we experimented with multiple types of transformer architectures in a crossvalidation setup to validate our results thoroughly.

Corpus
The current dataset started from the one introduced by [5] for MITRE technique classification.Each technique had a corresponding tactic; thus, the tactic mapping can be easily derived from the initial dataset.The dataset contained 1718 entries, of which 820 CVEs were provided by MITRE-Engenuity [6].The remaining entries were manually annotated by Grigorescu et al. [5] in a team of five people.Data quality was ensured by performing multiple validation rounds between the team members [5].
In our first expansion, 8000 more vulnerabilities were extracted from ENISA's 2018-2019 state of vulnerabilities report [12].The European Union (EU) Agency for Cybersecurity (ENISA) is dedicated to achieving a high common level of cybersecurity across the continent.It released this document, which highlights the state of cybersecurity vulnerabilities at the end of 2019.A list of these vulnerabilities, along with some data analysis tasks performed on this data, can be found on GitHub [13].The provided data also includes the corresponding tactics for part of the vulnerabilities extracted and added to our dataset.
Additionally, 250 more vulnerabilities were manually searched and added to the new dataset to ensure that each of the 14 classes had at least 170 sample vulnerabilities.A comparison between the initial dataset and the new one is presented in Table 1.
Let us consider the following example regarding a specific vulnerability (i.e., CVE-2020-5413) to argue for the complexity of the annotation process: "Spring Integration framework provides Kryo Codec implementations as an alternative for Java (de)serialization.When Kryo is configured with default options, all unregistered classes are resolved on demand.This leads to the "deserialization gadgets" exploit when provided data contains malicious code for execution during deserialization.In order to protect against this type of attack, Kryo can be configured to require a set of trusted classes for (de)serialization.Spring Integration should be proactive against blocking unknown "deserialization gadgets" when configuring Kryo in code".This entry was first mapped in terms of techniques to Command and Scripting Interpreter because it can be exploited through the Spring Framework; second, it was mapped to Exploitation for Client Execution because it can lead to the client attack, which runs malicious code on the host.Further, the entry was mapped to Execution in terms of tactics, since this is the corresponding tactic for both previous techniques.
From an exploratory data analysis point of view, notable observations about our extended dataset must be made.As shown in Figure 1, the set is unbalanced, with "Defense Evasion" being the dominant tactic."Privilege Escalation", "Persistence", "Execution", "Discovery", and "Lateral Movement" are also well represented.Some other tactics, like "Initial Access", "Collection", "Credential Access", "Command and Control", and "Impact" still have hundreds of samples, while the rest of the tactics have few samples; more specifically, Exfiltration has 171 samples; Resource Development and Reconnaissance each have 170 samples.Most vulnerabilities have been annotated with only one tactic (see Figure 2); nevertheless, the count of entries with two tactics follows closely, thus arguing for the inadequacy of labeling only with the dominant tactic, as performed by Ampel et al. [10].As shown in Figure 3, most vulnerabilities have under 200 words in their description, which should not represent an issue for the length limit that BERT has (512).Noise removal was used for preprocessing the text of a CVE.This refers to removing those characters, digits, or even pieces of text that might pull back the performance of the text analysis.In our case, this refers to the version of a given vulnerable piece of software and the identifiers of other vulnerabilities.It is relevant for the model to know the targeted product, but knowing the version itself might confuse the model.With this in mind, the numerical versions were removed and replaced with the word [version].References to the identifiers of other vulnerabilities were also replaced with the "CVE" generic word.
We split the data into two sets: 80% for training and validation, while the remaining entries were selected for the static test set.The static test partition was considered to achieve a fair comparison across all employed architectures.For this split, the dataset was shuffled first, and the separation was made using the MultilabelStratifiedShuffleSplit class from the 'iterative-stratification' Python package [14], which kept the label distribution constant after the train-test split in the context of multilabel tasks; see Sechidis et al. [15].
Given the imbalance in the dataset, the tactic distribution is an important aspect when evaluating the results.Nevertheless, all tactics are well represented, and we did not opt to perform oversampling or undersampling but instead kept the initial distributions.
Next, the common training and validation dataset was shuffled and split again using MultilabelStratifiedShuffleSplit four times into individual training and validation static subsets.Each time, the training set had 60% of the initial data, while the validation subset contained 20% of the initial data.In this way, the results could be validated on various training and validation datasets for all models while keeping a constant test subset.

Models
Transformer-based models have reshaped the NLP landscape since their introduction [16] by proving state-of-the-art performance on a wide variety of tasks.Depending on the architecture, these deep learning models can be classified into three large categories: encoders, decoders, and encoder-decoder models.The transformer encoder model (e.g., Bidirectional Encoder Representations from Transformers-BERT [17]) takes as input a piece of text, represented as a sequence of tokens, and produces a fixed-size length vector representation of the input, which can be further used for classification or clustering.In contrast, the transformer decoder model (e.g., GPT-4 [18]) takes as input a fixed-size length vector and outputs a sequence of words one at a time, with each word being conditioned on the previously generated words, thus making this model especially fitted for tasks like text generation.Finally, the encoder-decoder transformer (e.g., T5 [19]) takes as input a sequence of tokens, produces a fixed-size vector representation of it, and then gives it to the decoder to generate the output text, thus making this model fitting for many tasks, including translation or question answering.We performed experiments with models from each category.

Transformer Encoders
We selected all publicly available encoders on HuggingFace for the cybersecurity domain [20][21][22].First, SecBERT (or SecureBERT) [23] is a BERT-based model pretrained on a cybersecurity-related corpus.It was fine-tuned based on the Masked Language Modeling (MLM) procedure [17]; namely, the models learned to guess randomly masked words from a diverse set of security texts (e.g., Cisco security reports, the NIST website, and cybersecurity and hacking books).
Second, SecRoBERTa [24] was fine-tuned on the same cybersecurity corpus as SecBERT, but, as its name suggests, the base model is RoBERTa, which is an optimized variation of BERT.This variant was trained on a larger corpus compared to BERT, had the benefit of better hyperparameters found to be suboptimal in the original BERT, and had a different, dynamic word masking strategy during training.
Third, CyBERT [25] is a domain-specific BERT model that has been fine-tuned with a large corpus of textual unlabeled cybersecurity data.It was fine-tuned starting from the BERT base cased model with MLM [17], which is similar to SecBERT.More specifically, the first step consisted of extending the model's vocabulary with specific cybersecurity words.Once the vocabulary was extended and the embedding space was updated, the model was trained to guess randomly masked words from the specific cybersecurity corpus; see Ranade et al. [25].The main difference between SecBERT and CyBERT lies in the data the two models were tuned on.More specifically, CyBERT was trained using a corpus of labeled sequences from ICS (industrial control systems) device documentation collected across a wide range of vendors and devices [25], while SecBERT was trained using a large number of online cybersecurity-related text data, including books, blogs, news, security reports, videos (subtitles), journals and conferences, white papers, tutorials, and survey papers [23].
These models (CyBERT, SecRoBERTa, and SecBERT) were used to encode the CVE text description, followed by a dropout layer and a linear layer with 14 output nodes, one for each class, coupled with a BCEWithLogitsLoss [26] loss function.The output represents a probability for each class, based on which it is decided if the CVE can be exploited using a given tactic.

Transformer Encoder-Decoder Architectures
TARS (Task-Aware Representation of Sentences for Generic Text Classification) is a transformer-based model that leverages the information provided by class labels and shows good performance in few-shot and zero-shot settings [27].It is based on a BERT encoder, with the most relevant difference being that it also has a task-shared decoder, which outputs a Boolean value, thus extending the capabilities of transfer learning compared to text classifiers based on BERT and its variations.For those classifiers, the final linear layer and activation are task-specific and must be learned separately from scratch for each task [27].The concept of task-aware representation refers to the reformulation of each classification problem, multiclass or mult-label, into a generic binary classification for each label by feeding the transformer a sentence and a potential label (more or less descriptive) and establishing whether the association holds or not.TARS formalizes the classification function as and addresses the incomplete knowledge transfer, as the same decoder can now be used across arbitrary tasks in this context, thereby allowing the full transfer of the model not only of the encoder component, as in the case of the previous BERT models [27].This approach was chosen especially to explore if, by providing some information about what a tactic means by itself, the model better classifies the under-represented classes unsuccessfully recognized by the previous BERT models, which are not aware of each label's meaning.This model is also suitable for the multilabel classification, as it establishes for every possible tactic if there is an association between a CVE description and that tactic or not.
Two experiments were performed: one where the labels have a short name (e.g., "Reconnaissance") and one where the labels are a short description of the tactics (e.g., "gather the information they can use to plan future operations").
The Text-to-Text Transfer Transformer [19] relies on a transformer-based architecture that uses a text-to-text approach that takes text as input, and it learns to produce new text as output.The main goal of the model is to create a unified framework for various NLP tasks, such as translation, question answering, and classification.This text-to-text approach enabled the direct usage of the same model, an objective training procedure, and a decoding process for all considered tasks [19].
To take advantage of transfer learning capabilities and the text-to-text format, the training entries were adapted as shown in Table 2.The tactics formed of two words (e.g., Lateral Movement and Initial Access) were reduced to only one word (e.g., Movement and Access).The predictions contained padding sequences and also had the format of the labels shown in Table 2, so they had to be parsed and transformed into the original encoding to compute the scores.

Text Label
In freewvs before 0.1.1,a user could create a large file that freewvs will try to read, which will terminate a scan process.This has been patched in 0.1.1.0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1 CVE Text: In freewvs before 0.1.1,a user could create a large file that freewvs will try to read, which will terminate a scan process.This has been patched in 0.1.1.%Predicted MITRE tactics: Access, Persistence, Impact

Transformer Decoder Architectures
Large language models, like GPT-4, were trained on an extremely large quantity of data and instructions and exhibited enhanced zero-shot learning capabilities; these models capture the semantic similarities and differences between classes, thus making it possible to well generalize and obtain state-of-the-art results for classification tasks without additional training.
GPT-4 is the most recent model from the GPT family.It was evaluated on various tasks that tested its ability to understand and generate natural language text, particularly in more complex and nuanced scenarios [18].For example, GPT-4 was tested on specialized exams designed for humans (e.g., the bar exam), and it achieved a score that falls in the top 10% of test takers compared to the bottom 10% for its predecessor, GPT-3 [18].
For now, the model is accessible only through OpenAI's API [28], which allows for easy experimentation despite its size, which would otherwise require a lot of resources.However, using the API for GPT-4 is significantly more expensive (30 times) than for GPT-3.5, thereby making it more expensive for large datasets.Fortunately, our dataset was relatively small and did not create issues in this direction.
After the system prompt that contextualizes the task, two prompts briefly describing the task were used to carry out the classification with GPT-4.For the first one, the prompt contained only the tactic names, while the second one also contained a short description of the tactic after its name, which was similar to the TARS experiment.The completion format was specified in the prompt to interpret the response and measure performance.The used prompts are shown in Table 3: the first and the last table rows were utilized for all experiments.The prompts from the Prompt columns were used in the experiments without descriptions, while the prompts with tactic descriptions from the Extra Prompt column were used in the experiments with descriptions.

Experimental Setup for Fine-Tuned Models
The experiments were performed on NVIDIA A100 GPU.For evaluating the model, in the context of the presented multilabel classification task, we were interested in generating overall correct mappings (high precision) and the correct classification of as many examples as possible per class (high recall).To take into consideration both aspects, the F1 score, which is the harmonic mean between recall and precision, was considered the most relevant metric [5].More specifically, due to the severe imbalance in the dataset, the weighted F1 score was chosen as the final metric, as it computes the metrics for each label and finds the final score based on the number of true instances for each one.The macro F1 score, which does not consider the imbalance in the dataset and gives equal importance to each class, was also computed for a better overall model analysis.The other relevant metrics, like precision and recall, were also computed as weighed for the overall score.
We also computed the standard deviation (SD) for each reported metric.It measures the variability of values from their mean to determine the stability of the models on different training and validation sets.A smaller SD indicates a more robust model, while a larger one shows greater performance fluctuation, thereby making the model less reliable.Moreover, the scores for each class were computed to understand better how the imbalance in the dataset would impact the results.Given the multilabel task, confusion matrixes were computed for each class.During training, the number of epochs was chosen by observing the behavior during longer experiments based on when the validation loss had reached a constant value.For all models, the learning rate search was performed to find the optimal value based on the best average validation score for each model.For all transformer encoders, the AdamW optimizer was used, the batch size was set to 16, and the learning rate was set to 3 × 10 −5 during training.No input truncation was needed, because the longest vulnerability was shorter than BERT's input length limit of 512 tokens [17].The variants based on BERT base uncased were used.For TARS, we considered the implementation [27] from the FLAIR Framework [29] based on BERT base.A learning rate of 0.02 was set, and the model was trained for 20 epochs.The T5 base from HuggingFace [30] was used for fine-tuning experiments.During the 10 epochs training, an AdamW optimizer, a batch size of 16, and a learning rate of 2 × 10 −4 were used.

Results
All the previously described architectures that involved training (i.e., CyBERT, SecBERT, SecRoBERTa, TARS, and T5) were evaluated on both validation (Table 4) and testing datasets (Table 5).The GPT-4 experiments performed in a zero-shot setting were only conducted once for the static testing dataset (20% of the available data).More precisely, based on the initial splits computed with respect to the constant tactic distribution, the validation results were computed after crossvalidation experiments on the train and validation common set, thereby representing 80% of all the available data.During the validation experiments, the best parameters (e.g., learning rate) were established.Each model was trained four times on 60% of the available data and also evaluated each time on 20% of the available data, with the reported data from Table 4 being the average of the four results and the standard deviation being computed for each metric.The validation results (Table 4) have highlighted the stability and consistency of the models while being trained and validated on different data subsets.The standard deviation (SD) for all the computed metrics was less than 4.5%, thereby indicating a small variability among the models and therefore arguing for their reliability.
Next, the models trained on the four variable training subsets from the previously described validation experiments were tested on the static testing set.Moreover, in order to take advantage of as much data as possible, with the perspective of deploying the models in the future, a fifth experiment was executed, with training on the whole training and validation set (i.e., 80% of available data) and testing on the static test set.The reported results from Table 5 represent the average of these five experiments and the standard deviation for each metric.
The best weighted F1 score for both validation (77.65%) and testing experiments (78.88%) was achieved by SecRoBERTa, by a very close margin, which was closely followed by SecBERT and CyBERT.Both TARS experiments exhibited a slightly smaller weighted F1 score but a better F1 macro score on the validation experiments, thus indicating that, given the information of what a tactic means, TARS may be less influenced by the strong dataset imbalance.Another interesting aspect is that, in multiple validation runs, the shortlabel TARS experiment showed better results than the long-label experiments, which is counterintuitive, as the former offered the model more context about what each tactic means.T5, on the other hand, showed lower performance on all metrics.Table 6 highlights the per-tactic results for our best-performing model, SecRoBERTa.The reported results were computed after training the model on the whole training and validation dataset and testing on the fixed test dataset.Figure 4 highlights the correlation between the per-class F1 score and the number of samples from the training dataset.As expected from the imbalance in the dataset, the best F1 score was obtained for the dominating class, namely Defense Evasion (F1 score of 91.66%).However, the next most frequent classes (Privilege Escalation, Persistence, Execution, Discovery, and Lateral Movement) also showed promising results, with F1 scores of over 74%.Despite being less-frequent classes, Exfiltration and Resource Development showed good F1 scores of 88.88% and 79.13%.Weaker results were registered for Initial Access and Impact, with F1 score slightly over 30%.For the best experiment, we have also computed the weighted accuracy having a value of 89.52% and the accuracy score (computed using the subsets accuracy) having a value of 65.34%.The reason why these two metrics have different values is that in multilabel classification, the subset accuracy takes into account that the set of labels predicted for a sample must exactly match the corresponding set of labels [31].
Given the relatively large number of classes for the current multilabel classification task (14), the macro F1 score of 68.67%, which gives equal value to each class, shows that even if the model naturally has the tendency to better classify the class with the largest number of samples, it still differentiates between the rest of the classes, thus extracting relevant linguistic features from the CVE's description.
As shown in Table 5, GPT-4 exhibited weak performance in a zero-shot setup compared to the fine-tuned models.Adding a short description of each tactic in the prompt did not improve the results, thereby leading to slightly worse performance.The imbalanced dataset did not influence GPT-4's performance; nevertheless, its F1 macro score was much lower than those from the prior experiments, thereby confirming that the current task is a difficult one even for such large language models as GPT-4.

Discussion
The top performance models were SecRoBERTa (F1 weighed: 78.88%), SecBERT (F1 weighed: 78.77%), and TARS (F1 weighed: 78.10%).For the well-and medium-represented classes, the per-class F1 score had satisfactory values, with some variations depending on the model.However, it is visible that the models were extracting valuable information from training and learning in some measure to map those tactics, thereby indicating significant potential regarding the information that can be extracted from a CVE's textual description.
The fine-tuned models still had minor problems when tackling the high off-balance between the classes (i.e., 171 samples compared to 7552 samples for the most frequent one).Also, the results obtained for the Defense Evasion tactic were influenced by the very large amount of data for this class, as the models inherently tended to have a bias toward predicting the majority tactic.For the confusion with Defense Evasion, a more balanced dataset would be necessary, because Defense Evasion tended to dominate, and the class appeared even when it was not applicable.Adding synthetic data by augmenting the existing data (e.g., using TextAttack [32] or state-of-the-art LLMs like GPT-4) from the under-represented classes was considered, but given the specialized cybersecurity terms and the relatively fixed structure of a CVE, this technique would not have provided realistic additional information for the models.
Our best model had problems identifying two classes (i.e., Initial Access and Impact), for which its F1 scores were below 50%.The precision score was over 65%; thus, when the model predicted these tactics, it was usually right, but these classes had a low recall score under 25%, thereby reflecting that the model frequently missed them.Given the multilabel task, sometimes the model predicted a smaller number of tactics for those entries with multiple associated tactics, thereby ignoring tactics like Initial Access or Impact.Other times, it confused the previously two mentioned tactics with other tactics.Upon closer inspection, the model had problems discriminating between Initial Access and Defense Evasion, as well as between Impact and Credential Access.The first confusion might come from the fact that both tactics usually take advantage of public-facing services, with Initial Access being an intermediary step needed for the attacker to enter a context where avoiding detection is actually needed.The second confusion can also be explained by the conceptual link between tactics like Impact, which involves compromising the integrity of a system, and Credential Access, which involves obtaining sensitive data that leads to the system's compromise.
MITRE states that the Impact tactic consists of techniques that adversaries use to disrupt availability or compromise integrity by manipulating business and operational processes.As such, we infer that reaching Impact can occur following the malicious acquisition of credentials, thereby suggesting a causal relationship between the two tactics.Providing more examples of Impact, especially ones not related to credentials since it is one of the rarer classes, would be beneficial in avoiding this confusion.
Given our multilabel class, computing a single 14 × 14 confusion matrix was not possible; instead, 14 2 × 2 confusion matrixes containing the number of true positive, true negative, false positive, and false negative samples for the best model were computed using multilabel_confusion_matrix from sklearn [33] (additional tables are presented in Appendix A).
Furthermore, it is interesting to note that GPT-4 only had F1 scores of over 50% for Discovery, Privilege Escalation, and Exfiltration.This shows that for a specific classification task such as ours, a very large, generative model might have, in zero-shot settings, a surprisingly poor performance, being much lower than significantly smaller, fine-tuned models.

Error Analysis
Table 7 shows different predictions of the best model, from fully correct classifications to only partially correct classifications and, finally, completely wrong classifications.As illustrated, some CVEs contain much more information than others, thereby making the classification difficult for those with very short text.Especially for these cases, adding more data beyond the sole CVE textual description would be helpful.We have extended the initial dataset and integrated data from ENISA (details in Section 3.1).We have observed that, in time, other organizations (e.g., ENISA) have publicly released precisely the required data.Hence, it is possible that this could occur again in the future, thus allowing us to incorporate new data from such sources.Another envisioned approach is to explore semisupervised learning, as proposed in the future work section, which leverages the abundance of unlabeled CVEs.
Moreover, the model tended to correctly classify some tactics based on more semantically obvious keywords, like 'execute'.In many CVE descriptions, the word 'execute' appears, thus making it easier for the model to classify the respective CVE as Execution.The same happens for CVEs containing the words 'authenticated' (which tend to be classified as Privilege Escalation) and 'credentials' (which tend to be classified as Credentials Access) without considering the actual deeper meanings of the CVE.This aspect, coupled with the fact that some vulnerabilities have similar texts with similar affected software (e.g., 'Oracle Java SE ' and 'Cisco Web Security '), makes it easier to confuse the model based on non-necessarily relevant text sequences.
However, many vulnerabilities are indeed interpretable, as their correct labels are highly influenced by the labeling methodology.This happens because many tactics are also conceptually linked, thereby making it difficult to derive a tactic sequence only from the textual description.The most direct manner to solve this issue is to manually label more data to focus on the classes with low performance.Unfortunately, this is not a trivial task, as such CVEs are less frequent in the wild and are likely coupled with a more popular tactic.

Conclusions
This study builds on the work of Grigorescu et al. [5] and targets vulnerability mapping to tactics instead of techniques, thus treating it as a multilabel classification task.The experiments involved larger, more powerful models, which were applied on a significantly larger dataset, thereby leading to a weighted F1 score of 78.88% and with consistent performance on different vulnerability subsets.
Three types of experiments were performed with transformer encoder architectures (i.e., CyBERT, SecBERT, and SecRoBERTa), encoder-decoder architectures (i.e., TARS and T5), and zero-shot learning using GPT-4.In the end, from the proposed approaches, SecRoBERTa performed best, with an F1 weighed score of 78.88.76%, which was closely followed SecBERT and TARS.A major benefit of our models consists of their relatively small size, thus making them easier to deploy.
The best-performing model exhibited some weaknesses, with weaker performance and F1 scores of under 50% on two classes, namely Initial Access and Impact.Inherently, the obtained results were strongly influenced by the dataset imbalance.Still, the NLP task by itself is also a difficult one, which is an aspect confirmed by the overall low performance of the zero-shot GPT-4 experiment.
Several directions for future research are envisioned.The most straightforward approach is to extend the dataset manually.Additionally, we want to explore whether taking into account any other type of data or metadata associated with a vulnerability (e.g., the severity score) during training brings any improvement.Another possible approach is to take advantage of the significant number of unlabeled CVEs and experiment with weakly and semisupervised learning.Starting from our best-performing model, we could spot potential CVEs for specific classes and label those to ensure better variability when training our models.Moreover, we aim to introduce a hierarchical approach that tackles both tactics and techniques in a multitask learning setup.

Figure 2 .
Figure 2. Distribution of MITRE tactic number per sample.

Figure 4 .
Figure 4. F1 score (red lines) on top of CVE training distribution (blue bars) for SecRoBERTa.

Table 2 .
Initial training entry and T5 training entry.

Table 3 .
GPT-4 zero-shot experiment prompts with and without tactic description.

Table 4 .
Performance results for on the validation sets (bold marks the best model and performance).

Table 5 .
Performance results for the test set (bold marks the best model and performance).

Table 7 .
Cont. vulnerability exists in Pulse Connect Secure before 9.1R11.4 allows a remote authenticated attacker to execute arbitrary code as the root user via maliciously crafted meeting room.discovered to contain a stack overflow in the function acurite_00275rm_decode at /devices/acurite.c.This vulnerability allows attackers to cause a DoS via a crafted file.the HTTPS decryption feature of Cisco Web Security Appliance (WSA) could allow an unauthenticated, remote attacker to cause a DoS condition.The vulnerability is due to insufficient validation of Secure Sockets Layer (SSL) server certificates.An attacker could exploit this vulnerability by installing a malformed certificate in a web server and sending a request to it through the Cisco WSA.A successful exploit could allow the attacker to cause an unexpected restart of the proxy process on an affected device.

Table A2 .
Confusion matrix for Reconnaissance.

Table A3 .
Confusion matrix for Resource Development.

Table A4 .
Confusion matrix for Initial Access.

Table A5 .
Confusion matrix for Execution.

Table A6 .
Confusion matrix for Persistence.

Table A7 .
Confusion matrix for Privilege Escalation.

Table A8 .
Confusion matrix for Defense Evasion.

Table A9 .
Confusion matrix for Credential Access.

Table A10 .
Confusion matrix for Discovery.

Table A11 .
Confusion matrix for Lateral Movement.

Table A12 .
Confusion matrix for Collection.

Table A13 .
Confusion matrix for Command and Control.

Table A15 .
Confusion matrix for Impact.