V2W-LLM: Automated Vulnerability to Weakness Mapping Based on Large Language Model

Wang, Ziguo; Nian, Mei; Jing, Yaling; Zhang, Jun

doi:10.3390/info17060513

Open AccessArticle

V2W-LLM: Automated Vulnerability to Weakness Mapping Based on Large Language Model

by

Ziguo Wang

¹,

Mei Nian

^1,*,

Yaling Jing

¹ and

Jun Zhang

²

¹

College of Computer Science & Technology, Xinjiang Normal University, Urumqi 830054, China

²

Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(6), 513; https://doi.org/10.3390/info17060513

Submission received: 20 April 2026 / Revised: 17 May 2026 / Accepted: 19 May 2026 / Published: 22 May 2026

(This article belongs to the Topic New Trends in Cybersecurity and Data Privacy)

Download

Browse Figures

Versions Notes

Abstract

To address the rapid growth of software vulnerabilities, the latency of manual expert classification, and the limitations of existing methods restricted to fixed categories, this paper proposes V2W-LLM, an automated vulnerability-to-weakness mapping model based on Large Language Models (LLMs). First, a dataset of CVE-CWE description pairs is constructed based on established expert correlations from MITRE. Subsequently, the LLM is instruction-tuned on this dataset to leverage its reasoning capabilities in generating CWE-style descriptive text for newly disclosed, unmapped vulnerabilities. Finally, using a BAAI-based embedding model, the semantic representations of the generated text and official CWE descriptions are computed to identify the optimal mapping via cosine similarity (Top-1). Experimental results indicate that V2W-LLM achieves an accuracy of 90.18% and a Macro-F1 of 87.64% in common categories. Furthermore, on the public ChatGPT-VDMEval and the latest 2024 NVD datasets, the model attains F1 scores of 86.02% and 94.02% respectively, validating its effectiveness in automating the vulnerability-to-weakness mapping process.

Keywords:

cybersecurity; vulnerability classification; large language model; instruction fine-tuning; vector retrieval

Graphical Abstract

1. Introduction

The contemporary cybersecurity landscape is characterized by escalating complexity, where the exponential surge in software vulnerabilities poses substantial threats to personal privacy, corporate assets, and national critical infrastructure as illustrated in Figure 1. The Common Vulnerabilities and Exposures (CVE) [1] system, providing a list of uniquely identified public cybersecurity vulnerabilities, has become the de facto standard for describing security flaws in both industry and academia. Within the security assurance of the entire software life-cycle, effective defense and the formulation of targeted mitigation strategies require more than just awareness of a vulnerability’s existence; they necessitate a profound understanding of its root cause. Consequently, accurately identifying the underlying weakness types from massive, heterogeneous vulnerability reports is paramount for software developers, security analysts, and end-users alike [2].

Common Weakness Enumeration (CWE) [3] serves as an industry standard for classifying software weaknesses or vulnerabilities. By systematically abstracting vulnerability causes, it addresses the fundamental question of why a vulnerability occurs. Establishing precise mapping relationships from specific CVE instances to abstract CWE categories is critical for identifying attack chains. According to statistics from the National Vulnerability Database (NVD), the number of publicly disclosed vulnerabilities has risen exponentially since 2017, with the volume in 2024 alone exceeding 40,000. Such a vast and continuously growing volume of data poses significant challenges for automated, high-precision vulnerability classification [4]. The inherent latency of manual analysis prevents newly disclosed vulnerabilities from being rapidly attributed, creating cognitive blind spots in defense and increasing the risk of exploitation [5].

To address these challenges, existing research has proposed various automated classification methods, yet significant limitations persist [6]. Early methods based on traditional machine learning relied on shallow keyword matching, struggling to comprehend the complex contextual semantics in vulnerability descriptions, which resulted in low accuracy for complex CVEs [7]. Although recent deep learning-based methods have improved semantic understanding, they typically model the task as a closed-set classification problem. This setup assumes that all categories in the test set appear in the training set. However, the CWE system comprises hundreds of categories with an extreme long-tail distribution, where numerous rare or novel weaknesses lack sufficient training samples. Conventional classifiers struggle when encountering CWE categories unseen during training. Furthermore, some similarity-retrieval methods attempt to directly match text embeddings of CVEs and CWEs. Given that CVEs describe concrete instances while CWEs provide abstract definitions, the significant disparity in their expression forms and abstraction levels limits the precision of direct matching. Overcoming the closed-set constraint and bridging the gap between concrete instances and abstract definitions are key to achieving high-precision automated mapping. The innovations and contributions of this paper are summarized as follows:

Construction of a semantic-enhanced vulnerability-weakness dataset: To address the semantic scarcity in traditional CVE-CWE ID datasets, we enhance the data by incorporating CWE names and descriptions. We constructed an instruction-tuning dataset comprising CVE descriptions, CWE names, and CWE descriptions, providing a data foundation for the model to learn the transition from concrete attack events to abstract weakness definitions.

A novel generative reasoning-retrieval mapping paradigm: Addressing the expressive disparity between CVE descriptions and CWE definitions, we leverage the reasoning and generative capabilities of Large Language Models (LLMs). The model reasons and transforms CVE descriptions into CWE-style representations, which are then matched against the CWE library via similarity retrieval. This approach resolves the precision limitations caused by direct matching between concrete instances and abstract definitions.

Design of three specialized fine-tuning strategies: Unlike traditional classification models, we designed three fine-tuning strategies to guide the LLM in understanding the underlying logic of why a specific CVE description corresponds to a particular CWE description. This enables the model to reason and generate CWE-style descriptions even when encountering novel CVE inputs.

2. Related Work

Early research primarily utilized machine learning algorithms and statistical features to investigate the automated mapping from Common Vulnerabilities and Exposures (CVE) to Common Weakness Enumeration (CWE). For instance, Rehman et al. [8] converted CVE description texts into feature vectors using TF-IDF or Bag-of-Words (BoW) models. Subsequently, researchers employed classic classifiers such as Support Vector Machines (SVM) [9,10] and Naive Bayes (NB) [11] for training. Within this domain, Terdchanakul et al. [12] proposed an N-gram IDF technique to extract keyword statistical features from vulnerability texts, training Logistic Regression and Random Forest classifiers for automated classification. Furthermore, Albanese et al. [13] introduced CVE2CWE, which utilizes TF-IDF vectors and cosine similarity to calculate the matching degree between new CVEs and CWE categories. However, these models are context-independent, relying entirely on keyword overlap while neglecting word order, syntax, and deep semantics. Consequently, they struggle to achieve high accuracy when distinguishing between vulnerabilities with similar technical terminology but different root causes, such as “Buffer Over-read” versus “Buffer Overflow.”

To address the lack of word order and contextual understanding in traditional methods, researchers shifted toward early deep learning models. These approaches typically utilize word embedding techniques like Word2Vec to convert sequences into embeddings, which are then fed into Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Gated Recurrent Units (GRU), or Long Short-Term Memory (LSTM) networks. The final hidden states or pooling outputs are passed through fully connected layers and activation functions (e.g., Sigmoid) to generate predictions. For example, Nakagawa et al. [14] and Saklani et al. [15] leveraged CNNs to predict vulnerability severity levels from descriptions, while other studies explored hybrid CNN-LSTM architectures [16]. Although these models capture word order, their contextual understanding is often unidirectional. To mitigate this, Zhang et al. [17] proposed an automated classification technique based on BiGRU and TextCNN, capturing both sequential and local features. Nevertheless, when processing complex and highly technical CVE descriptions, these neural networks exhibit limited capacity in capturing long-range dependencies and deep bidirectional contexts, which constrains CVE-CWE mapping performance.

In recent years, pre-trained language models, represented by BERT [18], have become a research focal point. Das et al. [19] introduced V2W-BERT, which employs a Siamese BERT network with end-to-end fine-tuning to learn semantic correlations between CVE and CWE descriptions. This task-specific training effectively addresses mapping challenges for rare or even zero-shot CWEs [20]. Yang et al. [21] explored dual-attention mechanisms and improved adversarial training to prevent models from missing key features or succumbing to overfitting [22]. While fine-tuning pre-trained models as classifiers has achieved high accuracy on benchmarks, this paradigm faces two critical limitations: the closed-set problem, where classifiers can only predict CWE categories encountered during training—failing to scale to the full CWE list of over 900 categories—and the long-tail problem, stemming from the highly imbalanced distribution of CWE data in CVEs. These models remain heavily biased toward the “Top-25” CWE categories, often discarding the remaining infrequent classes.

To resolve these closed-set and long-tail issues, researchers have begun leveraging the general capabilities of Large Language Models (LLMs). Text2Weak, proposed by Simonetto et al. [23], is a representative approach that utilizes LLM embedding capabilities. This method employs general models like OpenAI’s text-embedding-ada-002 [24] to map official CWE descriptions into a vector index. When a new CVE emerges, its description is vectorized to retrieve the most matching weakness type via cosine similarity. However, this approach essentially calculates direct similarity between concrete vulnerability instances (CVE) and abstract weakness definitions (CWE). By relying solely on surface-level feature extraction, it ignores the core logical reasoning and generative potential of LLMs. This failure to bridge the significant disparity in expression and abstraction levels leads to limited mapping precision and practical utility. In summary, while existing research has progressed, it faces a bottleneck: traditional discriminative models are restricted by closed-set assumptions and long-tail distributions, whereas preliminary LLM applications like Text2Weak remain “black boxes” that fail to activate the model’s deep reasoning potential for heterogeneous text matching.

In response to these limitations, this paper explores a generative retrieval paradigm to break through the constraints of fixed-category classifiers. By introducing parameter-efficient fine-tuning (PEFT), we reduce training costs while fully activating the LLM’s reasoning and generative capabilities. Instead of direct text matching, our method empowers the model to reason and transform concrete CVE descriptions into transparent, standardized CWE-style descriptions based on contextual understanding. These generated descriptions are subsequently matched against the official CWE knowledge base. This process not only circumvents the challenges of direct matching between heterogeneous texts but also provides intuitive intermediate reasoning results, achieving high-precision, scalable, and interpretable automated vulnerability mapping in real-world scenarios.

3. Method

This paper proposes a generative-retrieval model, V2W-LLM, as illustrated in Figure 2, which integrates three primary modules. The data construction module is responsible for collecting vulnerability and weakness text from the NVD and CWE security knowledge bases, followed by rigorous text pre-processing and cleaning to establish sentence-pair associations via CWE-IDs. The generative module employs a fine-tuned large language model to infer and generate CWE-style descriptions based on input CVE descriptions. Finally, the retrieval mapping module vectorizes official CWE descriptions to build a retrieval index, then projects the generated descriptions into the same vector space, where the most matching CWE-ID is identified through cosine similarity calculation to achieve the final mapping.

3.1. Data Construction Module

The task of the data construction module is to construct a high-quality CVE-CWE description text pair dataset. Unlike traditional methods that use simple correspondences between CVE identifiers and CWE-IDs, this paper collects the descriptive text for each CVE and the detailed descriptive text of the corresponding CWE, forming semantically rich text pairs. The specific data construction process is as follows:

(a) Raw Data Acquisition: First, the raw knowledge bases were downloaded from the official CVE and CWE websites. The National Vulnerability Database (NVD) collected the official JSON data streams for all CVE entries from 1999 to 2023. Simultaneously, XML files containing detailed information for all CWE entries were downloaded from the official CWE website, including names, descriptions, hierarchical relationships, and usage status.

(b) Initial Extraction and Flattening: Python 3.10.20 scripts were written to parse and clean the raw JSON files obtained from the official NVD website. The core information of each CVE entry was extracted to form a tuple (CVE-ID, CVE description, associated CWE-ID list). Since one CVE may be associated with multiple CWEs, the data underwent a flattening process here.

(c) Semantic Linking: This is the key to the dataset construction in this paper. Using the CWE-ID as a common key, each tuple generated in step b was linked to the official CWE dictionary. The semantically neutral CWE-ID symbols in the samples generated in step b were replaced with their corresponding, official, and complete CWE descriptions, thereby constructing semantically rich CVE description and CWE description text pairs.

(d) To ensure the precision and operability of model learning, two levels of filtering were performed on the CWE data according to the official MITRE best practice guidelines. First, at the abstraction level, only the Base and Variant level CWE entries with the finest granularity were retained. Second, regarding usage specifications, only categories with the status Allowed were kept. This effectively removed CWE categories that were too abstract or not recommended, as shown in Table 1.

3.2. Generation Module

The generative module is the core of V2W-LLM. Its objective is to fine-tune a pre-trained large language model (LLM), enabling it to generate semantically aligned CWE descriptive text based on a profound understanding of CVE descriptions.

3.2.1. Fine-Tuning Methods

To generate CWE text descriptions that meet the requirements while reducing resource demands, this paper adopts the Low-Rank Adaptation (LoRA) method [25], a mainstream and efficient approach in the field of fine-tuning. Its core idea is that when adapting to downstream tasks, the change in weights is low-rank, meaning that weight updates primarily occur within certain critical, low-dimensional subspaces. Consequently, instead of directly updating the entire massive pre-trained weight matrix

W_{0}

, LoRA utilizes a more efficient approach: it freezes

W_{0}

and injects a pair of small, trainable low-rank decomposition matrices A and B via a bypass to indirectly simulate the updates. This can be formalized as follows:

h = W x + Δ W x = W x + B A x

(1)

In Equation (1), x represents the input vector and y represents the output vector.

W_{0}

is the original pre-trained weight matrix, while the low-rank matrices A and B are the only parameters updated during training. The number of trainable parameters is reduced from the dimension of

W_{0}

to the sum of the dimensions of A and B, where the rank r is typically much smaller than the dimensions of

W_{0}

. This allows the number of trainable parameters to be reduced by several orders of magnitude, thereby significantly lowering the resources required for training and remarkably improving training efficiency while maintaining performance.

3.2.2. Fine-Tuning Strategy

To explore the optimal model fine-tuning paradigm, this paper designs three different fine-tuning strategies and determines the impact of different strategies on the overall framework performance through experiments. These strategies emphasize different input-output formats and model guidance methods. The fine-tuned models are uniformly abstracted as functions

f (\cdot)

. The logic of the three strategies is introduced below. To provide a rigorous mathematical description, we first define the following semantic spaces and notations:

f (\cdot)

: The fine-tuned strategy are uniformly abstracted as functions.

CVE_desc: Description of Vulnerability text.

CWE_name: Name of weakness.

CWE_desc: Description of weakness text.

⊕: The concatenation operator for textual sequences.

(a) Strategy A: Baseline Generation

f_{A} (CVE_desc) \to CWE_desc

(2)

The task is modeled as a standard sequence-to-sequence mapping, where the model is trained to directly generate the corresponding CWE description from the CVE description. The training and inference processes remain consistent. This strategy relies entirely on the model’s autonomous learning of complex semantic correlations from phenomenon to essence.

(b) Strategy B: Generation with Teacher Signals

To reduce the learning difficulty of the model during the training phase, Strategy B introduces the CWE name as a strong prompt signal to guide the model to focus on generating the correct descriptive content. However, during inference, the model still needs to make independent predictions based on the CVE description.

Training:

f_{B} (CVE_desc, CWE_name) \to CWE_desc

(3)

Inference:

f_{B} (CVE_desc) \to CWE_desc

(4)

(c) Strategy C: Structured Generation

The model is trained to generate structured text containing both the name and the description, and the training and inference processes remain consistent. This structured output format first determines the CWE name and then generates the corresponding standard description. This mechanism, similar to a Chain-of-Thought (CoT), can enhance the relevance and accuracy of the generated text. The instruction templates used during the training process are shown in Table 2.

f_{C} (CVE_desc) \to CWE_name \oplus desc

(5)

3.2.3. Retrieval Mapping Module

(a) Embedding Model Selection: The BGE (BAAI General Embedding) model [26] released by the Beijing Academy of Artificial Intelligence (BAAI) has consistently ranked among the top in the MTEB (Massive Text Embedding Benchmark) [27] retrieval tasks. Furthermore, the BGE model exhibits excellent domain adaptability. Its training corpus covers a vast and diverse range of network texts and technical documents, providing it with a stronger understanding of professional terminology common in the cybersecurity domain. Therefore, the bge-large-en-v1.5 model is selected for this study.

(b) Offline Knowledge Base Construction: After determining the embedding model, a persistent vector knowledge base is constructed through an offline process. First, the IDs and descriptions of the filtered official CWE entries are traversed and encoded into 1024-dimensional semantic vectors. After aggregating all vectors, a Faiss (Facebook AI Similarity Search) [28] index is constructed. Finally, the generated Faiss index (index.faiss) and the mapping file (id_map.json), which records the correspondence between vectors and CWE-IDs, are stored persistently to provide instant and accurate retrieval capabilities for the online inference stage.

(c) Mapping Method: The generated CWE descriptive text is represented as a vector through the retrieval mapping module. By calculating the cosine similarity between this representation and each CWE vector in the knowledge base, the CWE-ID with the highest similarity is selected as the final mapping result. Given the generated CWE descriptive text vector

E_{gen}

and the knowledge base CWE vector

E_{cwe}^{i}

, the matching degree between them is calculated as follows:

sim (E_{gen}, E_{cwe}^{i}) = \frac{E_{gen} \cdot E_{cwe}^{i}}{| E_{gen} | | E_{cwe}^{i} |}

(6)

where

∥ E ∥

denotes the

L 2

norm of vector E. The CWE-ID is selected based on maximum similarity. By calculating the similarity between the query vector

E_{gen}

and all M CWE vectors in the knowledge base, the index j corresponding to the entry with the highest score is determined by the following formula:

j = arg max_{i} sim (E_{gen}, E_{cwe}^{i})

(7)

The final prediction result is

I D_{j}

, where M represents the total number of entries in the CWE knowledge base.

3.3. Methodological Synthesis

The three modules described above—Data Construction (Section 3.1), Generation (Section 3.2), and Retrieval Mapping (Section 3.2.3)—do not operate in isolation but form a cohesive, data-driven pipeline. The synergy of these components is designed to overcome the semantic gap and long-tail challenges identified in existing literature. To rigorously validate the efficacy of this integrated approach, the following section presents a multi-dimensional evaluation. We first assess the core “reasoning” quality of the generation module (Section 4.3) and then evaluate the final end-to-end mapping performance across multiple real-world datasets (Section 4.4 and Section 4.5).

4. Results

4.1. Dataset and Experimental Procedure

This paper collected all CVE entries and their corresponding CWE-IDs from 1999 to 2023 from the NVD knowledge base [1,3]. Through the data construction module, a total of 65,167 CVE-CWE description pairs were ultimately obtained.

Dataset 1: A subset comprising 16 common CWE categories with a total of 34,672 CVE vulnerabilities [1,3], as detailed in Table 3. This dataset was partitioned into training, validation, and test sets according to a ratio of 8:1:1.

Dataset 2: The latest 2024 NVD dataset, which encompasses 220 CWE categories and a total of 13,033 vulnerabilities [1,3].

Dataset 3: The ChatGPT-VDMEval public dataset, originating from the study by Liu et al. [29], which contains 151 CWE categories and 8223 CVE vulnerabilities.

In terms of the experimental workflow, this paper first utilizes Dataset 1 to validate the effectiveness of the proposed model architecture and perform benchmark comparisons against existing mainstream methods. Subsequently, to enable the model to capture a more comprehensive range of vulnerability features, the full dataset of 65,167 entries is employed for the final training phase. Datasets 2 and 3 remain strictly unseen throughout the training process and are utilized solely to evaluate the model’s cross-dataset generalization performance and its capability to identify newly emerging vulnerabilities in real-world scenarios.

4.2. Parameter Settings and Evaluation Indicators

The experimental environment is configured as follows:a server with a single NVIDIA L20 (48 GB) GPU was utilized, and all fine-tuning procedures were implemented using the open-source LLaMA-Factory framework. Meta-Llama-3-8B was selected as the base large language model. Built upon the Transformer [30] architecture, this model possesses robust capabilities in general language understanding and generation. To achieve efficient model adaptation, the LoRA fine-tuning method was employed. Detailed parameter configurations are summarized in Table 4.

This paper establishes a hierarchical evaluation framework. First, regarding the generative module, standard text generation metrics, ROUGE-L and BLEU-4, are adopted to quantitatively evaluate the similarity between model-generated CWE descriptions and official standard descriptions in terms of content semantics and syntactic structure. Second, for the end-to-end CVE-to-CWE-ID mapping task, this study treats the process as a classification problem and employs Accuracy, Macro-Precision, Macro-Recall, and Macro-F1 as evaluation metrics. The calculation formulas are as follows:

Acc = \frac{\sum_{i = 1}^{N} I ({Predict_CWE}_{i} = {True_CWE}_{i})}{N}

(8)

P_{i} = \frac{T P_{i}}{T P_{i} + F P_{i}}, Macro - P = \frac{1}{n} \sum_{i = 1}^{n} P_{i}

(9)

R_{i} = \frac{T P_{i}}{T P_{i} + F N_{i}}, Macro - R = \frac{1}{n} \sum_{i = 1}^{n} R_{i}

(10)

F 1_{i} = \frac{2 \times P_{i} \times R_{i}}{P_{i} + R_{i}}, Macro - F 1 = \frac{1}{n} \sum_{i = 1}^{n} F 1_{i}

(11)

4.3. Generation Quality Assessment and Analysis

To determine the optimal generative model and training strategy, comparative experiments were first conducted on the three proposed fine-tuning strategies across various LLMs to evaluate the quality of the generated CWE descriptions.

Comparative Analysis of Strategies: The results in Table 5 indicate that Strategy C consistently achieves the best performance across all models. Taking the Llama3-8B model as an example, Strategy C achieves ROUGE-L and BLEU-4 scores of 92.31 and 90.68, respectively, significantly outperforming Strategy A and Strategy B. This demonstrates that the multi-task learning paradigm in Strategy C—which requires the model to simultaneously generate both the CWE name and description—enables the model to establish stronger CVE-CWE semantic associations. The generated CWE name serves as an internal prompt that enhances the model’s generalization capability. In contrast, Strategy A produces concise outputs but lacks critical CWE name information, while Strategy B suffers from descriptive bias in certain cases due to the inconsistency between its training and inference phases.

Furthermore, the scale of the model significantly impacts the generation quality. Across all strategies, the larger-parameter models, Llama3-8B and Qwen3-8B, consistently outperform Qwen3-4B. Notably, under the optimal Strategy C, Llama3-8B exhibits the best performance with a ROUGE-2 score of 91.16. Since ROUGE-2 measures the bigram overlap between the generated and reference texts, this high score indicates that the model is highly accurate in generating key technical phrases such as “buffer overflow” and “improper neutralization.” Based on the above analysis, the combination of the Llama3-8B model and Strategy C demonstrates superior generative capabilities across all metrics. The training loss curves in Figure 3 further reveal its excellent training dynamics. Consequently, the combination of Llama3-8B and Strategy C is identified as the most effective generative module and is adopted as the optimal implementation for the V2W-LLM framework.

4.4. Experimental Results and Analysis

The experimental results for common CWE types on Dataset 1 are summarized in Table 6. It is observed that deep learning-based models significantly outperform traditional machine learning methods across all evaluated metrics.To further validate the effectiveness of our approach, we implemented V2W-BERT as a high-performance classification baseline. Among all the models included in the comparison, the proposed V2W-LLM model demonstrates the most superior overall performance, attaining the peak values in most metrics, especially in terms of categorical balance (Macro-Precision, Macro-Recall, and Macro-F1).

Specifically, the model achieves an Accuracy of 90.18 and a Macro-F1 score of 87.64, representing a further improvement over the TF-IDF+W2V+TCNN-BiGRU model. Notably, while V2W-BERT achieves the highest raw Accuracy of 91.35, its Macro-F1 score of 81.14 is significantly lower than that of V2W-LLM. This suggests that while fixed-head classifiers excel at label matching, they may struggle with the semantic nuances across diverse categories. The experimental results demonstrate that by deeply integrating the generative capabilities of large language models, V2W-LLM can precisely capture complex feature correlations in end-to-end mapping tasks. While maintaining high classification precision across the full spectrum of CWE types, it proves to be an effective and robust solution for automated vulnerability mapping tasks.

4.5. Experimental Comparison

Due to the significant heterogeneity in the selection of vulnerability types and data volumes across different studies, a direct performance comparison between the proposed model and existing research remains unfeasible. Consequently, we selected several most common vulnerability types that overlap with previous literature to conduct a comparative analysis. The classification performance and the corresponding confusion matrix are presented in Table 7 and Figure 4, respectively.

The results indicate that in data-dense categories such as CWE-78 and CWE-89, our model demonstrates performance parity with existing research, with metric variances remaining within a minimal range. Notably, in categories with relatively smaller data scales, such as CWE-125 and CWE-787, our model exhibits a distinct advantage over the compared literature, with Macro-F1 scores improving by 3.13 and 2.43, respectively. This proves that the V2W-LLM model can effectively mitigate the heavy reliance of traditional methods on large-scale annotated datasets. It possesses superior semantic generalization and robustness when handling few-shot or data-sparse categories, enabling a more precise capture of deep features within low-frequency classes.

4.6. Ablation Experiment Analysis

To verify the individual contributions of the core components within V2W-LLM to the overall classification performance, an ablation study was conducted. The analysis results are summarized in Table 8.

The experimental results demonstrate that when the generative module and retrieval mapping are removed in favor of a standard direct fine-tuning mode (i.e., mapping CVE descriptions directly to CWE-IDs), the Macro-F1 score decreases from 87.64 to 84.95. This proves that converting abstract IDs into semantically rich descriptive text and performing retrieval calibration can significantly enhance the model’s depth of understanding regarding vulnerability features.

Removing the retrieval mapping corresponds to the strategy of generating CWE description text only, the results of which are detailed in Table 5. When the data construction module is removed, the Macro-F1 score plummets to 56.14. Furthermore, a scheme based solely on vector similarity matching (removing both data construction and generative modules) performs reasonably in terms of Accuracy (Acc); however, its Macro-F1 value of 61.86 reflects the inherent limitations of unsupervised retrieval in handling complex classification and long-tail distributions.

In summary, the performance of V2W-LLM does not stem from a single improvement but is the result of the synergy between three modules: data construction, generative mapping, and retrieval optimization. Additionally, the impact of key hyperparameters on the Macro-F1 metric is further explored through sensitivity analysis in Figure 5.

To further verify the robustness of the V2W-LLM model and determine the optimal parameter configuration, we conducted a detailed sensitivity analysis on the low-rank order r, scaling factor

α

, and dropout ratio. Experimental results demonstrate that as r increases from 2 to 8, the Macro-F1 score of the model exhibits a significant upward trend, reflecting that an adequate rank is crucial for capturing complex vulnerability semantic features, whereas performance gradually stabilizes when r exceeds 8, indicating that the representation capacity of the model has reached saturation. Regarding the scaling factor

α

, the model performance follows an inverted U-shaped distribution, peaking at

α = 1.5

, which underscores its sensitivity in balancing the inherent knowledge of the pre-trained model with the magnitude of weight updates during fine-tuning. Furthermore, the model performance shows a clear negative correlation with the dropout ratio, where increasing the ratio leads to a substantial decline in classification accuracy; this suggests that in fine-grained CVE-to-CWE mapping tasks, maintaining the integrity of critical semantic features is more vital than preventing overfitting via high-ratio stochastic deactivation. In summary, the model achieves the optimal balance between performance and computational efficiency with the configuration of

r = 8

,

α = 1.5

, and a dropout of 0.1.

5. Discussion

To validate the practical effectiveness of V2W-LLM in realistic scenarios, we evaluate the fully fine-tuned model on Datasets 2 and 3.

These independent test sets encompass a broader range of CWE categories and include the latest 2024 vulnerability samples.As summarized in Table 9, the model achieves an Accuracy of 88.21 and a Weighted-F1 score of 86.92 across all 220 CWE categories in Dataset 2. On Dataset 3, the Accuracy and F1 score reach 87.68 and 86.02, respectively.

Performance improves significantly when focusing on more common categories. Specifically, narrowing the target in Dataset 2 to the Top-100 categories increases Accuracy and F1 to 89.29 and 88.37. Focusing further on the Top-25 core weaknesses boosts these metrics to 93.67 and 94.44. This consistent performance trend across both datasets proves that our generation-retrieval paradigm generalizes effectively across diverse data distributions.

The interpretability of V2W-LLM is demonstrated through five representative 2024 CVE case studies in Table 10. These cases illustrate the workflow, advantages, and limitations of our framework. The first four successful cases cover diverse high-risk vulnerabilities—such as Cross-Site Scripting (XSS), Server-Side Request Forgery (SSRF), Command Injection, and Information Exposure—highlighting the model’s broad applicability. For example, in CVE-2024-28191, where the input describes “injecting arbitrary web scripts,” the model generates a standardized definition: “Improper Neutralization of Input During Web Page Generation (’Cross-site Scripting’),” instead of directly predicting an ID. This output shows that the model successfully translates colloquial text into formal weakness definitions. Due to the high semantic consistency with the official CWE-79 definition, the generated text receives the highest similarity score and achieves correct mapping.

Conversely, the final failure case reveals limitations in handling hierarchical relationships and long-tail data. In CVE-2024-21626, the ground truth is CWE-404 (Improper Resource Shutdown or Release), a specific Variant category. However, the model generates a description for CWE-403, its Base (parent) category. Although abstract Pillar and Class levels are filtered during preprocessing, distinguishing between closely related Base and Variant levels remains an open challenge.

Regarding the tail CWE categories, this study opted for fine-tuning rather than discarding them. The reasoning and generative capabilities of LLMs help mitigate the long-tail issue to a certain extent. By embedding 748 CWE categories and calculating cosine similarity with generated descriptions, our approach overcomes the limitations of traditional classification methods, which can only recognize a predefined and limited number of CWEs. V2W-LLM exhibits robust CWE expansion capabilities and the potential to infer latent CWEs, providing an advanced solution for the field of automated vulnerability analysis.

6. Conclusions

This paper proposes and implements V2W-LLM, a vulnerability-to-weakness mapping model based on large language models. By employing a three-stage workflow consisting of data construction, generative modeling, and retrieval mapping, the model innovatively transforms the traditional classification task into a generation-retrieval paradigm. Experimental results demonstrate that V2W-LLM achieves robust performance in vulnerability-to-weakness mapping. Although the generative nature of LLMs introduces a risk of semantic hallucinations, our framework successfully mitigates this through a semantic calibration step that grounds the generated outputs in a verified CWE knowledge base.

To provide a clear overview of the architectural advancements achieved by this paradigm, Table 11 explicitly summarizes the novel contributions of V2W-LLM and contrasts them with traditional classification paradigms.

7. Future Work

While V2W-LLM provides an effective solution for automated mapping, several avenues for future exploration remain:

1. Hierarchical Knowledge Integration: Future research will focus on integrating the complex hierarchical relationships of the CWE into the model’s loss function. This aims to resolve current discrepancies in handling fine-grained categories and improve the model’s ability to distinguish between Base and Variant-level weaknesses.

2. Automation Chain Extension: We plan to extend the current CVE-CWE mapping capability to include broader security standards, such as CAPEC (Common Attack Pattern Enumeration and Classification) and the MITRE ATT&CK framework. This will enable the automated generation of complete attack-defense paths from raw vulnerability reports.

3. Multi-source Cyber Knowledge Graph Reasoning: A key long-term goal is the construction of a comprehensive cybersecurity knowledge graph integrating multi-source data including CVE, CWE, CAPEC, and ATT&CK. By leveraging advanced Knowledge Graph Embedding (KGE) models, we aim to perform link prediction and reasoning completion to uncover latent security threats and provide a holistic view of the global threat landscape.

Author Contributions

Conceptualization, Z.W. and M.N.; methodology, Z.W.; software, Z.W.; validation, Z.W., J.Z. and Y.J.; formal analysis, Z.W.; investigation, Z.W.; resources, M.N.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, M.N.; visualization, Z.W.; supervision, M.N.; project administration, M.N.; funding acquisition, M.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Xinjiang Uygur Autonomous Region (2023D01A46); the 2025 Special Research Project on Education Network Security (CAETCS25006); the National Key R&D Program (2024YFF0908203-3); the Shanghai Cooperation Organization Science and Technology Partnership Program and International Science and Technology Cooperation Program (2025E01038); and the Xinjiang “Tianshan Talent” Training Program for Outstanding Engineers (EB0210).

Data Availability Statement

The data and code presented in this study are openly available in Gitee at https://gitee.com/wangziguo999/V2W.git (accessed on 10 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

MITRE. Common Vulnerabilities and Exploits (CVE). Available online: https://cve.mitre.org/ (accessed on 15 September 2024).
Komaragiri, V.B.; Edward, A. AI-driven vulnerability management and automated threat mitigation. Int. J. Sci. Res. Manag. 2022, 10, 981–998. [Google Scholar] [CrossRef]
MITRE. Common Weakness Enumeration (CWE). Available online: https://cwe.mitre.org/ (accessed on 15 September 2024).
Iannone, E.; Guadagni, R.; Ferrucci, F.; De Lucia, A.; Palomba, F. The secret life of software vulnerabilities: A large-scale empirical study. IEEE Trans. Softw. Eng. 2022, 49, 44–63. [Google Scholar] [CrossRef]
Haddad, O.A.; Ikram, M.; Ahmed, E.; Lee, Y. Prompting the Priorities: A First Look at Evaluating LLMs for Vulnerability Triage and Prioritization. arXiv 2025, arXiv:2510.18508. [Google Scholar] [CrossRef]
Uddin, M.N.; Zhang, Y.; Hei, X. Deep learning aided software vulnerability detection: A survey. arXiv 2025, arXiv:2503.04002. [Google Scholar] [CrossRef]
Risse, N.; Böhme, M. Uncovering the limits of machine learning for automatic vulnerability detection. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 4247–4264. [Google Scholar]
Rehman, S.; Mustafa, K. Software design level vulnerability classification model. Int. J. Comput. Sci. Secur. IJCSS 2012, 6, 238. [Google Scholar]
Aota, M.; Kanehara, H.; Kubo, M.; Murata, N.; Sun, B.; Takahashi, T. Automation of vulnerability classification from its description using machine learning. In Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France, 8–10 July 2020; pp. 1–7. [Google Scholar]
Davari, M.; Zulkernine, M.; Jaafar, F. An automatic software vulnerability classification framework. In Proceedings of the 2017 International Conference on Software Security and Assurance (ICSSA), Altoona, PA, USA, 24–25 July 2017; pp. 44–49. [Google Scholar]
Na, S.; Kim, T.; Kim, H. A study on the classification of common vulnerabilities and exposures using naïve bayes. In Proceedings of the International Conference on Broadband and Wireless Computing, Communication and Applications, Asan, Republic of Korea, 5–7 November 2016; pp. 657–662. [Google Scholar]
Terdchanakul, P.; Hata, H.; Phannachitta, P.; Matsumoto, K. Bug or not? Bug report classification using n-gram idf. In Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), Shanghai, China, 17–22 September 2017; pp. 534–538. [Google Scholar]
Albanese, M.; Adebiyi, O.; Onovae, F. CVE2CWE: Automated Mapping of Software Vulnerabilities to Weaknesses Based on CVE Descriptions. In Proceedings of the 21st International Conference on Security and Cryptography (SECRYPT), Dijon, France, 8–10 July 2024; pp. 500–507. [Google Scholar]
Nakagawa, S.; Nagai, T.; Kanehara, H.; Furumoto, K.; Takita, M.; Shiraishi, Y.; Takahashi, T.; Mohri, M.; Takano, Y.; Morii, M. Character-level convolutional neural network for predicting severity of software vulnerability from vulnerability description. IEICE Trans. Inf. Syst. 2019, 102, 1679–1682. [Google Scholar] [CrossRef]
Saklani, S.; Kalia, A. Severity prediction of software vulnerabilities using convolutional neural networks. Inf. Comput. Secur. 2025, 33, 613–630. [Google Scholar] [CrossRef]
Sun, X.; Li, L.; Bo, L.; Wu, X.; Wei, Y.; Li, B. Automatic software vulnerability classification by extracting vulnerability triggers. J. Softw. Evol. Process 2024, 36, e2508. [Google Scholar] [CrossRef]
Zhang, H.; He, D. Research on Automatic Vulnerability Classification Technology Based on BiGRU-TextCNN Framework. J. Inf. Secur. Res. 2024, 10, 446–452. (In Chinese) [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
Das, S.S.; Serra, E.; Halappanavar, M.; Pothen, A.; Al-Shaer, E. V2W-BERT: A framework for effective hierarchical multiclass classification of software vulnerabilities. In Proceedings of the 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), Porto, Portugal, 6–9 October 2021; pp. 1–12. [Google Scholar]
Zhu, C.; Du, G.; Wu, T.; Cui, N.; Chen, L.; Shi, G. BERT-based vulnerability type identification with effective program representation. In Proceedings of the International Conference on Wireless Algorithms, Systems, and Applications, Dalian, China, 7–9 April 2022; pp. 271–282. [Google Scholar]
Yang, J.; Li, W.; He, J.; Zhou, S.; Li, T.; Wang, Y. Vulnerability Classification Method Based on Dual Attention Mechanism and Improved Adversarial Training. Appl. Res. Comput. 2024, 41, 3447–3454. (In Chinese) [Google Scholar]
Wang, T.; Qin, S.; Chow, K.P. Towards vulnerability types classification using pure self-attention: A common weakness enumeration based approach. In Proceedings of the 2021 IEEE 24th International Conference on Computational Science and Engineering (CSE), Shenyang, China, 20–22 October 2021; pp. 146–153. [Google Scholar]
Simonetto, S.; van Ede, T.S.; Bosch, P.; Jonker, W.; Oostveen, R. Text2Weak: Mapping CVEs to CWEs using description embeddings analysis. In Proceedings of the 4th Workshop on Artificial Intelligence-Enabled Cybersecurity Analytics, Barcelona, Spain, 26 August 2024. [Google Scholar]
Neelakantan, A.; Xu, T.; Puri, R.; Radford, A.; Han, J.M.; Tworek, J.; Yuan, Q.; Tezak, N.; Kim, J.W.; Hallacy, C.; et al. Text and code embeddings by contrastive pre-training. arXiv 2022, arXiv:2201.10005. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 2318–2335. [Google Scholar]
Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Dubrovnik, Croatia, 2–6 May 2023; pp. 2014–2037. [Google Scholar]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
Liu, X.; Tan, Y.; Xiao, Z.; Zhuge, J.; Zhou, R. Not the end of story: An evaluation of ChatGPT-driven vulnerability description mappings. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 3724–3731. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Wang, Q.; Gao, Y.; Ren, J.; Zhang, B. An automatic classification algorithm for software vulnerability based on weighted word vector and fusion neural network. Comput. Secur. 2023, 126, 103070. [Google Scholar] [CrossRef]

Figure 1. Number of vulnerabilities.

Figure 2. Architecture of V2W-LLM.

Figure 3. Loss Change Curve.

Figure 4. Confusion Matrix.

Figure 5. Hyperparameter sensitivity experiment.

Table 1. Statistics of data construction and filtering.

Phase	Description	Quantity
Initial CWE	Initial CWE categories	969
Filtered CWE	Filtered CWE categories	748
Initial CVE	Total number of CVEs (1999–2023)	96,460
Filtered CVE	〈CVE description, CWE description〉 text pairs	65,167

Table 2. Instruction fine-tuning template.

Keyword	Description
Instruction	Given the vulnerability description, identify and provide the name and description of the corresponding Common Weakness Enumeration.
Input	Vulnerability description: {}
Output	According to the vulnerability description, its corresponding weakness is: {}

Table 3. Common CWE weakness types.

CWE ID	CWE Name	CWE ID	CWE Name
CWE-79	Cross-site Scripting (XSS)	CWE-78	OS Command Injection
CWE-89	SQL Injection	CWE-22	Path Traversal
CWE-787	Out-of-bounds Write	CWE-121	Stack-based Buffer Overflow
CWE-20	Improper Input Validation	CWE-862	Missing Authorization
CWE-125	Out-of-bounds Read	CWE-120	Classic Buffer Overflow
CWE-352	Cross-site Request Forgery	CWE-434	Unrestricted File Upload
CWE-416	Use After Free	CWE-476	NULL Pointer Dereference
CWE-200	Exposure of Sensitive Info	CWE-287	Improper Authentication

Table 4. Key hyperparameter settings for model fine-tuning.

Hyperparameter	Value	LoRA Parameter	Value
Training Epochs	5.0	lora_rank	8
Batch Size	4	lora_alpha	1.5
Gradient Accumulation Steps	16	lora_dropout	0.1
Learning Rate	5 × 10⁻⁵	lora_target_modules	all

Table 5. Performance comparison of different model parameters and fine-tuning strategies on text generation quality.

Model	Strategy A				Strategy B				Strategy C
Model	B-4	R-1	R-2	R-L	B-4	R-1	R-2	R-L	B-4	R-1	R-2	R-L
Qwen3-4B	83.07	85.24	82.57	85.28	44.39	54.11	36.42	50.73	85.31	86.55	84.65	86.95
Qwen3-8B	88.10	88.89	85.35	87.82	57.82	62.09	45.34	56.65	89.17	89.85	87.33	89.48
Llama3-8B	88.15	90.29	87.58	90.23	64.87	71.59	60.79	70.03	90.68	91.90	90.16	92.31

Notes: B-4: BLEU-4; R-1: ROUGE-1; R-2: ROUGE-2; R-L: ROUGE-L.

Table 6. Performance comparison of different methods on common CWE types.

Method (Model)	Acc	Macro-Pre	Macro-Rec	Macro-F1
N-TF-IDF+W2V+KNN	72.05	81.22	72.05	74.03
N-TF-IDF+W2V+SVM	85.66	85.04	77.00	80.33
TF-IDF+W2V+TRNN	85.57	84.21	81.17	82.27
TF-IDF+W2V+DPCNN	86.70	85.05	82.77	83.58
TF-IDF+W2V+TCNN	87.46	86.16	83.79	84.40
TF-IDF+W2V+TCNN-DA	87.94	86.38	84.40	85.24
TF-IDF+W2V+TCNN-BiGRU	89.70	87.84	85.07	86.22
V2W-BERT	91.35	81.62	81.44	81.14
V2W-LLM	90.18	88.29	87.11	87.64

Table 7. CWE classification performance compared with existing literature.

CWE-ID	Macro-F1 (%)
CWE-ID	Ref. [31]	Ref. [21]	Ours	Gain
CWE-78	91.40	89.60	90.97	+1.37
CWE-79	99.42	98.29	98.07	−0.22
CWE-89	99.69	99.28	97.89	−1.39
CWE-125	89.84	88.16	91.29	+3.13
CWE-352	97.05	97.15	97.98	+0.83
CWE-787	79.56	87.50	89.93	+2.43

Table 8. Ablation study of the core components in V2W-LLM.

Model	Acc	Ma-Pre	Ma-Rec	Ma-F1
w/o Data Construction	65.88	70.57	58.12	56.14
w/o Retrieval Mapping	-	-	-	-
w/o Generation and Retrieval Mapping	88.41	85.39	84.87	84.95
w/o Data Construction and Generation	80.10	76.31	77.65	61.86
V2W-LLM	90.18	88.29	87.11	87.64

Table 9. End-to-end mapping performance on different test sets and CWE subsets.

Dataset	CWE Category	CVE Count	Accuracy	Precision	Recall	F1
Dataset 2	All-220	13,033	88.21%	86.45%	88.21%	86.92%
	Top-100	12,852	89.29%	88.20%	89.29%	88.37%
	Top-50	12,407	91.24%	91.62%	91.24%	91.21%
	Top-25	11,549	93.67%	95.41%	93.67%	94.44%
Dataset 3	All-151	8223	87.68%	85.18%	87.68%	86.02%
	Top-100	8158	87.95%	85.86%	87.95%	86.46%
	Top-50	7813	90.31%	89.91%	90.31%	89.77%
	Top-25	6919	93.51%	94.72%	93.51%	94.02%

Table 10. Workflow, interpretability, and error case analysis of V2W-LLM.

CVE-ID	Input: CVE Description	Generated CWE Description	Mapped Result	Ground Truth
CVE-2024-28191	…inject arbitrary web script…	Improper Neutralization of Input During Web Page Generation (’Cross-site Scripting’)	CWE-79	CWE-79
CVE-2024-21893	…allows attackers… to read arbitrary files on the… file system	Relative Path Traversal (Dot Dot Slashed)	CWE-23	CWE-23
CVE-2024-22243	…an attacker could provide a specially crafted SpEL … could result in remote code execution	Improper Neutralization of Special Elements used in an OS Command	CWE-78	CWE-78
CVE-2024-24919	…allows remote unauthenticated users to read arbitrary files…	Exposure of Sensitive Information to an Unauthorized Actor	CWE-200	CWE-200
CVE-2024-21626	…a file descriptor leak vulnerability… in runc… in the web interface…	Improper Handling of File Descriptors or Handles	CWE-403	CWE-404

Note: CVE examples are selected from the 2024 NVD dataset to demonstrate model performance.

Table 11. Summary of novel contributions of V2W-LLM compared with traditional paradigms.

Core Aspect	Traditional Paradigm (e.g., V2W-BERT)	Proposed V2W-LLM Framework	Impact & Advantage
Task Modeling	Direct label-mapping (Fixed-head Classifier)	Generative-Retrieval Reasoning	Resolves long-tail data scarcity and architectural bottlenecks.
Scalability	Limited to closed-set common categories (Top-20)	Expands to comprehensive open-set spectrum (220 classes)	Handles real-world, large-scale cybersecurity databases effectively.
Data Efficiency	Requires massive training volumes to converge	High sample efficiency (Achieves superior metrics with 50% less data)	Reduces annotation costs and excels in data-scarce scenarios.
Interpretability	Black-box output of solitary IDs	Provides a text-based semantic bridge	Enables procedural interpretability and human-in-the-loop auditing.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Nian, M.; Jing, Y.; Zhang, J. V2W-LLM: Automated Vulnerability to Weakness Mapping Based on Large Language Model. Information 2026, 17, 513. https://doi.org/10.3390/info17060513

AMA Style

Wang Z, Nian M, Jing Y, Zhang J. V2W-LLM: Automated Vulnerability to Weakness Mapping Based on Large Language Model. Information. 2026; 17(6):513. https://doi.org/10.3390/info17060513

Chicago/Turabian Style

Wang, Ziguo, Mei Nian, Yaling Jing, and Jun Zhang. 2026. "V2W-LLM: Automated Vulnerability to Weakness Mapping Based on Large Language Model" Information 17, no. 6: 513. https://doi.org/10.3390/info17060513

APA Style

Wang, Z., Nian, M., Jing, Y., & Zhang, J. (2026). V2W-LLM: Automated Vulnerability to Weakness Mapping Based on Large Language Model. Information, 17(6), 513. https://doi.org/10.3390/info17060513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

V2W-LLM: Automated Vulnerability to Weakness Mapping Based on Large Language Model

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Data Construction Module

3.2. Generation Module

3.2.1. Fine-Tuning Methods

3.2.2. Fine-Tuning Strategy

3.2.3. Retrieval Mapping Module

3.3. Methodological Synthesis

4. Results

4.1. Dataset and Experimental Procedure

4.2. Parameter Settings and Evaluation Indicators

4.3. Generation Quality Assessment and Analysis

4.4. Experimental Results and Analysis

4.5. Experimental Comparison

4.6. Ablation Experiment Analysis

5. Discussion

6. Conclusions

7. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI