1. Introduction
The contemporary cybersecurity landscape is characterized by escalating complexity, where the exponential surge in software vulnerabilities poses substantial threats to personal privacy, corporate assets, and national critical infrastructure as illustrated in
Figure 1. The Common Vulnerabilities and Exposures (CVE) [
1] system, providing a list of uniquely identified public cybersecurity vulnerabilities, has become the de facto standard for describing security flaws in both industry and academia. Within the security assurance of the entire software life-cycle, effective defense and the formulation of targeted mitigation strategies require more than just awareness of a vulnerability’s existence; they necessitate a profound understanding of its root cause. Consequently, accurately identifying the underlying weakness types from massive, heterogeneous vulnerability reports is paramount for software developers, security analysts, and end-users alike [
2].
Common Weakness Enumeration (CWE) [
3] serves as an industry standard for classifying software weaknesses or vulnerabilities. By systematically abstracting vulnerability causes, it addresses the fundamental question of why a vulnerability occurs. Establishing precise mapping relationships from specific CVE instances to abstract CWE categories is critical for identifying attack chains. According to statistics from the National Vulnerability Database (NVD), the number of publicly disclosed vulnerabilities has risen exponentially since 2017, with the volume in 2024 alone exceeding 40,000. Such a vast and continuously growing volume of data poses significant challenges for automated, high-precision vulnerability classification [
4]. The inherent latency of manual analysis prevents newly disclosed vulnerabilities from being rapidly attributed, creating cognitive blind spots in defense and increasing the risk of exploitation [
5].
To address these challenges, existing research has proposed various automated classification methods, yet significant limitations persist [
6]. Early methods based on traditional machine learning relied on shallow keyword matching, struggling to comprehend the complex contextual semantics in vulnerability descriptions, which resulted in low accuracy for complex CVEs [
7]. Although recent deep learning-based methods have improved semantic understanding, they typically model the task as a closed-set classification problem. This setup assumes that all categories in the test set appear in the training set. However, the CWE system comprises hundreds of categories with an extreme long-tail distribution, where numerous rare or novel weaknesses lack sufficient training samples. Conventional classifiers struggle when encountering CWE categories unseen during training. Furthermore, some similarity-retrieval methods attempt to directly match text embeddings of CVEs and CWEs. Given that CVEs describe concrete instances while CWEs provide abstract definitions, the significant disparity in their expression forms and abstraction levels limits the precision of direct matching. Overcoming the closed-set constraint and bridging the gap between concrete instances and abstract definitions are key to achieving high-precision automated mapping. The innovations and contributions of this paper are summarized as follows:
Construction of a semantic-enhanced vulnerability-weakness dataset: To address the semantic scarcity in traditional CVE-CWE ID datasets, we enhance the data by incorporating CWE names and descriptions. We constructed an instruction-tuning dataset comprising CVE descriptions, CWE names, and CWE descriptions, providing a data foundation for the model to learn the transition from concrete attack events to abstract weakness definitions.
A novel generative reasoning-retrieval mapping paradigm: Addressing the expressive disparity between CVE descriptions and CWE definitions, we leverage the reasoning and generative capabilities of Large Language Models (LLMs). The model reasons and transforms CVE descriptions into CWE-style representations, which are then matched against the CWE library via similarity retrieval. This approach resolves the precision limitations caused by direct matching between concrete instances and abstract definitions.
Design of three specialized fine-tuning strategies: Unlike traditional classification models, we designed three fine-tuning strategies to guide the LLM in understanding the underlying logic of why a specific CVE description corresponds to a particular CWE description. This enables the model to reason and generate CWE-style descriptions even when encountering novel CVE inputs.
2. Related Work
Early research primarily utilized machine learning algorithms and statistical features to investigate the automated mapping from Common Vulnerabilities and Exposures (CVE) to Common Weakness Enumeration (CWE). For instance, Rehman et al. [
8] converted CVE description texts into feature vectors using TF-IDF or Bag-of-Words (BoW) models. Subsequently, researchers employed classic classifiers such as Support Vector Machines (SVM) [
9,
10] and Naive Bayes (NB) [
11] for training. Within this domain, Terdchanakul et al. [
12] proposed an N-gram IDF technique to extract keyword statistical features from vulnerability texts, training Logistic Regression and Random Forest classifiers for automated classification. Furthermore, Albanese et al. [
13] introduced CVE2CWE, which utilizes TF-IDF vectors and cosine similarity to calculate the matching degree between new CVEs and CWE categories. However, these models are context-independent, relying entirely on keyword overlap while neglecting word order, syntax, and deep semantics. Consequently, they struggle to achieve high accuracy when distinguishing between vulnerabilities with similar technical terminology but different root causes, such as “Buffer Over-read” versus “Buffer Overflow.”
To address the lack of word order and contextual understanding in traditional methods, researchers shifted toward early deep learning models. These approaches typically utilize word embedding techniques like Word2Vec to convert sequences into embeddings, which are then fed into Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Gated Recurrent Units (GRU), or Long Short-Term Memory (LSTM) networks. The final hidden states or pooling outputs are passed through fully connected layers and activation functions (e.g., Sigmoid) to generate predictions. For example, Nakagawa et al. [
14] and Saklani et al. [
15] leveraged CNNs to predict vulnerability severity levels from descriptions, while other studies explored hybrid CNN-LSTM architectures [
16]. Although these models capture word order, their contextual understanding is often unidirectional. To mitigate this, Zhang et al. [
17] proposed an automated classification technique based on BiGRU and TextCNN, capturing both sequential and local features. Nevertheless, when processing complex and highly technical CVE descriptions, these neural networks exhibit limited capacity in capturing long-range dependencies and deep bidirectional contexts, which constrains CVE-CWE mapping performance.
In recent years, pre-trained language models, represented by BERT [
18], have become a research focal point. Das et al. [
19] introduced V2W-BERT, which employs a Siamese BERT network with end-to-end fine-tuning to learn semantic correlations between CVE and CWE descriptions. This task-specific training effectively addresses mapping challenges for rare or even zero-shot CWEs [
20]. Yang et al. [
21] explored dual-attention mechanisms and improved adversarial training to prevent models from missing key features or succumbing to overfitting [
22]. While fine-tuning pre-trained models as classifiers has achieved high accuracy on benchmarks, this paradigm faces two critical limitations: the closed-set problem, where classifiers can only predict CWE categories encountered during training—failing to scale to the full CWE list of over 900 categories—and the long-tail problem, stemming from the highly imbalanced distribution of CWE data in CVEs. These models remain heavily biased toward the “Top-25” CWE categories, often discarding the remaining infrequent classes.
To resolve these closed-set and long-tail issues, researchers have begun leveraging the general capabilities of Large Language Models (LLMs). Text2Weak, proposed by Simonetto et al. [
23], is a representative approach that utilizes LLM embedding capabilities. This method employs general models like OpenAI’s text-embedding-ada-002 [
24] to map official CWE descriptions into a vector index. When a new CVE emerges, its description is vectorized to retrieve the most matching weakness type via cosine similarity. However, this approach essentially calculates direct similarity between concrete vulnerability instances (CVE) and abstract weakness definitions (CWE). By relying solely on surface-level feature extraction, it ignores the core logical reasoning and generative potential of LLMs. This failure to bridge the significant disparity in expression and abstraction levels leads to limited mapping precision and practical utility. In summary, while existing research has progressed, it faces a bottleneck: traditional discriminative models are restricted by closed-set assumptions and long-tail distributions, whereas preliminary LLM applications like Text2Weak remain “black boxes” that fail to activate the model’s deep reasoning potential for heterogeneous text matching.
In response to these limitations, this paper explores a generative retrieval paradigm to break through the constraints of fixed-category classifiers. By introducing parameter-efficient fine-tuning (PEFT), we reduce training costs while fully activating the LLM’s reasoning and generative capabilities. Instead of direct text matching, our method empowers the model to reason and transform concrete CVE descriptions into transparent, standardized CWE-style descriptions based on contextual understanding. These generated descriptions are subsequently matched against the official CWE knowledge base. This process not only circumvents the challenges of direct matching between heterogeneous texts but also provides intuitive intermediate reasoning results, achieving high-precision, scalable, and interpretable automated vulnerability mapping in real-world scenarios.
Author Contributions
Conceptualization, Z.W. and M.N.; methodology, Z.W.; software, Z.W.; validation, Z.W., J.Z. and Y.J.; formal analysis, Z.W.; investigation, Z.W.; resources, M.N.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, M.N.; visualization, Z.W.; supervision, M.N.; project administration, M.N.; funding acquisition, M.N. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Natural Science Foundation of Xinjiang Uygur Autonomous Region (2023D01A46); the 2025 Special Research Project on Education Network Security (CAETCS25006); the National Key R&D Program (2024YFF0908203-3); the Shanghai Cooperation Organization Science and Technology Partnership Program and International Science and Technology Cooperation Program (2025E01038); and the Xinjiang “Tianshan Talent” Training Program for Outstanding Engineers (EB0210).
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
References
- MITRE. Common Vulnerabilities and Exploits (CVE). Available online: https://cve.mitre.org/ (accessed on 15 September 2024).
- Komaragiri, V.B.; Edward, A. AI-driven vulnerability management and automated threat mitigation. Int. J. Sci. Res. Manag. 2022, 10, 981–998. [Google Scholar] [CrossRef]
- MITRE. Common Weakness Enumeration (CWE). Available online: https://cwe.mitre.org/ (accessed on 15 September 2024).
- Iannone, E.; Guadagni, R.; Ferrucci, F.; De Lucia, A.; Palomba, F. The secret life of software vulnerabilities: A large-scale empirical study. IEEE Trans. Softw. Eng. 2022, 49, 44–63. [Google Scholar] [CrossRef]
- Haddad, O.A.; Ikram, M.; Ahmed, E.; Lee, Y. Prompting the Priorities: A First Look at Evaluating LLMs for Vulnerability Triage and Prioritization. arXiv 2025, arXiv:2510.18508. [Google Scholar] [CrossRef]
- Uddin, M.N.; Zhang, Y.; Hei, X. Deep learning aided software vulnerability detection: A survey. arXiv 2025, arXiv:2503.04002. [Google Scholar] [CrossRef]
- Risse, N.; Böhme, M. Uncovering the limits of machine learning for automatic vulnerability detection. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 4247–4264. [Google Scholar]
- Rehman, S.; Mustafa, K. Software design level vulnerability classification model. Int. J. Comput. Sci. Secur. IJCSS 2012, 6, 238. [Google Scholar]
- Aota, M.; Kanehara, H.; Kubo, M.; Murata, N.; Sun, B.; Takahashi, T. Automation of vulnerability classification from its description using machine learning. In Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France, 8–10 July 2020; pp. 1–7. [Google Scholar]
- Davari, M.; Zulkernine, M.; Jaafar, F. An automatic software vulnerability classification framework. In Proceedings of the 2017 International Conference on Software Security and Assurance (ICSSA), Altoona, PA, USA, 24–25 July 2017; pp. 44–49. [Google Scholar]
- Na, S.; Kim, T.; Kim, H. A study on the classification of common vulnerabilities and exposures using naïve bayes. In Proceedings of the International Conference on Broadband and Wireless Computing, Communication and Applications, Asan, Republic of Korea, 5–7 November 2016; pp. 657–662. [Google Scholar]
- Terdchanakul, P.; Hata, H.; Phannachitta, P.; Matsumoto, K. Bug or not? Bug report classification using n-gram idf. In Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), Shanghai, China, 17–22 September 2017; pp. 534–538. [Google Scholar]
- Albanese, M.; Adebiyi, O.; Onovae, F. CVE2CWE: Automated Mapping of Software Vulnerabilities to Weaknesses Based on CVE Descriptions. In Proceedings of the 21st International Conference on Security and Cryptography (SECRYPT), Dijon, France, 8–10 July 2024; pp. 500–507. [Google Scholar]
- Nakagawa, S.; Nagai, T.; Kanehara, H.; Furumoto, K.; Takita, M.; Shiraishi, Y.; Takahashi, T.; Mohri, M.; Takano, Y.; Morii, M. Character-level convolutional neural network for predicting severity of software vulnerability from vulnerability description. IEICE Trans. Inf. Syst. 2019, 102, 1679–1682. [Google Scholar] [CrossRef]
- Saklani, S.; Kalia, A. Severity prediction of software vulnerabilities using convolutional neural networks. Inf. Comput. Secur. 2025, 33, 613–630. [Google Scholar] [CrossRef]
- Sun, X.; Li, L.; Bo, L.; Wu, X.; Wei, Y.; Li, B. Automatic software vulnerability classification by extracting vulnerability triggers. J. Softw. Evol. Process 2024, 36, e2508. [Google Scholar] [CrossRef]
- Zhang, H.; He, D. Research on Automatic Vulnerability Classification Technology Based on BiGRU-TextCNN Framework. J. Inf. Secur. Res. 2024, 10, 446–452. (In Chinese) [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
- Das, S.S.; Serra, E.; Halappanavar, M.; Pothen, A.; Al-Shaer, E. V2W-BERT: A framework for effective hierarchical multiclass classification of software vulnerabilities. In Proceedings of the 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), Porto, Portugal, 6–9 October 2021; pp. 1–12. [Google Scholar]
- Zhu, C.; Du, G.; Wu, T.; Cui, N.; Chen, L.; Shi, G. BERT-based vulnerability type identification with effective program representation. In Proceedings of the International Conference on Wireless Algorithms, Systems, and Applications, Dalian, China, 7–9 April 2022; pp. 271–282. [Google Scholar]
- Yang, J.; Li, W.; He, J.; Zhou, S.; Li, T.; Wang, Y. Vulnerability Classification Method Based on Dual Attention Mechanism and Improved Adversarial Training. Appl. Res. Comput. 2024, 41, 3447–3454. (In Chinese) [Google Scholar]
- Wang, T.; Qin, S.; Chow, K.P. Towards vulnerability types classification using pure self-attention: A common weakness enumeration based approach. In Proceedings of the 2021 IEEE 24th International Conference on Computational Science and Engineering (CSE), Shenyang, China, 20–22 October 2021; pp. 146–153. [Google Scholar]
- Simonetto, S.; van Ede, T.S.; Bosch, P.; Jonker, W.; Oostveen, R. Text2Weak: Mapping CVEs to CWEs using description embeddings analysis. In Proceedings of the 4th Workshop on Artificial Intelligence-Enabled Cybersecurity Analytics, Barcelona, Spain, 26 August 2024. [Google Scholar]
- Neelakantan, A.; Xu, T.; Puri, R.; Radford, A.; Han, J.M.; Tworek, J.; Yuan, Q.; Tezak, N.; Kim, J.W.; Hallacy, C.; et al. Text and code embeddings by contrastive pre-training. arXiv 2022, arXiv:2201.10005. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
- Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 2318–2335. [Google Scholar]
- Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Dubrovnik, Croatia, 2–6 May 2023; pp. 2014–2037. [Google Scholar]
- Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
- Liu, X.; Tan, Y.; Xiao, Z.; Zhuge, J.; Zhou, R. Not the end of story: An evaluation of ChatGPT-driven vulnerability description mappings. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 3724–3731. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Wang, Q.; Gao, Y.; Ren, J.; Zhang, B. An automatic classification algorithm for software vulnerability based on weighted word vector and fusion neural network. Comput. Secur. 2023, 126, 103070. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |