1. Introduction
In recent years, the increasing scale and complexity of software systems have led to a continuous rise in software vulnerabilities. According to the National Vulnerability Database (NVD), over 40,000 Common Vulnerabilities and Exposures (CVEs) were reported in 2024, compared to 28,092 in 2023 [
1]. This trend highlights the growing urgency of developing effective methods for automated vulnerability detection.
Existing approaches to vulnerability detection can be broadly categorized into traditional rule-based methods and data-driven deep learning methods. Traditional methods rely on handcrafted rules, Abstract Syntax Tree (AST) analysis, and Control-Flow Graph(CFG) inspection; however, these approaches often suffer from poor scalability and high false positive rates. In contrast, deep learning-based methods such as LSTM [
2,
3,
4], GNN [
5,
6,
7], and the transformer architecture [
8,
9] learn vulnerability patterns from large-scale code corpora and have demonstrated superior performance in identifying previously unseen vulnerabilities.
Despite the remarkable capabilities of LLMs in code understanding and generation, applying them directly to vulnerability detection presents several challenges. First, the high cost of fine-tuning and limited accuracy of vulnerability detection continue to be major issues. Although LLMs are capable of identifying vulnerabilities, their performance frequently lags behind specialized models unless fine-tuned, which necessitates additional significant computational resources and labeled data [
10]. Second, the hallucination problem persists. LLMs can easily produce fake content [
11], which can lead to vulnerabilities being misidentified. To guarantee that generated knowledge is accurate, a strong evaluation system is required. Third, because current methods rely on code retrieval techniques and usually match query code against a database using only source code text embeddings, they frequently have trouble identifying patterns. This restricts their capacity to identify weaknesses in evolutionary history or structural anomalies.
Therefore, it is essential to improve LLMs by addressing these shortcomings in order to enhance the capabilities for code vulnerability detection.
To further the construction of reliable and knowledge-driven vulnerability analysis systems, we present Retrieval-Augmented Semantic Mapping for Vulnerability Detection (RASM-Vul), a framework combining semantic analysis and knowledge enhancement. There are three main modules that make up our framework:
LLM-Based Knowledge Extraction and Code Variant Generation: To enhance the diversity of vulnerable and fixed code, this module employs a closed-loop “generate–evaluate–reflect” mechanism that can extract structured vulnerability insights and generate semantically equivalent code variants, thereby enriching the feature space while preserving original semantics.
Multi-view Vector Knowledge Base Construction: We build an all-encompassing index system that addresses auxiliary (line/AST changes), structural (AST), and semantic (code) dimensions. This makes it possible for the system to record both dynamic repair logic and static code patterns.
Multi-stage Retrieval and Vulnerability Detection Mechanism: To dynamically integrate evidence from various channels, we present the WRRF (Weighted Reciprocal Rank Fusion) algorithm. The mechanism reduces retrieval noise and improves LLM reasoning by adaptively prioritizing the most pertinent context based on the problem scope.
These components collectively form a closed-loop system that enhances LLM performance across semantic diversity, pattern coverage, and knowledge reliability.
2. Related Work
2.1. Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) improves output accuracy and contextual relevance by combining text generation and information retrieval. There are three stages to the typical workflow. In order to create an index, pertinent knowledge is first organized and vectorized using embedding models during the knowledge base construction phase. The system then retrieves pertinent information fragments based on the input query during the retrieval stage. The final response is generated by feeding the retrieved context into a generative model during the generation stage. This paradigm allows models to leverage external data without retraining.
RAG has been extensively utilized in numerous natural language processing and software engineering applications, encompassing question-answering systems [
12] and code generation [
13]. These studies demonstrate that efficient vulnerability detection necessitates matching across various code representation dimensions instead of considering code as a linear token sequence.
2.2. LLM-Based Code Vulnerability Detection
Recent methods show that while LLMs can be useful for finding vulnerabilities [
14], most of them depend on either costly fine-tuning or simple retrieval strategies. This means that better retrieval methods could be used to find different types of vulnerabilities.
Supervised Fine-Tuning: The fine-tuning method changes the model’s parameters using a labeled vulnerability dataset, which helps the model learn how to recognize certain security patterns. Syafiq et al. [
15] adjusted the DeepSeek-Coder model so that it could train specific classifiers to find different types of Common Weakness Enumeration (CWE). Fine-tuning usually gives very accurate results in the training distribution, but is very expensive in terms of computing power and loses all of its knowledge when it tries to adapt to new types of vulnerabilities.
In-Context Learning (ICL): To save calculation resources during training, ICL uses an LLM’s inference abilities by including demonstration examples in the prompt. SVA-ICL [
16] improves this method by using information fusion techniques to choose high-quality demonstration samples that include source code and descriptions of vulnerabilities. The efficacy of ICL is inherently constrained by the limitations of the context window and the caliber of the chosen demonstrations.
Retrieval and Tool Augmentation: Recent methodologies involve enhancing LLMs with external context to alleviate LLM hallucinations. Li et al. [
17] utilized CodeQL static analysis to verify LLM predictions, while Du et al. [
18] and Yin et al. [
10] use retrieve vulnerability knowledge to help with generation.
2.3. Code Vulnerabilities and Their Causes
In order to comprehend the prerequisites for efficient vulnerability detection, it is essential to examine the manifestation of security flaws within actual code, officially classified by CWE and CVE. CWE provides the theoretical root cause (such as logic error vs. memory corruption), while CVE provides the specific instance. The right repair patterns to fix these problems differ widely in terms of how hard they are to understand.
As shown in
Figure 1, we categorize these manifestations into two primary patterns that pose distinct challenges for detection models:
Statement-Level Manifestations (e.g., CVE-2016-10088): These vulnerabilities typically stem from missing validations or incorrect operators. The fix is often localized and additive (e.g., inserting a boundary check). Detection models can identify these largely through local context and line-level differences.
Structural-Level Manifestations (e.g., CVE-2021-3493): Complex issues such as race conditions often require architectural changes. As seen in Case 2, the fix involves extending locking scopes and altering control flow paths. Crucially, the vulnerable code and patched code share high lexical overlap, causing models relying on simple text similarity to fail. Successfully distinguishing these pairs requires modeling the syntactic structure (AST) and global dependencies.
Figure 1.
Comparison of Vulnerability Types. (Case 1) shows a statement-level fix (CVE-2016-10088) where a security check is added. (Case 2) illustrates a structural-level fix (CVE-2021-3493), requiring lock scope expansion and control flow changes.
Figure 1.
Comparison of Vulnerability Types. (Case 1) shows a statement-level fix (CVE-2016-10088) where a security check is added. (Case 2) illustrates a structural-level fix (CVE-2021-3493), requiring lock scope expansion and control flow changes.
This observation that dependence on a singular representation of code (e.g., plain text) is inadequate for encompassing the full range of CWE classifications drives our methodology. On the other hand, a strong detection framework needs to use a multi-view representation that combines the clear semantic signals from changes at the statement level with the implicit structural signals from AST analysis or CPG analysis.
2.4. Data Generation and Augmentation
Lack of labeled data and limited ability to reason semantically remain major problems for deep learning-based vulnerability detection [
19]. Recent advancements in LLMs have led to the development of new solutions, mainly divided into interactive refinement and synthetic injection.
Interactive refinement frameworks such as TICODER [
20] use a user-feedback loop to improve the model’s understanding of what the user wants by forming test cases from natural language instructions. Injection-based methods, on the other hand, focus on making vulnerable samples on their own. VulGen [
21] uses hierarchical clustering to find patterns and add vulnerabilities to safe code. VGX [
22] also uses a transformer-based method to make realistic samples by adding defects that come from past fixes and expert knowledge.
Existing methods mainly focus on increasing the volume of training data. We believe that simply adding more data is not enough; instead, dense sampling across the vulnerability pattern manifold through systematic code variation is essential. This makes sure that there is a “historical case” in the knowledge base that is semantically close to any given query, which helps with accurate retrieval and reasoning.
3. Approach
Based on the three major challenges proposed in the introduction, this research designs and implements RASM-Vul, a retrieval-augmented framework that enhances vulnerability detection through multi-view semantic mapping. Our approach leverages external knowledge and pattern matching rather than attempting to endow LLMs with intrinsic structural understanding.
Figure 2 shows the framework of RASM-Vul.
3.1. LLM-Based Knowledge Extraction and Code Variant Generation
In this stage, our system implements a multi-stage knowledge extraction pipeline that leverages LLM-driven analysis to process function-level vulnerability datasets, generating structured fine-grained vulnerability knowledge bases to address the knowledge insufficiency and hallucination problems in large model-driven vulnerability detection.
The framework includes three parts: the knowledge extraction engine (
Section 3.1.1), code variant generator (
Section 3.2), and multi-stage validation processing pipeline (
Section 3.1.3).
3.1.1. Knowledge Extraction Engine
We use the DeepSeek-V2.5 Coder as the LLM engine to obtain structured knowledge information about vulnerabilities. To keep the experiment honest and stop data from leaking, we follow a strict data isolation protocol in which knowledge extraction only happens on the training split. This phase does not include the validation and test splits at all, ensuring that the retrieval system never sees the correct answers from the evaluation sets.
The CWE classification and MITRE ATT&CK framework form the basis of our knowledge schema. They are meant to cover the entire vulnerability lifecycle. To meet RAG’s information needs, which include understanding context, causal reasoning, and differential analysis, we create a nine-dimensional typology that is divided into three logical groups:
Context & Semantics: Includes Functional Purpose, Functional Behavior, Code Semantics, and Context Metadata. These dimensions give the LLM a general idea of what the code is supposed to do and how it works, which helps to prevent context-loss hallucinations.
Vulnerability Mechanics: This includes Vulnerability Behavior, Trigger Conditions, and Prerequisites. These show the cause-and-effect chain of the defect and list the exact conditions (such as boundary values and race windows) that need to be met for the vulnerability to be triggered.
Remediation & Classification: This includes Solution Patterns and Security Context (CWE/CVE definitions). Along with the vulnerability mechanics, these allow the system to perform differential analysis between the vulnerable code and the fixed version.
Because we use specialized prompts to guide the LLM to extract these structured fields, we are able to turn unstructured code pairs into a rich semantic knowledge base that can support fine-grained pattern matching.
3.1.2. Code Variant Generator
As shown in
Figure 3, our framework uses two methods to create code variants that add vulnerability characters and fix characters, namely, semantic mutation and knowledge enhancement. Semantic mutation alters local content while maintaining existing semantic properties, augmenting vulnerability diversity. Knowledge enhancement seeks to retain current code characteristics while utilizing the knowledge derived in
Section 3.1.1 to improve existing code features.
Training-Only Variant Generation: The code variant generation process is implemented solely on training samples in order to broaden the knowledge base. We use 16 semantic mutation strategies and knowledge enhancement techniques to make several versions of each of the 3789 training pairs. This keeps the validation and test data completely separate while still creating a wide range of code variants. There are no variants made from validation or test samples; thus, the retrieval system never sees test functions that are very similar to each other during inference.
Table 1 shows the semantic mutation strategies we use to generate code variants:
In addition to mutation strategies, our framework also uses enhancement strategies. The purpose of enhancement strategies is to strengthen the semantic features of current code. Through prompt engineering, we combine the extracted knowledge with the code to be enhanced, allowing the LLM to enhance code features based on the code and its related knowledge. For vulnerable code, our framework amplifies vulnerability features while preserving semantics; for fixed code, it enhances security measures and robustness. The knowledge enhancement process can be defined as follows.
Let C be the original code, K the extracted knowledge set, and the enhanced code. The enhancement process can be represented as , where is the LLM generation function and T is the enhancement target, i.e., semantic mutation or knowledge enhancement.
3.1.3. Multi-Stage Validation Processing Pipeline
Our framework implements a generate → evaluate → reflect pipeline to ensure knowledge quality and code semantics. Details of the generation stage can be found in
Section 3.1.1 and
Section 3.1.2.
Evaluation Stage: This stage validates nine knowledge types and two code variant categories using specialized prompts. Code evaluation focuses on semantic preservation and functional equivalence, while knowledge assessment verifies CVE/CWE consistency and technical accuracy.
Reflection Stage: This stage optimizes generated content through iterative refinement, where , in which G represents initial content, E contains evaluation results, and enforces constraints including semantic preservation and consistency maintenance. This dual-mechanism approach ensures high-quality knowledge generalization while maintaining dataset integrity.
3.2. Multi-View Vector Knowledge Base Construction
Our framework implements a multi-view vector database system to efficiently store and retrieve code representations across multiple abstraction levels. This enables comprehensive pattern matching against vulnerability information.
3.2.1. Vectorization Representation Structure
Our framework adopts a multi-view vectorization strategy to convert different dimensional information of code, syntax trees, and knowledge into vector representations:
Code Vectorization: Based on the code understanding capabilities of BERT models, original vulnerable code, fixed code, and their code variants are respectively encoded into high-dimensional vectors. Pre-trained BERT models have code semantic understanding capabilities and can capture code functional intentions, code logic, and code structure as well as potential defect features and repair features.
Syntax Tree Vectorization: By constructing ASTs, the structural information of code is converted into vector representations. Our framework first uses tree-sitter to parse code into ASTs, then generates node sequences through deep tree traversal, and finally uses BERT models to encode AST sequences into vectors. This syntax-based vectorization representation captures code syntactic structures, control flow, and data dependency patterns, enabling structural similarity matching that complements lexical and semantic retrieval channels.
Knowledge Content Vectorization: Text encoding techniques turn structured knowledge from vulnerability data (such as vulnerability behaviors, solutions, functional purposes, etc.) into vector representations. Knowledge vectorization uses a strategy of breaking knowledge down into smaller parts based on the types of knowledge. This ensures that the semantic connections between different types of knowledge are accurately represented in vector space.
3.2.2. Multi-Layered Index Structure
To support efficient semantic retrieval, the system constructs five specialized vector indices exclusively using training data:
Strict Index Construction Protocol: The training split (3789 paired samples) and resulting variants are all that is used to build the five vector indices. It is strictly forbidden to use the validation and test splits in any index construction. This ensures that there are no data leaks and supports the integrity of the experimental evaluation.
Static Feature Indices (Code, AST, Knowledge): These three indices show what is unique about a code snapshot. The Code Index and Knowledge Index keep the semantic vector representations of the source code text and the natural language descriptions that were taken from it. This makes it easier to match words and intentions. The Syntax (AST) Index adds to these by vectorizing linearized AST sequences to encode structural information. This lets the system find vulnerabilities based on syntactic topology (such as certain control flow patterns) instead of just textual similarity.
Assisted Analysis Repair Indices (Line & AST Changes): The Line Changes Index keeps track of vector representations of historical diffs (added/deleted lines), which show the clear repair logic at the statement level. The AST Changes Index keeps track of changes in the syntax tree’s topology, which makes it possible to find structural changes (such as expanding a lock’s scope) that simple text diffs might miss. These evolutionary indices furnish the necessary contrastive evidence for the ensuing retrieval phase. Both of the indices are helpful for figuring out the pattern of vulnerability.
3.3. Multi-Stage Retrieval and Vulnerability Detection Mechanism
Our framework uses a two-stage retrieval mechanism and corresponding detection strategies to achieve vulnerability identification and analysis functions based on Retrieval-Augmented Generation (RAG) technology. This mechanism effectively addresses limitations of traditional vulnerability detection methods in semantic understanding and knowledge utilization, while additionally addressing hallucination problems and knowledge limitations in LLM vulnerability detection through integrated retrieval and knowledge-enhanced detection methods.
3.3.1. Code Feature Analysis Stage
In the feature analysis stage, our framework performs a preliminary code analysis by using the LLM to bridge the gap between static test code and dynamic change indices. Because test samples lack historical modification records (diffs), we employ a “Predict-then-Retrieve” strategy.
First, the model analyzes the function to identify the problem type (statement-level or syntactic-level). Crucially, based on this identification, the model then predicts potential vulnerability locations:
For statement-level issues, the model acts as a static analyzer to pinpoint specific suspicious code lines (e.g., a missing check at line N).
For syntactic-level issues, the model identifies potentially vulnerable paths within the AST structure.
These predicted features (predicted suspicious lines or AST paths) are then turned into vectors that act as pseudo-change queries. This allows the system to retrieve semantically similar “historical repair patterns” from the Line Changes Index and ST Changes Index, which in turn allows the static test code work with the dynamic evolutionary knowledge stored in the knowledge base.
At the same time, the analysis stage produces a summary of the code’s behavior that acts as the query for the Knowledge Index.
3.3.2. WRRF Recall Stage
In order to effectively merge retrieval results from diverse views, we present the Weighted Reciprocal Rank Fusion (WRRF) algorithm. The proposed WRRF algorithm uses dynamic weights based on the type of problem T that is found. Traditional RRF treats all sources the same. The fusion score is figured out as follows:
where
is the rank of sample
s in channel
c,
is a smoothing constant (default 60), and
is the active channel set. Crucially,
represents the dynamic weight, which is adjusted to reflect the reliability of each channel for a specific problem type.
The weight configuration is based on design principles that focus on solving problems:
Statement-Level Strategy: The system assigns more weight to surface-level lexical features when there are localized errors such as missing checks. The Code () and Line Changes () channels have the most weight, since they strictly compare the predicted suspicious lines to past repair patterns. Structural channels such as AST () provide extra context, while AST changes () are assigned less weight because the changes in topology are small in these cases.
Syntactic-Level Strategy: When it comes to architectural problems such as control flow errors, the focus changes to structural topology. We focus on the Syntax/AST () and AST Changes () channels to find patterns in how logic flow changes and reorganizes. Textual features (code: , line changes: ) are not as important as other elements, providing functional context instead of core diagnostic evidence.
3.3.3. Re-Ranking Stage
In the re-ranking stage, our framework performs semantic re-ranking on the recalled candidate results by leveraging LLMs’ ability to assess relevance between the query code, retrieved structural patterns (from AST indices), and related knowledge to re-rank candidate results based on semantic relevance, structural similarity, and knowledge correlation. This ultimately results in selecting retrieved results and returning the most relevant knowledge entries.
During the re-ranking process, our framework passes the test code and retrieved knowledge to the LLM, which analyzes each candidate result’s vulnerability behavior description, solution patterns, and functional context in order to evaluate its matching degree with the target query and ultimately determine the knowledge entry that best matches the current code vulnerability.
3.3.4. Knowledge Enhanced Vulnerability Detection Stage
The last vulnerability identification phase follows knowledge retrieval. All acquired knowledge is used to examine vulnerabilities and deliver the LLM test code, relevant statements, structural nature concerns, and related knowledge. The LLM employs the RAG methodology to analyze code snippets, structural patterns, and vulnerability knowledge before comparing the present code to known vulnerability patterns and using repairing procedures to determine a detection result.
Knowledge improves LLM-based vulnerability detection accuracy and comprehension. This technology can identify security vulnerabilities and provide extensive analysis and solutions. Thus, genuine code security audits receive extensive technical backing.
4. Experimental Setup
In this section, we first discuss the dataset characteristics used in our research, then baseline methods. Finally, we cover the experimental setup and the metrics used to evaluate it.
4.1. Dataset
We use the PrimeVul dataset [
23] as the basis for our framework. PrimeVul is a new dataset that has been put together to help train and test Code Language Models (Code LMs) intended for finding vulnerabilities. The dataset has 6968 functions that are vulnerable and 228,800 functions that are not, covering 140 common Common Weakness Enumerations (CWEs).
We chose the Paired subset of the dataset because it has paired samples of vulnerable code and their fixed versions. This dataset lets us look at how well the model can find subtle patterns of vulnerability and tell the difference between vulnerable code and fixed code.
Table 2 shows the statistics for the dataset used in our research. This dataset includes metadata CVE, CWE, and related vulnerability descriptions, which means that it has more contextual information (which can be used to create code variants) and new knowledge compared to other datasets that only have code and labels.
We carried out a correlation analysis on key code features in order to learn more about the vulnerability characteristics of both vulnerable code and fixed code.
Figure 4 and
Figure 5 show the correlations between features at the statement and structure levels. This means that almost all of the vul-fixed pairs have features at both levels.
The statement-level heatmap shows that there are strong connections between if statements, function calls, and pointer operations. This backs up our decision to put these features at the top of the line changes index for finding statement-level vulnerabilities.
On the other hand, the structural-level heatmap shows how data structure access (e.g., struct pointer access) and complexity metrics (e.g., cyclomatic complexity) are related. These patterns require a wider context for detection, which is why we use AST indices and the appropriate weights in our WRRF algorithm for structural flaws.
4.2. Baseline Methods
To comprehensively compare our method’s performance with existing methods, we consider five categories of baseline methods:
- 1.
Static analysis: CppCheck, FlawFinder, Semgrep
- 2.
Deep learning-based detection: LineVul, SVulD
- 3.
BERT-based models: UniXCoder, CodeBERT, GraphCodeBERT
- 4.
LLMs: CodeT5+, DeepSeek-V3, Qwen2.5-72B
- 5.
LLM-based analysis methods: Vul-RAG, SVA-ICL
CppCheck [
24] is a widely used open-source static analysis tool for C/C++ code. It provides unique code analysis features to detect errors, undefined behavior, and dangerous coding constructs. CppCheck aims to minimize false positives and can analyze non-standard C/C++ syntax common in embedded system development.
FlawFinder [
25] is a static analysis tool which is designed to scan source code for potential security vulnerabilities and coding flaws. It supports multiple programming languages and provides a ranked list of potential issues, helping developers to prioritize fixes based on the severity of the detected flaws.
Semgrep [
26] is a fast and open-source static analysis tool that searches code, finds bugs, and enforces secure guardrails and coding standards.
LineVul [
8], proposed by Fu et al., is a transformer-based model for line-level vulnerability prediction. It utilizes the self-attention mechanism in BERT architecture to capture long-distance dependencies in code sequences, achieving more precise localization of vulnerable lines.
SVulD [
27] trains a model to learn distinguishing semantic representations of functions regardless of their lexical similarity, which is familiar to our research.
CodeBERT [
28], developed by Microsoft Research, is a dual-mode pre-trained model based on the transformer architecture. It is trained on both natural language and programming language corpora, which makes it good at a wide range of code understanding tasks, including finding vulnerabilities. We use it after LoRA fine-tuning in our tests to find vulnerabilities.
UniXCoder [
29] is a pre-trained model that works well for tasks that involve understanding and generating code. It works with many programming languages and combines text and structural information in its representations, which makes it better at generalizing when analyzing code. We use it after LoRA fine-tuning in our tests to find vulnerabilities.
GraphCodeBERT [
30] is a pre-trained model that is aware of structure and is based on the transformer architecture. It uses Data Flow Graphs (DFG) in the pre-training stage to capture variable dependencies and the semantic structure of code, which is different from traditional models that treat code as a sequence of tokens. This design makes it very good at tasks that require logical reasoning to understand code. We use it after LoRA fine-tuning to find vulnerabilities in our tests.
CodeT5+ [
31] is a unified encoder–decoder based code LLM. It employs a diverse mixture of pre-training objectives (including span denoising, causal language modeling, and contrastive learning) to achieve state-of-the-art performance in both code generation and understanding. Its flexible architecture allows it to effectively handle complex code analysis scenarios. In our experiments, we formulate the task as a sequence-to-sequence (Seq2Seq) generation problem in order to leverage the model’s pre-trained generative capabilities.
DeepSeek-V3 [
32] is a powerful Mixture-of-Experts (MoE) language model with 671B parameters, activating 37B per token. It shows strong performance in general code understanding and reasoning, making it a competitive baseline for large-scale LLM vulnerability detection.
Qwen2.5 [
33] is an LLM series launched by Alibaba Cloud. It aims to provide powerful natural language processing capabilities and excels in multiple fields, including code generation and interpretation. In our research, we use Qwen2.5-72B as the base model in RASM-Vul.
Vul-RAG [
18] is an LLM-based vulnerability detection technique which leverages knowledge-level RAG framework to detect vulnerabilities. In our research, we use Deepseek-V3 as the base model for Vul-RAG.
SVA-ICL [
16] is a software vulnerability assessment method based on LLMs, with vulnerability detection as its core component. It uses a similar method to ours (in-context learning to evaluate vulnerabilities); thus, we choose it as a comparison standard.
4.3. Evaluation Metrics
To evaluate our framework’s effectiveness in code vulnerability detection, we consider the following four metrics:
Accuracy: Measures the proportion of correctly labeled functions:
Precision: Measures the proportion of true vulnerabilities among detected vulnerabilities:
Recall: Evaluates how many vulnerabilities are correctly detected:
F1-score: The harmonic mean of precision and recall:
Pair Accuracy (Pair-Acc): Measures the accuracy of the method in correctly identifying both the vulnerable code and its corresponding fixed version within a pair. This metric is specific to our paired dataset, and evaluates the model’s ability to discern subtle differences between vulnerable and patched code:
4.4. Detailed Setup of Our Work
4.4.1. Our Framework
Core Architecture: We use Python 3.11.4 to build a modular framework, use LangChain 0.3.19 for LLM integration and prompt management, and adopt a pipeline processing architecture to support knowledge extraction, vectorization, and retrieval-enhanced detection.
Knowledge Extraction and Code Enhancement: We utilize DeepSeek-V2.5-1210 to implement a nine-dimensional knowledge extraction engine, generating code variants and knowledge exclusively from training data.
Multi-view Vectorization: We use UniXCoder-base as the embedding model. It produces 768-dimensional vector representations and builds five specialized indices from training data only: the code index, syntax index, knowledge index, line change index, and AST change index. These indices support multi-level semantic retrieval.
Strict Data Isolation Implementation: Our framework enforces comprehensive data separation to prevent leakage:
- –
Knowledge Base Phase: Only training samples (3789 pairs) are used for knowledge extraction, code variant generation, and index construction.
- –
Inference Phase: Test samples (435 pairs) are processed through the retrieval system without any knowledge base contamination.
- –
Validation Phase: Validation samples (480 pairs) are used solely for hyperparameter tuning, specifically WRRF weight optimization via grid search.
WRRF Retrieval System: The implemented Weighted Reciprocal Rank Fusion (WRRF) algorithm dynamically adjusts the weights of five retrieval channels according to problem type, improving retrieval accuracy and semantic relevance through a two-stage retrieval mechanism.
Dynamic Weight Configuration: We design different weight strategies for statement-level and syntactic-level problems, with statement-level problems focusing on the code and line change channels (respective weights of 0.35 and 0.25) and syntactic-level problems focusing on the AST and AST change channels (0.4 and 0.3). These weights are optimized using the validation set through grid search without ever using test data.
Vulnerability Detection Model: This supports multiple LLMs for vulnerability detection, including DeepSeek-V3 and Qwen2.5-72B, achieving knowledge-driven vulnerability analysis through RAG-enhanced prompts. All the prompt templates are provided in
Appendix A.
4.4.2. Model Parameters
The following parameters were carefully configured based on actual code implementation and experimental verification:
LLM Generation Parameters:
- –
max_tokens = 4096: Supports complete function-level code analysis, averaging 50–100 processed line vulnerability functions without truncation.
- –
temperature = 0.3: Balances determinism and diversity, ensuring consistency in vulnerability detection results.
- –
top_p = 0.9: Optimizes output quality, following best practices for code generation tasks.
WRRF Algorithm Parameters:
- –
k_RRF = 60: Smoothing parameter for the reciprocal rank fusion algorithm.
- –
Statement-level weight configuration: Code channel (0.35), line change channel (0.25), AST channel (0.2), knowledge channel (0.15), AST change channel (0.05).
- –
Syntactic-level weight configuration: AST channel (0.4), AST change channel (0.3), code channel (0.15), line change channel (0.1), knowledge channel (0.05).
Vector Retrieval Parameters:
- –
top_k = 5: Returns the five most relevant candidate results per retrieval round.
- –
embedding_dim = 768: Standard vector dimension for UniXCoder-base model.
- –
similarity_threshold = 0.15: Semantic similarity threshold to ensure retrieval quality.
5. Experimental Results
We came up with four important research questions (RQs 1–4) to help systematically test how well our proposed RASM-Vul works for finding code vulnerabilities:
RQ-1. What is the overall performance of the RASM-Vul in function-level vulnerability detection?
RQ-2. What is the contribution of each component in RASM-Vul, and are all five indices necessary?
RQ-3. Is the WRRF retrieval method in the framework better than other ways of obtaining data?
RQ-4. What is the current framework’s ability to tell the difference between vulnerable code and fixed code in paired detection, and how does the importance of the obtained evidence affect this ability?
5.1. RQ-1: Overall Vulnerability Detection Performance Analysis
We used the PrimeVul dataset to compare our method with several baseline methods in order to see how well the RASM-Vul works overall.
Table 3 shows how well each method performed on the four metrics of accuracy, precision, recall, and F1-score.
The experimental results demonstrate that RASM-Vul exhibits substantial performance advantages across various LLM backends. In particular:
Overall Performance Leadership: RASM-VulDeepSeek-V3 achieves 60.11% accuracy and 66.79% F1-score, significantly outperforming all baseline methods.
Outstanding Recall Performance: RASM-VulQwen2.5-72B has a recall rate of 87.36%, which means that multi-view semantic mapping can find most real vulnerabilities by comparing them to historical vulnerability patterns. This is very important for security detection applications.
Comparison with State-of-the-Art Retrieval Methods: RASM-VulDeepSeek-V3 is 10 percentage points more accurate and 8.44 percentage points better in terms of F1-score compared to the current best retrieval-based method, SVA-ICL. This shows that our multi-view similarity search approach is better than single-channel retrieval. Our framework also beats the Vul-RAG knowledge-enhanced retrieval method by 5.62 percentage points in accuracy and 16.46 percentage points in F1-score. This shows that our multi-view fusion strategy works better than traditional knowledge-level RAG methods.
Table 3 shows that while RASM-Vul has a better recall (up to 87.36%), its precision stays about the same (about 52–57%). We contend that this tradeoff is operationally warranted for two reasons. The first is the
“look-alike” challenge: in the paired dataset, fixed functions exhibit an exceptionally high lexical similarity with vulnerable functions; the model’s tendency to flag fixed code as vulnerable (leading to false positives) reflects a necessarily conservative safety bias, identifying the risky context even if the subtle fix line is missed. The second reason is
safety-critical optimization: in security auditing, the costs of false negatives (missed vulnerabilities) significantly outweigh those of false positives. RASM-Vul is intended to be a screening tool with a high recall rate. We used the
F2-score (with recall being twice as important as precision) to determine how suitable this is. RASM-Vul achieves an F2-score of 0.742, which is better than the baselines. This shows that it is useful for reducing security risks.
We additionally use the Wilcoxon signed-rank test () on the paired prediction results to ensure that the performance improvements of RASM-Vul are not merely random. RASM-Vul shows a big improvement over the strongest generative baseline, CodeT5+ (); in addition, the improvement over the baseline GraphCodeBERT is very significant (). These results show that our multi-view retrieval framework is strong, always beating the fine-tuned baselines.
These results show that the RASM-Vul is good at finding vulnerabilities. Its multi-view semantic mapping architecture provides LLMs with exactly matched historical vulnerability patterns and structured knowledge, which should help LLMs better understand how to find vulnerabilities.
5.2. RQ-2: Indices Contribution Analysis
We performed thorough ablation experiments in order to prove that each part of RASM-Vul is necessary as well as to answer the question of whether all five indices are necessary.
Table 4 displays the efficacy of various component configurations, illustrating the impact of each module.
The results of the ablation experiments reveal several critical insights:
Multi-View Retrieval Effectiveness: The results clearly demonstrate the value of our multi-view retrieval approach:
Statement-level matching enhances recall for both LLM backends (DeepSeek-V3: 43.75% → 70.34%; Qwen2.5-72B: 61.84% → 86.87%), showing its effectiveness in identifying line-level vulnerability patterns.
Syntactic pattern matching enhances statement-level analysis by identifying structural vulnerability patterns. In the DeepSeek-V3 configuration, the combination of syntactic pattern, AST diff, and knowledge retrieval achieves the highest recall (90.57%).
Knowledge integration: This always improves detection capabilities, especially when used with statement-level features. Adding knowledge to the statement and line diff features of DeepSeek-V3 raises the F1-score by 5.84 percentage points, from 57.20% to 63.04%.
Full Integration Superiority: The full RASM-Vul architecture achieves the best overall performance:
The full framework for DeepSeek-V3 achieves the best F1-score (66.79%), which is better than any of the other configurations that were removed. This is an increase of 19.89 percentage points over the baseline model (46.90%).
The integrated system for Qwen2.5-72B achieves an F1-score of 64.96%, with an especially good recall score of 87.36%. This shows that the framework can work with different types of LLM architectures.
The performance gains from full integration are important; DeepSeek-V3 improves by 3.75 percentage points over its best partial configuration (Statement + Line Diff + Knowledge), while Qwen2.5-72B improves by 1.08 percentage points over its best partial configuration.
These results support our design philosophy that providing LLMs with many different types of vulnerability evidence through retrieval is better than relying on their built-in ability to recognize patterns. The steady rise in performance across both LLM backends shows that RASM-Vul’s multi-view semantic mapping method greatly improves the ability to find vulnerabilities in different types of model architectures.
5.3. RQ-3: Retrieval Algorithm Ablation Experiment
To verify the effectiveness of the proposed WRRF retrieval algorithm, we designed detailed ablation experiments comparing the performance of different retrieval strategies in vulnerability detection tasks. The base model for this research question was DeepSeek-V3. The experimental results are shown in
Table 5.
The experimental results show the following:
Single-Channel Limitations: Using only code similarity for retrieval results in only a small amount of contextual knowledge, similar to few-shot prompt engineering. This is why the F1-score is so low (59.01%). This method cannot find patterns of vulnerability that show up in different code representations.
Multi-Channel Knowledge Expansion: By adding more retrieval channels (code, syntax, knowledge, etc.), we greatly increase the amount of knowledge that the LLM can use. The 4.55 percentage point improvement in F1-score shows that providing different views on vulnerabilities makes it easier for the model to find them.
RRF Integration Benefits:The Reciprocal Rank Fusion algorithm improves retrieval quality even more by combining results from different channels, achieving a 64.87% F1-score. This shows that combining evidence from different angles in a smart way works better than just taking the average.
WRRF Adaptive Optimization: Our Weighted Reciprocal Rank Fusion (WRRF) algorithm introduces dynamic weight adjustment based on vulnerability characteristics, yielding the highest performance (66.79% F1). The 1.92 percentage point improvement over RRF demonstrates that adaptive channel weighting better matches the nuanced requirements of different vulnerability patterns. Importantly, this enhancement requires no additional computation time, as WRRF only modifies channel weights while retaining the same retrieval operations.
Figure 6 shows this change in a way that is easy to see. WRRF makes it easiest to tell the difference between right and wrong predictions. True positives are mostly in high-confidence areas (0.8–1.0). The figure shows how our step-by-step improvement of the retrieval mechanism from single-channel to adaptive multi-view mapping systematically improves the quality of the knowledge provided to the LLM.
These results show that our multi-view semantic mapping framework using the WRRF algorithm makes it much easier to find vulnerabilities by providing LLMs with more accurate and useful historical evidence across different ways of representing code.
5.4. RQ-4: Fine-Grained Discrimination Capability and Retrieval Evidence Analysis
In practical applications, vulnerability detection systems must not only find vulnerable code but also make a clear distinction between vulnerable functions and their non-vulnerable (fixed) versions. To assess this capability, we present the Paired Accuracy metric.
Table 6 shows the results of the comparison on the Paired subset of the data.
5.4.1. Quantitative Analysis of Paired Detection
The experimental findings demonstrate the significant challenge of identifying nuanced vulnerability mitigations:
The Challenge of Look-Alike Code: Traditional deep learning techniques encounter considerable difficulties in this endeavor. LineVul only gets 5.06% of its pairs right. SVulD, which was specifically made to find look-alike code through subtle semantic representation, is a little better at 5.52%. This shows that intrinsic pattern recognition by itself is not enough to capture fine-grained repair logic.
Superiority of Knowledge-Enhanced Retrieval: RASM-Vul achieves a paired accuracy of 21.38%, which is 38.8% better than direct LLM detection (DeepSeek-V3: 15.40%). The WRRF algorithm is the best retrieval strategy. This shows that dynamically weighting structural and differential evidence (line/AST changes) is necessary to determine whether a vulnerability has been fixed.
5.4.2. Attribution to Retrieval Effectiveness
To understand why RASM-Vul succeeds in capturing these subtle differences where other methods fail, we evaluated the semantic relevance of the retrieved evidence. We use the BERT score to measure the similarity between the retrieved knowledge and ground-truth vulnerability descriptions.
The quantitative analysis shows an average BERT score of
0.6732 for retrieved code and
0.4052 for retrieved knowledge. As illustrated in
Figure 7, there is a strong correlation between retrieval quality and detection success:
High Relevance Drives Discrimination: In successful detection cases (left side of
Figure 7), the retrieved code and knowledge exhibit high similarity to the test function’s logic and vulnerability type (CWE match). This provides the LLM with explicit historical references for how such vulnerabilities are manifested and fixed.
Mismatch Leads to Failure: In failed cases (right side of
Figure 7), the retrieval system fetches irrelevant patterns (e.g., mismatching Buffer Overflow with Integrity Check), causing the LLM to hallucinate or fail to recognize the vulnerability.
Figure 7.
Impact of retrieval relevance on detection: correct vs. incorrect CWE type matching.
Figure 7.
Impact of retrieval relevance on detection: correct vs. incorrect CWE type matching.
5.4.3. Error Propagation and Resilience Analysis
To evaluate the reliability of the “predict-then-retrieve” pipeline, we conducted a comprehensive error analysis combining quantitative oracle testing and qualitative failure case studies.
Quantitative Analysis: Classification and Resilience
We use an outcome-based oracle analysis instead of heuristic rules. We define the “ground truth” as the retrieval strategy (statement vs. structure) that produces the right detection result. When comparing the agent’s predictions to this oracle, our system achieves a classification accuracy of 81.22%, showing that the prediction stage is very reliable. In 61.3% of cases, both strategies produce the same detection results, which shows that the system is strong on its own.
We additionally measured error propagation by comparing a simulated “hard switch” strategy (which only uses the predicted channel) to our deployed WRRF soft fusion. The hard switch method achieved 52.76% accuracy, compared to 60.11% for the full RASM-Vul system. This large gain (+7.35 percentage points) shows that the WRRF adaptive weighting system works well to stop errors from spreading. WRRF ensures that complementary evidence from secondary channels is kept even when the agent’s prediction is not the best. In this way, it “saves” possible misclassifications by using non-zero weights.
Qualitative Analysis: Failure Modes
Next, we looked at representative failure cases (see
Appendix B for detailed code snippets and retrieval logs) in order to determine the limits of our framework.
False Negatives: The system has trouble with vulnerabilities that involve complicated cross-function dependencies, such as memory leaks that are not obvious. Our embedding-based retrieval looks at local semantic similarity, which means that it might not find long-range logic contracts. This means the LLM will default to a safe “secure” prediction.
False Positives: A conservative safety bias is mostly to blame for the high false positive rates. Fixed code frequently exhibits significant lexical similarity with vulnerable versions (often referred to as “look-alike” code). The model usually marks these as risky in order to avoid missing possible threats, putting recall ahead of accuracy
6. Discussion
6.1. Research Findings and Implications
Based on the experimental results and ablation investigations, we derive three critical implications for vulnerability identification utilizing LLMs:
Synergy of Multi-Dimensional Semantics: The ablation study (RQ-2) verifies that code is not a linear text sequence but rather a multidimensional construct encompassing semantics, syntax, and evolutionary history. Our multi-view fusion elevates the F1-score to 66.79%, whereas single-channel retrieval reaches a plateau. The incorporation of line/AST change indices is a crucial element in the “paired detection” task (RQ-4). This indicates that historical repair patterns (diffs) provide LLMs with the necessary contrastive data to distinguish between vulnerable functionalities and their patched counterparts.
The WRRF method optimizes this synergy by employing a weight configuration that has been empirically validated rather than arbitrarily selected. The channel contribution analysis reveals a distinct division: for statement-level vulnerabilities, the code and line changes indices account for 74.9% and 25.1%, respectively, establishing them as the primary sources of evidence for retrieval. Conversely, for structural-level vulnerabilities, the dependence predominantly switches to AST changes (74.8%) and AST indices. This information supports the notion that the dynamic weighting technique aligns with the physical characteristics of vulnerabilities.
Figure 8 illustrates that the system maintains a superior performance level (about 0.95) within the weight range of [0.3, 0.35]. This indicates that the framework’s robustness derives from the adaptive mechanism itself rather than from its excessive alignment with certain hyperparameters.
Necessity of Adaptive Retrieval (WRRF): Static RAG methods do not take into account the fact that vulnerabilities can show up in many different ways. Our WRRF algorithm (RQ-3) is a dynamic attention mechanism that changes focus between “surface-level features” (code/line indices) for bugs at the statement level and “structural features” (syntax/AST indices) for bugs at the architectural level. RASM-Vul’s ability to adapt to different types of errors is what makes it so strong. It beats fixed-weight baselines by 1.92 percentage points in terms of F1-score.
Retrieval Relevance as a Confidence Proxy: According to the correlation study in RQ-4, detection failures are frequently preceded by low-relevance retrievals (low BERT scores). This indicates that the discovered evidence acts as a calibrating anchor in addition to providing context. According to the retrieval similarity score, future systems may employ it as a sort of trustworthiness metric. To reduce the possibility of silent hallucinations, the system should indicate the detection result as doubtful if the retrieval confidence is low.
Latency and Complexity Analysis: To determine whether inference latency might be used in practice, we measured it on an NVIDIA RTX 3090. Each function required roughly 3.3 s to process, of which 1.2 s was spent on feature analysis, 1.4 s on WRRF retrieval, and 0.7 s on RAG inference. Although this latency is greater than that of lightweight static analysis, it is acceptable in asynchronous scenarios such as automated code reviews or nightly CI/CD builds, where the significant accuracy gain justifies the additional processing time.
6.2. Threats to Validity
Quality Validation of Knowledge Base and Code Variants: To ensure the reliability of our constructed knowledge base, we carried out a thorough check of both the extracted knowledge and the created code variants.
We performed a manual check on a stratified random sample of 100 cases, which included the ten most common CWEs. The validation employed a double-blind methodology using a human expert and an LLM assessor; neither were participants in the trial. The manual assessment showed that the accuracy was 68%, which suggests that the information that was taken out was correct and may be used in future scenarios. The LLM evaluator indicated that the accuracy was only 60%. The Inter-Annotator Agreement (Cohen’s Kappa) was 0.65, which shows that the results were quite similar. The first extraction accuracy plummeted to roughly 40% without the “generate–evaluate–reflect” technique (
Section 3.1.3), which indicates how crucial the reflection stage is.
We also utilized tree-sitter to automatically check all 7564 created samples to ensure that the mutation variants were of good quality (
Section 3.1.2). This was necessary because isolated snippets cannot be compiled dynamically. The automated assessment revealed great reliability; 98.14% of variations achieved syntactic validity (error-free ASTs) and 86.86% kept the original vulnerability logic, which is sufficient label consistency to serve as a proxy for regression testing. The structural preservation rate was 57.89%; this is a very important result, since it implies that structural diversity (loop transformations) is successfully included. This increases model generalization while preserving high semantic similarity (88.50%).
Generalizability to Other Programming Languages: Our experiments are currently limited to C/C++. The WRRF retrieval logic works with any language, but the AST parsers and mutation rules are made for C-family syntax. It is still necessary to test the framework on other ecosystems that are commonly used in vulnerability research, such as Java 11 [
34] and Python [
35].
Limitations in Detection Granularity: The current framework works at the function level; however, complicated vulnerabilities often come from manipulating the global state or data flow between procedures. Recent studies have shown that many real-world vulnerabilities affect more than one file [
36]. The absence of repository-level context (e.g., caller/callee relationships) constrains the model’s capacity to identify profound logical defects that transcend the boundaries of a singular function.
Computational Resource and Data Constraints: Because of limitations in computing power, we used subset data for some of the ablation tests. In addition, RASM-Vul is based on the idea that “similar bugs have happened before.” The system works well for recurring vulnerability patterns (CWEs), but might have trouble with “zero-day” vulnerabilities or unique logic errors that do not have any historical examples in the knowledge base. The diversity of the training data strictly determines how good the vector index is.
7. Conclusions
This paper introduces RASM-Vul, a retrieval-augmented framework that aims to fix the problems LLMs encounter with hallucinations and fine-grained vulnerability patterns. Our approach effectively connects general code understanding with specialized security analysis by building a multi-view vector knowledge base that combines code, syntax, and evolutionary changes along with our new WRRF algorithm for adaptive retrieval. Extensive testing on the PrimeVul Paired dataset shows that RASM-Vul achieves a new record F1-score of 66.79%. On the difficult paired detection task, our framework achieves 21.38% accuracy, which is better than existing baselines (about 5%). This shows that adding historical repair knowledge is the key to telling the difference between look-alike vulnerable code and fixed code. Our results indicate that the future of automated vulnerability detection depends not only on larger models but also on more intelligent and adaptable retrieval systems that can deliver accurate comparative evidence. In the future, we will work on broadening the retrieval scope to include repository-level contexts and look into cross-language knowledge transfer to make the system even more robust.
Author Contributions
T.Z.: Conceptualization, methodology, writing—original draft preparation, writing—review and editing; C.M.: Conceptualization, validation, writing—review and editing, resources; L.Z.: Software, formal analysis, data curation, visualization; J.Y.: Validation, methodology, investigation, funding acquisition; L.N.: Supervision, project administration. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by Heilongjiang Provincial Natural Science Foundation of China, grant number ZL2025F005, and National Natural Science Foundation, grant number 62172123.
Data Availability Statement
The datasets analyzed during the current study are available on the site PrimeVul (
https://github.com/DLVulDet/PrimeVul) (accessed on 15 December 2024). The source code of the experiments presented in this paper is available on request from the corresponding author.
Acknowledgments
The authors would like to thank the anonymous reviewers for their helpful comments.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| AST | Abstract Syntax Tree |
| BERT | Bidirectional Encoder Representations from Transformers |
| CPG | Code Property Graph |
| CVE | Common Vulnerabilities and Exposures |
| CWE | Common Weakness Enumeration |
| GNN | Graph Neural Network |
| ICL | In-Context Learning |
| LLM | Large Language Model |
| LoRA | Low-Rank Adaptation |
| LSTM | Long Short-Term Memory |
| MoE | Mixture-of-Experts |
| NVD | National Vulnerability Database |
| RAG | Retrieval-Augmented Generation |
| RASM-Vul | Retrieval-Augmented Semantic Mapping for Vulnerability Detection |
| WRRF | Weighted Reciprocal Rank Fusion |
Appendix B. Detailed Failure Case Analysis
This appendix provides a detailed examination of the specific code snippets and reasoning processes for the representative failure cases of FN and FP (Listing A1).
| Listing A1. Vulnerable code example for CVE-2022-23578 (TensorFlow). |
![Electronics 15 00612 i0a5 Electronics 15 00612 i0a5]() |
Appendix B.1. Case Study 1: False Negative (Memory Leak)
Target: CVE-2022-23578 (TensorFlow)
Issue: Implicit memory leak on error handling path.
Prediction: Safe (Incorrect)
Analysis
The model failed to detect this because the leak is indirect. There is no explicit malloc without free. Instead, it requires understanding the lifecycle of the NodeItem object and the side effects of create_kernel. The retrieval system failed to find a structurally similar “leak-on-error-path” pattern, leading the LLM to trust the visible error handling logic.
Appendix B.2. Case Study 2: False Positive (Integer Overflow)
Target: CVE-2022-32545 (ImageMagick)
Issue: Fixed version of an Integer Overflow vulnerability.
Prediction: Vulnerable (CWE-125)
Appendix B.2.1. Code Snippet (Fixed Version)
The code involves pointer arithmetic, which the model flagged as risky despite the presence of checks (Listing A2).
| Listing A2. Fixed code snippet for CVE-2022-32545 (ImageMagick). |
![Electronics 15 00612 i0a6 Electronics 15 00612 i0a6]() |
Appendix B.2.2. Analysis
The model incorrectly flagged this safe code because it shares high lexical similarity with the vulnerable version. The pattern of iterating pointers (*p++, PushShortPixel) inside a loop is a strong heuristic for buffer overflows in the knowledge base. Acting conservatively, the LLM flagged the “risky smell” despite the logic being sound.
References
- NVD. National Vulnerability Database. 2024. Available online: https://nvd.nist.gov/ (accessed on 28 March 2024).
- Wu, F.; Wang, J.; Liu, J.; Wang, W. Vulnerability detection with deep learning. In Proceedings of the 2017 3rd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2017; IEEE: New York, NY, USA, 2017; pp. 1298–1302. [Google Scholar]
- Russell, R.; Kim, L.; Hamilton, L.; Lazovich, T.; Harer, J.; Ozdemir, O.; Ellingwood, P.; McConley, M. Automated vulnerability detection in source code using deep representation learning. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; IEEE: New York, NY, USA, 2018; pp. 757–762. [Google Scholar]
- Lin, G.; Zhang, J.; Luo, W.; Pan, L.; Xiang, Y. POSTER: Vulnerability discovery with function representation learning from unlabeled projects. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October 2017; pp. 2539–2541. [Google Scholar]
- Cheng, X.; Wang, H.; Hua, J.; Xu, G.; Sui, Y. Deepwukong: Statically detecting software vulnerabilities using deep graph neural network. ACM Trans. Softw. Eng. Methodol. 2021, 30, 38. [Google Scholar] [CrossRef]
- Wang, H.; Ye, G.; Tang, Z.; Tan, S.H.; Huang, S.; Fang, D.; Feng, Y.; Bian, L.; Wang, Z. Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans. Inf. Forensics Secur. 2020, 16, 1943–1958. [Google Scholar] [CrossRef]
- Cao, S.; Sun, X.; Bo, L.; Wei, Y.; Li, B. Bgnn4vd: Constructing bidirectional graph neural-network for vulnerability detection. Inf. Softw. Technol. 2021, 136, 106576. [Google Scholar] [CrossRef]
- Fu, M.; Tantithamthavorn, C. Linevul: A transformer-based line-level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories, Pittsburgh, PA, USA, 23–24 May 2022; pp. 608–620. [Google Scholar]
- Cao, Y.; Ju, X.; Chen, X.; Gong, L. MCL-VD: Multi-modal contrastive learning with LoRA-enhanced GraphCodeBERT for effective vulnerability detection. Autom. Softw. Eng. 2025, 32, 67. [Google Scholar] [CrossRef]
- Yin, X.; Ni, C.; Wang, S. Multitask-Based Evaluation of Open-Source LLM on Software Vulnerability. IEEE Trans. Softw. Eng. 2024, 50, 3071–3087. [Google Scholar] [CrossRef]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
- Siriwardhana, S.; Weerasekera, R.; Wen, E.; Kaluarachchi, T.; Rana, R.; Nanayakkara, S. Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Trans. Assoc. Comput. Linguist. 2023, 11, 1–17. [Google Scholar] [CrossRef]
- Lu, S.; Duan, N.; Han, H.; Guo, D.; Hwang, S.W.; Svyatkovskiy, A. Reacc: A retrieval-augmented code completion framework. arXiv 2022, arXiv:2203.07722. [Google Scholar]
- Gao, Z.; Wang, H.; Zhou, Y.; Zhu, W.; Zhang, C. How Far Have We Gone in Vulnerability Detection Using Large Language Models. arXiv 2023, arXiv:2311.12420. [Google Scholar] [CrossRef]
- Atiiq, S.A.; Gehrmann, C.; Dahlén, K.; Khalil, K. From generalist to specialist: Exploring cwe-specific vulnerability detection. arXiv 2024, arXiv:2408.02329. [Google Scholar] [CrossRef]
- Gao, C.; Chen, X.; Zhang, G. SVA-ICL: Improving LLM-based software vulnerability assessment via in-context learning and information fusion. Inf. Softw. Technol. 2025, 186, 107803. [Google Scholar]
- Li, Z.; Dutta, S.; Naik, M. Llm-assisted static analysis for detecting security vulnerabilities. arXiv 2024, arXiv:2405.17238. [Google Scholar] [CrossRef]
- Du, X.; Zheng, G.; Wang, K.; Zou, Y.; Wang, Y.; Deng, W.; Lou, Y. Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG. arXiv 2024, arXiv:2406.11147. [Google Scholar]
- Ding, B.; Qin, C.; Zhao, R.; Luo, T.; Li, X.; Chen, G.; Xia, W.; Hu, J.; Luu, A.T.; Joty, S. Data augmentation using large language models: Data perspectives, learning paradigms and challenges. arXiv 2024, arXiv:2403.02990. [Google Scholar]
- Fakhoury, S.; Naik, A.; Sakkas, G.; Chakraborty, S.; Lahiri, S.K. Llm-based test-driven interactive code generation: User study and empirical evaluation. IEEE Trans. Softw. Eng. 2024, 50, 2254–2268. [Google Scholar]
- Nong, Y.; Ou, Y.; Pradel, M.; Chen, F.; Cai, H. Vulgen: Realistic vulnerability generation via pattern mining and deep learning. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; IEEE: New York, NY, USA, 2023; pp. 2527–2539. [Google Scholar]
- Nong, Y.; Fang, R.; Yi, G.; Zhao, K.; Luo, X.; Chen, F.; Cai, H. Vgx: Large-scale sample generation for boosting learning-based software vulnerability analyses. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar]
- Ding, Y.; Fu, Y.; Ibrahim, O.; Sitawarin, C.; Chen, X.; Alomair, B.; Wagner, D.; Ray, B.; Chen, Y. Vulnerability detection with code language models: How far are we? In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 26 April–6 May 2025; IEEE Computer Society: New York, NY, USA, 2025; pp. 469–481. [Google Scholar] [CrossRef]
- Cppcheck Team. CppCheck. 2024. Available online: http://cppcheck.net/ (accessed on 28 March 2024).
- Wheeler, D.A. Flawfinder. Static Analysis Tool for Finding Potential Security Vulnerabilities in Source Code. Available online: https://dwheeler.com/flawfinder/ (accessed on 4 April 2025).
- Semgrep. Semgrep. 2021. Available online: https://semgrep.dev (accessed on 12 January 2025).
- Ni, C.; Yin, X.; Yang, K.; Zhao, D.; Xing, Z.; Xia, X. Distinguishing look-alike innocent and vulnerable code by subtle semantic representation learning and explanation. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 29 November 2023; pp. 1611–1622. [Google Scholar]
- Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 13 November 2020; pp. 1536–1547. [Google Scholar] [CrossRef]
- Guo, D.; Lu, S.; Duan, N.; Wang, Y.; Zhou, M.; Yin, J. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 9 May 2022; pp. 7212–7225. [Google Scholar] [CrossRef]
- Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. Graphcodebert: Pre-training code representations with data flow. arXiv 2020, arXiv:2009.08366. [Google Scholar]
- Wang, Y.; Le, H.; Gotmare, A.; Bui, N.; Li, J.; Hoi, S. Codet5+: Open code large language models for code understanding and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 11 December 2023; pp. 1069–1088. [Google Scholar]
- DeepSeek-AI. DeepSeek-V3 Technical Report. Technical Report, DeepSeek-AI, 2024. Available online: https://arxiv.org/pdf/2412.19437 (accessed on 4 April 2025).
- Yang, A.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Huang, H.; Jiang, J.; Tu, J.; Zhang, J.; Zhou, J.; et al. Qwen2.5-1M Technical Report. arXiv 2025, arXiv:2501.15383. [Google Scholar] [CrossRef]
- Bui, Q.C.; Scandariato, R.; Ferreyra, N.E.D. Vul4J: A dataset of reproducible Java vulnerabilities geared towards the study of program repair techniques. In Proceedings of the 19th International Conference on Mining Software Repositories, Pittsburgh, PA, USA, 23–24 May 2022; pp. 464–468. [Google Scholar]
- Wartschinski, L.; Noller, Y.; Vogel, T.; Kehrer, T.; Grunske, L. VUDENC: Vulnerability detection with deep learning on a natural codebase for Python. Inf. Softw. Technol. 2022, 144, 106809. [Google Scholar] [CrossRef]
- Wang, X.; Hu, R.; Gao, C.; Wen, X.C.; Chen, Y.; Liao, Q. Reposvul: A repository-level high-quality vulnerability dataset. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, Lisbon, Portugal, 14–20 April 2024; pp. 472–483. [Google Scholar]
Figure 2.
Overview of RASM-Vul. The workflow is composed of three interconnected modules: (1) Knowledge Base Construction: Extracts structured vulnerability knowledge and generates code variants to build multi-view vector indices (code, syntax, knowledge, line/AST changes); (2) Multi-stage Retrieval: The WRRF algorithm is utilized to dynamically fuse evidence from different channels based on the predicted problem type; (3) Knowledge-Enhanced Detection: The LLM re-ranks candidates and performs final reasoning using the retrieved historical evidence to distinguish vulnerabilities.
Figure 2.
Overview of RASM-Vul. The workflow is composed of three interconnected modules: (1) Knowledge Base Construction: Extracts structured vulnerability knowledge and generates code variants to build multi-view vector indices (code, syntax, knowledge, line/AST changes); (2) Multi-stage Retrieval: The WRRF algorithm is utilized to dynamically fuse evidence from different channels based on the predicted problem type; (3) Knowledge-Enhanced Detection: The LLM re-ranks candidates and performs final reasoning using the retrieved historical evidence to distinguish vulnerabilities.
Figure 3.
The process of generating code variants. The framework uses a closed-loop “generate–evaluate–reflect” system to add to the knowledge base. Semantic mutation uses 16 pre-set rules (e.g., loop transformation) to make structures more diverse while keeping the logic the same. Knowledge enhancement uses LLMs to add security contexts, resulting in better versions of the original. This strict pipeline ensures that the retrieval system can access a wide range of vulnerability patterns.
Figure 3.
The process of generating code variants. The framework uses a closed-loop “generate–evaluate–reflect” system to add to the knowledge base. Semantic mutation uses 16 pre-set rules (e.g., loop transformation) to make structures more diverse while keeping the logic the same. Knowledge enhancement uses LLMs to add security contexts, resulting in better versions of the original. This strict pipeline ensures that the retrieval system can access a wide range of vulnerability patterns.
Figure 4.
Statement-level feature correlation. This heatmap reveals strong correlations between control flow statements (e.g., if, function_calls) and pointer operations, suggesting that statement-level vulnerabilities often manifest in these interactions.
Figure 4.
Statement-level feature correlation. This heatmap reveals strong correlations between control flow statements (e.g., if, function_calls) and pointer operations, suggesting that statement-level vulnerabilities often manifest in these interactions.
Figure 5.
Structure-level feature correlation. The correlations between structural definitions, pointer access, and cyclomatic complexity highlight how structural vulnerabilities are deeply tied to data structure manipulation and code complexity.
Figure 5.
Structure-level feature correlation. The correlations between structural definitions, pointer access, and cyclomatic complexity highlight how structural vulnerabilities are deeply tied to data structure manipulation and code complexity.
Figure 6.
Comparison of the distributions of vulnerability detection results for different retrieval methods. The scatter plots show the difference between right predictions (green) and wrong ones (red). The red dashed line represents a 0.5 confidence score. When looking at single-channel (left) and WRRF fusion (right), the WRRF method produces the clearest boundary, with true positives located in the high-confidence area (0.8–1.0). This shows that adaptive multi-view mapping works well for clearing up confusion and making detection more reliable.
Figure 6.
Comparison of the distributions of vulnerability detection results for different retrieval methods. The scatter plots show the difference between right predictions (green) and wrong ones (red). The red dashed line represents a 0.5 confidence score. When looking at single-channel (left) and WRRF fusion (right), the WRRF method produces the clearest boundary, with true positives located in the high-confidence area (0.8–1.0). This shows that adaptive multi-view mapping works well for clearing up confusion and making detection more reliable.
Figure 8.
Sensitivity analysis of WRRF weights. The curve indicates that the retrieval results stay the same no matter how much weight is assigned to the code channel. The gray area displays the stable region (), which shows that the chosen weight (, dashed line) is in the optimal range and that the system can withstand tiny changes in hyperparameters.
Figure 8.
Sensitivity analysis of WRRF weights. The curve indicates that the retrieval results stay the same no matter how much weight is assigned to the code channel. The gray area displays the stable region (), which shows that the chosen weight (, dashed line) is in the optimal range and that the system can withstand tiny changes in hyperparameters.
Table 1.
Semantics-preserving mutation rules.
Table 1.
Semantics-preserving mutation rules.
| Rule ID | Operation Type | Preserved Semantics | Purpose |
|---|
| 1 | Rename variable | Logic, flow | Increase lexical diversity |
| 2 | for ↔ while | Loop behavior | Test loop syntax robustness |
| 3 | x++ ↔ x+1 ↔ x+=1 | Increment result | Evaluate operator sensitivity |
| 4 | Merge declarations | Type, scope | Assess code organization impact |
| 5 | Split declarations | Variable behavior | Test declaration granularity |
| 6 | Swap if-else blocks | Branch logic | Test resistance to reordering |
| 7 | if ↔ ternary expr | Condition result | Evaluate syntax flexibility |
| 8 | ternary ↔ if | Logic, output | Same as above |
| 9 | Use temp variables | Expression result | Test decomposition robustness |
| 10 | Split conditions | Boolean logic | Assess nesting sensitivity |
| 11 | Reorder statements | Behavior, state | Evaluate order robustness |
| 12 | if-continue ↔ if-else | Loop control | Test loop structure handling |
| 13 | Swap assignments | Semantics | Evaluate order sensitivity |
| 14 | Swap string operands | Comparison result | Test operand order robustness |
| 15 | Decompose ++/– | Final value | Evaluate unary op handling |
| 16 | switch ↔ if-else | Branch logic | Test multi-branch detection |
Table 2.
PrimeVul dataset.
Table 2.
PrimeVul dataset.
| Dataset | Vul | Fixed | Total | Vul:Fix |
|---|
| PrimeVul Paired | 4704 | 4704 | 9408 | 1:1 |
| Train | 3789 | 3789 | 7578 | 1:1 |
| Valid | 480 | 480 | 960 | 1:1 |
| Test | 435 | 435 | 870 | 1:1 |
Table 3.
Performance comparison of baseline and proposed methods.
Table 3.
Performance comparison of baseline and proposed methods.
| Method | Accuracy | Precision | Recall | F1 Score |
|---|
| Static Analysis |
| CppCheck | 0.5000 | 0.0000 | 0.0000 | 0.0000 |
| FlawFinder | 0.5023 | 0.5040 | 0.2897 | 0.3679 |
| Semgrep | 0.4874 | 0.4881 | 0.5195 | 0.5033 |
| Deep Learning-based Detection |
| LineVul | 0.5172 | 0.5109 | 0.8069 | 0.6257 |
| SVulD | 0.5230 | 0.5137 | 0.8621 | 0.6438 |
| BERT-based Models |
| UniXCoder | 0.5368 | 0.5419 | 0.4759 | 0.5067 |
| CodeBERT | 0.5230 | 0.5298 | 0.4092 | 0.4617 |
| GraphCodeBERT | 0.5138 | 0.5181 | 0.4276 | 0.4685 |
| Large Language Models (Direct) |
| CodeT5+-770M | 0.5345 | 0.5234 | 0.7724 | 0.6240 |
| DeepSeek-V3 | 0.5058 | 0.5053 | 0.4375 | 0.4690 |
| Qwen2.5-72B | 0.5322 | 0.5275 | 0.6184 | 0.5693 |
| LLM-based Methods |
| SVA-ICL | 0.5011 | 0.5008 | 0.6989 | 0.5835 |
| Vul-RAG | 0.5449 | 0.5500 | 0.4639 | 0.5033 |
| Proposed Method |
| RASM-VulDeepSeek-V3 | 0.6011 | 0.5721 | 0.8023 | 0.6679 |
| RASM-VulQwen2.5-72B | 0.5276 | 0.5179 | 0.8736 | 0.6496 |
Table 4.
RQ-2: Ablation study of RASM-Vul components (bold indicates the best performance within each LLM).
Table 4.
RQ-2: Ablation study of RASM-Vul components (bold indicates the best performance within each LLM).
| Component Configuration | Precision | Recall | F1-Score |
|---|
| DeepSeek-V3 Ablation Study |
| DeepSeek-V3 (Baseline) | 0.5053 | 0.4375 | 0.4690 |
| + Statement Only | 0.5240 | 0.7034 | 0.6006 |
| + Syntactic-pattern Only | 0.5149 | 0.6368 | 0.5694 |
| + Statement + Line Diff | 0.5239 | 0.6299 | 0.5720 |
| + Syntactic-pattern + AST Diff | 0.5184 | 0.4851 | 0.5012 |
| + Statement + Line Diff + Knowledge | 0.5197 | 0.8207 | 0.6304 |
| + Syntactic-pattern + AST Diff + Knowledge | 0.5038 | 0.9057 | 0.6495 |
| + Full RASM-Vul | 0.5721 | 0.8023 | 0.6679 |
| Qwen2.5-72B Ablation Study |
| Qwen2.5-72B (Baseline) | 0.5275 | 0.6184 | 0.5693 |
| + Statement Only | 0.5054 | 0.8687 | 0.6390 |
| + Syntactic-pattern Only | 0.5071 | 0.7402 | 0.6019 |
| + Statement + Line Diff | 0.5169 | 0.7724 | 0.6194 |
| + Syntactic-pattern + AST Diff | 0.5091 | 0.6414 | 0.5677 |
| + Statement + Line Diff + Knowledge | 0.5133 | 0.8456 | 0.6388 |
| + Syntactic-pattern + AST Diff + Knowledge | 0.5122 | 0.7724 | 0.6159 |
| + Full RASM-Vul | 0.5170 | 0.8736 | 0.6496 |
Table 5.
Performance comparison of different retrieval algorithms.
Table 5.
Performance comparison of different retrieval algorithms.
| Retrieval Algorithm | Accuracy | Precision | Recall | F1 Score |
|---|
| Single Channel (Code Only) | 0.5423 | 0.5387 | 0.6521 | 0.5901 |
| Traditional Weighted Fusion | 0.5689 | 0.5542 | 0.7423 | 0.6356 |
| RRF Fusion | 0.5786 | 0.5618 | 0.7689 | 0.6487 |
| WRRF Fusion (Dynamic Weights) | 0.6011 | 0.5721 | 0.8023 | 0.6679 |
Table 6.
Paired detection accuracy comparison.
Table 6.
Paired detection accuracy comparison.
| Method | Paired Accuracy | Correct Pairs/Total |
|---|
| FlawFinder | 0.92% | 4/435 |
| LineVul | 5.06% | 22/435 |
| SVulD | 5.52% | 24/435 |
| SVA-ICL | 7.59 % | 33/435 |
| Vul-RAG | 14.25% | 62/435 |
| DeepSeek-V3 | 15.40% | 67/435 |
| Qwen2.5-72B | 11.26% | 49/435 |
| RASM-VulDeepSeek-V3 Multi-Channel | 14.94% | 65/435 |
| RASM-VulDeepSeek-V3 RRF | 15.86% | 69/435 |
| RASM-VulDeepSeek-V3 WRRF | 21.38% | 93/435 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |