Structure-Enhanced Prompt Learning for Graph-Based Code Vulnerability Detection

Chang, Wei; Ye, Chunyang; Zhou, Hui

doi:10.3390/app15116128

Open AccessArticle

Structure-Enhanced Prompt Learning for Graph-Based Code Vulnerability Detection

by

Wei Chang

¹

,

Chunyang Ye

^2,*

and

Hui Zhou

²

¹

School of Cybersapce Security, Hainan University, Haikou 570228, China

²

School of Computer Science and Technology, Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6128; https://doi.org/10.3390/app15116128

Submission received: 24 April 2025 / Revised: 24 May 2025 / Accepted: 26 May 2025 / Published: 29 May 2025

Download

Browse Figures

Versions Notes

Abstract

Recent advances in prompt learning have opened new avenues for enhancing natural language understanding in domain-specific tasks, including code vulnerability detection. Motivated by the limitations of conventional binary classification methods in capturing complex code semantics, we propose a novel framework that integrates a two-stage prompt optimization mechanism with hierarchical representation learning. Our approach leverages graphon theory to generate task-adaptive, structurally enriched prompts by encoding both contextual and graphical information into trainable vector representations. To further enhance representational capacity, we incorporate the pretrained model CodeBERTScore, a syntax-aware encoder, and Graph Neural Networks, enabling comprehensive modeling of both local syntactic features and global structural dependencies. Experimental results on three public datasets—FFmpeg+Qemu, SVulD and Reveal—demonstrate that our method performs competitively across all benchmarks, achieving accuracy rates of 64.40%, 83.44% and 90.69%, respectively. These results underscore the effectiveness of combining prompt-based learning with graph-based structural modeling, offering a more accurate and robust solution for automated vulnerability detection.

Keywords:

vulnerability detection; syntax aware; prompt learning; graphon; code property graph

1. Introduction

The number of vulnerabilities in the Common Vulnerabilities and Exposures (CVE) database [1] continues to grow, despite improvements in software security. This trend increases the risk of cyberattacks and leads to serious economic and social consequences. To reduce security risks and maintain stability, efficient vulnerability detection methods are essential in today’s interconnected digital world.

Recent advancements in deep learning have significantly enhanced the detection of code vulnerabilities. Early methods focused on treating source code as textual data, which limited their ability to capture structural and semantic dependencies. Later models, such as those based on pretrained language models and Graph Neural Networks, improved representation learning by incorporating both textual and structural information. These models further evolved by integrating techniques like code property graphs and residual connections, enhancing their ability to handle vulnerability detection more effectively. More recently, prompt learning has emerged as an effective fine-tuning strategy for pretrained models, advancing the state of the art. Despite these advancements, however, existing approaches still struggle to leverage task-specific knowledge effectively, especially in graph-structured code, where alignment remains challenging.

Although task-specific knowledge integration via prompts has proven effective in the NLP and CV domains [2,3], particularly through knowledge-aware designs [4,5], its application to graph-structured code representations remains challenging. In theory, graph-level prompts could enhance vulnerability detection by encoding structural task knowledge. However, the highly abstract nature of graph data complicates explicit alignment between prompt structures and code semantics. This gap highlights the need for systematic approaches to evaluate and optimize graph-aware prompt adaptability in code vulnerability analysis.

Furthermore, Graph Neural Networks (GNNs) suffer from long-term dependency issues [6], particularly in large code property graphs (CPGs), where distant nodes struggle to exchange information effectively. As the node count increases, this limitation becomes more pronounced. Recent studies [7] indicate that the performance of CPG-based vulnerability detection methods degrades dramatically under these conditions. Figure 1 illustrates a typical Use-After-Free (UAF) vulnerability, where the variable VAR8 is prematurely released at line 6 and subsequently reused at line 7. Detecting this vulnerability requires propagating taint information across eight-hop neighbors in the CPG (highlighted in Figure 2), which exceeds the effective receptive field of conventional GNNs. Although prior studies [8,9] employ coarse-grained program dependence graphs (PDGs) to extract vulnerability features, such statement-level embeddings often fail to capture fine-grained local semantics, which are crucial for identifying subtle vulnerabilities.

To address these challenges, we design a prompt optimization for vulnerability detection as a two-stage process: (1) generating graph-level prompts enriched with task-specific knowledge derived from downstream task samples, and (2) making predictions by integrating these prompts with text semantic labels. In the prompt generation phase, we introduce graphon theory [10] to systematically model structural knowledge.

To further improve vulnerability detection, we propose a hierarchical representation learning strategy. This approach synergizes Convolutional Neural Networks (CNNs) with graph-based neural networks to perform syntax- and semantic-aware embeddings at multiple granularities. Specifically, a TextCNN [11] analyzes local code token sequences to capture fine-grained syntactic patterns, while GNNs model global structural dependencies in the CPG. This dual-path architecture enables the extraction of multi-scale vulnerability features, bridging the gap between coarse-grained program semantics and fine-grained code-level vulnerabilities.

To evaluate the effectiveness of the proposed method, we conducted comprehensive experiments on three widely used publicly available datasets: FFmpeg+Qemu [12], SVlud [13] and Reveal [14]. Experimental results demonstrate that our method consistently outperforms state-of-the-art models, confirming the effectiveness of both the structure-enhanced prompt design and the syntax-aware module. Specifically, our approach achieved an accuracy of 64.40% on the FFmpeg+Qemu dataset, representing a notable improvement over existing methods. In the SVluD and Reveal datasets, our method achieved accuracies of 83.44% and 90.69%, respectively, demonstrating its robustness and generalizability across diverse codebases.

In summary, this paper makes the following contributions: First, we propose a novel structure-enhanced prompt learning framework that integrates prompt learning with graph-based code representations to improve the accuracy and robustness of vulnerability detection. Second, we design a graphon-based prompt generation mechanism, which estimates class-specific structural patterns from aligned CPGs to construct informative, task-relevant graph prompts. Third, we develop a hierarchical representation learning strategy that combines a TextCNN for capturing fine-grained syntactic features and a GCN for modeling global code structure, enabling multi-level semantic understanding. Fourth, we conduct experiments on three public datasets, demonstrating that our method consistently surpasses SOTA baselines in both performance and generalization.

The rest of this paper is organized as follows: Section 2 reviews related research efforts in the area of software vulnerability detection. Section 3 presents our methodology. Section 4 evaluates our proposal with extensive experiments. Section 5 concludes this work and highlights some future research directions.

2. Related Work

2.1. Learning-Based Vulnerability Detection

In recent years, the application of deep learning techniques has significantly advanced the field of vulnerability detection. Pioneering research by [15] utilized Convolutional Neural Networks (CNNs) to interpret code as if it were a natural language, subsequently employing Random Forest classifiers for the actual detection task.

Vuldeepecker [16] utilized BiLSTM models alongside code gadgets to identify vulnerabilities associated with library or API calls. This approach was later refined by SySeVR [17], which incorporated data and control dependencies for a more comprehensive detection framework. Furthermore, CEVulDet [18] applied centrality analysis to simplify program dependency graphs, enhancing the efficiency of CNN-based vulnerability detection.

Graph-based methods have become prominent in vulnerability detection. Devign utilizes Graph Neural Networks (GNNs) to model code dependencies effectively. In contrast, DeepWukong [8] analyzes the dependency graphs of the program by dividing them into subgraphs for GNN processing. To enhance detection accuracy, ReGVD incorporates residual connections within GNN layers. CPVD [19] represents code through code property graphs, employing Graph Attention Networks and Convolution Pooling Networks to extract feature vectors. Similarly, MAGNET [20] constructs a multi-granularity meta-path graph to capture the structural information of code snippets. GCL4SVD [21] combines Code Graph Embedding to identify vulnerability patterns with Graph Confident Learning Denoising to reduce noise and improve accuracy. Furthermore, TrVD [22] and SCALE [23] apply tree structures to enhance detection performance.

Other methods include VulCNN [24], which transforms code into image-like representations, and CSGVD, which uses pretrained language models for feature initialization.

Previous methods typically simplified vulnerability detection into binary classification tasks, reducing labels to a 0/1 format. This simplification hindered the model’s ability to capture rich, high-dimensional features and led to the loss of critical spatial and structural details. In contrast, our approach introduces prompts with task-specific structural information, which enhances the model’s navigation within high-dimensional feature space. This allows the model to preserve complex semantic and spatial attributes, resulting in a more detailed and accurate vulnerability classification.

2.2. Pretrained Models for Source Code Analysis

Inspired by the success of NLP pretraining, specialized models such as C-BERT [25], CodeBERT [26] and GraphCodeBERT [27] have been developed for programming languages, excelling in tasks like defect detection and code completion. Moreover, CodeBERTScore [28] refines token similarity scoring, surpassing models like RoBERTa [29] and CodeBERT. Derived from GPT-3, CodeX [30] demonstrates robust performance in code translation and refactoring.

Recognizing that code-based pretrained models often require task-specific fine-tuning to excel in applications such as vulnerability detection, our approach addresses this challenge by seamlessly integrating code encoding modules. These modules leverage the rich knowledge embedded in pretrained models, gradually combining the semantic perception module and the graph structure to improve the model’s capability. To further optimize the fine-tuning process, we introduce textual prompts as supervisory signals, thereby improving the model performance in subsequent vulnerability detection tasks.

2.3. Prompt Learning

Prompt learning has emerged as a solution to overcome the limitations of early unsupervised NLP tasks. By incorporating explicit prompts, it improves both accuracy and manageability [31]. Various techniques have been developed to generate effective prompts, including text mining, gradient-based approaches [32,33], and continuous prompt learning. The latter optimizes word vectors, providing precise control [34,35,36].

Recent surveys by Liu et al. [37] and Wang et al. [38] underline the significant impact of prompt learning in NLP tasks and code intelligence. Additionally, Zhang et al. [39] took this a step further by incorporating structural information into prompts, which enhances their effectiveness for vulnerability detection.

In the context of image–text pairs, CLIP [3] and its extension CoOp [40] have shown notable improvements in performance, achieved through the use of learnable prompts. These advancements underscore the growing importance of prompt-based methods in various domains.

Prompt learning has made significant strides in the fields of natural language processing and computer vision. However, its application in software engineering remains in its early stages. Inspired by this paradigm, we developed task-specific text prompts, enriched with structural information, to enhance the detection of vulnerabilities in source code, leading to a series of novel contributions in this domain.

3. Methodology

The goal of our research is to tackle the task of function-level source code vulnerability detection, considering it as a binary classification problem. Specifically, our objective is to determine whether a given function in the source code is vulnerable or not. In this section, we begin by presenting a comprehensive overview of our framework. We delineate the key components and their interconnections, highlighting the novel aspects of our approach. Then, we provide a step-by-step demonstration of how our framework can be utilized for software vulnerability detection.

3.1. Solution Overview

As depicted in Figure 3, our methodology is architecturally divided into three critical phases: code feature construction, structure-enhanced prompt and vulnerability detection.

In the code feature construction phase, we first normalize the source code of the target function. Using advanced static analysis techniques, we construct the CPGs to encapsulate the behavior of the function. Subsequently, we perform a preorder traversal of the CPG, merging leaf nodes with Abstract Syntax Tree (AST) attributes into syntactically meaningful subwords. These subwords are then embedded using a pretrained language model. To extract both syntactic and semantic information between subwords, we employ a TextCNN as the core component of the statement encoder. Additionally, we utilize Graph Convolutional Networks (GCNs) to analyze the structural relationships in the code and capture dependencies between nodes in the property graph. This synergistic combination enables multi-scale vulnerability feature extraction, effectively preserving both fine-grained textual nuances and coarse-grained structural characteristics inherent in source code.

During the structure-enhanced prompt phase, we introduce a novel graph prompt mechanism derived from text prompts. This mechanism integrates generalized code information and the structures related to the task into the prompts. To achieve this, we leverage graph theory to systematically incorporate task-related knowledge into the prompt generation process. A graphon is a continuous, measurable function

W : [0, 1] \times [0, 1] \to [0, 1]

, defined on the unit square, which describes the probability of an edge existing between any two vertices in a graph G with an infinite number of nodes. Here,

W (x, y)

denotes the probability of an edge between two vertices corresponding to coordinates x and y in G. The function satisfies non-negativity,

0 \leq W (x, y) \leq 1

, and symmetry,

W (x, y) = W (y, x)

, indicating that the probability of an edge between x and y is identical to that between y and x. Graphons effectively model the complexity and randomness of real-world networks, facilitating a deeper understanding of graph formation processes. This insight guides the generation of graph structures, allowing us to extract knowledge from graph data in downstream tasks by estimating graphons. We then utilize these estimated graphons to generate graph-level prompts enriched with task-related knowledge. Furthermore, we incorporate the corresponding text prompts to refine and optimize the generated graph prompts, enhancing their adaptability to downstream tasks.

In the final stage, we align the embedding spaces of the code and the prompts to enhance vulnerability detection. We employ a joint loss function, combining Triplet Loss and Cross-Entropy Loss, to optimize model parameters. During training, we compute the cosine similarity between paired embeddings and minimize the prediction error using Cross-Entropy loss based on class labels. Simultaneously, we construct triplets consisting of an anchor (a code sample), a positive sample (a prompt representing a vulnerability) and a negative sample (a prompt corresponding to non-vulnerability). The training objective is to reduce the latent space distance between the anchor and the positive sample while increasing the distance between the anchor and the negative sample. By leveraging a joint learning paradigm, the model effectively captures discriminative relationships between code semantics and vulnerability patterns, enabling accurate vulnerability classification through cross-modal feature alignment.

In the sections that follow, we provide an in-depth examination of the essential components that constitute our methodology.

3.2. Code Feature Construction

3.2.1. Normalization

We initiate the process by removing non-ASCII characters and comments from each function’s source code using regular expressions, thereby focusing solely on relevant code elements. Subsequently, we map both user-defined function and variable names to standardized symbolic names like “FUN1” and “VAR1” to anonymize and generalize the code for uniform analysis. However, standard API names, keywords and punctuation marks are retained to maintain their semantic and syntactic importance. These steps collectively ensure a clean, consistent and standardized code representation, as illustrated in Figure 4.

3.2.2. Extracting the Code Property Graph

To extract the required CPG, we utilize Joern [41], a tool for parsing source code and generating graph representations. The CPG unifies Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), and Program Dependence Graphs (PDGs) into a cohesive structure, allowing for in-depth vulnerability analysis.

As shown in Figure 5, a property graph is a directed, edge-labeled, attributed multi-graph where nodes represent program constructs (e.g., functions, variables, statements) with type-specific attributes. The directed edges define the relationships between these constructs, capturing the structural and semantic properties. In our research, we utilize a subset of Joern’s CPG, focusing on nodes and undirected edges to extract key structural and semantic features.

3.2.3. Syntax-Aware Encoder

Existing DL-based vulnerability detectors [12,14,42] use GNNs to learn program semantics from diverse code representations (e.g., AST, CFG and PDG) for model training. However, the neighborhood aggregation mechanism used in GNNs is not well suited for modeling tree structures, making it difficult to effectively capture the hierarchical syntactic relationships between parent and child nodes.

To this end, we adopt a hierarchical representation learning strategy that generates syntax-aware neural embeddings for each node containing AST attributes in CPGs. By fusing AST attribute nodes, we effectively capture rich unstructured semantic information and significantly reduce the overall number of nodes in the graph (see in Section 4). The extracted unstructured information is subsequently combined with structured relations (i.e., control and data flows between statements), and the integrated representation is fed into a GNN model to learn a holistic representation of the code.

Specifically, as shown in Figure 6, we first merge the leaf nodes with the AST attributes in the CPG into meaningful subwords using preorder traversal. Then, the embeddings of these subwords serve as the initial feature representations for training the statement encoder. We employ a TextCNN as the core component of the statement encoder to capture both syntactic and semantic relationships between subwords.

Given a set of mergeable AST nodes T within a function, where each subtree

t \in T

, all nodes and their feature vectors in t are fed into the statement encoder. This process computes the final representation of their corresponding statement:

x_{t_{i}} = T e x t C N N (t_{i})

(1)

Based on our syntactic and semantic perception neural embedding strategy, the feature vectors of each sentence node in CPG effectively retain both the lexical and syntactic information of the source code.

3.2.4. Graph Feature Encoder

After performing the syntax-aware aggregation operation in the previous step, we obtain a reconstructed code graph structure. This structure consists of nodes with syntactic information and edges incorporating DDG and CFG attributes. The node features are denoted as

X_{n e w} = \{x_{t_{1}}, x_{t_{2}}, \dots, x_{t_{i}}\}

, while the adjacency matrix is represented by

A_{n e w}

.

We utilize a GCN to analyze the structural relationships within the code and capture dependencies between nodes in the property graph. The GCN model takes as input both the adjacency matrix

A_{n e w}

and the encoded node feature matrix

X_{n e w}

, enabling a comprehensive analysis of the graph’s topology and node attributes.

g_{i} = G C N (X_{n e w}, A_{n e w})

(2)

To aggregate information from all nodes within the graph, we apply the Global Mean Pool operation. This operation computes the average of node features across the entire graph, producing a unified vector representation that encapsulates the collective information.

G_{i} = G l o b a l M e a n P o o l (g_{i})

(3)

where

G_{i}

represents the output of the graph feature encoder.

3.3. Structure-Enhanced Prompt

In visual and language tasks [43], prompts often incorporate task-specific knowledge [4,5]. Similarly, we hypothesize that well-structured prompts encapsulating task-relevant knowledge can enhance model performance and improve cross-task generalization. To achieve this, we propose leveraging graphons to preserve structural information in code while integrating prior knowledge. Our framework consists of two key steps: (1) generating structured prompts and (2) performing prompt ensembling.

The overall process of our structure-enhanced prompt framework is illustrated in Algorithm 1, which comprises two main stages: generating structured prompts and integrating them via prompt ensembling. These components are discussed in detail in the subsequent subsections.

Algorithm 1 Structure-enhanced prompt generation.

Require:: Training graph set $G_{train}$ , class labels C, top-K node count K, prompt node count $K^{'}$
Ensure:: Final structure-enhanced prompt $P_{j}$
1:: for each class c in C do
2:: $G_{c} \leftarrow {G \in G_{train} ∣ label (G) = c}$
3:: Align graph structures and estimate graphon $W_{c}$
4:: Sample $K^{'}$ nodes from $W_{c}$ to generate adjacency matrix $A_{c}$
5:: Extract node feature matrix $X_{c}$ from aligned graphs
6:: $P_{c} \leftarrow {GCN}_{prompt} (X_{c}, A_{c})$
7:: end for
8:: for each class i do
9:: Construct text-based prompt $P_{i}$ with class-specific context
10:: $P_{all} \leftarrow concat (P_{i}, P_{c})$
11:: $P_{j} \leftarrow MLP (P_{all})$
12:: end for
13:: return $P_{j}$

3.3.1. Generation of Structured Prompt

To facilitate knowledge-driven generation of structured prompts, we first identify key structural features in graph data and encode them effectively. Since graphs differ from Euclidean data, their structure is best captured using adjacency matrices. Extracting structure-specific knowledge from class-associated graphs enables the creation of task-specific prompts.

In our approach, we utilize graphons to efficiently model task-related knowledge within graph structures. A graphon [44] is a continuous representation of large-scale networks, capturing generalized structural properties. It defines probabilistic node relationships as random functions over the unit square. Graphon estimation typically employs spectral methods, singular value decomposition or smoothing techniques on observed graph data. It is a continuous, bounded, and symmetric function:

W : {[0, 1]}^{2} \to [0, 1]

(4)

which can be interpreted as the weight matrix of a graph with an infinite number of nodes [45]. In this work, we define a graphon as a two-dimensional matrix of the form:

W = [w_{k k^{'}}] \in {[0, 1]}^{K \times K}

(5)

where

w_{k k^{'}}

denotes an element of matrix W, representing the probability of an edge between nodes k and node

k^{'}

. When treating a class of graphs as a set, the corresponding graphon captures its generalized structural characteristics and adjacency patterns. Leveraging the generative properties of graphons, we construct new topological structures as prompts for further analysis.

Building on the previous analysis, we extract class-specific knowledge for downstream tasks by using graphons [46] as templates. These graphons, which represent innovative topological structures, form the foundation for generating class-specific subprompts at the graph level. The process involves a series of well-defined steps, as outlined below:

Given a dataset with two types of graphs, we divide the training set $G_{t r a i n}$ by class to obtain two types of graph sets $\{G_{c} | C = 0, 1\}$ , representing robust and vulnerable.
The degree of the nodes serves as the metric for each graph set $G_{c}$ . The alignment procedure begins by sorting the nodes in descending order of degree, then reorganizing the adjacency matrix accordingly.
The graphon $W^{c} \in R^{K \times K}$ is estimated from the aligned graphs in $G_{c}$ using

$W_{c} = E s t i m a t i o n (G_{c}^{'})$

(6)

where $E s t i m a t i o n (\cdot)$ denotes the graphon estimation operator and $G_{c}^{'}$ represents aligned graph sets. K is the number of nodes that account for 80% of the total nodes in the graph set $G_{c}^{'}$ . We adopt Universal Singular Value Thresholding (USVT) [47] for graphon estimation. This method stacks adjacency matrices of aligned graphs and applies Singular Value Decomposition (SVD) to extract dominant structural features. The resulting graphon $W^{c}$ captures the generalized structural characteristics of the graphs in $G_{c}$ , providing a distribution that can generate topological structures.

The topology of a graph-level structured prompt, consisting of

K^{'}

nodes, is generated through the following sampling process:

v_{1} . . . v_{K^{'}} \sim U n i f o r m_{[0, 1]}, f o r n = 1, \dots K^{'}

(7)

a_{n n^{'}} \sim B e r n o u l l i (W^{c} (i, j)), \forall_{i, j} \in [K^{'}]

(8)

First,

K^{'}

nodes are independently sampled from a uniform distribution

U n i f o r m_{[0, 1]}

over the interval

[0, 1]

. Next, an adjacency matrix

A_{c} = [a_{i, j}] \in {0, 1}^{K^{'} \times K^{'}}

is constructed using a Bernoulli distribution parameterized by

W^{c}

. Since the estimated graphon captures the common structural patterns and adjacency relationships across a family of graphs, the subprompts generated via the Bernoulli distribution are expected to reflect these shared structures.

In the alignment step, nodes from different graphs are matched to form corresponding node feature matrices. Given that graphs in each set

G_{c}

may contain varying numbers of nodes, positions without corresponding nodes are filled with zero vectors in the aligned node feature matrices. Subsequently, an averaging technique is applied to obtain the fused node feature matrix

X_{c} \in R^{K \times F}

for the resulting subprompt:

X_{c} = M e a n (\{x_{i}^{c} | i = 1, . . ., N_{c}\})

(9)

where F represents the dimensionality of the node features and

M e a n (\cdot)

denotes the averaging operation. The term

x_{i}^{c}

represents the feature matrix of the

i^{t h}

graph in the aligned graph set, while

N_{c}

is the total number of graphs in

G_{c}

. Since the dataset contains different sample sizes for each class,

N_{c}

may vary across different classes.

From this, the graph-level subprompt is generated. To effectively capture class-specific structural information, a GCN is applied to generate structured graph prompts for each class:

P_{c} = G C N_{p r o m p t} (X_{c}, A_{c})

(10)

These prompts, termed as structured prompts, encode task-relevant topological and feature representations.

3.3.2. Prompt Ensembling

Building on structured prompts, we introduce the idea of “Optimizable Prompts”, where contextual words are represented as continuous vectors learned end-to-end from available datasets. To maintain inter-class consistency, all categories follow a unified template:

P_{i} = {[V]}_{1} {[V]}_{2} . . . {[V]}_{N} [C L S] {[V]}_{N + 1} . . . {[V]}_{M}

(11)

In this expression,

{[V]}_{N}

represents task-relevant contextual information (e.g., “The structure of a

[C L S]

function …”). Within each prompt

P_{i}

, the

[C L S]

token is substituted by the name of the corresponding the

i^{t h}

class. While Equation (11) places

[C L S]

in the middle, alternative positions—such as at the beginning or end—are also considered.

Moreover, we integrate graphical prompts into text prompts, incorporating them as auxiliary information. The integrated prompt formulation is

P_{a l l} = c a t (P_{i}, P_{c})

(12)

The final structure-enhanced prompt consists of contextual content, classification-related text and graph-based structural information, enriching prompts for better interpretability and model performance. The fused features are then processed through a Multi-Layer Perceptron (MLP):

P_{j} = M L P (P_{a l l})

(13)

3.4. Vulnerability Detection

A critical challenge in vulnerability detection is to reduce misclassification errors. To address this, we propose a joint loss function that combines Triplet Loss and Cross-Entropy Loss to optimize model performance. Triplet Loss utilizes a triplet structure (anchor, positive and negative), where the anchor is a randomly selected function, the positive encodes its true vulnerability type and the negative is its patched counterpart. As shown in Figure 7, Triplet Loss optimizes two objectives: (1) clustering similar vulnerabilities by maximizing the cosine similarity between the anchor and positive sample and (2) enhancing the distinction between vulnerable and patched code by minimizing the similarity between the anchor and negative sample. This dual optimization improves the model’s ability to differentiate potential vulnerabilities. Additionally, Cross-Entropy Loss refines classification by computing the cosine similarity between paired embeddings and minimizing prediction errors based on class labels.

3.4.1. Model Training

The training process, illustrated in Figure 8, takes three inputs: the target function (

f_{anchor}

), the structurally enhanced positive prompt (

f_{positive}

) and the structurally enhanced negative prompt (

f_{negative}

). These inputs are projected into a shared feature space, where their relationships are optimized using a joint loss function that combines Triplet Loss and Cross-Entropy Loss.

To learn meaningful embeddings for

f_{anchor}

, we accumulate the gradients for each input function and backpropagate them to update the network parameters. This ensures that structurally similar functions remain close in the learned space, while distinct functions are pushed apart.

To quantify the relative similarity between samples, we define a loss function that operates on an anchor (a), a positive example (p) and a negative example (n), with a predefined margin

m = 0.7

. This criterion ensures that positive pairs are closer than negative pairs in the embedding space. The Triplet Loss component is formulated as

L_{t r i p l e t} (a, p, n) = m a x \{0, (s i m (a_{i}, n_{i}) - s i m (a_{i}, p_{i})) + m\}

(14)

where the distance

s i m (x_{i}, y_{i})

is given by

s i m (x_{i}, y_{j}) = \frac{x_{i} \cdot y_{j}}{| x_{i} | \cdot | y_{j} |}

(15)

To enhance classification accuracy, Cross-Entropy Loss is also applied, ensuring that embeddings align with the correct class labels. The final joint loss function is defined as

L_{j o i n t} = λ L_{t r i p l e t} + (1 - λ) L_{C E}

(16)

By setting

λ = 0.2

, we prioritize classification accuracy while still leveraging metric learning to refine the embedding space. This joint optimization effectively captures structural similarities, reduces misclassification errors and improves vulnerability detection performance.

3.4.2. Detecting Vulnerability

In the vulnerability detection phase, the model leverages the embeddings generated during training to identify potential vulnerabilities within the input code. The prediction probability is computed as

P (y = i | x) = \frac{e x p (c o s (F_{i}, P_{j}) * τ)}{\sum_{k = 1}^{K} e x p (c o s (F_{i}, P_{k}) * τ)}

(17)

In this equation,

τ

denotes a gradient scaling parameter, which is initialized to

lg (1 / 0.07)

.

By integrating structure-enhanced prompts, our approach enriches the embedded space, enabling the model to capture task-specific features more effectively. The framework leverages contrastive learning to automatically discover vulnerability signatures, using structured prompts to guide the model in extracting discriminative features. This improves the precision of vulnerability detection while reducing false positives and false negatives.

3.5. Computational Complexity and Scalability Analysis

Our framework comprises two core phases—structured prompt generation and model training—both of which incur modest computational costs that scale gracefully with problem size. During the prompt generation phase, aligning node degrees across M graphs of size up to N requires

O (M N log N)

time. Subsequently, estimating a compact

K \times K

graphon using USVT has a complexity of

O (M K^{2} + K^{3})

. The following sampling and feature fusion steps operate in

O (K^{2} + K F)

time, where F denotes the dimensionality of node features. Model training involves a total of 352,673 parameters across the TextCNN, GCN and MLP modules. Scalability is further enhanced by offline graphon estimation, top-K node pruning (covering 80% of graph mass) and standard GPU-accelerated batching. Together, these strategies enable efficient handling of tens of thousands of code samples while maintaining stable memory usage (under 16 GB) and consistent runtime performance.

4. Experiments

In this section, we first conduct extensive experiments using three widely used datasets to answer the following Research Questions (RQs):

RQ1: How does our method perform with varying model parameters?
RQ2: Does the introduction of the syntax-aware embedding module provide better detection capability and stability?
RQ3: Does introducing code structure information in text prompts have better detection ability and stability?
RQ4: How does our method perform compared to state-of-the-art vulnerability detection methods?

Subsequently, we proceed by visualizing the prompts to delve deeper into the efficacy of our approach. Finally, we conduct an analysis of potential threats to its effectiveness.

4.1. Datasets

In our experiments, we evaluate the model on three public vulnerability detection benchmarks: FFmpeg+Qemu [12], SVulD [13] and Reveal [14]. The FFmpeg+Qemu dataset originates from two widely used open-source projects. It comprises 22k code snippets, of which 10k have been identified as vulnerable. SVulD, based on Fan et al. [48], contains both before-fixed and after-fixed code in the training set. Reveal comprises over 18k code snippets, about 2k of which exhibit known vulnerabilities. The model is required to identify the before-fixed as vulnerable and after-fixed as non-vulnerable simultaneously on this dataset. Table 1 summarizes the statistics of the datasets.

Following established practices in vulnerability detection research [12,13,49,50], we use an 8:1:1 split for training, validation and test sets. The training set is used to learn the model parameters, the validation set for hyperparameter tuning and model selection and the test set exclusively for final performance evaluation.

4.2. Experimental Setup

In the code encoder, we integrate a TextCNN with convolutional kernels of sizes 1, 3, 5 and 7, along with a three-layer GCN, applying a dropout rate of 0.2 to mitigate overfitting. In the prompt module, a single-layer GCN is combined with a three-layer MLP to effectively capture graphical prompts and contextual information. To ensure smooth convergence, we employ the AdamW optimizer with a CosineAnnealingLR scheduler, gradually reducing the learning rate from 1 ×

10^{- 2}

to 1 ×

10^{- 4}

. We choose ReLU as the activation function and apply L2 regularization to enhance the model’s generalization ability. Cross-Entropy Loss and Triplet Loss serve as objective functions to guide training. The model undergoes 300 training iterations for performance optimization. Table 2 provides a detailed overview of the hyperparameter settings and configuration choices used in model training and optimization.

To assess model performance, we adopt four standard evaluation metrics: precision, recall, F1 score, accuracy.

4.3. Result Analysis

4.3.1. RQ1: Parameter Analysis

To rigorously evaluate and optimize our model’s performance and efficiency, we undertook a multi-faceted analysis, which we delineate in this section.

First, we analyzed the structural properties of the CPGs constructed from the FFmpeg+Qemu, SVulD and Reveal datasets, focusing on four key metrics: node length, fused node count and node count per sample before and after fusion. As illustrated in Figure 9, these insights guided our design choices for initial node dimensions and function representation architecture. Across both datasets, we observed consistent statistical patterns, leading us to define a 32-dimensional embedding space for initial node representations. Additionally, we structured fused syntax-aware nodes in an 8 × 32 configuration, where 80% of the fused nodes contain fewer than eight individual nodes. Furthermore, we present the number of nodes per sample before and after syntax-aware fusion. The significant reduction in node count after fusion mitigates the long-range dependency problem in GNNs, particularly improving the effectiveness of information exchange between distant nodes.

Second, selecting an optimal batch size requires a nuanced understanding of task-specific factors. Smaller batch sizes enhance model robustness but prolong training, whereas larger batches accelerate training but may introduce memory constraints and gradient estimation inaccuracies. To determine the most effective batch size, we conducted experiments on the three datasets mentioned earlier, evaluating batch sizes of 16, 32, 64, 128, 256 and 512, as depicted in Figure 10a. The results consistently indicated that a batch size of 64 achieved the best performance, guiding our selection for subsequent experiments.

Third, we examined how prompt dimension affects model performance on three datasets. By combining textual and structural prompts, we tested dimensions of 8, 16, 32, 64 and 128. As shown in Figure 10b, accuracy initially improved with larger dimensions but declined beyond a certain threshold. A dimension of 32 achieved the best balance between accuracy and stability and was used in subsequent experiments.

4.3.2. RQ2: Syntax-Aware Embedding Effectiveness

The CPG representation of code combines multiple structural elements, including the AST, providing a solid foundation for vulnerability detection. However, traditional GNNs struggle to adapt to the tree structure. Our analysis shows that directly applying unified graph neural architectures (e.g., GCN) to hybrid code leads to suboptimal performance in vulnerability detection. To overcome this, we propose a hierarchical learning framework. First, we use syntax-aware embedding techniques to model AST nodes in the CPG. Then, we apply a GCN to extract features closely related to security vulnerabilities, improving the model’s ability to detect potential vulnerabilities.

To systematically evaluate the impact of syntax-aware encoding on vulnerability detection capabilities, we designed a controlled ablation study to compare two architectures: (1) the baseline GCN and (2) the GCN enhanced with a syntax-aware embedding module. Both configurations were implemented as binary classifiers and evaluated on the three datasets mentioned earlier. To ensure statistical reliability, we conducted 20 independent trials for each configuration, maintaining identical hyperparameter settings (as detailed in Table 2).

In the experimental results shown in Table 3, similar trends are observed across the three datasets. Here, we focus on the FFmpeg+Qemu dataset for a more detailed discussion. The baseline GCN model exhibits relatively weak overall performance. In contrast, CNN-GCN outperforms the GCN across all evaluation metrics, including accuracy (58.05% vs. 57.14%), precision (56.95% vs. 55.69%), recall (58.05% vs. 54.60%) and F1 score (55.47% vs. 53.52%), demonstrating that incorporating CNN-based syntax-aware feature extraction enhances the model’s discriminative ability to some extent. However, CNN-GCN also shows significantly higher standard deviations (up to ±1.44), indicating that the model’s performance is more unstable and sensitive to variations in the experimental setup. This implies that while CNN-based syntax-aware feature extraction improves representation capacity to some degree, it does not fundamentally overcome the modeling limitations of GCNs.

In comparison, the Text Prompt method exhibits clear advantages over CNN-GCN across all metrics—accuracy, precision, recall and F1 score. Notably, the median accuracy and precision both exceed 60%, and the distribution is more concentrated. These results suggest that leveraging natural language prompts can effectively enhance the model’s understanding of task objectives, leading to more robust and efficient feature learning.

4.3.3. RQ3: Structure-Enhanced Prompt Effectiveness

To evaluate the impact of incorporating structural prompts, we compared our architecture (SE Prompt) with a plain-text baseline (Text Prompt) lacking structural information. To ensure experimental rigor and maintain consistency with RQ2, we fixed all hyperparameter settings and conducted 20 independent runs for each configuration.

In the experimental results shown in Table 3, the Text Prompt method helps the model better understand and process the semantic information in the input data by introducing specific textual prompts. This approach leads to notable performance improvements on the FFmpeg+Qemu dataset, where it achieves an accuracy of 61.32%, precision of 60.41%, recall of 59.71% and F1 score of 59.53%.

In contrast, SE Prompt, which combines structured information such as graph structures or domain knowledge, provides richer contextual information, further enhancing the model’s performance. On the FFmpeg+Qemu dataset, the four evaluation metrics of SE Prompt are significantly better than text prompts. This result indicates that structured enhancement of prompts offers substantial advantages in improving accuracy and consistency.

The observed performance trends in the SVulD and Reveal datasets align closely with those in the FFmpeg+Qemu dataset, thereby validating the effectiveness and generalizability of our method across diverse vulnerability detection tasks.

Our results were validated through statistical tests, using the p-value (Mann–Whitney U) and effect size metrics. As shown in Table 4, all p-values are less than 0.001, indicating that our method significantly outperforms the three baseline models across all datasets and metrics. The consistently large effect sizes (reaching 1.00) further underscore the practical and theoretical superiority of our approach.

To further validate the effectiveness of SE Prompt, we present the ROC curves of all models across the three datasets in Figure 11. SE Prompt consistently produces curves closer to the top-left corner, indicating higher true-positive rates across a wide range of thresholds. This effect is particularly evident on the FFmpeg+Qemu and SVulD datasets, suggesting that incorporating structural information significantly enhances the model’s discriminative power. Furthermore, SE Prompt achieves consistently higher AUC values across all datasets, reinforcing its robustness and strong generalization capability.

4.3.4. RQ4: Comparison with State of the Art

To evaluate the effectiveness of our proposed method, we conducted a series of comparative experiments using state-of-the-art (SOTA) techniques on the FFmpeg+Qemu, SVulD and Reveal datasets. First, we compared our method with two vulnerability detection approaches that do not utilize pretrained models [12,14], while the remaining methods were based on pretrained models [23,26,51,52,53,54]. The detailed results of these comparisons are presented in Table 5.

This table shows the performance of several vulnerability detection models on three datasets. The models are evaluated using four metrics: accuracy (Acc), precision (Pre), recall (Rec) and F1 score (F1). Our method achieves the best or near-best results in most cases, showing a clear advantage overall.

On the FFmpeg+Qemu dataset, our method achieved an accuracy of 64.40%, slightly lower than UnixCoder’s 65.19%. However, it recorded the highest precision (63.59%) and F1 score (63.41%) among all models. These results reflect a well-balanced trade-off between precision and recall, enhancing detection accuracy without compromising coverage. The improvement on the SVulD dataset is particularly notable. While maintaining a high accuracy of 83.44%, our method achieved a breakthrough in precision with 75.89%, nearly four times higher than the second-best model LineVul (15.95%). This demonstrates stronger discriminative capability in handling complex vulnerability patterns. Furthermore, on the Reveal dataset, our method retained a competitive advantage, achieving an F1 score of 56.11% and an accuracy of 90.69%, thereby surpassing previous benchmarks. Although its recall (55.14%) was slightly lower than that of UnixCoder, the overall performance was strengthened by a more effective balance between precision and recall.

Our method achieves the highest F1 scores across all three cross-domain datasets, demonstrating strong generalization and robustness. This stability largely stems from its deep understanding of code semantics, which enables the model to effectively capture vulnerability patterns across different contexts. In particular, the method shows clear superiority on the imbalanced SVulD and Reveal datasets, where prompt learning helps focus on essential and underrepresented patterns, reducing the negative impact of skewed data distributions. These empirical results highlight both the effectiveness and competitive advantage of our approach in the field of vulnerability detection while also suggesting its practical potential for real-world deployment.

4.4. Feature Visualization

4.4.1. Graphon Visualization

To demonstrate the effectiveness of prompts with task-related knowledge, we provide a graphical structure visualization using the existing dataset. In this study, the graphical structure is represented as a matrix,

W = [w_{k k^{'}}] \in {[0, 1]}^{K \times K}

, where each element

W_{i j}

denotes the probability of an edge existing between node i and node j. In the visualization, a darker color indicates a higher probability of an edge being present at a given location.

As shown in Figure 12, subfigures (a)–(f) compare structural patterns across the FFmpeg+Qemu, SVulD and Reveal datasets. Despite dataset differences, structural consistency is largely preserved, demonstrating that our graphon-based prompt generation captures class-specific topologies. Ablation results further confirm that removing graph-based prompts degrades classification performance, underscoring their contribution to model robustness.

4.4.2. Code Feature Visualization

To investigate how our prompt module influences neural network feature extraction, we designed a sequence of visualization experiments on three distinct datasets. Specifically, we applied the UMAP technique [55] to project three feature sets—the fused code embeddings and the structure-enhanced prompt embeddings generated by our model—into a shared two-dimensional space. In the visualizations, circular markers denote the combined code features, whereas star markers highlight the two classes of prompt features. To emphasize their separation, we overlaid blue perpendicular bisectors between each prompt-feature cluster, as shown in Figure 13a–c. At the margins of each scatter plot, density curves rise and fall to reveal a clear distinction: our carefully tuned prompt tokens and similarity metrics effectively segregate the two feature populations. Although occasional misalignments produce a few false positives, they do not obscure the overall boundary. Taken together, these visualizations convincingly demonstrate the discriminative power and robustness of our approach.

4.5. Threats to Validity

4.5.1. Internal Validity

The model’s internal validity depends on the methods used for hyperparameter optimization and data preprocessing. We attempted to fine-tune hyperparameters using rigorous experimental design. However, resource constraints prevented us from performing exhaustive hyperparameter sweeps. This limitation leaves room for future exploration of alternative configurations. Moreover, our approach uses CPGs as input features. As a result, the model performance depends on the quality of the generated graph structure. To mitigate this vulnerability, we selected datasets that are widely endorsed by the academic community.

4.5.2. External Validity

The model’s external validity has certain limitations, mainly related to the dataset used. Our study uses C/C++ code with function-level vulnerability labels, which may affect the model’s generalization ability. To address this, we validated our method using two publicly available datasets that are well regarded in the academic community. Additionally, the model performance might vary when applied to more complex codebases or different programming languages (e.g., Java). However, it is important to note that our model’s core architecture and algorithms are inherently language-agnostic.

5. Conclusions

Inspired by recent advances in prompt learning within the field of natural language processing (NLP), this paper introduces a two-stage prompt optimization framework combined with a hierarchical representation learning strategy, both of which show significant advantages in code vulnerability detection. Building upon conventional prompt text, we incorporate graphon theory to construct structurally enhanced prompts that are task-specific and knowledge-rich. These prompts transform contextual variables and graphical structures into trainable vector representations, enabling dynamic selection of the most effective prompts during training. Furthermore, our approach leverages the pretrained CodeBERTScore model in conjunction with a TextCNN and a GCN, allowing us to effectively capture both local code semantics and syntactic features while modeling global structural dependencies. By integrating code features with prompt features in a unified architecture, our model achieves more accurate and robust vulnerability prediction. Experimental results on three publicly available datasets—FFmpeg+Qemu, SvulD and Reveal—demonstrate that our method significantly outperforms current SOTA models, thereby validating its effectiveness and superiority in code vulnerability detection tasks.

In future work, we will explore advanced graph-structure modeling and optimization techniques to boost both accuracy and generalization in vulnerability detection. Although our current efforts focus on functional-level flaws in C/C++ code, we intend to support multiple vulnerability types, extend analyses to other programming languages and refine detection granularity from the function level down to individual lines.

Author Contributions

Conceptualization, W.C.; methodology, W.C.; software, W.C.; validation, W.C.; formal analysis, W.C.; investigation, W.C.; resources, W.C.; data curation, W.C.; writing—original draft preparation, W.C.; writing—review and editing, W.C. and C.Y.; visualization, W.C.; supervision, C.Y. and H.Z.; project administration, C.Y. and H.Z.; funding acquisition, C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grant No. 62362022, the National Key Research and Development Program of China under Grant No. 2018YFB2100805 and the Key Research and Development Program of Hainan Province under grant No. ZDYF2020008, ZDYF2022GXJS230. This research was also supported by Hainan Province Intelligent Software Engineering Research Center and the Key Laboratory of Big Data and Smart Services of Hainan Province.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

“FFmpeg+Qemu”: https://drive.google.com/file/d/1LrGV9i5A90qO8S49Bmo3K9AVQyl1sbOI/view (accessed on 15 December 2024)), “SVulD”: https://drive.google.com/file/d/1fw3SmCJjUCche2cSAhBjjnii7TO3qBje/view (accessed on 15 December 2024)), “Reveal”: https://drive.google.com/file/d/1TcV_KzeBWCnAChl92g6vonpNhSVB0H0A/view (accessed on 15 December 2024)).

Conflicts of Interest

The authors declare no conflicts of interest.

References

MITRE. Common Vulnerabilities and Exposures. Available online: https://www.cve.org/ (accessed on 15 January 2025).
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 6–12 December 2020. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Hu, S.; Ding, N.; Wang, H.; Liu, Z.; Wang, J.; Li, J.; Wu, W.; Sun, M. Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 2225–2240. [Google Scholar] [CrossRef]
Chen, X.; Zhang, N.; Xie, X.; Deng, S.; Yao, Y.; Tan, C.; Huang, F.; Si, L.; Chen, H. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 2778–2788. [Google Scholar] [CrossRef]
Alon, U.; Yahav, E. On the Bottleneck of Graph Neural Networks and its Practical Implications. arXiv 2021, arXiv:2006.05205. [Google Scholar]
Wen, X.C.; Chen, Y.; Gao, C.; Zhang, H.; Zhang, J.M.; Liao, Q. Vulnerability Detection with Graph Simplification and Enhanced Graph Representation Learning. In Proceedings of the 45th International Conference on Software Engineering, ICSE’23, Melbourne, Australia, 14–20 May 2023; IEEE Press: Piscataway, NJ, USA, 2023; pp. 2275–2286. [Google Scholar] [CrossRef]
Cheng, X.; Wang, H.; Hua, J.; Xu, G.; Sui, Y. Deepwukong: Statically detecting software vulnerabilities using deep graph neural network. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2021, 30, 1–33. [Google Scholar] [CrossRef]
Hin, D.; Kan, A.; Chen, H.; Babar, M.A. LineVD: Statement-level Vulnerability Detection using Graph Neural Networks. In Proceedings of the 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), Pittsburgh, PA, USA, 23–24 May 2022; pp. 596–607. [Google Scholar] [CrossRef]
Diaconis, P.; Janson, S. Graph limits and exchangeable random graphs. arXiv 2007, arXiv:0712.2749. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1746–1751. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, S.; Siow, J.; Du, X.; Liu, Y. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. arXiv 2019, arXiv:1909.03496. [Google Scholar]
Ni, C.; Yin, X.; Yang, K.; Zhao, D.; Xing, Z.; Xia, X. Distinguishing Look-Alike Innocent and Vulnerable Code by Subtle Semantic Representation Learning and Explanation. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, 5–7 December 2023; pp. 1611–1622. [Google Scholar] [CrossRef]
Chakraborty, S.; Krishna, R.; Ding, Y.; Ray, B. Deep learning based vulnerability detection: Are we there yet? IEEE Trans. Softw. Eng. 2021, 48, 3280–3296. [Google Scholar] [CrossRef]
Russell, R.; Kim, L.; Hamilton, L.; Lazovich, T.; Harer, J.; Ozdemir, O.; Ellingwood, P.; McConley, M. Automated Vulnerability Detection in Source Code Using Deep Representation Learning. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 757–762. [Google Scholar] [CrossRef]
Li, Z.; Zou, D.; Xu, S.; Ou, X.; Jin, H.; Wang, S.; Deng, Z.; Zhong, Y. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. In Proceedings of the 2018 Network and Distributed System Security Symposium, San Diego, CA, USA, 18–21 February 2018; Internet Society: Reston, VA, USA, 2018. [Google Scholar] [CrossRef]
Li, Z.; Zou, D.; Xu, S.; Jin, H.; Zhu, Y.; Chen, Z. SySeVR: A framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Secur. Comput. 2022, 19, 2244–2258. [Google Scholar] [CrossRef]
Peng, B.; Liu, Z.; Zhang, J.; Su, P. CEVulDet: A Code Edge Representation Learnable Vulnerability Detector. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 1–8. [Google Scholar] [CrossRef]
Zhang, C.; Liu, B.; Xin, Y.; Yao, L. CPVD: Cross project vulnerability detection based on graph attention network and domain adaptation. IEEE Trans. Softw. Eng. 2023, 49, 4152–4168. [Google Scholar] [CrossRef]
Wen, X.C.; Gao, C.; Ye, J.; Li, Y.; Tian, Z.; Jia, Y.; Wang, X. Meta-path based attentional graph learning model for vulnerability detection. IEEE Trans. Softw. Eng. 2024, 50, 360–375. [Google Scholar] [CrossRef]
Wang, Q.; Li, Z.; Liang, H.; Pan, X.; Li, H.; Li, T.; Li, X.; Li, C.; Guo, S. Graph Confident Learning for Software Vulnerability Detection. Eng. Appl. Artif. Intell. 2024, 133, 108296. [Google Scholar] [CrossRef]
Tian, Z.; Tian, B.; Lv, J.; Chen, Y.; Chen, L. Enhancing vulnerability detection via AST decomposition and neural sub-tree encoding. Expert Syst. Appl. 2024, 238, 121865. [Google Scholar] [CrossRef]
Wen, X.C.; Gao, C.; Gao, S.; Xiao, Y.; Lyu, M.R. SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability Detection. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, 16–20 September 2024; pp. 235–247. [Google Scholar] [CrossRef]
Wu, Y.; Zou, D.; Dou, S.; Yang, W.; Xu, D.; Jin, H. VulCNN: An image-inspired scalable vulnerability detection system. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 22–24 May 2022; pp. 2365–2376. [Google Scholar] [CrossRef]
Buratti, L.; Pujar, S.; Bornea, M.; McCarley, S.; Zheng, Y.; Rossiello, G.; Morari, A.; Laredo, J.; Thost, V.; Zhuang, Y.; et al. Exploring Software Naturalness through Neural Language Models. arXiv 2020, arXiv:2006.12641. [Google Scholar]
Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 1536–1547. [Google Scholar] [CrossRef]
Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv 2021, arXiv:2009.08366. [Google Scholar]
Zhou, S.; Alon, U.; Agarwal, S.; Neubig, G. CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 13921–13937. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar]
Petroni, F.; Rocktäschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A.H.; Riedel, S. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2463–2473. [Google Scholar] [CrossRef]
Jiang, Z.; Xu, F.F.; Araki, J.; Neubig, G. How can we know what language models know? Trans. Assoc. Comput. Linguist. 2020, 8, 423–438. [Google Scholar] [CrossRef]
Shin, T.; Razeghi, Y.; Logan IV, R.L.; Wallace, E.; Singh, S. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 4222–4235. [Google Scholar] [CrossRef]
Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3045–3059. [Google Scholar] [CrossRef]
Zhong, Z.; Friedman, D.; Chen, D. Factual probing is [mask]: Learning vs. learning to recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5017–5033. [Google Scholar] [CrossRef]
Li, Y.; Liang, F.; Zhao, L.; Cui, Y.; Ouyang, W.; Shao, J.; Yu, F.; Yan, J. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. arXiv 2022, arXiv:2110.05208. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Wang, C.; Yang, Y.; Gao, C.; Peng, Y.; Zhang, H.; Lyu, M.R. Prompt Tuning in Code Intelligence: An Experimental Evaluation. IEEE Trans. Softw. Eng. 2023, 49, 4869–4885. [Google Scholar] [CrossRef]
Zhang, C.; Liu, H.; Zeng, J.; Yang, K.; Li, Y.; Li, H. Prompt-Enhanced Software Vulnerability Detection Using ChatGPT. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, Lisbon, Portugal, 14–20 April 2024; pp. 276–277. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Yamaguchi, F.; Golde, N.; Arp, D.; Rieck, K. Modeling and discovering vulnerabilities with code property graphs. In Proceedings of the 2014 IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 18–21 May 2014; pp. 590–604. [Google Scholar] [CrossRef]
Wang, H.; Ye, G.; Tang, Z.; Tan, S.H.; Huang, S.; Fang, D.; Feng, Y.; Bian, L.; Wang, Z. Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans. Inf. Forensics Secur. 2021, 16, 1943–1958. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16816–16825. [Google Scholar] [CrossRef]
Goldenberg, A.; Zheng, A.X.; Fienberg, S.E.; Airoldi, E.M. A survey of statistical network models. Found. Trends® Mach. Learn. 2010, 2, 129–233. [Google Scholar] [CrossRef]
Lovász, L. Large Networks and Graph Limits; American Mathematical Society: Providence, RI, USA, 2012; Volume 60. [Google Scholar]
Frieze, A.; Kannan, R. Quick approximation to matrices and applications. Combinatorica 1999, 19, 175–220. [Google Scholar] [CrossRef]
Chatterjee, S. Matrix estimation by universal singular value thresholding. Ann. Stat. 2015, 43, 177–214. [Google Scholar] [CrossRef]
Fan, J.; Li, Y.; Wang, S.; Nguyen, T.N. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of the 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR), Seoul, Republic of Korea, 29–30 June 2020; pp. 508–512. [Google Scholar] [CrossRef]
Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.; et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. CoRR 2021, arXiv:2102.04664. [Google Scholar]
Wen, X.C.; Wang, X.; Gao, C.; Wang, S.; Liu, Y.; Gu, Z. When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; pp. 345–357. [Google Scholar] [CrossRef]
Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8696–8708. [Google Scholar] [CrossRef]
Guo, D.; Lu, S.; Duan, N.; Wang, Y.; Zhou, M.; Yin, J. Unixcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 7212–7225. [Google Scholar] [CrossRef]
Fu, M.; Tantithamthavorn, C. LineVul: A Transformer-based Line-Level Vulnerability Prediction. In Proceedings of the 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), Pittsburgh, PA, USA, 23–24 May 2022; pp. 608–620. [Google Scholar] [CrossRef]
Zhang, J.; Liu, Z.; Hu, X.; Xia, X.; Li, S. Vulnerability Detection by Learning From Syntax-Based Execution Paths of Code. IEEE Trans. Softw. Eng. 2023, 49, 4196–4212. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2020, arXiv:1802.03426. [Google Scholar]

Figure 1. Example of a UAF vulnerability, where the variable VAR8 is freed and then reused.

Figure 2. The simplified CPG of the vulnerable function.

Figure 3. Overview of our approach.

Figure 4. Steps of normalization.

Figure 5. An example of CPG (nodes and undirected edges) by Joern.

Figure 6. Syntax-aware embedding.

Figure 7. Triplet Loss: enhancing similarity within types, differentiating across types.

Figure 8. The vulnerability detection process.

Figure 9. We measured the node length, fused node count and node count per sample before and after fusion in the graph structure.

Figure 10. Performance under varying batch sizes and prompt dimensions on three datasets.

Figure 11. Comparison of ROC curves across datasets (SE Prompt vs. baselines).

Figure 12. Visualization of graphon matrix on three datasets.

Figure 13. Visualization of features on three datasets.

Table 1. The distribution of datasets.

Dataset	Total	Vul	Non-Vul	Ratio (%)
FFmpeg+Qemu	22,361	10,067	12,294	45.02
SVulD	28,730	5260	23,470	18.31
Reveal	18,169	1664	16,505	9.16

Table 2. Parameter settings in our model.

Parameter	Setting	Parameter	Setting
Loss function	CE & Triplet Loss	Batch	64
Activation function	ReLU	Learning rate	1 × $10^{- 2}$ to 1 × $10^{- 4}$
Optimizer	AdamW	Epoch number	300

Table 3. Vulnerability detection results (%) on three datasets (mean ± std over 20 runs).

Dataset	Method	Accuracy	Precision	Recall	F1 Score
FFmpeg+Qemu	GCN	57.14 ± 0.18	55.69 ± 0.18	54.60 ± 0.04	53.52 ± 0.21
	CNN-GCN	58.05 ± 1.12	56.95 ± 1.36	58.05 ± 1.12	55.47 ± 1.44
	Text Prompt	61.32 ± 0.34	60.41 ± 0.35	59.71 ± 0.89	59.53 ± 0.67
	SE Prompt	64.40 ± 0.50	63.59 ± 0.47	63.33 ± 0.48	63.41 ± 0.49
SVulD	GCN	82.36 ± 0.89	71.96 ± 0.72	57.56 ± 0.38	59.31 ± 0.68
	CNN-GCN	82.52 ± 0.68	74.65 ± 1.45	57.66 ± 0.97	59.39 ± 0.92
	Text Prompt	82.69 ± 0.15	75.59 ± 0.29	65.60 ± 0.96	68.37 ± 0.84
	SE Prompt	83.44 ± 0.18	75.89 ± 0.53	67.80 ± 0.69	70.69 ± 0.60
Reveal	GCN	88.90 ± 0.15	44.61 ± 0.14	50.00 ± 0.15	47.15 ± 0.02
	CNN-GCN	89.21 ± 0.35	55.35 ± 1.62	50.08 ± 0.89	50.42 ± 1.69
	Text Prompt	89.69 ± 0.06	61.68 ± 1.16	51.76 ± 0.58	52.01 ± 1.25
	SE Prompt	90.69 ± 0.62	56.63 ± 0.51	55.14 ± 0.55	56.11 ± 0.66

Table 4. Statistical analysis: p-value and effect size (SE Prompt vs. baselines).

Model	Dataset	Accuracy (p/e)	Precision (p/e)	Recall (p/e)	F1 Score (p/e)
GCN	FFmpeg+Qemu	5.79 × $10^{- 8}$ /1.00	6.75 × $10^{- 8}$ /1.00	6.75 × $10^{- 8}$ /1.00	6.57 × $10^{- 8}$ /1.00
	SVulD	1.37 × $10^{- 7}$ /1.00	6.78 × $10^{- 8}$ /1.00	6.78 × $10^{- 8}$ /1.00	6.73 × $10^{- 8}$ /1.00
	Reveal	5.60 × $10^{- 8}$ /1.00	5.65 × $10^{- 8}$ /1.00	3.42 × $10^{- 1}$ /1.00	5.65 × $10^{- 8}$ /1.00
CNN-GCN	FFmpeg+Qemu	6.27 × $10^{- 8}$ /1.00	6.78 × $10^{- 8}$ /1.00	6.66 × $10^{- 8}$ /1.00	6.75 × $10^{- 8}$ /1.00
	SVulD	1.16 × $10^{- 7}$ /1.00	1.60 × $10^{- 5}$ /1.00	6.77 × $10^{- 8}$ /1.00	6.72 × $10^{- 8}$ /1.00
	Reveal	5.42 × $10^{- 5}$ /1.00	7.72 × $10^{- 5}$ /1.00	3.42 × $10^{- 5}$ /1.00	7.72 × $10^{- 5}$ /1.00
Text Prompt	FFmpeg+Qemu	6.32 × $10^{- 8}$ /1.00	6.79 × $10^{- 8}$ /1.00	6.79 × $10^{- 8}$ /1.00	6.76 × $10^{- 8}$ /1.00
	SVulD	6.96 × $10^{- 8}$ /1.00	2.21 × $10^{- 7}$ /1.00	6.79 × $10^{- 8}$ /1.00	6.79 × $10^{- 8}$ /1.00
	Reveal	4.32 × $10^{- 5}$ /1.00	6.20 × $10^{- 5}$ /1.00	3.42 × $10^{- 5}$ /1.00	6.20 × $10^{- 5}$ /1.00

Table 5. Comparison with state-of-the-art vulnerability detectors on datasets (metrics unit: %).

Datasets Metrics	FFmpeg+Qemu				SVulD				Reveal
Datasets Metrics	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1
Devign	56.89	52.50	64.67	57.95	73.57	9.72	50.31	16.29	87.49	31.55	36.65	33.91
Reveal	61.07	55.50	70.70	62.19	82.58	12.92	40.08	19.31	81.77	31.55	61.14	41.62
CodeBERT	62.37	61.55	48.21	54.07	80.56	14.33	55.32	22.76	87.51	43.63	56.15	49.10
CodeT5	63.36	58.65	68.61	63.24	78.73	14.32	62.36	23.30	89.53	51.15	54.51	52.78
UnixCoder	65.19	59.93	59.98	59.96	77.54	15.11	72.24	24.99	88.48	47.44	68.44	56.04
EPVD	63.03	59.32	62.15	60.70	76.75	14.26	69.58	23.67	88.87	48.60	63.93	55.22
LineVul	62.37	61.55	48.21	54.07	80.57	15.95	64.45	25.58	87.51	43.63	56.15	49.10
Ours	64.40	63.59	63.33	63.41	83.44	75.89	67.80	70.69	90.69	56.63	55.14	56.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chang, W.; Ye, C.; Zhou, H. Structure-Enhanced Prompt Learning for Graph-Based Code Vulnerability Detection. Appl. Sci. 2025, 15, 6128. https://doi.org/10.3390/app15116128

AMA Style

Chang W, Ye C, Zhou H. Structure-Enhanced Prompt Learning for Graph-Based Code Vulnerability Detection. Applied Sciences. 2025; 15(11):6128. https://doi.org/10.3390/app15116128

Chicago/Turabian Style

Chang, Wei, Chunyang Ye, and Hui Zhou. 2025. "Structure-Enhanced Prompt Learning for Graph-Based Code Vulnerability Detection" Applied Sciences 15, no. 11: 6128. https://doi.org/10.3390/app15116128

APA Style

Chang, W., Ye, C., & Zhou, H. (2025). Structure-Enhanced Prompt Learning for Graph-Based Code Vulnerability Detection. Applied Sciences, 15(11), 6128. https://doi.org/10.3390/app15116128

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Structure-Enhanced Prompt Learning for Graph-Based Code Vulnerability Detection

Abstract

1. Introduction

2. Related Work

2.1. Learning-Based Vulnerability Detection

2.2. Pretrained Models for Source Code Analysis

2.3. Prompt Learning

3. Methodology

3.1. Solution Overview

3.2. Code Feature Construction

3.2.1. Normalization

3.2.2. Extracting the Code Property Graph

3.2.3. Syntax-Aware Encoder

3.2.4. Graph Feature Encoder

3.3. Structure-Enhanced Prompt

3.3.1. Generation of Structured Prompt

3.3.2. Prompt Ensembling

3.4. Vulnerability Detection

3.4.1. Model Training

3.4.2. Detecting Vulnerability

3.5. Computational Complexity and Scalability Analysis

4. Experiments

4.1. Datasets

4.2. Experimental Setup

4.3. Result Analysis

4.3.1. RQ1: Parameter Analysis

4.3.2. RQ2: Syntax-Aware Embedding Effectiveness

4.3.3. RQ3: Structure-Enhanced Prompt Effectiveness

4.3.4. RQ4: Comparison with State of the Art

4.4. Feature Visualization

4.4.1. Graphon Visualization

4.4.2. Code Feature Visualization

4.5. Threats to Validity

4.5.1. Internal Validity

4.5.2. External Validity

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI