Data-Driven Transferable Modeling for Cross-Project Software Vulnerability Detection via Dual-Feature Stacking Ensemble

Liu, Yu; Liu, Bin; Wang, Shihai; Hu, Bin; Jin, Yujie

doi:10.3390/math14050780

Open AccessArticle

Data-Driven Transferable Modeling for Cross-Project Software Vulnerability Detection via Dual-Feature Stacking Ensemble

by

Yu Liu

¹

,

Bin Liu

¹,

Shihai Wang

^1,*

,

Bin Hu

² and

Yujie Jin

²

¹

School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China

²

Department of Information Science and Engineering, Changsha Normal University, Changsha 410100, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(5), 780; https://doi.org/10.3390/math14050780

Submission received: 27 January 2026 / Revised: 14 February 2026 / Accepted: 24 February 2026 / Published: 26 February 2026

(This article belongs to the Special Issue Advances and Applications for Data-Driven/Model-Free Control)

Download

Browse Figures

Versions Notes

Abstract

In recent years, deep learning-based vulnerability detection has drawn wide attention for its data-driven ability to analyze code semantics and learn vulnerability patterns without predefined models. However, data distribution differences across projects limit model generalization. Transfer learning provides a solution, yet most studies ignore expert-designed metrics. This paper proposes Decpvd, a data-driven cross-project software vulnerability detection method based on a dual-feature stacking ensemble. It builds an adaptive and transferable model using only code and vulnerability label data from source and target projects. It extracts code semantic features via Gated Graph Neural Networks, incorporates expert metrics from tools, performs cross-domain data-driven modeling with TrAdaBoost, and adaptively fuses the two features through stacking, overcoming fixed-weight fusion limitations. Experiments on six cross-project groups from three real datasets (FFmpeg, LibTIFF, LibPNG) show that Decpvd achieves an average AUC of 0.814, significantly outperforming mainstream baselines.

Keywords:

cross-project vulnerability detection; transfer learning; semantic metrics; expert metrics; adaptive model fusion; data-driven modeling

MSC:

68T07

1. Introduction

As software systems rapidly evolve toward complexity and scale, the relevance of code logic and heterogeneity across projects have become increasingly prominent. Software vulnerabilities, as latent flaws in code design and implementation, not only lead to system functional abnormalities and operational interruptions but also may serve as triggers for security risks, posing severe challenges to the stability and reliability of software systems [1]. Essentially, vulnerability detection achieves accurate discrimination between “normal code and vulnerable code” through in-depth mining and pattern recognition of code features, a process that highly relies on data-driven feature learning and adaptive modeling capabilities. With the advancement of artificial intelligence technology, deep learning has provided a novel technical pathway for vulnerability detection by virtue of its powerful automatic feature extraction and semantic understanding capabilities. By constructing end-to-end model architectures, deep learning can automatically analyze the syntactic structures and semantic logic of code, mine the hidden feature patterns of vulnerabilities, and eliminate the need for manual design of detection rules, thereby significantly improving the automation level and efficiency of vulnerability detection [2,3,4,5,6].

However, current models suffer from poor generalization ability, and their performance generally drops sharply when detecting vulnerabilities in unseen projects [7]. This is attributed to the significant data distribution differences across different projects [8]. According to existing research [9,10,11], transfer learning represents a viable alternative approach. Dual-GD-DDAN [9] adopts a structure combining bidirectional recurrent neural networks with dual generators and dual discriminators to extract semantic metrics for project code analysis. CPVD [10] combines Graph Attention Networks (GATs) with Domain Adaptation to address the issues of label scarcity and the loss of code structural semantic information in cross-project software vulnerability detection. DAM2P [11] integrates deep domain adaptation with the maximum margin principle, learning domain-invariant features through adversarial training of a bidirectional Recurrent Neural Network (RNN) generator and a Generative Adversarial Network (GAN) to tackle the problem of imbalanced cross-project software vulnerability detection. In previous cross-project vulnerability detection research, most studies have only considered semantic metrics generated through code representation learning in deep learning, while neglecting expert metrics manually designed by domain experts [12].

Notably, recent studies in related detection tasks (e.g., electricity theft detection, complex network analysis) have demonstrated that synergistic modeling of multiple features (e.g., temporal–spectral features) and mining of critical information (e.g., key nodes) can effectively enhance model generalization across heterogeneous scenarios [13,14,15], which provides valuable insights for addressing the cross-project vulnerability detection problem. For instance, Zhao et al. [13] proposed a time–frequency synergistic modeling network that integrates temporal pathways and spectral features, achieving a robust performance in theft detection tasks with data distribution differences—this verifies the effectiveness of multi-feature fusion for cross-scenario detection. In complex network analysis, Zhao et al. [14,15] developed key node mining methods by integrating structural and neighborhood information, highlighting that deep learning-based multi-feature fusion can improve the model’s ability to capture critical patterns in heterogeneous data. These findings provide valuable technical insights for optimizing cross-project vulnerability detection: leveraging deep learning’s feature extraction advantages while integrating multiple complementary features is a feasible direction to address existing limitations.

This paper presents an improved cross-project software vulnerability detection method based on a dual-feature stacking ensemble for C/C++ programming languages, named Decpvd, built upon the CSVD-TF baseline, which pioneers the fusion of expert and semantic metrics. This method employs Gated Graph Neural Networks (GGNNs) [16] to extract code semantic features. Subsequently, it trains base models based on expert and semantic features separately using TrAdaboost transfer learning [17]. Furthermore, it utilizes a stacking ensemble strategy [18] to fuse these features for binary classification detection—upgrading the fixed-weight fusion scheme of CSVD-TF to adaptively combine semantic and expert metrics, thus solving the limitation of the poor adaptability of fixed-weight fusion in cross-project scenarios. The main contributions of this study are as follows:

A vulnerability detection method named Decpvd is presented as an improved approach. Targeting the field of cross-project vulnerability detection, this method is developed on the basis of CSVD-TF (which pioneers expert–semantic feature fusion) and achieves efficient and accurate cross-project vulnerability detection through the collaborative work of three modules: the Code Representation Module, the Model Construction Module, and the Vulnerability Detection Module.
A model fusion mechanism based on a stacking ensemble is designed, which is capable of adaptively integrating two transfer learning models built on expert-metric features and semantic-metric features respectively. This mechanism upgrades the fixed-weight fusion of CSVD-TF, realizes the effective complementarity between the two types of metrics, and further solves the problem of the poor adaptability of fixed-weight fusion in cross-project scenarios, thereby enhancing the performance of the vulnerability detection method.
We conduct large-scale evaluation experiments on real-world software project datasets for Decpvd. The experimental results demonstrate that Decpvd significantly outperforms current mainstream methods in cross-project vulnerability detection tasks, especially showing better adaptability than CSVD-TF due to its adaptive stacking fusion strategy.

2. Related Work

Related work is elaborated from two core dimensions: source code representation learning, which transforms source code into semantic-rich feature vectors to capture potential patterns of software vulnerabilities; and cross-project vulnerability detection, which leverages source project data and prior knowledge to predict labels of unlabeled projects by minimizing cross-domain feature distribution differences in the feature space for improved vulnerability classification accuracy.

2.1. Representation Learning of Source Code in Software Vulnerability Detection

In the early stages of software vulnerability detection research, researchers primarily focused on employing expert metrics [19,20,21,22]. These metrics are manually designed by security experts based on their professional expertise and practical experience, which can accurately capture security-related key features in software and lay a solid and reliable foundation for vulnerability detection. With the advancement of deep learning technology, researchers have leveraged such techniques to mine and analyze latent semantic information in code, identify concealed vulnerability patterns, and, thus, enhance the accuracy of vulnerability detection. Dam et al. [23] utilized the Long Short-Term Memory (LSTM) model to transform code token sequences into vector representations for extracting semantic and syntactic features of the code. Steenhoek et al. [24] employed pre-trained models such as Code2Vec [25] and CodeBERT [26] to investigate code semantic features. VulDeePecker [2] represented programs using semantically related code snippets and trained a Bidirectional Long Short-Term Memory (Bi-LSTM) model for code representation. SySeVR [27] converted programs into vectors by extracting syntactic features (SyVCs) and semantic features (SeVCs) for Recurrent Neural Network (RNN) analysis to detect vulnerabilities. Inspired by image classification techniques, Wu et al. [28] transformed function source code into images with key program details preserved and employed Convolutional Neural Networks (CNNs) for vulnerability detection. Unlike treating code as a linearly arranged sequence of characters or tokens, graph-based analysis methods can effectively capture the internal dependency and structural information of code. Devign [3] extracts graphical information from code to construct a representation model for characterizing the inherent code structure and then employs Gated Graph Neural Networks (GGNNs) for feature learning on the code graph. Funded [29] interconnects code statements through relational edges to capture the syntactic, semantic, and flow information of the program. Reveal [30] represents the code structure and semantics using a Code Property Graph (CPG) and mines critical information through a heterogeneous graph transformer and convolutional pooling modules. ReGVD [4] initializes node features using a pre-trained model, constructs a graph structure from flattened source code token sequences, introduces residual connections between graph neural network layers during training, and generates code graph embeddings. EnGS2F [31] adopts an enhanced graph-structured representation learning method to strengthen the Program Dependence Graph (PDG) from both structural and nodal dimensions. In summary, deep learning-based semantic feature extraction has become a common practice in software vulnerability detection, which significantly enhances the efficiency and accuracy of vulnerability identification. Among all methods, graph-structured code representations outperform sequence-based ones in capturing code structural information, thus presenting more prominent advantages for vulnerability detection.

2.2. Cross-Project Software Vulnerability Detection

Cross-project software vulnerability detection addresses the challenge of insufficient labeled vulnerability samples in target projects by transferring vulnerability knowledge from other projects to construct detection models. It is a research hotspot and core challenge in the field of software security. Compared with in-project detection, coding style differences, feature distribution shifts, and the absence of labeled data in the target domain in cross-project scenarios constrain the generalization capability of traditional models, thus making cross-domain transfer and feature alignment the core research directions in this field. Nguyen et al. proposed the cross-project vulnerability detection model Dual-GD-DDAN [9]. To solve the mode collapse, boundary distortion, and data distortion in traditional GAN-based domain adaptation methods, this model realizes precise source–target domain data mapping via a dual-generator–dual-discriminator architecture and preserves data clustering structures by incorporating manifold regularization. Zhang et al. introduced the CPVD [10] cross-project vulnerability detection method, which combines Graph Attention Networks (GATs) with domain adaptation: it comprehensively represents code syntactic and semantic information through a Code Property Graph (CPG), extracts deep graph features via dual-attention GAT and convolutional pooling networks, resamples source domain data with SMOTE, and reduces cross-domain distribution discrepancy through domain adversarial learning. Li et al. proposed the VulGDA [8] framework integrating graph embedding and deep domain adaptation techniques, which is applicable to various cross-domain scenarios including zero-shot learning. This framework represents code syntactic and semantic relationships via a CPG, generates graph embedding vectors by aggregating neighborhood information with a GGNN, and minimizes cross-domain distribution discrepancy using Maximum Mean Discrepancy (MMD). Tao et al. [32] proposed a cross-modal feature enhancement and fusion-based vulnerability detection method, which achieves fine-grained alignment of source code and assembly code at the statement/instruction and variable/register levels through compilation and debugging techniques, generates dual-modal slices based on program slicing, extracts semantic features from source code and assembly code with self-attention+CNN and BiGRU respectively, enhances cross-modal features via a co-attention mechanism, and completes feature fusion through attention-weighted summation. DAM2P [11] integrates deep domain adaptation with the maximum margin principle to address label scarcity and data imbalance: it learns domain-invariant features through bidirectional RNN generators and GAN adversarial learning and constructs a cross-domain kernel classifier to maximize the margin between source domain vulnerability data and target domain data for improved transfer detection performance in imbalanced scenarios. Cai et al. proposed the CSVD-TF [12], which combines transfer learning with metric fusion strategies for low-label target project scenarios: it extracts 39-dimensional expert metrics via the Understand tool and CodeBERT/BERT-Whitening-based semantic metrics, trains an XGBoost base classifier with TrAdaBoost transfer learning using labeled source and target project data, and fuses the two types of metric detection results at the model level with a fixed weight ratio of 0.4:0.6.

In recent years, software vulnerability detection based on large pre-trained language models (LLMs) has emerged as a research hotspot. By leveraging pre-training on massive amounts of code, such methods learn universal code semantic features and have achieved exceptional performance in single-project vulnerability detection. GRACE [33] enhances LLM-based software vulnerability detection by integrating graph structural information from the code and employing in-context learning. VulTrLM [34] guides LLMs to focus on key vulnerability semantics by deconstructing the Abstract Syntax Tree (AST) and enhancing it with annotations, thereby improving the accuracy of vulnerability detection. Semantic SAST [35] combines Tree-sitter AST parsing with LLM reasoning to automatically extract patterns from CVE patches, surpassing the capabilities of traditional Static Application Security Testing (SAST) tools.

Current cross-project software vulnerability detection methods predominantly focus on semantic feature extraction and cross-domain transfer, while generally ignoring the valuable information of traditional expert-designed metrics, leading to incomplete feature utilization. Although CSVD-TF innovatively integrates expert metrics and semantic metrics and improves detection performance through model-level fusion, its fixed weight allocation strategy cannot adaptively adjust the fusion ratio according to cross-project feature distribution disparities or vulnerability types, resulting in poor flexibility and adaptability. To address this limitation, we propose a cross-project software vulnerability detection method based on a dual-feature stacking ensemble, named Decpvd.

3. Approach

The framework of our proposed approach Decpvd is illustrated in Figure 1. To tackle the issues of inconsistent data distributions between source and target projects, as well as the difficulty in effectively fusing expert and semantic features in cross-project software vulnerability detection scenarios, this method proposes a detection approach based on a dual-feature stacking ensemble. Initially, it employs TrAdaboost transfer learning [17] to train base models separately based on expert features and semantic features. Subsequently, it utilizes stacking ensemble learning to treat the predicted probabilities from these two types of base models as meta-features, and trains a meta-model to achieve feature fusion. Ultimately, it accomplishes binary classification detection of cross-project software vulnerabilities. Decpvd comprises three modules: the Code Representation Module, the Model Construction Module, and the Vulnerability Detection Module. In the following sections, we give detailed explanations for these three modules.

3.1. Code Representation Module

The primary function of this module is to extract representative and practical metrics from source code, transforming statistical data and code tokens into numerical vectors. This conversion facilitates subsequent data analysis and model construction. The core inputs of this module are the Source Project Codes and Target Project Codes (as shown in Figure 1). Through two parallel extraction pathways—expert feature extraction and semantic feature extraction—standardized preprocessing and feature extraction are performed on these two types of input codes. Ultimately, this process outputs expert feature vectors and semantic feature vectors corresponding to both the source and target projects, which are further integrated into Source Embedding and Target Embedding, respectively, providing feature inputs for the subsequent model construction module. The Source Project Codes and Target Project Codes undergo identical preprocessing and feature extraction procedures within this module, ensuring uniformity in the feature space and preventing cross-project feature distribution biases.

3.1.1. Expert Metrics

Expert metrics are manually designed by domain experts, and we utilize the commercial tool Understand [36] to collect them. This tool is capable of calculating both traditional and object-oriented source code metrics for Java and C/C++ projects. We adopt the 39 expert metrics screened by Cai et al. [12] as the expert feature set in this study, with the selection rationale mainly reflected in three aspects: first, their research is tailored to the cross-project software vulnerability detection scenario, and the feature selection scheme is a targeted optimization for this task, which is highly consistent with the research objective of this paper; second, they have verified the non-redundancy of these 39 metrics through dimensionality reduction and feature selection methods, effectively avoiding the negative impacts of redundant features on model training; and third, their research objects are also C/C++ projects, which matches the code feature distribution and programming characteristics of the research objects in this paper, making the selected features more adaptable to our feature extraction requirements. These 39 metrics are roughly divided into five dimensions: code size, complexity, readability, maintainability, and performance, with specific details shown in Table 1.

The expert features extracted in this study are represented as a 39-dimensional numerical feature vector, encompassing five major dimensions: code size, complexity, readability, maintainability, and performance. Physically, these features capture the shallow-level formal characteristics of the code from a traditional programming perspective, reflecting its external statistical properties and engineering-oriented attributes. They represent an empirical summary by domain experts of features relevant to code vulnerabilities.

3.1.2. Semantic Metrics

In recent years, deep learning has been widely applied in vulnerability detection due to its ability to precisely analyze code semantics and learn vulnerability characteristics [37]. Source code exhibits stronger structural and hierarchical features compared to natural language [3]. Representing code functionality with graph structures is more accurate and comprehensive in reflecting the intrinsic relationships between code fragments than using token sequences. Our Decpvd method employs Gated Graph Neural Networks to capture code graph information and obtain vector representations as semantic metrics.

First, we conduct standardized preprocessing on the source code fragments, which comprises three steps:

1.: Removing comment information from the code to eliminate potential interference from natural language text;
2.: Uniformly mapping user-defined variable names to standardized ones to avoid generating irrelevant features due to diverse variable naming;
3.: Uniformly mapping user-defined function names to standardized ones to reduce the impact caused by differences in function naming styles.

In this study, the standardization and normalization of variable/function names primarily aim to eliminate cross-project feature distribution biases caused by differences in naming conventions (e.g., different projects may use distinct names for the same variable while maintaining consistent logic). This constitutes a critical step for feature space alignment in cross-project vulnerability detection. Regarding potential security signals embedded in variable/function names (e.g., keywords like “password” or “encrypt”), this study compensates for their potential loss by preserving the Program Dependence Graph (PDG) structural information and semantic logic of the code. The control-flow and data-flow dependencies captured by the PDG already encompass the core security semantics of the code, with shallow-level signals in naming contributing significantly less to vulnerability detection than structural semantic information.

Subsequently, we utilize the open-source static code analysis tool Joern [38] to extract Program Dependence Graphs (PDGs) from the preprocessed source code fragments. Joern is a mainstream tool for extracting graph-structured representations from C/C++ code. To address potential parsing errors (such as inaccuracies in parsing complex macro definitions or nested functions), we conducted a 10% manual sampling inspection of the extracted PDGs and filtered out functions that failed to be parsed correctly. PDGs can simultaneously capture data dependence relations and control dependence relations, establishing multimodal associations among nodes in the Abstract Syntax Tree (AST). Their hierarchical structure not only clearly presents the syntactic features of the code but also profoundly reveals its semantic logic. An example of this process is illustrated in Figure 2.

Since deep learning models require numerical vectors as inputs, we convert graph structures into feature vectors (including node features and graph structural features) through graph embedding techniques. For node features, after decomposing the code statements in each node into a list of tokens, we use the word2vec model [39] to embed them into fixed-length vectors. Regarding graph structural features, we employ adjacency matrices to vectorize the relationships within the graph. Specifically, we construct an adjacency matrix

A \in R^{N \times N}

that matches the total number of nodes in the Program Dependence Graph (PDG). The matrix element

A_{i, j}

is used to represent the connection relationship between the source node

v_{i}

and the target node

v_{j}

: when there is a data dependency edge or a control dependency edge between them,

A_{i, j}

is assigned different identification values; it is set to 0 when there is no connection.

After transforming the source code into graph-structured data encompassing both data dependency and control dependency relationships, we employ Graph Neural Networks for further feature embedding on the graph samples. The input includes two types of data: labeled source-project graph embedding data and unlabeled target-project graph embedding data. Pre-training on labeled source-project graph data yields the corresponding graph feature vectors, while the model trained on the source project is used to extract feature vectors for unlabeled target-project graph data. In this study, the model parameters used for PDG (Program Dependence Graph) construction and feature extraction of the target project are reused from those of the source project. Moreover, the GGNN (Gated Graph Neural Network) model pre-trained on the source project is directly employed for semantic feature extraction of the target project. This approach ensures the uniformity of the semantic feature spaces between the source and target projects.

To elaborate on the technical implementation of the GGNN [16] model adopted for feature embedding in our Decpvd framework, it iteratively aggregates node and neighbor information and integrates Gated Recurrent Units (GRUs) for temporal feature updates across multiple time steps to generate final node features. In detail, for each node

v_{u}

in the graph, its node vector is initialized as

h_{u}^{(1)} = {[m_{u}^{⊤}, 0]}^{⊤}

. Let T denote the total number of time steps for neighborhood aggregation, and assume there are p edge types in the graph (each corresponding to an adjacency matrix

A_{p}

). To obtain the propagated information of all nodes, at each time step

t \leq T

, all nodes transmit information according to the adjacency matrix

A_{p}

of the respective edge type, which is mathematically expressed as:

a_{u, p}^{(t)} = A_{p}^{⊤} (W_{p} [h_{1}^{(t) ⊤}, \dots, h_{m}^{(t) ⊤}] + b)

(1)

Here,

a_{u, p}^{(t)}

denotes the neighborhood aggregation information of node

v_{u}

under edge type p at time step t,

W_{p}

is the learnable weight matrix for this edge type, b is the bias term, and

h_{1}^{(t) ⊤}, \dots, h_{m}^{(t) ⊤}

are the transposes of the state vectors of all neighbor nodes of

v_{u}

at time step t, respectively. The neighborhood aggregation information of

v_{u}

across all edge types is then merged via the aggregation function AGG, and the merged information is fed into the GRU together with the node’s previous time-step state to update the node state, as follows:

h_{u}^{(t + 1)} = GRU (h_{u}^{(t)}, AGG ({\{a_{u, p}^{(t)}\}}_{p = 1}^{k}))

(2)

Here,

AGG (\cdot)

denotes an aggregation function (e.g., SUM/MEAN/MAX) for fusing neighborhood information from different edge types;

h_{u}^{(t)}

and

h_{u}^{(t + 1)}

represent the current and updated state of node

v_{u}

at time step t and

t + 1

after GRU processing, respectively.

After PDG (Program Dependence Graph) construction, graph embedding, and iterative aggregation via the GGNN (Gated Graph Neural Network), the resulting semantic features are represented as fixed-length graph embedding feature vectors. Physically, these vectors capture the deep-level semantic features of the code from a logical perspective, precisely reflecting both data dependencies and control dependencies within the code. They reveal the intrinsic execution logic and semantic associations of the code, thereby addressing the limitation of expert features in representing the underlying logical structures of the code.

3.2. Model Construction Module

In the context of cross-project vulnerability detection, there often exist discrepancies in data distributions between the source domain and the target domain. To address this issue, the Model Construction Module constructs two transfer learning models using TrAdaBoost [17], specifically targeting expert metrics and semantic metrics respectively. This approach enhances the data-fitting ability of single-feature models. After completing the aforementioned model construction, to fully leverage the informational advantages embedded in different feature dimensions, we adopt the stacking ensemble strategy [18] to fuse the outputs of models built upon multiple feature dimensions.

3.2.1. Training of Transfer Learning Base Models Based on TrAdaboost

To alleviate the issue of cross-domain distribution shift, this section employs TrAdaboost as the transfer learning framework to achieve effective knowledge transfer from the source domain to the target domain. The core advantage of TrAdaboost lies in its ability to dynamically adjust the weights of source domain samples, thereby minimizing the influence of source domain samples with substantial distribution differences from the target domain and enhancing the contributions of samples that facilitate effective transfer.

1. Expert metrics-based model

Taking

X_{source}^{expert}

and

y_{source}^{expert}

as inputs, the model training process consists of three core steps: Sample Weight Initialization, Base Learner Training with Weighted Error Rate Calculation, and Dynamic Sample Weight Updating.

Sample Weight Initialization: The initial weights of all samples in the source domain are set to a uniform distribution, ensuring that each sample contributes equally to model training in the initial state, as shown in Equation (3):

w_{i}^{1} = \frac{1}{n_{s}}, i = 1, 2, \dots, n_{s}

(3)

Here,

w_{i}^{1}

represents the initial weight of the i-th sample in the source domain during the first iteration, and

n_{s}

denotes the total number of samples in the source domain.

Base Learner Training with Weighted Error Rate Calculation: In each iteration, an XGBoost base learner [40] is trained under the constraint of the current sample weights, and the weighted error rate of this iteration is calculated via Equation (4) to quantify the classification performance of the model:

ε_{t} = \frac{\sum_{i = 1}^{n_{s}} w_{i}^{t} \cdot I (h_{t} (x_{i}) \neq y_{i})}{\sum_{i = 1}^{n_{s}} w_{i}^{t}}

(4)

In the formula,

ε_{t}

represents the weighted error rate in the t-th iteration;

w_{i}^{t}

is the weight of the i-th sample in the source domain during the t-th iteration;

h_{t} (x_{i})

denotes the predicted label of sample

x_{i}

by the XGBoost base learner in the t-th iteration;

y_{i}

is the true label of sample

x_{i}

; and

I (\cdot)

is the indicator function (which takes a value of 1 when the prediction is incorrect and 0 when the prediction is correct).

Dynamic Sample Weight Updating: The sample weights are adjusted based on the weighted error rate to enhance the contribution of correctly classified samples and diminish the influence of incorrectly classified samples (noise samples). The updating formula is shown in Equation (5):

w_{i}^{t + 1} = w_{i}^{t} \cdot exp (α_{t} \cdot I (h_{t} (x_{i}) = y_{i}))

(5)

Here,

w_{i}^{t + 1}

represents the updated weight of sample i in the

t + 1

-th iteration and

α_{t} = \frac{1}{2} ln (\frac{1 - ε_{t}}{ε_{t}})

is the weight adjustment coefficient in the t-th iteration, which is negatively correlated with the weighted error rate (i.e., a lower error rate results in a larger adjustment coefficient); when the sample is correctly predicted,

I (h_{t} (x_{i}) = y_{i}) = 1

, leading to an increase in weight, and conversely, the weight decreases.

The model iteratively executes the aforementioned steps until the preset number of iterations is reached or the early stopping condition is satisfied, at which point the training is terminated.

From the perspective of the optimization objective, Equation (5) can be decomposed into two cases corresponding to correct and incorrect classifications: the weights of correctly classified samples are amplified exponentially, while the weights of incorrectly classified samples remain unchanged. This operation is equivalent to minimizing the weighted empirical risk of the next iteration given by

R_{t + 1} = \frac{\sum_{i = 1}^{n_{s}} w_{i}^{t + 1} I (h_{t + 1} (x_{i}) \neq y_{i})}{\sum_{i = 1}^{n_{s}} w_{i}^{t + 1}}

, which makes subsequent base learners assign a higher optimization priority to correctly classified samples that match the target domain distribution, and gradually reduces the proportion of noisy samples or distribution-shifted samples in the source domain within the loss function. From the perspective of convergence characteristics, the sequence of weighted error rates

{ε_{t}}_{t = 1}^{T}

exhibits a monotonic non-increasing property during effective iterations. Base learners continuously optimize over the feature regions that were correctly classified in previous iterations, thus ensuring that the weight update process converges toward the direction of the minimum weighted empirical risk. In addition, samples matching the target domain distribution are consistently correctly classified across multiple iterations, with their weights amplified exponentially and ultimately becoming the dominant samples in source domain training. In contrast, the weights of distribution-shifted samples show no growth and are gradually neglected in the training process.

TrAdaBoost is prone to negative transfer risks in highly imbalanced scenarios with significant cross-domain discrepancies (i.e., noise samples from the source domain dominate training, leading to degraded performance in the target domain). This study effectively mitigates such risks through two key measures: (1) applying SMOTE oversampling to the source domain data to alleviate class imbalance and (2) introducing an early stopping threshold (50 iterations) in TrAdaBoost’s sample weight updating process, which halts training immediately when the weighted error rate ceases to decline, thereby preventing excessive weight accumulation on noisy samples.

2. Semantic metrics-based model

The semantic metrics-based model adopts an identical TrAdaboost framework, hyperparameter settings, and training process as the expert metrics-based model. The only difference lies in the input, which is replaced with the semantic feature subset

X_{source}^{semantic}

and its corresponding labels

y_{source}^{semantic}

. Training is accomplished through the same logic of weight initialization, error rate calculation, and weight updating, thereby facilitating the transfer of semantic feature knowledge from the source domain to the target domain.

3.2.2. Model Fusion Based on Stacking

Since single-feature transfer models struggle to encompass the full spectrum of feature information essential for vulnerability detection, our Decpvd employs a stacking ensemble learning strategy. It integrates the outputs of the expert metrics-based model and the semantic metrics-based model to construct a meta-model. The stacking fusion process consists of two steps: meta-feature generation and meta-model training.

1. Meta-feature generation

A 5-fold cross-validation (5-fold CV) strategy is employed to process the source domain data to prevent overfitting:

Both the source domain expert feature data $X_{source}^{expert}$ and semantic feature data $X_{source}^{semantic}$ are partitioned into five non-overlapping subsets.
In each round of cross-validation, four subsets are selected as the training set, and the remaining subset serves as the validation set.
The expert metrics-based model and the semantic metrics-based model are fine-tuned using the training set. Subsequently, the prediction probabilities of both models are obtained using the validation set.
The prediction probabilities from the expert feature model $P_{expert}$ and the semantic feature model $P_{semantic}$ on the validation set are concatenated to form a two-dimensional meta-feature vector $P_{meta} = [P_{expert}, P_{semantic}]$ . Meanwhile, the true labels of the validation set are collected as the meta-labels $y_{meta}$ .

The aforementioned process is repeated until the completion of the 5-fold cross-validation. Subsequently, the meta-feature vectors and meta-labels from all validation sets are aggregated to form the training set for the meta-model.

D_{meta} = \{(P_{meta}^{(i)}, y_{meta}^{(i)}) ∣ i = 1, 2, \dots, N\}

(6)

Here, N represents the total amount of source domain data.

2. Meta-model training

We select Logistic Regression as the meta-model due to its advantages of a simple structure and effective fitting capability for low-dimensional meta-features:

The logistic regression model is trained using the meta-feature training set $D_{meta}$ as an input and the meta-labels $y_{meta}$ as outputs.
By learning the mapping relationship between the prediction probabilities of the two base models and the true labels, the meta-model adaptively adjusts the weights of the two base models. It ultimately outputs the fused prediction probabilities, achieving complementarity and enhancement among multiple feature models.

The stacking fusion process is capable of adaptively learning the optimal fusion weights for the two types of base models, effectively integrating the complementary information from expert features and semantic features. This provides more comprehensive predictive support for software vulnerability detection.

In the meta-model training phase of the stacking ensemble strategy adopted in this paper, the log-likelihood loss of the logistic regression meta-model is a convex function, whose learning dynamics essentially consist of the iterative evolution of the parameter vector

θ = {[θ_{0}, θ_{1}, θ_{2}]}^{⊤}

and the monotonic convergence of the loss function under the gradient descent method. Parameter updates are executed along the direction opposite to the gradient of the loss function, which ensures that the loss function presents a monotonic non-increasing trend and ultimately converges to the global minimum, with no issues of local optimal solutions arising. Herein, the evolutionary results of parameters

θ_{1}

and

θ_{2}

directly realize the adaptive learning of the predictive contribution degrees of the two base models: if

θ_{1} > θ_{2}

, it indicates that the predictive information of the expert metrics-based model makes a greater contribution to vulnerability classification; if

θ_{2} > θ_{1}

, the semantic metrics-based model exerts a more prominent contribution. This mathematical mechanism of adaptive weight learning endows the stacking ensemble with a better ability to adapt to the characteristics of data distribution compared with fixed-weight fusion methods, thereby achieving the optimal fusion of feature information.

3.3. Vulnerability Detection Module

The vulnerability detection module represents the final application stage of the model. Its core lies in leveraging the trained architecture of “dual-feature transfer base models + stacking meta-model” to perform binary vulnerability classification predictions on unlabeled data in the target domain. Algorithm 1 intuitively illustrates the end-to-end workflow of the Decpvd vulnerability detection module, and the core steps of this workflow are elaborated in detail below. Based on the trained models, the probability prediction and label conversion are accomplished in two steps:

1. Probability Prediction in the Target Domain by Dual-Feature Models

Algorithm 1 Decpvd Vulnerability Detection Algorithm (End-to-End)

Require: $X_{t a r g e t}^{expert}$ (Target domain expert features), $X_{t a r g e t}^{semantic}$ (Target domain semantic features), $M_{expert}$ (Pre-trained expert base model), $M_{semantic}$ (Pre-trained semantic base model), $M_{meta}$ (Pre-trained meta-model), $τ = 0.5$ (Classification threshold)
Ensure: ${\hat{Y}}_{t a r g e t}$ (Final vulnerability detection labels of target domain)

1: Initialize empty sets:

P_{t a r g e t}^{expert} = \emptyset

,

P_{t a r g e t}^{semantic} = \emptyset

,

{\hat{Y}}_{t a r g e t} = \emptyset

2: # Step 1: Dual-feature model probability prediction for target domain

3: for each sample

x_{i} \in X_{t a r g e t}^{expert}

do

4:

p_{i}^{expert} = M_{expert} (x_{i})

{Forward propagation for expert feature positive probability}

5: Add

p_{i}^{expert}

to

P_{t a r g e t}^{expert}

6: end for

7: for each sample

x_{i} \in X_{t a r g e t}^{semantic}

do

8:

p_{i}^{semantic} = M_{semantic} (x_{i})

{Forward propagation for semantic feature positive probability}

9: Add

p_{i}^{semantic}

to

P_{t a r g e t}^{semantic}

10: end for

11: # Step 2: Meta-feature concatenation

12: for each

i \in {1, 2, \dots, | X_{t a r g e t}^{expert} |}

do

13:

P_{t a r g e t}^{meta} (i) = [P_{t a r g e t}^{expert} (i), P_{t a r g e t}^{semantic} (i)]

{Generate 2D meta-feature vector}

14: end for

15: # Step 3: Meta-model fusion to get final probability

16:

{\hat{P}}_{t a r g e t} = M_{meta} (P_{t a r g e t}^{meta})

{Fusion of dual-feature probabilities}

17: # Step 4: Binary label conversion by threshold

18: for each

{\hat{p}}_{i} \in {\hat{P}}_{t a r g e t}

do

19: if

{\hat{p}}_{i} \geq τ

then

20:

{\hat{y}}_{i} = 1

{Classified as vulnerable}

21: else

22:

{\hat{y}}_{i} = 0

{Classified as non-vulnerable}

23: end if

24: Add

{\hat{y}}_{i}

to

{\hat{Y}}_{t a r g e t}

25: end for

26:

27: return

{\hat{Y}}_{t a r g e t}

Let the expert feature subset of the target domain data be denoted as

X_{target}^{expert}

and the semantic feature subset as

X_{target}^{semantic}

. These are separately input into their corresponding transfer base models:

Expert Feature Model Prediction: Input $X_{target}^{expert}$ into the trained expert feature transfer model, which outputs the positive class prediction probability $P_{target}^{expert} \in [0, 1]$ for the target domain samples. (The closer $P_{target}^{expert} \in [0, 1]$ is to 1, the higher the confidence that the sample is classified as “vulnerable”.)
Semantic Feature Model Prediction: Similarly, input $X_{target}^{semantic}$ into the semantic feature transfer model to obtain the positive class prediction probability $P_{target}^{semantic} \in [0, 1]$ for the target domain samples.

2. Final Label Generation by the Stacking Meta-Model

Fuse the prediction probabilities from the dual-feature models to obtain the final classification labels:

Meta-Feature Concatenation: Concatenate $P_{target}^{expert}$ and $P_{target}^{semantic}$ for the target domain samples into a two-dimensional target domain meta-feature vector $P_{target}^{meta} = [P_{target}^{expert}, P_{target}^{semantic}]$ .
Meta-Model Probability Output: Input $P_{target}^{meta}$ into the trained stacking meta-model (logistic regression) to obtain the final positive class prediction probability ${\hat{P}}_{target} \in [0, 1]$ .
Binary Label Conversion: Use 0.5 as the classification threshold to convert the continuous probability into a discrete label:

${\hat{y}}_{target} = \{\begin{matrix} 1, & {\hat{P}}_{target} \geq 0.5 (classified as “ vulnerable ”) \\ 0, & {\hat{P}}_{target} < 0.5 (classified as “ non - vulnerable ”) \end{matrix}$

(7)

where ${\hat{y}}_{target}$ represents the final prediction label for the target domain samples.

In this study, a threshold of 0.5 is selected for binary classification, primarily based on three considerations:

First, 0.5 represents the universal benchmark threshold for probability-based decision-making in binary classification tasks. In scenarios without specific business biases, this threshold serves as the equal-probability boundary for determining whether a sample belongs to the positive or negative class.

Second, this threshold has been validated through hyperparameter ablation experiments (see Section 6.2 for details), where the model achieved optimal performance in terms of the AUC metric when using a threshold of 0.5.

Third, for the cross-project software vulnerability detection task, this study aims to achieve a balanced optimization between false positives and false negatives: a low threshold would lead to a significant increase in the false positive rate, thereby raising the manual cost of code review; conversely, a high threshold would result in a surge in the false negative rate, undermining effective vulnerability detection. The threshold of 0.5 strikes a balance between these two extremes, aligning with the engineering application orientation of this method.

4. Experiment

In this section, we describe our experimental setup, encompassing the research questions we formulated, the datasets employed, the baseline methods compared, the evaluation metrics utilized, and the implementation details.

4.1. Research Questions

To evaluate the effectiveness of our proposed Decpvd and analyze the contributions of its individual components, we have formulated the following three research questions:

RQ1: How effective is Decpvd in cross-project software vulnerability detection?

In this Research Question, we aim to evaluate how Decpvd performs in terms of vulnerability detection capability in cross-project software scenarios compared to existing software vulnerability detection methods. To this end, we have selected five representative baselines and conducted comparative experiments on real-world project datasets.

RQ2: Can the effective integration of two types of metrics enhance the performance of Decpvd?

In this Research Question, our goal is to assess the contribution of the effective integration of two types of metrics to the performance enhancement of Decpvd and further validate the superiority of the adaptive fusion method over fixed-weight fusion mechanisms. We train models using different mechanisms that employ fixed-weight weighted fusion to combine the two metrics, and then conduct a comparison with our adaptive fusion approach.

RQ3: How do the Gated Graph Neural Network and model ensemble synergistically affect the performance of Decpvd?

In this Research Question, our aim is to validate the synergistic value of the Gated Graph Neural Network (GGNN) and model fusion within Decpvd: GGNN takes charge of uncovering implicit vulnerability features at the code semantic level, while model fusion centers on the complementary integration of multi-feature models. We employ ablation experiments to eliminate the possibility that the performance enhancement merely results from a single component.

4.2. Dataset

In our experiments, we utilized three real-world projects (FFmpeg, LibTIFF, and LibPNG) as our experimental subjects. These experimental subjects have already been employed in previous cross-project software vulnerability detection research [9,11,12,41]. Specifically, FFmpeg is an open-source, cross-platform multimedia processing toolkit. LibTIFF is an open-source C library dedicated to processing and manipulating TIFF image files. Similarly, LibPNG is an open-source C library primarily used for handling and operating PNG image files. The vulnerability data was primarily annotated from two major sources: the National Vulnerability Database (NVD) and Common Vulnerabilities and Exposures (CVE). Both sources use CVE identifiers as unique identifiers, facilitating the differentiation of individual vulnerabilities. Table 2 shows the statistical details of our experimental subjects. To simulate the cross-project vulnerability detection scenarios, we sequentially selected one project as the source project and the other two as target projects. For example, if FFmpeg was designated as the source project, then LibTIFF or LibPNG would be set as the target projects.

To ensure the reproducibility of the experiments, the key details related to data processing were supplemented as follows:

1. Code function extraction: The Joern tool was used for function-level slicing of the source files of C/C++ projects. Independent executable functions were extracted as basic sample units, while empty functions, comment functions, and test functions were filtered out to avoid interfering with the experimental results.

2. Vulnerability labeling standards: Based on the NVD and CVE databases, vulnerable functions in the project source code were matched through CVE identifiers and labeled as 1; normal functions not matched with any CVE identifiers were labeled as 0.

3. Cross-project data division: We employed a division strategy where all labeled samples from the source project, along with 15% of the labeled samples from the target project (selected randomly), were used to form the training set. The remaining 85% of the labeled samples from the target project were designated as the test set. This approach was consistently applied across all six cross-project combinations, ensuring no sample overlap between the training and test sets within each combination to accurately simulate real-world cross-project vulnerability detection scenarios.

4.3. Baseline Methods

To comprehensively and objectively evaluate the performance of the Decpvd method, we have selected five typical and representative methods in the field of software vulnerability detection as baseline comparison models, as detailed below:

ReGVD [4] is a software source code vulnerability detection model based on Graph Neural Networks (GNNs). It transforms source code into a flat sequence of tokens and constructs a graph structure, where node features are initialized solely by the token embedding layer of a pre-trained programming language model. By employing residual connections and combining graph-level sum and max pooling, ReGVD generates graph embeddings for the source code.

Devign [3] is a method that employs Graph Neural Networks (GNNs) for vulnerability identification. In terms of its specific workflow, it first extracts graphical information from functions and utilizes this information to construct a structural representation of the code. Subsequently, it transmits these graphical data to a Gated Graph Neural Network (GGNN), which then performs classification on the code.

Dual-GD-DDAN [9] utilizes a bidirectional recurrent neural network to extract semantic metrics. This model is equipped with two distinct generators, whose functions are, respectively, to obtain the code sequences of the source project and the target project. Furthermore, Dual-GD-DDAN incorporates two discriminators, with each one specifically tasked with conducting discrimination operations on the two distinct projects.

DAM2P [11] is a deep domain adaptation method specifically designed for cross-project imbalanced software vulnerability detection. It fundamentally integrates deep domain adaptation with the maximum-margin principle to address two critical challenges: the scarcity of labeled vulnerability data and automatic feature learning. This method learns domain-invariant features through adversarial training between a bidirectional RNN generator and a GAN (Generative Adversarial Network), thereby narrowing the distribution gap between the source domain and the target domain. Additionally, it devises a cross-domain kernel classifier that leverages the maximum-margin principle to effectively handle the data imbalance characteristic.

CSVD-TF [12] is a transfer learning approach tailored for cross-project software vulnerability detection. Its core lies in integrating the complementary strengths of expert metrics and semantic metrics through the TrAdaBoost transfer learning framework. In this method, it first employs the UnderStand tool to extract 39 types of expert metrics. Subsequently, it utilizes CodeBERT combined with BERT-Whitening to extract semantic metrics. Separately, models are constructed using XGBoost as the base classifier for each type of metric. Finally, a model-level fusion with a weight ratio of 0.4:0.6 is adopted to obtain the final results.

4.4. Evaluation Metrics

In our experiments, we adopted AUC (Area Under the ROC Curve) and MCC (Matthews Correlation Coefficient) as evaluation metrics. Both of them are well-suited for vulnerability detection scenarios with class imbalance and can comprehensively reflect the model’s performance.

AUC represents the area under the ROC curve. It evaluates the model’s discrimination ability by depicting the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR). Its value ranges within [0, 1], and a value closer to 1 indicates a better detection performance. The core calculations of TPR (True Positive Rate) and FPR (False Positive Rate), on which it relies, are as follows:

\{\begin{matrix} TPR = \frac{TP}{TP + FN} \\ FPR = \frac{FP}{FP + TN} \end{matrix}

(8)

MCC takes into account all four elements of the confusion matrix and can effectively avoid evaluation bias caused by class imbalance. Its value ranges within [−1, 1], where 1 indicates a perfect prediction, 0 indicates a random prediction, and −1 indicates a completely reversed prediction. The formula is as follows:

MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}

(9)

4.5. Implementation Details

In this study, all experiments were conducted on a GPU server equipped with an A800 GPU featuring 80 GB of memory. The implementation of Decpvd was based on the deep learning framework PyTorch (version 2.0.1), and the programming language used was Python 3.10. During the Program Dependence Graph (PDG) generation phase for extracting semantic metrics, the Joern tool (version 1.1.172) was employed for feature extraction.

To enhance the reproducibility of the Decpvd model, this paper systematically summarizes the hyperparameters, corresponding values and selection methods for each of its core modules, as presented in Table 3.

Notably, unmentioned module hyperparameters (e.g.,

β_{1} = 0.9

and

β_{2} = 0.999

for the Adam optimizer, subsampling ratio

= 1.0

for XGBoost) are all set to the officially recommended default optimal values of their respective frameworks, and fixed throughout the experiments without any adjustment.

5. Results

5.1. RQ1: How Effective Is Decpvd in Cross-Project Software Vulnerability Detection?

To address RQ1, we selected five representative baselines (ReGVD [4], Devign [3], Dual-GD-DDAN [9], DAM2P [11], and CSVD-TF [12]) to conduct cross-project vulnerability detection comparison experiments on real-world project datasets (FFmpeg, LibTIFF, and LibPNG). We compared their performance based on the AUC (Area Under the ROC Curve) and MCC (Matthews Correlation Coefficient) metrics.

In Table 4, we present comparison results between our approach, Decpvd, and the five baselines for six project combinations. As shown in the table, Decpvd achieves the leading AUC values across all combinations, with an average of 0.814, significantly outperforming all baseline models. This result validates the effectiveness of its dual-feature stacking ensemble architecture. Specifically, by leveraging TrAdaboost transfer learning to separately explore the values of expert features and semantic features and then integrating them through stacking to achieve feature complementarity, it effectively enhances feature adaptability and recognition accuracy in cross-project scenarios.

We employed the Wilcoxon signed-rank test [42] to statistically validate our experimental results. As shown in Table 4, all calculated p-values are less than 0.05, which indicates that the performance improvement of Decpvd over the baseline models is statistically significant.

ReGVD achieves a performance close to Decpvd in one scenario (FFmpeg→LibPNG, AUC = 0.842), but there is a significant gap in overall average AUC (0.573). As a Graph Neural Network-based model, ReGVD can capture code syntactic features through token embeddings and graph structure modeling. However, it lacks a transfer learning mechanism tailored to address cross-project domain discrepancies, resulting in a sharp performance decline in scenarios with substantial distribution differences between source and target projects (e.g., LibPNG→LibTIFF, AUC = 0.367). Devign (average AUC = 0.634) models code graph structures using a GGNN but fails to incorporate transfer learning components, making it ill-equipped to handle domain variations in coding styles and vulnerability distributions across projects. CSVD-TF (average AUC = 0.686), while integrating expert metrics and semantic metrics, employs a fixed-weight (0.4:0.6) model-level fusion strategy, lacking the meta-feature adaptive learning capability of stacking ensembles. Consequently, it cannot dynamically adjust feature weights across different cross-project scenarios, leading to an inferior performance compared to Decpvd. DAM2P (average AUC = 0.652) designs a cross-domain kernel classifier to address data imbalance issues, while Dual-GD-DDAN (average AUC = 0.628) extracts semantic features using bidirectional RNNs and narrows domain gaps through a dual generator–discriminator architecture. However, both rely solely on single semantic features and lack the engineering attribute support from expert features, resulting in limited discriminative power.

Overall, Decpvd maintains a consistently high performance across all cross-project scenarios, demonstrating particularly pronounced advantages in more challenging settings (e.g., LibPNG→FFmpeg, AUC = 0.845; LibTIFF→FFmpeg, AUC = 0.848). This indicates that its stacked ensemble architecture exhibits strong robustness against distribution discrepancies between different source–target project pairs. In contrast, baseline models generally exhibit scenario dependence: for instance, CSVD-TF performs relatively well in FFmpeg→LibPNG (AUC = 0.783) and LibTIFF→FFmpeg (AUC = 0.789) but suffers a dramatic performance drop in LibPNG→FFmpeg (AUC = 0.53). This reflects the limitations of fixed-weight fusion and single-transfer frameworks in adapting to diverse cross-project scenarios.

Figure 3 shows the performance comparison between Decpvd and the five baseline models in terms of MCC. Consistent with the AUC results, Decpvd maintains consistently high Matthews Correlation Coefficient (MCC) values across all scenarios, while the baseline models exhibit significant fluctuations in MCC performance. In scenarios with substantial domain discrepancies, the MCC values of baseline models approach 0 (equivalent to random classification), while Decpvd still maintains high classification accuracy. This further validates the superiority of its dual-feature stacking ensemble architecture in addressing cross-project imbalance scenarios.

5.2. RQ2: Can the Effective Integration of Two Types of Metrics Enhance the Performance of Decpvd?

To address RQ2, we compared the performance of Decpvd using fixed-weight fusion methods against that using stacking-based ensemble approaches. Specifically, we trained models with different fixed-weight mechanisms for combining two metrics and benchmarked their performance against the Decpvd method. The results are presented in Figure 4.

As shown in Figure 4, none of the fixed-weight fusion strategies outperformed Decpvd in terms of AUC performance. Moreover, these approaches exhibited significant scenario dependence and uncertainty regarding optimal weight selection.

The optimal fixed-weight ratios exhibit significant variability across different cross-project scenarios. For instance, in the “FFmpeg→LibPNG” scenario, a weight ratio of 0.7:0.3 demonstrates a relatively superior performance, whereas in the “LibPNG→FFmpeg” scenario, a 0.8:0.2 ratio proves more effective. In the “LibPNG→LibTIFF” scenario, a 0.6:0.4 ratio approaches the performance peak for this category of fusion strategies. This variability stems from differing demands for feature types across source–target project pairs—e.g., engineering-oriented code features are more critical in some scenarios, while semantic-logical features dominate in others. Fixed-weight approaches fail to dynamically adapt to such domain-specific disparities.

The average AUC performance of fixed-weight fusion methods is significantly inferior to that of Decpvd: across all fixed-weight schemes, the mean AUC values cluster within the range of 0.60–0.75, whereas Decpvd achieves an average AUC of 0.814, indicating a substantial performance gap. Even the optimal fixed-weight combinations in individual scenarios fail to surpass Decpvd’s corresponding AUC values, demonstrating that feature fusion with fixed ratios cannot fully exploit the complementary strengths of expert and semantic features. In contrast, Decpvd’s stacking-based ensemble approach enables adaptive learning of feature integration strategies through a meta-model, thereby achieving effective synergies between engineering-oriented expert features and semantic–logical features.

5.3. RQ3: How Do the Gated Graph Neural Network and Model Ensemble Synergistically Affect the Performance of Decpvd?

To address RQ3, we performed ablation experiments in the Decpvd framework, comparing two dimensions affecting vulnerability detection:

1. The GGNN’s role in feature extraction: We contrasted GGNN-based extraction with the pre-trained model CodeBERT [36] (w/o GGNN).

2. Model ensemble effectiveness: We tested performance with and without ensemble (w/o ensemble), uncovering the mechanisms behind its gains.

The ablation experiments in this study primarily focus on evaluating the performance contribution of our core innovation (the integration of GGNN and stacking). We did not conduct systematic ablation tests on the transfer learning algorithms or the types of meta-models, as these two components represent established and mature technologies. The core focus of our research lies in their combination and optimization, rather than redesigning them from scratch.

As shown in Figure 5, comparing the AUC scores between the Decpvd model and its “w/o GGNN” variant clearly demonstrates the effectiveness of the GGNN in feature extraction. The complete Decpvd model achieves an average AUC of 0.814, whereas removing the GGNN reduces this to 0.804, indicating the GGNN’s role in enhancing feature quality. By modeling structural dependencies in code (e.g., control flow and data flow relationships), the GGNN compensates for CodeBERT’s limitations in capturing syntactic structures, thereby producing features with stronger vulnerability discrimination capabilities. Notably, the GGNN’s impact varies across scenarios. In the “FFmpeg→LibTIFF” (0.791 vs. 0.808) and “LibTIFF→LibPNG” (0.786 vs. 0.798) transfer learning tasks, removing the GGNN slightly improved AUC performance. The fundamental reason for this lies in the fact that the target projects (LibTIFF/LibPNG) in these two scenarios are lightweight image processing libraries characterized by simple code function structures and straightforward control/data flow dependencies. Consequently, the graph structures constructed by the PDG exhibit redundant information. In such cases, the GGNN’s advantage in extracting features from complex graph structures cannot be effectively leveraged; instead, the graph node aggregation process introduces slight noise, resulting in a superior performance of semantic features solely extracted by CodeBERT. This phenomenon is specific to cross-project scenarios involving simple code structures. In contrast, for scenarios with complex code structures, such as FFmpeg→LibPNG, the GGNN’s graph feature extraction remains the key factor driving performance improvements.

A comparative analysis of Decpvd and its “w/o ensemble” variant reveals that model ensembling serves as a critical performance pillar for Decpvd. Upon removing the ensemble module, the average AUC drops sharply from 0.814 to 0.708 (a 0.106 decrease). This demonstrates the ensemble module’s dominant contribution to overall performance. Decpvd’s stacking ensemble architecture dynamically integrates complementary strengths from expert feature-based models and semantic feature-based models by learning the probabilistic correlations between their predictions through a meta-model. The absence of this ensemble module proves particularly detrimental in challenging scenarios: in the “LibPNG→FFmpeg” transfer task, AUC plummets from 0.845 to 0.527 (a 0.318 decrease), while in “LibPNG→LibTIFF”, it drops from 0.747 to 0.546 (0.201 decrease). These results indicate that the ensemble module substantially mitigates scenario-specific noise through the adaptive fusion of multi-model predictions.

A comparative evaluation of all four model variants reveals that the “w/o GGNN&ensemble” configuration achieves the lowest average AUC (0.703) compared to models lacking only individual components. This performance pattern underscores the synergistic relationship between the GGNN and the ensemble module: the GGNN provides high-quality foundational features as input to the ensemble, while the ensemble module maximizes the discriminative value of GGNN-derived features through complementary integration.

6. Discussion

In this section, we discuss several aspects associated with our study.

6.1. Impact of Feature Selection and Feature Importance Weighting Approaches

To further verify the practical application value of the feature selection and feature importance weighting methods adopted in this study, additional comparative experiments were conducted, with the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) serving as the evaluation metric. We analyzed the AUC results under three experimental setups, namely expert features, semantic features, and fused features generated by the integrated fusion of the two aforementioned feature types, so as to evaluate the enhancement effect of the feature selection and integration strategy on the performance of vulnerability detection. The experimental results are presented in Figure 6.

From the results of AUC distributions for different feature types, it can be observed that single feature types exhibit obvious limitations in their discriminative performance. Neither the AUC value distributions of expert features nor those of semantic features deliver the optimal discriminative performance, whereas the AUC distribution of the feature set after integrated fusion is significantly superior to that of the two single feature types overall. This result directly verifies the effectiveness of the feature selection and feature importance weighting methods adopted in this study.

In a more in-depth analysis, expert features are constructed based on professional domain expertise and possess strong domain specificity, thus serving as the foundation for the discriminative performance of the model. However, such features fail to explore the underlying correlations behind the data. In contrast, semantic features are capable of mining hidden semantic correlations from the textual and structural dimensions of data, thereby enriching the dimensional coverage of features. Nevertheless, they tend to introduce irrelevant noise due to the inherent complexity of the data itself, which restricts the discriminative performance of standalone semantic features. In this study, we realize the fusion of these two types of standalone features: on the one hand, the core domain information of expert features and the valid correlation information of semantic features are retained; on the other hand, the contribution weights of the two feature types are rationally assigned by means of ensemble learning. This integration ultimately achieves a significant improvement in the discriminative capability of the feature set.

6.2. Impact of Key Hyperparameters

To investigate how key hyperparameters influence model performance, we designed hyperparameter ablation experiments. Specifically, we adhered to the univariate control method (where only the target hyperparameter was varied while all others were fixed at the optimal values reported in the paper). We focused on the three most critical hyperparameters of the Decpvd model: the number of GGNN (Gated Graph Neural Network) iterations, the maximum depth of decision trees in XGBoost, and the binary classification threshold for stacking. We selected the cross-project dataset combination of FFmpeg→LibPNG and employed the AUC (Area Under the Curve) metric to evaluate model performance. The results are presented in Figure 7.

As can be seen from Figure 7, the three core hyperparameters exert a significant impact on the cross-project vulnerability detection performance (AUC) of the Decpvd model:

1. GGNN Iteration Steps: The AUC reaches its peak at 6.0 iterations. At 3.0 iterations, the extraction of semantic features is insufficient, while at 9.0 iterations, overfitting occurs, leading to a slight decline in performance.

2. Stacking Classification Threshold: The highest AUC is achieved at a threshold of 0.5. Deviations from this value (e.g., 0.3 or 0.7) result in biases in binary classification decisions, causing a noticeable drop in performance.

3. XGBoost Maximum Tree Depth: The optimal AUC is attained at a maximum tree depth of 6.0. At 3.0, the modeling of features suffers from underfitting, whereas at 10.0, the risk of overfitting increases.

In summary, the experiments validate the rationality of the hyperparameter configuration adopted in the paper (six iterations for GGNN, a stacking threshold of 0.5, and a maximum tree depth of six for XGBoost). This configuration effectively balances feature extraction capability with model generalization, providing reliable support for cross-project vulnerability detection performance.

6.3. Efficiency Comparison Between Decpvd and Baseline Models

In the scenario of cross-project software vulnerability detection, the efficiency and detection accuracy of models are equally important: under the constraint of limited code review costs, the ability to efficiently discover more vulnerabilities directly determines the engineering practicality of a method. Similar to previous studies on defect prediction [43,44], we measure code review costs using LOCs (lines of code), adopt Recall@20% Effort (R@20%E) as the core evaluation metric, and focus on the comparison of efficiency.

R@20%E is defined as the ratio of the number of vulnerabilities successfully identified by the model to the total number of actual vulnerabilities when 20% of code review costs are invested. The higher this metric, the stronger the vulnerability detection efficiency of the model under limited review costs. We compare Decpvd with other vulnerability detection models, and the results are shown in Figure 8.

As shown in Figure 8, the Decpvd method achieves the highest R@20%E value (approximately 0.37), which is significantly superior to other comparative methods; Devign (approximately 0.355) and REGVD (approximately 0.33) follow in terms of efficiency performance, while the R@20%E values of DAM2P, CSVD-TF, and Dual-GD-DDAN decrease in turn, with Dual-GD-DDAN exhibiting the relatively weakest efficiency performance (approximately 0.245). This result indicates that Decpvd can locate vulnerabilities more efficiently under the constraint of limited code review costs, fully validating its engineering practicality and efficiency advantages in the scenario of cross-project software vulnerability detection.

7. Conclusions

This paper proposes a cross-project software vulnerability detection method based on a dual-feature stacking ensemble, named Decpvd. This method significantly improves vulnerability detection performance in cross-project scenarios through the collaborative learning and adaptive fusion of expert features and semantic features. The main conclusions are as follows:

1. The dual-feature collaborative extraction mechanism effectively enhances the comprehensiveness of feature representation. Decpvd integrates expert metrics and semantic features, where the former captures code engineering attributes (such as complexity and maintainability), while the latter explores code structural dependencies and logical semantics through a PDG (Program Dependence Graph) and GGNN (Gated Graph Neural Network). The complementary nature of these two types of features provides richer discriminative information for cross-domain detection.

2. The combination of TrAdaBoost transfer learning and stacking ensemble strategy synergistically addresses the challenges of cross-domain adaptation and adaptive fusion. TrAdaBoost mitigates interference from samples with significant distribution disparities by dynamically adjusting the weights of source domain samples, thereby improving the cross-domain adaptability of base models. The stacking ensemble overcomes the adaptability limitations of fixed-weight fusion in different cross-project scenarios by utilizing a meta-model to adaptively learn the predictive probability mapping relationship between the two types of base models.

3. Decpvd demonstrates exceptional generalization ability and robustness on real-world project datasets. In six sets of cross-project experiments, Decpvd outperforms five mainstream baseline methods, including ReGVD and DAM2P, in terms of both AUC (Area Under the Curve) and MCC (Matthews Correlation Coefficient). Moreover, Decpvd maintains a stable performance in scenarios with significant data distribution disparities, validating its practical value.

In this study, we selected three C/C++ libraries—FFmpeg, LibTIFF, and LibPNG—as experimental subjects because these projects represent classic benchmark datasets for cross-project vulnerability detection, facilitating comparative validation with existing research (e.g., CSVD-TF, DAM2P). The core innovation of our proposed method lies in its adaptive stacking fusion strategy, which is generalizable and not limited to these three libraries. Subsequent research will extend its application to a wider variety of C/C++ projects (e.g., operating system kernels, industrial-grade applications) to validate its external validity.

In future research, we plan to expand the multi-dimensional feature types and cross-language adaptability. Currently, Decpvd primarily focuses on the fusion of expert and semantic features in C/C++ projects. In the future, we can further incorporate multi-dimensional information: On the one hand, we will introduce code-level semantic association features, such as semantic similarity between cross-project code snippets and API call pattern similarity, to capture implicit semantic associations related to vulnerabilities across different projects. On the other hand, we will extend our approach to scenarios involving multiple programming languages, such as Java and Python, and propose concrete technical solutions to address the fundamental challenges of differing syntax, toolchains, and relevant code metrics among different languages.

Specifically, for the syntactic differences of different languages, aiming at the absence of pointers in Java and dynamic typing in Python, we will optimize the expert metric system accordingly—adding object-oriented related metrics (e.g., class coupling degree, inheritance depth) for Java to adapt to its object-oriented features, and adding dynamic type-related metrics (e.g., dynamic function call frequency, type checking completeness) and decorator-related metrics for Python to match its dynamic programming characteristics. For the differences in toolchains, we will adopt language-adaptive graph structure extraction tools, using Soot instead of Joern for PDG construction in Java projects (to better support Java’s class structure and bytecode analysis) and PyCG for PDG extraction in Python projects (to adapt to Python’s dynamic function calls and module dependencies).

Furthermore, we will design a universal cross-language feature extraction framework to enhance the method’s generality. This framework will include two core modules: a language adaptive preprocessing module (to normalize language-specific syntax elements and unify the feature representation space) and a cross-language feature alignment module (to align expert and semantic features of different languages through cross-language code embedding, eliminating feature distribution differences caused by language heterogeneity). Through the above concrete measures, we will effectively address the technical challenges of cross-language extension and improve the generality of Decpvd.

Author Contributions

Conceptualization, Y.L. and S.W.; methodology, Y.L.; software, Y.L.; validation, Y.L.; investigation, Y.L., B.L. and S.W.; data curation, Y.L., B.H. and Y.J.; writing—original draft preparation, Y.L.; writing—review and editing, B.H. and Y.J.; supervision, B.L. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is funded by 2026 Natural Science Foundation of Hunan Province (2026JJ81212, 2026JJ81221).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GGNN	Gated Graph Neural Network
PDG	Program Dependence Graph
AST	Abstract Syntax Tree
GRU	Gated Recurrent Unit
GAN	Generative Adversarial Network
RNN	Recurrent Neural Network
Bi-LSTM	Bidirectional Long Short-Term Memory
LSTM	Long Short-Term Memory
CNN	Convolutional Neural Network
GAT	Graph Attention Network
CPG	Code Property Graph
MMD	Maximum Mean Discrepancy
AUC	Area Under the ROC Curve
MCC	Matthews Correlation Coefficient
TPR	True Positive Rate
FPR	False Positive Rate
TP	True Positive
TN	True Negative
FP	False Positive
FN	False Negative
NVD	National Vulnerability Database
CVE	Common Vulnerabilities and Exposures

References

Aslan, M.; Aktu, S.S.; Ozkan-Okay, M.; Yilmaz, A.A.; Akin, E. A Comprehensive Review of Cyber Security Vulnerabilities, Threats, Attacks, and Solutions. Electronics 2023, 12, 1333. [Google Scholar] [CrossRef]
Li, Z.; Zou, D.; Xu, S.; Ou, X.; Jin, H.; Wang, S.; Deng, Z.; Zhong, Y. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv 2018, arXiv:1801.01681. [Google Scholar]
Zhou, Y.; Liu, S.; Siow, J.; Du, X.; Liu, Y. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 32, Volume 13 of 20: 32nd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, CA, USA, 8–14 December 2019. [Google Scholar]
Nguyen, V.A.; Nguyen, D.Q.; Nguyen, V.; Le, T.; Tran, Q.H.; Phung, D. ReGVD: Revisiting Graph Neural Networks for Vulnerability Detection. In Proceedings of the 44th International Conference on Software Engineering Companion (ICSE ’22 Companion), Pittsburgh, PA, USA, 22–24 May 2022. [Google Scholar]
Kong, L.; Luo, S.; Pan, L.; Wu, Z.; Li, X. A multi-type vulnerability detection framework with parallel perspective fusion and hierarchical feature enhancement. Comput. Secur. 2024, 140, 103787. [Google Scholar] [CrossRef]
Nguyen, H.Q.; Hoang, T.; Dam, H.K.; Ghose, A. Graph-based explainable vulnerability prediction. Inf. Softw. Technol. 2025, 177, 107566. [Google Scholar] [CrossRef]
Risse, N.; Liu, J.; Böhme, M. Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection. arXiv 2025, arXiv:2408.12986v2. [Google Scholar] [CrossRef]
Li, X.; Xin, Y.; Zhu, H.; Yang, Y.; Chen, Y. Cross-domain vulnerability detection using graph embedding and domain adaptation. Comput. Secur. 2023, 125, 103017. [Google Scholar] [CrossRef]
Nguyen, V.; Le, T.; de Vel, O.; Montague, P.; Grundy, J.; Phung, D. Dual-Component Deep Domain Adaptation: A New Approach for Cross Project Software Vulnerability Detection. In Proceedings of the Advances in Knowledge Discovery and Data Mining; Springer: Cham, Switzerland, 2020; pp. 699–711. [Google Scholar] [CrossRef]
Zhang, C.; Liu, B.; Xin, Y.; Yao, L. CPVD: Cross Project Vulnerability Detection Based on Graph Attention Network and Domain Adaptation. IEEE Trans. Softw. Eng. 2023, 49, 4152–4168. [Google Scholar] [CrossRef]
Nguyen, V.; Le, T.; Tantithamthavorn, C.; Grundy, J.; Phung, D. Deep Domain Adaptation with Max-Margin Principle for Cross-Project Imbalanced Software Vulnerability Detection. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–34. [Google Scholar] [CrossRef]
Cai, Z.; Cai, Y.; Chen, X.; Lu, G.; Pei, W.; Zhao, J. CSVD-TF: Cross-project software vulnerability detection with TrAdaBoost by fusing expert metrics and semantic metrics. J. Syst. Softw. 2024, 213, 15. [Google Scholar] [CrossRef]
Zhao, N.; Huang, Z.; Hua, R.; Li, Y.; Zheng, R.; Shen, Q.; Wang, J. TFSM: A network for time-frequency synergistic modeling integrating Mamba temporal pathway and spectral features for electricity theft detection. Expert Syst. Appl. 2026, 297, 129425. [Google Scholar] [CrossRef]
Zhao, N.; Feng, Q.; Wang, H.; Jing, M.; Lin, Z.; Wang, J. A Key Node Mining Method Based on K-Shell and Neighborhood Information. Appl. Sci. 2024, 14, 6012. [Google Scholar] [CrossRef]
Zhao, N.; Wang, H.; Wen, J.; Li, J.; Jing, M.; Wang, J. Identifying critical nodes in complex networks based on neighborhood information. New J. Phys. 2023, 25, 083020. [Google Scholar] [CrossRef]
Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated graph sequence neural networks. arXiv 2015, arXiv:1511.05493. [Google Scholar]
Dai, W.; Yang, Q.; Xue, G.R.; Yu, Y. Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning; ACM: New York, NY, USA, 2007; pp. 193–200. [Google Scholar]
Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Neuhaus, S.; Zimmermann, T.; Holler, C.; Zeller, A. Predicting vulnerable software components. In Proceedings of the 14th ACM Conference on Computer and Communications Security, Alexandria, VA, USA, 28–31 October 2007; pp. 529–540. [Google Scholar]
Shin, Y.; Meneely, A.; Williams, L.; Osborne, J.A. Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities. IEEE Trans. Softw. Eng. 2011, 37, 772–787. [Google Scholar] [CrossRef]
Walden, J.; Stuckman, J.; Scandariato, R. Predicting vulnerable components: Software metrics vs. text mining. In Proceedings of the 2014 IEEE 25th International Symposium on Software Reliability Engineering; IEEE: New York, NY, USA, 2014; pp. 23–33. [Google Scholar]
Grieco, G.; Grinblat, G.L.; Uzal, L.; Rawat, S.; Feist, J.; Mounier, L. Toward large-scale vulnerability discovery using machine learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, New Orleans, LA, USA, 9–11 March 2016; pp. 85–96. [Google Scholar]
Dam, H.K.; Tran, T.; Pham, T.; Ng, S.W.; Grundy, J.; Ghose, A. Automatic feature learning for predicting vulnerable software components. IEEE Trans. Softw. Eng. 2018, 47, 67–85. [Google Scholar] [CrossRef]
Steenhoek, B.; Rahman, M.M.; Jiles, R.; Le, W. An empirical study of deep learning models for vulnerability detection. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE); IEEE: New York, NY, USA, 2023; pp. 2237–2248. [Google Scholar]
Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E. code2vec: Learning distributed representations of code. Proc. ACM Program. Lang. 2019, 3, 1–29. [Google Scholar] [CrossRef]
Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. Codebert: A pre-trained model for programming and natural languages. arXiv 2020, arXiv:2002.08155. [Google Scholar]
Li, Z.; Zou, D.; Xu, S.; Jin, H.; Zhu, Y.; Chen, Z. Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Secur. Comput. 2021, 19, 2244–2258. [Google Scholar] [CrossRef]
Wu, Y.; Zou, D.; Dou, S.; Yang, W.; Xu, D.; Jin, H. VulCNN: An image-inspired scalable vulnerability detection system. In Proceedings of the 44th International Conference on Software Engineering; Association for Computing Machinery: New York, NY, USA, 2022; ICSE’ 22; pp. 2365–2376. [Google Scholar] [CrossRef]
Wang, H.; Ye, G.; Tang, Z.; Tan, S.H.; Huang, S.; Fang, D.; Feng, Y.; Bian, L.; Wang, Z. Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans. Inf. Forensics Secur. 2020, 16, 1943–1958. [Google Scholar] [CrossRef]
Chakraborty, S.; Krishna, R.; Ding, Y.; Ray, B. Deep learning based vulnerability detection: Are we there yet? IEEE Trans. Softw. Eng. 2021, 48, 3280–3296. [Google Scholar] [CrossRef]
Xiao, P.; Xiao, Q.; Zhang, X.; Wu, Y.; Yang, F. Vulnerability Detection Based on Enhanced Graph Representation Learning. IEEE Trans. Inf. Forensics Secur. 2024, 19, 5120–5135. [Google Scholar] [CrossRef]
Tao, W.; Su, X.; Wan, J.; Wei, H.; Zheng, W. Vulnerability detection through cross-modal feature enhancement and fusion. Comput. Secur. 2023, 132, 103341. [Google Scholar] [CrossRef]
Lu, G.; Ju, X.; Chen, X.; Pei, W.; Cai, Z. GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning. J. Syst. Softw. 2024, 212, 112031. [Google Scholar] [CrossRef]
Peng, T.; Li, Z.; Zhang, Y. VulTrLM: LLM-assisted vulnerability detection via AST decomposition and comment enhancement. Empir. Softw. Eng. 2026, 26, 1–28. [Google Scholar] [CrossRef]
Luo, Y.; Chen, Z.; Dong, Y.; Zhang, H.; Sun, Y.; Xie, F.; Dong, Z. Improving SAST Detection Capability with LLMs and Enhanced DFA. In Proceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages; Association for Computing Machinery: New York, NY, USA, 2025; LMPL’ 25; pp. 66–70. [Google Scholar] [CrossRef]
SciTools Limited Liability Company SciTools Understand; Computer Software. 2025. Available online: https://scitools.com/ (accessed on 15 April 2025).
Shiri Harzevili, N.; Boaye Belle, A.; Wang, J.; Wang, S.; Jiang, Z.M.J.; Nagappan, N. A Systematic Literature Review on Automated Software Vulnerability Detection Using Machine Learning. ACM Comput. Surv. 2024, 57, 1–36. [Google Scholar] [CrossRef]
Anon. Joern; Computer Software. 2025. Available online: https://joern.io/ (accessed on 15 April 2025).
Wolf, L.; Hanani, Y.; Bar, K.; Dershowitz, N. Joint word2vec Networks for Bilingual Semantic Representations. Int. J. Comput. Linguist. Appl. 2014, 5, 27–42. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; KDD’ 16; pp. 785–794. [Google Scholar] [CrossRef]
Nguyen, V.; Le, T.; Le, T.; Nguyen, K.; DeVel, O.; Montague, P.; Qu, L.; Phung, D. Deep Domain Adaptation for Vulnerable Code Function Identification. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar] [CrossRef]
Wilcoxon, F. Individual comparisons by ranking methods. In Breakthroughs in Statistics: Methodology and Distribution; Kotz, S., Johnson, N.L., Eds.; Springer: New York, NY, USA, 1992; pp. 196–202. [Google Scholar]
Chen, X.; Xia, H.; Pei, W.; Ni, C.; Liu, K. Boosting multi-objective just-in-time software defect prediction by fusing expert metrics and semantic metrics. J. Syst. Softw. 2023, 206, 111853. [Google Scholar] [CrossRef]
Ni, C.; Wang, W.; Yang, K.; Xia, X.; Liu, K.; Lo, D. The best of both worlds: Integrating semantic features with expert features for defect prediction and localization. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering; Association for Computing Machinery: New York, NY, USA, 2022; pp. 672–683. [Google Scholar] [CrossRef]

Figure 1. The framework of Decpvd.

Figure 2. Illustration of the Semantic Metrics Extraction Process.

Figure 3. The comparison results between Decpvd and five baselines in terms of MCC.

Figure 4. The impact of using different fixed fusion weights and a stacking ensemble on the performance in terms of AUC.

Figure 5. The heatmap results of the ablation study.

Figure 6. Box plot of AUC distribution for different feature types.

Figure 7. Impact of key hyperparameters on AUC.

Figure 8. Efficiency comparison of different models.

Table 1. Expert metrics employed by Decpvd.

Dimension	Metric Name
Code size	CountDeclClass
	CountDeclFunction
	CountLine
	CountLineBlank
	CountLineCode
	CountLineCodeDecl
	CountLineComment
	CountLineInactive
	CountLinePreprocessor
Complexity	AvgCyclomatic
	AvgCyclomaticModified
	AvgCyclomaticStrict
	AvgEssential
	MaxCyclomatic
	MaxCyclomaticModified
	MaxCyclomaticStrict
	MaxEssential
	MaxNesting
	SumCyclomatic
	SumCyclomaticModified
	SumCyclomaticStrict
	SumEssential
Readability	AvgLine
	AvgLineBlank
	AvgLineCode
	AvgLineComment
	AltAvgLineBlank
	AltAvgLineCode
	AltAvgLineComment
	AltCountLineBlank
	AltCountLineCode
	AltCountLineComment
	RatioCommentToCode
Maintainability	CountStmt
	CountStmtDecl
	CountStmtEmpty
	CountStmtExe
Performance	CountLineCodeExe
Performance	CountSemicolon

Table 2. Statistical details pertaining to the experimental subjects.

Project	Vulnerable Functions	Non-Vulnerable Functions
FFmpeg	806	4808
LibTIFF	79	418
LibPNG	30	370

Table 3. Hyperparameter settings.

Model Module	Hyperparameter Name	Value	Selection Method
word2vec	Embedding Dim.	256	Grid search [128, 256, 512]
GGNN	Embedding Dim.	256	Grid search [128, 256, 512]
GGNN	Iteration Steps	6	Grid search [3, 6, 9]
GGNN-Adam	Learning Rate	$10^{- 4}$	Fine-tuning
GGNN-Adam	Weight Decay	$10^{- 3}$	Fine-tuning
GGNN	Loss Function	BCELoss	Ref. classic domain studies
XGBoost	Learning Rate $η$	0.1	Grid search [0.01, 0.1, 0.2]
XGBoost	Max Tree Depth	6	Grid search [3, 6, 10]
XGBoost	L2 Regularization $λ$	1	Grid search [0, 1, 5]
XGBoost	Iterations	200	Grid search [100, 200, 300]
TrAdaBoost	Early Stopping	50	Ref. classic domain studies
Stacking	K-fold CV	5	Classic value
Stacking	Classification Threshold	0.5	Grid search [0.3, 0.5, 0.7]

Table 4. AUC comparison of Decpvd and five baseline models.

Source→Target	Decpvd	CSVD-TF	DAM2P	Dual-GD-DDAN	Devign	REGVD
FFmpeg→LibPNG	0.866	0.783	0.669	0.595	0.775	0.842
FFmpeg→LibTIFF	0.791	0.741	0.665	0.667	0.716	0.718
LibPNG→FFmpeg	0.845	0.530	0.658	0.552	0.628	0.429
LibPNG→LibTIFF	0.747	0.614	0.595	0.650	0.550	0.367
LibTIFF→FFmpeg	0.848	0.789	0.631	0.559	0.568	0.660
LibTIFF→LibPNG	0.786	0.658	0.696	0.746	0.566	0.423
Average	0.814	0.686	0.652	0.628	0.634	0.573
p-value	–	*	*	*	*	*

Notes: * means p-value < 0.05.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Y.; Liu, B.; Wang, S.; Hu, B.; Jin, Y. Data-Driven Transferable Modeling for Cross-Project Software Vulnerability Detection via Dual-Feature Stacking Ensemble. Mathematics 2026, 14, 780. https://doi.org/10.3390/math14050780

AMA Style

Liu Y, Liu B, Wang S, Hu B, Jin Y. Data-Driven Transferable Modeling for Cross-Project Software Vulnerability Detection via Dual-Feature Stacking Ensemble. Mathematics. 2026; 14(5):780. https://doi.org/10.3390/math14050780

Chicago/Turabian Style

Liu, Yu, Bin Liu, Shihai Wang, Bin Hu, and Yujie Jin. 2026. "Data-Driven Transferable Modeling for Cross-Project Software Vulnerability Detection via Dual-Feature Stacking Ensemble" Mathematics 14, no. 5: 780. https://doi.org/10.3390/math14050780

APA Style

Liu, Y., Liu, B., Wang, S., Hu, B., & Jin, Y. (2026). Data-Driven Transferable Modeling for Cross-Project Software Vulnerability Detection via Dual-Feature Stacking Ensemble. Mathematics, 14(5), 780. https://doi.org/10.3390/math14050780

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data-Driven Transferable Modeling for Cross-Project Software Vulnerability Detection via Dual-Feature Stacking Ensemble

Abstract

1. Introduction

2. Related Work

2.1. Representation Learning of Source Code in Software Vulnerability Detection

2.2. Cross-Project Software Vulnerability Detection

3. Approach

3.1. Code Representation Module

3.1.1. Expert Metrics

3.1.2. Semantic Metrics

3.2. Model Construction Module

3.2.1. Training of Transfer Learning Base Models Based on TrAdaboost

3.2.2. Model Fusion Based on Stacking

3.3. Vulnerability Detection Module

4. Experiment

4.1. Research Questions

4.2. Dataset

4.3. Baseline Methods

4.4. Evaluation Metrics

4.5. Implementation Details

5. Results

5.1. RQ1: How Effective Is Decpvd in Cross-Project Software Vulnerability Detection?

5.2. RQ2: Can the Effective Integration of Two Types of Metrics Enhance the Performance of Decpvd?

5.3. RQ3: How Do the Gated Graph Neural Network and Model Ensemble Synergistically Affect the Performance of Decpvd?

6. Discussion

6.1. Impact of Feature Selection and Feature Importance Weighting Approaches

6.2. Impact of Key Hyperparameters

6.3. Efficiency Comparison Between Decpvd and Baseline Models

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI