4.1. Experimental Design
The primary objective of the experiments in this section is to evaluate the performance of the proposed DAP2ER method and compare it with state-of-the-art techniques in the domain of cross-project imbalanced software vulnerability detection. Specifically, the focus is on the task of cross-domain vulnerability detection, where the imbalance between vulnerable and non-vulnerable functions is pronounced.
For experimental datasets, this study uses two real-world source code datasets, each containing both vulnerable and non-vulnerable functions. The first dataset, FFmpeg, includes 187 vulnerable functions and 5427 non-vulnerable functions. The second dataset, LibPNG, contains 43 vulnerable functions and 551 non-vulnerable functions.
It is important to note that these datasets exhibit extreme class imbalance, with the proportion of vulnerable data representing only about 0.51% to 11.65% of the non-vulnerable data. Our observations suggest that in cross-domain vulnerability detection, the smaller the proportion of vulnerable samples relative to non-vulnerable samples within a given source-target domain pair, the more severe the imbalance problem. This issue can also arise when transitioning between different source–target domain pairs.
To explicitly address the severe imbalance that varies across projects and transfer directions, we adopt a unified imbalance-handling strategy for each source→target experiment based on the labeled source training split. Specifically, we use a cost-sensitive classification objective by applying class weights in the source-domain classification loss, where the weight of each class is computed from the source training data. This makes misclassification of minority classes more costly and alleviates the dominance of the majority class during optimization. In addition, to ensure sufficient exposure of minority samples during training, mini-batches are formed with class-balanced sampling on the source training set. Notably, because the target domain is unlabeled during training under the UDA setting, imbalance handling is performed only on the labeled source side, while target samples are incorporated through domain alignment and the proposed high-confidence selection mechanism.
In this experiment, we aim to demonstrate the transfer learning capability of the proposed method for imbalanced cross-domain software vulnerability detection. We use FFmpeg, a multimedia application dataset, as the source domain and LibPNG, an image processing application dataset, as the target domain. It is noteworthy that the labels of the target domain dataset are hidden during training and are only revealed during the testing phase to evaluate the model’s performance.
In the data processing and embedding stage, we perform a series of preprocessing steps before feeding the source code datasets into the neural network. First, we standardize the source code by removing comments, blank lines, and non-ASCII characters. Then, we map user-defined variables to symbolic variable names and user-defined functions to symbolic function names. Additionally, integers, real numbers, and hexadecimal numbers are replaced with a generic number token, while strings are replaced with a generic Strings token. Subsequently, we embed source code statements into numerical vectors by tokenizing each code statement into a sequence of code tokens, constructing a frequency vector representing the information of that statement, and multiplying this frequency vector by a learnable embedding matrix .
To comprehensively evaluate the effectiveness of the proposed DAP2ER method in cross-domain multi-classification tasks, we compare it against five widely used and representative baseline methods from the domain adaptation literature: SourceOnly [
19], DANN [
19], PseudoLabeling [
31], MMD [
32], and CORAL [
33]. These methods cover the mainstream technological approaches and can verify the advantages and necessity of DAP2ER from different perspectives.
All comparison methods share the same backbone network architecture in the experiments, with consistent training epochs, optimizers, and data partition strategies. The only difference lies in whether the corresponding domain alignment/pseudolabeling modules and loss terms are introduced. Additionally, all domain adaptation methods strictly follow the UDA setup: during training, only the true labels of the source domain are used, and no true label information is used from the target domain data. The source-only method trains the classifier using only the labeled source domain data without any domain alignment strategy and directly transfers the model to the target domain for testing, serving as the baseline for cross-domain performance. The pseudolabeling method first trains an initial model on the source domain, then generates pseudolabels for the target domain samples; in subsequent training, the high-confidence target samples are used as supervision signals for optimization, improving the model’s adaptation to target domain data. This strategy is generally effective when predictions are more reliable in the early stages of the model but is sensitive to pseudolabel noise. MMD is a statistical distribution alignment method that minimizes the distance between the source and target domain feature distributions in the RKHS. Its advantages are simplicity and stability, but its limitation is that it mainly performs marginal distribution alignment, which can lead to alignment with class confusion when there is class imbalance or significant differences in class-conditional distributions. CORAL aligns the covariance matrices of source and target domain features to achieve second-order statistical matching, and is a lightweight distribution alignment strategy. Compared to MMD, CORAL does not rely on kernel functions and has lower computational overhead; like MMD, however, it tends to align global statistics, with limited ability for class-level alignment. DANN is a classic deep adversarial domain adaptation method that uses a gradient reversal layer and domain discriminator for adversarial training, enabling the feature extractor to learn domain-invariant representations, thereby reducing the discrepancy between source and target domains. This method has shown robust performance in multiple UDA tasks and is one of the key adversarial baselines.
Building upon these baselines, our proposed DAP2ER method introduces mechanisms such as progressive weight scheduling and high-confidence pseudolabel/prototype alignment. The method prioritizes preserving the source domain’s discriminative capability during the early stages of training, gradually increasing domain adversarial and target domain constraints to achieve more stable and superior performance on the target domain. This approach is especially beneficial in scenarios involving cross-domain transfer, class-conditional alignment, and imbalance.
In the model configuration, all experiments were implemented using the PyTorch version: 1.7.1 framework and accelerated using NVIDIA GPUs. For reproducibility, we fixed the random seed to 42 for all experiments. In all methods (including baselines and DAP2ER), we applied the same class-imbalance handling strategy (class-weighted source classification loss and class-balanced sampling on the labeled source training data) to ensure fair comparison across different domain adaptation objectives.
During training, the Adam optimizer was employed to update model parameters, with the learning rate set to and the weight decay coefficient set to . The models were optimized over 150 training epochs. The source domain data were split into 80% for training and 20% for testing, while the target domain data were split into 50% for training and 50% for testing. Importantly, the target domain training set did not utilize real labels. The batch size was set to 128 in each training epoch to ensure sufficient sample support for gradient updates.
In the early stages of training, domain adaptation-related losses were progressively increased through a scheduling mechanism in order to balance the conflict between classification learning and cross-domain alignment tasks. This progressive approach ensures that the relevant loss weights start at lower values, gradually increasing to avoid oscillations and ensuring model stability and generalization in the later stages of training.
For each run, we first randomly shuffled samples in both the source and target domains using a fixed random seed, then partitioned the data as follows: for the source domain, we split the dataset into 80% for training and 20% for evaluation; for the target domain, we split the dataset into 50% for training and 50% for evaluation. During training, target-domain labels were not used; the target training split was treated as unlabeled data, and only served to compute the domain-adversarial, entropy minimization, and prototype alignment objectives. The held-out target split was used exclusively for performance evaluation with ground truth labels.
The experimental evaluation employs the standard classification evaluation metrics of Accuracy, Precision, Recall, F1-score, and AUC to thoroughly assess the performance of the proposed method on the target domain. These metrics provide a comprehensive assessment of classification effectiveness, particularly in scenarios involving class imbalance. The detailed calculation procedures are outlined below.
For each sample in the target domain, its feature representation was first extracted, then the corresponding outputs were generated by the prediction model. These outputs were processed and converted into a probability distribution, with the probability of the vulnerability class being taken as the predicted probability of the sample belonging to the positive class.
We determined the predicted label for each sample by setting a threshold of 0.5. If the predicted probability of vulnerability was greater than or equal to 0.5, the sample was classified as a positive class; otherwise, it was classified as a negative class. The specific calculation formula is as follows:
Based on the predicted labels, we evaluate the model performance using the following standard classification metrics: Accuracy: The proportion of correctly classified samples, computed as follows:
Precision: The proportion of true positive samples among all samples predicted as positive, computed as follows:
Recall: The proportion of true positive samples among all actual positive samples, computed as follows:
F1-score: The harmonic mean of Precision and Recall, balancing both metrics, computed as follows:
AUC: The area under the Receiver Operating Characteristic (ROC) curve, reflecting the classifier’s performance across various thresholds. A higher AUC indicates better model performance. AUC is calculated based on the true labels and predicted probabilities:
In the above equations, True Positive refers to the number of samples correctly predicted as positive, False Positive refers to the number of negative samples incorrectly predicted as positive, and False Negative refers to the number of positive samples incorrectly predicted as negative.
4.2. Comparative Experiments
To evaluate the effectiveness of DAP2ER in cross-project imbalanced software vulnerability detection tasks, we establish a bidirectional transfer setup between two projects and compare it with five representative unsupervised/weakly-supervised domain adaptation baseline methods: Source-Only, Pseudolabel, MMD, CORAL, and DANN. The comparison results are presented in
Table 2 and
Table 3. The evaluation metrics include Accuracy, Precision, Recall, F1-score, and AUC, which comprehensively assess detection performance under class-imbalanced scenarios.
As shown in
Table 2 and
Table 3, DAP2ER outperforms all other methods across all core metrics on the source-to-target transfer task, achieving an F1-score of 0.8794 and an AUC of 0.9593, which are the highest among the compared methods. Specifically, the strongest baseline, MMD, achieves an F1-score of 0.6807, while DAP2ER improves upon this by 0.1987. Similarly, DAP2ER achieves a 0.2575 higher AUC compared to DANN’s 0.7018. Notably, while the Pseudolabel category exhibits extremely high Recall, its Precision is only 0.4698, indicating that it tends to generate noisy pseudolabels and over-predict the positive class under significant domain shift and imbalance. In contrast, DAP2ER maintains high Recall while significantly improving Precision, resulting in a more balanced and stable detection performance.
In the target-to-source transfer task, DAP2ER also demonstrates substantial advantages, achieving Accuracy = 89.30%, Precision = 91.18%, Recall = 86.53%, F1-score = 88.80%, and AUC = 95.92%. Compared to the best baseline, DANN, DAP2ER improves F1-score by 0.2104 and AUC by 0.2357. While DANN achieves relatively high Recall on the Source-Only category, it significantly lags behind DAP2ER in terms of F1-score, indicating that relying solely on source-domain supervision cannot effectively overcome cross-project distribution discrepancies.
Overall, DAP2ER demonstrates superior F1-score and AUC in both transfer directions, highlighting its ability to not only improve the identification of vulnerable samples in the target domain but also enhance the control of false positives and false negatives in class-imbalanced settings. This showcases the robustness and effectiveness of the proposed progressive alignment strategy and multi-constraint collaborative optimization in cross-domain vulnerability detection tasks.
We observe that vulnerabilities with weaker lexical/structural cues and higher context dependence are generally harder to detect in the cross-project setting. In particular, logic-related flaws and subtle semantic misuse patterns often manifest through dispersed code semantics rather than localized tokens, making their representations less transferable across projects. Similarly, boundary-condition and rare corner-case vulnerabilities tend to be underrepresented in the training data and to exhibit high intra-class diversity, which increases the uncertainty of pseudolabeling and weakens class-conditional alignment. In contrast, vulnerabilities with more explicit local signatures are comparatively easier to capture. Overall, these observations suggest that detection performance degrades when vulnerability categories are rare, semantically subtle, or heavily reliant on long-range program context, which is further amplified by domain shift and severe class imbalance in cross-project vulnerability detection.
4.3. Ablation Study
To further validate the contribution of each module in DAP2ER, we conducted ablation experiments by systematically removing or disabling key components of the model: w/o DANN, removal of the domain adversarial training module; w/o Entropy, removal of the target-domain entropy minimization constraint; w/o Prototype, removal of the prototype alignment module; and Full DAP2ER, the complete model.
In this section, we evaluate the following four model configurations: Only DANN, utilizing only the domain adversarial module from DANN while removing entropy minimization and prototype alignment; w/o Entropy, removing the entropy minimization module while retaining both domain adversarial and prototype alignment; w/o Prototype, removing the prototype alignment module while retaining both domain adversarial and entropy minimization; and DAP2ER, the full DAP2ER method incorporating all three modules.
As shown in
Table 4, DAP2ER achieves the best overall performance; Accuracy increases from 0.870 to 0.873, F1-score from 0.8780 to 0.8785, and Precision from 0.8225 to 0.8376. However, Recall slightly decreases from 0.9416 to 0.9235, indicating that the model adopts a more conservative approach to positive-class predictions with the introduction of the entropy constraint and prototype guidance. This reduces false positives and improves the Precision, albeit at the expense of a slight reduction in Recall. This observation aligns with the design goal of high-confidence screening and class-conditional alignment to suppress noise propagation. It is worth noting that the performance metrics for w/o Entropy and Only DANN are virtually identical in this setup, with minimal fluctuations in AUC. Similarly, the difference between w/o Prototype and Only DANN is marginal, indicating that adversarial alignment plays a dominant role in performance improvement. Meanwhile, the benefits of entropy minimization and prototype-guided alignment are primarily observed in enhancing prediction reliability and decision boundary robustness, leading to a modest but stable improvement in the overall metrics for DAP2ER.
To further visually assess the contribution of each key module, we present the ablation results in the form of a heatmap (
Figure 2). From the overall color distribution, it is evident that the three variants (Only DANN, w/o Entropy, and w/o Prototype) exhibit nearly identical color intensities across the four metrics of Accuracy, Precision, F1-score, and Recall. This suggests that domain adversarial alignment is the main driving force behind model performance under the current transfer setting, providing a stable foundational alignment capability for target-domain transfer. In contrast, the removal of either entropy minimization or prototype guidance does not lead to significant performance degradation, indicating that these modules primarily serve as fine-tuning mechanisms for decision boundaries and class-conditional structures, rather than being the sole contributors to baseline performance.
Notably, the heatmap reveals a deeper color in the Precision and F1-score dimensions for DAP2ER, along with a slight improvement in Accuracy. This suggests that the improvements primarily stem from enhanced prediction reliability and discriminability in the target domain. Specifically, entropy minimization reduces uncertainty in target-domain predictions, promoting clearer classification boundaries. On the other hand, the prototype-guided strategy enhances intra-class sample aggregation and minimizes inter-class confusion through class-conditional constraints, thereby improving the Precision. The heatmap also shows a slight decrease in Recall for DAP2ER; combined with the improved Precision, this reflects a more robust but relatively conservative decision tendency in the target domain. This tradeoff, in which a reduction in Recall is compensated for by a higher Precision, results in an overall improvement in F1-score.
4.4. Hyperparameter Sensitivity Analysis
To evaluate the sensitivity of our method to changes in the key hyperparameters, we conducted a series of experiments focusing on the impact of , ∇, h, and on model performance. In these experiments, the target-domain AUC was used as the primary evaluation metric.
Specifically, we independently varied
, ∇,
h, and
while keeping all other hyperparameters fixed. For each setting, the model was trained using the same protocol and the corresponding AUC on the target domain was recorded. The results cover a range of values for
and different settings of
as well as variations in the hidden layer size
h and the confidence threshold. The hyperparameter sensitivity analysis results are shown in
Table 5.
As shown in
Figure 3, increasing
from 0.1 to 0.3 leads to a clear improvement in target AUC, with the best performance achieved at
. When
is further increased to 1.0, the AUC decreases. This trend indicates that a small
results in insufficient adversarial signals leading to inadequate feature alignment, whereas an overly large
may cause excessive domain confusion, weakening class-discriminative information and destabilizing optimization. Therefore, we set
as the default value in order to strike a better tradeoff between domain invariance and class separability.
As shown in
Figure 4, the confidence threshold has a significant impact on performance, with the best AUC obtained at
. When the threshold is reduced to 0.7, the AUC drops noticeably to 0.84, suggesting that more noisy pseudolabels are retained and the resulting class-conditional alignment can be corrupted, leading to negative transfer. When the threshold is increased to 0.9, the AUC also slightly decreases, which is likely because overly strict filtering retains too few target samples and yields unreliable class-wise statistics. Overall, these results support choosing
as the default threshold.
As shown in
Figure 5, the model achieves the highest AUC at
. With a smaller hidden layer size, the performance is lower, indicating that limited model capacity may restrict representation learning and hinder the learning of robust decision boundaries under domain shift. When
h becomes too large, the AUC also decreases, which may be attributed to increased overfitting risk or more difficult optimization due to the enlarged parameter space. Considering both performance and complexity, we adopt
as the default hidden layer width.
As shown in
Figure 6, we observe an overall trend that larger values of
lead to worse performance; the best AUC is achieved at
, while the performance degrades substantially when
increases to 0.5 and 0.7. This observation suggests that prototype alignment acts as a strong class-conditional constraint and is highly sensitive to pseudolabel quality. When the prototype loss is overly weighted, target samples with noisy pseudolabels may be forcibly pulled towards incorrect prototypes. This distorts class clusters and amplifies the effect of residual noise, causing negative transfer. Hence, we set
and combine it with a progressive weighting strategy to improve training stability in early epochs.
In summary, the above sensitivity analyses indicate that our method remains effective within a reasonable range of hyperparameter values, although clear optima exist for the examined parameters. Based on the target-domain AUC results, we subsequently used , , , and as the default settings in our experiments to obtain stable and strong cross-domain detection performance.