4.1. Experimental Setup
The ConSynergy framework was evaluated against widely-used concurrency bug benchmarks, including DataRaceBench [
6], a subset of concurrent bugs from Juliet [
7], and the DeepRace dataset [
8]. DataRaceBench provides both C and Fortran language data. The C language subset is used for the experiments, which comprises a total of 204 records. The DeepRace dataset comprises three distinct subsets (pthread, OpenMP Private, and OpenMP Critical), collectively ensuring a balanced ratio of positive and negative samples.
The bug detection task is formally modeled as a binary classification task. Performance quantification relies on standard metrics derived from ground truth annotations [
41]: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). Overall performance is evaluated using Accuracy (Acc), Precision (P), Recall (R), and the
F1 score (
F1).
The experimental environment utilized a server running Anolis OS 8.6, configured with dual Intel® Xeon® Max 9468 CPUs (96 logical cores), 512 GB of RAM, and a single NVIDIA A800 80 GB GPU. The experiments are structured to investigate the following research questions (RQs):
RQ1: Evaluation of LLM Variability: Assessing the detection performance of ConSynergy across different foundational LLMs on the target datasets.
RQ2: Comparison with SOTA Baselines: Comparative analysis of ConSynergy’s effectiveness against contemporary static and dynamic concurrency bug detection tools.
RQ3: Ablation Study: Investigating the contribution of the three core architectural components: (a) Concurrency-Aware Program Slicing, (b) Chain-of-Thought (CoT) Data Flow Modeling, and (c) SMT-based Formal Verification.
RQ4: Real-World Application and Tool Comparison: Evaluating ConSynergy’s capability in detecting bugs within complex, real-world projects compared to established static analysis tools.
4.2. RQ1: Performance in Concurrency Bug Detection
To address RQ1, the multi-stage performance of ConSynergy is evaluated across concurrent subsets of the DataRaceBench and the Juliet Test Suite, employing three distinct SOTA LLMs (GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro). Performance metrics (
P,
R,
) are quantified at four key stages of the pipeline: extraction and slicing, data flow analysis, SMT constraint generation, and final detection. The results are summarized in
Table 1.
The initial Extraction and Slicing phase achieved perfect scores (P, R, F1 = 1.0) across all LLMs and datasets. This confirms the efficacy of the LLM-assisted, iterative correction mechanism (Phase 2) in accurately identifying concurrency-critical statements and ensuring the syntactic and semantic integrity of the minimal program slice for downstream analysis.
The initial decline in performance is observed at the Data Flow Analysis stage (Phase 3). On the DataRaceBench dataset, Gemini 2.5 Pro achieved the highest F1 score (), demonstrating high effectiveness in modeling inter-thread data propagation. The high precision scores across all models (e.g., for GPT-4o and Claude 3.5 Sonnet on Juliet) indicate that the CoT prompting effectively mitigates false positives by meticulously reasoning about data dependencies.
Performance degradation is most noticeable in the downstream SMT Constraint Generation stage (Phase 4), which is directly dependent on the accuracy of the preceding data flow phase. The significant decline in F1 score (e.g., GPT-4o on DataRaceBench dropped from to ) suggests that errors in identifying subtle inter-thread data flows lead to the synthesis of inaccurate or incomplete Z3 constraints. This confirms the cascading dependency of the pipeline, where LLM hallucination during data flow modeling directly impacts the formal correctness required for verification.
In the final Detection phase, the combined error accumulation results in the final metric values. Overall, GPT-4o demonstrates the most robust performance across all phases and datasets, achieving an F1 score of on Juliet and on DataRaceBench. While Claude 3.5 Sonnet exhibits high initial precision in some stages, its lower recall in the final detection stage (e.g., on DataRaceBench) suggests that it failed to identify a larger number of True Positive paths compared to GPT-4o.
In summary, these results validate the overall effectiveness of the LLM-guided pipeline. They also underscore a critical finding: accuracy in multi-stage concurrency detection exhibits strong inter-phase dependency. Robust performance requires maintaining high accuracy in both the probabilistic modeling (Phase 3) and the formal constraint synthesis (Phase 4), as errors accumulate and reduce the final verification precision.
4.3. RQ2: Compared with Baselines
To benchmark the performance of ConSynergy, several representative concurrency bug detection methods were selected, including pre-trained models requiring fine-tuning (CodeBERT [
27], GraphCodeBERT [
42], and CodeT5 [
43]), and two GNN-based approaches (Devign [
20] and ReGVD [
44]). Additionally, a Zero-Shot LLM baseline was included, where GPT-4o was directly prompted for bug analysis. The dataset was randomly partitioned into training, validation, and test sets at a 7:1.5:1.5 ratio for all trainable models. All comparative experiments were conducted on the same test set. In the DeepRace dataset, DP denotes the pthread subset, while DO1 and DO2 represent the first and second OpenMP subsets, respectively. Experimental parameters for the fine-tuned models were set consistently with prior work [
45]. The experimental results are shown in
Table 2.
The proposed ConSynergy method significantly outperforms all baseline approaches in terms of both average Precision () and average F1 score () across the four test datasets. This superior performance confirms that the constrained, multi-stage LLM approach effectively balances sensitivity (Recall) and specificity (Precision) in concurrency bug detection.
Fine-tuned pre-trained models (CodeBERT, GraphCodeBERT, CodeT5) exhibit generally high average Recall (0.877 for CodeT5), a result characteristic of a predictive bias toward the positive (vulnerable) class. This conservative strategy minimizes False Negatives but simultaneously generates an excessive number of False Positives, ultimately compromising the F1 score (CodeT5 average F1 : ).
Graph learning methods (Devign, ReGVD) demonstrate a more balanced performance profile than the fine-tuned LLMs, yet their overall efficacy (ReGVD average F1: ) remains inferior to ConSynergy. This limitation stems from their inherent reliance on the completeness and fidelity of upstream static analysis tools used for graph construction, which often struggle to capture the complex, context-sensitive inter-thread dependencies essential for accurate concurrency modeling.
The Zero-Shot baseline yielded a remarkably high average Recall (0.983), confirming the LLM’s inherent capability to identify potential bug patterns. However, its low average Precision (0.507), driven by an excessive False Positive rate, highlights the risk of unconstrained LLM misclassification. Without the formalized structural and semantic constraints provided by our slicing and SMT verification pipeline, the model tends to over-generalize, resulting in poor practical utility.
A critical practical advantage of ConSynergy is its significantly reduced computational overhead (
Table 3). Methods requiring fine-tuning (CodeBERT, GraphCodeBERT, CodeT5) incur substantial training overhead, with CodeT5 requiring over 42 min due to its extensive parameter size. Similarly, the hybrid architecture of ReGVD results in a lengthy training duration (33 min 35 s total).
In contrast, ConSynergy requires zero training time, relying exclusively on LLM inference coupled with efficient static analysis. The total inference overhead for ConSynergy is only 15 s, representing an approximately 168-fold reduction compared to the most computationally expensive baseline (CodeT5). This superior efficiency underscores the practicality of integrating ConSynergy into rapid, large-scale concurrent code analysis workflows.
4.4. RQ3: Ablation Study
To rigorously ascertain the contribution of each core component to ConSynergy’s overall efficacy, three critical ablation experiments were conducted. These variants specifically isolate the impact of the framework’s unique hybrid elements:
LLM Slicing: Replacing our static analysis-assisted program slicing with direct, unconstrained slice generation by the LLM.
No Chain-of-Thought: Eliminating the structured, step-by-step reasoning prompt during the data flow analysis phase.
LLM Verification: Substituting the SMT-based formal verification of path feasibility with direct LLM judgment.
Figure 2 illustrates the performance impact of these ablations, using GPT-4o across the three subsets of the DeepRace dataset (DP, DO1, and DO2).
The complete ConSynergy framework exhibits optimal performance, achieving an average Precision (), Recall (), and score () across the test subsets. This superior balance confirms the merit of our hybrid design.
The most severe degradation in performance was observed in the LLM Slicing configuration (average F1 score: ). This sharp decline underscores a fundamental limitation of LLM-only approaches: LLMs struggle to precisely identify and preserve the minimal, yet necessary, program slice required for accurate concurrency analysis. Relying solely on LLM output for slicing leads to the unintentional omission of crucial control flow and data dependency statements pertaining to shared variables, resulting in the loss of critical context irrecoverable in downstream phases.
Replacing the SMT solver with Direct LLM Verification also resulted in a significant performance drop (average F1 score: ). While marginally better than the LLM Slicing variant, this result strongly indicates that LLMs, even when provided with accurately generated data flow paths, lack the formal rigor required for sound verification. Formal constraint satisfaction via SMT is indispensable for guaranteeing the feasibility of the interleaving path, a function that simple LLM probabilistic reasoning cannot reliably perform. The elimination of the structured reasoning prompt, No Chain-of-Thought, resulted in a smaller, yet measurable, performance reduction. This confirms that the explicit CoT guidance enhances the LLM’s internal consistency and improves the fidelity of cross-thread data stream modeling. While CoT is less critical than the slicing or verification mechanisms, its absence diminishes the model’s ability to accurately identify subtle data dependencies, thereby propagating minor errors downstream.
4.5. RQ4: Concurrency Bug Detection in Real Software
Performance was evaluated against real-world Common Vulnerabilities and Exposures (CVEs). The experimental scope was significantly expanded beyond the initial ConVul dataset [
46] to include 16 documented concurrency bugs, incorporating high-profile bugs from 2023 and 2024 in the Linux Kernel.
ConSynergy was benchmarked against four distinct baselines representing different technical paradigms: two industry-standard static analysis tools (Facebook’s Infer [
47] and MathWorks’ Polyspace [
48]) and two SOTA LLM-based agents (GitHub Copilot [
49] and the LangGraph-based ai-code-inspector [
50]).
The evaluation results, summarized in
Table 4, demonstrate a significant advantage in detection effectiveness for ConSynergy. Our method successfully identified 14 out of 16 bugs (87.5% detection rate), substantially surpassing both traditional static tools and LLM-based agents.
In the comparison between LLM-based approaches, GitHub Copilot demonstrated respectable performance, correctly identifying 12 out of 16 bugs, outperforming the autonomous ai-code-inspector agent (8 detected). While the agentic workflow employs iterative critique loops intended to improve reasoning, our experiments suggest that in the specific domain of concurrency, this complexity often leads to “over-reasoning” or hallucinated safety guarantees, resulting in a lower detection rate compared to the direct inference of Copilot. Nevertheless, both pure LLM methods struggled with deep causal chains in the Linux kernel (e.g., CVE-2024-44903) compared to ConSynergy’s neuro-symbolic verification.
Regarding analysis time, traditional static analysis tools proved to be the most computationally efficient. Infer showed the lowest overhead (<1 s), followed by Polyspace (averaging ≈ 11 s). However, this speed came at the cost of accuracy and stability: Infer suffered from a high False Negative rate, while Polyspace frequently failed to complete analysis due to environment dependency issues (marked as “failed”).
Among the LLM-based baselines, ConSynergy proved to be the most efficient, with an average analysis time of 21 s. In contrast, GitHub Copilot incurred a significantly higher latency (averaging ≈ 60 s) due to unconstrained token generation. The ai-code-inspector agent was the most computationally expensive (≈92 s), primarily due to the latency inherent in local model inference (via Ollama) required to execute its iterative generation and reflection workflow. ConSynergy thus represents an optimal trade-off: it provides the detection depth of formal verification while remaining nearly 3× faster than Copilot and 4× faster than autonomous agents.
The handling of recent complex bugs highlights ConSynergy’s adaptability. For instance, in CVE-2024-42111, the race condition involved a subtle use-after-free protected by a complex lock hierarchy. While static tools timed out due to state-space explosion and GitHub Copilot missed the lock release timing, ConSynergy’s sliced SMT constraints correctly identified the feasibility of the race. However, limitations remain: the False Negative in CVE-2017-6346 persists because the static slicing component lacks support for specific C language constructs used in this driver, pointing to future work in enhancing the parser’s language compatibility.
To verify broad applicability, 200 C programs were tested from the SV-COMP ConcurrencySafety benchmark, as shown in
Table 5. This allowed comparison not only with Infer and Polyspace but also with direct, unconstrained LLM baselines (GPT, Claude, Gemini).
On the SV-COMP suite, ConSynergy achieved the highest overall performance metrics (Accuracy , F1 score ).
Static Analysis Comparison: Infer exhibited a substantial False Negative count (FN = 18), as seen in cases like gcd-2.c and per-thread-array-join-counter-race-4.c, where its lightweight nature fails to capture subtle, implicit inter-thread data races. Polyspace, conversely, was characterized by a high False Positive rate (FP = 34), notably in per-thread-array-join-counter.c and thread-local-value-cond.c. In these examples, Polyspace incorrectly flags protected resources, demonstrating a failure to fully resolve the complex semantics of condition variables and mutex release conditions.
LLM Comparison: The unconstrained LLM baselines (GPT, Claude, Gemini) achieved high Recall (average ≈ 0.85) but were severely penalized by high False Positives (average ≈ 55), confirming the illusion problem previously discussed. ConSynergy effectively addresses this trade-off: by employing formal SMT verification, it reduces the False Positive count dramatically (FP = 15) while maintaining high Recall (0.880), thus achieving a superior balance in practical concurrency bug detection.
To further verify the cross-language adaptability of the proposed method and enhance the coverage of test cases, testing was also performed on a Go language concurrent program. The tool presented here can easily support other languages with minimal customization. Currently, Infer and Polyspace do not support Go. This study selected non-blocking samples from GoBench [
57] for testing, divided into two subsets: GOREAL and GOKER. The results are shown in
Table 6. Since GoBench does not include TN samples, only TP, FP, and FN were used to measure the tool’s performance. The experimental results on GoBench demonstrate that the tool’s performance is comparable to that in C/C++, indicating its capability to detect concurrency bugs in cross-language environments.