LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization
Abstract
1. Introduction
- We design a novel framework that fuses semantic and structural embeddings, reducing reliance on test-based BL technique
- We incorporate AST, control-flow, and data-flow signals to mitigate semantic bias in LLM-based retrieval.
- We conduct experiments on all 835 bugs in Defects4J v2.0.0 and show significant improvements in key metrics such as MAP and MRR compared to existing methods.
- We validate the effectiveness of LLMLoc in real-world maintenance environments with limited test quality and highlight its potential to extend to broader applications such as vulnerability detection and automated patch generation.
2. Background
2.1. Traditional Bug Localization Techniques
2.2. Machine Learning–Based Fault Localization
2.3. Large Language Models and Bug Localization
2.4. Structure-Aware Embedding Research
3. Methodology
3.1. Preprocessing
3.2. Embedding and Structural Information Generation
3.3. Candidate Generation and Re-Ranking
3.4. Candidate List Integration
3.5. Inference
4. Experiment
4.1. Experimental Setup
4.2. Dataset
4.3. Evaluation Metrics
4.4. Baselines
- RQ1. Limitations of a baseline LLM: We assess how effectively an LLM can perform bug localization when provided only with a bug report and the entire code corpus as input, without any additional retrieval strategies or structural information. This experiment establishes the fundamental performance level and inherent limitations of a purely LLM-based approach [11,12,24,28].
- RQ2. Contribution of SASR’s structure–semantic integration: We analyze the extent to which SASR, which combines CodeBERT-based semantic scores [13] with AST structural signals, improves quantitative metrics such as Top-k accuracy, MAP, and MRR. This evaluation verifies whether structural information enhances candidate quality and positively influences the inputs used during LLM inference [17,18,19,32].
- RQ3. Stabilization effect of tournament-based inference: We evaluate the consistency and reproducibility of results when candidates are partitioned into batches, ranked using Top-3 voting, and then finalized through a Top-5 selection from a pooled set. This setup tests whether the tournament design mitigates the variability caused by the stochastic nature of LLM inference [20,33,34].
- RQ4. Synergistic effect of combining SASR and the tournament method: We examine whether the integration of high-quality candidates from SASR with the stability of tournament-based inference produces improvements that go beyond additive gains. The goal is to confirm additional benefits over individual methods in both Top-k accuracy and ranking-based metrics [18,24,28,35].
- Baseline LLM: This group directly inputs the bug report and the entire method corpus (on average about 8000 methods) into a single prompt without incorporating any structural signals or retrieval preprocessing [11,12,28]. It provides the minimum performance benchmark of a purely LLM-based approach and serves as the control group for RQ1.
- SASR-only: This group isolates the contribution of the proposed structure–semantic retrieval method. By combining CodeBERT-based semantic embeddings [13] with AST-based structural embeddings [16,17,32], SASR re-ranks candidates and restricts the LLM input to the top 20 functions. This setup directly addresses RQ2 by quantifying the impact of structural information alone.
- Proposed LLMLoc: This approach integrates SASR with additional signals from SBIR [30], Ochiai [4], and Suspiciousness Ranking (SR) [36] to form the final candidate set, followed by tournament-based inference. The design jointly tests RQ3 and RQ4 by verifying whether SASR improves candidate quality and the tournament procedure enhances inference stability in a complementary manner.
4.5. Experimental Results
4.6. Ablation Study
5. Discussion
5.1. Analysis of Experimental Results
5.2. Threats to Validity
6. Related Work
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Wong, W.E.; Gao, R.; Li, Y.; Abreu, R.; Wotawa, F. A survey on software fault localization. IEEE Trans. Softw. Eng. (TSE) 2016, 42, 707–740. [Google Scholar] [CrossRef]
- Zou, D.; Liang, J.; Xiong, Y.; Ernst, M.D.; Zhang, L. An empirical study of fault localization families and their combinations. IEEE Trans. Softw. Eng. (TSE) 2021, 47, 332–347. [Google Scholar] [CrossRef]
- Jones, J.A.; Harrold, M.J. Empirical evaluation of the Tarantula automatic fault-localization technique. In Proceedings of the ASE, Lisbon, Portugal, 7–11 November 2005; pp. 273–282. [Google Scholar]
- Abreu, R.; Zoeteweij, P.; Van Gemund, A.J.C. An evaluation of similarity coefficients for software fault localization. In Proceedings of the PRDC, Riverside, CA, USA, 18–20 December 2006; pp. 39–46. [Google Scholar]
- Abreu, R.; Zoeteweij, P.; Van Gemund, A.J.C. On the accuracy of spectrum-based fault localization. In Proceedings of the TAICPART-MUTATION, Windsor, UK, 10–14 September 2007; pp. 89–98. [Google Scholar]
- Wong, W.E.; Debroy, V.; Gao, R.; Li, Y. The DStar method for effective software fault localization. IEEE Trans. Reliab. 2014, 63, 290–308. [Google Scholar] [CrossRef]
- Jia, Y.; Harman, M. An analysis and survey of the development of mutation testing. IEEE Trans. Softw. Eng. 2011, 37, 649–678. [Google Scholar] [CrossRef]
- Papadakis, M.; Traon, Y.L. Metallaxis-FL: Mutation-based fault localization. Softw. Test. Verif. Reliab. 2015, 25, 605–628. [Google Scholar] [CrossRef]
- Li, X.; Li, W.; Zhang, Y.; Zhang, L. DeepFL: Integrating multiple fault diagnosis dimensions for deep fault localization. In Proceedings of the ISSTA, Beijing, China, 15–19 July 2019; pp. 169–180. [Google Scholar]
- Briand, L.C.; Labiche, Y.; Liu, X. Using machine learning to support debugging with Tarantula. In Proceedings of the ISSRE, Trollhättan, Sweden, 5–9 November 2007; pp. 137–146. [Google Scholar]
- Wu, Y.; Li, Z.; Zhang, J.M.; Papadakis, M.; Harman, M.; Liu, Y. Large language models in fault localisation. arXiv 2023, arXiv:2308.15276. [Google Scholar] [CrossRef]
- Yang, A.Z.H.; Le Goues, C.; Martins, R.; Hellendoorn, V. Large language models for test-free fault localization. In Proceedings of the ICSE, Lisbon, Portugal, 14–20 April 2024. [Google Scholar]
- Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A pre-trained model for programming and natural languages. arXiv 2020, arXiv:2002.08155. [Google Scholar]
- Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. GraphCodeBERT: Pre-training code representations with data flow. In Proceedings of the ICLR, Vienna, Austria, 3–7 May2021. [Google Scholar]
- Rozière, F.G.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code Llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
- Sadat, A.; Agarwal, P.; Lin, H.; Zhang, C.; Bendersky, M.; Najork, M. LameR: LLM-augmented multi-stage ranking for code retrieval. arXiv 2023, arXiv:2305.15489. [Google Scholar]
- Li, Z.; Wang, X.; Wang, S.; Nguyen, T.N. SANTA: Structure-aligned neural text-to-code retrieval. In Proceedings of the FSE, San Francisco, CA, USA, 3–9 December 2023; pp. 1472–1484. [Google Scholar]
- Xu, H.; Zhang, Z.; Li, J.; Wang, X.; Cheung, S.-C. FlexFL: Boosting fault localization with LLMs via flexible feature learning. IEEE Trans. Softw. Eng. 2025, 51, 535–548. [Google Scholar]
- Lou, Y.; Zhu, Q.; Dong, J.; Li, X.; Sun, Z.; Hao, D.; Zhang, L.; Zhang, L. Boosting coverage-based fault localization via graph-based representation learning. In Proceedings of the ESEC/FSE, Athens, Greece, 23–28 August 2021; pp. 664–676. [Google Scholar]
- Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguist. (TACL) 2024, 12, 157–173. [Google Scholar] [CrossRef]
- Tsumita, S.; Hayashi, S.; Amasaki, S. Large-scale evaluation of method-level bug localization with FinerBench4BL. In Proceedings of the SANER, Macao, China, 21–24 March 2023; pp. 815–824. [Google Scholar]
- Zhang, W.; Li, Z.; Wang, Q.; Li, J. FineLocator: Improving bug localization by query expansion. Inf. Softw. Technol. 2019, 110, 121–135. [Google Scholar] [CrossRef]
- Campos, J.; Riboira, A.; Perez, A.; Abreu, R. GZoltar: An Eclipse plug-in for testing and debugging. In Proceedings of the ASE, Essen, Germany, 3–7 September 2012; pp. 378–381. [Google Scholar]
- Kang, S.; An, G.; Yoo, S. A quantitative and qualitative evaluation of LLM-based explainable fault localization. Proc. ACM Softw. Eng. 2024, 1, 64. [Google Scholar] [CrossRef]
- Li, Y.; Wang, S.; Nguyen, T.N. Fault localization with code coverage representation learning. In Proceedings of the ICSE, Madrid, Spain, 22–30 May 2021; pp. 661–673. [Google Scholar]
- Zhang, Z.; Lei, Y.; Mao, X.; Li, P. CNN-FL: An effective approach for localizing faults using CNNs. In Proceedings of the SANER, Hangzhou, China, 24–29 February 2019; pp. 445–459. [Google Scholar]
- Zeng, S.; Tan, H.; Zhang, H.; Li, J.; Zhang, Y.; Zhang, L. An extensive study on pretrained models for program understanding. In Proceedings of the ISSTA, Virtual Event, 18–22 July 2022; pp. 39–51. [Google Scholar]
- Widyasari, R.; Ang, J.W.; Nguyen, T.G.; Sharma, N. Demystifying faulty code: Step-by-step reasoning in large language models for fault localization. In Proceedings of the SANER, Rovaniemi, Finland, 12–15 March 2024; pp. 568–579. [Google Scholar]
- Razzaq, A.; Buckley, J.; Patten, J.V.; Chochlov, M.; Sai, A.R. BoostNSift: A Query Boosting and Code Sifting Technique for Method Level Bug Localization. In Proceedings of the SCAM, Luxembourg, 27–28 September 2021; pp. 81–91. [Google Scholar]
- Le, T.B.; Oentaryo, R.J.; Lo, D. Information retrieval and spectrum-based bug localization: Better together. In Proceedings of the ESEC/FSE, Bergamo, Italy, 30 August 2015–4 September 2015; pp. 579–590. [Google Scholar]
- Zhou, J.; Zhang, H.; Lo, D. Where should the bugs be fixed? More accurate IR-based bug localization based on bug reports. In Proceedings of the ICSE, Zurich, Switzerland, 2–9 June 2012; pp. 14–24. [Google Scholar]
- Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
- OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the NeurIPS, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [PubMed]
- Abreu, R.; Zoeteweij, P.; Golsteijn, R.; Van Gemund, A.J.C. A practical evaluation of spectrum-based fault localization. J. Syst. Softw. 2009, 82, 1780–1792. [Google Scholar] [CrossRef]
- Li, Y.; Wang, S.; Nguyen, T.N. Fault localization to detect co-change fixing locations. In Proceedings of the ESEC/FSE, Singapore, 14–18 November 2022; pp. 659–671. [Google Scholar]
- Zhang, M.; Li, X.; Zhang, L.; Khurshid, S. Boosting spectrum-based fault localization using PageRank. In Proceedings of the ISSTA, Santa Barbara, CA, USA, 10–14 July 2017; pp. 261–272. [Google Scholar]
- Zhang, M.; Li, Y.; Li, X.; Chen, L.; Zhang, Y.; Zhang, L. An empirical study of boosting spectrum-based fault localization via PageRank. IEEE Trans. Softw. Eng. 2021, 47, 1089–1113. [Google Scholar] [CrossRef]
- Böhme, M.; Soremekun, E.O.; Chattopadhyay, S.; Ugherughe, E.; Zeller, A. Where is the bug and how is it fixed? An experiment with practitioners. In Proceedings of the ESEC/FSE, Paderborn, Germany, 4–8 September 2017; pp. 117–128. [Google Scholar]
- Meta AI. Blog of Meta Llama 3. Available online: https://ai.meta.com/blog/ (accessed on 28 September 2025).
- HuggingFace. Model card of Llama3-8B-Instruct. Available online: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct (accessed on 28 September 2025).


| Aspect | SANTA | GNN-Based Methods | Proposed LLMLoc (SASR) |
|---|---|---|---|
| Structural representation | Graph-based AST alignment between bug reports and code; requires node/edge labeling | Learned graph embeddings via supervised training on defect datasets | AST metric-based embedding using control/data-flow features without supervision |
| Semantic representation | Textual similarity using TF-IDF or code embeddings | Pretrained code encoders with fine-tuning | CodeBERT semantic embeddings directly fused with structural signals |
| Fusion strategy | Rule-based graph alignment scores | Neural fusion layers within GNN framework | Weighted retrieval fusion using adaptive λ (heuristic balancing semantic vs. structural) |
| Training requirement | Supervised with aligned graph pairs | Supervised on large labeled defect graphs | Zero-shot; no task-specific training required |
| Primary goal | Improve graph matching accuracy | Capture topological relations for defect prediction | Enhance retrieval robustness and ranking stability under test-free conditions |
| λ Schedule | Top-1 | Top-3 | Top-5 | MAP | MRR |
|---|---|---|---|---|---|
| {0.3, 0.5, 0.7, 0.85} (default) | 238 | 367 | 416 | 0.336 | 0.364 |
| {0.2, 0.4, 0.6, 0.8} | 240 | 365 | 428 | 0.339 | 0.370 |
| Condition | Value | Interpretation |
|---|---|---|
| No report or test available | 0.30 | Structural signal emphasized |
| 1–50 words | 0.50 | Balanced structural and semantic signals |
| 51–150 words | 0.70 | Beginning of semantic signal emphasis |
| More than 151 words | 0.85 | Strong reflection of semantic signal |
| Project | #Bugs | Avg. #Methods | Avg. Buggy Methods |
|---|---|---|---|
| Chart | 26 | 5485 | 1.6 |
| Closure | 176 | 7927 | 1.8 |
| Lang | 65 | 3013 | 1.4 |
| Math | 106 | 3902 | 1.7 |
| Mockito | 38 | 2023 | 1.2 |
| Time | 27 | 4121 | 1.5 |
| Collections | 28 | 1640 | 1.3 |
| Codec | 18 | 1213 | 1.1 |
| Compress | 47 | 2482 | 1.4 |
| Csv | 16 | 1870 | 1.3 |
| Gson | 18 | 3110 | 1.2 |
| JacksonCore | 26 | 2934 | 1.5 |
| JacksonDatabind | 112 | 6285 | 1.7 |
| JacksonXml | 6 | 1544 | 1.2 |
| Jsoup | 93 | 3221 | 1.5 |
| JxPath | 22 | 2313 | 1.4 |
| Cli | 39 | 1765 | 1.3 |
| Total | 835 | 6843 | 1.6 |
| Method | Top-1 | Top-3 | Top-5 | MAP | MRR |
|---|---|---|---|---|---|
| Baseline (LLM only) | 220 | 311 | 358 | 0.325 | 0.287 |
| SASR | 221 | 345 | 411 | 0.325 | 0.347 |
| LLMLoc (Ours) | 238 | 367 | 416 | 0.336 | 0.364 |
| Method | Top-1 | Top-3 | Top-5 | MAP | MRR |
|---|---|---|---|---|---|
| Baseline (LLM only) | 220 | 311 | 358 | 0.325 | 0.287 |
| SASR (no Tournament) | 195 | 364 | 437 | 0.294 | 0.342 |
| SASR + Tournament | 221 | 345 | 411 | 0.324 | 0.347 |
| LLMLoc (Ours) | 238 | 367 | 416 | 0.336 | 0.364 |
| Condition | (s/bug) | (s/bug) | (s/bug) | # 1 LLM Calls | GPU Peak (MB) | CPU RSS (MB) |
|---|---|---|---|---|---|---|
| Baseline Top-K (no tour.) | 2.485 ± 0.171 | 0.000 | 2.312 ± 0.196 | 1.0 | 15 442 | 1 479 |
| SASR (no tour.) | 2.434 ± 0.187 | 0.000 | 2.343 ± 0.187 | 1.0 | 15 460 | 1 470 |
| LLMLoc (full) | 3.866 ± 0.195 | 1.498 ± 0.151 | 2.271 ± 0.186 | 2.0 | 15 465 | 1 469 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nam, G.; Yang, G. LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization. Electronics 2025, 14, 4343. https://doi.org/10.3390/electronics14214343
Nam G, Yang G. LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization. Electronics. 2025; 14(21):4343. https://doi.org/10.3390/electronics14214343
Chicago/Turabian StyleNam, Gyumin, and Geunseok Yang. 2025. "LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization" Electronics 14, no. 21: 4343. https://doi.org/10.3390/electronics14214343
APA StyleNam, G., & Yang, G. (2025). LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization. Electronics, 14(21), 4343. https://doi.org/10.3390/electronics14214343
