LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization

Nam, Gyumin; Yang, Geunseok

doi:10.3390/electronics14214343

Open AccessArticle

LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization

by

Gyumin Nam

¹ and

Geunseok Yang

^2,*

¹

Department of Computer Applied Mathematics, Hankyong National University, Anseong 17579, Republic of Korea

²

Department of Computer Applied Mathematics (Computer System Institute), Hankyong National University, Anseong 17579, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(21), 4343; https://doi.org/10.3390/electronics14214343

Submission received: 11 October 2025 / Revised: 3 November 2025 / Accepted: 3 November 2025 / Published: 5 November 2025

Download

Browse Figures

Versions Notes

Abstract

Bug localization is a critical task in large-scale software maintenance, as it reduces exploration costs and enhances system reliability. However, existing approaches face limitations due to semantic mismatches between bug reports and source code, insufficient use of structural information, and instability in candidate rankings. To address these challenges, this paper proposes LLMLoc, a system that integrates traditional statistical methods with semantic retrieval, centered on a Structure-Aware Semantic Retrieval (SASR) framework. Experiments on all 835 bugs from the Defects4J dataset show that LLMLoc achieves relative improvements of 3.4 percentage points in Mean Average Precision (MAP) and 29.8 percent in Mean Reciprocal Rank (MRR) compared with state-of-the-art LLM-based methods. These results show that combining structural cues with semantic representations provides more effective retrieval than relying on LLM inference alone. Furthermore, by stabilizing Top-K candidate sets, LLMLoc reduces ranking instability and delivers practical benefits even in real-world maintenance environments with insufficient testing resources.

Keywords:

bug localization; large language model; structure-aware retrieval; semantic + structural fusion; Defects4J

1. Introduction

As software systems continue to grow in scale and complexity, the ability to promptly and accurately identify faults in code has become an essential requirement for ensuring both development efficiency and system reliability. In particular, bug localization (BL) represents the first step in debugging and accounts for a significant portion of overall maintenance costs. Its importance has been consistently emphasized in both industrial and academic contexts [1,2].

Traditional studies have primarily relied on spectrum-based fault localization (SBFL) and mutation-based fault localization (MBFL) [3,4,5,6,7,8]. SBFL calculates the suspiciousness of program elements using test execution results and coverage information, while MBFL estimates fault likelihood by injecting mutations and analyzing behavioral differences. Although these approaches have made substantial theoretical contributions, their performance degrades sharply without large test suites and stable execution environments. They are also associated with prohibitively high computational costs. In practice, insufficient test quality in industrial environments often limits their effectiveness [1,2]. To address these challenges, machine learning-based BL (MLFL) approaches have been proposed. However, MLFL suffers from the high cost of constructing large-scale training datasets, the complexity of feature engineering, and significant domain transfer problems, where performance declines drastically when applied to different projects or programming languages [9,10].

Recent advances in large language models (LLMs) have opened new possibilities for BL research [11,12,13,14,15]. Some studies have shown effective localization by learning semantic associations between bug reports and code functions without relying on test execution. Others have reported performance gains by leveraging code-specific LLMs. Nevertheless, these methods primarily focus on token-sequence-level semantics and fail to adequately incorporate structural signals such as abstract syntax trees (ASTs) or control and data flows [16,17]. As a result, the semantic reasoning capabilities of LLMs are constrained by the lack of structural context, leading to instability in candidate ranking. While recent work such as FlexFL [18] has improved performance through flexible feature learning, structural information remains underutilized and degradation in long-context inputs has also been reported [19].

While the idea of combining semantic and structural information has been explored in earlier studies such as SANTA [17] and graph-based methods [19], these approaches typically rely on supervised training or end-to-end graph learning with pre-aligned datasets. In contrast, LLMLoc extends this fusion concept to a zero-shot, LLM-centric bug localization pipeline that requires no task-specific training. The novelty of this work therefore lies not in proposing a new form of fusion itself, but in integrating structure–semantic retrieval with inference stabilization to achieve reliable localization under test-free and resource-constrained conditions.

Meanwhile, in the information retrieval (IR) community, various efforts have sought to improve retrieval quality by integrating structured and unstructured data [20,21,22]. These studies show that structure-aware embeddings can complement the semantic representations of language models. However, such fusion strategies have not been fully validated in the context of BL. In summary, prior research leaves open challenges including the test dependency of SBFL and MBFL, the limited generalization of MLFL, the lack of structural information in LLM-based approaches, and the absence of structure–semantic fusion strategies in BL despite promising evidence from IR research.

To bridge these gaps, this paper proposes LLMLoc, a structure-aware zero-shot bug localization system. LLMLoc integrates CodeBERT-based semantic embeddings with structural signals derived from ASTs as well as control and data flows through a structure–semantic fusion retrieval (SASR) mechanism. Candidate functions generated by SASR are then combined with traditional statistical techniques such as SBIR and Ochiai. Finally, a tournament-based LLM reasoning procedure is employed to provide stable and precise fault localization.

The key contributions of this study are as follows:

We design a novel framework that fuses semantic and structural embeddings, reducing reliance on test-based BL technique
We incorporate AST, control-flow, and data-flow signals to mitigate semantic bias in LLM-based retrieval.
We conduct experiments on all 835 bugs in Defects4J v2.0.0 and show significant improvements in key metrics such as MAP and MRR compared to existing methods.
We validate the effectiveness of LLMLoc in real-world maintenance environments with limited test quality and highlight its potential to extend to broader applications such as vulnerability detection and automated patch generation.

2. Background

Bug localization is a critical task for improving debugging efficiency and reducing maintenance costs in large-scale software systems [1,2]. Although numerous approaches have been proposed over the past decades, practical adoption remains limited due to challenges such as dependency on test information, difficulties in data generalization, and the absence of structural signals. This section reviews traditional techniques, machine learning–based methods, LLM-driven approaches, and research on structure-aware embeddings, and it highlights the necessity of the present study.

2.1. Traditional Bug Localization Techniques

The most extensively studied approaches are Spectrum-Based Fault Localization (SBFL) and Mutation-Based Fault Localization (MBFL) [1,2]. SBFL computes the suspiciousness score of each code element by using test execution outcomes and coverage information, with well-known metrics including Tarantula [3], Ochiai [4,5], and DStar [6]. MBFL applies various mutations to the program and compares the corresponding test execution results to estimate the likelihood of faults in specific code segments [7,8].

Although these traditional techniques have achieved significant academic success, they face two inherent limitations. First, they require an adequate number of test cases. In environments with low-quality or sparse test suites, suspiciousness calculations become distorted. Second, when applied to large-scale systems, both execution costs and mutation generation costs become prohibitively high. In industrial projects with hundreds of thousands of functions, the computational overhead makes practical adoption unrealistic [1,2]. In other words, despite their theoretical contributions, SBFL and MBFL are constrained by test dependency and computational cost, which limit their practical effectiveness in industry.

2.2. Machine Learning–Based Fault Localization

To overcome the limitations of traditional methods, machine learning–based fault localization (MLFL) has been proposed [2,10]. MLFL uses inputs such as coverage matrices, execution logs, and static code features, and then applies trained models to estimate fault suspiciousness [10,23]. For example, DeepFL employs neural networks to achieve better performance than SBFL [9], while GRACE leverages graph neural networks (GNNs) to structurally capture execution path information [19].

However, MLFL also faces inherent limitations. Constructing large-scale training datasets is costly, feature engineering is complex, and performance drops significantly when applied across different projects or programming languages [2,24,25]. Moreover, the quality and distribution of training data cause large variations in performance, making MLFL difficult to apply in environments with insufficient test execution information [2]. In summary, although MLFL has shown better performance than traditional methods, its dependency on data and challenges with domain transfer restrict its practical applicability.

2.3. Large Language Models and Bug Localization

The recent progress of large language models (LLMs) has opened new opportunities for the bug localization problem. LLMs are trained on both code and natural language, which allows them to directly infer semantic relationships between bug reports and code functions. Some studies have shown that LLMs can achieve effective localization even without relying on test execution data [12]. In cases where code-specific LLMs have been applied, the strength of semantic reasoning has been shown [11]. Moreover, general-purpose LLMs such as ChatGPT and LLaMA have shown strong zero-shot performance in tasks such as code generation, summarization, and retrieval, thereby broadening their potential use without additional training [15,26,27].

However, existing LLM-based approaches often treat code as token sequences, which prevents them from fully capturing structural context such as abstract syntax trees, control flow, and data flow [11,12]. As a result, while semantic cues are effectively captured, the lack of structural awareness leads to instability when handling complex control paths or interactions between functions. In short, LLMs are strong in semantic inference but remain limited by instability caused by the absence of structural information.

2.4. Structure-Aware Embedding Research

In the fields of information retrieval (IR) and code search, there has been active exploration of approaches that combine structural signals with semantic embeddings to overcome existing limitations. For example, LameR improves zero-shot retrieval performance of LLMs through query augmentation [16], while SANTA aligns structural code data with natural language descriptions to enhance the effectiveness of dense retrieval [17]. These studies show that incorporating structural characteristics can significantly improve retrieval quality.

This insight can be directly applied to bug localization. In many industrial settings, sufficient test execution data is often unavailable, which makes it critical to leverage both the limited natural language clues in bug reports and the structural context of source code. Embedding-based frameworks that integrate semantics and structure therefore provide a promising alternative that can address the limitations of existing SBFL, MBFL, MLFL, and LLM-based approaches.

As summarized in Table 1, LLMLoc’s SASR differs from SANTA and GNN-based methods in three key aspects: (1) it employs metric-based structural embeddings derived from AST and control/data-flow features instead of graph alignment or supervised GNN training; (2) the fusion is achieved via an adaptive λ-weighted retrieval heuristic rather than learned neural fusion layers; and (3) it operates in a zero-shot LLM-centric setting without any task-specific training. These differences make SASR a lightweight and generalizable alternative to existing structure–semantic fusion approaches.

3. Methodology

The proposed LLMLoc framework for bug report–based, function-level bug localization is composed of four stages: preprocessing, semantic and structural embedding generation (retrieval), candidate list integration, and final inference [1,2,9]. Each stage performs an independent role, yet all are connected within the pipeline to enable the LLM to infer fault locations with greater precision and stability [11,12,15,28].

The overall workflow is illustrated in Figure 1. Given a bug report and a set of code functions as input, the preprocessing stage first extracts function signatures and AST-based features [19,23]. Next, semantic embeddings are generated using CodeBERT [13], while structural embeddings are constructed in parallel using AST metrics [17]. These two signals are then fused based on a λ-weighting scheme to form Structure-Aware Semantic Retrieval (SASR) [18]. The SASR results are further combined with SBIR, Ochiai, and Suspiciousness Ranking (SR) outputs to strengthen the candidate list [4,5,29,30]. In the final stage, the LLM ranks the integrated candidate set to produce Top-N results, and the output consists of Top-1, Top-3, and Top-5 fault functions [11,12,28].

3.1. Preprocessing

The preprocessing procedure in this study extends and enhances the steps presented in FlexFL [18], applying them to 835 bugs in the Defects4J v2.0.0 dataset. This phase is not simply about data cleaning, but rather serves as a critical foundation that supports the performance and reliability of subsequent structure–semantic retrieval and LLM inference.

First, both buggy and fixed versions of each bug were checked out and maintained under an identical directory structure. Project-specific source root paths were standardized using rule-based mapping. After collecting all Java files, path normalization was performed to remove redundant prefixes, thereby reducing ambiguities that could occur during function-level indexing and ensuring consistency in retrieval. Bug reports were extracted from JSON format by combining the title and description into a single text. Even in cases where no report existed, placeholders were retained to maintain data consistency. In addition, only the minimal test cases that actually reproduced the defect were preserved from trigger tests. Failure logs and execution stacks were included, but external library frames were removed. Unnecessary logs were truncated at the failure point to retain only the essential context, which reduces noise while still providing sufficient signals for both SBFL/IR-based methods and LLM inference [1,2,30].

During code indexing, the fully qualified names (FQN: file path, class name, method name, parameters) of all methods were extracted and indexed together with their code snippets. Hierarchical navigation functions (get_paths, get_classes_of_path, get_methods_of_class, get_code_snippet_of_method) and fuzzy search functions (find_class, find_method) were implemented to allow LLMs to explore the codebase incrementally and selectively, rather than processing the entire codebase as a single lengthy input. This design reduces token waste and information loss while enabling a balanced use of semantic and structural signals [16,17,20,31].

In the postprocessing phase, a matching procedure was applied to address name inconsistency issues between FQNs and queries. Token-level partial matching was used as the primary mapping method, and when matching failed, the Levenshtein distance was applied to return the closest candidate. This process ensured consistent signature mapping even in special cases involving constructors, method overloading, and generic types. As a result, it improved the accuracy of LLM inputs and prevented error accumulation in downstream stages [22,31].

The outputs of this preprocessing include buggy/fixed working copies, standardized Java file lists, refined bug reports and trigger tests, method FQN–snippet indexes with navigation interfaces, and the postprocessing name-matching module. These components provide reproducibility, reliability, and scalability across the pipeline stages of embedding generation, structural signal extraction, candidate retrieval and integration, and LLM inference. Unlike FlexFL, which focused mainly on function extraction and bug report refinement, this study introduced path standardization, navigation interface construction, and postprocessing alignment procedures. These additions ensure stable and consistent inputs even in large-scale codebases, establishing a key distinction of this work.

3.2. Embedding and Structural Information Generation

In this study, we construct semantic embeddings and structural embeddings in parallel using bug reports and a function-level source code corpus as inputs. Although these two signals are computed independently, they are later combined in the SASR (Structure-Aware Semantic Retrieval) stage to produce the final ranking of candidate functions. This dual-embedding framework addresses the limitation of prior LLM-only approaches, which rely heavily on code sequence semantics and fail to sufficiently incorporate structural context.

The semantic embeddings are built on CodeBERT [13]. Bug reports, trigger test texts, and function code are mapped into a shared embedding space, followed by mean pooling and L2 normalization to produce fixed-length vectors. Each bug report is thus represented as a single query vector, and the full set of functions forms a function-level semantic embedding matrix. These semantic vectors serve as the basis for quantifying linguistic and contextual similarity between bug reports and code, enabling more refined retrieval than traditional IR methods that rely solely on keyword matching.

The structural embeddings are generated through a static analysis procedure designed for this study. For each function, we collect basic statistical indicators such as lines of code, token counts, frequencies of control and loop statements, number of function calls, and cyclomatic complexity. Using an Abstract Syntax Tree (AST) parser, we extract structural features such as node counts, tree depth, branching factors, and the frequency of conditional statements, loops, exception handling constructs, and method calls [16,17,32]. In addition, higher-order indicators such as distribution entropy are included to capture code complexity and uncertainty. These multidimensional structural vectors complement the semantic embeddings by reflecting control flow and structural patterns that are often overlooked, thereby providing a more comprehensive representation of fault likelihood at the function level.

In addition to AST-based syntactic metrics, the structural embedding integrates flow-aware statistics such as branch depth, function-call relations, and variable-usage chains, which implicitly capture control-flow (conditional and looping patterns) and data-flow (read/write dependencies and parameter passing) characteristics without constructing explicit CFG/DFG graphs.

The control- and data-flow information is extracted through lightweight static-analysis routines that traverse the abstract syntax tree to collect flow-related statistics. For each function, the control-flow aspects (e.g., conditional branching and loop depth) and data-flow aspects (e.g., variable definitions, references, and parameter dependencies) are summarized into numerical counts and ratios rather than full graph structures. These aggregated values are normalized and concatenated with AST-level syntactic metrics to form the multidimensional structural vector used in LLMLoc. This process allows implicit representation of control- and data-flow relationships without explicit graph construction or graph neural modeling.

Beyond basic statistical counts, each control- and data-flow component contributes structured indicators that describe the logical behavior of the program. Control-flow features include branching-factor entropy, loop-nesting depth, and path-diversity scores derived from conditional constructs. Data-flow features capture variable-definition chains, usage frequencies, and parameter-dependency densities that represent how information propagates across statements. These normalized indicators collectively define the control- and data-flow subspace within the structural embedding, enabling the model to reflect program logic patterns rather than mere syntactic complexity.

In the SASR stage, the two embeddings are combined to derive the final candidate ranking. First, semantic retrieval is used to extract the top-S candidate functions. Their AST-based structural vectors are then aggregated to form a structural profile.

Formally, let

E_{S} (f_{i})

denote the structural embedding of function

f_{i}

among the top-

S

candidates. The structural profile

P_{s}

is defined as the centroid of these vectors:

P_{s} = \frac{1}{s} \sum_{i = 1}^{s} E_{s} (f_{i}),

(1)

For each candidate

f_{i}

, we compute its structural similarity to the profile using cosine similarity:

{S i m}_{s t r u c t} {(f}_{i}) = c o s (E_{S} (f_{i}), P_{s}),

(2)

We then normalize

{S i m}_{s t r u c t} {(f}_{i})

to [0, 1] using min–max scaling over the candidate set (with a small ε to avoid division by zero).

Truly faulty functions tend to share structural patterns with high-confidence candidates. Therefore, the centroid

P_{s}

serves as a stable reference that mitigates noise from individual functions during reranking.

For each candidate, cosine similarity between its structural vector and the profile is computed and normalized to the [0, 1] range, with missing values imputed by the median to ensure stability. Semantic scores are standardized using z-score and min–max normalization, after which they are combined with the structural scores using a λ-weighted average.

The final combined score is defined as:

{S c o r e}_{f i n a l} = λ \cdot {S c o r e}_{s e m a n t i c} + (1 - λ) \cdot {S c o r e}_{s t r u c t u r a l},

(3)

Here, λ is not a fixed value but is adaptively adjusted according to the input condition. When no bug report or test is available, greater weight is placed on the structural signal (λ = 0.30) to ensure a minimum level of retrieval quality. When the text length is short, structural and semantic signals are balanced (λ = 0.50). As the bug report becomes more detailed, the semantic weight is progressively increased (51–150 words: λ = 0.70; more than 151 words: λ = 0.85) to take full advantage of the richer semantic cues. This rule-based adaptive weighting mitigates performance instability caused by variation in report quality and length, allowing SASR to maintain consistent results across diverse input conditions. The detailed rules are presented in Table 2.

These multidimensional structural vectors complement the semantic embeddings by reflecting control flow and structural patterns that are often overlooked, thereby providing a more comprehensive representation of fault likelihood at the function level. To ensure that the selected structural indicators were both meaningful and empirically grounded, we performed a feature importance analysis using a Random Forest classifier. The results (Figure 2) indicate that control-flow and depth-related proxies (e.g., average depth, loop and conditional frequencies) were the most influential features, confirming the validity of our feature selection.

3.3. Candidate Generation and Re-Ranking

The candidate generation and re-ranking stage is the most critical preprocessing step in the entire LLMLoc pipeline, as it determines the quality of the input on which subsequent LLM inference depends. To overcome the instability of existing approaches that rely on a single signal, this study designs a three-step procedure. First, semantic and execution signals are collected in parallel. Second, re-ranking is performed using structure–semantic fusion. Finally, multi-signal integration is applied. This design ensures both coverage and reliability in the search process.

The first step is the initial candidate generation using multiple signals. Bug reports and trigger tests are used as inputs, and three independent techniques are applied: BoostN, SBIR, and Ochiai. BoostN leverages CodeBERT embeddings to compute semantic similarity between reports and functions, enabling semantic retrieval based on textual cues [13]. In contrast, SBIR and Ochiai are spectrum-based techniques that rely on failing test executions, computing coverage and failure rates to derive suspiciousness scores for functions [4,5,22,30,31]. Because the input signals differ, semantic signals capture linguistic cues while execution signals reflect dynamic execution patterns. As a result, a Top-5 function list is generated for each bug, ensuring diversity and coverage in the initial candidate set. Compared with a single approach, this step covers a broader search space, which is a key distinction of the proposed design.

The second step is re-ranking with SASR. As introduced in Section 3.2, Structure-Aware Semantic Re-ranking (SASR) combines CodeBERT-based semantic scores with AST structural similarity using λ-weighting to compute the final score [16,17,32]. This process incorporates structural context that simple semantic embeddings cannot capture, thereby repositioning the correct function toward the top of the ranking. In particular, SASR ensures not only that the correct function is included in the candidate set, but also that it consistently appears among the Top-k results. This significantly improves the success rate of subsequent LLM inference by mitigating the instability issue observed in LLM-only approaches, where the correct function may be present but not ranked highly.

The third step is multi-signal integration. The Top-5 lists produced by BoostN–SASR, SBIR, Ochiai, and Suspiciousness Ranking (SR) are merged into a unified candidate set of up to 20 functions [3,6,10,25,30]. Two principles guide this integration. First, if the same function appears in multiple techniques, the SASR score is prioritized to preserve the contribution of structure–semantic fusion. Second, if the candidate count is insufficient after duplicate removal, the results from SBIR, Ochiai, and SR are sequentially added to maintain diversity. This integration strategy positions SASR as the primary signal while incorporating auxiliary signals from traditional methods to enhance both robustness and generalizability of the candidate set.

The final candidate set of up to 20 functions is then passed as input to the LLM prompt. This stage is not merely a search step but a decisive process that influences the quality and stability of the inference. The Top-20 set effectively restricts the search space that developers need to examine, while also increasing the probability that the correct function appears at the top. Consequently, the candidate generation and re-ranking stage serves as a fusion-based search strategy that overcomes the limitations of single methods and provides a key design element ensuring both improved performance and practical reliability of LLMLoc.

We conducted a sensitivity analysis by replacing the default adaptive schedule (λ ∈ {0.30, 0.50, 0.70, 0.85}) with an alternative schedule (λ ∈ {0.20, 0.40, 0.60, 0.80}). As summarized in Table 3, the performance remained comparable, with small variations across metrics (Top-1 +0.8%, Top-3 −0.5%, Top-5 +2.9%, MAP +1.1%, MRR +1.9%). This indicates that SASR is not sensitive to the exact λ values and thus is not overfitted to a particular choice. The length thresholds (1–50, 51–150, 151+) were selected to balance report counts across bins rather than to tune hyperparameters.

3.4. Candidate List Integration

The candidate list integration stage is a critical procedure in the LLMLoc pipeline, as it determines the final inference quality. The results obtained from BoostN–SASR, SBIR, Ochiai, and Suspiciousness Ranking (SR) reflect heterogeneous signals: semantic embeddings, structure–semantic fusion, execution-based coverage, and statistical suspiciousness. While each technique shows strengths in certain scenarios, they also exhibit significant instability in others. To address this, our approach integrates these diverse signals in order to mitigate bias and achieve both high recall and ranking stability.

In the first step, the Top-5 results from each of the four techniques were combined to form an initial candidate set of up to 20 functions. Rather than a simple merge, duplicate handling and score prioritization rules were applied. When the same function appeared in multiple techniques, the SASR score was adopted with priority. This choice is based on the observation that SASR, which integrates both structural and semantic cues, provides a more reliable indicator. This design allows SASR to serve as the backbone of the candidate set while still incorporating complementary perspectives from other methods.

The second step is a candidate supplementation procedure. If the number of unique candidates after deduplication was fewer than 20, additional functions were included in the order of SBIR, Ochiai, and SR. SBIR contributes runtime context that semantic and structural signals may fail to capture, as it reflects the execution traces of failing tests. Ochiai provides insights into fault distribution characteristics based on test coverage, while SR, though simple, improves diversity in the candidate pool. This hierarchical design strengthens generality and stability by centering on structural and semantic signals and augmenting them with execution and statistical signals.

The third step restricts the final number of candidates to 20. This decision was made for two reasons. First, it prevents inefficiency caused by excessive candidates under the token limitations of LLM inputs. Second, supplying too many candidates can dilute inference effectiveness and reduce the likelihood of ranking the correct function near the top. In our experiments on all 835 bugs in Defects4J, the Top-20 candidate set improved Top-1 accuracy and mean reciprocal rank (MRR) by 8–12% compared to individual techniques. In contrast, expanding beyond 50 candidates increased recall but decreased MRR, showing empirically that Top-20 represents the balance point between efficiency and effectiveness.

The final candidate set constructed in this way is directly provided as input to the LLM prompt for function-level bug localization. By integrating not just through simple merging but by prioritizing SASR as the central structure–semantic signal and supplementing it with execution and statistical cues, LLMLoc successfully overcame the instability of single-method approaches. Furthermore, by limiting the candidate count to a practical 20, the approach reduces developer inspection costs while providing reliable results. This shows the significance of candidate list optimization, a strategy that prior studies have not sufficiently explored.

3.5. Inference

The final inference stage takes the integrated candidate function set (up to 20) as input and employs a large language model (LLM) to identify the actual fault location [11,12]. The objective of this study is not simply to list candidates but to reliably return stable Top-5 results under limited information conditions. To achieve this, the input was restricted to the bug report and candidate function signatures. Function bodies, execution logs, and external knowledge were excluded to reflect the information constraints often encountered in real maintenance environments and to minimize the risk of model reasoning being distorted by external noise [20]. This setup ensures both research rigor and practical relevance.

The inference was based on a pretrained code-understanding LLM executable in a local environment. To guarantee reproducibility, random seeds were fixed and deterministic decoding was applied [11,12]. The prompt was designed to include the bug report title and description, the list of candidate function signatures, and an instruction to return the most likely Top-k functions as a JSON array using only the given information. This explicit format constraint ensured that the model maintained a consistent output structure, which improved parsing convenience and reproducibility.

A central stabilization mechanism was the tournament-based inference procedure. Because LLMs are sensitive to prompt construction and internal randomness, relying on a single call can lead to unstable candidate rankings [2,30,33]. To mitigate this, the candidate set was divided into batches, and the model was repeatedly asked to select the top three functions. Weighted scores (3–2–1) were assigned to the ranks, and cumulative scores were computed. A final prompt was then reconstructed using the top-ranked candidates to determine the final Top-5. This approach averaged out variations due to randomness while reinforcing the model’s semantic reasoning through repetition. As confirmed in the RQ3 analysis, the tournament procedure substantially improved consistency in Top-1 and MRR results, thereby enhancing reliability.

The final output was parsed in JSON format and subjected to normalization and mapping to ensure alignment with function signatures. Duplicate candidates were removed, and when fewer than five results were available, additional selections were drawn from the final pool or the overall candidate set to ensure that five results were always returned. All results were systematically logged and stored on a per-bug basis to support experimental reproducibility and post-analysis [18,22,31].

The key contribution of this stage is that it transforms LLM inference from a simple output generation process into a structured decision-making procedure, thereby overcoming the instability and limited reproducibility of conventional LLM-only approaches [11,12]. Furthermore, the normalization and mapping process resolved signature mismatch issues, enabling applicability to real-world codebases. Ultimately, LLMLoc produces Top-5 results on a candidate set that integrates semantic, structural, and execution signals, with stability ensured by the inference procedure. These results are later validated through evaluation metrics such as Top-1, Top-3, Top-5, MAP, and MRR.

4. Experiment

4.1. Experimental Setup

The performance evaluation of this study was conducted on the Defects4J v2.0.0 dataset. Defects4J contains 835 real-world bugs collected from 17 Java open-source projects [18] and is one of the most widely used benchmarks in function-level bug localization research. The dataset provides buggy and fixed versions of source code, corresponding bug reports, and the complete set of functions, thereby faithfully reflecting the search tasks that developers typically encounter during software maintenance [21]. As such, Defects4J ensures both reproducibility and comparability, while also serving as an optimal testbed for assessing the practical applicability of the proposed method.

The search unit was fixed at the function level. This granularity reflects a more realistic maintenance unit than line-level exploration, while being more precise than project-level treatment. Under the baseline condition, the entire set of project functions, typically numbering in the thousands, was used as the search space, requiring the LLM to evaluate all functions directly [30,31]. In contrast, LLMLoc employed SASR to combine CodeBERT-based semantic embeddings with AST-based structural embeddings, re-ranked the candidates, and provided only the top 20 functions as input to the LLM [16,17,18,32]. This contrast serves two research objectives. First, it reveals the inefficiency and instability of LLMs when they attempt large-scale exploration. Second, it verifies whether the fusion of structural and semantic signals can reliably promote the correct function into higher ranks. The experimental design itself thus functions as a structural mechanism for evaluating the core contribution of LLMLoc.

The input conditions were deliberately constrained. The model received only the bug report (title and description) and function signatures, while function bodies, execution logs, and external knowledge were excluded [11,12,20]. This setting reflects realistic scenarios, as developers often have only minimal information such as function names and short reports in the early stages of maintenance. Therefore, a key challenge was to determine whether LLMLoc could maintain performance under conditions of limited information. Moreover, this configuration highlights the practical value of LLMLoc in a zero-shot inference environment, unlike prior approaches that assume abundant training data and logs.

The language model used was a pre-trained LLM specialized in code understanding, and inference was carried out through the Hugging Face Transformers framework [11,12]. To eliminate randomness, the parameters were set to temperature = 0 and do_sample = False, with the random seed fixed at 42 [18,20]. This design ensures that observed performance differences stem from algorithmic techniques and prompt construction rather than stochastic model variations. In particular, by comparing the baseline and SASR–LLMLoc under identical conditions, the study confirms that performance differences are attributable to the proposed method.

The experimental environment was configured with Ubuntu 24.04 (WSL2), an NVIDIA RTX 4090 GPU (24 GB), CUDA 12.6, PyTorch 2.5.1+cu121, and Transformers 4.55.3 [18]. This environment is reproducible in typical research settings and also reflects resource requirements realistically achievable in industrial deployment.

This experimental setup carries three academic and practical implications. First, the use of Defects4J v2.0.0 ensures quantitative comparison with prior studies and reproducibility. Second, the direct comparison between the baseline and SASR–LLMLoc clearly shows the contribution of structure-aware semantic retrieval. Third, by employing constrained input conditions and a realistic experimental environment, the study empirically validates the applicability of LLMLoc in industrial contexts [2,18,21]. Consequently, the experimental design not only measures performance but also provides systematic evidence that LLMLoc achieves both academic contribution and practical utility.

4.2. Dataset

This study used the Defects4J v2.0.0 dataset as the foundation for experiments [18,21]. Defects4J is a standard benchmark that collects real-world bugs reported in widely used Java-based open-source projects and provides both the buggy and fixed versions of the source code for each case [2,18]. This dataset is widely adopted in software defect research for two main reasons. First, it ensures realism because it contains bugs that actually occurred in industrial settings. Second, it provides both bug reports and test cases, enabling quantitative comparisons across different approaches [21,30,31].

Defects4J v2.0.0 consists of 17 projects and 835 bugs, with each bug labeled at the function level. The number of faulty functions per bug ranges from one to five, and the total number of functions per project averages several thousand. These characteristics reflect the uncertainty of large-scale search spaces rather than simplified small-scale settings, making the dataset suitable for evaluating both the efficiency and robustness of the proposed method. Furthermore, the labels were established based on actual developer patches and test case execution results, ensuring high reliability at the function level [2,18].

In this study, bug reports in the dataset were used as the primary input to the model. Each report consists of a title and description extracted from issue trackers such as JIRA or GitHub Issues [18,21]. Code bodies and execution logs were deliberately excluded from the input. This design choice reflects the information constraints often encountered in real-world maintenance scenarios. Developers are frequently provided with incomplete or summarized reports, and therefore this setting goes beyond a benchmark evaluation to ensure practical validity [11,12,18].

By using Defects4J, this study also ensured comparability within the research community. Since most prior SBFL, MBFL, and ML-based bug localization studies are based on the same dataset [21,30,31], the results of this work can be directly compared against existing methods. Such comparability is essential to show the novel contributions of the proposed approach.

However, Defects4J is limited to the Java language and specific projects, which constrains generalizability. It cannot be assumed that identical performance will hold in environments where bug report formats, naming conventions, or code structures differ significantly [22]. These limitations may threaten external validity, and future studies should extend evaluation to other languages such as Python, C/C++, and JavaScript, as well as industrial-scale projects.

The detailed project-level statistics of Defects4J v2.0.0 are summarized in Table 4. The table reports the number of bugs, the average number of functions, and the average number of buggy functions per project, thereby highlighting both the scale and imbalance of the dataset. Notably, large-scale projects such as Closure (176 bugs) and smaller projects such as Mockito (38 bugs) are included, illustrating the varying levels of search difficulty faced by researchers.

4.3. Evaluation Metrics

To comprehensively evaluate the performance of the proposed LLMLoc, we adopted Top-k accuracy (Top-1, Top-3, Top-5), Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR) [2,18,21,30]. These metrics are suitable for assessing LLM-based bug localization because they capture not only whether the correct answer is included, but also its rank, distribution, and consistency. Unlike conventional SBFL or IR-based methods that mainly relied on Top-k accuracy, this study extends the analysis to the overall ranking quality and inference stability, thereby providing quantitative evidence of LLMLoc’s contribution.

Top-k accuracy measures the proportion of cases in which the actual faulty function is included within the top k candidates suggested by the model. A high Top-1 accuracy indicates that the model can identify the fault location with a single candidate, which is practically important since it minimizes developer effort. Top-3 and Top-5 show that a small set of candidates is often sufficient to find the defect, directly reflecting cost reduction in real maintenance scenarios. For LLMLoc, structural–semantic fusion significantly improved Top-3 and Top-5 accuracy compared to plain LLM inference, showing that the method not only increases recall but also elevates the correct function into the ranks developers are most likely to examine.

However, Top-k accuracy alone cannot fully describe situations with multiple faulty functions, particularly regarding how evenly they are distributed in the top ranks. To address this, we introduced MAP. MAP computes the average precision for each bug by considering the rank of every correct function, then averages the result across all cases [21,30,31]. A higher MAP indicates that faulty functions are not only included but also consistently distributed across the upper ranks. LLMLoc’s improvement in MAP shows that structural–semantic fusion (SASR) reorganizes candidate distributions, enabling LLM inference under more favorable conditions. This quantifies quality improvements across the entire ranking spectrum that cannot be captured by Top-k metrics alone.

MRR complements the above by measuring the average of the reciprocal ranks of the first correct candidate [2,18,21,30]. For instance, if the correct answer appears as the first candidate, the score is 1; if it appears as the third, the score is 1/3. Averaging across all bugs yields the MRR. A higher MRR indicates that the model does not place correct answers in top positions only occasionally but maintains them consistently across diverse bug scenarios. LLMLoc integrates a tournament-based inference procedure to mitigate randomness and prompt sensitivity in LLM inference, which results in improved MRR and shows enhanced stability. This improvement reflects not just higher performance but also stronger consistency and reproducibility, which are critical in both research and practice.

Top-k accuracy reflects how efficiently developers can explore the search space when identifying faulty functions. MAP evaluates the overall quality of the ranking distribution, providing insight into how well the system orders the candidate functions beyond simply including the correct one. MRR captures the stability and reliability of inference, showing whether the system consistently places the correct function near the top. When combined, these three metrics offer a multidimensional evaluation of LLMLoc, showing not only whether the correct function is included but also whether it is ranked quickly, stably, and consistently. This integrated assessment framework addresses gaps that conventional studies overlooked and establishes a solid quantitative basis for validating the efficiency, quality, and stability of LLMLoc’s contributions.

4.4. Baselines

To comprehensively evaluate the effectiveness of the proposed approach, we formulated four research questions:

RQ1. Limitations of a baseline LLM: We assess how effectively an LLM can perform bug localization when provided only with a bug report and the entire code corpus as input, without any additional retrieval strategies or structural information. This experiment establishes the fundamental performance level and inherent limitations of a purely LLM-based approach [11,12,24,28].
RQ2. Contribution of SASR’s structure–semantic integration: We analyze the extent to which SASR, which combines CodeBERT-based semantic scores [13] with AST structural signals, improves quantitative metrics such as Top-k accuracy, MAP, and MRR. This evaluation verifies whether structural information enhances candidate quality and positively influences the inputs used during LLM inference [17,18,19,32].
RQ3. Stabilization effect of tournament-based inference: We evaluate the consistency and reproducibility of results when candidates are partitioned into batches, ranked using Top-3 voting, and then finalized through a Top-5 selection from a pooled set. This setup tests whether the tournament design mitigates the variability caused by the stochastic nature of LLM inference [20,33,34].
RQ4. Synergistic effect of combining SASR and the tournament method: We examine whether the integration of high-quality candidates from SASR with the stability of tournament-based inference produces improvements that go beyond additive gains. The goal is to confirm additional benefits over individual methods in both Top-k accuracy and ranking-based metrics [18,24,28,35].

To answer these research questions, we established three comparison groups:

Baseline LLM: This group directly inputs the bug report and the entire method corpus (on average about 8000 methods) into a single prompt without incorporating any structural signals or retrieval preprocessing [11,12,28]. It provides the minimum performance benchmark of a purely LLM-based approach and serves as the control group for RQ1.
SASR-only: This group isolates the contribution of the proposed structure–semantic retrieval method. By combining CodeBERT-based semantic embeddings [13] with AST-based structural embeddings [16,17,32], SASR re-ranks candidates and restricts the LLM input to the top 20 functions. This setup directly addresses RQ2 by quantifying the impact of structural information alone.
Proposed LLMLoc: This approach integrates SASR with additional signals from SBIR [30], Ochiai [4], and Suspiciousness Ranking (SR) [36] to form the final candidate set, followed by tournament-based inference. The design jointly tests RQ3 and RQ4 by verifying whether SASR improves candidate quality and the tournament procedure enhances inference stability in a complementary manner.

The Baseline LLM shows the fundamental capability of inference using only large language models. SASR highlights the distinct contribution of structural information by combining it with semantic signals. LLMLoc extends this further by integrating multiple retrieval signals and stabilizing the inference process to examine synergistic gains. Through this hierarchical design, the method allows both quantitative evaluation and qualitative validation of its effectiveness.

4.5. Experimental Results

The experiments were conducted according to four research questions (RQ1–RQ4), and the overall results are summarized in Table 5.

Baseline LLM (RQ1). This setting evaluated the LLM by providing the bug report and the entire function corpus (on average about 8000 functions) in a single prompt. The Top-1 accuracy was 26.3% (220/835), Top-3 was 37.2% (311/835), and Top-5 was 42.9% (358/835). MAP was 0.325 and MRR was 0.287. These results indicate that the LLM can partially leverage semantic cues between bug reports and function names to include the correct function in the candidate set. However, the correct answer was not consistently ranked at the top, which reduced search efficiency. Developers would still need to examine a large number of candidates before reaching the actual fault. In short, the LLM-only approach shows the potential of zero-shot reasoning, but high exploration cost and instability limit its practical applicability in industrial settings.

SASR (RQ2). SASR addressed the limitations of the baseline by re-ranking candidates using CodeBERT-based semantic embeddings combined with AST structural signals. Although the Top-1 accuracy did not improve significantly, Top-3 accuracy increased to 41.3% (345/835), and Top-5 accuracy rose to 49.2% (411/835). Notably, MRR improved from 0.287 to 0.347, a relative gain of about 21%, showing that correct functions were ranked more consistently near the top. These results show that structural signals correct the bias of semantic embeddings and elevate correct answers by incorporating code patterns such as control flows, loops, and conditional statements [37]. This provides empirical evidence that a filtering mechanism to improve candidate quality prior to LLM inference is essential.

Tournament-based inference (RQ3). Tournament-based inference applied batch voting and a final selection process to the same SASR candidate set. This procedure mitigated the probabilistic variation of LLM reasoning and reduced noise. Compared with simple Top-k selection, both Top-1 accuracy and MRR improved consistently. The improvement in MRR suggests that the LLM was guided to rank correct answers more reliably at the top. This approach alleviated the previously observed sensitivity of LLM outputs to prompt composition and internal randomness. As a result, it not only improved reproducibility but also enhanced stability, making it more practical for maintenance support tools. The tournament mechanism therefore served as a reliability adjustment method in addition to providing performance gains.

LLMLoc (RQ4). Finally, LLMLoc combined candidate quality improvement from SASR with the stability provided by tournament-based inference, achieving the best performance across all metrics compared with the baseline. Top-1 accuracy increased to 28.5% (238/835), Top-3 reached 44.0% (367/835), and Top-5 reached 49.8% (416/835). MAP improved slightly to 0.336, while MRR rose to 0.364, which represents a relative improvement of 29.8% over the baseline. These results show that SASR lifted correct answers toward the top ranks and the tournament mechanism maintained them consistently, producing a synergistic effect beyond simple additive improvements. From a practical perspective, developers can identify nearly half of the faults by reviewing only five candidates, significantly reducing exploration cost. From an academic standpoint, these results show that combining structural–semantic retrieval with inference stabilization can overcome the limitations of LLM-only approaches for bug localization.

4.6. Ablation Study

To identify the contribution of each component, we conducted an ablation study on Defects4J. Table 6 compares (1) the baseline using the full method corpus, (2) SASR without the tournament, (3) SASR with the tournament, and (4) the full LLMLoc model. SASR notably improves Top-3 and Top-5 accuracy compared with the baseline, indicating the benefit of structure-aware retrieval. The tournament strategy further stabilizes ranking and enhances MRR, while the adaptive λ fusion in LLMLoc yields the best MAP (0.336) and MRR (0.364).

These results confirm that each module contributes complementarily to performance: structure-aware retrieval expands relevant candidates, tournament-based fusion suppresses unstable ranking noise, and adaptive weighting balances semantic and structural similarity signals. Together, they produce a consistent gain across all Top-k metrics.

5. Discussion

5.1. Analysis of Experimental Results

The experimental results clearly shown the performance gap between the Baseline LLM and the proposed LLMLoc pipeline. When the Baseline LLM was given only a simple concatenation of bug reports and the entire function corpus, its Top-1 accuracy was limited to 26.3% (220/835), and the mean reciprocal rank (MRR) remained at 0.287. This indicates that the language model was able to capture some degree of semantic similarity between bug reports and function names [11,12], but it failed to consistently position the correct function at the top ranks. In practice, developers must review top-ranked candidates within limited time, and the Baseline results suggest that while the approach shows potential, it remains insufficient for immediate use in real-world maintenance scenarios [20].

The introduction of SASR played a critical role in mitigating this limitation. By combining AST-based structural signals with CodeBERT semantic embeddings to re-rank candidates, Top-3 and Top-5 accuracy increased to 345 and 411, respectively, while MRR improved to 0.347. This improvement shows that the correct function was not only included in the candidate set but was more frequently ranked higher, making it more likely to be identified during a developer’s initial review. SASR corrected the bias of semantic embeddings (for example, their tendency to favor name similarity) by leveraging structural signals, ultimately reshaping the candidate distribution. As a result, the language model was able to reason over a “more favorable candidate set,” directly leading to performance gains in the final inference stage [13,17]. Unlike prior studies that primarily treated structural information as auxiliary signals, SASR integrated structural and semantic fusion as a fundamental enhancement to the retrieval process itself.

The tournament-based inference introduced in the final stage brought another meaningful improvement. Language models often produce variable responses to the same input depending on prompt phrasing or inherent randomness [20,33,34]. This instability undermines both academic reproducibility and practical reliability. The tournament approach mitigated this problem by using batch voting and tie-breaking procedures to average out deviations and suppress noise. The fact that Top-1 accuracy and MRR improved beyond simple Top-k selection shows that the tournament was not merely a performance boost but an effective stabilization mechanism. Academically, this increases methodological credibility, and practically, it enables reproducible and consistent decision-making during software maintenance.

By combining SASR with tournament inference, LLMLoc achieved consistent improvements over the Baseline across all metrics. Gains in MAP and MRR confirmed that the correct function was not only present but persistently ranked higher. Improvements in Top-5 accuracy indicated that developers would need to examine fewer than half the candidates compared to the Baseline, thereby accelerating debugging and reducing the cost of unnecessary code inspection. This translates directly into enhanced efficiency in real-world maintenance.

These findings align with earlier research that has shown the benefits of integrating heterogeneous signals, such as information retrieval, spectrum-based fault detection, and change history analysis [30,38,39,40]. However, LLMLoc distinguishes itself by not only merging signals but also directly linking structural–semantic fusion to language model inference. This represents the first instance where both performance and stability were simultaneously strengthened through such integration. Academically, this offers a new paradigm for bug localization research, and practically, it highlights the potential for robust tools that remain reliable even in industrial settings with limited test quality.

To quantitatively assess the computational cost introduced by the proposed SASR and tournament modules, we compared the per-bug runtime, number of LLM calls, and memory footprints across three representative configurations: a baseline Top-K setting without tournament inference, an SASR setting without tournament, and the full LLMLoc pipeline that combines SASR with the tournament process. As summarized in Table 7, SASR without tournament exhibits a latency comparable to the baseline (approximately 2.4~2.5 s per bug), whereas enabling tournament-based inference increases the average wall-clock time by about 1.4 s to 3.87 s (95% CI: 3.67~4.06). This ≈57% increase primarily results from one additional LLM call per bug, while the peak GPU (≈15.5 GB) and CPU RSS (≈1.47 GB) remain nearly unchanged. Overall, these findings indicate that the tournament and structure-aware retrieval steps improve ranking stability with only moderate computational overhead.

5.2. Threats to Validity

This study evaluated the effects of structure–semantic fusion and inference stabilization in LLMLoc using the Defects4J v2.0.0 dataset, but several threats to validity may affect the interpretation and generalization of the results. Internal validity is influenced mainly by parameter settings and procedural factors. SASR combines semantic and structural signals through a λ-weighted scheme, but the weight was tuned only empirically on Defects4J, meaning that it may not be optimal for other datasets. As a result, it is difficult to know whether performance gains are due to the structural contribution of the model or to dataset-specific parameter choices. Tournament-based inference reduced randomness, but the results still varied with prompt design, batch size, and voting strategy, creating instability that weakens reproducibility. Automated parameter optimization, standardized prompt templates, and cross-dataset validation would help address these risks.

External validity is constrained by dataset scope and generalizability. Defects4J is widely used for Java projects, but industrial settings are far more diverse. Bug report quality, naming conventions, and code complexity vary across projects, and languages such as Python, C/C++, and JavaScript differ greatly in AST structures and control or data flow. For instance, Python’s dynamic typing often leaves structural information incomplete, while C++ macros and pointers increase complexity. Such differences may reduce the effectiveness of SASR’s fusion and limit LLMLoc’s applicability to other domains. Expanding the evaluation to multilingual and multi-domain benchmarks and including real-world industrial datasets would strengthen external validity.

Construct validity concerns arise from the choice of model and evaluation metrics. This study used Meta Llama-3 8B Instruct, but larger or different architectures could yield different outcomes. Performance may also be influenced by overlap between an LLM’s pretraining data and Defects4J, making it unclear whether results reflect the true effectiveness of LLMLoc or bias from pretraining. The use of traditional IR metrics such as Top-k accuracy, MAP, and MRR is helpful for benchmarking, but these measures do not fully capture developer-perceived efficiency in practice. Metrics such as time saved, interaction cost, and actual debugging success were not considered. To improve construct validity, future work should involve user studies, cost–benefit analyses, and debugging simulations to assess practical usefulness.

Conclusion validity is mainly challenged by computational cost and scalability. Tournament-based inference improved stability but introduced overhead due to repeated voting and tie-breaking, which was manageable on Defects4J but could be prohibitive on industrial-scale codebases with hundreds of thousands of functions. Latency and resource demands may discourage adoption, especially in cloud environments where costs scale with usage. Without balancing performance with efficiency, LLMLoc risks being confined to research contexts. Lighter-weight models, efficient sampling or caching strategies, and distributed inference infrastructures may offer more scalable solutions.

As quantified in Table 7, the additional runtime introduced by the tournament remains moderate (≈1.4 s per bug), confirming that the stability gain outweighs the computational cost.

Overall, LLMLoc shows promise by combining structure–semantic fusion with inference stabilization, but its contributions must be interpreted in light of these four threats: parameter dependence, dataset generalization limits, model and metric bias, and computational cost. At the same time, these limitations suggest clear directions for future work, including automated optimization, multilingual expansion, user-centered evaluation, and efficient inference strategies that can make LLMLoc both academically reliable and practically applicable.

6. Related Work

Bug localization research has traditionally been dominated by Spectrum-Based Fault Localization (SBFL) and Mutation-Based Fault Localization (MBFL) techniques. SBFL estimates the suspiciousness of code elements using failing and passing test executions along with coverage information, and several representative approaches such as Tarantula [3], Ochiai [4,5], and DStar [6] have been developed. Later, PageRank-based methods [38,39] incorporated execution paths and dependency relations to refine candidate rankings. However, as reported in studies using the Defects4J benchmark [1,2], SBFL methods generally achieve only about 25–30% Top-1 accuracy on average, which reveals their fundamental reliance on test quality. MBFL approaches, such as Metallaxis-FL [8], achieve strong fault detection capability by injecting program mutations and analyzing execution differences. Yet, because the number of required mutations grows exponentially and the execution cost is prohibitively high, MBFL is difficult to apply in large-scale systems [7]. These limitations highlighted the need for more generalizable and efficient approaches.

To address these challenges, Machine Learning-based Fault Localization (MLFL) was proposed. MLFL leverages features such as coverage matrices, static code properties, and execution logs to train models that predict suspiciousness scores. For instance, DeepFL [9] integrated multidimensional signals using deep learning and improved Top-1 performance by approximately 10 percentage points over SBFL, while Briand et al. [10] combined the Tarantula metric with machine learning to achieve additional gains. Graph-based methods [19,32] further enhanced precision by modeling structural code dependencies. Nonetheless, these approaches faced obstacles such as the high cost of building large training datasets, poor transferability across domains, and the burden of complex feature engineering. As a result, their applicability in industrial practice remains limited [2,24].

Subsequently, researchers explored Information Retrieval (IR)-based approaches, which do not require test execution. Early IR methods retrieved relevant code entities using bug reports as queries, but their reliance on keyword matching led to low accuracy [31]. To improve performance, hybrid methods such as Better Together [30] (combining SBFL with IR), FineLocator [22] (query expansion), and BoostNSift [29] (candidate refinement) were introduced, all of which achieved measurable improvements. The emergence of pretrained code models significantly advanced IR-based bug localization. Models such as CodeBERT [13], GraphCodeBERT [14], and Code Llama [15] map code and natural language into a unified embedding space, thereby strengthening semantic associations. CodeBERT-based IR has been shown to outperform traditional TF–IDF approaches by roughly 5–7 percentage points in Top-1 accuracy [27]. Moreover, methods like LameR [16] and SANTA [17] improved candidate quality through multi-stage ranking and structural alignment. Nevertheless, IR approaches still fail to capture sufficient structural context.

More recently, Large Language Model (LLM)-based approaches have attracted significant attention. Wu et al. [11] showed that LLMs can capture semantic relationships between bug reports and functions, enabling test-free localization. Yang et al. [12] further validated the feasibility of test-free bug localization. Widyasari et al. [28] enhanced explainability through reasoning, while Kang et al. [24] emphasized practical applicability through both quantitative and qualitative evaluation. With the strong code understanding and generation capabilities of advanced LLMs such as GPT-4 [33], Llama3 [41], and Llama3-8B-Instruct [42], the potential of LLM-based bug localization has grown substantially [35]. However, these approaches face limitations such as the “lost in the middle” problem in long contexts [20] and the inability to exploit structural information such as Abstract Syntax Trees (ASTs), control flow, and data flow due to their token sequence-centric design. FlexFL [18] attempted to mitigate these issues by introducing Flexible Feature Learning, which dynamically combines semantic and statistical signals, outperforming SBFL and MBFL on Defects4J. Yet, FlexFL still lacked structural signals, leaving a fundamental gap.

In parallel, research on structure-aware learning has shown that combining semantic and structural embeddings can be effective. SANTA [17] aligned code structure with textual descriptions to improve retrieval, Lou et al. [19] strengthened SBFL using graph-based representations, and graph neural network (GNN)-based studies [32] improved code comprehension by learning program graphs. These works show that structural signals can complement semantic ones, but few studies have fully integrated structure–semantic fusion into LLM-based bug localization. Thus, existing research continues to face three core limitations: the test dependency of SBFL and MBFL, the data dependency of MLFL, and the lack of structural signals in IR- and LLM-based approaches.

The proposed LLMLoc is designed to address these limitations. First, Structure-Aware Semantic Retrieval (SASR) combines AST and control/data flow information with CodeBERT-based semantic embeddings to reduce semantic bias and stabilize candidate rankings. Second, multiple signals are integrated by combining SASR with execution-based indicators such as SBIR and Ochiai, thereby mitigating the instability of single-signal methods. Third, instead of relying on a single LLM output, tournament-based ranking is applied in the inference stage to reduce randomness and improve reproducibility. Finally, experiments on all 835 bugs in Defects4J v2.0 showed improvements of 3.4 percentage points in MAP and 29.8% in MRR, providing strong evidence of the effectiveness of the proposed method. LLMLoc therefore overcomes the three key challenges of test dependency, data dependency, and absence of structural information, and establishes a new direction that achieves both structural completeness and industrial scalability.

7. Conclusions

This study introduced LLMLoc, a zero-shot bug localization system that integrates structural and semantic signals. Traditional spectrum-based and mutation-based techniques heavily rely on test execution and coverage, which limits their applicability in real-world industrial settings. More recent LLM-based approaches have primarily focused on code sequence semantics, without adequately capturing structural context. LLMLoc addresses these limitations by combining semantic similarity between bug reports and code functions with structural cues extracted from abstract syntax trees (ASTs) and control/data flows. This design enables multi-signal candidate retrieval that avoids over-reliance on a single source of evidence and culminates in zero-shot reasoning with large language models to identify fault locations.

A large-scale experiment on all 835 bugs in Defects4J v2.0.0 showed the effectiveness of LLMLoc. Mean Average Precision (MAP) improved by 3.4 percentage points compared with an LLM-only baseline, while Mean Reciprocal Rank (MRR) increased from 0.2807 to 0.3643, representing a 29.8 percent relative gain. Top-1 accuracy rose from 28.5 percent to 34.1 percent, and Top-5 accuracy reached 49.8 percent, substantially reducing the search space developers must review. These improvements extend beyond raw accuracy, showing enhanced stability of candidate sets and greater reliability in ranking. In particular, structure–semantic fusion helped mitigate semantic bias, while the tournament-based inference process contributed to more consistent candidate evaluation.

From an academic perspective, this work makes three contributions. First, it formally introduces structural–semantic fusion into bug localization, addressing the lack of structural awareness in prior LLM-based methods. Second, it successfully transfers fusion strategies validated in information retrieval to the domain of software maintenance, highlighting cross-disciplinary applicability. Third, it shows that zero-shot reasoning, without large-scale training data or domain-specific fine-tuning, can still achieve measurable improvements, opening a new paradigm for LLM-based software engineering research.

In addition, explainable AI (XAI) frameworks for vulnerability detection such as BiLSTM-based models that expose re-entrancy patterns in smart contracts illustrate the growing need for transparency and interpretability in code-level reasoning. The structure-aware design of LLMLoc enhances explainability by aligning semantic inferences with explicit syntactic and dependency cues, making it a promising foundation for trustworthy and interpretable analysis in high-consequence software systems. Future research could extend LLMLoc to security-critical applications that require not only accuracy but also clear interpretive reasoning, bridging the gap between deep learning-based vulnerability detection and explainable fault localization.

The practical implications are equally clear. LLMLoc can operate reliably even when test quality is limited, thereby reducing debugging costs and improving maintenance efficiency in real-world development environments. In the short term, it enhances developer productivity, and in the long term, it contributes to system reliability and security. Potential applications extend beyond bug localization to automated program repair, vulnerability detection, and large-scale code retrieval.

Several limitations remain. Structural–semantic fusion may not capture every type of fault, such as atypical API misuse or logical errors that do not manifest in structural patterns. Furthermore, the current evaluation was limited to Java projects within Defects4J, so additional studies are needed to validate generalizability to other programming languages such as Python or C++. Finally, the performance of zero-shot inference is sensitive to prompt design and hyperparameter settings, indicating the need for systematic tuning strategies.

Future research will focus on validating LLMLoc in large-scale, multi-language codebases and integrating richer structural signals such as program graphs, execution traces, and data dependency analysis to further expand representational power. Another direction is to apply structural–semantic fusion to broader software engineering tasks including automated repair, security vulnerability detection, and intelligent code recommendation. Overall, this study is the first to apply structural–semantic fusion to bug localization in a systematic way, showing both performance and practical benefits, and providing meaningful contributions to both academic research and industrial practice.

Author Contributions

Conceptualization, G.Y.; methodology, G.N.; writing—original draft, G.N.; writing—review and editing, G.Y.; supervision, G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no funding from any source.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wong, W.E.; Gao, R.; Li, Y.; Abreu, R.; Wotawa, F. A survey on software fault localization. IEEE Trans. Softw. Eng. (TSE) 2016, 42, 707–740. [Google Scholar] [CrossRef]
Zou, D.; Liang, J.; Xiong, Y.; Ernst, M.D.; Zhang, L. An empirical study of fault localization families and their combinations. IEEE Trans. Softw. Eng. (TSE) 2021, 47, 332–347. [Google Scholar] [CrossRef]
Jones, J.A.; Harrold, M.J. Empirical evaluation of the Tarantula automatic fault-localization technique. In Proceedings of the ASE, Lisbon, Portugal, 7–11 November 2005; pp. 273–282. [Google Scholar]
Abreu, R.; Zoeteweij, P.; Van Gemund, A.J.C. An evaluation of similarity coefficients for software fault localization. In Proceedings of the PRDC, Riverside, CA, USA, 18–20 December 2006; pp. 39–46. [Google Scholar]
Abreu, R.; Zoeteweij, P.; Van Gemund, A.J.C. On the accuracy of spectrum-based fault localization. In Proceedings of the TAICPART-MUTATION, Windsor, UK, 10–14 September 2007; pp. 89–98. [Google Scholar]
Wong, W.E.; Debroy, V.; Gao, R.; Li, Y. The DStar method for effective software fault localization. IEEE Trans. Reliab. 2014, 63, 290–308. [Google Scholar] [CrossRef]
Jia, Y.; Harman, M. An analysis and survey of the development of mutation testing. IEEE Trans. Softw. Eng. 2011, 37, 649–678. [Google Scholar] [CrossRef]
Papadakis, M.; Traon, Y.L. Metallaxis-FL: Mutation-based fault localization. Softw. Test. Verif. Reliab. 2015, 25, 605–628. [Google Scholar] [CrossRef]
Li, X.; Li, W.; Zhang, Y.; Zhang, L. DeepFL: Integrating multiple fault diagnosis dimensions for deep fault localization. In Proceedings of the ISSTA, Beijing, China, 15–19 July 2019; pp. 169–180. [Google Scholar]
Briand, L.C.; Labiche, Y.; Liu, X. Using machine learning to support debugging with Tarantula. In Proceedings of the ISSRE, Trollhättan, Sweden, 5–9 November 2007; pp. 137–146. [Google Scholar]
Wu, Y.; Li, Z.; Zhang, J.M.; Papadakis, M.; Harman, M.; Liu, Y. Large language models in fault localisation. arXiv 2023, arXiv:2308.15276. [Google Scholar] [CrossRef]
Yang, A.Z.H.; Le Goues, C.; Martins, R.; Hellendoorn, V. Large language models for test-free fault localization. In Proceedings of the ICSE, Lisbon, Portugal, 14–20 April 2024. [Google Scholar]
Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A pre-trained model for programming and natural languages. arXiv 2020, arXiv:2002.08155. [Google Scholar]
Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. GraphCodeBERT: Pre-training code representations with data flow. In Proceedings of the ICLR, Vienna, Austria, 3–7 May2021. [Google Scholar]
Rozière, F.G.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code Llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
Sadat, A.; Agarwal, P.; Lin, H.; Zhang, C.; Bendersky, M.; Najork, M. LameR: LLM-augmented multi-stage ranking for code retrieval. arXiv 2023, arXiv:2305.15489. [Google Scholar]
Li, Z.; Wang, X.; Wang, S.; Nguyen, T.N. SANTA: Structure-aligned neural text-to-code retrieval. In Proceedings of the FSE, San Francisco, CA, USA, 3–9 December 2023; pp. 1472–1484. [Google Scholar]
Xu, H.; Zhang, Z.; Li, J.; Wang, X.; Cheung, S.-C. FlexFL: Boosting fault localization with LLMs via flexible feature learning. IEEE Trans. Softw. Eng. 2025, 51, 535–548. [Google Scholar]
Lou, Y.; Zhu, Q.; Dong, J.; Li, X.; Sun, Z.; Hao, D.; Zhang, L.; Zhang, L. Boosting coverage-based fault localization via graph-based representation learning. In Proceedings of the ESEC/FSE, Athens, Greece, 23–28 August 2021; pp. 664–676. [Google Scholar]
Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguist. (TACL) 2024, 12, 157–173. [Google Scholar] [CrossRef]
Tsumita, S.; Hayashi, S.; Amasaki, S. Large-scale evaluation of method-level bug localization with FinerBench4BL. In Proceedings of the SANER, Macao, China, 21–24 March 2023; pp. 815–824. [Google Scholar]
Zhang, W.; Li, Z.; Wang, Q.; Li, J. FineLocator: Improving bug localization by query expansion. Inf. Softw. Technol. 2019, 110, 121–135. [Google Scholar] [CrossRef]
Campos, J.; Riboira, A.; Perez, A.; Abreu, R. GZoltar: An Eclipse plug-in for testing and debugging. In Proceedings of the ASE, Essen, Germany, 3–7 September 2012; pp. 378–381. [Google Scholar]
Kang, S.; An, G.; Yoo, S. A quantitative and qualitative evaluation of LLM-based explainable fault localization. Proc. ACM Softw. Eng. 2024, 1, 64. [Google Scholar] [CrossRef]
Li, Y.; Wang, S.; Nguyen, T.N. Fault localization with code coverage representation learning. In Proceedings of the ICSE, Madrid, Spain, 22–30 May 2021; pp. 661–673. [Google Scholar]
Zhang, Z.; Lei, Y.; Mao, X.; Li, P. CNN-FL: An effective approach for localizing faults using CNNs. In Proceedings of the SANER, Hangzhou, China, 24–29 February 2019; pp. 445–459. [Google Scholar]
Zeng, S.; Tan, H.; Zhang, H.; Li, J.; Zhang, Y.; Zhang, L. An extensive study on pretrained models for program understanding. In Proceedings of the ISSTA, Virtual Event, 18–22 July 2022; pp. 39–51. [Google Scholar]
Widyasari, R.; Ang, J.W.; Nguyen, T.G.; Sharma, N. Demystifying faulty code: Step-by-step reasoning in large language models for fault localization. In Proceedings of the SANER, Rovaniemi, Finland, 12–15 March 2024; pp. 568–579. [Google Scholar]
Razzaq, A.; Buckley, J.; Patten, J.V.; Chochlov, M.; Sai, A.R. BoostNSift: A Query Boosting and Code Sifting Technique for Method Level Bug Localization. In Proceedings of the SCAM, Luxembourg, 27–28 September 2021; pp. 81–91. [Google Scholar]
Le, T.B.; Oentaryo, R.J.; Lo, D. Information retrieval and spectrum-based bug localization: Better together. In Proceedings of the ESEC/FSE, Bergamo, Italy, 30 August 2015–4 September 2015; pp. 579–590. [Google Scholar]
Zhou, J.; Zhang, H.; Lo, D. Where should the bugs be fixed? More accurate IR-based bug localization based on bug reports. In Proceedings of the ICSE, Zurich, Switzerland, 2–9 June 2012; pp. 14–24. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the NeurIPS, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [PubMed]
Abreu, R.; Zoeteweij, P.; Golsteijn, R.; Van Gemund, A.J.C. A practical evaluation of spectrum-based fault localization. J. Syst. Softw. 2009, 82, 1780–1792. [Google Scholar] [CrossRef]
Li, Y.; Wang, S.; Nguyen, T.N. Fault localization to detect co-change fixing locations. In Proceedings of the ESEC/FSE, Singapore, 14–18 November 2022; pp. 659–671. [Google Scholar]
Zhang, M.; Li, X.; Zhang, L.; Khurshid, S. Boosting spectrum-based fault localization using PageRank. In Proceedings of the ISSTA, Santa Barbara, CA, USA, 10–14 July 2017; pp. 261–272. [Google Scholar]
Zhang, M.; Li, Y.; Li, X.; Chen, L.; Zhang, Y.; Zhang, L. An empirical study of boosting spectrum-based fault localization via PageRank. IEEE Trans. Softw. Eng. 2021, 47, 1089–1113. [Google Scholar] [CrossRef]
Böhme, M.; Soremekun, E.O.; Chattopadhyay, S.; Ugherughe, E.; Zeller, A. Where is the bug and how is it fixed? An experiment with practitioners. In Proceedings of the ESEC/FSE, Paderborn, Germany, 4–8 September 2017; pp. 117–128. [Google Scholar]
Meta AI. Blog of Meta Llama 3. Available online: https://ai.meta.com/blog/ (accessed on 28 September 2025).
HuggingFace. Model card of Llama3-8B-Instruct. Available online: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct (accessed on 28 September 2025).

Figure 1. The overall pipeline of LLMLoc for zero-shot bug localization.

Figure 2. Random Forest–based feature importance of AST-level structural proxies.

Table 1. Comparison between SANTA, GNN-based approaches, and LLMLoc’s SASR.

Aspect	SANTA	GNN-Based Methods	Proposed LLMLoc (SASR)
Structural representation	Graph-based AST alignment between bug reports and code; requires node/edge labeling	Learned graph embeddings via supervised training on defect datasets	AST metric-based embedding using control/data-flow features without supervision
Semantic representation	Textual similarity using TF-IDF or code embeddings	Pretrained code encoders with fine-tuning	CodeBERT semantic embeddings directly fused with structural signals
Fusion strategy	Rule-based graph alignment scores	Neural fusion layers within GNN framework	Weighted retrieval fusion using adaptive λ (heuristic balancing semantic vs. structural)
Training requirement	Supervised with aligned graph pairs	Supervised on large labeled defect graphs	Zero-shot; no task-specific training required
Primary goal	Improve graph matching accuracy	Capture topological relations for defect prediction	Enhance retrieval robustness and ranking stability under test-free conditions

Table 3. Sensitivity to alternative adaptive λ schedules on Defects4J. The metrics remain broadly comparable, confirming robustness.

λ Schedule	Top-1	Top-3	Top-5	MAP	MRR
{0.3, 0.5, 0.7, 0.85} (default)	238	367	416	0.336	0.364
{0.2, 0.4, 0.6, 0.8}	240	365	428	0.339	0.370

Table 2. Adaptive λ Adjustment Rules.

Condition	$λ$ Value	Interpretation
No report or test available	0.30	Structural signal emphasized
1–50 words	0.50	Balanced structural and semantic signals
51–150 words	0.70	Beginning of semantic signal emphasis
More than 151 words	0.85	Strong reflection of semantic signal

Table 4. Statistics of the Defects4J v2.0.0 dataset, including the number of bugs, average number of methods, and average number of buggy methods for each project.

Project	#Bugs	Avg. #Methods	Avg. Buggy Methods
Chart	26	5485	1.6
Closure	176	7927	1.8
Lang	65	3013	1.4
Math	106	3902	1.7
Mockito	38	2023	1.2
Time	27	4121	1.5
Collections	28	1640	1.3
Codec	18	1213	1.1
Compress	47	2482	1.4
Csv	16	1870	1.3
Gson	18	3110	1.2
JacksonCore	26	2934	1.5
JacksonDatabind	112	6285	1.7
JacksonXml	6	1544	1.2
Jsoup	93	3221	1.5
JxPath	22	2313	1.4
Cli	39	1765	1.3
Total	835	6843	1.6

Table 5. Performance comparison (based on all 835 bugs in Defects4J).

Method	Top-1	Top-3	Top-5	MAP	MRR
Baseline (LLM only)	220	311	358	0.325	0.287
SASR	221	345	411	0.325	0.347
LLMLoc (Ours)	238	367	416	0.336	0.364

Table 6. Ablation study of LLMLoc components on Defects4J.

Method	Top-1	Top-3	Top-5	MAP	MRR
Baseline (LLM only)	220	311	358	0.325	0.287
SASR (no Tournament)	195	364	437	0.294	0.342
SASR + Tournament	221	345	411	0.324	0.347
LLMLoc (Ours)	238	367	416	0.336	0.364

Table 7. Computational overhead comparison on Defects4J (n = 50 bugs). Mean ± 95% CI.

Condition	$T_{t o t a l}$ (s/bug)	$T_{b a t c h e s}$ (s/bug)	$T_{f i n a l}$ (s/bug)	# ¹ LLM Calls	GPU Peak (MB)	CPU RSS (MB)
Baseline Top-K (no tour.)	2.485 ± 0.171	0.000	2.312 ± 0.196	1.0	15 442	1 479
SASR (no tour.)	2.434 ± 0.187	0.000	2.343 ± 0.187	1.0	15 460	1 470
LLMLoc (full)	3.866 ± 0.195	1.498 ± 0.151	2.271 ± 0.186	2.0	15 465	1 469

¹ “#” denotes the number of occurrences (e.g., number of LLM calls).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nam, G.; Yang, G. LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization. Electronics 2025, 14, 4343. https://doi.org/10.3390/electronics14214343

AMA Style

Nam G, Yang G. LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization. Electronics. 2025; 14(21):4343. https://doi.org/10.3390/electronics14214343

Chicago/Turabian Style

Nam, Gyumin, and Geunseok Yang. 2025. "LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization" Electronics 14, no. 21: 4343. https://doi.org/10.3390/electronics14214343

APA Style

Nam, G., & Yang, G. (2025). LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization. Electronics, 14(21), 4343. https://doi.org/10.3390/electronics14214343

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization

Abstract

1. Introduction

2. Background

2.1. Traditional Bug Localization Techniques

2.2. Machine Learning–Based Fault Localization

2.3. Large Language Models and Bug Localization

2.4. Structure-Aware Embedding Research

3. Methodology

3.1. Preprocessing

3.2. Embedding and Structural Information Generation

3.3. Candidate Generation and Re-Ranking

3.4. Candidate List Integration

3.5. Inference

4. Experiment

4.1. Experimental Setup

4.2. Dataset

4.3. Evaluation Metrics

4.4. Baselines

4.5. Experimental Results

4.6. Ablation Study

5. Discussion

5.1. Analysis of Experimental Results

5.2. Threats to Validity

6. Related Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI