CBR2: A Case-Based Reasoning Framework with Dual Retrieval Guidance for Few-Shot KBQA

Hu, Xinyu; Li, Tong; Xue, Lingtao; Du, Zhipeng; Huang, Kai; Xiao, Gang; Tang, He

doi:10.3390/bdcc10010017

Open AccessArticle

CBR²: A Case-Based Reasoning Framework with Dual Retrieval Guidance for Few-Shot KBQA

by

Xinyu Hu

¹

,

Tong Li

²

,

Lingtao Xue

^2,3,

Zhipeng Du

²,

Kai Huang

⁴,

Gang Xiao

^2,* and

He Tang

⁵

¹

School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou 510275, China

²

National Key Laboratory for Complex Systems Simulation, Beijing 100029, China

³

School of Computer Science and Technology, Xidian University, Xi’an 710071, China

⁴

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510275, China

⁵

College of Systems and Society, Australian National University, Canberra 0200, Australia

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(1), 17; https://doi.org/10.3390/bdcc10010017

Submission received: 17 November 2025 / Revised: 11 December 2025 / Accepted: 19 December 2025 / Published: 4 January 2026

(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))

Download

Browse Figures

Versions Notes

Abstract

Recent advances in large language models (LLMs) have driven substantial progress in knowledge base question answering (KBQA), particularly under few-shot settings. However, symbolic program generation remains challenging due to its strict structural constraints and high sensitivity to generation errors. Existing few-shot methods often rely on multi-turn strategies, such as rule-based step-by-step reasoning or iterative self-correction, which introduce additional latency and exacerbate error propagation. We present CBR², a case-based reasoning framework with dual retrieval guidance for single-pass symbolic program generation. Instead of generating programs interactively, CBR² constructs a unified structure-aware prompt that integrates two complementary types of retrieval: (1) structured knowledge from ontologies and factual triples, and (2) reasoning exemplars retrieved via semantic and function-level similarity. A lightweight similarity model is trained to retrieve structurally aligned programs, enabling effective transfer of abstract reasoning patterns. Experiments on KQA Pro and MetaQA demonstrate that CBR² achieves significant improvements in both accuracy and syntactic robustness. Specifically on KQA Pro, it boosts Hits@1 from 72.70% to 82.13% and reduces syntax errors by 25%, surpassing the previous few-shot state-of-the-art.

Keywords:

few-shot learning; knowledge base question answering; symbolic reasoning; case-based reasoning; prompt engineering; retrieval-augmented generation

1. Introduction

In recent years, large language models (LLMs) like GPT [1] and Llama [2] have shown remarkable potential in knowledge base question answering (KBQA) [3,4,5,6], particularly in few-shot scenarios where annotated training data is scarce. The primary goal of KBQA is to transform a natural language question into an executable logical form (LF) and run it over a Knowledge Base (KB) to obtain the answer [7,8,9,10]. KBQA shields users from the heterogeneity of underlying data sources while enabling reliable responses, thereby making it crucial in domains that require high accuracy and strong logical reasoning, such as law, finance, and medicine [11]. However, symbolic program generation is highly structured, requires strict compositionality, and tolerates almost no generation errors. Under few-shot supervision, LLMs often struggle to induce precise reasoning patterns from limited in-context examples, leading to syntactically invalid or semantically incorrect programs [12,13].

To address these challenges, existing studies have primarily explored two major paradigms (Figure 1). The first paradigm, rule-guided interactive reasoning, employs a symbolic agent to construct the program step-by-step under the control of retrieved abstract rules, often with the LLM performing candidate evaluation or slot filling at each step [3,14]. The second paradigm, multi-turn correction strategies, first generates an initial program and then refines it iteratively through prompt rewriting, self-correction, or feedback-based validation to improve syntactic and semantic consistency [4,5].

Although these paradigms differ in implementation, they share a common strategy: decomposing program generation into multiple steps to reduce reasoning complexity and improve controllability. However, this multi-step design introduces notable drawbacks, including increased inference latency, higher risk of error accumulation (i.e., valid reasoning steps built on erroneous prior outputs [15]), and a strong reliance on explicit structural scaffolding. Moreover, the process of generating the program’s structural logic is often separated from grounding it to factual knowledge in the KB, which can lead to inconsistencies. Figure 2 illustrates two typical types of generation errors: (1) semantic grounding errors, where the model incorrectly maps a natural-language predicate to a KG relation, causing the direction or meaning of the reasoning path to deviate; and (2) condition omission, in which the generated reasoning path is structurally valid but fails to include essential constraints required to capture the intended semantics. These issues substantially limit the model’s generalization capacity under few-shot supervision.

In this work, we propose CBR² (Case-Based Reasoning with Dual Retrieval guidance), a single-pass symbolic program generation framework with dual retrieval guidance. Unlike iterative decoding or symbolic rule execution, CBR² constructs a unified, structure-aware prompt that integrates both ontology-constrained contextual knowledge and retrieved reasoning cases, guiding the LLM to produce an executable program in a single decoding pass. Specifically, our prompt design incorporates:

Ontology-constrained contextual knowledge: concept hierarchies and factual triples relevant to the question;
Dual-source retrieved reasoning cases: one set retrieved via semantic similarity, and the other via function-level structural similarity.

By jointly leveraging symbolic constraints and inductive reasoning patterns within a single prompt, CBR² can generate accurate and interpretable programs without requiring multi-turn interaction or post-hoc correction.

Our main contributions are summarized as follows:

We propose CBR², a novel case-based reasoning framework with dual retrieval guidance for single-pass symbolic program generation in few-shot KBQA.
We design a dual-view case retrieval mechanism that captures both semantic similarity and function-level structural similarity, enabling generalization without iterative correction.
We unify structured ontology knowledge and retrieved reasoning cases into a single, structure-aware prompt, effectively integrating symbolic constraints with reasoning exemplars.

Extensive experiments on two widely used KBQA benchmarks, KQA Pro and MetaQA, demonstrate that CBR² consistently improves program accuracy and significantly reduces syntax error rates compared to strong few-shot baselines, showcasing its effectiveness across different datasets and question types.

The remainder of this paper is organized as follows. In Section 2, we review literature of three mainstream KBQA paradigms. In Section 3, we introduce the preliminary knowledge about our framework, including relevant concepts and elements. Next, Section 4 provides a detailed description of the individual modules of the framework. Then, Section 5 introduces the adopted datasets, baselines, implementation details and evaluation metrics. After that, we discuss the results in Section 6. Finally, we give a conclusion and our future work in Section 7.

2. Related Works

2.1. Semantic Parsing-Based KBQA

Semantic parsing (SP) has been a long-standing approach to KBQA, aiming to convert natural language questions into executable queries such as SPARQL, Cypher, or other logical forms [16,17]. Early SP-based KBQA systems relied heavily on hand-crafted grammar rules and domain-specific templates. Representative examples include STAGG [16], NSM [17], and QGG [18], which leverage structured grammar, template alignment, or intermediate symbolic representations to enhance interpretability and executability. While effective in constrained domains, these systems require extensive manual design and struggle to adapt to new schemas.

To alleviate the annotation bottleneck, weakly supervised SP methods have been explored, including distant supervision [19], reinforcement learning [20], and semantic similarity matching [21]. These approaches reduce the need for fully labeled question-program pairs, but often suffer from noisy labels and unstable training, especially in complex reasoning scenarios.

More recently, fully supervised SP-based KBQA methods have adopted pre-trained language models such as BERT [22], BART [23], and T5 [24] as encoders or encoder–decoder architectures. These models are used either directly generate logical forms like SPARQL [25,26,27,28,29] or intermediate representations that can be mapped to executable queries [8,30]. By leveraging the strong semantic representation capabilities of PLMs, these models reduce reliance on hand-crafted rules and achieve strong in-domain performance.

However, they still face significant challenges in real-world. These approaches typically rely on large-scale, high-quality annotated data—pairing natural language questions with corresponding logical forms—which is expensive to collect. Specifically, they often generalize poorly to unseen KB schemas or few-shot settings. This limitation stems from insufficient explicit modeling of schema-level constraints (e.g., entity/relation types, domain restrictions) and robust compositional reasoning operations. Overall, SP-based KBQA methods—whether grammar-driven, weakly supervised, or PLM-based—face persistent challenges in scalability and generalization, which motivates the shift towards LLM-based symbolic reasoning explored in the next section.

2.2. LLM-Based Symbolic Reasoning

The emergence of large language models (LLMs) has shifted KBQA research towards symbolic program generation in few-shot and zero-shot settings [31,32,33]. Recent work in this area can be broadly categorized into two paradigms.

(1): Rule-guided interactive reasoning. This paradigm retrieves symbolic rules or structured templates from the knowledge schema and incrementally constructs programs under these constraints, with the LLM selecting actions or filling parameters at each step. Representative methods include Rule-KBQA [14], which adopts rule-guided prompting for controllable reasoning in complex KBQA; KBQA-o1 [34], which interacts with the KB environment through an agent for stepwise logical form generation; Inter-KBQA [3], which uses interpretable symbolic action sequences to construct programs step-by-step; FlexKBQA [35], which introduces dynamic schema constraint adaptation to balance flexibility and structural control; and MusTQ [36], which targets multi-step temporal KGQA by integrating rule-based temporal operators into the program generation process. These approaches ensure structural validity and interpretability, but rely on explicit rule repositories and require multiple inference rounds.
(2): Multi-turn correction strategies. Another line of research first generates an initial program in one pass and then iteratively refines it through prompt rewriting, self-correction, or feedback-based validation [4,5,37,38]. For example, CodeAlignKGQA [5] formulates complex KGQA as knowledge-aware constrained code generation and applies multi-turn code alignment to progressively fix logical and semantic errors; SymKGQA [4] combines symbolic reasoning structures with execution feedback to iteratively refine logic, enhancing interpretability and execution accuracy; FUn-FuSIC [39] introduces an iterative repair loop that uses a suite of strong and weak verifiers to successively refine candidate logical forms; and prompt design techniques from related tasks [40], have been adapted to KBQA to improve structural alignment and reduce logical errors. While more robust to initial mistakes, these approaches incur additional latency and may suffer from inconsistencies between structural logic and factual grounding.

Despite their differences, both paradigms face common limitations: multi-turn decoding increases inference latency and computation overhead; prior errors in entity linking or relation selection often propagate through sequential steps, yielding programs that are structurally valid but factually incorrect; heavy reliance on explicit scaffolding such as rule bases or templates limits adaptability; and the separation of structural reasoning from factual grounding hampers few-shot generalization. These challenges motivate our single-pass approach that unifies both aspects within one decoding process.

2.3. Case-Based Reasoning in KBQA

Case-based reasoning (CBR) [41] has emerged as an interpretable and data-efficient approach to KBQA by reusing and adapting solutions from prior cases. Early retrieval-based methods relied on surface-level matching, limiting their ability to handle complex reasoning. Recent work [42] advanced this paradigm by retrieving semantically similar question-logical form pairs and adapting them via entity and relation substitution, demonstrating strong performance in both in-domain and cross-domain evaluations. Follow-up research has focused on enhancing structural reasoning signals: GS-CBR-KBQA [43] introduces graph-structured representations to capture fine-grained entity-relation dependencies during case retrieval and adaptation, CBR-KBQA [44] retrieves and adapts relevant KB subgraphs to improve factual grounding, while StructCBR [45] leverages subtree-level similarity between logical forms of cases and candidate outputs for better decoding decisions. Additionally, PerKGQA [46] retrieves cases with similar path template and encodes them for scoring paths generated during inference phase.

However, most existing retrieval strategies rely solely on either direct encoding of sentences for surface-level semantic similarity or schema-based rule matching for literal similarity. These approaches often fail to capture deeper structural reasoning patterns and recall sufficient beneficial samples to guide inference. Our work extends this line by introducing a dual-view retrieval mechanism that jointly considers semantic similarity and function-level structural similarity, offering richer reasoning guidance under few-shot constraints.

3. Preliminaries

3.1. Knowledge Base and Ontology Structure

We define a KB K as consisting of two complementary parts:

K = K_{ontology} \cup K_{facts} = {(s, p, o)}

where each triple

(s, p, o)

follows the RDF convention, with subject s, predicate p, and object o.

K_{ontology}

contains schema-level relations that describe the structural organization of concepts, such as instanceOf, subclassOf, and hasAttribute. These triples encode the type system and hierarchical constraints that define how entities map to abstract concepts.

K_{facts}

consists of factual assertions about specific entities, representing real-world relationships and attribute values.

Separating ontology knowledge from factual knowledge is critical for symbolic reasoning: the former constrains the reasoning space, resolves entity type ambiguity, and guides function selection, while the latter provides factual grounding for answering specific questions.

3.2. KoPL Logic Formalism

We adopt the Knowledge-oriented Programming Language (KoPL) [47] as the target formalism for symbolic reasoning. KoPL is a lightweight, function-oriented query language specifically designed for interpretable and type-safe reasoning over KBs. Its design enforces strict input/output type signatures and deterministic execution semantics, ensuring that each reasoning step produces a well-defined intermediate state that can be directly executed on the KB.

KoPL defines a finite set of 27 atomic functions, covering entity retrieval, attribute filtering, relation traversal, logical composition, comparative and extremal reasoning, verification, and counting. These functions are governed by a type system that explicitly encodes allowable compositions, preventing syntactically valid but semantically meaningless chains.

Formally, a KoPL program is an ordered sequence:

P = [f_{1}, f_{2}, \dots, f_{n}]

where each function

f_{i}

consumes either entities retrieved from the KB or intermediate results from previous steps, and outputs a value whose type is compatible with the subsequent function.

Structurally, KoPL programs can be represented as a directed acyclic graph (DAG) [48], where nodes correspond to symbolic functions and edges represent typed data dependencies. This structure supports step-by-step execution and provides a transparent, human-readable trace of the reasoning process, which is crucial for interpretability and error analysis.

3.3. Few-Shot KBQA Problem Definition

Given a natural language question q, the goal of the KBQA task is to generate an executable symbolic program P such that running P over the knowledge base K produces the correct answer entity set:

A = Execute (P, K) .

The knowledge base is defined as

K = K_{ontology} \cup K_{facts},

where

K_{ontology}

provides schema-level information such as type constraints and hierarchical relations, and

K_{facts}

contains instance-level assertions used during entity grounding and symbolic execution. The execution function

Execute (\cdot)

interprets each symbolic operator in P and performs step-by-step reasoning over K to obtain the final answer.

In the few-shot KBQA setting, we make use of a set of reference exemplars

E = {(q_{i}, P_{i})},

where each pair consists of a natural language question and its corresponding gold symbolic program. These exemplars may be incorporated into the prompt or retrieved from a case base, and serve as structural guidance for inducing correct compositional reasoning.

Formally, the program generator G conditions on the target question q, the exemplar set E, and ontology knowledge to produce a symbolic program:

P = G (q, E, K_{ontology}) .

The objective is to learn appropriate reasoning patterns from exemplar-based guidance and generate programs that are both syntactically valid and semantically executable over K.

4. CBR² Framework

Figure 3 illustrates the overall architecture of CBR², a case-based reasoning framework with dual retrieval guidance for symbolic program generation. The key idea is to unify structured knowledge and retrieved reasoning cases into a single, structure-aware prompt that enables the LLM to produce an executable KoPL program in one pass. This design avoids the multi-turn interaction and error accumulation common in pipeline-based methods, while also reducing inference latency.

The framework comprises four modules:

(1): Knowledge Retrieval—The system retrieves ontology axioms and factual triples relevant to the input question to form a symbolic context. For example, for the question “When was Richard Widmark nominated for an Academy Award for Best Supporting Actor?”, the retrieved facts include “Richard Widmark has nominated for relation in forward direction with Academy Award for Best Supporting Actor”. The ontology slice further provides schema constraints such as “human has award received relation in forward direction with real property has follows qualifier”. Together, these two sources supply the symbolic background necessary for program grounding and execution.
(2): Dual-View Case Retrieval—The system retrieves exemplar reasoning cases from two complementary perspectives. The semantic-view model recalls questions involving similar topics (e.g., awards or nominations), whereas the structural-view model identifies KoPL programs with reasoning patterns similar to the target query. For this example, the latter often surfaces chains such as Find → Find → QueryRelationQualifier, which match the required reasoning structure.
(3): Prompt Construction—The retrieved knowledge and exemplar cases are encoded into a unified, structure-aware prompt $M_{q}$ . The prompt contains system instructions, KoPL function descriptions, fact and ontology snippets, and labeled few-shot examples. By embedding these symbolic constraints and structurally similar exemplars directly into the prompt, the model is guided at both semantic and structural levels during program generation.
(4): Single-Pass Program Generation—The LLM decodes the entire KoPL program $P_{q}$ in a single step, which is then executed over the knowledge base to obtain the final answer. In the running example, executing the generated program yields the answer “1947-01-01”.

This unified workflow tightly integrates symbolic constraints with inductive reasoning patterns, ensuring executable correctness while maintaining efficient inference.

4.1. Knowledge Retrieval

The knowledge retrieval module aims to construct a symbolic context for program generation by selecting knowledge triples from the knowledge base (KB) that are semantically related to the input question q. We adopt a semantic retrieval approach to bypass explicit entity linking, which can suffer from ambiguity and incomplete lexicons.

Each triple

(s, p, o) \in K

(as defined in Section 3.1) is linearized into a textual form (e.g., “Barack Obama is a human”) and embedded using a pre-trained sentence-level encoder (BGE-m3). The question is encoded in the same space, and cosine similarity is computed to retrieve the top-K most relevant triples. The retrieval is implemented with FAISS [49] for efficient large-scale search.

Two categories of triples are retrieved:

Ontology triples: schema-level relations such as instanceOf, subclassOf, and hasAttribute, which provide type constraints and concept hierarchies.
Factual triples: entity-level assertions that capture real-world facts.

We retrieve ontology triples separately from factual triples to preserve type-aware constraints. For example, in the question “When was Titanic released?”, ontology knowledge (e.g., Titanic is a film and film has attribute publication date) constrains the retrieval to attributes of the correct type, ensuring that only date-type properties are considered. This prevents the use of inappropriate functions such as FilterNum for numerical attributes, and improves both interpretability and execution accuracy by narrowing the reasoning space.

4.2. Dual-View Case Retrieval

In few-shot KBQA, high-quality reasoning examples are critical for inducing compositional symbolic patterns. CBR² retrieves cases from a training pool via a dual-view strategy that balances semantic similarity and reasoning-structure similarity.

Semantic-view retrieval. We encode both the test question and all training questions using a pre-trained sentence-level encoder, and compute cosine similarity to select the top-k semantically related examples. This view captures entity, attribute, and lexical overlaps, providing direct contextual grounding.

Structural-view retrieval. To complement the semantic view, we retrieve examples based on the structural similarity of their KoPL programs, as shown in Figure 4. For each annotated program P, we extract its function-level sketch

Sketch (P) = (f_{1}, f_{2}, \dots)

. Given two programs

P_{i}

and

P_{j}

, we compute an offline structural similarity score using SequenceMatcher:

s_{i j} = SeqMatch (Sketch (P_{i}), Sketch (P_{j})),

which yields a continuous measure of their alignment. These automatically derived scores supervise a lightweight dual-encoder model operating on question pairs. The model encodes two questions,

h_{i} = Enc (q_{i}), h_{j} = Enc (q_{j}),

and predicts their structural similarity using cosine similarity:

{\hat{s}}_{i j} = cos (h_{i}, h_{j}) .

The encoder is fine-tuned with a mean squared error objective,

L = MSE ({\hat{s}}_{i j}, s_{i j}),

allowing it to approximate program-level similarity purely from question text. The model is implemented as a lightweight dual-encoder fine-tuned from the MiniLM-L6-v2 backbone (22M parameters), trained on KQAPro training set, and optimized in under 60 min on a single Tesla V100 GPU (32 G). At inference time, this model retrieves questions whose underlying reasoning structure is most compatible with that of the test query.

As illustrated by the example in Figure 3, consider the query: “When was Richard Widmark nominated for an Academy Award for Best Supporting Actor?” The semantic-view retriever primarily focuses on entity and context-level similarity, whereas the structural-view retriever captures similarity in the underlying reasoning steps (e.g., Find → Find → QueryRelation). This contrast highlights why combining both retrieval views yields a more reliable and structurally aligned case set

C_{q}

. In practice, we retrieve the top-k examples from each view independently and merge them (with deduplication) into a unified set, ensuring that

C_{q}

contains both semantically relevant cues and structurally aligned reasoning patterns.

4.3. Prompt Construction

Following the KoPL program format introduced in Section 3.2, although KoPL programs are often serialized linearly, they are inherently DAGs, where each step may depend on one or more previous outputs. Preserving this structural information in the prompt is essential for inducing correct compositional reasoning and preventing invalid data-flow.

We adopt a structure-aware representation for KoPL programs: (1) Single-dependency steps are expressed as chained compositions, e.g., r2: r1 → Relate[located in] → FilterConcept[country], which encodes the execution order and data passing explicitly; (2) Multi-dependency steps, such as logical conjunctions or comparisons, use explicit tuple references, e.g., (r2, r3) → And[], indicating that the function consumes outputs from multiple prior steps; (3) Independent steps initiate new branches for parallel reasoning, allowing different sub-goals to be resolved concurrently before merging.

The prompt template integrates multiple information sources: (i) concise role instructions specifying KoPL syntax, typing rules, and execution semantics; (ii) definitions of all KoPL functions with explicit argument and return types; (iii) several retrieved examples from the dual-view case retrieval module (Section 4.2), each paired with its factual triples, ontology triples, and structure-aware program; (iv) the target question and its retrieved knowledge, followed by a “Program:” slot for completion.

Compared with flat function sequences, this representation explicitly encodes data-flow dependencies and enforces type constraints, thereby reducing common syntactic errors (e.g., calling a function with the wrong number of arguments) and semantic errors (e.g., referencing an attribute that does not exist for the given entity type). Moreover, by exposing reusable structural patterns—such as “filter by type → relate → count”—the approach improves generalization to novel compositions, enabling the LLM to reason beyond memorized sequential forms.

4.4. Single-Pass Program Generation

The final structured prompt is fed into a frozen LLM, which produces the entire KoPL program in a single decoding step. This design contrasts with multi-turn correction or interactive reasoning, eliminating intermediate validation stages and thereby avoiding additional latency, cumulative error propagation, and reliance on external symbolic controllers.

After decoding, we perform a brief syntactic and type check to ensure that all variables are defined, argument types match function signatures, and the program conforms to KoPL syntax. Any minor issues can be repaired using deterministic rules, though in practice such errors are rare due to the explicit constraints embedded in the prompt.

The validated program is then executed by the KoPL interpreter to obtain the final answer. By integrating symbolic constraints directly into the prompt and decoding once, this approach combines the efficiency of minimal inference steps with the robustness of type-safe symbolic reasoning, while maintaining high accuracy even in few-shot settings. This robustness also stems from the fact that the retrieved symbolic knowledge is not used in isolation. Although semantic retrieval may surface triples that are lexically related but imperfect with respect to ontology constraints, the accompanying retrieved cases provide structural cues—such as valid reasoning patterns and type-consistent relation usage—that guide the model toward constructing executable programs. During single-pass decoding, these two retrieval signals act in concert, reducing the likelihood that noisy or schema-incompatible triples will influence the final program. This interaction further explains why CBR² achieves stable performance without the need for iterative correction.

5. Experimental Setup

5.1. Datasets

We evaluate CBR² on two widely used KBQA benchmarks: KQA Pro [47] and MetaQA [50], Table 1 summarizes dataset statistics.

KQA Pro [47] is a large-scale KBQA dataset built over Wikidata, where each question is paired with a gold KoPL program and the corresponding answer. It covers diverse reasoning types, including multi-hop reasoning, comparisons, and qualifier constraints, and supports fine-grained evaluation by reasoning category. We use its full validation set for main results. For ablation studies and comparison with prior few-shot methods, we follow Inter-KBQA [3] adopt a balanced subset of

9 \times 100

questions, sampled uniformly across nine reasoning categories to reduce experimental cost while preserving type coverage.

MetaQA [50] is a movie-domain KBQA benchmark supporting 1-hop, 2-hop, and 3-hop reasoning questions. Following prior works [5], we adopt the KoPL-annotated version converted by GraphQ IR [8].

5.2. Baselines

To evaluate the effectiveness of our proposed CBR² framework, we compare against a diverse set of representative KBQA methods, grouped into three categories following prior work [3].

SP-based Methods. These methods treat KBQA as semantic parsing, mapping natural language questions into executable logical forms such as SPARQL or KoPL:

BART (SPARQL) [23]: a fully supervised seq2seq parser trained to generate SPARQL queries.
GraphQ IR [8]: represents queries as graph-structured intermediate forms to preserve compositional semantics and enable precise mapping from natural language to executable queries.
NSM (SE) [17]: neural symbolic machine with search-based execution, used for MetaQA.
TransferNet [51]: knowledge transfer across domains for semantic parsing, used for MetaQA.

Prompting Methods with LLMs. These approaches leverage large language models in few-shot or zero-shot settings via prompt engineering:

IO prompting [1]: in-context learning with input-output exemplars.
Chain-of-Thought (CoT) [52]: prompting to elicit intermediate reasoning steps.
Self-Consistency (SC) [53]: aggregating answers frommultiple CoT generations.

LLMs + KBs Methods. These methods integrate LLMs with symbolic controllers, schema rules, or retrieval mechanisms to improve structural validity and factual grounding:

LLM-ICL: is a lternate to Pangu for evaluation used in FlexKBQA [35] and SymKGQA [4].
FlexKBQA [35]: leverages large language models to generate executable KB queries with minimal task-specific supervision.
Inter-KBQA [3]: interactive KBQA with schema-level rule retrieval and step-wise execution.
SymKGQA [4]: schema-constrained symbolic program generation with GPT-4.
Rule-KBQA [14]: rule-guided symbolic reasoning for complex KBQA.
CodeAlignKGQA [5]: multi-turn code alignment and correction for KBQA programs.

5.3. Implementation Details

Our system follows a retrieval → prompt construction → program generation → execution pipeline. For knowledge retrieval, we adopt a dual-encoder architecture implemented with FAISS to independently index ontology triples and factual triples. Each triple is linearized into a natural language sentenceand embedded using the BGE-m3 [54] model without fine-tuning. At inference time, the input question is encoded in the same vector space, and the top-

K_{o} = 10

ontology triples and top-

K_{f} = 10

factual triples are retrieved separately to preserve explicit type constraints.

Case retrieval is conducted from two complementary views. The semantic view uses the same BGE-m3 embeddings to retrieve the top-

k_{s} = 5

semantically similar questions from the training pool. The structural view uses a lightweight all-MiniLM-L6-v2 [55] encoder fine-tuned on KoPL function transition sequences. Training pairs are dynamically sampled from the KBQA training set, with similarity labels computed from the normalized longest common subsequence of function names. This encoder retrieves the top-

k_{r} = 5

structurally similar programs, enabling the reuse of abstract reasoning patterns across semantically different questions.

All retrieved knowledge and cases are integrated into a single structure-aware prompt in the DAG format described in Section 3.3, including full function definitions and typed dependencies. Program generation is performed using the Qwen3-Plus API [56] with greedy decoding. A lightweight syntax checker validates variable references and type compatibility before execution by the KoPL interpreter. Minor syntax issues (e.g., missing brackets, variable typos) are repaired deterministically, though such cases are rare due to the explicit constraints in the prompt.

Our implementation uses FAISS 1.7.2 for retrieval and PyTorch 2.4.1 [57] for model training. Average retrieval per question takes under

0.1

s, while program generation via the Qwen3-Plus API takes about

1.8

s. The complete pipeline processes a question in approximately 2 s end-to-end, making it suitable for interactive KBQA scenarios.

5.4. Evaluation Metrics

We report two evaluation metrics:

Hit@1. The proportion of questions for which the top-ranked predicted answer matches the gold answer. Since KQA Pro contains only single-answer questions, Hit@1 is equivalent to execution accuracy in our setting.

Syntax Error Rate (SER). The proportion of generated programs that cannot be parsed or violate type constraints, indicating the syntactic controllability of program generation.

Both metrics are computed on the official validation or test splits of each dataset.

6. Results and Discussion

In this section, we present a comprehensive evaluation of CBR². Our experiments aim to examine whether the proposed single-pass case-based reasoning framework not only achieves competitive or superior performance compared to iterative decoding strategies, but also maintains robustness across reasoning types and datasets. We further analyze the contribution of each key component through ablation studies.

Specifically, we address the following four research questions:

RQ1: Can a single-pass generation framework match or surpass the performance of exist multi-turn generation methods? This directly evaluates our core motivation of avoiding iterative decoding without sacrificing accuracy.
RQ2: Can CBR² maintain stable generalization performance across different reasoning types? We investigate its robustness by conducting category-wise analysis on diverse reasoning patterns.
RQ3: How well does CBR² generalize across datasets? We assess its cross-domain applicability by comparing performance on two distinct KBQA benchmarks.
RQ4: What is the contribution of each component in CBR²? We quantify the impact of knowledge retrieval, dual-view case retrieval, and structure-aware prompting via ablation studies.

Across all result tables, the highest-performing values are indicated in bold.

6.1. RQ1: Baseline Comparison

On the full KQA Pro validation set (Table 2), CBR² achieves 82.13% Hits@1 and a 3.71% SER, outperforming the best prior few-shot baseline, CodeAlignKGQA, by +9.43 points while reducing syntax errors by 25%. CodeAlignKGQA already incorporates code-structure alignment, yet its improvements remain limited without explicit knowledge grounding or retrieval-driven contextualization. The substantial gains of CBR² arise from two complementary factors.

First, richer contextual grounding is enabled by conditioning the LLM on retrieved KB facts and ontology triples during decoding. These knowledge snippets act as explicit semantic anchors, helping the model disambiguate relations, avoid argument mismatches, and maintain type consistency. As a result, CBR² suppresses relation hallucination patterns commonly observed in purely code-aligned models.

Second, targeted case conditioning through dual-view case retrieval (semantic + reasoning) ensures that the exemplars used for few-shot adaptation match both the topical domain and the compositional program structure of the input question. This alignment provides a “soft template” for the LLM that guides step ordering, operator selection, and variable grounding. Together, these mechanisms enable CBR² to approach the performance of fully supervised BART + SPARQL (83.28%) despite using no gradient-based task adaptation.

Beyond comparing against few-shot methods, we further evaluate CBR² against the rule-guided interactive system Inter-KBQA. Although Inter-KBQA leverages explicit rule repositories, symbolic controllers, and multi-turn execution feedback, CBR² achieves higher overall accuracy (84.11% vs. 83.33% on the balanced 900-sample set, Table 3) while relying solely on a simpler and faster single-pass generation pipeline. This finding highlights that the retrieved ontology knowledge and reasoning-case exemplars in CBR² implicitly encode many of the structural constraints that interactive pipelines enforce explicitly. By avoiding iterative decoding and execution cycles, CBR² eliminates both the latency overhead and the risk of compounding errors inherent to multi-turn symbolic reasoning.

Overall, these results illustrate that CBR² strikes an effective balance between controllability and efficiency: it leverages structural signals and retrieved knowledge to produce faithful programs, yet maintains the simplicity and stability of single-pass generation.

6.2. RQ2: Performance Across Reasoning Categories

On the full KQA Pro validation set (Figure 5), CBR² achieves the highest Hits@1 in six out of seven reasoning categories, with particularly large gains in Qualifier (+29.14) and Count (+26.42) over the strongest prior few-shot baseline (CodeAlignKGQA). These two categories are especially challenging because they require precise filtering, attribute grounding, and aggregation over sets of entities—operations that LLMs often struggle with when relying only on surface-form prompts. The integration of ontology triples in CBR² helps alleviate this brittleness by: (1) providing explicit qualifier–value mappings that anchor temporal, spatial, and contextual modifiers to the correct KG fields, and (2) enumerating valid numerical attributes for each entity type, thereby reducing ambiguity when constructing countable sets or selecting aggregatable fields.

Substantial improvements are also observed in Logical reasoning (+10.10), which frequently involves multi-branch aggregation, set intersection, and long-range dependency propagation across intermediate program states. In these settings, the structural-view case retrieval plays a pivotal role: it provides reusable multi-operator templates and codifies explicit dependency chains, reducing the likelihood of errors such as incorrect variable reuse or misordered operations.

In contrast, CBR² shows mild degradation in reasoning types that depend less on compositional structure and more on fine-grained lexical cues or implicitly expressed comparison signals. On the full validation set, the Comparison category requires interpreting subtle comparative expressions (e.g., “earlier than,” “higher than”), where the decisive semantic cues are often weakly realized. Single-pass generation, while efficient and stable for structured multi-step reasoning, lacks an opportunity to iteratively refine or correct such fine-grained lexical interpretations. In comparison, CodeAlignKGQA benefits from stronger coupling and can more reliably anchor these subtle cues during decoding.

A related phenomenon is observed for the QA category in the 900-sample subset. The model must select among multiple semantically proximate attributes, and minor lexical differences may shift attribute grounding. Multi-turn or feedback-driven frameworks—as exemplified by Rule-KBQA’s schema-guided deduction agent—can revise or prune candidates over several decision steps, whereas CBR² must commit to a single decoding trajectory, making it more sensitive to slight lexical ambiguity. A similar contrast appears between SA and SB. SB features an explicit comparison structure and a constrained candidate range, allowing CBR² to leverage structural cues effectively. SA, however, involves a larger candidate pool and implicitly expressed selection conditions, where iterative symbolic filtering—rather than single-pass generation—aligns more naturally with the task structure.

Overall, these observations highlight that the few areas where CBR² lags are precisely those where iterative refinement or feedback-based correction is most beneficial. Importantly, CBR² remains competitive in these categories, while maintaining clear advantages in reasoning types that rely on structural composition, type consistency, and multi-step dependencies.

6.3. RQ3: Generalization Across Datasets

To further evaluate the cross-domain robustness of CBR², we conduct a 10-shot evaluation on the movie-domain MetaQA benchmark. In contrast to KQA Pro, MetaQA features a much smaller and domain-specific knowledge base, limited relation diversity, and relatively simple multi-hop reasoning patterns. Furthermore, the entities and relations in MetaQA bear little lexical or structural resemblance to those in KQA Pro. Crucially, we do not retrain or adapt the structural-similarity retriever for this dataset; instead, we directly reuse the model trained exclusively on KQA Pro.

As shown in Table 4, even under this zero-adaptation setup, CBR² achieves Hits@1 scores of 99.7%, 99.8%, and 99.8% on the 1-hop, 2-hop, and 3-hop tasks respectively, while maintaining a consistently zero SER across all depths. These results match or surpass multi-turn few-shot baselines that rely on extensive in-context examples (up to 100 shots), despite CBR² using only 10-shot demonstrations and a single-pass decoding strategy.

This strong cross-domain performance highlights several key strengths of CBR². First, the dual-view case retrieval mechanism provides robust semantic and structural anchoring that generalizes beyond the source domain. Second, the symbolic program format enables the model to transfer reasoning behaviors even when the underlying KB schema shifts significantly. Finally, the ability to reuse the structural retriever without retraining shows that CBR² does not rely on domain-specific overfitting but instead captures higher-level patterns of functional composition. Together, these factors demonstrate that CBR² can maintain high accuracy when transferred to new knowledge structures and reasoning tasks, without requiring any additional supervision, retraining, or domain-specific adaptation.

6.4. RQ4: Ablation Study

We conduct an ablation study on the balanced 9 × 100 subset of KQA Pro to quantify the contribution of each component in CBR², as shown in Table 5. The full model achieves 84.11% Hits@1, and removing any retrieval branch or knowledge context leads to clear and sometimes substantial degradation, confirming that CBR² relies on the complementary interaction between retrieval signals and structured knowledge.

Effect of case retrieval. Removing the reasoning-case branch (w/o reasoning case) results in a 9.44-point drop, demonstrating that reasoning cases provide reusable program templates that stabilize the compositional structure of KoPL generation. Without them, the model tends to produce under-specified or structurally incomplete programs. Eliminating semantic-case retrieval (w/o semantic case) leads to the largest decrease (−20.55 points), highlighting that lexical and contextual cues are essential for grounding argument values and attribute names. This indicates that CBR² relies on semantic alignment to avoid errors such as attribute mismatches and entity drift. When retrieved cases are replaced with random ones, accuracy collapses to 30.22%, showing that CBR² does not merely benefit from additional context, but specifically from relevant analogical patterns that match the query’s reasoning intent. This underscores the importance of dual-view retrieval in few-shot compositions.

Effect of knowledge retrieval. Knowledge retrieval also contributes significantly. Removing all knowledge from the KB (w/o knowledge) reduces performance by 5.89 points, illustrating that program grounding requires explicit information beyond the examples provided in the prompt. Removing only ontology triples (w/o ont knowledge) causes a 3.78-point drop, showing that structural constraints still play an important role in preventing illegal function–entity combinations and enforcing type consistency. Interestingly, the ontology-only condition (w/o fact knowledge) performs slightly worse than removing all knowledge, suggesting that factual triples are more crucial for resolving ambiguous operations, while ontology constraints alone may sometimes bias the decoder toward structurally plausible yet semantically incorrect choices. This highlights the necessity of combining symbolic constraints with referential grounding.

Effect of structured prompting. Finally, switching the structured DAG-style program prompt to a flat sequential chain (Chain) yields a 4.22-point drop. This indicates that the explicit data-flow dependencies preserved by the DAG representation provide actionable clues for the LLM, helping it track intermediate entities and compose multi-hop logic more reliably. Flattening this structure weakens the model’s ability to reason about variable bindings and execution ordering, leading to more frequent compositional errors.

Overall, the ablation results indicate that each module mitigates a distinct class of errors: semantic cases reduce attribute and argument ambiguity; structural cases prevent incomplete reasoning chains; ontology knowledge enforces type correctness; factual knowledge grounds entity selection; and the DAG-style prompt preserves variable flow. These complementary roles jointly stabilize single-pass KoPL generation.

6.5. Error Case Study

To better understand the failure patterns of CBR², we manually examined 200 mispredicted cases from the validation set and categorized them into two broad classes, as shown in Table 6: (1) semantic grounding errors, where the decoded program uses an incorrect attribute or schema element, and (2) logical reasoning errors, where the multi-hop dependency or relation structure is corrupted. We identified 63 semantic grounding errors and 82 logical errors, which indicates that these two categories together account for the majority of failures.

Case 1 exemplifies a semantic grounding error: although elevation above sea level and altitude above sea level are lexically similar, they correspond to different schema attributes. Symbolic program execution requires exact schema alignment, and even slight lexical deviations can lead to empty results. This reflects the brittleness of schema-grounded symbolic operations and suggests future directions such as attribute canonicalization or schema-aware decoding.

Case 2 illustrates a logical reasoning error: the orientation of the influenced by relation is incorrectly reversed. Although the surface relation appears correct, the semantic direction, which is crucial for asymmetric relations, is flipped, yielding a logically inverted reasoning path. This indicates a broader difficulty in maintaining consistent relation-direction grounding during decoding.

Together, these analyses show that most residual errors in CBR² stem from fine-grained semantic grounding and relation- or chain-level logical inconsistencies. Addressing these limitations, for example through schema normalization, direction-aware decoding, or the incorporation of more robust self-correction mechanisms, represents a promising direction for further strengthening single-pass program generation.

7. Conclusions and Future Work

7.1. Summary

We have presented CBR², a unified case-based reasoning framework that integrates ontology-constrained knowledge retrieval with semantically and structurally aligned case retrieval, forming a coherent single-pass LLM-based symbolic program generation pipeline. By jointly leveraging factual knowledge, ontological structure, and dual-view case analogies, CBR² provides a principled way to guide LLMs toward faithful and executable reasoning programs without requiring gradient-based adaptation or task-specific fine-tuning. Across both KQA Pro and MetaQA, the framework achieves state-of-the-art performance while preserving symbolic interpretability and stable cross-category behavior, especially on qualifier-related, logical, and counting questions.

7.2. Limitations

Although CBR² demonstrates strong empirical performance, the framework inherits several inherent liabilities of large language models (LLMs), including hallucinated facts, structurally inconsistent reasoning chains, and sensitivity to prompt formatting and demonstration ordering. These vulnerabilities are particularly salient in multi-hop symbolic program generation, where a single erroneous step may render the entire KoPL program unexecutable. CBR² mitigates these risks through ontology-constrained retrieval, which restricts semantically compatible operations; structural-view case retrieval, which stabilizes multi-hop composition by providing template-aligned demonstrations; and the executability of KoPL programs, which naturally exposes type mismatches and ill-formed data-flow.

Beyond general LLM constraints, the framework also has several method-specific shortcomings. Its performance depends on the quality and coverage of retrieved examples; when structurally relevant cases are not present in the pool, program generation may degrade. The reliance on the KoPL formalism limits expressivity for tasks requiring richer reasoning operators. Furthermore, the system currently lacks an automatic mechanism for repairing programs when execution fails. These limitations highlight directions for future methodological enhancement.

7.3. Future Directions

Looking forward, we plan to extend CBR² to open-domain KBQA, where large and heterogeneous knowledge sources introduce additional retrieval and grounding challenges. Another promising direction is to incorporate automatic error diagnosis and adaptive multi-turn correction, allowing the system to dynamically decide when iterative refinement is necessary while maintaining the efficiency of single-pass reasoning. In particular, such a fallback mechanism would be most beneficial in cases where the single-pass output exhibits structural invalidity, ambiguous entity or attribute grounding, or incomplete multi-hop dependencies—situations in which additional clarification or self-correction is likely to improve the final program.

Such hybrid designs may offer a better trade-off between accuracy, interpretability, and computational cost, ultimately contributing to more controllable and trustworthy neural-symbolic reasoning. Beyond these directions, we believe that the principles demonstrated by CBR²—particularly its integration of structured retrieval, ontology grounding, and executable reasoning—may extend to a broader range of applications beyond KBQA, including scientific information extraction, decision support in high-stakes domains such as law or medicine, and multi-hop fact verification. The framework’s emphasis on structural alignment and knowledge faithfulness offers a transferable paradigm that may inspire future developments in hybrid neural-symbolic reasoning systems.

Author Contributions

Conceptualization, X.H.; methodology, X.H.; software, T.L.; validation, X.H., L.X. and Z.D.; formal analysis, X.H.; investigation, X.H. and T.L.; resources, G.X.; data curation, L.X. and Z.D.; writing—original draft preparation, X.H.; writing—review and editing, X.H., H.T. and K.H.; visualization, L.X. and Z.D.; supervision, K.H. and G.X.; project administration, X.H.; funding acquisition, G.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Defense-related Science and Technology Key Lab Fund Project of China (grant No. 61420062401).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. KQA Pro is accessible at https://github.com/shijx12/KQAPro_Baselines (accessed on 16 December 2025), and MetaQA is available at https://github.com/yuyuz/MetaQA (accessed on 16 December 2025). No new data were created in this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the 34th Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Xiong, G.; Bao, J.; Zhao, W. Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Volume 1, pp. 10561–10582. [Google Scholar] [CrossRef]
Agarwal, P.; Kumar, N.; Bedathur, S. SymKGQA: Few-shot knowledge graph question answering via symbolic program generation and execution. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Volume 1, pp. 10119–10140. [Google Scholar]
Agarwal, P.; Kumar, N.; Jagannath, S.B. Aligning Complex Knowledge Graph Question Answering as Knowledge-Aware Constrained Code Generation. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 3952–3978. [Google Scholar]
Hu, X.; Tong, L.; Yang, J.; Xue, L.; Huang, K.; Xiao, G. Fuzzy Symbolic Reasoning for Few-Shot KBQA: A CBR-Inspired Generative Approach. In Proceedings of the Case-Based Reasoning Research and Development, ICCBR 2025, Biarritz, France, 30 June–3 July 2025; Bichindaritz, I., López, B., Eds.; Springer: Cham, Switzerland, 2025; pp. 96–110. [Google Scholar]
Hu, X.; Jian, Y.; Xiao, G. Knowledge-injected Stepwise Reasoning on Complex KBQA. In Proceedings of the International Joint Conference on Neural Networks, IJCNN 2024, Yokohama, Japan, 30 June–5 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar] [CrossRef]
Nie, L.; Cao, S.; Shi, J.; Sun, J.; Tian, Q.; Hou, L.; Li, J.; Zhai, J. GraphQ IR: Unifying the Semantic Parsing of Graph Query Languages with One Intermediate Representation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5848–5865. [Google Scholar] [CrossRef]
Shu, Y.; Yu, Z.; Li, Y.; Karlsson, B.F.; Ma, T.; Qu, Y.; Lin, C.Y. TIARA: Multi-grained Retrieval for Robust Question Answering over Large Knowledge Base. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 8108–8121. [Google Scholar] [CrossRef]
Neelam, S.; Sharma, U.; Karanam, H.; Ikbal, S.; Kapanipathi, P.; Abdelaziz, I.; Mihindukulasooriya, N.; Lee, Y.S.; Srivastava, S.; Pendus, C.; et al. SYGMA: A System for Generalizable and Modular Question Answering Over Knowledge Bases. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 3866–3879. [Google Scholar] [CrossRef]
Gu, Y.; Pahuja, V.; Cheng, G.; Su, Y. Knowledge Base Question Answering: A Semantic Parsing Perspective. In Proceedings of the 4th Conference on Automated Knowledge Base Construction, London, UK, 3–5 November 2022. [Google Scholar]
Wang, X.; Li, S.; Ji, H. Code4Struct: Code Generation for Few-Shot Event Structure Prediction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1, pp. 3640–3663. [Google Scholar] [CrossRef]
Mishra, M.; Kumar, P.; Bhat, R.; Murthy, R.; Contractor, D.; Tamilselvam, S. Prompting with Pseudo-Code Instructions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 15178–15197. [Google Scholar] [CrossRef]
Zhang, Z.; Wen, L.; Zhao, W. Rule-KBQA: Rule-guided reasoning for complex knowledge base question answering with large language models. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 8399–8417. [Google Scholar]
Mukherjee, S.; Chinta, A.; Kim, T.; Sharma, T.A.; Tur, D.H. Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs. In Proceedings of the Forty-second International Conference on Machine Learning, 13–19 July 2025. [Google Scholar]
Yih, S.W.t.; Chang, M.W.; He, X.; Gao, J. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing of the AFNLP, Beijing, China, 26–31 July 2015. [Google Scholar]
Liang, C.; Berant, J.; Le, Q.; Forbus, K.D.; Lao, N. Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1, pp. 23–33. [Google Scholar] [CrossRef]
Lan, Y.; Jiang, J. Query Graph Generation for Answering Multi-hop Complex Questions from Knowledge Bases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 969–974. [Google Scholar] [CrossRef]
Mintz, M.; Bills, S.; Snow, R.; Jurafsky, D. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, 2–7 August 2009; pp. 1003–1011. [Google Scholar]
Liang, C.; Norouzi, M.; Berant, J.; Le, Q.V.; Lao, N. Memory augmented policy optimization for program synthesis and semantic parsing. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 2–8 December 2018; pp. 10015–10027. Available online: https://dl.acm.org/doi/10.5555/3327546.3327665 (accessed on 16 December 2025).
Dong, L.; Wei, F.; Zhou, M.; Xu, K. Question answering over freebase with multi-column convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; Volume 1, pp. 260–269. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Shin, R.; Lin, C.; Thomson, S.; Chen, C.; Roy, S.; Platanios, E.A.; Pauls, A.; Klein, D.; Eisner, J.; Van Durme, B. Constrained Language Models Yield Few-Shot Semantic Parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7699–7715. [Google Scholar] [CrossRef]
Scholak, T.; Schucher, N.; Bahdanau, D. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 9895–9901. [Google Scholar] [CrossRef]
Sun, Y.; Li, P.; Cheng, G.; Qu, Y. Skeleton parsing for complex question answering over knowledge bases. J. Web Semant. 2022, 72, 100698. [Google Scholar] [CrossRef]
Banerjee, D.; Nair, P.A.; Kaur, J.N.; Usbeck, R.; Biemann, C. Modern baselines for SPARQL semantic parsing. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 2260–2265. [Google Scholar]
Tran, D.; Pascazio, L.; Akroyd, J.; Mosbach, S.; Kraft, M. Leveraging text-to-text pretrained language models for question answering in chemistry. ACS Omega 2024, 9, 13883–13896. [Google Scholar] [CrossRef] [PubMed]
Zhan, B.; Duan, Y.; Yang, X.; He, D.; Yan, S. Text2SPARQL: Grammar Pre-training for Text-to-QDMR Semantic Parsers from Intermediate Question Decompositions. In Proceedings of the International Conference on Neural Information Processing, Auckland, New Zealand, 2–6 December 2024; pp. 123–137. [Google Scholar]
Hu, S.; Zou, L.; Zhang, X. A State-transition Framework to Answer Complex Questions over Knowledge Base. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2098–2108. [Google Scholar] [CrossRef]
Jiang, J.; Zhou, K.; Dong, Z.; Ye, K.; Zhao, X.; Wen, J. StructGPT: A General Framework for Large Language Model to Reason over Structured Data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 9237–9251. [Google Scholar] [CrossRef]
Sun, J.; Xu, C.; Tang, L.; Wang, S.; Lin, C.; Gong, Y.; Ni, L.M.; Shum, H.; Guo, J. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In Proceedings of the The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Luo, H.; E, H.; Guo, Y.; Lin, Q.; Wu, X.; Mu, X.; Liu, W.; Song, M.; Zhu, Y.; Luu, A.T. KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search. In Proceedings of the Forty-Second International Conference on Machine Learning, Vancouver, Canada, 13–19 July 2025. [Google Scholar]
Li, Z.; Fan, S.; Gu, Y.; Li, X.; Duan, Z.; Dong, B.; Liu, N.; Wang, J. Flexkbqa: A flexible llm-powered framework for few-shot knowledge base question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20–27 February 2024; Volume 38, pp. 18608–18616. [Google Scholar]
Zhang, T.; Wang, J.; Li, Z.; Qu, J.; Liu, A.; Chen, Z.; Zhi, H. MusTQ: A Temporal Knowledge Graph Question Answering Dataset for Multi-Step Temporal Reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 11688–11699. [Google Scholar]
Ye, X.; Yavuz, S.; Hashimoto, K.; Zhou, Y.; Xiong, C. RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 6032–6043. [Google Scholar] [CrossRef]
Yu, D.; Zhang, S.; Ng, P.; Zhu, H.; Li, A.H.; Wang, J.; Hu, Y.; Wang, W.Y.; Wang, Z.; Xiang, B. DecAF: Joint Decoding of Answers and Logical Forms for Question Answering over Knowledge Bases. In Proceedings of the The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Sawhney, R.; Yadav, S.; Bhattacharya, I.; Mausam. Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 24578–24596. [Google Scholar] [CrossRef]
Nan, L.; Zhao, Y.; Zou, W.; Ri, N.; Tae, J.; Zhang, E.; Cohan, A.; Radev, D. Enhancing Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 14935–14956. [Google Scholar] [CrossRef]
Aamodt, A.; Plaza, E. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Commun. 1994, 7, 39–59. [Google Scholar] [CrossRef]
Das, R.; Zaheer, M.; Thai, D.; Godbole, A.; Perez, E.; Lee, J.Y.; Tan, L.; Polymenakos, L.; McCallum, A. Case-based Reasoning for Natural Language Queries over Knowledge Bases. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 9594–9611. [Google Scholar] [CrossRef]
Li, J.; Luo, X.; Lu, G. GS-CBR-KBQA: Graph-structured case-based reasoning for knowledge base question answering. Expert Syst. Appl. 2024, 257, 125090. [Google Scholar] [CrossRef]
Das, R.; Godbole, A.; Naik, A.; Tower, E.; Zaheer, M.; Hajishirzi, H.; Jia, R.; Mccallum, A. Knowledge Base Question Answering by Case-based Reasoning over Subgraphs. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 4777–4793. [Google Scholar]
Awasthi, A.; Chakrabarti, S.; Sarawagi, S. Structured case-based reasoning for inference-time adaptation of text-to-sql parsers. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 12536–12544. [Google Scholar]
Dutt, R.; Bhattacharjee, K.; Gangadharaiah, R.; Roth, D.; Rose, C. PerKGQA: Question Answering over Personalized Knowledge Graphs. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 253–268. [Google Scholar] [CrossRef]
Cao, S.; Shi, J.; Pan, L.; Nie, L.; Xiang, Y.; Hou, L.; Li, J.; He, B.; Zhang, H. KQA Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 6101–6119. [Google Scholar] [CrossRef]
Liang, P.; Jordan, M.I.; Klein, D. Learning Dependency-Based Compositional Semantics. Comput. Linguist. 2013, 39, 389–446. [Google Scholar] [CrossRef]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
Zhang, Y.; Dai, H.; Kozareva, Z.; Smola, A.; Song, L. Variational reasoning for question answering with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Shi, J.; Cao, S.; Hou, L.; Li, J.; Zhang, H. TransferNet: An Effective and Transparent Framework for Multi-hop Question Answering over Relation Graph. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 4149–4158. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 2318–2335. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Kerrville, TX, USA, 2019. [Google Scholar]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 Technical Report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]

Figure 1. Two dominant paradigms for symbolic program generation in few-shot KBQA. Rule-guided interactive reasoning: a symbolic agent incrementally constructs the program under retrieved abstract rules, with the large language model (LLM) assisting at each step. Multi-turn correction strategies: an initial program is generated and iteratively refined via prompt rewriting, self-correction, or feedback-based validation.

Figure 2. Two typical types of generation errors. (1) Inaccurate entity linking. (2) Condition omission.

Figure 3. The overall architecture of CBR², including knowledge retrieval, dual-view case retrieval, prompt construction, and single-pass program generation.

Figure 4. Fine-tuning for structural-view retrieval.

Figure 5. Performance on the KQA Pro full validation set (11,797 samples) across different reasoning categories.

Table 1. Statistics of datasets used in our experiments. The symbol denotes that the dataset is annotated with KoPL programs.

Dataset	Hop	KoPL	Train	Valid	Test
KQA Pro	–	✓	94,376	11,797	11,797
MetaQA	1-hop	✓	96,106	9992	9947
	2-hop	✓	118,980	14,872	14,872
	3-hop	✓	114,196	14,274	14,274

Table 2. Overall performance on the KQA Pro validation set. Comparative results for baseline methods are taken from the CodeAlignKGQA [5] paper.

Method	Models	Hits@1	SER%
Fully Supervised	BART(SPARQL)	83.28	8.2
Fully Supervised	GraphQ IR	79.13	–
Few-shot	FlexKBQA	42.68	–
	LLM-ICL	27.75	–
	SymKGQA	51.10	29.2
	CodeAlignKGQA	72.70	4.95
Few-shot (Ours)	CBR²	82.13	3.71

Table 3. Performance on the KQA Pro 9 × 100 sample set. CT: Count; QA: QueryAttr; QAQ: QueryAttrQualifier; QN: QueryName; QR: QueryRelation; QRQ: QueryRelationQualifier; SA: SelectAmong; SB: SelectBetween; VF: Verify. Prompting results are taken from the Rule-KBQA [14] paper.

	Model	CT	QA	QAQ	QN	QR	QRQ	SA	SB	VF	Overall
Prompting	IO w/GPT-4	27	23	36	40	25	50	11	69	73	39.33
	CoT w/GPT-4	22	26	35	34	18	46	21	79	77	39.78
	SC w/GPT-4	25	28	33	38	22	51	19	86	75	41.89
LLMs + KGs	Inter-KBQA	74	83	64	73	73	59	80	61	80	71.89
	Rule-KBQA	82	87	79	82	84	75	87	88	86	83.33
	Ours	83	84	85	81	85	81	81	91	86	84.11

Table 4. Performance on the MetaQA dataset. Comparative results for baseline methods are taken from the CodeAlignKGQA [5] paper.

Method	Models	1-Hop	2-Hop	3-Hop	SER%
Fully Supervised	NSM (SE)	97.2	99.9	98.9	-
Fully Supervised	TransferNet	97.5	100.0	100.0	-
Few-Shot (100 Shots)	SymKGQA	99.1	99.7	99.7	0.2
	CodeAlignKGQA (Gemini Pro 1.0)	99.2	99.7	99.8	0.0
	CodeAlignKGQA (CodeLlama Ins.)	99.6	99.8	99.7	0.0
Few-Shot (10 Shots)	Ours	99.7	99.8	99.8	0.0

Table 5. Ablation study results on the KQA Pro 9 × 100 sample set.

Category	Model / Setting	CT	QA	QAQ	QN	QR	QRQ	SA	SB	VF	Overall
	Full CBR² (Ours)	83	84	85	81	85	81	81	91	86	84.11
Target Program Format	Chain	78	82	81	73	82	68	80	90	85	79.89
Case Recall	-w/o reasoning case	77	75	79	62	77	65	77	85	75	74.67
	-w/o semantic case	63	64	57	52	68	63	56	79	70	63.56
	random case	31	37	11	26	43	11	25	42	46	30.22
Knowledge Recall	-w/o act knowledge	77	74	69	71	81	60	76	85	83	75.11
	-w/o ont knowledge	79	81	78	78	80	77	79	88	83	80.33
	-w/o knowledge	80	77	76	74	82	70	76	85	84	78.22

Table 6. Representative error cases from the validation set.

Case 1: Semantic Grounding Error

Question: Which one of Pennsylvanian cities, with 717 as the local dialing code, has the lowest altitude above sea level?

Gold KoPL: r1: FindAll[] → FilterStr[local dialing code | 717] → FilterConcept[city of Pennsylvania] → SelectAmong[elevation above sea level | smallest]

Pred KoPL: r1: FindAll[] → FilterStr[local dialing code | 717] → FilterConcept[city of Pennsylvania] → SelectAmong[altitude above sea level | smallest]

Gold Answer: Harrisburg

Pred Answer: no

Case 2: Logical Reasoning Error

Question: How many popular musics are influenced by the band which has ISNI 0000 0001 2369 4269?

Gold KoPL: r1: FindAll[] → FilterStr[ISNI | 0000 0001 2369 4269] → FilterConcept[band] → Relate[influenced by | backward] → FilterConcept[popular music] → Count[]

Pred KoPL: r1: FindAll[] → FilterStr[ISNI | 0000 0001 2369 4269] → FilterConcept[band] → Relate[influenced by | forward] → FilterConcept[popular music] → Count[]

Gold Answer: 1

Pred Answer: 0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, X.; Li, T.; Xue, L.; Du, Z.; Huang, K.; Xiao, G.; Tang, H. CBR²: A Case-Based Reasoning Framework with Dual Retrieval Guidance for Few-Shot KBQA. Big Data Cogn. Comput. 2026, 10, 17. https://doi.org/10.3390/bdcc10010017

AMA Style

Hu X, Li T, Xue L, Du Z, Huang K, Xiao G, Tang H. CBR²: A Case-Based Reasoning Framework with Dual Retrieval Guidance for Few-Shot KBQA. Big Data and Cognitive Computing. 2026; 10(1):17. https://doi.org/10.3390/bdcc10010017

Chicago/Turabian Style

Hu, Xinyu, Tong Li, Lingtao Xue, Zhipeng Du, Kai Huang, Gang Xiao, and He Tang. 2026. "CBR²: A Case-Based Reasoning Framework with Dual Retrieval Guidance for Few-Shot KBQA" Big Data and Cognitive Computing 10, no. 1: 17. https://doi.org/10.3390/bdcc10010017

APA Style

Hu, X., Li, T., Xue, L., Du, Z., Huang, K., Xiao, G., & Tang, H. (2026). CBR²: A Case-Based Reasoning Framework with Dual Retrieval Guidance for Few-Shot KBQA. Big Data and Cognitive Computing, 10(1), 17. https://doi.org/10.3390/bdcc10010017

Article Menu