Next Article in Journal
H2Avatar: Expressive Whole-Body Avatars from Monocular Video via Hierarchical Geometry and Hybrid Rendering
Previous Article in Journal
Semantic Agent-Based Intelligent Digital Twins Integrating Demand, Production and Product Through Asset Administration Shells
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhanced Schema Linking with Large Language Models via Self-Verification and Value Hints

1
National University of Defense Technology, Changsha 410073, China
2
Information Support Force Engineering University, Wuhan 430010, China
3
Hubei Provincial Key Laboratory of Data Intelligence, Wuhan 430010, China
*
Authors to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(4), 104; https://doi.org/10.3390/bdcc10040104
Submission received: 9 February 2026 / Revised: 22 March 2026 / Accepted: 24 March 2026 / Published: 31 March 2026

Abstract

Schema linking, the task of identifying relevant database schema elements (tables and columns) for natural language queries, is a critical component in database-driven natural language interfaces. While existing approaches rely on question decomposition to handle complex queries, they often suffer from error propagation and low precision. In this paper, we propose a novel schema linking framework enhanced by self-verification (SV) and value hints (VHs) that significantly improves both precision and recall. Our approach introduces two key components: (1) self-verification (SV), an iterative refinement mechanism that validates and corrects initial predictions through explicit verification prompts, and (2) value hints (VHs), which explicitly guide the model to recognize database values mentioned in queries. We conduct comprehensive experiments on two benchmark datasets, Spider and BIRD, using two language models of 4B and 80B parameters. Our results demonstrate that SV + VH consistently improves performance across datasets, models, and method configurations, outperforming both decomposition-based approaches and compute-matched alternatives such as self-consistency under equivalent inference budgets.

1. Introduction

Schema linking—the process of identifying which database tables and columns are relevant to answering a given natural language question—is a fundamental task in database-driven natural language interfaces [1,2]. As organizations increasingly seek to democratize data access, enabling non-technical users to interact with complex databases through natural language has become critically important [3,4]. Schema linking serves as the bridge between ambiguous human language and structured database schemas.
Schema linking serves multiple critical functions in natural language database interfaces. It provides semantic disambiguation by mapping ambiguous natural language mentions to specific database objects [1]. For instance, when a user asks about “price,” schema linking determines whether this refers to products.unit_price or orders.total_price. Furthermore, schema linking can function as an independent access control layer, identifying potentially sensitive fields before query execution [5]. The task also enables interactive query refinement, where systems can present linking candidates for user confirmation in ambiguous scenarios.
Traditional schema linking approaches often rely on question decomposition, breaking complex queries into simpler sub-questions that can be independently linked to schema elements [6,7]. While decomposition can improve recall by ensuring all query aspects are addressed, it often introduces significant challenges. The decomposition process itself can be error-prone, and errors propagate through subsequent linking steps [5]. Moreover, decomposition-based approaches frequently exhibit low precision, identifying many irrelevant schema elements that can confuse downstream applications [8].
Recent advances in large language models (LLMs) have transformed schema linking research [5,9]. LLMs demonstrate remarkable few-shot learning capabilities, enabling effective schema linking with minimal task-specific training [10,11]. However, even state-of-the-art LLMs make errors in schema linking, particularly for complex queries involving multiple tables or implicit value references [8,12].
Self-verification and self-correction mechanisms have emerged as promising techniques for improving LLM outputs [13,14]. These approaches leverage the model’s ability to critically evaluate and refine its own predictions, often achieving significant improvements over single-pass generation. However, the effectiveness of self-correction in LLMs remains debated, with some studies suggesting that LLMs struggle to self-correct without external feedback [15]. While self-verification has been successfully applied to mathematical reasoning [13] and code generation [16], its application to schema linking remains unexplored.
In this paper, we propose an enhanced schema linking framework via self-verification and value hints that addresses the limitations of existing approaches. Our contributions are twofold:
1.
We introduce self-verification (SV), an iterative refinement mechanism that validates initial schema linking predictions through explicit verification prompts. The verification process checks for completeness (are all necessary tables and columns included?), connectivity (can selected tables be joined?), and precision (are there unnecessary elements?).
2.
We propose value hints (VHs), a technique that explicitly informs the model about database values mentioned in the query. This addresses a common failure mode where models miss columns needed for WHERE conditions because value matches are not explicitly highlighted.
We evaluate our approach on two widely used benchmarks: Spider [17] and BIRD [18]. Using two language models (Qwen3-4B and Qwen3-Next-80B-MoE), our experiments demonstrate that:
  • SV + VH consistently outperforms compute-matched alternatives (self-consistency) under equivalent inference budgets, confirming that the gains stem from the structured verification design rather than additional compute alone.
  • The combination of SV and VH achieves the best results among all evaluated methods under the same model and settings, with an SL F1 of 75.7% on BIRD (Decomp + SV + VH, three rounds) and 80.9% on Spider (SV + VH).
  • SV + VH generalizes across model scales (4B, 80B parameters), with larger relative gains on smaller models.
  • Multiple verification rounds progressively improve performance, with three rounds yielding 4.6% Column F1 improvement over single-round verification on BIRD.

2. Related Work

2.1. Schema Linking for Database Interfaces

Schema linking has evolved significantly with the introduction of cross-domain benchmarks. Spider [17] established the paradigm of evaluating generalization across unseen databases, while BIRD [18] introduced more realistic challenges including dirty data and complex domain knowledge. These benchmarks have driven substantial research into robust schema linking techniques [1,2].
Schema linking approaches can be broadly categorized into embedding-based and generation-based methods. Embedding-based approaches use neural encoders to compute similarity between question tokens and schema elements [19,20]. Generation-based approaches leverage language models to directly predict relevant schema elements [5,9]. Recent work has shown that generation-based approaches with LLMs achieve superior performance, particularly in cross-domain settings [21].
Lei et al. [1] conducted a comprehensive analysis of schema linking errors, identifying value linking (matching query values to database contents) as a major source of failures. Maamari et al. [8] argued that with sufficiently capable reasoning models, explicit schema linking may become less critical, though their experiments focused on very large models. Our work demonstrates that explicit schema linking with self-verification remains valuable even as model capabilities improve.

2.2. LLM-Based Schema Linking

The application of LLMs to schema linking has yielded remarkable improvements. Katsogiannis-Meimarakis et al. [5] introduced a decomposed approach that demonstrated strong performance on complex benchmarks. DAIL-SQL [9] focused on optimizing prompting strategies, demonstrating that example selection and representation significantly impact performance. C3-SQL [22] proposed iterative validation to correct errors. DIN-SQL [11] demonstrated that decomposing the text-to-SQL task into sub-problems with in-context learning significantly improves LLM performance, achieving state-of-the-art results on Spider and BIRD benchmarks.
Few-shot prompting has become the dominant paradigm for LLM-based schema linking [10,11]. Research has shown that demonstration selection based on question similarity improves performance [23,24]. Our work builds on these foundations while introducing verification-based refinement as a complementary technique.

2.3. Self-Verification in LLMs

Self-verification mechanisms have proven effective across various LLM applications. Weng et al. [13] demonstrated that LLMs can verify their mathematical reasoning through backward verification, achieving improvements over self-consistency approaches. Madaan et al. [14] introduced Self-Refine, an iterative refinement approach through self-feedback for code and text generation tasks, demonstrating that even state-of-the-art LLMs like GPT-4 can be further improved at test-time through iterative self-feedback. In contrast, Huang et al. [15] showed that LLMs often struggle to self-correct their reasoning without external feedback, highlighting the importance of structured verification prompts rather than naive self-correction.
In the context of structured prediction, self-correction has been applied to various generation tasks [16,25]. These approaches typically verify output correctness through semantic checks. Our work differs by applying self-verification to the schema linking stage, addressing errors at an early stage of the pipeline.

3. Methodology

3.1. Problem Formulation

Given a natural language question Q, a database schema S = { T 1 , T 2 , , T n } where each table T i = { c i 1 , c i 2 , , c i m } contains columns, and optionally a set of matched database values V, the schema linking task predicts a minimal relevant set of schema elements for the downstream task. Formally, it outputs a pair ( T ^ , C ^ ) , where T ^ S and C ^ i T i such that: (i) every schema element required to interpret the semantics of Q is included (coverage) and (ii) no irrelevant elements are included (precision).

3.2. Baseline Approaches

We first describe two baseline approaches that form the foundation of our method.

3.2.1. Base Method

The Base method directly prompts an LLM to predict relevant tables and columns given the question and schema. The prompt structure includes:
  • A system instruction (see Appendix A) describing the schema linking task;
  • The serialized database schema with table and column information;
  • The natural language question;
  • Few-shot demonstrations selected based on question similarity.
The model outputs a JSON object containing predicted tables and columns:
  • {"tables": ["table1", "table2"],
  •  "columns": ["table1.col1", "table2.col2"]}

3.2.2. Decomposition Method (Decomp)

Following Katsogiannis-Meimarakis et al. [5], the decomposition method first breaks the question into simpler sub-questions, then performs schema linking for each sub-question, and finally aggregates the results. We adopt this methodology as our decomposition baseline because Katsogiannis-Meimarakis et al. [5] demonstrated superior schema linking performance on the BIRD benchmark compared to other decomposition-based approaches. While this can improve recall by ensuring all aspects of complex questions are addressed, it often introduces noise through incorrect decomposition or over-linking.

3.3. Self-Verification (SV)

Our self-verification mechanism introduces a second LLM pass that validates and corrects the initial prediction. The verification prompt (see Appendix A) explicitly asks the model to check:
1.
Completeness: Are all tables needed for JOINs included? Are columns for WHERE conditions and SELECT clauses present?
2.
Connectivity: If multiple tables are selected, can they be connected through foreign keys?
3.
Precision: Are there any extra tables or columns that are not actually needed?
The verification prompt takes the following form:
“Given the database schema, question, and initial prediction, verify and correct the prediction if necessary. Check that all necessary tables and columns are included, tables can be joined, and no unnecessary elements are selected.”
The model outputs a corrected JSON prediction, which may be identical to the initial prediction if no corrections are needed.

3.4. Value Hints (VHs)

A common failure mode in schema linking occurs when the question mentions specific values that exist in the database, but the model fails to identify the columns containing those values. This is particularly problematic for WHERE conditions.
We address this by preprocessing the question against database contents to identify value matches and then explicitly including these matches in the prompt:
“The question mentions these values that appear in the database:
- “California” → found in: customers.state
- “2024” → found in: orders.year
This suggests these columns are needed for WHERE conditions.”
This explicit guidance helps the model recognize implicit value references that might otherwise be missed.

VH Matching Pipeline

The value-matching pipeline operates as an offline preprocessing step before LLM inference. Our implementation adopts the bridge content encoder from the open-source RESDSQL framework [26] (BSD-3 License). The matching procedure is as follows:
1.
Tokenization and Normalization. The question is tokenized and each token is case-folded. A stopword list (common English words and single characters) is applied to filter out non-informative tokens. Multi-token n-grams (up to 6 tokens) are also extracted to handle multi-word entity mentions (e.g., “New York” and “United States”).
2.
Candidate Generation. For each column in the database schema, we retrieve a sample of distinct cell values (up to 100 per column). Numeric and date values are normalized to string form. Column names themselves are also included as candidates.
3.
Matching Strategy. Each question token (or n-gram) is compared against cell values using a two-stage matching approach:
  • Longest Common Subsequence (LCS): We use difflib.SequenceMatcher to compute the LCS ratio between the token and each cell value, identifying pairs with high overlap.
  • Fuzzy Matching: We apply rapidfuzz.fuzz.ratio to compute a character-level similarity score. A match is accepted if the similarity exceeds a threshold of τ = 0.85 .
4.
Ranking and Selection. For each matched question token, we rank candidate columns by their matching score and select the top-2 column matches. Matches involving stopwords or overly common database values are filtered out to reduce noise.
The VH preprocessing is computationally lightweight. On the BIRD development set (1534 samples across 95 databases), preprocessing completes in 25.9–54.1 s total (16.9–35.3 ms per sample); on the Spider development set (1034 samples across 20 databases), it completes in 4.6 s total (4.5 ms per sample). This overhead is negligible compared to LLM inference time and is performed only once as an offline step.

3.5. Combining Components

Our framework allows flexible combination of components:
  • Base: Single-pass generation;
  • Decomp: Question decomposition + Base;
  • Decomp + SV: Decomposition with verification;
  • Decomp + VH: Decomposition with value hints;
  • Decomp + SV + VH: Full system with all components;
  • SV-only: Verification-only with full schema initialization;
  • SV + VH: Verification with value hints, no decomposition;
  • VH-only: Value hints without verification.

4. Experimental Setup

4.1. Datasets

We evaluate on two widely used schema linking benchmarks:
Spider [17]: A large-scale cross-domain dataset containing 10,181 questions across 200 databases. Spider features complex queries involving multiple tables, nested queries, and aggregations. We use the development set for evaluation.
BIRD [18]: A more challenging benchmark with 12,751 question–answer pairs across 95 large-scale databases. BIRD emphasizes real-world complexity, including dirty data, domain knowledge requirements, and external evidence. We use the development set for evaluation.

4.2. Implementation Details

We use Qwen3-4B-Instruct as our primary language model, served through vLLM for efficient inference. For model scale experiments, we additionally evaluate on Qwen3-Next-80B-MoE. Key hyperparameters include:
  • Temperature: 0.0 (greedy decoding);
  • Top-p: 1.0;
  • Maximum tokens: 2048;
  • Number of few-shot demonstrations: 3;
  • Batch size: 20.
For demonstration selection, we compute semantic similarity between questions using SimCSE embeddings [27] and select the top-k most similar examples from the training set.

4.3. Evaluation Metrics

We evaluate schema linking performance using precision, recall, and F1 scores at three levels:
  • Table level: Precision/recall/F1 for predicted tables;
  • Column level: Precision/recall/F1 for predicted columns;
  • Schema linking (SL): Combined metric considering both tables and columns, where a prediction is correct only if both the table set and column set exactly match the ground truth.

5. Results

5.1. Main Results

Table 1 presents the main experimental results on BIRD and Spider benchmarks.
Key Observations:
  • Decomposition hurts column-level precision. On BIRD, the Decomp method achieves only 63.8% Column F1, compared to 78.2% for the Base method. This 14.4% drop demonstrates that question decomposition introduces significant noise, likely due to over-linking across sub-questions.
  • Self-verification substantially improves decomposition. Adding SV to Decomp improves Column F1 from 63.8% to 74.2% on BIRD (+10.4%) and from 67.8% to 74.8% on Spider (+7.0%). This validates our hypothesis that verification can filter out incorrectly linked elements.
  • Non-decomposition methods outperform decomposition. The SV-only and SV + VH methods achieve the best overall performance without using decomposition. On BIRD, SV-only achieves 91.1% Table F1 and 80.4% Column F1, surpassing all decomposition variants.
  • Value hints are most effective when combined with verification. The combination SV + VH achieves the best SL F1 on Spider (80.9%), demonstrating synergy between the two components. Value hints provide explicit guidance for WHERE conditions, while verification ensures precision.
Figure 1 visualizes these results, highlighting the performance differences across methods on both datasets.

5.2. Ablation: Verification Rounds

Table 2 shows the effect of multiple verification rounds.
Multiple verification rounds progressively improve column-level performance. On BIRD, Column F1 improves from 73.0% (one round) to 77.6% (three rounds), a gain of 4.6%. This suggests that each verification round catches additional errors missed in previous rounds. However, the gains diminish after 2–3 rounds, indicating convergence of the verification process. Figure 2 visualizes this progressive improvement.

5.3. Compute-Matched Baselines

A natural concern is whether the gains from SV simply reflect a larger inference budget (more LLM calls) rather than the verification design itself. To address this, we compare SV against self-consistency (SC) [28], a widely used compute-matched alternative that spends the same inference budget by sampling multiple predictions and taking a majority vote. For SC, we set temperature = 0.7 and aggregate predictions by selecting the tables and columns that appear in the majority of samples.
Table 3 presents the results. SC (2 rounds) uses two LLM calls (matching SV + VH, one round) and SC (4 rounds) uses four calls (matching SV + VH, three rounds).
Under matched compute budgets, SV + VH consistently outperforms SC on all metrics across both datasets. SC yields negligible gains over the Base method—on BIRD, SC (4 rounds) achieves only 78.3% Column F1 versus 78.2% for base, while SV + VH (1 round) already reaches 80.3%. This demonstrates that the improvement from SV is attributable to the structured verification design (completeness, connectivity, and precision checks), not merely to additional inference passes.

5.4. Model Scale Experiments

To assess the generality of our method across model scales, we evaluate SV + VH on an additional model: Qwen3-Next-80B-MoE (80B MoE, approximately 20B activated parameters). Table 4 presents the results.
Several observations emerge from the model scale experiments:
  • SV + VH is effective across model scales. On Qwen3-4B, SV + VH (one round) improves Column F1 by +1.6% on BIRD and +1.1% on Spider over the Base method. On the larger Qwen3-Next-80B-MoE, gains are smaller but still consistent: +0.2% Column F1 on BIRD and +1.0% on Spider. This pattern suggests that SV + VH provides complementary value that persists as model capability increases, though stronger models leave less room for improvement.
  • Diminishing returns with stronger models. The relative improvement from SV + VH decreases as model capability increases (Qwen3-4B > Qwen3-Next-80B-MoE), consistent with the hypothesis that stronger models already perform implicit verification during generation.

5.5. Analysis

5.5.1. Precision–Recall Trade-Off

Figure 3 illustrates the precision–recall characteristics of different methods through scatter plots, showing the trade-off between precision and recall at both table and column levels.
The Decomp method achieves very high recall (95.9% for tables, 93.5% for columns) but suffers from extremely low column precision (48.4%). This means nearly half of predicted columns are incorrect. In contrast, SV-only maintains high recall while dramatically improving precision.

5.5.2. Error Analysis

We conducted manual error analysis on 50 randomly sampled errors from each method.
Base method errors:
  • 42% missing columns for implicit value references;
  • 28% missing tables needed for JOINs;
  • 30% over-prediction of semantically similar columns.
Decomp method errors:
  • 56% over-prediction from decomposition noise;
  • 24% incorrect decomposition leading to wrong linking;
  • 20% missing elements not covered by any sub-question.
SV-only method errors:
  • 38% missing columns for complex implicit references;
  • 32% under-pruning of related but unnecessary elements;
  • 30% boundary cases with ambiguous relevance.
The analysis reveals that SV-only largely eliminates over-prediction errors while maintaining strong recall, explaining its superior performance. Figure 4 visualizes the error distribution for each method.
Figure 5 provides a comprehensive comparison of methods across all precision and recall metrics, illustrating how SV-only achieves superior balance across all dimensions.

6. Discussion

6.1. Why Does Self-Verification Work?

Our results demonstrate that self-verification substantially improves schema linking, particularly for precision. We hypothesize several reasons:
  • Evaluation is easier than generation. The verification task asks the model to judge whether specific elements are necessary, which is cognitively simpler than generating the complete set from scratch. This aligns with findings in other domains showing that LLMs are better evaluators than generators [29].
  • Explicit criteria focus attention. The verification prompt provides explicit checking criteria (completeness, connectivity, and precision) that guide the model’s attention to specific failure modes. This structured evaluation helps catch errors that might be missed in open-ended generation. This is consistent with the finding that LLMs require structured external guidance to effectively self-correct [15].
  • Multiple perspectives reduce blind spots. Each verification round provides an opportunity to reconsider the prediction from a fresh perspective, catching errors that survived previous rounds.

6.2. When to Use Each Method?

Based on our experimental results, we provide the following recommendations:
  • For high-precision requirements (e.g., minimizing false positives), use SV-only or SV + VH.
  • For high-recall requirements (e.g., ensuring all relevant elements are captured), use Decomp + SV + VH with multiple verification rounds.
  • When database values are frequently mentioned in queries, always enable value hints.
  • For efficiency-critical applications, use the Base method with careful prompt engineering.
Figure 6 summarizes the component effects, showing how each technique contributes to the final performance.

6.3. Computational Cost Analysis

Table 5 quantifies the computational cost of each method variant. All measurements are averaged over the Spider development set using Qwen3-4B served via vLLM on a single GPU.
SV + VH (one round) approximately doubles the token cost and latency compared to the Base method, which is expected given the additional verification pass. SV + VH (three rounds) incurs approximately 2.6× the token cost and latency of the Base method, scaling linearly with the number of verification rounds. Note that SC achieves lower per-call latency because it uses parallel sampling, but as shown in Table 3, this budget does not translate to meaningful accuracy gains. For practical deployments, SV + VH (one round) offers the best cost-effectiveness trade-off, achieving most of the performance gain at only 2× the base cost.

6.4. Evaluation Metric Discussion

Our evaluation uses exact-set matching for both tables and columns, where a prediction is correct only if it exactly matches the ground truth set. We acknowledge that this metric can be strict in cases where multiple valid schema linkings exist for a given question—for example, when equivalent join paths exist or when surrogate keys could replace natural keys.
Regarding foreign key join columns, our SV verification prompt explicitly asks the model to include “foreign key columns for JOINs between selected tables.” The ground truth annotations in Spider and BIRD extract schema elements from gold SQL queries, which include JOIN columns. Therefore, our SV mechanism is aligned with the evaluation protocol rather than working against it. We verified that on both benchmarks, JOIN columns are consistently included in ground truth annotations when they appear in the gold SQL.
We note that a relaxed evaluation metric (e.g., crediting predictions that yield equivalent SQL) would likely benefit all methods equally and would not change the relative rankings. Developing such metrics is an important direction for future work in schema linking evaluation.

6.5. Limitations

Our work has several limitations:
  • Computational overhead. Self-verification requires additional LLM calls, increasing latency and cost. As quantified in Table 5, SV + VH (one round) approximately doubles the inference cost, while SV + VH (three rounds) incurs 4× the LLM calls. For latency-sensitive applications, one-round verification offers the best efficiency–accuracy trade-off.
  • Value matching preprocessing. The VH component requires offline preprocessing to identify database value matches. While this is lightweight for the evaluated benchmarks (4.5–35.3 ms per sample), it may become more expensive for very large databases with millions of rows. Strategies such as indexing or sampling could mitigate this for production deployment.
  • Model dependence. While our model scale experiments (Table 4) demonstrate that SV + VH generalizes across two models of different sizes (4B and 80B), the optimal number of verification rounds may need to be tuned per model depending on its specific capabilities and error patterns.

7. Conclusions

We presented an enhanced schema linking framework via self-verification and value hints for database-driven natural language interfaces. Our approach introduces two key components: self-verification (SV) for iterative refinement and value hints (VHs) for explicit value guidance.
Comprehensive experiments on Spider and BIRD benchmarks demonstrate that SV + VH consistently outperforms both decomposition-based approaches and compute-matched alternatives such as self-consistency. Under equivalent inference budgets, the structured verification design of SV yields significantly larger gains than naive multi-sample voting, confirming that the improvements stem from the verification mechanism itself. The Decomp + SV + VH method with three verification rounds achieves the best SL F1 of 75.7% on BIRD, while SV + VH achieves 80.9% on Spider. Model scale experiments across two architectures (Qwen3-4B and Qwen3-Next-80B-MoE) confirm that SV + VH generalizes beyond a single model, with the largest relative gains observed on smaller models.
Our ablation studies reveal that multiple verification rounds progressively improve performance, with three rounds yielding 4.6% Column F1 improvement over single-round verification. This validates the effectiveness of iterative self-verification for structured prediction tasks.
Future work includes extending self-verification to other structured prediction tasks, exploring more efficient verification strategies, and investigating the interaction between schema linking and query understanding.

Author Contributions

Conceptualization, L.M. and F.W.; methodology, L.M. and H.Z.; software, L.M. and D.W.; validation, D.W. and X.L.; formal analysis, D.W. and X.L.; investigation, L.M. and D.W.; resources, F.W. and H.Z.; data curation, D.W. and X.L.; writing—original draft preparation, L.M. and D.W.; writing—review and editing, X.L.; visualization, L.M. and D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Spider and BIRD datasets are publicly available. Spider is available at https://yale-lily.github.io/spider (accessed on 12 November 2024). BIRD is available at https://bird-bench.github.io/ (accessed on 27 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LLMLarge Language Model
SQLStructured Query Language
SVSelf-Verification
VHsValue Hints
SLSchema Linking
NLNatural Language
PPrecision
RRecall
F1F1 Score

Appendix A. Prompt Templates

Schema Linking Prompt:
You are an expert database assistant. You will be given a Natural Language Question and a relational database schema. Your goal is to identify the schema elements (tables and columns) of the schema that are necessary to translate the given Natural Language Question into a SQL query.
Important: If the question mentions specific values that are found in the database, pay special attention to the columns containing those values—they are likely needed for WHERE conditions.
Create a json object in the following format: {“tables”: [“table_name”, …], “columns”: [“column_name”, …]}
Verification Prompt:
You are an expert database assistant performing verification. You have been given a Natural Language Question, a database schema, and an initial prediction of needed tables and columns.
Your task is to verify and correct the prediction if necessary. Check: 1. Are all tables needed for JOINs included? 2. Are all columns mentioned in WHERE conditions included? 3. Are all columns needed for SELECT (output) included? 4. Are foreign key columns for JOINs between selected tables included? 5. Are there any extra tables/columns that are NOT actually needed? Remove them.
If the prediction is correct, output the same JSON. If corrections are needed, output the corrected JSON.

References

  1. Lei, W.; Wang, W.; Ma, Z.; Gan, T.; Lu, W.; Kan, M.Y.; Chua, T.S. Re-examining the role of schema linking in text-to-SQL. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6943–6954. [Google Scholar]
  2. Wang, Y.; Liu, P.; Yang, X. Linkalign: Scalable schema linking for real-world large-scale multi-database text-to-sql. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4–9 November 2025; pp. 977–991. [Google Scholar]
  3. Kim, H.; So, B.H.; Han, W.S.; Lee, H. Natural language to SQL: Where are we today? Proc. VLDB Endow. 2020, 13, 1737–1750. [Google Scholar] [CrossRef]
  4. Zhang, W.; Wang, Y.; Song, Y.; Wei, V.J.; Tian, Y.; Qi, Y.; Chan, J.H.; Wong, R.C.W.; Yang, H. Natural language interfaces for tabular data querying and visualization: A survey. IEEE Trans. Knowl. Data Eng. 2024, 36, 6699–6718. [Google Scholar] [CrossRef]
  5. Katsogiannis-Meimarakis, G.; Mirylenka, K.; Scotton, P.; Fusco, F.; Labbi, A. In-depth Analysis of LLM-based Schema Linking. In Proceedings of the EDBT, Tampere, Finland, 24–27 March 2026; pp. 117–130. [Google Scholar]
  6. Min, S.; Zhong, V.; Zettlemoyer, L.; Hajishirzi, H. Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6097–6109. [Google Scholar]
  7. Perez, E.; Lewis, P.; Yih, W.t.; Cho, K.; Kiela, D. Unsupervised question decomposition for question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 8864–8880. [Google Scholar]
  8. Maamari, K.; Abubaker, F.; Jaroslawicz, D.; Mhedhbi, A. The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models. In Proceedings of the NeurIPS 2024 Third Table Representation Learning Workshop, Vancouver, BC, Canada, 14 December 2024. [Google Scholar]
  9. Gao, D.; Wang, H.; Li, Y.; Sun, X.; Qian, Y.; Ding, B.; Zhou, J. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. Proc. VLDB Endow. 2024, 17, 1132–1145. [Google Scholar] [CrossRef]
  10. Nan, L.; Zhao, Y.; Zou, W.; Ri, N.; Tae, J.; Zhang, E.; Cohan, A.; Radev, D. Enhancing few-shot text-to-sql capabilities of large language models: A study on prompt design strategies. arXiv 2023, arXiv:2305.12586. [Google Scholar]
  11. Pourreza, M.; Rafiei, D. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Adv. Neural Inf. Process. Syst. 2023, 36, 36339–36348. [Google Scholar]
  12. Liu, G.; Tan, Y.; Zhong, R.; Xie, Y.; Zhao, L.; Wang, Q.; Hu, B.; Li, Z. Solid-SQL: Enhanced schema-linking based in-context learning for robust text-to-SQL. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 9793–9803. [Google Scholar]
  13. Weng, Y.; Zhu, M.; Xia, F.; Li, B.; He, S.; Liu, S.; Sun, B.; Liu, K.; Zhao, J. Large language models are better reasoners with self-verification. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 2550–2575. [Google Scholar]
  14. Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-refine: Iterative refinement with self-feedback. Adv. Neural Inf. Process. Syst. 2023, 36, 46534–46594. [Google Scholar]
  15. Huang, J.; Chen, X.; Mishra, S.; Zheng, H.S.; Yu, A.W.; Song, X.; Zhou, D. Large language models cannot self-correct reasoning yet. arXiv 2023, arXiv:2310.01798. [Google Scholar]
  16. Chen, X.; Lin, M.; Schärli, N.; Zhou, D. Teaching Large Language Models to Self-Debug. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; pp. 8746–8825. [Google Scholar]
  17. Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3911–3921. [Google Scholar]
  18. Li, J.; Hui, B.; Qu, G.; Yang, J.; Li, B.; Li, B.; Wang, B.; Qin, B.; Geng, R.; Huo, N.; et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Adv. Neural Inf. Process. Syst. 2023, 36, 42330–42357. [Google Scholar]
  19. Wang, B.; Shin, R.; Liu, X.; Polozov, O.; Richardson, M. Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7567–7578. [Google Scholar]
  20. Scholak, T.; Schucher, N.; Bahdanau, D. PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 9895–9901. [Google Scholar]
  21. Shi, L.; Tang, Z.; Zhang, N.; Zhang, X.; Yang, Z. A survey on employing large language models for text-to-sql tasks. ACM Comput. Surv. 2025, 58, 1–37. [Google Scholar] [CrossRef]
  22. Dong, X.; Zhang, C.; Ge, Y.; Mao, Y.; Gao, Y.; Chen, L.; Lin, J.; Lou, D. C3: Zero-shot Text-to-SQL with ChatGPT. arXiv 2023, arXiv:2307.07306. [Google Scholar]
  23. Ye, J.; Wu, Z.; Feng, J.; Yu, T.; Kong, L. Compositional exemplars for in-context learning. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 39818–39833. [Google Scholar]
  24. Rubin, O.; Herzig, J.; Berant, J. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 10–15 July 2022; pp. 2655–2671. [Google Scholar]
  25. Pan, L.; Saxon, M.; Xu, W.; Nathani, D.; Wang, X.; Wang, W.Y. Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies. Trans. Assoc. Comput. Linguist. 2024, 12, 484–506. [Google Scholar] [CrossRef]
  26. Li, H.; Zhang, J.; Li, C.; Chen, H. Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13067–13075. [Google Scholar]
  27. Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6894–6910. [Google Scholar]
  28. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
  29. Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
Figure 1. Comparison of schema linking methods on BIRD and Spider benchmarks. The horizontal dashed lines indicate the best performance achieved by SV-only (BIRD) and SV + VH (Spider). Non-decomposition methods (VH-only, SV-only, and SV + VH) consistently outperform decomposition-based approaches.
Figure 1. Comparison of schema linking methods on BIRD and Spider benchmarks. The horizontal dashed lines indicate the best performance achieved by SV-only (BIRD) and SV + VH (Spider). Non-decomposition methods (VH-only, SV-only, and SV + VH) consistently outperform decomposition-based approaches.
Bdcc 10 00104 g001
Figure 2. Effect of verification rounds on schema linking performance. Column F1 and column precision show consistent improvement with additional verification rounds on BIRD, with +4.6% F1 and +6.7% precision gain from 1 to 3 rounds. Spider shows smaller but consistent gains.
Figure 2. Effect of verification rounds on schema linking performance. Column F1 and column precision show consistent improvement with additional verification rounds on BIRD, with +4.6% F1 and +6.7% precision gain from 1 to 3 rounds. Spider shows smaller but consistent gains.
Bdcc 10 00104 g002
Figure 3. Precision–recall scatter plots on BIRD dataset. (a) Table level: All methods achieve high recall, but Decomp sacrifices precision. (b) Column level: Decomp achieves very high recall (93.5%) but suffers from extremely low precision (48.4%). SV-only achieves the best balance with high precision and recall. Dashed lines represent iso-F1 contours.
Figure 3. Precision–recall scatter plots on BIRD dataset. (a) Table level: All methods achieve high recall, but Decomp sacrifices precision. (b) Column level: Decomp achieves very high recall (93.5%) but suffers from extremely low precision (48.4%). SV-only achieves the best balance with high precision and recall. Dashed lines represent iso-F1 contours.
Bdcc 10 00104 g003
Figure 4. Error distribution analysis for different methods. (a) Base method: Rrrors primarily from missing columns for implicit values. (b) Decomposition: Dominated by over-prediction from decomposition noise (56%). (c) SV-only: More balanced error distribution with reduced over-prediction.
Figure 4. Error distribution analysis for different methods. (a) Base method: Rrrors primarily from missing columns for implicit values. (b) Decomposition: Dominated by over-prediction from decomposition noise (56%). (c) SV-only: More balanced error distribution with reduced over-prediction.
Bdcc 10 00104 g004
Figure 5. Radar chart comparing base, Decomp, and SV-only methods on BIRD across six metrics. SV-only (green) achieves the best overall balance, while Decomp (red) sacrifices precision for recall. The chart clearly shows how decomposition dramatically reduces column precision while only marginally improving recall.
Figure 5. Radar chart comparing base, Decomp, and SV-only methods on BIRD across six metrics. SV-only (green) achieves the best overall balance, while Decomp (red) sacrifices precision for recall. The chart clearly shows how decomposition dramatically reduces column precision while only marginally improving recall.
Bdcc 10 00104 g005
Figure 6. Component contribution analysis showing Column F1 changes on both datasets. Decomposition causes a significant drop in Column F1 (−14.4% on BIRD, −13.4% on Spider), which is partially recovered by adding SV. The SV-only method achieves the best Column F1 by avoiding decomposition entirely.
Figure 6. Component contribution analysis showing Column F1 changes on both datasets. Decomposition causes a significant drop in Column F1 (−14.4% on BIRD, −13.4% on Spider), which is partially recovered by adding SV. The SV-only method achieves the best Column F1 by avoiding decomposition entirely.
Bdcc 10 00104 g006
Table 1. Main experimental results on BIRD and Spider benchmarks. P/R/F1 denotes precision/recall/F1 scores (%). Best results in each column are bolded.
Table 1. Main experimental results on BIRD and Spider benchmarks. P/R/F1 denotes precision/recall/F1 scores (%). Best results in each column are bolded.
MethodBIRDSpider
Table F1Column F1SL F1Table F1Column F1SL F1
Base89.878.270.786.981.278.8
Decomp87.863.871.085.267.873.3
Decomp + SV89.374.273.386.374.876.2
Decomp + VH86.967.272.184.768.173.4
Decomp + SV + VH89.276.175.386.575.576.9
VH-only91.279.671.386.781.279.2
SV-only91.180.473.187.682.480.1
SV + VH90.679.872.887.982.980.9
Table 2. Ablation study on the number of self-verification rounds (Decomp + SV + VH). Results show F1 scores (%) on BIRD and Spider.
Table 2. Ablation study on the number of self-verification rounds (Decomp + SV + VH). Results show F1 scores (%) on BIRD and Spider.
SV RoundsBIRDSpider
Table F1Column F1SL F1Table F1Column F1SL F1
1-round88.173.074.787.175.676.3
2-rounds89.675.575.487.476.777.8
3-rounds89.777.675.786.776.877.2
Table 3. Compute-matched comparison between SV + VH and self-consistency (SC). All methods use Qwen3-4B. LLM calls denotes the number of forward passes per query. SV + VH consistently outperforms SC under matched compute budgets.
Table 3. Compute-matched comparison between SV + VH and self-consistency (SC). All methods use Qwen3-4B. LLM calls denotes the number of forward passes per query. SV + VH consistently outperforms SC under matched compute budgets.
MethodLLM CallsBIRDSpider
Table F1Col F1SL F1Table F1Col F1SL F1
Base189.878.270.786.981.278.8
SC (2 rounds)289.678.269.686.781.178.9
SV + VH (1 round)291.980.371.587.481.879.3
SC (4 rounds)489.978.370.186.781.078.6
SV + VH (3 rounds)491.379.771.587.582.479.8
Table 4. Effect of SV + VH across different model scales. SV + VH improves performance on both small and large models, though the gains are more pronounced on smaller models. Note: Minor differences in Base method results compared to Table 1 are due to different random seeds used in these experiments.
Table 4. Effect of SV + VH across different model scales. SV + VH improves performance on both small and large models, though the gains are more pronounced on smaller models. Note: Minor differences in Base method results compared to Table 1 are due to different random seeds used in these experiments.
ModelMethodBIRDSpider
Table F1Col F1SL F1Table F1Col F1SL F1
Qwen3-4BBase89.978.370.786.680.978.7
SV + VH (1r)91.479.971.887.482.079.7
SV + VH (3r)91.580.071.487.282.280.3
Qwen3-Next
-80B-MoE
Base93.283.877.588.886.084.9
SV + VH (1r)93.284.078.289.787.086.1
SV + VH (3r)93.584.077.089.887.185.9
Table 5. Computational cost analysis per query on the Spider development set. Avg Latency includes all LLM calls for that method. VH preprocessing cost is excluded as it is a one-time offline step (see VH Matching Pipeline Section). Tokens = total input + output tokens.
Table 5. Computational cost analysis per query on the Spider development set. Avg Latency includes all LLM calls for that method. VH preprocessing cost is excluded as it is a one-time offline step (see VH Matching Pipeline Section). Tokens = total input + output tokens.
MethodLLM CallsAvg Latency (s)Avg Tokens
Base15.51916
SC (n = 2)20.51946
SC (n = 4)40.52005
SV + VH (1r)27.02982
SV + VH (3r)414.44938
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, L.; Wei, D.; Li, X.; Wen, F.; Zhang, H. Enhanced Schema Linking with Large Language Models via Self-Verification and Value Hints. Big Data Cogn. Comput. 2026, 10, 104. https://doi.org/10.3390/bdcc10040104

AMA Style

Ma L, Wei D, Li X, Wen F, Zhang H. Enhanced Schema Linking with Large Language Models via Self-Verification and Value Hints. Big Data and Cognitive Computing. 2026; 10(4):104. https://doi.org/10.3390/bdcc10040104

Chicago/Turabian Style

Ma, Linfei, Dexing Wei, Xiangpeng Li, Feng Wen, and Haisu Zhang. 2026. "Enhanced Schema Linking with Large Language Models via Self-Verification and Value Hints" Big Data and Cognitive Computing 10, no. 4: 104. https://doi.org/10.3390/bdcc10040104

APA Style

Ma, L., Wei, D., Li, X., Wen, F., & Zhang, H. (2026). Enhanced Schema Linking with Large Language Models via Self-Verification and Value Hints. Big Data and Cognitive Computing, 10(4), 104. https://doi.org/10.3390/bdcc10040104

Article Metrics

Back to TopTop