Enhanced Schema Linking with Large Language Models via Self-Verification and Value Hints
Abstract
1. Introduction
- 1.
- We introduce self-verification (SV), an iterative refinement mechanism that validates initial schema linking predictions through explicit verification prompts. The verification process checks for completeness (are all necessary tables and columns included?), connectivity (can selected tables be joined?), and precision (are there unnecessary elements?).
- 2.
- We propose value hints (VHs), a technique that explicitly informs the model about database values mentioned in the query. This addresses a common failure mode where models miss columns needed for WHERE conditions because value matches are not explicitly highlighted.
- SV + VH consistently outperforms compute-matched alternatives (self-consistency) under equivalent inference budgets, confirming that the gains stem from the structured verification design rather than additional compute alone.
- The combination of SV and VH achieves the best results among all evaluated methods under the same model and settings, with an SL F1 of 75.7% on BIRD (Decomp + SV + VH, three rounds) and 80.9% on Spider (SV + VH).
- SV + VH generalizes across model scales (4B, 80B parameters), with larger relative gains on smaller models.
- Multiple verification rounds progressively improve performance, with three rounds yielding 4.6% Column F1 improvement over single-round verification on BIRD.
2. Related Work
2.1. Schema Linking for Database Interfaces
2.2. LLM-Based Schema Linking
2.3. Self-Verification in LLMs
3. Methodology
3.1. Problem Formulation
3.2. Baseline Approaches
3.2.1. Base Method
- A system instruction (see Appendix A) describing the schema linking task;
- The serialized database schema with table and column information;
- The natural language question;
- Few-shot demonstrations selected based on question similarity.
- {"tables": ["table1", "table2"],
- "columns": ["table1.col1", "table2.col2"]}
3.2.2. Decomposition Method (Decomp)
3.3. Self-Verification (SV)
- 1.
- Completeness: Are all tables needed for JOINs included? Are columns for WHERE conditions and SELECT clauses present?
- 2.
- Connectivity: If multiple tables are selected, can they be connected through foreign keys?
- 3.
- Precision: Are there any extra tables or columns that are not actually needed?
“Given the database schema, question, and initial prediction, verify and correct the prediction if necessary. Check that all necessary tables and columns are included, tables can be joined, and no unnecessary elements are selected.”
3.4. Value Hints (VHs)
“The question mentions these values that appear in the database:- “California” → found in: customers.state- “2024” → found in: orders.yearThis suggests these columns are needed for WHERE conditions.”
VH Matching Pipeline
- 1.
- Tokenization and Normalization. The question is tokenized and each token is case-folded. A stopword list (common English words and single characters) is applied to filter out non-informative tokens. Multi-token n-grams (up to 6 tokens) are also extracted to handle multi-word entity mentions (e.g., “New York” and “United States”).
- 2.
- Candidate Generation. For each column in the database schema, we retrieve a sample of distinct cell values (up to 100 per column). Numeric and date values are normalized to string form. Column names themselves are also included as candidates.
- 3.
- Matching Strategy. Each question token (or n-gram) is compared against cell values using a two-stage matching approach:
- Longest Common Subsequence (LCS): We use difflib.SequenceMatcher to compute the LCS ratio between the token and each cell value, identifying pairs with high overlap.
- Fuzzy Matching: We apply rapidfuzz.fuzz.ratio to compute a character-level similarity score. A match is accepted if the similarity exceeds a threshold of .
- 4.
- Ranking and Selection. For each matched question token, we rank candidate columns by their matching score and select the top-2 column matches. Matches involving stopwords or overly common database values are filtered out to reduce noise.
3.5. Combining Components
- Base: Single-pass generation;
- Decomp: Question decomposition + Base;
- Decomp + SV: Decomposition with verification;
- Decomp + VH: Decomposition with value hints;
- Decomp + SV + VH: Full system with all components;
- SV-only: Verification-only with full schema initialization;
- SV + VH: Verification with value hints, no decomposition;
- VH-only: Value hints without verification.
4. Experimental Setup
4.1. Datasets
4.2. Implementation Details
- Temperature: 0.0 (greedy decoding);
- Top-p: 1.0;
- Maximum tokens: 2048;
- Number of few-shot demonstrations: 3;
- Batch size: 20.
4.3. Evaluation Metrics
- Table level: Precision/recall/F1 for predicted tables;
- Column level: Precision/recall/F1 for predicted columns;
- Schema linking (SL): Combined metric considering both tables and columns, where a prediction is correct only if both the table set and column set exactly match the ground truth.
5. Results
5.1. Main Results
- Decomposition hurts column-level precision. On BIRD, the Decomp method achieves only 63.8% Column F1, compared to 78.2% for the Base method. This 14.4% drop demonstrates that question decomposition introduces significant noise, likely due to over-linking across sub-questions.
- Self-verification substantially improves decomposition. Adding SV to Decomp improves Column F1 from 63.8% to 74.2% on BIRD (+10.4%) and from 67.8% to 74.8% on Spider (+7.0%). This validates our hypothesis that verification can filter out incorrectly linked elements.
- Non-decomposition methods outperform decomposition. The SV-only and SV + VH methods achieve the best overall performance without using decomposition. On BIRD, SV-only achieves 91.1% Table F1 and 80.4% Column F1, surpassing all decomposition variants.
- Value hints are most effective when combined with verification. The combination SV + VH achieves the best SL F1 on Spider (80.9%), demonstrating synergy between the two components. Value hints provide explicit guidance for WHERE conditions, while verification ensures precision.
5.2. Ablation: Verification Rounds
5.3. Compute-Matched Baselines
5.4. Model Scale Experiments
- SV + VH is effective across model scales. On Qwen3-4B, SV + VH (one round) improves Column F1 by +1.6% on BIRD and +1.1% on Spider over the Base method. On the larger Qwen3-Next-80B-MoE, gains are smaller but still consistent: +0.2% Column F1 on BIRD and +1.0% on Spider. This pattern suggests that SV + VH provides complementary value that persists as model capability increases, though stronger models leave less room for improvement.
- Diminishing returns with stronger models. The relative improvement from SV + VH decreases as model capability increases (Qwen3-4B > Qwen3-Next-80B-MoE), consistent with the hypothesis that stronger models already perform implicit verification during generation.
5.5. Analysis
5.5.1. Precision–Recall Trade-Off
5.5.2. Error Analysis
- 42% missing columns for implicit value references;
- 28% missing tables needed for JOINs;
- 30% over-prediction of semantically similar columns.
- 56% over-prediction from decomposition noise;
- 24% incorrect decomposition leading to wrong linking;
- 20% missing elements not covered by any sub-question.
- 38% missing columns for complex implicit references;
- 32% under-pruning of related but unnecessary elements;
- 30% boundary cases with ambiguous relevance.
6. Discussion
6.1. Why Does Self-Verification Work?
- Evaluation is easier than generation. The verification task asks the model to judge whether specific elements are necessary, which is cognitively simpler than generating the complete set from scratch. This aligns with findings in other domains showing that LLMs are better evaluators than generators [29].
- Explicit criteria focus attention. The verification prompt provides explicit checking criteria (completeness, connectivity, and precision) that guide the model’s attention to specific failure modes. This structured evaluation helps catch errors that might be missed in open-ended generation. This is consistent with the finding that LLMs require structured external guidance to effectively self-correct [15].
- Multiple perspectives reduce blind spots. Each verification round provides an opportunity to reconsider the prediction from a fresh perspective, catching errors that survived previous rounds.
6.2. When to Use Each Method?
- For high-precision requirements (e.g., minimizing false positives), use SV-only or SV + VH.
- For high-recall requirements (e.g., ensuring all relevant elements are captured), use Decomp + SV + VH with multiple verification rounds.
- When database values are frequently mentioned in queries, always enable value hints.
- For efficiency-critical applications, use the Base method with careful prompt engineering.
6.3. Computational Cost Analysis
6.4. Evaluation Metric Discussion
6.5. Limitations
- Computational overhead. Self-verification requires additional LLM calls, increasing latency and cost. As quantified in Table 5, SV + VH (one round) approximately doubles the inference cost, while SV + VH (three rounds) incurs 4× the LLM calls. For latency-sensitive applications, one-round verification offers the best efficiency–accuracy trade-off.
- Value matching preprocessing. The VH component requires offline preprocessing to identify database value matches. While this is lightweight for the evaluated benchmarks (4.5–35.3 ms per sample), it may become more expensive for very large databases with millions of rows. Strategies such as indexing or sampling could mitigate this for production deployment.
- Model dependence. While our model scale experiments (Table 4) demonstrate that SV + VH generalizes across two models of different sizes (4B and 80B), the optimal number of verification rounds may need to be tuned per model depending on its specific capabilities and error patterns.
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| LLM | Large Language Model |
| SQL | Structured Query Language |
| SV | Self-Verification |
| VHs | Value Hints |
| SL | Schema Linking |
| NL | Natural Language |
| P | Precision |
| R | Recall |
| F1 | F1 Score |
Appendix A. Prompt Templates
References
- Lei, W.; Wang, W.; Ma, Z.; Gan, T.; Lu, W.; Kan, M.Y.; Chua, T.S. Re-examining the role of schema linking in text-to-SQL. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6943–6954. [Google Scholar]
- Wang, Y.; Liu, P.; Yang, X. Linkalign: Scalable schema linking for real-world large-scale multi-database text-to-sql. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4–9 November 2025; pp. 977–991. [Google Scholar]
- Kim, H.; So, B.H.; Han, W.S.; Lee, H. Natural language to SQL: Where are we today? Proc. VLDB Endow. 2020, 13, 1737–1750. [Google Scholar] [CrossRef]
- Zhang, W.; Wang, Y.; Song, Y.; Wei, V.J.; Tian, Y.; Qi, Y.; Chan, J.H.; Wong, R.C.W.; Yang, H. Natural language interfaces for tabular data querying and visualization: A survey. IEEE Trans. Knowl. Data Eng. 2024, 36, 6699–6718. [Google Scholar] [CrossRef]
- Katsogiannis-Meimarakis, G.; Mirylenka, K.; Scotton, P.; Fusco, F.; Labbi, A. In-depth Analysis of LLM-based Schema Linking. In Proceedings of the EDBT, Tampere, Finland, 24–27 March 2026; pp. 117–130. [Google Scholar]
- Min, S.; Zhong, V.; Zettlemoyer, L.; Hajishirzi, H. Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6097–6109. [Google Scholar]
- Perez, E.; Lewis, P.; Yih, W.t.; Cho, K.; Kiela, D. Unsupervised question decomposition for question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 8864–8880. [Google Scholar]
- Maamari, K.; Abubaker, F.; Jaroslawicz, D.; Mhedhbi, A. The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models. In Proceedings of the NeurIPS 2024 Third Table Representation Learning Workshop, Vancouver, BC, Canada, 14 December 2024. [Google Scholar]
- Gao, D.; Wang, H.; Li, Y.; Sun, X.; Qian, Y.; Ding, B.; Zhou, J. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. Proc. VLDB Endow. 2024, 17, 1132–1145. [Google Scholar] [CrossRef]
- Nan, L.; Zhao, Y.; Zou, W.; Ri, N.; Tae, J.; Zhang, E.; Cohan, A.; Radev, D. Enhancing few-shot text-to-sql capabilities of large language models: A study on prompt design strategies. arXiv 2023, arXiv:2305.12586. [Google Scholar]
- Pourreza, M.; Rafiei, D. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Adv. Neural Inf. Process. Syst. 2023, 36, 36339–36348. [Google Scholar]
- Liu, G.; Tan, Y.; Zhong, R.; Xie, Y.; Zhao, L.; Wang, Q.; Hu, B.; Li, Z. Solid-SQL: Enhanced schema-linking based in-context learning for robust text-to-SQL. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 9793–9803. [Google Scholar]
- Weng, Y.; Zhu, M.; Xia, F.; Li, B.; He, S.; Liu, S.; Sun, B.; Liu, K.; Zhao, J. Large language models are better reasoners with self-verification. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 2550–2575. [Google Scholar]
- Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-refine: Iterative refinement with self-feedback. Adv. Neural Inf. Process. Syst. 2023, 36, 46534–46594. [Google Scholar]
- Huang, J.; Chen, X.; Mishra, S.; Zheng, H.S.; Yu, A.W.; Song, X.; Zhou, D. Large language models cannot self-correct reasoning yet. arXiv 2023, arXiv:2310.01798. [Google Scholar]
- Chen, X.; Lin, M.; Schärli, N.; Zhou, D. Teaching Large Language Models to Self-Debug. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; pp. 8746–8825. [Google Scholar]
- Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3911–3921. [Google Scholar]
- Li, J.; Hui, B.; Qu, G.; Yang, J.; Li, B.; Li, B.; Wang, B.; Qin, B.; Geng, R.; Huo, N.; et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Adv. Neural Inf. Process. Syst. 2023, 36, 42330–42357. [Google Scholar]
- Wang, B.; Shin, R.; Liu, X.; Polozov, O.; Richardson, M. Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7567–7578. [Google Scholar]
- Scholak, T.; Schucher, N.; Bahdanau, D. PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 9895–9901. [Google Scholar]
- Shi, L.; Tang, Z.; Zhang, N.; Zhang, X.; Yang, Z. A survey on employing large language models for text-to-sql tasks. ACM Comput. Surv. 2025, 58, 1–37. [Google Scholar] [CrossRef]
- Dong, X.; Zhang, C.; Ge, Y.; Mao, Y.; Gao, Y.; Chen, L.; Lin, J.; Lou, D. C3: Zero-shot Text-to-SQL with ChatGPT. arXiv 2023, arXiv:2307.07306. [Google Scholar]
- Ye, J.; Wu, Z.; Feng, J.; Yu, T.; Kong, L. Compositional exemplars for in-context learning. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 39818–39833. [Google Scholar]
- Rubin, O.; Herzig, J.; Berant, J. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 10–15 July 2022; pp. 2655–2671. [Google Scholar]
- Pan, L.; Saxon, M.; Xu, W.; Nathani, D.; Wang, X.; Wang, W.Y. Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies. Trans. Assoc. Comput. Linguist. 2024, 12, 484–506. [Google Scholar] [CrossRef]
- Li, H.; Zhang, J.; Li, C.; Chen, H. Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13067–13075. [Google Scholar]
- Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6894–6910. [Google Scholar]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]






| Method | BIRD | Spider | ||||
|---|---|---|---|---|---|---|
| Table F1 | Column F1 | SL F1 | Table F1 | Column F1 | SL F1 | |
| Base | 89.8 | 78.2 | 70.7 | 86.9 | 81.2 | 78.8 |
| Decomp | 87.8 | 63.8 | 71.0 | 85.2 | 67.8 | 73.3 |
| Decomp + SV | 89.3 | 74.2 | 73.3 | 86.3 | 74.8 | 76.2 |
| Decomp + VH | 86.9 | 67.2 | 72.1 | 84.7 | 68.1 | 73.4 |
| Decomp + SV + VH | 89.2 | 76.1 | 75.3 | 86.5 | 75.5 | 76.9 |
| VH-only | 91.2 | 79.6 | 71.3 | 86.7 | 81.2 | 79.2 |
| SV-only | 91.1 | 80.4 | 73.1 | 87.6 | 82.4 | 80.1 |
| SV + VH | 90.6 | 79.8 | 72.8 | 87.9 | 82.9 | 80.9 |
| SV Rounds | BIRD | Spider | ||||
|---|---|---|---|---|---|---|
| Table F1 | Column F1 | SL F1 | Table F1 | Column F1 | SL F1 | |
| 1-round | 88.1 | 73.0 | 74.7 | 87.1 | 75.6 | 76.3 |
| 2-rounds | 89.6 | 75.5 | 75.4 | 87.4 | 76.7 | 77.8 |
| 3-rounds | 89.7 | 77.6 | 75.7 | 86.7 | 76.8 | 77.2 |
| Method | LLM Calls | BIRD | Spider | ||||
|---|---|---|---|---|---|---|---|
| Table F1 | Col F1 | SL F1 | Table F1 | Col F1 | SL F1 | ||
| Base | 1 | 89.8 | 78.2 | 70.7 | 86.9 | 81.2 | 78.8 |
| SC (2 rounds) | 2 | 89.6 | 78.2 | 69.6 | 86.7 | 81.1 | 78.9 |
| SV + VH (1 round) | 2 | 91.9 | 80.3 | 71.5 | 87.4 | 81.8 | 79.3 |
| SC (4 rounds) | 4 | 89.9 | 78.3 | 70.1 | 86.7 | 81.0 | 78.6 |
| SV + VH (3 rounds) | 4 | 91.3 | 79.7 | 71.5 | 87.5 | 82.4 | 79.8 |
| Model | Method | BIRD | Spider | ||||
|---|---|---|---|---|---|---|---|
| Table F1 | Col F1 | SL F1 | Table F1 | Col F1 | SL F1 | ||
| Qwen3-4B | Base | 89.9 | 78.3 | 70.7 | 86.6 | 80.9 | 78.7 |
| SV + VH (1r) | 91.4 | 79.9 | 71.8 | 87.4 | 82.0 | 79.7 | |
| SV + VH (3r) | 91.5 | 80.0 | 71.4 | 87.2 | 82.2 | 80.3 | |
| Qwen3-Next -80B-MoE | Base | 93.2 | 83.8 | 77.5 | 88.8 | 86.0 | 84.9 |
| SV + VH (1r) | 93.2 | 84.0 | 78.2 | 89.7 | 87.0 | 86.1 | |
| SV + VH (3r) | 93.5 | 84.0 | 77.0 | 89.8 | 87.1 | 85.9 | |
| Method | LLM Calls | Avg Latency (s) | Avg Tokens |
|---|---|---|---|
| Base | 1 | 5.5 | 1916 |
| SC (n = 2) | 2 | 0.5 | 1946 |
| SC (n = 4) | 4 | 0.5 | 2005 |
| SV + VH (1r) | 2 | 7.0 | 2982 |
| SV + VH (3r) | 4 | 14.4 | 4938 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ma, L.; Wei, D.; Li, X.; Wen, F.; Zhang, H. Enhanced Schema Linking with Large Language Models via Self-Verification and Value Hints. Big Data Cogn. Comput. 2026, 10, 104. https://doi.org/10.3390/bdcc10040104
Ma L, Wei D, Li X, Wen F, Zhang H. Enhanced Schema Linking with Large Language Models via Self-Verification and Value Hints. Big Data and Cognitive Computing. 2026; 10(4):104. https://doi.org/10.3390/bdcc10040104
Chicago/Turabian StyleMa, Linfei, Dexing Wei, Xiangpeng Li, Feng Wen, and Haisu Zhang. 2026. "Enhanced Schema Linking with Large Language Models via Self-Verification and Value Hints" Big Data and Cognitive Computing 10, no. 4: 104. https://doi.org/10.3390/bdcc10040104
APA StyleMa, L., Wei, D., Li, X., Wen, F., & Zhang, H. (2026). Enhanced Schema Linking with Large Language Models via Self-Verification and Value Hints. Big Data and Cognitive Computing, 10(4), 104. https://doi.org/10.3390/bdcc10040104

