SQL Statement Generation Enhanced Through the Fusion of Large Language Models and Knowledge Graphs
Abstract
1. Introduction
2. Methodology
2.1. Architecture Design
2.2. Design of Input Module
2.3. SQL-KG-Verifier
- 1.
- Information Extraction and Matching:
- 2.
- Rule Checking and Reasoning:
- 3.
- Iterative error correction and systemic security assurance:
- (1)
- Information Extraction and Matching. The first step involves extracting key elements from the input SQL query, including table names, field names, and relational structures, as well as those referenced in the user query. These extracted components are then retrieved and matched against entity nodes in the knowledge graph, primarily for the purpose of entity disambiguation. The verifier conducts searches within both the knowledge graph and the vector database to confirm whether the corresponding entities—such as tables and fields—exist in the knowledge graph. This process ensures that the generated output is consistent with the factual knowledge encoded in the graph. Finally, the validated elements, including table names, field names, relationships, and entities, are passed on to the subsequent rule-checking and reasoning module.
- (2)
- Rule Checking and Reasoning. Based on the structural information encoded in the knowledge graph, the verifier implements a multi-level rule-checking and reasoning mechanism consisting of three components: (1) Schema Constraint Validation: Verifies that referenced tables and fields exist in the database schema, checks data type compatibility between compared fields, and validates primary-foreign key relationships. (2) Semantic Consistency Rules: Ensures that JOIN conditions connect semantically related entities, validates that WHERE clause predicates are logically coherent, and checks for redundant or contradictory conditions. (3) Inference Based on Graphs and Large Language Models: The approach utilizes graphs to explore the information of neighboring nodes, and leverages large language models to summarize this neighboring information in order to infer the potential implicit relationships between entities. By combining the graph’s connectivity and the semantic summarization capability of the large language model, this method identifies potential missing JOIN conditions. When direct relationships are unavailable, it also suggests alternative query paths. Through these mechanisms, the verifier identifies potential conflicts between the input data and the encoded rules, thereby validating the consistency of the output with the factual knowledge stored in the graph. In addition, the reasoning capabilities of the knowledge graph are leveraged to locate relevant nodes and relational information that support the verification process. The detailed hard-matching procedure is illustrated in Table 1.
- (3)
- Iterative error correction and systemic security assurance. By explicitly strengthening connections through three mechanisms—mapping relational table structures to entity nodes, aligning foreign key constraints with graph edges, and enriching field semantics via attribute annotations—the framework establishes a rigid structural alignment that serves a dual purpose. It facilitates a closed-loop refinement process: when discrepancies arise, the verifier isolates mismatched information and feeds specific error details back to the LLM for iterative optimization. Second, extending beyond accuracy, this architecture functions as a critical security layer that cross-references generated SQL against KG-encoded constraints to identify anomalies—such as hallucinations or adversarial inputs—and preemptively mitigate risks like SQL injection or unauthorized access. Ultimately, this integration guarantees a comprehensive Tri-fold Integrity (comprising syntactic executability, semantic fidelity, and schema adherence), thereby ensuring the trustworthiness required for deploying LLM systems in complex production environments.
2.4. Figures, Tables and Schemes
2.4.1. Conceptual Layer
2.4.2. Data Layer
2.5. Knowledge Graph Retrieval
2.5.1. Agent-Based Retrieval
- Syntax Error Detection: The generated SQL statement is parsed using an SQL syntax analyzer to identify structural errors, such as missing keywords, mismatched parentheses, or incorrect clause order.
- Semantic Deviation Assessment: The system compares the semantic intent of the SQL statement generated by the large language model with the original user query using cosine similarity between embedding representations. A similarity score below a predefined threshold (e.g., 0.75) indicates a potential semantic deviation.
- Schema Compliance Check: The SQL statement is validated against the database schema to ensure that all referenced tables, columns, and relationships exist and are used correctly.
- Execution Feasibility Test: A simulated execution plan is generated to identify potential runtime issues, such as ambiguous column references or type mismatches.
2.5.2. Entity Validation
2.5.3. Relationship Validation
2.5.4. Computational Cost Analysis
3. Experiments and Analysis of Results
3.1. Experimental Parameters and Performance Metrics
3.2. Experiment
3.2.1. Comparative Experiment
3.2.2. Ablation Experiment
- Think-on-Graph (ToG) explicitly models multi-step reasoning paths, enabling effective acquisition of distributed structural information and ensuring schema consistency throughout SQL generation;
- Vector-based retrieval filtering, applied prior to hard KG matching, introduces an initial semantic-level screening step by computing semantic similarity between the query and candidate facts, thereby aligning linguistic expressions and mitigating semantic drift in subsequent generation stages;
- SQL-KG-Verifier (KG) fuses the LLM’s semantic reasoning with the KG’s logical constraints to perform dual verification—assessing both grammatical correctness and schema-level consistency—and provides iterative feedback for refinement, substantially enhancing result accuracy and robustness.
4. Conclusions and Future Work
- Dynamic Knowledge Graph Construction and Adaptive Integration: Explore lightweight, incremental knowledge graph construction mechanisms that leverage real-time updates from database metadata, enhancing the model’s adaptability to dynamic environments;
- Enhanced Controllability of the Generation Process: Introduce reinforcement learning rewards and supervised fine-tuning strategies, combined with prompt engineering optimization, to constrain model outputs to adhere to business semantics and SQL syntax, thereby reducing redundancy and logical errors [10];
- Cross-Domain Generalization and Zero-Shot Transfer: Improve the model’s ability to handle heterogeneous database structures through graph alignment, schema abstraction, and meta-learning strategies, mitigating the current strong dependency on training-domain knowledge [13].
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
| Example | Prompt Text |
|---|---|
| P1 | ### System Instruction You are an expert in SQL logic analysis. Your task is to interpret the user’s natural language query and convert it into a professional Text2SQL task description. ### Requirements 1. **Analyze Intention**: Clearly state what data the user is looking for and the business logic behind it. 2. **Deconstruct Logic**: Break down the request into SQL-related concepts (Select, Filter, Join, Group, Sort) without writing the actual code. 3. **Handle Ambiguity**: Highlight any implicit assumptions made during the analysis. ### Output Style Use a structured bullet-point format to present the analysis. Ensure the terminology is technically accurate (e.g., use terms like “Predicate,” “Aggregation,” “Foreign Key”). |
| P2 | **Role:** You are a Senior Database Engineer and an expert in SQL generation. Your specific expertise lies in **[Insert Dialect, e.g., MySQL 8.0/PostgreSQL]**. **Task:** Generate a precise and executable SQL query based on the provided “Table Schema” to answer the “User Request”. **Constraints & Rules:** 1. **Output Only:** Output **strictly** the SQL code block. Do not add any explanations, notes, or markdown text outside the code block. 2. **Syntax:** Ensure the SQL is syntactically correct for the specified dialect. 3. **Aliases:** Use explicit table aliases (e.g., ‘t1’, ‘t2’ or ‘users’, ‘orders’) for clarity, especially when performing JOINs. 4. **No Hallucination:** Only use columns and tables defined in the schema. If the request cannot be answered with the given schema, output a SQL comment: ‘-- Error: Insufficient schema context’. **Data Context:** **Table Schema:** {SQL related table structure} **User Request:** {Input Data/Natural Language Question} **Output format:** ‘‘‘sql -- Your SQL Query here |
| P3 | # Role You are an advanced SQL Parsing and Metadata Extraction Engine. You specialize in static code analysis to map data lineage. # Objective Analyze the provided SQL statement to extract two key types of information: 1. **Field Inventory:** A comprehensive list of all fields used per table. 2. **Relationship Mapping:** A structural representation of how tables are connected (Joins/Links). # Analysis Guidelines ### 1. Entity Extraction (Fields) * **Scope:** Extract columns from ALL clauses including ‘SELECT’, ‘FROM’, ‘WHERE’, ‘GROUP BY’, ‘ORDER BY’, ‘HAVING’, and ‘JOIN ON’. * **Table Resolution:** Identify the full table name. If a schema is provided (e.g., ‘sales.orders’), preserve it. * **Alias Handling (CRITICAL):** * You must **STRIP** all table aliases from column names. * *Example:* If the SQL is ‘SELECT t1.user_id’, extract ‘user_id’. * *Example:* If the SQL is ‘WHERE o.amount > 100’, extract ‘amount’. * **Calculations:** For derived fields (e.g., ‘COUNT(distinct user_id)’), extract only the underlying column name (‘user_id’). ### 2. Relationship Extraction (Topology) * Identify connections defined in ‘JOIN … ON’ clauses and implicit joins in ‘WHERE’ clauses. * Capture the **Source** (Left) and **Target** (Right) components of the relationship. * Determine the join type (INNER, LEFT, RIGHT, FULL, CROSS). # Output Format Return the result strictly as a **JSON object**. Do not include markdown formatting (like ‘‘‘json … ’’’) or explanatory text. Use the following schema: { “entities”: [ { “table_name”: “Name of the table”, “fields”: [ “List”, “of”, “distinct”, “columns”, “used” ] } ], “relationships”: [ { “join_type”: “INNER/LEFT/RIGHT/etc”, “source_table”: “Table name”, “source_field”: “Column used in join”, “target_table”: “Table name”, “target_field”: “Column used in join” } ] } # Input SQL “““ [INSERT YOUR SQL HERE] “““ |
References
- Mansuri, I.R.; Sarawagi, S. Integrating unstructured data into relational databases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), Atlanta, GA, USA, 3–7 April 2006; IEEE: New York, NY, USA, 2006; p. 29. [Google Scholar]
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Hogan, A.; Blomqvist, E.; Cochez, M.; Claudia, A.; Gerard, M.; Claudio, G.; Emilio, L.G.J.; Sabrina, K.; Sebastian, N.; Axel, P.; et al. Knowledge graphs. ACM Comput. Surv. (CSUR) 2021, 54, 1–37. [Google Scholar] [CrossRef]
- Seranmadevi, R.; Addula, S.R.; Kumar, D.; Tyagi, A. Security and Privacy in AI: IoT-Enabled Banking and Finance Services. In Monetary Dynamics and Socio-Economic Development in Emerging Economies; IGI Global: Hershey, PA, USA, 2026; pp. 163–194. [Google Scholar]
- Peng, B.; Li, C.; He, P.; Galley, M.; Gao, J. Instruction Tuning with GPT-4. arXiv 2023, arXiv:2304.03277. [Google Scholar] [CrossRef]
- Hong, Z.; Yuan, Z.; Chen, H.; Zhang, Q.; Huang, F.; Huang, X. Knowledge-to-SQL: Enhancing SQL Generation with Data Expert LLM. In Proceedings of the Findings of the 62nd Annual Meeting of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics (ACL): Kerrville, TX, USA, 2024; pp. 10997–11008. [Google Scholar]
- Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv 2018, arXiv:1809.08887. [Google Scholar]
- Li, J.; Hui, B.; Qu, G.; Yang, J.; Li, B.; Li, B.; Wang, B.; Qin, B.; Geng, R.; Huo, N.; et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Adv. Neural Inf. Process. Syst. 2023, 36, 42330–42357. [Google Scholar]
- Yih, W.T.; Richardson, M.; Meek, C.; Chang, M.W.; Suh, J. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016; pp. 201–206. [Google Scholar]
- Perevalov, A.; Diefenbach, D.; Usbeck, R.; Both, A. Qald-9-plus: A multilingual dataset for question answering over dbpedia and wikidata translated by native speakers. In Proceedings of the 2022 IEEE 16th International Conference on Semantic Computing (ICSC), Virtual, 26–28 January 2022; pp. 229–234. [Google Scholar]
- Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2369–2380. [Google Scholar]
- Chaudhuri, S.; Narasayya, V.; Ramamurthy, R. Estimating progress of execution for SQL queries. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, 13–18 June 2004; pp. 803–814. [Google Scholar]
- Wang, P.; Sun, B.; Dong, X.; Dai, Y.; Yuan, H.; Chu, M.; Gao, Y.; Qi, X.; Zhang, P.; Yan, Y. Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling. arXiv 2025, arXiv:2509.24403. [Google Scholar]
- Dönder, Y.D.; Hommel, D.; Wen-Yi, A.W.; Mimno, D.; Jo, U.E.S. Cheaper, Better, Faster, Stronger: Robust Text-to-SQL without Chain-of-Thought or Fine-Tuning. arXiv 2025, arXiv:2505.14174. [Google Scholar]
- Zhang, T.; Chen, C.; Liao, C.; Wang, J.; Zhao, X.; Yu, H.; Wang, J.; Li, J.; Shi, W. Sqlfuse: Enhancing text-to-sql performance through comprehensive llm synergy. arXiv 2024, arXiv:2407.14568. [Google Scholar]
- Pourreza, M.; Rafiei, D. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Adv. Neural Inf. Process. Syst. 2023, 36, 36339–36348. [Google Scholar]
- Gao, D.; Wang, H.; Li, Y.; Sun, X.; Qian, Y.; Ding, B.; Zhou, J. Text-to-sql empowered by large language models: A benchmark evaluation. arXiv 2023, arXiv:2308.15363. [Google Scholar] [CrossRef]
- Qu, G.; Li, J.; Li, B.; Qin, B.; Huo, N.; Ma, C.; Cheng, R. Before generation, align it! A novel and effective strategy for mitigating hallucinations in text-to-sql generation. arXiv 2024, arXiv:2405.15307. [Google Scholar]
- Toyama, M. SuperSQL: An extended SQL for database publishing and presentation. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA, 1–4 June 1998; pp. 584–586. [Google Scholar]
- Fan, W.; Ding, Y.; Ning, L.; Wang, S.; Li, H.; Yin, D.; Chua, T.-S.; Li, Q. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 6491–6501. [Google Scholar]
- Sun, J.; Xu, C.; Tang, L.; Wang, S.; Lin, C.; Gong, Y.; Ni, L.; Shum, H.-Y.; Guo, J. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]




| Relationship Type | Cypher Code |
|---|---|
| Node existence check | MATCH (n) WHERE exists(n.name) RETURN DISTINCT n.name AS node name; |
| Single-hop relationship (primary key) | MATCH (t:table)-[r:relation]->(f:field) WHERE r.type = ‘PK’ AND t.name = ‘table’ AND f.name = ‘field’ RETURN t, f |
| Single-hop relationship | MATCH (t:table)-[r:relation]->(f:field) t.name = ‘table’ AND f.name = ‘field’ RETURN COUNT(*) > 0 |
| Multi-hop paths across tables | MATCH (t1:table)-[r:*2..5]->(t2:table) WITH t1, t2, collect(type(r)) AS relTypes WHERE ALL(rel IN [‘Pri_Key’, ‘For_Key’] WHERE rel IN relTypes) RETURN t1, t2 |
| Multi-hop path of field relationships | MATCH (f1:feild1)-[r:*2..6]->(f2:field2) f1.name = ’field1’ AND f2.name = ’field2’ |
| Field type relationship | MATCH (f1:field)<-[:relation]-(t1:table), (f2:field)<-[:relation]-(t2:table) WHERE f1.name = f2.name AND f1.type = f2.type RETURN f1, f2 |
| Entity | Entity Attributes |
|---|---|
| Table name | Primary key, foreign key |
| Field name | Type, Default value, Notes |
| Serial Number | Subject | Verb | Object |
|---|---|---|---|
| 1 | Table | Primary Key | Field |
| 2 | Table | Foreign Key | Field |
| Information Category | Description |
|---|---|
| Table Name | The logical name of the table, used to identify the data entity. |
| Field Name | The physical name of each column in the table. |
| Field Type | The data type of the field, such as INT, VARCHAR. |
| Field Description | Comments or business meaning of the field. |
| Default Value | The default value of the field when no value is assigned. |
| Field Key Type | Whether the field is a primary key or a foreign key |
| Table Name | The logical name of the table, used to identify the data entity. |
| Data Cleaning Rule Category | Description | Processing Method |
|---|---|---|
| Outlier Handling | Inconsistent or illogical entity, attribute, or relationship data. | Identify and remove outliers using a model. |
| Duplicate Data Handling | Multiple records with duplicate foreign keys. | Identify and remove duplicate records based on matching foreign keys. |
| Format Standardization | Inconsistent data formats | Standardize the data format to a consistent entity-relationship-entity structure. |
| Table Structure Relationships | Knowledge Graph Relationships |
|---|---|
| Primary Key | Primary Key Identifier |
| Foreign Key | Foreign Key Identifier |
| Column | Basic Relationship |
| Table Structure | Knowledge Graph |
|---|---|
| Table | Entity |
| Field | Entity |
| Foreign Key | Relationship, Attribute |
| Primary Key | Relationship, Attribute |
| Table Comment | Attribute |
| Field Type | Attribute |
| Field Default Value | Attribute |
| Field Comment | Attribute |
| Model | Method | Dataset | |
|---|---|---|---|
| BIRD (%) | Spider (%) | ||
| - | Agentar-Scale-SQL | 74.6 | - |
| - | GenaSQL | 70.53 | - |
| GPT-4 | DPG-SQL | - | 85.6 |
| GPT-4 | DIN-SQL | - | 85.3 |
| Qwen2.5-32B | KG + ToG | 52.71[our] | 77.26[our] |
| Qwen2.5-32B | DAIL-SQL | 49.59 | 75.68 |
| Qwen2.5-32B | TA-SQL | 48.59 | 72.68 |
| Qwen2.5-32B | SuperSQL | 45.08 | 70.42 |
| Qwen2.5-32B | - | 34.64 | 54.89 |
| Agent Model | WebQSP-EX | QALD-10-EN-EX | AdvHotpotQA-EX | ||||||
|---|---|---|---|---|---|---|---|---|---|
| LLM | ToG | KG + ToG | LLM | ToG | KG + ToG | LLM | ToG | KG + ToG | |
| Qwen2.5-72B | 68.83% | 75.23% | 81.32% | 49.82% | 50.07% | 51.92% | 25.17% | 32.87% | 34.92% |
| Qwen2.5-32B | 65.59% | 77.13% | 80.15% | 46.61% | 47.50% | 48.43% | 23.81% | 27.31% | 33.28% |
| Qwen2.5-7B | 66.26% | 74.97% | 77.46% | 43.33% | 45.21% | 44.05% | 21.08% | 23.42% | 30.09% |
| Model | BIRD-EX (%) | Spider-EX (%) | ||||||
|---|---|---|---|---|---|---|---|---|
| LLMs | KG | KG + ToG (LLMs-Only) | KG + ToG | LLMs | KG | KG + ToG (LLMs-Only) | KG + ToG | |
| Qwen2.5-72B | 43.26 | 45.19 | 48.47 | 53.28 | 66.79 | 77.56 | 76.48 | 80.12 |
| Qwen2.5-32B | 31.44 | 37.9 | 33.17 | 42.90 | 57.33 | 73.50 | 60.03 | 77.26 |
| Qwen2.5-7B | 21.44 | 30.18 | 29.68 | 32.18 | 47.33 | 66.69 | 65.84 | 69.81 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, B.; Yu, X.; Zheng, X. SQL Statement Generation Enhanced Through the Fusion of Large Language Models and Knowledge Graphs. Electronics 2026, 15, 278. https://doi.org/10.3390/electronics15020278
Wang B, Yu X, Zheng X. SQL Statement Generation Enhanced Through the Fusion of Large Language Models and Knowledge Graphs. Electronics. 2026; 15(2):278. https://doi.org/10.3390/electronics15020278
Chicago/Turabian StyleWang, Bohan, Xuhong Yu, and Xin Zheng. 2026. "SQL Statement Generation Enhanced Through the Fusion of Large Language Models and Knowledge Graphs" Electronics 15, no. 2: 278. https://doi.org/10.3390/electronics15020278
APA StyleWang, B., Yu, X., & Zheng, X. (2026). SQL Statement Generation Enhanced Through the Fusion of Large Language Models and Knowledge Graphs. Electronics, 15(2), 278. https://doi.org/10.3390/electronics15020278
