FGCSQL: A Three-Stage Pipeline for Large Language Model-Driven Chinese Text-to-SQL

Jiang, Guanyu; Li, Weibin; Yu, Chenglong; Zhu, Zixuan; Li, Wei

doi:10.3390/electronics14061214

Open AccessArticle

FGCSQL: A Three-Stage Pipeline for Large Language Model-Driven Chinese Text-to-SQL

by

Guanyu Jiang

¹,

Weibin Li

^2,*

,

Chenglong Yu

²,

Zixuan Zhu

¹ and

Wei Li

¹

Hangzhou Institute of Technology, Xidian University, Hangzhou 311231, China

²

School of Artificial Intelligence, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(6), 1214; https://doi.org/10.3390/electronics14061214

Submission received: 10 February 2025 / Revised: 9 March 2025 / Accepted: 18 March 2025 / Published: 19 March 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Recent advances in large language models have driven major breakthroughs in Text-to-SQL tasks. However, many challenges hinder the use of SQL parsers for cross-language tasks. In this article, we introduce FGCSQL, a novel three-stage pipeline framework to deal with three challenges: cross-language schema linking, SQL parsing potential of LLM, and error propagation in SQL parsers, in which the framework uniquely incorporates a filtering encoder to eliminate irrelevant database schema items, harnessing a pre-trained generative large language model fine-tuned on a carefully structured dataset for enhanced SQL parsing. Finally, a correcting decoder addresses error propagation, culminating in a robust system for semantic parsing tasks. Tested on the CSpider dataset, the FGCSQL showcases a substantial improvement in the exact-set-match (EM) accuracy and execution accuracy (EX) metrics, validating the pipeline’s architecture’s effectiveness in mitigating the challenges typically confronted in Text-to-SQL conversion, especially in cross-lingual contexts. FGCSQL outstrips existing methods in execution precision, indicating the validity of our proposed method.

Keywords:

text-to-SQL; LLM; CSpider; natural language processing

1. Introduction

Because of its powerful data organization and management ability, as well as mature transaction processing and data consistency guarantee, the relational database is widely used in many industries, such as finance, retail, medical, and so on. However, there is a high threshold for the use of the structured query language SQL, so researchers have proposed Text-to-SQL tasks that aim to translate natural language problems into structured query statements. By constructing the natural language interface of the database, this task has become a research hotspot in the natural language processing and database community, so that users can extract key information efficiently without a professional background.

The challenge of Text-to-SQL lies in bridging the semantic gap between natural language and SQL queries, encoding complex natural language inputs, and decoding them into richly structured SQL query statements. Previous studies have been dedicated to addressing these challenges, treating Text-to-SQL as a sequence-to-sequence task, focusing on encoder–decoder architectures, and training machine learning models using Text-to-SQL corpora [1,2,3,4].

The process of implementing Text-to-SQL based on the encoder–decoder architecture begins with data pre-processing, where training data are collected and cleaned. Afterward, the encoder transforms a natural language query into encoded information, capturing both syntax and semantic details. The decoder then utilizes this encoded information to progressively generate the corresponding SQL query, including components such as SELECT, FROM, and WHERE.

This study aims to address three critical technical challenges in the Chinese Text-to-SQL domain: (1) insufficient modeling of cross-lingual schema linking that fails to capture deep semantic correlations between Chinese linguistic features and database schemas; (2) underexplored potential of open-source large language models (LLMs) for semantic parsing tasks; (3) error propagation effects inherent in conventional multi-stage processing pipelines that significantly degrade system performance. To overcome these limitations, we propose an integrated framework combining cross-lingual alignment enhancement, lightweight model fine-tuning, and establishing a robust mapping mechanism between Chinese natural language queries and structured SQL representations. The technical specifics of each challenge and our corresponding innovative solutions will be systematically elaborated below.

Cross-language schema linking for Chinese Text-to-SQL needs to be strengthened. Existing solutions exhibit significant limitations. On one hand, current approaches tend to suit simple queries, but their performance drops as the number of database schema complexities and multi-table joins increase. On the other hand, while database schemas are usually in English, user idioms (take Chinese for example) and the data they record may be in other languages. In such cases, SQL parsers trained on English Text-to-SQL corpora greatly diminish in performance, or may even fail to comprehend the natural language queries correctly.

Thus, the associated benchmarks have become increasingly intricate, from single-domain datasets to multi-domain datasets, accumulating more and more complex query–SQL pairs. Single-domain datasets collect query–SQL pairs for a single database in real-world tasks, such as ATIS [5], SEDE [6], etc. Cross-domain datasets contain multiple databases, to test the performance of models on complex, unseen SQL queries and their ability to generalize to new domains, such as WikiSQL [7] and Spider [8].

Many existing studies are conducted on Spider for its challenging nature, which involves various complex operators. Research on Chinese Text-to-SQL tasks, in comparison, is less frequent. Due to the increasing complexity of the existing test benchmarks, and the obvious difference between the description of Chinese natural language questions and the language used by the database schema items, it is necessary to strengthen cross-language schema linking for the complex Chinese Text-to-SQL test benchmarks.

The potential of open-source LLMs has not been fully tapped. Recently, open-source large language models (LLMs) have flourished, offering rich linguistic knowledge learned from vast corpora through pre-training, better capturing the diversity and variability of natural language, and understanding contextual relationships within, excelling in tasks such as programming, logical reasoning, and text generation.

Cutting-edge research combining pre-trained large-scale language models with prompt engineering has achieved remarkable results [9,10,11,12,13].

However, most studies based on LLMs utilize closed-source ones with vast parameter scales, such as OpenAI LLMs, overlooking the potential of open-source LLMs. Addressing Text-to-SQL tasks with closed-source LLMs offers little room for optimization and often requires considerable inference costs. Although open-source LLMs are somewhat limited in context understanding and text generation quality compared to closed-source ones, this issue can be rectified through supervised fine-tuning.

Multi-stage pipeline method leads to error propagation. For Text-to-SQL tasks, end-to-end methods often increase the learning difficulty. Most existing methods adopt a multi-stage model, assisting the SQL decoder with auxiliary tasks such as schema item classification and encoding token types, collaboratively resolving the Text-to-SQL task [3,14,15].

Multi-stage methods decompose complex tasks into stages, each fulfilled by specialized components or modules, thereby enhancing the overall system performance. However, multi-stage pipeline methods also present the challenge of error propagation [16,17,18]. Error propagation refers to mistakes in one stage of the pipeline potentially cascading to subsequent stages, causing errors in system output.

Addressing error propagation is a significant challenge in designing multi-stage pipeline systems.

To explore the performance of open-source large language models on Chinese Text-to-SQL tasks, and research potential methods to strengthen cross-linguistic schema linking and reduce the learning difficulty of Text-to-SQL conversion, we conducted preliminary experiments on the Chinese Text-to-SQL CSpider dataset. Initially, we supervised fine-tuned a relatively small-scale open-source LLM such as LlaMa2-7b-chat using only basic question representations to organize training data, resulting in nearly a 40% improvement in exact match on the development set. This pre-experiment convinced us that using LLMs to tackle Chinese Text-to-SQL tasks is an effective future method.

We propose a three-stage pipeline method: Redundant Database Schema Items Filtering Encoder (RDsiF-Encoder for short), generative pre-trained LLM for SQL parser, and SQL query correcting decoder, to validate the feasibility of our ideas:

Stage 1: Train a filtering encoder to discard database information irrelevant to the query, improving cross-lingual schema linking accuracy and reducing the SQL parser’s learning difficulty.
Stage 2: Utilize a supervised fine-tuned LLM to parse the query and generate the corresponding SQL statement.
Stage 3: Train a correcting decoder to refine the SQL statements from Stage 2, further enhancing the overall model performance.

Contributions We propose a three-stage pipeline method (Figure 1) to address Chinese Text-to-SQL challenges, emphasizing three key improvements over existing approaches:

Schema Filtering with Cross-Language Optimization: Unlike conventional schema linking methods [3,19] that focus on structural encoding, we design a filter encoder enhanced with a Chinese information injected layer and data type-aware attention. This specifically addresses ambiguity in cross-language schema linking and eliminates interference from data type-mismatched columns, overcoming limitations in existing graph-based encoders [4,20] that neglect cross-lingual alignment.

IECQN question representation for LLM fine-tuning: We propose IECQN question representation (Instruction, Example, Chosen schema, Question, NatSQL) for LLM fine-tuning, which systematically structures prompts beyond the decomposition strategies in [12] and achieves better alignment with database contexts than the holistic representations in [13].

Error Correction for Multi-Stage Pipelines: Unlike existing solutions that use prompt engineering methods for error correction [12], we introduce a dedicated correction model trained on a newly collected dataset to rectify error propagation across stages.

We conducted a series of evaluations and analyses, and the results show that our methods have achieved state-of-the-art (SOTA) performance in execution accuracy on the CSpider dataset.

Outline The remainder of this paper is structured as follows. Section 2 introduces the latest Text-to-SQL solutions and related studies. Section 3 presents our contributions to the task. Section 4, Section 5 and Section 6 includes the experimental setup, results, and conclusions of this work.

2. Related Work

At present, researchers in the Text-to-SQL domain have developed a variety of approaches, which can mainly be categorized into three types: encoding approaches, decoding approaches, and methods based on prompt engineering. While each paradigm addresses specific challenges, critical gaps remain that motivate our FGCSQL framework’s design.

2.1. Encoding Approaches

Current encoding methods predominantly focus on schema representation and alignment, yet struggle with cross-lingual schema linking—a key limitation our filtering encoder specifically addresses.

Ref. [19]’s Structure-Grounded framework demonstrates effective text–table alignment through weak supervision. Ref. [3] introduces a ranking-enhanced encoding for Text-to-SQL that aims to reinforce schema linking and has shown promising performance and robustness on Spider and its three robust variations. But neither takes into account cross-language schema linking, which prompted us to design more generalizing schema filtering mechanisms.

Given that database schemas contain rich structural information, graph-based methods have been used to better encode such structures. Ref. [21] discusses the integration of relational structures into a heterogeneous graph for Chinese NL2SQL parsers, which aims to solve issues like column name reuse, natural language query descriptions, and inconsistent data presentation in databases. Ref. [20] utilizes a Graph Convolutional Network (GCN) to capture the database structure and employs a gated GCN to select relevant database information for SQL generation. Graph-based methods’ GCN approach effectively captures local table structures but fails to model the cross-table dependencies critical for nested queries. While [4]’s relation-aware transformer encodes richer schema relationships, such as “two columns from the same table”, with the increase in pattern items, the relationships between them requiring maintenance surge exponentially, resulting in poor generalization capability.

2.2. Decoding Approaches

Several decoder-based approaches have been developed to reduce the complexity of SQL parsing and bridge the gap between natural language and SQL. But error propagation remains a persistent issue that our correcting decoder systematically mitigates.

The tree-based decoding in [22] effectively handles SQL syntax hierarchy but becomes brittle with highly nested queries due to its rigid module invocation. While the NatSQL [23] intermediate representation sacrifices the expressiveness of complex aggregation functions, it simplifies SQL generation, which inspired us to constrain the output specification based on the LLM, generate the intermediate representation, and then convert it to SQL.

Refs. [24,25] impose constraints on the decoder to prevent the generation of invalid tokens. Refs. [26,27] utilize an execution-guided decoding mechanism, excluding parts of SQL queries that are not executable from the output candidates. This work also inspired us to further constrain the output of the SQL parser.

2.3. Prompt Engineering Approaches

With the rise of pre-trained language models, approaches based on large language models (LLMs) have significantly enhanced the performance of various tasks in the NLP field, not excluding the Text-to-SQL task. In the application of Text-to-SQL, these pre-trained models have significantly improved the depth of understanding of natural language queries and the ability to accurately match these queries with database structures.

DIN-SQL [12] investigates how to break down the complex Text-to-SQL task into smaller subtasks and demonstrates how this decomposition significantly improves the performance of large language models (LLMs) during inference. However, DIN-SQL introduces error propagation between subtasks, motivating our correction design. DAIL-SQL [13] integrates previous experimental outcomes and presents a new comprehensive prompt engineering approach, setting a new record on the Spider leaderboard with an execution accuracy of 86.6%, but it does not fully realize the potential of the open source model.

3. Methodology

Pre-processing and post-processing play pivotal roles in Text-to-SQL task optimization [28]. The pre-processing stage enhances generation efficiency and accuracy through schema pruning, while post-processing ensures SQL executability via error correction and validation. These complementary components form a critical optimization pathway for large language models (LLMs) in Text-to-SQL applications. To address the adaptability challenges of conventional methods in complex Chinese database scenarios, we propose a novel modular three-stage optimization framework that systematically integrates (1) data-driven pre-processing for redundant schema filtering, (2) core LLM-based SQL generation, and (3) post-processing with dedicated error correction models. This architecture represents the first comprehensive implementation of integrated pre-processing–post-processing optimization specifically designed for Chinese Text-to-SQL applications.

3.1. Redundant Database Schema Items Filtering Encoder

In a given database D, with tables list

T = {t_{1}, t_{2}, . . ., t_{N_{t}}}

and columns lists

C = {C_{1}, C_{2}, . . ., C_{N_{t}}}

, where

C_{i} = {c_{1}, c_{2}, . . ., c_{n_{i}}}, i \in {1, 2, . . ., N_{t}}

, natural language queries typically do not involve all database schema items. Superfluous schema items are likely to negatively affect Text-to-SQL parser performance. Moreover, aligning entities mentioned in natural language queries with database schema items is one of the challenges in Text-to-SQL tasks. Thus, we did not use all database schema items as prompts for the LLM but trained a RDsiF-Encoder instead. We define the filtering task as a classification process, where, for each query Q, the probability of each schema item in the database being necessary to Q is predicted, and superfluous tables and columns are filtered out. The diagram is shown in Figure 2.

3.1.1. Organization Method of Input Data

In order to make the classifier fully capture the correspondence between question Q and the schema item, the Chinese semantics of the schema item and the data type of the column are taken as part of the input. Specifically, we organize the input to the model in the following form:

I n p u t = {Q | t_{1}, s_{t_{1}} : c_{1}, s_{c_{1}}, l_{c_{1}}; c_{2}, s_{c_{2}}, l_{c_{2}}; . . .; c_{n_{1}}, s_{c_{n_{1}}}, l_{c_{n_{1}}} . | t_{2}, s_{t_{2}} : . . .}

, where

n_{i}

represents the number of columns contained in table

t_{i}

, s represents the Chinese semantic information corresponding to the schema item, and l represents the data type corresponding to the column, such as NUMBER and TEXT.

We input the combination of query and database schema items into XLM-RoBERTa, a cross-lingual pre-trained language model. The RDsiF-Encoder structure is detailed in Figure 3.

3.1.2. Chinese Information Injected Layer for Schema Items

For the Chinese Text-to-SQL task, there is a noteworthy aspect: schema items are predominantly named in English, often with the inclusion of abbreviations, while the queries are presented in Chinese. The cross-linguistic nature and the abstract expression of schema items undoubtedly increase the difficulty of classification.

The omission of schema items in the Text-to-SQL mapping inevitably hinders generative models from capturing potential correct schema items. Thus, it is necessary to enhance cross-language schema linking to address this limitation.

Therefore, we designed Chinese information injected layer (CII layer for short) to inject the Chinese semantic information of schema items into their corresponding original embedding. Even if the original names of the database schema items are abstract, the Chinese information injection layer can help the filtering encoder to capture the schema items involved in the natural language query

\begin{matrix} e_{i}^{(h)} = \frac{t_{i} W_{Q}^{(h)} (s_{t_{i}} W_{K}^{(h)})}{\sqrt{d_{t_{i}^{'}} / H}}; & α_{i}^{(h)} = s o f t m a x (e_{i}^{(h)}) \\ t_{i}^{' (h)} = α_{i}^{(h)} (s_{t_{i}} W_{V}^{(h)}); & t_{i}^{'} = c o n c a t (t_{i}^{' (1)}, t_{i}^{' (2)}, . . ., t_{i}^{' (H)}) \\ \hat{t_{i}} = L a y e r N o r m (t_{i} + t_{i}^{'}) . \end{matrix}

(1)

Here, the multi-head scaled dot-product attention mechanism [29] combines the schema item embedding

t_{i}

with its Chinese semantic information

s_{t_{i}}

to capture the associations between the database schema and the Chinese natural language description. This mechanism computes weighted similarities between

t_{i}

and

s_{t_{i}}

, using these similarities as attention scores to guide the blending of information, and finally obtains the Chinese semantic injected schema embedding

t^{'}

.

The multi-head mechanism (with H heads, and

h \in {1, 2, . . ., H}

) enables the model to learn information in parallel across different subspaces, with each head capturing different dimensions of semantic relevance. This enriches the semantic features captured by the model.

Subsequently, Chinese semantic injected schema embedding

t^{'}

is combined with the original embedding

t_{i}

through residual connection, followed by layer normalization, to ensure training stability and to retain the original information. The resulting

\hat{t_{i}}

is an enhanced embedding representation infused with Chinese semantic information, which improves the model’s ability to recognize and understand abstract database schema items.

3.1.3. Type-Aware Cross Attention for Columns

Queries often involve the values of a certain column. According to the data type of the value, columns with mismatched data types will be filtered out. In order to enhance the ability of the classifier to perceive the data type of the column, type-aware cross attention is designed based on the CII layer. Specifically, the implementation process is as follows.

\begin{matrix} e_{i}^{(h)} = \frac{c_{i} W_{Q}^{(h)} ((s_{c_{i}} + l_{c_{i}}) W_{K}^{(h)})}{\sqrt{d_{c_{i}^{'}} / H}}; & α_{i}^{(h)} = s o f t m a x (e_{i}^{(h)}) \\ c_{i}^{' (h)} = α_{i}^{(h)} ((s_{c_{i}} + l_{c_{i}}) W_{V}^{(h)}); & c_{i}^{'} = c o n c a t (c_{i}^{' (1)}, c_{i}^{' (2)}, . . ., c_{i}^{' (H)}) \\ \hat{c_{i}} = L a y e r N o r m (c_{i} + c_{i}^{'}) . \end{matrix}

(2)

The realization process of type-aware cross attention is similar to that of CII layer. The difference is that the Chinese semantic information and data type information are added, and the representation capability of the output column embedding

\hat{c_{i}}

is enhanced through the integration, so that the classifier can perceive the data type of the input column.

3.1.4. Column Injected Layer for Tables

The information involved in Q may not involve table names, which prevents the classifier from correctly classifying potential tables. For example, column c, which is necessary for Q, belongs to table t, but table t is not explicitly mentioned in Q. Drawing on the work of [3], we designed a column injected layer to enhance the relevance of tables to their columns.

\begin{matrix} C_{i} = c o n c a t ({\hat{c_{1}}}^{p}, {\hat{c_{2}}}^{p}, . . ., {\hat{c_{n_{i}}}}^{p}); \\ e_{i}^{(h)} = \frac{{\hat{t_{i}}}^{P} W_{Q}^{(h)} (C_{i} W_{K}^{(h)})}{\sqrt{d_{{\hat{t_{i}}}^{p^{'}}} / H}}; & α_{i}^{(h)} = s o f t m a x (e_{i}^{(h)}) \\ {\hat{t_{i}}}^{p^{'} (h)} = α_{i}^{(h)} (C_{i} W_{V}^{(h)}); & {\hat{t_{i}}}^{p^{'}} = c o n c a t ({\hat{t_{i}}}^{p^{'} (1)}, {\hat{t_{i}}}^{p^{'} (2)}, . . ., {\hat{t_{i}}}^{p^{'} (H)}) \\ {\hat{t_{i}}}^{E} = L a y e r N o r m ({\hat{t_{i}}}^{p} + {\hat{t_{i}}}^{p^{'}}) . \end{matrix}

(3)

{\hat{t_{i}}}^{E}

is the enhanced table embedding that injects information about the columns it contains.

3.1.5. Tables and Columns Classifier

To complete the binary classification task, two fully connected layers are set up at the end of the encoder.

\begin{matrix} P_{x} = F C_{2} (L e a k e y R e L U (F C_{1} (x))) \end{matrix}

(4)

where x refers to the enhanced table embedding

{\hat{t_{i}}}^{E}

or the pooled type-aware column embedding

{\hat{c_{i}}}^{p}

.

3.2. Generative Pre-Trained LLM for SQL Parser

Language models that have undergone pre-training have learned a wealth of linguistic knowledge from large corpora, better capturing the diversity and fluctuation of natural language and understanding the contextual relationships within. This ability to comprehend context allows the generative SQL parser to adapt more flexibly to various language expressions and query structures, better handling complex natural language queries.

Using open-source LLMs to tackle Text-to-SQL tasks provides more room for optimization, achieving an advantage in parameter scale while ensuring performance.

Before conducting supervised fine-tuning for the large language model (LLM), it is essential to prepare training data. In the context of Text-to-SQL, particular attention should be given to the method of representing questions. In question representation, a primary challenge is the effective combination of natural language questions with database schema information. The goal of question representation is to control the probability of the LLM generating SQL statements, maximizing the likelihood of generating correct SQL queries.

max_{f} P_{G} (S^{*} | f (Q, D))

(5)

where

P_{G} (S^{*} | f (q, D))

denotes the probability of generating the correct SQL query

S^{*}

given the query Q and the associated database schema information D. The function f encapsulates the question representation process, and the optimization aims to maximize this conditional probability, ensuring the generation of accurate SQL queries by the LLM.

To organize the training and test data required for LLM supervised fine-tuning, we proposed a question representation method: IECQN.

In IECQN, first, the instruction (I) articulates the task to be solved by the SQL parser and the output standard; then, an example (E) output helps the SQL parser clarify the output standard further; the chosen database schema information (C) is expressed through listing the N most relevant tables and M columns within those tables associated with the question; next, the natural language query (Q) is presented. Finally, we use an intermediate representation to bridge the gap between natural language questions and SQL queries. NatSQL simplifies queries while preserving core SQL functionalities, significantly reducing the generation complexity for LLMs. Its natural language-like expressions align better with Chinese semantic logic, alleviating the model’s pressure to handle complex SQL syntax (such as deep nesting and HAVING clauses). Additionally, NatSQL and standard SQL support bidirectional lossless conversion, retaining complete semantic information for seamless plug-and-play usage. Therefore, we use the NatSQL corresponding to the original gold SQL query as the target response.

Most existing methods for LLMs focusing on prompt engineering and in-context learning do not explore the potential of supervised fine-tuning; the goal of supervised fine-tuning is to minimize the following empirical loss:

min_{G^{*}} \sum_{i = 1}^{| T |} L (G^{*} (f (q_{i}, D_{i})), S_{i})

(6)

Here, the goal is to find the best-performing parameters

G^{*}

which minimize the accumulated loss over the training set

T = (q_{i}, D_{i}, S_{i})

. The function f takes query

q_{i}

and its corresponding database information

D_{i}

, encapsulates the question representation process.

G^{*} (f (q_{i}, D_{i}))

denotes the SQL query generated by

G^{*}

. The loss function

L

evaluates the deviation of this prediction from the gold annotation

S_{i}

. Through this optimization, the fine-tuned model is better positioned to capture the nuances of the targeted tasks and perform with greater accuracy.

As pre-trained language models grow larger, the difficulty of traditional fine-tuning increases. Considering computational resources and training costs, we opted for parameter-efficient fine-tuning. The trainable parameters in some parameter-efficient fine-tuning (PEFT) methods are significantly fewer compared to the parameter scale of a LLM [30,31,32].

The fine-tuned

G^{*}

is used for inference, creating NatSQL queries by processing the input natural language question and the specified database. The question representation method used during inference is consistent with the method employed to organize the training data.

SQL parser first generates intermediate representation NatSQL, which can be translated to SQL queries via a non-trainable transpiler [23]. During inference, a beam search of size B is performed, with the first executable SQL query output to ensure execution accuracy.

We conducted a series of experiments to validate the effectiveness of our proposed question representation method IECQN, while also discussing the vast potential of supervised fine-tuning on LLMs for Text-to-SQL tasks.

3.3. SQL Query Correcting Decoder

A widely recognized downside of multi-stage pipeline methods is error propagation, where incorrect categorization of redundant items inevitably affects the generative capabilities of large language models for SQL query generation. Hence, we collected correct and incorrect inference results from

G_{i}

(i denotes a specific parameter-efficient fine-tuning) to train an SQL query correcting decoder, mitigating the effects of error propagation. Given a natural language query Q, database schema information D, and the erroneous SQL query

S^{-}

, the goal of the correcting decoder is to predict the edit operations m required for correction and the corrected SQL

S^{+}

. Following existing works [33], we defined the correction task as a sequence-to-sequence generation:

P ((m, S^{+}) | (Q, D, S^{-})) = \prod_{t = 1}^{T} P ((m_{t}, S_{t}) | ((Q, D, S^{-}), (m_{1 : t - 1}, S_{1 : t - 1})))

(7)

The equation represents the probability of correctly predicting each edit operation

m_{t}

and the updated SQL snippet

S_{t}

at time step t, conditioned on the input query Q, schema D, the erroneous SQL

S^{-}

, and the sequence of all previously predicted edits and SQL snippets up to time

t - 1

. This sequence-to-sequence generation framework serves to iteratively refine the erroneous query towards the correct form, ultimately producing the corrected SQL

S^{+}

.

To ensure that the training data for the correcting decoder reflects the true distribution of errors, we performed k-fold cross-validation on the Chinese Text-to-SQL dataset using a specific pre-trained LLM, synthesizing a correction dataset.

Specifically, for the training set, the Chinese Text-to-SQL training set was divided into k parts, using k-1 subsets to supervised fine-tune the LLM while evaluating on the remaining subset. If the parsing result is not an exact set match or execution match to the gold annotation, it is marked as erroneous; otherwise, it is deemed correct and included in the correction training set. For the test set, LLM is supervised fine-tuned on the training set of the complete Chinese Text-to-SQL dataset to obtain

G^{*}

, and then

G^{*}

is evaluated on the development set of the Chinese Text-to-SQL dataset. Then, the evaluation results are included in the test set of the error correction data set. The training data of organization and fine-tuning methods used for the supervised fine-tuning of LLM are consistent with those mentioned in Section 3.2. The specific pre-trained LLM refers to the pre-trained LLM used by the generative pre-trained LLM SQL parser.

We used three editing operations: replace, insert, and delete, to correct the erroneous SQL queries in the correction dataset relative to the target output, completing the construction of the correction dataset (Figure 4).

It is worth noting that CSpider is translated from the Spider dataset, which was annotated by eleven different annotators to extract gold SQL queries, with inconsistent annotation styles.

Also, the IECQN representation method we proposed in Section 3.2 uses NatSQL as an intermediate representation. When the intermediate representation is converted to SQL queries, its style (NatSQL style for short) has certain differences from the gold SQL queries. For differences in style see the diagram below.

Gold SQL style tends to use uppercase letters to represent keywords, table names, and column names, and employs the AS keyword to alias tables. In contrast, NatSQL style leans towards the use of lowercase letters, directly utilizing table names without the AS keyword.

The construction rules for these two styles of SQL queries are inconsistent (Figure 5), and the ultimate output of the generative NatSQL parser is the former. Considering all the above factors, to reduce the learning difficulty of the correction model and avoid conflicts in the construction rules between the erroneous SQL queries and the target outputs, when correcting the erroneous SQL queries, the NatSQL style SQL query corresponding to the original gold SQL in the CSpider dataset is taken as the target output. In this way, we can reduce the risk of erroneous corrections due to style differences, enhancing the robustness and accuracy of the correction decoder when dealing with actual SQL queries.

4. Experimental Setup

4.1. Data and Environment

CSpider

We conducted a series of experiments using the CSpider dataset. CSpider is a large-scale Chinese dataset for complex cross-domain semantic parsing and Text-to-SQL tasks, comprising the entire dataset translated from Spider for Chinese Text-to-SQL tasks. The CSpider dataset consists of a training set with 7000 samples and a development set with 1034 samples, with no overlap between the databases involved in the training and development sets.

IECQN representation dataset for the SFT of the LLM

The IECQN representation consists of a concatenation of instructions, examples, chosen schema items, natural language questions, and NatSQL intermediate representations. The dataset organization method for the IECQN representation has already been introduced in Section 3.2. The natural language questions are sourced from questions in the CSpider dataset, and the chosen schema items come from the output of the trained redundant item filtering encoder. The two parameters N and M for the chosen schema items are set to 4 and 5. It is observed that the SQL query statements in the CSpider dataset mostly involve fewer than four tables (with only 32 samples involving five tables in the training set); we set the number of most relevant tables in the IECQN representation to 4 and, similarly, the number of the most relevant columns in the chosen tables to 5, to avoid interference from superfluous items on LLM performance, while ensuring that the chosen database schema items cover the items mentioned in the natural language questions as much as possible.

SQL Query Correction Dataset The preparation steps for the correction dataset are specifically elaborated in Section 3.3, and are not reiterated here. The Chinese Text-to-SQL dataset mentioned refers to the CSpider dataset.

All experiments were carried out on a server with two NVIDIA 3090 (24G) GPUs, an Intel Xeon Gold 6230 CPU, 125 GB of memory, and an Ubuntu 20.04 LTS operating system.

4.2. Evaluation Metrics

To evaluate the performance of the redundant item filtering encoder, we use the Area Under the ROC Curve (AUC) as the evaluation metric. AUC is calculated separately for table classification and column classification.

To assess the performance of our proposed method, we adopt two metrics: Exact Match Accuracy and Exact Execution Accuracy. Exact Match Accuracy measures whether an SQL query can accurately match the gold SQL query by converting the predicted SQL query into a special data structure. Exact Execution Accuracy compares the execution result of the predicted SQL query with that of the gold SQL query.

4.3. Details

The training of FGCSQL is divided into three phases.

In the first phase, the redundant item filtering encoder was trained with the number of heads h for the Chinese information enhancement layer set to 8. It was optimized using an AdamW optimizer with a batch size of 8 and a learning rate of 1

\times 10^{- 5}

.

In the second phase, for training the generative SQL decoder, we conducted comparative experiments on pre-trained LLMs like LlaMa2-7b-chat [34], ChatGLM2-6b [35], ChatGLM3-6b [35], and BaiChuan2-7b [36], and eventually selected LlaMa2-7b-chat as the best model variant for the SQL parser after supervised fine-tuning due to its optimal comprehensive performance. The parameter-efficient fine-tuning method used was LoRA [37], with lora_rank set to 64 and lora_alpha set to 32, utilizing cosine annealing with restarts to adjust the learning rate.

We conducted a trade-off analysis between hardware limitations and algorithmic efficiency. Experiments show that (Figure 6), when the beam size exceeds 16, although theoretically capable of improving generation quality, practical applications are constrained by GPU memory capacity with significantly diminishing marginal benefits. So, we set the beam size B to 16.

In the third phase, training the correcting decoder, we selected LlaMa2-7b-chat for the SQL parser, and CodeT5 [38] for the error correction models. The batch size was set to 4, with a learning rate for Adafactor of 3

\times 10^{- 5}

, using linear decay to adjust the learning rate.

5. Results

5.1. Results on CSpider

We compared our proposed method with the following methods: RAT-SQL [4], LGESQL [39], FastRAT [40], ChatGPT [9], Heterogeneous Graph + Relative Position Attention [21], RESDSQL [3], and Auto-SQL-Correction [33].

On the development set of the CSpider dataset, the performance of FGCSQL and other methods on the EM and EX evaluation metrics is shown in Table 1.

Our best model variant, FGCSQL, performs closely to the SOTA on the EM metric (−0.4% difference), and it outperforms all baseline variants on the EX metric. This indicates that our three-stage pipeline method can significantly reduce the learning difficulty of Text-to-SQL conversion. Our FGCSQL achieved a competitive performance, improving EX from 79.1% to 81.1%, showcasing the effectiveness of our method. Notably, all model variants that combine with NatSQL show significant improvement on the EX metric compared to previous studies, as using NatSQL as an interim representation bridges the gap between natural language and SQL queries, further reducing the learning difficulty.

5.2. Generalization Validation

To evaluate the model’s generalization capability, we conducted experiments on the BIRD development set. BIRD’s database design simulates real-world scenarios, featuring cluttered data rows and complex schemas. Its difficulty stratification (Simple/Moderate/Challenging) effectively validates the model’s ability to handle schema redundancy in Chinese Text-to-SQL tasks.

Sampling Methodology: Through stratified sampling from 1534 development set samples, we constructed a 200-sample subset (proportionally distributed across difficulty levels) by translating English questions to Chinese, ensuring linguistic and task complexity representation.

Challenge Characteristics: Notably, BIRD contains numerous challenging cases involving numerical reasoning and nested queries, which impose heightened demands on Text-to-SQL solutions.

Experimental Results: Our three-stage pipeline achieved an execution accuracy (EX) of 66.27% without fine-tuning on the BIRD training data. While this performance gap (compared to the current SOTA [41] EX of 73.01% on BIRD) reveals improvement potential, it preliminarily demonstrates our method’s cross-dataset generalization capability, particularly in handling complex schema relationships unseen during training.

5.3. Ablation Studies

To analyze the effectiveness of each design component, we conducted a series of ablation experiments on the CSpider development set.

5.3.1. Effect of Chinese Information Injection Layer

The design of the Chinese Information Injection Layer aims to address the difficulties of cross-language schema linking and reduce the parsing difficulty for the subsequent generative SQL parser. The AUC is calculated separately for table and column classification with and without the use of the Chinese information injected layer and type-aware attention, with the experimental results shown in Table 2.

The CII layer improves the filtering encoder’s table classification AUC and column classification AUC by 0.89% and 0.92%, respectively. Type-aware attention improves the filter encoder’s column classification AUC by 0.82%. The enhanced performance of the AUC indicates that the CII layer is helpful to capture the mapping features between the Chinese information and the corresponding English schema items, and type-aware attention improves the classifier’s type perception ability for column information.

5.3.2. Effect of Redundant Database Schema Items Filtering Encoder

The design of the Redundant Database Schema Items Filtering Encoder aims to resolve the issues with end-to-end methods increasing the learning difficulty of Text-to-SQL. By filtering the database schema items corresponding to the natural language queries in CSpider through the Redundant Item Filtering Encoder, the most relevant database schema items are provided for the IECQN query representation. Table 3 shows the performance comparison between using filtered database schema items and using all schema items as database information in the question representation, for LlaMa2-7b-chat and ChatGLM2-6b following LoRA fine-tuning.

For the fine-tuned LlaMa2-7b-chat as the SQL parser, using the filtered database schema item information led to a 4.0% increase in EM and a 7.9% increase in EX.

For the fine-tuned ChatGLM2-6b as the SQL parser, using the filtered database schema item information led to a 4.9% increase in EM and an 8.3% increase in EX.

By filtering out irrelevant information that would negatively impact the SQL parser using the Redundant Item Filtering Encoder, the fine-tuned LLM experienced a significant performance boost.

In order to qualitatively analyze the effect of RDsiF-Encoder filtering related schema items, we tested three queries to calculate the output probability of RDsiF-Encoder and draw the schema correlation probability heatmap (Figure 7). The probabilistic heat map uses a color gradient from blue to red to represent varying probability levels. Blue shades indicate low probabilities. Red shades indicate high probabilities (intensity increases as the color transitions toward red).

Through visualization heatmap analysis, it is found that the redundancy pattern filtering mechanism in this study can effectively identify core table columns across queries of varying complexity. In simple query cases, the model focuses on the correct target car_names.model with a probability >0.98, accurately filtering out other table columns. When processing complex queries involving multi-table joins, the system demonstrates semantic understanding capabilities: despite two distinct query expressions, it consistently identifies the core table car_makers (probabilities 0.991 and 0.990) and its key columns fullname (0.993 and 0.739) and id (0.984 and 0.987), while moderately associating with the maker column in the model_list table (0.650 and 0.620 probabilities) to establish table joins.

Notably, the model can differentiate the semantic differences between columns with the same name across tables: when queries emphasize “car version”, the model column in the car_names table achieves a relevance of 0.981, while the identically named model column in the model_list table only scores 0.082, indicating the system captures contextual table structure differences through column information injection. Additionally, when users employ alternative expressions like “names”, the fullname column’s probability decreases from 0.993 to 0.739, yet the system maintains high-confidence selection of core table columns, proving that the Chinese semantic injection strategy effectively enhances robustness against natural language expression variations. These cases confirm that the filtering module significantly optimizes schema linking accuracy, establishing a reliable foundation for subsequent SQL generation.

5.3.3. Effect of IECQN Question Representation

By using IECQN question representation, an effective expression of natural language questions and database schema information increases the likelihood of generating the correct SQL queries. To validate the effectiveness of IECQN query representation, we compared the parser trained with basic question representation (BS) to one trained with IECQN representation, and the experimental results are shown in Table 4.

Compared to basic question representation training, using IECQN representation led to a 12.0% increase in EM and 12.9% increase in EX for the fine-tuned LlaMa2-7b-chat as the SQL parser. For the fine-tuned ChatGLM2-6b as the SQL parser, a 27.6% increase in EM and 28.6% increase in EX were observed.

This is because the database information in IECQN representation is customized for the natural language question, eliminating interference from redundant information. On the other hand, using NatSQL as the target output narrows the gap between natural language queries and SQL statements, further lowering the learning difficulty of Text-to-SQL.

5.3.4. Effect of Parameter-Efficiently Fine-Tuned LLM

The purpose of using a parameter-efficiently fine-tuned LLM as the SQL parser is to explore the potential of an open-source LLM in Chinese Text-to-SQL tasks, achieving comparable or even superior performance to larger scale language models with lower training costs.

We compared the performance of LoRA fine-tuned LlaMa2-7b-chat and ChatGLM2-6b with basic question representation with the original un-fined-tuned models. Evaluations used the same input context organization method: BS. Given ChatGPT’s excellent performance on various natural language tasks, we also added comparison results with ChatGPT. Table 5 shows the experimental results.

Parameter-efficiently fine-tuned LlaMa2-7b-chat improved by 38.8% in EM and 44.6% in EX. Parameter-efficiently fine-tuned ChatGLM2-6b improved by 38.8% in EM and 44.6% in EX as well. Compared to the closed-source ChatGPT, parameter-efficiently fine-tuned LlaMa2-7b-chat was 20.5% ahead in EM and 2.9% ahead in EX. The significant gap in EM compared to EX is because EM is insensitive to the execution results of generated SQL; sometimes the correct SQL for the same query varies substantially in expression order and logic, with EM occasionally misclassifying correct SQL as wrong, leading to false negatives. This means some SQL queries generated by ChatGPT are more diverse and flexibly expressed, yet the execution results are accurate.

The parameter-efficiently fine-tuned open-source LLM achieved comparable performance in Chinese Text-to-SQL tasks, particularly the parameter-efficiently fine-tuned LlaMa2-7b-chat, which surpassed the 20b-parameter ChatGPT in both metrics.

5.3.5. Effect of SQL Query Correcting Decoder

The SQL query correcting decoder aims to resolve the error propagation problem inherent in multi-stage pipeline methods. In our experiments, the LLM used in the generative SQL parser was LlaMa2-7b-chat. We compared the output SQL queries from the first two stages of FGCSQL with the complete FGCSQL, with the results presented in Table 6.

The use of the correction decoder led to a 0.7% improvement in EM, and a not significant 0.2% improvement in EX, nonetheless achieving the best performance on the CSpider dataset. This is because the correction decoder somewhat shields against the negative impact that misclassification of redundant items could have on generative LLM SQL query generation, reducing error propagation, and, on the other hand, reduces the risks of incorrect corrections due to style differences, enhancing the robustness and accuracy of the correction decoder when handling practical SQL queries.

5.4. Prompt Format Sensitivity Analysis

To evaluate the potential constraints of the IECQN instruction template on model outputs, we designed two categories of prompt variants: (1) Elements Reordering (ICEQ): adjusting the sequence of instruction, example, and chosen schema items to ICEQ, thereby disrupting the original template’s linear logic chain; (2) Weak-format Expressions: weakening the imperative nature of instructions through semantic softening and format simplification.

See Figure 8 for examples of the modified prompts. By comparing the model performance under these variants, we conducted analysis from two perspectives: output stability (EX fluctuation) and semantic robustness (format sensitivity).

"Based on" → "Using the available": Softens the dependency on chosen database schema items by implying flexibility in interpreting “available” information.
"Convert" → "Try to generate": Replaces a direct command with a suggestion, reducing pressure on absolute correctness and allowing for potential fallibility.
Remove "only return the complete SQL statement": Eliminates the mandatory constraint, allowing unconstrained model outputs.
"###Input:\nquery \n\n###Response:\n" → "query": Directly presents the query without formatting markers.

5.4.1. Output Stability

Table 7 shows that, under the reordered prompts (ICEQ), the model achieves 80.7% EX on the test set, only 0.2% lower than the original IECQN’s 80.9%. However, the weak-format version yields 79.2% EX with a larger fluctuation of 1.7%. Further analysis reveals that 54.2% of the errors in the weak-format outputs stem from generating non-SQL content (e.g., explanatory text, code block markers), indicating significant model dependency on explicit instruction structures.

5.4.2. Format Robustness Metric

We define format robustness as

R = \frac{1}{N} \sum_{i = 1}^{N} cos_sim (E_{o r i g}^{(i)}, E_{v a r}^{(i)})

, where

E_{o r i g}^{(i)}

and

E_{v a r}^{(i)}

denote the output of the i-th sample under the original and variant prompts, respectively. The experimental results reveal the following:

The ICEQ variant achieves an average $R = 0.92$ , significantly higher than the $R = 0.85$ of the weak-format version, demonstrating stronger model tolerance to element reordering.

This confirms that, while maintaining command intensity (non-weak formats), the IECQN structure effectively decouples domain semantics from format dependency, preserving core semantics while enabling generalization across prompt expression variants.

5.5. Error Analysis and Correction Effectiveness Evaluation

To systematically identify potential errors in the SQL parsing process, we conducted a manual analysis of 100 randomly selected error cases from the correction dataset where the execution results differed from the gold SQL. The categorical error distribution is visualized in Figure 9.

Schema-linking Errors Occur when the model fails to accurately map entities to corresponding table/column names in the database schema, typically manifested as confusion between homonymous fields or misuse of redundant schema information. These errors stem from incomplete filtering of irrelevant schema elements during the schema filtering phase, leading to entity mapping deviations in LLM-generated SQL.

Value Errors Characterized by incorrect value associations between user queries and database entries, including numerical discrepancies or format inconsistencies (e.g., date formatting mismatches).

JOIN Errors Involve improper join types or missing join conditions, particularly failures in detecting implicit relationships implied in user questions.

GROUP-BY Errors Manifest as incorrect grouping column selection or omission of essential grouping operations (e.g., parsing “calculate average salary per department” as a non-grouped aggregation).

Condition Errors Include comparison operator misuse (e.g., “>” vs. “>=”) and flawed logical condition nesting.

Nested Operation Errors Involve incorrect implementation or undetected nested queries/set operations.

Invalid SQL Structurally malformed SQL statements that fail execution.

Figure 10 is the error correction statistics of the SQL query correcting decoder for these 100 error samples.

The correction model successfully rectified only one schema-linking error, indicating strong coupling between schema linking errors and upstream processing stages.

No value errors were corrected, primarily due to the absence of database value samples in the correction process.

JOIN and nested operation errors showed zero correction rates, highlighting limitations in addressing semantic-level errors requiring sophisticated cross-table dependency analysis.

The SQL query correcting decoder has a more obvious correction effect on syntax-level errors (such as Condition and GROUP-BY), demonstrating effective mitigation of error propagation in grammatical constructs.

This error distribution suggests that, while our correction decoder effectively handles syntactic-level anomalies, it faces inherent challenges with semantic-level errors that demand deeper contextual understanding. Future improvements should focus on enhancing cross-table dependency analysis and implementing value-linking mechanisms.

6. Conclusions

6.1. Theoretical and Practical Implications

The proposed FGCSQL framework advances the field of cross-lingual Text-to-SQL parsing through three key contributions with both theoretical and practical significance. First, the three-stage pipeline architecture (filtering–generation–correction) establishes a theoretically grounded framework for decoupling schema linking, SQL parsing, and error correction—components that are traditionally entangled in end-to-end approaches. This decomposition provides new insights into managing error propagation in multi-stage semantic parsing systems.

Practically, the integration of parameter-efficient fine-tuning with open-source LLMs (e.g., LlaMa2-7b-chat) demonstrates a viable path for developing domain-specific Text-to-SQL systems without requiring massive computational resources. The operational success of Chinese-specific components (e.g., the Chinese information injected layer) offers a blueprint for adapting Text-to-SQL systems to non-English contexts, addressing a critical gap in multilingual NLP applications.

6.2. Summary of Ideas and Experimental Performance

This study introduces FGCSQL, a three-stage pipeline method for open-source large language model (LLM)-driven Chinese Text-to-SQL transformation. The proposed pipeline includes a redundant database schema items filtering encoder, a generative pre-trained LLM for SQL parsing, and an SQL query correcting decoder. These components are designed to address the challenges associated with Text-to-SQL tasks, particularly those involving cross-lingual schema linking, the complexity of queries, and error propagation within multi-stage processing.

The FGCSQL model demonstrates competitive performance on the CSpider dataset, a large-scale Chinese benchmark for cross-domain semantic parsing and Text-to-SQL tasks. The model achieves near SOTA performance on exact match (EM) and surpasses all baseline variants on exact execution (EX), showing that the three-stage pipeline method significantly reduces the learning difficulty of Text-to-SQL translation.

Extensive ablation studies and comparison with various baseline approaches, including methods like LlaMa2-7b-chat with basic question representation, RAT-SQL, LGESQL, and others, highlight the effectiveness of each component within FGCSQL. Notable findings include the impact of the Chinese information injected layer and type-aware attention on improving schema item classification, and the effectiveness of the IECQN query representation, which customizes database schema information to natural language queries and uses NatSQL as an intermediate representation, further minimizing the learning difficulty.

FGCSQL boasts superior execution precision compared to similar methods like ChatGPT, and displays the advantage of leveraging open-source LLMs, particularly those with parameter-efficient fine-tuning, which significantly outperform larger closed-source models like ChatGPT in both EM and EX.

Lastly, the SQL query correcting decoder, designed to mitigate error propagation, showcases incrementally improved EM and benchmarks as the best performance on the CSpider dataset. The correction decoder demonstrates increased robustness and accuracy in dealing with practical SQL queries.

6.3. Limitations and Future Work

Two primary limitations warrant discussion. First, defining schema filtering as a binary classification task may lead to some potentially relevant tables/columns being mistakenly judged as redundant and lost, thereby affecting the contextual integrity and semantic accuracy of subsequent SQL generation. Second, our error correction mechanism, which focuses on syntactic validity and semantic alignment with user intent, does not guarantee that SQL statements are fully executable, which is a critical flaw for more complex queries.

Future research can advance the technical boundaries from the following directions: (1) integrate structural meta-information such as primary and foreign key relationships in the schema filtering stage to reduce the probability of mistakenly deleting key schema items; (2) design a value linking module based on real value retrieval, which identifies and completes implicit values through real data distribution analysis and context awareness, alleviating SQL parsing errors caused by values; (3) introduce execution feedback in the error correction stage to ensure the executability of the parsing results.

Author Contributions

Conceptualization, G.J.; Methodology, G.J. and W.L. (Weibin Li); Validation, G.J. and C.Y.; Writing—original draft, G.J.; Writing—review and editing, G.J. and Z.Z.; Visualization, G.J. and W.L. (Wei Li). All authors have read and agreed to the published version of the manuscript.

Funding

The support projects for this work include Xianyang City Key Research and Development Plan Project (2021ZDYF-GY-0031), Shaanxi Provincial Department and City Joint Key Project (2022GD-TSLD-61-3), Xi’an City Science and Technology Plan Project (23ZDCYTSGG0026-2022), Shanxi Coal Geology Group Co., Ltd. Scientific Research Project (SMDZ-2023CX-14), and Shaanxi Provincial Water Conservancy Development Fund Science and Technology Project (2024SLKJ-16).

Data Availability Statement

The CSpider dataset is available at https://github.com/taolusi/chisp (accessed on 8 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zheng, Y.; Wang, H.; Dong, B.; Wang, X.; Li, C. HIE-SQL: History Information Enhanced Network for Context-Dependent Text-to-SQL Semantic Parsing. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 2997–3007. [Google Scholar]
Hui, B.; Geng, R.; Wang, L.; Qin, B.; Li, Y.; Li, B.; Sun, J.; Li, Y. S²SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 1254–1262. [Google Scholar] [CrossRef]
Li, H.; Zhang, J.; Li, C.; Chen, H. Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13067–13075. [Google Scholar]
Wang, B.; Shin, R.; Liu, X.; Polozov, O.; Richardson, M. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Seattle, WA, USA, 2020; pp. 7567–7578. [Google Scholar] [CrossRef]
Price, P.J. Evaluation of Spoken Language Systems: The ATIS Domain. In Proceedings of the Speech and Natural Language: Proceedings of a Workshop, Hidden Valley, PA, USA, 24–27 June 1990. [Google Scholar]
Hazoom, M.; Malik, V.; Bogin, B. Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), Online, 6 August 2021; Lachmy, R., Yao, Z., Durrett, G., Gligoric, M., Li, J.J., Mooney, R., Neubig, G., Su, Y., Sun, H., Tsarfaty, R., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2021; pp. 77–87. [Google Scholar] [CrossRef]
Zhong, V.; Xiong, C.; Socher, R. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. arXiv 2017, arXiv:cs.CL/1709.00103. [Google Scholar]
Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; et al. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 3911–3921. [Google Scholar] [CrossRef]
Liu, A.; Hu, X.; Wen, L.; Yu, P.S. A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL capability. arXiv 2023, arXiv:2303.13547. [Google Scholar]
Rajkumar, N.; Li, R.; Bahdanau, D. Evaluating the Text-to-SQL Capabilities of Large Language Models. arXiv 2022, arXiv:cs.CL/2204.00498. [Google Scholar]
Trummer, I. CodexDB: Generating Code for Processing SQL Queries using GPT-3 Codex. arXiv 2022, arXiv:cs.DB/2204.08941. [Google Scholar]
Pourreza, M.; Rafiei, D. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. arXiv 2023, arXiv:cs.CL/2304.11015. [Google Scholar]
Gao, D.; Wang, H.; Li, Y.; Sun, X.; Qian, Y.; Ding, B.; Zhou, J. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. arXiv 2023, arXiv:cs.DB/2308.15363. [Google Scholar] [CrossRef]
Yu, T.; Li, Z.; Zhang, Z.; Zhang, R.; Radev, D. TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 1–6 June 2018; Walker, M., Ji, H., Stent, A., Eds.; Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 588–594. [Google Scholar] [CrossRef]
Liu, Q.; Yang, D.; Zhang, J.; Guo, J.; Zhou, B.; Lou, J.G. Awakening Latent Grounding from Pretrained Language Models for Semantic Parsing. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2021; pp. 1174–1189. [Google Scholar] [CrossRef]
Song, H.; Zhang, C.; Li, Q.; Song, D. An end-to-end multi-task learning to link framework for emotion-cause pair extraction. In Proceedings of the 2021 International Conference on Image, Video Processing, and Artificial Intelligence, Shanghai, China, 28–29 August 2021. [Google Scholar]
Chen, Z.; Huang, H.; Liu, B.; Shi, X.; Jin, H. Semantic and Syntactic Enhanced Aspect Sentiment Triplet Extraction. arXiv 2021, arXiv:cs.CL/2106.03315. [Google Scholar]
Liu, Y.; Zhang, J.; Xiong, H.; Zhou, L.; He, Z.; Wu, H.; Wang, H.; Zong, C. Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding. arXiv 2019, arXiv:1912.07240. [Google Scholar] [CrossRef]
Deng, X.; Awadallah, A.H.; Meek, C.; Polozov, O.; Sun, H.; Richardson, M. Structure-Grounded Pretraining for Text-to-SQL. arXiv 2020, arXiv:2010.12773. [Google Scholar]
Bogin, B.; Berant, J.; Gardner, M. Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Florence, Italy, 2019; pp. 4560–4565. [Google Scholar] [CrossRef]
Ma, C.; Zhang, W.; Huang, M.; Feng, S.; Wu, Y. Integrating Relational Structure to Heterogeneous Graph for Chinese NL2SQL Parsers. Electronics 2023, 12, 2093. [Google Scholar] [CrossRef]
Yu, T.; Yasunaga, M.; Yang, K.; Zhang, R.; Wang, D.; Li, Z.; Radev, D. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 1653–1663. [Google Scholar] [CrossRef]
Gan, Y.; Chen, X.; Xie, J.; Purver, M.; Woodward, J.R.; Drake, J.; Zhang, Q. Natural SQL: Making SQL Easier to Infer from Natural Language Specifications. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 2030–2042. [Google Scholar] [CrossRef]
Scholak, T.; Schucher, N.; Bahdanau, D. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 9895–9901. [Google Scholar] [CrossRef]
Dou, L.; Gao, Y.; Pan, M.; Wang, D.; Che, W.; Zhan, D.; Lou, J.G. UniSAr: A Unified Structure-Aware Autoregressive Language Model for Text-to-SQL. arXiv 2022, arXiv:cs.CL/2203.07781. [Google Scholar]
Wang, C.; Tatwawadi, K.; Brockschmidt, M.; Huang, P.S.; Mao, Y.; Polozov, O.; Singh, R. Robust Text-to-SQL Generation with Execution-Guided Decoding. arXiv 2018, arXiv:cs.CL/1807.03100. [Google Scholar]
Hwang, W.; Yim, J.; Park, S.; Seo, M. A Comprehensive Exploration on WikiSQL with Table-Aware Word Contextualization. arXiv 2019, arXiv:cs.CL/1902.01069. [Google Scholar]
Liu, X.; Shen, S.; Li, B.; Ma, P.; Jiang, R.; Zhang, Y.; Fan, J.; Li, G.; Tang, N.; Luo, Y. A Survey of NL2SQL with Large Language Models: Where are we, and where are we going? arXiv 2024, arXiv:2408.05109. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
He, H.; Cai, J.; Zhang, J.; Tao, D.; Zhuang, B. Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning. arXiv 2023, arXiv:cs.CV/2303.08566. [Google Scholar]
Liao, B.; Meng, Y.; Monz, C. Parameter-Efficient Fine-Tuning without Introducing New Latency. arXiv 2023, arXiv:cs.CL/2305.16742. [Google Scholar]
Lu, J.; Yu, L.; Li, X.; Yang, L.; Zuo, C. LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning. arXiv 2023, arXiv:cs.SE/2308.11148. [Google Scholar]
Chen, Z.; Chen, S.; White, M.; Mooney, R.; Payani, A.; Srinivasa, J.; Su, Y.; Sun, H. Text-to-SQL Error Correction with Language Models of Code. arXiv 2023, arXiv:cs.CL/2305.13073. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:cs.CL/2307.09288. [Google Scholar]
Du, Z.; Qian, Y.; Liu, X.; Ding, M.; Qiu, J.; Yang, Z.; Tang, J. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 320–335. [Google Scholar]
Baichuan. Baichuan 2: Open Large-scale Language Models. arXiv 2023, arXiv:2309.10305. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Online and Punta Cana, Dominican Republic, 2021; pp. 8696–8708. [Google Scholar] [CrossRef]
Cao, R.; Chen, L.; Chen, Z.; Zhao, Y.; Zhu, S.; Yu, K. LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2021; pp. 2541–2555. [Google Scholar] [CrossRef]
Vougiouklis, P.; Papasarantopoulos, N.; Zheng, D.; Tuckey, D.; Diao, C.; Shen, Z.; Pan, J.Z. FastRAT: Fast and Efficient Cross-lingual Text-to-SQL Semantic Parsing. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Nusa Dua, Bali, 1–4 November 2023. [Google Scholar]
Pourreza, M.; Li, H.; Sun, R.; Chung, Y.; Talaei, S.; Kakkar, G.T.; Gan, Y.; Saberi, A.; Ozcan, F.; Arik, S.O. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql. arXiv 2024, arXiv:2410.01943. [Google Scholar]

Figure 1. Schematic of the FGCSQL method.

Figure 2. Schematic of the Redundant Database Schema Items Filtering Encoder.

Figure 3. The specific structure diagram of RDsiF-Encoder. There are stacked color blocks such as

c_{2}

and

s_{c_{2}}

to indicate that the corresponding entry has multiple tokens after tokenization. For example, the column “room number” is divided into “room” and “number”.

Figure 3. The specific structure diagram of RDsiF-Encoder. There are stacked color blocks such as

c_{2}

and

s_{c_{2}}

to indicate that the corresponding entry has multiple tokens after tokenization. For example, the column “room number” is divided into “room” and “number”.

Figure 4. (a,b) show two examples collected through cross-validation using LlaMa2-7b-chat on the CSpider training set. These examples illustrate three types of editing operations: replace, insert, and delete. These operations are performed to correct errors in SQL queries relative to the target response in the correction dataset.

Figure 5. Illustration of differences between two kinds of SQL in style.

Figure 6. The performance of the first two phases of FGCSQL on EX with the change in beam size.

Figure 7. Probabilistic heat map, where (a) corresponds to simple query, (b) corresponds to multi-table associated query, and (c) corresponds to the semantic variant of (b).

Figure 8. Example of the modified prompts.

Figure 9. Calculate the proportion of various types of errors in the statistical sample.

Figure 10. Error correction statistics.

Table 1. EM and EX results on CSpider’s development set. We compare our approach with some powerful baseline methods, where BS stands for basic problem representation [12].

Approach	EM	EX
RAT-SQL	41.4	-
LGESQL	58.6	-
FastRAT	61.3	67.7
ChatGPT + BS	32.6	65.1
Heterogeneous Graph + Relative Position Attention	66.2	-
RESDSQL + NatSQL	65.6	79.1
Auto-SQL-Correction + NatSQL	64.6	75.9
Ours	65.8	81.1

Table 2. Ablation studies of Chinese information injected layer.

Model Variant	Table AUC	Column AUC	Total
RDsiF-Encoder—without CII layer	0.9725	0.9703	1.9428
RDsiF-Encoder—without type-aware attention	0.9811	0.9713	1.9524
RDsiF-Encoder	0.9814	0.9795	1.9609

Table 3. Ablation studies of Redundant Database Schema Items Filtering Encoder.

Model Variant	EM	EX
LlaMa2-7b-chat + LoRA + IECQN	65.1	80.9
LlaMa2-7b-chat + LoRA + IECQN—with all schema items	61.1	73.0
ChatGLM2-6b + LoRA + IECQN	64.2	79.7
ChatGLM2-6b + LoRA + IECQN—with all schema items	59.3	71.4

Table 4. Ablation studies of IECQN question representation.

Model Variant	EM	EX
LlaMa2-7b-chat + LoRA + IECQN	65.1	80.9
LlaMa2-7b-chat + LoRA + BS	53.1	68.0
ChatGLM2-6b + LoRA + IECQN	64.2	79.7
ChatGLM2-6b + LoRA + BS	36.6	51.1

Table 5. Ablation studies of parameter-efficiently fine-tuned LLM.

Model Variant	EM	EX
ChatGPT	32.6	65.1
LlaMa2-7b-chat	14.3	23.4
LlaMa2-7b-chat + LoRA + BS	53.1	68.0
ChatGLM2-6b	13.1	21.9
ChatGLM2-6b + LoRA + BS	36.6	51.1

Table 6. Ablation studies of SQL query correcting decoder.

Model Variant	EM	EX
FGCSQL	65.8	81.1
FGCSQL—without SQL query correcting decoder	65.1	80.9

Table 7. The performance of IECQN and its two variants in terms of EX and format robustness.

Prompt Variant	EX	R
IECQ	80.9	-
ICEQ	80.7	0.92
Weak-format Expressions	79.2	0.85

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, G.; Li, W.; Yu, C.; Zhu, Z.; Li, W. FGCSQL: A Three-Stage Pipeline for Large Language Model-Driven Chinese Text-to-SQL. Electronics 2025, 14, 1214. https://doi.org/10.3390/electronics14061214

AMA Style

Jiang G, Li W, Yu C, Zhu Z, Li W. FGCSQL: A Three-Stage Pipeline for Large Language Model-Driven Chinese Text-to-SQL. Electronics. 2025; 14(6):1214. https://doi.org/10.3390/electronics14061214

Chicago/Turabian Style

Jiang, Guanyu, Weibin Li, Chenglong Yu, Zixuan Zhu, and Wei Li. 2025. "FGCSQL: A Three-Stage Pipeline for Large Language Model-Driven Chinese Text-to-SQL" Electronics 14, no. 6: 1214. https://doi.org/10.3390/electronics14061214

APA Style

Jiang, G., Li, W., Yu, C., Zhu, Z., & Li, W. (2025). FGCSQL: A Three-Stage Pipeline for Large Language Model-Driven Chinese Text-to-SQL. Electronics, 14(6), 1214. https://doi.org/10.3390/electronics14061214

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FGCSQL: A Three-Stage Pipeline for Large Language Model-Driven Chinese Text-to-SQL

Abstract

1. Introduction

2. Related Work

2.1. Encoding Approaches

2.2. Decoding Approaches

2.3. Prompt Engineering Approaches

3. Methodology

3.1. Redundant Database Schema Items Filtering Encoder

3.1.1. Organization Method of Input Data

3.1.2. Chinese Information Injected Layer for Schema Items

3.1.3. Type-Aware Cross Attention for Columns

3.1.4. Column Injected Layer for Tables

3.1.5. Tables and Columns Classifier

3.2. Generative Pre-Trained LLM for SQL Parser

3.3. SQL Query Correcting Decoder

4. Experimental Setup

4.1. Data and Environment

4.2. Evaluation Metrics

4.3. Details

5. Results

5.1. Results on CSpider

5.2. Generalization Validation

5.3. Ablation Studies

5.3.1. Effect of Chinese Information Injection Layer

5.3.2. Effect of Redundant Database Schema Items Filtering Encoder

5.3.3. Effect of IECQN Question Representation

5.3.4. Effect of Parameter-Efficiently Fine-Tuned LLM

5.3.5. Effect of SQL Query Correcting Decoder

5.4. Prompt Format Sensitivity Analysis

5.4.1. Output Stability

5.4.2. Format Robustness Metric

5.5. Error Analysis and Correction Effectiveness Evaluation

6. Conclusions

6.1. Theoretical and Practical Implications

6.2. Summary of Ideas and Experimental Performance

6.3. Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI