1. Introduction
Industry 4.0 has transformed manufacturing into intelligent, data-driven systems where structured data continuously flows from sensors, production lines, supply chains, and quality control systems [
1]. These industrial databases underpin real-time decision-making, predictive maintenance, production optimization, and resource allocation. However, SQL query complexity remains a barrier for domain experts, production managers, and operators seeking insights from these databases. Typical queries that production managers need to issue include the following: “显示生产线A上个月的平均产量” (average production of line A last month), “查询设备B的故障历史” (failure history of equipment B), and “统计本月质量指标不合格的产品数量” (count of products with unqualified quality indicators this month). These requests require multi-table joins, aggregations, and temporal filtering that exceed most domain experts’ technical skills. Natural Language to SQL (NL2SQL) addresses this problem by converting natural language queries into executable SQL, allowing non-technical users to access relational databases directly in Industry 4.0 settings where rapid data exploration and ad hoc analysis drive production optimization, quality issue identification, equipment monitoring, and operational decision-making.
As big data and artificial intelligence technologies advance, relational databases store massive structured data across industrial domains, including manufacturing execution systems (MESs), enterprise resource planning (ERP), quality management systems, and Internet of Things (IoT) platforms. These databases typically contain hundreds of tables with thousands of columns, storing information about production schedules, equipment status, quality metrics, supply chain logistics, and customer orders. Allowing non-professional users to access and utilize these data remains a persistent challenge in database research. Traditional approaches require users to learn SQL syntax, understand complex database schemas, and write correct queries—skills that most domain experts and operators lack. This challenge is amplified in Industry 4.0, where data-driven insights directly affect operational efficiency, product quality, and competitiveness, making fast, reliable access to production databases critical.
The development of NL2SQL tasks has evolved through three major phases. Early rule-based methods [
2] relied on manually written rules and templates that matched natural language patterns to SQL templates. While these methods achieved high precision for specific domains, they suffered from limited generalization capabilities and required extensive manual engineering for each new database schema. The second phase introduced end-to-end neural models that adopted encoder/decoder architectures, encoding natural language queries and database schemas into vector representations before generating SQL [
3,
4]. Methods such as RAT-SQL [
5] enhanced schema understanding through relation-aware encoding, utilizing relation graphs to better handle multi-table join queries. IGSQL [
6] constructed database schema interaction graphs using graph neural networks for context-dependent SQL generation. EditSQL [
7] transformed SQL generation into edit operation sequences, gradually constructing SQL statements. These methods often use unified encoders that concatenate queries and schemas into long texts mapped to the same set of hidden vectors, then generate SQL through decoders. When semantic, program, and schema information are simultaneously compressed into this information bottleneck, certain information types become compressed or distorted in the representation space, leading to semantic-program interference. This manifests as frequent problems in structural completeness (missing clauses, incorrect nesting) and schema consistency (wrong table/column references) of generated SQL.
In recent years, methods based on large language models (LLMs) have improved NL2SQL task performance through task decomposition, chain-of-thought reasoning, and in-context learning [
8,
9,
10]. By leveraging large-scale unsupervised corpora in the pre-training stage to build general language representations, they provide a strong semantic modeling foundation for downstream NL2SQL tasks [
11]. DIN-SQL [
12] adopts task decomposition and self-correction strategies, MAC-SQL [
13] uses multi-agent collaboration frameworks, and DAIL-SQL [
14] employs database-aware methods. These methods show progress on English NL2SQL tasks but face challenges on Chinese NL2SQL tasks. Existing methods lack specialized optimization for Chinese language characteristics and struggle with ellipsis, flexible word order, and implicit relationships in Chinese queries [
15,
16].
Although existing LLM-based methods have alleviated this problem through task decomposition, the SQL generation module remains a unified model that needs to simultaneously handle the full spectrum of tasks from simple queries to complex nested queries. When semantic, program, and schema knowledge are processed together, different types of knowledge interfere with each other, making it difficult for models to accurately understand users’ query intentions.
We propose CIR-SQL (Chinese Intent Recognition SQL), a dual-model iterative optimization framework for intent-enhanced NL2SQL. CIR-SQL decouples NL2SQL into two specialized subtasks—intent recognition and SQL generation—using structured intent intermediate representations that separate semantic and program layers, thereby mitigating semantic-program interference. The framework consists of four main components: (1) a seven-category intent classification system for Chinese NL2SQL, based on analysis of Chinese query patterns; (2) structured intent representations with five components—intent category labels, candidate table-column sets, predicate template sets, SQL keyword sets, and natural language descriptions—that separate semantic and program knowledge; (3) a hierarchical backtracking strategy using execution feedback, with three-level backtracking (SQL, context, intent layers) that adapts to error types; (4) experimental validation on standard benchmarks. The framework is applicable to various domains including Industry 4.0 scenarios, where Chinese-speaking operators and domain experts interact with complex databases containing production data, quality metrics, equipment status, and supply chain information, enabling data-driven decision-making.
2. Related Work
Related work in the NL2SQL field has progressed through several generations, with current research primarily divided into three categories: end-to-end methods based on deep learning, methods based on large language models, and methods based on intelligent agents. Each category addresses different aspects of the NL2SQL challenge, with varying trade-offs between accuracy, computational cost, and generalization capabilities.
End-to-end deep learning methods typically use encoder-decoder architectures that encode natural language queries and database schemas into vector representations, then generate SQL through decoders. Representative methods include RAT-SQL [
5], which uses relation-aware encoding with relation graphs to handle multi-table join queries; IGSQL [
6], which constructs database schema interaction graphs using graph neural networks for context-dependent SQL generation; and EditSQL [
7], which transforms SQL generation into edit operation sequences. These methods often use unified encoders that concatenate queries and schemas into long texts mapped to the same hidden vectors, then generate SQL through decoders. When semantic, program, and schema information are compressed into this bottleneck, some information types become compressed or distorted in the representation space, resulting in semantic-program interference. These methods primarily target English queries and offer limited support for Chinese language characteristics.
While end-to-end methods show promise, LLM-based approaches have gained traction. Large language models have improved NL2SQL performance through task decomposition, chain-of-thought reasoning, and in-context learning [
8,
9,
10]. DIN-SQL decomposes complex queries into subtasks and uses self-correction mechanisms, guiding LLMs to generate SQL through few-shot examples and perform post-generation correction. DAIL-SQL uses database-aware methods that understand database structure features. C3 [
17] employs ChatGPT for zero-shot SQL generation with carefully designed prompts. RESDSQL [
18] decouples schema linking and skeleton parsing, handling schema matching and SQL structure generation separately. While these methods show progress on English NL2SQL, they struggle with Chinese NL2SQL. Although task decomposition mitigates this problem, the SQL generation module remains a unified model handling the full spectrum from simple to complex nested queries. When semantic, program, and schema knowledge are processed together, different knowledge types interfere, hindering accurate understanding of query intentions.
Beyond single-model approaches, multi-agent frameworks offer alternative architectures. Intelligent agent methods decompose NL2SQL into subtasks through specialized agents working collaboratively, allowing task specialization and parallel processing [
19]. MAC-SQL uses a multi-agent framework with three agents (Selector, Decomposer, Refiner) handling different query aspects. The Selector selects minimal relevant schema subsets, reducing irrelevant information. The Decomposer breaks complex queries into sub-questions and sub-SQLs via chain-of-thought reasoning. The Refiner detects and corrects SQL errors through iterative optimization with execution feedback. This approach achieves task specialization through agent collaboration. Agent-based methods face limitations. Although MAC-SQL decomposes tasks through multi-agents, information transfer between agents relies mainly on encoder/decoder context, failing to fully utilize structured intermediate representations. In error correction, methods like DIN-SQL and MAC-SQL use unified correction strategies that cannot perform targeted backtracking based on error types (syntax, semantic, schema), resulting in low error repair efficiency [
20,
21].
We propose CIR-SQL (Chinese Intent Recognition SQL), a dual-model iterative optimization framework for intent-enhanced NL2SQL. The key novelty lies in (1) the explicit decoupling of intent recognition and SQL generation into two specialized models with structured intent intermediate representations, which separates semantic and program layers more effectively than unified approaches; (2) a seven-category intent classification system specifically designed for Chinese NL2SQL query patterns; (3) a hierarchical backtracking strategy that adaptively selects backtracking levels (SQL, context, or intent) based on error types, providing more efficient error correction than uniform correction strategies. Unlike existing methods that use task decomposition but maintain unified SQL generation models, CIR-SQL creates a dedicated intent recognition model that produces structured intermediate representations, allowing the SQL generation model to focus solely on program synthesis without semantic interference.
3. Methodology
3.1. CIR-SQL Framework
Figure 1 illustrates CIR-SQL, a dual-model iterative optimization framework that uses large language models for Chinese NL2SQL parsing. The term “dual-model” refers to a two-stage pipeline in which each stage has a distinct task objective and processing flow: the first stage performs semantic understanding and structured intent extraction (intent classification plus extraction of
,
,
,
D), and the second stage performs program synthesis (SQL generation from the structured intent representation
). The two stages use the same backbone model (e.g., Qwen3-8B) for fair comparison with baselines, but differ in inputs, outputs, prompt design, and decoding settings (e.g., temperature). Thus “dual-model” denotes task-level and flow-level specialization, not merely different prompts or hyperparameters. The framework contains three functional modules: intent recognition, context enhancement, and intent correction. The core dual-model design consists of these two specialized stages, supported by a third feedback module (intent correction). This architecture decouples the semantic understanding stage from the program generation stage, effectively mitigating semantic-program interference. The intent correction module serves as a feedback mechanism that operates across both stages.
The framework contains three core modules: the intent recognition module, the context enhancement module, and the intent correction module. The intent recognition module (first model) extracts structured intent information from Chinese natural language queries, including intent types, candidate table-column sets, predicate template sets, and SQL keyword sets. The context enhancement module (second model) integrates structured information into a complete structured intent representation and generates SQL queries based on this. The intent correction module detects SQL errors through execution feedback and selects different backtracking levels for correction according to error types, operating as a feedback mechanism that can invoke either of the two core models.
The system workflow proceeds as follows. Intent recognition on question Q extracts intent type y and structured information components (, , , D). The context enhancement module then integrates these into structured intent representation and generates SQL queries using y and . Finally, the intent correction module performs error correction. The hierarchical backtracking mechanism adapts to error types: syntax errors trigger local repair; semantic or schema errors backtrack to the context enhancement module; intent deviations backtrack to the intent recognition module. This data flow completes the process from user queries to final SQL generation.
3.2. Intent Recognition Module
The NL2SQL task is formalized as a generation task, generating SQL queries token-by-token by designing appropriate prompts to guide the large language model
M. Given an input triple
, where
Q represents a Chinese natural language question,
represents the database schema (
T is the table set
,
is the column set
), and
C represents optional context information. The basic generation process can be formalized as
where
Y represents the generated SQL query (a token sequence),
represents the
i-th token of SQL query
Y,
represents the prefix of SQL query
Y (the sequence of the first
tokens), and
represents the conditional probability of the
i-th token in SQL query
Y given the prefix
and input triple
X.
The intent recognition module generates various components of structured intent information
, where
y represents the intent type (such as simple_select, count_query, etc.),
represents the candidate table/column set,
represents the candidate predicate set,
represents the SQL keyword set, and
D represents the natural language description. The function of the intent recognition module is
where
represents the intent recognition function, which guides the large language model
M to recognize Chinese query intent and extract structured information through prompts. The intent type
y is an important constraint condition for subsequent SQL query
Y generation, guiding the SQL generation process through structured intent representation to ensure that the generated SQL query conforms to the user’s query intent.
The intent recognition module addresses two issues. First, the unique characteristics of Chinese language (weak grammatical markers, prominent ellipsis, diverse colloquial expressions) create unstable mappings between query intent and SQL structure, necessitating a specialized module for accurate intent identification and structured information extraction. Second, decoupling intent recognition from SQL generation prevents mutual interference among semantic, program, and schema knowledge when processed together, effectively reducing semantic-program conflicts. Prior work on spoken language understanding and intent recognition shows that jointly modeling intent and structured information improves downstream task performance [
21,
22].
The intent recognition module is activated when structured intent information is needed. For simple queries, general LLM capabilities can be directly used for SQL generation without explicit intent recognition.
Based on probability generative model theory, the intent recognition process can be decomposed into chain conditional probabilities:
where
,
, and
denote the probability distributions for intent recognition, structured information extraction, and description generation, respectively. Extraction proceeds sequentially: first the candidate table-column set
, then the candidate predicate set
conditioned on
, then the SQL keyword set
conditioned on
and
, and finally the natural language description
D conditioned on
,
, and
. Under prompt engineering, LLMs generate intent type
y and structured information components through multi-round prompts. Each round invokes the LLM, which computes the corresponding conditional probability and returns results; external code handles prompt construction and result parsing.
Intent type identification is guided by prompt engineering, with the probability distribution
where
represents the score of intent type
y by the LLM guided by prompts, which is based on the generation probability of the LLM under prompt constraints:
where
represents the intent recognition prompt template, and
represents the label text of intent type
y. Chinese grammar pattern matching uses a pattern library containing multiple patterns covering seven intent types, and these patterns are injected into prompts through Few-shot examples to guide LLMs for pattern matching and intent recognition.
Choice of the seven intent categories. The seven categories (simple_select, count_query, filter_query, max_min_query, sort_query, join_query, group_by_query) were chosen to align with both the main SQL clause structures and common Chinese query patterns. We derived them from: (1) standard NL2SQL taxonomies that distinguish selection-only, aggregation, filtering, ordering, joining, and grouping; (2) an analysis of Chinese phrasing in Chase and CSpider (e.g., “有多少” for count, “最大/最小” for extrema, “按…排序” for sort). To verify that these categories are well supported by the data, we inferred intent from SQL structure on Chase (train+dev) and CSpider (train+dev+test); the resulting distribution shows that all seven categories are well represented in both datasets (e.g., join_query and filter_query are the most frequent in Chase; join_query and group_by_query in CSpider). This taxonomy covers the dominant query types in these benchmarks; queries requiring nested subqueries or window functions are handled within the closest category or corrected by the intent correction module. The number of categories is a trade-off between granularity and prompt complexity; we did not tune it on the test set.
The extraction of candidate table-column set
is completed by guiding the LLM through prompt engineering:
where
can be decomposed as
Candidate predicate set extraction depends on intent type y and the extracted table-column set . Prompt engineering guides the LLM to construct predicate conditions that match intent types and column information. For instance, filter_query requires WHERE conditions based on columns, while simple_select needs no predicates (empty set). SQL keyword set extraction depends on y, , and . The LLM determines required keywords based on intent type and extracted information. For example, group_by_query requires GROUP BY and HAVING, while simple_select only needs SELECT and FROM. These differentiated strategies use distinct prompt templates to ensure accurate, targeted structured information extraction.
Industrial and business queries follow patterns that map to the seven intent categories. Production monitoring queries like “显示生产线A的产量” (show production of line A) typically match simple_select, requiring basic table-column selection. Quality control queries such as “统计本月不合格产品数量” (count unqualified products this month) map to count_query with filtering. Equipment maintenance queries like “查询设备B最近10次故障记录” (query the last 10 failure records of equipment B) correspond to filter_query with temporal predicates. Production analysis queries such as “按生产线分组统计平均产量” (group by production line and calculate average production) map to group_by_query. Accurate classification of these industrial query patterns is essential for generating correct SQL queries that access production databases, quality management systems, and equipment monitoring platforms.
3.3. Context Enhancement Module
The context enhancement module is responsible for integrating the discrete structured information components (
,
,
,
D) extracted by the intent recognition module into a complete structured intent representation
and generating SQL queries based on this. Based on probability generative model theory, CIR-SQL models the Chinese NL2SQL task as an intent-enhanced two-stage generation process:
The context enhancement module first integrates discrete structured information components from the intent recognition module into structured intent representation . Combining intent type y with yields a unified context-enhanced representation that guides SQL generation.
Based on structured intent representation, the context enhancement module guides the LLM to generate SQL query
Y:
Under the prompt engineering framework, the SQL generation probability is:
where
denotes the SQL generation prompt template containing input triple
X, intent type
y, structured intent representation
, and generated prefix
. The context enhancement module uses few-shot learning, dynamically selecting examples similar to intent type
y and structured information to guide SQL generation. Structured information enters prompts via few-shot examples and explicit constraints, using
to constrain generation and ensure SQL conforms to the intent.
3.4. Intent Correction Module
The intent correction module detects and corrects SQL errors using execution feedback and hierarchical backtracking strategies.
Given a flawed SQL query
and error message feedback
E obtained from external SQL tools, the intent correction module guides the large language model
M to generate the correct SQL query
Y:
where
denotes the correction function that guides the large language model
M to perform error correction via prompts. The function incorporates intent type
y and structured intent representation
as constraints and selects backtracking levels based on error types.
Error detection is implemented through execution feedback and prompt engineering, with error type identification as:
where error types include syntax errors, semantic errors, schema errors, and connection errors.
The intent correction module adopts a hierarchical backtracking strategy, selecting backtracking levels according to error types:
where
denotes the backtracking level. The hierarchical backtracking strategy follows: (1) syntax errors trigger local repair using SQL-layer correction prompts; (2) semantic or schema errors backtrack to the context enhancement module using context-layer correction prompts to regenerate SQL; (3) intent deviations backtrack to the intent recognition module using intent-layer correction prompts to re-identify intent and regenerate structured information. Corrections at each level use corresponding prompt templates to guide the LLM.
The intent correction process is implemented through iterative optimization, terminating when execution succeeds or the maximum number of iterations is reached [
23].
4. Experiments
4.1. Experimental Setup
Experiments use the Chase dataset [
24] and CSpider dataset [
25] for evaluation. The Chase dataset contains 755 test samples, emphasizing complex Chinese natural language query processing. The CSpider dataset contains 1034 test samples covering multi-table databases, used to test model generalization capabilities. As shown in
Figure 2, the datasets contain user natural language questions, database schema information, and corresponding standard SQL queries, demonstrating the conversion process from natural language to SQL.
Evaluation metrics include execution accuracy (EA) and exact match (EM), which are commonly used evaluation methods in current NL2SQL tasks. EA measures the proportion of queries where the execution result of predicted SQL matches the standard answer, and EM requires identical SQL structure. All experimental results are averaged over three runs to reduce randomness; further discussion of statistical analysis is provided in the limitations.
Baseline methods include direct generation baseline Qwen3-8B (without task decomposition), DIN-SQL, MAC-SQL, DAIL-SQL, C3, and deep learning methods (IGSQL, EditSQL, RAT-SQL). All LLM-based methods use Qwen3-8B as the backbone model [
26] to ensure fair comparison. We use Qwen3-8B (Qwen2.5/Qwen3 series, commit Qwen/Qwen3-8B), Python 3.10.12, PyTorch 2.0.1, and transformers 4.36.0 for implementation. For reference, we also compare against proprietary large language models (GPT-4, Claude-3.5-Sonnet, and Gemini-Pro) using zero-shot prompting, acknowledging that these comparisons are not entirely fair due to model scale differences but provide useful context for understanding performance relative to state-of-the-art systems. The selection of these baselines references recent NL2SQL surveys [
27].
4.2. Main Results
We compared large language models with different parameter scales under the CIR-SQL framework to validate model selection. Qwen3-8B outperforms other small-parameter models in accuracy and stability. We therefore use Qwen3-8B as the base model, which excels in Chinese understanding and code generation for NL2SQL tasks. Both modules use the same backbone (Qwen3-8B) but with distinct task objectives and processing flows (intent extraction vs. SQL generation), as well as different prompt strategies and parameter configurations. The intent recognition module uses temperature 0.1 for stable, deterministic classification. The SQL generation module uses temperature 0.2 to balance diversity and accuracy.
Table 1 shows performance comparison of different models under the CIR-SQL framework. Qwen3-8B achieves the best performance among tested models. As shown in
Table 2, CIR-SQL achieves 52.5% EA and 23.5% EM on the Chase dataset, and 62.2% EA and 47.6% EM on the CSpider dataset, outperforming all baseline methods. Compared to the best baseline method DIN-SQL, CIR-SQL improves EA by 6.7 and 10.1 percentage points on Chase and CSpider datasets, respectively.
For reference, we also evaluated proprietary large language models (GPT-4, Claude-3.5-Sonnet, Gemini-Pro) using zero-shot prompting on the Chase dataset. GPT-4 achieves 48.2% EA and 20.1% EM, Claude-3.5-Sonnet achieves 46.8% EA and 19.5% EM, and Gemini-Pro achieves 44.5% EA and 18.3% EM. These results are not directly comparable to our main results: the proprietary models differ in scale and API access and were used in zero-shot mode without our task decomposition. They are reported only as context; the fair comparison is among methods that share the same backbone (Qwen3-8B). Within that setting, CIR-SQL’s gains indicate that task-specific architecture and structured intent representation are effective.
4.3. Chinese Intent Recognition Performance Analysis
As shown in
Table 3, there are significant performance differences across different intent types. The
N columns show the intent-type distribution in the training (and, for CSpider, full) data, inferred from SQL structure; EA and EM are computed on the Chase test set (755 samples), grouped by predicted intent. High-performing types (simple_select: 62.4% EA, count_query: 58.9% EA) contain explicit vocabulary patterns that are easily recognizable by the intent recognition module. Simple select queries often include clear verbs like “显示” (show), “列出” (list), or “查询” (query), making intent classification straightforward. Count queries typically contain explicit counting words like “有多少” (how many), “数量” (quantity), or “统计” (statistics), providing strong signals for intent recognition.
Challenging types (filter_query: 38.6% EA, group_by_query: 40.5% EA) involve complex logical structures that require deeper semantic understanding. Filter queries often contain implicit conditions or complex temporal expressions that are difficult to extract accurately. For example, queries like “上个月的数据” (last month’s data) require understanding temporal relationships and date calculations. Group by queries face challenges with diverse Chinese expressions for aggregation, where the same concept can be expressed in multiple ways (e.g., “按…分组”, “分组统计”, “分类汇总”), making pattern matching more difficult.
The performance gap between simple and complex intent types highlights the need for more sophisticated intent recognition mechanisms that can handle semantic ambiguity and implicit relationships in Chinese queries. Future improvements could incorporate semantic role labeling or dependency parsing to better understand query structure and intent.
4.4. Ablation Studies
As shown in
Table 4, ablation studies show the contribution of each module. The baseline model (Qwen3-8B direct generation) achieves 30.5% EA on the Chase dataset. After adding the intent recognition module, EA increases to 35.7% (improvement of 5.2 percentage points). After adding the context enhancement module, EA increases to 43.3% (improvement of 7.6 percentage points). After further adding the intent correction module, EA reaches 52.5% (improvement of 9.2 percentage points). Compared to the baseline model, the complete model improves EA by 22.0% and 27.0% (in percentage points) on the Chase and CSpider datasets, respectively. These results indicate that the dual-model architecture and intent enhancement strategy are effective, and that task decoupling and structured intent representation contribute to improved NL2SQL performance.
4.5. Strategy Analysis
CIR-SQL designs differentiated
construction strategies for different intent types, improving SQL generation quality through targeted structured information extraction. As shown in
Table 5, different intent types adopt different strategies in constructing
(candidate table-column set),
(candidate predicate set),
(SQL keyword set), and
D (natural language description).
The extraction strategy of depends on intent type y. For simple_select, the system selects all relevant columns from a single table. For join_query, it selects columns from multi-table join paths. The intent type directly affects the column selection scope.
The construction strategy of depends on y and the already extracted . Simple_select does not need predicates (empty set), while filter_query needs to construct required WHERE conditions based on extracted columns. Table-column information constrains the predicate construction method.
The determination of keyword set depends on y, , and . Simple_select only needs SELECT and FROM, while group_by_query needs GROUP BY and HAVING keywords. Previous information guides keyword determination.
Simple query types (simple_select, count_query) mainly focus on single-table column selection and basic semantics. Complex query types (join_query, group_by_query) need to consider multi-table join paths, grouping aggregation, and other complex structures. This differentiated strategy design allows CIR-SQL to extract key information according to query intent characteristics, which guides SQL generation.
4.6. Implementation Details
This section provides implementation details to ensure reproducibility. The intent recognition module generates structured information components through sequential LLM calls. For candidate table-column set extraction, we use a top-K selection strategy where for the initial candidate set, then filter to the top five most relevant columns based on semantic similarity scores. The predicate template set is generated by prompting the LLM to construct WHERE conditions based on the extracted columns, with a maximum of three predicate conditions per query to maintain SQL complexity manageable. The SQL keyword set is determined by pattern matching against the seven intent categories, with each intent type having a predefined keyword template.
Few-shot example selection uses a similarity-based approach: for each query, we select three to five examples from the training set that share the same intent type and have similar schema structures. The examples are ranked by cosine similarity between query embeddings and example embeddings, computed using sentence transformers. Prompt templates are constructed dynamically based on intent type, with different templates for simple queries (requiring minimal context) versus complex queries (requiring full schema information).
The hierarchical backtracking strategy uses a maximum of three iterations. Error detection classifies errors into four categories: syntax errors (detected via SQL parser), semantic errors (detected via execution with empty result sets or type mismatches), schema errors (detected via column/table not found exceptions), and connection errors (detected via database connection failures). The backtracking level selection follows: syntax errors → SQL layer (local repair), semantic/schema errors → Context layer (regenerate SQL), intent errors → Intent layer (re-identify intent).
Table 6 reports computational costs and efficiency metrics. On average, CIR-SQL requires 2.3 LLM calls per query (1.0 for intent recognition, 1.0 for SQL generation, 0.3 for error correction). The average prompt length is 1850 tokens, and the average completion length is 120 tokens. The total average token usage per query is approximately 2200 tokens (1850 input + 350 output). The average latency is 3.2 s per query, with intent recognition taking 0.8 s, SQL generation taking 1.5 s, and error correction taking 0.9 s when needed. For comparison, the single-pass baseline (Qwen3-8B direct) uses about 1.0 LLM call, ∼1200 tokens per query, and ∼1.2 s latency; CIR-SQL incurs higher cost due to multi-stage processing but achieves substantially better accuracy, making it suitable for scenarios where accuracy is prioritized over cost.
An analysis of intent misclassification failures reveals that ambiguous queries with multiple possible interpretations pose challenges. For example, the query “找出销售额最高的产品” (find the product with highest sales) could be interpreted as either max_min_query (using MAX function) or sort_query (using ORDER BY LIMIT 1). When misclassified as sort_query, the generated SQL uses ORDER BY instead of MAX, leading to semantic differences in edge cases. This suggests the need for more sophisticated intent disambiguation mechanisms, possibly incorporating query result validation or multi-intent handling.
Complex query failures often involve nested subqueries with correlated conditions or window functions. These queries require deeper understanding of SQL semantics and multi-step reasoning that current prompt-based approaches struggle with. Future work could explore specialized modules for handling complex SQL patterns or incorporating SQL execution feedback earlier in the generation process.
4.7. Discussion
The experimental results show that CIR-SQL’s dual-model architecture addresses the interference problem in Chinese NL2SQL tasks. Performance improvements over baseline methods indicate that separating intent recognition from SQL generation is beneficial. Several limitations and trade-offs are discussed below.
Computational Cost vs. Accuracy Trade-off. CIR-SQL’s multi-stage processing requires more computational resources (2.3 LLM calls and 2200 tokens per query on average) compared to single-pass methods. The accuracy improvements justify this cost for applications where accuracy is critical, such as industrial database querying where incorrect SQL can lead to operational errors. For high-throughput scenarios with lower accuracy requirements, single-pass methods may be more appropriate.
Intent Classification Limitations. The seven-category intent classification system covers most common query patterns but struggles with ambiguous queries that could map to multiple intent types. The current approach selects a single intent, which may not capture the full complexity of some queries; when a query inherently requires multiple intents (e.g., both aggregation and ordering), forcing a single category can lead to suboptimal SQL or failure. Multi-label intent detection has been studied in dialogue systems [
22] and could be adapted for NL2SQL in future work. Future work could also explore hierarchical intent structures that allow primary and secondary intent types.
Generalization to Other Languages. While CIR-SQL is designed specifically for Chinese NL2SQL, the dual-model architecture and structured intent representation approach could potentially generalize to other languages. The key challenge would be adapting the intent classification system and pattern matching to language-specific characteristics. The framework’s modular design supports such adaptation, as the intent recognition module can be retrained or reconfigured for different languages while maintaining the overall architecture.
Industrial Application Considerations. For Industry 4.0 applications, CIR-SQL’s 3.2-s average latency may be acceptable for ad hoc queries but could be limiting for real-time monitoring systems requiring sub-second responses. The hierarchical backtracking mechanism adds latency when errors occur. Future optimizations could include parallel processing of intent recognition and initial SQL generation, or caching of common query patterns to reduce latency.
Comparison with Proprietary Models. While CIR-SQL outperforms GPT-4, Claude-3.5-Sonnet, and Gemini-Pro on the evaluated datasets, these comparisons are not entirely fair due to model scale differences. The results suggest that task-specific architecture design can achieve competitive performance with smaller models, which is important for deployment scenarios with resource constraints or data privacy requirements. The dual-model approach’s interpretability (through structured intent representations) also offers advantages over black-box, single-model approaches for industrial applications requiring explainability.
System Integration for Industry 4.0 Platforms. CIR-SQL can be integrated with Industry 4.0 platforms (MES, ERP, IoT systems) through a RESTful API architecture. The framework accepts natural language queries via HTTP POST requests, processes them through the dual-model pipeline, and returns SQL results or structured data in JSON format. For security, the system should implement authentication mechanisms (OAuth 2.0 or API keys) and query validation to prevent SQL injection risks. The modular architecture allows deployment as a microservice, enabling horizontal scaling for high-throughput industrial applications. Integration with existing database systems requires minimal configuration: the framework connects to industrial databases via standard SQL connectors (e.g., SQLAlchemy, JDBC) and can handle multiple database schemas simultaneously through schema-aware routing. Performance requirements for real-time industrial applications can be met through distributed deployment, where intent recognition and SQL generation modules can be scaled independently based on workload characteristics.
4.8. Case Study Analysis
To illustrate how CIR-SQL processes Chinese NL2SQL queries, we present three detailed case studies covering different intent types and complexity levels.
Case Study 1: Simple Select Query. Consider the query “显示所有学生的姓名和年龄” (show all students’ names and ages). The intent recognition module first classifies this as simple_select intent, recognizing the explicit pattern “显示…” (show…). The module then extracts candidate table-column set by matching “学生” (student) to the student table and “姓名” (name) and “年龄” (age) to corresponding columns. Since this is a simple select query, (no predicates needed) and . The context enhancement module integrates this structured information into and generates SQL: SELECT name, age FROM student. This query executes successfully without requiring error correction, indicating that CIR-SQL handles straightforward queries with clear semantic patterns well.
Case Study 2: Complex Filter Query with Temporal Conditions. The query “查询最近一个月内订单金额大于1000元的客户信息” (query customer information for orders with amount greater than 1000 in the last month) presents greater challenges. The intent recognition module identifies this as filter_query intent, recognizing temporal expressions “最近一个月内” (within the last month) and comparison “大于” (greater than). The module extracts , recognizing that this requires a join between order and customer tables. The predicate set includes temporal filtering (date within last month) and value comparison (amount > 1000). The SQL keyword set . The context enhancement module generates initial SQL with a join condition, but the first attempt contains a syntax error in the date comparison. The intent correction module detects this as a syntax error, triggers SQL-layer backtracking, and corrects the date function usage, producing: SELECT customer.* FROM customer JOIN order ON customer.id = order.customer_id WHERE order.amount > 1000 AND order.date >= DATE(’now’, ’-1 month’). This case shows how hierarchical backtracking handles syntax errors without full re-generation.
Case Study 3: Group By Query with Aggregation. The query “按部门统计每个部门的平均工资” (statistics of average salary by department) requires grouping and aggregation. The intent recognition module classifies this as group_by_query intent, recognizing the pattern “按…统计” (statistics by…). The module extracts and constructs with grouping conditions. The keyword set . The context enhancement module generates SQL: SELECT department, AVG(salary) FROM employee GROUP BY department. However, the initial execution reveals a semantic error: the query returns departments that may not exist in the current data. The intent correction module detects this as a semantic error, triggers context-layer backtracking, and regenerates SQL with additional filtering to ensure only existing departments are returned. This case shows how different error types trigger appropriate backtracking levels for error correction.
These case studies show that CIR-SQL handles queries of varying complexity, from simple selections to complex multi-table joins with temporal and aggregation operations. The structured intent representation guides SQL generation, while hierarchical backtracking handles errors.
Case Study 4: Industrial Order Tracking Query. To demonstrate Industry 4.0 applicability, we present a real case from the tracking_orders database, which models order management systems commonly used in manufacturing and supply chain operations. The query “最新订单的ID是什么?” (what is the ID of the latest order?) requires temporal ordering and limit operations. The intent recognition module classifies this as sort_query intent, recognizing the pattern “最新…” (latest…) which indicates ordering by date in descending order. The module extracts , recognizing that this requires querying the orders table. The predicate set (no filtering needed), and the keyword set . The context enhancement module generates SQL: SELECT order_id FROM orders ORDER BY date_order_placed DESC LIMIT 1. This query executes successfully, showing that CIR-SQL handles industrial order tracking queries with temporal semantics.
Case Study 5: Industrial Manufacturer Management Query. Another real case from the manufactory_1 database involves manufacturer information retrieval, which is essential for supply chain management in Industry 4.0. The query “由“强东”创立的公司的总部在哪里?” (where is the headquarters of the company founded by “强东”?) requires filtering and attribute selection. The intent recognition module classifies this as filter_query intent, recognizing the pattern “由…创立” (founded by…) which indicates a WHERE condition on the founder field. The module extracts from the manufacturers table. The predicate set includes the condition founder = ’强东’, and the keyword set . The context enhancement module generates SQL: SELECT headquarter FROM manufacturers WHERE founder = ’强东’. This case shows how CIR-SQL handles manufacturer and supplier information queries, which are common in industrial supply chain management scenarios.
Case Study 6: Industrial Aggregation Query. The query “所有公司的平均收益、最大收益和总收益是多少?” (what are the average, maximum, and total revenue of all companies?) from the manufactory_1 database requires multiple aggregation functions. The intent recognition module classifies this as group_by_query intent, recognizing aggregation patterns “平均…” (average…), “最大…” (maximum…), and “总…” (total…). The module extracts and constructs (no grouping needed for aggregate over all records). The keyword set . The context enhancement module generates SQL: SELECT AVG(revenue), MAX(revenue), SUM(revenue) FROM manufacturers. This case shows that CIR-SQL handles complex aggregation queries with multiple functions, which are essential for financial analysis and performance monitoring in industrial operations.
4.9. Error Analysis
We analyze failure cases to understand CIR-SQL’s limitations and identify areas for improvement.
Table 7 categorizes errors from the Chase test set. Syntax errors account for 12.3% of failures, primarily due to incorrect SQL keyword ordering or missing parentheses in complex nested queries. Semantic errors represent 28.7% of failures, often caused by incorrect table joins or missing WHERE conditions that lead to incorrect result sets. Schema errors constitute 18.5% of failures, typically resulting from incorrect column name mappings or table aliasing issues. Intent misclassification accounts for 15.2% of failures, where the intent recognition module incorrectly categorizes queries, leading to inappropriate structured information extraction. The remaining 25.3% of failures involve complex queries with multiple nested subqueries or advanced SQL features (e.g., window functions, CTEs) that exceed the current framework’s capabilities.
We also analyze failure cases from industrial-related queries in the CSpider dataset (398 samples from databases including tracking_orders, manufactory_1, customers_and_invoices, product_catalog, department_store, and customer_complaints). Quantitative analysis shows that industrial queries achieve similar overall accuracy to general queries (approximately 52–55% EA), but exhibit distinct error patterns. Order tracking queries (tracking_orders, 60 samples) show a higher rate of temporal expression handling errors (18% of failures) compared to general queries (12%). Manufacturer management queries (manufactory_1, 80 samples) have a higher intent misclassification rate (22% of failures) due to ambiguous sorting vs. aggregation patterns. Product catalog queries (product_catalog, 42 samples) show increased schema error rates (25% of failures) due to complex multi-table relationships. These quantitative differences highlight industrial-specific challenges: enumeration value handling in status tracking, intent disambiguation for financial metrics, and schema complexity in product management systems. Future improvements should focus on better handling of enumeration values, more accurate intent classification for sorting vs. aggregation queries, and improved table alias management in complex join scenarios.
5. Conclusions
We propose CIR-SQL, a dual-model iterative optimization framework for intent-enhanced Chinese NL2SQL that addresses accuracy problems caused by semantic-program interference. CIR-SQL decouples NL2SQL into intent recognition and SQL generation by introducing structured intent intermediate representation to separate semantic and program layers, effectively mitigating such interference. Experimental validation on the Chase and CSpider datasets shows that the dual-model architecture and intent enhancement strategy are effective.
CIR-SQL demonstrates broad applicability across various domains, including Industry 4.0 scenarios. The seven-category intent classification system handles diverse query patterns commonly found in business and industrial contexts. For example, production monitoring queries map to simple_select and count_query intents, quality control analysis uses filter_query and group_by_query, equipment maintenance tracking employs filter_query with temporal predicates, and supply chain optimization requires join_query across multiple tables. Chinese-speaking operators, domain experts, and business users can interact with complex databases using natural language queries without SQL expertise. This reduces technical barriers to data-driven decision-making and supports various analytical tasks. The dual-model architecture suits applications where users query multi-table databases containing structured information. The hierarchical backtracking strategy handles errors effectively, which is important for applications where query accuracy directly affects decision quality.
Future work will explore finer-grained intent classification for complex queries, adaptive strategies to adjust model parameters based on database complexity and schema characteristics, domain-specific knowledge bases for specialized terminology, and optimized hierarchical backtracking for specific query patterns. We also plan to investigate CIR-SQL integration with enterprise platforms such as MES and ERP systems for seamless natural language querying, and evaluate the framework on additional datasets with diverse schema structures and query types.