ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

Ascoli, Benjamin G.; Kandikonda, Yasoda Sai Ram; Choi, Jinho D.

doi:10.3390/fi17080325

Open AccessArticle

ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

by

Benjamin G. Ascoli

^*,

Yasoda Sai Ram Kandikonda

and

Jinho D. Choi

^*

Department of Computer Science, Emory University, Atlanta, GA 30322, USA

^*

Authors to whom correspondence should be addressed.

Future Internet 2025, 17(8), 325; https://doi.org/10.3390/fi17080325

Submission received: 13 June 2025 / Revised: 13 July 2025 / Accepted: 18 July 2025 / Published: 23 July 2025

(This article belongs to the Section Big Data and Augmented Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The task of Text-to-SQL enables anyone to retrieve information from SQL databases using natural language. While this task has made substantial progress, the two primary evaluation metrics—Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM)—suffer from inherent limitations that can misrepresent performance. Specifically, ESM’s rigid matching overlooks semantically correct but stylistically different queries, whereas EXE can overestimate correctness by ignoring structural errors that yield correct outputs. These shortcomings become especially problematic when assessing outputs from large language model (LLM)-based approaches without fine-tuning, which vary more in style and structure compared with their fine-tuned counterparts. Thus, we introduce a new metric, Enhanced Tree Matching (ETM), which mitigates these issues by comparing queries using both syntactic and semantic elements. Through evaluating nine LLM-based models, we show that EXE and ESM can produce false positive and negative rates as high as 23.0% and 28.9%, while ETM reduces these rates to 0.3% and 2.7%, respectively. We release our ETM script as open-source, offering the community a more robust and reliable approach to evaluating Text-to-SQL.

Keywords:

text-to-SQL; evaluation metrics; large language models

1. Introduction

While interacting with SQL databases through natural language interfaces makes them significantly more accessible to non-experts, the task of automatically mapping natural language requests to SQL queries for relational databases, known as Text-to-SQL, remains challenging. Lately, the advent of the transformer [1] and large language models (LLMs) [2,3] has led to momentous advancements in this field. Notably, LLMs have overcome several challenges in Text-to-SQL, evidenced by their dominance in leaderboards for popular benchmarks like the Spider dataset [4] and the more challenging BIRD dataset [5], underscoring their effectiveness in handling complex, multi-table SQL query generation that previous approaches had struggled with.

Evaluating Text-to-SQL models is also challenging because SQL equivalence has been shown to be undecidable [6]. Text-to-SQL models are tested using two metrics: Execution Accuracy (EXE) and Exact Set Matching Accuracy (ESM). EXE checks if the SQL execution result of the predicted query matches that of the gold standard query. However, EXE can yield false positives, as semantically different queries may produce the same output (Figure 1a). On the other hand, ESM assesses the predicted query by comparing sets of keywords and their arguments with those of the gold query. While more rigorous than EXE, ESM is prone to false negatives because SQL queries may be semantically equivalent yet syntactically diverse (Figure 1b). These issues raise the need for a more robust evaluation metric that accurately evaluates the performance of Text-to-SQL models.

Models using pretrained LLMs without fine-tuning, such as GPT (henceforth PLM), perform particularly well on EXE, which is the main metric used on the Spider and BIRD leaderboards. Surprisingly, they do not show a similar level of performance on ESM. When using ESM as the primary metric, no PLM-based models rank highly, a stark contrast to the Spider and BIRD leaderboards. Therefore, it is critical to examine these metrics and determine the most suitable approach for an accurate evaluation of model performance, as the disparity between them disproportionately impacts PLM-based models compared with those using fine-tuned LLMs (henceforth, FLM).

Our main contributions are as follows:

We identify and analyze critical mismatches between commonly used standard Text-to-SQL evaluation metrics (Execution Accuracy and Exact Set Matching).
We introduce and implement Enhanced Tree Matching (ETM), a novel metric that integrates syntactic normalization and rule-based semantic equivalence to better evaluate structural correctness in SQL generation.
We conduct comprehensive empirical studies across nine models on the Spider and BIRD benchmarks, showing that ETM reduces variance in evaluation and resolves failure cases missed by EXE and ESM.
We present detailed error analyses and rule-level ablations, providing insight into the kinds of reasoning and structure that ETM captures more effectively than existing metrics.

This paper first examines potential issues in EXE and ESM and proposes a new enhanced metric, called ETM, which addresses many shortcomings present in the original metrics (Section 3). Nine state-of-the-art models are evaluated on the Spider and BIRD datasets, comparing their performance using EXE, ESM, and ETM (Section 4). Finally, a comprehensive error analysis is conducted on the evaluation results using these three metrics, revealing the superior robustness of ETM (Section 5). We posit that ETM will serve as a pivotal metric for assessing the real capabilities of LLM-based Text-to-SQL models, thereby enabling them to reach new heights of performance. All our resources, including the new evaluation script and the model outputs, are available through our open-source project: https://github.com/emorynlp/ETM, accessed on 17 July 2025.

2. Related Work

2.1. Text-to-SQL Models

The current state-of-the-art performance has been achieved by PLM-based models using GPT [7]. Dong et al. [8] introduced C3, which employs schema linking to rank relevant tables and columns and prompts GPT to generate SQL queries. Pourreza and Rafiei [9] proposed DIN-SQL, which predicts schema links, classifies query difficulty, and prompts GPT using template-based queries with debugging prompts. Gao et al. [10] presented DAIL-SQL, which searches for similar questions in the training set and uses them to create a few-shot prompt with GPT to generate an initial query. This is then used to find more similar queries in the training set, and the most similar ones are used in a second few-shot prompt to generate the final query. Despite achieving high ranks on the Spider leaderboard, evaluated on EXE, none of these PLM-based models appear on the CoSQL leaderboard, evaluated on ESM [11].

Several FLM-based models, such as fine-tuned T5 [3], have also been introduced, showing comparable results to PLM-based models on Spider. RASAT [12], coupled with PICARD [13], incorporates relation-aware self-attention, enabling better schema understanding while inheriting pre-trained weights from T5. Li et al. [14] introduced Graphix-T5, which augments T5 with graph-aware layers to integrate semantic information from transformer blocks with structural information from graph neural networks. Li et al. [15] presented RESDSQL that utilizes an encoder to identify relevant schema items and a decoder to first generate the SQL skeleton with keywords. Li et al. [16] introduced CodeS, an open-source series of language models specifically designed for Text-to-SQL. CodeS undergoes incremental pre-training with a curated SQL-centric corpus and uses schema filtering and prompt construction techniques. SuperSQL [17], a hybrid FLM/PLM framework, uses a genetic learning algorithm to swap model components to improve its output.

2.2. Evaluation of SQL Equivalence

Although evaluating the equivalence of two queries plays a crucial role in advancing Text-to-SQL models, few works have addressed this challenge. Chu et al. [18] introduced Cosette, an automatic SQL solver that compiles queries over relational tables and checks their semantic equivalence, producing counterexamples when not equivalent; however, it supports limited SQL operations. Zhou et al. [19] presented EQUITAS, an automated verification tool that transforms SQL queries into first-order logic to verify equivalence, although its source code is not publicly available. Zhong et al. [20] proposed test-suite execution matching to measure semantic accuracy, which generates a small suite of slightly altered databases to help reduce the false positive rate of EXE. However, this approach is not scalable and suffers from the long execution times of some queries, especially when dealing with larger databases. More recently, Nooralahzadeh et al. [21] introduced soft and partial execution accuracy, which aimed to reduce error from ambiguous questions by allowing multiple answers to be correct. However, in doing this, it relaxes the definition of semantic equivalence, allowing for more false positives. Song et al. [22] evaluated SQL queries using the editing difference between their abstract syntax trees (TSED), which faces challenges, as two queries can vary in structure but still be equivalent. Zhan et al. [23] proposed FuncEvalGMN, which compares queries by transforming them into relational operator trees and using a graph matching network to assess functional equivalence by comparing their embeddings. However, it relies on an embedding similarity threshold for accurate results, which can struggle with subtle semantic differences and threshold tuning.

Therefore, the most accessible and widely used automatic evaluation approaches for Text-to-SQL remain EXE and ESM [4]. Their combined evaluation script provides options to disable value and distinct checks, which were employed due to prior model limitations. However, despite the proficiency of LLM-based models in handling these aspects, the results in the literature for Spider are still reported with both value and distinct checking disabled, obscuring the true performance of LLM-based models in real-world applications. BIRD uses EXE as its primary metric, but evaluates the query output without regard to duplicate rows or ordering. Since this can lead to even more false positives, we focus primarily on the original EXE metric, which has more strict output matching.

3. Materials and Methods

For a comprehensive analysis of the two metrics, Execution Accuracy (EXE) and Exact Set Matching (ESM), we evaluate nine models on the Spider [4] and BIRD [5] datasets. Cases of false positives (Section 3.1) and negatives (Section 3.2) in ESM are thoroughly examined through this analysis and addressed in our new metric, ETM (Section 3.3).

3.1. False Positives in ESM

We first analyze the queries predicted by the models along with their gold standard counterparts that are considered equivalent by ESM but not by EXE. Since ESM is a more stringent metric, it is expected that no query pair considered a mismatch by EXE would be considered a match by ESM. Upon closer inspection, however, it becomes evident that ESM has several shortcomings in its evaluation approach.

One major issue is that ESM does not account for JOIN conditions, which are essential parts of many SQL queries. In Figure 2, the two queries produce different outputs such that EXE correctly considers them a mismatch. ESM mistakenly considers them a match, however, because it ignores the JOIN conditions (t2.breed_code vs. t2.breed_name).

Another issue arises when evaluating queries with the DISTINCT keyword. Even when distinct checks are enabled in the ESM script (Section 2.2), it considers DISTINCT only within aggregate keywords, like COUNT or AVE, failing to recognize it in simpler and more commonly used cases (Figure 3).

Additionally, the ESM script ignores specified LIMIT values even when value checks are enabled (Figure 4).

3.2. False Negatives in ESM

We also analyze the predicted and gold query pairs that EXE finds equivalent but not ESM. Some of these cases are false positives for EXE, where the queries are semantically distinct but still return the same result when executed. The other cases involve queries that are semantically equivalent but syntactically distinct, causing ESM to mistakenly consider them a mismatch. These false negatives occur because assessing semantic equivalence is often contingent on certain assumptions about the database.

In Figure 5, the queries are semantically equivalent only if the column dog_id is NON_NULL. This can be verified by the database schema, which gives information about tables and columns, such as primary-foreign key relationships and constraints. Likewise, the queries in Figure 3 can also be considered a match if the column name in the table dogs is UNIQUE. To this end, we carefully examine every false negative case and compile verifiable assumptions that are sufficiently general for any database schema to alleviate this challenge.

3.3. New Evaluation Metric

We present Enhanced Tree Matching (ETM), a new evaluation metric that compares queries based on their abstract syntax tree (AST) rather than the set-based matching approach of ESM. ETM applies a set of verifiable equivalence rules to transform queries into normalized forms before comparison, reducing false negatives present in ESM (Section 3.2). Figure 6 shows a high-level overview of ETM’s process for comparing SQL queries. Queries are parsed into Abstract Syntax Trees (ASTs) then normalized using predefined equivalence rules such that the normalization recognizes that structural differences like aliases do not affect the query meaning. After normalization, the originally different ASTs become equivalent, enabling semantic rather than textual query comparison.

ETM addresses all the issues in Section 3.1 and Section 3.2, as well as other critical issues. Table 1, Table 2 and Table 3 provide a full list of equivalent queries and verifiable assumptions incorporated into ETM. The preprocessing rules (Table 1) handle basic syntactic variations like case sensitivity, table prefixes, column ordering, and aliasing, while the advanced equivalence rules (Table 2 and Table 3) capture more complex semantic relationships that require database schema constraints for verification.

The following details additional key updates in ETM.

Keywords: The keywords LEFT JOIN, RIGHT JOIN, OUTER JOIN, INNER JOIN, CAST, CASE, and others, previously disregarded by ESM, are now properly considered.
Foreign Key Preservation: ESM rebuilds queries such that all foreign keys become their primary key counterparts, causing false positives. In ETM, all foreign keys are preserved.
Join Conditions: ESM never compares JOIN conditions between queries. Conditions for any JOIN are correctly assessed by ETM.
Local Aliases: ESM extends aliases to the entire query, causing issues in subqueries where aliases are local. ETM properly scopes aliases to their corresponding subqueries (Listing 1).

Listing 1. ESM evaluates this incorrectly as it does not recognize that t is not only an alias for t3 in the subquery but also for t1 in the outer query.

SELECT c1 FROM t1 AS t JOIN t2 ON t.c1=t2.c2 WHERE c1 IN (SELECT c3 FROM t3 AS t);

5.: DISTINCT: While ESM checks for DISTINCT only within aggregate functions, ETM consistently considers it across the entire query (Section 3.1).
6.: IN with lists: ESM allows the keyword IN followed by a subquery, but doesn’t allow a list of values. ETM properly parses and evaluates lists within the IN keyword (Listing 2).

Listing 2. ESM disregards this query as it cannot parse a list of values.

SELECT c1 FROM t1 WHERE c1 IN (1, 2, 3);

7.: Complex Queries: ESM only allows for a single subquery, intersection, or union operator. ETM correctly allows any query to be parsed.
8.: Retrieval from Subquery: Queries retrieving columns from the subquery are not properly parsed by ESM. An example of this is SELECT c1 FROM (SELECT * FROM t1). ETM properly allows retrieving columns from subqueries.
9.: Parentheses: Queries using parentheses to order conditional statements are not handled correctly by ESM (Listing 3). ETM correctly handles parentheses.

Listing 3. ESM incorrectly parses this the same way with and without parentheses.

SELECT c1 FROM t1 WHERE c1 = x AND (c2 = y OR c1 = z);

10.: Alias Definition: In ESM, only table names can have aliases, and they must be defined with the optional AS keyword. ETM properly evaluates all aliases, including for columns and expressions, and correctly allows aliases to be defined without AS.
11.: Quote Types: In SQL, single quotes are treated as a literal, while double quotes can be used for column names or literals. ESM incorrectly treats all quotes the same way, while ETM correctly handles different quote types.

4. Results

4.1. Spider Models

Three PLM-based and four FLM-based models are evaluated on the Spider dataset [4]. Section 2 describes these models. Below are their names as listed on the leaderboard (https://yale-lily.github.io/spider accessed on 17 January 2025):

DAIL (PLM): DAIL-SQL + GPT4 [10]
https://github.com/BeachWang/DAIL-SQL accessed on 17 January 2025.
DIN (PLM): DIN-SQL+GPT4 [9]
https://github.com/MohammadrezaPourreza/Few-shot-NL2SQL-with-prompting accessed on 17 January 2025.
C3 (PLM): C3 + ChatGPT + Zero-Shot [8]
https://github.com/bigbigwatermalon/C3SQL accessed on 17 January 2025.
R+N (FLM): RESDSQL-3B + NatSQL [15]
https://github.com/RUCKBReasoning/RESDSQL accessed on 17 January 2025.
G+P (FLM): Graphix-3B + PICARD [14]
https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/graphix accessed on 17 January 2025.
R+P (FLM): RASAT + PICARD [12]
https://github.com/LUMIA-Group/rasat accessed on 17 January 2025.
CodeS (FLM): CodeS-7b [16]
https://github.com/RUCKBReasoning/codes accessed on 17 January 2025.
Super (FLM): SuperSQL [17]
https://github.com/BugMaker-Boyan/NL2SQL360 accessed on 17 January 2025.

We obtain the outputs for DAIL, DIN, C3, G+P, CodeS, and Super on the development set from their repositories and reproduce the outputs for R+N and R+P using their sources. We reproduce the outputs for all models on the evaluation set, as most were introduced before it was released.

4.2. BIRD Models

Five models are evaluated on the BIRD dataset [5]:

DAIL: the same model as described in Section 4.1
C3: the same model as described in Section 4.1
RESD: R+N without NatSQL [15]
CodeS-15: 15b version of CodeS
Super: the same model as described in Section 4.1.

4.3. Results

Table 4 shows the results of the seven Spider models (Section 4.1) with respect to EXE, ESM, and ETM. For the development set, Super performs the best on EXE, while R+N performs best on ESM. Super’s EXE score is 2.6% higher than the next best model, CodeS, although its ESM score is 7.3% lower. This discrepancy is diminished to 1.5% with ETM; however, this still indicates it is not as strong as CodeS overall. DAIL, despite scoring better than every FLM model except CodeS on EXE, scores significantly worse on ETM, getting outperformed by both R+N and G+P. The trend is evident; FLM-based models exhibit 3–7% decreases in performance from ESM to ETM, whereas PLM-based models show 1–14% increases. This impact is especially dramatic for zero-shot methods. For example, C3 performs relatively well on EXE (only 3.7% lower than CodeS) but extremely poorly on ESM (32.5% lower than CodeS) and then substantially recovers on ETM (16.2% lower than CodeS). Likewise, DIN also underperforms on ESM and thus gets a significant 4.6% boost when evaluated using ETM. We attribute this to ESM’s limitations in handling query styles that deviate from the Spider dataset. FLM-based models are less impacted because they are trained to learn those styles, whereas PLM-based models—which often generate queries with styles not captured in the training set—are often semantically correct but still get penalized by ESM.

On the evaluation set, the trend between ESM and ETM remains consistent. Overall, PLM-based models dominate FLM-based models on EXE: the best PLM-based model, Super, gives a 2.3% higher score than the best FLM-based model, CodeS. Unfortunately, we were unable to run the G+P model, so its results on the evaluation set are omitted from Table 4. This pattern continues with the other PLM models, which all outperform R+N and R+P on EXE. However, for ESM, CodeS outperforms Super by 8.2%, although the gap narrows to 2.8% on ETM. Note that the EXE scores of PLM-based models decrease from their originally reported values to our replicated results, while FLM-based models see similar or slightly higher scores. CodeS’s scores on the evaluation set weren’t reported, and CodeS was not submitted to the Spider leaderboard. We attribute these differences to the absence of distinct and value checking (Section 2.2), as well as the high variance in PLM-based approaches, discussed further in Section 5.2.

Table 5 illustrates the results of the five models evaluated on BIRD’s development set (Section 4.2). All models perform poorly on ESM, largely because the parser in the evaluation script fails to handle the complexity of the queries in the BIRD dataset. Consequently, run-time errors are common, and the script automatically classifies queries with such errors as syntactically incorrect, leading to notably low scores and indicating that ESM is not a suitable metric for a model’s performance on this dataset. By contrast, ETM—unaffected by these shortcomings—still shows a marked difference from EXE. Similarly to the Spider results, models across the board perform worse on ETM than on EXE, though some are affected more severely than others. While Super achieves the highest EXE score, it is surpassed by CodeS-15 on ETM. C3 and RESD have a similar trend; despite outperforming RESD on EXE, C3 suffers a bigger drop off from EXE to ETM, causing RESD to score higher on ETM.

A notable shift occurs in model rankings from EXE to ETM, as PLM-based models that dominate on EXE often underperform relative to FLM-based models on ETM. A likely explanation is that PLMs, which excel in broader language understanding, make implicit assumptions about the databases that prove problematic when the queries are evaluated on ETM. As discussed further in Section 5.1, these types of assumptions typically do not affect performance except in edge cases which may be absent in the dataset. In contrast, FLMs—though weaker on EXE—may avoid some of these pitfalls, resulting in higher ETM scores. Meanwhile, the performance decrease from ESM to ETM for the FLM-based models on Spider likely stems from their lack of optimization for certain SQL features that ESM does not assess (Section 3), causing the incorrect handling of those (e.g., generating random conditions for JOIN would not have any impact on ESM). These findings highlight the need for more robust evaluation metrics, such as ETM, to facilitate further improvements in Text-to-SQL.

5. Discussion

5.1. Model Evaluation

Upon analysis of why PLM-based models achieve high EXE scores but score lower on ETM, we find that they often generate queries that are equivalent to the gold query under certain table-specific verifiable assumptions. However, these assumptions are not enforced by the actual schema of those tables, so ETM correctly identifies that the queries are semantically distinct. In practice, the assumptions may hold for the tables in the database at hand, leading to false positives in EXE since the predicted queries are not guaranteed to produce correct results in other databases. In such scenarios, ETM is a more robust metric than EXE, as it yields fewer false positives.

For Spider, C3 and DIN produce much lower ESM scores compared with other models. Although DIN and C3 employ highly specialized prompts tailored to the dataset’s style—such as calibration hints and elaborate classification prompting—the other models make more extensive use of the training set. The FLM-based models, for instance, are directly fine-tuned on the training set, thereby imitating its query style. DAIL and Super both search the training set for questions similar to the input and use them for few-shot prompting. Because DIN and C3 do not directly leverage the training set, they exhibit greater creativity in query generation—precisely the behavior penalized by ESM. This penalty is alleviated with ETM, where the performance gaps shrink and more accurately represent model effectiveness.

The same reasoning explains why FLM-based models do not improve from ESM to ETM in Spider. Their generation styles largely match the dataset’s style, and ESM already accounts for the primary assumptions relevant to that style, so only a few of the verifiable equivalence rules introduced in ETM apply. Moreover, certain issues present in ESM that are addressed in our new metric (Section 3.3) cause some outputs to be evaluated more rigorously, lowering their ETM scores relative to ESM.

5.2. PLM Variance

The discrepancy between the published results and our reproduced results on EXE for the PLM-based models in Table 4 is in part due to the high natural variability inherent in PLMs. This variability not only hinders the replicability of the work but also creates a situation where, given enough attempts, even a worse model can outperform a more consistent model. This is exacerbated when EXE is used as the primary evaluation metric, since many of the tables in BIRD and especially Spider do not have sufficient edge cases to catch all the false assumptions made by these models.

ETM, however, aims to reduce the variance in PLMs by being more stringent, so that it forces the models to generate a query that always predicts the correct values, which is much more challenging, but leads to less variance in results when evaluated under ETM. This is because, even if models output different SQL queries that might vary slightly under certain edge cases, EXE is vulnerable to having many false positives, while ETM is a model-agnostic metric that will correctly evaluate every output.

5.3. Error Analysis

To assess whether our new metric gives a more accurate evaluation, we manually analyzed the false positives and false negatives generated by EXE, ESM, and ETM for each model on the Spider evaluation set and the BIRD development set. Since disabling either distinct or value checks leads to numerous false positives in both EXE and ESM, and most current state-of-the-art models predict values, we enabled these checks for our analysis.

Table 6 presents the error analysis results. Despite enabling distinct and value checks, EXE and ESM still yield a high volume of false positives and false negatives, respectively. Notably, ESM’s low false positive rate on BIRD partially stems from poor internal parsing, as it very rarely evaluates any two queries in BIRD as equivalent. For all models, the amount of false positives from EXE and false negatives from ESM decreases significantly in ETM. The decrease in false positives from EXE stems from the new constraints in ETM that correctly identify mismatches, while the false negative decrease from ESM is attributed to our equivalence rules in Table 2. The false positive rate in ESM results from the issues described in Section 3.2, which are fixed in ETM.

On both EXE and ESM, false positives and false negatives disproportionately affect certain models. In particular, PLM-based models have higher false positive rates for EXE and higher false negative rates on ESM than FLM-based ones. In contrast, ETM exhibits a notably smaller discrepancy between the best and worst models, indicating it is less biased than either EXE or ESM. The few remaining false positives stem from inconsistencies in the database schema itself, where the actual database does not comply with the specified schema requirements. On Spider, the false negative rate is entirely eliminated, whereas on BIRD, a small number of false negatives persist. Although this is far fewer than the false positives for EXE, it suggests that additional equivalence rules could further reduce the false negative rate, which we will explore in future work.

5.4. Equivalence Rule Analysis

To look into this further, we perform an analysis of the improvement of the ETM metric as equivalence rules are added (Figure 7). When only the preprocessing equivalence rules are used, there is already a large decrease in the false negative rate from ESM to ETM in both models on BIRD and the C3 model on Spider due to the fixes in functionality (Section 3.3), while the Super model on Spider had a much smaller decrease because it was imitating Spider’s style, which ESM was built for. Without these preprocessing rules, the basic AST matching has much higher false negative rates (shown in Appendix A).

The overall trend shows that, as expected, each rule we cumulatively add decreases the false negative rate of ETM. However, the equivalence rules did not have an equal impact on the improvement from ESM to ETM. Some models benefited more from certain rules, and others didn’t have a reliance on any one rule in particular. Some rules were particularly impactful, while others were almost never utilized. Rule 14 (which entailed unnecessary use of JOIN) was the most important addition in Spider for models that directly used the training set (DAIL, Super, FLM-based) because even the training sets were not consistent in whether the JOIN keywords they used were necessary. Thus, the models that relied on it had similar levels of variation. The models that were more reliant on generating without access to examples from the training set, however, had much more reliance on specific rules, indicating a certain preference for styles of SQL queries. For example, the models based solely on GPT (C3 and DIN) had Rule 6 as the most useful, indicating that GPT has a bias toward generating COUNT (c1) instead of COUNT (*). As more rules are added, the discrepancy in false negatives between the best and worst models decreases, showing that each equivalence rule added reduces bias in ETM.

5.5. Limitations

Additional equivalence rules: There could be any number of additional equivalence rules to be added to further decrease the false negative rate of ETM. Missing equivalence rules could punish certain types of generation, leading to inaccurate model evaluation. This would also further reduce any remaining false negatives that are present in the BIRD evaluation. Addressing this is critical when evaluating the Text-to-SQL task.

In addition, while analyzing Spider and BIRD, we noticed that the gold queries sometimes make non-verifiable assumptions about the question or the real world (Figure 8).

A potential option to mitigate this issue would be to have multiple possible correct queries for each question, in order to allow for a larger array of interpretations of each question. We recognize that Spider 2.0 is under development, and we hope that it corrects this aspect of Spider, but it is crucial to address this issue in both Spider and in CoSQL.

Another potential limitation is the time complexity of ETM. Although evaluation time definitely goes up with SQL complexity, given that we have to parse through any nested structures to apply rules, in practice, this is still much faster than EXE, which scales in time when the databases become large, which is extremely common in the industry and is tested in the BIRD dataset, while ETM is agnostic to the size of the database.

6. Conclusions

This study introduces Enhanced Tree Matching (ETM), a novel evaluation metric for Text-to-SQL that overcomes several limitations of the previous metrics, Execution (EXE) and Exact Set Matching (ESM). Our findings indicate that ETM offers a substantial improvement by reducing the occurrences of both false positives and false negatives that commonly plague the earlier metrics. By adopting a more rigorous approach and incorporating verifiable equivalence rules to allow query diversity, ETM can discern more granular distinctions in query correctness, allowing for a more accurate measurement of the semantic accuracy of the generated queries and a better understanding of LLMs’ true capabilities in generating SQL queries. While EXE may be suitable for applications with few edge cases, domains where errors must be avoided demand a more rigorous metric like ETM.

Moving forward, we plan to extend the list of verifiable rules to strengthen ETM. We invite the community to help in building equivalence rules and updating ETM, thereby increasing its robustness in evaluating complex SQL query structures. With our framework, adding new rules to strengthen our metric is much easier than improving ESM or EXE. As we continue to refine and enhance ETM, our goal is to establish a new standard for evaluating Text-to-SQL models that can accurately represent their practical utility and technical proficiency in real-world applications. Once this evaluation is more robust, Text-to-SQL could even be used as a standard in LLM evaluation pipelines that are used to evaluate LLM strengths and weaknesses. Text-to-SQL is particularly important for chatbots that may rely on data that is present in relational databases. With the introduction of ETM, we hope that more PLM-based methods will tackle multi-turn datasets like CoSQL, which is currently evaluated with ESM, since they will no longer be constrained by the lack of variation enforced by ESM.

Author Contributions

Conceptualization, B.G.A. and J.D.C.; methodology, B.G.A. and J.D.C.; formal analysis, B.G.A. and Y.S.R.K.; investigation, B.G.A., software, B.G.A. and Y.S.R.K., writing—original draft prepareation, B.G.A. and J.D.C., writing—review and editing, B.G.A. and J.D.C., supervision, J.D.C., funding acquisition, J.D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Our released evaluation script is publicly available under the Apache 2.0 license on https://github.com/emorynlp/ETM/, accessed on 17 July 2025.

Acknowledgments

We greatly acknowledge the support of Emory University.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

False Negatives with Equivalent Rules

Figure A1. False negative rates for ETM (%) on spider (a) and BIRD (b) as our equivalence rules are accumulated. ESM: the original ESM metric, Pn: ETM using rules P0 to P_n (Table 2), n: P8 + equivalence rules 1 to n in Table 2 are applied, ETM: P + all 26 equivalence rules are applied, which is our final ETM.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; et al. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Brussels, Belgium, 2018. [Google Scholar]
Li, J.; Hui, B.; QU, G.; Yang, J.; Li, B.; Li, B.; Wang, B.; Qin, B.; Geng, R.; Huo, N.; et al. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Abiteboul, S.; Hull, R.; Vianu, V. Foundations of Databases; Addison-Wesley: Boston, MA, USA, 1995. [Google Scholar]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
Dong, X.; Zhang, C.; Ge, Y.; Mao, Y.; Gao, Y.; Chen, L.; Lin, J.; Lou, D. C3: Zero-shot Text-to-SQL with ChatGPT. arXiv 2023, arXiv:2307.07306. [Google Scholar]
Pourreza, M.; Rafiei, D. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. arXiv 2023, arXiv:2304.11015. [Google Scholar]
Gao, D.; Wang, H.; Li, Y.; Sun, X.; Qian, Y.; Ding, B.; Zhou, J. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. arXiv 2023, arXiv:2308.15363. [Google Scholar] [CrossRef]
Yu, T.; Zhang, R.; Er, H.Y.; Li, S.; Xue, E.; Pang, B.; Lin, X.V.; Tan, Y.C.; Shi, T.; Li, Z.; et al. CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases. arXiv 2019, arXiv:1909.05378. [Google Scholar]
Qi, J.; Tang, J.; He, Z.; Wan, X.; Cheng, Y.; Zhou, C.; Wang, X.; Zhang, Q.; Lin, Z. RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL. arXiv 2022, arXiv:2205.06983. [Google Scholar]
Scholak, T.; Schucher, N.; Bahdanau, D. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. arXiv 2021, arXiv:2109.05093. [Google Scholar] [CrossRef]
Li, J.; Hui, B.; Cheng, R.; Qin, B.; Ma, C.; Huo, N.; Huang, F.; Du, W.; Si, L.; Li, Y. Graphix-T5: Mixing Pre-Trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing. arXiv 2023, arXiv:2301.07507. [Google Scholar] [CrossRef]
Li, H.; Zhang, J.; Li, C.; Chen, H. RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. arXiv 2023, arXiv:2302.05965. [Google Scholar] [CrossRef]
Li, H.; Zhang, J.; Liu, H.; Fan, J.; Zhang, X.; Zhu, J.; Wei, R.; Pan, H.; Li, C.; Chen, H. CodeS: Towards Building Open-source Language Models for Text-to-SQL. arXiv 2024, arXiv:2402.16347. [Google Scholar] [CrossRef]
Li, B.; Luo, Y.; Chai, C.; Li, G.; Tang, N. The Dawn of Natural Language to SQL: Are We Fully Ready? Proc. VLDB Endow. 2024, 17, 3318–3331. [Google Scholar] [CrossRef]
Chu, S.; Wang, C.; Weitz, K.; Cheung, A. Cosette: An Automated Prover for SQL. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, 8–11 January 2017. [Google Scholar]
Zhou, Q.; Arulraj, J.; Navathe, S.B.; Harris, W.; Xu, D. Automated Verification of Query Equivalence Using Satisfiability Modulo Theories. Proc. VLDB Endow. 2019, 12, 1276–1288. [Google Scholar] [CrossRef]
Zhong, R.; Yu, T.; Klein, D. Semantic Evaluation for Text-to-SQL with Distilled Test Suites. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 396–411. [Google Scholar] [CrossRef]
Nooralahzadeh, F.; Zhang, Y.; Smith, E.; Maennel, S.; Matthey-Doret, C.; De Fondeville, R.; Stockinger, K. StatBot.Swiss: Bilingual Open Data Exploration in Natural Language. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 5486–5507. [Google Scholar] [CrossRef]
Song, Y.; Ezzini, S.; Tang, X.; Lothritz, C.; Klein, J.; Bissyande, T.; Boytsov, A.; Ble, U.; Goujon, A. Enhancing Text-to-SQL Translation for Financial System Design. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP); IEEE Computer Society: Los Alamitos, CA, USA, 2024; pp. 252–262. [Google Scholar] [CrossRef]
Zhan, Y.; Cui, L.; Weng, H.; Wang, G.; Tian, Y.; Liu, B.; Yang, Y.; Yin, X.; Xie, J.; Sun, Y. Towards Database-Free Text-to-SQL Evaluation: A Graph-Based Metric for Functional Correctness. In Proceedings of the 31st International Conference on Computational Linguistics; Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2025; pp. 4586–4610. [Google Scholar]

Figure 1. Examples of a false positive yielded by EXE (a) and a false negative yielded by ESM (b). (a) Semantically distinct queries having the same execution result, as there are no dogs with age ≥ 10. (b) Syntactically distinct but semantically equivalent queries to find the weight of the heaviest dog.

Figure 2. A query pair correctly considered a mismatch by EXE but considered a match by ESM.

Figure 3. A query pair mistakenly considered a match by ESM, as it overlooks the DISTINCT keyword.

Figure 4. A query pair mistakenly considered a match by ESM due to its disregard of the LIMIT values.

Figure 5. A semantically equivalent query pair under a verifiable assumption (dog_id is NON_NULL).

Figure 6. Overview of the Enhanced Tree Matching (ETM) process. Queries are parsed into their ASTs, which are then normalized using a set of predefined equivalence rules (Table 2) before comparison. After the rules are applied, queries are reduced to their most basic form, allowing us to compare them for equality.

Figure 7. False negative rates of ETM (%) on selected models for Spider and BIRD as our equivalence rules are accumulated. ESM: the original ESM metric; P: ETM using only the preprocessing equivalence rules (Table 2); n: P + equivalence rules 1 to n in Table 2 are applied; ETM: P + all 26 equivalence rules are applied, which is our final ETM. Full results can be found in Appendix A.

Figure 8. A question and gold query pair from Spider that assumes that every student has graduated. This assumption is not verifiable.

Table 1. Preprocessing equivalence rules implemented in ETM. t*: table, c*: column, d*: condition.

ID	Equivalent Queries	Verifiable Assumptions
P0	`SELECT c1 FROM t1;` `SELECT C1 FROM T1;`	None
P1	`SELECT c1 FROM t1;` `SELECT t1.c1 FROM t1;`	None
P2	`SELECT c1, c2 FROM t1;` `SELECT c2, c1 FROM t1;`	None
P3	`SELECT t1.c1 FROM t1;` `SELECT t.c1 FROM t1 AS t;`	None
P4	`SELECT _ FROM t1 JOIN t2;` `SELECT _ FROM t2 JOIN t1;`	None
P5	`SELECT _ FROM _ WHERE x =/AND/OR y;` `SELECT _ FROM _ WHERE y =/AND/OR x;`	None
P6	`SELECT col AS c FROM t1;` `SELECT col FROM t1;`	None
P7	`SELECT _ FROM _ WHERE d1;` `SELECT _ FROM _ WHERE (d1);`	None
P8	`SELECT "t1"."c1" FROM "t1";` `SELECT t1.c1 FROM t1;`	None

Table 2. Equivalent queries with verifiable assumptions implemented in ETM. t*: table, c*: column, d*: condition, q*: full query. Case 1: a primary key-foreign key relation, where t1.c1 is the primary key and t2.c2 is the foreign key. Case 2: t1.c1 must be non-composite and X can be any column(s) in t2. / denotes options, but consistency is required in selecting between options across corresponding elements of the queries.

ID	Equivalent Queries	Verifiable Assumptions
1	`SELECT _ FROM t1 WHERE c1 = (SELECT MIN/MAX(c1) FROM t1);` `SELECT _ FROM t1 ORDER BY c1 ASC/DESC LIMIT 1;`	`c1` is `UNIQUE`
2	`SELECT DISTINCT c1 FROM t1;` `SELECT c1 FROM t1;`	`c1` is `UNIQUE`
3	`SELECT c1 FROM t1 WHERE d1 INTERSECT/UNION SELECT c1 FROM t1 WHERE d2;` `SELECT c1 FROM t1 WHERE d1 AND/OR d2;`	`c1` is `UNIQUE`
4	`SELECT _ FROM t1 WHERE GROUP BY c1,c2,…;` `SELECT _ FROM t1 WHERE GROUP BY c1;`	`c1` is `UNIQUE`
5	`SELECT c1 FROM t1 EXCEPT (q1);` `SELECT c1 FROM t1 WHERE c1 NOT IN (q1);`	`c1` is `UNIQUE` `and NON_NULL`
6	`SELECT COUNT(*) FROM t1;` `SELECT COUNT(c1) FROM t1;`	`c1` is `NON_NULL`
7	`SELECT _ FROM t1 WHERE c1 is NOT NULL;` `SELECT _ FROM t1;`	`c1` is `NON_NULL`
8	`SELECT CAST(SUM(c1) AS FLOAT) / COUNT(*) FROM t1;` `SELECT AVG(c1) FROM t1;`	`c1` is `NON_NULL`
9	`SELECT COUNT(CASE WHEN d1 THEN 1/c1 ELSE NULL END) FROM t1;` `SELECT SUM(CASE WHEN d1 THEN 1 ELSE 0 END) FROM t1;`	`c1` is `NON_NULL`
10	`SELECT MIN/MAX(c1), _ FROM t1;` `SELECT c1, _ FROM t1 ORDER BY c1 ASC/DESC LIMIT 1;`	`t1` is not empty
11	`SELECT * FROM t1;` `SELECT c1, c2, … FROM t1;`	`t1` consists of only `c1`, `c2`, …
12	`SELECT _ FROM _ WHERE c1 = 'x';` `SELECT _ FROM _ WHERE c1 = x;`	`x` is a number not starting with zero
13	`SELECT _ FROM t2 WHERE c2 IN (SELECT c1 FROM t1 WHERE d1);` `SELECT _ FROM t1 JOIN t2 ON t1.c1 = t2.c2 WHERE d1;`	Case 1 (refer to the caption)
14	`SELECT X FROM t1 JOIN t2 on t1.c1 = t2.c2;` `SELECT X from t2;`	Case 2 (refer to the caption)
15	`SELECT _ FROM _ WHERE SUBSTR(c1, 1, a) = x AND SUBSTR(c1, b, c) >/</>=/<= y;` `SELECT _ FROM _ WHERE c1 >/</>=/<= xy;`	`a` + 1 = `b`
16	`SELECT _ FROM _ WHERE c1 LIKE 'x%';` `SELECT _ FROM _ WHERE SUBSTR(c1, 1, n) = 'x'`	len(`x`) = `n`

Table 3. Equivalent queries with no verifiable assumption implemented in ETM.

ID	Equivalent Queries
17	`SELECT _ FROM _ ORDER BY c1;` `SELECT _ FROM _ ORDER BY JULIANDAY(c1);`
18	`SELECT _ FROM _ WHERE c1 IN/NOT IN (x, y,…);` `SELECT _ FROM _ WHERE c1 =/!= x OR/AND c1 =/!= y OR/AND …;`
19	`SELECT t1.c1 FROM t1 JOIN t2 on t1.c1 = t2.c2;` `SELECT t2.c2 FROM t1 JOIN t2 on t1.c1 = t2.c2;`
20	`SELECT _ FROM t1 WHERE c1 IN (SELECT c1 FROM t1 WHERE d1);` `SELECT _ FROM t1 WHERE d1;`
21	`q1;` `q1 UNION/INTERSECT q1;`
22	`SELECT _ FROM t1 WHERE c1 BETWEEN x AND y;` `SELECT _ FROM t1 WHERE c1 >= x/y and c1 <= x/y;`
23	`SELECT _ FROM t1 WHERE c1 !=/>/</>=/<=/= x;` `SELECT _ FROM t1 WHERE NOT c1 =/<=/>=/</>/!= x;`
24	`SELECT CASE WHEN d1 THEN x ELSE y END;` `SELECT IIF(d1, x, y);`
25	`SELECT _ FROM t1 LEFT JOIN t2 on t1.c1 = t2.c2 WHERE t2._ IS NULL;` `SELECT _ FROM t1 WHERE t1.c1 NOT IN (SELECT c2 FROM t2);`
26	`WITH q AS (q1) SELECT _ FROM q;` `SELECT _ FROM (q1);`

Table 4. Model performance on the Spider dataset in %. Column-wise rankings are indicated in parentheses. The Evaluation Set columns display the results from the model outputs reproduced by us, while the Reported columns show the results on the evaluation set as reported in the respective literature and the leaderboard for those models.

Model		Development Set			Evaluation Set			Reported
Model		`EXE`	`ESM`	`ETM`	`EXE`	`ESM`	`ETM`	`EXE`	`ESM`
DAIL	PLM	82.9 (3)	70.0 (6)	71.5 (5)	82.2 (3)	66.1 (4)	68.1 (5)	86.2 (2)	66.5 (4)
DIN	PLM	81.7 (5)	60.1 (7)	64.7 (7)	81.6 (4)	60.7 (6)	64.8 (6)	85.3 (3)	60.0 (5)
C3	PLM	79.8 (7)	46.9 (8)	59.8 (8)	79.5 (5)	43.9 (7)	58.5 (7)	82.3 (4)	-
Super	PLM	86.1 (1)	72.1 (5)	75.1 (2)	85.3 (1)	65.5 (5)	70.4 (2)	87.0 (1)	-
R+N	FLM	82.8 (4)	80.5 (1)	74.9 (3)	78.4 (6)	70.9 (2)	70.4 (2)	79.9 (5)	72.0 (2)
G+P	FLM	80.1 (6)	77.1 (3)	72.3 (4)	-	-	-	77.6 (6)	74.0 (1)
R+P	FLM	76.7 (8)	75.2 (4)	69.3 (6)	77.9 (7)	69.3 (3)	69.6 (4)	75.5 (7)	70.9 (3)
CodeS	FLM	83.5 (2)	79.4 (2)	76.6 (1)	83.0 (2)	73.7 (1)	73.2 (1)	-	-

Table 5. Model performance (%) on the BIRD development set. Rankings are indicated in parentheses.

Model		Development Set
Model		`EXE`	`ESM`	`ETM`
DAIL	PLM	50.1 (3)	8.0 (3)	31.9 (3)
C3	PLM	42.8 (4)	5.7 (5)	22.6 (5)
Super	PLM	52.1 (1)	8.3 (2)	33.1 (2)
RESD	FLM	37.4 (5)	7.2 (4)	25.4 (4)
CodeS-15	FLM	51.6 (2)	9.1 (1)	36.2 (1)

Table 6. False positives and negative rates (%) for all models with respect to the three metrics on the Spider evaluation set and BIRD development set. ESM and EXE are evaluated with distinct and value checking enabled.

Model			`EXE`		`ESM`		`ETM`
Model			FP	FN	FP	FN	FP	FN
Spider	DAIL	P	16.3	0.0	5.0	3.2	0.1	0.0
	DIN	P	19.5	0.0	6.1	9.8	0.0	0.0
	C3	P	23.0	0.0	2.8	17.5	0.1	0.0
	Super	P	15.0	0.0	3.2	8.2	0.0	0.0
	R+N	F	10.0	0.0	4.6	3.8	0.1	0.0
	R+P	F	10.0	0.0	4.1	4.3	0.1	0.0
	CodeS	F	12.1	0.0	4.7	4.0	0.0	0.0
BIRD	DAIL	P	17.4	0.0	1.6	26.3	0.3	1.1
	C3	P	17.5	0.0	1.1	20.7	0.0	2.7
	Super	P	17.9	0.0	1.2	27.1	0.2	1.2
	RESD	F	11.1	0.0	1.7	20.7	0.1	1.0
	CodeS	F	14.8	0.0	1.2	28.9	0.1	0.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ascoli, B.G.; Kandikonda, Y.S.R.; Choi, J.D. ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models. Future Internet 2025, 17, 325. https://doi.org/10.3390/fi17080325

AMA Style

Ascoli BG, Kandikonda YSR, Choi JD. ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models. Future Internet. 2025; 17(8):325. https://doi.org/10.3390/fi17080325

Chicago/Turabian Style

Ascoli, Benjamin G., Yasoda Sai Ram Kandikonda, and Jinho D. Choi. 2025. "ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models" Future Internet 17, no. 8: 325. https://doi.org/10.3390/fi17080325

APA Style

Ascoli, B. G., Kandikonda, Y. S. R., & Choi, J. D. (2025). ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models. Future Internet, 17(8), 325. https://doi.org/10.3390/fi17080325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models

Abstract

1. Introduction

2. Related Work

2.1. Text-to-SQL Models

2.2. Evaluation of SQL Equivalence

3. Materials and Methods

3.1. False Positives in ESM

3.2. False Negatives in ESM

3.3. New Evaluation Metric

4. Results

4.1. Spider Models

4.2. BIRD Models

4.3. Results

5. Discussion

5.1. Model Evaluation

5.2. PLM Variance

5.3. Error Analysis

5.4. Equivalence Rule Analysis

5.5. Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

False Negatives with Equivalent Rules

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI