Next Article in Journal
Spatial Shape-Aware Network for Elongated Target Detection
Previous Article in Journal
Simplified Integrity Checking for an Expressive Class of Denial Constraints
 
 
Article
Peer-Review Record

Transforming Medical Data Access: The Role and Challenges of Recent Language Models in SQL Query Automation

Algorithms 2025, 18(3), 124; https://doi.org/10.3390/a18030124
by Nikola Tanković *, Robert Šajina and Ivan Lorencin *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Algorithms 2025, 18(3), 124; https://doi.org/10.3390/a18030124
Submission received: 26 January 2025 / Revised: 18 February 2025 / Accepted: 19 February 2025 / Published: 21 February 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In this manuscript, the authors evaluate the performance of publicly available Large Language Models (LLMs), such as ChatGPT and Llama, in translating natural language queries into SQL queries for the medical community. These models assist both expert and non-expert personnel in navigating healthcare databases. They provide practical guidelines for using such models, taking into account the trade-offs between costs and accuracy. The study is interesting and has a highly practical scope. The manuscript is well-written and informative but could be improved. I ask the authors to respond to the following questions:

(1) The definition of Execution Accuracy seems somewhat subjective. While syntactic validity is well-defined as a hit-or-miss situation, Efficiency appears to be a subjective measure. Was Efficiency measured with unanimous agreement among the authors? How much variability should we expect from human assessment of this measure?

(2) It would be very instructive to include concrete prompt examples and the corresponding SQL queries obtained from the LLMs. I suggest the authors add an appendix to provide these examples. Additionally, I encourage the authors to illustrate the ablation study in Section 3.5 with concrete examples. While the paper does a good job of explaining the experiments, providing examples would enhance readability and understanding.

(3) In Figure 1, it is unclear which accuracy is being computed, EMA (Execution Accuracy with Manual Assessment) or EA (Execution Accuracy). Please clarify this issue.

(4) The information presented below Figure 2 does not match the data in Table 1. For example, Table 1 indicates that ChatGPT handles the smaller average number of input tokens, not Gemini. Similarly, Gemini produces the more compact output tokens, not Claude. If there is an additional criterion being used, please make it explicit and provide sufficient explanations.

(5) Please provide some explanation of the Pareto front analysis. This concept may not be familiar to me or many other readers, and a brief explanation would be helpful.

(6) I also request the authors to provide concrete examples in the ablation study section. It is challenging to visualize what the authors are describing without examples. Including examples would greatly facilitate understanding of the study's focus.

(7) It is surprising that Gemini's performance decreased when provided with examples from the database. Can the authors offer an explanation for this behavior?

(8) While the effort to translate natural language into medical SQL queries is commendable and beneficial for laypersons, it is likely that medical staff will derive even greater benefits from such systems once they are familiar with the domain-specific jargon (e.g., ICD taxonomies). I suggest that a study focused on translating more specific medical jargon into SQL queries could be conducted to evaluate its utility for the medical community more realistically.

 (9) The apparent success of DeepSeek in various NLP tasks raises the question of how well this model performs compared to the others evaluated in the study. Would it be possible to include DeepSeek in the analysis? This is merely a suggestion, not a requirement.

Author Response

"We would like to thank Reviewer 1 for their time, effort, and constructive comments, which have significantly improved the quality of this manuscript. All changes are marked in BLUE. The authors' responses can be found in the following paragraphs."

 

In this manuscript, the authors evaluate the performance of publicly available Large Language Models (LLMs), such as ChatGPT and Llama, in translating natural language queries into SQL queries for the medical community. These models assist both expert and non-expert personnel in navigating healthcare databases. They provide practical guidelines for using such models, taking into account the trade-offs between costs and accuracy. The study is interesting and has a highly practical scope. The manuscript is well-written and informative but could be improved. I ask the authors to respond to the following questions:

(1) The definition of Execution Accuracy seems somewhat subjective. While syntactic validity is well-defined as a hit-or-miss situation, Efficiency appears to be a subjective measure. Was Efficiency measured with unanimous agreement among the authors? How much variability should we expect from human assessment of this measure?

Execution Accuracy (EA) in our study was defined as the comparison between the result set produced by the LLM-generated SQL query and the ground-truth query’s result set. This measure is objective in that it strictly checks whether the generated query returns the same output as the expected query when executed against the database.

Regarding Efficiency, our study did not use it as a separate subjective measure in evaluating Execution Accuracy. However, if you are referring to efficiency in terms of query performance (e.g., execution time, resource usage), that was not part of our EA metric. If efficiency were to be included as a qualitative aspect, there could be variability in human judgment, but in our study, the comparison was strictly based on result set equivalence, avoiding subjectivity.

We included these clarifications in the Methodology section.

(2) It would be very instructive to include concrete prompt examples and the corresponding SQL queries obtained from the LLMs. I suggest the authors add an appendix to provide these examples. Additionally, I encourage the authors to illustrate the ablation study in Section 3.5 with concrete examples. While the paper does a good job of explaining the experiments, providing examples would enhance readability and understanding.

We appreciate the suggestion to include concrete prompt examples and corresponding SQL queries. In response, we have expanded the appendix to include detailed examples of the prompts used in our ablation study. Additionally, we have provided explicit question-ground truth-model output pairs to illustrate how different LLMs performed on specific queries. These additions should enhance clarity and facilitate a deeper understanding of the results.

We have included the prompts and example outputs in Appendix A and B.

(3) In Figure 1, it is unclear which accuracy is being computed, EMA (Execution Accuracy with Manual Assessment) or EA (Execution Accuracy). Please clarify this issue.

Thank you for pointing this out. In Figure 1, the accuracy being computed is EA (Execution Accuracy), as we aimed to avoid subjectivity in our evaluation. We have clarified this explicitly in the figure caption and the main text to ensure there is no ambiguity.

(4) The information presented below Figure 2 does not match the data in Table 1. For example, Table 1 indicates that ChatGPT handles the smaller average number of input tokens, not Gemini. Similarly, Gemini produces the more compact output tokens, not Claude. If there is an additional criterion being used, please make it explicit and provide sufficient explanations.

Thank you for this observation regarding the token counts and the final price presented. The final cost is determined by both input and output token counts, multiplied by their respective price per token which is different from provider to provider. While Table 1 presents the average token usage per model, the overall cost ranking considers both token consumption and model-specific pricing. We have clarified this in the revised text and ensured consistency between Tables and Figures.

(5) Please provide some explanation of the Pareto front analysis. This concept may not be familiar to me or many other readers, and a brief explanation would be helpful.

Thank you for the suggestion. We have added a brief explanation of Pareto front analysis in the paper to ensure clarity for all readers. Specifically, we now describe how the Pareto front represents the set of models that achieve the best trade-offs between key evaluation metrics, where no single model can improve in one metric without compromising another. This additional context should make the analysis more accessible to readers unfamiliar with the concept.

(6) I also request the authors to provide concrete examples in the ablation study section. It is challenging to visualize what the authors are describing without examples. Including examples would greatly facilitate understanding of the study's focus.

We appreciate the request for more concrete examples in the ablation study section. In response, we have included detailed examples of different prompts used, as well as question-ground truth-model output pairs in Appendix A and B. These additions should improve clarity and make it easier for readers to understand the impact of different ablations on model performance.

(7) It is surprising that Gemini's performance decreased when provided with examples from the database. Can the authors offer an explanation for this behavior?

The performance drop of Gemini when provided with database examples may be due to overfitting, misinterpreting patterns, increased input complexity, or prompt sensitivity. The model might have prioritized example replication over generalization, leading to incorrect SQL. Additionally, longer inputs could have diluted focus, and formatting differences may have influenced interpretation. To address this, we have added a remark in the paper discussing these potential factors.

According to this commend, the following text is added:

“The performance drop of gemini-1.5-pro when provided with database examples may be due to overfitting, misinterpreting patterns, increased input complexity, or prompt sensitivity. The model might have prioritized example replication over generalization, leading to incorrect SQL. Additionally, longer inputs could have diluted focus, and formatting differences may have influenced interpretation.”

(8) While the effort to translate natural language into medical SQL queries is commendable and beneficial for laypersons, it is likely that medical staff will derive even greater benefits from such systems once they are familiar with the domain-specific jargon (e.g., ICD taxonomies). I suggest that a study focused on translating more specific medical jargon into SQL queries could be conducted to evaluate its utility for the medical community more realistically.

We appreciate this insightful suggestion. While our study focused on general natural language to SQL translation, we acknowledge that medical professionals, familiar with domain-specific jargon (e.g., ICD taxonomies), may benefit from a more specialized system. We have noted this in the paper as a potential direction for future research, where evaluating LLMs on translating structured medical queries into SQL could provide a more realistic assessment of their utility in clinical settings.

According to this comment, the following text is added to section 6:

“While the effort to translate natural language into medical SQL queries is commendable and beneficial for laypersons, it is likely that medical staff will derive even greater benefits from such systems once they are familiar with the domain-specific jargon (e.g., ICD taxonomies). While the study focused on general natural language to SQL translation, the authors acknowledge that medical professionals, familiar with domain-specific jargon, may benefit from a more specialized system. This has been noted as a potential direction for future research, where evaluating LLMs on translating structured medical queries into SQL could provide a more realistic assessment of their utility in clinical settings.”

 (9) The apparent success of DeepSeek in various NLP tasks raises the question of how well this model performs compared to the others evaluated in the study. Would it be possible to include DeepSeek in the analysis? This is merely a suggestion, not a requirement.

At the time of conducting the research, DeepSeek-R1 was not yet available on the market. However, we acknowledge its potential and will include it in our future work. The authors are currently conducting research on this topic and plan to present the findings in upcoming studies. This has been added to the future directions and conclusion sections of the manuscript.

According to this comment, the following text is added to section 6:

“The potential of DeepSeek-R1 is also recognized, and its inclusion in future work is planned. Evaluating its performance in comparison with other state-of-the-art models will contribute to a deeper understanding of model capabilities and trade-offs in the context of SQL query generation for healthcare applications.” 

and this text has is added to conclusion:

“The authors recognize the potential of DeepSeek-R1 and plan to include it in future work. The limitation of using 1,000 queries from the TREQS dataset is acknowledged, and efforts to address this through the incorporation of additional datasets and the generation of synthetic queries are planned to enhance robustness and generalizability. These enhancements will be explored as part of future work, and findings will be presented in subsequent studies.”


Reviewer 2 Report

Comments and Suggestions for Authors

The paper is devoted to the comparison of modern LLMs, including Llama, Mistral, Qwen, etc., for transforming medical questions into SQL queries. The main strengths of the paper is a strong analysis of the model effectiveness and a good descriprtion of methods. Overall, the paper is well-structured and well-written.

I recommend explicitly formulating the objective of the paper in the introduction. The literature review for some points (for example, RAG) is also too brief.

Author Response

"We would like to thank Reviewer 2 for their time, effort, and constructive comments, which have significantly improved the quality of this manuscript. All changes are marked in RED. The authors' responses can be found in the following paragraphs."

 

The paper is devoted to the comparison of modern LLMs, including Llama, Mistral, Qwen, etc., for transforming medical questions into SQL queries. The main strengths of the paper is a strong analysis of the model effectiveness and a good description of methods. Overall, the paper is well-structured and well-written.

I recommend explicitly formulating the objective of the paper in the introduction. The literature review for some points (for example, RAG) is also too brief.

According to this comment, the following text is added at the end of Introduction section:

The paper's contributions are as follows. First, this study aims to systematically evaluate the effectiveness of state-of-the-art large language models (LLMs) in transforming medical queries into executable SQL statements, with a focus on accuracy, consistency, and cost-efficiency.} To achieve this, we assess both proprietary (e.g., GPT-4o, Claude 3.5) and open-source models (e.g., LLaMA 3.3-70B, Qwen-2.5-72B) on the MIMIC-3 and TREQS datasets, utilizing a diverse set of natural language queries.Second, the paper introduces a novel prompt engineering approach tailored for healthcare databases, incorporating schema previews, 1-shot examples, and data samples to optimize query accuracy and efficiency. Third, the study identifies critical trade-offs between accuracy, consistency, and token cost, offering actionable insights for selecting models based on specific use cases, such as cost-sensitive deployments or accuracy-critical medical applications. Finally, recognizing the limitations of current models in handling complex medical terminology, we propose future directions, including the integration of Retrieval-Augmented Generation (RAG) pipelines and ICD taxonomies, to enhance query precision and adaptability in high-stakes healthcare environments.

 

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript is devoted to compare different LLM for generating accurate SQL queries from natural language. The work is very interesting. 

1. But why the authors did not includ DeepSeek in the study ?
2. the authors wrote that "We used atotal of 1000 natural language queries from the TREQS dataset, originating from the MIMICIII dataset, a widely used deidentified critical care dataset. " .

A dataset with 1000 samples does not seem sufficient for the task at hand. How do the authors plan to supplement it for future work?

3, Metrics should be included in the conclusions section to make the final discussion more informative and understandable.

Author Response

We would like to thank Reviewer 1 for their time, effort, and constructive comments, which have significantly improved the quality of this manuscript. All changes are marked in GREEN. The authors' responses can be found in the following paragraphs.

 

The manuscript is devoted to compare different LLM for generating accurate SQL queries from natural language. The work is very interesting. 

  1. But why the authors did not include DeepSeek in the study ?

At the time of conducting the research, DeepSeek-R1 was not yet available on the market. However, we acknowledge its potential and will include it in our future work. The authors are currently conducting research on this topic and plan to present the findings in upcoming studies. This has been added to the future directions and conclusion sections of the manuscript.

According to this comment, the following text is added to section 6:

“The potential of DeepSeek-R1 is also recognized, and its inclusion in future work is planned. Evaluating its performance in comparison with other state-of-the-art models will contribute to a deeper understanding of model capabilities and trade-offs in the context of SQL query generation for healthcare applications.” 

and this text has is added to conclusion:

“The potential of DeepSeek-R1 is also recognized and should be included in future work. The current limitation of using 1,000 queries from the TREQS dataset should also be addressed by incorporating additional data sets and the generation of synthetic queries to enhance robustness and generalizability.”


  1. the authors wrote that "We used atotal of 1000 natural language queries from the TREQS dataset, originating from the MIMICIII dataset, a widely used deidentified critical care dataset. " .

A dataset with 1000 samples does not seem sufficient for the task at hand. How do the authors plan to supplement it for future work?

Regarding the dataset size, we acknowledge the limitation of using 1000 queries from the TREQS dataset. In our future work, we plan to supplement this by incorporating additional datasets and generating synthetic queries to enhance robustness and generalizability. This has also been addressed in the conclusion section of the manuscript:

“From a cost-effectiveness perspective, the Pareto front analysis revealed distinct model advantages: while GPT-4o provides the highest accuracy at a moderate cost (0.00484 per query), Gemini-1.5-pro presents a well-balanced alternative between accuracy (60.8%) and affordability (0.00241 per query). Qwen-2.5-72B, the only open-source model on the Pareto front, offers a transparent and adaptable solution, albeit at a slightly higher cost.”

 

3, Metrics should be included in the conclusions section to make the final discussion more informative and understandable.

Metrics have been incorporated into the conclusions section to enhance the final discussion, making it more informative and understandable. 

 

Back to TopTop