Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs

Blašković, Luka; Tanković, Nikola; Lorencin, Ivan; Baressi Šegota, Sandi

doi:10.3390/bdcc9100256

Open AccessArticle

Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs

¹

Faculty of Informatics, Juraj Dobrila University of Pula, 52100 Pula, Croatia

²

Department of Automation and Electronics, Faculty of Engineering, University of Rijeka, 51000 Rijeka, Croatia

^*

Authors to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(10), 256; https://doi.org/10.3390/bdcc9100256

Submission received: 16 July 2025 / Revised: 12 September 2025 / Accepted: 3 October 2025 / Published: 11 October 2025

(This article belongs to the Special Issue Advances in Large Language Models for Biological and Medical Applications)

Download

Browse Figures

Versions Notes

Abstract

Electronic health records (EHRs) are typically stored in relational databases, making them difficult to query for nontechnical users, especially under privacy constraints. We evaluate two practical clinical NLP workflows, natural language to SQL (NL2SQL) for EHR querying and retrieval-augmented generation for clinical question answering (RAG-QA), with a focus on privacy-preserving deployment. We benchmark nine large language models, spanning open-weight options (DeepSeek V3/V3.1, Llama-3.3-70B, Qwen2.5-32B, Mixtral-8 × 22B, BioMistral-7B, and GPT-OSS-20B) and proprietary APIs (GPT-4o and GPT-5). The models were chosen to represent a diverse cross-section spanning sparse MoE, dense general-purpose, domain-adapted, and proprietary LLMs. On MIMICSQL (27,000 generations; nine models × three runs), the best NL2SQL execution accuracy (EX) is 66.1% (GPT-4o), followed by 64.6% (GPT-5). Among open-weight models, DeepSeek V3.1 reaches 59.8% EX, while DeepSeek V3 reaches 58.8%, with Llama-3.3-70B at 54.5% and BioMistral-7B achieving only 11.8%, underscoring a persistent gap relative to general-domain benchmarks. We introduce SQL-EC, a deterministic SQL error-classification framework with adjudication, revealing string mismatches as the dominant failure (86.3%), followed by query-join misinterpretations (49.7%), while incorrect aggregation-function usage accounts for only 6.7%. This highlights lexical/ontology grounding as the key bottleneck for NL2SQL in the biomedical domain. For RAG-QA, evaluated on 100 synthetic patient records across 20 questions (54,000 reference–generation pairs; three runs), BLEU and ROUGE-L fluctuate more strongly across models, whereas BERTScore remains high on most, with DeepSeek V3.1 and GPT-4o among the top performers; pairwise t-tests confirm that significant differences were observed among the LLMs. Cost–performance analysis based on measured token usage shows per-query costs ranging from USD 0.000285 (GPT-OSS-20B) to USD 0.005918 (GPT-4o); DeepSeek V3.1 offers the best open-weight cost–accuracy trade-off, and GPT-5 provides a balanced API alternative. Overall, the privacy-conscious RAG-QA attains strong semantic fidelity, whereas the clinical NL2SQL remains brittle under lexical variation. SQL-EC pinpoints actionable failure modes, motivating ontology-aware normalization and schema-linked prompting for robust clinical querying.

Keywords:

healthcare; large language model; EHR; NL2SQL; RAG; open source; ontology grounding

1. Introduction

Electronic health records (EHRs) digitize patient histories, laboratory results, medications, and other clinical data, improving information sharing and enabling data-driven care [1,2]. Despite broad adoption, EHR data remain difficult to leverage due to volume, heterogeneity, and ambiguity, which complicate interpretation and the extraction of actionable insights [2,3,4]. While EHRs can reduce clinician administrative burden and improve coordination, claims about direct outcome improvements should be made cautiously and supported by evidence [5,6,7].

Recent progress in large language models (LLMs) for clinical NLP, spanning domain-specialized models (e.g., BioBERT, ClinicalBERT, PubMedGPT, and Med-PaLM) and strong general-purpose models (e.g., GPT, Claude, Llama, and DeepSeek), suggests utility for tasks such as clinical question answering, EHR summarization, and natural-language database querying [8,9,10,11,12,13,14,15,16]. However, deployment in healthcare is constrained by reliability and safety concerns (e.g., hallucinations, calibration and uncertainty, brittleness under distribution shift, and bias), which motivates rigorous task-specific evaluation and the use of guardrails.

Retrieval-augmented generation (RAG) can ground model outputs in curated clinical sources, improving timeliness, traceability, and evidence linkage for decisions that are time-critical [17,18]. In parallel, locally deployable LLMs offer privacy, latency, and compliance advantages by keeping sensitive data on-premises, albeit at increased computational, infrastructure, and MLOps costs [19,20,21,22,23,24].

This study draws motivation from the Croatian healthcare system, which serves as an example of a resource-constrained EU member developing country facing significant challenges in healthcare delivery [25,26]. Croatia ranks 24th out of 35 in the 2018 Euro Health Consumer Index [27], with healthcare expenditures per capita (EUR 1787) substantially below the EU average (EUR 4029) [28]. Only about 4% of that amount, approximately EUR 71.5, is allocated to nationally organized preventive care programs, compared to the European average of 6% [28]. These figures, alongside workforce shortages (3.7 doctors and 7.5 nurses per 1000 inhabitants, compared to the EU average of 4.2 and 8.2, respectively) [29], prolonged waiting times for diagnostics and surgeries, and geographic disparities in access to care highlight the high burden and resource limitations common in similar healthcare environments.

We evaluate nine LLMs, seven open-source (DeepSeek-V3, DeepSeek-V3.1, Llama 3.3-70B, Qwen2.5-32B, Mixtral-8x22B, BioMistral-7B, and GPT-OSS-20B) and two proprietary (GPT-4o, GPT-5), on two core clinical NLP tasks: (1) natural language to SQL and (2) RAG-based clinical question answering. For natural language to SQL, we use the MIMICSQL dataset. For the question-answering task, we construct a privacy-preserving corpus of 100 synthetic patient records and evaluate model answers against 20 predefined clinical questions spanning common reasoning patterns. We use the following abbreviations throughout this paper:

NL2SQL: Natural language to SQL generation;
RAG-QA: Retrieval-augmented generation for question answering;
SQL-EC: Generated SQL error classification.

Beyond task accuracy, we introduce a deterministic SQL error classification framework for targeted diagnosis of NL2SQL failures. SQL-EC partitions erroneous queries into five categories: (i) schema mismatch (incorrect table/column references), (ii) incorrrect aggregation usage (wrong filters, bounds, or aggregation function), (iii) join errors (missing/incorrect joins, key misuse, and fan-out), (iv) condition misinterpretation(operator misuse, missing conditions, and logical errors), and (v) string mismatch (errors due to vocabulary variations, unnormalized inputs, and improper column use). The classifier is rule-based and reproducible.

Contributions: (i) A head-to-head evaluation of seven open-source and two proprietary LLMs on NL2SQL and RAG-QA using controlled, privacy-preserving data was conducted; (ii) SQL-EC, a five-class, deterministic framework for diagnosing NL2SQL errors, was used; (iii) analysis of local versus hosted LLM deployment trade-offs (privacy, latency, compliance, and compute) in resource-constrained health systems was performed; and (iv) reproducible pipelines and prompts for NL2SQL and RAG-QA, including retrieval and evaluation configurations, were developed. Exact model versions, licenses, context lengths, quantization/precision, and other hyperparameters are specified in Section 3 for reproducibility. The code is available at https://github.com/fipu-lab/med-rag, accessed on 2 October 2025.

The remainder of this paper is structured as follows: Section 2 reviews the clinical applications of NL2SQL in healthcare, highlighting prevalent models, key technological advances, RAG in clinical environments, lexical challenges, and integration issues. Section 3 outlines the research methodology, including the evaluated NL2SQL and RAG-QA pipelines, evaluation metrics, and our SQL-EC framework, with a deterministic error classifier tailored for the medical domain. Section 4 reports results from nine LLMs across the examined tasks, supported by figures, tables, and quantitative metrics. Section 5 discusses limitations and future research directions, and Section 6 concludes the paper.

2. Literature Review

The integration of NLP into clinical settings depends on methods that can accurately process complex queries, retrieve relevant evidence, and manage ambiguous medical terminology. To better understand the state of the field, this review evaluates recent progress in clinical NL2SQL, RAG, and approaches to lexical and real-world integration challenges. By tracing these developments, we identify opportunities for more robust and secure clinical NLP pipelines.

2.1. Recent Developments of Clinical NL2SQL

NL2SQL, also present as text to SQL in the literature, is a specialized form of semantic parsing that automates the translation of human (natural) language into structured database queries [30]. In NL2SQL, the input sequence is a natural language query (e.g., “Show me all patients hospitalized in 2025”), while the output sequence is the corresponding SQL query (e.g., SELECT ∗ FROM patients WHERE hospitalized_year = 2025).

Early deep neural networks fine-tuned for this task, such as Seq2SQL [30] and SQLNet [31], relied on direct, sequence-to-sequence translations without an explicit understanding of the database schema’s business context. Similarly, Roberts and Patra [32] developed a semantic parser to map clinical questions to logical forms. These early approaches required explicit training on a specific database structure.

Zhu et al. [33] review NL2SQL methods, categorizing them by training strategies such as prompt engineering and fine-tuning, and summarize key datasets and evaluation metrics. They emphasize the role of exact match and execution accuracy in evaluating model performance. Wang et al. [34] introduced the MIMICSQL dataset, derived from the MIMIC-III critical care database, which reformulates clinical information into a relational format and serves as the foundational benchmark for this domain. Moreover, they introduced TREQS, a seq2seq-based model designed to generate SQL queries from medical questions by leveraging dynamic temporal attention and controlled copying. However, the approach developed in this study, while laying the foundation in clinical NL2SQL, did not integrate much intrinsic information related to SQL itself, such as its inherent tree structure, and the model showed limitations when handling more diverse phrasing, achieving only 55% execution accuracy on paraphrased questions.

Transformer architectures brought significant improvements by encoding a richer linguistic and schema context. For instance, Pan et al. [35] proposed MedTS, a BERT-based NL2SQL generator tailored to EMR queries. MedTS employs a pre-trained BERT encoder with schema linking, which identifies table and column names in the input question, and a grammar-constrained LSTM decoder to generate an intermediate SQL syntax tree [35]. It achieved 78% execution accuracy on the MIMICSQL dataset, a substantial improvement over TREQS. The ablation study further demonstrated that schema linking and biomedical pre-trained encoders, such as BioBERT [8], yielded modest performance gains. In contrast, ClinicalBERT [9], trained on clinical notes, proved slightly less effective for question answering than BioBERT or general BERT [35].

By 2024, advancements in LLMs created new opportunities for NL2SQL systems. With increasing access to clinical data and the growing prevalence of large-scale architectures, researchers found that these models could generalize more effectively by learning patterns of natural language and formal SQL syntax. Moreover, large-scale pretraining enabled them to acquire competence in other domain-specific query languages, such as SPARQL and FHIRPath [36,37]. In particular, LLMs demonstrated the ability to internalize the formal syntax of SQL, including constructs such as SELECT, WHERE, JOIN, and GROUP BY, without requiring explicit schema-specific training. This observation is well established by studies evaluating performance on the SPIDER [38], WikiSQL [30], and BIRD [39] datasets, which are the de facto cross-domain standards most NL2SQL studies report on [40,41,42]. While LLMs can generate syntactically valid SQL, NL2SQL additionally requires a schema-specific context (e.g., tables, columns, and constraints). This is typically integrated via few-shot prompting or RAG, which is enabled by rapidly expanding context windows. Recent models illustrate this trend, with context lengths growing from 4k–32k tokens (GPT-3.5/4) to 100k (Claude 2), 200k (Claude 3), 2M (Gemini 1.5), and 400k (GPT-5) tokens [43,44,45,46,47]. These characteristics of modern LLMs have fueled further research into designing new NL2SQL frameworks, such as the M3 framework by Al Attrach et al. [48], which incorporates a proprietary model and the Model Context Protocol (MCP) [49]. The MCP is essentially an open standard that allows AI models to securely connect to external APIs. Using Claude 3.5 models, the framework achieves over 90% accuracy on 100 sampled natural language questions from the EHRSQL dataset.

Regarding open-weight models, a recent study by Chadha et al. [50] evaluated five open-source LLMs on the task of NL2SQL on the MIMICSQL dataset. Among these, the Llama 3.1-70B model achieved the best performance in generating correct SQL queries, as measured by exact match accuracy. To the best of our knowledge, no prior work has benchmarked newer open-weight LLMs, such as GPT-OSS [51], DeepSeek V3, and DeepSeek V3.1 [52], on the MIMICSQL dataset, either for clinical querying or for clinical RAG-based question answering.

Yet, while fine-tuned transformer models such as TREQS [34] and MedTS [35] initially show promising results in clinical NL2SQL, a recent study [53] emphasizes that high benchmark scores may not necessarily equate to robust generalization in the healthcare domain. Tarbell et al. [53] demonstrated this by re-evaluating medical NL2SQL models on newly constructed, more challenging splits of the MIMICSQL dataset. Although state-of-the-art models achieved up to 92% accuracy on the original test set, performance collapsed to approximately 28% on their redesigned evaluation, which deliberately introduced greater lexical and schema variation. This striking drop revealed that existing models were overfitting to dataset artifacts rather than learning transferable representations. Complementary evidence comes from broader evaluations of clinical LLMs. For instance, Rahman et al. [54] conducted local fine-tuning of ClinicLLM, an LLM trained on hospitals’ clinical notes, separately on data from four different hospitals. They found that models trained on one hospital’s data suffered notable performance drops when tested across other institutions and demographic groups, underscoring how benchmark gains may conceal limited robustness in practice. However, it is worth noting that this study analyzed readmission prediction rather than NL2SQL. Similarly, Chen et al. [55] provided extensive empirical evidence showing that general-purpose and medical LLMs, regardless of the parameter scale, prompting techniques, or fine-tuning strategies, still fall short of traditional machine learning models in clinical prediction tasks.

Building on these observations, we advocate for further research in real-world settings through open-source solutions, which provide a promising pathway to mitigating the data privacy concerns associated with proprietary LLMs. In this work, we contribute to this agenda by benchmarking state-of-the-art open-source models on NL2SQL and QA tasks in the clinical domain. A detailed discussion of open-source models and the challenges surrounding their integration is presented in Section 2.4.

2.2. Retrieval-Augmented Generation in Clinical Environments

LLMs demonstrate strong performance in natural language processing, but they face limitations such as knowledge cutoffs and hallucinations [56]. A common way to address these issues is retrieval-augmented generation (RAG). In simple terms, RAG helps an LLM “look up” relevant information from trusted sources before answering a question. It works in two steps: first, a retriever finds the most relevant documents, and then the generator produces a response that is explicitly conditioned on this evidence. Lewis et al. [57] formally proposed RAG as a framework for grounding model outputs in external knowledge.

Regarding healthcare, Amugongo et al. [58] reported that RAG has been primarily deployed in clinical decision support, medical QA, and the augmentation of EHRs. They highlighted common challenges in RAG development and application, including variable retrieval quality, the lack of standardized evaluation benchmarks, privacy concerns, and persistent ethical issues.

One of the most immediate applications of RAG in the medical domain is the summarization and extraction of key information from EHRs, which are the primary focus of this study. For example, Alkhalaf et al. [59] developed a RAG-based approach to generate structured summaries of patient records. They defined specific summary fields (e.g., age, weight, and major diagnoses) and used those field names as queries to retrieve corresponding snippets from the EHR, which the model then aggregated into a summary. This method significantly improved the inclusion of key facts in the generated summaries compared to unguided generation. Similarly, Myers et al. [60] evaluated RAG for extracting specific information across a patient’s hospital stay. They introduced tasks like listing all imaging procedures performed, compiling an antibiotic treatment timeline, and generating the set of active diagnoses for a hospitalization.

Healthcare-specific applications, such as ChatENT in otolaryngology [61] and hepatology-focused RAG pipelines [62], similarly improved accuracy and reduced hallucinations. Beyond external knowledge, patient-specific retrieval is utilized in frameworks like RAGnosis [63], which operates on-premises to retrieve similar prior cases, thereby supporting privacy-preserving reasoning by analogy. Zhao et al. proposed MedRAG, which integrates a curated knowledge graph to differentiate between diseases with overlapping symptoms [64]. Some solutions incorporate external knowledge bases alongside the patient’s record, for example, retrieving medical fact sheets to clarify medical terms or provide background information that the EHR itself may not contain [65].

While the RAG approach is still relatively novel and remains under active research and development, evaluating RAG systems is a non-trivial task [66]. Beyond accuracy, one must assess factual correctness, clinical relevance, safety (e.g., avoiding harmful medical advice), and even the model’s reasoning trace [67], which may assist healthcare professionals in assessing the final LLM output. Meanwhile, there is a growing push for new benchmarks, such as MIRAGE [68], which was introduced to evaluate medical RAG across multiple QA datasets by measuring how much retrieval improves accuracy over knowledge-cutoff models. This benchmark, developed by Xiong et al. [68], demonstrated that incorporating external knowledge increased GPT-4’s multiple-choice QA accuracy from 73% to 80%. Some papers combine automated metrics (such as BLEU, ROUGE-L, and METEOR) with expert reviews [67], while others utilize neural-network-powered evaluation metrics such as BERTScore [69,70], which essentially matches words in candidate and reference sentences by cosine similarity. Therefore, in the present study, we chose to evaluate our RAG results using both surface-level overlap metrics, such as BLEU and ROUGE-L, as well as semantic similarity metrics, such as BERTScore.

2.3. Lexical Challenges in Clinical Environments

Clinical notes contain a high density of acronyms and abbreviations. While these shorten data entry, they often introduce ambiguity. Synonymy, homonymy, and domain-specific terminology further increase the challenges of clinical querying and information retrieval. In a survey of 46 healthcare professionals, Jayatilake and Oyibo [71] asked participants to interpret 20 abbreviations drawn from patient records, and only 4 were correctly identified by more than half of the respondents, yielding an overall mean accuracy of 32%. This highlights the pervasive ambiguity of clinical terminology, where abbreviations such as “MS” may denote multiple sclerosis, mitral stenosis, or morphine sulfate, while common concepts like “heart attack” can appear as myocardial infarction (MI), cardiac infarction, or simply infarct [72]. This problem extends beyond medical diagnoses. For instance, Cheung et al. conducted an audit of medication orders used to communicate which medications to administer to patients and found frequent use of dangerous abbreviations in both paper and electronic prescribing systems, underscoring the risks such practices pose to patient safety. They found that almost one-third of patients had orders containing dangerous abbreviations [73]. Dangerous medical abbreviations are shorthand forms or symbols that are easily misread and can cause medication misinterpretations and adverse events. Common examples include “IV” (intravenous, often confused with the Roman numeral IV); “q.d.” (quaque die—Latin for “once daily”), sometimes mistaken for “qid” (quarter in die, “four times daily”); and the use of Roman numerals in dosing instructions (e.g., “tab II”, misread as 11 instead of 2). To minimize such risks, medical institutes such as the National Institutes of Health (NIH) in the U.S. [74] and the Institute for Safe Medication Practices (ISMP) in Canada publish reports discouraging unsafe abbreviations in clinical communication [75]. However, not all condensed forms of medical language carry the same risks. Unlike ad hoc abbreviations, the Latin language has long functioned as a standardized international lexicon in medicine [76]. For example, Lysanets and Bieliaieva’s quantitative analysis of Latin terms in medical case reports shows that Latin usage aligns with the communicative strategies of the medical genre. Latin supports the narrative style and serves important educational purposes in medical reporting [77]. However, while Latin offers conciseness and standardization, its use is uneven. Some doctors prioritize it, while others mix it with English or local terms, and many non-physician staff and patients struggle to understand it [77].

The Unified Medical Language System (UMLS), developed by the U.S. National Library of Medicine in 1986, addresses a central challenge in health informatics: the coexistence of multiple biomedical terminologies, vocabularies, and coding systems. As a comprehensive compendium of controlled vocabularies, the UMLS provides standardized concepts, mappings across terminologies, and semantic relationships, thereby mitigating issues of synonymy, homonymy, and abbreviation ambiguity to support accurate clinical information retrieval and natural language processing [78]. Currently, the UMLS contains over four million concepts and twelve million concept names, drawn from 223 biomedical vocabularies, including widely used terminologies such as SNOMED CT, MeSH, LOINC, and ICD. Despite standardization efforts, ambiguity in abbreviations persists. Kim et al. report that over 33% of abbreviations in the UMLS have multiple meanings, with the proportion rising to 54% in clinical reports. Similar results were reported by Liu et al. [79], who extracted 161,666 unique (abbreviation and full form) pairs from the UMLS and found that 33% of abbreviations with six or fewer characters had multiple meanings, while UMLS abbreviations covered more than 66% of those in clinical reports.

Recent work on LLM-driven concept normalization has primarily focused on addressing lexical challenges in biomedical texts through prompting-based methods [80,81] and fine-tuning. However, some approaches also explore agent-based frameworks. Fan et al. [82], for example, coordinated a terminology expert agent with simulated doctor personas to boost recall, though their reliance on social media mentions limits clinical applicability. Borchert et al. [83] used few-shot prompting to simplify biomedical mentions, achieving 63% accuracy on the Spanish SympTEMIST dataset. Fine-tuning a multilingual model on biomedical corpora improved embedding quality across more than 50 languages, though performance remained dependent on knowledge graphs [84].

Despite these advances, we found no prior research that explicitly addresses NL2SQL or question answering in clinical environments, even after integrating ontology grounding using more comprehensive strategies, for example, systematic grounding to large-scale biomedical vocabularies such as those mapped in the UMLS. This gap highlights an open challenge: existing solutions remain focused primarily on prompting techniques and fine-tuned embeddings. While our study does not provide a full end-to-end solution, it lays the groundwork for future research by systematically examining the lexical and semantic barriers to robust clinical querying and by identifying the most critical challenges encountered during sequence generation, based on NL2SQL and RAG tasks and an accompanying SQL-EC common error analysis.

2.4. Integration Challenges and Policy Considerations

The integration of LLM-powered NLP pipelines into clinical environments poses persistent challenges related to privacy, compliance, and interoperability [85]. Healthcare data is among the most sensitive categories of personal information, and regulatory frameworks such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States and the General Data Protection Regulation (GDPR) in the European Union mandate strict protections for identifiable patient information [85]. Legal frameworks have heightened interest in privacy-preserving techniques [86], notably differential privacy, which introduces controlled statistical noise to safeguard individual records, and LLM firewalls, which are built-in security mechanisms that regulate and filter model inputs and outputs. The importance of these approaches is highlighted in the review by Rathod et al. [87].

Reliance on built-in safeguards alone may not meet institutional standards for strict data custody [88]. Greater attention has therefore been given to deploying LLMs locally, particularly following the release of the DeepSeek LLM [52], either on hospital premises or within secure cloud environments, thereby ensuring that clinical data remains under organizational control. Unlike closed proprietary API-based models that require transmitting sensitive prompts to external providers, open-weight models such as DeepSeek, Google Gemma [89], Mistral [90], Meta Llama [91], and OpenAI GPT-OSS [51] support fine-tuning and inference within controlled infrastructures, aligning more effectively with regulatory and institutional compliance requirements [92].

2.5. Synthesis and Key Takeaways

This literature review shows steady progress in clinical NLP; however, robustness under lexical shifts, evaluation reliability, and privacy-constrained deployment remains an open problem. The key takeaways can be summarized as follows:

NL2SQL has improved; generalization remains fragile. Transformer and LLM-based parsers encode the SQL syntax and outperform early seq2seq systems, yet accuracy drops sharply under schema and phrasing variation, which motivates tighter schema linking and explicit ontology grounding [33,35,53,54,55].
RAG improves factuality, yet evaluation is still unsettled. RAG helps EHR summarization and clinical QA; however, retrieval quality is variable, benchmarks are emerging; and an LLM as a judging method can be biased, so combined automatic and expert evaluation is needed [57,58,59,60,66,68].
Clinical language is lexically unstable. Abbreviations and synonymy remain pervasive, and the UMLS mitigates variance, but ambiguity persists, which suggests integrating ontology-aware normalization into NL2SQL and RAG pipelines [71,72,78,79,82,83,84].
Deployment requires privacy-first design. Compliance and trust favor local inference and open-weight models combined with LLM firewalls and differential privacy to retain data custody and meet institutional policies [51,52,85,87,88,89,90,91,92].

Guided by these takeaways, our study targets lexical robustness in clinical querying, evaluates NL2SQL and RAG-QA, and prioritizes local LLM integration with both surface-level and semantic metrics.

3. Methodology

This study assesses the performance and cost-efficiency of LLMs on two clinical tasks: (1) NL2SQL generation using the MIMICSQL test set and (2) clinical question answering (RAG-QA) based on synthetic patient record enrichment through RAG. We evaluate nine LLMs, comprising seven open-source models deployable locally and two proprietary API-based models. The comparative analysis considers model accuracy, estimated inference costs, and practical feasibility for integration into clinical workflows. To address domain-specific lexical challenges, we further conduct a common-error analysis of NL2SQL outputs using our proposed SQL-EC framework. Through this framework, we identify the most frequent errors in SQL query generation by general-knowledge LLMs when applied to clinical querying, thereby establishing a foundation for future research.

3.1. Language Models Under Test

This section reviews the nine large language models evaluated, summarizing architectures, parameter counts, and other key features. The set includes seven open-source models (DeepSeek V3, DeepSeek V3.1, Llama 3.3-70B, Qwen2.5-32B, Mixtral-8x22B, BioMistral-7B, and GPT-OSS-20B) and two proprietary models (GPT-4o and GPT-5), covering a broad range of approaches.

We focus on instruction-tuned, text-only LLMs and exclude reasoning variants, as their added complexity and slower inference are unnecessary for the straightforward tasks considered in this study [33,93,94]. Prior work has shown that extended reasoning chains can propagate errors, increasing the likelihood of compounding mistakes and substantially raising the cost per query [95]. This makes them misaligned with our objective of lightweight open-source LLM integration. For instance, in the NL2SQL task, reasoning models cannot execute intermediate queries during inference to obtain execution results and further refine the query. Supporting such behavior would require tool use and an agentic framework, both of which are outside the scope of this work.

Such reasoning models are designed to handle more complex tasks, like scientific and engineering design, math and programming problem solving, or multi-hop question answering [93].

DeepSeek V3 and V3.1, developed by DeepSeek AI, represent advancements in balancing computational efficiency and high performance through a mixture-of-experts (MoE) architecture [96]. Both V3 and V3.1 contain 671 billion parameters, of which approximately 37 billion are actively utilized per input, as the MoE mechanism selectively activates the most relevant “experts”—essentially independent feed-forward sub-networks. To further improve memory efficiency, these models introduce Multi-head Latent Attention (MLA), an attention variant that routes information through latent vectors rather than full token-to-token connections [52]. Building on V3, V3.1 introduces hybrid reasoning modes, post-training enhancements to tool use, a wider context window, and broader multilingual support [97]. Both are released under the MIT License, a permissive open-source license [98].

Llama 3.3-70B (Meta AI) is a dense, 70-billion-parameter transformer model designed for general-purpose reasoning and dialogue. It adopts a decoder-only architecture, incorporating rotary positional embeddings (RoPEs) to handle long sequences and SwiGLU activation functions for improved efficiency [91]. The model is released under the Meta Llama 3 Community License Agreement [99], which permits research and certain commercial applications, though with more restrictions compared to permissive licenses such as the MIT License.

Qwen2.5-32B-instruct (Alibaba Cloud) is a 32-billion-parameter dense transformer model designed for reasoning and dialogue. It adopts rotary positional embeddings (RoPEs) and grouped-query attention (GQA) to enable efficient long-context inference [100]. Compared to its predecessor, Qwen2, Qwen2.5 features an improved pretraining corpus, enhanced alignment through supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF), as well as broader multilingual coverage (supporting >30 languages). The model is released under the Apache-2.0 License [101].

Mixtral-8x22B-instruct (Mistral AI) is a sparse MoE decoder-only model with eight 22B experts; inference activates a subset per token (39B active of 141B total), yielding strong cost-efficiency [102]. It offers a 64K context window, native function calling, and strong math/coding performance, with optimization for English, French, German, Spanish, and Italian. It is released as a model with open weights under Apache-2.0, with an instruction-tuned variant utilized in this paper [101].

BioMistral-7B (BioMistral Team) is a 7B biomedical model obtained by further pretraining Mistral-7B-Instruct on PubMed Central Open Access texts. It retains Mistral’s dense decoder-only transformer with GQA, sliding-window attention, RoPE, RMSNorm, and SwiGLU for efficient inference [103]. The model is released under the Apache-2.0 License [101]. Although it is significantly smaller (7 billion parameters), we selected it as a representative example of a smaller LLM fine-tuned on biomedical textual data.

GPT-4o (OpenAI) is a multimodal model supporting texts, vision, and audio and is capable of generating texts, images, and sound. For consistency, we use its text-only chat interface. It matches GPT-4 Turbo on English texts and coding tasks, improves non-English performance, and is optimized for low latency. Parameter counts remain undisclosed, with access limited to ChatGPT and the OpenAI API [104].

GPT-5 (OpenAI) is a flagship LLM that internally routes inputs between a fast main pathway and an optional, deeper thinking pathway. In line with our scope, we evaluate only the standard text-only chat configuration and exclude the reasoning mode. Relative to GPT-4, GPT-5 is designed to lower hallucination rates, enhance instruction following, improve code generation, and incorporate a safe completion alignment strategy. The model’s parameter count and training compute remain undisclosed, and access is proprietary via ChatGPT and the OpenAI API [105].

GPT-OSS-20B (OpenAI) is an open-weight MoE transformer designed for reasoning and tool use. It has 20.9B total parameters, with approximately 3.6B active per token via sparse routing. The weights, inference code, tool environments, and tokenizer are released under Apache-2.0 with a GPT-OSS usage policy. Although the model is positioned for agentic tasks such as browsing and Python tools, we evaluate it in a text-only chat setting with tools disabled. Along with the 120B variant, it constitutes the GPT-OSS family, which represents OpenAI’s first open-weight model [51].

Table 1 provides a structured overview of the nine models under evaluation, highlighting their architectures, parameter counts, context lengths, and availability.

The nine models in this study were selected to provide a representative cross-section of available LLMs relevant for clinical applications.We included (i) large-scale sparse MoE models (DeepSeek, Mixtral, and GPT-OSS), (ii) state-of-the-art dense general-purpose models (Llama and Qwen), (iii) a biomedical domain-adapted model (BioMistral), and (iv) proprietary API models (GPT-4o and GPT-5). This combination allows us to compare across architectural paradigms, scales, and licensing conditions, as summarized in Table 1. Moreover, we did not re-evaluate Claude 3.5, GPT-4o-mini, or Gemini 1.5 Pro, since these proprietary models were already benchmarked in our prior work [106] and were therefore excluded from the present analysis, as their inclusion does not align with the central scope and thematic focus of this study. In this study, our emphasis is on open-source LLMs and their feasibility for local deployment.

For all evaluated open-source models (DeepSeek V3/V3.1, GPT-OSS-20B, Llama 3.3-70B, Qwen2.5-32B, and Mixtral-8 × 22B), we relied on Fireworks AI’s managed inference and internal quantization pipeline, which combines methods such as SmoothQuant [107] and GPTQ [108] with outlier-reduction transforms and evaluates quality using divergence-based metrics to closely track FP16 baselines while reducing costs and latency [109,110]. These models were accessed through Fireworks AI’s private managed endpoints, where requests are processed entirely within their isolated infrastructure and are not shared with model developers. By default, open-weight models are served under a zero-retention policy, where prompts and generations are retained only in volatile memory, with minimal metadata (e.g., token counts) logged [109].

For BioMistral-7B, we deployed a 15 GB checkpoint using FP16/BF16 weights via a private Fireworks AI endpoint, ensuring the same level of privacy guarantees. This managed private inference pattern avoids the operational complexity of maintaining GPU clusters, quantization toolchains, and autoscaling while enabling healthcare institutions to meet strict privacy and data residency requirements. Remark: A common misconception is that deploying local LLMs necessarily requires specialized GPU clusters to ensure data privacy [111,112]. Institutions may seek deployment options that ensure data residency and strict privacy controls, including on-premises deployments or managed providers with isolated environments, such as Fireworks AI. On-premises deployment essentially means running the LLM infrastructure within the organization’s own data centers or facilities, providing complete control over data processing and storage. Legal entities, such as the Hamburg Data Protection Authority, have outlined requirements for local operations and third-party providers, emphasizing the need for strict processing controls and the minimization of personal data processing [113]. Integration and privacy concerns are discussed in more detail in Section 2.4.

3.2. Task 1: Natural Language to SQL Generation (NL2SQL)

The NL2SQL task was conducted on the MIMICSQL dataset [34] using the nine models described in the previous section. Each model was evaluated across three independent runs on a test sample of 1000 cases, where each case consisted of a natural language query paired with its corresponding SQL query. In total, this setup produced 9 models × 3 runs × 1000 queries, yielding 27,000 generated SQL sequences for evaluation.

We set the temperature to 0 for all nine LLMs to produce results that are as deterministic as possible. The batch size was set to 5, meaning five test cases were processed simultaneously. No thinking models were used, as discussed in the previous section. Nucleus sampling (top-p) is irrelevant in our case because the temperature was set to 0 [114], resulting in little to no randomness in token selection. For the API, we set the retry limit to 3, the maximum output length to 200 tokens, and the timeout duration to 30 s.

Remark: The cases in the test sample do not include the expected executed SQL results.

Remark: In machine learning tasks such as classification and regression, results are typically presented in terms of reference–predicted values. Since LLMs essentially generate sequences of tokens rather than single labels or scalar outputs, we adopt the terminology reference–generated throughout this work.

Table 2 presents example pairs of natural language questions and reference SQL queries from the MIMICSQL dataset.

To evaluate the accuracy of generated SQL queries, we execute both the generated SQL and the reference SQL against an SQLite database [115]. SQLite was chosen for its lightweight design, ease of setup, and fast processing speed, making it well-suited for research settings where production-level considerations are unnecessary. Rather than comparing queries at the syntax level, we assess their outputs, as two queries can produce functionally identical results despite syntactic differences (e.g., variations in JOIN orders or WHERE clause structures). Furthermore, because database outputs may be non-deterministic, we normalize both reference and generated queries lexicographically prior to running the execution.

The metric used to evaluate the outputs of executed results is referred to as execution accuracy (

E X

) in the literature [38,116].

E X

is an objective measure defined as the comparison between the result set produced by the LLM-generated SQL query and that of the ground-truth (reference) query. For

E X

to be computed, both the generated and reference SQL queries must successfully execute and produce a result set.

To be considered correct under the

E X

metric, a generated query must first meet the following general conditions:

Syntactic Validity: The generated SQL query must execute without syntax errors, meaning it is successfully parsed and run by the SQLite engine.
Semantic Correctness: The query must return results consistent with the intended meaning of the natural language prompt.
Adherence to Constraints: The query must respect the structure and integrity constraints defined by the underlying database schema.

However,

E X

can sometimes produce a false positive, for example, when the generated SQL query returns the same output as the reference query (such as an empty set or only

N U L L

values), despite being logically incorrect. For our 1000-case MIMICSQL set, we did not encounter such a scenario. However, we argue that this is important to address through alternative evaluation metrics, such as logical form accuracy [117], which compares query structure and intent rather than solely with output equivalence.

In addition to these general requirements,

E X

imposes specific criteria concerning the result set:

The result set must contain the same data values as the reference query;
The result set must include the same number of rows;
The order of rows is disregarded—all result sets are sorted before comparison. For instance, when evaluating a query such as “return the top 5 diagnoses,” the results are ordered alphabetically before comparison, even if the query lacks an explicit ORDER BY clause.

Although query efficiency, such as execution time and resource consumption, could serve as a qualitative indicator, it was not incorporated into the

E X

metric. Our evaluation focused solely on the correctness of result sets, thereby minimizing subjectivity. Partially correct queries were not evaluated in this study and are considered false by

E X

.

Accuracy was computed as the proportion of queries that satisfied all defined

E X

criteria relative to the total number of queries

(N_{total queries} = 1000)

.

E X = \frac{N_{correctly generated queries (LLM)}}{N_{total queries}}

To evaluate the consistency of the models’ performances, we analyzed the variation in

E X

across three independent runs of each query. The sample standard deviation of

E X

was calculated as

SD (E X) = \sqrt{\frac{1}{2} \sum_{i = 1}^{3} {(E X_{i} - \bar{E X})}^{2}},

where

E X_{i}

denotes the execution accuracy of the i-th run and

\bar{E X}

is their mean.

For the cost evaluation, three primary factors were considered:

Input tokens: The total number of tokens in the prompt, encompassing the natural language query, schema details, instructions, and other contextual elements.
Output tokens: The number of tokens the model generates as part of the SQL response.
Pricing structure: Each model’s specific cost per input and output token.

The total cost per query was derived by summing the input and output token costs. These costs were estimated based on actual token counts recorded during inference for each model. The recorded token counts were then multiplied by the official pricing for each provider (OpenAI for GPT-4o and GPT-5 [118] and Fireworks AI for open-source models [119]). Token pricing is reported as the cost per million tokens, separately for input and output.

TotalCost = ({Tokens}_{in} \times {Price}_{token in}) + ({Tokens}_{out} \times {Price}_{token out})

No synthetic cost modeling was used; the reported values reflect token usage under real inference conditions. This method enabled a comprehensive assessment of each model’s cost-efficiency, providing a complementary perspective to the accuracy and cost analyses.

As for the system prompt structure, we guided LLMs with a predefined system prompt to successfully execute NL2SQL. Each user query included schema details, example pairs, and execution constraints to guide the generation of SQL. These prompts, tailored for the MIMICSQL dataset, ensured that the outputs were valid, relevant, and consistent by framing the task and imposing domain-specific guidelines.

The prompt for the NL2SQL task includes the following components:

Job description: A brief but explicit statement defining the model’s role as a SQL query generator within the medical domain. This component anchors the model’s objective and ensures domain-specific alignment.
Instructions: A comprehensive list of behavioral constraints to shape the SQL outputs. These include formatting directives, SQL dialect restrictions (specific to SQLite), avoidance of placeholders, and preferences for readability and syntactic simplicity. Such instructions minimize ambiguity and promote consistency.
Schema information: A detailed presentation of the database schema, including table names, field types, and descriptions. This context helps the model to generate syntactically correct and semantically coherent SQL queries.
1-shot example query: A single illustrative example that pairs a user query with its corresponding SQL translation. This demonstrates expected output structure and style, helping to condition the model’s subsequent generations.
Data preview: A selection of sample rows from key database tables, offering empirical insight into data content and structure. This component further enhances the model’s ability to align query logic with real-world data patterns.

The prompt structure (excerpt) for the NL2SQL task is provided in Appendix A.

3.3. Task 2: Retrieval-Augmented Question Answering (RAG-QA)

We implemented a RAG pipeline that utilizes the Qdrant vector database [120] to extract medical information from patient records. The pipeline consists of three principal steps: First, (1) chunking and indexing split patient records into smaller text segments, which are stored in the Qdrant vector database [120] as 512-dimensional vectors. Second, (2) retrieval and search identify and return the most relevant text segments from the database based on cosine similarity to the user’s query. Finally, (3) answer generation and evaluation generate responses from the retrieved context for 20 predefined medical questions covering the entire record. The generated answers are then compared with reference answers using three metrics: BLEU [121], ROUGE-L [122], and BERTScore [123].

Qdrant was selected as the vector database due to its high-performance vector similarity search capabilities and its ease of local deployment. It imposes zero infrastructure overhead, incurs no API or cloud costs, and provides users with full data control [120]. We leveraged this functionality to constrain search queries to vectors associated with specific patient IDs, thereby ensuring that only the clinically relevant context was retrieved. Qdrant offers native support for cosine similarity, integrates seamlessly with our Python-based development environment, and is distributed under the permissive Apache 2.0 License.

Let

D

denote the indexed corpus of a single patient’s medical record, represented as a collection of embedded passages. Given a medical question x, we retrieve the top-k relevant passages

{z_{1}, \dots, z_{k}} \subset D

using cosine similarity in the embedding space. The system then generates an answer y by conditioning jointly on the query and the retrieved passages:

P (y ∣ x) = P (y ∣ x, z_{1}, z_{2}, \dots, z_{k})

where the retrieved passages are concatenated and provided as context to the language model. Here, y is the final generated medical answer, enriched with information drawn from the patient’s medical record. We set

k = 3

, allowing the model to synthesize evidence across the three most semantically similar retrieved passages, rather than relying on a single passage.

In our evaluation, questions are quite simple and typically require evidence from only 1–2 sections of an 18-section record schema (see Table 3). We therefore ablated k to balance coverage and brevity: starting from

k = 6

to maximize breadth, then reducing to

k = 2

to minimize context length. We ultimately selected

k = 3

as the smallest value that consistently captured all required evidence while avoiding redundancy and spurious distractions. Practically,

k = 3

provides enough headroom to include complementary sections (e.g., history, diagnostics, and plan) when the top passage is narrowly scoped, whereas a larger k value yields diminishing returns and unnecessary context, and a smaller k value risks omissions. Thus,

k = 3

offers the best coverage–conciseness trade-off for our particular scenario. Future work may explore adaptive or learning-based k thresholding strategies.

The evaluation was conducted on 100 synthetic patient records across 20 medical questions, yielding

100 \times 20 = 2000

answers per LLM. With 9 LLMs, this amounts to

18, 000

answers, and over 3 runs (a total of

54, 000

answers) evaluated reference–generated answer pairs.

(1) Chunking and Indexing: The synthetic dataset includes 100 patient records in markdown format. Each record is divided into multiple medical sections and is structured using markdown headings (‘##’ and ‘###’). Each record is tagged with a unique patient ID for accurate identification. Entries indexed in the vector database contain both the patient ID and the corresponding sections for easy retrieval. Indexing is performed using 512-dimensional vector embeddings generated by OpenAI’s text-embedding-3-small, an Ada-based embedding model [124], and the Qdrant-default HNSW [125] (Hierarchical Navigable Small World) index, which is its most efficient structure for high-dimensional vectors. The m parameter is set to 16, the default number of bi-directional links for HSNW. Full precision (32-bit float) is used, with no quantization applied. No re-ranking was applied. For reproducibility in data preparation, we fixed the random seed to 42.

To evaluate retrieval performance within this RAG-QA framework, we constructed a set of 20 predefined questions that cover the entire patient record. These questions are listed in Table 4.

(2) Retrieval and Search phase: Each of the 20 predefined questions is embedded into the same vector space as our patient records. The top three context chunks, identified through cosine similarity ranking, are retrieved and combined into a reference termed “RAG documentation”. This RAG documentation is then appended during the final answer generation stage.

(3) Answer Generation and Evaluation: The retrieved context (i.e., RAG documentation) is subsequently integrated into the RAG system prompt, as detailed in Appendix C. The experiment is conducted using nine large language models, which we discussed in Section 3.1. To evaluate the results, we use three metrics described in the next chapter and perform similarity tests against the ground-truth (reference) answers per patient record to the predefined questions listed in Table 4.

In this generation phase, we used the same hyperparameters for the API and model temperature as in the NL2SQL generation task.

Figure 1 illustrates the complete pipeline of the RAG-QA framework.

3.3.1. RAG-QA Metrics

We employ three widely used evaluation metrics to assess the similarity between generated and reference texts: BLEU, ROUGE-L, and BERTScore. Higher scores across these metrics generally indicate greater alignment with the reference text. The application of these metrics is illustrated using an example reference–generated answer pair.

Reference answer:	“Patient presented with severe chest pain and shortness of breath”
Generated answer:	“Patient reported severe chest pain and breathing difficulty”

BLEU [121] (Bilingual Evaluation Understudy) evaluates the generated text quality via n-gram overlap with a reference. We report BLEU-4, capturing matches up to 4-g.
Figure 2 presents tokenized versions of the reference and generated answers for the example pair, with overlapping unigrams, bigrams, and trigrams highlighted in green, blue, and red, respectively, to illustrate lexical similarity.
We employ the BLEU-4 scoring function from NLTK [126], which incorporates n-gram precision up to $n = 4$ and includes a brevity penalty. Summary statistics for n-gram matches in our example are given in Table 5, with the complete BLEU computation provided in Appendix C.1.
ROUGE-L [122] (Recall-Oriented Understudy for Gisting Evaluation) measures sequence-level similarity via the longest common subsequence (LCS), which captures non-contiguous token matches that preserve order. Higher LCS values reflect closer alignment in information structure. We use the ROUGE-L F1 score from Google’s ROUGE implementation, harmonizing LCS-based precision and recall.
Figure 3 illustrates an LCS-based comparison between a reference and a generated answer example. Tokens retained in order are visually highlighted in pink to demonstrate the LCS.
A summary of token-level statistics for LCS matching is provided inline, with the complete ROUGE-L F1 derivation given in Appendix C.2.
BERTScore [123] measures semantic similarity between generated and reference texts using contextual embeddings from pre-trained transformer models. Unlike lexical metrics, it aligns each token with its most semantically similar counterpart (in both directions) via cosine similarity.
We report BERTScore F1, which integrates the precision and recall of token alignments, capturing semantic equivalence. This is particularly valuable in clinical contexts where lexical variation is quite common.
Figure 4 illustrates this with a comparison between “shortness of breath” and “breathing difficulty”, showing that BERTScore captures semantic equivalence beyond surface token overlap.
The formal mathematical definition and computation of BERTScore are provided in Appendix C.3.

The prompt for the RAG-QA task includes the following components:

Job description: It defines the model’s role as a health administrator responding to clinical questions based on patient documentation. This sets the scope and tone for the expected outputs.
1-shot example: A single illustrative query–response pair demonstrates the expected style, format, and specificity of the model’s answers.
RAG Documentation: It is an excerpt of a patient record retrieved in the RAG retrieval and search step. This context provides the factual basis for the model’s response and constrains it to a grounded answer.
Output format: Instructions require responses to be returned as plain text, without any formatting or markup, ensuring clarity and consistency.

The prompt structure (excerpt) for the RAG-QA task is provided in Appendix B.

3.4. Task 3: SQL Error Classification (SQL-EC)

We conducted an error-type analysis to better understand the nature of model failures in the NL2SQL task. Drawing on prior research, primarily focused on post-2020 developments involving LLMs and earlier machine learning-based approaches, while excluding rule-based systems, we identified five standard categories of errors frequently observed in generated SQL queries. This analysis was conducted in two stages: (1) applying our rule-based error classifier to systematically categorize erroneous results from the NL2SQL task and (2) manually annotating those cases where the classifier assigned the OTHER category when no specific error pattern was detected.

The primary goal of the SQL-EC task is to identify the most frequent and impactful failure patterns in the NL2SQL pipeline. Each error category is characterized by a set of heuristic rules that indicate when an incorrect SQL query conforms to a particular error type. These heuristics support both manual annotation and automated classification, and they are based on the origin of the error, whether from misinterpreting the natural language query, structuring the SQL query incorrectly (while remaining executable), or producing a non-executable query due to syntax errors. Therefore, we identified the following five most common error categories in SQL query generation:

Schema mismatch—Inconsistencies between a query and the database schema often arise from ambiguous or imprecise language [40,127,128,129].
Incorrect Aggregation Usage—Misapplication of aggregation operations typically stems from misunderstandings of how to group or summarize data [128,130].
Join errors—Omitted joins, missing tables, or misconfigured join types can result in incomplete or misleading query results across tables [40,127,129,131].
Condition misinterpretation—Logical mistakes within WHERE or HAVING clauses may lead to incorrect or overly broad results [128,129].
String mismatch—Challenges in handling string values often arise from vocabulary differences, improper column use, or flawed comparisons [128,132,133].

In addition to the literature background, we handcrafted several detection patterns tailored to the MIMICSQL database to identify these categories.

Table 6 summarizes these five error categories and heuristic rules guiding their assignment. This table constitutes our proposed framework for MIMICSQL error classification, which is grounded in prior work.

This two-stage analysis includes

Deterministic classification: The first stage involved applying a rule-based error classifier (SQL-EC) to systematically categorize erroneous SQL generations. The classifier compared each generated SQL query against the ground truth from the MIMICSQL dataset and assigned one or more error categories deterministically. Each error log included the original natural language question, the corresponding reference query, and the incorrect query produced by the model. If no specific pattern matched, the classifier assigned the query to the OTHER category. This ensured consistency across all models and enabled multifaceted labeling when multiple types of errors were present.
Manual annotation: In the second stage, we manually reviewed cases that the classifier had assigned to the OTHER category. These manual checks ensured that residual errors were correctly interpreted and aligned with the predefined taxonomy, avoiding under-classification due to limitations of the deterministic rules.

The pseudocode for the deterministic rule-based error classifier from the first stage is shown in Algorithm 1.

Algorithm 1: Pseudocode of the proposed rule-based SQL error classification framework for SQL-EC task

4. Results and Discussion

This section presents the study’s findings, focusing on the performance and cost-efficiency of the nine evaluated LLMs across three defined tasks: NL2SQL, RAG-QA, and SQL-EC. The results are organized into three subsections, each dedicated to one task.

4.1. NL2SQL Results

The evaluated models demonstrated varying levels of execution accuracy (

E X

) and cost-effectiveness. These results highlight the

E X

across 3000 queries per model and the associated computational costs. Additionally, we conducted pairwise t-tests across all LLMs to assess the statistical significance of performance differences and performed a Pareto front analysis to identify models that optimally balance accuracy and cost.

Among the open-source models, DeepSeek achieved the highest performance, with DeepSeek V3 averaging

58.8 %

and V3.1 slightly higher at

59.8 %

. Llama 3.3-70B also performed strongly, with a score of

54.5 %

, which is notable given its relatively smaller parameter size. Other open-source MoE models, such as Qwen2.5-32B-Instruct (

51.0 %

) and Mixtral-8x22B-Instruct (

52.4 %

), performed moderately, while GPT-OSS-20B ranked lowest at

45.6 %

. BioMistral-7B, despite fine-tuning on biomedical corpora, scored only

11.8 %

, suggesting insufficient parameter-level knowledge of SQL for accurate query generation. Among proprietary models, OpenAI’s GPT-4o achieved the highest score (

66.1 %

), outperforming GPT-5 (

64.6 %

). The reason for GPT-5’s relative decline remains unclear. In summary, DeepSeek models lead among open-source systems, whereas GPT-4o retains the top position overall among the proprietary models we tested in our previous work [106].

Figure 5 illustrates the 3-run average mean (

E X

) for nine models tested.

Table 7 summarizes execution statistics based on 3000 observations per LLM, including the total number of correctly generated SQL queries. Across all models, 27,000 observations were conducted, of which 13,938 were valid (

51.6 %

) and 13,062 were invalid (

48.3 %

). The average input length (system prompt plus natural language question) ranged from

2134.92

tokens for GPT-4o to

2987.24

tokens for BioMistral-7B, with an overall mean of

2407.91

tokens. The average output length ranged from

58.11

tokens for GPT-4o to

126.12

tokens for GPT-OSS-20B, with an overall mean of

76.92

tokens. These measures inform the subsequent cost analysis, which examines the financial implications of deploying each model at scale. GPT-4o demonstrated the highest accuracy and lowest average token usage, although it remains the most expensive among the eight models evaluated.

We excluded BioMistral-7B from the cost analysis because it was deployed independently using an on-demand GPU configuration rather than through Firestore’s managed infrastructure. Consequently, its costs are measured in GPU hours rather than standardized API pricing, making it incomparable to the other models. Among the remaining eight LLMs, GPT-4o is the most expensive, averaging USD 0.005918 per query, with API pricing set at USD 2.50 per million input tokens and USD 10.00 per million output tokens. GPT-5, while exhibiting slightly lower performance, is more cost-efficient at USD 0.003294 per query. Among open-source models, GPT-OSS-20B is the least expensive at only USD 0.000285 per query. The other open-source models fall within a broader range, from USD 0.001413 per query for DeepSeek V3.1 to USD 0.003687 per query for Mixtral-8x22B-Instruct. These results are summarized in Table 8, and Figure 6 illustrates the average cost per 1000 generated queries.

Figure 7 presents a Pareto front analysis illustrating the trade-off between model accuracy and the cost per query. The results show that GPT-4o achieves the highest accuracy (

66.1 %

), albeit at a substantially higher cost. In comparison, GPT-5 attains a comparable accuracy score (

64.6 %

) at a markedly lower cost, making it the more economically favorable option. Among open-source models, DeepSeek V3.1 demonstrates the best balance between cost and accuracy, whereas GPT-OSS-20B, despite being the most cost-efficient, performs poorly in terms of accuracy.

Using pairwise t-tests (

α = 0.05

), Figure 8 presents a comparison matrix where each cell shows whether the row model significantly outperforms (+1, blue), underperforms (

- 1

, red), or is not significantly different (0, white) from the column model. Results reveal a hierarchy: GPT-4o significantly outperforms all models, while GPT-5 outperforms all except GPT-4o, with which it shows no significant difference. Among open-source models, DeepSeek V3 significantly outperforms all others, and its performance is statistically indistinguishable from DeepSeek V3.1. The remaining pairwise comparisons are detailed in Figure 8.

Based on the NL2SQL evaluation, GPT-5 offers the best balance of accuracy and cost-efficiency. Among open-source models, DeepSeek V3.1 provides the most favorable trade-off, delivering competitive accuracy at a fraction of the proprietary costs. GPT-OSS-20B, while highly economical, is suitable only when cost constraints outweigh accuracy. However, NL2SQL’s

E X

remains far too low for real-world use, particularly in high-stakes clinical settings. Further details are provided in the SQL-SEC error analysis (Section 4.3).

4.2. RAG-QA Results

RAG-QA results show substantial variability across evaluation metrics. We assessed nine LLMs using BLEU, ROUGE-L, and BERTScore. As expected, BLEU yielded the lowest scores and the highest standard deviation, reflecting the lexical diversity of medical terminology. ROUGE-L produced higher values, while BERTScore reached the highest results (up to 0.957) with the smallest variations, indicating strong semantic alignment. Overall QA performance was consistently high, with DeepSeek V3.1 and GPT-4o achieving the best BERTScores. Open-source models performed comparably to proprietary ones, suggesting that RAG-based context enrichment effectively supports clinical summarization and QA tasks, where semantic understanding is crucial. In contrast, NL2SQL performance requires improvement. BioMistral-7B, while underperforming on NL2SQL, achieved QA scores comparable to proprietary models. Comparative performance of the evaluated LLMs is shown in Figure 9, Figure 10 and Figure 11.

To assess the statistical significance of model differences, we conducted pairwise t-tests across BLEU, ROUGE-L, and BERTScore (

α = 0.05

). Figure 12, Figure 13 and Figure 14 present the results as pairwise comparison matrices, where each cell indicates whether the row model significantly outperforms (+1, blue), underperforms (−1, red), or shows no significant difference (0, white) relative to the column model. Across all three metrics, GPT-4o and DeepSeek V3.1 consistently rank as top performers, while DeepSeek V3 and Llama 3.3-70B remain competitive but are occasionally outperformed. Importantly, BioMistral-7B, despite its weak NL2SQL results, shows no significant disadvantage in RAG-QA, highlighting the stabilizing effect of the retrieval-augmented context. These results indicate that while performance differences exist, semantic similarity remains consistently high, and the leading models exhibit no statistically significant performance gaps.

RAG-QA results show stable, high performance across models, with pairwise t-tests indicating that differences among top performers (GPT-4o and DeepSeek V3.1) are often not statistically significant. Even models like BioMistral-7B, which are weak in NL2SQL, perform competitively once the retrieval context is provided. This contrasts with NL2SQL, where accuracy gaps are large and statistically robust, making model choice critical. For clinical QA, open-source models such as DeepSeek V3.1 offer a cost-effective alternative to proprietary systems without sacrificing reliability, while NL2SQL still demands higher-performing, more expensive models.

4.3. SQL-EC Results

The SQL-EC error classification framework employs a multi-label classification approach to categorize erroneous queries identified as invalid under the

E X

metric. From the 13,062 generated SQL queries in the first NL2SQL task (

48.3 %

of total queries), our classifier identified at least one error category for 12,563 cases, with the remaining 499 queries manually classified as OTHER. Given the multi-label nature of the framework, individual queries may belong to multiple error categories simultaneously.

Error quantification follows two complementary approaches: (1) averaged percentage rates across models and (2) raw error counts per model. The averaged percentage method calculates each model’s error rate for a given category (errors of that type divided by the total errors for that model) and averages these percentages across all models. This approach prevents bias toward models with higher total error volumes and provides a normalized comparison of error propensities. Conversely, the raw count method simply tallies absolute occurrences of each error type per model, offering insights into absolute error volumes.

As illustrated in Figure 15, the averaged analysis reveals String Mismatch as the predominant error category at

86.3 %

, followed by Join Misinterpretation (

49.7 %

), Schema Mismatch (

37.5 %

), Join Errors (

27.3 %

), and Incorrect Aggregation (

6.7 %

). These results indicate that semantic grounding challenges, particularly in string matching within WHERE clauses, represent the most pervasive issue across all evaluated models. The high prevalence of String Mismatch and Join Misinterpretation errors indicate fundamental limitations in domain adaptation and entity recognition when querying medical databases such as MIMICSQL.

Figure 16 presents the raw error counts across individual models, revealing consistent patterns that largely correlate with overall NL2SQL performance metrics. Models with lower accuracy rates exhibit proportionally higher absolute error counts while maintaining similar error type distributions. The dominance of String Mismatch across all models underscores the need for enhanced semantic understanding and context-aware entity resolution in medical NLP applications.

The SQL-EC error analysis reveals that only a small proportion of errors stem from Incorrect Aggregation (

6.7 %

on average), suggesting that models generally handle basic aggregation operators correctly. Similarly, Schema Mismatch (

37.5 %

) and Join Errors (

27.3 %

) account for a moderate share of errors, which can likely be mitigated through prompting strategies that explicitly incorporate database schemas and provide few-shot exemplars of valid query patterns. By contrast, Join Misinterpretation (

49.7 %

) and especially String Mismatch (

86.3 %

) represent most of the errors. The latter often arises from failures to recognize semantically equivalent terms (e.g., “myocardial infarction” vs. “heart attack”) or from inconsistent string representations across tables, which we also noted in our literature review prior to conducting these experiments.

We therefore recommend ontology-grounded normalization pipelines that map clinical concepts, diagnoses, procedures, and medications to controlled vocabularies (e.g., SNOMED CT, RxNorm, and LOINC) prior to query generation. Schema-aware prompting and adaptive few-shot exemplars may alleviate join-related errors by guiding models toward more consistent table-linking strategies. It is also evident from our results that fine-tuning on a medical corpus alone, as demonstrated by the BioMistral-7B model, does not inherently resolve these issues.

5. Limitations and Future Directions

5.1. Limitations

This study offers a comprehensive evaluation of seven large language models, primarily open-source, alongside selected proprietary models, on two clinically relevant NLP tasks: (1) translating natural language queries into SQL for clinical database retrieval and (2) question answering on synthetic patient records enhanced with RAG. In addition, we examine lexical challenges both through a focused literature review and an error classification analysis, providing context and groundwork for future research in this domain. Still, we note several limitations of this study.

First, this study was limited to nine language models, seven of which were open-source. While this selection captures a diverse set of state-of-the-art systems, the broader and rapidly evolving LLM landscape could yield different performance characteristics. The scope was chosen for feasibility, but future work should extend to a wider array of models. Moreover, while open models are attractive for their transparency and adaptability, their deployment in real-world clinical environments remains an open challenge. This work should therefore be viewed as an entry point for evaluation, cost modeling, and potential issue identification that may arise in real clinical scenarios.

Second, synthetic patient records, though necessary for privacy compliance, may not fully capture the variability and noise inherent in real-world clinical documentation. Consequently, question-answering LLM performance in practical deployments may diverge from our reported results.

Third, the RAG-QA task was evaluated using a fixed set of 20 clinical questions across 100 patient records, potentially limiting the diversity of linguistic constructions and medical concepts tested. Broader query sets may reveal additional model vulnerabilities, particularly in long-form reasoning or domain-specific terminology resolution. Moreover, the questions evaluated were relatively simple compared to those in other medical reasoning datasets or the complex scenarios encountered in real-world clinical practice.

Fourth, despite evaluating models across multiple metrics in the RAG-QA task (BLEU, ROUGE-L, and BERTScore), these primarily assess surface-level and semantic similarity without fully capturing clinical correctness or decision-making relevance. Manual validation by medical professionals was beyond the scope of this study, but it remains critical for assessing model utility in practice.

Fifth, the NL2SQL evaluation was restricted to the MIMICSQL dataset. Although this resource is widely used and representative, it is based on a constrained schema and may not generalize well to more heterogeneous hospital information systems, particularly those involving multilingual or ontologically complex databases.

Finally, model behavior was analyzed in a single-turn interaction setting. While this provides initial insights, many real-world healthcare applications require multi-turn, context-aware dialog systems capable of maintaining coherence and adapting over extended interactions. For all tasks, this study focused on static model responses rather than agentic approaches involving models actively reasoning, planning, or interacting with external tools across multiple steps.

5.2. Future Directions

Future research should prioritize evaluation in real clinical environments, where models interact with patient records and healthcare professionals. This would allow systematic assessment of clinical validity, interpretability, and integration within EHR systems, addressing the current lack of case studies in the literature.

Frequent string mismatches and schema-related errors in LLM-generated SQL highlight the need for domain adaptation. Possible approaches include domain-specific models, ontology-guided entity normalization, and schema-aware prompt designs. Unifying heterogeneous medical terminologies with resources such as the UMLS can improve entity alignment and reduce ambiguity. Metrics like METEOR [134], which account for synonymy and paraphrase, may also be useful for evaluating lexical similarity between generated and reference outputs, for example, in diagnosis-to-prescription mapping.

Coupling generative models with lightweight validation modules is another direction. Such modules can verify schema compliance, enforce type constraints, and align outputs with ontologies, thereby reducing syntactic and semantic errors while incurring minimal computational costs.

Agent-based frameworks for multi-step reasoning should also be explored. These systems could resolve ambiguities, reformulate failed queries, and incorporate user feedback, which are critical in clinical settings.

Given the costs and privacy concerns of proprietary models, improving open-source LLMs is also important. Techniques such as targeted fine-tuning, prompt calibration, adapter learning, and retrieval optimization may narrow the gap with commercial models while offering transparency, lower costs, and local deployment advantages.

Finally, more standardized evaluation frameworks and benchmarks specific to clinical tasks are needed. These should cover safety, factuality, and usability to support responsible adoption of LLMs in healthcare.

6. Conclusions

This study conducted a systematic evaluation of nine large language models, including both open-source and proprietary systems, across two clinically relevant NLP tasks: natural language to SQL generation (NL2SQL) and retrieval-augmented question answering (RAG-QA). Additionally, we performed an analysis of erroneous NL2SQL outputs by categorizing errors into five classes: schema mismatches, aggregation issues, join errors, condition misinterpretations, and string mismatches.

In the NL2SQL task, execution accuracy served as the primary evaluation metric. Models were evaluated on the MIMICSQL benchmark. Proprietary models such as GPT-4o (66.1%) and GPT-5 (64.6%) achieved the highest overall accuracies, while DeepSeek-V3.1 led among open-source models with 59.8%. Despite these advances, overall accuracy remains insufficient for high-stakes clinical deployment, underscoring the challenges of reliable structured query generation in healthcare settings.

For the RAG-QA task, performance was measured using BLEU, ROUGE-L, and BERTScore. Across 20 clinical questions over 100 patient records, models such as GPT-4o and DeepSeek-V3.1 achieved consistently high semantic similarity, with BERTScores approaching 0.957. These findings confirm the promise of retrieval-augmented strategies for clinical summarization and question answering, where semantic fidelity outweighs strict lexical matching.

To further assess statistical significance, pairwise t-tests were conducted across models for both tasks. The results indicated that differences in performance between top-performing proprietary and open-source models were statistically significant in NL2SQL but less pronounced in RAG-QA, where semantic similarity scores were more evenly distributed.

In the SQL-EC task, errors were categorized into schema mismatches, aggregation issues, join errors, condition misinterpretations, and string mismatches. Analysis revealed that string mismatches (86.3%) and join misinterpretations (49.7%) were the most frequent failure modes. This aligns with the literature review, which highlighted persistent lexical and terminological inconsistencies in clinical texts as a critical barrier to robust query generation.

Taken together, these results provide several key insights. First, while retrieval augmentation can stabilize performance across diverse models, reliable NL2SQL remains a bottleneck. Second, proprietary models currently lead in accuracy, but open-source systems such as DeepSeek-V3.1 offer more favorable cost-efficiency trade-offs, highlighting practical considerations for real-world deployments. Third, lexical normalization and ontology grounding are essential to mitigate persistent string-level and schema-related errors.

Future work in clinical NLP should focus on testing LLMs with real EHR data, multi-turn dialogues, and a broader range of models. Evaluation should include richer metrics, and performance could be improved through ontology-guided normalization, validation modules, and agent-based reasoning. Ultimately, standardized clinical benchmarks are necessary to support the safe and responsible adoption of healthcare.

In conclusion, while current models demonstrate strong potential for supporting clinical reasoning tasks, especially in QA, significant challenges remain in structured query reliability and domain adaptation. Addressing these challenges will be critical for enabling LLMs to transition from research prototypes into trustworthy tools that enhance clinical decision support and reduce the burden on healthcare professionals.

Author Contributions

Conceptualization, N.T., I.L., and L.B.; methodology, N.T. and I.L.; software, N.T. and I.L.; validation, N.T., I.L., S.B.Š. and L.B.; formal analysis, N.T., I.L. and S.B.Š.; investigation, N.T., I.L. and L.B.; resources, N.T.; data curation, N.T., I.L. and L.B.; writing—original draft preparation, N.T., I.L. and L.B.; writing—review and editing, N.T., L.B., S.B.Š. and I.L.; visualization, N.T., L.B., and I.L.; supervision, N.T. and I.L.; project administration, I.L.; funding acquisition, I.L. and S.B.Š. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive external funding.

Data Availability Statement

All code and evaluation materials associated with this study are publicly available in a GitHub repository. The repository contains the complete RAG pipeline implementation, a synthetic patient record generator, SQL-based evaluation scripts, and the complete set of evaluation prompts. These resources can be accessed at https://github.com/fipu-lab/med-rag (accessed on 2 October 2025).

Acknowledgments

This research is (partly) supported by “European Digital Innovation Hub Adriatic Croatia (EDIH Adria) (project no. 101083838)” under the European Commission’s Digital Europe Programme, SPIN project “INFOBIP Konverzacijski Order Management (IP.1.1.03.0120)”, SPIN project “Projektiranje i razvoj nove generacije laboratorijskog informacijskog sustava (iLIS)” (IP.1.1.03.0158), SPIN project “Istraživanje i razvoj inovativnog sustava preporuka za napredno gostoprimstvo u turizmu (InnovateStay)” (IP.1.1.03.0039), and FIPU project “Primjena generativne umjetne inteligencije u modeliranju i izvođenju raspodijeljenih informacijskih sustava i procesa (GENESIS)”.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Appendix A. NL2SQL—Prompt Structure (Excerpt)

This appendix provides a representative example of the structured prompt used in the NL2SQL experiment. The complete set of prompt templates, full schema definitions, and data previews is available in a public repository. See the Data and Code Availability Statement.

Appendix A.1. NL2SQL Job Description and Example

Job Description:
You are an advanced language model specialized in generating SQL queries for electronic health record (EHR) databases. Given a natural language question and the relevant schema, return a syntactically correct SQL query using SQLite syntax. All table and column names must be double-quoted. No table aliases are allowed, and comments should be excluded from the output.
Example:

Example Query:

User: List the number of female patients above 60 diagnosed with hypertension.

Expected Output:

SELECT COUNT(DISTINCT ‘‘demographic’’."SUBJECT_ID")
FROM "demographic"
JOIN "diagnoses" ON "demographic"."HADM_ID" = "diagnoses"."HADM_ID"
WHERE LOWER("demographic"."GENDER") = ’female’
AND "demographic"."AGE" > 60
AND LOWER("diagnoses"."SHORT_TITLE") LIKE ’%hypertension%’;

Appendix A.2. NL2SQL—SQL Schema (Excerpt)

The following schema excerpt represents a subset of the whole database schema used in prompt.

    -- demographic: patient-level information
    CREATE TABLE demographic (
      SUBJECT_ID INTEGER,
      HADM_ID INTEGER,
      AGE INTEGER,
      GENDER TEXT,
      ADMISSION_TYPE TEXT,
      INSURANCE TEXT
    );
    -- diagnoses: diagnostic codes per admission
    CREATE TABLE diagnoses (
      SUBJECT_ID INTEGER,
      HADM_ID INTEGER,
      ICD9_CODE TEXT,
      SHORT_TITLE TEXT,
      LONG_TITLE TEXT
    );

Appendix A.3. NL2SQL—Data Preview (Excerpt)

The following excerpt presents a small, representative sample of the database records included in the prompt.

    TABLE: DEMOGRAPHIC
    SUBJECT_ID,HADM_ID,AGE,GENDER,ADMISSION_TYPE,INSURANCE
    10001,123456,65,FEMALE,EMERGENCY,Medicare
    10002,123457,58,MALE,URGENT,Private

    TABLE: DIAGNOSES
    SUBJECT_ID,HADM_ID,ICD9_CODE,SHORT_TITLE,LONG_TITLE
    10001,123456,4019,Hypertension,Essential hypertension, unspecified
    10002,123457,2500,Diabetes Mellitus,Type 2 diabetes without complications

Appendix A.4. NL2SQL—Example of Prompted Query

Case number: 05cbfa2dbd0e6f14a3b99380a97d21e0
Question: What number of patients had the lab test named fibrin degradation products and self paid the costs?
Ground Truth SQL:

    SELECT Count(DISTINCT demographic."subject_id")
    FROM demographic
    INNER JOIN lab ON demographic.hadm_id = lab.hadm_id
    WHERE demographic."insurance" = "self pay"
    AND lab."label" = "Fibrin Degradation Products"

Model Responses:

DeepSeek-V3:

    SELECT Count(DISTINCT demographic."subject_id")
    FROM demographic
    INNER JOIN lab ON demographic.hadm_id = lab.hadm_id
    WHERE Lower(lab."label") = "fibrin degradation products"
    AND demographic."insurance" = "self pay"

Llama3.3-70b:

    SELECT Count(DISTINCT demographic."subject_id")
    FROM demographic
    INNER JOIN lab ON demographic."subject_id" = lab."subject_id"
    WHERE Lower(lab."label") = ’fibrin degradation products’
    AND demographic."insurance" = ’Self Pay’

GPT-4o:

    SELECT Count(DISTINCT demographic."subject_id")
    FROM demographic
    INNER JOIN lab ON demographic.hadm_id = lab.hadm_id
    WHERE Lower(lab."label") = ’fibrin degradation products’
    AND Lower(demographic."insurance") = ’self pay’

Appendix B. RAG-QA—Prompt Structure

This appendix outlines the structured prompt used during the RAG-QA experiment. The prompt consists of four primary components: a job description, an example question–answer pair, a documentation section dynamically populated during retrieval, and an output formatting instruction.

Appendix B.1. RAG-QA—Job Description and Example

Job Description:
You are a health administrator that will respond to user question about a patient in a concise and informative way. For that purpose you will be provided an excerpt from the documentation on the patient.
Example Question-Answer Pair:

Q: What is the patient’s chief complaint?
A: The patient’s chief complaint is chronic pelvic pain that
has worsened over the past 12 months,
with increased menstrual flow duration, fatigue, bloating, and nausea.

Appendix B.2. RAG-QA—Documentation

This section is dynamically generated at query time through a vector similarity search conducted on a Qdrant database. Further details are provided in Section 3.3, and the process is illustrated in Figure 1 of the main text.

Appendix B.3. RAG-QA—Output

Respond in normal text, exclude all formatting symbols, markdown, or HTML.

Appendix C. RAG Metric Computation Details

This appendix contains mathematical formulations of the RAG evaluation metrics utilized in this study. While these metrics are widely adopted, their implementations can vary substantially; therefore, we provide explicit computations based on the example question–answer pair discussed in Section 3.3.1.

Appendix C.1. BLEU Computation Details

The BLEU score combines modified n-gram precisions with a brevity penalty (BP), defined as

BLEU = BP \cdot exp (\sum_{n = 1}^{4} w_{n} \cdot log (p_{n})), where w_{n} = \frac{1}{4}, p_{n} is the n - gram precision .

The brevity penalty is calculated as

BP = \{\begin{matrix} 1 & if c > r \\ e^{(1 - \frac{r}{c})} & if c \leq r \end{matrix}

For our example with reference length

r = 10

and candidate length

c = 8

, this yields

BP = e^{(1 - \frac{10}{8})} = e^{- 0.25} \approx 0.7788

Given n-gram precisions from Table 5,

p_{1} = 0.625, p_{2} = 0.286, p_{3} = 0.167, p_{4} = ε (small smoothing term)

The BLEU score becomes

BLEU \approx 0.7788 \times exp (\frac{1}{4} [log (0.625) + log (0.286) + log (0.167) + log (ε)])

This formulation ensures numerical stability even in the presence of zero counts for higher-order n-grams, preserving the continuity of the scoring metric.

Appendix C.2. ROUGE-L Computation Details

Let

LCS

denote the length of the longest common subsequence (LCS) between the reference and the generated text. Define

| R |

and

| G |

as the number of tokens in the reference and generated sequences, respectively.

{Precision}_{LCS} = \frac{L_{CS}}{| G |}, {Recall}_{LCS} = \frac{L_{CS}}{| R |}

The ROUGE-L F-score (

F_{LCS}

) is the harmonic mean of these values:

F_{LCS} = \frac{2 \cdot {Precision}_{LCS} \cdot {Recall}_{LCS}}{{Precision}_{LCS} + {Recall}_{LCS}}

Assume

L_{CS} = 5

,

| R | = 10

,

| G | = 8

.

Then

{Precision}_{LCS} = \frac{5}{8} = 0.625, {Recall}_{LCS} = \frac{5}{10} = 0.50,

F_{LCS} = \frac{2 \cdot 0.625 \cdot 0.50}{0.625 + 0.50} = \frac{0.625}{1.125} \approx 0.556

This final F-score balances the degree of the shared sequence (recall) and conciseness of generation (precision), making ROUGE-L a robust metric for sequence alignment evaluation.

Appendix C.3. BERTScore Computation Details

Let

x = {x_{1}, \dots, x_{m}}

be the reference tokens and

\hat{x} = {{\hat{x}}_{1}, \dots, {\hat{x}}_{n}}

the candidate tokens. Let

E (x_{i})

and

E ({\hat{x}}_{j})

denote the contextual embeddings obtained via a pre-trained transformer model (e.g., BERT). The cosine similarity between embeddings is denoted by

cos (\cdot, \cdot)

.

We define the precision and recall of BERTScore as follows:

Precision = \frac{1}{n} \sum_{j = 1}^{n} \max_{i} \cos (E ({\hat{x}}_{j}), E (x_{i})), Recall = \frac{1}{m} \sum_{i = 1}^{m} \max_{j} \cos (E (x_{i}), E ({\hat{x}}_{j})) .

These equations reflect the average maximum similarity of candidate tokens to reference tokens (precision), and vice versa (recall). The harmonic mean of precision and recall yields the final BERTScore, where

{F 1}_{BERTScore} = \frac{2 \times Precision \times Recall}{Precision + Recall} .

References

Garets, D.; Davis, M. Electronic Medical Records vs. Electronic Health Records: Yes, There is a Difference; Policy White Paper; HIMSS Analytics: Chicago, IL, USA, 2006; Volume 1. [Google Scholar]
Kim, M.K.; Rouphael, C.; McMichael, J.; Welch, N.; Dasarathy, S. Challenges in and Opportunities for Electronic Health Record-Based Data Analysis and Interpretation. Gut Liver 2024, 18, 201–208. [Google Scholar] [CrossRef]
Shabestari, O.; Roudsari, A. Challenges in data quality assurance for electronic health records. In Studies in Health Technology and Informatics; IOS Press: Amsterdam, The Netherlands, 2013. [Google Scholar]
Lewis, A.E.; Weiskopf, N.; Abrams, Z.B.; Foraker, R.; Lai, A.M.; Payne, P.R.; Gupta, A. Electronic health record data quality assessment and tools: A systematic review. J. Am. Med. Inform. Assoc. 2023, 30, 1730–1740. [Google Scholar] [CrossRef]
Upadhyay, S.; Hu, H.F. A qualitative analysis of the impact of electronic health records (EHR) on healthcare quality and safety: Clinicians’ lived experiences. Health Serv. Insights 2022, 15, 11786329211070722. [Google Scholar] [CrossRef]
Campanella, P.; Lovato, E.; Marone, C.; Fallacara, L.; Mancuso, A.; Ricciardi, W.; Specchia, M.L. The impact of electronic health records on healthcare quality: A systematic review and meta-analysis. Eur. J. Public Health 2016, 26, 60–64. [Google Scholar] [CrossRef]
Tapuria, A.; Porat, T.; Kalra, D.; Dsouza, G.; Xiaohui, S.; Curcin, V. Impact of patient access to their electronic health record: Systematic review. Inform. Health Soc. Care 2021, 46, 194–206. [Google Scholar] [CrossRef] [PubMed]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar]
Bolton, E.; Hall, D.; Yasunaga, M.; Lee, T.; Manning, C.; Liang, P. Stanford CRFM Introduces PubMedGPT 2.7B. 2022. Available online: https://hai.stanford.edu/news/stanford-crfm-introduces-pubmedgpt-27b (accessed on 31 March 2025).
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.E.A. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Leonte, K.G.; Chen, M.L.; Torous, J.B.; Linos, E.; Pinto, A.; Rodriguez, C.I. Large language models outperform mental and medical health care professionals in identifying obsessive-compulsive disorder. npj Digit. Med. 2024, 7, 193. [Google Scholar] [CrossRef]
Hou, Y.; Bert, C.; Gomaa, A.; Lahmer, G.; Höfler, D.; Weissmann, T.; Voigt, R.; Schubert, P.; Schmitter, C.; Depardon, A.; et al. Fine-tuning a local LLaMA-3 large language model for automated privacy-preserving physician letter generation in radiation oncology. Front. Artif. Intell. 2025, 7, 1493716. [Google Scholar] [CrossRef]
Kuckelman, I.J.; Wetley, K.; Yi, P.H.; Ross, A.B. Translating musculoskeletal radiology reports into patient-friendly summaries using ChatGPT-4. Skelet. Radiol. 2024, 53, 1621–1624. [Google Scholar] [CrossRef]
Mustafa, A.; Naseem, U.; Azghadi, M.R. Large language models vs human for classifying clinical documents. Int. J. Med. Inform. 2025, 195, 105800. [Google Scholar] [CrossRef]
Ong, J.C.L.; Jin, L.; Elangovan, K.; Lim, G.Y.S.; Lim, D.Y.Z.; Sng, G.G.R.; Ke, Y.; Tung, J.Y.M.; Zhong, R.J.; Koh, C.M.Y.; et al. Development and testing of a novel large language model-based clinical decision support systems for medication safety in 12 clinical specialties. arXiv 2024, arXiv:2402.01741. [Google Scholar] [CrossRef]
Liu, S.; McCoy, A.B.; Wright, A. Improving large language model applications in biomedicine with retrieval-augmented generation: A systematic review, meta-analysis, and clinical development guidelines. J. Am. Med. Inform. Assoc. 2025, 32, 605–615. [Google Scholar] [CrossRef] [PubMed]
Dennstädt, F.; Hastings, J.; Putora, P.M.; Schmerder, M.; Cihoric, N. Implementing large language models in healthcare while balancing control, collaboration, costs and security. npj Digit. Med. 2025, 8, 143. [Google Scholar] [CrossRef]
Wada, A.; Tanaka, Y.; Nishizawa, M.; Yamamoto, A.; Akashi, T.; Hagiwara, A.; Hayakawa, Y.; Kikuta, J.; Shimoji, K.; Sano, K.; et al. Retrieval-augmented generation elevates local LLM quality in radiology contrast media consultation. npj Digit. Med. 2025, 8, 395. [Google Scholar] [CrossRef]
Mortensen, G.A.; Zhu, R. Early Alzheimer’s Detection Through Voice Analysis: Harnessing Locally Deployable LLMs via ADetectoLocum, a privacy-preserving diagnostic system. AMIA Summits Transl. Sci. Proc. 2025, 2025, 365. [Google Scholar] [PubMed]
Lorencin, I.; Tankovic, N.; Etinger, D. Optimizing Healthcare Efficiency with Local Large Language Models. In Intelligent Human Systems Integration (IHSI 2025): Integrating People and Intelligent Systems; AHFE International Open Access: Rome, Italy, 2025; Volume 160. [Google Scholar]
Wiest, I.C.; Leßmann, M.E.; Wolf, F.; Ferber, D.; Van Treeck, M.; Zhu, J.; Ebert, M.P.; Westphalen, C.B.; Wermke, M.; Kather, J.N. Anonymizing medical documents with local, privacy preserving large language models: The LLM-Anonymizer. medRxiv 2024. [Google Scholar] [CrossRef]
Hong, Z.; Yuan, Z.; Zhang, Q.; Chen, H.; Dong, J.; Huang, F.; Huang, X. Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL. arXiv 2025, arXiv:2406.08426. [Google Scholar] [CrossRef]
Malešević, A.; Čartolovni, A. Healthcare digitalization: Insights from Croatia’s experience and lessons from the COVID-19 pandemic. In Digital Healthcare, Digital Transformation and Citizen Empowerment in Asia-Pacific and Europe for a Healthier Society; Academic Press (Elsevier): Cambridge, MA, USA, 2025; pp. 459–473. [Google Scholar]
Mastilica, M.; Kušec, S. Croatian healthcare system in transition, from the perspective of users. BMJ 2005, 331, 223–226. [Google Scholar] [CrossRef]
Björnberg, A.; Phang, A.Y. Euro Health Consumer Index 2018 Report. 2018. Available online: https://santesecu.public.lu/dam-assets/fr/publications/e/euro-health-consumer-index-2018/euro-health-consumer-index-2018.pdf (accessed on 30 March 2025).
OECD. Croatia Country Health Profile 2023; OECD Publishing: Paris, France, 2023. [Google Scholar]
Džakula, A.; Vočanec, D.; Banadinović, M.; Vajagić, M.; Lončarek, K.; Lovrenčić, I.L.; Radin, D.; Rechel, B. Croatia: Health System Summary, 2024; European Observatory on Health Systems and Policies, WHO Regional Office for Europe: Copenhagen, Denmark, 2024. [Google Scholar]
Zhong, V.; Xiong, C.; Socher, R. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv 2017, arXiv:1709.00103. [Google Scholar] [CrossRef]
Xu, X.; Liu, C.; Song, D. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. arXiv 2017, arXiv:1711.04436. [Google Scholar] [CrossRef]
Roberts, K.; Patra, B.G. A semantic parsing method for mapping clinical questions to logical forms. In Proceedings of the AMIA 2018 Annual Symposium Proceedings, San Francisco, CA, USA, 3–7 November 2018; American Medical Informatics Association (AMIA): Washington, DC, USA, 2018; Volume 2017, p. 1478. [Google Scholar]
Zhu, X.; Li, Q.; Cui, L.; Liu, Y. Large language model enhanced text-to-sql generation: A survey. arXiv 2024, arXiv:2410.06011. [Google Scholar]
Wang, P.; Shi, T.; Reddy, C.K. Text-to-SQL Generation for Question Answering on Electronic Medical Records. In Proceedings of the WWW ’20: The Web Conference 2020, Taipei Taiwan, 20–24 April 2020; Association for Computing Machinery (ACM): New York, NY, USA, 2020; pp. 350–361. [Google Scholar]
Pan, Y.; Wang, C.; Hu, B.; Xiang, Y.; Wang, X.; Chen, Q.; Chen, J.; Du, J. A BERT-based generation model to transform medical texts to SQL queries for electronic medical records: Model development and validation. JMIR Med. Inform. 2021, 9, e32698. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; et al. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv 2019, arXiv:1809.08887. [Google Scholar]
Li, J.; Hui, B.; Qu, G.; Yang, J.; Li, B.; Li, B.; Wang, B.; Qin, B.; Geng, R.; Huo, N.; et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Adv. Neural Inf. Process. Syst. 2023, 36, 42330–42357. [Google Scholar]
Liu, X.; Shen, S.; Li, B.; Ma, P.; Jiang, R.; Zhang, Y.; Fan, J.; Li, G.; Tang, N.; Luo, Y. A Survey of NL2SQL with Large Language Models: Where are we, and where are we going? arXiv 2024, arXiv:2408.05109. [Google Scholar] [CrossRef]
Guo, J.; Zhan, Z.; Gao, Y.; Xiao, Y.; Lou, J.G.; Liu, T.; Zhang, D. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. arXiv 2019, arXiv:1905.08205. [Google Scholar]
Hwang, W.; Yim, J.; Park, S.; Seo, M. A comprehensive exploration on wikisql with table-aware word contextualization. arXiv 2019, arXiv:1902.01069. [Google Scholar]
OpenAI. GPT-3.5. 2023. Available online: https://platform.openai.com/docs/models/gpt-3.5-turbo (accessed on 6 September 2025).
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Anthropic. Claude 2. 2023. Available online: https://www.anthropic.com/index/claude-2 (accessed on 6 September 2025).
DeepMind, G. Gemini 1.5: Unlocking Multimodal Long Context Understanding. 2024. Available online: https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/ (accessed on 6 September 2025).
OpenAI. Introducing GPT-5 for Developers. 2025. Available online: https://openai.com/index/introducing-gpt-5-for-developers/ (accessed on 6 September 2025).
Attrach, R.A.; Moreira, P.; Fani, R.; Umeton, R.; Celi, L.A. Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis. arXiv 2025, arXiv:2507.01053. [Google Scholar]
Anthropic. Introducing the Model Context Protocol. 2024. Available online: https://www.anthropic.com/news/model-context-protocol (accessed on 10 September 2025).
Chadha, I.K.; Gupta, A.; Sarkar, S.; Tomer, M.; Rathee, T. Performance Evaluation of Open-Source LLMs for Text-to-SQL Conversion in Healthcare Data. In Proceedings of the 2025 International Conference on Pervasive Computational Technologies (ICPCT), Greater Noida, India, 8–9 February 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 606–609. [Google Scholar]
OpenAI. gpt-oss-120b & gpt-oss-20b Model Card. arXiv 2025, arXiv:2508.10925. [Google Scholar]
Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. DeepSeek-V3 Technical Report. arXiv 2025, arXiv:2412.19437. [Google Scholar]
Tarbell, R.; Choo, K.K.R.; Dietrich, G.; Rios, A. Towards understanding the generalization of medical text-to-sql models and datasets. In Proceedings of the AMIA Annual Symposium Proceedings, San Francisco, CA, USA, 9–13 November 2024; American Medical Informatics Association (AMIA): Washington, DC, USA, 2024; Volume 2023, p. 669. [Google Scholar]
Rahman, S.; Jiang, L.Y.; Gabriel, S.; Aphinyanaphongs, Y.; Oermann, E.K.; Chunara, R. Generalization in healthcare ai: Evaluation of a clinical large language model. arXiv 2024, arXiv:2402.10965. [Google Scholar] [CrossRef]
Chen, C.; Yu, J.; Chen, S.; Liu, C.; Wan, Z.; Bitterman, D.; Wang, F.; Shu, K. ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction? arXiv 2024, arXiv:2411.06469. [Google Scholar] [CrossRef]
Zhang, G.; Xu, Z.; Jin, Q.; Chen, F.; Fang, Y.; Liu, Y.; Rousseau, J.F.; Xu, Z.; Lu, Z.; Weng, C.; et al. Leveraging long context in retrieval augmented language models for medical question answering. npj Digit. Med. 2025, 8, 239. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Amugongo, L.M.; Mascheroni, P.; Brooks, S.; Doering, S.; Seidel, J. Retrieval augmented generation for large language models in healthcare: A systematic review. PLoS Digit. Health 2025, 4, e0000877. [Google Scholar] [CrossRef]
Alkhalaf, M.; Yu, P.; Yin, M.; Deng, C. Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records. J. Biomed. Inform. 2024, 156, 104662. [Google Scholar] [CrossRef]
Myers, S.; Dligach, D.; Miller, T.A.; Barr, S.; Gao, Y.; Churpek, M.; Mayampurath, A.; Afshar, M. Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs. arXiv 2025, arXiv:2508.14817. [Google Scholar]
Long, C.; Subburam, D.; Lowe, K.; Dos Santos, A.; Zhang, J.; Hwang, S.; Saduka, N.; Horev, Y.; Su, T.; Côté, D.W.; et al. ChatENT: Augmented large language model for expert knowledge retrieval in otolaryngology–head and neck surgery. Otolaryngol.–Head Neck Surg. 2024, 171, 1042–1051. [Google Scholar] [CrossRef]
Ge, J.; Sun, S.; Owens, J.; Galvez, V.; Gologorskaya, O.; Lai, J.C.; Pletcher, M.J.; Lai, K. Development of a liver disease–specific large language model chat interface using retrieval-augmented generation. Hepatology 2024, 80, 1158–1168. [Google Scholar] [CrossRef]
Rouhollahi, A.; Homaei, A.; Sahu, A.; Harari, R.E.; Nezami, F.R. RAGnosis: Retrieval-Augmented Generation for Enhanced Medical Decision Making. medRxiv 2025. [Google Scholar] [CrossRef]
Zhao, X.; Liu, S.; Yang, S.Y.; Miao, C. Medrag: Enhancing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot. In Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; Association for Computing Machinery (ACM): New York, NY, USA, 2025; pp. 4442–4457. [Google Scholar]
Soni, S.; Gayen, S.; Demner-Fushman, D. Overview of the archehr-qa 2025 shared task on grounded question answering from electronic health records. In Proceedings of the 24th Workshop on Biomedical Language Processing, Vienna, Austria, 1 August 2025; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2025; pp. 396–405. [Google Scholar]
Yu, H.; Gan, A.; Zhang, K.; Tong, S.; Liu, Q.; Liu, Z. Evaluation of retrieval-augmented generation: A survey. In Proceedings of the CCF Conference on Big Data, Shenyang, China, 28–30 August 2025; Springer: Singapore, 2025; pp. 102–120. [Google Scholar]
Gargari, O.K.; Habibi, G. Enhancing medical AI with retrieval-augmented generation: A mini narrative review. Digit. Health 2025, 11, 20552076251337177. [Google Scholar] [CrossRef]
Xiong, G.; Jin, Q.; Lu, Z.; Zhang, A. Benchmarking retrieval-augmented generation for medicine. In Proceedings of the Findings of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2024; pp. 6233–6251. [Google Scholar]
Wu, J.; Zhu, J.; Qi, Y. Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation. arXiv 2024, arXiv:2408.04187. [Google Scholar] [CrossRef]
Jadhav, S.; Shanbhag, A.G.; Joshi, S.; Date, A.; Sonawane, S. Maven at MEDIQA-CORR 2024: Leveraging RAG and Medical LLM for Error Detection and Correction in Medical Notes. In Proceedings of the Clinical Natural Language Processing Workshop, Mexico City, Mexico, 24 April 2024; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2024. [Google Scholar]
Jayatilake, D.C.; Oyibo, S.O.; Jayatilake, D. Interpretation and misinterpretation of medical abbreviations found in patient medical records: A cross-sectional survey. Cureus 2023, 15, e44735. [Google Scholar] [CrossRef] [PubMed]
SNOMED CT: Myocardial infarction (Concept ID 22298006). 2025. Available online: https://bioportal.bioontology.org/ontologies/SNOMEDCT?p=classes&conceptid=22298006 (accessed on 16 January 2025).
Cheung, S.; Hoi, S.; Fernandes, O.; Huh, J.; Kynicos, S.; Murphy, L.; Lowe, D. Audit on the use of dangerous abbreviations, symbols, and dose designations in paper compared to electronic medication orders: A multicenter study. Ann. Pharmacother. 2018, 52, 332–337. [Google Scholar] [CrossRef] [PubMed]
National Institutes of Health (NIH) Exhibit in NIH Grants Policy Statement (NIH GPS). 2024. Available online: https://grants.nih.gov/grants/policy/nihgps/html5/section_1/1.1_abbreviations.htm (accessed on 6 September 2025).
Do Not Use: Dangerous Abbreviations, Symbols, and Dose Designations—2025 Update. 2025. Available online: https://ismpcanada.ca/resource/do-not-use-list/ (accessed on 6 September 2025).
Marecková, E.; Cervenỳ, L. Latin as the language of medical terminology: Some remarks on its role and prospects. Swiss Med. Wkly. 2002, 132, 581. [Google Scholar] [CrossRef]
Lysanets, Y.V.; Bieliaieva, O.M. The use of Latin terminology in medical case reports: Quantitative, structural, and thematic analysis. J. Med. Case Rep. 2018, 12, 45. [Google Scholar] [CrossRef]
Bodenreider, O. The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 2004, 32, D267–D270. [Google Scholar] [CrossRef]
Liu, H.; Lussier, Y.A.; Friedman, C. A study of abbreviations in the UMLS. In Proceedings of the AMIA Symposium, Portland, OR, USA, 6–10 November 2001; American Medical Informatics Association (AMIA): Washington, DC, USA, 2001; p. 393. [Google Scholar]
Xu, R.; Jiang, P.; Luo, L.; Xiao, C.; Cross, A.; Pan, S.; Sun, J.; Yang, C. A Survey on Unifying Large Language Models and Knowledge Graphs for Biomedicine and Healthcare. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, Toronto, ON, Canada, 3–7 August 2025; Association for Computing Machinery (ACM): New York, NY, USA, 2025; pp. 6195–6205. [Google Scholar]
Nazi, Z.A.; Hristidis, V.; McLean, A.L.; Meem, J.A.; Chowdhury, M.T.A. Ontology-Guided Query Expansion for Biomedical Document Retrieval using Large Language Models. arXiv 2025, arXiv:2508.11784. [Google Scholar] [CrossRef]
Fan, Y.; Xue, K.; Li, Z.; Zhang, X.; Ruan, T. An LLM-based Framework for Biomedical Terminology Normalization in Social Media via Multi-Agent Collaboration. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2025; pp. 10712–10726. [Google Scholar]
Borchert, F.; Llorca, I.; Schapranow, M.P. Improving biomedical entity linking for complex entity mentions with LLM-based text simplification. Database 2024, 2024, baae067. [Google Scholar] [CrossRef]
Remy, F.; Demuynck, K.; Demeester, T. BioLORD-2023: Semantic textual representations fusing llm and clinical knowledge graph insights. arXiv 2023, arXiv:2311.16075. [Google Scholar] [CrossRef]
Tertulino, R.; Antunes, N.; Morais, H. Privacy in electronic health records: A systematic mapping study. J. Public Health 2024, 32, 435–454. [Google Scholar] [CrossRef]
Mirzaei, T.; Amini, L.; Esmaeilzadeh, P. Clinician voices on ethics of LLM integration in healthcare: A thematic analysis of ethical concerns and implications. BMC Med. Inform. Decis. Mak. 2024, 24, 250. [Google Scholar] [CrossRef] [PubMed]
Rathod, V.; Nabavirazavi, S.; Zad, S.; Iyengar, S.S. Privacy and security challenges in large language models. In Proceedings of the 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 6–8 January 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 00746–00752. [Google Scholar]
Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
Gemma Team. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295. [Google Scholar] [CrossRef]
Gemma Team. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Kukreja, S.; Kumar, T.; Purohit, A.; Dasgupta, A.; Guha, D. A literature survey on open source large language models. In Proceedings of the 2024 7th International Conference on Computers in Management and Business, Singapore, 12–14 January 2024; Association for Computing Machinery (ACM): New York, NY, USA, 2024; pp. 133–143. [Google Scholar]
Plaat, A.; Wong, A.; Verberne, S.; Broekens, J.; van Stein, N.; Bäck, T. Reasoning with large language models, a survey. CoRR 2024. [Google Scholar] [CrossRef]
Floratou, A.; Psallidas, F.; Zhao, F.; Deep, S.; Hagleither, G.; Tan, W.; Cahoon, J.; Alotaibi, R.; Henkel, J.; Singla, A.; et al. Nl2sql is a solved problem… not! In Proceedings of the Conference of Innovative Data Systems Research (CIDR), Chaminade, HI, USA, 14–17 January 2024; CIDR Foundation: San Mateo, CA, USA, 2024. [Google Scholar]
Tai, C.Y.; Chen, Z.; Zhang, T.; Deng, X.; Sun, H. Exploring chain-of-thought style prompting for text-to-sql. arXiv 2023, arXiv:2305.14215. [Google Scholar]
Ali Alkamel, S. DeepSeek and the Power of Mixture of Experts (MoE). Available online: https://dev.to/sayed_ali_alkamel/deepseek-and-the-power-of-mixture-of-experts-moe-ham (accessed on 1 April 2025).
DeepSeek-V3.1 Release. 2025. Available online: https://api-docs.deepseek.com/news/news250821 (accessed on 10 September 2025).
Open Source Initiative. MIT License. 1988. Available online: https://opensource.org/licenses/MIT (accessed on 10 September 2025).
Meta Platforms, Inc. Meta LLaMA 3 Community License. 2024. Available online: https://www.llama.com/llama3/license/ (accessed on 10 September 2025).
Team Qwen. Qwen2.5 Technical Report. arXiv 2025, arXiv:2412.15115. [Google Scholar]
Apache Software Foundation. Apache License, Version 2.0. 2004. Available online: http://www.apache.org/licenses/LICENSE-2.0 (accessed on 10 September 2025).
Mistral AI. Cheaper, Better, Faster, Stronger: Mixtral 8x22B. Release Blog Post. 2024. Available online: https://mistral.ai/news/mixtral-8x22b (accessed on 10 September 2025).
Labrak, Y.; Bazoge, A.; Morin, E.; Gourraud, P.A.; Rouvier, M.; Dufour, R. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv 2024, arXiv:2402.10373. [Google Scholar]
OpenAI. GPT-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 22 April 2025).
OpenAI. Introducing GPT-5. 2025. Available online: https://openai.com/index/introducing-gpt-5/ (accessed on 7 August 2025).
Tanković, N.; Šajina, R.; Lorencin, I. Transforming Medical Data Access: The Role and Challenges of Recent Language Models in SQL Query Automation. Algorithms 2025, 18, 124. [Google Scholar] [CrossRef]
Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; Han, S. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv 2024, arXiv:2211.10438. [Google Scholar]
Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2023, arXiv:2210.17323. [Google Scholar]
Fireworks AI. How Fireworks Evaluates Quantization Precisely and Interpretably. Blog Post. 2024. Available online: https://fireworks.ai/blog/fireworks-quantization (accessed on 11 September 2025).
Fireworks AI. LLM Inference Performance Benchmarking (Part 1). Blog Post. 2023. Available online: https://fireworks.ai/blog/llm-inference-performance-benchmarking-part-1 (accessed on 11 September 2025).
Sandrini, P. Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators. arXiv 2025, arXiv:2507.23399. [Google Scholar] [CrossRef]
Chen, K.; Zhou, X.; Lin, Y.; Feng, S.; Shen, L.; Wu, P. A survey on privacy risks and protection in large language models. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 163. [Google Scholar] [CrossRef]
Hamburg Commissioner for Data Protection and Freedom of Information. Discussion Paper: Large Language Models and Personal Data. Discussion paper, Hamburg Commissioner for Data Protection and Freedom of Information (HmbBfDI). 2024. Available online: https://datenschutz-hamburg.de/fileadmin/user_upload/HmbBfDI/Datenschutz/Informationen/240715_Discussion_Paper_Hamburg_DPA_KI_Models.pdf (accessed on 11 September 2025).
Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; Choi, Y. The Curious Case of Neural Text Degeneration. arXiv 2020, arXiv:1904.09751. [Google Scholar] [CrossRef]
Hipp, R.D. SQLite. 2020. Available online: https://sqlite.org/ (accessed on 11 September 2025).
Qin, B.; Hui, B.; Wang, L.; Yang, M.; Li, J.; Li, B.; Geng, R.; Cao, R.; Sun, J.; Si, L.; et al. A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions. arXiv 2022, arXiv:2208.13629. [Google Scholar]
Noor, H. What Do You Mean? Using Large Language Models for Semantic Evaluation of NL2SQL Queries. 2025. Available online: https://uwspace.uwaterloo.ca/bitstreams/73520bc6-13dd-4586-a9c4-5b8ced8ddfc1/download (accessed on 9 September 2025).
OpenAI. API Pricing. 2025. Available online: https://openai.com/api/pricing/ (accessed on 12 September 2025).
Fireworks AI. Pricing. 2025. Available online: https://fireworks.ai/pricing (accessed on 12 September 2025).
Qdrant Team. Qdrant—Vector Database. 2025. Available online: https://qdrant.tech/ (accessed on 11 September 2025).
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2004; pp. 74–81. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
OpenAI. Text Embedding 3 Small. 2023. Available online: https://platform.openai.com/docs/models/text-embedding-3-small (accessed on 12 September 2025).
Malkov, Y.A.; Yashunin, D.A. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. arXiv 2018, arXiv:1603.09320. [Google Scholar] [CrossRef] [PubMed]
Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
Liu, X.; Shen, S.; Li, B.; Tang, N.; Luo, Y. NL2SQL-BUGs: A Benchmark for Detecting Semantic Errors in NL2SQL Translation. arXiv 2025, arXiv:2503.11984. [Google Scholar]
Shen, J.; Wan, C.; Qiao, R.; Zou, J.; Xu, H.; Shao, Y.; Zhang, Y.; Miao, W.; Pu, G. A Study of In-Context-Learning-Based Text-to-SQL Errors. arXiv 2025, arXiv:2501.09310. [Google Scholar]
Ning, Z.; Tian, Y.; Zhang, Z.; Zhang, T.; Li, T.J.J. Insights into natural language database query errors: From attention misalignment to user handling strategies. arXiv 2024, arXiv:2402.07304. [Google Scholar] [CrossRef]
Ren, T.; Fan, Y.; He, Z.; Huang, R.; Dai, J.; Huang, C.; Jing, Y.; Zhang, K.; Yang, Y.; Wang, X.S. Purple: Making a large language model a better sql writer. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 15–28. [Google Scholar]
Zhu, C.; Lin, Y.; Cai, Y.; Li, Y. SCoT2S: Self-Correcting Text-to-SQL Parsing by Leveraging LLMs. Comput. Speech Lang. 2026, 95, 101865. [Google Scholar] [CrossRef]
Mitsopoulou, A.V. Towards More Robust Text-to-SQL Translation. 2024. Available online: https://pergamos.lib.uoa.gr/uoa/dl/object/3401060/file.pdf (accessed on 10 April 2025).
Kroll, H.; Kreutz, C.K.; Sackhoff, P.; Balke, W.T. Enriching Simple Keyword Queries for Domain-Aware Narrative Retrieval. In Proceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Santa Fe, NM, USA, 26–30 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 143–154. [Google Scholar] [CrossRef]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 1 June 2005; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2005; pp. 65–72. [Google Scholar]

Figure 1. Illustration of the QA-RAG pipeline in three steps: (1) chunking and indexing patient records, (2) retrieving relevant context based on question–record similarity, and (3) generating and evaluating answers to medical questions.

Figure 2. Illustration of token-level n-gram overlap for BLEU scoring on an example reference–generated answer pair. Green tokens represent unigram matches, blue tokens represent bigram matches, and red tokens represent trigram matches between the reference and generated answers.

Figure 3. An illustration of the LCS used in ROUGE-L scoring, based on the preceding example. Tokens highlighted in pink occur in both texts and maintain the same sequential order.

Figure 4. Illustration of BERTScore computation using the reference answer. “shortness of breath” and the generated answer “breathing difficulty” show the text embedding and pairwise cosine similarity.

Figure 5. Comparison of model performance on the NL2SQL task in terms of execution accuracy.

Figure 6. Average cost per 1000 generated queries in the NL2SQL evaluation.

Figure 7. NL2SQL task: Cost vs performance comparison.

Figure 8. NL2SQL: pairwise statistical comparison of LLM performance.

Figure 9. RAG-QA: Comparative BLEU scores of evaluated models.

Figure 10. RAG-QA: Comparative ROUGE-L scores of evaluated models.

Figure 11. RAG-QA: Comparative BERTScores of evaluated models.

Figure 12. RAG-QA pairwise LLM comparison matrices for BLEU.

Figure 13. RAG-QA pairwise LLM comparison matrices for ROUGE-L.

Figure 14. RAG-QA pairwise LLM comparison matrices for BERTScore.

Figure 15. SQL-EC: Average error rates per error category across all models.

Figure 16. SQL-EC: Stacked distribution of total error counts by error type across different LLMs.

Table 1. Summary of evaluated LLMs. Nine models (seven open-source and two proprietary) spanning dense and sparse MoE architectures. Params/Active (B) reports total parameters in billions and, for MoE, per-token active parameters; Max. ctx. denotes the maximum context length in tokens; Availability denotes license or API access (OSS = Open-source, P = Proprietary). MLA = Multi-head Latent Attention. Parameter counts denoted with n/a are not disclosed.

Model	Architecture	Params/Active (B)	Max. Ctx. (Tokens)	Availability
DeepSeek V3	MoE (MLA)	671/∼37	128k	OSS (MIT)
DeepSeek V3.1	MoE (MLA)	671/∼37	128k	OSS (MIT)
Llama 3.3-70B	Dense decoder-only	70	128k	OSS (Llama 3 Comm.)
Qwen2.5-32B	Dense decoder-only (RoPE,GQA)	32.5	131k	OSS (Apache-2.0)
Mixtral-8x22B	MoE (8 × 22B)	141/39	64k	OSS (Apache-2.0)
BioMistral-7B	Dense decoder-only	7	32k	OSS (Apache-2.0)
GPT-4o	Proprietary multimodal	n/a	128k *	P (API)
GPT-5	Unified proprietary ^†	n/a	400k ^‡	P (API)
GPT-OSS-20B	MoE	20.9/∼3.6	131k	OSS (Apache-2.0)

* GPT-4o is treated at the 128k context scale in OpenAI materials. ^† GPT-5 routes between a fast “main” model and a deeper ”thinking” variant; we evaluate the standard text-only mode. ^‡ GPT-5 supports 272k input + 128k output tokens (∼400k total).

Table 2. Examples of natural question—SQL query pairs from the MIMICSQL dataset.

Natural Language QUESTION	Reference SQL Query
Tell me the number of patients who were diagnosed with unspecified cerebral artery occlusion with cerebral infarction.	`SELECT COUNT(DISTINCT DEMOGRAPHIC."SUBJECT_ID")` `FROM DEMOGRAPHIC` `INNER JOIN DIAGNOSES ON DEMOGRAPHIC.HADM_ID = DIAGNOSES.HADM_ID` `WHERE DIAGNOSES."SHORT_TITLE" = "Crbl art ocl NOS w infrc"`
Find out the category of lab test for patient Stephanie Suchan.	`SELECT LAB."CATEGORY"` `FROM DEMOGRAPHIC` `INNER JOIN LAB ON DEMOGRAPHIC.HADM_ID = LAB.HADM_ID` `WHERE DEMOGRAPHIC."NAME" = "Stephanie Suchan"`
What is the number of patients with abdominal pain as their primary disease who were discharged to SNF?	`SELECT COUNT(DISTINCT DEMOGRAPHIC."SUBJECT_ID")` `FROM DEMOGRAPHIC` `WHERE DEMOGRAPHIC."DISCHARGE_LOCATION" = "SNF"` `AND DEMOGRAPHIC."DIAGNOSIS" = "ABDOMINAL PAIN"`

Table 3. Structured overview of sections and example snippets from synthetic patient record ID 4.

Category	Section	Example Snippet (Record ID: 4)
Generic Information	Patient Information	“Jane Doe, 40-year-old female nurse, divorced, registered April 1, 2024.”
	Insurance Information	“HealthSecure, full coverage, deductible $1000, prior authorization obtained for pelvic MRI.”
	Emergency Contacts	“Primary contact: (555) 472-6833; Secondary contact: sister, (555) 987-6543.”
	Billing Information	“Electronic billing via HealthSecure portal, outstanding balance: none.”
Medical Record	Chief Complaint	“Chronic pelvic pain worsening over 12 months; prolonged menstruation up to 10 days.”
	History of Present Illness	“Progressively heavier bleeding, pain worsens with activity, occasional dizziness.”
	Past Medical History	“Anemia, hypertension (Losartan 50mg), appendectomy, 2 cesareans.”
	Gynecological History	“Irregular periods, fibroids diagnosed age 38, G2P2, last Pap smear normal.”
	Family History	“Mother with fibroids; sister with endometriosis; father with hypertension and diabetes.”
	Social History	“Non-smoker, occasional alcohol, moderate stress, 2 kg weight loss in 6 months.”
	Review of Systems	“Fatigue, dizziness, bloating, urinary frequency, mild lower back pain.”
	Physical Examination	“Mild pallor, BP 132/85, HR 87, enlarged uterus with irregular contours.”
	Diagnostic Studies	“Ultrasound: multiple fibroids (largest 4.8 cm); CBC: hemoglobin 10.2 g/dL.”
Assessment and Treatment Plan	Assessment	“Likely symptomatic uterine fibroids; differential: adenomyosis, endometrial hyperplasia.”
	Medical Management	“Ferrous sulfate 325 mg daily; Ibuprofen 400 mg as needed.”
	Surgical Consultation	“Referral for possible myomectomy or uterine artery embolization.”
	Lifestyle Modifications	“Increase iron-rich diet, encourage moderate exercise (e.g., yoga).”
	Follow-Up Plan	“Biopsy review in 1 week; monitor hemoglobin every 3 months.”
Final Summary	“Chronic pelvic pain and menorrhagia due to fibroids, initial medical management plus surgical consult planned.”

Table 4. Twenty predefined questions utilized in the RAG-QA Task.

Evaluation Question
(1) What is the patient’s chief complaint?
(2) Who referred the patient for admission?
(3) What medications is the patient currently taking?
(4) What is the patient’s gynecological history?
(5) Has the patient experienced any recent surgical procedures?
(6) What is the patient’s preferred method of billing?
(7) What diagnostic studies have been performed?
(8) What is the attending physician’s name?
(9) What is the patient’s age?
(10) What is included in the patient’s treatment plan?
(11) What is the patient’s marital status?
(12) What is the patient’s occupation?
(13) Who are the patient’s emergency contacts and their contact information?
(14) What is the patient’s family medical history?
(15) What are the patient’s social habits including smoking, alcohol use, and exercise?
(16) What are the patient’s vital signs?
(17) What is the patient’s BMI and general physical appearance?
(18) What does the physical examination reveal?
(19) What is the patient’s assessment or diagnosis?
(20) What is the patient’s follow-up plan?

Table 5. N-gram overlap results for reference vs. generated texts.

	Unigrams (1-g)	Bigrams (2-g)	Trigrams (3-g)	4-g
Matches	5	2	1	0
Total generated	8	7	6	5
Precision	5/8 = 0.625	2/7 ≈ 0.286	1/6 ≈ 0.167	0/5 = 0

Table 6. Proposed error categories and heuristic rules used to classify LLM-generated SQL failures.

Error Category	Heuristic Rules for Classification
(1) Schema mismatch	Errors due to inconsistencies between a query and the database schema are often caused by ambiguous or imprecise language. Heuristic rules are as follows: Incorrect table reference—Valid column selected from the wrong table (e.g., `VALUE_UNIT` from `demographic` instead of `lab`). Non-existent column selection—Querying a column not present in any table (e.g., `HEIGHT` from `demographic`). Attribute misuse—Correct column used inappropriately (e.g., interpreting free-text `DIAGNOSIS` as structured codes).
(2) Incorrect aggregation usage	Errors in the application of aggregation operations are often due to misunderstandings of how to group or summarize data. Heuristic rules as follows: Inappropriate function—Misuse of an aggregation function (e.g., using `COUNT` instead of `SUM`). Missing `DISTINCT`—Omitting `DISTINCT` when only unique values should be aggregated. No `GROUP BY` clause—Applying aggregates without necessary grouping. Wrong granularity—Aggregating at too coarse or too fine a level for the intended analysis.
(3) Join errors	Errors are due to omitted joins, missing tables, or misconfigured join types, resulting in incomplete or misleading query results across tables. Heuristic rules are as follows: Omitted joins—Missing necessary table joins (e.g., diagnoses without joining to `demographic`). Incorrect join keys—Using unrelated or inappropriate fields for joins (e.g., `HADM_ID` instead of `SUBJECT_ID`). Wrong join type—Using `INNER JOIN` where `OUTER JOIN` is needed to preserve unmatched records.
(4) Condition misinterpretation	Errors in logical expressions within `WHERE` or `HAVING` clauses lead to incorrect or overly broad results. Heuristic rules are as follows: Operator misuse—Incorrect comparison (e.g., `AGE = 65` instead of `AGE > 65`). Missing conditions—Omitting necessary filters (e.g., not restricting to abnormal lab results). Logical errors—Incorrect use of `AND`/`OR` (e.g., selecting patients under 40 or over 80 when not intended).
(5) String mismatch	Errors in handling string values are often due to vocabulary differences, improper column use, or flawed comparison. Heuristic rules are as follows: Vocabulary variation—Mismatched terms (e.g., “heart attack” vs. `LONG_TITLE` using “myocardial infarction”). Wrong column—Using `SHORT_TITLE` when `LONG_TITLE` is more appropriate. Comparison misuse—Ineffective use of `LIKE` vs. `=` for matching strings. Unnormalized input—Case and whitespace not ignored (e.g., `‘SEPSIS’` vs. `‘Sepsis’`).

Table 7. Execution statistics and performance variability of models on the NL2SQL task.

Model	Total Obs.	True Obs.	Avg. Tokens In	Avg. Tokens Out
DeepSeek V3	3000	1764	2281.84	68.17
DeepSeek V3.1	3000	1794	2307.84	71.53
Llama 3.3-70B	3000	1635	2174.07	60.68
Qwen2.5-32B-instruct	3000	1530	2425.20	63.29
Mixtral-8x22B-instruct	3000	1572	2979.29	92.97
GPT-4o	3000	1983	2134.92	58.11
GPT-5	3000	1938	2158.91	59.58
GPT-OSS-20B	3000	1368	2221.92	126.12
BioMistral-7B	3000	354	2987.24	91.84

Table 8. Cost analysis of models on the NL2SQL task based on token counts recorded during inference and official provider pricing.

Model	Price/1M Input Tokens ($)	Price/1M Output Tokens ($)	Avg. Price per Query ($)
DeepSeek V3	0.90	0.90	0.002115
DeepSeek V3.1	0.56	1.68	0.001413
Llama 3.3-70B	0.90	0.90	0.002011
Mixtral-8x22B-Instruct	1.20	1.20	0.003687
Qwen2.5-VL-32B-Instruct	0.90	0.90	0.002240
GPT-4o	2.50	10.00	0.005918
GPT-5	1.25	10.00	0.003294
GPT-OSS-20B	0.10	0.50	0.000285

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Blašković, L.; Tanković, N.; Lorencin, I.; Baressi Šegota, S. Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs. Big Data Cogn. Comput. 2025, 9, 256. https://doi.org/10.3390/bdcc9100256

AMA Style

Blašković L, Tanković N, Lorencin I, Baressi Šegota S. Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs. Big Data and Cognitive Computing. 2025; 9(10):256. https://doi.org/10.3390/bdcc9100256

Chicago/Turabian Style

Blašković, Luka, Nikola Tanković, Ivan Lorencin, and Sandi Baressi Šegota. 2025. "Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs" Big Data and Cognitive Computing 9, no. 10: 256. https://doi.org/10.3390/bdcc9100256

APA Style

Blašković, L., Tanković, N., Lorencin, I., & Baressi Šegota, S. (2025). Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs. Big Data and Cognitive Computing, 9(10), 256. https://doi.org/10.3390/bdcc9100256

Article Menu

Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs

Abstract

1. Introduction

2. Literature Review

2.1. Recent Developments of Clinical NL2SQL

2.2. Retrieval-Augmented Generation in Clinical Environments

2.3. Lexical Challenges in Clinical Environments

2.4. Integration Challenges and Policy Considerations

2.5. Synthesis and Key Takeaways

3. Methodology

3.1. Language Models Under Test

3.2. Task 1: Natural Language to SQL Generation (NL2SQL)

3.3. Task 2: Retrieval-Augmented Question Answering (RAG-QA)

3.3.1. RAG-QA Metrics

3.4. Task 3: SQL Error Classification (SQL-EC)

4. Results and Discussion

4.1. NL2SQL Results

4.2. RAG-QA Results

4.3. SQL-EC Results

5. Limitations and Future Directions

5.1. Limitations

5.2. Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. NL2SQL—Prompt Structure (Excerpt)

Appendix A.1. NL2SQL Job Description and Example

Appendix A.2. NL2SQL—SQL Schema (Excerpt)

Appendix A.3. NL2SQL—Data Preview (Excerpt)

Appendix A.4. NL2SQL—Example of Prompted Query

Appendix B. RAG-QA—Prompt Structure

Appendix B.1. RAG-QA—Job Description and Example

Appendix B.2. RAG-QA—Documentation

Appendix B.3. RAG-QA—Output

Appendix C. RAG Metric Computation Details

Appendix C.1. BLEU Computation Details

Appendix C.2. ROUGE-L Computation Details

Appendix C.3. BERTScore Computation Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI