Previous Article in Journal
An Overview of AI-Guided Thyroid Ultrasound Image Segmentation and Classification for Nodule Assessment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs

by
Luka Blašković
1,
Nikola Tanković
1,*,
Ivan Lorencin
1,* and
Sandi Baressi Šegota
2,*
1
Faculty of Informatics, Juraj Dobrila University of Pula, 52100 Pula, Croatia
2
Department of Automation and Electronics, Faculty of Engineering, University of Rijeka, 51000 Rijeka, Croatia
*
Authors to whom correspondence should be addressed.
Big Data Cogn. Comput. 2025, 9(10), 256; https://doi.org/10.3390/bdcc9100256 (registering DOI)
Submission received: 16 July 2025 / Revised: 12 September 2025 / Accepted: 3 October 2025 / Published: 11 October 2025

Abstract

Electronic health records (EHRs) are typically stored in relational databases, making them difficult to query for nontechnical users, especially under privacy constraints. We evaluate two practical clinical NLP workflows, natural language to SQL (NL2SQL) for EHR querying and retrieval-augmented generation for clinical question answering (RAG-QA), with a focus on privacy-preserving deployment. We benchmark nine large language models, spanning open-weight options (DeepSeek V3/V3.1, Llama-3.3-70B, Qwen2.5-32B, Mixtral-8 × 22B, BioMistral-7B, and GPT-OSS-20B) and proprietary APIs (GPT-4o and GPT-5). The models were chosen to represent a diverse cross-section spanning sparse MoE, dense general-purpose, domain-adapted, and proprietary LLMs. On MIMICSQL (27,000 generations; nine models × three runs), the best NL2SQL execution accuracy (EX) is 66.1% (GPT-4o), followed by 64.6% (GPT-5). Among open-weight models, DeepSeek V3.1 reaches 59.8% EX, while DeepSeek V3 reaches 58.8%, with Llama-3.3-70B at 54.5% and BioMistral-7B achieving only 11.8%, underscoring a persistent gap relative to general-domain benchmarks. We introduce SQL-EC, a deterministic SQL error-classification framework with adjudication, revealing string mismatches as the dominant failure (86.3%), followed by query-join misinterpretations (49.7%), while incorrect aggregation-function usage accounts for only 6.7%. This highlights lexical/ontology grounding as the key bottleneck for NL2SQL in the biomedical domain. For RAG-QA, evaluated on 100 synthetic patient records across 20 questions (54,000 reference–generation pairs; three runs), BLEU and ROUGE-L fluctuate more strongly across models, whereas BERTScore remains high on most, with DeepSeek V3.1 and GPT-4o among the top performers; pairwise t-tests confirm that significant differences were observed among the LLMs. Cost–performance analysis based on measured token usage shows per-query costs ranging from USD 0.000285 (GPT-OSS-20B) to USD 0.005918 (GPT-4o); DeepSeek V3.1 offers the best open-weight cost–accuracy trade-off, and GPT-5 provides a balanced API alternative. Overall, the privacy-conscious RAG-QA attains strong semantic fidelity, whereas the clinical NL2SQL remains brittle under lexical variation. SQL-EC pinpoints actionable failure modes, motivating ontology-aware normalization and schema-linked prompting for robust clinical querying.
Keywords: healthcare; large language model; EHR; NL2SQL; RAG; open source; ontology grounding healthcare; large language model; EHR; NL2SQL; RAG; open source; ontology grounding

Share and Cite

MDPI and ACS Style

Blašković, L.; Tanković, N.; Lorencin, I.; Baressi Šegota, S. Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs. Big Data Cogn. Comput. 2025, 9, 256. https://doi.org/10.3390/bdcc9100256

AMA Style

Blašković L, Tanković N, Lorencin I, Baressi Šegota S. Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs. Big Data and Cognitive Computing. 2025; 9(10):256. https://doi.org/10.3390/bdcc9100256

Chicago/Turabian Style

Blašković, Luka, Nikola Tanković, Ivan Lorencin, and Sandi Baressi Šegota. 2025. "Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs" Big Data and Cognitive Computing 9, no. 10: 256. https://doi.org/10.3390/bdcc9100256

APA Style

Blašković, L., Tanković, N., Lorencin, I., & Baressi Šegota, S. (2025). Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs. Big Data and Cognitive Computing, 9(10), 256. https://doi.org/10.3390/bdcc9100256

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop