This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs
1
Faculty of Informatics, Juraj Dobrila University of Pula, 52100 Pula, Croatia
2
Department of Automation and Electronics, Faculty of Engineering, University of Rijeka, 51000 Rijeka, Croatia
*
Authors to whom correspondence should be addressed.
Big Data Cogn. Comput. 2025, 9(10), 256; https://doi.org/10.3390/bdcc9100256 (registering DOI)
Submission received: 16 July 2025
/
Revised: 12 September 2025
/
Accepted: 3 October 2025
/
Published: 11 October 2025
Abstract
Electronic health records (EHRs) are typically stored in relational databases, making them difficult to query for nontechnical users, especially under privacy constraints. We evaluate two practical clinical NLP workflows, natural language to SQL (NL2SQL) for EHR querying and retrieval-augmented generation for clinical question answering (RAG-QA), with a focus on privacy-preserving deployment. We benchmark nine large language models, spanning open-weight options (DeepSeek V3/V3.1, Llama-3.3-70B, Qwen2.5-32B, Mixtral-8 × 22B, BioMistral-7B, and GPT-OSS-20B) and proprietary APIs (GPT-4o and GPT-5). The models were chosen to represent a diverse cross-section spanning sparse MoE, dense general-purpose, domain-adapted, and proprietary LLMs. On MIMICSQL (27,000 generations; nine models × three runs), the best NL2SQL execution accuracy (EX) is 66.1% (GPT-4o), followed by 64.6% (GPT-5). Among open-weight models, DeepSeek V3.1 reaches 59.8% EX, while DeepSeek V3 reaches 58.8%, with Llama-3.3-70B at 54.5% and BioMistral-7B achieving only 11.8%, underscoring a persistent gap relative to general-domain benchmarks. We introduce SQL-EC, a deterministic SQL error-classification framework with adjudication, revealing string mismatches as the dominant failure (86.3%), followed by query-join misinterpretations (49.7%), while incorrect aggregation-function usage accounts for only 6.7%. This highlights lexical/ontology grounding as the key bottleneck for NL2SQL in the biomedical domain. For RAG-QA, evaluated on 100 synthetic patient records across 20 questions (54,000 reference–generation pairs; three runs), BLEU and ROUGE-L fluctuate more strongly across models, whereas BERTScore remains high on most, with DeepSeek V3.1 and GPT-4o among the top performers; pairwise t-tests confirm that significant differences were observed among the LLMs. Cost–performance analysis based on measured token usage shows per-query costs ranging from USD 0.000285 (GPT-OSS-20B) to USD 0.005918 (GPT-4o); DeepSeek V3.1 offers the best open-weight cost–accuracy trade-off, and GPT-5 provides a balanced API alternative. Overall, the privacy-conscious RAG-QA attains strong semantic fidelity, whereas the clinical NL2SQL remains brittle under lexical variation. SQL-EC pinpoints actionable failure modes, motivating ontology-aware normalization and schema-linked prompting for robust clinical querying.
Share and Cite
MDPI and ACS Style
Blašković, L.; Tanković, N.; Lorencin, I.; Baressi Šegota, S.
Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs. Big Data Cogn. Comput. 2025, 9, 256.
https://doi.org/10.3390/bdcc9100256
AMA Style
Blašković L, Tanković N, Lorencin I, Baressi Šegota S.
Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs. Big Data and Cognitive Computing. 2025; 9(10):256.
https://doi.org/10.3390/bdcc9100256
Chicago/Turabian Style
Blašković, Luka, Nikola Tanković, Ivan Lorencin, and Sandi Baressi Šegota.
2025. "Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs" Big Data and Cognitive Computing 9, no. 10: 256.
https://doi.org/10.3390/bdcc9100256
APA Style
Blašković, L., Tanković, N., Lorencin, I., & Baressi Šegota, S.
(2025). Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs. Big Data and Cognitive Computing, 9(10), 256.
https://doi.org/10.3390/bdcc9100256
Article Metrics
Article metric data becomes available approximately 24 hours after publication online.