MDPI - Publisher of Open Access Journals

15 pages, 967 KB

Open AccessArticle

A Retrieval-Augmented Generation with Dual-Similarity Monitoring for Nuclear Energy Knowledge Q&A

by Cheng-Hsing Chiang and Kun-Chou Lee

Appl. Sci. 2026, 16(7), 3182; https://doi.org/10.3390/app16073182 - 26 Mar 2026

Viewed by 289

We present a Retrieval-Augmented Generation (RAG)-based question-answering system for nuclear energy science communication, characterizing retrieval quality in generated responses. The system introduces a dual-similarity analysis that jointly measures (i) question-to-context (Q→C) and (ii) answer-to-context (A→C) semantic consistency, serving as “retrieval-side semantic alignment signal” [...] Read more.

We present a Retrieval-Augmented Generation (RAG)-based question-answering system for nuclear energy science communication, characterizing retrieval quality in generated responses. The system introduces a dual-similarity analysis that jointly measures (i) question-to-context (Q→C) and (ii) answer-to-context (A→C) semantic consistency, serving as “retrieval-side semantic alignment signal” and “post-generation semantic alignment indicator” respectively. Built with LangChain, FAISS retrieval, and a large language model, our pipeline separates offline indexing from online inference and is grounded on authoritative Taiwanese Nuclear Safety Commission documents. We evaluate two settings: (a) in-domain prompts derived from the corpus and (b) out-of-domain, randomly generated nuclear energy questions. Results show that generated answers are, on average, more semantically similar to retrieved contexts than the original questions under the present setup, while the overall association between retrieval-side and answer-side signals remains stronger in the in-domain setting. Out-of-domain questions show weaker but still observable answer-to-context alignment patterns, contingent on corpus overlap. These findings suggest that combining RAG with dual-similarity analysis offers a practical and audit-oriented approach for educational Q&A, and we discuss potential improvements in versioned regulations, re-ranking, and abstention strategies. In this study, the RAG technique and dual-similarity analysis are combined together to promote nuclear energy knowledge. The research flow chat of this study can be applied to many other fields of scientific knowledge. Full article

► Show Figures

Figure 1

36 pages, 1552 KB

Open AccessArticle

RO-FIN-LLM: A Benchmark with LLM-as-a-Judge and Human Evaluators for Romanian Tax and Accounting

by Maria-Ecaterina Olariu, Vlad-Gabriel Buinceanu, Cristian Simionescu, Octavian Dospinescu, Răzvan Georgescu, Cezar Tudor, Adrian Iftene and Ana-Maria Bores

Systems 2026, 14(3), 244; https://doi.org/10.3390/systems14030244 - 27 Feb 2026

Viewed by 520

Abstract

Large Language Models (LLMs) are increasingly being adopted in business settings; however, there remains a shortage of evaluation tools that account for country-specific regulations, particularly for Romania’s taxation and financial accounting requirements. RO-FIN-LLM is a benchmark designed to test how well LLMs handle [...] Read more.

Large Language Models (LLMs) are increasingly being adopted in business settings; however, there remains a shortage of evaluation tools that account for country-specific regulations, particularly for Romania’s taxation and financial accounting requirements. RO-FIN-LLM is a benchmark designed to test how well LLMs handle Romania-specific regulatory question answering in taxation (including VAT regimes, income/profit tax, microenterprise rules, and other obligations) and financial accounting (including journal entries/monographs, amortization, provisions, and foreign exchange transactions). The benchmark contains questions curated by experts, each including the applicable regulatory time frames and the legal sources for the answers. Evaluation is performed in two protocols: closed-book and open-book with Retrieval Augmented Generation (RAG), using Tavily Search API. Evaluation metrics are represented by rubrics, namely correctness, legal citation quality, and clarity/structure. A subset of answers produced by three models was additionally evaluated by 12 specialists in the financial-accounting domain. In this revision, we also describe a public release plan for the question schema, prompts, and evaluation scripts to support independent reproducibility. Full article

(This article belongs to the Special Issue Business Intelligence and Data Analytics in Enterprise Systems)

► Show Figures

Figure 1

21 pages, 551 KB

Open AccessArticle

Agentic RAG for Maritime AIoT: Natural Language Access to Structured Data

by Oxana Sachenkova, Melker Andreasson, Dongzhu Tan and Alisa Lincke

Sensors 2026, 26(4), 1227; https://doi.org/10.3390/s26041227 - 13 Feb 2026

Viewed by 587

Abstract

Maritime operations are increasingly reliant on sensor data to drive efficiency and enhance decision-making. However, despite rapid advances in large language models, including expanded context windows and stronger generative capabilities, critical industrial settings still require secure, role-constrained access to enterprise data and explicit [...] Read more.

Maritime operations are increasingly reliant on sensor data to drive efficiency and enhance decision-making. However, despite rapid advances in large language models, including expanded context windows and stronger generative capabilities, critical industrial settings still require secure, role-constrained access to enterprise data and explicit limitation of model context. Retrieval-Augmented Generation (RAG) remains essential to enforce data minimization, preserve privacy, support verifiability, and meet regulatory obligations by retrieving only permissioned, provenance-tracked slices of information at query time. However, current RAG solutions lack robust validation protocols for numerical accuracy for high-stakes industrial applications. This paper introduces Lighthouse Bot, a novel Agentic RAG system specifically designed to provide natural-language access to complex maritime sensor data, including time-series and relational sensor data. The system addresses a critical need for verifiable autonomous data analysis within the Artificial Intelligence of Things (AIoT) domain, which we explore through a case study on optimizing ferry operations. We present a detailed architecture that integrates a Large Language Model with a specialized database and coding agents to transform natural language into executable tasks, enabling core AIoT capabilities such as generating Python code for time-series analysis, executing complex SQL queries on relational sensor databases, and automating workflows, while keeping sensitive data outside the prompt and ensuring auditable, policy-aligned tool use. To evaluate performance, we designed a test suite of 24 questions with ground-truth answers, categorized by query complexity (simple, moderate, complex) and data interaction type (retrieval, aggregation, analysis). Our results show robust, controlled data access with high factual fidelity: the proprietary Claude 3.7 achieved close to 90% overall factual correctness, while the open-source Qwen 72B achieved 66% overall and 99% on simple retrieval and aggregation queries. These findings underscore the need for a secure limited-context RAG in maritime AIoT and the potential for cost-effective automation of routine exploratory analyses. Full article

(This article belongs to the Special Issue Machine Learning and Big Data Analytics for the Internet of Things and Wireless Sensor Networks)

► Show Figures

Figure 1

22 pages, 1655 KB

Open AccessArticle

Engineering Trustworthy Retrieval-Augmented Generation for EU Electricity Market Regulation

by Șener Ali, Simona-Vasilica Oprea and Adela Bâra

Electronics 2026, 15(4), 749; https://doi.org/10.3390/electronics15040749 - 10 Feb 2026

Viewed by 474

Abstract

The regulatory framework governing EU electricity markets is highly complex, fragmented across multiple normative acts and sensitive to citation accuracy and contextual completeness. While Large Language Models (LLMs) offer promising capabilities for regulatory question answering (QA), their tendency to hallucinate legal references and [...] Read more.

The regulatory framework governing EU electricity markets is highly complex, fragmented across multiple normative acts and sensitive to citation accuracy and contextual completeness. While Large Language Models (LLMs) offer promising capabilities for regulatory question answering (QA), their tendency to hallucinate legal references and omit critical conditions makes them unreliable for compliance-sensitive domains. This paper presents the design of a domain-specific Retrieval-Augmented Generation (RAG) system for EU electricity market regulations, explicitly engineered to deliver source-grounded, traceable and low-hallucination answers. The answering component is based on Google’s gemini-2.5-flash model. The Open AI’s gpt-4o-mini model is responsible for both relevant document selection before building the RAG prompt and playing the judge LLM role for Retrieval Augmented Generation Assessment (RAGAS) evaluation. We build a legal corpus comprising multiple core EU regulatory acts related to REMIT and market operation and propose a regulatory QA architecture that integrates: (i) three chunking strategies (article-based, structure-aware, sliding window), (ii) two embedding models and (iii) a novel LLM-based document selection agent that restricts retrieval to the most relevant normative acts before vector search, improving contextual focus and retrieval precision. Using a fixed benchmark of regulatory questions and a reproducible evaluation protocol, we quantitatively assess system performance with RAGAS metrics and classical information-retrieval measures. While all configurations achieve strong faithfulness (up to 0.96), answer relevancy varies substantially with embedding and chunking choices. The findings confirm that retrieval engineering, particularly embedding selection, chunking strategy and pre-retrieval document filtering, has a high impact for building reliable regulatory AI systems. The sliding window strategy combined with bge-small-en-v1.5 delivered the strongest rank-sensitive retrieval performance, achieving the highest Precision@10 and NDCG@10. In contrast, article-level chunking with the same model yielded a modest improvement in Recall@10, indicating a clear trade-off between recall and precision-oriented ranking quality in legal corpora. Full article

(This article belongs to the Special Issue Generative AI and Its Transformative Potential, 2nd Edition)

► Show Figures

Figure 1

31 pages, 2850 KB

Open AccessArticle

Context-Aware Multi-Agent Architecture for Wildfire Insights

by Ashen Sandeep, Sithum Jayarathna, Sunera Sandaruwan, Venura Samarappuli, Dulani Meedeniya and Charith Perera

Sensors 2026, 26(3), 1070; https://doi.org/10.3390/s26031070 - 6 Feb 2026

Viewed by 892

Abstract

Wildfires are environmental hazards with severe ecological, social, and economic impacts. Wildfires devastate ecosystems, communities, and economies worldwide, with rising frequency and intensity driven by climate change, human activity, and environmental shifts. Analyzing wildfire insights such as detection, predictive patterns, and risk assessment [...] Read more.

Wildfires are environmental hazards with severe ecological, social, and economic impacts. Wildfires devastate ecosystems, communities, and economies worldwide, with rising frequency and intensity driven by climate change, human activity, and environmental shifts. Analyzing wildfire insights such as detection, predictive patterns, and risk assessment enables proactive response and long-term prevention. However, most of the existing approaches have been focused on isolated processing of data, making it challenging to orchestrate cross-modal reasoning and transparency. This study proposed a novel orchestrator-based multi-agent system (MAS), with the aim of transforming multimodal environmental data into actionable intelligence for decision making. We designed a framework to utilize Large Multimodal Models (LMMs) augmented by structured prompt engineering and specialized Retrieval-Augmented Generation (RAG) pipelines to enable transparent and context-aware reasoning, providing a cutting-edge Visual Question Answering (VQA) system. It ingests diverse inputs like satellite imagery, sensor readings, weather data, and ground footage and then answers user queries. Validated by several public datasets, the system achieved a precision of 0.797 and an F1-score of 0.736. Thus, powered by Agentic AI, the proposed, human-centric solution for wildfire management, empowers firefighters, governments, and researchers to mitigate threats effectively. Full article

(This article belongs to the Section Internet of Things)

► Show Figures

Figure 1

21 pages, 4001 KB

Open AccessArticle

Designing an Architecture of a Multi-Agentic AI-Powered Virtual Assistant Using LLMs and RAG for a Medical Clinic

by Andreea-Maria Tanasă, Simona-Vasilica Oprea and Adela Bâra

Electronics 2026, 15(2), 334; https://doi.org/10.3390/electronics15020334 - 12 Jan 2026

Viewed by 1561

Abstract

This paper presents the design, implementation and evaluation of an agentic virtual assistant (VA) for a medical clinic, combining large language models (LLMs) with retrieval-augmented generation (RAG) technology and multi-agent artificial intelligence (AI) frameworks to enhance reliability, clinical accuracy and explainability. The assistant [...] Read more.

This paper presents the design, implementation and evaluation of an agentic virtual assistant (VA) for a medical clinic, combining large language models (LLMs) with retrieval-augmented generation (RAG) technology and multi-agent artificial intelligence (AI) frameworks to enhance reliability, clinical accuracy and explainability. The assistant has multiple functionalities and is built around an orchestrator architecture in which a central agent dynamically routes user queries to specialized tools for retrieval-augmented question answering (Q&A), document interpretation and appointment scheduling. The implementation combines LangChain and LangGraph with interactive visualizations to track reasoning steps, prompts using Gemini 2.5 Flash defines tool usage and strict formatting rules, maintaining reliability and mitigating hallucinations. Prompt engineering has an important role in the implementation and thus, it is designed to assist the patient in the human–computer interaction. Evaluation through qualitative and quantitative metrics, including ROUGE, BLEU, LLM-as-a-judge and sentiment analysis, confirmed that the multi-agent architecture enhances interpretability, accuracy and context-aware performance. Evaluation shows that the multi-agent architecture improves reliability, interpretability and alignment with medical requirements, supporting diverse clinical tasks. Furthermore, the evaluation shows that Gemini 2.5 Flash combined with clinic-specific RAG significantly improves response quality, grounding and coherence compared with earlier models. SBERT analyses confirm strong semantic alignment across configurations, while LLM-as-a-judge scores highlight the superior relevance and completeness of the 2.5 RAG setup. Although some limitations remain, the updated system provides a more reliable and context-aware solution for clinical question answering. Full article

► Show Figures

Figure 1

22 pages, 820 KB

Open AccessArticle

CBR²: A Case-Based Reasoning Framework with Dual Retrieval Guidance for Few-Shot KBQA

by Xinyu Hu, Tong Li, Lingtao Xue, Zhipeng Du, Kai Huang, Gang Xiao and He Tang

Big Data Cogn. Comput. 2026, 10(1), 17; https://doi.org/10.3390/bdcc10010017 - 4 Jan 2026

Viewed by 783

Abstract

Recent advances in large language models (LLMs) have driven substantial progress in knowledge base question answering (KBQA), particularly under few-shot settings. However, symbolic program generation remains challenging due to its strict structural constraints and high sensitivity to generation errors. Existing few-shot methods often [...] Read more.

Recent advances in large language models (LLMs) have driven substantial progress in knowledge base question answering (KBQA), particularly under few-shot settings. However, symbolic program generation remains challenging due to its strict structural constraints and high sensitivity to generation errors. Existing few-shot methods often rely on multi-turn strategies, such as rule-based step-by-step reasoning or iterative self-correction, which introduce additional latency and exacerbate error propagation. We present CBR², a case-based reasoning framework with dual retrieval guidance for single-pass symbolic program generation. Instead of generating programs interactively, CBR² constructs a unified structure-aware prompt that integrates two complementary types of retrieval: (1) structured knowledge from ontologies and factual triples, and (2) reasoning exemplars retrieved via semantic and function-level similarity. A lightweight similarity model is trained to retrieve structurally aligned programs, enabling effective transfer of abstract reasoning patterns. Experiments on KQA Pro and MetaQA demonstrate that CBR² achieves significant improvements in both accuracy and syntactic robustness. Specifically on KQA Pro, it boosts Hits@1 from 72.70% to 82.13% and reduces syntax errors by 25%, surpassing the previous few-shot state-of-the-art. Full article

(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))

► Show Figures

Figure 1

28 pages, 1457 KB

Open AccessArticle

LoopRAG: A Closed-Loop Multi-Agent RAG Framework for Interactive Semantic Question Answering in Smart Buildings

by Junqi Bai, Dejun Ning, Yuxuan You and Jiyan Chen

Buildings 2026, 16(1), 196; https://doi.org/10.3390/buildings16010196 - 1 Jan 2026

Viewed by 1322

Abstract

With smart buildings being widely adopted in urban digital transformation, interactive semantic question answering (QA) systems serve as a crucial bridge between user intent and environmental response. However, they still face substantial challenges in semantic understanding and dynamic reasoning. Most existing systems rely [...] Read more.

With smart buildings being widely adopted in urban digital transformation, interactive semantic question answering (QA) systems serve as a crucial bridge between user intent and environmental response. However, they still face substantial challenges in semantic understanding and dynamic reasoning. Most existing systems rely on static frameworks built upon Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), which suffer from rigid prompt design, breakdowns in multi-step reasoning, and inaccurate generation. To tackle these issues, we propose LoopRAG, a multi-agent RAG architecture that incorporates a Plan–Do–Check–Act (PDCA) closed-loop optimization mechanism. The architecture formulates a dynamic QA pipeline across four stages: task parsing, knowledge extraction, quality evaluation, and policy feedback, and further introduces a semantics-driven prompt reconfiguration algorithm and a heterogeneous knowledge fusion module. These components strengthen multi-source information handling and adaptive reasoning. Experiments on HotpotQA, MultiHop-RAG, and an in-house building QA dataset demonstrate that LoopRAG significantly outperforms conventional RAG systems in key metrics, including context recall of 90%, response relevance of 72%, and answer accuracy of 88%. The results indicate strong robustness and cross-task generalization. This work offers both theoretical foundations and an engineering pathway for constructing trustworthy and scalable semantic QA interaction systems in smart building settings. Full article

(This article belongs to the Special Issue AI in Construction: Automation, Optimization, and Safety)

► Show Figures

Figure 1

43 pages, 6411 KB

Open AccessArticle

The Art Nouveau Path: Valuing Urban Heritage Through Mobile Augmented Reality and Sustainability Education

by João Ferreira-Santos and Lúcia Pombo

Heritage 2026, 9(1), 4; https://doi.org/10.3390/heritage9010004 - 23 Dec 2025

Cited by 2 | Viewed by 755

Abstract

Cultural heritage is framed as a living resource for citizenship and education, although evidence on how in situ augmented reality can cultivate sustainability competences remains limited. This study examines the Art Nouveau Path, a location-based mobile augmented reality game across eight points [...] Read more.

Cultural heritage is framed as a living resource for citizenship and education, although evidence on how in situ augmented reality can cultivate sustainability competences remains limited. This study examines the Art Nouveau Path, a location-based mobile augmented reality game across eight points of interest in Aveiro, Portugal, aligned with the GreenComp framework. Within a design-based research case study, the analysis integrates repeated cross-sectional student questionnaires (S1-PRE N = 221; S2-POST N = 439; S3-FU N = 434), anonymized gameplay logs from 118 collaborative groups, and 24 teacher field observations (T2-OBS), using quantitative summaries with reflexive thematic analysis. References to heritage preservation in students’ sustainability conceptions rose from 28.96% at baseline to 61.05% immediately after gameplay, remaining above baseline at follow-up (47.93%). Augmented reality items were answered more accurately than non- augmented reality items (81% vs. 73%) and involved longer on-site exploration (+10.17 min). Triangulated evidence indicates that augmented reality and multimodality amplified attention to architectural details and prompted debates about authenticity. Built heritage, mobilized through lightweight augmented reality within a digital teaching and learning ecosystem, can serve as an effective context for Education for Sustainable Development, strengthening preservation literacy and civic responsibility and generating interoperable cultural traces for future reuse. Full article

(This article belongs to the Special Issue Applications of Digital Technologies in the Heritage Preservation)

► Show Figures

Figure 1

19 pages, 1267 KB

Open AccessArticle

Implementing a Knowledge Management System with GraphRAG: A Physical Internet Example

by Hisatoshi Naganawa, Enna Hirata and Akira Yamada

Electronics 2025, 14(24), 4948; https://doi.org/10.3390/electronics14244948 - 17 Dec 2025

Cited by 1 | Viewed by 1016

Abstract

The rapid expansion and interdisciplinary nature of Physical Internet (PI) research have resulted in fragmented knowledge, limiting the ability of stakeholders to identify emerging trends, actionable insights and genuine research gaps. This study introduces a novel knowledge management approach that uses Graph Retrieval-Augmented [...] Read more.

The rapid expansion and interdisciplinary nature of Physical Internet (PI) research have resulted in fragmented knowledge, limiting the ability of stakeholders to identify emerging trends, actionable insights and genuine research gaps. This study introduces a novel knowledge management approach that uses Graph Retrieval-Augmented Generation (GraphRAG) to systematically organize and integrate PI-related literature. A comprehensive knowledge graph was constructed by extracting and semantically modeling entities and relationships from 2835 academic papers, conference proceedings and international roadmaps. The developed system incorporates fuzzy semantic search and multiple retrieval strategies, including local, global and hybrid approaches, enabling nuanced, context-aware access to information. Stakeholder-specific prompts, tailored to the needs of industry, government and academia, demonstrate how GraphRAG can support the discovery of business model innovations, policy design and underexplored research areas. A comparative evaluation using cosine similarity and BERTScore confirms that graph-based strategies outperform standard LLM retrieval in providing relevant and comprehensive answers while also revealing connections that would be missed in manual reviews. The results demonstrate that the proposed GraphRAG model is a scalable and extensible framework for addressing knowledge gaps and promoting collaboration in PI research synthesis for sustainable logistics. The model also shows promise for application in other complex domains. Full article

(This article belongs to the Special Issue Feature Papers in Artificial Intelligence)

► Show Figures

Figure 1

35 pages, 2974 KB

Open AccessEditor’s ChoiceArticle

Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation

by Irina Radeva, Ivan Popchev, Lyubka Doukovska and Miroslava Dimitrova

Electronics 2025, 14(24), 4883; https://doi.org/10.3390/electronics14244883 - 11 Dec 2025

Viewed by 2233

Abstract

This paper evaluates multi-agent coordination strategies against single-agent retrieval-augmented generation (RAG) for open-source language models. Four coordination strategies (collaborative, sequential, competitive, hierarchical) were tested across Mistral 7B, Llama 3.1 8B, and Granite 3.2 8B using 100 domain-specific question–answer pairs (3100 total evaluations). Performance [...] Read more.

This paper evaluates multi-agent coordination strategies against single-agent retrieval-augmented generation (RAG) for open-source language models. Four coordination strategies (collaborative, sequential, competitive, hierarchical) were tested across Mistral 7B, Llama 3.1 8B, and Granite 3.2 8B using 100 domain-specific question–answer pairs (3100 total evaluations). Performance was assessed using Composite Performance Score (CPS) and Threshold-aware CPS (T-CPS), aggregating nine metrics spanning lexical, semantic, and linguistic dimensions. Under the tested conditions, all 28 multi-agent configurations showed degradation relative to single-agent baselines, ranging from −4.4% to −35.3%. Coordination overhead was identified as a primary contributing factor. Llama 3.1 8B tolerated Sequential and Hierarchical coordination with minimal degradation (−4.9% to −5.3%). Mistral 7B with shared context retrieval achieved comparable results. Granite 3.2 8B showed degradation of 14–35% across all strategies. Collaborative coordination exhibited the largest degradation across all models. Study limitations include evaluation on a single domain (agriculture), use of 7–8B parameter models, and homogeneous agent architectures. These findings suggest that single-agent RAG may be preferable for factual question-answering tasks in local deployment scenarios with computational constraints. Future research should explore larger models, heterogeneous agent teams, role-specific prompting, and advanced consensus mechanisms. Full article

(This article belongs to the Special Issue Machine Learning Applications in Computer Vision, Data Modeling, and Natural Language Processing)

► Show Figures

Figure 1

20 pages, 645 KB

Open AccessArticle

Enhancing Chatbot Performance in a SaaS Platform Through Retrieval-Augmented Generation and Prompt Engineering: A Case Study in Behavioral Safety Analysis

by Jorge Rivera, Scarlett Zapata, Ricardo Pizarro and Brian Keith

Knowledge 2025, 5(4), 25; https://doi.org/10.3390/knowledge5040025 - 5 Nov 2025

Viewed by 1823

Abstract

This article presents a case study showing the development of a chatbot, named Selene, in a Software-as-a-Service platform for behavioral analysis using Retrieval-Augmented Generation (RAG) integrating domain-specific knowledge and enforcing adherence to organizational rules to improve response quality. Selene is designed to provide [...] Read more.

This article presents a case study showing the development of a chatbot, named Selene, in a Software-as-a-Service platform for behavioral analysis using Retrieval-Augmented Generation (RAG) integrating domain-specific knowledge and enforcing adherence to organizational rules to improve response quality. Selene is designed to provide deep analyses and practical recommendations that help users optimize organizational behavioral development. To ensure that the RAG pipeline had updated information, we implemented an Extract, Transform, and Load process that updated the knowledge base of the pipeline daily and applied prompt engineering to ensure compliance with organizational rules and directives, using GPT-4 as the underlying language model of the chatbot, which was the state-of-the-art model at the time of deployment. We followed the Generative AI Project Life Cycle Frameworkas the basic methodology to develop this system. To evaluate Selene, we used the DeepEval library, showing that it provides appropriate responses and aligning with organizational rules. Our results show that the system achieves high answer relevancy in 78% of the test cases achieved and a complete absence of bias and toxicity issues. This work provides practical insights for organizations deploying similar knowledge-based chatbot systems. Full article

► Show Figures

Figure 1

40 pages, 2077 KB

Open AccessArticle

Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs

by Luka Blašković, Nikola Tanković, Ivan Lorencin and Sandi Baressi Šegota

Big Data Cogn. Comput. 2025, 9(10), 256; https://doi.org/10.3390/bdcc9100256 - 11 Oct 2025

Viewed by 3187

Abstract

Electronic health records (EHRs) are typically stored in relational databases, making them difficult to query for nontechnical users, especially under privacy constraints. We evaluate two practical clinical NLP workflows, natural language to SQL (NL2SQL) for EHR querying and retrieval-augmented generation for clinical question [...] Read more.

Electronic health records (EHRs) are typically stored in relational databases, making them difficult to query for nontechnical users, especially under privacy constraints. We evaluate two practical clinical NLP workflows, natural language to SQL (NL2SQL) for EHR querying and retrieval-augmented generation for clinical question answering (RAG-QA), with a focus on privacy-preserving deployment. We benchmark nine large language models, spanning open-weight options (DeepSeek V3/V3.1, Llama-3.3-70B, Qwen2.5-32B, Mixtral-8 × 22B, BioMistral-7B, and GPT-OSS-20B) and proprietary APIs (GPT-4o and GPT-5). The models were chosen to represent a diverse cross-section spanning sparse MoE, dense general-purpose, domain-adapted, and proprietary LLMs. On MIMICSQL (27,000 generations; nine models × three runs), the best NL2SQL execution accuracy (EX) is 66.1% (GPT-4o), followed by 64.6% (GPT-5). Among open-weight models, DeepSeek V3.1 reaches 59.8% EX, while DeepSeek V3 reaches 58.8%, with Llama-3.3-70B at 54.5% and BioMistral-7B achieving only 11.8%, underscoring a persistent gap relative to general-domain benchmarks. We introduce SQL-EC, a deterministic SQL error-classification framework with adjudication, revealing string mismatches as the dominant failure (86.3%), followed by query-join misinterpretations (49.7%), while incorrect aggregation-function usage accounts for only 6.7%. This highlights lexical/ontology grounding as the key bottleneck for NL2SQL in the biomedical domain. For RAG-QA, evaluated on 100 synthetic patient records across 20 questions (54,000 reference–generation pairs; three runs), BLEU and ROUGE-L fluctuate more strongly across models, whereas BERTScore remains high on most, with DeepSeek V3.1 and GPT-4o among the top performers; pairwise t-tests confirm that significant differences were observed among the LLMs. Cost–performance analysis based on measured token usage shows per-query costs ranging from USD 0.000285 (GPT-OSS-20B) to USD 0.005918 (GPT-4o); DeepSeek V3.1 offers the best open-weight cost–accuracy trade-off, and GPT-5 provides a balanced API alternative. Overall, the privacy-conscious RAG-QA attains strong semantic fidelity, whereas the clinical NL2SQL remains brittle under lexical variation. SQL-EC pinpoints actionable failure modes, motivating ontology-aware normalization and schema-linked prompting for robust clinical querying. Full article

(This article belongs to the Special Issue Advances in Large Language Models for Biological and Medical Applications)

► Show Figures

Figure 1

18 pages, 3371 KB

Open AccessArticle

Fusing Geoscience Large Language Models and Lightweight RAG for Enhanced Geological Question Answering

by Bo Zhou and Ke Li

Geosciences 2025, 15(10), 382; https://doi.org/10.3390/geosciences15100382 - 2 Oct 2025

Cited by 4 | Viewed by 2561

Abstract

Mineral prospecting from vast geological text corpora is impeded by challenges in domain-specific semantic interpretation and knowledge synthesis. General-purpose Large Language Models (LLMs) struggle to parse the complex lexicon and relational semantics of geological texts, limiting their utility for constructing precise knowledge graphs [...] Read more.

Mineral prospecting from vast geological text corpora is impeded by challenges in domain-specific semantic interpretation and knowledge synthesis. General-purpose Large Language Models (LLMs) struggle to parse the complex lexicon and relational semantics of geological texts, limiting their utility for constructing precise knowledge graphs (KGs). Our novel framework addresses this gap by integrating a domain-specific LLM, GeoGPT, with a lightweight retrieval-augmented generation architecture, LightRAG. Within this framework, GeoGPT automates the construction of a high-quality mineral-prospecting KG by performing ontology definition, entity recognition, and relation extraction. The LightRAG component then leverages this KG to power a specialized geological question-answering (Q&A) system featuring a dual-layer retrieval mechanism for enhanced precision and an incremental update capability for dynamic knowledge incorporation. The results indicate that the proposed method achieves a mean F1-score of 0.835 for entity extraction, representing a 17% to 25% performance improvement over general-purpose large models using generic prompts. Furthermore, the geological Q&A model, built upon the LightRAG framework with GeoGPT as its core, demonstrates a superior win rate against the DeepSeek-V3 and Qwen2.5-72B general-purpose large models by 8–29% in the geochemistry domain and 53–78% in the remote sensing geology domain. This study establishes an effective and scalable methodology for intelligent geological text analysis, enabling lightweight, high-performance Q&A systems that accelerate knowledge discovery in mineral exploration. Full article

► Show Figures

Figure 1

15 pages, 770 KB

Open AccessArticle

Analysis of Large Language Models for Company Annual Reports Based on Retrieval-Augmented Generation

by Abhijit Mokashi, Bennet Puthuparambil, Chaissy Daniel and Thomas Hanne

Information 2025, 16(9), 786; https://doi.org/10.3390/info16090786 - 10 Sep 2025

Cited by 1 | Viewed by 3997

Abstract

Large language models (LLMs) like ChatGPT-4 and Gemini 1.0 demonstrate significant text generation capabilities but often struggle with outdated knowledge, domain specificity, and hallucinations. Retrieval-Augmented Generation (RAG) offers a promising solution by integrating external knowledge sources to produce more accurate and informed responses. [...] Read more.

Large language models (LLMs) like ChatGPT-4 and Gemini 1.0 demonstrate significant text generation capabilities but often struggle with outdated knowledge, domain specificity, and hallucinations. Retrieval-Augmented Generation (RAG) offers a promising solution by integrating external knowledge sources to produce more accurate and informed responses. This research investigates RAG’s effectiveness in enhancing LLM performance for financial report analysis. We examine how RAG and the specific prompt design improve the provision of qualitative and quantitative financial information in terms of accuracy, relevance, and verifiability. Employing a design science research approach, we compare ChatGPT-4 responses before and after RAG integration, using annual reports from ten selected technology companies. Our findings demonstrate that RAG improves the relevance and verifiability of LLM outputs (by 0.66 and 0.71, respectively, on a scale from 1 to 5), while also reducing irrelevant or incorrect answers. Prompt specificity is shown to critically impact response quality. This study indicates RAG’s potential to mitigate LLM biases and inaccuracies, offering a practical solution for generating reliable and contextually rich financial insights. Full article

(This article belongs to the Collection Natural Language Processing and Applications: Challenges and Perspectives)

► Show Figures

Figure 1

Search Results (34)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (34)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI