Reducing Hallucinations in Medical AI Through Citation Enforced Prompting in RAG Systems

Pawlik, Lukasz; Deniziak, Stanislaw

doi:10.3390/app16063013

Open AccessArticle

Reducing Hallucinations in Medical AI Through Citation Enforced Prompting in RAG Systems

by

Lukasz Pawlik

^1,2,*

and

Stanislaw Deniziak

¹

Department of Information Systems, Kielce University of Technology, 7 Tysiąclecia Państwa Polskiego Ave., 25-314 Kielce, Poland

²

Altar Sp. z o.o., 5 Różana St., 25-729 Kielce, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 3013; https://doi.org/10.3390/app16063013

Submission received: 5 March 2026 / Revised: 16 March 2026 / Accepted: 17 March 2026 / Published: 20 March 2026

(This article belongs to the Special Issue Innovative Applications of AI, Machine Learning, IoT, and Assistive Robots in Health Monitoring and Care)

Download

Browse Figures

Versions Notes

Abstract

The safe integration of Large Language Models in clinical environments requires strict adherence to verified medical evidence. As part of the PARROT AI project, this study provides a systematic evaluation of how prompting strategies affect the reliability of Retrieval-Augmented Generation (RAG) pipelines using the MedQA USMLE benchmark (

N = 500

). Four prompting strategies were examined: Baseline (zero-shot), Neutral, Expert Chain-of-Thought (Expert-CoT) with structured clinical reasoning, and StrictCitations with mandatory evidence grounding. The experiments covered six modern model architectures: Command R (35B), Gemma 2 (9B and 27B), Llama 3.1 (8B), Mistral Nemo (12B), and Qwen 2.5 (14B). Evaluation was conducted using the Deterministic RAG Evaluator, providing an objective assessment of grounding through the Unsupported Sentence Ratio (USR) based on TF-IDF and cosine similarity. The results indicate that structured reasoning in the Expert-CoT strategy significantly increases USR values (reaching 95–100%), as models prioritize internal diagnostic logic over verbatim context. In contrast, the StrictCitations strategy, while maintaining high USR due to the conservative evaluation threshold, achieves the highest level of verifiable grounding and source adherence. The analysis identifies a statistically significant Verbosity Signal (

r = 0.81, p < 0.001

), where increased response length serves as a proxy for model uncertainty and parametric leakage, a pattern particularly prominent in Llama 3.1 and Gemma 2. Overall, the findings demonstrate that prompting strategy selection is as critical for clinical reliability as model architecture. This work delivers a reproducible framework for the development of trustworthy medical AI assistants and highlights citation-enforced prompting as a vital mechanism for improving clinical safety.

Keywords:

Large Language Models; Retrieval-Augmented Generation; clinical decision support systems; medical informatics; prompt engineering; faithfulness; hallucination

1. Introduction

Modern healthcare infrastructures globally are enduring a period of intense pressure, characterized by a growing discrepancy between the increasing complexity of patient care and the available bandwidth of medical personnel. One of the most significant contributors to this systemic strain is the administrative burden associated with clinical documentation and data entry, which often consumes a disproportionate amount of a physician’s workday [1]. Empirical studies have demonstrated that clinicians frequently spend more time interacting with Electronic Health Records (EHRs) than with patients, a phenomenon that acts as a primary catalyst for professional burnout, reduced job satisfaction, and a heightened risk of medical errors due to cognitive fatigue [1]. This documentation-centric task load not only disrupts the patient–clinician relationship but also introduces significant operational costs to healthcare providers, estimated to be in the billions annually due to turnover and lost productivity [2].

Within the specific context of the Polish medical sector, the PARROT AI project, conducted under the national INFOSTRATEG strategic program, represents a high-stakes initiative to mitigate these challenges through the deployment of advanced artificial intelligence [3]. The project aims to develop a comprehensive digital assistant ecosystem capable of automating registration, transcribing medical interviews with high precision, and providing expert-level clinical decision support [4]. By leveraging speech processing and sophisticated natural language models, the system seeks to “de-burden” physicians, allowing technology to fade into the background while clinical care returns to the foreground [4]. The successful commercialization of such a system, planned for 2026, hinges on its ability to provide reliable, high-accuracy suggestions that medical professionals can trust in time-sensitive and high-stakes environments [3].

However, the integration of Large Language Models (LLMs) into clinical workflows is fundamentally limited by the phenomenon of hallucination, where models generate factually incorrect or ungrounded assertions with high linguistic confidence [5]. Hallucinations are not mere artifacts of model size; they are intrinsic properties of probabilistic language models that emerge from the trade-offs between creative generation and factual accuracy [6]. In clinical diagnosis and treatment planning, even minor hallucinations can lead to catastrophic diagnostic errors, undermining public trust in AI. As these systems transition from general-purpose assistants to specialized clinical tools, there is an urgent need to shift model behavior from “stochastic prediction” to “verifiable information extraction” based on non-parametric knowledge sources [5].

Retrieval-Augmented Generation (RAG) has been proposed as the most effective framework for grounding LLMs in authoritative evidence [7]. By conditioning the model’s response on a set of retrieved documents, such as peer-reviewed textbooks, RAG significantly reduces factual errors associated with outdated training data [7]. Nevertheless, providing external context does not guarantee its correct application. Models often experience Knowledge Conflict, where extensive pre-training overrides the specific evidence provided in the retrieval window [8]. This results in unfaithful outputs, where the model disregards retrieved facts in favor of internal parametric priors. Furthermore, complex reasoning protocols like Chain-of-Thought (CoT) can paradoxically degrade performance in smaller models, leading to analytical paralysis rather than improved accuracy.

This study addresses these critical limitations by evaluating four distinct prompting strategies across six state-of-the-art LLM architectures, ranging from 8B to 35B parameters. Using a sample of 500 questions from the MedQA USMLE dataset, we provide a rigorous assessment of model reliability through the D-RAG Evaluator (Deterministic RAG Evaluator). This framework utilizes TF-IDF and cosine similarity to ensure objective assessment of grounding and accuracy, specifically through the introduction of the Unsupported Sentence Ratio (USR) as a conservative proxy for hallucination detection [9]. We quantify the impact of specific instructions on Faithfulness and Citation Consistency, moving beyond raw accuracy to identify the root causes of model failure [10]. Our findings demonstrate that while the StrictCitations strategy effectively suppresses unsupported content, architectures such as Qwen 2.5 14B and Gemma 2 27B exhibit a superior balance between reasoning and grounding. Furthermore, we introduce the “Verbosity Signal” as a statistically significant diagnostic indicator (

r = 0.81, p < 0.001

) of model uncertainty, where excessive response length serves as a proxy for potential lack of contextual grounding [6]. Together, these contributions provide a technical roadmap for designing safe, evidence-grounded AI assistants within the PARROT AI project and for advancing robust RAG methodologies in the broader medical informatics community [4].

2. Related Works

2.1. The Landscape of Medical Question Answering and Benchmarking

The development of specialized medical Large Language Models (LLMs) has been heavily reliant on the emergence of high-quality, large-scale benchmarks that simulate professional clinical assessments. Among these, the MedQA dataset is widely recognized as the gold standard for evaluating open-domain medical question answering. Introduced by Jin et al. [9], MedQA comprises multiple-choice questions derived from professional medical board examinations in the United States (USMLE), Mainland China, and Taiwan. These exams are designed to test not only factual recall but also complex clinical reasoning, requiring the synthesis of findings from case vignettes, the identification of pathophysiology, and the systematic elimination of incorrect diagnoses. Prior to the rise of modern LLMs, state-of-the-art performance on MedQA-USMLE was approximately 36.7%, illustrating the immense difficulty of the task for traditional neural methods [9]. The evolution of medical benchmarks has recently expanded to include more diverse modalities and language contexts. For instance, the Vietnamese Multiple-Choice Question Answering (VMHQA) dataset focuses on mental health topics, addressing resource gaps in low-income regions [11]. Similarly, researchers have developed the Swedish Medical LLM Benchmark (SMLB) to assess model capabilities in local clinical contexts, highlighting significant performance variations between high-resource languages like English and other national languages [12]. These benchmarks underscore a fundamental drawback of multiple-choice evaluations: the risk of data leakage and “memorization” rather than genuine reasoning [12]. Consequently, there is a growing consensus that for benchmarks to truly measure progress, they must reflect the complexities of real-world clinical practice, which often involves open-ended inquiries and unstructured patient data [13].

2.2. Retrieval-Augmented Generation (RAG) in Healthcare

The primary strategy for improving model reliability in the medical domain is Retrieval-Augmented Generation (RAG). RAG addresses the limitations of parametric knowledge storage by coupling a pretrained LLM with a non-parametric retrieval module that fetches external evidence during inference [7]. This architecture is particularly suited for medical applications because it allows for the integration of up-to-date information, enhances transparency through source attribution, and reduces the frequency of factual hallucinations [7]. Studies have shown that RAG pipelines typically result in accuracy gains of 10% to 15% over zero-shot LLM baselines in medical Q&A tasks [14]. In clinical settings, the preference for local or on-premise RAG deployments is growing due to stringent data privacy regulations such as GDPR and HIPAA, which often preclude the use of cloud-based APIs like GPT-4 [15]. However, standard RAG implementations face significant challenges; retrieval noise can lead to “grounded hallucinations” where the model synthesizes claims based on superficially related but technically incorrect context [7]. Advanced frameworks like Med-RISE have addressed these issues by expanding retrieval databases to include more authoritative sources and employing query rewriting to improve retrieval precision [16]. Furthermore, research into self-reflective RAG variants has demonstrated that instructing the model to list lacking citations and refine its answer based on uncertainty can lower hallucination rates to as low as 5.8% [14]. Modular retrieval and editing stages, as seen in MEGA-RAG, provide greater transparency and accountability, supporting factual correctness in public health scenarios [17].

2.3. Prompt Engineering and Clinical Reasoning Protocols

Prompt engineering has emerged as a lightweight yet powerful technique for guiding LLMs toward more structured and transparent reasoning processes [18]. Chain-of-Thought (CoT) prompting demonstrated that by instructing a model to generate intermediate reasoning steps, its latent inferential capabilities could be elicited, significantly improving performance on logical tasks [19]. In the clinical domain, this has been adapted into structured reasoning protocols that simulate a physician’s diagnostic workflow, including evidence gathering, pathophysiology analysis, and differential diagnosis [20]. Research has shown that these reasoning-based approaches can enhance diagnostic accuracy and provide valuable insights into the model’s decision-making process [21].

However, the efficacy of CoT in medicine is often constrained by the “faithfulness-plausibility gap” [20]. While models may generate fluent reasoning traces, they often fall victim to “Knowledge Override”, where strong parametric medical priors override the specific evidence provided in the retrieval window [8]. Recent investigations suggest that “safety-first” prompting strategies, such as strict citation enforcement, prioritize contextual adherence over elaborate reasoning, which is crucial for smaller architectures that may otherwise suffer from reasoning degradation or “analytical paralysis” when faced with complex instructions [6].

2.4. Hallucination Mechanisms and Deterministic Evaluation

Hallucination in LLMs arises from fundamental computational limits, noise in training data, and the inherent trade-off between creative generation and factual accuracy [6]. In medical contexts, hallucinations can take many forms, including the invention of symptoms or the generation of fake citations. Mitigation strategies span the entire model lifecycle, but inference-stage interventions like strict grounding are particularly promising [5]. Modern approaches move away from probabilistic “black-box” evaluations toward deterministic frameworks, such as the D-RAG Evaluator, which utilizes mathematical linguistics, including TF-IDF character-level n-grams and cosine similarity, to quantify the alignment between generated claims and source documents [14]. By measuring the Unsupported Sentence Ratio (USR), these frameworks provide a conservative assessment of model grounding, ensuring that any content extending beyond the provided evidence is transparently flagged.

Furthermore, research has identified that excessive response length, or a “Verbosity Signal”, can serve as a measurable proxy for model uncertainty and potential lack of grounding, especially when models are forced into reasoning chains they cannot fully verify [6]. These advancements suggest that reducing ungrounded content is not merely about increasing model size, but about architecting systems capable of self-verification and strict adherence to verified external knowledge [5]. Recent advances in medical AI have further explored hallucination reduction in specialized domains such as oncology [22] and general clinical information systems [23].

3. Materials and Methods

This section details the experimental framework designed to evaluate the influence of prompting strategies on Retrieval-Augmented Generation (RAG) performance in the medical domain, specifically within the context of the PARROT AI project. The integrity of clinical AI assistants depends on their ability to retrieve the most relevant evidence from technical corpora and synthesize this information without introducing unverified parametric bias [7].

3.1. Data Source: MedQA USMLE

The study utilizes the English subset of the MedQA dataset [9], consisting of multiple-choice questions (MCQs) from the United States Medical Licensing Examination (USMLE). This dataset is recognized as the premier benchmark for testing professional medical knowledge because it requires deep understanding and complex multi-hop reasoning over clinical vignettes.

To ensure a rigorous evaluation and statistical robustness, we utilized

N = 500

consecutive questions from the “4 options” test set. This sample size was selected to provide narrow confidence intervals and ensure the reproducibility of the comparative findings. The MedQA textbooks, containing approximately 12.7 million tokens across 18 authoritative medical textbooks, serve as the primary knowledge source for the RAG pipeline.

3.2. System Architecture and Pipeline

The RAG technical pipeline was implemented as a modular local system to ensure clinical data privacy. The complete workflow, from query processing to algorithmic evaluation, is illustrated in Figure 1.

3.3. Knowledge Base Construction and Indexing

The retrieval corpus was constructed based on the professional medical textbooks associated with the MedQA dataset [9]. The indexing pipeline began with text segmentation, where raw text was processed into fixed-size chunks of 250 tokens with a 50-token overlap to ensure that semantic context is preserved across segment boundaries [24]. These chunks were subsequently converted into dense 768-dimensional vector representations using the bge-base-en-v1.5 embedding model [25]. Finally, for the indexing stage, we employed the Facebook AI Similarity Search (FAISS) library with the IndexFlatIP metric. To ensure mathematical consistency with cosine similarity, all embeddings were

L_{2}

normalized prior to indexing [26].

3.4. Retrieval and Multi-Stage Reranking

For each query, the system executes a two-stage retrieval pipeline:

Initial Retrieval: The top $k = 15$ candidate chunks are retrieved from the FAISS index focusing on high recall.
Reranking: Candidates are processed by the FlashRank reranker using the ms-marco-TinyBERT-L-2-v2 model. The top 5 chunks are selected as the final context for the generation phase [27].

3.5. Evaluated Large Language Models and Infrastructure

We evaluated six state-of-the-art Large Language Models (LLMs) hosted locally via the Ollama framework. The models represent a diverse range of architectures and parameter scales (Table 1). All evaluated models are based on the Transformer architecture, which remains the state-of-the-art for clinical NLP. Proprietary models such as OpenAI’s GPT-4 were excluded to focus on local, open-weight deployments (8B–35B) that comply with strict healthcare data privacy regulations (GDPR/HIPAA) and allow for on-premise execution within the PARROT AI infrastructure.

Command R and Qwen 2.5 14B were selected for their reported high precision in RAG and tool-use tasks [28,29]. Llama 3.1 and Gemma 2 represent the cutting edge of open-weight decoder-only architectures, known for their strong general reasoning capabilities [30,31]. Furthermore, Mistral Nemo was included as it represents a state-of-the-art 12B parameter model, co-developed by Mistral AI and NVIDIA to deliver optimized performance and expanded context handling within a compact footprint [32].

All tasks were performed on an NVIDIA Quadro RTX 6000 GPU (24 GB VRAM). Inference performance was monitored to ensure clinical viability; for the expanded

N = 500

dataset, total execution times per model ranged from 03:57:30 (Mistral Nemo, mean 28.5 s/query) to 07:02:20 (Gemma 2 27B, mean 50.7 s/query). The entire benchmarking suite for all six models required approximately 32.5 h of continuous GPU compute. These results demonstrate that even the most parameter-heavy local models (27B–35B) are feasible for high-throughput clinical decision support, particularly in asynchronous workflows or point-of-care settings with moderate query volumes.

3.6. Prompting Strategies

Four prompting strategies were implemented to test model resistance to “Knowledge Override” and adherence to external evidence (Table 2). These strategies represent a spectrum of constraints, from unassisted generation to highly structured clinical reasoning protocols.

The Baseline (No-RAG) strategy serves as the control group, specifically testing the models’ internal parametric knowledge without access to external medical documents [21]. This provides a benchmark for identifying the volume of factual information derived from pre-training versus the provided context.

The Neutral variant represents a standard implementation where models are instructed to integrate provided context into their decision-making process without specific formatting constraints [33]. It reflects the most common deployment scenario for clinical assistants.

To evaluate the impact of explicit grounding, the StrictCitations strategy requires models to justify every claim by citing specific source identifiers. This effectively transforms the LLM into a verifiable information extractor, significantly reducing the likelihood of ungrounded assertions [14].

Finally, the Expert Chain-of-Thought (Expert-CoT) strategy implements a complex clinical reasoning protocol. This multi-stage process involves Evidence Gathering, Pathophysiology analysis, Elimination of alternative options, and a Final Double Check, reflecting the cognitive steps taken by physicians [19].

3.7. The D-RAG Evaluator: Deterministic Grounding Assessment

To eliminate the interpretive bias inherent in “LLM-as-a-Judge” systems, this study utilizes the D-RAG Evaluator (Deterministic RAG Evaluator). This framework relies on algorithmic verification and vector-space comparisons across four primary dimensions:

Accuracy is determined by extracting the “Answer: [Letter]” string via regular expressions and comparing it against MedQA ground truth.
Faithfulness is assessed by calculating the mean cosine similarity between answer sentences and the specifically cited text chunks. To ensure robustness against OCR noise, we employed a TF-IDF vectorizer with character-level n-grams.
Unsupported Sentence Ratio (USR) is calculated by cross-referencing every generated sentence against the entire retrieved context, flagging sentences as unsupported if their maximum cosine similarity falls below a conservative clinical threshold ( $τ = 0.22$ ). This threshold was established through empirical cross-validation to distinguish between semantic alignment and tangential clinical associations, ensuring the evaluator remains robust against minor character-level discrepancies in complex medical terminology.
Knowledge Override (OVR) serves as a diagnostic metric to identify cases where the model provides a correct answer, but its reasoning lacks contextual grounding (Faithfulness < 25%).

3.8. Statistical Analysis Methodology

Statistical significance for performance differences was assessed using the McNemar test for binary accuracy outcomes. For continuous metrics, such as Faithfulness and Unsupported Sentence Ratios, the Wilcoxon signed-rank test was employed. Correlation between response length (Verbosity) and USR was measured using the Pearson correlation coefficient (r). All metrics were normalized to a 0–100% scale, and 95% confidence intervals were calculated to ensure robustness [14].

3.9. Computational Efficiency and Hardware Resource Management

To evaluate the feasibility of real-time clinical deployment, we measured the inference latency and hardware resource requirements for each model. The experiments were conducted on a dedicated server environment using the NVIDIA Quadro RTX 6000 GPU (24 GB VRAM) as described in the technical specifications.

Latency was calculated as the average time to generate a complete response, including knowledge retrieval, reranking, and the generation of reasoning steps. Based on these measurements, we estimated the total workload for the expanded

N = 500

dataset. The results, summarized in Table 3, reflect the performance of models executed locally via the Ollama framework.

The data indicates a clear trade-off between model scale and responsiveness. Models in the 8B–12B parameter range (notably Mistral Nemo) achieved sub-35-s response times, making them highly suitable for interactive clinical use-cases. While larger models like Gemma 2 27B and Command R utilize nearly the full VRAM capacity of the professional-grade GPU (see Appendix A), their increased latency suggests they are better suited for asynchronous batch processing of medical records rather than immediate point-of-care decision support.

4. Results

The performance of the evaluated Large Language Models (LLMs) was assessed across four prompting strategies using the MedQA dataset (

N = 500

). Before presenting the quantitative findings, it is critical to establish a formal distinction between two types of errors: (1) factual inaccuracies, which stem from outdated or incorrect internal parametric knowledge acquired during pre-training, and (2) unfaithful hallucinations, defined as logical or semantic deviations from the provided retrieval context. While the former relates to the model’s static knowledge base, our study—and the D-RAG Evaluator—specifically targets the latter to assess the reliability of the grounding mechanism. Table 4 summarizes the primary metrics.

4.1. Comparative Accuracy and Strategy Effectiveness

As illustrated in Figure 2, RAG-based strategies (Neutral and StrictCitations) generally improved performance over the Baseline, with the most significant gains observed in the Gemma 2 27B model (+41.2 p.p.). Interestingly, the Expert-CoT strategy led to a substantial decrease in raw accuracy across all models, suggesting a “lost in reasoning” effect where complex clinical logic chains introduce multiple points of failure in the zero-shot clinical setting.

To determine the statistical significance of these performance shifts, we conducted a pairwise McNemar test. As shown in Figure 3, values satisfying the

p < 0.05

criterion indicate significant differences, providing a rigorous differentiation between stochastic noise and true algorithmic improvement.

Intra-model analysis confirms that while the performance leap for Gemma 2 models was highly significant (

p < 0.001

), the transition for Llama 3.1 from Baseline to StrictCitations did not reach the significance threshold (

p = 0.093

). This lack of significance should not be interpreted as a failure of the strategy, but rather as evidence of Llama 3.1’s high parametric robustness. It indicates that the model’s internal medical knowledge is more resilient to external prompting constraints, maintaining a stable performance floor regardless of the RAG enforcement level, whereas Gemma 2’s performance is more heavily dependent on the provided context structure.

This lack of significance for Llama 3.1 should not be interpreted as a failure of the strategy, but rather as evidence of the model’s high parametric robustness, indicating that its internal medical knowledge is more resilient to external prompting constraints compared to the Gemma 2 family.

4.2. Grounding and Hallucination Analysis

The D-RAG Evaluator revealed a positive correlation between Faithfulness and Accuracy, as visualized in the regression analysis in Figure 4. However, many models achieved high faithfulness without corresponding accuracy gains, pointing to reasoning bottlenecks rather than retrieval failures.

The distribution of non-contextual content, measured as the Unsupported Sentence Ratio (USR), is presented in Figure 5. Under the conservative clinical threshold (

τ = 0.22

), most models exhibited high USR values (75–100%), particularly within the Expert-CoT strategy. This does not necessarily indicate factual clinical errors, but rather reflects the models’ generation of intermediate reasoning steps and diagnostic logic, which, while potentially accurate, extend beyond the verbatim evidence provided in the retrieved textbook chunks.

Interestingly, even under the StrictCitations strategy, models like Qwen 2.5 and Llama 3.1 maintained high USR levels (87–88%). This suggests that while these models successfully provide citations, they frequently interleave cited evidence with internal parametric knowledge to form a cohesive clinical narrative. In contrast, the high USR in Expert-CoT confirms that the structured reasoning process inherently relies on the model’s internal capabilities, making verbatim grounding nearly impossible to achieve for the entire response.

4.2.1. Knowledge Override and the Verbosity Signal

A primary objective was to quantify “Knowledge Override” (OVR), defined as cases where the model provides a correct answer but its reasoning lacks contextual grounding. Table 5 details these occurrences across strategies.

We identified a strong “Verbosity Signal”, which is clearly depicted in the correlation plot in Figure 6. Statistical analysis confirmed a Pearson correlation of

r = 0.81

(

p < 0.001

), where increased response length in Expert-CoT (mean 261–367 words) directly correlated with higher unsupported sentence ratios (94.9–98.3%). This suggests that forcing long-form reasoning in clinical RAG increases the likelihood of parametric “leakage” from the model’s training data. The correlation was computed at the model–strategy aggregate level to avoid coefficient inflation due to per-sample repetition.

4.2.2. Root Cause of Clinical Hallucinations

To better understand the mechanisms of failure in a clinical setting, identified errors were categorized into two primary types. The first type is Contextual Misinterpretation (Faithful Error), observed mainly in Command R. In these instances, the model adheres to the provided context, but due to the incompleteness of the supplied chunks, it draws incorrect conclusions. The second type is Instruction-Knowledge Conflict (Unfaithful USR), dominant in Llama 3.1 and Gemma 2. In the StrictCitations variant, these models often provided medically correct answers based on internal weights while ignoring the lack of support in the context or generating phantom citations. This phenomenon, known as knowledge override, suggests that models with strong medical training bases are more difficult to ground when a conflict arises between the instructions and their internal knowledge. Final results indicate that for critical medical applications, the StrictCitations approach is superior in terms of safety, as it enforces transparency and effectively discourages the introduction of unverified internal knowledge.

5. Discussion

The experimental results reveal complex interactions between model architecture and prompting strategies. As established in our results, the clear distinction between factual inaccuracies and unfaithful hallucinations is critical, as the D-RAG evaluator specifically quantifies the latter through the Unsupported Sentence Ratio (USR). The analysis reveals distinct clinical “personalities” among the tested architectures, as summarized in the radar comparison (Figure 7). Qwen 2.5 and Command R emerged as the most balanced models for medical RAG, showing high citation consistency and a lower propensity for knowledge-driven overrides compared to larger architectures.

A crucial observation is the Knowledge Override phenomenon, which is particularly evident in models with high parametric capacity, such as Llama 3.1 and Gemma 2 27B. These models, due to their extensive pre-training on medical corpora, occasionally prioritized their internal weights over the provided retrieved chunks. While this behavior occasionally maintained accuracy in the Baseline variant, it compromised system auditability in RAG scenarios. As shown in Table 5, the OVR metric (Knowledge Override) serves as a critical diagnostic of this over-reliance. Auditability is a primary requirement for clinical safety, as it ensures that every medical recommendation can be traced back to a verified source. Command R, by contrast, exhibited a safety-first profile where its performance was more strictly bounded by the evidence available in the retrieval context.

5.1. The Verbosity Signal and Clinical Reliability

Our analysis of the Verbosity Ratio, visualized in Figure 8, indicates that several models tend to produce substantially longer responses when they fail to identify the correct answer or when forced into complex reasoning chains. In the Expert-CoT variant, mean word counts increased significantly (e.g., reaching 367 words for Command R), which strongly correlated (

r = 0.81

) with a higher Unsupported Sentence Ratio (94.9–98.3%). This form of “hallucinatory verbosity” may mislead clinicians by masking uncertainty behind unnecessarily elaborate explanations.

Data from the D-RAG Evaluator shows that models like Gemma 2 27B and Llama 3.1 reached Verbosity Ratios above 1.06 in the Expert-CoT variant, suggesting that incorrect responses were systematically longer than correct ones. From a practical standpoint, this Verbosity Signal can be implemented as a real-time safety UI element: if a model produces a response significantly exceeding the average length of grounded answers, the system can flag it with a “High Uncertainty” warning for the clinician.

5.2. Retrieval Impact and Detailed Performance Analysis

The impact of successful retrieval on model performance is quantified in Table 6. The data confirms that for models with high Citation Consistency, the presence of a citation serves as a crucial grounding signal, though it is not a perfect guarantee of accuracy.

The majority of incorrect but faithful responses (Faithful Errors) were traced back to sub-optimal retrieval, where the correct diagnostic information was missing from the top-k chunks. This forced models to make logical but incorrect deductions based on incomplete data. As shown in Table 7, faithfulness scores are significantly higher when models successfully provide citations, compared to instances where citations are missing. A representative selection of models (from 14B to 35B parameters) was used to illustrate this trend, as they exhibited the most distinct correlation between citation presence and contextual grounding.

These findings highlight that for clinical assistants, priority must be given to safety-first configurations that decline to answer when evidence is missing, rather than attempting to guess using internal knowledge. The use of strict citation enforcement serves as a vital tool for exposing the model’s reliance on context versus its own internal biases. Furthermore, the drop in accuracy observed in Expert-CoT is not merely due to override, but rather a “reasoning collapse” or analytical paralysis, which must be carefully managed in high-stakes medical decision support systems.

6. Conclusions

This study demonstrates that the reliability of Large Language Models in medical RAG systems is shaped not only by model scale but, critically, by prompting strategies and architectural optimizations that support transparent grounding. Across six modern LLM architectures evaluated on a dataset of

N = 500

medical cases, our findings show that the StrictCitations strategy provides the strongest safeguards against unfaithful outputs, as it enforces explicit provenance and constrains models to verifiable, retrieved evidence.

Several key conclusions emerge from this work regarding the nature of clinical reliability. Command R and Qwen 2.5 achieved the most balanced performance, showing that models specifically optimized for RAG and long-context tasks can maintain higher faithfulness even under rigorous constraints. Furthermore, we empirically identified that verbosity serves as a practical uncertainty signal; models that substantially lengthen incorrect responses, particularly in the Expert-CoT variant (where lengths exceeded 350 words), exhibit a higher Unsupported Sentence Ratio (USR), masking uncertainty with elaborate but ungrounded explanations.

However, larger architectures such as Llama 3.1 and Gemma 2 27B remain vulnerable to Knowledge Override, where strong parametric priors dominate retrieved evidence despite explicit instructions. The strong correlation between citation consistency and answer correctness confirms that adherence to citation protocols, as measured by the D-RAG Evaluator (Deterministic RAG Evaluator), functions as an effective proxy for trustworthiness. Overall, these results indicate that controlling inference-time behavior through specialized prompting is as impactful as model selection. These findings support the PARROT AI project’s ambition to reduce administrative burden and improve patient service economics by implementing rigorously grounded RAG pipelines within the Polish healthcare system.

7. Limitations

Despite the promising results, several limitations must be acknowledged. While the study utilized a robust sample of 500 questions providing statistical significance for benchmarking, larger-scale evaluation across diverse medical specialties (e.g., oncology, rare diseases) is required for full clinical generalizability. Another constraint is the current focus on English-language benchmarks; performance in local languages like Polish remains to be fully explored, especially given the lower density of high-quality medical corpora compared to English.

Additionally, the RAG pipeline relied on a fixed set of medical textbooks, whereas real-world clinical settings require handling dynamic, unstructured, and often contradictory patient records. Finally, model faithfulness remains highly dependent on retrieval quality; the observed “Faithful Errors” indicate that even a perfectly grounded model will fail if the retrieval stage fails to deliver the specific diagnostic evidence required for multi-hop reasoning. Furthermore, the generation of unsupported content may be exacerbated by algorithmic bias and inequities in the pre-training data, which often under-represent certain demographics or rare clinical conditions. While our RAG approach mitigates this by enforcing external grounding, the underlying model priors remain a factor in Knowledge Override.

8. Future Work

Future research will focus on several key areas to enhance the PARROT AI ecosystem and its clinical utility [34]. A primary objective is multilingual adaptation, involving the development of Polish-specific medical embedding models and fine-tuning LLMs on localized clinical guidelines and legal frameworks. To address the observed “lost-in-reasoning” effect in complex prompts, we plan to explore hybrid architectures combining LLMs with Symbolic AI or Knowledge Graphs to enforce logical consistency.

The transition from benchmarking to User Acceptance Testing (UAT) in clinical environments will be crucial to evaluate the system’s impact on reducing the administrative burden in real-time. Furthermore, we intend to refine the D-RAG Evaluator to include semantic entailment analysis alongside TF-IDF n-grams, providing even more granular detection of subtle clinical deviations from source context. Lastly, implementing a multi-agent consensus framework could mitigate the Knowledge Override effect by allowing specialized models to cross-verify generated citations before they reach the medical professional.

Author Contributions

Conceptualization, L.P.; methodology, L.P.; software, L.P.; validation, L.P. and S.D.; formal analysis, L.P.; investigation, L.P.; resources, L.P. and S.D.; data curation, L.P.; writing—original draft preparation, L.P.; writing—review and editing, L.P. and S.D.; visualization, L.P.; supervision, L.P. and S.D.; project administration, L.P. and S.D.; funding acquisition, S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Centre for Research and Development, Poland, as part of the INFOSTRATEG Strategic Program (Project: PARROT AI), Grant No. INFOSTRATEG4/0012/2022.

Data Availability Statement

The data analyzed in this study are publicly available. The MedQA dataset can be accessed via the official publication at: https://github.com/jind11/MedQA (accessed on 25 January 2026).

Conflicts of Interest

Author Dr. Lukasz Pawlik was employed by Altar Sp. z o.o., 5 Różana St., 25-729 Kielce, Poland, the remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Hardware Specifications

The experiments were executed on a dedicated server environment with the following specifications:

Operating System: Ubuntu 24.04.3 LTS Server
Kernel: Linux 6.14.0-1015-nvidia
Graphics Card (GPU): NVIDIA Quadro RTX 6000 (24 GB GDDR6)
System Memory (RAM): 32 GB

Appendix B. Computational Environment and Library Versions

All experiments and analyses were conducted in a Python 3.12.3 environment to ensure consistency and reproducibility. To maintain transparency and allow for full replication of the workflow, particularly for the RAG (Retrieval-Augmented Generation) pipeline and LLM integration, the versions of the key scientific, machine learning, and natural language processing packages are summarized in Table A1.

Table A1. Versions of Python libraries used in the computational environment.

Package	Description	Version
numpy	Core numerical computations and array operations	v2.4.1
pandas	Data manipulation and tabular processing	v3.0.0
scikit-learn	Machine learning algorithms and evaluation metrics (TF-IDF, Cosine Similarity)	v1.8.0
torch	Deep learning framework	v2.10.0+cu128
faiss-cpu	Efficient similarity search and vector indexing	v1.13.2
sentence-transformers	Text embeddings for retrieval and semantic similarity	v5.2.0
flashrank	Lightweight reranker for retrieved document chunks	v0.2.10
langchain	Framework for LLM application orchestration	v1.2.6
langchain-ollama	Connector for locally hosted LLMs via Ollama	v1.0.1
loguru	Structured logging and runtime monitoring	v0.7.3

References

Olson, K.D.; Meeker, D.; Troup, M.; Barker, T.D.; Nguyen, V.H.; Manders, J.B.; Stults, C.D.; Jones, V.G.; Shah, S.D.; Shah, T.; et al. Use of Ambient AI Scribes to Reduce Administrative Burden and Professional Burnout. JAMA Netw. Open 2025, 8, e2534976. [Google Scholar] [CrossRef] [PubMed]
Samraik, M. AI Scribes Reduce Physician Burnout and Return Focus to the Patient; Yale School of Medicine: New Haven, CT, USA, 2025; Available online: https://medicine.yale.edu/news-article/ai-scribes-reduce-physician-burnout-return-focus-to-the-patient/ (accessed on 26 January 2026).
Czerepak, M. AI Assistant for Doctors. Nicolaus Copernicus Superior School. 2025. Available online: https://www.sgmk.edu.pl/ai-assistant-for-doctors/ (accessed on 26 January 2026).
Akademickiej, N.A.W. Polish Scientists Develop AI-Powered Medical Assistant. Research in Poland. 2025. Available online: https://researchinpoland.org/news/polish-scientists-develop-ai-powered-medical-assistant/ (accessed on 26 January 2026).
Li, Y.; Fu, X.; Verma, G.; Buitelaar, P.; Liu, M. Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems. arXiv 2025, arXiv:2510.24476. [Google Scholar] [CrossRef]
Mohsin, M.A.; Umer, M.; Bilal, A.; Memon, Z.; Qadir, M.I.; Bhattacharya, S.; Rizwan, H.; Gorle, A.R.; Kazmi, M.Z.; Amir, N.; et al. On the Fundamental Limits of LLMs at Scale. arXiv 2025, arXiv:2511.12869. [Google Scholar] [CrossRef]
Sharma, C. Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers. arXiv 2025, arXiv:2506.00054. [Google Scholar] [CrossRef]
Zhu, H.; Fiondella, L.; Yuan, J.; Zeng, K.; Jiao, L. NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation of LLM via Genetic Optimization of External Knowledge. arXiv 2025, arXiv:2510.21144. [Google Scholar] [CrossRef]
Jin, D.; Pan, E.; Oufattole, N.; Weng, W.H.; Fang, H.; Szolovits, P. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Appl. Sci. 2021, 11, 6421. [Google Scholar] [CrossRef]
Evaluating RAG with LLM as a Judge. Mistral AI Blog. Available online: https://mistral.ai/news/llm-as-rag-judge (accessed on 26 January 2026).
Nguyen, T.A.H.; Nguyen, Q.D.; Nguyen, H.M.; Nguyen, A.H.; Nguyen, L.O.A.N. VMHQA: A Vietnamese Multi-choice Dataset for Mental Health Domain Question Answering. EAI Endorsed Trans. Scalable Inf. Syst. 2025, 12, 1–17. [Google Scholar] [CrossRef]
Moëll, B.; Farestam, F.; Beskow, J. Swedish Medical LLM Benchmark. Front. Artif. Intell. 2025, 8, 1557920. [Google Scholar] [CrossRef]
Alaa, A.; Hartvigsen, T.; Golchini, N.; Dutta, S.; Dean, F.; Raji, I.D.; Zack, T. Medical Large Language Model Benchmarks Should Prioritize Construct Validity. arXiv 2025, arXiv:2503.10694. [Google Scholar] [CrossRef]
Wołk, K. Evaluating Retrieval-Augmented Generation Variants for Clinical Decision Support. Electronics 2025, 14, 4227. [Google Scholar] [CrossRef]
Pawlik, L. LLM Selection and Vector Database Tuning: A Methodology for Enhancing RAG Systems. Appl. Sci. 2025, 15, 10886. [Google Scholar] [CrossRef]
Wang, D.; Ye, J.; Li, J.; Liang, J.; Zhang, Q.; Hu, Q.; Pan, C.; Wang, D.; Liu, Z.; Shi, W.; et al. Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question Answering. JMIR Med. Educ. 2025, 11, e70190. [Google Scholar] [CrossRef] [PubMed]
Xu, S.; Yan, Z.; Dai, C.; Wu, F. MEGA-RAG: A retrieval-augmented generation framework with multi-evidence guided answer refinement for mitigating hallucinations of LLMs in public health. Front. Public Health 2025, 13, 1635381. [Google Scholar] [CrossRef] [PubMed]
Pawlik, L. How the Choice of LLM and Prompt Engineering Affects Chatbot Effectiveness. Electronics 2025, 14, 888. [Google Scholar] [CrossRef]
Sim, S.Z.Y.; Chen, T. Critique of impure reason: Unveiling the reasoning behaviour of medical large language models. eLife 2025, 14, e106187. [Google Scholar] [CrossRef]
Wang, W.; Ma, Z.; Ding, M.; Zheng, S.; Liu, S.; Liu, J.; Ji, j.; Chen, W.; Li, X.; Shen, L.; et al. Medical Reasoning in the Era of LLMs. arXiv 2025, arXiv:2508.00669. [Google Scholar] [CrossRef]
Zhen, H.; Shi, Y.; Huang, Y.; Yang, J.J.; Liu, N. Leveraging Large Language Models with Chain-of-Thought and Prompt Engineering. Computers 2024, 13, 232. [Google Scholar] [CrossRef]
Chow, J.C.L.; Li, K. Developing Effective Frameworks for Large Language Model–Based Medical Chatbots: Insights from Radiotherapy Education with ChatGPT. JMIR Cancer 2025, 11, e66633. [Google Scholar] [CrossRef]
Chow, J.C.L.; Li, K. Large Language Models in Medical Chatbots: Opportunities, Challenges, and the Need to Address AI Risks. Information 2025, 16, 549. [Google Scholar] [CrossRef]
Hota, S. Query Large PDF with Multi-LLMs, (RAG + FAISS) with Tuning (Top-K) for Semantic Results. Medium. Available online: https://medium.com/@sobhan.hota/query-large-pdf-with-multi-llms-rag-faiss-with-tuning-top-k-for-semantic-results-b6b1f80dcd6a (accessed on 26 January 2026).
Teradata/Bge-Base-en-v1.5. Hugging Face. Available online: https://huggingface.co/Teradata/bge-base-en-v1.5 (accessed on 26 January 2026).
Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The Faiss library. arXiv 2025, arXiv:2401.08281. [Google Scholar] [CrossRef]
Gupta, P. AI Agent at Work. Medium. Available online: https://medium.com/@impuneetgupta/jira-as-a-knowledge-base-ai-agent-advanced-integration-for-enhanced-search-and-q-a-capabilities-6a811c272335 (accessed on 26 January 2026).
Cohere is Command R Model. Cohere Documentation. Available online: https://docs.cohere.com/docs/command-r (accessed on 26 January 2026).
Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. arXiv 2025, arXiv:2412.15115. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Gemma 2: Improving Open Language Models at a Practical Size. Google DeepMind Technical Report. Available online: https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf (accessed on 26 January 2026).
Mistral NeMo | Mistral AI. Mistral AI Blog. Available online: https://mistral.ai/news/mistral-nemo (accessed on 26 January 2026).
Jiang, P.; Ouyang, S.; Jiao, Y.; Zhong, M.; Tian, R.; Han, J. A Survey on Retrieval And Structuring Augmented Generation with Large Language Models. In Proceedings of the 31st ACM SIGKDD Conference, Toronto, ON, Canada, 3–7 August 2025; pp. 6032–6042. [Google Scholar] [CrossRef]
Projekt PARROT AI—Inteligentny Asystent Lekarza. Politechnika Świętokrzyska. Available online: https://tu.kielce.pl/projekt-parrot-ai-inteligentny-asystent-lekarza/ (accessed on 26 January 2026).

Figure 1. The proposed clinical RAG pipeline: from MedQA query to deterministic D-RAG Evaluator assessment.

Figure 2. Comparison of Accuracy across models and prompting variants.

Figure 3. Pairwise comparison of model accuracy using McNemar’s test. Values represent p-values; results where

p < 0.05

are considered statistically significant. The label n.s. denotes comparisons that did not reach this significance threshold, indicating high performance stability between those specific variants.

Figure 3. Pairwise comparison of model accuracy using McNemar’s test. Values represent p-values; results where

p < 0.05

are considered statistically significant. The label n.s. denotes comparisons that did not reach this significance threshold, indicating high performance stability between those specific variants.

Figure 4. Correlation between Faithfulness and Accuracy across all runs.

Figure 5. Heatmap of the Unsupported Sentence Ratio (USR) across models and strategies.

Figure 6. Correlation between response verbosity (word count) and the Unsupported Sentence Ratio (USR). The shadow represents the 95% confidence interval for the regression line.

Figure 7. Multi-dimensional comparison of model capabilities (StrictCitations variant).

Figure 8. Verbosity Ratio: Word count comparison between incorrect and correct answers.

Table 1. Technical specifications of the evaluated Large Language Models.

Model Name	Developer	Architecture	Params (B)
Command R	Cohere	Transformer (Auto-regressive)	35
Gemma 2 9B	Google	Transformer (Decoder-only)	9
Gemma 2 27B	Google	Transformer (Decoder-only)	27
Llama 3.1	Meta	Transformer (Decoder-only)	8
Mistral Nemo	Mistral AI/NVIDIA	Transformer (Decoder-only)	12
Qwen 2.5	Alibaba Cloud	Transformer (Decoder-only)	14

Table 2. Comparison of System and User instructions for the evaluated strategies.

Strategy	System Role	Core User Instruction
Baseline	Medical Expert	Provide the correct letter and end with “Answer: X”.
Neutral	Medical Assistant	Use the provided context to select the best option.
StrictCitations	Clinical Assistant	Answer only using context. Cite source IDs as `[docX#chunkY]`.
Expert-CoT	Expert Physician Specialist	Solve via steps: Evidence, Pathophysiology, Elimination, and Double-check.

Table 3. Hardware requirements and inference latency for clinical RAG execution (estimated for

N = 500

).

Table 3. Hardware requirements and inference latency for clinical RAG execution (estimated for

N = 500

).

Model	Params	VRAM (est.)	Avg. Latency	Total Time (est.)	Suitability
Mistral Nemo	12B	∼8.2 GB	28.50 s	03:57:30	High
Llama 3.1	8B	∼5.5 GB	31.57 s	04:23:05	High
Gemma 2 9B	9B	∼6.8 GB	32.26 s	04:28:50	High
Qwen 2.5	14B	∼10.5 GB	43.45 s	06:02:05	Moderate
Command R	35B	∼21.5 GB	47.06 s	06:32:10	Moderate
Gemma 2 27B	27B	∼18.5 GB	50.68 s	07:02:20	Moderate

Note: Total execution time estimated for

N = 500

queries based on measured per-query averages on an NVIDIA Quadro RTX 6000 (24 GB).

Table 4. Comparison of Accuracy and Faithfulness metrics (

N = 500

).

Table 4. Comparison of Accuracy and Faithfulness metrics (

N = 500

).

Model	Accuracy (%)				Faithfulness (%)
Model	Baseline	Neutral	StrictCit.	Expert-CoT	Baseline	Neutral	StrictCit.	Expert-CoT
Command R	50.8	53.6	56.6	40.6	0.0	18.2	14.3	21.6
Gemma 2 9B	48.6	57.2	55.0	27.2	0.0	16.1	21.0	17.6
Gemma 2 27B	22.8	64.0	59.4	38.0	0.0	16.8	23.8	20.4
Llama 3.1	61.6	53.8	57.6	22.6	0.0	18.7	20.8	21.4
Mistral Nemo	54.0	50.4	53.6	36.8	0.0	12.7	7.5	20.4
Qwen 2.5	67.8	65.2	64.2	60.0	0.0	20.0	23.5	22.1

Table 5. Knowledge Override (OVR) counts per strategy (Correct answer with low Faithfulness,

N = 500

).

Table 5. Knowledge Override (OVR) counts per strategy (Correct answer with low Faithfulness,

N = 500

).

Model	Neutral	StrictCitations	Expert-CoT
Command R	214	253	130
Gemma 2 9B	249	207	113
Gemma 2 27B	272	198	131
Llama 3.1	213	209	83
Mistral Nemo	223	268	138
Qwen 2.5	258	240	206

Table 6. Detailed performance metrics for evaluated models and prompting variants (

N = 500

).

Table 6. Detailed performance metrics for evaluated models and prompting variants (

N = 500

).

Model	Variant	Acc. (%)	Faith. (%)	CiteRate (%)	USR (%)	OVR (N)
Command R	StrictCit.	56.6	14.3	61.2	91.0	253
Command R	Expert-CoT	40.6	21.6	0.0	94.9	130
Gemma 2 9B	StrictCit.	55.0	21.0	96.2	82.0	207
Gemma 2 9B	Expert-CoT	27.2	17.6	0.0	98.3	113
Gemma 2 27B	StrictCit.	59.4	23.8	100.0	75.5	198
Gemma 2 27B	Expert-CoT	38.0	20.4	0.0	97.0	131
Llama 3.1	StrictCit.	57.6	20.8	93.6	88.3	209
Llama 3.1	Expert-CoT	22.6	21.4	0.0	95.6	83
Mistral Nemo	StrictCit.	53.6	7.5	13.0	100.0	268
Mistral Nemo	Expert-CoT	36.8	20.4	0.0	96.4	138
Qwen 2.5	StrictCit.	64.2	20.0	73.8	87.1	240
Qwen 2.5	Expert-CoT	60.0	22.1	0.0	95.0	206

Table 7. Impact of retrieval and citation presence on faithfulness scores (StrictCitations variant).

Model	Avg Faithfulness (%)	With Citation (%)	Without Citation (%)
Command R	14.3	21.1	3.6
Gemma 2 27B	23.8	23.8	0.0
Qwen 2.5	20.0	23.9	1.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pawlik, L.; Deniziak, S. Reducing Hallucinations in Medical AI Through Citation Enforced Prompting in RAG Systems. Appl. Sci. 2026, 16, 3013. https://doi.org/10.3390/app16063013

AMA Style

Pawlik L, Deniziak S. Reducing Hallucinations in Medical AI Through Citation Enforced Prompting in RAG Systems. Applied Sciences. 2026; 16(6):3013. https://doi.org/10.3390/app16063013

Chicago/Turabian Style

Pawlik, Lukasz, and Stanislaw Deniziak. 2026. "Reducing Hallucinations in Medical AI Through Citation Enforced Prompting in RAG Systems" Applied Sciences 16, no. 6: 3013. https://doi.org/10.3390/app16063013

APA Style

Pawlik, L., & Deniziak, S. (2026). Reducing Hallucinations in Medical AI Through Citation Enforced Prompting in RAG Systems. Applied Sciences, 16(6), 3013. https://doi.org/10.3390/app16063013

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reducing Hallucinations in Medical AI Through Citation Enforced Prompting in RAG Systems

Abstract

1. Introduction

2. Related Works

2.1. The Landscape of Medical Question Answering and Benchmarking

2.2. Retrieval-Augmented Generation (RAG) in Healthcare

2.3. Prompt Engineering and Clinical Reasoning Protocols

2.4. Hallucination Mechanisms and Deterministic Evaluation

3. Materials and Methods

3.1. Data Source: MedQA USMLE

3.2. System Architecture and Pipeline

3.3. Knowledge Base Construction and Indexing

3.4. Retrieval and Multi-Stage Reranking

3.5. Evaluated Large Language Models and Infrastructure

3.6. Prompting Strategies

3.7. The D-RAG Evaluator: Deterministic Grounding Assessment

3.8. Statistical Analysis Methodology

3.9. Computational Efficiency and Hardware Resource Management

4. Results

4.1. Comparative Accuracy and Strategy Effectiveness

4.2. Grounding and Hallucination Analysis

4.2.1. Knowledge Override and the Verbosity Signal

4.2.2. Root Cause of Clinical Hallucinations

5. Discussion

5.1. The Verbosity Signal and Clinical Reliability

5.2. Retrieval Impact and Detailed Performance Analysis

6. Conclusions

7. Limitations

8. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Hardware Specifications

Appendix B. Computational Environment and Library Versions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI