RAGMed: A RAG-Based Medical AI Assistant for Improving Healthcare Delivery

Patil, Rajvardhan; Abbidi, Manideep; Fannon, Sherri

doi:10.3390/ai6100240

Open AccessArticle

RAGMed: A RAG-Based Medical AI Assistant for Improving Healthcare Delivery

by

Rajvardhan Patil

^1,*

,

Manideep Abbidi

¹ and

Sherri Fannon

²

¹

College of Computing, Grand Valley State University, Allendale, MI 49401, USA

²

Kirkhof College of Nursing, Grand Valley State University, Allendale, MI 49401, USA

^*

Author to whom correspondence should be addressed.

AI 2025, 6(10), 240; https://doi.org/10.3390/ai6100240

Submission received: 19 August 2025 / Revised: 10 September 2025 / Accepted: 22 September 2025 / Published: 24 September 2025

(This article belongs to the Section Medical & Healthcare AI)

Download

Browse Figures

Versions Notes

Abstract

Electronic Health Records (EHRs) have enhanced access to medical information but have also introduced challenges for healthcare providers, such as increased documentation workload and reduced face-to-face interaction with patients. To mitigate these issues, we propose RAGMed, a Retrieval-Augmented Generation (RAG)-based AI assistant designed to deliver automated and clinically grounded responses to frequently asked patient questions. This system combines a vector database for semantic retrieval with the generative capabilities of a large language model to provide accurate, reliable answers without requiring direct physician involvement. In addition to patient-facing support, the assistant facilitates appointment scheduling and assists clinicians by summarizing clinical notes, thereby streamlining healthcare workflows. Additionally, to evaluate the influence of retrieval quality on overall system performance, we compare two embedding models, gte-large and all-MiniLM-L6-v2, using real-world medical queries. The models are assessed within the RAG-Triad Framework, focusing on context relevance, answer relevance, and factual groundedness. The results indicate that gte-large, owing to its higher-dimensional embeddings, retrieves more informative context, resulting in more accurate and trustworthy responses. These findings underscore the importance of not only the potential of incorporating RAG-based systems to alleviate physician workload and enhance the efficiency and accessibility of healthcare delivery but also the dimensionality of models used to generate embeddings, as this directly influences the relevance, accuracy, and contextual understanding of retrieved information. This prototype is intended for the retrieval-augmented answering of medical FAQs and general informational queries, and is not designed for diagnostic use or treatment recommendations without professional validation.

Keywords:

EHRs; LLMs; RAG; vector databases; physicians; clinicians; healthcare; electronic health records; large language models; retrieval-augmented generation; scheduling; clinical note summarization

1. Introduction

Clinical decision support systems are motivated not only by the potential to improve care quality, but also by the increasing workload pressures on healthcare providers [1]. For example, a national survey found that 62.8% of U.S. physicians reported experiencing burnout in 2021, up from 38.2% in 2020, with documentation burden and excessive clerical tasks cited as key contributors [2]. Similarly, physicians spend nearly 50% of their workday interacting with the electronic health record (EHR), including both direct use and clerical work, compared with only 27% of their time spent on direct patient care [3]. This imbalance contributes to inefficiency, dissatisfaction, and medical errors. Other studies have estimated that U.S. physicians spend an average of 1.84 h per day outside clinic hours completing EHR “pajama time” tasks, further compounding fatigue [4].

While non-AI-based approaches (e.g., rule-based decision trees, keyword retrieval, or static knowledge bases) can address certain narrow clinical support tasks, they are brittle and difficult to scale for the highly variable, context-dependent nature of clinical dialogue. Clinical conversations involve unstructured language, diverse terminology, and context-dependent reasoning, which traditional systems often fail to capture [5,6]. Prior reviews have shown that such approaches are prone to errors when applied to complex medical narratives and are insufficient for robust decision support [7,8]. Moreover, the complexity and heterogeneity of real-world clinical data—including free-text notes, structured fields, and event logs—further limit the utility of static or rule-based solutions [9]. Given this context, AI-driven tools that can reduce the burden of information retrieval, documentation, and knowledge synthesis hold promise for alleviating workload pressures and supporting more efficient, patient-centered care.

AI-driven approaches enable systems to dynamically retrieve relevant biomedical evidence and generate contextually appropriate responses. This approach has been shown to improve factual grounding and performance on knowledge-intensive tasks compared to generative-only models [10]. Such adaptability is particularly important in healthcare, where the knowledge base is vast and continuously evolving—for example, during the COVID-19 pandemic, the volume of biomedical publications grew at an unprecedented pace, underscoring the need for retrieval-based systems that can keep up with rapidly changing evidence [11]. Together, these factors suggest that AI is not only beneficial but necessary for building flexible, scalable, and clinically meaningful dialogue assistants, since non-AI methods remain too rigid for the complexity of real-world medical consultation.

Natural language processing (NLP) has a long history in the clinical domain, with early work focusing on information extraction from EHR text and its use in clinical decision support [7,8,12]. More recently, transformer-based models have been adapted to healthcare, including ClinicalBERT and BioBERT variants trained on EHR notes and biomedical corpora, enabling contextual embeddings tailored to medical applications [13,14]. Large-scale resources such as MIMIC-III [9] have played a key role in enabling such advances. At the same time, the adoption of AI in healthcare has raised significant concerns about privacy and security, leading to work on frameworks such as k-anonymity [15], regulatory perspectives on medical big data [16], and technical solutions such as federated learning for privacy-preserving clinical AI [17].

The emergence of large language models (LLMs) has transformed NLP, particularly in domains requiring nuanced understanding, such as healthcare. However, despite their power, LLMs suffer from hallucinations and lack of grounding in up-to-date domain-specific information. Furthermore, healthcare professionals are increasingly confronted with complex clinical and administrative demands that require timely access to accurate, contextually relevant information. The proliferation of unstructured medical data, largely driven by the widespread adoption of Electronic Health Records (EHRs), has intensified the cognitive load on clinicians. As a result, physicians often face challenges in efficiently retrieving actionable insights, leading to increased documentation workload, diminished patient interaction, and elevated rates of burnout.

To alleviate these challenges, artificial intelligence (AI) has emerged as a transformative force in healthcare, offering innovative solutions to streamline clinical workflows and enhance decision making. Among various AI paradigms, Retrieval-Augmented Generation (RAG) stands out as a particularly promising approach. RAG addresses this by integrating an external knowledge retriever with a generative model, enabling factually accurate and context-sensitive responses, an especially valuable capability in clinical environments. For instance, early applications of RAG in healthcare have focused on clinical question answering and decision support, demonstrating how RAG-based models outperform traditional neural QA systems on biomedical benchmarks like BioASQ, particularly when dealing with rare diseases and complex treatments.

RAG systems combine two core components: a retrieval mechanism that identifies semantically relevant information from a database, and a large language model (LLM) that generates context-aware responses based on the retrieved data. This architecture enables the generation of fluent, factually grounded answers that are both informative and contextually appropriate. In this work, we present a RAG-based medical AI assistant called ‘RAGMed’, designed to support healthcare workflows by automating routine yet critical administrative tasks, including

Responding to frequently asked patient medical questions;
Facilitating patient appointment scheduling;
Summarizing clinical notes for healthcare providers.

Through the integration of semantic search with generative reasoning, RAGMed offers a scalable and interpretable solution for reducing the administrative workload in healthcare. This integration supports the development of more efficient and trustworthy AI-driven tools, enhancing provider workflows and patient engagement in digital health ecosystems. In addition to improving the accessibility and clarity of clinical information, the system fosters greater trust in AI-assisted decision making by delivering responsive and interpretable healthcare solutions. The AI-driven tools employed in our proposed system, RAGMed, are discussed in detail below.

1.1. Objective

The primary objective of our study is to examine how the choice of embedding model influences the informativeness, reliability, and clinical value of RAG-generated responses. To address this objective, the RAGMed system leverages the Pinecone vector database to store and retrieve dense medical embeddings and utilizes the LLaMA3-8B-8192 large language model to generate high-quality responses. To investigate the impact of retrieval quality on system performance, we compare two embedding models—gte-large and all-MiniLM-L6-v2—using 18 queries. In pursuing this objective, our work advances the field through several key contributions, outlined in the next subsection.

1.2. Contributions

This work makes the following key contributions:

Application of RAG in healthcare dialogue and workflows: We demonstrate a retrieval-augmented assistant in a safety-critical setting, focusing on factual grounding for patient FAQs, administrative support, and clinical documentation.
Prototype implementation: We present a working system that integrates medical question answering, natural-language appointment scheduling, and clinical note summarization, illustrating potential to reduce documentation and administrative burden.
Clinically meaningful evaluation framework: We adopt the RAG-Triad metrics (answer relevance, context precision, context recall), moving beyond surface-overlap metrics like BLEU/ROUGE to assess grounding quality.

The remainder of this paper is organized as follows. Section 2 provides background on large language models, embedding models, vector databases, and similarity search. Section 3 introduces the design of the proposed RAGMed system, including its novel architecture, datasets, and RAG-Triad evaluation framework. Section 4 outlines the methodology, while Section 5 presents experimental results and a comparison of embedding models using the RAG-Triad metrics. Section 6 reviews related work, and Section 7 discusses the contributions and limitations of our approach. Finally, Section 8 concludes the paper and suggests directions for future research.

2. Background

2.1. Large Language Model

To support high-quality response generation within our RAG-based system, we selected LLaMA3-8B-8192, a state-of-the-art large language model developed by Meta. This model offers a balanced combination of advanced reasoning capabilities, low inference latency, and a significantly extended context window, up to 8192 tokens. These characteristics make it highly suitable for clinical applications that demand both real-time responsiveness and the ability to process complex, long-form medical text. LLaMA3-8B-8192 is particularly well-suited for tasks that require integrating and synthesizing detailed contextual information, such as interpreting nuanced patient histories or explaining complex treatment options. Its relatively efficient model size, when compared to larger-scale LLMs, also supports deployment in latency-sensitive environments like digital health assistants, where rapid turnaround and cost-efficiency are essential. Consequently, LLaMA3-8B serves as an ideal backbone for generating contextually grounded, medically accurate responses in our RAG architecture.

2.2. Embedding Models

A fundamental component of the RAG pipeline is the generation of high-quality vector embeddings, which translate unstructured textual data into a numerical format amenable to semantic similarity search. This step allows our system to bridge the linguistic gap between user-submitted patient queries and physician-authored clinical observations by encoding both as dense vectors in the same semantic space. We evaluated two pretrained sentence embedding models for this purpose:

all-MiniLM-L6-v2: A lightweight and high-speed model that produces 384-dimensional embeddings. It is optimized for fast inference and performs well on tasks such as semantic search, clustering, and sentence similarity. Its compact size and efficiency make it particularly advantageous for real-time clinical applications and shorter user queries.
GTE-Large: A more expressive and higher-capacity model that generates 1024-dimensional embeddings. It is designed for general-purpose semantic tasks that demand a deeper understanding of linguistic nuance. GTE-Large excels in capturing subtle similarities across longer and more complex input texts, making it well-suited for detailed medical questions that require advanced contextual reasoning.

Both models were implemented using industry-standard libraries: the SentenceTransformers package for MiniLM and HuggingFace Transformers for GTE-Large. A unified preprocessing pipeline was applied to both models to ensure consistency in comparative evaluation. Each patient query and clinical observation was converted into a fixed-length vector, regardless of input size, thus enabling scalable semantic indexing and retrieval.

2.3. Vector Database and Indexing

Following embedding generation, we implemented semantic indexing to efficiently organize and access the vector representations of physician observations. Given the diversity and complexity of the language used in medical documentation, a robust and flexible indexing system is critical for high-performance retrieval. To this end, we integrated Pinecone, a cloud-native vector database optimized for large-scale, real-time similarity search. Pinecone enables the efficient storage and retrieval of high-dimensional embeddings using Hierarchical Navigable Small World (HNSW) indexing, an approximate nearest neighbor (ANN) algorithm known for its high recall and low latency. This architecture supports scalable retrieval without the need to scan the entire dataset, significantly improving search efficiency. Upon receiving a patient query, the system uses the corresponding embedding to retrieve the top-k most semantically relevant physician records in milliseconds. These documents are then supplied to the large language model as contextual input for grounded response generation. Pinecone’s elastic architecture also supports the seamless scaling of our system to accommodate increasing volumes of medical data over time. Each embedding model maintains a dedicated Pinecone index, enabling independent performance evaluation and retrieval pipelines (e.g., PC-index1 for MiniLM and PC-index2 for GTE-Large).

2.4. Similarity Search

Semantic similarity search lies at the core of our system’s retrieval process. This mechanism ensures that the most contextually relevant physician observations are retrieved from a large and heterogeneous dataset, enabling accurate and grounded response generation. To perform similarity matching, we employed cosine similarity, a widely used metric that calculates the angle between two vectors to quantify their semantic alignment. Unlike traditional keyword-based retrieval methods, cosine similarity assesses conceptual proximity rather than literal term overlap, which is an essential advantage in clinical settings where users may phrase the same problem in diverse and often non-standardized ways.

When a patient submits a question, the system encodes it into a dense vector using the selected embedding model. This query vector is then compared with precomputed physician observation embeddings in the vector database. Cosine similarity scores are calculated for each embedding pair, and the system retrieves the top-k results with the highest scores. This method ensures that the retrieved context is semantically aligned with the intent of the query, even when the terminology, phrasing, or syntax differ significantly. By focusing on the meaning behind the words rather than their exact matches, cosine similarity improves the precision and relevance of the input provided to the LLM. This, in turn, results in higher-quality, trustworthy, and informative responses that align with both clinical expectations and user needs.

2.5. Model Evaluation

To evaluate retrieval quality and the correctness of system outputs, we adopted the RAG-Triad framework. Unlike surface-level metrics such as BLEU, ROUGE, or F1, RAG-Triad emphasizes clinically meaningful aspects of performance, including retrieval utility, factual grounding, and answer relevance. This makes it well suited for healthcare dialogue, where reliability and safety are paramount. A detailed description of the framework and its three evaluation dimensions is provided in Section 3.3.

3. System Architecture and Evaluation Framework

3.1. System Architecture Overview

The proposed system “RAGMed” employs a Retrieval-Augmented Generation (RAG) architecture, which integrates the strengths of semantic retrieval with the generative capabilities of large language models (LLMs). This hybrid design ensures that system responses are not only coherent and easy to understand but also firmly grounded in accurate, real-world clinical data. Such a structure is essential in healthcare settings, where both factual accuracy and the traceability of information are critical for patient safety and clinical decision making. As shown in Figure 1, the RAGMed system’s workflow is mainly composed of six stages as follows:

Query Input: The process begins with a patient or clinician submitting a natural language query via the user interface.
Embedding Generation: The input query is transformed into a dense numerical vector representation using a pre-selected embedding model (e.g., gte-large or all-MiniLM-L6-v2). This vector encodes the semantic content of the query in a high-dimensional space.
Information Retrieval: The system computes the cosine similarity between the query embedding and precomputed embeddings of physician observations stored in the vector database. Using this metric, it retrieves the top-k most semantically relevant documents or data points.
Context Injection: The retrieved documents are then incorporated into the prompt context fed into the large language model, effectively grounding the generation process in verified clinical evidence.
Prompt Engineering and Response Generation: Through carefully designed prompt templates, the LLM synthesizes a summarized and contextually appropriate response that addresses the original query while maintaining factual consistency.
Physician Output: The generated response is returned to the user as the final physician-facing output, facilitating informed clinical decisions or patient communication.

Figure 1. RAGMed —system architecture.

Figure 1 illustrates the system architecture of RAGMed, a Retrieval-Augmented Generation (RAG)-based AI assistant. The workflow begins with a patient submitting a medical-related query to the system. This input is then converted into a dense numerical vector representation, known as a query embedding, using one of two embedding models: all-MiniLM-L6-v2 or gte-large. The generated query embedding is used to perform a semantic similarity search against a Pinecone vector database containing approximately 45,000 physician observation embeddings. Each embedding model is associated with its own dedicated index (e.g., PC index1 for all-MiniLM-L6-v2 and PC index2 for gte-large), enabling parallel and distinct retrieval pipelines tailored to the characteristics of each model. From this retrieval step, the system selects the top-five most relevant documents, which are subsequently incorporated into the prompt context for the large language model (LLaMA3-8B-8192).

Through carefully designed prompt engineering and the use of deterministic generation parameters (e.g., temperature set to 0), the LLM produces a clear, concise, and clinically grounded response. This generated response is then delivered back to the patient as the physician’s output. To rigorously assess system performance, the RAG-Triad evaluation framework is employed, focusing on three critical metrics: context relevance, answer relevance, and groundedness. By comparing the outputs generated via the two embedding models, the evaluation identifies which embedding approach retrieves more pertinent content, thereby enabling higher-quality, more reliable AI-assisted responses. This architecture of RAGMed balances the interpretability and factual grounding of retrieved data with the fluency and adaptability of generative language models, thereby supporting accurate and reliable AI-assisted healthcare applications.

3.2. Retrieval-Augmented Generation

The core of our system is built upon the Retrieval-Augmented Generation (RAG) framework, a hybrid architecture that enhances large language models by incorporating the retrieval of relevant, domain-specific information from an external vector database. This framework effectively bridges the critical gap between the generative fluency of language models and the imperative need for factual accuracy, a balance that is especially vital in clinical applications where inaccuracies or hallucinations may lead to significant adverse consequences. Unlike traditional language models, which rely exclusively on knowledge acquired during pretraining and thus may produce plausible yet unverified information, the RAG framework operates through a two-stage pipeline, ensuring that generated responses are explicitly grounded in real-world data.

3.2.1. Retrieval Stage

Upon receiving a patient query, the RAGMed system conducts a semantic search over a comprehensive vector database containing a large corpus of physician observations and clinical data. This is achieved through embedding-based similarity metrics, which enable the identification and retrieval of the most contextually relevant documents related to the query. By grounding the language model with these retrieved documents, the retrieval stage ensures access to accurate, up-to-date, and evidence-based clinical information that supplements the model’s internal knowledge.

3.2.2. Generation Stage

The subset of retrieved documents is then injected into the input prompt of a large language model, which is instructed via meticulous prompt engineering to generate a concise, coherent, and contextually faithful response exclusively based on this retrieved information. The prompt explicitly directs the model to avoid reliance on prior training data and discourages hallucination, thereby promoting responses that are verifiable and firmly anchored in the provided context. By leveraging this RAG-based design, RAGMed delivers medical question responses that are not only fluent and user-friendly but also traceable and contextually accurate. Furthermore, this approach enhances transparency by explicitly linking each generated answer to its supporting documents, fostering increased trust among patients and healthcare professionals in the system’s reliability and clinical validity.

3.3. RAG-Triad Evaluation Framework

To assess the effectiveness and reliability of the RAGMed system, we employed the RAG-Triad Evaluation Framework, which offers a structured methodology for evaluating the quality of model outputs along three essential dimensions. These dimensions—Context Relevance, Answer Relevance, and Groundedness—are critical in determining the accuracy, informativeness, and clinical value of responses generated by AI in healthcare applications.

3.3.1. Context Relevance

This metric evaluates the degree to which the retrieved documents align with the semantic intent of the user’s original query. High context relevance indicates that the embedding model successfully identified pertinent information from the vector database, providing the language model with an appropriate factual foundation. It reflects the precision of the retrieval mechanism and its ability to surface contextually useful evidence.

3.3.2. Groundedness

Groundedness assesses the extent to which the content of the generated response is factually supported by the retrieved documents. A grounded response should remain strictly within the bounds of the provided context, avoiding unsupported claims, hallucinations, or extraneous information. This dimension is especially vital in clinical domains, where factual reliability is paramount.

3.3.3. Answer Relevance

Answer relevance measures how directly and effectively the generated response addresses the user’s question. It considers several factors, including alignment with the query’s purpose, the clarity of the explanation, and medical accuracy. A highly relevant answer demonstrates the model’s capacity to synthesize and apply retrieved information in a meaningful, user-centered manner.

Overall, the RAG-Triad framework enabled a comprehensive evaluation of the influence of embedding model selection on system performance. Our analysis extended beyond simple retrieval accuracy, as it focuses on the downstream impact of retrieval quality on the generation process. Specifically, we examined how effectively the retrieved context facilitated the production of responses that are informative, trustworthy, and semantically aligned with user intent. This framework provided critical insights into the comparative strengths of different embedding models in producing high-quality, clinically appropriate AI-generated outputs.

4. Methodology

4.1. Data Selection Criteria

To develop this RAG-based medical AI assistant, we selected publicly available healthcare dialogue datasets that satisfied three main criteria:

4.1.1. Clinical Relevance

Datasets needed to capture authentic or medically grounded patient–clinician interactions.

4.1.2. Diversity of Coverage

We prioritized sources spanning multiple specialties, conditions, and conversational styles (structured benchmarks, free-form dialogue, Q&A).

4.1.3. Public Accessibility and Compliance

Only datasets that were openly available and free from protected health information (PHI) were included, in order to comply with ethical and legal requirements.

Accordingly, we aggregated ACI-Bench, MedDialog, Augmented Medical Dialogue, MTS Dialog, and Patient–Doctor Conversations. Each of these resources contributes unique coverage, such as benchmark-style dialogues (ACI-Bench) or large-scale question–answer interactions (MedDialog).

4.1.4. Limitations of Using Only Public Datasets

We recognize that the absence of real clinical conversation data introduces important limitations. Public datasets often include simulated or curated dialogues, which may not fully reflect the complexity, variability, and nuance of authentic patient–provider communication. Rare conditions, multimorbidity, and institution-specific practice styles are typically underrepresented. However, our reliance on publicly available datasets stems from the difficulty of accessing real-world patient–doctor dialogues, which are subject to several well-documented barriers:

Regulatory restrictions: U.S. laws such as HIPAA (protecting health data) and FERPA (covering educational records when clinical data are linked to students) impose strict requirements for handling identifiable patient information.
Institutional Review Board (IRB) approval: Access to raw clinical dialogues generally requires multi-level IRB review, which can take significant time and may still result in partial or denied access.
Data de-identification challenges: Even after anonymization, conversational data often contain subtle identifiers (e.g., locations, dates, rare conditions), making full de-identification technically and legally complex.
Limited sharing agreements: Health systems may be reluctant to release conversational datasets externally due to liability, reputational, and security concerns, even when data are de-identified.
Resource and cost constraints: Negotiating data use agreements (DUAs), maintaining secure storage, and meeting compliance standards require substantial administrative effort and financial investment.

For these reasons, we opted to use only public corpora in this initial phase, ensuring reproducibility, feasibility, and legal compliance. Future work will extend to validated, de-identified clinical data through formal collaborations and regulatory approval processes, enabling the evaluation of the assistant in real-world healthcare settings.

4.2. Data Collection and Preprocessing

To develop a RAG-based medical AI assistant for clinical decision support, we aggregated a diverse set of publicly available healthcare datasets. These include ACI-Bench, MedDialog, Augmented Medical Dialogue, MTS Dialog, and Patient-Doctor Conversations. Collectively, these sources contribute approximately 45,000 records, encompassing patient medical histories, clinical notes, laboratory results, SOAP (Subjective, Objective, Assessment, and Plan) notes, and physician–patient interactions. To ensure data quality, compatibility, and the optimal performance of the retrieval pipeline, the following preprocessing steps were applied:

Data Cleaning: Duplicate entries, inconsistencies, and incomplete records were removed. Where feasible, missing or inaccurate values were corrected to enhance the integrity and reliability of the dataset.
Data Normalization: Data from heterogeneous sources were standardized to a unified format, ensuring structural consistency across records. This step facilitated seamless downstream processing for embedding generation and retrieval index construction.

4.3. System Functionalities

RAGMed is built upon a Retrieval-Augmented Generation (RAG) framework and is designed to operate in two distinct modes: Admin Mode and End-User Mode. These role-based operational modes collectively support backend configuration and real-time user interaction.

4.4. Admin Mode

Admin Mode is intended for system configuration and management. It provides a graphical user interface (GUI) that allows administrators to perform essential setup tasks required for initializing and maintaining the RAG pipeline. The frontend of RAGMed’s Admin Mode is shown in Figure 2. The key functionalities include the following:

Dataset Upload and Preparation: Administrators can upload curated datasets and initiate preprocessing for embedding generation.
Embedding Model Selection: The administrator can select the embedding model from a drop-down list to generate the embeddings. Currently, the system supports two embedding models—all-MiniLM-L6-v2 and GTE-large—which can be chosen based on performance and resource considerations.
Vector Index Creation: The administrator can select a desired vector database, such as Pinecone, to create an index and store the semantic embeddings that enable efficient similarity-based retrieval.
Language Model Configuration: The administrator will also be able to select the large language generative model, such as LLaMA3-8B-8192, which will be used to generate context-aware, grounded responses based on retrieved documents.

Figure 2. Admin Mode.

Admin Mode is essential for customizing the system to specific clinical domains or institutional requirements, offering control over the data pipeline and model deployment. This setup ensures that the system is equipped with the right set of tools to generate meaningful answers to user queries.

4.5. End-User Mode

The End-User Mode represents the operational phase of the system, where patients or clinicians interact with the RAG-based medical AI assistant through a conversational interface. As shown in Figure 3, this mode supports multiple functionalities tailored to the workflows of both patients and clinicians. It supports three primary query types, each designed to address a specific aspect of the clinical or administrative workflow.

4.5.1. Medical Question Answering

Users can submit medical-related queries, such as frequently asked questions (FAQs) about symptoms, conditions, treatments, or general health information. Upon receiving a query, the system activates the Retrieval-Augmented Generation (RAG) pipeline: First, semantically relevant documents are retrieved from the vector database using the selected embedding model (either all-MiniLM-L6-v2 or gte-large). The retrieved context is then provided to the LLaMA3-8B-8192 large language model, which generates a natural language response grounded in the retrieved content. The final output is presented as a concise, informational answer, not as a diagnostic or prescriptive recommendation.

To support comparative analysis, the system can display side-by-side responses generated using both embedding models. This enables the qualitative evaluation of output quality, contextual grounding, and semantic relevance, offering insight into how embedding model choice influences the final response.

4.5.2. Appointment Scheduling

For appointment-related queries, the system interacts with a structured scheduling database to assist users in booking consultations. It performs the following tasks: it extracts availability information while filtering out already reserved slots; ensures that suggested times conform to the healthcare provider’s business hours; and presents users with a list of suitable, conflict-free time slots to choose from. This functionality streamlines the scheduling process, reduces administrative friction, and enhances user convenience.

4.5.3. Clinical Note Summarization

Healthcare providers can upload unstructured clinical documents—such as patient case studies, examination notes, or consultation records—for summarization. The system processes the input and returns a structured, bullet-point summary that highlights key clinical findings, diagnoses and treatment plans, relevant laboratory or imaging results, and follow-up recommendations. This summarization function is designed to support efficient clinical review and documentation, reducing the time spent on reading lengthy medical texts while preserving essential clinical insights.

5. Experiments and Results

For the evaluation, we use a benchmark of 18 real-world medical queries representative of typical patient and clinician interactions. Each query is processed using both embedding models—gte-large and all-MiniLM-L6-v2—to investigate how embedding dimensionality and semantic richness impact system performance across the three dimensions. Comparative results provide insight into how embedding model selection influences the overall quality of in terms of the relevance, accuracy, factual grounding, informativeness, reliability, and clinical value of RAG-generated system responses. We conducted a comprehensive study of the two embedding models, all-MiniLM-L6-v2 and gte-large. Each embedding model is associated with its vector index, which contains approximately 45000 doctor observations encoded by them. Every input patient query is converted into a numerical embedding by both models, and using the cosine similarity metric, we retrieved the top-five relevant observations. Based on the prior research findings, we selected cosine similarity because it performed better than other alternatives like Euclidean distance and dot product, specifically in healthcare-related tasks.

These retrieved top-five observations are then passed as context to LLM (llama3-8b-8192). Using prompt engineering, the LLM is instructed to generate a summarized response with a fixed temperature of 0. Utilizing the RAG-Triad evaluation framework, we evaluated each model’s pipeline by measuring three important metrics: context relevance, answer relevance, and groundedness. This configuration enabled us to examine the impact of retrieval quality on the final responses’ accuracy, clarity, and reliability. To assess the performance and efficiency of the proposed system, we conducted experiments across three representative healthcare tasks—(1) medical question answering, (2) appointment scheduling, and (3) clinical note summarization—as detailed in Section 5.3, Section 5.4 and Section 5.5. To further clarify our experimental design, we next describe the rationale behind our choice of the temperature parameter (Section 5.1) and outline the evaluation metrics used in our study (Section 5.2).

5.1. Temperature Parameter Choice

For all the experiments, the temperature parameter of the language model was set to 0, with the goal of minimizing randomness and encouraging more deterministic, reproducible outputs. While temperature = 0 does not guarantee strict determinism (since factors like non-deterministic GPU operations and beam search randomness can still introduce variation), it significantly reduces variability compared to higher values.

We selected this setting to ensure consistency in the comparative evaluation across embedding models, so that differences in output quality could be attributed primarily to retrieval and grounding rather than sampling noise. Although other temperature values (e.g., 0.3, 0.7) were briefly tested during exploratory runs and yielded more diverse but less stable responses, they were not adopted for formal evaluation. Future work may explore how controlled diversity in generation (via non-zero temperatures or nucleus sampling) could influence answer quality, especially in patient-facing scenarios where nuanced expression is beneficial.

5.2. Evaluation Metrics—Methods Section

We evaluated the system using the RAG-Triad framework, which provides three complementary dimensions essential for assessing retrieval-augmented systems in healthcare: Answer Relevance (whether the generated response correctly addresses the clinical query), Context Precision (whether the retrieved passages are genuinely useful for answering the query), and Context Recall (whether the assistant retrieves enough relevant knowledge to support an appropriate answer). Unlike n-gram–based measures such as BLEU, ROUGE, and F1, which emphasize surface text overlap, RAG-Triad captures whether the system is both retrieving the right medical knowledge and grounding its responses appropriately, which is an essential aspect for clinical decision support. As this study was designed as a proof-of-concept study and as our aim was to highlight the ability of the system to retrieve relevant medical knowledge and generate grounded, clinically appropriate responses, we did not compute additional statistical measures (e.g., confidence intervals, significance testing) and instead focused on retrieval and grounding quality as clinically meaningful indicators of system performance.

5.3. Medical Question Answering

We tested both embedding models across 18 real-world medical queries. Below, we present a few of those examples to demonstrate the key differences.

Query-1: I have asthma and a mild allergy to certain antibiotics. What’s the safest way for me to protect myself from COVID-19?

Figure 4 and Table 1 illustrate the output generated by the RAGMed system in response to Query-1. gte-large retrieved better chunks specific to asthma and allergies, such as advice on inhaler use, keeping indoor air clean, and getting the right vaccines. Meanwhile, all-MiniLM-L6-v2 mostly repeated the general COVID-19 precautions without taking into account the users’ medical background history. As a result, gte-large is able to produce a more tailored and helpful response, offering detailed instructions that someone with asthma could follow.

Query-2: I have a family history of hypertension and was recently diagnosed with prediabetes. What can I do to keep my blood pressure under control?

The results for Query-2 on the RAGMed system are presented in Figure 5 and Table 2. The final response of gte-large was more thorough and reliable since it extracted chunks that included specific lifestyle changes (such as the Mediterranean diet, sleep hygiene, and stress reduction) and referenced actual medication (Norvasc). Conversely, all-MiniLM-L6-v2 provided broad guidance that is unstructured and cited questionable blood pressure limits without providing context. As a result, gte-large generated a more structured and medically appropriate answer, providing the user with clear guidance to which to adhere.

Table 2. RAG-TRIAD scores for Query-2.

Embedding Model	Answer Relevance	Context Relevance	Groundedness
GTE-Large	0.67	0.67	0.43
all-MiniLM-L6-v2	0.58	0.63	0.25

Figure 5. RAGMed output for Query-2.

Query-3: What medication can I safely take for a fever during my second trimester of pregnancy?

Figure 6 and Table 3 showcase the RAGMed system’s response to Query-3. all-MiniLM-L6-v2 retrieved more chunks that mentioned specific medication names, but it did not clarify whether those drugs are safe during pregnancy, making it risky in this context. However, gte-large chunks are more cautious, mentioning antibiotics recommended by doctors and advising against using ibuprofen without a prescription. The final response from gte-large is less detailed but safer, better grounded, and more suitable for a medically sensitive query.

Across all 18 queries, the gte-large embedding model constantly produced better results when compared with the all-MiniLM-L6-v2 model. On average, GTE-large scored 0.72 in answer relevance, 0.70 in context relevance, and 0.47 in groundedness, reflecting its ability to grasp meaningful and clinically useful content. In comparison, all-MiniLM-L6-v2 scored lower in each category, with 0.53 for answer relevance, 0.61 for context relevance, and 0.31 for groundedness, indicating that it captured the general topic but often lacked the depth and clarity needed for high-quality responses. These results highlight the benefit of using semantically richer embeddings like GTE for retrieval augmented generation tasks in clinical applications. Table 4 summarizes the average scores across all 18 queries:

Table 3. RAG -TRIAD scores for Query-3.

Embedding Model	Answer Relevance	Context Relevance	Groundedness
GTE-Large	0.78	0.67	0.5
all-MiniLM-L6-v2	0.56	0.5	0.19

Figure 6. RAGMed output for Query-3.

Table 4. Average RAG-TRIAD scores Across 18 queries.

Embedding Model	Answer Relevance	Context Relevance	Groundedness
GTE-Large	0.72	0.70	0.47
all-MiniLM-L6-v2	0.53	0.61	0.31

5.3.1. Quantitative Evaluation

GTE-Large consistently outperformed all-MiniLM-L6-v2 across all dimensions. It demonstrated a stronger capacity to retrieve semantically rich and highly aligned content, which in turn enabled the language model to produce more informative and clinically sound responses. In contrast, while all-MiniLM-L6-v2 offered lower latency and faster inference, it occasionally retrieved context that was overly generic or tangential to the query’s intent.

5.3.2. Qualitative Observations

In addition to quantitative scoring, qualitative analysis revealed that

GTE-Large captured subtle semantic nuances better, particularly in queries involving multi-step reasoning or complex medical terminology.
MiniLM, while efficient, often missed specific contextual cues in longer or compound queries, leading to less precise responses.
Responses generated with GTE-Large context were more likely to cite specific clinical actions, findings, or terminology from the retrieved documents, thereby enhancing traceability and user trust.

5.3.3. Implications

These results underscore the critical role that embedding model selection plays in the performance of RAG-based systems for healthcare. While lightweight models like MiniLM are attractive for real-time systems due to their speed, high-dimensional embeddings from models like GTE-Large offer superior retrieval quality, which directly translates into better downstream response generation. The findings also validate the utility of the RAG-Triad framework as a multidimensional evaluation approach that captures not only linguistic fluency but also factual fidelity and contextual alignment—key concerns in clinical AI systems.

5.4. Appointment Scheduling

While traditional software systems can monitor and manage a physician’s calendar, they typically require structured inputs through predefined forms or drop-down menus. In contrast, our LLM-enabled assistant allows patients (and clinicians) to book or inquire about appointments by using simple English sentences. The system interprets this free-text request, extracts relevant details (physician, timeframe, reason), checks availability, and presents scheduling options. This offers several benefits over traditional systems, such as (1) improved accessibility, since patients who may not be tech-savvy can interact naturally without learning rigid interfaces, and (2) reduced administrative burden, as natural language scheduling decreases the need for staff to translate patient requests into structured calendar entries. By enabling natural, context-aware interaction, RAGMed goes beyond static rule-based tools and provides a more flexible, user-friendly scheduling experience.

To validate the appointment scheduling functionality, we simulated a 3-month dataset of structured appointment records. The system was tested on several practical queries, and we demonstrate the results for seven of them below. Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 present the RAGMed output for those queries.

The RAGMed System was able to analyze existing appointments, identified open slots during business hours on Monday–Friday (9AM-5PM), and responded accordingly. It accurately distinguished between booked and available time slots. And it correctly interpreted temporal expressions such as “this week” or “next week” using prompt-based logic. Although, the system consistently generated accurate scheduling responses across most tested scenarios, demonstrating its potential for real-time deployment, occasional hallucinations were observed, indicating a need for future refinement.

Figure 7. RAGMed output for Query-4.

Figure 8. RAGMed output for Query-5.

Figure 9. RAGMed output for Query-6.

Figure 10. RAGMed output for Query-7.

5.5. Clinical Note Summarization

To evaluate RAGMEd’s ability in summarizing structured clinical case studies, we implemented the following functionality, which allows physicians to upload a clinical case file and receive a clear bullet-point summary. The content is extracted from the file and passed to llama3-8b-8192 with a structured prompt designed for clinical summarization. The prompt instructs the model to organize the summary using bullet points and section headings, mimicking physician-style documentation. Figure 14, Figure 15 and Figure 16 illustrate the system’s output for three selected clinical cases.

Figure 11. RAGMed output for Query-8.

Figure 12. RAGMed output for Query-9.

Figure 13. RAGMed output for Query-10.

The RAGMed system consistently produced well-structured, clinically trustworthy summaries that aligned with standard clinical documentation practices. It captured key sections such as presenting complaints, medical history, and examination findings, resulting in outputs that physicians can review quickly and integrate into electronic health records. Importantly, the model demonstrated the ability to process long-form clinical input and distill it into concise, physician-friendly summaries. By grounding its outputs in retrieved context, RAGMed reduces the risk of unsupported statements or omissions, highlighting its potential as a practical tool for documentation assistance within real-world clinical workflows.

Figure 14. RAGMed output for Clinical Case-1.

Figure 15. RAGMed output for Clinical Case-2.

Figure 16. RAGMed output for Clinical Case-3.

6. Related Work

Retrieval-Augmented Generation (RAG) has recently emerged as a promising approach to enhance the performance and reliability of generative models, particularly in domains requiring access to large and dynamic knowledge bases, such as healthcare. Unlike conventional large language models (LLMs) that rely solely on their internal parameters, RAG combines generative models with external retrieval systems, allowing the model to ground its outputs in up-to-date and domain-specific information [10]. This feature is particularly critical in healthcare, where accuracy, explainability, and current knowledge are essential for clinical reasoning, patient education, and biomedical research. Retrieval-Augmented Generation (RAG) helps overcome key limitations of large language models (LLMs) in the medical domain—such as outdated clinical practice, hallucinations, and lack of transparency—by grounding responses in evidence-based, best practice external sources. The study in [18] highlights how RAG improves the reliability of LLMs in healthcare tasks and surveys various RAG techniques (Naive, Advanced, Modular) applied to medical datasets.

Ref. [19] introduced i-MedRAG, an iterative RAG approach where LLMs generate and refine follow-up queries across multiple rounds to build deeper understanding. This method significantly improves performance on challenging medical benchmarks like USMLE and MedQA, outperforming existing prompt engineering and fine-tuning strategies, and demonstrates strong potential for advancing medical question answering. The manuscript [20] introduces MedSummRAG, a RAG framework tailored for medical text summarization, addressing limitations of large language models in domain-specific understanding. By integrating a fine-tuned dense retriever trained with contrastive learning, it enhances summary quality using relevant external knowledge. Experiments show notable ROUGE improvements across various settings.

Article [21] presents CLI-RAG, a clinically informed retrieval-augmented generation framework for structured clinical text generation from unstructured EHR data. It introduces hierarchical chunking and dual-stage retrieval to handle the complexity and heterogeneity of clinical documentation. When evaluated on MIMIC-III, CLI-RAG outperforms baselines in semantic and temporal alignment, demonstrating potential for reliable and consistent clinical documentation. Ref. [22]’s work introduces MIRAGE, a comprehensive benchmark for evaluating medical retrieval-augmented generation (RAG) systems, and MEDRAG, a toolkit enabling large-scale experimentation across various LLMs, retrievers, and corpora. Through extensive testing, the study shows that optimal RAG configurations significantly boost QA accuracy up to 18% and offers practical guidelines for deploying RAG in medical applications.

Manuscript [23] presents SMARThealth GPT, a RAG-based system designed to support community health workers in low-resourced/socioeconomic settings with guideline-based maternal care education. Developed using Indian pregnancy guidelines, the model emphasizes traceability, scalability, and adaptability. The case study demonstrates the practical value of RAG and LLMs in improving healthcare education and offers a blueprint for similar applications in resource-limited contexts. Ref. [24] introduces MedRAG, a RAG framework enhanced with knowledge graph-based reasoning to improve diagnostic accuracy and specificity from EHRs. By integrating hierarchical diagnostic KGs and dynamically retrieving similar cases, MedRAG supports precise, patient-centered recommendations and proactive diagnostic questioning. Evaluations on public and private datasets show that it outperforms existing RAG models, particularly in reducing misdiagnosis for clinically similar conditions.

The study in [25] evaluates the effectiveness of combining fine-tuning and Retrieval-Augmented Generation (RAG) in open-source LLMs for medical question answering, particularly in resource-constrained settings. Among the tested models, Mistral-7B with fine-tuning and RAG showed the best performance, achieving strong accuracy and alignment between confidence and correctness. The work introduces a novel MCQ evaluation methodology and highlights the potential of such models for clinical reasoning and patient education. Article [26] explores the use of zero-shot prompting with LLaMA 2 (13B) and RAG to extract and summarize malnutrition-related data from aged care EHRs. The combined approach achieved high accuracy (up to 99.25%) in generating structured summaries and extracting clinical risk factors. RAG improved summarization performance and reduced hallucinations, highlighting its value in enhancing data accessibility and improved quality of care in healthcare settings.

The scoping review in [27] maps the current applications and challenges of retrieval-augmented generation (RAG) in healthcare, highlighting its use in clinical reasoning and clinical judgment, education, and pharmacovigilance. While RAG enhances accuracy and transparency over traditional LLMs, key issues such as data privacy, bias, and lack of standardized validation remain. The study emphasizes the need for ethical implementation and interdisciplinary collaboration to ensure safe and effective RAG deployment in clinical settings. Ref. [28] evaluated LLM-RAG models for surgical fitness assessment and preoperative education generation using local and international guidelines. Among ten models tested, GPT-4 with RAG achieved the highest accuracy (96.4%), outperforming clinicians and maintaining low hallucination rates. The findings highlight its potential as a reliable, consistent, and scalable support tool in preoperative and broader clinical workflows.

Article [29] introduces SelfRewardRAG, a RAG-based framework that dynamically integrates real-time medical data with LLMs to address knowledge obsolescence in healthcare AI. Demonstrating strong performance across benchmarks like PubMedQA and MedQA, it delivers accurate, timely medical responses, surpassing some state-of-the-art models. While promising, its reliance on external data quality and high computational demands highlight the need for further optimization and ethical integration into clinical workflows. Study [30] presents a Retrieval-Augmented Generation (RAG) pipeline tailored for preoperative medicine, integrating guideline-based knowledge into LLMs to enhance clinical accuracy. Using 35 preoperative guidelines, the GPT-4.0-RAG model achieved 91.4% accuracy, outperforming base LLMs and showing non-inferiority to junior doctors (86.3%). The system delivered rapid, safe, and guideline-concordant responses with lower hallucination rates. These findings highlight the potential of LLM-RAG systems as scalable and upgradeable tools in clinical reasoning and clinical judgment.

Manuscript [31]’s work introduces Self-BioRAG, a domain-specific RAG framework designed for biomedical and clinical tasks, integrating retrieval, generation, and self-reflection modules. Trained on 84k biomedical instructions, it outperforms prior open-source models (≤7B) by 7.2% on average across major biomedical QA benchmarks and achieves an 8% improvement in Rouge-1 for long-form QA. The authors emphasize the need for tailored retrievers, corpora, and instruction tuning to enhance domain adherence and factual accuracy in medical NLP tasks. Article [32] evaluates Almanac, a retrieval-augmented LLM designed for clinical applications, using a dataset of 314 clinical questions. Compared to standard LLMs like ChatGPT-4, Bing, and Bard, Almanac demonstrated superior performance in factual accuracy, completeness, user preference, and safety. The study highlights the promise of domain-specific LLMs in medical decision making while emphasizing the need for rigorous validation before deployment.

Ref. [33] presents ReinRAG, a reinforced reasoning-augmented generation method that leverages medical knowledge graphs to guide LLMs in generating long-form clinical discharge instructions from limited pre-admission data. By optimizing retrieval quality with group-normalized rewards, ReinRAG improves reasoning depth and reduces clinical misinterpretation. Experiments on real-world data demonstrate its superior performance in clinical accuracy and language generation compared to baseline models. Paper [34] presents MEDGPT, a healthcare chatbot built on the Retrieval-Augmented Generation (RAG) framework, integrating external data sources like PDFs, CSVs, and PubMed to improve response accuracy and user satisfaction. By combining specialized retrieval tools and reasoning agents, MEDGPT delivers contextually relevant and personalized medical information.

Collectively, these research studies underscore the effectiveness of Retrieval-Augmented Generation (RAG) in enhancing the performance and reliability of large language models (LLMs) in the medical domain. By grounding their responses with current evidence-based practices and domain-specific information, RAG allows LLMs to dynamically integrate medical data and standards, enabling more accurate and customized outputs for tasks like diagnoses, drug development, and personalized care. The RAGMed system proposed in this manuscript advances the state of the art by comparing two embedding models and evaluating their impact on the performance and output quality of the RAG framework.

7. Discussion and Limitations

7.1. Novelty and Contributions

Beyond the comparison of embedding models, this work makes several contributions. First, it demonstrates the application of retrieval-augmented generation in healthcare dialogue and workflow support, a safety-critical domain where factual grounding is essential. Second, by adopting the RAG-Triad framework, we provide an evaluation that goes beyond surface-level overlap metrics and directly measures answer relevance, context precision, and recall—dimensions critical for clinical decision support. Third, we present a working prototype that integrates question answering, appointment scheduling through natural language, and clinical note summarization, thereby illustrating how such assistants can reduce the documentation burden and improve workflow efficiency. Finally, we explicitly address compliance and responsible use considerations, outlining the safeguards required for deployment in healthcare environments. Collectively, these contributions extend the state of the art by positioning RAG not only as a research technique but as a practically oriented, domain-grounded assistant for digital healthcare. In addition to these contributions, our system also has the following limitations.

7.2. Sample Size

The evaluation was based on a relatively small sample size, which restricts the statistical power and may limit the generalizability of the findings. A larger and more diverse set of dialogues would be needed to better capture the variability of real-world patient–provider interactions. A further limitation is that we did not include well-known medical QA benchmarks such as BioASQ, MedQA, PubMedQA, i2b2/n2c2, or MIMIC. While these datasets are highly valuable for benchmarking, many are task-mismatched with our setting. For example, BioASQ, MedQA, and PubMedQA focus on factoid question answering from biomedical literature rather than dialogue-based consultation, whereas i2b2/n2c2 and MIMIC primarily contain EHRs, clinical notes, or structured data rather than conversational exchanges. In addition, benchmarks such as MIMIC and i2b2/n2c2 involve restricted-access clinical records that require IRB approval and data use agreements. As this study was an early-stage, proof-of-concept prototype, we focused on fully public, reproducible dialogue datasets. Future work will extend evaluation to these benchmarks to enable broader comparability and stronger validation.

7.3. Lack of Physician Validation

This research is currently in an early-stage, proof-of-concept phase, focused on developing and refining a workable prototype of a retrieval-augmented medical AI assistant. As such, no practicing physicians, clinical experts, or medical professionals were directly involved in validating the system’s responses at this stage. Our goal in this phase was to establish the technical feasibility of retrieval-augmented generation using publicly available datasets. Validation with physicians and other healthcare professionals is a critical next step to ensure clinical accuracy, safety, and usability, and will be incorporated into subsequent phases of this research once the system has matured beyond the prototype stage. While public datasets and automatic metrics (RAG-Triad) provide a useful initial benchmark, expert review is essential to confirm the clinical accuracy, safety, and usability of the assistant in practice. Future work will include qualitative evaluation with physicians and domain specialists.

7.4. Dataset Bias

The use of public datasets introduces a potential risk of bias. Many of these corpora (e.g., MedDialog) are curated or translated, and may not fully reflect authentic clinical language, workflows, or cultural context. This could lead to mismatches between system behavior and real clinical communication. Additionally, because rare conditions and complex multimorbidity are underrepresented in public datasets, performance may not generalize evenly across all patient scenarios.

Finally, as a proof of concept, our system did not incorporate statistical significance testing, confidence intervals, or comparative baselines with standard NLP metrics, which limits the rigor of the reported results. Future work will address these limitations by (i) expanding the evaluation dataset, (ii) involving medical professionals in validation, (iii) incorporating safeguards against dataset bias, and (iv) conducting more comprehensive statistical and comparative analyses.

7.5. Evaluation Rigor

We acknowledge that our evaluation did not include commonly reported text generation metrics such as BLEU, ROUGE, and F1. While these measures are widely used and enable comparability across studies, they may not fully reflect correctness in the medical domain, where multiple semantically valid answers may differ in wording. Our decision to prioritize RAG-Triad was guided by the need for clinically meaningful evaluation. Nonetheless, future work will incorporate both RAG-specific and general NLP metrics to balance domain faithfulness with broader benchmarking. Another limitation is the absence of statistical significance testing and confidence intervals, which are valuable for quantifying uncertainty and enabling more rigorous comparisons. As this study was intended as a proof-of-concept study, we focused on domain-specific evaluation, but future work will expand the analysis to include confidence intervals, hypothesis testing, and other quantitative measures alongside RAG-Triad scores.

7.6. Dataset Limitations

While public datasets such as MedDialog provide large-scale patient–doctor dialogues, they may not fully reflect the complexity of real-world clinical communication. For example, MedDialog may differ in linguistic style, cultural context, and clinical workflow compared to typical healthcare encounters. Similarly, other publicly available corpora often rely on curated or synthetic interactions rather than transcripts from authentic patient visits.

Despite these limitations, such datasets remain valuable for proof-of-concept development because they offer diverse coverage of medical topics, are ethically shareable, and enable reproducibility. To ensure eventual clinical applicability, however, future work should validate the system using de-identified real-world clinical dialogues obtained under Institutional Review Board (IRB) approval and HIPAA-compliant data-sharing agreements.

7.7. Data Privacy and Compliance

This study used only publicly available, de-identified datasets (e.g., MedDialog, ACI-Bench, Augmented Medical Dialogue, MTS Dialog, Patient–Doctor Conversations). As such, no protected health information (PHI) was accessed, and therefore HIPAA and GDPR compliance concerns were not directly applicable at this stage. Because the work is an early-stage, proof-of-concept prototype, all the experiments were conducted exclusively on open datasets that are widely used for research and do not contain identifiable patient data.

For future phases that involve real clinical data or physician validation, strict compliance with data protection regulations will be followed, including Institutional Review Board (IRB) approval, HIPAA adherence in the U.S., and GDPR alignment for international collaborations. All the patient data would be de-identified, securely stored, and accessed only under approved data use agreements

7.8. Responsible Use and Clinical Safety Considerations

Healthcare is a safety-critical domain, and the outputs of large language models (LLMs) cannot be relied upon in isolation for clinical decision making. Our prototype is intended as an early-stage, proof-of-concept research system to explore the feasibility of retrieval-augmented generation for medical dialogue, not as a tool for direct patient use. We acknowledge that LLMs can produce hallucinations, incomplete reasoning, or contextually inappropriate responses, which poses risks if applied without oversight.

For this reason, we emphasize that such systems should be deployed only as decision support aids under the supervision of qualified healthcare professionals, not as autonomous decision makers. Future work will incorporate expert physician validation, human-in-the-loop safeguards, and reliability mechanisms (e.g., confidence estimation, grounding verification) to ensure safety, correctness, and clinical usability before any real-world application.

7.9. Evaluation Limitations

We recognize that the question test set used in this proof-of-concept study is not sufficient for formal validation. This small set was intended only as an illustrative demonstration of system capabilities. The primary aim of including these questions was to show the baseline behavior of the system and to demonstrate how responses remain grounded in retrieved evidence rather than relying solely on model priors.

More clinically meaningful benefits of the RAG approach will emerge when the system is applied to patient-specific scenarios in which the vector database contains individualized health information (e.g., allergies, family history, comorbidities). In such cases, the assistant can tailor responses based on structured context, which a non-retrieval LLM would not reliably achieve. Future work will therefore involve evaluation on larger and more diverse question sets, including patient-specific queries linked to synthetic or de-identified health records, and will incorporate both automatic metrics (e.g., RAG-Triad, standard NLP measures) and domain expert validation to assess correctness, completeness, and safety at scale.

8. Conclusions and Future Work

8.1. Conclusions

In this study, we presented RAGMed, a Retrieval-Augmented Generation (RAG)-based AI assistant designed to support clinical workflows. A central focus was the comparative evaluation of two sentence embedding models—all-MiniLM-L6-v2 and GTE-Large—using the RAG-Triad framework. While prior studies have compared embeddings for general NLP or biomedical retrieval tasks, our work specifically evaluates their impact within a retrieval-augmented clinical assistant. The contribution lies not merely in confirming that GTE-Large outperforms MiniLM, but in showing how embedding dimensionality directly affects grounding quality in a safety-critical domain. By assessing answer relevance, context precision, and recall, we demonstrate that embedding choice has measurable consequences for retrieving clinically useful passages and generating factually grounded responses. To our knowledge, this is among the first evaluations to link embedding dimensionality with RAG performance in medical dialogue and workflow support, extending state-of-the-art retrieval analyses into a practical healthcare context.

Beyond question answering, RAGMed can also handle appointment scheduling via natural language and the summarization of lengthy clinical narratives, illustrating its potential to reduce the administrative burden and enhance workflow efficiency. Collectively, these contributions highlight the promise of retrieval-augmented assistants to improve the accessibility, responsiveness, and reliability of digital healthcare services.

8.2. Future Work

Looking ahead, we plan to extend the assistant’s capabilities in several directions. First, we will explore voice-based querying to support more natural interactions, integration with Electronic Health Record (EHR) systems for seamless clinical workflows, and advanced privacy-preserving mechanisms to safeguard sensitive health data. These developments are intended to ensure that RAG-based systems are not only technically robust but also operationally viable for real-world healthcare deployment. Furthermore, we will address the current prototype’s limitation in handling negative findings (e.g., absence of symptoms, normal results, non-diagnostic statements). Negative findings carry significant clinical value, yet our proof-of-concept system focused only on retrieval-grounded positive evidence. Future work will therefore incorporate clinical NLP methods for negation detection (e.g., rule-based tools such as NegEx or more recent neural approaches) into the RAG pipeline. This will enable the assistant to capture and highlight both positive and negative evidence, resulting in more balanced, clinically comprehensive outputs that better reflect real-world reasoning.

Equally important, future phases will directly address the limitations identified in this study. We will involve physicians in iterative validation and co-design; expand evaluation to widely used biomedical benchmarks such as BioASQ, PubMedQA, i2b2/n2c2, and MIMIC; and incorporate statistical rigor through confidence intervals, hypothesis testing, and effect size reporting. These steps, together with safeguards for privacy, dataset bias, and responsible deployment, will establish a more clinically credible foundation for RAGMed and ensure its readiness for real-world integration.

Author Contributions

Conceptualization, R.P.; methodology, R.P.; software, R.P. and M.A.; validation, M.A.; formal analysis, M.A.; investigation, R.P.; resources, M.A. and R.P.; data curation, M.A.; writing—original draft preparation, R.P.; writing—review and editing, R.P. and S.F.; visualization, M.A.; supervision, R.P. and S.F.; project administration, R.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shanafelt, T.D.; Dyrbye, L.N.; Sinsky, C.; Hasan, O.; Satele, D.; Sloan, J.; West, C.P. Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction. In Mayo Clinic Proceedings; Elsevier: Amsterdam, The Netherlands, 2016; Volume 91, pp. 836–848. [Google Scholar]
Shanafelt, T.D.; West, C.P.; Dyrbye, L.N.; Trockel, M.; Tutty, M.; Wang, H.; Carlasare, L.E.; Sinsky, C. Changes in burnout and satisfaction with work-life integration in physicians during the first 2 years of the COVID-19 pandemic. In Mayo Clinic Proceedings; Elsevier: Amsterdam, The Netherlands, 2022; Volume 97, pp. 2248–2258. [Google Scholar]
Sinsky, C.; Colligan, L.; Li, L.; Prgomet, M.; Reynolds, S.; Goeders, L.; Westbrook, J.; Tutty, M.; Blike, G. Allocation of physician time in ambulatory practice: A time and motion study in 4 specialties. Ann. Intern. Med. 2016, 165, 753–760. [Google Scholar] [CrossRef] [PubMed]
Arndt, B.G.; Beasley, J.W.; Watkinson, M.D.; Temte, J.L.; Tuan, W.J.; Sinsky, C.A.; Gilchrist, V.J. Tethered to the EHR: Primary care physician workload assessment using EHR event log data and time-motion observations. Ann. Fam. Med. 2017, 15, 419–426. [Google Scholar] [CrossRef] [PubMed]
Kreimeyer, K.; Foster, M.; Pandey, A.; Arya, N.; Halford, G.; Jones, S.F.; Forshee, R.; Walderhaug, M.; Botsis, T. Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review. J. Biomed. Inform. 2017, 73, 14–29. [Google Scholar] [CrossRef] [PubMed]
Spasić, I.; Livsey, J.; Keane, J.A.; Nenadić, G. Text mining of cancer-related information: Review of current status and future directions. Int. J. Med. Inform. 2014, 83, 605–623. [Google Scholar] [CrossRef] [PubMed]
Demner-Fushman, D.; Chapman, W.W.; McDonald, C.J. What can natural language processing do for clinical decision support? J. Biomed. Inform. 2009, 42, 760–772. [Google Scholar] [CrossRef]
Wang, Y.; Wang, L.; Rastegar-Mojarad, M.; Moon, S.; Shen, F.; Afzal, N.; Liu, S.; Zeng, Y.; Mehrabi, S.; Sohn, S.; et al. Clinical information extraction applications: A literature review. J. Biomed. Inform. 2018, 77, 34–49. [Google Scholar] [CrossRef]
Johnson, A.E.; Pollard, T.J.; Shen, L.; Lehman, L.W.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L.A.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 9459–9474. [Google Scholar]
Chen, Q.; Allot, A.; Lu, Z. Keep up with the latest coronavirus research. Nature 2020, 579, 193–194. [Google Scholar] [CrossRef]
Meystre, S.M.; Savova, G.K.; Kipper-Schuler, K.C.; Hurdle, J.F. Extracting information from textual documents in the electronic health record: A review of recent research. Yearb. Med. Inform. 2008, 17, 128–144. [Google Scholar]
Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly available clinical BERT embeddings. arXiv 2019, arXiv:1904.03323. [Google Scholar] [CrossRef]
Huang, K.; Altosaar, J.; Ranganath, R. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar]
Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness-Knowl. Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
Price, W.N.; Cohen, I.G. Privacy in the age of medical big data. Nat. Med. 2019, 25, 37–43. [Google Scholar] [CrossRef] [PubMed]
Kaissis, G.A.; Makowski, M.R.; Rückert, D.; Braren, R.F. Secure, privacy-preserving and federated machine learning in medical imaging. Nat. Mach. Intell. 2020, 2, 305–311. [Google Scholar] [CrossRef]
Amugongo, L.M.; Mascheroni, P.; Brooks, S.; Doering, S.; Seidel, J. Retrieval augmented generation for large language models in healthcare: A systematic review. PLoS Digit. Health 2025, 4, e0000877. [Google Scholar] [CrossRef]
Xiong, G.; Jin, Q.; Wang, X.; Zhang, M.; Lu, Z.; Zhang, A. Improving retrieval-augmented generation in medicine with iterative follow-up questions. In Biocomputing 2025: Proceedings of the Pacific Symposium; World Scientific: Singapore, 2024; pp. 199–214. [Google Scholar]
Luo, G.; Arase, Y. MedSummRAG: Domain-Specific Retrieval for Medical Summarization. In Proceedings of the 24th Workshop on Biomedical Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 27–33. [Google Scholar]
Keerthana, G.; Gupta, M. CLI-RAG: A retrieval-augmented framework for clinically structured and context aware text generation with llms. arXiv 2025, arXiv:2507.06715. [Google Scholar]
Xiong, G.; Jin, Q.; Lu, Z.; Zhang, A. Benchmarking retrieval-augmented generation for medicine. In Findings of the Association for Computational Linguistics ACL 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 6233–6251. [Google Scholar]
Al Ghadban, Y.; Lu, H.; Adavi, U.; Sharma, A.; Gara, S.; Das, N.; Kumar, B.; John, R.; Devarsetty, P.; Hirst, J.E. Transforming healthcare education: Harnessing large language models for frontline health worker capacity building using retrieval-augmented generation. medRxiv 2023. [Google Scholar] [CrossRef]
Zhao, X.; Liu, S.; Yang, S.Y.; Miao, C. MEDRAG: Enhancing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot. In Proceedings of the ACM on Web Conference, Sydney, Australia, 28 April–2 May2025; pp. 4442–4457. [Google Scholar]
Bora, A.; Cuayáhuitl, H. Systematic analysis of retrieval-augmented generation-based llms for medical chatbot applications. Mach. Learn. Knowl. Extr. 2024, 6, 2355–2374. [Google Scholar] [CrossRef]
Alkhalaf, M.; Yu, P.; Yin, M.; Deng, C. Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records. J. Biomed. Inform. 2024, 156, 104662. [Google Scholar] [CrossRef]
Bunnell, D.J.; Bondy, M.J.; Fromtling, L.M.; Ludeman, E.; Gourab, K. Bridging AI and Healthcare: A Scoping Review of Retrieval-Augmented Generation—Ethics, Bias, Transparency, Improvements, and Applications. medRxiv 2025. [Google Scholar]
Ke, Y.H.; Jin, L.; Elangovan, K.; Abdullah, H.R.; Liu, N.; Sia, A.T.H.; Soh, C.R.; Tung, J.Y.M.; Ong, J.C.L.; Kuo, C.F.; et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness. Npj Digit. Med. 2025, 8, 187. [Google Scholar] [CrossRef] [PubMed]
Hammane, Z.; Ben-Bouazza, F.E.; Fennan, A. SelfRewardRAG: Enhancing medical reasoning with retrieval-augmented generation and self-evaluation in large language models. In Proceedings of the 2024 International Conference on Intelligent Systems and Computer Vision (ISCV), Fez, Morocco, 8–10 May 2024; pp. 1–8. [Google Scholar]
Ke, Y.; Jin, L.; Elangovan, K.; Abdullah, H.R.; Liu, N.; Sia, A.T.H.; Soh, C.R.; Tung, J.Y.M.; Ong, J.C.L.; Ting, D.S.W. Development and testing of retrieval augmented generation in large language models—A case study report. arXiv 2024, arXiv:2402.01733. [Google Scholar]
Jeong, M.; Sohn, J.; Sung, M.; Kang, J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 2024, 40 (Suppl. S1), i119–i129. [Google Scholar] [CrossRef]
Zakka, C.; Shad, R.; Chaurasia, A.; Dalal, A.R.; Kim, J.L.; Moor, M.; Fong, R.; Phillips, C.; Alexander, K.; Ashley, E.; et al. Almanac—retrieval-augmented language models for clinical medicine. Nejm Ai 2024, 1, AIoa2300068. [Google Scholar] [CrossRef]
Ting, L.P.Y.; Zhao, C.; Zeng, Y.H.; Lim, Y.J.; Chuang, K.T. Beyond RAG: Reinforced Reasoning Augmented Generation for Clinical Notes. arXiv 2025, arXiv:2506.05386. [Google Scholar]
Sree, Y.B.; Sathvik, A.; Akshit, D.S.H.; Kumar, O.; Rao, B.S.P. Retrieval-augmented generation based large language model chatbot for improving diagnosis for physical and mental health. In Proceedings of the 2024 6th International Conference on Electrical, Control and Instrumentation Engineering (ICECIE), Pattaya, Thailand, 23 November 2024; pp. 1–8. [Google Scholar]

Figure 3. End-User Modes.

Figure 4. RAGMed output for Query-1.

Table 1. RAG-TRIAD scores for Query-1.

Embedding Model	Answer Relevance	Context Relevance	Groundedness
GTE-Large	0.67	1	0.34
all-MiniLM-L6-v2	0.54	0.62	0.28

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Patil, R.; Abbidi, M.; Fannon, S. RAGMed: A RAG-Based Medical AI Assistant for Improving Healthcare Delivery. AI 2025, 6, 240. https://doi.org/10.3390/ai6100240

AMA Style

Patil R, Abbidi M, Fannon S. RAGMed: A RAG-Based Medical AI Assistant for Improving Healthcare Delivery. AI. 2025; 6(10):240. https://doi.org/10.3390/ai6100240

Chicago/Turabian Style

Patil, Rajvardhan, Manideep Abbidi, and Sherri Fannon. 2025. "RAGMed: A RAG-Based Medical AI Assistant for Improving Healthcare Delivery" AI 6, no. 10: 240. https://doi.org/10.3390/ai6100240

APA Style

Patil, R., Abbidi, M., & Fannon, S. (2025). RAGMed: A RAG-Based Medical AI Assistant for Improving Healthcare Delivery. AI, 6(10), 240. https://doi.org/10.3390/ai6100240

Article Menu

RAGMed: A RAG-Based Medical AI Assistant for Improving Healthcare Delivery

Abstract

1. Introduction

1.1. Objective

1.2. Contributions

2. Background

2.1. Large Language Model

2.2. Embedding Models

2.3. Vector Database and Indexing

2.4. Similarity Search

2.5. Model Evaluation

3. System Architecture and Evaluation Framework

3.1. System Architecture Overview

3.2. Retrieval-Augmented Generation

3.2.1. Retrieval Stage

3.2.2. Generation Stage

3.3. RAG-Triad Evaluation Framework

3.3.1. Context Relevance

3.3.2. Groundedness

3.3.3. Answer Relevance

4. Methodology

4.1. Data Selection Criteria

4.1.1. Clinical Relevance

4.1.2. Diversity of Coverage

4.1.3. Public Accessibility and Compliance

4.1.4. Limitations of Using Only Public Datasets

4.2. Data Collection and Preprocessing

4.3. System Functionalities

4.4. Admin Mode

4.5. End-User Mode

4.5.1. Medical Question Answering

4.5.2. Appointment Scheduling

4.5.3. Clinical Note Summarization

5. Experiments and Results

5.1. Temperature Parameter Choice

5.2. Evaluation Metrics—Methods Section

5.3. Medical Question Answering

5.3.1. Quantitative Evaluation

5.3.2. Qualitative Observations

5.3.3. Implications

5.4. Appointment Scheduling

5.5. Clinical Note Summarization

6. Related Work

7. Discussion and Limitations

7.1. Novelty and Contributions

7.2. Sample Size

7.3. Lack of Physician Validation

7.4. Dataset Bias

7.5. Evaluation Rigor

7.6. Dataset Limitations

7.7. Data Privacy and Compliance

7.8. Responsible Use and Clinical Safety Considerations

7.9. Evaluation Limitations

8. Conclusions and Future Work

8.1. Conclusions

8.2. Future Work

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI