AlzheimerRAG: Multimodal Retrieval-Augmented Generation for Clinical Use Cases

Lahiri, Aritra Kumar; Hu, Qinmin Vivian

doi:10.3390/make7030089

Open AccessArticle

AlzheimerRAG: Multimodal Retrieval-Augmented Generation for Clinical Use Cases

by

Aritra Kumar Lahiri

^*

and

Qinmin Vivian Hu

Department of Computer Science, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(3), 89; https://doi.org/10.3390/make7030089

Submission received: 26 June 2025 / Revised: 19 July 2025 / Accepted: 22 July 2025 / Published: 27 August 2025

Download

Browse Figures

Versions Notes

Abstract

Recent advancements in generative AI have fostered the development of highly adept Large Language Models (LLMs) that integrate diverse data types to empower decision-making. Among these, multimodal retrieval-augmented generation (RAG) applications are promising because they combine the strengths of information retrieval and generative models, enhancing their utility across various domains, including clinical use cases. This paper introduces AlzheimerRAG, a multimodal RAG application for clinical use cases, primarily focusing on Alzheimer’s disease case studies from PubMed articles. This application incorporates cross-modal attention fusion techniques to integrate textual and visual data processing by efficiently indexing and accessing vast amounts of biomedical literature. Our experimental results, compared to benchmarks such as BioASQ and PubMedQA, yield improved performance in the retrieval and synthesis of domain-specific information. We also present a case study using our multimodal RAG in various Alzheimer’s clinical scenarios. We infer that AlzheimerRAG can generate responses with accuracy non-inferior to humans and with low rates of hallucination.

Keywords:

Alzheimer; clinical; context-aware; generative AI; information retrieval; LLMs; multimodal; PubMed; RAG; question answering

1. Introduction

The high volume and variety of data in medical research offer several opportunities and challenges. Of these, Alzheimer’s disease (AD) is a particularly compelling case study because it is multicausal, involving genetic, biochemical, and environmental factors, and also involves complex clinical presentations. Despite the tremendous progress, few effective methods exist for diagnosing, treating, and preventing Alzheimer’s disease (AD). This knowledge gap is further exacerbated by the growing volume and fragmentation across various data modalities, including textual descriptions, clinical trial data, imaging studies, and molecular data. Traditional methods of synthesizing such a large volume of knowledge are ineffective; most have a single-modality approach, which may miss the insights obtained synergistically from integrated data. This gap in methodology underscores the need for a robust, unified framework that can leverage multiple modalities to enhance the retrieval process by making it more context-aware and reducing the retrieval of irrelevant or less pertinent information.

In this research, we describe a novel multimodal retrieval-augmented generation (RAG) application, AlzheimerRAG (video demonstration), which integrates textual and visual modalities to improve contextual understanding and information synthesis from the biomedical literature. Our primary research objective in implementing multimodal RAG is to enhance context-aware retrieval capabilities by integrating heterogeneous data types, including textual data, images, and clinical trial information from PubMed articles. Existing methods [1,2,3,4] typically, focus on textual or visual data separately, leaving a gap for integrated multimodal solutions. Latest research in Context-Aware Retrieval [5,6,7,8,9,10] provides the foundation for RAG models by demonstrating how retrieval could enhance the generation capabilities of language models, particularly in knowledge-intensive tasks. Integrating RAG methodologies with multimodal inputs is a burgeoning area of research, as highlighted by Xia et al. [11], who proposed a multimodal RAG system that enhanced data synthesis across text and image modalities. In light of these advancements, the novelty of our approach lies in the seamless integration and alignment of multimodal data during the cross-modal attention fusion process. The AlzheimerRAG framework combines rapid, accurate retrieval via object stores with specialized language models, enhancing its capability to address the nuances of multimodal information pertinent to Alzheimer’s disease. We utilize an optimized mechanism for fine-tuning by implementing Parameter-Efficient Fine-Tuning (PEFT) [12] and inducing cross-modal attention fusion to facilitate the synergistic information flow between the text and image models. The fine-tuned models are then incorporated into a multimodal RAG workflow, developed as a web application with a user interface that allows end-users to retrieve context-aware answers from their queries. The target audience of this application includes biomedical researchers for synthesizing Alzheimer’s disease literature and identifying disease trends, clinicians to support diagnosis and treatment planning for AD, and healthcare institutions for clinical trial design and support.

Benchmark datasets such as BioASQ [13] and PubMedQA [14] have been instrumental in measuring the effectiveness of multimodal RAG systems [15]. BioASQ, a large-scale biomedical semantic indexing and question-answering (QA) dataset, provides a robust framework for assessing models’ retrieval and QA capabilities. Similarly, PubMedQA offers insights into the accuracy of models in handling biomedical queries, making it an essential tool for evaluating AlzheimerRAG’s performance against existing benchmarks. In comparative studies, models that integrate multimodal data have been shown to outperform traditional single-modality systems. For instance, models like T5 [16] have been evaluated in the context of biomedical question answering, demonstrating significant gains when multimodal inputs are utilized. This trend reinforces the need for AlzheimerRAG’s multimodal framework to enhance the understanding and treatment of AD.

In summary, our research contributions advance the multimodal RAG domain in AD in the following aspects:

Context-aware retrieval-augmented generation—Our framework enhances traditional RAG models by prioritizing the context relevance of domain-specific information, thereby increasing accuracy and utility in biomedical applications.
Advanced cross-modal attention fusion—AlzheimerRAG integrates multimodal data more effectively using transformer architectures and cross-modal attention mechanisms tailored to handle heterogeneous data types.
RAG user interface—Our system implements multimodal RAG as a web-based application using the latest state-of-the-art technologies like LangChain, FastAPI, Jinja2, and FaissDB to provide users with a robust interface for performing biomedical information retrieval tasks through the context-aware question-answering paradigm.
Comparable framework with state-of-the-art benchmarks—We evaluate the capability of the Multimodal RAG application with benchmark datasets like BioASQ and PubMedQA, along with other comparable LLM RAG models. We also study the effectiveness of our AlzheimerRAG against human-generated responses for different clinical scenarios in Alzheimer’s disease to gauge the accuracy and hallucination rates of the retrieved answers.

Although this research is rooted in the Alzheimer’s domain, the usage of cross-modal attention fusion in multimodal RAG makes it adaptable to any domain that requires alignment of textual data with visuals, audio, or structured data. The modular approach of AlzheimerRAG supports adapting it to answering capabilities on queries related to comparable medical domains linked to Alzheimer’s Disease. Additionally, the technical scalability is also enhanced by parameter-efficient fine-tuning techniques such as QLoRA, which enables efficient fine-tuning of LLMs (e.g., LlaMA and LLaVA) on niche datasets without full retraining.

2. Related Work

The AlzheimerRAG framework was developed within the rapidly evolving landscape of multimodal data integration and retrieval-augmented generation techniques, which are becoming increasingly crucial in biomedical research. Recent studies have demonstrated the importance of leveraging multiple data modalities to enhance diagnosis, treatment, and understanding of complex diseases, such as Alzheimer’s.

Existing research [17,18,19,20,21] has highlighted the efficacy of attention mechanisms that span multiple modalities, which are instrumental in synthesizing heterogeneous information sources in medical contexts. For example, the effectiveness of multimodal token fusion for vision transformers [22] has been demonstrated, which significantly improves the integration of visual and textual data in medical imaging [23]. Similarly, cross-modal translation and alignment techniques [24,25] have been showcased that facilitate survival analysis, emphasizing the benefits of integrating diverse data types to yield richer insights. Additionally, recent developments in knowledge distillation have further enhanced model efficiency in healthcare applications, as demonstrated in the work by Hinton et al. and Gupta et al. [26,27], which involves transferring knowledge from a larger model (teacher) to a smaller model (student), thereby retaining performance while reducing computational costs. Various studies have adopted this methodology, notably, the work that discovered integrating imaging and genetic data improved predictive outputs in Alzheimer’s models [2].

The application of AI in Alzheimer’s research has been underscored by studies [28] which leverage multimodal inputs to improve early diagnosis and patient stratification. Other research, such as [29,30], has focused on using AI to manage Alzheimer’s disease symptoms, demonstrating that AI-driven solutions can provide valuable insights and recommendations for patient care. The BioBERT model [31] represents a significant advancement in biomedical text mining, emphasizing the utility of transformer models fine-tuned for biomedical applications. This model has been foundational in developing various biomedical applications, including those focused on Alzheimer’s disease, where precision in information retrieval is critical. RAG methodologies [32] have gained traction in biomedical research for efficiently synthesizing information from large datasets. The works [5,33] laid the ground for RAG models by demonstrating how retrieval could enhance the generation capabilities of language models, particularly in knowledge-intensive tasks. This has profound implications for healthcare, where accurate and timely information retrieval can guide clinical decisions.

Compared to these advancements, the AlzheimerRAG framework combines rapid, accurate retrieval via FaissDB with specialized language models, enhancing its capability to address the nuances of multimodal information pertinent to Alzheimer’s Disease.

3. Materials and Methods

The overall architecture of AlzheimerRAG is described in Figure 1. To simplify, the architecture diagram in Figure 1 illustrates a multimodal RAG component for the biomedical literature (e.g., PubMed), where text, tables, and images are extracted through parsing and processed separately—text is parsed, tables are summarized, and images are captioned using a visual language model. These processed elements are then converted into embeddings through a cross-modal embedding fusion method and stored in an object store and a vector database. Upon receiving a user query, the system retrieves relevant information using similarity search and passes it to a large language model, which generates a context-aware answer by reasoning over the retrieved multimodal content.

In the subsequent sections, we describe each step in the architecture flow in more detail, followed by a demonstration of the application with the technical components.

3.1. Data Collection and Preprocessing

The first step of our process involved collecting relevant articles from PubMed. We accomplished this by writing a Python script [34] that called the National Centre for Biotechnology Information (NCBI) Entrez Programming Utilities (E-utilities) API to fetch the top 2000 articles from the PubMed repository [35] related to the “Alzheimer’s Disease” search term [36]. The articles were fetched in batches per API request, adhering to NCBI API rate limits and sorted by relevance during the retrieval process. We parsed each document, collecting the full texts, abstracts, tables, and figures for textual and image retrieval. After that, we cleaned and normalized the data for the data preprocessing step to ensure consistency and usability. This involved removing hyperlinks, references, and footnotes. We also standardized the figures/diagrams format by converting them to a consistent image format for uniform processing.

3.2. Textual Data Retrieval

This step retrieves the clinical text data related to Alzheimer’s disease (AD) for textual and tabular data processing. In our workflow, for generating the text embedding, we fine-tuned the “Llama-2-7b-pubmed” [37,38] model by training it with the PubMedQA [14] dataset from HuggingFace. The fine-tuning used parameter-efficient fine-tuning (PEFT) techniques like QLoRA [39]. Table 1 outlines the QLoRA parameters and the training argument parameters used for fine-tuning.

Textual and Tabular Data Processing

The extracted data were chunked into structured text and table summaries. Then, a layout model (for tables) and titles were used for candidate sub-sections of the document (e.g., Introduction, Methods, etc.). Finally, post-processing was conducted to aggregate text under each title, and further chunking into text blocks was performed for downstream processing based on user-specific flags for each block. After that step, the text embeddings converted the smaller blocks into embedding vectors, which were used for cross-modal attention fusion.

3.3. Image Retrieval

For the generation of feature embeddings that capture image details from the PubMed articles, we fine-tuned the “LlaVA” (Language and Vision Assistant Model, version 2) [40] model using the official LLaVA repo with the Llama-2 7B backbone language model [41]. LLaVA combines pre-trained language models (such as Vicuna or LLaMA [42,43]) with visual models (such as CLIP’s [44] visual encoder) by converting visual features into embeddings that are compatible with the language model. Its training has two stages: a pre-training stage, where image–text pairs align visual and language embeddings with only the projection matrix being trained [45], and a fine-tuning stage, where the visual encoder remains frozen while the projection layer and language model are updated [46]. Using the fine-tuned approach preserves the strengths of the large language model while lowering computational requirements, making it ideal for resource-limited environments and quick adaptation to new data. The hyperparameters used for fine-tuning are presented in Table 2. QLoRA uses the 4-bit NormalFloat, which is explicitly designed for customarily distributed weights, thereby further reducing memory usage.

3.4. Cross-Modal Attention Fusion

Cross-modal attention fusion is a mechanism that facilitates interaction between different modalities, within our current scope, specifically between text and images. It allows a model to selectively focus on relevant parts of both modalities by computing attention weights. These weights are used to modulate the embeddings from each modality, enabling a richer and more comprehensive representation. In our context, the cross-modal attention fusion ensures that the integrated textual and visual data contribute meaningfully to medical information retrieval. The process steps of cross-modal attention fusion are detailed in Figure 2 as a sequence diagram.

The three steps associated with this process are described below:

Generate query, key, and value vectors from the text and image embeddings from Section 3.2 and Section 3.3, respectively.
Compute the attention scores using the dot-product attention mechanism shown below:

$scores = \frac{queries \cdot {keys}^{⊤}}{\sqrt{d_{k}}}$

(1)

where
−
queries and keys are matrices of size $(n \times d_{k})$ , with n being the number of tokens and $d_{k}$ the dimension of each key.
−
$d_{k}$ is the dimensionality of the keys used for scaling.
−
$\sqrt{d_{k}}$ scales the dot-product, helping to stabilize gradients in deeper networks.
Aggregate contributions from both modalities based on attention weights:

$aggr_embeddings = attn_wts \cdot values$

(2)

where
−
$attn_wts$ is a matrix representing the attention scores, with dimensions $(n \times m)$ , where n is the number of tokens, and m is the dimensionality of each value.
−
values is a matrix of values corresponding to tokens, typically with dimensions $(m \times d)$ , where d is the embedding size.
The resulting $aggr_embeddings$ is a combination of the values weighted by attention.

$combined_features = aggr_attn (values, attn_wts)$

(3)

Finally, the combined feature embeddings are indexed as vectors in an object store, which allows quicker retrieval of multimodal data.

3.5. AlzheimerRAG Demonstration

3.5.1. System Walk-Through

AlzheimerRAG is implemented as a Python (version 3.8.x) Web Application utilizing FastAPI and Jinja2 Templates with LangChain integration. It provides a simple user interface (Web Application) for leveraging efficient multimodal RAG capabilities related to AD. The application (Source Code) is deployed in Heroku, a cloud-based Platform-as-a-Service (PaaS) solution that helps manage seamless continuous integration and deployment. It provides the functionality for information retrieval from user queries. The multimodal RAG component extracts context-aware relevant images as part of the output response. The demo video can be accessed from this link (Video Demonstration).

A sample response from the AlzheimerRAG user interface can be observed in Figure 3, where relevant text and images are fetched for a particular user query related to Alzheimer’s disease from the embedded PubMed articles.

3.5.2. Key Technical Components

The key technical components are summarized below.

FastAPI for API development: FastAPI is a high-performance web framework for API development that provides an intuitive interface for API development and integrates seamlessly with Python’s async capabilities.

Jinja2 for template rendering: Jinja2 is a templating engine for Python that offers dynamic template rendering. It serves HTML content from backend data, enabling a seamless and interactive user experience.

FaissDB for embedding multimodal data: FaissDB [47], a vector DB, is widely used for embedding multimodal data. Embedding is the process of converting content into a numerical representation (i.e., vectors) for large language learning models and is crucial for transforming preprocessed healthcare knowledge into individual vectors. The text and image embeddings are encoded into uniform, high-dimensional vectors and indexed for efficient similarity searches. When a query is made, the reasoning and retrieval component searches the vector space to extract relevant information. The benefit is that it uses an approximate nearest neighbor (ANN) search to quickly locate embeddings in high-dimensional space, which is essential for large-scale applications. The generation component uses the retrieved multimodal representations to produce outputs in various formats, such as text or images.

LangChain as a Retrieval Agent for multimodal RAG: The Retrieval Agent is a medium to pinpoint the most relevant knowledge in response to user queries. This process involves using the embedding model to convert the user query text into vectors, which are then searched through the vector storage to identify the closest matching vectors. The effectiveness of a Retrieval Agent is closely tied to the underlying framework upon which it is built. Therefore, we utilized the LangChain [48,49] framework, a premium existing open-source framework, along with LlamaIndex [41] because of its significant advantages in (i) preservation of table data integrity, (ii) streamlining the handling of Multimodal data, (iii) enhanced semantic embedding. Together, LlamaIndex and LangChain enhance the context-awareness of extracted content, enabling efficient retrieval and synthesis of information and producing nuanced outputs.

4. Experimental Results

4.1. Comparative Evaluation

We compared our AlzheimerRAG application against state-of-the-art techniques in the biomedical domain and evaluated the performance of our methods. In our experiments, we selected BioBERT [31], a transformer model fine-tuned on biomedical text, and MedPix [50], which utilizes deep learning for medical image classification. To compare the cross-modal attention fusion, we introduced a naive fusion of text and image modalities among two models, primarily by concatenating the embeddings without significant interaction between the modalities. Among the newer variants, we included PubMedBERT [51], LlaVA-Med [52], and BioRAG [4] in our evaluation.

Table 3 represents the performance, where it is observed that AlzheimerRAG, with its multimodal RAG design, retained the lead over LlaVA-Med, a multimodal model for biomedicine that lacks retrieval capabilities, and BioRAG, a text-only RAG model with PubMed integration.

Against benchmark datasets like BioASQ [13], a large-scale biomedical semantic indexing and question-answering dataset, and PubMedQA [14], developed for QA tasks using a PubMed corpus, we assessed the capability of our multimodal RAG by evaluating the document retrieval from given queries and generating accurate answers to Alzheimer-related questions from the data against GPT-4. The results are highlighted in Table 4.

The metrics used in our evaluation included (i) Precision@k, which measures the relevance of the top-k(10) retrieved document; (ii) Recall, which evaluates how many relevant documents are retrieved from the corpus; and (iii) Mean Average Precision (MAP), which provides the mean average precision values for all queries. In terms of question-answering tasks, our evaluation metrics included accuracy (percentage of correctly answered questions), Exact Match (EM) (percentage of questions that are responded to with exact word matches to the ground truth), and F1-score (considers both precision and recall for evaluating answer span quality). We further conducted a comparative qualitative evaluation with other models adaptable for the biomedical domain, focusing on retrieval and question-answering capabilities, as depicted in Table 5. The comparison results are presented by considering the GLUE (General Language Understanding Evaluation) [53] and SuperGLUE (Super General Language Understanding Evaluation) [54] benchmarking leaderboards, which serve as metrics for evaluating how well NLP models handle a wide range of complex and straightforward natural language understanding tasks. It can be observed that BioBERT [31] stands out in biomedical applications due to its PubMed pre-training, achieving high precision in retrieval. SciBERT [55], with its broader scientific text pre-training, is more versatile but may need fine-tuning for top biomedical QA tasks. BM25 [56], as a traditional keyword-based model, sets a baseline but lacks deep semantic understanding. ColBERT [57] combines efficient retrieval with semantic depth, though it performs moderately without specific domain adjustments. The BERT+TF-IDF [58] hybrid model strikes a balance between deep learning and traditional retrieval, yielding reasonable results but limited contextual depth. Lastly, T5 [16] excels in QA, especially when fine-tuned for biomedical contexts, leveraging its generative capabilities to achieve high accuracy. In comparison to these, AlzheimerRAG combines fast, accurate retrieval via FaissDB with specialized language models, making it a powerful tool for biomedical retrieval and QA. Its ability to handle text and images offers a significant advantage in contexts where visual data is essential [59].

4.2. Ablation Studies

The primary objective of our ablation studies was to assess the significance of critical components in our mechanism. We conducted multiple combinations for our experiments by removing the cross-modal attention mechanism, QLoRA fine-tuning techniques, and multimodal integration. Each of these simulations was designed to isolate and evaluate the impact of the specific component.

By removing cross-modal attention, we anticipated that the model’s ability to integrate and leverage text and image data effectively would degrade. We replaced the cross-modal attention mechanism with a simple text and image embedding concatenation. Similarly, we fine-tuned the techniques without QLoRA to observe the computation costs and performance. Lastly, we removed the multimodal integration to check whether the model’s overall performance would decrease.

Each variation’s performance metrics were recorded and consolidated in Table 6.

As observed, Cross-modal attention enables effective interaction between text and image data, with its removal leading to considerable metric degradation. QLoRA fine-tuning improves precision and clinical relevance with lower computational costs than traditional methods. Lastly, multimodal integration is essential to the framework’s overall effectiveness, as isolating text and image processing substantially reduces recall, precision, and practical application.

5. Clinical Case Study Analysis

We designed a case study to evaluate AlzheimerRAG in clinical scenarios related to AD using five primary clinical scenarios—(1) Early Diagnosis and Monitoring, (2) Medication Management, (3) Non-Pharmacological Interventions, (4) Caregiver Support and Education, (5) Behavioral Symptom Management. The clinical scenario descriptions are provided in Box 1.

Box 1. Clinical scenarios.

Early Diagnosis and Monitoring: Assess the system’s ability to recommend diagnostic tools and interpret results for early detection.
Medication Management: Determine the ability to guide current medications, potential side effects, and interactions specific to Alzheimer’s treatments.
Non-Pharmacological Interventions: Evaluate recommendations for cognitive therapies, physical activities, and lifestyle modifications to slow disease progression.
Caregiver Support and Education: Assess the capability to generate materials for educating caregivers about disease progression and management strategies.
Behavioral Symptom Management: Evaluate the effectiveness of offering strategies to manage common symptoms like agitation, depression, and anxiety.

5.1. System Evaluation

The clinical scenarios were identified from the medical literature [60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77] due to their recognized importance in Alzheimer’s treatment. A total of 350 responses were evaluated, comprising 50 human-generated, 150 LLM-generated, and 150 LLM-RAG-generated responses. The correctness of the responses was determined by simulating established guidelines [78] and expert reviews [79]. The validation criteria were factual correctness, absence of hallucinations, and clinical applicability.

Selection of Domain Experts. The domain experts selected for the study were senior researchers from the Vector Institute specializing in the biomedical domain and with a strong familiarity with PubMed literature.

Human-generated answers, provided by domain experts described in Section 5.3, were used as a comparison. Figure 4 represents the LLM-RAG performance regarding correct-answer percentages.

Evaluation criteria concerned accuracy and safety. Responses with at least 75% accuracy in instructions were deemed “correct”. However, any response containing a significant medical error was categorized as “wrong (hallucination)”. Table 7 indicates the accuracy and the hallucination rate results; AlzheimerRAG (84%) and GPT4.0-RAG (84%) were the best-performing RAGs compared to the human-generated answers.

5.2. Statistical Evaluation

We also used statistical tools, such as Cohen’s H and the chi-square test, to evaluate and compare the performance of human-generated responses against the AlzheimerRAG responses [80,81].

Cohen’s H [82] is a measure for evaluating the effect size of differences between two proportions. Since the number of answers obtained in our experimental evaluation differed for human-generated and Alzheimer’s responses, this metric can provide us with context on the accuracy of the responses between them.

The chi-square test [83,84] can assess whether there is a significant association or difference in the responses generated between the two categories. It is helpful to test the differences in the distribution of responses across multiple clinical scenarios.

Table 8 provides the statistical evaluation results of the different clinical scenarios used in our analysis. In the case of Early Diagnosis and Monitoring, both proportions were the same, inferring there was no difference; hence, the chi-square value was zero. For other clinical scenarios, the results indicated small effect sizes, with notable differences observed only in the Medication Management category. Thus, from the overall results, it can be concluded that there were no major statistically significant differences between the human-generated and the AlzheimerRAG answers.

5.3. AlzheimerRAG Responses vs. Human-Generated Clinical Scenario Responses

To illustrate the AlzheimerRAG outputs with sample human-generated responses from domain experts, we provide two detailed patient profiles for Alzheimer’s disease (AD) in Figure 5 and Figure 6. We input queries with tailored questions, answered by domain experts in the AlzheimerRAG application, for each clinical scenario we designed, and retrieved the responses. The human-generated responses for queries curated for Patient Profile 1 to different clinical scenarios are presented in Boxes 2, 3, 4 and 5. The human-generated responses curated for queries for Patient Profile 2 to different clinical scenarios are presented in Boxes 6, 7 and 8. The corresponding AlzheimerRAG responses for Patient Profile 1 are provided in Figure 7, Figure 8, Figure 9 and Figure 10. Similarly, for Patient Profile 2, AlzheimerRAG responses are depicted in Figure 11, Figure 12 and Figure 13.

Box 2. Patient 1—Clinical scenario: Early Diagnosis and Monitoring.

Clinical Scenario Query: The patient is a 70-year-old male diagnosed with mild-to-moderate Alzheimer’s disease, with his recent MMSE score of 19/30, indicating progressive memory loss. What are the recommended diagnostic or monitoring strategies to assess his disease progression?
Human-Generated Response:
−
Cognitive Testing: Continue regular cognitive assessments such as the Montreal Cognitive Assessment (MoCA) every 6-12 months to monitor changes.
−
Neuroimaging: Consider an MRI to evaluate brain atrophy and rule out other neurological conditions.
−
Biomarker Analysis: Use cerebrospinal fluid (CSF) biomarkers or PET imaging, if available, to track amyloid-beta and tau protein levels, particularly in research settings.

Box 3. Patient 1—Clinical scenario: Non-Pharmacological Interventions.

Clinical Scenario Query: How can caregivers support cognitive function in this person with early-stage Alzheimer’s disease without medication?
Human-Generated Response:
−
Cognitive Stimulation: Engage the patient in activities like puzzles, memory games, or learning a new hobby.
−
Physical Exercise: Encourage regular physical exercise, such as walking, to improve brain health.
−
Social Interaction: Promote social engagement through group activities or community programs.
−
Balance Nutrition: Ensure a balanced diet rich in omega-3 fatty acids, antioxidants, and vitamins.

Box 4. Patient 1—Clinical scenario: Caregiver Support and Education.

Clinical Scenario Query: The patient’s wife reports increased difficulty managing his daily routines and is seeking advice to help support his condition better. What are the recommendations for caregiver support and education?
Human-Generated Response:
−
Caregiver Education: Provide educational resources on Alzheimer’s disease progression, specific behavioral strategies, and effective communication techniques
−
Support Groups: Recommend participation in Alzheimer’s support groups to share experiences and strategies with other caregivers.
−
Daily Routine Planning: Develop structured routines to reduce anxiety and disorientation for both patient and caregiver, integrating familiar activities.
−
Respite Care: Ensure a balanced diet rich in omega-3 fatty acids, antioxidants, and vitamins.

Box 5. Patient 1—Clinical scenario: Medication Management.

Clinical Scenario Query: The patient has been on Donepezil for two years for Alzheimer’s management. He is facing increased difficulty remembering recent events despite stable medication. What adjustments or additional treatments should be considered?
Human-Generated Response:
−
Medication Review: Ensure adherence to the prescribed Donepezil regimen and evaluate liver function to monitor for side effects.
−
Add Memantine: Consider introducing Memantine, especially since symptoms are progressing, offering additional cognitive support in moderate stages.
−
Vitamin E and Antioxidants: Continue Vitamin E supplementation as a supportive measure, respecting current evidence and guidelines.

Box 6. Patient 2—Clinical scenario: Early Diagnosis and Monitoring.

Clinical Scenario Query: What are the early diagnostic criteria for Alzheimer’s disease, and how can regular monitoring be implemented for a 76-year-old female presenting with increased agitation and nighttime wandering?
Human-Generated Response:
−
Use comprehensive clinical evaluations, including patient history and physical examination.
−
Conduct cognitive testing using tools such as the Mini-Mental State Examination (MMSE) or Montreal Cognitive Assessment (MoCA).
−
Consider neuroimaging techniques, like MRI, to assess brain structural changes.
−
Implement regular monitoring every 6 to 12 months to track cognitive decline.
−
Engage caregivers to note behavioral changes and symptoms.
−
Utilize wearable technology to track daily activities and monitor wandering behavior.

Box 7. Patient 2—Clinical scenario: Medication Management.

Clinical Scenario Query: What considerations should be taken into account when managing medications for this 76-year-old female with moderate Alzheimer’s experiencing agitation and nighttime wandering?
Human-Generated Response:
−
Continue cognitive enhancers like Memantine and Donepezil, adjusting doses as needed.
−
Evaluate the need for antipsychotic medications for managing severe agitation, considering side effects.
−
Regularly review and adjust medications based on patient response and tolerability.
−
Monitor drug interactions, especially if the patient is on multiple medications.
−
Explore adding new treatments only if current management is ineffective.

Box 8. Patient 2—Clinical scenario: Behavioral Symptom Management.

Clinical Scenario Query: How can behavioral symptoms like increased agitation and wandering in Alzheimer’s be effectively managed without heavy reliance on medications?
Human-Generated Response:
−
Identify triggers for agitation and wandering by tracking behavioral patterns.
−
Create a peaceful, structured environment with consistent routines.
−
Engage patients in soothing activities, such as pet or music therapy.
−
Redirect attention when agitation occurs, employing distraction techniques rather than confrontation.
−
Schedule engaging activities in the late afternoon or early evening to prevent nighttime wandering.
−
Train caregivers in behavioral management techniques to ensure uniform care strategies.

6. Conclusions and Future Work

The AlzheimerRAG application represents a significant advancement in biomedical research, particularly in understanding and managing Alzheimer’s disease. By integrating multimodal data—including textual information from PubMed articles, imaging studies, and clinical trial scenarios—this innovative retrieval-augmented generation (RAG) tool provides a comprehensive platform for analyzing complex biomedical data. The use of cross-modal attention fusion enhances the alignment and processing of diverse data types, leading to improved accuracy in generating insights relevant to diagnosis, treatment planning, and understanding the pathophysiology of Alzheimer’s disease. The experimental results indicate that AlzheimerRAG outperforms existing methodologies in terms of accuracy and robustness, demonstrating the value of a multimodal approach in addressing the complexities inherent in Alzheimer’s disease research. While it exhibits low hallucination rates, the risks of generating misleading information in nuanced clinical scenarios remain; as discussed in Appendix A and Appendix B, it necessitates further research and clinical validation for real-world safety and applicability. Future enhancements could expand its scope to other neurodegenerative disorders, such as Parkinson’s, incorporate additional data sources (e.g., wearable devices, electronic health records (EHRs)), refine the user interface for improved interpretability, and optimize clinical trial support through enhanced patient recruitment and monitoring. Continuous improvements informed by user feedback will further enhance its utility and functionality for researchers and clinicians. In summary, while the AlzheimerRAG shows great promise in enhancing Alzheimer’s disease research, pursuing these outlined future directions will be essential for maximizing its impact in clinical settings.

Author Contributions

Conceptualization, methodology, experiments, and the original manuscript preparation and writing were undertaken by A.K.L. The application was developed by A.K.L. and validated by Q.V.H. Review and editing of the manuscript were performed by Q.V.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) research grant RGPIN/6686-2019.

Data Availability Statement

There is no new dataset developed for this research.

Acknowledgments

We would like to thank the researchers from the Vector Institute in Toronto, Canada, for their invaluable contributions, including expert guidance on the clinical study and participation in curating the human-generated responses for the tool evaluation.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Ethical Consideration Statement

AlzheimerRAG prioritizes ethical integrity by using only publicly available PubMed data, avoiding private patient information, and adhering to ethical standards without requiring Institutional Review Board approval. While rigorous filtering and cross-modal attention mitigate biases from historical PubMed imbalances, users must interpret results cautiously due to potential underrepresentation. Outputs are traceable to sources, though cross-modal complexity limits full transparency. Despite low hallucination rates (6%), clinical oversight remains critical to validate recommendations and address outdated or conflicting data.

The system aims to accelerate Alzheimer’s research and aid underserved regions, yet risks of bias perpetuation or misinterpretation necessitate clear disclaimers and user education. Future commitments include continuous updates with the latest data, partnerships with clinicians and ethicists, and enhancements to align with real-world needs, ensuring the responsible integration of biomedical workflows while balancing societal benefits with ethical safeguards.

Appendix B. RAG Hallucination in Medical Inference

While powerful, retrieval-augmented generation (RAG) systems risk generating hallucinations, factually incorrect, or unsupported outputs when synthesizing medical information. In clinical contexts, such errors could lead to harmful misdiagnoses, treatment inaccuracies, or propagation of outdated practices.

Appendix B.1. Causes of Hallucination

Gaps in Retrieved Evidence: If retrieved documents lack sufficient or conflicting data, models may “fill in” gaps with speculative content.
Overgeneralization: Models might conflate findings from unrelated studies or misattribute causal relationships.
Ambiguous Queries: Poorly phrased user inputs (e.g., “Does amyloid-beta cause dementia?”) may trigger oversimplified or misleading responses.

Appendix B.2. Mitigation Strategies

Strict Evidence Grounding: AlzheimerRAG restricts responses to directly cited passages from retrieved PubMed articles, minimizing unsupported claims.
Uncertainty Flagging: The system explicitly flags low-confidence responses when retrieved data are sparse or conflicting.
Cross-Modal Verification: Visual data are cross-referenced with textual findings to validate claims (e.g., correlating amyloid-beta plaques with cognitive decline).
Human-in-the-Loop Validation: Clinicians review high-stakes outputs (e.g., treatment recommendations) before deployment, ensuring alignment with established guidelines.

Appendix B.3. Recommendations for Users

Treat AI-generated inferences as decision-support tools, not definitive medical advice.
Verify critical claims against peer-reviewed guidelines (e.g., NIH Alzheimer’s).
Report hallucinations via transparent feedback mechanisms to enable iterative model improvement.

References

Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; pp. 17754–17762. [Google Scholar]
Zhao, R.; Chen, H.; Wang, W.; Jiao, F.; Do, X.L.; Qin, C.; Ding, B.; Guo, X.; Li, M.; Li, X.; et al. Retrieving multimodal information for an augmented generation: A survey. arXiv 2023, arXiv:2303.10868. [Google Scholar]
Braunschweiler, N.; Doddipatla, R.; Keizer, S.; Stoyanchev, S. Evaluating Large Language Models for Document-grounded Response Generation in Information-Seeking Dialogues. arXiv 2023, arXiv:2309.11838. [Google Scholar]
Wang, C.; Long, Q.; Xiao, M.; Cai, X.; Wu, C.; Meng, Z.; Wang, X.; Zhou, Y. BioRAG: A RAG-LLM Framework for Biological Question Reasoning. arXiv 2024, arXiv:2408.01107. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Seonwoo, Y.; Kim, J.-H.; Ha, J.-W.; Oh, A. Context-aware answer extraction in question answering. arXiv 2020, arXiv:2011.02687. [Google Scholar]
Sticha, A. Utilizing Large Language Models for Question Answering in Task-Oriented Dialogues. Master’s Thesis, University of Cambridge, Cambridge, UK, 2023. Available online: https://www.mlmi.eng.cam.ac.uk/files/2022_-_2023_dissertations/large_language_models_for_question_answering.pdf (accessed on 19 July 2025).
Lahiri, A.K.; Hu, Q.V. Descriptor: Open-Domain Long-Form Context-Aware Question-Answering Dataset (DragonVerseQA). IEEE Data Descr. 2025, 2, 141–150. [Google Scholar] [CrossRef]
Lahiri, A.K.; Hu, Q.V. HouseOfTheDragonQA: Open-Domain Long-form Context-Aware QA Pairs for TV Series. In Proceedings of the 2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Bangkok, Thailand, 9–12 December 2024; IEEE: Piscataway Township, NJ, USA, 2024; pp. 150–157. [Google Scholar]
Lahiri, A.K.; Hu, Q.V. Gameofthronesqa: Answer-aware question-answer pairs for TV series. In Proceedings of the European Conference on Information Retrieval, Stavanger, Norway, 10–14 April 2022; Springer International Publishing: Berlin/Heidelberg, Germany, 2022; pp. 180–189. [Google Scholar]
Xia, P.; Zhu, K.; Li, H.; Wang, T.; Shi, W.; Wang, S.; Zhang, L.; Zou, J.; Yao, H. Mmed-rag: Versatile multimodal rag system for medical vision language models. arXiv 2024, arXiv:2410.13085. [Google Scholar]
Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.M.; Chen, W.; et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mach. Intell. 2023, 5, 220–235. [Google Scholar]
Tsatsaronis, G.; Balikas, G.; Malakasiotis, P.; Partalas, I.; Zschunke, M.; Alvers, M.R.; Weissenborn, D.; Krithara, A.; Petridis, S.; Polychronopoulos, D.; et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. Bmc Bioinform. 2015, 16, 138. [Google Scholar]
Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.W.; Lu, X. Pubmedqa: A dataset for biomedical research question answering. arXiv 2019, arXiv:1909.06146. [Google Scholar]
Chen, W.; Hu, H.; Chen, X.; Verga, P.; Cohen, W.W. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. arXiv 2022, arXiv:2210.02928. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Fang, Z.; Zhu, S.; Chen, Y.; Zou, B.; Jia, F.; Qiu, L.; Liu, C.; Huang, Y.; Feng, X.; Qin, F.; et al. GFE-Mamba: Mamba-based AD Multi-modal Progression Assessment via Generative Feature Extraction from MCI. arXiv 2024, arXiv:2407.15719. [Google Scholar]
Yang, R.; Tan, T.F.; Lu, W.; Thirunavukarasu, A.J.; Ting, D.S.W.; Liu, N. Large language models in health care: Development, applications, and challenges. Health Care Sci. 2023, 2, 255–263. [Google Scholar]
Yang, R.; Marrese-Taylor, E.; Ke, Y.; Cheng, L.; Chen, Q.; Li, I. Integrating umls knowledge into large language models for medical question answering. arXiv 2023, arXiv:2310.02778. [Google Scholar]
Bolton, E.; Venigalla, A.; Yasunaga, M.; Hall, D.; Xiong, B.; Lee, T.; Daneshjou, R.; Frankle, J.; Liang, P.; Carbin, M.; et al. Biomedlm: A 2.7 b parameter language model trained on biomedical text. arXiv 2024, arXiv:2403.18421. [Google Scholar]
Treder, M.S.; Lee, S.; Tsvetanov, K.A. Introduction to Large Language Models (LLMs) for dementia care and research. Front. Dement. 2024, 3, 1385303. [Google Scholar]
Wang, Y.; Chen, X.; Cao, L.; Huang, W.; Sun, F.; Wang, Y. Multimodal token fusion for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12186–12195. [Google Scholar]
Jiang, X.; Hu, Z.; Wang, S.; Zhang, Y. Deep Learning for Medical Image-Based Cancer Diagnosis. Cancers 2023, 15, 3608. [Google Scholar] [CrossRef]
Zhou, F.; Chen, H. Cross-modal translation and alignment for survival analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 21485–21494. [Google Scholar]
Zhang, X.; Zhang, W.; Sun, W.; Sun, X.; Jha, S.K. A Robust 3-D Medical Watermarking Based on Wavelet Transform for Data Protection. Comput. Syst. Sci. Eng. 2022, 41, 1043–1056. [Google Scholar]
Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Gupta, N.; Zhang, P.; Kannan, R.; Prasanna, V. PaCKD: Pattern-Clustered Knowledge Distillation for Compressing Memory Access Prediction Models. In Proceedings of the 2023 IEEE High-Performance Extreme Computing Conference (HPEC), Virtual, 25–9 September 2023; pp. 1–7. [Google Scholar]
Liu, X.; Chen, K.; Wu, T.; Weidman, D.; Lure, F.; Li, J. Use of multimodality imaging and artificial intelligence for diagnosis and prognosis of early stages of Alzheimer’s disease. Transl. Res. 2018, 194, 56–67. [Google Scholar]
Li, J.; Li, X.; Chen, F.; Li, W.; Chen, J.; Zhang, B. Studying the Alzheimer’s disease continuum using EEG and fMRI in single-modality and multi-modality settings. Rev. Neurosci. 2024, 35, 373–386. [Google Scholar]
Yao, Z.; Wang, H.; Yan, W.; Wang, Z.; Zhang, W.; Wang, Z.; Zhang, G. Artificial intelligence-based diagnosis of Alzheimer’s disease with brain MRI images. Eur. J. Radiol. 2023, 165, 110934. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar]
Madan, S.; Lentzen, M.; Brandt, J.; Rueckert, D.; Hofmann-Apitius, M.; Fröhlich, H. Transformer models in biomedicine. Bmc Med. Inform. Decis. Mak. 2024, 24, 214. [Google Scholar]
Xiong, G.; Jin, Q.; Lu, Z.; Zhang, A. Benchmarking retrieval-augmented generation for medicine. arXiv 2024, arXiv:2402.13178. [Google Scholar]
Lahiri, A.K.; Hasan, E.; Hu, Q.V.; Ding, C. TMU at TREC Clinical Trials Track 2023. arXiv 2024, arXiv:2403.12088. [Google Scholar]
National Library of Medicine: PubMed. 1996. Available online: https://pubmed.ncbi.nlm.nih.gov/ (accessed on 20 January 2025).
Thomo, A. PubMed Retrieval with RAG Techniques. In Digital Health and Informatics Innovations for Sustainable Health Care Systems; IOS Press: Amsterdam, The Netherlands, 2024; pp. 652–653. [Google Scholar]
HuggingFace: Llama-2-7b-pubmed. 2022. Available online: https://huggingface.co/botch/Llama-2-7b-pubmed (accessed on 20 March 2025).
Meta: Introducing Llama: A Foundational, 65-Billion-Parameter Language Model. 2023. Available online: https://ai.meta.com/blog/large-language-model-llama-meta-ai/ (accessed on 28 March 2025).
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized LLMs. arXiv 2023, arXiv:2305.14314. [Google Scholar]
Zhu, Y.; Zhu, M.; Liu, N.; Xu, Z.; Peng, Y. Llava-phi: Efficient multi-modal assistant with small language model. arXiv 2024, arXiv:2401.02330. [Google Scholar]
LLamaIndex: Documentation. 2022. Available online: https://www.llamaindex.ai/ (accessed on 28 March 2025).
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021. [Google Scholar]
Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3195–3204. [Google Scholar]
Xia, P.; Zhu, K.; Li, H.; Zhu, H.; Li, Y.; Li, G.; Zhang, L.; Yao, H. RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models. arXiv 2024, arXiv:2407.05131. [Google Scholar]
Meta AI: Faiss. 2022. Available online: https://ai.meta.com/tools/faiss/ (accessed on 28 March 2025).
LangChain: Question Answering. Docs. 2021. Available online: https://python.langchain.com/v0.1/docs/use_cases/question_answering/ (accessed on 9 March 2025).
Topsakal, O.; Akinci, T.C. Creating large language model applications utilizing langchain: A primer on developing llm apps fast. Int. Conf. Appl. Eng. Nat. Sci. 2023, 1, 1050–1056. [Google Scholar]
Data Discovery: NIH. 2008. Available online: https://datadiscovery.nlm.nih.gov/ (accessed on 9 March 2025).
Gu, Y.; Tinn, R.; Cheng, H. PubMedBERT-2: Advanced Biomedical Language Understanding. arXiv 2024, arXiv:2402.15785. [Google Scholar]
Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv 2024, arXiv:2306.00890. [Google Scholar]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv 2019, arXiv:1804.07461. [Google Scholar]
Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv 2020, arXiv:1905.00537. [Google Scholar] [CrossRef]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pre-trained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
Robertson, S.; Zaragoza, H. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 2009, 3, 333–389. [Google Scholar]
Khattab, O.; Zaharia, M. Colbert: Efficient and effective passage search via contextualized late interaction over Bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25 July 2020; pp. 39–48. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liang, P.P.; Lyu, Y.; Fan, X.; Wu, Z.; Cheng, Y.; Wu, J.; Chen, L.; Wu, P.; Lee, M.A.; Zhu, Y.; et al. Multibench: Multiscale benchmarks for multimodal representation learning. arXiv 2021, arXiv:2107.07502. [Google Scholar]
Atri, A. The Alzheimer’s disease clinical spectrum: Diagnosis and management. Med. Clin. 2019, 103, 263–293. [Google Scholar]
Murray, M.E.; Graff-Radford, N.R.; Ross, O.A.; Petersen, R.C.; Duara, R.; Dickson, D.W. Neuropathologically defined subtypes of Alzheimer’s disease with distinct clinical characteristics: A retrospective study. Lancet Neurol. 2011, 10, 785–796. [Google Scholar]
Campillo-Sánchez, P.; Gómez-Sanz, J. Modelling and Simulation of Alzheimer’s disease Scenarios. Procedia Comput. Sci. 2016, 83, 353–360. [Google Scholar]
Amugongo, L.M.; Mascheroni, P.; Brooks, S.G.; Doering, S.; Seidel, J. Retrieval Augmented Generation for Large Language Models in Healthcare: A Systematic Review. Plos Digit. Health 2025, 4, e0000877. [Google Scholar]
Petersen, R.C.; Aisen, P.S.; Beckett, L.A.; Donohue, M.C.; Gamst, A.C.; Harvey, D.J.; Jack, C.R.; Jagust, W.J.; Shaw, L.M.; Toga, A.W.; et al. Alzheimer’s disease Neuroimaging Initiative (ADNI) clinical characterization. Neurology 2010, 74, 201–209. [Google Scholar]
Scheltens, P.; De Strooper, B.; Kivipelto, M.; Holstege, H.; Chételat, G.; Teunissen, C.E.; Cummings, J.; van der Flier, W.M. Alzheimer’s disease. Lancet 2021, 397, 1577–1590. [Google Scholar]
Weller, J.; Budson, A. Current understanding of Alzheimer’s disease diagnosis and treatment. F1000Research 2018, 7, F1000 Faculty Rev-1161. [Google Scholar]
Lane, C.A.; Hardy, J.; Schott, J.M. Alzheimer’s disease. Eur. J. Neurol. 2018, 25, 59–70. [Google Scholar]
Twarowski, B.; Herbet, M. Inflammatory processes in Alzheimer’s disease—Pathomechanism, diagnosis, and treatment: A review. Int. J. Mol. Sci. 2023, 24, 6518. [Google Scholar]
Rostagno, A.A. Pathogenesis of Alzheimer’s disease. Int. J. Mol. Sci. 2022, 24, 107. [Google Scholar]
Mantzavinos, V.; Alexiou, A. Biomarkers for Alzheimer’s disease diagnosis. Curr. Alzheimer Res. 2017, 14, 1149–1154. [Google Scholar]
Eratne, D.; Loi, S.M.; Farrand, S.; Kelso, W.; Velakoulis, D.; Looi, J.C.L. Alzheimer’s disease: Clinical update on epidemiology, pathophysiology, and diagnosis. Australas. Psychiatry 2018, 26, 347–357. [Google Scholar]
Ogbodo, J.O.; Agbo, C.P.; Njoku, U.O.; Ogugofor, M.O.; Egba, S.I.; Ihim, S.A.; Echezona, A.C.; Brendan, K.C.; Upaganlawar, A.B.; Upasani, C.D. Alzheimer’s disease: Pathogenesis and therapeutic interventions. Curr. Aging Sci. 2022, 15, 2–25. [Google Scholar]
Huang, L.K.; Chao, S.P.; Hu, C.J. Clinical trials of new drugs for Alzheimer’s disease. J. Biomed. Sci. 2020, 27, 18. [Google Scholar]
Knapskog, A.-B.; Engedal, K.; Selbæk, G.; Øksengård, A.-R. Alzheimers sykdom—Diagnostikk og behandling [Alzheimer’s disease—Diagnosis and treatment]. Tidsskr. Nor. Legeforening 2021, 141. [Google Scholar] [CrossRef]
Oboudiyat, C.; Glazer, H.; Seifan, A.; Greer, C.; Isaacson, R.S. Alzheimer’s disease. Semin. Neurol. 2013, 33, 313–329. [Google Scholar]
Aisen, P.S.; Cummings, J.; Jack, C.R.; Morris, J.C.; Sperling, R.; Frölich, L.; Jones, R.W.; Dowsett, S.A.; Matthews, B.R.; Raskin, J.; et al. On the path to 2025: Understanding the Alzheimer’s disease continuum. Alzheimer’s Res. Ther. 2017, 9, 60. [Google Scholar]
Mangialasche, F.; Solomon, A.; Winblad, B.; Mecocci, P.; Kivipelto, M. Alzheimer’s disease: Clinical trials and drug development. Lancet Neurol. 2010, 9, 702–716. [Google Scholar]
Ke, Y.; Jin, L.; Elangovan, K.; Abdullah, H.R.; Liu, N.; Sia, A.T.H.; Soh, C.R.; Tung, J.Y.M.; Ong, J.C.L.; Ting, D.S.W. Development and Testing of Retrieval Augmented Generation in Large Language Models–A Case Study Report. arXiv 2024, arXiv:2402.01733. [Google Scholar]
Amazon: Amazon Mechanical Turk. 2005. Available online: https://www.mturk.com/ (accessed on 20 May 2025).
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
The probable error of a mean. Biometrika 1908, 6, 1–25. [CrossRef]
Cohen, J. Statistical Power Analysis for the Behavioral Sciences; Routledge: London, UK, 2013. [Google Scholar]
Pearson, K.X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. London, Edinburgh, Dublin Philos. Mag. J. Sci. 1900, 50, 157–175. [Google Scholar]
Neyman, J.; Pearson, E.S. IX. On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. London. Ser. Contain. Pap. Math. Phys. Character 1933, 231, 694–706. [Google Scholar]

Figure 1. AlzheimerRAG architecture.

Figure 2. AlzheimerRAG: cross-modal attention sequence diagram.

Figure 3. AlzheimerRAG: sample user interface response.

Figure 4. Percentage of correct answers: LLM and LLM-RAG groups.

Figure 5. Patient Profile 1: Alzheimer’s disease.

Figure 6. Patient Profile 2: Alzheimer’s disease.

Figure 7. AlzheimerRAG response: Patient 1—Early Diagnosis and Monitoring.

Figure 8. AlzheimerRAG response: Patient 1—Non-Pharmacological Interventions.

Figure 9. AlzheimerRAG response: Patient 1—Caregiver Support and Education.

Figure 10. AlzheimerRAG response: Patient 1—Medication Management.

Figure 11. AlzheimerRAG response: Patient 2—Early Diagnosis and Monitoring.

Figure 12. AlzheimerRAG response: Patient 2—Medication Management.

Figure 13. AlzheimerRAG response: Patient 2—Behavioral Symptom Management.

Table 1. QLoRA hyperparameters: LlaMA.

Parameter	Value
QLoRA Parameters
LoRA attention dimension	64
Alpha parameter for LoRA scaling	16
Dropout probability for LoRA layers	0.1
Training Hyperparameters
Number of training epochs	1
FP16/BF16 training	False (True for A100 GPU)
Training batch size per GPU	4
Evaluation batch size per GPU	4
Gradient accumulation steps	1
Enable gradient checkpointing	True
Max gradient norm (clipping)	0.3
Initial learning rate	2 × 10⁻⁴
Weight decay	0.001
Optimizer	`paged_adamw_32bit`
Learning rate scheduler	`cosine`
Number of training steps	−1
Warmup ratio	0.03

Table 2. LlaVA hyperparameters.

Parameter	Value
lora_enable	True
lora_r	128
lora_alpha	256
mm_projector_lr	2 × 10⁻⁵
bits	4
learning_rate	2 × 10⁻⁴
weight_decay	0.001
warmup_ratio	0.03

Table 3. AlzheimerRAG evaluation with comparative benchmark models.

Model	Recall	Precision@10	F1
BioBERT	0.72	0.69	0.71
MedPix	0.65	0.62	0.63
BioBERT + MedPix	0.78	0.75	0.76
PubMedBERT	0.80	0.77	0.78
LlaVA-Med	0.82	0.79	0.80
BioRAG	0.87	0.84	0.85
AlzheimerRAG	0.88	0.85	0.86

Table 4. Benchmark dataset evaluation: AlzheimerRAG vs. GPT-4.

Benchmark	Metrics	AlzheimerRAG	GPT-4
BioASQ	Precision@10	0.71	0.70
	Recall	0.80	0.78
	MAP	0.78	0.74
	QA Accuracy	0.72	0.76
	F1-Score	0.75	0.77
PubMedQA	Accuracy	0.74	0.78
	Exact Match	0.71	0.73
	F1-Score	0.76	0.79

Table 5. Comparison of benchmark models across GLUE and SuperGLUE metrics.

Model	BioASQ (Retrieval)	PubMedQA (QA)	Domain	Multimodal
AlzheimerRAG	High precision and recall	High accuracy and F1	Biomedical	Yes
BioBERT	High for text	Good accuracy	Biomedical	No
SciBERT	High for scientific texts	Moderate, versatile	Scientific	No
BM25 (Baseline)	Fair, keyword-based	Basic QA	N/A	No
ColBERT	Efficient	Moderate	General-purpose	No
BERT+TF-IDF (for QA)	Fair	Moderate	General-purpose	No
T5 (fine-tuned)	Good, versatile	High in QA when fine-tuned	General-purpose	Emerging

Table 6. Ablation studies across multiple components.

Experiment	Recall	Precision	F1 Score
Baseline (AlzheimerRAG)	0.88	0.85	0.86
Without Cross-Modal Attention	0.75	0.72	0.74
Without QLora Fine-Tuning	0.80	0.77	0.78
Without Multimodal Integration	0.70	0.68	0.69

Table 7. Accuracy and hallucination response: human-generated vs. LLM and LLM-RAG.

Models	Early Diagnosis and Monitoring	Medication Management	Non-Pharma Interventions	Caregiver-Support and Education	Behavioral Symptom Management	Total Correct	Hallucinations Present
Human-Generated	10/10 (100.0%)	8/10 (80.0%)	9/10 (90.0%)	9/10 (90.0%)	9/10 (90.0%)	45/50 (90.0%)	-
GPT4.0	9/10 (90.0%)	9/10 (90.0%)	6/10 (60.0%)	7/10 (70.0%)	6/10 (60.0%)	37/50 (74.0%)	(3/50) 6%
GPT4.0-RAG	10/10 (100.0%)	10/10 (100.0%)	8/10 (80.0%)	8/10 (80.0%)	6/10 (60.0%)	42/50 (84.0%)	(3/50) 6%
LLAMA2-7B	9/10 (90.0%)	9/10 (90.0%)	7/10 (70.0%)	7/10 (70.0%)	7/10 (70.0%)	39/50 (78.0%)	(5/50) 10%
AlzheimerRAG	10/10 (100.0%)	9/10 (90.0%)	7/10 (70.0%)	8/10 (80.0%)	8/10 (80.0%)	42/50 (84.0%)	(3/50) 6%
Mistral	8/10 (80.0%)	8/10 (80.0%)	5/10 (50.0%)	6/10 (60.0%)	6/10 (60.0%)	33/50 (66.0%)	(9/50) 18%
Mistral-RAG	8/10 (80.0%)	8/10 (80.0%)	6/10 (60.0%)	7/10 (70.0%)	7/10 (70.0%)	36/50 (72.0%)	(9/50) 18%

Table 8. Comparison between human and AlzheimerRAG answers.

Clinical Scenarios	Cohen’s h	Chi-Square
Early Diagnosis and Monitoring	0	0
Medication Management	−0.234	0.3922
Non-Pharma Interventions	0.404	1.25
Caregiver-Support and Education	0.234	0.3922
Behavioral Symptom Management	0.234	0.3922

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lahiri, A.K.; Hu, Q.V. AlzheimerRAG: Multimodal Retrieval-Augmented Generation for Clinical Use Cases. Mach. Learn. Knowl. Extr. 2025, 7, 89. https://doi.org/10.3390/make7030089

AMA Style

Lahiri AK, Hu QV. AlzheimerRAG: Multimodal Retrieval-Augmented Generation for Clinical Use Cases. Machine Learning and Knowledge Extraction. 2025; 7(3):89. https://doi.org/10.3390/make7030089

Chicago/Turabian Style

Lahiri, Aritra Kumar, and Qinmin Vivian Hu. 2025. "AlzheimerRAG: Multimodal Retrieval-Augmented Generation for Clinical Use Cases" Machine Learning and Knowledge Extraction 7, no. 3: 89. https://doi.org/10.3390/make7030089

APA Style

Lahiri, A. K., & Hu, Q. V. (2025). AlzheimerRAG: Multimodal Retrieval-Augmented Generation for Clinical Use Cases. Machine Learning and Knowledge Extraction, 7(3), 89. https://doi.org/10.3390/make7030089

Article Menu

AlzheimerRAG: Multimodal Retrieval-Augmented Generation for Clinical Use Cases

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data Collection and Preprocessing

3.2. Textual Data Retrieval

Textual and Tabular Data Processing

3.3. Image Retrieval

3.4. Cross-Modal Attention Fusion

3.5. AlzheimerRAG Demonstration

3.5.1. System Walk-Through

3.5.2. Key Technical Components

4. Experimental Results

4.1. Comparative Evaluation

4.2. Ablation Studies

5. Clinical Case Study Analysis

5.1. System Evaluation

5.2. Statistical Evaluation

5.3. AlzheimerRAG Responses vs. Human-Generated Clinical Scenario Responses

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Ethical Consideration Statement

Appendix B. RAG Hallucination in Medical Inference

Appendix B.1. Causes of Hallucination

Appendix B.2. Mitigation Strategies

Appendix B.3. Recommendations for Users

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI