Next Article in Journal
SG-RAG MOT: SubGraph Retrieval Augmented Generation with Merging and Ordering Triplets for Knowledge Graph Multi-Hop Question Answering
Next Article in Special Issue
Thrifty World Models for Applying Machine Learning in the Design of Complex Biosocial–Technical Systems
Previous Article in Journal
VisRep: Towards an Automated, Reflective AI System for Documenting Visualisation Design Processes
Previous Article in Special Issue
Simulated Annealing-Based Hyperparameter Optimization of a Convolutional Neural Network for MRI Brain Tumor Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MERA: Medical Electronic Records Assistant

1
AI Innovation Lab, Weill Cornell Medicine—Qatar, Doha P.O. Box 24144, Qatar
2
Department of Medicine, New Vision University, 0159 Tbilisi, Georgia
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2025, 7(3), 73; https://doi.org/10.3390/make7030073
Submission received: 20 May 2025 / Revised: 11 July 2025 / Accepted: 17 July 2025 / Published: 30 July 2025
(This article belongs to the Special Issue Advances in Machine and Deep Learning)

Abstract

The increasing complexity and scale of electronic health records (EHRs) demand advanced tools for efficient data retrieval, summarization, and comparative analysis in clinical practice. MERA (Medical Electronic Records Assistant) is a Retrieval-Augmented Generation (RAG)-based AI system that addresses these needs by integrating domain-specific retrieval with large language models (LLMs) to deliver robust question answering, similarity search, and report summarization functionalities. MERA is designed to overcome key limitations of conventional LLMs in healthcare, such as hallucinations, outdated knowledge, and limited explainability. To ensure both privacy compliance and model robustness, we constructed a large synthetic dataset using state-of-the-art LLMs, including Mistral v0.3, Qwen 2.5, and Llama 3, and further validated MERA on de-identified real-world EHRs from the MIMIC-IV-Note dataset. Comprehensive evaluation demonstrates MERA’s high accuracy in medical question answering (correctness: 0.91; relevance: 0.98; groundedness: 0.89; retrieval relevance: 0.92), strong summarization performance (ROUGE-1 F1-score: 0.70; Jaccard similarity: 0.73), and effective similarity search (METEOR: 0.7–1.0 across diagnoses), with consistent results on real EHRs. The similarity search module empowers clinicians to efficiently identify and compare analogous patient cases, supporting differential diagnosis and personalized treatment planning. By generating concise, contextually relevant, and explainable insights, MERA reduces clinician workload and enhances decision-making. To our knowledge, this is the first system to integrate clinical question answering, summarization, and similarity search within a unified RAG-based framework.

1. Introduction

In recent years, large language models (LLMs) have redefined the landscape of Artificial Intelligence (AI) by demonstrating extraordinary capabilities in natural language understanding, reasoning, and knowledge representation [1]. Their success in applications such as customer service automation and decision-making underscores their broad potential. While LLMs have shown impressive capabilities, their limitations become more pronounced in specialized areas such as healthcare [2,3,4]. Issues such as model hallucinations (where LLMs generate linguistically fluent but factually inaccurate or nonsensical responses [5]), gaps in domain-specific medical knowledge, and difficulties in incorporating private or proprietary healthcare data [6] are not merely inconvenient; they can also be risky.
Moreover, LLMs are often trained on data that does not include the most recent medical research, clinical trial outcomes, or patient-specific records, causing their outputs to risk being outdated or incomplete. For example, a model trained on data up to 2021 might lack awareness of innovative cancer therapies or new drug approvals from 2023 [7]. This underscores the need for methods that reliably incorporate up-to-date, domain-specific information.
Retrieval-Augmented Generation (RAG) has emerged as a promising solution to these challenges. By integrating domain-specific data, RAG systems anchor their outputs in verifiable, current information, thereby reducing hallucinations and improving accuracy [8,9,10]. This approach enables LLMs to access the latest clinical guidelines, research findings, and even patient-specific data from electronic health records (EHRs) [11]. However, integrating such diverse data is not without its challenges. Medical data is inherently complex, ranging from structured EHR entries and unstructured clinical notes to multimodal data such as medical imaging [12,13,14,15]. Developing robust pipelines to process and harmonize these sources remains a significant technical hurdle, further compounded by the need for LLMs to reason about intricate, long-term interdependencies within the data.
To address these challenges, our work introduces MERA, a chatbot powered by RAG, designed to meet three essential needs: (1) question answering, (2) report summarization, and (3) similarity search. By streamlining data management and enhancing the precision and timeliness of clinical decisions, MERA aims to improve patient outcomes while reducing the administrative burden on healthcare professionals. Additionally, we contribute a synthetic dataset of medical reports generated using state-of-the-art LLMs, further facilitating the development and evaluation of clinically relevant systems. Importantly, we also validate MERA on real-world, de-identified EHRs from the MIMIC-IV dataset, demonstrating robust performance in question answering, summarization, and similarity search tasks. To the best of our knowledge, MERA is the first RAG-based system to unify question answering, report summarization, and similarity search in a single, LLM-powered framework tailored for healthcare applications.

2. Related Work

The integration of conversational AI into healthcare is already transforming clinical workflows [16,17,18,19,20]. Early applications have demonstrated effectiveness in summarizing clinical data [21,22], but often lack context-aware retrieval mechanisms and interpretable generation logic that facilitate traceable and clinically relevant summaries. In addition, most research has treated chatbot functionality and clinical summarization as separate challenges [23,24,25,26]. Chatbots have primarily been used for patient engagement, symptom assessment, and appointment scheduling [25], while summarization systems focus on condensing patient data to support clinical decision-making. Hence, there is a need for integrated systems that combine question answering and summarization to better support clinical workflows.
Recent studies have begun to highlight the benefits of such integration. For example, a healthcare chatbot was developed to generate comprehensive patient summaries aligned with European patient summary standards, enhancing both patient engagement and navigation [27]. Neupane et al. introduced CLINICSUM, a framework that extracts and abstracts key insights from patient–doctor dialogues to produce succinct clinical summaries [28]. While these approaches address critical needs, a gap remains in fully combining document-interaction question answering with clinical summarization tools.
In addition to bridging question answering and summarization, there is a critical need to integrate a robust similarity search component into clinical AI systems [29]. This functionality would enable clinicians to efficiently retrieve and compare analogous patient cases from large, heterogeneous datasets. By surfacing relevant precedents, such a component can enhance diagnostic accuracy and support personalized care decisions. Despite its promise, similarity search remains underexplored in current conversational healthcare AI tools, representing a key gap in the development of comprehensive clinical support systems.
Gupta et al. [30] developed a deep similarity learning framework using Siamese networks to predict diseases based on pairwise similarity between patients. While effective, their approach is constrained by fixed supervised similarity learning, which limits its flexibility and scalability in broader, real-world clinical scenarios. It does not support dynamic, query-based exploration of patient records or generalize easily to unseen clinical tasks.
The following Methods section details the design and implementation of MERA, which successfully integrates three core components—clinical question answering, report summarization, and similarity search—into a unified RAG-based framework. We describe the strategies used for synthetic dataset generation, report standardization, and the development of robust evaluation metrics to assess MERA’s performance across these functionalities.

3. Methods

3.1. Dataset

Developing and evaluating AI systems in healthcare is often hindered by restricted access to clinical data and rigorous privacy regulations [31,32,33,34]. To address these challenges, we created a synthetic dataset comprising 6000 medical reports generated via structured prompting with LLMs. This strategy supports robust experimentation while ensuring complete compliance with patient privacy standards.
To promote diversity and linguistic richness, we employed three LLMs of varying sizes and architectures: Mistral v0.3 7B [35], Qwen 2.5 32B [36], and Llama 3 70B [37]. Each model generated 2000 reports, following the same predefined clinical prompts and adhering to a consistent medical documentation format. Specifically, the generation process was guided by a structured template modeled after real-world hospital discharge summaries (see Figure 1). To enable consistent analysis and evaluation, these records required further processing, as detailed in the next section.

3.2. Report Formatting and Standardization

To ensure the synthetic reports could be reliably compared and analyzed, we applied a custom formatting and standardization process, using a custom script described below.
  • Extraction of Key Sections: The script extracted specific sections from the reports, such as Reason for Admission, Medical History, Diagnostic Findings, and Treatment Plan, using predefined key phrases.
  • Normalization of Dates: Dates in various formats (e.g., “January 1, 1985,” “01/01/1985”) were normalized to a consistent format (MM/DD/YYYY) to ensure uniformity across the dataset.
  • Replacement of Placeholder Names: Placeholder names (e.g., “John Doe,” “Jane Smith”) were replaced with realistic patient and physician names sourced from external CSV files. It is important to note that these names were synthetically generated and do not correspond to real individuals. This step ensured that the dataset reflected real-world naming conventions while maintaining privacy and ethical standards.
  • Unified Report Structure: The script transformed the reports into a unified structure, ensuring that each report followed the same format for consistency.
Figure 1. Structured template in markdown used to guide clinical report generation. All three language models—Mistral v0.3 7B, Qwen 2.5 32B, and Llama 3 70B—generated reports based on this predefined format.
Figure 1. Structured template in markdown used to guide clinical report generation. All three language models—Mistral v0.3 7B, Qwen 2.5 32B, and Llama 3 70B—generated reports based on this predefined format.
Make 07 00073 g001

3.3. MERA Architecture

MERA is designed to handle medical queries by seamlessly integrating retrieval and generation processes, as illustrated in Figure 2. This system is equipped with several core functionalities, including question answering, similarity search, and report summarization, each of which plays a role in ensuring that the responses provided are accurate, contextually relevant, and concise.

3.4. Question Answering

For question answering, MERA is designed to handle queries related to individual patients as well as two or more patients. When a question is received, such as “What is the diagnosis of patient 10?”, the system begins by extracting the patient’s medical record number and retrieving the corresponding medical report from a predefined directory. To manage large documents effectively, the report is divided into smaller, manageable chunks using double newline characters as delimiters.
For queries involving multiple patients (e.g., “What is the diagnosis of patients 10 and 20?”), the system first separates the query into distinct sub-questions for each patient. This approach reduces context length and enhances the model’s understanding of each individual query. The system then retrieves the medical reports for each patient and processes them independently using the same chunking and re-ranking methodology, eventually generating tailored answers for each sub-question. These individual responses are then integrated into a cohesive final answer.
The re-ranking process is a critical component of MERA’s question-answering pipeline, designed to ensure the most clinically relevant information is prioritized for response generation. After initial retrieval of document chunks, we employ a cross-encoder model (MiniLM-L-6-v2) to perform fine-grained semantic matching between the user’s query and each retrieved chunk [38,39]. Unlike initial retrieval methods that rely on vector similarity alone, this re-ranker evaluates query-chunk pairs through joint processing, where both the query and candidate text are fed simultaneously into the model. This allows the cross-encoder to capture intricate contextual relationships, such as the following:
  • Semantic nuances in medical terminology (e.g., distinguishing “metastatic carcinoma” from “localized carcinoma”).
  • Query-specific emphasis (e.g., prioritizing treatment-related chunks for “therapy options” queries).
  • Negation and conditionality (e.g., recognizing “non-responsive to chemotherapy” as a key exclusion criterion).
The cross-encoder assigns a relevance score to each query-chunk pair through its transformer architecture, which analyzes token-level interactions across both texts. These scores are then sorted to select the top-matching chunk. Unlike bi-encoders, which encode queries and documents independently—potentially missing nuanced interactions—cross-encoders evaluate both jointly, reducing false positives caused by superficial term overlap (e.g., retrieving “stage III complications” for a “stage II treatment” query).
As illustrated in Figure 3, bi-encoders compute similarity based on separate embeddings using cosine distance, whereas cross-encoders jointly process the input pair through a unified model to produce a classification score. This joint processing also helps resolve medical ambiguities, such as polysemous terms such as “remission,” by interpreting their meaning in context. While bi-encoders offer scalability for large datasets, MERA’s design limits candidate chunks per query, enabling the use of cross-encoder re-ranking with minimal latency—thus favoring clinical accuracy over speed.

3.5. Report Summarization

The report summarization functionality is designed to deliver concise, informative summaries of entire medical reports or specific sections upon request. When a summarization query is received (e.g., “Summarize the report for patient 10”), the system first retrieves the relevant medical document from the database. This retrieved report then triggers a structured summarization prompt to the LLM, which processes the content and generates a standardized, section-based summary (e.g., Clinical History, Findings, Diagnosis, Recommendations), regardless of the original report’s format.
This structured approach ensures consistency and clarity, even when source documents vary in layout or style. Depending on the query, the summary may cover the full report or focus on specific sections—always maintaining a logical and clinician-friendly format. For requests involving multiple patients, the system retrieves and processes each report individually, triggering separate LLM summarization prompts before compiling the results into a single, cohesive response.

3.6. Similarity Search

MERA is designed to support clinical decision-making by identifying patients with similar medical histories. When prompted with a query such as “Find patients with cases similar to patient 10,” the system begins by retrieving the document-level embedding for the target patient’s record. It then searches a vector database to find the top K most similar embeddings, representing patients with semantically comparable clinical profiles.
To refine these initial results, MERA applies a cross-encoder that jointly processes the query and each retrieved case to re-rank them based on contextual relevance. This additional step ensures that the most clinically meaningful matches are prioritized. The top-ranked cases are then passed to the language model, which uses them to generate an informed, tailored response. This similarity search workflow allows clinicians to compare relevant patient cases efficiently, uncover patterns, and inform diagnosis and treatment planning.

3.7. Question-Answering Evaluation Metrics

3.7.1. Answer Correctness Metric

The correctness metric evaluates the accuracy, completeness, and consistency of a generated response against a ground truth answer, ensuring that the output is both factually sound and comprehensive. This is especially critical in healthcare, where precision and trustworthiness are essential [40]. By leveraging GPT-4o [41], the metric checks whether the generated answer aligns with the ground truth, detects factual inaccuracies, and determines whether all key aspects are addressed. It also verifies internal consistency, ensuring the response is free from contradictions. The evaluation yields a binary correctness score (True/False) along with a detailed explanation, offering a transparent assessment of the model’s performance. The prompt used for this evaluation is shown in Figure 4.

3.7.2. Answer Relevancy Metric

The answer relevancy metric assesses how well the generated response addresses the input query, focusing on whether the information provided is directly applicable and useful. This is particularly important in high-stakes domains such as healthcare, where clarity and focus are vital. The metric evaluates the relevance and helpfulness of each statement in the response. The LLM determines whether the answer directly responds to the question and provides meaningful insight, assigning a relevance score along with a justification. The prompt used for this evaluation is illustrated in Figure 5.

3.7.3. Groundedness Metric

The groundedness metric assesses how closely a generated response adheres to the provided factual information, ensuring that the answer is accurate and well-supported. This is particularly crucial in fields such as healthcare, where maintaining trust and data fidelity requires responses to be anchored in verified evidence. The metric determines whether the content of the response faithfully reflects the given facts, assigning a groundedness score (True/False) along with a detailed explanation of the evaluation. The prompt used for this assessment is shown in Figure 6.

3.7.4. Retrieval Relevance Metric

The retrieval-relevance metric evaluates the semantic alignment between the retrieved facts and the input question, ensuring that the information presented is directly applicable and meaningful. Using an LLM-based assessment, the metric determines whether the retrieved content is contextually related to the query, flagging any irrelevant or off-topic material. The evaluation produces a binary relevance score (True/False) along with a detailed explanation, offering transparency into the rationale behind the judgment. The prompt used for this evaluation is shown in Figure 7.

3.8. Summarization Metrics

3.8.1. Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

The ROUGE metric is a widely used evaluation tool in LLMs for assessing the quality of summaries by comparing them to reference summaries [42]. Reference summaries are authoritative, LLM-generated or expert-validated summaries of medical reports that serve as the gold standard for evaluating the quality of automatically generated summaries. In the context of our research, the ROUGE metric was employed to evaluate the lexical and semantic similarity between the summaries generated by our MERA system and the reference summaries.
The ROUGE metric operates by calculating the overlap of n-grams (contiguous sequences of n words) between the generated summary and the reference summary. In our evaluation, we focused on two key variants of the ROUGE metric: ROUGE-1 and ROUGE-2. ROUGE-1 measures the overlap of unigrams (single words), while ROUGE-2 measures the overlap of bigrams (pairs of words). These metrics provide insights into both the lexical precision and recall of the generated summaries. Precision indicates the proportion of overlapping n-grams in the generated summary that are also present in the reference summary, while recall measures the proportion of n-grams in the reference summary that are captured in the generated summary. The F1-score, which is the harmonic mean of precision and recall, offers a balanced measure of the summary’s overall quality.

3.8.2. Jaccard Similarity Metric

The Jaccard similarity metric is another essential tool used in our evaluation framework to assess the quality of the summaries [43]. Unlike the ROUGE metric, which focuses on n-gram overlaps, the Jaccard similarity metric measures the overlap between two sets of words, providing a straightforward measure of how much the generated summary and the reference summary share in common. This metric is particularly useful for evaluating the overall content overlap and ensuring that the generated summaries capture the key points from the reference summaries.
The Jaccard similarity is calculated as the ratio of the intersection of the word sets from the generated summary and the reference summary to the union of these sets. Mathematically, it is expressed as
Jaccard Similarity = | A B | | A B |
where A represents the set of words in the generated summary, and B represents the set of words in the reference summary. The resulting score ranges from 0 to 1, where a score of 1 indicates perfect overlap, and a score of 0 indicates no overlap.

3.8.3. Human Evaluation

The human evaluation of the summarized reports was conducted by two Subject Matter Experts (SMEs). They assessed the quality of the summaries based on criteria relevant to a physician’s needs, including clarity, conciseness, completeness, and logical structure. The SMEs evaluated 20 summaries (10 from each of the two models) and scored each on a scale of 1 to 10.

3.9. Similarity Search Evaluation

To evaluate the effectiveness of the similarity search, we used the top 20 most common diagnoses (see Figure 8) and tested the model’s ability to retrieve documents with similar diagnoses. The retrieval performance was assessed using the METEOR (Metric for Evaluation of Translation with Explicit Ordering) score, a widely used metric that considers precision, recall, and synonym matching, making it well-suited for evaluating text similarity [44].

4. Results

To comprehensively assess the performance of our model, we conducted evaluations across multiple dimensions. The evaluation framework was structured into three main components—question answering, summarization, and similarity search—ensuring a holistic assessment of the system’s effectiveness. See Figure 9 for an overview of the different chat functionalities.

4.1. Question-Answering Evaluation

To evaluate the model’s ability to answer questions, we categorized the task into two scenarios: single-patient and multiple-patient contexts. Each scenario was further subdivided based on the expected answer type: short answers, long answers, and numerical answers. The QA evaluation was conducted on a test dataset of 600 samples, evenly distributed among reports generated by Mistral, Qwen, and Llama (200 samples each).
In the single-patient scenario, the model demonstrated strong performance across all evaluation metrics. The F1-scores further reinforced this robustness, with 0.94 for correctness, 0.99 for relevance, 0.92 for groundedness, and 0.93 for retrieval relevance (Table 1). Notably, precision exceeded 0.95 for all metrics except retrieval relevance (0.94), while recall values remained consistently high (≥0.89).
Figure 9. Different chat functionalities, where (A) represents a chat for question answering, providing precise answers based on input queries, (B) represents a chat for similarity search, retrieving relevant information based on semantic similarity, and (C) represents a chat for summarization, condensing key information into a concise summary, demonstrating the system’s diverse capabilities in processing and generating meaningful textual responses.
Figure 9. Different chat functionalities, where (A) represents a chat for question answering, providing precise answers based on input queries, (B) represents a chat for similarity search, retrieving relevant information based on semantic similarity, and (C) represents a chat for summarization, condensing key information into a concise summary, demonstrating the system’s diverse capabilities in processing and generating meaningful textual responses.
Make 07 00073 g009
In the multiple-patient scenario, the model was required to handle questions involving two or more patients simultaneously. Despite the added complexity, the model maintained high performance across all metrics, with an accuracy of 0.91 for correctness, 0.97 for relevance, 0.87 for groundedness, and 0.93 for retrieval relevance. The F1-scores were similarly strong, reaching 0.94 for correctness, 0.98 for relevance, 0.91 for groundedness, and 0.93 for retrieval relevance (Table 2). These results demonstrate the model’s robust capability to effectively process and integrate information from multiple medical reports, even in complex multi-patient query scenarios.

4.2. Summarization Evaluation

For the summarization task, summaries were generated for 600 test samples, evenly distributed across reports from Mistral, Qwen, and Llama (200 each), using GPT-4o for automatic evaluation and T5 for human evaluation. A medical expert reviewed these summaries to ensure they were suitable as reference standards for evaluation. The summarization performance was assessed using multiple metrics: ROUGE, to evaluate lexical and semantic similarity between the generated and reference summaries, and Jaccard similarity, to measure content overlap. To ensure unbiased evaluation, Mistral, Qwen, and Llama were used to assess summarization quality relative to GPT-4o summaries, with results averaged across the three models.
In our experiments, ROUGE-1 Precision, Recall, and F1-scores were 0.68, 0.73, and 0.70, respectively. Similarly, ROUGE-2 Precision, Recall, and F1-scores were 0.54, 0.60, and 0.56, respectively. These results highlight MERA’s ability to generate summaries that align well with the lexical and structural patterns of the reference summaries, particularly in terms of word and phrase-level overlaps. Complementing the ROUGE metrics, the Jaccard similarity score was 0.73, indicating that the generated summaries capture a significant portion of the key information present in the reference summaries.
Our human evaluation revealed that SMEs consistently rated MERA higher than T5 for the quality of its summarized medical reports. As depicted in Figure 10A, MERA received superior scores across all evaluated cases, with SMEs highlighting its structured format, comprehensive coverage, and readability. MERA’s summaries effectively included critical medical details such as hospital course, treatment plans, and discharge instructions, making them more valuable for clinical decision-making. In contrast, T5, while often more concise, tended to omit key information and exhibited less optimal organization and consistency.

4.3. Similarity Search Evaluation

We conducted our similarity search experiments using different values of K (number of retrieved documents), specifically K = 3, K = 5, and K = 10. The results are visualized in Figure 11 where the x-axis represented the 20 diagnoses and the y-axis showed the METEOR scores. The lowest recorded METEOR score was approximately 0.7, while the highest approached 1, indicating that the model demonstrated strong performance in retrieving relevant documents across different diagnoses. The variation in scores suggests that some diagnoses had more well-matched documents than others, potentially due to differences in the availability and specificity of textual descriptions. This variation occurs because some diagnoses are repeated with slight wording differences or are general categories, such as “bladder cancer” instead of more specific terms such as “stage III bladder cancer,” leading the system to retrieve different stages of cancer.
Figure 10. (A) Comparison of summarization scores between MERA and T5, evaluated by SMEs. Scores range from 1 (lowest) to 10 (highest). (B) Example of a summarized report generated by MERA. (C) Example of a summarized report generated by the baseline T5 model.
Figure 10. (A) Comparison of summarization scores between MERA and T5, evaluated by SMEs. Scores range from 1 (lowest) to 10 (highest). (B) Example of a summarized report generated by MERA. (C) Example of a summarized report generated by the baseline T5 model.
Make 07 00073 g010
Figure 11. METEOR scores for the top 20 most common diagnoses with K = 3, K = 5, and K = 10.
Figure 11. METEOR scores for the top 20 most common diagnoses with K = 3, K = 5, and K = 10.
Make 07 00073 g011

4.4. Validation on Real-World EHRs

To address concerns about generalizability, we conducted a pilot validation of the MERA system using real-world, de-identified EHRs from the MIMIC-IV dataset [45]. A subset of 100 discharge reports was selected, and the system’s outputs were benchmarked against 800 ground-truth answers derived from these reports.
The average performance scores for question answering were 0.92 for correctness, 0.95 for relevance, 0.96 for groundedness, and 0.92 for retrieval relevance. Most answers were factually accurate; however, minor inconsistencies, such as differences in medication naming conventions (for example, “Aldactone” versus “Spironolactone”) and omissions of contextual details, such as explanations for medication adherence, contributed to slightly lower correctness scores.
Summarization was evaluated on the same subset of 100 reports. The ROUGE score was 0.45 and the Jaccard similarity was 0.51. These scores were lower than those observed with synthetic data, primarily due to the de-identification process in MIMIC-IV, which replaces sensitive information with placeholders such as “___”. While this impacts lexical overlap metrics, it does not result in the loss of essential clinical information.
For similarity search, the experiment used the same 100-report dataset, which included 10 reports with the discharge condition “nausea and vomiting,” 10 with “asthma,” and 80 randomly selected reports. The system was tasked with retrieving cases involving either “nausea and vomiting” or “asthma,” using K values of 3, 5, and 10 for the number of retrieved reports. The METEOR metric was used to assess relevance, and in all cases, the score was 1, indicating perfect retrieval of relevant cases.

5. Discussion

MERA is a RAG-based assistant that advances medical record analysis by combining domain-specific retrieval with LLMs. This approach enables MERA to deliver accurate, context-aware, and explainable clinical insights while addressing core limitations of traditional LLMs—such as hallucinations, outdated knowledge, and lack of transparency. A key advantage of the RAG architecture is its ability to ground responses in verifiable, context-specific data retrieved at inference time, thereby enhancing factual accuracy—especially in high-stakes clinical settings [46]. In MERA, all retrieved content is explicitly referenced, allowing clinicians to independently verify the accuracy and relevance of generated outputs. To our knowledge, MERA is the first RAG-based system to unify clinical question answering, report summarization, and similarity search within a single framework tailored for healthcare.
Across all evaluation metrics, MERA demonstrated robust performance in single-patient scenarios, achieving correctness of 0.91, relevance of 0.98, groundedness of 0.89, and relevance of 0.92. These results highlight the system’s ability to generate precise, contextually relevant answers, even for complex clinical queries. In more challenging multiple-patient scenarios, MERA maintained strong performance, with correctness at 0.91 and relevance at 0.93, underscoring its effectiveness in synthesizing information from diverse clinical sources.
Beyond question answering, MERA streamlines clinical workflows by automating both medical inquiry resolution and report summarization. This advanced functionality reduces clinicians’ cognitive and administrative burdens, while minimizing the risk of human error during data retrieval and interpretation. The summarization module, evaluated using ROUGE and Jaccard similarity metrics, showed strong alignment with expert-validated references, confirming MERA’s ability to distill large volumes of clinical data into concise insights [47].
The system’s similarity search capability is particularly valuable for clinicians, supporting differential diagnosis and personalized treatment planning. By quickly retrieving and summarizing similar patient cases, MERA enables doctors to review comparable clinical journeys, identify patterns in diagnoses, and explore potential treatment strategies. It can also reveal differences in how patients with similar profiles have responded to treatments—providing valuable insight into what approaches may be more effective or should be avoided. These insights help clinicians tailor care plans, replicate successful interventions, and make more informed, patient-specific decisions, all while reducing cognitive burden and saving time.
Despite these strengths, the initial development and benchmarking of MERA relied primarily on synthetic data to ensure privacy compliance. However, we have now evaluated MERA using de-identified real patient records from the MIMIC-IV dataset. This validation demonstrates that MERA maintains robust performance in question answering, summarization, and similarity search tasks when applied to authentic clinical documentation. The results confirm that MERA’s capabilities generalize beyond synthetic datasets, capturing the complexity and variability inherent in real-world EHRs.
MERA’s robust performance with medical data positions it as a promising tool for clinical decision support. Real clinical impact, however, depends on successful integration with live EHRs and compliance with established healthcare data standards. The path to deployment begins with pilot studies on de-identified EHRs to benchmark and refine the system. Adopting HL7 and FHIR standards enables structured and modular data exchange, while secure, compliant data pipelines ensure privacy and regulatory adherence.
Prospective validation with clinical partners is essential for assessing outputs and refining usability. Scalable deployment should include continuous monitoring and regular updates for evolving clinical guidelines and regulatory requirements. Future development will focus on full HL7/FHIR integration, privacy-preserving learning, automated concept mapping, and ongoing interface improvements based on clinician feedback, ensuring MERA becomes a clinically integrated, standards-compliant solution that delivers actionable insights while upholding data privacy and security.
The computational cost and scalability of MERA are shaped by its retrieval-augmented design, which substantially reduces inference time and token usage compared to full-text LLM approaches. By restricting context to only the most relevant retrieved passages, MERA achieves significant cost savings and faster response times. Notably, recent studies in clinical information retrieval have demonstrated that similar RAG-based systems can achieve over 70% reduction in token usage and inference time compared to full-document LLM inference, while maintaining or improving performance in medical applications.
Finally, the modular architecture of MERA allows distributed deployment and elastic scaling, enabling the system to handle large volumes of clinical data and user queries. Ongoing optimizations in retrieval, model architecture, and system engineering will further enhance MERA’s scalability and affordability for real-world clinical applications.

6. Conclusions

This work demonstrates that MERA effectively handles medical queries and generates accurate, relevant, and well-supported responses in both single- and multiple-patient scenarios, with strong summarization performance closely aligning with expert-validated references. The system’s similarity search capability enables clinicians to efficiently identify and compare analogous patient cases across large, heterogeneous datasets, supporting differential diagnosis and personalized treatment planning. Leveraging the RAG framework, MERA minimizes hallucinations and enhances explainability, as confirmed by robust results on both synthetic and de-identified real-world EHRs from the MIMIC-IV dataset. By grounding responses in verifiable data, synthesizing information from diverse sources, and delivering contextually relevant case matches, MERA supports clinical decision-making and streamlines data management, paving the way for real-world integration. Future efforts will focus on expanding real patient data integration, optimizing scalability, and extending capabilities to multimodal data, ensuring MERA continues to advance as a safe, effective, and explainable AI solution for healthcare delivery.

Author Contributions

Conceptualization, A.I. and A.S. (Ahmed Serag); methodology, A.I. and A.K.; software, A.I. and A.H.; validation, A.I., M.A., A.S. (Aamenah Sattar), A.H. and A.S. (Ahmed Serag); formal analysis, A.I., M.A., A.S. (Aamenah Sattar), A.H. and A.S. (Ahmed Serag); investigation, A.I., A.H. and A.S. (Ahmed Serag); resources, A.S. (Ahmed Serag); data curation, A.I., A.K. and A.S. (Ahmed Serag); writing—original draft preparation, A.I., A.K. and A.S. (Ahmed Serag); writing—review and editing, A.I., A.K., M.A., A.S. (Aamenah Sattar), A.H. and A.S. (Ahmed Serag); visualization, A.I. and A.H.; supervision, A.S. (Ahmed Serag); project administration, A.S. (Ahmed Serag). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, D.; Zhang, S. Large Language Models in Medical and Healthcare Fields: Applications, Advances, and Challenges. Artif. Intell. Rev. 2024, 57, 299. [Google Scholar] [CrossRef]
  2. Sallam, M. The utility of ChatGPT as an example of large language models in healthcare education, research and practice: Systematic review on the future perspectives and potential limitations. MedRxiv 2023. medRxiv:2023.02.19.23286155. [Google Scholar] [CrossRef]
  3. Ibrahim, A.; Hosseini, A.; Ibrahim, S.; Sattar, A.; Serag, A. D3: A Small Language Model for Drug-Drug Interaction prediction and comparison with Large Language Models. Mach. Learn. Appl. 2025, 20, 100658. [Google Scholar] [CrossRef]
  4. Ali, H.; Qadir, J.; Shah, Z. Chatgpt and large language models (llms) in healthcare: Opportunities and risks. In Proceedings of the 2023 IEEE International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings), Mount Pleasant, MI, USA, 16–17 September 2023; Volume 36227. [Google Scholar]
  5. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
  6. Alber, D.A.; Yang, Z.; Alyakin, A.; Yang, E.; Rai, S.; Valliani, A.A.; Zhang, J.; Rosenbaum, G.R.; Amend-Thomas, A.K.; Kurland, D.B.; et al. Medical large language models are vulnerable to data-poisoning attacks. Nat. Med. 2025, 31, 618–626. [Google Scholar] [CrossRef]
  7. Han, T.; Nebelung, S.; Khader, F.; Wang, T.; Müller-Franzes, G.; Kuhl, C.; Försch, S.; Kleesiek, J.; Haarburger, C.; Bressem, K.K.; et al. Medical large language models are susceptible to targeted misinformation attacks. NPJ Digit. Med. 2024, 7, 288. [Google Scholar] [CrossRef] [PubMed]
  8. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
  9. Ng, K.K.Y.; Matsuba, I.; Zhang, P.C. RAG in Health Care: A Novel Framework for Improving Communication and Decision-Making by Addressing LLM Limitations. NEJM AI 2025, 2, AIra2400380. [Google Scholar] [CrossRef]
  10. Anandavally, B.B. Improving Clinical Support Through Retrieval-Augmented Generation Powered Virtual Health Assistants. J. Comput. Commun. 2024, 12, 86–94. [Google Scholar] [CrossRef]
  11. Nashwan, A.J.; AbuJaber, A.A. Harnessing the power of large language models (LLMs) for electronic health records (EHRs) optimization. Cureus 2023, 15, e42634. [Google Scholar] [CrossRef]
  12. Wang, Y.; Yin, C.; Zhang, P. Multimodal risk prediction with physiological signals, medical images and clinical notes. Heliyon 2024, 10, e26772. [Google Scholar] [CrossRef] [PubMed]
  13. Ben Rabah, C.; Sattar, A.; Ibrahim, A.; Serag, A. A Multimodal Deep Learning Model for the Classification of Breast Cancer Subtypes. Diagnostics 2025, 15, 995. [Google Scholar] [CrossRef]
  14. Ben Rabah, C.; Petropoulos, I.N.; Malik, R.A.; Serag, A. Vision transformers for automated detection of diabetic peripheral neuropathy in corneal confocal microscopy images. Front. Imaging 2025, 4, 1542128. [Google Scholar] [CrossRef]
  15. Huang, S.C.; Pareek, A.; Seyyedi, S.; Banerjee, I.; Lungren, M.P. Fusion of medical imaging and electronic health records using deep learning: A systematic review and implementation guidelines. NPJ Digit. Med. 2020, 3, 136. [Google Scholar] [CrossRef]
  16. Ayers, J.W.; Poliak, A.; Dredze, M.; Leas, E.C.; Zhu, Z.; Kelley, J.B.; Faix, D.J.; Goodman, A.M.; Longhurst, C.A.; Hogarth, M.; et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern. Med. 2023, 183, 589–596. [Google Scholar] [CrossRef]
  17. Shi, X.; Liu, Z.; Du, L.; Wang, Y.; Wang, H.; Guo, Y.; Ruan, T.; Xu, J.; Zhang, X.; Zhang, S. Medical Dialogue System: A Survey of Categories, Methods, Evaluation and Challenges. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; pp. 2840–2861. [Google Scholar] [CrossRef]
  18. Helmy, H.; Rabah, C.B.; Ali, N.; Ibrahim, A.; Hoseiny, A.; Serag, A. Optimizing ICU Readmission Prediction: A Comparative Evaluation of AI Tools. In Proceedings of the International Workshop on Applications of Medical AI, Amsterdam, The Netherlands, 13–17 August 2024; Wu, S., Shabestari, B., Xing, L., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 95–104. [Google Scholar]
  19. Zhang, W.; Zhang, J. Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review. Mathematics 2025, 13, 856. [Google Scholar] [CrossRef]
  20. Ayala, O.; Bechard, P. Reducing hallucination in structured outputs via Retrieval-Augmented Generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), Mexico City, Mexico, 16–21 June 2024; Yang, Y., Davani, A., Sil, A., Kumar, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 228–238. [Google Scholar] [CrossRef]
  21. Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerová, A.; et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 2024, 30, 1134–1142. [Google Scholar] [CrossRef] [PubMed]
  22. Aali, A.; Van Veen, D.; Arefeen, Y.I.; Hom, J.; Bluethgen, C.; Reis, E.P.; Gatidis, S.; Clifford, N.; Daws, J.; Tehrani, A.S.; et al. A dataset and benchmark for hospital course summarization with adapted large language models. J. Am. Med. Inform. Assoc. 2024, 32, 470–479. [Google Scholar] [CrossRef]
  23. Nazary, F.; Deldjoo, Y.; Di Noia, T. Chatgpt-healthprompt. harnessing the power of xai in prompt-based healthcare decision support using chatgpt. In Proceedings of the European Conference on Artificial Intelligence, Kraków, Poland, 30 September–4 October 2023; Nowaczyk, S., Biecek, P., Chung, N.C., Vallati, M., Skruch, P., Jaworek-Korjakowska, J., Parkinson, S., Nikitas, A., Atzmüller, M., Kliegr, T., et al., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 382–397. [Google Scholar]
  24. Bora, A.; Cuayáhuitl, H. Systematic Analysis of Retrieval-Augmented Generation-Based LLMs for Medical Chatbot Applications. Mach. Learn. Knowl. Extr. 2024, 6, 2355–2374. [Google Scholar] [CrossRef]
  25. Quidwai, M.A.; Lagana, A. A RAG Chatbot for Precision Medicine of Multiple Myeloma. MedRxiv 2024. medRxiv:2024.03.14.24304293. [Google Scholar] [CrossRef]
  26. Sanna, L.; Bellan, P.; Magnolini, S.; Segala, M.; Haez, S.G.; Consolandi, M.; Dragoni, M. Building Certified Medical Chatbots: Overcoming Unstructured Data Limitations with Modular RAG. In Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024, Torino, Italia, 20–25 May 2024; Demner-Fushman, D., Ananiadou, S., Thompson, P., Ondov, B., Eds.; ELRA and ICCL: Paris, France, 2024; pp. 124–130. Available online: https://aclanthology.org/2024.cl4health-1.15/ (accessed on 15 July 2025).
  27. Vasili, A.; Schiza, E.; Schizas, C.N.; Pattichis, C.S. Integrating Chatbot Functionality in a Patient Summary Based Healthcare System. Stud. Health Technol. Inform. 2024, 316, 296–300. [Google Scholar]
  28. Neupane, S.; Tripathi, H.; Mitra, S.; Bozorgzad, S.; Mittal, S.; Rahimi, S.; Amirlatifi, A. CLINICSUM: Utilizing Language Models for Generating Clinical Summaries from Patient-Doctor Conversations. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 5050–5059. [Google Scholar]
  29. Sivarajkumar, S.; Mohammad, H.A.; Oniani, D.; Roberts, K.; Hersh, W.; Liu, H.; He, D.; Visweswaran, S.; Wang, Y. Clinical information retrieval: A literature review. J. Healthc. Inform. Res. 2024, 8, 313–352. [Google Scholar] [CrossRef] [PubMed]
  30. Gupta, V.; Sachdeva, S.; Bhalla, S. A novel deep similarity learning approach to electronic health records data. IEEE Access 2020, 8, 209278–209295. [Google Scholar] [CrossRef]
  31. Nalela, P. Leveraging Generative AI Through Prompt Engineering and Rigorous Validation to Create Comprehensive Synthetic Datasets for AI Training in Healthcare. arXiv 2025, arXiv:2504.20921. [Google Scholar] [CrossRef]
  32. Hosseini, A.; Serag, A. Is synthetic data generation effective in maintaining clinical biomarkers? Investigating diffusion models across diverse imaging modalities. Front. Artif. Intell. 2025, 7, 1454441. [Google Scholar] [CrossRef]
  33. Hosseini, A.; Serag, A. Self-Supervised Learning Powered by Synthetic Data From Diffusion Models: Application to X-Ray Images. IEEE Access 2025, 13, 59074–59084. [Google Scholar] [CrossRef]
  34. Jadon, A.; Kumar, S. Leveraging generative ai models for synthetic data generation in healthcare: Balancing research and privacy. In Proceedings of the 2023 International Conference on Smart Applications, Communications and Networking (SmartNets), Istanbul, Turkey, 25–27 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–4. [Google Scholar]
  35. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.D.L.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
  36. Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 Technical Report. arXiv 2025, arXiv:2412.15115. [Google Scholar]
  37. Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
  38. Déjean, H.; Clinchant, S.; Formal, T. A thorough comparison of cross-encoders and llms for reranking splade. arXiv 2024, arXiv:2403.10407. [Google Scholar] [CrossRef]
  39. Pradeep, R.; Liu, Y.; Zhang, X.; Li, Y.; Yates, A.; Lin, J. Squeezing Water from a Stone: A Bag of Tricks for Further Improving Cross-Encoder Effectiveness for Reranking. In Proceedings of the European Conference on Information Retrieval, Stavanger, Norway, 10–14 April 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 655–670, ISBN 978-3-030-99736-6. [Google Scholar]
  40. Shool, S.; Adimi, S.; Saboori Amleshi, R.; Bitaraf, E.; Golpira, R.; Tara, M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef]
  41. Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. GPT-4o System Card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
  42. Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
  43. Niwattanakul, S.; Singthongchai, J.; Naenudorn, E.; Wanapu, S. Using of Jaccard coefficient for keywords similarity. In Proceedings of the International Multiconference Of Engineers and Computer Scientists, Hong Kong, China, 13–15 March 2013; Volume 1, pp. 380–384. [Google Scholar]
  44. Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; Goldstein, J., Lavie, A., Lin, C.Y., Voss, C., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 65–72. Available online: https://aclanthology.org/W05-0909/ (accessed on 15 July 2025).
  45. Johnson, A.E.; Pollard, T.J.; Horng, S.; Celi, L.A.; Mark, R. MIMIC-IV-Note: Deidentified Free-Text Clinical Notes (Version 2.2). [Dataset] PhysioNet. RRID:SCR_007345. 2023. Available online: https://physionet.org/content/mimic-iv-note/2.2/ (accessed on 1 July 2025). [CrossRef]
  46. Gargari, O.K.; Habibi, G. Enhancing medical AI with retrieval-augmented generation: A mini narrative review. Digit. Health 2025, 11, 20552076251337177. [Google Scholar] [CrossRef] [PubMed]
  47. Pu, X.; Gao, M.; Wan, X. Summarization is (almost) dead. arXiv 2023, arXiv:2309.09558. [Google Scholar] [CrossRef]
Figure 2. Graphical illustration of the MERA architecture. It comprises three modules: indexing the medical records, retrieve based on the query from the user, and generate response using LLM based on the retrieval document and prompt.
Figure 2. Graphical illustration of the MERA architecture. It comprises three modules: indexing the medical records, retrieve based on the query from the user, and generate response using LLM based on the retrieval document and prompt.
Make 07 00073 g002
Figure 3. (A) shows the independent encoding of the bi-encoder, and (B) demonstrates the joint processing of the cross-encoder.
Figure 3. (A) shows the independent encoding of the bi-encoder, and (B) demonstrates the joint processing of the cross-encoder.
Make 07 00073 g003
Figure 4. Prompt in markdown for correctness metric.
Figure 4. Prompt in markdown for correctness metric.
Make 07 00073 g004
Figure 5. Prompt in markdown for relevancy metric.
Figure 5. Prompt in markdown for relevancy metric.
Make 07 00073 g005
Figure 6. Prompt in markdown for groundedness metric.
Figure 6. Prompt in markdown for groundedness metric.
Make 07 00073 g006
Figure 7. Prompt in markdown for retrieval-relevance metric.
Figure 7. Prompt in markdown for retrieval-relevance metric.
Make 07 00073 g007
Figure 8. Distribution of the Top 20 primary diagnoses across Llama, Mistral, and Qwen Datasets.
Figure 8. Distribution of the Top 20 primary diagnoses across Llama, Mistral, and Qwen Datasets.
Make 07 00073 g008
Table 1. Performance metrics for single-patient question-answering evaluation.
Table 1. Performance metrics for single-patient question-answering evaluation.
MetricsCorrectnessRelevanceGroundednessRetrieval Relevance
Accuracy0.910.980.890.92
Precision0.970.990.950.94
Recall0.910.980.890.92
F1-Score0.940.990.920.93
Table 2. Performance metrics for multiple-patient question-answering evaluation.
Table 2. Performance metrics for multiple-patient question-answering evaluation.
MetricsCorrectnessRelevanceGroundednessRetrieval Relevance
Accuracy0.910.970.870.93
Precision0.970.990.940.94
Recall0.910.970.880.93
F1-Score0.940.980.910.93
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ibrahim, A.; Khalili, A.; Arabi, M.; Sattar, A.; Hosseini, A.; Serag, A. MERA: Medical Electronic Records Assistant. Mach. Learn. Knowl. Extr. 2025, 7, 73. https://doi.org/10.3390/make7030073

AMA Style

Ibrahim A, Khalili A, Arabi M, Sattar A, Hosseini A, Serag A. MERA: Medical Electronic Records Assistant. Machine Learning and Knowledge Extraction. 2025; 7(3):73. https://doi.org/10.3390/make7030073

Chicago/Turabian Style

Ibrahim, Ahmed, Abdullah Khalili, Maryam Arabi, Aamenah Sattar, Abdullah Hosseini, and Ahmed Serag. 2025. "MERA: Medical Electronic Records Assistant" Machine Learning and Knowledge Extraction 7, no. 3: 73. https://doi.org/10.3390/make7030073

APA Style

Ibrahim, A., Khalili, A., Arabi, M., Sattar, A., Hosseini, A., & Serag, A. (2025). MERA: Medical Electronic Records Assistant. Machine Learning and Knowledge Extraction, 7(3), 73. https://doi.org/10.3390/make7030073

Article Metrics

Back to TopTop