MDPI - Publisher of Open Access Journals

19 pages, 6095 KiB

Open AccessArticle

MERA: Medical Electronic Records Assistant

by Ahmed Ibrahim, Abdullah Khalili, Maryam Arabi, Aamenah Sattar, Abdullah Hosseini and Ahmed Serag

Mach. Learn. Knowl. Extr. 2025, 7(3), 73; https://doi.org/10.3390/make7030073 - 30 Jul 2025

Viewed by 414

Abstract

The increasing complexity and scale of electronic health records (EHRs) demand advanced tools for efficient data retrieval, summarization, and comparative analysis in clinical practice. MERA (Medical Electronic Records Assistant) is a Retrieval-Augmented Generation (RAG)-based AI system that addresses these needs by integrating domain-specific [...] Read more.

The increasing complexity and scale of electronic health records (EHRs) demand advanced tools for efficient data retrieval, summarization, and comparative analysis in clinical practice. MERA (Medical Electronic Records Assistant) is a Retrieval-Augmented Generation (RAG)-based AI system that addresses these needs by integrating domain-specific retrieval with large language models (LLMs) to deliver robust question answering, similarity search, and report summarization functionalities. MERA is designed to overcome key limitations of conventional LLMs in healthcare, such as hallucinations, outdated knowledge, and limited explainability. To ensure both privacy compliance and model robustness, we constructed a large synthetic dataset using state-of-the-art LLMs, including Mistral v0.3, Qwen 2.5, and Llama 3, and further validated MERA on de-identified real-world EHRs from the MIMIC-IV-Note dataset. Comprehensive evaluation demonstrates MERA’s high accuracy in medical question answering (correctness: 0.91; relevance: 0.98; groundedness: 0.89; retrieval relevance: 0.92), strong summarization performance (ROUGE-1 F1-score: 0.70; Jaccard similarity: 0.73), and effective similarity search (METEOR: 0.7–1.0 across diagnoses), with consistent results on real EHRs. The similarity search module empowers clinicians to efficiently identify and compare analogous patient cases, supporting differential diagnosis and personalized treatment planning. By generating concise, contextually relevant, and explainable insights, MERA reduces clinician workload and enhances decision-making. To our knowledge, this is the first system to integrate clinical question answering, summarization, and similarity search within a unified RAG-based framework. Full article

(This article belongs to the Special Issue Advances in Machine and Deep Learning)

► Show Figures

Figure 1

18 pages, 706 KiB

Open AccessArticle

A Design Architecture for Decentralized and Provenance-Assisted eHealth Systems for Enhanced Personalized Medicine

by Wagno Leão Sergio, Victor Ströele and Regina Braga

J. Pers. Med. 2025, 15(7), 325; https://doi.org/10.3390/jpm15070325 - 19 Jul 2025

Viewed by 313

Abstract

Background/Objectives: Electronic medical record systems play a crucial role in the operation of modern healthcare institutions, enabling the foundational data necessary for advancements in personalized medicine. Despite their importance, the software supporting these systems frequently experiences data availability and integrity issues, particularly concerning [...] Read more.

Background/Objectives: Electronic medical record systems play a crucial role in the operation of modern healthcare institutions, enabling the foundational data necessary for advancements in personalized medicine. Despite their importance, the software supporting these systems frequently experiences data availability and integrity issues, particularly concerning patients’ personal information. This study aims to present a decentralized architecture that integrates both clinical and personal patient data, with a provenance mechanism to enable data tracing and auditing, ultimately supporting more precise and personalized healthcare decisions. Methods: A system implementation based on the solution was developed, and a feasibility study was conducted with synthetic medical records data. Results: The system was able to correctly receive data of 190 instances of the entities designed, which included different types of medical records, and generate 573 provenance entries that captured in detail the context of the associated medical information. Conclusions: For the first cycle of the research, the system developed served to validate the main features of the solution, and through that, it was possible to infer the feasibility of a decentralized EHR and PHR health system with formal provenance data tracking. Such a system lays a robust foundation for secure and reliable data management, which is essential for the effective implementation and future development of personalized medicine initiatives. Full article

(This article belongs to the Topic eHealth and mHealth: Challenges and Prospects, 2nd Edition)

► Show Figures

Graphical abstract

21 pages, 817 KiB

Open AccessArticle

C3-VULMAP: A Dataset for Privacy-Aware Vulnerability Detection in Healthcare Systems

by Jude Enenche Ameh, Abayomi Otebolaku, Alex Shenfield and Augustine Ikpehai

Electronics 2025, 14(13), 2703; https://doi.org/10.3390/electronics14132703 - 4 Jul 2025

Viewed by 424

Abstract

The increasing integration of digital technologies in healthcare has expanded the attack surface for privacy violations in critical systems such as electronic health records (EHRs), telehealth platforms, and medical device software. However, current vulnerability detection datasets lack domain-specific privacy annotations essential for compliance [...] Read more.

The increasing integration of digital technologies in healthcare has expanded the attack surface for privacy violations in critical systems such as electronic health records (EHRs), telehealth platforms, and medical device software. However, current vulnerability detection datasets lack domain-specific privacy annotations essential for compliance with healthcare regulations like HIPAA and GDPR. This study presents C3-VULMAP, a novel and large-scale dataset explicitly designed for privacy-aware vulnerability detection in healthcare software. The dataset comprises over 30,000 vulnerable and 7.8 million non-vulnerable C/C++ functions, annotated with CWE categories and systematically mapped to LINDDUN privacy threat types. The objective is to support the development of automated, privacy-focused detection systems that can identify fine-grained software vulnerabilities in healthcare environments. To achieve this, we developed a hybrid construction methodology combining manual threat modeling, LLM-assisted synthetic generation, and multi-source aggregation. We then conducted comprehensive evaluations using traditional machine learning algorithms (Support Vector Machines, XGBoost), graph neural networks (Devign, Reveal), and transformer-based models (CodeBERT, RoBERTa, CodeT5). The results demonstrate that transformer models, such as RoBERTa, achieve high detection performance (F1 = 0.987), while Reveal leads GNN-based methods (F1 = 0.993), with different models excelling across specific privacy threat categories. These findings validate C3-VULMAP as a powerful benchmarking resource and show its potential to guide the development of privacy-preserving, secure-by-design software in embedded and electronic healthcare systems. The dataset fills a critical gap in privacy threat modeling and vulnerability detection and is positioned to support future research in cybersecurity and intelligent electronic systems for healthcare. Full article

(This article belongs to the Special Issue Advanced Machine Learning, Pattern Recognition, and Deep Learning Technologies: Methodologies and Applications, 2nd Edition)

► Show Figures

Graphical abstract

11 pages, 817 KiB

Open AccessArticle

Investigating De-Identification Methodologies in Dutch Medical Texts: A Replication Study of Deduce and Deidentify

by Pablo Mosteiro, Ruilin Wang, Floortje Scheepers and Marco Spruit

Electronics 2025, 14(8), 1636; https://doi.org/10.3390/electronics14081636 - 18 Apr 2025

Viewed by 499

Abstract

Deidentifying sensitive information in electronic health records (EHRs) is increasingly important as legal obligations to data privacy evolve along with the need to protect patient and institutional confidentiality. This study aims to comparatively evaluate the performance of two state-of-the-art deidentification systems, Deduce and [...] Read more.

Deidentifying sensitive information in electronic health records (EHRs) is increasingly important as legal obligations to data privacy evolve along with the need to protect patient and institutional confidentiality. This study aims to comparatively evaluate the performance of two state-of-the-art deidentification systems, Deduce and Deidentify, on both real-world and synthetic Dutch medical texts, thereby providing insights into their relative strengths and limitations in preserving privacy while maintaining data utility. We employ a replication-extension research design, utilizing two distinct datasets: (1) the Annotation-Based Dataset from the Utrecht University Medical Center (UMC Utrecht), comprising manually annotated patient records spanning 1987 to 2021, and (2) the Synthetic Dataset, generated using a two-step process involving OpenAI’s GPT-4 model. Utilizing precision, recall, and

F_{1}

scores as evaluation metrics, we uncover the relative strengths and limitations of the two methods. Our findings indicate that both techniques show variable performance across different entities of deidentifying text information. Deduce outperforms Deidentify in overall accuracy by a margin of 0.42 on the synthetic datasets. On the real-world annotation-based dataset, the generalization ability of Deidentify is lower than Deduce by 0.2. However, the performance of both techniques is affected by the limitations of the dataset. In conclusion, this study provides valuable insights into the comparative performance of Deduce and Deidentify for deidentifying Dutch EHRs, contributing to the development of more effective privacy preservation techniques in the healthcare domain. Full article

(This article belongs to the Special Issue Digital Security and Privacy Protection: Trends and Applications, 2nd Edition)

► Show Figures

Figure 1

30 pages, 2184 KiB

Open AccessArticle

Synthetic Data as a Proxy for Real-World Electronic Health Records in the Patient Length of Stay Prediction

by Dominik Bietsch, Robert Stahlbock and Stefan Voß

Sustainability 2023, 15(18), 13690; https://doi.org/10.3390/su151813690 - 13 Sep 2023

Cited by 9 | Viewed by 2511

Abstract

While generative artificial intelligence has gained popularity, e.g., for the creation of images, it can also be used for the creation of synthetic tabular data. This bears great potential, especially for the healthcare industry, where data are often scarce and underlie privacy restrictions. [...] Read more.

While generative artificial intelligence has gained popularity, e.g., for the creation of images, it can also be used for the creation of synthetic tabular data. This bears great potential, especially for the healthcare industry, where data are often scarce and underlie privacy restrictions. For instance, the creation of synthetic electronic health records (EHR) promises to improve the usage of machine learning algorithms, which usually work with large amounts of data. This also applies for the prediction of the patient length of stay (LOS), a key measure for hospitals. Thereby, the LOS represents one of the core tools for decision makers to plan the allocation of resources. Thus, this paper aims to add to the still-young research concerning the application of generative adversarial nets (GAN) on tabular EHR. It does that with the intention to leverage the advantages of synthetic data for the prediction of the LOS in order to contribute to the efficiency-enhancing and cost-saving aspirations of hospitals and insurance companies. Therefore, the applicability of synthetic data that is generated using GANs as a proxy for scarce real-world EHR for the patient LOS multi-class classification task is examined. In this context, the Conditional Tabular GAN (CTGAN) and the Copula GAN are selected as the underlying models as they are state-of-the-art GAN architectures designed for generating synthetic tabular data. The CTGAN is found to be the superior model for the underlying use case. Nevertheless, the paper shows that there is still room for improvement when applying state-of-the-art GAN architectures to clinical healthcare data. Full article

(This article belongs to the Section Health, Well-Being and Sustainability)

► Show Figures

Figure 1

13 pages, 691 KiB

Open AccessArticle

Synthesizing Electronic Health Records for Predictive Models in Low-Middle-Income Countries (LMICs)

by Ghadeer O. Ghosheh, C. Louise Thwaites and Tingting Zhu

Biomedicines 2023, 11(6), 1749; https://doi.org/10.3390/biomedicines11061749 - 18 Jun 2023

Cited by 7 | Viewed by 2592

Abstract

The spread of machine learning models, coupled with by the growing adoption of electronic health records (EHRs), has opened the door for developing clinical decision support systems. However, despite the great promise of machine learning for healthcare in low-middle-income countries (LMICs), many data-specific [...] Read more.

The spread of machine learning models, coupled with by the growing adoption of electronic health records (EHRs), has opened the door for developing clinical decision support systems. However, despite the great promise of machine learning for healthcare in low-middle-income countries (LMICs), many data-specific limitations, such as the small size and irregular sampling, hinder the progress in such applications. Recently, deep generative models have been proposed to generate realistic-looking synthetic data, including EHRs, by learning the underlying data distribution without compromising patient privacy. In this study, we first use a deep generative model to generate synthetic data based on a small dataset (364 patients) from a LMIC setting. Next, we use synthetic data to build models that predict the onset of hospital-acquired infections based on minimal information collected at patient ICU admission. The performance of the diagnostic model trained on the synthetic data outperformed models trained on the original and oversampled data using techniques such as SMOTE. We also experiment with varying the size of the synthetic data and observe the impact on the performance and interpretability of the models. Our results show the promise of using deep generative models in enabling healthcare data owners to develop and validate models that serve their needs and applications, despite limitations in dataset size. Full article

(This article belongs to the Section Biomedical Engineering and Materials)

► Show Figures

Figure 1

21 pages, 9663 KiB

Open AccessArticle

The “Coherent Data Set”: Combining Patient Data and Imaging in a Comprehensive, Synthetic Health Record

by Jason Walonoski, Dylan Hall, Karen M. Bates, M. Heath Farris, Joseph Dagher, Matthew E. Downs, Ryan T. Sivek, Ben Wellner, Andrew Gregorowicz, Marc Hadley, Francis X. Campion, Lauren Levine, Kevin Wacome, Geoff Emmer, Aaron Kemmer, Maha Malik, Jonah Hughes, Eldesia Granger and Sybil Russell

Electronics 2022, 11(8), 1199; https://doi.org/10.3390/electronics11081199 - 9 Apr 2022

Cited by 8 | Viewed by 13504

Abstract

The “Coherent Data Set” is a novel synthetic data set that leverages structured data from Synthea™ to create a longitudinal, “coherent” patient-level electronic health record (EHR). Comprised of synthetic patients, the Coherent Data Set is publicly available, reproducible using Synthea™, and free of [...] Read more.

The “Coherent Data Set” is a novel synthetic data set that leverages structured data from Synthea™ to create a longitudinal, “coherent” patient-level electronic health record (EHR). Comprised of synthetic patients, the Coherent Data Set is publicly available, reproducible using Synthea™, and free of the privacy risks that arise from using real patient data. The Coherent Data Set provides complex and representative health records that can be leveraged by health IT professionals without the risks associated with de-identified patient data. It includes familial genomes that were created through a simulation of the genetic reproduction process; magnetic resonance imaging (MRI) DICOM files created with a voxel-based computational model; clinical notes in the style of traditional subjective, objective, assessment, and plan notes; and physiological data that leverage existing System Biology Markup Language (SBML) models to capture non-linear changes in patient health metrics. HL7 Fast Healthcare Interoperability Resources (FHIR^®) links the data together. The models can generate clinically logical health data, but ensuring clinical validity remains a challenge without comparable data to substantiate results. We believe this data set is the first of its kind and a novel contribution to practical health interoperability efforts. Full article

(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)

► Show Figures

Figure 1

21 pages, 1400 KiB

Open AccessArticle

The Problem of Fairness in Synthetic Healthcare Data

by Karan Bhanot, Miao Qi, John S. Erickson, Isabelle Guyon and Kristin P. Bennett

Entropy 2021, 23(9), 1165; https://doi.org/10.3390/e23091165 - 4 Sep 2021

Cited by 63 | Viewed by 7618

Abstract

Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this [...] Read more.

Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets. Full article

(This article belongs to the Special Issue Representation Learning: Theory, Applications and Ethical Issues)

► Show Figures

Figure 1

12 pages, 1027 KiB

Open AccessArticle

Supervised Analysis for Phenotype Identification: The Case of Heart Failure Ejection Fraction Class

by Cristina Lopez, Jose Luis Holgado, Raquel Cortes, Inma Sauri, Antonio Fernandez, Jose Miguel Calderon, Julio Nuñez and Josep Redon

Bioengineering 2021, 8(6), 85; https://doi.org/10.3390/bioengineering8060085 - 21 Jun 2021

Cited by 2 | Viewed by 3544

Abstract

Artificial Intelligence is creating a paradigm shift in health care, with phenotyping patients through clustering techniques being one of the areas of interest. Objective: To develop a predictive model to classify heart failure (HF) patients according to their left ventricular ejection fraction (LVEF), [...] Read more.

Artificial Intelligence is creating a paradigm shift in health care, with phenotyping patients through clustering techniques being one of the areas of interest. Objective: To develop a predictive model to classify heart failure (HF) patients according to their left ventricular ejection fraction (LVEF), by using available data from Electronic Health Records (EHR). Subjects and methods: 2854 subjects over 25 years old with a diagnosis of HF and LVEF, measured by echocardiography, were selected to develop an algorithm to predict patients with reduced EF using supervised analysis. The performance of the developed algorithm was tested in heart failure patients from Primary Care. To select the most influentual variables, the LASSO algorithm setting was used, and to tackle the issue of one class exceeding the other one by a large amount, we used the Synthetic Minority Oversampling Technique (SMOTE). Finally, Random Forest (RF) and XGBoost models were constructed. Results: The full XGBoost model obtained the maximum accuracy, a high negative predictive value, and the highest positive predictive value. Gender, age, unstable angina, atrial fibrillation and acute myocardial infarct are the variables that most influence EF value. Applied in the EHR dataset, with a total of 25,594 patients with an ICD-code of HF and no regular follow-up in cardiology clinics, 6170 (21.1%) were identified as pertaining to the reduced EF group. Conclusion: The obtained algorithm was able to identify a number of HF patients with reduced ejection fraction, who could benefit from a protocol with a strong possibility of success. Furthermore, the methodology can be used for studies using data extracted from the Electronic Health Records. Full article

(This article belongs to the Special Issue Machine Learning-Based Heart, Brain and Nerve Tissue Engineering)

► Show Figures

Figure 1

24 pages, 893 KiB

Open AccessArticle

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

by Claudia Alessandra Libbi, Jan Trienes, Dolf Trieschnigg and Christin Seifert

Future Internet 2021, 13(5), 136; https://doi.org/10.3390/fi13050136 - 20 May 2021

Cited by 23 | Viewed by 6617

Abstract

A major hurdle in the development of natural language processing (NLP) methods for Electronic Health Records (EHRs) is the lack of large, annotated datasets. Privacy concerns prevent the distribution of EHRs, and the annotation of data is known to be costly and cumbersome. [...] Read more.

A major hurdle in the development of natural language processing (NLP) methods for Electronic Health Records (EHRs) is the lack of large, annotated datasets. Privacy concerns prevent the distribution of EHRs, and the annotation of data is known to be costly and cumbersome. Synthetic data presents a promising solution to the privacy concern, if synthetic data has comparable utility to real data and if it preserves the privacy of patients. However, the generation of synthetic text alone is not useful for NLP because of the lack of annotations. In this work, we propose the use of neural language models (LSTM and GPT-2) for generating artificial EHR text jointly with annotations for named-entity recognition. Our experiments show that artificial documents can be used to train a supervised named-entity recognition model for de-identification, which outperforms a state-of-the-art rule-based baseline. Moreover, we show that combining real data with synthetic data improves the recall of the method, without manual annotation effort. We conduct a user study to gain insights on the privacy of artificial text. We highlight privacy risks associated with language models to inform future research on privacy-preserving automated text generation and metrics for evaluating privacy-preservation during text generation. Full article

(This article belongs to the Special Issue Natural Language Engineering: Methods, Tasks and Applications)

► Show Figures

Figure 1

Search Results (10)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (10)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI