MDPI - Publisher of Open Access Journals

30 pages, 667 KiB

Open AccessArticle

Large Language Models for Electronic Health Record De-Identification in English and German

by Samuel Sousa, Michael Jantscher, Mark Kröll and Roman Kern

Information 2025, 16(2), 112; https://doi.org/10.3390/info16020112 - 6 Feb 2025

Cited by 1 | Viewed by 2299

Electronic health record (EHR) de-identification is crucial for publishing or sharing medical data without violating the patient’s privacy. Protected health information (PHI) is abundant in EHRs, and privacy regulations worldwide mandate de-identification before downstream tasks are performed. The ever-growing data generation in healthcare [...] Read more.

Electronic health record (EHR) de-identification is crucial for publishing or sharing medical data without violating the patient’s privacy. Protected health information (PHI) is abundant in EHRs, and privacy regulations worldwide mandate de-identification before downstream tasks are performed. The ever-growing data generation in healthcare and the advent of generative artificial intelligence have increased the demand for de-identified EHRs and highlighted privacy issues with large language models (LLMs), especially data transmission to cloud-based LLMs. In this study, we benchmark ten LLMs for de-identifying EHRs in English and German. We then compare de-identification performance for in-context learning and full model fine-tuning and analyze the limitations of LLMs for this task. Our experimental evaluation shows that LLMs effectively de-identify EHRs in both languages. Moreover, in-context learning with a one-shot setting boosts de-identification performance without the costly full fine-tuning of the LLMs. Full article

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

► Show Figures

Figure 1

24 pages, 654 KiB

Open AccessArticle

Deep Learning Framework for Advanced De-Identification of Protected Health Information

by Ahmad Aloqaily, Emad E. Abdallah, Rahaf Al-Zyoud, Esraa Abu Elsoud, Malak Al-Hassan and Alaa E. Abdallah

Future Internet 2025, 17(1), 47; https://doi.org/10.3390/fi17010047 - 20 Jan 2025

Cited by 3 | Viewed by 1616

Abstract

Electronic health records (EHRs) are widely used in healthcare institutions worldwide, containing vast amounts of unstructured textual data. However, the sensitive nature of Protected Health Information (PHI) embedded within these records presents significant privacy challenges, necessitating robust de-identification techniques. This paper introduces a [...] Read more.

Electronic health records (EHRs) are widely used in healthcare institutions worldwide, containing vast amounts of unstructured textual data. However, the sensitive nature of Protected Health Information (PHI) embedded within these records presents significant privacy challenges, necessitating robust de-identification techniques. This paper introduces a novel approach, leveraging a Bi-LSTM-CRF model to achieve accurate and reliable PHI de-identification, using the i2b2 dataset sourced from Harvard University. Unlike prior studies that often unify Bi-LSTM and CRF layers, our approach focuses on the individual design, optimization, and hyperparameter tuning of both the Bi-LSTM and CRF components, allowing for precise model performance improvements. This rigorous approach to architectural design and hyperparameter tuning, often underexplored in the existing literature, significantly enhances the model’s capacity for accurate PHI tag detection while preserving the essential clinical context. Comprehensive evaluations are conducted across 23 PHI categories, as defined by HIPAA, ensuring thorough security across critical domains. The optimized model achieves exceptional performance metrics, with a precision of 99%, recall of 98%, and F1-score of 98%, underscoring its effectiveness in balancing recall and precision. By enabling the de-identification of medical records, this research strengthens patient confidentiality, promotes compliance with privacy regulations, and facilitates safe data sharing for research and analysis. Full article

(This article belongs to the Special Issue eHealth and mHealth)

► Show Figures

Figure 1

18 pages, 7572 KiB

Open AccessCommunication

Discovery Viewer (DV): Web-Based Medical AI Model Development Platform and Deployment Hub

by Valentin Fauveau, Sean Sun, Zelong Liu, Xueyan Mei, James Grant, Mikey Sullivan, Hayit Greenspan, Li Feng and Zahi A. Fayad

Bioengineering 2023, 10(12), 1396; https://doi.org/10.3390/bioengineering10121396 - 6 Dec 2023

Cited by 1 | Viewed by 2488

Abstract

The rapid rise of artificial intelligence (AI) in medicine in the last few years highlights the importance of developing bigger and better systems for data and model sharing. However, the presence of Protected Health Information (PHI) in medical data poses a challenge when [...] Read more.

The rapid rise of artificial intelligence (AI) in medicine in the last few years highlights the importance of developing bigger and better systems for data and model sharing. However, the presence of Protected Health Information (PHI) in medical data poses a challenge when it comes to sharing. One potential solution to mitigate the risk of PHI breaches is to exclusively share pre-trained models developed using private datasets. Despite the availability of these pre-trained networks, there remains a need for an adaptable environment to test and fine-tune specific models tailored for clinical tasks. This environment should be open for peer testing, feedback, and continuous model refinement, allowing dynamic model updates that are especially important in the medical field, where diseases and scanning techniques evolve rapidly. In this context, the Discovery Viewer (DV) platform was developed in-house at the Biomedical Engineering and Imaging Institute at Mount Sinai (BMEII) to facilitate the creation and distribution of cutting-edge medical AI models that remain accessible after their development. The all-in-one platform offers a unique environment for non-AI experts to learn, develop, and share their own deep learning (DL) concepts. This paper presents various use cases of the platform, with its primary goal being to demonstrate how DV holds the potential to empower individuals without expertise in AI to create high-performing DL models. We tasked three non-AI experts to develop different musculoskeletal AI projects that encompassed segmentation, regression, and classification tasks. In each project, 80% of the samples were provided with a subset of these samples annotated to aid the volunteers in understanding the expected annotation task. Subsequently, they were responsible for annotating the remaining samples and training their models through the platform’s “Training Module”. The resulting models were then tested on the separate 20% hold-off dataset to assess their performance. The classification model achieved an accuracy of 0.94, a sensitivity of 0.92, and a specificity of 1. The regression model yielded a mean absolute error of 14.27 pixels. And the segmentation model attained a Dice Score of 0.93, with a sensitivity of 0.9 and a specificity of 0.99. This initiative seeks to broaden the community of medical AI model developers and democratize the access of this technology to all stakeholders. The ultimate goal is to facilitate the transition of medical AI models from research to clinical settings. Full article

(This article belongs to the Section Biosignal Processing)

► Show Figures

Figure 1

17 pages, 494 KiB

Open AccessArticle

Classification of Severe Maternal Morbidity from Electronic Health Records Written in Spanish Using Natural Language Processing

by Ever A. Torres-Silva, Santiago Rúa, Andrés F. Giraldo-Forero, Maria C. Durango, José F. Flórez-Arango and Andrés Orozco-Duque

Appl. Sci. 2023, 13(19), 10725; https://doi.org/10.3390/app131910725 - 27 Sep 2023

Cited by 6 | Viewed by 2376

Abstract

One stepping stone for reducing the maternal mortality is to identify severe maternal morbidity (SMM) using Electronic Health Records (EHRs). We aim to develop a pipeline to represent and classify the unstructured text of maternal progress notes in eight classes according to the [...] Read more.

One stepping stone for reducing the maternal mortality is to identify severe maternal morbidity (SMM) using Electronic Health Records (EHRs). We aim to develop a pipeline to represent and classify the unstructured text of maternal progress notes in eight classes according to the silver labels defined by the ICD-10 codes associated with SMM. We preprocessed the text, removing protected health information (PHI) and reducing stop words. We built different pipelines to classify the SMM by the combination of six word-embeddings schemes, three different approaches for the representation of the documents (average, clustering, and principal component analysis), and five well-known machine learning classifiers. Additionally, we implemented an algorithm for typos and misspelling adjustment based on the Levenshtein distance to the Spanish Billion Word Corpus dictionary. We analyzed 43,529 documents constructed by an average of 4.15 progress notes from 22,937 patients. The pipeline with the best performance was the one that included Word2Vec, typos and spelling adjustment, document representation by PCA, and an SVM classifier. We found that it is possible to identify conditions such as miscarriage complication or hypertensive disorders from clinical notes written in Spanish, with a true positive rate higher than 0.85. This is the first approach to classify SMM from the unstructured text contained in the maternal EHRs, which can contribute to the solution of one of the most important public health problems in the world. Future works must test other representation and classification approaches to detect the risk of SMM. Full article

(This article belongs to the Special Issue Natural Language Processing in Healthcare)

► Show Figures

Figure 1

20 pages, 2339 KiB

Open AccessArticle

GDPR Compliant Data Storage and Sharing in Smart Healthcare System: A Blockchain-Based Solution

by Pinky Bai, Sushil Kumar, Kirshna Kumar, Omprakash Kaiwartya, Mufti Mahmud and Jaime Lloret

Electronics 2022, 11(20), 3311; https://doi.org/10.3390/electronics11203311 - 14 Oct 2022

Cited by 13 | Viewed by 4150

Abstract

Smart healthcare systems provide user-centric medical services to patients based on collected information of patients inducing personal health information (PHI) and personal identifiable information (PII). The information (PII and PHI) flows into the smart healthcare system with or without any regulation and patient [...] Read more.

Smart healthcare systems provide user-centric medical services to patients based on collected information of patients inducing personal health information (PHI) and personal identifiable information (PII). The information (PII and PHI) flows into the smart healthcare system with or without any regulation and patient concern with the help of new information and communication technologies (ICT). The use of ICT comes with the security and privacy issues of collected PII and PHI data. The Europe Union has published the General Data Protection Regulation (GDPR) to regulate the flow of personal information. Towards this end, this paper proposes a blockchain-based data storage and sharing framework for a smart healthcare system that complies with the “Privacy by Design” rule of the GDPR. The personal information collected from patients is stored on off-chain storage (IPFS), and other information is stored on the blockchain ledger, which is visible to all participants. The smart contracts are designed to share the PII data with another participant based on prior permission of the data owner. The proposed framework also includes the deletion of PII and PHI in the system as per the “Right to be Forgotten” GDPR rule. Security and privacy analyses are performed for the framework to demonstrate the security and privacy of data while sharing and at rest. The comparative performance analysis demonstrates the benefit of the proposed GDPR-compliant data storage and sharing framework using blockchain. It is evident from the reported results that the proposed framework outperforms the state-of-the-art techniques in terms of performance metrics in a smart healthcare system. Full article

(This article belongs to the Special Issue Wireless Sensors Networks in the IoT Era: Advanced Technologies, Recent Challenges, Smart Applications & Future Prospects)

► Show Figures

Figure 1

10 pages, 2979 KiB

Open AccessArticle

Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients

by Tanmoy Paul, Humayera Islam, Nitesh Singh, Yaswitha Jampani, Teja Venkat Pavan Kotapati, Preethi Aishwarya Tautam, Md Kamruz Zaman Rana, Vasanthi Mandhadi, Vishakha Sharma, Michael Barnes, Richard D. Hammer and Abu Saleh Mohammad Mosa

Appl. Sci. 2022, 12(19), 9976; https://doi.org/10.3390/app12199976 - 4 Oct 2022

Cited by 1 | Viewed by 1996

Abstract

The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection [...] Read more.

The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F₁-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature. Full article

(This article belongs to the Special Issue Application of Data Analytics in Smart Healthcare)

► Show Figures

Figure 1

17 pages, 969 KiB

Open AccessArticle

Analysis of Insider Threats in the Healthcare Industry: A Text Mining Approach

by In Lee

Information 2022, 13(9), 404; https://doi.org/10.3390/info13090404 - 27 Aug 2022

Cited by 17 | Viewed by 7838

Abstract

To address rapidly growing data breach incidents effectively, healthcare providers need to identify various insider and outsider threats, analyze the vulnerabilities of their internal security systems, and develop more appropriate data security measures against the threats. While there have been studies on trends [...] Read more.

To address rapidly growing data breach incidents effectively, healthcare providers need to identify various insider and outsider threats, analyze the vulnerabilities of their internal security systems, and develop more appropriate data security measures against the threats. While there have been studies on trends of data breach incidents, there is a lack of research on the analysis of descriptive contents posted on the data breach reporting website of the U.S. Department of Health and Human Services (HHS) Office for Civil Rights (OCR). Hence, this study develops a novel approach to the analysis of descriptive data breach information with the use of text mining and visualization. Insider threats, vulnerabilities, breach incidents, impacts, and responses to the breaches are analyzed for three data breach types. Full article

(This article belongs to the Special Issue Techniques and Frameworks to Detect and Mitigate Insider Attacks)

► Show Figures

Figure 1

10 pages, 778 KiB

Open AccessArticle

Common and Unique Barriers to the Exchange of Administrative Healthcare Data in Environmental Public Health Tracking Program

by Mikyong Shin, Charles Hawley and Heather Strosnider

Int. J. Environ. Res. Public Health 2021, 18(8), 4356; https://doi.org/10.3390/ijerph18084356 - 20 Apr 2021

Viewed by 2142

Abstract

CDC’s National Environmental Public Health Tracking Program (Tracking Program) receives administrative data annually from 25–30 states to track potential environmental exposures and to make data available for public access. In 2019, the CDC Tracking Program conducted a cross-sectional survey among principal investigators or [...] Read more.

CDC’s National Environmental Public Health Tracking Program (Tracking Program) receives administrative data annually from 25–30 states to track potential environmental exposures and to make data available for public access. In 2019, the CDC Tracking Program conducted a cross-sectional survey among principal investigators or program managers of the 26 funded programs to improve access to timely, accurate, and local data. All 26 funding recipients reported having access to hospital inpatient data, and most states (69.2%) regularly update data user agreements to receive the data. Among the respondents, 15 receive record-level data with protected health information (PHI) and seven receive record-level data without PHI. Regarding geospatial resolution, approximately 50.0% of recipients have access to the street address or census tract information, 34.6% have access to ZIP code, and 11.5% have other sub-county geographies (e.g., town). Only three states receive administrative data for their residents from all border states. The survey results will help the Tracking Program to identify knowledge gaps and perceived barriers to the use and accessibility of administrative data for the CDC Tracking Program. The information collected will inform the development of resources that can provide solutions for more efficient and timely data exchange. Full article

► Show Figures

Figure 1

9 pages, 1381 KiB

Open AccessArticle

Filtered BERT: Similarity Filter-Based Augmentation with Bidirectional Transfer Learning for Protected Health Information Prediction in Clinical Documents

by Min Kang, Kye Hwa Lee and Youngho Lee

Appl. Sci. 2021, 11(8), 3668; https://doi.org/10.3390/app11083668 - 19 Apr 2021

Cited by 8 | Viewed by 4437

Abstract

For the secondary use of clinical documents, it is necessary to de-identify protected health information (PHI) in documents. However, the difficulty lies in the fact that there are few publicly annotated PHI documents. To solve this problem, in this study, we propose a [...] Read more.

For the secondary use of clinical documents, it is necessary to de-identify protected health information (PHI) in documents. However, the difficulty lies in the fact that there are few publicly annotated PHI documents. To solve this problem, in this study, we propose a filtered bidirectional encoder representation from transformers (BERT)-based method that predicts a masked word and validates the word again through a similarity filter to construct augmented sentences. The proposed method effectively performs data augmentation. The results show that the augmentation method based on filtered BERT improved the performance of the model. This suggests that our method can effectively improve the performance of the model in the limited data environment. Full article

(This article belongs to the Special Issue New Trends in Medical Informatics)

► Show Figures

Figure 1

24 pages, 5279 KiB

Open AccessArticle

Proof-of-Familiarity: A Privacy-Preserved Blockchain Scheme for Collaborative Medical Decision-Making

by Jinhong Yang, Md Mehedi Hassan Onik, Nam-Yong Lee, Mohiuddin Ahmed and Chul-Soo Kim

Appl. Sci. 2019, 9(7), 1370; https://doi.org/10.3390/app9071370 - 1 Apr 2019

Cited by 78 | Viewed by 8676

Abstract

The current healthcare sector is facing difficulty in satisfying the growing issues, expenses, and heavy regulation of quality treatment. Surely, electronic medical records (EMRs) and protected health information (PHI) are highly sensitive, personally identifiable information (PII). However, the sharing of EMRs, enhances overall [...] Read more.

The current healthcare sector is facing difficulty in satisfying the growing issues, expenses, and heavy regulation of quality treatment. Surely, electronic medical records (EMRs) and protected health information (PHI) are highly sensitive, personally identifiable information (PII). However, the sharing of EMRs, enhances overall treatment quality. A distributed ledger (blockchain) technology, embedded with privacy and security by architecture, provides a transparent application developing platform. Privacy, security, and lack of confidence among stakeholders are the main downsides of extensive medical collaboration. This study, therefore, utilizes the transparency, security, and efficiency of blockchain technology to establish a collaborative medical decision-making scheme. This study considers the experience, skill, and collaborative success rate of four key stakeholders (patient, cured patient, doctor, and insurance company) in the healthcare domain to propose a local reference-based consortium blockchain scheme, and an associated consensus gathering algorithm, proof-of-familiarity (PoF). Stakeholders create a transparent and tenable medical decision to increase the interoperability among collaborators through PoF. A prototype of PoF is tested with multichain 2.0, a blockchain implementing framework. Moreover, the privacy of identities, EMRs, and decisions are preserved by two-layer storage, encryption, and a timestamp storing mechanism. Finally, superiority over existing schemes is identified to improve personal data (PII) privacy and patient-centric outcomes research (PCOR). Full article

(This article belongs to the Special Issue Advances in Blockchain Technology and Applications)

► Show Figures

Figure 1

12 pages, 2331 KiB

Open AccessArticle

Conversion of Legal Text to a Logical Rules Set from Medical Law Using the Medical Relational Model and the World Rule Model for a Medical Decision Support System

by Imran Khan, Muhammad Sher, Javed I. Khan, Syed M. Saqlain, Anwar Ghani, Husnain A. Naqvi and Muhammad Usman Ashraf

Informatics 2016, 3(1), 2; https://doi.org/10.3390/informatics3010002 - 26 Feb 2016

Cited by 11 | Viewed by 7897

Abstract

Automated formalization of legal text is a time- and effort-consuming task, but human-based validation consumes even more of both. The exchange of healthcare data in compliance with the medical privacy law requires experts with deep familiarity of its intricate provisions for verification. The [...] Read more.

Automated formalization of legal text is a time- and effort-consuming task, but human-based validation consumes even more of both. The exchange of healthcare data in compliance with the medical privacy law requires experts with deep familiarity of its intricate provisions for verification. The article presents a medical relational model (MRM) for the extraction of logical rules from medical law, required to design a medical decision support system (MDSS) that facilitates the process of exchanging data electronically with minimum human intervention. The division of medical law into small concept classes makes it easier to formalize the legal text of medical law into logical rules. These logical rules are then used to make a precise decision in compliance with the law, after evaluating requests from different entities for different purposes in MDSS. Our methodology is to analyze the legal text and release records in compliance with the medical law. For developing countries where medical laws are not as mature as HIPAA (Health Insurance Portability and Accountability Act) in the USA, the proposed methodology can be adapted to build their MDSS based on MRM. Full article

► Show Figures

Figure 1

16 pages, 1079 KiB

Open AccessArticle

Freshness-Preserving Non-Interactive Hierarchical Key Agreement Protocol over WHMS

by Hyunsung Kim

Sensors 2014, 14(12), 23742-23757; https://doi.org/10.3390/s141223742 - 10 Dec 2014

Cited by 4 | Viewed by 5402

Abstract

The digitization of patient health information (PHI) for wireless health monitoring systems (WHMSs) has brought many benefits and challenges for both patients and physicians. However, security, privacy and robustness have remained important challenges for WHMSs. Since the patient’s PHI is sensitive and the [...] Read more.

The digitization of patient health information (PHI) for wireless health monitoring systems (WHMSs) has brought many benefits and challenges for both patients and physicians. However, security, privacy and robustness have remained important challenges for WHMSs. Since the patient’s PHI is sensitive and the communication channel, i.e., the Internet, is insecure, it is important to protect them against unauthorized entities, i.e., attackers. Otherwise, failure to do so will not only lead to the compromise of a patient’s privacy, but will also put his/her life at risk. This paper proposes a freshness-preserving non-interactive hierarchical key agreement protocol (FNKAP) for WHMSs. The FNKAP is based on the concept of the non-interactive identity-based key agreement for communication efficiency. It achieves patient anonymity between a patient and physician, session key secrecy and resistance against various security attacks, especially including replay attacks. Full article

(This article belongs to the Special Issue Sensor Computing for Mobile Security and Big Data Analytics)

► Show Figures

Search Results (12)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (12)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI