Next Article in Journal
Efficiency Analysis of Artificial Intelligence and Conventional Maximum Power Point Tracking Methods in Photovoltaic Systems
Previous Article in Journal
Salp Swarm Algorithm Optimized A* Algorithm and Improved B-Spline Interpolation in Path Planning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages

Human-Centric Explorations and Research in AI, Technology, Medicine and Enhanced Data (HUMADEX) Group, Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška Cesta 46, 2000 Maribor, Slovenia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(10), 5585; https://doi.org/10.3390/app15105585
Submission received: 14 April 2025 / Revised: 1 May 2025 / Accepted: 13 May 2025 / Published: 16 May 2025

Abstract

Patient-reported health data, especially patient-reported outcomes measures, are vital for improving clinical care but are often limited by memory bias, cognitive load, and inflexible questionnaires. Patients prefer conversational symptom reporting, highlighting the need for robust methods in symptom extraction and conversational intelligence. This study presents a weakly-supervised pipeline for training and evaluating medical Named Entity Recognition (NER) models across eight languages, with a focus on low-resource settings. A merged English medical corpus, annotated using the Stanza i2b2 model, was translated into German, Greek, Spanish, Italian, Portuguese, Polish, and Slovenian, preserving the entity annotations medical problems, diagnostic tests, and treatments. Data augmentation addressed the class imbalance, and the fine-tuned BERT-based models outperformed baselines consistently. The English model achieved the highest F1 score (80.07%), followed by German (78.70%), Spanish (77.61%), Portuguese (77.21%), Slovenian (75.72%), Italian (75.60%), Polish (75.56%), and Greek (69.10%). Compared to the existing baselines, our models demonstrated notable performance gains, particularly in English, Spanish, and Italian. This research underscores the feasibility and effectiveness of weakly-supervised multilingual approaches for medical entity extraction, contributing to improved information access in clinical narratives—especially in under-resourced languages.
Keywords: low-resource languages; machine translation; medical entity extraction; NER; NLP; patient-reported outcomes; weakly-supervised learning low-resource languages; machine translation; medical entity extraction; NER; NLP; patient-reported outcomes; weakly-supervised learning

Share and Cite

MDPI and ACS Style

Sallauka, R.; Arioz, U.; Rojc, M.; Mlakar, I. Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages. Appl. Sci. 2025, 15, 5585. https://doi.org/10.3390/app15105585

AMA Style

Sallauka R, Arioz U, Rojc M, Mlakar I. Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages. Applied Sciences. 2025; 15(10):5585. https://doi.org/10.3390/app15105585

Chicago/Turabian Style

Sallauka, Rigon, Umut Arioz, Matej Rojc, and Izidor Mlakar. 2025. "Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages" Applied Sciences 15, no. 10: 5585. https://doi.org/10.3390/app15105585

APA Style

Sallauka, R., Arioz, U., Rojc, M., & Mlakar, I. (2025). Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages. Applied Sciences, 15(10), 5585. https://doi.org/10.3390/app15105585

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop