Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages

Sallauka, Rigon; Arioz, Umut; Rojc, Matej; Mlakar, Izidor

doi:10.3390/app15105585

Open AccessArticle

Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages

Human-Centric Explorations and Research in AI, Technology, Medicine and Enhanced Data (HUMADEX) Group, Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška Cesta 46, 2000 Maribor, Slovenia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5585; https://doi.org/10.3390/app15105585

Submission received: 14 April 2025 / Revised: 1 May 2025 / Accepted: 13 May 2025 / Published: 16 May 2025

Download

Browse Figures

Versions Notes

Abstract

:

Patient-reported health data, especially patient-reported outcomes measures, are vital for improving clinical care but are often limited by memory bias, cognitive load, and inflexible questionnaires. Patients prefer conversational symptom reporting, highlighting the need for robust methods in symptom extraction and conversational intelligence. This study presents a weakly-supervised pipeline for training and evaluating medical Named Entity Recognition (NER) models across eight languages, with a focus on low-resource settings. A merged English medical corpus, annotated using the Stanza i2b2 model, was translated into German, Greek, Spanish, Italian, Portuguese, Polish, and Slovenian, preserving the entity annotations medical problems, diagnostic tests, and treatments. Data augmentation addressed the class imbalance, and the fine-tuned BERT-based models outperformed baselines consistently. The English model achieved the highest F1 score (80.07%), followed by German (78.70%), Spanish (77.61%), Portuguese (77.21%), Slovenian (75.72%), Italian (75.60%), Polish (75.56%), and Greek (69.10%). Compared to the existing baselines, our models demonstrated notable performance gains, particularly in English, Spanish, and Italian. This research underscores the feasibility and effectiveness of weakly-supervised multilingual approaches for medical entity extraction, contributing to improved information access in clinical narratives—especially in under-resourced languages.

Keywords:

low-resource languages; machine translation; medical entity extraction; NER; NLP; patient-reported outcomes; weakly-supervised learning

1. Introduction

With the growing emphasis on patient-centered care, patient-reported outcomes (PROMs) have emerged as a critical tool for understanding and addressing patients’ health needs, especially in the management of chronic and complex conditions [1,2]. PROMs allow the systematic collection of patients’ health experiences, symptoms, and quality of life, often using structured instruments such as questionnaires [3,4]. While effective in many cases, traditional PROMs face notable limitations. These include their reliance on structured, predefined questions that may not capture patients’ experiences fully, a lack of adaptability for underrepresented and resource-poor languages, and the significant burden they can place on patients to complete standardized forms [5,6]. Moreover, current PROMs often fail to accommodate natural language expression, which could offer richer and more nuanced insights into patients’ symptoms and health concerns [7]. This issue is particularly pronounced in diverse linguistic contexts, where a lack of tools for resource-poor languages limits inclusivity and equity in data collection and analysis [8,9].

In light of the limitations of traditional PROMs, there is a growing need for approaches that can capture patient experiences in a more natural and inclusive manner, accommodating diverse linguistic contexts and minimizing patient burden [10]. To address these challenges, advancements in Natural Language Processing (NLP), such as Named Entity Recognition (NER), offer promising tools for extracting patient-reported information directly from unstructured text, enabling a richer and more equitable understanding of health concerns across languages. NER is a fundamental NLP task that involves extracting entities from text and classifying them into predefined categories, such as locations, people, or organizations [11]. In the medical domain, clinical NER models have been developed to identify specific information like diseases, drugs, dosages, and frequencies, with recent advancements leveraging deep learning algorithms to enhance their performance [11]. While progress has been made in medical NER for resource-abundant languages, there remains a notable research gap concerning resource-poor languages [12]. Addressing this gap is important to ensure equitable healthcare advancements across various language groups. Moreover, beyond structuring medical notes, there is a need to extract symptoms or problems as expressed by patients in their own words, facilitating a more patient-centered approach to healthcare.

This paper describes a weakly-supervised pipeline for training and evaluating medical NER models across eight languages: English, German, Greek, Spanish, Italian, Polish, Portuguese, and Slovenian. We refer to the approach as “weakly-supervised” because it relies on automatically annotated data rather than manually labeled gold-standard datasets, using state-of-the-art models to generate training labels. By fine-tuning BERT-based models, we enabled the extraction of medical entities like symptoms, treatments, and tests from medical and everyday texts. The models will be used in a multicentric feasibility study of the SMILE project, as part of digital health interventions to support the (mental) well-being of children across multiple sites in Germany, Italy, the UK, Spain, Cyprus, Poland, and Slovenia. This research focuses on underrepresented languages and patient-expressed symptoms, as it seeks to enhance the accessibility and quality of healthcare information extraction.

Clinical Named Entity Recognition (NER) models have made significant progress in structuring and summarizing clinical notes, predominantly in English. However, the existing models face several challenges, including monolingual focus, inconsistent performance reporting, issues with non-clinical semantics in reporting, and limited adaptability to low-resource languages [13,14,15]. The model in bert-medical-ner (https://huggingface.co/silpakanneganti/bert-medical-ner, accessed on 14 April 2025) on the HuggingFace (https://huggingface.co/, accessed on 14 April 2025) platform extracts the entities, such as age, sex, clinical event, non-biological location, duration, severity, biological structure, sign symptom, and more, achieving a self-reported F1 score of 67.5%. Similarly, bert-base-uncased_clinical-ner (https://huggingface.co/samrawal/bert-base-uncased_clinical-ner, accessed on 14 April 2025) is another NER model, which extracts problem, treatment, and test entities from medical texts but it does not report any performance evaluation, leaving its effectiveness unclear. Other models, such as the one proposed in [16] and Stanza’s clinical i2b2 model [17,18], report F1 scores of 81.2% and 88.1%, respectively, yet their focus remains restricted to English medical texts. Domain-specific models like biobert_ncbi_disease_ner (https://huggingface.co/ugaray96/biobert_ncbi_disease_ner, accessed on 14 April 2025) and en_ner_bc5cdr_md (https://huggingface.co/Kaelan/en_ner_bc5cdr_md, accessed on 14 April 2025) are tailored for tasks like disease or chemical recognition, reporting F1 scores of 85.7% for chemicals and diseases. Moreover, they are well-tuned to medical semantics but not to the way individuals express or describe their symptoms. Efforts in non-English languages include models like GERNERMED [19] for German, which extracts entities such as drug dosage and frequency (F1: 81.5%), MedPsyNIT [20] for Italian, which extracts symptoms and comorbidities (F1: 89.5%), and the model in [21] for Spanish, which recognizes anatomical, chemical, and pathological entities (F1: 86.4%). The model for Portuguese achieves a high F1 score of 92.6% [22] for extracting entities like disorders, medical procedures, and pharmacologic substances. Despite these successes, multilingual models for medical NER remain scarce, and no models currently exist for Greek, Polish, or Slovenian.

In summary, while existing clinical NER models demonstrate strong performance in specific settings, they are constrained by their focus on a single language or a narrow range of languages, their reliance on extensive annotated datasets that are expensive and time-intensive to create, and their limited ability to adapt across different types of texts, such as formal medical documents and informal everyday language.

Our aim is to make the following contributions:

-: we introduce a flexible, multilingual, and adaptable pipeline for medical NER;
-: our pipeline enables effective extraction of medical entities (problem, test, and treatment) from a given body of text;
-: our pipeline is effective across diverse languages, with a focus on low-resource languages.

Our pipeline is designed to handle both formal and informal textual contexts.

2. Materials and Methods

This section describes the data sources, preprocessing pipeline, and experimental setup used in this study. Section 2.1 outlines the overall methodology of the whole pipeline. Section 2.2 describes the data sources and corpus creation, while Section 2.3 presents the data preprocessing steps. Section 2.4 presents the annotation process, followed by the translation and multilingual adaptation process in Section 2.5. Word alignment is described in Section 2.6 and after that in Section 2.7 we present the data augmentation techniques that were used. Preparation for training including data splitting, model configuration, and evaluation metrics were described in Section 2.8, Section 2.9 and Section 2.10, respectively.

2.1. Overall Methodology

In this paper, we propose a weakly-supervised multilingual Named Entity Recognition (NER) pipeline for symptom extraction from medical texts and patient-reported data. The pipeline consists of several key stages: data preparation, annotation, translation for multilingual adaptation, data augmentation, and model fine-tuning. Each step contributes to building a robust NER model capable of identifying symptom-related entities across eight languages (English, German, Greek, Italian, Spanish, Polish, Portuguese, and Slovenian) in unstructured medical texts. The flow of the pipeline is displayed in the image below, providing an overview of the processes involved in creating the final model. Figure 1 shows an overview of the pipeline flow, where we begin with the creation of a corpus of medical texts in the English language, which is then preprocessed and annotated by a state-of-the-art NER model, thus creating an annotated corpus. This annotated corpus is then translated into a targed language using machine translation models, and this translated corpus is then used to fine-tune the BERT-based NER model, adjusting it so it learns to identify predefined categories in different languages.

2.2. Data Sources and Corpus Creation

The initial stage involved combining multiple datasets of medical texts to form a comprehensive corpus of medical texts. This corpus was created by merging two key datasets: autotrain-data-1w6s-u4vt-i7yo (https://huggingface.co/datasets/Kabatubare/autotrain-data-1w6s-u4vt-i7yo, accessed on 14 April 2025) and medical_qa_meds (https://huggingface.co/datasets/s200862/medical_qa_meds/discussions, accessed on 14 April 2025). The dataset autotrain-data-1w6s-u4vt-i7yo (https://huggingface.co/datasets/Kabatubare/autotrain-data-1w6s-u4vt-i7yo, accessed on 14 April 2025) consists of clinical notes, discharge summaries, and other unstructured medical records from various healthcare providers, providing a rich source of terminology related to patient symptoms, diagnoses, and treatments. The dataset in medical_qa_meds (https://huggingface.co/datasets/s200862/medical_qa_meds/discussions, accessed on 14 April 2025), on the other hand, contains a diverse collection of question-and-answer pairs focused on medical topics, which includes descriptions of symptoms, diagnoses, and treatment options. By merging these datasets, we ensured a diverse and comprehensive representation of medical terminology and symptom descriptions in English, capturing different styles of clinical language, patient narratives, and medical abbreviations. The total number of examples of both datasets is 29,379 entries of text data. Table 1 describes the details of the two datasets.

2.3. Preprocessing

The data underwent a thorough preprocessing phase. Initially, they were cleaned to remove unnecessary text while retaining relevant information. Since our dataset consisted of question–answer pairs between a user and an assistant, we identified certain textual elements that could be removed. In the dataset in autotrain-data-1w6s-u4vt-i7yo, we removed strings such as “Human:”, “Assistant:”, “\n”, “\t”, and replaced hyphens (“-”) between words with a single space. In the dataset in s200862/medical_qa_meds, we eliminated tags like “[INST]”, “[/INST]”, “<s>”, “</s>”, along with newline (“\n”) and tab (“\t”) characters. Next, all the punctuation was removed from the text. Finally, the data were converted to lowercase for consistency. As stated above, the rows of the datasets are medical texts, made of multiple sentences. We split all the texts so as to have one sentence in each row; thus, the number of examples rose from around 29,000 to 261,936 sentences.

2.4. Annotation Process

To annotate the corpus of medical texts, we utilized Stanza [18], a Python NLP package designed for linguistic analysis across multiple human languages. Stanza offers a suite of efficient tools that perform a variety of linguistic processing tasks, including sentence segmentation, word tokenization, part-of-speech tagging, and entity recognition. Starting from raw text, Stanza structures the data, identifying syntactic and semantic elements to support further analysis.

In particular, Stanza provides an i2b2 model specialized for NER tasks in the Medical domain. This model is designed to extract key medical entities: PROBLEM (symptom), TEST, and TREATMENT. Trained on the i2b2-2010 corpus [23], which includes manually annotated clinical notes from institutions such as Partners Healthcare, Beth Israel Deaconess Medical Center, and the University of Pittsburgh Medical Center, the i2b2 model is highly accurate and well-suited for medical text analysis. With an impressive F1 score of 88.13, the i2b2 model identifies and classifies medical entities effectively, making it a robust tool for NER tasks in clinical settings. Table 2 presents the i2b2 model from Stanza.

2.5. Translation and Multilingual Adaptation

To enable multilingual capabilities in our NER pipeline, we employed machine translation models to convert the annotated English medical texts into several target languages, including German, Greek, Spanish, Italian, Polish, Portuguese, and Slovenian. These translations allowed us to develop a dataset that supports multilingual symptom extraction, enhancing the model’s utility across diverse linguistic contexts.

For the translations from English to German, Greek, Spanish, Italian, Polish, and Portuguese, we utilized models developed by the Language Technology Research Group at the University of Helsinki [24]. These models, licensed under CC-BY-4.0, are optimized for direct translations from English to each target language and are available on the Hugging Face platform [17]. Each model was selected specifically for its accuracy and reliability in medical contexts, given the specific terminology present in clinical texts:

German—Helsinki-NLP/opus-mt-en-de (https://huggingface.co/Helsinki-NLP/opus-mt-en-de, accessed on 14 April 2025)
Greek—Helsinki-NLP/opus-mt-en-el (https://huggingface.co/Helsinki-NLP/opus-mt-en-el, accessed on 14 April 2025)
Spanish—Helsinki-NLP/opus-mt-tc-big-en-es (https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-es, accessed on 14 April 2025)
Italian—Helsinki-NLP/opus-mt-tc-big-en-it (https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-it, accessed on 14 April 2025)
Polish—gsarti/opus-mt-tc-en-pl (https://huggingface.co/gsarti/opus-mt-tc-en-pl, accessed on 14 April 2025)
Portuguese—Helsinki-NLP/opus-mt-tc-big-en-pt (https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-pt, accessed on 14 April 2025)
Slovenian—facebook/mbart-large-50-many-to-many-mmt model (https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt, accessed on 14 April 2025)

For English to Slovenian translations, we used the mBART model [25], a fine-tuned version of the mBART-large-50 model capable of translating between any pair of 50 languages. We chose this model due to its superior performance in handling low-resource languages like Slovenian. This model enables direct translation by setting the target language ID as the first generated token in the output sequence. This flexibility and multilingual adaptability were crucial in creating a reliable Slovenian translation.

Helsinki-NLP does not support the Slovenian language, which is why the mBART model was chosen for the translations in this language, while Greek is not supported in the mBART model and that is why the Helsinki-NLP was chosen for this language. The other languages are supported by both models. We chose the Helsinki-NLP model for the translation in these languages because of the higher BLEU score that it has compared to the mBART model. Table 3 shows the BLEU scores for the German, Spanish, Italian, Polish, and Portuguese of both models.

2.6. Word Alignment

To ensure accurate word alignment in the translated texts, we used the awesome-align tool (https://huggingface.co/aneuraz/awesome-align-with-co, accessed on 14 April 2025), which leverages contextual word embeddings from models like BERT to map words between languages effectively [26]. The alignment process involved several key steps. Contextual word embeddings were generated for each word in a sentence to capture the word’s meaning based on its context within the sentence. Two techniques were employed to calculate alignment scores. First, probability thresholding involved generating a similarity matrix using the dot product of word embeddings, which was then converted into probabilities through a function like softmax. This approach identified high-probability word pairs as aligned. Second, optimal transport treated alignment as a transportation problem, minimizing the “cost” of transferring probability mass between words in different languages and producing a matrix of probable alignments. Then, the alignments were calculated in both directions—source-to-target and target-to-source—with the final alignment derived from the intersection of these bidirectional results. For words split into subwords by the model, alignment was ensured through any matching subwords, guaranteeing complete word coverage in the alignment process.

2.7. Data Augmentation

Before splitting the dataset into train/validate/test sets for the training process, we performed data augmentation to enhance the diversity and robustness of the training data. Data augmentation was used because the dataset contained a large number of zero tags (O), namely, the entities that do not belong to any category (PROBLEM, TEST, and TREATMENT), and this leads to class imbalance, which impacts the model’s performance. Furthermore, during model development, we also observed signs of overfitting when training without data augmentation, as the evaluation loss began to rise after several epochs. This motivated the introduction of augmentation further, to improve generalization. The English dataset served as the basis for tuning these strategies, which were then applied consistently across all the languages.

The augmentation process involved two main strategies. Sentence reordering was applied, where the words within each sentence were rearranged to create new variations of the same sentence structure. This method increased the variability of the dataset, enabling the model to generalize to different sentence formations better. Additionally, entity extraction was performed by identifying all the words within each sentence that were annotated with non-“O” labels (i.e., labeled as PROBLEM, TEST, or TREATMENT). These extracted words were used to generate new sentences, which were then added back into the dataset. This ensured that the model would encounter more examples of key medical entities during training.

After the data augmentation, each dataset consisted of approximately 440,000 rows.

2.8. Data Splitting

Following the augmentation, the dataset was split into three distinct sets. The training set comprised 80% of the dataset, the validation set accounted for 10% of the dataset, and the remaining 10% was designated as the test set.

2.9. Model Configuration

For our symptom extraction pipeline, we fine-tuned the BERT-based model (cased), a widely-used transformer model pretrained on English text using a masked language modeling (MLM) objective [27]. This model, developed by Google, is case sensitive, differentiating between lowercase and uppercase words (e.g., “english” vs. “English”). BERT was pretrained on large English corpora, including BookCorpus and English Wikipedia, using two main objectives: MLM, where 15% of the words in each sentence are masked and predicted, and Next Sentence Prediction (NSP), which trains the model to predict sentence order [27]. The model configuration includes a vocabulary size of 30,000, a maximum token length of 512, and an Adam optimizer with learning rate warmup and linear decay [27]. BERT achieved high performance on a range of NLP tasks, including an average score of 79.6 on the GLUE benchmark, and is well-suited for token classification tasks, such as named entity recognition, when fine-tuned on specific labeled datasets [27]. This makes it an ideal base for developing a specialized model for medical-named entity recognition in our pipeline. The architecture of the BERT model specialized for NER tasks is displayed in Figure 2.

The model was fine-tuned to classify entities into specific categories: PROBLEM, TEST, and TREATMENT, using the IOB tagging scheme (Inside-Outside-Beginning) [28], a widely-used Standard in NER, where additional tags (e.g., B-, I-, E-, S-) are used to capture entity boundaries. The key training parameters were set as follows: the architecture utilized was BERT-base-cased, a pretrained transformer model with 12 layers, 768 hidden units, and a vocabulary of 30,000 tokens.

For the training parameters, the model was trained for 200 epochs with a batch size of 64; however, early stopping based on validation loss was applied, and in practice, training typically converged within 20–30 epochs. An AdamW optimizer was used, featuring a learning rate of 3 × 10⁻⁵ and a weight decay of 0.01 to prevent overfitting. The sequence length was capped at 128 tokens to ensure efficient processing. To address the class imbalance, a focal loss function [29] was applied, emphasizing harder-to-classify samples.

Table 4 displays more details on the parameters used for model training.

These configurations were selected to maximize the model’s ability to extract medical entities accurately from medical texts while maintaining efficiency and generalizability.

2.10. Evaluation

The fine-tuned BERT-based NER model was evaluated using the standard metrics of precision, recall, and F1 score (Table 5), which provide insights into the model’s ability to identify and classify medical entities accurately. Early stopping was employed to prevent overfitting, halting training if the validation loss did not improve after 30 evaluation steps.

The performance was monitored on a validation set every 2000 steps, with the best-performing model checkpoint saved based on the lowest validation loss. These metrics, along with confusion matrices, offered a detailed view of the model’s performance across the entity classes. The model configuration, training hyperparameters, and evaluation metrics were consistent across all eight training processes for each language. The best models of each language and the respective datasets are uploaded to the Hugging Face platform in the HUMADEX page.

3. Results

This section presents the evaluation results of the eight fine-tuned multilingual NER models, each trained to extract medical entities (PROBLEM, TEST, TREATMENT) in different languages. The performance of each model was analyzed using the standard metrics, including precision, recall, and F1 score.

Table 6 provides a comparative overview of the performance metrics for the eight language-specific NER models. The English model demonstrated the highest overall performance, with a precision of 80.85%, recall of 79.30%, and F1 score of 80.07%, alongside the lowest evaluation loss at 0.24. This suggests that the English model is both accurate and consistent in identifying entities, likely due to the abundance of high-quality English training data. The German and Spanish models also showed strong results, with F1 scores of 78.70% and 77.61%, respectively, and evaluation losses below 0.35, indicating reliable performance in these languages. Portuguese followed closely with an F1 score of 77.21%, showing its capability in medical NER despite slight variations in recall and precision.

In contrast, the Greek model exhibited the lowest F1 score at 69.10% and a higher evaluation loss of 0.41, suggesting greater challenges in recognizing entities in Greek accurately. The models for Italian, Polish, and Slovenian fell within a mid-range performance level, with F1 scores between 75.56% and 75.72% and evaluation losses ranging from 0.34 to 0.40. These results indicate that while the models performed reasonably well, there may be language-specific nuances or variations in translation quality that affect performance.

The Translation BLEU score column represents the BLEU score for translations from English to each respective language, measuring the quality of the machine translation used to generate training data in different languages.

3.1. Models’ Comparison with Existing Models

To evaluate the effectiveness of our multilingual NER models, we compared their performance with three existing baseline models trained for English, Italian, and Spanish NER tasks. These existing models were tested on the same dataset used for our models to ensure a fair and consistent evaluation. The baseline models used for comparison were domain-specific NER systems tailored for each language. The English baseline was the Stanza i2b2 model [18], which combines word embeddings with forward and backward LSTM-based character-level language models and a CRF decoder, trained on clinical notes to extract entities such as problems, tests, and treatments. The Italian baseline, MedPsyNIT [20], was built on BioBIT and fine-tuned on clinical data from four Italian hospitals, targeting psychiatric and medical entities through low-resource fine-tuning. The Spanish baseline, lcampillos/roberta-es-clinical-trials-ner [21], adapted a biomedical RoBERTa model to identify anatomical, chemical, pathological, and procedural entities from clinical trial texts using the CT-EBM-SP corpus. Table 7 displays the self-reported F1 score of the models.

In our test set, our models outperformed the existing ones across all three languages, demonstrating superior precision, recall, and F1 scores, as shown in Table 8. For English, our model achieved an F1 score of 80.07%, surpassing the baseline model’s performance of 67.69% by a significant margin. Similarly, in Italian, our model attained an F1 score of 75.60%, compared to the baseline model, which achieved an F1 score of 57.06%. In Spanish, our model’s F1 score of 77.61% showed a clear improvement in handling complex linguistic nuances over the existing model, which achieved an F1 score of 62.60%.

The Italian and Spanish models use different labels from our labels. To perform the comparison we carried outlabel mapping to match the labels in the Spanish model, as displayed in Table 9.

Whereas Table 10 shows the mapping of the labels of the Italian model.

3.2. Case Studies

To illustrate the performance of our models, we present examples of sentences with the extracted entities categorized into PROBLEM, TEST, and TREATMENT. These examples showcase the practical application of our models in identifying relevant medical entities within unstructured text, emphasizing their ability to handle diverse linguistic constructs across multiple languages. The fine-tuned models were BERT-based models, that we retrained using our datasets.

English sentences

Fine-tuned model predictions:

The patient complained of severe (B-PROBLEM) headaches (E-PROBLEM) and nausea (S-PROBEM) that had persisted for two days. To alleviate the (B-PROBLEM) symptoms (E-PROBLEM), he was prescribed paracetamol (S-TREATMENT) and advised to rest and drink plenty of fluids.
The patient exhibited symptoms (S-PROBLEM) of fever (S-PROBLEM), cough (S-PROBLEM), and body (B-PROBLEM) aches (E-PROBLEM). A (B-TEST) chest (I-TEST) X-ray (I-TEST) was taken to rule out pneumonia (S-PROBLEM). He was prescribed an (B-TREATMENT) antibiotic (E-TREATMENT) and advised to rest.
The patient complained of dizziness (S-PROBLEM), vision (B-PROBLEM) disturbances (E-PROBLEM), and numbness (B-PROBLEM) in (I-PROBLEM) her (I-PROBLEM) hands (E-PROBLEM). An (B-TEST) MRI (I-TEST) of (I-TEST) the (I-TEST) brain (E-TEST) was ordered to rule out a (B-PROBLEM) neurological (I-PROBLEM) cause (E-PROBLEM). A (B-TREATMENT) beta-blocker (I-TREATMENT) was prescribed to stabilize her (B-TEST) blood (I-TEST) pressure (E-TEST).

Existing model predictions:

The patient complained of severe (B-PROBLEM) headaches (E-PROBLEM) and nausea (S-PROBEM) that had persisted for two days. To alleviate the (B-PROBLEM) symptoms (E-PROBLEM), he was prescribed paracetamol (S-TREATMENT) and advised to rest and drink plenty of fluids.
The patient exhibited symptoms (S-PROBLEM) of fever (S-PROBLEM), cough (S-PROBLEM), and body (B-PROBLEM) aches (E-PROBLEM). A (B-TEST) chest (I-TEST) X-ray (I-TEST) was taken to rule out pneumonia (S-PROBLEM). He was prescribed an (B-TREATMENT) antibiotic (E-TREATMENT) and advised to rest.
The patient complained of dizziness (S-PROBLEM), vision (B-PROBLEM) disturbances (E-PROBLEM), and numbness (B-PROBLEM) in (I-PROBLEM) her (I-PROBLEM) hands (E-PROBLEM). An (B-TEST) MRI (I-TEST) of (I-TEST) the (I-TEST) brain (E-TEST) was ordered to rule out a (B-PROBLEM) neurological (I-PROBLEM) cause (E-PROBLEM). A (B-TREATMENT) beta (I-TREATMENT)-(I-TREATMENT) blocker (E-TREATMENT) was prescribed to stabilize her (B-TEST) blood (I-TEST) pressure (E-TEST).

The fine-tuned model demonstrates superior handling of “colloquial” symptom descriptions. When processing clinical terminology like “neurological cause” and “beta-blocker”, both models maintained precise entity boundaries and correct classification. For non-clinical expressions, such as, “body aches” instead of “myalgia”, the model identifies these as PROBLEM entities successfully, showing adaptability to patient-level language. In contrast, the existing model has issues with compound terms and informal expressions, particularly in maintaining consistent entity boundaries.

Spanish sentences

Fine-tuned model predictions:

El paciente se quejó de fuertes (B-PROBLEM) dolores (E-PROBLEM) de cabeza y náuseas (S-PROBLEM) que habían persistido durante dos días. Para aliviar los síntomas, se le recetó paracetamol (S-TREATMENT) y se le aconsejó descansar y beber muchos líquidos.
El paciente presentó síntomas (S-PROBLEM) de fiebre (S-PROBLEM), tos y dolores (E-PROBLEM) corporals (E-PROBLEM). Se le realizó una (B-TEST) radiografía (E-TEST) de tórax para descartar una (B-PROBLEM) neumonía (E-PROBLEM). Se le recetó un (B-TREATMENT) antibiótico (E-TREATMENT) y se le aconsejó descansar.
La paciente se quejó de mareos (S-PROBLEM), alteraciones (E-PROBLEM) de la vision (B-PROBLEM) y entumecimiento (B-PROBLEM) en (I-PROBLEM) las manos (E-PROBLEM). Se ordenó una (B-TEST) resonancia (I-TEST) magnética (E-TEST) del (I-TEST) cerebro (E-TEST) para descartar una causa (E-PROBLEM) neurológica (I-PROBLEM). Se le recetó un (B-TREATMENT) betabloqueante (E-TREATMENT) para estabilizar su (B-TEST) presión (E-TEST) arterial (I-TEST).

Existing model predictions:

El paciente se quejó de fuertes dolores (B-DISO) de (I-DISO) cabeza (I-PROBLEM) y náuseas (B-DISO) que habían persistido durante dos días. Para aliviar los síntomas (B-PROBLEM), se le recetó paracetamol (B-CHEM) y se le aconsejó descansar (B-PROC) y beber muchos líquidos.
El paciente presentó síntomas (B-DISO) de (I-DISO) fiebre (I-DISO), tos (I-DISO) y dolores (B-DISO) corporals (I-DISO). Se le realizó una radiografía (B-PROC) de (I-PROC) tórax (I-PROC) para descartar una neumonía (B-DISO). Se le recetó (B-PROC) un antibiótico (B-CHEM) y se le aconsejó descansar (B-PROC).
La paciente se quejó de mareos (B-DISO), alteraciones (B-DISO) de (I-DISO) la (I-DISO) vision (I-DISO) y entumecimiento (B-DISO) en (I-DISO) las (I-DISO) manos (I-DISO). Se ordenó una resonancia (B-PROC) magnética (I-PROC) del (I-PROC) cerebro (I-PROC) para descartar una causa neurológica. Se le recetó un betabloqueante (B-CHEM) para estabilizar (B-PROC) su presión (B-PROC) arterial (I-PROC).

English translation of the Spanish sentences:

The patient complained of severe headaches and nausea that had persisted for two days. To relieve the symptoms, paracetamol was prescribed, and the patient was advised to rest and drink plenty of fluids.
The patient presented symptoms of fever, cough, and body aches. A chest X-ray was performed to rule out pneumonia. An antibiotic was prescribed, and the patient was advised to rest.
The patient complained of dizziness, vision disturbances, and numbness in the hands. An MRI of the brain was ordered to rule out a neurological cause. A beta-blocker was prescribed to stabilize her blood pressure.

The distinction between clinical and colloquial language is especially evident in the Spanish examples. The model proposed in this paper processed both formal medical terms (“resonancia magnética”—MRI) and everyday expressions (“dolores de cabeza”—headaches) effectively as coherent entities. The existing model showed a bias toward clinical terminology, using a rigid DISO/PROC/CHEM classification that accommodates natural patient expressions poorly. For instance, “mareos” (dizziness) was identified correctly as a PROBLEM by our model but received an overly clinical DISO tag in the existing model.

Italian sentences

Fine-tuned model predictions:

Il paziente ha lamentato forti (B-PROBLEM) mal (E-PROBLEM) di testa (E-PROBLEM) e nausea (S-PROBLEM) che persistevano da due giorni. Per alleviare i sintomi (E-PROBLEM), gli è stato prescritto il paracetamolo (S-TREATMENT) e gli è stato consigliato di riposare e bere molti liquidi.
Il paziente ha manifestato sintomi (S-PROBLEM) di febbre (S-PROBLEM), tosse (S-PROBLEM) e dolori (E-PROBLEM) muscolari. È stata eseguita una (B-TEST) radiografia (E-TEST) del torace (E-TEST) per escludere una (B-PROBLEM) polmonite (E-PROBLEM). Gli è stato prescritto un (B-TREATMENT) antibiotico (E-TREATMENT) e gli è stato consigliato di riposare.
La paziente ha lamentato vertiginid (S-PROBLEM), disturbi (E-PROBLEM) visivi (B-PROBLEM) e intorpidimento (B-PROBLEM) delle (I-PROBLEM) mani (E-PROBLEM). È stata ordinata una (B-TREATMENT) risonanza (I-TEST) magnetica (I-TEST) del (I-TEST) cervello per escludere una (B-PROBLEM) causa (E-PROBLEM) neurologica (I-PROBLEM). È stato prescritto un (B-TREATMENT) betabloccante (E-TREATMENT) per stabilizzare la pressione (E-TEST) sanguigna.

Existing model predictions:

Il paziente ha lamentato forti mal di testa e nausea che persistevano da due giorni. Per alleviare i sintomi, gli è stato prescritto il paracetamolo (TRATTAMENTO FARMACOLOGICO (B)) e gli è stato consigliato di riposare e bere molti liquidi.
Il paziente ha manifestato sintomi di febbre, tosse e dolori muscolari. È stata eseguita una radiografia (TEST (B)) del torace per escludere una polmonite. Gli è stato prescritto un antibiotico e gli è stato consigliato di riposare.
La paziente ha lamentato vertigini, disturbi visivi e intorpidimento delle mani. È stata ordinata una risonanza (TEST (B)) magnetica (TEST (B)) del (TEST (I)) cervello (TEST (I)) per escludere una causa neurologica. È stato prescritto un betabloccante (TRATTAMENTO FARMACOLOGICO (B)) per stabilizzare la pressione sanguigna.

English translation of the Italian sentences:

The patient complained of severe headaches and nausea that had persisted for two days. To relieve the symptoms, paracetamol was prescribed, and he was advised to rest and drink plenty of fluids.
The patient presented symptoms of fever, cough, and muscle aches. A chest X-ray was performed to rule out pneumonia. An antibiotic was prescribed, and he was advised to rest.
The patient complained of dizziness, visual disturbances, and numbness in the hands. An MRI of the brain was ordered to rule out a neurological cause. A beta-blocker was prescribed to stabilize blood pressure.

Both the formal and informal medical expressions in Italian revealed key differences between the models. The model proposed in this paper identified colloquial symptom descriptions like “mal di testa” (headache) successfully as PROBLEM entities while maintaining accuracy with clinical terms like “betabloccante” (beta-blocker). The existing model, however, showed a clear preference for formal medical terminology, often missing informal symptom descriptions entirely.

4. Discussion

The presented weakly-supervised multilingual NER pipeline addresses the key limitations identified in the existing approaches, including single-language focus [20,25], reliance on large, annotated datasets, the inability to handle informal patient language [5,6,7], and limited adaptation to resource-poor languages [12,13,14]. Namely, traditional medical NER models like Stanza [17] and GERNERMED [19] demonstrate effectiveness in single languages but struggle with multilingual adaptability [30]. Moreover, the existing multilingual models often depend on extensive annotated datasets, making them impractical for resource-poor languages [12,13]. The proposed approach overcomes these constraints through weak supervision and efficient language–language translation pipelines, enabling robust performance across multiple languages, including underrepresented ones such as Slovenian and Polish.

A critical challenge in current medical NER models is their limited adaptability across different types of language [10,15]. Traditional medical NER systems, trained primarily on clinical documentation, often struggle to recognize symptoms when patients use colloquial expressions or metaphorical language to describe their conditions [5,8]. Namely, while models like Stanza [17] and MedPsyNIT [20] achieve high performance on formal clinical notes (F1 scores of 88.1% and 89.5% respectively), their effectiveness diminishes significantly when processing informal patient descriptions. This limitation is particularly evident when trying to exploit the medical NER concept in the context of patient-reported outcomes, where individuals describe their symptoms in natural, conversational language rather than clinical terminology [6,7]. For instance, while a clinician might document “pyrexia”, patients typically describe “feeling hot” or “burning up”. We demonstrate this with a series of case studies, comparing existing medical NERs in English, Italian, and Spanish, where our model overcame the gap between clinical precision and patient expression significantly, making it more suitable for processing real-world patient-reported health data across languages. Moreover, the increased performance over the existing models, especially on the non-medical terminology (English: 67.69% to 80.07%, Italian: 57.06% to 75.60%, Spanish: 62.60% to 77.61%), demonstrates the effectiveness of our approach further. The results show clearly that our models (and the pipeline) can complement the traditional PROMs efficiently by extracting symptoms from natural language descriptions, supporting more patient-centered data collection. This aligns well with recent research highlighting the importance of natural language processing in healthcare [10], while extending its applicability to multiple languages.

The current hyperparameter configuration, while effective, still has room for optimization through more extensive tuning. Secondly, the quality of the machine translation affects model performance, particularly for low-resource languages with less reliable translation tools. This limitation connects to broader challenges in multilingual NLP noted by Zhu et al. [14]. Translation errors, particularly in the case of ambiguous or domain-specific terms, can lead to misinterpretations that may degrade model performance. These challenges are especially evident when handling medical terminology, where incorrect translation may result in inaccurate annotations and entity recognition. Addressing these errors through improved translation models or post-translation correction could mitigate their impact. Thirdly, while our automated annotation process using Stanza [17] enabled efficient dataset creation, it may propagate biases from the base model. A fully manual annotation process would provide more reliable training data but require prohibitive time and cost investments. Moreover, the reliance on the pretrained BERT architecture means that inherent biases in the original training data may impact performance, especially for rare medical terms or linguistic nuances in low-resource languages. The choice to use a basic BERT model rather than multilingual models such as mBERT or XLM-RoBERTa may limit the model’s transferability across languages and reduces its ability to leverage cross-lingual knowledge effectively. Future work includes using more specialized multilingual models, which could improve model performance in diverse linguistic settings. Finally, evaluating the model on real patient data, rather than just synthetic examples, would provide a more comprehensive understanding of its effectiveness in real-world applications. The use of synthetic data limits the model’s generalizability and may not capture the full complexity of clinical language used by actual patients. A follow-up clinical study in collaboration with healthcare institutions to evaluate the model on clinical notes is underway. Ethical approval has already been issued and the process is in progress.

Class imbalance remains one of the main challenges and the less-frequent entity types like TEST and TREATMENT may be underrepresented, particularly in languages with fewer examples. This echoes similar challenges noted in recent clinical NLP research [15]. While current augmentation helps balance the O tags versus medical entity tags, more advanced techniques could balance between PROBLEM, TEST, and TREATMENT categories better. Moreover, sophisticated augmentation could introduce controlled semantic variations in how symptoms are described, helping models handle the gap between clinical and patient language better [5,7]. Furthermore, advanced augmentation techniques could generate language-specific variations that account for cultural and linguistic differences in symptom description [13,14], particularly important for the SMILE project’s multicenter implementation.

5. Conclusions

This research demonstrates the potential of fine-tuned BERT-based NER models for multilingual medical text analysis, achieving strong and consistent performance across eight languages, including underrepresented ones such as Slovenian, Polish, and Greek. By outperforming the existing models in English, Italian, and Spanish, our approach highlights the importance of fine-tuning, custom annotations, and multilingual training processes in addressing the challenges of Named Entity Recognition across diverse linguistic contexts. By facilitating the extraction of detailed patient-reported symptoms and medical entities from patient-reported outcomes (PROs), this research addresses the limitations of traditional structured approaches, offering a more inclusive and natural language-driven method for healthcare data collection. Despite challenges such as reliance on pretrained models, potential biases in annotation processes, translation inaccuracies, and class imbalances, the models developed in this research demonstrated strong adaptability and practicality for real-world healthcare applications. Future research should focus on refining the translation pipelines, enhancing the quality of the training data, and conducting extensive hyperparameter tuning to improve performance further. These efforts could expand the applicability of the pipeline to additional languages and tasks, contributing to more equitable and efficient multilingual healthcare information extraction worldwide. Finaly future work should consider developing more sophisticated data augmentation techniques. While the current augmentation focuses on sentence reordering and entity extraction, more advanced techniques could synthesize natural patient language patterns, improving the model’s ability to handle informal symptom descriptions further. Moreover, this could reduce reliance on translation significantly and improve cross-lingual robustness.

Author Contributions

Conceptualization: I.M. and R.S.; data curation: R.S.; methodology: I.M. and R.S.; software: I.M., R.S., M.R. and U.A.; validation: I.M. and R.S.; formal analysis: I.M., R.S., U.A. and M.R.; investigation: I.M. and R.S.; writing—original draft preparation: R.S. and I.M.; writing—review and editing: I.M., R.S., U.A. and M.R.; visualization: R.S.; supervision: I.M.; funding acquisition: I.M. All authors have read and agreed to the published version of the manuscript.

Funding

This paper has received funding from the European Union Horizon Europe Research and Innovation Program Project SMILE (Grant number 101080923) and from Marie Skłodowska-Curie Doctoral Networks Actions (HORIZON-MSCA-2021-DN-01-01). Project BosomShield (Grant number 101073222). The content of this paper does not reflect the official opinion of the European Union or any other institution. Responsibility for the information and views expressed herein lies entirely with the authors. The funder had no impact on the design or execution of this research.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets and models are available in the Zenodo platform, accessible in this link: https://zenodo.org/records/13918323 (accessed on 14 April 2025) (https://doi.org/10.5281/zenodo.13918323, accessed on 14 April 2025).

Acknowledgments

During the preparation of this work the authors used ChatGPT 4o (https://chatgpt.com/, accessed on 14 April 2025) and Gemini (https://gemini.google.com/, accessed on 14 April 2025) to support the writing process as language-editing tools. After using these tools/services, the authors reviewed and edited the content carefully and take full responsibility for the content of the published article.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

NER	Named Entity Recognition
PROM	Patient-Reported Outcome Measure
NLP	Natural Language Processing
BERT	Bidirectional Encoder Representations from Transformers
BLEU	Bilingual Evaluation Understudy
MLM	Masked language modeling
NSP	Next Sentence Prediction

References

Salmond, S.W.; Echevarria, M. Healthcare Transformation and Changing Roles for Nursing. Orthop. Nurs. 2017, 36, 12. [Google Scholar] [CrossRef] [PubMed]
Anderson, G.; Horvath, J.; Knickman, J.R.; Colby, D.C.; Schear, S.; Jung, M. Chronic Conditions: Making the Case for Ongoing Care. Available online: https://www.policyarchive.org/handle/10207/21756 (accessed on 2 December 2024).
Li, J.; Porock, D. Resident outcomes of person-centered care in long-term care: A narrative review of interventional research. Int. J. Nurs. Stud. 2014, 51, 1395–1415. [Google Scholar] [CrossRef]
Calvert, M.; Kyte, D.; Price, G.; Valderas, J.M.; Hjollund, N.H. Maximising the impact of patient reported outcome assessment for patients and society. BMJ 2019, 364, k5267. [Google Scholar] [CrossRef]
Nguyen, H.; Butow, P.; Dhillon, H.; Sundaresan, P. A review of the barriers to using Patient-Reported Outcomes (PROs) and Patient-Reported Outcome Measures (PROMs) in routine cancer care. J. Med. Radiat. Sci. 2021, 68, 186–195. [Google Scholar] [CrossRef]
Black, N. Patient reported outcome measures could help transform healthcare. BMJ 2013, 346, f167. [Google Scholar] [CrossRef]
Greenhalgh, J. The applications of PROs in clinical practice: What are they, do they work, and why? Qual. Life Res. 2009, 18, 115–123. [Google Scholar] [CrossRef]
Snyder, C.F.; Aaronson, N.K.; Choucair, A.K.; Elliott, T.E.; Greenhalgh, J.; Halyard, M.Y.; Hess, R.; Miller, D.M.; Reeve, B.B.; Santana, M. Implementing patient-reported outcomes assessment in clinical practice: A review of the options and considerations. Qual. Life Res. 2012, 21, 1305–1314. [Google Scholar] [CrossRef]
Šafran, V.; Lin, S.; Nateqi, J.; Martin, A.G.; Smrke, U.; Ariöz, U.; Plohl, N.; Rojc, M.; Bēma, D.; Chávez, M.; et al. Multilingual Framework for Risk Assessment and Symptom Tracking (MRAST). Sensors 2024, 24, 1101. [Google Scholar] [CrossRef]
Mlakar, I.; Arioz, U.; Smrke, U.; Plohl, N.; Šafran, V.; Rojc, M. An End-to-End framework for extracting observable cues of depression from diary recordings. Expert Syst. Appl. 2024, 257, 125025. [Google Scholar] [CrossRef]
Bardak, B.; Tan, M. Improving clinical outcome predictions using convolution over medical entities with multimodal learning. Artif. Intell. Med. 2021, 117, 102112. [Google Scholar] [CrossRef]
Sasikumar, N.; Mantri, K.S.I. Transfer Learning for Low-Resource Clinical Named Entity Recognition. In Proceedings of the 5th Clinical Natural Language Processing Workshop, Toronto, ON, Canada, 14 July 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 514–518. [Google Scholar] [CrossRef]
Shaitarova, A.; Zaghir, J.; Lavelli, A.; Krauthammer, M.; Rinaldi, F. Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey. Yearb. Med. Inform. 2023, 32, 230–243. [Google Scholar] [CrossRef] [PubMed]
Zhu, S.; Supryadi; Xu, S.; Sun, H.; Pan, L.; Cui, M.; Du, J.; Jin, R.; Branco, A.; Xiong, D. Multilingual Large Language Models: A Systematic Survey. arXiv 2024, arXiv:2411.11072. [Google Scholar] [CrossRef]
Agerskov, J.; Nielsen, K.; Pedersen, C.F. Computationally Efficient Labeling of Cancer-Related Forum Posts by Non-clinical Text Information Retrieval. SN Comput. Sci. 2023, 4, 711. [Google Scholar] [CrossRef]
Rohanian, O.; Nouriborji, M.; Jauncey, H.; Kouchaki, S.; Nooralahzadeh, F.; ISARIC Clinical Characterisation Group ; Clifton, L.; Merson, L.; Clifton, D.A. Lightweight Transformers for Clinical Natural Language Processing. arXiv 2023, arXiv:2302.04725. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Zhang, Y.; Qi, P.; Manning, C.D.; Langlotz, C.P. Biomedical and clinical English model packages for the Stanza Python NLP library. J. Am. Med. Inform. Assoc. 2021, 28, 1892–1899. [Google Scholar] [CrossRef]
Qi, P.; Zhang, Y.; Zhang, Y.; Bolton, J.; Manning, C.D. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. arXiv 2020, arXiv:2003.07082. [Google Scholar]
Frei, J.; Kramer, F. GERNERMED—An Open German Medical NER Model. arXiv 2021, arXiv:2109.12104. [Google Scholar] [CrossRef]
Crema, C.; Buonocore, T.M.; Fostinelli, S.; Parimbelli, E.; Verde, F.; Fundarò, C.; Manera, M.; Ramusino, M.C.; Capelli, M.; Costa, A.; et al. Advancing Italian biomedical information extraction with transformers-based models: Methodological insights and multicenter practical application. J. Biomed. Inform. 2023, 148, 104557. [Google Scholar] [CrossRef]
Campillos-Llanos, L.; Valverde-Mateos, A.; Capllonch-Carrión, A.; Moreno-Sandoval, A. A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine. BMC Med. Inform. Decis. Mak. 2021, 21, 69. [Google Scholar] [CrossRef]
Schneider, E.T.R.; de Souza, J.V.A.; Knafou, J.; e Oliveira, L.E.S.; Copara, J.; Gumiel, Y.B.; de Oliveira, L.F.A.; Paraiso, E.C.; Teodoro, D.; Barra, C.M.C.M. BioBERTpt—A Portuguese Neural Language Model for Clinical Named Entity Recognition. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, Cambridge, MA, USA, 19 November 2020; Association for Computational Linguistics: Cambridge, MA, USA, 2020; pp. 65–72. [Google Scholar] [CrossRef]
Uzuner, Ö.; South, B.R.; Shen, S.; DuVall, S.L. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inform. Assoc. 2011, 18, 552–556. [Google Scholar] [CrossRef]
Tiedemann, J.; Thottingal, S. OPUS-MT—Building open translation services for the World. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, Lisboa, Portugal, 3–5 November 2020; European Association for Machine Translation: Lisboa, Portugal, 2020; pp. 479–480. Available online: https://aclanthology.org/2020.eamt-1.61 (accessed on 13 November 2024).
Tang, Y.; Tran, C.; Li, X.; Chen, P.-J.; Goyal, N.; Chaudhary, V.; Gu, J.; Fan, A. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. arXiv 2020, arXiv:2008.00401. [Google Scholar] [CrossRef]
Dou, Z.-Y.; Neubig, G. Word Alignment by Fine-tuning Embeddings on Parallel Corpora. arXiv 2021, arXiv:2101.08231. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
Ramshaw, L.A.; Marcus, M.P. Text Chunking using Transformation-Based Learning. arXiv 1995, arXiv:cmp-lg/9505040. [Google Scholar] [CrossRef]
Nayak, R. Focal Loss: A Better Alternative for Cross-Entropy. Medium. Available online: https://towardsdatascience.com/focal-loss-a-better-alternative-for-cross-entropy-1d073d92d075 (accessed on 15 November 2024).
Idrissi-Yaghir, A.; Dada, A.; Schäfer, H.; Arzideh, K.; Baldini, G.; Trienes, J.; Hasin, M.; Bewersdorff, J.; Schmidt, C.S.; Bauer, M.; et al. Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, 20–25 May 2024; ELRA and ICCL: Torino, Italia, 2024; pp. 3654–3665. Available online: https://aclanthology.org/2024.lrec-main.324 (accessed on 10 December 2024).

Figure 1. Overview of the pipeline flow.

Figure 2. Architecture of the BERT model for the NER task.

Table 1. Description of the datasets.

Dataset	Feature Name	Data Type	Split	Number of Examples	Size (Bytes)
autotrain-data-1w6s-u4vt-i7yo	autotrain_text	String	Train	23,437	19,109,937
medical_qa_meds	text	String	Train	5942	1,638,400

Table 2. Stanza model’s description [18].

Model	Dataset	Entity Types	Number of Tokens (Train/Dev/Test)	F1 Score
i2b2 clinical model	i2b2-2010 corpus	Problem, Test, Treatment	106,597/44,672/269,954	88.13

Table 3. BLEU scores for SPECIFIC languages of Facebook MBART and Helsinki NLP machine translation models.

	Facebook MBART (ML-FT N to 1)	Helsinki NLP
De	41.5	47.5
El	/	56.4
Es	28.6	54.9
It	43.9	53.9
Pl	32.9	47.5
Pt	49.3	50.4
Sl	33.9	/

Notes: multilingual finetuning (ML-FT); many to one (N to 1); German (De); Greek (El); Spanish (Es); Italian (It); Polish (Pl).

Table 4. Hyperparameters for model training.

	BERT-BASE-CASED
max sequence length	128
total parameters (million)	109
batch size	64
learning rate	3 × 10⁻⁵
warmup steps	66,540
checkpoint every	2000
weight decay	0.01
max number of train epochs	200
layers	12
hidden states	768
attention heads	12
vocab size	28,996
train time (hours)	12–23
loss	focal loss
number of GPUs	1
GPU type	NVIDIA A100-PCIE-40 GB

Table 5. Evaluation metrics.

Metric	Formula
Precision (P)	P = TP/(TP + FP)
Recall (R)	R = TP/(TP + FN)
F1 Score (F1)	F1 = 2 × (P × R)/(P + R)

Notes: TP—True Positives; FP—False Positives; FN—False Negatives.

Table 6. Overview of the models’ performances.

Language	Precision	Recall	F1 Score	Loss	BLEU Score	Runtime
En	80.85%	79.30%	80.07%	0.24	N/A	12 h 34 m
De	78.94%	78.46%	78.70%	0.30	47.5	16 h 7 m
El	70.69%	67.58%	69.10%	0.41	56.4	17 h 50 m
Es	77.14%	78.08%	77.61%	0.33	54.9	14 h 23 m
It	75.91%	75.28%	75.60%	0.34	53.9	14 h 57 m
Pl	75.52%	75.60%	75.56%	0.40	47.5	23 h 17 m
Pt	77.25%	77.16%	77.21%	0.34	50.4	16 h 35 m
Sl	75.78%	75.66%	75.72%	0.37	33.9	14 h 17 m

Notes: English (En); German (De); Greek (El); Spanish (Es); Italian (It); Polish (Pl); Portuguese (Pt); Slovenian (Sl); hours (h); minutes (m).

Table 7. Self-reported F1 score of the models used for comparison with our fine-tuned models.

Model	Language	F1 Score
i2b2 [18]	English	88.13%
MedPsyNIT [20]	Italian	89.53%
roberta-es-clinical-trials-ner [21]	Spanish	86.47%

Table 8. Comparison of our fine-tuned models with the existing models. The existing models are the state-of-the-art models found in the literature for specific models.

Language	Existing Model	Existing Models’ Performance	Fine-Tuned Models’ Performance
English	i2b2 [18]	67.69%	80.07%
Italian	MedPsyNIT [20]	57.06%	75.60%
Spanish	roberta-es-clinical-trials-ner [21]	62.60%	77.61%

Notes: The fine-tuned models are BERT-based models, that we retrained using our datasets. The percentages are F1 scores.

Table 9. Mapping of the labels of the Spanish model to labels we used in the fine-tuned model.

Label in the Spanish Model	Mapped Label
DISO	PROBLEM
PROC	TREATMENT
CHEM	TREATMENT

Table 10. Mapping of the labels of the Italian model to the labels we use in the fine-tuned model.

Label in Italian	Mapped Label
SINTOMI COGNITIVI	PROBLEM
TRATTAMENTO FARMACOLOGICO	TREATMENT
DIAGNOSI E COMORBIDITA	PROBLEM
SINTOMI NEUROPSICHIATRICI	PROBLEM
TEST	TEST

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sallauka, R.; Arioz, U.; Rojc, M.; Mlakar, I. Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages. Appl. Sci. 2025, 15, 5585. https://doi.org/10.3390/app15105585

AMA Style

Sallauka R, Arioz U, Rojc M, Mlakar I. Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages. Applied Sciences. 2025; 15(10):5585. https://doi.org/10.3390/app15105585

Chicago/Turabian Style

Sallauka, Rigon, Umut Arioz, Matej Rojc, and Izidor Mlakar. 2025. "Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages" Applied Sciences 15, no. 10: 5585. https://doi.org/10.3390/app15105585

APA Style

Sallauka, R., Arioz, U., Rojc, M., & Mlakar, I. (2025). Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages. Applied Sciences, 15(10), 5585. https://doi.org/10.3390/app15105585

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Methodology

2.2. Data Sources and Corpus Creation

2.3. Preprocessing

2.4. Annotation Process

2.5. Translation and Multilingual Adaptation

2.6. Word Alignment

2.7. Data Augmentation

2.8. Data Splitting

2.9. Model Configuration

2.10. Evaluation

3. Results

3.1. Models’ Comparison with Existing Models

3.2. Case Studies

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI