1. Introduction
With the growing emphasis on patient-centered care, patient-reported outcomes (PROMs) have emerged as a critical tool for understanding and addressing patients’ health needs, especially in the management of chronic and complex conditions [
1,
2]. PROMs allow the systematic collection of patients’ health experiences, symptoms, and quality of life, often using structured instruments such as questionnaires [
3,
4]. While effective in many cases, traditional PROMs face notable limitations. These include their reliance on structured, predefined questions that may not capture patients’ experiences fully, a lack of adaptability for underrepresented and resource-poor languages, and the significant burden they can place on patients to complete standardized forms [
5,
6]. Moreover, current PROMs often fail to accommodate natural language expression, which could offer richer and more nuanced insights into patients’ symptoms and health concerns [
7]. This issue is particularly pronounced in diverse linguistic contexts, where a lack of tools for resource-poor languages limits inclusivity and equity in data collection and analysis [
8,
9].
In light of the limitations of traditional PROMs, there is a growing need for approaches that can capture patient experiences in a more natural and inclusive manner, accommodating diverse linguistic contexts and minimizing patient burden [
10]. To address these challenges, advancements in Natural Language Processing (NLP), such as Named Entity Recognition (NER), offer promising tools for extracting patient-reported information directly from unstructured text, enabling a richer and more equitable understanding of health concerns across languages. NER is a fundamental NLP task that involves extracting entities from text and classifying them into predefined categories, such as locations, people, or organizations [
11]. In the medical domain, clinical NER models have been developed to identify specific information like diseases, drugs, dosages, and frequencies, with recent advancements leveraging deep learning algorithms to enhance their performance [
11]. While progress has been made in medical NER for resource-abundant languages, there remains a notable research gap concerning resource-poor languages [
12]. Addressing this gap is important to ensure equitable healthcare advancements across various language groups. Moreover, beyond structuring medical notes, there is a need to extract symptoms or problems as expressed by patients in their own words, facilitating a more patient-centered approach to healthcare.
This paper describes a weakly-supervised pipeline for training and evaluating medical NER models across eight languages: English, German, Greek, Spanish, Italian, Polish, Portuguese, and Slovenian. We refer to the approach as “weakly-supervised” because it relies on automatically annotated data rather than manually labeled gold-standard datasets, using state-of-the-art models to generate training labels. By fine-tuning BERT-based models, we enabled the extraction of medical entities like symptoms, treatments, and tests from medical and everyday texts. The models will be used in a multicentric feasibility study of the SMILE project, as part of digital health interventions to support the (mental) well-being of children across multiple sites in Germany, Italy, the UK, Spain, Cyprus, Poland, and Slovenia. This research focuses on underrepresented languages and patient-expressed symptoms, as it seeks to enhance the accessibility and quality of healthcare information extraction.
Clinical Named Entity Recognition (NER) models have made significant progress in structuring and summarizing clinical notes, predominantly in English. However, the existing models face several challenges, including monolingual focus, inconsistent performance reporting, issues with non-clinical semantics in reporting, and limited adaptability to low-resource languages [
13,
14,
15]. The model in bert-medical-ner (
https://huggingface.co/silpakanneganti/bert-medical-ner, accessed on 14 April 2025) on the HuggingFace (
https://huggingface.co/, accessed on 14 April 2025) platform extracts the entities, such as age, sex, clinical event, non-biological location, duration, severity, biological structure, sign symptom, and more, achieving a self-reported F1 score of 67.5%. Similarly, bert-base-uncased_clinical-ner (
https://huggingface.co/samrawal/bert-base-uncased_clinical-ner, accessed on 14 April 2025) is another NER model, which extracts problem, treatment, and test entities from medical texts but it does not report any performance evaluation, leaving its effectiveness unclear. Other models, such as the one proposed in [
16] and Stanza’s clinical i2b2 model [
17,
18], report F1 scores of 81.2% and 88.1%, respectively, yet their focus remains restricted to English medical texts. Domain-specific models like biobert_ncbi_disease_ner (
https://huggingface.co/ugaray96/biobert_ncbi_disease_ner, accessed on 14 April 2025) and en_ner_bc5cdr_md (
https://huggingface.co/Kaelan/en_ner_bc5cdr_md, accessed on 14 April 2025) are tailored for tasks like disease or chemical recognition, reporting F1 scores of 85.7% for chemicals and diseases. Moreover, they are well-tuned to medical semantics but not to the way individuals express or describe their symptoms. Efforts in non-English languages include models like GERNERMED [
19] for German, which extracts entities such as drug dosage and frequency (F1: 81.5%), MedPsyNIT [
20] for Italian, which extracts symptoms and comorbidities (F1: 89.5%), and the model in [
21] for Spanish, which recognizes anatomical, chemical, and pathological entities (F1: 86.4%). The model for Portuguese achieves a high F1 score of 92.6% [
22] for extracting entities like disorders, medical procedures, and pharmacologic substances. Despite these successes, multilingual models for medical NER remain scarce, and no models currently exist for Greek, Polish, or Slovenian.
In summary, while existing clinical NER models demonstrate strong performance in specific settings, they are constrained by their focus on a single language or a narrow range of languages, their reliance on extensive annotated datasets that are expensive and time-intensive to create, and their limited ability to adapt across different types of texts, such as formal medical documents and informal everyday language.
Our aim is to make the following contributions:
- -
we introduce a flexible, multilingual, and adaptable pipeline for medical NER;
- -
our pipeline enables effective extraction of medical entities (problem, test, and treatment) from a given body of text;
- -
our pipeline is effective across diverse languages, with a focus on low-resource languages.
Our pipeline is designed to handle both formal and informal textual contexts.
2. Materials and Methods
This section describes the data sources, preprocessing pipeline, and experimental setup used in this study.
Section 2.1 outlines the overall methodology of the whole pipeline.
Section 2.2 describes the data sources and corpus creation, while
Section 2.3 presents the data preprocessing steps.
Section 2.4 presents the annotation process, followed by the translation and multilingual adaptation process in
Section 2.5. Word alignment is described in
Section 2.6 and after that in
Section 2.7 we present the data augmentation techniques that were used. Preparation for training including data splitting, model configuration, and evaluation metrics were described in
Section 2.8,
Section 2.9 and
Section 2.10, respectively.
2.1. Overall Methodology
In this paper, we propose a weakly-supervised multilingual Named Entity Recognition (NER) pipeline for symptom extraction from medical texts and patient-reported data. The pipeline consists of several key stages: data preparation, annotation, translation for multilingual adaptation, data augmentation, and model fine-tuning. Each step contributes to building a robust NER model capable of identifying symptom-related entities across eight languages (English, German, Greek, Italian, Spanish, Polish, Portuguese, and Slovenian) in unstructured medical texts. The flow of the pipeline is displayed in the image below, providing an overview of the processes involved in creating the final model.
Figure 1 shows an overview of the pipeline flow, where we begin with the creation of a corpus of medical texts in the English language, which is then preprocessed and annotated by a state-of-the-art NER model, thus creating an annotated corpus. This annotated corpus is then translated into a targed language using machine translation models, and this translated corpus is then used to fine-tune the BERT-based NER model, adjusting it so it learns to identify predefined categories in different languages.
2.2. Data Sources and Corpus Creation
The initial stage involved combining multiple datasets of medical texts to form a comprehensive corpus of medical texts. This corpus was created by merging two key datasets: autotrain-data-1w6s-u4vt-i7yo (
https://huggingface.co/datasets/Kabatubare/autotrain-data-1w6s-u4vt-i7yo, accessed on 14 April 2025) and medical_qa_meds (
https://huggingface.co/datasets/s200862/medical_qa_meds/discussions, accessed on 14 April 2025). The dataset autotrain-data-1w6s-u4vt-i7yo (
https://huggingface.co/datasets/Kabatubare/autotrain-data-1w6s-u4vt-i7yo, accessed on 14 April 2025) consists of clinical notes, discharge summaries, and other unstructured medical records from various healthcare providers, providing a rich source of terminology related to patient symptoms, diagnoses, and treatments. The dataset in medical_qa_meds (
https://huggingface.co/datasets/s200862/medical_qa_meds/discussions, accessed on 14 April 2025), on the other hand, contains a diverse collection of question-and-answer pairs focused on medical topics, which includes descriptions of symptoms, diagnoses, and treatment options. By merging these datasets, we ensured a diverse and comprehensive representation of medical terminology and symptom descriptions in English, capturing different styles of clinical language, patient narratives, and medical abbreviations. The total number of examples of both datasets is 29,379 entries of text data.
Table 1 describes the details of the two datasets.
2.3. Preprocessing
The data underwent a thorough preprocessing phase. Initially, they were cleaned to remove unnecessary text while retaining relevant information. Since our dataset consisted of question–answer pairs between a user and an assistant, we identified certain textual elements that could be removed. In the dataset in autotrain-data-1w6s-u4vt-i7yo, we removed strings such as “Human:”, “Assistant:”, “\n”, “\t”, and replaced hyphens (“-”) between words with a single space. In the dataset in s200862/medical_qa_meds, we eliminated tags like “[INST]”, “[/INST]”, “<s>”, “</s>”, along with newline (“\n”) and tab (“\t”) characters. Next, all the punctuation was removed from the text. Finally, the data were converted to lowercase for consistency. As stated above, the rows of the datasets are medical texts, made of multiple sentences. We split all the texts so as to have one sentence in each row; thus, the number of examples rose from around 29,000 to 261,936 sentences.
2.4. Annotation Process
To annotate the corpus of medical texts, we utilized Stanza [
18], a Python NLP package designed for linguistic analysis across multiple human languages. Stanza offers a suite of efficient tools that perform a variety of linguistic processing tasks, including sentence segmentation, word tokenization, part-of-speech tagging, and entity recognition. Starting from raw text, Stanza structures the data, identifying syntactic and semantic elements to support further analysis.
In particular, Stanza provides an i2b2 model specialized for NER tasks in the Medical domain. This model is designed to extract key medical entities: PROBLEM (symptom), TEST, and TREATMENT. Trained on the i2b2-2010 corpus [
23], which includes manually annotated clinical notes from institutions such as Partners Healthcare, Beth Israel Deaconess Medical Center, and the University of Pittsburgh Medical Center, the i2b2 model is highly accurate and well-suited for medical text analysis. With an impressive F1 score of 88.13, the i2b2 model identifies and classifies medical entities effectively, making it a robust tool for NER tasks in clinical settings.
Table 2 presents the i2b2 model from Stanza.
2.5. Translation and Multilingual Adaptation
To enable multilingual capabilities in our NER pipeline, we employed machine translation models to convert the annotated English medical texts into several target languages, including German, Greek, Spanish, Italian, Polish, Portuguese, and Slovenian. These translations allowed us to develop a dataset that supports multilingual symptom extraction, enhancing the model’s utility across diverse linguistic contexts.
For the translations from English to German, Greek, Spanish, Italian, Polish, and Portuguese, we utilized models developed by the Language Technology Research Group at the University of Helsinki [
24]. These models, licensed under CC-BY-4.0, are optimized for direct translations from English to each target language and are available on the Hugging Face platform [
17]. Each model was selected specifically for its accuracy and reliability in medical contexts, given the specific terminology present in clinical texts:
For English to Slovenian translations, we used the mBART model [
25], a fine-tuned version of the mBART-large-50 model capable of translating between any pair of 50 languages. We chose this model due to its superior performance in handling low-resource languages like Slovenian. This model enables direct translation by setting the target language ID as the first generated token in the output sequence. This flexibility and multilingual adaptability were crucial in creating a reliable Slovenian translation.
Helsinki-NLP does not support the Slovenian language, which is why the mBART model was chosen for the translations in this language, while Greek is not supported in the mBART model and that is why the Helsinki-NLP was chosen for this language. The other languages are supported by both models. We chose the Helsinki-NLP model for the translation in these languages because of the higher BLEU score that it has compared to the mBART model.
Table 3 shows the BLEU scores for the German, Spanish, Italian, Polish, and Portuguese of both models.
2.6. Word Alignment
To ensure accurate word alignment in the translated texts, we used the awesome-align tool (
https://huggingface.co/aneuraz/awesome-align-with-co, accessed on 14 April 2025), which leverages contextual word embeddings from models like BERT to map words between languages effectively [
26]. The alignment process involved several key steps. Contextual word embeddings were generated for each word in a sentence to capture the word’s meaning based on its context within the sentence. Two techniques were employed to calculate alignment scores. First, probability thresholding involved generating a similarity matrix using the dot product of word embeddings, which was then converted into probabilities through a function like softmax. This approach identified high-probability word pairs as aligned. Second, optimal transport treated alignment as a transportation problem, minimizing the “cost” of transferring probability mass between words in different languages and producing a matrix of probable alignments. Then, the alignments were calculated in both directions—source-to-target and target-to-source—with the final alignment derived from the intersection of these bidirectional results. For words split into subwords by the model, alignment was ensured through any matching subwords, guaranteeing complete word coverage in the alignment process.
2.7. Data Augmentation
Before splitting the dataset into train/validate/test sets for the training process, we performed data augmentation to enhance the diversity and robustness of the training data. Data augmentation was used because the dataset contained a large number of zero tags (O), namely, the entities that do not belong to any category (PROBLEM, TEST, and TREATMENT), and this leads to class imbalance, which impacts the model’s performance. Furthermore, during model development, we also observed signs of overfitting when training without data augmentation, as the evaluation loss began to rise after several epochs. This motivated the introduction of augmentation further, to improve generalization. The English dataset served as the basis for tuning these strategies, which were then applied consistently across all the languages.
The augmentation process involved two main strategies. Sentence reordering was applied, where the words within each sentence were rearranged to create new variations of the same sentence structure. This method increased the variability of the dataset, enabling the model to generalize to different sentence formations better. Additionally, entity extraction was performed by identifying all the words within each sentence that were annotated with non-“O” labels (i.e., labeled as PROBLEM, TEST, or TREATMENT). These extracted words were used to generate new sentences, which were then added back into the dataset. This ensured that the model would encounter more examples of key medical entities during training.
After the data augmentation, each dataset consisted of approximately 440,000 rows.
2.8. Data Splitting
Following the augmentation, the dataset was split into three distinct sets. The training set comprised 80% of the dataset, the validation set accounted for 10% of the dataset, and the remaining 10% was designated as the test set.
2.9. Model Configuration
For our symptom extraction pipeline, we fine-tuned the BERT-based model (cased), a widely-used transformer model pretrained on English text using a masked language modeling (MLM) objective [
27]. This model, developed by Google, is case sensitive, differentiating between lowercase and uppercase words (e.g., “english” vs. “English”). BERT was pretrained on large English corpora, including BookCorpus and English Wikipedia, using two main objectives: MLM, where 15% of the words in each sentence are masked and predicted, and Next Sentence Prediction (NSP), which trains the model to predict sentence order [
27]. The model configuration includes a vocabulary size of 30,000, a maximum token length of 512, and an Adam optimizer with learning rate warmup and linear decay [
27]. BERT achieved high performance on a range of NLP tasks, including an average score of 79.6 on the GLUE benchmark, and is well-suited for token classification tasks, such as named entity recognition, when fine-tuned on specific labeled datasets [
27]. This makes it an ideal base for developing a specialized model for medical-named entity recognition in our pipeline. The architecture of the BERT model specialized for NER tasks is displayed in
Figure 2.
The model was fine-tuned to classify entities into specific categories: PROBLEM, TEST, and TREATMENT, using the IOB tagging scheme (Inside-Outside-Beginning) [
28], a widely-used Standard in NER, where additional tags (e.g., B-, I-, E-, S-) are used to capture entity boundaries. The key training parameters were set as follows: the architecture utilized was BERT-base-cased, a pretrained transformer model with 12 layers, 768 hidden units, and a vocabulary of 30,000 tokens.
For the training parameters, the model was trained for 200 epochs with a batch size of 64; however, early stopping based on validation loss was applied, and in practice, training typically converged within 20–30 epochs. An AdamW optimizer was used, featuring a learning rate of 3 × 10
−5 and a weight decay of 0.01 to prevent overfitting. The sequence length was capped at 128 tokens to ensure efficient processing. To address the class imbalance, a focal loss function [
29] was applied, emphasizing harder-to-classify samples.
Table 4 displays more details on the parameters used for model training.
These configurations were selected to maximize the model’s ability to extract medical entities accurately from medical texts while maintaining efficiency and generalizability.
2.10. Evaluation
The fine-tuned BERT-based NER model was evaluated using the standard metrics of precision, recall, and F1 score (
Table 5), which provide insights into the model’s ability to identify and classify medical entities accurately. Early stopping was employed to prevent overfitting, halting training if the validation loss did not improve after 30 evaluation steps.
The performance was monitored on a validation set every 2000 steps, with the best-performing model checkpoint saved based on the lowest validation loss. These metrics, along with confusion matrices, offered a detailed view of the model’s performance across the entity classes. The model configuration, training hyperparameters, and evaluation metrics were consistent across all eight training processes for each language. The best models of each language and the respective datasets are uploaded to the Hugging Face platform in the HUMADEX page.
3. Results
This section presents the evaluation results of the eight fine-tuned multilingual NER models, each trained to extract medical entities (PROBLEM, TEST, TREATMENT) in different languages. The performance of each model was analyzed using the standard metrics, including precision, recall, and F1 score.
Table 6 provides a comparative overview of the performance metrics for the eight language-specific NER models. The English model demonstrated the highest overall performance, with a precision of 80.85%, recall of 79.30%, and F1 score of 80.07%, alongside the lowest evaluation loss at 0.24. This suggests that the English model is both accurate and consistent in identifying entities, likely due to the abundance of high-quality English training data. The German and Spanish models also showed strong results, with F1 scores of 78.70% and 77.61%, respectively, and evaluation losses below 0.35, indicating reliable performance in these languages. Portuguese followed closely with an F1 score of 77.21%, showing its capability in medical NER despite slight variations in recall and precision.
In contrast, the Greek model exhibited the lowest F1 score at 69.10% and a higher evaluation loss of 0.41, suggesting greater challenges in recognizing entities in Greek accurately. The models for Italian, Polish, and Slovenian fell within a mid-range performance level, with F1 scores between 75.56% and 75.72% and evaluation losses ranging from 0.34 to 0.40. These results indicate that while the models performed reasonably well, there may be language-specific nuances or variations in translation quality that affect performance.
The Translation BLEU score column represents the BLEU score for translations from English to each respective language, measuring the quality of the machine translation used to generate training data in different languages.
3.1. Models’ Comparison with Existing Models
To evaluate the effectiveness of our multilingual NER models, we compared their performance with three existing baseline models trained for English, Italian, and Spanish NER tasks. These existing models were tested on the same dataset used for our models to ensure a fair and consistent evaluation. The baseline models used for comparison were domain-specific NER systems tailored for each language. The English baseline was the Stanza i2b2 model [
18], which combines word embeddings with forward and backward LSTM-based character-level language models and a CRF decoder, trained on clinical notes to extract entities such as problems, tests, and treatments. The Italian baseline, MedPsyNIT [
20], was built on BioBIT and fine-tuned on clinical data from four Italian hospitals, targeting psychiatric and medical entities through low-resource fine-tuning. The Spanish baseline, lcampillos/roberta-es-clinical-trials-ner [
21], adapted a biomedical RoBERTa model to identify anatomical, chemical, pathological, and procedural entities from clinical trial texts using the CT-EBM-SP corpus.
Table 7 displays the self-reported F1 score of the models.
In our test set, our models outperformed the existing ones across all three languages, demonstrating superior precision, recall, and F1 scores, as shown in
Table 8. For English, our model achieved an F1 score of 80.07%, surpassing the baseline model’s performance of 67.69% by a significant margin. Similarly, in Italian, our model attained an F1 score of 75.60%, compared to the baseline model, which achieved an F1 score of 57.06%. In Spanish, our model’s F1 score of 77.61% showed a clear improvement in handling complex linguistic nuances over the existing model, which achieved an F1 score of 62.60%.
The Italian and Spanish models use different labels from our labels. To perform the comparison we carried outlabel mapping to match the labels in the Spanish model, as displayed in
Table 9.
Whereas
Table 10 shows the mapping of the labels of the Italian model.
3.2. Case Studies
To illustrate the performance of our models, we present examples of sentences with the extracted entities categorized into PROBLEM, TEST, and TREATMENT. These examples showcase the practical application of our models in identifying relevant medical entities within unstructured text, emphasizing their ability to handle diverse linguistic constructs across multiple languages. The fine-tuned models were BERT-based models, that we retrained using our datasets.
English sentences
Fine-tuned model predictions:
The patient complained of severe (B-PROBLEM) headaches (E-PROBLEM) and nausea (S-PROBEM) that had persisted for two days. To alleviate the (B-PROBLEM) symptoms (E-PROBLEM), he was prescribed paracetamol (S-TREATMENT) and advised to rest and drink plenty of fluids.
The patient exhibited symptoms (S-PROBLEM) of fever (S-PROBLEM), cough (S-PROBLEM), and body (B-PROBLEM) aches (E-PROBLEM). A (B-TEST) chest (I-TEST) X-ray (I-TEST) was taken to rule out pneumonia (S-PROBLEM). He was prescribed an (B-TREATMENT) antibiotic (E-TREATMENT) and advised to rest.
The patient complained of dizziness (S-PROBLEM), vision (B-PROBLEM) disturbances (E-PROBLEM), and numbness (B-PROBLEM) in (I-PROBLEM) her (I-PROBLEM) hands (E-PROBLEM). An (B-TEST) MRI (I-TEST) of (I-TEST) the (I-TEST) brain (E-TEST) was ordered to rule out a (B-PROBLEM) neurological (I-PROBLEM) cause (E-PROBLEM). A (B-TREATMENT) beta-blocker (I-TREATMENT) was prescribed to stabilize her (B-TEST) blood (I-TEST) pressure (E-TEST).
Existing model predictions:
The patient complained of severe (B-PROBLEM) headaches (E-PROBLEM) and nausea (S-PROBEM) that had persisted for two days. To alleviate the (B-PROBLEM) symptoms (E-PROBLEM), he was prescribed paracetamol (S-TREATMENT) and advised to rest and drink plenty of fluids.
The patient exhibited symptoms (S-PROBLEM) of fever (S-PROBLEM), cough (S-PROBLEM), and body (B-PROBLEM) aches (E-PROBLEM). A (B-TEST) chest (I-TEST) X-ray (I-TEST) was taken to rule out pneumonia (S-PROBLEM). He was prescribed an (B-TREATMENT) antibiotic (E-TREATMENT) and advised to rest.
The patient complained of dizziness (S-PROBLEM), vision (B-PROBLEM) disturbances (E-PROBLEM), and numbness (B-PROBLEM) in (I-PROBLEM) her (I-PROBLEM) hands (E-PROBLEM). An (B-TEST) MRI (I-TEST) of (I-TEST) the (I-TEST) brain (E-TEST) was ordered to rule out a (B-PROBLEM) neurological (I-PROBLEM) cause (E-PROBLEM). A (B-TREATMENT) beta (I-TREATMENT)-(I-TREATMENT) blocker (E-TREATMENT) was prescribed to stabilize her (B-TEST) blood (I-TEST) pressure (E-TEST).
The fine-tuned model demonstrates superior handling of “colloquial” symptom descriptions. When processing clinical terminology like “neurological cause” and “beta-blocker”, both models maintained precise entity boundaries and correct classification. For non-clinical expressions, such as, “body aches” instead of “myalgia”, the model identifies these as PROBLEM entities successfully, showing adaptability to patient-level language. In contrast, the existing model has issues with compound terms and informal expressions, particularly in maintaining consistent entity boundaries.
Spanish sentences
Fine-tuned model predictions:
El paciente se quejó de fuertes (B-PROBLEM) dolores (E-PROBLEM) de cabeza y náuseas (S-PROBLEM) que habían persistido durante dos días. Para aliviar los síntomas, se le recetó paracetamol (S-TREATMENT) y se le aconsejó descansar y beber muchos líquidos.
El paciente presentó síntomas (S-PROBLEM) de fiebre (S-PROBLEM), tos y dolores (E-PROBLEM) corporals (E-PROBLEM). Se le realizó una (B-TEST) radiografía (E-TEST) de tórax para descartar una (B-PROBLEM) neumonía (E-PROBLEM). Se le recetó un (B-TREATMENT) antibiótico (E-TREATMENT) y se le aconsejó descansar.
La paciente se quejó de mareos (S-PROBLEM), alteraciones (E-PROBLEM) de la vision (B-PROBLEM) y entumecimiento (B-PROBLEM) en (I-PROBLEM) las manos (E-PROBLEM). Se ordenó una (B-TEST) resonancia (I-TEST) magnética (E-TEST) del (I-TEST) cerebro (E-TEST) para descartar una causa (E-PROBLEM) neurológica (I-PROBLEM). Se le recetó un (B-TREATMENT) betabloqueante (E-TREATMENT) para estabilizar su (B-TEST) presión (E-TEST) arterial (I-TEST).
Existing model predictions:
El paciente se quejó de fuertes dolores (B-DISO) de (I-DISO) cabeza (I-PROBLEM) y náuseas (B-DISO) que habían persistido durante dos días. Para aliviar los síntomas (B-PROBLEM), se le recetó paracetamol (B-CHEM) y se le aconsejó descansar (B-PROC) y beber muchos líquidos.
El paciente presentó síntomas (B-DISO) de (I-DISO) fiebre (I-DISO), tos (I-DISO) y dolores (B-DISO) corporals (I-DISO). Se le realizó una radiografía (B-PROC) de (I-PROC) tórax (I-PROC) para descartar una neumonía (B-DISO). Se le recetó (B-PROC) un antibiótico (B-CHEM) y se le aconsejó descansar (B-PROC).
La paciente se quejó de mareos (B-DISO), alteraciones (B-DISO) de (I-DISO) la (I-DISO) vision (I-DISO) y entumecimiento (B-DISO) en (I-DISO) las (I-DISO) manos (I-DISO). Se ordenó una resonancia (B-PROC) magnética (I-PROC) del (I-PROC) cerebro (I-PROC) para descartar una causa neurológica. Se le recetó un betabloqueante (B-CHEM) para estabilizar (B-PROC) su presión (B-PROC) arterial (I-PROC).
English translation of the Spanish sentences:
The patient complained of severe headaches and nausea that had persisted for two days. To relieve the symptoms, paracetamol was prescribed, and the patient was advised to rest and drink plenty of fluids.
The patient presented symptoms of fever, cough, and body aches. A chest X-ray was performed to rule out pneumonia. An antibiotic was prescribed, and the patient was advised to rest.
The patient complained of dizziness, vision disturbances, and numbness in the hands. An MRI of the brain was ordered to rule out a neurological cause. A beta-blocker was prescribed to stabilize her blood pressure.
The distinction between clinical and colloquial language is especially evident in the Spanish examples. The model proposed in this paper processed both formal medical terms (“resonancia magnética”—MRI) and everyday expressions (“dolores de cabeza”—headaches) effectively as coherent entities. The existing model showed a bias toward clinical terminology, using a rigid DISO/PROC/CHEM classification that accommodates natural patient expressions poorly. For instance, “mareos” (dizziness) was identified correctly as a PROBLEM by our model but received an overly clinical DISO tag in the existing model.
Italian sentences
Fine-tuned model predictions:
Il paziente ha lamentato forti (B-PROBLEM) mal (E-PROBLEM) di testa (E-PROBLEM) e nausea (S-PROBLEM) che persistevano da due giorni. Per alleviare i sintomi (E-PROBLEM), gli è stato prescritto il paracetamolo (S-TREATMENT) e gli è stato consigliato di riposare e bere molti liquidi.
Il paziente ha manifestato sintomi (S-PROBLEM) di febbre (S-PROBLEM), tosse (S-PROBLEM) e dolori (E-PROBLEM) muscolari. È stata eseguita una (B-TEST) radiografia (E-TEST) del torace (E-TEST) per escludere una (B-PROBLEM) polmonite (E-PROBLEM). Gli è stato prescritto un (B-TREATMENT) antibiotico (E-TREATMENT) e gli è stato consigliato di riposare.
La paziente ha lamentato vertiginid (S-PROBLEM), disturbi (E-PROBLEM) visivi (B-PROBLEM) e intorpidimento (B-PROBLEM) delle (I-PROBLEM) mani (E-PROBLEM). È stata ordinata una (B-TREATMENT) risonanza (I-TEST) magnetica (I-TEST) del (I-TEST) cervello per escludere una (B-PROBLEM) causa (E-PROBLEM) neurologica (I-PROBLEM). È stato prescritto un (B-TREATMENT) betabloccante (E-TREATMENT) per stabilizzare la pressione (E-TEST) sanguigna.
Existing model predictions:
Il paziente ha lamentato forti mal di testa e nausea che persistevano da due giorni. Per alleviare i sintomi, gli è stato prescritto il paracetamolo (TRATTAMENTO FARMACOLOGICO (B)) e gli è stato consigliato di riposare e bere molti liquidi.
Il paziente ha manifestato sintomi di febbre, tosse e dolori muscolari. È stata eseguita una radiografia (TEST (B)) del torace per escludere una polmonite. Gli è stato prescritto un antibiotico e gli è stato consigliato di riposare.
La paziente ha lamentato vertigini, disturbi visivi e intorpidimento delle mani. È stata ordinata una risonanza (TEST (B)) magnetica (TEST (B)) del (TEST (I)) cervello (TEST (I)) per escludere una causa neurologica. È stato prescritto un betabloccante (TRATTAMENTO FARMACOLOGICO (B)) per stabilizzare la pressione sanguigna.
English translation of the Italian sentences:
The patient complained of severe headaches and nausea that had persisted for two days. To relieve the symptoms, paracetamol was prescribed, and he was advised to rest and drink plenty of fluids.
The patient presented symptoms of fever, cough, and muscle aches. A chest X-ray was performed to rule out pneumonia. An antibiotic was prescribed, and he was advised to rest.
The patient complained of dizziness, visual disturbances, and numbness in the hands. An MRI of the brain was ordered to rule out a neurological cause. A beta-blocker was prescribed to stabilize blood pressure.
Both the formal and informal medical expressions in Italian revealed key differences between the models. The model proposed in this paper identified colloquial symptom descriptions like “mal di testa” (headache) successfully as PROBLEM entities while maintaining accuracy with clinical terms like “betabloccante” (beta-blocker). The existing model, however, showed a clear preference for formal medical terminology, often missing informal symptom descriptions entirely.
4. Discussion
The presented weakly-supervised multilingual NER pipeline addresses the key limitations identified in the existing approaches, including single-language focus [
20,
25], reliance on large, annotated datasets, the inability to handle informal patient language [
5,
6,
7], and limited adaptation to resource-poor languages [
12,
13,
14]. Namely, traditional medical NER models like Stanza [
17] and GERNERMED [
19] demonstrate effectiveness in single languages but struggle with multilingual adaptability [
30]. Moreover, the existing multilingual models often depend on extensive annotated datasets, making them impractical for resource-poor languages [
12,
13]. The proposed approach overcomes these constraints through weak supervision and efficient language–language translation pipelines, enabling robust performance across multiple languages, including underrepresented ones such as Slovenian and Polish.
A critical challenge in current medical NER models is their limited adaptability across different types of language [
10,
15]. Traditional medical NER systems, trained primarily on clinical documentation, often struggle to recognize symptoms when patients use colloquial expressions or metaphorical language to describe their conditions [
5,
8]. Namely, while models like Stanza [
17] and MedPsyNIT [
20] achieve high performance on formal clinical notes (F1 scores of 88.1% and 89.5% respectively), their effectiveness diminishes significantly when processing informal patient descriptions. This limitation is particularly evident when trying to exploit the medical NER concept in the context of patient-reported outcomes, where individuals describe their symptoms in natural, conversational language rather than clinical terminology [
6,
7]. For instance, while a clinician might document “pyrexia”, patients typically describe “feeling hot” or “burning up”. We demonstrate this with a series of case studies, comparing existing medical NERs in English, Italian, and Spanish, where our model overcame the gap between clinical precision and patient expression significantly, making it more suitable for processing real-world patient-reported health data across languages. Moreover, the increased performance over the existing models, especially on the non-medical terminology (English: 67.69% to 80.07%, Italian: 57.06% to 75.60%, Spanish: 62.60% to 77.61%), demonstrates the effectiveness of our approach further. The results show clearly that our models (and the pipeline) can complement the traditional PROMs efficiently by extracting symptoms from natural language descriptions, supporting more patient-centered data collection. This aligns well with recent research highlighting the importance of natural language processing in healthcare [
10], while extending its applicability to multiple languages.
The current hyperparameter configuration, while effective, still has room for optimization through more extensive tuning. Secondly, the quality of the machine translation affects model performance, particularly for low-resource languages with less reliable translation tools. This limitation connects to broader challenges in multilingual NLP noted by Zhu et al. [
14]. Translation errors, particularly in the case of ambiguous or domain-specific terms, can lead to misinterpretations that may degrade model performance. These challenges are especially evident when handling medical terminology, where incorrect translation may result in inaccurate annotations and entity recognition. Addressing these errors through improved translation models or post-translation correction could mitigate their impact. Thirdly, while our automated annotation process using Stanza [
17] enabled efficient dataset creation, it may propagate biases from the base model. A fully manual annotation process would provide more reliable training data but require prohibitive time and cost investments. Moreover, the reliance on the pretrained BERT architecture means that inherent biases in the original training data may impact performance, especially for rare medical terms or linguistic nuances in low-resource languages. The choice to use a basic BERT model rather than multilingual models such as mBERT or XLM-RoBERTa may limit the model’s transferability across languages and reduces its ability to leverage cross-lingual knowledge effectively. Future work includes using more specialized multilingual models, which could improve model performance in diverse linguistic settings. Finally, evaluating the model on real patient data, rather than just synthetic examples, would provide a more comprehensive understanding of its effectiveness in real-world applications. The use of synthetic data limits the model’s generalizability and may not capture the full complexity of clinical language used by actual patients. A follow-up clinical study in collaboration with healthcare institutions to evaluate the model on clinical notes is underway. Ethical approval has already been issued and the process is in progress.
Class imbalance remains one of the main challenges and the less-frequent entity types like TEST and TREATMENT may be underrepresented, particularly in languages with fewer examples. This echoes similar challenges noted in recent clinical NLP research [
15]. While current augmentation helps balance the O tags versus medical entity tags, more advanced techniques could balance between PROBLEM, TEST, and TREATMENT categories better. Moreover, sophisticated augmentation could introduce controlled semantic variations in how symptoms are described, helping models handle the gap between clinical and patient language better [
5,
7]. Furthermore, advanced augmentation techniques could generate language-specific variations that account for cultural and linguistic differences in symptom description [
13,
14], particularly important for the SMILE project’s multicenter implementation.
5. Conclusions
This research demonstrates the potential of fine-tuned BERT-based NER models for multilingual medical text analysis, achieving strong and consistent performance across eight languages, including underrepresented ones such as Slovenian, Polish, and Greek. By outperforming the existing models in English, Italian, and Spanish, our approach highlights the importance of fine-tuning, custom annotations, and multilingual training processes in addressing the challenges of Named Entity Recognition across diverse linguistic contexts. By facilitating the extraction of detailed patient-reported symptoms and medical entities from patient-reported outcomes (PROs), this research addresses the limitations of traditional structured approaches, offering a more inclusive and natural language-driven method for healthcare data collection. Despite challenges such as reliance on pretrained models, potential biases in annotation processes, translation inaccuracies, and class imbalances, the models developed in this research demonstrated strong adaptability and practicality for real-world healthcare applications. Future research should focus on refining the translation pipelines, enhancing the quality of the training data, and conducting extensive hyperparameter tuning to improve performance further. These efforts could expand the applicability of the pipeline to additional languages and tasks, contributing to more equitable and efficient multilingual healthcare information extraction worldwide. Finaly future work should consider developing more sophisticated data augmentation techniques. While the current augmentation focuses on sentence reordering and entity extraction, more advanced techniques could synthesize natural patient language patterns, improving the model’s ability to handle informal symptom descriptions further. Moreover, this could reduce reliance on translation significantly and improve cross-lingual robustness.