Extracting Information from Unstructured Medical Reports Written in Minority Languages: A Case Study of Finnish

Myllylä, Elisa; Siirtola, Pekka; Isosalo, Antti; Reponen, Jarmo; Tamminen, Satu; Laatikainen, Outi

doi:10.3390/data10070104

Open AccessArticle

Extracting Information from Unstructured Medical Reports Written in Minority Languages: A Case Study of Finnish

by

Elisa Myllylä

¹,

Pekka Siirtola

^1,*

,

Antti Isosalo

²

,

Jarmo Reponen

²

,

Satu Tamminen

¹

and

Outi Laatikainen

³

¹

Biomimetics and Intelligent Systems Group, University of Oulu, FI-90014 Oulu, Finland

²

Research Unit of Health Sciences and Technology, University of Oulu, FI-90014 Oulu, Finland

³

Medical Research Center Oulu and Oulu University Hospital, FI-90014 Oulu, Finland

^*

Author to whom correspondence should be addressed.

Data 2025, 10(7), 104; https://doi.org/10.3390/data10070104

Submission received: 9 June 2025 / Revised: 24 June 2025 / Accepted: 26 June 2025 / Published: 1 July 2025

Download

Browse Figure

Versions Notes

Abstract

In the era of digital healthcare, electronic health records generate vast amounts of data, much of which is unstructured, and therefore, not in a usable format for conventional machine learning and artificial intelligence applications. This study investigates how to extract meaningful insights from unstructured radiology reports written in Finnish, a minority language, using machine learning techniques for text analysis. With this approach, unstructured information could be transformed into a structured format. The results of this research show that relevant information can be effectively extracted from Finnish medical reports using classification algorithms with default parameter values. For the detection of breast tumour mentions from medical texts, classifiers achieved high accuracy, almost 90%. Detection of metastasis mentions, however, proved more challenging, with the best-performing models Support Vector Machine (SVM) and logistic regression achieving an F1-score of 81%. The lower performance in metastasis detection is likely due to the more complex problem, ambiguous labeling, and the smaller dataset size. The results of classical classifiers were also compared with FinBERT, a domain-adapted Finnish BERT model. However, classical classifiers outperformed FinBERT. This highlights the challenge of medical language processing when working with minority languages. Moreover, it was noted that parameter tuning based on translated English reports did not significantly improve the detection rates, likely due to linguistic differences between the datasets. This larger translated dataset used for tuning comes from a different clinical domain and employs noticeably simpler, less nuanced language than the Finnish breast cancer reports, which are written by native Finnish-speaking medical experts. This underscores the need for localised datasets and models, particularly for minority languages with unique grammatical structures.

Keywords:

medical text analysis; radiology reports; machine learning; Finnish; information extraction

1. Introduction

Healthcare institutions worldwide have transitioned to using electronic patient record systems instead of traditional paper-based health records [1]. In this current era of digital healthcare, vast amounts of health data are generated every day. These electronic health records (EHRs) contain a lot of structured medical data related to patients, such as patient treatments, symptoms, and diagnoses. This structured data can be used for data-driven decision making, and therefore, it offers a lot of opportunities for artificial intelligence (AI) and machine learning (ML). In fact, in the future this data and AI/ML will have a huge role in personalising medicine and optimising the usage of limited healthcare resources [2,3]. It also plays a central role in shifting the healthcare ideology from reactive medicine towards prevention of illnesses.

While a part of the data obtained from patients’ treatment is structured, a substantial portion of health data is in an unstructured format, mainly text in appointment documents and medical case summaries. According to some estimates, as much as 80% of the information is in an unstructured format [4]. Therefore, this information often remains unutilised in the machine learning model training process. The automated transformation of this information into a structural format could increase their potential in providing new opportunities for cost effectiveness, more reliable models, and better healthcare. This paper focuses on how to extract valuable and new insights from unstructured health data using machine learning. This would lead to larger structured health datasets which usually result in better ML models, and eventually, safer and more efficient healthcare.

This is not the first article to study this topic. For instance, a locally operating Llama 2 language model was used to extract structural information from clinical texts in a work by Wiest et al. [5]. The study demonstrated that Llama 2 accurately identified clinical features related to liver cirrhosis from patient data, showcasing the model’s potential for analysing healthcare information locally without relying on remote servers. A number of studies with similar research questions have been published [6,7,8,9]. The common denominator has been the medical text analysis of large linguistic regions, such as, English, Spanish, Chinese, or German.

While large linguistic areas potentially enable the analysis of big datasets, it is important to reach studies concerning medical texts also outside of these areas. In fact, if the aim is to unleash the potential of AI to secure high quality healthcare services regardless of the country, the models should be developed by enabling the analyses locally, as models trained in one geographical area very likely to decrease in performance when moved to another area [10]. The underlying reasons for this are, for example, differences in genetics and environmental factors in each area. Thus, models need to be adapted to new locations or they need to be retrained. Both cases require local data sources, highlighting the importance of this study.

What makes this research special is that the focus is on the Finnish language, spoken mainly in Finland with 5.6 million citizens, of whom, approximately 85% speak Finnish natively. Being such a small language, available dataset sizes for language model training are small compared to English, for example. In addition, Finnish differs from all major languages grammatically. It has 15 grammatical cases [11], far exceeding the number in Indo-European languages; for instance, English has only three cases [12]. These cases are used by adding specific suffixes to the noun, which can change depending on whether the noun is singular or plural. This system allows Finnish to convey a lot of information through word endings, making it quite different from languages that rely more on prepositions and word order. Finnish words can have numerous forms due to the extensive use of cases and suffixes. This rich morphology means that a single word can have many different variations, making it difficult for machine learning models to recognize and process all possible forms and the meaning of them. There are some language models trained for Finnish language (such as [13,14,15]), and one article that studies how well toxic language can be detected from Finnish texts [16]. However, it has not been studied how well models can extract relevant information from Finnish medical reports, specifically radiology report texts.

The aim of this article is to study how well machine learning models can extract relevant information from medical texts written in a minority language. In more detail, this article has the following objectives:

To evaluate and compare the performance of different machine learning models in classifying unstructured Finnish radiology reports, specifically for detecting mentions of breast tumours and metastases.
To investigate whether model performance on minority language medical texts can be improved by leveraging larger translated publicly available datasets for parameter tuning.

2. Datasets

In this study, two datasets were used. The first dataset contains unstructured medical texts from radiology reports of patients diagnosed with breast cancer. The reports are in Finnish and written by expert radiologists as part of normal healthcare functions. The second dataset is a big open dataset, called MIMIC, containing radiology reports written in English. These reports are collected from Beth Israel Deaconess Medical Center in Boston, Massachusetts, and contain chest radiographs and related free-text radiology reports.

2.1. Breast Cancer Data

Breast cancer data from Northern Ostrobothnia, Finland, includes 8074 patients with a first diagnosis of breast cancer between 2005–2022 (Oulu breast cancer dataset for precision medicine, OBC-PM). It is a private dataset that was collected from electronic health records for this study. The dataset contains all data related to the treatment of cohort patients, including but not limited to demographics, diagnoses, patient texts (radiology, pathology, and imaging reports), medications, procedures, and more. This study focuses on radiology report texts, and the aim is to identify whether these texts mention breast tumour or metastasis-related findings. The degree of certainty of the report findings varied from suspected tumour/metastasis to certain tumour/metastasis.

For supervised machine learning, radiology reports need to be manually labelled. The original dataset contains a total of 172,543 radiology reports written in Finnish. However, the manual labelling of all of these would have been too resource-intensive. In addition, among the documents, there were also unrelated reports. Therefore, most irrelevant reports were filtered out, and only a small subset of the data was used for this study. This filtering was performed based on expert-defined list of specific words, word fragments, and phrases related to breast cancer. The expressions used in the filter are listed in Table 1. It should be noted that reports selected for further analysis based on this list had either these expressions in a negative or positive context. This way the number of reports was reduced to around 10,000, and from them, a randomly selected subset was labelled by the expert.

Radiology reports were labelled for two cases: (1) to detect if the report mentions that the person has a breast tumor or not, and (2) to detect if reports mention that the person has a metastasis, that is, whether detected distant metastases or axillary lymphnodes considered metastases, or not. The dataset sizes for the two case studies differ slightly, as the selection process aimed to reduce class imbalance. The original annotated data contained more negative than positive samples. To address this, all available positive cases were included, and an equal number of negative cases were randomly selected for each case. As a result, case 1 included 278 samples (138 positive and 138 negative), while case 2 included 214 samples (107 positive and 107 negative).

During the labelling process, reports were noted to be highly hierarchical and diverse, with varying structures and conflicting information. In order to detect if a person has metastasis, the radiology reports sometimes needed to contain information from several separate imaging procedures (e.g., bone structure imaging, brain and head imaging, and thorax imaging) whereas several X-rays and radiology reports related to them need to be studied while in the case of breast tumour the information of the radiology report came from one imaging procedure. Thus, radiology reports on localised breast tumour were significantly less complex compared to the metastasis imaging reports. This made the annotation process highly challenging, non-trivial and time consuming. For instance, some reports may state “no bone metastases” and “metastatic finding in the abdomen” within the same text, presenting both positive and negative classes within the same text. This type of case can be highly confusing for ML models. Moreover, some findings were more certain than others, and based on the reports it was not clear if positive finding was found or not. This was especially the case when defining if texts mention the person to have metastasis or not. However, in this study, binary classification is used so the person labelling the texts had to decide if the finding was positive or negative. In addition, in Finnish language same things can be expressed in multiple ways making labelling even more difficult. Table 2 presents examples of phrases (translated to English) that were used to classify cases as “Metastasis YES”, “Metastasis NO”, or “Unclear”. As demonstrated, numerous phrases convey meaning indirectly or exhibit contextual ambiguity, requiring expert interpretation to resolve their inherent uncertainty during annotation. Also, in the end, “Unclear”-cases were labelled as “Metastasis YES” or “Metastasis NO”, and since the phrases are not certain, their interpretation can lead to different outcomes depending on the labeller. This is a typical form of bias, where expert-dependent error can arise. However, in this study, all labels were made by a single expert, and due to this, no formal inter-annotator agreement was measured. This is acknowledged as a limitation of this study and will be further studied in the future.

2.2. MIMIC Dataset

In the second phase of the article, the MIMIC dataset (see [17,18,19]) is used. MIMIC contains labels for 14 medical findings. None of these are related to cancer, but with the help of an expert, two of these were selected to be used in this study. In fact, MIMIC was used to train a model to classify radiology reports based on the presence of the following: (1) consolidation, a certain type of lung disease, and (2) cardiomegaly, which refers to enlarged heart, mentions in the reports. MIMIC data is used in this article in a supporting role as there are no publicly open medical datasets available in Finnish. MIMIC data is only used to tune classifier parameters to better understand medical text. However, a limitation of the MIMIC dataset for this study is that it does not contain information about breast cancer. However, since it includes medical texts, it can be useful to examine whether the terminology and phrasing in these reports can improve the analysis of Finnish medical texts.

The data contains two parts: MIMIC-CXR and MIMIC-CXR-JPG. These provide valuable resources for analysing and diagnosing chest X-rays and their accompanying reports. In this study, radiology reports were taken from the MIMIC-CXR database and the related labels from MIMIC-CXR-JPG database.

MIMIC-CXR is a large, publicly available database of chest X-rays collected at Beth Israel Deaconess Medical Center. The database includes 371,920 DICOM-format chest X-rays and 227,835 associated free-text radiology reports from 65,379 patients, all collected between 2011 and 2016. To protect patient privacy, all images and reports have been de-identified.

As mentioned, reports based on the presence of (1) consolidation and (2) cardiomegaly were selected to be used in this study. In both cases, the processing of the whole dataset was found to be computationally too demanding, hence it was decided to use only part of the data to reduce the number of analysed texts and the need for processing power and memory. Due to this, in the case on consolidation texts, 3000 randomly selected imaging studies with a label of 1 for consolidation were selected for the study, as well as 2000 with a label of 0, and 1000 with a label of “missing”. The corresponding reports were then extracted from the MIMIC-CXR database based on the selected study IDs. The binary classification approach was used to detect if consolidation was positively mentioned as a finding or not, and therefore, the reports labelled as “missing” were relabelled as 0. As a result, negative class included reports where consolidation was negatively mentioned or it was not mentioned at all. This resulted in a balanced dataset with 3000 samples in both classes, or 6000 samples in total. The same was performed for the reports having cardiomegaly mentions as well. However, not all the 6000 had labels for cardiomegaly. Therefore, this dataset is smaller containing 2530 samples, 1469 positive and 1062 negative cases.

In the MIMIC data, the reports are written in English. However, because the aim was to classify Finnish clinical texts, the English MIMIC reports were translated into Finnish using the DeepL translation tool [20], because it ensures data protection, an essential condition for the analysis of this data set. The quality of the translation was assessed with BERTScore [21], a metric that compares two texts by computing the cosine similarity of contextual word embeddings produced by a pretrained Transformer model. BERTScore is widely used to estimate semantic alignment between a candidate sentence (here, the English translation) and a reference sentence (the original Finnish report). A BERTScore for every Finnish radiology report that mentioned consolidation and its English counterpart was calculated; the average F1 value was 0.76, indicating moderately high semantic similarity. Some variation remains, most likely reflecting differences in sentence structure and specialised medical terminology between the two languages. By using these translated texts, the parameter optimization for classification models is based on Finnish reports, just like the model testing, and due to this, it is assumed that MIMIC data can be used to improve the detection of tumour and metastasis mentions from the OBC-PM data.

2.3. Comparing Datasets

In Table 3, the experimented datasets are compared using metrics that are commonly used to measure the richness of the written text [22]. For instance, Type-Token Ratio (TTR) is used to measure the ratio of unique words to the total number of words. Therefore, a higher TTR indicates more diverse vocabulary. MTLD (Measure of Textual Lexical Diversity) calculates how often new vocabulary is introduced as the text progresses, providing a length-independent measure of diversity. HD-D (Hypergeometric Distribution Diversity) estimates the probability of encountering unique words within a random sample and is robust to corpus size. From the results of Table 3, it can be noted that the translated MIMIC data is much larger, in terms of word count and instances, than the Finnish OBC-PM dataset. However, the lexical diversity in MIMIC data is much smaller than in OBC-PM, as can be seen from several calculated metrics (TTR (Type-Token Ratio), MTLD (Measure of Textual Lexical Diversity), HD-D (Hypergeometric Distribution Diversity), etc.). From the machine learning model point-of-view, higher lexical diversity means greater challenges: the models encounter more unique word forms with fewer repetitions making it harder for them to learn consistent patterns or features that generalize across examples. In addition, according to MTLD, the OBC-PM metastasis dataset appears to be the most linguistically diverse, making it more complicated to ML models than OBC-PM metastasis tumors. The size of the data may partially explain the differences between MIMIC and OBC-PM datasets, but still, it is obvious that OBC-PM introduces new words at a higher rate than MIMIC. One explanation may be that the automated translation of the MIMIC data set produced word repetition and reduced the richness of the language, which in turn makes it easier to analyse translated data with ML models compared to the OBC-PM data set.

3. Methodology

Two approaches were studied in this article (see Figure 1): (1) using OBC-PM data and classifiers with default values, and (2) using OBC-PM data and classifiers with parameters defined using MIMIC data. In both cases, in order to compare classifiers, the same set of five classical classifiers from scikit-learn 1.7.0 Python package were used. These included Support Vector Machine (SVM), Logistic Regression, Naive Bayes, Random Forest, and XGBoost. These were selected for this study as many of them have been found to perform well in tasks related to medical texts [23]. The classifiers used in this study are general-purpose machine learning models, not specifically designed for text analysis. Therefore, the textual data had to be transformed into a numerical format before model training. To provide a modern language-model point of comparison, small experiments were also ran with the Finnish-NLP/bert-base-finnish-cased-v1 model—hereafter FinBERT—using the same training folds [13]. Unlike other used models, FinBERT consumes tokenised sub-word sequences directly and learns its own dense representations, so in the case of FinBERT, texts were not transformed into numerical format before training. As a preprocessing step, as shown in workflow Figure 1, all text was converted to lowercase and punctuation was stripped; common Finnish stop-words were removed with the NLTK Snowball Finnish list ([24,25]), but negation particles (e.g., “ei”, “en”) were retained, as explicit negation is critical for this study. Due to the agglutinative nature of Finnish—where words are formed through extensive suffixation and compounding—stemming is inadequate for text processing; therefore, it was not applied in this study, and compound words were left intact to preserve semantics. The numerical representation of the text was then created, except for FinBERT, using the Term Frequency–Inverse Document Frequency (TF-IDF) vectorisation method implemented in scikit-learn’s feature extraction library [26]. TF-IDF assigns a weighted value to each term based on its frequency in the document and its rarity in the entire dataset, which helps distinguish common words from rarer, more meaningful ones.

All the text analysis algorithms used in this study can run locally. This is required by Finnish law, which states that sensitive medical data, like the data in this study, must be analysed in secure environments without internet access.

4. Experiments

The experiments were conducted in two parts. In the first experiment, OBC-PM data were classified using classification models with default parameters, as in this case it was not possible to use a separate validation dataset for parameter tuning as the size of the dataset is so small. In the second experiment, classification models were trained using MIMIC data, where the purpose was to tune and find parameters for different classification models and then use them with OBC-PM data.

4.1. Experiment 1: Classification with Default Parameters

To experiment how well the selected models can classify Finnish OBC-PM data, the data was randomly divided into training and testing. The size of the training data was 75% and test data 25%. Due to small data size, it was not possible to use a separate validation dataset to tune the model parameters, so the default values were used instead. Model training and testing were performed 1000 times. On each run the data was randomly divided into training and testing, and the results reported in Table 4 and Table 5 show the averages and standard deviation for both types of findings. However, as FinBERT requires much more calculation capacity and running time compared to other experimented methods, it was ran only only once.

4.2. Experiment 2: Classification with Tuned Parameters

In this experiment, both OBC-PM and MIMIC datasets were used. Again, OBC-PM data was randomly divided into training (75%) and testing (25%). In order to find suitable parameters for each experimented classification algorithm, the significantly larger MIMIC data was randomly divided into the training (75%), validation (12.5%), and test set (12.5%). Later, these same model parameters are used with the OBC-PM data to determine if it is possible to improve OBC-PM data classification results by using a bigger English dataset.

Two scenarios are experimented: (1) English-to-Finnish translated MIMIC texts and consolidation status as a target, and (2) English-to-Finnish translated MIMIC texts and cardiomegaly status as a target. Then it was experimented how the model parameters defined based on these scenarios work with the OBC-PM data using tumour and metastasis mentions as a target.

To identify optimal hyperparameters for each classifier, grid search was used. Using grid search, predefined parameter ranges were exhaustively explored to find the best-performing combination. The search was conducted using the MIMIC training and validation sets, and the best parameters were selected based on their performance on the validation set. These selected hyperparameters were then applied directly to models trained and tested on the OBC-PM dataset to assess whether tuning on the larger, publicly available dataset could improve classification performance in a minority-language setting.

The exact hyperparameters tuned for each classifier are listed in Table 6. For example, for the Random Forest model, five parameters were optimised, while for Naive Bayes, two parameters were adjusted. The results for detecting the positive breast tumour reports using parameters defined based on consolidation mentions from MIMIC data are shown in Table 7, and based on cardiomegaly mentions from MIMIC data are shown in Table 8. Similarly, the results for to detect positive metastasis reports using parameters defined based on consolidation mentions are shown in Table 9, and based on cardiomegaly mentions are shown in Table 10. Model training and testing was conducted 1000 times. On each run the data was randomly divided into training and testing, and the results shown in the tables show the averages and standard deviation from these.

5. Discussion

In the first experiment, the purpose was to classify radiology reports written in the Finnish language. Two cases were studied, one containing reports where to detect if the patient has breast tumour or not, and the other containing reports to detect if the patient has metastasis or not. As these datasets were small, separate validation datasets were not used, instead, classification algorithms were used with default parameters.

The results presented in Table 4 and Table 5 show that information can be extracted quite well from medical reports written in minority language, in this case, Finnish. When default classical classifiers are used, the recognition rate to detect if texts mention a patient having a tumour or not (Table 4), ranges between 86 and 91% depending on the used performance metrics (F1-score, accuracy, recall, and precision). However, the performance of FinBERT is much lower, F1-score is only 70%. This is probably due to limited data set size, as FinBERT is much more complex than the other experimented models, and therefore, training it requires much more data. On the other hand, these results show that the classification rate is high no matter which classical classification algorithm is used. Only XGBoost seems to perform a bit worse.

On the other hand, detecting if the person has metastasis or not based on radiology reports, seems to be much more difficult (see Table 5). In this case, SVM and logistic regression perform the best and provide a detection rate of 81% based on F1-score. XGBoost has the worst performance, F1-score is around 70%. In fact, in this case, it performs worse than FinBERT. However, again it can be noted that FinBERT does not perform well in this classification task. It was not surprising that the detection of metastasis mentions gives lower detection rates than the detection of tumour status mentions, as it was noted already in the labelling phase that reports quite seldom clearly state if the person has a metastasis or not, and that a single report can contain both negative and positive information regarding metastasis findings. Moreover, the dataset containing metastasis mentions is smaller than the one for detection tumour mentions which affects the model performance.

In the second experiment, the idea was to experiment if the detection rates of tumour and metastasis mentions can be improved by defining better parameters for the experimented classification algorithms based on translated English radiology reports. In these experiments only classical classifiers were used, and experiments were not performed with FinBERT as based on the results of Table 4 and Table 5, it performs much worse than most of the classical classifiers. Table 7 and Table 9 show how the parameters optimised using texts for consolidation affect the detection of tumour and metastasis mentions, and Table 8 and Table 10 show how the parameters optimised using texts for cardiomegaly affect to detection of tumour and metastasis mentions. What can be noted is that this approach does not have a positive effect to the recognition rates. Although some averaged metrics (e.g., for SVM) appear marginally higher after MIMIC-based tuning, a paired t-test conducted on the 1000 resamples indicated that this difference was not statistically significant (

p > 0.05

). The reason for this can most likely be explained by the differences between the datasets. According to Table 3, the MIMIC dataset is quite different to OBC-PM dataset. In fact, the language in the translated MIMIC dataset is notably simpler and less nuanced compared to the OBC-PM dataset. This simplification is likely due to the translation process itself, which reduces linguistic complexity. The reports in MIMIC are generated by a single translation tool, whereas the reports of OBC-PM data are written by multiple native Finnish-speaking medical experts, resulting in greater diversity in writing styles and expressions. Consequently, the two datasets are not directly comparable in terms of linguistic richness and variation, and therefore, the parameters defined using the translated MIMIC data do not improve the classification of tumour and metastasis reports. This underscores the need for localised datasets and models, particularly for minority languages with unique grammatical structures. In addition, the OBC-PM data are related to breast cancer, but the reports of MIMIC data were not related to cancer. This may also affect the language used, rendering the datasets different. These findings illustrate a general challenge in cross-lingual transfer for specialised medical domains, underscoring the need for locally curated data.

Differences between the languages of the datasets are not necessarily the only reason for not improving the recognition rates. Another reason can be the labels for tumour and metastasis datasets. Labelling for binary classes was found to be problematic and ambiguous. Therefore, it is possible that the recognition rates obtained using default values are already close to the highest possible due to the errors in the labels, and therefore they cannot be improved with parameter tuning.

6. Conclusions

This study explored the classification of radiology reports written in Finnish to detect mentions of breast tumours and metastasis. The results indicate that it is possible to extract relevant information from minority language medical reports using default classification algorithms. For detection of mentions of breast tumours, classical classifiers performed well, achieving an accuracy of 89%, regardless of the algorithm used, with XGBoost being the only exception, performing slightly worse. However, detecting mentions of metastasis was found to be more challenging, with the best-performing algorithms (SVM and logistic regression) achieving an F1-score of 81%. The lower performance for detecting mentions of metastases is likely due to ambiguous labelling and the smaller dataset size. The experiments were also performed using FinBERT, which is large language model designed for text analysis. However, it performed much worse than the most other classifiers (F1 score: 70% for detecting tumor mentions, and 72% for detecting metastasis mentions).

In the second experiment, parameter tuning using translated English reports did not significantly improve detection rates. This lack of improvement is most likely due to the linguistic differences between the datasets: (1) English texts were not about breast cancer, but other diseases, which means their language was not similar to Finnish texts, (2) MIMIC reports also contained simpler and less nuanced language, due to automated translation, compared to the more varied Finnish breast cancer reports. Therefore, it is obvious that there is a pressing need for localised datasets and models, particularly for minority languages with unique grammatical structures. Additionally, challenges in the labelling process likely limited the possibility to improve the results of the default classifiers using parameter tuning, suggesting that the performance achieved using default values may already be near the maximum possible given the dataset limitations. In fact, data annotation was found as one of the key challenges in this article due to it’s complexity and burdensome. There are findings showing that assistance software can make the labelling process easier and lighter [27], and this approach could potentially also improve the quality of the OBC PM data labels. In addition, semi-supervised methods should be experimented to reduce the need for labelling.

Therefore, the results showed that it is possible to extract usable information from unstructured radiology reports, and in the future work this information can be used as an extra feature for machine learning model in model training. In addition, future work could include experimenting with more powerful machine learning and NLP (natural language processing) models that can be run locally. In this study, FinBERT was experimented with but the results were not promising, probably due to a small training dataset. This highlights how difficult clinical NLP becomes when working with minority languages that lack large datasets. In addition, some related articles have used, for instance, GPT-4 for medical text analysis, but the results are quite similar to our study [28]. However, as the field is evolving rapidly, experiments should be conducted with the next generation of LLM models. Likewise, it is equally important to perform experiments using larger datasets.

Author Contributions

Conceptualization, E.M., O.L., S.T., and P.S.; Data Curation, O.L., J.R., and A.I.; Funding acquisition, O.L. and P.S.; Methodology, E.M.; Software, E.M.; Supervision, O.L., S.T., and P.S.; Validation, E.M., O.L., S.T., J.R., A.I., and P.S.; Writing—original draft, E.M., O.L., and P.S.; Writing—review and editing, O.L., J.R., A.I., S.T., and P.S. All authors have read and agreed to the published version of the manuscript.

Funding

Authors would like to thank Profi6 theme 6GESS for funding this project. Moreover, this research was supported by the Research Council of Finland (former Academy of Finland) 6G Flagship Programme (Grant Number: 346208). A. Isosalo received funding from the Jenny and Antti Wihuri Foundation, Helsinki, Finland (no. 220106).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of the Wellbeing Services County of Northern Ostrobothnia (protocol code 215/2023). All research follows good scientific practice, research ethics, and GDPR guidelines. Partners adhere to the guidelines of the Finnish National Board on Research Integrity (TENK). All patient data were pseudonymized to ensure that no individuals are identifiable. Only authorized researchers had access to the data, which were securely stored in a validated environment (e.g., HUSAcamedic). Data use complied with the Act on the Secondary Use of Health and Social Data.

Data Availability Statement

OBC-PM dataset is a private dataset and it is not publicly available. However, MIMIC dataset is available online (https://physionet.org/content/mimic-cxr/2.0.0/, accessed on 25 June 2025).

Acknowledgments

The authors are grateful for Infotech Oulu.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vehko, T.; Ruotsalainen, S.; Hyppönen, H. E-Health and E-Welfare of Finland: Check Point 2018; Finnish Institute for Health and Welfare: Helsinki, Finland, 2019; report 7. [Google Scholar]
Ihalapathirana, A.; Chalkou, K.; Siirtola, P.; Tamminen, S.; Chandra, G.; Benkert, P.; Kuhle, J.; Salanti, G.; Röning, J. Explainable Artificial Intelligence to predict clinical outcomes in Type 1 diabetes and relapsing-remitting multiple sclerosis adult patients. Inform. Med. Unlocked 2023, 42, 101349. [Google Scholar] [CrossRef]
Lavikainen, P.; Chandra, G.; Siirtola, P.; Tamminen, S.; Ihalapathirana, A.T.; Röning, J.; Laatikainen, T.; Martikainen, J. Data-Driven Identification of Long-Term Glycemia Clusters and Their Individualized Predictors in Finnish Patients with Type 2 Diabetes. Clin. Epidemiol. 2023, 15, 13–29. [Google Scholar] [CrossRef] [PubMed]
Murdoch, T.B.; Detsky, A.S. The inevitable application of big data to health care. Jama 2013, 309, 1351–1352. [Google Scholar] [CrossRef] [PubMed]
Wiest, I.C.; Ferber, D.; Zhu, J.; van Treeck, M.; Meyer, S.K.; Juglan, R.; Carrero, Z.I.; Paech, D.; Kleesiek, J.; Ebert, M.P.; et al. Privacy-preserving large language models for structured medical information retrieval. NPJ Digit. Med. 2024, 7, 257. [Google Scholar] [CrossRef] [PubMed]
Wiest, I.C.; Wolf, F.; Lessmann, M.E.; van Treeck, M.; Ferber, D.; Zhu, J.; Boehme, H.; Bressem, K.K.; Ulrich, H.; Ebert, M.P.; et al. LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy pre-serving Large Language Models. medRxiv 2024. [Google Scholar]
Wiest, I.C.; Verhees, F.G.; Ferber, D.; Zhu, J.; Bauer, M.; Lewitzka, U.; Pfennig, A.; Mikolas, P.; Kather, J.N. Detection of suicidality from medical text using privacy-preserving large language models. Br. J. Psychiatry 2024, 225, 532–537. [Google Scholar] [CrossRef] [PubMed]
Luo, X.; Gandhi, P.; Storey, S.; Huang, K. A deep language model for symptom extraction from clinical text and its application to extract COVID-19 symptoms from social media. IEEE J. Biomed. Health Inform. 2021, 26, 1737–1748. [Google Scholar] [CrossRef] [PubMed]
Shaaban, M.A.; Akkasi, A.; Khan, A.; Komeili, M.; Yaqub, M. Fine-Tuned Large Language Models for Symptom Recognition from Spanish Clinical Text. arXiv 2024, arXiv:2401.15780. [Google Scholar]
Ye, Y.; Wagner, M.M.; Cooper, G.F.; Ferraro, J.P.; Su, H.; Gesteland, P.H.; Haug, P.J.; Millett, N.E.; Aronis, J.M.; Nowalk, A.J.; et al. A study of the transferability of influenza case detection systems between two large healthcare systems. PLoS ONE 2017, 12, e0174970. [Google Scholar] [CrossRef] [PubMed]
Helasvuo, M.L. 2. Aspects of the Structure of Finnish. In Research in Logopedics; Klippi, A., Launonen, K., Eds.; Multilingual Matters: Bristol, UK, 2008; pp. 9–18. [Google Scholar] [CrossRef]
Marr, V.; Aldus, V.; Brookes, I. The Chambers Dictionary; Chambers: London, UK, 2008. [Google Scholar]
Virtanen, A.; Kanerva, J.; Ilo, R.; Luoma, J.; Luotolahti, J.; Salakoski, T.; Ginter, F.; Pyysalo, S. Multilingual is not enough: BERT for Finnish. arXiv 2019, arXiv:1912.07076. [Google Scholar]
Tanskanen, A.; Toivanen, R.; Vehviläinen, T. RoBERTa Large Model for Finnish. Hugging Face Model Repository, 2022. Available online: https://huggingface.co/Finnish-NLP/roberta-large-finnish (accessed on 25 June 2025).
Tanskanen, A.; Toivanen, R. ConvBERT for Finnish. Hugging Face Model Repository, 2022. Available online: https://huggingface.co/Finnish-NLP/convbert-base-finnish (accessed on 25 June 2025).
Eskelinen, A.; Silvala, L.; Ginter, F.; Pyysalo, S.; Laippala, V. Toxicity detection in Finnish using machine translation. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), Reykjavik, Iceland, 28–30 August 2023; pp. 685–697. [Google Scholar]
Johnson, A.; Pollard, T.; Mark, R.; Berkowitz, S.; Horng, S. MIMIC-CXR Database (Version 2.0.0). PhysioNet, 2019. RRID:SCR_007345. Available online: https://doi.org/10.13026/C2JT1Q (accessed on 25 June 2025).
Johnson, A.E.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.y.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef] [PubMed]
Johnson, A.; Lungren, M.; Peng, Y.; Lu, Z.; Mark, R.; Berkowitz, S.; Horng, S. MIMIC-CXR-JPG: Chest Radiographs with Structured Labels (Version 2.0.0). PhysioNet, 2019. RRID:SCR_007345. Available online: https://doi.org/10.13026/8360-t248 (accessed on 25 June 2025).
DeepL. DeepL Translator. Cologne, Germany: DeepL SE, 2024. Available online: https://www.deepl.com/translator (accessed on 25 June 2025).
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 25–29 April 2020. [Google Scholar]
McCarthy, P.M.; Jarvis, S. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behav. Res. Methods 2010, 42, 381–392. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Wang, L.; Rastegar-Mojarad, M.; Moon, S.; Shen, F.; Afzal, N.; Liu, S.; Zeng, Y.; Mehrabi, S.; Sohn, S.; et al. Clinical information extraction applications: A literature review. J. Biomed. Inform. 2018, 77, 34–49. [Google Scholar] [CrossRef] [PubMed]
Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python; O’Reilly Media: Sebastopol, CA, USA, 2009. [Google Scholar]
Porter, M. Snowball: A Language for Stemming Algorithms. 2001. Available online: http://snowball.tartarus.org/texts/introduction.html (accessed on 25 June 2025).
Scikit-learn. Text Feature Extraction. 2001. Available online: https://scikit-learn.org/stable/modules/feature_extraction.html (accessed on 25 June 2025).
Martynov, P.; Mitropolskii, N.; Kukkola, K.; Gretsch, M.; Koivisto, V.M.; Lindgren, I.; Saunavaara, J.; Reponen, J.; Mäkynen, A. Testing of the assisting software for radiologists analysing head CT images: Lessons learned. BMC Med. Imaging 2017, 17, 59. [Google Scholar] [CrossRef] [PubMed]
Kanzawa, J.; Kurokawa, R.; Kaiume, M.; Nakamura, Y.; Kurokawa, M.; Sonoda, Y.; Gonoi, W.; Abe, O. Evaluating the Role of GPT-4 and GPT-4o in the Detectability of Chest Radiography Reports Requiring Further Assessment. Cureus 2024, 16, e75532. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the two case workflows used in the study.

Table 1. A list of clinical-related expressions related to tumours and metastasis in Finnish, with English translations.

Expression (Finnish)	Translation (English)
aktuelliin patologiaan	related to current pathology
keuhkovarjostum	lung shadow
maligni	malignant
merkittävän näköinen imusolmuke	significantly appearing lymph node
metast	metastasis
patologinen imusolmuke	pathological lymph node
patologinen mikrokalkki	pathological microcalcification
patologisia imusolmukkeita	pathological lymph nodes
patologista mikrokalkkia	pathological microcalcification
pesäk	lesion
poikkeava kertymä	abnormal accumulation
resid	residual
signaalilisä	signal enhancement
signaalimuutos	signal change
suspekti	suspect

Table 2. Examples of phrases (translated to English) indicating the presence, absence, or ambiguity of metastasis in radiology reports.

Metastasis YES	Metastasis NO	Unclear
Changes suitable for metastasis	No significant findings	e.g., “No signs of clear sentinel node but sidewise a faint build-up suitable for nodus can be seen”
Changes that fit metastatic profile	no suspect findings	“Sentinel lymph nodes are not visualized.”
Changes that fit malignant profile	no focal findings
Abnormal signal	no radiological shadows
Malignant finding	no malignant consolidations
Metastatic finding	no pathological microcalcifications
Suspect finding	no metastatic signs
Suspect change	no abnormal findings
Suspect change/build-up for malignancy	no findings of active malignancies
Pathological microcalcification	no potentially malignant tumorcorpuscles or microcalcium clusters
Nodus that is seemingly significant	no abnormal bone/skeletal edema
Etäpesäkkeitä	no clearly active changes
Distant metastases	no pathological axillary noduses
Change that fits to recurrence of the disease	no residual findings
Pathological signal change	no suspect focuses
Abnormal build-up	no signs of skeletal metastases
Bone/skeletal metastasis

Table 3. Comparison of lexical diversity metrics for MIMIC data (consolidation) and OBC-PM datasets.

Metric	MIMIC	OBC-PM (Metastasis)	OBC-PM (Tumour)
Number of instances	6000	214	278
Word Count	441,412	17,149	18,437
Unique Words	18,701	4220	4146
Type-Token Ratio (TTR)	0.0424	0.2460	0.2250
Root Type-Token Ratio (RTTR)	28.1477	32.2250	30.5340
Corrected Type-Token Ratio (CTTR)	19.9034	22.7860	21.5910
Mean Segmental Type-Token Ratio (MSTTR)	0.9285	0.9127	0.9154
Moving Average Type-Token Ratio (MATTR)	0.9285	0.9130	0.9154
Measure of Textual Lexical Diversity (MTLD)	180.2231	201.0060	167.7770
Hypergeometric Distribution Diversity (HD-D)	0.9038	0.9156	0.9117

Table 4. Comparison of different models to detect mentions of breast tumour from OBC-PM data using default model parameters. Values are mean (standard deviation) over 1000 runs, except for FinBERT, which was ran only once (no variation available).

Model	F1 Score	Accuracy	Recall	Precision
Naive Bayes	0.880 (0.038)	0.877 (0.037)	0.909 (0.056)	0.857 (0.066)
Random Forest	0.885 (0.038)	0.888 (0.036)	0.874 (0.059)	0.901 (0.051)
Logistic Regression	0.883 (0.039)	0.888 (0.036)	0.880 (0.059)	0.896 (0.057)
SVM	0.890 (0.039)	0.892 (0.037)	0.882 (0.061)	0.902 (0.055)
XGBoost	0.858 (0.040)	0.859 (0.037)	0.861 (0.057)	0.858 (0.057)
FinBERT	0.695 (–)	0.700 (–)	0.701 (–)	0.694 (–)

Table 5. Comparison of different models to detect mentions of metastasis from OBC-PM data using default model parameters. Values are mean (standard deviation) over 1000 runs, except for FinBERT, which was ran only once (no variation available).

Model	F1 Score	Accuracy	Recall	Precision
Naive Bayes	0.793 (0.056)	0.787 (0.057)	0.827 (0.098)	0.778 (0.098)
Random Forest	0.718 (0.075)	0.756 (0.061)	0.633 (0.110)	0.851 (0.080)
Logistic Regression	0.809 (0.054)	0.797 (0.056)	0.867 (0.088)	0.772 (0.094)
SVM	0.806 (0.055)	0.789 (0.059)	0.884 (0.083)	0.754 (0.097)
XGBoost	0.685 (0.071)	0.709 (0.059)	0.643 (0.099)	0.747 (0.091)
FinBERT	0.720 (–)	0.722 (–)	0.720 (–)	0.750 (–)

Table 6. Parameter value ranges used in grid search for each classifier.

Classifier	Parameter Value Ranges
Naive Bayes	alpha: [0.01, 0.1, 1.0], fit_prior: [True, False]
Random Forest	criterion: [“gini”, “entropy”], max_depth: [5, 10, 20, None],
	max_features: [“sqrt”, “log2”], min_samples_split: [2, 5, 10],
	n_estimators: [50, 100, 200]
SVM	C: [0.1, 1, 10], gamma: [0.001, 0.01, 0.1], kernel: [“linear”, “rbf”]
Logistic Regression	C: [0.1, 1, 10], penalty: [“l2”], solver: [“lbfgs”, “saga”]
XGBoost	gamma: [0, 0.1, 0.5], learning_rate: [0.01, 0.1, 0.3],
	max_depth: [3, 6, 10], n_estimators: [50, 100, 200]

Table 7. Comparison of different models to detect mentions of breast tumour using model parameters tuned with consolidation data.

Model	F1 Score	Accuracy	Recall	Precision
Naive Bayes	0.871 (0.042)	0.868 (0.041)	0.899 (0.060)	0.853 (0.076)
Random Forest	0.879 (0.039)	0.882 (0.036)	0.861 (0.061)	0.902 (0.051)
Logistic Regression	0.842 (0.043)	0.847 (0.040)	0.823 (0.065)	0.868 (0.061)
SVM	0.896 (0.037)	0.897 (0.034)	0.889 (0.059)	0.906 (0.048)
XGBoost	0.862 (0.041)	0.865 (0.038)	0.847 (0.061)	0.882 (0.057)

Table 8. Comparison of different models to detect mentions of breast tumour using model parameters tuned with cardiomegaly data.

Model	F1 Score	Accuracy	Recall	Precision
Naive Bayes	0.874 (0.038)	0.872 (0.036)	0.898 (0.054)	0.856 (0.063)
Random Forest	0.884 (0.037)	0.886 (0.035)	0.875 (0.056)	0.896 (0.052)
Logistic Regression	0.860 (0.043)	0.864 (0.040)	0.843 (0.065)	0.883 (0.059)
SVM	0.896 (0.034)	0.897 (0.033)	0.897 (0.051)	0.899 (0.053)
XGBoost	0.841 (0.045)	0.846 (0.041)	0.824 (0.066)	0.865 (0.065)

Table 9. Comparison of different models to detect mentions of metastasis using model parameters tuned with consolidation data.

Model	F1 Score	Accuracy	Recall	Precision
Naive Bayes	0.787 (0.066)	0.777 (0.064)	0.843 (0.136)	0.766 (0.107)
Random Forest	0.704 (0.073)	0.739 (0.061)	0.637 (0.114)	0.811 (0.087)
Logistic Regression	0.743 (0.057)	0.736 (0.054)	0.776 (0.092)	0.724 (0.084)
SVM	0.829 (0.047)	0.817 (0.049)	0.900 (0.065)	0.775 (0.078)
XGBoost	0.658 (0.066)	0.681 (0.056)	0.625 (0.096)	0.709 (0.092)

Table 10. Comparison of different models to detect mentions of metastasis using model parameters tuned with cardiomegaly data.

Model	F1 Score	Accuracy	Recall	Precision
Naive Bayes	0.786 (0.055)	0.780 (0.054)	0.819 (0.094)	0.769 (0.092)
Random Forest	0.686 (0.081)	0.735 (0.064)	0.595 (0.115)	0.836 (0.084)
Logistic Regression	0.729 (0.062)	0.723 (0.057)	0.758 (0.095)	0.713 (0.085)
SVM	0.806 (0.053)	0.795 (0.053)	0.864 (0.076)	0.764 (0.086)
XGBoost	0.670 (0.070)	0.696 (0.058)	0.629 (0.097)	0.733 (0.095)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Myllylä, E.; Siirtola, P.; Isosalo, A.; Reponen, J.; Tamminen, S.; Laatikainen, O. Extracting Information from Unstructured Medical Reports Written in Minority Languages: A Case Study of Finnish. Data 2025, 10, 104. https://doi.org/10.3390/data10070104

AMA Style

Myllylä E, Siirtola P, Isosalo A, Reponen J, Tamminen S, Laatikainen O. Extracting Information from Unstructured Medical Reports Written in Minority Languages: A Case Study of Finnish. Data. 2025; 10(7):104. https://doi.org/10.3390/data10070104

Chicago/Turabian Style

Myllylä, Elisa, Pekka Siirtola, Antti Isosalo, Jarmo Reponen, Satu Tamminen, and Outi Laatikainen. 2025. "Extracting Information from Unstructured Medical Reports Written in Minority Languages: A Case Study of Finnish" Data 10, no. 7: 104. https://doi.org/10.3390/data10070104

APA Style

Myllylä, E., Siirtola, P., Isosalo, A., Reponen, J., Tamminen, S., & Laatikainen, O. (2025). Extracting Information from Unstructured Medical Reports Written in Minority Languages: A Case Study of Finnish. Data, 10(7), 104. https://doi.org/10.3390/data10070104

Article Menu

Extracting Information from Unstructured Medical Reports Written in Minority Languages: A Case Study of Finnish

Abstract

1. Introduction

2. Datasets

2.1. Breast Cancer Data

2.2. MIMIC Dataset

2.3. Comparing Datasets

3. Methodology

4. Experiments

4.1. Experiment 1: Classification with Default Parameters

4.2. Experiment 2: Classification with Tuned Parameters

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI