Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues

Lucińska, Małgorzata; Płaza, Małgorzata; Kęczkowska, Justyna; Kurek, Kacper; Wykrota, Karol; Deniziak, Stanisław; Twardowski, Karol; Koruba, Zbigniew; Płaza, Mirosław

doi:10.3390/app16062645

Open AccessArticle

Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues

by

Małgorzata Lucińska

¹,

Małgorzata Płaza

¹

,

Justyna Kęczkowska

¹

,

Kacper Kurek

¹

,

Karol Wykrota

¹

,

Stanisław Deniziak

¹,

Karol Twardowski

²

,

Zbigniew Koruba

¹ and

Mirosław Płaza

^1,*

¹

Faculty of Electrical Engineering, Automatic Control and Computer Science, Kielce University of Technology, 25-314 Kielce, Poland

²

Altar Sp. z o.o., ul. Różana 5, 25-729 Kielce, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2645; https://doi.org/10.3390/app16062645

Submission received: 20 January 2026 / Revised: 26 February 2026 / Accepted: 4 March 2026 / Published: 10 March 2026

(This article belongs to the Special Issue AI for Medical Systems: Algorithms, Applications, and Challenges)

Download

Browse Figures

Versions Notes

Abstract

Multi-class and multi-label classification of medical dialogues remains a challenging task due to high linguistic variability and transcription noise. This study proposes an ensemble approach based on three fine-tuned Polish T5 (Text-to-Text Transfer Transformer) models trained on partially overlapping clinical dialogue datasets. The models are evaluated exclusively on low-quality, highly noisy, automatically transcribed conversations to assess real-world robustness. The results demonstrate that the ensemble of models improves classification stability and outperforms the best single model, increasing the F1-score by 21.8% for internal medicine dialogues and by 44.9% for paediatric interviews. The proposed method shows potential for practical deployment in clinical decision support and automated medical documentation systems.

Keywords:

machine learning; text classification; ensemble learning; medical NLP; Polish T5

1. Introduction

Present-day healthcare systems generate ever-increasing amounts of data, particularly in terms of information content resulting from conversations between patients and doctors. Medical dialogues are a rich source of clinical information, such as descriptions of symptoms, disease context, interview history, and physician responses and hypotheses. However, these data remain largely unstructured, which hinders its direct use in medical records, population analyses, and Clinical Decision Support Systems (CDSSs). In the context of the rapid advancement of artificial intelligence methods, an increasing number of smart solutions dedicated specifically to the medical domain have emerged. Among these, systems designed for the creation and storage of medical documentation, commonly referred to as Electronic Health Record (EHR) systems, play a prominent role [1]. Such systems enable the archiving of essential information, including medical history, diagnoses, treatment processes, and test results [2]. Their adoption substantially improves the quality of patient care, facilitates access to information for physicians, enhances communication among healthcare professionals, increases data security, and contributes to time efficiency [3]. Consequently, there is growing interest in natural language processing (NLP) methods capable of automatically extracting structured clinical knowledge from dialogues while preserving their contextual and dynamic characteristics.

The main challenge in this area is the complexity of the multi-label and multi-class classification used in spontaneous speech. A single fragment of a patient’s speech may refer simultaneously to multiple symptoms with different semantic states (present, absent or uncertain). This complexity is compounded by the inherently chaotic nature of natural dialogues, which are characterised by incomplete utterances, paraphrasing, interruptions and subject changes. In contrast to structured medical notes, patients often express themselves in an illogical or highly descriptive manner, which contrasts with the doctor’s technical language. Therefore, effective processing requires models capable of managing both the multi-layered semantic representations and the linguistic inconsistencies of actual interactions. Although large language models (LLMs), such as Bidirectional Encoder Representations from Transformers (BERT) [4], Robustly Optimised BERT Pretraining Approach (RoBERTa) [5], Text-to-Text Transfer Transformer (T5) [6] and Generative Pre-trained Transformer (GPT) [7], have demonstrated the ability to capture complex semantic relations, the classification of heterogeneous clinical dialogues remains an open challenge. Models trained on a single corpus of data tend to overfit to domain-specific patterns, leading to reduced performance when applied to data that differs from the training distribution. In addition, medical data show a wide variation in noise levels—from informal telephone conversations to transcripts generated by automatic speech recognition (ASR) systems. These factors motivated the authors to address the present research problem. Existing solutions in the field of medical dialogue analysis using Large Language Models (LLMs) also employ ensemble learning [8,9]. Research indicates that the use of single models is not highly effective [10], whereas various ensemble models achieve significantly better results [11]. However, a drawback of currently developed solutions is that they are typically tested on public text repositories; there is a lack of solutions tailored to real-world dialogue transcriptions, including those containing errors. Noise simulation does not accurately reflect real-world data, particularly the transcription errors typical of dialogue analysis in the Polish language.

In this paper, we propose an approach aimed at improving the quality of medical term extraction in scenarios involving relatively small amounts of noisy real-world data. The proposed method is based on an ensemble of Polish T5 models, each trained on a distinct dialogue dataset. The centralised application of multiple models enhances generalisation, reduces classification errors, and increases robustness to noise and ambiguity. This is particularly important for conversations recorded under real life conditions during outpatient visits. In the proposed system, multi-class classification is performed in a generative manner, whereby Polish T5 models produce textual outputs containing both the medical term class and its associated status. Predictions from individual models are subsequently combined through a result fusion process that accounts for prediction consistency, model confidence, and external validation rules. This approach improves labelling accuracy and minimises errors arising from incomplete or inconsistent utterances.

The main scientific contributions of this article are as follows:

A multi-class and multi-label classification method for Polish-language medical dialogues was developed based on fine-tuning three variants of the Polish T5 model.
An ensemble strategy combining model predictions through majority voting and qualitative validation was proposed, enhancing robustness to noise and transcription errors.
It was demonstrated that the ensemble approach outperforms the best individual model, particularly for low-quality dialogues, confirming its suitability for practical clinical applications.

The proposed solution forms part of a larger system being deployed as a secure alternative to widely used cloud-based online chat services. Such services frequently process data outside the European Union, which is particularly problematic for sensitive information such as patient health data. Pursuant to Article 9 of the General Data Protection Regulation (GDPR) (EU Regulation 2016/679), health-related data are subject to special protection and may be processed or transferred outside the European Economic Area only under strict legal conditions and with appropriate safeguards. In practice, this implies that transmitting such data to non-EU cloud services may constitute a violation of data protection regulations. For this reason, the proposed system was designed to ensure that data processing is conducted locally or within infrastructure that guarantees full compliance with European Union regulations.

The remainder of this article is structured as follows: Section 2 presents the related work, Section 3 describes the datasets and data preparation process, Section 4 details the methodology of the proposed approach, Section 5 reports the experimental results, discussion is presented in Section 6, and Section 7 provides the concluding remarks.

2. Related Work

2.1. Multi-Label and Multi-Class Text Classification with Transformer-Based Language Models

In recent years, there has been intensive development of text classification methods in multi-label and multi-class settings, driven by transformer models and their ability to capture context at the level of entire sequences. The BERT model introduced by Devlin et al. initiated a wave of architectures based on masked pre-training, leading to substantial improvements in standard classification tasks [4]. Subsequent models, including RoBERTa [5], Decoding-enhanced BERT with Disentangled Attention (DeBERTa) [12], and Multilingual Text-to-Text Transfer Transformer (mT5) [13], further expanded these capabilities by providing better representations of inter-label dependencies and increased robustness to low-quality data. Considerable attention in the literature has been devoted to classification scenarios involving a large number of labels. Chalkidis et al. [14] proposed effective multi-label classification methods for open-label and large-label settings, employing transformer-based models enhanced with mechanisms that explicitly account for relationships between labels. Paolini et al. [15], in turn, demonstrated a generative approach to multi-label classification using sequence-to-sequence models (such as T5), in which labels are generated as textual sequences. This paradigm improves interpretability and enables more complex, multi-step prediction processes.

In addition, there is research into the use of LLM in multi-label classification, which mainly focuses on prompt-based techniques [16]. However, the analyses presented in [17] point to significant challenges and limitations in correctly mapping complex correlations between labels. Despite the introduction of advanced techniques such as prompt-tuned embedding or reasoning-enhanced prompting, standard generative approaches show significant sensitivity to the quality of the input data. While these models achieve satisfactory results in zero-shot and few-shot scenarios on clean English-language collections, their effectiveness drops dramatically for highly noisy and domain-specific data [18]. In this area, we can include transcriptions of Polish medical dialogues, which are subject to speech recognition errors and language specificity, a barrier that individual LLM models are often unable to overcome on their own. Limitations such as low noise immunity and a tendency to lose precision under specific medical conditions—are key justifications for using the ensemble approach proposed in this paper. Combining fine-tuning techniques with an ensemble architecture allows individual model errors to be compensated for and ensures classification stability in the face of real-world transcription noise, which standard prompting methods do not adequately deal with.

2.2. NLP for Medical Dialogues

Research on clinical dialogue processing has accelerated with the emergence of language models specialised for medical tasks. Models such as ClinicalBERT [19] and BioBERT [20] have shown that domain-specific adaptation of transformer architectures significantly improves performance in classification, information extraction, and clinical entity recognition tasks. In the context of physician–patient interactions, studies focusing on the processing of automatically transcribed conversations are particularly relevant. Zhang et al. (2021) [21] presented an approach to modelling the structure of clinical dialogues using contextual representations for symptom extraction tasks. Agrawal et al. [22] demonstrated that generative models (mT5 and T5) can effectively predict both clinical categories and their attributes, even when applied to noisy data.

2.3. Ensemble Methods with Transformer Models

The use of transformer-based model ensembles has recently emerged as an effective strategy for improving prediction quality. Previous studies have shown that combining models trained on different domains or data variants increases stability and reduces errors [23]. T5 models and their variants have been successfully employed in ensemble settings for both generative classification and information extraction tasks [24,25].

2.4. Polish Language Models

Research on Polish language models faces several challenges related to the linguistic characteristics of Polish, which often limit the effectiveness of models trained primarily on English-language data [26]. One important development direction involves the creation of models specialised for particular domains. The medical domain represents one of the most critical and at the same time most challenging areas for NLP applications. Language used in clinical documentation, scientific publications, and medicinal product characteristics is highly formalised and contains extensive specialised terminology, abbreviations, and ambiguities that are difficult for general-purpose models to process. Medical NLP research for the Polish language encounters additional challenges, primarily due to the lack of large, publicly available, and annotated medical corpora, which are essential for effective deep learning model training [27].

Nevertheless, several general-purpose domain models have been developed. PolBERT [28] is a Polish adaptation of the BERT language model, compatible with the BERT-base-uncased architecture [4]. It produces general-purpose sentence and token representations that can be applied across a wide range of NLP tasks and fine-tuned for specific applications without retraining from scratch. Polish RoBERTa [29] and PolitBERT [30] are models built upon the RoBERTa architecture. The Polish equivalent of the T5 model is plT5 [31], which is based on the mT5 parameters [13] and subsequently adapted to tasks specific to the Polish language. PapuGaPT2 [32] is a Polish language model based on the GPT-2 architecture and training paradigm. It employs causal language modelling to predict subsequent tokens in a sequence, enabling the generation of coherent and logically consistent text. The development of pre-trained Polish language models, such as HerBERT [33], constitutes a significant advancement. These models are pre-trained on large Polish-language corpora, including Polish Wikipedia and web-based datasets, and achieve better performance on general NLP tasks in Polish compared to multilingual models.

Unlike the situation for English-language resources [19,20], there is currently no widely available Polish language model pre-trained from scratch on large biomedical corpora. This limitation primarily results from the restricted availability of large, public, anonymised medical datasets in Polish. The Polish NLP community actively contributes to the development of models and methods for the medical domain by pursuing several key strategies: (a) fine-tuning general-purpose Polish language models [33], (b) applying multilingual transformer models to medical tasks [4,5] and (c) launching new initiatives aimed at creating Polish medical language models. One such initiative is the Eskulap project, an open language model dedicated to the medical domain, trained on a large collection of Polish-language medical sources [34].

Table 1 provides an overview of pre-trained Polish language models along with the data corpora used for their training.

3. Data Description

3.1. Dataset Construction

The task of information extraction from medical dialogues is formulated as follows. The analysed data consist of medical dialogues composed of multiple utterances produced by two speakers: a physician and a patient.

Medical dialogue D = {phrase₁, phrase₂, … phrase_n}

(1)

Each phrase is annotated with medical labels, including category, term and status. The set of medical terms T is predefined based on established medical knowledge. Individual terms may be expressed in medical dialogues in various forms, including formal expressions, colloquial formulations, or expressions distributed across a single sentence or multiple sentences. Each term is assigned to a predefined category within the dataset. The status label provides additional, fine-grained information about a given term and is therefore essential. Furthermore, the status of a term may change over the course of a medical dialogue. Given a medical dialogue D, the goal of the proposed system is to extract a set of term-status pairs: {…, (ti,si), …} denotes a term mentioned in the dialogue and si ∈ S denotes the corresponding status of term ti. Generative approaches formulate this task as a single-stage sequence generation problem. In this setting, the dialogue D serves as input to the model, and the term–status pairs {…, (ti,si), …} are generated sequentially.

3.2. Data Quality and Preprocessing

The dialogue dataset used in this study is highly heterogeneous with respect to both acoustic and linguistic quality. Recording sources encompassed diverse clinical environments and varying interview practices, resulting in differing levels of noise, interference, and speech intelligibility. In some recordings, patients articulated their speech unclearly or used non-standard linguistic constructions, which substantially complicated subsequent text processing. In many cases, patients were positioned at a considerable distance from the microphone or spoke in environments with background noise, such as conversations of third parties, waiting room sounds, or medical equipment noise. Consequently, the recorded audio often exhibited a low signal-to-noise ratio, leading to fragmented or distorted speech segments.

From a linguistic perspective, the data displayed considerable variability. Medical dialogues are inherently dynamic and frequently inconsistent: patients interrupt themselves, shift narrative focus, and employ paraphrases, generalisations, or ambiguous expressions. In some recordings, overlapping speech between patients and physicians was observed, which impeded accurate segmentation of dialogue turns. Moreover, patient utterances were often semantically incomplete, required contextual inference, or contained colloquial descriptions of symptoms that do not always have direct clinical equivalents. Table 2 presents examples of dialogue excerpts exhibiting different types of noise.

Given this complexity and variability, two distinct transcription processes were applied: manual and automatic. Manual transcription, carried out by experienced transcribers, significantly improved the quality of many recordings by accurately capturing utterances, correcting phonetic errors, and standardising transcription in acoustically challenging segments. By contrast, automatic transcriptions generated using automatic speech recognition systems were more susceptible to acoustic noise, phonetic ambiguity, and non-standard speech patterns. As a result, some automatically generated transcripts contained errors, necessitating additional cleaning and validation. To eliminate different transcription errors, the method described in [39] was employed, with a primary focus on preprocessing mechanisms. These mechanisms effectively reduced factors that negatively impact the quality of automatic transcription. A key element involved the removal of inarticulate sounds (e.g., ‘mhm’, ‘eee’, ‘yyy’), which, despite lacking semantic meaning, frequently generated errors in ASR systems. Concurrently, normalisation of numbers, dates, and times was performed, alongside the standardisation of capitalisation and punctuation. A significant component of the system was the creation of a dedicated dictionary designed to correct phonetically similar words, such as medication names often mispronounced by patients. The data processing pipeline also integrated lemmatisation, which is particularly beneficial given the complex inflectional structure of the Polish language, although its application remains optional depending on specific analytical requirements. Furthermore, mechanisms were implemented to optimise the rendering of foreign medical terms and phonetic transcriptions of abbreviations. For example, the system converts phonetic Polish renderings such as ‘e ka gie’ (phonetic Polish spelling) into the standard medical abbreviation Electrocardiogram (ECG). This data processing flow, grounded in comparative analysis with manual transcriptions, significantly enhanced transcription quality prior to further analysis.

3.3. General Characteristics of the Data

The analysed unannotated corpus consists of 2000 transcribed audio recordings of medical conversations between physicians and patients, with an average duration of approximately 15 min. Of these, 900 samples correspond to interviews conducted by internal medicine physicians, while 1100 samples correspond to interviews conducted by paediatricians. Each utterance was assigned to one of five categories: symptoms, referrals, medications, data, and other. Utterances assigned to the symptoms category were annotated with medical terms representing specific disease symptoms, such as vomiting, nausea, or diarrhoea. Approximately 400 distinct symptoms were used; however, some symptoms mentioned in the dialogues also correspond to disease or condition names, for example cold (rhinitis). More than 50 different medical terms were assigned to utterances in the referrals category, including, for example, referrals for COMBO testing or referrals to a surgeon. Utterances in the medications category were annotated with the names of the mentioned drugs. The drug dictionary used in this study contains several hundred prescription and over-the-counter medications. Utterances labelled as data contain information related to patient age, height, and weight. The ‘other’ category comprises utterances in which medical expressions occur but are not relevant from the perspective of the patient’s health status or diagnosis.

A status label was assigned to all utterances except those in the ‘other’ category. The status may take the values positive, negative, or unknown, or be expressed numerically. Positive and negative symptom statuses indicate whether the patient does or does not experience a given condition. For the unknown status, it is not possible to infer from the dialogue whether a given symptom is present, for example when a symptom is mentioned in a question that remains unanswered. The status is expressed as a numerical value when it is associated with a term for which a quantitative value is provided, such as temperature, blood pressure, or body weight.

For example, the utterance “My back hurts” would be labelled with the tuple: (category: symptoms, term: back pain, status: positive).

3.4. Data Preparation

Following transcription, the subsequent data preparation stage involved removing utterances that did not contain medical terms. The medical terminology dictionary was compiled on the basis of medical literature and medical-focused websites. Utterances lacking medical terms were removed from the dialogues. An exception was made for short affirmative or negative responses (for example yes, no, or rather not). These were assumed to constitute responses to questions appearing earlier in the dialogue and were therefore considered potentially relevant for resolving unknown statuses.

The next step involved manual annotation of the remaining utterances to assign the predefined labels: category, medical term, and status. Therefore, a total of three people were involved in the annotation process, including two algorithm design engineers and (depending on the speciality) an internist or paediatrician. In problematic cases, questionable labels were consulted within a three-person panel, and the final decision was taken by vote. A detailed description of the annotation process is included in the paper [40]. This assignment process proved challenging due to frequent violations of grammatical conventions in colloquial speech. In some instances, utterances were so ambiguous that it was difficult to distinguish between a question and an affirmation or negation. Occasionally, mutually contradictory statements occurred within a single dialogue, such as “he had a fever” followed by “no, actually he did not have a fever.” Similarly, accurately matching symptoms posed substantial difficulties. In certain cases, the identification of a symptom depended on the physician’s interpretation, for example “catarrhal pharynx” versus “reddened throat.” Table 3 presents an example excerpt from the interview after the annotation.

3.5. Segmentation of Data into Windows

To enable the plT5 language model to assign labels while leveraging utterance context, the pre-trained model must be fine-tuned. For this purpose, each dialogue is segmented using a sliding window of size n (n = 2–4 utterances and step = 1).

Medical dialogue D = {window₁, window₂, … window_m} comprises a set of m windows. Each window contains a sequence of n consecutive dialogue utterances: window_i = {phrase₁, phrase₂, … phrase_n}. Windows containing fewer than n utterances are padded at the beginning with empty strings.

Appropriate labels are assigned to each window, containing specific terms and their statuses. These labels serve as cues for generating the correct medical terms and their statuses, improving the textual semantics for better understanding of medical dialogue. The target outputs assigned to each window provide additional contextual knowledge and extend the model’s ability to exploit utterance-level context. In particular, windows without assigned labels correspond to the ‘other’ category. The ‘other’ category encompasses a wide range of terms.

As illustrated in Figure 1, a typical medical dialogue window consists of five utterances, ranging from phrase 1 to phrase 5, with the corresponding labels shown beneath each window.

4. Methodology

4.1. T5 Model

The T5 language model [6] converts all natural language processing problems into a textual format, where both inputs and outputs are treated as text sequences. It uses an encoder–decoder transformer architecture as its structural backbone. The encoder processes the input text, while the decoder generates the output text.

The fundamental components of T5 are the attention mechanism and a feedforward neural network. More specifically, the architecture consists of an embedding layer, encoder block, encoder layer, decoder block, decoder layer, and output layer. The input text is tokenised and embedded into multidimensional vectors in the embedding layer. These embeddings represent the meaning of each token in the input sequence. The encoder block consists of a multi-head self-attention mechanism and a feedforward neural network. The multi-head self-attention mechanism allows the model to weight different parts of the input sequence when processing a given token, capturing dependencies between words and helping the model understand context. After the self-attention mechanism is applied, the output passes through a feedforward neural network, which introduces nonlinearity and enables the model to capture complex relationships in the data. The encoder is composed of multiple layers of such blocks, with each layer refining the understanding of the input sequence. Similarly, the decoder block consists of a multi-head self-attention mechanism, a feedforward neural network, and an attention mechanism. The encoder–decoder attention mechanism allows the decoder to focus on relevant parts of the input sequence, helping it generate the output sequence. The final layer is the output layer, which generates the output sequence; in the case of the task considered here, this corresponds to the labels assigned to the windows.

T5 is trained on a large corpus using a variety of tasks such as summarisation, question answering, classification, translation, and others. This broad training scope makes it versatile and effective, enabling it to understand and reproduce different writing styles and content types.

The Polish language model plT5 was pre-trained and denoised on a dataset composed of the following four corpora: Polish Wikipedia, the National Corpus of Polish, Wolne Lektury, Polish Open Subtitles, and the Common Crawl of Polish websites. Each dataset is described in [31].

The pre-trained plT5 language model, used as a backbone, is further trained to generate sequences of terms and statuses. Each training sample is represented as follows:

d = (Dt,y)

(2)

where t ∈ {term generation, status generation} denotes the subtask to which sample d belongs. Dt represents the input data for the subtask and consists of the medical dialogue and subtask prompts (assigned labels), while y denotes the output text of the subtask (labels predicted by the system). The task of the system is to achieve the best possible alignment between the predicted labels and the assigned ground-truth labels.

4.2. Proposed Method

The aim of the proposed method was to design a noise- and transcription-error-resistant multi-class and multi-label classification system for medical dialogues. A key design assumption was the use of Polish-adapted T5 family language models (Polish T5) and the enhancement of their stability through an ensemble of several model variants fine-tuned on different data subsets. Due to the limited number of manually verified, semantically correct dialogue transcriptions, the fine-tuning process and subsequent ensemble construction had to account for both data scarcity and the high variability in the quality of recordings used for later testing.

Figure 2 presents a concise overview of the proposed solution architecture. In the first stage, three independent training datasets were prepared based on manual transcription of selected dialogues. These dialogues were additionally subjected to data cleaning procedures, including normalisation of notation, error correction, terminology standardisation, and utterance segmentation. Given the limited number of available examples, the individual datasets were not mutually exclusive—in practice, they shared a substantial portion of samples (dataset overlap of about 60%), although each also contained unique examples.

For each dataset, a separate fine-tuning of the Polish T5 base model was performed, treating the classification task as a generative problem: the model received an input dialogue fragment and was trained to generate symptom labels and status categories assigned by annotators. As a result of this process, three variants of the Polish T5 model were obtained, hereafter referred to as T5-A, T5-B, and T5-C, differing in the nature of their training data and in the distribution of labels.

To evaluate robustness to disturbances, the models were tested exclusively on low-quality dialogues—noisy, incomplete, ambiguous, and automatically transcribed. These data were characterised by numerous errors caused by poor acoustic recording quality, including phonetic distortions, signal dropouts, utterance fragmentation, and errors introduced by automatic speech recognition systems. Testing on significantly degraded data enabled a reliable assessment of model stability under conditions similar to real-world applications, where patient–physician interactions occur in diverse technical environments.

Single models trained on a limited number of examples exhibited satisfactory performance on clean data, but their stability decreased substantially on noisy data. Preliminary analysis indicated that each of the models T5-A, T5-B, and T5-C exhibited different systematic errors: some were more sensitive to lexical errors, while others were more affected by lack of context or fragmented dialogues. It was also observed that the models often correctly detected different subsets of labels, indicating potential complementarity, which can be seen in Figure 3.

Although the datasets used for finetuning the models overlap by approximately 60%, ensemble diversity can still emerge due to stochastic optimisation, architectural differences, and heterogeneous fine-tuning strategies. Prior work demonstrates that deep neural networks trained on identical data often converge to different regions of the loss landscape, yielding complementary predictions [41]. Ensemble theory further indicates that performance gains depend primarily on error decorrelation rather than dataset disjointedness [42,43].

For this reason, an ensemble method based on the decisions of the three models was adopted. The goal of the ensemble was to reduce individual model errors through aggregation of predictions and reinforcement of shared signals.

4.3. Construction of the Multi-Model Ensemble

The ensemble was designed using rule-based voting mechanisms combined with prediction quality evaluation. For each (symptom, status) pair, all three models produced a classification decision along with associated quality metrics—precision, recall, and F1-score—calculated on the validation set. On this basis, two decision aggregation strategies were defined.

Ensemble 1: agreement of at least two models.
When the same symptom with the same status was identified by at least two of the three models (T5-A, T5-B, or T5-C), the (symptom, status) pair was deemed to be present in the analysed dialogue, regardless of individual performance metric values.
This majority voting approach enhanced robustness by reducing the impact of false positives produced by individual models.
Ensemble 2: agreement of at least two models and single-model prediction (without consensus).
When a symptom–status pair was identified by only one model, an additional validation procedure was applied. The prediction was accepted only if the precision value computed on the validation set for that symptom exceeded the corresponding recall value. This rule is an empirical heuristic rather than a theoretically optimal criterion. This criterion ensured that unique predictions were retained only when the model demonstrated reliable performance for detecting the specific symptom. Predictions that did not satisfy this condition were discarded. As a result, the approach limited error propagation caused by automatic transcription inaccuracies or excessive dialogue fragmentation.

5. Experiments and Results

The plT5 model was implemented in its base configuration using the HuggingFace Transformers library [44], with both the T5Tokenizer and T5ForConditionalGeneration classes initialised from the same pre-trained local checkpoint. Input sequences were tokenised with a maximum length of 512 tokens and formatted as structured prompts for generative multi-label medical text classification. All experiments were conducted on five Nvidia Quadro RTX 6000 Graphics Processing Units (GPU). The AdamW optimiser [45] was employed with a weight decay of 0.01 together with a linear learning-rate scheduler. The initial learning rate was set to 2 × 10⁻⁵, the batch size to 4 per GPU, gradient accumulation steps were set to 1, and the total number of training epochs to 50. Training followed a standard fine-tuning procedure with sequence-to-sequence loss optimisation. Greedy decoding was used during inference, and model checkpoints achieving the best performance on the evaluation set were selected for testing. The experimental architecture was built upon the pre-trained plT5 base model, which was subsequently fine-tuned for medical text classification.

To further justify the choice of plT5, additional experiments were conducted using other Polish language models, including fine-tuned BERT [35] and BART [36] variants, as well as a traditional Support Vector Machine (SVM) classifier [46]. All baseline models were trained and evaluated on exactly the same datasets as the plT5 models, with separate training and testing sets for internal medicine and paediatrics, respectively, ensuring a fair and consistent comparison across architectures. The results of these baselines are reported in Table 4 alongside the best-performing T5 configuration. The plT5 model achieved the strongest performance in terms of F1-score and recall, which is particularly important in medical symptom detection where missed labels may have clinical consequences. Although the SVM classifier obtained higher precision, it suffered from substantially lower recall, indicating limited ability to capture diverse and implicitly expressed symptoms in noisy conversational data. This behaviour is consistent with the generative nature of T5, which better handles contextual inference and paraphrased symptom descriptions, thereby supporting its selection as the primary architecture for the proposed approach.

Experiments were carried out on two sets of medical dialogues. For each set, two groups of language models were created: one dedicated to extracting medical terms and their statuses from internal medicine (primary care) interviews, and the other from paediatric interviews. This separation reflects differences in medical terminology and in the structure of documentation used in each specialisation.

Independent experiments were conducted for the two medical specialisations—internal medicine and paediatrics—covering data preparation, model fine-tuning, and evaluation. For each specialisation, three model variants (T5-A, T5-B, and T5-C) were fine-tuned, each trained on a dedicated dataset consisting of 600 manually transcribed medical dialogues. Due to limited data availability, the training sets for individual models within a specialisation partially overlapped, while still preserving dataset-specific differences. The train sets overlapped by about 60%. In addition, evaluation sets of approximately 200 dialogues were used to create each model.

For each specialisation, a separate test set comprising 100 dialogues was prepared. These test sets consisted of dialogues of substantially reduced quality—noisy, fragmented, and frequently affected by errors introduced by automatic speech recognition. They were used exclusively for evaluation and were not included in the training process of any model, enabling a reliable assessment of robustness to disturbances and adherence to the adopted validation protocol.

The tests were performed at the level of the ensemble system rather than on individual classifiers. For each dialogue, predictions were generated by the three constituent models and then aggregated according to the predefined ensemble rules. The inference process was executed sequentially: for each dialogue, the three models generated candidate (symptom, status) pairs, after which, ensemble aggregation was applied using Ensemble 1 and Ensemble 2. Ensemble 1 relied on majority voting between two models, whereas Ensemble 2 incorporated both consensus decisions and validation based on performance metrics for individual symptoms.

The prediction results, reported in terms of mean precision, recall, and F1-score, are presented in Table 5. For the internal medicine dataset, model ensembling led to a substantial improvement in performance. The F1-score of Ensemble 1 increased by 0.1505 (21.76%) compared to Model A, which achieved the highest individual F1-score, and by 0.1722 (25.70%) relative to the mean F1-score across all three models. For Ensemble 2, the F1-score improved by 0.1251 (18.08%) compared to the best individual model and by 0.1468 (21.91%) relative to the average F1-score of the three models.

For the paediatrics dataset, the benefits of ensembling were even more pronounced. The F1-score of Ensemble 1 increased by 0.2919 (49.63%) relative to the mean F1-score of the component models and exceeded the best individual Model A by 0.2729 (44.93%). Corresponding improvements for Ensemble 2 were 0.2953 (50.20%) and 0.2762 (45.48%), respectively. Both ensemble strategies achieved comparable overall performance; however, the voting-based ensemble exhibited higher precision values than the ensemble approach relying on symptom-specific performance metrics.

Table 6 summarises the number of test examples for which the ensemble approach achieved better, equal, or worse results compared to the best single component model, taking into account the F1-score. The improvement in classification after applying model ensembling is evident for both datasets, and particularly significant in the case of the paediatric dataset.

Figure 4 presents the F1-score values obtained for a sample of 100 medical dialogues. The plots compare two ensemble configurations (Ensemble 1 and Ensemble 2) with the maximum and mean F1-scores achieved by the three individual component models. Separate plots are shown for internal medicine (Figure 4a) and paediatrics (Figure 4b). For the internal medicine dataset, ensemble-based approaches outperform individual models in the majority of cases, with particularly pronounced improvements observed when the average F1-score of the component models is low, in some instances nearly doubling the performance. As the average performance of the individual models increases, the relative advantage of the ensemble diminishes. A similar trend is observed for the paediatric dataset; however, the performance gains obtained through ensembling are generally more substantial. Notably, when the maximum F1-score of the component models is around 0.5, the F1-score of the ensemble models exceeds 0.8, demonstrating the robustness and effectiveness of the proposed ensemble approach across both medical domains, and particularly significant in the case of the paediatric dataset.

Given the multi-label nature of the task and the dominance of missed detections over label confusions, a per-class recall analysis provides more insight into model behaviour. Figure 5 presents a recall matrix illustrating per-class detection performance. The visualisation reveals substantial variability in recall across symptoms and demonstrates that the ensemble consistently improves recall, especially for symptoms that are difficult to detect in noisy medical dialogues. For several symptom classes, model ensembling resulted in a substantial increase in recall, from values of approximately 0.5 observed for individual component models to values close to 1.0 achieved by the ensemble.

Additional experiments were conducted on a subset of dialogues characterised by relatively lower levels of transcription degradation. The results of these experiments are presented in Table 7. The findings indicate that model ensembling still yields performance improvements under these conditions; however, the gains are more moderate—typically ranging from several to a dozen percentage points—compared to the substantially larger improvements observed for heavily noisy data. This pattern further supports the claim that the proposed ensemble strategy is particularly advantageous in scenarios with increased transcription uncertainty and linguistic distortion.

6. Discussion

The experimental results presented in Section 5 confirm that the proposed ensemble strategy significantly outperforms single tuned models, especially in the presence of significant acoustic and language noise. The observed increase in F1-score—by 21.8 per cent for internal medicine and 44.9 per cent for paediatrics—indicates that the majority voting mechanism effectively eliminates stochastic errors generated by individual models. The nature of automated transcription makes individual models prone to errors due to signal noise. The ensemble mechanism used eliminates these mistakes, which effectively filters out low-probability results. The results presented here demonstrate that for uncommon languages, such as Polish, a strategy of combining several models trained on diverse subsets of data provides higher classification accuracy and resistance to overfitting than a single model. A key advantage of the proposed solution, which differentiates it from massive commercial LLM models, is its adaptation to local deployment. While larger models may offer a broader semantic understanding, their cloud-based architecture is often at odds with strict data protection regulations such GDPR in the European Union. The approach proposed in the paper is based entirely on local infrastructure, ensuring data security and eliminating the risk of data leakage to external providers. Consequently, this approach provides a practical, legal alternative for healthcare facilities that cannot legally use the LLM models provided by the API.

The main innovative feature of the proposed solution is an ensemble of generative T5 models with quality validation defined by the number of correctly recognised symptoms, designed for robustness against highly noisy, automatically transcribed clinical dialogues in Polish. This was achieved by using three fine-tuned plT5 models, treating multi-class and multi-label classification as a generative task (text-to-text), combining them into a rule-based ensemble robust to noise and transcription errors. Our approach deliberately tests models only on low-quality data, including: automatic ASR transcriptions, phonetic errors, speech fragmentation, speaker overlap, environmental noise. This distinguishes the work from many publications that test models on clean, manually transcribed corpora. There are solutions such as ClinicalBERT, BioBERT, but for the English language, in our work we have created a large, public, purely medical collection for models dedicated to the Polish language. The assumptions made define the problem as: simultaneously multi-class and multi-label, with symptom status (positive, negative, unknown), dependent on the context of the entire sliding window. In addition, segmentation of dialogues (n = 2–4), generative pairing (symptom, status) and status determination during the conversation were used. This is a more complex formalisation than the classic ‘one utterance—one label’. In addition, the solution is a system innovation because it is embedded in the context of the GDPR, we are proposing a system that runs locally/in the infrastructure of the medical facility, which is an alternative to cloud-based LLMs and large LLMs that require huge hardware resources for data processing.

Despite the promising results, the limitations of the present study should be pointed out. The following analysis not only defines the limits of the current solution, but also sets the direction for further development work:

In order to properly assess the performance of the system, the quality of the input data from the ASR methods must be taken into account. Unlike written texts, spontaneous speech is subject to errors that directly affect classification [47]. The most serious problem is the omission of the negative word ‘no’. This leads to a situation where the meaning of the statement changes to the complete opposite (a change of status from Negative to Positive). While modern language models can correct simple typos in disease names, the complete absence of the word ‘not’ is difficult for them to detect, as the resulting sentence (e.g., ‘I have a fever’ instead of ‘I do not have a fever’) sounds logical and grammatically correct. Moreover, generative models can undergo so-called hallucinations [48]. In situations of severe noise, the model can generate a medical label that is highly plausible in the context, but not actually spoken by the patient.
In medicine, metrics such as F1-score do not fully capture the importance of errors. It is important to distinguish between the consequences of missing a symptom (False Negative), which can lead to an incomplete medical history and delayed diagnosis. In addition, false detection (False Positive) can also occur, which can lead to unnecessary diagnostic tests being ordered, generating costs and stress for the patient [49]. Given these risks, the proposed solution cannot function as a stand-alone diagnostic system. Its role is strictly defined as a support system, operating according to the human-in-the-loop principle [50]. The system is only intended to streamline the documentation process, relieving the burden on medical staff, while the final assessment of the patient’s condition always needs to be verified by a specialist.
Language coverage is also a limitation of the present study. The results are based on data in Polish, which means that the current system cannot analyse conversations in other languages. To change this, it would be necessary to collect new data and re-train the models. In addition, the collection of 2000 dialogues is still relatively small compared to the databases available for English, making it difficult for the model to learn rarer diseases.
Another limitation may be the area of data collection. They were collected in medical facilities located quite close to each other, which means that the models were able to learn a specific local way of speaking. Regional differences, such as dialects, dialects or a different accent, can make the system work less well in other parts of the country. The same risk applies to patient groups that were less frequently represented in the study. These limitations illustrate the wider problem of so-called limited generalisation. This is one of the main challenges in introducing artificial intelligence into medicine, requiring care to ensure that the system works fairly for all patient groups [51].
The processing of medical data requires strict compliance with the law. In addition to the GDPR, our system must comply with the new European Artificial Intelligence Act (AI Act) [52], which classifies diagnostic tools as high-risk systems where human supervision is crucial. In our approach, it is always the doctor who makes the final decision and the AI only has an advisory role. In terms of privacy, data are processed locally and patient-identifiable information is automatically deleted before analysis [53].

7. Conclusions and Future Work

This paper proposes an approach to the classification of multi-class and multi-label texts in natural medical dialogues, based on the combination of several fine-tuned Polish T5 language models. The experiments confirmed that the use of ensemble strategies is an effective way to increase the system’s resistance to acoustic disturbances, transcription errors, and inconsistencies in speech characteristic of real patient-doctor conversations. The models complemented each other in situations where individual predictions were subject to errors resulting from low-quality input data, which directly translated into improved classification stability.

The consensus mechanism between models and selective acceptance of predictions from individual classifiers based on quality measures significantly reduced the number of false detections. This was of particular importance in the context of multi-label classification, where each dialogue can contain multiple symptoms and different statuses of their occurrence at the same time. The results indicate that the ensemble approach allows for a higher consistency and reliability of dialogue content interpretation than solutions based on a single model.

Another significant advantage of the proposed method is that it mitigates the problem of limited high-quality training data. The use of several variants of models trained on partially overlapping data sets allowed for the effective use of available resources and enabled increased generalisation of the system. Subtle differences in the distributions of training data resulted in the complementarity of models, allowing the ensemble to analyse a broader spectrum of linguistic structures and speech variants.

From a practical point of view, the ultimate goal is to implement the system in a real clinical environment to reduce the administrative burden on medical staff. The implementation strategy is strictly based on the human-in-the-loop principle [50], which is essential for building trust in the medical community. The system provides clear diagnostic suggestions that require explicit authorisation from the doctor before being recorded in the medical record. This ensures that the solution only functions as a support tool and not as an autonomous decision-making system. This ensures that the ultimate diagnostic responsibility always rests with the expert. In addition, successful implementation requires solving operational and acoustic problems in the doctor’s surgery. The correct selection and placement of specialised directional microphones (e.g., positioning them away from fans, computers or open windows) is crucial to maintaining high precision ASR transcription in dynamic conditions, which directly supports the effectiveness of the ensemble model.

Future work plans include expanding the research to include additional medical specialties and increasing the diversity of training sets by incorporating additional dialogue data sources. Another interesting direction for further research is the integration of acoustic information with textual data, which could improve the detection of uncertainty, negation, and semantic nuances present in patients’ speech. In addition, the use of dynamic weights in the ensemble, depending on the context of the dialogue or the level of data noise, is being considered, as well as the exploration of newer generative architectures and semi-supervised learning methods. The ultimate goal of further work is to implement the system in a clinical environment and validate it in the real-world conditions under which medical personnel work.

Author Contributions

Conceptualisation, M.L., S.D., Z.K., M.P. (Małgorzata Płaza) and M.P. (Mirosław Płaza); methodology, M.L., S.D., Z.K. and M.P. (Mirosław Płaza); software, K.K. and K.W.; validation, M.L., M.P. (Małgorzata Płaza) and J.K.; formal analysis, M.P. (Małgorzata Płaza), M.L., M.P. (Mirosław Płaza) and J.K.; investigation, M.L., M.P. (Małgorzata Płaza), M.P. (Mirosław Płaza) and J.K.; resources, M.L., M.P. (Małgorzata Płaza) and J.K.; data curation, M.L., M.P. (Małgorzata Płaza), J.K., K.K., K.W. and K.T.; writing—original draft preparation, M.L., M.P. (Małgorzata Płaza), M.P. (Mirosław Płaza) and J.K.; writing—review and editing, M.L., M.P. (Małgorzata Płaza), M.P. (Mirosław Płaza) and S.D.; visualisation, M.P. (Małgorzata Płaza), M.L., K.W. and K.K.; supervision, M.P. (Mirosław Płaza) and S.D.; project administration, M.P. (Mirosław Płaza) and S.D.; funding acquisition, M.P. (Mirosław Płaza) and S.D. All authors have read and agreed to the published version of the manuscript.

Funding

Research co-funded by the National Centre for Research and Development under the INFOSTRATEG Strategic Program, competition: INFOSTRATEG IV, funding agreement number INFOSTRATEG4/0012/2022.

Institutional Review Board Statement

All procedures performed in this study were in accordance with the ethical standards of the institutional research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. The whole study was approved by the local research ethics committee of the Faculty of Electrical Engineering, Automatic Control and Computer Science (Kielce University of Technology).

Data Availability Statement

The data for the described research, which form the basis for the results of this article, are provided directly by the leader of the PARROT AI project—the company Altar Sp. z o.o. The data are intended to be published as a standalone data article in the near future. Until then, the data are available from Altar Sp. z o.o. upon reasonable request.

Conflicts of Interest

Author Karol Twardowski was employed by the company Altar Sp. z o.o. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Yang, X.; Chen, A.; PourNejatian, N.; Shin, H.C.; Smith, K.E.; Parisien, C.; Compas, C.; Martin, C.; Costa, A.B.; Flores, M.G.; et al. A large language model for electronic health records. npj Digit. Med. 2022, 5, 194. [Google Scholar] [CrossRef] [PubMed]
MacKay, C.; Klement, W.; Vanberkel, P.; Lamond, N.; Urquhart, R.; Rigby, M. A framework for implementing machine learning in healthcare based on the concepts of preconditions and postconditions. Healthc. Anal. 2023, 3, 100155. [Google Scholar] [CrossRef]
Tortorella, G.L.; Fogliatto, F.S.; Tlapa Mendoza, D.; Pepper, M.; Capurro, D. Digital transformation of health services: A value stream-oriented approach. Int. J. Prod. Res. 2023, 61, 1814–1828. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS ‘20), Vancouver, BC, Canada, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
Yang, H.; Li, M.; Zhou, H.; Xiao, Y.; Fang, Q.; Zhou, S.; Zhang, R. Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study. J. Med. Internet Res. 2025, 27, e70080. [Google Scholar] [CrossRef]
Mienye, I.D.; Swart, T.G. Ensemble Large Language Models: A Survey. Information 2025, 16, 688. [Google Scholar] [CrossRef]
Rane, N.; Choudhary, S.P.; Rane, J. Ensemble deep learning and machine learning: Applications, opportunities, challenges, and future directions. Stud. Med. Health Sci. 2024, 1, 18–41. [Google Scholar] [CrossRef]
Chen, Z.; Li, J.; Chen, P.; Li, Z.; Sun, K.; Luo, Y.; Mao, Q.; Li, M.; Xiao, L.; Yang, D.; et al. Harnessing multiple large language models: A survey on llm ensemble. arXiv 2025, arXiv:2502.18036. [Google Scholar] [CrossRef]
He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv 2020, arXiv:2006.03654. [Google Scholar] [CrossRef]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 483–498. [Google Scholar] [CrossRef]
Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. LEGAL-BERT: The Mueller Report and Beyond. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 39–48. [Google Scholar] [CrossRef]
Paolini, G.; Ma, A.; Ghezzi, F.; Lakomkin, E.; Wieser, J.; Peris, C.; Homan, S.; Tsvetkov, Y. Structured Prediction as Translation via Alignment. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Ma, M.; Chochlakis, G.; Pandiyan, N.M.; Thomason, J.; Narayanan, S.S. Large Language Models Do Multi-Label Classification Differently. arXiv 2025, arXiv:2505.17510. [Google Scholar] [CrossRef]
Sakai, H.; Lam, S.S. QUAD-LLM-MLTC: Large language models ensemble learning for healthcare text multi-label classification. arXiv 2025, arXiv:2502.14189. [Google Scholar]
Alqahtani, A.; Al-Makhadmeh, Z.; Tolba, A. Large Language Models for Health Care Text Classification: Systematic Review. JMIR AI 2026, 5, e79202. [Google Scholar] [CrossRef] [PubMed]
Alsentzer, E.; Murphy, J.; Boag, W.; Weng, W.-H.; Jin, D.; Naumann, T.; McDermott, M. Publicly Available Clinical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, MN, USA, 7 June 2019; pp. 72–78. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Zhao, J.; Wang, P.; Xu, N.; Yang, Y.; Liu, Y.; Huang, Y.; Feng, J. Learning to Check Contract Inconsistencies. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 14446–14453. [Google Scholar]
Agrawal, M.; Hegselmann, S.; Lang, H.; Kim, Y.; Sontag, D. Large Language Models are Few-Shot Clinical Information Extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 1998–2022. [Google Scholar] [CrossRef]
Kyritsis, K.; Liapis, C.M.; Perikos, I.; Paraskevas, M.; Kapoulas, V. From Transformers to Voting Ensembles for Interpretable Sentiment Classification: A Comprehensive Comparison. Computers 2025, 14, 167. [Google Scholar] [CrossRef]
Hwang, M.-H.; Shin, J.; Seo, H.; Im, J.-S.; Cho, H.; Lee, C.-K. Ensemble Neural Question Generation Model Based on Text-to-Text Transfer Transformer. Appl. Sci. 2023, 13, 903. [Google Scholar] [CrossRef]
Adams, V.; Shin, H.-C.; Anderson, C.; Liu, B.; Abidin, A. Text Mining Drug/Chemical-Protein Interactions using an Ensemble of BERT and T5 Based Models. arXiv 2021, arXiv:2111.15617. [Google Scholar]
Rybak, P.; Mroczkowski, R.; Tracz, J.; Gawlik, I. KLEJ: Comprehensive Benchmark for Polish Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1191–1201. [Google Scholar] [CrossRef]
Czyżewski, A.; Szplit, D.; Budzisz, M.; Narkiewicz, K. A Comprehensive Polish Medical Speech Dataset for Enhancing Automatic Medical Dictation. Sci. Data 2025, 12, 1436. [Google Scholar] [CrossRef]
Kłeczek, D. Polbert: Attacking Polish NLP Tasks with Transformers. In Proceedings of the PolEval 2020 Workshop, Warsaw, Poland, 26 October 2020; pp. 79–88. [Google Scholar]
Dadas, S.; Perełkiewicz, M.; Poświata, R. Pre-training Polish Transformer-based Language Models at Scale. In Artificial Intelligence and Soft Computing; Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 110–119. [Google Scholar]
Sopyła, K.; Sawaniewski, Ł. Ermlab/Politbert: Polish RoBERTa Model Trained on Polish Literature, Wikipedia, and Oscar. Available online: https://github.com/Ermlab/PoLitBert (accessed on 5 January 2024).
Chrabrowa, A.; Dragan, Ł.; Grzegorczyk, K.; Kajtoch, D.; Koszowski, M.; Mroczkowski, R.; Rybak, P. Evaluation of Transfer Learning for Polish with a Text-to-Text Transformer. In Proceedings of the 13th Language Resources and Evaluation Conference (LREC), Marseille, France, 20–25 June 2022; pp. 4318–4326. [Google Scholar]
Wojczulis, M.; Kłeczek, D. papuGaPT2—Polish GPT2 Language Model. Available online: https://huggingface.co/flax-community/papuGaPT2 (accessed on 5 January 2024).
Mroczkowski, R.; Rybak, P.; Wróblewska, A.; Gawlik, I. HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, Kyiv, Ukraine, 19–20 April 2021; pp. 1–10. [Google Scholar]
Obuchowski, A. Eskulap—Polish Medical Language Model. Available online: https://github.com/AleksanderObuchowski/AleksanderObuchowski (accessed on 5 January 2024).
Kłeczek, D. Polbert Repository. Available online: https://github.com/kldarek/polbert (accessed on 5 January 2024).
Dadas, S. Polish RoBERTa Repository. Available online: https://github.com/sdadas/polish-roberta (accessed on 5 January 2024).
Allegro. plT5-Small Model. Available online: https://huggingface.co/allegro/plt5-small (accessed on 5 January 2024).
Dadas, S. Polish NLP Resources. Available online: https://github.com/sdadas/polish-nlp-resources (accessed on 5 January 2024).
Płaza, M.; Pawlik, Ł.; Deniziak, S. Call Transcription Methodology for Contact Center Systems. IEEE Access 2021, 9, 110975–110988. [Google Scholar] [CrossRef]
Płaza, M.; Płaza, M.; Lucińska, M.; Kęczkowska, J.; Deniziak, S.; Murawska, T.; Murawski, K.; Wykrota, K.; Jaszczyk, D.; Zawadzki, M.; et al. Parrot AI—Intelligent medical assistant. Sci. Rep. 2026, in review. [Google Scholar]
Fort, S.; Hu, H.; Lakshminarayanan, B. Deep ensembles: A loss landscape perspective. arXiv 2019, arXiv:1912.02757. [Google Scholar]
Kuncheva, L.I.; Whitaker, C.J. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Mach. Learn. 2003, 51, 181–207. [Google Scholar] [CrossRef]
Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms, 1st ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2012. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Cui, T.; Xiao, J.; Li, L.; Jiang, X.; Liu, Q. An approach to improve robustness of nlp systems against asr errors. arXiv 2021, arXiv:2103.13610. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Chen, D.; Dai, W.; et al. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2022, 55, 1–38. [Google Scholar] [CrossRef]
Amann, J.; Blasimme, A.; Vayena, E.; Frey, D.; Madai, V.I.; Precise4Q Consortium. Explainability for artificial intelligence in healthcare: A multidisciplinary perspective. BMC Med. Inform. Decis. Mak. 2020, 20, 310. [Google Scholar] [CrossRef]
Holzinger, A. Interactive machine learning for health informatics: When do we need the human-in-the-loop? Brain Inform. 2016, 3, 119–131. [Google Scholar] [CrossRef]
Rogers, T.W.; Jaccard, N.; Carbonaro, F.; Lemij, H.G.; Vermeer, K.A.; Reus, N.J.; Trikha, S. Evaluation of an AI system for the automated detection of glaucoma from stereoscopic optic disc photographs: The European Optic Disc Assessment Study. arXiv 2019, arXiv:1906.01272. [Google Scholar] [CrossRef]
van Kolfschooten, H.; van Oirschot, J. The EU Artificial Intelligence Act (2024): Implications for healthcare. Health Policy 2024, 149, 105152. [Google Scholar] [CrossRef]
Faustini, P.; McIver, A.; Sullivan, R.; Dras, M. De-identification of clinical data: A systematic review of free text, image and tabular data approaches. Int. J. Med. Inform. 2026, 208, 106225. [Google Scholar] [CrossRef]

Figure 1. Medical dialogue window.

Figure 2. Architecture of the proposed approach.

Figure 3. Per-class prediction variability illustrating functional diversity between models (a) for internal medicine dialogues, (b) for paediatric dialogues.

Figure 4. F1-score comparison between ensemble models and individual T5-based models (a) for internal medicine dialogues, (b) for paediatric dialogues.

Figure 5. Recall matrix for 20 selected symptoms across individual T5 models and their ensemble (a) for internal medicine dialogues. (b) for paediatric dialogues.

Table 1. Overview of Polish pre-trained language models and their training data.

Model	Architecture	Training Corpora Used	Training Body Size *	References
PolBERT	BERT	Polish Subset of Open Subtitles	2.2 billion tokens	[35]
		Polish Subset of ParaCrawl
		Polish Parliamentary Corpus
		Polish Wikipedia
Polish RoBERTa	RoBERTa	CommonCrawl	1 billion Polish sentences	[36]
		Polish Wikipedia
		Open Subtitles
plT5	mT5	CCNet	8.5 billion tokens	[37]
		National Corpus of Polish
		Open Subtitles
		Polish Wikipedia
		Wolne lektury
PapuGaPT2	GPT-2	Polish Oscar Corpus	Several hundred million to billions of tokens, depending on the variant	[32]
HerBERT	BERT	National Corpus of Polish	1.1 billion words	[33]
		Polish Wikipedia
		Wolne Lektury
		CCNet
		Open Subtitles
PolishBART	BART	Common Crawl	200+ GB	[38]

Legend: *—The data reflect the authors’ original documentation, BART—Bidirectional and Auto-Regressive Transformers.

Table 2. Sample excerpts from interviews with various types of noise.

Phrase	Person	Symptom1	S1	C1	Symptom2	S2	C2	Symptom3	S3	C3
Yes it hurts me dear I rub it from the neck from the neck it hurts me and the granddaughter she rubs me and this mine he rubs me nothing helps nothing works	P	Neck pain	Y	O
I was at the orthopaedist I had X-ray there of the let nothing there shows that it would have influence on this because I had there this leg was breaked	P	Orthopaedic treatment	Y	O
OK. So we have here is iterative colidis	D	Colitis	Y	O
I couldn’t feel my hip and I couldn’t feel my leg now I can’t feel my mine that’s exactly this pain and this pain.	P	Leg pain	Y	O	Hip pain	Y	O	Back pain	Y	O

Legend: D—Doctor, P—Patient, S1, S2, S3—status, C1, C2, C3—category, O—symptoms category, Y—positive status.

Table 3. An example excerpt from the interview after the annotation.

Phrase	Person	Symptom1	S1	C1	Symptom2	S2	C2	Symptom3	S3	C3
And in the morning it was around 37, so I didn’t give her anything else, because of her condition. For example, I can see now that she’s warm.	P	Mild fever	Y	O
Then it came back again, this morning she had a mild fever, but the child looks a bit weak, with dark circles under the eyes, mopey, less energetic.	D	Weakness	Y	O	Mild fever	Y	O	Deterioration of mood	Y	O
And yesterday, wasn’t there some diarrhoea yesterday, or the day before yesterday?	D	Diarrhoea	Q	O
Is the nose clear?	P	Stuffy nose	Q	O
It wasn’t like there was a lot of coughing	D	Cough	Y	O
Any worrying changes on the skin? Fresh rash?	D	Skin lesions	Q	O	Rash	Q	O
Just like with an infection	D	Infection	Y	O
With the fever, yes	P	Fever	Y	O
3, 3 nights we assume, once the infection has started, that these increases will happen	D	Infection	Y	O
On auscultation it’s clear, nothing is wrong there	D	Normal respiratory sound	Y	O
Ok, eardrum not bulging, the other side	D	Eardrum reddened	N	O
Don’t be afraid, don’t be afraid, it’s ok, your ear is fine.	D	Otoscopically, no changes to the ears	Y	O
The entire eardrum on the right side was not visible, only half of it.	D			I
Okay, now the tummy,	D			I
Yes, liver, spleen under the ribs not enlarged, abdomen without rigidity, soft, beautiful, just right.	D	Liver not enlarged	Y	O	Spleen not enlarged	Y	O	Abdomen soft	Y	O

Legend: D—Doctor; P—Patient; S1, S2, S3—status; C1, C2, C3—category; I—other category; O—symptoms category; Y—positive status; N—negative status; Q—unknown status.

Table 4. Comparison of classification performance between the plT5 model and baseline models (BERT, BART, SVM) across internal medicine and paediatric datasets.

Internal Medicine	Metric	plT5	PolBERT	plBART	SVM
	precision	0.7141	0.6805	0.7002	0.9208
	recall	0.6969	0.7059	0.6899	0.5643
	F1-score	0.6917	0.6734	0.6759	0.6723
Paediatrics	Metric	plT5	PolBERT	plBART	SVM
	precision	0.5507	0.4516	0.4731	0.6809
	recall	0.7031	0.7544	0.6909	0.5302
	F1-score	0.6073	0.5499	0.5508	0.5732

Table 5. Average values of the F1-score for a test set of 100 samples for individual solutions.

Internal Medicine	Metric	Model A	Model B	Model C	Mean	Max	Ensemble 1	Ensemble 2
	precision	0.7474	0.7141	0.6015	0.6877	0.7474	0.9364	0.8447
	recall	0.6642	0.6969	0.7088	0.6900	0.7088	0.7906	0.8255
	F1-score	0.6861	0.6917	0.6322	0.6700	0.6917	0.8422	0.8168
Paediatrics	Metric	Model A	Model B	Model C	Mean	Max	Ensemble 1	Ensemble 2
	precision	0.5507	0.5352	0.5451	0.5437	0.5507	0.9546	0.9238
	recall	0.7031	0.6744	0.6350	0.6708	0.7031	0.8274	0.8587
	F1-score	0.6073	0.5860	0.5715	0.5882	0.6073	0.8802	0.8835

Table 6. Number of test examples with greater than or equal to and less than values of the F1-score for the ensemble.

Data Set	Ensemble 1			Ensemble 2
	Larger F1-Score	Equal F1-Score	Smaller F1-Score	Larger F1-Score	Equal F1-Score	Smaller F1-Score
Internal medicine	68	5	27	65	4	31
Paediatrics	85	4	11	85	5	10

Table 7. Ensemble strategy performance for low-noise dialogue data.

Metric	Model A	Model B	Model C	Mean	Max	Ensemble 1	Ensemble 2
precision	0.8384	0.8501	0.8492	0.8459	0.8501	0.912	0.8925
recall	0.6762	0.7008	0.6049	0.6606	0.7008	0.7591	0.7971
F1-score	0.7447	0.7653	0.7029	0.7376	0.7653	0.8246	0.8384

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lucińska, M.; Płaza, M.; Kęczkowska, J.; Kurek, K.; Wykrota, K.; Deniziak, S.; Twardowski, K.; Koruba, Z.; Płaza, M. Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues. Appl. Sci. 2026, 16, 2645. https://doi.org/10.3390/app16062645

AMA Style

Lucińska M, Płaza M, Kęczkowska J, Kurek K, Wykrota K, Deniziak S, Twardowski K, Koruba Z, Płaza M. Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues. Applied Sciences. 2026; 16(6):2645. https://doi.org/10.3390/app16062645

Chicago/Turabian Style

Lucińska, Małgorzata, Małgorzata Płaza, Justyna Kęczkowska, Kacper Kurek, Karol Wykrota, Stanisław Deniziak, Karol Twardowski, Zbigniew Koruba, and Mirosław Płaza. 2026. "Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues" Applied Sciences 16, no. 6: 2645. https://doi.org/10.3390/app16062645

APA Style

Lucińska, M., Płaza, M., Kęczkowska, J., Kurek, K., Wykrota, K., Deniziak, S., Twardowski, K., Koruba, Z., & Płaza, M. (2026). Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues. Applied Sciences, 16(6), 2645. https://doi.org/10.3390/app16062645

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues

Abstract

1. Introduction

2. Related Work

2.1. Multi-Label and Multi-Class Text Classification with Transformer-Based Language Models

2.2. NLP for Medical Dialogues

2.3. Ensemble Methods with Transformer Models

2.4. Polish Language Models

3. Data Description

3.1. Dataset Construction

3.2. Data Quality and Preprocessing

3.3. General Characteristics of the Data

3.4. Data Preparation

3.5. Segmentation of Data into Windows

4. Methodology

4.1. T5 Model

4.2. Proposed Method

4.3. Construction of the Multi-Model Ensemble

5. Experiments and Results

6. Discussion

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI