Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues
Abstract
1. Introduction
- A multi-class and multi-label classification method for Polish-language medical dialogues was developed based on fine-tuning three variants of the Polish T5 model.
- An ensemble strategy combining model predictions through majority voting and qualitative validation was proposed, enhancing robustness to noise and transcription errors.
- It was demonstrated that the ensemble approach outperforms the best individual model, particularly for low-quality dialogues, confirming its suitability for practical clinical applications.
2. Related Work
2.1. Multi-Label and Multi-Class Text Classification with Transformer-Based Language Models
2.2. NLP for Medical Dialogues
2.3. Ensemble Methods with Transformer Models
2.4. Polish Language Models
3. Data Description
3.1. Dataset Construction
3.2. Data Quality and Preprocessing
3.3. General Characteristics of the Data
3.4. Data Preparation
3.5. Segmentation of Data into Windows
4. Methodology
4.1. T5 Model
4.2. Proposed Method
4.3. Construction of the Multi-Model Ensemble
- Ensemble 1: agreement of at least two models.When the same symptom with the same status was identified by at least two of the three models (T5-A, T5-B, or T5-C), the (symptom, status) pair was deemed to be present in the analysed dialogue, regardless of individual performance metric values.This majority voting approach enhanced robustness by reducing the impact of false positives produced by individual models.
- Ensemble 2: agreement of at least two models and single-model prediction (without consensus).When a symptom–status pair was identified by only one model, an additional validation procedure was applied. The prediction was accepted only if the precision value computed on the validation set for that symptom exceeded the corresponding recall value. This rule is an empirical heuristic rather than a theoretically optimal criterion. This criterion ensured that unique predictions were retained only when the model demonstrated reliable performance for detecting the specific symptom. Predictions that did not satisfy this condition were discarded. As a result, the approach limited error propagation caused by automatic transcription inaccuracies or excessive dialogue fragmentation.
5. Experiments and Results
6. Discussion
- In order to properly assess the performance of the system, the quality of the input data from the ASR methods must be taken into account. Unlike written texts, spontaneous speech is subject to errors that directly affect classification [47]. The most serious problem is the omission of the negative word ‘no’. This leads to a situation where the meaning of the statement changes to the complete opposite (a change of status from Negative to Positive). While modern language models can correct simple typos in disease names, the complete absence of the word ‘not’ is difficult for them to detect, as the resulting sentence (e.g., ‘I have a fever’ instead of ‘I do not have a fever’) sounds logical and grammatically correct. Moreover, generative models can undergo so-called hallucinations [48]. In situations of severe noise, the model can generate a medical label that is highly plausible in the context, but not actually spoken by the patient.
- In medicine, metrics such as F1-score do not fully capture the importance of errors. It is important to distinguish between the consequences of missing a symptom (False Negative), which can lead to an incomplete medical history and delayed diagnosis. In addition, false detection (False Positive) can also occur, which can lead to unnecessary diagnostic tests being ordered, generating costs and stress for the patient [49]. Given these risks, the proposed solution cannot function as a stand-alone diagnostic system. Its role is strictly defined as a support system, operating according to the human-in-the-loop principle [50]. The system is only intended to streamline the documentation process, relieving the burden on medical staff, while the final assessment of the patient’s condition always needs to be verified by a specialist.
- Language coverage is also a limitation of the present study. The results are based on data in Polish, which means that the current system cannot analyse conversations in other languages. To change this, it would be necessary to collect new data and re-train the models. In addition, the collection of 2000 dialogues is still relatively small compared to the databases available for English, making it difficult for the model to learn rarer diseases.
- Another limitation may be the area of data collection. They were collected in medical facilities located quite close to each other, which means that the models were able to learn a specific local way of speaking. Regional differences, such as dialects, dialects or a different accent, can make the system work less well in other parts of the country. The same risk applies to patient groups that were less frequently represented in the study. These limitations illustrate the wider problem of so-called limited generalisation. This is one of the main challenges in introducing artificial intelligence into medicine, requiring care to ensure that the system works fairly for all patient groups [51].
- The processing of medical data requires strict compliance with the law. In addition to the GDPR, our system must comply with the new European Artificial Intelligence Act (AI Act) [52], which classifies diagnostic tools as high-risk systems where human supervision is crucial. In our approach, it is always the doctor who makes the final decision and the AI only has an advisory role. In terms of privacy, data are processed locally and patient-identifiable information is automatically deleted before analysis [53].
7. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Yang, X.; Chen, A.; PourNejatian, N.; Shin, H.C.; Smith, K.E.; Parisien, C.; Compas, C.; Martin, C.; Costa, A.B.; Flores, M.G.; et al. A large language model for electronic health records. npj Digit. Med. 2022, 5, 194. [Google Scholar] [CrossRef] [PubMed]
- MacKay, C.; Klement, W.; Vanberkel, P.; Lamond, N.; Urquhart, R.; Rigby, M. A framework for implementing machine learning in healthcare based on the concepts of preconditions and postconditions. Healthc. Anal. 2023, 3, 100155. [Google Scholar] [CrossRef]
- Tortorella, G.L.; Fogliatto, F.S.; Tlapa Mendoza, D.; Pepper, M.; Capurro, D. Digital transformation of health services: A value stream-oriented approach. Int. J. Prod. Res. 2023, 61, 1814–1828. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS ‘20), Vancouver, BC, Canada, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
- Yang, H.; Li, M.; Zhou, H.; Xiao, Y.; Fang, Q.; Zhou, S.; Zhang, R. Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study. J. Med. Internet Res. 2025, 27, e70080. [Google Scholar] [CrossRef]
- Mienye, I.D.; Swart, T.G. Ensemble Large Language Models: A Survey. Information 2025, 16, 688. [Google Scholar] [CrossRef]
- Rane, N.; Choudhary, S.P.; Rane, J. Ensemble deep learning and machine learning: Applications, opportunities, challenges, and future directions. Stud. Med. Health Sci. 2024, 1, 18–41. [Google Scholar] [CrossRef]
- Chen, Z.; Li, J.; Chen, P.; Li, Z.; Sun, K.; Luo, Y.; Mao, Q.; Li, M.; Xiao, L.; Yang, D.; et al. Harnessing multiple large language models: A survey on llm ensemble. arXiv 2025, arXiv:2502.18036. [Google Scholar] [CrossRef]
- He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv 2020, arXiv:2006.03654. [Google Scholar] [CrossRef]
- Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 483–498. [Google Scholar] [CrossRef]
- Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. LEGAL-BERT: The Mueller Report and Beyond. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 39–48. [Google Scholar] [CrossRef]
- Paolini, G.; Ma, A.; Ghezzi, F.; Lakomkin, E.; Wieser, J.; Peris, C.; Homan, S.; Tsvetkov, Y. Structured Prediction as Translation via Alignment. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
- Ma, M.; Chochlakis, G.; Pandiyan, N.M.; Thomason, J.; Narayanan, S.S. Large Language Models Do Multi-Label Classification Differently. arXiv 2025, arXiv:2505.17510. [Google Scholar] [CrossRef]
- Sakai, H.; Lam, S.S. QUAD-LLM-MLTC: Large language models ensemble learning for healthcare text multi-label classification. arXiv 2025, arXiv:2502.14189. [Google Scholar]
- Alqahtani, A.; Al-Makhadmeh, Z.; Tolba, A. Large Language Models for Health Care Text Classification: Systematic Review. JMIR AI 2026, 5, e79202. [Google Scholar] [CrossRef] [PubMed]
- Alsentzer, E.; Murphy, J.; Boag, W.; Weng, W.-H.; Jin, D.; Naumann, T.; McDermott, M. Publicly Available Clinical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, MN, USA, 7 June 2019; pp. 72–78. [Google Scholar]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Zhao, J.; Wang, P.; Xu, N.; Yang, Y.; Liu, Y.; Huang, Y.; Feng, J. Learning to Check Contract Inconsistencies. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 14446–14453. [Google Scholar]
- Agrawal, M.; Hegselmann, S.; Lang, H.; Kim, Y.; Sontag, D. Large Language Models are Few-Shot Clinical Information Extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 1998–2022. [Google Scholar] [CrossRef]
- Kyritsis, K.; Liapis, C.M.; Perikos, I.; Paraskevas, M.; Kapoulas, V. From Transformers to Voting Ensembles for Interpretable Sentiment Classification: A Comprehensive Comparison. Computers 2025, 14, 167. [Google Scholar] [CrossRef]
- Hwang, M.-H.; Shin, J.; Seo, H.; Im, J.-S.; Cho, H.; Lee, C.-K. Ensemble Neural Question Generation Model Based on Text-to-Text Transfer Transformer. Appl. Sci. 2023, 13, 903. [Google Scholar] [CrossRef]
- Adams, V.; Shin, H.-C.; Anderson, C.; Liu, B.; Abidin, A. Text Mining Drug/Chemical-Protein Interactions using an Ensemble of BERT and T5 Based Models. arXiv 2021, arXiv:2111.15617. [Google Scholar]
- Rybak, P.; Mroczkowski, R.; Tracz, J.; Gawlik, I. KLEJ: Comprehensive Benchmark for Polish Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1191–1201. [Google Scholar] [CrossRef]
- Czyżewski, A.; Szplit, D.; Budzisz, M.; Narkiewicz, K. A Comprehensive Polish Medical Speech Dataset for Enhancing Automatic Medical Dictation. Sci. Data 2025, 12, 1436. [Google Scholar] [CrossRef]
- Kłeczek, D. Polbert: Attacking Polish NLP Tasks with Transformers. In Proceedings of the PolEval 2020 Workshop, Warsaw, Poland, 26 October 2020; pp. 79–88. [Google Scholar]
- Dadas, S.; Perełkiewicz, M.; Poświata, R. Pre-training Polish Transformer-based Language Models at Scale. In Artificial Intelligence and Soft Computing; Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 110–119. [Google Scholar]
- Sopyła, K.; Sawaniewski, Ł. Ermlab/Politbert: Polish RoBERTa Model Trained on Polish Literature, Wikipedia, and Oscar. Available online: https://github.com/Ermlab/PoLitBert (accessed on 5 January 2024).
- Chrabrowa, A.; Dragan, Ł.; Grzegorczyk, K.; Kajtoch, D.; Koszowski, M.; Mroczkowski, R.; Rybak, P. Evaluation of Transfer Learning for Polish with a Text-to-Text Transformer. In Proceedings of the 13th Language Resources and Evaluation Conference (LREC), Marseille, France, 20–25 June 2022; pp. 4318–4326. [Google Scholar]
- Wojczulis, M.; Kłeczek, D. papuGaPT2—Polish GPT2 Language Model. Available online: https://huggingface.co/flax-community/papuGaPT2 (accessed on 5 January 2024).
- Mroczkowski, R.; Rybak, P.; Wróblewska, A.; Gawlik, I. HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, Kyiv, Ukraine, 19–20 April 2021; pp. 1–10. [Google Scholar]
- Obuchowski, A. Eskulap—Polish Medical Language Model. Available online: https://github.com/AleksanderObuchowski/AleksanderObuchowski (accessed on 5 January 2024).
- Kłeczek, D. Polbert Repository. Available online: https://github.com/kldarek/polbert (accessed on 5 January 2024).
- Dadas, S. Polish RoBERTa Repository. Available online: https://github.com/sdadas/polish-roberta (accessed on 5 January 2024).
- Allegro. plT5-Small Model. Available online: https://huggingface.co/allegro/plt5-small (accessed on 5 January 2024).
- Dadas, S. Polish NLP Resources. Available online: https://github.com/sdadas/polish-nlp-resources (accessed on 5 January 2024).
- Płaza, M.; Pawlik, Ł.; Deniziak, S. Call Transcription Methodology for Contact Center Systems. IEEE Access 2021, 9, 110975–110988. [Google Scholar] [CrossRef]
- Płaza, M.; Płaza, M.; Lucińska, M.; Kęczkowska, J.; Deniziak, S.; Murawska, T.; Murawski, K.; Wykrota, K.; Jaszczyk, D.; Zawadzki, M.; et al. Parrot AI—Intelligent medical assistant. Sci. Rep. 2026, in review. [Google Scholar]
- Fort, S.; Hu, H.; Lakshminarayanan, B. Deep ensembles: A loss landscape perspective. arXiv 2019, arXiv:1912.02757. [Google Scholar]
- Kuncheva, L.I.; Whitaker, C.J. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Mach. Learn. 2003, 51, 181–207. [Google Scholar] [CrossRef]
- Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms, 1st ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2012. [Google Scholar] [CrossRef]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Cui, T.; Xiao, J.; Li, L.; Jiang, X.; Liu, Q. An approach to improve robustness of nlp systems against asr errors. arXiv 2021, arXiv:2103.13610. [Google Scholar] [CrossRef]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Chen, D.; Dai, W.; et al. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2022, 55, 1–38. [Google Scholar] [CrossRef]
- Amann, J.; Blasimme, A.; Vayena, E.; Frey, D.; Madai, V.I.; Precise4Q Consortium. Explainability for artificial intelligence in healthcare: A multidisciplinary perspective. BMC Med. Inform. Decis. Mak. 2020, 20, 310. [Google Scholar] [CrossRef]
- Holzinger, A. Interactive machine learning for health informatics: When do we need the human-in-the-loop? Brain Inform. 2016, 3, 119–131. [Google Scholar] [CrossRef]
- Rogers, T.W.; Jaccard, N.; Carbonaro, F.; Lemij, H.G.; Vermeer, K.A.; Reus, N.J.; Trikha, S. Evaluation of an AI system for the automated detection of glaucoma from stereoscopic optic disc photographs: The European Optic Disc Assessment Study. arXiv 2019, arXiv:1906.01272. [Google Scholar] [CrossRef]
- van Kolfschooten, H.; van Oirschot, J. The EU Artificial Intelligence Act (2024): Implications for healthcare. Health Policy 2024, 149, 105152. [Google Scholar] [CrossRef]
- Faustini, P.; McIver, A.; Sullivan, R.; Dras, M. De-identification of clinical data: A systematic review of free text, image and tabular data approaches. Int. J. Med. Inform. 2026, 208, 106225. [Google Scholar] [CrossRef]





| Model | Architecture | Training Corpora Used | Training Body Size * | References |
|---|---|---|---|---|
| PolBERT | BERT | Polish Subset of Open Subtitles | 2.2 billion tokens | [35] |
| Polish Subset of ParaCrawl | ||||
| Polish Parliamentary Corpus | ||||
| Polish Wikipedia | ||||
| Polish RoBERTa | RoBERTa | CommonCrawl | 1 billion Polish sentences | [36] |
| Polish Wikipedia | ||||
| Open Subtitles | ||||
| plT5 | mT5 | CCNet | 8.5 billion tokens | [37] |
| National Corpus of Polish | ||||
| Open Subtitles | ||||
| Polish Wikipedia | ||||
| Wolne lektury | ||||
| PapuGaPT2 | GPT-2 | Polish Oscar Corpus | Several hundred million to billions of tokens, depending on the variant | [32] |
| HerBERT | BERT | National Corpus of Polish | 1.1 billion words | [33] |
| Polish Wikipedia | ||||
| Wolne Lektury | ||||
| CCNet | ||||
| Open Subtitles | ||||
| PolishBART | BART | Common Crawl | 200+ GB | [38] |
| Phrase | Person | Symptom1 | S1 | C1 | Symptom2 | S2 | C2 | Symptom3 | S3 | C3 |
|---|---|---|---|---|---|---|---|---|---|---|
| Yes it hurts me dear I rub it from the neck from the neck it hurts me and the granddaughter she rubs me and this mine he rubs me nothing helps nothing works | P | Neck pain | Y | O | ||||||
| I was at the orthopaedist I had X-ray there of the let nothing there shows that it would have influence on this because I had there this leg was breaked | P | Orthopaedic treatment | Y | O | ||||||
| OK. So we have here is iterative colidis | D | Colitis | Y | O | ||||||
| I couldn’t feel my hip and I couldn’t feel my leg now I can’t feel my mine that’s exactly this pain and this pain. | P | Leg pain | Y | O | Hip pain | Y | O | Back pain | Y | O |
| Phrase | Person | Symptom1 | S1 | C1 | Symptom2 | S2 | C2 | Symptom3 | S3 | C3 |
|---|---|---|---|---|---|---|---|---|---|---|
| And in the morning it was around 37, so I didn’t give her anything else, because of her condition. For example, I can see now that she’s warm. | P | Mild fever | Y | O | ||||||
| Then it came back again, this morning she had a mild fever, but the child looks a bit weak, with dark circles under the eyes, mopey, less energetic. | D | Weakness | Y | O | Mild fever | Y | O | Deterioration of mood | Y | O |
| And yesterday, wasn’t there some diarrhoea yesterday, or the day before yesterday? | D | Diarrhoea | Q | O | ||||||
| Is the nose clear? | P | Stuffy nose | Q | O | ||||||
| It wasn’t like there was a lot of coughing | D | Cough | Y | O | ||||||
| Any worrying changes on the skin? Fresh rash? | D | Skin lesions | Q | O | Rash | Q | O | |||
| Just like with an infection | D | Infection | Y | O | ||||||
| With the fever, yes | P | Fever | Y | O | ||||||
| 3, 3 nights we assume, once the infection has started, that these increases will happen | D | Infection | Y | O | ||||||
| On auscultation it’s clear, nothing is wrong there | D | Normal respiratory sound | Y | O | ||||||
| Ok, eardrum not bulging, the other side | D | Eardrum reddened | N | O | ||||||
| Don’t be afraid, don’t be afraid, it’s ok, your ear is fine. | D | Otoscopically, no changes to the ears | Y | O | ||||||
| The entire eardrum on the right side was not visible, only half of it. | D | I | ||||||||
| Okay, now the tummy, | D | I | ||||||||
| Yes, liver, spleen under the ribs not enlarged, abdomen without rigidity, soft, beautiful, just right. | D | Liver not enlarged | Y | O | Spleen not enlarged | Y | O | Abdomen soft | Y | O |
| Internal Medicine | Metric | plT5 | PolBERT | plBART | SVM |
|---|---|---|---|---|---|
| precision | 0.7141 | 0.6805 | 0.7002 | 0.9208 | |
| recall | 0.6969 | 0.7059 | 0.6899 | 0.5643 | |
| F1-score | 0.6917 | 0.6734 | 0.6759 | 0.6723 | |
| Paediatrics | Metric | plT5 | PolBERT | plBART | SVM |
| precision | 0.5507 | 0.4516 | 0.4731 | 0.6809 | |
| recall | 0.7031 | 0.7544 | 0.6909 | 0.5302 | |
| F1-score | 0.6073 | 0.5499 | 0.5508 | 0.5732 |
| Internal Medicine | Metric | Model A | Model B | Model C | Mean | Max | Ensemble 1 | Ensemble 2 |
|---|---|---|---|---|---|---|---|---|
| precision | 0.7474 | 0.7141 | 0.6015 | 0.6877 | 0.7474 | 0.9364 | 0.8447 | |
| recall | 0.6642 | 0.6969 | 0.7088 | 0.6900 | 0.7088 | 0.7906 | 0.8255 | |
| F1-score | 0.6861 | 0.6917 | 0.6322 | 0.6700 | 0.6917 | 0.8422 | 0.8168 | |
| Paediatrics | Metric | Model A | Model B | Model C | Mean | Max | Ensemble 1 | Ensemble 2 |
| precision | 0.5507 | 0.5352 | 0.5451 | 0.5437 | 0.5507 | 0.9546 | 0.9238 | |
| recall | 0.7031 | 0.6744 | 0.6350 | 0.6708 | 0.7031 | 0.8274 | 0.8587 | |
| F1-score | 0.6073 | 0.5860 | 0.5715 | 0.5882 | 0.6073 | 0.8802 | 0.8835 |
| Data Set | Ensemble 1 | Ensemble 2 | ||||
|---|---|---|---|---|---|---|
| Larger F1-Score | Equal F1-Score | Smaller F1-Score | Larger F1-Score | Equal F1-Score | Smaller F1-Score | |
| Internal medicine | 68 | 5 | 27 | 65 | 4 | 31 |
| Paediatrics | 85 | 4 | 11 | 85 | 5 | 10 |
| Medium Noise | Metric | Model A | Model B | Model C | Mean | Max | Ensemble 1 | Ensemble 2 |
|---|---|---|---|---|---|---|---|---|
| precision | 0.8384 | 0.8501 | 0.8492 | 0.8459 | 0.8501 | 0.912 | 0.8925 | |
| recall | 0.6762 | 0.7008 | 0.6049 | 0.6606 | 0.7008 | 0.7591 | 0.7971 | |
| F1-score | 0.7447 | 0.7653 | 0.7029 | 0.7376 | 0.7653 | 0.8246 | 0.8384 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lucińska, M.; Płaza, M.; Kęczkowska, J.; Kurek, K.; Wykrota, K.; Deniziak, S.; Twardowski, K.; Koruba, Z.; Płaza, M. Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues. Appl. Sci. 2026, 16, 2645. https://doi.org/10.3390/app16062645
Lucińska M, Płaza M, Kęczkowska J, Kurek K, Wykrota K, Deniziak S, Twardowski K, Koruba Z, Płaza M. Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues. Applied Sciences. 2026; 16(6):2645. https://doi.org/10.3390/app16062645
Chicago/Turabian StyleLucińska, Małgorzata, Małgorzata Płaza, Justyna Kęczkowska, Kacper Kurek, Karol Wykrota, Stanisław Deniziak, Karol Twardowski, Zbigniew Koruba, and Mirosław Płaza. 2026. "Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues" Applied Sciences 16, no. 6: 2645. https://doi.org/10.3390/app16062645
APA StyleLucińska, M., Płaza, M., Kęczkowska, J., Kurek, K., Wykrota, K., Deniziak, S., Twardowski, K., Koruba, Z., & Płaza, M. (2026). Ensemble-Based Multi-Class and Multi-Label Text Classification for Noisy Clinical Dialogues. Applied Sciences, 16(6), 2645. https://doi.org/10.3390/app16062645

