Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare
Abstract
1. Introduction
- Do we need to finetune a model or can we use it in its zero-shot capacity?
- Are domain-adjacent pretrained models better than generic pretrained models?
- Is further pretraining on domain-specific data helpful?
- With the rise of LLMs, are BiLMs like BERT still relevant?
2. Models and Metrics
2.1. Off-the-Shelf Models: Finetuning vs. Zero-Shot
2.2. The Value of Further Domain-Specific Pretraining
2.3. Revisiting the Questions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Qin, L.; Chen, Q.; Feng, X.; Wu, Y.; Zhang, Y.; Li, Y.; Li, M.; Che, W.; Yu, P.S. Large language models meet NLP: A survey. arXiv 2024, arXiv:2405.12819. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (long and short papers), pp. 4171–4186. [Google Scholar] [CrossRef]
- Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S.; et al. A systematic review of testing and evaluation of healthcare applications of large language models (LLMs). medRxiv 2024. [Google Scholar] [CrossRef]
- Wang, D.; Zhang, S. Large language models in medical and healthcare fields: Applications, advances, and challenges. Artif. Intell. Rev. 2024, 57, 299. [Google Scholar] [CrossRef]
- Gondara, L.; Simkin, J.; Arbour, G.; Devji, S.; Ng, R. Classifying Tumor Reportability Status From Unstructured Electronic Pathology Reports Using Language Models in a Population-Based Cancer Registry Setting. JCO Clin. Cancer Inform. 2024, 8, e2400110. [Google Scholar] [CrossRef] [PubMed]
- Gondara, L.; Simkin, J.; Devji, S.; Arbour, G.; Ng, R. ELM: Ensemble of Language Models for Predicting Tumor Group from Pathology Reports. arXiv 2025, arXiv:2503.21800. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
- Santos, T.; Tariq, A.; Das, S.; Vayalpati, K.; Smith, G.H.; Trivedi, H.; Banerjee, I. PathologyBERT-pre-trained vs. a new transformer language model for pathology domain. In Proceedings of the AMIA Annual Symposium Proceedings, San Francisco, CA, USA, 29 April 2023; Volume 2022, p. 962. [Google Scholar]
- Yang, X.; Chen, A.; PourNejatian, N.; Shin, H.C.; Smith, K.E.; Parisien, C.; Compas, C.; Martin, C.; Flores, M.G.; Zhang, Y.; et al. Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv 2022, arXiv:2203.03540. [Google Scholar] [CrossRef]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. arXiv 2018, arXiv:1801.06146. [Google Scholar] [CrossRef]
- Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv 2020, arXiv:2004.10964. [Google Scholar] [CrossRef]
- Kraišniković, C.; Harb, R.; Plass, M.; Al Zoughbi, W.; Holzinger, A.; Müller, H. Fine-tuning language model embeddings to reveal domain knowledge: An explainable artificial intelligence perspective on medical decision making. Eng. Appl. Artif. Intell. 2025, 139, 109561. [Google Scholar] [CrossRef]
- Kokalj, E.; Škrlj, B.; Lavrač, N.; Pollak, S.; Robnik-Šikonja, M. BERT meets shapley: Extending SHAP explanations to transformer-based classifiers. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Online, 19 April 2021; pp. 16–21. [Google Scholar]
Models | Task 1 (Reportability Classification) | Task 2 (Tumor Group Classification) | Task 3 (Histology Classification) |
---|---|---|---|
RoBERTa—zeroshot | 0.34 | 0.01 | 0.02 |
PathologyBERT—zeroshot | 0.40 | 0.01 | 0.04 |
Gatortron—zeroshot | 0.34 | 0.01 | 0.13 |
Mistral—zeroshot | 0.76 | 0.54 | 0.65 |
RoBERTa—finetuned | 0.96 | 0.78 | 0.61 |
PathologyBERT—finetuned | 0.95 | 0.81 | 0.60 |
Gatortron—finetuned | 0.97 | 0.85 | 0.78 |
BCCRoBERTa—finetuned | 0.97 | 0.84 | 0.71 |
BCCRTron—finetuned | 0.97 | 0.85 | 0.89 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gondara, L.; Simkin, J.; Sayle, G.; Devji, S.; Arbour, G.; Ng, R. Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare. Mach. Learn. Knowl. Extr. 2025, 7, 121. https://doi.org/10.3390/make7040121
Gondara L, Simkin J, Sayle G, Devji S, Arbour G, Ng R. Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare. Machine Learning and Knowledge Extraction. 2025; 7(4):121. https://doi.org/10.3390/make7040121
Chicago/Turabian StyleGondara, Lovedeep, Jonathan Simkin, Graham Sayle, Shebnum Devji, Gregory Arbour, and Raymond Ng. 2025. "Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare" Machine Learning and Knowledge Extraction 7, no. 4: 121. https://doi.org/10.3390/make7040121
APA StyleGondara, L., Simkin, J., Sayle, G., Devji, S., Arbour, G., & Ng, R. (2025). Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare. Machine Learning and Knowledge Extraction, 7(4), 121. https://doi.org/10.3390/make7040121