Review Reports - Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Authors focus on very important topic. The paper is well written and it has great potential to guide other researchers working on this domain. However, I strongly recommend following changes:

Introduction part is super short. SLMs and LLMs should be described in details as well as usage LLMs in healthcare should be explained.
Methods are not clearly explained. Where did you have the compute ? Cloud ? Local machine ?
There is no clear information regarding the selection of those models. Please explain it more.

Author Response

We thank the reviewer for their feedback. This will help us make the paper stronger. Please see below for our pointwise response.

Introduction part is super short. SLMs and LLMs should be described in details as well as usage LLMs in healthcare should be explained.

The reason the introduction and the paper is short is because the paper is written as a brief report. We have added content around LLM and SLMs use in healthcare.

Methods are not clearly explained. Where did you have the compute ? Cloud ? Local machine ?

We have added the compute details in section 2.

There is no clear information regarding the selection of those models. Please explain it more.

We have explained the rationale behind model selection in Section 2, where we state that we chose models like pathologyBERT because they are pretrained on clinical text and the LLM as it is an open source and top performing one that can be fit in a commodity GPU.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper evaluates model selection choices for clinical text classification across three pathology-report tasks from the British Columbia Cancer Registry. It compares small language models (SLMs) and one 12B-parameter LLM under zero-shot versus finetuning, and examines the value of domain-adjacent pretraining and additional institution-specific pretraining. The work is very interesting, however the following points must be addressed before it is accepted for publication:

1. Baselines and fairness of comparisons: Only one LLM is evaluated, and only in zero-shot mode. Claims that finetuned SLMs “often surpass zero-shot LLMs” are plausible but may not generalize across LLM families or prompting strategies.

2. Definition and implementation of “zero-shot” for SLMs: Masked LMs are not inherently zero-shot classifiers. Please specify prompt templates, verbalizers, and any calibration used. Without this, the very low macro-F1 numbers may reflect suboptimal prompting rather than inherent limits.

3. Statistical rigour and variance: The paper uses terms like “significantly outperformed,” but no confidence intervals, multiple random seeds or statistical tests are reported. Provide mean and std across seeds and tests for key comparisons.

4. Compute and efficiency claims: The conclusion highlights the performance and resource balance of SLMs, but there are no training or inference cost measurements. Report hardware, GPU hours, peak memory, tokens processed for additional pretraining, and throughput or latency.

Technical Issue: Fix “Error! Reference source not found.” and ensure the table is referenced correctly.

Author Response

We thank the reviewer for the feedback. Please see below for the pointwise response. We would like to emphasize that the paper is submitted as a brief report, not a full length research article, hence the length.

Baselines and fairness of comparisons: Only one LLM is evaluated, and only in zero-shot mode. Claims that finetuned SLMs “often surpass zero-shot LLMs” are plausible but may not generalize across LLM families or prompting strategies.

That observation is true, but given the paper length and format, we are not able to evaluate multiple LLMs. We have added it as a limitation.

Definition and implementation of “zero-shot” for SLMs: Masked LMs are not inherently zero-shot classifiers. Please specify prompt templates, verbalizers, and any calibration used. Without this, the very low macro-F1 numbers may reflect suboptimal prompting rather than inherent limits.

We have clarified that we mean bidirectional language models (BiLMs) and LLMs, and in case of BiLMs, for zero shot, we mean without finetuning, using them out of the box as a classifiers. As people might get confused with BiLMs and LLMs for zeroshot capabilities, our intent is to show that BiLMs do not work in the zero shot capacity, specifically in the specialized domains. We have added clarifying sentences.

Statistical rigour and variance: The paper uses terms like “significantly outperformed,” but no confidence intervals, multiple random seeds or statistical tests are reported. Provide mean and std across seeds and tests for key comparisons.

Thank you for pointing it out. We have removed the language that might indicate statistical significance.

Compute and efficiency claims: The conclusion highlights the performance and resource balance of SLMs, but there are no training or inference cost measurements. Report hardware, GPU hours, peak memory, tokens processed for additional pretraining, and throughput or latency.

We have added compute details in section 2.

5. Technical Issue: Fix “Error! Reference source not found.” and ensure the table is referenced correctly.

It has been fixed.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript addresses question within the NLP community: how to select the appropriate language model architecture and training strategy for specialized healthcare tasks. The authors' comparison of SLMs versus LLMs, fine-tuning versus zero-shot prompting, and general versus domain-specific pre-training is interesting question. The choice of three distinct clinical classification scenarios with varying difficulty and data availability provides empirical foundation for their conclusions.

However, the paper in its current form suffers from several significant issues that undermine its potential impact and scientific validity. The most critical flaw is the complete lack of reproducibility, which renders the central results in Table 1 unverifiable. Furthermore, the manuscript lacks crucial details about the experimental setup and data, and contains clear signs of insufficient proofreading, which raises concerns about the overall value of the work.

Major Issues

Complete Lack of Reproducibility:

This is the most pressing concern. The authors state that the sensitive nature of the BCCR patient data prevents public release and suggest contacting them directly for access. This is an inadequate solution for scientific verification.

Code Unavailability: The argument that data sensitivity precludes code sharing is not convincing. The code for data preprocessing, model fine-tuning, prompting strategies, and evaluation metric calculation is independent of the sensitive data itself. Releasing the code would allow other researchers to understand the exact methodology and apply it to their own (or public) datasets. Without a public repository (e.g., on GitHub), the work is a "black box."

Lack of Validation on Public Data:

Given the private nature of the BCCR dataset, it is imperative that the authors validate their central claims on a publicly available clinical dataset. This would demonstrate that their conclusions—such as fine-tuned SLMs outperforming zero-shot LLMs—are generalizable and not an artifact of their specific dataset.

Synthetic Data as an Alternative:

As the authors cannot share their data, they should generate and provide a small, structurally similar, synthetic dataset. This would allow the community to run the provided code, verify that the experimental pipeline works, and understand the nature of the data being modeled without compromising patient privacy.

Impact on Results:

Without a path to reproducibility, Table 1, which contains the core findings of the paper, is essentially a list of unsubstantiated claims. The scientific method relies on verification, and in its current state, this work cannot be verified.

Insufficient Experimental Detail and Hypothesis Formulation:

The introduction and methods sections lack the necessary detail for a reader to fully grasp the experimental design.

Vague Task Descriptions:

The authors describe the scenarios in terms of difficulty (easy, medium, hard) and data size, but the nature of the classification tasks remains opaque. Understanding the classes is crucial for interpreting the results. Is the difficulty due to subtle linguistic differences, severe class imbalance, or long-tail distributions? This context is missing.

Missing Hypothesis:

The introduction effectively poses several research questions but fails to formulate a clear, testable hypothesis before presenting the experiments.

Evidence of Careless Preparation: The manuscript contains errors that suggest a rushed submission process and a lack of thorough proofreading. The example is the sentence: “We provide detailed results in Error! Reference source not found.” This placeholder error is unacceptable in a final manuscript and significantly erodes the reader's confidence in the quality and accuracy of the research presented.

Implications: What are the practical implications of these findings for hospitals, research institutions, and clinical NLP practitioners? Future Work: What are the logical next steps for this research?

Author Response

Thank you for the detailed review. We really appreciate it. We acknowledge the reviewer’s main concern regarding the lack of reproducibility. Given that the paper is submitted as a brief report, not a full research article, we want to keep the paper short. We agree that it is important to have data and code available for reproducibility, but as we have mentioned, we use sensitive data with private patient information that cannot be made public. As for code, our experiments are simple enough that they can be replicated using generic code available online. But to ensure that any researcher who wants our code or data should be able to access it after completing our privacy assessment, we have provided the information in two places in the paper (at the top, with our names, and at the end, in the data and code availability section).

Synthetic data: Providing a small, curated synthetic dataset that has the properties of the real data in our case, would be a significant undertaking, and is out of scope for this brief report.

Task descriptions: Cited papers in the respective sections have complete details on the tasks and classes.

Table reference: Thanks for pointing out the reference error, it occurred when we were converting the manuscript from LaTeX to word. We have fixed it.

Practical implications: Section 2.3 explicitly addresses this question.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have adequately addressed the concerns I raised in my review report.

Author Response

Thank you for your time and helping us improve the manuscript.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript addresses an practical question in clinical NLP: how to select between small and large language models, zero-shot prompting, fine-tuning, and domain-specific pretraining for healthcare tasks. The revised version is clearer than the original submission, with improved task descriptions, corrected references.

That said, I continue to have reservations about the scientific soundness of the work, primarily due to the absence of fully reproducible resources. However, given the scope of this brief report, I will not press this issue further.

Author Response

Thank you for your time and helping us improve the manuscript.