Optimizing a Multimodal Large Language Model for Ultrasound-Based Thyroid Nodule Malignancy Classification: A Comparative Study of Few-Shot Learning, Prompt Engineering, and Fine-Tuning
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Design
2.2. Datasets
2.2.1. Evaluation Datasets
- DDTI: For model evaluation, we used a curated subset of the publicly available DDTI “https://www.kaggle.com/datasets/eiraoi/thyroidultrasound (accessed on 20 January 2025)” [20]. The reference label was the benign/malignant designation provided in the original dataset. Two board-certified endocrinologists independently selected images from the original 347 images based on predefined quality criteria: (i) grayscale B-mode thyroid ultrasound with adequate resolution; (ii) a clearly visible dominant nodule with margins and internal echotexture not obscured by shadowing, motion, or major artifacts; (iii) minimal burned-in annotations or calipers that could conceal key features; and (iv) a clear benign/malignant label provided by the dataset. Of the 347 original images, 125 were excluded as duplicate views of the same nodule, 78 for shadowing, motion, or major artifacts obscuring the nodule, 13 for burned-in calipers or annotations concealing key features, and 51 for the absence of a clear benign/malignant label. Exclusions targeted image quality and labeling adequacy rather than diagnostic difficulty. Only images meeting all criteria and with reviewer consensus were included, yielding 80 images (28 malignant, 52 benign).
- TN5000 Test Split: To evaluate generalizability under a higher malignancy prevalence, we used the publicly described 1000-image test split of the TN5000 thyroid ultrasound dataset (731 malignant, 269 benign; malignancy prevalence 73.1%). TN5000 is a large grayscale B-mode thyroid ultrasound collection in which benign and malignant labels were assigned based on fine-needle aspiration and surgical pathology. The standard 1000-image test split was used as released, without additional curation, to provide an independent external test set whose case mix differs markedly from DDTI and the AUS cohort. Malignant nodules were treated as the positive class [19].
- AUS Dataset: To assess performance on a clinically challenging cohort, a private dataset of 84 thyroid ultrasound images was retrospectively collected from Taichung Veterans General Hospital. These images were from patients with at least two AUS cytology results on fine-needle aspiration (FNA) who subsequently underwent surgical resection for definitive histopathological diagnosis, yielding 43 malignant and 41 benign nodules. An experienced endocrinologist confirmed the image quality prior to inclusion. This dataset was also used for the human expert comparator arm. The endocrinologist consensus and ATA comparators were applied only to the AUS cohort with definitive surgical histopathology; this represents the targeted clinical challenge. The DDTI and TN5000 datasets were used for external technical validation of the optimization strategies.
2.2.2. Development Dataset (For Fine-Tuning and Few-Shot Examples)
- PubMed Central Open Access (PMC-OA) Dataset: A distinct set of 60 de-identified thyroid ultrasound images was curated from the PMC-OA repository (available at https://huggingface.co/datasets/axiong/pmc_oa, access on 23 January 2025) [21] to serve as the source for fine-tuning data and few-shot learning examples. Images were selected based on clear nodule depiction and caption-derived diagnostic information, validated by two endocrinologists with 7 and 15 years of experience in thyroid ultrasonography. These were the same two board-certified endocrinologists who curated the DDTI subset in Section 2.2.1; the two datasets were annotated in separate sessions to limit fatigue. These endocrinologists provided structured annotations based on ATA guidelines [1], including binary benign/malignant labels. The development set was divided into two independent subsets: a 50-image fine-tuning set (40 training and 10 validation images) and a separate 10-image few-shot set. Across the fine-tuning set, there were 21 malignant and 29 benign images (training: 26 benign, 14 malignant; validation: 3 benign, 7 malignant), and the 10 few-shot exemplars comprised 5 malignant and 5 benign images. The same 40-image training set and 10-image validation set were used to fine-tune both GPT-4o and Gemini 2.5 Flash-Lite. The few-shot, fine-tuning training, and fine-tuning validation subsets were mutually exclusive, and all were independent of the DDTI, TN5000, and AUS evaluation sets.
2.2.3. Optimization Strategies
- Text Prompting: The prompt directed each model to perform a structured analysis based on key elements of the ATA ultrasound risk stratification guidelines [1]. The model was instructed to (i) describe key sonographic features (composition, echogenicity, shape, margins, and echogenic foci), (ii) map these features to ATA risk patterns, and (iii) output a final binary classification (benign vs. malignant) along with the corresponding ATA suspicion category (high/intermediate/low/very low/benign). During text prompting, no images from the PMC-OA development set were used; it relied solely on the ATA-anchored system prompt. Requesting structured ATA features provides an ordered suspicion score for ROC analysis and renders the model’s reasoning interpretable, which is relevant for clinical transparency even though the primary endpoint is binary classification.
- Few-Shot Learning: For few-shot learning, we used 10 examples (5 malignant, 5 benign) from the PMC-OA development dataset, accompanied by their labels and concise feature descriptions. These 10 examples were selected by consensus of two endocrinologists as images that (i) displayed textbook ATA sonographic features for their category, (ii) had unambiguous benign/malignant labels supported by the source caption, and (iii) were of high quality without obscuring artifacts. Images with mixed or borderline features were not used as exemplars, so that each example presented a clear prototype of its class. The two endocrinologists reviewed each candidate image independently; because selection was consensus-based rather than independent parallel scoring, a formal inter-rater agreement coefficient (e.g., Cohen’s kappa) was not calculated. The few infrequent disagreements, which concerned borderline images, were resolved by discussion, and any image on which consensus could not be reached was excluded rather than assigned a forced label.
- Fine-Tuning: From the PMC-OA development dataset, 40 images (distinct from the few-shot exemplars) were allocated for fine-tuning, with a further 10 images held out for validation. These strategies use different numbers of labeled development images by design—text prompting uses none, few-shot learning uses 10 in-context exemplars, and fine-tuning uses 40 training images and 10 images for validation—so the comparison reflects each strategy as it is realistically deployed under a single fixed development budget rather than a controlled equal-N comparison. Images were paired with expert annotations in JSONL format. For GPT-4o, fine-tuning was executed via the OpenAI API using the default learning rate multiplier of 1.0 with 3 epochs and a batch size of 1, chosen to limit overfitting risk with the small training set based on preliminary testing. For Gemini 2.5 Flash-Lite, supervised tuning was carried out using the same 40 annotated images. The base model Gemini-2.5 Flash-Lite was used, with 3 epochs, a learning-rate multiplier of 0.5, and an ADAPTER_SIZE_ONE adapter size, on a training set of 40 images (26 benign, 14 malignant) and a validation set of 10 images (3 benign, 7 malignant). The same 40-image training set and 10-image validation set were used for both GPT-4o and Gemini 2.5 Flash-Lite. The validation subset was weighted toward malignant cases relative to the training set and was not rebalanced owing to the limited size of the development set. The OpenAI fine-tuning interface exposes only step-level training loss and periodic validation loss for GPT-4o, which are shown together with the Gemini 2.5 Flash-Lite training and validation loss in Supplementary Figure S1. Training and validation loss curves were monitored for convergence and signs of overfitting, and JSONL files underwent programmatic validation for structural integrity.
2.3. Human Expert Comparator
2.4. ATA Guideline Comparator
2.5. Statistical Analysis
3. Results
3.1. Cohort Characteristics
3.2. Diagnostic Performance on the DDTI Evaluation Set
3.3. Diagnostic Performance on the TN5000 Test Split
3.4. Diagnostic Performance on the AUS Cohort
3.5. Performance of Gemini 2.5 Flash-Lite Across Datasets
3.6. Comparison with Human Experts and Clinical Guidelines on the AUS Cohort
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| ΔAUC | difference in area under the receiver operating characteristic curve |
| ATA | American Thyroid Association |
| AUC | area under the receiver operating characteristic curve |
| AUS | atypia of undetermined significance |
| CI | confidence interval |
| DDTI | Digital Database of Thyroid Images |
| FN | false negative |
| FNA | fine-needle aspiration |
| FP | false positive |
| MLLM | multimodal large language model |
| OR | odds ratio |
| PMC OA | PubMed Central Open Access |
| TIRADS | Thyroid Imaging Reporting and Data System |
| TN | true negative |
| TP | true positive |
References
- Haugen, B.R.; Alexander, E.K.; Bible, K.C.; Doherty, G.M.; Mandel, S.J.; Nikiforov, Y.E.; Pacini, F.; Randolph, G.W.; Sawka, A.M.; Schlumberger, M.; et al. 2015 American Thyroid Association Management Guidelines for Adult Patients with Thyroid Nodules and Differentiated Thyroid Cancer: The American Thyroid Association Guidelines Task Force on Thyroid Nodules and Differentiated Thyroid Cancer. Thyroid 2016, 26, 1–133. [Google Scholar] [CrossRef] [PubMed]
- Grani, G.; Lamartina, L.; Ascoli, V.; Bosco, D.; Biffoni, M.; Giacomelli, L.; Maranghi, M.; Falcone, R.; Ramundo, V.; Cantisani, V.; et al. Reducing the Number of Unnecessary Thyroid Biopsies While Improving Di agnostic Accuracy: Toward the “Right” TIRADS. J. Clin. Endocrinol. Metab. 2019, 104, 95–102. [Google Scholar] [CrossRef] [PubMed]
- Arosemena, M.; Thekkumkattil, A.; Valderrama, M.L.; Kuker, R.; Castillo, R.P.; Sidani, C.; Gonzalez, M.L.; Casula, S.; Kargi, A.Y. American Thyroid Association Sonographic Risk and Afirma Gene Expressi on Classifier Alone and in Combination for the Diagnosis of Thyroid No dules with Bethesda Category III Cytology. Thyroid 2020, 30, 1613–1619. [Google Scholar] [CrossRef] [PubMed]
- Kim, D.H.; Kim, S.W.; Basurrah, M.A.; Lee, J.; Hwang, S.H. Diagnostic Performance of Six Ultrasound Risk Stratification Systems f or Thyroid Nodules: A Systematic Review and Network Meta-Analysis. AJR Am. J. Roentgenol. 2023, 220, 791–803. [Google Scholar] [CrossRef] [PubMed]
- Tessler, F.N.; Middleton, W.D.; Grant, E.G.; Hoang, J.K.; Berland, L.L.; Teefey, S.A.; Cronan, J.J.; Beland, M.D.; Desser, T.S.; Frates, M.C.; et al. ACR Thyroid Imaging, Reporting and Data System (TI-RADS): White Paper of the ACR TI-RADS Committee. J. Am. Coll. Radiol. 2017, 14, 587–595. [Google Scholar] [CrossRef] [PubMed]
- Ali, S.Z.; Baloch, Z.W.; Cochand-Priollet, B.; Schmitt, F.C.; Vielh, P.; VanderLaan, P.A. The 2023 Bethesda System for Reporting Thyroid Cytopathology. Thyroid 2023, 33, 1039–1044. [Google Scholar] [CrossRef] [PubMed]
- Vaccarella, S.; Franceschi, S.; Bray, F.; Wild, C.P.; Plummer, M.; Dal Maso, L. Worldwide Thyroid-Cancer Epidemic? The Increasing Impact of Overdiagno sis. N. Engl. J. Med. 2016, 375, 614–617. [Google Scholar] [CrossRef] [PubMed]
- Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural net works. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef] [PubMed]
- Topol, E.J. High-performance medicine: The convergence of human and artificial int elligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef] [PubMed]
- Moor, M.; Banerjee, O.; Abad, Z.S.H.; Krumholz, H.M.; Leskovec, J.; Topol, E.J.; Rajpurkar, P. Foundation models for generalist medical artificial intelligence. Nature 2023, 616, 259–265. [Google Scholar] [CrossRef] [PubMed]
- Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A survey on multimodal large language models. Natl. Sci. Rev. 2024, 11, nwae403. [Google Scholar] [CrossRef] [PubMed]
- Ferber, D.; Wölflein, G.; Wiest, I.C.; Ligero, M.; Sainath, S.; Ghaffari Laleh, N.; El Nahhas, O.S.M.; Müller-Franzes, G.; Jäger, D.; Truhn, D.; et al. In-context learning enables multimodal large language models to classify cancer pathology images. Nat. Commun. 2024, 15, 10104. [Google Scholar] [CrossRef] [PubMed]
- Zhou, Y.; Ong, H.; Kennedy, P.; Wu, C.C.; Kazam, J.; Hentel, K.; Flanders, A.; Shih, G.; Peng, Y. Evaluating GPT-4V (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs. Radiology 2024, 311, e233270. [Google Scholar] [CrossRef] [PubMed]
- Chen, Z.; Chambara, N.; Wu, C.; Lo, X.; Liu, S.Y.W.; Gunda, S.T.; Han, X.; Qu, J.; Chen, F.; Ying, M.T.C. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine 2025, 87, 1041–1049. [Google Scholar] [CrossRef] [PubMed]
- Payne, D.L.; Purohit, K.; Borrero, W.M.; Chung, K.; Hao, M.; Mpoy, M.; Jin, M.; Prasanna, P.; Hill, V. Performance of GPT-4 on the American College of Radiology In-training Examination: Evaluating Accuracy, Model Drift, and Fine-tuning. Acad. Radiol. 2024, 31, 3046–3054. [Google Scholar] [CrossRef] [PubMed]
- Aliyeva, A.; Nevzati, E.; Grassia, F.; Riaz, M.; Mann, S.; Nasirov, R. AI at the Sella Turcica: Multi-Model Large Language Model Evaluation in Pituitary Adenomas. Brain Spine 2026, 6, 105997. [Google Scholar] [CrossRef] [PubMed]
- Aliyeva, A.; Muradova, A.; Hashimli, R.; Müderris, T. Multi-model Artificial Intelligence Evaluation in Sudden Sensorineural Hearing Loss. Otolaryngol. Head Neck Surg. 2026, 174, 980–988. [Google Scholar] [CrossRef] [PubMed]
- Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.-M.; Chen, W.; et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mach. Intell. 2023, 5, 220–235. [Google Scholar] [CrossRef]
- Zhang, H.; Liu, Q.; Han, X.; Niu, L.; Sun, W. TN5000: An Ultrasound Image Dataset for Thyroid Nodule Detection and Classification. Sci. Data 2025, 12, 1431. [Google Scholar] [CrossRef] [PubMed]
- Pedraza, L.; Vargas, C.; Narváez, F.; Durán, O.; Muñoz, E.; Romero, E. An open access thyroid ultrasound image database. In 10th International Symposium on Medical Information Processing and Analysis, 2015; SPIE: Bellingham, WA, USA, 2015; pp. 188–193. [Google Scholar] [CrossRef]
- Lin, W.; Zhao, Z.; Zhang, X.; Wu, C.; Zhang, Y.; Wang, Y.; Xie, W. PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2023; pp. 525–536. [Google Scholar] [CrossRef]
- Fagerland, M.W.; Lydersen, S.; Laake, P. The McNemar test for binary matched-pairs data: Mid-p and asymptotic are better than exact conditional. BMC Med. Res. Methodol. 2013, 13, 91. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Hu, G. Boosting GPT-4V’s accuracy in dermoscopic classification with few-shot learning. Comment on “can ChatGPT vision diagnose melanoma? An explor atory diagnostic accuracy study”. J. Am. Acad. Dermatol. 2024, 91, e165–e166. [Google Scholar] [CrossRef] [PubMed]
- Schramm, S.; Preis, S.; Metz, M.-C.; Jung, K.; Schmitz-Koep, B.; Zimmer, C.; Wiestler, B.; Hedderich, D.M.; Kim, S.H. Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases. Radiology 2025, 314, e240689. [Google Scholar] [CrossRef] [PubMed]
- Vaira, L.A.; Lechien, J.R.; Maniaci, A.; Tanda, G.; Abbate, V.; Allevi, F.; Arena, A.; Bonali, M.; Carraturo, E.; Fakhry, N.; et al. Validation of the Quality Analysis of Medical Artificial Intelligence (QAMAI) tool: A new tool to assess the quality of health information provided by AI platforms. Eur. Arch. Otorhinolaryngol. 2024, 281, 6123–6131. [Google Scholar] [CrossRef] [PubMed]
- Lechien, J.R.; Maniaci, A.; Gengler, I.; Hans, S.; Chiesa-Estomba, C.M.; Vaira, L.A. Validity and reliability of an instrument evaluating the performance of intelligent chatbot: The Artificial Intelligence Performance Instrument (AIPI). Eur. Arch. Otorhinolaryngol. 2024, 281, 2063–2079. [Google Scholar] [CrossRef] [PubMed]




| Dataset | Intended Use | N (Total) | Malignant, n (%) | Benign, n (%) |
|---|---|---|---|---|
| DDTI a | External evaluation | 80 | 28 (35.0) | 52 (65.0) |
| TN5000 b | External evaluation (high prevalence) | 1000 | 731 (73.1) | 269 (26.9) |
| AUS c | Clinical evaluation | 84 | 43 (51.2) | 41 (48.8) |
| PMC-OA d | Fine-tuning & few-shot examples (development only) | 60 | 26 (43.3) | 34 (56.7) |
| Model/Strategy | AUC (95% CI) | Sens % (95% CI) | Spec % (95% CI) | Accuracy % (95% CI; p *) | Precision % (95% CI) | F1 (95% CI) | ΔAUC vs. Baseline | ΔAUC vs. Few-Shot | ΔAUC vs. Fine-Tuning |
|---|---|---|---|---|---|---|---|---|---|
| Text prompting | 0.704 (0.594–0.807) | 67.9 (50.0–84.4) | 73.1 (60.8–84.6) | 71.3 (60.5–80.0; p = 0.003) | 57.6 (40.7–75.0) | 0.623 (0.459–0.754) | — | — | — |
| Few-shot | 0.647 (0.543–0.750) | 42.9 (24.1–61.9) | 86.5 (76.4–94.9) | 71.3 (60.5–80.0; p < 0.001) | 63.2 (39.1–84.2) | 0.511 (0.308–0.667) | −8.1% (p = 0.281) | — | — |
| Fine-tuning | 0.791 (0.690–0.885) | 71.4 (53.8–87.9) | 86.5 (76.6–95.6) | 81.3 (71.3–88.3; p = 0.096) | 74.1 (56.2–90.6) | 0.727 (0.571–0.849) | +12.4% (p = 0.183) | +22.3% (p = 0.012) | — |
| Hybrid (fine-tuning + few-shot) | 0.866 (0.774–0.946) | 75.0 (56.7–90.6) | 98.1 (93.8–99.8) | 90.0 (81.5–94.8; ref) | 95.5 (85.0–100) | 0.840 (0.706–0.938) | +23.0% (p = 0.015) | +33.9% (p < 0.001) | +9.5% (p = 0.211) |
| Model/Strategy | AUC (95% CI) | Sens % (95% CI) | Spec % (95% CI) | Accuracy % (95% CI; p *) | Precision % (95% CI) | F1 (95% CI) | ΔAUC vs. Baseline | ΔAUC vs. Few-Shot | ΔAUC vs. Fine-Tuning |
|---|---|---|---|---|---|---|---|---|---|
| Text prompting | 0.770 (0.659–0.869) | 100.0 (87.9–100.0) | 21.2 (12.2–34.0) | 48.8 (38.1–59.5; p = 0.332) | 40.6 (29.8–52.4) | 0.577 (0.452–0.686) | — | — | — |
| Few-shot | 0.702 (0.586–0.808) | 100.0 (87.9–100.0) | 28.8 (18.3–42.3) | 53.8 (42.9–64.3; p = 1.000) | 43.1 (31.8–55.2) | 0.602 (0.476–0.708) | −8.8% (p = 0.278) | — | — |
| Fine-tuning | 0.726 (0.615–0.829) | 100.0 (87.9–100.0) | 23.1 (13.7–36.1) | 50.0 (39.3–60.7; p = 0.481) | 41.2 (30.3–53.0) | 0.583 (0.460–0.691) | −5.7% (p = 0.454) | +3.4% (p = 0.696) | — |
| Hybrid (fine-tuning + few-shot) | 0.754 (0.643–0.852) | 96.4 (82.3–99.4) | 30.8 (19.9–44.3) | 53.8 (42.9–64.3; ref) | 42.9 (31.4–55.1) | 0.593 (0.463–0.704) | −2.1% (p = 0.813) | +7.4% (p = 0.396) | +3.9% (p = 0.637) |
| Model/Strategy | AUC (95% CI) | Sens % (95% CI) | Spec % (95% CI) | Accuracy % (95% CI; p *) | Precision % (95% CI) | F1 (95% CI) | ΔAUC vs. Baseline | ΔAUC vs. Few-Shot | ΔAUC vs. Fine-Tuning |
|---|---|---|---|---|---|---|---|---|---|
| Text prompting | 0.573 (0.536–0.607) | 28.5 (25.3–31.8) | 89.2 (84.9–92.4) | 44.8 (41.8–47.9; p < 0.001) | 87.8 (82.9–91.3) | 0.429 (0.390–0.468) | — | — | — |
| Few-shot | 0.613 (0.574–0.651) | 69.5 (66.1–72.7) | 52.8 (46.8–58.7) | 65.0 (62.0–67.9; p = 0.229) | 80.0 (76.7–82.9) | 0.744 (0.719–0.768) | +7.0% (p = 0.076) | — | — |
| Fine-tuning | 0.615 (0.576–0.655) | 74.1 (70.9–77.3) | 48.0 (41.9–53.7) | 67.1 (64.1–69.9; ref) | 79.5 (76.2–82.3) | 0.767 (0.743–0.793) | +7.3% (p = 0.092) | +0.3% (p = 0.955) | — |
| Hybrid (fine-tuning + few-shot) | 0.689 (0.653–0.724) | 68.5 (64.8–71.9) | 63.9 (58.0–69.8) | 67.3 (64.2–70.2; p = 0.798) | 83.8 (80.3–86.5) | 0.753 (0.726–0.778) | +20.2% (p < 0.001) | +12.4% (p < 0.001) | +12.0% (p < 0.001) |
| Model/Strategy | AUC (95% CI) | Sens % (95% CI) | Spec % (95% CI) | Accuracy % (95% CI; p *) | Precision % (95% CI) | F1 (95% CI) | ΔAUC vs. Baseline | ΔAUC vs. Few-Shot | ΔAUC vs. Fine-Tuning |
|---|---|---|---|---|---|---|---|---|---|
| Text prompting | 0.653 (0.616–0.690) | 98.1 (96.8–98.9) | 4.8 (2.8–8.1) | 73.0 (70.2–75.7; p = 0.009) | 73.7 (70.8–76.4) | 0.842 (0.822–0.860) | — | — | — |
| Few-shot | 0.628 (0.591–0.667) | 94.1 (92.2–95.6) | 17.1 (13.1–22.1) | 73.4 (70.5–76.0; p = 0.019) | 75.5 (72.6–78.2) | 0.838 (0.818–0.856) | −3.8% (p = 0.288) | — | — |
| Fine-tuning | 0.619 (0.584–0.656) | 98.8 (97.7–99.4) | 5.2 (3.1–8.5) | 73.6 (70.8–76.2; p = 0.050) | 73.9 (71.1–76.6) | 0.845 (0.826–0.863) | −5.2% (p = 0.075) | −1.4% (p = 0.695) | — |
| Hybrid (fine-tuning + few-shot) | 0.658 (0.621–0.698) | 96.6 (95.0–97.7) | 17.5 (13.4–22.5) | 75.3 (72.5–77.9; ref) | 76.1 (73.2–78.7) | 0.851 (0.832–0.870) | +0.8% (p = 0.800) | +4.8% (p = 0.146) | +6.3% (p = 0.072) |
| Model/Strategy | AUC (95% CI) | Sens % (95% CI) | Spec % (95% CI) | Accuracy % (95% CI; p *) | Precision % (95% CI) | F1 (95% CI) | ΔAUC vs. Baseline | ΔAUC vs. Few-Shot | ΔAUC vs. Fine-Tuning |
|---|---|---|---|---|---|---|---|---|---|
| Text prompting | 0.668 (0.563–0.768) | 58.1 (42.5–73.2) | 75.6 (61.1–88.6) | 66.7 (56.1–75.8; p < 0.001) | 71.4 (55.3–86.7) | 0.641 (0.507–0.756) | — | — | — |
| Few-shot | 0.629 (0.523–0.734) | 72.1 (57.8–85.7) | 53.7 (37.8–69.4) | 63.1 (52.4–72.6; p < 0.001) | 62.0 (47.9–75.0) | 0.667 (0.548–0.772) | −5.8% (p = 0.481) | — | — |
| Fine-tuning | 0.717 (0.618–0.812) | 60.5 (45.0–75.6) | 82.9 (70.0–94.3) | 71.4 (61.0–80.0; p = 0.019) | 78.8 (64.3–92.3) | 0.684 (0.556–0.789) | +7.3% (p = 0.362) | +13.9% (p = 0.157) | — |
| Hybrid (fine-tuning + few-shot) | 0.836 (0.756–0.911) | 72.1 (57.9–85.4) | 95.1 (87.2–99.0) | 83.3 (73.9–89.8; ref) | 93.9 (85.2–100) | 0.816 (0.708–0.902) | +25.1% (p < 0.001) | +32.9% (p < 0.001) | +16.6% (p = 0.014) |
| Model/Strategy | AUC (95% CI) | Sens % (95% CI) | Spec % (95% CI) | Accuracy % (95% CI; p *) | Precision % (95% CI) | F1 (95% CI) | ΔAUC vs. Baseline | ΔAUC vs. Few-Shot | ΔAUC vs. Fine-Tuning |
|---|---|---|---|---|---|---|---|---|---|
| Text prompting | 0.617 (0.499–0.729) | 97.7 (87.9–99.6) | 7.3 (2.5–19.4) | 53.6 (43.0–63.8; p = 0.754) | 52.5 (41.7–63.1) | 0.683 (0.579–0.773) | — | — | — |
| Few-shot | 0.555 (0.429–0.677) | 88.4 (75.5–94.9) | 9.8 (3.9–22.5) | 50.0 (39.5–60.5; p = 0.508) | 50.7 (39.6–61.7) | 0.644 (0.532–0.740) | −10.0% (p = 0.304) | — | — |
| Fine-tuning | 0.567 (0.448–0.682) | 97.7 (87.9–99.6) | 7.3 (2.5–19.4) | 53.6 (43.0–63.8; p = 0.754) | 52.5 (41.7–63.1) | 0.683 (0.584–0.773) | −8.1% (p = 0.368) | +2.2% (p = 0.818) | — |
| Hybrid (fine-tuning + few-shot) | 0.489 (0.366–0.610) | 90.7 (78.4–96.3) | 12.2 (5.3–25.5) | 52.4 (41.8–62.7; ref) | 52.0 (40.9–62.9) | 0.661 (0.559–0.754) | −20.7% (p = 0.040) | −11.9% (p = 0.315) | −13.8% (p = 0.141) |
| Model/Reader Consensus | AUC (95% CI) | Sens % (95% CI) | Spec % (95% CI) | ΔAUC vs. Hybrid (p) | Δ Sens vs. Hybrid (p) | Δ Spec vs. Hybrid (p) |
|---|---|---|---|---|---|---|
| Hybrid (fine-tuning + few-shot) | 0.836 (0.756–0.911) | 72.1 (57.9–85.4) | 95.1 (87.2–99.0) | — | — | — |
| Endocrinologist consensus | 0.722 (0.623–0.822) | 74.4 (60.5–87.2) | 70.7 (55.6–84.4) | −13.6% (p = 0.072) | +3.2% (p = 0.999) | −25.6% (p = 0.002) |
| ATA—high suspicion = malignant | 0.749 (0.655–0.842) | 79.1 (64.0–90.0) | 70.7 (54.5–83.9) | −10.3% (p = 0.123) | +9.7% (p = 0.607) | −25.6% (p = 0.001) |
| ATA—high + intermediate = malignant | 0.699 (0.599–0.792) | 83.7 (70.0–91.9) | 56.1 (41.0–70.1) | −16.4% (p = 0.017) | +16.1% (p = 0.180) | −41.0% (p < 0.001) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, Y.-H.; Cheng, Y.-C.; Chang, C.-Y.; Lee, I.-T. Optimizing a Multimodal Large Language Model for Ultrasound-Based Thyroid Nodule Malignancy Classification: A Comparative Study of Few-Shot Learning, Prompt Engineering, and Fine-Tuning. Diagnostics 2026, 16, 1931. https://doi.org/10.3390/diagnostics16121931
Li Y-H, Cheng Y-C, Chang C-Y, Lee I-T. Optimizing a Multimodal Large Language Model for Ultrasound-Based Thyroid Nodule Malignancy Classification: A Comparative Study of Few-Shot Learning, Prompt Engineering, and Fine-Tuning. Diagnostics. 2026; 16(12):1931. https://doi.org/10.3390/diagnostics16121931
Chicago/Turabian StyleLi, Yu-Hsuan, Yu-Cheng Cheng, Chih-Yun Chang, and I-Te Lee. 2026. "Optimizing a Multimodal Large Language Model for Ultrasound-Based Thyroid Nodule Malignancy Classification: A Comparative Study of Few-Shot Learning, Prompt Engineering, and Fine-Tuning" Diagnostics 16, no. 12: 1931. https://doi.org/10.3390/diagnostics16121931
APA StyleLi, Y.-H., Cheng, Y.-C., Chang, C.-Y., & Lee, I.-T. (2026). Optimizing a Multimodal Large Language Model for Ultrasound-Based Thyroid Nodule Malignancy Classification: A Comparative Study of Few-Shot Learning, Prompt Engineering, and Fine-Tuning. Diagnostics, 16(12), 1931. https://doi.org/10.3390/diagnostics16121931

