Comparative Analysis of General-Purpose vs. Domain-Specific Multimodal Models for Diabetic Retinopathy Classification
Abstract
1. Introduction
2. Materials and Methods
2.1. Dataset Description
2.2. Experimental Setup
2.3. Zero-Shot and Few-Shot Prompting
2.4. Fine-Tuning of Domain-Specific Models
2.5. Data Augmentation
2.6. Linear Probing
2.7. Evaluation Metrics
3. Results
3.1. Performance of General-Purpose Multimodal Models
3.2. Fine-Tuning Performance of Domain-Specific Models
3.3. Linear Probing Performance
3.4. Statistical Comparisons
3.5. External Validation on DeepDRiD
4. Discussion
4.1. Comparison with Prior In-Context Learning Approaches
4.2. Comparison with RETFound: Vision-Only Domain-Specific Foundation Models
4.3. Comparison with EyeCLIP: Domain-Specific Vision–Language Models
4.4. Explainability and Clinical Implications
4.5. Limitations and Future Work
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wong, T.Y.; Cheung, C.M.; Larsen, M.; Sharma, S.; Simo, R. Diabetic retinopathy. Nat. Rev. Dis. Primers 2016, 2, 16012. [Google Scholar] [CrossRef] [PubMed]
- Cheung, N.; Mitchell, P.; Wong, T.Y. Diabetic retinopathy. Lancet 2010, 376, 124–136. [Google Scholar] [CrossRef] [PubMed]
- Tan, T.E.; Wong, T.Y. Diabetic retinopathy: Looking forward to 2030. Front. Endocrinol. 2022, 13, 1077669. [Google Scholar] [CrossRef]
- Ting, D.S.W.; Pasquale, L.R.; Peng, L.; Campbell, J.P.; Lee, A.Y.; Raman, R.; Tan, G.S.W.; Schmetterer, L.; Keane, P.A.; Wong, T.Y. Artificial intelligence and deep learning in ophthalmology. Br. J. Ophthalmol. 2019, 103, 167–175. [Google Scholar] [CrossRef] [PubMed]
- Gunasekeran, D.V.; Ting, D.S.W.; Tan, G.S.W.; Wong, T.Y. Artificial intelligence for diabetic retinopathy screening, prediction and management. Curr. Opin. Ophthalmol. 2020, 31, 357–365. [Google Scholar] [CrossRef]
- Gulshan, V.; Peng, L.; Coram, M.; Stumpe, M.C.; Wu, D.; Narayanaswamy, A.; Venugopalan, S.; Widner, K.; Madams, T.; Cuadros, J.; et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 2016, 316, 2402–2410. [Google Scholar] [CrossRef]
- Dai, L.; Wu, L.; Li, H.; Cai, C.; Wu, Q.; Kong, H.; Liu, R.; Wang, X.; Hou, X.; Liu, Y.; et al. A deep learning system for detecting diabetic retinopathy across the disease spectrum. Nat. Commun. 2021, 12, 3242. [Google Scholar] [CrossRef]
- Chia, M.A.; Antaki, F.; Zhou, Y.; Turner, A.W.; Lee, A.Y.; Keane, P.A. Foundation models in ophthalmology. Br. J. Ophthalmol. 2024, 108, 1341–1348. [Google Scholar] [CrossRef]
- Zhou, Y.; Chia, M.A.; Wagner, S.K.; Ayhan, M.S.; Williamson, D.J.; Struyven, R.R.; Liu, T.; Xu, M.; Lozano, M.G.; Woodward-Court, P.; et al. A foundation model for generalizable disease detection from retinal images. Nature 2023, 622, 156–163. [Google Scholar] [CrossRef]
- Shi, D.; Zhang, W.; Yang, J.; Huang, S.; Chen, X.; Xu, P.; Jin, K.; Lin, S.; Wei, J.; Yusufu, M.; et al. A multimodal visual-language foundation model for computational ophthalmology. npj Digit. Med. 2025, 8, 381. [Google Scholar] [CrossRef]
- Wang, M.; Lin, T.; Lin, A.; Yu, K.; Peng, Y.; Wang, L.; Chen, C.; Zou, K.; Liang, H.; Chen, M.; et al. Enhancing diagnostic accuracy in rare and common fundus diseases with a knowledge-rich vision-language model. Nat. Commun. 2025, 16, 5528. [Google Scholar] [CrossRef]
- Zhang, S.; Metaxas, D. On the challenges and perspectives of foundation models for medical image analysis. Med. Image Anal. 2024, 91, 102996. [Google Scholar] [CrossRef]
- Morano, J.; Fazekas, B.; Sukei, E.; Fecso, R.; Emre, T.; Gumpinger, M.; Faustmann, G.; Oghbaie, M.; Schmidt-Erfurth, U.; Bogunovic, H. Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis. npj Digit. Med. 2025, 8, 576. [Google Scholar] [CrossRef]
- Mikhail, D.; Milad, D.; Antaki, F.; Milad, J.; Farah, A.; Khairy, T.; El-Khoury, J.; Bachour, K.; Szigiato, A.A.; Nayman, T.; et al. Multimodal Performance of GPT-4 in Complex Ophthalmology Cases. J. Pers. Med. 2025, 15, 160. [Google Scholar] [CrossRef] [PubMed]
- Wu, K.Y.; Qian, S.Y.; Marchand, M. Evaluating ChatGPT-4 Plus in Ophthalmology: Effect of Image Recognition and Domain-Specific Pretraining on Diagnostic Performance. Diagnostics 2025, 15, 1820. [Google Scholar] [CrossRef]
- AlRyalat, S.A.; Musleh, A.M.; Kahook, M.Y. Evaluating the strengths and limitations of multimodal ChatGPT-4 in detecting glaucoma using fundus images. Front. Ophthalmol. 2024, 4, 1387190. [Google Scholar] [CrossRef] [PubMed]
- Antaki, F.; Touma, S.; Milad, D.; El-Khoury, J.; Duval, R. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmol. Sci. 2023, 3, 100324. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z.; Zhang, H.; Pan, Z.; Bi, Z.; Wan, Y.; Song, X.; Fan, X. Evaluating Large Language Models in Ophthalmology: Systematic Review. J. Med. Internet Res. 2025, 27, e76947. [Google Scholar] [CrossRef]
- Massin, P.; Marre, M. Fundus photography for the screening for diabetic retinopathy. Diabetes Metab. 2002, 28, 151–155. [Google Scholar]
- Gupta, A.; Al-Kazwini, H. Evaluating ChatGPT’s Diagnostic Accuracy in Detecting Fundus Images. Cureus 2024, 16, e73660. [Google Scholar] [CrossRef]
- Mihalache, A.; Huang, R.S.; Popovic, M.M.; Patil, N.S.; Pandya, B.U.; Shor, R.; Pereira, A.; Kwok, J.M.; Yan, P.; Wong, D.T.; et al. Accuracy of an Artificial Intelligence Chatbot’s Interpretation of Clinical Ophthalmic Images. JAMA Ophthalmol. 2024, 142, 321–326. [Google Scholar] [CrossRef]
- Xu, P.; Chen, X.; Zhao, Z.; Shi, D. Unveiling the clinical incapabilities: A benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br. J. Ophthalmol. 2024, 108, 1384–1389. [Google Scholar] [CrossRef]
- Ayhan, M.S.; Ong, A.Y.; Ruffell, E.; Wagner, S.K.; Merle, D.A.; Keane, P.A. In-Context Learning for Data-Efficient Diabetic Retinopathy Detection via Multimodal Foundation Models. Ophthalmol. Sci. 2026, 6, 100934. [Google Scholar] [CrossRef]
- Agbareia, R.; Omar, M.; Zloto, O.; Glicksberg, B.S.; Nadkarni, G.N.; Klang, E. Multimodal LLMs for retinal disease diagnosis via OCT: Few-shot versus single-shot learning. Ther. Adv. Ophthalmol. 2025, 17, 25158414251340569. [Google Scholar] [CrossRef] [PubMed]
- Porwal, P.; Pachade, S.; Kamble, R.; Kokare, M.; Deshmukh, G.; Sahasrabuddhe, V.; Meriaudeau, F. Indian Diabetic Retinopathy Image Dataset (IDRiD): A Database for Diabetic Retinopathy Screening Research. Data 2018, 3, 25. [Google Scholar] [CrossRef]
- Sellergren, A.; Kazemzadeh, S.; Jaroensri, T.; Kiraly, A.; Traverse, M.; Kohlberger, T.; Xu, S.; Jamil, F.; Hughes, C.; Lau, C.; et al. MedGemma Technical Report. arXiv 2025, arXiv:2507.05201. [Google Scholar] [CrossRef]
- Liu, R.; Wang, X.; Wu, Q.; Dai, L.; Fang, X.; Yan, T.; Son, J.; Tang, S.; Li, J.; Gao, Z.; et al. DeepDRiD: Diabetic Retinopathy-Grading and Image Quality Estimation Challenge. Patterns 2022, 3, 100512. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
- Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Muller, K.R.; Samek, W. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLoS ONE 2015, 10, e0130140. [Google Scholar] [CrossRef] [PubMed]

| Model Name(s) | Model Description | Modality | Knowledge Cutoff |
|---|---|---|---|
| GPT-5.2-2025-12-11 | The GPT-5.2 generation provides state-of-the-art reasoning and multimodal processing for text and image inputs. | Conversational, Image + Text | 31 August 2025 |
| Gemini-3-Flash-Preview | Gemini 3 Flash preview model optimized for fast inference and deployment. | Conversational, Image + Text | January 2025 |
| Pixtral-Large-2411 | Large multimodal model integrating vision and language understanding. | Conversational, Image + Text | Not available |
| MedGemma-1.5-4B-IT | Google’s medical multimodal model specialized for clinical reasoning and instruction tuning. | Conversational, Image + Text | Not available |
| MedSigLIP v1.0.0 (ViT-So400m/14) | Google’s vision model designed for broad medical image analysis. | Image + Text | Not available |
| EyeCLIP (ViT-L/14, ViT-L/14@336px) | Model trained specifically on ophthalmic (fundus) imaging tasks. | Image + Text | Not available |
| RETFound_mae_natureCFP (ViT-L/16) | ViT-based model pretrained on retinal imaging datasets. | Image-only | Not available |
| ViT-L/16 | Standard Vision Transformer baseline without domain specialization. | Image-only | Not available |
| Model | Input Resolution | Normalization | Preprocessing |
|---|---|---|---|
| RETFound | 224 × 224 | ImageNet (mean = [0.485, 0.456, 0.406]; std = [0.229, 0.224, 0.225]) | Bicubic resize to 224 × 224 |
| EyeCLIP | 224 × 224 | CLIP-specific (mean = [0.4815, 0.4578, 0.4082]; std = [0.2686, 0.2613, 0.2758]) | Native CLIP preprocessing |
| MedSigLIP | 448 × 448 | Native model processor | Resizing disabled; native processor |
| ViT-L/16 | 224 × 224 | Native model processor | Resizing disabled; native processor |
| Model | Accuracy (%) | Sensitivity (%) | Specificity (%) |
|---|---|---|---|
| GPT-5.2 (0-shot) | 77.9 ± 0.4 [77.2, 78.6] | 67.4 ± 0.6 [66.3, 68.5] | 99.6 ± 0.3 [99.0, 100] |
| GPT-5.2 (5-shot) | 81.5 ± 0.1 [81.3, 81.7] | 73.7 ± 0.3 [73.0, 74.3] | 97.8 ± 0.3 [97.2, 98.4] |
| Gemini 3 Flash (0-shot) | 88.5 ± 0.0 [88.5, 88.5] | 84.8 ± 0.0 [84.8, 84.8] | 96.4 ± 0.0 [96.4, 96.4] |
| Gemini 3 Flash (5-shot) | 85.1 ± 0.4 [84.4, 85.8] | 80.0 ± 0.4 [79.2, 80.8] | 95.6 ± 0.3 [95.0, 96.2] |
| Pixtral-Large (0-shot) | 70.7 ± 0.7 [69.5, 72.0] | 58.0 ± 1.0 [56.1, 59.9] | 97.0 ± 0.0 [97.0, 97.0] |
| Pixtral-Large (5-shot) | 78.1 ± 1.0 [76.3, 79.9] | 74.2 ± 0.9 [72.6, 75.8] | 86.1 ± 1.2 [83.8, 88.4] |
| MedGemma-1.5-4B-IT (0-shot) | 88.2 ± 0.1 [88.0, 88.4] | 92.6 ± 0.2 [92.3, 92.9] | 79.2 ± 0.0 [79.2, 79.2] |
| MedGemma-1.5-4B-IT (5-shot) | 87.2 ± 1.1 [85.2, 89.2] | 91.2 ± 0.6 [90.1, 92.3] | 79.0 ± 2.4 [74.5, 83.4] |
| Model | Accuracy (%) | Sensitivity (%) | Specificity (%) | ROC-AUC | Avg. Precision | F1 Score |
|---|---|---|---|---|---|---|
| RETFound | 77.1 ± 6.2 [72.7, 81.6] | 75.3 ± 7.2 [70.1, 80.5] | 80.9 ± 8.0 [75.2, 86.6] | 86.9 ± 5.9 [82.7, 91.1] | 92.9 ± 4.4 [89.7, 96.0] | 75.7 ± 6.4 [71.1, 80.3] |
| MedSigLIP | 94.8 ± 2.4 [93.0, 96.5] | 93.1 ± 2.7 [91.1, 95.0] | 98.2 ± 2.9 [96.1, 100] | 98.1 ± 2.2 [96.6, 99.7] | 99.2 ± 0.8 [98.7, 99.8] | 96.0 ± 1.9 [94.6, 97.4] |
| EyeCLIP | 85.8 ± 3.9 [83.1, 88.6] | 81.3 ± 5.2 [77.6, 85.1] | 95.2 ± 3.8 [92.5, 97.9] | 94.4 ± 2.5 [92.6, 96.2] | 97.7 ± 1.0 [97.0, 98.3] | 88.5 ± 3.4 [86.1, 90.9] |
| ViT-L/16 | 85.7 ± 4.6 [82.3, 89.0] | 82.2 ± 5.5 [78.3, 86.1] | 92.9 ± 5.4 [89.0, 96.7] | 94.2 ± 2.8 [92.2, 96.2] | 97.5 ± 1.2 [96.6, 98.3] | 88.5 ± 4.1 [85.6, 91.4] |
| Model | Accuracy (%) | Sensitivity (%) | Specificity (%) | ROC-AUC | Avg. Precision | F1 Score |
|---|---|---|---|---|---|---|
| RETFound | 86.8 ± 2.1 [85.3, 88.4] | 91.9 ± 3.0 [89.8, 94.1] | 76.2 ± 6.7 [71.4, 81.0] | 93.3 ± 2.7 [91.3, 95.3] | 96.7 ± 1.6 [95.6, 97.8] | 90.4 ± 1.5 [89.3, 91.5] |
| MedSigLIP | 94.4 ± 3.2 [92.1, 96.7] | 95.1 ± 3.3 [92.7, 97.5] | 92.8 ± 5.6 [88.8, 96.8] | 97.9 ± 2.0 [96.4, 99.4] | 99.1 ± 0.8 [98.6, 99.7] | 95.8 ± 2.4 [94.0, 97.5] |
| EyeCLIP | 88.2 ± 3.8 [85.5, 90.9] | 91.4 ± 2.7 [89.4, 93.3] | 81.6 ± 9.7 [74.7, 88.5] | 95.6 ± 1.9 [94.2, 97.0] | 98.2 ± 0.7 [97.6, 98.7] | 91.3 ± 2.6 [89.4, 93.2] |
| ViT-L/16 | 84.9 ± 4.2 [81.9, 87.9] | 89.4 ± 4.4 [86.2, 92.6] | 75.6 ± 7.5 [70.3, 81.0] | 92.5 ± 4.8 [89.1, 95.9] | 96.6 ± 2.3 [94.9, 98.2] | 88.9 ± 3.2 [86.6, 91.1] |
| Model | Accuracy (%) | Sensitivity (%) | Specificity (%) |
|---|---|---|---|
| GPT-5.2 (0-shot) | 61.0 ± 0.7 [59.8, 62.3] | 45.6 ± 0.6 [44.5, 46.7] | 93.1 ± 0.9 [91.4, 94.7] |
| GPT-5.2 (5-shot) | 67.6 ± 0.7 [66.4, 68.9] | 55.4 ± 0.7 [54.1, 56.6] | 93.1 ± 0.9 [91.4, 94.7] |
| Gemini 3 Flash (0-shot) | 73.8 ± 0.5 [72.9, 74.6] | 73.2 ± 0.5 [72.2, 74.2] | 74.9 ± 1.1 [72.9, 76.9] |
| Gemini 3 Flash (5-shot) | 75.4 ± 0.8 [74.0, 76.9] | 73.9 ± 0.6 [72.8, 75.0] | 78.6 ± 3.0 [73.1, 84.1] |
| Pixtral-Large (0-shot) | 54.8 ± 0.4 [54.0, 55.5] | 43.3 ± 0.2 [43.0, 43.6] | 78.6 ± 1.0 [76.7, 80.5] |
| Pixtral-Large (5-shot) | 63.3 ± 0.8 [61.9, 64.7] | 60.7 ± 1.1 [58.7, 62.7] | 68.7 ± 0.3 [68.0, 69.3] |
| MedGemma-1.5-4B-IT (0-shot) | 81.2 ± 0.0 [81.2, 81.2] | 80.5 ± 0.0 [80.5, 80.5] | 82.7 ± 0.0 [82.7, 82.7] |
| MedGemma-1.5-4B-IT (5-shot) | 74.8 ± 0.7 [73.5, 76.1] | 66.7 ± 1.3 [64.4, 69.0] | 91.7 ± 1.6 [88.8, 94.6] |
| Model | Accuracy (%) | Sensitivity (%) | Specificity (%) | ROC-AUC | AP | F1 Score |
|---|---|---|---|---|---|---|
| RETFound | 54.9 | 98.4 | 0.8 | 59.5 | 64.6 | 36.2 |
| MedSigLIP | 80.2 | 81.6 | 78.4 | 88.2 | 91.0 | 82.0 |
| EyeCLIP | 72.8 | 65.0 | 82.5 | 82.2 | 86.0 | 72.6 |
| ViT-L/16 | 65.5 | 66.1 | 64.7 | 72.4 | 79.8 | 68.0 |
| Model | Accuracy (%) | Sensitivity (%) | Specificity (%) | ROC-AUC | AP | F1 Score |
|---|---|---|---|---|---|---|
| RETFound | 65.0 | 42.4 | 93.0 | 79.9 | 83.6 | 57.3 |
| MedSigLIP | 83.7 | 81.9 | 85.9 | 91.2 | 93.1 | 84.8 |
| EyeCLIP | 55.4 | 21.1 | 97.9 | 77.1 | 80.9 | 34.4 |
| ViT-L/16 | 54.4 | 57.3 | 50.8 | 57.3 | 62.9 | 58.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Nouyed, M.I.; Al-Mamun, M.; Adjeroh, D.A.; Hu, G. Comparative Analysis of General-Purpose vs. Domain-Specific Multimodal Models for Diabetic Retinopathy Classification. Diagnostics 2026, 16, 1504. https://doi.org/10.3390/diagnostics16101504
Nouyed MI, Al-Mamun M, Adjeroh DA, Hu G. Comparative Analysis of General-Purpose vs. Domain-Specific Multimodal Models for Diabetic Retinopathy Classification. Diagnostics. 2026; 16(10):1504. https://doi.org/10.3390/diagnostics16101504
Chicago/Turabian StyleNouyed, Mohammad Iqbal, Mohammad Al-Mamun, Donald A. Adjeroh, and Gangqing Hu. 2026. "Comparative Analysis of General-Purpose vs. Domain-Specific Multimodal Models for Diabetic Retinopathy Classification" Diagnostics 16, no. 10: 1504. https://doi.org/10.3390/diagnostics16101504
APA StyleNouyed, M. I., Al-Mamun, M., Adjeroh, D. A., & Hu, G. (2026). Comparative Analysis of General-Purpose vs. Domain-Specific Multimodal Models for Diabetic Retinopathy Classification. Diagnostics, 16(10), 1504. https://doi.org/10.3390/diagnostics16101504

