Comparative Analysis of General-Purpose vs. Domain-Specific Multimodal Models for Diabetic Retinopathy Classification
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- The proposed work appears to be an extension of the existing work by Ayhan et al. [13], aimed at demonstrating the credibility of prompt-based or general-purpose large multimodal models (LMMs) in comparison with domain-specific models.
- Details of data augmentation should be consolidated and presented in a single section, as many of the parameters are common across models.
- The term “accum_steps,” used in the context of learning rate computation, should be clearly defined.
- The fine-tuning section in Figure 1 should be updated to improve clarity and provide a better understanding of the proposed methodology.
- Clarity regarding the dataset size after data augmentation, as well as details of the test dataset, should be included. This is important for proper interpretation of the results. In previous works, along with 10-fold cross-validation, a test set (typically 10% of the data) is also considered.
- Uniform evaluation metrics, including specificity, should be reported across all results (Tables 3 and 4) to enable fair and comprehensive comparison among the presented models.
- Although EyeCLIP and MedSigLIP are designed for multimodal (image and text) inputs, the evaluation has been performed only on image data. The authors are requested to justify this choice.
- Confidence intervals should be included along with the reported results (mean + std) to provide better insight into the statistical significance.
- Explainability analysis should be incorporated using techniques such as Grad-CAM to provide better insights into the model’s decision-making process.
- The generalization capability of the proposed approach can be further evaluated using external datasets such as the DeepDRiD dataset (ISBI Challenge) [Liu et al., 2022].
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors- A limited dataset size (IDRiD with 516 images) may limit the generalizability and robustness of the conclusions. Use other DR datasets
- The study focuses only on binary classification, ignoring multi-grade DR severity, which reduces clinical relevance.
- No statistical significance testing was provided to support performance differences between models.
- Prompt-based evaluation lacks reproducibility due to variability in LMM responses and sampling strategies.
- Explainability analysis is discussed but not quantitatively evaluated, limiting clinical interpretability.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe reviewer appreciates the efforts put forth by the authors to address the suggestions and comments raised.
The authors are further advised to review the entire paper for any possible grammatical errors, such as the one in line 157:
“Although EyeCLIP and MedSigLIP are designed for multimodal (image + text) inputs, we deliberately used their visual encoders as the focus of the study in on (should be "is on") image classification.”
Author Response
Thanks for the comments on writing. We have used a professional manuscript editing agent from author service to further improve the readability of the manuscript. All changes have been highlighted in red.
Reviewer 2 Report
Comments and Suggestions for AuthorsI'm satisfied with the revised manuscript.
Author Response
We are glad to see that you are satisfied with our revision.

