Fine-Tuning a Small Vision Language Model Using Synthetic Data for Explaining Bacterial Skin Disease Images
Abstract
1. Introduction
2. Methods
2.1. Data Curation
2.1.1. Domain-Specific Dataset Preparation
2.1.2. Category Labels Annotation
2.1.3. QA Pairs Generation
- 1.
- General, knowledge-based questions that are independent of specific metadata (Appendix B).
- 2.
- Dataset-specific questions that are grounded in the image’s metadata context (Appendix B).
2.2. Model Fine-Tuning
2.3. Evaluation Benchmarks
2.3.1. Text Generation Metrics
2.3.2. Classification Metrics
- Distractor generation: Using the gemini-2.0-flash-exp model, we generated 3 semantically relevant distractor options for each ground-truth answer in the test dataset. These distractors were designed to mimic common misdiagnoses or plausible but incorrect clinical interpretations. For instance, confusing folliculitis with acne vulgaris in the context of bacterial skin diseases.
- Free-form response generation: We prompted the fine-tuned model to generate an open-ended response based on the input image and question, aiming to demonstrate its ability to synthesize information without being explicitly constrained to predefined answer choices.
- Semantic matching: A semantic encoder was used to compute the semantic similarity score between the generated response and the four candidate options. The option with the highest score was used as the model’s prediction, and classification accuracy was calculated as the proportion of correct predictions.
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| LLM | large language model |
| VLM | vision language model |
| SOTA | state-of-the-art |
| PMC-OA | PubMed Central Open Access |
| QA | question–answer |
| AI | artificial intelligence |
| CLM | causal language modelling |
| LoRA | low-rank adaptation |
| VQA | visual question–answer |
| QApairs | dataset of synthetic question answer pairs generated |
| Caption | dataset of image captions from the BIOMEDICA dataset |
| QaCaption | dataset of synthetic QA pairs and image captions |
Appendix A. Prompts
Appendix A.1. Distractor Option Generation Prompt
| ***** INSTRUCTIONS *****
You are a medical assistant. Given the following question and its correct answer, generate 3 plausible but incorrect options. **** CONTEXT ***** Question: {question} Correct Answer: {correct_answer} **** RESPONSE FORMAT ***** - Option 1: … - Option 2: … - Option 3: … |
Appendix A.2. Label Generation Prompt
|
***** INSTRUCTIONS ***** You are a dermatology expert. For each of the following image metadata entries, determine the most appropriate skin disease category from the list below. Provide your prediction along with a concise and well-reasoned explanation. Skin Disease Categories: {label_block} **** CONTEXT ***** {metadata_context} **** QUESTION ***** Please determine which of the following skin disease categories this image most likely belongs to. **** RESPONSE FORMAT ***** Predicted Category (Number + Name): <Number>. <Name> Brief Reasoning: <Explanation> |
Appendix B. Example Questions
Appendix B.1. Example General Questions
|
Appendix B.2. Example Dataset Specific Questions
|
Appendix C. Skin Disease Categories
List of 29 Dermatological Categories Used for Classification
|

References
- Seth, D.; Cheldize, K.; Brown, D.; Freeman, E.E. Global burden of skin disease: Inequities and innovations. Curr. Dermatol. Rep. 2017, 6, 204–210. [Google Scholar] [CrossRef]
- Hay, R.J.; Johns, N.E.; Williams, H.C.; Bolliger, I.W.; Dellavalle, R.P.; Margolis, D.J.; Marks, R.; Naldi, L.; Weinstock, M.A.; Wulf, S.K. The global burden of skin disease in 2010: An analysis of the prevalence and impact of skin conditions. J. Investig. Dermatol. 2014, 134, 1527–1534. [Google Scholar] [CrossRef] [PubMed]
- Li, H.; Pan, Y.; Zhao, J.; Zhang, L. Skin disease diagnosis with deep learning: A review. Neurocomputing 2021, 464, 364–393. [Google Scholar] [CrossRef]
- Marghoob, A.A.; Scope, A. The complexity of diagnosing melanoma. J. Investig. Dermatol. 2009, 129, 11–13. [Google Scholar] [CrossRef] [PubMed]
- Javed, R.; Rahim, M.S.M.; Saba, T.; Rehman, A. A comparative study of features selection for skin lesion detection from dermoscopic images. Netw. Model. Anal. Health Inform. Bioinform. 2020, 9, 4. [Google Scholar] [CrossRef]
- Yilmaz, A.; Yasar, S.P.; Gencoglan, G.; Temelkuran, B. Derm12345: A Large, Multisource Dermatoscopic Skin Lesion Dataset with 40 Subclasses. Sci. Data 2024, 11, 1302. [Google Scholar] [CrossRef]
- Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 Dataset, a Large Collection of Multi-Source Dermatoscopic Images of Common Pigmented Skin Lesions. Sci. Data 2018, 5, 1–9. [Google Scholar] [CrossRef]
- Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef]
- Brinker, T.J.; Hekler, A.; Enk, A.H.; Klode, J.; Hauschild, A.; Berking, C.; Schilling, B.; Haferkamp, S.; Schadendorf, D.; Holland-Letz, T. Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. Eur. J. Cancer 2019, 113, 47–54. [Google Scholar] [CrossRef]
- Wu, K.; Xu, S.; Chen, H.; Wang, C.; Li, Z.; Wang, Y.; Zhong, F. VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Visual-Language Models. arXiv 2025, arXiv:2505.20718. [Google Scholar]
- Yilmaz, A.; Yuceyalcin, F.; Varol, R.; Gokyayla, E.; Erdem, O.; Choi, D.; Demircali, A.A.; Gencoglan, G.; Posma, J.M.; Temelkuran, B. Resource-efficient medical vision language model for dermatology via a synthetic data generation framework. medRxiv 2025. [Google Scholar] [CrossRef]
- Gupta, A.K.; Talukder, M.; Wang, T.; Daneshjou, R.; Piguet, V. The Arrival of Artificial Intelligence Large Language Models and Vision-Language Models: A Potential to Possible Change in the Paradigm of Healthcare Delivery in Dermatology. J. Investig. Dermatol. 2024, 144, 1186–1188. [Google Scholar] [CrossRef] [PubMed]
- Cao, X.; Ye, W.; Moise, K.; Coffee, M. MpoxVLM: A Vision-Language Model for Diagnosing Skin Lesions from Mpox Virus Infection. arXiv 2024, arXiv:2411.10888. [Google Scholar]
- Yuceyalcin, F.; Yilmaz, A.; Temelkuran, B. A Hierarchical Benchmark of Foundation Models for Dermatology. arXiv 2026, arXiv:2601.12382. [Google Scholar] [CrossRef]
- Duniphin, D.D. Limited access to dermatology specialty care: Barriers and teledermatology. Dermatol. Pract. Concept. 2023, 13, e2023031. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. Coca: Contrastive captioners are image-text foundation models. arXiv 2022, arXiv:2205.01917. [Google Scholar]
- Yilmaz, A.; Yuceyalcin, F.; Gokyayla, E.; Choi, D.; Demircali, O.E.A.A.; Varol, R.; Kirabali, U.G.; Gencoglan, G.; Posma, J.M.; Temelkuran, B. DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets. arXiv 2025, arXiv:2502.00196. [Google Scholar]
- Yilmaz, A.; Erdem, O.; Gokyayla, E.; Acar, A.; Dagtas, B.B.; Erdil, D.I.; Gencoglan, G.; Temelkuran, B. DermaBench: A Clinician-Annotated Benchmark Dataset for Dermatology Visual Question Answering and Reasoning. arXiv 2026, arXiv:2601.14084. [Google Scholar]
- Zeng, W.; Sun, Y.; Ma, C.; Tan, W.; Yan, B. MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27–31 October 2025; ACM: New York, NY, USA, 2025. [Google Scholar]
- Yim, W.w.; Fu, Y.; Sun, Z.; Abacha, A.B.; Yetisgen, M.; Xia, F. DermaVQA: A Multilingual Visual Question Answering Dataset for Dermatology. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Marrakesh, Morocco, 6–10 October 2024. [Google Scholar]
- Lozano, A.; Sun, M.W.; Burgess, J.; Chen, L.; Nirschl, J.J.; Gu, J.; Lopez, I.; Aklilu, J.; Katzer, A.W.; Chiu, C. BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature. arXiv 2025, arXiv:2501.07171. [Google Scholar]
- Peng, B.; Li, C.; He, P.; Galley, M.; Gao, J. Instruction tuning with gpt-4. arXiv 2023, arXiv:2304.03277. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022; Volume 1, p. 3. [Google Scholar]
- Lian, C.; Zhou, H.Y.; Yu, Y.; Wang, L. Less could be better: Parameter-efficient fine-tuning advances medical vision foundation models. arXiv 2024, arXiv:2401.12215. [Google Scholar]
- Marafioti, A.; Zohar, O.; Farré, M.; Noyan, M.; Bakouch, E.; Cuenca, P.; Zakka, C.; Allal, L.B.; Lozhkov, A.; Tazi, N. SmolVLM: Redefining small and efficient multimodal models. arXiv 2025, arXiv:2504.05299. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out Workshop, Barcelona, Spain, 21–26 July 2004; pp. 74–81. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
- Moreno, A.C.; Bitterman, D.S. Toward clinical-grade evaluation of large language models. Int. J. Radiat. Oncol. Biol. Phys. 2024, 118, 916–920. [Google Scholar] [CrossRef]
- Zhou, Y.; Ringeval, F.; Portet, F. A survey of evaluation methods of generated medical textual reports. In Proceedings of the 5th Clinical Natural Language Processing Workshop, Toronto, ON, Canada, 14 July 2023; pp. 447–459. [Google Scholar]
- Chandak, N.; Goel, S.; Prabhu, A.; Hardt, M.; Geiping, J. Answer Matching Outperforms Multiple Choice for Language Model Evaluation. arXiv 2025, arXiv:2507.02856. [Google Scholar] [CrossRef]
- Zhou, J.; He, X.; Sun, L.; Xu, J.; Chen, X.; Chu, Y.; Zhou, L.; Liao, X.; Zhang, B.; Afvari, S. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat. Commun. 2024, 15, 5649. [Google Scholar] [CrossRef]
- Yan, S.; Yu, Z.; Primiero, C.; Vico-Alonso, C.; Wang, Z.; Yang, L.; Tschandl, P.; Hu, M.; Ju, L.; Tan, G. A multimodal vision foundation model for clinical dermatology. Nat. Med. 2025, 31, 2691–2702. [Google Scholar] [CrossRef]
- Lozano, A.; Sun, M.W.; Burgess, J.; Nirschl, J.J.; Polzak, C.; Zhang, Y.; Chen, L.; Gu, J.; Lopez, I.; Aklilu, J. A large-scale vision-language dataset derived from open scientific literature to advance biomedical generalist ai. arXiv 2025, arXiv:2503.22727. [Google Scholar]
- Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Adv. Neural Inf. Process. Syst. 2023, 36, 28541–28564. [Google Scholar]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
- Li, H.; Lan, T.; Fu, Z.; Cai, D.; Liu, L.; Collier, N.; Watanabe, T.; Su, Y. Repetition in repetition out: Towards understanding neural text degeneration from the data perspective. Adv. Neural Inf. Process. Syst. 2023, 36, 72888–72903. [Google Scholar]



| Model-Data | Accuracy | BLEU | ROUGE-1 | ROUGE-L | BERT Score |
|---|---|---|---|---|---|
| QApairs-500 | 61.96% | 11.46% | 40.78% | 35.12% | 90.19% |
| QApairs-1000 | 66.41% | 10.52% | 39.30% | 33.28% | 89.86% |
| QApairs-2286 | 68.63% | 10.94% | 40.60% | 34.17% | 90.17% |
| Caption-500 | 61.96% | 5.23% | 29.13% | 24.24% | 87.58% |
| Caption-1000 | 63.40% | 5.60% | 30.65% | 25.05% | 87.68% |
| Caption-2286 | 65.36% | 6.18% | 31.69% | 26.08% | 87.64% |
| QaCaption-500 | 64.71% | 8.20% | 34.71% | 29.32% | 88.74% |
| QaCaption-1000 | 67.45% | 7.69% | 34.16% | 28.54% | 88.75% |
| QaCaption-2286 | 70.20% | 7.87% | 34.45% | 28.56% | 88.79% |
| SmolVLM-Instruct | 68.24% | 3.59% | 22.93% | 19.10% | 87.69% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, S.; Yilmaz, A.; Gencoglan, G.; Temelkuran, B. Fine-Tuning a Small Vision Language Model Using Synthetic Data for Explaining Bacterial Skin Disease Images. Diagnostics 2026, 16, 603. https://doi.org/10.3390/diagnostics16040603
Zhang S, Yilmaz A, Gencoglan G, Temelkuran B. Fine-Tuning a Small Vision Language Model Using Synthetic Data for Explaining Bacterial Skin Disease Images. Diagnostics. 2026; 16(4):603. https://doi.org/10.3390/diagnostics16040603
Chicago/Turabian StyleZhang, Shiwan, Abdurrahim Yilmaz, Gulsum Gencoglan, and Burak Temelkuran. 2026. "Fine-Tuning a Small Vision Language Model Using Synthetic Data for Explaining Bacterial Skin Disease Images" Diagnostics 16, no. 4: 603. https://doi.org/10.3390/diagnostics16040603
APA StyleZhang, S., Yilmaz, A., Gencoglan, G., & Temelkuran, B. (2026). Fine-Tuning a Small Vision Language Model Using Synthetic Data for Explaining Bacterial Skin Disease Images. Diagnostics, 16(4), 603. https://doi.org/10.3390/diagnostics16040603

