Validation of a Dermatology-Focused Multimodal Large Language Model in Classification of Pigmented Skin Lesions
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Setting
2.2. Participants

2.3. Data Collection
2.4. AI Model Evaluation
2.5. Outcome Measures
2.6. Institutional Review Board Statement
2.7. Informed Consent Statement
2.8. Statistical Analysis
3. Results
3.1. Study Population and Clinical Context
3.2. Diagnostic Performance



3.3. Inter-Rater Agreement Analysis
3.4. Subgroup Analysis
3.4.1. Sex-Based Analysis
3.4.2. Anatomical Location Analysis
3.4.3. Clinical Marker Analysis
3.4.4. Consistency Across Subgroups
4. Discussion
4.1. Key Findings
4.2. Clinical Context Integration Is a Fundamental Advance
4.3. Implications for Clinical Practice and Diagnostic Workflows
4.3.1. Systematic History-Taking Enhancement
4.3.2. Addressing Geographic and Expertise Disparities
4.4. Limitations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| LLM | Large Language Model |
| Claude | Claude Sonnet 4 |
| CNN | Convolutional Neural Network |
References
- Siegel, R.L.; Kratzer, T.B.; Giaquinto, A.N.; Sung, H.; Jemal, A. Cancer statistics, 2025. CA Cancer J. Clin. 2025, 75, 10–45. [Google Scholar] [CrossRef] [PubMed]
- Carli, P.; de Giorgi, V.; Chiarugi, A.; Nardini, P.; Weinstock, M.A.; Crocetti, E.; Stante, M.; Giannotti, B. Addition of dermoscopy to conventional naked-eye examination in melanoma screening: A randomized study. J. Am. Acad. Dermatol. 2004, 50, 683–689. [Google Scholar] [CrossRef] [PubMed]
- Health Resources and Services Administration. Health Workforce Projections. 2025. Available online: https://data.hrsa.gov/topics/health-workforce/workforce-projections (accessed on 10 July 2025).
- Conway, J.; Roy, B.; Barazani, L.; Wu, A.G.; Cline, A.; Moy, J. High Demand: Identification of Dermatology Visit Trends from 1991–2016 National Ambulatory Medical Care Surveys. FC20 Dermatol. Conf. 2020, 4, 552–555. [Google Scholar] [CrossRef]
- Statistics, N.C.f.H. Ambulatory Care Use and Physician Office Visits. 12 December 2024. Available online: https://www.cdc.gov/nchs/fastats/physician-visits.htm (accessed on 10 July 2025).
- Feng, H.; Berk-Krauss, J.; Feng, P.W.; Stein, J.A. Comparison of Dermatologist Density Between Urban and Rural Counties in the United States. JAMA Dermatol. 2018, 154, 1265–1271. [Google Scholar] [CrossRef]
- Brinker, T.J.; Hekler, A.; Enk, A.H.; Klode, J.; Hauschild, A.; Berking, C.; Schilling, B.; Haferkamp, S.; Schadendorf, D.; Holland-Letz, T.; et al. Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. Eur. J. Cancer 2019, 113, 47–54. [Google Scholar] [CrossRef]
- Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Corrigendum: Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 546, 686. [Google Scholar] [CrossRef]
- Haenssle, H.A.; Fink, C.; Schneiderbauer, R.; Toberer, F.; Buhl, T.; Blum, A.; Kalloo, A.; Hassen, A.B.H.; Thomas, L.; Enk, A.; et al. Man against machine: Diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol. 2018, 29, 1836–1842. [Google Scholar] [CrossRef]
- Haenssle, H.A.; Fink, C.; Toberer, F.; Winkler, J.; Stolz, W.; Deinlein, T.; Hofmann-Wellenhof, R.; Lallas, A.; Emmert, S.; Buhl, T.; et al. Man against machine reloaded: Performance of a market-approved convolutional neural network in classifying a broad spectrum of skin lesions in comparison with 96 dermatologists working under less artificial conditions. Ann. Oncol. 2020, 31, 137–143. [Google Scholar] [CrossRef]
- Haenssle, H.A.; Winkler, J.K.; Fink, C.; Toberer, F.; Enk, A.; Stolz, W.; Deinlein, T.; Hofmann-Wellenhof, R.; Kittler, H.; Tschandl, P.; et al. Skin lesions of face and scalp—Classification by a market-approved convolutional neural network in comparison with 64 dermatologists. Eur. J. Cancer 2021, 144, 192–199. [Google Scholar] [CrossRef]
- Han, S.S.; Moon, I.J.; Kim, S.H.; Na, J.I.; Kim, M.S.; Park, G.H.; Park, I.; Kim, K.; Lim, W.; Lee, J.H.; et al. Assessment of deep neural networks for the diagnosis of benign and malignant skin neoplasms in comparison with dermatologists: A retrospective validation study. PLoS Med. 2020, 17, e1003381. [Google Scholar] [CrossRef]
- Schielein, M.C.; Christl, J.; Sitaru, S.; Pilz, A.C.; Kaczmarczyk, R.; Biedermann, T.; Lasser, T.; Zink, A. Outlier detection in dermatology: Performance of different convolutional neural networks for binary classification of inflammatory skin diseases. J. Eur. Acad. Dermatol. Venereol. 2023, 37, 1071–1079. [Google Scholar] [CrossRef]
- Luo, N.; Zhong, X.; Su, L.; Cheng, Z.; Ma, W.; Hao, P. Artificial intelligence-assisted dermatology diagnosis: From unimodal to multimodal. Comput. Biol. Med. 2023, 165, 107413. [Google Scholar] [CrossRef] [PubMed]
- Yan, S.; Yu, Z.; Primiero, C.; Vico-Alonso, C.; Wang, Z.; Yang, L.; Tschandl, P.; Hu, M.; Ju, L.; Tan, G.; et al. A multimodal vision foundation model for clinical dermatology. Nat. Med. 2025, 31, 2691–2702. [Google Scholar] [CrossRef] [PubMed]
- Zhou, J.; He, X.; Sun, L.; Xu, J.; Chen, X.; Chu, Y.; Zhou, L.; Liao, X.; Zhang, B.; Afvari, S.; et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat. Commun. 2024, 15, 5649. [Google Scholar] [CrossRef] [PubMed]
- Meskó, B. The Impact of Multimodal Large Language Models on Health Care’s Future. J. Med. Internet Res. 2023, 25, e52865. [Google Scholar] [CrossRef]
- Rajpurkar, P.; Lungren, M.P. The Current and Future State of AI Interpretation of Medical Images. N. Engl. J. Med. 2023, 388, 1981–1990. [Google Scholar] [CrossRef]
- Rao, V.M.; Hla, M.; Moor, M.; Adithan, S.; Kwak, S.; Topol, E.J.; Rajpurkar, P. Multimodal generative AI for medical image interpretation. Nature 2025, 639, 888–896. [Google Scholar] [CrossRef]
- Soni, N.; Ora, M.; Agarwal, A.; Yang, T.; Bathla, G. A Review of the Opportunities and Challenges with Large Language Models in Radiology: The Road Ahead. AJNR Am. J. Neuroradiol. 2025, 46, 1292–1299. [Google Scholar] [CrossRef]
- Algarni, A. CareAssist GPT improves patient user experience with a patient centered approach to computer aided diagnosis. Sci. Rep. 2025, 15, 22727. [Google Scholar] [CrossRef]
- Katal, S.; York, B.; Gholamrezanezhad, A. AI in radiology: From promise to practice—A guide to effective integration. Eur. J. Radiol. 2024, 181, 111798. [Google Scholar] [CrossRef]
- Sosna, J.; Joskowicz, L.; Saban, M. Navigating the AI Landscape in Medical Imaging: A Critical Analysis of Technologies, Implementation, and Implications. Radiology 2025, 315, e240982. [Google Scholar] [CrossRef]
- Jain, A.; Way, D.; Gupta, V.; Gao, Y.; de Oliveira Marinho, G.; Hartford, J.; Sayres, R.; Kanada, K.; Eng, C.; Nagpal, K.; et al. Development and Assessment of an Artificial Intelligence-Based Tool for Skin Condition Diagnosis by Primary Care Physicians and Nurse Practitioners in Teledermatology Practices. JAMA Netw. Open 2021, 4, e217249. [Google Scholar] [CrossRef]
- Phillips, M.; Marsden, H.; Jaffe, W.; Matin, R.N.; Wali, G.N.; Greenhalgh, J.; McGrath, E.; James, R.; Ladoyanni, E.; Bewley, A.; et al. Assessment of Accuracy of an Artificial Intelligence Algorithm to Detect Melanoma in Images of Skin Lesions. JAMA Netw. Open 2019, 2, e1913436. [Google Scholar] [CrossRef]
- Ko, C.J.; Braverman, I.; Sidlow, R.; Lowenstein, E.J. Visual perception, cognition, and error in dermatologic diagnosis: Key cognitive principles. J. Am. Acad. Dermatol. 2019, 81, 1227–1234. [Google Scholar] [CrossRef]
- Lowenstein, E.J.; Sidlow, R.; Ko, C.J. Visual perception, cognition, and error in dermatologic diagnosis: Diagnosis and error. J. Am. Acad. Dermatol. 2019, 81, 1237–1245. [Google Scholar] [CrossRef] [PubMed]
- Marcum, J.A. An integrated model of clinical reasoning: Dual-process theory of cognition and metacognition. J. Eval. Clin. Pract. 2012, 18, 954–961. [Google Scholar] [CrossRef] [PubMed]
- Norman, G.; Pelaccia, T.; Wyer, P.; Sherbino, J. Dual process models of clinical reasoning: The central role of knowledge in diagnostic expertise. J. Eval. Clin. Pract. 2024, 30, 788–796. [Google Scholar] [CrossRef]
- Norman, G.; Young, M.; Brooks, L. Non-analytical models of clinical reasoning: The role of experience. Med. Educ. 2007, 41, 1140–1145. [Google Scholar] [CrossRef]
- Escalé-Besa, A.; Yélamos, O.; Vidal-Alaball, J.; Fuster-Casanovas, A.; Miró Catalina, Q.; Börve, A.; Ander-Egg Aguilar, R.; Fustà-Novell, X.; Cubiró, X.; Rafat, M.E.; et al. Exploring the potential of artificial intelligence in improving skin lesion diagnosis in primary care. Sci. Rep. 2023, 13, 4293. [Google Scholar] [CrossRef]
- Combalia, M.; Codella, N.; Rotemberg, V.; Carrera, C.; Dusza, S.; Gutman, D.; Helba, B.; Kittler, H.; Kurtansky, N.R.; Liopyris, K.; et al. Validation of artificial intelligence prediction models for skin cancer diagnosis using dermoscopy images: The 2019 International Skin Imaging Collaboration Grand Challenge. Lancet Digit. Health 2022, 4, e330–e339. [Google Scholar] [CrossRef]
- Patel, R.H.; Foltz, E.A.; Witkowski, A.; Ludzik, J. Analysis of Artificial Intelligence-Based Approaches Applied to Non-Invasive Imaging for Early Detection of Melanoma: A Systematic Review. Cancers 2023, 15, 4694. [Google Scholar] [CrossRef]
- Strzelecki, M.; Kociołek, M.; Strąkowska, M.; Kozłowski, M.; Grzybowski, A.; Szczypiński, P.M. Artificial intelligence in the detection of skin cancer: State of the art. Clin. Dermatol. 2024, 42, 280–295. [Google Scholar] [CrossRef]
- Collenne, J.; Monnier, J.; Iguernaissi, R.; Nawaf, M.; Richard, M.A.; Grob, J.J.; Gaudy-Marqueste, C.; Dubuisson, S.; Merad, D. Fusion between an Algorithm Based on the Characterization of Melanocytic Lesions’ Asymmetry with an Ensemble of Convolutional Neural Networks for Melanoma Detection. J. Investig. Dermatol. 2024, 144, 1600–1607.e1602. [Google Scholar] [CrossRef]
- Maureen Miracle, S.; Rianto, L.; Kelvin, K.; Tandarto, K.; Setiadi, F.; Angela, A.; Brunner, T.M.; Darmawan, H.; Tanojo, H.; Kupwiwat, R.; et al. The Role of Artificial Intelligence with Deep Convolutional Neural Network in Screening Melanoma: A Systematic Review and Meta-Analyses of Quasi-Experimental Diagnostic Studies. J. Craniofac. Surg. 2025, 10–1097. [Google Scholar] [CrossRef]
- Sabir, R.; Mehmood, T. Classification of melanoma skin Cancer based on Image Data Set using different neural networks. Sci. Rep. 2024, 14, 29704. [Google Scholar] [CrossRef]
| Characteristic | n (%) |
|---|---|
| Study Population | |
| Clinically Concerning Lesions | 68 (100) |
| Histopathologic Diagnoses | |
| Malignant melanoma | 49 (72.1) |
| Atypical nevus/melanocytic proliferation | 15 (22.1) |
| Benign (no atypia mentioned) | 4 (5.9) |
| Demographics | |
| Male | 29 (42.6) |
| Female | 39 (57.4) |
| Lesion Location | |
| Head (scalp, face, ears) | 14 (20.6) |
| Upper Extremity | 12 (17.6) |
| Trunk | 16 (23.5) |
| Lower Extremity | 19 (27.9) |
| Neck | 5 (7.4) |
| Other | 2 (2.9) |
| Presence of Clinical Marker | |
| Present | 39 (57.4) |
| Not Present | 29 (42.6) |
| Race/Ethnicity | |
| White | 67 (98.5) |
| Non-White | 1 (1.5) |
| Metric | Claude | DermFlow | Clinician |
|---|---|---|---|
| Sensitivity | 81.6% (68.6–90.0%) | 93.9% (83.5–97.9%) | 67.3% (53.4–78.8%) |
| Specificity | 52.6% (31.7–72.7%) | 89.5% (68.6–97.1%) | 84.2% (62.4–94.5%) |
| PPV | 81.6% (68.6–90.0%) | 95.8% (86.0–98.8%) | 91.7% (78.2–97.1%) |
| NPV | 52.6% (31.7–72.7%) | 85.0% (64.0–94.8%) | 50.0% (33.6–66.4%) |
| F1 Score | 0.816 (0.732–0.890) | 0.948 (0.901–0.980) | 0.776 (0.691–0.851) |
| Cohen’s κ | 0.295 (0.030–0.534) | 0.057 (−0.126–0.270) | −0.169 (−0.352–0.076) |
| Accuracy | 73.5% (62.0–82.6%) | 92.6% (83.9–96.8%) | 72.1% (60.4–81.3%) |
| Diagnostic Measure | Comparison | Cohen’s κ | Observed Agreement (%) | Agreement Level |
|---|---|---|---|---|
| Top Diagnosis | DermFlow vs. Clinician | 0.046 ± 0.119 | 52.9 | Slight |
| DermFlow vs. Claude | −0.051 ± 0.071 | 50.0 | Poor | |
| Clinician vs. Claude | 0.197 ± 0.093 | 67.6 | Slight | |
| Decision-to-Biopsy | DermFlow vs. Claude | −0.076 ± 0.037 | 77.9 | Poor |
| Clinician vs. Others | N/A † | N/A † | N/A † |
| Subgroup | n | Claude | DermFlow | Clinician |
|---|---|---|---|---|
| Sex | ||||
| Male | 29 | 6.9 (0.0–16.7) | 51.7 (32.4–71.1) | 34.5 (16.1–52.9) |
| Female | 39 | 10.3 (0.0–20.2) | 43.6 (27.3–59.9) | 41.0 (24.9–57.2) |
| Lesion Location | ||||
| Head (scalp, face, ears) | 14 | 14.3 (0.0–35.3) | 64.3 (35.6–93.0) | 71.4 (44.4–98.5) |
| Upper extremity | 12 | 8.3 (0.0–26.7) | 58.3 (25.6–91.1) | 25.0 (0.0–53.7) |
| Trunk | 16 | 12.5 (0.0–30.7) | 50.0 (22.5–77.5) | 25.0 (0.0–48.8) |
| Lower extremity | 19 | 5.3 (0.0–16.3) | 21.1 (0.0–41.2) | 36.8 (13.0–60.7) |
| Neck | 5 | 0.0 (0.0–0.0) | 40.0 (0.0–100.0) | 20.0 (0.0–75.5) |
| Clinical Marker Presence | ||||
| Present | 39 | 12.8 (0.0–23.8) | 53.8 (37.5–70.2) | 38.5 (22.5–54.4) |
| Absent | 29 | 3.4 (0.0–10.5) | 37.9 (19.2–56.7) | 37.9 (19.2–56.7) |
| Subgroup | n | Claude | DermFlow | Clinician |
|---|---|---|---|---|
| Sex | ||||
| Male | 29 | 82.8 (68.1–97.4) | 96.6 (89.5–100.0) | 65.5 (47.1–83.9) |
| Female | 39 | 66.7 (51.2–82.2) | 89.7 (79.8–99.7) | 76.9 (63.1–90.8) |
| Lesion Location | ||||
| Head (scalp, face, ears) | 14 | 85.7 (64.5–100.0) | 100.0 (100.0–100.0) | 92.9 (77.4–100.0) |
| Upper extremity | 12 | 91.7 (73.3–100.0) | 100.0 (100.0–100.0) | 50.0 (16.8–83.2) |
| Trunk | 16 | 87.5 (69.3–100.0) | 93.8 (80.4–100.0) | 56.3 (29.0–83.6) |
| Lower extremity | 19 | 42.1 (17.7–66.6) | 79.0 (58.76–100.0) | 79.0 (58.8–99.1) |
| Neck | 5 | 60.0 (0.0–100.0) | 100.0 (100.0–100.0) | 80.0 (24.5–100.0) |
| Clinical Marker Presence | ||||
| Present | 39 | 71.8 (57.0–86.6) | 97.4 (92.3–100.0) | 76.9 (63.1–90.8) |
| Absent | 29 | 75.9 (59.3–92.4) | 86.2 (72.9–99.6) | (47.1–83.9) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mijares, J.; Jairath, N.; Zhang, A.; Que, S.K.T. Validation of a Dermatology-Focused Multimodal Large Language Model in Classification of Pigmented Skin Lesions. Diagnostics 2025, 15, 2808. https://doi.org/10.3390/diagnostics15212808
Mijares J, Jairath N, Zhang A, Que SKT. Validation of a Dermatology-Focused Multimodal Large Language Model in Classification of Pigmented Skin Lesions. Diagnostics. 2025; 15(21):2808. https://doi.org/10.3390/diagnostics15212808
Chicago/Turabian StyleMijares, Joshua, Neil Jairath, Andrew Zhang, and Syril Keena T. Que. 2025. "Validation of a Dermatology-Focused Multimodal Large Language Model in Classification of Pigmented Skin Lesions" Diagnostics 15, no. 21: 2808. https://doi.org/10.3390/diagnostics15212808
APA StyleMijares, J., Jairath, N., Zhang, A., & Que, S. K. T. (2025). Validation of a Dermatology-Focused Multimodal Large Language Model in Classification of Pigmented Skin Lesions. Diagnostics, 15(21), 2808. https://doi.org/10.3390/diagnostics15212808

