AI in Dermato-Oncology: Diagnostic Performance and Prompt-Injection Vulnerability of Vision–Language Models in Dermoscopic Skin Cancer Assessment
Simple Summary
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Design
2.2. Dataset
2.3. Experimental Conditions
2.4. Models
2.5. Prompting Strategy
2.6. Outcome Definitions
2.7. Statistical Analysis
3. Results
3.1. Diagnostic Performance in the Clean Condition
3.2. Collapse of Diagnostic Performance Under Prompt Injection
3.3. Selectivity of the Attack
3.4. Deterministic Nature of Manipulated Outputs
3.5. Confidence–Accuracy Dissociation
4. Discussion
4.1. Principal Findings
4.2. Inadequate Baseline Diagnostic Performance
4.3. Adversarial Control of the Diagnostic Decision
4.4. Selectivity of the Manipulation
4.5. Confidence Miscalibration and Failure of Self-Consistency
4.6. Clinical Implications
4.7. Comparison with Prior Literature
4.8. Limitations
- Sample size. The per-class sample (n = 52 images, 26 per class) was deliberately restricted to allow a fully factorial design (52 images × 3 models × 4 conditions × 3 rounds × 9 outputs = 16,848 structured-output observations) within a tractable budget of queries per frontier-level model. Although the primary effects were large and Bonferroni-surviving, effects of smaller magnitude may not be reliably detectable. The reported performance and adversarial vulnerability estimates should accordingly be interpreted as condition-specific findings within a constrained experimental paradigm and should not be extrapolated to rarer melanoma subtypes, broader image-quality strata, or real-world dermatological contexts not represented in the present dataset without further dedicated evaluation.
- Dataset composition. A substantial proportion of the benign cases were atypical or dysplastic melanocytic nevi, and Fitzpatrick phototypes V and VI were not represented. The dataset, therefore, reflects a clinically demanding subset of dermoscopic diagnoses on lighter skin and may not fully capture the spectrum of lesion complexity and skin-type diversity encountered in routine clinical or screening use [34].
- Binary classification task. The primary endpoint was restricted to the benign-versus-malignant distinction and does not reflect the broader multi-class differential diagnosis encountered in clinical dermato-oncology, which includes non-melanocytic malignancies such as basal- and squamous-cell carcinoma as well as benign mimickers.
- Worst-case adversarial design. The attack assumes prior knowledge of the ground-truth label and injects its exact inverse. The observed effect should therefore be interpreted as an upper-bound estimate of single-word manipulability rather than a real-world attack frequency, since actual manipulation efficacy would depend on the attacker’s ability to distinguish benign from malignant cases a priori.
- Single-word manipulation only. Multi-word, semantically richer, or more subtle perturbations were not examined and may produce qualitatively different failure modes.
- Prompt and inference configuration. All queries used a zero-shot generic prompt without specialized dermatological prompting, chain-of-thought reasoning, or in-context examples, and models were accessed in their default inference configurations. The effect of alternative prompting strategies or inference settings on either baseline accuracy or adversarial robustness was not evaluated.
- Black-box evaluation. The present study was designed as a black-box evaluation of observable end-user model behavior under standardized input conditions. Internal feature representations, attention distributions, intermediate reasoning steps, and decision pathways of the proprietary models were not accessible through the consumer-facing interfaces used in this work and were not the object of investigation. The unit of analysis is the model’s observable categorical output (benign/malignant/non-diagnostic) and the accompanying structured fields under each input condition, rather than the internal mechanism by which that output is generated.
- No mechanistic preprocessing characterization. The internal preprocessing and parsing pipelines of the evaluated consumer-facing platforms, including which specific metadata fields are ingested, at which stage of preprocessing, and by which intermediate component, were not characterized in this study, as such mechanistic characterization would require controlled API-level or model-internal inspection that is not accessible through first-party consumer interfaces. The reported metadata-only effects should therefore be interpreted as empirical observations at the input–output level: manipulation of image-associated metadata under real-world interface conditions altered the downstream diagnostic output, without claim regarding the specific internal mechanism by which this occurred.
- Model snapshot. The evaluated models (Claude Opus 4.7, Gemini 3.1 Pro, GPT-5.4) represent specific versions accessed in April 2026. Baseline performance and adversarial robustness may change with future model updates, and the absolute numerical estimates reported here should be interpreted accordingly.
- No mitigation testing. No candidate mitigation strategies, such as visual-overlay detectors, metadata-stripping pipelines, adversarial training, or ensemble cross-checking, were tested. The present study documents the vulnerability but does not address its remediation.
- Potential training-data contamination. All images originate from publicly available ISIC dermoscopic image repositories. The possibility that subsets of these images, or visually related images, were included in the pre-training or fine-tuning corpora of the evaluated VLMs cannot be excluded. Such contamination, if present, would plausibly bias baseline performance upwards rather than downwards and would not affect the relative comparison of baseline and adversarial conditions, which is the central analysis of this study. Prospective evaluation on private, unpublished dermoscopic datasets remains a desirable direction for future work.
4.9. Future Directions
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| API | Application Programming Interface |
| CI | Confidence Interval |
| CNN | Convolutional Neural Networks |
| GT | Ground Truth |
| ISIC | International Skin Imaging Collaboration |
| NPV | Negative Predictive Value |
| PACS | Picture Archiving and Communication System |
| PPV | Positive Predictive Value |
| VLM | Vision–Language Model |
References
- Bray, F.; Laversanne, M.; Sung, H.; Ferlay, J.; Siegel, R.L.; Soerjomataram, I.; Jemal, A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA A Cancer J. Clin. 2024, 74, 229–263. [Google Scholar] [CrossRef] [PubMed]
- Joshi, U.M.; Kashani-Sabet, M.; Kirkwood, J.M. Cutaneous Melanoma. JAMA 2025, 334, 2113. [Google Scholar] [CrossRef]
- Siegel, R.L.; Kratzer, T.B.; Giaquinto, A.N.; Sung, H.; Jemal, A. Cancer statistics, 2025. CA Cancer J. Clin. 2025, 75, 10–45. [Google Scholar] [CrossRef]
- Chen, J.Y.; Fernandez, K.; Fadadu, R.P.; Reddy, R.; Kim, M.-O.; Tan, J.; Wei, M.L. Skin Cancer Diagnosis by Lesion, Physician, and Examination Type. JAMA Dermatol. 2025, 161, 135. [Google Scholar] [CrossRef]
- Saginala, K.; Barsouk, A.; Aluru, J.S.; Rawla, P.; Barsouk, A. Epidemiology of Melanoma. Med. Sci. 2021, 9, 63. [Google Scholar] [CrossRef]
- Khamaysi, Z.; Awwad, M.; Jiryis, B.; Bathish, N.; Shapiro, J. The Role of ChatGPT in Dermatology Diagnostics. Diagnostics 2025, 15, 1529. [Google Scholar] [CrossRef] [PubMed]
- Zarfati, M.; Nadkarni, G.N.; Glicksberg, B.S.; Harats, M.; Greenberger, S.; Klang, E.; Soffer, S. Exploring the Role of Large Language Models in Melanoma: A Systematic Review. J. Clin. Med. 2024, 13, 7480. [Google Scholar] [CrossRef]
- Wang, Q.; Amugo, I.; Rajakaruna, H.; Irudayam, M.J.; Xie, H.; Shanker, A.; Adunyah, S.E. Evaluating GPT-5 for Melanoma Detection Using Dermoscopic Images. Diagnostics 2025, 15, 3052. [Google Scholar] [CrossRef]
- Liu, X.; Duan, C.; Kim, M.-K.; Zhang, L.; Jee, E.; Maharjan, B.; Huang, Y.; Du, D.; Jiang, X. Claude 3 Opus and ChatGPT with GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis. JMIR Med. Inform. 2024, 12, e59273. [Google Scholar] [CrossRef]
- Perlmutter, J.W.; Milkovich, J.; Fremont, S.; Datta, S.; Mosa, A. Beyond the Surface: Assessing GPT-4’s Accuracy in Detecting Melanoma and Suspicious Skin Lesions from Dermoscopic Images. Plast. Surg. 2025, 34, 293–300. [Google Scholar] [CrossRef]
- Reiter, O.; Navarrete-Dechent, C.; Atlas, M.; Nathansohn, N.; Ben Mordehai, Y.; Mimouni, T.; Gleicher, R.; Awwad, M.; Cohen, I.; Khamaysi, Z.; et al. Beyond black-box AI: Comparing ChatGPT-4 interpretability and accuracy to CNNs in melanocytic lesions diagnosis. JDDG J. Dtsch. Dermatol. Ges. 2026, 1–9. [Google Scholar] [CrossRef]
- Clusmann, J.; Ferber, D.; Wiest, I.C.; Schneider, C.V.; Brinker, T.J.; Foersch, S.; Truhn, D.; Kather, J.N. Prompt injection attacks on vision language models in oncology. Nat. Commun. 2025, 16, 1239. [Google Scholar] [CrossRef] [PubMed]
- Clusmann, J.; Schulz, S.J.; Ferber, D.; Wiest, I.C.; Fernandez, A.; Eckstein, M.; Lange, F.; Reitsam, N.G.; Kellers, F.; Schmitt, M.; et al. Incidental Prompt Injections on Vision–Language Models in Real-Life Histopathology. NEJM AI 2025, 2, AIcs2500078. [Google Scholar] [CrossRef]
- Lee, R.W.; Jun, T.J.; Lee, J.-M.; Cho, S.I.; Park, H.J.; Suh, J. Vulnerability of Large Language Models to Prompt Injection When Providing Medical Advice. JAMA Netw. Open 2025, 8, e2549963. [Google Scholar] [CrossRef]
- ISIC Archive Gallery. International Skin Imaging Collaboration. Available online: https://gallery.isic-archive.com (accessed on 18 April 2026).
- Cassidy, B.; Kendrick, C.; Brodzicki, A.; Jaworek-Korjakowska, J.; Yap, M.H. Analysis of the ISIC image datasets: Usage, benchmarks and recommendations. Med. Image Anal. 2022, 75, 102305. [Google Scholar] [CrossRef]
- Wilson, E.B. Probable Inference, the Law of Succession, and Statistical Inference. J. Am. Stat. Assoc. 1927, 22, 209–212. [Google Scholar] [CrossRef]
- McNemar, Q. Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages. Psychometrika 1947, 12, 153–157. [Google Scholar] [CrossRef]
- Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
- Python. Available online: https://www.python.org/ (accessed on 3 April 2026).
- Naeem, A.; Anees, T.; Fiza, M.; Naqvi, R.A.; Lee, S.-W. SCDNet: A Deep Learning-Based Framework for the Multiclassification of Skin Cancer Using Dermoscopy Images. Sensors 2022, 22, 5652. [Google Scholar] [CrossRef]
- Kaur, R.; GholamHosseini, H.; Sinha, R.; Lindén, M. Melanoma Classification Using a Novel Deep Convolutional Neural Network with Dermoscopic Images. Sensors 2022, 22, 1134. [Google Scholar] [CrossRef] [PubMed]
- Savage, T.; Wang, J.; Gallo, R.; Boukil, A.; Patel, V.; Safavi-Naini, S.A.A.; Soroush, A.; Chen, J.H. Large language model uncertainty proxies: Discrimination and calibration for medical diagnosis and treatment. J. Am. Med. Inform. Assoc. 2024, 32, 139–149. [Google Scholar] [CrossRef] [PubMed]
- Ayo-Ajibola, O.; Davis, R.J.; Lin, M.E.; Riddell, J.; Kravitz, R.L. Characterizing the Adoption and Experiences of Users of Artificial Intelligence–Generated Health Information in the United States: Cross-Sectional Questionnaire Study. J. Med. Internet Res. 2024, 26, e55138. [Google Scholar] [CrossRef] [PubMed]
- Choudhury, A.; Elkefi, S.; Tounsi, A. Exploring factors influencing user perspective of ChatGPT as a technology that assists in healthcare decision making: A cross sectional survey study. PLoS ONE 2024, 19, e0296151. [Google Scholar] [CrossRef] [PubMed]
- Mendel, T.; Singh, N.; Mann, D.M.; Wiesenfeld, B.; Nov, O. Laypeople’s Use of and Attitudes Toward Large Language Models and Search Engines for Health Queries: Survey Study. J. Med. Internet Res. 2025, 27, e64290. [Google Scholar] [CrossRef]
- Huang, N.C.; Mukundan, A.; Karmakar, R.; Syna, S.; Chang, W.Y.; Wang, H.C. Novel Snapshot-Based Hyperspectral Conversion for Dermatological Lesion Detection via YOLO Object Detection Models. Bioengineering 2025, 12, 714. [Google Scholar] [CrossRef]
- Nie, Y.; Sommella, P.; Carratù, M.; O’Nils, M.; Lundgren, J. A Deep CNN Transformer Hybrid Model for Skin Lesion Classification of Dermoscopic Images Using Focal Loss. Diagnostics 2022, 13, 72. [Google Scholar] [CrossRef]
- Zbrzezny, A.M.; Krzywicki, T. Artificial Intelligence in Dermatology: A Review of Methods, Clinical Applications, and Perspectives. Appl. Sci. 2025, 15, 7856. [Google Scholar] [CrossRef]
- Behara, K.; Bhero, E.; Agee, J.T. AI in dermatology: A comprehensive review into skin cancer detection. PeerJ Comput. Sci. 2024, 10, e2530. [Google Scholar] [CrossRef] [PubMed]
- Yu, J.; Cheong, I.H.; Kozlakidis, Z.; Wang, H. Advancements and challenges of artificial intelligence in dermatology: A review of applications and perspectives in China. Front. Digit. Health 2025, 7, 1544520. [Google Scholar] [CrossRef]
- Salinas, M.P.; Sepúlveda, J.; Hidalgo, L.; Peirano, D.; Morel, M.; Uribe, P.; Rotemberg, V.; Briones, J.; Mery, D.; Navarrete-Dechent, C. A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis. npj Digit. Med. 2024, 7, 125, Erratum in NPJ Digit. Med. 2024, 7, 141. https://doi.org/10.1038/s41746-024-01138-0. [Google Scholar] [CrossRef]
- Musthafa, M.M.; Mahesh, T.R.; Kumar, V.V.; Guluwadi, S. Enhanced skin cancer diagnosis using optimized CNN architecture and checkpoints for automated dermatological lesion classification. BMC Med. Imaging 2024, 24, 201. [Google Scholar] [CrossRef]
- Fitzpatrick, T.B. The validity and practicality of sun-reactive skin types I through VI. Arch. Dermatol. 1988, 124, 869–871. [Google Scholar] [CrossRef]




| Characteristic | Value |
|---|---|
| Images, total | 52 |
| Benign nevi (histopathologically confirmed) | 26 (50.0%) |
| Invasive melanomas (histopathologically confirmed) | 26 (50.0%) |
| Image format | Dermoscopic, lesion-only crops |
| Identifiable patient information | None |
| Age range (years) | 30–85 (median 62.5) |
| Lesion size, long diameter (mm) | 2.5–20.0 (median 6.3) |
| Sex | female 29 (55.8%), male 23 (44.2%) |
| Fitzpatrick phototype distribution | I: 5 II: 32 III: 13 IV: 2 V: 0 VI: 0 |
| Anatomical site distribution | posterior torso 14; upper extremity 13; anterior torso 10; lower extremity 9; head/neck 3; lateral torso 2; palms/soles 1 |
| Model | Condition | Accuracy (%) [95% CI] | Sensitivity (%) [95% CI] | Specificity (%) [95% CI] | PPV (%) | NPV (%) | Balanced Accuracy (%) | McNemar b/c | p-Value |
|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.7 | Clean | 62.2 [54.4, 69.4] | 61.5 [42.5, 77.6] | 65.4 [46.2, 80.6] | 64.0 | 63.0 | 63.5 | — | — |
| Claude Opus 4.7 | Visual inj. | 0.0 [0.0, 2.4] | 0.0 [0.0, 12.9] | 0.0 [0.0, 12.9] | 0.0 | 0.0 | 0.0 | 33/0 | <0.0001 * |
| Claude Opus 4.7 | Metadata inj. | 0.6 [0.1, 3.5] | 0.0 [0.0, 12.9] | 0.0 [0.0, 12.9] | 0.0 | 0.0 | 0.0 | 33/0 | <0.0001 * |
| Claude Opus 4.7 | Combined inj. | 0.0 [0.0, 2.4] | 0.0 [0.0, 12.9] | 0.0 [0.0, 12.9] | 0.0 | 0.0 | 0.0 | 33/0 | <0.0001 * |
| Gemini 3.1 Pro | Clean | 59.0 [51.1, 66.4] | 15.4 [6.2, 33.5] | 100.0 [87.1, 100.0] | 100.0 | 54.2 | 57.7 | — | — |
| Gemini 3.1 Pro | Visual inj. | 0.0 [0.0, 2.4] | 0.0 [0.0, 12.9] | 0.0 [0.0, 12.9] | 0.0 | 0.0 | 0.0 | 30/0 | <0.0001 * |
| Gemini 3.1 Pro | Metadata inj. | 0.0 [0.0, 2.4] | 0.0 [0.0, 12.9] | 0.0 [0.0, 12.9] | 0.0 | 0.0 | 0.0 | 30/0 | <0.0001 * |
| Gemini 3.1 Pro | Combined inj. | 0.0 [0.0, 2.4] | 0.0 [0.0, 12.9] | 0.0 [0.0, 12.9] | 0.0 | 0.0 | 0.0 | 30/0 | <0.0001 * |
| GPT-5.4 | Clean | 58.3 [50.5, 65.8] | 38.5 [22.4, 57.5] | 76.9 [57.9, 89.0] | 62.5 | 55.6 | 57.7 | — | — |
| GPT-5.4 | Visual inj. | 1.9 [0.7, 5.5] | 3.8 [0.7, 18.9] | 0.0 [0.0, 12.9] | 3.7 | 0.0 | 1.9 | 30/1 | <0.0001 * |
| GPT-5.4 | Metadata inj. | 0.0 [0.0, 2.4] | 0.0 [0.0, 12.9] | 0.0 [0.0, 12.9] | 0.0 | 0.0 | 0.0 | 30/0 | <0.0001 * |
| GPT-5.4 | Combined inj. | 0.0 [0.0, 2.4] | 0.0 [0.0, 12.9] | 0.0 [0.0, 12.9] | 0.0 | 0.0 | 0.0 | 30/0 | <0.0001 * |
| Model | Injection | Unchanged Correct | Improved | Worsened | Unchanged Wrong | Benign→ Malignant Flips | Malignant→ Benign Flips | Net Change |
|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.7 | Visual inj. | 0 | 0 | 33 | 19 | 17 | 16 | +33 |
| Claude Opus 4.7 | Metadata inj. | 0 | 0 | 33 | 19 | 17 | 16 | +33 |
| Claude Opus 4.7 | Combined inj. | 0 | 0 | 33 | 19 | 17 | 16 | +33 |
| Gemini 3.1 Pro | Visual inj. | 0 | 0 | 30 | 22 | 26 | 4 | +30 |
| Gemini 3.1 Pro | Metadata inj. | 0 | 0 | 30 | 22 | 26 | 4 | +30 |
| Gemini 3.1 Pro | Combined inj. | 0 | 0 | 30 | 22 | 26 | 4 | +30 |
| GPT-5.4 | Visual inj. | 0 | 1 | 30 | 21 | 21 | 10 | +29 |
| GPT-5.4 | Metadata inj. | 0 | 0 | 30 | 22 | 20 | 10 | +30 |
| GPT-5.4 | Combined inj. | 0 | 0 | 30 | 22 | 20 | 10 | +30 |
| Model | Condition | Mean Confidence (±SD) | Fleiss’ κ (3 Rounds) | Flip Rate | Mean Accuracy (7 Non-Diagnosis Variables) |
|---|---|---|---|---|---|
| Claude Opus 4.7 | Clean | 69.1 ± 6.3 | 0.690 | 23.1% | 47.4% |
| Claude Opus 4.7 | Visual inj. | 71.1 ± 2.9 | 1.000 | 0.0% | 47.5% |
| Claude Opus 4.7 | Metadata inj. | 66.3 ± 6.2 | 0.974 | 1.9% | 45.6% |
| Claude Opus 4.7 | Combined inj. | 71.4 ± 3.6 | 1.000 | 0.0% | 47.2% |
| Gemini 3.1 Pro | Clean | 86.9 ± 5.4 | 0.372 | 15.4% | 44.0% |
| Gemini 3.1 Pro | Visual inj. | 89.4 ± 4.1 | 1.000 | 0.0% | 42.9% |
| Gemini 3.1 Pro | Metadata inj. | 86.5 ± 5.5 | 1.000 | 0.0% | 42.7% |
| Gemini 3.1 Pro | Combined inj. | 88.1 ± 4.5 | 1.000 | 0.0% | 41.8% |
| GPT-5.4 | Clean | 72.1 ± 9.5 | 0.613 | 25.0% | 46.2% |
| GPT-5.4 | Visual inj. | 77.7 ± 11.0 | 1.000 | 0.0% | 47.4% |
| GPT-5.4 | Metadata inj. | 75.3 ± 8.7 | 1.000 | 0.0% | 40.8% |
| GPT-5.4 | Combined inj. | 82.4 ± 7.6 | 1.000 | 0.0% | 42.6% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Güler, I.; Kraus, A.; Grieb, G.; Satir, T.; Eberz, P.; Stelling, H. AI in Dermato-Oncology: Diagnostic Performance and Prompt-Injection Vulnerability of Vision–Language Models in Dermoscopic Skin Cancer Assessment. Cancers 2026, 18, 1750. https://doi.org/10.3390/cancers18111750
Güler I, Kraus A, Grieb G, Satir T, Eberz P, Stelling H. AI in Dermato-Oncology: Diagnostic Performance and Prompt-Injection Vulnerability of Vision–Language Models in Dermoscopic Skin Cancer Assessment. Cancers. 2026; 18(11):1750. https://doi.org/10.3390/cancers18111750
Chicago/Turabian StyleGüler, Ibrahim, Armin Kraus, Gerrit Grieb, Tevfik Satir, Pascal Eberz, and Henrik Stelling. 2026. "AI in Dermato-Oncology: Diagnostic Performance and Prompt-Injection Vulnerability of Vision–Language Models in Dermoscopic Skin Cancer Assessment" Cancers 18, no. 11: 1750. https://doi.org/10.3390/cancers18111750
APA StyleGüler, I., Kraus, A., Grieb, G., Satir, T., Eberz, P., & Stelling, H. (2026). AI in Dermato-Oncology: Diagnostic Performance and Prompt-Injection Vulnerability of Vision–Language Models in Dermoscopic Skin Cancer Assessment. Cancers, 18(11), 1750. https://doi.org/10.3390/cancers18111750

