Next Article in Journal
Factors Associated with the Severity of ERCP-Related Complications: A Retrospective Single-Centre Study
Previous Article in Journal
The Predictive Role of Preoperative Malnutrition Assessment in Postoperative Outcomes of Patients Undergoing Surgery Due to Gastrointestinal Cancer: A Cross-Sectional Observational Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

Exploring the Role of Large Language Models in Melanoma: A Systematic Review

1
Department of Internal Medicine, Soroka University Medical Center, Beer-Sheva 84101, Israel
2
Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
3
Sheba Medical Center, Department of Plastic and Reconstructive Surgery, Ramat-Gan 52621, Israel
4
Faculty of Medicine, Tel Aviv University, Tel Aviv 52621, Israel
5
Pediatric Dermatology Unit, Sheba Medical Center, Department of Dermatology, Ramat Gan 52621, Israel
6
Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center, Petah-Tikva 49100, Israel
*
Author to whom correspondence should be addressed.
J. Clin. Med. 2024, 13(23), 7480; https://doi.org/10.3390/jcm13237480
Submission received: 19 October 2024 / Revised: 30 November 2024 / Accepted: 2 December 2024 / Published: 9 December 2024
(This article belongs to the Section Dermatology)

Abstract

:
Objective: This systematic review evaluates the current applications, advantages, and challenges of large language models (LLMs) in melanoma care. Methods: A systematic search was conducted in PubMed and Scopus databases for studies published up to 23 July 2024, focusing on the application of LLMs in melanoma. The review adhered to PRISMA guidelines, and the risk of bias was assessed using the modified QUADAS-2 tool. Results: Nine studies were included, categorized into subgroups: patient education, diagnosis, and clinical management. In patient education, LLMs demonstrated high accuracy, though readability often exceeded recommended levels. For diagnosis, multimodal LLMs like GPT-4V showed capabilities in distinguishing melanoma from benign lesions, but accuracy varied, influenced by factors such as image quality and integration of clinical context. Regarding management advice, ChatGPT provided more reliable recommendations compared to other LLMs, but all models lacked depth for individualized decision-making. Conclusions: LLMs, particularly multimodal models, show potential in improving melanoma care. However, current applications require further refinement and validation. Future studies should explore fine-tuning these models on large, diverse dermatological databases and incorporate expert knowledge to address limitations such as generalizability across different populations and skin types.

Graphical Abstract

1. Introduction

Large language models (LLMs), including ChatGPT, Gemini, and Llama, are artificial intelligence (AI) models designed to understand and generate human-like text [1]. These models are gaining recognition across various medical specialties for their potential to assist with clinical tasks [2,3,4,5,6,7]. However, their specific role in dermatology, particularly in melanoma care, remains under investigation [8]. Multimodal LLMs, such as GPT-4 Vision (GPT-4V), further expand this potential by combining visual and textual data. This capability could improve applications in medical imaging and diagnosis [9].
Melanoma, the most aggressive form of skin cancer, is responsible for more than 80% of skin cancer mortality [10]. Early melanoma detection enables successful surgical treatment and significantly better survival rates. Once metastasis occurs, prognosis worsens considerably. Therefore, timely and accurate diagnosis is critical for patient outcomes [11]. Clinically, melanoma is often identified by the ABCDE rule, which evaluates asymmetry, border irregularity, color variation, diameter, and evolving characteristics of skin lesions [10]. Current treatment approaches include surgical excision, targeted therapy, immunotherapy, and radiation therapy, with treatment selection based on stage and molecular profile [11].
Previous studies have shown mixed results regarding the use of LLMs in dermatology, leading to caution among dermatologists [12]. However, these tools, when properly optimized, may enhance melanoma diagnosis, patient communication, and treatment outcomes.
This systematic review aims to evaluate the current applications, advantages, and challenges associated with the use of LLMs in melanoma care.

2. Foundational Concepts

Below are the key concepts related to LLMs and their applications in healthcare. In Figure 1, we present a hierarchy diagram of AI terms.

2.1. Artificial Intelligence and Deep Learning

AI refers to the development of algorithms capable of performing tasks that typically require human intelligence. Machine learning (ML), a key driver of recent AI advances, enables systems to learn from data rather than following fixed rules. Deep learning is a subset of ML that employs artificial neural networks to analyze different types of data and learn from them. Examples include language comprehension and image pattern recognition [13,14].

2.2. Artificial Neural Networks

Artificial neural networks form the foundation of deep learning. Inspired by biological neural networks, they consist of interconnected nodes, or “neurons”, organized in layers. Each neuron receives inputs, processes them, and passes an output to the subsequent layer. Each neuron is a simple computational unit, similar to a single logistic regression function. By adjusting the connections between neurons based on the input data, neural networks can learn to recognize patterns and generate predictions [13].

2.3. Large Language Models

LLMs are large deep learning models that process and generate human-like text. Composed of multiple transformer layers, these models employ an attention mechanism to selectively focus on different parts of the input data. This structure allows them to excel in tasks such as text recognition, language translation, and content generation [15]. Notable examples of LLMs include ChatGPT by OpenAI and LLaMA by Meta [16,17,18].

2.4. Convolutional Neural Networks

CNNs represent a specialized deep learning architecture optimized for visual analysis. These self-learning algorithms process images through multiple layers, each extracting features like edges, textures, and patterns. Each layer applies mathematical filters (convolutions) to detect specific visual elements, enabling image recognition capabilities [19].

2.5. Multimodal Large Language Models

Multimodal LLMs (MLLMs) extend the capabilities of traditional LLMs by incorporating multiple other data modalities into text, such as images, sound, and videos. MLLMs are typically built as foundation models—large-scale, general-purpose models pre-trained on massive multimodal datasets. These models provide a base that can be fine-tuned for specific tasks.
Vision Language Models (VLMs) are a subset of multimodal models specifically focusing on image–text interactions. Unlike VLMs, MLLMs are more versatile and can handle broader multimodal reasoning tasks that involve multiple data types such as audio, video, and structured data.
The architecture of MLLMs typically consists of three key components: modality-specific encoders (processing individual data types), a central LLM backbone (processes text and handles core reasoning), and modality interfaces (unify these inputs into a shared representation) [20].
The field of MLLMs has evolved rapidly, from early models like Flamingo [21] (2022) [21] to advanced systems like GPT-4V (2023) [22] and the latest generation including Claude 3 and Gemini 1.5 and GPT-4o (2024) [10].

3. Materials and Methods

3.1. Search Strategy

A systematic review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement (PRISMA) guidelines and the recommendations for systematic reviews of prediction models (CHARMS checklist) [23,24]. The study is registered with PROSPERO (CRD42024575859, link: https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=575859, accessed on 3 October 2024) [25].
We searched the literature for applications of LLMs in melanoma using PubMed and Scopus. A systematic search of the published literature was conducted on 23 July 2024. Our search query was “((“Melanoma”) AND ((“ChatGPT”) OR (“large language models”) OR (“OpenAI”) OR (“Microsoft Bing”) OR (“google bard”) OR (“google gemini”)))”. We included Microsoft Bing in our search strategy, although it is not an LLM itself, as it incorporates LLM technology. The term “MLLM” (multimodal large language model) was not included in our initial search strategy. References to multimodal language models emerged from papers identified through our original search terms related to LLMs in melanoma care. To ensure thoroughness, we also reviewed the reference lists of relevant articles, but this did not yield any additional studies that met the inclusion criteria.
We excluded articles that did not specifically evaluate the application of LLMs in melanoma, non-original articles, and conference abstracts.

3.2. Study Selection

The titles and abstracts of the identified studies were screened to determine their eligibility based on the inclusion and exclusion criteria. Any uncertainty was resolved through discussion between two reviewers, with a third reviewer consulted when necessary. The full texts of the selected articles were then independently assessed by two reviewers (MZ, SS). Discrepancies were resolved through consensus or consultation with a third reviewer (EK).

3.3. Data Extraction

Data extraction was conducted using a standardized form to ensure consistency. Key information extracted included the first author’s name, year of publication, sample size, LLM model types, objectives, and main findings.
To investigate the specific applications and effectiveness of LLMs in different aspects of melanoma care, we divided the articles into three subgroups: patient education, clinical management, and diagnosis.

3.4. Quality Assessment and Risk of Bias

To evaluate the risk of bias, we used the adapted version of the Quality Assessment of Diagnostic Accuracy Studies criteria (QUADAS-2) [26].

4. Results

Our literature search yielded a total of 45 articles from PubMed and Scopus. After the removal of nine duplicates, the screening process found nine studies that met our inclusion criteria. We did not identify additional studies via reference screening [27,28,29,30,31,32,33,34,35]. The process of study selection and the screening methodology are detailed in the PRISMA flow chart (Figure 2).
According to the QUADAS-2 tool, most papers scored as having a low to moderate risk of bias for the interpretation of the index test. A detailed assessment of the risk of bias is provided in Table 1.
The characteristics of the studies are presented in Table 2. A summary of the objective, sample size, reference standard, main findings, and conclusions are presented in Table 3. The main advantages and challenges in the included studies are presented in Table 4.
Of the nine studies, five were comparative, evaluating and comparing various LLM models, such as ChatGPT, BARD, and BingAI [27,29,30,33,34]. The remaining four studies focused on a single LLM, specifically different versions of ChatGPT [28,31,32,35]. Three studies specifically examined multimodal LLMs, such as GPT-4V and LLaVA, highlighting their unique capabilities and associated challenges [29,31,33].
The included studies were diverse in their objectives, methodologies, and evaluation metrics. The studies focused on the application of LLMs in melanoma diagnosis, patient education, and clinical decision-making.

4.1. Patient Education

Four studies evaluated the use of LLMs in patient education, focusing on the accuracy of responses to common patient questions [28,30,34,35]. ChatGPT 4.0 and ChatGPT 3.5 were noted for their relatively high accuracy.
Deliyannis et al. found that while both ChatGPT and BARD can generate accurate educational responses, both ChatGPT 4.0 and 3.5 outperformed BARD [30]. Anguita et al. focused on choroidal melanoma and found no significant accuracy differences between ChatGPT 3.5, Bing AI, and DocsGPT beta [34].
Young et al. reported that ChatGPT 4.0 generates mostly accurate responses, scoring 4.9/5. However, only 64% of these responses were considered suitable for patient use, indicating that ChatGPT may be more effective as a supplemental tool in clinical practice. The study also found that the average readability score corresponded to a college-level comprehension, suggesting that the content might be too advanced for public use [35].
Roster et al. addressed this readability issue by evaluating ChatGPT’s responses to questions about sunscreen and melanoma from the American Academy of Dermatology’s (AAD) website. They investigated whether prompt engineering techniques (strategic prompting) could improve readability. The study compared ChatGPT’s responses after two rounds of strategic prompting with the original answers from the AAD website. The findings showed that the initial prompt did not lower the reading level compared to the AAD content. However, with additional prompting, the reading level was reduced to 7th grade, compared to the AAD’s 9th-grade level. This suggests that with proper prompt engineering, LLMs could improve the readability of medical information for melanoma patients [28].

4.2. Melanoma Diagnosis

Four studies examined the use of LLMs in melanoma diagnosis, focusing on their ability to identify and classify melanoma using clinical and dermoscopic data [29,31,32,33]. Multimodal LLMs, such as GPT-4V and LLaVA, played a key role in the majority of these evaluations.
Cirone et al. assessed GPT-4V and LLaVA, emphasizing their ability to integrate visual and textual data. The study used macroscopic images of melanoma and melanocytic nevi obtained from the MClass-D dataset. The prompts varied in their specificity, with some being general and asking for descriptions of the images, while others addressed the ABCDE features of melanoma. Some prompts also assessed the effects of background skin color on predictions. GPT-4V demonstrated superior performance, with an overall accuracy of 85%, compared to 45% for LLaVA. Notably, GPT-4V consistently provided descriptions of relevant ABCDE features and accurately identified melanoma. Also, LLaVA had difficulty recognizing melanoma in skin of darker color, unlike GPT-4V [33]. This finding is consistent with those of Akrout et al., who also showed that GPT-4V outperformed LLaVA across all assessed features, though both models require further refinement to enhance diagnostic accuracy [29].
Shifai et al. evaluated ChatGPT Vision’s diagnostic accuracy in identifying melanoma using dermoscopic images from the ISIC archives. The model provided three ranked differential diagnoses for 100 melanocytic lesions. The sensitivity, specificity, and diagnostic accuracy varied depending on whether the top diagnosis or the top three diagnoses were considered [31].
These findings suggest that ChatGPT Vision may not yet be suitable for independent clinical use without additional refinement.

4.3. Management Advice

Only one study specifically evaluated the use of LLMs in providing melanoma management advice. Mu et al. conducted a comparative analysis of several LLMs (ChatGPT 4.0, BARD, and BingAI) to assess their performance in this context. The study used five prompts related to melanoma management. ChatGPT 4.0 consistently provided more reliable, evidence-based clinical advice, outperforming the other models, with significant differences noted compared to BARD and marginal differences compared to BingAI. However, none of the models evaluated the risks and benefits associated with their recommendations. The limited number of questions restricts the generalizability of the findings [27].

5. Discussion

This review’s findings underscore the potential of LLMs across various domains in melanoma care, including patient education, disease diagnosis, and management advice. Of particular interest is the emergence of multimodal LLMs, which integrate visual and textual data to address the complexities of medical imaging and clinical decision-making.
In patient education, LLMs demonstrated an ability to generate accurate and readable responses to common melanoma-related queries. For example, Roster et al. showed that strategic prompting can enhance the readability of ChatGPT’s outputs [28]. This finding suggests that with appropriate fine-tuning, LLMs could become valuable tools for creating accessible patient education materials, enabling individuals to make informed decisions.
In melanoma diagnosis, multimodal LLMs such as GPT-4V and LLaVA exhibited capabilities in distinguishing melanoma from benign lesions. Cirone et al. and Akrout et al. demonstrated GPT-4V’s superior performance [29,33], particularly in handling variations in skin tone and image manipulations [33]. Zhou et al. presented SkinGPT-4, a multimodal LLM trained on a large collection of skin disease images and clinical notes. SkinGPT-4 demonstrated the ability to accurately diagnose various skin conditions and provide interactive treatment recommendations [36]. In addition to LLMs, AI-based methods, particularly those utilizing dermoscopic images, have shown promising results in assisting with melanoma detection. A systematic review by Patel et al. found that AI-based algorithms achieved a higher ROC (>80%) compared to dermatologists in the detection of melanoma using dermoscopic images [37]. However, it is important to recognize that multimodal LLMs are not yet reliable for independent clinical use. Their performance may be influenced by factors such as dataset limitations, image quality, and the lack of clinical context.
Despite these limitations, multimodal LLMs may hold promise for applications in medical education. Sorin et al. explored the potential of multimodal LLMs in ophthalmology education, suggesting that they could significantly impact this field by providing detailed explanations of ocular examination and imaging findings [38]. Similarly, in the context of melanoma and dermatology, multimodal LLMs could assist students in identifying and describing lesion characteristics, considering differential diagnoses, and developing their clinical reasoning skills.
Mu et al. investigated the use of LLMs for management advice and found that ChatGPT provided more reliable and evidence-based recommendations compared to BARD and BingAI. However, all models were limited by a lack of depth and specificity, reducing their utility in individualized clinical decision-making [27]. This finding emphasizes the need for further refinement and validation of LLMs to ensure that their recommendations align with clinical guidelines.
The limitations of this review include the small number of studies, heterogeneity in methodologies, and variations in evaluation metrics. Additionally, most studies had small sample sizes and did not involve patients in the question selection process. Furthermore, most studies focused on general melanoma questions rather than specific clinical scenarios.

6. Conclusions

This review highlights the potential of LLMs, particularly multimodal models, in improving melanoma care through patient education, diagnosis, and management advice. While these technologies show promise, they remain assistive tools that complement, rather than substitute, medical expertise. Despite promising results, current LLM applications require further refinement to ensure clinical utility, and their use should always be under physician supervision. Future studies should explore fine-tuning these models on large dermatological databases and incorporate expert knowledge.

Author Contributions

Conceptualization, M.Z., E.K. and S.S.; methodology, M.Z., E.K. and S.S.; validation, E.K. and S.S.; data curation, M.Z.; writing—original draft preparation, M.Z.; writing—review and editing, E.K., S.S., G.N.N., B.S.G., M.H. and S.G.; visualization, M.Z.; supervision, E.K. and S.S. equally. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hadi, M.U.; Al Tashi, Q.; Shah, A.; Qureshi, R.; Muneer, A.; Irfan, M.; Zafar, A.; Shaikh, M.B.; Akhtar, N.; Wu, J.; et al. Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects Large Language Models: A Comprehensive Survey of Applications, Challenges, Limitations, and Future Prospects. TechRxiv 2024. [Google Scholar] [CrossRef]
  2. Clusmann, J.; Kolbinger, F.R.; Muti, H.S.; Carrero, Z.I.; Eckardt, J.N.; Laleh, N.G.; Löffler, C.M.L.; Schwarzkopf, S.C.; Unger, M.; Veldhuizen, G.P.; et al. The future landscape of large language models in medicine. Commun. Med. 2023, 3, 141. [Google Scholar] [CrossRef] [PubMed]
  3. Mudrik, A.; Nadkarni, G.N.; Efros, O.; Glicksberg, B.S.; Klang, E.; Soffer, S. Exploring the role of Large Language Models (LLMs) in hematology: A systematic review of applications, benefits, and limitations. medRxiv 2024. medRxiv:2024.04.26.24306358. [Google Scholar]
  4. Preiksaitis, C.; Ashenburg, N.; Bunney, G.; Chu, A.; Kabeer, R.; Riley, F.; Ribeira, R.; Rose, C. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med. Inform. 2024, 12, e53787. [Google Scholar] [CrossRef] [PubMed]
  5. Pressman, S.M.; Borna, S.; Gomez-Cabello, C.A.; Haider, S.A.; Haider, C.R.; Forte, A.J. Clinical and Surgical Applications of Large Language Models: A Systematic Review. J. Clin. Med. 2024, 13, 3041. [Google Scholar] [CrossRef]
  6. Klang, E.; Sourosh, A.; Nadkarni, G.N. Evaluating the role of ChatGPT in gastroenterology: A comprehensive systematic review of applications, benefits, and limitations. Ther. Adv. Gastroenterol. 2023, 16, 17562848231218618. [Google Scholar] [CrossRef]
  7. Glicksberg, B.S.; Timsina, P.; Patel, D.; Sawant, A.; Vaid, A.; Raut, G.; Charney, A.W.; Apakama, D.; Carr, B.G.; Freeman, R.; et al. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J. Am. Med. Inform. Assoc. 2024, 31, 1921–1928. [Google Scholar] [CrossRef]
  8. Sallam, M.; Salim, N.A.; Barakat, M.; Al-Tammemi, A.B. ChatGPT applications in medical, dental, pharmacy, and public health education: A descriptive study highlighting the advantages and limitations. Narra J. 2023, 3, e103. [Google Scholar] [CrossRef]
  9. Deng, J.; Heybati, K.; Shammas-Toma, M. When vision meets reality: Exploring the clinical applicability of GPT-4 with vision. Clin. Imaging 2024, 108, 110101. [Google Scholar] [CrossRef]
  10. Duarte, A.F.; Sousa-Pinto, B.; Azevedo, L.F.; Barros, A.M.; Puig, S.; Malvehy, J.; Haneke, E.; Correia, O. Clinical ABCDE rule for early melanoma detection. Eur. J. Dermatol. 2021, 31, 771–778. [Google Scholar] [CrossRef]
  11. Davis, L.E.; Shalin, S.C.; Tackett, A.J. Current state of melanoma diagnosis and treatment. Cancer Biol. Ther. 2019, 20, 1366–1379. [Google Scholar] [CrossRef] [PubMed]
  12. Zhang, Z.; Zhang, J.; Duan, L.; Tan, C. ChatGPT in dermatology: Exploring the limited utility amidst the tech hype. Front. Med. 2024, 10, 1308229. [Google Scholar] [CrossRef] [PubMed]
  13. Sarker, I.H. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions. SN Comput. Sci. 2021, 2, 420. [Google Scholar] [CrossRef] [PubMed]
  14. Sheikh, H.; Prins, C.; Schrijvers, E. Artificial Intelligence: Definition and Background. In Mission AI: The New System Technology; Springer International Publishing: Cham, Switzerland, 2023; pp. 15–41. [Google Scholar]
  15. Almarie, B.; Teixeira, P.E.P.; Pacheco-Barrios, K.; Rossetti, C.A.; Fregni, F. Editorial—The Use of Large Language Models in Science: Opportunities and Challenges. Princ. Pr. Clin. Res. J. 2023, 9, 1. [Google Scholar] [CrossRef] [PubMed]
  16. ChatGPT. Available online: https://chatgpt.com/?oai-dm=1 (accessed on 29 November 2024).
  17. Llama 3.2. Available online: https://www.llama.com/ (accessed on 29 November 2024).
  18. Open Archives Initiative. Available online: https://www.openarchives.org/ (accessed on 29 November 2024).
  19. Greenspan, H.; Van Ginneken, B.; Summers, R.M. Guest Editorial Deep Learning in Medical Imaging: Overview and Future Promise of an Exciting New Technique. IEEE Trans. Med Imaging 2016, 35, 1153–1159. [Google Scholar] [CrossRef]
  20. Nazi, Z.A.; Peng, W. Large Language Models in Healthcare and Medical Domain: A Review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
  21. Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
  22. Zheng, B.; Gou, B.; Kil, J.; Sun, H.; Su, Y. GPT-4V(ision) is a Generalist Web Agent, if Grounded. Proc. Mach. Learn Res. 2024, 235, 61349–61385. [Google Scholar]
  23. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, 71. [Google Scholar] [CrossRef]
  24. Moons, K.G.M.; de Groot, J.A.H.; Bouwmeester, W.; Vergouwe, Y.; Mallett, S.; Altman, D.G.; Reitsma, J.B.; Collins, G.S. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: The CHARMS checklist. PLoS Med. 2014, 11, e1001744. [Google Scholar] [CrossRef]
  25. Schiavo, J.H. PROSPERO: An International Register of Systematic Review Protocols. Med. Ref. Serv. Q 2019, 38, 171–180. [Google Scholar] [CrossRef]
  26. Whiting, P.F.; Rutjes, A.W.; Westwood, M.E.; Mallett, S.; Deeks, J.J.; Reitsma, J.B.; Leeflang, M.M.; Sterne, J.A.; Bossuyt, P.M.; QUADAS-2 Group*. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann. Intern. Med. 2011, 155, 529–536. [Google Scholar] [CrossRef] [PubMed]
  27. Mu, X.; Lim, B.; Seth, I.; Xie, Y.; Cevik, J.; Sofiadellis, F.; Hunter-Smith, D.J.; Rozen, W.M. Comparison of large language models in management advice for melanoma: Google’s AI BARD, BingAI and ChatGPT. Ski. Health Dis. 2023, 4, ski2-313. [Google Scholar] [CrossRef]
  28. Roster, K.; Kann, R.B.; Farabi, B.; Gronbeck, C.; Brownstone, N.; Lipner, S.R. Readability and Health Literacy Scores for ChatGPT-Generated Dermatology Public Education Materials: Cross-Sectional Analysis of Sunscreen and Melanoma Questions. JMIR Dermatol. 2024, 7, e50163. [Google Scholar] [CrossRef] [PubMed]
  29. Akrout, M.; Cirone, K.D.; Vender, R. Evaluation of Vision LLMs GTP-4V and LLaVA for the Recognition of Features Characteristic of Melanoma. J. Cutan. Med. Surg. 2024, 28, 98–99. [Google Scholar] [CrossRef] [PubMed]
  30. Deliyannis, E.P.; Paul, N.; Patel, P.U.; Papanikolaou, M. Comparative performance analysis of ChatGPT 3.5, ChatGPT 4.0 and Bard in answering common patient questions on melanoma. Clin. Exp. Dermatol. 2024, 49, 743–746. [Google Scholar] [CrossRef]
  31. Shifai, N.; van Doorn, R.; Malvehy, J.; Sangers, T.E. Can ChatGPT vision diagnose melanoma? An exploratory diagnostic accuracy study. J. Am. Acad. Dermatol. 2024, 90, 1057–1059. [Google Scholar] [CrossRef]
  32. Karampinis, E.; Toli, O.; Georgopoulou, K.-E.; Kampra, E.; Spyridonidou, C.; Schulze, A.-V.R.; Zafiriou, E. Can Artificial Intelligence ‘Hold’ a Dermoscope?-The Evaluation of an Artificial Intelligence Chatbot to Translate the Dermoscopic Language. Diagnostics 2024, 14, 1165. [Google Scholar] [CrossRef]
  33. Cirone, K.; Akrout, M.; Abid, L.; Oakley, A. Assessing the Utility of Multimodal Large Language Models (GPT-4 Vision and Large Language and Vision Assistant) in Identifying Melanoma Across Different Skin Tones. JMIR Dermatol. 2024, 7, e55508. [Google Scholar] [CrossRef]
  34. Anguita, R.; Downie, C.; Ferro Desideri, L.; Sagoo, M.S. Assessing large language models’ accuracy in providing patient support for choroidal melanoma. Eye 2024, 38, 3113–3117. [Google Scholar] [CrossRef]
  35. Young, J.N.; O’Hagan, R.; Poplausky, D.; Levoska, M.A.; Gulati, N.; Ungar, B.; Ungar, J. The utility of ChatGPT in generating patient-facing and clinical responses for melanoma. J. Am. Acad. Dermatol. 2023, 89, 602–604. [Google Scholar] [CrossRef] [PubMed]
  36. Zhou, J.; He, X.; Sun, L.; Xu, J.; Chen, X.; Chu, Y.; Zhou, L.; Liao, X.; Zhang, B.; Afvari, S.; et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat. Commun. 2024, 15, 5649. [Google Scholar] [CrossRef] [PubMed]
  37. Patel, R.H.; Foltz, E.A.; Witkowski, A.; Ludzik, J. Analysis of Artificial Intelligence-Based Approaches Applied to Non-Invasive Imaging for Early Detection of Melanoma: A Systematic Review. Cancers 2023, 15, 4694. [Google Scholar] [CrossRef] [PubMed]
  38. Sorin, V.; Kapelushnik, N.; Hecht, I.; Zloto, O.; Glicksberg, B.S.; Bufman, H.; Barash, Y.; Nadkarni, G.N.; Klang, E. GPT-4 Multimodal Analysis on Ophthalmology Clinical Cases Including Text and Images. medRxiv 2023. medRxiv:2023.11.24.23298953. [Google Scholar]
Figure 1. Hierarchy diagram of artificial intelligence (AI) terms.
Figure 1. Hierarchy diagram of artificial intelligence (AI) terms.
Jcm 13 07480 g001
Figure 2. Flow diagram of the search and inclusion process based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.
Figure 2. Flow diagram of the search and inclusion process based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.
Jcm 13 07480 g002
Table 1. Risk of bias according to the QUADAS-2.
Table 1. Risk of bias according to the QUADAS-2.
Patient SelectionIndex TestReference StandardFlow and TimingOverall Bias
Young J. N. [35]Jcm 13 07480 i001Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i001Low to moderate
Anguita R. [34]Jcm 13 07480 i001Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i001Low to moderate
Deliyannis E. P. [30]Jcm 13 07480 i001Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i001Low to moderate
Roster K. [28]Jcm 13 07480 i002Jcm 13 07480 i001Jcm 13 07480 i001Jcm 13 07480 i001Low to moderate
Cirone K. [33]Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i001Moderate
Shifai N. [31]Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i001Jcm 13 07480 i001Low to moderate
Akrout M. [29]Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i001Moderate
Karampinis E. [32]Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i001Moderate
Mu X. [27]Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i002Jcm 13 07480 i001Moderate
Judgement: Jcm 13 07480 i003 high risk, Jcm 13 07480 i002 moderate risk, Jcm 13 07480 i001 low risk.
Table 2. Details about reviewed articles.
Table 2. Details about reviewed articles.
Group Title First Author Journal Year
Patient educationThe utility of ChatGPT in generating patient-facing and clinical responses for melanomaYoung J. N. [35]Journal of the American Academy of Dermatology2023
Assessing large language models’ accuracy in providing patient support for choroidal melanomaAnguita R. [34]Eye (Lond)2024
Comparative performance analysis of ChatGPT 3.5, ChatGPT 4.0 and Bard in answering common patient questions on melanomaDeliyannis E. P. [30]Clinical and Experimental Dermatology2024
Readability and Health Literacy Scores for ChatGPT-Generated Dermatology Public Education Materials: Cross-Sectional Analysis of Sunscreen and Melanoma QuestionsRoster K. [28]JMIR Dermatology2024
Melanoma DiagnosisAssessing the Utility of Multimodal Large Language Models (GPT-4 Vision and Large Language and Vision Assistant) in Identifying Melanoma Across Different Skin TonesCirone K. [33]JMIR Dermatology2024
Can ChatGPT Vision Diagnose Melanoma? An Exploratory Diagnostic Accuracy StudyShifai N. [31]Journal of the American Academy of Dermatology2024
Evaluation of Vision LLMs GTP-4V and LLaVA for the Recognition of Features Characteristic of MelanomaAkrout M. [29]Journal of Cutaneous Medicine and Surgery2024
Diagnosis of melanoma and medical educationCan Artificial Intelligence “Hold” a Dermoscope? The Evaluation of an Artificial Intelligence Chatbot to Translate the Dermoscopic LanguageKarampinis E. [32]Diagnostics (Basel)2024
Management adviceComparison of large language models in management advice for melanoma: Google’s AI BARD, BingAI and ChatGPTMu X. [27]Skin Health2023
Table 3. A summary of the reviewed articles.
Table 3. A summary of the reviewed articles.
First AuthorModel UsedObjectiveReference StandardSample SizeMain FindingsConclusion
Young J. N. [35]ChatGPT 4.0Assess the appropriateness, clinical applicability, accuracy, and readability of ChatGPT 4.0 responses to melanoma-related questions.Three board-certified dermatologists25 melanoma- related patient questionsAccuracy: 4.88/5 with agreement (80%, Fleiss K coefficient 0.808, p < 0.001).
Appropriateness: 92%
Sufficiency: 64%
Readability: Average FRES 42.67 (college-level readability)
ChatGPT 4.0 generates mostly accurate, but not sufficient, responses to melanoma patient questions, but it presents it at a level to advanced for public use.
Anguita R. [34]ChatGPT 3.5, Bing AI, DocsGPT betaEvaluate the accuracy of information provided by LLMs in response to common questions about choroidal melanoma.Three ocular oncology experts27 questions: 12 medical advice and 15 pre- and post-operative adviceMedical advice questions:
Accuracy: GPT 3.5 92%, Bing AI 58%, DocsGPT 58%
Pre and post-operative advice:
Accuracy: GPT 3.5 86%, Bing AI 86%, DocsGPT 73%.
57% of responses varied across triplicated queries (Cohen’s kappa = 0.43, p < 0.05)
The three models demonstrate accuracy in response to most patient questions.
There are no significant differences between the models.
Deliyannis E. P. [30]ChatGPT 3.5, ChatGPT 4.0, Google BardEvaluate and compare the accuracy, readability, comprehensiveness and reproducibility of responses provided by ChatGPT 3.5, ChatGPT 4.0, and Google Bard to common melanoma patient questions.A consultant dermatologist and a senior dermatology trainee205 questions identified;
22 questions selected
Total score for all 4 parameters, readability, comprehensiveness, reproducibility (out of 5):
ChatGPT 3.5: 4.51, 4.68, 4.38, 4.41
ChatGPT 4.0: 4.43, 4.65, 4.4, 4.2
Bard: 4.14, 4.35, 4.09, 3.89
ChatGPT 3.5 and 4.0 consistently scored higher than Bard for all parameters.
ChatGPT and BARD may generate educational responses to common patient queries. Both versions of ChatGPT outperform BARD.
Roster K. [28]ChatGPTEvaluate the readability of ChatGPT-generated public education dermatology materials on sunscreen and melanoma, and to determine if strategic prompting can improve readability to meet the American Medical Association (AMA) guidelines (6th-grade reading level or less).Readability was compared to AAD.
Accuracy was evaluated by three dermatology residents.
The study evaluated initial ChatGPT responses and responses after two rounds of strategic prompting.
42 prompts, sourced from the American Academy of Dermatology (AAD) website’s frequently asked questions (FAQs).Melanoma FAQ Readability:
(FRES score, average grade)
AAD: 56.2, 9th grade
ChatGPT initial: 46.5,10th grade
ChatGPT with 2 prompt: 58.9, 8th grade
ChatGPT with 3 prompts: 59.3, 7th grade
Prompting lowered the reading level vs. AAD (for 3 prompts p = 0.007)
Melanoma FAQs accuracy (scale from 1 to 3):
AAD: 2.82
ChatGPT initial: 2.89
ChatGPT with 2 prompt: 2.63
ChatGPT with 3 prompts: 2.62
Using strategic prompting, ChatGPT could be used to enhance readability of medical data for melanoma patients.
This prompting may result in less accuracy.
Cirone K. [33]GPT-4V, LLaVAAssess the ability of LLMs, specifically GPT-4 Vision and LLaVA, to accurately recognize and differentiate between melanoma and benign melanocytic nevi across different skin tones.Macroscopic images of melanoma and melanocytic nevi obtained from the MClass-D dataset.20 text-based prompts, each tested on 3 images, resulting in 60 unique image–prompt combinations.GPT-4V Performance:
Overall accuracy: 85%
Consistently provided descriptions of relevant ABCDE features.
Accurately identified melanoma across different skin tones and -recognized alterations in images.
LLaVA Performance:
Overall accuracy: 45%
Unable to confidently identify melanoma in individuals with darker skin tones.
Vulnerable to visual prompt injection and manipulation, leading to diagnostic errors.
GPT-4V and LLaVA show potential in identifying melanoma across different skin tones, but further refinement is needed.
GPT-4V outperforms LLaVA in overall accuracy.
Shifai N. [31]ChatGPT VisionAssess the diagnostic accuracy of ChatGPT Vision in identifying melanoma using dermoscopic images.Dermoscopy images from ISIC archives.100 melanocytic lesions (50 melanomas and 50 benign nevi)The model provided 3 ranked differential diagnoses.
Top Diagnosis:
Sensitivity: 32%
Specificity: 40%
Diagnostic accuracy: 36%
Top-3 Differential Diagnoses:
Sensitivity: 56%
Specificity: 53.3%
Diagnostic accuracy: 54.7%
Malignant vs. Benign (Top Diagnosis):
Sensitivity: 46%
Specificity: 78%
Diagnostic accuracy: 62%
Malignant vs. Benign (Top-3 Diagnoses):
Sensitivity: 78%
Specificity: 46.7%
Diagnostic accuracy: 62.3%
ChatGPT Vision’s current capabilities are inadequate for reliable melanoma diagnosis.
Akrout M. [29]GTP-4V, LLaVAAssess the ability of vision LLMs to recognize, classify, and appropriately comment on the ABCDE features of melanoma lesions.Macroscopic images obtained from the publicly available MD-class dataset and Dermnet NZ.55 unique text-based prompts consisting of questions and instructions, and image-based prompts highlighting areas of focusGTP-4V Performance:
Accurately described asymmetry, border, color, diameter, and evolution.
Inconsistently identified melanoma subtypes
Vulnerable to visual prompt injections.
LLaVA Performance:
Accurately described asymmetry, border, and color.
Inaccurately assessed diameter and evolution.
Inconsistently identified melanoma subtypes.
Less vulnerable to visual prompt injections.
GTP-4V outperformed LLaVA.
While GTP-4V and LLaVA show promise in recognizing features characteristic of melanoma, both models require further refinement to improve diagnostic accuracy and consistency.
Karampinis E. [32]ChatGPT 3.5Assess the clarity of dermoscopic language translated by an AI chatbot and its role in facilitating accurate diagnoses and educational opportunities for novice dermatologists.30 participants with a certification in dernoscopyThe survey comprised instances of dermoscopic descriptions, including 3 pigmented
lesions (1 melanoma and 2 nevi)
Pigmented lesion scores (scale of 1 to 3):
Completeness: 2.4 ± 0.88
Helpful in diagnosis: 2.8 ± 0.48
Teaching tool: 2.7 ± 0.59
For pigmented lesions, incorporating clinical patient data did not significantly change the results.
AI chatbot demonstrates potential in translating dermoscopic language but requires further development to improve its accuracy and reliability for clinical use.
Mu X. [28]ChatGPT-4, BingAI, Google’s AI BARDCompare the performance of Google’s AI BARD, BingAI, and ChatGPT-4 in providing melanoma management advice based on current clinical guidelines and the literature.2 plastic surgent residents, 1 registrar and 3 specialist plastic surgeons5 questions on melanoma managementReadability (Flesch Reading Ease Score, Flesch–Kincaid Grade Level):
ChatGPT: 35.42, 11.98
BARD: 32.1, 15.03
BingAI: 29.88,13.58
The mean readability exhibited considerable similarity.
Reliability:
DISCERN score:
ChatGPT 58 (+-6.44)
BARD 36.2 (+-34.06)
BingAI’s 49.8 (+-22.28).
The only statistically significant test was comparing ChatGPT to BARD for the DISCERN score (p-value 0.04).
ChatGPT provides more reliable, evidence-based clinical advice than BARD and BingAI. However, all models lack depth and specificity, limiting their use in individualized clinical decision-making.
Table 4. Advantages and challenges of the reviewed articles.
Table 4. Advantages and challenges of the reviewed articles.
First AuthorModel UsedChallenges
Young J. N. [35]
  • The responses were evaluated by three board-certified dermatologists, ensuring that the assessment of the AI’s performance was thorough and conducted by knowledgeable professionals.
  • The agreement between the evaluators was statistically significant.
Patients were not involved in the question selection process, potentially missing out on patient perspectives.
Anguita R. [34]
  • The study compares 3 different LLMs, offering a broad perspective on their performance.
  • The study relies on the assessment of three experts who were blinded to the LLM they were using.
  • The study is limited to a subtype of melanoma.
  • The study focussed only on accuracy and did not evaluate other aspects.
Deliyannis E. P. [30]
  • Questions were identified from online sources such as Facebook groups, national foundations, and charity websites, increasing the relevance and practical importance of the questions evaluated.
  • The study compares three different LLMs, offering a broad perspective on their performance.
  • The responses were assessed for accuracy, readability, comprehensiveness, and reproducibility, providing a thorough evaluation.
  • Only 2 assessors were involved in scoring the responses, which might limit the robustness of the evaluation.
  • Readability was not assessed using FRES score.
Roster K. [28]
  • The use of multiple readability and health literacy tools provides a thorough evaluation of the text readability.
  • Accuracy was assessed by 3 dermatology residents, ensuring the reliability of the content evaluation.
  • The use of multiple prompts on the same FAQ demonstrates the model’s strength in improving readability.
  • The study only evaluates ChatGPT, limiting the comparison with other LLMs.
  • It is unclear how many prompts specifically addressed melanoma.
Cirone K. [32]
  • The use of Multiple LLMs offers a broad perspective on their performance.
  • Evaluation of the models’ ability to handle image manipulations and consider skin tone variations demonstrates the models’ effectiveness across different diagnostic factors.
  • Absence of statistical significance tests.
  • The number of benign nevi vs. melanomas that were recognized or un-recognized is not specified. Thus, the reader cannot interpret the sensitivity and specificity of the diagnosis.
  • The study does not specify the number of evaluators who assessed the accuracy of the results, as well as the unknown proficiency of these evaluators.
Shifai N. [31]
  • The study uses a balanced dataset with an equal number of melanomas and benign nevi, thus improving the credibility of the study.
  • The evaluation uses sensitivity and specificity metrics to assess the model’s diagnostic performance for both positive and negative cases.
  • The absence of intermediate melanocytic lesions, such as dysplastic nevi, oversimplifies the evaluation compared to routine clinical settings.
  • Factors such as anatomic site, skin type, nevi subtype, melanoma subtype, and tumor thickness were not considered in the analyses.
Akrout M. [29]
  • The study utilized a balanced dataset covering various melanoma stages which enhances the robustness of the evaluation.
  • The evaluation included metrics for describing ABCDE features, identifying melanoma subtypes, and handling visual prompt injections, offering a detailed assessment of model performance.
  • The use of Multiple LLMs offers a broad perspective on their performance.
  • No statistical tools were used.
  • The study utilized “textbook” or idealized images of melanoma, which may not accurately represent the diverse range of lesions encountered in real-world clinical settings.
  • The evaluators’ identities and their proficiency in interpreting the model outcomes are unknown.
Karampinis E. [32]
  • The results are based on feedback from 30 participants, providing diverse insights into the chatbot’s performance.
  • The prompts were evaluated both with and without incorporating additional clinical patient data.
  • Only three descriptions of pigmented lesions were used.
  • The study did not focus specifically on melanotic lesions.
Mu X. [21]
  • The study involves a panel of experienced board-certified plastic surgeons to assess the responses.
  • The use of multiple readability matrixes provides a thorough evaluation of the text readability.
  • The comparison of multiple LLMs offers a broad perspective on their performance.
  • The small number of questions limits the generalizability of the results.
  • The questions examined were mostly general and did not address a patient’s clinical background.
  • The study evaluates LLMs’ responses based solely on existing guidelines, without considering newer research that may provide more up-to-date information.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zarfati, M.; Nadkarni, G.N.; Glicksberg, B.S.; Harats, M.; Greenberger, S.; Klang, E.; Soffer, S. Exploring the Role of Large Language Models in Melanoma: A Systematic Review. J. Clin. Med. 2024, 13, 7480. https://doi.org/10.3390/jcm13237480

AMA Style

Zarfati M, Nadkarni GN, Glicksberg BS, Harats M, Greenberger S, Klang E, Soffer S. Exploring the Role of Large Language Models in Melanoma: A Systematic Review. Journal of Clinical Medicine. 2024; 13(23):7480. https://doi.org/10.3390/jcm13237480

Chicago/Turabian Style

Zarfati, Mor, Girish N. Nadkarni, Benjamin S. Glicksberg, Moti Harats, Shoshana Greenberger, Eyal Klang, and Shelly Soffer. 2024. "Exploring the Role of Large Language Models in Melanoma: A Systematic Review" Journal of Clinical Medicine 13, no. 23: 7480. https://doi.org/10.3390/jcm13237480

APA Style

Zarfati, M., Nadkarni, G. N., Glicksberg, B. S., Harats, M., Greenberger, S., Klang, E., & Soffer, S. (2024). Exploring the Role of Large Language Models in Melanoma: A Systematic Review. Journal of Clinical Medicine, 13(23), 7480. https://doi.org/10.3390/jcm13237480

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop