Dermatology “AI Babylon”: Cross-Language Evaluation of AI-Crafted Dermatology Descriptions
Abstract
1. Introduction
1.1. The Linguistic Properties of a Chatbot-Crafted Text
1.2. Chatbot-Crafted Text in Different Languages
2. Materials and Methods
3. Results
3.1. Readability Scores Amongst Languages
3.2. Medical Terminology-Based Assessment of Chatbot-Created Dermatology Descriptions and Their Comparison in Multiple Language
3.3. Chatbot-by-Chatbot Assessment of Multiple-Language Dermatology Descriptions
3.4. CLEAR Tool-Based Assessment of the Chatbot’s Dermatology Descriptions
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Bhuyan, S.S.; Sateesh, V.; Mukul, N.; Galvankar, A.; Mahmood, A.; Nauman, M.; Rai, A.; Bordoloi, K.; Basu, U.; Samuel, J. Generative Artificial Intelligence Use in Healthcare: Opportunities for Clinical Excellence and Administrative Efficiency. J. Med. Syst. 2025, 49, 10. [Google Scholar] [CrossRef]
- Al Nazi, Z.; Peng, W. Large Language Models in Healthcare and Medical Domain: A Review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
- Vartiainen, H.; Tedre, M. How Text-to-Image Generative AI Is Transforming Mediated Action. IEEE Comput. Graph. Appl. 2024, 44, 12–22. [Google Scholar] [CrossRef] [PubMed]
- Boit, S.; Patil, R. A Prompt Engineering Framework for Large Language Model–Based Mental Health Chatbots: Conceptual Framework. JMIR Ment. Health 2025, 12, e75078. [Google Scholar] [CrossRef]
- Kalyan, K.S. A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4. Nat. Lang. Process. J. 2024, 6, 100048. [Google Scholar] [CrossRef]
- Karampinis, E.; Bozi Tzetzi, D.A.; Pappa, G.; Koumaki, D.; Sgouros, D.; Vakirlis, E.; Liakou, A.; Papakonstantis, M.; Papadakis, M.; Mantzaris, D.; et al. Use of a Large Language Model as a Dermatology Case Narrator: Exploring the Dynamics of a Chatbot as an Educational Tool in Dermatology. JMIR Dermatol. 2025, 8, e72058. [Google Scholar] [CrossRef]
- Rahman, M.d.M.; Watanobe, Y. ChatGPT for Education and Research: Opportunities, Threats, and Strategies. Appl. Sci. 2023, 13, 5783. [Google Scholar] [CrossRef]
- Tengler, K.; Brandhofer, G. Exploring the Difference and Quality of AI-Generated versus Human-Written Texts. Discov. Educ. 2025, 4, 113. [Google Scholar] [CrossRef]
- Hakam, H.T.; Prill, R.; Korte, L.; Lovreković, B.; Ostojić, M.; Ramadanov, N.; Muehlensiepen, F. Human-Written vs AI-Generated Texts in Orthopedic Academic Literature: Comparative Qualitative Analysis. JMIR Form. Res. 2024, 8, e52164. [Google Scholar] [CrossRef]
- Kar, S.K.; Bansal, T.; Modi, S.; Singh, A. How Sensitive Are the Free AI-Detector Tools in Detecting AI-Generated Texts? A Comparison of Popular AI-Detector Tools. Indian J. Psychol. Med. 2025, 47, 275–278. [Google Scholar] [CrossRef]
- Herbold, S.; Hautli-Janisz, A.; Heuer, U.; Kikteva, Z.; Trautsch, A. A Large-Scale Comparison of Human-Written versus ChatGPT-Generated Essays. Sci. Rep. 2023, 13, 18617. [Google Scholar] [CrossRef]
- Guo, B.; Zhang, X.; Wang, Z.; Jiang, M.; Nie, J.; Ding, Y.; Yue, J.; Wu, Y. How Close Is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. arXiv 2023, arXiv:2301.07597. [Google Scholar] [CrossRef]
- Zhou, J.; Zhang, Y.; Luo, Q.; Parker, A.G.; De Choudhury, M. Synthetic Lies: Understanding AI-Generated Misinformation and Evaluating Algorithmic and Human Solutions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany, 23–28 April 2023; ACM: New York, NY, USA, 19 April 2023; pp. 1–20. [Google Scholar]
- Georgiou, G.P. Differentiating Between Human-Written and AI-Generated Texts Using Automatically Extracted Linguistic Features. Information 2025, 16, 979. [Google Scholar] [CrossRef]
- Tanrıverdi, S.; Söylemez, N. Use of Artificial Intelligence in Planning Postoperative Nursing Care in Laparoscopic Cholecystectomy Patients: Comparison of ChatGPT and Student Practice. Nurse Educ. Pract. 2025, 87, 104515. [Google Scholar] [CrossRef]
- Garcia, P.; Ma, S.P.; Shah, S.; Smith, M.; Jeong, Y.; Devon-Sand, A.; Tai-Seale, M.; Takazawa, K.; Clutter, D.; Vogt, K.; et al. Artificial Intelligence–Generated Draft Replies to Patient Inbox Messages. JAMA Netw. Open 2024, 7, e243201. [Google Scholar] [CrossRef] [PubMed]
- Xie, Y.; Seth, I.; Rozen, W.M.; Hunter-Smith, D.J. Evaluation of the Artificial Intelligence Chatbot on Breast Reconstruction and Its Efficacy in Surgical Research: A Case Study. Aesthetic Plast. Surg. 2023, 47, 2360–2369. [Google Scholar] [CrossRef] [PubMed]
- Liang, C.X.; Tian, P.; Yin, C.H.; Yua, Y.; An-Hou, W.; Ming, L.; Song, X.; Wang, T.; Bi, Z.; Liu, M. A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks. arXiv 2025, arXiv:2411.06284. [Google Scholar] [CrossRef]
- Karampinis, E.; Toli, O.; Georgopoulou, K.-E.; Kampra, E.; Spyridonidou, C.; Roussaki Schulze, A.-V.; Zafiriou, E. Can Artificial Intelligence “Hold” a Dermoscope?—The Evaluation of an Artificial Intelligence Chatbot to Translate the Dermoscopic Language. Diagnostics 2024, 14, 1165. [Google Scholar] [CrossRef]
- Zhang, Z.; Huang, X. The Impact of Chatbots Based on Large Language Models on Second Language Vocabulary Acquisition. Heliyon 2024, 10, e25370. [Google Scholar] [CrossRef]
- Terzis, R.; Salam, B.; Nowak, S.; Mueller, P.-T.; Mesropyan, N.; Oberlinkels, L.; Efferoth, A.F.; Kravchenko, D.; Voigt, M.; Ginzburg, D.; et al. Evaluation of GPT-4o for Multilingual Translation of Radiology Reports across Imaging Modalities. Eur. J. Radiol. 2025, 191, 112341. [Google Scholar] [CrossRef]
- Al Rousan, R.; Jaradat, R.; Malkawi, M. ChatGPT Translation vs. Human Translation: An Examination of a Literary Text. Cogent Soc. Sci. 2025, 11, 2472916. [Google Scholar] [CrossRef]
- Martínez, G.; Conde, J.; Reviriego, P.; Merino-Gómez, E.; Hernández, J.A.; Lombardi, F. How Many Words Does ChatGPT Know? The Answer Is ChatWords. arXiv 2023, arXiv:2309.16777. [Google Scholar] [CrossRef]
- Harigai, A.; Toyama, Y.; Nagano, M.; Abe, M.; Kawabata, M.; Li, L.; Yamamura, J.; Takase, K. Response Accuracy of GPT-4 across Languages: Insights from an Expert-Level Diagnostic Radiology Examination in Japan. Jpn. J. Radiol. 2025, 43, 319–329. [Google Scholar] [CrossRef]
- Zheng, C.; Ye, H.; Guo, J.; Yang, J.; Fei, P.; Yuan, Y.; Huang, D.; Huang, Y.; Peng, J.; Xie, X.; et al. Development and Evaluation of a Large Language Model of Ophthalmology in Chinese. Br. J. Ophthalmol. 2024, 108, 1390–1397. [Google Scholar] [CrossRef]
- Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar] [CrossRef]
- Yao, Z.; Duan, L.; Xu, S.; Chi, L.; Sheng, D. Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations. JMIR Med. Inform. 2025, 13, e69485. [Google Scholar] [CrossRef]
- Kelloniemi, M.; Koljonen, V. AI Did Not Pass Finnish Plastic Surgery Written Board Examination. J. Plast. Reconstr. Aesthetic Surg. 2023, 87, 172–179. [Google Scholar] [CrossRef]
- Wu, J.; Wu, X.; Qiu, Z.; Li, M.; Lin, S.; Zhang, Y.; Zheng, Y.; Yuan, C.; Yang, J. Large Language Models Leverage External Knowledge to Extend Clinical Insight beyond Language Boundaries. J. Am. Med. Inform. Assoc. 2024, 31, 2054–2064. [Google Scholar] [CrossRef]
- Toyama, Y.; Harigai, A.; Abe, M.; Nagano, M.; Kawabata, M.; Seki, Y.; Takase, K. Performance Evaluation of ChatGPT, GPT-4, and Bard on the Official Board Examination of the Japan Radiology Society. Jpn. J. Radiol. 2024, 42, 201–207. [Google Scholar] [CrossRef] [PubMed]
- Seghier, M.L. ChatGPT: Not All Languages Are Equal. Nature 2023, 615, 216. [Google Scholar] [CrossRef] [PubMed]
- Sallam, M.; Al-Mahzoum, K.; Alshuaib, O.; Alhajri, H.; Alotaibi, F.; Alkhurainej, D.; Al-Balwah, M.Y.; Barakat, M.; Egger, J. Language Discrepancies in the Performance of Generative Artificial Intelligence Models: An Examination of Infectious Disease Queries in English and Arabic. BMC Infect. Dis. 2024, 24, 799. [Google Scholar] [CrossRef] [PubMed]
- Samaan, J.S.; Yeo, Y.H.; Ng, W.H.; Ting, P.-S.; Trivedi, H.; Vipani, A.; Yang, J.D.; Liran, O.; Spiegel, B.; Kuo, A.; et al. ChatGPT’s Ability to Comprehend and Answer Cirrhosis Related Questions in Arabic. Arab. J. Gastroenterol. 2023, 24, 145–148. [Google Scholar] [CrossRef]
- Menezes, M.C.S.; Hoffmann, A.F.; Tan, A.L.M.; Nalbandyan, M.; Omenn, G.S.; Mazzotti, D.R.; Hernández-Arango, A.; Visweswaran, S.; Venkatesh, S.; Mandl, K.D.; et al. The Potential of Generative Pre-Trained Transformer 4 (GPT-4) to Analyse Medical Notes in Three Different Languages: A Retrospective Model-Evaluation Study. Lancet Digit. Health 2025, 7, e35–e43. [Google Scholar] [CrossRef]
- Cheng, E.; Anampa, J.D.; Bernabe-Ramirez, C.; Lin, J.; Xue, X.; Isasi, C.R.; Moadel-Robblee, A.B.; Chu, E. Artificial Intelligence Chatbots and Their Responses to Most Searched Spanish Cancer Questions. Cancer Med. 2025, 14, e71364. [Google Scholar] [CrossRef]
- Gimeno, A.; Krause, K.; D’Souza, S.; Walsh, C.G. Completeness and Readability of GPT-4-Generated Multilingual Discharge Instructions in the Pediatric Emergency Department. JAMIA Open 2024, 7, ooae050. [Google Scholar] [CrossRef]
- Gonzalez Fiol, A.; Mootz, A.A.; He, Z.; Delgado, C.; Ortiz, V.; Reale, S.C. Accuracy of Spanish and English-Generated ChatGPT Responses to Commonly Asked Patient Questions about Labor Epidurals: A Survey-Based Study among Bilingual Obstetric Anesthesia Experts. Int. J. Obstet. Anesth. 2025, 61, 104290. [Google Scholar] [CrossRef] [PubMed]
- Pugliese, N.; Polverini, D.; Lombardi, R.; Pennisi, G.; Ravaioli, F.; Armandi, A.; Buzzetti, E.; Dalbeni, A.; Liguori, A.; Mantovani, A.; et al. Evaluation of ChatGPT as a Counselling Tool for Italian-Speaking MASLD Patients: Assessment of Accuracy, Completeness and Comprehensibility. J. Pers. Med. 2024, 14, 568. [Google Scholar] [CrossRef] [PubMed]
- Mikhail, D.; Mihalache, A.; Huang, R.S.; Khairy, T.; Popovic, M.M.; Milad, D.; Shor, R.; Pereira, A.; Kwok, J.; Yan, P.; et al. Performance of ChatGPT in French Language Analysis of Multimodal Retinal Cases. J. Fr. Ophtalmol. 2025, 48, 104391. [Google Scholar] [CrossRef]
- Menz, B.D.; Modi, N.D.; Abuhelwa, A.Y.; Ruanglertboon, W.; Vitry, A.; Gao, Y.; Li, L.X.; Chhetri, R.; Chu, B.; Bacchi, S.; et al. Generative AI Chatbots for Reliable Cancer Information: Evaluating Web-Search, Multilingual, and Reference Capabilities of Emerging Large Language Models. Eur. J. Cancer 2025, 218, 115274. [Google Scholar] [CrossRef]
- Singla, R.; Lodhi, S.; Kibret, T.; Jegatheswaran, J.; Glavinovic, T.; Massicotte-Azarniouch, D.; Karpinski, J.; Powell, R.; Burns, K.; Sood, M.M.; et al. Accuracy, Clarity, and Comprehensiveness of ChatGPT Outputs for Commonly Asked Questions About Living Kidney Donation. Clin. Transplant. 2025, 39, e70303. [Google Scholar] [CrossRef]
- Sallam, M.; Alasfoor, I.M.; Khalid, S.W.; Al-Mulla, R.I.; Al-Farajat, A.; Mijwil, M.M.; Zahrawi, R.; Sallam, M.; Egger, J.; Al-Adwan, A.S. Chinese Generative AI Models (DeepSeek and Qwen) Rival ChatGPT-4 in Ophthalmology Queries with Excellent Performance in Arabic and English. Narra J. 2025, 5, e2371. [Google Scholar] [CrossRef] [PubMed]
- Sallam, M.; Al-Mahzoum, K.; Almutawaa, R.A.; Alhashash, J.A.; Dashti, R.A.; AlSafy, D.R.; Almutairi, R.A.; Barakat, M. The Performance of OpenAI ChatGPT-4 and Google Gemini in Virology Multiple-Choice Questions: A Comparative Analysis of English and Arabic Responses. BMC Res. Notes 2024, 17, 247. [Google Scholar] [CrossRef]
- Sallam, M.; Barakat, M.; Sallam, M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence–Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact. J. Med. Res. 2024, 13, e54704. [Google Scholar] [CrossRef]
- Skrzypczak, T.; Mamak, M. Assessing the Readability of Online Health Information for Colonoscopy—Analysis of Articles in 22 European Languages. J. Cancer Educ. 2023, 38, 1865–1870. [Google Scholar] [CrossRef]
- Calafato, R.; Gudim, F. Literature in Contemporary Foreign Language School Textbooks in Russia: Content, Approaches, and Readability. Lang. Teach. Res. 2022, 26, 826–846. [Google Scholar] [CrossRef]
- Skrzypczak, T.; Skrzypczak, A.; Szepietowski, J.C. The Importance of Readability: A Guide to Understanding Alopecia Areata through Multilingual Online Resources. Acta Derm. Venereol. 2024, 104, adv41046. [Google Scholar] [CrossRef]
- Sebo, P.; de Lucia, S. Performance of Machine Translators in Translating French Medical Research Abstracts to English: A Comparative Study of DeepL, Google Translate, and CUBBITT. PLoS ONE 2024, 19, e0297183. [Google Scholar] [CrossRef]
- Balk, E.M.; Chung, M.; Chen, M.L.; Chang, L.K.W.; Trikalinos, T.A. Data Extraction from Machine-Translated versus Original Language Randomized Trial Reports: A Comparative Study. Syst. Rev. 2013, 2, 97. [Google Scholar] [CrossRef] [PubMed]
- Das, A.; Madke, B.; Jakhar, D.; Neema, S.; Kaur, I.; Kumar, P.; Pradhan, S. Named Signs and Metaphoric Terminologies in Dermoscopy: A Compilation. Indian J. Dermatol. Venereol. Leprol. 2022, 88, 855. [Google Scholar] [CrossRef]
- Paganelli, A.; Spadafora, M.; Navarrete-Dechent, C.; Guida, S.; Pellacani, G.; Longo, C. Natural Language Processing in Dermatology: A Systematic Literature Review and State of the Art. J. Eur. Acad. Dermatol. Venereol. 2024, 38, 2225–2234. [Google Scholar] [CrossRef] [PubMed]
- Roster, K.; Kann, R.B.; Farabi, B.; Gronbeck, C.; Brownstone, N.; Lipner, S.R. Readability and Health Literacy Scores for ChatGPT-Generated Dermatology Public Education Materials: Cross-Sectional Analysis of Sunscreen and Melanoma Questions. JMIR Dermatol. 2024, 7, e50163. [Google Scholar] [CrossRef]
- Ünal, M.; Balevi Akkese, İ. Assessment of the Readability Levels of Turkish-Language Websites on Hybrid Prostheses: A Methodological Study. KTO Karatay Üniversitesi Sağlık Bilim. Derg. 2025, 6, 308–318. [Google Scholar] [CrossRef]
- Chang, E.; Sung, S. Use of SNOMED CT in Large Language Models: Scoping Review. JMIR Med. Inform. 2024, 12, e62924. [Google Scholar] [CrossRef] [PubMed]
- Cazzaniga, G.; Eccher, A.; Munari, E.; Marletta, S.; Bonoldi, E.; Della Mea, V.; Cadei, M.; Sbaraglia, M.; Guerriero, A.; Dei Tos, A.P.; et al. Natural Language Processing to Extract SNOMED-CT Codes from Pathological Reports. Pathologica 2023, 115, 318–324. [Google Scholar] [CrossRef] [PubMed]


| Study | Foreign Language | Comparison | Result |
|---|---|---|---|
| [27] | Chinese | Comparison between LLMs trained on different languages corpora, based on Chinese National Medical Licensing Examination as reference point. | LLMs that are mostly trained on English texts and those trained primarily on Chinese text both perform well on the examination, although the Chinese-trained models achieve slightly better results. |
| [28] | Finish | Assessment of ChatGPT and Microsoft Bing exam performance on Finish medical exams. | LLMs did not pass the test. |
| [29] | Chinese | Evaluation of the performance of ChatGPT China National Medical Licensing Examination. | When used without additional training, ChatGPT did not achieve a passing score on the exam. However, when combined with external medical knowledge frameworks such as Knowledge and Few-Shot Enhancement In-Context Learning, LLMs of different sizes showed consistent and substantial performance improvements. |
| [30] | Japanese | Assessment of the accuracy of large language models (LLMs) in generating responses to questions in clinical radiology practice based on Japan Radiology Board Examination. | GPT-4 scored 65% when answering Japanese questions. The performance of GPT-4 was also domain-based as the model significantly better in nuclear medicine than in diagnostic radiology. GPT-4 also performed better on lower-order questions than on higher-order questions. |
| [31] | French and Arabic | Testing of ChatGPT in a set of neuroscience questions. | The richness of ChatGPT’s response and the intelligibility of its writing in Arabic and French languages were notably inferior to that in English. |
| [32] | Arabic | Comparison of AI models’ efficiency for infectious disease queries in English and Arabic. | Queries in English consistently achieved the highest performance, with Bard performing best, followed by Bing, ChatGPT-4, and ChatGPT-3.5. A similar pattern appeared in Arabic queries, though the differences were not statistically significant. A generally inferior performance of the tested generative AI models was observed for Arabic compared to English, despite being rated “above average”. |
| [33] | Arabic | Comparison of AI model efficiency for cirrhosis in English and Arabic. | The performance of ChatGPT in Arabic was less accurate than that in English. |
| [34] | French, Italian Spanish | Medical note assessment in French, Italian, English, and Spanish. | The results of the model evaluation study showed that ChatGPT-4 is accurate when analyzing medical notes in three different languages. |
| [35] | Spanish | Evaluation of chatbots Spanish responses on cancer questions. | AI chatbots generated good-quality Spanish responses. |
| [36] | Spanish | Completeness and readability of ChatGPT-4 discharge instructions for common pediatric emergency room complaints in Spanish. | GPT-4 generated discharge instructions in both English and Spanish that were easy to read and adjusted the reading level appropriately. The English instructions generally demonstrated greater completeness compared to the Spanish versions. |
| [37] | Spanish | Evaluation ChatGPT’s answers to ten questions on labor epidurals in Spanish and English. | ChatGPT’s responses in Spanish were less accurate than its answers in English, especially concerning the impact of labor epidurals on the progression of labor and delivery method. |
| [38] | Italian | Comparison of the accuracy and completeness of English vs. Italian answers on metabolic dysfunction-associated steatotic liver disease (MASLD). | While language does not seem to impact ChatGPT’s ability to deliver clear and thorough counseling to MASLD patients, its accuracy is still inadequate in some areas. |
| [39] | French | Comparison of differential diagnoses based on English and French prompts including retina cases. | GPT-4 performed similarly in English and French while ophthalmic images were identified in both languages as critical for correct diagnosis. |
| [40] | English, French, Chinese, Thai, Hindi, Nepali, Vietnamese, and Arabic | Comparison of seven AI chatbot LLMs on three simple cancer-related queries across eight languages. | Hallucinations are less frequent in English. |
| [41] | French | Assessment of effectiveness of ChatGPT responses on kidney donation in different languages. | Nephrologists showed moderate agreement for English responses and poor agreement for French responses. Kidney donors exhibited high agreement for English but low for French. |
| [42] | Arabic | Comparison of ophthalmology queries in Arabic vs. English based on multiple AI models, including ChatGPT and DeepSeek. | English remains the higher-performing language overall. Arabic outputs were slightly lower across all models. |
| [43] | Arabic | Comparison of the performance of ChatGPT-4 and Gemini in answering virology multiple-choice questions in both English and Arabic, and evaluation of the quality of the content they generate. | ChatGPT-4 and Gemini showed stronger performance in English than in Arabic, with ChatGPT-4 consistently outperforming Gemini in both accuracy and completeness. |
| Language | Prompt Requesting a Macroscopic Description of the Dermatology Image | Prompt Requesting a Dermoscopic Description of the Dermatology Image |
|---|---|---|
| English | Describe what you observe in this clinical image in one short paragraph, focusing on the visible dermatological features (color, texture, distribution, and possible dιagnοsis). | Describe what you observe in this dermoscopic image in one short paragraph, focusing on the visible dermoscopic structures (colors, patterns, borders, and possible diagnosis). |
| French | Décrivez en un court paragraphe ce que vous observez sur cette image clinique, en vous concentrant sur les caractéristiques dermatologiques visibles (couleur, texture, distribution et diagnostic possible). | Décrivez en un court paragraphe ce que vous observez sur cette image dermoscopique, en vous concentrant sur les structures dermoscopiques visibles (couleurs, motifs, contours et diagnostic possible). |
| German | Beschreiben Sie in einem kurzen Absatz, was Sie auf diesem klinischen Bild beobachten, und konzentrieren Sie sich dabei auf die sichtbaren dermatologischen Merkmale (Farbe, Textur, Verteilung und mögliche Diagnose). | Beschreiben Sie in einem kurzen Absatz, was Sie auf diesem dermoskopischen Bild beobachten, und konzentrieren Sie sich dabei auf die sichtbaren dermoskopischen Strukturen (Farben, Muster, Grenzen und mögliche Diagnose). |
| Greek | Περιγράψτε τι παρατηρείτε σε αυτήν την κλινική εικόνα σε μία σύντομη παράγραφο, εστιάζοντας στα ορατά δερματολογικά χαρακτηριστικά (χρώμα, υφή, κατανομή και πιθανή διάγνωση). | Περιγράψτε τι παρατηρείτε σε αυτήν την δερματοσκοπική εικόνα σε μία σύντομη παράγραφο, εστιάζοντας στις ορατές δερματοσκοπικές δομές (χρώματα, μοτίβα, όρια και πιθανή διάγνωση). |
| METRICS Parameters | METRICS Results |
|---|---|
| Model | Gemini 2.0 |
| Evaluation | Language differences in creating content in dermatology. |
| Timing | The survey performed from 1 September 2025 to 15 December 2025. |
| Range/Randomization | Diversity of images provided and randomness in test cases or prompts. |
| Individual factors | Dermatology and dermoscopy knowledge of the participants; personal opinions. |
| Count | 60 prompts for each language and 30 prompts for each image. |
| Specificity of prompts | Reading difficulty assessment, use of similar clinical terms, CLEAR criteria amongst the prompts. |
| Language | English, German, French, Greek. |
| Macroscopy Language | Index |
|---|---|
| English | 61.67 ± 7.16 |
| French | 66.23 ± 6.44 |
| Greek | 62.3 ± 6.12 |
| German | 60.5 ± 5.38 |
| Dermoscopy Language | |
| English | 69.13 ± 6.63 |
| French | 65.2 ± 6.81 |
| Greek | 60.53 ± 4.77 |
| German | 61.53 ± 5.31 |
| Dermoscopy Language—Post Hoc p Value Comparison | Macroscopy Language—Post Hoc p Value Comparison | |
|---|---|---|
| Greek vs. French | 0.02 | 0.08 |
| Greek vs. English | <0.01 | 0.98 |
| Greek vs. German | 0.9 | 0.69 |
| French vs. English | 0.06 | 0.03 |
| French vs. German | 0.09 | <0.01 |
| English vs. German | <0.01 | 0.89 |
| Category | Clinical Term | SNOMED CT Concept ID |
|---|---|---|
| Diagnosis | Plaque psoriasis | 200965009 |
| Morphology | Plaque (skin lesion) | 1522000 |
| Erythematous plaque | 72768000 | |
| Papule | 25694009 | |
| Indurated/raised lesion | 260399008 | |
| Surface/Texture | Hyperkeratosis | 26996000 |
| Scaly skin | 271761007 | |
| Thickened skin | 263899003 | |
| Color/Inflammation | Erythema | 247441003 |
| Anatomical Location | Extensor surface of limb | 249973009 |
| Elbow | 127949000 | |
| Distribution Pattern | Localized lesion | 255471002 |
| Grouped/clustered lesions | 255504006 |
| Mismatch Percentage Range in French Language | Mismatch Percentage Range in Greek Language | Mismatch Percentage Range in German Language | |
|---|---|---|---|
| Macroscopy | 0–15% | 0–40% | 0–40% |
| Dermoscopy | 12.5–55.6% | 16.7–50% | 5–60% |
| Macroscopy | Yes | Maybe | No |
|---|---|---|---|
| English–French comparisons | 148 | 32 | 0 |
| English–Greek comparisons | 159 | 21 | 0 |
| English–German comparisons | 174 | 6 | 0 |
| French–Greek comparisons | 174 | 6 | 0 |
| Greek–German comparisons | 169 | 11 | 0 |
| French–German comparisons | 123 | 57 | 0 |
| 947 | 133 | 0 | |
| Dermoscopy | Yes | Maybe | No |
| English–French comparisons | 129 | 45 | 6 |
| English–Greek comparisons | 144 | 36 | 0 |
| English–German comparisons | 139 | 41 | 0 |
| French–Greek comparisons | 118 | 62 | 0 |
| Greek–German comparisons | 119 | 47 | 14 |
| French–German comparisons | 140 | 40 | 0 |
| 789 | 271 | 20 |
| German Prompt 1 | German Prompt 2 | German Prompt 3 | German Prompt 4 | German Prompt 5 | German Prompt 6 | |
|---|---|---|---|---|---|---|
| English Prompt 1 | Yes | Maybe | Maybe | Maybe | Maybe | Maybe |
| English Prompt 2 | Maybe | Yes | Maybe | Maybe | Maybe | Maybe |
| English Prompt 3 | Yes | Yes | Yes | Yes | Maybe | Yes |
| English Prompt 4 | Maybe | Maybe | Maybe | Maybe | Maybe | Yes |
| English Prompt 5 | Yes | Yes | Yes | Yes | Maybe | Maybe |
| English Prompt 6 | Yes | Yes | Yes | Yes | Maybe | Maybe |
| Completeness | Lack of False Information (Accuracy) | Evidence-Based Content | Appropriateness | Relevance | ||
|---|---|---|---|---|---|---|
| English | Macroscopy | 2.5 ± 0.70 | 2.49 ± 0.76 | 2.47 ± 0.72 | 2.51 ± 0.70 | 2.53 ± 0.71 |
| Dermoscopy | 2.37 ± 0.76 | 2.22 ± 0.81 | 2.16 ± 0.83 | 2.21 ± 0.81 | 2.16 ± 0.84 | |
| Sum | 2.44 ± 0.72 | 2.35 ± 0.74 | 2.31 ± 0.76 | 2.36 ± 0.74 | 2.34 ± 0.76 | |
| French | Macroscopy | 2.33 ± 0.70 | 2.22 ± 0.78 | 2.34 ± 0.74 | 2.23 ± 0.79 | 2.11 ± 0.82 |
| Dermoscopy | 2.18 ± 0.22 | 2.03 ± 0.84 | 2.09 ± 0.8 | 2.01 ± 0.78 | 2.01 ± 0.79 | |
| Sum | 2.23 ± 0.77 | 2.19 ± 0.79 | 2.17 ± 0.78 | 2.12 ± 0.79 | 2.06 ± 0.81 | |
| Greek | Macroscopy | 2.38 ± 0.75 | 2.2 ± 0.70 | 2.31 ± 0.81 | 2.2 ± 0.7 | 2.15 ± 0.85 |
| Dermoscopy | 2.13 ± 0.87 | 1.92 ± 0.87 | 2.1 ± 0.81 | 2.04 ± 0.80 | 2.06 ± 0.82 | |
| Sum | 2.26 ± 0.77 | 2.06 ± 0.84 | 2.21 ± 0.82 | 2.12 ± 0.81 | 2.10 ± 0.84 | |
| German | Macroscopy | 2.35 ± 0.70 | 2.25 ± 0.79 | 2.30 ± 0.76 | 2.2 ± 0.79 | 2.2 ± 0.79 |
| Dermoscopy | 2.2 ± 0.77 | 2.11 ± 0.77 | 2.08 ± 0.79 | 2.15 ± 0.96 | 2.16 ± 0.79 | |
| Sum | 2.28 ± 0.79 | 2.18± 0.78 | 2.19 ± 0.78 | 2.18 ± 0.87 | 2.18 ± 0.78 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Published by MDPI on behalf of the Lithuanian University of Health Sciences. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Karampinis, E.; Zoumpourli, C.-M.; Kontogianni, C.; Arkoumanis, T.; Koumaki, D.; Mantzaris, D.; Filippakis, K.; Papadopoulou, M.-M.; Theofili, M.; Enechukwu, N.A.; et al. Dermatology “AI Babylon”: Cross-Language Evaluation of AI-Crafted Dermatology Descriptions. Medicina 2026, 62, 227. https://doi.org/10.3390/medicina62010227
Karampinis E, Zoumpourli C-M, Kontogianni C, Arkoumanis T, Koumaki D, Mantzaris D, Filippakis K, Papadopoulou M-M, Theofili M, Enechukwu NA, et al. Dermatology “AI Babylon”: Cross-Language Evaluation of AI-Crafted Dermatology Descriptions. Medicina. 2026; 62(1):227. https://doi.org/10.3390/medicina62010227
Chicago/Turabian StyleKarampinis, Emmanouil, Christina-Marina Zoumpourli, Christina Kontogianni, Theofanis Arkoumanis, Dimitra Koumaki, Dimitrios Mantzaris, Konstantinos Filippakis, Maria-Myrto Papadopoulou, Melpomeni Theofili, Nkechi Anne Enechukwu, and et al. 2026. "Dermatology “AI Babylon”: Cross-Language Evaluation of AI-Crafted Dermatology Descriptions" Medicina 62, no. 1: 227. https://doi.org/10.3390/medicina62010227
APA StyleKarampinis, E., Zoumpourli, C.-M., Kontogianni, C., Arkoumanis, T., Koumaki, D., Mantzaris, D., Filippakis, K., Papadopoulou, M.-M., Theofili, M., Enechukwu, N. A., Ouédraogo, N. A., Katoulis, A., Zafiriou, E., & Sgouros, D. (2026). Dermatology “AI Babylon”: Cross-Language Evaluation of AI-Crafted Dermatology Descriptions. Medicina, 62(1), 227. https://doi.org/10.3390/medicina62010227

