MDPI - Publisher of Open Access Journals

14 pages, 3600 KiB

Open AccessArticle

Performance of Large Language Models in Recognizing Brain MRI Sequences: A Comparative Analysis of ChatGPT-4o, Claude 4 Opus, and Gemini 2.5 Pro

by Ali Salbas and Rasit Eren Buyuktoka

Diagnostics 2025, 15(15), 1919; https://doi.org/10.3390/diagnostics15151919 - 30 Jul 2025

Viewed by 333

Abstract

Background/Objectives: Multimodal large language models (LLMs) are increasingly used in radiology. However, their ability to recognize fundamental imaging features, including modality, anatomical region, imaging plane, contrast-enhancement status, and particularly specific magnetic resonance imaging (MRI) sequences, remains underexplored. This study aims to evaluate [...] Read more.

Background/Objectives: Multimodal large language models (LLMs) are increasingly used in radiology. However, their ability to recognize fundamental imaging features, including modality, anatomical region, imaging plane, contrast-enhancement status, and particularly specific magnetic resonance imaging (MRI) sequences, remains underexplored. This study aims to evaluate and compare the performance of three advanced multimodal LLMs (ChatGPT-4o, Claude 4 Opus, and Gemini 2.5 Pro) in classifying brain MRI sequences. Methods: A total of 130 brain MRI images from adult patients without pathological findings were used, representing 13 standard MRI series. Models were tested using zero-shot prompts for identifying modality, anatomical region, imaging plane, contrast-enhancement status, and MRI sequence. Accuracy was calculated, and differences among models were analyzed using Cochran’s Q test and McNemar test with Bonferroni correction. Results: ChatGPT-4o and Gemini 2.5 Pro achieved 100% accuracy in identifying the imaging plane and 98.46% in identifying contrast-enhancement status. MRI sequence classification accuracy was 97.7% for ChatGPT-4o, 93.1% for Gemini 2.5 Pro, and 73.1% for Claude 4 Opus (p < 0.001). The most frequent misclassifications involved fluid-attenuated inversion recovery (FLAIR) sequences, often misclassified as T1-weighted or diffusion-weighted sequences. Claude 4 Opus showed lower accuracy in susceptibility-weighted imaging (SWI) and apparent diffusion coefficient (ADC) sequences. Gemini 2.5 Pro exhibited occasional hallucinations, including irrelevant clinical details such as “hypoglycemia” and “Susac syndrome.” Conclusions: Multimodal LLMs demonstrate high accuracy in basic MRI recognition tasks but vary significantly in specific sequence classification tasks. Hallucinations emphasize caution in clinical use, underlining the need for validation, transparency, and expert oversight. Full article

(This article belongs to the Section Medical Imaging and Theranostics)

► Show Figures

Figure 1

35 pages, 7934 KiB

Open AccessArticle

Analyzing Diagnostic Reasoning of Vision–Language Models via Zero-Shot Chain-of-Thought Prompting in Medical Visual Question Answering

by Fatema Tuj Johora Faria, Laith H. Baniata, Ahyoung Choi and Sangwoo Kang

Mathematics 2025, 13(14), 2322; https://doi.org/10.3390/math13142322 - 21 Jul 2025

Viewed by 763

Abstract

Medical Visual Question Answering (MedVQA) lies at the intersection of computer vision, natural language processing, and clinical decision-making, aiming to generate accurate responses from medical images paired with complex inquiries. Despite recent advances in vision–language models (VLMs), their use in healthcare remains limited [...] Read more.

Medical Visual Question Answering (MedVQA) lies at the intersection of computer vision, natural language processing, and clinical decision-making, aiming to generate accurate responses from medical images paired with complex inquiries. Despite recent advances in vision–language models (VLMs), their use in healthcare remains limited by a lack of interpretability and a tendency to produce direct, unexplainable outputs. This opacity undermines their reliability in medical settings, where transparency and justification are critically important. To address this limitation, we propose a zero-shot chain-of-thought prompting framework that guides VLMs to perform multi-step reasoning before arriving at an answer. By encouraging the model to break down the problem, analyze both visual and contextual cues, and construct a stepwise explanation, the approach makes the reasoning process explicit and clinically meaningful. We evaluate the framework on the PMC-VQA benchmark, which includes authentic radiological images and expert-level prompts. In a comparative analysis of three leading VLMs, Gemini 2.5 Pro achieved the highest accuracy (72.48%), followed by Claude 3.5 Sonnet (69.00%) and GPT-4o Mini (67.33%). The results demonstrate that chain-of-thought prompting significantly improves both reasoning transparency and performance in MedVQA tasks. Full article

(This article belongs to the Special Issue Mathematical Foundations in NLP: Applications and Challenges)

► Show Figures

Figure 1

11 pages, 386 KiB

Open AccessArticle

Benchmarking AI Chatbots for Maternal Lactation Support: A Cross-Platform Evaluation of Quality, Readability, and Clinical Accuracy

by İlke Özer Aslan and Mustafa Törehan Aslan

Healthcare 2025, 13(14), 1756; https://doi.org/10.3390/healthcare13141756 - 20 Jul 2025

Viewed by 398

Abstract

Background and Objective: Large language model (LLM)–based chatbots are increasingly utilized by postpartum individuals seeking guidance on breastfeeding. However, the chatbots’ content quality, readability, and alignment with clinical guidelines remain uncertain. This study was conducted to evaluate and compare the quality, readability, and [...] Read more.

Background and Objective: Large language model (LLM)–based chatbots are increasingly utilized by postpartum individuals seeking guidance on breastfeeding. However, the chatbots’ content quality, readability, and alignment with clinical guidelines remain uncertain. This study was conducted to evaluate and compare the quality, readability, and factual accuracy of responses generated by three publicly accessible AI chatbots—ChatGPT-4o Pro, Gemini 2.5 Pro, and Copilot Pro—when prompted with common maternal questions related to breast-milk supply. Methods: Twenty frequently asked breastfeeding-related questions were submitted to each chatbot in separate sessions. The responses were paraphrased to enable standardized scoring and were then evaluated using three validated tools: ensuring quality information for patients (EQIP), the simple measure of gobbledygook (SMOG), and the global quality scale (GQS). Factual accuracy was benchmarked against WHO, ACOG, CDC, and NICE guidelines using a three-point rubric. Additional user experience metrics included response time, character count, content density, and structural formatting. Statistical comparisons were performed using the Kruskal–Wallis and Wilcoxon rank-sum tests with Bonferroni correction. Results: ChatGPT-4o Pro achieved the highest overall performance across all primary outcomes: EQIP score (85.7 ± 2.4%), SMOG score (9.78 ± 0.22), and GQS rating (4.55 ± 0.50), followed by Gemini 2.5 Pro and Copilot Pro (p < 0.001 for all comparisons). ChatGPT-4o Pro also demonstrated the highest factual alignment with clinical guidelines (95%), while Copilot showed more frequent omissions or simplifications. Differences in response time and formatting quality were statistically significant, although not always clinically meaningful. Conclusions: ChatGPT-4o Pro outperforms other chatbots in delivering structured, readable, and guideline-concordant breastfeeding information. However, substantial variability persists across the platforms, and none should be considered a substitute for professional guidance. Importantly, the phenomenon of AI hallucinations—where chatbots may generate factually incorrect or fabricated information—remains a critical risk that must be addressed to ensure safe integration into maternal health communication. Future efforts should focus on improving the transparency, accuracy, and multilingual reliability of AI chatbots to ensure their safe integration into maternal health communications. Full article

► Show Figures

Figure 1

14 pages, 679 KiB

Open AccessArticle

Enhancing Patient Outcomes in Head and Neck Cancer Radiotherapy: Integration of Electronic Patient-Reported Outcomes and Artificial Intelligence-Driven Oncology Care Using Large Language Models

by ChihYing Liao, ChinNan Chu, TingChun Lin, TzuYao Chou and MengHsiun Tsai

Cancers 2025, 17(14), 2345; https://doi.org/10.3390/cancers17142345 - 15 Jul 2025

Viewed by 827

Abstract

Background: Electronic patient-reported outcomes (ePROs) enable real-time symptom monitoring and early intervention in oncology. Large language models (LLMs), when combined with retrieval-augmented generation (RAG), offer scalable Artificial Intelligence (AI)-driven education tailored to individual patient needs. However, few studies have examined the feasibility and [...] Read more.

Background: Electronic patient-reported outcomes (ePROs) enable real-time symptom monitoring and early intervention in oncology. Large language models (LLMs), when combined with retrieval-augmented generation (RAG), offer scalable Artificial Intelligence (AI)-driven education tailored to individual patient needs. However, few studies have examined the feasibility and clinical impact of integrating ePRO with LLM-RAG feedback during radiotherapy in high-toxicity settings such as head and neck cancer. Methods: This prospective observational study enrolled 42 patients with head and neck cancer undergoing radiotherapy from January to December 2024. Patients completed ePRO entries twice weekly using a web-based platform. Following each entry, an LLM-RAG system (Gemini 1.5-based) generated real-time educational feedback using National Comprehensive Cancer Network (NCCN) guidelines and institutional resources. Primary outcomes included percentage weight loss and treatment interruption days. Statistical analyses included t-tests, linear regression, and receiver operating characteristic (ROC) analysis. A threshold of ≥6 ePRO entries was used for subgroup analysis. Results: Patients had a mean age of 53.6 years and submitted an average of 8.0 ePRO entries. Frequent ePRO users (≥6 entries) had significantly less weight loss (4.45% vs. 7.57%, p = 0.021) and fewer treatment interruptions (0.67 vs. 2.50 days, p = 0.002). Chemotherapy, moderate-to-severe pain, and lower ePRO submission frequency were associated with greater weight loss. ePRO submission frequency was negatively correlated with both weight loss and treatment interruption days. The most commonly reported symptoms were appetite loss, fatigue, and nausea. Conclusions: Integrating LLM-RAG feedback with ePRO systems is feasible and may enhance symptom control, treatment continuity, and patient engagement in head and neck cancer radiotherapy. Further studies are warranted to validate the clinical benefits of AI-supported ePRO platforms in routine care. Full article

(This article belongs to the Special Issue Personalized Radiotherapy in Cancer Care (2nd Edition))

► Show Figures

Graphical abstract

19 pages, 1186 KiB

Open AccessArticle

Synthetic Patient–Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation

by Syed Ali Haider, Srinivasagam Prabha, Cesar Abraham Gomez-Cabello, Sahar Borna, Ariana Genovese, Maissa Trabilsy, Bernardo G. Collaco, Nadia G. Wood, Sanjay Bagaria, Cui Tao and Antonio Jorge Forte

Sensors 2025, 25(14), 4305; https://doi.org/10.3390/s25144305 - 10 Jul 2025

Viewed by 611

Abstract

Background: Data accessibility remains a significant barrier in healthcare AI due to privacy constraints and logistical challenges. Synthetic data, which mimics real patient information while remaining both realistic and non-identifiable, offers a promising solution. Large Language Models (LLMs) create new opportunities to generate [...] Read more.

Background: Data accessibility remains a significant barrier in healthcare AI due to privacy constraints and logistical challenges. Synthetic data, which mimics real patient information while remaining both realistic and non-identifiable, offers a promising solution. Large Language Models (LLMs) create new opportunities to generate high-fidelity clinical conversations between patients and physicians. However, the value of this synthetic data depends on careful evaluation of its realism, accuracy, and practical relevance. Objective: To assess the performance of four leading LLMs: ChatGPT 4.5, ChatGPT 4o, Claude 3.7 Sonnet, and Gemini Pro 2.5 in generating synthetic transcripts of patient–physician interactions in plastic surgery scenarios. Methods: Each model generated transcripts for ten plastic surgery scenarios. Transcripts were independently evaluated by three clinically trained raters using a seven-criterion rubric: Medical Accuracy, Realism, Persona Consistency, Fidelity, Empathy, Relevancy, and Usability. Raters were blinded to the model identity to reduce bias. Each was rated on a 5-point Likert scale, yielding 840 total evaluations. Descriptive statistics were computed, and a two-way repeated measures ANOVA was used to test for differences across models and metrics. In addition, transcripts were analyzed using automated linguistic and content-based metrics. Results: All models achieved strong performance, with mean ratings exceeding 4.5 across all criteria. Gemini 2.5 Pro received mean scores (5.00 ± 0.00) in Medical Accuracy, Realism, Persona Consistency, Relevancy, and Usability. Claude 3.7 Sonnet matched the scores in Persona Consistency and Relevancy and led in Empathy (4.96 ± 0.18). ChatGPT 4.5 also achieved perfect scores in Relevancy, with high scores in Empathy (4.93 ± 0.25) and Usability (4.96 ± 0.18). ChatGPT 4o demonstrated consistently strong but slightly lower performance across most dimensions. ANOVA revealed no statistically significant differences across models (F(3, 6) = 0.85, p = 0.52). Automated analysis showed substantial variation in transcript length, style, and content richness: Gemini 2.5 Pro generated the longest and most emotionally expressive dialogues, while ChatGPT 4o produced the shortest and most concise outputs. Conclusions: Leading LLMs can generate medically accurate, emotionally appropriate synthetic dialogues suitable for educational and research use. Despite high performance, demographic homogeneity in generated patients highlights the need for improved diversity and bias mitigation in model outputs. These findings support the cautious, context-aware integration of LLM-generated dialogues into medical training, simulation, and research. Full article

(This article belongs to the Special Issue Feature Papers in Smart Sensing and Intelligent Sensors 2025)

► Show Figures

Figure 1

23 pages, 439 KiB

Open AccessArticle

Evaluating Proprietary and Open-Weight Large Language Models as Universal Decimal Classification Recommender Systems

by Mladen Borovič, Eftimije Tomovski, Tom Li Dobnik and Sandi Majninger

Appl. Sci. 2025, 15(14), 7666; https://doi.org/10.3390/app15147666 - 8 Jul 2025

Viewed by 345

Abstract

Manual assignment of Universal Decimal Classification (UDC) codes is time-consuming and inconsistent as digital library collections expand. This study evaluates 17 large language models (LLMs) as UDC classification recommender systems, including ChatGPT variants (GPT-3.5, GPT-4o, and o1-mini), Claude models (3-Haiku and 3.5-Haiku), Gemini [...] Read more.

Manual assignment of Universal Decimal Classification (UDC) codes is time-consuming and inconsistent as digital library collections expand. This study evaluates 17 large language models (LLMs) as UDC classification recommender systems, including ChatGPT variants (GPT-3.5, GPT-4o, and o1-mini), Claude models (3-Haiku and 3.5-Haiku), Gemini series (1.0-Pro, 1.5-Flash, and 2.0-Flash), and Llama, Gemma, Mixtral, and DeepSeek architectures. Models were evaluated zero-shot on 900 English and Slovenian academic theses manually classified by professional librarians. Classification prompts utilized the RISEN framework, with evaluation using Levenshtein and Jaro–Winkler similarity, and a novel adjusted hierarchical similarity metric capturing UDC’s faceted structure. Proprietary systems consistently outperformed open-weight alternatives by 5–10% across metrics. GPT-4o achieved the highest hierarchical alignment, while open-weight models showed progressive improvements but remained behind commercial systems. Performance was comparable between languages, demonstrating robust multilingual capabilities. The results indicate that LLM-powered recommender systems can enhance library classification workflows. Future research incorporating fine-tuning and retrieval-augmented approaches may enable fully automated, high-precision UDC assignment systems. Full article

(This article belongs to the Special Issue Advanced Models and Algorithms for Recommender Systems)

► Show Figures

Figure 1

19 pages, 821 KiB

Open AccessArticle

Adaptive RAG-Assisted MRI Platform (ARAMP) for Brain Metastasis Detection and Reporting: A Retrospective Evaluation Using Post-Contrast T1-Weighted Imaging

by Kuo-Chen Wu, Fatt Yang Chew, Kang-Lun Cheng, Wu-Chung Shen, Pei-Chun Yeh, Chia-Hung Kao, Wan-Yuo Guo and Shih-Sheng Chang

Bioengineering 2025, 12(7), 698; https://doi.org/10.3390/bioengineering12070698 - 26 Jun 2025

Viewed by 466

Abstract

This study aimed to develop and evaluate an AI-driven platform, the Adaptive RAG Assistant MRI Platform (ARAMP), for assisting in the diagnosis and reporting of brain metastases using post-contrast axial T1-weighted (AX_T1+C) MRI. In this retrospective study, 2447 cancer patients who underwent MRI [...] Read more.

This study aimed to develop and evaluate an AI-driven platform, the Adaptive RAG Assistant MRI Platform (ARAMP), for assisting in the diagnosis and reporting of brain metastases using post-contrast axial T1-weighted (AX_T1+C) MRI. In this retrospective study, 2447 cancer patients who underwent MRI between 2010 and 2022 were screened. A subset of 100 randomized patients with confirmed brain metastases and 100 matched non-cancer controls were selected for evaluation. ARAMP integrates quantitative radiomic feature extraction with an adaptive Retrieval-Augmented Generation (RAG) framework based on a large language model (LLM, GPT-4o), incorporating five authoritative medical references. Three board-certified neuroradiologists and an independent LLM (Gemini 2.0 Pro) assessed ARAMP performance. Metrics of the assessment included Pre-/Post-Trained Inference Difference, Inter-Inference Agreement, and Sensitivity. Post-training, ARAMP achieved a mean Inference Similarity score of 67.45%. Inter-Inference Agreement among radiologists averaged 30.20% (p = 0.01). Sensitivity for brain metastasis detection improved from 0.84 (pre-training) to 0.98 (post-training). ARAMP also showed improved reliability in identifying brain metastases as the primary diagnosis post-RAG integration. This adaptive RAG-based framework may improve diagnostic efficiency and standardization in radiological workflows. Full article

(This article belongs to the Section Biosignal Processing)

► Show Figures

Figure 1

14 pages, 804 KiB

Open AccessArticle

Using Large Language Models to Infer Problematic Instagram Use from User Engagement Metrics: Agreement Across Models and Validation with Self-Reports

by Davide Marengo and Michele Settanni

Electronics 2025, 14(13), 2548; https://doi.org/10.3390/electronics14132548 - 24 Jun 2025

Viewed by 591

Abstract

This study investigated the feasibility of using large language models (LLMs) to infer problematic Instagram use, which refers to excessive or compulsive engagement with the platform that negatively impacts users’ daily functioning, productivity, or well-being, from a limited set of metrics of user [...] Read more.

This study investigated the feasibility of using large language models (LLMs) to infer problematic Instagram use, which refers to excessive or compulsive engagement with the platform that negatively impacts users’ daily functioning, productivity, or well-being, from a limited set of metrics of user engagement in the platform. Specifically, we explored whether OpenAI’s GPT-4o and Google’s Gemini 1.5 Pro could accurately predict self-reported problematic use tendencies based solely on readily available user engagement metrics like daily time spent on the platform, weekly posts and stories, and follower/following counts. Our sample comprised 775 Italian Instagram users (61.6% female; aged 18–63), who were recruited through a snowball sampling method. Item-level and total scores derived by querying the LLMs’ application programming interfaces were correlated with self-report items and the total score measured via an adapted Bergen Social Media Addiction Scale. LLM-inferred scores showed positive correlations with both item-level and total scores for problematic Instagram use. The strongest correlations were observed for the total scores, with GPT-4o achieving a correlation of r = 0.414 and Gemini 1.5 Pro achieving a correlation of r = 0.319. In cross-validated regression analyses, adding LLM-generated scores, especially from GPT-4o, significantly improved the prediction of problematic Instagram use compared to using usage metrics alone. GPT-4o’s performance in random forest models was comparable to models trained directly on Instagram metrics, demonstrating its ability to capture complex, non-linear relationships indicative of addiction without needing extensive model training. This study provides compelling preliminary evidence for the use of LLMs in inferring problematic Instagram use from limited data points, opening exciting new avenues for research and intervention. Full article

(This article belongs to the Special Issue Application of Data Mining in Social Media)

► Show Figures

Figure 1

25 pages, 898 KiB

Open AccessArticle

GenAI-Powered Text Personalization: Natural Language Processing Validation of Adaptation Capabilities

by Linh Huynh and Danielle S. McNamara

Appl. Sci. 2025, 15(12), 6791; https://doi.org/10.3390/app15126791 - 17 Jun 2025

Viewed by 551

Abstract

The authors conducted two experiments to assess the alignment between Generative AI (GenAI) text personalization and hypothetical readers’ profiles. In Experiment 1, four LLMs (i.e., Claude 3.5 Sonnet, Llama, Gemini Pro 1.5, and ChatGPT 4) were prompted to tailor 10 science texts (i.e., [...] Read more.

The authors conducted two experiments to assess the alignment between Generative AI (GenAI) text personalization and hypothetical readers’ profiles. In Experiment 1, four LLMs (i.e., Claude 3.5 Sonnet, Llama, Gemini Pro 1.5, and ChatGPT 4) were prompted to tailor 10 science texts (i.e., biology, chemistry, and physics) to accommodate four different profiles varying in knowledge, reading skills, and learning goals. Natural Language Processing (NLP) was leveraged to evaluate the GenAI-adapted texts using an array of linguistic and semantic features empirically associated with text readability. NLP analyses revealed variations in the degree to which the LLMs successfully adjusted linguistic features to suit reader profiles. Most notably, NLP highlighted inconsistent alignment between potential reader abilities and text complexity. The results pointed toward the need to augment the AI prompts using personification, chain-of-thought, and documents regarding text comprehension, text readability, and individual differences (i.e., leveraging RAG). The resulting text modifications in Experiment 2 were better aligned with readers’ profiles. Augmented prompts resulted in LLM modifications with more appropriate cohesion features tailored to high- and low-knowledge readers for optimal comprehension. This study demonstrates how LLMs can be prompted to modify text and uniquely demonstrates the application of NLP to evaluate theory-driven content personalization using GenAI. NLP offers an efficient, real-time solution to validate personalized content across multiple domains and contexts. Full article

(This article belongs to the Special Issue Applied Intelligence in Natural Language Processing)

► Show Figures

Figure 1

16 pages, 1193 KiB

Open AccessArticle

From Data to Decisions: Leveraging Retrieval-Augmented Generation to Balance Citation Bias in Burn Management Literature

by Ariana Genovese, Srinivasagam Prabha, Sahar Borna, Cesar A. Gomez-Cabello, Syed Ali Haider, Maissa Trabilsy, Cui Tao and Antonio Jorge Forte

Eur. Burn J. 2025, 6(2), 28; https://doi.org/10.3390/ebj6020028 - 2 Jun 2025

Viewed by 453

Abstract

(1) Burn injuries demand multidisciplinary, evidence-based care, yet the extensive literature complicates timely decision making. Retrieval-augmented generation (RAG) synthesizes research while addressing inaccuracies in pretrained models. However, citation bias in sourcing for RAG often prioritizes highly cited studies, overlooking less-cited but valuable research. [...] Read more.

(1) Burn injuries demand multidisciplinary, evidence-based care, yet the extensive literature complicates timely decision making. Retrieval-augmented generation (RAG) synthesizes research while addressing inaccuracies in pretrained models. However, citation bias in sourcing for RAG often prioritizes highly cited studies, overlooking less-cited but valuable research. This study examines RAG’s performance in burn management, comparing citation levels to enhance evidence synthesis, reduce selection bias, and guide decisions. (2) Two burn management datasets were assembled: 30 highly cited (mean: 303) and 30 less-cited (mean: 21). The Gemini-1.0-Pro-002 RAG model addressed 30 questions, ranging from foundational principles to advanced surgical approaches. Responses were evaluated for accuracy (5-point scale), readability (Flesch–Kincaid metrics), and response time with Wilcoxon rank sum tests (p < 0.05). (3) RAG achieved comparable accuracy (4.6 vs. 4.2, p = 0.49), readability (Flesch Reading Ease: 42.8 vs. 46.5, p = 0.26; Grade Level: 9.9 vs. 9.5, p = 0.29), and response time (2.8 vs. 2.5 s, p = 0.39) for the highly and less-cited datasets. (4) Less-cited research performed similarly to highly cited sources. This equivalence broadens clinicians’ access to novel, diverse insights without sacrificing quality. As plastic surgery evolves, RAG’s inclusive approach fosters innovation, improves patient care, and reduces cognitive burden by integrating underutilized studies. Embracing RAG could propel the field toward dynamic, forward-thinking care. Full article

► Show Figures

Figure 1

13 pages, 1763 KiB

Open AccessProceeding Paper

Transforming Petrochemical Safety Using a Multimodal AI Visual Analyzer

by Uzair Bhatti, Qamar Jaleel, Umair Aslam, Ahrad bin Riaz, Najam Saeed and Khurram Kamal

Eng. Proc. 2024, 78(1), 12; https://doi.org/10.3390/engproc2024078012 - 29 May 2025

Viewed by 516

Abstract

The petrochemical industry faces significant safety challenges, necessitating stringent protocols and advanced monitoring systems. Traditional methods rely on manual inspections and fixed sensors, often reacting to hazards only after they occur. Multimodal AI, integrating visual, sensor, and textual data, offers a transformative solution [...] Read more.

The petrochemical industry faces significant safety challenges, necessitating stringent protocols and advanced monitoring systems. Traditional methods rely on manual inspections and fixed sensors, often reacting to hazards only after they occur. Multimodal AI, integrating visual, sensor, and textual data, offers a transformative solution for real-time, proactive safety management. This paper evaluates AI models—Gemini 1.5 Pro, OPENAI GPT-4, and Copilot—in detecting workplace hazards, ensuring compliance with Process Safety Management (PSM) and DuPont safety frameworks. The study highlights the models’ potential in improving safety outcomes, reducing human error, and supporting continuous, data-driven risk management in petrochemical plants. This paper is the first of its kind to use the latest multimodal tech to identify the safety hazard; a similar model could be deployed in other manufacturing industries, especially the oil and gas (both upstream and downstream) industry, fertilizer industries, and production facilities. Full article

(This article belongs to the Proceedings of The 1st International Conference on AI Sensors & the 10th International Symposium on Sensor Science)

► Show Figures

Figure 1

19 pages, 1840 KiB

Open AccessArticle

Facial Analysis for Plastic Surgery in the Era of Artificial Intelligence: A Comparative Evaluation of Multimodal Large Language Models

by Syed Ali Haider, Srinivasagam Prabha, Cesar A. Gomez-Cabello, Sahar Borna, Ariana Genovese, Maissa Trabilsy, Adekunle Elegbede, Jenny Fei Yang, Andrea Galvao, Cui Tao and Antonio Jorge Forte

J. Clin. Med. 2025, 14(10), 3484; https://doi.org/10.3390/jcm14103484 - 16 May 2025

Viewed by 913

Abstract

Background/Objectives: Facial analysis is critical for preoperative planning in facial plastic surgery, but traditional methods can be time consuming and subjective. This study investigated the potential of Artificial Intelligence (AI) for objective and efficient facial analysis in plastic surgery, with a specific focus [...] Read more.

Background/Objectives: Facial analysis is critical for preoperative planning in facial plastic surgery, but traditional methods can be time consuming and subjective. This study investigated the potential of Artificial Intelligence (AI) for objective and efficient facial analysis in plastic surgery, with a specific focus on Multimodal Large Language Models (MLLMs). We evaluated their ability to analyze facial skin quality, volume, symmetry, and adherence to aesthetic standards such as neoclassical facial canons and the golden ratio. Methods: We evaluated four MLLMs—ChatGPT-4o, ChatGPT-4, Gemini 1.5 Pro, and Claude 3.5 Sonnet—using two evaluation forms and 15 diverse facial images generated by a Generative Adversarial Network (GAN). The general analysis form evaluated qualitative skin features (texture, type, thickness, wrinkling, photoaging, and overall symmetry). The facial ratios form assessed quantitative structural proportions, including division into equal fifths, adherence to the rule of thirds, and compatibility with the golden ratio. MLLM assessments were compared with evaluations from a plastic surgeon and manual measurements of facial ratios. Results: The MLLMs showed promise in analyzing qualitative features, but they struggled with precise quantitative measurements of facial ratios. Mean accuracy for general analysis were ChatGPT-4o (0.61 ± 0.49), Gemini 1.5 Pro (0.60 ± 0.49), ChatGPT-4 (0.57 ± 0.50), and Claude 3.5 Sonnet (0.52 ± 0.50). In facial ratio assessments, scores were lower, with Gemini 1.5 Pro achieving the highest mean accuracy (0.39 ± 0.49). Inter-rater reliability, based on Cohen’s Kappa values, ranged from poor to high for qualitative assessments (κ > 0.7 for some questions) but was generally poor (near or below zero) for quantitative assessments. Conclusions: Current general purpose MLLMs are not yet ready to replace manual clinical assessments but may assist in general facial feature analysis. These findings are based on testing models not specifically trained for facial analysis and serve to raise awareness among clinicians regarding the current capabilities and inherent limitations of readily available MLLMs in this specialized domain. This limitation may stem from challenges with spatial reasoning and fine-grained detail extraction, which are inherent limitations of current MLLMs. Future research should focus on enhancing the numerical accuracy and reliability of MLLMs for broader application in plastic surgery, potentially through improved training methods and integration with other AI technologies such as specialized computer vision algorithms for precise landmark detection and measurement. Full article

(This article belongs to the Special Issue Innovation in Hand Surgery)

► Show Figures

Figure 1

22 pages, 25979 KiB

Open AccessFeature PaperArticle

Advancing Early Wildfire Detection: Integration of Vision Language Models with Unmanned Aerial Vehicle Remote Sensing for Enhanced Situational Awareness

by Leon Seidel, Simon Gehringer, Tobias Raczok, Sven-Nicolas Ivens, Bernd Eckardt and Martin Maerz

Drones 2025, 9(5), 347; https://doi.org/10.3390/drones9050347 - 3 May 2025

Viewed by 1708

Abstract

Early wildfire detection is critical for effective suppression efforts, necessitating rapid alerts and precise localization. While computer vision techniques offer reliable fire detection, they often lack contextual understanding. This paper addresses this limitation by utilizing Vision Language Models (VLMs) to generate structured scene [...] Read more.

Early wildfire detection is critical for effective suppression efforts, necessitating rapid alerts and precise localization. While computer vision techniques offer reliable fire detection, they often lack contextual understanding. This paper addresses this limitation by utilizing Vision Language Models (VLMs) to generate structured scene descriptions from Unmanned Aerial Vehicle (UAV) imagery. UAV-based remote sensing provides diverse perspectives for potential wildfires, and state-of-the-art VLMs enable rapid and detailed scene characterization. We evaluated both cloud-based (OpenAI, Google DeepMind) and open-weight, locally deployed VLMs on a novel evaluation dataset specifically curated for understanding forest fire scenes. Our results demonstrate that relatively compact, fine-tuned VLMs can provide rich contextual information, including forest type, fire state, and fire type. Specifically, our best-performing model, ForestFireVLM-7B (fine-tuned from Qwen2-5-VL-7B), achieved a 76.6% average accuracy across all categories, surpassing the strongest closed-weight baseline (Gemini 2.0 Pro at 65.5%). Furthermore, zero-shot evaluation on the publicly available FIgLib dataset demonstrated state-of-the-art smoke detection accuracy using VLMs. Our findings highlight the potential of fine-tuned, open-weight VLMs for enhanced wildfire situational awareness via detailed scene interpretation. Full article

(This article belongs to the Special Issue Artificial Intelligence (AI) and Machine Learning (ML) in UAV Technology)

► Show Figures

Figure 1

19 pages, 670 KiB

Open AccessArticle

Quantifying Gender Bias in Large Language Models Using Information-Theoretic and Statistical Analysis

by Imran Mirza, Akbar Anbar Jafari, Cagri Ozcinar and Gholamreza Anbarjafari

Information 2025, 16(5), 358; https://doi.org/10.3390/info16050358 - 29 Apr 2025

Cited by 1 | Viewed by 2280

Abstract

Large language models (LLMs) have revolutionized natural language processing across diverse domains, yet they also raise critical fairness and ethical concerns, particularly regarding gender bias. In this study, we conduct a systematic, mathematically grounded investigation of gender bias in four leading LLMs—GPT-4o, Gemini [...] Read more.

Large language models (LLMs) have revolutionized natural language processing across diverse domains, yet they also raise critical fairness and ethical concerns, particularly regarding gender bias. In this study, we conduct a systematic, mathematically grounded investigation of gender bias in four leading LLMs—GPT-4o, Gemini 1.5 Pro, Sonnet 3.5, and LLaMA 3.1:8b—by evaluating the gender distributions produced when generating “perfect personas” for a wide range of occupational roles spanning healthcare, engineering, and professional services. Leveraging standardized prompts, controlled experimental settings, and repeated trials, our methodology quantifies bias against an ideal uniform distribution using rigorous statistical measures and information-theoretic metrics. Our results reveal marked discrepancies: GPT-4o exhibits pronounced occupational gender segregation, disproportionately linking healthcare roles to female identities while assigning male labels to engineering and physically demanding positions. In contrast, Gemini 1.5 Pro, Sonnet 3.5, and LLaMA 3.1:8b predominantly favor female assignments, albeit with less job-specific precision. These findings demonstrate how architectural decisions, training data composition, and token embedding strategies critically influence gender representation. The study underscores the urgent need for inclusive datasets, advanced bias-mitigation techniques, and continuous model audits to develop AI systems that are not only free from stereotype perpetuation but actively promote equitable and representative information processing. Full article

(This article belongs to the Special Issue Fundamental Problems of Information Studies)

► Show Figures

Figure 1

15 pages, 3491 KiB

Open AccessArticle

Generative Artificial Intelligence Models in Clinical Infectious Disease Consultations: A Cross-Sectional Analysis Among Specialists and Resident Trainees

by Edwin Kwan-Yeung Chiu, Siddharth Sridhar, Samson Sai-Yin Wong, Anthony Raymond Tam, Ming-Hong Choi, Alicia Wing-Tung Lau, Wai-Ching Wong, Kelvin Hei-Yeung Chiu, Yuey-Zhun Ng, Kwok-Yung Yuen and Tom Wai-Hin Chung

Healthcare 2025, 13(7), 744; https://doi.org/10.3390/healthcare13070744 - 27 Mar 2025

Viewed by 640

Abstract

Background/Objectives: The potential of generative artificial intelligence (GenAI) to augment clinical consultation services in clinical microbiology and infectious diseases (ID) is being evaluated. Methods: This cross-sectional study evaluated the performance of four GenAI chatbots (GPT-4.0, a Custom Chatbot based on GPT-4.0, Gemini Pro, [...] Read more.

Background/Objectives: The potential of generative artificial intelligence (GenAI) to augment clinical consultation services in clinical microbiology and infectious diseases (ID) is being evaluated. Methods: This cross-sectional study evaluated the performance of four GenAI chatbots (GPT-4.0, a Custom Chatbot based on GPT-4.0, Gemini Pro, and Claude 2) by analysing 40 unique clinical scenarios. Six specialists and resident trainees from clinical microbiology or ID units conducted randomised, blinded evaluations across factual consistency, comprehensiveness, coherence, and medical harmfulness. Results: Analysis showed that GPT-4.0 achieved significantly higher composite scores compared to Gemini Pro (p = 0.001) and Claude 2 (p = 0.006). GPT-4.0 outperformed Gemini Pro and Claude 2 in factual consistency (Gemini Pro, p = 0.02; Claude 2, p = 0.02), comprehensiveness (Gemini Pro, p = 0.04; Claude 2, p = 0.03), and the absence of medical harm (Gemini Pro, p = 0.02; Claude 2, p = 0.04). Within-group comparisons showed that specialists consistently awarded higher ratings than resident trainees across all assessed domains (p < 0.001) and overall composite scores (p < 0.001). Specialists were five times more likely to consider responses as “harmless”. Overall, fewer than two-fifths of AI-generated responses were deemed “harmless”. Post hoc analysis revealed that specialists may inadvertently disregard conflicting or inaccurate information in their assessments. Conclusions: Clinical experience and domain expertise of individual clinicians significantly shaped the interpretation of AI-generated responses. In our analysis, we have demonstrated disconcerting human vulnerabilities in safeguarding against potentially harmful outputs, which seemed to be most apparent among experienced specialists. At the current stage, none of the tested AI models should be considered safe for direct clinical deployment in the absence of human supervision. Full article

(This article belongs to the Special Issue Application of Artificial Intelligence in the Diagnosis, Treatment and Management of Diseases)

► Show Figures

Figure 1

Search Results (33)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (33)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI