Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (1,390)

Search Parameters:
Keywords = ChatGPT-3.5

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
33 pages, 4433 KB  
Systematic Review
How Can Large Language Models Drive Environmental Sustainability? A Systematic Scoping Review
by Xiaotong Su, Ting Liu, Patrick Pang, Yiming Taclis Luo and Dennis Wong
Sustainability 2026, 18(9), 4327; https://doi.org/10.3390/su18094327 (registering DOI) - 27 Apr 2026
Abstract
Currently, Large Language Models (LLMs), exemplified by ChatGPT, are accelerating technological development across various domains, including the environmental domain, owing to their powerful text-generation and information-processing capabilities. With changes in global climate and environmental conditions, environmental sustainability has emerged as a major global [...] Read more.
Currently, Large Language Models (LLMs), exemplified by ChatGPT, are accelerating technological development across various domains, including the environmental domain, owing to their powerful text-generation and information-processing capabilities. With changes in global climate and environmental conditions, environmental sustainability has emerged as a major global challenge. Leveraging LLMs to advance environmental sustainability and mitigate current environmental problems is considered a valuable and effective approach. This study aims to systematically synthesize research progress and core challenges in current LLMs for promoting sustainability-related fields, and to comprehensively analyze the application contexts, impacts, and development potential of various LLMs within the environmental sector. Following the PRISMA-ScR guidelines, a comprehensive search was conducted across six databases: Web of Science (WOS), Scopus, ACM Digital Library, IEEE Xplore, ScienceDirect, and Google Scholar. A total of 20 articles were ultimately included for analysis. The findings indicate that LLMs play a positive role in maintaining environmental sustainability and promoting the low-carbon energy transition. The applications of LLMs span six core domains: the green transition, carbon emission management, air quality assessment, smart city operations, map analysis, and human cognition and behavioral observation. However, the training and operation of current LLMs consume considerable resources, which creates an inherent conflict with the goals of sustainable development. Future efforts must focus on developing a secure, equitable, and scalable LLM support system to advance environmental sustainability. This requires optimizing model energy efficiency and ensuring a balance between performance, reliability, and environmental impact. These endeavors are crucial for addressing environmental problems and guaranteeing the sustainable progression of LLMs across diverse environmental contexts. Full article
Show Figures

Figure 1

13 pages, 293 KB  
Article
A Comparison of a Customized Peripheral Artery Disease (PAD)-Specific Generative AI Chatbot and General-Purpose AI Chatbots for PAD Patient Education
by Aboubacar Cherif, Megan E. Alagna, Margaret A. Reilly, Lara Lopes, Madison Crutcher, Jennifer Schroeder, Kathryn A. Carey, Anand Brahmandam, David Liebovitz and Karen J. Ho
J. Clin. Med. 2026, 15(9), 3317; https://doi.org/10.3390/jcm15093317 (registering DOI) - 27 Apr 2026
Abstract
Background/Objective: Patients with peripheral artery disease (PAD) are known to have poor awareness and understanding of the diagnosis. The role of generative AI chatbots in improving PAD patient education is unknown. Our goal is to compare a generative AI chatbot customized for PAD [...] Read more.
Background/Objective: Patients with peripheral artery disease (PAD) are known to have poor awareness and understanding of the diagnosis. The role of generative AI chatbots in improving PAD patient education is unknown. Our goal is to compare a generative AI chatbot customized for PAD patient education to publicly available AI chatbots. Methods: This is a cross-sectional comparative evaluation of the responses of four AI chatbots to ten prompts that are commonly asked questions about PAD. The three publicly available AI chatbots were ChatGPT-5, Gemini 2.5 Flash, and Claude Sonnet 4.5. We created a customized, voice AI chatbot for PAD education grounded on curated and prompt-injected guidance called Vascular Education and Resources using Artificial Intelligence, or “VERA.” De-identified chatbot-generated responses to inputs were assessed for readability (Flesch–Kincaid Grade Level, Flesch Reading Ease, Gunning Fog Index, Simple Measure of Gobbledygook Index, and Average Reading Level Consensus Score), accuracy, comprehensiveness, and patient education quality (Patient Education Materials Assessment Tool; PEMAT) using validated instruments and expert scoring rubrics. Nonparametric statistical testing was used to compare chatbot performance across all evaluation domains. Results: VERA generated the most accessible text compared to the other chatbots and produced responses at a median grade level of 6.6, which was lower than responses from the other chatbots. PAD expert-rated accuracy scores were high across all the chatbots without significant differences between them. Comprehensiveness scores were more varied and demonstrated that VERA was less comprehensive than the other chatbots. PEMAT understandability scores were uniformly high. PEMAT actionability scores were low overall but did not differ significantly across chatbots on post hoc analysis. Conclusions: A generative AI chatbot research tool customized for PAD patient education generates textual information about PAD that is more accessible (mean grade level 6.6) than publicly available AI chatbots without loss of accuracy, albeit with modestly reduced comprehensiveness that reflects intentional simplification for patient-centered communication. Future research will assess the acceptability and feasibility of this research tool to be adopted as part of PAD patient education. Full article
(This article belongs to the Special Issue Machine Learning in Vascular Surgery)
Show Figures

Figure 1

17 pages, 1087 KB  
Article
The Role of ChatGPT in Job Crafting: A Study of IT Professionals in Pakistan
by Seema Gul, Sajeela Rabbani and Aqsa Jaleel
Behav. Sci. 2026, 16(5), 655; https://doi.org/10.3390/bs16050655 (registering DOI) - 26 Apr 2026
Abstract
The wake of artificial intelligence (AI) tools has witnessed a lot of changes at workplaces. Job crafting (JC) has also embraced the predictive quality of using AI tools such as ChatGPT. Drawing on Conservation of Resources theory, this study was conducted in is [...] Read more.
The wake of artificial intelligence (AI) tools has witnessed a lot of changes at workplaces. Job crafting (JC) has also embraced the predictive quality of using AI tools such as ChatGPT. Drawing on Conservation of Resources theory, this study was conducted in is an effort to understand the role that ChatGPT plays in job crafting by enhancing work engagement (WE) in the presence of work-related curiosity (WRC). Time-lagged data from 314 employees from the information technology (IT) sector was used to test the relationship by using partial least square structural equation modeling. The results showed that ChatGPT and job crafting are linked to each other in the presence of work engagement. The results further showed that WE mediated and work-related curiosity moderated the relationship between ChatGPT and job crafting. These results are instrumental in understanding the significance of AI adoption in business and can be used as a potential tool for crafting jobs toward other work-related outcomes. The research holds significance for mangers and policymakers of the IT sector in terms of establishing AI adoption to Predict positive behaviors in employees, and it also highlights future avenues. Full article
(This article belongs to the Section Social Psychology)
Show Figures

Figure 1

14 pages, 3388 KB  
Article
Biological Cardiovascular Age Derived from Coronary CTA Reports Using a Large Language Model: A Novel Predictor of Major Adverse Cardiovascular Events?
by Gudrun M. Feuchtner, Yannick Scharll, Johannes Deeg, Valentin Bilgeri, Philipp Spitaler, Malik Galijasevic, Michael Swoboda, Leonhard Gruber, Gerlig Widmann and Pietro G. Lacaita
Diagnostics 2026, 16(9), 1298; https://doi.org/10.3390/diagnostics16091298 - 26 Apr 2026
Viewed by 29
Abstract
Background/Objectives: Coronary artery disease (CAD) remains the leading cause of death worldwide. Traditional cardiovascular risk assessment is based on chronological age and other clinical factors, with inherent limitations and poor accuracy. Objective was to estimate the artificial intelligence (AI)-enhanced biological cardiovascular age [...] Read more.
Background/Objectives: Coronary artery disease (CAD) remains the leading cause of death worldwide. Traditional cardiovascular risk assessment is based on chronological age and other clinical factors, with inherent limitations and poor accuracy. Objective was to estimate the artificial intelligence (AI)-enhanced biological cardiovascular age calculation derived from coronary computed tomography angiography (CTA) reports using a large language model (LLM), in predicting major adverse cardiovascular events (MACE). Methods: Coronary CTA reports were analyzed using a LLM (ChatGPT-4.0v, OpenAI), from symptomatic patients with suspected CAD who underwent coronary CTA for clinical indications. Patients in which the LLM successfully analyzed the key metrics (1) coronary artery calcium (CAC) score and (2) coronary CTA reports (coronary stenosis severity (CAD-RADS), high-risk anatomy, non-calcified plaque, cardiac function (LVEF and others) were included. Results: 386 CTA reports were uploaded, and 346 (89.6%) included. The mean biological age (bioAGE) was 57.2 ± 10.9 and the chronological 58.5 ± 10.8 years. 137 (39.6%) were women. The intra-individual deviation in bioAGE was high (median: 8.8; IQR 9.98). BioAGE exceeded chronological age in 45.4% patient and was lower or equal in 54.6%) MACE rate was 8.7% comprising 2 deaths, 5 myocardial infarctions, and 22 late revascularizations. The accuracy for prediction of MACE was higher for bioAGE (c = 0.768; 95% CI: 0.681–0.855, p < 0.001) compared to chronological age (c = 0.590; 95% CI: 0.492–0.689, p = 0.102) Conclusions: Biological age calculation from coronary CTA reports using LLM is feasible, yet intra-individual deviations are high. The accuracy for prediction of MACE is improved by bioAGE compared to chronological. Full article
(This article belongs to the Special Issue Advances in Cardiovascular and Vascular Imaging)
Show Figures

Figure 1

22 pages, 876 KB  
Article
“In ChatGPT-Powered Virtual Influencers We (Dis)Trust?”: The Privacy Paradox and the Double-Edged Sword of Ubiquitous Large Language Model (LLM) Generative AI as a General Purpose Technology (GPT) in a Human-Centered AI Ecosystem
by Seunga Venus Jin
Behav. Sci. 2026, 16(5), 651; https://doi.org/10.3390/bs16050651 (registering DOI) - 26 Apr 2026
Viewed by 24
Abstract
“Can ChatGPT become a general purpose technology?” “How does the “privacy paradox” play a role in adopting ubiquitous AI technologies in a humane AI ecosystem?” To answer these research questions, this study examined the roles of AI equality, trust in [...] Read more.
“Can ChatGPT become a general purpose technology?” “How does the “privacy paradox” play a role in adopting ubiquitous AI technologies in a humane AI ecosystem?” To answer these research questions, this study examined the roles of AI equality, trust in the large language model (LLM) ChatGPT, the need to belong, perceived benefits of ubiquitous AI, and privacy concerns about potentially ubiquitous generative artificial intelligence (GenAI) in a human-centered AI ecosystem. Drawing from the emerging literature on the AI divide (vs. AI equality) and AI-powered digital transformation, cross-sectional survey data were collected from current ChatGPT users. The results of testing PROCESS macro models with 5000 bootstrap samples showed the relationship between AI equality and purchase intention is mediated by trust in ChatGPT and is moderated by the need to belong. Privacy concerns about ChatGPT moderate the relationship between AI equality and perceived benefits of ubiquitous GenAI, which, in turn, mediates the relationship between AI equality and purchase intention. Ethical dilemmas in developing an equitable AI ecosystem, practical implications of the “privacy paradox” for designing trustworthy and ubiquitous AI interfaces in the dynamically evolving AI-powered digital transformation landscape and electronic marketplaces, and theoretical implications of the ChatGPT epidemic in a humane AI ecosystem for the literature on general purpose technology (GPT) are discussed. Full article
(This article belongs to the Special Issue Advanced Studies in Human-Centred AI—2nd Edition)
35 pages, 5864 KB  
Review
The State of Practice in Application of Natural Language Processing in Transportation Safety Analysis
by Mohammadjavad Bazdar, Hyun Kim, Branislav Dimitrijevic and Joyoung Lee
Appl. Sci. 2026, 16(9), 4223; https://doi.org/10.3390/app16094223 (registering DOI) - 25 Apr 2026
Viewed by 133
Abstract
This paper provides a systematic review of recent applications of NLP methods for analyzing traffic crash reports, with a focus on estimating crash severity, crash duration, and crash causation. The review covers prior research using probabilistic topic modeling methods such as LDA, STM, [...] Read more.
This paper provides a systematic review of recent applications of NLP methods for analyzing traffic crash reports, with a focus on estimating crash severity, crash duration, and crash causation. The review covers prior research using probabilistic topic modeling methods such as LDA, STM, and hierarchical Dirichlet processes in addition to research using transformer-based language models, which include encoder-based models like BERT and PubMedBERT as well as decoder-based models like GPT, GPT2, ChatGPT, GPT-3, and LLaMA. The review starts with a systematic literature selection process with predefined inclusion criteria. We categorize the reviewed studies into the following application areas: crash severity prediction, risk factor identification in crashes, and road safety analysis. The results show several complementary advantages of using different NLP techniques to achieve different analytical goals. Topic models allow for interpretable and exploratory pattern discovery, while encoder models are well-suited for structured prediction problems. Decoder models have the additional flexibility to perform zero-shot and few-shot reasoning, which makes them useful for reasoning about under-sampled or under-reported data. Across the literature, hybrid methods that combine text and structured data outperform individual methods in terms of prediction accuracy and broad applicability. Challenges across the literature include class imbalance, lack of standardization in preprocessing and evaluation methods, and the tradeoff between prediction accuracy and interpretability of prediction models. These findings highlight the importance of aligning model selection with data availability and operational constraints, pointing toward future research directions in hybrid modeling frameworks, standardized evaluation protocols, and real-world deployment of NLP-driven traffic safety systems. Full article
(This article belongs to the Special Issue Traffic Safety Measures and Assessment: 2nd Edition)
Show Figures

Figure 1

31 pages, 1741 KB  
Article
AI-Driven Approaches to System Requirements and Test Case Generation: A New Paradigm in Software Engineering
by Ziad Salem, Luay Tahat, Yasmeen Humaidan and Noor Tahat
Technologies 2026, 14(5), 260; https://doi.org/10.3390/technologies14050260 (registering DOI) - 25 Apr 2026
Viewed by 139
Abstract
Artificial intelligence (AI) is a new paradigm in software engineering that automates key phases of the development cycle. The methods of creating test cases and designing requirements are still mostly manual and prone to error. Unclear requirements can result in expensive rework and [...] Read more.
Artificial intelligence (AI) is a new paradigm in software engineering that automates key phases of the development cycle. The methods of creating test cases and designing requirements are still mostly manual and prone to error. Unclear requirements can result in expensive rework and undiscovered defects in the development process. Scalability and dependability are crucial concerns in complex systems. These shortcomings highlight the need for improved methods to enhance accuracy and consistency throughout these critical phases. To generate well-organized system requirements, this article outlines a clear strategy that leverages Extended Finite State Machine models as formal inputs for large language models (LLMs). Five system models are used to assess the suggested framework. The comparison analysis evaluates the accuracy, completeness, test coverage, and runtime efficiency of the artifacts. Along with a comparison with a human-made reference standard, the study evaluates the performance of LLMs such as ChatGPT-5, Claude Sonnet 4.5, and DeepSeek V3.2. The findings demonstrate that AI models can achieve human-comparable accuracy by exceeding 90% with EFSM-based prompting. Claude Sonnet generated the most reliable findings, ChatGPT demonstrated exceptional flexibility, and DeepSeek demonstrated exceptional runtime economy. These findings show that human–AI workflows provide a new paradigm in scalable, traceable, and reproducible system engineering. Full article
(This article belongs to the Section Information and Communication Technologies)
24 pages, 750 KB  
Article
Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic
by Soufiyan Ouali, Kanza Raisi, Asmaa Mourhir, El Habib Nfaoui and Said El Garouani
Big Data Cogn. Comput. 2026, 10(5), 132; https://doi.org/10.3390/bdcc10050132 - 24 Apr 2026
Viewed by 232
Abstract
Offensive language detection is crucial for ensuring safe and inclusive digital environments. Identifying harmful content protects users and supports healthier online interactions. Despite advances in transformer-based models, particularly Large Language Models (LLMs), their application to this task remains underexplored for low-resource languages such [...] Read more.
Offensive language detection is crucial for ensuring safe and inclusive digital environments. Identifying harmful content protects users and supports healthier online interactions. Despite advances in transformer-based models, particularly Large Language Models (LLMs), their application to this task remains underexplored for low-resource languages such as Moroccan Arabic, especially compared with high-resource languages. This study evaluates the performance of various open- and closed-source LLMs for offensive language detection in Moroccan Darija. The evaluated models include general-purpose LLMs such as LLaMA, Mistral, and Gemma, as well as Arabic-focused models such as ArabianGPT, Falcon Arabic, and Atlas-Chat. We also experiment with reasoning models such as DeepSeek and GPT-4. Beyond traditional evaluation metrics, we investigate the robustness of these LLMs and examine the impact of adversarial training on their performance. Moreover, we contribute to the field by creating a large, high-quality dataset. Our evaluation revealed that GPT-4o Mini achieved the best overall performance, reaching an F1-score of 88%. However, robustness testing under black-box and white-box adversarial attacks exposed notable vulnerabilities, with attack success rates reaching 30%, thereby highlighting the need for enhancement. Despite the complex morphology and linguistic variability of Moroccan Darija, adversarial training resulted in a notable improvement in both overall model performance and robustness against adversarial attacks, yielding an average increase of 20.89% in resistance to attacks. Furthermore, this approach enabled GPT-4o Mini to achieve an F1-score of 91%, surpassing the current state-of-the-art performance by 6%. These results highlight the importance of incorporating adversarial approaches in low-resource dialectal settings to effectively address linguistic variability and data scarcity. Full article
(This article belongs to the Special Issue Natural Language Processing Applications in Big Data)
Show Figures

Figure 1

8 pages, 197 KB  
Article
The Role of Large Language Models in the Promotion of Minimally Invasive Interventional Radiologic Methods in Gynecology and Obstetrics
by Iason Psilopatis, Julius Emons, Kleio Vrettou and Tibor A. Zwimpfer
J. Clin. Med. 2026, 15(9), 3234; https://doi.org/10.3390/jcm15093234 - 23 Apr 2026
Viewed by 225
Abstract
Background: Minimally invasive interventional radiology (IR) offers effective, uterus-preserving treatments for several gynecologic and obstetric conditions such as uterine fibroids, adenomyosis and postpartum hemorrhage. Despite their efficacy, these methods remain underused, partly to limited awareness among clinicians and patients. Large language models (LLMs) [...] Read more.
Background: Minimally invasive interventional radiology (IR) offers effective, uterus-preserving treatments for several gynecologic and obstetric conditions such as uterine fibroids, adenomyosis and postpartum hemorrhage. Despite their efficacy, these methods remain underused, partly to limited awareness among clinicians and patients. Large language models (LLMs) may help bridge this gap by providing accessible, reliable information. Objective: To evaluate how current LLMs address knowledge gaps and promote awareness of minimally invasive IR methods in gynecology and obstetrics. Methods: A structured ten-question instrument was used to query three publicly available LLMs (OpenEvidence, ChatGPT, and Google Gemini). Responses were analyzed for accuracy, completeness, safety considerations, and patient-centered communication. Results: All three models accurately identified a range of medical, minimally invasive, and surgical treatments for uterine fibroids, adenomyosis, and postpartum hemorrhage, with OpenEvidence and ChatGPT providing more detailed and clinically nuanced responses. OpenEvidence achieved the highest scores overall, closely followed by ChatGPT, while Google Gemini scored lower, particularly in completeness and patient-centered communication. In more complex scenarios, performance differences became more pronounced, with OpenEvidence again leading, ChatGPT performing strongly, and Google Gemini lagging behind. Overall, OpenEvidence and ChatGPT demonstrated higher accuracy, completeness, and safety considerations, whereas Google Gemini showed comparatively weaker and less consistent performance. Conclusions: LLMs may endorse the promotion of minimally invasive IR methods in gynecology and obstetrics, but their outputs vary considerably in quality. Ongoing refinement and integration of evidence-based sources are essential before routine use in clinical practice. Therefore, effective collaboration between artificial intelligence (AI) developers and medical professionals is essential to harness this technology’s full potential. Full article
(This article belongs to the Special Issue Artificial Intelligence and Machine Learning in Clinical Practice)
16 pages, 758 KB  
Article
Large Language Models in Medical and Dental Education: A Cross-Sectional Comparison of AI-Generated and Faculty-Authored Prosthodontic Materials
by Alexia-Ecaterina Cârstea, Lucian-Toma Ciocan, Vlad-Gabriel Vasilescu, Ana-Maria Cristina Țâncu, Marina Imre, Andreea-Cristiana Didilescu and Silviu-Mirel Pițuru
Dent. J. 2026, 14(5), 249; https://doi.org/10.3390/dj14050249 - 23 Apr 2026
Viewed by 159
Abstract
Background/Objectives: This study aimed to compare AI-generated educational material with faculty-authored content in Dental Prostheses Technology, evaluating perceived clarity, accuracy, structure, usefulness, and overall instructional quality across different age and professional groups. Methods: An analytical cross-sectional study was conducted using two [...] Read more.
Background/Objectives: This study aimed to compare AI-generated educational material with faculty-authored content in Dental Prostheses Technology, evaluating perceived clarity, accuracy, structure, usefulness, and overall instructional quality across different age and professional groups. Methods: An analytical cross-sectional study was conducted using two versions of the first three chapters of a prosthodontics textbook: the original faculty-authored text and a reformulated version generated by ChatGPT 5.2 (OpenAI). Images were removed and formatting standardized to ensure a text-only comparison. An anonymized online questionnaire based on a five-point Likert scale assessed clarity, accuracy, readability, usefulness and structure. To reduce potential bias, participants were unaware of the authorship of the evaluated materials (human-authored vs. AI-generated). A total of 130 participants independently reviewed both documents. Data were analyzed using Wilcoxon signed-rank, Mann–Whitney U, and Friedman tests. Results: Both materials received favorable evaluations across all dimensions. The AI-generated version demonstrated a statistically significant advantage in clarity (Z = −2.107, p = 0.035; r = 0.19), while no significant differences were observed for structure, accuracy, readability, or usefulness. Generational differences emerged: younger participants valued improved clarity but reported reduced usefulness, mid-career participants showed the greatest improvement in perceived accuracy, and senior professionals reported substantial gains in usefulness and readability. Conclusions: AI-generated educational material demonstrates pedagogical equivalence to faculty-authored content, with clarity representing its principal advantage. Large language models may serve as effective complementary tools in dental education, particularly for restructuring complex content. Full article
(This article belongs to the Special Issue Dental Education: Innovation and Challenge)
18 pages, 880 KB  
Article
Comparative Evaluation of Five Multimodal Large Language Models for Medical Laboratory Image Recognition: Impact of Prompting Strategies on Diagnostic Accuracy
by Hui-Ru Yang, Kuei-Ying Lin, Ping-Chang Lin, Jih-Jin Tsai and Po-Chih Chen
Diagnostics 2026, 16(9), 1258; https://doi.org/10.3390/diagnostics16091258 - 22 Apr 2026
Viewed by 185
Abstract
Background: Multimodal large language models (MLLMs) show promise in medical imaging, but their performance is highly dependent on prompt engineering. This study systematically evaluates how different prompting strategies affect diagnostic accuracy in clinical laboratory image interpretation. Methods: We evaluated five MLLMs (ChatGPT-4o, Gemini [...] Read more.
Background: Multimodal large language models (MLLMs) show promise in medical imaging, but their performance is highly dependent on prompt engineering. This study systematically evaluates how different prompting strategies affect diagnostic accuracy in clinical laboratory image interpretation. Methods: We evaluated five MLLMs (ChatGPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet, Grok-2, and Perplexity Pro (Claude 3.5 Sonnet)) using 177 proficiency testing images across three domains: blood smears (n = 78), urinalysis (n = 50), and parasitology (n = 49). Three prompting approaches were compared: (1) complex multi-choice prompts with 20 diagnostic options, (2) zero-shot open-ended prompts, and (3) two-step descriptive-reasoning prompts. Images were sourced from the Taiwan Society of Laboratory Medicine external quality assurance archives with expert consensus diagnoses. Results: Zero-shot prompting significantly outperformed complex multi-choice prompts across all models and domains (p < 0.001). With zero-shot prompts, Gemini achieved 78.5% overall accuracy (urinalysis: 92.0%; parasitology: 75.5%; blood smears: 64.1%), representing a 17% improvement over complex prompts. Two-step descriptive-reasoning prompts further improved blood smear accuracy by 8–12% for top-performing models, but showed minimal benefit in urinalysis and parasitology. The re-query mechanism (“please reconsider”) improved urinalysis accuracy by 7.6% but had a negligible effect on blood smears and parasitology. Conclusions: Prompting strategy critically determines MLLM diagnostic performance. Zero-shot approaches with minimal constraints consistently outperform complex multi-choice formats. The remarkable performance of general-purpose models in structured domains like urinalysis (>90% accuracy) demonstrates the considerable progress of multimodal AI. However, complex morphological tasks like blood smear interpretation require either specialized prompting techniques or domain-specific fine-tuning. These findings provide evidence-based guidance for optimizing AI integration in clinical laboratories. Full article
23 pages, 1954 KB  
Article
Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization
by Constantinos Djouvas, Christiana Andreou, Maria C. Voutsa and Nicolas Tsapatsoulis
Computers 2026, 15(5), 262; https://doi.org/10.3390/computers15050262 - 22 Apr 2026
Viewed by 133
Abstract
Large Language Models (LLMs) are increasingly deployed as automated annotators in semantic multimedia systems, yet their reliability varies significantly across architectures. This study extends prior cross-model evaluations by benchmarking ChatGPT-5, Qwen-3, and Gemini-3-flash against human expert annotations using the HRAST hotel review dataset. [...] Read more.
Large Language Models (LLMs) are increasingly deployed as automated annotators in semantic multimedia systems, yet their reliability varies significantly across architectures. This study extends prior cross-model evaluations by benchmarking ChatGPT-5, Qwen-3, and Gemini-3-flash against human expert annotations using the HRAST hotel review dataset. We adopt a bias-by-design framework to analyze systematic divergences in sentiment, topic, and aspect labeling across real and synthetic data, while investigating the moderating effects of annotation mode. Findings reveal model-contingent polarity bias: ChatGPT-5 exhibits a pronounced neutrality bias, while Qwen-3 and Gemini-3-flash align more closely with human polarization. Agreement is substantial for concrete topics but diverges on abstract evaluative dimensions. Synthetic data consistently inflates reliability metrics while masking ambiguity. These findings highlight that annotation bias is structurally embedded in model design choices and operational conditions. Cross-architectural triangulation and mode-aware deployment strategies are recommended for robust semantic multimedia system development. Full article
(This article belongs to the Special Issue Advances in Semantic Multimedia and Personalized Digital Content)
12 pages, 791 KB  
Article
Exploratory Evaluation of Diagnostic Accuracy and Temporal Reproducibility of Multimodal Large Language Models in the Image-Based Assessment of Oral Mucosal Lesions
by Lovro Dumančić, Marko Antonio Cug, Danica Vidović Juras, Luís Monteiro, Rui Albuquerque and Vlaho Brailo
Appl. Sci. 2026, 16(8), 4046; https://doi.org/10.3390/app16084046 - 21 Apr 2026
Viewed by 148
Abstract
Objective: The aim was to evaluate the diagnostic accuracy and temporal reproducibility of multimodal large language models (LLMs) in the image-based diagnosis of oral mucosal lesions. Materials and Methods: The study included 100 anonymized clinical photographs of oral mucosal conditions obtained from the [...] Read more.
Objective: The aim was to evaluate the diagnostic accuracy and temporal reproducibility of multimodal large language models (LLMs) in the image-based diagnosis of oral mucosal lesions. Materials and Methods: The study included 100 anonymized clinical photographs of oral mucosal conditions obtained from the archive of the Department of Oral Medicine, School of Dental Medicine, University of Zagreb. Images were categorized into four subgroups: physiological variations, benign mucosal lesions, oral potentially malignant disorders, and oral cancer (25 images each). Three multimodal LLMs (ChatGPT-5.1 Plus, Gemini 3 Pro, and Perplexity Pro) analyzed each image using an identical prompt and were required to provide a single most probable diagnosis based solely on visual features. To evaluate temporal reproducibility, the entire evaluation was repeated in three independent testing cycles conducted at one-month intervals. Diagnostic accuracy was compared using chi-square tests, while intra-model agreement across cycles was assessed using Fleiss’ kappa. Results: Gemini demonstrated the highest diagnostic accuracy, reaching 78% correct responses in cycles 2 and 3, significantly outperforming ChatGPT (55–57%) and Perplexity (28–31%) (p < 0.00001). Subgroup analyses showed similar trends, with Gemini achieving the highest accuracy across most lesion categories. Intra-model agreement across cycles was moderate for ChatGPT (κ = 0.525), fair for Gemini (κ = 0.338) and Perplexity (κ = 0.409). Gemini also showed the highest proportion of responses that remained correct across all three cycles (51%). Conclusions: Multimodal LLMs demonstrate promising diagnostic capabilities in the image-based assessment of oral mucosal lesions; however, variability in reproducibility highlights the need for cautious clinical implementation and further validation. Full article
(This article belongs to the Special Issue Recent Advances in Biomedical Data Analysis)
Show Figures

Figure 1

13 pages, 711 KB  
Article
The Potential Role of Large Language Models in Assisting Patients and Guiding Emergency Care Visits
by Kristina Gerhardinger, Josina Straub, Julia Lenz, Siegmund Lang, Volker Alt, Borys Frankewycz, Maximilian Kerschbaum and Lisa Klute
J. Clin. Med. 2026, 15(8), 3170; https://doi.org/10.3390/jcm15083170 - 21 Apr 2026
Viewed by 205
Abstract
Background/Objectives: Overcrowding in emergency departments (EDs) remains a critical challenge in modern healthcare systems, driven in part by patient uncertainty regarding symptom urgency and a lack of accessible medical guidance. Recent advances in artificial intelligence, particularly large language models (LLMs), present a [...] Read more.
Background/Objectives: Overcrowding in emergency departments (EDs) remains a critical challenge in modern healthcare systems, driven in part by patient uncertainty regarding symptom urgency and a lack of accessible medical guidance. Recent advances in artificial intelligence, particularly large language models (LLMs), present a novel opportunity to support patient navigation and relieve pressure on ED infrastructures. Methods: A total of 238 unique patient questions were identified through a structured web search. Following deduplication and thematic clustering, 15 representative questions were selected. Each question was submitted to the three LLMs—ChatGPT (v3.5), DeepSeek, and Gemini—using a standardized prompt. Responses were assessed by clinical experts (N = 8) who were blinded to the model source. Reviewers selected the best overall response per question, as well as the individual responses of the three LLMs for each respective question. Results: ChatGPT was selected as the best-performing model in 60% of cases, with DeepSeek and Gemini selected in 23% and 17%, respectively. ChatGPT responses also achieved the highest proportion of “excellent” quality ratings and the lowest proportion of “unsatisfactory” outputs. Across all models, clarity was the most positively rated domain (79% agreement), followed by empathy (72%), length/detail appropriateness (71%), and completeness (65%). Over two-thirds of raters expressed willingness to integrate LLM-based tools into clinical practice for patient education and pre-triage counseling. Conclusions: Large language models demonstrate promising capabilities in responding to emergency care-related patient queries. Their ability to deliver medically sound and communicatively effective answers positions them as potential digital adjuncts in the management of low-acuity ED presentations and prehospital triage. Full article
(This article belongs to the Special Issue Novel Technologies to Assist Emergency Medical Care)
Show Figures

Figure 1

43 pages, 5546 KB  
Article
Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators
by Shuchih Ernest Chang and Kai-Chun Chung
Mathematics 2026, 14(8), 1393; https://doi.org/10.3390/math14081393 - 21 Apr 2026
Viewed by 154
Abstract
In the context of political and financial market turmoil, effectively forecasting financial market trends is crucial for investment decisions. Large language models (LLMs) have been applied in extant research to predict market trends, analyze investor sentiments and interpret financial news, all aiming to [...] Read more.
In the context of political and financial market turmoil, effectively forecasting financial market trends is crucial for investment decisions. Large language models (LLMs) have been applied in extant research to predict market trends, analyze investor sentiments and interpret financial news, all aiming to help investment decision making. However, LLMs face limitations due to training data heterogeneity, restricting multidimensional perspectives and hindering comparative analysis for optimization. This study proposes a “Dual-Agent LLM Debate Mechanism” framework using a Proponent (LLM1: Gemini Pro 3) and an Opponent (LLM2: ChatGPT 5.2) to address single-LLM forecasting gaps: The Proponent generates a baseline forecast (F1) from an Integrated Context, while the Opponent validates and resolves conflicts with the Proponent via up to three rounds of cross-debate to produce a consensus forecast (F2). A controlled experiment was conducted to analyze 75 financial market indicators (FMIs) across five asset categories, revealing that F2 outperforms F1 in accuracy and directional stability, particularly in highly volatile assets like Cryptocurrencies and 10-Year Government Bonds. Paired-sample t-tests confirmed statistical significance, validating the mechanism’s effectiveness. Our study results demonstrate how cross-debate between LLMs enhances forecasting accuracy through structured optimization. Full article
(This article belongs to the Special Issue Artificial Intelligence Techniques in the Financial Services Industry)
Show Figures

Figure 1

Back to TopTop