Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (76)

Search Parameters:
Keywords = multimodal large language models (MLLMs)

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 1101 KB  
Article
SR-VLN: Implicit Spatial Reasoning Vision-and-Language Navigation
by Ruolin Zhu, Shaobin Li and Min Yang
Sensors 2026, 26(12), 3809; https://doi.org/10.3390/s26123809 - 15 Jun 2026
Viewed by 208
Abstract
Vision-and-language navigation (VLN) traditionally relies on explicit reasoning chains, which, despite being interpretable, impose severe constraints on inference efficiency and scalability in long-range environments. Existing multimodal large language models (MLLMs) frequently encounter latency bottlenecks due to the generation of verbose textual narratives during [...] Read more.
Vision-and-language navigation (VLN) traditionally relies on explicit reasoning chains, which, despite being interpretable, impose severe constraints on inference efficiency and scalability in long-range environments. Existing multimodal large language models (MLLMs) frequently encounter latency bottlenecks due to the generation of verbose textual narratives during decision-making. To address these limitations, we propose spatial reasoning vision-and-language navigation (SR-VLN), a novel framework that shifts the paradigm from explicit chain-of-thought (CoT) to an implicit spatial representation space. SR-VLN introduces a pyramidal hierarchical history framework integrated with perceptual compression to condense historical trajectories into multi-scale representations, effectively minimizing token overhead while preserving critical spatial semantics. Rather than generating verbose textual reasoning steps, SR-VLN employs compact, learnable spatial tokens (S-Tokens) to perform agile inference directly within the latent feature space. To establish robust causal mappings between these implicit states and navigational actions, we employ a hybrid training strategy that combines sparse reward supervision with reinforcement learning via GRPO. Extensive evaluations on the R2R, REVERIE, and SOON datasets demonstrate that SR-VLN achieves state-of-the-art overall navigation performance, while maintaining a comparable balance between accuracy and efficiency. Compared to explicit reasoning baselines, our method reduces token consumption by 68% and achieves a 4.1× speedup in inference while reaching a 76.02% success rate and a 73.80% SPL on the R2R unseen split, thereby facilitating near-real-time action prediction in long-range navigation environments. Full article
(This article belongs to the Section Navigation and Positioning)
Show Figures

Figure 1

13 pages, 536 KB  
Article
Diagnostic Performance of Multimodal Large Language Models for Central Venous Catheter Assessment Chest Radiographs in the Intensive Care Unit
by Christina-Chrysanthi Theocharidou, Zafeiris Tsinaris, Christos Karachristos, Anastasia Theocharidou, Michail Kourtidis, Kiriaki Papadopoulou, Athanasia-Marina Peristeri, Athanasios Astreinidis, Anna Simichanidou, Chrysavgi Giannaki, Myrto Tzimou, Evangelos Kaimakamis, Vasileios Voutsas, Vasiliki Soulountsi and Athina Lavrentieva
Med. Sci. 2026, 14(2), 315; https://doi.org/10.3390/medsci14020315 - 14 Jun 2026
Viewed by 186
Abstract
Background: Chest radiography remains central to post-procedural assessment of central venous catheter (CVC) placement in intensive care units. Multimodal large language models (MLLMs) can process medical images, but their reliability for practical radiography tasks remains uncertain. This study assessed the diagnostic performance of [...] Read more.
Background: Chest radiography remains central to post-procedural assessment of central venous catheter (CVC) placement in intensive care units. Multimodal large language models (MLLMs) can process medical images, but their reliability for practical radiography tasks remains uncertain. This study assessed the diagnostic performance of MLLMs and intensivists for CVC access classification, CVC tip assessment, and pneumothorax-related radiographic findings. Methods: In this retrospective diagnostic performance study, consecutive portable anteroposterior chest radiographs obtained after CVC placement in adult critically ill patients were independently evaluated by four intensivists and five MLLMs. A radiologist consensus served as the reference standard. Interobserver agreement and diagnostic performance were assessed using Fleiss’ kappa, Gwet AC1, Cohen’s kappa, accuracy, sensitivity, specificity, precision, F1 score, balanced accuracy, and Matthews correlation coefficient. Results: The final cohort included 183 unique radiographs. Intensivist reviewers showed high performance for CVC access classification but lower and more heterogeneous performance for CVC tip-position assessment. Among MLLMs, CVC access accuracy ranged from 0.339 to 0.874, whereas CVC tip assessment was dominated by almost universal classification of tips as appropriate, with near-zero specificity and chance-level balanced accuracy. For pneumothorax-related findings, all MLLMs classified every case as negative. Intensivist reviewers had higher balanced accuracy than MLLMs for CVC access classification (difference, 0.420; 95% CI, 0.349–0.490; p < 0.001) and CVC tip assessment (difference, 0.247; 95% CI, 0.205–0.290; p < 0.001). Pneumothorax analyses were exploratory because only five positive cases were present. Conclusions: The evaluated MLLMs showed unreliable diagnostic performance compared with experienced intensivists. Apparent performance was influenced by class imbalance and dominant-response behavior, supporting cautious task-specific validation and complete diagnostic performance reporting. Full article
(This article belongs to the Section Critical Care Medicine)
Show Figures

Figure 1

91 pages, 6222 KB  
Review
A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision–Language Tasks
by Chia Xin Liang, Pu Tian, Caitlyn Heqi Yin, Yao Yua, An-Hou Wei, Ming Li, Xinyuan Song, Tianyang Wang, Ziqian Bi, Ming Liu, Riyang Bao and Pengbin Feng
Computation 2026, 14(6), 125; https://doi.org/10.3390/computation14060125 - 29 May 2026
Cited by 1 | Viewed by 1026
Abstract
This survey provides a comprehensive guide to Multimodal Large Language Models (MLLMs) with a focus on vision–language tasks, including image captioning, visual question answering, cross-modal retrieval, visual grounding, multi-image reasoning, long-video understanding, and embodied AI. We examine architectures, training pipelines, and practical applications, [...] Read more.
This survey provides a comprehensive guide to Multimodal Large Language Models (MLLMs) with a focus on vision–language tasks, including image captioning, visual question answering, cross-modal retrieval, visual grounding, multi-image reasoning, long-video understanding, and embodied AI. We examine architectures, training pipelines, and practical applications, covering visual encoders, language model backbones, connector modules, contrastive pre-training, instruction tuning, and preference alignment. We also foreground first-principles constraints—information bottlenecks, data-processing limits, and statistical co-occurrence bias—that shape architecture, robustness, and evaluation. This survey centers on vision–language systems and does not cover audio-only models or code-generation tools without visual inputs. Through task-level analysis and system-level case studies, we examine prominent MLLM implementations while addressing key challenges in scalability, memory, energy use, inference cost, robustness, and cross-modal learning. We present a unified taxonomy of the MLLM design space, a comparative overview of representative models and evaluation benchmarks, and a discussion of open problems. Concluding with ethical considerations and responsible AI development, this survey offers theoretical frameworks and practical insights for researchers, practitioners, and students working at the intersection of natural language processing and computer vision. Full article
Show Figures

Figure 1

19 pages, 1764 KB  
Article
Automated Dataset Construction for Composed Video Retrieval in Soccer
by Riku Yoshida, Ryota Goka, Keisuke Maeda, Takahiro Ogawa and Miki Haseyama
Appl. Sci. 2026, 16(11), 5360; https://doi.org/10.3390/app16115360 - 27 May 2026
Viewed by 249
Abstract
Composed Video Retrieval (CoVR) enables flexible video search by retrieving a target video that reflects a specified modification to a query video. The triplet datasets—consisting of query videos, query text, and target videos—required for model training have been collected manually. Recent studies have [...] Read more.
Composed Video Retrieval (CoVR) enables flexible video search by retrieving a target video that reflects a specified modification to a query video. The triplet datasets—consisting of query videos, query text, and target videos—required for model training have been collected manually. Recent studies have explored automatic construction of training triplets for CoVR; however, most existing approaches rely heavily on caption similarity. This limitation is particularly problematic in soccer videos, where identical or highly similar captions can correspond to visually distinct situations, making it difficult to construct triplets with appropriate relationships. To address this issue, this paper proposes a multimodal triplet construction framework specialized for soccer videos. The key idea is to explicitly incorporate visual similarity alongside textual similarity. Specifically, candidate target videos are selected by combining visual similarity with commentary caption filtering, enabling the identification of videos that are visually similar yet semantically different. The semantic difference between videos is then generated as query text using a large language model (LLM) without manual annotation. Furthermore, a multimodal large language model (MLLM) is introduced to estimate whether the generated modification is visually and semantically consistent with the video pair. Rather than replacing human verification, this step provides an automated screening signal to identify potentially unreliable triplets. The experiments show that the proposed framework automatically constructs triplets with reasonable validity under limited human validation. These results demonstrate the potential of scalable triplet construction for CoVR in soccer videos. Full article
(This article belongs to the Collection Computer Science in Sport)
Show Figures

Figure 1

61 pages, 2270 KB  
Article
Multimodal Large Language Model-Based Shapley Interaction Quantification Analysis for Interpretation of Battery State-of-Charge Prediction in Electric Vehicles
by Jaehyeok Lee, Jaeseung Lee and Jehyeok Rew
Appl. Sci. 2026, 16(10), 4812; https://doi.org/10.3390/app16104812 - 12 May 2026
Viewed by 308
Abstract
Accurate state-of-charge (SOC) prediction is critical for estimating driving range and ensuring the reliability of electric vehicle (EV) battery management systems. Although machine learning-based SOC prediction models achieve high accuracy, their complex nonlinear structures limit interpretability and hinder practical deployment. This study proposes [...] Read more.
Accurate state-of-charge (SOC) prediction is critical for estimating driving range and ensuring the reliability of electric vehicle (EV) battery management systems. Although machine learning-based SOC prediction models achieve high accuracy, their complex nonlinear structures limit interpretability and hinder practical deployment. This study proposes an automated interpretation framework that integrates a multimodal large language model (MLLM) with Shapley interaction quantification (SHAP-IQ) to explain SOC prediction results. An XGBoost-based SOC prediction model is developed, and SHAP-IQ is employed to analyze both main effects of individual input variables (order 1) and pairwise feature interactions (order 2). SHAP-IQ visualizations and attribution values are provided as inputs to MLLM, which generates instance-level natural language explanations, while cross-validation and aggregation procedures ensure consistency. Experiments using real-world driving data collected from a BMW i3 show that XGBoost outperforms benchmark models in SOC prediction accuracy. The results indicate that, for the analyzed instances, SOC predictions are primarily governed by electrical variables such as battery voltage and current, whereas driving and environmental variables mainly affect the prediction through interaction effects. The proposed framework demonstrates the potential to improve the interpretability of SOC prediction models and can be extended to other energy systems in EVs employing complex machine learning models. Full article
Show Figures

Figure 1

19 pages, 1520 KB  
Article
Artificial Intelligence in Cancer Research: Modality Dependence and Limited Visual–Spatial Integration in Multimodal Large Language Models for Breast Cancer Histopathology
by Ibrahim Güler, Armin Kraus, Gerrit Grieb, Tevfik Satir and Henrik Stelling
Life 2026, 16(5), 763; https://doi.org/10.3390/life16050763 - 2 May 2026
Viewed by 439
Abstract
Multimodal large language models (MLLMs) are increasingly considered for cancer diagnostic support, yet their suitability for histopathological image interpretation remains inadequately characterized. We evaluated six contemporary general-purpose MLLMs (Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, ChatGPT 5.3, Grok 4.2, Gemini 3.1 [...] Read more.
Multimodal large language models (MLLMs) are increasingly considered for cancer diagnostic support, yet their suitability for histopathological image interpretation remains inadequately characterized. We evaluated six contemporary general-purpose MLLMs (Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, ChatGPT 5.3, Grok 4.2, Gemini 3.1 Pro) on 58 paired hematoxylin and eosin (H&E)-stained breast cancer histopathology images (26 malignant, 32 benign) and corresponding nuclei segmentation masks. Each case was classified five times per model under three conditions, image only (IMAGE), mask only (MASK), and both combined (BOTH), yielding 5220 observations. Mean accuracy dropped from 69.4% (IMAGE) to 49.6% (MASK), below the majority-class baseline of 55.2%. Providing the mask together with the image did not improve classification (68.0%), and for ChatGPT 5.3 produced a net loss of 31 correct predictions. Models maintained elevated mean confidence (67.6) under MASK despite near-random accuracy, and reasoning categories shifted in 67.5% of matched case–run pairs between modalities. Under the conditions tested, current general-purpose MLLMs exhibit strong dependence on visual surface features, fail to effectively integrate spatial structural information, and maintain confidence independent of accuracy. These behavioral limitations are directly relevant to the safe deployment of MLLMs in cancer diagnostic workflows. Full article
(This article belongs to the Section Biochemistry, Biophysics and Computational Biology)
Show Figures

Graphical abstract

29 pages, 2266 KB  
Article
Test-Time Candidate-Aware Dual Refinement for Remote Sensing Image–Text Retrieval
by Bofan Zhang and Hao Wu
Remote Sens. 2026, 18(9), 1389; https://doi.org/10.3390/rs18091389 - 30 Apr 2026
Viewed by 509
Abstract
Remote sensing image–text retrieval (RSITR) is a pivotal task aimed at achieving efficient bidirectional matching between visual content and textual descriptions in large-scale remote sensing databases. Nevertheless, it faces a fundamental challenge: the severe information asymmetry between sparse, abstract captions and dense, multi-scale [...] Read more.
Remote sensing image–text retrieval (RSITR) is a pivotal task aimed at achieving efficient bidirectional matching between visual content and textual descriptions in large-scale remote sensing databases. Nevertheless, it faces a fundamental challenge: the severe information asymmetry between sparse, abstract captions and dense, multi-scale overhead imagery. Prior works predominantly focus on learning static cross-modal representations during training; however, this frozen inference process is fundamentally limited in bridging the asymmetry due to its inability to dynamically compensate for missing details or resolve visual ambiguities in heterogeneous scenes. To overcome this limitation, we propose CADRE (Test-Time Candidate-Aware Dual Refinement), a retrieval-backbone-agnostic framework exploiting retrieved candidates as feedback for bidirectional alignment. Operating on a novel Inject-and-Suppress paradigm, CADRE comprises two complementary modules. First, the Visual-Context Injection (VCI) module addresses textual sparsity by incorporating an adaptive filtering mechanism to efficiently mine hierarchical visual evidence from high-confidence candidates and inject it into the query via a domain-adapted Multimodal Large Language Model (MLLM). Second, the Query-Guided Disambiguation (QGD) module targets visual ambiguity by generating multi-view visual hypotheses and utilizing the query as a semantic probe to suppress background noise. Extensive experiments on three standard benchmarks (RSICD, RSITMD, and UCM) demonstrate good transferability across several strong RSITR backbones. Full article
Show Figures

Figure 1

17 pages, 933 KB  
Article
Comparative Evaluation of Five Multimodal Large Language Models for Medical Laboratory Image Recognition: Impact of Prompting Strategies on Diagnostic Accuracy
by Hui-Ru Yang, Kuei-Ying Lin, Ping-Chang Lin, Jih-Jin Tsai and Po-Chih Chen
Diagnostics 2026, 16(9), 1258; https://doi.org/10.3390/diagnostics16091258 - 22 Apr 2026
Viewed by 423
Abstract
Background: Multimodal large language models (MLLMs) show promise in medical imaging, but their performance is highly dependent on prompt engineering. This study systematically evaluates how different prompting strategies affect diagnostic accuracy in clinical laboratory image interpretation. Methods: We evaluated five MLLMs (ChatGPT-4o, Gemini [...] Read more.
Background: Multimodal large language models (MLLMs) show promise in medical imaging, but their performance is highly dependent on prompt engineering. This study systematically evaluates how different prompting strategies affect diagnostic accuracy in clinical laboratory image interpretation. Methods: We evaluated five MLLMs (ChatGPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet, Grok-2, and Perplexity Pro (Claude 3.5 Sonnet)) using 177 proficiency testing images across three domains: blood smears (n = 78), urinalysis (n = 50), and parasitology (n = 49). Three prompting approaches were compared: (1) complex multi-choice prompts with 20 diagnostic options, (2) zero-shot open-ended prompts, and (3) two-step descriptive-reasoning prompts. Images were sourced from the Taiwan Society of Laboratory Medicine external quality assurance archives with expert consensus diagnoses. Results: Zero-shot prompting significantly outperformed complex multi-choice prompts across all models and domains (p < 0.001). With zero-shot prompts, Gemini achieved 78.5% overall accuracy (urinalysis: 92.0%; parasitology: 75.5%; blood smears: 64.1%), representing a 17% improvement over complex prompts. Two-step descriptive-reasoning prompts further improved blood smear accuracy by 8–12% for top-performing models, but showed minimal benefit in urinalysis and parasitology. The re-query mechanism (“please reconsider”) improved urinalysis accuracy by 7.6% but had a negligible effect on blood smears and parasitology. Conclusions: Prompting strategy critically determines MLLM diagnostic performance. Zero-shot approaches with minimal constraints consistently outperform complex multi-choice formats. The remarkable performance of general-purpose models in structured domains like urinalysis (>90% accuracy) demonstrates the considerable progress of multimodal AI. However, complex morphological tasks like blood smear interpretation require either specialized prompting techniques or domain-specific fine-tuning. These findings provide evidence-based guidance for optimizing AI integration in clinical laboratories. Full article
Show Figures

Figure 1

20 pages, 621 KB  
Review
Conditional Generative AI in Oncology Diagnostics
by Chiara Frascarelli, Alberto Concardi, Elisa Mangione, Mariachiara Negrelli, Francesca Maria Porta, Michela Tulino, Joana Sorino, Antonio Marra, Nicola Fusco, Elena Guerini-Rocco and Konstantinos Venetis
Appl. Sci. 2026, 16(8), 4015; https://doi.org/10.3390/app16084015 - 21 Apr 2026
Viewed by 688
Abstract
The increasing complexity of oncology diagnostics requires advanced Clinical Decision Support Systems (CDSS) capable of integrating multimodal data. Traditional discriminative models often struggle with missing data and cross-modal dependencies. This review provides a novel, systematic analysis of conditional generative artificial intelligence (AI), including [...] Read more.
The increasing complexity of oncology diagnostics requires advanced Clinical Decision Support Systems (CDSS) capable of integrating multimodal data. Traditional discriminative models often struggle with missing data and cross-modal dependencies. This review provides a novel, systematic analysis of conditional generative artificial intelligence (AI), including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), diffusion models and Multimodal Large Language Models (MLLMs), specifically tailored for oncological CDSS. We examine how these architectures move beyond simple prediction to learn joint data distributions, enabling robust data imputation, virtual staining, and automated clinical reporting. A central focus of this work is the assessment of translational application, identifying the gaps between experimental proof-of-concepts and clinical deployment. We address critical hurdles such as model hallucinations, domain shift, and demographic bias, providing a roadmap for biological consistency and regulatory compliance. This review highlights the transition from task-specific generators to multimodal reasoning systems. Ultimately, we argue that the integration of generative AI into diagnostic workflows is essential for precision oncology, provided that human-in-the-loop validation and uncertainty-aware inference remain central to their implementation. Full article
Show Figures

Figure 1

33 pages, 5543 KB  
Article
The New Frontier of Quality Evaluation for Visual Sensors: A Survey of Large Multimodal Model-Based Methods
by Qihang Ge, Xiongkuo Min, Sijing Wu, Yunhao Li and Guangtao Zhai
Sensors 2026, 26(8), 2530; https://doi.org/10.3390/s26082530 - 20 Apr 2026
Viewed by 844
Abstract
Visual quality assessment is entering a new frontier as media evolve from static images to temporally dynamic videos and 3D content. These visual signals are typically captured by sensing devices such as cameras and depth sensors, whose acquisition characteristics significantly influence perceptual quality. [...] Read more.
Visual quality assessment is entering a new frontier as media evolve from static images to temporally dynamic videos and 3D content. These visual signals are typically captured by sensing devices such as cameras and depth sensors, whose acquisition characteristics significantly influence perceptual quality. Traditional quality models, including distortion-centric and regression-based approaches, perform well on conventional degradations but struggle to evaluate higher-level attributes such as semantic plausibility and structural coherence in modern AI-generated and multimodal scenarios. The emergence of large multimodal models (LMMs), including vision–language models (VLMs) and multimodal large language models (MLLMs), reshapes the evaluation paradigm by enabling semantic grounding, instruction-driven assessment, and explainable reasoning. This survey presents a unified perspective on visual quality assessment for sensor-captured visual data across image, video, and 3D modalities. We review conventional deep learning approaches and recent LMM-based methods, highlighting how multimodal fusion and language-conditioned reasoning transform quality assessment from scalar prediction to perceptual intelligence. Finally, we discuss key challenges and future opportunities for building efficient, robust, and sensor-aware visual quality assessment systems. Full article
(This article belongs to the Special Issue Perspectives in Intelligent Sensors and Sensing Systems)
Show Figures

Figure 1

55 pages, 4195 KB  
Article
Multimodal Large Language Model-Based Explainable Boosting Machine Analysis for Interpretation of State-of-Health Prediction of Lithium-Ion Batteries
by Jaehyeok Lee, Jaeseung Lee and Jehyeok Rew
Electronics 2026, 15(8), 1675; https://doi.org/10.3390/electronics15081675 - 16 Apr 2026
Viewed by 489
Abstract
Accurate prediction of the state of health (SOH) of lithium-ion batteries is essential for ensuring the safety and reliability of electric vehicles and energy storage systems. While machine learning (ML)-based models have demonstrated strong predictive performance, their limited interpretability remains a major challenge [...] Read more.
Accurate prediction of the state of health (SOH) of lithium-ion batteries is essential for ensuring the safety and reliability of electric vehicles and energy storage systems. While machine learning (ML)-based models have demonstrated strong predictive performance, their limited interpretability remains a major challenge for deployment in safety-critical applications. Although explainable boosting machines (EBMs) provide an interpretable alternative through their additive structure, existing studies still rely on manual analysis of model outputs, which restricts scalability and reproducibility. To address this limitation, this study proposes a structured interpretation framework that integrates EBMs with multimodal large language models (MLLMs). The proposed framework employs EBMs to generate SOH predictions along with global feature importance and variable-level score-density visualizations. These outputs are subsequently processed by an MLLM to perform automated interpretation at both global and variable levels, followed by aggregation, cross-validation, and generation of a unified interpretation report. Experiments were conducted on a lithium-ion battery degradation dataset and the EBM achieved competitive predictive performance compared to baseline ML models. In addition, the quality of the generated interpretations was evaluated using both an MLLM-as-a-Judge and a user study. The evaluation results show that the generated interpretations consistently achieved high scores, with average ratings exceeding 4.5 out of 5 across key criteria such as interpretation accuracy and faithfulness, as assessed by both independent MLLMs and domain experts. These results demonstrate that the proposed framework enables reliable and scalable interpretation of battery SOH prediction models, providing a practical solution for explainable artificial intelligence in battery health management. Full article
Show Figures

Figure 1

17 pages, 892 KB  
Article
Artificial Intelligence for Biomedical Diagnostics: Diagnostic Accuracy and Reliability of Multimodal Large Language Models in Electrocardiogram Interpretation
by Henrik Stelling, Armin Kraus, Gerrit Grieb, David Breidung and Ibrahim Güler
Life 2026, 16(4), 681; https://doi.org/10.3390/life16040681 - 16 Apr 2026
Viewed by 845
Abstract
The electrocardiogram (ECG) is a central tool in cardiovascular diagnostics, yet interpretation requires expertise and remains subject to variability. Multimodal large language models (MLLMs) have shown emerging capabilities in medical image analysis, but their performance in ECG interpretation remains insufficiently characterized. This study [...] Read more.
The electrocardiogram (ECG) is a central tool in cardiovascular diagnostics, yet interpretation requires expertise and remains subject to variability. Multimodal large language models (MLLMs) have shown emerging capabilities in medical image analysis, but their performance in ECG interpretation remains insufficiently characterized. This study evaluated the diagnostic accuracy and inter-run reliability of five MLLMs across ECG interpretation tasks. Thirteen standard 12-lead ECGs were presented to five models (ChatGPT-5.3, Gemini 3.1 Pro, Claude Opus 4.6, Grok 4.1, and ERNIE 5.0) across five independent runs per case, yielding 2275 task-level assessments. Six categorical interpretation tasks (rhythm, electrical axis, PR/P-wave morphology, QRS duration, ST/T-wave morphology, and QTc interval) were compared with expert-consensus ground truth, while heart rate estimation was evaluated using mean absolute error (MAE). Overall categorical accuracy ranged from 52.3% to 64.9%. QRS duration classification achieved the highest accuracy (66.2–90.8%), whereas ST/T-wave assessment showed the lowest performance (20.0–41.5%). Heart rate MAE ranged from 14.8 to 46.7 bpm. A dissociation between diagnostic accuracy and inter-run reliability was observed across models. These findings indicate that current MLLMs do not achieve clinically reliable ECG interpretation performance and highlight the importance of assessing diagnostic accuracy and inter-run reliability when evaluating artificial intelligence systems in biomedical diagnostics. Full article
Show Figures

Graphical abstract

16 pages, 962 KB  
Article
AI in Hand and Wrist Radiography: Multimodal Large Language Models for Distal Radius Fracture Detection and Characterization
by Ibrahim Güler, Armin Kraus, Gerrit Grieb, David Breidung, Martin Lautenbach and Henrik Stelling
Diagnostics 2026, 16(8), 1171; https://doi.org/10.3390/diagnostics16081171 - 15 Apr 2026
Viewed by 683
Abstract
Background/Objectives: Multimodal large language models (MLLMs) are increasingly evaluated for diagnostic tasks in medical imaging, including radiographic interpretation. However, most studies focus primarily on binary fracture detection and rarely assess clinically relevant fracture characteristics such as displacement or intra-articular extension, which influence [...] Read more.
Background/Objectives: Multimodal large language models (MLLMs) are increasingly evaluated for diagnostic tasks in medical imaging, including radiographic interpretation. However, most studies focus primarily on binary fracture detection and rarely assess clinically relevant fracture characteristics such as displacement or intra-articular extension, which influence treatment decisions. In addition, most evaluations rely on single-run inference designs that do not assess response reproducibility. This study evaluated the diagnostic performance and inter-run reliability of five MLLMs for radiographic assessment of distal radius fractures. Methods: Fifty fracture-positive distal radius radiographs were evaluated by five MLLMs (ChatGPT 5.3, Gemini 3.1 Pro, Claude Opus 4.6, Grok 4.1, and ERNIE 5.0) across five independent zero-shot inference runs (n = 1250 observations). Diagnostic tasks included fracture detection, intra-articular extension, and displacement. Sex and age were exploratory endpoints. Performance was summarized using sensitivity (fracture detection) and accuracy (other tasks), with inter-run reliability assessed via Fleiss’ κ. Results: Performance varied across tasks and models. Fracture detection sensitivity ranged from 39.6% to 99.6%, with two models exceeding 90%. Intra-articular extension accuracy ranged from 51.6% to 55.6%, consistent with chance-level performance. Displacement classification ranged from 34.8% to 70.4%. One model achieved substantial inter-run agreement across binary tasks (κ > 0.60), whereas two models showed slight agreement (κ < 0.20). Conclusions: Only two models exceeded 90% sensitivity for fracture detection, while intra-articular extension remained at chance level (≤55.6%). Substantial inter-run reliability (κ > 0.60) was observed in only one model. These findings indicate that current MLLMs do not reliably support multidimensional fracture assessment and that single-run evaluations overestimate robustness. Full article
Show Figures

Graphical abstract

37 pages, 10609 KB  
Article
A Scalable Framework for Street Interface Morphology Assessment via Automated Multimodal Large Language Model Agents
by Yuchen Wang, Yu Ye and Chao Weng
Land 2026, 15(4), 610; https://doi.org/10.3390/land15040610 - 8 Apr 2026
Viewed by 570
Abstract
Evaluating street interface morphology is essential for urban design, yet existing approaches often struggle to combine large-scale applicability with higher-level morphological interpretation. This study proposes a scalable framework for assessing street interface morphology using an automated multimodal large language model (MLLM) agent. Using [...] Read more.
Evaluating street interface morphology is essential for urban design, yet existing approaches often struggle to combine large-scale applicability with higher-level morphological interpretation. This study proposes a scalable framework for assessing street interface morphology using an automated multimodal large language model (MLLM) agent. Using street view imagery (SVI), the framework evaluates four core morphological dimensions—enclosure, continuity, transparency, and roughness–through two complementary analytical streams: objective geometric measurement and subjective morphological assessment. To support reliable evaluation, the framework incorporates a dual-benchmark strategy consisting of manually derived geometric measurements and expert-consensus ratings for calibration and validation. Applied in Shanghai, the framework demonstrated reliable performance across the evaluated dimensions. The optimized agent was further extended to continuous street-segment analysis, demonstrating its applicability to large-scale urban assessment. By integrating objective and subjective evaluation within a scalable and interpretable workflow, the proposed methodology provides a practical tool for street interface morphology analysis and urban design assessment. Full article
(This article belongs to the Section Land Planning and Landscape Architecture)
Show Figures

Figure 1

26 pages, 977 KB  
Article
KE-MLLM: A Knowledge-Enhanced Multi-Sensor Learning Framework for Explainable Fake Review Detection
by Jiaying Chen, Jingyi Liu, Yiwen Liang and Mengjie Zhou
Appl. Sci. 2026, 16(6), 2909; https://doi.org/10.3390/app16062909 - 18 Mar 2026
Cited by 1 | Viewed by 737
Abstract
The proliferation of fake reviews on e-commerce and social platforms has severely undermined consumer trust and market integrity, necessitating robust and interpretable real-time detection mechanisms with multi-sensor data fusion capabilities. While traditional machine learning approaches have shown promise in identifying fraudulent reviews, they [...] Read more.
The proliferation of fake reviews on e-commerce and social platforms has severely undermined consumer trust and market integrity, necessitating robust and interpretable real-time detection mechanisms with multi-sensor data fusion capabilities. While traditional machine learning approaches have shown promise in identifying fraudulent reviews, they often lack transparency and fail to leverage the rich contextual knowledge embedded in large-scale datasets. In this paper, we propose KE-MLLM (Knowledge-Enhanced Multimodal Large Language Model), a unified framework that integrates knowledge-enhanced prompting with parameter-efficient fine-tuning for explainable fake review detection. Our approach employs LoRA (Low-Rank Adaptation) to fine-tune lightweight large language models (LLaMA-3-8B) on review text, while incorporating multimodal behavioral sensor signals including temporal patterns, user metadata, and social network characteristics for comprehensive anomaly sensing. To address the critical need for interpretability in fraud detection systems, we implement a Chain-of-Thought (CoT) reasoning module that generates human-understandable explanations for classification decisions, highlighting linguistic anomalies, sentiment inconsistencies, and behavioral red flags. We enhance the model’s discriminative capability through a knowledge distillation strategy that transfers domain-specific expertise from larger teacher models while maintaining computational efficiency suitable for edge sensing devices. Extensive experiments on two benchmark datasets—YelpChi and Amazon Reviews from the DGL Fraud Dataset—show that KE-MLLM achieves strong performance, reaching an F1-score of 94.3% and an AUC-ROC of 96.7% on YelpChi and outperforming the strongest baseline in our comparison by 5.8 and 4.2 percentage points, respectively. Furthermore, human evaluation indicates that the generated explanations achieve 89.5% consistency with expert annotations, suggesting that the framework can improve the interpretability and practical usefulness of automated fraud detection systems. The proposed framework provides a useful step toward more accurate and interpretable fake review detection and offers a practical reference for building more transparent and accountable AI systems in high-stakes applications. Full article
Show Figures

Figure 1

Back to TopTop