Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (684)

Search Parameters:
Keywords = vision-language model

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
16 pages, 6152 KB  
Article
DisasterReliefGPT: Multimodal AI for Autonomous Disaster Impact Assessment and Crisis Communication
by Lekshmi Chandrika Reghunath, Athikkal Sudhir Abhishek, Arjun Changat, Arjun Unnikrishnan, Ayush Kumar Rai, Christian Napoli and Cristian Randieri
Technologies 2026, 14(3), 179; https://doi.org/10.3390/technologies14030179 - 16 Mar 2026
Abstract
The work presented herein proposes DisasterReliefGPT, a multimodal AI system for automation in the areas of crisis communication and post-disaster assessment. The system integrates three tightly coupled components: a vision module called DisasterOCS for structural damage detection in satellite images, a Large Vision–Language [...] Read more.
The work presented herein proposes DisasterReliefGPT, a multimodal AI system for automation in the areas of crisis communication and post-disaster assessment. The system integrates three tightly coupled components: a vision module called DisasterOCS for structural damage detection in satellite images, a Large Vision–Language Model (LVLM) for enhanced visual understanding and contextual reasoning, and a Large Language Model (LLM) to produce detailed, clear assessment reports. DisasterOCS relies on a ResNet34-based encoder with partial weight sharing and event-specific decoders, coupled with a custom MultiCrossEntropyDiceLoss function for multi-class segmentation on pre- and post-disaster image pairs. On the benchmark xBD dataset, the developed system reaches a high score of 78.8% in identifying F1-damage, making correct identifications of destroyed buildings with 81.3% precision, while undamaged structures are found with a very high value of 90.7%. From a combination of these components, emergency responders can immediately provide reliable and readable assessments of damage that can be used to directly support urgent decision-making. Full article
Show Figures

Graphical abstract

25 pages, 3799 KB  
Article
DR-CLIP: A Deformable Vision–Language Model for Scale-Invariant Object Counting in Remote Sensing Images
by Jingzhe Nie, Qun Liu, Tianze Li, Xu Lu and Liang Zhang
Sensors 2026, 26(6), 1863; https://doi.org/10.3390/s26061863 - 16 Mar 2026
Abstract
Object counting in remote sensing images is valuable for applications such as urban planning and environmental monitoring. However, it remains challenging due to heterogeneous annotations, semantic ambiguity in open-vocabulary queries, and performance degradation of small targets. To address these limitations, we propose DR-CLIP [...] Read more.
Object counting in remote sensing images is valuable for applications such as urban planning and environmental monitoring. However, it remains challenging due to heterogeneous annotations, semantic ambiguity in open-vocabulary queries, and performance degradation of small targets. To address these limitations, we propose DR-CLIP (Deformable Remote CLIP), a vision–language model for remote sensing image counting that incorporates deformable visual feature extraction with text-guided prediction. DR-CLIP includes a (1) Region-to-Instruction (R2I) mechanism to convert points, bounding boxes, and polygons into a unified image–text training representation, a (2) Multi-scale Deformable Attention (MSDA) to enhance discriminative feature extraction across extreme scale variations and cluttered backgrounds, and a (3) Text-Guided Counting Head that establishes robust cross-modal alignment through contrastive learning, achieving open-vocabulary counting capability without category-specific retraining. On DOTA-v2.0, DR-CLIP achieves a Mean Absolute Error (MAE) of 2.34 and a Root Mean Squared Error (RMSE) of 3.89, outperforming baselines by 19.0% in MAE. The MSDA module significantly increases Small-Object Recall (SOR) to 0.824, which is especially effective in situations involving dense and small object counting. In cross-modal retrieval, DR-CLIP attains R@1 scores of 68.3% (image-to-text) and 72.1% (text-to-image) on the Remote Sensing Image Captioning Dataset (RSICD). The framework generalizes robustly, with only 8.7% performance degradation in cross-domain tests, which is significantly lower than the 23.4% drop observed in baseline methods. Full article
(This article belongs to the Section Remote Sensors)
Show Figures

Figure 1

16 pages, 310 KB  
Article
A Regularized Backbone-Level Cross-Modal Interaction Framework for Stable Temporal Reasoning in Video-Language Models
by Geon-Woo Kim and Ho-Young Jung
Mathematics 2026, 14(6), 996; https://doi.org/10.3390/math14060996 - 15 Mar 2026
Abstract
Deep learning approaches for egocentric video understanding often lack a principled theoretical treatment of stability, particularly when dealing with the sparse, noisy, and temporally ambiguous observations characteristic of first-person imaging. In this work, we frame egocentric video question answering not merely as a [...] Read more.
Deep learning approaches for egocentric video understanding often lack a principled theoretical treatment of stability, particularly when dealing with the sparse, noisy, and temporally ambiguous observations characteristic of first-person imaging. In this work, we frame egocentric video question answering not merely as a classification task, but as an ill-posed inverse problem aimed at reconstructing latent semantic intent from stochastically perturbed visual signals. To address the instability inherent in standard dual-encoder architectures, we present a framework with a mathematical interpretation that incorporates gated cross-modal interaction within the transformer backbone. Formally, the video-side update analyzed in this work is defined as a learnable convex combination of unimodal feature representations and cross-modal attention residuals; the full implementation applies analogous gated cross-modal updates bidirectionally. From a regularization perspective, the gating mechanism can be interpreted as an adaptive parameter that balances data fidelity against language-conditioned structural constraints during feature reconstruction. We provide the Bounded Update Property (Lemma 1) and an analytical layer-wise sensitivity bound and empirically demonstrate that the proposed framework achieves measurable improvements in both accuracy and stability on the EgoTaskQA and MSR-VTT benchmarks. On EgoTaskQA, our model improves accuracy from 27.0% to 31.7% (+4.7 pp) and reduces the accuracy drop under 50% frame drop from 3.93 pp to 0.94 pp. On MSR-VTT, our model improves accuracy by 13.0 pp over the dual-encoder baseline. Under severe perturbation (50% frame drop) on MSR-VTT, our model retains 97.7% of its clean performance, whereas the baseline exhibits near-zero drop accompanied by majority-class behavior. These results provide empirical evidence that the proposed interaction induces stable behavior under perturbations in an ill-posed multimodal inference setting, mitigating sensitivity to sampling variability while preserving query-relevant temporal structure. Furthermore, an entropy-based analysis indicates that the gating mechanism prevents excessive diffusion of attention, promoting coherent temporal reasoning. Overall, this work offers a mathematically informed perspective on designing interaction mechanisms for stable multimodal systems, with a focus on robust reasoning under temporal ambiguity. Full article
Show Figures

Figure 1

23 pages, 2679 KB  
Article
Morphology-Aware Deep Features and Frozen Filters for Surgical Instrument Segmentation with LLM-Based Scene Summarization
by Adnan Haider, Muhammad Arsalan and Kyungeun Cho
J. Clin. Med. 2026, 15(6), 2227; https://doi.org/10.3390/jcm15062227 - 15 Mar 2026
Abstract
Background/Objectives: The rise of artificial intelligence is injecting intelligence into the healthcare sector, including surgery. Vision-based intelligent systems that assist surgical procedures can significantly increase productivity, safety, and effectiveness during surgery. Surgical instruments are central components of any surgical intervention, yet detecting and [...] Read more.
Background/Objectives: The rise of artificial intelligence is injecting intelligence into the healthcare sector, including surgery. Vision-based intelligent systems that assist surgical procedures can significantly increase productivity, safety, and effectiveness during surgery. Surgical instruments are central components of any surgical intervention, yet detecting and locating them during live surgeries remains challenging due to adverse imaging conditions such as blood occlusion, smoke, blur, glare, low-contrast, instrument scale variation, and other artifacts. Methods: To address these challenges, we developed an advanced segmentation architecture termed the frozen-filters-based morphology-aware segmentation network (FFMS-Net). Accurate surgical instrument segmentation strongly depends on edge and morphology information; however, in conventional neural networks, this spatial information is progressively degraded during spatial processing. FFMS-Net introduces a frozen and learnable feature pipeline (FLFP) that simultaneously exploits frozen edge representations and learnable features. Within FLFP, Sobel and Laplacian filters are frozen to preserve edge and orientation information, which is subsequently fused with learnable initial spatial features. Moreover, a tri-atrous blending (TAB) block is employed at the end of the encoder to fuse multi-receptive-field-based contextual information, preserving instrument morphology and improving robustness under challenging conditions such as blur, blood occlusion, and smoke. Datasets focused on surgical instruments often suffer from severe class imbalance and poor instrument visibility. To mitigate these issues, FFMS-Net incorporates a progressively structure-preserving decoder (PSPD) that aggregates dilated and standard spatial information after each upsampling stage to maintain class structure. Multi-scale spatial features from different encoder levels are further fused using light skip paths (LSPs) to project channels with task-relevant patterns. Results/Conclusions: FFMS-Net is extensively evaluated on three challenging datasets: UW-Sinus-surgery-live, UW-Sinus-cadaveric, and CholecSeg8k. The proposed method demonstrates promising performance compared with state-of-the-art approaches while requiring only 1.5 million trainable parameters. In addition, an open-source large language model is integrated for non-clinical summarization of the surgical scene based on the predicted mask and deterministic descriptors derived from it. Full article
(This article belongs to the Special Issue Artificial Intelligence and Machine Learning in Clinical Practice)
Show Figures

Figure 1

17 pages, 602 KB  
Review
Artificial Intelligence Applications in Gastric Cancer Surgery: Bridging Early Diagnosis and Responsible Precision Medicine
by Silvia Malerba, Miljana Vladimirov, Aman Goyal, Audrius Dulskas, Augustinas Baušys, Tomasz Cwalinski, Sergii Girnyi, Jaroslaw Skokowski, Ruslan Duka, Robert Molchanov, Bojan Jovanovic, Francesco Antonio Ciarleglio, Alberto Brolese, Kebebe Bekele Gonfa, Abdi Tesemma Demmo, Zilvinas Dambrauskas, Adolfo Pérez Bonet, Mario Testini, Francesco Paolo Prete, Valentin Calu, Natale Calomino, Vikas Jain, Aleksandar Karamarkovic, Karol Polom, Adel Abou-Mrad, Rodolfo J. Oviedo, Yogesh Vashist and Luigi Maranoadd Show full author list remove Hide full author list
J. Clin. Med. 2026, 15(6), 2208; https://doi.org/10.3390/jcm15062208 - 13 Mar 2026
Viewed by 104
Abstract
Background: Artificial intelligence is emerging as a promising tool in surgical oncology, with growing evidence suggesting potential applications in diagnostic support, intraoperative guidance, and perioperative risk assessment. In gastric cancer surgery, emerging applications range from AI-assisted endoscopic detection to data-driven perioperative risk [...] Read more.
Background: Artificial intelligence is emerging as a promising tool in surgical oncology, with growing evidence suggesting potential applications in diagnostic support, intraoperative guidance, and perioperative risk assessment. In gastric cancer surgery, emerging applications range from AI-assisted endoscopic detection to data-driven perioperative risk prediction, while some technological developments, particularly in robotic autonomy, derive from broader surgical or experimental models that may inform future gastric procedures. Methods: A narrative review was conducted following established methodological standards, including the Scale for the Assessment of Narrative Review Articles (SANRA) and the Search–Appraisal–Synthesis–Analysis (SALSA) framework. English-language studies indexed in PubMed, Scopus, Embase, and Web of Science up to October 2025 were included. Evidence was synthesized thematically across five domains: AI-assisted anatomical recognition and lymphadenectomy support, autonomous robotic systems, early cancer detection, perioperative predictive and frailty models, and ethical and regulatory considerations. Results: AI-based computer vision and deep learning algorithms have demonstrated promising capabilities for real-time anatomical recognition, surgical phase classification, and intraoperative guidance, although evidence of direct patient-level benefit remains limited. In diagnostic settings, AI-assisted endoscopy and Raman spectroscopy have been shown to improve early lesion detection and reduce dependence on operator experience. Predictive models, including MySurgeryRisk and AI-driven frailty assessments, may support individualized prehabilitation planning and perioperative risk stratification. Persistent limitations include small and heterogeneous datasets, insufficient external validation, and unresolved concerns related to data privacy, algorithmic interpretability, and medico-legal responsibility. Conclusions: Artificial intelligence is progressively emerging as a promising tool in gastric cancer surgery, integrating automation, advanced analytics, and human clinical reasoning. Its safe and ethical adoption requires robust validation, transparent governance, and continuous surgeon oversight. When developed within human-centered and ethically grounded frameworks, AI can augment, rather than replace, surgical expertise, potentially advancing precision, safety, and equity in oncologic care. Full article
Show Figures

Figure 1

26 pages, 6182 KB  
Article
VL-OrdinalFormer: Vision–Language-Guided Ordinal Transformers for Interpretable Knee Osteoarthritis Grading
by Zahid Ullah and Jihie Kim
Mathematics 2026, 14(6), 963; https://doi.org/10.3390/math14060963 - 12 Mar 2026
Viewed by 63
Abstract
Knee osteoarthritis (KOA) severity assessment using the Kellgren–Lawrence (KL) grading system is essential for clinical decision-making, yet reliable discrimination between adjacent early stages, particularly KL1 and KL2, remains challenging due to subtle radiographic differences and inter-observer variability. This study investigates whether integrating ordinal [...] Read more.
Knee osteoarthritis (KOA) severity assessment using the Kellgren–Lawrence (KL) grading system is essential for clinical decision-making, yet reliable discrimination between adjacent early stages, particularly KL1 and KL2, remains challenging due to subtle radiographic differences and inter-observer variability. This study investigates whether integrating ordinal regression with vision–language semantic alignment can improve fine-grained automated KOA grading. We propose VL-OrdinalFormer, a transformer-based framework that models KL severity as an ordered process and aligns visual features with clinically grounded textual descriptions. The model is evaluated using stratified five-fold cross-validation on the publicly available OAI kneeKL224 dataset (1656 test radiographs). The proposed approach achieves 70.29% accuracy, 70.19% macro F1-score, and 81.61% macro AUROC, outperforming both CNN and standard ViT baselines. Notably, class-wise analysis shows consistent improvements for clinically ambiguous intermediate grades, with gains of +6.6% for KL1 and +19.4% for KL2 compared to the VGG19 baseline. Robustness experiments further demonstrate stable performance under simulated acquisition and projection variability. These results indicate that combining ordinal modeling with vision–language alignment enhances discrimination of subtle disease stages while maintaining interpretability, supporting the potential of the proposed framework for reliable and clinically meaningful KOA grading. Full article
Show Figures

Figure 1

32 pages, 2223 KB  
Article
From Large Language Models to Agentic AI in Industry 5.0 and the Post-ChatGPT Era: A Socio-Technical Framework and Review on Human–Robot Collaboration
by Enrique Coronado
Robotics 2026, 15(3), 58; https://doi.org/10.3390/robotics15030058 - 12 Mar 2026
Viewed by 199
Abstract
Generative Artificial Intelligence (GenAI), particularly Foundation Models (FMs), has recently become a key component of Industry 5.0. Despite growing interest in integrating these technologies into industrial environments, comprehensive analyses of the socio-technical opportunities and challenges of deploying these emerging AI systems in real-world [...] Read more.
Generative Artificial Intelligence (GenAI), particularly Foundation Models (FMs), has recently become a key component of Industry 5.0. Despite growing interest in integrating these technologies into industrial environments, comprehensive analyses of the socio-technical opportunities and challenges of deploying these emerging AI systems in real-world settings remain limited. This article proposes a socio-technical conceptual perspective, termed Responsible Agentic Robotics (RAR), which structures the lifecycle deployment of agentic AI-enabled robotic systems around three core layers: context, design, and value. Additionally, this article presents a brief review of 21 peer-reviewed studies published between 2023 and 2025 (post-ChatGPT era) on FMs and agentic AI-enabled Human–Robot Collaboration (HRC) in industrial assembly/disassembly environments. The results indicate that existing research remains predominantly technology-centric, with a strong emphasis on enhancing robot autonomy, while comparatively limited attention is devoted to human-centered and responsible practices. Moreover, empirical evaluations of human, social, and sustainability dimensions, such as worker empowerment, human factors, well-being, inclusivity, resource utilization, and environmental impact, are rarely conducted and poorly discussed. This article concludes by identifying key socio-technical gaps, outlining future research directions. Full article
(This article belongs to the Special Issue Human-Centered Robotics: The Transition to Industry 5.0)
Show Figures

Figure 1

28 pages, 5635 KB  
Article
Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics
by Kaiqing Yuan, Haotian Lan, Yao Gao and Kun Wang
Land 2026, 15(3), 449; https://doi.org/10.3390/land15030449 - 12 Mar 2026
Viewed by 169
Abstract
While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) [...] Read more.
While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using Low-Rank Adaptation(LoRA) and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.863 on objective features and 89.3% agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy while simultaneously reducing pedestrian comfort. It also identifies nonlinear and semantically contingent patterns—such as the divergent perceptual effects of architectural transparency across residential and commercial zones—revealing the limits of universal spatial heuristics. By generating natural-language rationales grounded in attention mechanisms, the framework bridges sensory data with socio-affective inference, enabling transparent diagnostics aligned with Sustainable Development Goal 11(SDG 11). This work offers both methodological innovation in urban perception modeling and practical utility for planning systems seeking to reconcile infrastructural precision with lived experience. Full article
(This article belongs to the Special Issue Big Data-Driven Urban Spatial Perception)
Show Figures

Figure 1

17 pages, 1881 KB  
Communication
HSG-ON: Hierarchical Scene Graph-Based Object Navigation
by Seokjoon Kwon, Hee-Deok Jang and Dong Eui Chang
Sensors 2026, 26(6), 1755; https://doi.org/10.3390/s26061755 - 10 Mar 2026
Viewed by 179
Abstract
For a robot to operate effectively in human-centric environments, finding objects based on natural language is essential. Zero-shot object goal navigation is a significant challenge where robots must find unseen objects in new environments without prior knowledge. Existing methods often struggle with strategic [...] Read more.
For a robot to operate effectively in human-centric environments, finding objects based on natural language is essential. Zero-shot object goal navigation is a significant challenge where robots must find unseen objects in new environments without prior knowledge. Existing methods often struggle with strategic exploration, leading to inefficient searches. In this study, we propose a hierarchical scene graph-based navigation system to address this challenge. Our core innovations are twofold: dynamically constructing a three-layer “room–workspace–object” hierarchical scene graph without manually pre-tuned parameters, and introducing a novel workspace-based searching strategy. By evaluating semantic relevance at the workspace level rather than the object level, the robot infers probable containers for a target, enabling focused, human-like exploration. Simulation results demonstrate that our system significantly outperforms existing state-of-the-art methods. Quantitatively, our approach improves the Success Rate (SR) by 26.8% (SR 0.4859) under distance-constrained settings and by 20.2% (SR 0.7360) under unconstrained settings, compared to the best baselines. These results validate that our framework offers a robust solution for zero-shot object goal navigation. Full article
(This article belongs to the Section Sensors and Robotics)
Show Figures

Figure 1

25 pages, 6915 KB  
Article
EXAONE-VLA: A Unified Vision–Language Framework for Mobile Manipulation via Semantic Topology and Hierarchical LLM Reasoning
by Jeong-Seop Park, Yong-Jun Lee, Jong-Chan Park, Sung-Gil Park, Jong-Jin Woo and Myo-Taeg Lim
Appl. Sci. 2026, 16(5), 2600; https://doi.org/10.3390/app16052600 - 9 Mar 2026
Viewed by 219
Abstract
This paper proposes a unified vision–language framework that translates user instructions into navigation for the mobile base and actions for the manipulator in indoor environments. In general, occupancy grid maps constructed via SLAM capture solely the geometric layout of the environment. This renders [...] Read more.
This paper proposes a unified vision–language framework that translates user instructions into navigation for the mobile base and actions for the manipulator in indoor environments. In general, occupancy grid maps constructed via SLAM capture solely the geometric layout of the environment. This renders the robot incapable of leveraging the semantic information required for object distinction. The proposed method encodes semantic information from vision–language models and the robot’s pose in a textual format, referred to as a semantic topological graph. Specifically, the models including GroundingDINO, LG EXAONE, and SAM2 extract object-level semantic information, which is subsequently used to identify room characteristics. A large language model then interprets user instructions to identify the final destination for navigation within the semantic topological graph, followed by reasoning to determine the suitable action network. Notably, the proposed text-based representation facilitates a substantial reduction in inference time, and its effectiveness is validated through real-world experiments. Full article
(This article belongs to the Special Issue Deep Reinforcement Learning for Multiagent Systems)
Show Figures

Figure 1

23 pages, 15691 KB  
Article
ProM-Pose: Language-Guided Zero-Shot 9-DoF Object Pose Estimation from RGB-D with Generative 3D Priors
by Yuchen Li, Kai Qin, Haitao Wu and Xiangjun Qu
Electronics 2026, 15(5), 1111; https://doi.org/10.3390/electronics15051111 - 7 Mar 2026
Viewed by 279
Abstract
Object pose estimation is fundamental for robotic manipulation, autonomous driving, and augmented reality, yet recovering the full 9-DoF state (rotation, translation, and anisotropic 3D scale) from RGB-D observations remains challenging for previously unseen objects. Existing methods either rely on instance-specific CAD models, predefined [...] Read more.
Object pose estimation is fundamental for robotic manipulation, autonomous driving, and augmented reality, yet recovering the full 9-DoF state (rotation, translation, and anisotropic 3D scale) from RGB-D observations remains challenging for previously unseen objects. Existing methods either rely on instance-specific CAD models, predefined category boundaries, or suffer from scale ambiguity under sparse observations. We propose ProM-Pose, a unified cross-modal temporal perception framework for zero-shot 9-DoF object pose estimation. By integrating language-conditioned generative 3D shape priors as canonical geometric references, an asymmetric cross-modal attention mechanism for spatially aware fusion, and a decoupled pose decoding strategy with temporal refinement, ProM-Pose constructs metrically consistent and semantically grounded representations without relying on category-specific pose priors or instance-level CAD supervision. Extensive experiments on CAMERA25 and REAL275 benchmarks demonstrate that ProM-Pose achieves competitive or superior performance compared to category-level methods, with mAP of 75.0% at 5°,2cm and 90.5% at 10°,5cm on CAMERA25, and 42.2% at 5°,2cm and 76.0% at 10°,5cm on REAL275 under zero-shot cross-domain evaluation. Qualitative results on real-world logistics scenarios further validate temporal stability and robustness under occlusion and lighting variations. ProM-Pose effectively bridges semantic grounding and metric geometric reasoning within a unified formulation, enabling stable and scale-aware 9-DoF pose estimation for previously unseen objects under open-vocabulary conditions. Full article
Show Figures

Figure 1

23 pages, 5448 KB  
Article
Evidence-Guided Diagnostic Reasoning for Pediatric Chest Radiology Based on Multimodal Large Language Models
by Yuze Zhao, Qing Wang, Yingwen Wang, Ruiwei Zhao, Rui Feng and Xiaobo Zhang
J. Imaging 2026, 12(3), 111; https://doi.org/10.3390/jimaging12030111 - 6 Mar 2026
Viewed by 223
Abstract
Pediatric respiratory diseases are a leading cause of hospital admissions and childhood mortality worldwide, highlighting the critical need for accurate and timely diagnosis to support effective treatment and long-term care. Chest radiography remains the most widely used imaging modality for pediatric pulmonary assessment. [...] Read more.
Pediatric respiratory diseases are a leading cause of hospital admissions and childhood mortality worldwide, highlighting the critical need for accurate and timely diagnosis to support effective treatment and long-term care. Chest radiography remains the most widely used imaging modality for pediatric pulmonary assessment. Consequently, reliable AI-assisted diagnostic methods are essential for alleviating the workload of clinical radiologists. However, most existing deep learning-based approaches are data-driven and formulate diagnosis as a black-box image classification task, resulting in limited interpretability and reduced clinical trustworthiness. To address these challenges, we propose a trustworthy two-stage diagnostic paradigm for pediatric chest X-ray diagnosis that closely aligns with the radiological workflow in clinical practice, in which the diagnosis procedure is constrained by evidence. In the first stage, a vision–language model fine-tuned on pediatric data identifies radiological findings from chest radiographs, producing structured and interpretable diagnostic evidence. In the second stage, a multimodal large language model integrates the radiograph, extracted findings, patient demographic information, and external medical domain knowledge with RAG mechanism to generate the final diagnosis. Experiments conducted on the VinDr-PCXR dataset demonstrate that our method achieves 90.1% diagnostic accuracy, 70.9% F1-score, and 82.5% AUC, representing up to a 13.1% increase in diagnosis accuracy over the state-of-the-art baselines. These results validate the effectiveness of combining multimodal reasoning with explicit medical evidence and domain knowledge, and indicate the strong potential of the proposed approach for trustworthy pediatric radiology diagnosis. Full article
(This article belongs to the Section AI in Imaging)
Show Figures

Figure 1

53 pages, 5533 KB  
Systematic Review
Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
by Matthew Lisondra, Beno Benhabib and Goldie Nejat
Robotics 2026, 15(3), 55; https://doi.org/10.3390/robotics15030055 - 4 Mar 2026
Viewed by 639
Abstract
Rapid advancements in foundation models, including Large Language Models, Vision-Language Models, Multimodal Large Language Models, and Vision-Language-Action models, have opened new avenues for embodied AI in mobile service robotics. By combining foundation models with the principles of embodied AI, where intelligent systems perceive, [...] Read more.
Rapid advancements in foundation models, including Large Language Models, Vision-Language Models, Multimodal Large Language Models, and Vision-Language-Action models, have opened new avenues for embodied AI in mobile service robotics. By combining foundation models with the principles of embodied AI, where intelligent systems perceive, reason, and act through physical interaction, mobile service robots can achieve more flexible understanding, adaptive behavior, and robust task execution in dynamic real-world environments. Despite this progress, embodied AI for mobile service robots continues to face fundamental challenges related to the translation of natural language instructions into executable robot actions, multimodal perception in human-centered environments, uncertainty estimation for safe decision-making, and computational constraints for real-time onboard deployment. In this paper, we present the first systematic review of foundation models in mobile service robotics, following the preferred reporting items for systematic reviews and meta-analysis (PRISMA) guidelines. Using an OpenAlex literature search, we considered 7506 papers for the years spanning 1968–2025. Our detailed analysis identified four main challenges and how recent advances in foundation models, related to the translation of natural language instructions into executable robot actions, multimodal perception in human-centered environments, uncertainty estimation for safe decision-making, and computational constraints for real-time onboard deployment, have addressed these challenges. We further examine real-world applications in domestic assistance, healthcare, and service automation, highlighting how foundation models enable context-aware, socially responsive, and generalizable robot behaviors. Beyond technical considerations, we discuss ethical, societal, human-interaction, and physical design and ergonomic implications associated with deploying foundation-model-enabled service robots in human environments. Finally, we outline future research directions emphasizing reliability and lifelong adaptation, privacy-aware and resource-constrained deployment, as well as the governance and human-in-the-loop frameworks required for safe, scalable, and trustworthy mobile service robotics. Full article
(This article belongs to the Special Issue Embodied Intelligence: Physical Human–Robot Interaction)
Show Figures

Figure 1

20 pages, 2574 KB  
Article
Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations
by Soushi Futamura and Tomohiro Fukuda
Technologies 2026, 14(3), 157; https://doi.org/10.3390/technologies14030157 - 4 Mar 2026
Viewed by 317
Abstract
In built environmental design, incorporating building user participation and verifying indoor thermal performance at early design stages have become increasingly important. Although Computational Fluid Dynamics (CFD) analysis is widely used to predict indoor thermal environments, its results are difficult for non-expert stakeholders to [...] Read more.
In built environmental design, incorporating building user participation and verifying indoor thermal performance at early design stages have become increasingly important. Although Computational Fluid Dynamics (CFD) analysis is widely used to predict indoor thermal environments, its results are difficult for non-expert stakeholders to interpret, even when visualized using Mixed Reality (MR). Interpreting CFD visualizations in MR requires quantitative reasoning that explicitly cross-references visual features with legend information, rather than relying on prior color–value associations learned from natural images. This study investigates the capability of Vision–Language Models (VLMs) to interpret MR visualizations of CFD results and respond to user queries. We focus on indoor temperature distributions and airflow velocities visualized in MR. A novel dataset was constructed, consisting of MR images with CFD results superimposed onto real indoor spaces, paired with domain-specific question–answer annotations requiring legend-based reasoning. Using this dataset, a general-purpose VLM (Qwen2.5-VL) was fine-tuned. Experimental results show that the baseline model achieved less than 30% accuracy, whereas fine-tuning improved accuracy to over 60% across all categories while largely preserving general reasoning performance. These results demonstrate that domain adaptation enables VLMs to quantitatively interpret physical information embedded in MR visualizations, supporting non-experts’ understanding of built environmental design. Full article
(This article belongs to the Section Construction Technologies)
Show Figures

Figure 1

26 pages, 3226 KB  
Article
Assessing Street-Level Emotional Perception in Urban Regeneration Contexts Using Domain-Adapted CLIP
by Liyang Chu and Keting Zhou
Buildings 2026, 16(5), 980; https://doi.org/10.3390/buildings16050980 - 2 Mar 2026
Viewed by 217
Abstract
As urban regeneration goals shift from physical improvement to pedestrian-level experience and emotional perception, existing assessment methods struggle to describe the emotional responses associated with renewed street environments. This paper proposes a framework for street-level emotional perception inference and analysis within the context [...] Read more.
As urban regeneration goals shift from physical improvement to pedestrian-level experience and emotional perception, existing assessment methods struggle to describe the emotional responses associated with renewed street environments. This paper proposes a framework for street-level emotional perception inference and analysis within the context of urban regeneration, enabling the automatic semantic recognition based on Street View Images (SVIs) and a Vision-Language Model (VLM). The paper constructs a six-dimensional emotion perceptual framework encompassing Comfort, Vitality, Safety, Oppressiveness, Nostalgia, and Alienation and uses a lightweight domain-adapted Contrastive Language-Image Pre-training (CLIP) model to infer emotional perceptions from SVIs. Building upon this, a dual-axis evaluation framework is introduced to structure and interpret basic spatial experience and regeneration-related perception. Using the Yuyuan Road and Wuding Road areas in Shanghai as a case study, the paper combines emotional perception results with street-level spatial analysis, proposing a scalable and interpretable analytical method for diagnosing urban regeneration outcomes and supporting emotion-informed spatial interventions. Full article
(This article belongs to the Section Architectural Design, Urban Science, and Real Estate)
Show Figures

Figure 1

Back to TopTop