Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (298)

Search Parameters:
Keywords = interactive multimodal environment

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
25 pages, 1183 KB  
Article
A Federated Digital Twin Framework for Consumer Wellbeing Systems
by Matti Rachamim and Jacob Hornik
Systems 2026, 14(4), 417; https://doi.org/10.3390/systems14040417 - 9 Apr 2026
Abstract
Consumer wellbeing systems are characterized by conceptual fragmentation, heterogeneous data sources, and multilevel interactions across economic, psychological, social, and environmental domains. Existing monitoring approaches remain largely unidimensional and lack integrative system architectures capable of supporting real-time, adaptive analysis. This paper proposes a Federated [...] Read more.
Consumer wellbeing systems are characterized by conceptual fragmentation, heterogeneous data sources, and multilevel interactions across economic, psychological, social, and environmental domains. Existing monitoring approaches remain largely unidimensional and lack integrative system architectures capable of supporting real-time, adaptive analysis. This paper proposes a Federated Digital Twin (FDT) framework for Consumer Wellbeing Systems, designed to integrate decentralized, multimodal data while preserving autonomy and privacy. The proposed architecture builds on a five-dimensional digital twin model and extends it through federated interoperability, data fusion, adaptive learning, simulation capabilities, and human-in-the-loop mechanisms. The framework enables the synchronization of observed, self-reported, contextual, and synthetic data across distributed environments, supporting system-level modeling, prediction, and optimization. As an illustrative application, the paper examines Shopping Wellbeing and Shopping–Life Balance as sub-systems within broader wellbeing ecosystems, demonstrating how federated digital twins can unify fragmented theoretical constructs into a coherent, dynamic monitoring structure. The study contributes a system-oriented conceptual architecture for modeling complex human-centric wellbeing ecosystems and outlines implications for systems design, governance, and future interdisciplinary research. Full article
(This article belongs to the Section Complex Systems and Cybernetics)
Show Figures

Figure 1

25 pages, 11063 KB  
Article
Tac-Mamba: A Pose-Guided Cross-Modal State Space Model with Trust-Aware Gating for mmWave Radar Human Activity Recognition
by Haiyi Wu, Kai Zhao, Wei Yao and Yong Xiong
Electronics 2026, 15(7), 1535; https://doi.org/10.3390/electronics15071535 - 7 Apr 2026
Abstract
Millimeter-wave (mmWave) radar point clouds offer a privacy-preserving solution for Human Activity Recognition (HAR), but their inherent sparsity and noise limit single-modal performance. While multimodal fusion mitigates this issue, existing methods often suffer from severe negative transfer during visual degradation and incur high [...] Read more.
Millimeter-wave (mmWave) radar point clouds offer a privacy-preserving solution for Human Activity Recognition (HAR), but their inherent sparsity and noise limit single-modal performance. While multimodal fusion mitigates this issue, existing methods often suffer from severe negative transfer during visual degradation and incur high computational costs, unsuitable for edge devices. To address these challenges, we propose Tac-Mamba, a lightweight cross-modal state space model. First, we introduce a topology-guided distillation scheme that uses a Spatial Mamba teacher to extract structural priors from visual skeletons. These priors are then explicitly distilled into a Point Transformer v3 (PTv3) radar student with a modality dropout strategy. We also developed a Trust-Aware Cross-Modal Attention (TACMA) module to prevent negative transfer. It evaluates the reliability of visual features through a SiLU-activated cross-modal bilinear interaction, smoothly degrading to a pure radar-driven fallback projection when visual inputs are corrupted. Finally, a Lightweight Temporal Mamba Block (LTMB) with a Zero-Parameter Cross-Gating (ZPCG) mechanism captures long-range kinematic dependencies with linear complexity. Experiments on the public MM-Fi dataset under strict cross-environment protocols demonstrate that Tac-Mamba achieves competitive accuracies of 95.37% (multimodal) and 87.54% (radar-only) with only 0.86M parameters and 1.89 ms inference latency. These results highlight the model’s exceptional robustness to modality missingness and its feasibility for edge deployment. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

27 pages, 24041 KB  
Article
PMDet: Patch-Aware Enhancement and Fusion for Multispectral Object Detection
by Jie Li, Chenhong Sui, Jing Wang and Jun Zhou
Remote Sens. 2026, 18(7), 1068; https://doi.org/10.3390/rs18071068 - 2 Apr 2026
Viewed by 182
Abstract
Multispectral object detection addresses the limitations of single-modal approaches by fusing complementary information from visible and infrared images, thereby improving robustness in complex environments. However, the inter-modal representations are inherently misaligned due to sensing discrepancies, and the complementary cues they provide are often [...] Read more.
Multispectral object detection addresses the limitations of single-modal approaches by fusing complementary information from visible and infrared images, thereby improving robustness in complex environments. However, the inter-modal representations are inherently misaligned due to sensing discrepancies, and the complementary cues they provide are often imbalanced, making it difficult to exploit modality-specific information effectively. Moreover, directly merging features from different modalities can introduce noise and artifacts that deteriorate the detection performance. To this end, this paper proposes a patch-aware enhancement and fusion network for multispectral object detection (PMDet). This method employs a dual-stream backbone equipped with the patch-aware Feature Enhancer (FE) module for cross-modal features alignment and enhancement. FE not only reinforces the feature representation of key regions but also helps to suppress local noise and enhance the model’s perception of fine textures and differences. Building on these enriched features, the patch-based Feature Aggregator (FA) module allows for efficient inter-modal feature interaction and semantic fusion with noise resistance. Specifically, both FE and FA modules leverage the shifted-patch design to preserve computational efficiency while enabling long-range modeling. In this regard, PMDet couples multi-scale cross-modal semantic enhancement with deep semantic fusion to form a stable and discriminative multimodal representation pipeline. Experiments on FLIR, LLVIP, and VEDAI demonstrate that the method outperforms mainstream approaches in detection accuracy and robustness, and ablation studies further verify the effectiveness of each module. Full article
Show Figures

Figure 1

39 pages, 96608 KB  
Article
Multi-Modal Feature Fusion and Hierarchical Classification for Automated Equine–Human Interaction Behavior Recognition
by Samierra Arora, Emily Kieson, Christine Rudd and Peter A. Gloor
Sensors 2026, 26(7), 2202; https://doi.org/10.3390/s26072202 - 2 Apr 2026
Viewed by 727
Abstract
Automated recognition of equine–human interaction behaviors from video represents a significant challenge in computational ethology, with critical applications spanning animal welfare assessment, equine-assisted services evaluation, and safety monitoring in equestrian environments. Existing approaches to animal behavior recognition typically focus on single species in [...] Read more.
Automated recognition of equine–human interaction behaviors from video represents a significant challenge in computational ethology, with critical applications spanning animal welfare assessment, equine-assisted services evaluation, and safety monitoring in equestrian environments. Existing approaches to animal behavior recognition typically focus on single species in isolation, rely solely on facial expression analysis while ignoring full-body posture, or employ flat classification architectures that fail under the severe class imbalances characteristic of naturalistic behavioral datasets. Furthermore, no prior framework integrates simultaneous analysis of both human and equine body language for cross-species interaction classification. This paper presents a novel hierarchical classification framework integrating multi-modal computer vision features to distinguish behavioral states during horse–human encounters. Our methodology employs three complementary feature extraction pipelines: YOLOv8 for spatial relationship modeling, MediaPipe for human postural analysis, and AP-10K for equine body language interpretation. From 28 annotated interaction videos comprising 50,270 temporal samples across five horse breeds, we extract 35 discriminative features capturing proximity dynamics, body orientation, and species-specific behavioral indicators. To address severe class imbalance (18.3:1 ratio between affiliative and avoidant categories), we implement cost-sensitive gradient boosting with automatic class weight optimization within a two-stage hierarchical architecture. The first stage classifies interactions into three parent categories (affiliative, neutral, avoidant) achieving 73.2% balanced accuracy, while stage two discriminates six fine-grained sub-behaviors achieving 88.5% balanced accuracy (under oracle parent-category routing; cascaded end-to-end performance is 62.9% balanced accuracy due to Stage 1 error propagation, identifying parent classification as the primary bottleneck). Notably, our system achieves 85.0% recall on safety-critical avoidant behaviors despite their representation of only 3.8% of the dataset. Extensive ablation studies demonstrate that equine pose features contribute most critically to classification performance, while comprehensive cross-validation analysis confirms model robustness across diverse interaction contexts. The proposed framework establishes the first systematic multimodal cross-species behavioral assessment pipeline in human–animal interaction research, with direct implications for improving equine welfare monitoring and rider safety protocols. Full article
(This article belongs to the Special Issue Innovative Sensing Methods for Motion and Behavior Analysis)
Show Figures

Figure 1

45 pages, 8329 KB  
Article
HRV-Based Multimodal Physiological Signal Monitoring Using Wearable Biosensors in Human–Computer Interaction: Cognitive Load in Real-Time Strategy Games
by Yunlong Shi, Muyesaier Kuerban, Yiyang Jin, Chaoyue Wang and Lu Chen
Sensors 2026, 26(7), 2181; https://doi.org/10.3390/s26072181 - 1 Apr 2026
Viewed by 457
Abstract
Real-time strategy (RTS) games provide a cognitively demanding and ecologically valid context for investigating workload dynamics in human–computer interaction (HCI). This multimodal study (HRV, NASA-TLX, behavior, interviews) examined multitasking, visual complexity, and decision pressure in 36 novice RTS players. High multitasking significantly increased [...] Read more.
Real-time strategy (RTS) games provide a cognitively demanding and ecologically valid context for investigating workload dynamics in human–computer interaction (HCI). This multimodal study (HRV, NASA-TLX, behavior, interviews) examined multitasking, visual complexity, and decision pressure in 36 novice RTS players. High multitasking significantly increased subjective workload (total raw-TLX: from 22.50 ± 14.65 to 36.47 ± 20.19, p < 0.001) and prolonged completion time (from 317.17 ± 37.26 s to 354.92 ± 50.70 s, p < 0.001). Decision pressure elevated subjective workload (total raw-TLX: from 20 to 28, p = 0.008) without affecting performance. Although HRV did not consistently differentiate experimental conditions at the group level, it showed stable individual-level associations with perceived workload—both in expected directions (e.g., LF power positively correlated with total raw-TLX across four experiments, r = 0.28–0.53, all p < 0.05) and in inverse relationships that deviate from conventional stress models (e.g., stress index negatively correlated with total raw-TLX, r = −0.34 to −0.40, all p < 0.01). These findings suggest that autonomic responses in complex interactive environments may reflect dynamic engagement processes rather than uniform stress activation, supporting multimodal cognitive load assessment and offering transferable insights for interface design and workload evaluation in demanding HCI contexts. Full article
(This article belongs to the Special Issue Human–Computer Interaction in Sensor Systems)
Show Figures

Graphical abstract

32 pages, 21661 KB  
Article
Robust Human-to-Robot Handover System Under Adverse Lighting
by Yifei Wang, Baoguo Xu, Huijun Li and Aiguo Song
Biomimetics 2026, 11(4), 231; https://doi.org/10.3390/biomimetics11040231 - 1 Apr 2026
Viewed by 286
Abstract
Human-to-robot (H2R) handovers are critical in human–robot interaction but are challenged by complex environments that impact robot perception. Traditional RGB-based perception methods exhibit severe performance degradation under harsh lighting (e.g., glare and darkness). Furthermore, H2R handovers occur in unstructured environments populated with fine-grained [...] Read more.
Human-to-robot (H2R) handovers are critical in human–robot interaction but are challenged by complex environments that impact robot perception. Traditional RGB-based perception methods exhibit severe performance degradation under harsh lighting (e.g., glare and darkness). Furthermore, H2R handovers occur in unstructured environments populated with fine-grained visual details, such as multi-angle hand configurations and novel object geometries, where conventional semantic segmentation and grasp generation approaches struggle to generalize. To overcome lighting disturbances, we present an H2R handover system with a dual-path perception pipeline. The system fuses perception data from a stereo RGB-D camera (eye-in-hand) and a time-of-flight (ToF) camera (fixed scene) under normal lighting, and switches to the ToF camera for reliable perception under glare and darkness. In parallel, to address the complex spatial and geometric features, we augment the Point Transformer v3 (PTv3) architecture by integrating a T-Net module and a self-attention mechanism to fuse the relative positional angle features between human and robot, enabling efficient real-time 3D semantic segmentation of both the object and the human hand. For grasp generation, we extend GraspNet with a grasp selection module optimized for H2R scenarios. We validate our approach through extensive experiments: (1) a semantic segmentation dataset with 7500 annotated point clouds covering 15 objects and 5 relative angles and tested on 750 point clouds from 15 unseen objects, where our method achieves 84.4% mIoU, outperforming Swin3D-L by 3.26 percentage points with 3.2× faster inference; (2) 250 real-world handover trials comparing our method with the baseline across 5 objects, 5 hand postures, and 5 angles, showing an improvement of 18.4 percentage points in success rate; (3) 450 trials under controlled adverse lighting (darkness and glare), where our dual-path perception method achieves 82.7% overall success, surpassing single-camera baselines by up to 39.4 percentage points; and (4) a comparative experiment against a state-of-the-art multimodal H2R handover method under identical adverse lighting, where our system achieves 75.0% success (15/20) versus the baseline’s 15.0% (3/20), further confirming the lighting robustness of our design. These results demonstrate the system’s robustness and generalization in challenging H2R handover scenarios. Full article
(This article belongs to the Special Issue Human-Inspired Grasp Control in Robotics 2025)
Show Figures

Figure 1

34 pages, 181429 KB  
Article
SENSASEA: Fostering Positive Behavioral Manifestations and Social Collaboration in Children Through an Interactive Multimodal Environment
by Yanjun Lyu, Ripon Kumar Saha, Assegid Kidane, Lauren Hayes and Xin Wei Sha
Multimedia 2026, 2(2), 5; https://doi.org/10.3390/multimedia2020005 - 31 Mar 2026
Viewed by 260
Abstract
The SensaSea System is a responsive multisensory environment, specifically, a room-sized interactive installation that incorporates wearable devices, interactive visual floor projections and auditory and tactile modalities. SensaSea is designed as a physical environment for embodied interaction and free play suitable for multiple players; [...] Read more.
The SensaSea System is a responsive multisensory environment, specifically, a room-sized interactive installation that incorporates wearable devices, interactive visual floor projections and auditory and tactile modalities. SensaSea is designed as a physical environment for embodied interaction and free play suitable for multiple players; the system uses social proximity as the primary mechanism. Our objective is to promote active peer interaction and social connectedness among elementary school children through sensory-guided approaches which include digitized and projected interactive sea creatures. The multi-modal system also features an interactive soundscape and innovative real-time haptic feedback. We conducted eight group user studies (24 children in total). Our usability and feasibility tests demonstrated that the system results in positive emotions and elicits multiple pro-social behaviors. Full article
Show Figures

Figure 1

18 pages, 833 KB  
Article
A Federated FHIR-Based Interoperability Framework for Multi-Site Heart Failure Monitoring: The RETENTION Project
by Nikolaos Vasileiou, Olympia Giannakopoulou, Ourania Manta, Konstantinos Bromis, Theodoros P. Vagenas, Ioannis Kouris, Maria Roumpi, Lefteris Koumakis, Yorgos Goletsis, Maria Haritou, George K. Matsopoulos, Dimitris Fotiadis and Dimitris D. Koutsouris
Computers 2026, 15(4), 212; https://doi.org/10.3390/computers15040212 - 31 Mar 2026
Viewed by 261
Abstract
Heart failure management increasingly relies on heterogeneous clinical and real-world data generated through remote monitoring technologies. However, transforming these multimodal data streams into actionable insights requires robust interoperability infrastructures. This study presents the RETENTION interoperability framework, a federated HL7 Fast Healthcare Interoperability Resources [...] Read more.
Heart failure management increasingly relies on heterogeneous clinical and real-world data generated through remote monitoring technologies. However, transforming these multimodal data streams into actionable insights requires robust interoperability infrastructures. This study presents the RETENTION interoperability framework, a federated HL7 Fast Healthcare Interoperability Resources (FHIR)-based architecture designed to support multi-site heart failure monitoring across five independent clinical environments. A semantic reference model comprising 444 clinical and contextual variables was developed and aligned with FHIR R4 resources and internationally recognised terminology systems. The platform adopts a selective profiling strategy, extending only the Patient resource while standardising the remaining variables through example-driven Implementation Guide documentation. Identifiable data are retained locally within Clinical Site Backends, whereas anonymised datasets are periodically aggregated into a Global Insights Cloud to enable centralised analytics and controlled third-party interactions. The framework was deployed across six hospitals (with two Spanish hospitals sharing the same deployment), supporting 390 patients and over 130,000 patient-days of monitoring, with more than 3.6 million remote device data points harmonised without schema conflicts. The results demonstrate that large-scale semantic harmonisation and privacy-preserving aggregation can be achieved using a lightweight profiling approach, providing a scalable and reproducible interoperability model for multi-centre digital health research infrastructures. Full article
(This article belongs to the Section Cloud Continuum and Enabled Applications)
Show Figures

Figure 1

27 pages, 4046 KB  
Article
A Deep Learning Framework for Predicting Psycho-Physiological States in Urban Underground Systems: Automating Human-Centric Environmental Perception
by Guanjie Huang and Hongzan Jiao
Buildings 2026, 16(7), 1328; https://doi.org/10.3390/buildings16071328 - 27 Mar 2026
Viewed by 296
Abstract
Traditional Post-Occupancy Evaluation (POE) is static and incompatible with dynamic systems like Digital Twins, creating a digital gap in managing health-oriented urban environments, especially in Urban Underground Spaces (UUS). This paper bridges this gap with a deep learning framework that automates the continuous [...] Read more.
Traditional Post-Occupancy Evaluation (POE) is static and incompatible with dynamic systems like Digital Twins, creating a digital gap in managing health-oriented urban environments, especially in Urban Underground Spaces (UUS). This paper bridges this gap with a deep learning framework that automates the continuous prediction of human physiological arousal. We created a novel multimodal dataset from in situ experiments, synchronizing first-person video, environmental data, and Galvanic Skin Response (GSR) as a real-time physiological arousal proxy. Our dual-branch spatial–temporal model fuses these data streams to predict GSR with high accuracy (Pearson’s r = 0.72), effectively mapping objective environmental inputs to continuous human physiological dynamics. This framework provides an automated, human-centric analysis engine for urban planning, design validation, and real-time building management. It establishes a foundational ‘human dynamics layer’ for urban Digital Twins, evolving them into predictive tools for simulating human-environment interactions and embedding physiological perception into intelligent urban systems. Full article
Show Figures

Figure 1

19 pages, 4749 KB  
Article
A Human-Centred Extended Reality (XR) System for Safe Human–Robot Collaboration (HRC) in Smart Logistics
by Adamos Daios and Ioannis Kostavelis
Systems 2026, 14(4), 348; https://doi.org/10.3390/systems14040348 - 25 Mar 2026
Viewed by 359
Abstract
HRC is increasingly adopted in industrial and logistics environments, while workforce preparation often remains constrained by instructional approaches that provide limited embodied understanding of safety and ergonomics. This study examines the architectural design and system integration of a modular, human-centred XR platform intended [...] Read more.
HRC is increasingly adopted in industrial and logistics environments, while workforce preparation often remains constrained by instructional approaches that provide limited embodied understanding of safety and ergonomics. This study examines the architectural design and system integration of a modular, human-centred XR platform intended to support safe and ergonomics-aware collaboration within smart logistics contexts. The proposed system integrates XR training scenarios deployed on consumer-grade hardware and follows a structured pedagogical progression from conceptual familiarisation through experiential task execution to reflective ergonomic evaluation. Multimodal feedback mechanisms based on posture-oriented guidance, attention-aware interaction design, and context-sensitive safety cues are incorporated without reliance on intrusive sensing technologies. A structured evaluation framework is defined to examine usability, task performance, and ergonomics-aligned posture indicators using standardised instruments and system-generated telemetry. The architectural design indicates that the framework supports scalable deployment, consistent interaction fidelity, and privacy-conscious data handling across educational and vocational settings. The proposed framework suggests that human-centred XR architectures can strengthen safety-oriented and ergonomically informed HRC within Industry 4.0 logistics environments. Full article
Show Figures

Figure 1

24 pages, 1460 KB  
Perspective
From Sensing to Sense-Making: A Framework for On-Person Intelligence with Wearable Biosensors and Edge LLMs
by Tad T. Brunyé, Mitchell V. Petrimoulx and Julie A. Cantelon
Sensors 2026, 26(7), 2034; https://doi.org/10.3390/s26072034 - 25 Mar 2026
Viewed by 527
Abstract
Wearable biosensors increasingly stream multi-channel physiological and behavioral data outside the laboratory, yet most deployments still end in dashboards or threshold alarms that leave interpretation open to the user. In high-stakes domains, such as military, emergency response, aviation, industry, and elite sport, the [...] Read more.
Wearable biosensors increasingly stream multi-channel physiological and behavioral data outside the laboratory, yet most deployments still end in dashboards or threshold alarms that leave interpretation open to the user. In high-stakes domains, such as military, emergency response, aviation, industry, and elite sport, the constraint is rarely data availability but the cognitive effort required to convert noisy signals into timely, actionable decisions. We argue for on-person cognitive co-pilots: systems that integrate multimodal sensing, compute probabilistic state estimates on devices, synthesize those states with task and environmental context using locally hosted large language models (LLMs), and deliver recommendations through attention-appropriate cues that preserve autonomy. Enabling conditions include mature wearable sensing, edge artificial intelligence (AI) accelerators, tiny machine learning (TinyML) pipelines, privacy-preserving learning, and open-weight LLMs capable of local deployment with retrieval and guardrails. However, critical research gaps remain across layers: sensor validity under real-world conditions, uncertainty calibration and fusion under distribution shift, verification of LLM-mediated reasoning, interaction design that avoids alarm fatigue and automation bias, and governance models that protect privacy and consent in constrained settings. We propose a layered technical framework and research agenda grounded in cognitive engineering and human–automation interaction. Our core claim is that local, uncertainty-aware reasoning is an architectural prerequisite for trustworthy, low-latency augmentation in isolated, confined, and extreme environments. Full article
(This article belongs to the Special Issue Sensors in 2026)
Show Figures

Figure 1

15 pages, 287 KB  
Proceeding Paper
Computer Vision for Collaborative Robots in Industry 5.0: A Survey of Techniques, Gaps, and Future Directions
by Himani Varolia, César M. A. Vasques and Adélio M. S. Cavadas
Eng. Proc. 2026, 124(1), 99; https://doi.org/10.3390/engproc2026124099 - 24 Mar 2026
Viewed by 173
Abstract
Collaborative robots are increasingly deployed in human-shared industrial workspaces, where perception is a key enabler for safe interaction, flexible manipulation, and human-aware task execution. In the context of Industry 5.0, computer vision for cobots must meet not only accuracy requirements but also human-centered [...] Read more.
Collaborative robots are increasingly deployed in human-shared industrial workspaces, where perception is a key enabler for safe interaction, flexible manipulation, and human-aware task execution. In the context of Industry 5.0, computer vision for cobots must meet not only accuracy requirements but also human-centered constraints such as safety, transparency, robustness, and practical deployability. This paper surveys computer-vision approaches used in collaborative robotics and organizes them through a task-driven taxonomy covering detection, segmentation, tracking, pose estimation, action/gesture recognition, and safety monitoring. Beyond a descriptive literature review, the paper provides a task-driven qualitative analytical perspective that relates families of computer vision methods to key industrial constraints, including occlusion, lighting variability, clutter, domain shift, real-time latency, and annotation cost, and summarizes comparative strengths and failure modes using unified criteria. We further discuss challenges related to data availability and evaluation practices, highlighting gaps in reproducibility, standardized metrics, and real-world validation in shared human–robot environments. Finally, we outline implementation and deployment considerations across common software stacks (e.g., Python-based pipelines and MATLAB-based prototyping), emphasizing ROS2 integration, edge inference, and lifecycle maintenance. The survey concludes with research directions toward robust multimodal perception, explainable human-aware vision, and benchmarkable safety-critical perception for next-generation collaborative robotic systems. Full article
(This article belongs to the Proceedings of The 6th International Electronic Conference on Applied Sciences)
51 pages, 2633 KB  
Review
Large-Scale Model-Enhanced Vision-Language Navigation: Recent Advances, Practical Applications, and Future Challenges
by Zecheng Li, Xiaolin Meng, Xu He, Youdong Zhang and Wenxuan Yin
Sensors 2026, 26(7), 2022; https://doi.org/10.3390/s26072022 - 24 Mar 2026
Viewed by 599
Abstract
The ability to autonomously navigate and explore complex 3D environments in a purposeful manner, while integrating visual perception with natural language interaction in a human-like way, represents a longstanding research objective in Artificial Intelligence (AI) and embodied cognition. Vision-Language Navigation (VLN) has evolved [...] Read more.
The ability to autonomously navigate and explore complex 3D environments in a purposeful manner, while integrating visual perception with natural language interaction in a human-like way, represents a longstanding research objective in Artificial Intelligence (AI) and embodied cognition. Vision-Language Navigation (VLN) has evolved from geometry-driven to semantics-driven and, more recently, knowledge-driven approaches. With the introduction of Large Language Models (LLMs) and Vision-Language Models (VLMs), recent methods have achieved substantial improvements in instruction interpretation, cross-modal alignment, and reasoning-based planning. However, existing surveys primarily focus on traditional VLN settings and offer limited coverage of LLM-based VLN, particularly in relation to Sim2Real transfer and edge-oriented deployment. This paper presents a structured review of LLM-enabled VLN, covering four core components: instruction understanding, environment perception, high-level planning, and low-level control. Edge deployment and implementation requirements, datasets, and evaluation protocols are summarized, along with an analysis of task evolution from path-following to goal-oriented and demand-driven navigation. Key challenges, including reasoning complexity, spatial cognition, real-time efficiency, robustness, and Sim2Real adaptation, are examined. Future research directions, such as knowledge-enhanced navigation, multimodal integration, and world-model-based frameworks, are discussed. Overall, LLM-driven VLN is progressing toward deeper cognitive integration, supporting the development of more explainable, generalizable, and deployable embodied navigation systems. Full article
Show Figures

Figure 1

20 pages, 4508 KB  
Article
IAF-RTDETR: Illumination Evaluation-Driven Multimodal Object Detection Network for Infrared–Visible Dual-Source Fusion
by Qi Hu, Haiyan Yu, Zhiquan Zhou and Simiao Li
Electronics 2026, 15(6), 1332; https://doi.org/10.3390/electronics15061332 - 23 Mar 2026
Viewed by 296
Abstract
Infrared–visible multimodal object detection has attracted increasing attention for its robustness under challenging conditions such as low illumination, occlusion, and complex backgrounds. However, existing fusion methods often suffer from coarse illumination modeling and insufficient cross-modal semantic alignment, leading to performance degradation in scenes [...] Read more.
Infrared–visible multimodal object detection has attracted increasing attention for its robustness under challenging conditions such as low illumination, occlusion, and complex backgrounds. However, existing fusion methods often suffer from coarse illumination modeling and insufficient cross-modal semantic alignment, leading to performance degradation in scenes with strong illumination variations or modality imbalance. To address these issues, this paper proposes IAF-RTDETR (Illumination-Aware Fusion RT-DETR), an illumination-aware fusion real-time detection network built upon the RT-DETR framework. The proposed method introduces a progressive fusion pipeline composed of four key modules: (1) a Modality-Specific Feature Enhancer to recalibrate modality-dependent representations and suppress low-quality feature interference; (2) a lightweight Global Light Estimator that learns a continuous illumination score via self-supervised proxy supervision derived from RGB image statistics; (3) a Light-Aware Fusion module that dynamically adjusts multi-scale fusion weights of infrared and visible features according to the estimated illumination; and (4) a Cross-Layer Dual-Branch Interaction Module that alleviates cross-modal semantic shift through bidirectional attention-guided interaction and channel reweighting. Extensive experiments on the M3FD dataset demonstrate that the proposed method achieves consistent performance improvements under diverse lighting conditions, outperforming RGB-only and IR-only baselines by 7.4% and 16.1% in mAP@50, respectively, while maintaining real-time inference speed (≈17.3 ms). Further evaluations on the LLVIP dataset validate the robustness and generalization ability of IAF-RTDETR in real low-illumination scenarios. Moreover, compared with representative multimodal fusion methods such as TFDet and TarDAL, the proposed method achieves superior detection accuracy. Visualization and quantitative semantic consistency analyses further confirm the effectiveness of the proposed illumination-aware fusion and cross-layer interaction mechanisms. These results indicate that IAF-RTDETR provides an effective and practical solution for real-time infrared–visible object detection under complex lighting environments. Full article
Show Figures

Figure 1

17 pages, 1035 KB  
Perspective
Reconstructing Multilingual Development Research: Shifting from a Monolingual Bias and Toward a Developmental Systems Framework
by Marissa A. Castellana and Viridiana L. Benitez
Behav. Sci. 2026, 16(3), 473; https://doi.org/10.3390/bs16030473 - 22 Mar 2026
Viewed by 467
Abstract
Multilingual research offers a unique window into the diverse developmental trajectories of language and cognition; yet this research has largely been built on a monolingual framework. Here, we first describe how a monolingual bias has limited theory construction and research on the multilingual [...] Read more.
Multilingual research offers a unique window into the diverse developmental trajectories of language and cognition; yet this research has largely been built on a monolingual framework. Here, we first describe how a monolingual bias has limited theory construction and research on the multilingual experience. We then apply a developmental systems framework to understand the multilingual experience, shifting the field away from a monolingual bias toward centering the lived language experiences of multilingual children. At the center of our framework are the moment-to-moment, multimodal, and dynamic interactions between children, their social partners, and environment. Contributing to interaction dynamics are child and social partner characteristics (cognition, motivation, and experiences), as well as contextual factors (activities, places, and policies) that can shape multilingual exposure. Cultural practices, values, and beliefs, as well as developmental time at the micro level (seconds, hours, days) and the macro level (weeks, months, and years), permeate all levels of the framework. Our proposal reveals important avenues of future research, including (1) understanding the dynamic coordination of multimodal behaviors and languages within interactions, (2) how experiences specific to minoritized communities (e.g., language discrimination) shape interaction dynamics, (3) how the temporal patterns of language experience at the micro level contribute to long-term multilingual exposure, and (4) understanding experiences of different multilingual communities within and across communities. Use of this framework can advance knowledge of the contexts enriching multilingual experiences and reconstruct multilingual development research for the benefit of multilingual learners. Full article
(This article belongs to the Special Issue Language and Cognitive Development in Bilingual Children)
Show Figures

Figure 1

Back to TopTop