sensors-logo

Journal Browser

Journal Browser

Multimodal Sensing, Fusion, and VLMs for Scene Understanding and Robot Vision

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Sensors and Robotics".

Deadline for manuscript submissions: 30 September 2026 | Viewed by 2427

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer Engineering, Keimyung University, Shindang-Dong, Dalseo-Gu, Daegu 704-701, Republic of Korea
Interests: computer vision; pattern recognition; object detection tracking; deep learning
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Robot Engineering, Keimyung University, Dalseo-gu, Daegu 42601, Republic of Korea
Interests: robotic platform design; optimal mechanism; reinforcement learning
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Recent advances in multimodal sensing and vision–language models (VLMs/MLLMs) are reshaping scene understanding and robot vision. This Special Issue explicitly welcomes work that centers on sensors and their applications, including the following: (i) sensor design/selection (RGB, depth, thermal, event, LiDAR, radar, IMU, microphones), placement, and synchronization; (ii) data acquisition pipelines (ground-truthing, calibration, time/pose alignment, distortion compensation, uncertainty/quality assessment); (iii) fusion architectures that are sensor-aware (modality reliability weighting, failure detection, cross-modal registration); and (iv) deployment on real systems (embedded/edge platforms, power/latency tradeoffs, safety).

We encourage submissions that report new datasets and benchmarks built from real sensor suites, application-driven case studies (e.g., mobile/industrial robots, assistive and field robotics), and evaluation protocols that quantify calibration, robustness under domain shift, and reliability in-the-wild. Algorithmic contributions (e.g., open-vocabulary perception, grounded VQA, VLA policies) are welcome when grounded in sensor usage—through realistic data collection, sensor-specific calibration, or hardware-in-the-loop experiments. Our goal is to advance scalable, safe, and explainable robot vision by tightly coupling sensing, fusion, and VLM-driven perception in practical environments.

Prof. Dr. ByoungChul Ko
Dr. Sungkeun Yoo
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • multimodal sensing and fusion (RGB/Depth/Thermal/Event/LiDAR/Radar/IMU/Audio)
  • sensor calibration, synchronization, and uncertainty modeling
  • Vision–Language Models (VLMs)/Multimodal LLMs (MLLMs) for sensor-grounded perception
  • open-vocabulary detection/segmentation/tracking with real sensor data
  • grounded VQA, referring expressions, spatio-temporal/3D grounding
  • 3D perception and SLAM with language guidance
  • cross-modal registration
  • edge/embedded deployment, real-time inference, power–latency tradeoffs
  • robustness, safety, OOD detection, and trustworthy evaluation
  • retrieval-augmented and knowledge-aware perception with sensor priors
  • datasets, data acquisition protocols, and reproducible benchmarks

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (2 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 4640 KB  
Article
Multimodal Navigation System for Visually Impaired Users Using Environmental Perception and Vision-Language Models
by Huei-Yung Lin, Yu-Hsiang Fan and Chin-Chen Chang
Sensors 2026, 26(10), 3045; https://doi.org/10.3390/s26103045 - 12 May 2026
Viewed by 373
Abstract
Visually impaired users face significant challenges in navigating complex indoor environments due to limited spatial awareness and lack of real-time semantic guidance. This paper proposes a multimodal navigation system integrating environmental perception with vision-language models (VLMs). It provides context-aware and explainable guidance without [...] Read more.
Visually impaired users face significant challenges in navigating complex indoor environments due to limited spatial awareness and lack of real-time semantic guidance. This paper proposes a multimodal navigation system integrating environmental perception with vision-language models (VLMs). It provides context-aware and explainable guidance without requiring additional infrastructure. The proposed system combines RTAB-Map for localization, YOLO-World for open-vocabulary object detection, and a lightweight language model for semantic reasoning and natural language interaction. To evaluate our system, experiments are conducted using the RePOPE benchmark to assess hallucination in vision-language understanding. Real-world indoor navigation experiments are also performed. The results show that integrating perception with language-based reasoning improves precision by up to 2.29% and consistently enhances F1-score compared to baseline VLM approaches. Real-world experiments further demonstrate reliable navigation performance, including multi-floor path planning and obstacle-aware guidance. Hence, the proposed system effectively enhances spatial understanding and reduces hallucination, providing a practical and scalable solution for assistive navigation. Full article
Show Figures

Figure 1

20 pages, 7030 KB  
Article
Latency-Aware Benchmarking of Large Language Models for Natural-Language Robot Navigation in ROS 2
by Murat Das, Zawar Hussain and Muhammad Nawaz
Sensors 2026, 26(2), 608; https://doi.org/10.3390/s26020608 - 16 Jan 2026
Viewed by 1714
Abstract
A growing challenge in mobile robotics is the reliance on complex graphical interfaces and rigid control pipelines, which limit accessibility for non-expert users. This work introduces a latency-aware benchmarking framework that enables natural-language robot navigation by integrating multiple Large Language Models (LLMs) with [...] Read more.
A growing challenge in mobile robotics is the reliance on complex graphical interfaces and rigid control pipelines, which limit accessibility for non-expert users. This work introduces a latency-aware benchmarking framework that enables natural-language robot navigation by integrating multiple Large Language Models (LLMs) with the Robot Operating System 2 (ROS 2) Navigation 2 (Nav2) stack. The system allows robots to interpret and act upon free-form text instructions, replacing traditional Human–Machine Interfaces (HMIs) with conversational interaction. Using a simulated TurtleBot4 platform in Gazebo Fortress, we benchmarked a diverse set of contemporary LLMs, including GPT-3.5, GPT-4, GPT-5, Claude 3.7, Gemini 2.5, Mistral-7B Instruct, DeepSeek-R1, and LLaMA-3.3-70B, across three local planners, namely Dynamic Window Approach (DWB), Timed Elastic Band (TEB), and Regulated Pure Pursuit (RPP). The framework measures end-to-end response latency, instruction-parsing accuracy, path quality, and task success rate in standardised indoor scenarios. The results show that there are clear trade-offs between latency and accuracy, where smaller models respond quickly but have less spatial reasoning, while larger models have more consistent navigation intent but take longer to respond. The proposed framework is the first reproducible multi-LLM system with multi-planner evaluations within ROS 2, supporting the development of intuitive and latency-efficient natural-language interfaces for robot navigation. Full article
Show Figures

Figure 1

Back to TopTop