Sensing the Action: Rethinking Sensor Modalities and Multi-Modal Fusion in Vision–Language–Action Models for Robotic Manipulation
Abstract
1. Introduction
1.1. 1st Generation
1.2. 2nd Generation
1.3. 3rd Generation
1.4. Related Surveys and Positioning of This Survey
1.5. Contributions and Organization of This Survey
- This survey provides a systematic taxonomy of the sensor modalities used in robotic manipulation, including RGB and multi-view cameras, depth sensors, tactile sensing, force and torque sensing, proprioception and inertial measurement units, and multi-spectral or thermal sensing. Rather than merely listing these modalities, this survey examines, for each modality, both the physical nature of the information it provides and the representative failure modes and hardware constraints that arise in real deployment settings, such as occlusion, low illumination, transparent or reflective objects, and contact uncertainty. In doing so, this survey argues in a systematic manner that sensor selection constitutes a fundamental starting point in VLA design.
- This survey provides analytical formulations for calibration-induced error propagation, temporal misalignment, modality information contribution, and latency effects. These formulations connect sensor-level design decisions to measurable outcomes in VLA-based manipulation.
- This survey systematizes the sensor to action pipeline through which sensor signals are transformed into actual policy inputs and action generation. To this end, this study reviews calibration and synchronization, data collection pipelines based on teleoperation, human video, and simulation, as well as methods for obtaining ground-truth labels tailored to manipulation tasks. This analysis shows that sensor design is not merely a matter of choosing input modalities but rather a practical pipeline problem that directly determines data quality, observation to action alignment, and ultimately the performance of the learned policy.
- This survey organizes the design space of fusion for aligning and integrating heterogeneous sensor signals across time and space from a sensor-centered rather than model-centered perspective. Representative strategies, including early fusion, late fusion, cross attention, and token-level fusion, are compared with particular attention to the trade-offs they present under latency, synchronization error, noise, and modality missingness. In this way, the survey moves beyond the question of which fusion method is best in general and instead provides design guidelines regarding which fusion strategy is appropriate under which conditions.
- This survey proposes an evaluation framework suited to sensor fusion-based VLA research. Most existing manipulation benchmarks are designed under the assumption of single RGB input and rely primarily on a single metric, namely task success rate. However, this metric does not capture the performance differences that occur when the sensor configuration changes. For example, depth sensors often fail in the presence of transparent objects, the absence of tactile sensing can lead to unstable grasping, and noise in IMU signals may result in delayed actions. To directly address these issues, this survey organizes a multidimensional set of evaluation criteria covering safety, robustness, real-time performance, and reproducibility and derives modality specific evaluation principles from the perspective of which metrics can properly reveal the contribution and failure characteristics of each sensor. Through this analysis, the survey seeks to connect the question of which sensors to use with the question of how to evaluate them under a coherent set of design principles.
- Building on the three preceding axes of analysis, this survey diagnoses the structural limitations caused by RGB-centric input bias and identifies open problems and future research directions for achieving deployment-level reliability. These include sensor agnostic VLA architectures, language grounding of tactile signals, theoretical frameworks for asynchronous multi-sensor fusion, and the absence of standardized benchmarks built on real sensor inputs.
2. Sensor Modalities in VLA Systems
2.1. Vision Sensors
2.2. Tactile and Force Sensors
2.3. Proprioceptive Sensors (IMU; Joint Encoders)
2.4. Multi-Spectral/Thermal Sensors
3. Sensor-to-Action Pipeline
3.1. Calibration and Synchronization
3.2. Data Collection Pipelines for VLA Training
3.3. Ground-Truth Annotation for Manipulation
4. Multi-Modal Fusion Architectures in VLA
4.1. Fusion Strategies for Heterogeneous Sensors
4.2. Reliability-Aware and Robust Fusion
4.3. Action Representation and Temporal Alignment with Sensor Resolution
5. Benchmarks and Evaluation Protocols
5.1. Limitations of Sensor Coverage in Existing Benchmarks
- First, the set of observations provided to the policy should be specified, including modalities such as RGB, RGB-D, wrist cameras, tactile sensing, F/T sensing, and IMUs.
- Second, the sampling rate of each sensor and the synchronization protocol among sensors should be clearly defined.
- Third, the benchmark should state whether it is intended to evaluate primarily vision-centric problems, contact-centric problems, or robustness under perturbation and recovery.
- Fourth, the evaluation design should include mechanisms such as ablation studies or sensor dropout conditions that make it possible to quantitatively isolate and interpret the contribution of each sensor channel.
5.2. Limitations of Current Evaluation Metrics and a Proposal for a Multidimensional Evaluation Framework
- Task Completion (: SR, sub-goal completion rate, and partial progress score;
- Robustness (: RSR, WSR, sensor dropout sensitivity, and recovery after perturbation;
- Safety and Execution Quality (): peak force, collision count, slip count, smoothness, and jerk;
- Efficiency (): inference latency, control frequency, replanning count, and energy or compute cost.
6. Open Problems and Future Directions
6.1. Sensor-Agnostic VLA Architectures
6.2. Tactile and Force Grounding in Language-Conditioned Manipulation
6.3. Asynchronous and Multi-Rate Sensor Fusion Frameworks
6.4. Real-Sensor-Grounded Benchmarks and Evaluation Protocols
6.5. Safe and Reliable Deployment of Sensor-Rich VLA Systems
7. Methodological Limitations of This Survey
8. Conclusions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection. arXiv 2016, arXiv:1603.02199. [Google Scholar] [CrossRef]
- Eric, J.; Irpan, A.; Khansari, M.; Kappler, D.; Ebert, F.; Lynch, C.; Levine, S.; Finn, C. BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning. In Proceedings of the 5th Conference on Robot Learning; PMLR: Cambridge, MA, USA, 2022; Volume 164, pp. 991–1002. [Google Scholar]
- Zhang, T.; McCarthy, Z.; Jow, O.; Lee, D.; Chen, X.; Goldberg, K.; Abbeel, P. Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation. arXiv 2017, arXiv:1710.04615. [Google Scholar]
- Calandra, R.; Owens, A.; Jayaraman, D.; Lin, J.; Yuan, W.; Malik, J.; Adelson, E.H.; Levine, S. More Than a Feeling: Learning to Grasp and Regrasp using Vision and Touch. IEEE Robot. Autom. Lett. 2018, 4, 3300–3307. [Google Scholar] [CrossRef]
- Pinto, L.; Gupta, A. Supersizing Self-Supervision: Learning to Grasp from 50K Tries and 700 Robot Hours. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 3406–3413. [Google Scholar] [CrossRef]
- Tirado-Garin, A.; Civera, J. AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2025; pp. 8044–8055. [Google Scholar]
- Yuan, W.; Dong, S.; Adelson, E.H. GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force. Sensors 2017, 17, 2762. [Google Scholar] [CrossRef]
- Mason, M.T. Toward Robotic Manipulation. Annu. Rev. Control Robot. Auton. Syst. 2018, 1, 1–28. [Google Scholar] [CrossRef]
- Anthony, B.; Brown, N.; Carbajal, J.; Chebotar, Y.; Dabis, J.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Hsu, J.; et al. RT-1: Robotics Transformer for Real-World Control at Scale. arXiv 2022, arXiv:2212.06817. [Google Scholar]
- Brianna, Z.; Yu, T.; Xu, S.; Xu, P.; Xiao, T.; Xia, F.; Wu, J.; Wohlhart, P.; Welker, S.; Wahid, A.; et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. In Conference on Robot Learning; PMLR: Cambridge, MA, USA, 2023; Volume 229, pp. 2165–2183. [Google Scholar]
- Kim, M.J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Rafailov, R.; Foster, E.; Lam, G.; Sanketi, P.; et al. OpenVLA: An Open-Source Vision-Language-Action Model. arXiv 2024, arXiv:2406.09246. [Google Scholar]
- Reed, S.; Zolna, K.; Parisotto, E.; Colmenarejo, S.G.; Novikov, A.; Barth-Maron, G.; Gimenez, M.; Sulsky, Y.; Kay, J.; Springenberg, J.T.; et al. A Generalist Agent (Gato). arXiv 2022, arXiv:2205.06175. [Google Scholar] [CrossRef]
- Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y.; Burchfiel, B.; Tedrake, R.; Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. Int. J. Robot. Res. 2024, 44, 10–11. [Google Scholar] [CrossRef]
- Lipman, Y.; Chen, R.T.Q.; Ben-Hamu, H.; Nickel, M.; Le, M. Flow Matching for Generative Modeling. In Proceedings of the 11th International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Ghosh, D.; Walke, H.; Pertsch, K.; Black, K.; Mees, O.; Dasari, S.; Hejna, J.; Kreiman, T.; Xu, C.; Luo, J.; et al. Octo: An Open-Source Generalist Robot Policy. In Proceedings of the Robotics: Science and Systems (RSS), Delft, The Netherlands, 15–19 July 2024. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- O’Neill, A.; Rehman, A.; Maddukuri, A.; Gupta, A.; Padalkar, A.; Lee, A.; Pooley, A.; Gupta, A.; Mandlekar, A.; Jain, A.; et al. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024. [Google Scholar]
- Kevin, B.; Brown, N.; Driess, D.; Esmail, A.; Equi, M.; Finn, C.; Fusai, N.; Groom, L.; Hausman, K.; Ichter, B.; et al. π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv 2024, arXiv:2410.24164. [Google Scholar]
- Li, X.; Liu, M.; Zhang, H.; Yu, C.; Xu, J.; Wu, H.; Kong, T. Vision-Language Foundation Models as Effective Robot Imitators (RoboFlamingo). arXiv 2023, arXiv:2311.01378. [Google Scholar]
- Awadalla, A.; Gao, I.; Gardner, J.; Hessel, J.; Hanafy, Y.; Zhu, W.; Marathe, K.; Bitton, Y.; Gadre, S.; Sagawa, S.; et al. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. arXiv 2023, arXiv:2308.01390. [Google Scholar]
- Ud Din, M.; Akram, W.; Saad Saoud, L.; Rosell, J.; Hussain, I. Vision Language Action Models in Robotic Manipulation: A Systematic Review. arXiv 2025, arXiv:2507.10672. [Google Scholar] [CrossRef]
- Ma, Y.; Song, Z.; Zhuang, Y.; Hao, J.; King, I. A Survey on Vision-Language-Action Models for Embodied AI. arXiv 2024, arXiv:2405.14093. [Google Scholar] [CrossRef]
- Zhong, Y.; Bai, F.; Cai, S.; Huang, X.; Chen, Z.; Zhang, X.; Wang, Y.; Guo, S.; Guan, T.; Lui, K.N.; et al. A Survey on Vision-Language-Action Models: An Action Tokenization Perspective. arXiv 2025, arXiv:2507.01925. [Google Scholar] [CrossRef]
- Shao, R.; Li, W.; Zhang, L.; Zhang, R.; Liu, Z.; Chen, R.; Nie, L. Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey. arXiv 2025, arXiv:2508.13073. [Google Scholar] [CrossRef]
- Li, H.; Chen, Y.; Cui, W.; Liu, W.; Liu, K.; Zhou, M.; Zhang, Z.; Zhao, D. Survey of Vision-Language-Action Models for Embodied Manipulation. arXiv 2025, arXiv:2508.15201. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, B.; Zhang, H.; Du, T.; Chen, T.; Sun, G.; He, Y.; Shen, Z.; Ye, W.; Li, A. Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines. arXiv 2026, arXiv:2604.23001. [Google Scholar] [CrossRef]
- Han, X.; Chen, S.; Fu, Z.; Feng, Z.; Fan, L.; An, D.; Wang, C.; Guo, L.; Meng, W.; Zhang, X.; et al. Multimodal Fusion and Vision–Language Models: A Survey for Robot Vision. Inf. Fusion 2026, 126, 103652. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS 2012); Curran Associates: Red Hook, NY, USA, 2012; Volume 25, pp. 1097–1105. [Google Scholar]
- Palazzo, L.; Suglia, V.; Grieco, S.; Buongiorno, D.; Pagano, G.; Bevilacqua, V.; D’Addio, G. Optimized Deep Learning-Based Pathological Gait Recognition Explored Through Network Analysis of Inertial Data. In Proceedings of the 2025 IEEE Medical Measurements & Applications (MeMeA), Chania, Greece, 28–30 May 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Buongiorno, D.; Prunella, M.; Grossi, S.; Hussain, S.M.; Rennola, A.; Longo, N.; Di Stefano, G.; Bevilacqua, V.; Brunetti, A. Inline Defective Laser Weld Identification by Processing Thermal Image Sequences with Machine and Deep Learning Techniques. Appl. Sci. 2022, 12, 6455. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. In Proceedings of the 6th Conference on Robot Learning (CoRL); PMLR: Cambridge, MA, USA, 2023; Volume 205, pp. 287–318. [Google Scholar]
- Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An Embodied Multimodal Language Model. arXiv 2023, arXiv:2303.03378. [Google Scholar] [CrossRef]
- Wu, T.; Jing, Y.; Cheang, C.; Chen, G.; Xu, J.; Li, X.; Liu, M.; Li, H.; Kong, T. GR-1: Unleashing Large-Scale Video Generative Pre-Training for Visual Robot Manipulation. In Proceedings of the International Conference on Learning Representations, Singapore, 24–28 April 2025; Volume 2024, pp. 10641–10662. [Google Scholar]
- Bharadhwaj, H.; Vakil, J.; Sharma, M.; Gupta, A.; Tulsiani, S.; Kumar, V. RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentation and Action Chunking. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 4788–4795. [Google Scholar] [CrossRef]
- Zhao, H.; Jiang, L.; Fu, C.-W.; Jia, J. Point Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar] [CrossRef]
- Yu, X.; Tang, L.; Rao, Y.; Huang, T.; Zhou, J.; Lu, J. Point-BERT: Pre-Training 3D Point Cloud Transformers with Masked Point Modeling. arXiv 2021, arXiv:2111.14819. [Google Scholar] [CrossRef]
- Shridhar, M.; Manuelli, L.; Fox, D. PerAct: Perceiver-Actor for Robotics Manipulation. arXiv 2022, arXiv:2209.05451. [Google Scholar] [CrossRef]
- Gallego, G.; Delbrück, T.; Orchard, G.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.J.; Conradt, J.; Daniilidis, K.; et al. Event-Based Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 154–180. [Google Scholar] [CrossRef]
- Lichtsteiner, P.; Posch, C.; Delbrück, T. A 128×128 120 dB 15 μs Latency Asynchronous Temporal Contrast Vision Sensor. IEEE J. Solid-State Circuits 2008, 43, 566–576. [Google Scholar] [CrossRef]
- Rebecq, H.; Ranftl, R.; Koltun, V.; Scaramuzza, D. Events-to-Video: Bringing Modern Computer Vision to Event Cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2019; pp. 3857–3866. [Google Scholar]
- Sun, Z.; Messikommer, N.; Gehrig, D.; Scaramuzza, D. ESS: Learning Event-Based Semantic Segmentation from Still Images. In Computer Vision—ECCV 2022, Lecture Notes in Computer Science; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; Volume 13694, pp. 341–357. [Google Scholar] [CrossRef]
- Gehrig, M.; Aarents, W.; Gehrig, D.; Scaramuzza, D. DSEC: A Stereo Event Camera Dataset for Driving Scenarios. IEEE Robot. Autom. Lett. 2021, 6, 4947–4954. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Wu, Z.; Liu, X.; Gilitschenski, I. EventCLIP: Adapting CLIP for Event-Based Vision. arXiv 2023, arXiv:2306.06354. [Google Scholar]
- Khazatsky, A.; Pertsch, K.; Nair, S.; Balakrishna, A.; Dasari, S.; Karamcheti, S.; Nasiriany, S.; Srirama, M.K.; Chen, L.Y.; Ellis, K.; et al. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. arXiv 2024, arXiv:2403.12945. [Google Scholar] [CrossRef]
- Shafiullah, N.M.M.; Rai, A.; Etukuru, H.; Liu, Y.; Misra, I.; Chintala, S.; Pinto, L. On Bringing Robots Home. arXiv 2023, arXiv:2311.16098. [Google Scholar] [CrossRef]
- Dal Cin, A.; Dikov, G.; Ju, J.; Ghafoorian, M. AnyMap: Learning a General Camera Model for Structure-from-Motion with Unknown Cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2025; pp. 16674–16684. [Google Scholar]
- Frisoli, A.; Leonardis, D. Wearable Haptics for Virtual Reality and Beyond. Nat. Rev. Electr. Eng. 2024, 1, 666–679. [Google Scholar] [CrossRef]
- Bortone, I.; Barsotti, M.; Leonardis, D.; Crecchi, A.; Tozzini, A.; Bonfiglio, L.; Frisoli, A. Immersive Virtual Environments and Wearable Haptic Devices in Rehabilitation of Children with Neuromotor Impairments: A Single-Blind Randomized Controlled Crossover Pilot Study. J. NeuroEng. Rehabil. 2020, 17, 144. [Google Scholar] [CrossRef]
- Calandra, R.; Owens, A.; Upadhyaya, M.; Yuan, W.; Lin, J.; Adelson, E.H.; Levine, S. The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes? arXiv 2017, arXiv:1710.05512. [Google Scholar] [CrossRef]
- Dong, S.; Yuan, W.; Adelson, E.H. Improved GelSight Tactile Sensor for Measuring Geometry and Slip. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2017; pp. 137–144. [Google Scholar]
- Lambeta, M.; Chou, P.-W.; Tian, S.; Yang, B.; Maloon, B.; Most, V.R.; Stroud, D.; Santos, R.; Byagowi, A.; Kammerer, G.; et al. DIGIT: A Novel Design for a Low-Cost Compact High-Resolution Tactile Sensor with Application to In-Hand Manipulation. IEEE Robot. Autom. Lett. 2020, 5, 3838–3845. [Google Scholar] [CrossRef]
- Hao, P.; Zhang, C.; Li, D.; Cao, X.; Hao, X.; Cui, S.; Wang, S. TLA: Tactile-Language-Action Model. arXiv 2025, arXiv:2503.08548. [Google Scholar]
- Bi, J.; Ma, K.Y.; Hao, C.; Zheng, M.S.; Soh, H. VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback. arXiv 2025, arXiv:2507.17294. [Google Scholar] [CrossRef]
- Liu, W.; Wang, J.; Wang, Y.; Wang, W.; Lu, C. ForceMimic: Force-Centric Imitation Learning with Force-Motion Capture System for Contact-Rich Manipulation. arXiv 2024, arXiv:2410.07554. [Google Scholar]
- Suglia, V.; Palazzo, L.; Bevilacqua, V.; Passantino, A.; Pagano, G.; D’Addio, G. A Novel Framework Based on Deep Learning Architecture for Continuous Human Activity Recognition with Inertial Sensors. Sensors 2024, 24, 2199. [Google Scholar] [CrossRef]
- Jaramillo, I.E.; Jeong, J.G.; Lopez, P.R.; Lee, C.-H.; Kang, D.-Y.; Ha, T.-J.; Oh, J.-H.; Jung, H.; Lee, J.H.; Lee, W.H.; et al. Real-Time Human Activity Recognition with IMU and Encoder Sensors in Wearable Exoskeleton Robot via Deep Learning Networks. Sensors 2022, 22, 9690. [Google Scholar] [CrossRef]
- Zhao, T.Z.; Kumar, V.; Levine, S.; Finn, C. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. In Proceedings of the Robotics: Science and Systems (RSS) XIX, Daegu, Republic of Korea, 10–14 July 2023. [Google Scholar]
- Høeg, S.H.; Du, Y.; Egeland, O. Streaming diffusion policy: Fast policy synthesis with variable noise diffusion models. arXiv 2024, arXiv:2406.04806. [Google Scholar] [CrossRef]
- Ahn, D.; Jung, C.; Baek, J.; Yoo, S.; Ko, B.C. Shifted flow policy: Uncertainty-aware time reparameterization for visuomotor learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA); 2026. to appear. [Google Scholar]
- Shrivastava, A.; Gangani, K.; Jain, L.; Goel, M.; Batra, N. ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery. arXiv 2026, arXiv:2602.14989. [Google Scholar]
- Signoroni, A.; Savardi, M.; Baronio, A.; Benini, S. Deep Learning Meets Hyperspectral Image Analysis: A Multidisciplinary Review. J. Imaging 2019, 5, 52. [Google Scholar] [CrossRef]
- Murray, R.M.; Li, Z.; Sastry, S.S. A Mathematical Introduction to Robotic Manipulation; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
- Lynch, K.M.; Park, F.C. Modern Robotics: Mechanics, Planning, and Control; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
- Furgale, P.; Rehder, J.; Siegwart, R. Unified Temporal and Spatial Calibration for Multi-Sensor Systems. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems; IEEE: New York, NY, USA, 2013; pp. 1280–1286. [Google Scholar] [CrossRef]
- Taylor, Z.; Nieto, J. Motion-Based Calibration of Multimodal Sensor Extrinsics and Timing Offset Estimation. IEEE Trans. Robot. 2016, 32, 1215–1229. [Google Scholar] [CrossRef]
- Koide, K.; Menegatti, E. General Robot-Camera Synchronization Based on Reprojection Error Minimization. In Proceedings of the ARW & OAGM Workshop 2019; Verlag der TU Graz: Steyr, Austria, 2019; Volume 2019, pp. 119–122. [Google Scholar] [CrossRef]
- Koide, K.; Menegatti, E. General Hand–Eye Calibration Based on Reprojection Error Minimization. IEEE Robot. Autom. Lett. 2019, 4, 1021–1028. [Google Scholar] [CrossRef]
- Ha, J. Probabilistic Framework for Hand-Eye and Robot-World Calibration AX=YB. IEEE Trans. Robot. 2016, 39, 1196–1211. [Google Scholar] [CrossRef]
- Pachtrachai, K.; Vasconcelos, F.; Edwards, P.; Stoyanov, D. Learning to Calibrate—Estimating the Hand-eye Transformation without Calibration Objects. IEEE Robot. Autom. Lett. 2021, 6, 7309–7316. [Google Scholar] [CrossRef]
- Walke, H.R.; Black, K.; Zhao, T.Z.; Vuong, Q.; Zheng, C.; Hansen-Estruch, P.; He, A.W.; Myers, V.; Kim, M.J.; Du, M.; et al. BridgeData V2: A Dataset for Robot Learning at Scale. In Proceedings of The 7th Conference on Robot Learning; PMLR: Cambridge, MA, USA, 2023; Volume 229, pp. 1723–1736. [Google Scholar]
- Porcini, F.; Chiaradia, D.; Marcheschi, S.; Solazzi, M.; Frisoli, A. Evaluation of an Exoskeleton-Based Bimanual Teleoperation Architecture with Independently Passivated Slave Devices. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10205–10211. [Google Scholar] [CrossRef]
- Palagi, M.; Santamato, G.; Chiaradia, D.; Gabardi, M.; Marcheschi, S.; Solazzi, M.; Frisoli, A.; Leonardis, D. A Mechanical Hand-Tracking System with Tactile Feedback Designed for Telemanipulation. IEEE Trans. Haptics 2023, 16, 594–601. [Google Scholar] [CrossRef]
- Nair, S.; Rajeswaran, A.; Kumar, V.; Finn, C.; Gupta, A. R3M: A Universal Visual Representation for Robot Manipulation. In Proceedings of the 6th Conference on Robot Learning; PMLR: Cambridge, MA, USA, 2023; Volume 205, pp. 892–909. [Google Scholar]
- Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. Ego4D: Around the World in 3000 Hours of Egocentric Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18995–19012. [Google Scholar]
- Chen, A.S.; Nair, S.; Finn, C. Learning Generalizable Robotic Reward Functions from In-The-Wild Human Videos. arXiv 2021, arXiv:2103.16817. [Google Scholar]
- Makoviychuk, V.; Wawrzyniak, L.; Guo, Y.; Lu, M.; Storey, K.; Macklin, M.; Hoeller, D.; Rudin, N.; Allshire, A.; Handa, A.; et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv 2021, arXiv:2108.10470. [Google Scholar] [CrossRef]
- Mayank, M.; Roth, P.; Tigue, J.; Richard, A.; Zhang, O.; Du, P.; Serrano-Muñoz, A.; Yao, X.; Zurbrügg, R.; Rudin, N. Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning. arXiv 2025, arXiv:2511.04831. [Google Scholar] [CrossRef]
- Mandlekar, A.; Nasiriany, S.; Wen, B.; Akinola, I.; Narang, Y.; Fan, L.; Zhu, Y.; Fox, D. MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations. arXiv 2023, arXiv:2310.17596. [Google Scholar] [CrossRef]
- Fang, H.-S.; Fang, H.; Tang, Z.; Liu, J.; Wang, C.; Wang, J.; Zhu, H.; Lu, C. RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 653–660. [Google Scholar]
- James, S.; Ma, Z.; Arrojo, D.R.; Davison, A.J. RLBench: The Robot Learning Benchmark & Learning Environment. IEEE Robot. Autom. Lett. 2020, 5, 3019–3026. [Google Scholar] [CrossRef]
- Fu, Z.; Zhao, T.Z.; Finn, C. Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation. In Proceedings of the 7th Annual Conference on Robot Learning (CoRL 2023), Atlanta, GA, USA, 6–9 November 2023. [Google Scholar]
- Merriaux, P.; Dupuis, Y.; Boutteau, R.; Vasseur, P.; Savatier, X. A Study of Vicon System Positioning Performance. Sensors 2017, 17, 1591. [Google Scholar] [CrossRef]
- Wang, J.; Olson, E. AprilTag 2: Efficient and Robust Fiducial Detection. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 4193–4198. [Google Scholar] [CrossRef]
- Garrido-Jurado, S.; Muñoz-Salinas, R.; Madrid-Cuevas, F.J.; Marín-Jiménez, M.J. Automatic Generation and Detection of Highly Reliable Fiducial Markers under Occlusion. Pattern Recognit. 2014, 47, 2280–2292. [Google Scholar] [CrossRef]
- Wen, B.; Yang, W.; Kautz, J.; Birchfield, S. FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 17868–17879. [Google Scholar] [CrossRef]
- Wen, B.; Tremblay, J.; Blukis, V.; Tyree, S.; Müller, T.; Evans, A.; Fox, D.; Kautz, J.; Birchfield, S. BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 606–617. [Google Scholar]
- Moya Rueda, F.; Grzeszick, R.; Fink, G.A.; Feldhorst, S.; Ten Hompel, M. Convolutional Neural Networks for Human Activity Recognition Using Body-Worn Sensors. Informatics 2018, 5, 26. [Google Scholar] [CrossRef]
- Sheng, H.; Xuanqi, W.; Chang, Z.; Jiacheng, W.; Pingxia, D.; Yuwei, W. AIGC video detection based on the fusion of spatial-frequency-optical flow multimodal features. J. Syst. Eng. Electron. 2026, 1–15. [Google Scholar] [CrossRef]
- Dang, J.; Zheng, H.; Xu, X.; Wang, L.; Hu, Q.; Guo, Y. Adaptive sparse memory networks for efficient and robust video object segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 3820–3833. [Google Scholar] [CrossRef]
- Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. arXiv 2022, arXiv:2204.14198. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Savarese, S.; Hoi, S.C.H. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Cheang, C.-L.; Chen, G.; Jing, Y.; Kong, T.; Li, H.; Li, Y.; Liu, Y.; Wu, H.; Xu, J.; Yang, Y.; et al. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation. arXiv 2024, arXiv:2410.06158. [Google Scholar]
- Ma, M.; Ren, J.; Zhao, L.; Testuggine, D.; Peng, X. Are Multimodal Transformers Robust to Missing Modality? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 18177–18186. [Google Scholar] [CrossRef]
- Bednarek, P.; Kicki, P.; Walas, K. On Robustness of Multi-Modal Fusion—Robotics Perspective. Electronics 2020, 9, 1152. [Google Scholar] [CrossRef]
- Huang, H.; Du, B. Deep Evidential Fusion with Uncertainty Quantification and Reliability Learning for Multimodal Medical Image Segmentation. Inf. Fusion 2025, 113, 102648. [Google Scholar] [CrossRef]
- Yu, S.; Wang, J.; Hussein, W.; Hung, P.C.K. Robust Multimodal Federated Learning for Incomplete Modalities. Comput. Commun. 2024, 214, 234–243. [Google Scholar] [CrossRef]
- Li, Z.; Gao, Y.; Xing, J.; Cui, L.; Wang, X. Adaptive Multi-Scale Attention for Robust Visual–Tactile Feature Fusion. IFAC-Pap. 2025, 58, 523–528. [Google Scholar] [CrossRef]
- Ma, X.; Patidar, S.; Haughton, I.; James, S. Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16 June–22 June 2024. [Google Scholar]
- Nowzari, C.; Garcia, E.; Cortés, J. Event-Triggered Communication and Control of Networked Systems for Multi-Agent Consensus. Automatica 2019, 105, 1–27. [Google Scholar] [CrossRef]
- Dong, A.; Starr, A.; Zhao, Y. End-to-End Identification of Autoregressive with Exogenous Input (ARX) Models Using Neural Networks. Mach. Intell. Res. 2025, 22, 117–130. [Google Scholar] [CrossRef]
- Liu, B.; Zhu, Y.; Gao, C.; Feng, Y.; Liu, Q.; Zhu, Y.; Stone, P. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track; Curran Associates: Red Hook, NY, USA, 2023. [Google Scholar]
- Srivastava, S.; Li, C.; Lingelbach, M.; Martín-Martín, R.; Xia, F.; Vainio, K.; Lian, Z.; Gokmen, C.; Buch, S.; Liu, K.; et al. BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments. In Proceedings of the 5th Conference on Robot Learning (CoRL 2021), London, UK, 8–11 November 2021; Volume 164, pp. 477–490. [Google Scholar]
- Li, C.; Xia, F.; Martín-Martín, R.; Lingelbach, M.; Srivastava, S.; Shen, B.; Vainio, K.; Gokmen, C.; Dharan, G.; Jain, T.; et al. BEHAVIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation. In Proceedings of the 6th Annual Conference on Robot Learning (CoRL 2022), Auckland, New Zealand, 14–18 December 2022; Volume 205, pp. 80–93. [Google Scholar]
- Zhang, S.; Xu, Z.; Liu, P.; Yu, X.; Li, Y.; Gao, Q.; Fei, Z.; Yin, Z.; Wu, Z.; Jiang, Y.-G.; et al. VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–20 October 2025. [Google Scholar]
- Li, X.; Hsu, K.; Gu, J.; Pertsch, K.; Mees, O.; Walke, H.R.; Fu, C.; Lunawat, I.; Sieh, I.; Kirmani, S.; et al. Evaluating Real-World Robot Manipulation Policies in Simulation. In Proceedings of the 8th Annual Conference on Robot Learning (CoRL 2024), Munich, Germany, 6–9 November 2024; Volume 270, pp. 4114–4134. [Google Scholar]
- Liu, M.; Sheng, J.; Li, P.; Wang, Z.; Xu, T.; Xu, T.; Liu, H. Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods. arXiv 2026, arXiv:2601.18723. [Google Scholar] [CrossRef]
- Yang, F.; Feng, C.; Chen, Z.; Park, H.; Wang, D.; Dou, Y.; Zeng, Z.; Chen, X.; Gangopadhyay, R.; Owens, A.; et al. Binding Touch to Everything: Learning Unified Multimodal Tactile Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 26340–26353. [Google Scholar]
- Zhang, K.; Zhang, H.; Xu, Z.; Zhang, Z.; Prince, M.R.I.; Li, X.; Han, X.; Zhou, Y.; Ajoudani, A.; She, Y. TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation. arXiv 2026, arXiv:2603.12665. [Google Scholar]
- Huang, Y.; Lin, P.; Li, W.; Li, D.; Li, J.; Jiang, J.; Xiao, C.; Jiao, Z. TaF-VLA: Tactile-Force Alignment in Vision-Language-Action Models for Force-aware Manipulation. arXiv 2026, arXiv:2601.20321. [Google Scholar]
- Huang, J.; Wang, S.; Lin, F.; Hu, Y.; Wen, C.; Gao, Y. Tactile-VLA: Unlocking Vision-Language-Action Model’s Physical Knowledge for Tactile Generalization. arXiv 2025, arXiv:2507.09160. [Google Scholar]
- Visinsky, M.L.; Cavallaro, J.R.; Walker, I.D. Robotic Fault Detection and Fault Tolerance: A Survey. Reliab. Eng. Syst. Saf. 1994, 46, 139–158. [Google Scholar] [CrossRef]
- Khan, Z.; Nasir, A.; Mekid, S. Fault-Tolerant Control Strategies for Industrial Robots: State of the Art and Future Perspective on AI-Based Fault Management. Artif. Intell. Rev. 2025, 58, 362. [Google Scholar] [CrossRef]
- Samarathunga, S.M.B.P.B.; Valori, M.; Legnani, G.; Fassi, I. Assessing Safety in Physical Human–Robot Interaction in Industrial Settings: A Systematic Review of Contact Modelling and Impact Measuring Methods. Robotics 2025, 14, 27. [Google Scholar] [CrossRef]
- ISO 10218-1:2011; Robots and Robotic Devices—Safety Requirements for Industrial Robots—Part 1: Robots. ISO: Geneva, Switzerland, 2011.
- ISO 10218-2:2011; Robots and Robotic Devices—Safety Requirements for Industrial Robots—Part 2: Robot Systems and Integration. ISO: Geneva, Switzerland, 2011.
- IEC 61508-1:2010; Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems—Part 1: General Requirements. IEC: Geneva, Switzerland, 2010.




| Survey | Main Focus | Sensor Modality Taxonomy | Failure Mode Analysis | Calibration and Synchronization | Data Collection/Data Engine | Fusion Strategy Analysis | Evaluation Framework | Sensor-Centric Perspective |
|---|---|---|---|---|---|---|---|---|
| Ma et al. (2024) [22] | Broad VLA/embodied AI overview | ✗ | ✗ | ✗ | △ | ✗ | △ | ✗ |
| Zhong et al. (2025) [23] | Action token taxonomy | ✗ | ✗ | ✗ | △ | ✗ | △ | ✗ |
| Din et al. (2025) [21] | Architectures, datasets, and simulators | ✗ | ✗ | ✗ | ✓ | △ | ✓ | ✗ |
| Shao et al. (2025) [24] | Large-VLM-based VLA architectures | ✗ | ✗ | ✗ | △ | ✗ | △ | ✗ |
| Li et al. (2025) [25] | Structures, datasets, pre/post-training, and evaluation | ✗ | ✗ | ✗ | △ | ✗ | ✓ | ✗ |
| Wang et al. (2026) [26] | Data-centric VLA survey | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ |
| Han et al. (2026) [27] | Robotic vision, multi-modal fusion, and VLMs | △ | ✗ | ✗ | ✓ | ✓ | △ | △ |
| Ours | Sensor–fusion–action for robotic manipulation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Sensor | Sensing Role | Typical Use in VLA | Key Limitations |
|---|---|---|---|
| RGB Camera [2,10,11,32,33] | Rich semantic and appearance information for object recognition, scene understanding, and language grounding | Primary visual input for most VLA models; encoded by vision backbones and fused with language tokens | No direct depth information; sensitive to occlusion, illumination changes, reflections, and transparent objects |
| Depth/RGB-D Sensor [34,35,36,37] | Geometric structure, distance, and surface shape for grasping and contact-aware planning | Combined with RGB to support 3D scene understanding, grasp pose estimation, and manipulation planning | Failure on transparent/reflective surfaces; reduced reliability under strong ambient light; alignment with pre-trained VLM tokens remains nontrivial |
| Event Camera [39,40,41,42] | High temporal resolution and high dynamic range for fast motion and dynamic scenes | Potential input for motion-sensitive manipulation and high-speed perception, often after conversion to event frames or related representations | Sparse asynchronous output; limited large-scale pre-training data; weak compatibility with standard RGB-based VLM pipelines an open problem |
| Fisheye/Wide-angle Camera [6,46,47,48] | Wide field of view for workspace coverage and reduced blind spots | Useful for mobile manipulation or long-horizon tasks requiring large scene coverage | Severe radial distortion; weaker compatibility with perspective-camera-pre-trained encoders; rectification may still leave residual distortion |
| Task/Condition | Added Sensing Source | Expected Benefit | Main Overhead | Tolerability Judgment |
|---|---|---|---|---|
| Simple rigid-object pick-and-place under good visibility | Depth or tactile | Limited improvement over RGB and proprioception | Added preprocessing and synchronization | Often not necessary unless the failure rate is high |
| Transparent or reflective object manipulation | Depth, tactile, or F/T | Compensates for unreliable RGB or depth cues and improves contact estimation | Calibration and fusion latency | Often tolerable when visual uncertainty causes failures |
| Contact-rich insertion or assembly | Tactile or F/T | Detects contact, jamming, slip, and excessive force | High-rate sensing and control-loop integration | Usually tolerable, especially for safety and robustness |
| Deformable-object manipulation | Tactile, F/T, or multi-view vision | Improves deformation and contact-state estimation | Increased sensor bandwidth and fusion complexity | Tolerable when deformation state cannot be inferred visually |
| Fast motion or mobile manipulation | IMU or event camera | Provides high-rate motion cues and reduces visual latency | Multi-rate alignment and asynchronous fusion | Tolerable if summarized or event-triggered efficiently |
| Industrial monitoring or diagnostic operation | Thermal, vibration, acoustic, or IMU sensing | Provides failure, anomaly, or quality-control information | Additional inference or monitoring cost | Tolerable if used as supervisory feedback rather than high-frequency policy input |
| Dataset | RGB | Depth | Tactile | F/T | Proprio | Collection Method | Scale/Note |
|---|---|---|---|---|---|---|---|
| Open X-Embodiment [17] | ✓ | Varies | Varies | Varies | ✓ | Multi-institution real-world demonstrations | 1M+ trajectories; 22 robot embodiments |
| DROID [46] | ✓ | ✓ | ✗ | ✗ | ✓ | In-the-wild teleoperation/human demonstrations | ~76k trajectories; 350 h; language annotations included |
| ALOHA [59] | ✓ | ✗ | ✗ | ✗ | ✓ | Whole-body bimanual teleoperation | ~50 demonstrations per task in the reported setup |
| BridgeData V2 [72] | ✓ | ✓ | ✗ | ✗ | ✓ | VR teleoperation + scripted rollouts | 60k trajectories |
| MimicGen [80] | Sim dependent | Sim dependent | Sim dependent | Sim dependent | ✓ | Simulation-based generation from a small number of human demonstrations | 50k+ generated demos from <200 human demos across 18 tasks |
| Mobile ALOHA [83] | ✓ | ✗ | ✗ | ✗ | ✓ | Whole-body mobile manipulation teleoperation | ~50 demonstrations per task; includes mobile base velocity in proprioception |
| RH20T [81] | ✓ | ✓ | (limited) | ✓ | ✓ | Contact-rich teleoperation | ~110k sequences; some sequences include fingertip tactile sensing |
| RLBench [82] | ✓ | ✓ | ✗ | ✗ | ✓ | Scripted/motion-planned simulation | 100 tasks; effectively unlimited demonstrations |
| Strategy | Main Operation | Cost Trend | Strength | Limitation |
|---|---|---|---|---|
| Early fusion | Joint self-attention over all modality tokens | Dense cross-modal interaction | High memory and latency | |
| Late fusion | Independent encoding followed by vector-level fusion | after summarization | Modular and efficient | Possible loss of low-level interactions |
| Hybrid fusion | Cross-attention between selected modality pairs | per interaction | Controlled cross-modal reasoning | Design complexity and pair selection |
| Category | Metric/Formula | Meaning | Pros | Cons/Limitation | Occurrence |
|---|---|---|---|---|---|
| Task completion | SR = successful trials/total trials | Whether the task is completed | Simple, intuitive, and easy to compare | Hides failure cause, safety, and sensor contribution | Common |
| Task progress | Sub-goal completion = completed sub-goals/total sub-goals | Partial progress in long-horizon tasks | More informative than binary SR | Needs sub-goal annotation | Partial |
| Robustness | RSR = mean SR under perturbation/OOD/sensor noise | Average performance under domain shift | Captures stability beyond clean settings | Depends on perturbation protocol | Partial |
| Worst-case robustness | WSR = min SR across perturbation conditions | Lower-bound performance in the hardest condition | Reveals edge-case failures | Sensitive to perturbation-set design | Emerging |
| Safety | Collision count, force-limit violation, unsafe-contact events | Whether the policy avoids hazardous behavior | Critical for real deployment | Needs safety sensors or manual labels | Emerging |
| Contact stability | Slip count, regrasp frequency, excessive-force events | Grasp and contact reliability | Useful for tactile/F/T manipulation | Needs tactile/F/T sensing | Emerging |
| Execution quality | Smoothness, jerk, corrective-action frequency | Physical stability and control quality | Captures quality beyond success/failure | Hardware and controller dependent | Partial |
| Real-time feasibility | AIL = mean inference time per step or episode | Inference-time burden | Important for real-time deployment | Hardware/implementation dependent | Partial |
| End-to-end latency | Policy-loop latency = sensing + preprocessing + fusion + inference + actuation | Full perception–action delay | More deployment-relevant than inference only | Requires system-level profiling | Emerging |
| Efficiency/cost | Compute cost, memory, energy; CpS = cost/successful trial | Resource consumption per success | Links performance with deployment cost | Platform and definition dependent | Emerging |
| Sensor contribution | Ablation drop = SR_full − SR_without modality | Contribution of a sensor modality | Directly supports sensor selection | Needs controlled ablation | Emerging |
| Failure diagnosis | Failure-type distribution (perception, calibration, contact, planning, and execution) | Dominant failure sources | Useful for system improvement | Needs detailed failure labeling | Emerging |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ko, B.C. Sensing the Action: Rethinking Sensor Modalities and Multi-Modal Fusion in Vision–Language–Action Models for Robotic Manipulation. Sensors 2026, 26, 3541. https://doi.org/10.3390/s26113541
Ko BC. Sensing the Action: Rethinking Sensor Modalities and Multi-Modal Fusion in Vision–Language–Action Models for Robotic Manipulation. Sensors. 2026; 26(11):3541. https://doi.org/10.3390/s26113541
Chicago/Turabian StyleKo, Byoung Chul. 2026. "Sensing the Action: Rethinking Sensor Modalities and Multi-Modal Fusion in Vision–Language–Action Models for Robotic Manipulation" Sensors 26, no. 11: 3541. https://doi.org/10.3390/s26113541
APA StyleKo, B. C. (2026). Sensing the Action: Rethinking Sensor Modalities and Multi-Modal Fusion in Vision–Language–Action Models for Robotic Manipulation. Sensors, 26(11), 3541. https://doi.org/10.3390/s26113541
