Fusion of Computer Vision and AI in Collaborative Robotics: A Review and Future Prospects

Cohen, Yuval; Biton, Amir; Shoval, Shraga

doi:10.3390/app15147905

Open AccessReview

Fusion of Computer Vision and AI in Collaborative Robotics: A Review and Future Prospects

by

Yuval Cohen

^1,*

,

Amir Biton

²

and

Shraga Shoval

²

¹

Department of Industrial Engineering and Management, Tel Aviv Afeka Academic College of Engineering, Tel Aviv 6998812, Israel

²

Department of Industrial Engineering and Management, Ariel University, Ariel 40700, Israel

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7905; https://doi.org/10.3390/app15147905

Submission received: 20 June 2025 / Revised: 11 July 2025 / Accepted: 13 July 2025 / Published: 15 July 2025

(This article belongs to the Special Issue Integrating AI into Mechatronics and Robotics: Innovations and Applications)

Download

Browse Figures

Versions Notes

Abstract

The integration of advanced computer vision and artificial intelligence (AI) techniques into collaborative robotic systems holds the potential to revolutionize human–robot interaction, productivity, and safety. Despite substantial research activity, a systematic synthesis of how vision and AI are jointly enabling context-aware, adaptive cobot capabilities across perception, planning, and decision-making remains lacking (especially in recent years). Addressing this gap, our review unifies the latest advances in visual recognition, deep learning, and semantic mapping within a structured taxonomy tailored to collaborative robotics. We examine foundational technologies such as object detection, human pose estimation, and environmental modeling, as well as emerging trends including multimodal sensor fusion, explainable AI, and ethically guided autonomy. Unlike prior surveys that focus narrowly on either vision or AI, this review uniquely analyzes their integrated use for real-world human–robot collaboration. Highlighting industrial and service applications, we distill the best practices, identify critical challenges, and present key performance metrics to guide future research. We conclude by proposing strategic directions—from scalable training methods to interoperability standards—to foster safe, robust, and proactive human–robot partnerships in the years ahead.

Keywords:

cobot; human–robot interaction; collaborative robotics; computer vision; artificial intelligence (AI); scene understanding; human pose estimation; semantic mapping

1. Introduction

1.1. Motivation and Scope

The rapid evolution of collaborative robots, or cobots, is transforming the landscape of industrial and service-oriented automation [1,2,3]. Unlike traditional industrial robots, Cobots are designed to operate safely and interactively alongside humans, fostering enhanced productivity, safety, and flexibility in dynamic environments [4]. Cobots bridge the gap between manual labor and full automation [5]. Bridging the gap between manual labor and full automation improves cost-effectiveness, safety, quality, and flexibility. Cobots reduce labor costs while avoiding the rigidity of full automation. They enhance safety by handling hazardous tasks and enable seamless human–robot synergy for adaptable, efficient production. Central to this transformation is the fusion of computer vision and artificial intelligence (AI)— enabling perceptive and context-aware robotic systems [6]. Computer vision enables cobots to interpret complex scenes, detect and classify objects, perceive human gestures and activities, and model their surroundings in real time [7]. These sensory capabilities, when integrated with AI techniques (such as deep learning and cognitive reasoning), are elevated to intelligent decision-making and adaptive behavior. This fusion empowers cobots to go beyond passive operation toward proactive collaboration, where robots anticipate and respond to human intentions and environmental changes [8,9].

The scope of this review is to explore and systematize the most recent and impactful advancements in vision–AI integration within collaborative robotic systems. In particular, we highlight the foundational technologies, current applications, emerging trends, and open research challenges that collectively shape the path toward next-generation human–robot collaboration (HRC).

1.2. Definitions and Interdisciplinary Scope

The interdisciplinary nature of collaborative robotics necessitates precise terminology. Artificial Intelligence (AI) encompasses computational techniques that emulate human-like intelligence [7]. Machine Learning (ML), a subdomain of AI, focuses on the development of algorithms that learn from data without explicit programming [10]. Deep Learning (DL) further focuses ML through deep neural architectures capable of hierarchical feature extraction [8,9]. Computer Vision (CV) lies at the confluence of AI and ML, and is tasked with enabling machines to acquire, process, and interpret visual information [10,11]. Understanding these definitions is pivotal, as contemporary collaborative robotic systems draw upon all these domains to achieve perceptual awareness, semantic reasoning, and contextually grounded action execution.

1.3. Contributions of the Review

This review offers the following core contributions:

Comprehensive Survey of Vision–AI Synergy: We synthesize the literature across various domains—robotic vision, AI planning, human activity recognition, and multimodal perception—into a unified taxonomy tailored to collaborative robotics.
Technology Landscape Mapping: We provide a structured overview of key methods including visual object detection, human pose estimation, and scene understanding.
Simultaneous Localization and Mapping (SLAM), and deep learning architectures for HRC [12].
Evaluation and Benchmarking: We discuss performance metrics, benchmark datasets, and simulation environments for vision–AI-enabled cobots, emphasizing the need for reproducibility and standardization [13].
Identification of Challenges and Future Directions: We highlight critical issues such as real-time performance, uncertainty handling, ethical decision-making, and human trust in autonomous systems [4,14,15].

1.4. Methodology and Literature Selection Criteria

This review employed a systematic and thematic approach to identify, evaluate, and synthesize relevant studies on the integration of computer vision (CV) and artificial intelligence (AI) in collaborative robotics. The initial search was conducted across major academic databases, including Scopus, Web of Science, IEEE Xplore, and MDPI’s Applied Sciences, using keywords such as human–robot interaction, collaborative robots, robotic vision, semantic scene understanding, deep learning in robotics, and sensor fusion for HRC.

This process retrieved approximately 1800 records published from 2015 through May 2025. Exclusion criteria were applied in several stages: (i) removing non-English papers and non-peer-reviewed sources; (ii) filtering out works unrelated to collaborative robotics (e.g., purely industrial automation or general computer vision without HRC context); and (iii) excluding duplicate entries and low-relevance conference abstracts. This screening narrowed the corpus to around 250 articles for full-text assessment.

After detailed evaluation of their technical depth, real-world applicability, and relevance to the cobot paradigm, 138 studies were selected for inclusion in this review. These covered diverse topics such as object detection, human pose estimation, scene understanding, visual SLAM, reinforcement learning, and explainable AI as applied to collaborative robotics. A bibliometric analysis revealed notable trends. Publication volume has grown markedly since 2018, reflecting accelerating interest in vision–AI fusion for human–robot collaboration. The included studies span contributions from Europe, North America, and Asia, with strong research clusters in Germany, China, the United States, and Japan. Topical trends indicate increasing emphasis on multimodal sensor fusion, proactive and anticipatory human–robot collaboration, explainable AI, and real-time adaptive planning.

This multi-stage selection strategy ensures that the review synthesizes a representative, high-quality, and globally relevant body of literature, enabling a comprehensive assessment of current capabilities, challenges, and future directions in vision–AI-enabled collaborative robotics.

1.5. Outline of the Paper

The remainder of this paper is structured as follows: Section 2 provides an overview of the foundations and evolution of collaborative robotics, computer vision, and AI. Section 3 examines the core technologies enabling vision-based cobots, including perception, pose estimation, and spatial modeling. Section 4 explores AI-driven decision-making and autonomy, with emphasis on learning frameworks and explainability. Section 5 investigates technological trends and challenges. Section 6 highlights key cobot considerations in system architecture, AI, and algorithms. Section 7 discusses human-related challenges and trends. Section 8 discusses the literature review and outlines strategic future directions for vision–AI integration in cloud–edge platforms and proactive HRC. Section 9 concludes with a synthesis of relevant insights and a vision for the next generation of vision-enabled collaborative robotic systems.

2. Foundations and Evolution

2.1. Overview of Collaborative Robotics (Cobots)

Collaborative robots (cobots) are robotic systems explicitly designed for direct physical interaction with humans within a shared workspace, without the need for conventional safety barriers [16,17]. Unlike traditional industrial robots, which operate in isolated environments, cobots integrate intrinsic safety mechanisms—such as compliant actuators, force–torque sensing, and intelligent motion planning—to enable close and safe collaboration [18]. Their flexibility, rapid deployment, and capacity for programming by demonstration have driven their adoption across manufacturing, logistics, healthcare, and education sectors [19]. Cobots represent a paradigmatic shift towards hybrid workspaces where human creativity is synergistically augmented by robotic precision, fostering higher productivity, enhanced ergonomics, and agile production systems.

2.2. Historical Evolution of Computer Vision in Robotics

The field of computer vision (CV) has undergone a profound evolution, transitioning from the early heuristic-based image processing in the 1980s to the sophisticated deep learning-driven perception systems today. Initial robotic vision focused on basic tasks such as edge detection, object localization, and obstacle avoidance using monocular cameras [20]. Advances in sensors, notably RGB-D cameras and LiDAR, enabled three-dimensional scene understanding and real-time spatial mapping [21,22]. Simultaneously, breakthroughs in algorithmic foundations, such as convolutional neural networks (CNNs) and simultaneous localization and mapping (SLAM), transformed perception capabilities [21,23,24]. In collaborative robotics, CV has evolved from a passive sensing modality to an active, predictive component that is essential for real-time scene understanding, human activity recognition, and adaptive task execution [25].

2.3. Progression of Artificial Intelligence (AI) in Robotic Applications

Artificial Intelligence (AI) has expanded the functional capabilities of robotic systems from scripted task execution to dynamic, goal-driven behaviors [7,26]. Early AI integration involved rule-based systems and expert-driven programming. The introduction of machine learning (ML), particularly supervised and reinforcement learning approaches, enabled robots to generalize from data rather than relying solely on pre-specified rules [27,28]. More recently, deep learning (DL) architectures, such as recurrent neural networks (RNNs) and transformers, have empowered robots to perform complex perception and decision-making tasks in highly dynamic environments [29,30,31,32]. Within collaborative robotics, AI facilitates context-aware action planning, predictive human intent recognition, and proactive collaboration strategies [7,24,33,34]. As robots increasingly learn from interaction histories, multimodal sensory data, and demonstration-based training, the boundaries of adaptive autonomy continue to expand.

3. Core Technologies in Vision-Based Collaborative Robotics

3.1. Visual Perception and Object Detection

Visual perception forms the foundation of collaborative robotic systems [35], enabling machines to interpret their environments and interact meaningfully with objects and humans [34,35]. Object detection—locating and classifying objects within an image or 3D space—is a critical competency, achieved through advanced deep learning models such as YOLO (You Only Look Once), Faster R-CNN (Region-based Convolutional Neural Networks), and transformer-based detectors [36,37,38,39].

In cobot applications, perception systems must operate under strict real-time constraints [40], offering high precision under variable lighting, occlusion, and dynamic backgrounds [41]. Recent advancements leverage multimodal data (e.g., RGB-D and thermal imaging) to enhance robustness, enabling cobots to identify workpieces, tools, and obstacles with human-like perceptual fluency [20,42].

Table 1 quantifies the trade-off between detector speed and accuracy in cobot-relevant use cases.

Single-stage detectors—including state-of-the-art YOLOv6/v7—deliver real-time performance (hundreds to over 1000 FPS) with moderate accuracy (35–57% mAP), which is ideal for fast-paced tasks like collision avoidance and bin picking. Faster R-CNN, on the other hand, achieves a higher accuracy (~55% mAP) but is limited to ≈5 FPS, making it less suited for real-time edge deployment.

Domain-specific variants demonstrate the impact of customization: AT-LI-YOLO reaches ~34 FPS and improves small-object detection by ~3% over YOLOv3, while Parallel YOLO–GG achieves ~94% accuracy and a 14% higher speed vs. YOLOv3 in grasping scenarios. Additionally, YOLOv4 has been validated for robotic arms, showing speed and accuracy gains over YOLOv3.

In cobot systems, the real-time processing benchmark is typically ≥15–30 FPS on edge devices. Thus, YOLOv6/N quantized and YOLOv7 are highly suitable, especially when combined with hardware accelerators. For tasks requiring precise localization or small-object handling, customized light-weight versions (AT-LI-YOLO, YOLO–GG) offer compelling alternatives. Two-stage detectors like Faster R-CNN may still be used in contexts where model latency is acceptable or where high precision outweighs speed.

3.2. Human Pose Estimation and Activity Recognition

Effective human–robot collaboration requires robots to not only detect humans but also understand their poses, gestures, and activities. Human pose estimation is the task of localizing body key points. Human pose estimation is pivotal for recognizing human intent, predicting actions, and ensuring safe interactions [43,44]. Pose estimation accuracy significantly affects downstream tasks such as intention prediction and cooperative planning in shared workspaces [44,45]. Pose detection errors greater than 10–15 pixels at the wrist or shoulder joints substantially degrade the collaborative task efficiency [46].

State-of-the-art techniques, including OpenPose, HRNet, and vision transformers, have enabled real-time multi-person tracking in cluttered environments [47]. Coupled with activity recognition models, such as 3D convolutional neural networks (3D-CNNs) and LSTM (Long Short-Term Memory)-based architectures, cobots can proactively adjust their behaviors in response to human movements [48,49].

3.3. Scene Understanding and Environmental Modeling

Beyond object and human recognition, holistic scene understanding is essential for cobots operating in complex environments [6,24]. Scene understanding involves semantic segmentation, spatial reasoning, and affordance detection [50].

By constructing rich semantic maps of their surroundings, cobots can identify interaction zones, dynamically allocate resources, and anticipate environmental changes. Techniques, such as DenseFusion for object pose estimation, and deep learning-enhanced SLAM algorithms, have propelled significant advancements in this domain [51].

3.4. SLAM and Spatial Awareness for Shared Workspaces

Collaborative tasks often unfold in partially structured or dynamically changing spaces, necessitating robust spatial awareness. Simultaneous Localization and Mapping (SLAM) allows cobots to build and update maps while localizing themselves within them, representing a cornerstone capability for safe and reliable autonomy [52,53].

Modern vision-based SLAM approaches, such as ORB-SLAM3 and Deep-SLAM, integrate monocular, stereo, and depth camera inputs to maintain situational awareness even in human-populated environments [54,55,56]. Semantic SLAM extends traditional SLAM by recognizing and excluding dynamic objects—this is crucial in HRC settings where humans move unpredictably. ORB-SLAM3, while not semantic by default, improves robustness via IMU integration and multi-map management. Experience with ORB-SLAM3, as shown in [57], shows that it handles dynamic occlusions by spawning sub-maps and maintaining inertial consistency, but it cannot explicitly exclude humans without external semantic masks. Methods like Dynamic-SLAM use segmentation (e.g., Mask R-CNN) to filter out moving people, reducing tracking errors and drift. For reliable HRC, best practice combines semantic segmentation to exclude moving humans with IMU-visual fusion for static mapping. Some examples of benchmarks on SLAM and drift corrections are seen in [58,59,60,61,62].

Additionally, semantic SLAM variants enrich maps with object-level annotations. These variants enhance the cobot’s ability to reason about its environment in a task-relevant manner. SLAM Drift Correction, incorporating loop closure techniques, is vital for minimizing drift in long-duration SLAM deployments in dynamic, human-shared environments [63,64]. Future directions include human-aware SLAM that not only filters but predicts human motion for safer, more adaptive collaboration.

Figure 1 and Table 2 illustrate and summarize the core technologies in vision-based collaborative robotics.

4. AI-Driven Autonomy and Decision-Making

4.1. Deep Learning Architectures for Perception and Planning

Deep learning (DL) has emerged as a transformative force in robotic autonomy [65,66]. DL enables end-to-end learning pipelines that span perception, decision-making, and motion planning [67,68]. The schematic stages of this pipeline are described and illustrated in Figure 2, leading from sensory inputs to cobot actions.

Convolutional Neural Networks (CNNs) have been widely adopted for vision-based feature extraction [69,70]. CNNs formed the basis for more recent architectures such as transformers and graph neural networks (GNNs) facilitate spatial reasoning and relational modeling within human–robot environments [71,72,73]. Moreover, encoder–decoder models and attention mechanisms enhance generalization across diverse tasks and environments, allowing cobots to operate under uncertainty while maintaining interpretability [74].

4.2. Reinforcement Learning for Adaptive Behavior

Reinforcement Learning (RL) endows cobots with the capacity to learn optimal behaviors through interaction with the environment, receiving feedback in the form of rewards [75]. In industrial HRC, sample-efficient RL algorithms are critical, as collecting large-scale real-world interaction data is time-consuming and may pose safety risks [76]. Sample Efficiency in RL methods such as meta-RL, curriculum learning, and simulation-to-reality (Sim2Real) transfer, have been employed to mitigate data inefficiencies in robotic RL training pipelines [77,78,79].

Recent advances in Deep Reinforcement Learning (DRL)—notably Deep Q-Networks (DQNs), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO)—have enabled cobots to acquire complex skills such as manipulation, cooperative assembly, and adaptive trajectory planning. [80,81,82,83].

In shared workspaces, RL facilitates rapid online adaptation to novel human behaviors and dynamic task constraints [84,85]. Model-based approaches (e.g., World Models and MuZero) further accelerate learning by simulating outcomes, enabling proactive planning with fewer real-world samples [86].

4.3. Explainable AI (XAI) and Ethical Autonomy

As cobots assume decision-making roles in human-centric domains, explainability becomes vital for user trust, transparency, and regulatory compliance [87]. Explainable AI (XAI) techniques, such as attention heatmaps, saliency mapping, and symbolic rule extraction, offer insights into how and why a cobot selects specific actions [88,89]. The EU AI Act and ISO/IEC TR 24028 provide guidelines on trustworthy AI, urging integration of explainability and risk management in autonomous systems [90]. Ethical AI Standards: The IEEE P7000 family and EU’s Trustworthy AI guidelines outline ethical design requirements, which are especially critical in safety-sensitive cobot deployments [91].

In addition to technical transparency, ethical autonomy frameworks are increasingly emphasized in collaborative robotics. These frameworks incorporate principles such as non-maleficence, privacy preservation, and accountability into learning objectives or decision policies. Hybrid approaches—combining neural and symbolic reasoning—offer a promising path toward ethically aware cobots that are verifiable and audit-friendly.

4.4. Cognitive Architectures for Human-like Interaction

To engage in meaningful collaboration, cobots must go beyond reactive control and embody cognitive traits such as memory, prediction, and attention. Cognitive architectures—like SOAR, ACT-R, [92], and hybrid neuro-symbolic systems—model these capabilities by integrating perception, reasoning, and learning within a unifying framework [92,93].

In collaborative settings, these architectures support task allocation, turn-taking, joint attention, and mental state inference. For instance, integrating theory-of-mind models enables cobots to reason about human beliefs and goals, improving their fluency and safety in joint action scenarios. Vision plays a key role here, as visual cues (e.g., gaze, gesture, and facial expression) inform internal state models that modulate robot behavior in real time [94]. Table 3 summarizes this section.

5. Technological Trends and Challenges

The fusion of computer vision (CV) and artificial intelligence (AI) in collaborative robotics has already demonstrated considerable advancements across perception, decision-making, and interaction modalities [95,96]. Research momentum is also being built around multimodal sensor fusion, embodied AI agents, and open-source robotic ecosystems [97]. These trends point toward the development of proactive cobots that are capable of understanding intent, adapting behavior in real time, and collaborating seamlessly with humans in complex, dynamic environments [98].

However, while the reviewed literature showcases significant progress in the perception, reasoning, and adaptive behavior of cobots, a deeper analysis reveals that, despite notable advances in perception and control architectures, significant challenges remain in achieving seamless, real-time, and trustworthy human–robot collaboration. Several key thematic insights and limitations warrant critical discussion to inform the next generation of human-centered collaborative systems.

5.1. Vision Technologies Integration and Deployment Challenges

Vision-based perception modules such as object detection, semantic segmentation, and human pose estimation have matured significantly [99] (especially when combined with deep learning models). The integration of vision-based perception modules into end-to-end cobot systems remains nontrivial. Real-world collaborative environments are unstructured, dynamic, and often noisy [100], resulting in performance degradation when transitioning from controlled lab settings to production scenarios [101]. Moreover, real-time constraints on inference latency, sensor fusion, and action planning introduce system-level bottlenecks, especially in safety-critical or time-sensitive applications.

In particular, while deep models like YOLOv11 or OpenPose offer state-of-the-art results in isolated benchmarks [102], their deployment in active control loops often requires model compression, hardware acceleration, and resilience to occlusion or adversarial perturbations. This underscores the need for modular, adaptive pipelines that can maintain their robust performance in diverse operational contexts.

5.2. Real-Time Adaptation and Learning

AI-enhanced collaborative robots, equipped with advanced computer vision, are increasingly capable of real-time adaptation to changes in human behavior and task environments. Unlike traditional systems constrained by pre-programmed rules, these cobots utilize reinforcement learning to dynamically adjust their behavior. For instance, Hou et al. [75] present a deep reinforcement learning framework for mobile cobots in collaborative assembly, demonstrating improvements in timing and efficiency by learning human task patterns on the fly. Such systems adapt their support strategies based on human actions, leading to a reduction of over 50% in idle time in controlled scenarios [75].

Computer vision is instrumental in these systems, capturing real-time data on human gestures, tool positions, and workspace changes. This perceptual input informs AI-based decision-making, allowing robots to modify their assistance level, timing, or task sequence. Qiu et al. [76] further enhance reinforcement learning with techniques that increase sample efficiency, enabling faster adaptation with fewer training episodes.

The combination of real-time sensor fusion and anticipatory AI also supports proactive behavior. Robots can infer user intent and adjust their pace or engagement accordingly. Salvato et al. [77] provide a review of sim-to-real transfer methods that bridge the gap between training environments and real-world deployment, improving reliability and generalization. Vision-based scene understanding, as explored by Fan et al. [24], supports the contextual awareness that is necessary for these dynamic adaptations.

Recent advances have demonstrated that bridging the sim-to-real gap is feasible even for complex, collaborative tasks. Two examples that demonstrate this are real case studies: (1) NVIDIA gear assembly [103] and (2) robot-assisted surgery at Heidelberg University Hospital [104]. Here are some details on these examples.

NVIDIA leveraged its Isaac Lab and ROS frameworks to develop a contact-rich gear assembly task on the UR10e robot. This system achieved zero-shot transfer—the model was trained entirely in a simulation and directly deployed on real hardware without additional tuning [103]. The robot successfully picked, transported, and inserted multiple gears in random poses—demonstrating remarkable robustness and adaptability directly from a simulation.
Researchers from Karlsruhe Institute of Technology and Heidelberg University Hospital trained in simulation for robotic tissue manipulation in surgery. They applied pixel-level domain adaptation (unpaired image translation) to bridge visual differences, enabling zero-shot deployment in a real surgical environment. This is one of the first successful sim-to-real transfers involving deformable tissues in a surgical robotic context, marking a significant milestone toward clinical-world deployment.

These case studies required a non-standard investment of time and effort, and an extremely detailed control level on real-time movements, so they will remain as exceptional, rare cases, and not lead the way for a general solution for the sim-to-real gap.

Despite these advances, challenges persist—particularly in maintaining learning stability and ensuring user transparency. Nonetheless, the integration of real-time perception and AI learning marks a pivotal step toward cobots that function as adaptive, intelligent teammates across domains where flexibility, precision, and responsiveness are essential.

5.3. Trends in Cobots’ Hardware Related Capabilities

In recent years, significant hardware innovations have enhanced the safety, adaptability, and operational efficiency of cobots, accelerating their adoption in industrial and human-centric environments [75]. A primary development has been the widespread integration of high-resolution torque sensors and series elastic actuators (SEA), both of which contribute to improved compliance and force control at the joint level, allowing safer physical interactions with human workers [105]. These systems provide fine-grained feedback that supports adaptive responses during unstructured tasks.

Moreover, advances in lightweight composite structures—notably the use of carbon fiber-reinforced polymers and 3D-printed aluminum alloys—have yielded arms with lower inertia and greater payload-to-weight ratios [106], enabling higher-speed operation while maintaining ISO/TS 15066-compliant safety profiles [107]. Complementing these are compact, backlash-free harmonic drives [108] and miniaturized servo systems [106], which allow for precise motion in tight workspaces and reduce the physical footprint of cobots [109].

The development of back-drivable actuators and soft robotic joints has further increased cobot usability in kinesthetic teaching and physical guidance tasks [110]. These allow the human operator to intuitively manipulate the cobot without motor resistance. To facilitate broader task versatility, many systems now include modular end-effector interfaces with quick-change couplers [111] and embedded multi-axis force–torque sensors at the wrist, enhancing manipulation sensitivity during delicate assembly operations [112].

Also notable are the soft grippers made from pneumatic or tendon-driven designs [113]. Soft grippers enable the safe handling of deformable or irregular objects, expanding the range of applications [114]. In parallel, energy-efficient and regenerative actuator modules have emerged, capable of harvesting energy from deceleration phases, contributing to improved energy consumption metrics over long operational cycles [115].

Lastly, the deployment of redundant joint configurations (e.g., 7-DOF arms) has improved task flexibility and enhanced cobot maneuverability [116]. This is especially important for avoiding collisions in cluttered or collaborative spaces [117] without depending on external sensors for collision avoidance. These hardware advances are laying the foundation for the next generation of scalable, safe, and human-friendly cobots.

5.4. Trends Based on Technological Convergence

Recent trends indicate a clear trajectory towards holistic scene understanding, where robots do not merely react to stimuli but proactively interpret complex contexts. Holistic approaches—such as those detailed in [24]—emphasize multi-level visual cognition (object, human, and environment) to achieve anticipatory planning. These trends align with the rise of proactive human–robot collaboration (HRC), demanding cognitive empathy and contextual awareness beyond conventional reactive systems [118].

Simultaneously, the synergistic use of deep learning for visual tasks—such as segmentation, pose estimation, and trajectory prediction—is enabling more robust interpretation of dynamic, unstructured environments [119]. LSTM-based and transformer-based models are particularly promising for temporal predictions [120].These models are crucial in shared workspaces where robot actions must be synchronized with human behaviors. Table 4 summarizes this section.

6. Cobot Considerations in System Architecture, AI, and Algorithms

6.1. Advanced System Architectures and Integration

Modern collaborative robots necessitate integrated system architectures that seamlessly combine perception, planning, and control. Departing from traditional “sense–plan–act” pipelines, contemporary designs employ middleware solutions like ROS-Industrial to facilitate real-time sensor fusion and decision-making processes [121]. For instance, the implementation of an autonomous industrial cell utilizing eye-in-hand stereo cameras and visual serving demonstrates the efficacy of such integrations in enhancing flexibility and accuracy in production lines [19].

Specialized cognitive stacks, such as DAC-HRC, and edge computing configurations are optimized for latency-sensitive visual tasks, enabling cobots to process complex data efficiently [122]. These advancements are particularly beneficial in manufacturing environments, where intelligent cobot systems have been shown to improve human–robot collaboration and overall productivity [16].

The transition towards lightweight, domain-specific frameworks is replacing monolithic solutions, thereby enhancing both scalability and adaptability across various sectors. In the realm of smart manufacturing, incorporating human-in-the-loop robot learning approaches ensures that cobots can adapt to dynamic tasks while maintaining safety and efficiency standards [29,123].

6.2. Evolving AI Paradigms: From Deep Learning to Foundation Models

As mentioned, AI in collaborative robotics is shifting from task-specific models to foundation models—large-scale pre-trained models that adapt to multiple tasks. Vision-language models (VLMs) and large language models (LLMs) now support scene understanding and task reasoning without retraining, enabling generalization to new settings [67]. These models offer zero-shot recognition [124] and language-driven planning [125], advancing cobot capabilities. Vision transformers (ViTs) and multimodal transformers further improve perception, integrating text and image inputs in real time [47]. This shift also brings challenges in explainability and computational demand, prompting hybrid approaches combining deep learning and classical planning [26,70,126].

6.3. Role of Neural Policy Architectures

One of the most transformative developments in collaborative robotics is the introduction of neural policy architectures, including deep reinforcement learning (DRL), imitation learning, and end-to-end visuomotor control [127]. These methods have enabled cobots to learn flexible task policies from experience rather than relying solely on rule-based programming.

However, DRL approaches are often data-intensive, difficult to interpret, and prone to overfitting in narrowly defined task environments [128]. The sim-to-real gap—where policies trained in simulation fail to generalize in the physical world—remains a central challenge [129,130,131]. Hybrid architectures [132] combining symbolic reasoning, probabilistic planning, and learned policies is a promising direction, offering greater robustness and generalization across task domains.

6.4. Algorithmic Frameworks for Dynamic Environments

Cobots operating in unstructured environments must plan under uncertainty. Recent frameworks apply probabilistic models—such as partially observable Markov decision processes (POMDPs)—and intention recognition via motion analysis to adapt to human actions [133]. For example, deep learning-based motion detection algorithms improve recognition accuracy and support flexible response behaviors [30,134].

Advanced semantic mapping approaches enable cobots to interpret dynamic environments and update operational plans accordingly [19,53]. Additionally, digital twins provide a virtual testbed for pre-validating robot strategies, enhancing safety and robustness during real deployment [12,135]. Table 5 summarizes this section.

7. Human Related Challenges and Trends

Human–robot collaboration (HRC) is inherently related to human interaction and human factors. While many collaborative capabilities have been added to modern cobots, some human-related challenges remain. For example, trust, transparency, and intention prediction are often qualitatively addressed, but lack rigorous and quantitative integration into system models. Robinson et al. [136] underscore this gap by highlighting how robotic vision could more effectively support intuitive communication, affective understanding, and non-verbal cues for seamless co-adaptation.

Explainable AI (XAI) frameworks remain in their infancy in cobotic systems [126,137]. As collaborative autonomy grows, explainability becomes essential not just for debugging, but for human confidence and safety. Ethically guided autonomy—i.e., capable of incorporating normative constraints—must be embedded within the planning and actuation pipelines [138].

7.1. Trust Calibration in Collaborative Robotics

Effective human–robot collaboration hinges on maintaining appropriately calibrated trust, avoiding both over-reliance and under-utilization. Recent literature underscores the efficacy of systems where robots monitor and adapt to human trust in real time [39,84,139]. For instance, internal performance self-assessment mechanisms and confidence signaling strategies have demonstrated measurable improvements in both perceived trust and team coordination [140]. These enhancements occur without altering the robot’s functional capabilities but by adjusting its behavioral transparency and responsiveness [141,142].

Trust calibration in human–robot collaboration relies on both subjective and objective metrics. Common subjective methods include Likert-scale surveys (e.g., perceived safety, reliability, and predictability) and structured instruments like the Trust in Automation Scale. Objective measures often involve physiological monitoring (heart rate variability, galvanic skin response, and gaze tracking) to detect stress or vigilance changes.

Vision and AI systems can support trust calibration by modeling human states in real time—detecting signs of hesitation or disengagement via facial expressions, posture, or gaze aversion. Adaptive systems can then modulate robot behavior by slowing movements, increasing transparency (e.g., visual cues or explanations), or requesting confirmation before acting.

Computer vision and AI integration are central to these systems: vision modules capture behavioral cues such as gaze aversion or hesitation, while AI algorithms determine when and how to intervene to preserve trust levels [143]. Trust repair is also emerging as a critical component, with evidence showing that timely explanations or apologies—especially when synchronized with human expectations—can restore trust after errors [144,145]. Such mechanisms are particularly vital in high-stakes domains like healthcare and manufacturing, where a misalignment of trust can compromise safety and performance.

7.2. Socio-Cognitive Models and HRI

Collaborative robots are increasingly expected to operate with socio-cognitive awareness, enabling them to interpret human intentions, gestures, and mental states [146]. Cognitive architectures such as DAC-HRC integrate modules for memory, perception, and decision-making to support mechanisms like joint attention, turn-taking, and social responsiveness [6,122]. These systems fuse visual inputs—such as gaze tracking, body pose estimation, and facial expression detection—with AI-based reasoning to infer user goals and adapt their assistance strategies accordingly [147,148].

Studies demonstrate that incorporating Theory of Mind capabilities—where the robot models human beliefs, emotions, or expectations—significantly enhances user trust and task fluency, especially when robots apply trust–repair behaviors (e.g., explanations and apologies) tailored to situational and cultural contexts [25,27,149,150].

For example, in surgical robotics, cognitive architectures enable assistance robots to interpret eye gaze and verbal cues from surgeons, thereby anticipating required instruments or imaging views without explicit commands [151]. This contributes to a smoother workflow, reduced cognitive load, and enhanced sterility compliance [152]. In collaborative manufacturing, socio-cognitive models support dynamic negotiation and role allocation [153]: a robot can infer that a human worker is uncertain based on hesitations or gaze aversion, prompting the robot to take initiative or offer guidance in real time [154].

Furthermore, systems capable of turn-taking and shared intention modeling have shown measurable gains in cooperative task performance [122]. These are especially relevant in education or training environments, where cobots serve as instructional assistants that modulate their responses based on learner engagement or confusion signals [155].

Future developments will likely integrate multimodal signals—such as voice tone, pupil dilation, and micro-gestures—into unified user models. These models will drive more fluent, context-sensitive interaction styles, contributing to safe, efficient, and personalized human–robot collaboration in high-stakes environments.

Figure 3 illustrates the integration of real-time adaptation and socio-cognitive interaction in vision-based collaborative robotics.

7.3. Ethical and Societal Considerations

The integration of AI and vision into collaborative robots also raises broader ethical and societal questions. The use of visual data in sensitive environments (e.g., healthcare, education, public spaces, and defense and national security systems) must adhere to principles of privacy, consent, and data security. Moreover, biased datasets and opaque decision-making processes risk reinforcing systemic inequities or causing harm.

Ethics in the digital realm and especially “Ethics in AI” are well-established domains [156]. While the ethical field is wide, it appears that “Ethics-by-Design” could benefit from the deployment of vision in robotic activity [157]. Ethics-by-Design for vision systems help ensure fairness, privacy, and transparency [103]. Therefore, in Section 8.4 we suggest a framework for Vision-Enabled Cobots.

To mitigate ethical risks, future cobot systems must incorporate ethical safeguards such as privacy-preserving computer vision, value-aligned learning objectives, and auditability of actions. Furthermore, interdisciplinary collaboration across engineering, ethics, and human factors is essential for developing inclusive and accountable HRC systems. Table 6 summarizes this section.

8. Discussion

So far, the paper has presented a comprehensive and integrative review of the recent advances in fusing computer vision and artificial intelligence (AI) within collaborative robotic systems. By systematically analyzing key technologies—ranging from object detection, scene understanding, and human pose estimation to reinforcement learning and explainable AI—we have outlined how the convergence of perception and cognition is transforming cobots into perceptive, adaptive, and context-aware collaborators.

Cobots are used in specific application domains. The following are the three most popular application domains for cobots:

Manufacturing and Industry: Cobots are central to the evolution of flexible, human-centric production lines. By combining foundation models with multimodal sensing, industrial robots are executing complex tasks in real time. The focus on safety, sustainability, and seamless human–machine collaboration aligns with the principles of Industry 5.0 [1,4,81,109,124,126,137].
Healthcare and Assistive Robotics: Cobots are increasingly supporting hospital operations—from delivery and cleaning to patient monitoring [89,133]. Emotion-aware and trust-sensitive designs are improving safety and acceptance. In surgical settings, vision-guided robots are enhancing tool precision, while rehabilitation robots are using pose estimation for adaptive therapy delivery [151,152].
Education and Training: Educational robots are leveraging vision and affective computing to create adaptive, emotionally intelligent learning environments. These systems promote student engagement, they personalize instruction, and help educators monitor cognitive and emotional states to improve learning outcomes [148].

Current trends in collaborative robotics demonstrate a shift from reactive, rule-based systems to data-driven, proactive robots capable of reasoning about complex visual environments and anticipating human intent. Deep learning architectures have dramatically improved performance in visual recognition, while multimodal sensor fusion is enabling richer, more nuanced human–robot interaction (HRI). The following subsections present the major trends in this domain.

8.1. Perceptual Awareness to Intent Understanding

The integration of computer vision and AI has advanced cobots from basic perception tasks—like object detection and human pose estimation—to richer understanding of scenes. Despite this, moving from perception to understanding intention remains an underdeveloped frontier that is crucial for proactive collaboration.

Intent prediction requires not only recognizing the current scene but anticipating human goals and future actions—enabling robots to assist before being explicitly commanded. Several barriers explain its limited progress:

Sparse Contextual Data: Existing datasets rarely capture labeled human intentions over time, making supervised learning of intent inference difficult.
Ambiguous Visual Cues: Similar gestures may signal different intentions depending on task context, making disambiguation hard from vision alone.
Limited Multimodal Integration: Audio, gaze, force sensing, and verbal cues are critical for intent inference but remain challenging to fuse robustly in real time.
Poor Generalization: Models often overfit to specific tasks or environments, failing to adapt to new users, layouts, or cultural practices.
Real-Time and Safety Constraints: Predicting intent must happen quickly and reliably to avoid errors that compromise trust and safety.

Addressing these challenges requires developing multimodal, context-aware learning frameworks, data-efficient training strategies, and probabilistic models that can reason under uncertainty. Advancing intent understanding is essential to enable cobots to transition from reactive tools to proactive, intuitive collaborators in shared workspaces.

8.2. Adaptive Interaction in Shared Workspaces

In dynamic human environments, adaptability is paramount. Vision-based SLAM and semantic scene modeling enable environmental awareness, but current systems often lack real-time reactivity and predictive responsiveness. Achieving fluid, intuitive interaction necessitates integrating sensor data with deep reinforcement learning and multimodal reasoning under uncertainty.

8.3. Challenges in Real-Time and Scalable Machine Learning

Real-world deployment confronts various practical constraints: latency [13,158], data heterogeneity [159,160], and domain shift [161]. Techniques such as few-shot and continual learning are promising but immature in collaborative robotics. Scaling machine-learning architectures without compromising safety or interpretability is a major bottleneck [162], especially in cloud–edge hybrid architectures [163]. A promising strategy to mitigate data inefficiency in cobot vision systems is the adoption of self-supervised learning (SSL) approaches, which reduce reliance on expensive, manually labeled datasets [164]. SSL leverages large quantities of raw, unlabeled sensor data—such as RGB-D images, force–torque signals, or robot trajectories—to learn useful representations via pretext tasks [165]. For example, predicting future frames, reconstructing occluded regions, or aligning multimodal inputs. These learned representations can then be fine-tuned on smaller, task-specific labeled datasets, improving data efficiency while retaining their high-level performance. In the context of collaborative robotics, SSL can enable robust object detection, pose estimation, and scene understanding even in dynamic, cluttered environments with limited annotation. Moreover, SSL frameworks can facilitate continual learning on the shop floor, adapting to novel objects and layouts without costly data curation. Integrating self-supervised learning thus offers a scalable pathway toward more adaptable, data-efficient vision systems for safe and intuitive human–robot collaboration.

8.4. Human Trust, Transparency, and Ethical AI

Trust is foundational to human–cobot collaboration. We shall limit the discussion to non-technical end-users (rather than to developers) to avoid confusion and scope creep, while addressing the concerns of a much larger population. Trust shapes whether humans feel safe, confident, and willing to share tasks or space with robots. Well-calibrated trust improves teamwork, reduces errors, and boosts productivity, while mistrust or over-reliance can cause accidents or underuse. Building trust requires explainable AI, transparency in perception–action loops, and ethical design to ensure that cobot behavior is predictable, interpretable, and aligned with human values. The explainability of AI-driven decisions, transparency in perception–action loops, and adherence to ethical principles are essential. However, current systems often treat explainability as an afterthought rather than a design imperative. Based on the principles of “Ethics-by-Design” presented in Section 7.3, we propose the following framework for Vision-Enabled Cobots:

Ethics-by-Design Framework for Vision-Enabled Cobots

To ensure fairness, privacy, and transparency in collaborative robotics, vision systems should follow an ethics-by-design approach with these practical pillars:

Stakeholder Analysis:
- Engage operators, designers, and ethicists early.
- Identify potential biases, vulnerabilities, and social impacts.
Data Privacy and Minimization:
- Collect only necessary visual data.
- Apply on-device processing when possible to reduce cloud exposure.
- Anonymize human data (e.g., blur faces, remove identifiers).
Bias and Fairness Auditing:
- Evaluate training datasets for representation gaps.
- Test models across diverse user groups to avoid discriminatory performance.
- Retrain with balanced data if disparities emerge.
Explainable AI (XAI) Integration:
- Provide human-understandable reasons for vision-driven decisions.
- Use visual overlays or verbal cues to explain robot intentions.
Safety and Consent:
- Implement opt-in/opt-out mechanisms for visual monitoring.
- Ensure clear signaling when cameras are active.
- Provide manual overrides for emergency stops.
Continuous Monitoring and Feedback:
- Track system performance and ethical compliance post-deployment.
- Gather operator feedback to refine policies.
- Establish accountability lines for failures or breaches.

By embedding these steps throughout their design and deployment, vision-based cobots can promote trust, protect user rights, and align with broader societal values.

Future systems must embed ethical reasoning and user-centered transparency at all levels of decision-making.

8.5. Toward Seamless and Proactive Collaboration

Ultimately, the vision of collaborative robots as proactive teammates requires the seamless integration of perceptual and cognitive capabilities. This entails synchronized visual perception, semantic scene understanding, and real-time adaptive planning—executed with a degree of autonomy that remains accountable and transparent to human partners. Bridging these gaps is the next grand challenge in collaborative robotics.

Still, challenges remain: real-time inference under computational constraints, the lack of generalizability across tasks and environments, data inefficiency, and the need for ethical, explainable decision-making all persist as major barriers. Some of the major challenges are presented in the following subsections.

8.6. Interoperability Challenges and Open Robotic Architectures

Interoperability requires transparent independence between the hardware (of the cobot and vision system) and its behavior on different software platforms. Current middleware frameworks interact with different platforms to react with similar control, manipulation, and vision-based scene understanding on different systems. Some examples of relevant middleware frameworks are ROS 2, OPC UA, RobMoSys, OPIL, SmartSoft/BRIDE, and ROS-Industrial. Standard middleware frameworks, such as ROS 2, have advanced control interfaces and AI-oriented perception extensions (such as ros2_control, MoveIt 2, and ROS 2 perception stacks). These extensions greatly improved basic interoperability for cobot systems, but they remain insufficient to achieve true interoperability in control, manipulation, and vision-based scene understanding. Middleware solves the problem of message exchange but does not address deeper semantic alignment across heterogeneous systems. For cobots to seamlessly share tasks such as collaborative assembly, dynamic object handover, or shared workspace navigation, they must agree not only on control interfaces but also on representations of objects, scenes, and human intents. This requires standardizing semantic models for scene understanding—including object taxonomies, affordances, human pose, and activity definitions—and shared ontologies for describing manipulation tasks and constraints. Without such semantic interoperability, perception modules (e.g., object detectors or 3D scene parsers) remain vendor- or task-specific, limiting reusability and coordination across platforms. Moreover, achieving safe, context-aware control in mixed-robot teams demands harmonized protocols for exchanging intent, negotiating plans, and enforcing safety constraints (e.g., force limits and dynamic collision avoidance) in real time. Therefore, advancing interoperability in cobotics will require not only robust middleware for communication but also standardized abstractions for perception, control policies, and collaborative planning—ultimately enabling the modular, plug-and-play integration of heterogeneous robots with a shared understanding of their environment and tasks.

Table 7 summarizes the key differences between controlled laboratory demonstrations and real-world deployment environments for vision–AI-enabled collaborative robots, illustrating why solutions validated under ideal conditions often struggle to achieve robust, safe, and trustworthy performance in dynamic human-centered settings.

8.7. Limitations in Real-World Deployability

Despite these innovations, a notable gap remains between laboratory demonstrations and real-world applicability. A common challenge is generalizability: vision–AI models often overfit to controlled environments and fail to maintain their performance under real-world conditions, such as noise, lighting variability, occlusions, or sensor degradation. Additionally, multimodal fusion systems—though promising—introduce complexity in calibration, latency, and data synchronization across heterogeneous sensors (e.g., vision, tactile, force, and auditory).

Furthermore, the reliance on data-intensive deep learning methods remains a bottleneck, especially in applications where annotated datasets are scarce or ethically sensitive (e.g., healthcare HRC). The absence of standardization in benchmarking and evaluation frameworks across domains exacerbates this issue, complicating cross-comparative validation.

To overcome these barriers, future research should explore self-supervised learning from human–robot shared experiences, enabling cobots to continuously learn from their regular operation without a need for large datasets. Neuro-symbolic vision systems, which combine data-driven learning with logic-based reasoning, offer a path toward explainable yet robust visual understanding. Another promising direction is spatial–temporal scene imagination—robots capable of simulating future scenarios using predictive models grounded in visual data to proactively reconfigure tasks and avert risks.

Furthermore, collaborative visual grounding—in which humans and cobots co-construct shared semantic maps—can enhance mutual understanding and task alignment. Finally, embedding affective computing into vision–AI pipelines could enrich collaboration through empathy-driven responses to human emotions and behaviors. This could make cobots contextually intelligent and socially attuned, in addition to their functional competency.

9. Conclusions and Future Directions

This review has presented a comprehensive and integrative synthesis of recent advances in the integration of computer vision and artificial intelligence (AI) within collaborative robotic systems. By systematically examining foundational technologies—such as object detection, scene understanding, human pose estimation, reinforcement learning, and explainable AI—we have illustrated how the convergence of perception and cognition is transforming collaborative robots (cobots) into perceptive, adaptive, and context-aware partners.

Recent developments indicate a clear evolution from reactive, rule-based systems to data-driven, proactive robots capable of interpreting complex visual environments and anticipating human intentions. Deep learning has substantially improved visual recognition, while multimodal sensor fusion has enriched human–robot interaction (HRI) with greater nuance and responsiveness.

However, several critical challenges persist. A major gap remains between laboratory performance and real-world deployment. Generalizability remains a significant hurdle: vision–AI models often overfit to controlled environments and degrade in performance under real-world conditions such as variable lighting, occlusions, sensor degradation, and environmental noise. Multimodal fusion systems, while promising, bring their own challenges—including calibration complexities, latency, and synchronization issues across heterogeneous sensors like vision, tactile, auditory, and force feedback. Moreover, the reliance on data-intensive deep learning methods poses limitations, particularly in domains where annotated data is scarce or ethically constrained, such as healthcare and public security. The lack of standardized benchmarks and evaluation protocols across application domains further complicates both reproducibility and cross-comparative validation. Ethical concerns, data efficiency, real-time inference under hardware constraints, and the need for explainable decision-making remain pressing issues.

To address these barriers, future research must prioritize several strategic directions.

9.1. Proactive Human–Robot Collaboration

Current systems largely operate reactively, responding to human inputs after the fact. The next frontier involves equipping cobots with proactive collaboration capabilities—anticipating human actions, predicting intent, and dynamically adapting their behavior. This necessitates advancements in holistic scene understanding, temporal reasoning, and multimodal data integration to enable predictive modelling and real-time coordination. To achieve this, we highlight specific technical directions:

Spatiotemporal Transformers for predicting human motion and intent from video sequences, enabling anticipatory planning.
Graph Neural Networks to fuse multimodal inputs (vision, force, and audio) for relational reasoning about humans, objects, and tasks.
Real-Time 3D Semantic Mapping using neural implicit representations for a detailed, dynamic workspace understanding.
Uncertainty-Aware Trajectory Prediction with probabilistic models to ensure safe and cautious planning around humans.
Explainable Reinforcement Learning to make robot policies interpretable and trustworthy.
Few-Shot and Meta-Learning for rapid adaptation to new tasks and collaborators with minimal data.
Edge AI Deployment to deliver low-latency perception and planning for responsive and safe interaction.

These targeted approaches can operationalize the promise of proactive, fluent, and human-centered collaboration in next-generation cobot systems.

9.2. Few-Shot and Continual Learning at the Edge

Cobots must generalize across diverse users, tasks, and environments while operating under computational constraints. Leveraging few-shot, zero-shot, and continual learning techniques—optimized for edge devices—will be critical. Lightweight transformer architectures and efficient convolutional neural networks (CNNs) designed for on-device inference can support real-time learning and reduce the risk of catastrophic forgetting.

9.3. Interoperability and Open Robotic Architectures

The fragmented landscape of robotic hardware and software hampers its scalability. Developing standardized, open-source middleware solutions for vision–AI integration will support plug-and-play compatibility, shared benchmarking, and broader collaborations between academia and industry. Modular and open robotic operating systems will be instrumental in driving interoperability.

9.4. Explainability, Safety, and Human Trust

As cobots move into safety-critical domains such as manufacturing, healthcare, and public spaces, explainability, safety, and human trust become essential system requirements [87,90,91,126,137]. Without user trust, collaborative robots risk underutilization or unsafe operation, undermining their value in human-centered workspaces. Explainable AI (XAI) methods like LIME and SHAP provide model-agnostic interpretability, helping engineers and operators understand predictions of complex perception models [126,137]. In robotics contexts, saliency mapping and attention visualization have been applied to visuomotor policies to clarify which visual inputs drive cobot actions [126].

For safety assurance, frameworks such as SafeROS and VerifAI enable the formal verification of perception and planning components against safety requirements [90,91]. These tools help identify policy violations or unsafe behaviors before deployment.

International standards reinforce these needs. ISO/TS 15066 specifies safety limits and risk assessments for human–robot collaboration, while the EU AI Act introduces risk-based requirements for transparency, human oversight, and robustness in AI systems [90]. The IEEE P7000 series provides structured guidance for embedding ethical considerations and explainability throughout system design [91].

Integrating these tools and frameworks into cobot development pipelines is critical for ensuring systems are not only effective and adaptive, but also safe, interpretable, and trustworthy for human partners.

Future systems must incorporate transparency, fail-safes, and ethically-guided autonomy to ensure safety, trustworthiness, and regulatory compliance. Explainability should be tightly coupled with real-time monitoring and adaptive control systems. Real-time monitoring, transparency, and fail-safe systems must be tightly integrated with decision-making pipelines.

9.5. Multimodal and Semantic Sensor Fusion

To move beyond traditional RGB-based vision, future cobots should harness a spectrum of sensory inputs—including LiDAR, tactile feedback, audio, and thermal imaging. Developing semantic sensor fusion frameworks that contextually align these diverse signals will enrich scene understanding and enable more accurate modelling of human intent and environmental dynamics. While multimodal sensor fusion promises richer perception for cobots, practical deployments often encounter conflicting or inconsistent data between modalities—such as discrepancies between LiDAR and RGB depth maps under reflective surfaces or poor lighting. Addressing these conflicts in real time requires robust fusion strategies that move beyond naive averaging. Probabilistic methods, such as Bayesian filtering and Kalman or particle filters, explicitly model uncertainty in each sensor, enabling cross-validation and dynamic weighting based on confidence. Hierarchical decision-making can prioritize sensors or acuteness levels of perception depending on task criticality (e.g., coarse mapping vs. fine manipulation) and context (e.g., LiDAR in low light). Learning-based fusion architectures further adaptively resolve conflicts by offering training on realistic variations. Ultimately, reliable fusion demands context-aware arbitration mechanisms that combine uncertainty modeling, cross-sensor validation, and action hierarchies to ensure safe and consistent perception for real-time manipulation and human–robot interaction.

9.6. Need for Standardization and Unified Evaluation Frameworks

Progress is currently hampered by the lack of unified benchmarking standards. This leads to inconsistent benchmarking and siloed datasets that distort the evaluation of system performance and prevent clear identification of progress. To address the lack of unified evaluation frameworks, the field is actively pursuing standardization and benchmarking efforts. Initiatives such as the EU RoboticsBench project aim to define shared tasks, metrics, and datasets for evaluating human–robot collaboration, including vision-based perception and manipulation in realistic industrial scenarios [90].

Efforts like RoMaIn (Robotics Manipulation Intelligence Benchmark) at ETH Zurich and the CLEAR Benchmark (Collaborative Learning and Adaptation in Robotics) [126] promote open-access, standardized protocols for assessing adaptation and learning in dynamic environments. For perception modules, datasets such as EPIC-KITCHENS (for object and action recognition in real-world kitchens) and HRI-COVID (capturing collaborative gestures under distancing constraints) offer annotated data that is relevant to human–robot interaction [24,99].

Software frameworks including ROS-Industrial, OpenDR, and the ROS 2 Quality Assurance Working Group support standard interfaces, middleware interoperability, and testing practices that enable the plug-and-play integration of perception, planning, and control modules across heterogeneous platforms [19,121].

However, the collaborative robotics community still did not fully converge on unified evaluation protocols, open simulation environments, and real-world datasets for vision–AI-enabled cobots. The field must adopt shared simulation environments, open datasets, and evaluation metrics that reflect real-world complexities. Benchmarks should capture robustness, latency, adaptability, user trust, and overall system reliability across domains.

Figure 4 provides a conceptual map of the most pressing and promising research directions for the fusion of computer vision and AI in collaborative robotics. Each branch highlights a distinct challenge or opportunity area that, if addressed, will help enable safe, reliable, and human-centered cobot systems.

The integration of computer vision and AI into collaborative robotics is at a pivotal inflection point. As these systems transition from experimental prototypes to widespread deployment, the coming decade will demand comprehensive rethinking across algorithmic, architectural, and ethical dimensions. Through targeted research in learning efficiency, semantic understanding, trust-building, and real-world applicability, the field is poised to unlock a new era of socially intelligent and contextually aware robotic collaborators.

Author Contributions

Personal contributions: Conceptualizing—Y.C.; Writing the first draft—Y.C.; Figures and Tables—A.B.; Introduction and Core Technologies—S.S.; Technological Trends and Challenges—A.B.; Human Related Challenges and Trends—All authors; Discussion and Conclusion—Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

The authors hereby declare that they used GPT 4.5 (AI language model) throughout the paper to improve the writing style, clarity, and consistency, and to enrich the content. The authors exercised special care to ensure that no copyrights will be violated by this use.

Conflicts of Interest

The authors declare no conflict of interest.

References

Patil, Y.H.; Patil, R.Y.; Gurale, M.A.; Karati, A. Industry 5.0: Empowering Collaboration through Advanced Technological Approaches. In Intelligent Systems and Industrial Internet of Things for Sustainable Development; Chapman and Hall/CRC: New York, NY, USA, 2024; pp. 1–23. [Google Scholar]
George, A.S.; George, A.H. The cobot chronicles: Evaluating the emergence, evolution, and impact of collaborative robots in next-generation manufacturing. Partn. Univers. Int. Res. J. 2023, 2, 89–116. [Google Scholar]
Palanisamy, C.; Perumal, L.; Chin, C.W. A comprehensive review of collaborative robotics in manufacturing. Eng. Technol. Appl. Sci. Res. 2025, 15, 21970–21975. [Google Scholar] [CrossRef]
Rahman, M.M.; Khatun, F.; Jahan, I.; Devnath, R.; Bhuiyan, M.A.A. Cobotics: The Evolving Roles and Prospects of Next-Generation Collaborative Robots in Industry 5.0. J. Robot. 2024, 2024, 2918089. [Google Scholar] [CrossRef]
Weidemann, C.; Mandischer, N.; van Kerkom, F.; Corves, B.; Hüsing, M.; Kraus, T.; Garus, C. Literature review on recent trends and perspectives of collaborative robotics in work 4.0. Robotics 2023, 12, 84. [Google Scholar] [CrossRef]
De Magistris, G.; Caprari, R.; Castro, G.; Russo, S.; Iocchi, L.; Nardi, D.; Napoli, C. Vision-based holistic scene understanding for context-aware human-robot interaction. In International Conference of the Italian Association for Artificial Intelligence; Springer International Publishing: Cham, Switzerland, 2021; pp. 310–325. [Google Scholar]
Borboni, A.; Reddy, K.V.V.; Elamvazuthi, I.; AL-Quraishi, M.S.; Natarajan, E.; Ali, S.S.A. The expanding role of artificial intelligence in collaborative robots for industrial applications: A systematic review of recent works. Machines 2023, 11, 111. [Google Scholar] [CrossRef]
Scheutz, C.; Law, T.; Scheutz, M. Envirobots: How human–robot interaction can facilitate sustainable behavior. Sustainability 2021, 13, 12283. [Google Scholar] [CrossRef]
Buyukgoz, S.; Grosinger, J.; Chetouani, M.; Saffiotti, A. Two ways to make your robot proactive: Reasoning about human intentions or reasoning about possible futures. Front. Robot. AI 2022, 9, 929267. [Google Scholar] [CrossRef] [PubMed]
Semeraro, F.; Griffiths, A.; Cangelosi, A. Human–robot collaboration and machine learning: A systematic review of recent research. Robot. Comput.-Integr. Manuf. 2023, 79, 102432. [Google Scholar] [CrossRef]
Mendez, E.; Ochoa, O.; Olivera-Guzman, D.; Soto-Herrera, V.H.; Luna-Sánchez, J.A.; Lucas-Dophe, C.; Lugo-del-Real, E.; Ayala-Garcia, I.N.; Alvarado Perez, M.; González, A. Integration of deep learning and collaborative robot for assembly tasks. Appl. Sci. 2024, 14, 839. [Google Scholar] [CrossRef]
Riedelbauch, D.; Höllerich, N.; Henrich, D. Benchmarking teamwork of humans and cobots—An overview of metrics, strategies, and tasks. IEEE Access 2023, 11, 43648–43674. [Google Scholar] [CrossRef]
Addula, S.R.; Tyagi, A.K. Future of Computer Vision and Industrial Robotics in Smart Manufacturing. Artif. Intell.-Enabled Digit. Twin Smart Manuf. 2024, 22, 505–539. [Google Scholar]
Soori, M.; Dastres, R.; Arezoo, B.; Jough, F.K.G. Intelligent robotic systems in Industry 4.0: A review. J. Adv. Manuf. Sci. Technol. 2024, 4, 2024007. [Google Scholar] [CrossRef]
Cohen, Y.; Shoval, S.; Faccio, M.; Minto, R. Deploying cobots in collaborative systems: Major considerations and productivity analysis. Int. J. Prod. Res. 2022, 60, 1815–1831. [Google Scholar] [CrossRef]
Faccio, M.; Cohen, Y. Intelligent cobot systems: Human-cobot collaboration in manufacturing. J. Intell. Manuf. 2024, 35, 1905–1907. [Google Scholar] [CrossRef]
Cohen, Y.; Faccio, M.; Rozenes, S. Vocal Communication Between Cobots and Humans to Enhance Productivity and Safety: Review and Discussion. Appl. Sci. 2025, 15, 726. [Google Scholar] [CrossRef]
Cohen, Y.; Gal, H.C.B. Digital, Technological and AI Skills for Smart Production Work Environment. IFAC-Pap. 2024, 58, 545–550. [Google Scholar] [CrossRef]
D’Avella, S.; Avizzano, C.A.; Tripicchio, P. ROS-Industrial based robotic cell for Industry 4.0: Eye-in-hand stereo camera and visual servoing for flexible, fast, and accurate picking and hooking in the production line. Robot. Comput.-Integr. Manuf. 2023, 80, 102453. [Google Scholar] [CrossRef]
Wang, J.; Li, L.; Xu, P. Visual sensing and depth perception for welding robots and their industrial applications. Sensors 2023, 23, 9700. [Google Scholar] [CrossRef] [PubMed]
Malhan, R.; Jomy Joseph, R.; Bhatt, P.M.; Shah, B.; Gupta, S.K. Algorithms for improving speed and accuracy of automated three-dimensional reconstruction with a depth camera mounted on an industrial robot. J. Comput. Inf. Sci. Eng. 2022, 22, 031012. [Google Scholar] [CrossRef]
Thakur, U.; Singh, S.K.; Kumar, S.; Singh, A.; Arya, V.; Chui, K.T. Multi-Modal Sensor Fusion With CRNNs for Robust Object Detection and Simultaneous Localization and Mapping (SLAM) in Agile Industrial Drones. In AI Developments for Industrial Robotics and Intelligent Drones; IGI Global Scientific Publishing: New York, NY, USA, 2025; pp. 285–304. [Google Scholar]
Raj, R.; Kos, A. An Extensive Study of Convolutional Neural Networks: Applications in Computer Vision for Improved Robotics Perceptions. Sensors 2025, 25, 1033. [Google Scholar] [CrossRef] [PubMed]
Fan, J.; Zheng, P.; Li, S. Vision-based holistic scene understanding towards proactive human–robot collaboration. Robot. Comput.-Integr. Manuf. 2022, 75, 102304. [Google Scholar] [CrossRef]
Sado, F.; Loo, C.K.; Liew, W.S.; Kerzel, M.; Wermter, S. Explainable goal-driven agents and robots-a comprehensive review. ACM Comput. Surv. 2023, 55, 1–41. [Google Scholar] [CrossRef]
Milani, S.; Topin, N.; Veloso, M.; Fang, F. Explainable reinforcement learning: A survey and comparative review. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Brawer, J. Fusing Symbolic and Subsymbolic Approaches for Natural and Effective Human-Robot Collaboration. Ph.D. Dissertation, Yale University, New Haven, CT, USA, 2023. [Google Scholar]
Baduge, S.K.; Thilakarathna, S.; Perera, J.S.; Arashpour, M.; Sharafi, P.; Teodosio, B.; Shringi, A.; Mendis, P. Artificial intelligence and smart vision for building and construction 4.0: Machine and deep learning methods and applications. Autom. Constr. 2022, 141, 104440. [Google Scholar] [CrossRef]
Chen, H.; Li, S.; Fan, J.; Duan, A.; Yang, C.; Navarro-Alarcon, D.; Zheng, P. Human-in-the-Loop Robot Learning for Smart Manufacturing: A Human-Centric Perspective. IEEE Trans. Autom. Sci. Eng. 2025, 22, 11062–11086. [Google Scholar] [CrossRef]
Mahajan, H.B.; Uke, N.; Pise, P.; Shahade, M.; Dixit, V.G.; Bhavsar, S.; Deshpande, S.D. Automatic robot manoeuvres detection using computer vision and deep learning techniques: A perspective of Internet of Robotics Things (IoRT). Multimed. Tools Appl. 2023, 82, 23251–23276. [Google Scholar] [CrossRef]
Manakitsa, N.; Maraslidis, G.S.; Moysis, L.; Fragulis, G.F. A review of machine learning and deep learning for object detection, semantic segmentation, and human action recognition in machine and robotic vision. Technologies 2024, 12, 15. [Google Scholar] [CrossRef]
Adebayo, R.A.; Obiuto, N.C.; Olajiga, O.K.; Festus-Ikhuoria, I.C. AI-enhanced manufacturing robotics: A review of applications and trends. World J. Adv. Res. Rev. 2024, 21, 2060–2072. [Google Scholar] [CrossRef]
Angulo, C.; Chacón, A.; Ponsa, P. Towards a cognitive assistant supporting human operators in the Artificial Intelligence of Things. Internet Things 2023, 21, 100673. [Google Scholar] [CrossRef]
Jiang, N.; Liu, X.; Liu, H.; Lim, E.T.K.; Tan, C.W.; Gu, J. Beyond AI-powered context-aware services: The role of human–AI collaboration. Ind. Manag. Data Syst. 2023, 123, 2771–2802. [Google Scholar] [CrossRef]
Gallagher, J.E.; Oughton, E.J. Surveying You Only Look Once (YOLO) Multispectral Object Detection Advancements, Applications And Challenges. IEEE Access 2025, 13, 7366–7395. [Google Scholar] [CrossRef]
Aboyomi, D.D.; Daniel, C. A Comparative Analysis of Modern Object Detection Algorithms: YOLO vs. SSD vs. Faster R-CNN. ITEJ Inf. Technol. Eng. J. 2023, 8, 96–106. [Google Scholar] [CrossRef]
Amjoud, A.B.; Amrouch, M. Object detection using deep learning, CNNs and vision transformers: A review. IEEE Access 2023, 11, 35479–35516. [Google Scholar] [CrossRef]
Shah, S.; Tembhurne, J. Object detection using convolutional neural networks and transformer-based models: A review. J. Electr. Syst. Inf. Technol. 2023, 10, 54. [Google Scholar] [CrossRef]
Liu, S.; Yao, S.; Fu, X.; Shao, H.; Tabish, R.; Yu, S.; Yun, H.; Sha, L.; Abdelzaher, T.; Bansal, A.; et al. Real-time task scheduling for machine perception in intelligent cyber-physical systems. IEEE Trans. Comput. 2021, 71, 1770–1783. [Google Scholar] [CrossRef]
Hussain, M.; Ali, N.; Hong, J.E. Vision beyond the field-of-view: A collaborative perception system to improve safety of intelligent cyber-physical systems. Sensors 2022, 22, 6610. [Google Scholar] [CrossRef] [PubMed]
Yang, B.; Li, J.; Zeng, T. A Review of Environmental Perception Technology Based on Multi-Sensor Information Fusion in Autonomous Driving. World Electr. Veh. J. 2025, 16, 20. [Google Scholar] [CrossRef]
Duan, J.; Zhuang, L.; Zhang, Q.; Zhou, Y.; Qin, J. Multimodal perception-fusion-control and human–robot collaboration in manufacturing: A review. Int. J. Adv. Manuf. Technol. 2024, 132, 1071–1093. [Google Scholar] [CrossRef]
Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep learning-based human pose estimation: A survey. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
Dubey, S.; Dixit, M. A comprehensive survey on human pose estimation approaches. Multimed. Syst. 2023, 29, 167–195. [Google Scholar] [CrossRef]
Wang, T.; Liu, Z.; Wang, L.; Li, M.; Wang, X.V. Data-efficient multimodal human action recognition for proactive human–robot collaborative assembly: A cross-domain few-shot learning approach. Robot. Comput.-Integr. Manuf. 2024, 89, 102785. [Google Scholar] [CrossRef]
Kwon, H.; Wang, B.; Abowd, G.D.; Plötz, T. Approaching the real-world: Supporting activity recognition training with virtual imu data. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2021, 5, 1–32. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose++: Vision transformer for generic body pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1212–1230. [Google Scholar] [CrossRef] [PubMed]
Papanagiotou, D.; Senteri, G.; Manitsaris, S. Egocentric gesture recognition using 3D convolutional neural networks for the spatiotemporal adaptation of collaborative robots. Front. Neurorobot. 2021, 15, 703545. [Google Scholar] [CrossRef] [PubMed]
Matin, A.; Islam, M.R.; Wang, X.; Huo, H. Robust Multimodal Approach for Assembly Action Recognition. Procedia Comput. Sci. 2024, 246, 4916–4925. [Google Scholar] [CrossRef]
Delitzas, A.; Takmaz, A.; Tombari, F.; Sumner, R.; Pollefeys, M.; Engelmann, F. SceneFun3D: Fine-grained functionality and affordance understanding in 3D scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 14531–14542. [Google Scholar]
Hoque, S.; Arafat, M.Y.; Xu, S.; Maiti, A.; Wei, Y. A comprehensive review on 3D object detection and 6D pose estimation with deep learning. IEEE Access 2021, 9, 143746–143770. [Google Scholar] [CrossRef]
Yarovoi, A.; Cho, Y.K. Review of simultaneous localization and mapping (SLAM) for construction robotics applications. Autom. Constr. 2024, 162, 105344. [Google Scholar] [CrossRef]
Zheng, C.; Du, Y.; Xiao, J.; Sun, T.; Wang, Z.; Eynard, B.; Zhang, Y. Semantic map construction approach for human-robot collaborative manufacturing. Robot. Comput.-Integr. Manuf. 2025, 91, 102845. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, Y.; Tong, K.; Chen, H.; Yuan, Y. Review of visual simultaneous localization and mapping based on deep learning. Remote Sens. 2023, 15, 2740. [Google Scholar] [CrossRef]
Pu, H.; Luo, J.; Wang, G.; Huang, T.; Liu, H. Visual SLAM integration with semantic segmentation and deep learning: A review. IEEE Sens. J. 2023, 23, 22119–22138. [Google Scholar] [CrossRef]
Merveille, F.F.R.; Jia, B.; Xu, Z.; Fred, B. Enhancing Underwater SLAM Navigation and Perception: A Comprehensive Review of Deep Learning Integration. Sensors 2024, 24, 7034. [Google Scholar] [CrossRef] [PubMed]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Servières, M.; Renaudin, V.; Dupuis, A.; Antigny, N. Visual and Visual-Inertial SLAM: State of the Art, Classification, and Experimental Benchmarking. J. Sens. 2021, 2021, 2054828. [Google Scholar] [CrossRef]
Schmidt, F.; Blessing, C.; Enzweiler, M.; Valada, A. Visual-Inertial SLAM for Unstructured Outdoor Environments: Benchmarking the Benefits and Computational Costs of Loop Closing. J. Field Robot. 2025, 1–22. [Google Scholar] [CrossRef]
Tao, Y.; Liu, X.; Spasojevic, I.; Agarwal, S.; Kumar, V. 3d active metric-semantic slam. IEEE Robot. Autom. Lett. 2024, 9, 2989–2996. [Google Scholar] [CrossRef]
Rahman, S.; DiPietro, R.; Kedarisetti, D.; Kulathumani, V. Large-scale Indoor Mapping with Failure Detection and Recovery in SLAM. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 12294–12301. [Google Scholar]
Peng, H.; Zhao, Z.; Wang, L. A review of dynamic object filtering in SLAM based on 3D LiDAR. Sensors 2024, 24, 645. [Google Scholar] [CrossRef] [PubMed]
Arshad, S.; Kim, G.W. Role of deep learning in loop closure detection for visual and lidar slam: A survey. Sensors 2021, 21, 1243. [Google Scholar] [CrossRef] [PubMed]
Ebadi, K.; Palieri, M.; Wood, S.; Padgett, C.; Agha-mohammadi, A.A. DARE-SLAM: Degeneracy-aware and resilient loop closing in perceptually-degraded environments. J. Intell. Robot. Syst. 2021, 102, 1–25. [Google Scholar] [CrossRef]
Ni, J.; Chen, Y.; Tang, G.; Shi, J.; Cao, W.; Shi, P. Deep learning-based scene understanding for autonomous robots: A survey. Intell. Robot. 2023, 3, 374–401. [Google Scholar] [CrossRef]
Farkh, R.; Alhuwaimel, S.; Alzahrani, S.; Al Jaloud, K.; Quasim, M.T. Deep Learning Control for Autonomous Robot. Comput. Mater. Contin. 2022, 72. [Google Scholar] [CrossRef]
Firoozi, R.; Tucker, J.; Tian, S.; Majumdar, A.; Sun, J.; Liu, W.; Zhu, Y.; Song, S.; Kapoor, A.; Hausman, K.; et al. Foundation models in robotics: Applications, challenges, and the future. Int. J. Robot. Res. 2024, 44, 701–739. [Google Scholar] [CrossRef]
Huang, C.I.; Huang, Y.Y.; Liu, J.X.; Ko, Y.T.; Wang, H.C.; Chiang, K.H.; Yu, L.F. Fed-HANet: Federated visual grasping learning for human robot handovers. IEEE Robot. Autom. Lett. 2023, 8, 3772–3779. [Google Scholar] [CrossRef]
Zhao, X.; Wang, L.; Zhang, Y.; Han, X.; Deveci, M.; Parmar, M. A review of convolutional neural networks in computer vision. Artif. Intell. Rev. 2024, 57, 99. [Google Scholar] [CrossRef]
Wu, F.; Wu, J.; Kong, Y.; Yang, C.; Yang, G.; Shu, H.; Carrault, G.; Senhadji, L. Multiscale low-frequency memory network for improved feature extraction in convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, WC, Canada, 20–27 February 2024; Volume 38, pp. 5967–5975. [Google Scholar]
Ma, R.; Liu, Y.; Graf, E.W.; Oyekan, J. Applying vision-guided graph neural networks for adaptive task planning in dynamic human robot collaborative scenarios. Adv. Robot. 2024, 38, 1690–1709. [Google Scholar] [CrossRef]
Ding, P.; Zhang, J.; Zhang, P.; Lv, Y. A Spatial-Temporal Graph Neural Network with Hawkes Process for Temporal Hypergraph Reasoning towards Robotic Decision-Making in Proactive Human-Robot Collaboration. In Proceedings of the 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE), Bari, Italy, 28 August–1 September 2024; pp. 514–519. [Google Scholar]
Ding, P.; Zhang, J.; Zhang, P.; Lv, Y.; Wang, D. A stacked graph neural network with self-exciting process for robotic cognitive strategy reasoning in proactive human-robot collaborative assembly. Adv. Eng. Inform. 2025, 63, 102957. [Google Scholar] [CrossRef]
Ding, P.; Zhang, J.; Zheng, P.; Fei, B.; Xu, Z. Dynamic scenario-enhanced diverse human motion prediction network for proactive human–robot collaboration in customized assembly tasks. J. Intell. Manuf. 2024. [Google Scholar] [CrossRef]
Hou, W.; Xiong, Z.; Yue, M.; Chen, H. Human-robot collaborative assembly task planning for mobile cobots based on deep reinforcement learning. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2024, 238, 11097–11114. [Google Scholar] [CrossRef]
Qiu, Y.; Jin, Y.; Yu, L.; Wang, J.; Wang, Y.; Zhang, X. Improving sample efficiency of multiagent reinforcement learning with nonexpert policy for flocking control. IEEE Internet Things J. 2023, 10, 14014–14027. [Google Scholar] [CrossRef]
Salvato, E.; Fenu, G.; Medvet, E.; Pellegrino, F.A. Crossing the reality gap: A survey on sim-to-real transferability of robot controllers in reinforcement learning. IEEE Access 2021, 9, 153171–153187. [Google Scholar] [CrossRef]
Ju, H.; Juan, R.; Gomez, R.; Nakamura, K.; Li, G. Transferring policy of deep reinforcement learning from simulation to reality for robotics. Nat. Mach. Intell. 2022, 4, 1077–1087. [Google Scholar] [CrossRef]
Zhu, X.; Zheng, X.; Zhang, Q.; Chen, Z.; Liu, Y.; Liang, B. Sim-to-real transfer with action mapping and state prediction for robot motion control. In Proceedings of the 2021 6th Asia-Pacific Conference on Intelligent Robot Systems (ACIRS), Tokyo, Japan, 16–18 July 2021; pp. 1–6. [Google Scholar]
Amirnia, A.; Keivanpour, S. Real-time sustainable cobotic disassembly planning using fuzzy reinforcement learning. Int. J. Prod. Res. 2025, 63, 3798–3821. [Google Scholar] [CrossRef]
Langås, E.F.; Zafar, M.H.; Sanfilippo, F. Exploring the synergy of human-robot teaming, digital twins, and machine learning in Industry 5.0: A step towards sustainable manufacturing. J. Intell. Manuf. 2025, 1–24. [Google Scholar] [CrossRef]
Xu, Y.; Bao, R.; Zhang, L.; Wang, J.; Wang, S. Embodied intelligence in RO/RO logistic terminal: Autonomous intelligent transportation robot architecture. Sci. China Inf. Sci. 2025, 68, 1–17. [Google Scholar] [CrossRef]
Laukaitis, A.; Šareiko, A.; Mažeika, D. Facilitating Robot Learning in Virtual Environments: A Deep Reinforcement Learning Framework. Appl. Sci. 2025, 15, 5016. [Google Scholar] [CrossRef]
Li, C.; Zheng, P.; Zhou, P.; Yin, Y.; Lee, C.K.; Wang, L. Unleashing mixed-reality capability in Deep Reinforcement Learning-based robot motion generation towards safe human–robot collaboration. J. Manuf. Syst. 2024, 74, 411–421. [Google Scholar] [CrossRef]
Gonzalez-Santocildes, A.; Vazquez, J.I.; Eguiluz, A. Adaptive Robot Behavior Based on Human Comfort Using Reinforcement Learning. IEEE Access 2024, 12, 122289–122299. [Google Scholar] [CrossRef]
Walker, J.C.; Vértes, E.; Li, Y.; Dulac-Arnold, G.; Anand, A.; Weber, T.; Hamrick, J.B. Investigating the role of model-based learning in exploration and transfer. In Proceedings of the ICML’23: 40th International Conference on Machine Learning, Honolulu, HA, USA, 23–29 July 2023; pp. 35368–35383. [Google Scholar]
Thalpage, N. Unlocking the black box: Explainable artificial intelligence (XAI) for trust and transparency in ai systems. J. Digit. Art Humanit 2023, 4, 31–36. [Google Scholar] [CrossRef] [PubMed]
Gunning, D.; Aha, D. DARPA’s explainable artificial intelligence (XAI) program. AI Mag. 2019, 40, 44–58. [Google Scholar]
Saraswat, D.; Bhattacharya, P.; Verma, A.; Prasad, V.K.; Tanwar, S.; Sharma, G.; Bokoro, P.N.; Sharma, R. Explainable AI for healthcare 5.0: Opportunities and challenges. IEEE Access 2022, 10, 84486–84517. [Google Scholar] [CrossRef]
Oviedo, J.; Rodriguez, M.; Trenta, A.; Cannas, D.; Natale, D.; Piattini, M. ISO/IEC quality standards for AI engineering. Comput. Sci. Rev. 2024, 54, 100681. [Google Scholar] [CrossRef]
Lewis, D.; Hogan, L.; Filip, D.; Wall, P.J. Global challenges in the standardization of ethics for trustworthy AI. J. ICT Stand. 2020, 8, 123–150. [Google Scholar] [CrossRef]
Ali, J.A.H.; Lezoche, M.; Panetto, H.; Naudet, Y.; Gaffinet, B. Cognitive architecture for cognitive cyber-physical systems. IFAC-Pap. 2024, 58, 1180–1185. [Google Scholar]
Ogunsina, M.; Efunniyi, C.P.; Osundare, O.S.; Folorunsho, S.O.; Akwawa, L.A. Cognitive architectures for autonomous robots: Towards human-level autonomy and beyond. Int. J. Frontline Res. Eng. Technol. 2024, 2, 41–50. [Google Scholar] [CrossRef]
Gurney, N.; Pynadath, D.V. Robots with Theory of Mind for Humans: A Survey. In Proceedings of the 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Napoli, Italy, 29 August–1 September 2022. [Google Scholar]
Taesi, C.; Aggogeri, F.; Pellegrini, N. COBOT applications—Recent advances and challenges. Robotics 2023, 12, 79. [Google Scholar] [CrossRef]
Liu, Y.; Caldwell, G.; Rittenbruch, M.; Belek Fialho Teixeira, M.; Burden, A.; Guertler, M. What affects human decision making in human–robot collaboration?: A scoping review. Robotics 2024, 13, 30. [Google Scholar] [CrossRef]
Sun, J.; Mao, P.; Kong, L.; Wang, J. A Review of Embodied Grasping. Sensors 2025, 25, 852. [Google Scholar] [CrossRef] [PubMed]
Karbouj, B.; Al Rashwany, K.; Alshamaa, O.; Krüger, J. Adaptive Behavior of Collaborative Robots: Review and Investigation of Human Predictive Ability. Procedia CIRP 2024, 130, 952–958. [Google Scholar] [CrossRef]
Ebert, N.; Mangat, P.; Wasenmuller, O. Multitask network for joint object detection, semantic segmentation and human pose estimation in vehicle occupancy monitoring. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 4–9 June 2022; pp. 637–643. [Google Scholar]
Yalcinkaya, B.; Couceiro, M.S.; Pina, L.; Soares, S.; Valente, A.; Remondino, F. Towards Enhanced Human Activity Recognition for Real-World Human-Robot Collaboration. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 7909–7915. [Google Scholar]
Carissoli, C.; Negri, L.; Bassi, M.; Storm, F.A.; Delle Fave, A. Mental workload and human-robot interaction in collaborative tasks: A scoping review. Int. J. Hum.-Comput. Interact. 2024, 40, 6458–6477. [Google Scholar] [CrossRef]
Huang, S.; Chen, Z.; Zhang, Y. An Algorithm for Standing Long Jump Distance Measurement Based on Improved YOLOv11 and Lightweight Pose Estimation. In Proceedings of the 2025 4th International Symposium on Computer Applications and Information Technology (ISCAIT), Xi’an, China, 21–23 March 2025; pp. 914–918. [Google Scholar]
Salimpour, S.; Peña-Queralta, J.; Paez-Granados, D.; Heikkonen, J.; Westerlund, T. Sim-to-Real Transfer for Mobile Robots with Reinforcement Learning: From NVIDIA Isaac Sim to Gazebo and Real ROS 2 Robots. arXiv 2025, arXiv:2501.02902. [Google Scholar]
Scheikl, P.M.; Tagliabue, E.; Gyenes, B.; Wagner, M.; Dall’Alba, D.; Fiorini, P.; Mathis-Ullrich, F. Sim-to-real transfer for visual reinforcement learning of deformable object manipulation for robot-assisted surgery. IEEE Robot. Autom. Lett. 2022, 8, 560–567. [Google Scholar] [CrossRef]
Véronneau, C.; Denis, J.; Lhommeau, P.; St-Jean, A.; Girard, A.; Plante, J.S.; Bigué, J.P.L. Modular magnetorheological actuator with high torque density and transparency for the collaborative robot industry. IEEE Robot. Autom. Lett. 2022, 8, 896–903. [Google Scholar] [CrossRef]
Feng, H.; Zhang, J.; Kang, L. Key Technologies of Cobots with High Payload-Reach to Weight Ratio: A Review. In International Conference on Social Robotics; Springer Nature: Singapore, 2024; pp. 29–40. [Google Scholar]
Rojas, R.A.; Garcia, M.A.R.; Gualtieri, L.; Rauch, E. Combining safety and speed in collaborative assembly systems–An approach to time optimal trajectories for collaborative robots. Procedia CIRP 2021, 97, 308–312. [Google Scholar] [CrossRef]
Guida, R.; Bertolino, A.C.; De Martin, A.; Sorli, M. Comprehensive Analysis of Major Fault-to-Failure Mechanisms in Harmonic Drives. Machines 2024, 12, 776. [Google Scholar] [CrossRef]
Zafar, M.H.; Langås, E.F.; Sanfilippo, F. Exploring the synergies between collaborative robotics, digital twins, augmentation, and industry 5.0 for smart manufacturing: A state-of-the-art review. Robot. Comput.-Integr. Manuf. 2024, 89, 102769. [Google Scholar] [CrossRef]
Hua, H.; Liao, Z.; Wu, X.; Chen, Y.; Feng, C. A back-drivable linear force actuator for adaptive grasping. J. Mech. Sci. Technol. 2022, 36, 4213–4220. [Google Scholar] [CrossRef]
Pantano, M.; Blumberg, A.; Regulin, D.; Hauser, T.; Saenz, J.; Lee, D. Design of a collaborative modular end effector considering human values and safety requirements for industrial use cases. In Human-Friendly Robotics 2021: HFR: 14th International Workshop on Human-Friendly Robotics; Springer International Publishing: Cham, Switzerland, 2022; pp. 45–60. [Google Scholar]
Li, S.; Xu, J. Multi-Axis Force/Torque Sensor Technologies: Design Principles and Robotic Force Control Applications: A Review. IEEE Sens. J. 2024, 25, 4055–4069. [Google Scholar] [CrossRef]
Elfferich, J.F.; Dodou, D.; Della Santina, C. Soft robotic grippers for crop handling or harvesting: A review. IEEE Access 2022, 10, 75428–75443. [Google Scholar] [CrossRef]
Zaidi, S.; Maselli, M.; Laschi, C.; Cianchetti, M. Actuation technologies for soft robot grippers and manipulators: A review. Curr. Robot. Rep. 2021, 2, 355–369. [Google Scholar] [CrossRef]
Fernandez-Vega, M.; Alfaro-Viquez, D.; Zamora-Hernandez, M.; Garcia-Rodriguez, J.; Azorin-Lopez, J. Transforming Robots into Cobots: A Sustainable Approach to Industrial Automation. Electronics 2025, 14, 2275. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, X.; Huang, Y.; Wu, Y.; Ota, J. Kinematics optimization of a novel 7-DOF redundant manipulator. Robot. Auton. Syst. 2023, 163, 104377. [Google Scholar] [CrossRef]
Zheng, P.; Wieber, P.B.; Baber, J.; Aycard, O. Human arm motion prediction for collision avoidance in a shared workspace. Sensors 2022, 22, 6951. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Wang, R.; Zheng, P.; Wang, L. Towards proactive human–robot collaboration: A foreseeable cognitive manufacturing paradigm. J. Manuf. Syst. 2021, 60, 547–552. [Google Scholar] [CrossRef]
Sampieri, A.; di Melendugno, G.M.D.A.; Avogaro, A.; Cunico, F.; Setti, F.; Skenderi, G.; Cristani, M.; Galasso, F. Pose forecasting in industrial human-robot collaboration. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 51–69. [Google Scholar]
Hao, Z.; Zhang, D.; Honarvar Shakibaei Asli, B. Motion Prediction and Object Detection for Image-Based Visual Servoing Systems Using Deep Learning. Electronics 2024, 13, 3487. [Google Scholar] [CrossRef]
Vosniakos, G.C.; Stathas, E. Exploring collaboration of humans with industrial robots using ROS-based simulation. Proc. Manuf. Syst. 2023, 18, 33–38. [Google Scholar]
Freire, I.T.; Guerrero-Rosado, O.; Amil, A.F.; Verschure, P.F. Socially adaptive cognitive architecture for human-robot collaboration in industrial settings. Front. Robot. AI 2024, 11, 1248646. [Google Scholar] [CrossRef] [PubMed]
Ciccarelli, M.; Forlini, M.; Papetti, A.; Palmieri, G.; Germani, M. Advancing human–robot collaboration in handcrafted manufacturing: Cobot-assisted polishing design boosted by virtual reality and human-in-the-loop. Int. J. Adv. Manuf. Technol. 2024, 132, 4489–4504. [Google Scholar] [CrossRef]
Jabrane, K.; Bousmah, M. A new approach for training cobots from small amount of data in industry 5.0. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 634–646. [Google Scholar] [CrossRef]
Ranasinghe, N.; Mohammed, W.M.; Stefanidis, K.; Lastra, J.L.M. Large Language Models in Human-Robot Collaboration with Cognitive Validation Against Context-induced Hallucinations. IEEE Access 2025, 13, 77418–77430. [Google Scholar] [CrossRef]
Trivedi, C.; Bhattacharya, P.; Prasad, V.K.; Patel, V.; Singh, A.; Tanwar, S.; Sharma, R.; Aluvala, S.; Pau, G.; Sharma, G. Explainable AI for Industry 5.0: Vision, architecture, and potential directions. IEEE Open J. Ind. Appl. 2024, 5, 177–208. [Google Scholar] [CrossRef]
Świetlicka, A. A Survey on Artificial Neural Networks in Human-Robot Interaction. Neural Comput. 2025, 37, 1–63. [Google Scholar] [CrossRef] [PubMed]
Moezzi, M. Towards Sample-Efficient Reinforcement Learning Methods for Robotic Manipulation Tasks. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2024. [Google Scholar]
Liu, Y.; Xu, H.; Liu, D.; Wang, L. A digital twin-based sim-to-real transfer for deep reinforcement learning-enabled industrial robot grasping. Robot. Comput.-Integr. Manuf. 2022, 78, 102365. [Google Scholar] [CrossRef]
Trentsios, P.; Wolf, M.; Gerhard, D. Overcoming the sim-to-real gap in autonomous robots. Procedia CIRP 2022, 109, 287–292. [Google Scholar] [CrossRef]
Rothert, J.J.; Lang, S.; Seidel, M.; Hanses, M. Sim-to-Real Transfer for a Robotics Task: Challenges and Lessons Learned. In Proceedings of the 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA), Padova, Italy, 10–13 September 2024; pp. 1–8. [Google Scholar]
Lettera, G.; Costa, D.; Callegari, M. A Hybrid Architecture for Safe Human–Robot Industrial Tasks. Appl. Sci. 2025, 15, 1158. [Google Scholar] [CrossRef]
Tulk Jesso, S.; Greene, C.; Zhang, S.; Booth, A.; DiFabio, M.; Babalola, G.; Adegbemijo, A.; Sarkar, S. On the potential for human-centered, cognitively inspired AI to bridge the gap between optimism and reality for autonomous robotics in healthcare: A respectful critique. In Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care, Chicago, IL, USA, 24–27 March 2024; SAGE Publications; Sage CA: Los Angeles, CA, USA, 2024; Volume 13, pp. 106–112. [Google Scholar]
Swarnkar, N.; Rawal, A.; Patel, G. A paradigm shift for computational excellence from traditional machine learning to modern deep learning-based image steganalysis. In Data Science and Innovations for Intelligent Systems; CRC Press: Boca Raton, FL, USA, 2021; pp. 209–240. [Google Scholar]
Wang, S.; Zhang, J.; Wang, P.; Law, J.; Calinescu, R.; Mihaylova, L. A deep learning-enhanced Digital Twin framework for improving safety and reliability in human–robot collaborative manufacturing. Robot. Comput.-Integr. Manuf. 2024, 85, 102608. [Google Scholar] [CrossRef]
Robinson, N.; Tidd, B.; Campbell, D.; Kulić, D.; Corke, P. Robotic vision for human-robot interaction and collaboration: A survey and systematic review. ACM Trans. Hum.-Robot Interact. 2023, 12, 1–66. [Google Scholar] [CrossRef]
Gadekallu, T.R.; Maddikunta, P.K.R.; Boopathy, P.; Deepa, N.; Chengoden, R.; Victor, N.; Wang, W.; Wang, W.; Zhu, Y.; Dev, K. Xai for industry 5.0-concepts, opportunities, challenges and future directions. IEEE Open J. Commun. Soc. 2024, 6, 2706–2729. [Google Scholar] [CrossRef]
Li, J.; Cai, M.; Xiao, S. Reinforcement learning-based motion planning in partially observable environments under ethical constraints. AI Ethics 2024, 5, 1047–1067. [Google Scholar] [CrossRef]
Hostettler, D.; Mayer, S.; Albert, J.L.; Jenss, K.E.; Hildebrand, C. Real-time adaptive industrial robots: Improving safety and comfort in human-robot collaboration. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; pp. 1–16. [Google Scholar]
Conlon, N.J. Robot Competency Self-Assessments to Improve Human Decision-Making in Uncertain Environments. Ph.D. Dissertation, University of Colorado at Boulder, Boulder, CO, USA, 2024. [Google Scholar]
Kluy, L.; Roesler, E. Working with industrial cobots: The influence of reliability and transparency on perception and trust. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting; SAGE Publications: Los Angeles, CA, USA, 2021; Volume 65, pp. 77–81. [Google Scholar]
Pinto, A.; Duarte, I.; Carvalho, C.; Rocha, L.; Santos, J. Enhancing cobot design through user experience goals: An investigation of human–robot collaboration in picking tasks. Hum. Behav. Emerg. Technol. 2024, 2024, 7058933. [Google Scholar] [CrossRef]
Hancock, P.A.; Billings, D.R.; Schaefer, K.E.; Chen, J.Y.C.; De Visser, E.J.; Parasuraman, R. A meta-analysis of factors affecting trust in human-robot interaction. Hum. Factors 2011, 53, 517–527. [Google Scholar] [CrossRef] [PubMed]
Desai, M.; Stubbs, K.; Steinfeld, A.; Yanco, H.A. Creating trustworthy robots: Lessons and inspirations from automated systems. In Proceedings of the 2013 ACM/IEEE International Conference on Human-Robot Interaction, Tokyo, Japan, 3–6 March 2013; pp. 409–416. [Google Scholar] [CrossRef]
Robinette, P.; Howard, A.M.; Wagner, A.R. Timing is key for robot trust repair. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, Portland, OR, USA, 2–5 March 2015; pp. 205–212. [Google Scholar]
Gervasi, R.; Barravecchia, F.; Mastrogiacomo, L.; Franceschini, F. Applications of affective computing in human-robot interaction: State-of-art and challenges for manufacturing. Proc. Inst. Mech. Eng. Part B J. Eng. Manuf. 2023, 237, 815–832. [Google Scholar] [CrossRef]
Toaiari, A.; Murino, V.; Cristani, M.; Beyan, C. Upper-Body pose-based gaze estimation for privacy-preserving 3D gaze target detection. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2025; pp. 359–376. [Google Scholar]
Pourmirzaei, M.; Montazer, G.A.; Mousavi, E. ATTENDEE: An AffecTive Tutoring system based on facial EmotioN recognition and heaD posE Estimation to personalize e-learning environment. J. Comput. Educ. 2025, 12, 65–92. [Google Scholar] [CrossRef]
Tsumura, T.; Yamada, S. Making a human’s trust repair for an agent in a series of tasks through the agent’s empathic behavior. Front. Comput. Sci. 2024, 6, 1461131. [Google Scholar] [CrossRef]
Esterwood, C.; Robert, L.P. Repairing Trust in Robots?: A Meta-analysis of HRI Trust Repair Studies with A No-Repair Condition. In Proceedings of the 2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Melbourne, Australia, 4–6 March 2025; pp. 410–419. [Google Scholar]
Wong, S.W.; Crowe, P. Cognitive ergonomics and robotic surgery. J. Robot. Surg. 2024, 18, 110. [Google Scholar] [CrossRef] [PubMed]
Min, Z.; Lai, J.; Ren, H. Innovating robot-assisted surgery through large vision models. Nat. Rev. Electr. Eng. 2025, 2, 350–363. [Google Scholar] [CrossRef]
Chen, H.; Alghowinem, S.; Breazeal, C.; Park, H.W. Integrating flow theory and adaptive robot roles: A conceptual model of dynamic robot role adaptation for the enhanced flow experience in long-term multi-person human-robot interactions. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Boulder, CO, USA, 11–15 March 2024; pp. 116–126. [Google Scholar]
Pelikan, H.; Hofstetter, E. Managing delays in human-robot interaction. ACM Trans. Comput.-Hum. Interact. 2023, 30, 1–42. [Google Scholar] [CrossRef]
Tuncer, S.; Gillet, S.; Leite, I. Robot-mediated inclusive processes in groups of children: From gaze aversion to mutual smiling gaze. Front. Robot. AI 2022, 9, 729146. [Google Scholar] [CrossRef] [PubMed]
Ricciardi Celsi, L.; Zomaya, A.Y. Perspectives on Managing AI Ethics in the Digital Age. Information 2025, 16, 318. [Google Scholar] [CrossRef]
Bourgais, A.; Ibnouhsein, I. Ethics-by-design: The next frontier of industrialization. AI Ethics 2022, 2, 317–324. [Google Scholar] [CrossRef]
Kolvig-Raun, E.S.; Hviid, J.; Kjærgaard, M.B.; Brorsen, R.; Jacob, P. Balancing Cobot Productivity and Longevity Through Pre-Runtime Developer Feedback. IEEE Robot. Autom. Lett. 2024, 10, 1617–1624. [Google Scholar] [CrossRef]
Zia, A.; Haleem, M. Bridging Research Gaps in Industry 5.0: Synergizing Federated Learning, Collaborative Robotics, and Autonomous Systems for Enhanced Operational Efficiency and Sustainability. IEEE Access 2025, 13, 40456–40479. [Google Scholar] [CrossRef]
Ramírez, T.; Mora, H.; Pujol, F.A.; Maciá-Lillo, A.; Jimeno-Morenilla, A. Management of heterogeneous AI-based industrial environments by means of federated adaptive-robot learning. Eur. J. Innov. Manag. 2025, 28, 50–64. [Google Scholar] [CrossRef]
Govi, E.; Sapienza, D.; Toscani, S.; Cotti, I.; Franchini, G.; Bertogna, M. Addressing challenges in industrial pick and place: A deep learning-based 6 Degrees-of-Freedom pose estimation solution. Comput. Ind. 2024, 161, 104130. [Google Scholar] [CrossRef]
Pan, Z.; Zhuang, B.; Liu, J.; He, H.; Cai, J. Scalable vision transformers with hierarchical pooling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 377–386. [Google Scholar]
Liu, R.; Wang, L.; Yu, Z.; Zhang, H.; Liu, X.; Sun, B.; Huo, X.; Zhang, J. SCTNet-NAS: Efficient semantic segmentation via neural architecture search for cloud-edge collaborative perception. Complex Intell. Syst. 2025, 11, 365. [Google Scholar] [CrossRef]
Chen, Z.; Hu, B.; Chen, Z.; Zhang, J. Progress and Thinking on Self-Supervised Learning Methods in Computer Vision: A Review. IEEE Sens. J. 2024, 24, 29524–29544. [Google Scholar] [CrossRef]
Shaw, A. Self-Supervised Learning For Robust Robotic Grasping In Dynamic Environment. arXiv 2024, arXiv:2410.11229. [Google Scholar] [CrossRef]

Figure 1. Core vision-based technologies enabling collaborative robotics.

Figure 2. Overview of a deep learning pipeline for cobot perception and action—from input sensors to action commands via a neural policy architecture.

Figure 3. Integration of real-time adaptation and socio-cognitive interaction in vision-based collaborative robotics.

Figure 4. Summary of future research directions.

Table 1. Comparative performance of object detectors in collaborative robot applications.

Detector	Accuracy (mAP)	Inference Speed	Real-Time Suitability	Strengths	Limitations
YOLOv6-N/S/S-Quant	35.9–43.5%	1234–495 FPS (T4 GPU)	Very High	Very fast, ideal for edge deployment	Moderate accuracy, dataset-dependent
YOLOv7	~56.8% @30 FPS (V100 GPU)	30–160 FPS	High	Best-in-class real-time balance	GPU-dependent, less mobile-friendly
Faster R-CNN (with VGG16)	~55% mAP	~5 FPS (GPU)	Moderate	High precision, good for small objects	Too slow for real-time use with light hardware
AT-LI-YOLO (home service robots)	+3.2% over YOLOv3 (≈40%+ mAP)	~34 FPS (29 ms)	High	Great for small/blurred/occluded indoor objects	Still task-specific, needs deblurring add-ons
YOLOv3 vs. YOLOv4 (robot arm)	YOLOv4 better than YOLOv3 (no data)	YOLOv4 has best speed	High	YOLOv4 has faster speed + accuracy over v3	Context-specific performance

Table 2. Key technologies in vision-based collaborative robotics.

Technology	Purpose	Representative Methods	Application Examples
Object Detection [36,37,38,39]	Identify and locate task-relevant objects	YOLOv8, Faster R-CNN, DETR	Tool identification, obstacle recognition
Human Pose Estimation [43,44,45,46,47,48,49]	Capture human body landmarks and movement	OpenPose, HRNet, Vision Transformers	Gesture recognition, safety monitoring
Scene Understanding [6,24,50,51]	Semantic labeling and spatial reasoning	DeepLabv3+, DenseFusion, Semantic SLAM	Workspace mapping, affordance detection
Visual SLAM [52,53,54,55,56,63,64]	Localization and environment reconstruction	ORB-SLAM3, DeepFactors, SemanticFusion	Navigation, shared workspace adaptation

Table 3. Summary of AI-driven decision-making and autonomy methods.

Method	Description	Key Advantages	Example Applications
Deep Learning Architectures [65,66,67,68,69,70,71,72,73,74]	End-to-end models for perception and planning	High accuracy; handles complex scenes	Visual part detection; semantic mapping
Reinforcement Learning (RL) [75,76,77,78,79,80,81,82,83,84,85,86]	Learning adaptive policies via trial-and-error	Handles dynamic, changing tasks	Adaptive path planning; force-sensitive assembly
Explainable AI (XAI) and Ethical Autonomy [87,88,89,90]	Interpretable, safe decision-making	Improves trust and safety	Visual feedback; regulatory compliance
Cognitive Architectures [91,92,93,94]	Human-like integration of perception and planning	Enables proactive, intuitive interaction	Intention prediction; shared task handover

Table 4. Summary of technological trends and challenges.

Topic	Description	Key Challenges/Notes
General Trends [95,96,97,98]	Advances in CV–AI fusion for perception, reasoning, and multimodal interaction	Need for seamless, real-time, and trustworthy HRC
Vision Integration Challenges [99,100,101,102]	Deploying object detection, segmentation, and pose estimation in real-world cobots	Performance drops in unstructured, dynamic settings; real-time latency issues
Real-Time Adaptation and Learning [24,75,76,77]	Reinforcement learning and anticipatory AI for dynamic task adjustment	Stability, transparency, and sample efficiency
Hardware Capabilities Trends [105,106,107,108,109,110,111,112,113,114,115,116,117]	Advanced sensors, compliant actuators, soft robotics, and lightweight structures	Complexity, cost, and safety compliance
Technological Convergence [24,118,119,120]	Holistic scene understanding and proactive HRC via multi-level visual cognition	Need for cognitive empathy and context awareness

Table 5. Summary of system architecture, AI, and algorithmic considerations.

Topic	Description	Key Challenges/Notes
Advanced System Architectures [19,121,122]	Integrated perception, planning, control via middleware (e.g., ROS-Industrial)	Real-time sensor fusion; latency-sensitive tasks
Evolving AI Paradigms [67,124,125,126]	From task-specific models to foundation models and vision transformers	Explainability, computational demands
Neural Policy Architectures [127,128,129,130,131,132]	DRL, imitation learning, end-to-end visuomotor control	Data intensity; sim-to-real gap; overfitting
Algorithmic Frameworks for Dynamic Environments [53,133,134,135]	Probabilistic planning (POMDPs), semantic mapping, digital twins	Adapting under uncertainty; safe pre-validation

Table 6. Summary of human-related challenges and trends.

Topic	Description	Key Challenges/Notes
Trust Calibration in HRC [39,84,139,140,141,142,143,144,145]	Systems that monitor and adapt to human trust levels	Avoiding over-reliance or under-utilization; trust repair mechanisms
Socio-Cognitive Models and HRI [6,122,146,147,148,149,150,151,152,153,154,155]	Cognitive architectures for intention, emotion, and social cues	Theory of Mind; joint attention; real-time adaptation
Ethical and Societal Considerations [126,137,138]	Ensuring privacy, fairness, and transparency in vision–AI	Ethical safeguards; bias mitigation; regulatory compliance

Table 7. Differences between laboratory demonstrations and real-world deployment for vision–AI-enabled cobots.

Aspect	Laboratory Demonstrations	Real-World Deployment	Representative References
Environment	Controlled, static, well-lit	Dynamic, cluttered, variable lighting and occlusion	[24,100,101]
Data Distribution	Limited variation; curated datasets	Diverse, unpredictable, domain shifts	[24,77,99,101]
Task Definition	Predefined, repetitive, scripted	Variable tasks, unplanned changes	[16,19,75,124]
Sensor Conditions	Calibrated, noise-free	Sensor degradation, misalignment, interference	[99,100,101]
Human Behavior	Cooperative, predictable	Variable, ambiguous, culturally diverse	[24,139,146,147]
Latency Requirements	Often relaxed, offline processing possible	Strict real-time constraints for safety and trust	[19,100,102,121]
Generalization Need	Low (task-specific tuning acceptable)	High (must adapt to new users, objects, settings)	[24,75,76,124,126]
Safety Concerns	Simulated or with fail-safe barriers	Shared workspace with humans; legal compliance needed	[84,90,91,138,140]
Performance Metrics	Accuracy-focused benchmarks	Robustness, reliability, human acceptance	[77,99,126,137]
Explainability	Often not emphasized	Essential for user trust and regulatory approval	[87,90,126,137]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cohen, Y.; Biton, A.; Shoval, S. Fusion of Computer Vision and AI in Collaborative Robotics: A Review and Future Prospects. Appl. Sci. 2025, 15, 7905. https://doi.org/10.3390/app15147905

AMA Style

Cohen Y, Biton A, Shoval S. Fusion of Computer Vision and AI in Collaborative Robotics: A Review and Future Prospects. Applied Sciences. 2025; 15(14):7905. https://doi.org/10.3390/app15147905

Chicago/Turabian Style

Cohen, Yuval, Amir Biton, and Shraga Shoval. 2025. "Fusion of Computer Vision and AI in Collaborative Robotics: A Review and Future Prospects" Applied Sciences 15, no. 14: 7905. https://doi.org/10.3390/app15147905

APA Style

Cohen, Y., Biton, A., & Shoval, S. (2025). Fusion of Computer Vision and AI in Collaborative Robotics: A Review and Future Prospects. Applied Sciences, 15(14), 7905. https://doi.org/10.3390/app15147905

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fusion of Computer Vision and AI in Collaborative Robotics: A Review and Future Prospects

Abstract

1. Introduction

1.1. Motivation and Scope

1.2. Definitions and Interdisciplinary Scope

1.3. Contributions of the Review

1.4. Methodology and Literature Selection Criteria

1.5. Outline of the Paper

2. Foundations and Evolution

2.1. Overview of Collaborative Robotics (Cobots)

2.2. Historical Evolution of Computer Vision in Robotics

2.3. Progression of Artificial Intelligence (AI) in Robotic Applications

3. Core Technologies in Vision-Based Collaborative Robotics

3.1. Visual Perception and Object Detection

3.2. Human Pose Estimation and Activity Recognition

3.3. Scene Understanding and Environmental Modeling

3.4. SLAM and Spatial Awareness for Shared Workspaces

4. AI-Driven Autonomy and Decision-Making

4.1. Deep Learning Architectures for Perception and Planning

4.2. Reinforcement Learning for Adaptive Behavior

4.3. Explainable AI (XAI) and Ethical Autonomy

4.4. Cognitive Architectures for Human-like Interaction

5. Technological Trends and Challenges

5.1. Vision Technologies Integration and Deployment Challenges

5.2. Real-Time Adaptation and Learning

5.3. Trends in Cobots’ Hardware Related Capabilities

5.4. Trends Based on Technological Convergence

6. Cobot Considerations in System Architecture, AI, and Algorithms

6.1. Advanced System Architectures and Integration

6.2. Evolving AI Paradigms: From Deep Learning to Foundation Models

6.3. Role of Neural Policy Architectures

6.4. Algorithmic Frameworks for Dynamic Environments

7. Human Related Challenges and Trends

7.1. Trust Calibration in Collaborative Robotics

7.2. Socio-Cognitive Models and HRI

7.3. Ethical and Societal Considerations

8. Discussion

8.1. Perceptual Awareness to Intent Understanding

8.2. Adaptive Interaction in Shared Workspaces

8.3. Challenges in Real-Time and Scalable Machine Learning

8.4. Human Trust, Transparency, and Ethical AI

8.5. Toward Seamless and Proactive Collaboration

8.6. Interoperability Challenges and Open Robotic Architectures

8.7. Limitations in Real-World Deployability

9. Conclusions and Future Directions

9.1. Proactive Human–Robot Collaboration

9.2. Few-Shot and Continual Learning at the Edge

9.3. Interoperability and Open Robotic Architectures

9.4. Explainability, Safety, and Human Trust

9.5. Multimodal and Semantic Sensor Fusion

9.6. Need for Standardization and Unified Evaluation Frameworks

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI