Computer Vision for Collaborative Robots in Industry 5.0: A Survey of Techniques, Gaps, and Future Directions

Varolia, Himani; Vasques, César M. A.; Cavadas, Adélio M. S.

doi:10.3390/engproc2026124099

Open AccessProceeding Paper

Computer Vision for Collaborative Robots in Industry 5.0: A Survey of Techniques, Gaps, and Future Directions^†

by

Himani Varolia

^1,2,*

,

César M. A. Vasques

^1,2,*

and

Adélio M. S. Cavadas

¹

proMetheus, Higher School of Technology and Management, Polytechnic Institute of Viana do Castelo (IPVC), 4900-347 Viana do Castelo, Portugal

²

Centre for Mechanical Technology and Automation (TEMA), Department of Mechanical Engineering, University of Aveiro, 3810-193 Aveiro, Portugal

^*

Authors to whom correspondence should be addressed.

^†

Presented at the 6th International Electronic Conference on Applied Sciences, 9–11 December 2025; Available online: https://sciforum.net/event/ASEC2025.

Eng. Proc. 2026, 124(1), 99; https://doi.org/10.3390/engproc2026124099

Published: 24 March 2026

(This article belongs to the Proceedings of The 6th International Electronic Conference on Applied Sciences)

Download Versions Notes

Abstract

Collaborative robots are increasingly deployed in human-shared industrial workspaces, where perception is a key enabler for safe interaction, flexible manipulation, and human-aware task execution. In the context of Industry 5.0, computer vision for cobots must meet not only accuracy requirements but also human-centered constraints such as safety, transparency, robustness, and practical deployability. This paper surveys computer-vision approaches used in collaborative robotics and organizes them through a task-driven taxonomy covering detection, segmentation, tracking, pose estimation, action/gesture recognition, and safety monitoring. Beyond a descriptive literature review, the paper provides a task-driven qualitative analytical perspective that relates families of computer vision methods to key industrial constraints, including occlusion, lighting variability, clutter, domain shift, real-time latency, and annotation cost, and summarizes comparative strengths and failure modes using unified criteria. We further discuss challenges related to data availability and evaluation practices, highlighting gaps in reproducibility, standardized metrics, and real-world validation in shared human–robot environments. Finally, we outline implementation and deployment considerations across common software stacks (e.g., Python-based pipelines and MATLAB-based prototyping), emphasizing ROS2 integration, edge inference, and lifecycle maintenance. The survey concludes with research directions toward robust multimodal perception, explainable human-aware vision, and benchmarkable safety-critical perception for next-generation collaborative robotic systems.

Keywords:

computer vision; collaborative robots; artificial intelligence; Industry 5.0; human–robot interaction

1. Introduction

Industrial modernity has undergone a complete turnaround, from the automation-centered focus of Industry 4.0 to Industry 5.0, incorporating human centeredness, sustainability, and resilience in manufacturing processes [1]. While Industry 4.0 was all about linked-up means and data-driven decisions with enabling technologies like AI and IoT, Industry 5.0 is more about how advanced technological systems increasingly integrate humans into symbiotic working relationships. This paradigm shift redefines industrial objectives, prioritizing human well-being, skill augmentation, and the creation of more adaptable and resilient production environments [2]. This evolution requires developing sophisticated human–robot collaboration systems where robots will co-operate safely and effectively along with humans in close proximity [3]. Within this evolving framework, collaborative robots (cobots) emerge as pivotal components, designed to work in close proximity to humans, thus enabling unprecedented levels of interaction and shared task execution [4]. Unlike traditional industrial robots confined to cages, cobots incorporate intrinsic safety features like compliant actuators and force-torque sensing to facilitate secure and efficient human–robot collaboration [5]. Consequently, the ability of these cobots to perceive and interpret their surroundings becomes paramount for effective interaction and task execution within Industry 5.0 environments [6]. This necessitates advanced computer vision techniques to enable cobots to understand human intentions, recognize objects, and navigate dynamic, unstructured industrial settings.

This paper provides a comprehensive survey of computer vision techniques for empowering collaborative robots within the Industry 5.0 paradigm, exploring how vision facilitates real-time object detection, pick-and-place, gesture recognition, and adaptive navigation essential for seamless human–robot interaction and safety in shared workspaces [3,6,7]. The computer vision methods considered in this survey range from classical image processing and machine learning algorithms to deep learning architectures capable of extracting high-level semantic information, enabling improved adaptability, safety monitoring, and interaction awareness in complex industrial environments [3,7].

Although several surveys have investigated computer vision and artificial intelligence for human–robot collaboration [5,8,9], most of these studies were developed within the Industry 4.0 context and therefore primarily evaluate perception methods in terms of algorithmic accuracy or task-specific performance. Industry 5.0 introduces additional design requirements centered on human-centric manufacturing, safe physical interaction, and resilient production ecosystems. As a result, computer vision systems for collaborative robots must address not only perception accuracy but also robustness in dynamic human-shared workspaces, transparency of AI-driven decision making, and practical deployment constraints such as real-time inference, edge computing, and integration with industrial robotic middleware. The present survey addresses these gaps by integrating computer vision methods, AI frameworks, and deployment pipelines within a unified Industry 5.0 analysis, treating safety, robustness, deployability, and human-centered design as cross-cutting evaluation criteria.

The main contributions of this survey are summarized as follows: (1) a structured taxonomy of computer vision techniques for collaborative robotics based on perception tasks relevant to Industry 5.0 manufacturing environments; (2) a comparative analysis of classical computer vision, machine learning, and deep learning approaches used in cobot perception systems, highlighting their strengths, limitations, and application contexts; (3) a practical comparison of software frameworks and development environments used for implementing computer vision in collaborative robotics, including Python-based ecosystems and MATLAB-based prototyping environments; (4) an overview of key computer-vision-driven applications in collaborative robotics such as gesture recognition, human detection, vision-guided manipulation, and safety monitoring; and (5) identification of open research challenges and future directions for developing robust, deployable, and human-centered computer vision systems for Industry 5.0 collaborative manufacturing environments.

2. Methodology

This review adopts a narrative literature review approach to provide a structured analysis of computer vision techniques for collaborative robots within the Industry 5.0 framework. Relevant studies published between 2018 and 2025 were identified using keyword searches across major digital libraries, including IEEE Xplore, Scopus-indexed MDPI journals, SpringerLink, Elsevier ScienceDirect, arXiv, and Google Scholar. The initial search returned approximately 502 publications, which were reduced to 145 unique papers after merging and deduplication. From these, 97 papers were selected for detailed synthesis. The selected studies were organized according to perception tasks (e.g., detection, segmentation, tracking, pose estimation, action recognition, and safety monitoring) and analyzed qualitatively using Industry 5.0-related evaluation criteria, including robustness, safety relevance, real-time constraints, and deployability. Data from each selected article were extracted using a common analysis template, including publication details, computer vision techniques, collaborative robot applications, key findings, contributions, limitations, evaluation metrics, and performance characteristics. This information was synthesized to identify recurring themes, emerging trends, comparative advantages, and critical research gaps.

3. Computer Vision Techniques and Architectures

This section outlines the principal computer vision approaches used in collaborative robotics, including classical computer vision techniques, machine learning-based methods, and deep learning architectures that enable perception capabilities in Industry 5.0 environments. Table 1 provides a qualitative comparison of these approaches with respect to typical perception tasks, data requirements, robustness, real-time capability, and deployment considerations in collaborative robotic systems.

3.1. Classical Computer Vision Methods

Classical computer vision approaches formed the bedrock of robotic perception, primarily focusing on fundamental tasks like edge detection, object localization, and obstacle avoidance, often using monocular cameras [5]. These traditional algorithms relied on hand-crafted features and rule-based systems to interpret visual data. Their efficacy is often contingent upon controlled environments with predictable lighting and object geometries, making them suitable for repetitive industrial tasks where variations are minimal [10]. Structured and well-defined manufacturing settings have therefore historically favored such methods due to their interpretability, low computational requirements, and deterministic behavior. For example, techniques such as Canny edge detection [11,12,13], SURF [14,15], and SIFT [16,17] were widely employed for feature extraction and matching, enabling basic object recognition and pose estimation in constrained scenarios. These approaches continue to play crucial roles in quality inspection and part localization tasks, especially where computational resources are limited or where interpretable algorithms are required for regulatory compliance. This allows for object detection in manufacturing processes using classic techniques like the sliding window paradigm [18].

3.2. Machine Learning-Based Methods

Machine learning methods play an important role in computer vision by learning patterns from handcrafted visual features such as edges, corners, and texture descriptors.

Supervised learning algorithms such as Support Vector Machines (SVM), K-Nearest Neighbors (k-NN), Decision Trees, and Random Forests are extensively used for classification purposes. SVM and k-NN algorithms enable accurate object classification, an essential step in pick-and-place operations where the robot needs to classify and handle different industrial parts [19,20]. Decision Trees and Random Forests are also used to classify spatial and temporal features extracted from human skeleton data, enabling cobots to recognize human commands and adjust their actions in real time, thereby improving human–robot interaction [21]. Gaussian Mixture Models are probability-based approaches for image classification and can be trained from demonstration to perform task learning, allowing the robot to classify based on the distribution learned from the tasks [2,22,23]. Linear Discriminant Analysis (LDA) is used as a method of dimensionality reduction to improve the process of classification and motion classification tasks, often implemented using the extracted features [24,25,26]. Quadratic Discriminant Analysis (QDA) is used in image classification and industrial process fault detection tasks, especially when there is a difference in the covariance of the classes [27,28]. Naive Bayesian classifiers are also used for gesture recognition in robotics with an assumption of feature independence [25,29,30,31].

Unsupervised learning algorithms such as k-Means clustering, DBSCAN, and Expectation Maximization enable robots to discover patterns in visual data without labeled datasets [2,32]. These algorithms enable robots to segment, group, or detect anomalies in unstructured industrial environments. For example, the clustering process of objects in bin-picking tasks is improved with DBSCAN [32].

In dynamic settings, probabilistic filtering methods such as Kalman Filters and Particle Filters, while distinct from ML classifiers, complement machine learning-based perception by enabling reliable human pose estimation and trajectory prediction, allowing cobots to maintain safe operating distances in shared workspaces [8,33,34,35,36,37].

By incorporating these traditional machine learning techniques, cobots can achieve improved perception capabilities while maintaining interpretable decision-making processes, an essential tenet of Industry 5.0 [38,39].

3.3. Deep Learning-Based Methods

Deep learning has emerged as the dominant paradigm for computer vision in collaborative robotics due to its ability to learn hierarchical visual representations directly from large datasets.

Convolutional Neural Networks, such as ResNet or MobileNet, are capable of accurate image classification and object detection for industrial applications in quality inspection and component recognition [9,40,41]. Object detection networks (Faster R-CNN, YOLO, or SSD) foster the real-time detection and localization of objects, which is essential for efficiency optimization, logistics automation, and the protection of human workers through collision detection in dynamic environments [5,42,43].

Semantic and instance segmentation networks (U-Net, DeepLab, or Mask R-CNN) perform pixel-wise classification and instance segmentation, enabling precise scene understanding. This pixel-level comprehension is vital for accurate manipulation tasks, robotic bin picking, and digital twin prototyping in collaborative robotic environments [32,44,45,46].

Human pose estimation networks, including OpenPose, HRNet, and SCC-HRNet, are capable of accurate human keypoint and pose detection. This allows cobots to infer human intentions, predict actions, and proactively adjust behavior [5,47,48,49,50]. Depth and 3D perception networks further support 6D pose estimation and scene reconstruction using RGB-D data, enabling reliable grasping and navigation in complex industrial environments [5,51].

Vision Transformers, including DETR and Swin Transformer, address attention-based scene understanding by capturing long-range relationships and contextual cues [44]. Vision Transformers learn global feature representations by representing images as sequences of patches, bypassing the strong local inductive biases of traditional convolutional networks. For robotic perception, architectures such as DETR enable end-to-end object detection by eliminating hand-crafted components such as anchor boxes or non-maximum suppression [52]. The Swin Transformer introduces a hierarchical attention mechanism that supports multi-scale feature extraction, which is particularly useful for fine-grained inspection and defect detection in industrial settings [44]. Vision–language models such as CLIP further extend collaborative robotics by grounding natural language commands in visual observations, enabling multimodal interaction and zero-shot adaptability in dynamic manufacturing environments [53,54].

Recurrent Neural Networks (RNN) and their variants, such as Long Short-Term Memory networks (LSTM), are critical for processing sequential data, enabling predictive human motion analysis and complex gesture recognition. For instance, LSTM-based architectures have LDTrack utilizing conditional latent diffusion to continuously observe and predict human and robot movements [5,9,55,56]. This enables cobots to adapt to human actions, anticipate HRC demands, and mitigate risks by identifying anomalous behaviors and optimizing motion planning.

Generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), contribute to predictive modeling and data augmentation [18,44,57]. In collaborative robotics, GANs have been applied to synthetic defect image generation and domain randomization, where synthetically generated factory-floor imagery reduces annotation effort and helps bridge the visual gap between simulation and real deployment environments [58,59]. However, GAN-based approaches present limitations including training instability, mode collapse, and difficulties in validating the realism and diversity of generated samples, which may reduce their reliability in safety-critical industrial applications. VAEs are employed in learning low-dimensional representations of behavior demonstrations [60,61] and for predictive modeling of human motion [62,63,64,65]. In human–robot collaboration scenarios, VAEs are particularly suited to behavioral modeling tasks where compact latent representations of demonstrated motions can be conditioned on human state or contextual information to enable adaptive and generalizable robot behavior, reducing dependence on large labeled training datasets [55,57].

4. Software Frameworks and Implementation Platforms

Vision-enabled collaborative robotic systems rely on a diverse software ecosystem that supports perception, learning, system integration, and deployment. This ecosystem includes robot middleware, vision libraries, deep learning frameworks, simulation environments, and edge AI platforms that enable the development and deployment of computer vision pipelines for collaborative robots [40,66,67,68,69,70]. Table 2 summarizes representative frameworks and platforms used in vision-enabled collaborative robotics, highlighting their typical computer vision tasks, strengths, and deployment limitations.

Among the available programming platforms for computer vision development in collaborative robotics, MATLAB and Python represent two of the most widely adopted environments in both academic research and industrial practice [8,68]. While other languages such as C++ may offer superior runtime performance, they generally require greater development effort and are therefore more commonly used for system deployment rather than for rapid prototyping or computer vision pipeline development. The comparison presented in Table 3 therefore focuses on MATLAB and Python as the primary tools used by engineers and researchers during the design, training, and integration phases of cobot vision systems. The comparison is structured around eight practical criteria: development environment, ease of use, available libraries, computational performance, cost, technical support, robotics integration, and deployment flexibility. These dimensions reflect the typical factors considered when selecting a development platform for industrial computer vision applications in collaborative robotics, allowing a practical comparison of trade-offs in performance, accessibility, and deployment readiness. The characteristics summarized in Table 3 should be interpreted as context-dependent trade-offs rather than absolute advantages, as the suitability of each platform depends on the requirements of the collaborative robotic application and deployment environment.

To illustrate these trade-offs in practice, three representative deployment workflows are briefly outlined. (1) A ROS2 + PyTorch + NVIDIA Jetson pipeline enables real-time edge inference for gesture recognition and human detection on embedded hardware, leveraging ROS2 middleware for perception–control integration and GPU-accelerated deep learning frameworks without reliance on cloud connectivity [70,71].

(2) A MATLAB/Simulink rapid prototyping workflow supports the design and validation of vision-based control logic, with subsequent C++ code generation via MATLAB Coder for deployment on embedded cobot controllers [68]. (3) A simulation-to-real pipeline using NVIDIA Isaac Sim enables the generation of synthetic annotated datasets at scale for training object detection and pose estimation models, which are then fine-tuned on real factory-floor data to mitigate the sim-to-real gap [69]. These workflows demonstrate that platform selection is inherently context-dependent and driven by deployment targets, latency requirements, and available computational infrastructure.

5. AI-Driven Computer Vision Tasks and Applications in Collaborative Robotics

AI-driven computer vision enables safe, effective, and intuitive human–robot collaboration across a variety of domains by facilitating collaborative robots’ perception, interpretation, and environmental responses [3,6,7].

Human Detection, Tracking, and Safety Monitoring are crucial in shared workspace. Vision systems continually detect human presence, estimate distances, and monitor motion for safety purposes including speed/separation monitoring, dynamic workspace zoning, and collision avoidance [43,50,73,74].

Pose Estimation, Gesture, and Action Recognition help robots infer intent and anticipate actions by providing insights into human posture and movement [5,47]. Action recognition helps with task coordination and human action prediction, while gesture recognition makes natural human commands possible. Assistance with assembly, tool handover, and training through demonstration are typical uses [53,75].

Object Perception and Manipulation Support relies on robust object detection, classification, and posture estimation. The vision system can detect and locate tools, components, and workpieces in a complicated environment [9]. In shared human–robot workspaces, this capability facilitates essential tasks such as bin picking, kitting, cooperative assembly, and quality inspection [32,40,76,77,78].

Human–Robot Interaction and Intention Understanding are significantly improved by computer vision, allowing robots to analyze facial expressions, gaze, and body language to interpret human actions and social cues. This boosts user acceptance and trust, particularly in service, healthcare, and social robotics, where intuitive interaction is paramount.

These AI-driven vision capabilities have extensive application in collaborative assembly, inspection, and material handling Industrial settings [40,78]; in Healthcare for rehabilitation, patient monitoring, and surgical support; and in Service robotics for customer service, home care, and social interaction in public spaces.

Sustainability and Resource Efficiency. Beyond safety and interaction capabilities, computer vision also plays an important role in supporting sustainability objectives within Industry 5.0 manufacturing systems. Vision-based inspection and defect detection systems enable early identification of production faults, reducing material waste and minimizing energy consumption associated with defective product processing [41,79]. In addition, computer vision techniques can support automated waste sorting and recycling classification, where deep learning models identify material categories such as plastics, metals, and electronic components to enable robotic separation of recyclable waste streams [80,81]. Such applications contribute to more resource-efficient manufacturing processes and highlight the expanding role of collaborative robot perception beyond productivity and safety toward circular manufacturing and sustainability goals envisioned by Industry 5.0.

6. Current Challenges and Research Gaps

Despite significant progress, several challenges still hinder the widespread adoption of AI-driven computer vision in collaborative robotics. One key issue is narrow multimodal integration, which requires the seamless fusion of tactile data, speech, and contextual information with visual perception to enable advanced human–robot interaction [82].

Industry 5.0 Perception Challenges. Industrial environments often involve severe occlusion, dynamic lighting conditions, and unstructured workspaces that degrade detection accuracy and compromise safety-critical tasks such as human tracking and collision avoidance [43,74]. Real-time constraints impose strict latency requirements on perception pipelines, while domain shifts between laboratory datasets and real factory environments require costly model adaptation to ensure robust deployment [5,70,79].

Another challenge is the scarcity of benchmarking and standardized datasets. Deep learning models require large volumes of annotated data that are often expensive and time-consuming to acquire, which reduces robustness when models are deployed outside curated laboratory environments [5,53,82]. While datasets such as COVERED provide 3D semantic annotations tailored to collaborative robot environments [45], they do not fully capture dynamic interaction scenarios or safety-critical edge cases typical of real industrial deployments. Moreover, current datasets and evaluation protocols rarely include speed and separation monitoring scenarios required by ISO/TS 15066 [83], leaving limited benchmarking resources for evaluating perception systems in safety-critical collaborative robotic environments. The sim-to-real gap between synthetic training data and real factory deployment further limits model generalization [53,79], highlighting the need for domain-specific benchmarks and standardized evaluation protocols.

Explainability of deep learning models is another crucial challenge for fostering trust and transparency in human–robot collaboration, aligning with Industry 5.0’s human-centric principles; however, consensus on methodologies is still evolving [44,84,85]. Relatedly, safety compliance [74,86] and hardware constraints pose significant hurdles, as cobots often have limited memory, processing units, and power due to size, weight, area, and power limitations [87,88]. This creates a real-time perception bottleneck for powerful models, impacting edge AI deployment and low-latency real-time processing [40,70,89].

Privacy and Data Protection. Camera-based perception systems in shared workspaces may capture identifiable worker information, requiring compliance with data protection frameworks such as the General Data Protection Regulation (GDPR) [86]. Future systems should therefore incorporate privacy-by-design principles including data minimization, on-device processing, and controlled visual data storage [90,91]. Ensuring regulatory compliance without compromising perceptual accuracy remains an underexplored challenge in collaborative robotics.

Finally, maintenance and lifecycle management present additional challenges, as model performance may degrade over time, requiring periodic retraining and robust calibration procedures to ensure long-term reliability and adaptability to evolving operational conditions [40,92,93].

Addressing these multifaceted challenges is essential for advancing secure, efficient, and human-centric Industry 5.0 applications.

7. Future Directions

Future research in computer vision for collaborative robots in Industry 5.0 should focus on advanced multimodal perception frameworks that integrate visual, tactile, and linguistic inputs to facilitate intuitive human–robot interaction and enhance comprehension of human intent [94,95]. To improve model robustness and generalization, future work should address current dataset limitations through the development of standardized industrial benchmarks and the use of synthetic data generation techniques. In addition, inherently explainable AI architectures are needed to improve transparency, facilitate debugging, and support certification in safety-critical applications [44,84]. For optimal deployment of edge AI, hardware algorithm co-design should be considered, in addition to in-sensor computing approaches focusing on ultra-low latency processing [40,96].

Privacy-preserving perception represents an emerging research direction for vision systems deployed in human-shared industrial environments. Promising approaches include real-time facial anonymization applied to industrial video streams [90], skeleton-based human representation as a privacy-preserving alternative to raw RGB data [91], and edge inference architectures that process identifiable visual data locally without transmission to centralized servers [70,89]. The development of proactive safety solutions and lifelong learning capabilities will further promote continuously adaptive, reliable, and low-maintenance collaborative robotic systems [43].

The integration of soft robotic systems with computer vision represents a further research direction for Industry 5.0 cobot applications. Soft robots introduce perception challenges related to deformable morphology that standard rigid-body vision pipelines do not address. Future work should explore vision systems for real-time soft manipulator state estimation and closed-loop visual control in human-shared workspaces [97,98].

These research directions are critical for the development of human-centered, reliable, and scalable collaborative robotic systems within Industry 5.0 manufacturing environments.

8. Conclusions

The key role of CV in developing safe, adaptive, and human-centered collaborative robots within the paradigm of Industry 5.0 has been discussed in this study. From classical vision techniques, essential for simple tasks, we tracked the development to advanced deep learning methods allowing complex object recognition, estimation of the human pose, and real-time understanding of the environment. The importance of robust software frameworks and programming environments was highlighted, noting the complementary roles of platforms like MATLAB for rapid prototyping and Python-based ecosystems for scalable, flexible deployment.

Despite this progress, substantial research gaps and practical challenges remain in the field, including improving model robustness, enhancing explainability, establishing reliable benchmarking practices, and optimizing edge AI deployment for low-latency perception. This survey additionally identifies emerging directions, including vision-based sustainability applications such as defect detection and intelligent waste sorting for circular manufacturing, privacy-preserving perception mechanisms to support GDPR-compliant human–robot collaboration, and the integration of soft robotic systems with computer vision as an underexplored frontier for Industry 5.0 cobot deployments.

Addressing these challenges through focused future research will be instrumental in fostering the reliable, sustainable, and ethically aligned adoption of collaborative robotic systems, ultimately strengthening human–robot synergy in Industry 5.0.

Author Contributions

Conceptualization, H.V., C.M.A.V. and A.M.S.C.; formal analysis, H.V., C.M.A.V. and A.M.S.C.; investigation, H.V., C.M.A.V. and A.M.S.C.; writing—original draft preparation, H.V.; writing—review and editing, H.V., C.M.A.V. and A.M.S.C.; supervision, C.M.A.V. and A.M.S.C.; funding acquisition, C.M.A.V. and A.M.S.C. All authors have read and agreed to the published version of the manuscript.

Funding

The authors gratefully acknowledge the funding provided by the Foundation for Science and Technology (FCT) of Portugal, within the scope of the project of the Research Unit on Materials, Energy and Environment for Sustainability (proMetheus, https://tech.ipvc.pt/unidades.php?-u=PROMETHEUS, 24 January 2026), Ref. UID/05975/2020, financed by national funds through the FCT/MCTES; and the funding provided within the scope of the “Agenda DRIVOLUTION: Transition to the Factory of the Future”, project no. C632394276-0046698 with operation code 02/C05-i01.02/2022.PC644913740-00000022, within the framework of the Agendas/Mobilizing Alliances for Reindustrialization, Notice no. 2022-C05i0101-02, project 23, of the Recovery and Resilience Plan (PRR) of Portugal.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Aheleroff, S.; Huang, H.; Xu, X.; Zhong, R.Y. Toward sustainability and resilience with Industry 4.0 and Industry 5.0. Front. Manuf. Technol. 2022, 2, 951643. [Google Scholar] [CrossRef]
Langås, E.F.; Zafar, M.H.; Sanfilippo, F. Exploring the synergy of human-robot teaming, digital twins, and machine learning in Industry 5.0: A step towards sustainable manufacturing. J. Intell. Manuf. 2025, 37, 999–1022. [Google Scholar] [CrossRef]
Shah, R.; Doss, A.S.A.; Lakshmaiya, N. Advancements in AI-Enhanced Collaborative Robotics: Towards Safer, Smarter, and Human-Centric Industrial Automation. Results Eng. 2025, 27, 105704. [Google Scholar] [CrossRef]
Puttero, S.; Verna, E.; Genta, G.; Galetto, M. Collaborative robots for quality control: An overview of recent studies and emerging trends. J. Intell. Manuf. 2025. [Google Scholar] [CrossRef]
Cohen, Y.; Biton, A.; Shoval, S. Fusion of Computer Vision and AI in Collaborative Robotics: A Review and Future Prospects. Appl. Sci. 2025, 15, 7905. [Google Scholar] [CrossRef]
Patil, S.; Vasu, V.; Srinadh, K.V.S. Advances and perspectives in collaborative robotics: A review of key technologies and emerging trends. Discov. Mech. Eng. 2023, 2, 13. [Google Scholar] [CrossRef]
Rahman, M.M.; Khatun, F.; Jahan, I.; Devnath, R.; Bhuiyan, M.A.A. Cobotics: The Evolving Roles and Prospects of Next-Generation Collaborative Robots in Industry 5.0. J. Robot. 2024, 2024, 2918089. [Google Scholar] [CrossRef]
Robinson, N.; Tidd, B.; Campbell, D.; Kulić, D.; Corke, P. Robotic Vision for Human-Robot Interaction and Collaboration: A Survey and Systematic Review. ACM Trans. Hum.-Robot Interact. 2022, 12, 12. [Google Scholar] [CrossRef]
Borboni, A.; Reddy, K.V.V.; Elamvazuthi, I.; Al-Quraishi, M.S.; Natarajan, E.; Ali, S.S.A. The Expanding Role of Artificial Intelligence in Collaborative Robots for Industrial Applications: A Systematic Review of Recent Works. Machines 2023, 11, 111. [Google Scholar] [CrossRef]
Santos, A.A.; Schreurs, C.; da Silva, A.F.; Pereira, F.; Felgueiras, C.; Lopes, A.M.; Machado, J. Integration of Artificial Vision and Image Processing into a Pick and Place Collaborative Robotic System. J. Intell. Robot. Syst. 2024, 110, 159. [Google Scholar] [CrossRef]
Gaboardi, C. Use of the Wavelet Transform for Digital Terrain Model Edge Detection (Special Issue—Wavelet Analysis). J. Appl. Math. Phys. 2018, 6, 1997–2005. [Google Scholar] [CrossRef][Green Version]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Agrawal, H.; Desai, K. Canny Edge Detection: A Comprehensive Review. Int. J. Technol. Res. Sci. 2024, 9, 27–35. [Google Scholar] [CrossRef]
Kumar, A. SURF feature descriptor for image analysis. Imaging Radiat. Res. 2024, 6, 5643. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Gool, L.V. SURF: Speeded Up Robust Features. In Computer Vision—ECCV 2006; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3951, pp. 404–417. [Google Scholar] [CrossRef]
Karami, E.; Shehata, M.; Smith, A. Image Identification Using SIFT Algorithm: Performance Analysis against Different Image Deformations. arXiv 2022, arXiv:1710.02728. [Google Scholar] [CrossRef]
Otero, I.R.; Delbracio, M. Anatomy of the SIFT Method. Image Process. Line 2014, 4, 370–396. [Google Scholar] [CrossRef]
Mangat, A.S.; Mangler, J.; Rinderle-Ma, S. Interactive Process Automation based on lightweight object detection in manufacturing processes. Comput. Ind. 2021, 130, 103482. [Google Scholar] [CrossRef]
Molaei, A.; Kolu, A.; Lahtinen, K.; Geimer, M. Automatic recognition of excavator working cycles using supervised learning and motion data obtained from inertial measurement units (IMUs). Constr. Robot. 2024, 8, 14. [Google Scholar] [CrossRef]
Hussain, S.; Saeed, K.; Baimagambetov, A.; Rab, S.; Saad, M. Advancements in Gesture Recognition Techniques and Machine Learning for Enhanced Human-Robot Interaction: A Comprehensive Review. arXiv 2024, arXiv:2409.06503. [Google Scholar] [CrossRef]
Roitberg, A.; Perzylo, A.; Somani, N.; Giuliani, M.; Rickert, M.; Knoll, A. Human activity recognition in the context of industrial human-robot interaction. In Proceedings of the 2014 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA); IEEE: Piscataway, NJ, USA, 2014; pp. 1–10. [Google Scholar] [CrossRef]
Jiang, Y.; Leung, F. Gaussian Mixture Model and Gaussian Supervector for Image Classification. In Proceedings of the 2018 IEEE 23rd International Conference on Digital Signal Processing (ICDSP); IEEE: Piscataway, NJ, USA, 2018; pp. 1–5. [Google Scholar] [CrossRef]
Chernova, S.; Veloso, M. Confidence-based policy learning from demonstration using Gaussian mixture models. In Proceedings of the Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS); ACM: New York, NY, USA, 2007; pp. 1–8. [Google Scholar] [CrossRef]
Singh, A.K. On effective human robot interaction based on recognition and association. arXiv 2018, arXiv:1812.07100. [Google Scholar] [CrossRef]
Qi, J.; Ma, L.; Cui, Z.; Yu, Y. Computer vision-based hand gesture recognition for human-robot interaction: A review. Complex Intell. Syst. 2023, 10, 1581–1606. [Google Scholar] [CrossRef]
Hong, Y.; Yang, Y.; Park, J. Linear Discriminant Analysis-Based Motion Classification Using Distributed Micro-Doppler Radars with Limited Backhaul. Sensors 2021, 21, 2924. [Google Scholar] [CrossRef]
Shahin, M.; Chen, F.F.; Hosseinzadeh, A.; Zand, N. Using machine learning and deep learning algorithms for downtime minimization in manufacturing systems: An early failure detection diagnostic service. Int. J. Adv. Manuf. Technol. 2023, 128, 3857–3883. [Google Scholar] [CrossRef]
Li, H.; Jia, M.; Mao, Z. Dynamic Feature Extraction-Based Quadratic Discriminant Analysis for Industrial Process Fault Classification and Diagnosis. Entropy 2023, 25, 1664. [Google Scholar] [CrossRef] [PubMed]
Ashfaq, T.; Khurshid, K. Classification of Hand Gestures Using Gabor Filter with Bayesian and Naïve Bayes Classifier. Int. J. Adv. Comput. Sci. Appl. 2016, 7. [Google Scholar] [CrossRef]
Trovato, G.; Chrupała, G.; Takanishi, A. Application of the Naive Bayes Classifier for Representation and Use of Heterogeneous and Incomplete Knowledge in Social Robotics. Robotics 2016, 5, 6. [Google Scholar] [CrossRef]
Escalante, H.J.; Morales, E.F.; Sucar, L.E. A naïve Bayes baseline for early gesture recognition. Pattern Recognit. Lett. 2016, 73, 91–99. [Google Scholar] [CrossRef]
Zhuang, C.; Li, S.; Ding, H. Instance segmentation based 6D pose estimation of industrial objects using point clouds for robotic bin-picking. Robot. Comput.-Integr. Manuf. 2023, 82, 102541. [Google Scholar] [CrossRef]
Saleem, Z.; Gustafsson, F.; Furey, E.; McAfee, M.; Huq, S. A review of external sensors for human detection in a human robot collaborative environment. J. Intell. Manuf. 2024, 36, 2255–2279. [Google Scholar] [CrossRef]
Wang, S.; Zhang, J.; Wang, P.; Law, J.; Călinescu, R.; Mihaylova, L. A deep learning-enhanced Digital Twin framework for improving safety and reliability in human–robot collaborative manufacturing. Robot. Comput.-Integr. Manuf. 2023, 85, 102608. [Google Scholar] [CrossRef]
Amorim, A.; Guimares, D.; Mendona, T.; Neto, P.; Costa, P.; Moreira, A.P. Robust human position estimation in cooperative robotic cells. Robot. Comput.-Integr. Manuf. 2020, 67, 102035. [Google Scholar] [CrossRef]
Bellotto, N.; Hu, H. Computationally efficient solutions for tracking people with a mobile robot: An experimental evaluation of Bayesian filters. Auton. Robot. 2009, 28, 425–438. [Google Scholar] [CrossRef]
Islam, M.J.; Hong, J.; Sattar, J. Person-following by autonomous robots: A categorical overview. Int. J. Robot. Res. 2019, 38, 1581–1618. [Google Scholar] [CrossRef]
Sridharan, M.; Meadows, B. Towards a Theory of Explanations for Human–Robot Collaboration. KI KüNstliche Intell. 2019, 33, 331–342. [Google Scholar] [CrossRef]
Rožanec, J.M.; Montini, E.; Cutrona, V.; Papamartzivanos, D.; Klemenčič, T.; Fortuna, B.; Mladenić, D.; Veliou, E.; Giannetsos, T.; Emmanouilidis, C. Human in the AI Loop via xAI and Active Learning for Visual Inspection. In Explainable Artificial Intelligence for Industry 5.0; Springer: Berlin/Heidelberg, Germany, 2023; pp. 381–406. [Google Scholar] [CrossRef]
Terras, N.; Pereira, F.; Silva, A.R.; Santos, A.A.; Lopes, A.M.; da Silva, A.F.; Cartal, L.A.; Apostolescu, T.C.; Badea, F.; Machado, J. Integration of Deep Learning Vision Systems in Collaborative Robotics for Real-Time Applications. Appl. Sci. 2025, 15, 1336. [Google Scholar] [CrossRef]
Govi, E.; Sapienza, D.; Toscani, S.; Cotti, I.; Franchini, G.; Bertogna, M. Addressing challenges in industrial pick and place: A deep learning-based 6 Degrees-of-Freedom pose estimation solution. Comput. Ind. 2024, 161, 104130. [Google Scholar] [CrossRef]
Tavakoli, H.; Suh, S.; Walunj, S.; Pahlevannejad, P.; Plociennik, C.; Ruskowski, M. Object Detection for Human–Robot Interaction and Worker Assistance Systems. In Explainable Artificial Intelligence for Industry 5.0; Springer: Berlin/Heidelberg, Germany, 2023; pp. 319–332. [Google Scholar] [CrossRef]
Patalas-Maliszewska, J.; Łosyk, H.; Dudek, A. Improving safety in human–robot collaboration towards sustainable production in Industry 5.0. J. Intell. Manuf. 2025. [Google Scholar] [CrossRef]
Maham, A.; Tashfa, D.E.N. Deep Learning Perspective of Scene Understanding in Autonomous Robots. arXiv 2025, arXiv:2512.14020. [Google Scholar] [CrossRef]
Munasinghe, C.; Amin, F.M.; Scaramuzza, D.; van de Venn, H.W. COVERED, CollabOratiVE Robot Environment Dataset for 3D Semantic segmentation. In Proceedings of the 2022 IEEE 27th International Conference on Emerging Technologies and Factory Automation (ETFA), Stuttgart, Germany, 6–9 September 2022; pp. 1–4. [Google Scholar] [CrossRef]
Monsone, C.R.; Csapó, Á. Instance Segmentation in Industry 5.0 Applications Based on the Automated Generation of Point Clouds. Acta Polytech. Hung. 2025, 22, 25–46. [Google Scholar] [CrossRef]
Ma, Z.; Jiao, W.; Li, L.; Yang, S.; Xu, X. Application of Keypoint Recognition for Industrial Human-Robot Safe Collaboration Scenarios. In Proceedings of the 2024 IEEE International Symposium on Assembly and Manufacturing (ISAM); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Forlini, M.; Neri, F.; Ciccarelli, M.; Palmieri, G.; Callegari, M. Experimental implementation of skeleton tracking for collision avoidance in collaborative robotics. Int. J. Adv. Manuf. Technol. 2024, 134, 57–73. [Google Scholar] [CrossRef]
Svarny, P.; Tesar, M.; Behrens, J.K.; Hoffmann, M. Safe physical HRI: Toward a unified treatment of speed and separation monitoring together with power and force limiting. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: Piscataway, NJ, USA, 2019; pp. 7580–7587. [Google Scholar] [CrossRef]
Amaya-Mejía, L.M.; Duque-Suárez, N.; Jaramillo-Ramírez, D.; Martínez, C. Vision-Based Safety System for Barrierless Human-Robot Collaboration. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 7331–7336. [Google Scholar] [CrossRef]
Liang, G.; Chen, F.; Liang, Y.; Feng, Y.; Wang, C.; Wu, X. A Manufacturing-Oriented Intelligent Vision System Based on Deep Neural Network for Object Recognition and 6D Pose Estimation. Front. Neuror. 2021, 14, 616775. [Google Scholar] [CrossRef]
Sanghai, N.; Brown, N. Advances in Transformers for Robotic Applications: A Review. arXiv 2024, arXiv:2412.10599. [Google Scholar] [CrossRef]
Xia, W.; Zheng, H.; Xu, W.; Xu, X. Large vision-language models enabled novel objects 6D pose estimation for human-robot collaboration. Robot. Comput.-Integr. Manuf. 2025, 95, 103030. [Google Scholar] [CrossRef]
Zhang, X.; Tian, S.; Liang, X.; Zheng, M.; Behdad, S. Early Prediction of Human Intention for Human–Robot Collaboration Using Transformer Network. J. Comput. Inf. Sci. Eng. 2024, 24, 051003. [Google Scholar] [CrossRef]
Laplaza, J.; Moreno, F.; Sanfeliu, A. Enhancing Robotic Collaborative Tasks Through Contextual Human Motion Prediction and Intention Inference. Int. J. Soc. Robot. 2024, 17, 2077–2096. [Google Scholar] [CrossRef]
Fung, A.; Benhabib, B.; Nejat, G. LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models. Int. J. Comput. Vis. 2025, 133, 3392–3412. [Google Scholar] [CrossRef]
Osa, T.; Ikemoto, S. Goal-Conditioned Variational Autoencoder Trajectory Primitives with Continuous and Discrete Latent Codes. SN Comput. Sci. 2020, 1, 303. [Google Scholar] [CrossRef]
Deng, F.; Luo, J.; Fu, L.; Huang, Y.; Chen, J.; Li, N.; Zhong, J.; Lam, T.L. DG2GAN: Improving defect recognition performance with generated defect image sample. Sci. Rep. 2024, 14, 14787. [Google Scholar] [CrossRef]
Hong, J.; Fulton, M.; Sattar, J. A Generative Approach Towards Improved Robotic Detection of Marine Litter. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2020; pp. 10525–10531. [Google Scholar] [CrossRef]
Hafez, M.B.; Wermter, S. Behavior Self-Organization Supports Task Inference for Continual Robot Learning. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 6739–6746. [Google Scholar] [CrossRef]
Hsiao, F.I.; Kuo, J.H.; Sun, M. Learning a Multi-Modal Policy via Imitating Demonstrations with Mixed Behaviors. arXiv 2022, arXiv:1903.10304. [Google Scholar] [CrossRef]
Mao, W.; Liu, M.; Salzmann, M. Generating Smooth Pose Sequences for Diverse Human Motion Prediction. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; p. 13289. [Google Scholar] [CrossRef]
Ling, H.Y.; Zinno, F.; Cheng, G.G.; van de Panne, M. Character controllers using motion VAEs. ACM Trans. Graph. 2020, 39, 40. [Google Scholar] [CrossRef]
Aliakbarian, S.; Saleh, F.S.; Petersson, L.; Gould, S.J.; Salzmann, M. Contextually Plausible and Diverse 3D Human Motion Prediction. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; p. 11313. [Google Scholar] [CrossRef]
Li, J.; Villegas, R.; Ceylan, D.; Yang, J.; Kuang, Z.; Li, H.; Zhao, Y. Task-Generic Hierarchical Human Motion Prior using VAEs. In Proceedings of the 2021 International Conference on 3D Vision (3DV); IEEE: Piscataway, NJ, USA, 2021; pp. 771–781. [Google Scholar] [CrossRef]
Riba, E.; Mishkin, D.; Ponsa, D.; Rublee, E.; Bradski, G. Kornia: An Open Source Differentiable Computer Vision Library for PyTorch. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2020; pp. 3663–3672. [Google Scholar] [CrossRef]
Wang, Z.; Liu, K.; Li, J.; Zhu, Y.; Zhang, Y. Various Frameworks and Libraries of Machine Learning and Deep Learning: A Survey. Arch. Comput. Methods Eng. 2019, 31, 1–24. [Google Scholar] [CrossRef]
Bonci, A.; Gaudeni, F.; Giannini, M.; Longhi, S. Robot Operating System 2 (ROS2)-Based Frameworks for Increasing Robot Autonomy: A Survey. Appl. Sci. 2023, 13, 12796. [Google Scholar] [CrossRef]
de Melo, M.S.P.; da Silva Neto, J.G.; da Silva, P.J.L.; Teixeira, J.M.; Teichrieb, V. Analysis and Comparison of Robotics 3D Simulators. In Proceedings of the 2019 21st Symposium on Virtual and Augmented Reality (SVR); IEEE: Piscataway, NJ, USA, 2019; pp. 242–251. [Google Scholar] [CrossRef]
Kim, D.; Kwon, J.I.; Kim, Y.; Kim, D.; Choi, C. AI-native robotic vision systems enabled by in-sensor computing. npj Unconv. Comput. 2026, 3, 2. [Google Scholar] [CrossRef]
Erős, E.; Dahl, M.; Bengtsson, K.; Hanna, A.; Falkman, P. A ROS2 based communication architecture for control in collaborative and intelligent automation systems. Procedia Manuf. 2019, 38, 349–357. [Google Scholar] [CrossRef]
Zhang, J.; Keramat, F.; Yu, X.; Hernández, D.M.; Queralta, J.P.; Westerlund, T. Distributed Robotic Systems in the Edge-Cloud Continuum with ROS 2: A Review on Novel Architectures and Technology Readiness. In Proceedings of the 2022 Seventh International Conference on Fog and Mobile Edge Computing (FMEC); IEEE: Piscataway, NJ, USA, 2022; pp. 1–8. [Google Scholar] [CrossRef]
Bonci, A.; Cheng, P.D.C.; Indri, M.; Nabissi, G.; Sibona, F. Human-Robot Perception in Industrial Environments: A Survey. Sensors 2021, 21, 1571. [Google Scholar] [CrossRef]
Alenjareghi, M.J.; Keivanpour, S.; Chinniah, Y.; Jocelyn, S. Computer vision-enabled real-time job hazard analysis for safe human–robot collaboration in disassembly tasks. J. Intell. Manuf. 2024, 36, 5563–5591. [Google Scholar] [CrossRef]
Liu, S.; Zhang, J.; Wang, L.; Gao, R.X. Vision AI-based human-robot collaborative assembly driven by autonomous robots. CIRP Ann. 2024, 73, 13–16. [Google Scholar] [CrossRef]
Shahria, M.T.; Sunny, M.S.H.; Zarif, M.I.I.; Ghommam, J.; Ahamed, S.I.; Rahman, M.H. A Comprehensive Review of Vision-Based Robotic Applications: Current State, Components, Approaches, Barriers, and Potential Solutions. Robotics 2022, 11, 139. [Google Scholar] [CrossRef]
Harada, K.; Wan, W.; Tsuji, T.; Kikuchi, K.; Nagata, K.; Onda, H. Experiments on Learning Based Industrial Bin-picking with Iterative Visual Recognition. arXiv 2018, arXiv:1805.08449. [Google Scholar] [CrossRef]
Villalonga, A.; Cruz, Y.J.; Alfaro, D.D.; Haber, R.E.; Lastra, J.L.M.; Castaño, F. Enhancing Quality Inspection in Zero-Defect Manufacturing Through Robotic-Machine Collaboration. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Lema, D.G.; Sánchez-González, L.; Usamentiaga, R.; de la Calle, F.J. Benchmarking Deep Learning Models for Surface Defect Detection: A Reproducible and Statistically-Rigorous Approach. J. Intell. Manuf. 2025. [Google Scholar] [CrossRef]
Koskinopoulou, M.; Raptopoulos, F.; Papadopoulos, G.; Maniadakis, M.; Partsinevelos, P. Robotic Waste Sorting Technology: Toward a Vision-Based Categorization System for the Industrial Robotic Separation of Recyclable Waste. IEEE Robot. Autom. Mag. 2021, 28, 50–60. [Google Scholar] [CrossRef]
Vukićević, A.M.; Petrović, M.; Jurišević, N.; Djapan, M.; Knezevic, N.; Novakovic, A.; Jovanovic, K. Versatile Waste Sorting in Small Batch and Flexible Manufacturing Industries Using Deep Learning Techniques. Sci. Rep. 2025, 15, 3756. [Google Scholar] [CrossRef]
Jabrane, K.; Bousmah, M. A New Approach for Training Cobots from Small Amount of Data in Industry 5.0. Int. J. Adv. Comput. Sci. Appl. 2021, 12. [Google Scholar] [CrossRef]
ISO/TS 15066; Robots and Robotic Devices—Collaborative Robots. International Organization for Standardization: Geneva, Switzerland, 2016. Available online: https://www.iso.org/standard/62996.html (accessed on 11 March 2026).
Weber, T.; Wermter, S. Integrating Intrinsic and Extrinsic Explainability: The Relevance of Understanding Neural Networks for Human-Robot Interaction. arXiv 2020, arXiv:2010.04602. [Google Scholar] [CrossRef]
Ambsdorf, J.; Munir, A.; Wei, Y.; Degkwitz, K.; Harms, H.M.; Stannek, S.; Ahrens, K.; Becker, D.; Strahl, E.; Weber, T.; et al. Explain yourself! Effects of Explanations in Human-Robot Interaction. In Proceedings of the 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Napoli, Italy, 29 August–2 September 2022; pp. 393–400. [Google Scholar] [CrossRef]
Callari, T.C.; Segate, R.V.; Hubbard, E.; Daly, A.; Lohse, N. An Ethical Framework for Human-Robot Collaboration for the Future People-Centric Manufacturing: A Collaborative Endeavour with European subject-matter experts in Ethics. Technol. Soc. 2024, 78, 102680. [Google Scholar] [CrossRef]
Đorđević, M.; Albonico, M.; Lewis, G.A.; Malavolta, I.; Lago, P. Computation offloading for ground robotic systems communicating over WiFi—An empirical exploration on performance and energy trade-offs. Empir. Softw. Eng. 2023, 28, 140. [Google Scholar] [CrossRef]
Neuman, S.M.; Plancher, B.; Duisterhof, B.P.; Krishnan, S.; Banbury, C.; Mazumder, M.; Prakash, S.; Jabbour, J.; Faust, A.; de Croon, G.; et al. Tiny Robot Learning: Challenges and Directions for Machine Learning in Resource-Constrained Robots. In Proceedings of the Proceedings of the 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS); IEEE: Piscataway, NJ, USA, 2022; pp. 296–299. [Google Scholar] [CrossRef]
Park, J.; Kim, P.; Ko, D. Real-time open-vocabulary perception for mobile robots on edge devices: A systematic analysis of the accuracy-latency trade-off. Front. Robot. AI 2025, 12, 1693988. [Google Scholar] [CrossRef]
Triess, S.C.; Leitritz, T.; Jauch, C. Exploring AI-based Anonymization of Industrial Image and Video Data in the Context of Feature Preservation. In Proceedings of the 2024 32nd European Signal Processing Conference (EUSIPCO); IEEE: Piscataway, NJ, USA, 2024; pp. 471–475. [Google Scholar] [CrossRef]
Moon, S.; Kim, M.; Qin, Z.; Liu, Y.; Kim, D. Anonymization for Skeleton Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2023; Volume 37, pp. 15028–15036. [Google Scholar] [CrossRef]
Cramariuc, A.; Petrov, A.; Suri, R.; Mittal, M.; Siegwart, R.; Cadena, C. Learning Camera Miscalibration Detection. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2020; pp. 4997–5003. [Google Scholar] [CrossRef]
Qiao, G.; Li, G. Auto-Calibration for Vision-Based 6-D Sensing System to Support Monitoring and Health Management for Industrial Robots. In Proceedings of the ASME 2021 16th International Manufacturing Science and Engineering Conference. Volume 2: Manufacturing Processes; Manufacturing Systems; Nano/Micro/Meso Manufacturing; Quality and Reliability, Online, 21–25 June 2021. [Google Scholar] [CrossRef]
Zhao, W.; Gangaraju, K.; Yuan, F. Multimodal perception-driven decision-making for human-robot interaction: A survey. Front. Robot. AI 2025, 12, 1604472. [Google Scholar] [CrossRef]
Xue, T.; Wang, W.; Ma, J.; Liu, W.; Pan, Z.; Han, M. Progress and Prospects of Multimodal Fusion Methods in Physical Human–Robot Interaction: A Review. IEEE Sens. J. 2020, 20, 10355–10370. [Google Scholar] [CrossRef]
Martinez-Gil, J.; Pichler, M.; Bountouni, N.; Koussouris, S.; Barreiro, M.M.; Gusmeroli, S. An Agentic Framework for Rapid Deployment of Edge AI Solutions in Industry 5.0. arXiv 2025, arXiv:2510.25813. [Google Scholar] [CrossRef]
Shao, X.; Xu, L.; Zheng, T.; Sun, G.; Zhu, Y. Practical Finite-Time Motion Planning for Spacecraft-Mounted Soft Manipulators Under Dynamic Obstacles. IEEE Trans. Aerosp. Electron. Syst. 2026, 62, 2603–2620. [Google Scholar] [CrossRef]
Shao, X.; Xu, L.; Sun, G.; Yao, W.; Wu, L.; Santina, C.D. Self-Attention Enhanced Dynamics Learning and Adaptive Fractional-Order Control for Continuum Soft Robots with System Uncertainties. IEEE Trans. Autom. Sci. Eng. 2025, 22, 18694–18708. [Google Scholar] [CrossRef]

Table 1. Qualitative overview of CV methods used in collaborative robotics.

Approach/ Method	Typical CV Tasks	Data	Robustness	Real-Time	Deployment	Key Limitations
Classical computer vision approaches
Canny, SURF, SIFT, sliding window	Edge detection; localization; obstacle avoidance; feature matching; basic recognition; pose; inspection	Low	Low	High	Low–Med	Sensitive to lighting/occlusion; limited generalization; feature engineering required
Machine-learning-based methods
SVM, KNN, DT, RF	Classification; gesture/command recognition; fault detection; motion classification; task learning	Med	Med	High	Med	Depends on engineered features; limited end-to-end perception
GMM	Image classification; task learning	Med	Med	High	Med	Distribution/feature assumptions
LDA/QDA	Dimensionality reduction; classification; fault tasks	Low–Med	Med	High	Med	Separability/distribution assumptions
Naive Bayes	Gesture recognition; knowledge representation	Low–Med	Med	High	Med	Feature-independence assumption
Unsupervised (k-Means, DBSCAN, EM)	Segmentation; clustering; anomaly detection; bin-picking support	Low labels	–	–	–	Sensitive to feature space and hyperparameters
Kalman/Particle Filters	Pose tracking; trajectory prediction	–	Med–High	High	Med	Model mismatch can degrade tracking
Deep-learning-based methods
CNNs/Detectors (YOLO, SSD, Faster R-CNN)	Detection; classification; inspection; recognition; collision-aware perception	High	High	High	High	Compute-heavy; limited interpretability
Segmentation (U-Net, DeepLab, Mask R-CNN)	Pixel-wise segmentation; manipulation; bin picking; digital twins	High	High	High	High	Annotation cost; compute; limited interpretability
Pose nets (OpenPose, HRNet, SCC-HRNet)	Keypoints/pose; intention inference; action prediction	High	High	High	High	Occlusion + latency constraints
Depth/3D nets	6D pose; reconstruction; grasping; navigation	High (RGB-D)	High	High	High	Sensor dependence; domain shift; compute
ViTs (DETR, Swin)	Long-range context; scene understanding; language-guided tasks	High	High	High	High	Training/data cost; compute
RNN/LSTM	Sequential modeling; motion prediction; gesture/anomaly detection	High	High	High	High	Latency/training complexity
GAN/VAE	Augmentation; synthetic data; representation learning	Synthetic	Med–High	High	High	Stability + realism validation issues

– indicates that the characteristic is context-dependent or not directly applicable to this method category.

Table 2. Software frameworks and implementation platforms for vision-enabled collaborative robots.

Category	Framework/ Platform	Primary Role	Typical CV Tasks	Strengths	Limitations
Robot middleware	ROS/ROS2	Perception–control integration	Image streaming; sensor fusion; vision-to-motion pipelines	Modular; open-source; widely adopted [68,71]	Real-time constraints; setup complexity
	Industrial robot SDKs	Robot–vision interfacing	Vision-guided manipulation; calibration-to-actuation integration	Industrial reliability; vendor support	Vendor lock-in; limited flexibility
Vision libraries	OpenCV	Classical & hybrid vision	Detection; tracking; calibration; preprocessing	Lightweight; real-time capable [66]	Limited native deep learning support
	PCL	3D perception	Point-cloud processing; workspace modeling; registration	Strong 3D tooling; mature ecosystem	Computationally intensive
Deep learning frameworks	PyTorch	Model development & training	CNNs; Transformers; multimodal learning	Flexible; research-friendly	Requires deployment optimization
	TensorFlow/ TF Lite	Training & deployment	Industrial inspection; edge inference	Strong deployment ecosystem	Less flexible for rapid research iteration
Prototyping platforms	MATLAB/ Simulink	Rapid prototyping & validation	Vision algorithm prototyping; control testing	Fast development; robust toolboxes	Proprietary licensing and toolbox costs
Simulation environments	Gazebo; Isaac Sim; CoppeliaSim	Virtual testing	Vision-based navigation; HRC evaluation; synthetic data generation	Safe testing; reproducibility [69,72]	Sim-to-real gap
Edge AI platforms	NVIDIA Jetson	On-device inference	Real-time detection; gesture/action recognition	High performance; GPU acceleration	Power and thermal constraints relative to server-grade hardware
	Intel OpenVINO	Optimized inference	Industrial inspection; CPU-optimized pipelines	Efficient CPU usage; deployment tooling	Model/operator constraints Optimized primarily for Intel hardware
Integration ecosystem	Python stack	Pipeline integration	Data handling; inference orchestration; ROS integration	Flexible; easy to prototype	Performance tuning needed
	Docker/ MLOps tools	Deployment & scaling	Model lifecycle; reproducibility; CI/CD	Reproducibility; portability	Industrial adoption challenges; requires DevOps expertise

Table 3. Comparison of MATLAB and Python programming environments for computer vision development in collaborative robotics based on key practical criteria.

Feature	MATLAB	Python
Development Environment	Proprietary integrated environment with specialized toolboxes	Open-source and flexible ecosystem relying on diverse libraries and frameworks
Ease of Use	High, particularly for deep learning development and debugging	High, due to readability and extensive community support
Libraries/ Frameworks	Comprehensive proprietary toolboxes	Extensive open-source libraries (e.g., PyTorch, Scikit-learn, Scikit-image)
Performance	Interpreted environment; performance depends on toolbox implementation and available hardware acceleration (e.g., GPU)	Typically high when using GPU-accelerated frameworks such as PyTorch
Cost	Commercial software requiring licenses	Free and open-source
Support	Dedicated vendor support and extensive documentation	Primarily community-driven support with extensive documentation and a large user ecosystem; no dedicated vendor service-level agreements by default
Robotics Integration	Dedicated Robotics System Toolbox available with ROS and ROS2 support	Strong native integration with ROS, benefiting from extensive community-developed packages and tighter middleware coupling; adaptable for industrial applications
Deployment	Embedded deployment via MATLAB Coder/Simulink; edge/cloud deployment requires additional setup	Flexible deployment across edge, cloud, and embedded systems with widely available tooling

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Varolia, H.; Vasques, C.M.A.; Cavadas, A.M.S. Computer Vision for Collaborative Robots in Industry 5.0: A Survey of Techniques, Gaps, and Future Directions. Eng. Proc. 2026, 124, 99. https://doi.org/10.3390/engproc2026124099

AMA Style

Varolia H, Vasques CMA, Cavadas AMS. Computer Vision for Collaborative Robots in Industry 5.0: A Survey of Techniques, Gaps, and Future Directions. Engineering Proceedings. 2026; 124(1):99. https://doi.org/10.3390/engproc2026124099

Chicago/Turabian Style

Varolia, Himani, César M. A. Vasques, and Adélio M. S. Cavadas. 2026. "Computer Vision for Collaborative Robots in Industry 5.0: A Survey of Techniques, Gaps, and Future Directions" Engineering Proceedings 124, no. 1: 99. https://doi.org/10.3390/engproc2026124099

APA Style

Varolia, H., Vasques, C. M. A., & Cavadas, A. M. S. (2026). Computer Vision for Collaborative Robots in Industry 5.0: A Survey of Techniques, Gaps, and Future Directions. Engineering Proceedings, 124(1), 99. https://doi.org/10.3390/engproc2026124099

Article Menu

Computer Vision for Collaborative Robots in Industry 5.0: A Survey of Techniques, Gaps, and Future Directions^†

Abstract

1. Introduction

2. Methodology

3. Computer Vision Techniques and Architectures

3.1. Classical Computer Vision Methods

3.2. Machine Learning-Based Methods

3.3. Deep Learning-Based Methods

4. Software Frameworks and Implementation Platforms

5. AI-Driven Computer Vision Tasks and Applications in Collaborative Robotics

6. Current Challenges and Research Gaps

7. Future Directions

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Computer Vision for Collaborative Robots in Industry 5.0: A Survey of Techniques, Gaps, and Future Directions †

Abstract

1. Introduction

2. Methodology

3. Computer Vision Techniques and Architectures

3.1. Classical Computer Vision Methods

3.2. Machine Learning-Based Methods

3.3. Deep Learning-Based Methods

4. Software Frameworks and Implementation Platforms

5. AI-Driven Computer Vision Tasks and Applications in Collaborative Robotics

6. Current Challenges and Research Gaps

7. Future Directions

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Computer Vision for Collaborative Robots in Industry 5.0: A Survey of Techniques, Gaps, and Future Directions^†