A Comprehensive Review of Human-Robot Collaborative Manufacturing Systems: Technologies, Applications, and Future Trends

Cai, Qixiang; Han, Jinmin; Zhou, Xiao; Zhao, Shuaijie; Li, Lunyou; Liu, Huangmin; Xu, Chenhao; Chen, Jingtao; Liu, Changchun; Zhu, Haihua

doi:10.3390/su18010515

Open AccessReview

A Comprehensive Review of Human-Robot Collaborative Manufacturing Systems: Technologies, Applications, and Future Trends

by

Qixiang Cai

,

Jinmin Han

,

Xiao Zhou

,

Shuaijie Zhao

,

Lunyou Li

,

Huangmin Liu

,

Chenhao Xu

,

Jingtao Chen

,

Changchun Liu

and

Haihua Zhu

^*

College of Mechanical and Electrical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(1), 515; https://doi.org/10.3390/su18010515

Submission received: 24 November 2025 / Revised: 16 December 2025 / Accepted: 31 December 2025 / Published: 4 January 2026

(This article belongs to the Special Issue Sustainable Manufacturing Systems in the Context of Industry 4.0)

Download

Browse Figures

Versions Notes

Abstract

Amid the dual-driven trends of Industry 5.0 and smart manufacturing integration, as well as the global imperative for manufacturing sustainability to address resource constraints, carbon neutrality goals, and circular economy demands, human–robot collaborative (HRC) manufacturing has emerged as a core direction for reshaping manufacturing production modes while aligning with sustainable development principles. This paper comprehensively reviews HRC manufacturing systems, summarizing their technical framework, practical applications, and development trends with a focus on the synergistic realization of operational efficiency and sustainability. Addressing the rigidity of traditional automated lines, inefficiency of manual production, and the unsustainable drawbacks of high energy consumption and resource waste in conventional manufacturing, HRC integrates humans’ flexible decision-making and environmental adaptability with robots’ high-precision and continuous operation, not only improving production efficiency, quality, and safety but also optimizing resource allocation, reducing energy consumption, and minimizing production waste to bolster manufacturing sustainability. Its core technologies include task allocation, multimodal perception, augmented interaction (AR/VR/MR), digital twin-driven integration, adaptive motion control, and real-time decision-making, all of which can be tailored to support sustainable production scenarios such as energy-efficient process scheduling and circular material utilization. These technologies have been applied in automotive, aeronautical, astronautical, and shipping industries, boosting high-end equipment manufacturing innovation while advancing the sector’s sustainability performance. Finally, challenges and future directions of HRC are discussed, emphasizing its pivotal role in driving manufacturing toward a balanced development of efficiency, intelligence, flexibility, and sustainability.

Keywords:

human–robot collaborative manufacturing; HRC task allocation; perception and interaction; digital twin; adaptive control and decision-making

1. Introduction

In the current global manufacturing landscape, sustainability has evolved from a peripheral environmental initiative to a core strategic imperative, driven by escalating concerns over climate change, dwindling global resources, and stringent carbon neutrality mandates across regions. Amid the in-depth integration of Industry 5.0 [1] and smart manufacturing [2], human–robot collaborative (HRC) manufacturing [3] has emerged as a core direction reshaping the production mode of the manufacturing industry. The rigid limitations of traditional automated production lines and the efficiency bottlenecks of manual production have become increasingly prominent, prompting the industrial sector to accelerate the exploration of collaborative operation modes between humans and robots. In the context of traditional manufacturing, the rigid limitations of automated production lines, efficiency bottlenecks of manual production, and unsustainable drawbacks (including excessive energy consumption, low resource utilization rates, and high production waste) have become increasingly prominent, prompting the industrial sector to accelerate the exploration of collaborative operation modes between humans and robots that balance technical advancement and environmental responsibility. By integrating humans’ flexible decision-making capabilities, complex environment adaptability with robots’ high-precision execution capabilities and continuous operation advantages, it realizes the all-round improvement of production efficiency, product quality, and operation safety, serving as a key support for the transformation of the manufacturing industry towards intelligence and flexibility [4].

As shown in Figure 1, the efficient and sustainable implementation of HRC manufacturing hinges on the synergistic support of its core technology system, with each module inherently contributing to the circularity, resource efficiency, and long-term viability of industrial production. The efficient implementation of HRC manufacturing relies on the collaborative support of the core technology system. HRC task allocation technology [5] establishes a reasonable division of labor framework through task decomposition, personnel skills matching and precise task assignment, laying the foundation for collaborative operations. This approach minimizes resource idle time, optimizes human–robot resource allocation, and curtails unnecessary energy consumption, thus bolstering the sustainability of production workflows. Multimodal perception technology [6] achieves comprehensive capture of production environment, operation instructions, and personnel status by virtue of text perception, image perception, and video perception capabilities, ensuring the integrity and timeliness of information interaction. HRC augmented interaction [7] breaks the interaction barrier between humans and robots with the help of augmented reality, virtual reality, and mixed reality technologies, improving operational convenience and immersion. Digital twin-driven human–robot integration [8] further constructs a linked virtual–real operation scenario through digital modeling, data association, and virtual–real mapping, providing full-process visualization and traceability support for the collaborative process, and forming a “perception–interaction–modeling” technical closed loop.

The in-depth coupling and practical application of core technologies promote HRC manufacturing to form a complete technical chain and industrial value closed loop. Adaptive motion control [9] ensures the safe and precise execution of robot tasks in dynamic operating environments through path planning, collision detection and code generation. HRC decision-making [10] realizes real-time adaptation and optimization adjustment to production disturbances relying on dynamic decision, self-learning, and rapid response capabilities, improving system flexibility. These technologies are ultimately applied in key fields, such as the automotive industry, aeronautical industry, astronautical industry, and shipping industry, transforming technological advantages into industrial competitiveness, promoting the innovation of production modes in scenarios such as high-end equipment manufacturing and complex component processing, and accelerating the evolution of the manufacturing industry towards an efficient, intelligent, and flexible future form, which highlights the core driving role of HRC manufacturing technology in industrial upgrading.

The subsequent chapter arrangement of this paper is as follows: Section 2 introduces human–robot collaborative task allocation including deep learning-based human–robot collaborative task allocation and large language model-based human–agent collaborative task allocation methods; Section 3 presents the concept significance and development of multimodal perception covering multimodal perception based on deep learning architectures and multimodal perception based on pre-trained large models; Section 4 elaborates on the concept and core value of human–robot hybrid augmented interaction including traditional hybrid augmented interaction methods based on AR/VR/MR and hybrid enhanced interaction methods based on visual language models; Section 5 outlines the basic concepts of human–robot fusion enabled by digital twin encompassing environmental perception and scene modeling human state perception and modeling as well as system organization and collaborative logic for complex collaborative scenarios; Section 6 focuses on adaptive motion control including the concept of adaptive motion control based on path planning adaptive motion control based on collision detection and adaptive motion control based on code generation; Section 7 discusses human–robot collaborative decision-making covering human–robot collaborative decision-making based on large language models and human–AI collaborative decision-making based on reinforcement learning; Section 8 describes the applications of human–robot collaborative manufacturing technology; Section 9 presents conclusions and future work.

2. Human–Robot Collaborative Task Allocation

Overview of human–robot collaborative task allocation methods can be seen in Table 1. Human–robot collaborative task allocation refers to the dynamic, evidence-based distribution of assembly tasks between human operators and cobots within smart manufacturing ecosystems, with an explicit emphasis on embedding sustainability into core operational workflows. Beyond its foundational goal of leveraging the complementary competencies of human workers—such as adaptive problem-solving, contextual flexibility, and cognitive judgment—and cobots—including micron-level precision, uninterrupted endurance, and repeatable task execution. Human–robot collaborative task allocation further advances sustainable manufacturing objectives by optimizing resource utilization, minimizing energy consumption, and reducing operational waste. By strategically assigning high-energy, high-repetition tasks to cobots (to curtail unnecessary power expenditure and equipment wear) and reserving complex, context-dependent tasks for human operators (to avoid costly errors and rework), this framework simultaneously enhances assembly efficiency, safeguards workplace ergonomics and safety, and extends the lifecycle of production assets—creating a synergistic balance between operational performance, worker well-being, and long-term environmental and economic sustainability of manufacturing systems [11,12]. As manufacturing evolves from Industry 4.0 to Industry 5.0, task allocation methodologies have transitioned from static planning toward dynamic optimization [13], with modern frameworks now integrating advanced technologies to enable autonomous collaboration and real-time adaptability [14]. Furthermore, contemporary strategies prioritize not only economic efficiency but also the cultivation of human–robot trust and worker well-being. Ultimately, this holistic approach realizes human-centric smart manufacturing, balancing productivity with critical considerations of safety and ergonomics.

2.1. Deep Learning-Based Human–Robot Collaborative Task Allocation

Deep learning, a pivotal branch of machine learning, constructs multi-layer neural networks to simulate human brain structures, automatically extracting deep features from raw data without manual feature engineering. It has achieved widespread applications in image recognition, natural language processing, and other fields. Its core advantage lies in processing high-dimensional and complex data, optimizing network parameters via backpropagation algorithms to achieve precise predictions and decisions, thereby providing robust technical support for intelligent optimization of complex systems. Applying deep learning to human–robot collaborative task allocation enhances task allocation intelligence through real-time perception, intent recognition, and adaptive decision-making. Real-time perception relies on multimodal data processing capabilities to accurately capture the status of operational resources, the dynamics of task scenarios, and the details of human operations. Intention recognition leverages temporal modeling techniques to deeply interpret the potential demands of humans, while incorporating key factors such as ergonomics and task complexity. Joo Taejong et al. [15] proposed a deep learning-based dynamic allocation method, which uses LSTM networks to capture unobservable features such as the fatigue level and task capability of human operators. Tested in a manufacturing system consisting of five operators and 10 machines, the results show that compared with the FIFO and SPT rules, this method achieves shorter average flow time with smaller fluctuations, while effectively balancing the workload between humans and machines.

Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are widely used to recognize human operators’ intentions and actions in real time. Convolutional Neural Networks (CNNs) are deep learning models proficient in processing grid-structured data such as images. They automatically extract spatial features through operations like convolution and pooling, and are widely applied in vision-related recognition and analysis tasks. Recurrent Neural Networks (RNNs) are a class of deep learning models suitable for temporal or sequential data. By retaining historical information via internal recurrent structures, they can capture temporal dependencies in data, making them commonly used in scenarios such as natural language processing (NLP) and action sequence recognition. Gao, Z. et al. [16] proposed a hybrid CNN architecture combining ST-GCN and 1DCNN, collecting skeletal sequences via Azure Kinect. This architecture achieved an accuracy of 91.7% in human–robot collaborative assembly action recognition, significantly outperforming single-model approaches. Mavsar, M. et al. [17] designed two RNN models—OptiNet (based on hand trajectories) and HandNet (based on RGB-D videos)—for target intention recognition in industrial assembly. As the action observation ratio increases, the accuracy gradually approaches 100%. In collaborative assembly tasks, robots continuously perceive human behaviors and dynamically reallocate tasks. Wang, P. et al. [18] employed dual-stream CNNs to simultaneously analyze visual data and object states, enabling proactive robotic assistance and dynamic task adjustment. Bandi, C. et al. [19] developed a skeleton-based deep learning classifier for personalized motion recognition, improving collaboration accuracy. Gao, X. et al. [20] proposed an improved hybrid RNN architecture integrating Bi-LSTM with LSTM. Based on spatiotemporal data captured by Kinect, this architecture accurately recognizes the operator’s intention in the collaborative assembly tasks of the UR5 robot, effectively reducing prediction errors.

Task complexity assessment is another critical area. Deep learning optimizes allocation decisions through neural network-based scoring systems. Malik, A.A. et al. [21] proposed a dual-dimensional complexity metric integrating operational difficulty (e.g., required precision) and cognitive load, assigning high-cognition tasks to humans and repetitive tasks to robots, reducing fatigue and increasing productivity. Barathwaj, N. et al. [22] incorporated complexity metrics into genetic algorithms for assembly line balancing, considering energy consumption and skill levels to minimize costs and improve ergonomics.

Digital twin technology further enhances these methods by creating virtual replicas of physical systems for real-time synchronization and iterative optimization. Sun X. et al. [23] proposed a digital twin-driven human–robot collaborative assembly and commissioning method, integrating part feature recognition, task knowledge graph, and the DDPG algorithm. Validated through an automotive generator case study, this method established an efficient and accurate collaborative assembly and commissioning technology system, significantly improving the assembly and commissioning performance of complex products. Wang, J. et al. [24] pre-trained reinforcement learning policies in virtual twins and migrated them to physical systems, reducing experimental costs and improving robustness.

Despite significant progress in human–robot collaborative task allocation driven by deep learning, challenges persist in data efficiency, real-time performance, and security. Deep learning models require large amounts of training data, which incur high collection costs in real-world environments, while complex models may introduce computational latency that compromises real-time control, as shown in Figure 2.

2.2. Large Language Model-Based Human–Agent Collaborative Task Allocation Methods

As a core breakthrough in artificial intelligence, Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing, knowledge reasoning, and multimodal interaction in recent years, holding vast promise in human–agent collaboration. Through precise parsing of natural language instructions, multimodal environmental data, and human intent expressions via semantic understanding, as well as dynamic adaptation of task sequences, language variants, and real-time errors through contextual reasoning, Large Language Models (LLMs) can independently accomplish task decomposition, priority ranking, and human–robot division of labor optimization. This enables dynamically adaptive task planning and decision-making, thereby significantly improving the efficiency and flexibility of human–robot collaboration. Dimitropoulos, N. et al. [25] adopted an integration framework combining Large Multimodal Models (LMMs) and digital twins, processing video and audio multimodal data to automatically generate task models and allocation schemes. In the case study of white goods assembly, this framework reduced manual planning effort, improved resource utilization, and achieved task execution efficiency comparable to traditional heuristic methods. Lim, J. et al. [26] leveraged GPT-4.0 to integrate voice instructions and visual sensor data, adapting to language variants with different levels of specificity and dynamically addressing issues such as component overlap and assembly errors. In the cable shark assembly task, the success rate of understanding specific instructions reached 98%, effectively ensuring the continuity and flexibility of the collaborative process, which fully validates the core value of LLMs. Chen, J. et al. [27] designed a “Perception–Decision–Execution” brain-inspired architecture and corresponding coordination mechanism based on Multimodal Large Language Models (MLLMs), enabling the Human-Like Collaborative Robot (HLCobot) to achieve dynamic and autonomous human–robot collaboration. Validated through engine assembly experiments, this approach successfully established an end-to-end collaborative workflow, with the robot collaboration success rate reaching 90–95%. This significantly enhances adaptability to dynamic scenarios and the overall intelligence level of the system. Sihan, H. et al. [28] highlighted LLMs’ strength in fusing visual, tactile, and semantic information, achieving high-level decision-making via natural language interaction and significantly enhancing the intelligence level of task allocation. This indicates that LLMs can not only handle static task sequences but also adapt to dynamic environmental changes, offering a new paradigm for human–agent collaboration.

LLM-based task allocation methods typically construct a closed-loop system encompassing perception, decision-making, and execution. The core role of Large Language Models (LLMs) in task allocation lies in integrating multi-source heterogeneous information such as language instructions, environmental perception data, equipment operating status, and human intentions. Through semantic understanding, contextual reasoning, and dynamic planning, LLMs conduct intelligent decision-making, flexibly achieving task decomposition, priority ranking, and human–robot division of labor adaptation, thereby realizing efficient and adaptive human–robot collaboration. Liu, Z. et al. [29] constructed a human–robot collaborative decision-making system by integrating natural language processing (NLP), multimodal fusion, and machine learning technologies. Leveraging LLMs to integrate multi-source information and perform logical reasoning, the system achieved dynamic human–robot division of labor and efficient collaboration in scenarios such as intelligent production and business decision-making. This validates the core value of LLMs in enhancing the flexibility, adaptability, and scientificity of decision-making in task allocation. Kong, F. et al. [30] proposed a task complexity-driven allocation model that uses LLMs for task decomposition and priority sorting, minimizing assembly time and maximizing resource utilization. Additionally, Bilberg, A. et al. [31] adopted Large Language Models (LLMs) to generate simulation instructions based on digital twin technology for human–robot collaborative assembly, achieving dynamic task allocation, load balancing, trajectory optimization, and robot program generation. Validated through linear actuator assembly experiments, this approach improved the flexibility and automation level of high-mix production.

In practical applications, LLM-driven task allocation methods have been validated across multiple industrial scenarios (as shown in Figure 3). Cai, M. et al. [32] applied LLMs to complexity assessment and dynamic allocation in speed reducer assembly, achieving autonomous coordination between human operators and robots and reducing assembly time by over 20%. Xuquan, J.I. et al. [33] developed a vision-LLM-integrated intelligent system for satellite payload assembly, completing high-precision tasks through pose recognition and path planning. Wang, Y. et al. [34] leveraged LLMs for multi-robot coordination in electronics manufacturing, scheduling resources via natural language instructions to enhance production line flexibility. These cases highlight LLMs’ adaptability in complex environments, not only improving production efficiency but also strengthening system robustness and flexibility.

3. Concept, Significance, and Development of Multimodal Perception

To effectively execute the human–robot collaborative task allocation described in Chapter 2, a robot must first be able to accurately perceive the dynamic work environment and comprehend its human partner’s intent. This requires the system not only to process visual information but also to interpret auditory commands and sense physical interactions. Consequently, multimodal perception technology constitutes the foundation for efficient and safe collaboration, serving as the critical information bridge that connects task planning with physical execution.

An overview of multimodal perception methods can be seen in Table 2. In the context of industrial sustainability—a core imperative for modern manufacturing sectors such as aerospace and high-end precision equipment production—multimodal perception denotes a system’s capability to concurrently acquire, process, and synergistically fuse heterogeneous data streams from diverse sensor modalities (e.g., vibration sensors, thermal imagers, acoustic detectors, and vision systems). By integrating these complementary data sources, the system generates a holistic, resilient representation of operational environments or interacting components, which in turn underpins sustainability-centric objectives: from minimizing resource waste via predictive fault diagnosis to optimizing energy consumption through real-time process monitoring, and enhancing lifecycle efficiency of critical manufacturing assets [35]. These modalities may include visual (e.g., RGB images, depth maps, skeleton data), auditory (e.g., speech, ambient sounds), tactile (e.g., force, torque, surface electromyography signals EMG), and textual information [36]. Data from a single modality is often insufficient to fully describe complex real-world scenarios. Relying solely on visual information may fail to accurately infer an operator’s intent or a machine’s internal state; however, integrating auditory and tactile information can significantly enhance the system’s depth of situational understanding and decision-making accuracy [37]. The advancement of multimodal perception technologies enables human–robot collaboration systems to operate more naturally, safely, and efficiently. In collaborative assembly, robots must interpret human posture, intent, and applied forces to dynamically adjust their own behavior, avoid collisions, and cooperatively complete tasks [38]. This capability (i.e., the real-time adaptive human–robot collaborative decision-making and safety-aware task reconfiguration capability proposed in this study) is critical for intelligent manufacturing and human–robot integration across Industry 4.0 and Industry 5.0 [39]. In Industry 4.0, it enables seamless cyber–physical interaction and flexible production scheduling via data-driven human–robot coordination, breaking fixed task allocation constraints. For Industry 5.0, which prioritizes human-centricity and resilience, the capability facilitates real-time adaptation of robotic behaviors to human cognitive/physiological states and intuitive human intervention in robot decision-making, underpinning human–machine symbiotic manufacturing. Its absence would impede the transition from Industry 4.0 mass customization to Industry 5.0’s personalized, human–machine collaborative production paradigm.

Multimodal perception technology has evolved from early single-sensor processing to multi-sensor fusion, and more recently to deep learning and pre-trained large model-driven approaches. Early methods primarily relied on traditional signal processing and machine learning techniques for feature extraction and fusion, but exhibited limited robustness and generalization capabilities in complex and dynamic environments. With the rise in deep learning, its powerful capacity to handle high-dimensional, nonlinear data has brought breakthroughs to multimodal perception. Particularly in recent years, the emergence of pre-trained large models (PLMs) [40] has further propelled multimodal perception toward higher-level cognitive understanding. Accordingly, this chapter reviews multimodal perception methods driven by these advanced technologies. Specifically, Section 3.1 explores perception techniques based on deep learning architectures such as CNNs, RNNs, and BERT, detailing how they extract and fuse features from individual modalities like vision, audio, and text. Subsequently, Section 3.2 focuses on cutting-edge methods based on pre-trained large models (PLMs), analyzing how they achieve higher-level cognitive understanding and situational awareness through cross-modal learning, thereby providing robust information input for HRC systems.

3.1. Multimodal Perception Based on Deep Learning Architectures

Deep learning architectures have achieved significant progress in the field of multimodal perception, enabling automatic learning of complex feature representations from raw multimodal data and facilitating effective fusion and decision-making. Typical deep learning architectures include convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Transformers [41].

CNNs are the core technology in the field of visual perception, particularly excelling in processing visual information [42]. ResNet [43], as an efficient CNN architecture, addresses the vanishing gradient problem in training deep networks by introducing residual connections, allowing networks to be constructed deeper and thereby learn richer image features. In human–robot collaboration scenarios, CNNs can identify operators’ skeletal keypoints, body postures, and ongoing actions—such as grasping, placing, or screwing—by analyzing RGB or depth images. This capability is crucial for robots to understand human intentions, predict human motion trajectories, and adjust their own behaviors to ensure safe collaboration. CNNs [44] can recognize tools, parts, and obstacles in the workspace and accurately estimate their positions and orientations, enabling robots to precisely grasp target objects, plan collision-free paths, and respond to environmental changes. In a human–robot collaboration system [45], an RGB-D camera is used to perceive human posture and actions, while a USB camera mounted on the end-effector enables visual grasping. The cognitive module extracts 3D human models and actions from RGB-D data through preprocessing, human mesh recovery, and action recognition steps using an encoder–decoder architecture, thereby generating a digital human in the digital twin space to inform robot planning and responses.

In speech perception (as shown in Figure 4), traditional natural language processing (NLP) techniques [46], particularly models based on RNNs and their variants such as LSTM and GRU, possess inherent advantages in processing sequential data. These models can capture temporal dependencies in speech signals for tasks including speech recognition, intent recognition, and sentiment analysis. In human–robot collaboration, operators can issue commands to robots via voice; NLP models convert speech into text and extract key action and target information for robot execution [47]. Robots can understand natural language questions from humans and generate appropriate responses, enabling smoother and more intuitive interaction [48]. By analyzing speech intonation and content, operators’ emotional states or fatigue levels can be inferred, allowing robots to adjust work pace or trigger alerts when necessary, thereby enhancing collaboration safety and comfort [49].

Bidirectional Encoder Representations from Transformers (BERT) [50] is a landmark model in the field of NLP. Leveraging the encoder architecture of Transformers, BERT learns bidirectional semantic information of words and sentences across diverse contexts through pre-training on large-scale corpora, demonstrating exceptional capability in understanding textual content and achieving state-of-the-art performance across various NLP tasks such as question answering, text classification, and named entity recognition. In human–robot collaborative systems, BERT can deeply comprehend the semantics of complex textual instructions or workflow descriptions, extract key information, and assist robots in executing more sophisticated tasks. As illustrated in the multimodal sentiment analysis system architecture [51], BERT is employed for feature extraction from the text modality. Text input is first processed through the BERT model, then passed through a self-attention layer to generate deep representations of the text, which are subsequently used for unimodal and multimodal tasks.

Under deep learning frameworks, multimodal information fusion typically occurs at either the feature level or the decision level [52]. Feature-level fusion involves concatenating or combining raw data or low-level features from different modalities prior to inputting them into a deep learning model. Decision-level fusion, by contrast, processes data from each modality through separate deep learning models to generate individual predictions, which are then fused at the decision level. Intermediate-level fusion performs feature fusion at different layers within the deep learning network. For instance, image features are extracted using ResNet, text features using BERT, and then fused at higher layers via attention mechanisms or multilayer perceptrons [53]. Wang, T. et al. [54] proposed a data-efficient multimodal human action recognition method that enables robots to learn and execute similar operations by observing human demonstrations, leveraging multimodal feature extraction, few-shot learning, and domain adaptation. In this approach, skeleton graphs are processed via an improved ST-GCN, RGB frames via early-fusion ResNet, and features from different sources are ultimately combined through a late-fusion block.

3.2. Multimodal Perception Based on Pre-Trained Large Models

PLMs [55] have achieved remarkable success in the field of NLP, such as BERT and the GPT series, which learn rich language representations through unsupervised pre-training on massive datasets. In recent years, this paradigm has been extended to the multimodal domain, giving rise to multimodal PLMs, capable of simultaneously processing and understanding data across multiple modalities, including text, images, video, and speech. The core idea of multimodal PLMs is to leverage large-scale multimodal datasets for pre-training to learn general cross-modal representations and alignment relationships; through pre-training, these models can capture complex associations among different modalities, understand semantic relationships between modalities, and perform tasks such as image caption generation, visual question answering, and cross-modal retrieval. Due to the broad knowledge acquired during pre-training, multimodal PLMs also demonstrate strong generalization capabilities when faced with new tasks or limited-sample data.

In human–robot collaboration scenarios (as shown in Figure 5), robots can utilize multimodal PLMs to comprehend complex tasks involving textual descriptions, image examples, and spoken instructions, decomposing them into executable subtasks [56]. By integrating visual perception of human actions and facial expressions with auditory perception of vocal intonation and textual instructions, multimodal PLMs can more accurately predict operators’ real-time intentions and generate coordinated robotic behaviors. For instance, the system proposed by Laplaza, J. et al. [57] employs multi-head attention mechanisms and graph convolutional networks to fuse information on human motion, robot end-effector positions, and obstacle data to predict human intentions and future movements. By analyzing multimodal data streams, the model can identify unusual operational patterns, potential safety risks, or equipment failures, issue timely warnings or initiate protective measures. Multimodal PLMs can serve as components within robotic learning frameworks, continuously learning and optimizing their perception and decision-making capabilities through sustained human interaction and accumulation of environmental data [58]. The cyber-system layer of the Human–Machine Work System (HMWS) [59] includes a data analytics module encompassing descriptive, diagnostic, predictive, and prescriptive analytics, aligning well with the capabilities of multimodal PLMs in understanding and analyzing complex data. The intelligent workstation gateway enables real-time data transmission via multiple protocols, connecting the hyper-human and hyper-machine components within the physical system, facilitating interaction through perception and action.

In recent years, this paradigm based on pre-trained large models has achieved breakthrough progress in applications within specific industrial sectors, especially in scenarios requiring fine-grained perception. Taking smart grid inspection as an example, traditional vision-language models like CLIP struggle to accurately identify subtle cracks on insulators or to distinguish between visually similar but semantically distinct situations, such as “corrosion” versus “shadows.” To address this challenge, Yan, H. et al. [60] proposed VisPower, a curriculum-guided framework for the power domain. This framework first performs global semantic grounding using a corpus of 100 K image–text pairs with long-text descriptions, followed by contrastive refinement using 24 K region-level data, including 12 K synthesized “hard-negative” samples, to significantly enhance the model’s ability to discriminate fine-grained defects. Experimental data shows that on their self-curated PowerAnomalyVL dataset, the model achieved an 18.4% absolute gain on the Recall@1 metric in the zero-shot image-text retrieval task, proving that domain-specific multimodal alignment strategies are key to promoting the adoption of large models in vertical industrial fields. Additionally, to address the prevalent issue of anomaly sample scarcity in industrial scenarios, researchers have begun to explore the frontier trend of combining generative large models with multimodal perception technology. The Zoom-Anomaly framework proposed by Li, J. et al. [61] provides a case in point. The framework utilizes a Denoising Diffusion Probabilistic Model (DDPM) to embed potential anomaly patterns into the high-frequency regions of normal samples, thereby generating a large volume of high-quality synthetic anomaly data for model training. This “synthetic data + multimodal learning” model not only effectively solves the “cold-start” problem in industrial applications but also achieved an image-level AUROC of 89.9% with only a few-shot samples on a real-world production line photovoltaic dataset, PV_actual AD, demonstrating its strong cross-domain adaptability and practical utility.

4. The Concept and Core Value of Human–Robot Hybrid Augmented Interaction

Against the backdrop of global efforts to advance sustainable manufacturing, the core technology system of human–robot collaborative production frameworks relies on targeted technical enablers to balance operational efficiency, human–machine synergy, and environmental stewardship. Among these enablers, Human–Robot Collaborative Augmented Interaction (Augmented Interaction) serves as a critical technical bridge connecting human operators and robotic systems. Beyond facilitating real-time information exchange, task coordination, and safety-critical interaction, this augmented interface directly contributes to the sustainability of HRC manufacturing by optimizing resource allocation (e.g., reducing redundant robotic movements and energy consumption), extending the lifecycle of both human workforce capabilities (via ergonomic task augmentation) and robotic assets (through predictive maintenance triggers enabled by interaction data), and minimizing production waste through enhanced process accuracy and adaptive task reconfiguration. The result is that it transforms a traditional human–robot connection into a sustainability-centric nexus that aligns HRC operations with circular economy principles and low-carbon manufacturing mandates [62]. The preceding sections have systematically elaborated on fundamental technologies such as task allocation, multimodal perception, and digital twin-driven integration. As the core hub of the “perception–interaction–execution” closed loop, augmented interaction aims to break down human–robot interaction barriers and enhance operational convenience and immersion—by integrating next-generation interaction technologies [63] including Augmented Reality (AR), Virtual Reality (VR), and Mixed Reality (MR), coupled with multimodal information analysis capabilities, it constructs an interaction paradigm of deep collaboration among “human–robot–environment” [64], laying a critical foundation for the subsequent implementation of adaptive control and collaborative decision-making. An overview of human–robot hybrid augmented interaction methods can be seen in Table 3.

In manufacturing scenarios, traditional human–robot interaction is plagued by inherent drawbacks such as non-intuitive information transmission and delayed operational feedback, which are particularly pronounced in tasks like high-precision assembly and complex component docking. HRC Augmented Interaction effectively addresses these shortcomings by precisely fusing virtual information with the physical production environment and creating immersive operational spaces [65]. Its core value is concentrated in three dimensions: first, it significantly lowers the operational threshold, enabling non-professionals to quickly adapt to collaborative work with robots and supporting flexible production scenarios with multiple varieties and small batches; second, it greatly improves interaction efficiency, achieving precise transmission and real-time response of complex instructions to meet the demands of high-precision manufacturing such as automotive parts assembly and aerospace component docking [66]; third, it comprehensively enhances operational safety—during close-range human–robot collaboration in dynamic production environments, it avoids collision and other safety risks through virtual guidance and hazard warnings [67], consolidating a solid safety foundation for efficient collaborative production.

Building on this, this chapter will further analyze traditional AR/VR/MR-based hybrid augmented interaction methods and emerging augmented interaction technologies driven by visual language models (VLMs). It will delve into their technical pathways, application scenarios, and optimization directions, providing a systematic perspective for understanding the technological evolution and industrial value of HRC Augmented Interaction.

4.1. Traditional Hybrid Augmented Interaction Methods Based on AR/VR/MR

The traditional hybrid enhanced interaction method based on AR/VR/MR is the early core implementation path of human–computer hybrid enhanced interaction technology, which meets the demands of diverse manufacturing scenarios through virtual–real fusion technologies in different dimensions (as shown in Figure 6) [68].

AR technology focuses on “virtual and real superimposition”. Through devices such as smart glasses and head-mounted displays, virtual information such as robot movement trajectories, part assembly steps, and equipment parameters is superimposed in real time onto the physical production scene [69]. In scenarios such as automotive parts assembly and aerospace component docking [70], operators can complete high-precision collaborative operations based on superimposed virtual guide lines, reducing reliance on experience [71]. Some advanced AR systems also integrate haptic feedback modules, which enhance operational accuracy through force perception simulation in micro-component assembly scenarios [72].

VR technology focuses on “virtual immersion”. By building a virtual simulation system consistent [73] with the physical production environment, it enables the preview [74], operation training [75], and remote control of human–machine collaborative processes. In hazardous working conditions (such as high-temperature and high-pressure operations), operators can remotely control robots in a virtual space, which not only ensures personal safety but also does not affect the production progress. The VR system combined with digital twin technology [76] can also achieve real-time synchronization of production data and process optimization.

MR technology achieves “virtual–real integration”, combining the reality superposition of AR with the immersive experience of VR, and supports two-way interaction between humans and virtual robots as well as physical robots. In complex equipment maintenance scenarios, maintenance personnel can call up virtual maintenance manuals through MR devices and synchronize robot-assisted operation actions in real time, thereby enhancing the efficiency of maintenance collaboration [77]. For precision scenarios [78] such as semiconductor manufacturing, MR systems have achieved sub-millimeter-level virtual–real alignment accuracy.

The core advantage of this type of method lies in its high technical maturity and strong scene adaptability. However, its interactive flexibility is limited by hardware devices, and it is difficult to achieve in-depth parsing of complex natural language instructions.

4.2. Hybrid Enhanced Interaction Method Based on Visual Language Model

With the development of artificial intelligence technology, the hybrid enhanced interaction method based on visual language model (VLM) has become a breakthrough direction for the new generation of technology, achieving the full-chain intelligence of “language understanding—visual perception—interaction execution”, as shown in Figure 7.

This method takes the visual language model as the core, integrates natural language processing, computer vision and enhanced interaction technology, and has the closed-loop ability of “understanding—perception—feedback”. Its core logic is the physical information in the production scene (such as part positions and personnel movements) is captured through the visual module, and the natural language instructions of the operators (such as “Grasp the red part on the left”) are parsed through the language module. After being fused and processed by the model, the corresponding virtual guidance information or robot control instructions are generated. Then, visual interaction and execution feedback are achieved through AR/MR devices. For Chinese scenarios, specially optimized visual language models such as Chinese CLIP [79] have significantly improved the alignment accuracy between Chinese instructions and visual information through a two-stage pre-training method.

In practical applications, this method has addressed the pain point of “single form of instruction transmission” in traditional technologies. For instance, in the scenario of customized parts assembly, operators only need to describe their assembly requirements through natural language. The system can then identify the type of parts and plan the assembly path through a visual language model, and superimpose dynamic guidance information through AR devices. At the same time, it can adjust the robot’s movements in real time to adapt to the operation rhythm of the personnel. The Qwen-VL [80] series models proposed by Alibaba, with their fine-grained visual understanding capabilities, can achieve real-time identification of part defects and linguistic feedback of assembly errors. Moreover, the multi-layer semantic fusion algorithm based on BERT [81] further improves the parsing accuracy of complex assembly instructions.

Studies on benchmark datasets such as VisTW [82] have shown that VLMS optimized for specific languages perform better in manufacturing scenario interactions, especially in special scenarios like a traditional Chinese context. The core advantage of this type of method lies in its more natural interaction and stronger self-adaptability, which can quickly meet the flexible production demands of multiple varieties and small batches. However, it has higher requirements for computing resources and data quality.

5. Basic Concepts of Human–Robot Fusion Enabled by Digital Twin

As an emerging paradigm under the background of Industry 5.0, human–machine fusion aims to break the boundaries of traditional human–machine interaction and achieve in-depth integration of human intelligence and machine intelligence. This collaborative model not only involves physical coexistence but also emphasizes cognitive interconnection and emotional resonance. In this context, human factors are no longer regarded as constraints in system design but as key variables driving collaborative efficiency, covering multi-dimensional attributes such as physiological states (e.g., fatigue level, work intensity), psychological characteristics (e.g., emotion, trust), cognitive abilities (e.g., decision-making preferences, skill level), and behavioral uncertainty. These factors are highly individualized and context-dependent, making it difficult to model through static rules. There is an urgent need for a technical framework capable of dynamic perception, real-time response, and continuous evolution to provide support. An overview of human–robot fusion enabled by digital twin can be seen in Table 4.

Digital twin (DT) serves as an ideal carrier for this purpose and has become a core enabling technology to support the design, reconstruction, and verification of human–machine collaborative systems [83]. By establishing a bidirectional closed-loop mapping between physical entities and virtual models, DT not only achieves holographic perception of the “human–machine–environment” system but also enables task pre-simulation, conflict detection, and strategy optimization in the virtual space, acting as a “front-runner” throughout the system’s entire life cycle [84]. It is worth noting that current research mainly focuses on two aspects: one is core perception and modeling for the operational layer (Human-in-the-Loop, HitL), emphasizing real-time state capture and interactive control; the other is system organization and collaborative logic for the management layer (Human-in-the-Mesh, HitM), involving high-level capabilities such as collaborative planning, resource scheduling, and organizational resilience. This constitutes a key leap toward truly “human-centric” intelligent manufacturing [85]. Therefore, digital twin is not only a technical tool but also an enabling architecture that reconstructs human–machine power relations and collaborative logic. Its value lies not only in “mirroring reality” but also in “guiding the future”.

5.1. Environmental Perception and Scene Modeling

In terms of environmental perception and scene modeling, sensors such as vision are widely used to improve the safety and naturalness of human–machine interaction. For example, an Intel RealSense camera was adopted through Kalman filtering to achieve precise positioning of the operator’s hand and dynamically adjusted the robot’s speed to ensure safety. Choi, S.H. et al. [86] proposed a safety-aware system based on mixed reality (MR), which uses deep learning and digital twin to real-time calculate the minimum safe distance between humans and machines and visually presents it to the operator through MR glasses. Tao, F. et al. [87] systematically sorted out the digital twin modeling process including model construction, assembly, fusion, verification, modification, and management, providing a theoretical basis for building effective human–machine collaborative digital twin models. Although the above methods perform well in structured environments, they still face robustness challenges under complex conditions such as occlusion and illumination changes commonly encountered in industrial sites. More importantly, most systems only focus on geometric positional relationships and lack modeling of high-level semantic information such as tool functions and material states, limiting their contextual understanding capabilities. Ji, Y. et al. [88] used large language models (LLMs) for task reasoning and vision foundation models (VFMs) for scene semantic perception, enabling the reasoning of perceiving new objects and handling undefined tasks without additional training, which provides a new idea for breaking the semantic bottleneck.

5.2. Human State Perception and Modeling

In the field of human state perception and modeling, research focuses on the fusion analysis of physiological and behavioral signals (as shown in Figure 8). Zhang, T. et al. [89] proposed a robust 3D arm force estimation model (R3DNet) based on a hybrid deep learning network, using surface electromyography (sEMG) signals as the interaction medium to enable robots to adjust collaborative behaviors according to changes in human arm strength. You, Y. et al. [90] fused IMU measured data with OpenSim simulation and used the IK-BiLSTM-AM network to synchronously estimate muscle force and fatigue accumulation status, providing support for real-time fatigue assessment in human–machine collaboration. Malik, A.A. et al. [91] built a worker digital model using the “human package” in Tecnomatix Process Simulate to conduct human factors engineering and ergonomics evaluation. Notably, the human mesh recovery algorithm was improved to achieve accurate posture inference of occluded humans without relying on any wearable devices, providing support for assembly intention recognition. However, current human digital twins still rely highly on individual characteristics, resulting in high calibration costs, which severely restricts the flexibility and large-scale deployment of the system.

Overall, the existing perception-modeling system is characterized by “valuing hardware over semantics, individuals over groups, and states over intentions”. In the future, it is necessary to develop a semantic-enhanced multi-modal fusion framework and introduce generative AI or causal reasoning mechanisms to achieve a leap from “state reproduction” to “intention prediction”.

5.3. System Organization and Collaborative Logic for Complex Collaborative Scenarios

When human–machine fusion expands from unit-level collaboration to production line or even factory-level systems, perception and modeling alone are no longer sufficient to ensure the stability and efficiency of overall collaboration. At this time, the system-level organizational structure and collaborative logic become the key to success. Although current research has proposed various mechanisms, most are limited to specific application scenarios, lacking a general framework, and insufficiently considering the adaptability of the “human” role evolution in dynamic task flows (as shown in Figure 9).

In terms of spatial organization, digital twin was used to conduct collision, accessibility, layout, and visual tests to optimize workstation layout, ensuring safety and efficiency. However, such methods have limited adaptability to frequently changing layouts in flexible manufacturing. Dröder, K. et al. [92] proposed a digital twin path planning scheme integrating 3D obstacle detection and dynamic safety envelopes. Through cluster analysis and artificial neural networks, obstacles such as humans and workbenches in the factory are identified, and human-centric dynamic safety zones are generated to achieve real-time obstacle avoidance of robots, adapting to flexible production scenarios of human–machine collaboration.

In terms of resource organization, existing research exhibits a hierarchical characteristic of “macro optimization + micro execution”: the macro level focuses on the dynamic allocation and collaborative coordination of production resources, while the micro level focuses on the efficient scheduling and adaptation verification of resources in specific interactive scenarios. Kousi, N. et al. [93] dynamically allocated operational tasks based on task difficulty and human–machine capabilities using optimization algorithms, resulting in a 25% increase in annual production throughput and a 30% rise in the proportion of time allocated to assembly work; Tchane Djogdom, G.V. et al. [94] proposed a robust dynamic robot scheduling method for human–robot collaborative manufacturing operations, aiming to address the variability in human availability and intervention time, which achieved optimized production time while minimizing human and robot idle time. Oyekan, J.O. et al. [95] constructed a simulation test platform integrating digital twin and virtual reality, which reproduces physical robot units and interacts with human operators in real time, quantitatively verifying the effectiveness and safety of collaborative strategies based on indicators such as reaction time and contact force.

In terms of authority organization, Liu, X. et al. [96] established a hierarchical authority management mechanism in a web-based system, where managers have scheduling rights and operators only have execution and feedback rights. These explorations reveal a deep-seated contradiction: industrial systems pursue certainty and safety, while human–machine fusion requires flexibility and autonomy. There is no mature paradigm for achieving a balance between the two.

6. Adaptive Motion Control

6.1. Concept of Adaptive Motion Control

Human–robot collaborative manufacturing systems are designed to enhance productivity, ensure quality, and perform tasks beyond the capability of humans or robots alone. Such collaborative scenarios inherently require close proximity between human and robotic agents working on separate sub-tasks. A primary concern in such systems is ensuring human safety [97]. Overview of adaptive motion control can be seen in Table 5.

In addition to the multimodal perception and digital twin technologies mentioned earlier, adaptive motion control serves as the core technology for safe decision-making and execution in human–robot collaborative systems [98]. Compared with traditional control methods, adaptive control can be regarded as a feedback system integrating both parameter estimation and real-time correction [99], which can effectively suppress disturbances caused by dynamic environmental changes and unexpected contacts during human–robot collaborative manufacturing [100]. It performs particularly well in addressing parameter fluctuations of the robot itself as well as dynamic changes in external tasks and the environment [101]. By integrating robot dynamic models [102], multi-sensor perception information [103], and real-time control algorithms [104], this technology effectively ensures the achievement of safe, efficient, and flexible collaboration in variable environments. Specifically, dynamic models provide necessary inertial and impedance parameters; multi-sensor fusion achieves millisecond-level precise perception of humans and obstacles; and real-time algorithms based on impedance control effectively solve the problem of hybrid force/position control in physical human–robot interaction.

Considering the implementation requirements of adaptive motion control, in addition to path planning and collision detection, code generation serves as a key execution link for technology implementation, collectively forming a complete technical system. As shown in Figure 10, to systematically clarify the implementation path of adaptive motion control, this study proposes a dedicated technical architecture. We will review and analyze the state of adaptive motion control research primarily through three lenses: path planning, collision detection, and code generation.

6.2. Adaptive Motion Control Based on Path Planning

Traditional fixed-path planning is typically performed offline before a task begins, assuming a fully known and static environment, rendering it unsuitable for human–robot collaborative scenarios. Adaptive motion control methods based on path planning address the poor adaptability and safety limitations of traditional approaches in dynamic interactive settings by fusing real-time environmental perception with trajectory optimization algorithms. This has become a prominent research focus in the field.

The highly uncertain and complexly constrained nature of human–robot collaborative manufacturing has prompted various solutions. For instance, Ding, S. et al. [105] proposed a dual closed-loop control structure, where an outer loop generates trajectories via Quadratic Programming (QP) optimization, and an inner loop employs an adaptive force controller combined with a neural network to compensate for system uncertainties, enabling automatic obstacle avoidance in robot trajectory planning. Lin, H.-I. et al. [106] enhanced the Artificial Potential Field (APF) method by integrating force sensors. By optimizing the direction of attractive and repulsive forces in the APF and introducing tool vector modeling and a path smoothing mechanism, they achieved efficient obstacle avoidance and smooth trajectory generation for robotic arms in complex 3D environments.

In recent years, the integration of artificial intelligence has invigorated adaptive path planning. To meet the trajectory planning needs of autonomous mobile robots, researchers have leveraged many algorithms to achieve efficient and safe path generation in complex environments. For example, Cui, J. et al. [107] proposed a Multi-strategy Adaptive Ant Colony Optimization (MsAACO) algorithm, which, after parameter optimization and validation in multiple environments, demonstrated planning results with shorter paths, fewer turns, faster convergence, and greater stability. Bai, Z. et al. [108] developed a deep reinforcement learning algorithm based on an Improved Double Deep Q-Network (IDDQN), exhibiting exceptional adaptability and safety in unknown, unstructured environments. Gao, Q. et al. [109] addressed the robotic arm scenario by proposing a path planning algorithm (BP-RRT) that integrates an improved Rapidly exploring Random Tree (RRT) with a Backpropagation (BP) neural network, successfully achieving faster, more efficient, and near-optimal trajectory planning in cluttered 3D spaces with narrow passages. Collectively, these AI-driven methods show significant potential for highly uncertain environments.

In summary, the adaptive motion control method based on path planning focuses on addressing the trajectory generation challenge in dynamic environments. It endows robots with the ability to perceive environmental changes in real time and dynamically adjust their motion paths, and lays the first cornerstone for the safety and efficiency of human–robot collaboration through proactive obstacle avoidance and optimization.

6.3. Adaptive Motion Control Based on Collision Detection

In human–robot collaborative manufacturing, safety is paramount. Adaptive motion control based on collision detection ensures safe interaction between the robot, the human operator, and the environment by identifying potential collision risks in real-time and dynamically adjusting the robot’s motion strategy. The core of this approach lies in constructing a high-precision, low-latency collision detection mechanism and deeply integrating it with adaptive control algorithms to maximize operational efficiency without compromising safety.

As shown in Figure 11, collision detection techniques are primarily categorized into two types: geometry model-based detection and torque/moment observer-based detection.

Geometry model-based detection relies on high-precision external sensors, such as depth cameras and LiDAR, to construct a real-time 3D map of the environment. Collision risk is then determined by computing the geometric relationships between the robot’s links and the models of obstacles (including humans) in this map [110]. For example, Tang, X. et al. [111] proposed an integrated obstacle avoidance path planning method based on an improved A* algorithm and the Artificial Potential Field method. This approach simplifies the robotic arm model into a geometric model of nodes and links to determine collision status computationally, and then uses the APF to adjust the arm’s posture for avoidance. Cao, M. et al. [112] introduced a series of optimization strategies to enhance the original RRT-Connect algorithm for robotic arm collision avoidance. By detecting the intersection area between a path and obstacles and incorporating the obstacle intersection area into an improved cost function, their method efficiently determines collision states.

However, vision-based methods have blind spots and are affected by sensor noise. Consequently, torque observer-based collision detection, which requires no external sensors, serves as a crucial complementary solution. Huang, J. et al. [113] proposed an Adaptive Robust Interaction Control (ARIC) scheme with a dual-loop architecture. Through force feedback and adaptive robust control, it achieves real-time environmental contact force estimation and proactively generates the desired motion trajectory of the robot end-effector to mitigate the impacts of dynamic environmental uncertainties. Chen, Z. et al. [114] proposed an adaptive impedance control strategy for a Stewart parallel mechanism. This strategy utilizes end-effector contact force sensor feedback for “collision awareness” and employs adaptive impedance control to suppress collision force peaks, representing a passive collision suppression approach based on force feedback.

In conclusion, adaptive motion control based on collision detection focuses on the “last line of defense” for human–robot interaction safety. By establishing a high-precision, low-latency loop for collision perception and response, it provides vital safety assurance for close-proximity collaboration. Geometry-based techniques aim to “prevent accidents before they happen,” while torque-based methods excel at “responding instantaneously.” These two approaches are complementary, working in tandem to minimize risk under unexpected circumstances and firmly uphold the safety bottom line of human–robot collaboration.

6.4. Adaptive Motion Control Based on Code Generation

In traditional human–robot collaborative systems, controller design is often tailored to specific tasks and environments, leading to insufficient flexibility and adaptability when facing dynamic task requirements or uncertainties. The adaptive motion control method based on code generation addresses this by automatically generating and dynamically optimizing control code, enabling real-time adaptation to system uncertainties. This approach effectively reduces the development complexity of control code for complex collaborative tasks and enhances system flexibility and scalability [115].

Recent breakthroughs in Large Language Models (LLMs) have infused code generation with new intelligence, as LLMs can not only understand task intent described in natural language but also generate control code with complex logic, offering the potential for a higher level of “cognitive adaptation”. However, LLMs often struggle with generating accurate code for complex programming tasks. To enhance the accuracy and reliability of the generated code, staged generation strategies have become a research focus. Specifically, a two-stage self-planning code generation method was also proposed recently. In the planning stage, few-shot prompting guides the LLMs to generate concise solution steps, which then guide the incremental code generation in the implementation stage, providing an efficient solution for complex coding tasks. Han, Y. et al. [116] further advanced this with a Multi-Stage Guided (MSG) code generation strategy. This refines the technical pathway through three progressive stages—planning, design, and implementation—gradually bridging the gap between problem description and correct code. Concurrently, Liu, Z. et al. [117] systematically investigated the code generation capabilities of LLMs and the effects of multi-round repair, focusing on the core dimensions of correctness, readability, and security. Their work provides critical insights for the targeted optimization of LLMs-based code generation. Collectively, these studies lay a solid foundation for applying code generation-based adaptive motion control in human–robot collaborative manufacturing.

Building on this foundation, researchers are extending LLMs code generation to robotic manipulation and automated programming scenarios. Burns, K. et al. [118] proposed the GenCHiP system, which leverages LLMs to automatically generate policy code for high-precision, contact-rich robotic manipulation tasks, automating the programming of complex operations. Macaluso, A. et al. [119] developed a ChatGPT-based automated programming system that dynamically generates and debugs assembly strategy code through task decomposition and iterative verification in a simulation environment, enabling robots to adapt to unknown assembly sequences.

Therefore, adaptive motion control based on code generation is evolving from “parameter adaptation” to the higher stage of “strategy adaptation,” offering a highly promising solution for human–robot collaborative systems to cope with intensely uncertain production scenarios.

7. Human–Robot Collaborative Decision-Making

Human–robot collaborative decision-making refers to the complementary relationship between human intelligence and artificial intelligence in the decision-making process. Through human–robot collaboration, they jointly tackle complex problems. The core objective is neither to replace humans with machines nor to have humans passively follow machine instructions, but to integrate human intuition, experience, and ethical judgment with machine computational power and tireless characteristics, producing a decision-making effect where ‘1 + 1 > 2’. An overview of human–robot collaborative decision-making can be seen in Table 6.

As shown in Figure 12, HRC decision-making relies on dynamic decision-making, self-learning, and rapid response capabilities to achieve real-time adaptation and optimal adjustment to production disturbances, thereby enhancing system flexibility. Its advantages are reflected in three aspects: first, improved decision accuracy, by integrating the technological strengths of machines in data processing and logical reasoning with human advanced cognitive abilities; second, enhanced interpretability of decision outcomes, as the decision-making process of human–robot collaborative intelligence helps bridge the cognitive gap between humans and machines, constructing transparent decision pathways to clarify responsibility boundaries; and third, increased system robustness, leveraging machines’ risk monitoring and adaptive capabilities in combination with human experience in complex situations, collaboratively forming a complementarity between technical rationality and human cognition [120].

In fields such as medical diagnosis, military command, and urban planning, the limitations of a single entity make it difficult to make optimal decisions. Therefore, developing effective human–robot collaborative systems has become an important research direction in both academia and industry [121].

7.1. Human–Robot Collaborative Decision-Making Based on Large Language Models

Traditional human–robot collaborative decision-making typically relies on specially trained models, which suffer from limitations such as unnatural interactions and difficulty in domain transfer. Recently, the emergence of Large Language Models (LLMs) has marked a milestone in the field of human–AI collaborative decision-making, with landmark works greatly expanding the boundaries of human–AI collaboration (as shown in Figure 13). Among them, GPT-3, proposed by Brown, T.B. et al. [122], laid the foundation for humans to collaboratively complete tasks with models through natural language instructions by scaling parameters and data; Llama 2, released by Rane et al. [123], has promoted the development of open-source, reproducible dialog models, lowering the barrier to building high-quality human–AI dialog systems; meanwhile, the multimodal large model launched by the Sobo et al. [124] enables AI to perceive and reason based on mixed text and image inputs, driving innovative cross-modal human–AI collaboration. has been a milestone. The superior capabilities of LLMs in task understanding, planning, and reasoning [125] have spurred the development of LLM-based autonomous agents [126]. LLMs can play the role of an “intelligent collaborator” in Human–Robot collaborative decision-making. This upgrades the core of HRC from “tool use” to “dialog and collaboration”.

The most direct method is to use a combination of prompts and chain thinking. By designing prompts, LLMs can be guided to simulate the decision analysis process. Humans provide the problem background and data, and finally give decision suggestions and reasons. Chain thinking requires LLMs to “think step by step” [127], which not only improves the reliability of decisions, but also allows humans to check for loopholes in their logical chains [128]. The core idea of explainable artificial intelligence is that LLMs must provide explanations of their reasoning process while providing decision suggestions. Common forms of explanation include feature importance, counterfactual explanations, and similar cases. Researchers have found that providing LIME or SHAP explanations can significantly improve users’ trust in them and the accuracy of decisions [129,130].

In recent years, multimodal Large Language Models (MLLMs) have emerged, such as GPT-4V and Gemini. By aligning visual, linguistic, and other modalities into a unified space, they have achieved a richer perception of the physical world. This has brought about a revolutionary change in human–robot collaborative decision-making: AI collaborators can now “see” and understand the specific situations humans are in, thereby providing more relevant and feasible decision support.

7.2. Human–AI Collaborative Decision-Making Based on Reinforcement Learning

Reinforcement learning, with its strength in handling sequential decision-making problems, provides a theoretical framework for modeling and optimizing HRC (as shown in Figure 14). AI acts as the primary decision learner, aiming to learn a policy that maximizes cumulative rewards when collaborating with humans.

In this process, the human–robot collaborative system can be modeled as a collaborative, partially observable Markov decision process [131]. The core idea is to enable AI to learn autonomously how to become the best partner for humans through interaction with the environment, rather than pre-programming all behaviors. This allows the system to adapt to the unique styles, skill levels, and preferences of different human partners. One of the challenges is that human decisions are not entirely observable. The solution is to use inverse reinforcement learning [132]. The potential reward function is inferred from human behavioral data, and RL then optimizes the AI policy accordingly [133] to align the AI’s goals with human intentions.

RL can learn to take the right action at the right time. In driver assistance systems, AI needs to learn when to issue an alert and when to monitor quietly, thus avoiding unnecessary interference.

From a more general perspective, humans and AI are considered a multi-agent system. Multiple agents in a reinforcement learning team share a common goal but need to learn how to coordinate their actions [134]. During training, the system can access information from all agents to learn coordination strategies; however, during execution, each agent makes decisions based only on its own local observations. In this case, the team’s global value function can be decomposed into the individual value function of each agent [135] to promote efficient collaboration.

8. Applications of Human–Robot Collaborative Manufacturing Technologies

On modern production lines, material identification, grasping, and handling are fundamental operations in human–robot collaborative manufacturing. In recent years, deep vision networks and multimodal models have been increasingly adopted to enhance the stability and efficiency of these processes. For instance, Sun, R. et al. [136] proposed the YOLO-GG parallel network, in which YOLOv3 performs object detection while the GG-CNN outputs grasping poses. Evaluated on the NEU-COCO dataset, this method achieved a 14.1% improvement in detection speed and approximately 94% recognition accuracy, providing a feasible solution for collaborative robots to achieve rapid perception and grasping in industrial scenarios. Building on this direction, Ji et al. [89] combined vision foundation models with large language models to support part-level perception and assembly planning. Their system extracts features based on SAM and CLIP, segments and recognizes parts and tools, maintains robust localization under multi-view and occlusion conditions, and simultaneously generates executable assembly steps from scene information and natural-language instructions. In such systems, operators can guide robots to complete grasping, alignment, and assembly via simple verbal instructions or instructive gestures, thereby reducing manual programming effort and significantly improving collaboration efficiency in multi-variety, small-batch manufacturing. An overview of applications of human–robot collaborative manufacturing technologies can be seen in Table 7.

Task planning in HRC is a key link that bridges human workers and robots to ensure a smooth collaborative operation. Petzoldt, C. et al. [137] conducted a comprehensive review of human–robot collaborative task allocation methods for assembly scenarios, summarizing typical division-of-labor principles and mechanisms for real-time adjustment under uncertainty. On this basis, Lamon, E. et al. [138] proposed a unified framework for human–robot teamwork that leverages behavior trees and mixed-integer linear programming (MILP) optimization to achieve dynamic role switching and task scheduling. These studies demonstrate that integrating rule-based reasoning, optimization algorithms, and flexible task models can provide practical solutions for task planning in complex, highly variable collaborative environments.

In terms of adaptive execution and advanced collaboration, a number of studies start from control and demonstration to enhance robots’ adaptability to environmental changes and human operation variations. For example, Jha, D.K. et al. [139] proposed a collaborative assembly strategy that combines force sensing with vision. When position deviations or abnormal contact forces are detected, the robot can adjust its motion trajectory or stiffness online, maintaining fault tolerance in situations such as inaccurate placement or temporary human intervention. Complementary to these control-oriented approaches, Fan, J. et al. [140] introduced visual language models and deep reinforcement learning to decompose complex human–robot collaborative tasks into specific executable subtasks, and to continuously adapt strategies based on visual and linguistic feedback during execution. This approach is particularly suitable for unstructured environments and multi-variety, small-batch production, where task requirements and human behaviors are highly dynamic. As shown in Figure 15, the depth camera captures the RGB-D image, and the parallel network structure improves the processing speed by performing the target detection and pose estimation tasks simultaneously. In addition, the workflow of the HRC system with the embedding algorithm studied by Ji, Y. et al. is shown in Figure 16.

Overall, existing application studies demonstrate that by integrating deep vision networks, vision–language models, reinforcement learning, and optimization-based task planning into HRC manufacturing systems, it is possible to achieve more robust perception, more flexible task allocation, and more adaptive execution. These advances are gradually transforming human–robot collaboration from isolated, task-specific deployments into integrated, intelligent production systems capable of coping with complex industrial scenarios.

9. Conclusions and Future Work

9.1. Summary

Against the backdrop of Industry 5.0 and the deep integration of smart manufacturing, human–robot collaboration manufacturing systems have emerged as a pivotal enabler—not only for driving the industry’s intelligent and flexible transformation but also for advancing core sustainability imperatives in modern production ecosystems. This review systematically combs the core technical framework of HRC manufacturing, which forms a “perception–interaction–modeling–execution–decision” integrated system: HRC task allocation lays the foundation for rational labor division through task decomposition and skill matching; multimodal perception realizes comprehensive information capture of production scenarios; augmented interaction (AR/VR/MR) enhances human–robot operational convenience and immersion; digital twin-driven integration constructs virtual–real linked scenarios for full-process visualization and traceability; adaptive motion control ensures safe and precise robot execution in dynamic environments; and HRC decision-making achieves real-time optimization of production processes. Existing HRC technologies exhibit significant advantages. Firstly, they effectively integrate human flexible decision-making and complex environment adaptability with robot high-precision and continuous operation capabilities, breaking the rigidity of traditional automated lines and the efficiency bottleneck of manual production. Secondly, the formed technical closed loop realizes seamless connection between perception, interaction and execution, significantly improving production efficiency, product quality, and operational safety. Thirdly, practical applications in automotive, aeronautical, astronautical, and shipping industries have verified the technical feasibility and industrial value, promoting innovation in high-end equipment manufacturing and complex component processing scenarios.

However, current technologies still face prominent limitations. In terms of HRC task allocation, most methods rely on pre-defined rules and lack dynamic adaptability to real-time changes in personnel status, task urgency, and equipment performance. Multimodal perception technologies have insufficient robustness in complex environments (e.g., strong light, noise interference), leading to incomplete or delayed information acquisition. Augmented interaction systems often have high hardware costs and limited naturalness in human–robot interaction (e.g., gesture recognition accuracy and voice command response speed need improvement). Digital twin-driven integration has challenges in real-time data synchronization and high-fidelity modeling, especially for large-scale and multi-agent collaborative scenarios. Additionally, HRC decision-making systems are weak in handling complex, uncertain, and multi-objective production disturbances, and their self-learning capabilities still depend on large-scale labeled data.

9.2. Outlook

Future development of HRC manufacturing systems will focus on addressing current technical bottlenecks and adapting to the evolving needs of smart manufacturing. Guided by technological hotspots such as LLMs, embodied intelligence, and digital twins, as well as industry demands for flexibility, safety, and sustainability, the key development directions are subdivided into the following specific areas:

(1) Intelligent Adaptive Optimization Driven by Multi-Modal Perception and Large Models

Leveraging advanced AI technologies including deep learning, reinforcement learning, cognitive computing, and visual language models (VLMs) to enhance the autonomous decision-making and adaptive capabilities of HRC systems. Specifically:

Develop dynamic task allocation algorithms fused with real-time multi-source data (personnel fatigue status, equipment health indicators, task priority) and LLM-based intent understanding, enabling self-optimization of human–robot division of labor in complex production environments;
Improve the environmental robustness and few-shot learning capabilities of multi-modal perception through cross-modal fusion (vision, audio, tactile, physiological signals) and pre-trained model fine-tuning, empowering robots to proactively perceive human operation intentions and adjust collaborative strategies in real time.

(2) Lightweight Modular Integration for Flexible Production Scenarios

Reduce the deployment cost and system complexity of HRC through lightweight hardware design and modular software architecture, adapting to the trend of small-batch, customized production in modern manufacturing:

Promote the development of lightweight collaborative robots with high flexibility and low load, and optimize hardware integration of multi-modal sensors (e.g., miniaturized vision cameras, wearable physiological monitors);
Construct a modular software framework based on microservices, realizing plug-and-play of core functional modules (perception, interaction, control, decision-making) and supporting rapid adaptation to diverse production scenarios (e.g., aerospace component assembly, electronic product customization) through digital twin-driven module configuration.

(3) Multi-Dimensional Safety Guarantee and Ethical Norms Construction

Focus on the core demand of “human-centric” collaborative safety, and establish a comprehensive safety and ethical system for HRC systems:

Build a multi-level safety protection system integrating active collision avoidance (based on real-time trajectory prediction), human physiological state monitoring (e.g., EEG-based fatigue detection, eye-tracking attention recognition), and rapid emergency stop mechanisms, ensuring human safety in close-range human–robot collaboration;
Formulate industry-wide ethical norms and data security standards, including privacy protection of human operation data (e.g., desensitization of physiological signals and operation behavior), accountability definition for collaborative decision-making (distinguishing human/robot responsibilities in accident scenarios), and AI ethics review mechanisms for autonomous decision-making modules.

(4) Standardization System Improvement and Industrial Ecosystem Synergy

Accelerate the standardization process of HRC technologies to break through technical barriers and promote large-scale application:

Promote the formulation of international standards for HRC system interfaces (e.g., robot-sensor communication protocols), data formats (e.g., multi-modal data exchange specifications), and performance evaluation (e.g., collaborative efficiency, safety indicators), realizing interoperability between different brands of robots, perception devices, and digital twin platforms;
Build an industrial ecosystem integrating technology research (universities and research institutes), product development (equipment manufacturers), application promotion (end users), and talent training (vocational education), promoting the iterative upgrade of HRC technologies through industry–university–research cooperation and the commercialization of innovative achievements.

(5) Green Low-Carbon Transformation Based on Intelligent Energy Management

Respond to the global demand for carbon neutrality in the manufacturing industry, and realize the green development of HRC systems:

Optimize energy consumption through intelligent energy management algorithms, such as scheduling robot working hours based on production peaks and valleys, selecting energy-efficient motion paths via path planning, and enabling standby energy-saving modes for idle equipment;
Integrate HRC technologies with circular economy concepts, such as collaborative disassembly and remanufacturing of waste products through human–robot collaboration, and real-time monitoring of carbon emissions in the production process based on digital twins, contributing to the green transformation of the manufacturing industry.

In conclusion, HRC manufacturing systems will continue to play a core driving role in industrial upgrading. With the breakthrough of key technologies and the improvement of industrial ecosystems, they will move towards a more intelligent, safe, flexible, and sustainable future, further promoting the manufacturing industry to achieve high-quality development.

Author Contributions

Conceptualization, Q.C.; methodology, Q.C., J.H., X.Z., S.Z., L.L., H.L., C.X. and J.C.; software, J.H.; validation, J.H.; formal analysis, J.H.; investigation, C.L.; resources, J.H., X.Z., S.Z., L.L., H.L., C.X. and J.C.; writing—original draft preparation, Q.C.; writing—review and editing, C.L. and H.Z.; project administration, H.Z.; funding acquisition, Q.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Natural Science Foundation of China (U25B6012), Jiangsu Province Academic Degree and Postgraduate Education Teaching Reform Project (JGKT25_C013), the Young Elite Scientists Sponsorship Program by JSTJ (JSTJ-2025-183), and Research Topics for 2025 of the Jiangsu Institution of Engineers (JSIE2025KT11).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Leng, J.; Sha, W.; Wang, B.; Zheng, P.; Zhuang, C.; Liu, Q.; Wuest, T.; Mourtzis, D.; Wang, L. Industry 5.0: Prospect and retrospect. J. Manuf. Syst. 2022, 65, 279–295. [Google Scholar] [CrossRef]
Wang, B.; Tao, F.; Fang, X.; Liu, C.; Liu, Y.; Freiheit, T. Smart Manufacturing and Intelligent Manufacturing: A Comparative Review. Engineering 2021, 7, 738–757. [Google Scholar] [CrossRef]
Li, S.; Wang, R.; Zheng, P.; Wang, L. Towards proactive human–robot collaboration: A foreseeable cognitive manufacturing paradigm. J. Manuf. Syst. 2021, 60, 547–552. [Google Scholar] [CrossRef]
Hietanen, A.; Pieters, R.; Lanz, M.; Latokartano, J.; Kämäräinen, J.-K. AR-based interaction for human-robot collaborative manufacturing. Robot. Comput.-Integr. Manuf. 2020, 63, 101891. [Google Scholar] [CrossRef]
Liau, Y.Y.; Ryu, K. Task Allocation in Human-Robot Collaboration (HRC) Based on Task Characteristics and Agent Capability for Mold Assembly. Procedia Manuf. 2020, 51, 179–186. [Google Scholar] [CrossRef]
Duan, J.; Fang, Y.; Zhang, Q.; Qin, J. HRC for dual-robot intelligent assembly system based on multimodal perception. Proc. Inst. Mech. Eng. Part B J. Eng. Manuf. 2024, 238, 562–576. [Google Scholar] [CrossRef]
Blankemeyer, S.; Wendorff, D.; Raatz, A. A hand-interaction model for augmented reality enhanced human-robot collaboration. CIRP Ann. 2024, 73, 17–20. [Google Scholar] [CrossRef]
Baratta, A.; Cimino, A.; Longo, F.; Nicoletti, L. Digital twin for human-robot collaboration enhancement in manufacturing systems: Literature review and direction for future developments. Comput. Ind. Eng. 2024, 187, 109764. [Google Scholar] [CrossRef]
Ding, P.; Zhang, J.; Zheng, P.; Zhang, P.; Fei, B.; Xu, Z. Dynamic scenario-enhanced diverse human motion prediction network for proactive human–robot collaboration in customized assembly tasks. J. Intell. Manuf. 2025, 36, 4593–4612. [Google Scholar] [CrossRef]
Liu, C.; Tang, D.; Zhu, H.; Zhang, Z.; Wang, L.; Zhang, Y. Vision language model-enhanced embodied intelligence for digital twin-assisted human-robot collaborative assembly. J. Ind. Inf. Integr. 2025, 48, 100943. [Google Scholar] [CrossRef]
Liu, L.; Guo, F.; Zou, Z.; Duffy, V.G. Application, Development and Future Opportunities of Collaborative Robots (Cobots) in Manufacturing: A Literature Review. Int. J. Hum.-Comput. Interact. 2024, 40, 915–932. [Google Scholar] [CrossRef]
Petzoldt, C.; Niermann, D.; Maack, E.; Sontopski, M.; Vur, B.; Freitag, M. Implementation and Evaluation of Dynamic Task Allocation for Human–Robot Collaboration in Assembly. Appl. Sci. 2022, 12, 12645. [Google Scholar] [CrossRef]
Zanchettin, A.M.; Casalino, A.; Piroddi, L.; Rocco, P. Prediction of human activity patterns for human–robot collaborative assembly tasks. IEEE Trans. Ind. Informatics. 2018, 15, 3934–3942. [Google Scholar] [CrossRef]
Bruno, G.; Antonelli, D. Dynamic task classification and assignment for the management of human-robot collaborative teams in workcells. Int. J. Adv. Manuf. Technol. 2018, 98, 2415–2427. [Google Scholar] [CrossRef]
Joo, T.; Jun, H.; Shin, D. Task Allocation in Human–Machine Manufacturing Systems Using Deep Reinforcement Learning. Sustainability 2022, 14, 2245. [Google Scholar] [CrossRef]
Gao, Z.; Yang, R.; Zhao, K.; Yu, W.; Liu, Z.; Liu, L. Hybrid Convolutional Neural Network Approaches for Recognizing Collaborative Actions in Human–Robot Assembly Tasks. Sustainability 2024, 16, 139. [Google Scholar] [CrossRef]
Mavsar, M.; Deni, M.; Nemec, B.; Ude, A. Intention Recognition with Recurrent Neural Networks for Dynamic Human-Robot Collaboration. In Proceedings of the 2021 20th International Conference on Advanced Robotics (ICAR), Ljubljana, Slovenia, 6–10 December 2021; pp. 208–215. [Google Scholar]
Wang, P.; Liu, H.; Wang, L.; Gao, R.X. Deep learning-based human motion recognition for predictive context-aware human-robot collaboration. CIRP Ann. 2018, 67, 17–20. [Google Scholar] [CrossRef]
Bandi, C.; Thomas, U. Skeleton-based Action Recognition for Human-Robot Interaction using Self-Attention Mechanism. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Jodhpur, India, 15–18 December 2021; pp. 1–8. [Google Scholar]
Gao, X.; Yan, L.; Wang, G.; Gerada, C. Hybrid Recurrent Neural Network Architecture-Based Intention Recognition for Human–Robot Collaboration. IEEE Trans. Cybern. 2023, 53, 1578–1586. [Google Scholar] [CrossRef]
Malik, A.A.; Bilberg, A. Complexity-based task allocation in human-robot collaborative assembly. Ind. Robot. Int. J. Robot. Res. Appl. 2019, 46, 471–480. [Google Scholar] [CrossRef]
Barathwaj, N.; Raja, P.; Gokulraj, S. Optimization of assembly line balancing using genetic algorithm. J. Cent. South Univ. 2015, 22, 3957–3969. [Google Scholar] [CrossRef]
Sun, X.; Zhang, R.; Liu, S.; Lv, Q.; Bao, J.; Li, J. A digital twin-driven human–robot collaborative assembly-commissioning method for complex products. Int. J. Adv. Manuf. Technol. 2022, 118, 3389–3402. [Google Scholar] [CrossRef]
Wang, J.; Yan, Y.; Hu, Y.; Yang, X.; Zhang, L. A transfer reinforcement learning and digital-twin based task allocation method for human-robot collaboration assembly. Eng. Appl. Artif. Intell. 2025, 144, 110064. [Google Scholar] [CrossRef]
Dimitropoulos, N.; Kaipis, M.; Giartzas, S.; Michalos, G. Generative AI for automated task modelling and task allocation in human robot collaborative applications. CIRP Ann. 2025, 74, 7–11. [Google Scholar] [CrossRef]
Lim, J.; Patel, S.; Evans, A.; Pimley, J.; Li, Y.; Kovalenko, I. Enhancing Human-Robot Collaborative Assembly in Manufacturing Systems Using Large Language Models. In Proceedings of the 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE), Bari, Italy, 28 August–1 September 2024; pp. 2581–2587. [Google Scholar]
Chen, J.; Huang, S.; Wang, X.; Wang, P.; Zhu, J.; Xu, Z.; Wang, G.; Yan, Y.; Wang, L. Perception-decision-execution coordination mechanism driven dynamic autonomous collaboration method for human-like collaborative robot based on multimodal large language model. Robot. Comput.-Integr. Manuf. 2026, 98, 103167. [Google Scholar] [CrossRef]
Chen, J.; Huang, S.; Xu, Z.; Yan, Y.; Wang, G. Human-robot Autonomous Collaboration Method of Smart Manufacturing Systems Based on Large Language Model and Machine Vision. J. Mech. Eng. 2025, 61, 130–141. [Google Scholar] [CrossRef]
Liu, Z.; Peng, Y. Human-Machine Interactive Collaborative Decision-Making Based on Large Model Technology: Application Scenarios and Future Developments. In Proceedings of the 2025 2nd International Conference on Artificial Intelligence and Digital Technology (ICAIDT), Guangzhou, China, 28–30 April 2025; pp. 106–110. [Google Scholar]
Kong, F.; Gao, T.; Li, H.; Lu, Z. Research on Human-robot Joint Task Assignment Considering Task Complexity. J. Mech. Eng. 2021, 57, 204–214. [Google Scholar]
Bilberg, A.; Malik, A.A. Digital twin driven human–robot collaborative assembly. CIRP Ann. 2019, 68, 499–502. [Google Scholar] [CrossRef]
Cai, M.; Wang, G.; Luo, X.; Xu, X. Task allocation of human-robot collaborative assembly line considering assembly complexity and workload balance. Int. J. Prod. Res. 2025, 63, 4749–4775. [Google Scholar] [CrossRef]
Ji, X.; Wang, J.; Zhao, J.; Zhang, X.; Sun, Z. Intelligent Robotic Assembly Method of Spaceborne Equipment Based on Visual Guidance. J. Mech. Eng. 2018, 54, 63–72. [Google Scholar] [CrossRef]
Wang, Y.; Feng, J.; Liu, J.; Liu, X.; Wang, J. Digital Twin-based Design and Operation of Human-Robot Collaborative Assembly. IFAC-Pap. 2022, 55, 295–300. [Google Scholar] [CrossRef]
Sleeman, W.C.; Kapoor, R.; Ghosh, P. Multimodal Classification: Current Landscape, Taxonomy and Future Directions. ACM Comput. Surv. 2022, 55, 150. [Google Scholar] [CrossRef]
Cao, Y.; Xu, B.; Li, B.; Fu, H. Advanced Design of Soft Robots with Artificial Intelligence. Nano-Micro Lett. 2024, 16, 214. [Google Scholar] [CrossRef] [PubMed]
Hussain, A.; Khan, S.U.; Rida, I.; Khan, N.; Baik, S.W. Human centric attention with deep multiscale feature fusion framework for activity recognition in Internet of Medical Things. Inf. Fusion 2024, 106, 102211. [Google Scholar] [CrossRef]
Zhang, Y.; Ding, K.; Hui, J.; Liu, S.; Guo, W.; Wang, L. Skeleton-RGB integrated highly similar human action prediction in human–robot collaborative assembly. Robot. Comput.-Integr. Manuf. 2024, 86, 102659. [Google Scholar] [CrossRef]
Piardi, L.; Leitão, P.; Queiroz, J.; Pontes, J. Role of digital technologies to enhance the human integration in industrial cyber–physical systems. Annu. Rev. Control. 2024, 57, 100934. [Google Scholar] [CrossRef]
Nadeem, M.; Sohail, S.S.; Javed, L.; Anwer, F.; Saudagar, A.K.J.; Muhammad, K. Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition. Cogn. Comput. 2024, 16, 2566–2579. [Google Scholar] [CrossRef]
Hazmoune, S.; Bougamouza, F. Using transformers for multimodal emotion recognition: Taxonomies and state of the art review. Eng. Appl. Artif. Intell. 2024, 133, 108339. [Google Scholar] [CrossRef]
Bayoudh, K. A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges. Inf. Fusion 2024, 105, 102217. [Google Scholar] [CrossRef]
Heydari, M.; Alinezhad, A.; Vahdani, B. A deep learning framework for quality control process in the motor oil industry. Eng. Appl. Artif. Intell. 2024, 133, 108554. [Google Scholar] [CrossRef]
Liu, B.; Rocco, P.; Zanchettin, A.M.; Zhao, F.; Jiang, G.; Mei, X. A real-time hierarchical control method for safe human–robot coexistence. Robot. Comput.-Integr. Manuf. 2024, 86, 102666. [Google Scholar] [CrossRef]
Xia, L.Q.; Li, C.X.; Zhang, C.B.; Liu, S.M.; Zheng, P. Leveraging error-assisted fine-tuning large language models for manufacturing excellence. Robot. Comput.-Integr. Manuf. 2024, 88, 102728. [Google Scholar] [CrossRef]
Liu, C.; Wang, Y.; Yang, J. A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis. Appl. Intell. 2024, 54, 8415–8441. [Google Scholar] [CrossRef]
Salichs, M.A.; Castro-González, Á.; Salichs, E.; Fernández-Rodicio, E.; Maroto-Gómez, M.; Gamboa-Montero, J.J.; Marques-Villarroya, S.; Castillo, J.C.; Alonso-Martín, F.; Malfaz, M. Mini: A New Social Robot for the Elderly. Int. J. Soc. Robot. 2020, 12, 1231–1249. [Google Scholar] [CrossRef]
Min, B.; Ross, H.; Sulem, E.; Veyseh, A.P.B.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey. ACM Comput. Surv. 2023, 56, 30. [Google Scholar] [CrossRef]
Sun, T.; Feng, B.; Huo, J.; Xiao, Y.; Wang, W.; Peng, J.; Li, Z.; Du, C.; Wang, W.; Zou, G.; et al. Artificial Intelligence Meets Flexible Sensors: Emerging Smart Flexible Sensing Systems Driven by Machine Learning and Artificial Synapses. Nano-Micro Lett. 2023, 16, 14. [Google Scholar] [CrossRef]
Xue, Z.; He, G.; Liu, J.; Jiang, Z.; Zhao, S.; Lu, W. Re-examining lexical and semantic attention: Dual-view graph convolutions enhanced BERT for academic paper rating. Inf. Process. Manag. 2023, 60, 103216. [Google Scholar] [CrossRef]
Luo, Y.; Wu, R.; Liu, J.; Tang, X. A text guided multi-task learning network for multimodal sentiment analysis. Neurocomputing 2023, 560, 126836. [Google Scholar] [CrossRef]
Shafizadegan, F.; Naghsh-Nilchi, A.R.; Shabaninia, E. Multimodal vision-based human action recognition using deep learning: A review. Artif. Intell. Rev. 2024, 57, 178. [Google Scholar] [CrossRef]
Li, Z.; Guo, Q.; Pan, Y.; Ding, W.; Yu, J.; Zhang, Y.; Liu, W.; Chen, H.; Wang, H.; Xie, Y. Multi-level correlation mining framework with self-supervised label generation for multimodal sentiment analysis. Inf. Fusion 2023, 99, 101891. [Google Scholar] [CrossRef]
Wang, T.; Liu, Z.; Wang, L.; Li, M.; Wang, X.V. Data-efficient multimodal human action recognition for proactive human–robot collaborative assembly: A cross-domain few-shot learning approach. Robot. Comput.-Integr. Manuf. 2024, 89, 102785. [Google Scholar] [CrossRef]
Wang, B.; Xie, Q.; Pei, J.; Chen, Z.; Tiwari, P.; Li, Z.; Fu, J. Pre-trained Language Models in Biomedical Domain: A Systematic Survey. ACM Comput. Surv. 2023, 56, 55. [Google Scholar] [CrossRef]
Yang, C.; Liu, Y.; Yin, C. More than a framework: Sketching out technical enablers for natural language-based source code generation. Comput. Sci. Rev. 2024, 53, 100637. [Google Scholar] [CrossRef]
Laplaza, J.; Moreno, F.; Sanfeliu, A. Enhancing Robotic Collaborative Tasks Through Contextual Human Motion Prediction and Intention Inference. Int. J. Soc. Robot. 2025, 17, 2077–2096. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Zhang, T. Reinforcement learning with imitative behaviors for humanoid robots navigation: Synchronous planning and control. Auton. Robot. 2024, 48, 5. [Google Scholar] [CrossRef]
Ling, S.; Yuan, Y.; Yan, D.; Leng, Y.; Rong, Y.; Huang, G.Q. RHYTHMS: Real-time Data-driven Human-machine Synchronization for Proactive Ergonomic Risk Mitigation in the Context of Industry 4.0 and Beyond. Robot. Comput.-Integr. Manuf. 2024, 87, 102709. [Google Scholar] [CrossRef]
Yan, H.; Chen, Z.; Du, J.; Yan, Y.; Zhao, S. VisPower: Curriculum-Guided Multimodal Alignment for Fine-Grained Anomaly Perception in Power Systems. Electronics 2025, 14, 4747. [Google Scholar] [CrossRef]
Li, J.; Wen, S.; Karimi, H.R. Zoom-Anomaly: Multimodal vision-Language fusion industrial anomaly detection with synthetic data. Inf. Fusion 2026, 127, 103910. [Google Scholar] [CrossRef]
Xue, J.; Hu, B.; Li, L.; Zhang, J. Human—Machine augmented intelligence: Research and applications. Front. Inf. Technol. Electron. Eng. 2022, 23, 1139–1141. [Google Scholar] [CrossRef]
Balamurugan, K.; Sudhakar, G.; Xavier, K.F.; Bharathiraja, N.; Kaur, G. Human-machine interaction in mechanical systems through sensor enabled wearable augmented reality interfaces. Meas. Sens. 2025, 39, 101880. [Google Scholar] [CrossRef]
Huang, J.; Han, D.; Chen, Y.; Tian, F.; Wang, H.; Dai, G. A Survey on Human-Computer Interaction in Mixed Reality. J. Comput.-Aided Des. Comput. Graph. 2016, 28, 869–880. [Google Scholar]
Ma, W.; Sun, B.; Zhao, S.; Dai, K.; Zhao, H.; Wu, J. Research on the Human-machine Hybrid Decision-making Strategy Basing on the Hybrid-augmented Intelligence. J. Mech. Eng. 2025, 61, 288–304. [Google Scholar] [CrossRef]
Bao, Y.; Liu, J.; Jia, X.; Qie, J. An assisted assembly method based on augmented reality. Int. J. Adv. Manuf. Technol. 2024, 135, 1035–1050. [Google Scholar] [CrossRef]
Calderón-Sesmero, R.; Lozano-Hernández, A.; Frontela-Encinas, F.; Cabezas-López, G.; De-Diego-Moro, M. Human–Robot Interaction and Tracking System Based on Mixed Reality Disassembly Tasks. Robotics 2025, 14, 106. [Google Scholar] [CrossRef]
Dong, H.; Zhou, X.; Li, J.; Liu, S.; Sun, J.; Gu, C. An Aircraft Part Assembly Based on Virtual Reality Technology and Mixed Reality Technology. In Proceedings of the 2021 International Conference on Big Data Analytics for Cyber-Physical System in Smart City, Singapore, 10 December 2021; pp. 1251–1263. [Google Scholar]
Yuan, L.; Hongli, S.; Qingmiao, W. Research on AR assisted aircraft maintenance technology. J. Phys. Conf. Ser. 2021, 1738, 012107. [Google Scholar] [CrossRef]
Yan, X.; Bai, G.; Tang, C. An Augmented Reality Tracking Registration Method Based on Deep Learning. In Proceedings of the Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition, Xiamen, China, 23–25 September 2022; pp. 360–365. [Google Scholar]
Yang, H.; Li, S.; Zhang, X.; Shen, Q. Research on Satellite Cable Laying and Assembly Guidance Technology Based on Augmented Reality. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; pp. 6550–6555. [Google Scholar]
Hamad, J.; Bianchi, M.; Ferrari, V. Integrated Haptic Feedback with Augmented Reality to Improve Pinching and Fine Moving of Objects. Appl. Sci. 2025, 15, 7619. [Google Scholar] [CrossRef]
Kalkan, Ö.K.; Karabulut, Ş.; Höke, G. Effect of Virtual Reality-Based Training on Complex Industrial Assembly Task Performance. Arab. J. Sci. Eng. 2021, 46, 12697–12708. [Google Scholar] [CrossRef]
Masehian, E.; Ghandi, S. Assembly sequence and path planning for monotone and nonmonotone assemblies with rigid and flexible parts. Robot. Comput.-Integr. Manuf. 2021, 72, 102180. [Google Scholar] [CrossRef]
Yan, Y.; Jiang, K.; Shi, X.; Zhang, J.; Ming, S.; He, Z. Application of Embodied Interaction Technology in Virtual Assembly Training. Packag. Eng. 2025, 46, 84–95. [Google Scholar] [CrossRef]
Havard, V.; Jeanne, B.; Lacomblez, M.; Baudry, D. Digital twin and virtual reality: A co-simulation environment for design and assessment of industrial workstations. Prod. Manuf. Res. 2019, 7, 472–489. [Google Scholar] [CrossRef]
Zhang, X.; Bai, X.; Zhang, S.; He, W.; Wang, S.; Yan, Y.; Wang, P.; Liu, L. A novel mixed reality remote collaboration system with adaptive generation of instructions. Comput. Ind. Eng. 2024, 194, 110353. [Google Scholar] [CrossRef]
Seetohul, J.; Shafiee, M.; Sirlantzis, K. Augmented Reality (AR) for Surgical Robotic and Autonomous Systems: State of the Art, Challenges, and Solutions. Sensors 2023, 23, 6202. [Google Scholar] [CrossRef]
Gu, J.; Meng, X.; Lu, G.; Hou, L.; Minzhe, N.; Liang, X.; Yao, L.; Huang, R.; Zhang, W.; Jiang, X.; et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Adv. Neural Inf. Process. Syst. 2022, 35, 26418–26431. [Google Scholar]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Chen, Z.; Liu, L.; Wan, Y.; Chen, Y.; Dong, C.; Li, W.; Lin, Y. Improving BERT with local context comprehension for multi-turn response selection in retrieval-based dialogue systems. Comput. Speech Lang. 2023, 82, 101525. [Google Scholar] [CrossRef]
Tam, Z.R.; Pai, Y.-T.; Lee, Y.-W.J.A. VisTW: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan. arXiv 2025, arXiv:2503.10427. [Google Scholar]
Krupas, M.; Kajati, E.; Liu, C.; Zolotova, I. Towards a Human-Centric Digital Twin for Human–Machine Collaboration: A Review on Enabling Technologies and Methods. Sensors 2024, 24, 2232. [Google Scholar] [CrossRef]
Zafar, M.H.; Langås, E.F.; Sanfilippo, F. Exploring the synergies between collaborative robotics, digital twins, augmentation, and industry 5.0 for smart manufacturing: A state-of-the-art review. Robot. Comput.-Integr. Manuf. 2024, 89, 102769. [Google Scholar] [CrossRef]
Piardi, L.; Queiroz, J.; Pontes, J.; Leitão, P. Digital Technologies to Empower Human Activities in Cyber-Physical Systems. IFAC-Pap. 2023, 56, 8203–8208. [Google Scholar] [CrossRef]
Choi, S.H.; Park, K.-B.; Roh, D.H.; Lee, J.Y.; Mohammed, M.; Ghasemi, Y.; Jeong, H. An integrated mixed reality system for safety-aware human-robot collaboration using deep learning and digital twin generation. Robot. Comput.-Integr. Manuf. 2022, 73, 102258. [Google Scholar] [CrossRef]
Tao, F.; Xiao, B.; Qi, Q.; Cheng, J.; Ji, P. Digital twin modeling. J. Manuf. Syst. 2022, 64, 372–389. [Google Scholar] [CrossRef]
Ji, Y.; Zhang, Z.; Tang, D.; Zheng, Y.; Liu, C.; Zhao, Z.; Li, X. Foundation models assist in human–robot collaboration assembly. Sci. Rep. 2024, 14, 24828. [Google Scholar] [CrossRef]
Zhang, T.; Chu, H.; Zou, Y.; Sun, H. A robust electromyography signals-based interaction interface for human-robot collaboration in 3D operation scenarios. Expert Syst. Appl. 2024, 238, 122003. [Google Scholar] [CrossRef]
You, Y.; Liu, Y.; Ji, Z. Human Digital Twin for Real-Time Physical Fatigue Estimation in Human-Robot Collaboration. In Proceedings of the 2024 IEEE International Conference on Industrial Technology (ICIT), Bristol, UK, 25–27 March 2024; pp. 1–6. [Google Scholar]
Malik, A.A.; Bilberg, A. Digital twins of human robot collaboration in a production setting. Procedia Manuf. 2018, 17, 278–285. [Google Scholar] [CrossRef]
Dröder, K.; Bobka, P.; Germann, T.; Gabriel, F.; Dietrich, F. A Machine Learning-Enhanced Digital Twin Approach for Human-Robot-Collaboration. Procedia CIRP 2018, 76, 187–192. [Google Scholar] [CrossRef]
Kousi, N.; Gkournelos, C.; Aivaliotis, S.; Lotsaris, K.; Bavelos, A.C.; Baris, P.; Michalos, G.; Makris, S. Digital Twin for Designing and Reconfiguring Human–Robot Collaborative Assembly Lines. Appl. Sci. 2021, 11, 4620. [Google Scholar] [CrossRef]
Tchane Djogdom, G.V.; Meziane, R.; Otis, M.J.D. Robust dynamic robot scheduling for collaborating with humans in manufacturing operations. Robot. Comput.-Integr. Manuf. 2024, 88, 102734. [Google Scholar] [CrossRef]
Oyekan, J.O.; Hutabarat, W.; Tiwari, A.; Grech, R.; Aung, M.H.; Mariani, M.P.; López-Dávalos, L.; Ricaud, T.; Singh, S.; Dupuis, C. The effectiveness of virtual environments in developing collaborative strategies between industrial robots and humans. Robot. Comput.-Integr. Manuf. 2019, 55, 41–54. [Google Scholar] [CrossRef]
Peng, Y.J.; Han, J.H.; Zhang, Z.L.; Fan, L.F.; Liu, T.Y.; Qi, S.Y.; Feng, X.; Ma, Y.X.; Wang, Y.Z.; Zhu, S.C. The Tong Test: Evaluating Artificial General Intelligence Through Dynamic Embodied Physical and Social Interactions. Engineering 2024, 34, 12–22. [Google Scholar] [CrossRef]
Cao, Y.; Zhou, Q.; Yuan, W.; Ye, Q.; Popa, D.; Zhang, Y. Human-robot collaborative assembly and welding: A review and analysis of the state of the art. J. Manuf. Process. 2024, 131, 1388–1403. [Google Scholar] [CrossRef]
Qu, W.B.; Li, J.; Zhang, R.; Liu, S.M.; Bao, J.S. Adaptive planning of human-robot collaborative disassembly for end-of-life lithium-ion batteries based on digital twin. J. Intell. Manuf. 2024, 35, 2021–2043. [Google Scholar] [CrossRef]
Pasik-Duncan, B.J.I.C.S. Adaptive Control [Second edition, by Karl J. Astrom and Bjorn Wittenmark, Addison Wesley (1995)]. IEEE Control. Syst. Mag. 2002, 16, 87. [Google Scholar] [CrossRef]
Zhang, D.; Wei, B. A review on model reference adaptive control of robotic manipulators. Annu. Rev. Control. 2017, 43, 188–198. [Google Scholar] [CrossRef]
Duan, J.; Zhuang, L.; Zhang, Q.; Zhou, Y.; Qin, J. Multimodal perception-fusion-control and human–robot collaboration in manufacturing: A review. Int. J. Adv. Manuf. Technol. 2024, 132, 1071–1093. [Google Scholar] [CrossRef]
Jiao, C.; Yu, L.; Su, X.; Wen, Y.; Dai, X. Adaptive hybrid impedance control for dual-arm cooperative manipulation with object uncertainties. Automatica 2022, 140, 110232. [Google Scholar] [CrossRef]
Yu, X.; Li, B.; He, W.; Feng, Y.; Cheng, L.; Silvestre, C. Adaptive-Constrained Impedance Control for Human–Robot Co-Transportation. IEEE Trans. Cybern. 2022, 52, 13237–13249. [Google Scholar] [CrossRef] [PubMed]
Hameed, A.; Ordys, A.; Możaryn, J.; Sibilska-Mroziewicz, A. Control System Design and Methods for Collaborative Robots: Review. Appl. Sci. 2023, 13, 675. [Google Scholar] [CrossRef]
Ding, S.; Peng, J.; Xin, J.; Zhang, H.; Wang, Y. Task-Oriented Adaptive Position/Force Control for Robotic Systems Under Hybrid Constraints. IEEE Trans. Ind. Electron. 2024, 71, 12612–12622. [Google Scholar] [CrossRef]
Lin, H.-I.; Shodiq, M.A.F.; Hsieh, M.F. Robot path planning based on three-dimensional artificial potential field. Eng. Appl. Artif. Intell. 2025, 144, 110127. [Google Scholar] [CrossRef]
Cui, J.; Wu, L.; Huang, X.; Xu, D.; Liu, C.; Xiao, W. Multi-strategy adaptable ant colony optimization algorithm and its application in robot path planning. Knowl.-Based Syst. 2024, 288, 111459. [Google Scholar] [CrossRef]
Bai, Z.; Pang, H.; He, Z.; Zhao, B.; Wang, T. Path Planning of Autonomous Mobile Robot in Comprehensive Unknown Environment Using Deep Reinforcement Learning. IEEE Internet Things J. 2024, 11, 22153–22166. [Google Scholar] [CrossRef]
Gao, Q.; Yuan, Q.; Sun, Y.; Xu, L. Path planning algorithm of robot arm based on improved RRT* and BP neural network algorithm. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101650. [Google Scholar] [CrossRef]
Mohanan, M.G.; Salgoankar, A. A survey of robotic motion planning in dynamic environments. Robot. Auton. Syst. 2018, 100, 171–185. [Google Scholar] [CrossRef]
Tang, X.; Zhou, H.; Xu, T. Obstacle avoidance path planning of 6-DOF robotic arm based on improved A* algorithm and artificial potential field method. Robotica 2024, 42, 457–481. [Google Scholar] [CrossRef]
Cao, M.; Mao, H.; Tang, X.; Sun, Y.; Chen, T. A novel RRT*-Connect algorithm for path planning on robotic arm collision avoidance. Sci. Rep. 2025, 15, 2836. [Google Scholar] [CrossRef] [PubMed]
Huang, J.; Tateo, D.; Liu, P.; Peters, J. Adaptive Control Based Friction Estimation for Tracking Control of Robot Manipulators. IEEE Robot. Autom. Lett. 2025, 10, 2454–2461. [Google Scholar] [CrossRef]
Chen, Z.; Zhan, G.; Jiang, Z.; Zhang, W.; Rao, Z.; Wang, H.; Li, J. Adaptive impedance control for docking robot via Stewart parallel mechanism. ISA Trans. 2024, 155, 361–372. [Google Scholar] [CrossRef]
Frigerio, M.; Buchli, J.; Caldwell, D.G. Code generation of algebraic quantities for robot controllers. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Algarve, Portugal, 7–12 October 2012; pp. 2346–2351. [Google Scholar]
Han, Y.; Lyu, C. Multi-stage guided code generation for Large Language Models. Eng. Appl. Artif. Intell. 2025, 139, 109491. [Google Scholar] [CrossRef]
Liu, Z.; Tang, Y.; Luo, X.; Zhou, Y.; Zhang, L.F. No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT. IEEE Trans. Softw. Eng. 2024, 50, 1548–1584. [Google Scholar] [CrossRef]
Burns, K.; Jain, A.; Go, K.; Xia, F.; Stark, M.; Schaal, S.; Hausman, K. GenCHiP: Generating Robot Policy Code for High-Precision and Contact-Rich Manipulation Tasks. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 9596–9603. [Google Scholar]
Macaluso, A.; Cote, N.; Chitta, S. Toward Automated Programming for Robotic Assembly Using ChatGPT. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 17687–17693. [Google Scholar]
Tocchetti, A.; Corti, L.; Balayn, A.; Yurrita, M.; Lippmann, P.; Brambilla, M.; Yang, J.A.I. Robustness: A Human-Centered Perspective on Technological Challenges and Opportunities. ACM Comput. Surv. 2025, 57, 141. [Google Scholar] [CrossRef]
Zhe, L.I.; Ke, W.; Biao, W.; Ziqi, Z.; Yafei, L.I.; Yibo, G.U.O.; Yazhou, H.U.; Hua, W.; Pei, L.V.; Mingliang, X.U. Human-Machine Fusion Intelligent Decision-Making: Concepts, Framework, and Applications. J. Electron. Inf. Technol. 2025, 47, 3439–3464. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Amodei, D. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.R.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.-l.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A survey on large language model based autonomous agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
Wang, C.; Hasler, S.; Tanneberg, D.; Ocker, F.; Joublin, F. LaMI: Large Language Models Driven Multi-Modal Interface for Human-Robot Communication. In Proceedings of the CHI EA ‘24: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 11–16 May 2024. [Google Scholar]
Xiong, R.; Chen, L.; Feng, Z.; Liu, J.; Feng, S. Fine-tuned Multimodal Large Language Models are Zero-shot Learners in Image Quality Assessment. In Proceedings of the 2025 IEEE International Conference on Multimedia and Expo (ICME), Nantes, France, 30 June–4 July 2025; pp. 1–6. [Google Scholar]
Che, X.; Chu, M.; Chen, Y.; Gu, H.; Li, Q. Chain-of-thought driven dynamic prompting and computation method. Appl. Soft Comput. 2026, 186, 114204. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4768–4777. [Google Scholar]
Sadigh, D.; Sastry, S.; Seshia, S.; Dragan, A. Planning for Autonomous Cars that Leverage Effects on Human Actions. In Robotics: Science And Systems; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Ng, A.Y.; Russell, S. Algorithms for Inverse Reinforcement Learning. In Computer Aided Chemical Engineering; Elsevier: Amsterdam, The Netherlands, 2024. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; p. 2011. [Google Scholar]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I.J.A. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. arXiv 2017, arXiv:1706.02275. [Google Scholar]
Rashid, T.; Samvelyan, M.; Witt, C.S.D.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 178. [Google Scholar]
Sun, R.; Wu, C.; Zhao, X.; Zhao, B.; Jiang, Y. Object Recognition and Grasping for Collaborative Robots Based on Vision. Sensors 2024, 24, 195. [Google Scholar] [CrossRef] [PubMed]
Petzoldt, C.; Harms, M.; Freitag, M. Review of task allocation for human-robot collaboration in assembly. Int. J. Comput. Integr. Manuf. 2023, 36, 1675–1715. [Google Scholar] [CrossRef]
Lamon, E.; Fusaro, F.; De Momi, E.; Ajoudani, A. A Unified Architecture for Dynamic Role Allocation and Collaborative Task Planning in Mixed Human-Robot Teams. arXiv 2023, arXiv:2301.08038. [Google Scholar] [CrossRef]
Jha, D.K.; Jain, S.; Romeres, D.; Yerazunis, W.; Nikovski, D. Generalizable Human-Robot Collaborative Assembly Using Imitation Learning and Force Control. In Proceedings of the 2023 European Control Conference (ECC), Bucharest, Romania, 13–16 June 2023; pp. 1–8. [Google Scholar]
Fan, J.; Yin, Y.; Wang, T.; Dong, W.; Zheng, P.; Wang, L. Vision-language model-based human-robot collaboration for smart manufacturing: A state-of-the-art survey. Front. Eng. Manag. 2025, 12, 177–200. [Google Scholar] [CrossRef]

Figure 1. Compressive review and key methods of HRC in manufacturing.

Figure 2. Schematic diagram of autonomous task assignment based on deep learning.

Figure 3. Schematic diagram of task assignment by Large Language Models.

Figure 4. Architecture diagram of the robot–human interaction system.

Figure 5. Human–robot work system.

Figure 6. Comparison of traditional hybrid enhanced interaction technologies based on AR/VR/MR.

Figure 7. Closed-loop framework of VLM-based hybrid enhanced interaction technology.

Figure 8. Core perception and modeling.

Figure 9. Development stage of digital twin.

Figure 10. Technical architecture of adaptive motion control.

Figure 11. Collision detection mechanism.

Figure 12. Human–robot collaborative decision-making framework.

Figure 13. Comparison of decision-making techniques based on Large Language Models.

Figure 14. Comparison of reinforcement learning-based decision-making techniques.

Figure 15. Parallel network grasping architecture. Reprinted from [136].

Figure 16. HRC workflow with embedded LLMs and VFMs. Reprinted from [88].

Table 1. Overview of human–robot collaborative task allocation methods.

No.	Authors (Year)	Method
1	Liu, L. et al. (2024) [11]	PRISMA
2	Petzoldt, C. et al. (2022) [12]	HRC
3	Malik, A.A. et al. (2019) [13]	complexity-based tasks classification
4	Bruno, G. et al. (2018) [14]	task classification
5	Joo, T. et al. (2022) [15]	deep reinforcement learning
6	Gao, Z. et al. (2024) [16]	CNN
7	Mavsar, M. et al. (2021) [17]	RNN
8	Wang, P. et al. (2018) [18]	deep learning
9	Bandi, C. et al. (2021) [19]	RNN, CNN
10	Malik, A.A. et al. (2019) [20]	complexity-based tasks classification
11	Gao, X. et al. (2021) [21]	RNN
12	Barathwaj, N. et al. (2015) [22]	RULA, problem based genetic algorithm (GA)
13	Sun, X. et al. (2022) [23]	RULA, digital twin
14	Wang, J. et al. (2025) [24]	transfer reinforcement learning, augmented reality
15	Dimitropoulos, N. et al. (2025) [25]	LLM, digital twin
16	Lim, J. et al. (2024) [26]	LLM
17	Chen, J. et al. (2026) [27]	MLLM
18	Sihan, H. et al. (2025) [28]	LLM
19	Liu, Z. et al. (2025) [29]	LLM
20	Kong, F. et al. (2021) [30]	LLM
21	Bilberg, A. et al. (2019) [31]	LLM, digital twin
22	Cai, M. et al. (2025) [32]	LLM
23	Xuquan, J.I. et al. (2018) [33]	LLM
24	Wang, Y. et al. (2022) [34]	LLM

Table 2. Overview of multimodal perception methods.

No.	Authors (Year)	Method
1	Sleeman et al. (2022) [35]	Multimodal classification taxonomy
2	Cao et al. (2024) [36]	Multimodal soft sensors
3	Hussain et al. (2024) [37]	Deep multiscale feature fusion
4	Zhang et al. (2024) [38]	Skeleton-RGB integrated action prediction
5	Piardi et al. (2024) [39]	Human-in-the-Mesh (HitM) integration
6	Nadeem et al. (2024) [40]	Vision-Enabled Large Language Models (LLMs)
7	Hazmoune et al. (2024) [41]	Transformers
8	Bayoudh, K. (2024) [42]	Convolutional Neural Networks (CNNs)
9	Heydari et al. (2024) [43]	Residual Networks (ResNet)
10	Liu et al. (2024) [44]	Hierarchical control method
11	Zhang et al. (2024) [45]	Human mesh recovery algorithm
12	Liu et al. (2024) [46]	Transformer-encoder
13	Salichs et al. (2020) [47]	Social robot platform
14	Min et al. (2023) [48]	Large Pre-trained Language Models (PLMs)
15	Sun et al. (2023) [49]	Flexible sensors
16	Xue et al. (2023) [50]	Bidirectional Encoder Representations from Transformers (BERT)
17	Luo et al. (2023) [51]	Text guided multi-task learning network
18	Shafizadegan et al. (2024) [52]	Feature-level fusion
19	Li et al. (2023) [53]	Self-supervised label generation (Self-MM)
20	Wang et al. (2024) [54]	Cross-domain few-shot learning (CDMFL)
21	Wang et al. (2023) [55]	Multimodal Pre-trained Language Models (PLMs)
22	Yang et al. (2024) [56]	Natural language-based code generation
23	Laplaza et al. (2025) [57]	Contextual human motion prediction
24	Wang et al. (2024) [58]	Reinforcement learning with imitative behaviors
25	Ling et al. (2024) [59]	Real-time data-driven human–machine synchronization (RHYTHMS)
26	Yan, H. et al. (2025) [60]	Curriculum-guided multimodal alignment
27	Li, J. et al. (2026) [61]	Synthetic anomaly generation (Zoom-Anomaly)

Table 3. Overview of human–robot hybrid augmented interaction methods.

No.	Authors (Year)	Method
1	Xue et al. (2022) [62]	a review
2	Balamurugan et al. (2025) [63]	wearable sensor-based AR interfaces
3	Huang et al. (2016) [64]	a review
4	Ma et al. (2025) [65]	the proposed human–machine hybrid decision-making strategy
5	Bao et al. (2024) [66]	LK optical flow registration
6	Calderón-Sesmero et al. (2025) [67]	deep learning
7	Dong et al. (2022) [68]	VR, MR
8	Yuan et al. (2021) [69]	AR
9	Yan et al. (2022) [70]	deep learning in AR
10	Yang et al. (2021) [71]	AR
11	Hamad et al. (2025) [72]	AR, VR
12	Kalkan et al. (2021) [73]	VR
13	Masehian et al. (2021) [74]	SPP-Flex
14	Yan et al. (2025) [75]	a review
15	Havard et al. (2019) [76]	a co-simulation and communication architecture between digital twin and virtual reality software
16	Zhang et al. (2024) [77]	MR
17	Seetohul et al. (2023) [78]	AR
18	Gu et al. (2022) [79]	a review
19	Bai et al. (2023) [80]	Vision-language model
20	Chen et al. (2023) [81]	BERT-LCC
21	Tam et al. (2025) [82]	VisTW

Table 4. Overview of human–robot fusion enabled by digital twin.

No.	Authors (Year)	Method
1	Krupas et al. (2024) [83]	Technology & method review
2	Zafar et al. (2024) [84]	State-of-the-art review
3	Piardi et al. (2023) [85]	Digital technologies
4	Choi et al. (2022) [86]	Deep learning
5	Tao et al. (2022) [87]	Digital twin modeling
6	Ji et al. (2024) [88]	LLMs, VFMs
7	Tie et al. (2024) [89]	R3DNet
8	You et al. (2024) [90]	IK-BiLSTM-AM
9	Malik et al. (2018) [91]	Tecnomatix Process Simulate
10	Dröder et al. (2018) [92]	ANN, obstacle detection
11	Kousi et al. (2021) [93]	optimization algorithms
12	Tchane Djogdom et al. (2024) [94]	Robust dynamic scheduling
13	Oyekan et al. (2019) [95]	Digital twin
14	Liu et al. (2024) [96]	Web-based digital twin

Table 5. Overview of adaptive motion control.

No.	Authors (Year)	Method
1	Cao et al. (2024) [97]	a review
2	Li et al. (2024) [98]	a review
3	Astrom et al. (1994) [99]	Adaptive Control
4	Zhang et al. (2017) [100]	a review
5	Duan et al. (2024) [101]	MMI
6	Jiao et al. (2022) [102]	AHIC
7	Yu et al. (2022) [103]	ACIC
8	Hameed et al. (2023) [104]	a review
9	Ding et al. (2024) [105]	TOAPFC
10	Lin et al. (2025) [106]	improved 3D APF
11	Cui et al. (2024) [107]	MsAACO
12	Bai et al. (2024) [108]	IDDQN
13	Gao et al. (2023) [109]	BP-RRT
14	Mohanan et al. (2018) [110]	a review
15	Tang et al. (2024) [111]	improved A*
16	Cao et al. (2025) [112]	RRT-Connect
17	Huang et al. (2025) [113]	ARIC
18	Chen et al. (2024) [114]	Stewart Parallel Mechanism
19	Frigerio et al. (2012) [115]	DSLs
20	Han et al. (2025) [116]	LLMs
21	Liu et al. (2024) [117]	LLMs
22	Burns et al. (2024) [118]	LLMs
23	Macaluso et al. (2024) [119]	LLMs

Table 6. Overview of human–robot collaborative decision-making.

No.	Authors (Year)	Method
1	Tocchetti et al. (2025) [120]	ML
2	Zhe et al. (2025) [121]	method review
3	Rane et al. (2023) [122]	LLM
4	Sobo et al. (2023) [123]	MLLM
5	Team et al. (2023) [124]	LLM
6	Wang et al. (2024) [125]	LLM
7	Wang et al. (2024) [126]	LLM
8	Xiong et al. (2025) [127]	MLLM
9	Che et al. (2026) [128]	DPC-CoT
10	Ribeiro et al. (2016) [129]	LIME
11	Lundberg et al. (2017) [130]	SHAP
12	Sadigh et al. (2016) [131]	IRL
13	Ng et al. (2024) [132]	IRL
14	Ouyang et al. (2022) [133]	LLM
15	Lowe et al. (2017) [134]	RL
16	Rashid et al. (2020) [135]	QMIXs

Table 7. Overview of applications of human–robot collaborative manufacturing technologies.

No.	Authors (Year)	Method
1	Sun et al. (2024) [136]	YOLO-GG
2	Petzoldt et al. (2023) [137]	a review
3	Lamon et al. (2023) [138]	behavior trees + MILP
4	Jha et al. (2023) [139]	imitation learning + force control
5	Fan et al. (2025) [140]	VLM + deep RL

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cai, Q.; Han, J.; Zhou, X.; Zhao, S.; Li, L.; Liu, H.; Xu, C.; Chen, J.; Liu, C.; Zhu, H. A Comprehensive Review of Human-Robot Collaborative Manufacturing Systems: Technologies, Applications, and Future Trends. Sustainability 2026, 18, 515. https://doi.org/10.3390/su18010515

AMA Style

Cai Q, Han J, Zhou X, Zhao S, Li L, Liu H, Xu C, Chen J, Liu C, Zhu H. A Comprehensive Review of Human-Robot Collaborative Manufacturing Systems: Technologies, Applications, and Future Trends. Sustainability. 2026; 18(1):515. https://doi.org/10.3390/su18010515

Chicago/Turabian Style

Cai, Qixiang, Jinmin Han, Xiao Zhou, Shuaijie Zhao, Lunyou Li, Huangmin Liu, Chenhao Xu, Jingtao Chen, Changchun Liu, and Haihua Zhu. 2026. "A Comprehensive Review of Human-Robot Collaborative Manufacturing Systems: Technologies, Applications, and Future Trends" Sustainability 18, no. 1: 515. https://doi.org/10.3390/su18010515

APA Style

Cai, Q., Han, J., Zhou, X., Zhao, S., Li, L., Liu, H., Xu, C., Chen, J., Liu, C., & Zhu, H. (2026). A Comprehensive Review of Human-Robot Collaborative Manufacturing Systems: Technologies, Applications, and Future Trends. Sustainability, 18(1), 515. https://doi.org/10.3390/su18010515

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comprehensive Review of Human-Robot Collaborative Manufacturing Systems: Technologies, Applications, and Future Trends

Abstract

1. Introduction

2. Human–Robot Collaborative Task Allocation

2.1. Deep Learning-Based Human–Robot Collaborative Task Allocation

2.2. Large Language Model-Based Human–Agent Collaborative Task Allocation Methods

3. Concept, Significance, and Development of Multimodal Perception

3.1. Multimodal Perception Based on Deep Learning Architectures

3.2. Multimodal Perception Based on Pre-Trained Large Models

4. The Concept and Core Value of Human–Robot Hybrid Augmented Interaction

4.1. Traditional Hybrid Augmented Interaction Methods Based on AR/VR/MR

4.2. Hybrid Enhanced Interaction Method Based on Visual Language Model

5. Basic Concepts of Human–Robot Fusion Enabled by Digital Twin

5.1. Environmental Perception and Scene Modeling

5.2. Human State Perception and Modeling

5.3. System Organization and Collaborative Logic for Complex Collaborative Scenarios

6. Adaptive Motion Control

6.1. Concept of Adaptive Motion Control

6.2. Adaptive Motion Control Based on Path Planning

6.3. Adaptive Motion Control Based on Collision Detection

6.4. Adaptive Motion Control Based on Code Generation

7. Human–Robot Collaborative Decision-Making

7.1. Human–Robot Collaborative Decision-Making Based on Large Language Models

7.2. Human–AI Collaborative Decision-Making Based on Reinforcement Learning

8. Applications of Human–Robot Collaborative Manufacturing Technologies

9. Conclusions and Future Work

9.1. Summary

9.2. Outlook

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI