Table 5 presents an overview of the motivations and enabled tasks reported in the reviewed articles. Across these articles, assembly tasks dominate as the primary application context for FM-enabled HRC. Representative examples include multi-step product and component assembly such as cable shark assembly with sequential subtasks [
56], satellite and computing module installation [
12,
57,
58], light switch and gearbox assembly [
59,
60], and complex electronic and wire harness assembly [
61,
62,
63].
Overall, these works aim to improve how robots understand human intent and natural language (e.g., voice-driven task management and interpretation of ambiguous or incomplete instructions), and to translate that understanding into actionable, constraint-compliant robot behaviors and control code that remain effective under uncertainty and changing shop-floor conditions. Examples in this category are [
12,
55,
56,
63,
68,
69]. A second shared objective is to strengthen multimodal grounding to support robust perception, planning, and manipulation, including handling unseen objects and estimating 6D poses without repeatedly training task-specific models. Some examples in this category include [
12,
57,
64,
70,
71]. In parallel, some studies frame FMs as the core of assistive “co-worker” systems that provide real-time error detection, adaptive guidance, and personalized operator support to reduce cognitive load and improve trust and collaborative efficiency [
59,
67,
72,
73]. Finally, a subset of works explicitly targets dynamic multi-agent settings—such as multi-human–multi-robot collaboration, uncertain disassembly, and task rescheduling. Some of them combine FMs with graph-based reasoning (KG/GNN) and unified cognitive architectures to enable safer, more autonomous coordination and robust task allocation under real-world variability [
60,
65,
66].
Table 5.
Overview of identified HRC studies using FMs, year available online, core motivations for adopting FMs in HRC, and the corresponding collaborative task scenarios addressed.
| Article | Year | Need/Objective of Using FMs in HRC | HRC Task Description |
|---|
| [56] | 2024 | Reduce voice communication barriers. | Assembly of a cable shark product using voice-based natural language commands. |
| [12] | 2024 | Understand ambiguous human instructions; reasoning about new objects. | Assembly of a satellite component model with flexible, human-guided sequencing. |
| [73] | 2025 | Improve HRC efficiency by adapting robot strategies based on human trust inferred from robot performance. | Pick and Place on a conveyor belt with adaptive grasp or non-grasp decisions. |
| [68] | 2024 | Enable natural-language-based error correction and robust intention understanding in HRC. | Human–robot collaborative assembly where robots select, correct, and hand over tools based on human language. |
| [72] | 2025 | Provide real-time, non-intrusive, adaptive error detection and multimodal guidance. | Assembly of a cast iron horizontal bare-shaft centrifugal pump. |
| [64] | 2025 | Improve accuracy of understanding human intentions using an attention-based multimodal fusion model. | Drone disassembly use case. The human guides the robot using voice and gesture commands. |
| [57] | 2025 | Eliminate the need for repeatedly training multiple vision models and reduce reliance on pre-programmed robot scripts. | Multi-step HRC assembly of aerospace electronic modules, including pick-and-place, handling tools, screwing, and inspection tasks. |
| [59] | 2025 | Reduce operator cognitive load by providing real-time, contextual assistance and act as an intermediary to control physical agents via natural language commands. | Assembly of a light switch, including pick-and-place of components and specific assembly subtasks supported by a cobot and smart projector. |
| [65] | 2025 | Enable autonomous reasoning and adaptation to dynamic, unscheduled multi-human multi-robot disassembly tasks. | Disassembly of automotive lithium-ion batteries under dynamic and uncertain conditions. |
| [71] | 2024 | Enable adaptive task planning and human-guided execution in unstructured HRC manufacturing. | Three predefined tasks for a HRC processes: fetching a specific part from storage, placing a gear into a case, and picking and mounting a case cover onto the module. |
| [66] | 2025 | Improve robust perception and task rescheduling in dynamic disassembly tasks with changing conditions, such as corrosion or damage of disassembly objects. | Dynamic disassembly of end-of-life automotive lithium-ion batteries. |
| [67] | 2025 | Effective integration of historical and real-time information for question answering and robot manipulation. | AR-assisted human–robot collaborative disassembly with historical Visual Question Answering (VQA) and tool guidance. |
| [69] | 2025 | Enable few-shot human intent recognition and semantic understanding in data-scarce industrial scenarios. | Recognition of human actions and intentions during industrial tasks involving tools and parts. |
| [55] | 2024 | Mitigate human–robot communication ambiguity prevalent in HRC manufacturing scenarios. | Assembly of a gear pump module involving language-guided pick-and-place actions. |
| [61] | 2025 | Improve generality on task planning, avoiding execution conflicts while balancing operator experience and efficiency. | Assembly of ten types of electronic assembly tasks (e.g., chassis connectors, heat sink, printed circuit boards, and fan assembly). |
| [70] | 2025 | Enable accurate 6D pose estimation of novel objects without retraining for HRC. | The robot follows voice commands, analyses the available objects, identifies and grasp the target, and hand over to the human. |
| [58] | 2025 | Enable adaptive and proactive robotic manipulation in dynamic HRC assembly environments. | Assembly of a small satellite, humans assemble components and the robot proactively installs designated parts. |
| [62] | 2025 | Enable continuous instruction understanding and long-term reasoning in HRC by integrating a reflection-based contextual memory into LLM agents. | Wire harness assembly for aviation electronic equipment requiring accurate robotic manipulation. |
| [60] | 2025 | Enable proactive, autonomous, and generalizable HRC in dynamic manufacturing assembly and disassembly. | Reducer/gearbox multi-step assembly composed of diverse tasks such as component positioning, shaft installation, bearing assembly, and housing alignment. |
| [63] | 2025 | Achieve human-like collaborative intelligence by structuring perception, decision-making, and execution in a unified architecture. | Dynamic engine assembly collaboration involving tool handover and component delivery based on human assembly progress. |
| [28] | 2025 | Improve accuracy and consistency of LLM-based HRC, which suffer interference from accumulated, irrelevant historical context in long-span assembly tasks. | Assembly of a complex computer mainframe, with multi-species, small batch characteristics, repetitive fastening and fine installation operations. |
Table 6 provides an overview of the used models and modalities of reviewed articles. Extracted data reveals that all studies rely on voice or text input as the primary interaction modality, reflecting the central role of LLMs in interpreting human instructions, clarifying intent, and supporting task planning or error handling. Moreover, while many works combine language with visual input (images or video), fewer studies exploit richer multimodal signals such as motion data or physiological signals (e.g., EEG). Furthermore, the table shows a strong dependence on large proprietary models, particularly GPT-4 and its variants, often combined with well-established perception models such as CLIP, SAM, DINOv2, or BLIP. These models are, in some cases, fine-tuned or augmented with task-specific components to adapt them to industrial HRC scenarios.
5.1. Use of LLMs in HRC
Across the reviewed studies, LLMs are mainly used as a semantic–cognitive layer to (i) translate natural operator input into structured robot-relevant representations, (ii) support task decomposition and allocation decisions, and (iii) maintain task context through memory/knowledge mechanisms. Some systems demonstrate instruction grounding and action generation under flexible language input (e.g., [
56,
68]), while others position the LLM as a planner that can integrate production objectives with human-related constraints such as fatigue, comfort, or collaboration dynamics (e.g., [
61]). Other articles address context persistence for disassembly by augmenting LLMs with historical state reasoning and memory structures (e.g., [
28,
67]) and by pairing semantic reasoning with simulation/digital-twin validation for safer strategy exploration (e.g., [
65]). However, these benefits come with consistent boundary conditions: LLM-driven collaboration is sensitive to underspecified instructions and context loss (often requiring clarification loops), and practical deployment is constrained by latency/cost/privacy trade-offs that motivate cloud–edge splitting (e.g., [
58]); moreover, higher autonomy via modular orchestration can reduce human burden but may increase token/compute overhead and amplify failure consequences without robust oversight mechanisms (e.g., [
60,
62]). A more detailed description of these articles, with a primary focus on the use of LLMs, is presented below.
Gao et al. [
68] presented one of the earliest applications of LLMs in HRC by highlighting a core limitation mentioned across many works reviewed in this article: existing HRC systems typically rely on rigid, predefined language syntax, which severely restricts their ability to interpret natural, ambiguous, or incomplete human instructions. Consecutively, in their proposed HRC system, they demonstrate that fine-tuned LLMs can act as adaptive cognitive engines, capable of converting flexible natural-language ambiguous or underspecified commands into structured robotic action configurations, thereby enabling more robust intention recognition in collaborative assembly. They further integrated LLMs with a control module that fuses language outputs with the robot’s internal state to automatically correct tool-selection errors.
Lim et al. [
56] argued that, despite advances in HRC, manufacturing assembly systems continue to suffer from communication gaps that limit effective coordination. These gaps stem primarily from language barriers and the need for extensive robotics training. To address this issue, they proposed a framework that leverages LLMs to interpret operators’ voice commands and coordinate a robotic arm during a cable shark assembly task. Rather than evaluating the overall system performance, the authors focused specifically on assessing the LLMs’ ability to understand commands expressed in varied and less structured language. Their results showed that as instructions become less specific, due to missing context or explicit task references, robot performance degrades significantly, highlighting the importance of well-defined commands for reliable human–robot communication.
Wang et al. [
61] highlighted that traditional task-planning methods for HRC lack generality and insufficiently consider operator experience. To address this, they proposed an LLM-based multi-agent task planning (MATP) framework that avoids execution conflicts while balancing operator experience and production efficiency. Their proposed method decomposes assembly tasks into action-level subtasks, evaluates operator and robot states including fatigue, posture comfort, and human–robot trust, and performs task allocation through a hybrid optimization combining LLM reasoning with a genetic algorithm. This approach is validated in a laboratory electronic assembly scenario outperforming single-agent and traditional methods by effectively balancing operator experience with assembly efficiency and dynamic adaptability.
Lv et al. [
67] proposed a historical Visual Question Answering (VQA) framework for AR-assisted HRC. This framework integrates structured visual representations, a temporally organized Memory Graph, and LLMs to support reasoning over both current perception and historical experience. The proposed solution is presented as an alternative to traditional human intention recognition systems that rely on hand gestures, body posture, or gaze direction captured by visual sensors. This system adopts a client–server architecture in which voice commands are interpreted by LLMs and transmitted to the robot for execution. The resulting robot state, including joint positions and rotation angles, is then returned to the client and visualized through the AR interface. The system also generates image captions and VQA-based responses to guide robotic actions, enabling the robot to grasp tools and deliver them to the operator during battery disassembly. AR is used to capture unstructured environmental data and to visualize the LLM’s reasoning outputs. Moreover, human instructions are continuously fed back to the LLM, enabling bidirectional communication between the human and the robot.
Tong et al. [
65] demonstrated that, beyond serving merely as natural-language communication interfaces, LLMs can operate as powerful cognitive engines capable of addressing dynamic operational challenges in HRC. For this, they proposed a Hybrid Cognitive Digital Twin (HCDT) that integrates GNN-based rule learning with LLM-driven semantic reasoning to enable generative and adaptive decision-making in multi-human multi-robot collaborative (MHMRC) disassembly. Integrated within a Digital Twin, the cognitive engine conducts continuous reasoning, task allocation, real-time transmission and monitoring of multi-sensor data, and strategy validation, supporting safe in-simulation evaluation before deployment in the physical environment. In this article, human operators remain central to the workflow, performing dexterous or safety-critical actions, while the cognitive Digital Twin dynamically restructures robot behaviors and task sequences around their capabilities. The pilot study reported in this article indicates that the proposed HCDT system may improve subjective worker experience and operational safety relative to conventional HRC approaches.
Ma et al. [
58] highlighted that low computational efficiency, high deployment costs, and data leakage risks are major obstacles to the large-scale industrial adoption of cloud-based LLMs. To address these challenges, they proposed a fusion-driven framework that combines cloud-based large-scale LLMs for cognitive reasoning and dynamic manipulation planning with edge-based small-scale LLMs for efficient perception of robotic control demands and verification of control constraints. In their framework, the robot proactively perceives ongoing human assembly actions, generates adaptive manipulation constraints through the large-scale LLMs, and assists humans by executing assigned assembly subtasks, thereby reducing the need for continuous human instruction and enhancing flexibility in dynamic assembly processes. The authors report that this increased adaptability and flexibility of LLMs, even with the proposed approach, comes at the cost of longer task execution times compared to traditional fixed-code execution models.
Verhelst et al. [
59] proposed a Digital Colleague: a modular, human-centric architecture combining LLMs, a skill-based robot framework, and a hierarchical knowledge base to support high-mix, low-volume (HMLV) manufacturing. In such environments, operators frequently switch between products and processes, increasing cognitive load and complexity. The proposed Digital Colleague aims to address this issue by providing task-specific, on-demand guidance to reduce cognitive strain while maintaining efficiency and quality. Instead of using an AR headset, a smart projector offers physical support and context-aware digital guidance (e.g., displaying relevant information and instructional text onto the workspace). Unlike most studies identified in this review, this work adopted a more human-centric design approach. Consequently, the authors employed qualitative and subjective metrics to evaluate the user experience of the proposed system. Their study found that the proposed conversational interaction and on-demand support improved perceived clarity and reduced cognitive load. However, perceived efficiency was rated lower due to slow LLM response times and occasional hallucinations. Another notable human-centric element proposed in [
59] is the use of facial animations embedded in a tablet mounted on the collaborative robot, which serve as an embodiment of the Digital Colleague. This design choice adds a friendly and approachable visual presence, further supporting intuitive interaction between the operator and the system.
Hua et al. [
28] explained that although LLMs capable of processing extended context windows can incorporate large volumes of historical contextual knowledge as input prompts, they remain susceptible to interference from irrelevant or subordinate historical information during the generation of collaborative strategies. To overcome this limitation, they introduced a dynamic knowledge evolution mechanism capable of handling large-scale textual information and reasoning over long-span problems, thereby improving the accuracy and consistency in complex collaborative assembly scenarios. The main idea is to continuously capture, refine, and summarize evolving situational demands while incorporating a forgetting mechanism to update historical scene states, ensuring that LLM prompts retain only critical information related to both global objectives and local states and mitigating secondary-information interference during knowledge accumulation.
Wang et al. [
62] argued that the field is entering a “post-LLM” era, aiming to reformulate how LLM-based applications are designed and deployed. Therefore, they propose a modular, agent-based architecture in which an unfine-tuned LLM acts as the cognitive core for natural language understanding, chain-of-thought reasoning, and task decomposition. The proposed HRC agent consists of four main modules: a configuration module that defines the agent’s role, goals, capabilities, knowledge, and behavioral rules; a task planning module that decomposes complex instructions into executable subtasks and corresponding strategies; a task execution module that interfaces with robotic tools and hardware; and a memory module that stores interaction and environmental information to support future reasoning and instruction understanding. Experimental results indicated that the HRC agent can effectively interpret natural language instructions, generate correct reasoning chains, and drive robots to execute assembly tasks via tool calling. However, the authors reported a small number of failure cases caused by LLM-induced hallucinations during task planning, noting that completely eliminating such hallucinations remains costly and challenging.
Ding et al. [
60] proposed an LLM-powered, cognition-centered AI agent framework to support proactive HRC in assembly and disassembly tasks. To improve the generalization capability of AI agents and mitigate the lack of domain-specific knowledge in pre-trained LLMs, the authors introduced a semantic Chain-of-Thought prompt learning method that integrates task semantics with structured reasoning. Within this framework, the robot adaptively adjusts its autonomy level based on task complexity and operator state to reduce human intervention and workload. Experimental results demonstrated improved autonomous orchestration, higher execution accuracy, and reduced human intervention compared with LLM-only and existing AI-agent baselines. However, these advantages come at the cost of increased token consumption.
5.2. Use of VLMs in HRC
Across reviewed studies, VLMs primarily act as the grounding and perception layer that connects language and robot reasoning to the physical workspace, enabling collaboration under dynamic conditions without fully task-specific retraining. Many systems combine a language/planning component with VLM-based segmentation/localization to ground operator intent in object-level state representations, often mediated through AR/MR interfaces where humans verify plans or provide corrective input (e.g., [
55,
71]). Another recurrent use is intention recognition under data scarcity (e.g., [
69]). VLMs are also leveraged as enabling primitives (e.g., unseen-object pose estimation, scene reasoning, or monitoring signals used to adapt collaboration) (e.g., [
70,
73]). A key trade-off of VLM-driven grounding is that while it offers greater flexibility, it faces limitations in terms of real-time feasibility and, in some cases, interface intrusiveness. The latter issue is particularly evident when using AR headsets. For example, in the pilot study conducted by [
12], participants reported that AR glasses were not a convenient interaction modality and expressed a preference for a screen-based interface positioned in front of the robotic platform. This finding is reinforced by [
72], who propose a more human-centered and inclusive approach that adapts assistance to the operator’s cognitive needs by projecting visual cues directly onto the workspace. A more detailed description of these articles, with a primary focus on the use of LLMs, is presented below.
Zheng et al. [
71] proposed a framework for HRC that integrates a mixed-reality head-mounted display (MR-HMD) for data collection, communication, and state representation, together with a vision–language-guided task planning approach and a deep reinforcement learning-based motion control policy for a mobile manipulator. In this framework, LLMs generate zero-shot robotic task plans by parsing natural language instructions, decomposing them into sequential action steps, and producing executable code through predefined robotic primitives, while VLMs support object segmentation and localization using the same language specifications. For this, human operators need to issue task prompts and wear MR-HMDs to provide first-person environmental state information, evaluate the correctness of LLM-generated plans, and iteratively refine prompts as needed. Experimental results in collaborative assembly tasks demonstrated improved segmentation accuracy, higher task success rates, and more efficient motion planning compared to baseline approaches. However, the authors acknowledge several limitations, including latency introduced by cloud-based LLM inference, the high computational cost of deep reinforcement learning-based control, and scalability challenges associated with the reliance on MR-HMDs. Posteriorly, they proposed in [
55] a vision–language-guided robotic action planning approach that combines referred object retrieval with an LLM-based planner to mitigate ambiguity in collaborative manufacturing tasks. In their work, the role of humans is to provide natural language instructions and refining perception results through manual clicking when model confidence is low, enabling error correction without disrupting production.
Wu et al. [
69] argued that, despite substantial progress in HRC, research in both industrial and academic domains largely prioritizes adaptive robotic planning while insufficiently modeling human operator intentions. They also identified a key limitation in action recognition research: the scarcity of representative industrial datasets, as most benchmarks target generic scenarios and fail to reflect real industrial constraints. To address these gaps, they proposed H2R Bridge, a vision–language–temporal framework for human intention recognition in industrial HRC. The framework combines pre-trained VLMs with temporal encoding and few-shot learning to achieve transferable intent recognition under data-scarce conditions, while diverse LLMs (including T5-small, GPT-2, and Qwen-turbo) are used to translate recognized actions into natural-language intention instructions to guide robot command generation.
Liu et al. [
57] explained that current embodied intelligence approaches for HRC require repeated training of multiple perception and reasoning models, thereby limiting adaptability in dynamic manufacturing environments. To deal with this issue, they argued that VLMs can serve as generalizable cognitive engines, enabling multimodal perception, reasoning, and autonomous execution without retraining specialized modules. Consequently, they proposed a VLM-enhanced embodied intelligence framework for digital-twin-assisted human–robot collaborative assembly. The proposed framework comprises four core modules: VLM-enhanced embodied perception of the HRC environment, VLM-enhanced embodied reasoning, DT-supported embodied decision-making, and embodied autonomous execution through automatic code generation. In this framework, the Digital Twin is used to train and optimize robot path-planning by simulating motion, identifying potential collisions, and refining strategies through reinforcement learning. Meanwhile, a HoloLens 2 headset enables natural-language input and AR-based guidance for the operator during the assembly of aerospace electronic bay components. In this approach, humans remain central in the assembly workflow: operators specify tasks (e.g., verbally via AR glasses), perform skills requiring dexterity, and intervene in reasoning and knowledge updates through human-in-the-loop mechanisms.
According to Simeone et al. [
72], existing HRC and operator-support systems often rely on intrusive interfaces, like VR and AR headsets, which inadvertently increase cognitive load and disrupt assembly operations. Therefore, they proposed an alternative approach focusing on enhancing human-centricity and inclusion by tailoring assistance to the operator’s cognitive needs. For this, they introduced a non-intrusive, multimodal, generative AI-based system that provides real-time error detection, adaptive guidance, and personalized support. They achieved this by using a projector to display visual cues directly onto the workspace, offering performance comparable to VR/AR head-mounted displays with minimal cognitive load. Their system is enhanced by a Generative AI Layer (using ChatGPT-4.0 and Claude 3.5), which is used to interpret images and text for posterior error detection and instruction generation. A key feature in the framework proposed by [
72] is the human-in-the-loop learning cycle: when the system misclassifies an error, the human provides a corrective prompt that instantly updates the system’s knowledge base, thereby improving future error detection without the need for time-consuming model retraining.
Ji et al. [
12] stated that existing HRC systems lack transferability and generalization because they rely on specialized perception models and predefined workflows that require retraining or refactoring when facing unseen objects or undefined tasks. To address these limitations, they proposed a foundation-model-based HRC framework that includes LLMs and VLMs for enhancing perception and reasoning in a assembly scenario. In the proposed system, LLMs act as the reasoning “brain” that interprets human language instructions and environmental descriptions to generate robot control code via prompt engineering, while Vision Foundation Models serve as the perceptual “eyes” enabling transferable scene semantic perception without task-specific training. In their approach, humans remain central to the collaboration by issuing free-form language instructions, evaluating and correcting LLM-generated robot code, and performing dexterous assembly actions, while the system follows human instructions without enforcing a fixed assembly sequence. Human–robot communication is mediated through natural language speech input, visual feedback via AR glasses, and simple pointing gestures that allow humans to specify assembly locations without verbally encoding complex spatial descriptions. Performance evaluations indicated that the proposed framework improves generalization and enables reasoning about undefined tasks compared to traditional HRC methods. Additionally, the authors conducted user studies using a focus-group format in which participants were shown demonstration videos of the system. The goal was to gather feedback and identify directions for future improvement. Participants reported that AR glasses were not a convenient interaction modality and expressed a preference for a screen-based interface positioned in front of the robotic platform. This claim is also supported by [
72].
Xia et al. [
70] leverage VLMs to enable 6D pose estimation of previously unseen objects in shared human–robot workspaces, addressing the reliance of existing learning-based methods on extensive retraining and large datasets. For this, they proposed a three-stage pipeline comprising vision–language-based object detection and segmentation, CAD-template-based mask selection, and pose refinement. This approach was validated through real-world experiments using a UR5e collaborative robot in assisted picking and collaborative assembly scenarios. Their results demonstrated improved pose estimation performance for novel objects and effective support for HRC, while the authors acknowledged that the current processing speed of the full pipeline does not yet meet real-time industrial requirements.
Guo et al. [
73] argued that a lack of trust in HRC can reduce users’ willingness to adopt such systems. However, existing trust-computing approaches face several practical limitations. Subjective trust metrics can interrupt collaboration, negatively affecting naturalness and efficiency in real-world HRC scenarios, while many objective trust metrics rely on intrusive sensors that compromise user comfort. To address these challenges, the authors proposed a robot performance evaluation method based on a VLM to support trust estimation. This trust metric is directly linked to collaboration efficiency, measured by the number of steps required to complete a task collaboratively. In addition, they introduced an active interaction strategy generation framework for HRC that leverages this trust metric to improve the predictability of human actions and reduce interruptions caused by redundant interventions and delayed decision-making. In their approach, the VLM reasons with visual observations of collaborative sub-scenes to evaluate robot performance, which is then used to update trust estimates and select optimal robot actions. Human operators remain central to the collaboration by intervening when appropriate, while the robot dynamically adjusts its level of autonomy based on whether the human intervenes promptly or refrains from intervening during the task. Experimental results from a collaborative object transportation task demonstrated that their proposed strategy can reduce the number of steps required to complete the task compared to a random strategy, in which the robot arbitrarily selects between the actions grasp and non-grasp at each step without accounting for the human’s trust level.
5.3. Use of MLLMs in HRC
Across the reviewed set, MLLMs are used when single-modality pipelines are fragile, aiming to improve robustness and situational awareness by fusing multiple signals (generally vision + language and, in some cases, additional human/scene cues) for collaboration in noisy, dynamic shop-floor conditions. A common pattern is multimodal intention recognition, where combining modalities reduces failure under occlusion, acoustic noise, or ambiguous gestures compared with vision-only or speech-only interfaces (e.g., [
64]). In disassembly, MLLMs are often paired with structured knowledge (e.g., affordance knowledge/graphs) to support dynamic scene understanding and rescheduling when part conditions or availability change, enabling more responsive task allocation (e.g., [
66]). Some architectures push toward integrated perception–decision–execution loops with feedback and shared memory to sustain continuous adaptation during collaboration (e.g., [
63]). A more detailed description of these articles, with a primary focus on the use of LLMs, is presented below.
Li et al. [
64] argued that current HRC systems relying on single-modality perception (e.g., only vision or only speech) fail to provide robust and accurate intention understanding in complex manufacturing environments. To overcome these limitations, they introduced a multimodal large model that integrates synchronized vision, audio, and EEG signals, enabling significantly more reliable robot intention recognition compared with traditional single-modality approaches. The framework consists of four layers: a physical layer for data acquisition, a multimodal fusion layer for processing and integration, a virtual layer using Digital Twins for prediction and optimization, and a service layer supporting interaction and safety. Validated in a drone-disassembly task, the system demonstrated improved robustness to noise. Their results show that multimodal fusion reduces sensitivity to visual occlusions, audio noise, and EEG instability, leading to more reliable operator-intention prediction.
Yu et al. [
66] argued that improving robotic perception and leveraging historical experience are critical for human–robot collaborative disassembly (HRCD). To address this challenge, they proposed a dynamic task rescheduling method for HRCD that is enhanced by an Affordance Knowledge Graph (AFKG) and a MLLM to enable dynamic scene perception, semantic reasoning, and task reallocation. This approach is validated in an automotive lithium-ion battery disassembly scenario, where uncertainties such as component degradation, corrosion, damage, and tool availability frequently disrupt predefined workflows. Within this framework, the MLLM supports the scene understanding process by processing RGB-D images and gaze information obtained from mixed-reality head-mounted displays to construct semantic scene graphs. The AFKG complements this capability by enabling the recognition of previously unseen components and changing conditions of components in disassembly scenarios through querying similar affordance-based cases.
Chen et al. [
63] presented the concept of a human-like collaborative robot (HLCobot) which emphasizes human-like intelligence, achieved through a tightly integrated perception–decision–execution coordination loop. This approach aims to enable industrial robots to continuously and autonomously collaborate with human operators even in dynamic, unstructured environments where uncertainties and unexpected events frequently occur. To materialize this vision, the authors proposed a brain-inspired perception–decision–execution coordination framework for HRC, driven by MLLMs and organized into three tightly coupled functional hubs: (i) an active perception hub that enables dynamic and adaptive scene understanding under occlusion and uncertainty; (ii) an intelligent decision hub that performs knowledge-enhanced reasoning to infer task states and collaboration needs as well as mitigate bias and hallucinations; and (iii) an execution hub that decomposes high-level collaborative intentions into sequences of reusable low-level motion primitives. Additionally, inter-hub coordination with feedback communication and a shared memory module (to persistently store perceptual information and historical task outcomes) are integrated. Tested in an engine assembly scenario, the approach achieved a high success rate but struggled to distinguish fine-grained operational states when components or actions were visually similar, occasionally leading to incorrect collaborative instructions.