Next Article in Journal
A Shallow-Torque Haptic Device for Wrist Postural Guidance: Design and System Evaluation in a Virtual Rehabilitation Task
Previous Article in Journal
Correction: Greve, D.; Kreischer, C. Methodology for Integrated Design Optimization of Actuation Systems for Exoskeletons. Robotics 2024, 13, 158
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

From Large Language Models to Agentic AI in Industry 5.0 and the Post-ChatGPT Era: A Socio-Technical Framework and Review on Human–Robot Collaboration

National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan
Robotics 2026, 15(3), 58; https://doi.org/10.3390/robotics15030058
Submission received: 29 January 2026 / Revised: 2 March 2026 / Accepted: 10 March 2026 / Published: 12 March 2026
(This article belongs to the Special Issue Human-Centered Robotics: The Transition to Industry 5.0)

Abstract

Generative Artificial Intelligence (GenAI), particularly Foundation Models (FMs), has recently become a key component of Industry 5.0. Despite growing interest in integrating these technologies into industrial environments, comprehensive analyses of the socio-technical opportunities and challenges of deploying these emerging AI systems in real-world settings remain limited. This article proposes a socio-technical conceptual perspective, termed Responsible Agentic Robotics (RAR), which structures the lifecycle deployment of agentic AI-enabled robotic systems around three core layers: context, design, and value. Additionally, this article presents a brief review of 21 peer-reviewed studies published between 2023 and 2025 (post-ChatGPT era) on FMs and agentic AI-enabled Human–Robot Collaboration (HRC) in industrial assembly/disassembly environments. The results indicate that existing research remains predominantly technology-centric, with a strong emphasis on enhancing robot autonomy, while comparatively limited attention is devoted to human-centered and responsible practices. Moreover, empirical evaluations of human, social, and sustainability dimensions, such as worker empowerment, human factors, well-being, inclusivity, resource utilization, and environmental impact, are rarely conducted and poorly discussed. This article concludes by identifying key socio-technical gaps, outlining future research directions.

1. Introduction

Generative Artificial Intelligence (GenAI) has emerged as a disruptive paradigm in technological development across diverse research and industrial areas. While the use of GenAI technologies has been explored over the past few years, it was not until the release of ChatGPT, in November 2022, that these technologies, particularly Large Language Models (LLMs), gained exponential visibility and widespread adoption.
LLMs are part of a broader class of intelligent models known as Foundation Models (FMs). Depending on their input modalities and functional capabilities, FMs encompass several model families, including Vision–Language Models (VLMs) and Multimodal Large Language Models (MLLMs), Visual Foundation Models (VFMs), Visual Generative Models (VGMs) and Robotic-specific Foundation Models (RFMs) [1]. Moreover, when these models are implemented in industrial settings, they can be referred to as Industrial Foundation Models (IFMs) [2]. In manufacturing, IFMs represent a recent addition to the landscape of emergent technologies associated with Industry 4.0 and Industry 5.0. This landscape also encompasses diverse areas such as Cyber–Physical–Human Systems (CPHSs) [3], the Internet of Things (IoT), Digital Twins (DTs) [4], Extended Reality (XR) [5], and Human–Robot Collaboration (HRC) [6,7]. Within this set of emergent technologies, HRC plays a key human-centric role in helping to mitigate risks often associated with the rapid development of automation technologies, such as job displacement, reduced well-being, and fear of AI [8]. In the robotics domain, recent works suggest that FMs have the potential to improve generalization in robot perception, task planning, and control. In this context, approaches such as SayCan [9], Code-as-Policy (CaP) [10], and PaLM-E [11] leverage Large Language Models (LLMs), VLMs, and MLLMs to enable task planning, ground decisions in real-world constraints, perform multimodal reasoning, and generate executable robot policies from natural language. However, these systems are not designed for industrial HRC; instead, they primarily focus on simplified and controlled settings with limited human involvement and interaction [12].
This article argues that the social acceptance and successful integration of robotic systems empowered by FMs in real-world and industrial settings introduce challenges that extend beyond technical considerations. Additionally, these challenges also require the identification and mitigation of human-centered issues and sustainability issues [13]. Addressing such issues calls for more holistic, responsible, and socio-technical research perspectives in the design, deployment, and evaluation of FM-enabled robotic systems. Consequently, this article aims to guide practitioners, researchers, and managers toward a broader socio-technical understanding of the opportunities and challenges associated with integrating FM-empowered robotic systems into real-world environments, specifically industrial settings. To this end, this article introduces a socio-technical conceptual framework that structures the lifecycle deployment of agentic AI and robotic systems in smart manufacturing. In addition, this article reviews current real-world and realistic industrial applications of FMs in HRC, with a particular focus on assembly and disassembly tasks.

2. Related Work and Contributions

2.1. FM-Based Conceptual and Industry 5.0 Assessment Frameworks

A limited number of recent studies have begun to systematically identify and classify the key dimensions that characterize the opportunities and challenges associated with integrating FMs into industrial environments. In particular, Refs. [2,14] propose system-oriented conceptual frameworks that organize the challenges and opportunities of IFMs across multiple levels of abstraction. Such conceptual frameworks are commonly used to inspire new ideas and to guide practitioners, researchers, and managers in advancing the development of specific research areas, rather than to deliver immediately actionable systems [6]. On the one hand, Ren et al. [14] introduce a four-layer conceptual framework for IFMs in the process industry composed of a resource layer, a base layer, an adaptation layer, and an application layer. However, the role of humans is only marginally addressed in this framework. On the other hand, Zhao et al. [2] propose a closely related framework for intelligent manufacturing structured around three analogous levels: data-level, model-level, and application-level. Although this framework explicitly includes humans as part of the system, it omits several key human-centered and responsible AI considerations and methodologies that are essential for the design of effective and socially acceptable HRC applications. These limitations underscore the need for complementary perspectives that explicitly integrate human-centered and responsible AI principles into the development of Industry 5.0 applications.

2.2. Recent Surveys of FMs in HRC

As FMs continue to increase in popularity, an increasing number of related literature review articles have been published in recent years. Many of these surveys take a broad, cross-domain perspective (e.g., [15,16]), which often results in limited discussion of how these technologies impact robotics, particularly in human-centered settings. Moreover, these surveys often restrict their analysis to a small number of widely cited frameworks, such as PaLM-E and SayCan. A smaller subset of surveys addresses industrial contexts more explicitly, as illustrated by [2,17]; however, most of these works offer only limited discussion of the challenges related to HRC settings. In the context of FMs for HRC and/or Industry 5.0, Table 1 provides a comparative summary of recent and related survey articles. A brief discussion of each survey is presented below, highlighting its main focus and limitations in relation to the scope of this article.
Fan et al. [18] presented a comprehensive survey of VLMs for HRC covering advances in task planning, navigation, manipulation, and human-guided skill transfer. They provided a detailed analysis of diverse architectures and performance factors, with a focus on household and industrial scenarios. Moreover, they identified common limitations such as latency, precision constraints, and challenges in real-time 3D perception. However, their review covered the literature only up to mid-2024 and relies heavily on arXiv preprints, reflecting the early stage of the field at that time. These limitations highlight the need for an updated, peer-reviewed, and human-centered review focused on realistic industrial HRC scenarios.
Ma et al. [19] proposed a comprehensive review of LLM-based applications within the Industry 5.0 paradigm, synthesizing 197 studies across areas such as DTs, robotics, XR, IIoT, and blockchain. However, robotics is only marginally covered, with a very limited number of studies reporting LLM-based approaches evaluated in real industrial HRC scenarios, specifically [22,23]. Similarly, Chen et al. [20] present a review of studies on the integration of LLMs and DTs in Industry 5.0, identifying only three robotics studies related to HRC [22,23,24].
Most recently, Wu et al. [21] provided a comprehensive review of how MLLMs and spatial intelligence can support natural HRC. Their analysis focuses on the technical mechanisms that enable robots to perceive, reason, and act in unstructured environments, synthesizing developments in perception–cognition–action pipelines and spatial representations. While their review offers valuable insights into progress in robot autonomy and interaction capabilities, it remains largely robot-centered. Moreover, similar to Fan et al. [18], the review relies heavily on arXiv preprints.
In fact, most recent surveys on the application of FMs to HRC remain predominantly technology-centric, identify only a limited number of peer-reviewed studies, and consequently rely heavily on gray literature. Given the rapid evolution of FMs in robotics following the release of OpenAI’s ChatGPT, the field is now entering a post-ChatGPT and post-LLMs phase in which agentic architectures are expected to play a central role. In this context, an updated review that explicitly adopts a human-centered and Industry 5.0-oriented perspective is both timely and necessary.

2.3. Contributions and Research Questions

Building on the research gaps identified above, and grounded in socio-technical systems theory, human-centered design, and responsible AI principles, this article makes the following contributions:
(1) This article introduces a conceptual socio-technical perspective, termed Responsible Agentic Robotics (RAR), which structures FM-enabled HRC across three layers—context, design, and value—spanning both technical and social dimensions. The framework aims to guide practitioners, researchers, and managers in developing feasible and desirable HRC systems that integrate agentic AI architectures in an ethically acceptable and sustainable manner.
(2) This article reviews 21 peer-reviewed studies to examine how FMs have been used to support HRC and what human-centered and responsible research and innovation factors are evaluated or considered.
(3) Based on observed patterns and gaps, this article proposes research directions and makes a call for an earlier adoption of socio-technical and human-centered methodologies in FM-enabled HRC design and development to improve desirability, social acceptability, and deployment feasibility in manufacturing contexts.

3. Towards a Responsible Agentic AI Conceptual Framework for Human–Robot Collaboration

This article proposes the RAR perspective for HRC, explicitly grounded in a socio-technical systems theory [25]. This conceptual framework is composed of two interdependent systems: a technical system and a social system, which together aim to guide and support the responsible design, integration, and evaluation of FM-enabled HRC systems in alignment with Industry 5.0 principles. On the one hand, the technical system extends existing IFM conceptual frameworks proposed in [2,14]. While these previous works primarily structure FM integration around data, model, and application layers, the proposed RAR framework evolves this view toward a post-LLM paradigm centered on agentic AI and autonomous collaborators. In this system, FMs are not treated solely as isolated inference engines, but as components of agentic architectures capable of reasoning, adaptation, coordination, and real-time interaction within human-centered industrial environments. On the other hand, the social system is grounded in human-centered, future-oriented, and responsible innovation methodologies, which can be used to address widely recognized ethical and human-related challenges associated with GenAI and automation technologies, including privacy, dignity of work, beneficence and non-maleficence, fairness and non-discrimination, human–AI calibration, and value creation, among others [26,27].
Both the technical and social systems are structured using three overarching and conceptually aligned layers: context, design, and value. Figure 1 provides an overview of the proposed RAR conceptual framework. The context layer captures the conditions, constraints, and states that shape system behavior, including human, environmental, organizational, and social factors. The design layer comprises the mechanisms and processes that translate technical and social information and real-world conditions into advanced intelligent systems, as well as the methodologies that promote socially acceptable and responsible innovation. The value layer represents the outcomes generated by the system. These outcomes may encompass both technical value (e.g., usefulness, reliability, and feasibility) and social value (e.g., human well-being, trust, fairness, bias reduction, and human empowerment). The relationship between the design and value layers is not linear but iterative and adaptive. The outcomes generated in the value layer, including both technical and social value, feed back into the design layer and inform refinements, adjustments, and continuous improvement in design methodologies and system implementations. This iterative process helps ensure that system development remains aligned with human needs and contextual realities, while striving to be feasible, viable, desirable, and ethically aligned.

3.1. Technical System

The technical system of the proposed RAR framework structures the infrastructure, integration, and functional components required to design, deploy, and operate FM-enabled smart manufacturing systems. Figure 2 shows the main elements of the proposed technical system. In this system, the context layer provides the foundational information required for situational awareness and informed decision-making in FM-enabled HRC systems. For this, it integrates heterogeneous contextual information related to humans, robots, environment, process and historical data. Human data are primarily acquired through exteroceptive sensors and physiological measurements, enabling the perception of human-related signals such as gestures, body poses, facial expressions, heart rate, and voice or text commands. Robot data are obtained from proprioceptive and internal sensors that provide self-awareness of the robot’s state, including joint angles, force–torque measurements, and end-effector position and orientation information. Environment data capture the state of the workspace, including object position and orientation, distances to obstacles, and relevant safety-related elements. Interaction data describe the dynamics of human–robot interaction, including physical contact, relative distance, coordination timing, and shared workspace usage. Process data encompass task- and domain-specific information such as assembly and disassembly sequences, product structures, tooling requirements, and constraints derived from engineering artifacts, such as CAD models. Moreover, historical data can capture prior interaction or system states that can support the adaptation of HRC systems [28].
The agentic AI layer enables FMs to gather information from the context layer and act effectively in real-world environments. At its core, this layer incorporates different types of FMs that support advanced perception, cognition, and actuation capabilities in robotic systems. These capabilities are further enhanced by agentic AI techniques and tools, including chain-of-thought prompting for structured reasoning, retrieval-augmented generation for knowledge grounding, fine-tuning for domain adaptation, and Model Context Protocol (MCP) for connecting AI applications and models to external systems. Finally, the value layer supports the agentic architecture through a set of enabling technologies, infrastructure, and systems. It provides the resources required to build, train, and deploy Foundation Model (FM)-based architectures, including GPUs, containerization tools (e.g., Docker), model APIs, and cloud computing platforms. Integration connectors enable data transfer and transformation across system components, supporting modularity and scalability. System enablers incorporate complementary Industry 5.0 technologies—such as XR, CPHS, digital twins, and cloud–edge computing—to enhance agentic capabilities, safety, and performance. On top of these sub-layers, applications are developed to address specific industrial needs, including assembly, disassembly, quality inspection, maintenance, and training.

3.2. Social System

AI technology, especially GenAI, stands at a pivotal point, where it may either exacerbate existing social risk and inequalities or evolve into a truly responsible and human-centered technology [29]. One of the key obstacles to achieving this latter outcome is the insufficient consideration of human values, needs, and contextual factors during the early design, evaluation, and decision-making phases, along with the limited engagement of external stakeholders [27,29]. Accordingly, one of the central objectives of the social system within the proposed RAR perspective is to promote responsible practices and methodologies that can ensure agentic AI systems uphold the legitimate needs, concerns, perspectives, interests, values, and fundamental rights of workers and other stakeholders. Figure 3 presents an overview of the proposed social system. Within this framework, the context layer encompasses a set of human-centered and sustainability-related concerns and goals, together with relevant political, economic, social, technological, legal, and environmental (PESTLE) factors that must be identified for the intended application domain and target users [30]. These contextual elements are complemented by well-being and sustainability metrics, which should be systematically defined and applied to assess the development and deployment of human–AI and human–robot interaction systems [6]. The design layer consists of a set of well-established responsible and human-centered methodologies (discussed in the following subsections) that operationalize these contextual requirements into concrete development practices. Finally, the value layer comprises a set of intended positive outcomes, including enhanced human well-being, trust, privacy, safety, accessibility, bias reduction, and social acceptance, which collectively define the desired impact of the innovation process.

3.2.1. Responsible Research and Innovation as Methodological Base

As highlighted by [29], human-centered considerations should be explicitly defined and systematically integrated into the design process of AI systems through inclusive and transdisciplinary approaches, rather than being left to the discretion or interpretation of AI and robotics developers. This perspective ensures that design decisions are grounded in the values, needs, and concerns of the people who use or are affected by these systems. In response to this challenge, a wide range of partially overlapping methodologies and paradigms has emerged with the shared goal of achieving truly responsible AI and Human–Machine Interaction systems. Figure 4 illustrates the relationships among four such complementary approaches that practitioners, researchers, and managers can operationalize to support the development of responsible and agentic systems. At the center of these approaches lies the Responsible Research and Innovation (RRI) framework, defined by [31] “as a higher-level responsibility or meta-responsibility aimed at shaping, maintaining, coordinating, and aligning research and innovation processes, actors, and responsibilities to ensure desirable and acceptable outcomes”. As further elaborated by Burget et al. [32], RRI places primary emphasis on research and innovation processes rather than outcomes alone, and is grounded in four interconnected dimensions: inclusion, anticipation, reflexivity and responsiveness. Inclusion refers to the active engagement of diverse stakeholders (i.e., users affected by technologies) throughout the research and innovation process, ensuring that multiple perspectives are considered. Anticipation involves exploring plausible and desirable innovation outcomes by assessing potential impacts, as well as near- and long-term risks and the benefits associated with emerging technologies. Reflexivity emphasizes critical self-examination by researchers, managers, and innovators regarding their assumptions, values, and potential biases. Responsiveness refers to the ability to adapt research and innovation trajectories in light of new insights, stakeholder feedback, and evolving societal needs, and to address the anticipatory and reflective questions that arise through inclusive deliberation [33,34,35]. Figure 4 presents four related methodologies that can be operationalized to support the responsible design of agentic AI systems. A brief explanation of each approach is presented below.

3.2.2. Human-Centered Design of Agentic AI Applications

Figure 5 presents the Double Diamond model, a widely used framework for describing human-centered design processes. This model consists of four main phases: Empathize, Define, Develop, and Evaluate [36,37]. In this framework, the design process begins with data collection aimed at empathizing with and understanding workers’ realities. The Define phase requires clearly articulating these challenges, typically through participatory methods such as focus groups, workshops, and co-design activities involving diverse stakeholders, including workers, managers, engineers, and social and robotics researchers. The goal is to formulate a set of human-centered problem statements that guide the subsequent development of agentic AI and HRC systems. In the Develop phase, design teams generate and prototype multiple solutions aimed at addressing identified human-centered challenges, such as reducing cognitive load, increasing trust in collaborative assembly tasks, or improving the acceptance of HRC systems. Finally, the Evaluate phase requires assessing the proposed solutions, not only in terms of technical performance, but also in terms of their impact on human well-being and interaction quality. This includes usability, user experience, ergonomics, and human performance factors, as described in [6]. In this step, evaluation should go beyond Proof-of-Concept (PoC) testing conducted solely by developers or researchers with shared technical backgrounds and assumptions. Instead, solutions should be assessed by real or representative users. This requires well-designed experimental settings employing qualitative, quantitative, or mixed-method approaches, and involving a sufficient number of participants to ensure statistical validity [38]. Finally, this process is inherently iterative and may require multiple cycles before achieving truly responsible, robust, and socially valuable solutions.

3.2.3. Participatory Design in HRC

In Human–Robot Interaction, Participatory Design is often mentioned interchangeably with related approaches such as user-centered design or co-design; however, inclusion represents only one of its defining principles [39]. As articulated by Kensing and Greenbaum [40] and further discussed by Bødker [41], Participatory Design also includes the equalization of power relations by empowering users to influence technologies that shape their lives; democratic practices that involve stakeholders collaboratively in design and decision-making; situation-based action, which values users’ situated knowledge and real-world practices; and mutual learning, in which designers gain insight into users’ contexts while users develop an understanding of technological possibilities. While the application of Participatory Design to the development of responsible technologies in industrial HRC contexts remains underexplored, few examples have emerged in recent years. In this context, Ron et al. [42] combine Participatory Design with Feminist Technoscience [43] to address industrial needs while fostering inclusivity in HRC design. Their study shows that many challenges attributed to automation do not stem from technical limitations, but rather from misalignments between technological design choices and workers’ lived realities. They also identified several gaps in contemporary HRC design, including insufficient representation of workforce diversity, environments, and routines; technology-driven and overly specified solutions that respond to assumed rather than articulated needs; limited reflection on inclusivity; and design-induced biases. Another example is presented by Cao et al. [44], who conducted a Participatory Design study in which factory workers co-created a HRC interface in an industrial setting. To achieve this, they involved real operators in a Double Diamond-based design process consisting of co-creation sessions and feedback interviews. This resulted in a customized interface capable of satisfying operators’ needs and preferences. Moreover, their results suggest that involving operators in the design process can increase the acceptance and adoption of HRC systems in industrial settings.

3.2.4. Speculative Design Applied to Emergent Technologies

Speculative design provides a complementary approach for critically examining the long-term societal implications of emerging technologies by envisioning potential risks and exploring alternative futures that challenge deeply ingrained assumptions [45]. Rather than aiming to produce immediately deployable solutions, speculative design seeks to stimulate cross-disciplinary reflection and negotiation around shared norms and values by constructing fictional yet plausible futures. These speculative scenarios provoke debate about which technological trajectories are desirable, undesirable, or ethically problematic [46]. Recent research demonstrates the value of speculative design for responsible innovation. For example, Hohendanner et al. [46] employed participatory speculative design to explore how citizens in Japan envision future metaverse societies, revealing concerns around identity, power, governance, and human–AI relations that are often overlooked in industry-led development processes. Similarly, Shwartz Altshuler et al. [47] applied speculative design to the governance of phygital spaces, demonstrating how speculative scenarios and role-play can expose ethical tensions around privacy and social interactions.

3.2.5. Value-Sensitivity Design for an Industry 5.0

Many system designers adopt a function-driven design approach, focusing primarily on what a system should do and how efficiently it performs its tasks. While this approach supports usability and technical reliability, it often treats human values as implicit or secondary considerations. However, human values, particularly moral and ethical ones, do not naturally emerge from functional design alone. These human factors require explicit and systematic attention throughout the design process [48].
In this context, Value-Sensitive Design (VSD) has emerged as a suitable methodology. VSD is defined as “a methodology capable of tipping AI technologies in the right direction” by explicitly foregrounding the values of those who use or are affected by technological systems [29]. Rather than prioritizing functionality alone, VSD advocates designing for values. In this context, values refer to what people consider important, desirable, and ethically appropriate, such as privacy, autonomy, dignity, or fairness [29]. These values differ from norms, which are socially enforced and context-dependent, and from needs, which often lead to narrow functional solutions that may overlook broader social and ethical consequences [29,49,50]. Examples of applying VSD to embed human values into AI-driven Industry 5.0 and smart factory systems (aimed at ensuring ethical outcomes and improving work–life balance) are presented in [48,51,52].

4. Review Protocol and Methodology

This article structures the review around two research questions (shown in Table 2). This review prioritizes studies that report the integration of FMs in HRC tasks, particularly assembly and disassembly, and are deployed in real or realistically simulated industrial environments. Accordingly, works that apply or evaluate FMs in non-industrial contexts (e.g., household tasks), or that lack explicit assessment through real or realistic HRC experiments, are excluded from the scope of this survey.
To answer these research questions, a literature search was conducted across three major academic databases relevant to smart manufacturing and HRC: namely ScienceDirect, Springer, and IEEE Xplore. The general search string employed was “language model” AND “human–robot collaboration”. Eligibility criteria were defined to ensure alignment with the research questions and with the human-centered principles of Industry 5.0. In particular, selected studies were required to conceptualize HRC as an active and ongoing paradigm in which humans retain a meaningful role after training, adapting, and integrating FMs into industrial settings. Table 3 summarizes the inclusion criteria applied during screening, while Table 4 presents the exclusion criteria used during article selection. The screening and article selection process was conducted in three stages: (i) title screening, (ii) abstract screening based on the inclusion criteria, and (iii) full-text review using the exclusion criteria.
It is important to note that search results on these platforms are ranked according to relevance, which can make it difficult to reproduce identical results across different users and search sessions. Moreover, as one progresses through successive pages of search results, the retrieved articles tend to become increasingly less aligned with the specific focus of the study. As discussed in [53,54], searches on the same topic may therefore yield partially different sets of articles. Consequently, some relevant studies may have been inadvertently overlooked during the search process.
After applying the inclusion and exclusion criteria, the final corpus comprised 21 articles published or available online between 2023 and October 2025. Prior studies report an exponential increase in research on FMs following the release of ChatGPT in November 2022, with an exponential growth during 2023 [2]. Consequently, this survey focuses on articles published from 2023 onward. Within this period, 2023 and 2024 correspond to the initial “ChatGPT era”, marked by rapid adoption of LLM-based approaches, while 2025 reflects the early emergence of an agentic AI era. Figure 6 shows the paper selection process. Notably, no articles published in 2023 and only five relevant studies were identified in 2024. The number of eligible articles increases substantially in 2025, reaching sixteen, indicating the growing maturity and research interest in agentic FM-enabled HRC systems. This trend is consistent with findings from previous surveys [18,19,55], which highlight that most FM-based robotics studies before mid-2024 focus on tabletop or household settings, with only a limited subset extending these approaches to industrial HRC scenarios. Moreover, selection process results reveal a clear concentration of publications within a small number of journals, with the majority of studies appearing in well-established manufacturing and automation venues. In particular, Robotics and Computer-Integrated Manufacturing emerges as the most frequent venue, accounting for six of the selected publications. The Journal of Manufacturing Systems represents the second most frequent venue, contributing five publications. All remaining venues appear only once in the dataset totaling ten publications. These include a diverse range of venues such as CIRP Annals, Procedia CIRP, Advanced Engineering Informatics, Scientific Reports, and other specialized conference proceedings.

5. RQ1: How Are FMs (LLMs, VLMs, and MLLMs) Currently Used to Support Human–Robot Collaborative Assembly/Disassembly in Industrial Contexts?

Table 5 presents an overview of the motivations and enabled tasks reported in the reviewed articles. Across these articles, assembly tasks dominate as the primary application context for FM-enabled HRC. Representative examples include multi-step product and component assembly such as cable shark assembly with sequential subtasks [56], satellite and computing module installation [12,57,58], light switch and gearbox assembly [59,60], and complex electronic and wire harness assembly [61,62,63].
Disassembly tasks appear less frequently and are mainly explored in sustainability- and maintenance-oriented scenarios, often characterized by higher uncertainty and dynamic conditions. Examples include drone disassembly [64] and power battery disassembly under variable constraints such as corrosion, damage, or missing tools [65,66,67].
Overall, these works aim to improve how robots understand human intent and natural language (e.g., voice-driven task management and interpretation of ambiguous or incomplete instructions), and to translate that understanding into actionable, constraint-compliant robot behaviors and control code that remain effective under uncertainty and changing shop-floor conditions. Examples in this category are [12,55,56,63,68,69]. A second shared objective is to strengthen multimodal grounding to support robust perception, planning, and manipulation, including handling unseen objects and estimating 6D poses without repeatedly training task-specific models. Some examples in this category include [12,57,64,70,71]. In parallel, some studies frame FMs as the core of assistive “co-worker” systems that provide real-time error detection, adaptive guidance, and personalized operator support to reduce cognitive load and improve trust and collaborative efficiency [59,67,72,73]. Finally, a subset of works explicitly targets dynamic multi-agent settings—such as multi-human–multi-robot collaboration, uncertain disassembly, and task rescheduling. Some of them combine FMs with graph-based reasoning (KG/GNN) and unified cognitive architectures to enable safer, more autonomous coordination and robust task allocation under real-world variability [60,65,66].
Table 5. Overview of identified HRC studies using FMs, year available online, core motivations for adopting FMs in HRC, and the corresponding collaborative task scenarios addressed.
Table 5. Overview of identified HRC studies using FMs, year available online, core motivations for adopting FMs in HRC, and the corresponding collaborative task scenarios addressed.
ArticleYearNeed/Objective of Using FMs in HRCHRC Task Description
[56]2024Reduce voice communication barriers.Assembly of a cable shark product using voice-based natural language commands.
[12]2024Understand ambiguous human instructions; reasoning about new objects.Assembly of a satellite component model with flexible, human-guided sequencing.
[73]2025Improve HRC efficiency by adapting robot strategies based on human trust inferred from robot performance.Pick and Place on a conveyor belt with adaptive grasp or non-grasp decisions.
[68]2024Enable natural-language-based error correction and robust intention understanding in HRC.Human–robot collaborative assembly where robots select, correct, and hand over tools based on human language.
[72]2025Provide real-time, non-intrusive, adaptive error detection and multimodal guidance.Assembly of a cast iron horizontal bare-shaft centrifugal pump.
[64]2025Improve accuracy of understanding human intentions using an attention-based multimodal fusion model.Drone disassembly use case. The human guides the robot using voice and gesture commands.
[57]2025Eliminate the need for repeatedly training multiple vision models and reduce reliance on pre-programmed robot scripts.Multi-step HRC assembly of aerospace electronic modules, including pick-and-place, handling tools, screwing, and inspection tasks.
[59]2025Reduce operator cognitive load by providing real-time, contextual assistance and act as an intermediary to control physical agents via natural language commands.Assembly of a light switch, including pick-and-place of components and specific assembly subtasks supported by a cobot and smart projector.
[65]2025Enable autonomous reasoning and adaptation to dynamic, unscheduled multi-human multi-robot disassembly tasks.Disassembly of automotive lithium-ion batteries under dynamic and uncertain conditions.
[71]2024Enable adaptive task planning and human-guided execution in unstructured HRC manufacturing.Three predefined tasks for a HRC processes: fetching a specific part from storage, placing a gear into a case, and picking and mounting a case cover onto the module.
[66]2025Improve robust perception and task rescheduling in dynamic disassembly tasks with changing conditions, such as corrosion or damage of disassembly objects.Dynamic disassembly of end-of-life automotive lithium-ion batteries.
[67]2025Effective integration of historical and real-time information for question answering and robot manipulation.AR-assisted human–robot collaborative disassembly with historical Visual Question Answering (VQA) and tool guidance.
[69]2025Enable few-shot human intent recognition and semantic understanding in data-scarce industrial scenarios.Recognition of human actions and intentions during industrial tasks involving tools and parts.
[55]2024Mitigate human–robot communication ambiguity prevalent in HRC manufacturing scenarios.Assembly of a gear pump module involving language-guided pick-and-place actions.
[61]2025Improve generality on task planning, avoiding execution conflicts while balancing operator experience and efficiency.Assembly of ten types of electronic assembly tasks (e.g., chassis connectors, heat sink, printed circuit boards, and fan assembly).
[70]2025Enable accurate 6D pose estimation of novel objects without retraining for HRC.The robot follows voice commands, analyses the available objects, identifies and grasp the target, and hand over to the human.
[58]2025Enable adaptive and proactive robotic manipulation in dynamic HRC assembly environments.Assembly of a small satellite, humans assemble components and the robot proactively installs designated parts.
[62]2025Enable continuous instruction understanding and long-term reasoning in HRC by integrating a reflection-based contextual memory into LLM agents.Wire harness assembly for aviation electronic equipment requiring accurate robotic manipulation.
[60]2025Enable proactive, autonomous, and generalizable HRC in dynamic manufacturing assembly and disassembly.Reducer/gearbox multi-step assembly composed of diverse tasks such as component positioning, shaft installation, bearing assembly, and housing alignment.
[63]2025Achieve human-like collaborative intelligence by structuring perception, decision-making, and execution in a unified architecture.Dynamic engine assembly collaboration involving tool handover and component delivery based on human assembly progress.
[28]2025Improve accuracy and consistency of LLM-based HRC, which suffer interference from accumulated, irrelevant historical context in long-span assembly tasks.Assembly of a complex computer mainframe, with multi-species, small batch characteristics, repetitive fastening and fine installation operations.
Table 6 provides an overview of the used models and modalities of reviewed articles. Extracted data reveals that all studies rely on voice or text input as the primary interaction modality, reflecting the central role of LLMs in interpreting human instructions, clarifying intent, and supporting task planning or error handling. Moreover, while many works combine language with visual input (images or video), fewer studies exploit richer multimodal signals such as motion data or physiological signals (e.g., EEG). Furthermore, the table shows a strong dependence on large proprietary models, particularly GPT-4 and its variants, often combined with well-established perception models such as CLIP, SAM, DINOv2, or BLIP. These models are, in some cases, fine-tuned or augmented with task-specific components to adapt them to industrial HRC scenarios.

5.1. Use of LLMs in HRC

Across the reviewed studies, LLMs are mainly used as a semantic–cognitive layer to (i) translate natural operator input into structured robot-relevant representations, (ii) support task decomposition and allocation decisions, and (iii) maintain task context through memory/knowledge mechanisms. Some systems demonstrate instruction grounding and action generation under flexible language input (e.g., [56,68]), while others position the LLM as a planner that can integrate production objectives with human-related constraints such as fatigue, comfort, or collaboration dynamics (e.g., [61]). Other articles address context persistence for disassembly by augmenting LLMs with historical state reasoning and memory structures (e.g., [28,67]) and by pairing semantic reasoning with simulation/digital-twin validation for safer strategy exploration (e.g., [65]). However, these benefits come with consistent boundary conditions: LLM-driven collaboration is sensitive to underspecified instructions and context loss (often requiring clarification loops), and practical deployment is constrained by latency/cost/privacy trade-offs that motivate cloud–edge splitting (e.g., [58]); moreover, higher autonomy via modular orchestration can reduce human burden but may increase token/compute overhead and amplify failure consequences without robust oversight mechanisms (e.g., [60,62]). A more detailed description of these articles, with a primary focus on the use of LLMs, is presented below.
Gao et al. [68] presented one of the earliest applications of LLMs in HRC by highlighting a core limitation mentioned across many works reviewed in this article: existing HRC systems typically rely on rigid, predefined language syntax, which severely restricts their ability to interpret natural, ambiguous, or incomplete human instructions. Consecutively, in their proposed HRC system, they demonstrate that fine-tuned LLMs can act as adaptive cognitive engines, capable of converting flexible natural-language ambiguous or underspecified commands into structured robotic action configurations, thereby enabling more robust intention recognition in collaborative assembly. They further integrated LLMs with a control module that fuses language outputs with the robot’s internal state to automatically correct tool-selection errors.
Lim et al. [56] argued that, despite advances in HRC, manufacturing assembly systems continue to suffer from communication gaps that limit effective coordination. These gaps stem primarily from language barriers and the need for extensive robotics training. To address this issue, they proposed a framework that leverages LLMs to interpret operators’ voice commands and coordinate a robotic arm during a cable shark assembly task. Rather than evaluating the overall system performance, the authors focused specifically on assessing the LLMs’ ability to understand commands expressed in varied and less structured language. Their results showed that as instructions become less specific, due to missing context or explicit task references, robot performance degrades significantly, highlighting the importance of well-defined commands for reliable human–robot communication.
Wang et al. [61] highlighted that traditional task-planning methods for HRC lack generality and insufficiently consider operator experience. To address this, they proposed an LLM-based multi-agent task planning (MATP) framework that avoids execution conflicts while balancing operator experience and production efficiency. Their proposed method decomposes assembly tasks into action-level subtasks, evaluates operator and robot states including fatigue, posture comfort, and human–robot trust, and performs task allocation through a hybrid optimization combining LLM reasoning with a genetic algorithm. This approach is validated in a laboratory electronic assembly scenario outperforming single-agent and traditional methods by effectively balancing operator experience with assembly efficiency and dynamic adaptability.
Lv et al. [67] proposed a historical Visual Question Answering (VQA) framework for AR-assisted HRC. This framework integrates structured visual representations, a temporally organized Memory Graph, and LLMs to support reasoning over both current perception and historical experience. The proposed solution is presented as an alternative to traditional human intention recognition systems that rely on hand gestures, body posture, or gaze direction captured by visual sensors. This system adopts a client–server architecture in which voice commands are interpreted by LLMs and transmitted to the robot for execution. The resulting robot state, including joint positions and rotation angles, is then returned to the client and visualized through the AR interface. The system also generates image captions and VQA-based responses to guide robotic actions, enabling the robot to grasp tools and deliver them to the operator during battery disassembly. AR is used to capture unstructured environmental data and to visualize the LLM’s reasoning outputs. Moreover, human instructions are continuously fed back to the LLM, enabling bidirectional communication between the human and the robot.
Tong et al. [65] demonstrated that, beyond serving merely as natural-language communication interfaces, LLMs can operate as powerful cognitive engines capable of addressing dynamic operational challenges in HRC. For this, they proposed a Hybrid Cognitive Digital Twin (HCDT) that integrates GNN-based rule learning with LLM-driven semantic reasoning to enable generative and adaptive decision-making in multi-human multi-robot collaborative (MHMRC) disassembly. Integrated within a Digital Twin, the cognitive engine conducts continuous reasoning, task allocation, real-time transmission and monitoring of multi-sensor data, and strategy validation, supporting safe in-simulation evaluation before deployment in the physical environment. In this article, human operators remain central to the workflow, performing dexterous or safety-critical actions, while the cognitive Digital Twin dynamically restructures robot behaviors and task sequences around their capabilities. The pilot study reported in this article indicates that the proposed HCDT system may improve subjective worker experience and operational safety relative to conventional HRC approaches.
Ma et al. [58] highlighted that low computational efficiency, high deployment costs, and data leakage risks are major obstacles to the large-scale industrial adoption of cloud-based LLMs. To address these challenges, they proposed a fusion-driven framework that combines cloud-based large-scale LLMs for cognitive reasoning and dynamic manipulation planning with edge-based small-scale LLMs for efficient perception of robotic control demands and verification of control constraints. In their framework, the robot proactively perceives ongoing human assembly actions, generates adaptive manipulation constraints through the large-scale LLMs, and assists humans by executing assigned assembly subtasks, thereby reducing the need for continuous human instruction and enhancing flexibility in dynamic assembly processes. The authors report that this increased adaptability and flexibility of LLMs, even with the proposed approach, comes at the cost of longer task execution times compared to traditional fixed-code execution models.
Verhelst et al. [59] proposed a Digital Colleague: a modular, human-centric architecture combining LLMs, a skill-based robot framework, and a hierarchical knowledge base to support high-mix, low-volume (HMLV) manufacturing. In such environments, operators frequently switch between products and processes, increasing cognitive load and complexity. The proposed Digital Colleague aims to address this issue by providing task-specific, on-demand guidance to reduce cognitive strain while maintaining efficiency and quality. Instead of using an AR headset, a smart projector offers physical support and context-aware digital guidance (e.g., displaying relevant information and instructional text onto the workspace). Unlike most studies identified in this review, this work adopted a more human-centric design approach. Consequently, the authors employed qualitative and subjective metrics to evaluate the user experience of the proposed system. Their study found that the proposed conversational interaction and on-demand support improved perceived clarity and reduced cognitive load. However, perceived efficiency was rated lower due to slow LLM response times and occasional hallucinations. Another notable human-centric element proposed in [59] is the use of facial animations embedded in a tablet mounted on the collaborative robot, which serve as an embodiment of the Digital Colleague. This design choice adds a friendly and approachable visual presence, further supporting intuitive interaction between the operator and the system.
Hua et al. [28] explained that although LLMs capable of processing extended context windows can incorporate large volumes of historical contextual knowledge as input prompts, they remain susceptible to interference from irrelevant or subordinate historical information during the generation of collaborative strategies. To overcome this limitation, they introduced a dynamic knowledge evolution mechanism capable of handling large-scale textual information and reasoning over long-span problems, thereby improving the accuracy and consistency in complex collaborative assembly scenarios. The main idea is to continuously capture, refine, and summarize evolving situational demands while incorporating a forgetting mechanism to update historical scene states, ensuring that LLM prompts retain only critical information related to both global objectives and local states and mitigating secondary-information interference during knowledge accumulation.
Wang et al. [62] argued that the field is entering a “post-LLM” era, aiming to reformulate how LLM-based applications are designed and deployed. Therefore, they propose a modular, agent-based architecture in which an unfine-tuned LLM acts as the cognitive core for natural language understanding, chain-of-thought reasoning, and task decomposition. The proposed HRC agent consists of four main modules: a configuration module that defines the agent’s role, goals, capabilities, knowledge, and behavioral rules; a task planning module that decomposes complex instructions into executable subtasks and corresponding strategies; a task execution module that interfaces with robotic tools and hardware; and a memory module that stores interaction and environmental information to support future reasoning and instruction understanding. Experimental results indicated that the HRC agent can effectively interpret natural language instructions, generate correct reasoning chains, and drive robots to execute assembly tasks via tool calling. However, the authors reported a small number of failure cases caused by LLM-induced hallucinations during task planning, noting that completely eliminating such hallucinations remains costly and challenging.
Ding et al. [60] proposed an LLM-powered, cognition-centered AI agent framework to support proactive HRC in assembly and disassembly tasks. To improve the generalization capability of AI agents and mitigate the lack of domain-specific knowledge in pre-trained LLMs, the authors introduced a semantic Chain-of-Thought prompt learning method that integrates task semantics with structured reasoning. Within this framework, the robot adaptively adjusts its autonomy level based on task complexity and operator state to reduce human intervention and workload. Experimental results demonstrated improved autonomous orchestration, higher execution accuracy, and reduced human intervention compared with LLM-only and existing AI-agent baselines. However, these advantages come at the cost of increased token consumption.

5.2. Use of VLMs in HRC

Across reviewed studies, VLMs primarily act as the grounding and perception layer that connects language and robot reasoning to the physical workspace, enabling collaboration under dynamic conditions without fully task-specific retraining. Many systems combine a language/planning component with VLM-based segmentation/localization to ground operator intent in object-level state representations, often mediated through AR/MR interfaces where humans verify plans or provide corrective input (e.g., [55,71]). Another recurrent use is intention recognition under data scarcity (e.g., [69]). VLMs are also leveraged as enabling primitives (e.g., unseen-object pose estimation, scene reasoning, or monitoring signals used to adapt collaboration) (e.g., [70,73]). A key trade-off of VLM-driven grounding is that while it offers greater flexibility, it faces limitations in terms of real-time feasibility and, in some cases, interface intrusiveness. The latter issue is particularly evident when using AR headsets. For example, in the pilot study conducted by [12], participants reported that AR glasses were not a convenient interaction modality and expressed a preference for a screen-based interface positioned in front of the robotic platform. This finding is reinforced by [72], who propose a more human-centered and inclusive approach that adapts assistance to the operator’s cognitive needs by projecting visual cues directly onto the workspace. A more detailed description of these articles, with a primary focus on the use of LLMs, is presented below.
Zheng et al. [71] proposed a framework for HRC that integrates a mixed-reality head-mounted display (MR-HMD) for data collection, communication, and state representation, together with a vision–language-guided task planning approach and a deep reinforcement learning-based motion control policy for a mobile manipulator. In this framework, LLMs generate zero-shot robotic task plans by parsing natural language instructions, decomposing them into sequential action steps, and producing executable code through predefined robotic primitives, while VLMs support object segmentation and localization using the same language specifications. For this, human operators need to issue task prompts and wear MR-HMDs to provide first-person environmental state information, evaluate the correctness of LLM-generated plans, and iteratively refine prompts as needed. Experimental results in collaborative assembly tasks demonstrated improved segmentation accuracy, higher task success rates, and more efficient motion planning compared to baseline approaches. However, the authors acknowledge several limitations, including latency introduced by cloud-based LLM inference, the high computational cost of deep reinforcement learning-based control, and scalability challenges associated with the reliance on MR-HMDs. Posteriorly, they proposed in [55] a vision–language-guided robotic action planning approach that combines referred object retrieval with an LLM-based planner to mitigate ambiguity in collaborative manufacturing tasks. In their work, the role of humans is to provide natural language instructions and refining perception results through manual clicking when model confidence is low, enabling error correction without disrupting production.
Wu et al. [69] argued that, despite substantial progress in HRC, research in both industrial and academic domains largely prioritizes adaptive robotic planning while insufficiently modeling human operator intentions. They also identified a key limitation in action recognition research: the scarcity of representative industrial datasets, as most benchmarks target generic scenarios and fail to reflect real industrial constraints. To address these gaps, they proposed H2R Bridge, a vision–language–temporal framework for human intention recognition in industrial HRC. The framework combines pre-trained VLMs with temporal encoding and few-shot learning to achieve transferable intent recognition under data-scarce conditions, while diverse LLMs (including T5-small, GPT-2, and Qwen-turbo) are used to translate recognized actions into natural-language intention instructions to guide robot command generation.
Liu et al. [57] explained that current embodied intelligence approaches for HRC require repeated training of multiple perception and reasoning models, thereby limiting adaptability in dynamic manufacturing environments. To deal with this issue, they argued that VLMs can serve as generalizable cognitive engines, enabling multimodal perception, reasoning, and autonomous execution without retraining specialized modules. Consequently, they proposed a VLM-enhanced embodied intelligence framework for digital-twin-assisted human–robot collaborative assembly. The proposed framework comprises four core modules: VLM-enhanced embodied perception of the HRC environment, VLM-enhanced embodied reasoning, DT-supported embodied decision-making, and embodied autonomous execution through automatic code generation. In this framework, the Digital Twin is used to train and optimize robot path-planning by simulating motion, identifying potential collisions, and refining strategies through reinforcement learning. Meanwhile, a HoloLens 2 headset enables natural-language input and AR-based guidance for the operator during the assembly of aerospace electronic bay components. In this approach, humans remain central in the assembly workflow: operators specify tasks (e.g., verbally via AR glasses), perform skills requiring dexterity, and intervene in reasoning and knowledge updates through human-in-the-loop mechanisms.
According to Simeone et al. [72], existing HRC and operator-support systems often rely on intrusive interfaces, like VR and AR headsets, which inadvertently increase cognitive load and disrupt assembly operations. Therefore, they proposed an alternative approach focusing on enhancing human-centricity and inclusion by tailoring assistance to the operator’s cognitive needs. For this, they introduced a non-intrusive, multimodal, generative AI-based system that provides real-time error detection, adaptive guidance, and personalized support. They achieved this by using a projector to display visual cues directly onto the workspace, offering performance comparable to VR/AR head-mounted displays with minimal cognitive load. Their system is enhanced by a Generative AI Layer (using ChatGPT-4.0 and Claude 3.5), which is used to interpret images and text for posterior error detection and instruction generation. A key feature in the framework proposed by [72] is the human-in-the-loop learning cycle: when the system misclassifies an error, the human provides a corrective prompt that instantly updates the system’s knowledge base, thereby improving future error detection without the need for time-consuming model retraining.
Ji et al. [12] stated that existing HRC systems lack transferability and generalization because they rely on specialized perception models and predefined workflows that require retraining or refactoring when facing unseen objects or undefined tasks. To address these limitations, they proposed a foundation-model-based HRC framework that includes LLMs and VLMs for enhancing perception and reasoning in a assembly scenario. In the proposed system, LLMs act as the reasoning “brain” that interprets human language instructions and environmental descriptions to generate robot control code via prompt engineering, while Vision Foundation Models serve as the perceptual “eyes” enabling transferable scene semantic perception without task-specific training. In their approach, humans remain central to the collaboration by issuing free-form language instructions, evaluating and correcting LLM-generated robot code, and performing dexterous assembly actions, while the system follows human instructions without enforcing a fixed assembly sequence. Human–robot communication is mediated through natural language speech input, visual feedback via AR glasses, and simple pointing gestures that allow humans to specify assembly locations without verbally encoding complex spatial descriptions. Performance evaluations indicated that the proposed framework improves generalization and enables reasoning about undefined tasks compared to traditional HRC methods. Additionally, the authors conducted user studies using a focus-group format in which participants were shown demonstration videos of the system. The goal was to gather feedback and identify directions for future improvement. Participants reported that AR glasses were not a convenient interaction modality and expressed a preference for a screen-based interface positioned in front of the robotic platform. This claim is also supported by [72].
Xia et al. [70] leverage VLMs to enable 6D pose estimation of previously unseen objects in shared human–robot workspaces, addressing the reliance of existing learning-based methods on extensive retraining and large datasets. For this, they proposed a three-stage pipeline comprising vision–language-based object detection and segmentation, CAD-template-based mask selection, and pose refinement. This approach was validated through real-world experiments using a UR5e collaborative robot in assisted picking and collaborative assembly scenarios. Their results demonstrated improved pose estimation performance for novel objects and effective support for HRC, while the authors acknowledged that the current processing speed of the full pipeline does not yet meet real-time industrial requirements.
Guo et al. [73] argued that a lack of trust in HRC can reduce users’ willingness to adopt such systems. However, existing trust-computing approaches face several practical limitations. Subjective trust metrics can interrupt collaboration, negatively affecting naturalness and efficiency in real-world HRC scenarios, while many objective trust metrics rely on intrusive sensors that compromise user comfort. To address these challenges, the authors proposed a robot performance evaluation method based on a VLM to support trust estimation. This trust metric is directly linked to collaboration efficiency, measured by the number of steps required to complete a task collaboratively. In addition, they introduced an active interaction strategy generation framework for HRC that leverages this trust metric to improve the predictability of human actions and reduce interruptions caused by redundant interventions and delayed decision-making. In their approach, the VLM reasons with visual observations of collaborative sub-scenes to evaluate robot performance, which is then used to update trust estimates and select optimal robot actions. Human operators remain central to the collaboration by intervening when appropriate, while the robot dynamically adjusts its level of autonomy based on whether the human intervenes promptly or refrains from intervening during the task. Experimental results from a collaborative object transportation task demonstrated that their proposed strategy can reduce the number of steps required to complete the task compared to a random strategy, in which the robot arbitrarily selects between the actions grasp and non-grasp at each step without accounting for the human’s trust level.

5.3. Use of MLLMs in HRC

Across the reviewed set, MLLMs are used when single-modality pipelines are fragile, aiming to improve robustness and situational awareness by fusing multiple signals (generally vision + language and, in some cases, additional human/scene cues) for collaboration in noisy, dynamic shop-floor conditions. A common pattern is multimodal intention recognition, where combining modalities reduces failure under occlusion, acoustic noise, or ambiguous gestures compared with vision-only or speech-only interfaces (e.g., [64]). In disassembly, MLLMs are often paired with structured knowledge (e.g., affordance knowledge/graphs) to support dynamic scene understanding and rescheduling when part conditions or availability change, enabling more responsive task allocation (e.g., [66]). Some architectures push toward integrated perception–decision–execution loops with feedback and shared memory to sustain continuous adaptation during collaboration (e.g., [63]). A more detailed description of these articles, with a primary focus on the use of LLMs, is presented below.
Li et al. [64] argued that current HRC systems relying on single-modality perception (e.g., only vision or only speech) fail to provide robust and accurate intention understanding in complex manufacturing environments. To overcome these limitations, they introduced a multimodal large model that integrates synchronized vision, audio, and EEG signals, enabling significantly more reliable robot intention recognition compared with traditional single-modality approaches. The framework consists of four layers: a physical layer for data acquisition, a multimodal fusion layer for processing and integration, a virtual layer using Digital Twins for prediction and optimization, and a service layer supporting interaction and safety. Validated in a drone-disassembly task, the system demonstrated improved robustness to noise. Their results show that multimodal fusion reduces sensitivity to visual occlusions, audio noise, and EEG instability, leading to more reliable operator-intention prediction.
Yu et al. [66] argued that improving robotic perception and leveraging historical experience are critical for human–robot collaborative disassembly (HRCD). To address this challenge, they proposed a dynamic task rescheduling method for HRCD that is enhanced by an Affordance Knowledge Graph (AFKG) and a MLLM to enable dynamic scene perception, semantic reasoning, and task reallocation. This approach is validated in an automotive lithium-ion battery disassembly scenario, where uncertainties such as component degradation, corrosion, damage, and tool availability frequently disrupt predefined workflows. Within this framework, the MLLM supports the scene understanding process by processing RGB-D images and gaze information obtained from mixed-reality head-mounted displays to construct semantic scene graphs. The AFKG complements this capability by enabling the recognition of previously unseen components and changing conditions of components in disassembly scenarios through querying similar affordance-based cases.
Chen et al. [63] presented the concept of a human-like collaborative robot (HLCobot) which emphasizes human-like intelligence, achieved through a tightly integrated perception–decision–execution coordination loop. This approach aims to enable industrial robots to continuously and autonomously collaborate with human operators even in dynamic, unstructured environments where uncertainties and unexpected events frequently occur. To materialize this vision, the authors proposed a brain-inspired perception–decision–execution coordination framework for HRC, driven by MLLMs and organized into three tightly coupled functional hubs: (i) an active perception hub that enables dynamic and adaptive scene understanding under occlusion and uncertainty; (ii) an intelligent decision hub that performs knowledge-enhanced reasoning to infer task states and collaboration needs as well as mitigate bias and hallucinations; and (iii) an execution hub that decomposes high-level collaborative intentions into sequences of reusable low-level motion primitives. Additionally, inter-hub coordination with feedback communication and a shared memory module (to persistently store perceptual information and historical task outcomes) are integrated. Tested in an engine assembly scenario, the approach achieved a high success rate but struggled to distinguish fine-grained operational states when components or actions were visually similar, occasionally leading to incorrect collaborative instructions.

6. RQ2: What Human-Centered and Responsible-Deployment Considerations Are Reported, and What Is the Role of Humans?

6.1. Consideration of Human Factors and Responsible AI Design

The results indicate that, although all reviewed studies demonstrate substantial technical sophistication and potential technical value, human factors (when acknowledged) are often treated as secondary outcomes. These factors are typically assumed to improve indirectly through enhanced system performance or increased autonomy, rather than being explicitly framed as design requirements grounded in human-centered and responsible AI principles. Inclusion is briefly discussed in only two of the reviewed articles. In [59], inclusivity is supported through language-agnostic natural language interaction, which allows operators to communicate with robots in their preferred language without the need for on-site interpreters. In [72], the authors suggest that the system’s ability to adapt to diverse cognitive profiles and communication preferences could potentially support the inclusion of neurodiverse operators; however, this is presented as a direction for future research rather than being empirically evaluated. As illustrated by these representative studies, inclusion is primarily framed in terms of accessibility at the interaction level, while the involvement of diverse stakeholders during the design and deployment stages of HRC systems is rarely addressed.
Trust is also mentioned in a few of the reviewed articles. In this context, Ref. [59] emphasizes the importance of addressing hallucination-related issues in FM-enabled HRC systems, particularly in manufacturing settings, as a prerequisite for ensuring system dependability and trustworthiness. In [73], a trust-based active interaction strategy for HRC is proposed. In this system, a tree-structured model evaluates different robot strategies to select the optimal action sequence. Robot performance is assessed using a VLM-based method, and the resulting outcomes are used to estimate human trust. This trust value is used to guide the robot’s decision-making during collaboration.
Ethical concerns related to operator privacy and autonomy are briefly noted in [72], as proactive approaches that continuously monitor operators’ actions and cognitive states may create feelings of surveillance and reduce their sense of control over the work environment.
Only a small subset of the reviewed studies complements AI model and HRC system performance metrics with human-centered evaluation, for example, by employing standardized measures of user experience, usability, or ergonomics. One example is [59], which uses the User Experience Questionnaire (UEQ) [74] to assess participants’ perceived user experience of the proposed HRC system in a study involving 21 participants. However, as noted by the authors of [59], the participant sample consisted primarily of researchers rather than actual operators or novice users, which may bias feedback and insights toward a technical perspective. This limitation is also observed across most of the reviewed articles. In [65], the physical and cognitive workload of the HRC system is evaluated using the NASA–TLX Task Load Index [75] in a pilot study to provide preliminary POV validation. The authors, however, acknowledge the need for larger-scale user studies with a more diverse participant pool to further validate the results and establish statistical significance. In [67] an exploratory qualitative study was conducted to assess the usability of the proposed AR-assisted HRI system by comparing interaction methods across six usability dimensions, using a 3-point Likert scale. In this study, nine non-expert participants (graduate students) evaluated the system to reflect general user perspectives.
As highlighted in prior human-centered and responsible AI research, these observed patterns are not unexpected. One of the main obstacles to achieving genuinely human-centered technologies lies in the limited consideration of human-centered priorities during early design and evaluation stages, as well as the lack of external stakeholder participation in system development [29]. As described by [76], “this oversight can lead to technically sound systems that fail to optimize human–machine interaction (HCI), resulting in reduced system effectiveness and potential safety hazards”.
Consequently, most of the reviewed works provide little or no information about stakeholder inclusion and basic human-centered experimental validation, even when they evaluate their system in real-world settings. As a result, key details such as participant numbers, demographics, background, study duration, human-centered experimental methodologies, and ethical approval statements are rarely reported. This reveals that, while the technical development of agentic and FM-based systems is reaching a high level of maturity, the social and human-centered dimensions of this research remain at a very early stage. Addressing this gap requires developers, managers, and robotics researchers to move beyond purely technical perspectives and adopt transdisciplinary and responsible approaches. In this context, Section 3.2.2 provides an overview of key methodologies that strengthen the human-centered and responsible AI dimensions of technological projects, enabling the development of solutions that are not only functional, but also acceptable, desirable, socially valuable, and ethically grounded.

6.2. The Role of Humans

From a role-based perspective, humans can assume different functions in FM-enabled HRC systems depending on the level of robot autonomy and responsibility allocation. This work classifies these roles as: Instructor/Controller, Error Handler/Exception Manager and Collaborator/Co-worker. In the Instructor or Controller role, humans provide high-level instructions, prompts, or commands, typically through natural language or multimodal inputs, while the system interprets human intent and translates it into task execution via planning, perception, or reasoning modules. This role is the most commonly identified and can be clearly reflected in [67], where the operator interacts through AR glasses and voice input to specify task objectives, ask visual questions, and issue high-level robot commands. In the Error Handler or Exception Manager role, humans intervene reactively when the robotic system encounters unexpected failures during perception, planning, or execution, such as incorrect grasps, or planning inconsistencies. This role is event-driven and typically involves direct corrective actions to restore system functionality and ensure task continuity. This role is reflected in [58], where, upon detecting an error, the robot translates it into a natural language message for the human operator, who then manually resolves the issue and commands the robot to resume the task. A similar dynamic is observed in [66], where the human oversees the process through a MR interface and intervenes only when expert judgment is required. Finally, the Collaborator or Co-worker role represents the highest level of human involvement. In this configuration, humans and robots work concurrently on the same object, requiring shared manipulation, physical interaction, and continuous real-time coordination. This role places strong demands on safety, ergonomics, adaptability, and responsiveness, and it most closely aligns with the human-centered and empowerment-oriented vision of Industry 5.0. This dynamic is exemplified in [63], where the human operator remains the primary executor of the task, and the robot operates as an autonomous assistant that perceives the environment, reasons about the operator’s needs, and delivers adaptive support.
Table 7 summarizes the distribution of human roles identified across the reviewed studies. This table indicates that most existing FM-based HRC systems emphasize instructional and error handler roles (with most cases mixing these two roles), while fewer studies support deep, physically coupled collaboration.

6.3. Foundation Model Adaptation and Enhancement Strategies

Additionally, this article identifies common interaction paradigms and adaptation strategies used to enhance the capabilities of the reported FMs. This review identified three main prompt engineering and adaptation techniques used in the reviewed articles: chain-of-thought, fine-tuning and retrieval-augmented generation.

6.3.1. Chain-of-Thought

Some studies adopt chain-of-thought prompting to decompose complex tasks into intermediate reasoning steps. In this context, Ding et al. [60] propose a semantic chain-of-thought prompt learning approach to enhance the generalization capability of AI agents and address the high cost, low efficiency, and limited accuracy associated with manually designed prompt templates in more traditional methods. Chen et al. [63] proposed a chain-of-thought approach that enables the MLLMs to perform step-by-step parsing and reasoning over complex task instructions. Their approach is used to guide the robot to decompose high-level commands into structured execution sequences composed of multiple low-level action primitives, which are then translated into motion control commands. Wang et al. [62] adopt a Reason–Act chain-of-thought framework, referred to as “think while doing,” to address the high flexibility and uncertainty inherent in HRC assembly scenarios. This framework generates a series of intermediate reasoning steps, thereby improving the ability of LLMs to perform complex reasoning and execute multi-step tasks.

6.3.2. Fine-Tuning

Fine-tuning has been employed in several reviewed studies to adapt pre-trained, general-purpose models to specialized tasks within industrial domains. For instance, Gao et al. [68] fine-tuned a generic GPT-3.5 model to interpret and translate diverse human instructions into structured assembly task attributes. They argue that fine-tuning enables the system to accurately identify corrective actions across a wide range of human inputs, thereby strengthening the robustness and reliability of their HRC system. Additional applications of fine-tuning are reported by Wu et al. [69] for VLMs and by Ma et al. [58] for lightweight local models. However, Wang et al. [62] argue that the heavy reliance on fine-tuning in many existing LLM-based HRC approaches presents several limitations, including potential degradation of the model’s inherent general capabilities, reduced adaptability, difficulty in handling long-horizon or continuous tasks, rigid response strategies, and increased computational cost.

6.3.3. Retrieval-Augmented Generation

Retrieval-augmented generation has been used to enhance FMs by integrating external knowledge into the generation process, thereby improving factual accuracy and contextual grounding. However, the tight coupling between retrieval and generation in conventional RAG pipelines often limits robustness, particularly when retrieved information is noisy, incomplete, or weakly relevant. To mitigate this issue, Tong et al. [65] introduce Actor–Retriever Retrieval-Augmented Generation (AR-RAG), a text-attributed graph-based framework that explicitly evaluates and regulates the interaction between retrieval and generation, rather than assuming uniform reliability across retrieved documents. By incorporating confidence-aware reasoning over agent- and task-specific information, AR-RAG identifies and filters inaccurate or irrelevant content prior to response synthesis, thereby improving precision, stability, and overall reliability. Similarly, Chen et al. [63] propose a structured RAG-based pipeline tailored to MLLMs in an assembly domain. Their approach guides the model to focus on task-relevant contextual cues, reinforcing semantic understanding while reducing hallucination occurrences.

7. Technical Challenges and Pathways

A set of identified and recurrent open challenges from a technical perspective is presented below.

7.1. Robustness in Human–Robot Communication

Effective human–robot communication represents a recurrent challenge in reviewed articles. In particular, environmental noise can complicate language-based interaction in manufacturing settings. Electrical interference, machinery operation, and robot motion can significantly degrade speech recognition accuracy, leading to misinterpretation of user input and incorrect task execution. While mitigation strategies such as fuzzy wake-word matching or louder speech have been proposed [12], these approaches introduce new usability concerns, including increased uncertainty and additional noise that may disturb other workers. As a result, reliance on voice-only interaction remains challenging in real-world industrial environments. To address these limitations, researchers must explore multimodal communication strategies capable of operating in complex manufacturing environments, which are often characterized by noise, occlusions, and dynamic disturbances [64].

7.1.1. Hallucinations

Hallucinations represent one of the most critical technical challenges often reported by reviewed works. Multiple studies report that integrated FMs occasionally generate incorrect instructions [58,59,62]. This issue can be exacerbated in long-horizon planning where semantic ambiguity accumulates over time [61]. Even when mitigation strategies such as rule-based verification, retrieval augmentation, or multi-agent cross-checking are introduced in different works, hallucination risks are often reported to be reduced but not completely eliminated [65]. As a result, some authors, such as [58], explicitly caution against direct execution of LLM-generated plans or constraints without external validation mechanisms.

7.1.2. Computational Efficiency and Industrial Deployability

Another recurrent limitation in the reviewed articles concerns computational efficiency and scalability. Consequently, several authors report response times incompatible with real-time interaction, particularly with perception pipelines. Moreover, achieving real-time performance often requires more computational power, which can, in many cases, make such systems economically impractical. Furthermore, reliance on cloud-based models raises concerns regarding deployment cost, data leakage, network reliability, and industrial scalability, especially for safety-critical or privacy-sensitive environments [57,58]. Therefore, future work should focus on adopting local and cost-efficient strategies to improve the security and usability of FM-based applications.

8. Perspectives Towards Socio-Technical Systems

Based on the survey results, future directions from a socio-technical perspective are outlined below:

8.1. Preserving Human Agency and Meaningful Roles Under Increasing AI Capability

Over the past two decades, Industry 4.0 has strongly shaped global priorities, guiding industrial innovation. While earlier industrial transitions often balanced job displacement with the creation of new roles, this balance is increasingly eroding as machines nowadays replace tasks faster than newly created human roles [77]. In this context, techno-centric solutions—such as reallocating workers to AI-facing activities (e.g., training, fine-tuning, maintenance, or supervision of AI systems) or increasing their AI literacy—cannot be assumed to be universally applicable or time- and cost-effective. Such roles may be inaccessible to many workers and difficult to implement at scale, particularly in low-skilled occupations. As described in [77], increasing autonomy inadvertently reduces humans to passive roles, where engagement, perceived skill use, and control decline while stress and well-being risks increase. Moreover, monitoring and fault-recovery work can impose high cognitive load and vigilance fatigue without offering corresponding opportunities for mastery or value creation [78,79]. This trend highlights a critical future research direction for FM-enabled and agentic HRC: the design of automation strategies that preserve human agency, dignity, and meaningful participation, rather than defaulting to diminished roles focused on system supervision, AI training, or exception handling. In this context, the co-creation of applications that explicitly verify the preservation of human agency and meaningful work remains an open and urgent research challenge.

8.2. From Human-as-Data to Human-as-Co-Designer

The inclusion of affected stakeholders is one of the main principles of RRI. It is a powerful approach for developing products, services, and applications that are not only technologically advanced but also socially desirable and viable. However, many of the reviewed contributions focus on PoC demonstrations, offering few or no opportunities to involve workers in the design loop. Therefore, research must move beyond treating humans primarily as data providers and instead recognize their importance and value when developing human–robot interactive systems. Within the proposed socio-technical perspective, this requires a shift toward participatory methodologies. Moreover, incorporating anticipatory discussions on the societal, organizational, and ethical implications of FM-based innovations can better support decision-making and improve risk mitigation in industrial contexts.

8.3. Human Factors and Human-Centric Evaluation

Although most reviewed articles present relevant technical advances, human factors remain underexplored or insufficiently addressed in both design and evaluation. From a human-centered perspective, one recurring limitation is the lack of ecological validity due to small samples and frequent reliance on researcher participants rather than industrial operators. This pattern reflects broader concerns in human-in-the-loop industrial cyber–physical systems, where human factors are often treated non-comprehensively and are not adequately incorporated into system requirements or evaluation protocols [76]. As a result, critical dimensions such as trust, user experience, cognitive workload, stress, fatigue, and affective state are rarely modeled or systematically measured, even though they strongly influence collaboration quality and operator well-being.
Future FM-enabled HRC systems should integrate explicit models of human cognitive, physical, and affective states not only for monitoring purposes but also to enable adaptive collaboration strategies that dynamically adjust autonomy, interaction modalities, and task allocation. Moreover, the success of agentic AI systems in HRC should be evaluated not only in terms of robotic efficiency and task performance, but also through human-centered outcomes—such as usability, comfort, trust, and long-term acceptance—assessed through longitudinal, in situ studies using qualitative, quantitative, and mixed methods with diverse novice users or real industrial workers [6,38].

9. Limitations of This Review

Similar to any other review, this work has certain limitations that should be considered when interpreting its findings. First, although the literature search followed a transparent protocol, the final corpus is limited to peer-reviewed publications published between 2023 and 2025. As a result, relevant early-stage works (in the pre-ChatGPT era) and non-peer-reviewed studies have been excluded. This choice was made to improve scientific rigor and provide an updated overview of recent practices towards agentic systems.
Second, data extraction tasks, such as in the classification of human roles, require interpretative judgment. In many reviewed articles, these aspects were not explicitly defined by the authors and had to be inferred from system descriptions, images, experimental setups, and reported interactions. This introduces a threat to theoretical validity, which depends on the researchers’ ability to accurately capture the intended meaning of the original studies [80]. Although classification criteria were applied consistently, some degree of misinterpretation or oversimplification may have occurred. Relatedly, interpretive validity may be affected by researcher bias when synthesizing and generalizing findings across heterogeneous studies [54]. Differences in terminology, abstraction levels, and experimental rigor across publications complicate direct comparison.

10. Conclusions

This article presents a novel socio-technical perspective aimed at guiding new researchers, practitioners, and managers in the implementation of FMs and HRC in industrial settings. The results confirm findings from previous research, highlighting the predominantly technology-centric orientation of most AI-related work. While this rapid technological advancement represents important progress, greater effort is needed on the human-centered and responsible aspects, particularly through more rigorous and reproducible human-centered experimental evaluations. As observed in the analysis, most studies do not explicitly report the number of participants, nor their demographics or backgrounds, implicitly suggesting that evaluations are often conducted by the same developers, with few notable exceptions.
While FM-enabled technologies are reaching a level of technical maturity, several challenges remain, including hallucinations, cost efficiency, and privacy concerns. However, the results of this article indicate that, from a human-centered and responsible perspective, the field is still in its early stages. Therefore, this article calls for new researchers in the area to adopt a transdisciplinary and socio-technical approach. In this perspective, humans should not be treated merely as additional entities within a human-in-the-loop configuration; instead, they must be positioned as the central priority, following a human-at-the-center-of-the-loop paradigm. Therefore, future research should emphasize and highlight how human capabilities can be augmented, how working conditions can be improved, and how job satisfaction can be enhanced through the effective use of FMs-based and HRC systems. In this context, the proposed framework provides an overview of key human-centered and responsible design principles and methodologies that can be explored by future researchers and practitioners to develop FM-enabled HRC systems that are not only technically effective but also socially acceptable and ethically grounded.
The proposed RAR conceptual framework is task-agnostic and can be adapted across different collaboration settings. Beyond industrial contexts, the proposed framework provides a structured foundation for guiding the development of agentic AI systems in other human–robot interaction domains, such as rehabilitation, assistive robotics, healthcare, and service robotics. For example, in rehabilitation robotics, the Context layer would incorporate patient conditions, physical and cognitive limitations, and therapeutic goals; the Design layer would emphasize adaptive control, personalization, and clinician-in-the-loop decision-making; and the Value layer would prioritize safety, dignity, and long-term well-being outcomes. Future research should investigate how the RAR perspective can be applied in these settings to ensure that increasing levels of autonomy remain aligned with human well-being, empowerment, and societal values.
Although this review synthesizes key socio-technical dimensions and design principles, the limited reporting of structured human-centered development processes and documentation practices in the analyzed literature makes it difficult to derive a standardized, evidence-based step-by-step implementation pipeline at this stage. Therefore, an important open challenge for future research is the development of a fully operationalized implementation protocol derived from the RAR framework. Such efforts should include detailed toolkits, recommended documentation practices, domain-specific guidelines, and systematic validation procedures to support researchers and practitioners in structuring, tracking, and evaluating the development lifecycle of agentic AI systems across different contexts.

Funding

This research received no external funding.

Data Availability Statement

No new datasets or experimental data were generated or analyzed in this study. This manuscript is grounded exclusively in a theoretical framework and a brief review of previously published literature.

Acknowledgments

During the preparation of this manuscript, the author used ChatGPT (version 5.2) and Microsoft Copilot 365 to assist with grammar refinement, correction of typographical errors, and improvement in clarity and readability in selected paragraphs and sentences. All AI-generated suggestions were carefully reviewed and edited by the authors. The author has reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RARResponsible Agentic Artificial Intelligence
HRCHuman–Robot Collaboration
FMsFoundation Models
IFMsIndustrial Foundation Models
LLMsLarge Language Models
VLMsVisual-Language Models
MLLMsMultimodal Large Language Models
CPHSCyber–Physical–Human Systems
IoTInternet of Things
DTDigital Twin
XRExtended Reality
RRIResponsible Research and Innovation
PoCProof of Concept
VSDValue-Sensitive Design

References

  1. Li, D.; Jin, Y.; Sun, Y.; A, Y.; Yu, H.; Shi, J.; Hao, X.; Hao, P.; Liu, H.; Li, X.; et al. What foundation models can bring for robot learning in manipulation: A survey. Int. J. Robot. Res. 2024, 0, 02783649251390579. [Google Scholar] [CrossRef]
  2. Zhao, S.; Liu, S.; Jiang, Y.; Zhao, B.; Lv, Y.; Zhang, J.; Wang, L.; Zhong, R.Y. Industrial foundation models (IFMs) for intelligent manufacturing: A systematic review. J. Manuf. Syst. 2025, 82, 420–448. [Google Scholar] [CrossRef]
  3. Itadera, S.; Ueshiba, T.; Coronado, E.; Domae, Y. Cyber-Physical-Human Systems for Error Recovery in a Bin-Picking Task. J. Robot. Mechatron. 2025, 37, 456–465. [Google Scholar] [CrossRef]
  4. Coronado, E.; Ueshiba, T.; Ramirez-Alpizar, I.G. A path to industry 5.0 digital twins for human–robot collaboration by bridging NEP+ and ROS. Robotics 2024, 13, 28. [Google Scholar] [CrossRef]
  5. Coronado, E.; Itadera, S.; Ramirez-Alpizar, I.G. Integrating virtual, mixed, and augmented reality to human–robot interaction applications using game engines: A brief review of accessible software tools and frameworks. Appl. Sci. 2023, 13, 1292. [Google Scholar] [CrossRef]
  6. Coronado, E.; Kiyokawa, T.; Ricardez, G.A.G.; Ramirez-Alpizar, I.G.; Venture, G.; Yamanobe, N. Evaluating quality in human-robot interaction: A systematic search and classification of performance and human-centered factors, measures and metrics towards an industry 5.0. J. Manuf. Syst. 2022, 63, 392–410. [Google Scholar] [CrossRef]
  7. Flores Gonzalez, J.M.; Coronado, E.; Yamanobe, N. ROS-Compatible Robotics Simulators for Industry 4.0 and Industry 5.0: A Systematic Review of Trends and Technologies. Appl. Sci. 2025, 15, 8637. [Google Scholar] [CrossRef]
  8. Khogali, H.O.; Mekid, S. The blended future of automation and AI: Examining some long-term societal and ethical impact features. Technol. Soc. 2023, 73, 102232. [Google Scholar] [CrossRef]
  9. Brohan, A.; Chebotar, Y.; Finn, C.; Hausman, K.; Herzog, A.; Ho, D.; Ibarz, J.; Irpan, A.; Jang, E.; Julian, R.; et al. Do as i can, not as i say: Grounding language in robotic affordances. In Proceedings of the Conference on Robot Learning. PMLR; JMLR: Cambridge, MA, USA, 2023; pp. 287–318. [Google Scholar]
  10. Liang, J.; Huang, W.; Xia, F.; Xu, P.; Hausman, K.; Ichter, B.; Zeng, A. Code as policies: Language model programs for embodied control. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2023; pp. 9493–9500. [Google Scholar]
  11. Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning; JMLR: Cambridge, MA, USA, 2023; ICML’23. [Google Scholar]
  12. Ji, Y.; Zhang, Z.; Tang, D.; Zheng, Y.; Liu, C.; Zhao, Z.; Li, X. Foundation models assist in human–robot collaboration assembly. Sci. Rep. 2024, 14, 24828. [Google Scholar] [CrossRef]
  13. Di Fede, G.; Alrabie, L.; Andolina, S. Human-Centered LLM. In Handbook of Human-Centered Artificial Intelligence; Xu, W., Ed.; Springer Nature: Singapore, 2025; pp. 1–35. [Google Scholar] [CrossRef]
  14. Ren, L.; Wang, H.; Wang, Y.; Huang, K.; Wang, L.; Li, B. Foundation Models for the Process Industry: Challenges and Opportunities. Engineering 2025, 52, 53–59. [Google Scholar] [CrossRef]
  15. Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A comprehensive overview of large language models. ACM Trans. Intell. Syst. Technol. 2025, 16, 106. [Google Scholar] [CrossRef]
  16. Patil, R.; Gudivada, V. A review of current trends, techniques, and challenges in large language models (llms). Appl. Sci. 2024, 14, 2074. [Google Scholar] [CrossRef]
  17. Ayyat, M.; Osman, M.; Nadeem, T. Opportunities and challenges of foundation models in industrial manufacturing. IEEE Access 2025, 13, 138745–138775. [Google Scholar] [CrossRef]
  18. Fan, J.; Yin, Y.; Wang, T.; Dong, W.; Zheng, P.; Wang, L. Vision-language model-based human-robot collaboration for smart manufacturing: A state-of-the-art survey. Front. Eng. Manag. 2025, 12, 177–200. [Google Scholar] [CrossRef]
  19. Ma, Y.; Zheng, S.; Yang, Z.; Zheng, P.; Leng, J.; Hong, J. Leveraging large language models in next generation intelligent manufacturing: Retrospect and prospect. J. Manuf. Syst. 2025, 82, 809–840. [Google Scholar] [CrossRef]
  20. Chen, C.; Zhao, K.; Leng, J.; Liu, C.; Fan, J.; Zheng, P. Integrating large language model and digital twins in the context of industry 5.0: Framework, challenges and opportunities. Robot. Comput.-Integr. Manuf. 2025, 94, 102982. [Google Scholar] [CrossRef]
  21. Wu, D.; Zheng, P.; Zhao, Q.; Zhang, S.; Qi, J.; Hu, J.; Zhu, G.N.; Wang, L. Empowering natural human–robot collaboration through multimodal language models and spatial intelligence: Pathways and perspectives. Robot. Comput.-Integr. Manuf. 2026, 97, 103064. [Google Scholar] [CrossRef]
  22. Wang, T.; Fan, J.; Zheng, P. An LLM-based vision and language cobot navigation approach for Human-centric Smart Manufacturing. J. Manuf. Syst. 2024, 75, 299–305. [Google Scholar] [CrossRef]
  23. Gkournelos, C.; Konstantinou, C.; Makris, S. An LLM-based approach for enabling seamless Human-Robot collaboration in assembly. CIRP Ann. 2024, 73, 9–12. [Google Scholar] [CrossRef]
  24. Xia, L.; Li, C.; Zhang, C.; Liu, S.; Zheng, P. Leveraging error-assisted fine-tuning large language models for manufacturing excellence. Robot. Comput.-Integr. Manuf. 2024, 88, 102728. [Google Scholar] [CrossRef]
  25. Sony, M.; Naik, S. Industry 4.0 integration with socio-technical systems theory: A systematic review and proposed theoretical model. Technol. Soc. 2020, 61, 101248. [Google Scholar] [CrossRef]
  26. Akbarighatar, P. Operationalizing responsible AI principles through responsible AI capabilities. AI Ethics 2025, 5, 1787–1801. [Google Scholar]
  27. Bach, T.A.; Kaarstad, M.; Solberg, E.; Babic, A. Insights into suggested Responsible AI (RAI) practices in real-world settings: A systematic literature review. AI Ethics 2025, 5, 3185–3232. [Google Scholar] [CrossRef]
  28. Hua, Y.; Li, K.; Wang, R.; Li, Y.; Wang, G.; Yan, Y. Integration of dynamic knowledge and LLM for adaptive human-robot collaborative assembly solution generation. Adv. Eng. Inform. 2025, 68, 103613. [Google Scholar] [CrossRef]
  29. Sadek, M.; Calvo, R.A.; Mougenot, C. Designing value-sensitive AI: A critical review and recommendations for socio-technical design processes. AI Ethics 2024, 4, 949–967. [Google Scholar]
  30. Morita, P.P.; Abhari, S.; Kaur, J.; Lotto, M.; Miranda, P.A.D.S.E.S.; Oetomo, A. Applying ChatGPT in public health: A SWOT and PESTLE analysis. Front. Public Health 2023, 11, 1225861. [Google Scholar] [CrossRef] [PubMed]
  31. Stahl, B.C.; Eden, G.; Jirotka, M.; Coeckelbergh, M. From computer ethics to responsible research and innovation in ICT: The transition of reference discourses informing ethics-related research in information systems. Inf. Manag. 2014, 51, 810–818. [Google Scholar] [CrossRef]
  32. Burget, M.; Bardone, E.; Pedaste, M. Definitions and conceptual dimensions of responsible research and innovation: A literature review. Sci. Eng. Ethics 2017, 23, 1–19. [Google Scholar] [CrossRef] [PubMed]
  33. Li, W.; Yigitcanlar, T.; Browne, W.; Nili, A. The making of responsible innovation and technology: An overview and framework. Smart Cities 2023, 6, 1996–2034. [Google Scholar] [CrossRef]
  34. Baldassarre, B.; Calabretta, G.; Karpen, I.O.; Bocken, N.; Hultink, E.J. Responsible design thinking for sustainable development: Critical literature review, new conceptual framework, and research agenda. J. Bus. Ethics 2024, 195, 25. [Google Scholar] [CrossRef]
  35. Setälä, M. Inclusive deliberation for future-regarding governance: Potentials and pitfalls. Policy Stud. 2025; in press. [Google Scholar]
  36. Romero, V.; Rivera, E. Human-centred design thinking as a co-creation process: A commentary. Prev. Med. 2025, 199, 108375. [Google Scholar] [PubMed]
  37. Coronado, E.; Reyes, M.; Ramirez, Y.; Pedraza, I. Robots for Well-being: Design and Integration of a Low-Cost Social Robot Prototype for Promoting Healthy Habits. In Proceedings of the 2025 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO); IEEE: New York, NY, USA, 2025; pp. 164–170. [Google Scholar]
  38. Tian, L.; Wu, T.L.; Robinson, N.L.; Carreno-Medrano, P.; Chan, W.P.; Sakr, M.; Abdi, E.; Croft, E.A.; Kulić, D. Experimental Methodology for Human–Robot Interaction: Guidelines and Case Studies for Human-Centred and Ethical Robotics Research; CRC Press: Boca Raton, FL, USA, 2025. [Google Scholar]
  39. van der Velden, M.; Mörtberg, C. Participatory Design and Design for Values. In Handbook of Ethics, Values, and Technological Design: Sources, Theory, Values and Application Domains; van den Hoven, J., Vermaas, P.E., van de Poel, I., Eds.; Springer: Dordrecht, The Netherlands, 2015; pp. 1–22. [Google Scholar] [CrossRef]
  40. Kensing, F.; Greenbaum, J. Heritage: Having a say. In Routledge International Handbook of Participatory Design; Simonsen, J., Robertson, T., Eds.; Routledge: London, UK, 2013; pp. 21–36. [Google Scholar]
  41. Bødker, S.; Dindler, C.; Iversen, O.S.; Smith, R.C. What is participatory design? In Participatory Design; Springer International Publishing: Cham, Switzerland, 2022; pp. 5–14. [Google Scholar] [CrossRef]
  42. Ron, G.; Menges, A.; Wortmann, T. Critical Collaboration: Reflecting on Power and Agency in Human-Robot-Collaboration in Architecture and Construction, for a Diverse and Democratic Practice. In Proceedings of the Design Modelling Symposium Berlin; Springer: Berlin/Heidelberg, Germany, 2024; pp. 191–204. [Google Scholar]
  43. Åsberg, C.; Lykke, N. Feminist technoscience studies. Eur. J. Women Stud. 2010, 17, 299–305. [Google Scholar] [CrossRef]
  44. Cao, H.L.; Elprama, S.A.; Scholz, C.; Siahaya, P.; El Makrini, I.; Jacobs, A.; Ajoudani, A.; Vanderborght, B. Designing interaction interface for supportive human-robot collaboration: A co-creation study involving factory employees. Comput. Ind. Eng. 2024, 192, 110208. [Google Scholar] [CrossRef]
  45. Cordova, D.C.; Kelly, N.; Rezayan, L. A systematic literature review of the speculative design process and a proposed framework for speculative design. Des. Sci. 2025, 11, e38. [Google Scholar] [CrossRef]
  46. Hohendanner, M.; Ullstein, C.; Miyamoto, D.; Huffman, E.F.; Socher, G.; Grossklags, J.; Osawa, H. Metaverse perspectives from Japan: A participatory speculative design case study. Proc. ACM Hum.-Comput. Interact. 2024, 8, 400. [Google Scholar] [CrossRef]
  47. Altshuler, S.; Hershkovitz, A.; Mikulinsky, R.; Müller, B. Governing phygital spaces: Human rights by design meets speculative design. Internet Policy Rev. 2025, 14, 1–26. [Google Scholar] [CrossRef]
  48. Gazzaneo, L.; Padovano, A.; Umbrello, S. Designing smart operator 4.0 for human values: A value sensitive design approach. Procedia Manuf. 2020, 42, 219–226. [Google Scholar] [CrossRef]
  49. Iversen, O.S.; Halskov, K.; Leong, T.W. Rekindling values in participatory design. In Proceedings of the 11th Biennial Participatory Design Conference; Association for Computing Machinery: New York, NY, USA, 2010; pp. 91–100. [Google Scholar]
  50. Friedman, B.; Kahn, P.H., Jr. Human values, ethics, and design. In The Human-Computer Interaction Handbook; CRC Press: Boca Raton, FL, USA, 2007; pp. 1267–1292. [Google Scholar]
  51. Vernim, S.; Bauer, H.; Rauch, E.; Ziegler, M.T.; Umbrello, S. A value sensitive design approach for designing AI-based worker assistance systems in manufacturing. Procedia Comput. Sci. 2022, 200, 505–516. [Google Scholar] [CrossRef]
  52. Longo, F.; Padovano, A.; Umbrello, S. Value-oriented and ethical technology engineering in industry 5.0: A human-centric perspective for the design of the factory of the future. Appl. Sci. 2020, 10, 4182. [Google Scholar]
  53. Wohlin, C.; Runeson, P.; Neto, P.A.d.M.S.; Engström, E.; do Carmo Machado, I.; De Almeida, E.S. On the reliability of mapping studies in software engineering. J. Syst. Softw. 2013, 86, 2594–2610. [Google Scholar] [CrossRef]
  54. Petersen, K.; Vakkalanka, S.; Kuzniarz, L. Guidelines for conducting systematic mapping studies in software engineering: An update. Inf. Softw. Technol. 2015, 64, 1–18. [Google Scholar] [CrossRef]
  55. Fan, J.; Zheng, P. A vision-language-guided robotic action planning approach for ambiguity mitigation in human–robot collaborative manufacturing. J. Manuf. Syst. 2024, 74, 1009–1018. [Google Scholar] [CrossRef]
  56. Lim, J.; Patel, S.; Evans, A.; Pimley, J.; Li, Y.; Kovalenko, I. Enhancing human-robot collaborative assembly in manufacturing systems using large language models. In Proceedings of the 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE); IEEE: New York, NY, USA, 2024; pp. 2581–2587. [Google Scholar]
  57. Liu, C.; Tang, D.; Zhu, H.; Zhang, Z.; Wang, L.; Zhang, Y. Vision language model-enhanced embodied intelligence for digital twin-assisted human-robot collaborative assembly. J. Ind. Inf. Integr. 2025, 48, 100943. [Google Scholar] [CrossRef]
  58. Ma, D.; Zhang, C.; Xu, Q.; Zhou, G. Large and small-scale models’ fusion-driven proactive robotic manipulation control for human-robot collaborative assembly in industry 5.0. Robot. Comput.-Integr. Manuf. 2026, 97, 103078. [Google Scholar] [CrossRef]
  59. Verhelst, E.; Harley, N.; Van Doninck, B.; Bey-Temsamani, A. A Digital Colleague as intuitive operator support system in a HMLV Production Environment. Procedia CIRP 2025, 136, 242–247. [Google Scholar] [CrossRef]
  60. Ding, P.; Zhang, J.; Zhang, P.; Li, H.; Wang, D. CCM-FCC: LLM-powered cognition-centered AI agent framework for proactive human-robot collaboration. Robot. Comput.-Integr. Manuf. 2026, 98, 103145. [Google Scholar] [CrossRef]
  61. Wang, B.; Zheng, L.; Wang, Y.; Qi, Z. LLM-based multi-agent task planning for human-robot collaborative assembly balancing operator experience and efficiency. J. Manuf. Syst. 2025, 82, 1020–1045. [Google Scholar] [CrossRef]
  62. Wang, Y.; Guo, Q.; Zheng, L.; Wang, B.; Zheng, P.; Qi, Z. LLM based autonomous agent of human-robot collaboration for aerospace wire harnessing assembly. Robot. Comput.-Integr. Manuf. 2026, 97, 103120. [Google Scholar] [CrossRef]
  63. Chen, J.; Huang, S.; Wang, X.; Wang, P.; Zhu, J.; Xu, Z.; Wang, G.; Yan, Y.; Wang, L. Perception-decision-execution coordination mechanism driven dynamic autonomous collaboration method for human-like collaborative robot based on multimodal large language model. Robot. Comput.-Integr. Manuf. 2026, 98, 103167. [Google Scholar] [CrossRef]
  64. Li, J.; Xiong, J.; Zhang, Z.; Guo, D. A Multimodal Large Model to Enhance Robot Understanding of Human Intentions for Accurate Human Robot Collaborative Manufacturing. IFAC-PapersOnLine 2025, 59, 2820–2825. [Google Scholar] [CrossRef]
  65. Tong, X.; Li, K.; Bao, J. GNN-LLM hybrid cognitive architectures for generative task adaptation in multi-human multi-robot collaborative disassembly. Robot. Comput.-Integr. Manuf. 2026, 98, 103169. [Google Scholar] [CrossRef]
  66. Yu, W.; Lv, J.; Zhuang, W.; Pan, X.; Wen, S.; Bao, J.; Li, X. Rescheduling human-robot collaboration tasks under dynamic disassembly scenarios: An MLLM-KG collaboratively enabled approach. J. Manuf. Syst. 2025, 80, 20–37. [Google Scholar] [CrossRef]
  67. Lv, J.; Si, J.; Gao, D.; Bao, J. Historical visual question answering with large language model for Augmented Reality-assisted Human–Robot Collaboration. J. Manuf. Syst. 2025, 83, 546–556. [Google Scholar] [CrossRef]
  68. Gao, F.; Xia, L.; Zhang, J.; Liu, S.; Wang, L.; Gao, R.X. Integrating large language model for natural language-based instruction toward robust human-robot collaboration. Procedia CIRP 2024, 130, 313–318. [Google Scholar] [CrossRef]
  69. Wu, D.; Zhao, Q.; Fan, J.; Qi, J.; Zheng, P.; Hu, J. H2R Bridge: Transferring vision-language models to few-shot intention meta-perception in human robot collaboration. J. Manuf. Syst. 2025, 80, 524–535. [Google Scholar] [CrossRef]
  70. Xia, W.; Zheng, H.; Xu, W.; Xu, X. Large vision-language models enabled novel objects 6D pose estimation for human-robot collaboration. Robot. Comput.-Integr. Manuf. 2025, 95, 103030. [Google Scholar] [CrossRef]
  71. Zheng, P.; Li, C.; Fan, J.; Wang, L. A vision-language-guided and deep reinforcement learning-enabled approach for unstructured human-robot collaborative manufacturing task fulfilment. CIRP Ann. 2024, 73, 341–344. [Google Scholar] [CrossRef]
  72. Simeone, A.; Fan, Y.; Antonelli, D.; Priarone, P.C.; Settineri, L. Conceptualisation of a multimodal, non-intrusive, generative AI-based assistive system for assembly. CIRP Annals 2025, 74, 37–41. [Google Scholar] [CrossRef]
  73. Guo, Y.; Yi, P.; Wei, X.; Zhou, D. Active interaction strategy generation for human-robot collaboration based on trust. Vis. Comput. Ind. Biomed. Art 2025, 8, 16. [Google Scholar] [CrossRef]
  74. Laugwitz, B.; Held, T.; Schrepp, M. Construction and evaluation of a user experience questionnaire. In Proceedings of the Symposium of the Austrian HCI and Usability Engineering Group; Springer: Berlin/Heidelberg, Germany, 2008; pp. 63–76. [Google Scholar]
  75. Hart, S.G. NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting; Sage Publications Sage CA: Los Angeles, CA, USA, 2006; Volume 50, pp. 904–908. [Google Scholar]
  76. Clemmensen, T.; Moghaddam, M.T.; Nørbjerg, J. Cyber-physical systems with Human-in-the-Loop: A systematic review of socio-technical perspectives. J. Syst. Softw. 2025, 226, 112348. [Google Scholar] [CrossRef]
  77. Lee, H.R.; Fox, S.; Cheon, E.; Shorey, S. Minding the Stop-Gap: Attending to the “Temporary,” Unplanned, and Added Labor of Human-Robot Collaboration in Context. In Proceedings of the 2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI); IEEE: New York, NY, USA, 2025; pp. 34–44. [Google Scholar]
  78. Turkkan Zencirli, B.; Altin, M. The impact of artificial intelligence on workplace dehumanization: A critical review. J. Hosp. Tour. Horizons 2025, 1, 128–149. [Google Scholar] [CrossRef]
  79. Kayyali, M. Ethical AI and Automation in the Workplace. In Leading Inclusive Workplaces Through Digital Transformation and Organizational Change; IGI Global Scientific Publishing: Hershey, PA, USA, 2026; pp. 103–138. [Google Scholar]
  80. Coronado, E.; Mastrogiovanni, F.; Indurkhya, B.; Venture, G. Visual programming environments for end-user development of intelligent and social robots, a systematic review. J. Comput. Lang. 2020, 58, 100970. [Google Scholar] [CrossRef]
Figure 1. General overview of the proposed socio-technical framework for the design and development of responsible agentic AI and robotic systems.
Figure 1. General overview of the proposed socio-technical framework for the design and development of responsible agentic AI and robotic systems.
Robotics 15 00058 g001
Figure 2. Technical system of the proposed RAR framework. The context layer provides information for situational awareness, the agentic AI design layer provides advanced perception, cognition and action, and the value layer supports integration and deployment through enabling technologies, infrastructure, and connectors to build diverse industrial applications.
Figure 2. Technical system of the proposed RAR framework. The context layer provides information for situational awareness, the agentic AI design layer provides advanced perception, cognition and action, and the value layer supports integration and deployment through enabling technologies, infrastructure, and connectors to build diverse industrial applications.
Robotics 15 00058 g002
Figure 3. Social system of the proposed RAR framework. The context layer includes a set of relevant human concerns, stakeholder expectations, political, economic, social, technological, legal, and environmental (PESTLE) factors, as well as sustainability goals and metrics. The design layer integrates responsible and human-centered methodologies that operationalize these contextual considerations into the design process. The value layer is composed of a set of societal, ethical, and sustainability outcomes.
Figure 3. Social system of the proposed RAR framework. The context layer includes a set of relevant human concerns, stakeholder expectations, political, economic, social, technological, legal, and environmental (PESTLE) factors, as well as sustainability goals and metrics. The design layer integrates responsible and human-centered methodologies that operationalize these contextual considerations into the design process. The value layer is composed of a set of societal, ethical, and sustainability outcomes.
Robotics 15 00058 g003
Figure 4. Responsible principles guiding the design, development, and deployment of agentic AI and robotic systems. Participatory Design, Human-Centered Design, Value-Sensitive Design, and Speculative Design methodologies systematically operationalize reflexivity, inclusion, responsiveness, and anticipation, helping to align industrial requirements with human-centered, sustainable, and societal values.
Figure 4. Responsible principles guiding the design, development, and deployment of agentic AI and robotic systems. Participatory Design, Human-Centered Design, Value-Sensitive Design, and Speculative Design methodologies systematically operationalize reflexivity, inclusion, responsiveness, and anticipation, helping to align industrial requirements with human-centered, sustainable, and societal values.
Robotics 15 00058 g004
Figure 5. Double Diamond model for human-centered design of technological solutions.
Figure 5. Double Diamond model for human-centered design of technological solutions.
Robotics 15 00058 g005
Figure 6. The paper selection process.
Figure 6. The paper selection process.
Robotics 15 00058 g006
Table 1. Existing survey articles related to Foundation Models and human–robot collaboration.
Table 1. Existing survey articles related to Foundation Models and human–robot collaboration.
No.ReferenceYearMain Focus
1Fan et al. [18]2025Technical survey of Vision–Language Models for HRC, emphasizing robotic autonomy and system performance.
2Ma et al. [19]2025Broad review of LLM applications in Industry 5.0, with limited coverage of practical and human-centered HRC deployments.
3Chen et al. [20]2025Overview of LLM and Digital Twin integration in Industry 5.0, with minimal focus on empirical HRC systems and human factors.
4Wu et al. [21]2026Robot-centered survey of multimodal language models and spatial intelligence for HRC, focusing on perception and autonomy mechanisms.
Table 2. Research questions and data extraction dimensions guiding the review.
Table 2. Research questions and data extraction dimensions guiding the review.
RQResearch QuestionData Extraction Dimension
RQ1How are FMs (LLMs, VLMs, and MLLMs) currently used to support human–robot collaborative assembly/disassembly in industrial contexts?FMs need/objective; Type of FM (LLM, VLM, MLLM); input modality.
RQ2What human-centered and responsible-deployment considerations are reported, and what is the role of humans?Human-centered and responsible AI dimensions (trust, inclusion, personalization, ergonomics); human-role in HRC tasks.
Table 3. Inclusion criteria.
Table 3. Inclusion criteria.
I1Uses or discusses FMs (e.g., LLMs, VLMs, MLLMs) systems applied to robotics systems.
I2Uses or discusses FMs (e.g., LLMs, VLMs, MLLMs) systems applied to manufacturing systems.
Table 4. Exclusion criteria.
Table 4. Exclusion criteria.
E1Studies that do not demonstrate real HRC in assembly or disassembly manufacturing tasks, instead they focus on isolated algorithms for perception, cognition, or control without actual human–robot interaction during task execution and evaluation.
E2Studies that primarily target system automation or robotic skill enhancement, where human involvement is limited to data collection, annotation, or model training phases, and where no explicit role for humans is described or demonstrated during system operation after training.
E3Gray literature, short-papers (e.g., extended abstracts), book chapters, and non-peer-reviewed materials including articles available exclusively as preprints (e.g., arXiv).
Table 6. Technical comparison of HRC studies using FMs, summarizing input modalities, FM types, and main representative models.
Table 6. Technical comparison of HRC studies using FMs, summarizing input modalities, FM types, and main representative models.
ArticleInput DataFMs TypeMain Models
[56]voice/textLLMsGPT-4
[12]voice/textLLMs, VFMsGPT-4, SAM+CLIP
[73]voice/text, imageVLMsGPT-4 VLM
[68]voice/textLLMsGPT-3.5-turbo-1106 (fine-tuned)
[72]voice/text, imageLLMs; VLMsChatGPT-4.0
[64]EEG, voice/text, imageLLMs, MLLMsOpenAI GPT API, Attention-Based Multimodal Fusion
[57]voice/text, imageLLMs; VLMsChatGPT-4, CLIP + BERT (fine-tuned)
[59]voice/textLLMsN.A.
[65]voice/textLLMsQwq-32B (fine-tuned)
[71]voice/text, imageLLMs; VLMsGPT-3.5, ResNet-50+BERT
[66]voice/text, motionMLLMsLLaVA-v1.6
[67]voice/textLLMs, MLLMsGPT-3.5, BLIP
[69]voice/text, imageVLMsCLIP (fine-tuned), GPT-2 encoder
[55]voice/text, imageLLMs; VLMsGPT-4, CLIP (ResNet50 image encoder and Transformer text encoder)
[61]voice/text, imageLLMsGPT-4
[70]voice/text, imageLLMs, VLMsDetic, SAM, DINOv2, GPT-4o
[58]voice/text, imageLLMs, VFMsChatGPT-4o, DINOv2, Qwen2–1.5B (fine-tuned)
[62]voice/textLLMsGPT-4, Qwen, ChatGLM
[60]voice/textLLMsGPT-3.5 API
[63]voice/textLLMs, MLLMsGPT-4o
[28]voice/text, imageLLMsGPT-4
Table 7. Distribution of human roles identified in the reviewed FM-based HRC studies. Some articles exhibit more than one role.
Table 7. Distribution of human roles identified in the reviewed FM-based HRC studies. Some articles exhibit more than one role.
Human RoleNumber of Articles
Instructor/Controller17
Error Handler/Exception Manager11
Collaborator/Co-worker8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Coronado, E. From Large Language Models to Agentic AI in Industry 5.0 and the Post-ChatGPT Era: A Socio-Technical Framework and Review on Human–Robot Collaboration. Robotics 2026, 15, 58. https://doi.org/10.3390/robotics15030058

AMA Style

Coronado E. From Large Language Models to Agentic AI in Industry 5.0 and the Post-ChatGPT Era: A Socio-Technical Framework and Review on Human–Robot Collaboration. Robotics. 2026; 15(3):58. https://doi.org/10.3390/robotics15030058

Chicago/Turabian Style

Coronado, Enrique. 2026. "From Large Language Models to Agentic AI in Industry 5.0 and the Post-ChatGPT Era: A Socio-Technical Framework and Review on Human–Robot Collaboration" Robotics 15, no. 3: 58. https://doi.org/10.3390/robotics15030058

APA Style

Coronado, E. (2026). From Large Language Models to Agentic AI in Industry 5.0 and the Post-ChatGPT Era: A Socio-Technical Framework and Review on Human–Robot Collaboration. Robotics, 15(3), 58. https://doi.org/10.3390/robotics15030058

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop