Hybrid Deep Learning-Reinforcement Learning for Adaptive Human-Robot Task Allocation in Industry 5.0

Urrea, Claudio

doi:10.3390/systems13080631

Open AccessArticle

Hybrid Deep Learning-Reinforcement Learning for Adaptive Human-Robot Task Allocation in Industry 5.0

by

Claudio Urrea

Electrical Engineering Department, Faculty of Engineering, University of Santiago of Chile, Las Sophoras 165, Estación Central, Santiago 9170020, Chile

Systems 2025, 13(8), 631; https://doi.org/10.3390/systems13080631

Submission received: 21 June 2025 / Revised: 20 July 2025 / Accepted: 24 July 2025 / Published: 26 July 2025

(This article belongs to the Section Systems Engineering)

Download

Browse Figures

Versions Notes

Abstract

Human-Robot Collaboration (HRC) is pivotal for flexible, worker-centric manufacturing in Industry 5.0, yet dynamic task allocation remains difficult because operator states—fatigue and skill—fluctuate abruptly. I address this gap with a hybrid framework that couples real-time perception and double-estimating reinforcement learning. A Convolutional Neural Network (CNN) classifies nine fatigue–skill combinations from synthetic physiological cues (heart-rate, blink rate, posture, wrist acceleration); its outputs feed a Double Deep Q-Network (DDQN) whose state vector also includes task-queue and robot-status features. The DDQN optimises a multi-objective reward balancing throughput, workload and safety and executes at 10 Hz within a closed-loop pipeline implemented in MATLAB R2025a and RoboDK v5.9. Benchmarking on a 1000-episode HRC dataset (2500 allocations·episode⁻¹) shows the hybrid CNN+DDQN controller raises throughput to 60.48 ± 0.08 tasks·min⁻¹ (+21% vs. rule-based, +12% vs. SARSA, +8% vs. Dueling DQN, +5% vs. PPO), trims operator fatigue by 7% and sustains 99.9% collision-free operation (one-way ANOVA, p < 0.05; post-hoc power 1 − β = 0.87). Visual analyses confirm responsive task reallocation as fatigue rises or skill varies. The approach outperforms strong baselines (PPO, A3C, Dueling DQN) by mitigating Q-value over-estimation through double learning, providing robust policies under stochastic human states and offering a reproducible blueprint for multi-robot, Industry 5.0 factories. Future work will validate the controller on a physical Doosan H2017 cell and incorporate fairness constraints to avoid workload bias across multiple operators.

Keywords:

human-robot collaboration; adaptive task allocation; deep learning; reinforcement learning; convolutional neural network; deep Q-network; industrial automation; smart manufacturing; Industry 4.0; Industry 5.0; simulation framework

1. Introduction

Human-Robot Collaboration (HRC) in Industry 5.0 faces challenges like variable human states (fatigue, skills), leading to inefficient task allocation. I therefore construct synthetic datasets reproducing heart-rate, EMG, blink-rate and posture metrics [1,2]. My solution integrates a Convolutional Neural Network (CNN) for state perception and a Double Deep Q-Network (DDQN) for allocation, fused in a closed-loop pipeline. HRC is central to Industry 4.0 smart factories, enabling flexible, high-throughput manufacturing by combining human cognition with robotic precision [3,4]. In HRC, humans and robots share workspaces and collaborate on tasks, leveraging their complementary strengths to achieve superior performance in complex manufacturing processes such as assembly, quality control, and maintenance [5,6]. Industry 4.0 focuses on efficiency-centred automation [3,4], while Industry 5.0 prioritises human well-being and sustainability [7,8,9], with this framework emphasising the latter through fatigue-aware allocation. As the industry transitions toward Industry 5.0, which emphasises human-centric approaches and sustainable manufacturing practices, HRC is increasingly vital for empowering workers rather than replacing them [7,8,9]. However, Dynamic Task Allocation (DTA)—the real-time assignment of tasks to human or robot agents based on factors such as operator fatigue, skill level, and robot availability—remains a significant challenge; traditional rule-based schemes and early Machine-Learning (ML) methods cannot adapt adequately to changing human conditions, leading to inefficiencies in HRC [10,11]. Fatigue-induced errors in assembly [11] motivate the adaptive allocation proposed here. Real-life pain points in Industry 5.0 HRC include technological integration challenges, ethical AI concerns, regulatory barriers, and cybersecurity risks [8], as well as balancing technical and human skills [5] and ensuring cost-effective and accessible applications.

Recent advances in Artificial Intelligence (AI) offer promising solutions to these challenges. Deep Learning (DL) excels at inferring human states, such as fatigue and expertise, from multimodal sensor data [1,2], building on recent works in multimodal HRI for enhanced environmental awareness and response capabilities [12]. For instance, DL has been applied to monitor workers’ psychological and physiological states during HRC [1] and to capture mental workload through physiological sensors [2]. Meanwhile, Reinforcement Learning (RL) provides principled strategies for sequential decision-making, making it well-suited for optimising task allocation in dynamic environments [13,14], extending recent innovations in multi-agent RL for heterogeneous multi-robot systems [15]. RL has been utilised for learning optimal policies for human-robot coordination [13] and safe motion planning in industrial settings [14]. However, standalone DL or RL solutions struggle to balance human variability with system constraints. Although hybrid DL-RL frameworks have shown promise in broader robotics tasks, their deployment for adaptive HRC task assignment is still under-explored [16,17]. Progress in international research includes early DL for state monitoring [1,2] and RL for coordination [14,16], but gaps persist in integrating perception and decision-making for human-centric adaptability [16,17]. This view is reinforced by recent reviews such as Lorenzini et al. [18], who synthesise ergonomic factors in collaborative robotics for industrial use, and Maruyama et al. [19], who propose a digital twin approach enhanced by digital humans for workload analysis in HRC scenarios. This motivates the proposed hybrid framework, which addresses these gaps by fusing CNN perception and DDQN decision-making in a closed-loop pipeline. Unlike existing hybrid DL-RL methods such as Ma et al. [20]—which focus on reactive planning without fatigue integration—Peng et al. [21]—which target disassembly without skill-level perception—and Lee et al. [22]—which employ Bayesian RL without DDQN stability—my framework uniquely combines CNN-based real-time human-state classification (fatigue+skill) with DDQN multi-objective optimisation (throughput, workload, safety), yielding closed-loop adaptability in Industry 5.0 simulations.

A key construct in effective HRC is collaborative fluency, defined as “the coordinated meshing of joint activities between members of a well-synchronised team” [23]. Hoffman [23] proposes objective metrics (e.g., human/robot idle time, concurrent activity, functional delay) and subjective scales (e.g., perceived team commitment and fluency) evaluated in assembly tasks with human subjects. These metrics highlight the need for adaptive systems that minimise delays and enhance synchronisation, yet current hybrid DL-RL approaches [16,17] often overlook such fluency in dynamic task allocation under variable human states. My framework tackles this oversight by explicitly fusing perception and decision-making to enhance fluency, as demonstrated in Section 4.

Given the complexity and safety considerations in HRC, simulation-based approaches are critical for developing and validating adaptive systems before real-world deployment. Digital-twin technologies, which create virtual replicas of physical systems, enable realistic simulations of HRC scenarios [24,25]. These digital twins can be enhanced with deep learning to improve safety and reliability [25] and used for safe motion planning through co-evolution approaches [24]. Platforms like RoboDK facilitate high-fidelity simulations of human-robot interactions, allowing researchers to test algorithms in controlled environments [26]. In this work, I present a simulation-based hybrid DL–RL architecture that integrates a CNN for real-time human-state recognition with a DDQN for dynamic, context-aware task allocation. This approach enables continuous monitoring of human conditions and adaptive adjustment of task assignments, enhancing both efficiency and worker well-being in smart-factory settings [20,21]. Implemented in MATLAB R2025a and RoboDK v5.9 using a synthetic dataset of 1000 episodes, the framework achieves a 21% throughput boost, a 7% fatigue reduction and maintains a 99.9% collision-free rate (one-way ANOVA + Tukey HSD, p < 0.05). A set of graphical proof-of-concept figures demonstrates online task reallocation as simulated fatigue and skill metrics drift during a production cycle.

Unlike Ma et al. [20] (reactive planning without fatigue integration), Peng et al. [21] (disassembly-oriented with no skill-level perception), and Lee et al. [22] (Bayesian RL without double-estimation), our framework integrates real-time human-state perception with double-estimating DDQN allocation, reducing Q-value overestimation and improving safety by 1 pp.

This manuscript reorders the literature chronologically to clarify progress and gaps on HRC challenges and articulates three principal contributions:

Design and implementation of a high-fidelity DL-RL pipeline with closed-loop CNN–DDQN integration.
Quantitative evaluation across 1000 runs showing gains over SOTA baselines such as PPO and constraint programming [8].
Visual demonstrations that highlight the framework’s responsiveness and the innovative use of digital twins for Industry 5.0 adaptability.

The remainder of this paper is organised as follows. Section 2 details Materials and Methods; Section 3 describes the experimental setup and simulation flow; Section 4 presents quantitative results; Section 5 discusses findings and limitations; Section 6 concludes.

1.1. Background and Significance

Human–Robot Collaboration fosters flexible automation by integrating the complementary strengths of humans and robots, enabling efficient handling of complex tasks in smart manufacturing [3,5]. DTA is essential for maintaining efficiency and safety by assigning tasks in real time based on operator fatigue, skill level and robot status. Traditional rule-based schemes deteriorate when human states fluctuate, and single-model ML approaches lack the agility demanded by modern production lines [10,11]. Current debates centre on whether lightweight heuristics, such as those used in assembly-line balancing [10], suffice, or whether adaptive, data-driven frameworks, like learning-based coordination [13], are required to meet Industry 4.0 targets. With the advent of Industry 5.0, the discourse shifts toward human-centric design and sustainability, introducing challenges such as technological integration, ethical bias, regulation and cybersecurity in HRC [8].

1.2. Research Gap and Objectives

To overcome the limitations of static heuristics and single-agent RL, this paper proposes a hybrid DL–RL framework that:

Infers human states online via a CNN trained on multimodal synthetic data.
Allocates tasks through a DDQN that jointly optimises throughput, workload and collision avoidance.

Unlike previous studies, the approach fuses perception and decision-making in a closed loop, yielding a scalable blueprint for smart-factory deployments. Graphical proof-of-concept figures (Section 3) illustrate reallocations for a single operator under varying fatigue and skill conditions.

Objectives:

Design and implement the CNN+DDQN architecture for adaptive HRC task allocation.
Benchmark its performance against rule-based and SARSA baselines in 1000 high-fidelity episodes.
Visualise adaptability through real-time task-reassignment plots.

Hypotheses:

H1:

The hybrid DL-RL framework surpasses rule-based and single-model RL baselines in allocation efficiency.

H2:

Online human-state recognition improves adaptability relative to perception-free methods.

H3:

The framework scales toward multi-robot Industry 5.0 cells.

2. Materials and Methods

The proposed hybrid DL–RL framework was evaluated entirely in a high-fidelity simulation of a collaborative industrial assembly line, constructed to ensure controlled, repeatable experimentation and rigorous evaluation of adaptive task allocation in HRC. This simulation environment was developed using MATLAB R2025a (MathWorks, Natick, MA, USA) in conjunction with RoboDK 5.9 (RoboDK Inc., Montreal, QC, Canada) [26], providing a realistic yet safe platform for testing the integrated algorithms. A simulation-based approach was selected to alleviate real-world resource constraints, enabling safe and repeatable trials under representative conditions, aligning with recent trends in using digital twins for HRC research [24,27]. The dataset and selected source code are available on FigShare (https://doi.org/10.6084/m9.figshare.29323520) and GitHub (https://github.com/ClaudioUrrea/doosan) to support validation of the results, in line with MDPI’s open-science guidelines.

The research contributions are reorganized, and the proposed methods are summarized as follows. I include the design of a simulation-based hybrid DL–RL pipeline, quantitative benchmarking against advanced baselines, and visual demonstrations of adaptability. The methods integrate CNN for human-state perception and DDQN for task allocation, as detailed below (Figure 1).

Description:

Sensor Data: Input from simulated sensors (e.g., 128 × 128 × 3 feature maps).
CNN: Processes data to classify human state (9 classes: 3 fatigue levels × 3 skill levels).
DDQN: Uses state vector (human state+task queue+robot status) to select actions via Q-values.
Simulation Environment: Executes actions, returns rewards/feedback according to the following equation:

R = 0.5 × Throughput − 0.3 × Workload + 0.2 × Safety

(1)
Feedback Loop: Closes the system, updating states for continuous adaptation.

2.1. Simulation Environment

The simulation replicates a collaborative assembly station featuring a six-degree-of-freedom (6-DoF) Doosan Robotics H2017 robot [28] and a virtual human worker. The H2017 model (20 kg payload, 1700 mm reach, ±0.1 mm repeatability) was integrated into RoboDK for accurate kinematic simulation [26]. The virtual human agent was modeled with dynamic state parameters, such as fatigue and skill level, generated from synthetic sensor data, inspired by methods used to capture human states in HRC [1,2]. MATLAB R2025a orchestrates simulation logic and algorithms, interfacing with RoboDK 5.9 via the RoboDK API on a Windows 11 host PC. The Reinforcement Learning Toolbox, Deep Learning Toolbox, and Robotics System Toolbox implement learning algorithms and robot control. This setup is consistent with simulation frameworks used in recent HRC studies for assembly planning and coordination [29]. Key components of the simulated cell included:

Robotic Arm: A six-degree-of-freedom (6-DoF) Doosan Robotics H2017 collaborative robot was modeled and programmed in RoboDK to perform cyclic pick-and-place tasks. It offers a maximum 20 kg payload, 1700 mm reach, and ±0.1 mm positional repeatability. The robot’s motion profile was empirically tuned to minimise residual oscillations; for example, the shoulder-joint angle remained stabilised at ≈0.017 rad (≈1°), as confirmed by RoboDK simulation logs, indicating smooth operation.
Human Worker: A virtual human operator was represented with dynamic state parameters—fatigue, skill level, and task-completion time—generated from synthetic sensor data to mimic wearable-device readings (e.g., heart-rate, motion tracking). Human state parameters (fatigue, skill level, task-completion time) were updated at 1 Hz using a MATLAB script. Fatigue values were drawn from a truncated normal distribution N(0.5, 0.12) clipped to [0, 1] to simulate varying physical tiredness. Fatigue distribution is truncated N(0.5, 0.12) in [0, 1], based on empirical data from industrial operators (e.g., heart-rate variability in Orlando et al. [1] and muscle activity in Pereira et al. [2]), where fatigue accumulates non-linearly. Skill levels (novice/intermediate/expert) align with expertise tiers in HRC studies (e.g., Zhang & Cavuoto [30]). Task-completion times were sampled from a normal distribution N(10 s, 2 s²) to introduce variability in how quickly the human finishes tasks.
Tasks: A set of assembly tasks was defined, encompassing those better suited for human dexterity (e.g., inserting delicate connectors) versus tasks ideal for robotic precision and endurance (e.g., repetitive torque-controlled bolting). Task-complexity levels were varied to cover a spectrum from simple to challenging scenarios. Definitions and parameters were informed by industrial benchmarks and studies on task allocation in HRC [10,11] to ensure realism. Each task could thus be allocated to either the human or the robot, depending on the algorithm’s decisions and the current state of the human operator.

Environment modeled with 5 m × 3 m workspace, ambient 22 °C, noise <70 dB; future: dynamic factors like temperature variations. The assembly environment includes mechanical structures such as a URDF-driven conveyor (advancing at 100 mm·s⁻¹), task stations with fixtures for pick-and-place (e.g., tolerance <0.2 mm for connectors), and collision envelopes around robot/human meshes to simulate real industrial constraints [24,29].

2.2. Synthetic Dataset

A large synthetic dataset was created to drive and evaluate the learning framework, comprising 1000 simulated HRC episodes (task-allocation scenarios) [31]. The use of synthetic datasets in HRC simulations has been validated in studies such as [32], ensuring realistic modeling of human-robot interaction. Each episode encapsulated a sequence of tasks and the corresponding human–robot interaction data, recorded as follows:

Human-state data: Operator fatigue level, skill level, and task-completion time at each decision point (generated by the human_state_generator.m script).
Robot-performance data: Robotic task success/failure, execution times for each task, and any safety-related events (e.g., collision flags).
Task attributes: Details of each task in the sequence, including task type (human-centric, robot-centric, or collaborative) and complexity level.

The synthetic dataset was generated via a MATLAB script (generate_synthetic_dataset.m), ensuring consistency with the simulation environment, and was saved for offline analysis. The dataset, covering 1000 HRC episodes, is available on FigShare (https://doi.org/10.6084/m9.figshare.29323520) in CSV format (HRC_Simulation_Results.csv), with an optional Parquet version (HRC_Synthetic_Dataset.parquet), to support result validation.

2.3. Hybrid DL-RL Framework

The proposed control architecture constitutes a fully synchronous, closed-loop hybrid pipeline, in which a deep-learning perception module and a reinforcement-learning decision module exchange information at 10 Hz to guarantee sub-100 ms end-to-end latency—well within the response requirements of collaborative assembly stations. The framework, therefore, comprises two tightly coupled components—a CNN that interprets the human operator’s state in real time and a DDQN that allocates tasks on the basis of the CNN’s output. Both modules were implemented in MATLAB R2025a, utilising the Deep Learning Toolbox for the CNN and the Reinforcement Learning Toolbox for the DDQN; the modules communicate bidirectionally with the RoboDK 5.9 simulation via TCP/IP sockets for real-time execution of actions. This integrated approach builds on recent advancements in combining perception and decision-making for HRC [20,21,22]. All code is structured in object-oriented MATLAB classes (PerceptionCNN, AllocationDDQN, and RoboDKInterface) to promote reusability, modular testing, and rapid hyper-parameter tuning through configuration files.

CNN-DDQN chosen over transformers (high compute [33]) or simple DQN (instability [34]) for training stability (double Q mitigates overestimation) and low latency (4.8 ms inference).

2.3.1. Deep Learning Component

The deep-learning module employs a three-stage convolutional neural network similar to approaches used for monitoring human states in HRC [1,2], and optimized with advanced regularization and training techniques to classify the operator’s fatigue and skill level from simulated sensor inputs. Specifically, the CNN processes 2-D feature maps derived from 128 × 128 px three-channel sensor inputs—channel 1 encodes heart-rate variability trends, channel 2 captures skeletal-motion energy maps, and channel 3 aggregates cumulative workload indices computed from simulated IMU data, aligning with external sensor reviews for human detection in collaborative environments [35]. The network architecture consists of three convolutional layers with 3 × 3 kernels (32, 64, 128 filters, respectively), each followed by batch normalisation, ReLU activation, and 2 × 2 max-pooling to ensure stable gradients and progressive spatial-dimension reduction. Spatial dropout (rate = 0.2) is applied after the second pooling layer to mitigate over-fitting. The convolutional stack feeds two fully connected layers of 256 and 128 neurons, respectively, each regularized with L₂ weight decay (10⁻⁴), culminating in a softmax layer that yields a probability distribution over nine classes (3 fatigue levels × 3 skill levels). The “low,” “medium,” and “high” fatigue labels correspond to normalized fatigue values <0.25, 0.25–0.40, and ≥0.40, respectively, derived from the synthetic dataset’s fatigue distribution clipped to [0, 1] (Section 2.2).

The model was trained on an 80/20 train–test split of the synthetic dataset using the categorical cross-entropy loss, the Adam optimiser (learning rate = 0.001, β₁ = 0.9, β₂ = 0.999), batch size = 32, and 50 epochs. Early-stopping with a patience of five epochs was employed, and the best-performing weights on the validation set were restored for inference. Training was executed on an NVIDIA RTX 4090 GPU (Santa Clara, California, United States), reaching a throughput of ≈2100 images·s⁻¹ and converging in ≈11 min. The trained CNN attained 92% accuracy, an F1-score of 0.90, and an ROC-AUC of 0.94 on the test set. Inference latency averaged 4.8 ms per image on the target workstation, meeting the real-time requirements of the closed-loop system.

2.3.2. Reinforcement Learning Component

The reinforcement learning module is responsible for learning an optimal task allocation policy under varying human conditions, by leveraging a DDQN trained via interaction with the simulation environment, building upon similar RL approaches in HRC [20,21,36]. The agent determines, at each decision step, whether the next task should be assigned to the human operator, the robot, or executed collaboratively. The design of the agent incorporates a carefully constructed state representation, discrete action space, and a domain-specific reward function tailored to human–robot collaboration.

State space: At each timestep, the environment is represented by a vector combining three elements:

Human state, as classified by the CNN (i.e., a categorical encoding of fatigue level and skill level).
Task queue status, including the attributes of up to 10 pending tasks (task type and complexity), encoded as one-hot vectors concatenated into a fixed-length representation.
Robot status, represented as a binary indicator (0 for idle, 1 for busy).

Action space: The agent selects one of three discrete actions:

Assign the task to the human worker.
Assign the task to the robotic arm.
Assign the task to be completed collaboratively (by both agents).

Reward function: The agent receives a scalar reward R at each decision point, defined by R = 0.5 × Throughput − 0.3 × Workload + 0.2 × Safety, where:

Throughput denotes the number of tasks completed per minute, normalised to a fixed scale.
Workload quantifies the cumulative exertion of the human operator, estimated from their fatigue state after task execution (a higher workload reduces the reward).
Safety is a binary indicator representing the absence (1) or presence (0) of a safety-critical event, such as a collision.

The weighting coefficients w₁ = 0.5, w₂ = 0.3, and w₃ = 0.2 were chosen empirically through sensitivity testing to prioritise production efficiency, while mitigating excessive human fatigue and ensuring task execution remains within safe operational limits. The design of such reward functions is critical in HRC to balance productivity and human well-being, as discussed in [37]. The detailed calculations for Throughput, Workload, and Safety metrics, including physiological signal fusion and regression coefficients, are provided in Appendix A.

Network architecture: The DDQN is implemented as a fully connected MultiLayer Perceptron (MLP) comprising:

Three hidden layers with 128, 64, and 32 neurons, respectively, each followed by ReLU activation functions to enable non-linear approximation.
A final output layer that returns Q-values for each of the three possible task assignment actions.

Training protocol: The DDQN was trained across 1000 simulated episodes, each containing a varying number of task allocations depending on simulation dynamics. The training employed:

An ε-greedy exploration policy, with initial ε₀ = 0.1, decaying by a factor of 0.995 per episode to promote exploration at early stages and exploitation later in learning.
Experience replay, implemented via a buffer of 10,000 transition tuples (s, a, r, s′) sampled in mini-batches of 64 to decorrelate consecutive observations and stabilise learning.
A target network, updated every 100 steps, to compute stable target Q-values, thereby reducing oscillations and divergence.
A discount factor γ = 0.99, to prioritise long-term rewards, and a learning rate α = 0.001, adjusted via grid search during pilot experiments to ensure convergence across multiple random seeds.

The complete training procedure was executed in MATLAB using the Reinforcement Learning Toolbox, with performance monitored through loss convergence, reward trends, and Q-value stabilisation. All hyperparameters were tuned during preliminary experimentation to ensure policy convergence and consistency across simulation runs.

2.3.3. Integration

The deep-learning and reinforcement-learning components were integrated into a rigorously time-deterministic, closed-loop control stack operating at Δt = 100 ms (10 Hz), ensuring real-time responsiveness essential for collaborative stations, as highlighted by [38]. At every control cycle:

Perception step—The CNN ingests the newest 128 × 128 × 3 sensor frame, executes inference in ≈4.8 ms on an RTX 4090, and updates the operator’s fatigue–skill class.
Decision step—The DDQN receives a state vector 〈human-state, task-queue, robot-status〉 and produces Q-values; the greedy action is selected after ε-greedy exploration noise is applied. Inference latency averages 2.1 ms.
Execution step—MATLAB transmits the chosen action over a persistent TCP/IP socket (≤0.3 ms latency) to RoboDK’s runtime, where a high-level motion script dispatches either:
- robot_execute (taskID) → parametrised pick-and-place path on the Doosan H2017;
- human_simulate (taskID, t_est) → updates human-state buffer and visual avatar timeline;
- collab_sequence (taskID) → synchronised dual-agent macro including hand-over checkpoints. User bypass: Manual override mode via GUI, interrupting DDQN if needed for safety/emergency.
Feedback step—Upon task completion, RoboDK emits a JSON packet 〈taskID, successFlag, execTime, safetyFlag〉 back to MATLAB. A MATLAB script simultaneously increments cumulative fatigue by ΔF = k · execTime (k = 5 × 10⁻³ units·s⁻¹). These feedback tuples populate both the experience-replay buffer and a dedicated Parquet log for offline analytics.

A master scheduler (closedLoopController.m) orchestrates perception → decision → actuation with a hard real-time deadline of 90 ms, leaving a 10 ms guard band for OS jitter. Synchronisation is managed via MATLAB’s parallel.pool.Constant objects, ensuring thread-safe access to the DDQN target network and RoboDK API handles. Five hundred fully independent simulation runs (random seeds 1–500) were conducted; each run processed 2500 task-allocation cycles, producing >1.2 million state–action–reward samples for subsequent statistical evaluation.

2.4. Integration of MATLAB and RoboDK for Simulation and Visualization

To streamline experimentation and generate publication-quality visuals, MATLAB and RoboDK were linked through a bidirectional, event-driven architecture similar to the integration method presented in [39], that spans preparation, execution, monitoring and post-processing phases:

Environment setup in RoboDK. A parametric cell template scripted in RoboDK 5.9 automatically loads (i) a STEP model of the Doosan H2017 [28], (ii) a URDF-compatible conveyor, (iii) a rigged human avatar with 22 DoF, and (iv) five task stations corresponding to Interface, Sorting, Picking, Replenishing and Transport actions. Before each run, workspace envelopes and joint limits are verified with Doosan’s Safety Configuration Wizard to pre-empt out-of-range motions.
MATLAB ↔ RoboDK bridge. A MATLAB interface script opens a persistent socket on port 20500. Commands follow a lightweight text protocol—CMD, <timestamp>, <payload>—and replies include execution metrics for subsequent logging. Network benchmarking with 1000 pings showed an average round-trip latency of 0.28 ± 0.04 m.
Online state mirroring. The human avatar’s fatigue drives a real-time semaphore displayed above the helmet: green (≤0.25), orange (0.25–0.40) or red (≥0.40). When fatigue reaches orange or red, the controller throttles robot speed to 80% or 60%, respectively, via RoboDK’s GUI API, ensuring situational safety.
Proof-of-concept figures. Five simulation instances (t₁–t₅ = 15 s, 45 s, 90 s, 135 s, 180 s) are captured automatically using a MATLAB visualization script. MATLAB superimposes blue arrows (CNN inference flow) and yellow arrows (DDQN decision flow) through RoboDK’s annotation layer, exports each scene at 3000 × 1600 px and 600 dpi, and compiles the sequence as Figure 2.
Closed-loop telemetry. A publish/subscribe bus built with MATLAB’s parallel.pool.PollableDataQueue streams CNN logits, chosen actions, and safety flags, which are plotted to visualize throughput, cumulative fatigue, and safety incidents in real time—facilitating rapid regression testing between runs.

This tightly coupled integration keeps perception, decision, actuation and visual feedback phase-locked, allowing the virtual human–robot team to adapt continuously while providing rich evidence of the framework’s behaviour.

2.5. Benchmarking (Baseline Comparisons)

The hybrid CNN+DDQN framework was benchmarked against five baselines under identical simulation seeds, task queues and random initialisations:

Rule-Based Allocation. A non-learning proxy for common shop-floor heuristics [37]. Tasks are dispatched according to fixed priorities:
- Complexity rule—tasks requiring fine dexterity (tolerance < 0.2 mm) go to the human.
- Repetition rule—repetitive/force-intensive tasks are delegated to the robot.
- Time-out override—if the human queue stalls >15 s, the next task is reassigned to the robot. This static policy ignores real-time fatigue.
SARSA RL. On-policy State–Action–Reward–State–Action algorithm with the same reward in Equation (1), 1000-episode training, α = 0.002, γ = 0.95, λ = 0.7, ε-greedy (ε₀ = 0.1, decay = 0.995). No experience replay, mirroring classic on-policy formulation. Input state omits CNN-derived fatigue/skill signals.
Dueling DQN. Value-advantage architecture [40] with identical hyperparameters to DDQN but single-estimation targets. Highlights how advantage decoupling alone compares to double estimation.
PPO. Proximal Policy Optimisation [34] with actor-critic loss clipping (ϵ = 0.2) and GAE (λ = 0.95). Trained for 3 × 10⁵ steps; baseline chosen for its strong performance in robotics.
A3C. Asynchronous Advantage Actor–Critic with 8 worker threads, mirroring settings in the comparative study of Huang et al. [34].

Experimental protocol. CNN+DDQN, Dueling DQN, PPO, A3C, rule-based and SARSA agents were each executed over the same 1000-episode evaluation set (2500 task allocations·episode⁻¹), yielding >2.5 million allocation decisions per method. The following metrics were logged:

Throughput—tasks·min⁻¹, averaged per episode and normalised to [0, 1] for statistical analyses.
Human workload—cumulative fatigue units accrued by the operator across the episode.
Safety—proportion of tasks completed without any collision flag (binary indicator). These metrics were computed as formalized in Appendix A, ensuring traceability to physiological signals and simulation parameters.

DDQN outperformed Dueling DQN by 8% in throughput, PPO by 5% in safety and A3C by 3% in safety (one-way ANOVA, p < 0.01). The advantage stems from double Q-learning, which mitigates over-estimation and yields stabler policies under stochastic human states, whereas PPO/A3C may converge faster only in near-deterministic settings [34].

Power analysis. Post-hoc power analysis (MATLAB sampsizepwr) showed 1 − β = 0.87 (α = 0.05) for the main effects, confirming adequate statistical power.

2.6. Statistical Analysis

Performance data were analysed in MATLAB R2025a using functions from the Statistics and Machine Learning Toolbox. For each metric, the following pipeline was applied:

Assumption checks. Normality was tested with the Shapiro–Wilk test (swtest.m), and homogeneity of variances with Levene’s test (vartestn, option ‘levene’). No violations were detected at α = 0.05.
One-way ANOVA. A fixed-effects ANOVA (anova1) compared mean throughput, workload and safety across the six strategies (Hybrid, Rule-Based, SARSA, Dueling DQN, PPO, A3C). Significance threshold α = 0.05.
Post-hoc contrasts. Where ANOVA yielded p < 0.05, Tukey’s Honest Significant Difference (HSD) (multcompare, type = ‘hsd’) isolated pairwise differences. Adjusted p-values and 95% confidence intervals are reported.
Effect sizes. Partial η² values were computed (mes, ‘eta2’) to quantify the magnitude of strategy effects, with thresholds 0.01 = small, 0.06 = medium, 0.14 = large.

All analyses were reproducible under deterministic random seeds (set via rng(42,’twister’)) and executed on the same workstation used for simulation to avoid platform-induced variability.

2.7. Data and Code Availability

Consistent with MDPI Systems’ open-science guidelines, the dataset and selected scripts are available to support result validation:

Synthetic dataset—1000 HRC episodes in CSV format (HRC_Simulation_Results.csv), with an optional Parquet version (HRC_Synthetic_Dataset.parquet).
Source code—MATLAB scripts (hrc_simulation.m, generate_synthetic_dataset.m, Table_2.m, Table_3.m, Figure_2.m to Figure_6.m) for simulation and visualization.

The dataset is archived on FigShare (https://doi.org/10.6084/m9.figshare.29323520) under a CC-BY licence, and the scripts are available on GitHub (https://github.com/ClaudioUrrea/doosan). Due to proprietary restrictions, simulation assets (e.g., RoboDK templates, STEP/URDF models) are not included but can be requested from the author (claudio.urrea@usach.cl). A README in the GitHub repository provides instructions for using the dataset and scripts.

3. Experimental Setup and Simulation Flow

This section provides sufficient experimental information for reproducibility, including deterministic seeds (rng(42, ‘twister’) for all simulations), hyperparameters (Table 1: e.g., CNN: 3 conv layers, Adam lr = 0.001; DDQN: γ = 0.99, α = 0.001), and setup details to enable international peers to replicate the work.

Synthetic datasets used in this study are available in CSV and Parquet formats via FigShare (https://doi.org/10.6084/m9.figshare.29323520), while selected simulation scripts, code listings (e.g., compute_metrics), and figure-generation notebooks are hosted on GitHub (https://github.com/ClaudioUrrea/doosan). RoboDK templates are excluded due to licensing constraints, but are available upon request to the author, ensuring full reproducibility of all experiments.

3.1. Simulated HRC Scenario Overview

A high-fidelity digital twin of a single-station assembly cell was instantiated in RoboDK 5.9 and orchestrated from MATLAB R2025a to serve as a proof-of-concept for adaptive task allocation (Figure 2). This approach aligns with recent advancements in using digital twins for HRC simulations, such as the work by Filipescu et al. [41], who implemented IoT-cloud and digital twin-based remote monitoring for robotic cells in Industry 4.0 and 5.0 contexts. The virtual cell contains one Doosan H2017 cobot mounted beside a URDF-driven conveyor that advances pallets at 100 mm·s⁻¹ [28]. A single human avatar executes five sequential tasks—Interface, Sorting, Picking, Replenishing, and Transport—at predefined timestamps t₁–t₅. Initial operator states were set to 20% fatigue and ‘novice’ skill; by t₅, these values rose to 50% fatigue and ‘expert’, providing a controlled gradient for evaluating the DL–RL policy. This controlled variation in human states mirrors methodologies used in HRC studies to assess task allocation under dynamic conditions [1,10]. By constraining the scenario to one operator, any shift in task ownership can be unambiguously attributed to my hybrid framework rather than inter-individual variability.

3.2. Visualization and Workflow Representation

Real-time system dynamics are visualised through layered annotations rendered by RoboDK’s Python GUI API, a technique that enhances user understanding and interaction with the simulation, similar to the mixed reality approaches discussed by Choi et al. [42] for supporting HRC in manufacturing environments:

Floating text labels above each station display current fatigue (%) and skill tier.
Arrow overlays illustrate the Perception → Decision → Execution → Feedback loop: blue for CNN inference, orange for DDQN actions, and green for execution/safety feedback.
A three-colour semaphore (green ≤ 25%, orange 25–40%, red ≥ 40%) hovers over the avatar’s helmet; its colour is updated every 100 ms and directly influences DDQN exploration parameters (ε is halved when the semaphore is red to favour conservative allocations).

Five high-resolution screenshots (3000 × 1600 px at 600 dpi) captured at 15 s intervals constitute Figure 2, offering a step-by-step visual narrative of how the framework reallocates tasks in response to evolving human states. This visualization approach is consistent with recent efforts to provide intuitive interfaces for monitoring HRC dynamics, as highlighted in studies on multimodal interaction [2].

3.3. Key Simulation Elements

Multiview perception layer. Three virtual RGB cameras (frontal 0°, laterals ±120°) stream 1 Hz images that are synchronised and concatenated into a single 128 × 128 × 3 tensor for CNN classification. Synthetic features include blink rate, postural sway, wrist acceleration and cumulative joint torque. This setup is inspired by recent advancements in human pose estimation for HRC, such as the visual-inertial fusion method proposed by Wang et al. [43], which enhances accuracy in occluded assembly scenarios.
Pre-processing pipeline. Raw sensor frames are z-score-normalised, passed through a Sobel edge extractor and then rescaled to [0, 1] before entering the CNN, ensuring domain consistency with the network’s training data (Section 2.3.1). This pre-processing aligns with techniques used in HRC to ensure robust sensor data integration [24].
Virtual semaphore coupling. The CNN’s fatigue probability vector drives the semaphore via a finite-state machine: transitions occur only if the new fatigue class persists for >2 consecutive frames, filtering out transient misclassifications.
Safety monitors. RoboDK collision sensors envelop both the robot and human meshes; any contact triggers an immediate safety flag = 0 in the reward function and logs the event to the telemetry bus. This safety mechanism is crucial in HRC simulations to prevent accidents, as emphasized by Alenjareghi et al. [44], who developed a computer vision-enabled system for real-time job hazard analysis in disassembly tasks.

Together, these elements form an integrated perception layer that feeds the closed-loop DL–RL controller at deterministic intervals, preserving temporal coherence between sensed data and control actions.

3.4. Alignment with Industry 4.0 and 5.0 Objectives

The experimental workflow operationalises all three study hypotheses. When fatigue transitions to red, the DDQN delegates ≥90% of upcoming tasks to the robot, mirroring the 21% throughput gain and 7% workload drop reported in Section 4. Conversely, green-state periods keep dexterity-dependent tasks with the human, sustaining skill utilisation and engagement. This dynamic not only enhances productivity but also prioritizes worker well-being, aligning with the vision of harmonious synergy in HRC as proposed by Berx et al. [45]. This approach mirrors core Industry 4.0 principles—continuous sensing, cyber-physical adaptability, and human-centered safety, while also addressing the human factors emphasized in Industry 5.0, as reviewed by Ricci et al. [46].

The inclusion of the Doosan H2017 (20 kg payload, 1700 mm reach, power-and-force-limited) demonstrates compliance with ISO/TS 15066 and scalability to heavier payload scenarios [28]. Because the entire cell is parameterised, multiple robots or operators can be instantiated with minimal code changes, positioning the framework for future studies on multi-agent orchestration [47], edge-deployed inference and cloud-based fleet learning. This scalability aligns with advancements in cooperative robotics, as discussed by Urrea [47], who explored fault-tolerant control for multi-agent systems.

In sum, the simulation flow not only validates the framework under controlled conditions but also establishes a reusable template for transitioning to physical pilot lines, thereby reinforcing the study’s relevance to next-generation smart factories.

4. Results

4.1. Performance Metrics

Across 1000 evaluation episodes [31], the hybrid DL–RL framework outperformed all six baselines in terms of throughput, workload, and safety. Performance metrics were calculated following the equations and regression models detailed in Appendix A, with results aggregated across 1000 episodes as follows:

Throughput. The framework achieved 60.48 ± 0.08 tasks·min⁻¹, a 21% improvement over the rule-based policy (49.92 ± 0.25), 12% over SARSA (53.81 ± 0.20), 9% over A3C (55.50 ± 0.11), 8% over Dueling DQN (56.00 ± 0.12) and 5% over PPO (57.60 ± 0.09). This improvement aligns with findings from Nourmohammadi et al. [10] and Koreis [48].
Workload. Mean cumulative fatigue per episode fell to 4.25 ± 0.10 units, a 7% reduction versus the rule-based baseline (4.56 ± 0.11). Although SARSA (3.73 ± 0.10), A3C (3.90 ± 0.11) and PPO (4.35 ± 0.12) recorded lower workloads, their weaker safety levels (see below) reveal aggressive utilisation strategies. Dueling DQN (4.30 ± 0.11) exhibited a workload similar to the hybrid yet failed to match its safety margin. The workload reduction echoes Yin et al. [11] and Zhang & Cavuoto [30].
Safety. The hybrid agent preserved 99.90 ± 0.10% collision-free execution, markedly higher than rule-based (95.01 ± 0.05%), SARSA (96.49 ± 0.04%), Dueling DQN (98.90 ± 0.08%), A3C (97.00 ± 0.05%) and PPO (95.14 ± 0.10%). This level matches the benchmarks reviewed by S. M. B. P. B. et al. [49].

Overall performance. The framework delivered 60.48 ± 0.08 tasks·min⁻¹ (21% higher than rule-based, 12% higher than SARSA, 9% higher than A3C, 8% higher than Dueling DQN and 5% higher than PPO), 4.25 ± 0.10 fatigue units (7% lower than rule-based) and 99.90 ± 0.10% collision-free operation. One-way ANOVA confirmed significant differences (p < 0.001); Tukey-HSD returned p < 0.01 for throughput and safety versus every baseline and p < 0.05 for workload versus rule-based. Partial η² values (0.29–0.46) denote large effects, underscoring the benefits of adaptive allocation in HRC [10,11,30,48,49]. Table 2 summarises these statistics.

State-specific behaviour. Disaggregating results by the nine fatigue–skill combinations (Table 3) shows that the framework adapts allocations to maximise productivity while minimising fatigue growth. Throughput remains tightly clustered (~60 tasks·min⁻¹) across states, but workload grows non-linearly with fatigue, reflecting the reward function’s workload penalty. High fatigue in the ‘High-fatigue + Expert’ state (8.30 ± 0.27) reflects reasonable allocation to complex tasks, prioritising expert skills despite fatigue, without defects, as safety remains 100%. In Table 3, Expert throughput (60.59 ± 0.12) is slightly lower than novice (60.70 ± 0.08) due to higher variance in expert allocations, where DDQN favours collaborative tasks for skill utilisation, introducing minor delays—consistent with HRC variability [30]. This outcome corroborates that the penalty coefficient on workload (0.3 in Equation (1) successfully discourages over-taxing the operator while still leveraging expertise, thereby validating the reward-design rationale and the fairness of allocations under extreme fatigue.

4.2. Ablation Studies

To isolate the specific contributions of the perception and decision modules, comprehensive ablation experiments were conducted over 200 episodes. Each configuration (Full Hybrid, No CNN, and No DDQN) was evaluated under identical random seed initialization to ensure comparable simulation conditions. Removing the CNN (i.e., treating the operator state as static) reduced throughput by 15% (p < 0.01) and elevated workload to 4.90 ± 0.12 units because the controller no longer adapted to real-time fatigue-skill fluctuations. Conversely, replacing DDQN with a single-estimation DQN lowered safety to 97%, confirming that double Q-learning is required to curb over-estimation and maintain collision-free operation. Differences among configurations were statistically validated using a repeated-measures ANOVA (p < 0.05 across all three KPIs), confirming that the ablated modules produced significant performance impacts under matched conditions. Effect-size analysis (partial η² = 0.32 for throughput, 0.28 for safety) indicated large impacts of the ablated modules, and post-hoc power remained 1 − β = 0.81, ensuring statistical reliability. Table 4 summarises the quantitative module effects across the three KPIs.

4.3. Robustness Analysis

Heat-map evaluation. Figure 3 presents a 3 × 3 heat-map of average Q-values across the fatigue–skill grid. The network allocates robot-dominant actions in high-fatigue, low-skill states and human-dominant actions in low-fatigue, high-skill states, demonstrating proper internalisation of the reward trade-off. This behaviour mirrors the adaptive policies described by Ma et al. [20], Peng et al. [21] and Ding et al. [50].
Conditional throughput. Figure 4 contrasts allocation frequencies for two extreme states:
- State 3 (Low fatigue, Expert). The human executes 68% of tasks, 22% collaborative and 10% robot, yielding the best throughput-per-action ratio.
- State 7 (High fatigue, Novice). Allocations flip: robot takes 81%, collaborative 14%, human 5%, limiting workload spikes yet sustaining overall throughput.

Across 100 Monte-Carlo runs with ±10% reward-weight perturbations, mean throughput varied <1% and safety stayed >99%, underscoring policy stability under parametric uncertainty.

4.4. Visual Proof-of-Concept Outcome

Figure 2 (Section 3) and the extended sequence in Figure 5 and Figure 6 illustrate the closed-loop adaptation. As fatigue rises from 20% to 50%, the semaphore transitions from green to orange to red, triggering the DDQN to divert tasks from the human to the robot. Blue arrows denote CNN inference, orange arrows denote DDQN decisions and green arrows environment feedback, making the perception–decision–execution cycle explicit. Robot trajectories in Figure 5 shorten average human reach by 18% during high-fatigue intervals, visually confirming the numerical workload savings. These visualisations align with efforts to enhance transparency in HRC through intuitive mixed-reality interfaces [42] and gamified VR displays [51].

Together, the quantitative metrics, robustness probes and visual narratives confirm Hypotheses H1–H3: my hybrid DL–RL controller outperforms static heuristics and single-agent RL, leverages real-time perception for adaptability, and provides a scalable template for Industry 5.0 multi-robot deployment.

5. Discussion

The experimental results robustly validate Hypotheses H1–H3: my hybrid CNN–DDQN controller outperforms rule-based, SARSA, Dueling DQN, PPO and A3C baselines while remaining scalable to heavy-duty cobots. Over 1000 episodes, the framework lifted throughput from 49.92 to 60.48 tasks·min⁻¹ (+21%), held safety at 99.90%, and cut workload by 7%—gains that equal or surpass recent AI-enhanced HRC studies [16,37,52]. These gains were achieved without sacrificing fairness, as Table 3 shows balanced allocations across fatigue–skill states.

The key differentiator is the tight 10 ms perception ↔ 90 ms decision loop: a CNN infers fatigue-skill states in <10 ms, and a double-estimating DDQN refines task allocations inside a 100 ms control cycle. This latency envelope ensures ISO 10218 compliance for collaborative stations, demonstrating industrial feasibility. The real-time fusion overcomes the rigidity of earlier “perception-or-control” pipelines [1], confirming Hypothesis 2 on adaptability. Comparable integrated schemes include Franceschi et al. [53] and Zhong et al. [54]; however, neither incorporates a workload-aware reward function nor validates against five competitive baselines, as I do here.

It is acknowledged that the current validation is strictly quantitative; a subjective validation using NASA-TLX, SUS, and Hoffman fluency scores is planned in future trials to substantiate the framework’s human-centric value.

In transitioning to physical HRC deployments, future efforts must address sensor noise, latency in robot actuation, and domain shifts between simulation and real-world conditions. Techniques such as domain adaptation, transfer learning, and lightweight models for edge deployment will be explored to mitigate these challenges. Moreover, multi-agent scalability could benefit from hierarchical or federated reinforcement learning frameworks, as discussed in [55].

5.1. Comparative Insights

Why does DDQN beat PPO/A3C? DDQN’s twin-network update mitigates Q-value over-estimation, yielding stable policies under stochastic human states. Huang et al. [34] showed that PPO’s critic inflates Q-values when reward variance is high; in my experiment, DDQN still outperformed PPO by +5% throughput and +4.8% safety, and surpassed A3C by +3% safety. Dueling DQN narrowed the gap (–8% throughput vs. DDQN) but remained vulnerable to noisy rewards because it lacks double estimation.

Transformer-based models were excluded due to higher inference latencies that violate the 100-ms control loop; DQN was also avoided due to unstable Q-value estimation under high-reward variance, which DDQN corrects via twin-network updates.

Vision-Only vs. RL-Only. Vision-only policies (Alenjareghi et al. [44]) achieve 95% safety but negligible throughput gains; RL-only schemes (Huang et al. [34]) raise throughput yet over-exert operators. My CNN+DDQN harmonises all three KPIs (+21% throughput, −7% workload, 99.90% safety), offering the ergonomic synergy highlighted by Keshvarparast et al. [56].
Hardware Realism. Deploying a Doosan H2017 (20 kg payload, 1700 mm reach) demonstrates scalability beyond the ≤5 kg robots common in prior research [50], supporting Hypothesis 3.
Unexpected Performance Dips. Under medium-fatigue, novice conditions, throughput rose only 2%. Replay-buffer inspection revealed hesitation between collaborative and robot-only actions; a variance-weighted reward penalty is planned to stabilise selections, following Itadera & Domae [57].

Furthermore, my gains in throughput (21%) and safety (99.90%) align with fluency metrics in HRC, where reduced functional delay and idle time correlate with better team performance [23]. Unlike RL-only baselines that may inflate throughput at the cost of over-exertion, the CNN+DDQN integration minimises workload while sustaining synchronisation, echoing Hoffman’s findings on perceived collaboration quality [23].

5.2. Limitations

Simulation Fidelity. Results rely on a high-fidelity digital twin; physical noise, lighting and biomechanics remain unmodelled. Sensor noise, latency in robot actuation, and domain shifts between simulation and real-world conditions must also be accounted for. Mitigation strategies such as domain adaptation, transfer learning and lightweight models suitable for edge deployment will be explored in future work.
Single-Agent Scope. One human–one robot; multi-actor setups will need decentralised or hierarchical policies [52,58]. I therefore propose federated hierarchical RL [55] for scalable coordination.
Latency Envelope. A 10 Hz loop met simulation needs; embedded platforms may need edge inference and model pruning [59].
Synthetic Perception. Fatigue is vision-based; wearables (EMG, eye-tracking) [60,61] could enhance accuracy.
Generalisation. Training used a single task taxonomy; meta-RL and workload-fairness constraints [54] will guard against bias across diverse products or workers.
Lack of Fluency Metrics. Objective KPIs omit subjective fluency; a future user study will collect NASA-TLX, SUS, and Hoffman’s fluency scale with approximately 12 volunteer operators, ensuring fairness constraints prevent operator bias.
Ablation Analysis. No ablation study was performed to isolate the individual contributions of the CNN and DDQN modules; future work will incorporate such analyses to quantify their respective impacts on safety, throughput, and workload.

5.3. Future Research Directions

Live trials. A pilot with the physical H2017 and volunteer operators will test sim-to-real transfer, sensor fusion [43] and edge deployment [59].
Multi-objective RL. Adding torque, energy and takt-time terms [33] will create holistic policies.
Federated fleet learning. Sharing anonymised gradients across cells will enable continuous learning [55].
Explainability. Integrating SHAP or attention maps will audit feature reliance [62,63].
Industry 5.0 alignment. Embedding worker preferences and carbon-aware scheduling [25,64] will ensure sustainability.
Qualitative evaluations. Future trials will combine fluency and fairness questionnaires [23,54] to validate human-centric claims.
Task Complexity and Transfer Learning. To accommodate more diverse workflows, future work will apply transfer learning techniques that pre-train the CNN–DDQN pipeline on synthetic tasks and fine-tune on real multi-step assemblies, facilitating generalisation across complexity tiers.

5.4. Additional Discussion

The workspace (5 m × 3 m, 22 °C, <70 dB) omits dynamic environmental fluctuations that influence fatigue [2]. The current environment model does not yet capture auxiliary mechanical elements or fluctuating physical conditions such as temperature and acoustic noise, which can impact fatigue and task accuracy in real deployments. Future extensions will enrich the simulation with environmental stochasticity and structural complexity to bridge this realism gap. Positive aspects: realistic digital twin, open dataset and code. Negative aspects: limited task diversity, simplified operator model and no physical trials. Future work will add real experiments, a manual-override GUI and transfer learning (pre-train on synthetic tasks then fine-tune on real data [65]).

Additionally, physical deployment will require robustness against domain transfer gaps, sensor imprecision, and robot latency. Techniques such as domain adaptation, sensor fusion, and lightweight inference models for embedded platforms are planned. Human states follow truncated normal distributions (as specified in Appendix A, Table A5) and categorical skill tiers (Appendix A, Table A6); the absence of completion-time improvement stems from the multi-objective reward prioritisation: safety (weight 0.2) and workload (weight 0.3) are deliberately favoured over speed (weight 0.1). This conservative tuning avoids unsafe shortcuts and overexertion, ensuring ergonomic safety over maximal throughput—a deliberate trade-off aligned with Industry 5.0 principles.

Additionally, a fairness constraint

G = s t d ({W o r k l o a d}_{i}) < 0.5

will be introduced to prevent allocation bias among multiple operators.

The system can scale to more complex scenarios via transfer learning and manual override safeguards, ensuring human authority in critical situations. To preserve human authority, a manual-override graphical interface will allow operators to veto or adjust system decisions in real time, ensuring adaptability in unpredictable or safety-critical scenarios.

In summary, this study establishes a reproducible benchmark showing that integrated deep perception and double-estimating decision-making delivers human-aware, safe and productive HRC, paving the way for multi-robot Industry 5.0 cells.

6. Conclusions

This study has introduced and rigorously evaluated a simulation-driven hybrid CNN+DDQN framework that performs truly adaptive task allocation in human-robot collaboration. By fusing millisecond-level human-state perception with double-estimating reinforcement learning, the controller continuously reshapes work distribution in response to fluctuating fatigue and skill—thereby operationalising the worker-centric ideals of Industry 5.0.

Key quantitative achievements across 1000 validation episodes include:

Throughput: 60.48 ± 0.08 tasks·min⁻¹ (+21% vs. rule-based, +12% vs. SARSA, +9% vs. A3C, +8% vs. Dueling DQN, +5% vs. PPO).
Workload: 4.25 ± 0.10 fatigue units (–7% vs. rule-based) while retaining higher safety than all RL baselines that posted lower workloads.
Safety: 99.90 ± 0.10% collision-free execution, a margin ≥ 1 pp better than every comparator.
Fatigue regulation: avoidance of over-utilisation as seen in PPO/A3C, due to a 0.3 fatigue penalty that balances efficiency and well-being (see Table 3).
One-way ANOVA (p < 0.001) followed by Tukey-HSD (α = 0.05) confirmed that these improvements are statistically significant; partial η² values between 0.29 and 0.46 denote large practical effects, thereby validating Hypotheses H1–H3.

Industrial and societal relevance. Implementing the controller inside a RoboDK digital twin of a Doosan H2017 (20 kg payload, 1700 mm reach) proves that the algorithm scales to heavy-duty cobots while satisfying ISO/TS 15066 safety envelopes. The modular, object-oriented MATLAB codebase—communicating with RoboDK over TCP/IP—can be repurposed for multi-robot, multi-operator lines with minimal engineering effort. From an energy-and-well-being standpoint, the 7% drop in operator workload implies fewer micro-pauses and reduced ergonomic strain, while automated fatigue-based throttling cuts robot power draw by up to 40% during red-semaphore intervals, aligning with the sustainability pillar of Industry 5.0.

To enable deployment in real-world industrial settings, future research will address the challenges of simulation-to-reality transfer. This includes compensating for sensor noise, robot actuation latency, and domain gaps through domain adaptation, transfer learning, and sensor fusion. The inference pipeline will be redesigned for embedded execution by applying structured pruning to the CNN and quantisation of the DDQN. These techniques aim to maintain performance while enabling edge deployment on platforms like NVIDIA Jetson, thereby reducing infrastructure cost and power consumption.

Limitations. Despite the high-fidelity twin, simulation cannot capture lighting noise, real biomechanics, or human unpredictability. Vision-only fatigue proxies lack the granularity of multi-sensor wearables; the task taxonomy is limited to five assembly types; and the single-operator scenario does not expose fairness tensions in team settings.

Road-Mapped Future Work:

Hardware pilot: deploy the pipeline on a physical H2017 equipped with eye-tracking and EMG to verify latency (<100 ms), safety and ergonomic benefits in a furniture-manufacturing use-case.
Edge optimisation: compress the CNN (structured pruning) and quantise the DDQN to INT8, enabling inference <25 ms on NVIDIA Jetson AGX Orin.
Hierarchical and federated RL: extend to multi-agent cells where local managers learn sub-policies and share gradients via federated averaging, preserving operator privacy.
Multi-objective rewards: embed energy, carbon footprint and stated worker preferences so that the controller co-optimises productivity, sustainability and satisfaction.
Explainability and fairness: integrate SHAP heat-maps to expose which physiological cues drive decisions and enforce Jain-fairness constraints to avoid workload bias across heterogeneous operators.
Subjective fluency studies: run controlled user trials capturing NASA-TLX and Hoffman fluency scores to correlate objective KPIs with perceived collaboration quality.

Final statement:

In sum, this work delivers a reproducible benchmark and an open-source tool-chain demonstrating that the marriage of deep perception and double-estimating decision-making can yield human-aware, safe and energy-efficient HRC. The evidence presented here charts a concrete pathway from high-fidelity digital twins to real-world smart factories, where multiple robots collaborate with empowered, less-fatigued workers—fulfilling the promise of Industry 5.0.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable, as this study used synthetic data and simulations, involving no human or animal subjects, per Chilean research guidelines.

Informed Consent Statement

Not applicable, as no human participants were involved in this study.

Data Availability Statement

The synthetic data presented in this study are available on FigShare (https://doi.org/10.6084/m9.figshare.29323520) in CSV format, with an optional Parquet version. Selected scripts and figures are available on GitHub (https://github.com/ClaudioUrrea/doosan) to support result validation. Simulation assets (e.g., RoboDK templates) are not included due to proprietary restrictions but can be requested from the author (claudio.urrea@usach.cl).

Acknowledgments

This work was supported by RoboDK, which provided an educational license for high-fidelity simulation of industrial robotic systems, and the Faculty of Engineering of the Universidad de Santiago de Chile, Chile.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ANOVA	Analysis of Variance
CNN	Convolutional Neural Network
DDQN	Double Deep Q-Network
DL	Deep Learning
DoF	Degrees of Freedom
DQN	Deep Q-Network
DTA	Dynamic Task Allocation
GUI	Graphical User Interface
GPU	Graphics Processing Unit
HRC	Human–Robot Collaboration
HSD	(Tukey’s) Honest Significant Difference
IMU	Inertial Measurement Unit
IoT	Internet of Things
ISO	International Organization for Standardization
KPI	Key Performance Indicator
ML	Machine Learning
MLP	Multilayer Perceptron
RL	Reinforcement Learning
RMS	Root Mean Square
ROC	Receiver Operating Characteristic
SARSA	State–Action–Reward–State–Action
SHAP	SHapley Additive exPlanations
TCP/IP	Transmission Control Protocol/Internet Protocol
URDF	Unified Robot Description Format

Appendix A. Calculation of Performance, Workload, and Safety Metrics

This appendix provides formal equations and traceability between physiological signals and the aggregated “fatigue units”.

Appendix A.1. Throughput

T = \frac{N_{t a s k s}}{t_{e p i s o d e}} [t a s k s \cdot \min^{- 1}]

(A1)

Table A1. Throughput.

Symbol	Description	Unit
$N_{t a s k s}$	Total number of pick-and-place tasks completed in the episode	tasks
$t_{e p i s o d e}$	Total duration of the episode	min

Appendix A.2. Workload (Fatigue Units)

W = \sum_{k = 1}^{N_{a l l o c}} Δ F_{k}

(A2)

where each increment of fatigue per task assignment is:

Δ F_{k} = γ_{1} H R_{v a r, k} + γ_{2} I M U_{r m s, k} + γ_{3} {b l i n k}_{r a t e, k}

(A3)

Table A2. Workload.

Symbol	Description	Unit (Normalised [0, 1])
$H R_{v a r}$	Heart-rate variability (ECG/PPG)	–
$I M U_{r m s}$	RMS from the wearable accelerometer	–
${b l i n k}_{r a t e,}$	Blink frequency (eye tracker)	–
$γ_{1,2, 3}$	Regression coefficients (see Table A3)	–
$N_{a l l o c}$	Number of robot allocations to the operator	–

The coefficients

γ_{i}

were fitted following the multimodal fusion methodology of Peivandi et al. [66]. Normalisation to the [0, 1] range allows

W

to be interpreted as cumulative fatigue units while preventing high-variance signals from dominating.

Table A3 reports the coefficients obtained from a multiple linear regression model used to map physiological signals to the incremental fatigue score

Δ F_{k}

. All predictors were z-normalised prior to fitting. Coefficients are presented with their standard errors, 95% confidence intervals, t-statistics, and p-values.

Table A3. Weighted multiple linear regression coefficients for fatigue increment

Δ F_{k}

.

Table A3. Weighted multiple linear regression coefficients for fatigue increment

Δ F_{k}

.

Predictor (Physiological Feature)	Symbol	Coefficient $γ$	Std. Error	95% CI	t-Value	p-Value
Heart-rate variability	$γ_{1}$	0.48	0.05	0.38–0.58	9.60	<0.001
Wearable IMU RMS	$γ_{2}$	0.31	0.04	0.23–0.39	7.80	<0.001
Blink rate	$γ_{3}$	0.21	0.06	0.09–0.33	3.50	0.002

Model fit: R = 0.78, adjusted R² = 0.76; F(3, 116) = 134.2, p < 0.001; Durbin-Watson = 1.97.

Appendix A.3. Safety

S = 1 - \frac{N_{c o l l}}{N_{a l l o c}}

(A4)

(tasks completed safely per allocation).

Table A4. Safety.

Symbol	Description	Unit
$N_{c o l l}$	Number of robot–environment or robot–operator collisions recorded	–
$N_{a l l o c}$	Total number of allocations executed	–

A value of

S

= 1 indicates zero collisions, whereas

S

= 0 denotes a collision in every allocation. The minimum acceptable operating threshold is

S

≥ 0.95 in accordance with ISO 10218-2.

Appendix A.4. Cross-References and Reproducibility

The metrics above are implemented in the source code (Listing 1) via the function compute_metrics(), available at Figshare (DOI 10.6084/m9.figshare.29604932.v1) via the file compute_metrics_v1.1.zip.

Table A5 summarises the parameters of the truncated normal distribution used to generate synthetic fatigue scores, and Table A6 summarises the prior probabilities assigned to each operator skill level in the simulation environment (Section 2.2).

Table A5. Synthetic fatigue distribution (truncated normal, bounds [0, 1]).

Parameter	Symbol	Value	Unit
Mean	μ	0.50	fatigue units
Standard deviation	σ	0.12	fatigue units
Lower bound	a	0	fatigue units
Upper bound	b	1	fatigue units
5th percentile	P₅	≈0.30	fatigue units
Median	P₅₀	0.50	fatigue units
95th percentile	P₉₅	≈0.70	fatigue units

Table A6. Operator skill-level prior probabilities.

Skill Level	Descriptor	Probability	Rationale/Source
Level 1	Novice	0.40	Entry-level operator with minimal prior exposure
Level 2	Intermediate	0.35	Completed basic training, moderate efficiency
Level 3	Expert	0.25	Certified operator with >2 years of experience

Probabilities were derived from the staff composition statistics of the reference automation cell reported by Silva et al. [67] and reflect the expected distribution of operator expertise in a medium-sized manufacturing plant.

References

Orlando, E.M.; Nenna, F.; Zanardi, D.; Buodo, G.; Mingardi, M.; Sarlo, M.; Gamberini, L. Understanding Workers’ Psychological States and Physiological Responses during Human–Robot Collaboration. Int. J. Hum.-Comput. Stud. 2025, 200, 103516. [Google Scholar] [CrossRef]
Pereira, E.; Sigcha, L.; Silva, E.; Sampaio, A.; Costa, N.; Costa, N. Capturing Mental Workload Through Physiological Sensors in Human–Robot Collaboration: A Systematic Literature Review. Appl. Sci. 2025, 15, 3317. [Google Scholar] [CrossRef]
Barravecchia, F.; Gervasi, R.; Mastrogiacomo, L.; Franceschini, F. The Collaboration Scale: A Novel Approach for Assessing Robotic Systems Collaboration Capabilities. Robot. Comput.-Integr. Manuf. 2025, 96, 103062. [Google Scholar] [CrossRef]
Xia, G.; Ghrairi, Z.; Wuest, T.; Hribernik, K.; Heuermann, A.; Liu, F.; Liu, H.; Thoben, K.-D. Towards Human Modeling for Human-Robot Collaboration and Digital Twins in Industrial Environments: Research Status, Prospects, and Challenges. Robot. Comput.-Integr. Manuf. 2025, 95, 103043. [Google Scholar] [CrossRef]
Apraiz, A.; Lasa, G.; Mazmela, M.; Arana-Arexolaleiba, N.; Elguea, Í.; Escallada, O.; Osa, N.; Etxabe, A. The User Experience in Industrial Human-Robot Interaction: A Comparative Analysis of Unimodal and Multimodal Interfaces for Disassembly Tasks. Robot. Comput.-Integr. Manuf. 2025, 95, 103045. [Google Scholar] [CrossRef]
Howard, J.; Murashov, V.; Roth, G.; Wendt, C.; Carr, J.; Cheng, M.; Earnest, S.; Elliott, K.C.; Haas, E.; Liang, C.-J.; et al. Industrial Robotics and the Future of Work. Am. J. Ind. Med. 2025, 68, 559–572. [Google Scholar] [CrossRef]
Akhavan, M.; Alivirdi, M.; Jamalpour, A.; Kheradranjbar, M.; Mafi, A.; Jamalpour, R.; Ravanshadnia, M. Impact of Industry 5.0 on the Construction Industry (Construction 5.0): Systematic Literature Review and Bibliometric Analysis. Buildings 2025, 15, 1491. [Google Scholar] [CrossRef]
Langås, E.F.; Zafar, M.H.; Sanfilippo, F. Exploring the Synergy of Human-Robot Teaming, Digital Twins, and Machine Learning in Industry 5.0: A Step towards Sustainable Manufacturing. J. Intell. Manuf. 2025; online first. [Google Scholar] [CrossRef]
Yuan, G.; Liu, X.; Qiu, X.; Zheng, P.; Pham, D.T.; Su, M. Human-Robot Collaborative Disassembly in Industry 5.0: A Systematic Literature Review and Future Research Agenda. J. Manuf. Syst. 2025, 79, 199–216. [Google Scholar] [CrossRef]
Nourmohammadi, A.; Arbaoui, T.; Fathi, M.; Dolgui, A. Balancing Human–Robot Collaborative Assembly Lines: A Constraint Programming Approach. Comput. Ind. Eng. 2025, 205, 111154. [Google Scholar] [CrossRef]
Yin, X.; Yang, Y.; Li, X.; Wu, R.; Guo, A.; Zhao, Q. Research on the Balancing Problem of Human–Robot Collaborative Assembly Line in SMEs Considering Ergonomic Risk and Cost. Comput. Ind. Eng. 2025, 204, 111091. [Google Scholar] [CrossRef]
Su, H.; Qi, W.; Chen, J.; Yang, C.; Sandoval, J.; Laribi, M.A. Recent advancements in multimodal human–robot interaction. Front. Neurorobot. 2023, 17, 1084000. [Google Scholar] [CrossRef]
Sandrini, S.; Faroni, M.; Pedrocchi, N. Learning and Planning for Optimal Synergistic Human–Robot Coordination in Manufacturing Contexts. Robot. Comput.-Integr. Manuf. 2025, 95, 103006. [Google Scholar] [CrossRef]
Feng, B.; Wang, Z.; Yuan, L.; Zhou, Q.; Chen, Y.; Bi, Y. Towards Safe Motion Planning for Industrial Human-Robot Interaction: A Co-Evolution Approach Based on Human Digital Twin and Mixed Reality. Robot. Comput.-Integr. Manuf. 2025, 95, 103012. [Google Scholar] [CrossRef]
Dai, W.; Rai, U.; Chiun, J.; Cao, Y.; Sartoretti, G. Heterogeneous Multi-robot Task Allocation and Scheduling via Reinforcement Learning. IEEE Robot. Autom. Lett. 2025, 10, 2654–2661. [Google Scholar] [CrossRef]
Wu, D.; Zhao, Q.; Fan, J.; Qi, J.; Zheng, P.; Hu, J. H2R Bridge: Transferring Vision-Language Models to Few-Shot Intention Meta-Perception in Human Robot Collaboration. J. Manuf. Syst. 2025, 80, 524–535. [Google Scholar] [CrossRef]
Fan, J.; Yin, Y.; Wang, T.; Dong, W.; Zheng, P.; Wang, L. Vision-Language Model-Based Human-Robot Collaboration for Smart Manufacturing: A State-of-the-Art Survey. Front. Eng. Manag. 2025, 12, 177–200. [Google Scholar] [CrossRef]
Lorenzini, M.; Lagomarsino, M.; Fortini, L.; Gholami, S.; Ajoudani, A. Ergonomic human-robot collaboration in industry: A review. Front. Robot. AI 2023, 9, 813907. [Google Scholar] [CrossRef]
Maruyama, T.; Ueshiba, T.; Tada, M.; Toda, H.; Endo, Y.; Domae, Y.; Nakabo, Y.; Mori, T.; Suita, K. Digital Twin-Driven Human Robot Collaboration Using a Digital Human. Sensors 2021, 21, 8266. [Google Scholar] [CrossRef]
Ma, W.; Duan, A.; Lee, H.-Y.; Zheng, P.; Navarro-Alarcon, D. Human-Aware Reactive Task Planning of Sequential Robotic Manipulation Tasks. IEEE Trans. Ind. Inform. 2025, 21, 2898–2907. [Google Scholar] [CrossRef]
Peng, Y.; Li, W.; Zhou, Y.; Pham, D.T. Dynamic Disassembly Planning of End-of-Life Products for Human–Robot Collaboration Enabled by Multi-Agent Deep Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2025, 22, 13907–13919. [Google Scholar] [CrossRef]
Lee, H.-R.; Park, S.; Lee, J. Bayesian Reinforcement Learning for Adaptive Balancing in an Assembly Line with Human-Robot Collaboration. IEEE Access 2024, 12, 172256–172265. [Google Scholar] [CrossRef]
Hoffman, G. Evaluating Fluency in Human–Robot Collaboration. IEEE Trans. Hum.-Mach. Syst. 2019, 49, 209–218. [Google Scholar] [CrossRef]
Wang, S.; Zhang, J.; Wang, P.; Law, J.; Calinescu, R.; Mihaylova, L. A Deep Learning-Enhanced Digital Twin Framework for Improving Safety and Reliability in Human–Robot Collaborative Manufacturing. Robotics Comput.-Integr. Manuf. 2024, 85, 102608. [Google Scholar] [CrossRef]
Dhanda, M.; Rogers, B.A.; Hall, S.; Dekoninck, E.; Dhokia, V. Reviewing Human-Robot Collaboration in Manufacturing: Opportunities and Challenges in the Context of Industry 5.0. Robot. Comput.-Integr. Manuf. 2025, 93, 102937. [Google Scholar] [CrossRef]
RoboDK Documentation. 2025. Available online: https://robodk.com (accessed on 19 July 2025).
Doosan Robotics H2017 Documentation. 2025. Available online: https://manual.doosanrobotics.com/en/user/2.10/1.-M-H-Series/h2017 (accessed on 19 July 2025).
Cristoiu, C.; Ivan, A.M. Integration of Real Signals Acquired Through External Sensors into RoboDK Simulation of Robotic Industrial Applications. Sensors 2025, 25, 1395. [Google Scholar] [CrossRef]
Schirmer, F.; Kranz, P.; Rose, C.G.; Schmitt, J.; Kaupp, T. Towards Dynamic Human–Robot Collaboration: A Holistic Framework for Assembly Planning. Electronics 2025, 14, 190. [Google Scholar] [CrossRef]
Zhang, Q.; Cavuoto, L. Identifying the Impact of Robot Speed and Task Time on Human-Robot Collaboration through Facial Feature Analysis. Int. J. Ind. Ergon. 2025, 105, 103691. [Google Scholar] [CrossRef]
Urrea, C. Synthetic HRC Dataset for Adaptive Task Allocation. FigShare. 2025. Available online: https://figshare.com/articles/journal_contribution/HRC_Dataset_zip/29323520?file=55399133 (accessed on 19 July 2025).
Yao, B.; Li, X.; Ji, Z.; Xiao, K.; Xu, W. Task Reallocation of Human-Robot Collaborative Production Workshop Based on a Dynamic Human Fatigue Model. Comput. Ind. Eng. 2024, 189, 109855. [Google Scholar] [CrossRef]
Fan, J.; Zheng, P. A Vision-Language-Guided Robotic Action Planning Approach for Ambiguity Mitigation in Human–Robot Collaborative Manufacturing. J. Manuf. Syst. 2024, 74, 1009–1018. [Google Scholar] [CrossRef]
Huang, J.; Chen, H.; Wang, H.; Zhang, Z.; Lu, Y.; Qian, C.; Yin, Y. Deep Reinforcement Learning-Based Dynamic Reconfiguration Planning for Digital Twin-Driven Smart Manufacturing Systems with Reconfigurable Machine Tools. IEEE Trans. Ind. Inform. 2024, 20, 13135–13146. [Google Scholar] [CrossRef]
Saleem, Z.; Gustafsson, F.; Furey, E.; McAfee, M.; Huq, S. A review of external sensors for human detection in a human robot collaborative environment. J. Intell. Manuf. 2025, 36, 2255–2279. [Google Scholar] [CrossRef]
Xiao, J.; Zhang, Z.; Terzi, S.; Anwer, N.; Eynard, B. Dynamic Task Allocations with Q-Learning Based Particle Swarm Optimization for Human-Robot Collaboration Disassembly of Electric Vehicle Battery Recycling. Comput. Ind. Eng. 2025, 204, 111133. [Google Scholar] [CrossRef]
Karas Celik, A.; Ozcelik, F. Assembly Line Rebalancing Problem with Human-Robot Collaboration and a Hyper-Matheuristic Solution Approach. Comput. Ind. Eng. 2025, 200, 110795. [Google Scholar] [CrossRef]
Sidiropoulos, A.; Dimeas, F.; Papageorgiou, D.; Semetzidis, T.P.; Doulgeri, Z.; Zanella, A.; Grella, F.; Sagar, K.; Jilich, M.; Albini, A.; et al. Safe and Effective Collaboration With a High-Payload Robot: A Framework Integrating Novel Hardware and Software Modules. IEEE Robot. Autom. Mag. 2024, 31, 38–47. [Google Scholar] [CrossRef]
Colombathanthri, A.; Jomaa, W.; Chinniah, Y.A. Human-Centered Cyber-Physical Systems in Manufacturing Industry: A Systematic Search and Review. Int. J. Adv. Manuf. Technol. 2025, 136, 2107–2141. [Google Scholar] [CrossRef]
Wang, Z.; Kiyokawa, T.; Yamanobe, N.; Wan, W.; Harada, K. Assembly Task Allocation for Human–Robot Collaboration Considering Stability and Assembly Complexity. IEEE Access 2024, 12, 159821–159832. [Google Scholar] [CrossRef]
Filipescu, A.; Simion, G.; Ionescu, D.; Filipescu, A. IoT-Cloud, VPN, and Digital Twin-Based Remote Monitoring and Control of a Multifunctional Robotic Cell in the Context of AI, Industry, and Education 4.0 and 5.0. Sensors 2024, 24, 7451. [Google Scholar] [CrossRef]
Choi, S.H.; Kim, M.; Lee, J.Y. Smart and User-Centric Manufacturing Information Recommendation Using Multimodal Learning to Support Human-Robot Collaboration in Mixed Reality Environments. Robot. Comput.-Integr. Manuf. 2025, 91, 102836. [Google Scholar] [CrossRef]
Wang, B.; Song, C.; Li, X.; Zhou, H.; Yang, H.; Wang, L. A Deep Learning-Enabled Visual-Inertial Fusion Method for Human Pose Estimation in Occluded Human-Robot Collaborative Assembly Scenarios. Robotics Comput.-Integr. Manuf. 2025, 93, 102906. [Google Scholar] [CrossRef]
Alenjareghi, M.J.; Keivanpour, S.; Chinniah, Y.A.; Jocelyn, S. Computer Vision-Enabled Real-Time Job Hazard Analysis for Safe Human–Robot Collaboration in Disassembly Tasks. J. Intell. Manuf. 2024; 35, online first. [Google Scholar] [CrossRef]
Berx, N.; Decré, W.; De Schutter, J.; Pintelon, L. A Harmonious Synergy between Robotic Performance and Well-Being in Human-Robot Collaboration: A Vision and Key Recommendations. Annu. Rev. Control 2025, 59, 100984. [Google Scholar] [CrossRef]
Ricci, A.; Ronca, V.; Capotorto, R.; Giorgi, A.; Vozzi, A.; Germano, D.; Borghini, G.; Di Flumeri, G.; Babiloni, F.; Aricò, P. Understanding the Unexplored: A Review on the Gap in Human Factors Characterization for Industry 5.0. Appl. Sci. 2025, 15, 1822. [Google Scholar] [CrossRef]
Urrea, C. Hybrid Fault-Tolerant Control in Cooperative Robotics: Advances in Resilience and Scalability. Actuators 2025, 14, 177. [Google Scholar] [CrossRef]
Koreis, J. Human–Robot vs. Human–Manual Teams: Understanding the Dynamics of Experience and Performance Variability in Picker-to-Parts Order Picking. Comput. Ind. Eng. 2025, 200, 110750. [Google Scholar] [CrossRef]
S. M. B. P. B., S.; Valori, M.; Legnani, G.; Fassi, I. Assessing Safety in Physical Human–Robot Interaction in Industrial Settings: A Systematic Review of Contact Modelling and Impact Measuring Methods. Robotics 2025, 14, 27. [Google Scholar] [CrossRef]
Ding, P.; Zhang, J.; Zheng, P.; Zhang, P.; Fei, B.; Xu, Z. Dynamic Scenario-Enhanced Diverse Human Motion Prediction Network for Proactive Human–Robot Collaboration in Customized Assembly Tasks. J. Intell. Manuf. 2024; 35, online first. [Google Scholar] [CrossRef]
Riar, M.; Weber, M.; Ebert, J.; Morschheuser, B. Can Gamification Foster Trust-Building in Human-Robot Collaboration? An Experiment in Virtual Reality. Inf. Syst. Front. 2025, online first. [CrossRef]
Tian, B.; Kaul, H.; Janardhanan, M. Balancing Heterogeneous Assembly Line with Multi-Skilled Human-Robot Collaboration via Adaptive Cooperative Co-Evolutionary Algorithm. Swarm Evol. Comput. 2024, 91, 101762. [Google Scholar] [CrossRef]
Franceschi, P.; Cassinelli, D.; Pedrocchi, N.; Beschi, M.; Rocco, P. Design of an Assistive Controller for Physical Human–Robot Interaction Based on Cooperative Game Theory and Human Intention Estimation. IEEE Trans. Autom. Sci. Eng. 2025, 22, 5741–5756. [Google Scholar] [CrossRef]
Zhong, Y.; Karthikeyan, A.; Pagilla, P.; Mehta, R.K.; Bukkapatnam, S.T.S. Human-Centric Integrated Safety and Quality Assurance in Collaborative Robotic Manufacturing Systems. CIRP Ann. 2024, 73, 345–348. [Google Scholar] [CrossRef]
Boopathy, P.; Deepa, N.; Maddikunta, P.K.R.; Victor, N.; Gadekallu, T.R.; Yenduri, G.; Wang, W.; Pham, Q.V.; Huynh-The, T.; Liyanage, M. The Metaverse for Industry 5.0 in NextG Communications: Potential Applications and Future Challenges. IEEE Open J. Comput. Soc. 2025, 6, 4–24. [Google Scholar] [CrossRef]
Keshvarparast, A.; Berti, N.; Chand, S.; Guidolin, M.; Lu, Y.; Battaia, O.; Xu, X.; Battini, D. Ergonomic Design of Human-Robot Collaborative Workstation in the Era of Industry 5.0. Comput. Ind. Eng. 2024, 198, 110729. [Google Scholar] [CrossRef]
Itadera, S.; Domae, Y. Motion Priority Optimization Framework towards Automated and Teleoperated Robot Cooperation in Industrial Recovery Scenarios. Robot. Auton. Syst. 2025, 184, 104833. [Google Scholar] [CrossRef]
Gil, O.; Sanfeliu, A. Human-Robot Collaborative Minimum Time Search Through Sub-Priors in Ant Colony Optimization. IEEE Robot. Autom. Lett. 2024, 9, 10216–10223. [Google Scholar] [CrossRef]
Sharma, M.; Tomar, A.; Hazra, A. Edge Computing for Industry 5.0: Fundamental, Applications, and Research Challenges. IEEE Internet Things J. 2024, 11, 19070–19093. [Google Scholar] [CrossRef]
Lin, C.J.; Lukodono, R.P. Learning Performance and Physiological Feedback-Based Evaluation for Human–Robot Collaboration. Appl. Ergon. 2025, 124, 104425. [Google Scholar] [CrossRef]
Borghi, S.; Ruo, A.; Sabattini, L.; Peruzzini, M.; Villani, V. Assessing Operator Stress in Collaborative Robotics: A Multimodal Approach. Appl. Ergon. 2025, 123, 104418. [Google Scholar] [CrossRef]
Djogdom, G.V.T.; Otis, M.J.D.; Meziane, R. A Theoretical Foundation for Erroneous Behavior in Human–Robot Interaction. J. Intell. Robot. Syst. 2025, 111, 23. [Google Scholar] [CrossRef]
Cohen, Y.; Faccio, M.; Rozenes, S. Vocal Communication Between Cobots and Humans to Enhance Productivity and Safety: Review and Discussion. Appl. Sci. 2025, 15, 726. [Google Scholar] [CrossRef]
Hsu, C.-H.; Liu, J.-C.; Cai, X.-Q.; Zhang, T.-Y.; Lv, W.-Y. Enabling Sustainable Diffusion in Supply Chains Through Industry 5.0: An Impact Analysis of Key Enablers for SMEs in Emerging Economies. Mathematics 2024, 12, 3938. [Google Scholar] [CrossRef]
Sharma, V.K.; Zhou, P.; Xu, Z.; She, Y.; Sivaranjani, S. Safe Human–Robot Collaboration with Risk Tunable Control Barrier Functions. In IEEE/ASME Transactions on Mechatronics; IEEE: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
Peivandi, M.; Ardabili, S.Z.; Sheykhivand, S.; Danishvar, S. Deep Learning for Detecting Multi Level Driver Fatigue Using Physiological Signals: A Comprehensive Approach. Sensors 2023, 23, 8171. [Google Scholar] [CrossRef]
Silva, F.; Baptista, J.; Castro, A.; Gomes, M.; Amaral, P.; Santos, V.; Oliveira, M. Human–Robot Collaborative Manufacturing Cell with Learning-Based Interaction Abilities. Robotics 2024, 13, 107. [Google Scholar] [CrossRef]

Figure 1. System diagram illustrating the hybrid framework. The pipeline shows sensor data flowing to CNN for state classification, then to DDQN for action selection, with feedback from the simulation environment closing the loop.

Figure 2. Snapshot of the simulated industrial assembly line implemented in RoboDK 5.9 and controlled via MATLAB R2025a. Five time points (t₁–t₅) illustrate adaptive task allocation for a single human agent, with floating labels indicating fatigue and skill levels. Green, orange, and red colors encode low (<25%), moderate (25–40%), and high (≥40%) fatigue levels, respectively. Arrows denote the perception-decision-execution-feedback loop between the DL (blue) and RL (orange) modules.

Figure 3. Heat-map of average Q-values for task allocation decisions by the hybrid DL-RL framework across a 3 × 3 state-space (fatigue: low, medium, high; skill: novice, intermediate, expert). Warmer colors indicate higher Q-values, reflecting preferred task assignments. Generated in MATLAB R2025a using the Reinforcement Learning Toolbox.

Figure 4. Conditional Heat-maps of Task Allocation Frequencies. (a) Average throughput per action (human, robot, collaborative) for low-fatigue, expert workers (stateIdx-3); (b) Average throughput per action for high-fatigue, novice workers (stateIdx-7). Warmer colors indicate higher throughput, with text annotations in black or white (13-point font) for optimal contrast. Generated in MATLAB R2025a using data from 1000 simulation runs with action assignments recorded in HRC_Simulation_Results.csv.

Figure 5. Dynamic task allocation in a HRC environment.

Figure 6. Visual proof-of-concept of adaptive task allocation in RoboDK, showing task positions and reallocation dynamics over five time points (t₁–t₅).

Table 1. Key hyperparameters for CNN and DDQN.

Module	Parameter	Value
CNN	Convolutional Layers	3 layers (32, 64, 128 filters, 3 × 3 kernels)
CNN	Activation	ReLU after each conv
CNN	Pooling	2 × 2 max-pooling after each conv
CNN	Dropout	Spatial dropout rate = 0.2 after the second pooling
CNN	Regularization	L2 weight decay = 10⁻⁴
CNN	Fully Connected Layers	2 layers (256, 128 neurons)
CNN	Output	Softmax over 9 classes
CNN	Optimizer	Adam (l_r = 0.001, β₁ = 0.9, β₂ = 0.999)
CNN	Batch Size	32
CNN	Epochs	50 with early-stopping (patience = 5)
DDQN	Hidden Layers	3 layers (128, 64, 32 neurons, ReLU)
DDQN	Output	Q-values for 3 actions
DDQN	Exploration	ε-greedy (ε₀ = 0.1, decay = 0.995)
DDQN	Replay Buffer	10,000 transitions
DDQN	Mini-batch Size	64
DDQN	Target Update	Every 100 steps
DDQN	Discount Factor (γ)	0.99
DDQN	Learning Rate (α)	0.001

Table 2. Aggregate performance across 1000 episodes.

Method	Throughput (Tasks min⁻¹)	Workload (Fatigue Score)	Safety (%)	Sig. vs. DDQN
Hybrid DL-RL (CNN+DDQN)	60.48 ± 0.08 *	4.25 ± 0.10 *	99.90 ± 0.10 *	–
Dueling DQN	56.00 ± 0.12 ^♦	4.30 ± 0.11 ^♦	98.90 ± 0.08 ^♦	p < 0.01
PPO	57.60 ± 0.09 ^♦	4.35 ± 0.12 ^♦	95.14 ± 0.10 ^♦	p < 0.01
A3C	55.50 ± 0.11 ^♦	3.90 ± 0.11 ^♦	97.00 ± 0.05 ^♦	p < 0.05
Rule-Based	49.92 ± 0.25	4.56 ± 0.11	95.01 ± 0.05	–
SARSA	53.81 ± 0.20	3.73 ± 0.10	96.49 ± 0.04	–

* Significantly better than all baselines; ^♦ Significantly worse than Hybrid DL-RL (Tukey-HSD, α = 0.05).

Table 3. Hybrid DL–RL performance per human state.

Fatigue Level	Skill Level	Throughput (Tasks/Min)	Workload (Fatigue Score)	Safety (%)
Low	Novice	60.70 ± 0.08	1.42 ± 0.10	100.00 ± 0.00
Low	Intermediate	60.63 ± 0.04	1.42 ± 0.08	100.00 ± 0.00
Low	Expert	60.59 ± 0.12	1.56 ± 0.10	100.00 ± 0.00
Medium	Novice	60.08 ± 0.49	4.72 ± 0.16	100.00 ± 0.00
Medium	Intermediate	60.59 ± 0.05	4.80 ± 0.14	99.24 ± 0.76
Medium	Expert	60.46 ± 0.22	4.77 ± 0.18	100.00 ± 0.00
High	Novice	60.66 ± 0.06	3.64 ± 0.14	100.00 ± 0.00
High	Intermediate	60.26 ± 0.31	7.48 ± 0.23	100.00 ± 0.00
High	Expert	60.32 ± 0.30	8.30 ± 0.27	100.00 ± 0.00

Table 4. Ablation results across 200 episodes.

Configuration	Throughput (Tasks·min⁻¹)	Workload (Units)	Safety (%)
Full Hybrid	60.48 ± 0.08	4.25 ± 0.10	99.90 ± 0.10
No CNN	51.40 ± 0.15	4.90 ± 0.12	98.50 ± 0.08
No DDQN (DQN)	55.20 ± 0.10	4.35 ± 0.11	97.00 ± 0.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Urrea, C. Hybrid Deep Learning-Reinforcement Learning for Adaptive Human-Robot Task Allocation in Industry 5.0. Systems 2025, 13, 631. https://doi.org/10.3390/systems13080631

AMA Style

Urrea C. Hybrid Deep Learning-Reinforcement Learning for Adaptive Human-Robot Task Allocation in Industry 5.0. Systems. 2025; 13(8):631. https://doi.org/10.3390/systems13080631

Chicago/Turabian Style

Urrea, Claudio. 2025. "Hybrid Deep Learning-Reinforcement Learning for Adaptive Human-Robot Task Allocation in Industry 5.0" Systems 13, no. 8: 631. https://doi.org/10.3390/systems13080631

APA Style

Urrea, C. (2025). Hybrid Deep Learning-Reinforcement Learning for Adaptive Human-Robot Task Allocation in Industry 5.0. Systems, 13(8), 631. https://doi.org/10.3390/systems13080631

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Deep Learning-Reinforcement Learning for Adaptive Human-Robot Task Allocation in Industry 5.0

Abstract

1. Introduction

1.1. Background and Significance

1.2. Research Gap and Objectives

2. Materials and Methods

2.1. Simulation Environment

2.2. Synthetic Dataset

2.3. Hybrid DL-RL Framework

2.3.1. Deep Learning Component

2.3.2. Reinforcement Learning Component

2.3.3. Integration

2.4. Integration of MATLAB and RoboDK for Simulation and Visualization

2.5. Benchmarking (Baseline Comparisons)

2.6. Statistical Analysis

2.7. Data and Code Availability

3. Experimental Setup and Simulation Flow

3.1. Simulated HRC Scenario Overview

3.2. Visualization and Workflow Representation

3.3. Key Simulation Elements

3.4. Alignment with Industry 4.0 and 5.0 Objectives

4. Results

4.1. Performance Metrics

4.2. Ablation Studies

4.3. Robustness Analysis

4.4. Visual Proof-of-Concept Outcome

5. Discussion

5.1. Comparative Insights

5.2. Limitations

5.3. Future Research Directions

5.4. Additional Discussion

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Calculation of Performance, Workload, and Safety Metrics

Appendix A.1. Throughput

Appendix A.2. Workload (Fatigue Units)

Appendix A.3. Safety

Appendix A.4. Cross-References and Reproducibility

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI