1. Introduction
Human-Robot Collaboration (HRC) in Industry 5.0 faces challenges like variable human states (fatigue, skills), leading to inefficient task allocation. I therefore construct synthetic datasets reproducing heart-rate, EMG, blink-rate and posture metrics [
1,
2]. My solution integrates a Convolutional Neural Network (CNN) for state perception and a Double Deep Q-Network (DDQN) for allocation, fused in a closed-loop pipeline. HRC is central to Industry 4.0 smart factories, enabling flexible, high-throughput manufacturing by combining human cognition with robotic precision [
3,
4]. In HRC, humans and robots share workspaces and collaborate on tasks, leveraging their complementary strengths to achieve superior performance in complex manufacturing processes such as assembly, quality control, and maintenance [
5,
6]. Industry 4.0 focuses on efficiency-centred automation [
3,
4], while Industry 5.0 prioritises human well-being and sustainability [
7,
8,
9], with this framework emphasising the latter through fatigue-aware allocation. As the industry transitions toward Industry 5.0, which emphasises human-centric approaches and sustainable manufacturing practices, HRC is increasingly vital for empowering workers rather than replacing them [
7,
8,
9]. However, Dynamic Task Allocation (DTA)—the real-time assignment of tasks to human or robot agents based on factors such as operator fatigue, skill level, and robot availability—remains a significant challenge; traditional rule-based schemes and early Machine-Learning (ML) methods cannot adapt adequately to changing human conditions, leading to inefficiencies in HRC [
10,
11]. Fatigue-induced errors in assembly [
11] motivate the adaptive allocation proposed here. Real-life pain points in Industry 5.0 HRC include technological integration challenges, ethical AI concerns, regulatory barriers, and cybersecurity risks [
8], as well as balancing technical and human skills [
5] and ensuring cost-effective and accessible applications.
Recent advances in Artificial Intelligence (AI) offer promising solutions to these challenges. Deep Learning (DL) excels at inferring human states, such as fatigue and expertise, from multimodal sensor data [
1,
2], building on recent works in multimodal HRI for enhanced environmental awareness and response capabilities [
12]. For instance, DL has been applied to monitor workers’ psychological and physiological states during HRC [
1] and to capture mental workload through physiological sensors [
2]. Meanwhile, Reinforcement Learning (RL) provides principled strategies for sequential decision-making, making it well-suited for optimising task allocation in dynamic environments [
13,
14], extending recent innovations in multi-agent RL for heterogeneous multi-robot systems [
15]. RL has been utilised for learning optimal policies for human-robot coordination [
13] and safe motion planning in industrial settings [
14]. However, standalone DL or RL solutions struggle to balance human variability with system constraints. Although hybrid DL-RL frameworks have shown promise in broader robotics tasks, their deployment for adaptive HRC task assignment is still under-explored [
16,
17]. Progress in international research includes early DL for state monitoring [
1,
2] and RL for coordination [
14,
16], but gaps persist in integrating perception and decision-making for human-centric adaptability [
16,
17]. This view is reinforced by recent reviews such as Lorenzini et al. [
18], who synthesise ergonomic factors in collaborative robotics for industrial use, and Maruyama et al. [
19], who propose a digital twin approach enhanced by digital humans for workload analysis in HRC scenarios. This motivates the proposed hybrid framework, which addresses these gaps by fusing CNN perception and DDQN decision-making in a closed-loop pipeline. Unlike existing hybrid DL-RL methods such as Ma et al. [
20]—which focus on reactive planning without fatigue integration—Peng et al. [
21]—which target disassembly without skill-level perception—and Lee et al. [
22]—which employ Bayesian RL without DDQN stability—my framework uniquely combines CNN-based real-time human-state classification (fatigue+skill) with DDQN multi-objective optimisation (throughput, workload, safety), yielding closed-loop adaptability in Industry 5.0 simulations.
A key construct in effective HRC is collaborative fluency, defined as “the coordinated meshing of joint activities between members of a well-synchronised team” [
23]. Hoffman [
23] proposes objective metrics (e.g., human/robot idle time, concurrent activity, functional delay) and subjective scales (e.g., perceived team commitment and fluency) evaluated in assembly tasks with human subjects. These metrics highlight the need for adaptive systems that minimise delays and enhance synchronisation, yet current hybrid DL-RL approaches [
16,
17] often overlook such fluency in dynamic task allocation under variable human states. My framework tackles this oversight by explicitly fusing perception and decision-making to enhance fluency, as demonstrated in
Section 4.
Given the complexity and safety considerations in HRC, simulation-based approaches are critical for developing and validating adaptive systems before real-world deployment. Digital-twin technologies, which create virtual replicas of physical systems, enable realistic simulations of HRC scenarios [
24,
25]. These digital twins can be enhanced with deep learning to improve safety and reliability [
25] and used for safe motion planning through co-evolution approaches [
24]. Platforms like RoboDK facilitate high-fidelity simulations of human-robot interactions, allowing researchers to test algorithms in controlled environments [
26]. In this work, I present a simulation-based hybrid DL–RL architecture that integrates a CNN for real-time human-state recognition with a DDQN for dynamic, context-aware task allocation. This approach enables continuous monitoring of human conditions and adaptive adjustment of task assignments, enhancing both efficiency and worker well-being in smart-factory settings [
20,
21]. Implemented in MATLAB R2025a and RoboDK v5.9 using a synthetic dataset of 1000 episodes, the framework achieves a 21% throughput boost, a 7% fatigue reduction and maintains a 99.9% collision-free rate (one-way ANOVA + Tukey HSD,
p < 0.05). A set of graphical proof-of-concept figures demonstrates online task reallocation as simulated fatigue and skill metrics drift during a production cycle.
Unlike Ma et al. [
20] (reactive planning without fatigue integration), Peng et al. [
21] (disassembly-oriented with no skill-level perception), and Lee et al. [
22] (Bayesian RL without double-estimation), our framework integrates real-time human-state perception with double-estimating DDQN allocation, reducing Q-value overestimation and improving safety by 1 pp.
This manuscript reorders the literature chronologically to clarify progress and gaps on HRC challenges and articulates three principal contributions:
Design and implementation of a high-fidelity DL-RL pipeline with closed-loop CNN–DDQN integration.
Quantitative evaluation across 1000 runs showing gains over SOTA baselines such as PPO and constraint programming [
8].
Visual demonstrations that highlight the framework’s responsiveness and the innovative use of digital twins for Industry 5.0 adaptability.
The remainder of this paper is organised as follows.
Section 2 details Materials and Methods;
Section 3 describes the experimental setup and simulation flow;
Section 4 presents quantitative results;
Section 5 discusses findings and limitations;
Section 6 concludes.
1.1. Background and Significance
Human–Robot Collaboration fosters flexible automation by integrating the complementary strengths of humans and robots, enabling efficient handling of complex tasks in smart manufacturing [
3,
5]. DTA is essential for maintaining efficiency and safety by assigning tasks in real time based on operator fatigue, skill level and robot status. Traditional rule-based schemes deteriorate when human states fluctuate, and single-model ML approaches lack the agility demanded by modern production lines [
10,
11]. Current debates centre on whether lightweight heuristics, such as those used in assembly-line balancing [
10], suffice, or whether adaptive, data-driven frameworks, like learning-based coordination [
13], are required to meet Industry 4.0 targets. With the advent of Industry 5.0, the discourse shifts toward human-centric design and sustainability, introducing challenges such as technological integration, ethical bias, regulation and cybersecurity in HRC [
8].
1.2. Research Gap and Objectives
To overcome the limitations of static heuristics and single-agent RL, this paper proposes a hybrid DL–RL framework that:
Infers human states online via a CNN trained on multimodal synthetic data.
Allocates tasks through a DDQN that jointly optimises throughput, workload and collision avoidance.
Unlike previous studies, the approach fuses perception and decision-making in a closed loop, yielding a scalable blueprint for smart-factory deployments. Graphical proof-of-concept figures (
Section 3) illustrate reallocations for a single operator under varying fatigue and skill conditions.
Objectives:
Design and implement the CNN+DDQN architecture for adaptive HRC task allocation.
Benchmark its performance against rule-based and SARSA baselines in 1000 high-fidelity episodes.
Visualise adaptability through real-time task-reassignment plots.
Hypotheses:
H1: The hybrid DL-RL framework surpasses rule-based and single-model RL baselines in allocation efficiency.
H2: Online human-state recognition improves adaptability relative to perception-free methods.
H3: The framework scales toward multi-robot Industry 5.0 cells.
2. Materials and Methods
The proposed hybrid DL–RL framework was evaluated entirely in a high-fidelity simulation of a collaborative industrial assembly line, constructed to ensure controlled, repeatable experimentation and rigorous evaluation of adaptive task allocation in HRC. This simulation environment was developed using MATLAB R2025a (MathWorks, Natick, MA, USA) in conjunction with RoboDK 5.9 (RoboDK Inc., Montreal, QC, Canada) [
26], providing a realistic yet safe platform for testing the integrated algorithms. A simulation-based approach was selected to alleviate real-world resource constraints, enabling safe and repeatable trials under representative conditions, aligning with recent trends in using digital twins for HRC research [
24,
27]. The dataset and selected source code are available on FigShare (
https://doi.org/10.6084/m9.figshare.29323520) and GitHub (
https://github.com/ClaudioUrrea/doosan) to support validation of the results, in line with MDPI’s open-science guidelines.
The research contributions are reorganized, and the proposed methods are summarized as follows. I include the design of a simulation-based hybrid DL–RL pipeline, quantitative benchmarking against advanced baselines, and visual demonstrations of adaptability. The methods integrate CNN for human-state perception and DDQN for task allocation, as detailed below (
Figure 1).
Description:
Sensor Data: Input from simulated sensors (e.g., 128 × 128 × 3 feature maps).
CNN: Processes data to classify human state (9 classes: 3 fatigue levels × 3 skill levels).
DDQN: Uses state vector (human state+task queue+robot status) to select actions via Q-values.
Simulation Environment: Executes actions, returns rewards/feedback according to the following equation:
Feedback Loop: Closes the system, updating states for continuous adaptation.
2.1. Simulation Environment
The simulation replicates a collaborative assembly station featuring a six-degree-of-freedom (6-DoF) Doosan Robotics H2017 robot [
28] and a virtual human worker. The H2017 model (20 kg payload, 1700 mm reach, ±0.1 mm repeatability) was integrated into RoboDK for accurate kinematic simulation [
26]. The virtual human agent was modeled with dynamic state parameters, such as fatigue and skill level, generated from synthetic sensor data, inspired by methods used to capture human states in HRC [
1,
2]. MATLAB R2025a orchestrates simulation logic and algorithms, interfacing with RoboDK 5.9 via the RoboDK API on a Windows 11 host PC. The Reinforcement Learning Toolbox, Deep Learning Toolbox, and Robotics System Toolbox implement learning algorithms and robot control. This setup is consistent with simulation frameworks used in recent HRC studies for assembly planning and coordination [
29]. Key components of the simulated cell included:
Robotic Arm: A six-degree-of-freedom (6-DoF) Doosan Robotics H2017 collaborative robot was modeled and programmed in RoboDK to perform cyclic pick-and-place tasks. It offers a maximum 20 kg payload, 1700 mm reach, and ±0.1 mm positional repeatability. The robot’s motion profile was empirically tuned to minimise residual oscillations; for example, the shoulder-joint angle remained stabilised at ≈0.017 rad (≈1°), as confirmed by RoboDK simulation logs, indicating smooth operation.
Human Worker: A virtual human operator was represented with dynamic state parameters—fatigue, skill level, and task-completion time—generated from synthetic sensor data to mimic wearable-device readings (e.g., heart-rate, motion tracking). Human state parameters (fatigue, skill level, task-completion time) were updated at 1 Hz using a MATLAB script. Fatigue values were drawn from a truncated normal distribution N(0.5, 0.12) clipped to [0, 1] to simulate varying physical tiredness. Fatigue distribution is truncated N(0.5, 0.12) in [0, 1], based on empirical data from industrial operators (e.g., heart-rate variability in Orlando et al. [
1] and muscle activity in Pereira et al. [
2]), where fatigue accumulates non-linearly. Skill levels (novice/intermediate/expert) align with expertise tiers in HRC studies (e.g., Zhang & Cavuoto [
30]). Task-completion times were sampled from a normal distribution N(10 s, 2 s
2) to introduce variability in how quickly the human finishes tasks.
Tasks: A set of assembly tasks was defined, encompassing those better suited for human dexterity (e.g., inserting delicate connectors) versus tasks ideal for robotic precision and endurance (e.g., repetitive torque-controlled bolting). Task-complexity levels were varied to cover a spectrum from simple to challenging scenarios. Definitions and parameters were informed by industrial benchmarks and studies on task allocation in HRC [
10,
11] to ensure realism. Each task could thus be allocated to either the human or the robot, depending on the algorithm’s decisions and the current state of the human operator.
Environment modeled with 5 m × 3 m workspace, ambient 22 °C, noise <70 dB; future: dynamic factors like temperature variations. The assembly environment includes mechanical structures such as a URDF-driven conveyor (advancing at 100 mm·s
−1), task stations with fixtures for pick-and-place (e.g., tolerance <0.2 mm for connectors), and collision envelopes around robot/human meshes to simulate real industrial constraints [
24,
29].
2.2. Synthetic Dataset
A large synthetic dataset was created to drive and evaluate the learning framework, comprising 1000 simulated HRC episodes (task-allocation scenarios) [
31]. The use of synthetic datasets in HRC simulations has been validated in studies such as [
32], ensuring realistic modeling of human-robot interaction. Each episode encapsulated a sequence of tasks and the corresponding human–robot interaction data, recorded as follows:
Human-state data: Operator fatigue level, skill level, and task-completion time at each decision point (generated by the human_state_generator.m script).
Robot-performance data: Robotic task success/failure, execution times for each task, and any safety-related events (e.g., collision flags).
Task attributes: Details of each task in the sequence, including task type (human-centric, robot-centric, or collaborative) and complexity level.
The synthetic dataset was generated via a MATLAB script (
generate_synthetic_dataset.m), ensuring consistency with the simulation environment, and was saved for offline analysis. The dataset, covering 1000 HRC episodes, is available on FigShare (
https://doi.org/10.6084/m9.figshare.29323520) in CSV format (
HRC_Simulation_Results.csv), with an optional Parquet version (
HRC_Synthetic_Dataset.parquet), to support result validation.
2.3. Hybrid DL-RL Framework
The proposed control architecture constitutes a fully synchronous, closed-loop hybrid pipeline, in which a deep-learning perception module and a reinforcement-learning decision module exchange information at 10 Hz to guarantee sub-100 ms end-to-end latency—well within the response requirements of collaborative assembly stations. The framework, therefore, comprises two tightly coupled components—a CNN that interprets the human operator’s state in real time and a DDQN that allocates tasks on the basis of the CNN’s output. Both modules were implemented in MATLAB R2025a, utilising the Deep Learning Toolbox for the CNN and the Reinforcement Learning Toolbox for the DDQN; the modules communicate bidirectionally with the RoboDK 5.9 simulation via TCP/IP sockets for real-time execution of actions. This integrated approach builds on recent advancements in combining perception and decision-making for HRC [
20,
21,
22]. All code is structured in object-oriented MATLAB classes (PerceptionCNN, AllocationDDQN, and RoboDKInterface) to promote reusability, modular testing, and rapid hyper-parameter tuning through configuration files.
CNN-DDQN chosen over transformers (high compute [
33]) or simple DQN (instability [
34]) for training stability (double Q mitigates overestimation) and low latency (4.8 ms inference).
2.3.1. Deep Learning Component
The deep-learning module employs a three-stage convolutional neural network similar to approaches used for monitoring human states in HRC [
1,
2], and optimized with advanced regularization and training techniques to classify the operator’s fatigue and skill level from simulated sensor inputs. Specifically, the CNN processes 2-D feature maps derived from 128 × 128 px three-channel sensor inputs—channel 1 encodes heart-rate variability trends, channel 2 captures skeletal-motion energy maps, and channel 3 aggregates cumulative workload indices computed from simulated IMU data, aligning with external sensor reviews for human detection in collaborative environments [
35]. The network architecture consists of three convolutional layers with 3 × 3 kernels (32, 64, 128 filters, respectively), each followed by batch normalisation, ReLU activation, and 2 × 2 max-pooling to ensure stable gradients and progressive spatial-dimension reduction. Spatial dropout (rate = 0.2) is applied after the second pooling layer to mitigate over-fitting. The convolutional stack feeds two fully connected layers of 256 and 128 neurons, respectively, each regularized with L
2 weight decay (10
−4), culminating in a softmax layer that yields a probability distribution over nine classes (3 fatigue levels × 3 skill levels). The “low,” “medium,” and “high” fatigue labels correspond to normalized fatigue values <0.25, 0.25–0.40, and ≥0.40, respectively, derived from the synthetic dataset’s fatigue distribution clipped to [0, 1] (
Section 2.2).
The model was trained on an 80/20 train–test split of the synthetic dataset using the categorical cross-entropy loss, the Adam optimiser (learning rate = 0.001, β1 = 0.9, β2 = 0.999), batch size = 32, and 50 epochs. Early-stopping with a patience of five epochs was employed, and the best-performing weights on the validation set were restored for inference. Training was executed on an NVIDIA RTX 4090 GPU(Santa Clara, California, United States), reaching a throughput of ≈2100 images·s−1 and converging in ≈11 min. The trained CNN attained 92% accuracy, an F1-score of 0.90, and an ROC-AUC of 0.94 on the test set. Inference latency averaged 4.8 ms per image on the target workstation, meeting the real-time requirements of the closed-loop system.
2.3.2. Reinforcement Learning Component
The reinforcement learning module is responsible for learning an optimal task allocation policy under varying human conditions, by leveraging a DDQN trained via interaction with the simulation environment, building upon similar RL approaches in HRC [
20,
21,
36]. The agent determines, at each decision step, whether the next task should be assigned to the human operator, the robot, or executed collaboratively. The design of the agent incorporates a carefully constructed state representation, discrete action space, and a domain-specific reward function tailored to human–robot collaboration.
State space: At each timestep, the environment is represented by a vector combining three elements:
Human state, as classified by the CNN (i.e., a categorical encoding of fatigue level and skill level).
Task queue status, including the attributes of up to 10 pending tasks (task type and complexity), encoded as one-hot vectors concatenated into a fixed-length representation.
Robot status, represented as a binary indicator (0 for idle, 1 for busy).
Action space: The agent selects one of three discrete actions:
Assign the task to the human worker.
Assign the task to the robotic arm.
Assign the task to be completed collaboratively (by both agents).
Reward function: The agent receives a scalar reward R at each decision point, defined by R = 0.5 × Throughput − 0.3 × Workload + 0.2 × Safety, where:
Throughput denotes the number of tasks completed per minute, normalised to a fixed scale.
Workload quantifies the cumulative exertion of the human operator, estimated from their fatigue state after task execution (a higher workload reduces the reward).
Safety is a binary indicator representing the absence (1) or presence (0) of a safety-critical event, such as a collision.
The weighting coefficients
w1 = 0.5,
w2 = 0.3, and
w3 = 0.2 were chosen empirically through sensitivity testing to prioritise production efficiency, while mitigating excessive human fatigue and ensuring task execution remains within safe operational limits. The design of such reward functions is critical in HRC to balance productivity and human well-being, as discussed in [
37]. The detailed calculations for Throughput, Workload, and Safety metrics, including physiological signal fusion and regression coefficients, are provided in
Appendix A.
Network architecture: The DDQN is implemented as a fully connected MultiLayer Perceptron (MLP) comprising:
Three hidden layers with 128, 64, and 32 neurons, respectively, each followed by ReLU activation functions to enable non-linear approximation.
A final output layer that returns Q-values for each of the three possible task assignment actions.
Training protocol: The DDQN was trained across 1000 simulated episodes, each containing a varying number of task allocations depending on simulation dynamics. The training employed:
An ε-greedy exploration policy, with initial ε0 = 0.1, decaying by a factor of 0.995 per episode to promote exploration at early stages and exploitation later in learning.
Experience replay, implemented via a buffer of 10,000 transition tuples (s, a, r, s′) sampled in mini-batches of 64 to decorrelate consecutive observations and stabilise learning.
A target network, updated every 100 steps, to compute stable target Q-values, thereby reducing oscillations and divergence.
A discount factor γ = 0.99, to prioritise long-term rewards, and a learning rate α = 0.001, adjusted via grid search during pilot experiments to ensure convergence across multiple random seeds.
The complete training procedure was executed in MATLAB using the Reinforcement Learning Toolbox, with performance monitored through loss convergence, reward trends, and Q-value stabilisation. All hyperparameters were tuned during preliminary experimentation to ensure policy convergence and consistency across simulation runs.
2.3.3. Integration
The deep-learning and reinforcement-learning components were integrated into a rigorously time-deterministic, closed-loop control stack operating at Δt = 100 ms (10 Hz), ensuring real-time responsiveness essential for collaborative stations, as highlighted by [
38]. At every control cycle:
Perception step—The CNN ingests the newest 128 × 128 × 3 sensor frame, executes inference in ≈4.8 ms on an RTX 4090, and updates the operator’s fatigue–skill class.
Decision step—The DDQN receives a state vector 〈human-state, task-queue, robot-status〉 and produces Q-values; the greedy action is selected after ε-greedy exploration noise is applied. Inference latency averages 2.1 ms.
Execution step—MATLAB transmits the chosen action over a persistent TCP/IP socket (≤0.3 ms latency) to RoboDK’s runtime, where a high-level motion script dispatches either:
robot_execute (taskID) → parametrised pick-and-place path on the Doosan H2017;
human_simulate (taskID, t_est) → updates human-state buffer and visual avatar timeline;
collab_sequence (taskID) → synchronised dual-agent macro including hand-over checkpoints. User bypass: Manual override mode via GUI, interrupting DDQN if needed for safety/emergency.
Feedback step—Upon task completion, RoboDK emits a JSON packet 〈taskID, successFlag, execTime, safetyFlag〉 back to MATLAB. A MATLAB script simultaneously increments cumulative fatigue by ΔF = k · execTime (k = 5 × 10−3 units·s−1). These feedback tuples populate both the experience-replay buffer and a dedicated Parquet log for offline analytics.
A master scheduler (closedLoopController.m) orchestrates perception → decision → actuation with a hard real-time deadline of 90 ms, leaving a 10 ms guard band for OS jitter. Synchronisation is managed via MATLAB’s parallel.pool.Constant objects, ensuring thread-safe access to the DDQN target network and RoboDK API handles. Five hundred fully independent simulation runs (random seeds 1–500) were conducted; each run processed 2500 task-allocation cycles, producing >1.2 million state–action–reward samples for subsequent statistical evaluation.
2.4. Integration of MATLAB and RoboDK for Simulation and Visualization
To streamline experimentation and generate publication-quality visuals, MATLAB and RoboDK were linked through a bidirectional, event-driven architecture similar to the integration method presented in [
39], that spans preparation, execution, monitoring and post-processing phases:
Environment setup in RoboDK. A parametric cell template scripted in RoboDK 5.9 automatically loads (i) a STEP model of the Doosan H2017 [
28], (ii) a URDF-compatible conveyor, (iii) a rigged human avatar with 22 DoF, and (iv) five task stations corresponding to Interface, Sorting, Picking, Replenishing and Transport actions. Before each run, workspace envelopes and joint limits are verified with Doosan’s Safety Configuration Wizard to pre-empt out-of-range motions.
MATLAB ↔ RoboDK bridge. A MATLAB interface script opens a persistent socket on port 20500. Commands follow a lightweight text protocol—CMD, <timestamp>, <payload>—and replies include execution metrics for subsequent logging. Network benchmarking with 1000 pings showed an average round-trip latency of 0.28 ± 0.04 m.
Online state mirroring. The human avatar’s fatigue drives a real-time semaphore displayed above the helmet: green (≤0.25), orange (0.25–0.40) or red (≥0.40). When fatigue reaches orange or red, the controller throttles robot speed to 80% or 60%, respectively, via RoboDK’s GUI API, ensuring situational safety.
Proof-of-concept figures. Five simulation instances (t
1–t
5 = 15 s, 45 s, 90 s, 135 s, 180 s) are captured automatically using a MATLAB visualization script. MATLAB superimposes blue arrows (CNN inference flow) and yellow arrows (DDQN decision flow) through RoboDK’s annotation layer, exports each scene at 3000 × 1600 px and 600 dpi, and compiles the sequence as
Figure 2.
Closed-loop telemetry. A publish/subscribe bus built with MATLAB’s parallel.pool.PollableDataQueue streams CNN logits, chosen actions, and safety flags, which are plotted to visualize throughput, cumulative fatigue, and safety incidents in real time—facilitating rapid regression testing between runs.
This tightly coupled integration keeps perception, decision, actuation and visual feedback phase-locked, allowing the virtual human–robot team to adapt continuously while providing rich evidence of the framework’s behaviour.
2.5. Benchmarking (Baseline Comparisons)
The hybrid CNN+DDQN framework was benchmarked against five baselines under identical simulation seeds, task queues and random initialisations:
Rule-Based Allocation. A non-learning proxy for common shop-floor heuristics [
37]. Tasks are dispatched according to fixed priorities:
Complexity rule—tasks requiring fine dexterity (tolerance < 0.2 mm) go to the human.
Repetition rule—repetitive/force-intensive tasks are delegated to the robot.
Time-out override—if the human queue stalls >15 s, the next task is reassigned to the robot. This static policy ignores real-time fatigue.
SARSA RL. On-policy State–Action–Reward–State–Action algorithm with the same reward in Equation (1), 1000-episode training, α = 0.002, γ = 0.95, λ = 0.7, ε-greedy (ε0 = 0.1, decay = 0.995). No experience replay, mirroring classic on-policy formulation. Input state omits CNN-derived fatigue/skill signals.
Dueling DQN. Value-advantage architecture [
40] with identical hyperparameters to DDQN but single-estimation targets. Highlights how advantage decoupling alone compares to double estimation.
PPO. Proximal Policy Optimisation [
34] with actor-critic loss clipping (ϵ = 0.2) and GAE (λ = 0.95). Trained for 3 × 10
5 steps; baseline chosen for its strong performance in robotics.
A3C. Asynchronous Advantage Actor–Critic with 8 worker threads, mirroring settings in the comparative study of Huang et al. [
34].
Experimental protocol. CNN+DDQN, Dueling DQN, PPO, A3C, rule-based and SARSA agents were each executed over the same 1000-episode evaluation set (2500 task allocations·episode−1), yielding >2.5 million allocation decisions per method. The following metrics were logged:
Throughput—tasks·min−1, averaged per episode and normalised to [0, 1] for statistical analyses.
Human workload—cumulative fatigue units accrued by the operator across the episode.
Safety—proportion of tasks completed without any collision flag (binary indicator). These metrics were computed as formalized in
Appendix A, ensuring traceability to physiological signals and simulation parameters.
DDQN outperformed Dueling DQN by 8% in throughput, PPO by 5% in safety and A3C by 3% in safety (one-way ANOVA,
p < 0.01). The advantage stems from double Q-learning, which mitigates over-estimation and yields stabler policies under stochastic human states, whereas PPO/A3C may converge faster only in near-deterministic settings [
34].
Power analysis. Post-hoc power analysis (MATLAB sampsizepwr) showed 1 − β = 0.87 (α = 0.05) for the main effects, confirming adequate statistical power.
2.6. Statistical Analysis
Performance data were analysed in MATLAB R2025a using functions from the Statistics and Machine Learning Toolbox. For each metric, the following pipeline was applied:
Assumption checks. Normality was tested with the Shapiro–Wilk test (swtest.m), and homogeneity of variances with Levene’s test (vartestn, option ‘levene’). No violations were detected at α = 0.05.
One-way ANOVA. A fixed-effects ANOVA (anova1) compared mean throughput, workload and safety across the six strategies (Hybrid, Rule-Based, SARSA, Dueling DQN, PPO, A3C). Significance threshold α = 0.05.
Post-hoc contrasts. Where ANOVA yielded p < 0.05, Tukey’s Honest Significant Difference (HSD) (multcompare, type = ‘hsd’) isolated pairwise differences. Adjusted p-values and 95% confidence intervals are reported.
Effect sizes. Partial η2 values were computed (mes, ‘eta2’) to quantify the magnitude of strategy effects, with thresholds 0.01 = small, 0.06 = medium, 0.14 = large.
All analyses were reproducible under deterministic random seeds (set via rng(42,’twister’)) and executed on the same workstation used for simulation to avoid platform-induced variability.
2.7. Data and Code Availability
Consistent with MDPI Systems’ open-science guidelines, the dataset and selected scripts are available to support result validation:
Synthetic dataset—1000 HRC episodes in CSV format (HRC_Simulation_Results.csv), with an optional Parquet version (HRC_Synthetic_Dataset.parquet).
Source code—MATLAB scripts (hrc_simulation.m, generate_synthetic_dataset.m, Table_2.m, Table_3.m, Figure_2.m to Figure_6.m) for simulation and visualization.
5. Discussion
The experimental results robustly validate Hypotheses H1–H3: my hybrid CNN–DDQN controller outperforms rule-based, SARSA, Dueling DQN, PPO and A3C baselines while remaining scalable to heavy-duty cobots. Over 1000 episodes, the framework lifted throughput from 49.92 to 60.48 tasks·min
−1 (+21%), held safety at 99.90%, and cut workload by 7%—gains that equal or surpass recent AI-enhanced HRC studies [
16,
37,
52]. These gains were achieved without sacrificing fairness, as
Table 3 shows balanced allocations across fatigue–skill states.
The key differentiator is the tight 10 ms perception ↔ 90 ms decision loop: a CNN infers fatigue-skill states in <10 ms, and a double-estimating DDQN refines task allocations inside a 100 ms control cycle. This latency envelope ensures ISO 10218 compliance for collaborative stations, demonstrating industrial feasibility. The real-time fusion overcomes the rigidity of earlier “perception-or-control” pipelines [
1], confirming Hypothesis 2 on adaptability. Comparable integrated schemes include Franceschi et al. [
53] and Zhong et al. [
54]; however, neither incorporates a workload-aware reward function nor validates against five competitive baselines, as I do here.
It is acknowledged that the current validation is strictly quantitative; a subjective validation using NASA-TLX, SUS, and Hoffman fluency scores is planned in future trials to substantiate the framework’s human-centric value.
In transitioning to physical HRC deployments, future efforts must address sensor noise, latency in robot actuation, and domain shifts between simulation and real-world conditions. Techniques such as domain adaptation, transfer learning, and lightweight models for edge deployment will be explored to mitigate these challenges. Moreover, multi-agent scalability could benefit from hierarchical or federated reinforcement learning frameworks, as discussed in [
55].
5.1. Comparative Insights
Why does DDQN beat PPO/A3C? DDQN’s twin-network update mitigates Q-value over-estimation, yielding stable policies under stochastic human states. Huang et al. [
34] showed that PPO’s critic inflates Q-values when reward variance is high; in my experiment, DDQN still outperformed PPO by +5% throughput and +4.8% safety, and surpassed A3C by +3% safety. Dueling DQN narrowed the gap (–8% throughput vs. DDQN) but remained vulnerable to noisy rewards because it lacks double estimation.
Transformer-based models were excluded due to higher inference latencies that violate the 100-ms control loop; DQN was also avoided due to unstable Q-value estimation under high-reward variance, which DDQN corrects via twin-network updates.
Vision-Only vs. RL-Only. Vision-only policies (Alenjareghi et al. [
44]) achieve 95% safety but negligible throughput gains; RL-only schemes (Huang et al. [
34]) raise throughput yet over-exert operators. My CNN+DDQN harmonises all three KPIs (+21% throughput, −7% workload, 99.90% safety), offering the ergonomic synergy highlighted by Keshvarparast et al. [
56].
Hardware Realism. Deploying a Doosan H2017 (20 kg payload, 1700 mm reach) demonstrates scalability beyond the ≤5 kg robots common in prior research [
50], supporting Hypothesis 3.
Unexpected Performance Dips. Under medium-fatigue, novice conditions, throughput rose only 2%. Replay-buffer inspection revealed hesitation between collaborative and robot-only actions; a variance-weighted reward penalty is planned to stabilise selections, following Itadera & Domae [
57].
Furthermore, my gains in throughput (21%) and safety (99.90%) align with fluency metrics in HRC, where reduced functional delay and idle time correlate with better team performance [
23]. Unlike RL-only baselines that may inflate throughput at the cost of over-exertion, the CNN+DDQN integration minimises workload while sustaining synchronisation, echoing Hoffman’s findings on perceived collaboration quality [
23].
5.2. Limitations
Simulation Fidelity. Results rely on a high-fidelity digital twin; physical noise, lighting and biomechanics remain unmodelled. Sensor noise, latency in robot actuation, and domain shifts between simulation and real-world conditions must also be accounted for. Mitigation strategies such as domain adaptation, transfer learning and lightweight models suitable for edge deployment will be explored in future work.
Single-Agent Scope. One human–one robot; multi-actor setups will need decentralised or hierarchical policies [
52,
58]. I therefore propose federated hierarchical RL [
55] for scalable coordination.
Latency Envelope. A 10 Hz loop met simulation needs; embedded platforms may need edge inference and model pruning [
59].
Synthetic Perception. Fatigue is vision-based; wearables (EMG, eye-tracking) [
60,
61] could enhance accuracy.
Generalisation. Training used a single task taxonomy; meta-RL and workload-fairness constraints [
54] will guard against bias across diverse products or workers.
Lack of Fluency Metrics. Objective KPIs omit subjective fluency; a future user study will collect NASA-TLX, SUS, and Hoffman’s fluency scale with approximately 12 volunteer operators, ensuring fairness constraints prevent operator bias.
Ablation Analysis. No ablation study was performed to isolate the individual contributions of the CNN and DDQN modules; future work will incorporate such analyses to quantify their respective impacts on safety, throughput, and workload.
5.3. Future Research Directions
Live trials. A pilot with the physical H2017 and volunteer operators will test sim-to-real transfer, sensor fusion [
43] and edge deployment [
59].
Multi-objective RL. Adding torque, energy and takt-time terms [
33] will create holistic policies.
Federated fleet learning. Sharing anonymised gradients across cells will enable continuous learning [
55].
Explainability. Integrating SHAP or attention maps will audit feature reliance [
62,
63].
Industry 5.0 alignment. Embedding worker preferences and carbon-aware scheduling [
25,
64] will ensure sustainability.
Qualitative evaluations. Future trials will combine fluency and fairness questionnaires [
23,
54] to validate human-centric claims.
Task Complexity and Transfer Learning. To accommodate more diverse workflows, future work will apply transfer learning techniques that pre-train the CNN–DDQN pipeline on synthetic tasks and fine-tune on real multi-step assemblies, facilitating generalisation across complexity tiers.
5.4. Additional Discussion
The workspace (5 m × 3 m, 22 °C, <70 dB) omits dynamic environmental fluctuations that influence fatigue [
2]. The current environment model does not yet capture auxiliary mechanical elements or fluctuating physical conditions such as temperature and acoustic noise, which can impact fatigue and task accuracy in real deployments. Future extensions will enrich the simulation with environmental stochasticity and structural complexity to bridge this realism gap. Positive aspects: realistic digital twin, open dataset and code. Negative aspects: limited task diversity, simplified operator model and no physical trials. Future work will add real experiments, a manual-override GUI and transfer learning (pre-train on synthetic tasks then fine-tune on real data [
65]).
Additionally, physical deployment will require robustness against domain transfer gaps, sensor imprecision, and robot latency. Techniques such as domain adaptation, sensor fusion, and lightweight inference models for embedded platforms are planned. Human states follow truncated normal distributions (as specified in
Appendix A,
Table A5) and categorical skill tiers (
Appendix A,
Table A6); the absence of completion-time improvement stems from the multi-objective reward prioritisation: safety (weight 0.2) and workload (weight 0.3) are deliberately favoured over speed (weight 0.1). This conservative tuning avoids unsafe shortcuts and overexertion, ensuring ergonomic safety over maximal throughput—a deliberate trade-off aligned with Industry 5.0 principles.
Additionally, a fairness constraint will be introduced to prevent allocation bias among multiple operators.
The system can scale to more complex scenarios via transfer learning and manual override safeguards, ensuring human authority in critical situations. To preserve human authority, a manual-override graphical interface will allow operators to veto or adjust system decisions in real time, ensuring adaptability in unpredictable or safety-critical scenarios.
In summary, this study establishes a reproducible benchmark showing that integrated deep perception and double-estimating decision-making delivers human-aware, safe and productive HRC, paving the way for multi-robot Industry 5.0 cells.
6. Conclusions
This study has introduced and rigorously evaluated a simulation-driven hybrid CNN+DDQN framework that performs truly adaptive task allocation in human-robot collaboration (HRC). By fusing millisecond-level human-state perception with double-estimating reinforcement learning, the controller continuously reshapes work distribution in response to fluctuating fatigue and skill—thereby operationalising the worker-centric ideals of Industry 5.0.
Key quantitative achievements across 1000 validation episodes include:
Throughput: 60.48 ± 0.08 tasks·min−1 (+21% vs. rule-based, +12% vs. SARSA, +9% vs. A3C, +8% vs. Dueling DQN, +5% vs. PPO).
Workload: 4.25 ± 0.10 fatigue units (–7% vs. rule-based) while retaining higher safety than all RL baselines that posted lower workloads.
Safety: 99.90 ± 0.10% collision-free execution, a margin ≥ 1 pp better than every comparator.
Fatigue regulation: avoidance of over-utilisation as seen in PPO/A3C, due to a 0.3 fatigue penalty that balances efficiency and well-being (see
Table 3).
One-way ANOVA (p < 0.001) followed by Tukey-HSD (α = 0.05) confirmed that these improvements are statistically significant; partial η2 values between 0.29 and 0.46 denote large practical effects, thereby validating Hypotheses H1–H3.
Industrial and societal relevance. Implementing the controller inside a RoboDK digital twin of a Doosan H2017 (20 kg payload, 1700 mm reach) proves that the algorithm scales to heavy-duty cobots while satisfying ISO/TS 15066 safety envelopes. The modular, object-oriented MATLAB codebase—communicating with RoboDK over TCP/IP—can be repurposed for multi-robot, multi-operator lines with minimal engineering effort. From an energy-and-well-being standpoint, the 7% drop in operator workload implies fewer micro-pauses and reduced ergonomic strain, while automated fatigue-based throttling cuts robot power draw by up to 40% during red-semaphore intervals, aligning with the sustainability pillar of Industry 5.0.
To enable deployment in real-world industrial settings, future research will address the challenges of simulation-to-reality transfer. This includes compensating for sensor noise, robot actuation latency, and domain gaps through domain adaptation, transfer learning, and sensor fusion. The inference pipeline will be redesigned for embedded execution by applying structured pruning to the CNN and quantisation of the DDQN. These techniques aim to maintain performance while enabling edge deployment on platforms like NVIDIA Jetson, thereby reducing infrastructure cost and power consumption.
Limitations. Despite the high-fidelity twin, simulation cannot capture lighting noise, real biomechanics, or human unpredictability. Vision-only fatigue proxies lack the granularity of multi-sensor wearables; the task taxonomy is limited to five assembly types; and the single-operator scenario does not expose fairness tensions in team settings.
Road-Mapped Future Work:
Hardware pilot: deploy the pipeline on a physical H2017 equipped with eye-tracking and EMG to verify latency (<100 ms), safety and ergonomic benefits in a furniture-manufacturing use-case.
Edge optimisation: compress the CNN (structured pruning) and quantise the DDQN to INT8, enabling inference <25 ms on NVIDIA Jetson AGX Orin.
Hierarchical and federated RL: extend to multi-agent cells where local managers learn sub-policies and share gradients via federated averaging, preserving operator privacy.
Multi-objective rewards: embed energy, carbon footprint and stated worker preferences so that the controller co-optimises productivity, sustainability and satisfaction.
Explainability and fairness: integrate SHAP heat-maps to expose which physiological cues drive decisions and enforce Jain-fairness constraints to avoid workload bias across heterogeneous operators.
Subjective fluency studies: run controlled user trials capturing NASA-TLX and Hoffman fluency scores to correlate objective KPIs with perceived collaboration quality.
Final statement:
In sum, this work delivers a reproducible benchmark and an open-source tool-chain demonstrating that the marriage of deep perception and double-estimating decision-making can yield human-aware, safe and energy-efficient HRC. The evidence presented here charts a concrete pathway from high-fidelity digital twins to real-world smart factories, where multiple robots collaborate with empowered, less-fatigued workers—fulfilling the promise of Industry 5.0.