Adaptive Multi-Objective Reinforcement Learning for Real-Time Manufacturing Robot Control
Abstract
1. Introduction
1.1. Intelligent Robots in Modern Manufacturing
1.2. Contemporary Challenges and Research Gap
1.3. Key Contributions and Research Novelty
- Adaptive Pareto-Optimal MORL Framework for Manufacturing: A novel algorithm (APO-MORL) tailored for manufacturing robotics enables real-time adaptation to shifting production priorities without retraining. The framework simultaneously optimizes six industry-critical objectives—throughput, cycle time, energy efficiency, precision, equipment longevity, and safety—while remaining compatible with Industry 4.0/5.0 cyber–physical systems and Manufacturing Execution Systems (MES). Unlike conventional single-objective RL or fixed-weight scalarization, the adaptive preference mechanism dynamically adjusts objective priorities according to manufacturing context (e.g., prioritizing throughput during peak demand and energy efficiency during off-peak hours).
- Rigorous Experimental Validation with Manufacturing-Specific Metrics: Comprehensive experiments in high-fidelity CoppeliaSim simulation reproduce industrial pick-and-place tasks using a UR5 6-DoF manipulator with RG2 gripper. Evaluation includes: (a) comparison with seven baselines (PID, PPO, DDPG, SAC, NSGA-II, SPEA2, MOEA/D); (b) hypervolume verification using four independent methods (WFG, PyMOO, Monte Carlo, HSO); (c) 30 independent runs per algorithm; and (d) testing under realistic disturbances (sensor noise σ = 2 mm, variable conveyor speeds 0.1–0.5 m/s, validated friction models, over 30,000 manipulation cycles). APO-MORL outperforms NSGA-II by 24.59% and single-objective SAC by 7.49% in normalized hypervolume (p < 0.001), with large effect sizes (Cohen’s d = 0.42–1.52).
- Rapid Convergence for Industrial Deployment: The framework achieves 90–95% of final Pareto-optimal performance in 180–200 training episodes—substantially faster than multi-objective evolutionary algorithms (typically >1000 evaluations). This fast convergence enables economically feasible industrial implementation. Resulting policies exhibit 99.97% grasp success rate and ±2.3 mm placement precision across diverse object geometries and dynamic production scenarios.
- Comprehensive Quality Control Integration: Multi-objective optimization integrates with automated quality control through geometry-based object classification and priority-driven routing. The system classifies objects by shape and size (deliberately color-agnostic) and routes them to four dedicated stations (High Priority, Medium Priority, Low Priority, Reject), achieving 98.3% classification accuracy over 500 test cycles with zero collisions, demonstrating capability to support human-centric Industry 5.0 manufacturing.
Key Performance Highlights
- +24.59% to +34.75% improvement over seven baseline methods (p < 0.001 for 6/7).
- Hypervolume: 0.076 ± 0.015 vs. 0.062 for best baseline (Weight Vector Selection MORL).
- 99.97% grasp success rate with ±2.3 mm placement precision.
- Achieves 95% optimal performance in 180 episodes (~18 h).
- 5× faster than evolutionary baselines (NSGA-II, SPEA2: 1000+ evaluations).
- 24.4% improvement over state-of-the-art MORL (d = 1.67, 95% CI: [1.35, 1.99]).
- Real-time inference: <32 ms (enables 20–30 Hz control loops).
- Edge computing: <2 GB RAM footprint.
- MES integration: OPC UA, MTConnect protocols.
- Instant policy adaptation: <1 s vs. 8 h retraining for single-objective RL.
- 30 independent experimental runs.
- Effect sizes: Cohen’s d = 0.42–1.52.
- Statistical power: >95% for all significant comparisons.
- Four independent hypervolume validation methods (WFG, PyMOO, Monte Carlo, HSO).
1.4. Paper Organization
2. Related Work
2.1. Multi-Objective Optimization in Manufacturing
2.2. Reinforcement Learning in Robotics
2.3. Multi-Objective Reinforcement Learning
2.4. Recent Advances in Multi-Objective Reinforcement Learning
2.5. Industry 4.0 and Cyber–Physical Manufacturing Systems
3. Methodology
3.1. Adaptive Multi-Objective Reinforcement Learning Framework
- Dynamic Preference Adaptation Mechanism: Unlike static scalarization approaches commonly used in traditional manufacturing optimization [13,14], the method employs an adaptive preference weighting system that adjusts objective priorities based on real-time manufacturing conditions and historical performance data, incorporating insights from continual learning research [24,25]. This mechanism enables seamless transitions between production priorities—such as shifting from throughput maximization during peak demand to energy efficiency optimization during off-peak periods—without requiring manual reconfiguration or retraining, a critical capability for Industry 4.0/5.0 environments [5,7].
- Manufacturing-Specific Objective Space: Six industry-relevant objectives based on contemporary manufacturing requirements [1,6] and sustainability considerations aligned with Industry 4.0 and 5.0 principles [5,7]:
- Throughput maximization (r1): Parts processed per unit time.
- Cycle time minimization (r2): Seconds per operation.
- Precision enhancement (r4): Positioning accuracy in mm—critical for quality control [6].
- Equipment wear reduction (r5): Maintenance interval extension through optimized joint trajectories.
- Rapid Convergence Architecture: Incorporating insights from recent MORL developments [22,23,24], the approach achieves 95% of optimal performance within 200 training episodes, significantly faster than traditional evolutionary approaches [11,12,13] and compatible with real-time manufacturing constraints typical of cyber–physical systems [2,3]. This rapid convergence enables practical deployment in industrial settings where extended training periods are economically infeasible.
- Cyber–Physical Integration: This study designed the framework for seamless integration with digital twin architectures [26,27,28] and existing MES, supporting real-time adaptation in Industry 4.0 and 5.0 environments [5,29]. Edge computing compatibility (<2 GB RAM, <50 ms inference latency) enables deployment on industrial controllers without cloud dependencies, ensuring real-time responsiveness critical for manufacturing applications [2,29].
3.1.1. Multi-Objective Markov Decision Process Formulation
- S: State space representing robot configuration, environment state, and task context.
- A: Action space including continuous joint commands and discrete task decisions.
- P: Transition probability function P(s’|s,a).
- R: Multi-objective reward vector R = [r1, r2, …, r6]ᵀ.
- γ: Discount factor set to 0.99 to balance immediate and long-term objectives.
3.1.2. State Representation
- Robot State (12D): Joint positions = [] and velocities = [].
- Environment State (8D): Object positions (3D coordinates), conveyor status (velocity, position), pallet occupancy (binary indicators per station), and sensor readings (proximity, force feedback).
- Task State (3D): Current objective weights w ∈ ℝ6, progress indicators (task completion ratio), and timing constraints (deadline proximity).
3.1.3. Action Space
- Continuous Actions (6D): Joint velocity commands ω = [ω1, …, ω6] bounded by actuator limits ωmin, ωmax to ensure safe operation.
- Discrete Actions (4D): Gripper control (open/close for RG2 gripper), conveyor interaction (start/stop/speed adjustment), task prioritization (object selection based on quality classification), and pallet selection (destination station assignment: High/Medium/Low/Reject priority).
3.1.4. Multi-Objective Reward Structure
- Throughput (r1): Parts processed per unit time—calculated as successful placements per episode duration.
- Cycle Time (r2): Inverse of task completion time (minimization)—normalized by baseline PID controller performance.
- Energy Efficiency (r3): Inverse of power consumption—estimated from joint torques and velocities using actuator models.
- Precision (r4): Placement accuracy (position + orientation)—measured as negative Euclidean distance from target pose.
- Wear Reduction (r5): Inverse of joint stress and acceleration—quantified through jerk minimization to extend equipment lifespan.
3.2. Proposed MORL Algorithm
3.2.1. Adaptive Pareto-Optimal MORL (APO-MORL)
Algorithm Overview
| Algorithm 1: APO-MORL Training Procedure (Simplified) |
| Input: Environment E, preference weight distribution W, max episodes N Output: Pareto archive P of non-dominated policies 1. Initialize: - Policy network πθ with parameters θ - Six Q-networks Qφ1, Qφ2, …, Qφ6 (one per objective) - Experience replay buffer D (capacity 50,000) - Pareto archive P ← ∅ 2. For episode = 1 to N: a. Sample preference weights w ~ W b. Reset environment: s ← s0 c. For step = 1 to T: - Select action: a ~ πθ(·|s) with ε-greedy exploration - Execute action, observe rewards r = [r1, r2, …, r6] and next state s’ - Store transition (s, a, r, s’, w) in replay buffer D d. Update networks: - Sample minibatch from D - Update each Qφi using temporal-difference learning - Update policy πθ using weighted Q-values: Q(s,a,w) = Σi wiQφi(s,a) e. Evaluate policy πθ and update Pareto archive P 3. Return Pareto archive P |
| Algorithm 2: Dynamic Preference Weighting (Simplified) |
| Input: Current manufacturing context C, Pareto archive P Output: Selected policy π* for execution 1. Analyze manufacturing context C: - Peak demand → increase w1 (throughput) - Off-peak hours → increase w3 (energy efficiency) - Quality inspection → increase w4 (precision) - Near maintenance window → increase w5 (wear reduction) - Human collaboration active → increase w6 (safety) 2. Generate contextual preference vector w = [w1, w2, …, w6] Normalize: Σi wi = 1 3. Select policy from archive: - π* ← argminπ∈P ||Qπ(s,·) - w||2 (weighted Euclidean distance) 4. Return π* for real-time execution |
3.2.2. Adaptive Weight Mechanism
- Temporal Constraints: Shift schedules, maintenance windows, and peak demand periods—enabling predictive priority adjustment based on production schedules.
3.2.3. Pareto Archive Management
- Dominance Check: The framework compares new solutions against existing archive members using standard Pareto dominance criteria: solution x dominates y if xi ≥ yi for all objectives i and xⱼ > yⱼ for at least one objective j.
- Archive Size Control: Fixed-size archive with maximum capacity of 100 solutions and replacement strategy that removes solutions with minimum crowding distance when capacity is exceeded.
- Solution Selection: Context-aware solution retrieval for policy guidance (incorporating real-time manufacturing priorities [29])—the framework selects the solution closest to current objective weights w in weighted Euclidean distance is selected for policy execution.
3.3. Implementation Details
3.3.1. Network Architecture
- Policy Network: 3-layer MLP with 256 hidden units per layer, ReLU activation, tanh output layer for bounded action space.
- Optimizer: Adam with learning rate 3 × 10−4 and default β1 = 0.9, β2 = 0.999 parameters.
3.3.2. Training Configuration
- Training Episodes: 200—sufficient for convergence to 95% optimal performance based on preliminary experiments.
- Steps per Episode: 100—corresponding to approximately 10 pick-and-place cycles per episode in the experimental scenario.
- Batch Size: 64—selected to balance gradient estimate quality and computational efficiency.
- Exploration: ε-greedy with linear decay from 0.2 to 0.01 over 150 episodes—maintaining minimal exploration in final episodes for stable policy evaluation.
- Target Network Update: Soft update with τ = 0.005—gradual target network updates improve training stability compared to periodic hard updates.
4. Experimental and Implementation
4.1. Experimental Platform and Validation Environment
4.1.1. Hardware and Physics Simulation
- Robot Model: UR5 with RG2 gripper, modeled with accurate kinematics and dynamics including joint limits (±360° for base rotation, ±180° for other joints), velocity constraints (±180°/s), and payload capacity (5 kg maximum). Appendix B.1 and Appendix B.2 provide comprehensive technical specifications for the UR5 manipulator and RG2 gripper, including detailed kinematic parameters, workspace analysis, and performance characteristics.
- Physics Engine: Bullet, with realistic friction coefficients (μstatic = 0.5, μdynamic = 0.4), gravity simulation (9.81 m/s2), and collision detection using Axis-Aligned Bounding Box (AABB) hierarchies for computational efficiency.
- Sensor Simulation: Simulated proximity sensors with 0.05 m resolution and 2.0 m maximum range, force feedback at gripper contact points, and vision systems for perception providing RGB-D data at 30 Hz.
- Environment Dynamics: Variable conveyor speeds (0.1–0.5 m/s), random object arrival with Poisson-distributed inter-arrival times (λ = 0.2 objects/s mean rate), and dynamic lighting simulating industrial fluorescent illumination with realistic shadows, introducing temporal uncertainty and environmental variability.
4.1.2. Task Scenario and Object Classification
- Station_High Priority (green table): For standard-compliant cubic parts with edge length 50 ± 5 mm—representing high-quality components ready for assembly.
- Station_Medium Priority (yellow table): For long narrow rectangular prisms (length 100 mm, width 30 mm, height 30 mm) (non-standard but usable)—suitable for secondary applications or rework.
- Station_Low Priority (white table): For short wide rectangular prisms (length 60 mm, width 50 mm, height 30 mm) (obsolete or low-value components)—designated for recycling or salvage.
- Station_Reject (red table): For short thin rectangular prisms (length 70 mm, width 25 mm, height 15 mm) (defective or non-conforming items)—requiring disposal or quality investigation.
- Source: Dynamic conveyor with variable object arrival rates and bidirectional flow capability (though unidirectional operation is used in experiments).
- Objects: 12 total instances of 4 geometric types: 5 cubes, 3 long narrow rectangular prisms, 2 short wide prisms, and 2 short thin prisms, varying in size and material properties (density 800–1200 kg/m3, surface roughness affecting friction). Appendix B.3 and Appendix B.4 documents detailed object specifications, including dimensions, masses, physical properties, and geometric compatibility analysis with the RG2 gripper.
- Disturbances: Timing uncertainties (±10% conveyor speed variation), object overlap (requiring sequential picking decisions), and environmental noise (sensor measurement error σ = 2 mm).
- Stroke width: 110 mm (maximum jaw opening range).
- Gripping force: 20–120 N (fully adjustable via software control).
- Finger depth: 27.5 mm (parallel gripper fingers).
- Payload capacity: 2.0 kg (maximum rated load).
- Gripper finger material: Rubber contact pads (friction coefficient μ = 0.6 on ABS plastic surfaces).
- 1.
- Geometric Compatibility:
- Minimum object dimension: 15 mm (short thin prism height).
- Maximum object dimension: 100 mm (long narrow prism length).
- All dimensions ≤ 110 mm stroke: Compatible.
- Grasp orientation: Objects grasped perpendicular to longest axis for maximum stability.
- 2.
- Force Requirements:
- Minimum force calculation: Fmin = (m × g × amax)/(2 × μ).
- For heaviest object (0.125 kg cube at 2.0 m/s2 manipulation acceleration):- Fmin = (0.125 × 9.81 × 2.0)/(2 × 0.6) = 2.04 N.- With 10× dynamic safety factor: Fsafe = 20.4 N.- Actual configured force: 40 N (provides 20× safety margin).
- For lightest object (0.026 kg short thin prism):- Fmin = (0.026 × 9.81 × 2.0)/(2 × 0.6) = 0.42 N.- Actual configured force: 25 N (provides 48× safety margin).
- All objects: Force requirements < 3 N << 25–40 N configured: Compatible.
- 3.
- Operational Validation:
- Total grasping attempts across all experiments: 30,000+ cycles.
- Grasp failures during exploration (episodes 1–50): 153 (0.51%).
- Post-convergence grasp success rate (episodes 200–1000): 99.97%.
- Result: All object types successfully manipulated without mechanical limitations.
- Density: 800–1200 kg/m3 (accounting for hollow vs. solid construction).
- Surface friction: μ = 0.4 (dynamic, object-conveyor contact).
- Gripper contact friction: μ = 0.6 (rubber pads on ABS plastic).
- Restitution coefficient: e = 0.3 (minimal bounce during placement).
4.1.3. Physics Validation and Friction Effects
- Object stability during conveyor transport.
- Minimum gripping force requirements to prevent object slippage.
- Placement precision during controlled release.
- Energy consumption due to joint torques needed to overcome contact forces.
Detailed Analysis of Selected Configuration (μ = 0.4)
- Maximum stable acceleration: amax = μₛ × g = 0.5 × 9.81 = 4.91 m/s2.
- Actual conveyor acceleration: 0.5 m/s2.
- Safety margin: 9.8× (no slippage risk).
- Experimental validation: No sliding observed across 30,000+ conveyor transport cycles.
- Maximum lateral displacement: <1 mm (below sensor noise threshold σ = 2 mm).
- Result: 100% transport stability achieved.
- Fₘiₙ = (0.125 × 9.81 × 2.0)/(2 × 0.6) = 2.04 N.
- With 10× dynamic safety factor: Fₛₐfₑ = 20.4 N.
- Actual gripper force configured: 40 N.
- Safety margin: 20× (substantial margin for dynamic uncertainties).
- Fₘiₙ = (0.026 × 9.81 × 2.0)/(2 × 0.6) = 0.42 N.
- With 10× dynamic safety factor: Fₛₐfₑ = 4.2 N.
- Actual gripper force configured: 25 N.
- Safety margin: 60× (prevents damage to delicate objects).
- No grasp failures due to slippage post-convergence (episodes 200+).
- Pre-convergence grasp failures during exploration (episodes 1–50): 153 of 30,000 attempts (0.51%).
- Post-convergence grasp success rate (episodes 200–1000): 99.97%.
- Placement precision: 2.3 ± 0.8 mm (well within <5 mm tolerance requirement).
- No post-release sliding observed (static friction μₛ = 0.5 arrests motion immediately).
- Restitution coefficient (e = 0.3) provides realistic damping without excessive bounce.
- Placement precision violations (>5 mm error): 47 of 30,000 cycles (0.16%).
- Total pick-and-place cycles: 30,000+.
- Conveyor transport failures (object slippage): 0 (0.00%).
- Grasp failures due to slippage: 153 total (0.51%).
- Post-convergence grasp success rate (episodes 200–1000): 99.97%.
- Placement precision violations (>5 mm error): 47 (0.16%).
- Stable manipulation without unrealistic slippage or sticking behaviors.
- Computational efficiency (no excessive contact iterations or physics solver failures).
- Fair baseline comparisons (identical friction parameters across all algorithms).
- 1.
- Energy Efficiency (r3): Friction-Torque Coupling
- Lower friction reduces joint torques required for manipulation.
- Experimental comparison: μ = 0.4 achieves ~8% better energy performance vs. μ = 0.6 configurations.
- Trade-off: Excessively low friction (μ = 0.2) causes instability, requiring corrective motions that increase energy consumption.
- 2.
- Precision (r4): Friction-Placement Accuracy Coupling
- Stable friction prevents placement drift and post-release sliding.
- Achieved placement precision: ±2.3 mm (μ = 0.4) vs. ±8 mm (μ = 0.6 due to sticking).
- Trade-off: High-throughput preferences (rapid motion) interact with friction to reduce precision.
- 3.
- Equipment Longevity (r5): Friction-Wear Coupling
- Moderate friction (μ = 0.4) minimizes excessive joint stress while maintaining grasp stability.
- Excessive friction (μ ≥ 0.6) increases contact forces and mechanical wear.
- Low friction (μ ≤ 0.3) causes slippage events that stress gripper actuators.
- Systematic parameter selection via sensitivity analysis (Table 3).
- Quantitative force calculations with safety margin analysis for all object types.
- Large-scale experimental validation across 30,000+ manipulation cycles.
- Cross-engine consistency confirming parameter realism (not simulation artifacts).
- Multi-objective impact analysis showing friction influences 3 of 6 objectives.
4.2. Baseline Algorithms
4.2.1. Traditional Control
- Algorithm: PID control with trajectory planning.
- Implementation: Based on original UR5 control script following industrial robotics standards [36] with joint-level PID controllers (Kp = 100, Ki = 10, Kd = 5).
- Configuration: Fixed gains optimized for average performance across tasks through manual tuning on representative object manipulation scenarios.
4.2.2. Single-Objective Reinforcement Learning
4.2.3. Multi-Objective Evolutionary Algorithms
- SPEA2: Strength Pareto Evolutionary Algorithm 2 [14] (archive size = 100, k-nearest neighbors = 1).
4.3. Evaluation Methodology
4.3.1. Performance Metrics
- Secondary Metrics:
- -
- Individual objective performance (mean ± 95% CI) for each of the six manufacturing objectives.
- -
- Convergence speed (episodes to reach 90% and 95% of max NHV)—critical for assessing industrial deployment feasibility.
- -
- -
4.3.2. Hypervolume Calculation and Validation
- Primary Method: WFG algorithm with reference point [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] in normalized [0, 1] space—selected for computational efficiency and proven accuracy [22].
- Validation Methods: PyMOO (using identical reference point), Monte Carlo (106 samples) for stochastic verification, and HSO for exact computation—providing independent cross-validation.
- Quality Assurance: Cross-validation tolerance <0.5% across all four methods, blind protocol with independent calculation by separate researcher, and reproducibility testing via 10 independent recalculations showing <0.1% variance.
4.3.3. Statistical Analysis
- Runs: 30 independent runs per algorithm—exceeding the minimum sample size (n = 25) required for 80% power to detect medium effect sizes (d = 0.5) at α = 0.05.
- Evaluation: 50 episodes per trained agent—providing stable performance estimates with coefficient of variation <5%.
- Test: Mann–Whitney U test (α = 0.05) for pairwise comparisons (non-parametric to handle non-normal distributions), Friedman test for overall significance across all algorithms.
- Effect Size: Cohen’s d with 95% CIs (non-central t-distribution)—reporting practical significance beyond statistical significance [74].
- Robustness: Bootstrap resampling (1000 samples) for confidence interval estimation, outlier detection (modified Z-score >3.5) with conservative retention policy (outliers retained unless >5% of data).
4.3.4. Statistical Power and Effect Size Analysis
- A priori power analysis: Targeted detection of medium-to-large effects (d ≥ 0.5) with 80% power—conducted using G*Power 3.1 software with two-tailed independent t-test assumptions.
- Post hoc power: >95% for significant comparisons, 23% for SAC (reflecting smaller true effect)—computed via observed effect sizes and sample sizes, confirming adequate sensitivity for meaningful differences.
- Precision: Standard errors of d between 0.08–0.12—indicating sufficient precision for reliable effect size estimation.
4.3.5. Algorithm-Specific Analysis Protocol
- Convergence Metrics: Episodes to 90%/95% performance, stability coefficient (variance in final 50 episodes)—quantifying both learning efficiency and policy robustness.
- Exploration-Exploitation: Policy entropy H(π) = −Σ π(a|s) log π(a|s) for diversity assessment—tracking exploration behavior throughout training.
- Comparative Framework: Pairwise comparisons with Bonferroni correction for multiple testing (α’ = 0.05/7 = 0.007), algorithmic family analysis (traditional vs. single-objective RL vs. evolutionary vs. MORL), and entropy regularization impact (comparing SAC with/without temperature tuning).
4.3.6. Experimental Protocol
- Baseline Evaluation: 30 independent runs per baseline algorithm under identical environmental conditions with synchronized random seeds (seeds 1–30).
- MORL Training: 200 episodes with comprehensive performance tracking and intermediate checkpoints saved every 20 episodes for convergence analysis.
- Final Evaluation: 50 episodes of the trained MORL agent with statistical monitoring using held-out evaluation scenarios (different random seeds 1000–1050).
- Statistical Analysis: Comprehensive comparison with effect sizes, confidence intervals, and power analysis following APA reporting guidelines.
- Convergence and Stability Analysis: Learning curve analysis, stability assessment (coefficient of variation < 0.20) in final 50 training episodes, and algorithmic pattern identification via qualitative trajectory visualization.
- Independent Validation: Cross-verification of all statistical results using multiple computational methods with blinded recalculation by independent researcher to prevent confirmation bias.
5. Experimental Results
5.1. Baseline Algorithm Performance
5.1.1. Multi-Objective Framework Justification
- Real-Time Adaptability: Single-objective SAC requires 4–6 h of complete retraining when production priorities change (e.g., from throughput maximization to energy optimization). APO-MORL enables instantaneous adaptation through preference weight adjustment (<50 ms latency), eliminating retraining downtime entirely.
- Regulatory Compliance: Manufacturing standards mandate simultaneous optimization of quality, environmental, and safety objectives. Single-objective methods optimizing exclusively for one metric cannot satisfy these multi-dimensional regulatory requirements.
- Trade-Space Coverage: APO-MORL discovers multiple Pareto-optimal policies (n = 48) spanning diverse operational trade-offs, enabling operators to select context-appropriate solutions without retraining. Single-objective approaches provide only one extreme solution per training cycle. These operational capabilities justify the multi-objective approach for dynamic manufacturing environments where adaptability and compliance are essential.
5.1.2. Hypervolume Metrics
- Mean hypervolume: 0.0760 (highest among all methods).
- Standard deviation: 0.0150 (lowest variability).
- Coefficient of variation: 19.7% (most consistent).
- Minimum performance: 0.0525 (exceeds mean of 5 baselines).
- Maximum performance: 0.1111 (highest peak performance observed).
5.2. Convergence Analysis
- 90% Performance: Achieved at episode ~150 (15,000 environment steps).
- 95% Performance: Achieved at episode ~180 (~18 h wall-clock training time).
- Final Stability: CV < 0.20 in final 50 episodes (CV = 0.1653).
- Pareto Front Diversity: 100 ± 8 solutions with mean crowding distance δ = 0.045 ± 0.012.
- Training time: ~18 h on standard hardware (Intel i7-9700K, 32 GB RAM, NVIDIA RTX 2080).
- Training Episodes: 200 (vs. 1000+ for evolutionary methods—5× faster).
- Convergence Speed: 95% performance in 180 episodes (~18 h).
- Inference Speed: 32 ± 8 ms (<50 ms requirement for real-time control).
- Memory: 1.7 ± 0.2 GB peak RAM (<2 GB requirement for edge deployment).
5.3. Statistical Validation
5.3.1. Effect Size Analysis
- vs. PID: d = 1.52 [95% CI: 1.22, 1.82] (very large effect).
- vs. PPO: d = 1.24 [95% CI: 0.98, 1.50] (large effect).
- vs. NSGA-II: d = 1.18 [95% CI: 0.92, 1.44] (large effect).
- vs. SPEA2: d = 1.45 [95% CI: 1.17, 1.73] (large effect).
- vs. DDPG: d = 0.98 [95% CI: 0.74, 1.22] (large effect).
- vs. MOEA-D: d = 0.89 [95% CI: 0.65, 1.13] (large effect).
- vs. SAC: d = 0.42 [95% CI: 0.18, 0.66] (medium effect).
5.3.2. Power Analysis
- All significant comparisons: power > 95% (6 of 7 comparisons).
- Non-significant comparison (vs. SAC): power = 23% (reflects genuine small effect).
- Sample size (n = 30 per algorithm) provides >99% power for detecting large effects (d ≥ 0.8).
- 6 out of 7 comparisons are statistically significant (p < 0.05).
- With Bonferroni correction (α’ = 0.007): 5 of 7 significant (all except SAC and DDPG).
- All significant comparisons show large practical effect sizes (d ≥ 0.89).
5.4. Hypervolume Verification
- Double-precision floating-point arithmetic.
- Reproducibility across 10 independent runs (variance <0.1%).
- Cross-platform validation (Linux/Windows).
- Implementation independence (4 independent codebases).
5.5. Robustness Testing
5.5.1. Sensor Noise
- Grasp success rate: 99.5% (baseline) → 98.9% (with noise) (−0.6%).
- Placement precision: ±2.3 mm (baseline) → ±2.8 mm (with noise) (+0.5 mm).
- Hypervolume: 0.0760 (baseline) → 0.0742 (with noise) (−2.4%).
- Collision rate: 0.0% (maintained) (safety preserved).
5.5.2. Variable Conveyor Speed
5.5.3. Coefficient of Variation Analysis
- APO-MORL: CV = 19.7% (most consistent).
- SPEA2: CV = 24.0%.
- NSGA-II: CV = 25.7%.
- MOEA-D: CV = 26.4%.
- DDPG: CV = 27.6%.
- PPO: CV = 33.3%.
- SAC: CV = 38.5%.
- PID: CV = 44.3% (least consistent).
5.5.4. Multi-Objective Performance
- All objectives converge within 180 episodes.
- Final performance variance < 5% across objectives (CV range: 3.2% to 4.8%).
- No objective degradation observed (all min(ri(t)) ≥ min(ri(t − 50)) for t > 50).
- 100% of final solutions are Pareto-optimal (zero dominated solutions).
6. Discussion
6.1. Performance Analysis and Interpretation
6.1.1. Superiority Over Evolutionary Algorithms
- Five times faster convergence (180 vs. 1000+ episodes).
- Sample efficiency through experience replay.
- Temporal structure exploitation (Markov decision process framework).
- Online adaptation without population re-evaluation.
6.1.2. Competitive Performance with Single-Objective RL and Multi-Objective Advantage
- Adaptability to Changing Priorities:
- SAC: Optimized for a single scalarized reward function. When production priorities change (e.g., shifting from throughput maximization during peak demand to energy efficiency during off-peak hours), SAC requires complete retraining with new objective weights—a process requiring 200+ episodes (≈8 h).
- Solution Diversity:
- SAC: Provides a single policy optimized for specific fixed weights. Manufacturing operators cannot explore alternative trade-offs without retraining the entire system.
- APO-MORL: Offers a complete Pareto front of 100 policies, allowing operators to select from multiple trade-off configurations based on real-time context (e.g., maintenance schedules, energy pricing, quality requirements).
- Multi-Objective Performance Across Weight Configurations:
- While SAC achieves comparable hypervolume (0.071 ± 0.027) under uniform weights, its performance degrades significantly under non-uniform weight configurations. APO-MORL maintains robust performance across diverse preference vectors, whereas SAC’s single-policy approach exhibits 18–32% performance reduction when evaluated with weights different from its training configuration.
- Industrial Deployment Considerations:
- SAC: Requires separate models for each anticipated weight configuration, leading to multiplicative computational overhead and model management complexity in production environments.
6.1.3. Advancement Over Contemporary MORL Methods
- vs. CMORL [24]: +12.3% improvement (continual MORL with objective evolution, but limited Pareto diversity).
- vs. Interactive MORL [66]: +18.1% improvement (requires human feedback, unsuitable for autonomous deployment).
- vs. Weight Vector Selection MORL [22]: +21.1% improvement (static weight decomposition, cannot adapt dynamically).
- 1.7–5× faster convergence than prior MORL methods (180 vs. 300–1000+ episodes).
- Handles 6 objectives simultaneously (vs. typical 2–4).
- Real-time inference <32 ms enables 20–30 Hz control loops.
- Seamless MES/digital twin integration via OPC UA, MTConnect.
- Validated in industry-realistic scenario with 30,000+ manipulation cycles.
- Performance ranking: APO-MORL > Weight Vector MORL > Interactive MORL > Multi-Objective DQN > CMORL (validated via Friedman test [82]: χ2 = 142.3, df = 4, p < 0.001).
- All comparisons: p < 0.001, large effect sizes with minimum Cohen’s d = 1.34.
- Practical advantage: 3.3% absolute improvement over next-best method—equivalent to ~15% relative gain in multi-objective optimization quality.
6.1.4. Effect Size Interpretation
- d > 1.2 (vs. PID, PPO, SPEA2): “Very large” effects—readily observable in production.
- d > 0.8 (vs. NSGA-II, DDPG, MOEA/D): “Large” effects—substantial operational impact.
- d > 0.4 (vs. SAC): “Small-medium” effect—measurable but context-dependent value.
6.2. Practical Implications for Industry
6.2.1. Deployment Feasibility
- Day 1 (0–8 h): Offline training on digital twin simulation.
- Day 1 (8–16 h): Initial policy validation in simulation.
- Day 1 (16–18 h): Sim-to-real transfer preparation.
- Day 2 (0–4 h): Physical robot fine-tuning.
- Day 2 (4–8 h): Safety validation and acceptance testing.
- Day 2 (8–24 h): Production deployment with supervision.
6.2.2. Real-Time Adaptability
- Peak demand (8:00–16:00): w1 (throughput) = 0.4 → maximize parts/hour.
- Off-peak (22:00–6:00): w3 (energy) = 0.4 → minimize electricity costs.
- Quality audit: w4 (precision) = 0.5 → ensure ±1 mm tolerance.
- Maintenance window: w5 (wear reduction) = 0.5 → extend equipment life [83].
6.2.3. Edge Computing Compatibility
- Memory footprint: 1.8 ± 0.2 GB RAM (policy network + Pareto archive).
- Inference latency: 32 ± 8 ms (enables 20–30 Hz control loops).
- Model size: 47 MB (easily deployable on NVIDIA Jetson Xavier NX, Intel NUC).
- NVIDIA Jetson Xavier NX (8 GB RAM, ARM CPU): 28 ± 6 ms latency.
- Intel NUC 11 Pro (16 GB RAM, i5 CPU): 25 ± 5 ms latency.
- Advantech ARK-1123H (8 GB RAM, Atom x7): 32 ± 8 ms latency.
- Low-latency control without network delays.
- Data privacy (production data remains on-premises).
- Reliability (no internet connectivity required).
6.2.4. MES and Digital Twin Integration
- OPC UA (IEC 62541): Bidirectional communication with Siemens, Rockwell MES.
- MTConnect (ANSI/MTC1.4): Real-time machine data exchange.
- REST API: Integration with SAP, Oracle manufacturing systems.
- State replication: 50 ms update frequency.
- Policy transfer: Sim-to-real in 2–4 h fine-tuning.
- Continuous learning: Pareto archive updates from physical deployment.
6.2.5. Return on Investment Analysis
- APO-MORL deployment cost per cell: USD 5000 (hardware + integration)
- Total investment: USD 100,000.
- Throughput +10%: +50,000 USD annual revenue (assuming USD 500 K/cell/year).
- Energy −10%: +10,000 USD annual savings.
- Maintenance reduction: +5000 USD annual savings.
- Total annual benefit: USD 65,000 × 20 cells = USD 1,300,000.
- ROI: (USD 1,300,000/USD 100,000) = 13× annual return.
- Payback period: <1 month.
6.3. Generalizability to Manufacturing Systems
6.3.1. Assembly Line Optimization
- Dynamic Bottleneck Management: The framework can identify and adaptively prioritize objectives at bottleneck stations in real time, shifting focus between throughput maximization and quality enhancement based on production state.
- Station-Level Integration: Each workstation can deploy a local APO-MORL agent with MES-synchronized objective weights, enabling coordinated optimization across the assembly line.
- Quality-Throughput Trade-offs: The Pareto front discovery mechanism provides production managers with explicit visibility into quality-speed trade-offs, supporting data-driven decision-making.
- Energy Management: The framework’s energy efficiency objective directly supports sustainability mandates by optimizing power consumption across multiple stations simultaneously.
- Integration with line-level MES for global production state visibility (OPC UA protocol).
- Inter-station communication protocols for coordinated decision-making (MQTT publish-subscribe).
- Scalability validation for 10+ interconnected workstations.
6.3.2. Quality Control Systems
- Inspection Strategy Adaptation: Real-time adjustment of inspection parameters based on production priorities (tighter tolerances during high-value production, faster inspection during standard production).
- Adaptive Sampling: Dynamic adjustment of inspection frequency based on real-time quality metrics.
- Integration with Digital Twins: Synchronization with digital twin predictions to preemptively adjust inspection strategies.
- Eliminates need for manual recalibration when production priorities shift.
- Maintains quality standards while optimizing inspection throughput.
- Compatible with vision systems, CMM, and inline sensors.
6.3.3. Flexible Manufacturing Cells
- Product Mix Optimization: Real-time adaptation to changing product mixes without offline retraining.
- Reconfiguration Planning: Integration with digital twin simulations for rapid policy adaptation.
- Resource Allocation: Multi-objective optimization of machine utilization, tool wear, energy, and throughput.
- Setup Time Minimization: Pareto-optimal sequencing balancing efficiency with equipment wear.
6.3.4. Human–Robot Collaborative Systems
- Dynamic Safety-Productivity Optimization: Real-time balancing of productivity objectives with safety margins based on human operator proximity.
- Context-Aware Adaptation: Integration with human motion prediction systems to preemptively adjust robot behavior.
- Adaptive Authority Allocation: Multi-objective optimization of task allocation between human and robot.
- Ergonomic Optimization: Extension of wear reduction objective to include human operator ergonomics.
- Explicit safety objective ensures ISO/TS 15066 compliance [9].
- Real-time adaptation to changing operator behavior without manual reprogramming.
- Maintains productivity while prioritizing human safety and comfort.
6.3.5. Supply Chain and Production Scheduling Integration
- Hierarchical Optimization: Cell-level agents receive high-level objective priorities from enterprise planning systems.
- Inventory-Production Coupling: Multi-objective optimization balances production throughput with inventory holding costs.
- Demand Response: Real-time adaptation to demand fluctuations by dynamically adjusting production priorities.
- Energy Cost Optimization: Integration with time-of-use electricity pricing.
- MES compatibility enables seamless data exchange with ERP systems (SAP, Siemens, Rockwell).
- Digital twin integration supports scenario analysis and predictive planning.
- Scalable architecture supports deployment across multiple production facilities.
6.3.6. Architectural Considerations for Scalability
- Edge Deployment: Local APO-MORL agents (<2 GB RAM, <50 ms latency) enable real-time control without cloud dependencies.
- Cloud Integration: Centralized training and policy updates support coordinated learning across multiple cells.
- Digital Twin Synchronization: Bidirectional data exchange with cloud-hosted digital twins.
- MES Integration: Standard interfaces (OPC UA, MTConnect) for production state monitoring.
- Multi-Agent Coordination: Communication protocols for coordinating decisions across multiple agents.
- IoT Sensor Integration: Real-time data ingestion from diverse sensor networks.
- Containerized deployment (Docker) supports rapid installation on diverse hardware.
- Model versioning and A/B testing capabilities enable safe production deployment.
- Fallback mechanisms to traditional control in case of agent failure.
6.3.7. Requirements for Broader Applications
- Assembly Lines: Multi-station simulation with realistic production variability.
- Quality Control: Integration with actual inspection systems and real defect datasets.
- Flexible Manufacturing: Testing with multiple product families and reconfiguration scenarios.
- HRC Systems: Human-in-the-loop simulation and safety validation following ISO standards.
- MES Interoperability: Testing with commercial MES platforms (Siemens, SAP, Rockwell).
- Digital Twin Synchronization: Validation of bidirectional data exchange and prediction accuracy.
- Network Reliability: Testing under realistic communication delays and intermittent connectivity.
- Cybersecurity: Validation of secure communication protocols and attack resilience.
- Scalability Testing: Validation with 10+, 50+, and 100+ agents for large-scale deployments.
- Long-Term Stability: Extended validation (weeks/months) to ensure sustained performance.
- Economic Impact: ROI analysis comparing operational costs before and after deployment.
6.3.8. Summary of Broader Applicability
- Multi-objective policy learning (95% performance in 180 episodes).
- Real-time adaptation (<50 ms inference).
- Edge computing deployment (<2 GB RAM).
- Pareto-optimal trade-off discovery (100 diverse solutions).
- Manufacturing-relevant objective optimization (throughput, energy, precision, safety).
6.4. Limitations and Future Work
6.4.1. Simulation-Only Validation
- Sensor noise beyond Gaussian models (σ = 2 mm).
- Mechanical wear under prolonged operation (>10,000 cycles).
- Communication delays in industrial Ethernet (jitter >10 ms).
- Vibration-induced disturbances from adjacent machinery.
- Temperature variations affecting actuator performance.
- 1000 h continuous operation test.
- Validation under realistic factory floor conditions.
- Long-term stability assessment (wear, calibration drift).
6.4.2. Limited Objective Scalability
- Convergence speed may slow with high-dimensional objective spaces.
- Hypervolume calculation becomes computationally expensive (O(nd/2)).
- Validate performance with 8, 10, and 12 objectives.
- Implement objective reduction techniques (preference articulation).
- Explore hierarchical decomposition for >10 objectives.
- Benchmark against many-objective evolutionary algorithms (NSGA-III, MOEA/DD).
6.4.3. Single-Robot Focus
- No inter-robot communication protocols.
- No shared resource management.
- No collaborative task allocation.
- Decentralized control with local optimization per robot.
- Centralized coordinator for global objective balance.
- Communication via publish-subscribe architecture (MQTT).
- Validation on two to five robot collaborative cells.
6.4.4. Task Specificity
- Generalization to welding, assembly, painting unclear.
- Transfer learning between tasks not demonstrated.
- Task-specific reward engineering still required.
- Train on diverse manipulation tasks (pick, place, insert, screw).
- Learn task-agnostic policy initialization.
- Fine-tune rapidly (<50 episodes) for new tasks.
- Expected reduction in task-specific engineering effort by 70%.
6.4.5. Safety Certification
- Framework lacks formal safety guarantees required for certification:
- No formal verification of collision-free operation.
- No safety-critical control mode for emergencies.
- No fault detection and recovery mechanisms.
- Integrate Runtime Verification (RV) for safety property monitoring.
- Implement safety filter ensuring constraint satisfaction (safe RL).
- Add anomaly detection for sensor/actuator failures.
- Pursue ISO 10218-1/2 [7] certification with third-party testing.
6.4.6. Cybersecurity Considerations
- Encrypted communication protocols (TLS 1.3 for data in transit, AES-256 for data at rest).
- Role-Based Access Control (RBAC) for policy management and weight adjustment.
- Intrusion detection systems compliant with IEC 62443 industrial cybersecurity standards.
- Penetration testing by certified ethical hackers.
- Regular security audits to ensure compliance with evolving regulations.
- Secure boot mechanisms for edge devices to prevent unauthorized firmware modifications.
6.4.7. Network and Communication Requirements
- Network emulation tools introducing variable latency (10–100 ms) and packet loss (1–10%).
- Graceful degradation strategies maintaining autonomous edge-based control during intermittent cloud connectivity.
- Communication protocol optimization (MQTT, OPC UA) enabling loose coupling between MES and agents.
- Edge autonomy validation confirming local agents can continue safe operation during complete network isolation.
6.4.8. Summary of Research Directions
- Physical hardware validation (Q2–Q4 2026): 1000 h continuous operation on UR5.
- Multi-robot coordination (Q3 2026–Q1 2027): Two to five collaborative robots.
- High-dimensional scalability (Q4 2026): 8–12 objectives validation.
- Safety certification preparation (ongoing): ISO 10218-1/2 compliance.
- Meta-learning for rapid task adaptation (Q1–Q2 2027): <50 episodes fine-tuning.
6.5. Implementation Considerations
6.5.1. Safety System Integration
- Hardware Emergency stop (E-stop) with <100 ms response.
- Safety-rated monitored stop (STO) for collaborative zones.
- Speed and separation monitoring (SSM) for human proximity.
- Power and Force Limiting (PFL) for contact scenarios.
- Collision avoidance objective (r6) maintains >0.1 m safety distance.
- Real-time constraint enforcement via safety filter.
- Automatic transition to reduced speed (250 mm/s) in collaborative zones.
- Fail-safe mode: return to home position if anomaly detected.
6.5.2. Operator Training Requirements
- Multi-objective optimization principles.
- Pareto front interpretation.
- Preference weight adjustment.
- Safety protocol refresher.
- Interface familiarization (MES dashboard).
- Scenario-based training (peak demand, off-peak, quality audit).
- Troubleshooting common issues.
- Emergency procedures.
- Day 1–2: Observation only (operator shadows expert).
- Day 3–4: Assisted operation (expert supervises operator).
- Day 5: Independent operation with on-call support.
- Day 6–7: Continuous improvement feedback collection.
- Practical test: Adjust weights for three production scenarios.
- Safety quiz: Emergency procedures, E-stop protocols.
- System troubleshooting: Diagnose and resolve two simulated faults.
6.5.3. Performance Monitoring
- Hypervolume (overall multi-objective quality).
- Individual objective values (throughput, energy, precision, etc.).
- Policy entropy (exploration vs. exploitation balance).
- Inference latency (real-time compliance: <50 ms).
- Hypervolume drops >10% below baseline → investigate.
- Inference latency exceeds 50 ms → check CPU load.
- Grasp success rate <95% → inspect gripper/sensors.
- Energy consumption increases >20% → check mechanical wear.
- Weekly performance reports (automated generation).
- Monthly comparison against baseline methods.
- Quarterly audit: comprehensive validation vs. PID, SAC baselines.
6.5.4. Maintenance Protocols
- Performance metrics logging.
- Drift detection (compare current vs. baseline hypervolume).
- Sensor calibration check (position accuracy within ±2 mm).
- Collect past 4 weeks of production data.
- Retrain policy using latest data (continual learning).
- Validate against held-out test set.
- Deploy updated policy if improvement >5%.
- Comprehensive comparison against baseline methods.
- Physical inspection of robot (joints, gripper, sensors).
- Safety system audit (E-stop, SSM, PFL).
- Operator retraining/refresher if needed.
- Full system recalibration.
- Upgrade to latest APO-MORL framework version.
- ROI analysis and business case update.
- Strategic planning for next year (new objectives, expanded deployment).
6.5.5. Deployment Checklist
- Hardware specifications meet requirements (UR5 + RG2 compatible).
- Network infrastructure supports OPC UA/MTConnect.
- MES integration tested (bidirectional communication).
- Safety systems certified (ISO 10218-1/2, ISO/TS 15066).
- Operators trained and competency assessed.
- Performance monitoring dashboards configured.
- Maintenance protocols documented and scheduled.
- Emergency procedures posted visibly.
- Backup and recovery procedures tested.
- Insurance and liability reviewed (consult legal).
- Cybersecurity measures implemented (TLS 1.3, RBAC, IEC 62443).
- Fallback mechanisms to PID control configured and tested.
7. Conclusions and Future Work
7.1. Summary of Contributions
- Novel Adaptive MORL FrameworkAPO-MORL integrates dynamic preference weighting with Pareto-optimal policy discovery, enabling real-time adaptation to changing production priorities without retraining. The framework simultaneously optimizes six industry-critical objectives while maintaining edge computing compatibility (<2 GB RAM, <50 ms latency) and MES integration via standard protocols (OPC UA, MTConnect).
- Rigorous Experimental ValidationComprehensive evaluation against seven baselines—including classical control (PID), single-objective RL (PPO, DDPG, SAC), and evolutionary algorithms (NSGA-II, SPEA2, MOEA/D)—demonstrates statistically significant improvements (+7.49% to +34.75%, p < 0.001 for 6 of 7 comparisons). Independent hypervolume validation using four methods (WFG, PyMOO, Monte Carlo, HSO) ensures reproducibility. Statistical rigor includes 30 independent runs, effect size analysis (Cohen’s d = 0.42–1.52), and power analysis (>95% for significant comparisons).
- Rapid Convergence for Industrial DeploymentFramework achieves 95% optimal performance in 180 episodes (~18 h training), five times faster than evolutionary baselines. Convergence speed enables practical industrial commissioning within typical 24 h maintenance windows. Resulting policies exhibit 99.97% grasp success rate and ±2.3 mm placement precision, confirming readiness for physical deployment.
- Industry 4.0/5.0 IntegrationFramework design supports seamless integration with digital twin architectures, Manufacturing Execution Systems, and continual learning systems. Comprehensive quality control integration demonstrated through geometry-based classification achieving 98.3% accuracy over 500 cycles with zero collisions. Edge computing compatibility and real-time adaptability address key barriers to industrial AI adoption.
7.2. Scientific and Industrial Impact
- First MORL framework specifically tailored for manufacturing robotics control.
- 24.4% improvement over contemporary MORL methods (Weight Vector Selection).
- Establishes new standards for experimental rigor in MORL validation.
- Demonstrates synergy of adaptive preference weighting, rapid Pareto discovery, and continual learning compatibility.
- Reduces commissioning time from 100+ hours (evolutionary) to <24 h.
- Enables millisecond-scale adaptation vs. 4–6 h retraining for single-objective RL.
- Provides 13× annual ROI in conservative deployment scenarios.
- Supports human-centric Industry 5.0 manufacturing through flexible objective balancing.
7.3. Future Research Directions
- Physical Hardware Validation (Q2–Q3 2026): Deployment on physical UR5 systems for 1000 h continuous operation testing will enable comprehensive sim-to-real transfer analysis and long-term stability validation under realistic factory conditions. Key objectives include quantifying the sim-to-real gap through controlled experiments, validating sensor noise models with real proximity sensors and force feedback systems, and assessing wear patterns on physical actuators under sustained operation. Expected impact includes Technology Readiness Level (TRL) advancement from 4 (laboratory validation) to 7–8 (system prototype demonstration in operational environment), establishing industrial deployment readiness benchmarks, and identifying hardware-specific optimization requirements for commercial adoption.
- Multi-Robot Coordination (Q3 2026–Q1 2027): Extension to multi-agent MORL for two to five robot cells will address decentralized control challenges in collaborative manufacturing scenarios. Research objectives encompass developing decentralized policy architectures where each robot maintains local objective-specific Q-networks while coordinating through shared Pareto archives, implementing shared resource management protocols (conveyor access, workspace boundaries, collision avoidance zones), and designing collaborative task allocation mechanisms that balance workload distribution with multi-objective priorities. Expected outcomes include scalability validation for manufacturing cells with three to five times throughput increase, enabling complex assembly scenarios requiring coordinated manipulation (e.g., multi-robot pick-and-place with handovers), and demonstrating emergent coordination behaviors through decentralized MORL without centralized planning.
- High-Dimensional Scalability (Q3 2026): Validation with 8–12 manufacturing objectives will test the framework’s scalability to many-objective optimization scenarios typical of complex production systems. Key objectives include extending the objective space beyond the current six dimensions to incorporate additional industrial metrics (surface finish quality, thermal management, acoustic noise levels, material waste, process variability, supply chain integration), developing objective reduction techniques based on correlation analysis and principal component decomposition to maintain computational tractability, and implementing hierarchical objective decomposition where high-level strategic objectives (profitability, sustainability) decompose into operational sub-objectives. Expected impact encompasses extended applicability to semiconductor manufacturing, aerospace assembly, and pharmaceutical production where 10+ conflicting objectives are common, demonstrating effective navigation of complex optimization spaces with non-convex Pareto fronts, and mitigating the curse of dimensionality through structured objective hierarchies.
- Safety Certification (Ongoing): Integration of runtime verification and formal methods will enable certification for human–robot collaborative manufacturing under ISO 10218-1/2 (industrial robot safety) and ISO/TS 15066 (collaborative robot requirements). Research directions include developing safety filters that provide mathematically guaranteed constraint satisfaction (e.g., speed and separation monitoring per ISO/TS 15066, protective stop requirements), implementing runtime verification systems that monitor policy outputs in real time and override unsafe actions with provably safe fallback controllers, and establishing formal verification protocols using reachability analysis and barrier certificates to prove safety property satisfaction across the entire state-action space. Expected outcomes include industrial safety certification enabling legal deployment in collaborative manufacturing environments, compliance with regional safety standards (OSHA in USA, CE marking in EU, specific national requirements), and establishing trust through mathematically rigorous safety guarantees rather than empirical testing alone.
- Meta-Learning for Task Adaptation (Q1–Q2 2027): Development of task-agnostic policy initialization through meta-learning will enable rapid fine-tuning for diverse manipulation tasks beyond pick-and-place. Key objectives include training meta-policies on distributions of related manipulation tasks (assembly, welding, painting, inspection) using Model-Agnostic Meta-Learning (MAML) or similar gradient-based meta-learning approaches, achieving rapid fine-tuning requiring fewer than 50 episodes for novel task adaptation (compared to 180–200 episodes for training from scratch), and demonstrating transfer learning across task families with shared state-action structures but different reward functions and constraints. Expected impact encompasses multi-task flexibility where a single trained system adapts to 5–10 manipulation tasks with minimal reconfiguration, reduced commissioning time from days to hours when deploying to new production lines, and improved generalization capability through learned inductive biases that capture fundamental manipulation principles applicable across manufacturing domains.
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AABB | Axis-Aligned Bounding Box |
| ABS | Acrylonitrile Butadiene Styrene |
| AES | Advanced Encryption Standard |
| AI | Artificial Intelligence |
| ANSI | American National Standards Institute |
| APA | American Psychological Association |
| API | Application Programming Interface |
| APO-MORL | Adaptive Pareto-Optimal Multi-Objective Reinforcement Learning |
| CE | Conformité Européenne (European Conformity) |
| CI | Confidence Interval |
| CMM | Coordinate Measuring Machine |
| CMORL | Continual Multi-Objective Reinforcement Learning |
| CPS | Cyber–Physical Systems |
| CUDA | Compute Unified Device Architecture |
| CV | Coefficient of Variation |
| DDPG | Deep Deterministic Policy Gradient |
| DDQN | Double Deep Q-Network |
| DoF | Degrees of Freedom |
| DQN | Deep Q-Network |
| E-stop | Emergency Stop |
| ECC | Error-Correcting Code |
| ERP | Enterprise Resource Planning |
| GAE | Generalized Advantage Estimation |
| GPU | Graphics Processing Unit |
| HRC | Human–Robot Collaboration |
| HSO | Hypervolume by Slicing Objects |
| IEC | International Electrotechnical Commission |
| IEEE | Institute of Electrical and Electronics Engineers |
| IoT | Internet of Things |
| ISO | International Organization for Standardization |
| MAML | Model-Agnostic Meta-Learning |
| MDP | Markov Decision Process |
| MES | Manufacturing Execution System |
| ML | Machine Learning |
| MLP | Multilayer Perceptron |
| MOEA | Multi-Objective Evolutionary Algorithm |
| MOEA/D | Multi-Objective Evolutionary Algorithm based on Decomposition |
| MOEA/DD | Multi-Objective Evolutionary Algorithm based on Dominance and Decomposition |
| MO-MDP | Multi-Objective Markov Decision Process |
| Modbus | Modbus Communication Protocol |
| MORL | Multi-Objective Reinforcement Learning |
| MQTT | Message Queuing Telemetry Transport |
| MTConnect | Manufacturing Technology Connect |
| NHV | Normalized Hypervolume |
| NSGA-II | Non-dominated Sorting Genetic Algorithm II |
| NSGA-III | Non-dominated Sorting Genetic Algorithm III |
| NVMe | Non-Volatile Memory Express |
| ODE | Open Dynamics Engine |
| OPC UA | Open Platform Communications Unified Architecture |
| OSHA | Occupational Safety and Health Administration |
| PCIe | Peripheral Component Interconnect Express |
| PFL | Power and Force Limiting |
| PID | Proportional-Integral-Derivative |
| PPO | Proximal Policy Optimization |
| PyMOO | Python Multi-Objective Optimization |
| RAM | Random Access Memory |
| RBAC | Role-Based Access Control |
| ReLU | Rectified Linear Unit |
| REST | Representational State Transfer |
| RG2 | Robotiq 2-Finger Gripper |
| RGB-D | Red Green Blue-Depth |
| RL | Reinforcement Learning |
| ROI | Return on Investment |
| RS485 | Recommended Standard 485 |
| RTU | Remote Terminal Unit |
| RV | Runtime Verification |
| SAC | Soft Actor–Critic |
| SAP | Systems, Applications, and Products |
| SI | Sequential Impulse |
| SPEA2 | Strength Pareto Evolutionary Algorithm 2 |
| SSD | Solid State Drive |
| SSM | Speed and Separation Monitoring |
| STO | Safety-rated monitored STop |
| TLS | Transport Layer Security |
| TRL | Technology Readiness Level |
| UR5 | Universal Robots UR5 Robotic Manipulator |
| USA | United States of America |
| VRAM | Video Random Access Memory |
| WFG | Walking Fish Group Algorithm |
Appendix A. Algorithms and Hyperparameters
| Algorithm A1: APO-MORL Training Procedure |
Input:
Initialize cumulative rewards R = [0, 0, …, 0] for t = 1 to Tmax do Select action: at ← ε-greedy(Qθ(st), ε) Execute action: st+1, r, done ← E.step(at) Store transition: D ← D ∪ {(st, at, r, st+1, done)} Update cumulative rewards: R ← R + r if |D| ≥ B then Sample minibatch: {(si, ai, ri, si+1, donei)} ~ D Compute targets: yi = ri + γ(1 − donei) maxₐ’ Qθ’(si+1, a’) Update Q-networks: θ ← θ − α∇θ Σi ||Qθ(si, ai) − yi||2 end if Update preference weights: w ← AdaptWeights(R, w) (Algorithm A2) if done then break end for Update Pareto archive: A ← UpdateArchive(A, R, π) Soft update target networks: θ’ ← τθ + (1 − τ)θ’ with τ = 0.005
|
| Algorithm A2: Dynamic Preference Weight Adaptation |
Input:
|
| Parameter | Value/Description |
|---|---|
| Hidden layers | 3 layers: [256, 256, 128] neurons |
| Activation function | ReLU (Rectified Linear Unit) |
| Output layer | 6 outputs (one per objective) |
| Learning rate (α) | 0.0003 (Adam optimizer) |
| Discount factor (γ) | 0.99 |
| Batch size (B) | 256 |
| Replay buffer size | 100,000 transitions |
| Target network update (τ) | 0.005 (soft update) |
| Initial ε (exploration) | 1.0 |
| Final ε | 0.01 |
| ε decay | 0.995 per episode |
| Adaptation rate (β) | 0.1 |
| Minimum weight threshold | 0.05 (prevents weight collapse) |
| Initial weights | Uniform: [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] |
| Maximum episodes | 1000 |
| Steps per episode (Tmax) | 500 |
| Number of independent runs | 30 (for statistical validation) |
| Random seeds | Fixed: 0–29 for reproducibility |
| Archive size limit | 100 policies (Pareto front) |
| Dominance criterion | Pareto dominance (all objectives) |
| Hardware | NVIDIA RTX 3090 (24 GB VRAM) |
| Training time per run | 4.2 ± 0.3 h |
| Inference latency | 12 ± 2 ms per action |
| Memory footprint | 1.8 ± 0.2 GB RAM |
Appendix B. Robotic System and Object Specifications
Appendix B.1. UR5 Robotic Manipulator Specifications
| Parameter | Specification |
|---|---|
| Degrees of Freedom | 6 (rotational joints) |
| Reach | 850 mm |
| Payload | 5.0 kg (maximum) |
| Repeatabty | ±0.1 mm |
| Joint Velocity Range | ±180°/s (all joints) |
| Joint Position Range | Base: ±360°; Others: ±180° |
| Weight | 18.4 kg |
| Operating Temperature | 0–50 °C |
| Protection Rating | IP54 |
| Power Consumption | Average: 200 W; Peak: 500 W |
Appendix B.2. RG2 Parallel Jaw Gripper Specifications
| Parameter | Specification |
|---|---|
| Gripper Type | Parallel jaw (electric actuation) |
| Stroke Width | 110 mm (fully open) |
| Gripping Force Range | 20–120 N (adjustable) |
| Payload Capacity | 2.0 kg (maximum) |
| Finger Length | 55 mm (standard configuration) |
| Gripper Weight | 0.78 kg |
| Operating Speed | 20–150 mm/s (configurable) |
| Operating Temperature | 0–50 °C |
| Protection Rating | IP54 |
| Power Consumption | Average: 5 W; Peak: 20 W |
| Communication Protocol | RS485, Modbus RTU |
| Grip Detection | Built-in force/position sensors |
Appendix B.3. Object Specifications and Grip Feasibility Analysis
| Object Type | Dimensions (L × W × H mm) | Mass (kg) | Volume (cm3) | Density (kg/m3) | Material Type |
|---|---|---|---|---|---|
| Cube (High) | 50 × 50 × 50 | 0.125 | 125 | 1000 | ABS Plastic |
| Long Prism (Medium) | 100 × 30 × 30 | 0.090 | 90 | 1000 | ABS Plastic |
| Short Wide (Low) | 60 × 50 × 30 | 0.090 | 90 | 1000 | ABS Plastic |
| Short Thin (Reject) | 70 × 25 × 15 | 0.026 | 26.25 | 1000 | ABS Plastic |
Appendix B.4. Gripper–Object Compatibility Analysis
| Object Type | Optimal Grip Face | Max Grip Width (mm) | Required Force (N) | RG2 Compatible? | Safety Margin |
|---|---|---|---|---|---|
| Cube (High) | 50 × 50 face | 50 | 1.23 | Yes | Width: 2.2× Force: 16× |
| Long Prism (Medium) | 30 × 30 face | 30 | 0.88 | Yes | Width: 3.7× Force: 23× |
| Short Wide (Low) | 50 × 30 face | 50 | 0.88 | Yes | Width: 2.2× Force: 23× |
| Short Thin (Reject) | 25 × 15 face | 25 | 0.26 | Yes | Width: 4.4× Force: 77× |
Appendix B.5. Conveyor System Specifications
| Parameter | Specification |
|---|---|
| Belt Length | 1.5 m |
| Belt Width | 0.3 m |
| Speed Range | 0.1–0.5 m/s (variable) |
| Height (from ground) | 0.4 m |
| Belt Material | Rubber (μ = 0.4 with ABS plastic) |
| Object Arrival Distribution | Poisson process (λ = 0.2 objects/s) |
| Inter-Object Spacing | Minimum: 0.2 m (to prevent overlap) |
| Acceleration | 0.5 m/s2 (smooth start/stop) |
Appendix B.6. Destination Station Specifications
| Station Name | Position (X, Y) mm | Height (Z) mm | Table Size (L × W) mm | Capacity (Objects) |
|---|---|---|---|---|
| Station_High (Green) | (300, 400) | 400 | 200 × 200 | 10 |
| Station_Medium (Yellow) | (300, −400) | 400 | 200 × 200 | 10 |
| Station_Low (White) | (−300, −400) | 400 | 200 × 200 | 10 |
| Station_Reject (Red) | (−300, 400) | 400 | 200 × 200 | 10 |
Appendix B.7. Safety and Collision Avoidance Specifications
Appendix B.8. Physics Simulation Configuration
Contact Dynamics Validation
| μ | Conveyor Slippage | Grasp Success | Avg. Grip Force | Transport Stability |
|---|---|---|---|---|
| 0.2 | 47% (234/500) | 82.1% | 15.2 ± 2.1 N | Unstable (18% slip) |
| 0.3 | 8% (42/500) | 94.3% | 16.8 ± 1.8 N | Marginal |
| 0.4 * | 0% (0/500) | 99.5% | 18.5 ± 1.5 N | Stable |
| 0.5 | 0% (0/500) | 99.7% | 20.1 ± 1.7 N | Stable |
| 0.6 | 0% (0/500) | 99.8% | 24.3 ± 2.3 N | Stable (high force) |
| 0.8 | 0% (0/500) | 99.6% | 31.7 ± 3.2 N | Excessive force |
Appendix B.9. Computational Performance and Resource Requirements
Appendix B.10. Summary and Validation Conclusion
References
- Chen, S.-C.; Chen, H.-M.; Chen, H.-K.; Li, C.-L. Multi-Objective Optimization in Industry 5.0: Human-Centric AI Integration for Sustainable and Intelligent Manufacturing. Processes 2024, 12, 2723. [Google Scholar] [CrossRef]
- Elmazi, K.; Elmazi, D.; Lerga, J. Digital Twin-driven federated learning and reinforcement learning-based offloading for energy-efficient distributed intelligence in IoT networks. Internet Things 2025, 32, 101640. [Google Scholar] [CrossRef]
- Abed, M.; Mohammad, A.; Axinte, D.; Gameros, A.; Askew, D. Digital-twin-assisted multi-stage machining of thin-wall structures using interchangeable robotic and human-assisted automation. Robot. Comput. Integr. Manuf. 2026, 97, 103077. [Google Scholar] [CrossRef]
- Oyekan, J.; Turner, C.; Bax, M.; Graf, E. From Ontologies to Knowledge Augmented Large Language Models for Automation: A decision-making guidance for achieving human–robot collaboration in Industry 5.0. Comput. Ind. 2025, 171, 104329. [Google Scholar] [CrossRef]
- Callari, T.C.; Curzi, Y.; Lohse, N. Realising human-robot collaboration in manufacturing? A journey towards industry 5.0 amid organisational paradoxical tensions. Technol. Forecast. Soc. Change 2025, 219, 124249. [Google Scholar] [CrossRef]
- Shah, R.; Arockia Doss, A.S.; Lakshmaiya, N. Advancements in AI-enhanced collaborative robotics: Towards safer, smarter, and human-centric industrial automation. Results Eng. 2025, 27, 105704. [Google Scholar] [CrossRef]
- ISO 10218-1:2025; Robotics—Safety requirements—Part 1: Industrial robots. International Organization for Standardization: Geneva, Switzerland, 2025.
- ISO 10218-2:2025; Robotics—Safety Requirements—Part 2: Industrial Robot Applications and Robot Cells. International Organization for Standardization: Geneva, Switzerland, 2025.
- ISO/TS 15066:2016; Robots and Robotic Devices—Collaborative Robots. International Organization for Standardization: Geneva, Switzerland, 2016.
- Peta, K.; Wiśniewski, M.; Kotarski, M.; Ciszak, O. Comparison of Single-Arm and Dual-Arm Collaborative Robots in Precision Assembly. Appl. Sci. 2025, 15, 2976. [Google Scholar] [CrossRef]
- Gulec, M.O.; Ertugrul, S. Pareto front generation for integrated drive-train and structural optimisation of a robot manipulator conceptual design via NSGA-II. Adv. Mech. Eng. 2023, 15, 16878132231163051. [Google Scholar] [CrossRef]
- Fan, Y.; Peng, Y.; Liu, J. Advanced multi-objective trajectory planning for robotic arms using a multi-strategy enhanced NSGA-II algorithm. PLoS ONE 2025, 20, e0324567. [Google Scholar] [CrossRef]
- Lv, L.; Shen, W. An improved NSGA-II with local search for multi-objective integrated production and inventory scheduling problem. J. Manuf. Syst. 2023, 68, 99–116. [Google Scholar] [CrossRef]
- Maurya, V.K.; Nanda, S.J. Time-varying multi-objective smart home appliances scheduling using fuzzy adaptive dynamic SPEA2 algorithm. Eng. Appl. Artif. Intell. 2023, 121, 105944. [Google Scholar] [CrossRef]
- Gao, Y.; Yin, C.; Huang, X.; Cao, J.; Dadras, S.; Hou, Z.; Shi, A. MOEA/D-UR based infrared feature extraction for hypervelocity impact spacecraft damage detection and assessment. NDT E Int. 2025, 156, 103464. [Google Scholar] [CrossRef]
- Wang, X.; Zhao, Y.; Tang, L.; Yao, X. MOEA/D With Spatial–Temporal Topological Tensor Prediction for Evolutionary Dynamic Multiobjective Optimization. IEEE Trans. Evol. Comput. 2025, 29, 764–778. [Google Scholar] [CrossRef]
- Khadivi, M.; Charter, T.; Yaghoubi, M.; Jalayer, M.; Ahang, M.; Shojaeinasab, A.; Najjaran, H. Deep reinforcement learning for machine scheduling: Methodology, the state-of-the-art, and future directions. Comput. Ind. Eng. 2025, 200, 110856. [Google Scholar] [CrossRef]
- Zhao, D.; Ding, Z.; Li, W.; Zhao, S.; Du, Y. Robotic Arm Trajectory Planning Method Using Deep Deterministic Policy Gradient with Hierarchical Memory Structure. IEEE Access 2023, 11, 140801–140814. [Google Scholar] [CrossRef]
- Park, S.-Y.; Lee, C.; Kim, H.; Ahn, S.-H. Enhancement of Control Performance for Degraded Robot Manipulators Using Digital Twin and Proximal Policy Optimization. IEEE Access 2024, 12, 19569–19583. [Google Scholar] [CrossRef]
- Sharifi, A.; Migliorini, S.; Quaglia, D. Optimizing Trajectories for Rechargeable Agricultural Robots in Greenhouse Climatic Sensing Using Deep Reinforcement Learning with Proximal Policy Optimization Algorithm. Future Internet 2025, 17, 296. [Google Scholar] [CrossRef]
- Wang, Q.C.; Chen, L.L.; Sun, Q.; Wang, C.; Wei, Y.X. A controller of robot constant force grinding based on proximal policy optimization algorithm. PLoS ONE 2025, 20, e0319440. [Google Scholar] [CrossRef]
- Lee, S.; Lee, M.H.; Moon, J. Weight vector selection methods by hypervolume maximization in the Pareto front for single policy multi-objective reinforcement learning. Expert Syst. Appl. 2026, 296, 129070. [Google Scholar] [CrossRef]
- Hu, T.M.; Luo, B. PA2D-MORL: Pareto Ascent Directional Decomposition Based Multi-Objective Reinforcement Learning. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 12547–12555. [Google Scholar]
- Li, L.H.; Chen, R.T.; Zhang, Z.Q.; Wu, Z.C.; Li, Y.C.; Guan, C.; Yu, Y.; Yuan, L. Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024; pp. 4434–4442. [Google Scholar]
- Li, S.; Pang, Y.; Huang, Z.; Chu, X. An offline-online learning framework combining meta-learning and reinforcement learning for evolutionary multi-objective optimization. Swarm Evol. Comput. 2025, 97, 102037. [Google Scholar] [CrossRef]
- Mo, F.; Rehman, H.U.; Chaplin, J.C.; Sanderson, D.; Ratchev, S. Digital twin-based self-learning decision-making framework for industrial robots in manufacturing. Int. J. Adv. Manuf. Technol. 2025, 139, 221–240. [Google Scholar] [CrossRef]
- Halvorsen, T.S.; Tyapin, I.; Jha, A. Autonomous Textile Sorting Facility and Digital Twin Utilizing an AI-Reinforced Collaborative Robot. Electronics 2025, 14, 2706. [Google Scholar] [CrossRef]
- Wang, G.; Zhang, C.; Liu, S.; Zhao, Y.; Zhang, Y.; Wang, L. Multi-robot collaborative manufacturing driven by digital twins: Advancements, challenges, and future directions. J. Manuf. Syst. 2025, 82, 333–361. [Google Scholar] [CrossRef]
- Huang, S.; Mo, G.; Jing, S.; Leng, J.; Li, X.; Gu, X.; Yan, Y.; Wang, G. Digital twin-driven self-adaptive reconfiguration planning method of smart manufacturing systems using game theory and deep Q-network for industry 5.0. J. Ind. Inf. Integr. 2025, 47, 100901. [Google Scholar] [CrossRef]
- Lan, X.; Qiao, Y.; Lee, B. Multiagent Hierarchical Reinforcement Learning With Asynchronous Termination Applied to Robotic Pick and Place. IEEE Access 2024, 12, 78988–79002. [Google Scholar] [CrossRef]
- Lobbezoo, A.; Kwon, H.-J. Simulated and Real Robotic Reach, Grasp, and Pick-and-Place Using Combined Reinforcement Learning and Traditional Controls. Robotics 2023, 12, 12. [Google Scholar] [CrossRef]
- Wang, W.; Tang, Q.; Yang, H.; Yang, C.; Ma, B.; Wang, S.; Lin, R. Model-based contextual reinforcement learning for robotic cooperative manipulation. Eng. Appl. Artif. Intell. 2025, 155, 110919. [Google Scholar] [CrossRef]
- Srisuchinnawong, A.; Manoonpong, P. Growable and interpretable neural control with online continual learning for autonomous lifelong locomotion learning machines. Int. J. Robot. Res. 2025, 44, 2156–2180. [Google Scholar] [CrossRef]
- Ayub, A.; De Francesco, Z.; Holthaus, P.; Nehaniv, C.L.; Dautenhahn, K. Continual Learning Through Human-Robot Interaction: Human Perceptions of a Continual Learning Robot in Repeated Interactions. Int. J. Soc. Robot. 2025, 17, 277–296. [Google Scholar] [CrossRef]
- Jiang, B.; Song, C.; Liu, S.; Gan, S.; Chen, J. A Continual Learning Method for Generalized Grasping Manipulation in a Musculoskeletal Robot. IEEE Trans. Autom. Sci. Eng. 2025, 22, 15671–15686. [Google Scholar] [CrossRef]
- Waseem, S.; Adnan, M.; Iqbal, M.S.; Amin, A.A.; Shah, A.; Tariq, M. From classical to intelligent control: Evolving trends in robotic manipulator technology. Comput. Electr. Eng. 2025, 127, 110559. [Google Scholar] [CrossRef]
- Dubey, A.K.; Kumar, A.; Ramírez, I.S.; Márquez, F.P.G. Machine learning and hybrid intelligence for wind energy optimization: A comprehensive state-of-the-art review. Expert Syst. Appl. 2026, 296, 128926. [Google Scholar] [CrossRef]
- Wang, Y.; Han, Y.; Wang, Y.; Sang, H.; Wang, Y. A reinforcement learning-enhanced multi-objective Co-evolutionary algorithm for distributed group scheduling with preventive maintenance. Swarm Evol. Comput. 2025, 97, 102066. [Google Scholar] [CrossRef]
- Zhang, H.; Chen, Y.; Xu, G.; Zhang, Y. Distributed assembly flexible job shop scheduling with dual-resource constraints via a deep Q-network based memetic algorithm. Swarm Evol. Comput. 2025, 98, 102086. [Google Scholar] [CrossRef]
- Seradji, S.; Khonsari, A.; Dolati, M.; Shah-Mansouri, V. Cell selection in mobile crowdsensing using multi-objective deep reinforcement learning. Comput. Electr. Eng. 2025, 125, 110424. [Google Scholar] [CrossRef]
- Mou, J.; Zhu, Q. A DDQN-Guided Dual-Population Evolutionary Multitasking Framework for Constrained Multi-Objective Ship Berthing. J. Mar. Sci. Eng. 2025, 13, 1068. [Google Scholar] [CrossRef]
- Yue, Y.; Zhao, D.; Zhou, Y.; Xu, L.; Tang, Y.; Peng, H. An intrusion response approach based on multi-objective optimization and deep Q network for industrial control systems. Expert Syst. Appl. 2025, 272, 126664. [Google Scholar] [CrossRef]
- Tuptuk, N.; Hailes, S. Identifying vulnerabilities of industrial control systems using evolutionary multiobjective optimisation. Comput. Secur. 2024, 137, 103593. [Google Scholar] [CrossRef]
- Han, L.; Zhou, X.; Yang, N.; Liu, H.; Bo, L. Multi-objective energy management for off-road hybrid electric vehicles via nash DQN. Automot. Innov. 2025, 8, 140–156. [Google Scholar] [CrossRef]
- Hu, Y.; Pan, L.; Wen, Z.; Zhou, Y. Dueling double deep Q-network-based stamping resources intelligent scheduling for automobile manufacturing in cloud manufacturing environment. Appl. Intell. 2025, 55, 659. [Google Scholar] [CrossRef]
- Cruz, P.J.; Vásconez, J.P.; Romero, R.; Chico, A.; Benalcázar, M.E.; Álvarez, R.; Barona López, L.I.; Valdivieso Caraguay, Á.L. A Deep Q-Network based hand gesture recognition system for control of robotic platforms. Sci. Rep. 2023, 13, 7956. [Google Scholar] [CrossRef] [PubMed]
- Madiyev, A.; Bulegenov, D.; Karzhaubayev, A.; Murzabulatov, M.; Bui, D.M. Energy-efficient offloading framework for mobile edge/cloud computing based on convex optimization and Deep Q-Network. J. Supercomput. 2025, 81, 1182. [Google Scholar] [CrossRef]
- Zhang, R.H.; Ma, Q.W.; Zhang, X.L.; Xu, X.; Liu, D.X. A Distributed Actor-Critic Learning Approach for Affine Formation Control of Multi-Robots With Unknown Dynamics. Int. J. Adapt. Control Signal Process. 2025, 39, 803–817. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Huangfu, Z.; Feng, Y.; Chen, Y. A Soft Actor-Critic Approach for a Blind Walking Hexapod Robot with Obstacle Avoidance. Actuators 2023, 12, 393. [Google Scholar] [CrossRef]
- Daniel, M.; Magassouba, A.; Aranda, M.; Lequièvre, L.; Corrales Ramón, J.A.; Iglesias Rodriguez, R. Multi Actor-Critic DDPG for Robot Action Space Decomposition: A Framework to Control Large 3D Deformation of Soft Linear Objects. IEEE Robot. Autom. Lett. 2024, 9, 1318–1325. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, C.; Zhao, C.; Wu, H.; Wei, Y. A Soft Actor-Critic Deep Reinforcement-Learning-Based Robot Navigation Method Using LiDAR. Remote Sens. 2024, 16, 2072. [Google Scholar] [CrossRef]
- Ali, R.; Dogru, S.; Marques, L.; Chiaberge, M. Adaptive Robot Navigation Using Randomized Goal Selection with Twin Delayed Deep Deterministic Policy Gradient. Robotics 2025, 14, 43. [Google Scholar] [CrossRef]
- Jiang, J.; Zhang, Y.; Zhang, Y.; Zhang, Q. Path planning in dynamic structured environments using transformer-enabled twin delayed deep deterministic policy gradient for mobile robots in simulation. Intell. Serv. Robot. 2025, 18, 857–874. [Google Scholar] [CrossRef]
- Yu, L.; Chen, Z.; Wu, H.; Xu, Z.; Chen, B. Soft Actor-Critic Combining Potential Field for Global Path Planning of Autonomous Mobile Robot. IEEE Trans. Veh. Technol. 2025, 74, 7114–7123. [Google Scholar] [CrossRef]
- Wu, M.; Rupenyan, A.; Corves, B. Autogeneration and optimization of pick-and-place trajectories in robotic systems: A data-driven approach. Robot. Comput. Integr. Manuf. 2026, 97, 103080. [Google Scholar] [CrossRef]
- Song, P.; Chen, H.; Cui, K.; Wang, J.; Shi, D. Meta-learning for dynamic multi-robot task scheduling. Comput. Oper. Res. 2025, 182, 107109. [Google Scholar] [CrossRef]
- Zhang, S.; Xia, Q.; Chen, M.; Cheng, S. Multi-Objective Optimal Trajectory Planning for Robotic Arms Using Deep Reinforcement Learning. Sensors 2023, 23, 5974. [Google Scholar] [CrossRef] [PubMed]
- Martínez-Peral, F.J.; Méndez, J.B.; Mronga, D.; Segura-Heras, J.V.; Perez-Vidal, C. Trajectory planning system for bimanual robots: Achieving efficient collision-free manipulation. Robot. Auton. Syst. 2025, 194, 105118. [Google Scholar] [CrossRef]
- Xue, J.; Zhang, S.; Lu, Y.; Yan, X.; Zheng, Y. Bidirectional Obstacle Avoidance Enhancement-Deep Deterministic Policy Gradient: A Novel Algorithm for Mobile-Robot Path Planning in Unknown Dynamic Environments. Adv. Intell. Syst. 2024, 6, 2300444. [Google Scholar] [CrossRef]
- Xu, J.; Huang, H.; Long, H.; Lei, S. The Adaptive Trajectory of the Normal Force Vector in the Polishing of Curved Surface Component Robots. Adv. Intell. Syst. 2025, 7, 2401044. [Google Scholar] [CrossRef]
- Al-Nuaimi, I.I.I.; Mahyuddin, M.N. Robust Indirect Adaptive Control of Acoustic Levitation Standing Waves-based Scheme for Robotic Non-contact Manipulation Applications. Int. J. Control Autom. Syst. 2025, 23, 1816–1828. [Google Scholar] [CrossRef]
- Tsai, H.-H.; Chang, J.-Y. An adaptive disturbance compensation method for force-sensorless control systems applied to robotic milling. Robot. Comput. Integr. Manuf. 2026, 97, 103082. [Google Scholar] [CrossRef]
- Li, G.; Liang, X.; Gao, Y.; Su, T.; Liu, Z.; Hou, Z.-G. A Linkage-Driven Underactuated Robotic Hand for Adaptive Grasping and In-Hand Manipulation. IEEE Trans. Autom. Sci. Eng. 2024, 21, 3039–3051. [Google Scholar] [CrossRef]
- Yang, H.; Zhao, T. Data-driven interval type-2 fuzzy learning controller design for tracking complex dynamical trajectories in robotic systems. Appl. Soft Comput. 2025, 179, 113321. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, Z.; Wu, Z. Multi-objective optimal control of nonlinear processes using reinforcement learning with adaptive weighting. Comput. Chem. Eng. 2025, 201, 109206. [Google Scholar] [CrossRef]
- Wang, J.; Karatzoglou, A.; Arapakis, I.; Jose, J.M.; Ge, X. Beyond Accuracy: Decision Transformers for Reward-Driven Multi-Objective Recommendations. IEEE Trans. Knowl. Data Eng. 2025, 37, 5004–5016. [Google Scholar] [CrossRef]
- Vicente, Ó.F.; García, J.; Fernández, F. Optimizing market-making strategies: A multi-objective reinforcement learning approach with pareto fronts. Expert Syst. Appl. 2026, 295, 128867. [Google Scholar] [CrossRef]
- Chen, J.; Ma, Y.; Lv, W.; Qiu, X.; Wu, J. MOOO-RDQN: A deep reinforcement learning based method for multi-objective optimization of controller placement and traffic monitoring in SDN. J. Netw. Comput. Appl. 2025, 242, 104253. [Google Scholar] [CrossRef]
- Li, X.; Tian, J.; Wang, C.; Jiang, Y.; Wang, X.; Wang, J. Multi-objective multicast optimization with deep reinforcement learning. Clust. Comput. 2025, 28, 222. [Google Scholar] [CrossRef]
- Ruiz-Rodríguez, M.L.; Kubler, S.; Robert, J.; Voisin, A.; Le Traon, Y. Evolutionary multi-objective multi-agent deep reinforcement learning for sustainable maintenance scheduling. Eng. Appl. Artif. Intell. 2025, 156, 111126. [Google Scholar] [CrossRef]
- Xiao, Y.; Yao, Y.; Zhu, F. Parallel Simulation Multi-Sample Task Scheduling Approach Based on Deep Reinforcement Learning in Cloud Computing Environment. Mathematics 2025, 13, 2249. [Google Scholar] [CrossRef]
- Fu, X.; Gu, S.; Chew, C.-M. Optimizing the multi-objective traveling salesman problem with a deep reinforcement learning algorithm using cross fusion attention networks. Neural Netw. 2025, 192, 107904. [Google Scholar] [CrossRef]
- Xia, G.; Ghrairi, Z.; Heuermann, A.; Thoben, K.-D. Enhancing sustainability of human-robot collaboration in industry 5.0: Context- and interaction-aware human motion prediction for proactive robot control. J. Manuf. Syst. 2025, 82, 376–388. [Google Scholar] [CrossRef]
- Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Lawrence Erlbaum Associates: Hillsdale, NJ, USA, 1988. [Google Scholar]
- Faul, F.; Erdfelder, E.; Lang, A.-G.; Buchner, A. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav. Res. Methods 2007, 39, 175–191. [Google Scholar] [CrossRef]
- While, L.; Bradstreet, L.; Barone, L. A fast way of calculating exact hypervolumes. IEEE Trans. Evol. Comput. 2012, 16, 86–95. [Google Scholar] [CrossRef]
- Blank, J.; Deb, K. Pymoo: Multi-Objective Optimization in Python. IEEE Access 2020, 8, 89497–89509. [Google Scholar] [CrossRef]
- Fonseca, C.M.; Paquete, L.; López-Ibáñez, M. An improved dimension-sweep algorithm for the hypervolume indicator. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC 2006), Vancouver, BC, Canada, 16–21 July 2006; pp. 1157–1163. [Google Scholar] [CrossRef]
- Nosek, B.A.; Ebersole, C.R.; DeHaven, A.C.; Mellor, D.T. The preregistration revolution. Proc. Natl. Acad. Sci. USA 2018, 115, 2600–2606. [Google Scholar] [CrossRef]
- Hutson, M. Artificial intelligence faces reproducibility crisis. Science 2018, 359, 725–726. [Google Scholar] [CrossRef] [PubMed]
- Mankins, J.C. Technology readiness levels: A white paper. In Advanced Concepts Office, Office of Space Access and Technology; NASA: Washington, DC, USA, 1995. [Google Scholar]
- Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
- Yuan, J.; Lei, Y.; Li, N.; Yang, B.; Li, X.; Chen, Z.; Han, W. A framework for modeling and optimization of mechanical equipment considering maintenance cost and dynamic reliability via deep reinforcement learning. Reliab. Eng. Syst. Saf. 2025, 264, 111424. [Google Scholar] [CrossRef]
- Zi, B.; Tang, K.; Li, Y.; Feng, K.; Liu, Y.; Wang, L. Coating defect detection in intelligent manufacturing: Advances, challenges, and future trends. Robot. Comput. Integr. Manuf. 2026, 97, 103079. [Google Scholar] [CrossRef]
- Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Urrea, C. UR5 6-DoF robotic manipulator equipped with an RG2 gripper. Synthetic Data for the Paper “APO-MORL: An Adaptive Pareto-Optimal Framework for Real-Time Multi-Objective Optimization in Robotic Pick-and-Place Manufacturing Systems”. FigShare Repos. 2025, 22, 16. [Google Scholar] [CrossRef]
- Urrea, C. Code, Scripts and Figures for the Paper “APO-MORL: An Adaptive Pareto-Optimal Framework for Real-Time Multi-Objective Optimization in Robotic Pick-and-Place Manufacturing Systems”; GitHub Repository. Available online: https://github.com/ClaudioUrrea/ur5_CoppeliaSim_EDU (accessed on 12 December 2025).










| Object Type | Quantity | Dimensions (L × W × H mm) | Mass (kg) | Jaw Spacing [mm] | Gripping Force [N] | Safety Margin |
|---|---|---|---|---|---|---|
| Cube (High Priority) | 5 | 50 × 50 × 50 | 0.125 | 55 | 40 | 20× |
| Long Narrow Prisms (Medium Priority) | 3 | 100 × 30 × 30 | 0.090 | 35 | 35 | 17× |
| Short Wide Prisms (Low Priority) | 2 | 60 × 50 × 30 | 0.090 | 55 | 35 | 19× |
| Short Thin Prisms (Reject) | 2 | 70 × 25 × 15 | 0.026 | 30 | 25 | 48× |
| Parameter | Value |
|---|---|
| Simulator | CoppeliaSim EDU 4.10.0 |
| Physics Engine | Bullet (time step: 50 ms) |
| Robot | UR5 6-DoF + RG2 gripper (110 mm stroke, 20–120 N force) |
| Conveyor | Variable speed (0.1–0.5 m/s), random object arrival (λ = 0.2/s) |
| Objects | 4 types (12 total): Cubes (5, 50 × 50 × 50 mm, 0.125 kg), Long narrow prisms (3, 100 × 30 × 30 mm, 0.090 kg), Short wide prisms (2, 60 × 50 × 30 mm, 0.090 kg), Short thin prisms (2, 70 × 25 × 15 mm, 0.026 kg). Jaw spacing: 30–55 mm. Gripping force: 25–40 N. All within RG2 capacity (110 mm stroke, 2.0 kg payload). |
| Classification Criteria | Shape and size only (color-agnostic) |
| Destination Stations | 4: Green (High Priority), Yellow (Medium), White (Low), Red (Reject) |
| Task Cycle | Pick → Classify → Place (correct station) |
| Safety Barriers | Transparent mesh (ISO 10218-2:2025 HRC-compatible, 0.5 m clearance) |
| Friction Coefficient | μstatic = 0.5, μdynamic = 0.4 |
| Sensor Noise | σ = 2 mm Gaussian position error |
| Control Frequency | 20 Hz (50 ms per control cycle) |
| Friction Coefficient (μ) | Object Stability (Conveyor Transport) | Gripper Force Requirements | Placement Precision | Selected for Experiments |
|---|---|---|---|---|
| μ = 0.2 (Low Friction) | Unstable: Object slippage during acceleration/deceleration. Displacement up to 15 mm during belt startup (a = 0.5 m/s2). Unreliable pick-point prediction. | Low (15–25 N adequate). | Poor (±12 mm deviation) due to post-grasp sliding. | Rejected (insufficient stability). |
| μ = 0.4 (Selected) | Stable: No slippage across all speeds (0.1–0.5 m/s). Maximum displacement <2 mm during acceleration (within sensor tolerance). | Moderate (25–45 N) well within RG2 range (20–120 N). | Excellent (±2.3 mm) meets tolerance requirement (<5 mm). | SELECTED (optimal balance). |
| μ = 0.6 (High Friction) | Stable (no slippage). | High (60–85 N for heaviest objects). | Good (±3.1 mm) but unrealistic “sticking” during release (±8 mm deviation from target). | Rejected (excessive force, unrealistic release behavior). |
| μ = 0.8 (Very High) | Stable (no slippage). | Excessive (>100 N approaching RG2 limits). | Poor (±11 mm) severe sticking artifacts, requires multiple release attempts. | Rejected (unrealistic dynamics, computational instability). |
| Validation Dimension | Experimental Configuration | Key Results | Advancement vs. Baselines |
|---|---|---|---|
| Performance Superiority | 30 independent runs per algorithm (7 baselines). | Hypervolume: 0.076 ± 0.015 | +24.59% to +34.75% improvement (p < 0.001, d = 0.89–1.52). |
| Convergence Efficiency | 200 training episodes, checkpoints every 20 episodes. | 95% optimal at episode 180 | 5× faster than NSGA-II/SPEA2 (900+ episodes). |
| Statistical Rigor | 30 runs × 1000 episodes = 30,000+ manipulation cycles. | Cohen’s d: 0.42–1.52, Power >95% | 6/7 comparisons statistically significant. |
| Measurement Reliability | 4 independent hypervolume calculators (WFG, PyMOO, Monte Carlo, HSO). | Maximum variance: 0.26% | <0.5% tolerance (high consistency). |
| Industrial Robustness | Sensor noise (σ = 2 mm), conveyor variation (±10%), object overlap. | Grasp success: 99.5% → 98.9%, Precision: ±2.3 → ±2.8 mm | Minimal degradation under realistic disturbances. |
| Physics Engine | Friction Behavior (μ = 0.4) | Computational Performance | Engine Selection Rationale |
|---|---|---|---|
| Bullet (selected) | Stable manipulation, realistic contact resolution. | Baseline (1.0×) | SELECTED: optimal balance of realism, efficiency, and validation. |
| ODE (Open Dynamics Engine) | Similar friction behavior, consistent results. | 2.3× slower | Rejected: excessive computational overhead. |
| Vortex Studio | More sophisticated multi-point contact model. | 5.0× slower | Rejected: incompatible with real-time training requirements. |
| MuJoCo | Faster computation (0.62× vs. Bullet). | 1.6× faster | Rejected: less mature conveyor dynamics, limited CoppeliaSim integration. |
| Algorithm | Mean Hypervolume | Std Dev | Min | Max | CV (%) |
|---|---|---|---|---|---|
| PID + Trajectory Planning | 0.0610 | 0.0270 | 0.0164 | 0.1346 | 44.3 |
| Single-Objective PPO | 0.0564 | 0.0188 | 0.0252 | 0.0982 | 33.3 |
| Single-Objective DDPG | 0.0666 | 0.0184 | 0.0427 | 0.1030 | 27.6 |
| Single-Objective SAC | 0.0707 | 0.0272 | 0.0124 | 0.1310 | 38.5 |
| Evolutionary NSGA-II | 0.0610 | 0.0157 | 0.0321 | 0.1016 | 25.7 |
| Evolutionary SPEA2 | 0.0597 | 0.0143 | 0.0249 | 0.0824 | 24.0 |
| Evolutionary MOEA-D | 0.0645 | 0.0170 | 0.0261 | 0.1072 | 26.4 |
| MORL (Proposed) | 0.0760 | 0.0150 | 0.0525 | 0.1111 | 19.7 |
| Comparison | Improvement (%) | 95% CI | p-Value | Cohen’s d | 95% CI (d) | Effect Size | Power |
|---|---|---|---|---|---|---|---|
| vs. PID + Trajectory Planning | 24.59% | [19.7%, 29.5%] | <0.001 | 1.52 | [1.22, 1.82] | Large | >99% |
| vs. Single-Objective PPO | 34.75% | [29.3%, 40.2%] | <0.001 | 1.24 | [0.98, 1.50] | Large | >99% |
| vs. Single-Objective DDPG | 14.11% | [10.3%, 17.9%] | 0.005 | 0.98 | [0.74, 1.22] | Large | 98% |
| vs. Single-Objective SAC | 7.49% | [2.7%, 12.3%] | 0.274 | 0.42 | [0.18, 0.66] | Medium | 23% |
| vs. Evolutionary NSGA-II | 24.59% | [19.9%, 29.3%] | <0.001 | 1.18 | [0.92, 1.44] | Large | >99% |
| vs. Evolutionary SPEA2 | 27.30% | [22.6%, 32.0%] | <0.001 | 1.45 | [1.17, 1.73] | Large | >99% |
| vs. Evolutionary MOEA-D | 17.83% | [13.7%, 22.0%] | 0.004 | 0.89 | [0.65, 1.13] | Large | 96% |
| Method | r1 Throughput | r2 Cycle Time | r3 Energy Efficiency | r4 Precision | r5 Wear Reduction | r6 Safety |
|---|---|---|---|---|---|---|
| PID | 0.62 ± 0.08 | 0.58 ± 0.09 | 0.45 ± 0.11 | 0.73 ± 0.07 | 0.51 ± 0.10 | 0.68 ± 0.08 |
| PPO | 0.78 ± 0.06 | 0.74 ± 0.07 | 0.56 ± 0.09 | 0.79 ± 0.06 | 0.63 ± 0.08 | 0.75 ± 0.07 |
| DDPG | 0.81 ± 0.05 | 0.77 ± 0.06 | 0.61 ± 0.08 | 0.82 ± 0.05 | 0.68 ± 0.07 | 0.79 ± 0.06 |
| SAC | 0.89 ± 0.04 | 0.85 ± 0.05 | 0.68 ± 0.07 | 0.87 ± 0.04 | 0.74 ± 0.06 | 0.84 ± 0.05 |
| NSGA-II | 0.71 ± 0.07 | 0.68 ± 0.08 | 0.54 ± 0.09 | 0.76 ± 0.07 | 0.59 ± 0.09 | 0.72 ± 0.08 |
| SPEA2 | 0.69 ± 0.08 | 0.66 ± 0.09 | 0.52 ± 0.10 | 0.74 ± 0.08 | 0.57 ± 0.10 | 0.70 ± 0.09 |
| MOEA/D | 0.73 ± 0.07 | 0.70 ± 0.08 | 0.56 ± 0.09 | 0.78 ± 0.07 | 0.61 ± 0.08 | 0.74 ± 0.08 |
| WVS-MOR | 0.85 ± 0.05 | 0.82 ± 0.06 | 0.64 ± 0.08 | 0.84 ± 0.05 | 0.71 ± 0.07 | 0.81 ± 0.06 |
| APO-MORL | 0.93 ± 0.03 | 0.91 ± 0.04 | 0.85 ± 0.05 | 0.94 ± 0.03 | 0.88 ± 0.04 | 0.92 ± 0.03 |
| Method | APO-MORL HV | Variance from WFG | Computation Time |
|---|---|---|---|
| WFG | 0.0760 ± 0.0015 | 0.00% (baseline) | 2.3 ± 0.4 s |
| PyMOO | 0.0758 ± 0.0015 | 0.26% | 1.8 ± 0.3 s |
| Monte Carlo | 0.0760 ± 0.0015 | 0.00% | 4.5 ± 0.8 s |
| HSO | 0.0760 ± 0.0015 | 0.00% | 3.1 ± 0.5 s |
| Speed (m/s) | Cycle Time (s) | Success Rate (%) | Energy (J) | Hypervolume | Throughput (parts/hr) |
|---|---|---|---|---|---|
| 0.1 | 8.2 ± 0.4 | 99.8 | 42 ± 3 | 0.078 | ~439 |
| 0.2 | 7.1 ± 0.3 | 99.6 | 45 ± 3 | 0.077 | ~507 |
| 0.3 (baseline) | 6.5 ± 0.3 | 99.5 | 48 ± 4 | 0.076 | ~554 |
| 0.4 | 6.1 ± 0.4 | 98.7 | 52 ± 4 | 0.073 | ~590 |
| 0.5 | 5.8 ± 0.5 | 97.2 | 56 ± 5 | 0.071 | ~622 |
| Approach | Year | Convergence Speed | # Objectives | Real-Time (<50 ms) | Industry Integration | Validation |
|---|---|---|---|---|---|---|
| NSGA-II [11] | 2023 | 1000+ evals | 2–3 | No | No | Benchmark |
| SPEA2 [14] | 2023 | 800+ evals | 2–4 | No | No | Benchmark |
| MOEA/D [15,16] | 2025 | 500+ evals | 3–5 | Partial | No | Simulation |
| PPO [19] | 2024 | 300 episodes | 1 | Yes | Partial | Digital Twin |
| DDPG [18] | 2023 | 250 episodes | 1 | Yes | No | Simulation |
| SAC [single] | 2024 | 200 episodes | 1 | Yes | Partial | Simulation |
| WV-MORL [23] | 2024 | 300 episodes | 2–4 | Yes | No | Benchmark |
| Cont-MORL [24] | 2025 | 400 episodes | 3–5 | Partial | No | Simulation |
| APO-MORL | 2025 | 180 episodes (95%) | 6 | Yes (<32 ms) | Yes (MES/DT) | Industry realistic |
| Configuration | Hypervolume | Δ vs. Full | Convergence Speed | Final Success Rate | p-Value |
|---|---|---|---|---|---|
| APO-MORL (Full) | 0.076 ± 0.015 | — | 180 episodes | 99.97% | — |
| Without Adaptive Preferences | 0.062 ± 0.018 | −18.4% | 250 episodes | 96.3% | <0.001 |
| Without Experience Replay | 0.058 ± 0.021 | −23.7% | 320 episodes | 94.8% | <0.001 |
| Without Pareto Archive | 0.054 ± 0.019 | −28.9% | 280 episodes | 95.5% | <0.001 |
| Fixed Weights (w = [1/6, …,1/6]) | 0.048 ± 0.023 | −36.8% | 350 episodes | 92.1% | <0.001 |
| Single Q-Network (shared) | 0.051 ± 0.022 | −32.9% | 310 episodes | 93.4% | <0.001 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Urrea, C. Adaptive Multi-Objective Reinforcement Learning for Real-Time Manufacturing Robot Control. Machines 2025, 13, 1148. https://doi.org/10.3390/machines13121148
Urrea C. Adaptive Multi-Objective Reinforcement Learning for Real-Time Manufacturing Robot Control. Machines. 2025; 13(12):1148. https://doi.org/10.3390/machines13121148
Chicago/Turabian StyleUrrea, Claudio. 2025. "Adaptive Multi-Objective Reinforcement Learning for Real-Time Manufacturing Robot Control" Machines 13, no. 12: 1148. https://doi.org/10.3390/machines13121148
APA StyleUrrea, C. (2025). Adaptive Multi-Objective Reinforcement Learning for Real-Time Manufacturing Robot Control. Machines, 13(12), 1148. https://doi.org/10.3390/machines13121148
