To comprehensively evaluate the performance of our proposed reinforcement learning-based scheduling framework, we conducted a series of comparative experiments on a simulated dynamic flexible job shop environment. Specifically, we randomly generated a set of 1000 tasks, each characterized by randomly sampled arrival times, deadlines, and task types. The arrival times were generated using two different distributions: a uniform distribution (used for training) and a normal distribution (used to evaluate generalization). A total of 30 machines with heterogeneous processing capabilities were simulated. Each task was assigned to one of the machines by different scheduling strategies. To ensure a fair comparison, all algorithms were evaluated on the exact same task and machine configuration across repeated trials. All experiments were executed on on a Windows 11 platform equipped with an NVIDIA GeForce RTX 3090 Founders Edition GPU (NVIDIA Corporation, Santa Clara, CA, USA, manufactured in China).
5.3. Experiment Results
After extensive comparative evaluation, the performance of our proposed reinforcement learning-based scheduling algorithm was assessed against several heuristic-based methods, including Random Selection, Half Min-Max, multiple Deadline-Aware heuristics with thresholds of 4.25, 4.5, and 4.75, the Suitable heuristic, and the Shortest Processing Time (SPT) baseline. The results are summarized in
Table 1. Our algorithm consistently outperformed all baselines under both uniform and normal task arrival distributions across multiple performance metrics. In terms of cumulative reward, the proposed method achieved 3114.8 under the uniform distribution and 6564.1 under the normal distribution. These values represent improvements of 0.7% and 0.2%, respectively, compared to the second-best baseline (SPT), which recorded 3090.2 and 6545.5. This clearly demonstrates the ability of our reinforcement learning approach to improve scheduling efficiency beyond traditional methods. For the task completion success rate, our method achieved a perfect 100% under the uniform distribution and 99.9% under the normal distribution. These results surpass all heuristics, and are marginally better than the SPT baseline, which attained 99% and 99.8%. Most importantly, our method excels in minimizing the response time, a key indicator for real-time planning in dynamic industrial environments. As shown in
Table 1, the proposed approach achieved average response times of 1.38 s (uniform) and 1.42 s (normal), corresponding to reductions of 0.7% and 1.4% compared with SPT (1.39 s and 1.44 s). These reductions, though numerically small, are critical for time-sensitive industrial scheduling, where even minor improvements in latency can lead to significant practical benefits. The negative rewards in
Table 1 mean the certain algorithm failed in finding a feasible solution in the constrained time.
In addition to the simulated task distribution, we evaluated the performance of our algorithm on several well-established public benchmark instances, including Brandimarte, Dauzère, Taillard, Demirkol, and Lawrence. These benchmarks cover both flexible job shop scheduling (FJSP) and job shop scheduling problems (JSSP) and allow for a direct comparison of our method with other state-of-the-art scheduling algorithms such as PPO, DQN, and DDQN. As shown in
Table 2, our algorithm consistently performed well across all benchmark instances. Specifically, our approach achieved a cumulative reward of 199.50 for Brandimarte (FJSP), 2521.17 for Dauzère (FJSP), and 2781.56 for Taillard (JSSP), with success rates of 25.00%, 12.30%, and 19.00%, respectively. This performance is competitive with, and in many cases superior to, other methods like PPO and DQN, demonstrating the robustness of our approach in real-world job shop scheduling tasks. Moreover, our algorithm consistently maintained a lower response time, indicating its potential for real-time scheduling applications in dynamic industrial settings.
Our algorithm achieved strong performance across all benchmarks. For the Brandimarte (FJSP) benchmark, our method obtained a of 199.50 with a gap of 25.00%, which is only 0.6% higher than the best PPO result (199.10, 24.75%), and 0.9% better than Rainbow (198.30, 24.39%). This shows that our method can match or slightly exceed the best-performing baselines on this dataset.
For the Dauzère (FJSP) benchmark, our approach achieved with a gap of 12.30%. Compared to PPO’s 2442.14 ( gap) and DDQN’s 2440.33 ( gap), our method performed within a narrow margin, maintaining competitiveness across multiple runs.
On the Taillard (JSSP) benchmark, our method recorded with a gap of 19.00%. While PPO achieved the lowest of 2478.95 (18.97%), our method was only 0.03 percentage points higher in relative gap, effectively matching PPO’s performance.
For the Demirkol (JSSP) benchmark, our algorithm reached and a gap of 29.90%. Compared with PER (5877.50, 26.30%), our method was about 2.6% worse in terms of gap, but still maintained competitive results when compared to DQN and DDQN, which had larger deviations.
Finally, on the Lawrence (JSSP) benchmark, our approach achieved with a gap of 10.80%. This result is within +0.8% of DDQN’s best gap (9.99%) and significantly better than PPO (15.45%) and Noisy (14.24%).
Taken together, these results demonstrate that our algorithm maintains high robustness across diverse benchmark instances, often matching or closely trailing the best specialized methods (e.g., PPO or DDQN), while achieving notable improvements over weaker baselines (e.g., DQN, Noisy, Multi-step). Moreover, our algorithm consistently sustains competitive response times across all benchmarks, underscoring its potential for real-time scheduling applications in dynamic industrial environments.
We further render the training details of our approach with respect to the accumulated rewards in
Figure 3. From the figure, the Full Model began to outperform the other baselines very early in training, underscoring that our approach does not merely match traditional methods but rapidly surpasses them by learning a dynamic, adaptive policy. The graph provides empirical validation that our hierarchical structure mitigates the classic RL challenge of slow convergence in high-dimensional spaces, successfully enabling the agent to efficiently credit actions to long-term outcomes and learn a robust scheduling strategy directly from raw state representations.
In addition to these comparisons with heuristic baselines, we further examined the robustness of our approach by combining the hierarchical reinforcement learning framework with different optimization backbones, as summarized in
Table 3. The results show that our HRL-based variant with policy gradient (Ours+PG) achieved consistent improvements over PPO and DQN alone, with cumulative rewards of 3114.8 and 6564.1 under uniform and normal task distributions, respectively, together with near-optimal response times of 1.38 s and 1.42 s. When integrated with DQN (Ours+DQN), the performance remained competitive, achieving higher rewards and success rates than vanilla DQN while maintaining comparable response times. Most importantly, the combination with PPO (Ours+PPO) yielded the best overall performance across all metrics, improving cumulative reward by approximately +1.0% compared to Ours+PG (3146.9 vs. 3114.8 under uniform distribution and 6629.7 vs. 6564.1 under normal distribution), while also further reducing response times to 1.36 s and 1.40 s, respectively. These findings confirm that our framework is not only effective as a standalone HRL method but can also be seamlessly hybridized with state-of-the-art reinforcement learning algorithms to deliver superior scheduling performance in dynamic environments.
5.4. Ablation Study
To evaluate the contributions of various components of our proposed reinforcement learning-based scheduling algorithm, we conducted an ablation study. The goal was to analyze the impact of each component (e.g., the policy gradient, semi-supervised pre-training, and mask mechanism) on the overall performance.
We performed the ablation experiments using the same 1000 tasks and 30 machines, with both uniform and normal task arrival distributions. The following configurations were tested:
Baseline (No RL): A heuristic-based algorithm using the Shortest Processing Time (SPT) rule without reinforcement learning.
Policy Gradient Only: Our proposed method with the policy gradient but without the semi-supervised pre-training or the mask mechanism.
Semi-Supervised Pre-training: The agent was pre-trained using heuristic methods before applying reinforcement learning with the mask mechanism.
Mask Mechanism Only: The agent uses the mask mechanism with the policy gradient but without semi-supervised pre-training.
Full Model: Our complete reinforcement learning-based model with all components (policy gradient, semi-supervised pre-training, and mask mechanism).
As shown in
Table 4, the full model consistently outperformed all ablated variants across all evaluation metrics. The Full Model (before lightweighting) already achieved excellent performance with a cumulative reward of 3098.5, which is +44.7 higher than the semi-supervised pre-training configuration (3053.8), while also attaining a perfect success rate of 100.0% compared to 97.6% for the second-best variant. Its response time (2.85 s) is nearly identical to that of semi-supervised pre-training (2.87 s), but with reduced average tardiness of 3.15 s, improving deadline adherence by 1.06 s. After applying lightweight optimization, the Full Model (after lightweighting) further boosted the cumulative reward to 3114.8 (+16.3 compared to the unoptimized full model) while maintaining a 100.0% success rate. More importantly, it dramatically reduced the response time to 1.38 s, achieving a 1.49 s reduction relative to the best non-lightweight variant. This highlights that the lightweighting step not only preserves the superior scheduling quality of the full model but also significantly improves its efficiency in real-time decision making. Although both versions of the full model required the longest training time of 2500 s, this additional cost is justified by their substantial improvements in reward, success rate, and latency compared to all other configurations.