Dual Resource Scheduling Method of Production Equipment and Rail-Guided Vehicles Based on Proximal Policy Optimization Algorithm

Nengqi Zhang; Bo Liu; Jian Zhang

doi:10.3390/technologies13120573

,

and

School of Mechanical Engineering, Tongji University, Shanghai 201800, China

^*

Author to whom correspondence should be addressed.

Technologies2025, 13(12), 573;https://doi.org/10.3390/technologies13120573
(registering DOI)

This article belongs to the Section Manufacturing Technology

Version Notes

Order Reprints

Abstract

In the context of intelligent manufacturing, the integrated scheduling problem of dual rail-guided vehicles (RGVs) and multiple parallel processing equipment in flexible manufacturing systems has gained increasing importance. This problem exhibits spatiotemporal coupling and dynamic constraint characteristics, making traditional optimization methods ineffective at finding optimal solutions. At the problem formulation level, the dual resource scheduling task is modeled as a mixed-integer optimization problem. An intelligent scheduling framework based on action mask-constrained Proximal Policy Optimization (PPO) deep reinforcement learning is proposed to achieve integrated decision-making for production equipment allocation and RGV path planning. The approach models the scheduling problem as a Markov Decision Process, designing a high-dimensional state space, along with a multi-discrete action space that integrates machine selection and RGV motion control. The framework employs a shared feature extraction layer and dual-head Actor-Critic network architecture, combined with parallel experience collection and synchronous parameter update mechanisms. In computational experiments across different scales, the proposed method achieves an average makespan reduction of 15–20% compared with numerical methods, while exhibiting excellent robustness under uncertain conditions including processing time fluctuations.

Keywords:

proximal policy optimization; RGV schedule; two-sided operations; conflict avoidance

1. Introduction

In the era of Industry 4.0 and intelligent manufacturing, flexible manufacturing systems are undergoing a fundamental transformation from large-batch standardized production to small-batch, multi-variety manufacturing with personalized customization of individual orders [1,2,3]. This transformation poses unprecedented challenges to resource scheduling and optimization in manufacturing systems. Particularly in automated material handling systems, rail-guided vehicles (RGVs) serve as critical links connecting processing equipment and material buffer zones, and their scheduling efficiency directly determines the production capacity and response speed of the entire manufacturing system [4,5]. Dual RGV systems have become widely adopted configuration schemes in modern smart factories due to their unique advantages in improving system throughput and enhancing scheduling flexibility [6].

From an optimization perspective, the dual RGV integrated scheduling problem considered in this study refers to jointly determining workpiece-to-machine assignments and dual RGV motion trajectories on a shared single track, such that the utilization of processing equipment and RGVs is coordinated in time and space. While this study focuses on a specific manufacturing scenario, the underlying logic aligns with the general framework of integrated design and operation management, which emphasizes the necessity of joint decision-making for optimizing enterprise systems [7]. Consequently, this integrated treatment of machine allocation and material-handling control exhibits extremely high computational complexity and has been proven to belong to the NP-hard problem category [8]. Its complexity is primarily manifested in three interconnected dimensions: (1) Discrete-continuous hybrid decision space: The system needs to simultaneously optimize discrete workpiece–machine allocation decisions and continuous RGV path planning decisions. The coupling of this heterogeneous decision space makes it difficult for traditional decomposition methods to obtain globally optimal solutions [9]; (2) Dynamic spatiotemporal constraints: The movement of two RGVs on a shared track must satisfy strict collision avoidance constraints, where the position of RGVL must be less than RGVR at any time. These dynamic constraints change in real time with system states, increasing the difficulty of searching for the feasible solution space [10]; and (3) Multi-objective trade-offs: Practical production requires simultaneous consideration of multiple conflicting objectives such as minimizing makespan and reducing transportation costs. Finding balance among these objectives is the core challenge of scheduling decisions [11].

Traditional scheduling methods face fundamental limitations when dealing with such complex problems. Exact algorithms such as mixed-integer programming can guarantee solution optimality, but their computational complexity grows exponentially with problem scale, making them difficult to solve within a reasonable time even for medium-scale instances [12]. Classical methods such as branch-and-bound and dynamic programming are constrained by the curse of dimensionality, exhibiting extremely low computational efficiency when managing practical problems involving multiple machines and large numbers of workpieces [13].

To overcome the computational bottlenecks of exact algorithms, researchers have proposed various metaheuristic algorithms. Genetic algorithms simulate biological evolutionary processes to search for near-optimal solutions and have achieved widespread application in shop scheduling [14,15]. Swarm intelligence algorithms such as Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) have also been successfully applied to RGV scheduling problems [16,17,18]. Li et al. proposed an improved harmony search algorithm targeting the standard deviation of machine waiting time and total travel distance of automated guided vehicles as metrics, conducting scheduling optimization at a certain scale [19]. Zhang et al. employed a hybrid multi-population genetic algorithm with embedded constraint programming to solve multi-objective scheduling planning problems considering makespan, maximum machine workload, and total tardiness [20].

However, these metaheuristic methods commonly suffer from the following shortcomings: (1) high dependence on problem-specific heuristic rules and parameter tuning, lacking generalizability and transferability [21]; (2) complex conflict resolution rules must be preset when handling RGV path conflicts, which not only increases algorithm design complexity but may also lead to suboptimal scheduling decisions [22,23]; and (3) difficulty in effectively responding to dynamic changes in production environments, such as machine failures and urgent order insertions [24,25].

In recent years, the rapid development of deep reinforcement learning (DRL) has provided a new paradigm for solving complex scheduling problems [26,27,28]. Unlike traditional optimization methods, DRL enables autonomous learning of optimal or near-optimal scheduling strategies through continuous interaction between agents and environments without relying on predefined rules or heuristic knowledge [29].

In the field of manufacturing system scheduling, DRL has demonstrated enormous application potential. Agrawal et al. modeled task scheduling problems as MDPs, incentivizing efficient task scheduling through rewards for completing machine job assignments, thereby minimizing delays and optimizing throughput [30]. In the Internet of Vehicles scenario, a deep reinforcement learning method was proposed that simultaneously considers mobility and seamless service migration to optimize overall delay and reliability under spatiotemporal dynamic and service continuity constraints [31]. Chen et al. [32] used deep reinforcement learning for joint computation offloading and computing resource allocation to adapt to heterogeneous user constraints. Gui et al. [33] proposed a deep reinforcement learning method for dynamic flexible job shops that can generate high-quality feasible solutions in disturbance scenarios. Zou et al. [34] proposed an effective multi-AGV scheduling framework in matrix-style manufacturing workshops by employing multi-objective modeling and metaheuristic fusion, considering practical constraints such as unloading and safety inspections. Chen et al. [35] achieved excellent throughput and resource utilization under uncertain workloads using an Actor-Critic-based adaptive resource scheduling method.

However, existing research mostly focuses on single RGV systems or ignores path conflicts between RGVs. For integrated scheduling problems of dual RGVs on shared tracks, particularly how to autonomously learn conflict avoidance strategies through DRL, in-depth research remains lacking [36,37].

Among various DRL algorithms, Proximal Policy Optimization (PPO) has attracted significant attention due to its excellent stability and sample efficiency [38,39,40]. PPO effectively addresses training instability issues in traditional policy gradient methods by limiting the magnitude of policy updates, making it particularly suitable for industrial scheduling applications with complex constraints [41,42].

This paper proposes an intelligent scheduling method based on PPO targeting path conflicts and resource competition problems in dual RGV integrated scheduling. This method treats production equipment and RGVs as a unified resource pool, achieving integrated decision-making through intelligent agents. The core contribution lies in solving RGV operational path conflicts through action mask constraints and learning near-optimal strategies to avoid machine blocking through training, thereby dynamically responding to uncertainties in manufacturing processes. Experimental results demonstrate that compared with existing methods, the proposed scheduling method has significant advantages in improving production efficiency and reducing total system delays, providing innovative solutions for real-time scheduling in intelligent manufacturing systems.

2. Problem and Mathematical Model

2.1. Problem Description and Assumptions

This research focuses on a class of dual resource scheduling problems that are widely present in modern flexible manufacturing systems. First, it constructs a standardized production unit model as shown in Figure 1, which includes a group of homogeneous or heterogeneous processing equipment configured in parallel, along with a material handling system consisting of two RGVs. These two RGVs operate on a shared unidirectional track, responsible for efficiently transferring workpieces between the system’s loading bay, processing equipment, and unloading bay. Compared with single RGV systems, while dual RGV configurations can theoretically significantly enhance the system’s material handling capacity and production takt time, they also introduce exponentially growing decision complexity and system coordination challenges. To effectively decouple the task space and reduce decision dimensionality, this paper adopts a functional specialization strategy: defining RGV_L as dedicated to feeding materials from the loading area to upstream equipment, while RGV_R is dedicated to retrieving materials from downstream equipment and transporting them to the unloading area. Although this strategy simplifies high-level task allocation, it focuses the core scheduling challenge on the spatiotemporal coupling and dynamic conflict avoidance of the two RGVs on the shared track.

Figure 1. Schematic diagram of dual RGVs operation.

The system’s operational process follows a precisely defined workflow. Workpieces enter the system loading area in a predetermined sequence, awaiting scheduling by RGV_L. Based on the scheduling strategy, RGV_L picks up workpieces at appropriate times and transports them to pre-assigned processing equipment. After workpieces complete processing on the equipment, their status changes to awaiting pickup, and they wait for RGV_R’s response. RGV_R then moves to the equipment to pick up the completed workpieces and transports them to the system unloading area, thus completing a full workpiece lifecycle. Throughout this entire process, the movement of both RGVs must comply with the physical constraints of the shared track to prevent conflicts, as illustrated in Figure 2.

Figure 2. Typical RGV operation diagram (a) Operational conflict occurrence between dual RGVs; (b) operational conflict avoidance between dual RGVs. A and B represent the points at which conflicts occur between RGV1 and RGV2.

This dual RGV scheduling problem is an NP-hard combinatorial optimization problem, with complexity far exceeding that of traditional flexible job shop scheduling. The core challenges of the problem stem from tight coupling at multiple levels: First is the mixed and high-dimensional nature of the decision space. At each decision moment, the system faces a compound decision space that requires simultaneously determining discrete workpiece-to-machine assignment strategies and RGV movement control strategies. This decision coupling results in an enormous solution space, making it difficult for traditional decomposition methods to find globally optimal solutions. Second is the high degree of spatiotemporal coupling of resources. Processing equipment and RGVs, as two types of critical resources, have state evolutions that are interdependent and mutually constraining in both temporal and spatial dimensions. Any movement decision by one RGV affects the time windows for tasks it can execute in the future, while equipment states define the transportation demands that RGVs must respond to. Finally, the physical layout of the shared single track introduces strong constraints, requiring the two RGVs to always avoid collisions.

From the perspective of demand arrivals, scheduling problems are commonly classified into on-line and off-line categories: in on-line scheduling, customer demands arrive over time and the scheduler must make decisions without full knowledge of future jobs, whereas in off-line scheduling all jobs are known before the schedule is executed [43]. In this study, all workpieces to be processed by the dual RGV system are released at time zero and no new orders arrive during the planning horizon, so the considered problem belongs to the off-line batch scheduling class.

To ensure model rigor and solvability while preserving the core characteristics of the problem, the following basic assumptions are established based on common industrial practices. First, all fundamental time parameters of the system, including processing times for each workpiece on different equipment, RGV loading and unloading times, and unit distance movement times, are treated as known deterministic constants. Second, the processing follows a non-preemptive principle, meaning that once a workpiece begins processing on a machine, it must be completed continuously without interruption. Finally, while workpieces are treated as homogeneous regarding their physical handling properties (meaning any machine can process them), heterogeneity is introduced in terms of processing duration. The mean and fluctuation range of each workpiece’s processing time are used to characterize the duration differences arising from varying sizes or processing requirements.

2.2. Mathematical Formulation

The mathematical modeling of the dual RGV scheduling problem requires comprehensive characterization of the system’s discrete decision variables and time variables, as well as the constraint system that controls the complex interactions between production resources and material handling equipment. This section first defines the notation system, then constructs the objective function and constraints to form a complete mixed-integer programming model.

Set definitions include:

\begin{array}{l} I = \{1, 2, \dots, N_{W}\} : Workpiece Set \\ M = \{1, 2, \dots, N_{M}\} : Machine Set \\ B = \{0, 1, \dots, N_{M} + 1\} : Bay Set (loading position, machine positions, unloading position) \\ R = \{L, R\} : RGV Set, where L denotes loading and R denotes unloading \end{array}

The physical parameters of the system include:

\begin{array}{l} T_{i}^{p r o c} : Processing time of workpiece i \\ T^{l o a d} : RGV workpiece loading time \\ T^{u n l o a d} : RGV workpiece unloading time \\ T^{m o v e} : RGV travel time between adjacent bays \end{array}

State variables include:

\begin{array}{l} x_{L} (t) {: Position coordinates of RGV}_{L} at t \\ x_{R} (t) {: Position coordinates of RGV}_{R} at t \\ S_{j}^{m a c h i n e} (t) : State of machine j at time t \\ S_{L / R}^{R G V} (t) : State of RGV at time t \end{array}

The decision variables in this model encompass discrete allocation decisions and time variables. For workpiece allocation decisions, binary variable

x_{i j} \in \{0, 1\}

is introduced to represent whether workpiece i is allocated to machine j for processing. For RGV control decision variables,

u (t) \in \{- 1, 0, 1\}

represents the movement decision of RGV at t. To describe the processing sequence,

y_{i j k} \in \{0, 1\}

is defined to represent whether workpiece i is processed before workpiece j on machine k. Time-dimensional decisions include the following:

S_{i}^{P}

and

C_{i}^{P}

represent the processing start and completion times of workpiece i, respectively.

S_{i}^{L}

represents the time when RGV_L starts loading workpiece i.

C_{i}^{L}

represents the time when workpiece i arrives at the machine unloading position.

The objective of the model is to minimize the maximum completion time (makespan):

\min C_{\max} = \max_{i \in I} \{C_{i}\}

(1)

The makespan minimization objective reflects the fundamental goal of maximizing system throughput. Additionally, by introducing movement penalties for RGV, a scheduling scheme that minimizes movement energy consumption is achieved while satisfying the makespan minimization requirement. Therefore, the objective of this project balances multiple competing factors: efficient allocation of workpieces to machines, coordination of RGV movements to minimize idle time, and avoidance of resource conflicts. By introducing auxiliary constraints

C_{\max} \geq C_{i}, \forall i \in I

, this nonlinear objective function can be linearized. It could transform the problem into a linear programming framework suitable for standard optimization techniques.

Furthermore, the feasibility of the scheduling scheme is guaranteed by a series of constraints, which reflect both physical limitations and operational requirements.

Constraint (2) is the anti-collision constraint between RGVs:

x_{L} (t) < x_{R} (t), \forall t \geq 0

(2)

Constraints (3) and (4) are the motion boundary constraints for RGV, ensuring that the RGV does not move outside the track range:

0 \leq x_{L} (t) \leq N_{M}, \forall t \geq 0

(3)

1 \leq x_{R} (t) \leq N_{M} + 1, \forall t \geq 0

(4)

Constraint (5) is the workpiece allocation constraint ensuring that each workpiece must and can only be allocated to one machine:

\sum_{j \in M} x_{i j} = 1, \forall i \in I

(5)

Constraint (6) is the workpiece process sequencing constraint that defines the complete process for workpiece handling. The temporal relationship from loading to transportation is:

S_{i}^{P} \geq S_{i}^{L} + T^{l o a d} + T^{m o v e} \cdot j + T^{u n l o a d}, \forall i \in I, j \in M : x_{i j} = 1

(6)

Constraint (7) is the temporal relationship from transportation to processing:

C_{i}^{R} \geq S_{i}^{R} + T^{u n l o a d} + T^{m o v e} \cdot |N_{M} + 1 - j| + T^{l o a d}, \forall i \in I, j \in M : x_{i j} = 1

(7)

Constraint (8) is the processing sequence constraint:

\sum_{k = 1}^{i} y_{i j k} = 1, \forall i \in I

(8)

Constraints (9) and (10) are the machine resource constraints ensuring non-overlapping processing of different workpieces on the same machine:

S_{j}^{P} \geq C_{i}^{R} - M (1 - z_{i j k}), \forall i, j \in I, i \neq j, k \in M

(9)

S_{i}^{P} \geq C_{j}^{R} - M \cdot z_{i j k}, \forall i, j \in I, i \neq j, k \in M

(10)

Constraint (11) is the pickup timing constraint ensuring that workpieces can only be picked up after completing processing:

S_{i}^{R} \geq S_{i}^{P} + T_{i}^{p r o c}

(11)

Furthermore, each RGV can carry at most one workpiece at any given time, which implicitly constrains the non-overlapping nature of loading and unloading operations.

Considering all the above constraints, this model exhibits characteristics such as mixed decision space, strongly coupled constraints, non-convex feasible domain, and time-varying constraints. Its computational complexity is

O (N_{W}^{N_{W}} \cdot 2^{T \cdot |A|})

, belonging to the NP-hard problem class. Based on the computational challenges of this problem, this paper proposes the development of an intelligent solution method based on an action-masked PPO algorithm.

3. Design of Algorithms

3.1. Markov Decision Process Formulation

Based on the model in Section 2, the dual RGV scheduling problem can be viewed as a sequential decision-making process, where each decision influences the future system state and the feasible action space. Modeling the dual RGV scheduling problem as a Markov Decision Process (MDP) requires precise definition of the state space, action space, transition probabilities, and reward function. The theoretical foundation of this modeling approach lies in the fact that the future evolution of the scheduling system at any given moment depends solely on the current state and is independent of historical trajectories, thereby satisfying the Markov property. We formalize this problem as a five-tuple

M = (S, A, P, R, γ)

, where each component is defined as follows. The design of the state space constitutes the core of MDP modeling. It employs a high-dimensional real-valued vector to represent the system state. This vector comprehensively encodes the complete system information, including RGV status vectors, task progress vectors, temporal states, machine status vectors, RGV distance vectors, and augmented observation features. Specifically, the state vector comprises the following components.

The RGV status vector (4-dimensional) encompasses the positional coordinates and carrying status of both RGVs. The task progress vector (3-dimensional) includes normalized values recording the number of workpieces to be loaded, completed workpieces, and workpieces to be retrieved within the system. The temporal state (1-dimensional) represents the ratio of the current time step to the maximum allowable time steps, aligning with the per-step time penalty and episode truncation to promote temporal awareness in the policy. The machine status vector (N_M-dimensional) provides a normalized representation of each machine’s current state and remaining processing time, supporting the selection head in making state-contingent machine selections. The RGV distance vector (N_2xM-dimensional) captures the normalized distances from RGVs to each machine, biasing decisions towards lower travel distances. The augmented observation features (3-dimensional) include the distance between the two RGVs and their target positions, enabling proactive traffic deconfliction and smoother motion planning. The total dimensionality of the state vector is

8 + 3 N_{M} + 3

, where all continuous values are normalized to the [0, 1] interval, ensuring that different state components possess similar numerical scales.

The action space A is designed as a multi-discrete space

A = A_{m a c h i n e} \times A_{R G V}

. Each action consists of two sub-actions: machine selection and RGV control, where

a_{m a c h i n e} \in \{1, 2, \dots, N_{M}\}

determines the next target machine and

a_{R G V} \in \{0, 1, \dots, 8\}

corresponds to 9 types of movement combinations:

(v_{L}, v_{R}) \in {\{- 1, 0, 1\}}^{2}

. Under the layout scale and rhythm settings, a unit stride can achieve long-distance displacement through multi-step concatenation, making it equivalent in optimal reachability to macro actions merged from multiple steps.

The machine selection sub-action reflects the decision-making requirements for production resource allocation, while the RGV control sub-action embodies the real-time requirements for coordination.

The transition probability

P (s'| s, a)

possesses a distinct characteristic in deterministic environments: given the current state s and action a, the next state is deterministic. This deterministic transition simplifies the learning process, enabling the agent to focus on policy optimization without the need to manage environmental stochasticity.

The reward function

R (s, a, s')

adopts a hierarchical reward structure:

R (s_{t}, a_{t}, s_{t + 1}) = R_{t a s k} + R_{m o v e} + R_{t i m e}

(12)

where

R_{t a s k}

represents the task completion reward,

R_{m o v e}

denotes the movement penalty, and

R_{t i m e}

corresponds to the time penalty. By establishing a hierarchical reward structure, the algorithm can simultaneously optimize both makespan and RGV energy consumption.

Algorithm 1 demonstrates the complete logic of environmental state transition, encompassing core steps including machine allocation, RGV movement control, task execution, and state update.

Algorithm 1. Environment State Transition.
Input:		Current state s, Action a = (a_machine, a_rgv)
Output:		Next state s’, Reward r, Done flag
1	function EnvironmentStep(s, a):
2	if workpieces_to_load ≠ ∅:
3	AssignWorkpiece(next_workpiece, a_machine)
4	v_L, v_R ← DecodeRGVAction(a_rgv)
5	p′_L ← p_L + v_L; p′_R ← p_R + v_R
6	UpdateRGVPositions(p′_L, p′_R)
7	r_move ← −\|v_L\| − \|v_R\|
8	r_task ← 0
9	for each RGV ∈ {RGV_L, RGV_R}:
10	if CheckLoadingCondition(RGV):
11	ExecuteLoading(RGV); r_task += r_success
12	if CheckUnloadingCondition(RGV):
13	ExecuteUnloading(RGV); r_task += r_success
14	UpdateMachineStates()
15	if LoadingComplete(RGV_L): RGV_L.state ← CARRYING
16	if UnloadingComplete(RGV_L): RGV_L.state ← IDLE
17	Similar state updates for RGV_R
18	UpdateWorkpieceQueues()
19	r ← r_task + r_move + r_time
20	done ← (completed_workpieces ≥ N_W) ∨ (t ≥ T_max)
21	s′ ← ConstructNewState()
22	return s′, r, done
23	end function

3.2. The Action Mask-Based PPO Solution Framework

The core principle of the PPO algorithm lies in constraining the magnitude of policy updates to ensure training stability, with its theoretical foundation rooted in trust region optimization methods. In conventional policy gradient approaches, excessive policy updates may lead to catastrophic performance degradation. To address this challenge, PPO introduces a clipping mechanism that maintains the Kullback–Leibler (KL) divergence between new and old policies within a controllable range, thereby achieving stable policy improvement.

In the dual RGVs scheduling problem, the adaptation of the PPO algorithm faces challenges in handling composite action spaces. MultiDiscrete action space modeling is employed to decompose joint decisions into two independent yet coupled sub-decisions:

π_{θ} (a| s) = π_{θ}^{m a c h i n e} (a_{m a c h i n e}| s) \cdot π_{θ}^{R G V} (a_{R G V}| s)

(13)

Traditional reinforcement learning methods typically manage constraints through reward–penalty approaches. However, such soft constraint methods are prone to training instability and solutions under complex constraints. This paper proposes a hard constraint-handling mechanism based on action masking. Let

M_{t}^{m a c h i n e} \in {\{0, 1\}}^{N_{M}}

and

M_{t}^{R G V} \in {\{0, 1\}}^{9}

represent the machine availability mask and RGV legal action mask, respectively. The masked policy is defined as:

{\tilde{π}}_{θ} (a| s, M) = \frac{π_{θ} (a| s) ⊙ M}{\sum_{a^{'}} π_{θ} (a| s) ⊙ M}

(14)

This approach excludes infeasible actions, not only accelerating the learning process but also ensuring that all generated scheduling schemes satisfy physical constraints.

The calculation of constraint masks requires consideration of multi-level constraints. The machine’s availability mask examines whether machines are in an idle state and unoccupied. The calculation of RGV legal action masks is more complex, necessitating simultaneous consideration of: (1) State constraints: RGVs must remain stationary during loading or unloading processes; (2) Boundary constraints: ensuring that RGVs do not move outside the track range; and (3) Collision avoidance constraints: guaranteeing that the position of RGVL is strictly less than RGVR at any given moment. Algorithm 2 provides a detailed description of this constraint calculation process.

Algorithm 2. Constraint Mask Computation.
Input:		Current state s
Output:		M_machine, M_rgv
1	function ComputeConstraintMasks(s):
2	M_machine ← [true] × N_M; M_rgv ← [false] × 9
3	//Machine availability mask
4	for i = 1 to N_M:
5	if machine_state [i] > 0 ∨ machine_occupied [i]:
6	M_machine [i] ← false
7	//RGV legal action mask
8	for k = 0 to 8:
9	v_L, v_R ← action_combinations [k]
10	//Check state constraints
11	if (RGV_L.state ∈ {LOADING, UNLOADING} ∧ v_L ≠ 0):
12	continue
13	if (RGV_R.state ∈ {LOADING, UNLOADING} ∧ v_R ≠ 0):
14	continue
15	//Check forced return constraints
16	if RGV_L.state = IDLE ∧ ¬RGV_L.carrying ∧ p_L ≠ 0:
17	if v_L ≠ sign(0 − p_L): continue
18	if RGV_R.state = CARRYING ∧ p_R ≠ UNLOADING_POS:
19	if v_R ≠ 0 ∧ v_R ≠ sign(UNLOADING_POS − p_R): continue
20	//Check boundary and collision constraints
21	p′_L ← p_L + v_L; p′_R ← p_R + v_R
22	if p′_L < 0 ∨p′_L > N_M ∨ p′_R < 1 ∨ p′_R > N_M + 1 ∨ p′_L ≥ p′_R:
23	continue
24	M_rgv [k] ← true
25	if ¬any(M_rgv): M_rgv [4] ← true // Ensure at least one valid action
26	return M_machine, M_rgv
27	end function

The PPO training framework designed in this paper adopts a strategy of parallel experience collection and synchronous parameter updates. This framework manages atomic updates of model parameters through the SharedModelState class, ensuring that all worker processes utilize consistent policy versions. Worker processes determine whether synchronization with the latest model is necessary through version number checking, thereby avoiding unnecessary communication overhead.

Specifically, in each iteration, all worker processes use identical processing time seeds to generate problem instances, ensuring that experiences are collected under the same problem configurations. Algorithm 3 demonstrates the complete PPO training workflow, encompassing key steps including parallel experience collection, advantage computation, policy optimization, and parameter synchronization.

Algorithm 3. PPO Training Framework.
Input:		Environment Env, Actor π_θ, Critic
Parameters:		T updates, W workers, K epochs, ε clip ratio
1	Initialize SharedModel(π_θ, V_φ)
2	Launch W persistent worker processes
3	for update = 1 to T:
4	experiences ← ParallelCollect(workers, episodes_per_worker)
5	states, actions, rewards, masks ← PrepareData(experiences)
6	with no_grad():
7	old_log_probs ← π_θ.log_prob(states, actions, masks)
8	values ← V_φ(states)
9	advantages, returns ← ComputeGAE(rewards, values, γ, λ)
10	advantages ← (advantages − μ_A)/σ_A//Normalize
11	for epoch = 1 to K:
12	for batch ∈ SampleBatches(data, batch_size):
13	//Compute policy ratio
14	log_probs ← π_θ.log_prob(batch.states, batch.actions, batch.masks)
15	ratio ← exp(log_probs − batch.old_log_probs)
16	//Clipped surrogate objective
17	surr1 ← ratio × batch.advantages
18	surr2 ← clip(ratio, 1 − ε, 1 + ε) × batch.advantages
19	L_actor ← −min(surr1, surr2).mean()
20	//Value function loss
21	values_pred ← V_φ(batch.states)
22	L_critic ← MSE(values_pred, batch.returns)
23	//Entropy regularization
24	entropy ← π_θ.entropy(batch.states, batch.masks)
25	L_total ← L_actor + c_value × L_critic − c_entropy × entropy
26	//Update parameters
27	∇_total ← Backprop(L_total); ClipGrad(∇_total, max_norm)
28	OptimizerStep(θ, φ, ∇_total)
29	SharedModel.sync(π_θ, V_φ)//Update shared model
30	SchedulerStep(η_t)//Update learning rate
31	end for

3.3. Network Architecture and Training Strategy

Addressing the unique structural characteristics of the dual RGV scheduling problem, a hierarchical Actor-Critic network architecture is designed. This architecture employs a shared feature extraction layer to process high-dimensional heterogeneous state inputs, followed by dedicated decision heads that output machine selection and RGV control strategies, respectively. Figure 3 illustrates the complete network architecture.

Figure 3. The action mask-based PPO network architecture for dual RGV scheduling.

As illustrated in Figure 3, the network input layer receives a d-dimensional state vector, where

d = 8 + N_{M} \times 3 + 3

. The state vector comprehensively encodes the complete system information, including RGV state vectors, task progress, temporal states, machine states, RGV-to-machine distance matrices, and augmented features. These heterogeneous inputs are unified into a single state vector through concatenation operations.

The shared feature extraction comprises three fully connected layers. The first layer performs dimensional expansion to provide sufficient representation space for heterogeneous inputs. To address scale differences between different state components, LayerNorm is applied after linear transformation but prior to activation functions. The subsequent two layers maintain dimensional consistency while extracting abstract features through increased network depth. The first two layers employ ReLU activation and dropout regularization (p = 0.1), whereas the third layer utilizes only ReLU activation to ensure feature stability approaching the output stage.

The Actor network adopts a dual-head output architecture to handle the multi-discrete action space. The machine selection head compresses the shared features through a two-layer neural network and outputs NM-dimensional logits corresponding to machine selection probabilities. The RGV control head employs an identical compression architecture, generating 9-dimensional logits corresponding to motion combinations

(v_{L}, v_{R}) \in {\{- 1, 0, 1\}}^{2}

. Both decision heads integrate masked softmax mechanisms that apply large negative masks (−10⁴) directly in the logit space, ensuring that probabilities for infeasible actions are strictly zero while preserving gradient flow for valid actions. This hard constraint approach enables the network to focus computational resources on learning preferences among feasible actions rather than learning constraint satisfaction.

The Critic network employs a deeper architecture comprising five fully connected layers, providing more accurate value estimations through additional nonlinear transformations. Compared with the Actor network, the Critic network requires enhanced representational capacity to precisely evaluate the long-term value of complex states. The final network output

V (s)

is a scalar value representing the expected cumulative reward from the current state.

Furthermore, the training strategy incorporates multiple optimization techniques to enhance learning efficiency. First, adaptive learning rate scheduling is employed, utilizing cosine annealing to gradually decay the learning rate to 10% of its initial value throughout the training process. Second, advantage normalization is performed during PPO objective computation, where the advantage function is calculated using Generalized Advantage Estimation (GAE) with λ = 0.95, followed by mean variance normalization. Third, gradient clipping (max_norm = 0.5) is applied to prevent training instability. Finally, persistent worker process pools and shared memory mechanisms are utilized to enable parallel experience collection.

4. Computational Experiments

4.1. Case Analysis

The study constructs a dual RGV scheduling instance comprising 7 processing equipment and 20 workpieces requiring fabrication. From a computational complexity perspective, the solution space of this problem reaches astronomical proportions of approximately

7^{20} \approx 8.0 \times 10^{16}

feasible combinations, considering exclusively the discrete workpiece-to-machine allocation decisions. When incorporating the continuous motion control decisions for both RGVs, the dimensionality escalates exponentially, rendering the problem intractable for conventional optimization approaches.

The experimental configuration adopts several simplifying assumptions to maintain feasibility while preserving the essential characteristics of real-world manufacturing systems. RGV acceleration and deceleration dynamics are neglected, with a fixed movement velocity of 1 bay/minute to ensure consistent temporal analysis. Loading and unloading operations are standardized at 1 min duration, reflecting typical industrial automated handling systems. Crucially, workpiece processing times follow the uniform distribution U (56, 66) minutes, where the 60 min average represents the manufacturing tasks characteristic of precision machine operations, while the 10 min variation range captures the inherent stochastic fluctuations observed in actual production environments. The selection of uniform distribution over alternative probability distributions provides a distribution-agnostic testing environment.

The PPO algorithm hyperparameters presented in Table 1 were determined through preliminary experimentation. Specifically, Figure 4 presents comparative experiments for the three main indicators. By comparing different value ranges, the current parameter configuration is confirmed.

Table 1. PPO algorithm parameters.

Figure 4. Determination of key parameters in the PPO algorithm, (a) different learn rate, (b) different clip epsilon and (c) different entropy coef.

Figure 5 provides a demonstration of the iterative convergence characteristics between the action mask-based PPO algorithm and the conventional penalty-based PPO approach (PPO_basic). The results reveal several performance distinctions that underscore the advantage of the action masking methodology. The convergence velocity superiority of the action mask-based PPO algorithm is pronounced during the initial training phases. Within the first 300 episodes, the action mask-enabled agent achieved makespan reductions to below 400 min, indicating successful acquisition of fundamental scheduling strategies. In contrast, the PPO_basic approach required significantly more episodes to reach comparable performance levels, with makespan values exceeding 1000 min during the same amount of time. This performance differential can be attributed to the fundamental architecture of action masking. By constraining the agent’s decision space to feasible actions, the learning process circumvents the expensive phase of learning constraint violations and associated penalties.

Figure 5. Iterative convergence curves of two PPO algorithms.

Training stability analysis also showed advantages of the action mask approach. Beyond the 1000-episode threshold, the PPO_basic convergence curve exhibits persistent oscillations, indicating potential instability in reward signal processing. The action mask-based algorithm demonstrates superior consistency, maintaining steady improvement trajectories with minimal performance volatility. Also, at the end of the training, the scheduling competency of the action mask-trained agent exceeds that of the PPO_basic approach. Therefore, the action mask-based PPO algorithm exhibits improvements over conventional PPO approaches in both training efficiency and final performance outcomes.

Figure 6 and Figure 7 provide visualizations of the scheduling schemes generated by the trained PPO agent when confronted with stochastic evaluation tasks. Figure 6 presents the Gantt chart of the scheduling solution. From a load balancing perspective, the scheduling scheme demonstrates superior machine resource allocation, with 20 workpieces distributed across 7 machines exhibiting a balanced distribution pattern. Each machine is allocated 2–3 workpieces for processing tasks, effectively preventing resource wastage scenarios where certain machines experience overload while others remain idle.

Figure 6. Gantt chart of the task planned by the agent for the stochastic evaluation study.

Figure 7. RGV operation path diagram planned by agents for stochastic evaluation studies.

Figure 7 illustrates the RGV trajectory diagram planned by the algorithm. From a collision avoidance strategy perspective, both RGVs maintain strict adherence to the constraint that RGV_L position remains consistently less than RGV_R position throughout the entire scheduling execution process. Notably, at the 70 min mark, when RGV_L requires navigation to higher-numbered machines for unloading operations, RGV_R exhibits predictive behavior by remaining at the unloading position, thereby providing adequate maneuvering space for RGV_L.

Furthermore, the movement strategy reveals that during multiple time intervals, both RGVs demonstrate coordinated unidirectional movement patterns. This behavior can maximize transportation efficiency while maintaining safe inter-vehicle distances.

From an energy efficiency optimization perspective, throughout the entire scheduling execution process, neither RGV exhibits ineffective retreat movements generated for conflict avoidance purposes. All movement trajectories demonstrate clear task-oriented directionality, reflecting the algorithm’s ability to minimize unnecessary movements while maintaining operational safety. Therefore, the action mask-based PPO algorithm successfully achieves the dual objectives of scheduling efficiency optimization and energy consumption control.

4.2. Scheduling Performance Evaluation

To assess the performance advantages of the action mask-based PPO algorithm, this study designed systematic comparative experiments with five representative baseline algorithms:

PPO_basic: Employs basic reward–penalty design and enables the agent to learn correct scheduling methods through punishment and completion rewards, differing from the proposed PPO algorithm.
Greedy Algorithm: Selects the apparently optimal solution at each step, aiming to achieve global optimality through locally optimal choices.
Earliest Due Date (EDD) Algorithm: Prioritizes tasks with the earliest deadlines, designed to minimize task delays or timeouts.
Genetic algorithm: Continuously improves candidate solution quality through selection, crossover, and mutation operations. The algorithm maintains a population of solutions, evaluates individual fitness in each generation, selects superior individuals for reproduction, and generates new generations.
MODI-GWO: A modified metaheuristic algorithm based on Grey Wolf Optimization (GWO). Zhang et al. demonstrated that this represents one of the most effective metaheuristic algorithms for this operational context [44].

The experimental design encompasses nine different problem scales, with workpiece quantities ranging from 20 to 60 and machine quantities from 5 to 9, forming a test suite spanning small- to large-scale problems. Different workpiece quantities examine algorithmic adaptability across varying processing scales, while different machine quantities assess algorithmic adaptation to dual resource scheduling. According to preliminary experiments, under the system parameters established in this study, when machine quantity is less than 7, the primary constraint is processing machine capacity; when machine quantity exceeds 7, the primary constraint becomes RGV transport capacity limitations due to limited carrying capacity and excessive transport path lengths. Therefore, this test suite reflects and compares scheduling scheme performance.

For the sake of simplicity, all PPO algorithms in table and picture below refer to the action mask-based PPO algorithms. Table 2 presents makespan comparisons for various algorithms with an average workpiece processing time of 60 min and processing time range of 10 min. Experimental results demonstrate that across all scales, the action mask-based PPO algorithm consistently outperforms comparative algorithms. In small-scale problems (20 workpieces), the proposed PPO achieves average improvement rates of 8.4% (20-5), 2.8% (20-7), and 2.6% (20-9) compared with the best baseline algorithms. As problem scale increases, performance advantages become increasingly pronounced: in medium-scale problems (40 workpieces), improvement rates reach 15.0% (40-5), 15.5% (40-7), and 10.0% (40-9); in large-scale problems (60 workpieces), improvement rates further increase to 19.3% (60-5), 22.5% (60-7), and 15.1% (60-9).

Table 2. The average makespan of different scale studies with an average processing time of 60 min (55–65 random).

Taking the most complex instance 60-9 from Table 2 as an example, the action mask-based PPO achieves an average makespan of 868.3 min, representing a 15.1% improvement over the best baseline algorithm, the genetic algorithm (1022.5 min), and a 17.1% improvement over the PPO_basic algorithm (1047.9 min). Furthermore, the action mask-based PPO algorithm advantages exhibit growth trends with problem complexity, indicating that the learned strategies effectively manage high-dimensional state spaces and complex constraint coupling. Conversely, traditional algorithms experience varying degrees of performance degradation when confronting exponentially growing solution spaces.

Beyond the average makespan, algorithmic stability across multiple stochastic realizations is crucial for evaluating the practical usefulness of a scheduling policy. In this study, robustness is understood as the insensitivity of the policy performance to moderate processing time fluctuations, whereas resilience characterizes the capability of the algorithm to maintain an acceptable service level and avoid catastrophic performance degradation when the disturbance level increases [45]. Figure 8 illustrates standard deviation comparisons across different algorithms over ten evaluations. The data reveals that as workpiece quantities increase, all algorithms exhibit varying degrees of standard deviation growth. Among these, MODI-GWO demonstrates the most pronounced standard deviation increases, growing from an average of 13.2 for a 20-workpiece scale to 76.4 for a 60-workpiece scale—a 477% increase—indicating algorithmic inability to manage such high-dimensional computational requirements as workpiece scale expands.

Figure 8. The standard deviation of each algorithm after the evaluations example in Table 2, (a) SD of Work(20)-Machine(5), (b) SD of Work(40)-Machine(5), (c) SD of Work(60)-Machine(5), (d) SD of Work(20)-Machine(7), (e) SD of Work(40)-Machine(7), (f) SD of Work(60)-Machine(7), (g) SD of Work(20)-Machine(9), (h) SD of Work(50)-Machine(9) and (i) SD of Work(60)-Machine(9).

In contrast, the action mask-based PPO algorithm maintains standard deviations below 20 across different problem scales: specifically, 5.8 for 20-workpiece scales, 11.3 for a 40-workpiece scale, and 15.7 for a 60-workpiece scale. Across all tests except Figure 8d, the proposed algorithm ensures the lowest standard deviation. While PPO_basic demonstrates overall higher stability than other algorithms, it remains inferior to our PPO algorithm. This occurs because PPO_basic employs complex reward functions including conflict penalties. Practical experience reveals that different problem scales often require different reward function designs; otherwise, convergence failures may occur. This creates optimization difficulties for PPO_basic, hindering optimal agent development. Therefore, action masking mechanisms constrain search spaces, avoiding the training instability encountered in traditional reinforcement learning approaches.

To further validate algorithmic adaptability to more complex processing conditions, this study increased the average workpiece processing time from 60 to 100 min for additional testing. Table 3 results demonstrate that processing time parameter changes highlight the robustness of the action mask-based PPO algorithm. When the average processing time increases from 60 to 100 min, the action mask-based PPO algorithm maintains optimal scheduling performance across all instances while exhibiting minimal performance scaling ratios in most cases. Specifically, in the most complex instance 60-9, the action mask-based PPO makespan increases from 868.3 to 992.3 min (14.3% growth rate), while the genetic algorithm increases from 1022.5 to 1214.6 min (18.8% growth rate), and MODI-GWO increases from 1165.1 to 1522.4 min (30.7% growth rate). This differential adaptability stems from PPO agents experiencing diverse processing time distributions during training, enabling value networks to evaluate long-term cumulative rewards across different time scales and adapt decision strategies to processing time variations.

Table 3. The average makespan of different scale studies with an average processing time of 100 min (95–105 random).

Figure 9 further demonstrates algorithmic adaptability to processing time variability. When the fluctuation range of workpiece processing time is adjusted from 10 to 20 min, this comparative experiment can be regarded as a scenario with stronger fluctuations in raw materials and processes. A comparative analysis of the average makespan and standard deviation across ten evaluations further reveals the robustness characteristics of the PPO algorithm. As shown in Figure 9a, increased processing time fluctuation ranges produce moderate makespan growth for 5 and 9 machine configurations (3.2% and 4.1%, respectively), while 7-machine makespan remains unchanged (variation less than 1%). This differential performance reflects the system bottleneck characteristics’ impact on algorithmic performance mechanisms. When obvious bottlenecks exist in either production equipment or RGV resources, processing time fluctuations affect overall scheduling efficiency. Conversely, when both resource types are configured, agents can respond to workpiece processing time fluctuations through dynamic scheduling strategy adjustments.

Figure 9. The change in the (a) average makespan and (b) standard deviation of the machining time span from 10 to 20 in the example in Table 2.

From stability perspectives, Figure 9b shows that despite increased fluctuation ranges causing standard deviation growth in most instances, overall levels remain low, with a maximum standard deviation of 15.8 appearing in the 40-7 instances lower than traditional algorithms. This behavior indicates that the learned policy can absorb the impact of stronger processing time disturbances and still maintain stable performance.

Comprehensive analysis reveals that the action mask-based PPO algorithm demonstrates three distinct advantages in dual RGV scheduling problems. First, performance advantages, particularly achieving 15–20% makespan improvements over traditional methods in large-scale problems, with advantages becoming more pronounced as problem complexity increases. Second, stability advantages, with multi-run result consistency far exceeding comparative algorithms and excellent standard deviation control within 20, providing reliable guarantees for practical applications. Third, adaptability advantages, achieving superior scheduling results across broad task duration variation ranges and demonstrating strong adaptive capabilities for manufacturing process uncertainties. These technical innovations ensure algorithmic performance in handling complex manufacturing system scheduling problems, providing feasibility for practical industrial applications.

Future research can be extended from this framework to address complex operational conditions involving different priorities, sizes, or machine compatibility. This could involve adding priority weights and machine-specific processability mask features to the state representation, or encoding machine processability and handling constraints into the action mask to ensure the probability of illegal assignments is strictly zero. Since the PPO framework remains unchanged, such extensions would not alter the learning mechanism.

5. Conclusions

This study addresses the integrated scheduling problem of dual RGVs and multiple processing equipment in intelligent manufacturing systems. First, a mixed-integer optimization model is constructed that captures machine assignment decisions, dual RGV motion decisions, and collision avoidance and capacity constraints. Second, this optimization problem is reformulated as a Markov Decision Process (MDP) and solved by an action mask-based PPO algorithm with a dual-head Actor-Critic network architecture.

In the MDP formulation, a high-dimensional state space of 8 + 3NM dimensions comprehensively considers RGV positional states, task progress, machine workloads, and spatiotemporal distances, while a multi-discrete action space integrates discrete machine allocation and RGV motion control decisions. Unlike traditional soft constraint approaches based on penalty functions, the proposed action masking strategy that prevents invalid decision-making demonstrates superior training stability and convergence speed.

Through systematic experiments encompassing various problem scales, the designed PPO algorithm achieves 15–20% makespan reduction compared with conventional methods. Furthermore, the algorithm maintains stable scheduling performance when confronted with uncertain factors such as processing time variations (60–100 min) and expanded fluctuation ranges (10–20 min), with standard deviations controlled within 20.

These results demonstrate not only the effectiveness of the learned dual RGV policy, but also its robustness and resilience; the policy maintains stable performance across repeated realizations of random processing times and avoids catastrophic degradation when the disturbance level and problem scale increase.

Author Contributions

Conceptualization, B.L. and J.Z.; Methodology, N.Z.; Software, N.Z.; Validation, N.Z.; Formal analysis, N.Z.; Investigation, N.Z.; Resources, N.Z.; Data curation, N.Z.; Writing—original draft, N.Z.; Writing—review and editing, J.Z.; Visualization, N.Z.; Supervision, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Javaid, M.; Haleem, A.; Singh, R.P.; Suman, R. Enabling flexible manufacturing system (FMS) through the applications of industry 4.0 technologies. Internet Things Cyber-Phys. Syst. 2022, 2, 49–62. [Google Scholar] [CrossRef]
Turner, C.; Oyekan, J. Manufacturing in the Age of Human-Centric and Sustainable Industry 5.0: Application to Holonic, Flexible, Reconfigurable and Smart Manufacturing Systems. Sustainability 2023, 15, 10169. [Google Scholar] [CrossRef]
Aversa, P.; Formentini, M.; Iubatti, D.; Lorenzoni, G. Digital Machines, Space, and Time: Towards a Behavioral Perspective of Flexible Manufacturing. J. Prod. Innov. Manag. 2021, 38, 114–141. [Google Scholar] [CrossRef]
De Ryck, M.; Pissoort, D.; Holvoet, T.; Demeester, E. Decentral task allocation for industrial AGV-systems with resource constraints. J. Manuf. Syst. 2021, 59, 310–319. [Google Scholar] [CrossRef]
Wang, J.; Qi, X. Intelligent RGV Dynamic Scheduling Virtual Simulation Technology Based on Machine Learning. Procedia Comput. Sci. 2023, 228, 1077–1085. [Google Scholar] [CrossRef]
Ma, C.; Zhou, B. Dual-rail–guided vehicle scheduling in an automated storage and retrieval system with loading and collision-avoidance constraints. Eng. Comput. 2021, 38, 3290–3324. [Google Scholar] [CrossRef]
Zhang, W.; Wang, J.; Lin, Y. Integrated design and operation management for enterprise systems. Enterp. Inf. Syst. 2019, 13, 424–429. [Google Scholar] [CrossRef]
Guan, D. Routing a vehicle of capacity greater than one. Discret. Appl. Math. 1997, 81, 41–57. [Google Scholar] [CrossRef]
Goli, A.; Ala, A.; Hajiaghaei-Keshteli, M. Efficient multi-objective meta-heuristic algorithms for energy-aware non-permutation flow-shop scheduling problem. Expert Syst. Appl. 2023, 213, 119077. [Google Scholar] [CrossRef]
Sun, J.; Xu, Z.; Yan, Z.; Liu, L.; Zhang, Y. An Approach to Integrated Scheduling of Flexible Job-Shop Considering Conflict-Free Routing Problems. Sensors 2023, 23, 4526. [Google Scholar] [CrossRef]
Xu, G.; Bao, Q.; Zhang, H. Multi-objective green scheduling of integrated flexible job shop and automated guided vehicles. Eng. Appl. Artif. Intell. 2023, 126, 106864. [Google Scholar] [CrossRef]
Xin, B.; Lu, S.; Wang, Q.; Deng, F. A review of flexible job shop scheduling problems considering transportation vehicles. Front. Inf. Technol. Electron. Eng. 2025, 26, 332–353. [Google Scholar] [CrossRef]
Tan, L.; Kong, T.L.; Zhang, Z.; Metwally, A.S.M.; Sharma, S.; Sharma, K.P.; Eldin, S.M.; Zimon, D. Scheduling and Controlling Production in an Internet of Things Environment for Industry 4.0: An Analysis and Systematic Review of Scientific Metrological Data. Sustainability 2023, 15, 7600. [Google Scholar] [CrossRef]
Mousavi, M.; Yap, H.J.; Musa, S.N.; Tahriri, F.; Md Dawal, S.Z. Multi-objective AGV scheduling in an FMS using a hybrid of genetic algorithm and particle swarm optimization. PLoS ONE 2017, 12, e0169817. [Google Scholar] [CrossRef]
Zhang, Q.; Manier, H.; Manier, M.-A. A genetic algorithm with tabu search procedure for flexible job shop scheduling with transportation constraints and bounded processing times. Comput. Oper. Res. 2012, 39, 1713–1723. [Google Scholar] [CrossRef]
Saidi-Mehrabad, M.; Dehnavi-Arani, S.; Evazabadian, F.; Mahmoodian, V. An Ant Colony Algorithm (ACA) for solving the new integrated model of job shop scheduling and conflict-free routing of AGVs. Comput. Ind. Eng. 2015, 86, 2–13. [Google Scholar] [CrossRef]
Nessari, S.; Tavakkoli-Moghaddam, R.; Bakhshi-Khaniki, H.; Bozorgi-Amiri, A. A hybrid simheuristic algorithm for solving bi-objective stochastic flexible job shop scheduling problems. Decis. Anal. J. 2024, 11, 100485. [Google Scholar] [CrossRef]
Li, X.; Wu, X.; Wang, P.; Xu, Y.; Gao, Y.; Chen, Y. Bi-Objective Circular Multi-Rail-Guided Vehicle Scheduling Optimization Considering Multi-Type Entry and Delivery Tasks: A Combined Genetic Algorithm and Symmetry Algorithm. Symmetry 2024, 16, 1205. [Google Scholar] [CrossRef]
Li, G.; Zeng, B.; Liao, W.; Li, X.; Gao, L. A new AGV scheduling algorithm based on harmony search for material transfer in a real-world manufacturing system. Adv. Mech. Eng. 2018, 10, 1687814018765560. [Google Scholar] [CrossRef]
Zhang, S.; Tang, F.; Li, X.; Liu, J.; Zhang, B. A hybrid multi-objective approach for real-time flexible production scheduling and rescheduling under dynamic environment in Industry 4.0 context. Comput. Oper. Res. 2021, 132, 105267. [Google Scholar] [CrossRef]
Jimenez, S.H.; Trabelsi, W.; Sauvey, C. Multi-Objective Production Rescheduling: A Systematic Literature Review. Mathematics 2024, 12, 3176. [Google Scholar] [CrossRef]
Li, W.; Du, S.; Zhong, L.; He, L. Multiobjective Scheduling for Cooperative Operation of Multiple Gantry Cranes in Railway Area of Container Terminal. IEEE Access 2022, 10, 46772–46781. [Google Scholar] [CrossRef]
Wang, Y.-Z.; Hu, Z.-H. An Iterative Re-Optimization Framework for the Dynamic Scheduling of Crossover Yard Cranes with Uncertain Delivery Sequences. J. Mar. Sci. Eng. 2023, 11, 892. [Google Scholar] [CrossRef]
Fattahi, P.; Fallahi, A. Dynamic scheduling in flexible job shop systems by considering simultaneously efficiency and stability. CIRP J. Manuf. Sci. Technol. 2010, 2, 114–123. [Google Scholar] [CrossRef]
Wang, J.; Li, Y.; Zhang, Z.; Wu, Z.; Wu, L.; Jia, S.; Peng, T. Dynamic Integrated Scheduling of Production Equipment and Automated Guided Vehicles in a Flexible Job Shop Based on Deep Reinforcement Learning. Processes 2024, 12, 2423. [Google Scholar] [CrossRef]
Ngwu, C.; Liu, Y.; Wu, R. Reinforcement learning in dynamic job shop scheduling: A comprehensive review of AI-driven approaches in modern manufacturing. J. Intell. Manuf. 2025, 1–16. [Google Scholar] [CrossRef]
Zhang, X.; Zhu, G.-Y. A literature review of reinforcement learning methods applied to job-shop scheduling problems. Comput. Oper. Res. 2025, 175, 106929. [Google Scholar] [CrossRef]
Modrak, V.; Sudhakarapandian, R.; Balamurugan, A.; Soltysova, Z. A Review on Reinforcement Learning in Production Scheduling: An Inferential Perspective. Algorithms 2024, 17, 343. [Google Scholar] [CrossRef]
Esteso, A.; Peidro, D.; Mula, J.; Díaz-Madroñero, M. Reinforcement learning applied to production planning and control. Int. J. Prod. Res. 2023, 61, 5772–5789. [Google Scholar] [CrossRef]
Agrawal, A.; Won, S.J.; Sharma, T.; Deshpande, M.; McComb, C. A multi-agent reinforcement learning framework for intelligent manufacturing with autonomous mobile robots. Int. Conf. Eng. Des. 2021, 1, 161–170. [Google Scholar] [CrossRef]
Chen, Z.; Huang, S.; Min, G.; Ning, Z.; Li, J.; Zhang, Y. Mobility-aware seamless service migration and resource allocation in multi-edge IoV systems. IEEE Trans. Mob. Comput. 2025, 24, 6315–6332. [Google Scholar] [CrossRef]
Chen, Z.; Xiong, B.; Chen, X.; Min, G.; Li, J. Joint computation offloading and Resource allocation in multi-edge smart communities with personalized federated deep reinforcement learning. IEEE Trans. Mob. Comput. 2024, 23, 11604–11619. [Google Scholar] [CrossRef]
Gui, Y.; Tang, D.; Zhu, H.; Zhang, Y.; Zhang, Z. Dynamic scheduling for flexible job shop using a deep reinforcement learning approach. Comput. Ind. Eng. 2023, 180, 109255. [Google Scholar] [CrossRef]
Zou, W.; Zou, J.; Sang, H.; Meng, L.; Pan, Q. An effective population-based iterated greedy algorithm for solving the multi-AGV scheduling problem with unloading safety detection. Inf. Sci. 2024, 657, 119949. [Google Scholar] [CrossRef]
Chen, Z.; Hu, J.; Min, G.; Luo, C.; El-Ghazawi, T. Adaptive and efficient resource allocation in cloud datacenters using actor-critic deep reinforcement learning. IEEE Trans. Parallel Distrib. Syst. 2021, 33, 1911–1923. [Google Scholar] [CrossRef]
Chol, J.; Gun, C.R. Multi-agent based scheduling method for tandem automated guided vehicle systems. Eng. Appl. Artif. Intell. 2023, 123, 106229. [Google Scholar] [CrossRef]
Guo, T.; Sun, Y.; Liu, Y.; Liu, L. Circular Multiple RGV Path Planning based on Improved A* Algorithm. Front. Comput. Intell. Syst. 2023, 5, 66–72. [Google Scholar] [CrossRef]
Hu, B.; Zhang, K.; Li, N.; Mesbahi, M.; Fazel, M.; Başar, T. Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies. Annu. Rev. Control Robot. Auton. Syst. 2023, 6, 123–158. [Google Scholar] [CrossRef]
Ravindran, B. Advancements in Deep Reinforcement Learning: A Comparative Analysis of Policy Optimization Techniques. Int. J. Emerg. Trends Comput. Sci. Inf. Technol. 2020, 1, 18–28. [Google Scholar] [CrossRef]
Ogunfowora, O.; Najjaran, H. Reinforcement and deep reinforcement learning-based solutions for machine maintenance planning, scheduling policies, and optimization. J. Manuf. Syst. 2023, 70, 244–263. [Google Scholar] [CrossRef]
Zhang, M.; Lu, Y.; Hu, Y.; Amaitik, N.; Xu, Y. Dynamic Scheduling Method for Job-Shop Manufacturing Systems by Deep Reinforcement Learning with Proximal Policy Optimization. Sustainability 2022, 14, 5177. [Google Scholar] [CrossRef]
Yan, J.; Wang, L.; Zha, J.; Zhang, T.; Zhang, Y.; Liu, Z. Research on the dynamic scheduling problem of flexible job batch shop based on parallel proximal policy optimization algorithm. Comput. Ind. Eng. 2025, 207, 111362. [Google Scholar] [CrossRef]
Han, B.; Zhang, W.; Lu, X.; Lin, Y. On-line supply chain scheduling for single-machine and parallel-machine configurations with a single customer: Minimizing the makespan and delivery cost. Eur. J. Oper. Res. 2015, 244, 704–714. [Google Scholar] [CrossRef]
Zhang, Q.; Hu, J.; Liu, Z.; Duan, J. Multi-objective optimization of dual resource integrated scheduling problem of production equipment and RGV sconsidering conflict-free routing. PLoS ONE 2024, 19, e0297139. [Google Scholar] [CrossRef]
Han, B.; Liu, C.; Zhang, W. A method to measure the resilience of algorithm for operation management. IFAC-PapersOnLine 2016, 49, 1442–1447. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of dual RGVs operation.

Figure 2. Typical RGV operation diagram (a) Operational conflict occurrence between dual RGVs; (b) operational conflict avoidance between dual RGVs. A and B represent the points at which conflicts occur between RGV1 and RGV2.

Figure 3. The action mask-based PPO network architecture for dual RGV scheduling.

Figure 4. Determination of key parameters in the PPO algorithm, (a) different learn rate, (b) different clip epsilon and (c) different entropy coef.

Figure 5. Iterative convergence curves of two PPO algorithms.

Figure 6. Gantt chart of the task planned by the agent for the stochastic evaluation study.

Figure 7. RGV operation path diagram planned by agents for stochastic evaluation studies.

Figure 8. The standard deviation of each algorithm after the evaluations example in Table 2, (a) SD of Work(20)-Machine(5), (b) SD of Work(40)-Machine(5), (c) SD of Work(60)-Machine(5), (d) SD of Work(20)-Machine(7), (e) SD of Work(40)-Machine(7), (f) SD of Work(60)-Machine(7), (g) SD of Work(20)-Machine(9), (h) SD of Work(50)-Machine(9) and (i) SD of Work(60)-Machine(9).

Figure 9. The change in the (a) average makespan and (b) standard deviation of the machining time span from 10 to 20 in the example in Table 2.

Table 1. PPO algorithm parameters.

Parameter	Values
Learning rate	1 × 10⁻⁴
Discount factor	0.99
Clip epsilion	0.1
Value coef	0.5
Entropy coef	0.01
Epochs	8
Episode	1500

Table 2. The average makespan of different scale studies with an average processing time of 60 min (55–65 random).

Work-	Makespan
Machine	PPO	PPO_Basic	Greedy	EDD	Genetic	MODI-GWO
20-5	317.3	369.1	361.4	352.3	344.1	366
40-5	588.3	677.3	669	646.3	692	915
60-5	862	994.3	972.04	951.9	1068.5	1284.4
20-7	301.3	354.8	391.3	383	310	351
40-7	533	636.3	685.9	695	630.9	728.8
60-7	777.7	928.7	954.5	1003	1002.6	1156.9
20-9	327.5	379.9	425.1	407	336.3	369.9
40-9	605.8	707.5	758.4	751.1	673.3	763.7
60-9	868.3	1047.9	1114.8	1137.1	1022.5	1165.1

Table 3. The average makespan of different scale studies with an average processing time of 100 min (95–105 random).

Work-	Makespan
Machine	PPO	PPO_Basic	Greedy	EDD	Genetic	MODI-GWO
20-5	317.3	369.1	361.4	352.3	344.1	366
40-5	588.3	677.3	669	646.3	692	915
60-5	862	994.3	972.04	951.9	1068.5	1284.4
20-7	301.3	354.8	391.3	383	310	351
40-7	533	636.3	685.9	695	630.9	728.8
60-7	777.7	928.7	954.5	1003	1002.6	1156.9
20-9	327.5	379.9	425.1	407	336.3	369.9
40-9	605.8	707.5	758.4	751.1	673.3	763.7
60-9	868.3	1047.9	1114.8	1137.1	1022.5	1165.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.