A Reinforcement Learning Method for Automated Guided Vehicle Dispatching and Path Planning Considering Charging and Path Conflicts at an Automated Container Terminal

Zuo, Tianli; Liu, Huakun; Yang, Shichun; Wang, Wenyuan; Peng, Yun; Wang, Ruchong

doi:10.3390/jmse14010055

Open AccessArticle

A Reinforcement Learning Method for Automated Guided Vehicle Dispatching and Path Planning Considering Charging and Path Conflicts at an Automated Container Terminal

by

Tianli Zuo

¹,

Huakun Liu

^2,*,

Shichun Yang

¹,

Wenyuan Wang

²

,

Yun Peng

² and

Ruchong Wang

³

¹

School of Transportation Science and Engineering, Beihang University, Beijing 100191, China

²

State Key Laboratory of Coastal and Offshore Engineering, Dalian University of Technology, Dalian 116024, China

³

Dalian Port North Shore Container Terminal Co., Ltd., Dalian 116610, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(1), 55; https://doi.org/10.3390/jmse14010055 (registering DOI)

Submission received: 3 December 2025 / Revised: 22 December 2025 / Accepted: 26 December 2025 / Published: 28 December 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

The continued growth of international maritime trade has driven automated container terminals (ACTs) to pursue more efficient operational management strategies. In practice, the horizontal yard layout in ACTs significantly enhances transshipment efficiency. However, the more complex horizontal transporting system calls for an effective approach to enhance automated guided vehicle (AGV) scheduling. Considering AGV charging and path conflicts, this paper proposes a multi-agent reinforcement learning (MARL) approach to address the AGV dispatching and path planning (VD2P) problem under a horizontal layout. The VD2P problem is formulated as a Markov decision process model. To mitigate the challenges of high-dimensional state-action space, a multi-agent framework is developed to control the AGV dispatching and path planning separately. A mixed global–individual reward mechanism is tailored to enhance both exploration and corporation. A proximal policy optimization method is used to train the scheduling policies. Experiments indicate that the proposed MARL approach can provide high-quality solutions for a real-world-sized scenario within tens of seconds. Compared with benchmark methods, the proposed approach achieves an improvement of 8.4% to 53.8%. Moreover, sensitivity analyses are conducted to explore the impact of different AGV configurations and charging strategies on scheduling. Managerial insights are obtained to support more efficient terminal operations.

Keywords:

automated container terminal; AGV dispatching and path planning; charging; path conflict; reinforcement learning

1. Introduction

With the booming in maritime trade, advanced multimodal transportation systems have been widely integrated into automated container terminals (ACTs) to enhance transshipment effectiveness of maritime logistics [1,2]. In many global mega ports, water-to-water transshipment has become the dominant transport pattern. At Shanghai Yangshan, Guangzhou, Singapore, and Shenzhen ports, the ratios of water-to-water transshipment are approximately 50%, 90%, 85%, and 35%, respectively. Compared with import and export container flows, water-to-water transshipment containers are associated with shorter dwell times, more frequent and synchronized stacking and retrieval operations, and greater fluctuations in yard workload distribution. In large-scale ACTs with high transshipment ratios, the conventional vertical yard layout with transfer area often leads to imbalanced seaside and landside workloads, extended travel distances for yard cranes (YCs), and restricted handling space for horizontal transport devices, i.e., automated guided vehicles (AGVs) [3]. These factors collectively reduce operational efficiency, increase energy consumption, and accelerate equipment wear.

Driven by the rapid development of ACTs, the vertical yard layout with transfer area has gradually upgraded to various alternatives. Shanghai Yangshan port adopts a side interaction mode at transshipment blocks. Singapore port, Shenzhen port, Guangzhou port, and Tianjin port have implemented the horizontal yard with side handling mode, which significantly enhances flexibility and handling efficiency for transshipment operations. Nevertheless, the horizontal yard with side handling mode introduces substantial challenges to horizontal transport operations, specifically, AGV dispatching and path planning (VD2P).

Figure 1 and Figure 2 show an AGV transporting system under vertical and horizontal yard layouts at ACTs, respectively. Under the traditional vertical layout, AGV transporting is carried out along the circular paths between blocks and berths [4]. The transporting rules are simple, and AGVs are easily controlled to avoid conflict. However, the AGV transporting paths under the horizontal layout form a complex network structure with multiple crossroads, multiple lanes, various transporting rules, and overlapping transporting and handling space aside the block. Furthermore, battery-powered AGVs must charge before their state of charge (SoC) falls below a critical threshold, and the AGV is temporarily unavailable during charging. Consequently, the above complexities and charging requirements make the AGV scheduling problem more difficult under a horizontal layout.

Given this background, the AGV scheduling problem under a horizontal layout is an integrated optimization problem that simultaneously addresses AGV dispatching and path planning problems. On the one hand, idle AGVs must be dispatched to either a charge station or to a transport container task, with the aim to improve transporting efficiency. On the other hand, the transporting path of the dispatched AGV, including the specific path segments and lane selection, must be carefully planned to avoid potential conflicts.

However, existing studies on the joint optimization of AGV dispatching and path planning under a horizontal yard layout remain limited. To address this gap, this study proposes a multi-AGV scheduling framework that explicitly incorporates charging requirements and conflict avoidance under a horizontal yard layout. A reinforcement learning approach is developed to derive real-time AGV scheduling policy, enabling integrated decision-making for both AGV dispatching and path planning.

The main contributions of this study are summarized as follows:

(1): The AGV dispatching and path planning problem under a horizontal yard layout is formulated as a Markov decision process (MDP), aiming to improve the transporting efficiency at transshipment container terminals. A general simulator for an AGV real-time scheduling system is developed to capture the dynamic and diverse operational states of AGVs. The framework explicitly incorporates charging constraints as well as multiple types of path conflicts.
(2): A multi-agent learning framework is proposed to separately control the AGV dispatching and AGV path planning according to different state representations. The dispatching and path planning policies are shared among all homogeneous AGVs to ensure consistency and scalability. Furthermore, an integrated global–local reward mechanism is tailored to balance individual exploration with cooperative behavior, thereby enhancing both the efficiency of single-agent decisions and the coordination among multiple agents.
(3): The proposed approach is trained using the multi-agent proximal policy optimization (MAPPO) algorithm and evaluated across comprehensive numerical experiments. The results demonstrate that the MAPPO-based RL approach outperforms conventional rule-based benchmarks while achieving real-time solution efficiency at the operational level. In addition, application analyses based on a large-scale ACT scenario are further conducted to provide managerial insights into system configurations and operational strategies.

The remainder of this study is organized as follows. Section 2 comprehensively summarizes the existing work related to the AGV scheduling problem and the application of the reinforcement learning method. Section 3 describes the AGV scheduling problem considering charging and different types of path conflicts. Section 4 develops a multi-agent reinforcement learning approach to solve the integrated AGV real-time scheduling problem. In Section 5, computational experiments and application analyses are conducted. Key findings and potential future studies are summarized in Section 6.

2. Related Work

In this section, previous studies that relate to this study are comprehensively summarized into two aspects, i.e., research on AGV scheduling and the application of reinforcement learning methods.

2.1. AGV Scheduling Problem

Existing studies related to AGV scheduling mainly focus on AGV dispatching, AGV path planning, and their integration. An overview of studies on the AGV scheduling problem is summarized in Table 1.

Regarding AGV dispatching, Li et al. [5] investigated AGV scheduling with random tasks and battery swapping in ACTs using a two-stage stochastic programming model and a simulation-based ant colony algorithm. Gao et al. [6] proposed a QL-CNP algorithm integrating Q-learning and a contract net protocol to solve the multi-AGV dispatching problem, effectively balancing traffic load and reducing delays and congestion in ACTs. Duan et al. [7] investigated the integrated scheduling of QCs and AGVs in container terminals, considering both vessel makespan and AGV unladen time. Xing et al. [8] developed a three-phase algorithm incorporating a directed graph model and branch-and-bound methods, which effectively reduced the makespan and energy consumption by optimizing AGV speed adjustment and task assignment. Zhang et al. [9] developed an integrated scheduling model for AGVs and YCs, minimizing the maximum completion time using a hybrid algorithm combining genetic and dynamic programming approaches. Xing et al. [10] investigated an AGV charging scheduling with capacitated charging stations in ACTs, considering interactions among quay cranes (QCs), yard cranes (YCs), and AGVs. They developed a path-based mixed-integer programming model and a metaheuristic framework to minimize AGV operation time. Wang et al. [11] investigated the integrated optimization of AGV logistics scheduling and clean energy self-consistency in automated ports, developing a graph-based model and an immune algorithm to minimize scheduling time and maximize clean energy self-sufficiency. Yang et al. [12] studied the AGV dispatching problem integrated with handling facility scheduling at an ACT with a vertical yard layout. Charging and AGV mate capacity are considered in the developed MIP model. An improved variable neighborhood search algorithm is designed to solve large-scale instances. The proposed method achieves a high-quality solution compared with other methods. However, the computing time for large-scale instances still exceeds 40 min due to the time-consuming iteration.

With respect to AGV path planning, Hu et al. [13] employed the Multi-Agent Deep Deterministic Policy Gradient algorithm to address anti-conflict path planning for AGVs, effectively resolving opposite- and same-point occupation conflicts while minimizing path length. Wu et al. [14] developed a Petri-net-based scheduling framework with object-oriented timed colored stochastic Petri networks and heuristic rules to optimize AGV control in container terminals, incorporating three vehicle operation modes and four behavior patterns. In ACTs featuring a perpendicular yard layout and grouped loading/unloading tasks, Yue and Fan [15] proposed a hybrid algorithm combining rule-based heuristics, Dijkstra, and Q-Learning, along with a graph-based conflict avoidance strategy, which effectively reduced QC waiting times and path conflicts in numerical experiments. Tang et al. [16] introduced filter functions to eliminate irregular paths and applied cubic B-spline curves to smooth turns, which significantly reduced the number of nodes, turns, and total distance in planned paths compared to traditional methods. Xu et al. [17] proposed a load-in-load-out AGV scheduling model utilizing buffer zones to enable dual-container transport. Chen et al. [18] proposed an enhanced graph search method for global path planning of AGVs in port environments, addressing issues of paths too close to obstacles. Feng et al. [19] integrated obstacle, lane, and velocity potential fields into the MPC cost function and employed fuzzy logic to dynamically adjust weights.

Furthermore, some scholars jointly addressed AGV dispatching and path planning. Wang and Zeng [20] investigated the AGV dispatching and routing problem in ACTs with bidirectional paths, considering multiple parallel travel paths and shortcuts, developing a mixed-integer programming model to minimize the makespan and proposed a tailored branch-and-bound algorithm integrated with conflict-free routing heuristics. Li et al. [21] developed a hierarchical framework integrating reinforcement learning for dynamic task assignment and a tailored path generation algorithm to achieve conflict-free AGV scheduling, effectively reducing task delays and mitigating path conflicts in ACTs. Liang et al. [22] proposed a three-stage algorithm integrating Dijkstra-based route planning, GA-based task assignment, and speed control for collision avoidance. Lou et al. [23] addressed uncertainties through real-time virtual–physical interaction, predicted conflicts via twin-data comparison, and resolved them using Yen’s algorithm. Liu et al. [24] designed a spatiotemporal greedy-strategy-based genetic algorithm to minimize makespan and waiting time, which effectively resolved path conflicts and maintained efficient coordination among AGVs, trucks, and cranes without disrupting intermodal container operations.

While the existing literature has extensively explored AGV dispatching, path planning, and their integration, several notable limitations persist. Current studies often focus on vertical yard layouts, with insufficient attention paid to AGV system optimization in horizontal layout environments. Many approaches address subproblems in isolation, lack real-time decision-making capabilities, or fail to co-optimize scheduling and routing under dynamic conditions. Heavy reliance on traditional optimization or heuristic methods restricts scalability and real-time adaptability. Furthermore, few studies adequately incorporate charging constraints or comprehensively model diverse types of path conflicts, particularly lacking systematic analysis of typical issues in horizontal layouts such as heterogeneous traffic rules and varying lane capacities. These shortcomings considerably limit the practical applicability of existing approaches in large-scale, dynamic port terminal operations.

Table 1. Overview of studies on the AGV scheduling problem.

Citation	Yard Layout		Decisions		Constraints		Solution Method
Citation	H	V	D	P	C1	C2	Solution Method
Xu et al. [17]		√	-	√	-	-	MIP + HBM
Yue and Fan [15]		√	-	√	-	-	MIP + HBM
Wang and Zeng [20]		√	√	√	-	√	MIP + EM
Gao et al. [6]		√	√	-	-	√	MDP + RL
Xing et al. [8]		√	√	-	-	-	MIP + HBM
Duan et al. [7]		√	√	-	-	-	MIP + HBM
Li et al. [5]		√	√	-	-	-	MIP + HBM
Hu et al. [13]		√	-	√	-	√	MDP + RL
Zhang et al. [9]		√	√	-	√	-	MIP + HBM
Xing et al. [10]		√	√	-	√	-	MIP + HBM
Liu et al. [24]		√	√	√	-	√	MIP + HBM
Lou et al. [23]		√	√	√	-	√	MIP + RBM
Liang et al. [22]		√	√	√	-	√	MIP + HBM
Li et al. [21]		√	√	√	√	√	MDP + RL
Tang et al. [16]	√		-	√	-	√	MIP + HBM
Wu et al. [14]	√		-	√	√	√	MIP + HBM
Chen et al. [18]	√		-	√	-	√	MIP + HBM
Wang et al. [11]	√		√	-	√	-	MIP + HBM
This paper	√		√	√	√	√	MDP + RL

Note: “H” represents a horizontal yard layout, “V” represents a vertical yard layout, “D” represents AGV dispatching, “P” represents AGV path planning, “C1” represents charging, “C2” represents conflict, “MIP” represents a mixed-integer programming model, “MDP” represents the Markov decision process model, “HBM” represents a heuristic-based method, “RBM” represents a rule-based method, “EM” represents an exact method, “RL” represents a reinforcement learning approach.

2.2. Application of Reinforcement Learning Approach on AGV Scheduling Problem

The reinforcement learning approach demonstrates exceptional capabilities in handling dynamic uncertainties and multi-objective optimization [25,26,27], making it particularly suitable for complex AGV scheduling scenarios requiring real-time adaptation.

Wang et al. [28] proposed a DQN-based dynamic scheduling algorithm for AGVs in ACTs, incorporating a mask mechanism to handle reward sparsity. They designed both single and multi-AGV scheduling scenarios, optimizing task completion time and equipment utilization, respectively. The method effectively integrated terminal operational characteristics, yard layout constraints, and container handling tasks, demonstrating strong adaptability in simulated environments. However, this study did not account for the impact of charging on AGV scheduling, thereby limiting its practical applicability in real-world scenarios. Che et al. [29] developed a multi-agent deep reinforcement learning framework integrating heterogeneous graph networks and proximal policy optimization to optimize both job assignment and charging decisions under dynamic yard layouts and stochastic travel speeds. However, the lack of an explicit solution for AGV collision avoidance represents a limitation in its path planning framework, which may compromise operational safety and efficiency in practical applications.

Characterized by vertical and parallel yard layouts with container transportation tasks, ACTs present unique challenges for AGV path planning. Chen et al. [30] developed an APF-TD3 reinforcement learning framework that integrated artificial potential fields with deep deterministic policy gradient methods, generating the shortest and smoothest collision-free paths across small-, medium-, and large-scale port scenarios, significantly enhancing scheduling efficiency and operational safety. Zheng et al. [31] modeled the dynamic scheduling of multiple AGVs as a Markov decision process and developed a deep Q-network-based algorithm. The approach integrated real-time system states and mixed dispatching rules, effectively reducing QC waiting time and total completion time, particularly in large-scale scenarios with complex yard layouts and loading/unloading task dynamics.

In the study of AGV scheduling and energy management in ACTs, multiple innovative approaches have been proposed to enhance operational efficiency under dynamic conditions. Zhou et al. [32] developed a reinforcement-learning-based AGV scheduling model with a resilient charging strategy for ACTs, addressing dynamic energy constraints and task variability. Using a PPO algorithm within an actor–critic framework, the method enabled adaptive charging decisions and transportation task coordination. Experimental results from a large-scale terminal in the Pearl River Delta demonstrated significant improvements in completion time and revealed a U-shaped relationship between AGV quantity and operational efficiency, highlighting the importance of optimal fleet sizing and flexible charging under real-world stacking and loading conditions. Gong et al. [33] modeled the dynamic yard environment with magnetic nail networks and offline charging stations, focusing on real-time task allocation and conflict-free path planning under complex unloading processes. Drungilas et al. [34] developed a DRL-based speed control algorithm that dynamically adjusted AGV driving behavior. This approach reduced energy consumption by 4.6% in real-world simulations, demonstrating effective trade-offs between energy use and operation time within terminal environments.

Wang et al. [35] proposed a multi-agent C-DQN algorithm and addressed sparse rewards via masking, enhanced convergence with fixed-step ε-greedy exploration, and reduced conflicts through CTDE. Hau et al. [36] optimized AGV movement, waiting time, and container-handling actions within a grid-based terminal layout featuring quay and yard crane zones. The hybrid ACO-A2C approach generated collision-free action sequences for AGVs performing loading and unloading tasks, significantly reducing operation steps and improving equipment utilization. Zhang et al. [37] incorporated graph metrics into state representation and heuristic rules into action space, adopting group training–validation strategies. Chen et al. [38] developed a deep reinforcement learning model for AGV path planning using RGB image-based environmental representation and a multi-component reward function. Wei et al. [39] proposed a self-attention-based DRL method for AGV dispatching under a vehicle-initiated rule. They incorporated multimodal state features and invalid action masking, which reduced congestion and adapted to heuristic strategies like shortest queue dispatch under varying cost settings.

Beyond container terminals, reinforcement learning methods have also been extensively applied to AGV scheduling in diverse domains such as manufacturing systems [40,41], e-commerce warehousing [42,43], flexible production environments [44,45,46], indoor logistics [47], and airport cargo handling [48]. In addition, the Multi Agent Path Finding (MAPF) literature provides a canonical abstraction for computing collision-free routes for multiple agents moving concurrently on a shared network. In standard MAPF, each agent is assigned a start and a goal on a graph, and the objective is to find a set of time-indexed paths that avoid vertex conflicts and edge swap conflicts [49]. Representative optimal search-based methods include Conflict-Based Search (CBS) and improved variants such as ICBS, which explicitly detect conflicts and resolve them by adding constraints and replanning at a higher level [50]. In addition, compilation-based approaches transform MAPF into alternative formalisms such as SAT, CSP, or MILP, enabling the use of highly optimized solvers [51]. MAPF research also includes decoupled and prioritized planning methods that are computationally lightweight and more suitable for time-critical decision-making, although they may sacrifice optimality or completeness [52]. While MAPF primarily targets the routing layer, our problem setting further couples routing with real-time operational-level decisions, including task assignment and charging decisions under battery constraints and lane-based transport restrictions.

In summary, existing reinforcement-learning-based AGV scheduling studies mainly emphasize dispatching or task assignment and often treat AGV routing as a predefined shortest path procedure or as a simplified congestion penalty, without explicitly modeling multiple AGV path conflicts on shared lanes. Most studies focus primarily on vertical yard layouts, with insufficient attention to the distinctive challenges of horizontal layouts. In addition, charging decisions are frequently ignored or handled by fixed heuristics, which limits the ability to balance transport capacity and energy feasibility under battery constraints. In addition, the frequent reliance on homogeneous AGV assumptions and simplified conflict avoidance mechanisms further limits their applicability in large-scale, complex terminal environments. In contrast, this study targets an integrated real-time operational-level scheduling setting by jointly considering AGV task assignment, conflict-aware path planning, and charging decisions. Moreover, the proposed framework explicitly accounts for representative conflict types along shared path segments under lane-based movement restrictions, and it is evaluated on real-scale instances that cover all transport tasks within a one-hour planning horizon, demonstrating second-level online decision-making efficiency. These features collectively differentiate our work from prior MARL-based AGV scheduling studies and provide practical value for deploying learning-based scheduling policies in production terminal operations.

3. Development of the AGV Scheduling System

3.1. Problem Statement

Figure 3 shows the AGV transporting process under a horizontal yard layout at a transshipment ACT. AGVs run between berths and the yard to conduct loading and unloading transporting tasks. After receiving unloaded containers at a berth, an AGV first selects a loaded transporting path to deliver the container to its assigned storage block. After stacking, the AGV becomes idle and will be dispatched to another loading or unloading task according to current position and the distribution of tasks left in the yard. Then, the AGV chooses an empty transporting path to conduct the next task. Since AGVs are typically battery-powered, AGVs must recharge to avoid interruption caused by power exhaustion. During charging, AGVs are unavailable, and the charging duration depends on the remaining power and target charge threshold. A lower threshold results in a higher charging frequency, while a higher threshold will extend the charging duration, reducing transporting efficiency.

With the aim of improving the transporting efficiency and avoiding path conflicts, the AGV scheduling decisions mainly include AGV dispatching, path planning, charge time and target charge threshold selection. First, transport efficiency is reflected by typical metrics including total completion time of all tasks, average handling waiting time, average transport time, and average task delay, etc. In practice, these metrics are strongly associated with the service capability of horizontal transport, which further determines the waiting time of handling facilities and the overall production rhythm. Second, the path conflict is quantified by the frequency of different types of conflicts. Even if collisions are physically avoided by control systems, frequent blocking and deadlocks can still lead to excessive waiting and unstable operations. Therefore, conflict-related metrics are introduced to capture the interference among AGVs, including the number of overtaking conflicts, crossing conflicts, and parking conflicts. These conflicts are particularly important in real-world terminal scheduling because they directly affect the transporting time and potentially increase collision risks. In addition, since AGVs are subject to battery constraints, inappropriate charging decisions may lead to infeasible task execution due to insufficient energy or reduced transport capacity caused by excessive charging time. Therefore, identifying appropriate charging opportunities is also essential to enhance the continuity and efficiency of horizontal transportation in container terminal yards.

Unlike terminals primarily handling import or export containers, transshipment terminals frequently manage the concurrent loading and unloading operations of multiple vessels. The spatial distribution of loading and unloading workloads varies and fluctuates significantly across the yard.

Figure 4 and Figure 5 show real operational data from a transshipment ACT configured with 57 blocks in southern China. The statistics indicate that, within a 1 h period, 16 to 28 blocks simultaneously undertake both stacking and retrieval tasks, representing 28% to 49% of all blocks. In the same period, about 20 to 35 blocks are simultaneously engaged in retrieval tasks, accounting for 35% to 61% of the total. These results highlight the dispersed distribution of workloads in the yard, which creates considerable challenges for effective real-time AGV scheduling. In addition, when a large proportion of blocks are simultaneously engaged in stacking and retrieval operations, the transporting paths of multiple AGVs inevitably become intertwined and overlapping. This also underscores the need for rational path planning to reduce conflict risks and enhance overall transporting efficiency.

Under the horizontal yard layout in ACTs, AGV transport paths exhibit distinctive structural characteristics. As shown in Figure 6, paths can be classified into one-way and two-way configurations. Indicated by blue single-headed solid arrows, one-way paths allow for only unidirectional transport and are confined to container blocks for loading and unloading operations. Each one-way path consists of two lanes. Two-way paths outside the storage area are represented by red double-headed solid arrows, each configured with six lanes. Red double-headed dashed arrows denote two-way paths with four lanes located between blocks. Additionally, yellow double-headed solid arrows represent AGV turning areas, which are also equipped with six lanes.

To accurately capture feasible AGV paths, physical constraints, and resource conflicts while enabling efficient path planning, a discrete and structured transport path network is constructed in Figure 7. In this illustration, berths, blocks, and turning intersections are abstracted as individual nodes. The red lines indicate two-way connections between adjacent nodes, whereas blue lines represent one-way connections. As shown in Figure 7, nodes (0, 1), (0, 3), (0, 5), and (0, 7) correspond to four berths used for container loading and unloading. Block nodes such as (2, 1), (3, 3), and (4, 3) represent container stacking and retrieval positions, while nodes (10, 2) and (11, 2) denote charge stations.

The notations used for developing the AGV assignment and path planning system are summarized in Table 2.

Several assumptions are adopted in this study:

(1): Over a short rolling horizon, the quantities, operational sequences, and estimated handling durations of container tasks at each berth and block are known in advance by the terminal operating system. This is consistent with the actual operating situation. This assumption enables the proposed framework to focus on real-time operational decisions.
(2): The destination blocks for unloading containers and target berths for loading containers are predetermined in a short time window. This assumption is also consistent with real-world container terminal operations. For problems with dynamic yard space allocation plans, the proposed decision-making framework may need redesign to jointly optimize block assignment and AGV scheduling or to incorporate destination updates as uncertain events in the rolling horizon execution.
(3): At the real-time operational level, each AGV is assumed to run with a constant average speed on each path segment, so that the transport time between two adjacent nodes is pre-specified. Detailed acceleration/deceleration and turning effects at the vehicle control level are omitted.
(4): AGVs are all powered by Li-ion batteries. To support real-time operational-level scheduling, the battery state-of-charge (SOC) dynamics are approximated by linear charging and discharging characteristics within the operational SOC range considered in this study (40–100%).
(5): Quay cranes, yard cranes, and AGVs can each handle only one container at a time. This assumption is widely accepted in the area of container terminal operations.

3.2. AGV Scheduling State Transition

As stated in Section 3.1, there are two kinds of AGV states in the AGV dispatching and path planning system, i.e., a transporting state and charging state. Both of the two kinds of state start at the moment when AGV

k_{p}

ends its previous transporting or charging task and becomes free without load, i.e.,

t_{k_{p}}^{0, s}

. Figure 8 illustrates the two kinds of AGV state transition processes.

At

t_{k_{p}}^{0, s}

, AGV

k_{p}

can be assigned to a berth and block for loading or unloading transporting or be assigned to a charge station for charging. For example, at

t_{k_{p}}^{0, s}

, AGV

k_{p}

becomes free, and the system decision period proceeds to

p

. It is assumed that AGV

k_{p}

is located at block node

n_{k_{p}}^{s t a r t}

, and it is assigned to berth node

n_{k_{p}}^{t a r g e t}

. At this time, AGV

k_{p}

starts the unloading transporting state. Once the destination is determined, the empty transporting path

{p a t h}_{k_{p}}^{0}

and running time

τ_{k_{p}}^{t, 0}

from the current position to the destination will be generated following the shortest path approach.

{p a t h}_{k_{p}}^{0}

is composed of several segments

\{g_{0}, g_{1}, \dots\}

connected end to end. The arrival time to berth

n_{d}

can be obtained as follows.

t_{k_{p}}^{0, a} = t_{k_{p}}^{0, s} + τ_{k_{p}}^{t, 0}

(1)

{p a t h}_{k_{p}}^{0}, τ_{k_{p}}^{t, 0} = ξ (n_{k_{p}}^{s t a r t}, d_{k_{p}}^{s t a r t}, n_{k_{p}}^{t a r g e t})

(2)

After arriving at berth

n_{k_{p}}^{t a r g e t}

, AGV

k_{p}

is then assigned to the first queuing unloading container

u_{k_{p}}

. The start time for unloading cannot be earlier than the available time of the corresponding quay crane

t^{f} (n_{k_{p}}^{t a r g e t})

and the planned start time

t_{u_{k_{p}}}^{p}

. Under this consideration, the start and end time for handling can be determined as follows.

τ_{k_{p}}^{w, 0} = m a x \{0, t^{f} (n_{k_{p}}^{t a r g e t}) - t_{k_{p}}^{0, a}, t_{u_{k_{p}}}^{p} - t_{k_{p}}^{0, a}\}

(3)

t_{k_{p}}^{0, h} = t_{k_{p}}^{0, a} + τ_{k_{p}}^{w, 0}

(4)

t_{k_{p}}^{1, s} = t_{k_{p}}^{0, e} = t_{k_{p}}^{0, h} + t_{u_{k_{p}}}^{h}

(5)

After the unloading process is performed, AGV

k_{p}

starts loaded transporting from berth node

n_{k_{p}}^{t a r g e t}

to the stacking block node

n_{u_{k_{p}}}^{t a r g e t}

. The loaded transporting path and transporting time are determined as follows.

t_{k_{p}}^{1, a} = t_{k_{p}}^{1, s} + τ_{k_{p}}^{t, 1}

(6)

{p a t h}_{k_{p}}^{1}, τ_{k_{p}}^{t, 1} = ξ (n_{k_{p}}^{t a r g e t}, d_{k_{p}}^{t a r g e t}, n_{u_{k_{p}}}^{t a r g e t})

(7)

After arriving at the storage block, container

u_{k_{p}}

will be stacked into the block once the yard crane ends the former task. The start and end time for stacking can be obtained as follows.

τ_{k_{p}}^{w, 1} = m a x \{0, t^{f} (n_{k_{p}}^{t a r g e t}) - t_{k_{p}}^{0, a}\}

(8)

t_{k_{p}}^{1, h} = t_{k_{p}}^{1, a} + τ_{k_{p}}^{w, 1}

(9)

t_{k_{p}}^{1, e} = t_{k_{p}}^{1, h} + t_{u_{k_{p}}}^{h}

(10)

After the stacking process is finished, AGV

k_{p}

becomes free again and the current unloading transporting state is over. The loading transporting state is the reverse of the unloading transporting state. The charging state is similar to the unloading transporting state without the loaded transporting process. The required charging time

τ_{k_{p}}^{c h a r g e}

is determined according to current remaining power as follows.

τ_{k_{p}}^{c} = \frac{e^{t a r g e t} - e_{k_{p}}}{q^{c h a r g e}}

(11)

3.3. Path Conflict Detection

Three kinds of AGV path conflicts are considered in this study, i.e., overtaking conflicts, crossing conflicts, and parking conflicts. Figure 9 shows different kinds of conflicts between two AGVs.

n_{1} ~ n_{3}

are the consecutive nodes on the same lane

m

. Two AGVs,

k

and

l

, are running on this lane. The time increases from left to right. The vertical lines represent the same lane

m

at different times.

In Figure 9a, AGVs

k

and

l

run into lane

m

from node

n_{1}

at

t_{k}^{s}

and

t_{l}^{s}

, and run out of lane

m

from node

n_{3}

at

t_{k}^{e}

and

t_{l}^{e}

. If

t_{k}^{s} \geq t_{l}^{s}

and

t_{k}^{e} \leq t_{l}^{e}

, an overtaking conflict will happen between

n_{1}

and

n_{3}

. In Figure 9b, AGV

k

runs into lane

m

from node

n_{1}

at

t_{k}^{s}

and runs out of lane

m

from node

n_{3}

at

t_{k}^{e}

, and AGV

l

runs into lane

m

from node

n_{3}

at

t_{l}^{s}

and runs out of lane

m

from node

n_{1}

at

t_{l}^{e}

. If

t_{k}^{s} \leq t_{l}^{e}

and

t_{l}^{s} \leq t_{k}^{e}

, a crossing conflict will happen between

n_{1}

and

n_{3}

. In Figure 9c, AGV

k

runs into lane

m

from node

n_{1}

at

t_{k}^{s}

and runs out of lane

m

from node

n_{3}

at

t_{k}^{e}

, and AGV

l

is parking at node

n_{2}

for handling or waiting from

t_{l}^{s}

to

t_{l}^{e}

. If

t_{l}^{s} \leq t_{k}^{n_{2}} \leq t_{l}^{e}

, a parking conflict will happen at

n_{2}

.

3.4. Simulator for AGV Scheduling System

In this study, a simulator for an AGV scheduling system is developed to model the system dynamics of real-time AGV operations. The simulator is designed to be easily adaptable to various AGV scheduling systems with different terminal layouts, facility configurations, and transportation rules. The system simulation process is outlined in Algorithm 1. Algorithm 1 adopts a discrete time simulation framework. Given a set of loading and unloading tasks, each AGV is initialized at a designated position with a randomly generated power level. The simulator advances the global clock in discrete time and updates the positions and remaining transporting times of AGVs accordingly. Importantly, scheduling decisions are not made at every time step. Before the initialized scenario terminates, the simulator advances the system time to the point at which an AGV is required to make a decision. Specifically, there are two categories of decision epochs. (i) When an AGV starts a loaded transport, the decision epoch is triggered by the event that the AGV completes container loading at an operation node, and a path planning decision is made for the subsequent loaded transport. (ii) When an AGV becomes empty, the decision epoch is triggered by either the event that the AGV completes container unloading at the destination of a loaded transport or the event that the AGV finishes charging, after which task assignment and path planning decisions are made. After task assignment and path planning decisions are made, the simulator executes transportation, waiting, charging, loading, and unloading processes and updates the operational status of each AGV, block, berth, and charge station. The simulation of the current operational scenario terminates when all loading and unloading tasks have been completed.

Algorithm 1 Simulation process of AGV scheduling system
Initialization: loading tasks in blocks, unloading tasks at berths, AGV position and power level, decision period $p = 0$ , $d o n e \leftarrow F a l s e$ .
1:	While not $d o n e$ :
2:	If any loaded AGV $k$ arrives at its destination: # No decision is needed.
3:	Waiting for loading or stacking operation.
4:	Execute loading or stacking operation.
5:	Update the state of AGV $k$ to idle.
6:	Else if any empty AGV $k_{p}$ arrives at its destination: # Path plan decision is needed.
7:	Waiting for unloading or retrieving operation.
8:	Execute unloading or retrieving operation.
9:	Update the state of AGV $k_{p}$ to transporting, update arrive time and destination.
10:	Generate transporting path, and select conflict-free path lane.
11:	$p \leftarrow p + 1$
12:	Else: # Dispatch and path plan decisions are needed.
13:	Generate assignment decision (transporting or charging) for AGV $k_{p}$ .
14:	Update arrive time and destination of AGV $k_{p}$ .
15:	Generate transporting path, and select conflict-free path lane.
16:	If the AGV $k_{p}$ is assigned to transport:
17:	Update the state of AGV $k_{p}$ to transporting.
18:	Else:
19:	Update the state of AGV $k_{p}$ to charging, update the charging time and charging end time.
20:	End if
21:	$p \leftarrow p + 1$
22:	End if
23:	If all loading and unloading tasks are finished:
24:	$d o n e \leftarrow T r u e$
25:	End if
26:	End while

4. Multi-Agent Reinforcement Learning Approach

In this section, a reinforcement learning (RL)-based approach is proposed to address the integrated AGV assignment and path planning problem. The RL-based method enables real-time and reliable decision-making in real-world scenarios. Firstly, the problem is formulated as a Markov decision process (MDP), and a multi-agent learning framework is established. Then, an MDP-driven decision-making framework is designed to achieve joint control of AGV dispatching and path planning in an efficient and stable manner. Finally, a multi-agent proximal policy optimization (MAPPO)-based algorithm is tailored to train the multi-agent operational policy.

4.1. Multi-Agent MDP for AGV Scheduling System

The multi-agent MDP for the AGV scheduling system consists of several core elements such as agents, states, actions, and reward functions. The detailed design is introduced as follows.

4.1.1. Agents

A multi-agent structure is adopted to enable functional decomposition and coordination, consisting of an AGV dispatching agent and a path planning agent. The dispatching agent is responsible for assigning either transporting or charging tasks to idle AGVs. Once a task is assigned, the path planning agent determines the specific lanes for each segment of the shortest transporting path.

4.1.2. States

Corresponding to the multi-agent structure, states are distinguished by AGV dispatching state

s_{k_{p}}^{D}

and path planning state

s_{k_{p}, g}^{P}

.

(1): AGV dispatching state $s_{k_{p}}^{D}$

The AGV dispatching state is composed of global state

s_{p}^{D G}

and individual state

s_{k_{p}}^{D L}

.

s_{p}^{D G}

characterizes the overall workload and AGV distribution, including the number of left tasks

u_{p}^{L}

, queuing AGVs

v_{p}^{Q}

, assigned AGVs

v_{p}^{A}

at each handling node, and AGVs at each charging node.

s_{k_{p}}^{D L}

captures the characteristics of the current AGV, including its location

q_{k_{p}}

, heading direction

d_{k_{p}}

, and remaining power

e_{k_{p}}

.

s_{k_{p}}^{D} = [s_{p}^{D G}, s_{k_{p}}^{D L}]

(12)

s_{p}^{D G} = [u_{p}^{L}, v_{p}^{Q}, v_{p}^{A}, v_{p}^{C}]

(13)

s_{k_{p}}^{D L} = [q_{k_{p}}, d_{k_{p}}, e_{k_{p}}]

(14)

(2): Path planning state $s_{k_{p}, g}^{P}$

The path planning state incorporates both spatial information and congestion conditions for each segment along the shortest transporting path. Specifically, as defined in Equation (15), the state of

g

th segment

s_{k_{p}, g}^{P}

consists of origin node

x_{k_{p}, g}^{i n}

, destination node

x_{k_{p}, g}^{o u t}

, segment entry time

t_{k_{p}, g}^{i n}

, segment exit time

t_{k_{p}, g}^{o u t}

, and running direction

d_{k_{p}, g}

. In addition, the state also captures traffic conditions on the segment, including the number of parking AGVs

v_{p, g}^{P}

, the number of AGVs running in the same direction

v_{p, g}^{O}

, and those running in the opposite direction

v_{p, g}^{C}

on each lane of the current segment.

s_{k_{p}, g}^{P} = [x_{k_{p}, g}^{i n}, x_{k_{p}, g}^{o u t}, t_{k_{p}, g}^{i n}, t_{k_{p}, g}^{o u t}, d_{k_{p}, g}, v_{p, g}^{P}, v_{p, g}^{O}, v_{p, g}^{C}]

(15)

4.1.3. Actions

The action space is designed in a hierarchical discrete manner. AGV dispatching action

a_{k_{p}}^{D}

includes three types: unloading (denoted by 0), loading (denoted by 1), and charging (denoted by 2). Once the task type is determined, the specific handling node

n_{k_{p}}^{t a r g e t}

is automatically selected using a nearest distance function

f^{n} (a_{k_{p}}^{D})

.

a_{k_{p}}^{D} = [0,1, 2]

(16)

n_{k_{p}}^{t a r g e t} = f^{n} (a_{k_{p}}^{D})

(17)

The path planning action

a_{k_{p}}^{P}

consists of lane selection decisions for each segment

a_{k_{p}, g}^{P}

of the shortest transporting path. Once the origin node, origin direction, and destination node are determined, the shortest transporting path is generated using the Dijkstra algorithm. For each segment

g

of the path, the specific lane

a_{k_{p}, g}^{P}

is selected from all available lanes, as defined in Equation (19). Here,

n^{l a n e}

denotes the maximum available lane number per segment.

a_{k_{p}}^{P} = [a_{k_{p}, 1}^{P}, a_{k_{p}, 2}^{P}, \dots, a_{k_{p}, g}^{P}]

(18)

a_{k_{p}, g}^{P} = [0,1, \dots, n^{l a n e} - 1]

(19)

4.1.4. Reward Functions

(1): Global reward

The objectives of AGV dispatching and path planning are to minimize task makespan, improve transporting efficiency, reduce handling delay, and mitigate path conflicts. Accordingly, a global reward

r^{G}

is designed, comprising five components: completion time reward

r^{m}

, task delay penalty

\sum_{p} r_{k_{p}}^{t d}

, path conflict penalty

\sum_{p} r_{k_{p}}^{P}

, empty transporting penalty

\sum_{p} r_{k_{p}}^{e t}

, AGV handling wait penalty

\sum_{p} r_{k_{p}}^{h w}

, and AGV charging wait penalty

\sum_{p} r_{k_{p}}^{c w}

. These components are calculated according to Equations (21)–(26). Here,

ω^{m}

,

ω^{t d}

,

ω^{e t}

,

ω^{h w}

, and

ω^{c w}

are weighted factors for each corresponding reward.

φ^{P}, φ^{O}, φ^{C}

are weighted factors for three kinds of path conflicts. Notably, the completion time reward

r^{m}

is defined as the negative value of the maximum completion time among all tasks. In addition,

v_{p}^{Q} (n_{k_{p}}^{t a r g e t})

and

v_{p}^{C}

represent the number of AGVs queuing at the target handling node

n_{k_{p}}^{t a r g e t}

and charging AGVs in all stations, respectively.

r^{G} = r^{m} + \sum_{p} r_{k_{p}}^{t d} + \sum_{p} r_{k_{p}}^{P} + \sum_{p} r_{k_{p}}^{e t} + \sum_{p} r_{k_{p}}^{h w} + \sum_{p} r_{k_{p}}^{c w}

(20)

r^{m} = - ω^{m} \times m a x {\{t_{k_{p}}^{1, e}\}}_{p \in P}

(21)

r_{k_{p}}^{t d} = \{\begin{matrix} - ω^{t d} \times (t_{k_{p}}^{0, h} - t_{u_{k_{p}}}^{p}), i f a_{k_{p}}^{D} \neq 2 \\ 0, e l s e \end{matrix}

(22)

r_{k_{p}}^{P} = - φ^{P} \times x_{k_{p}}^{P} - φ^{O} \times x_{k_{p}}^{O} - φ^{C} \times x_{k_{p}}^{C}

(23)

r_{k_{p}}^{e t} = - ω^{e t} \times τ_{k_{p}}^{e}

(24)

r_{k_{p}}^{h w} = \{\begin{matrix} - ω^{h w} \times v_{p}^{Q} (n_{k_{p}}^{t a r g e t}), i f a_{k_{p}}^{D} \neq 2 \\ 0, e l s e \end{matrix}

(25)

r_{k_{p}}^{c w} = \{\begin{matrix} - ω^{c w} \times v_{p}^{C}, i f a_{k_{p}}^{D} = 2 a n d v_{p}^{C} \geq n^{c f} \\ 0, e l s e \end{matrix}

(26)

(2): Individual reward

To enhance the exploration ability of the multi-agent system, two kinds of individual rewards are further customized for the dispatching and path planning agents, respectively. Specifically, the dispatching agent mainly focuses on transporting efficiency and charging opportunities, while the path planning agent emphasizes the avoidance of path conflicts. This design provides each agent with feedback that is both task-relevant and hierarchically aligned, thereby enhancing the effectiveness of policy exploration and improving overall learning stability and performance.

The individual reward

r_{k_{p}}^{D}

for the dispatching agent is designed as Equations (27)–(29).

r_{k_{p}}^{e t}

,

r_{k_{p}}^{h w}

, and

r_{k_{p}}^{c w}

represent the penalties for empty transporting, AGV handling waiting, and AGV charging waiting, respectively.

r_{k_{p}}^{s p}

denotes the penalty for the charging decision of AGVs with sufficient power. As shown in Equation (28),

r_{k_{p}}^{s p}

is inversely proportional to the square of the remaining power

e_{k_{p}}

of AGV

k_{p}

. In addition,

r_{k_{p}}^{t n}

provides a positive reward for dispatching the AGV to handling nodes with a greater number of remaining tasks. As defined in Equation (29), it is proportional to the number of remaining tasks. Specifically,

u_{p}^{L} (n_{k_{p}}^{t a r g e t})

and

v_{p}^{A} (n_{k_{p}}^{t a r g e t})

denote the number of unassigned tasks and the number of AGVs already dispatched to the target node

n_{k_{p}}^{t a r g e t}

, respectively.

ω^{t n}

is the weighted factor of

r_{k_{p}}^{t n}

.

r_{k_{p}}^{D} = r_{k_{p}}^{e t} + r_{k_{p}}^{s p} + r_{k_{p}}^{h w} + r_{k_{p}}^{c w} + r_{k_{p}}^{t n}

(27)

r_{k_{p}}^{s p} = \{\begin{matrix} - μ_{k_{p}}^{s p} \times {e_{k_{p}}}^{2}, i f a_{k_{p}}^{D} = 2 \\ 0, e l s e \end{matrix}

(28)

r_{k_{p}}^{t n} = \{\begin{matrix} μ_{k_{p}}^{t n} \times (u_{p}^{L} (n_{k_{p}}^{t a r g e t}) - v_{p}^{A} (n_{k_{p}}^{t a r g e t})), i f a_{k_{p}}^{D} \neq 2 \\ 0, e l s e \end{matrix}

(29)

Since the path planning agent primarily focuses on path conflicts, its individual reward

r_{k_{p}}^{P}

is directly computed according to Equation (23).

4.1.5. Multi-Agent Learning Framework

Based on the above MDP formulation, a multi-agent learning framework is developed, as illustrated in Figure 10. At each decision step

p

, the simulator first generates the dispatching state

s_{k_{p}}^{D}

and path planning state

s_{k_{p}}^{P}

. The dispatching agent and path planning agent then provide corresponding actions

a_{k_{p}}^{D}

and

a_{k_{p}}^{P}

according to the input states, respectively. The above scheduling actions are executed by the simulator, which subsequently calculates the individual rewards and updates the system states to

s_{\tilde{k_{p}}}^{D}

and

s_{\tilde{k_{p}}}^{P}

. After all tasks are completed, the global reward is calculated and added to each individual reward. The dispatching transition

\{s_{k_{p}}^{D}, a_{k_{p}}^{D}, r_{k_{p}}^{D}, s_{\tilde{k_{p}}}^{D}\}

and path planning transition

\{s_{k_{p}}^{P}, a_{k_{p}}^{P}, r_{k_{p}}^{P}, s_{\tilde{k_{p}}}^{P}\}

are stored in their respective experience replay buffers. Once the number of stored transitions reaches the buffer’s capacity, mini-batches

B^{D}

and

B^{D}

are sampled from buffers to update the dispatching and path planning policies.

4.2. The MDP-Driven Decision-Making Framework of the AGV Scheduling System

Figure 11 illustrates the MDP-driven decision-making process of the AGV scheduling system under the multi-agent learning framework. Multiple loading and unloading tasks occur at blocks and berths, and AGVs are sequentially dispatched to transport containers between blocks and berths. To avoid the curse of dimensionality caused by joint optimization in an overly large action space, the AGV scheduling MDP involves two types of decision points:

t = t_{k}^{0, s t a r t}

, which denotes the moment when AGV

k

becomes idle after completing the current transporting task, and

t = t_{k}^{1, s t a r t}

, which denotes the moment when AGV

k

starts transporting after receiving a loading or unloading container. As shown by the red arrow, when the system advances to decision step

p

and the system clock time

t_{p} = t_{k_{p}}^{0, s t a r t}

, AGV

k_{p}

completes the current task and becomes idle. Then, the simulator generates the dispatching state

s_{k_{p}}^{D}

and path planning state

s_{k_{p}}^{P}

. The dispatching agent is first called to generate

a_{k_{p}}^{D}

, and then the path planning agent is called to generate

a_{k_{p}}^{P}

with the known

a_{k_{p}}^{D}

. The blue arrow in Figure 10 denotes the scheduling process at another kind of decision point

t_{p} = t_{k_{p}}^{1, s t a r t}

. At this time, the loaded transporting path planning action is determined by the path planning agent with the known destination. The scheduling decisions are then executed within the simulator, and the system status is subsequently updated. This process is repeated iteratively until all tasks are completed.

The above decision-making framework separates AGV dispatching and lane selection because they are activated at different event-triggered decision epochs and correspond to different operational roles in real-time terminal scheduling. When an AGV becomes empty, the controller must determine its next assignment, such as selecting the next transport node. This is a dispatching decision that primarily affects system throughput and workload distribution. In contrast, when an AGV completes loading and starts a loaded trip, the destination is already determined by the assigned task, and the key remaining decision is how to traverse the shared traffic network under lane-based movement restrictions. This is a lane selection decision that primarily affects travel efficiency and conflict-induced delays. Therefore, the separation is not an arbitrary modeling choice but is consistent with the operational semantics of state transitions and decision triggers in the simulation and execution process.

From an MDP perspective, the composite action at a decision epoch can be expressed as a pair

a = (a_{k_{p}}^{D}, a_{k_{p}}^{P})

. Accordingly, the policy can be written in a factorized form:

π (a | s) = π (a_{k_{p}}^{D}, a_{k_{p}}^{P} | s_{k_{p}}^{D}, s_{k_{p}}^{P}) = π_{θ}^{D} (a_{k_{p}}^{D} | s_{k_{p}}^{D}) π_{φ}^{P} (a_{k_{p}}^{P} | s_{k_{p}}^{P}, a_{k_{p}}^{D})

(30)

This factorization does not reduce the expressiveness of the policy class. For any joint policy

π (a_{k_{p}}^{D}, a_{k_{p}}^{P} | s_{k_{p}}^{D}, s_{k_{p}}^{P})

, one can construct an equivalent factorized policy by defining

π_{θ}^{D} (a_{k_{p}}^{D} | s_{k_{p}}^{D}) = \sum_{a_{k_{p}}^{P}} π (a_{k_{p}}^{D}, a_{k_{p}}^{P} | s_{k_{p}}^{D}, s_{k_{p}}^{P})

and

π_{φ}^{P} (a_{k_{p}}^{P} | s_{k_{p}}^{P}, a_{k_{p}}^{D}) = π (a_{k_{p}}^{D}, a_{k_{p}}^{P} | s_{k_{p}}^{D}, s_{k_{p}}^{P}) / π_{θ}^{D} (a_{k_{p}}^{D} | s_{k_{p}}^{D})

whenever

π_{θ}^{D} (a_{k_{p}}^{D} | s_{k_{p}}^{D}) > 0

. Therefore, the sequential decision representation is theoretically consistent with the original joint action formulation while providing a clearer operational interpretation.

Practically, the separation also improves scalability for real-time operational-level scheduling. Dispatching typically involves selecting among many tasks or nodes, while lane selection involves routing choices on a dense network with conflict interactions. Modeling them as a single flat action would lead to a prohibitively large composite action space and unstable learning. The factorized representation reduces action complexity by exploiting conditional structure, improves sample efficiency, and enables fast online inference, which is essential for generating decisions in real time under realistic traffic density and battery constraints.

4.3. Training Algorithm with Proximal Policy Optimization

AGV dispatching and path planning is a discrete action problem. Numerous studies have demonstrated that the proximal policy optimization (PPO) algorithm performs well in discrete action spaces and complex scheduling and control problems that involve large state spaces and delayed rewards [53,54]. Given the integrated decision scope in our setting, which couples dispatching, conflict-aware lane selection, and charging decisions under lane-based constraints, a stable learning algorithm is essential to ensure reliable training and fast online inference. In this paper, the PPO algorithm is employed to train the AGV dispatching and path planning policies under the designed multi-agent learning framework. In this study, the dispatching agent and path planning agent are equipped with two separate policy networks,

π_{θ^{D}} (a_{k}^{D} | s_{k}^{D})

and

π_{θ^{P}} (a_{k}^{P} | s_{k}^{P})

, and two value networks,

V_{φ^{D}} (s_{k}^{D})

and

V_{φ^{P}} (s_{k}^{P})

, respectively. The training process consists of several key stages: environment interaction and sampling, return and advantage estimation, loss function construction and optimization, network parameter updating, and policy improvement.

The parameters of both the policy network and value network are updated based on samples generated under the most recent policy. During environment interaction and sampling, the dispatching transitions and path planning transitions are generated and stored in experience replay buffers. Mini-batches

B^{D}

and

B^{D}

are sampled for policy training once the buffers are full. In sampled mini-batches, each transition

\{s_{k_{p}}, a_{k_{p}}, r_{k_{p}}, π_{θ} (a_{k_{p}} | s_{k_{p}}), V_{φ} (s_{k_{p}})\}

reflects the behavior of the current policy at decision step

p

. The cumulative discounted return

R_{p}

is calculated according to Equation (31).

r_{k_{p}}^{D} = r_{k_{p}}^{e t} + r_{k_{p}}^{s p} + r_{k_{p}}^{h w} + r_{k_{p}}^{c w} + r_{k_{p}}^{t n}

(31)

To avoid high variance and enhance training stability, the temporal difference (TD) error and the generalized advantage estimation (GAE)

{\hat{A}}_{p}

are further introduced to measure the quality of action

a_{k_{p}}

. The GAE

{\hat{A}}_{p}

is defined in Equations (32) and (33). Here,

γ

is the discount factor, and

λ

is the GAE smoothing parameter. By striking a balance between bias and variance, the estimation process of the advantage function can be smoothed to some extent, thereby improving the directionality and learning efficiency of the policy gradient.

δ_{p} = r_{k_{p}} + γ V (s_{k_{p + 1}}) - V (s_{k_{p}})

(32)

{\hat{A}}_{p} = δ_{p} + (γ λ) δ_{p + 1} + \dots + {(γ λ)}^{Q - p + 1} δ_{Q - 1}

(33)

Based on this, the loss function is constructed. To prevent training instability caused by overly rapid policy updates, the probability ratio

r_{p} (θ) = \frac{π_{θ^{n e w}} (a_{k_{p}} | s_{k_{p}})}{π_{θ^{o l d}} (a_{k_{p}} | s_{k_{p}})}

is further introduced and clipped within a certain range, ensuring that the policy is updated only within a “trust region”. This clipped policy function loss is calculated though the following equation.

L^{C L I P} (θ) = E_{p} [m i n (r_{p} (θ) {\hat{A}}_{p}, c l i p (r_{p} (θ), 1 - ε, 1 + ε) {\hat{A}}_{p})]

(34)

Meanwhile, to enable the value network to accurately evaluate state values, a value function loss

L^{V F} (φ)

is developed to reduce value estimation errors and enhance the stability of the advantage function:

L^{V F} (φ) = E_{p} [{(V_{φ} (s_{k_{p}}) - R_{k_{p}})}^{2}]

(35)

Additionally, to encourage exploration and improve decision diversity and robustness, a policy entropy

L^{E} (θ)

is designed in Equation (36).

H [π_{θ} (s_{k_{p}})]

represents the entropy of policy

π_{θ}

and represents the uncertainty of the policy distribution.

L^{E} (θ) = E_{p} [H [π_{θ} (s_{k_{p}})]]

(36)

Finally, as presented in Equation (37), the total loss function integrates three parts: the clipped policy function loss

L^{C L I P}

, the value function loss

L^{V F}

, and the policy entropy

L^{E}

. Here,

ω_{1}

and

ω_{2}

are weighted factors to balance the importance of

L^{V F}

and

L^{E}

.

L = L^{C L I P} - ω_{1} L^{V F} + ω_{2} L^{E}

(37)

The Adam optimizer is employed to perform mini-batch gradient updates on both policy network parameters

θ

and value network parameters

φ

. The iterative training process continues until the scheduling policy converges or predefined termination condition is met.

5. Numerical Experiments and Application Analyses

In this section, numerical experiments and application analyses are organized to evaluate the proposed MAPPO-based method for AGV dispatching and path planning in large-scale ACT scenarios. The effectiveness of the mixed global and local reward design is tested. Due to the complexity of the integration of dispatching and path planning under the horizontal yard layout, it is difficult to develop a mathematical programming model and solve it by exact methods or heuristic algorithms at a real-time scheduling level. In this study, the proposed method is compared with several commonly used artificial rules. In addition, the impact of AGV configuration and target charge threshold are also analyzed. The following experiments are conducted on a Windows server with an Intel Xeon E5-2650 v4 CPU, NVIDIA L40-48Q GPU, and 96 GB of RAM. The simulator and algorithms are coded in Python 3.7.1 and supported by the implementation of PyTorch 1.12.0.

5.1. Scenarios Setting

The proposed method is validated on a real-world-sized container terminal scenario with the same layout in Figure 3. The tested container yard consists of 40 horizontally arranged blocks and four berths, with 40 AGVs configured for transporting tasks. Six charge stations are located at the ends of blocks A4, A5, A9, A10, D6, and D8, each equipped with two charging facilities. Detailed configurations are derived from a real ACT in southern China and are provided in Table 3.

The reward function aggregates multiple operational objectives, including task completion time, task delay, path conflict, empty transport, AGV handling waiting, and AGV charging waiting. To justify reward weights, we first analyze the dominance relations among these objectives at the real-time operational level.

The reward weights are selected to reflect managerial priorities at the real-time operational level, with higher penalties assigned to conflict-induced delays and task delays to ensure stability and service reliability. Therefore, penalizing path conflicts should dominate other objectives to ensure stable and feasible operations. Task delay is the next priority because it is directly linked to service level and production scheduling reliability. Waiting-related objectives are then prioritized, where charging waiting affects effective AGV fleet availability and may lead to future infeasibility, while handling waiting reflects block area congestion and AGV idle time. Finally, task completion time and empty transport are treated as lower-priority refinement objectives because their improvements are meaningful only after stability and feasibility are maintained.

We further distinguish conflict types because their operational consequences differ. Crossing conflicts typically require more restrictive coordination and can cause severe blocking in narrow or lane-constrained segments. Parking conflicts often create prolonged occupancy and queue spillback, while overtaking conflicts are relatively easier to resolve through local speed adjustment or short waiting. The above analysis derives the following qualitative ordering for reward weights:

φ^{C} > φ^{P} > φ^{O} ≫ ω^{t d} > ω^{c w} > ω^{h w} > ω^{m} a n d ω^{e t}

. After fixing these priority relations, the detailed values of weighted factors are set according to the preference of port managers and are summarized in Table 4.

To validate the effectiveness of the proposed method under different workloads, three scheduling scenarios with varying workload levels are designed in Table 5. In S1, S2, and S3, the numbers of tasks in each block are randomly generated from [0, 2], [0, 6], and [0, 10], respectively, while the numbers of tasks at each berth are randomly generated from [0, 8], [0, 24], and [0, 40], respectively.

5.2. Parameter Tuning Test

In order to identify the best hyperparameters configuration of the MAPPO-based method, 16 distinct combinations of batch size, mini-batch size, learning rate of policy and value networks, and epsilon are designed for both dispatching and path planning agents, as detailed in Table 6.

A total of 16 sets of scheduling policies are trained with the above various hyperparameters settings and are then employed to solve 10 randomly generated cases under scenario S1. The average rewards obtained from different policies are reported in Figure 12. The results show that the fifteenth hyperparameter set achieves the highest average reward with notably stable performance. Consequently, this set of hyperparameters is used for training in the following experiments.

5.3. Experiment on Different Reward Function

The design of the reward function plays a critical role in MDP formulation. To evaluate the effectiveness of the proposed mixed global-and-individual reward mechanism, different reward function configurations are compared in this section.

Specifically, the first baseline variant only uses the global reward

\frac{r^{G}}{Q}

for both dispatching and path planning agents, which is denoted as MAPPO_G. The second baseline, referred to as MAPPO_L, employs only specific individual rewards for the two agents, as designed in Section 4.1.2. In contrast, the proposed method in this study, which is labeled MAPPO_GL, incorporates a mixed global-and-individual reward structure. Here, the reward of the dispatching agent combines the global reward

\frac{r^{G}}{Q}

and individual reward

r_{k_{p}}^{D}

, while the reward of path the planning agent integrates the global reward

\frac{r^{G}}{Q}

with the individual reward

r_{k_{p}}^{P}

.

The training processes under different rewards designed in the three scenarios are illustrated in Figure 13. The results show that MAPPO_G is able to optimize the initial policy and converge rapidly in S1. However, it fails to improve the initial policy effectively in S2 and S3. Across all three experimental scales, both MAPPO_GL and MAPPO_L significantly outperform MAPPO_G, indicating that tailored individual rewards for the two agents facilitate the exploration of better scheduling policies. In addition, both MAPPO_GL and MAPPO_L exhibit significant improvements in the early training stage under S1, S2, and S3. The benefit from introducing a global reward is that the cooperation between two independent agents is enhanced, and thus, the proposed MAPPO_GL shows better exploration ability at the later stage of training.

5.4. Comparison with Benchmark Scheduling Strategies

From the aspect of second-level operations, VD2P is an extremely dynamic and complex problem. It is quite hard to establish an MIP model considering detailed operational indices including task delay, empty transporting distance, charging, various path conflicts, etc. In addition, it is time-consuming to solve such a complex model by exact or heuristic methods. In addition, the above traditional solution methods can only solve a static moment of the dynamic AGV scheduling system and thus cannot meet the real-time operational requirements. Therefore, to evaluate the scheduling performance of the proposed MAPPO-based approach, three charging strategies and two dispatching strategies are employed for the benchmark.

The first charging strategy involves charging only when the remaining power level is lower than the minimum power for transporting (Cl), the second one involves randomly charging when the remaining power level is lower than 40% (Cr0.4), and the third one involves randomly charging when the AGV’s remaining power level is lower than 60% (Cr0.6). The two dispatching strategies are to choose the nearest task (Dn), and the nearest task within a loaded-cycle strategy (Dl). The loaded cycle restricts the AGV to transporting containers following an alternating loading and unloading manner, which is able to reduce empty transporting between block and berth. The transporting path is determined by the Dijkstra algorithm, and the specific lanes with the fewest transporting AGVs are selected. By combining the above dispatching and charging rules, six rule-based benchmark scheduling strategies are generated, i.e., Cl_Dn, Cl_Dl, Cr0.4_Dn, Cr0.4_Dl, Cr0.6_Dn, and Cr0.6_Dl.

A comparison of the results is presented in Table 7. The proposed MAPPO-based approach outperforms all rule-based methods, achieving an improvement in total weighted reward ranging from 8.4% to 53.8%. Specifically, it reduces path conflict penalties and average task delays by 12.2–41.4% and 22.9–272.3%, respectively, compared to the rule-based strategies. Among the rule-based methods, the loaded-cycle handling strategy (Dl) consistently performs better than the nearest handling strategy (Dn) under all three charging strategies. This improvement is mainly attributed to the significant lower task delay achieved by the Dl rule.

Since AGVs cannot execute transport tasks during charging, the timing of charging decisions has a substantial impact on overall scheduling performance. Compared to the fixed-threshold charging rule (Cl), the flexible rule Cl0.4 enhances decision flexibility by allowing AGVs to probabilistically decide whether to charge when the remaining power level drops below 40%. As a result, the total weighted rewards under the Cl0.4 rule are 0.7% and 7.3% higher than those under Cl when combined with the Dn and Dl rules, respectively. The decision flexibility is further improved under Cl0.6. However, the overall transporting efficiency is reduced with the increase in the total number of charging AGVs. Consequently, the total weighted rewards under Cl0.6 are lower than those achieved by both the Cl and Cl0.4 rules. These findings emphasize the critical role of charging opportunity selection in AGV scheduling, and further validate the superior flexibility and performance of the proposed MAPPO-based method.

5.5. Application Analysis on AGV Configuration and Target Charge Threshold

To further investigate the impact of AGV configurations and target charge thresholds on AGV scheduling, application analyses are conducted based on a large-scale ACT scenario. Ten scheduling instances are randomly generated based on the parameter settings of scenario S3 and solved under different AGV configurations and target charge thresholds. The average results are summarized and analyzed as follows.

Figure 14 presents the scheduling performance under different AGV configurations. When the fleet size is limited to 20 AGVs, the transport capacity is insufficient to meet operational demands. Each AGV is assigned a relatively high workload, resulting in longer average empty transport time and a significantly lower total weighted reward. As the number of AGVs increases from 20 to 50, the transport capacity gradually becomes adequate for the required loading and unloading operations. Consequently, the total weighted reward gradually increases, while both the total task complete time and average empty transport time decrease. However, when the fleet size exceeds 50, the total complete time remains nearly unchanged. The addition of more AGVs leads to higher path conflict penalties and longer AGV wait times for handling, which offset any potential gains in transport efficiency. As a result, the total weighted reward declines sharply beyond this point.

Figure 15 presents the comparison results under different target charge thresholds, ranging from 60% to 100%. The results demonstrate that the target charge threshold significantly influences AGV scheduling performance. As shown in the top figure, the penalty of path conflicts increases steadily as the target charge threshold rises from 60% to 90%, peaking at 90% and slightly declining at 100%. This trend suggests that higher charging thresholds lead to more congestion in the AGV transport network. The bottom part shows that at target charge thresholds of 60% and 70%, the average charging wait time remains very low. In this range, AGVs are allowed to recharge more frequently with shorter cycles, and the charging infrastructure is sufficient to meet demand. As a result, queuing at charging stations is rare, enabling smoother charging scheduling.

Once the target charge threshold exceeds 80%, the total weighted reward (shown in red in the top figure) drops sharply. This is accompanied by a substantial increase in charging wait time (light blue bars in the bottom figure), which becomes a major bottleneck. Since AGVs are required to recharge to higher levels, each charging session becomes longer, reducing charger turnover. The fixed number of charging stations is insufficient to support this increased demand, causing extensive queuing and scheduling delays. In addition, the task delay (dark blue bars) reaches its lowest point at a target charge threshold of 80%, indicating that this setting achieves the best balance between transport efficiency and charging frequency. This also aligns with the peak in total reward and moderate path conflict penalty, confirming that 80% is the most favorable setting under the current scenario conditions. In addition, across all target charge thresholds, the AGV wait time for handling (yellow bars) remains relatively stable. These findings highlight the importance of setting a moderate charge threshold to optimize the trade-off between energy replenishment and task execution efficiency.

The experimental results highlight that simply increasing the AGV fleet size or setting overly high charging thresholds does not guarantee better performance. Instead, optimal operational efficiency is achieved when the fleet size is maintained within a reasonable range (e.g., 40–50 AGVs) and the target charge threshold is set around 80%. This configuration balances transport capacity, charging efficiency, and path conflicts. These findings provide practical guidance for terminal operators in determining an economically and operationally optimal AGV fleet size and a reasonable target charge threshold under given workload and infrastructure constraints.

6. Conclusions

This study investigates the VD2P problem with explicit consideration of charging requirements and multiple path conflicts in horizontal-layout ACTs. The VD2P problem is formulated as an MDP model, aiming to minimize the weighted cost of completion time, empty transporting, wait time, and path conflicts. A real-time dynamic simulator for VD2P is tailored to capture AGV status, implement dispatching and path planning action, calculate global and individual rewards, and advance system processes. A multi-agent learning framework is proposed to achieve independent control for dispatching and path planning decision-making. A mixed global–individual reward mechanism is designed to improve the multi-agent system performance. Both the AGV dispatching agent and path planning agent are trained through a multi-agent proximal policy optimization (MAPPO) approach.

The proposed MAPPO-based approach is evaluated on a real-world-sized ACT with different workloads. Experimental results on different reward functions indicate that the mixed global–individual reward structure can efficiently enhance the exploration of both the two agents and further improve the cooperative ability of the multi-agent system. Due to the extremely dynamic nature and complexity of VD2P under the second-level operational scenario, it is hard to formulate an MIP model and solve it by exact or heuristic methods. Therefore, several artificial rule-based methods are used as benchmark methods for comparison. The results show that the proposed MAPPO-based approach achieves an improvement of 6.5% to 88.2% compared to rule-based methods from the aspect of weighted reward and shows a more balance consideration among multiple objectives. The minute-level solution efficiency under the real-world-sized scenario with busy workloads shows the potential of practical application. Moreover, application analyses based on a large-scale ACT operational scenario are further conducted to explore the impacts of AGV configurations and the target charge threshold. The above experimental results demonstrate that the proposed reinforcement-learning-based scheduling framework can significantly improve solution quality and operational stability compared with practical rule-based approaches while meeting real-time decision requirements. These findings also provide practical guidance for terminal operators to develop reliable facility configurations and real-time operational strategies.

Inspired by the application analysis results, we find that the target charge threshold significantly affects the AGV scheduling. As VD2P is a highly dynamic problem, an adaptive target charge threshold setting may further improve the transport efficiency. Integrated with target charge threshold decisions, solving the VD2P problem at a real-time operational level is a valuable research direction. In addition, future work will extend the proposed framework by incorporating a more detailed energy and motion model that captures acceleration and deceleration dynamics and turning-related penalties, enabling route selection that directly accounts for maneuver-induced energy costs. Another promising direction is to develop flexible policies that can adapt to scenario changes by dynamically adjusting reward weights or incorporating more structured multi-objective learning and constrained formulations. This would enable the scheduler to generate a set of Pareto-efficient policies and select appropriate trade-offs based on real-time operating conditions and management requirements. Furthermore, it is also interesting to incorporate a more realistic charging-power characteristic, for example, a piecewise linear approximation consistent with CC–CV behavior, to explicitly capture the slower charging near high SOC and investigate its impacts on charging opportunity selection and overall scheduling performance.

Author Contributions

Conceptualization, T.Z. and H.L.; methodology, H.L., W.W., and Y.P.; software, H.L.; investigation, S.Y. and R.W.; writing—original draft, T.Z. and H.L.; supervision, S.Y., W.W., and Y.P.; visualization, Y.P. and R.W.; funding acquisition, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Key Research and Development Program of China (Grant No. 2023YFB2504100), National Natural Science Foundation of China (Grant No. 52476172), and the Fundamental Research Funds for the Central Universities (Grant No. 501QYJC2025146001).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Conflicts of Interest

Ruchong Wang is an employee of Dalian Port North Shore Container Terminal Co., Ltd.; the other authors declare no conflicts of interest.

References

Chen, J.; Ye, J.; Zhuang, C.; Qin, Q.; Shu, Y. Liner shipping alliance management: Overview and future research directions. Ocean Coast. Manag. 2022, 219, 106039. [Google Scholar] [CrossRef]
Chen, J.; Liu, X.; Zhou, S.; Kang, J. Knowledge mapping analysis of resilient shipping network using CiteSpace. Ocean Coast. Manag. 2023, 244, 106775. [Google Scholar] [CrossRef]
Chen, J.; Zhuang, C.; Xu, H.; Xu, L.; Ye, S.; Rangel-Buitrago, N. Collaborative management evaluation of container shipping alliance in maritime logistics industry: CKYHE case analysis. Ocean Coast. Manag. 2022, 225, 106176. [Google Scholar] [CrossRef]
Chen, J.; Xu, J.; Zhou, S.; Liu, A. Slot co-chartering and capacity deployment optimization of liner alliances in containerized maritime logistics industry. Adv. Eng. Inform. 2023, 56, 101986. [Google Scholar] [CrossRef]
Li, L.; Li, Y.; Liu, R.; Zhou, Y.; Pan, E. A Two-stage Stochastic Programming for AGV scheduling with random tasks and battery swapping in automated container terminals. Transp. Res. Part E Logist. Transp. Rev. 2023, 174, 103110. [Google Scholar] [CrossRef]
Gao, Y.; Chen, C.-H.; Chang, D. A machine learning-based approach for multi-agv dispatching at automated container terminals. J. Mar. Sci. Eng. 2023, 11, 1407. [Google Scholar] [CrossRef]
Duan, Y.; Ren, H.; Xu, F.; Yang, X.; Meng, Y. Bi-objective integrated scheduling of quay cranes and automated guided vehicles. J. Mar. Sci. Eng. 2023, 11, 1492. [Google Scholar] [CrossRef]
Xing, Z.; Liu, H.; Wang, T.; Chew, E.P.; Lee, L.H.; Tan, K.C. Integrated automated guided vehicle dispatching and equipment scheduling with speed optimization. Transp. Res. Part E Logist. Transp. Rev. 2023, 169, 102993. [Google Scholar] [CrossRef]
Zhang, X.; Gu, Y.; Tian, Y. Integrated optimization of automated guided vehicles and yard cranes considering charging constraints. Eng. Optim. 2024, 56, 1748–1766. [Google Scholar] [CrossRef]
Xing, Z.; Liu, H.; Wang, T.; Lin, Y.H.; Chew, E.P.; Tan, K.C.; Li, H. AGV charging scheduling with capacitated charging stations at automated ports. Transp. Res. Part E Logist. Transp. Rev. 2025, 197, 104080. [Google Scholar] [CrossRef]
Wang, J.; Li, Y.; Liu, Z.; Yuan, M. Clean Energy Self-Consistent Systems for Automated Guided Vehicle (AGV) Logistics Scheduling in Automated Ports. Sustainability 2025, 17, 3411. [Google Scholar] [CrossRef]
Yang, X.; Hu, H.; Cheng, C. Collaborative scheduling of handling equipment in automated container terminals with limited AGV-mates considering energy consumption. Adv. Eng. Inform. 2025, 65, 103133. [Google Scholar] [CrossRef]
Hu, H.; Yang, X.; Xiao, S.; Wang, F. Anti-conflict AGV path planning in automated container terminals based on multi-agent reinforcement learning. Int. J. Prod. Res. 2023, 61, 65–80. [Google Scholar] [CrossRef]
Wu, M.; Gao, J.; Li, L.; Wang, Y. Control optimisation of automated guided vehicles in container terminal based on Petri network and dynamic path planning. Comput. Electr. Eng. 2022, 104, 108471. [Google Scholar] [CrossRef]
Yue, L.; Fan, H. Dynamic scheduling and path planning of automated guided vehicles in automatic container terminal. IEEECAA J. Autom. Sin. 2022, 9, 2005–2019. [Google Scholar] [CrossRef]
Tang, G.; Tang, C.; Claramunt, C.; Hu, X.; Zhou, P. Geometric A-star algorithm: An improved A-star algorithm for AGV path planning in a port environment. IEEE Access 2021, 9, 59196–59210. [Google Scholar] [CrossRef]
Xu, Y.; Qi, L.; Luan, W.; Guo, X.; Ma, H. Load-in-load-out AGV route planning in automatic container terminal. IEEE Access 2020, 8, 157081–157088. [Google Scholar] [CrossRef]
Chen, Y.; Shi, S.; Chen, Z.; Wang, T.; Miao, L.; Song, H. Optimizing Port Multi-AGV Trajectory Planning through Priority Coordination: Enhancing Efficiency and Safety. Axioms 2023, 12, 900. [Google Scholar] [CrossRef]
Feng, J.; Yang, Y.; Zhang, H.; Sun, S.; Xu, B. Path planning and trajectory tracking for autonomous obstacle avoidance in automated guided vehicles at automated terminals. Axioms 2023, 13, 27. [Google Scholar] [CrossRef]
Wang, Z.; Zeng, Q. A branch-and-bound approach for AGV dispatching and routing problems in automated container terminals. Comput. Ind. Eng. 2022, 166, 107968. [Google Scholar] [CrossRef]
Li, S.; Fan, L.; Jia, S. A hierarchical solution framework for dynamic and conflict-free AGV scheduling in an automated container terminal. Transp. Res. Part C Emerg. Technol. 2024, 165, 104724. [Google Scholar] [CrossRef]
Liang, C.; Zhang, Y.; Dong, L. A three stage optimal scheduling algorithm for AGV route planning considering collision avoidance under speed control strategy. Mathematics 2022, 11, 138. [Google Scholar] [CrossRef]
Lou, P.; Zhong, Y.; Hu, J.; Fan, C.; Chen, X. Digital-twin-driven AGV scheduling and routing in automated container terminals. Mathematics 2023, 11, 2678. [Google Scholar] [CrossRef]
Liu, W.; Zhu, X.; Wang, L.; Wang, S. Multiple equipment scheduling and AGV trajectory generation in U-shaped sea-rail intermodal automated container terminal. Measurement 2023, 206, 112262. [Google Scholar] [CrossRef]
Marugan, A.P. Applications of Reinforcement Learning for maintenance of engineering systems: A review. Adv. Eng. Softw. 2023, 183, 103487. [Google Scholar] [CrossRef]
Seyyedabbasi, A. A reinforcement learning-based metaheuristic algorithm for solving global optimization problems. Adv. Eng. Softw. 2023, 178, 103411. [Google Scholar] [CrossRef]
Syavasya, C.; Muddana, A.L. Optimization of autonomous vehicle speed control mechanisms using hybrid DDPG-SHAP-DRL-stochastic algorithm. Adv. Eng. Softw. 2022, 173, 103245. [Google Scholar] [CrossRef]
Wang, F.; Lu, Z.; Zhang, Y. Research on intelligent dynamic scheduling algorithm for automated guided vehicles in container terminal based on deep reinforcement learning. In Proceedings of the IEEE International Conference on Mechatronics and Automation (ICMA), Harbin, China, 6–9 August 2023; pp. 2406–2412. [Google Scholar]
Che, A.; Wang, Z.; Zhou, C. Multi-agent deep reinforcement learning for recharging-considered vehicle scheduling problem in container terminals. IEEE Trans. Intell. Transp. Syst. 2024, 25, 16855–16868. [Google Scholar] [CrossRef]
Chen, X.; Liu, S.; Zhao, J.; Wu, H.; Xian, J.; Montewka, J. Autonomous port management based AGV path planning and optimization via an ensemble reinforcement learning framework. Ocean Coast. Manag. 2024, 251, 107087. [Google Scholar] [CrossRef]
Zheng, X.; Liang, C.; Wang, Y.; Shi, J.; Lim, G. Multi-AGV dynamic scheduling in an automated container terminal: A deep reinforcement learning approach. Mathematics 2022, 10, 4575. [Google Scholar] [CrossRef]
Zhou, S.; Yu, Y.; Zhao, M.; Zhuo, X.; Lian, Z.; Zhou, X. A Reinforcement Learning—Based AGV Scheduling for Automated Container Terminals With Resilient Charging Strategies. IET Intell. Transp. Syst. 2025, 19, e70027. [Google Scholar] [CrossRef]
Gong, L.; Huang, Z.; Xiang, X.; Liu, X. Real-time AGV scheduling optimisation method with deep reinforcement learning for energy-efficiency in the container terminal yard. Int. J. Prod. Res. 2024, 62, 7722–7742. [Google Scholar] [CrossRef]
Drungilas, D.; Kurmis, M.; Senulis, A.; Lukosius, Z.; Andziulis, A.; Januteniene, J.; Bogdevičius, M.; Jankunas, V.; Vozňák, M. Deep reinforcement learning based optimization of automated guided vehicle time and energy consumption in a container terminal. Alex. Eng. J. 2023, 67, 397–407. [Google Scholar] [CrossRef]
Wang, F.; Ding, P.; Bi, Y.; Qiu, J. A Multi-Agent Deep Reinforcement Learning Approach for Multiple AGVs Scheduling in Automated Container Terminals. In Proceedings of the 2024 China Automation Congress (CAC), Qingdao, China, 1–3 November 2024; pp. 3564–3569. [Google Scholar]
Hau, B.M.; You, S.-S.; Kim, H.-S. Efficient routing for multiple AGVs in container terminals using hybrid deep learning and metaheuristic algorithm. Ain. Shams. Eng. J. 2025, 16, 103468. [Google Scholar] [CrossRef]
Zhang, Z.; Zhuang, Z.; Qin, W.; Fang, H.; Lan, S.; Yang, C.; Tian, Y. A reinforcement learning approach for integrated scheduling in automated container terminals. In Proceedings of the 2022 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), Kuala Lumpur, Malaysia, 7–10 December 2022; pp. 1182–1186. [Google Scholar]
Chen, X.; Liu, S.; Li, C.; Han, B.; Zhou, Y.; Zhao, J. AGV path planning and optimization with deep reinforcement learning model. In Proceedings of the 2023 7th International Conference on Transportation Information and Safety (ICTIS), Xi’an, China, 4–6 August 2023; pp. 1859–1863. [Google Scholar]
Wei, Q.; Yan, Y.; Zhang, J.; Xiao, J.; Wang, C. A self-attention-based deep reinforcement learning approach for AGV dispatching systems. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 7911–7922. [Google Scholar] [CrossRef]
Zhang, F.; Li, R.; Gong, W. Deep reinforcement learning-based memetic algorithm for energy-aware flexible job shop scheduling with multi-AGV. Comput. Ind. Eng. 2024, 189, 109917. [Google Scholar] [CrossRef]
Mohammadpour, M.; Allani, B.; Kelouwani, S.; Ouameur, M.A.; Zeghmi, L.; Amamou, A.; Bahmanabadi, H. Optimal charging scheduling for Indoor Autonomous Vehicles in manufacturing operations. Adv. Eng. Inform. 2024, 62, 102804. [Google Scholar] [CrossRef]
Li, K.; Liu, T.; Kumar, P.R.; Han, X. A reinforcement learning-based hyper-heuristic for AGV task assignment and route planning in parts-to-picker warehouses. Transp. Res. Part E Logist. Transp. Rev. 2024, 185, 103518. [Google Scholar] [CrossRef]
Li, C.; Zhang, L.; Zhang, L. A route and speed optimization model to find conflict-free routes for automated guided vehicles in large warehouses based on quick response code technology. Adv. Eng. Inform. 2022, 52, 101604. [Google Scholar] [CrossRef]
Nie, J.; Zhang, G.; Lu, X.; Wang, H.; Sheng, C.; Sun, L. Reinforcement learning method based on sample regularization and adaptive learning rate for AGV path planning. Neurocomputing 2025, 614, 128820. [Google Scholar] [CrossRef]
Kuo, P.-H.; Chen, S.-Y.; Feng, P.-H.; Chang, C.-W.; Huang, C.-J.; Peng, C.-C. Reinforcement learning-based fuzzy controller for autonomous guided vehicle path tracking. Adv. Eng. Inform. 2025, 65, 103180. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Gao, L. Real-time scheduling for production-logistics collaborative environment using multi-agent deep reinforcement learning. Adv. Eng. Inform. 2025, 65, 103216. [Google Scholar] [CrossRef]
Zhu, Q.; Zheng, Z.; Wang, C.; Lu, Y. Research on AGV path tracking method based on global vision and reinforcement learning. Sci. Prog. 2023, 106, 00368504231188854. [Google Scholar] [CrossRef] [PubMed]
Ho, G.; Tang, Y.M.; Leung, E.K.; Tong, P. Integrated reinforcement learning of automated guided vehicles dynamic path planning for smart logistics and operations. Transp. Res. Part E Logist. Transp. Rev. 2025, 196, 104008. [Google Scholar] [CrossRef]
Stern, R.; Sturtevant, N.; Felner, A.; Koenig, S.; Ma, H.; Walker, T.; Li, J.; Atzmon, D.; Cohen, L.; Kumar, T.K.; et al. Multi-agent pathfinding: Definitions, variants, and benchmarks. In Proceedings of the International Symposium on Combinatorial Search, Napa, CA, USA, 16–17 July 2019; pp. 151–158. [Google Scholar]
Sharon, G.; Stern, R.; Felner, A.; Sturtevant, N.R. Conflict-based search for optimal multi-agent pathfinding. Artif. Intel. 2015, 219, 40–66. [Google Scholar] [CrossRef]
Boyarski, E.; Felner, A.; Stern, R.; Sharon, G.; Betzalel, O.; Tolpin, D.; Shimony, E. Icbs: The improved conflict-based search algorithm for multi-agent pathfinding. In Proceedings of the International Symposium on Combinatorial Search, Ein Gedi, Israel, 11–13 June 2015; pp. 223–225. [Google Scholar]
Surynek, P. Problem Compilation for Multi-Agent Path Finding: A Survey. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 5615–5622. [Google Scholar]
Yang, T.; Fan, W. Enhancing robustness of deep reinforcement learning based adaptive traffic signal controllers in mixed traffic environments through data fusion and multi-discrete actions. IEEE Trans. Intell. Transp. Syst. 2024, 25, 14196–14208. [Google Scholar] [CrossRef]
Tang, Y.; Agrawal, S. Discretizing continuous action space for on-policy optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 5981–5988. [Google Scholar]

Figure 1. Illustration of the AGV transporting system under a vertical yard layout.

Figure 2. Illustration of the AGV transporting system under a horizontal yard layout.

Figure 3. The AGV transporting process under a horizontal yard layout at a transshipment ACT.

Figure 4. The distribution of the number of blocks with both stacking and retrieval tasks within 1 h.

Figure 5. The distribution of the number of blocks with retrieval tasks within 1 h.

Figure 6. Illustration of AGV transporting paths under a horizontal yard layout.

Figure 7. Schematic of AGV transporting path network structure.

Figure 8. Illustration of AGV states transition. (a) Transition of transporting state (loading or unloading transporting); (b) transition of charging state.

Figure 9. Three kinds of AGV path conflicts. (a) Overtaking conflict; (b) crossing conflict; (c) parking conflict.

Figure 10. Multi-agent learning framework.

Figure 11. Illustration of the MDP-driven decision-making framework of the AGV scheduling system.

Figure 12. Comparison of the obtained reward under different hyperparameter settings.

Figure 13. Comparison of different reward function designs under various scenarios. (a) Training process in S1; (b) training process in S2; (c) training process in S3.

Figure 14. Comparison of the scheduling results under different AGV configurations.

Figure 15. Comparison of the scheduling results under different target charge threshold settings.

Table 2. Notations used for developing the AGV assignment and path planning system.

Notation	Definition
$k$	Index of AGV.
$p$	Index of decision period.
$u$	Index of task.
$k_{p}$	The AGV that needs to generate decisions in decision period $p$ .
$n_{i}$	Index of the $i$ -th node on a given path.
$t_{k}^{n_{i}}$	Arrival time at node $n_{i}$ of AGV $k$ .
$t_{k}^{0, s}$	Empty transporting start time of AGV $k$ .
$t_{k, b l o c k}^{0, a}, t_{k, b e r t h}^{0, a}$	Empty arrival time at block and berth of AGV $k$ .
$t_{k, b l o c k}^{0, h}, t_{k, b e r t h}^{0, h}$	Empty handling start time at block and berth of AGV $k$ .
$t_{k, b l o c k}^{0, e}, t_{k, b e r t h}^{0, e}$	Empty handling end time at block and berth of AGV $k$ .
$t_{k, s t a t i o n}^{a}, t_{k, s t a t i o n}^{c}, t_{k, s t a t i o n}^{e}$	Arrival, charging start, and charging end time of AGV $k$ .
$t_{k, b l o c k}^{1, s}, t_{k, b e r t h}^{1, s}$	Loaded transporting start time at block and berth of AGV $k$ .
$t_{k, b l o c k}^{1, a}, t_{k, b e r t h}^{1, a}$	Loaded arrival time at block and berth of AGV $k$ .
$t_{k, b l o c k}^{1, h}, t_{k, b e r t h}^{1, h}$	Loaded handling start time at block and berth of AGV $k$ .
$t_{k, b l o c k}^{1, e}, t_{k, b e r t h}^{1, e}$	Loaded handling end time at block and berth of AGV $k$ .
$τ_{k}^{t, 0}, τ_{k}^{w, 0}$	Required empty transporting and waiting time for AGV $k$ .
$τ_{k}^{t, 1}, τ_{k}^{w, 1}$	Required loaded transporting and waiting time for AGV $k$ .
$τ_{k}^{l}, τ_{k}^{u}$	Required loading and unloading time for AGV $k$ .
$τ_{k}^{s}, τ_{k}^{r}$	Required stacking and retrieving time for AGV $k$ .
$τ_{k}^{c}$	Required charging time for AGV $k$ .
$e_{k}$	Remaining power of AGV $k$ .
$e^{t a r g e t}$	Target battery level to be reached during charging.
$q^{c h a r g e}$	The amount of energy charged per unit time.
$n_{k_{p}}^{s t a r t}$	The start node of $k_{p}$ .
$d_{k_{p}}^{s t a r t}$	The direction of $k_{p}$ at start node.
$n_{k_{p}}^{t a r g e t}$	The target node of $k_{p}$ .
$d_{k_{p}}^{t a r g e t}$	The direction of $k_{p}$ at target node.
$u_{k_{p}}$	The first queuing task that is assigned to $k_{p}$ at $n_{k_{p}}^{t a r g e t}$ when $k_{p}$ arrives.
$t^{f} (n_{k_{p}}^{t a r g e t})$	Time when handling facility becomes idle at $n_{k_{p}}^{t a r g e t}$ .
$t_{u}^{p}$	The planned handling time of task $u$ .
$ξ (n^{s t a r t}, d^{s t a r t}, n^{t a r g e t})$	The shortest path algorithm with known start node $n^{s t a r t}$ , start direction $d^{s t a r t}$ , and target node $n^{t a r g e t}$ .
${p a t h}_{k}^{0}, {p a t h}_{k}^{1}$	The shortest path of AGV $k$ for empty transporting and loaded transporting.

Table 3. Configuration of the tested container yard and AGV.

Parameter	Value
Number of blocks	40
Number of berths	4
Number of charge stations	6
Number of charge facilities in each charge station	2
Number of lanes in handling path aside block	2
Number of lanes in handling path at berth	6
Number of lanes in transporting path between block and berth	6
Number of lanes in transporting path inner block	4
Number of AGVs	40
Horizontal transporting time of AGV between adjacent blocks (s)	40
Vertical transporting time of AGV between adjacent blocks (s)	4
Power consumption of AGV per second (%)	0.023
Charging power of AGV per second (%)	0.042
Minimum remaining power of AGV for transporting (%)	18

Table 4. Parameter settings for the MDP.

Parameter	Value	Parameter	Value
$ω^{m}$	1	$φ^{P}$	4000
$ω^{t d}$	15	$φ^{O}$	1000
$ω^{e t}$	1	$φ^{C}$	10,000
$ω^{h w}$	2	$μ_{k_{p}}^{s p}$	1000
$ω^{c w}$	5	$μ_{k_{p}}^{t n}$	10

Table 5. Parameters in different scenario settings.

Scenario	Number of Tasks in Each Block	Number of Tasks at Each Berth
S1	[0, 2]	[0, 8]
S2	[0, 6]	[0, 24]
S3	[0, 10]	[0, 40]

Table 6. Hyperparameter settings in tuning test.

Set	Batch Size		Mini-Batch Size		Learning Rate of Policy Network		Learning Rate of Value Network		Epsilon
Set	D	P	D	P	D	P	D	P	Epsilon
1	1024	4096	32	128	1 × 10⁻³	1 × 10⁻³	1 × 10⁻⁴	1 × 10⁻⁴	0.1
2	1024	4096	32	128	1 × 10⁻³	1 × 10⁻³	1 × 10⁻⁴	1 × 10⁻⁴	0.2
3	1024	4096	32	128	1 × 10⁻³	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁵	0.1
4	1024	4096	32	128	1 × 10⁻³	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁵	0.2
5	1024	4096	32	128	1 × 10⁻⁴	1 × 10⁻³	1 × 10⁻⁵	1 × 10⁻⁴	0.1
6	1024	4096	32	128	1 × 10⁻⁴	1 × 10⁻³	1 × 10⁻⁵	1 × 10⁻⁴	0.2
7	1024	4096	32	128	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁵	1 × 10⁻⁵	0.1
8	1024	4096	32	128	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁵	1 × 10⁻⁵	0.2
9	2048	8192	64	256	1 × 10⁻³	1 × 10⁻³	1 × 10⁻⁴	1 × 10⁻⁴	0.1
10	2048	8192	64	256	1 × 10⁻³	1 × 10⁻³	1 × 10⁻⁴	1 × 10⁻⁴	0.2
11	2048	8192	64	256	1 × 10⁻³	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁵	0.1
12	2048	8192	64	256	1 × 10⁻³	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁵	0.2
13	2048	8192	64	256	1 × 10⁻⁴	1 × 10⁻³	1 × 10⁻⁵	1 × 10⁻⁴	0.1
14	2048	8192	64	256	1 × 10⁻⁴	1 × 10⁻³	1 × 10⁻⁵	1 × 10⁻⁴	0.2
15	2048	8192	64	256	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁵	1 × 10⁻⁵	0.1
16	2048	8192	64	256	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁵	1 × 10⁻⁵	0.2

Table 7. Results of MAPPO compared with benchmark scheduling strategies.

	MAPPO	Cl_Dn	Cl_Dl	Cr0.4_Dn	Cr0.4_Dl	Cr0.6_Dn	Cr0.6_Dl
Reward	−16,568.9	−23,182.4	−19,389.9	−22,999.7	−17,965.5	−25,483.6	−20,322.9
Gap	\	39.9%	17.0%	38.8%	8.4%	53.8%	22.7%
CT	5894.0	5457.5	5820.9	5613	5816.5	5843.7	6108.7
CP	5177.5	6165.0	6785.0	5810.0	6502.5	6212.5	7320.0
ET	980.5	747.1	916.4	821.8	944.6	997.5	1138.4
CW	223.8	201.9	350.3	95.8	89.7	113.1	124.0
HW	126.4	126.7	125.5	101.7	104.0	76.2	77.5
TD	209.7	636.7	257.7	671.5	269.7	780.8	332.1
CN	22.4	18.1	22.8	31.2	32.5	58.2	64.2
ST	85.9	11.5	6.9	10.6	7.2	12.1	8.3

Note: “CT” denotes the total complete time of all tasks, “CP” represents the total penalty of path conflicts, “ET” represents the average empty transporting time of each AGV, “CW” represents the average charging wait time of each AGV, “HW” represents the average handling wait time of each AGV, “TD” represents the average delay per task, “CN” denotes the total number of charging AGVs, and “ST” represents the solving time required for a trained policy to generate scheduling decisions during execution.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zuo, T.; Liu, H.; Yang, S.; Wang, W.; Peng, Y.; Wang, R. A Reinforcement Learning Method for Automated Guided Vehicle Dispatching and Path Planning Considering Charging and Path Conflicts at an Automated Container Terminal. J. Mar. Sci. Eng. 2026, 14, 55. https://doi.org/10.3390/jmse14010055

AMA Style

Zuo T, Liu H, Yang S, Wang W, Peng Y, Wang R. A Reinforcement Learning Method for Automated Guided Vehicle Dispatching and Path Planning Considering Charging and Path Conflicts at an Automated Container Terminal. Journal of Marine Science and Engineering. 2026; 14(1):55. https://doi.org/10.3390/jmse14010055

Chicago/Turabian Style

Zuo, Tianli, Huakun Liu, Shichun Yang, Wenyuan Wang, Yun Peng, and Ruchong Wang. 2026. "A Reinforcement Learning Method for Automated Guided Vehicle Dispatching and Path Planning Considering Charging and Path Conflicts at an Automated Container Terminal" Journal of Marine Science and Engineering 14, no. 1: 55. https://doi.org/10.3390/jmse14010055

APA Style

Zuo, T., Liu, H., Yang, S., Wang, W., Peng, Y., & Wang, R. (2026). A Reinforcement Learning Method for Automated Guided Vehicle Dispatching and Path Planning Considering Charging and Path Conflicts at an Automated Container Terminal. Journal of Marine Science and Engineering, 14(1), 55. https://doi.org/10.3390/jmse14010055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Reinforcement Learning Method for Automated Guided Vehicle Dispatching and Path Planning Considering Charging and Path Conflicts at an Automated Container Terminal

Abstract

1. Introduction

2. Related Work

2.1. AGV Scheduling Problem

2.2. Application of Reinforcement Learning Approach on AGV Scheduling Problem

3. Development of the AGV Scheduling System

3.1. Problem Statement

3.2. AGV Scheduling State Transition

3.3. Path Conflict Detection

3.4. Simulator for AGV Scheduling System

4. Multi-Agent Reinforcement Learning Approach

4.1. Multi-Agent MDP for AGV Scheduling System

4.1.1. Agents

4.1.2. States

4.1.3. Actions

4.1.4. Reward Functions

4.1.5. Multi-Agent Learning Framework

4.2. The MDP-Driven Decision-Making Framework of the AGV Scheduling System

4.3. Training Algorithm with Proximal Policy Optimization

5. Numerical Experiments and Application Analyses

5.1. Scenarios Setting

5.2. Parameter Tuning Test

5.3. Experiment on Different Reward Function

5.4. Comparison with Benchmark Scheduling Strategies

5.5. Application Analysis on AGV Configuration and Target Charge Threshold

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI