Scheduling of AGVs in Automated Container Terminal Based on the Deep Deterministic Policy Gradient (DDPG) Using the Convolutional Neural Network (CNN)

: In order to improve the horizontal transportation efﬁciency of the terminal Automated Guided Vehicles (AGVs), it is necessary to focus on coordinating the time and space synchronization operation of the loading and unloading of equipment, the transportation of equipment during the operation, and the reduction in the completion time of the task. Traditional scheduling methods limited dynamic response capabilities and were not suitable for handling dynamic terminal operating environments. Therefore, this paper discusses how to use delivery task information and AGVs spatiotemporal information to dynamically schedule AGVs, minimizes the delay time of tasks and AGVs travel time, and proposes a deep reinforcement learning algorithm framework. The framework combines the beneﬁts of real-time response and ﬂexibility of the Convolutional Neural Network (CNN) and the Deep Deterministic Policy Gradient (DDPG) algorithm, and can dynamically adjust AGVs scheduling strategies according to the input spatiotemporal state information. In the framework, ﬁrstly, the AGVs scheduling process is deﬁned as a Markov decision process, which analyzes the system’s spatiotemporal state information in detail, introduces assignment heuristic rules, and rewards the reshaping mechanism in order to realize the decoupling of the model and the AGVs dynamic scheduling problem. Then, a multi-channel matrix is built to characterize space–time state information, the CNN is used to generalize and approximate the action value functions of different state information, and the DDPG algorithm is used to achieve the best AGV and container matching in the decision stage. The proposed model and algorithm frame are applied to experiments with different cases. The scheduling performance of the adaptive genetic algorithm and rolling horizon approach is compared. The results show that, compared with a single scheduling rule, the proposed algorithm improves the average performance of task completion time, task delay time, AGVs travel time and task delay rate by 15.63%, 56.16%, 16.36% and 30.22%, respectively; compared with AGA and RHPA, it reduces the tasks completion time by approximately 3.10% and 2.40%.


Introduction
With the development of the international logistics industry, about 70% of the world's total trade volume is borne by ocean shipping. Over the past decade, the rapid growth of ship size and the rapid development of automated terminals maximized the throughput of container terminals in order to reduce the turnover time of container ships [1], which posed many new challenges to the planning of automated container terminals, and new methods are urgently needed to improve the service level and efficiency of automated terminals.
The automated terminal is a multi-level logistics operation process, which mainly includes Quay Cranes (QCs) operations, Yard Cranes (YCs) operations and the horizontal transportation of AGVs, in which AGVs are the key equipment connecting QCs operations The automated terminal is a multi-level logistics operation process, which mainly includes Quay Cranes (QCs) operations, Yard Cranes (YCs) operations and the horizontal transportation of AGVs, in which AGVs are the key equipment connecting QCs operations and YCs operations, and run through the entire port's import and export container trans-shipment operations. The layout of the automated container terminal is shown in Figure 1. AGVs scheduling is one of the main parts of the AGVs control system in the automated terminal, which takes charge of the optimal matches between AGVs and each task in the most efficient way, while the work procedures between AGVs and other equipment are coupled and decisions are interdependent, making the process of logistics operations extremely complex [2]. To improve the throughput of automated terminals, the interaction between AGVs and crane equipment is required, which depends to a large extent on the synergy of equipment resources [3][4][5][6], it is necessary to determine the time for the container to be transported from one equipment to another. To be more specific, when AGV and QC process the container delivery, the two should cooperate with each other; if AGV fails to reach the designated delivery location within the scheduling time, QC will not work. Naturally, it delays the completion of a certain task. At present, the front of the block area of the automated terminal is equipped with a sufficient number of AGV mates as auxiliary equipment for the horizontal transportation of AGVs, helping to reduce the waiting time related to YCs and AGVs, while the seaside QCs are not equipped with auxiliary equipment such as AGV mates. QCs need to consider the delayed delivery time of the task during the loading and unloading of the container task. Since QCs are the bottleneck resource of the automated terminal, it is necessary to fully consider the resource utilization of QCs [7]. In order to realize the autonomous learning and sustainable development capabilities of the AGVs scheduling system in the automated terminal. Considering the task delivery delay time and the AGVs travel time, we combined CNN and DDPG based on the Actor-Critic (AC) framework proposed by Degris T. et al. [8] to build a Deep Convolution Deterministic Policy Gradient AGVs dynamic scheduling (CDA) algorithm framework for AGVs dynamic scheduling problems; experimental cases verify the reliability and effectiveness of the method.
The main contributions of this paper are summarized as follows: (1) the AGVs scheduling problem is defined as a sequence decision-making problem, modeled as a Markov decision-making process, the scheduling system. The scheduling environment dynamically and interactively makes decisions, and then updates the scheduling strategies through the deep reinforcement learning algorithm to realize the autonomous learning of the scheduling system. (2) The complex environment information of the terminal system At present, the front of the block area of the automated terminal is equipped with a sufficient number of AGV mates as auxiliary equipment for the horizontal transportation of AGVs, helping to reduce the waiting time related to YCs and AGVs, while the seaside QCs are not equipped with auxiliary equipment such as AGV mates. QCs need to consider the delayed delivery time of the task during the loading and unloading of the container task. Since QCs are the bottleneck resource of the automated terminal, it is necessary to fully consider the resource utilization of QCs [7]. In order to realize the autonomous learning and sustainable development capabilities of the AGVs scheduling system in the automated terminal. Considering the task delivery delay time and the AGVs travel time, we combined CNN and DDPG based on the Actor-Critic (AC) framework proposed by Degris T. et al. [8] to build a Deep Convolution Deterministic Policy Gradient AGVs dynamic scheduling (CDA) algorithm framework for AGVs dynamic scheduling problems; experimental cases verify the reliability and effectiveness of the method.
The main contributions of this paper are summarized as follows: (1) the AGVs scheduling problem is defined as a sequence decision-making problem, modeled as a Markov decision-making process, the scheduling system. The scheduling environment dynamically and interactively makes decisions, and then updates the scheduling strategies through the deep reinforcement learning algorithm to realize the autonomous learning of the scheduling system. (2) The complex environment information of the terminal system information is stored in the multi-channel, two-dimensional matrix of the class diagram, and the complex CNN is created by multi-layer convolution, which effectively extracts the key information from the state information, and through this flexible combination ensures that the deep

Literature Review
Over the past few decades, many scholars made great contributions to the research and development of port dispatching, including many studies on the dispatching of terminal AGVs. The core of dispatching lies mainly in the problem model and the solution algorithm, both of which are modeled and optimized based on the assumptions determined by the port operating environment. Rashidi H. et al. [9] established an AGVs scheduling model that minimizes the cost flow of automated terminals, and compared and analyzed the extended network simplicity algorithm (NSA+) and the incomplete greedy vehicle search algorithm (GVS). The results show that these two algorithms complement each other in problem-solving effects, and the NSA algorithm can be used to solve large-scale problems. Grunow M. et al. [10] studied the multi-load AGVs scheduling problem, proposed a complex offline heuristic strategy, and used the extended simulation model to evaluate the performance of the scheduling strategy. The simulation results showed that, compared with the online heuristic algorithm, the offline heuristic algorithm greatly improves the 4 of 29 utilization of terminal AGVs. Taking the weighted sum minimization of the QCs delay time and the AGVs no-load travel distance as the joint optimization goal of AGVs scheduling, Kim K H. et al. [5] established a mixed-integer programming model for AGVs optimal task allocation, and proposed a heuristic algorithm that solves the problem, analyzing and comparing the performance of the heuristic algorithm with other scheduling rules. In a deterministic environment and a random environment, the algorithm achieves the joint optimization goal of terminal AGVs scheduling. Pjevčević D. et al. [11] analyzed in depth the influence of the number of AGVs and their dispatching rules on the process of container unloading at the terminal, established an AGVs simulation model and efficiency evaluation system, and proposed an efficient data envelopment analysis (DEA) for processing the strategic decision-making method, by setting a reasonable number of AGVs and selecting appropriate AGVs dispatching rules, which improved the efficiency of the terminal's operation in the unloading mode. Skinner B. et al. [12] considered the automated terminal container transportation scheduling problem, established a modified mathematical model, and proposed a two-part chromosome that improved the genetic algorithm to solve the problem. In simulation experiments of different scheduling scenarios, the improved genetic algorithm and the sequential operation scheduling method are evaluated and compared, and the method is applied to the Brisbane container terminal in Australia. The above problem only considers a single planning problem, which is only a partial optimization of the automated terminal. The automated terminal is a whole in which multiple loading and unloading equipment influences and restricts each other, the operation process of the automated terminal needs to be optimized from a global perspective. Iris, Ç. et al. [6] proposed a new flexible ship loading problem (FSLP), which comprehensively considered the management of loading operations and scheduling of transport vehicles. Various modelling enhancements and a mathematical model to obtain a strong lower bound were designed and a heuristic algorithm was used to solve the problem. The results show that the proposed model and heuristic algorithm can effectively generate high-quality solutions, and can greatly save costs. Luo J. et al. [13] considered the integrated problem of AGVs scheduling and container storage allocation in the automated terminal in the mode of loading and unloading, and established a mixed-integer programming model to minimize ship berth time. The effectiveness of the model and evolutionary genetic algorithm is verified through a large number of numerical experiments, and the experimental results show that the method can effectively solve problems of different scales, and provides a good solution for terminal AGVs scheduling and container storage allocation problems. The results show that the improved genetic algorithm can provide effective solutions and improve the overall operating efficiency of the terminal. There is a strong correlation between modeling optimization and problem definition in this particular environment. Different terminal environments have different problem definitions, and the established models are different. The algorithm lacks generalization ability.
In the actual port operation environment, the AGVs scheduling problem usually faces diversified automation equipment and other unexpected working conditions, such as machine failures, network delays, etc., due to the complexity and dynamics of the port environment and system information over time, these make AGVs scheduling easily affected by random disturbances. Therefore, it is necessary to increase the speed of the algorithm solution to meet the requirements of the complex and changeable port environment for processing timeliness. The general the heuristic search algorithm easily falls into the local optimal and the time cost to solve it is high. In order to adapt to the highly dynamic and complex port environment and improve the flexibility of AGVs scheduling, many scholars on AGVs dynamic scheduling problems carried out various studies, most of them put forward the concept of rescheduling. Angeloudis P. et al. [14] proposed A rolling horizon optimization algorithm based on the concept of cost or benefit. The results show that this method can minimize the maximum completion time and improve the efficiency of the terminal AGVs horizontal transportation operation. Klerides E. et al. [15] established a model to minimize the cost of AGVs scheduling and proposed a rolling time method. Through the analysis of port cases of different scales, the method of rolling time method was verified. The rescheduling strategy of dynamic decision-making according to the different states of the system can make AGVs quickly adapt to a complex and dynamic working environment. Cai B. et al. [16] proposed two rescheduling strategies for the Autonomous Straddle Carriers (ASC) scheduling problem, which take advantage of the cost of autonomous straddle carrier transportation, the cost of waiting time, and the delay of high-priority tasks. The weighted sum of the cost is the optimization goal. The branch-and-bound algorithm with column generation is used to solve the newly arrived task rescheduling strategy and the unexecuted task rescheduling combination strategy; the usage scenario for both strategies is compared through simulation experiments. In addition, some scholars made improvements based on the static AGVs scheduling algorithm and studied a variety of dynamic scheduling algorithms. Xin J. et al. [17] studied the dynamic scheduling problem of the automated collision-free path of terminal AGVs; introduced the concept of hierarchical control architecture to human-computer interaction scheduling and automatic guided vehicle path planning; and proposed a neighborhood variablesearch, meta-heuristic algorithm based on hierarchical control structure, which could be applied to simulated static obstacle and dynamic obstacle scheduling scenarios. The results show that the hierarchical control system algorithm ensures the collision-free horizontal transportation of AGV and improves the operation efficiency of the automated terminal. Kim J. et al. [18] proposed a multi-standard AGVs scheduling strategy. At the same time, the QCs delay time minimization and the AGVs empty travel minimization were considered as objective functions. A mixed-integer programming model was established, and solved the above problems step-by-step through the use of a multi-objective evolutionary derivative algorithm, and thus the goal of AGVs dynamic scheduling was achieved. The application of machine learning algorithms to dynamic scheduling problems became a popular topic in the study of combinatorial optimization problems. Machine learning algorithms learn the optimal strategy for scheduling through simulation experiments or real data, so that they can adapt to different job conditions and scheduling environments, and have the ability to resist interference.
In recent years, with the development of Artificial Intelligence, the Internet of Things, Big Data, and other technologies, many scholars have begun to pay attention to how to improve the dynamic response speed of automated terminal scheduling and promote the autonomous learning of the scheduling system to solve the dynamic problems of the automated terminal operating environment. Han B A. et al. [19] pointed out that adaptive scheduling can select the optimal scheduling strategy according to the optimization goal and system status information in the dynamic production environment, and the scheduling strategy can be regarded as a function of system state information to scheduling operation, which reduces the calculation time of the scheduling process and ensures the speed of a dynamic response. In the acquisition of scheduling strategy, some scholars conducted corresponding work. Based on the inventory management model, Briskorn D. et al. [20] converted the AGVs dynamic scheduling problem into a dynamic allocation transportation task problem; the task allocation strategy was obtained with the greedy limited rule heuristic algorithm and the precise algorithm. A large number of comparative experiments shows that the model and algorithm have a robust performance, which can reduce the empty travel time of the AGVs, thereby improving the horizontal transportation efficiency of the automated terminal. Focusing on the dynamic AGVs scheduling problem against the background of the uncertainty of the automated terminal, and to minimize the QCs operating time and the AGVs empty travel distance, Choe R. et al. [21] proposed a paired preference function and an online preference learning algorithm combined with a deep neural network. A large number of simulation experiments verify that the method can dynamically adjust the AGVs scheduling strategy according to the actual operating conditions of the automated terminal. The task of reinforcement learning is to dynamically learn the optimal strategy by observing the rewards of the action feedback after performing a series of actions. To reduce the average waiting time of trucks, Fotuhi F. et al. [22] proposed an agent-based YCs scheduling model and q-learning algorithm, and the optimal YCs scheduling strategy is learned through a large number of simulation experiments. The experiment further verified the applicability and robustness of the model and q-learning algorithm to the YCs scheduling problem. An increasing number of scholars focused on Deep Reinforcement Learning (DRL) to solve practical problems, Hu H. et al. [23] proposed an adaptive, deep reinforcement learning, hybrid-rule, AGVs dynamic scheduling method to address the dynamics and uncertainties in material handling in flexible workshops. Experiments verified the feasibility and effectiveness of the method. For the dynamic and stochastic nature of order dispatching in ride-sharing platforms, Tang X. et al. [24] proposed an order dispatching solution based on deep reinforcement learning, and verified the effectiveness of the algorithm through large-scale online tests. In addition, the application of DRL to network flow control problems [25], financial market intraday trading [26], subway train dispatching [27], etc. proved the superiority and effectiveness of DRL in solving sequence decision-making.
In summary, for the AGVs dynamic scheduling optimization method in the automated terminal, the main focus is on periodic static dispatch or rescheduling, dynamics are not fully considered. Therefore, these scheduling methods have certain limitations for the complex port environment. The research of a few AGVs dynamic scheduling problems is combined with machine learning algorithms, and the performance and learning efficiency of the algorithms used are limited by the scale of the tasks that are to be solved. In the automated terminal, as of yet, no scholars applied a deep reinforcement learning algorithm to solve the dynamic scheduling problem. The DRL algorithm framework is applied to AGVs dynamic scheduling for the first time in this paper.

Problem Description
Horizontal transportation is an important part of the terminal operating system, which is the link connecting QCs and YCs. In order to improve terminal operating efficiency and reduce terminal operating costs, this paper adopts the dual-cycle mode of simultaneous loading and unloading [28], in which the AGVs dynamic scheduling process is divided into the following: the scheduling system assigns the container to AGV, AGV travels to the loading and unloading point at the QC, AGV picks up or unloads the container, transports the container, and the container is then unloaded or picked at the front buffer area of the yard. As an example of import container unloading operations, first, the scheduling system obtains the current information of all containers and the AGVs space-time information, and allocates the container to be executed to the AGV. Then the AGV arrives at the designated QC location for container handover. Finally, the container is transported by the AGV to the block of yard, where the empty AGV can accept the container assignment of the dispatch system again. The dynamic scheduling process of a single AGV is shown in Figure 2. AGV's operation time at the QC location should be as close as possible to QC's loading and unloading time, reducing the delay time of container handover. In order to better describe whether there is a delay in the handover process of containers, the earliest possible event time of each container task is set [5]. If the containers fail to be executed by the AGVs before the earliest possible event time, the handover of the container is delayed.

Notation of the Global Parameters
We introduce the following AGVs scheduling-related model notations as shown in Tables 1 and 2.

Parameters
Notations Set of AGVs, indexed by ∈ . Set of QCs, indexed by ∈ . Set of YCs, indexed by b∈ Y. Representation of the dimensions of the matrix. Learning rate of Actor network and Critic network.
Capacity of experience replay memory. Batch size. Maximum number of episodes. Target network parameter update frequency.
Algorithm training time step. Target network parameter soft update coefficient.
The discount coefficient of accumulate reward. The scaling factor in the reward function.

Notation of the Global Parameters
We introduce the following AGVs scheduling-related model notations as shown in Tables 1 and 2.   The discount coefficient of accumulate reward. β The scaling factor in the reward function. The urgency of container i in the decision stage k. π(∆|s k ) The decision strategy to determine the probability of each action based on s k . ∆ k The action in the decision stage k. s k The state variable of next decision stage. The actual time that container i is handed over between AGV v and QC q.

D qi vk
The delay time of container i transported by AGV v.

C ivk
The travel time of container i transported by AGV v. r k The reward value of the state variable transition from s k to s k . π * The optimal strategy.

D av
Average delay time of all containers.

C av
Average travel time per container for AGVs transportation. N r Number of containers delayed.

AGVs Dynamic Scheduling Model
The dynamic scheduling of AGVs in the automated terminal can be described as a staged container-AGV matching problem. The entire container-AGV matching process is discretely partitioned in time, thus transforming the scheduling problem into a finite stochastic dynamic decision process, which is then modeled as a Markov decision process (MDP). The scheduling system interacts intermittently with the environment in all decision stages, and then dynamically assigns the highest priority container to the available AGV for execution until all tasks are completed. Such a complete process is called an episode. At each decision stage k, the scheduling system senses the environment state variable s k , the scheduling system makes a corresponding action ∆ k in response to the interaction state variable s k based on the policy π(∆ k |s k ) . At this time, the container and the available AGV achieve a successful matching. After the current container transportation is over, the environment generates a numerical reward r k as effective feedback for the decision. According to the AGVs scheduling process described above, we define the state variables, action rules, reward functions, and policy definitions.

System State Information
The local and global characteristic information of the system is described by defining state variables as s k in the decision-making stage k. State variables are the key information for the scheduling system to make a decision, and this decision directly affects the efficiency of the entire system. Therefore, extracting system state information is vital in optimizing the entire scheduling process. AGVs scheduling in the automated terminal involves multiple equipment, determining the amount of information involved in the characterization system. Data are obtained through various data collection equipment, and then multiple sources of information are combined to accurately determine the current system state information. This article characterizes the system state information from two perspectives of the sequence information of QCs containers and AGVs information, which are represented by s 1 k and s 2 k , respectively, the final state variable of the decision stage is s k = s 1 k , s 2 k . For the QCs containers, sequence information s 1 k is represented by a four-channel matrix, which describes the global features of the environment, storing different categories of feature information in different channels. The dimension of s 1 k can be expressed as Dim s 1 k = 4 × |Q| × max N q q ∈ Q ; the height and width of each channel are set to the number of QCs and the maximun number of tasks of single QC, respectively. Learning lessons from Kim J. et al. [18] in solving the AGVs scheduling problem, we select the characteristic attributes used to describe the state component s 1 k . The details of s 1 k are as follows: (1) The channel of container types is used to indicate the types of containers. In the dualcycle mode of loading and unloading, the execution of different types of containers by the AGVs affects the synchronization of the operation of the QCs and the AGVs. For example, when the AGVs transport the importing containers, they directly travel to the QCs to pick up the container, this process only includes the empty travel time of the AGVs. When the AGVs transport and export the container, they first travel to the front of the YCs to pick up container, and then travel to the QCs to unload the container, the process includes two processing stages, so it takes longer for AGVs to reach QCs. The values 0 and 1 are used to denote the import and export boxes, respectively. (2) The channel for container urgency indicates the urgency of each container. In order to ensure the synchronization of the QCs in loading and unloading each container with the AGVs, and to reduce the delay time of containers handover, the urgency of a container can be expressed as the remaining time until the earliest possible event time.
Once the container is transported, the urgency of the container is set to 0; otherwise, the urgency of the unexecuted container u ik can be expressed as: (3) The estimated load travel time channel records the estimated load travel time for each container not transported by the AGV. The load travel time for the AGV to execute the container is estimated using an automated terminal horizontal transportation map of path network G, according to the container loading and unloading location. The estimated load travel time is set to 0 if the container is assigned to the AGV for execution. (4) The completion time channel records the actual time that each container is completed, initialized to an all-0 matrix.
The AGVs scheduling process is driven not only by the containers' information but also the spatio-temporal information of the AGVs themselves. Since the information of the AGVs is constantly changing in time and space, only the local state information of the AGVs can be captured. The AGVs spatio-temporal state information s 2 k is mapped as a two-dimensional matrix with dimensions Dim s 2 k = |V| × (1 + |Q| + |Y|), initialized to 0. The first column of the matrix specifies the working status of the AGV, "0" means the AGV is idle, "1" means the AGV is assigned a container; the remaining columns indicate the travel time from the current position of the AGV to each QC q position, and to each YC b position, respectively, and the specific values can be obtained by querying the automated terminal horizontal transportation map of path network G.

Action Space Expression
In the process of AGVS dynamic scheduling, the scheduling system needs to plan a suitable container and assign a suitable AGVS for the container to execute. As ships become larger and the number of containers increases dramatically, it is difficult to accurately determine the containers that need to be transported within the limited decision-making time in dynamic scheduling. The heuristic assignment rule is a scheduling behavior of the system that schedules AGVs to transport containers, speeding up the dynamic response of the scheduling system and determining the order of container assignment in the scheduling process. The principle of validity and sufficiency is followed when determining the assignment rules. Validity is reflected in the influence of assignment rules on the convergence effect of the objective function, where different assignment rules determine the order of containers execution and directly affect the convergence degree of the objective function. Sufficiency refers to the diversity of the design of assignment rules, which can not only overcome the short-sightedness of a single assignment rule, but can also select appropriate rules based on different state information in different decision-making stages. Based on the above two principles, 18 heuristic assignment rules are designed in this paper with reference to the criterion proposed by Choe R. et al. [21], where rules 7-18 are hybrid assignment rules combined by two single assignment rules. The specific details of the 18 heuristic assignment rules are shown in Table 3. Table 3. Specific details of the heuristic assignment rules.

Symbols
Rules Description The container with the longest transport time from origin to destination is assigned to an AGV.
The container with the shortest transport time from origin to destination is assigned to an AGV. mr 3 GUT The container with the greatest urgency is assigned to an AGV. mr 4 LUT The container with the least urgency is assigned to an AGV.

mr 5 LPT
The container with the longest processing time is assigned to an AGV; the processing time includes the time for loading and unloading the container at QCS and YCS, as well as the transportation time from the origin to the destination of the container. mr 6 SPT The container with the shortest processing time is assigned to an AGV. mr 7 LQ-LTT Selecting the QC q with the most remaining containers, from which the container with the longest transport time is selected.
Selecting the QC q with the most remaining containers, from which the container with the shortest transport time is selected. mr 9 SQ-LTT Selecting the QC q with the fewest remaining containers, from which the container with the longest transport time is selected. mr 10 SQ-STT Selecting the QC q with the fewest remaining containers, from which the container with the shortest transport time is selected. mr 11 LQ-GUT Selecting the QC q with the most remaining containers, from which the container with the greatest urgency is assigned to an AGV. mr 12 LQ-LUT Selecting the QC q with the most remaining containers, from which the container with the least urgency is assigned to an AGV. mr 13 SQ-GUT Selecting the QC q with the fewest remaining containers, from which the container with the greatest urgency is assigned to an AGV. mr 14 SQ-LUT Selecting the QC q with the fewest remaining containers, from which the container with the least urgency is assigned to an AGV.
mr 15 LQ-LPT Selecting the QC q with the most remaining containers, from which the container with the longest processing time is assigned to the an AGV.
mr 16 LQ-SPT Selecting the QC q with the most remaining containers, from which the container with the shortest processing time is assigned to an AGV. mr 17 SQ-LPT Selecting the QC q with the fewest remaining containers, from which the container with the longest processing time is assigned to an AGV. mr 18 SQ-SPT Selecting the QC q with the fewest remaining containers, from which the container with the shortest processing time is assigned to an AGV.
The type of AGV determines the AGVs to be assigned to the selected container, and each AGV has a unique ID. In this paper, the action space consists of heuristic assignment rules and types of AGV. The action space can be represented as ∆ = {(mr, v)|mr ∈ MR, v ∈ V}.

Reward Design and Reshaping
At each decision stage k, the scheduling system obtains the state variable s k and makes a decision X π (s k ), and the corresponding container is assigned to the AGV. After the AGV completes the container task, it needs a measurement standard to measure the task completion effect. Therefore, the two optimization goals of minimizing the delay time of the tasks and the travel time of the AGVs are fully considered, and the reward function to calculate the reward value of the feedback is designed, which is then used for action evaluation and strategy optimization. The end of each container task is transported by AGV, the actual time t qi vk that container i is handed over between AGV v and QC q, and actual completion time of each container t k . The concepts of individual container task delay time cost and AGV travel time cost are introduced based on the optimization objectives of task delay time and single AGV travel time, as shown in Equations (2) and (3): (2) Generally, a simple numerical summation of the above-defined cost can be used as the reward value of the feedback. However, when the AGV transports a container task, there is a time difference between the start of the task, and the reward is observed at the end of the task. Within this time difference, if there are other tasks that match the AGV, then these will change the successor state information s k , causing the entire system environment to become excessively unstable. Therefore, the reward reshaping mechanism proposed in this paper calculates the average delay cost D av for all tasks and the average trip cost C av for all AGVs for the whole process. A difference factor in the terms D qi vk − D av and C ivk − C av between the cost of a single container task is introduced, and the average cost of the entire process, as well as the scaling factor of the difference factor, is proposed to reshape the reward function. The process of reward function reshaping is shown in Formulas (4)-(8):

Optimal Scheduling Strategy
The strategy π(∆|s k ) is the probability of all actions in the action space under the condition of the state variable s k in the decision state k. The scheduling system can select the container task to pair with AGV based on known action probabilities. Associated with the policy π is the action value function Q π (s k , ∆ k ) that represents the expected cumulative discount reward after the execution of action ∆ k in state variable s k using the policy π, with the formula shown in (9): From the action value function of Formula (9), the Bellman equation under the general policy can be written, as shown in Formula (10): The basic idea of RL is to iteratively update the following Bellman equation to learn an optimal strategy π * to maximize the expected cumulative discount reward. The cumulative reward under the optimal strategy is shown in Formula (11).
The AGVs dynamic scheduling problem is modeled as MDP. The ultimate goal is to find the optimal scheduling strategy π * to obtain the maximum expected cumulative discount reward.

CDA Scheduling Algorithm
In this paper, the DDPG algorithm is used to achieve AGVs dynamic scheduling. In the previous section, we defined the large-scale discrete action space consisting of a combination of heuristic assignment rules and AGVs types. Based on the original DDPG algorithm, the discrete action space reparameterization trick [29] is introduced and the DDPG algorithm is slightly modified, so that the algorithm can better search for the optimal scheduling strategy π * in the discrete action space. This method combines the deterministic policy gradient (DPG) and the deep CNN, which can be robustly learned. The DDPG algorithm is based on the network structure of the DQN algorithm and uses the fixed network technique to design the evaluation network structure and the target network structure to mitigate the instability of the target network update. The DDPG algorithm is based on the AC framework, and both of the above network architectures-two neural network approximators, the Actor network and the Critic network. In the estimation network, the Actor estimation network µ(s k θ µ ) is the policy function of the state variable mapping action; the Critic estimation network Q(s k , ∆ k |θ Q ) is the parameterized value function that approximates the Q(s k , ∆ k ) values of the state variable s k and the action ∆ k given by the Actor estimation network, which is then used to evaluate the Actor estimation network the quality of the action ∆ k and guide the direction of the strategy π update [28], where θ µ and θ Q are the parameters of the Actor estimation network and the Critic estimation network, respectively. In the estimation network parameter updating process, the Actor target network µ(s k |θ µ ) temporarily fixes the Actor estimation network parameters; and the Critic target network Q(s k , ∆ k |θ Q ) temporarily fixes the Critic estimation network parameters to improve the stability and convergence of algorithm training, which are structurally consistent with the estimation networks of Actor and Critic, where θ µ and θ Q are the parameters of the Actor target network and the Critic target network, respectively.

CDA Algorithm Network Structure
Actor network and Critic network are function approximators, and the design of their network structure is very important for the nonlinear approximate estimation of the value function. Because the state information represents the port scheduling environment from multiple perspectives, the dimensions and distribution of data are different due to different data sources. In this paper, the state variable s k , which represents the system information, consists of component s 1 k and component s 2 k , and the data structures are a 4-channel 2D matrix and a single-channel 2D matrix, respectively. There are differences in the dimensions of the 2D matrix for different components, and it is not possible to fuse each component information directly from the channels. The CNN has a wide range of applications in image classification, image recognition, and video processing, etc. Therefore, this paper uses multi-layer CNN to combine simple patterns into complex patterns when designing the network structure. This flexible combination can extract the data correlation and ensure that the deep CNN has a sufficient expressive ability and generalization. A hierarchical processing idea is introduced to handle the different components, including s 1 k and s 2 k of state variable s k . Two deep CNN network structures are used to extract the key feature information of each state component, and then the multi-dimensional key features are made one-dimensional to achieve multi-source information fusion. Deep CNN usually consists of three types of layers: convolution layers, pooling layers and fully connected layers, with the convolution and pooling layers being structurally contiguous. In the convolution layers, the local features of the matrix data are extracted by convolution operations. The main role of the pooling layer is to further reduce the number of parameters by subsampling the unimportant features after the convolution operation.
The convolution layers and the pooling layers are equivalent to feature engineering, while fully connected layers are equivalent to feature weighting, which acts as a "classifier" in the whole neural network. The details of the Actor network and Critic network structure are shown in Tables 4 and 5, respectively.

Algorithm Update Process
The goal of the CDA algorithm is to optimize the policy π → π * of AGVs scheduling by the non-approximated estimation of action value functions through deep neural networks. The algorithm update process contains two sub-processes, which are the scheduling process of AGVs and the training process of the algorithm. The update process of the algorithm is shown in Algorithm 1. The purpose of the AGVs scheduling process is to obtain a tuple of transfer sequences consisting of dynamic state information, action information, and reward information by interacting with the automated terminal environment to provide data support for the training process of the CDA algorithm. The scheduling process of AGVs can be generally described as follows: the scheduling system obtains the state variable s k from the scheduling environment of the AGVs. Based on the results of state variable s k processing by the Actor estimation network, the scheduling system determines the assignment rule and AGV with high priority. According to the above assignment rule, AGV is assigned to high-priority container tasks and transports the container to designated handover points (including QCs andYCs); after the implementation of the scheduling program, the scheduling system receives the feedback reward r k and obtains the current state variable s k , and the transfer sequence tuple [s k , ∆ k , r k , s k ] of the interaction between the system is stored in the experience replay memory. The above process corresponds to lines 3-6 of Algorithm 1 and the blue solid arrows in Figure 3. The loss function ( ) is equal to the mean square error of the action value , Δ predicted by the Critic estimation network and the state action target value ( , | ) calculated by Equation (12), ( ) is shown in Equation (13). The Critic estimation network parameters are updated by gradient back propagation: The policy gradient ( ) is the gradient of the state action value ( , Δ | ) of the Critic estimation network to action Δ , and is used to update the strategy parameters of the Actor estimation network, causing the neural network select the action with the highest payoff or higher likelihood. The formula for the strategy gradient ( ) is shown below. The following equation calculates the strategy gradient ( ): In order to ensure the stable convergence of the training process, under the condition of meeting the target network parameter update frequency , the "soft" target update method is used to update the parameters of the Actor target network and Critic target network. The target network parameters are updated as shown in Equations (15) and (16): The training process of the algorithm is performed every fixed time step l s : samples of BS * [s j , ∆ j , r j , s j ] are sampled uniformly and randomly from the experience replay memory, the state variables s j and s j are input to the Actor estimation network, and Actor target network, respectively. The actions ∆ j and ∆ j are given based on the predicted values of the network. The Critic target network takes the state variable s j and the action ∆ j given by the Actor target network as an input to predict the state action value Q(s j , ∆ j |θ Q ) and calculates the state action target value Q(s j , ∆ j |θ Q ) through Equation (12): The loss function L θ Q is equal to the mean square error of the action value Q s j , ∆ j θ µ predicted by the Critic estimation network and the state action target value Q(s j , ∆ j |θ Q ) calculated by Equation (12), L θ Q is shown in Equation (13). The Critic estimation network parameters θ Q are updated by gradient back propagation: The policy gradient ∇J θ µ is the gradient of the state action value (s j , ∆ j θ µ ) of the Critic estimation network to action ∆ j , and is used to update the strategy parameters θ µ of the Actor estimation network, causing the neural network select the action with the highest payoff or higher likelihood. The formula for the strategy gradient ∇J θ µ is shown below. The following equation calculates the strategy gradient ∇J θ µ : In order to ensure the stable convergence of the training process, under the condition of meeting the target network parameter update frequency C u , the "soft" target update method is used to update the parameters of the Actor target network and Critic target network. The target network parameters are updated as shown in Equations (15) and (16): Lines 7-16 of Algorithm 1 describe the learning process of the algorithm and the process of updating network parameters in detail; the above processes correspond to the orange dashed arrow in Figure 3. Figure 4 illustrates the entire algorithm process more clearly in the form of a flowchart. Input: Hyperparameters M, BS, α, β, γ, τ, lr, it max , C u , l s , the automated terminal horizontal transportation map of path network G, transport time function Dis(., .) Output: The optimal strategy π * is the optimal Actor estimation network parameter θ µ 1: Initialize estimated network and target network parameters, including θ µ , θ Q , θ µ , θ Q 2: For episode: = 1 to it max 3: Scheduling system obtains state information from the AGV scheduling terminal environment 4: The Actor estimation network gives the decision action ∆ k = π θ µ (s k ) based on the state variable s k 5: After the AGV completes the container transportation, the scheduling system gets the new state variable s k and reward r k , and determines if the task is completed 6: Store the state transfer sequence s j , ∆ j , s j , r j into the experience replay memory 7: If l s meets the conditions, start training the network 8: For j: = 1 to BS 9: Sample s j , ∆ j , s j , r j from the experience replay memory, and calculate the state action target value Q s j , ∆ j θ Q 10: Calculate the mean square error loss function L θ Q and update the Critic estimation network parameters θ Q by the gradient descent algorithm 11: Use the backpropagation of the policy gradient ∇J θ µ to update the Actor target network parameters θ µ 12: End for 13: End if 14: If the condition of C u is met 15: Update the parameters θ µ and θ Q using the "soft" target update method 16

Implementation of AGVs Dynamic Scheduling
The implementation of AGVs dynamic scheduling using CDA algorithm is basically the same as the algorithm update process. However, at this time only the algorithm-trained optimal policy is needed to guide the scheduling system to choose the optimal heuristic assignment rules as well as the best AGV in different states, and the training of the CDA algorithm is no longer needed.
As shown in the Figure 5, a certain decision-making stage in an episode is taken as an example to visually illustrate the decision-making process and state variable transition process. The task situation is set in the current state as shown in the Table 6, and the loading and unloading time of 10 s for a task is assumed on both the sea side and the shore side. First, the current state variable s k is calculated according to the definition of the state variable (Part 3 Section 3), marked with a light-yellow rectangular box. At this time, AGV 2 is idle, indicated as "0", and marked with a green filled rectangle; AGV 1 and AGV 3 are busy, indicated as "1", marked with a red filled rectangle. Then, the state variable s k is used as the input to the optimal action estimation network. Under the premise that there are tasks, the idle AGV is specified and the task assignment rule is predicted. In this example, the action is ∆ k = (mr 11 , AGV 2). According to the definition of the assignment rule mr 11 , there are three tasks in the task sequence of QC 3 and the task numbered 3-1 is the most urgent; therefore, AGV 2 considers the task numbered 3-1 first. Finally, after AGV 2 completes the task numbered 3-1, the delay time and the travel time of the task numbered 3-1 transported by AGV 2 are calculated by Formulas (2) and (3), which are 15 s and 20 s, respectively. The state variable s k at the end of the task numbered 3-1 is obtained by a calculation, which is marked by a light-blue rectangular box in the figure. The above task assignment process is repeated until all tasks are executed, and then the reshaping reward r k is calculated based on Equations (4)- (8), and the transfer sequence tuple [s k , ∆ k , r k , s k ] of each decision stage is stored in the experience replay memory for the training of the algorithm; the algorithm training process is shown in the previous section.  After AGV 2 has performed the task 3-1, the delay time of task transported by AGV 2 and the travel time of task transported by AGV 2 are calculated as 15s, 20s.

Numerical Experiments
This section details the factors affecting the scheduling performance of AGVs. First, the optimal value of the parameter in Equations (6) and (7) is determined empirically to ensure that the CDA algorithm has a reliable convergence as well as scheduling accuracy. Second, the scheduling results of a single assignment rule and other scheduling algorithms are compared and analyzed in different instances to verify the reliability and validity of the model and algorithm proposed in this paper. Finally, the simulation experiment is carried out under the double-cycle operation mode. The implementation of the AGVs dynamic scheduling simulation experiment consists of two parts: to build a simulation terminal environment for AGVs scheduling, and to implement the CDA algorithm architecture based on the deep framework, Tensorflow. To achieve this, all experiments were conducted on Windows 10, Intel(R) Core(TM) i5-10200H CPU @ 2.40 GHz, 16 GB RAM NVIDIA GeForce GTX 1650 Ti, python2019 Professorial. Each test case result is the average of five results.

Experimental Parameters Setting
There are complexities and uncertainties in the container handling and transportation environments of the automated terminal. In order to reflect the real automated terminal environment as accurately as possible, some important experimental parameters and constraints are set in this paper, as described below.   Note: Y means the task has been assigned or executed; N means the task has not yet been assigned.

Numerical Experiments
This section details the factors affecting the scheduling performance of AGVs. First, the optimal value of the parameter β in Equations (6) and (7) is determined empirically to ensure that the CDA algorithm has a reliable convergence as well as scheduling accuracy. Second, the scheduling results of a single assignment rule and other scheduling algorithms are compared and analyzed in different instances to verify the reliability and validity of the model and algorithm proposed in this paper. Finally, the simulation experiment is carried out under the double-cycle operation mode. The implementation of the AGVs dynamic scheduling simulation experiment consists of two parts: to build a simulation terminal environment for AGVs scheduling, and to implement the CDA algorithm architecture based on the deep framework, Tensorflow. To achieve this, all experiments were conducted on Windows 10, Intel(R) Core(TM) i5-10200H CPU @ 2.40 GHz, 16 GB RAM NVIDIA GeForce GTX 1650 Ti, python2019 Professorial. Each test case result is the average of five results.

Experimental Parameters Setting
There are complexities and uncertainties in the container handling and transportation environments of the automated terminal. In order to reflect the real automated terminal environment as accurately as possible, some important experimental parameters and constraints are set in this paper, as described below.
(1) The number of container tasks considered in each episode |N| ∈ [50, 500], where 50-100 containers are considered for the small-scale problem and 100-500 containers for the large-scale problem; the number of QCs on the sea side |Q| ∈ [2,8] and the number of YCs on the land side |Y| ∈ [4,10]; and the number of AGVs |V| ∈ [5,15] are considerations of this study. In general, the hyperparameter in the DRL algorithm plays a significant role in the convergence and training effect of the algorithm. However, due to a large number of hyperparameters and the large search space of parameters in the DRL algorithm, it is difficult to find the optimal value. In this paper, we refer to the hyperparameters given by Liu D. et al. [31] and obtain the final CDA algorithm-related parameters through preliminary experiments, as shown in Table 7. The discount coefficient of accumulative reward.

Parameter Experiment
In order to alleviate the difficulty of convergence of the CDA algorithm due to the unstable transition of the scheduling environment, the scaling factor of the differential term is introduced in the reward reshaping, and the setting of the scaling factor β has an impact on the accuracy and convergence speed of the algorithm solution. To address the above problem, the experimental arithmetic case is designed with the number of containers |N| = 100, the number of QCs |Q| = 4, the number of YCs |Y| = 6, and the number of AGVs |V| = 20, and the experiment is carried out with β taking values of 0.1, 0.3, 0.5, 0.7, and 0.9. The results are shown in Table 8. Figures 6-8 show the cumulative rewards of the algorithm, the delay time of containers tasks, and the total travel time of the AGVs, respectively. It can be seen that β at different values can ensure that the objectives of optimization tend towards the direction of minimization, but values of β that are too high or too low lead to the difficulty of the convergence of the algorithm, and an unsuitable value of β leads to the unideal objectives after the convergence of the algorithm. Considered comprehensively, the scaling factor β in this paper is set to 0.5.

Comparison of Experimental Results
In implementing the CDA algorithm proposed in this paper for AGVs dynamic scheduling, the scheduling system assigns tasks to the specified AGVs based on the designed heuristic assignment rules, and the optimal policy learned by this algorithm is the task assignment rules in different states. The results obtained in this way are better than the single rule [19]. It is necessary to compare the scheduling results of the CDA algorithm and the single assignment rule to verify the effectiveness of the model algorithm. In addition to using the proposed task delay time and AGVs travel time as evaluation metrics, the container task delay rate and the tasks completion time are also introduced for the comprehensive evaluation of the CDA algorithm and the single assignment rule. The container task delay rate is calculated as follows: Experiments are designed for small-scale container tasks and the number of AGVs in cases 1-10. Tables A1-A4 in Appendix A show the scheduling results for 18 single heuris-

Comparison of Experimental Results
In implementing the CDA algorithm proposed in this paper for AGVs dynamic scheduling, the scheduling system assigns tasks to the specified AGVs based on the designed heuristic assignment rules, and the optimal policy learned by this algorithm is the task assignment rules in different states. The results obtained in this way are better than the single rule [19]. It is necessary to compare the scheduling results of the CDA algorithm and the single assignment rule to verify the effectiveness of the model algorithm. In addition to using the proposed task delay time and AGVs travel time as evaluation metrics, the container task delay rate and the tasks completion time are also introduced for the comprehensive evaluation of the CDA algorithm and the single assignment rule. The container task delay rate is calculated as follows: Experiments are designed for small-scale container tasks and the number of AGVs in

Comparison of Experimental Results
In implementing the CDA algorithm proposed in this paper for AGVs dynamic scheduling, the scheduling system assigns tasks to the specified AGVs based on the designed heuristic assignment rules, and the optimal policy learned by this algorithm is the task assignment rules in different states. The results obtained in this way are better than the single rule [19]. It is necessary to compare the scheduling results of the CDA algorithm and the single assignment rule to verify the effectiveness of the model algorithm. In addition to using the proposed task delay time and AGVs travel time as evaluation metrics, the container task delay rate and the tasks completion time are also introduced for the comprehensive evaluation of the CDA algorithm and the single assignment rule. The container task delay rate is calculated as follows: Experiments are designed for small-scale container tasks and the number of AGVs in

Comparison of Experimental Results
In implementing the CDA algorithm proposed in this paper for AGVs dynamic scheduling, the scheduling system assigns tasks to the specified AGVs based on the designed heuristic assignment rules, and the optimal policy learned by this algorithm is the task assignment rules in different states. The results obtained in this way are better than the single rule [19]. It is necessary to compare the scheduling results of the CDA algorithm and the single assignment rule to verify the effectiveness of the model algorithm. In addition to using the proposed task delay time and AGVs travel time as evaluation metrics, the container task delay rate and the tasks completion time are also introduced for the comprehensive evaluation of the CDA algorithm and the single assignment rule. The container task delay rate is calculated as follows: Experiments are designed for small-scale container tasks and the number of AGVs in cases 1-10. Tables A1-A4 in Appendix A show the scheduling results for 18 single heuristic rules and the scheduling results of the CDA algorithm on different metrics; the optimal solutions of the CDA algorithm and scheduling rules on different metrics are marked in bold. The experimental results show that the algorithm performs well in all 10 cases. Overall, the CDA algorithm improves the average performance on the metrics of task completion time, the delay time of tasks, total delay time of AGVs, and delay rate of container tasks by 15.63%, 56.16%, 16.36%, and 30.22%, respectively.
The training process of the algorithm in case 8 is shown in Figure 8. As the training proceeds, the cumulative reward quickly converges to a maximum value, as shown in Figure 9a-d, which shows the trends of tasks completion time, tasks delay time, and travel time of the AGVs with the training process, respectively, exactly the opposite of the cumulative reward. Figures A1-A4 in Appendix A clearly show the scheduling results of the CDA algorithm and the single assignment rule in case 8. The CDA algorithm outperforms the single task assignment rule in terms of both tasks completion time and AGVs travel time, while it is slightly inferior to the GUT rule and the LQ-GUT rule in the two metrics of task delay rate and task delay time. Since these two rules always assign the urgent task to AGV for execution first, this will lead to the growth of AGVs travel time, and thus the overall operational efficiency of terminal AGVs scheduling decreases. Comparing Figures A2 and A3 in Appendix A, we can find that CDA is better than the GUT rule in reducing the delay rate of tasks, but slightly inferior to the LQ-GUT rule. In conclusion, improving the overall efficiency of dock-level transportation lies in the fact that the optimal task assignment rules can be selected according to different situations during AGVs scheduling. bold. The experimental results show that the algorithm performs well in all 10 cases. Overall, the CDA algorithm improves the average performance on the metrics of task completion time, the delay time of tasks, total delay time of AGVs, and delay rate of container tasks by 15.63%, 56.16%, 16.36%, and 30.22%, respectively. The training process of the algorithm in case 8 is shown in Figure 8. As the training proceeds, the cumulative reward quickly converges to a maximum value, as shown in Figure 9a-d, which shows the trends of tasks completion time, tasks delay time, and travel time of the AGVs with the training process, respectively, exactly the opposite of the cumulative reward. Figures A1-A4 in Appendix A clearly show the scheduling results of the CDA algorithm and the single assignment rule in case 8. The CDA algorithm outperforms the single task assignment rule in terms of both tasks completion time and AGVs travel time, while it is slightly inferior to the GUT rule and the LQ-GUT rule in the two metrics of task delay rate and task delay time. Since these two rules always assign the urgent task to AGV for execution first, this will lead to the growth of AGVs travel time, and thus the overall operational efficiency of terminal AGVs scheduling decreases. Comparing Figures A2 and A3 in Appendix A, we can find that CDA is better than the GUT rule in reducing the delay rate of tasks, but slightly inferior to the LQ-GUT rule. In conclusion, improving the overall efficiency of dock-level transportation lies in the fact that the optimal task assignment rules can be selected according to different situations during AGVs scheduling. To further verify the superiority of the algorithms in solving large-scale problems, this paper sets up a comparison experiment of different algorithms, including the adaptive genetic algorithm (AGA) [1], which is commonly used for dock scheduling, and the rolling time-domain algorithm (RHPA) [32], which is used for dynamic scheduling. To measure the difference in scheduling results between the CDA algorithm and other algo- To further verify the superiority of the algorithms in solving large-scale problems, this paper sets up a comparison experiment of different algorithms, including the adaptive genetic algorithm (AGA) [1], which is commonly used for dock scheduling, and the rolling time-domain algorithm (RHPA) [32], which is used for dynamic scheduling. To measure the difference in scheduling results between the CDA algorithm and other algorithms, the GPA value of the maximum completion time is used, and the GAP value of the CDA algorithm and AGA are calculated as shown in Equation (18): where t CDA m , t AGA m are the maximum completion times of the CDA algorithm and AGA, respectively. For the evaluation criterion of completion time, if the GAP value is positive, it means that the CDA algorithm is superior; otherwise, AGA is superior. Similar to Equation (18), the GAP values of the CDA algorithm and RHPA are calculated as follows: In this experiment, three algorithms are used to solve cases 11-30, and the results are shown in Table A5 in Appendix A. From the experimental results, it can be seen that the proposed CDA algorithm can obtain the approximate optimal solutions for different scales of the arithmetic cases. By calculating GAP CDA−AGA and GAP CDA−RHPA , and plotting the curves of GAP, as shown in Figure 10, both curves have a slow rising trend, and the performance of the CDA algorithm on scheduling problems is close to that of AGA and RHPA when the size of the arithmetic cases is relatively small. As shown in Table A5 in Appendix A, for example, in Case 13, GAP CDA−AGA is − 1.50%, the CDA algorithm has slightly worse scheduling results than AGA in this case; in Case 16 and Case 17, the GAP CDA−RHPA values are −1.92% and −1.34%, respectively, and RHPA has better results than the CDA algorithm in solving these two cases. The scheduling performance of the CDA algorithm becomes significantly better as the size of the cases increases (the number of container tasks ≥ 300). In conclusion, the CDA algorithm is slightly less capable of solving small-scale problems compared to large-scale problems, since the state space of large-scale problems provides a larger optimization space for the algorithmic network to learn and reduce training errors. Analyzing the results of cases 11-30, it can be seen that the CDA algorithm improves by 3.10% and 2.40% in average performance over AGA and RHPA, respectively.
where , are the maximum completion times of the CDA algorithm and AGA, respectively. For the evaluation criterion of completion time, if the GAP value is positive, it means that the CDA algorithm is superior; otherwise, AGA is superior. Similar to Equation (18), the GAP values of the CDA algorithm and RHPA are calculated as follows: In this experiment, three algorithms are used to solve cases 11-30, and the results are shown in Table A5 in Appendix A. From the experimental results, it can be seen that the proposed CDA algorithm can obtain the approximate optimal solutions for different scales of the arithmetic cases. By calculating and , and plotting the curves of GAP, as shown in Figure 10, both curves have a slow rising trend, and the performance of the CDA algorithm on scheduling problems is close to that of AGA and RHPA when the size of the arithmetic cases is relatively small. As shown in Table A5 in Appendix A, for example, in Case 13, − 1.50%, the CDA algorithm has slightly worse scheduling results than AGA in this case; in Case 16 and Case 17, the values are −1.92% and −1.34%, respectively, and RHPA has better results than the CDA algorithm in solving these two cases. The scheduling performance of the CDA algorithm becomes significantly better as the size of the cases increases (the number of container tasks ≥ 300). In conclusion, the CDA algorithm is slightly less capable of solving small-scale problems compared to large-scale problems, since the state space of largescale problems provides a larger optimization space for the algorithmic network to learn and reduce training errors. Analyzing the results of cases 11-30, it can be seen that the CDA algorithm improves by 3.10% and 2.40% in average performance over AGA and RHPA, respectively. For different numbers of AGVs, the scheduling results of the CDA algorithm are compared with those of RHPA and AGA. The number of pre-considered tasks is set to 300; the number of QCs and the number of YCs are set to 4 and 8, respectively; and the number of AGVs is assigned to 10, 12, 14, 16, 18, and 20, in that order. The results of the above six groups of experiments are shown in Table 9.  For different numbers of AGVs, the scheduling results of the CDA algorithm are compared with those of RHPA and AGA. The number of pre-considered tasks is set to 300; the number of QCs and the number of YCs are set to 4 and 8, respectively; and the number of AGVs is assigned to 10, 12, 14, 16, 18, and 20, in that order. The results of the above six groups of experiments are shown in Table 9. Figures 11-13 depict the trends of tasks delay time, AGVs travel time, and tasks completion time, respectively. The following conclusions can be drawn: (1) the travel time of the AGVs maintains almost constant values as the number of AGVs increases; (2) the three algorithms are sensitive to the metric of tasks delay time, and all of them decrease with the number of increasing AGVs; (3) The sensitivity of AGA to this indicator is greater, followed by RHPA, and the sensitivity of the CDA algorithm to this indicator is the least; (4) The sensitivity of the three algorithms to the indicator of the task completion time is similar to the conclusion (2); it reflects the fact that the optimization of task completion times depends to a greater extent on the degree of equipment synergy; (5) In terms of task delay times and completion times, the difference between the CDA algorithm, and AGA and RHPA, is significant when the number of AGVs is small, when the number of AGVs is greater than 16, and the difference between task delay duration and task completion time between the CDA algorithm and other algorithms gradually decreases as the number of AGVs increases, and the overall scheduling performance of the CDA algorithm is significantly better than that of AGA and RHPA. difference between the CDA algorithm, and AGA and RHPA, is significant when the number of AGVs is small, when the number of AGVs is greater than 16, and the difference between task delay duration and task completion time between the CDA algorithm and other algorithms gradually decreases as the number of AGVs increases, and the overall scheduling performance of the CDA algorithm is significantly better than that of AGA and RHPA.     difference between the CDA algorithm, and AGA and RHPA, is significant when the number of AGVs is small, when the number of AGVs is greater than 16, and the difference between task delay duration and task completion time between the CDA algorithm and other algorithms gradually decreases as the number of AGVs increases, and the overall scheduling performance of the CDA algorithm is significantly better than that of AGA and RHPA.

Conclusions and Future Research Direction
This article discusses how to choose the AGV and container assignment method to improve the synchronization of handling equipment and transportation equipment in the automated terminal, and transforms the dynamic scheduling problem into a sequential decision problem. The scheduling state, represented as a matrix of multiple channels, heuristic assignment rules, and reward functions, is introduced to simplify the complex AGVs dynamic scheduling process. A reinforcement learning algorithm using deep convolutional networks and hybrid heuristic rules is proposed to optimize the mapping space from the state-action space to the optimal policy. The real AGVs horizontal transportation scenario is simulated, and uncertain task loading and unloading time is considered. In order to obtain the optimal hybrid scheduling rules for AGVs in different states, a large number of experimental cases are designed, and, in this paper, the CDA algorithm is trained for these cases. Comparing the scheduling results of the CDA algorithm with each single scheduling rule defined in this paper, and other solution algorithms including AGA and RHPA, the effectiveness and superiority of the proposed algorithm are verified; the scheduling performance of the CDA algorithm improves by 29.59% on average over a single scheduling rule, and this algorithm can reduce the task operation time by about 3.10% and 2.40% in AGA and RHPA, respectively. A sensitivity test on the number of AGVs further demonstrates the performance of the CDA algorithm. The results show that as the number of containers and AGVs increases, the advantages of CDA become more apparent.
In the future research, in addition to the dynamic scheduling of AGVs, the dynamic path planning problem of AGVs can also be studied. Multiple AGVs share a road network in the automated terminal, not only to ensure the shortest path for the AGVs transportation of containers, but also to consider whether the AGV driving trajectory cross or overlap results in AGVs collision, congestion and other conflict issues. If the AGVs path conflict is not handled properly, it will not only prolong the travel time of the AGVs, but also increase the waiting time of QCs or YCs, resulting in a decrease in operational efficiency and a significant increase in the operation cost. Therefore, AGVs dynamic scheduling combined with AGVs dynamic path planning is one of the main directions for future research. At present, the application of the DRL algorithm to AGVs dynamic scheduling and AGVs dynamic path planning requires further research and exploration in the actual automated terminal. In addition to the synchronous operation of QCs and AGVs, the task delivery efficiency between AGVs and YCs also urgently must be resolved. In the future, the coupling constraints between AGVs and YCs can be considered, such as the shortage of AGV mates in the yard area that delays the tasks. In the new U-shape trafficked automated terminal [33], Both the internal AGVs and the external trucks are delivery tasks directly with the YCs. There is no buffering function in the multiple material handling equipment, and the synchronized operation of AGVs and YCs needs to be considered. Therefore, DRL-based algorithms need to be improved to adapt to the more complex operating environment of the automated terminal.          Table A4. Tasks delay rate of CDA algorithm and assignment rules under different cases.

Cases
Tasks Delay Rate         Figure A4. Curve of tasks delay rate of CDA algorithm and assignment rules under different cases.