Multiple Container Terminal Berth Allocation and Joint Operation Based on Dueling Double Deep Q-Network

: In response to the evolving challenges of the integration and combination of multiple container terminal operations under berth water depth constraints, the multi-terminal dynamic and continuous berth allocation problem emerges as a critical issue. Based on computational logistics, the MDC-BAP is formulated to be a unique variant of the classical resource-constrained project scheduling problem, and modeled as a mixed-integer programming model. The modeling objective is to minimize the total dwelling time of linerships in ports. To address this, a Dueling Double DQN-based reinforcement learning algorithm is designed for the multi-terminal dynamic and continuous berth allocation problem A series of computational experiments are executed to validate the algorithm’s effectiveness and its aptitude for multiple terminal joint operation. Speciﬁcally, the Dueling Double DQN algorithm boosts the average solution quality by nearly 3.7%, compared to the classical algorithm such as Proximal Policy Optimization, Deep Q Net and Dueling Deep Q Net also have better results in terms of solution quality when benchmarked against the commercial solver CPLEX. Moreover, the performance advantage escalates as the number of ships increases. In addition, the approach enhances the service level at the terminals and slashes operation costs. On the whole, the Dueling Double DQN algorithm shows marked superiority in tackling complicated and large-scale scheduling problems, and provides an efﬁcient, practical solution to MDC-BAP for port operators.


Introduction
With the steady expansion of global trade scales and the widespread adoption of container transportation [1][2][3][4], the rise and subsequent evolution of container technology has profoundly impacted the overarching framework of global logistics.The Production Plan, Task Scheduling, and Resource Allocation (PPTSRA) within the Container Terminal Handling Systems (CTHS) have emerged as focal points and challenges globally.To address these challenges, a myriad of algorithms have been extensively applied to the PPTSRA issues in CTHS, especially the Berth Allocation Problem (BAP) [5][6][7] and related dock operational space allocation and scheduling.In the entire collaborative body of the global supply chain and logistics, the BAP occupies a prominent position at the tactical decision-making and implementation levels [8,9].The implications of BAP are not limited to the ship's port dwelling time and operational costs but also involve the terminal's total throughput capacity, loading and unloading efficiency, equipment scheduling strategies, and its synergistic effects with the broader supply chain [10].Hence, from various perspectives, such as improving port operational efficiency, reducing port dwelling time, and operational costs, BAP holds profound academic and practical values [11,12].
However, BAP presents a combinatorial optimization challenge accompanied by multiple constraints and numerous objectives, involving various interconnected decision elements.These include ship arrival and departure times, dynamic berth resource allocation, and various other operational constraints [13].Notably, BAP has been proven to be a Non-deterministic Polynomial Complete (NPC) problem [14], making its related planning, scheduling, and decision-making strategies a continued research focus and challenge within the academic community.
In a single-berth allocation problem, Kim et al. [15] developed an integer programming model to schedule quay berths and quayside cranes, taking into account a multitude of constraints.This approach is divided into two stages: the first stage employs subgradient optimization to determine vessel berth positions, timing, and crane allocations, achieving near-optimal solutions; the second stage, building upon the results of the first, uses dynamic programming to devise detailed crane scheduling.The performance of the algorithm was validated through numerical experiments.Yang et al. investigated the continuous berth allocation challenge under dynamic ship arrival scenarios and formulated an integer linear programming model aiming at minimizing the total dwelling time of ships at the port [16].Lin et al. took into account both the dwelling time and associated penalty costs, delving into optimization studies for berth allocation strategies at container terminals [17].Sheikholeslami et al. [18] aimed to minimize the vessels' delay time of departure and proposed a model to realize the goal by considering tide effects in terminal with discrete berth.Additionally, metaheuristic-based approaches have garnered attention.For instance, Park et al. employed a modified particle swarm optimization algorithm to analyze a two-stage stochastic planning model, thereby achieving robust berth allocation strategies [7].Zeng and his team conducted an in-depth discussion on integrated operational strategies at container terminals against the backdrop of time-of-use electricity pricing, employing a tailored genetic algorithm for solutions [19].Song Y et al. systematically [20] studied the berth allocation problem under varying water depth conditions.They adjusted the priority of each ship using a weighting strategy, basing their research on a mathematical model that sought to minimize the total weighted service time.
However, with the continuous flourishing development of the global shipping industry and the increasing mergers and acquisitions of terminals by large shipping companies or groups, the berth allocation mechanism of a single terminal evidently struggles to adapt to the increasingly complex and highly integrated logistics industry landscape [21][22][23][24].Consequently, the significance of research on the Multi-terminal Dynamic and Continuous Berth Allocation Problem (MDC-BAP) is becoming increasingly prominent.Compared to the traditional single-terminal BAP, MDC-BAP exhibits a noticeable escalation in structural complexity and a challenge in its resolution.Given the distinct strong coupling characteristics and intrinsic high complexity of MDC-BAP, research on this topic remains relatively nascent.Hendriks [25] and Xu et al. [26] delved into the joint berth allocation within container hub ports, offering invaluable theoretical insights and practical strategies for this domain.Li Bin et al. [27], from a novel perspective, transformed the MDC-BAP problem into a heterogeneous multi-knapsack problem for optimization modeling.They subsequently constructed a mixed-integer programming model aimed at minimizing the total costs for both ports and shipping entities.Following this, they introduced a two-stage imperial competition algorithm, fusing computational logistics with swarm intelligence characteristics to address the problem.
Although in recent years many scholars have employed various intelligent optimization algorithms for the specific BAP problem and achieved notable algorithmic performance results, when facing large-scale, highly dynamic, and highly uncertain actual production environments, their optimization efficiency and accuracy often face significant challenges.Especially in the PPTSRA scenarios of the CTHS, the application of computational intelli-gence faces evident bottlenecks in terms of universality, robustness, agility, portability, and scalability (GRAPE).This limitation is even more pronounced in the Heterogeneous Container Terminal Cluster Logistics Generalized Computation Systems (HCTC-LGCS) [27].However, with the rapid development of artificial intelligence and machine learning, they have been widely applied in the PPTSRA of container terminals [28][29][30][31].Notably, Reinforcement Learning (RL) provides a novel computational approach to NPC and resourceconstrained scheduling dilemmas [32][33][34][35][36]. Compared to traditional optimization algorithms and heuristic strategies, reinforcement learning does not rely on rules specific to a problem or predefined objective functions.Instead, it learns decision policies adaptively through continuous interaction with the environment, achieving optimal or near-optimal solutions.This characteristic endows RL with clear superiority when dealing with dynamism, uncertainty, and numerous complex constraints [37][38][39].For instance, when confronted with resource-constrained scheduling issues, reinforcement learning possesses online learning and adaptive capabilities, allowing it to flexibly balance potentially conflicting objectives and formulate efficient scheduling strategies while adhering to operational constraints.
Moreover, concerning MDC-BAP, its computational complexity notably rises exponentially with the increase in the number of terminals and vessels.Specifically, for a single terminal with n vessels, p berth points, and a time scale of t, its computational complexity is O(n * p * t).When extended to m quays, the complexity becomes O(m * n * p * t), revealing a clear exponential growth in problem-solving complexity as the number of ships and terminals increase.Traditional heuristic algorithms struggle to find quality solutions for such high-dimensional problems.Additionally, benefiting from its ability to learn from accumulated experience in historical data, RL exhibits excellent generalization performance when dealing with unseen but structurally similar problems [40][41][42][43].Therefore, integrating reinforcement learning into the research on Continuous, Linked, and Optimized Berth Allocation under the HCTC-LGCS not only offers efficient optimization strategies for MDC-BAP but also presents a potential pathway to address the challenges of universality, robustness, agility, portability, and scalability in computational intelligence applied to PPTSRA issues.
In this paper, we conducted an in-depth modeling of the MDC-BAP problem, taking into account the water depth constraints of the terminal.To precisely address its inherent high coupling, dynamic characteristics, and complexity, we chose the D3QN (Dueling Double DQN) reinforcement learning algorithm as our solution strategy.After a series of numerical experiments, the effectiveness and feasibility of this algorithm in dealing with this problem were verified.The remaining structure of the article is as follows: Section 2 elaborates on the operations research abstraction and mathematical modeling for multi-dock berth allocation; Section 3 delves into the D3QN algorithm and its network structure; Section 4 empirically demonstrates the advantages of the D3QN algorithm through comparative analysis of experimental data; Section 5 inductively summarizes the entire article through in-depth discussions and looks forward to future research directions.

Multi-Container Terminal Berth Allocation for Computational Logistics
Under the current computational logistics theoretical framework, multiple container terminals located along the coast and belonging to the same organizational entity can be considered as a high-level computational node cluster in the global supply-chain environment, referred to as Heterogeneous Container Terminal Cluster Logistics Generalized Computation Systems (HCTC-LGCS).This concept further interprets terminal operations as a generalized decision-making and scheduling challenge involving "Computation (Computation), Memory (Memory), and Switching (Switching)" (abbreviated as CMS) [17].Within this framework, MDC-BAP is a pivotal component.Specifically, from the perspective of computational logistics, the Multi-Terminal Dynamic and Continuous Berth Allocation Problem (MDC-BAP) can be seen as a specific variant of the Resource-Constrained Project Scheduling Problem (RCPSP).This problem possesses pronounced non-linear characteristics and deep coupling, leading to its extremely high computational complexity.
The Resource-Constrained Project Scheduling Problem (RCPSP) is a classical optimization problem in project management, which focuses on how to allocate resources to project activities under limited resource constraints in order to achieve specific objectives, such as the shortest project completion time [44,45].At any given point in time in RCPSP, the amount of each resource available is limited, and each activity has a predetermined duration and a need for one or more resources.These resources may include manpower, machinery, materials, etc.The heart of the problem is how to schedule the start of activities to ensure that, for each resource, the demand for that resource does not exceed the amount of that resource available throughout the duration of the project, taking into account the tight front-loading relation vessel between activities.This means that some activities must be completed before others.RCPSP is a typical NPC problem, which means that as the number of activities and resources increases, the search for an optimal solution in the problem space is exponentially more difficult [46][47][48][49].
In order to describe RCPSP, researchers often use Activity-On-Arrow (AOA) directed graphs to represent it.Figure 1 shows a simple example to describe the whole scheduling process.In this case, consider a renewable resource with availability 4. Each activity is represented by a circle, where the upper number indicates the most probable duration of the activity and the lower number indicates the activity's demand for the resource.The arrow lines indicate the tight forward relation vessel between activities.For example, Activity 1 most likely has a duration of 3 and a resource requirement of 2. The immediate preceding activity is Activity 0, and the immediate following activity is Activity 6.There are eight activities in the project, of which Activity 0 and Activity 7 are virtual activities representing the beginning and end of the project, respectively, and they do not require resources and consume time.With this representation, the sequence of execution between tasks or activities becomes intuitive.
duration and a need for one or more resources.These resource machinery, materials, etc.The heart of the problem is how to sch to ensure that, for each resource, the demand for that resource d of that resource available throughout the duration of the proje tight front-loading relation vessel between activities.This mean be completed before others.RCPSP is a typical NPC problem number of activities and resources increases, the search for a problem space is exponentially more difficult [46][47][48][49].
In order to describe RCPSP, researchers often use Activitygraphs to represent it.Figure 1 shows a simple example to desc process.In this case, consider a renewable resource with avai represented by a circle, where the upper number indicates the the activity and the lower number indicates the activity's dem arrow lines indicate the tight forward relation vessel between a tivity 1 most likely has a duration of 3 and a resource require preceding activity is Activity 0, and the immediate following a are eight activities in the project, of which Activity 0 and Activ representing the beginning and end of the project, respectively resources and consume time.With this representation, the sequ tasks or activities becomes intuitive.Further research found that MDC-BAP and RCPSP share pects.Specifically, every vessel awaiting assignment can be con every available berth point at a container terminal is regarded operation time of a vessel at a container terminal equates to the activity, and the waiting relation vessels between vessels due to encies between tasks.Further research found that MDC-BAP and RCPSP share similarities in several aspects.Specifically, every vessel awaiting assignment can be considered as an activity, and every available berth point at a container terminal is regarded as a limited resource.The operation time of a vessel at a container terminal equates to the execution duration of an activity, and the waiting relation vessels between vessels due to constraints form dependencies between tasks.
Based on the computational logistics, MDC-BAP can be abstracted as a special kind of RCPSP, which is called the Heterogeneous and Reconfigurable Resource Pool Constrained Project Scheduling Problem (HRRP-CPSP).Compared with the classical RCPSP, MDC-BAP not only needs to consider the arrival order of vessels and the quantity constraints of container terminal resources, but also needs to synthesize the multi-dimensional factors such as the size of the vessels, berthing requirements, and operation duration.Vessels in MDC-BAP arrive dynamically, leading to berth allocation decisions being dynamically assigned.This presents a stark contrast to the static task scheduling in RCPSP.Meanwhile, MDC-BAP involves not only scheduling in time series but also spatial interaction between vessels and container terminals, which is beyond the consideration of a single resource type in RCPSP.Furthermore, there is no two terminals in the world that are the same, and thus multiple terminal frontal shorelines can be abstracted as a distributed pool of heterogeneous and reconfigurable resources.
The MDC-BAP defined in this paper defines the problem as follows: 1.
In the considered model, there are multiple terminals managed by the same operating entity.Each vessel can berth at only one of these terminals.Within each terminal, a continuous berthing allocation strategy is implemented based on its unique shoreline.Once a container vessel's berthing position has been determined, its berthing operations are continuous and cannot be interrupted until it has completed its operations and departed from its current berth.2.
In the dynamic port arrival mechanism of vessels, each container-carrying vessel has a predetermined expected docking terminal before executing the berth allocation algorithm.At the same time, the containers reserved for that vessel will be prepositioned in the yard.If there's a discrepancy between the vessel's actual docking terminal and its expected one, its containers need to undergo a transfer operation.

3.
The transfer of containers must be completed before the vessel arrives at the port, ensuring that no additional time cost is added to the vessel's berthing loading and unloading operations, and the cost of container transfer is reflected in the transportation cost generated during the transfer process, and other costs incurred are not taken into account.

4.
Each vessel is given a minimum departure time limit.The berth allocation strategy must ensure that the actual departure time of each vessel does not exceed its set minimum departure time.

5.
There are differentiated priority settings between ships.Berth allocations will be sorted according to these priorities to ensure that ships are allocated accordingly in their order of priority.6.
For the same terminal shoreline, container vessels operating at the same time point must not overlap in both time and spatial dimensions.Adjacent vessels need to maintain a prescribed safety interval, determined based on fifteen percent of the vessel's length proportion.For the purpose of this calculation, this safety distance has been included in the ship length data in this study.7.
On physical condition constraints, the water depth at the terminal where each vessel berthed must exceed the vessel's draft, and the vessel's length must not exceed the physical length of the dock.8.
The potential impact of force majeure and contingencies on the efficiency of port operations is not considered.

Notation
All notations of this paper are described in alphabetic order.

Q
Set of all docks, Q = {1, . . . ,|Q|}, where |Q| represents the total number of docks; V Set of incoming vessels during a planning period, where |V| represents the total number of vessels; Number of available berths in quay q, ∀q ∈ Q; D i Latest departure time for vessel i, ∀i ∈ V; L q Length of the shoreline of quay q, ∀q ∈ Q; N q Number of berthing points in quay q, ∀q ∈ Q; M A sufficiently large positive integer Pr i Time required for vessel i to pre-prepare containers, ∀i ∈ V

P iq
If the vessel i is berthed in predetermined quay, then P iq = 1; otherwise Weight considering the priority of vessel i, ∀i ∈ V; b i Loading and unloading time of vessel i, ∀i ∈ V; c 0 q Number of gantry cranes at quay q, q ∈ Q; c min i Minimum number of gantry cranes that can be allocated to vessel i, ∀i ∈ V; c max i Maximum number of gantry cranes that can be allocated to vessel i, ∀i ∈ The quay where vessel i plans to berth, ∀i ∈ V; x i Time taken by vessel i from anchorage to berth, ∀i ∈ V; x 0 i Time taken by vessel i from berth to anchorage, ∀i ∈ V; y i Preparation operation time required for vessel i, ∀i ∈ V; y 0 i Time required for vessel i to clean up after loading and unloading, ∀i ∈ V;

Decision variables
Departure time of the vessel i from the berthing point, ∀i ∈ V E i Departure moment of vessel i from the port, ∀i ∈ V S i Moment when vessel i starts berthing, ∀i ∈ V T w i Waiting time of vessel i,∀i ∈ V

T qq
Duration needed to move from the expected quay q to the actual quay q , if q = q , T qq = 0, ∀q, q ∈ Q Z ijq If vessel i and vessel j dock at the same quay, If vessel j starts berthing after vessel i departs the berthing point,

Number of quay cranes required by vessel i at moment t e ij
If vessel i and vessel j dock at the same quay and vessel j is berthed to the right of vessel If vessel j starts berthing after vessel i departs the berth, and both vessels dock at the same quay

Mathematical Model
The objective function (1) aims to minimize the time cost of all ships in port, which is the sum of their port stay duration, transit time and pre-storage time.It is noteworthy that in previous research [27], to achieve valid solutions, researchers often allowed vessels to exceed their latest departure time and only computed corresponding extension costs.However, in this study, it was explicitly stipulated that the departure time of vessels must not exceed their predetermined latest departure time, undoubtedly increasing the complexity of solving the problem.
In the modeling study of the vessel berthing process, a crucial premise is that the start time of a ship's berthing must strictly follow its arrival at the port.Based on this premise, Constraint ( 2) is established to ensure that this condition is adhered to.Furthermore, Constraint (3) is introduced as a quantitative expression of the waiting time for ships.According to Constraint (3), the waiting time of a ship is defined as the time when the ship begins berthing minus the time of the ship's arrival at the port.
Constraint ( 4) is utilized to define the time of vessel i arrival at the berth point.According to this constraint, the arrival time at the berth point is calculated as the sum of the time when the ship begins berthing, the time taken to travel from the berth point to the anchorage, and the time required for preparatory operations.This expression ensures a comprehensive consideration of the ship's transit time.Furthermore, Constraint (5) stipulates the uniqueness of ship berthing.This constraint guarantees that each ship can be docked at only one designated dock at any given time.This not only reflects the physical limitations in actual operations but also ensures the logical consistency of the model and its feasibility in practical application.
In the mathematical model for vessel berth allocation, the allocation of quay crane resources is a key factor.Each dock possesses a specific number of quay cranes, and this resource limitation is precisely reflected in Constraints ( 6)- (8).These constraints dictate the number of quay cranes that can be allocated to a ship, ensuring that at any given time, the total number of quay cranes allocated to a ship does not exceed the total number available at the dock where it is berthed In the multi-dock berth allocation problem, ensuring that the allocation between two vessels adheres to physical constraints is crucial.Specifically, Constraints (9)-( 17) are established to ensure that two vessels berthed at the same quay do not overlap in time and space.This means that when allocating a berth point to any vessel, it must be ensured that the berth point is not occupied by another vessel during the allocated time period.The implementation of these constraints in the model ensures the logical and practical feasibility of berth allocation, providing a robust and efficient decision-support tool for port berth management.
Constraint (18) explicitly stipulates that the docking point of a vessel must be within the quayside length of the dock.This provision ensures that the berthing position of the vessel does not exceed the actual physical boundaries of the dock, thereby maintaining the safety and effectiveness of port operations.Simultaneously, Constraint (19) emphasizes that the dock where a vessel docks must have sufficient water depth to meet the vessel's draft requirements.The existence of this constraint is to ensure the safe berthing of the vessel at the dock, and also to avoid safety incidents that may be caused by insufficient water depth at the dock.
In the study of vessel berth allocation and subsequent operations, Constraints ( 20)-( 22) define the time frame for vessel operations at the berth.Specifically, Constraint (20) dictates that the time a vessel leaves the berth point is the sum of its arrival time at the berth point plus the required time for cleaning and loading or unloading containers.Furthermore, Constraint (21) defines the departure time of the vessel, which is the time it leaves the berth point plus the time taken to travel from the berth point to the anchorage.This definition reflects the entire time process of a vessel from completing loading/unloading operations to actually leaving the port.Finally, Constraint (22) imposes a limit on the departure time of the vessel, ensuring it does not exceed the agreed latest departure time.This constraint is set to maintain the orderliness of port operations and the reliability of vessel operational schedules, while also considering the contractual obligations between shipping companies and port administrators.

Solutions of the Formulated Problem
In this section, the MDC-BAP is discussed from the perspective of reinforcement learning, and subsequently we introduce a solution method based on D3QN.To evaluate this method, we also list four other solution strategies as references, including CPLEX, PPO, DQN, and Dueling DQN.

Reinforcement Learning
Reinforcement learning focuses on strategies where an agent interacts with the environment to optimize long-term rewards; its theoretical framework is based on Markov Decision Processes (MDP).Unlike supervised learning, which is based on input-output pairings, reinforcement learning determines the best strategy through exploration and exploitation in unknown environments, and it is grounded on Markov Decision Processes.Key components include states S, actions A, and rewards R. The goal is to determine an action for each state, to maximize cumulative rewards while considering the trade-off between immediate and long-term rewards.The State Value Function V(s t ) describes the expected cumulative reward in a given state s t and follows the expected cumulative reward under a particular strategy, which is defined as follows: The function of action value Q(s t , a t ) indicates the anticipated reward for executing an action in a specific state and then pursuing a strategy, which is defined as follows:

Construction of Partially Observable Markov Decision Models
To determine the optimal berthing strategy, this study formulates the MDC-BAP as a Markov Decision Process (MDP).Considering the data based only on partial observations of the vessels awaiting berthing and the status of the quay, this decision-making problem can be viewed as a Partially Observable Markov Decision Process (POMDP).In this construction, the three core elements are the state space, action space, and reward function.When making berth assignments, the agent receives the current state s t and produces the corresponding berthing decision a t .Subsequently, the environment provides feedback in the form of a reward r t .By continuously iterating this process and accumulating a large amount of experiential data, a data-driven approach is used to update the model, leading to a gradual optimization of the strategy.

Constructions of State
In the application of deep reinforcement learning, constructing the state space is crucial.An accurate and efficient state space can enhance the performance of the model.When formulating state features, the two key factors are the representativeness of the features and their correlation.To ensure the effectiveness of decision making, using the latest sequence of N continuous observations is crucial and helps neural networks learn strategies more accurately.Although increasing the number of observations in the observation sequence can provide more comprehensive environmental information, excessive information might adversely affect computation speed, thereby affecting the timeliness of decision making.For MDC-BAP, if we choose N to be 10, which means using the first 10 consecutive observations to form the observation sequence, the observed information at time t can be divided into three main parts as shown in Table 1; the observation sequence can be represented as S t = (B q , E, V s ).

State Information Container Explanation
Berth status of the quay B q B q Berth status table of dock q at current and past nine moments, q ∈ Q.
Vessel's dynamic variables E P 0 Berth status of all vessels at time current and past nine moments, The reason why these three parts of the feature capture the core information of the state is that Berth status of the quay B q presents the usage status of the dock, providing a foundation for berth allocation decisions.The vessel's dynamic variables E, as a key dynamic factor in the decision-making process, occupies a central position in the berth allocation strategy; the vessel's static information variable V s provides basic constraints for berth allocation decisions, ensuring the feasibility and rationality of the allocation.

Constructions of Action
The design of the action space in reinforcement learning is crucial.It directly affects the efficiency and effectiveness of the learning algorithm; designing an appropriate action space can not only accelerate learning but also ensures that the final strategy is more reasonable and practical.For complex decision-making problems, especially when involving multiple independent but interrelated choices, the design strategy of combined actions becomes a direction worth exploring.For example, let "Time t, Quay 1, Berth Point 1" be represented as a t = 1.With this notation, each action corresponds to a specific dock and its berth point combination, which can ensure that the strategy can consider both the dock and berth point when making decisions.The specific definition is as follows: (1) Berth Decision 1: Choose the i berth point at the q quay: (2) Berth Decision 2: Wait: The action space in berth allocation problems encompasses all potential actions a vessel might take, such as choosing a specific available berth for docking or deciding not to perform a particular action for the moment.When constructing the action space, it is essential to deeply consider environmental constraints, including the berth capacity of each dock, water depth, and other factors, while also taking into account attributes of the vessel, such as its draft and length.An inappropriate design of the action space might hinder the agent from learning effective strategies.Therefore, the meticulous design of the action space should be based on the actual characteristics of the problem and ensure its compatibility with the state space.If the chosen berth doesn't meet certain constraints, the vessel needs to adopt a waiting behavior.

Constructions of Reward
The reward function plays a crucial role in evaluating policy performance and guiding policy optimization in reinforcement learning.This study aims to minimize the time vessels spend in the port through this function.Although the design of the reward function can integrate various evaluation criteria, such as economic losses due to waiting times, dock usage efficiency, and transportation costs, it has been observed that the objective function is mainly influenced by waiting times and vessel transfer times, while other time losses can be considered constants.In this study, we selected the negative sum of waiting times and transfer times caused by berthing as the immediate reward for each berth allocation.To ensure the single-time reward was not negative, it was added to a constant G to guarantee its non-negativity.The specific definition is as follows:

Dueling Double DQN
The Dueling Double DQN (D3QN) algorithm integrates both the Double DQN and Dueling DQN methods.Considering that both Double DQN and Dueling DQN target the intrinsic limitations of DQN to provide enhancements, it is necessary to first briefly review the basic principles of the DQN algorithm to gain a deeper understanding of D3QN's characteristics and contributions.
The DQN algorithm adopts an off-policy learning mechanism and approximates the Q-value function of Q-learning through neural networks [50][51][52].This strategy falls under the category of value-based reinforcement learning.It has shown significant advantages in dealing with high-dimensional computation and decision-making problems.Notably, DQN comprises an evaluation Q-network and a target Q-network.The target Q-values of the target Q-network can be calculated using the following formula: Additionally, the loss function is defined as where γ denotes the discount factor; θ indicates the shared parameters of the network.done indicates whether state s is a terminal state; it is set to 1 if it is, and 0 otherwise.While DQN excels in various aspects, it still faces the problem of overestimation during the Q-value learning process.To address this dilemma, Double DQN (DDQN) was introduced.Its core concept utilizes two Q-networks: one dedicated to the selection of the next action, while the other is responsible for estimating the Q-value for that action.This separation method significantly reduces the phenomenon of Q-value overestimation.The DDQN's target Q-values can be calculated using the following formula: To more accurately estimate Q-values in deep network structures, Dueling DQN was proposed.The design philosophy of this strategy is based on differentiated considerations of state values and action advantage values.The state-action function can be represented as: The state value function V is used to estimate the expected return in a given state, while the action advantage function A aims to measure the relative advantage of taking a certain action in a specific state compared to the average action.The value 1  |A| ∑ a A(s, a; θ)] is interpreted as the mean of the action advantage function A, which provides a benchmark for assessing the relative utility of each action.Meanwhile, the Dueling DQN algorithm uses the same method as DQN to calculate the target Q-value: The D3QN algorithm integrates the previously mentioned techniques, drawing on their essence to enhance algorithmic performance.With the robustness of DQN, the overestimation correction strategy of DDQN, and the precision of Dueling DQN, D3QN constructs an efficient and stable solution framework for complex reinforcement learning tasks.The basic framework of D3QN is shown in Figure 2 and the iteration and target Q-values of the D3QN algorithm can be expressed as

Network Infrastructure
Data from the current observation time and its preceding nine times were chosen in this research, constituting an observation sequence as the model's input.The observation sequence at each moment can be divided into three main parts: (1) the vessel's static information variables s V ; (2) the berth status of the quay q B ; (3) and the vessel's dynamic variables E .Considering that both q B and E represent time series data, an LSTM (Long Short-Term Memory) network was selected for extracting features.Meanwhile, s V

Network Infrastructure
Data from the current observation time and its preceding nine times were chosen in this research, constituting an observation sequence as the model's input.The observation sequence at each moment can be divided into three main parts: (1) the vessel's static information variables V s ; (2) the berth status of the quay B q ; (3) and the vessel's dynamic variables E. Considering that both B q and E represent time series data, an LSTM (Long Short-Term Memory) network was selected for extracting features.Meanwhile, V s are transformed through a fully connected layer.After individual processing, the three sets of data will be integrated and fed into a Transformer network for further feature integration.The output part of the network is completed by a fully connected layer to decide on the appropriate action.The specific parameters and settings of the model are detailed in Table 2, respectively, while the structural details of the model are illustrated in Figure 3.

Network Infrastructure
Data from the current observation time and its preceding nine times were chosen in this research, constituting an observation sequence as the model's input.The observation sequence at each moment can be divided into three main parts: (1) the vessel's static information variables s V ; (2) the berth status of the quay q B ; (3) and the vessel's dynamic variables E .Considering that both q B and E represent time series data, an LSTM (Long Short-Term Memory) network was selected for extracting features.Meanwhile, s V are transformed through a fully connected layer.After individual processing, the three sets of data will be integrated and fed into a Transformer network for further feature integration.The output part of the network is completed by a fully connected layer to decide on the appropriate action.The specific parameters and settings of the model are detailed in Table 2, respectively, while the structural details of the model are illustrated in Figure 3.

D3QN Algorithm for MDC-BAP
In the study of the Multi-Dock Berth Allocation Problem (MDC-BAP), this research incorporates more stringent and realistic constraints compared to previous works.Unlike earlier studies that often overlooked the requirement for vessels to depart within a specified time, or only involved scenarios with a limited number of vessels, this research focuses on simulating situations in actual port operations where there are many vessels that need to depart within an agreed timeframe.In real-world port operations, the large number of vessels and the strict requirements for departure times add complexity to the problem.Under such conditions of high vessel numbers and stringent constraints, traditional heuristic algorithms often struggle to generate effective solutions.
To address this issue, the D3QN algorithm is designed to solve MDC-BAP.This algorithm is aimed at tackling the complexity of high-dimensional vessel problems and the dynamism of vessel arrivals, which are often challenging to manage effectively with traditional methods.The core strength of the D3QN algorithm lies in its ability to process and analyze large-scale datasets through deep learning techniques, thereby adapting to changing conditions and emerging patterns.In the context of high-dimensional vessel numbers and strong constraints, the D3QN algorithm not only improves the efficiency of berth allocation but also optimizes berth utilization and minimizes waiting times through learning from historical data.Additionally, the algorithm takes into account the port's capacity for rapid response to dynamic vessel arrival patterns.Through continuous learning and strategy adjustment, the D3QN algorithm is better equipped to handle the uncertainties and complexities in port operations, providing a flexible and efficient solution for vessel berth allocation.
The core steps of the multi-dock berth allocation method based on the D3QN algorithm are as follows (Algorithm 1).
Resset the environment, get observation sequence S 0 3.
For i ← 1 to length(V berth ) do 7.
If vessel I finish berthing then: 8.
Update S t , vessel i leave berth point; 9.
End for 11.
Sorted V arrive by priority W i 12.
For i ← 1 to length(V n ) do 13.
Input S t into the current Q network, calculate the Q value corresponding to each action A t ; 14.
Use -greedy to select vessel i's berth action a t from A t ; 15.
Update the next observation sequence S t+1 , Get the reward R t , and the completion flag is_terminal; 16.
Store <S t , a t , S t+1 , R t , is_terminal> to the replay buffer pool R 0 ; 17.
Take the min-batch M batch in the replay buffer pool R 0 to learning; 19.
Use Formula (34) to calculate the squared error loss and then perform backward gradient propagation to update parameter θ: 21.

End if 22.
End for 23.
End for 24.End for The algorithm parameter details are as follows (Table 3).

Simulation Results and Discussion
In this section, we aim to provide a detailed description of the test dataset constructed for MDC-BAP, and further explore a comparative experimental analysis between the D3QN algorithm and other mainstream algorithms applicable to this problem.All experiments were coded and tested in the PyCharm-integrated environment.The running environment consisted of an AMD Ryzen 5 3600 6-Core Processor@3.6Hz(AMD, Santa Clara, CA, USA), 128 GB RAM, and a 64-bit Windows 10 operating system.The programming language used was Python, specifically Python 3.8.

Introduction to Port Production Services
The research context of this paper is based on the joint operations of Quay A and Quay B, managed and operated by a certain port group limited company on the southeast coast (hereinafter referred to as the port group).Test instances are generated based on the actual arrival of vessels to analyze the performance of the models and algorithms designed in this paper.The port group centrally manages and coordinates both Quay A and Quay B. Quay B has a shoreline length of 2800 m with a maximum natural water depth of 17 m, while Quay A has a shoreline length of 1500 m with a water depth of 14 m.The transshipment of containers can be achieved between these two terminals.Based on the water depth of the berth, the terminals are classified into two types: those with a depth of 14 m or less are determined as Type 1 terminals, and those with a depth of 14 m or more are determined as Type 2 terminals.Hence, Quay A belongs to Type 1, while Quay B belongs to Type 2. Table 4 displays the relevant data for the terminals, where the container loading and unloading efficiency of a single gantry crane is uniformly set at 72 standard boxes (Twenty-foot Equivalent Unit, TEU) per hour.

Introduction to Container Vessel Information
Quay A and Quay B mainly serve medium to large container mainline and feeder vessels.According to reference [53] on the core attributes of container vessels, 12 types of container vessel-related information were randomly generated for the calling ports.The related information for these vessels includes a vessel's length, draft, cargo load, container loading and unloading ratio, vessel contribution at the port, operation time, anchorage to berth time (berthing time), vessel operation preparation time, vessel clearance operation time, berth to channel time (departure time), the maximum time a vessel can stay in port, minimum quay crane allocation, maximum quay crane allocation, and priority.
These container vessels arrive at the port according to the container liner schedule set by the shipping company, and the estimated arrival time matches the actual arrival time of the vessel.The vessel's berthing operation time is directly related to the import and export container volume and the current number of quay cranes allocated to the vessel.According to the principle of quay crane operations, unloading is performed before loading.Each vessel has a desired berthing terminal, and the containers waiting for loading are stored in the container yard of the desired terminal, while containers needing transshipment are quickly transported to the actual berthing terminal after the vessel's berthing plan is determined.According to the terminal's depth requirements, the first six types of vessels can berth at both Terminal A and Terminal B and are transferable-type container vessels with a maximum port stay of 45~48 h.The latter six types of vessels can only berth at Terminal B, with a maximum stay of 48~50 h.The latest departure time for a vessel is its expected arrival time plus its maximum port stay time.The maximum safety distance required for a container ship terminal shoreline is 5% of the vessel's length.Thus, the berthing distance between two vessels on a quay crane should be no less than the maximum safety distance required by both vessels (for calculation convenience, the vessel's length in this study already includes the safety distance).
Based on the actual conditions of the dock and the definitions of the berthing vessels, this paper generated test cases according to the following rules: the dock's front shoreline was divided into units of 10 m; a planning cycle of one week (a total of 168 h) was selected, with a time scale of 15 min; and within the planning period, based on different scale cases (LE1~LE4), four instances were produced, and each instance led to the creation of five test cases, thus resulting in a total of 20 test cases.The detailed description of the cases is shown in Table 5.A week 100

Experimental Verification and Result Analysis
To validate the efficacy of the D3QN algorithm in MDC-BAP, experiments were conducted on all instances presented in Table 5.Table 6 and Figures 4-7 show cases in comparison in MDC-BAP by using CPLEX, Proximal Policy Optimization (PPO), DQN, Dueling DQN, and D3QN algorithms in different instances.In the current experiment, the time cost of vessels in the port was used as the evaluation metric.For CPLEX, the maximum set solution time was 54,000 s (equivalent to 15 h), with a minimum gap value of 5%.Through a meticulous analysis of Table 6, we systematically evaluated the performance of the D3QN algorithm in multi-dock berth allocation tasks against other mainstream methods.As a commonly used deterministic optimization method, CPLEX is often Through a meticulous analysis of Table 6, we systematically evaluated the performance of the D3QN algorithm in multi-dock berth allocation tasks against other mainstream methods.As a commonly used deterministic optimization method, CPLEX is often regarded as the gold standard for such problems, capable of outputting optimal or near-optimal solutions.Hence, this paper used the CPLEX solution as the benchmark for evaluation.
When the number of vessels was 85, the D3QN algorithm was optimized by approximately 0.45% compared to CPLEX.When the vessel count reached 90, the performance of the D3QN algorithm increased to 2.98% compared to CPLEX.This suggests that the D3QN algorithm has a significant performance advantage for medium-sized problems.Upon examining the scenarios with 95 and 100 vessels, the optimization benefits of D3QN over CPLEX became even more pronounced, reaching 5.63% and 5.89%, respectively.This further validates the exceptional performance of the D3QN in handling large-scale problems.Delving deeper, from a micro-perspective, D3QN not only surpasses other deep learning methods in most scenarios; notably, it was observed that as the number of vessels increased, the improvement of the D3QN algorithm became even more apparent compared to the DQN algorithm and the Dueling DQN algorithm.This also further indicates that the D3QN algorithm can compensate for the shortcomings of DQN and Dueling DQN, laying a solid theoretical foundation for the practical application of the D3QN algorithm in multi-container dock berth allocation tasks.regarded as the gold standard for such problems, capable of outputting optimal or nearoptimal solutions.Hence, this paper used the CPLEX solution as the benchmark for evaluation.
When the number of vessels was 85, the D3QN algorithm was optimized by approximately 0.45% compared to CPLEX.When the vessel count reached 90, the performance of the D3QN algorithm increased to 2.98% compared to CPLEX.This suggests that the D3QN algorithm has a significant performance advantage for medium-sized problems.Upon examining the scenarios with 95 and 100 vessels, the optimization benefits of D3QN over CPLEX became even more pronounced, reaching 5.63% and 5.89%, respectively.This further validates the exceptional performance of the D3QN in handling large-scale problems.Delving deeper, from a micro-perspective, D3QN not only surpasses other deep learning methods in most scenarios; notably, it was observed that as the number of vessels increased, the improvement of the D3QN algorithm became even more apparent compared to the DQN algorithm and the Dueling DQN algorithm.This also further indicates that the D3QN algorithm can compensate for the shortcomings of DQN and Dueling DQN, laying a solid theoretical foundation for the practical application of the D3QN algorithm in multi-container dock berth allocation tasks.As shown in Figure 12, when evaluating the average computational time cost, it was evident that the computation time of reinforcement learning algorithms was significantly less than CPLEX.Specifically, the difference in computation time among algorithm D3QN, DQN, and Dueling DQN was not significant.Among them, the PPO algorithm stood out due to its unique online learning update strategy, which is that the PPO algorithm updates the model only after all the vessel berths are completed.However, this online learning strategy has shown certain limitations in the multi-dock berth allocation problem.Specifically, it might lead to vessels exceeding the maximum allowed time constraint in the port, which in turn may result in an inability to find feasible solutions that satisfy all constraints in certain scenarios.As shown in Figure 12, when evaluating the average computational time cost, it was evident that the computation time of reinforcement learning algorithms was significantly less than CPLEX.Specifically, the difference in computation time among algorithm D3QN, DQN, and Dueling DQN was not significant.Among them, the PPO algorithm stood out due to its unique online learning update strategy, which is that the PPO algorithm updates the model only after all the vessel berths are completed.However, this online learning strategy has shown certain limitations in the multi-dock berth allocation problem.Specifically, it might lead to vessels exceeding the maximum allowed time constraint in the port, which in turn may result in an inability to find feasible solutions that satisfy all constraints in certain scenarios.As shown in Figure 12, when evaluating the average computational time cost, it was evident that the computation time of reinforcement learning algorithms was significantly less than CPLEX.Specifically, the difference in computation time among algorithm D3QN, DQN, and Dueling DQN was not significant.Among them, the PPO algorithm stood out due to its unique online learning update strategy, which is that the PPO algorithm updates the model only after all the vessel berths are completed.However, this online learning strategy has shown certain limitations in the multi-dock berth allocation problem.To delve deeper into the convergence properties of the D3QN algorithm on a specific model, this study conducted deep learning experiments on a representative set of instances from the LE4 case.The corresponding convergence curves are shown in Figure 13.From these two figures, it can be observed that as the agent continues to learn and iterate, the berth allocation strategy approaches a stable state.While slight fluctuations in results can be observed in the later stages of learning, most of these fluctuations are due to the randomness in task selection.While slight fluctuations in results can be observed in the later stages of learning, most of these fluctuations are due to the randomness in task selection.This further validates its adaptability and robustness in addressing complex problems.To delve deeper into the convergence properties of the D3QN algorithm on a specific model, this study conducted deep learning experiments on a representative set of instances from the LE4 case.The corresponding convergence curves are shown in Figure 13.From these two figures, it can be observed that as the agent continues to learn and iterate, the berth allocation strategy approaches a stable state.While slight fluctuations in results can be observed in the later stages of learning, most of these fluctuations are due to the randomness in task selection.While slight fluctuations in results can be observed in the later stages of learning, most of these fluctuations are due to the randomness in task selection.This further validates its adaptability and robustness in addressing complex problems.To delve deeper into the convergence properties of the D3QN algorithm on a specific model, this study conducted deep learning experiments on a representative set of instances from the LE4 case.The corresponding convergence curves are shown in Figure 13.From these two figures, it can be observed that as the agent continues to learn and iterate, the berth allocation strategy approaches a stable state.While slight fluctuations in results can be observed in the later stages of learning, most of these fluctuations are due to the randomness in task selection.While slight fluctuations in results can be observed in the later stages of learning, most of these fluctuations are due to the randomness in task selection.This further validates its adaptability and robustness in addressing complex problems.

Generalization Experience
The core objective of reinforcement learning is to train an agent capable of making efficient decisions in diverse environments.Complementary to this, rigorously evaluating the model's generalization ability is a crucial step in assessing its true value.This is distinctly different from traditional supervised learning: generalization in reinforcement learning implies that the agent should not only excel in environments it has been trained in but also maintain its decision-making efficiency in unseen, perhaps slightly altered or more complex environments.To validate this core capability, this study specifically designed 10 different test cases for ships with the same number of berths, aiming to delve into the generalization performance of the D3QN algorithm in varying environments.Detailed experimental results and data can be found in Table 7.When applying the trained network model to test instances, we recorded the performance metrics in detail and made an in-depth comparison with the CPLEX method to assess the generalization ability of the algorithm.Table 7 reveals that when the number of ships is relatively limited (e.g., 85 ships), the performance of the algorithm is not very satisfactory, with only two sets of solutions surpassing the output of CPLEX, and one set failing to obtain a valid solution.However, as the number of ships increased, the reinforcement learning algorithm demonstrated excellent berth allocation capabilities across multiple test instances.Notably, since the D3QN algorithm does not require retraining of the model, its computational cost is significantly lower than CPLEX.In contrast, CPLEX requires a complete recomputation process when faced with new cases.

Conclusions
This study conducted a comprehensive and systematic investigation into the Multi-Dock Berth Allocation Problem (MDC-BAP), with the objective of optimizing berth allocation strategies by minimizing the dwell time of vessels at the port.Given the limitations of traditional heuristic algorithms in handling high-dimensional vessel issues and under stringent constraints, this research adopted a deep reinforcement learning approach-the Dueling Double DQN (D3QN) algorithm.The adoption of this algorithm primarily aimed to address the high-dimensional challenges posed by the dynamic arrival of vessels, particularly in complex scenarios where traditional methods struggle.Through comparative experiments with the commercial optimization tool CPLEX and other reinforcement learning algorithms (such as DQN and Dueling DQN), the D3QN algorithm demonstrated significant advantages in handling MDC-BAP.Compared to CPLEX, the D3QN algorithm achieved notable success in reducing the dwell time of vessels in port and also displayed superiority in terms of computational time costs.This finding holds significant practical implications for enhancing port operational efficiency and alleviating port congestion.Compared to the DQN and Dueling DQN algorithms, the D3QN algorithm exhibited superior performance, especially in addressing issues of overestimated Q-values and insufficient precision in Q-value estimation.Through its unique dual learning structure and optimized strategies, the D3QN algorithm effectively avoided overestimation of Q-values, thereby enhancing the accuracy and reliability of decision making, which also confirms the effectiveness of the D3QN algorithm in solving high-dimensional challenges associated with the MDC-BAP.
Regarding future research directions, these include further optimization of the reward function, exploration of more strategic vessel scheduling schemes, and implementation of joint optimization of multiple dock berth allocations and scheduling.These measures will provide more comprehensive and integrated solutions for MDC-BAP, further enhancing the efficiency and effectiveness of port operations.
illustrate the berth allocation scheduling charts generated by the D3QN algorithm when solving for four different vessel quantity instances., which can intuitively present the final berth state of the MDC-BAP.
illustrate the berth allocation scheduling charts generated by the D3QN algorithm when solving for four different vessel quantity instances., which can intuitively present the final berth state of the MDC-BAP.

Figure 12 .
Figure 12.The average computational time cost of all algorithms.

Figure 12 .
Figure 12.The average computational time cost of all algorithms.

Figure 12 .
Figure 12.The average computational time cost of all algorithms.

Table 1 .
Status information table.
Training iteration times K, training sample length T, vessel number V n ; Initialize the current Q network parameters θ, Initialize Q target network parameters θ ; Q network parameters to Q target network, θ' ← θ; Initialize observation sequence S t , decay factor γ, exploration rate , the target Q network update frequency c, replay buffer R 0 with capacity N, min-batch size M batch , training steps total_steps = 0; Initialize arrive vessel list V arrive , berthing vessel list V berth Output: Final vessels berth allocation table 1.For k

Table 4 .
Basic data of joint operational terminals.