Rolling Cargo Management Using a Deep Reinforcement Learning Approach

: Loading and unloading rolling cargo in roll-on/roll-off are important and very recurrent operations in maritime logistics. In this paper, we apply state-of-the-art deep reinforcement learning algorithms to automate these operations in a complex and real environment. The objective is to teach an autonomous tug master to manage rolling cargo and perform loading and unloading operations while avoiding collisions with static and dynamic obstacles along the way. The artiﬁcial intelligence agent, representing the tug master, is trained and evaluated in a challenging environment based on the Unity3D learning framework, called the ML-Agents, and using proximal policy optimization. The agent is equipped with sensors for obstacle detection and is provided with real-time feedback from the environment thanks to its own reward function, allowing it to dynamically adapt its policies and navigation strategy. The performance evaluation shows that by choosing appropriate hyperpa-rameters, the agents can successfully learn all required operations including lane-following, obstacle avoidance, and rolling cargo placement. This study also demonstrates the potential of intelligent autonomous systems to improve the performance and service quality of maritime transport.


Introduction
Intelligent transportation systems implement advanced technologies such as sensing, awareness, data transmission, and intelligent control technologies in mobility and logistics systems. For mobility, the objective is to increase transport performance and achieve traffic efficiency by minimizing traffic problems such as congestion, accidents, and resource consumption. The same goals are targeted for logistics, but the focus is on the optimization of resource consumption, cost, and time, as well as on the efficiency, safety, and reliability of logistic operations.
Smart ports have emerged as a new form of port that utilizes the Internet of Things, artificial intelligence, blockchain, and big data in order to increase the level of automation and supply chain visibility and efficiency. Digital transformation of ports, terminals, and vessels has the potential to redefine cross-border trade and increase shipping operation efficiency [1]. Orive et al. [2] gave a thorough analysis of the conversion to digital, intelligent, and green ports and proposed a new methodology, called the business observation tool, that allows successfully undertaking the automation of terminals considering the specific constraints of the port. Furthermore, various logistics operations have been studied using new techniques such as cargo management, traffic control, detection, recognition and tracking of traffic-related objects, and optimization of different resources such as energy, time, and materials [3,4]. The state-of-the-art of smart ports and intelligent maritime transportation has been described highlighting trends, challenges, new technologies, and their implementation status [5][6][7].
At the same time, the past few years have witnessed breakthroughs in reinforcement learning (RL). Some important successes were recorded, namely learning a policy from pixel input in 2013 [8], beating the world champion by AlphaGo Master [9], and the OpenAI Dexterity program in 2019 [10]. Reinforcement learning refers to goal-oriented algorithms, which learn how to attain a complex objective (goal) or how to maximize along a defined dimension over time. The agent interacts with its environment in discrete time steps. At each time step, the agent receives an observation, which typically includes the reward. Based on this feedback, the agent chooses an action from the set of available actions, which is subsequently sent to the environment. The environment state is changing, and the goal of the agent is to maximize its collected reward.
The objective of this paper is to assess the feasibility and efficiency of deep RL techniques (DRL) in teaching a multiagent system to accomplish complex missions in a dynamic constrained environment. Handling of rolling cargo in ports is an appropriate field to achieve this purpose, in which the agents can be autonomous tug masters and the environment reflects the real world. Our particular application is the automation of the loading and unloading of rolling cargo from ships. These two operations are essential and recurrent in maritime shipping and aim to move rolling cargo from the quay or defined points in the terminal to onboard ships. This kind of ship is called a roll-on/roll-off (RORO) ship. The optimization of these services will be of great benefit in terms of time, energy, space, service quality, etc. This paper focuses on the loading operation since it is the most challenging and has more constraints to meet.
This paper tries to provide approaches to practical challenges and answer the following research questions:

1.
How is the performance of the agents controlled by RL? We evaluate this by looking in particular at convergence and its speed, but also other metrics such as accumulative reward, episode length, and entropy.

2.
To what degree are the learned policies optimal? This will be analyzed by identifying if there are any emergent strategies that are not explicitly desired by the developer and that the tug masters have learned. 3.
What is the impact of tuning the learning hyperparameters on the quality of the achieved solution?
Our contribution consists of a DRL solution that simulates the handling of loading and unloading cargo using autonomous tug masters for RORO ships in smart ports. We evaluate the solution both regarding performance and policy optimality and can show that our system is able to solve the tasks included satisfactorily. We further assess the impact of different learning model hyperparameters on the quality of the solutions achieved. Our solution is also trained using incremental learning, adding additional challenges as the system learns to solve previously included tasks.
The remainder of this paper is organized as follows: The second section presents the related works and background theories of DRL. The third section provides the specifications of the autonomous system ensuring loading/unloading operations and describes the experimental setup. The obtained results are presented and analyzed in Section 4. Finally, conclusions are drawn in Section 5.

Related Work
The use of machine learning techniques for cargo management has received considerable attention from researchers due to their achieved performance. Shen et al. [11] used a deep Q-learning network to solve the stowage planning problem and show the possibility to apply reinforcement learning for stowage planning. To perform an automated stowage planning in container ships, Lee et al. [12] provided a model to extract the features and to predict the lashing forces using deep learning without the explicit calculation of lashing force. The container relocation problem was solved using the random forest based branch pruner [13] and deep learning heuristic tree search [14]. This direction of research deals only with the management of cargo and particularly automated stowage planning and container relocation taking into consideration some logistic related constraints. Almost all these studies did not take into consideration the movement and operations required by cranes or autonomous vehicles in ports.
Another complementary track of research is the automation of cargo management using control and communication technologies. Price et al. [15] proposed a method to dynamically update the 3D crane workspace for collision avoidance during blind lift operations. The position and orientation of the crane load and its surrounding obstacles and workers were continuously tracked by 3D laser scanning. Then, the load swing and load rotation were corrected using a vision based detection algorithm to enhance the crane's situational awareness and collision avoidance. Atak et al. [16] modeled quay crane handling time in container terminals using regression models and based on a big dataset of port operations and crane movements in Turkish ports.
Recently, attention has also been directed toward rolling cargo. M'hand et al. [17] proposed a real-time tracking and monitoring architecture for logistics and transport in RORO terminals. The system aims to manage dynamic logistics processes and optimize some cost functions such as traffic flow and check-in time. The specific mission of the system is the identification, monitoring, and tracking of the rolling cargo. Reference [18] discussed the potential evolution in the RORO industry that can be achieved by process mining in accordance with complex event processing. It gave an overview of process mining and complex event processing. It also introduced the RORO definition, subsystems, as well as the associated logistics processes. To take advantage of the autonomy capabilities, K. You et al. [19] proposed an automated freight transport system based on the autonomous container truck. The designed system operates in a dual-mode, mixing the autonomous driving, which is used in the terminal and connecting tunnels, with the manned driving mode, which is adopted within the operation area. One of the problems related to rolling cargo handling is the stowage problem in which the aim is to plan loading as much cargo as possible and maximizing the space utilization ratio on the decks, while respecting the time and reducing the cost spent on shifting cargo [20][21][22].
Machine learning based techniques are increasingly used to implement and automate the management of smart port services. The goal is to provide autonomy and increase the performance of the logistics operations. L. Barua et al. [7] reviewed the machine learning applications in international freight transportation management. In [23], the authors used RL for selecting and sequencing containers to load onto ships. The goal of this assignment task is to minimize the number of crane movements required to load a given number of containers with a specific order. Fotuhi et al. [24] proposed an agent based approach to solve the yard crane scheduling problem, which means determining the sequence of drayage trailers that minimizes their waiting time. The yard crane operators were modeled as agents using the Q-learning technique for automatic trailer selection process. In the same context, Reference [3] used the Q-learning algorithm to determine the movement sequence of the containers so that they were loaded onto a ship in the desired order. Their ultimate goal was to reduce the run time for shipping. As an extension of their work, they tried to optimize three processes simultaneously: the rearrangement order of containers, the layout of containers, assuring explicit transfer of the container to the desired position, and the removal plan for preparing the rearrangement operation [25].
In the unloading process, it is primordial to provide customers with precise information on when rolling cargo is available to be picked up by customers at the terminal. To fulfil this need, Reference [26] developed and tested a module based framework using statistical analysis for estimating the discharge time per cargo unit. The result achieved was the estimation of the earliest pick-up time of each individual truck or trailer within 1 h accuracy for up to 70% of all cargo. In [27], the authors proposed a simulation model to simulate the process of RORO terminal operations, in particular the entry and exit of vehicles, with the Arena software. The model also served to determine the scale of land parking lots to be assigned to RORO.
To the best of our knowledge, there is no work dealing with the autonomous management of these types of loading and unloading of rolling cargo using reinforcement learning. In addition, in our study, we consider both the stowage planning problem and the automation of tug master operations.

Deep Reinforcement Learning
Reinforcement learning (RL) is a branch of machine learning encompassing a set of techniques that try to learn the best agent strategy, called the policy, maximizing the reward functions of the agent. The policy, denoted by π, is the way in which the agent chooses a specific action a on state s (called observation). Formally, the policy is a mapping π : S → A from the state space to the action space, and it is generally stochastic. The environment is stochastic as well and can be modeled as a Markov decision process with state space S, action space A, transition dynamics p(s t+1 |s t , a t ), reward function r(s t ; a t ), and a set of initial states p(s 0 ) [28]. Considering the agent objectives, trajectories can be extracted from the process by repeatedly sampling the agent's action a t from its policy π(s t ) and the next state s t + 1 from p(s t+1 |s t , a t ). The total reward is the sum of each step's reward and can be represented as R t := ∑ ∞ i=t r(s i , a i ). To maximize the expected sum of rewards, a state-value function V π (s t ) := E s i≥t ,a i≥t ∼π [R t |s t ] is defined to be the expected return from time t onwards with respect to π and starting from the state s. It measures the quality of being in a specific state. Another important function is the action value function Q π (s t , a t ) := E s i≥t ,a i≥t ∼π [R t |s t , a t ], which represents the expected return from time t onwards if an action a t is performed at state s. It measures the profitability of taking a specific action a t in a state s. The techniques that use these two functions are called value based RL. The objective of value based RL is to estimate the state-value function and then infer the optimal policy. In contrast, the policy gradient methods directly optimize the policy with respect to the expected return (long-term cumulative reward) by performing gradient ascent (or descent) on the policy parameters. The policy based methods are considered more efficient especially for high-dimensional or continuous action spaces.
To be able to optimize the policy using gradient based methods, the policy is assumed to be defined by some differentiable function π(θ), parameterized by θ. These methods use gradient ascent approximations to gradually adjust the policy parameterized function in order to optimize the long-term return J(θ) := E s i ,a i ∼π(θ) [R 0 ] according to the approximation θ ← αθ t + ∇ θ J(θ). ∇ θ J(θ) is the policy gradient representing the step and direction in updating the policy. It is approximated by the equation: ∇ θ log π(a n t |s n t )A π (s n t , a n t ) where the sum ∑ ∞ t=0 ∇ θ log π(a n t |s n t ) measures how likely the trajectory is under the current policy. The term A π (s, a) = Q π (s, a) − V π (s) is the advantage function, which represents the expected improvement obtained by an action compared to the default behavior. Since Q π (s, a) and V π (s) are unknown, A π (s, a) is also unknown; therefore, it is replaced by an advantage estimator using particular methods like generalized advantage estimation (GAE) [29].
This vanilla policy gradient algorithm has some limitations mainly due to the high variance and exaggerated requirement for a huge number of samples to accurately estimate the direction of the policy gradient. This results in a slow convergence time. In addition, it suffers from some practical issues such as the difficulty in choosing a stepsize that works throughout the entire course of the optimization and the risk of convergence towards a sub-optimum of the ultimate desired behavior.
Recent policy gradient methods such as trust region policy optimization (TRPO) [30] and proximal policy optimization (PPO) [31] solve these issues through some tricks. The importance sampling technique is used to avoid the huge number of samples and the collection of completely new trajectory data whenever the policy is updated. It uses samples from the old policy to calculate the policy gradient. To efficiently choose the stepsize, the trust region notion is elaborated. The maximum step size to be explored is determined as the radius of this trust region, and the objective is to locate the optimal point within the radius. The process is iteratively repeated until reaching the peak. To guarantee the monotonic improvement and that any new generated policies always improve the expected rewards and perform better than the old policy, the minimization-maximization algorithm is used to optimize the lower bound function that approximates the expected reward locally.
Deep RL is currently the cutting edge learning technique in control systems [32]. It extends reinforcement learning by using a deep neural network (DNN) and without explicitly designing the state space. Therefore, it combines both RL and deep learning (DL) and usually uses Q-learning [33] as its base. A neural network can be used to approximate a value function or a policy function. That is, neural nets learn to map states to values, or state-action pairs to Q values. Rather than using a lookup table to store, index, and update all possible states and their values, which is impossible for very large problems, the neural network is trained on samples from the state or action space to learn dynamically the optimal values and policies. In deep Q-learning [34], a DNN is used to approximate the Q-value function. The DNN receives the state in its input and generates the Q-value of all possible actions in its output. For the policy based methods, the policy and advantage function are learned by the DNN.

Methodological Approach
We apply the state-of the-art deep reinforcement learning algorithm, namely PPO, for loading and unloading rolling cargo from ships.
DRL algorithms require a large number of experiences to be able to learn complex tasks, and therefore, it is infeasible to train them in the real world. Moreover, the agents start by exploring the environment and taking random actions, which can be unsafe and damaging to the agent and the surrounding objects. Thus, it is necessary to develop a simulation environment with characteristics similar to reality in order to train the agents, speed up their learning, and avoid material costs. Our study used a simulation based approach performed with Unity3D. The environment in which we expect the agent to operate is the harbor quay and the ship filled with obstacles. The agent is the tug master itself even if it is more accurate to think of the agent as the guiding component, which controls the tug master, since its function is limited to generating the control signals that steer the tug master's actuators. It is equipped with sensors for obstacle detection and trained and evaluated in a challenging environment making use of the Unity3D learning framework (ML-Agents). The next subsections first describe the algorithm used, then the entire system and the interactions between its components.

System Specifications
The structure of the system is depicted in Figure 1. It consists of a dynamic environment and two agents. The environment is the space encompassing RORO, terminal, lanes, etc. It is considered unknown and dynamic. The structure of the environment may change over time; new rolling cargo arrives at the terminal, and other objects, including staff, move around the scene. The agents are autonomous tug masters that should learn how to load cargo inside the vessel, under the following constraints: • Rolling cargo is distributed randomly over the terminal, and the arrival time is unknown to the agent. • The goal space in which the agent should place the cargo is defined; however, the agent should determine the accurate position according to other specific constraints such as size, weight, and customer priority. In this paper, we are limited to the size of the trailer. • Collision avoidance: The agents should avoid collision with static obstacles, such as walls and cargo, and avoid dynamic obstacles, like staff and other agents. • Lane-following: Inside the vessel, the agents must learn to follow the lanes drawn on the ground. Unity3D was used to create the environment and simulate the training process. It enables realistic visuals and accurate physics, as well as low task complexity. The visualization was used to help evaluate the credibility of the system and as an illustration when communicating the solution to employees at Stena Line. Task complexity could be reduced through the task handling mechanism inside Unity3D. We made use also of the Unity ML-Agents [35] to train the agents since it contains a low-level Python interface that allows interacting with and manipulating the learning environment. In fact, the ML-Agents Toolkit contains five high-level components Figure 2, and herein, we explain the four components that we used and show their role in our application: • Learning Environment : It contains the scene and all game characters. The Unity scene provides the environment in which agents observe, act, and learn. In our experiments, the same scene was used for both training and testing the trained agents. It represents the internal decks of the RORO where trailers are placed and the quay or wharf, which is a reinforced bank where trailers stand or line-up. It includes also linkspans, trailers, tug masters, and various dynamic and static objects. The internal decks can be of various sizes and may be on different floors. From a pure learning perspective, this environment contains two sub-components: Agent and Behavior. The Agentis attached to the Unity game object tug master and represents the learner. The Behavior can be thought of as a function that receives observations and rewards from the Agent and returns actions. There are three types of behaviors: learning, heuristic, or inference. A learning behavior is dynamic and not yet defined, but about to be trained. It is used to explore the environment and learn the optimal policy and presents the intermediate step before having the trained model and using it for inference. That is to say that the inference behavior is accomplished by the trained model, which is supposed to be optimal. This behavior is derived from the saved model file. A heuristic behavior is one that is explicitly coded and includes a set of rules; we used it to ensure that our code is correct and the agent is able to

Algorithm
The agent should perform a set of tasks O(T), where each is moving trucks from different initial positions to the target states under certain constraints such as static obstacles' distribution, dynamic obstacles, lane-following, and truck size. We consider the task as being a tuple: T = (L t , S t (s), S t (s (t+1) |s t , a t ), H). The goal of our policy based approach is to optimize a stochastic policy π θ : S × A → R + , which is parameterized by θ. In other words, the objective is to find a strategy that allows the agents to accomplish the tasks from various initial states with limited experiences. Given the state s t ∈ S T (s) of the environment at time t for task T, the learning model of our agent predicts a distribution of actions according to the policy, from which an action at a t is sampled. Then, the agent interacts with the environment performing the sampled action a t and receives an immediate reward R t according to the reward function. Afterwards, it perceives next state s t+1 ∼ S(s t+1 |s t , a t ). In an iterative way, the learning model needs to optimize the loss function L T that maps a trajectory τ = (s 0 , a 0 , R 0 , ..., s H , a H , R H ) followed by the policy from an initial state to a finite horizon H. The loss of a trajectory is nothing but negative cumulative reward L T (τ) = −Σ H t=0 R t . The policy learning is shown in Algorithm 1. In each episode, we collect N trajectories under the policy π θ . Afterwards, the gradient of the loss function is computed with regard to the parameter θ, and the latter is updated accordingly.

DRL Implementation Workflow
The first task is to develop a realistic and complex environment with goal-oriented scenarios. The environment includes ship, walls, walking humans, rolling cargo, lanes, and tug masters. The cargo is uniformly distributed in random positions inside an area representing the quay where cargo should be queued. The maximum linear velocity in the x and y axes is 5 m/s, and the maximum angular velocity in the z axis is 2 rad/s. The agents observe the changes in the environment and collect the data, then send them along with reward signals to the behavior component, which will decide the proper actions. Then, they execute those actions within their environment and receive rewards. This cycle is repeated continuously until the convergence of the solution. The behavior component controlling the agents is responsible for learning the optimal policy. The learning process was divided into episodes with a length of T = 7000 time steps where the agent must complete the task. A discrete time step, i.e., a state, is terminal if: • The agents collide with a dynamic obstacle. • The maximal number of steps in the episode is achieved. • The agents place all cargo in the destinations, which means that all tasks are completed.
Data collection: the set of observations that the agent perceives about the environment. Observations can be numeric or visual. Numeric observations measure the attributes of the environment using sensors, and the visual inputs are images generated from the cameras attached to the agent and represent what the agent is seeing at that point in time.
In our application, we use the ray perception sensors, which measure the distance to the surrounding objects. The inputs of the value/policy network are collected instantly by the agents and consist of their local state information. The agents navigate according to this local information, which represents the relative position of agents to surrounding obstacles using nine raycast sensors with 30 between them. No global state data are received from the agents, such as absolute position or orientation in a coordinate reference. In addition, we feed to the input of the neural network other data like the attributes of the sensed cargo.
The agents share the same behavior component. This means that they will learn the same policy and share experience data during the training. If one of the agents discovers an area or tries some actions, then the others will learn from that automatically. This helps to speed up the training.
Action space: the set of actions the agent can take. Actions can either be continuous or discrete depending on the complexity of the environment and agent. In the simulation, the agent can move, rotate, pull, and release the cargo and select the destination where to release the dragged cargo.
Reward shaping: Reward shaping is one of the most important and delicate operations in RL. Since the agent is reward motivated and is going to learn how to act by trial experiences in the environment, simple and goal-oriented functions of rewards need to be defined.
According to this multi-task setting, we implement reward schemes to fulfil each task and achieve the second sub-goal since the task functions are relatively decoupled from each other and, in particular, the relationship between collision avoidance and path planning maneuvers and placement constraints. We train the agent with deep RL to obtain the skill of moving from different positions with various orientations to a target pose while considering the time and collision avoidance, as well as the constraints from both the agent and placement space. Thus, each constraint stated in the specifications should have its appropriate reward and/or penalty function. The reward can be immediate after each action or transition from state-to-state and measured as the cumulative reward at the end of the episode. In addition, the sparse reward function is used instead of continuous rewards since the relationship between the current state and goal is nonlinear.

•
During the episode:

-
The task must be completed on time. Therefore, after each action, a negative reward is given to encourage the agent to finish the mission quickly. It equals 1/MaxStep such that MaxStep is the maximum number of steps allowed for an episode.

-
If an object is transported to the right destination, a positive reward (+1) is given. • After episode completion: If all tasks are finished, a positive reward is given (+1). • Static collision avoidance: Agents receive a negative reward (−1) if they collide with a static obstacle. Box colliders are used to detect the collisions with objects in Unity3D. • Dynamic collision avoidance: There are two reward schemes: the episode must be terminated if a dynamic obstacle is hit, which means that the tug master fails to accomplish the mission; the second option is to assign a negative reward to the agent and let it finish the episode. A successful avoidance is rewarded as well. To enhance the safety and since dynamic obstacles are usually humans, the second choice is adopted. • Lane-following: No map or explicit rules are provided to the agent. The agent relies solely on the reward system to navigate, which makes this task different from path following in which a series of waypoints to follow is given to the agent. The agent is rewarded for staying on the lane continuously, i.e., in each step, and since the agent often enters and leaves the lanes during the training, the rewards for lane-following have to be very small so that they do not influence the learning of the objective. If we suppose that the lanes are obstacle-free, the accumulated reward is the sum of the obtained rewards during the episode. Otherwise, obstacle avoidance and lane-following are considered as competing objectives. Then, a coefficient β ∈ [0, 1] is introduced to make a trade-off between them, as described by Equation (1), with r f (t) and r a (t) being the lane-following and obstacle avoidance rewards at time t, respectively.
Configuration parameters: The most important parameters are explained in the following lines, and the others are summarized in Table 1. Learning rate: controls the change amount made by adjusting the weights of the network with respect to the loss gradient, called also the step size. It can simply be defined as how much newly acquired information overrides the old. The learning rate is the most important hyperparameter to tune to achieve good performance in the problem, commonly denoted as beta. • Epsilon decay: controls the decrease in the learning rate so that the exploration and exploitation are balanced throughout the learning. Exploration means that the agent searches over the whole space in the hope to find new states that potentially yield higher rewards in the future. On the other hand, exploitation means the agent's tendency to exploit the promising states that return the highest reward based on existing knowledge. The value of epsilon decay is generally set around 0.2. The parameter values exposed herein are the default values recommended for general tasks. For our specific task, optimal hyperparameters are obtained using tuning techniques. In fact, one of the big challenges of DRL is the selection of hyperparameter values. DRL methods include parameters not only for the deep learning model, which learns the policy, but also in the environment and the exploration strategy. The parameters related to the model design make it possible to construct a deep learning model capable of effectively learning latent features of the sampled observations; whereas a proper choice of parameters related to the training process allows the built model to speed up the learning and converge towards the objective.
The hyperparameter optimization problem (HPO) is a challenging task since the hyperparameters interact with each other and do not act independently, which makes the search space very huge. Different types of methods have been used to solve the HPO problem. Basic search methods sample the search space according to a very simple rule and without a guiding strategy, such as the grid search method and the random search method [37]. More advanced methods are sample based and use a policy to guide the sampling process and update the policy based on evaluating the new sample. The wellknown algorithms of this category are Bayesian based optimizers [38] and population based techniques such as the genetic algorithm, which we opt to use in this paper [39]. The challenge here is to find a policy that finds the optimal configuration in a few iterations and avoids local optima. The last category includes HPO gradient based methods that perform the optimization by directly computing the partial derivative of the loss function on the validation set [40].
All the experiments were done using Unity3D Version 2018.4 and ML-Agents v.1.1.0 developed by Unity technologies, San Francisco, US. C# was used to create the environment and scripts controlling the agents, while Python was used to develop the learning techniques and train the agents. The external communicator of Unity allows the agent training using the Python code. The agents use the same brain, which means that their observations are sent to the same model to learn a shared policy. A machine with the microprocessor Intel i7 and a RAM size of 16G was used to run the implemented code.

Agent Learning Evaluation
The simulations were performed using a multiagent framework. Two agents, representing tug masters, collaborated to place the cargo in their appropriate locations. We analyzed the agent's performance based on quantitative metrics, and Table 1 summarizes the most important parameters used in the simulation.
Four scenarios were considered. In the first scenario, the surface (100 m × 100 m) where the agents learned to move the rolling cargo was obstacle-free. The agents were not constrained to follow lanes, and the only requirement was to complete the task as quickly as possible. About 2.7 M steps of training were enough to converge, and then, the agents could complete all tasks in an average of 300 steps with a mean cumulative reward of 16.4 for 7000 steps.
The second scenario consisted of lane-following and static obstacle avoidance (walls). About 3.6 M episodes were required to achieve the convergence where the success rate of avoiding static obstacles was 100% and the average time needed to complete the task was 410 steps. The mean accumulated reward was 15.7, which decreased compared to the first scenario due to the increase of the path length caused mainly by the existence of obstacles. The third scenario included dynamic obstacles, i.e., pedestrians. The success rate of the agent to avoid those dynamic obstacles was about 91%, and it learned to avoid static obstacles 100%. In the fourth scenario, the agent learned all tasks. Table 2 summarizes the overall training progress for different scenarios. Converge is the number of steps it took to converge. Reward is the mean cumulative reward, and reward std is the standard deviation. Length is the episode length after convergence. To compare the learning behavior, the mean cumulative reward of 7000 steps was computed for all agents. As expected, it consistently increased over time during the training phase ( Figure 3). The fluctuation of the curve reflects the successive unsuccessful episodes. After the convergence, the curve should be stationary and fluctuations reduced. The standard deviation of the cumulative episode reward lied in the interval [1.8, 3.2], which appeals to the implementation of the stabilization technique as mentioned in Section 5. Three-hundred to 450 steps were required for each agent to accomplish the mission collaboratively.
The lane-following was qualitatively evaluated. The agents followed the lanes almost perfectly, but for some sequences, they still tended to choose the shortest path, so that they saved time and got a greater reward.  Figure 4 shows also how much reward the agent predicted to receive in the short term through the value estimate metric. It is the mean value estimate for all states visited by the agent during the episode. It increased during the successful training sessions and increased throughout the learning. After some time, the mean value estimate became almost proportional with to the cumulative reward increased. This means that the agent was accurately learning to predict the rewards. The same figure depicts the linear evolution of the learning rate by a decay of 0.2.  To evaluate the learning progress of the agents, various metrics were used. In Figure 5, the entropy measures the randomness and uncertainty of the model decisions. It reflects the expected information content and quantifies the surprise in the action outcome. As shown in the figure, it steadily decreased during a successful training episode and consistently decreased during the entire training process. The increase in the entropy, for instance between the steps 1.1 Mand 2 M, shows that the agents were exploring new regions of the state space for which they did not have much information. To evaluate how the state space was learned by the agents, the value loss metric was used. It computes the mean loss of the value function update for each iteration, subsequently reflecting the ability of the model to predict the value of each state. As shown in Figure 6, it increased during the training and then decreased once the learning converged. This shows that the agents learned the state space accurately. On the other hand, the policy loss was the mean magnitude of the policy loss function. It reflects the way in which the policy, i.e., the strategy of action decision, evolved. The results show that, as expected, it oscillated during training, and its magnitude decreased for each successful session when the agents chose the best action sequences.

Incremental Learning
In the case of complex environments, the agent may take a long time to figure out the desired tasks. By simplifying the environment at the beginning of training, we allowed the agent to quickly update the random policy to a more meaningful one that was successively improved as the environment gradually increased in complexity. We proceeded in an incremental way as follows: we trained the agents in a simplified environment, then we used the trained agents in the next stage with more complex tasks, and so forth. There was no need to retrain them from scratch. Table 3 depicts the different stages of the incremental learning. Converge is the number of steps required for convergence; Reward is the mean cumulative reward; and Length is the episode length after convergence. After training the agent in the basic scenario without obstacles and lanes, it took only 1.5 M to teach the agent to follow lanes and avoid static obstacles instead of spending 3.6 M steps to learn the tasks from scratch. The same was for learning dynamic obstacle avoidance, which required only an extra 1.9 M steps. The same experience was performed for five iterations and showed that the agent was learning a sub-optimal policy with a reward average of 13 and never exceeded 13.2.

Hyperparameter Tuning
The final experiment aimed to improve the results obtained using default hyperparameters. The genetic algorithm (GA) was used as an optimizer over the hyperparameter search spaces to find the optimal structure of the model, as well as to tune the hyperparameters of the model training process. The optimizer was run for 100 generations with a population of size 20 in a parallel computation scheme such that the maximum number of iterations was one million steps. Each individual presented a single combination of the hyperparameters. The fitness function was evaluated once the stop condition was met for all individuals, and it equaled the mean of accumulated rewards for the last 100 episodes. Table 4 shows the search space considered for the model design hyperparameters and the obtained optimal values. The discrete sets are denoted by parentheses, and continuous intervals are denoted with brackets and sampled uniformly by the GA. The optimal architecture of the deep learning model that learned the policy and the advantage function consisted of three hidden layers having each 256 units. The optimal parameters for the training process are listed in Table 5. The use of this value combination sped up the learning process and increased the mean cumulative reward. A full simulation for the fourth scenario shows that the agents achieved a mean cumulative reward of 14.83 in 4.6 M steps. To show the importance of hyperparameter tuning, Figure 7 illustrates the learning progress of the agents for different hyperparameter values in the fourth scenario. It is clear that choosing the values of configuration C4 degraded the performance of the agents, and never learned the tasks. In fact, we noticed through the visualization that they could not explore the environment and were stuck in a sub-region of the space. Using the value combinations C2 and C3 allowed the agents to learn a suboptimal policy because they did not complete all tasks in an optimized time and their risk of hitting obstacles was high.  Table 6 shows the corresponding combination values.

Discussion
Looking back at the research questions formulated in the Introduction, we can affirm that the agents are able to learn good strategies and fulfil the expected tasks. Regarding the first research question, we can say that using appropriate model hyperparameters, we can converge to a solution where the agents learn how to accomplish their tasks. The convergence also ensures some kind of stability and allows the agents to exploit their knowledge instead of infinitely exploring the environment or choosing random actions. However, the reward standard deviation reflects the trajectories' variation and also the imperfection of learning stability. Looking at the second research question, we can see that the policy learned is sub-optimal in the incremental learning approach, but it is nearoptimal otherwise. Regarding the third research question, it is evident that the tuning of hyperparameters improved the quality of the achieved solution.
The obtained results are very promising and encourage resolving more advanced logistics constraints using reinforcement learning. Our study was limited to collision avoidance, lane-following, and cargo moving. In addition, only one cargo feature, which is the size, is used as the constraint to place the cargo in its appropriate destination. Thus, the formulation of the multi-objective stowage problem including all the logistics constraints is one of the first priorities. The aim will be mainly the optimization of the deck utilization ratio on board the ship with respect to specific constraints such as customer priorities, placing vehicles that belong to the same customer close to each other, ship weight balance, etc. Another perspective of this study that currently presents a challenge to the proposed autonomous system is the stability. In fact, the reward standard deviation lies between 1.83 in the first scenario without obstacles and 3.2 for the fourth scenario. Lyapunov function based approaches [41] can be used on the value function to ensure that the RL process is bounded by a maximum reward region. In addition, collaboration among the agents can be further enhanced. In this paper, a parameter-sharing strategy is used to ensure the collaboration and implicit knowledge sharing, which means that the policy parameters are shared among the agents. However, cooperation is still limited without explicitly sharing the information.

Concluding Remarks
In this paper, deep reinforcement learning is used to perform recurrent and important logistics operations in maritime transport, namely rolling cargo handling. The agents, representing the tug masters, successfully learn to move all cargo from the initial positions to the appropriate destinations. Avoiding static and dynamic obstacles is included in the objective function of the agents. The agents learn to avoid static obstacles with a 100% success rate and avoid dynamic obstacles with a 91% success rate. We use hyperparameter tuning to improve the results, and we show the great impact that deep learning model hyperparameters have on the solution convergence, policy learning, and advantage function. In the tuned model, the agents need 4.6 M steps to converge towards the optimal policy and achieve 88.3% of the maximum value of the mean cumulative reward. Finally, incremental learning has the advantage of saving training time and building a complex environment incrementally, but it risks falling into a local optimal trap.
The proposed approach shows a potential to be extended to real-world problems where full automation of the loading/unloading operations would be feasible. The impact would be for both environmental sustainability and potential increase in revenue. The work should also be of interest to manufacturers of systems and vehicles used in these settings. However, as pointed out, this is an initial study and as such the main focus has been on studying the feasibility to apply the algorithms presented in the work and to see if they can solve a simplified problem of unloading and loading of cargo.
We see several possible extensions of the proposed solution as directions for future work. As the end goal is to implement it in physical tug masters loading and unloading ROROs, incorporating more realistic constraints is one extension that we will pursue. This includes making the agents operate with realistic tug master constraints, like having to reverse in order to attach a trailer, having to drag the trailer, and being realistically affected by dragging a heavy trailer regarding, e.g., acceleration, breaking, turning, etc. This also includes making the interior of the ROROs more realistic, with multiple decks, narrow passages, etc. Another aspect that might need consideration is load balancing on deck, in order to optimize stability and minimize fuel consumption. Several of the mentioned improvements will affect the RL environment, but will also need reshaping of the reward function. Optimizing the reward function to ensure that the agents learn to avoid dynamic obstacles perfectly is another important thing to address in future work.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: