COLREGs-Compliant Multi-Ship Collision Avoidance Based on Multi-Agent Reinforcement Learning Technique

: The congestion of waterways can easily lead to trafﬁc hazards. Moreover, according to the data, the majority of sea collisions are caused by human error and the failure to comply with the Convention on the International Regulation for the preventing Collision at Sea (COLREGs). To avoid this situation, ship automatic collision avoidance has become one of the most important research issues in the ﬁeld of marine engineering. In this study, an efﬁcient method is proposed to solve multi-ship collision avoidance problems based on the multi-agent reinforcement learning (MARL) algorithm. Firstly, the COLREGs and ship maneuverability are considered for achieving multi-ship collision avoidance. Subsequently, the Optimal Reciprocal Collision Avoidance (ORCA) algorithm is utilized to detect and reduce the risk of collision. Ships can operate at the safe velocity computed by the ORCA algorithm to avoid collisions. Finally, the Nomoto three-degrees-of-freedom (3-DOF) model is used to simulate the maneuvers of ships. According to the above information and algorithms, this study designs and improves the state space, action space and reward function. For validating the effectiveness of the method, this study designs various simulation scenarios with thorough performance evaluations. The simulation results indicate that the proposed method is ﬂexible and scalable in solving multi-ship collision avoidance, complying with COLREGs in various scenarios.


Introduction
The continuous increase in the number of maritime transportation vessels results in the waterways becoming more congested. Obviously, this situation will cause serious traffic hazards. When there are many ships around, it is easy to make wrong decisions by only relying on people to control the ship. According to data, it is found that about 89-96% of sea collisions are caused by human error [1]. To avoid this situation, ship automatic avoidance has become one of the most important research issues in the field of marine engineering. However, due to the complex motion model of ships and the low control accuracy, most of the algorithms cannot meet the requirements. Artificial intelligence technology (AI) is currently the most applicable method to solve this problem [2]. Deep reinforcement learning (DRL) is a new research hotspot in the field of artificial intelligence, and has made great progress in both theory and applications, in particular, the I-go agent named 'Alpha-Go', which was created by the 'Google Deep Mind' team and which beat the top I-go players [3]. It has also made substantial breakthroughs in decision-making control [4]. Deep reinforcement learning (DRL) consists of two parts: deep learning and reinforcement learning. Among them, deep learning has a strong perceptual ability, and is widely used in image analysis [5], speech recognition [6] and other fields. Reinforcement learning is known for its decision-making ability, first proposed by Sutton in 1984 [7]. It uses a reward and punishment system, gains experience from the environment, then adjusts strategies through multiple training to adapt to the environment and ultimately achieves the desired results. The task of autonomous ship avoidance involves interactions with the environment

Literature Review and Motivation
The automatic collision avoidance technology of ships is key in guaranteeing the safety of navigation. Recently, the related theories and technologies have been gradually improved. Miele et al. [8] proposed a method based on the multi-subarc sequential gradientrestoration algorithm to solve two cases of the collision avoidance problem: two ships moving along the same rectilinear course and orthogonal courses. Phanthong et al. [9] described path replanning techniques and proposed an algorithm based on the A star algorithm to avoid against stationary and dynamic obstacles with the optimal trajectory. Cheng et al. [10] proposed an optimization method based on a genetic algorithm which was applied to avoid collision and to seek the trajectory. However, these methods do not comply with the COLREGs, which should not be ignored for ocean-going ships.
Methods complying with COLREGs collision avoidance have been proposed for multi-ships in the open sea. Wilson et al. [11] proposed a new navigation method called a line-of-sight counteraction navigation algorithm (LOSCAN). The algorithm aided maneuver decision-making for a two-ship collision avoidance complying with COLREGs. However, this method is not capable of dealing with multi-ship collision avoidance. Liang et al. [12] proposed the minimum course alteration algorithm (MCA) to avoid moving ships, or obstacles constrained by COLREGs. The results of simulation presented that the algorithm was credible in collision avoidance. Chen et al. [13] designed an intelligent collision avoidance control system, which integrated the collision avoidance navigation and the nonlinear optimal control methods. To avoid collision, two fuzzy indicators including collision risk and collision avoidance acting timing were developed. Johansen et al. [14] describes a concept for a ship collision avoidance system, which is based on model predictive control. COLREGs and collision hazards associated with each of the alternative control behaviors are evaluated on a finite prediction horizon. Hu L et al. [15] designed a multi-objective optimization algorithm which cooperated a hierarchical sorting rule to prioritize the objective of course or speed change preference over other objectives such as path length and path smoothness. All of these methods can complete the two-ship collision avoidance task, and comply with COLREGs. However, when the ship encounters more complex scenarios, such as four ships colliding at the same time, these methods cannot achieve collision avoidance navigation.
With the development of artificial intelligence, a number of collision avoidance methods based on deep reinforcement learning (DRL) have been developed. Shen et al. [16] presented a training method based on DRL for ship collision avoidance, which incorporated the ship maneuverability, human experience and COLREGs. Through the experimental validation of three self-propelled ships, it demonstrated that the method based on DRL had great potential to realize automatic collision avoidance. Sawada et al. [17] proposed a multi-ship automatic collision avoidance method based on DRL in continuous action space, and the obstacle zone by target was used to compute risk of collision. The trained agent has passed a large number of simulation scenarios. Li et al. [18] utilized the artificial potential field (APF) algorithm to improve the action space and reward function of DRL. To avoid collision, the method trained agents complying with COLREGs. The results of simulation showed that the improved DRL could realize automatic collision avoidance. Zhao et al. [19] proposed a method which used the Deep Neural Network (DNN) to map the states of encountered ships to a ship's own steering commands in terms of rudder angle. The policy gradient-based DRL algorithm was used to train the DNN for collision avoidance complying with COLREGs. The simulation results indicated that the multi-ship model was able to avoid collision. Xu et al. [20] formulated the collision avoidance strategy and designed the state, action, reward function and network structure to improve the DDPG algorithm. The results showed that the method can give reasonable collision avoidance actions and realize effective collision avoidance. The advantages and disadvantages of the literature is shown in Table 1. When the ship encounters more complex scenarios, such as four ships will collide at the same time, these methods cannot achieve collision avoidance navigation [12] Minimum course alteration algorithm In this study, a novel intelligent method based on multiple agent reinforcement learning, named the CA-QMIX algorithm is proposed. The COLREGs and ship maneuverability are considered for achieving multi-ship automatic collision avoidance. The Optimal Reciprocal Collision Avoidance (ORCA) algorithm is used to detect and reduce the risk of collision. The safe velocity computed by the ORCA is adopted to avoid collision. This study also utilizes the three-degrees-of-freedom (3-DOF) Nomoto ship motion mathematical model to simulate the maneuvers of a ship. Finally, the state space, action space and reward functions are designed for improving the convergence rate of training. The simulation results indicated that the proposed method has excellent flexibility and scalability for solving multi-ship collision avoidance complying with COLREGs in various scenarios.
The organization of this paper is stated as follows: In Section 3, the method for detecting the risk of collision is given first. Then, the ship motion model and COLREGs, which are considered as the basis of ship collision avoidance, are illustrated. Section 4 describes the principles and applications of the multi-agent reinforcement algorithm. Section 5 is the simulation result of the multi-ship collision avoidance. Section 6 is a summary of this paper.

Problem Definition
The solution to the multi-ship collision avoidance problem can be roughly divided into the following two categories:

1.
Single-agent collision avoidance: The own-ship (OS) is considered as an agent, the target-ship (TS) is seen as a dynamic obstacle; 2.
Multi-agent collision avoidance: Each ship is an agent, and there are partnerships between them.
In this study, we aim at successfully completing multi-ship collision avoidance and reach the target point. The good or bad behavior of collision avoidance is not only related to own-ship (OS) actions, but also related to the target-ship (TS) actions; under the trend that the communication network (5G or 6G network) will gradually cover the world in the future, the interaction between ships will be more convenient, and the advantages of the multi-agent method will become more obvious. Thence this paper defines multi-ship collision avoidance as a multi-agent problem, and uses the multi-agent reinforcement learning algorithm to solve the problem.

The Ship Motion Model and Collision Detection
Establishing a suitable ship motion mathematical model is necessary before using the algorithm. This paper uses the Nomoto three-degrees-of-freedom (3-DOF) model [21] and the principal dimensions of the ship form [22]. The coordinate systems are shown in Figure 1, and the principal dimensions of the ship are given in Table 2. study also utilizes the three-degrees-of-freedom (3-DOF) Nomoto ship motion mathematical model to simulate the maneuvers of a ship. Finally, the state space, action space and reward functions are designed for improving the convergence rate of training. The simulation results indicated that the proposed method has excellent flexibility and scalability for solving multi-ship collision avoidance complying with COLREGs in various scenarios. The organization of this paper is stated as follows: In Section 3, the method for detecting the risk of collision is given first. Then, the ship motion model and COLREGs, which are considered as the basis of ship collision avoidance, are illustrated. Section 4 describes the principles and applications of the multi-agent reinforcement algorithm. Section 5 is the simulation result of the multi-ship collision avoidance. Section 6 is a summary of this paper.

Problem Definition
The solution to the multi-ship collision avoidance problem can be roughly divided into the following two categories: 1. Single-agent collision avoidance: The own-ship (OS) is considered as an agent, the target-ship (TS) is seen as a dynamic obstacle; 2. Multi-agent collision avoidance: Each ship is an agent, and there are partnerships between them.
In this study, we aim at successfully completing multi-ship collision avoidance and reach the target point. The good or bad behavior of collision avoidance is not only related to own-ship (OS) actions, but also related to the target-ship (TS) actions; under the trend that the communication network (5G or 6G network) will gradually cover the world in the future, the interaction between ships will be more convenient, and the advantages of the multi-agent method will become more obvious. Thence this paper defines multi-ship collision avoidance as a multi-agent problem, and uses the multi-agent reinforcement learning algorithm to solve the problem.

The Ship Motion Model and Collision Detection
Establishing a suitable ship motion mathematical model is necessary before using the algorithm. This paper uses the Nomoto three-degrees-of-freedom (3-DOF) model [21] and the principal dimensions of the ship form [22]. The coordinate systems are shown in Figure 1, and the principal dimensions of the ship are given in Table 2.   Where the ship's velocity set includes surge velocities u v , sway velocities v v and yaw rate r v . The ψ denotes heading angle, and ψ d stands for desired heading angle. Hence the error of heading angle is ψ e = ψ − ψ d . The rudder characteristics are expressed as.
where δ and δ E are the real rudder angle and the command rudder angle. T E is the time constant of the steering gear.
To reduce collision risk, this paper performs collision detection using the ORCA (Optimal Reciprocal Collision Avoidance) [23] method. The schematic diagram of ORCA is illustrated in Figure 2. Similarly, the ship domain concept is suggested to calculate the collision risk area and is used to define a safe area. To increase security, the area of the ship domain is further expanded into a circle (taking d 5 as diameter [24]).   As is shown in Figure 2a, a hexagon-shaped collision danger risk area is created, then taking the smallest circumscribed circle of the hexagon, the final collision danger risk area is formed. The multi-ship collision avoidance problem can be simplified into the collision avoidance behavior of circular areas with a radius R OS and R TS . As Figure 2 shows, P OS and P TS stand for the location of OS and TS, respectively. However, the movement of the areas are still constrained by the ship model.
In the velocity coordinate system of Figure 3a, assuming that TS is stationary, if OS does not collide with TS during movement, then the velocity of OS cannot be selected from the velocity obstacle V t s OS|TS (Figure 3a gray part). The definition of the velocity obstacle implies that if V OS − V TS ∈ V t s OS|TS , or equivalently if V TS − V OS ∈ V t s TS|OS , then OS and TS will collide at some moment before time t s (one time step).
where D is a circle with center P TS − P OS and radius R TS + R OS . The V t s OS|TS is geometrically a truncated cone with its apex at the origin and its two sides tangent to the circle (center P TS − P OS and radius R TS + R OS ). The cone is truncated by arc (center (P TS − P OS )/t s and radius (R TS + R OS )/t s ). Generally speaking, OS cannot sail for t s time with current velocity, otherwise OS will collide with the TS.
In the velocity coordinate system of Figure 3a, assuming that TS is stationary, if OS does not collide with TS during movement, then the velocity of OS cannot be selected from the velocity obstacle [ ] ( ) , the collision will be avoided; but it is a difficult problem to select the optimal velocity from the set safe V .  When V TS is considered, the set of velocities which make OS collide with TS are V t s OS|TS ⊕ V TS (Figure 3b gray part). Finally, the complementary set of V t s OS|TS ⊕ V TS is the safe velocity recorded as V sa f e . If the ship trails with V ∈ V sa f e , the collision will be avoided; but it is a difficult problem to select the optimal velocity from the set V sa f e .
For solving the above problem, a method named Optimal Reciprocal Collision Avoidance (ORCA) is presented. Firstly, a vector u illustrates minimal change to make OS's velocity where ∂V t s OS|TS is the limit point of the V t s OS|TS . The vector n is a normal vector of the V t s OS|TS boundary. n points to the inside of the collision area and the starting point is at the intersection of the vector u and the boundary. Combining these variables, ORCV t s OS|TS is defined as the optimal reciprocal collision avoidance velocity of OS.
where ORCV t s OS|TS is the optimal reciprocal collision avoidance velocity set for OS avoiding collision with TS in time t s . When multiple ships avoid collision, each ship computes the optimal velocity set by ORCA, and the intersection of these velocity set form a polyhedron.
For achieving multi-ship collision avoidance, we defined a velocity set ORCV t s OS , in which OS adopting V ∈ ORCV t s OS can avoid colliding with all TSs.

COLREGs
Before using the method to solve the multi-ship collision avoidance problem, COL-REGs needs to be considered. The OS must react to avoid the TSs while complying with COLREGs, and subsequently, return to its predefined path once safety is confirmed. As illustrated in Figure 4, a diagram centered on the OS is divided into four parts:  The collision avoidance behaviors conforming to COLREGs are shown The collision avoidance behaviors conforming to COLREGs are shown in Figure 5. The collision avoidance behaviors conforming to COLREGs are shown in Fig   Figure 5. Encounter situations defined by COLREGs.

COLREGs-Based Multi-Ship Collision Avoidance
The COLREGs can be extended to scenarios where OS encounters multiple T multi-ship collision avoidance under the COLREGs can be summarized as Figure In Figure 6, the OS encounters two TSs in different directions, they should all with COLREGs and alter course to starboard to avoid collision. In the same wa three ships encounter a similar situation, they should all alter course to starboard. mary, when a multi-ship ( ) 3 ≥ encounter is occurring, each ship should fol COLREGs and alter course to starboard for avoiding collision.

COLREGs-Based Multi-Ship Collision Avoidance
The COLREGs can be extended to scenarios where OS encounters multiple TSs. The multi-ship collision avoidance under the COLREGs can be summarized as Figure 6.

COLREGs-Based Multi-Ship Collision Avoidance
The COLREGs can be extended to scenarios where OS encounters multiple TSs. T multi-ship collision avoidance under the COLREGs can be summarized as Figure 6.
In Figure 6, the OS encounters two TSs in different directions, they should all comp with COLREGs and alter course to starboard to avoid collision. In the same way, wh three ships encounter a similar situation, they should all alter course to starboard. In sum mary, when a multi-ship ( ) 3 ≥ encounter is occurring, each ship should follow t COLREGs and alter course to starboard for avoiding collision. In Figure 6, the OS encounters two TSs in different directions, they should all comply with COLREGs and alter course to starboard to avoid collision. In the same way, when three ships encounter a similar situation, they should all alter course to starboard. In summary, when a multi-ship (≥3) encounter is occurring, each ship should follow the COLREGs and alter course to starboard for avoiding collision.
The process of collision avoidance combined with COLREGs is shown in Figure 7. Firstly, the instantaneous ship domain should be calculated and expanded to a safe round area. Then, OS detects whether TSs enter into the OS's safe area, and judges the encounter scenario by TSs' instantaneous positions. To avoid collision, OS must select V ∈ ORCV t s OS which is calculated by ORCA. After one time step, OS checks whether TSs leave the OS's safe area. If there are TSs still in the safe area, the above steps are repeated.
Firstly, the instantaneous ship domain should be calculated and expanded to a sa area. Then, OS detects whether TSs enter into the OS's safe area, and judges the en scenario by TSs' instantaneous positions. To avoid collision, OS mus

CTDE
Multi-agent reinforcement learning is a recent research topic. Centralized Decentralized Execution (CTDE) [25] is arguably the simplest method to train and However, multi-agent reinforcement learning has two difficulties:

•
Observational limitations: When the agent interacts with the environment, t cannot obtain the global state s of the environment, and can only see the servation information within its own observation range o ; • Instability: When multiple agents learn together, the changing strategies and tative actions caused mean that the value function of agent i cannot stably Therefore, to solve the above problems, this study proposes using the Cen Training Decentralized Execution (CTDE) framework to relax the limitation co and allow agents to access the global information during training. Multi-agent reinforcement learning is a recent research topic. Centralized Training Decentralized Execution (CTDE) [25] is arguably the simplest method to train and execute. However, multi-agent reinforcement learning has two difficulties:

•
Observational limitations: When the agent interacts with the environment, the agent cannot obtain the global state s of the environment, and can only see the local observation information within its own observation range o; • Instability: When multiple agents learn together, the changing strategies and the mutative actions caused mean that the value function of agent i cannot stably update.
Therefore, to solve the above problems, this study proposes using the Centralized Training Decentralized Execution (CTDE) framework to relax the limitation conditions and allow agents to access the global information during training.

IQL and VDN
The IQL (Independent Q-Learning) [27] algorithm treats the rest of the agents directly as part of the environment. That is, each agent is solving a single-agent task. The value function of the agent i is Q i (τ i , u i ). Only relying on Q i (τ i , u i ) for decision-making is unstable. Obviously, due to the existence of agents in the environment, the environment is a non-stationary state, so convergence cannot be guaranteed, and the agent can easily get caught up in endless exploration.
It is necessary to use Q total (τ, u) to learn in global sight. Sunehag [28] proposed the VDN algorithm which used Q i (τ i , u i ) to finish the factorization of Q total (τ, u), the formula is as follows: The VDN just accumulates the local action value functions of each agent to get the joint action value function, so that it satisfies the conditions as the same additivity of Q total (τ, u) and Q i (τ i , u i ). But it does not integrate the single agent local value function during learning.

IGM Condition and Constraint
In order to follow the advantages of VDN, centralized learning is used to obtain distributed strategies; QMIX [29] first defines the conditions called IGM (Individual-Global-Max): Generally speaking, if Equation (8) is satisfied (doing the argmax of Q total (τ, u) and Q i (τ i , u i ) are equivalent), then gaining optimal actions by local Q i (τ i , u i ) is trivially tractable.
For achieving this effect, the QMIX algorithm sets a sufficient condition: If Q total (τ, u) and Q i (τ i , u i ) satisfy monotonicity, then equation 4 holds. For the purpose of achieving the constraints, QMIX uses an architecture to implement.

Overall Framework
The overall framework of the QMIX algorithm is shown in Figure 8. The network structure mainly consists of three parts: • Agent network (Figure 8a): It is represented by the DRQN network. In the partially observable setting, agents using RNN can use all their action-observation history information to get the current state. Its input at each step is the current individual observation o i t of the agent and the action u i t−1 at each time step; • Hypernetwork: Hypernetwork [30] is used to calculate network weights and biases in the mixing network. Its inputs are global state inputs. The outputs are the weights and the bias, where the weights need to be greater than 0 (W ≥ 0), so the activation function is the absolute activation function. Biases use the common Relu activation function, because it does not have a requirement for the value range; • Mixing network (Figure 8c): Its weights and biases are generated by the Hypernetwork, and its role is to mix the Q i (τ i , u i ) of each agent into a monotonic Q total (τ, u) of the whole system, and to also make the training more stable by increasing the system information.

Algorithm Implementation
During the multi-ship collision avoidance, each ship is an agent to participate in the training; the multi-agent reinforcement learning algorithm named QMIX is employed to solve the problem. The iterative updating process of the algorithm is shown in Figure 9. The parameters of the training are shown in Table 3.  QMIX is trained to minimize the following loss: where b is the batch size of transitions sampled from the replay buffer, and the role of y total [31] is to update networks. θ − are the parameters of a target network as in DQN.

Algorithm Implementation
During the multi-ship collision avoidance, each ship is an agent to participate in the training; the multi-agent reinforcement learning algorithm named QMIX is employed to solve the problem. The iterative updating process of the algorithm is shown in Figure 9. The parameters of the training are shown in Table 3.

Action Space
In the process of ship collision avoidance, the crew changes heading and speed to ensure navigational safety. Likewise, during automatic collision avoidance, the perfor-

State Space
The state space is defined as the set of information about the environment that the ship receives at a given time step s t . The observed state includes each ship location location P , the location of goal goal P , heading angle ψ , desired heading d ψ , velocity V and the ship length L .

Reward Function
Reward function is an evaluation of ship movements, and calculated as the sum of the remuneration accumulated from each ship. This process is expressed in: The reward of each ship i R is the sum of the rewards accumulated in each episode.
The objective of the study is to avoid OS and TSs collision, and to maneuver OS complying with COLREGs. Consequently, the reward function can be defined to reward the agent for reaching the destination and for avoiding the collision by complying with COLREGs.   In the process of ship collision avoidance, the crew changes heading and speed to ensure navigational safety. Likewise, during automatic collision avoidance, the performance of turning is considered for designing the action u, where u ∈ [−ψ, ψ] and ψ is the change in course angle. The command of the rudder angle is obtained by the ship motion mathematical model. At each time step, each ship chooses an action u i , giving rise to a joint action vector [u where N is the number of ships.

State Space
The state space is defined as the set of information about the environment that the ship receives at a given time step t s . The observed state includes each ship location P location , the location of goal P goal , heading angle ψ, desired heading ψ d , velocity V and the ship length L.

Reward Function
Reward function is an evaluation of ship movements, and calculated as the sum of the remuneration accumulated from each ship. This process is expressed in: 13 of 21 The reward of each ship R i is the sum of the rewards accumulated in each episode. The objective of the study is to avoid OS and TSs collision, and to maneuver OS complying with COLREGs. Consequently, the reward function can be defined to reward the agent for reaching the destination and for avoiding the collision by complying with COLREGs.
The goal reward function R goal is to guide the ship to reach the destination. It is expressed as a formula by: where P t is the ship current location at t episode, and λ goal is a hyperparameter. As the distance between the ship and the destination gets shorter, the agent obtains more substantial reward value. The reward value reaches maximum, when the distance becomes less than d 5 4 . For collision avoidance and fulfilling COLREGs, this paper designs the reward functions R collision and R COLREGs .
However, sometimes the goal reward function is contradictory to the collision avoidance reward. Thereby, the whole process is divided into two stages including normal sailing and collision avoidance, as shown in Figure 10. The goal reward function goal R is to guide the ship to reach the destination. It is expressed as a formula by: where t P is the ship current location at t episode, and goal λ is a hyperparameter. As the distance between the ship and the destination gets shorter, the agent obtains more substantial reward value. The reward value reaches maximum, when the distance be- However, sometimes the goal reward function is contradictory to the collision avoidance reward. Thereby, the whole process is divided into two stages including normal sailing and collision avoidance, as shown in Figure 10.

Method for Path Planning and Collision Avoidance Based on CA-QMIX
According to the previous section, the design of the ship collision avoidance system has been presented with an explanation of important parts in detail. In this section, we trained the agent to avoid collision using the CA-QMIX algorithm. The proposed CA-QMIX algorithm has been evaluated with simulation tests for diverse environments. The setting of the ship collision avoidance simulation scenarios included two-ship encounter situations and a multi-ship encounter situation. The collision avoidance scenarios of two ships were used to evaluate whether the algorithm conforms to COLREGs. The scalability of the algorithm was then verified by multi-ship encountering scenarios.

Two Ships Collision Avoidance in Four Scenarios
To guarantee the performance of the algorithm, each ship making the decision must comply with COLREGs, and reach a destination after successfully avoiding collision. In the training phase, the state input of each ship consists of its state, as observed by itself. The output of the algorithm is the rudder angle. At each training iteration, each ship selects an optimal action based on state and observation to generate trajectories. But an episode will end if ships collide with others or ships all reach their destinations.
The average reward is computed as the sum of rewards accumulated from each ship, and the reward functions follow the rules designed in Section 3. When the average reward value tends to be stable as shown in Figure 11, the training process is completed, and the optimal agent can be obtained. All ships can automatically avoid collision, and strictly follow the COLREGs.

Method for Path Planning and Collision Avoidance Based on CA-QMIX
According to the previous section, the design of the ship collision avoidance system has been presented with an explanation of important parts in detail. In this section, we trained the agent to avoid collision using the CA-QMIX algorithm. The proposed CA-QMIX algorithm has been evaluated with simulation tests for diverse environments. The setting of the ship collision avoidance simulation scenarios included two-ship encounter situations and a multi-ship encounter situation. The collision avoidance scenarios of two ships were used to evaluate whether the algorithm conforms to COLREGs. The scalability of the algorithm was then verified by multi-ship encountering scenarios.

Two Ships Collision Avoidance in Four Scenarios
To guarantee the performance of the algorithm, each ship making the decision must comply with COLREGs, and reach a destination after successfully avoiding collision. In the training phase, the state input of each ship consists of its state, as observed by itself. The output of the algorithm is the rudder angle. At each training iteration, each ship selects an optimal action based on state and observation to generate trajectories. But an episode will end if ships collide with others or ships all reach their destinations.
The average reward is computed as the sum of rewards accumulated from each ship, and the reward functions follow the rules designed in Section 3. When the average reward value tends to be stable as shown in Figure 11, the training process is completed, and the optimal agent can be obtained. All ships can automatically avoid collision, and strictly follow the COLREGs. For inspecting the trained agent, the case (a), case (b), case (c) and case (d) were set to conduct the simulation. The origin and destination parameters of the simulation were shown in Table 4, and the simulation scenarios were setup as shown in Figure 12. For inspecting the trained agent, the case (a), case (b), case (c) and case (d) were set to conduct the simulation. The origin and destination parameters of the simulation were shown in Table 4, and the simulation scenarios were setup as shown in Figure 12. Overtaking Ⅷ (0, 100) (0, 300) In case (a), ship Ⅰ and ship Ⅱ encountered head-on situations (reciprocal courses in the range of 10°). In detecting collision risk, the ships quickly altered course to starboard. After finishing collision avoidance, they continued to their destinations. In this process, the ship movement not only complied with the COLREGs, but also completed collisionfree navigation.
In case (b), a port crossing situation occurred if the ship Ⅳ was appearing within an azimuth angle (247.5°, 355°) of ship Ⅲ. When two ships were in a dangerous ship domain, they altered course to starboard and finished the collision avoidance task by selecting a safe speed.
In case (c), ship Ⅴ detected that ship Ⅵ was coming from the starboard side of ship Ⅴ. Two ships were encountering a starboard crossing situation. Ship Ⅴ altered course to starboard to avoid collision.
In case (d), ship Ⅶ was overtaking ship Ⅷ in the range 135°. For avoiding collision, ship Ⅶ altered course to starboard and completed overtaking.
From Figure 12, the ships motion trajectories and collision avoidance behaviors can be observed. The results show that each ship followed the COLREGs and reached its destination, indicating that the trained agent can complete the collision avoid task. According to Figure 13, the detailed information of the ship's navigation can be obtained. Since the rudder angle and the speed of ship Ⅰ and ship Ⅱ are the same, only the information of ship Ⅰ is expressed. The rudder angle of each ship is limited to [−30°, 30°]. And the velocity is limited to [0, 7.5 (m/s)]. After detecting the collision risk, each ship adopts a different rudder angle and velocity to solve the danger. In case (a), ship I and ship II encountered head-on situations (reciprocal courses in the range of 10 • ). In detecting collision risk, the ships quickly altered course to starboard. After finishing collision avoidance, they continued to their destinations. In this process, the ship movement not only complied with the COLREGs, but also completed collisionfree navigation.
In case (b), a port crossing situation occurred if the ship IV was appearing within an azimuth angle (247.5 • , 355 • ) of ship III. When two ships were in a dangerous ship domain, they altered course to starboard and finished the collision avoidance task by selecting a safe speed.
In case (c), ship V detected that ship VI was coming from the starboard side of ship V. Two ships were encountering a starboard crossing situation. Ship V altered course to starboard to avoid collision.
In case (d), ship VII was overtaking ship VIII in the range 135 • . For avoiding collision, ship VII altered course to starboard and completed overtaking.
From Figure 12, the ships motion trajectories and collision avoidance behaviors can be observed. The results show that each ship followed the COLREGs and reached its destination, indicating that the trained agent can complete the collision avoid task.
According to Figure   In conclusion, the CA-QMIX algorithm can ensure compliance with the COLREGs under the premise of successful collision avoidance. In addition, the proposed method demonstrated its excellent collision avoidance performance, flexibility of application scenarios and scalability potential.

Simulation for Multi-Ship Collision Avoidance
In order to verify the scalability of the algorithm, scenarios of three and four ships encountering are set. The origin and destination of each ship are shown in Table 5. According to Figure 6, the three ships encountering scenario was created. We need to evaluate the performance of the algorithm from two aspects. One is the assessment of collision avoidance and whether the ship can avoid collision and comply with COLREGs. The other is whether the ship can reach the destination. In Figure 14, three ships were at risk of collision in the central area. To avoid collision, three ships altered course to starboard while complying with COLREGs. The trajectories of the ships were safe and smooth. Figure 15 illustrates the change in their speed and rudder angle.  In conclusion, the CA-QMIX algorithm can ensure compliance with the COLREGs under the premise of successful collision avoidance. In addition, the proposed method demonstrated its excellent collision avoidance performance, flexibility of application scenarios and scalability potential.

Simulation for Multi-Ship Collision Avoidance
In order to verify the scalability of the algorithm, scenarios of three and four ships encountering are set. The origin and destination of each ship are shown in Table 5.

Three Ships Collision Avoidance Scenarios
According to Figure 6, the three ships encountering scenario was created. We need to evaluate the performance of the algorithm from two aspects. One is the assessment of collision avoidance and whether the ship can avoid collision and comply with COLREGs. The other is whether the ship can reach the destination. In Figure 14, three ships were at risk of collision in the central area. To avoid collision, three ships altered course to starboard while complying with COLREGs. The trajectories of the ships were safe and smooth. Figure 15 illustrates the change in their speed and rudder angle. In conclusion, the CA-QMIX algorithm can ensure compliance with the COLREGs under the premise of successful collision avoidance. In addition, the proposed method demonstrated its excellent collision avoidance performance, flexibility of application scenarios and scalability potential.

Simulation for Multi-Ship Collision Avoidance
In order to verify the scalability of the algorithm, scenarios of three and four ships encountering are set. The origin and destination of each ship are shown in Table 5. According to Figure 6, the three ships encountering scenario was created. We need to evaluate the performance of the algorithm from two aspects. One is the assessment of collision avoidance and whether the ship can avoid collision and comply with COLREGs. The other is whether the ship can reach the destination. In Figure 14, three ships were at risk of collision in the central area. To avoid collision, three ships altered course to starboard while complying with COLREGs. The trajectories of the ships were safe and smooth. Figure 15 illustrates the change in their speed and rudder angle.

Four Ships Collision Avoidance Scenarios
To prove the scalability of the algorithm, we set up this complex simulation scenario, as shown in Figure 16, and the origin and destination coordinates of each ship are shown in Table 5. Where four ships navigate to a center point, if they do not adopt the appropriate collision avoidance behaviors, they will collide each other in central area.  Figure 17 illustrates the simulation process of collision avoidance with a four-ship encounter situation. Initially, four ships sailed to their destinations along straight lines. When they sailed to the positions of the first figure, they detected the collision risk. For avoiding collision, the command that changed their course to turn right was issued. After arriving in the positions of the second figure, the four ships move circularly to pass the hazardous area. Next, the four ships arrived at their positions in the third figure-the risk of collision had been solved. They changed course to more quickly arrive at the destination course. The rudder angle and velocity of the four ships in this process are shown in Figure 17.

Four Ships Collision Avoidance Scenarios
To prove the scalability of the algorithm, we set up this complex simulation scenario, as shown in Figure 16, and the origin and destination coordinates of each ship are shown in Table 5. Where four ships navigate to a center point, if they do not adopt the appropriate collision avoidance behaviors, they will collide each other in central area.

Four Ships Collision Avoidance Scenarios
To prove the scalability of the algorithm, we set up this complex simulation scenario, as shown in Figure 16, and the origin and destination coordinates of each ship are shown in Table 5. Where four ships navigate to a center point, if they do not adopt the appropriate collision avoidance behaviors, they will collide each other in central area.  Figure 17 illustrates the simulation process of collision avoidance with a four-ship encounter situation. Initially, four ships sailed to their destinations along straight lines. When they sailed to the positions of the first figure, they detected the collision risk. For avoiding collision, the command that changed their course to turn right was issued. After arriving in the positions of the second figure, the four ships move circularly to pass the hazardous area. Next, the four ships arrived at their positions in the third figure-the risk of collision had been solved. They changed course to more quickly arrive at the destination course. The rudder angle and velocity of the four ships in this process are shown in Figure 17.  Figure 17 illustrates the simulation process of collision avoidance with a four-ship encounter situation. Initially, four ships sailed to their destinations along straight lines. When they sailed to the positions of the first figure, they detected the collision risk. For avoiding collision, the command that changed their course to turn right was issued. After arriving in the positions of the second figure, the four ships move circularly to pass the hazardous area. Next, the four ships arrived at their positions in the third figure-the risk of collision had been solved. They changed course to more quickly arrive at the destination course. The rudder angle and velocity of the four ships in this process are shown in Figure 17.

Four Ships Collision Avoidance Scenarios
To prove the scalability of the algorithm, we set up this complex simulation scenario, as shown in Figure 16, and the origin and destination coordinates of each ship are shown in Table 5. Where four ships navigate to a center point, if they do not adopt the appropriate collision avoidance behaviors, they will collide each other in central area.  Figure 17 illustrates the simulation process of collision avoidance with a four-ship encounter situation. Initially, four ships sailed to their destinations along straight lines. When they sailed to the positions of the first figure, they detected the collision risk. For avoiding collision, the command that changed their course to turn right was issued. After arriving in the positions of the second figure, the four ships move circularly to pass the hazardous area. Next, the four ships arrived at their positions in the third figure-the risk of collision had been solved. They changed course to more quickly arrive at the destination course. The rudder angle and velocity of the four ships in this process are shown in Figure 17.

Complementary Simulation for Multi-Ship Collision Avoidance
In this section, further multi-ship encounter scenarios were set to increase the validity of the proposed CA-QMIX method. For confirming that the collision avoidance of two ships complies with COLREGs in all scenarios, the further simulations are shown in Figure 18. Based on the previous simulations, Figure 18 illustrates the collision avoidance process under the various conditions.
In this section, further multi-ship encounter scenarios were set to increase the of the proposed CA-QMIX method. For confirming that the collision avoidanc ships complies with COLREGs in all scenarios, the further simulations are show ure 18. Based on the previous simulations, Figure 18 illustrates the collision av process under the various conditions. Figure 19 illustrates the setup of the multi-ship simulation. According to this we confirmed the successful performance of the CA-QMIX algorithm in various tion scenarios and through thorough performance evaluations. Therefore, it can cluded that the proposed CA-QMIX algorithm could enable ships to avoid colli get to their target destinations in different scenarios. It also indicates that this st vides a more versatile decision-making model for intelligent ship behavior.   Figure 19 illustrates the setup of the multi-ship simulation. According to this section, we confirmed the successful performance of the CA-QMIX algorithm in various simulation scenarios and through thorough performance evaluations. Therefore, it can be concluded that the proposed CA-QMIX algorithm could enable ships to avoid collision and get to their target destinations in different scenarios. It also indicates that this study provides a more versatile decision-making model for intelligent ship behavior.

Conclusions
In this study, an intelligent ship behavior decision-making method is prop multi-ship collision avoidance based on the multi-agent reinforcement learnin rithm, which could ensure the safety of a ship's voyage in different multi-ship en scenarios.
To reduce collision risk, this paper performs collision detection using the OR timal Reciprocal Collision Avoidance) method. The proposed algorithm adopts th framework and DEC-POMDP model to train the agents to avoid collision. Then, th ship model was trained on the rich encountering situations based on the CA-QM rithm. Furthermore, multi-ship collision avoidance also needed to comp COLREGs. Hence, we designed a novel reward function for solving the collision p and as a result, changing course complied with COLREGs. To improve the effic training, the procedure of the reward functions was defined. Subsequently, multip were trained on different scenarios defined by COLREGs.
In the simulations, the proposed algorithm was validated in various simula narios, and its performance was evaluated by the navigational trajectories, the ru gle and speed. The results of simulation on various scenarios indicated that the p algorithm could implement multi-ship collision avoidance while complyin COLREGs and incorporating rudder characteristics. This algorithm demonstrated ibility and scalability. It was therefore able to be applied to a wide range of tasks.
For future work, we will focus on improving the ship motion mathematica The model-based multiple-agents reinforcement learning can achieve a good sam ciency and a stable performance. For use in the real world, a more accurate mode

Conclusions
In this study, an intelligent ship behavior decision-making method is proposed for multi-ship collision avoidance based on the multi-agent reinforcement learning algorithm, which could ensure the safety of a ship's voyage in different multi-ship encounter scenarios.
To reduce collision risk, this paper performs collision detection using the ORCA (Optimal Reciprocal Collision Avoidance) method. The proposed algorithm adopts the CTDE framework and DEC-POMDP model to train the agents to avoid collision. Then, the multi-ship model was trained on the rich encountering situations based on the CA-QMIX algorithm. Furthermore, multi-ship collision avoidance also needed to comply with COLREGs. Hence, we designed a novel reward function for solving the collision problem, and as a result, changing course complied with COLREGs. To improve the efficiency of training, the procedure of the reward functions was defined. Subsequently, multiple ships were trained on different scenarios defined by COLREGs.
In the simulations, the proposed algorithm was validated in various simulated scenarios, and its performance was evaluated by the navigational trajectories, the rudder angle and speed. The results of simulation on various scenarios indicated that the proposed algorithm could implement multi-ship collision avoidance while complying with COLREGs and incorporating rudder characteristics. This algorithm demonstrated its flexibility and scalability. It was therefore able to be applied to a wide range of tasks.
For future work, we will focus on improving the ship motion mathematical model. The model-based multiple-agents reinforcement learning can achieve a good sample efficiency and a stable performance. For use in the real world, a more accurate model will be integrated to enhance the maneuverability of a ship. In addition, the collision-free path should be optimized to improve navigational efficiency. Finally, a hardware simulation will be implemented for verifying the feasibility of multi-ship collision avoidance, and the simulation results will be compared with other relevant methods.
Funding: The paper is partially supported by National Natural Science Foundation of China (NO. 51409033, 52171342), and the Fundamental Research Funds for the Central Universities (NO. 3132019343). The authors would like to thank the anonymous reviews for their valuable comments.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.