Dynamic Navigation and Area Assignment of Multiple USVs Based on Multi-Agent Deep Reinforcement Learning

The unmanned surface vehicle (USV) has attracted more and more attention because of its basic ability to perform complex maritime tasks autonomously in constrained environments. However, the level of autonomy of one single USV is still limited, especially when deployed in a dynamic environment to perform multiple tasks simultaneously. Thus, a multi-USV cooperative approach can be adopted to obtain the desired success rate in the presence of multi-mission objectives. In this paper, we propose a cooperative navigating approach by enabling multiple USVs to automatically avoid dynamic obstacles and allocate target areas. To be specific, we propose a multi-agent deep reinforcement learning (MADRL) approach, i.e., a multi-agent deep deterministic policy gradient (MADDPG), to maximize the autonomy level by jointly optimizing the trajectory of USVs, as well as obstacle avoidance and coordination, which is a complex optimization problem usually solved separately. In contrast to other works, we combined dynamic navigation and area assignment to design a task management system based on the MADDPG learning framework. Finally, the experiments were carried out on the Gym platform to verify the effectiveness of the proposed method.


Introduction
Nowadays, various kinds of unmanned robots are developing rapidly with the arrival of the 5G era. Typical robots include the autonomous underwater vehicle (AUV), unmanned surface vessel (USV), unmanned ground vehicle (UGV), and unmanned aerial vehicle (UAV) [1][2][3][4], which play important roles in the artificial intelligence (AI)-enabled nextgeneration (6G) network. As an important unmanned marine tool, the USV is also a promising technique to provide wireless communication due to its low cost and high flexibility [5,6]. Especially when it comes to complex tasks, such as maritime joint search and rescue, maritime multi-target search, the construction of marine information networks, and other missions. However, it is difficult to rely on one single USV to complete the task, while multi-USV cooperation can be a good solution to the problem. This paper studies the multi-USV dynamic path planning and area assignment problem, which will effectively improve the autonomy and reliability of the USV in the increasing field.
As a generic technology among different robotics systems, the optimization objectives mainly include path length, time or energy consumption, risk measure, maneuverability, etc. They are classified into four kinds based on the model of consumption space, i.e., the grid-based methods [7][8][9][10] solved by the heuristic strategies, such as the A* algorithm, D* algorithm, Dijkstra algorithm, and Q-learning algorithm in reinforcement learning theory, or intelligent strategies represented by particle swarm optimization and evolutionary algorithms; the sampling-based methods such as probabilistic roadmap, rapidly-exploring random tree (RRT) [11][12][13], or deep reinforcement learning method represented by experience replay based DQN algorithm; the mathematical optimization methods such as mixed integer linear programming (MILP) and model predictive control (MPC) [14][15][16]; and the potential field methods such as the artificial potential field (APF) and the interfered fluid dynamical system (IFDS) [17][18][19].
In addition to the classification mentioned above, when the purpose of agents is taken as the dividing boundaries, the path planning problem is divided into the single agent path planning problem (SAPP) and the multi-agent path planning problem (MAPP). For the requirement of accessing a series of intermediate points during the monitoring mission in SAPP, Ning Wang et al. [20] proposed a successive waypoints tracking method using a BT-guided model-free solution to solve the SWT problem. Yong et al. [21] proposed the dynamic augmented multi-objective particle swarm optimization algorithm to solve the problem that obstacles and water flows exist at the same time to find the shortest and safest route to the target, which is subject to collision avoidance. The rapidly random-exploring tree (RRT) algorithm and its variants are some of the more popular path planning methods in SAPP. However, they suffer sensitivity to the initial solution, which requires a lot of memory and time to converge to the optimal solution. In order to solve this problem, Wang et al. [22] propose the NRRT* to achieve nonuniform sampling in the path planning process by learning quantities of successful planning cases from the A* algorithm. Thus, the sampling process is guided, and the efficiency of the algorithm is improved. The sampling mechanism is also used in reinforcement learning which has been prevalent in recent years. In [23], Tom Schaul et al. proposed the prioritized experience replay method to speed up the convergence of the training process, and the effectiveness was verified on the Gym platform.
For the multi-objective optimization problem in SAPP, Hongqiang Sang et al. [24] proposed a novel deterministic algorithm named the multiple sub-target artificial potential field (MTAPF) based on the heuristic A* algorithm. The optimal path is divided by this algorithm into multiple sub-target points to form a sub-target point sequence. Ning Wang et al. [25] proposed a multilayer path planner (MPP) with global path-planning (GPP), collision avoidance (CA), and routine correction (RC) for an unmanned surface vehicle (USV). In addition, some methods inspired by these problems or traditional search strategies are suggested, such as the bridge access path-planning method [26,27].
However, it is difficult to complete tasks only by relying on one single agent in certain scenarios, such as maritime joint search and rescue, maritime multi-target search, the construction of the marine information network, etc. Thus, the MAPP approach becomes another effective method to deal with complex multi-objective problems. Yu Wu et al. [28] proposed a new cooperative path planning algorithm based on an improved particle swarm optimization (IPSO) algorithm aimed at maximizing the search space, minimizing the terminal error, and generating paths in a centralized or distributed mode. In the same way, Pradhan B. et al. [29] realized multi-robot navigation tasks by combining particle swarm optimization (PSO) with the feed forward neural network (FFNN). To optimize search capability, Yu Wu et al. [30] proposed a clustering improved ant colony optimization (CIACO) algorithm, which strengthens the global and local search ability in the early and later phases of iterations. Xinghai Guo et al. [31] proposed a chaotic and sharing-learning particle swarm optimization (CSPSO) algorithm. The path planning problem is divided into two stages: global path planning and path control, to solve the extended TSP, and the nonlinear multi-objective model. For the multi-objective joint optimization problem, Milad Nazarahari et al. [32] proposed an enhanced genetic algorithm (EGA) to improve the initial paths in continuous space and jointly optimize the path length, smoothness, and safety of the agent.
However, the alternating algorithm may not converge as the number of optimized variables of USV is increased [33]. In addition, the optimized results can only be used for the current environment, while, when the environment changes, the proposed optimization algorithms can become invalid.
Deep reinforcement learning (DRL), as a branch of artificial intelligence (AI), provides an alternative solution for such complex optimization problems, such as resource allocation for V2V communications [34], the stochastic shortest path (SSP) problem [35], and mode selection and resource allocation for the fog radio access networks [36]. The DRL method can deal with a large state space and time-varying environments [37].
However, in the multi-agent reinforcement learning scenario, each agent is unstable and the environment is changing. In this paper, we propose to use the multi-agent deep deterministic policy gradient (MADDPG) algorithm to solve the area assignment and dynamic trajectory design problem in a multi-USV cooperation task. The algorithm utilized centralized training within the decentralized execution framework [38]. Each USV can be regarded as an agent. The contributions of this paper can be summarized as follows:

•
We propose a multi-agent DRL method for the joint optimization problem in the multi-USV cooperation scenario, where they share the common reward function to achieve the maximum success rate of the system. The number of UAVs can be arbitrary in the proposed MADRL algorithm, while the conventional methods can only deal with the simple case, i.e., no more than two USVs; • We consider a scenario in which obstacles change position at fixed intervals and the USV needs to adjust its actions in real-time to avoid collisions with obstacles and other USVs, ultimately achieving the task of dynamic trajectory design and area assignment. Our algorithm is designed for 2D space, in which the trajectory of the UAV can be shown in 2D; • We develop an efficient task management framework, which adopts centralized training and the decentralized execution method. In order to improve the learning ability of the agent and better adapt to the environment, we add the "soft update" mechanism in MADDPG, so that the target network can better track the learning policy. All USVs work in a cooperative way to achieve the reasonable allocation of target areas. In order to maximize cooperative rewards, all experiences are trained together, while all behaviors are at the disposal of the USV itself. The output of the actor network is the action of the USV, which is based on the USV's own observation.
The rest of the paper is organized as follows: In Section 2, we present the system model and problem formulation. Section 3 demonstrates the MADDPG method for the cooperative multi-USV network. Section 4 shows the simulation results and discussions. Finally, Section 5 elaborates on the conclusion of this paper.

System Model
This paper considered a multi-USV cooperative dynamic navigation and area allocation scenario with random obstacles on the sea surface, K USVs, N obstacles, and K target areas, as shown in Figure 1. Each USV departs from the same point and adjusts the target area to be reached according to the safety of the surrounding environment and the distance required to reach the target area. Specifically, each USV should keep a certain distance from other USVs during navigation to ensure communication as well as safety. At the same time, when encountering obstacles, it can smoothly avoid obstacles under the premise of keeping the maximum communication distance with other USVs. In order to simulate the real dynamic port situation, obstacles involved dynamic obstacles and static obstacles, among which dynamic obstacles would change their positions randomly after each training episode.
Next, we will describe the multi-USV system and the problem formulation for the proposed multi-USV cooperative dynamic navigation and area allocation scenario.

Multi-USV System
We considered a two-dimensional environment in which K USVs were represented by U , each USV's position at time t is denoted by ( ) and the path length by Thus, the discrete dynamics model of each USV can be expressed as:

Problem Formulation
In the given system model, our goal is to enable all USVs to autonomously navigate to their respective target areas by optimizing all trajectories, and each target is only assigned to one USV. Then, we have where T represent all assignable regions. All USVs can navigate by adjusting their positions dynamically to avoid static and dynamic obstacles, which can be expressed as

Multi-USV System
We considered a two-dimensional environment in which K USVs were represented by U, each USV's position at time t is denoted by x U i (t), y U i (t) , and the matching target is represented by T i , among which the obstacle is denoted by O i , and the positions of targets and obstacles are represented by For an arbitrary USV, the path taken can be denoted by where v t i and p t i , respectively, represent the speed and position of the i th USV at time t, and ∆t represents the sampling period where F is a force vector. The distance between the USV and the obstacle is represented by d O i , and the distance between USVs is represented by d U i , while the distance from the USV to the matching target is represented by d T i .

Problem Formulation
In the given system model, our goal is to enable all USVs to autonomously navigate to their respective target areas by optimizing all trajectories, and each target is only assigned to one USV. Then, we have where T represent all assignable regions. All USVs can navigate by adjusting their positions dynamically to avoid static and dynamic obstacles, which can be expressed as During the mission, each USV needs to maintain a certain distance to stay within communication distance. We utilized Z to describe the communication graph of K USVs, where E is the edge set, defined as where ρ c is the maximum communication distance. In other words, when the distance between two USVs is less than the maximum communication distance, there will be an undirected edge between them, and they can communicate with each other. Accordingly, each USV collaboratively adjusts its position and selects the target area based on the hazards of its environment. The collision-free path is then formulated as: Thus, in this article, a multi-agent deep reinforcement learning-based trajectory design and area assignment algorithm are proposed.

MADDPG Approach for Cooperative Multi-USV Network
The multi-USV collaboration optimization issue is challenging as it requires a joint kinetic USV trajectory and area assignment. In fact, this is NP-hard, and the approach of exhaustion is usually invalid for such a multiple USVs scenario. As the quantity of USVs increases, the calculation intricacy will elevate remarkably. As far as we know, there is research utilizing the traditional optimization algorithm to tackle this problem in these intricate settings. The MARL can solve this dynamic multi-objective optimization problem effectively and accurately. Therefore, in this section, we will delineate how to utilize the MARL approach to tackle the multi-USV collaboration issue. In Section 3.1, we formulate the proposed problem as a Markov process and define the agent, state, action, and reward separately. Section 3.2 presents the MADDPG algorithm for the multi-USV joint optimization scenario.

Markov Game for Multi-USV Cooperation
To study the best behavior mode in given states, we employ a Markov decision process (MDP) [39] to formulate the AUV motion planning, as the Markov decision process (MDP) offers a mathematical framework for simulating the randomized strategies that can be implemented with Markov states in a given scene. The method is modeled as tuples (S, A, R, P, γ), and in this case, the tuples are action set A, state set S, state transition probability P, discount factor γ, and reward function R of the system. The action selection of the agent, a functional mapping from each state s t i ∈ S to action a t i ∈ A, is modeled as policy π. The value function v π (s) of the process is defined as the expected sum of discount rewards that act continuously from and along with state s. In case a policy π can attain the best from any initial state s t i ∈ S, we detect it as an optimal policy π * . This is the target of this training of the deep neural network with a deep reinforcement learning module.
Agent: Each USV can be considered an agent. USVs maintain a certain distance from each other during navigation to facilitate communication, and each USV obtains a state containing its own information and the information of the surrounding environment in the process of interaction with the environment. The environment in this problem is completely observed; therefore, the observations are equivalent to the state. Each actor has its own actor and critic network which act as the executor and evaluator of the policy, respectively. Each agent observes its own state, takes actions according to its own strategy, and then obtains rewards from the environment to reach the next state, as shown in Figure 2. Action: The action for each agent is the output of its actor network, which is expressed by This is a Gaussian distribution where i v is composited by the velocities of the x and y axes. An action is executed by the formula a is a displacement vector and t is a random noise. The action is determined by the position of the target area and the risk factors of the surrounding environment.
Reward: Each USV obtains its own reward, which depends on the current state, current actions, and the next state ( ) ,, s a s  at each time slot t. In our proposed multi-USV scenario, the reward consists of four parts: communication distance max l limitation penalty, distance penalty, threat area penalty, and collision penalty. We utilized l R to represent the communication limitation reward, which depended on the next state of the agent. There will be a large minus penalty l R − for the agent if the agent is out of the communication space in the next state. The space limitation is set as follows  State: The state of any agent is a one-dimensional vector with five components, The first two elements represent the current 2D position of the agent, the third entry d U i represents the distance between agents, and the fourth entry d O i represents the distance between agent i and the surrounding obstacles. The fifth element d T i represents the distance from agent i to the current matching target which can be changed with the changes in the USV's selection.
Action: The action for each agent is the output of its actor network, which is expressed by ν i = u i (o i ). This is a Gaussian distribution where v i is composited by the velocities of the x and y axes. An action is executed by the formula where → a is a displacement vector and N t is a random noise. The action is determined by the position of the target area and the risk factors of the surrounding environment.
Reward: Each USV obtains its own reward, which depends on the current state, current actions, and the next state (s, a, s ) at each time slot t. In our proposed multi-USV scenario, the reward consists of four parts: communication distance l max limitation penalty, distance penalty, threat area penalty, and collision penalty. We utilized R l to represent the communication limitation reward, which depended on the next state of the agent. There will be a large minus penalty −R l for the agent if the agent is out of the communication space in the next state. The space limitation is set as follows 0 ≤ X k,t ≤ l max , ∀k ∈ K, t ∈ T and 0 ≤ Y k,t ≤ l max , ∀k ∈ K, t ∈ T where X k,t and Y k,t represent the horizontal and vertical coordinates of USV, respectively. In the same way, we denote R k,o as the collision penalty, which can be described as where z k,o,t = 1 means there has been a collision, and R k,o (t) is the collision penalty of USV. Additionally, we set up a threat area penalty R k,r to maintain a certain communication distance between the USVs and prevent collisions. Then, we have where d min(u i ,u j ) = agent.size + agent.size, σ represents the width of critical area of agents, and d u t i ,u t j represents the distance between agents. In order to guide the USV closer to the target, we give the USVs a reward dependent on the distance between the USV and the target as where R k,t is the distance reward for USV, and (x i − x t ), (y i − y t ) are the coordinates of the USV. Moreover, k dis is the hyperparameters set manually. Thus, the whole reward R w,t is

Multi-Agent DDPG Approach
In the collaborative MARL scenario, each agent engages in interactions with the environment and acquires team rewards to facilitate collaboration. Single-agent reinforcement learning algorithms, such as proximal policy optimization (PPO) and the deep Q network (DQN), can be straightly utilized to deal with collaborative scenarios by letting each agent learn its optimum Q function in an independent manner. Nevertheless, the environment can be unstable from the perspective of any agent. To tackle the unstable issue in MARL, we utilized the MADDPG algorithm [38] learning a centralized Q-function for each agent as per the global information. When all the actions of the agents are known, the environment is stable. This is due to P s | s, a 1 , . . . , a L , π 1 , . . . , π L = P s | s, a 1 , . . . , a L = P s | s, a 1 , . . . , a L , π 1 , . . . , π L (12) for any π i = π i , in which π i is the policy of agent i.
The aim of each agent is to select the beneficial policy that maximizes the accumulated reward J(θ) = E s∼p π ,a∼π θ [R] prior to the description of the MADDPG arithmetic. Let us start with the policy gradient (PG) algorithm utilized for the continuous control issue. The crucial goal of the PG is to straightly adjust the parameter θ of policy π at the orientation of ∇ θ J(θ), which is expressed as where the policy π θ is stochastic. At each step, the action is sampled as per the conditional probability density π θ (a|s). When the action space dimension is very large, the PG algorithm may need more samples, which will induce a remarkable computational challenge. Unlike the random policy requiring the exploration of the full state and action space in (12), the deterministic policy gradient (DPG) takes a deterministic policy into consideration µ θ : S → A , which merely needs to integrate over the state space, as It is obvious from (14) that the DPG arithmetic prevents the integral over the entire action space, which can decrease the computational intricacy and ameliorate the training efficiency. In contrast to PG, DPG can tackle the difficult enhancement issue with highdimensional actions.
As the extension of DPG, DDPG adopts the deep neural network (DNN) to approximate the policy µ and critic Q µ (s, a). Nevertheless, the updated network Q µ s, a θ Q cannot be straightly utilized for estimating the target value, which will induce the update Q divergent. Hence, the DDPG algorithm utilizes the soft target updates rather than straightly copying the weights of the update Q.
Specifically, DDPG utilized a copy of the actor and critic networks to calculate the target value, which is denoted as Q (s, a) and µ θ (s), respectively. For the sake of suiting the environment and ameliorating the exploration efficiency, another trick of DDPG is to add stochastic noise into the actor policy, as presented in Algorithm 1. Initialize Initialize: Actor network µ, critic network Q with weights θ µ , θ Q . Initialize: Target actor, critic network: µ , Q , with weights θ µ ← θ µ , θ Q ← θ Q . Initialize: Replay buffer D with capacity C, exploration counter. Counter = 0 3: for step t = 1 to T do 4: if Counter < C then 5: each USV i randomly chooses a i ; 6: 7:  a 1 , a 2 , . . . , a N )

16:
Updating actor networks Updating critic networks Update target networks with updating rate τ:

17: end for 18: end for
Now we resort to the MADDPG algorithm that extends DDPG to the multi-agent scenario. Specifically, a multi-agent game with L agents, µ θ = µ θ 1 , . . . , µ θ L , which is  a 1 , a 2 , . . . , a L ) a 1 , a 2 , . . . , a L ) is a centralized function with the inputs of all the agents' action a = {a 1 , . . . , a L }, as well as the observation information o. G = {o 1 , o 2 , . . . , o L } represents the information that the environment feeds back to the agent, including the distance between agents, the distance between agent i with surrounding obstacles, and the distance from agent i to each targets, all of which constitute the observation of all agents that are denoted by x, y, In our networks, the inputs of the critic network are the sates of all the USVs, the distance between USVs, as well as distance from the USV to the target area. The output of the critic network, i.e., Q µ i (o, a) is returned to the actor network for evaluating the action, which can be seen in Figure 2. The actor i executes the action based on its own policy µ θ i , and the critic uses the Q µ i function to evaluate the action of the actor. In order to remove the correlation of the samples generating from the environment, the framework adopts the experience replay mechanism to store the experience of each agent, which is a finite buffer with capacity M expressed as the tuple (o, o , a 1 , . . . , a L , r 1 , . . . , r L ). When the replay buffer is full, the oldest samples will be discarded. In order to better adapt to the environment and improve the stability of exploration, we updated the target network in a "soft update" way, so that it can slowly track the learning strategy, which can be expressed as θ ← τθ + (1 − τ)θ . In each time step, a batch size of tuples was selected to train the network, and the critic-network was updated by minimizing the loss function as where The update policy of the actor-network is expressed as where m and k represent the minibatch size and the experience index, respectively.

Simulation Results
In this section, we present the simulation results as well as the performance analysis of the proposed algorithm. In Section 4.1, the settings for the simulation are developed. Section 4.2 presents the algorithm training configurations. In Section 4.3, we present the detailed figures for the reward, collision rate, and trajectories.

Environment Settings
We consider a 400 * 400 square training environment based on the open AI platform, which consists of USVs, targets, and threat areas. The geometric coordinate system is established with the environmental center as the origin of the coordinates. As the target area and static obstacle are regarded as parts of the environment, their settings would not be changed during the training process. In contrast, the dynamic obstacle changes position randomly after each training episode. In our settings, the maximum communication distance ρ c is set to 0.9 and the width of critical area σ is set to 0.1. In this coordinate system, the size of the threat area is set to 0.25, while the size of the agent and target area are set to 0.04 and 0.20, respectively. We present two indicators to measure the effectiveness of training: the collision rate of USVs' collisions during the current episode and the collision rate between the agent and the critical area during the total episode.

Training Configuration
In the given scenario, we set 60 episodes with 1000 steps each during the training process. As the mission is continuous, we manually aborted the training process in each training, and the environment was reset automatically after each episode. Since the capacity of the memory buffer we set was 3000, the training process did not start until the buffer was full, thus each USV adopted a random policy in the first 3000 steps, and then the whole network started to be trained in each step afterward.
The whole training networks of the mission planning for the AUV are constructed based on the PyTorch framework, where the training process is implemented on the NVIDIA GeForce RTX 3060 GPU. The critic network consists of five fully connected layers, including three hidden layers, of which the hidden units are set as 64, 52, and 30, respectively. The ReLU function is adopted as the non-linearity function in the networks. Specifically, the first hidden layer with 64 neurons obtains the concatenated vector of observations o and actions a, and the data then proceed through the following second and third hidden layers, with 52 and 30 neurons, respectively. The whole process is offline and can be divided into two parts: off-line centralized training and on-line decentralized execution. The advantage of using this method is that the system will have competitive execution speeds compared to the traditional methods during the testing stage, and the calculation could be gathered in the training stage. Figure 3 shows the episode rewards obtained by USVs under different algorithms. The RL algorithm based on the policy gradient is used as the benchmark comparison algorithm. The experimental setting is for two USVs to perform navigation and assignment tasks with two target areas under the condition of no obstacles. It can be seen that the MADDPG has a significant advantage over the DDPG and REINFORCE algorithms, as it can get a higher reward, which proves the effectiveness of the proposed algorithm in solving multiple USVs' navigation and area allocation tasks.   Figure 4 shows the collision rate of USVs in different task environments. We observe the experimental effects when the number of USVs is 4, 3, and 2, and the number of target areas is 1, 2, 3, and 4, respectively. It can be seen from the figure that with the increase in training times, the collision rate of USVs in each turn will gradually decrease and tend to converge. At the same time, the total number of collisions will gradually decrease and stabilize, as can be seen in Figure 5.  Figure 4 shows the collision rate of USVs in different task environments. We observe the experimental effects when the number of USVs is 4, 3, and 2, and the number of target areas is 1, 2, 3, and 4, respectively. It can be seen from the figure that with the increase in training times, the collision rate of USVs in each turn will gradually decrease and tend to converge. At the same time, the total number of collisions will gradually decrease and stabilize, as can be seen in Figure 5. Figure 4 shows the collision rate of USVs in different task environments. We observe the experimental effects when the number of USVs is 4, 3, and 2, and the number of target areas is 1, 2, 3, and 4, respectively. It can be seen from the figure that with the increase in training times, the collision rate of USVs in each turn will gradually decrease and tend to converge. At the same time, the total number of collisions will gradually decrease and stabilize, as can be seen in Figure 5.     Figure 4 shows the collision rate of USVs in different task environments. We observe the experimental effects when the number of USVs is 4, 3, and 2, and the number of target areas is 1, 2, 3, and 4, respectively. It can be seen from the figure that with the increase in training times, the collision rate of USVs in each turn will gradually decrease and tend to converge. At the same time, the total number of collisions will gradually decrease and stabilize, as can be seen in Figure 5.   Figure 6 shows the experimental conditions of the USV reward. It can be seen that in the early stage of the training, because of the blindness of the exploration stage, the reward  Figure 6 shows the experimental conditions of the USV reward. It can be seen that in the early stage of the training, because of the blindness of the exploration stage, the reward for each USV will undergo a dramatic decline in the lowest (about 500 rounds), but with ongoing exploration, the neural network with gradually learn the best strategy. for each USV will undergo a dramatic decline in the lowest (about 500 rounds), but with ongoing exploration, the neural network with gradually learn the best strategy.  Figure 7 shows the dynamic trajectory of USVs in different test environments. Obstacles will move randomly after each training round, while the position of the target area remains unchanged. The number of USVs and target areas can be set manually. When encountering obstacles, it will adjust the direction according to the security of the surrounding environment and the position of the target area. When approaching the target area, it will automatically coordinate the area to be reached according to the position of each target area to complete the task assigned by the area.  Figure 7 shows the dynamic trajectory of USVs in different test environments. Obstacles will move randomly after each training round, while the position of the target area remains unchanged. The number of USVs and target areas can be set manually. When encountering obstacles, it will adjust the direction according to the security of the surrounding environment and the position of the target area. When approaching the target area, it will automatically coordinate the area to be reached according to the position of each target area to complete the task assigned by the area. encountering obstacles, it will adjust the direction according to the security of the surrounding environment and the position of the target area. When approaching the target area, it will automatically coordinate the area to be reached according to the position of each target area to complete the task assigned by the area.

Conclusions
In this paper, a multi-agent reinforcement learning framework is proposed to solve the trajectory design and area assignment for the multi-USV cooperation scenario. In this framework, each USV needs to select the closest target area so as to obtain the shortest route, during which each USV should keep a certain distance from others to keep within the communication distance and avoid collisions. All UAVs use the MADDPG to try to find the optimal policy to obtain a better reward, while traditional optimization methods can hardly handle the scenario with a large number of UAVs. Thus, the proposed method can be applied to the trajectory design and area assignment for multi-UAV scenarios. The simulation results show the effectiveness of the proposed framework.

Conclusions
In this paper, a multi-agent reinforcement learning framework is proposed to solve the trajectory design and area assignment for the multi-USV cooperation scenario. In this framework, each USV needs to select the closest target area so as to obtain the shortest route, during which each USV should keep a certain distance from others to keep within the communication distance and avoid collisions. All UAVs use the MADDPG to try to find the optimal policy to obtain a better reward, while traditional optimization methods can hardly handle the scenario with a large number of UAVs. Thus, the proposed method can be applied to the trajectory design and area assignment for multi-UAV scenarios. The simulation results show the effectiveness of the proposed framework.