Reinforcement-Learning-Based Asynchronous Formation Control Scheme for Multiple Unmanned Surface Vehicles

: The high performance and efﬁciency of multiple unmanned surface vehicles (multi-USV) promote the further civilian and military applications of coordinated USV. As the basis of multiple USVs’ cooperative work, considerable attention has been spent on developing the decentralized formation control of the USV swarm. Formation control of multiple USV belongs to the geometric problems of a multi-robot system. The main challenge is the way to generate and maintain the formation of a multi-robot system. The rapid development of reinforcement learning provides us with a new solution to deal with these problems. In this paper, we introduce a decentralized structure of the multi-USV system and employ reinforcement learning to deal with the formation control of a multi-USV system in a leader–follower topology. Therefore, we propose an asynchronous decentralized formation control scheme based on reinforcement learning for multiple USVs. First, a simpliﬁed USV model is established. Simultaneously, the formation shape model is built to provide formation parameters and to describe the physical relationship between USVs. Second, the advantage deep deterministic policy gradient algorithm (ADDPG) is proposed. Third, formation generation policies and formation maintenance policies based on the ADDPG are proposed to form and maintain the given geometry structure of the team of USVs during movement. Moreover, three new reward functions are designed and utilized to promote policy learning. Finally, various experiments are conducted to validate the performance of the proposed formation control scheme. Simulation results and contrast experiments demonstrate the efﬁciency and stability of the formation control scheme.


Introduction
Due to the rapid development of communication, navigation, and computer technology related to ship motion control, cooperative ship control has an extensive range of application prospects in military and production fields, including fleet cooperative combat, ocean-going replenishment, environmental monitoring, oil and gas detection, etc. Because of higher operational security, lower cost, and greater autonomy and flexibility, unmanned surface vehicles (USVs) are applied to perform extensive missions in hazardous maritime environments instead of manned vehicles [1]. Compared with a single USV, multiple USVs' cooperation has the advantages of strong adaptability and fault tolerance. The fleet can form a dynamic network during the navigation. Through division and cooperation, each USV can perceive the environmental information about the area quickly and accurately to accelerate the completion of missions and improve the efficiency of the system. Formation control is the most fundamental problem of multiple USV cooperative control. Therefore, formation control of USVs has become one of the hot issues in the research of USV motion control. A collective scheme is necessary to ensure that the USVs work together Appl. Sci. 2021, 11, 546 2 of 18 to complete a common task and coordinate in time and space. As of today, many scholars have studied the formation control problem of a multi-USV system. The formation control problems are summarized into two fundamental problems: (1) formation generation, which refers to how to form a designated formation [2,3], and (2) formation maintenance, which refers to how to keep formation unchanged in the process of movement [4].
USVs' formation originated from the study of biological cluster dynamics, which can be traced back to the Boid model proposed by Reynolds [5]. Based on this model, Olfati-Saber [6] extended the multi-agent consistency work to the usual swarm formation control field, introduced obstacle avoidance and tracking agents, and designed a distributed control framework including gradient-based term, velocity consensus term, and navigational feedback. Su et al. [7] further developed the collective formation control strategy based on the work of Olfati-Saber, using a virtual leader to replace the actual leader. Ponomarev et al. [8] proposed a consistency control method based on a predictive mechanism to accelerate the convergence speed of multi-agent consistency. Chen et al. [9] proposed the collective circular motion behavior control of heterogeneous multi-agents under arbitrarily closed curves. In the research of multi-agent formation control, most researchers treat the model as a linear system of first-order or multi-orders [6][7][8]. Taking into consideration the dynamic characteristics (nonlinearity, coupling, underactuation, etc.) of the robots in a multi-robot formation, it is often difficult to directly consider it as a system for analysis. To better achieve stability and efficiency, and at the same time be useful to the theoretical analysis of the multi-robot formation, researchers have proposed the following formation methods: the virtual structure method, the behavior-based method, and the leader-follower method [10]. The virtual structure method [11] is not flexible, and it is difficult to achieve obstacle avoidance with that method. It is hard to express the entire system in mathematical form and difficult to prove and guarantee the stability of the system using the behavior-based method [12]. In contrast, the leader-follower method [13] is easy to design and implement, and easy to ensure stability. The advantages of the leader-follower method are that the moving goal is only assigned to the leader to navigate the movement of the agent swarm, and each member only needs to collect information about its immediate leader instead of the whole swarm. For example, when performing seafloor terrain scanning, hydrological sampling, target search, and resource detection, a team of robots performs tasks in formation, and the path trajectory is assigned to the leader of the formation. Other robots keep a certain distance (for example, sonar detection radius) from their associates. The swarm can perform tasks with a fixed geometric structure, which improves work efficiency and safety.
The current full actuated leaderless formation control algorithms and leader-follower formation control algorithms are all based on back-stepping [14][15][16]. The repeated derivation of the virtual control law by a back-stepping method in practical design will bring about a sharp increase in partial derivative calculations as the order of the system increases, which obviously increases the complexity of nonlinear system design. Through interacting with the environment through a trial-and-error mechanism, reinforcement learning optimizes policy by maximizing cumulative rewards and finally achieves the optimal policy. Other existing works only use the current optimal sample update, while reinforcement learning makes full use of historical samples to get the gradient descent direction based on the cumulative discounted reward. Since the cumulative discounted reward is based on all the existing samples, the sample information is more fully utilized, and the efficiency of policy learning is significantly improved. The combination of reinforcement learning and deep learning can provide optimal decision-making strategies for complex high-dimensional multi-agent systems and can lead to efficient performance of tasks in challenging environments. The policy gradient adopted and improved in this paper is a direct parametric policy, optimizing the trajectory from the initial state to obtain an optimal policy, which is a continuous function of the state-action value and more suitable for dealing with continuous problems, such as formation control. The advantage of deterministic policy gradient is that less data need to be sampled and the algorithm efficiency is high. One way of optimizing policies is to adjust the gradient toward the direction of "good" actions. Advantage function is usually used to measure the quality of each action in each state. Therefore, we propose an improved DDPG based on advantage function to train policy for formation control of a multi-USV system.
In this paper, to solve the above problems, an asynchronous formation control scheme based on reinforcement learning and leader-follower structure is proposed for multiple USVs. First, a USV model and a novel formation shape model are established. Second, the advantage deep deterministic policy gradient algorithm (ADDPG) is proposed and used to learn a formation generation policy, which is used to generate the formation according to the control requirements. Finally, a formation maintenance policy based on the ADDPG and the designed reward function is utilized to maintain the given geometry structure of the team of USVs during movement.
To summarize, the main contributions of this paper are threefold.
• Modeling for maritime formation control: We introduce a USV model for underactuated USV, only considering its kinematics model. Moreover, we propose a formation shape model to describe the physical relationship between USV members, including relationship in formation, relative distance offset, and scaling coefficient. • Formation control scheme: We propose an asynchronous decentralized formation control scheme for multiple cooperative USVs, in which we propose the ADDPG algorithm, and design the reward functions for the formation control problem. Then, based on the required specific geometry shape of the USV team, the decentralized formation generation policies and decentralized formation maintenance policies are trained based on the ADDPG to generate the formation and keep the geometric shape, respectively. • Performance validation: Evaluation criteria are designed to evaluate the performance of the proposed scheme. Extensive simulations are conducted to verify the effectiveness of the proposed formation control scheme. The simulation results show that the proposed scheme can realize the effectiveness of formation generation and the stability of formation maintenance.
The remainder of this paper is arranged as follows. In Section 2, we review the relevant research studies. In Section 3, we describe the system model. In Section 4, we present the formation control scheme. In Section 5, we verify the performance of the proposed formation control scheme by simulation. The paper is concluded in Section 6.

Related Works
In this section, we review the related works, including the formation control algorithms of the multi-robot system.
Fahimi [17] studied the nonlinear model predictive about the control formation problem of USVs in the environment with obstacles. Based on the decentralized geometric control strategy in the leader-follower structure, the underactuation of USVs, and environmental obstacles, a formation controller for real-time optimization nonlinear predictive control method was designed to realize formation and obstacle avoidance. Do [18] presented a design of cooperative controllers for several agents based on the constraint of sensing ranges and collision avoidance. Then, Do [19] discussed the formation control of underactuated ships limited to collision avoidance and communication. An elliptic collision avoidance method was proposed, and the nonlinear coordinate transformation and additional control were introduced to control the underactuated ship. Simultaneously, the potential energy function was used in the controller design with collision avoidance between the ships.
Peng et al. [20] proposed a neural-network-based leader-follower underdrive UAV formation controller based on the uncertainty of leader dynamics and environmental interference. The uncertainty dynamics of the leader were approximated only by the sight distance and angle measured by the local sensor, and a control law that does not rely on an accurate model was designed. After that, an observer-based distributed formation controller was proposed. The formation controller based on neighbor information was designed using a neural network, back-stepping, and graph theory, and used to estimate the speed information [14]. Since then, to overcome problems such as model uncertainty, ocean noise, and unpredictable speed of leaders and followers, adaptive control, neural networks, high-gain observers, and minimum learning parameter algorithms have been combined into the backstepping design, a new adaptive output feedback control scheme has been proposed, which realizes the leader-follower formation based only on position and heading angle, and only two parameters need to be learned online [21,22]. Ding et al. [23] proposed a distributed adaptive cooperative formation control strategy based on a virtual leader, designed a navigation system and an adaptive neural network synchronization controller that can calculate a specified trajectory, and solved the problem of model uncertainty to achieve stable formation. Sun et al. [24] studied the formation control of USVs in a leader-follower structure, considering model uncertainty and dynamic disturbance of the environment.
Shojaei [25] proposed a leader-follower formation tracking controller for USVs affected by torque limitation and environmental noise. The saturation function was used to reduce the risk of driver saturation. The radial basis function and adaptive robust control technology were used to improve the robustness of the controller in a disturbance environment. On this basis, the formation method was developed into the three-dimensional formation control of underactuated underwater vehicles based on the neural network, and the nonlinear saturation observer was introduced to estimate the speed of the follower [26]. Sun et al. [27] considered the autonomous navigation of the leader-follower USV formation in a complex environment, and the predictive control based on the limited control set realized the USV team to reach the destination in a certain formation with internal collision avoidance under the condition of no prior knowledge of the environment and predefined trajectory.
In other respects, Breivik et al. [28] studied the leader-follower formation control problem of fully actuated ships and proposed a navigation formation control method, which uses control, navigation, and synchronization algorithms to ensure that each individual can converge and stay in the assigned formation position to achieve formation. Cui et al. [29] proposed a control method based on an approximation method to address the unknown uncertainty in the leader-follower formation control model of multi-AUV. Fan et al. [30] proposed a formation control strategy based on two-layer predictive control. One layer guarantees the leader-follower cooperative formation between USVs, and the other layer realizes the USVs' tracking of the optimal command. Park [31] aimed at the asymmetry of the quality and attenuation matrix of the underwater vehicles and the uncertainty of the hydrodynamic attenuation term, introduced additional control input to solve the underactuated control, and realized the leader-follower control when only using position information. Liu et al. [32] designed two different algorithms for formation forming and path planning based on the heading navigation fast marching algorithm, which solved the heading constraint problem of the unmanned boat and realized that the USVs can follow the planned trajectory and formation through any initial state. Sui et al. [33] presented a novel formation control with collision avoidance policy using imitation learning and reinforcement learning, but only one leader and one follower are considered.
The failure of the leader may affect the robustness of the whole swarm, so selecting the best leader from the swarm is an important issue for the study in leader-follower formation control. Hou et al. [34] proposed a leader-follower formation with multiple changeable leaders and proposed a switching distributed saturated control law, which enables the formation to work even if a leader fails. Based on the status of the multi-robot system evaluated by a fuzzy inference system, Li et al. [35] proposed an affection-based dynamic leader selection model to switch leaders autonomously. To solve the failure of the current leader, Li et al. [36] proposed a neuroendocrine system to switch and evaluate a leader autonomously. Considering the time-varying and fully-decentralized structure in the leader-follower multi-agent system, Franchi et al. [4] proposed an online leader selection strategy to periodically select the best leader for the team during the movement.
Xue et al. [37] introduced a supermodular optimization approach to fixed-size set and minimum-size set of leaders to select the optimum leader for minimizing convergence error in leader-follower formation control.
Formation control of multi-robot is to control multiple robots mainly based on preset inter-robot parameters, which determines the distance and orientation displacements among these robots [38][39][40][41][42][43] during task execution. Many formation strategies are proposed to obtain and keep the stability of formation under different formation shapes. Aranda et al. [44] achieved the local and global stability of formation, while Oh et al. [45] obtained the local stability of formation, and Lin et al. [46] ensured global stability of formation. Lin et al. [47] presented a formation method based on complex Laplacian to achieve global stability by the inter-agent relative displacement.

System Models
In this section, we analyze the system model of maritime cooperative formation for a team of USVs, as shown in Figure 1, including the USV model, formation shape control, and control objective. enables the formation to work even if a leader fails. Based on the status of the multi-robot system evaluated by a fuzzy inference system, Li et al. [35] proposed an affection-based dynamic leader selection model to switch leaders autonomously. To solve the failure of the current leader, Li et al. [36] proposed a neuroendocrine system to switch and evaluate a leader autonomously. Considering the time-varying and fully-decentralized structure in the leader-follower multi-agent system, Franchi et al. [4] proposed an online leader selection strategy to periodically select the best leader for the team during the movement. Xue et al. [37] introduced a supermodular optimization approach to fixed-size set and minimum-size set of leaders to select the optimum leader for minimizing convergence error in leader-follower formation control.
Formation control of multi-robot is to control multiple robots mainly based on preset inter-robot parameters, which determines the distance and orientation displacements among these robots [38][39][40][41][42][43] during task execution. Many formation strategies are proposed to obtain and keep the stability of formation under different formation shapes. Aranda et al. [44] achieved the local and global stability of formation, while Oh et al. [45] obtained the local stability of formation, and Lin et al. [46] ensured global stability of formation. Lin et al. [47] presented a formation method based on complex Laplacian to achieve global stability by the inter-agent relative displacement.

System Models
In this section, we analyze the system model of maritime cooperative formation for a team of USVs, as shown in Figure 1, including the USV model, formation shape control, and control objective.

USV Model
We consider a group of N USVs for formation control in the leader-follower structure, described as with geometric shapes, such as V-shape, as shown in Figure 1. Because this paper focuses on how to design and train formation generation policies and formation maintenance policies for the USV team, we simplify the USV into a particle, mainly considering the kinematics model dinate frame in the two-dimensional maritime plane. At each time step, the action taken

USV Model
We consider a group of N USVs for formation control in the leader-follower structure, described as U = {u 1 , u 2 , . . . u N } with geometric shapes, such as V-shape, as shown in Figure 1. Because this paper focuses on how to design and train formation generation policies and formation maintenance policies for the USV team, we simplify the USV into a particle, mainly considering the kinematics model . p u i = v u i of the particle, and temporarily ignoring the impact of the kinetics model of the particle. Each USV u i has coordinates p u i in the O − XY coordinate frame in the two-dimensional maritime plane. At each time step, the action taken by USV u i is the change of velocity, a u i = (∆v x u i , ∆v y u i ). In a leader-follower structure, the followers should follow the leader in a geometrically balanced manner and keep a specific direction and distance from the leader. Thus, we establish a new relative coordinate frame O − X Y that the origin is the original origin, and the direction of the Y -axis is the heading angle of the leader. Each USV u i has new coordinates p u i = (x u i , y u i ) in the relative coordinate frame, as calculated by the coordinate transformation matrix in Equation (1), where θ is the leader's heading angle. The formation control of a multi-robot system transforms the formation control problem into the problem of followers tracking the leader's position and Appl. Sci. 2021, 11, 546 6 of 18 direction. The classical leader-follower mode is that all the followers track a single leader with different distance offsets individually, while a chain leader-follower formation is used in this paper. The chain leader-follower is inspired by three flocking rules of Reynolds [5] that agents in a group should stay close to their neighbor agents, avoid collisions with their neighbor agents, and match speed with its neighbor agents instead of the leader of the group. In short, each agent aligns with its neighbors. In a cooperative USV team, it is more conducive to improve the team efficiency to use communication to share information than to collect information of members in the team with sensors. USVs are equipped with industrial control computers, GPS, and other sensors, so the followers obtain the coordinates of their leaders through wireless communication. For instance, the USV u 2 tracks the USV u 1 while the USV u 4 tracks the USV u 2 according to the required chain formation. USV u 2 is regarded as the immediate leader of its follower USV u 4 . In addition to the leader USV, is the predefined offset vector, i.e., the relative positional relationship for USV u i concerning its immediate leader in the O − X Y coordinate frame.
Each USV takes action following its formation generation policy µ θ i to make the USV team generate the expected formation shape (see Section 4.2.1). Each USV takes action following its formation maintenance policy µ f m θ i to keep the stability of the formation structure with the predefined teammate spacing (see Section 4.2.2).

Formation Shape Model
The USV team needs the geometry configuration of formation in the collective formation mission. Reasonable formation shapes of the USV team can increase the efficiency of formation task execution. To establish or maintain a specific geometric formation, it is necessary to establish a representative method of geometric formation. At present, there is no unified formation representative method to designate the formation of the USV team. In this paper, the formation shape matrix is established by combining with the formation description mode of chain guidance reference, defined as a 4 × N formation shape matrix, F s as shown in Equation (2). The matrix is adopted to represent the geometric relationship of USVs, where N is the number of USVs in the formation. In the matrix F s , the first row denotes the number of geometric nodes in the formation. The second and third rows represent the distance offsets between the USV and its immediate leader USV in x-axis and y-axis directions. The fourth row denotes the node number of the immediate leader of the USV in each formation node. N USVs form a geometric shape. The centric USV is the leader of the team, while the other USVs form the chain tracking one by one. Follower USVs keep the distance displacement (d X i , d Y i ) from their immediate leader. If one or several immediate leaders in the formation fail during the movement, it is necessary to reconstruct a formation according to the size N t of good USVs and the first N t columns of the formation shape matrix F s .
where d X i and d Y i are the horizontal and vertical distance of each follower-leader pair. α ∈ (0, 1] and β ∈ (0, 1] are the expansion coefficients in horizontal and vertical directions, respectively. By adjusting the values of α and β, the horizontal and vertical expansion and contraction of the same formation can be realized. Il(n i ) is the immediate leader of the USV located in the formation node n i .

Control Objective
For formation generation, the USV team starts from the respective current locations and generates a predefined formation shape in the target location. Therefore, the goal is to minimize the sum of the length of the movement path of the team of USVs, where s(u i ) is the initial position of USV u i , and e(u i ) is the final position of USV u i in the formation. len(s(u i ), e(u i )) is the length of the moving path for USV u i from position s(u i ) to position e(u i ) during formation generation. For formation maintenance, in this paper, we aim at training USVs to learn policies to move in a predetermined formation and maintain the shape of the formation. All the followers keep a certain distance with their respective leaders, which reduces the stability error (i.e., distance difference between the current relative distance and the predefined distance). Thus, another control goal is defined as follows: where Il(u i ) denotes the immediate leader of USV u i . U f ollower is the set of all the followers except the only leader USV, and p Il(u i ) is the position of the immediate leader Il(u i ) of USV u i . H i is the required distance offset between USV u i and its immediate leader, and H i is the current relative distance between USV u i and its immediate leader.

Proposed Scheme
In this section, we introduce the problem formulation for a cooperative formation control followed by a formation control algorithm based on reinforcement learning in which policies for formation generation and formation maintenance are presented.

Problem Formulation
For the formation control of USVs in the leader-follower structure, the main problem is the question of how to form the predetermined formation shape with collision-free movement and maintain the global formation shape.
An asynchronous formation control scheme based on reinforcement learning for multiple USVs is proposed to address the problems mentioned above, making the USV team generate a formation with a minimum total length of movement path and maintain the formation. The proposed scheme contains two parts: formation generation policy and decentralized formation maintenance policy. In formation generation, each USV observes the positions and velocities of all USVs by sensors and communication, and the positions of points in formation shape are given to all USVs, and then learn a formation generation policy µ * θ i based on cost function J i to make the team form the formation quickly with a series of optimal actions a * , as shown in Figure 2. However, only the leader collects the information about the team's goal; each follower obtains its immediate leader's position and velocity by communication and follows the formation maintenance policy µ f m θ i to track the leader, as shown in Figure 3.
of points in formation shape are given to all USVs, and then learn a formation generation policy * i θ μ based on cost function i J to make the team form the formation quickly with a series of optimal actions * a , as shown in Figure 2. However, only the leader collects the information about the team's goal; each follower obtains its immediate leader's position and velocity by communication and follows the formation maintenance policy i fm θ μ to track the leader, as shown in Figure 3.

Formation Generation Policy
In a sophisticated maritime environment, the formation control problem can be regarded as a Markov decision process (MDP) and described as , , , S A P r < >. S describes a set of the possible states s of each USV. A is a set of actions a that a USV can take.
The transition probability distribution for each pair of state s and action a is ex- γ ∈ is the discount factor.
We propose an improved deep deterministic policy gradient based on the advantage function and deep deterministic policy gradient (DDPG) proposed in [48], named the advantage deep deterministic policy gradient (ADDPG). The multi-USV system for formation control considered in this paper is decentralized, and each USV runs its policy independently. We use the ADDPG to train the decentralized formation generation pol- for the team of USVs, as described in Algorithm 1. We use the of points in formation shape are given to all USVs, and then learn a formation generation policy * i θ μ based on cost function i J to make the team form the formation quickly with a series of optimal actions * a , as shown in Figure 2. However, only the leader collects the information about the team's goal; each follower obtains its immediate leader's position and velocity by communication and follows the formation maintenance policy i fm θ μ to track the leader, as shown in Figure 3.

Formation Generation Policy
In a sophisticated maritime environment, the formation control problem can be regarded as a Markov decision process (MDP) and described as , , , S A P r < >. S describes a set of the possible states s of each USV. A is a set of actions a that a USV can take.
The transition probability distribution for each pair of state s and action a is ex- γ ∈ is the discount factor.
We propose an improved deep deterministic policy gradient based on the advantage function and deep deterministic policy gradient (DDPG) proposed in [48], named the advantage deep deterministic policy gradient (ADDPG). The multi-USV system for formation control considered in this paper is decentralized, and each USV runs its policy independently. We use the ADDPG to train the decentralized formation generation pol- for the team of USVs, as described in Algorithm 1. We use the

Formation Generation Policy
In a sophisticated maritime environment, the formation control problem can be regarded as a Markov decision process (MDP) and described as < S, A, P, r >. S describes a set of the possible states s of each USV. A is a set of actions a that a USV can take. The transition probability distribution for each pair of state s and action a is expressed as P : S × A × S → [0, 1] . The expected reward for each state-action pair is computed as r : S × A → R . Moreover, a deterministic policy µ : S → R |A| is defined to output a deterministic action a in state s that will obtain a reward r(s, a) and make the environment change to a new state s with an environmental transition probability P(s |s, a) . Policy optimization is realized by maximizing the cumulative return R i = ∑ T t=0 γ t r t i of each USV, where γ ∈ [0, 1] is the discount factor.
We propose an improved deep deterministic policy gradient based on the advantage function and deep deterministic policy gradient (DDPG) proposed in [48], named the advantage deep deterministic policy gradient (ADDPG). The multi-USV system for formation control considered in this paper is decentralized, and each USV runs its policy independently. We use the ADDPG to train the decentralized formation generation policy set µ = µ θ 1 , µ θ 2 , . . . , µ θ N for the team of USVs, as described in Algorithm 1. We use the advantage function of state-action value instead of the state-action value to calculate the policy gradient, which can make the policy update toward the direction of the larger action value and accelerate the efficiency of policy learning. For each USV, its observation includes information about its velocity and position, the relative distance of other USVs, the predefined parameters of the formations. Its action is the change of velocity. At each time step t of formation generation, each USV u i obtains its observation s i t , uses its policy µ θ i to generate an action a i t , and receive a reward r i t+1 from the environment. After the USV executes the action a i t , the environment state s i t transfer to the next state s i t+1 contains. The transition experience e i t = s i t , a i t , r i t+1 , s i t+1 of all USVs is collected and stored in the shared experience replay buffer D s and used to train the formation generation policy.
The network structure of the ADDPG is illustrated in Figure 4. Inspired by the target network in DQN, we introduce target networks and actor-critic paradigm in the proposed scheme to address continuous actions and action-value estimation and improve the stability of learning. Thus, there are four neural networks in the ADDPG. The critic is used to train the state value network to approximate the value of the state-action, including the current critic network and target critic network, which are three-layer multilayer perceptron (MLP) with parameters θ V and θ V , respectively. The actor is designed to train the formation policy to output the action that should be taken in the current state, containing current actor network and target actor network, with parameters θ µ and θ µ , respectively. The target networks' parameters θ V and θ µ use the parameters from some previous iteration of θ V and θ µ . We use an advantage function to evaluate the relative advantage of each action in a state and accelerate the learning of policies. The actor part based on advantage function uses the DPG method; the critic part uses the TD error method to update the parameters. During training, each USV in the team has independent networks with different parameters and independent optimization for its formation control policy. stored in the shared experience replay buffer s D and used to train the formation generation policy.
The network structure of the ADDPG is illustrated in Figure 4. Inspired by the target network in DQN, we introduce target networks and actor-critic paradigm in the proposed scheme to address continuous actions and action-value estimation and improve the stability of learning. Thus, there are four neural networks in the ADDPG. The critic is used to train the state value network to approximate the value of the state-action, including the current critic network and target critic network, which are three-layer multilayer perceptron (MLP) with parameters V θ and V θ ′ , respectively. The actor is designed to train the formation policy to output the action that should be taken in the current state, containing current actor network and target actor network, with parameters μ θ and μ θ ′ , respectively. The target networks' parameters V θ ′ and μ θ ′ use the parameters from some previous iteration of V θ and μ θ . We use an advantage function to evaluate the relative advantage of each action in a state and accelerate the learning of policies. The actor part based on advantage function uses the DPG method; the critic part uses the TD error method to update the parameters. During training, each USV in the team has independent networks with different parameters and independent optimization for its formation control policy.
For the critic part, the loss function is denoted as follows: where ( )  The advantage function describes the advantage of selecting an action a in the state s, compared with other actions under the state s, as denoted by For the critic part, the loss function is denoted as follows: where R i = r 0 + γr 1 + γ 2 r 2 + . . . + γ n−1 r n−1 + γ n V i s n |θ V i (8) The critic is updated by minimizing the loss, The actor is updated by the gradient of Q-value and advantage function: The interactions between the USV team and the ocean environment can be divided into separate episodes. An episode starts in a random state of the team and ends at a terminal state or after a specified number of time steps. The number of episodes is set to 20,000, and the maximum episode length is 100-that is, each episode has up to 100 timesteps. At each time step in every episode, every USV interacts with the environment, selects its action a t according to the current state s t , and dynamically calculates the reward r t+1 generated by the environment in real time according to the reward function r 1 and r 2 . The reward function r 1 is defined to be the minimum distance between the USVs and the geometric points of formation. If a collision occurs between members during the formation control process, a collision penalty is given-that is, the negative reward value r 2 .
where p n j is the position of the formation node n j in the coordinate frame O − X Y , which can be calculated by the predefined location of the leader and the parameters in the formation shape matrix F s . Dis p u i , p n j is the Euclidean distance between the USV u i and the formation node n j . c 1 is a positive constant.

Algorithm 1: Formation Generation Policy Based on the Advantage Deep Deterministic Policy Gradient (ADDPG)
Input Reward Function r 1 ,r 2 for formation generation scenario Input The predefined formation shape F s Output formation generation policies µ = µ θ 1 , µ θ 2 . . . µ θ N for the USV team Initial experience replay buffer D s 1: for episode = 1 : M do 2: Initialize a random process N rp for action exploration 3: Receive environment state 4: for t = 1 : T do 5: for i = 1 : N do 6: Select action a t = µ i s t θ µ i Execution actions a t and observe reward r t and new state s t+1 8: Add e i t = s i t , a i t , r i t , s i t+1 into replay buffer D s 9: Sample a random minibatch of transitions e j from D s

15:
Update the formation generation policy for USV u i by (10):

17:
Update the "soft" target networks for the actor and critic: 18: The details of the proposed formation generation policies are shown in Algorithm 1, where M is the number of episodes, T is the maximum episode length, and N is the number of USVs. The inputs are the designed reward r 1 ,r 2 for the formation generation scenario and the formation shape F s . The output are the formation generation policies for the USV team. In each episode, every USV adopts a random process N rp to achieve sufficient exploration and collects experiences e i t = s i t , a i t , r i t , s i t+1 that are stored in the replay buffer D s (lines [8][9] and sampled randomly to update the policies. The value of each state-action pair is estimated by the cumulative discount reward R i (lines 10 and 12), which is used to calculate the advantage of action. A random minibatch of samples in the replay buffer is sampled by each USV to improve the formation generation policies (lines [13][14][15][16]. The "soft" updates are used to make the target function change more slowly and improve the stability of learning (lines [18][19].

Decentralized Formation Maintenance Policy
The leader-follower method requires the followers to maintain a specific position and direction offset from the leader, so the structure is simple and robust engineering. We consider a chain mode in the leader-follower structure, in which each follower tracks its immediate leader instead of the mode in which all followers track the same leader. The advantage of this method is that the communication pressure of the leader is reduced, and the stability of the formation structure is realized by minimizing the tracking error for each USV.
We adopt the proposed ADDPG in this paper to train the decentralized formation for the decentralized multi-USV system, similar to the training of formation generation policy. For each USV, the state s contains information about its velocity and position and the relative distance from its immediate leader. The action a contains the change of velocity. At each time step j of formation maintenance, each USV u i obtains its observation s i j , uses its policy µ f m θ i to generate an action a i j and receive a reward r i j+1 from the environment. After the USV executes the action a i j , the environment state s i j transfers to the next state s i j+1 . The reward function r 3 is defined to be the distance difference, which is the error between the current distance H i and the expected distance H i between the USV and its immediate leader. The transition experience e i j = s i j , a i j , r i j+1 , s i j+1 of the USV u i is stored in the experience replay buffer D i and sampled randomly to update the policy. The reward r 3 is used to measure the performance of the formation maintenance policy, aiming to reduce the difference between the current formation and the expected formation. All UAVs choose and execute actions asynchronously and are not limited to synchronous operations.

Experiment and Analysis
In this section, we design comparative experiments to evaluate the proposed scheme and analyze the simulation results.

Experimental Setting
We design a formation generation scenario based on the simulation platform designed by [49]. A V-shape formation is used in the experiments. In the formation generation scenario, N USVs are moving in the two-dimensional maritime surface, which is considered as a square with a side length of 2. Only the kinematics model of USVs is considered. In a formation generation scenario, the formation generation is to control the team of USVs to form a predefined formation F s by following the formation generation policies µ = µ θ 1 , µ θ 2 , . . . , µ θ N . The input of the formation generation policy of each USV u i is a row vector s u i = (p u i , v u i , re point , re oth ), including 1 × 2 position vector p u i , 1 × 2 velocity vector v u i , 1 × 2N relative position vector re point between USV u i and N formation points, and 1 × 2(N − 1) relative position vector re oth between USV u i and N − 1 other USVs. The output a u i = ∆v u i = (∆v x u i , ∆v y u i ) is the velocity change of USV u i . The main hyperparameters of the generation policies are shown in Table 1. The cumulative discounted return and the average length of the movement path of the team are used to evaluate the performance of the policies. However, for the formation maintenance task, all the followers follow the leader's movement while maintaining the whole formation geometry. In a formation maintenance scenario, the goal is to minimize the error between the current formation shape and the expected formation shape by following the decentralized formation maintenance policies µ f m = µ of each USV u i is a row vector s u i = (p u i , v u i , re ileader ), including the 1 × 2 position vector p u i , 1 × 2 velocity vector v u i , and 1 × 2 relative position vector re ileader between USV u i and its immediate leader. The output is the change of velocity a u i = ∆v u i = (∆v x u i , ∆v y u i ). The cumulative discounted return and the average of the error are utilized to measure the performance of the maintenance policies.

Formation Generation
We compare the performance of the proposed scheme with the following other schemes through simulation results and analysis of the results:

•
The deep deterministic policy gradient (DDPG) scheme: In this scheme, USV learns formation generation policy based on the deep deterministic policy gradient.

•
The deep Q-learning (DQN) scheme: In this scheme, USV learns formation generation policy based on deep Q-learning.
We train the decentralized formation maintenance policies based on the control objective in Equation (3). We evaluate the proposed scheme by averaging the episode reward for every 100 episodes. Figure 5 shows the mean episode reward of the USV team with different configurations of formations and different team sizes over 20,000 episodes when using the proposed formation generation policies. Figure 5a-c show the performance of the team with 3 USVs, 5USVs, and 7 USVs, respectively. The proposed scheme can learn effective formation generation policies for different USV team sizes. As shown in Figure 5, with the policy training going on, the formation generation policies based on deep reinforcement learning is continuously optimized, and the cumulative discounted return increases until the policy converges. Figure 5 shows that the proposed formation control scheme can perform formation generation with different team sizes effectively.
Next, we study the performance comparison of the proposed scheme and other existing schemes. Figure 6 shows the total length len f g = ∑ N i=1 (s(u i ) − e(u i )) and average length ) of the moving path of the team with different team sizes. It illustrates that in the case of changing team size, the proposed scheme can still form an expected formation shape with the shortest path length. Moreover, the proposed scheme can obtain better performance than other schemes. Figure 6a shows that the total length of the USV team's moving path increases as the size of the team increases, and the proposed scheme has the shortest total length during the formation generation. Figure 6b shows that the proposed scheme has the shortest average length of the moving path during the formation generation. The reasons for these are as follows: first, the proposed scheme is used directly to parameterize the whole policy and find the optimal policy, which can get rid of the limitation of discrete action space. Second, the designed reward function is designed to drive all USVs to reach nodes in the predefined formation as soon as possible to maximize the cumulative discounted return of the team. Third, the advantage accelerates the learning of policies. As for the DDPG and DQN, the target locations in the generated formation for each USV maybe not the optimal, and the average moving path is not the shortest. Consequently, the proposed scheme is better compared with other schemes, as shown in Figure 6. Next, we study the performance comparison of the proposed scheme and other existing schemes. Figure 6 shows the total length sizes. It illustrates that in the case of changing team size, the proposed scheme can still form an expected formation shape with the shortest path length. Moreover, the proposed scheme can obtain better performance than other schemes. Figure 6a shows that the total which can get rid of the limitation of discrete action space. Second, the designed reward function is designed to drive all USVs to reach nodes in the predefined formation as soon as possible to maximize the cumulative discounted return of the team. Third, the advantage accelerates the learning of policies. As for the DDPG and DQN, the target locations in the generated formation for each USV maybe not the optimal, and the average moving path is not the shortest. Consequently, the proposed scheme is better compared with other schemes, as shown in Figure 6.

Formation Maintenance
Aiming at the formation maintenance problem of the dynamic moving UAV team, we train the path planning algorithm for the leader and design the formation maintenance algorithm for the followers so that the whole formation team can move toward the target with a relatively stable geometric structure. Thus, we evaluate the performance of the proposed scheme according to the following evaluation criteria: • Cumulative discounted reward during training • The final distance between the leader and the team goal in each episode • The stability difference of the whole team We train the decentralized formation maintenance policies based on the reward function in Equations (12) and (13) and the control objective in Equation (4). Figure 7 shows the reward of the USV team with decentralized formation maintenance policies with different numbers of USVs and over 10,000 episodes. Figure 7a-c show the mean episode reward for the team of 3 USVs, 5USVs, and 7 USVs, respectively. We can see that the policies quickly converge to stable optimized policies.

Formation Maintenance
Aiming at the formation maintenance problem of the dynamic moving UAV team, we train the path planning algorithm for the leader and design the formation maintenance algorithm for the followers so that the whole formation team can move toward the target with a relatively stable geometric structure. Thus, we evaluate the performance of the proposed scheme according to the following evaluation criteria:

•
Cumulative discounted reward during training • The final distance between the leader and the team goal in each episode • The stability difference of the whole team We train the decentralized formation maintenance policies based on the reward function in Equations (12) and (13) and the control objective in Equation (4). Figure 7 shows the reward of the USV team with decentralized formation maintenance policies with different numbers of USVs and over 10,000 episodes. Figure 7a-c show the mean episode reward for the team of 3 USVs, 5USVs, and 7 USVs, respectively. We can see that the policies quickly converge to stable optimized policies.
In the formation maintenance scenario, the leader of the formation guides the movement of the USV team. Hence, the path planning algorithm of the leader determines the success of the formation task directly. As shown in Figure 8a-c, we test the performance of the formation maintenance policy of the leader in the team of 3 USVs, 5 USVs, and 7 USVs. The mean episode distance between the leader and the team goal all become near zero from about 2000 episodes. That means the leader in these teams can successfully reach the mission target quickly.
We adopt a stability error function SF(U) = 1 T×n ∑ T j=0 ∑ N i=1 |H i − H i | to measure the formation stability when following the leader. Comparison simulations are conducted amongst the proposed decentralized formation maintenance policies, DDPG, and DQN. The stability error of formation with different team sizes is selected for analysis, as shown in Figure 9. We can see that the average stability error in the proposed scheme is lower than that in other schemes. The results show that the followers can track their respective immediate leaders and maintain the predefined distance and direction from their immediate leaders. The proposed scheme can obtain the best performance for maintaining the formation compared with other schemes. In the formation maintenance scenario, the leader of the formation guides the movement of the USV team. Hence, the path planning algorithm of the leader determines the success of the formation task directly. As shown in Figure 8a-c, we test the performance of the formation maintenance policy of the leader in the team of 3 USVs, 5 USVs, and 7 USVs. The mean episode distance between the leader and the team goal all become near zero from about 2000 episodes. That means the leader in these teams can successfully reach the mission target quickly. In the formation maintenance scenario, the leader of the formation guides the movement of the USV team. Hence, the path planning algorithm of the leader determines the success of the formation task directly. As shown in Figure 8a-c, we test the performance of the formation maintenance policy of the leader in the team of 3 USVs, 5 USVs, and 7 USVs. The mean episode distance between the leader and the team goal all become near zero from about 2000 episodes. That means the leader in these teams can successfully reach the mission target quickly. The stability error of formation with different team sizes is selected for analysis, as shown in Figure 9. We can see that the average stability error in the proposed scheme is lower than that in other schemes. The results show that the followers can track their respective immediate leaders and maintain the predefined distance and direction from their immediate leaders. The proposed scheme can obtain the best performance for maintaining the formation compared with other schemes.

Conclusions
In this paper, we have proposed an asynchronous formation control scheme based on reinforcement learning and leader-follower structure for multiple USVs in a complex maritime environment. First, a specific USV model and a novel formation shape model have been established, where the formation shape model provides the parameters for formation generation and maintenance. Second, the formation control policies have been proposed for the cooperative USVs to generate the predefined formation shape with minimum moving path while the decentralized formation maintenance policies have been presented to maintain the stability of the geometric formation structure by minimizing the stability error between the real-time relative distances and the expected relative distances for all the USVs. Finally, simulation results have demonstrated that the (c) Figure 8. The mean distance between the leader and the team goal with the proposed scheme and different team s during formation generation policies training: (a) mean distance for the leader in the team of 3 USVs; (b) mean dista for the team of 5 USVs; (c) mean distance for the team of 7 USVs.
We adopt a stability error function The stability error of formation with different team sizes is selected for analysis, a in Figure 9. We can see that the average stability error in the proposed scheme than that in other schemes. The results show that the followers can track their re immediate leaders and maintain the predefined distance and direction from thei diate leaders. The proposed scheme can obtain the best performance for maintai formation compared with other schemes.

Conclusions
In this paper, we have proposed an asynchronous formation control schem on reinforcement learning and leader-follower structure for multiple USVs in a maritime environment. First, a specific USV model and a novel formation shap have been established, where the formation shape model provides the param formation generation and maintenance. Second, the formation control policies ha proposed for the cooperative USVs to generate the predefined formation sha minimum moving path while the decentralized formation maintenance polic been presented to maintain the stability of the geometric formation structure b mizing the stability error between the real-time relative distances and the expec tive distances for all the USVs. Finally, simulation results have demonstrated

Conclusions
In this paper, we have proposed an asynchronous formation control scheme based on reinforcement learning and leader-follower structure for multiple USVs in a complex maritime environment. First, a specific USV model and a novel formation shape model have been established, where the formation shape model provides the parameters for formation generation and maintenance. Second, the formation control policies have been proposed for the cooperative USVs to generate the predefined formation shape with minimum moving path while the decentralized formation maintenance policies have been presented to maintain the stability of the geometric formation structure by minimizing the stability error between the real-time relative distances and the expected relative distances for all the USVs. Finally, simulation results have demonstrated that the proposed scheme can generate the required formation shape and maintain the geometry structure of the formation effectively compare with other schemes. In future work, we will take the communication interruption and the disturbance of wind, wave, and current into consideration.
Author Contributions: Conceptualization, J.L. and S.X.; methodology, Y.P. and J.X.; validation, J.X.; formal analysis, H.P.; investigation, Y.L. and R.Z.; writing and editing, J.X.; funding acquisition, S.X. All authors have read and agreed to the published version of the manuscript.