Multi-USV Dynamic Navigation and Target Capture: A Guided Multi-Agent Reinforcement Learning Approach

: Autonomous unmanned systems have become an attractive vehicle for a myriad of military and civilian applications. This can be partly attributed to their ability to bring payloads for utility, sensing, and other uses for various applications autonomously. However, a key challenge in realizing autonomous unmanned systems is the ability to perform complex group missions, which require coordination and collaboration among multiple platforms. This paper presents a cooperative navigating task approach that enables multiple unmanned surface vehicles (multi-USV) to autonomously capture a maneuvering target while avoiding both static and dynamic obstacles. The approach adopts a hybrid multi-agent deep reinforcement learning framework that leverages heuristic mechanisms to guide the group mission learning of the vehicles. Speciﬁcally, the proposed framework consists of two stages. In the ﬁrst stage, navigation subgoal sets are generated based on expert knowledge, and a goal selection heuristic model based on the immune network model is used to select navigation targets during training. Next, the selected goals’ executions are learned using actor-critic proximal policy optimization. The simulation results with multi-USV target capture show that the proposed approach is capable of abstracting and guiding the unmanned vehicle group coordination learning and achieving a generally optimized mission execution.


Introduction
The important technological developments in unmanned vehicles have greatly influenced the number of real-world applications that are suitable for unmanned vehicle deployment.Hence, in the last decade, the application of autonomous unmanned vehicles in military and civilian operations, such as search and rescue, patrolling, monitoring, and surveillance tasks, has witnessed a tremendous increase [1][2][3][4][5][6].One key challenge in the effective deployment of autonomous unmanned vehicles is the ability to perform complex group missions, which require coordination and collaboration among multiple vehicles.On the other hand, many real-world tasks require a group of agents to work together.Compared with a single agent, multi-agent systems have the advantages of strong adaptability and fault tolerance.In the robotics and autonomous unmanned vehicles contexts, a single platform or controller of an unmanned vehicle can be regarded as an agent.Through division and cooperation, each platform can perceive the environment and acquire information about the task area quickly and more accurately to accelerate the completion of missions and improve the efficiency of the group.
Unmanned vehicles can be categorized into four types based on their mode of operation.These include the unmanned aerial vehicle (UAV), unmanned ground vehicle (UGV), unmanned surface vehicle (USV), and autonomous underwater vehicle (AUV) [7].
In the last few decades, the robotics community has extended its interest to the study of cooperative decision making and controlling unmanned vehicles.A group of unmanned vehicles working together in a shared environment can be regarded as a multi-agent system.In cooperative multi-agent systems, a collective scheme is required to enable the group of agents to work together to complete a common task and coordinate in time and space.
Traditionally, group mission strategies for multi-robot systems were created using predefined rules and algorithms [8][9][10], but these methods can be rigid and not easily adaptable to changes in the environment or mission objectives.Consequently, there has been a growing interest in using reinforcement learning techniques that enable multiple robots to learn group mission strategies in a more flexible and adaptable way.To achieve coordination in autonomous multi-robot systems across different application domains, various control architectures have been proposed [11,12].Deep reinforcement learning (DRL) has emerged as a promising technique in this area [13,14].DRL has been integrated with multi-agent systems (MAS) to create multi-agent DRL, which has been extensively studied in the literature.For example, the multi-agent deep deterministic policy gradient (MADDPG) [15] has been developed as a centralized training framework with decentralized execution to enable multi-agent teams to cooperate or compete during task execution.The multi-agent hierarchical deep deterministic policy gradient has been introduced as an extension of the MADDPG to address search and exploration problems with numerous robotic agents [16].Additionally, proximal policy optimization (PPO) and its variants have been proposed and applied to various group missions to generate sophisticated behaviors [17][18][19][20][21].These and similar frameworks and approaches have been utilized to tackle the challenge of cooperative mission execution with multi-agent systems, such as the multi-agent pursue evasion (MPE) problems, using unmanned vehicles [17,[22][23][24].
The MPE problem is commonly known as target encirclement and capture.When dealing with MPE environments, the targets are evasive and try to avoid capture.They can be faster than the pursuers, and multiple agents are employed to capture an intruder or evader(s) who attempt to evade capture.The main goal of the MPE is to minimize the time required to capture the evasive intruder or target.Due to its practical significance in military applications, such as missile avoidance and interception, surveillance, criminal pursuit, reconnaissance, and rescue operations, several variations of MPE games have been extensively studied [25][26][27][28][29].
Several approaches in the literature assume unstructured environments.For example, the authors in [30] proposed a region-based relay pursuit scheme for pursuers to capture a single evader that adopts the pure evasion strategy.An RL-based method that employs the MADDPG algorithm to realize a pursuit strategy for the MPE game was considered in [31].However, this approach is not suitable with an increased number of agents.A classical single pursuer and single evader pursuit-evasion game based on Apollonius partitions with Q-learning was studied in [32].A decentralized learning approach [33] was explored where the pursuers were modeled as learning automata, while a new fuzzy reinforcement learning algorithm was proposed in [23] to solve the MPE with a single superior evader.Additionally, the authors in [34] studied the MPE from the view of the behavior-based control method and adopted a motor schema-based reactive control architecture based on the Apollonius circle to solve the MPE differential game with a superior evader.
In contrast, while some works tackle the MPE problem in a more general form, other studies consider specific applications using unmanned systems with DRL and other frameworks.For instance, a cooperative pursuit of unauthorized UAVs in urban airspace via MARL was presented in [35], Zhang et al. [36] focused on the pursuit-evasion game of a multiquadcopter in an obstacle environment with online motion planning by deep reinforcement learning, an approach using multiple autonomous underwater vehicles in 3D water volumes with and without obstacles was studied in [37], and Wei et al. [29] designed an unmanned target hunting system for a swarm of unmanned underwater vehicles (UUVs) to hunt a target with high maneuverability, while a collaborative strategy of a UAV/UGV heterogeneous system was proposed to derive a pursuit-evasion game in a complex three-dimensional environment in [38].In addition, a distributed, partially observable multi-target hunting proximal policy optimization (DPOMH-PPO) algorithm for multi-unmanned surface vehicle (USV) target hunting was proposed in [17], and a DRL approach for multi-vehicle pursuit in urban environments was presented in [24].The authors in [39] considered a decentralized DRL method for pursuing an omnidirectional target with multiple agents, while a method for enabling a multi-robot system to encircle a target and avoid collisions was studied in [40].Another proposed approach is a distributed transferable policy network framework based on DRL which employs a graph communication mechanism for robotic agents' interaction to tackle the problem of single-target capture with collision avoidance [41].
However, some of these approaches may not be effective in cases where the target has a strong escape strategy, as they do not consider predicting the target's movement trajectories.Additionally, in many DRL approaches, each agent receives information in a lumped vector without leveraging human knowledge of the target or the natural spatial structures of the environments.This is important because while multi-agent RL can be a very powerful tool, especially for achieving autonomous cooperative behavior of unmanned systems, several challenges exist when utilizing RL in the multi-vehicle cooperative capture of mobile targets in complicated multi-intersection and restricted moving spaces.
The first issue has to do with the state of agents and its effects on learning convergence.It is usually unrealistic to assume a completely observable and noise-free state.This means that aside from the fact that a learning agent might not be able to determine accurately its state, convergence becomes an issue.Additionally, a significantly different state might appear to be very similar.Hence, to facilitate the training and learning of cooperation for multi-robot group missions, designers can use filters of human guidance to help agents estimate the true state of the vehicle or to guide the learning process of the group.
This paper presents a human-guided environment transition model approach for coordination training in an environment where the adversaries or task transition model can be observed and modeled.The key idea in this approach is to utilize human knowledge of the observed mission environment to generate a transition model that can be initialized to guide the overall coordination of the multi-agent system in different situations by generating goals that will lead to a successful mission completion of unmanned vehicle groups.More specifically, the contributions of this paper include the following:

•
A novel formulation of the MPE problem as a multi-USV operation in a floating city environment that can be applied in target pursuit in multi-intersection urban areas.

•
We develop a hybrid deep reinforcement learning framework that leverages heuristic mechanisms to guide group mission learning of multiple unmanned surface vehicles.The proposed framework consists of subgoal generation in the form of navigation targets based on the knowledge of the environment, and a goal selection heuristic model based on immune network models is used to select subgoals for execution by an actor-critic proximal policy optimization model.

•
A simulation experiment is conducted to analyze the performance of the proposed framework in realizing the target-capturing behavior of a multi-USV group.
The rest of the paper is organized as follows.Section 2 introduces some key preliminaries related to the MARL model and algorithm.Section 3 describes the multi-agent USV cooperative problem and the target decision model.Next, the proposed approach is presented in Section 4, followed by the experiment and results in Section 5.

Multi-Agent Observation Models
The observation and states of agents in decentralized multi-agent systems are localized.In this case, the interaction between agents is defined by a graph G = (V, E), where V = {v 1 , v 2 , ..., v N } is the set of agents and E is the set of edges representing agent interactions.This interaction E ⊂ V × V is represented by undirected edges in the form of {v i , v j }, where agents i and j are neighbors.Hence, the set of neighbors of agent i within which agent i can sense local information about others is given by If the set of nodes or edges changes over time, then G can be regarded as a dynamic graph.The information shared by agent i to agent j is a function of the states of agent i about agent j, namely o ij = f (s i , s j ), where o ij ∈ O i and s i , s j ∈ S i .Here, O i is the complete information set agent i receives from all its neighbors, and S i is the set of states of all neighbor agents of i.The observation o ij is only available to agent i if agent j is in the neighborhood of agent i.

Partially Observable Stochastic Games
Multi-agent decision making can generally be classified as partially observable stochastic games (POSGs) since the agents have incomplete or noisy observations.Thus, MARL in multi-robot cooperative tasks, such as target capture, can be modeled as partially observable.The POSG is an extension of the stochastic Markov decision process (MDP), where rewards or punishments and transitions between agents' states are determined by the agents' joint actions, regardless of the possible conflicting goals of agents.In POSGs, partial observation of agents' states is assumed.
where the following apply: • n is the number of agents (controller agents or USVs); • S is the finite set of states;

., n are the finite set of actions available to the agents which form the joint action set
.., n are the finite set of observations of the agents.The joint observation is denoted as o = {o 1 , ..., 'o n }; is the Markovian state transition function, where f (s |s, a) denotes the probability of reaching state s after the joint action a in state s.The joint action of all agents a at a stage t is denoted as .., n are the reward functions of the agents.
A reward of agent i after taking action a i,t ∈ A i in stage t toward completing task c i is denoted as r i,t+1 .The individual controller h i : S × A i → [0, 1] of the agents forms the joint coordination controller (policy) h.Since the reward of each agent in a team depends on the joint action, the respective agents' rewards depend on the joint controller: where γ is the discounting factor and T is the length of the horizon.Because each agent has its own reward function that shows the agent's preferences, POSGs model multi-agent systems that are self-interested.The multi-agent system becomes cooperative when all agents have the same reward function preferences (∀ i,j R i = R j ).In this case, the multiagent system is modeled as a decentralized, partially observable Markov decision process (Dec-POMDP) [42].

Proximal Policy Optimization
In recent years, research on policy gradient methods has produced algorithms such as trust region policy optimization [43] methods that show state-of-the-art performance over Q-learning.Among the core algorithms in the policy gradient and actor-critic field is the proximal policy optimization (PPO) algorithm [44,45].PPO is a minimal version of the trust region policy optimization algorithm that aims to solve the problem of low data utilization of traditional gradient policy algorithms with simpler implementation, execution, and sampling methods.Both trust region policy optimization and PPO use the actor-critic structure, which combines the pros of traditional value-based and policy-based approaches.The actor-critic model employs two deep neural networks: the actor for taking the action and the critic for handling rewards.Specifically, to promote small policy updates, PPO proposes a clipped surrogate loss function and combines the policy surrogate and a value function error term as shown in Equation (3) [44].
where L CLIP is a clipped surrogate objective put forward in [44] as shown in Equation ( 4).Moreover, χ 1 and χ 2 are coefficients, L V constitutes the squared error loss of the value function, and S denotes an entropy loss: In Equation ( 4), and r t (θ) denote a hyperparameter and the probability ratio, respectively, where r t (θ) = π θ (a t |s t )/π θ old (a t |s t ).This clips the probability ratio r at 1 + or 1 − for a positive or negative advantage, respectively, forming the clipped objective after multiplying the advantage approximator Ât [46].

The Artificial Immune Network
The immune network theory is a critical theory of the artificial immune system that exhibits characteristics of adaptivity.The immune network hypothesis, proposed by Jerne [47], is based on Burnet's clonal selection theory.The hypothesis suggests that different types of antibodies and lymphocytes communicate with each other and maintain a stable network equilibrium through an interaction mechanism between network cells.This network is self-regulating through stimulatory and suppressive interactions among B cells, antigen epitopes, antibody idiotope, and antibody paratope.In the robotics domain, a computational model of Jerne's idiotypic network theory was proposed in [48] as a means of inducing adaptive behavior mediation, and it demonstrated some encouraging results.

Description of the Multi-USV Target Capture
We consider a problem where a team of unmanned surface vehicles is deployed in an floating environment to provide internal security by monitoring the activities of other boats within the area.The multi-USV team detects and finds intruder boats and captures them by entrapment or chasing them out of the environment.This scenario is an extension of multi-agent region surveillance and pursuit-evasion, which have been studied in the literature.The environment consists of multiple intersection pathways for easy navigation of USVs in the floating environment.
We assume dual pathways that are narrow and that the USVs cannot make a U-turn on a pathway.We further assume that there are a few exits and entrance points into and out of the environment.Additionally, the environment contains civilian boats that go about their business.We consider the case where an intruder is to be detected and chased by the multi-USV team with the objective to capture or chase the intruder out of the boundary as fast as possible.Here, capture means the pursuing USVs cross the pathway of the intruder boat.
In this case, the intruder is assumed to know the positions and detection ranges of the multi-USV team.The intruder boat is also assumed to be faster and have higher maneuverability compared with the multi-USV team.As an important requirement, we develop a competitive intruder's decision-making strategy, which is further used to provide the initial policy for the multi-USV team.Moreover, the intruder boat becomes visible to some of the USVs in the multi-USV team at some time (i.e., some of the USVs detect and obtain the planar coordinates of the intruder boat after a certain moment in time).Figure 1 shows an example mission scenario of USVs (blue) performing the monitoring mission with an intruder boat (red) and civilian boats in the environment.This problem can be directly extended to multiple unmanned ground vehicle target captures in urban city settings.Furthermore, the USVs are equipped with various sensors, including GPS, radar, and communication equipment for intruder and other sensory detection, as well as communication capabilities.However, USVs have limited observation ability and can only detect or communicate within their sensing and communication ranges, denoted as R s and C r , respectively.The multi-USVs are trained to make their own navigation and maneuvering decisions in a decentralized way (i.e., each USV navigates independently and has information about the current coordinates of the USVs in its neighborhood and the intruder boat through sensing and communication).The exact position of the intruder boat is only known to the USVs that have it within their detection range R s .
However, the other USVs in the team can acquire the last location through communication from other USVs.Here, the goal for the multi-USV team is to coordinate their movement to constrain the intruder boat such that its movement can be directed according to the direction the multi-USV team wants, leading to the multi-USV team catching the intruder or the intruder escaping through one of the exit points within the environment.In doing so, the multi-USV team must also avoid collisions with each other and other static structures within the environment.The intruder is said to be captured if it is trapped between two or more of the multi-USV teams.The objective is to train the multi-USV system to perform cooperative intruder boat detection and entrapment within the environment.

The Multi-USV Learning Formulation
Formally, we consider a multi-agent assignment Markov decision process for a team of agents (multi-USV) that must learn to cooperatively perform a given navigation task  in an adversarial environment D ⊂ R n .In executing the task, the agent team selects a navigation destination or goal from a set ( C() = {c 1 , c 2 , c 3 , ..., c n }) that is generated as subtasks or goals toward executing , based on an expert-or designer-modeled transition function.In this case, agents select and navigate to a location in a dynamic environment with collision avoidance and the minimum time cost.The team MDP can be defined as < N, O, A, P, R >, where N is the number of agents, A is the set of actions, O is the global environment observation made up of the local states S N of the agents and a goal map, and is the transition probability model that enables the transition from S N to S N , driven by the joint actions taken by the team A N .The navigation targets can be dynamic at different times during the mission due to the dynamics of the environment and task.In this formulation, the actions toward achieving a goal are of the same speed and direction such that a t = [v t , v r ], where a t is the action taken at time t.
Each agent uses the same policy µ : H → A defined for each subgoal (in this case, a navigation policy), which maps local observations' histories H and actions of agents to a new action a t to be executed.Hence, agents can share their policy during training.If η is denoted as a horizon of an agent history space, then H = S η n × A η n , where A n is the local actions of agent n.The reward function R of the team depends on the global state and all actions of the team (i.e., R : S N × A N → R).Even though this formulation focuses on navigation target selection, the proposed framework can be customized for any scenario where a set of goals can be defined to achieve the cooperative mission.

Reward Function
The reward function provides guidance for the multi-USV system to achieve optimal task performance policies, which are implemented as the training objective.In order to optimize the performance of the multi-USV system, we consider the various situations that USVs will encounter in the mission environment when designing a reward function.Equation (5) shows the shared reward R USV i to be maximized by the multi-USV system at time t.Here, r t dist,i is the negative displacements between agent i and its selected task, t t elapsed i is the time elapsed, γ is a bound constant, and c t collide i is a discounted penalty if agent i collides, with a penalty factor of −0.5:

Intruder Boat Policy Strategy
To design a human-competitive policy for the intruder, we developed a probability map model that the intruder used to make its maneuvering decisions.To apply the probability map model for intruder decision making, we divided the monitoring area into equally sized cells and associated each cell with the probability or likelihood of the intruder being captured if it were in that particular cell, thus forming the probability map of the entire monitoring area.Based on this, we designed navigation rules for the intruder boat with which it made its movement decisions.In addition to the strategy for exiting, the direction of movements of the intruder was biased toward less probable cells of capture, as shown in Figure 2. In this strategy, there is a 30% chance of the intruder trying to move out of the community through the closest exit point, with a 70% aim at staying in the environment and evading the multi-USV team.Cells closer to and between structures are riskier, depending on the location and direction of the multi-USV team.The cells size depends on the speed specified for the intruder at the beginning of the simulation.

Approach
This study proposes a hybrid multi-agent training and reinforcement learning approach based on a human-guided transition model for cooperative task execution in environments where the adversaries or task transition model can be observed and modeled.The key idea in this approach is to utilize human knowledge of the observed mission environment to generate a transition model that can be initialized to guide the overall coordination of the multi-agent system.This is achieved by generating goals or subtasks that will lead to successful completion of the mission by the multi-agent team.

Framework Design
The backbone of this framework consists of a goal or subtask generation model designed to provide a goal map to guide the learning model, reducing the coordination problem to goal selection and execution.This methodology is based on the premise that high-level subtasks that lead to the ultimate mission objective at any time during the mission's execution can be observed and abstracted, and a subgoal generation model of such processes can be developed by mapping abstract situations to appropriate subtasks.This idea corresponds well with multi-agent operations in certain environments, such as target pursuit or tracking in urban road networks with multiple intersections.
As shown in Figure 3, the methodology consists of two main stages.In the first stage, observation and domain analysis are performed to identify all potential situations that are disadvantageous to the adversaries, and a transition model for these situations is developed.This is performed based on the knowledge and constraints of the opponents or based on the limitations placed by the structure of the environment.In order to complete the first stage, we need a goal generation model that outputs abstract goals or subtasks whose optimal execution will facilitate successful mission execution for the training agents.In this study, the subtask generation model is represented by a graph of vertices G C() = {v i ∈ V}, with v i representing an intersection and the {v i , v j } ∈ E pathway representing the edges.Thus, depending on the direction, node, or edge where the intruder finds itself, the goal map consisting of the set of nodes connected to the current node or edge can be generated to guide the multi-USV system.During training, the strategy generation model is executed at certain time steps to produce a goal map or set of subtasks for the decision system to execute or learn to execute.The strategy generator is an intermediate module that encodes the goal generation model for subgoal generation to guide the decision-making sequence among the multi-agent system to achieve a high degree of team coordination.This module can be replaced by other algorithms or learning algorithms in the absence of an expert-guided task generation model, making it plugable.
An agent then selects a task from the set of subtasks to execute or learn.In this work, we propose a heuristic-based approach as the transition model encoding for realizing goal selection during training.Our ultimate goal is to learn a coordinated policy for the multi-agent system to perform a cooperative mission, taking advantage of our knowledge of the adversaries or environment constraints to generate abstract goals and leading to a successful mission.The framework implements an actor-critic PPO for learning the selected subtasks from the goal map generation model.The learning objective guides and measures the collective performance of the multi-agent system on the mission.The preference defines constraints or priorities that can be specified for tasks, behaviors, goals, or low-level actions during training.

Goal Selection Algorithm
In order to realize optimal subgoal selection, we adopted a heuristic immune network model.Prior to the immune process, the state of the agent is used by a strategy generator to generate subgoals.The generated subgoals are then evaluated, and an appropriate subgoal is selected by each agent based on the heuristic immune process.Algorithm 1 presents the dynamic navigation point selection of agents which can be customized to fit other mission scenarios by employing different models to compute the affinity and activation of subgoals.At every decision step, the goal selection algorithm updates its navigation position preference based on Equation ( 6) and selects the most attractive goal position.Each agent reevaluates all the available goal positions when the goal map changes and selects the strongest, most suitable subgoal.This process continues until the system terminates.Hence, after receiving the goal map generated from the transition model, the algorithm computes the suitability of each navigation position in the goal map to select the optimal subgoal point: where p a represents the relative location of vehicle a, whose activation is being determined, and p i is the positions of other vehicles, while p a − p i is the orientation from a to i and ∈ [0.2, 1] is a factor that indicates if obstacles are located in the direction of the goal position: The decision of an agent to move and select a particular goal is determined using Equation (7).In this equation, the first term represents the stimulation effect of the current agent with respect to the target location being considered, while the second term represents the suppressive effects of its neighbors.Here, A f f a and A f f j,a are the affinity of the current agent with respect to its neighbors' (N(a)) states, respectively, and vice versa.Additionally, ε and ρ are both scaling factors, where { , ρ ∈ [0.2, 1]} and ρ < ε.The navigation location with the highest activation value is selected, and control is passed to the task execution policy learning.

Task Execution Policy Learning
After the agent selects a target location, the appropriate policy should be executed to achieve the agent-selected goal.In this case, a navigation policy is implemented, which uses proximal policy optimization [44,45] with the actor-critic method to optimize the navigation goal execution based on the collective experiences of all controller agents.After the target locations are selected by controller agents, each controller agent obtains its local state S n and selects an action from its action space A n , following the stochastic policy π θ .For the navigation policy, the input of the training algorithm consists of the poses and velocities of the agent's physical model.The inputs include some noise to account for the various sensor noises present in the real world.These additional noises are incorporated using a set of dynamic parameters Λ within the range of Γ ∈ [0.02, 0.2] for the position and orientation noises.Finally, the actor-critic model implements neural networks for learning.Figure 2 shows the summarized algorithm for training the navigation policy.At the beginning of training, the network is initialized.With a given number of episodes, the environment is reset at beginning of an episode.For each episode, the dynamics parameters Λ from a given range Γ are sampled.Next, the current state s i t of the vehicle in the training environment is collected.The target location is obtained by executing Algorithm 1.Based on the current state, the agent's action a i t ← π θ (a i t |s i t ) is obtained.Next, to compute the estimated advantage, the state s i t , reward r i t , and a i t are collected.The advantage estimates are performed in a manner similar to the work in [46].Finally, the surrogate loss and value loss with respect to the vector of parameters θ and are constructed and updated, respectively [44].

Experiment and Results
To evaluate this approach, we implemented a simulation system of the multi-USV target capture in a novel floating community.The simulation was a modification of the one presented in [49].This section describes the details of the training environment set's set-up and experimental results.Figure 4 shows the simulation training set-up of the multi-USV target capture.

USV Physical Model
The basic USV model has a diverse set of onboard sensors.Hence, the USV physical model in the simulation environment was equipped with sensors (radar) that were used to map the surrounding environment while executing missions.The basic radar model had a 360 degree field of view with a radius r which determined its range.Additionally, an onboard inertial measurement unit (IMU) model was used to provide the linear acceleration, orientation, and rotational velocity data and estimate the poses of the USVs.To find the exact positions and orientations of the USVs with respect to the global frame, the USVs each had an onboard global positioning system (GPS).The radar was partitioned into virtual sensors of the same range by assigning a reach for each virtual sensor based on the configuration provided.These virtual sensors returned the relative positions and normalized distance of other objects in the scene.A relational virtual sensor returns information about how a USV is situated to and from other objects, while a velocity sensor returns the velocity of objects within the detection range of the USV.The basic inputs and controls include the heading and velocity controls for motion.The communication of USVs was modeled based on the reach of the communication range.

USV Dynamic Model
Considering the USV movement dimension in six-DOF, the simulated USV model was based on Fossen's six-DOF model for marine vehicles [50].Figure 5 shows the USV motions in six DOF.This model expresses the resultant movement of a USV as the combined effect of five main forces as shown in Equation ( 8): where τ hs f is the hydrostatic forces, τ hyd f is the hydrodynamic forces, τ is the control and propulsion forces, and τ wind and τ waves are the wind and wave forces, respectively.The kinematic and kinetic model with wind and wave disturbances is defined in Equation (9).
where the following definitions apply:

Experimental Settings
The experiment was conducted with teams of two, three, and five USVs in a simulated floating city environment.The initial positions of the teams were spread apart from each other to cover a large area.At the beginning of each episode, the intruder was invisible to the USV group, so the task generation module generated random target points near the intruder.The idea was to detect the intruder if the USVs were to move to these locations.Subsequently, the goal map was generated based on the location of the intruder, as mentioned in Section 4.1, which predicted the possible navigation direction.In the experiment, the USVs could communicate with each other, and the detection range of the USVs was set to 300 m.The length of an episode was set to 1000 simulation steps, and 5000 episodes were run.
The observation space of the agents included the general goal map, the local state of the USV, all the other USVs of the team within the USV communication range, and the distance and headings to the collision points.The action space consisted of a probability distribution over the action space of speed in [0, 1] and a direction of [−1, 0, 1].The actor implemented a three-hidden-layer network.The input state used 3 fully connected hidden layers of 256, 128, and 64 rectifier units.The output layer was a fully connected layer of five units, and different activation functions were used for different parts of the action space.
The critic part was the state value function V(s t ) and was designed as a two-hiddenlayer neural network, including two fully connected layers with 128 and 64 rectifier units.The output layer had only one unit to generate the value with linear activation.
The k update epochs were set to 20 for the surrogate loss update and 15 for the value loss.Both the surrogate loss and value losses were optimized with the Adam optimizer, with learning rates of 0.0001 and 0.0015, respectively.The parameter for clipping was 0.2.

Experimental Results
To evaluate the effectiveness of the approach, the experimental results with a multi-USV team against a high-speed intruder on one hand and a slow intruder with the same speed as the multi-USV team on the other hand are presented.The USVs used a fixed speed of 3 m•s −1 , while the intruder's speed was set to 3 m•s −1 and 4 m•s −1 for the slow and fast intruder, respectively.We plotted the average rewards for the different scenario settings.
Figures 6-8 show the average episode rewards for two, three, and five USVs and one intruder during training.The performance of the algorithm clearly improved over the course of training within a short period of time, as shown in the training graphs.From these graphs, it is evident that the approach can easily guide the multi-USV team not only to learn a navigation policy but also to dynamically assign the goals generated by the strategy generator to the USVs.Moreover, the results also show the influence of the intruder type, since there was a different average reward return with different speeds for the intruder.In order to test the performance of the final controllers, 100 simulations with the best-performing controllers for two scenarios with different navigation target speeds were run.In this experiment, the USVs used a fixed speed of 5 m•s −1 , while the intruder speed was set to 5 m•s −1 and 7.5 m•s −1 for the slow and fast intruder, respectively.The detection radius for the former was 200 m, while 350 m was used for the latter.Here, the higher speed made it more difficult for the USVs to make turns at intersections, even if the evaluation and training environments were similar.Figures 9 and 10 show the success rate of the USVs capturing the intruder as opposed to the intruder exiting the environment.The intruder was said to exit if it was not captured within the simulation episode length.Despite the higher speed of the USVs, the system still achieved about 86% and 77% successful capture rates with 5 and 2 USV teams, respectively, as can be seen in Figure 9. On the other hand, there is a decline in capture performance in Figure 10, with 80% and 60% success rates for a similar number of USVs, which can be attributed to the higher speed of the intruder boat.From these graphs, it is evident that more USVs led to more successful captures.In addition, the speed of the intruder affected the success rate of the coordination team.Figure 11 shows a sample capture process of three USVs at increasing time steps.Here, the speed was 3 m•s −1 for both the USVs and intruder with a 200 m detection radius.Figure 11a shows the configuration of USVs after 50 ts with a USV to the front (USV1), back (USV2) and right (USV3) of the intruder boat.In Figure 11b, USV1 executes a direct chase, while USV2 and USV3 execute left and right turns, respectively.In Figure 11c, USV3 is now performing the chase, USV2 makes a left turn a the next intersection to cover one of the possible exits of the intruder while USV2 makes a left turn at the second intersection ahead it.Finally, after 350 ts, the target is capture since all the exit points are covered by the team as shown in Figure 11d.From the trajectories of the USVs, it can be found that the USVs were able to capture the target within 350 simulation time steps.

Conclusions
This paper presents a training mechanism for a multi-agent controller that utilizes a human-guided task generation model for generating abstract subgoals based on the observed situations of an environment or opponents.This approach is applied in a novel floating multi-intersection environment for multiple unmanned surface vehicles (USVs), dynamically selecting a navigation goal generated based on the knowledge of the environment layout and learning navigation control for target capturing.To provide a realistic and competitive intruder policy, a probability model used by the intruder was adopted to make its maneuvering decisions.The multi-USV system employs an immune network and dynamics to achieve two main objectives: collaborative task assignment and proximal policy optimization (PPO) for navigation policy learning of the USV team.The performance of the system during training showed an improvement in the rewards received within a few episodes of the simulation.The evaluation results with 2, 3, and 5 USVs show that the approach can be applied for multi-agent navigation and target capture in multi-intersection environments and is capable of achieving a success rate of over 80% with 5 USVs and over 60% with 2 USVs.Thus, the proposed method has the potential for practical trajectory design and application in a multi-vehicle environment with multiple intersections and, by extension, in multi-vehicle scenarios where the environment and task structure can be leveraged to guide the training and execution process.For instance, this approach can be directly applied to target pursuit in an urban city environment by simply replacing the USVs with UGVs or UAVs.
However, this approach assumes independent subgoals that are performed concurrently by the multi-agent system.Additionally, the experiments conducted were limited to a single intruder that adopted a somewhat regularized escape strategy.Hence, the proposed approach needs further investigation and experiments in different and more practical applications with other types of unmanned vehicles where the generated tasks are coupled and need to be performed sequentially.

Figure 1 .
Figure 1.Example floating market or community environment map.The rectangular shapes represent floating structures.

Figure 2 .
Figure 2. Demonstration of intruder strategy model.The darker the grid, the higher the capture probability.

Figure 3 .
Figure 3. Structure diagram of the proposed framework.

Algorithm 1 : 5 A 6 7 8
Subgoal (navigation point) selection algorithm Data: Goal map Result: Optimal target 1 Act c ← ∅ { initialize empty list of activation values}; 2 for For every navigation position j ∈ goalmap do 3 S n ← S {get current local state of neighbors } ; 4 s t ← S m {get own state } ; f f N(a) ← {the set of current neighbors' (N(a)) affinity } ; Determine own affinity A f f a using Equation (6) ; Compute the suitability of selecting j, i.e, B act value of the agent using Equation (7) ; Act c ← B act { add to list of activation values}; 9 max B act ← argmax(Act c , random(Act c )) ; 10 Return the corresponding subgoal of max B act as the suitable navigation position.
, y, z, φ, θ, ψ] T is a vector of the position and Euler angles in the m-frame;• ν = [u, v, w, p, q, r] T is a vector of the linear and angular velocities in the d-frame; • v r is the hydrodynamic terms of the relative velocities vector (i.e., the difference between the vessel velocity relative to the fluid velocity and the marine currents expressed in the reference frame); • τ E represents the forces and moments of environmental disturbances of superimposed wind, currents, and waves;•The parameters J, M, D, and C are the rotational transformation, inertia, damping, and coriolis and centric fugal matrices, respectively.

Figure 6 .
Figure 6.Average episode rewards for two USVs and one intruder.

Figure 7 .
Figure 7. Average episode rewards for three USVs and one intruder.

Figure 8 .
Figure 8.Average episode rewards for five USVs and one intruder.

Figure 11 .
Figure 11.Sample capture process of three-USV team at different time steps (ts) with speed of 3 m•s −1 for both USVs and intruder and a 200 meter detection radius.

do 7
10 Compute the advantage estimates Âi 1 , ..., Âi T ; 11 Optimize surrogate loss with respect to θ for k epochs ; 12 Set θ old ← θ ; 13 Optimize value loss with respect to for number of episodes; 14 Set old ← ;