Multi-AUVs Cooperative Target Search Based on Autonomous Cooperative Search Learning Algorithm

: As a new type of marine unmanned intelligent equipment, autonomous underwater vehicle (AUV) has been widely used in the field of ocean observation, maritime rescue, mine countermeasures, intelligence reconnaissance, etc. Especially in the underwater search mission, the technical advantages of AUV are particularly obvious. However, limited operational capability and sophisticated mission environments are also difficulties faced by AUV. To make better use of AUV in the search mission, we establish the DMACSS (distributed multi-AUVs collaborative search system) and propose the ACSLA (autonomous collaborative search learning algorithm) integrated into the DMACSS. Compared with the previous system, DMACSS adopts a distributed control structure to improve the system robustness and combines an information fusion mechanism and a time stamp mechanism, making each AUV in the system able to exchange and fuse information during the mission. ACSLA is an adaptive learning algorithm trained by the RL (Reinforcement learning) method with a tailored design of state information, reward function, and training framework, which can give the system optimal search path in real-time according to the environment. We test DMACSS and ACSLA in the simulation test. The test results demonstrate that the DMACSS runs stably, the search accuracy and efficiency of ACSLA outperform other search methods, thus better realizing the cooperation between AUVs, making the DMACSS find the target more accurately and faster.


Introduction
With the development of technology and oceanic applications, AUV (autonomous underwater vehicle) has played an important role in marine applications. Compared with HOV (human occupied vehicle) and ROV (remotely operated vehicle), AUV is an unmanned agent without cables and can accomplish the work independently, safely and efficiently [1]. Relying on these advantages, AUV has been widely used in minefield search, reconnaissance, and anti-submarine, marine exploration, marine rescue, marine observation [2][3][4][5], etc. However, limited operational capability and sophisticated mission environments make a single AUV unable to meet the requirements of high efficiency and large-scale missions. MAS (multi-AUVs system) provides a new method to overcome these difficulties due to the great efficiency and high reliability brought by the space-time distribution and redundant configuration [6]. Compared with a single AUV, MAS has the following characteristics: (1) distribution, including spatial distribution and functional distribution. Spatial distribution is reflected in the fact that AUVs can be distributed in different areas for the operation to improve operational efficiency. The functional distribution shows that AUVs can carry different sensors and actuators to complete sophisticated missions through cooperation [7]. (2) Redundancy: MAS is redundant in quantity. When an AUV cannot work anymore, other AUVs in the MAS can replace it, ensuring that the mission is not interrupted [8]. The above characteristics make MAS show great advantages in applicability, economy, robustness, scalability, and is particularly beneficial to the larger missions such as underwater target search. The characteristic of the underwater target search is that the mission areas are large, and the targets are randomly distributed within the mission area, which requires the AUV to search the whole area in the shortest time to determine the location of the target [9]. Most MAS adopt centralized control structure. Although this control structure is relatively simple, once the central node fails, the entire system will be paralyzed [10,11]. Then, most search methods at this stage are pre-planning algorithms. This kind of algorithm first rasterizes the mission area and then makes the AUV scan all the grids according to a pre-planned path [12]. Although these search algorithms can accurately locate the target, they cannot produce effective cooperation behavior between AUVs, which will reduce mission efficiency. This shortcoming is unacceptable for some urgent search missions, such as shipwrecked ship searching and locating, anti-mine, and anti-submarine. In addition, limited detection, communication, endurance capability, and changing water depth will cause unexpected situations for AUVs during the mission. The pre-planning algorithm cannot adjust the search strategy based on the real-time situation [13]. Hence, for the current problem, we establish the DMACSS (distributed multi-AUVs collaborative search system) and propose the ACSLA (autonomous collaborative search learning algorithm) that is integrated into the DMACSS. This article mainly has the following contributions: 1. DMACSS and system modeling: we establish the DMACSS and build the dynamic detection and update model of DMACSS with the sophisticated environment (changing depth of water, randomly distributed targets, the uncertainty of the sensors, limited communication capability) based on the probability map model. Using this model, the search environment and the entire search process of the DMACSS are reasonably abstracted. The basis for the rationality of modeling is also provided.
2. Information communication and fusion: a local information fusion mechanism is proposed, which can use the information collected by each AUV in DMACSS to make up for the lack of perception ability of a single AUV. The information fusion mechanism combined with specially designed time stamp mechanism, that can make the information fusion process more effectively under limited communication condition, improve the search efficiency and reduce the error rate.
3. ACSLA: we propose ACSLA and integrate it with DMACSS so that AUVs in the system can get cooperative search strategies. The ACSLA is trained based on RL algorithms [14,15], it has specially designed state information, reward functions, and a new distributed training framework called SASF (single asynchronous sharing framework) to make the training process more stable and easy to converge, which is essential for search performance of DMACSS.

Background
In many studies, the problem of multi-agent cooperative search has become a research hotspot in academia and industry and receives widespread attention. Although some researches do not directly use AUV as the research object, it also provides a basis and reference for research on multi-AUVs.
Rajnarayan, D.G.; et al. [16] discuss the application of cooperation theory in the search mission of the multi-agents and use Radner's decentralized cooperation theory to make the optimality of cooperation between agents the same as the global optimality. The CS (centralized strategy) and DCS (distributed collaboration strategy) are all derived. Wang, X.; et al. [17] assign multiple sensors to a set of discrete search units to find hidden targets. To solve the sensor's uncertainty in the detection process, interference is added to the traditional discrete search formula to establish a new mathematical model. Finally, the greedy algorithm is used to optimize and solve this model. Hong, S.P.; et al. [18] combine the Markov chain with the concept of minimizing the undiscovered probability and propose a fast hybrid heuristic intelligent algorithm. Experimental results show that this heuristic algorithm can complete the search path decision in a short time. Singh [19] et al. propose a hybrid framework for guidance and navigation of swarm of unmanned surface vehicles (USVs) by combining the key characteristics of formation control and cooperative motion planning. Under this framework, a combination of offline planning and online planning is applied to the marine environment. In order to enable the USV to avoid dynamic obstacles, based on the USV maneuvering response time, using the A* algorithm based on offline optimal path planning and safe distance constraints, Mina [20] and others propose a general multi-USV navigation, guidance, and control framework. Thi, H.A.L.; et al. [21] study a hierarchical search planning model, which divides the search area space into several subspaces, and then conducts the second search planning in the search subspace. This secondary search planning method makes the entire search process efficient and precise.
Specific to AUV, Healey, et al. [22] studied the problem of complete coverage in the cooperative search process of multi-AUVs. In order to ensure complete coverage of the search area in the event of AUV loss in the formation, an effective cooperative strategy was designed and tested. Welling, et al. [23] used the multi-AUVs system to perform the cooperative search and target clear missions. They discussed the mission assignment problem and compared the two assignment strategies based on the closest distance and fuzzy logic from a time-consuming perspective. Shafer, et al. [24] studied the multi-AUVs cooperative adaptive search behavior under sophisticated environments and proposed a cooperative strategy that can make the multi-AUVs system achieve multiple missions in parallel. In addition to theoretical research, some multi-AUVs systems have already been applied in practical missions. Since REMUS (Remote Environmental Monitoring Underwater System) underwater robots have played an important role in minefield detection operations during the Iraq War, the ONR (Office of Naval Research) continuously funded several scientific research institutions to develop unmanned underwater systems. The MIT (Massachusetts Institute of Technology) carried out a research project called GOATS (Generic Oceanographic Array Technology System). The GOATS project used multi-AUVs equipped with underwater acoustic equipment to form a mobile underwater detection network which was used to search for mines in coastal waters [25][26][27][28]. Based on the GOATS project, the research team consisting of NURC (National Undersea Research Center) and MIT launched a project called Generic Littoral Interoperable Network Technology (GLINT) in 2008. The multi-AUVs system in the project is equipped with various sensors to complete the missions of automatic detection, positioning and tracking of specific targets [29]. In Europe, from 2012 to 2015, relevant research institutes in Italy, Estonia, the United Kingdom, Spain, and Turkey jointly launched a research project called ARROWS (Archaeological Robot System for the World's Seas) [30], which aimed to use multi-AUVs systems to improve the submarine scanning efficiency and research on the mission allocation strategy [31] and underwater communication [32]. At this stage, most MACSS systems adopt a centralized control structure and use a pre-planned method without considering real-time motion characteristics of AUV. To this end, we established the DMACSS system based on the distributed control structure and proposed the ACSLA algorithm that can realize real-time planning of search paths to adapt to complex environments.

Modeling
In order to establish DMACSS, we must reasonably simplify and abstract the mission environment according to the actual mission situation and effectively model the mission environment, each part of the system, and the system working process.

Environment Model
In order to effectively search for targets in the mission environment, DMACSS must keep updating the environment state in terms of limited target information. To this end, the probability map model is used. As we all know, the probability map model is used to model the uncertainty of the mission environment. We assume that the target is on the seafloor. Since the AUV navigates in the water, the complex terrain of the seabed has no effect on the navigation of the AUV, so we project the AUV onto a flat region, as shown in Figure 1. First, we establish an inertial coordinate system, where α is the coordinate origin, and the positive direction of the axis points to the east. The mission area of ∈ ℝ is divided into × grids, each grid is called a target area , = ( , ), ∈ {1,2,3 ⋯ , }, ∈ 1,2,3 ⋯ . Let = 1 denote that there is a target in the , otherwise = 0 indicates that there is no target in the . Similarly, , can be expressed as the probability of target existence in the grid at time , , ∈ [0,1]. Where , = 1 indicates that there must be a target in the , and , = 0 indicates that there must be no target in the . Before the start of the mission, each grid has an a priori initial probability, = 0.5. Set the threshold and as the upper and lower limits of the , , that is, , > means that there is a target in the . , < means that there is no target in the . The coordinates of each AUV are expressed as , = , , ℎ , ∈ ℝ ( = 1,2, ⋯ ). , are the coordinates of the AUV projection on the . ℎ , is the depth where the AUV is located, is the conversion operation, is the total number of AUVs. The kinematics model of AUV is shown in Appendix A. (b) search area model.

Sensor Model
At this stage, sonar is the most commonly used sensor for AUV to obtain underwater targets and other environmental information. The active sonar system and passive sonar system are two commonly used sonar systems. In the target search mission, an active sonar system is more suitable as the sensor of the system. Since the detection accuracy of the sonar system will directly affect the update of the probability graph model, in this article, we use sonar as a modeling object and focus on the relationship between the sonar accuracy and the mission environment.
The working principle of the active sonar system makes it susceptible to non-linear interference problems caused by the water media or other external factors. In addition, obstacles between the sonar and the target will also affect the accuracy of sonar. Formula (1) shows the forward-looking sonar detection model after adding constraints of nonlinear noise and obstacle: Among them, , represents the target information collected by the , and are the maximum and minimum detection distance of the sonar, ℎ represents the sonar detection function under noise-free conditions, and , represents the distance between the target and the , is nonlinear interference. Formula (1) indicates that when the , not in , or there exist the obstacle between the sonar and target, the target information cannot be obtained. has an important impact on the correctness of the , . The larger value of the , the lower the correctness of , . We set the probability of detecting the correct , as , , and the probability of detecting the wrong , as , : As shown in Formulas (2) and (3), , is the probability that the detects the target when there is a target in the grid . , is the probability that the detects the target when there is not a target in the grid . , , is the observation value of by at time . It can be seen from Formula (1) that the value of the is related to the , so we can get Formulas (4) and (5): , , where ( , ) < 0 for , ∈ ( , ) and 1 > ( ) = , > ( ) = 0.5.
, , where ( , ) > 0 for , ∈ ( , ) and 0 < ( ) = , < ( ) = 0.5. Remark 1. Due to the sophisticated marine environment (for example, the temperature of seawater, salinity, and ocean currents etc.), the model may change with different locations. In this article, we do not consider this situation, so the sensor model does not change in the whole mission area.
When the targets are located on the plane of the seabed, from Figure 2, we can get the relationship between the , , ℎ , and (the water depth at its location) as shown in Formula (6): where is the opening angle of the sonar.

Environment Update Model
As the mission continues, the system's perception of the environment is constantly changing, so we need to update the environment model based on the latest detection results. , , represents the probability that the grid has a target at time which is detected by . Let represent the number of times the is searched, then we can build Formula (7): Formula (7) indicates that when there is a target in the and the number of times that the is searched tends to infinity, , , → 1, that is, there must be a target in the . When there is no target in the and the number of times that the is searched tends to infinity, , , → 0, that is, there must be no target in the . Next, we update , , . According to the Bayesian empirical formula, the update formula of , , is obtained: where , , represents its a priori probability. Convert Formula (8) from nonlinear to linear form based on the Formula (9) to facilitate the calculation for more efficient: The simplified update formula of probability map is shown in Formula (10): According to Formula (10), it can be proved that Formula (7) still holds and the proof process is shown in Appendix B. In Appendix C, we deduce the relationship between , , , , and the convergence rate of the probability map.

Search Information Fusions
Due to the limitation of sensor performance, each AUV can only observe a limited area, so it can only update the probability map of the area within the observation radius , and because the basic probability map model does not integrate the state information of other AUVs, each AUV cannot grasp the global probability map information. This will negatively affect the search speed of the system. Therefore, to compensate for the limited detection capability of each AUV, the probability map model must be improved by building an information fusion mechanism that makes each AUV in the system faster converge to the same probability map reflecting the target location.
The communication between AUVs is the basis for building the information fusion mechanism. Same as the observation capability of AUV, the communication capability of AUV is also restricted by equipment and environment, the can only communicate with AUVs within a communication radius . Therefore, we use the , represent the neighbor of the at time : According to the number of AUVs in the , , we can divide , into different levels as , = , . Using , , we can calculate the fusion matrix : , , = 1 − , − 1 / , , , = (1/ ) for ∈ , ( ≠ ) and , , = 0 for ∉ , .
searches at time and stores ℒ , , , then transmit ℒ , , to neighbors and use Formula (13) to update ℒ , , : Next, through the , , , the updated ℒ , , is merged with the ℒ , , of other AUVs, as shown in Formula (14): For the entire system, we first define the parameters for the DMACSS: , ≜ , , , , , , ⋯ , , , Then, we can obtain the update rules of information fusion mechanism as shown in Formula (18):

Time Stamp
In order to improve the efficiency of information fusion, in addition to the information fusion mechanism, we also propose a time stamp mechanism. Specifically, when each AUV fuses the probability map, it is also necessary to transmit a timestamp map. The timestamp can prevent the AUV from repeatedly fusing some information about the determined search grid, and improve search efficiency. Set the timestamp as , , which means that the has updated the latest probability search map of the grid . We establish three rules for time stamp mechanism: (1) When the detects the grid at the current time, the update of , , comes from its detection behavior. At this time, the current grid and the timestamp , are updated to the current time as , , .
(2) When the observation area of the AUV in the system overlaps, the 's probability map update comes from the information fusion within its communication range, update the timestamp , , which is the timestamp closest to the current 's position.
(3) When the AUV fuses information, it not only transmits the probability map information but also needs to interact with the information of the timestamp. When the AUV encounters different timestamps, information fusion will happen.

Search Method
In this section, we will introduce the search method that is integrated with DMACSS, so that each AUV in DMACSS can get a cooperative control strategy that can make the AUV obtain the best search trajectory based on the value of real-time detection and maximize system search capability.
Through the tailored design of state information, reward function, and training framework, ACSLA is proposed for the multi-AUVs cooperative target search. In order to compare the effects of different RL algorithms (the RL algorithm is introduced in Appendix D) when training ACSLA, we used the deep Q-network (DQN) algorithm based on value iteration and the deep deterministic policy gradient algorithm (DDPG) algorithm based on policy gradient to train ACSLA, respectively. Below we will describe in detail state information, reward function, and training framework.

State Information
For the agent, the essence of the strategy is a mapping from state to action, therefore the state information must be able to fully reflect the state of the agent at each step so that the agent can choose the correct action. Using the model we built in the previous chapter, the state information of each AUV in the DMACSS include two pieces of information, one is the target information ℋ , and another is cooperation information , : ℋ , represents the target information which is the probability map of the whole area obtained by ℋ , , : , represents the coordinate information of other AUVs in the 2 of . The role of this information is to make the AUVs have the ability to the cooperative. However, we cannot directly input AUV's coordinate information as cooperation information into the ACSLA, because the coordinate information will be interfered with by the probability map information, the ACSLA cannot extract the corresponding feature information. To this end, we have processed the coordinate information accordingly, that is, converting the coordinate information into a cooperation map of size 2 × 2 : where , , is the Gaussian distribution centered on the coordinates of the neighboring of . Using the branch structure of the convolutional neural network, the features of the two parts of state information can be extracted separately for the agent to learn.

Reward Function
The reward function is the core part of the ACSLA. Appropriate reward function can guide agents to learn appropriate strategies, without making the strategy unable to converge or fall into a local optimum. To design a reasonable reward function, we must first clarify the goal of the mission: 1) accurately locate all the targets in the mission area; 2) under the premise of accurately finding all the targets, the search time can be reduced as soon as possible. Among the two indicators, the former reflects the accuracy of the system, and the latter reflects the efficiency of the system. Our reward functions are designed according to these two goals.

Target Reward
We stipulate that when finds a target during the search process (only when , , > , it is considered that the has found the target accurately at grid , otherwise, it is considered that there is no target found at grid ), the will get a reward. The reward function of the initial target reward , is shown in Formula (22): When there are multiple targets in the search area, and the system finds all the location of the targets, each AUV in the DMACSS will receive a final target reward : If the system does not find all the targets, it will be punished accordingly, that is, the system will receive a certain negative reward . The penalty value is related to the number of missed targets. The greater number of the missing target, the greater the penalty value.
is shown in Formula (24): Finally, the target reward of at time is composed as: where , , , are weight coefficients.

Dispersed Reward
In order to expand the search range, the AUV should be dispersed as much as possible in the search process to avoid the occurrence of multiple AUVs concentrated in a small area, resulting in a waste of search resources. To this end, we set up a dispersed reward , . The reward is based on the number of AUVs within the 's communication radius to give the 's different rewards. The larger the AUV is distributed in the search area, the more dispersed rewards the AUV gets. The distribution information of AUV is obtained by , : In addition, in order to avoid collisions between AUVs, when the distance between and other AUVs is less than the obstacle avoidance radius , the will receive a larger negative reward , : where , are weight coefficients.

Time Consumption Reward
The time-consuming reward , is to evaluate strategy from the perspective of time. , is designed in the form of a piecewise function to enable the AUV to obtain corresponding time consumption rewards in different stages of training: We divide the , obtained during the search process into three parts. − represents the consumption rewards obtained for the entire segment of the episode.
( − , 0) represents the consumption rewards of the first segment of the episode, − represents the consumption rewards of the second segment of the episode, where γ and γ are the weight coefficients of different segments, is the preset maximum number of episode steps. is the number of the episode step of , , are the preset segmentation points of the step. , is an overall reward, so it is the same for every AUV in the system.

Sparse Reward
In the initial stage of training, the AUV cannot obtain sufficient rewards to learn the search strategy because the AUV cannot find the target immediately. Therefore, to make the AUV have better learning ability in the early stage of training and make the learned strategy converge faster, we set up sparse rewards , : where is the weight coefficient. Formula (29) rewards the AUV according to the uncertainty of each in the search area. The uncertainty is defined as: , ≜ × ∑ , , ∈ , where , , = ℒ , , , is a constant. , will continue to decrease as the search progresses, so it will not affect the later stage of the training.
Finally, the reward function of at time is composed as:

Training Framework
An unstable training process is a common problem in MARL (multi-agent reinforcement learning). This is because each agent is part of the environment, so for any agent, the training environment is always changing. To overcome this problem, we design a new distributed training framework called SASF (single asynchronous sharing framework) for DMACSS. The schematic diagram of the SASF is shown in Figure 3. When we train the search strategy of , other AUVs in the system use the same strategy as to sample the environment to provide with information fusion data, but we only update in real-time, the strategies used by other AUVs unchanged over a period of steps, so that the training environment can be relatively stable for some time. After a certain number of training steps, shares the updated strategy with other AUVs, so that the entire system can improve the search capability together. Compared with the traditional training framework, our training framework can make the training process converge stably, and will not fall into the local optimal trap during the training process. Whether it is a value-based iterative or a gradient-based RL method can use this framework to train ACSLA. The training process of ACSLA for is given in Algorithm 1 (trained by DQN) and Algorithm 2 (trained by DDPG). However, this training structure also has a significant shortcoming, that is, because the experience of other AUVs is not fully utilized, it takes long training time to achieve satisfactory results.

Simulation Test
In all simulations, the size of the entire surveillance region is set to [0,25] × [0,25] . The initial position of each AUV is relatively close, ensuring that all AUVs are in a state where they can communicate with each other. The speed of each AUV is uniform and the maximum steering angle of each AUV is . We set the = 6, = 8, = 2. The depth change of the search area is shown in Figure 4.
We use Google's TensorFlow [33] to build a simulation environment based on python. In this article, in order to improve the calculation accuracy of ACSLA, we use a neural network structure called dueling-network which is shown in Figure 5a-c. Dueling-network structure can reduce the calculation error of the algorithm. In terms of parameter update of the neural network, the optimizer is the Adam optimization method [34], the learning rate is set to 2.5 × 10 . The replay buffer only stores 10,000 data, the batch size is set to 40, that is, 40 samples are required for one training. The exploration strategy in training adopts the ε-greedy method, in which the probability of exploration is selected as 0.2. Figure 5d shows the reward changes of the ACSLA that are trained by different RL methods during the training process. Environmental information entropy as an evaluation index of algorithm performance. Environmental information entropy can reflect the convergence speed of the probability map, and its calculation formula is as follows: Scenario 1: This scenario is used to compare the effects when ACSLA is trained by different RL methods (DQN and DDPG algorithms). To better reflect the changes in the probability map and the target distribution in the test, we use the three-dimensional histogram to show the probability map. It can be seen from Figure 6a,b that the final probability maps obtained by the ACSLA (trained by DQN and DDPG algorithms) can accurately reflect the target distribution. Figure 6c is the change graph of the environmental information entropy, it can be seen from Figure 6c that ACSLA trained by the DDPG algorithm can make the probability map converge faster. This is because DDPG adopts the Actor-Critic architecture, an actor-network is used to fit the strategy function and directly output actions. DQN is the algorithm based on the value function, which outputs the Q-value of the action instead of directly outputting the action. The agent also selects the corresponding action according to the Q-value of the action. Therefore, under the same mission environment, the convergence speed of DDPG is faster than DQN, especially when the action space is large, this is more obvious.
Scenario 2: This test is mainly to compare the effect of different numbers of AUVs on ACSLA. First, we ensure that other parameters in DMACSS remain unchanged, and set the number of AUVs in the system to 3, 4, 5, 6, and 7, respectively. The ACSLA is trained by the DDPG. It can be seen from the test results in Figure 7, when the number of AUVs increases from 3 to 4, the convergence speed of the probability map has been significantly improved. When the number of AUVs increases from 4 to 6, the convergence speed of the probability map does not change significantly. But it is worth noting that when the number of AUVs is increased from 6 to 7, the convergence speed of the probability map has a significant decrease. This is because, although the increasing number of AUVs enhances the system's ability to search for the area, this makes the calculation of the system also significantly increase, which will affect the convergence speed of the algorithm.
Scenario 3: In this test, we mainly verify the performance of ACSLA through comparative tests. Here we select some classic search methods such as the random algorithm and coverage control algorithm as the comparison algorithm of ACSLA. In the test, other parameters remain unchanged, the number of AUVs is 4 and and the number of targets increases sequentially. The ACSLA is trained by the DDPG. Each algorithm performs 500 complete episodes, using average search accuracy and average search steps to evaluate the performance of different search methods. The test results are shown in Table 1 and Figure 8. It can be seen from Table 1 and Figure 8 that the system using the ACSLA algorithm has the best search performance regardless of the number of targets. The average convergence speed of the ACSLA in the test is 145 steps, and the average search accuracy reaches 99.77%. Compared with the other two algorithms, it has greater advantages in search accuracy and convergence speed.

Conclusions
In this paper, we establish the DMACSS and propose the ACSLA that is integrated into the DMACSS. Compared with the previous system, DMACSS adopts a distributed control structure to improve the system robustness, and combines an information fusion mechanism and a time stamp mechanism, so that each AUV in the system can exchange and fuse information during the mission, improving the operating efficiency of the system. ACSLA is an adaptive learning algorithm trained by the RL method with a tailored design of state information, reward function, and training framework. We test DMACSS in simulation experiments and compared ACSLA with other cooperative search methods. The test results show that DMACSS runs stably and the search accuracy and efficiency of ACSLA are higher than other search methods, meaning it can better realize the cooperation between AUVs, make DMACSS more accurate and faster to find the target. At the same time, our research still has some problems and shortcomings. DMACSS lacks the ability to search for dynamic targets, and the ACSLA requires a long training time to achieve stable performance. In addition, the fact that the system cannot perform tasks in sophisticated environments (obstacles and other interference) is also a problem we need to solve. We hope that we can gradually resolve these problems in future research.

Conflicts of Interest:
The authors declare there are no conflicts of interest regarding the publication of this paper.

Appendix A
The kinematics model determines the movement ability of each AUV during the mission. When establishing the AUV's kinematics model, the inertial coordinate system and the hull coordinate system are usually used to analyze the movement of the AUV. As shown in Figure  A1, the pose variables of the AUV are described in the , and the speed variables of the AUV are described in the . With the help of the conversion relationship between the two coordinate systems, the position variable is calculated through the speed variable.
When the (the origin of the ) coincides with the (the origin of the ), the rotation is expanded in the order of → → ( , , are the transverse inclination, longitudinal inclination and bow angle of relative to ) according to Euler's theorem. After three rotations, the coordinate vector in the can coincide with the . Then a certain position vector of the AUV in the is recorded as = ( ) , and the coordinate mark under the is ( ) . According to the principle of coordinate system conversion, the following coordinate conversion relationship can be obtained: where is the conversion matrix. Similarly, if in the , the angular velocity of each direction can be denoted as ̇= , ̇, ̇ , then in the , it will be denoted as = ( ) , then the resulting conversion relationship is as follows: where is the conversion matrix. In combination with the above, the AUV's position vector is recorded as = [ , , , , , ], and the velocity vector is recorded as = [ , , , , , ] ( , , are the sway, insult, and heave of the AUV in the ), therefore, the vector form of the AUV motion model can be obtained: where ( ) is expressed as follows: After expanding the kinematics vector, the following conversion form can be obtained: Since this article aims to propose an effective multi-AUVs target collaborative search method, we ignore the influence of roll and pitch on the execution of the target search state of the AUV. In addition, the AUVs in the system are all operating at a fixed depth, and the depth is not changed during the operation, so the movement in the direction is ignored. The final kinematics model of the AUV can be simplified as: Figure A1. AUV's hull coordinate system and dynamic coordinate system.
Through the above derivation, we can find that the target probability map is related to the , and , when it converges (reaching the upper of the and the lower of the ). The larger . , the smaller the , , the smaller , , , can be obtained, that is, the smaller the number of search times needed to determine the state of the grid. These results show that the performance of the sensor will directly affect the search speed of the system.

Appendix D
ACSLA is trained by the RL method. When using the RL method, the agent needs to continuously interact with the environment and obtain corresponding rewards from the exploration process to learn the optimal strategy. The sample obtained by the agent interacting with the environment is ( , , , ), ( : state, : action, : next state, : reward ) which is called the experience fragment. These experience fragments are stored in the experience pool of the RL method and are randomly selected during learning to train the agent to learn the optimal strategy.

Value Iteration RL Method
RL methods are generally divided into two types, one based on the value iteration and the other based on the policy gradient. The core idea of value iteration is to solve dynamic programming problems. When solving the optimal value of the dynamic programming problem, the sub-problem of the optimal problem needs to be solved first, and then the optimal solution is finally obtained through iteration. Deep Q-network (DQN) is a classic value-based iterative RL method based on the Q-learning algorithm. In order to solve the dimensional explosion problem of the Q function, the Q function is fitted using a deep neural network [35]. DQN uses a deep neural network to approximate the value function, that is, the input of the neural network is state and the output is ( , ). After calculating the value function through the neural network, DQN uses the −greedy strategy to output the action . DQN considers: The loss function is set as follows: where is the parameter of the Q-network. DQN has two characteristics. The first one is experience replay. The algorithm stores agent's experiences ( , , , ) in the replay buffer. When training the Q-network, the samples are obtained by random sampling from the replay buffer. The second is to use a fixed network for training. Two neural networks are used for the training update. One network is not directly updated, called the target network, and the other network is normally updated, called the evaluation network. The parameters of the evaluation network are copied to the target network for update after a certain amount of training [36].

Policy Gradient RL Method
Unlike value iterative RL methods, the strategy gradient RL methods are to use strategy to sample the environment to obtain sequence = { , , , , , , ⋯ }, where is a strategy parameter [37]. The probability of generating the sequence is: ( , , , , , , ⋯ ) = ( ) ( | ) ( | , ) Therefore, the reward generated by is reflected in the distribution of strategy parameters. By maximizing the expected reward, the can be calculated inversely: * = ~ ( , ) ( , ) The gradient value of the objective function concerning the : Since the goal of RL methods is to maximize rewards, the gradient ascent algorithm is used to update the . The update formula of is shown as follow: The deep deterministic policy gradient algorithm (DDPG) [38] is the most commonly used gradient-based reinforcement learning algorithm. The actor-critic algorithm framework based on deterministic action strategy is used and the deterministic strategy method of deterministic policy gradient (DPG) is adopted in the actor part.