Multi-UAV Cooperative Searching and Tracking for Moving Targets Based on Multi-Agent Reinforcement Learning

Su, Kai; Qian, Feng

doi:10.3390/app132111905

Open AccessArticle

Multi-UAV Cooperative Searching and Tracking for Moving Targets Based on Multi-Agent Reinforcement Learning

by

Kai Su

and

Feng Qian

^*

Department of Management Engineering and Equipment Economics, Naval University of Engineering, Wuhan 430033, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(21), 11905; https://doi.org/10.3390/app132111905

Submission received: 24 August 2023 / Revised: 27 October 2023 / Accepted: 27 October 2023 / Published: 31 October 2023

(This article belongs to the Special Issue Advances in Unmanned Aerial Vehicle (UAV) System)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we propose a distributed multi-agent reinforcement learning (MARL) method to learn cooperative searching and tracking policies for multiple unmanned aerial vehicles (UAVs) with limited sensing range and communication ability. Firstly, we describe the system model for multi-UAV cooperative searching and tracking for moving targets and consider average observation rate and average exploration rate as the metrics. Moreover, we propose the information update and fusion mechanisms to enhance environment perception ability of the multi-UAV system. Then, the details of our method are demonstrated, including observation and action space representation, reward function design and training framework based on multi-agent proximal policy optimization (MAPPO). The simulation results have shown that our method has well convergence performance and outperforms other baseline algorithms in terms of average observation rate and average exploration rate.

Keywords:

unmanned aerial vehicle (UAV); cooperative search and track; moving targets; information fusion; multi-agent reinforcement learning

1. Introduction

Recently, the technology of UAV has advanced remarkablely and the cooperation of multiple UAVs has also been a research hotspot [1,2]. By working closely together, multiple UAVs can demonstrate superior intelligence, coordination, and autonomy [3,4]. At the same time, cooperative target searching and tracking has also become a significant application field for UAVs [5]. The aim of multi-UAV system in searching mission is twofold: searching undetected targets and monitoring detected targets. But the cooperation for multiple UAVs is a very challenging problem. Firstly, during the searching and tracking process, the situation is continuously changing, such as the locations of the moving targets, communication among UAVs, etc. It is essential to achieve real-time path planning in an unpredictable and ever-changing environment. Secondly, as the UAVs’ quantity rises, the joint action for UAVs becomes increasingly computationally complex.

Traditional methods for addressing the issue of target searching and tracking can be categorized into two groups. The first type is planning theory, such as geometric approach [6] and region-based approach [7]. The primary focus of these methods lies in addressing the issue of zoning search or formation search, both of which are typically appropriate for uncomplicated settings. But the search path for UAVs need to be redesigned once environment changes. The second type is rooted in the optimization theory [8,9]. The goal of these methods is to optimize target searching and tracking, aiming for a maximum coverage rate or maximum observation rate. Nevertheless, the computational complexity escalates significantly as the problem scale expands. In addition, their reliance on centralized architecture restricts the utilization under severe conditions.

Reinforcement learning (RL) focuses on training agents to make intelligent decisions through their interactions with the environment [10]. It has found extensive application in a variety of industries, such as robotics [11] and game playing [12]. Furthermore, multi-agent reinforcement learning (MARL) has also devoloped repaidly recently as an extension of reinforcement learning. It focuses on learning in environments with multiple agents that interact with each other. In MARL, the agents can have different objectives and may either collaborate or compete with each other. It is anticipated that MARL will be a key approach in resolving the problem of multi-UAV collaboration.

This paper proposes a distributed MARL method to learn target searching and tracking policies of multi-UAV system. Firstly, we discribe the system model of multi-UAV cooperative searching and tracking for moving targets. Moreover, the information update and fusion mechanism is proposed to enhance environment perception ability of the UAVs. Then, the details of our method are demonstrated, including observation and action space representation, reward function design and training framework based on MAPPO. The main contributions of our paper can be summarized as follows:

The system model of multi-UAV cooperative searching and tracking for moving targets is constructed, which is extended from decentralized partially observable markov decision process. By optimizing both the average observation rate of moving targets and the average exploration rate of mission area, UAVs can maintain a constant observation over perceived targets and exploration for unknown environment;
A novel information update and fusion mechanism is proposed to enhance environment perception ability of the multi-UAV system. In our model, each UAV keeps its individual cognitive information about the mission region to guide its action in a fully distributed decision-making approach. UAVs can achieve better understanding of the environment and better cooperatioin via information update and fusion;
A distributed MARL method is proposed to learn cooperative searching and tracking policies for multi-UAV system.The reward function and observation space are newly designed, considering both target searching and region exploration. The method has also been proven effective through simulation analysis.

The remainder of the paper is organized as follows. Section 2 introduces related works about target searching and MARL. Section 3 describes the mission model of multi-UAV cooperative searching for moving targets. Section 4 details the information update and fusion mechanisms for multi-UAV system. In Section 5, the proposed method based on MARL is demonstrated. In Section 6, simulation results and discussion are presented. Finally, Section 7 concludes the paper.

2. Related Works

2.1. Target Searching and Tracking

Search theory originated from Koopman’s research during World War II [13]. As a branch of search theory, target searching and tracking aims to find more unknown targets and track the perceived targets using limited resources through cooperation among multiple agents [14]. A widely used approach in the search problem is to divide mission area into cells and link each cell with a target existence probability to construct a probability map of entire area [15]. Force vectors method is also commonly used to solve the problem. In this method, control of multiple agents is achieved by adding up the force vectors that are attractive to nearby targets and repulsive to nearby agents [16,17]. In [18], the authors constructed the issue as a Multi-Objective Optimization (MOO) problem and propose a MOO algorithm to solve it. In [19], the authors proposed constraint programming (CP) to address target search problem for multiple UAVs.

However, it’s hard for the traditional planning and optimization methods to pursue optimal solution when problem scale is large. To reduce computational complexity, swarm intelligence algorithm is proposed to address the target searching and tracking issue. In [20], the authors proposed a method called motion-encoded particle swarm optimization (MPSO) to locate mobile targets using UAVs. Zhen et al. proposed an intelligent collaborative mission planning method for multiple UAVs to search and locate time-sensitive moving objects in an unpredictable dynamic setting, utilizing a combination of artificial potential field and ant colony optimization (HAPF-ACO) technique [21]. In [22], the authors presented a dynamic discrete pigeon-inspired optimization (PIO) algorithm to solve search-attack mission planning problem for UAVs. Hayat et al. presented a multi-objective optimization approach based on genetic algorithm (GA) to allocate tasks for UAVs [23].

In light of the recent advancements in RL, certain studies have also endeavored to employ RL techniques to address the issue of target searching and tracking. Wang et al. presented a centralized RL method for UAV target searching and tracking, assuming that global state and actions of other UAVs are accessible to each individual UAV [24]. Yan et al. formulate the multi-object tracking issue as a motion planning problem and use deep reinforcement learning (DRL) to train a shared policy across all agents [25]. In [26], the authors presented the experience sharing Reciprocal Reward Multi-Agent Actor-Critic (MAAC-R) algorithm to learn cooperative tracking policy for UAV swarm. Shen et al. considered cooperative search for stationary targets by UAVs and proposed DNQMIX algorithm to solve the problem [27]. The comparison between the existing works are listed in Table 1.

2.2. Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) combines multi-agent systems (MAS) and reinforcement learning. It focuses on scenarios where multiple agents, each with their own set of actions and objectives, interact in the shared environment and learn how to make optimal decisions through reinforcement learning techniques [28]. An intuitive thought to solve MARL problem is to treat every agent separately, taking into account the other agents as part of the environment, as in independent Q-Learning (IQL) algorithm [29], independent actor-critic (IAC) algorithm [30] and independent proximal policy optimization algorithm (IPPO) [31]. However, the nonstationarity and partial observability of the multi-agent system in independent learning (IL) may result in learning instabilities or suboptimal performance observed in practice. A common approach to address nonstationarity of the environment is using a fully observable critic, such as multi-agent deep deterministic policy gradient (MADDPG) [32]. It utilizes the architecture of “centralized learning and decentralized execution” (CTDE) [33] and uses a joint critic to train policy network. In [34], the authors presented Multiple Actor Attention-Critic (MAAC) algorithm that can efficiently scale up with agents’ quantity. As a variant of proximal policy optimization (PPO), MAPPO is one of the state-of-the-art MARL algorithms [35]. The algorithm also adopts CTDE architecture and has high learning efficiency with limited computing resource.

Another MARL approach is to use value function factorization. In [36], the authors proposed value decomposition networks (VDN) to break down the value function and measure the effect of each agent on joint reward. Rashid et al. examines the identical issue as VDN and introduced the QMIX algorithm [37] which surpasses VDN. QMIX incorporates additional information from the global state to improve the solution quality. In [38], the authors proposed another value decomposition method value-decomposition actor-critic (VDAC). VDAC can use advantage actor critic training framework to efficiently gather samples.

3. Problem Formulation

The system model of cooperative searching and tracking problem for multiple UAVs is constructed in the section. We first propose several assumptions to simplify the scenario. Then the formulation of the problem is presented.

3.1. Scenario Description

We concentrate on the issue of multi-UAV cooperative searching and tracking for moving targets, as shown in Figure 1. The number and location of targets is unknown and the UAVs must therefore adopt a certain strategy to optimize the search and track process. To simplify the scenario and focus on the main problem, several assumptions are presented as follows:

We assume that multiple homogeneous fixed-wing UAVs are used to search multiple unknown moving targets. UAVs’ flight is confined to a two-dimensional plane at a specific altitude and UAVs prevent collisions by stratifying the altitude. Targets wander randomly within the mission region;
As shown in Figure 1, UAVs utilize the sensor to detect the targets below them. When a target falls in detection range of the sensor, UAVs can detect the target. But UAVs cannot distinguish the identities or indices of targets;
Each UAV in our model possesses an adequate amount of memory and processing capacity. And UAVs cooperate with each other in a distributed framework. Each UAV can share information with neighboring UAVs in communication range and decides independently with its own cognitive information.

3.2. System Model

Environment model: The mission region

O

is regarded as a bounded two-dimensional rectangular plane and uniformly discretized into

L_{x} \times L_{y}

square cells, which can be denoted as:

O = \{G_{x, y} ∣ x \in \{1, 2, \dots, L_{x}\}, y \in \{1, 2, \dots, L_{y}\}\},

(1)

where

L_{x}

and

L_{y}

represent the number of cells in the rows and columns of mission region, respectively. In our model, we use

G_{x, y}

to represent each cell and the coordinate of cell

G_{x, y}

is indicated as

g_{x, y} = (x, y)

, where

x \in \{1, 2, \dots, L_{x}\}, y \in \{1, 2, \dots, L_{y}\}

.

Target model: Let

V = \{V_{1}, V_{2}, V_{3} \dots, V_{N_{V}}\}

be the set of

N_{V}

moving targets. And we use

ν_{k, t} = (x, y)

to denote the coordinate of the target

V_{k} (1 \leq k \leq N_{V})

at timestep

t (1 \leq t \leq T)

. The mission period is divided into T discrete timesteps in our model. UAVs and targets make movement decisions only at the beginning of each timestep. At each timestep, each target can only take up a single cell. and the target presence state of each cell is denoted as

θ_{x, y}

. There are two possible values of

θ_{x, y}

, namely

θ_{x, y} = 1

means that a target is located in the cell

G_{x, y}

and

θ_{x, y} = 0

means that there is no target. And it is possible to have more than one target in a single cell.

UAV model: The multi-UAV system

U = \{U_{1}, U_{2}, U_{3} \dots, U_{N_{U}}\}

consists of

N_{U}

homogeneous fixed-wing UAVs. The location of UAV can be determined by its projection onto the mission region

O

and the coordinate of UAV

U_{i} (1 \leq i \leq N_{U})

at timestep

t (1 \leq t \leq T)

is represented as

μ_{i, t} = (x, y)

. The UAV can choose to move into one of its eight neighbor cells or just stay at its current cell at each timestep.

Each UAV is equipped with a sensor and the UAV

U_{i}

can only detect cells within its sensing area

S_{i, t}

at timestep t with sensing radius

R_{s}

(A cell is regarded to be completely covered if its center is covered), where

S_{i, t} = \{G_{x, y} \in O ∣ ∥ μ_{i, t} - g_{x, y} ∥ \leq R_{s}\} .

(2)

The detection result of UAV

U_{i}

at timestep t for the cell

G_{x, y}

is represented as

Z_{i, t}^{x, y}

. It has two possible values, namely

Z_{i, t}^{x, y} = 1

means that a target is detected in the cell and

Z_{i, t}^{x, y} = 0

means that no target is detected. The sensor of UAV can only sense the presence of target but cannot distinguish the identity. Due to the imperfect detection ability of sensors, the detection model’s conditional probability of positive and negative properties is constructed as follows:

P (Z_{i, t}^{x, y} | θ_{x, y}) = \{\begin{matrix} P (Z_{i, t}^{x, y} = 1 | θ_{x, y} = 1) = P_{D} \\ P (Z_{i, t}^{x, y} = 0 | θ_{x, y} = 1) = 1 - P_{D} \\ P (Z_{i, t}^{x, y} = 1 | θ_{x, y} = 0) = P_{F} \\ P (Z_{i, t}^{x, y} = 0 | θ_{x, y} = 0) = 1 - P_{F} \end{matrix},

(3)

where

P_{D}

and

P_{F}

denote the true detection probability and false alarm probability respectively.

Each UAV is limited to interacting with its neighboring UAVs within the communication range. In our model, we only take into account the restricted communication range, disregarding any communication delays, bandwidth restrictions or interruptions. The group of neighbors for UAV

U_{i}

at timestep t is denoted as

N_{i, t} = \{U_{j} \in U ∣ ∥ μ_{i, t} - μ_{j, t} ∥ \leq R_{c}\},

(4)

where

R_{c}

represents the communication range. And we use

d_{i, t} = |N_{i, t}|

to denote the number of neighbors of UAV

U_{i}

at timestep t (a UAV is assumed to be a neighbor of itself).

Metrics: Target

V_{k}

is observed when it is within the sensing area of at least one UAV, which can be denoted as

O_{k, t} = \{\begin{matrix} 1, if \exists U_{i}, ∥ ν_{k, t} - μ_{i, t} ∥ \leq R_{s} \\ 0, else \end{matrix},

(5)

where

O_{k, t}

is the observation state of target

V_{k}

at timestep t. The observation rate

α_{k}

of target

V_{k}

can be denoted as

α_{k} = \frac{1}{T} \sum_{t = 1}^{T} O_{k, t},

(6)

which denotes the proportion of the time under surveillance to total mission time. Therefore, one of the objectives for UAVs is to maximize the average observation rate

\bar{α}

of

N_{V}

targets, which can be represented as follows:

\bar{α} = \frac{1}{N_{V}} \sum_{k = 1}^{N_{V}} α_{k} .

(7)

In addition, since the targets’ quantity is unknown, the multi-UAV system must persistently search the mission area to detect new targets. Therefore, another objective of the multi-UAV system is to maximize the exploration rate

β

of mission area, which can be represented as

β = \frac{1}{T} \frac{1}{L_{x} L_{y}} \sum_{x = 1}^{L_{x}} \sum_{y = 1}^{L_{y}} F_{x, y},

(8)

where

F_{x, y}

indicates the latest observed time for the cell

G_{x, y}

. The maximum value of

β

is 1, which represents that all the cells in mission area are being monitored during the whole mission. However it cannot be achieved since the maximum sensing area of the UAVs is less than the mission area, i.e.,

\underset{U_{i} \in U}{\cup} S_{i, t} < O

. In the scenario of cooperative searching and tracking for moving targets, the multi-UAV system needs to balance between monitoring perceived targets and exploring mission area [25].

4. Information Update and Fusion

In our model, each UAV keeps its individual cognitive information about the mission region to guide its action in a fully distributed decision-making approach. There are three individual cognitive information maps, i.e., target probability map, environment search status map and UAV location map. In this section, we will introduce these three maps and their information update and fusion mechanisms.

4.1. Target Probability Map

In the multi-UAV system, each UAV

U_{i}

keeps the target probability map

P_{i, t}

of the whole mission region, which can be represented as:

P_{i, t} = {[P_{i, t}^{x, y}]}_{L_{x} \times L_{y}} .

(9)

P_{i, t}^{x, y} \in [0, 1]

indicates target existence probability in cell

G_{x, y}

at timestep k in the understanding of UAV

U_{i}

. Initially,

P_{i, 0}^{x, y} = 0.5

, which means that there is no prior information about the distribution of targets.

Fundamentally, every UAV should possess the capability to autonomously modify its target probability map solely relying on its own observations, without the need for information exchange with neighbors. This would ensure that the multi-UAV system remains robust to any disruption in communication. Therefore, we will first introduce the target probability update by observations for each individual UAV without cooperation. Bayesian rule [39,40] is most popular technique to update the probability map based on observations, which is represented as

\begin{matrix} P_{i, t}^{x, y} & = P (θ_{x, y} = 1 ∣ Z_{i, t}^{x, y}) \\ = \frac{P (Z_{i, t}^{x, y} ∣ θ_{x, y} = 1) P_{i, t - 1}^{x, y}}{P (Z_{i, t}^{x, y} ∣ θ_{x, y} = 1) P_{i, t - 1}^{x, y} + P (Z_{i, t}^{x, y} ∣ θ_{x, y} = 0) (1 - P_{i, t - 1}^{x, y})} \\ = \{\begin{matrix} \frac{P_{D} P_{i, t - 1}^{x, y}}{P_{D} P_{i, t - 1}^{x, y} + P_{F} (1 - P_{i, t - 1}^{x, y})}, & if Z_{i, t}^{x, y} = 1 \\ \frac{(1 - P_{D}) P_{i, t - 1}^{x, y}}{(1 - P_{D}) P_{i, t - 1}^{x, y} + (1 - P_{F}) (1 - P_{i, t - 1}^{x, y})}, & if Z_{i, t}^{x, y} = 0 \\ P_{i, t - 1}^{x, y}, & otherwise . \end{matrix} \end{matrix}

(10)

It is evident that the probability map update does not follow a linear pattern. To reduce computational complexity, a linear update method is introduced, which is initially suggested in [41]. Firstly, Equation (10) is equivalent to

\frac{1}{P_{i, t}^{x, y}} - 1 = \{\begin{matrix} \frac{P_{F}}{P_{D}} (\frac{1}{P_{i, t - 1}^{x, y}} - 1), & if Z_{i, t}^{x, y} = 1 \\ \frac{1 - P_{F}}{1 - P_{D}} (\frac{1}{P_{i, t - 1}^{x, y}} - 1), & if Z_{i, t}^{x, y} = 0 \\ \frac{1}{P_{i, t - 1}^{x, y}} - 1, & otherwise . \end{matrix}

(11)

A nonlinear conversion of

P_{i, t}^{x, y}

is adopted:

Q_{i, t}^{x, y} ≜ ln (\frac{1}{P_{i, t}^{x, y}} - 1) .

(12)

Then, Equation (11) can be rewritten as:

Q_{i, t}^{x, y} = Q_{i, t - 1}^{x, y} + Δ_{i, t}^{x, y},

(13)

where

Δ_{i, t}^{x, y} = \{\begin{matrix} ln \frac{P_{F}}{P_{D}}, & if Z_{i, t}^{x, y} = 1 \\ ln \frac{1 - P_{F}}{1 - P_{D}}, & if Z_{i, t}^{x, y} = 0 \\ 0, & otherwise . \end{matrix}

(14)

Compared to the Bayesian method in Equation (10), the update in Equation (13), which is a linear function of

Q_{i, t}^{x, y}

, is more efficient in calculation.

When multiple UAVs are deployed, each UAV can broadcast the cognitive information to its neighbors for information fusion and cooperation, which will result in better performance. The UAV starts by examining the cells in its sensing area and then passing the observations to its neighboring UAVs. Following the receipt of observations from other UAVs,

Q_{i, t}^{x, y}

undergoes the following update:

R_{i, t}^{x, y} = η Q_{i, t - 1}^{x, y} + \sum_{j \in N_{i, t}} Δ_{j, t}^{x, y},

(15)

where

η \in (0, 1)

is the information decaying factor. Then, each UAV

U_{i}

transmits

R_{i, t}^{x, y}

of the whole area to the neighbors for information fusion, which is represented by:

Q_{i, t}^{x, y} = \sum_{j \in N_{i, t}} w_{i, j, t} R_{j, t}^{x, y},

(16)

where

w_{i, j, t} = \{\begin{matrix} 1 - ((d_{i, t} - 1) / N_{U}), & if j = i \\ 1 / N_{U}, & if j \neq i and j \in N_{i, t} \\ 0, & otherwise . \end{matrix}

(17)

4.2. Environment Search Status Map

The environment search status map that records the latest detected time for cells can be indicated as follows:

E_{i, t} = {[E_{i, t}^{x, y}]}_{L_{x} \times L_{y}},

(18)

where

E_{i, t}^{x, y}

denotes the latest observed time of cell

G_{x, y}

at timestep k in the understanding of UAV

U_{i}

. And the initial value is set to 0, i.e.,

E_{i, 0}^{x, y} = 0

.

The environment search status map

E_{i, t}

is updated in two stages. The map

E_{i, t}

is first updated by the individual detection of UAV

U_{i}

at each timestep t, which can be represented as

E_{i, t}^{x, y} = t, if G_{x, y} \in S_{i, t} .

(19)

Then, the environment search status maps from the neighboring UAVs can be integrated. The information fusion mechanism of the environment search status map can be denoted as follows:

E_{i, t}^{x, y} = E_{j, t}^{x, y}, if E_{j, t}^{x, y} > E_{i, t}^{x, y} and U_{j} \in N_{i, t} .

(20)

4.3. UAV Location Map

The UAV location map records the positions of UAVs and can be represented as follows:

L_{i, t} = {[L_{i, t}^{x, y}]}_{L_{x} \times L_{y}},

(21)

where

L_{i, t}

indicates whether a UAV is in the cell

G_{x, y}

at timestep t in the understanding of UAV

U_{i}

.

L_{i, t}

is defined as a binary variable, namely

L_{i, t} = 1

means that there is a UAV in the cell and

L_{i, t} = 0

means that there is no UAV.

The UAV location map

L_{i, t}

is updated in two stages. The map

L_{i, t}

of UAV

U_{i}

is first updated by the location information of its neighbors within its communicaiton range at each timestep t, which can be represented as

L_{i, t}^{x, y} = \{\begin{matrix} 1, & if μ_{j, t} = (x, y) and U_{j} \in N_{i, 0} \\ 0, & otherwise . \end{matrix}

(22)

Then, the UAV location maps from the neighboring UAVs can be integrated. The information fusion mechanism of the UAV location map can be represented as follows:

L_{i, t}^{x, y} = L_{j, t}^{x, y}, if L_{j, t}^{x, y} > L_{i, t}^{x, y} and U_{j} \in N_{i, t} .

(23)

5. Proposed Method

5.1. Preliminaries

5.1.1. Decentralized Partially Observable Markov Decision Process

The cooperative searching and tracking problem for multiple UAVs can be modeled as a decentralized partially observable Markov decision process (Dec-POMDP). At every timestep t, each agent in the system make its decision based on the local observation (including perception and communication). And then the environment execute all agents’ joint action to update the immediate rewards and the next global state.

Assuming there are n agents, the Dec-POMDP can be described as follows (the time subscript t is omitted for convenience and the joint variables over all agents are bold) [42]:

(N, S, A, T, R, O, Z, γ),

(24)

where:

N denotes the set of n agents;
S denotes the global state space and $s \in S$ denotes the current and specific state;
$A : {A^{(1)}, A^{(2)}, \dots A^{(n)}}$ denotes all agents’ joint action space and $a^{(i)} \in A^{(i)}$ is the action of the i-th agent;
$T : P (s^{'} | s, a) \to [0, 1]$ is the state trasition probability function from state s to next state $s^{'}$ given joint action $a : {a^{(1)}, a^{(2)}, \dots a^{(n)}}$ ;
$R (s, a) : {r^{(1)}, r^{(2)}, \dots, r^{(n)}}$ denotes the joint reward function by executing the joint action $a$ given state s;
$O : {O^{(1)}, O^{(2)}, \dots O^{(n)}}$ represents all agents’ joint observation space, and $o^{(i)} \in O^{(i)}$ denotes the local observation of the i-th agent;
$Z : o^{(i)} = Z (s, i)$ is the local observation function of the i-th agent given the global state s;
$γ \in [0, 1]$ denotes the constant discount factor.

5.1.2. Multi-Agent Proximal Policy Optimization

The performance of single agent reinforcement learning algorithms are usually unsatisfactory in multi-agent system because each agent is influenced by environment and other agents at the same time. Therefore, it is necessary to use MARL to train the agents in Dec-POMDP. As a variant of proximal policy optimization (PPO) specialized for multi-agent settings, Multi-Agent Proximal Policy Optimization (MAPPO) is one of the state-of-the-art MARL algorithms [35]. The algorithm adopts centralized training with decentralized execution (CTDE) architecture and has high learning efficiency with limited computing resource. In our paper, we adopt MAPPO as the learning method for multi-UAV cooperative searching and tracking problem.

PPO is an on-policy reinforcement learning algorithm based on Actor-Critic framework. It originates from Trust Region Policy Optimization (TRPO) and addressing the challenge posed by policy gradient algorithm’s sensitivity to step size and the difficulty in determining an appropriate step size. The objective of RL is to learn the optimal policy

π^{*}

to maximize the cumulative discounted reward

R_{t}

:

R_{t} = r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + γ^{3} r_{t + 4} + \dots = \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1},

(25)

where

γ \in [0, 1]

is the discount factor. The policy

π

can be measured by the state-value function

V^{π} (s)

and the action-value function

Q^{π} (s, a)

. State-value function

V^{π} (s)

refers to the expected cumulative discounted reward by following the policy

π

under state s:

V^{π} (s_{t}) = E_{π} [R_{t} ∣ s_{t}] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} r_{k + 1} ∣ s_{t}] .

(26)

And action-value function

Q^{π} (s, a)

refers to the expected reward by following the policy

π

under state s and action a:

Q^{π} (s_{t}, a_{t}) = E_{π} [\begin{matrix} R_{t} ∣ s_{t}, a_{t} \end{matrix}] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} r_{k + 1} ∣ s_{t}, a_{t}] .

(27)

To improve the learning efficiency and enhance the stability of training process, the advantage function

A^{π} (s, a)

is introduced:

A^{π} (s_{t}, a_{t}) = Q^{π} (s_{t}, a_{t}) - V^{π} (s_{t}) .

(28)

The advantage function

A^{π} (s, a)

gives the advantage of taking a given action over following the policy. Define

{\hat{A}}_{t}

as the estimate of advantage function

A^{π} (s, a)

at timestep t:

{\hat{A}}_{t} = - V (s_{t}) + r_{t} + γ r_{t + 1} + \dots + γ^{T - t + 1} r_{T - 1} + γ^{T - t} V (s_{T}),

(29)

where T is the trajectory length. Policy gradient method is an approach to solve reinforcement learning problem by directly optimizing the policy. The loss function of policy gradient method used in PPO is denoted as

L^{C L I P} (θ) = {\hat{E}}_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ε, 1 + ε) {\hat{A}}_{t})],

(30)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{d}} (a_{t} | s_{t})}

represents the probability ratio. The clip function in Equation (30) is used to restrict

r_{t} (θ)

to the interval

[1 - ε, 1 + ε]

.

Yu et al. [35] proposed MAPPO algorithm for POMDP based on PPO method. The method proposed in our paper uses MAPPO to train two independent networks, a policy network

π_{θ}

with parameter

θ

and a value network

V_{ϕ}

with parameter

ϕ

, which can be shared among all homogeneous agents. The policy network

π_{θ}

maps the observation

o_{t}

of the agent to the discrete action space. And the value network

V_{ϕ}

maps the state

s_{t}

of environment to current value. For multiple agents, the loss function

L^{C L I P} (θ)

is rewritten as

L^{C L I P} (θ) = \frac{1}{B N} \sum_{i = 1}^{B} \sum_{k = 1}^{N} \min (r_{θ, i}^{(k)} A_{i}^{(k)}, clip (r_{θ, i}^{(k)}, 1 - ε, 1 + ε) A_{i}^{(k)}),

(31)

where B refers to the batch size and N refers to the agents’ quantity. In Equation (31),

r_{θ, i}^{(k)} = \frac{π_{θ} (a_{i}^{(k)} ∣ s_{i}^{k})}{π_{θ_{old}} (a_{i}^{(k)} ∣ s_{i}^{k})}

is the probability ratio in multi-agent system and

A_{i}^{(k)}

is the advantage function. Correspondingly, the loss function

L^{C L I P} (ϕ)

of the value network

V_{ϕ}

is rewritten as

L^{C L I P} (ϕ) = \frac{1}{B N} \sum_{i = 1}^{B} \sum_{k = 1}^{N} max [{(V_{ϕ} (s_{i}^{(k)}) - {\hat{R}}_{i})}^{2}, {(clip (V_{ϕ} (s_{i}^{(k)}), V_{ϕ_{o l d}} (s_{i}^{(k)}) - ε, V_{ϕ_{o l d}} (s_{i}^{(k)}) + ε) - {\hat{R}}_{i})}^{2}] .

(32)

5.2. Observation and Action Space Representation

For the cooperative searching and tracking problem in our paper, each UAV in the multi-UAV system keeps its individual cognitive information about the mission region to guide its action in a fully distributed decision-making approach. However, if the UAV utilizes the cognitive data of the entire mission area, an excessive amount of incorrect information can lead to confusion in the agent’s decision-making. Thus, assuming that each UAV has a field of observation, it will only utilize the cognitive information within the restricted scope of observation. By decreasing the input dimension of policy network, the training speed can be enhanced, and the fixed input dimension can also enhance generalization to mission regions of any size.

Define the observation field as a square area (with side length

L_{o}

) centered in the UAV’s current cell. At each timestep, the UAV extract the portion of the local cognitive information about the mission region as input of the policy network. The observation of UAV

U_{i}

at timestep t is composed of three parts:

o_{i, t} = [p_{i, t}, e_{i, t}, l_{i, t}],

(33)

Target Probability Information $p_{i, t}$ : It is extracted from target probability map $P_{i, t}$ and the target existence probability out of mission region is assumed 0;
Environment Search Status Information $e_{i, t}$ : It is extracted from environment search status map $E_{i, t}$ and $E_{i, t}^{x, y}$ should be divided by t for normalization. The environment search status information out of the mission region is assumed 1, which means that there is no need to explore;
UAV Location Information $l_{i, t}$ : It is extracted from UAV location map $L_{i, t}$ . And the value of $L_{i, t}^{x, y}$ out of the mission region is assumed 0, which means that there is no UAV.

At each timestep, UAVs have at most nine available actions based on their current position. The action space comprises a collection of target cells encircling the UAV, meaning that each UAV has the option to remain in its present cell or transition into any of its eight adjacent cells. In the event that an action causes the UAV to exit the mission region during the subsequent timestep, it will be excluded from the list of potential actions.

In our paper, policy network

π_{θ}

and value network

V_{ϕ}

are constructed to process observation or state inputs. Their outputs are the probability distribution over available actions and estimate of the expected value, respectively. The state input to value network

V_{θ}

also contains the same three parts: target location information, environment search status information and UAV location information as the observation

o_{i, t}

mentioned above. But the state input represents the global information of environment rather than local observation of a specific agent.

There are two kinds of network architectures used in our method: one utilizes CNN (convolutional neural network) and fully connected layers to process information, while the other one incorporates RNN (recurrent neural network) in addition to that. The network architecture of recurrent-MAPPO is shown in Figure 2. Policy and value network first use CNN to process the input information. Then the output is flattened to obtain a one-dimensional feature vector, followed by two fully-connected layers and a gated recurrent unit (GRU) network. The usage of RNN is aimed at mining useful information hidden in historical observations. Finally, policy network uses softmax function to obtain the selection probabilities for each action and value network outputs the estimate of the expected value. We also construct the network architecture without RNN to train the agent in our paper. It omits the GRU layer while keeping the rest of the network architecture unchanged.

5.3. Reward Function Design

To encourage UAVs to search and track more targets, UAVs need to balance between tracking the already perceived targets and exploring the mission region. Besides, they should avoid overlapping the sensing range to maximize the total detection area. Rewards are used to shape expectations in reinforcement learning. Therefore, it is critical to devise reasonable rewards for the multi-UAV system. The reward function proposed in our paper consists of three rewards:

r_{i, t} = r_{i, t}^{1} + r_{i, t}^{2} + r_{i, t}^{3},

(34)

where

r_{i, t}

is the reward for UAV

U_{i}

at timestep t.

Tracking reward: We use

r_{i, t}^{1}

to encourage UAVs to track the discovered targets. It is commonly believed that the closer the UAV is to a target, the better the UAV can track the target. Thus, the reward for UAV

U_{i}

to track target

V_{k}

at timestep t is as follows:

r_{i, t}^{1} (V_{k}) = \{\begin{matrix} 1 + (R_{s} - ∥ μ_{i, t} - ν_{k, t} ∥) / R_{s}, & ∥ μ_{i, t} - ν_{k, t} ∥ ⩽ R_{s} \\ 0, & else . \end{matrix}

(35)

And the reward of UAV

U_{i}

at timestep t for tracking multiple targets can be calculated as

r_{i, t}^{1} = ω_{1} \sum_{k = 1}^{N_{V}} r_{i, t}^{1} (V_{k})

, where

ω_{1}

is a positive coefficient.

Exploring reward: The reward

r_{i, t}^{2}

encourages UAVs to explore the mission region and can be represented as follows:

r_{i, t}^{2} = ω_{2} (β_{i, t} - β_{i, t - 1}),

(36)

where

ω_{2}

is a positive coefficient. In Equation (36),

β_{i, t}

denotes the local exploration rate of the mission area of UAV

U_{i}

at timestep t and

β_{i, t}

can be denoted as follows:

β_{i, t} = \frac{1}{t} \frac{1}{L_{x} L_{y}} \sum_{x = 1}^{L_{x}} \sum_{y = 1}^{L_{y}} E_{i, t}^{x, y},

(37)

which can be calculated using its local environment search status map

E_{i, t}

.

Overlapping punishment: The reward

r_{i, t}^{3}

is designed as a punishment of UAV

U_{i}

for approaching UAV

U_{j}

too close, which can be represented as follows:

r_{i, t}^{3} (U_{j}) = \{\begin{matrix} - \exp ((2 R_{s} - ∥μ_{i, t} - μ_{j, t}∥) / (2 R_{s})), & ∥μ_{i, t} - μ_{j, t}∥ ⩽ 2 R_{s} and i \neq j \\ 0, & else . \end{matrix}

(38)

Therefore, the punishment of UAV

U_{i}

at timestep t for approaching multiple UAVs too close can be represented as

r_{i, t}^{3} = ω_{3} \sum_{j = 1}^{N_{U}} r_{i, t}^{3} (U_{j})

, where

ω_{3}

is a positive coefficient.

5.4. Training Framework Based on MAPPO

MAPPO algorithm is used to train the agents in our paper, which adopts CTDE architecture. The framework of MAPPO is shown in Figure 3. In training stage, parallel training environments generate global state

s_{t}

. Then observation model O and reward function R respectively generate global observation

o_{t}

and global reward

r_{t}

. The policy network of UAV generates joint action

a_{t}

based on observation

o_{t}

. Finally, the updated global state

s_{t + 1}

is obtained by feeding joint action

a_{t}

into the training environment, thereby completing one interaction loop. During each interaction, intermediate data [

s_{t}

,

o_{t}

,

a_{t}

,

r_{t}

,

s_{t + 1}

,

o_{t + 1}

] is stored in the replay buffer. Once the accumulated experience data attains a predetermined amount, the gradients are calculated and the weights of policy and value network are updated.

In testing stage, the observation model O takes global state

s_{t}

as input and outputs global observations

o_{t}

for each UAV. The UAVs then generate joint actions

a_{t}

based on

o_{t}

. Finally, the testing environment updates the global state based on

a_{t}

, completing the interaction. The pseudocode of MAPPO algorithm is shown in Algorithm 1.

Algorithm 1: MAPPO

6. Simulation Results

6.1. Setting Up

The settings of simulation will be first introduced before conducting simulation and analysis. In our simulation environment, multi-UAV system will search and track moving targets according to a certain strategy. And the performance of various algorithms will be measured by testing a variety of problem scenarios with varying scales. All the experiments are carried out on a PC with Windows 10 operating system, AMD 3.8 GHz CPU, 16 GB internal memory, and GeForce RTX 3070 GPU. Moreover, the simulation programmes are developed based on Python 3.10 and PyTorch.

It is assumed that each cell has a side length of 1, which serves as the basis for all distance metrics. The positions of UAVs and targets are randomly initialized for each problem instance. The maximum speed of UAV is set to 1 cell per timestep and the maximum speed of target is set to 0.5 cell per timestep. We will set different sizes of mission region, different number of UAVs and targets for multiple problem instances of different scales. The other simulation parameters are listed in Table 2.

6.2. Convergence Analysis

The convergence performance of MAPPO and IPPO with different network architectures is compared in the subsection. The learning curves which represent the episode rewards, average observation rate

α

and average exploration rate

β

with respect to training timesteps are respectively depicted in Figure 4. The algorithms using basic network architectures are referred to as MAPPO and IPPO, while those utilizing network with GRU module are labeled as rMAPPO and rIPPO. We train each method for five times with

10^{5}

timesteps. The mean is represented by dark lines and the standard deviation over five different random seeds is represented by the shaded area. The results of Figure 4 are obtained from the cooperative search mission of 5 UAVs for 7 moving targets in an area of 20 × 20 cells. And all the hyperparameters used in training process are listed in Table 3.

As depicted in Figure 4a, MAPPO and IPPO have shown better performance. The episode rewards obtained by MAPPO and IPPO converge to about 110, while rMAPPO and rIPPO converge to about 80. Moreover, we can see that the convergence of MAPPO and rMAPPO is faster and more stable compared to IPPO and rIPPO. As depicted in Figure 4b, the average observation rate

α

of MAPPO and IPPO converge to about 0.28, while rMAPPO and rIPPO converge to about 0.27. In Figure 4c, the average exploration rate

β

of MAPPO and IPPO converge to about 0.81, while rMAPPO and rIPPO converge to about 0.75.

6.3. Performance Analysis

Figure 4 indicates that the convergence performance of basic network architecture is better than recurrent network architecture. On one hand, RNN is too complex to the problem, which will lead to overfitting. On the other hand, the information update and fusion mechanism proposed in our method already preserves historical information, making the role of recurrent neural network redundant. We can also see that the convergence of MAPPO training algorithm is faster and more stable compared to IPPO. The MAPPO algorithm using CTDE architecture can effectively utilize global information to achieve better and more stable convergence in the training process. Different from MAPPO, as a fully decentralized method, IPPO algorithm directly applies a single-agent RL algorithm for each individual agent without considering changes of other agents. Thus, the environment is non-stationary to the agents and convergence performance cannot be guaranteed.

In this subsection, we use heatmap to visualize the cognitive information maps that incorporate all the detection information at different timesteps and analyze the effectiveness of our method. Furthermore, the performance of MAPPO with different sensing and communication capabilities is analyzed. The testing environment contains 5 UAVs and 7 targets in an area of 20 × 20 cells.

Figure 5 shows the target probability map and environment search status map at different timesteps. In target probability map, the degree of yellow in the cell is directly linked to the likelihood of the target’s presence. In environment search status map, the proximity of the cell’s color to dark red directly correlates with a decrease in the duration since the previous detection on the cell. Here we do not present the UAV location map but directly mark the locations of UAVs and targets on the other two maps. From Figure 5a,c, it can be seen that only two targets are under observation at timestep 30 while five targets have already been tracked at timestep 180. In target probability map, target existence probability of the cells under observation where targets are located is 1. Target existence probability of other cells that have been detected but do not contain targets approaches 0. And probability of undetected cells is 0.5, which means that no information is available about the presence of targets. The target probability map proposed in our paper contains information about distribution of targets, which can effectively guide UAVs to track and monitor the moving targets. In Figure 5b,d, we can see that multi-UAV system has only detected part of the mission region at timestep 30 while almost the entire area has been explored at timestep 180. The environment search status map contains search status information of the mission region. It can not only guide the UAVs to explore undetected areas but also promote revisiting of area that has not been explored for a while.

Next, as shown in Figure 6, average episode rewards, average observation rate

α

and average exploration rate

β

versus sensing range

R_{s}

under different communication conditions are tested. The results with communication are depicted by blue lines, denoted as “Com”, while the results without communication are represented by red lines, denoted as “NoCom”. The testing environment contains 5 UAVs and 7 targets in an area of 20 × 20 cells. Each data point is an average of 10 tests. It can be found that with the increase in sensing range

R_{s}

, average episode rewards, average observation rate

α

and average exploration rate

β

all increase. Larger sensing range enables multi-UAV to detect larger area and acquire more information, leading to better performance. Compared with the result without communication, we can see that the performance of multi-UAV system is better when UAVs communicate with each other. On one hand, UAVs can achieve better understanding of the environment through information sharing and fusion. On the other hand, UAVs can achieve better cooperation via communication.

6.4. Comparison with Other Methods

This subsection compares MAPPO with A-CMOMMT, ACO and Random algorithms to examine the effectiveness of our method. The results of the comparison were obtained from scenarios of different scales: 20 × 20 cells with 10 targets, 30 × 30 cells with 15 targets and 40 × 40 cells with 20 targets. Each data point is an average of 10 tests. Moreover, the performance results of MAPPO are obtained using the trained model with 5 UAVs, 7 targets and 20 × 20 cells. An introduction to the comparison algorithm is as follows:

A-CMOMMT: A-CMOMMT [16] is a traditional method to solve the target searching and tracking problem. In this method, control of multiple agents is based on a combination of force vectors that are attractive for nearby targets and repulsive for nearby agents;
ACO: As a swarm intelligence algorithm, ant colony optimization (ACO) is also applied in the target searching and tracking problem [21,43]. The pheromone used to initialize all the cells in this approach is identical. The pheromone of cells encountered by UAVs will undergo vaporization at every time step. The pheromone map, target existence probability and UAVs’ locations will be included in the heuristic information to guide UAVs’ decisions;
Random: At each timestep, the agents randomly select an action from the potential candidates.

The performance comparison of MAPPO, A-CMOMMT, ACO and Random algorithms under different number of UAVs from scenarios of different scales is depicted in Figure 7. Figure 7a–c show the results of the average observation rate

α

versus the UAVs’ quantity for different scenarios. We can see that with the increase in the UAVs’ quantity, the average observation rate

α

increases across all algorithms and the performance of MAPPO, A-CMOMMT and ACO far surpass that of Random. We can see that our MAPPO algorithm has the highest average observation rate

α

in most cases. A-CMOMMT outperforms ACO by a small margin in the environment with 10 targets and 20 × 20 cells and 15 targets and 30 × 30 cells. But ACO performs better in the environment with 20 targets and 40 × 40 cells. ACO has better performance when the scale of the problem is large, but the difference between A-CMOMMT and ACO is not much overall. Figure 7d–f display the correlation between the average exploration rate

β

and the quantity of UAVs. in different scenarios. We can see that the Random algorithm has the highest average exploration rate

β

and the performance of MAPPO is second only to Random. And we can find that the average exploration rate

β

of A-CMOMMT and ACO is far inferior to that of Random and MAPPO. The average exploration rate

β

in our paper is used to examine the effectiveness of algorithms in exploring the mission region. And a higher value of the indicator also means that the observation of all targets is done in a relatively uniform manner. The standard deviation of observation rate across all targets in the environment with 10 targets and 20 × 20 cells is listed in Table 4. As shown in Table 4, with better exploring performance, MAPPO can observe the targets relatively uniformly. In summary, the performance comparison illustrates that the proposed method MAPPO performs better in target searching and tracking problem than the traditional algorithm A-CMOMMT and swarm intelligence algorithm ACO.

7. Conclusions

We focus on the target searching and tracking problem and propose a distributed MARL method to learn the cooperative policies for the multi-UAV system in the paper. We first construct the system model of multi-UAV cooperative searching and tracking problem for multiple UAVs and consider average observation rate and average exploration rate as the metrics. Moreover, we propose the information update and fusion mechanisms to enhance environment perception ability of the multi-UAV system. Then, the details of our method are demonstrated, including observation and action space representation, reward function design and training framework based on MAPPO. The simulation results have shown that our method has well convergence performance and outperforms comparative algorithms in terms of average observation rate and average exploration rate. In the future, we will delve deeper into the collaborative policies of heterogeneous UAVs when it comes to target searching and tracking, and assess the practicality of the suggested approach for real-world UAVs.

Author Contributions

Conceptualization, K.S.; methodology, F.Q.; writing—original draft preparation, F.Q.; writing—review and editing, K.S.; supervision, K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Military Theoretical Science Research Fund of PLA, National Natural Science Foundation of China (NSFC) under grant No. 61802425 and Independent Science Research Fund of Naval University of Engineering.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Michael, N.; Mellinger, D.; Lindsey, Q.; Kumar, V. The grasp multiple micro-uav testbed. IEEE Robot. Autom. Mag. 2010, 17, 56–65. [Google Scholar] [CrossRef]
Kumar, V.; Michael, N. Opportunities and challenges with autonomous micro aerial vehicles. Int. J. Robot. Res. 2012, 31, 1279–1291. [Google Scholar] [CrossRef]
How, J.P.; Fraser, C.; Kulling, K.C.; Bertuccelli, L.F.; Toupet, O.; Brunet, L.; Bachrach, A.; Roy, N. Increasing autonomy of UAVs. IEEE Robot. Autom. Mag. 2009, 16, 43–51. [Google Scholar] [CrossRef]
Raj, J.; Raghuwaiya, K.; Vanualailai, J. Collision avoidance of 3D rectangular planes by multiple cooperating autonomous agents. J. Adv. Transp. 2020, 2020, 4723687. [Google Scholar] [CrossRef]
Qi, J.; Song, D.; Shang, H.; Wang, N.; Hua, C.; Wu, C.; Qi, X.; Han, J. Search and rescue rotary-wing uav and its application to the lushan ms 7.0 earthquake. J. Field Robot. 2016, 33, 290–321. [Google Scholar] [CrossRef]
Ablavsky, V.; Snorrason, M. Optimal search for a moving target—A geometric approach. In Proceedings of the AIAA Guidance, Navigation, and Control Conference and Exhibit, Dever, CO, USA, 14–17 August 2000; p. 4060. [Google Scholar]
Jung, B.; Sukhatme, G.S. A region-based approach for cooperative multi-target tracking in a structured environment. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Lausanne, Switzerland, 30 September–4 October 2002; Volume 3, pp. 2764–2769. [Google Scholar]
Tang, Z.; Ozguner, U. Motion planning for multitarget surveillance with mobile sensor agents. IEEE Trans. Robot. 2005, 21, 898–908. [Google Scholar] [CrossRef]
Lanillos, P.; Gan, S.K.; Besada-Portas, E.; Pajares, G.; Sukkarieh, S. Multi-UAV target search using decentralized gradient-based negotiation with expected observation. Inf. Sci. 2014, 282, 92–110. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Garaffa, L.C.; Basso, M.; Konzen, A.A.; de Freitas, E.P. Reinforcement learning for mobile robotics exploration: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 3796–3810. [Google Scholar] [CrossRef]
Crandall, J.W.; Goodrich, M.A. Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning. Mach. Learn. 2011, 82, 281–314. [Google Scholar] [CrossRef]
Koopman, B.O. The theory of search. I. Kinematic bases. Oper. Res. 1956, 4, 324–346. [Google Scholar] [CrossRef]
Raap, M.; Preuß, M.; Meyer-Nieberg, S. Moving target search optimization—A literature review. Comput. Oper. Res. 2019, 105, 132–140. [Google Scholar] [CrossRef]
Bertuccelli, L.F.; How, J.P. Robust UAV search for environments with imprecise probability maps. In Proceedings of the 44th IEEE Conference on Decision and Control, Seville, Spain, 15 December 2005; pp. 5680–5685. [Google Scholar]
Parker, L.E. Distributed algorithms for multi-robot observation of multiple moving targets. Auton. Robot. 2002, 12, 231–255. [Google Scholar] [CrossRef]
Ding, Y.; Zhu, M.; He, Y.; Jiang, J. P-CMOMMT algorithm for the cooperative multi-robot observation of multiple moving targets. In Proceedings of the 2006 6th World Congress on Intelligent Control and Automation, Dalian, China, 21–23 June 2006; Volume 2, pp. 9267–9271. [Google Scholar]
Jilkov, V.P.; Li, X.R. On fusion of multiple objectives for UAV search & track path optimization. J. Adv. Inf. Fusion 2009, 4, 27–39. [Google Scholar]
Booth, K.E.C.; Piacentini, C.; Bernardini, S.; Beck, J.C. Target Search on Road Networks with Range-Constrained UAVs and Ground-Based Mobile Recharging Vehicles. IEEE Robot. Autom. Lett. 2020, 5, 6702–6709. [Google Scholar] [CrossRef]
Phung, M.D.; Ha, Q.P. Motion-encoded particle swarm optimization for moving target search using UAVs. Appl. Soft Comput. 2020, 97, 106705. [Google Scholar] [CrossRef]
Zhen, Z.; Chen, Y.; Wen, L.; Han, B. An intelligent cooperative mission planning scheme of UAV swarm in uncertain dynamic environment. Aerosp. Sci. Technol. 2020, 100, 105826. [Google Scholar] [CrossRef]
Duan, H.; Zhao, J.; Deng, Y.; Shi, Y.; Ding, X. Dynamic discrete pigeon-inspired optimization for multi-UAV cooperative search-attack mission planning. IEEE Trans. Aerosp. Electron. Syst. 2020, 57, 706–720. [Google Scholar] [CrossRef]
Hayat, S.; Yanmaz, E.; Brown, T.X.; Bettstetter, C. Multi-objective UAV path planning for search and rescue. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 5569–5574. [Google Scholar]
Wang, T.; Qin, R.; Chen, Y.; Snoussi, H.; Choi, C. A reinforcement learning approach for UAV target searching and tracking. Multimed. Tools Appl. 2019, 78, 4347–4364. [Google Scholar] [CrossRef]
Yan, P.; Jia, T.; Bai, C. Searching and tracking an unknown number of targets: A learning-based method enhanced with maps merging. Sensors 2021, 21, 1076. [Google Scholar] [CrossRef]
Zhou, W.; Ll, J.; Liu, Z.; Shen, L. Improving multi-target cooperative tracking guidance for UAV swarms using multi-agent reinforcement learning. Chin. J. Aeronaut. 2022, 35, 100–112. [Google Scholar] [CrossRef]
Shen, G.; Lei, L.; Zhang, X.; Li, Z.; Cai, S.; Zhang, L. Multi-UAV Cooperative Search Based on Reinforcement Learning with a Digital Twin Driven Training Framework. IEEE Trans. Veh. Technol. 2023, 72, 8354–8368. [Google Scholar] [CrossRef]
Oroojlooy, A.; Hajinezhad, D. A review of cooperative multi-agent deep reinforcement learning. Appl. Intell. 2023, 53, 13677–13722. [Google Scholar] [CrossRef]
Tan, M. Multi-agent reinforcement learning: Independent vs. cooperative learning. In Readings in Agents; Morgan Kaufmann: Burlington, MA, USA, 1997; pp. 487–494. [Google Scholar]
Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
De Witt, C.S.; Gupta, T.; Makoviichuk, D.; Makoviychuk, V.; Torr, P.H.; Sun, M.; Whiteson, S. Is independent learning all you need in the starcraft multi-agent challenge? arXiv 2020, arXiv:2011.09533. [Google Scholar]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inf. Process. Syst. 2017, 30, 1–12. [Google Scholar]
Oliehoek, F.A.; Spaan, M.T.; Vlassis, N. Optimal and approximate Q-value functions for decentralized POMDPs. J. Artif. Intell. Res. 2008, 32, 289–353. [Google Scholar] [CrossRef]
Iqbal, S.; Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2961–2970. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-decomposition networks for cooperative multi-agent learning. arXiv 2017, arXiv:1706.05296. [Google Scholar]
Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 7234–7284. [Google Scholar]
Su, J.; Adams, S.; Beling, P. Value-decomposition multi-agent actor-critics. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 11352–11360. [Google Scholar]
Millet, T.; Casbeer, D.; Mercker, T.; Bishop, J. Multi-agent decentralized search of a probability map with communication constraints. In Proceedings of the AIAA Guidance, Navigation, and Control Conference, Toronto, ON, Canada, 2–5 August 2010; p. 8424. [Google Scholar]
Zhong, M.; Cassandras, C.G. Distributed coverage control and data collection with mobile sensor networks. IEEE Trans. Autom. Control 2011, 56, 2445–2455. [Google Scholar] [CrossRef]
Hu, J.; Xie, L.; Lum, K.Y.; Xu, J. Multiagent information fusion and cooperative control in target search. IEEE Trans. Control Syst. Technol. 2012, 21, 1223–1235. [Google Scholar] [CrossRef]
Dibangoye, J.S.; Amato, C.; Buffet, O.; Charpillet, F. Optimally solving Dec-POMDPs as continuous-state MDPs. J. Artif. Intell. Res. 2016, 55, 443–497. [Google Scholar] [CrossRef]
Zhen, Z.; Xing, D.; Gao, C. Cooperative search-attack mission planning for multi-UAV based on intelligent self-organized algorithm. Aerosp. Sci. Technol. 2018, 76, 402–411. [Google Scholar] [CrossRef]

Figure 1. The scenario of multi-UAV cooperative searching and tracking for moving targets.

Figure 2. Network architecture of recurrent-MAPPO. (a) Policy network. (b) Value network.

Figure 3. Framework of MAPPO. The joint variables over all agents are bold.

Figure 4. Comparison of convergence performance of MAPPO and IPPO with different network architectures. (a) Average episode rewards. (b) Average observation rate

α

. (c) Average exploration rate

β

.

Figure 4. Comparison of convergence performance of MAPPO and IPPO with different network architectures. (a) Average episode rewards. (b) Average observation rate

α

. (c) Average exploration rate

β

.

Figure 5. Target probability map and environment search status map at different time steps, where pink pentacles represent targets and blue airplane icons represent UAVs. (a) Target probability map at timestep 30. (b) Environment search status map at timestep 30. (c) Target probability map at timestep 180. (d) Environment search status map at timestep 180.

Figure 6. Comparison of results with and without communication when the sensing range is increasing. (a) Average episode rewards with and without communication. (b) Average observation rate

α

with and without communication. (c) Average exploration rate

β

with and without communication.

Figure 6. Comparison of results with and without communication when the sensing range is increasing. (a) Average episode rewards with and without communication. (b) Average observation rate

α

with and without communication. (c) Average exploration rate

β

with and without communication.

Figure 7. Performance comparison of MAPPO, A-CMOMMT, ACO and Random algorithms under different number of UAVs. (a) Average observation rate

α

in the environment with 10 targets and 20 × 20 cells. (b) Average observation rate

α

in the environment with 15 targets and 30 × 30 cells. (c) Average observation rate

α

in the environment with 20 targets and 40 × 40 cells. (d) Average exploration rate

β

in the environment with 10 targets and 20 × 20 cells. (e) Average exploration rate

β

in the environment with 15 targets and 30 × 30 cells. (f) Average exploration rate

β

in the environment with 20 targets and 40 × 40 cells.

Figure 7. Performance comparison of MAPPO, A-CMOMMT, ACO and Random algorithms under different number of UAVs. (a) Average observation rate

α

in the environment with 10 targets and 20 × 20 cells. (b) Average observation rate

α

in the environment with 15 targets and 30 × 30 cells. (c) Average observation rate

α

in the environment with 20 targets and 40 × 40 cells. (d) Average exploration rate

β

in the environment with 10 targets and 20 × 20 cells. (e) Average exploration rate

β

in the environment with 15 targets and 30 × 30 cells. (f) Average exploration rate

β

in the environment with 20 targets and 40 × 40 cells.

Table 1. Comparison between the existing works.

Reference	Method	Sensor Model	Communication Range	Target
[16]	A-CMOMMT	Deterministic	Limited	Moving
[17]	P-CMOMMT	Deterministic	Limited	Moving
[18]	MOO	Probabilistic	Unlimited	Moving
[19]	CP	Deterministic	Unlimited	Moving
[20]	PSO	Probabilistic	Unlimited	Moving
[21]	ACO	Probabilistic	Limited	Moving
[22]	PIO	Probabilistic	Limited	Static
[23]	GA	Deterministic	Limited	Static
[24]	RL	Probabilistic	Unlimited	Moving
[25]	DRL	Deterministic	Limited	Moving
[26]	MARL	Deterministic	Limited	Moving
[27]	MARL	Probabilistic	Limited	Static

Table 2. Simulation parameters.

Parameters	Value
Total misssion time steps (T)	200
Detection probability ( $P_{D}$ )	0.9
False alarm probability ( $P_{F}$ )	0.1
Sensing range ( $R_{s}$ )	3
Communication range ( $R_{c}$ )	6
Range of observation field ( $L_{o}$ )	7
Information decaying factor ( $η$ )	0.1
$ω_{1}$	2
$ω_{2}$	1
$ω_{3}$	0.5

Table 3. Hyperparameters used in training process.

Parameters	Value
Number of steps to execute (E)	$10^{5}$
Batch size (B)	16
Learning rate ( $L_{r}$ )	$5 \times 10^{- 4}$
Discount factor ( $γ$ )	0.99
Clip factor ( $ε$ )	0.2
Optimizer	Adam

Table 4. Standard deviation of observation rate of targets.

	4 UAVs	6 UAVs	8 UAVs	10 UAVs
MAPPO	0.143	0.174	0.176	0.185
A-CMOMMT	0.228	0.237	0.231	0.239
ACO	0.226	0.254	0.220	0.193
Random	0.119	0.152	0.148	0.168

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, K.; Qian, F. Multi-UAV Cooperative Searching and Tracking for Moving Targets Based on Multi-Agent Reinforcement Learning. Appl. Sci. 2023, 13, 11905. https://doi.org/10.3390/app132111905

AMA Style

Su K, Qian F. Multi-UAV Cooperative Searching and Tracking for Moving Targets Based on Multi-Agent Reinforcement Learning. Applied Sciences. 2023; 13(21):11905. https://doi.org/10.3390/app132111905

Chicago/Turabian Style

Su, Kai, and Feng Qian. 2023. "Multi-UAV Cooperative Searching and Tracking for Moving Targets Based on Multi-Agent Reinforcement Learning" Applied Sciences 13, no. 21: 11905. https://doi.org/10.3390/app132111905

APA Style

Su, K., & Qian, F. (2023). Multi-UAV Cooperative Searching and Tracking for Moving Targets Based on Multi-Agent Reinforcement Learning. Applied Sciences, 13(21), 11905. https://doi.org/10.3390/app132111905

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-UAV Cooperative Searching and Tracking for Moving Targets Based on Multi-Agent Reinforcement Learning

Abstract

1. Introduction

2. Related Works

2.1. Target Searching and Tracking

2.2. Multi-Agent Reinforcement Learning

3. Problem Formulation

3.1. Scenario Description

3.2. System Model

4. Information Update and Fusion

4.1. Target Probability Map

4.2. Environment Search Status Map

4.3. UAV Location Map

5. Proposed Method

5.1. Preliminaries

5.1.1. Decentralized Partially Observable Markov Decision Process

5.1.2. Multi-Agent Proximal Policy Optimization

5.2. Observation and Action Space Representation

5.3. Reward Function Design

5.4. Training Framework Based on MAPPO

6. Simulation Results

6.1. Setting Up

6.2. Convergence Analysis

6.3. Performance Analysis

6.4. Comparison with Other Methods

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI