Multi-UAV Cooperative Search in Partially Observable Low-Altitude Environments Based on Deep Reinforcement Learning

Yang, Xiu-Xia; Yao, Wen-Qiang; Zhang, Yi; Yu, Hao; Wang, Chao

doi:10.3390/drones9120825

Open AccessArticle

Multi-UAV Cooperative Search in Partially Observable Low-Altitude Environments Based on Deep Reinforcement Learning

by

Xiu-Xia Yang

,

Wen-Qiang Yao

^*,

Yi Zhang

,

Hao Yu

and

Chao Wang

Naval Aviation University, Yantai 264001, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(12), 825; https://doi.org/10.3390/drones9120825

Submission received: 30 September 2025 / Revised: 19 November 2025 / Accepted: 25 November 2025 / Published: 27 November 2025

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Propose a multi-agent deep reinforcement learning algorithm named Normalizing Graph Attention Soft Actor-Critic (NGASAC) to solve the problem of multi-UAV cooperative search in low-altitude partially observable environments.
To address the multi-UAV collaborative search problem in real scenarios, we propose a phased search strategy.

What are the implications of the main findings?

NGASAC integrates a normalizing flow (NL) layer and a multi-head graph attention network (MHGAT). The normalizing flow technique maps traditional Gaussian sampling to a more complex action distribution, thereby enhancing the expressiveness and flexibility of the policy. Simultaneously, by constructing a multi-head graph attention network that captures “obstacle–target” relationships, the algorithm improves the UAVs’ ability to learn and reason about complex spatial topologies, leading to significantly better performance in cooperative search and stable surveillance of hidden targets.
The phased search strategy achieves adaptive allocation of search resources by dynamically responding to changes in the target state, thereby enhancing search efficiency and task success rate.

Abstract

Multi-Unmanned Aerial Vehicle (Multi-UAV) cooperative search represents a cutting-edge research direction in the field of unmanned aerial vehicle applications. The use of multi-UAV systems for low-altitude target search and area surveillance has become an effective means of enhancing security capabilities. In practical scenarios, UAVs rely on onboard sensors to acquire environmental information; however, due to the limited perceptual range of these sensors, their observation capabilities are inherently local and constrained. This paper investigates the problem of multi-UAV cooperative search in partially observable low-altitude environments, where each UAV possesses a circular sensing range with a finite radius. Target location information is only obtained when a target enters the field of view of any UAV. The objective is to achieve cooperative search and sustain continuous surveillance while ensuring safety among UAVs and with the environment. To address this challenge, we propose a novel multi-agent deep reinforcement learning (MADRL) algorithm named Normalizing Graph Attention Soft Actor-Critic (NGASAC). This algorithm integrates a normalizing flow (NL) layer and a multi-head graph attention network (MHGAT). The normalizing flow technique maps traditional Gaussian sampling to a more complex action distribution, thereby enhancing the expressiveness and flexibility of the policy. Simultaneously, by constructing a multi-head graph attention network that captures “obstacle–target” relationships, the algorithm improves the UAVs’ ability to learn and reason about complex spatial topologies, leading to significantly better performance in cooperative search and stable surveillance of hidden targets. Simulation results demonstrate that the NGASAC algorithm markedly outperforms baseline methods such as Multi-Agent Soft Actor-Critic (MASAC), Multi-Agent Proximal Policy Optimization (MAPPO), and Multi-Agent Deep Deterministic Policy Gradient (MADDPG) across multiple evaluation metrics, including success rate, task time, and obstacle avoidance capability. Furthermore, it exhibits strong generalization performance and robustness.

Keywords:

partially observable low-altitude environments; multi-agent deep reinforcement learning; cooperative search; multi-head graph attention network; normalizing flow

1. Introduction

In recent years, UAV technology has undergone rapid development, driving significant technological advancements in fields such as surveying, aerial photography, and emergency response [1]. With continuous improvements in visual sensors, embedded processors, and data processing techniques, UAV systems are increasingly deployed for target search tasks in high-risk or even hostile environments [2]. Multi-UAV cooperative search [3,4] involves the systematic coordination of multiple drones to explore unknown areas and locate targets. Through strategic collaboration among UAVs, this approach significantly enhances the efficiency and robustness of search and surveillance operations. As a result, it has been widely adopted in critical applications including disaster relief, power line inspection, and forest fire monitoring [5,6,7].

This paper focuses on the problem of cooperative dynamic target search in low-altitude overhead surveillance scenarios. Its theoretical foundation can be traced to the predator–prey interaction model—a classic problem extensively studied in robotics and control theory [8]. Within this framework, the searching UAVs act as searcher, dynamically formulating strategies based on environmental information to approach and continuously monitor the target as quickly as possible, while the target acts as an evader, striving to maintain maximum distance to avoid detection [9]. However, most existing studies assume full observability—that both pursuers and evaders have complete access to each other’s information [10]—which significantly diverges from real-world applications characterized by information asymmetry and partial observations limited by sensor capabilities. Therefore, investigating cooperative UAV search strategies under partial observability is both more practical and challenging.

The search evasion problem can be regarded as a pursuit evasion game. Traditional approaches often employ differential game theory, solving the Hamilton–Jacobi–Isaacs (HJI) equation to derive optimal strategies. Alternatively, some studies use expert rules or geometry-based cooperative control methods for multi-UAV coordination [11]. However, in real-world environments with complex obstacle distributions and highly uncertain target behavior, it is difficult to establish accurate dynamic models [12]. Moreover, targets may repeatedly disappear and reappear, further complicating the design of conventional search policies [13].

With the rapid development of artificial intelligence, DRL—a major branch of machine learning—has demonstrated great potential in agent decision-making. By learning through trial-and-error interactions with the environment, DRL satisfies the requirement for real-time online decision-making and is an ideal approach for control problems in unknown environments [14,15]. It does not rely on precise environmental dynamics and can directly approximate control policies or optimize control parameters via neural networks [16]. Furthermore, learning-based methods exhibit superior generalization and adaptability in unfamiliar environments compared to traditional control approaches [17].

MADRL has gradually emerged as an effective way to address pursuit evasion problems due to its capacity to handle cooperative and competitive interactions among agents. After a detailed analysis of the specific challenges in search evasion scenarios, we propose the Normalizing Graph Attention Soft Actor-Critic (NGASAC) algorithm. This approach builds upon the Soft Actor-Critic (SAC) architecture [18] and extends it into a Centralized Training with Decentralized Execution (CTDE) framework [19] suitable for multi-agent decision-making under partial observability. To overcome the limitations of conventional action sampling methods that often rely on simplistic Gaussian distributions—which restrict policy expressiveness [20,21]—we introduce a normalizing flow (NF) layer inspired by [22]. This technique uses invertible transformations to map the original sampling distribution into a more complex action distribution. Updated in an end-to-end manner within the MADRL framework, it enhances the flexibility and diversity of policy representations. Additionally, while many MADRL algorithms based on the actor-critic framework use Multi-Layer Perceptron (MLP) to construct centralized critics [21,23], we incorporate a Multi-Head Graph Attention Network (MHGAT) into the critic network, drawing inspiration from [24]. This enables explicit modeling of the spatial relationships among UAVs, obstacles, and targets, thereby improving situational reasoning in complex environments. Our main contributions are as follows:

(1) We developed a practical multi-UAV cooperative search model under partial observability in low-altitude environments, incorporating overhead perception, partial observability, dynamic target behavior, and multiple safety constraints.

(2) We proposed the NGASAC algorithm, which integrates NF and MHGAT to significantly enhance policy expressiveness and spatial relational reasoning, with ablation studies validating the effectiveness of each module.

(3) Extensive simulations demonstrate that the proposed algorithm outperforms existing state-of-the-art methods in success rate, task efficiency, and robustness, and exhibits superior generalization across various obstacle configurations, numbers of UAVs, and unseen task scenarios.

The remainder of this paper is organized as follows: Section 2 reviews related work in cooperative search and MADRL; Section 3 provides a formal model of the multi-UAV cooperative search problem in partially observable low-altitude environments; Section 4 formulates the corresponding Partially Observable Markov Decision Process (POMDP) and elaborates on the proposed NGASAC algorithm; Section 5 presents extensive experimental evaluations and results analysis; and Section 6 concludes the paper and suggests future research directions.

2. Related Work

2.1. Cooperative Search

In recent years, multi-UAV cooperative search strategies have attracted widespread attention, with numerous scholars conducting in-depth research from various perspectives. Fei et al. [25] proposed a robust cooperative flight strategy for multi-UAV systems operating under uncertain communication conditions, enhancing task performance in unreliable communication environments. Shen et al. [26] designed a digital twin-based distributed training framework that supports cooperative decision-making and dynamic adaptation of multiple UAVs in complex scenarios through deep integration of virtual simulation and physical systems. Meng et al. [27] introduced an evolutionary state estimation-based multi-strategy jellyfish search algorithm, significantly improving the path planning quality and cooperative efficiency of multi-UAV systems in unknown environments. Zhang et al. [28] focused on the cooperative reconnaissance of static targets using multiple UAVs, formulating the task as a multi-objective optimization model to achieve an effective balance between broad coverage and precise positioning, thereby enhancing the reliability of search outcomes. Yu et al. [29] addressed the cooperative search for dynamic targets with multiple UAVs and developed an improved optimization algorithm incorporating multiple gene types; experiments demonstrated that the approach increases the search efficiency of the UAV swarm. Hentou et al. [30] proposed a complete navigation and control system for the problem of efficient navigation in dynamic environments. This system effectively simulated human reasoning capabilities and enabled intelligent agents to navigate autonomously in dynamic environments. Although these studies have made significant progress in path planning, cooperative coverage, and target localization, most are based on the ideal assumption of global observability and do not adequately address the issue of partial observability caused by sensor limitations in real-world environments. Particularly in complex low-altitude urban settings, factors such as constrained UAV mobility, obstacle occlusion, and the high maneuverability of dynamic targets pose substantial challenges to cooperative search. It is evident that current methods still exhibit notable limitations in handling partial observations and dynamic decision-making, highlighting the urgent need for more adaptive and reasoning-capable cooperative decision-making frameworks.

2.2. MADRL

MADRL has demonstrated remarkable perception and decision-making capabilities in complex environments, achieving significant progress in various domains such as path planning [14], algorithm configuration [31], and intelligent decision-making [32]. Without relying heavily on prior knowledge, MADRL enables agents to autonomously learn cooperative strategies through continuous interaction with the environment, maximizing long-term cumulative rewards even in highly uncertain settings [33,34]. In UAV-related applications, Xing et al. [14] proposed an improved deep reinforcement learning algorithm for multi-UAV trajectory planning in environments with unknown obstacles. Their experiments show that the algorithm effectively identifies optimal policies that maximize expected rewards. Ming et al. [32] developed a deep reinforcement learning-based online operator selection framework, significantly enhancing the performance of multi-objective evolutionary algorithms in complex optimization problems. Wu et al. [35] investigated cooperative search in multi-layer aerial computing networks and introduced a MADRL method combining parameter sharing and action masking. This approach maximizes target discoveries and area coverage while minimizing search uncertainty, thereby improving overall system utility. Fu et al. [17] addressed multi-UAV cooperative pursuit using a decomposed MADDPG algorithm, alleviating credit assignment challenges in collaboration through value decomposition. Zhang et al. [10] proposed a multi-agent collaborative bidirectional coordination and target detection algorithm that maintains strong recognition and tracking performance under partial observability. Du et al. [36] studied multi-UAV pursuit evasion in 2D urban airspace under communication constraints, offering valuable insights for cooperative control in low-bandwidth environments. Mateescu et al. [37] utilized a simulation testing environment incorporating the digital twin of the NAO robot and a parallel learning process to accelerate the progress of machine learning.

Based on the above analysis, although MADRL shows considerable potential in multi-UAV cooperative applications, several limitations remain. Many algorithms rely on simplistic Gaussian distributions for policy sampling, which struggle to represent complex multi-modal decision behaviors. Furthermore, conventional MLP architectures are often inadequate for capturing intricate spatial–topological relationships among agents, obstacles, and targets, thereby limiting the generalization and reasoning capabilities of these methods in partially observable settings. In light of the distinctive challenges inherent in cooperative search tasks, we propose an improved MADRL algorithm designed to enhance decision-making effectiveness and system adaptability in low-altitude partially observable environments for multi-UAV cooperative search.

3. System Model and Problem Formulation

In this section, we describe the background of the research problem and the kinematic model of the UAVs, which serves as the control model for this paper, followed by an emphasis on the core challenge: the limited circular field-of-view constraint under low-altitude environments and cooperative control. On this basis, we provide a mathematical formulation of the task objectives and various constraints.

3.1. Scenario Description

Considering the altitude restrictions in urban low-altitude airspace, it is reasonable to assume that all UAVs operate at the same flight level. Accordingly, this study models the problem in a bounded two-dimensional urban airspace containing multiple static obstacles and

N

homogeneous Search UAVs (hereinafter referred to as “searchers”), whose task is to cooperatively search for and surveil a moving evading vehicle (hereinafter referred to as the “evader”). Each searcher is equipped with a limited circular field of view—the evader’s position is perceived only when it enters the field of view of any searcher. Once the evader moves out of all searchers’ visual ranges, its position information is immediately lost. Furthermore, all UAVs are required to possess obstacle avoidance capabilities and must operate strictly within a predefined task area. Figure 1 illustrates the described scenario. In this setting, the evader’s velocity

v_{e}

is greater than the velocity

v_{i}

of the searchers, and its evasion strategy is non-deterministic, which collectively heightens the complexity of the search task. To simplify system modeling, the following assumptions are made:

(1) Communication among searchers is always reliable. Once the evader is detected by any searcher, its position is instantaneously shared with all other searchers, with no communication delay or localization error considered.

(2) Each searcher can accurately obtain its own state information (including position, velocity, and heading angle).

(3) The task is considered successful if the searchers continuously surveil the evader for a predefined duration

T_{s}

.

(4) The influence of UAV attitude variation on the field of view is negligible during surveillance; therefore, the sensing area is simplified as a fixed circular region, ignoring any shape deformation.

3.2. UAV Kinematic and Circular Sensing Range

In the two-dimensional plane, the positions of the

i - th

searcher and the evader at time

t

are denoted as

p_{i}^{t} = (x_{i}^{t}, y_{i}^{t})

and

p_{e}^{t} = (x_{e}^{t}, y_{e}^{t})

, respectively, where

i \in {1, 2, \dots, N}

, and

(x^{t}, y^{t})

represents the two-dimensional coordinates of the agent at time

t

. Meanwhile, the velocity and heading angle of the UAV at time

t

are denoted as

v^{t} and θ^{t}

, respectively, with the specific kinematic relationship illustrated in Figure 2. The kinematic behavior of the UAV can thus be described by the following dynamic equations:

\{\begin{matrix} {\dot{v}}^{t} = {a_{v}}^{t} \\ {\dot{θ}}^{t} = {a_{θ}}^{t} \\ {\dot{x}}^{t} = v^{t} \cos θ^{t} \\ {\dot{y}}^{t} = v^{t} \sin θ^{t} \end{matrix}

(1)

where

a_{v}^{t}

and

a_{θ}^{t}

represent the acceleration and angular velocity of the UAV at time

t

, respectively, both of which are to be determined via MADRL. Due to the maneuvering constraints of the UAV,

a_{v}^{t}

and

a_{θ}^{t}

are subject to the following constraints:

- a_{v \max} < a_{v}^{t} < a_{v \max}

(2)

- a_{θ \max} < a_{θ}^{t} < a_{θ \max}

(3)

where

a_{v \max}

and

a_{θ \max}

denote the maximum allowable linear acceleration and angular acceleration, respectively.

Each searcher is equipped with an onboard visual sensor, the perception range of which is modeled as a circular region with a fixed radius. All searchers are assumed to have identical sensing capabilities. Therefore, the field of view of a searcher is defined as follows:

V = \{p | {‖p - p_{i}^{t}‖}_{2} \leq R\}

(4)

where

V

denotes the field of view of a searcher,

p

represents a coordinate within the UAV’s visual range, and

R

is the radius of the sensing area. The visible region changes dynamically as the searcher moves, thereby enabling cooperative search of the target. This scenario is transformed into a two-dimensional layout as shown in Figure 3. To facilitate environmental modeling and reward mechanism design, the task area is discretized into a uniform grid, with records maintained for both explored and repeatedly explored regions.

3.3. Task Constraints

It is assumed that all searchers operate within a confined airspace denoted as

R

. The set of

m

obstacles located above the flight altitude of the searchers is defined as

B

. Given that high-rise buildings dominate in realistic environments, all obstacles are modeled as cuboids, which project as rectangles in the top-view 2D plane, with their longer and shorter edges parallel to the

x

and

y

axes. For any obstacle

b

, it is represented by the coordinates of its diagonal vertices

p_{bi} = [[x_{b i}, y_{b i}], [x_{b i}^{'}, y_{b i}^{'}]]

. Thus,

B = {p_{b 1}, p_{b 2}, \dots, p_{b m}}

. The total area occupied by all obstacles is denoted as

R_{B}

, which is a subset of

R

. All obstacles are fixed within the task area and impede UAV motion.

It is assumed that all searchers start from fixed initial positions, while the evader’s initial position is randomized, and none of them may be located within any obstacle region. The objective of the searchers is to detect the evader within the shortest possible time and maintain continuous surveillance of the evader for a specified duration, denoted as

T_{s}

. Task success is defined by the following expressions:

p_{e}^{t} \in V

(5)

T_{c} = T_{t} + T_{s}

(6)

where

T_{c}

represents the total task completion time, and

T_{t}

denotes the time elapsed before continuous observation is achieved.

Each searcher

i

selects an appropriate action

u_{i}^{t} = [a_{v}^{t}, a_{θ}^{t}]

based on its current local observation

{o_{i}}^{t}

to search for the evader. Thus, the policy of searcher

i

at time

t

can be expressed as

π_{i}^{t} (u_{i}^{t} | o_{i}^{t})

. We further define the joint policy of all searchers at time

t

as

\prod_{j}^{t}

, such that

π_{j}^{t} = [π_{1}^{t}, π_{2}^{t}, \dots π_{N}^{t}] \in \prod_{j}^{t}

. Ultimately, this cooperative search problem can be formulated as the following constrained optimization problem:

\arg \min_{Π_{j}^{t}} E [T_{c} | Π_{j}^{t} = {π_{j}}^{t}, t = 1, 2, \dots T_{c}]

(7)

s . t . |x_{i}^{t} - x_{b}| \geq l_{b} / 2, \forall i \in N, \forall b \in B, \forall t \in {1, 2, \dots, T_{c}}

(8)

|y_{i}^{t} - y_{b}| \geq w_{b} / 2, \forall i \in N, \forall b \in B, \forall t \in {1, 2, \dots, T_{c}}

(9)

{‖p_{i}^{t} - p_{j}^{t}‖}_{2} \geq δ_{s}, \forall i, j \in N, i \neq j, \forall t \in {1, 2, \dots, T_{c}}

(10)

p_{i}^{t} = p_{i}^{t - 1} + Δ t u_{i}^{t - 1}, \forall i \in N, \forall t \in {1, 2, \dots, T_{c}}

(11)

p_{i}^{t} \in Ω, i \in P, p_{i}^{t} \in Ω, \forall i \in N, \forall t \in {1, 2, \dots, T_{c}}

(12)

0 \leq T_{c} \leq T_{\max}

(13)

where Equation (7) formulates the minimization of the total task time as the optimization objective for the searchers. In Equations (8) and (9),

l_{b}

and

w_{b}

denote the length and width of the obstacles, respectively, constraining the searchers to avoid collisions with all obstacles during flight. Equation (10) specifies the minimum safe distance that must be maintained among multiple agents to prevent collisions during cooperative operations. Equation (11) describes the state transition process in which each searcher, based on the kinematic model, executes an action

u_{i}^{t}

at discrete time steps to update its position. Equation (12) restricts all agents to operate within the predefined task area

R

, ensuring no agent exceeds these bounds. Equation (13) defines the maximum allowable time steps

T_{\max}

; if the task is not completed within this limit, it is considered a failure.

4. Methodology

In this section, we elaborate on the strategy employed for multi-UAV cooperative search and formulate the corresponding POMDP. Building upon this formulation, we propose the NGASAC algorithm and provide a detailed description of its structural design and training procedure.

4.1. Introduction to Phased Search Strategy

Multi-UAV cooperative search for maneuvering targets presents several challenges. In real-world scenarios, the searching party often lacks prior information about the target, necessitating extensive cooperative exploration of the task area by multiple UAVs. Obstacle occlusion and the high maneuverability of the target further complicate sustained tracking. Even if a UAV detects the target, its typically lower speed compared to the target makes stable tracking difficult by a single agent; effective continuous surveillance requires collaboration among multiple UAVs.

Based on whether the target is within the field of view of any UAV and whether it has been detected previously, the search strategy is divided into three distinct operational modes: Cooperative Search, Tracking and Surveillance, and Focused Search.

(a) Cooperative Search Strategy: When the target has not been detected by any UAV, all searchers execute a cooperative search strategy aimed at maximizing area coverage and reducing uncertain regions.

(b) Tracking and Surveillance Strategy: Once any UAV detects the target, all searchers share this information and enter the tracking and surveillance state. In this phase, each UAV maneuvers to approach the target and maintain it within their collective field of view.

(c) Focused Search Strategy: If the target moves outside the field of view of all searchers, the system estimates a region of interest based on its last known position

p_{l o s t} = (x_{l o s t}, y_{l o s t})

and time

t_{l o s t}

. This region is a circular area centered at

p_{l o s t}

, with a radius that expands dynamically according to the maximum target speed

v_{e \max}

and the time elapsed

(t - t_{l o s t})

. In this mode, multi-UAV concentrates its efforts on this suspected area for intensive search. If the target is re-detected, the strategy reverts to the tracking and surveillance mode; if the target is not found within a predetermined number of steps, the strategy switches back to the global cooperative search mode. The logic of the above “Cooperative Search—Tracking and Surveillance—Focused Search” strategy is illustrated in Figure 4. By dynamically responding to changes in target status, this approach enables adaptive allocation of search resources, thereby improving both search efficiency and task success rate.

4.2. POMDP

POMDP provides a formal framework for sequential decision-making in uncertain environments and is widely used in modeling multi-agent systems. It is defined by the tuple:

M = {S, U, P, r, O, N, γ}

. which includes the state space

S

, action space

U

, state transition function

P

, reward function

r

, observation space

O

, number of agents

N

, and discount factor

γ

. Each agent is regarded as an independent entity that makes deterministic or stochastic decisions at each time step based on current information. The components are detailed below as applied to this problem.

4.2.1. State Space

A state encompasses the status of all searchers at time, expressed as:

s^{t} = [p_{1}^{t}, θ_{1}^{t}, v_{1}^{t}, η_{1}^{t} \dots, p_{N}^{t}, θ_{N}^{t}, v_{N}^{t}, η_{N}^{t}]

(14)

where includes the coordinates, heading angle, velocity, and obstacle perception information of each agent. Obstacle information is acquired via onboard radar with a maximum detection range of. When approaching an obstacle, radar waves in that direction are partially blocked, reducing the effective return. Each searcher calculates the ratio of the actual returned wavelength to the maximum range as obstacle perception information:

η_{i}^{t} = \frac{λ_{i}^{t}}{λ_{\max}}

(15)

4.2.2. Action Space

This paper adopts a continuous action space. The action of a single agent at

t

time is defined as:

u_{i}^{t} = [a_{v}^{t}, a_{θ}^{t}] \in U

. This two-dimensional action represents the linear acceleration and angular acceleration of the agent. The joint action of all agents at

t

time is defined as:

u_{j}^{t} = [u_{1}^{t}, u_{2}^{t}, \dots, u_{N}^{t}]

, which is the output of the searchers’ policy network.

4.2.3. Local Observation Space

The local observation of a searcher at time

t

is denoted as

o_{i}^{t} \in O_{j}^{t}

, where

O_{j}^{t}

represents the joint observation space. Due to limited sensing range, the environment is partially observable. The observation space of each searcher consists of the following three components:

(a) Self-Information: Includes the searcher’s own coordinates

p_{i}^{t}

, velocity

v_{i}^{t}

, heading angle

θ_{i}^{t}

and obstacle perception information

{η_{i}}^{t}

.

(b) Teammate Information: Includes relative coordinates

Δ p_{i j}^{t}

, distance

d_{i j}^{t}

, relative heading angle

θ_{i j}^{t}

, and relative bearing

ρ_{i j}^{t}

, where

j \in {1, 2, \dots, N}

and

i \neq j

.

(c) Evader Information: The information regarding the evader varies depending on the search phase, as detailed below:

Cooperative Search Phase: The target is not within the field of view of any searcher, and its state information is unobservable.

Tracking and Surveillance Phase: The target remains within the field of view of at least one searcher. Thus, the evader information includes relative coordinates

Δ p_{i e}^{t}

, distance

d_{i e}^{t}

, and relative bearing

ρ_{i e}^{t}

. These values are computed based on the evader’s position

p_{e}^{t} = (x_{e}^{t}, y_{e}^{t})

.

Focused Search Phase: The target was previously detected but is currently lost. Searchers navigate toward the last known position of the evader, denoted as

β

. The information includes relative coordinates

Δ p_{i β}^{t}

, distance

d_{i β}^{t}

and relative bearing

ρ_{i β}^{t}

, all computed from the last known position

p_{β}^{t} = (x_{β}^{t}, y_{β}^{t})

.

4.2.4. Reward Function

Given the inherently cooperative nature of the multi-agent search task, this paper adopts a team-based reward mechanism. Relying solely on sparse rewards upon target discovery would lead to training difficulties and slow policy convergence. Therefore, we design corresponding reward functions for different search phases to effectively guide the agents’ behaviors.

During the Cooperative Search Phase, the reward function contains no target-related information and is designed to encourage exploration of unknown areas. It is defined as follows:

r_{s 1}^{t} = S_{n}^{t} / C_{1}

(16)

where

r_{s 1}^{t}

is the reward for exploring new areas,

S_{n}^{t}

is the area newly discovered at time

t

, and

C_{1}

is a constant. Equation (16) implies that the searcher receives a reward proportional to the area of newly explored regions.

Concurrently, to reduce resource waste caused by overlapping fields of view and to encourage dispersed search efforts, the following reward is defined:

r_{s 2}^{t} = \{\begin{matrix} S_{c}^{t} / C_{2}, S_{c}^{t} \geq S_{c}^{t - 1} \\ - S_{c}^{t} / C_{2}, S_{c}^{t} < S_{c}^{t - 1} \end{matrix}

(17)

where

r_{s 2}^{t}

is the field-of-view reward,

S_{c}^{t}

is the total area within the field of view at time

t

, and

C_{2}

is a constant. Equation (17) rewards the agent when the total visible area increases and penalizes it otherwise.

Thus, the total reward function during the cooperative search phase is:

r_{s}^{t} = r_{s 1}^{t} + r_{s 2}^{t}

(18)

During the Tracking and Surveillance Phase, since the target has been detected, the searchers should approach the evader as closely as possible to maintain surveillance. Hence, the distance between the searchers and the target becomes a key guiding factor. The distance reward is designed as:

r_{p 1}^{t} = \{\begin{array}{l} \sum_{i = 1}^{N} (C_{3} / d_{i e}^{t}), d_{i e}^{t} < d_{i e}^{t - 1} \\ - \sum_{i = 1}^{N} (C_{3} / d_{i e}^{t}), d_{i e}^{t} \geq d_{i e}^{t - 1} \end{array}

(19)

where

r_{p 1}^{t}

is the distance reward,

d_{i e}^{t}

is the distance between searcher

i

and the evader at time

t

, and

C_{3}

is a constant. Equation (19) encourages the searchers to continuously reduce their distance to the target.

Additionally, to promote more effective movement toward the target, a directional reward term is introduced, based on the consistency between the agent’s current heading angle

θ_{i}^{t}

and the relative bearing angle

ρ_{i e}^{t}

:

r_{p 2}^{t} = \sum_{i = 1}^{N} C_{4} \cos (θ_{i}^{t} - ρ_{i e}^{t})

(20)

where

r_{p 2}^{t}

is the directional reward and

C_{4}

is a constant. Equation (20) is maximized when the agent is heading directly toward the target.

Therefore, the total reward function during the Tracking and Surveillance Phase is defined as:

r_{p}^{t} = r_{p 1}^{t} + r_{p 2}^{t}

(21)

During the Focused Search Phase, searchers are required to quickly reach the last known position of the target, denoted as

β

. The reward design in this phase is similar to that of the Tracking and Surveillance Phase. The distance reward is defined as:

r_{c 1}^{t} = \{\begin{matrix} \sum_{i = 1}^{N} (C_{4} / d_{i β}^{t}), d_{i β}^{t} \geq d_{i β}^{t - 1} \\ - \sum_{i = 1}^{N} (C_{4} / d_{i β}^{t}), d_{i β}^{t} < d_{i β}^{t - 1} \end{matrix}

(22)

where

r_{c 1}^{t}

denotes the distance reward,

C_{4}

is a constant, and

d_{i β}^{t}

represents the distance between searcher

i

and the last known target position

β

at time

t

.

Additionally, to encourage intensive search within the suspected region, an area coverage reward is introduced:

r_{c 2}^{t} = \{\begin{matrix} C_{5} \times S_{s}^{t} / [π \times {(t_{l o s t} \times v_{e \max})}^{2}], S_{s}^{t} \geq S_{s}^{t - 1} \\ - C_{5} \times S_{s}^{t} / [π \times {(t_{l o s t} \times v_{e \max})}^{2}], S_{s}^{t} < S_{s}^{t - 1} \end{matrix}

(23)

where

r_{c 2}^{t}

is the reward for covering the suspected region,

C_{5}

is a constant,

S_{s}^{t}

is the area within the suspected region that has been covered by the searchers up to time

t

,

t_{l o s t}

is the time elapsed since the target was lost, and

v_{e \max}

is the maximum speed of the evader. Equation (23) motivates the agents to expand the searched area within the region where the target is likely to be located.

Thus, the total reward function during the Focused Search Phase is:

r_{c}^{t} = r_{c 1}^{t} + r_{c 2}^{t}

(24)

Another critical consideration is that searchers must avoid collisions with other agents and obstacles in the dynamic environment in real time. To this end, a collision warning penalty term is introduced:

r_{w}^{t} = \sum_{i = 1}^{N} \min (C_{6} \times (λ_{i}^{t} - δ_{w}), 0)

(25)

where

r_{w}^{t}

denotes the collision warning penalty,

C_{6}

is a constant,

λ_{i}^{t}

represents the length of the radar return at time

t

, and

δ_{w}

is the safety distance. Equation (25) takes effect when

λ_{i}^{t} < δ_{w}

, with the penalty increasing as the agent approaches the obstacle, thereby encouraging the maintenance of a safe distance.

Thus, the total reward for all searchers is the sum of the phase-specific rewards and the collision warning penalty. To effectively guide the behavior of the agents, a high positive reward

r_{s u c c e s s}

is provided upon successful completion of the task, while a large penalty

r_{f a l l}

is imposed in cases of failure such as collisions, boundary violations, or timeout. This design incentivizes the agents to complete the search task efficiently while ensuring safety.

4.3. NGASAC

The NGASAC algorithm is built upon the Soft Actor-Critic (SAC) framework and incorporates several key enhancements tailored for multi-agent partially observable environments. First, we integrate NL with the Actor network to enhance the expressiveness of the policy and improve the flexibility of agent actions. Second, a MHGAT is incorporated into the Critic network to better capture spatial relationships among key entities in the environment, thereby accelerating convergence and improving final performance. Additionally, batch normalization (BN) and learning rate decay are employed to standardize input distributions, accelerate training, and enhance convergence stability. The algorithm adopts a CTDE architecture, and its core components are described below.

4.3.1. Normalizing Flow-Based Actor Network

In this work, the Actor network serves as the policy network for each agent, responsible for outputting a probability distribution over actions and generating continuous actions through sampling. Compared to networks that output deterministic actions, probabilistic policies enhance exploratory behavior. However, conventional policy networks typically output simple Gaussian distributions. Due to the unimodal nature of Gaussian distributions, it is difficult to accurately approximate complex optimal action distributions, thereby limiting the expressiveness of the policy. To overcome this limitation, we introduce NL, which enables the Actor network to learn more complex, multimodal action distributions while maintaining re-parameterizable sampling and differentiability. NL use a series of invertible transformations to map a base Gaussian distribution to a more complex distribution, achieving a “flow” of probability density via the change of variables theorem, thereby producing more expressive distributional forms.

The structure of the normalizing flow-based Actor network is illustrated in Figure 5. The input is the local observation

o_{i}

of each agent. This is passed through a MLP to extract high-dimensional features, followed by two fully connected layers

{FC}_{i}^{u}

and

{FC}_{i}^{σ}

that output the mean

u_{i}

and log standard deviation

\log σ_{i}

of the Gaussian distribution, respectively. The activation function used in the MLP is the rectified linear unit (ReLU).

The above process can be expressed by Equation (26):

\{\begin{matrix} u_{i} = F C_{i}^{u} (M L P_{i} (o_{i})) \\ \log σ_{i} = F C_{i}^{σ} (M L P_{i} (o_{i})) \end{matrix}

(26)

Given

u_{i}

and

\log σ_{i}

, a Gaussian distribution is obtained:

u_{i, 0} \sim p_{i, 0} = N (u_{i}, σ_{i}^{2})

, where

u_{i, 0}

is an action sampled from this initial Gaussian distribution. Then

u_{i, 0}

is fed into the first flow transformation layer, generating a new action

u_{i, 1} = f_{1} (u_{i, 0})

, which can be regarded as sampled from a new probability distribution

p_{i, 1}

. This distribution

p_{i, 1}

is derived from the original distribution

p_{i, 0}

through an invertible transformation. We employ a transformation known as Planar Flow [22], which offers the advantage of low computational cost:

f (u) = u + M_{z} G (M_{w}^{T} u + M_{b})

(27)

where

M_{z}, M_{w} \in R^{D_{u}}

,

M_{b} \in R

are learnable parameters,

D_{u}

is the dimensionality of action

u

, and

G (\cdot)

is a differentiable smooth non-linear activation function; in this paper, the Tanh function is used. For the above transformation, the Jacobian determinant is computed as follows:

\{\begin{matrix} ε (u) = G^{'} (M_{w}^{T} u + M_{b}) M_{w} \\ \det |\frac{\partial f}{\partial u}| = |\det (I + u ε {(u)}^{T})| = |1 + u^{T} ε (u)| \end{matrix}

(28)

Subsequently,

u_{i, 1}

serves as the input to the next flow layer. Following the same procedure through

F

flow layers, the final action

u_{i, e n d}

is obtained, which can be expressed as:

u_{i, end} = f_{F} \circ f_{F - 1} \circ \dots \circ f_{2} \circ f_{1} (u_{i, 0})

(29)

The parameters

M_{z}, M_{w}, M_{b}

of each flow layer are distinct. These parameters are generated by three fully connected layers

F C_{i}^{Z}, F C_{i}^{W}, F C_{i}^{b}

, which take as input the observation

o_{i}

processed through

M L P_{i}

. This can be expressed as follows:

{[M_{z, 1}, M_{z, 2}, \dots, M_{z, F}]}_{i} = F C_{i}^{z} (M L P_{i} (o_{i}))

(30)

{[M_{w, 1}, M_{w, 2}, \dots, M_{w, F}]}_{i} = F C_{i}^{w} (M L P_{i} (o_{i}))

(31)

{[M_{b, 1}, M_{b, 2}, \dots, M_{b, F}]}_{i} = F C_{i}^{b} (M L P_{i} (o_{i}))

(32)

After

F

invertible linear mappings,

u_{i, e n d}

is sampled from a complex and potentially multimodal probability distribution. The fully connected layers

F C_{i}^{Z}, F C_{i}^{W}, F C_{i}^{b}

are trained to learn flow transformation parameters adapted to the task requirements, enabling the agent to flexibly generate action policies suitable for complex environments.

4.3.2. Multi-Head Graph Attention Critic Network

In this study, the spatial topological relationships among searchers and between searchers and the environment are crucial for cooperative decision-making. For instance, during the cooperative search phase, UAVs must disperse to maximize coverage while maintaining a safe distance from obstacles. In the tracking and surveillance phase, they must rapidly approach the evader, with decisions heavily dependent on relative spatial relationships. To effectively capture such structural information, this paper introduces MHGAT, specifically constructing an Obstacle-oriented MHGAT (MHGAT-O) and a Target-oriented MHGAT (MHGAT-T) to model the spatial dependencies between searchers–obstacles and searchers–targets, respectively.

Specifically, during the cooperative search phase, we construct a spatial topology graph comprising each searcher and its

K

nearest obstacles. During the tracking and surveillance phase, a graph is built with all searchers and the evader as nodes. MHGAT-O and MHGAT-T process these different types of graph structures, and their input features also differ.

In MHGAT-O, the node set includes the searchers and their nearest obstacles. The input node features are denoted as

H_{O} = [h o_{1}, h o_{2}, \dots, h o_{N + N K}]

. When

h o_{i}

represents a searcher, it includes the searcher’s coordinates

p_{i}

and heading angle

θ_{i}

. When

h o_{i}

represents an obstacle, it contains the coordinates of the obstacle’s center, with remaining dimensions padded by 0.

In the MHGAT-T, the node set includes all searchers and the evader. The input node features are denoted as

H_{T} = [h t_{1}, h t_{2}, \dots, h t_{N + 1}]

, where each node feature has the same dimensionality. For example,

h t_{i}

includes the current agent’s coordinates

p_{i}

, heading angle, velocity

v_{i}

, and obstacle information

η_{i}

.

The architecture of the Multi-Head Graph Attention Critic network is illustrated in Figure 6. During the graph attention (GAT) operation, let the input feature of a node be

h \in R^{D_{h}}

, and the output feature be

h^{'} \in R^{D_{h'}}

. The GAT employs a learnable linear transformation weight matrix

W \in R^{D_{h^{'}} \times D_{h}}

. Node features entering the network are transformed by

W

into higher-dimensional features, which are then concatenated. The attention coefficients are computed using a weight vector

a

and the LeakyReLU activation function. Taking nodes

i

and

j

as an example, the attention coefficient

e_{i j}

is calculated as follows:

e_{i j} = L e a k y R e L U (a [W h_{i} ∥ W h_{j}])

(33)

where

a \in R^{2 D_{h^{'}}}

is a trainable vector, and

∥

denotes the concatenation operation. The attention coefficient

e_{i j}

indicates the importance of node

j^{'} s

features to node

i

.

To facilitate comparison of the importance of all neighboring nodes to node

i

, the attention coefficients are normalized using the

s o f t m a x

function. For nodes

i

and

j

, the normalized attention coefficient

A_{i j}

is calculated as follows:

A_{i j} = s o f t m a x (e_{i j}) = \frac{\exp (e_{i j})}{\sum_{k \in A_{i}} \exp (e_{i k})}

(34)

where

A_{i}

denotes the set of all neighboring nodes of node

i

. After obtaining the normalized attention coefficients, a linear combination of the corresponding features is computed to form the output feature for each node. To capture diverse features and enhance spatial perception, a multi-head graph attention mechanism is employed. Each graph attention layer uses a distinct head parameter

a

, and the outputs of multiple attention heads are concatenated to produce the final output of the multi-head graph attention layer:

h_{i}^{'} = | |_{m = 1}^{M} \sum_{j \in A_{i}} A_{i j}^{m} W^{m} h_{j}

(35)

where

| |

denotes the concatenation operation,

A_{i j}^{m}

represents the normalized attention coefficient between nodes

j

and

i

computed by the

m - t h

attention head, and

W^{m}

is the learnable weight matrix of the

m - t h

head. Thus, the output node features of MHGAT-O are denoted as

H_{O}^{'} = [h o_{1}^{'}, h o_{2}^{'}, \dots h o_{N + N K}^{'}]

, and those of MHGAT-T as

H_{T}^{'} = [h t_{1}^{'}, h t_{2}^{'}, \dots h t_{N + 1}^{'}]

. Subsequently,

H_{O}^{'}

and

H_{T}^{'}

are concatenated and passed through a

D r o p o u t

(layer, which randomly sets a small proportion of values to zero to prevent overfitting. An ELU activation function is applied to obtain the final feature representation

H^{*}

:

H^{*} = E L U (D r o p o u t ({H_{0}}^{'} ∥ {H_{T}}^{'}))

(36)

The multi-head graph attention Critic network evaluates the decisions of each agent based on the joint observations and joint actions of all agents. Therefore, the concatenated vector of the joint observation

O_{j}^{t}

and joint action

u_{j}^{t}

is first processed by

{MLP}_{q 1}

-a two-layer fully connected network with ReLU activation—to produce an output feature vector

o u

. This process can be expressed as:

o u = M L P_{q 1} (O_{j}^{t} ∥ u_{j}^{t})

(37)

Subsequently, the concatenated vector of

o u

and

H^{*}

is fed into

{MLP}_{q 2}

, which consists of two fully connected layers with ReLU activation. The final output is the action

q

-value. This process can be expressed as:

q = M L P_{q 2} (o u ∥ H^{*})

(38)

The multi-head graph attention Critic network relies not only on the joint observations and actions of all agents but also emphasizes the graph-structured spatial features among agents. This enhances the Critic network’s ability to accurately assess the current state, provide more reasonable evaluations, and guide all agents toward more effective actions.

4.3.3. Training Algorithm

The NGASAC algorithm is built upon the maximum entropy reinforcement learning framework. Compared to traditional reinforcement learning methods, it encourages the policy network to maintain a certain level of exploration during training. The ultimate goal is for each agent to learn a policy

π_{i}^{t}

such that the joint policy

\prod_{j}^{t}

maximizes the sum of expected rewards and entropy, formally defined as:

r \prod_{j}^{t *} = \underset{\prod_{j}^{t}}{\arg \max} E_{(s^{t}, u_{j}^{t}) \sim T_{\prod_{j}^{t}}} [\sum_{t = 0}^{+ \infty} γ^{t} (r (S^{t} = s^{t}, U^{t} = u^{t}) + α H (\prod_{j}^{t} (\cdot | S^{t} = s^{t}))]

(39)

H (\prod_{j}^{t} (\cdot | S^{t} = s^{t}) \overset{Δ}{=} E_{u_{j}^{t} \sim \prod_{j}^{t}} [- \log \prod_{j}^{t} (U^{t} = u_{j}^{t} | S^{t} = s^{t})]

(40)

where

T_{\prod_{j}^{t}}

denotes the trajectory distribution under the joint policy

\prod_{j}^{t}

,

H (\prod_{j}^{t} (\cdot | S^{t} = s^{t}))

represents the entropy of the joint policy

\prod_{j}^{t}

at state

s^{t}

, and

α

is a temperature coefficient that balances the importance of reward and entropy.

In this framework, the state-value function is defined as:

V (S^{t} = s^{t}) = E_{u_{j}^{t} \sim π_{j}^{t}} [Q (S^{t} = s^{t}, U^{t} = u_{j}^{t}) - α \log π_{j}^{t} (U^{t} = u_{j}^{t} | S^{t} = s^{t})]

(41)

and the soft Q-function is defined as:

Q (S^{t} = s^{t}, U^{t} = u_{j}^{t}) = r (S^{t} = s^{t}, U^{t} = u_{j}^{t}) + γ E_{s^{t + 1} \sim T} [V (S^{t + 1} = s^{t + 1})]

(42)

The soft Q-network, parameterized by

ω

, serves as the Critic network in this algorithm. It is trained by minimizing the following Bellman error:

J_{Q} (ω) = E_{(s^{t}, u_{j}^{t}) \sim D} [{(Q_{ω} (S^{t} = s^{t}, U^{t} = u_{j}^{t}) - r (S^{t} = s^{t}, U^{t} = u_{j}^{t}) - γ E_{s^{t + 1} \sim T} [V_{\bar{ω}} (S^{t + 1} = s^{t + 1})])}^{2}

(43)

where

D

denotes the experience replay buffer, and

\bar{ω}

represents the parameters of the target Q-network. The use of a target network helps mitigate overestimation issues. At the beginning of training, the target Q-network is a mirror of the original soft Q-network, sharing the same architecture and parameters. The target Q-network is updated according to the following rule:

\bar{ω} \leftarrow τ \times ω + (1 - τ) \times \bar{ω}

(44)

where

τ

is the mixing coefficient. The update of

\bar{ω}

follows a soft update strategy. Equation (44) indicates that at each step, the target network parameters

\bar{ω}

are updated by blending the parameters of the soft Q-network and the current target parameters. The policy of each agent is determined by the policy network parameters

ϕ

, which are optimized by minimizing the following loss:

J_{π} (ϕ) = E_{(s^{t}, o^{t}) \sim D} [E_{u^{t} \sim π_{ϕ}} [α \log π_{ϕ} (U^{t} = u^{t} | O^{t} = o^{t}) - Q_{θ} (S^{t} = s^{t}, U^{t} = u_{j}^{t})]

(45)

The temperature coefficient

α

is automatically adjusted during training by minimizing the following loss:

J (α) = E_{s^{'} \sim D} [- α \log π_{j}^{t} (U^{t} = u_{j}^{t} | S^{t} = s^{t}) - α \bar{H}]

(46)

where

\bar{H}

is a predefined target entropy.

Thus, the Critic network, Actor network, and temperature parameter

α

are optimized at each training step according to Equations (43), (45) and (46), with learning rates

l r_{Q}

,

l r_{π}

and

l r_{α}

, respectively. The pseudo-code of the NGASAC algorithm is provided in Algorithm 1.

Algorithm 1: NGASAC Algorithm

Input: Search space and initial positions of all agents
Output: Policy networks for all agents
1: Initialize policy networks

π_{ϕ_{1}}, π_{ϕ_{2}}, \dots, π_{ϕ_{N}}

,

Q

-value networks

Q_{ω_{1}}, Q_{ω_{2}}

, and target Q-value networks

Q_{{\bar{ω}}_{1}}, Q_{{\bar{ω}}_{2}}

, with

{\bar{ω}}_{1} \leftarrow ω_{1}, {\bar{ω}}_{2} \leftarrow ω_{2}

2: Initialize replay buffer

D

with capacity

Z : D \leftarrow \emptyset

3: for episode = 1: E do
4: for t = 1:

T_{\max}

do
5: Obtain global state

s^{t}

and joint observation

o_{j}^{t}

6: for i = 1: N do
7: Extract individual observation

o_{i}^{t}

from

o_{j}^{t}

8: Select action

u_{i}^{t}

according to policy network

π_{ϕ i} (\cdot | o_{i}^{t})

9: end for
10: Execute joint action

u_{j}^{t}

11: Receive reward

r^{t}

,next global state

s^{t + 1}

, and next joint observation

o_{j}^{t + 1}

12: Store transition

(o_{j}^{t}, s^{t}, u_{j}^{t}, r^{t}, o_{j}^{t + 1}, s^{t + 1})

in

D

13: Sample a random batch from

D

14: Update Q-value networks by minimizing loss in Equation (43):

ω_{k} \leftarrow ω_{k} - l r_{Q} {\overset{⌢}{\nabla}}_{ω_{k}} J_{Q} (ω_{k})

, for

k \in \{1, 2\}

15: Update each policy network by minimizing loss in Equation (45):

φ_{i} \leftarrow φ_{i} - l r_{π} {\overset{⌢}{\nabla}}_{φ_{i}} J_{π} (φ_{i})

, for

i \in \{1, \dots, N\}

16: Update temperature coefficient by minimizing loss in Equation (46):

α \leftarrow α - l r_{α} {\overset{⌢}{\nabla}}_{α} J (α)

17: Update target networks via soft update (Equation (44))
18: end for
19: end for

4.3.4. Algorithm Complexity Analysis

Actor Network: Let the input dimension of the Actor network be

M_{i n}

, the output dimension be 2, the hidden layer dimension be

M_{h}

, and the number of hidden layers be

L

. The computational complexity of the Actor network is:

O (M_{i n} M_{h} + L {M_{h}}^{2} + 2 M_{h})

. Additionally, the normalizing flow transformation introduces extra computation. Three separate single-layer networks are used to generate the flow parameters, with an input dimension of

M_{h}

, number of flows

F

, and UAV action dimension

M_{a}

. The output layer dimensions are

M_{a} F, M_{a} F

and

F

respectively. Thus, the computational complexity for generating the flow parameters is:

O (2 M_{h} M_{a} F + M_{h} F)

. The final action is computed via matrix multiplication, with a complexity of:

O (2 M_{a} F)

. Ignoring constant and lower-order terms, the total computational complexity of the normalizing flow-based Actor network is:

O_{a c t o r} (M_{i n} M_{h} + L {M_{h}}^{2} + M_{h} M_{a} F)

.

Critic Network: Assume the input dimension of the Critic’s

{MLP}_{q 1}

is

M_{i a}

, the hidden layer dimension is

M_{h}

, and the output dimension is

M_{q 1}

. The computational complexity of

{MLP}_{q 1}

is

O (M_{i a} M_{h} + M_{h} M_{q 1})

. Suppose the GAT network has

J

nodes, input node feature dimension

M_{n i}

, and output node feature dimension

M_{n o}

, Let the number of edges in the graph be

ε

. The complexity for node feature mapping is:

O (J M_{n i} M_{n o})

, and the complexity for computing attention coefficients via edge mapping is:

O (2 ε M_{n o})

. Combining both, the computational complexity for a single-head GAT is:

O (J M_{n i} M_{n o} + 2 ε M_{n o})

. Ignoring differences in input dimensions between GAT-O and GAT-T and omitting constants, the complexity for a

K

-head MHGAT is:

O (K M_{n o} (J M_{n i} + ε))

. The outputs of the two parts are concatenated and used as input to

{MLP}_{q 2}

, which has an input dimension of

M_{q 1} + K J M_{n o}

, an output dimension of 1, a hidden layer dimension of

M_{h}

, and

L

hidden layers. The computational complexity of

{MLP}_{q 2}

is:

O ((M_{q 1} + K J M_{n o}) M_{h} + L {M_{h}}^{2} + M_{h})

. Neglecting constants and lower-order terms, the overall computational complexity of the multi-head graph attention Critic network is:

O_{c r i t i c} (M_{h} (M_{i a} + L M_{h} + M_{q 1}) + K M_{n o} (J (M_{n i} + M_{h}) + ε))

.

Distributed Execution: During distributed execution of the algorithm, only the Actor network is used for computation. Let the total number of iterations be

T

and the number of UAVs be

N

. The total computational complexity in the distributed execution phase is:

O_{d e} = T N O_{a c t o r}

.

Centralized Training: During the centralized training phase, both the Actor and Critic networks are utilized for computation. Let the total number of iterations be

T

, the number of UAVs be

N

, and the batch size be

D

. Due to the use of twin Q-networks, each training step involves

2 N

forward passes through the Actor network and

4 + 2 N

through the Critic network. Ignoring constant and lower-order terms, the simplified computational complexity is:

O_{c t} = T D N (O_{a c t o r} + O_{c r i t i c})

.

5. Simulations

In this section, we conduct simulations to validate the effectiveness of the proposed NGASAC algorithm for multi-UAV cooperative search in partially observable low-altitude environments. To thoroughly evaluate its performance, we compare it against three classical MADRL algorithms using multiple representative metrics. Furthermore, we examine the generalization capability of NGASAC across different environmental settings and perform ablation studies to analyze the contribution of each key module.

5.1. Simulation Environment Setup

(a) Airspace Environment: We simulate a low-altitude urban environment, as illustrated in Figure 7. the flight space is confined to a region of [0,1000] m × [0,1000] m. The environment contains 10 square obstacles of identical size (70 m side length), with edges aligned parallel to the x- and y-axes.

(b) UAV Configuration: The system comprises three search UAVs. The initial positions of the searchers are fixed at [80,80], [80,100] and [100,80], while the initial position of the evader is randomly generated within the flight space outside obstacle areas. UAV parameters are set with reference to existing literature [38,39,40]; specific values are listed in Table 1. The maximum task time is set to 140 s. The task terminates early if a searcher exits the flight space, collides with an obstacle or another agent, or successfully detects the evader.

5.2. Evasion Strategies

In the simulations, we design two distinct evasion strategies for the evader, as follows:

Random Evasion Strategy: The evader does not perceive the searchers’ information and moves randomly within the environment at maximum speed.

Evasive Escape Strategy: When the evader is within the field of view of any searcher, it obtains the searchers’ positions and actively moves in the direction opposite to the nearest searcher to maximize its escape probability.

In both strategies, the evader is equipped with obstacle avoidance capability. When the distance to an obstacle is less than 25 m, the evader proactively maneuvers to avoid collision. The same rule is applied to prevent it from escaping the task airspace boundaries.

5.3. Comparative Simulations

The hyperparameters of the NGASAC algorithm and the reward function are summarized in Table 2. The compared algorithms include MASAC, MAPPO, and MADDPG. All simulations were conducted on a platform equipped with an i7-14700K processor, an RTX-4070Ti GPU, and 32 GB of RAM. During training, when the number of experiences in the replay buffer is less than the batch size, the searchers execute random actions. Experiences collected during this random exploration phase are retained. Once the buffer reaches the batch size, the model begins updating. Under both evasion strategies, the algorithm is evaluated every

1 \times 10^{4}

training steps. Each evaluation consists of 50 episodes. In addition to cumulative reward, five evaluation metrics are defined based on the problem characteristics to compare the performance of different algorithms:

Success Rate: The proportion of tasks in which the searchers successfully maintain continuous surveillance of the target. A higher value indicates better policy performance.

Task Time: The average time required to complete a task. If a task fails (due to collision or exiting the airspace), the time for that episode is recorded as the maximum task time (140 s). A shorter task time reflects higher algorithmic efficiency.

Collision Departure Rate: The proportion of tasks in which a searcher collides or exits the airspace. A higher value indicates poorer safety and stability of the algorithm.

Search Time: The average time from the start of the task until the evader is first detected. A lower value indicates stronger cooperative search capability of the UAV group.

Disappearance Time: The average time required to re-detect the evader after it leaves the field of view of all searchers. If the target is lost multiple times, the durations are accumulated. This metric reflects the algorithm’s focused search capability after target loss. A shorter disappearance time indicates a more effective search strategy.

To further analyze the training dynamics of the algorithms, we tracked the changes in reward and success rate during training for all four algorithms, as shown in Figure 8 and Figure 9.

As observed from the reward curves (Figure 8), NGASAC consistently achieves higher reward values across different evaluation steps, with an overall trend of steady increase as training progresses, indicating strong reward optimization capability and policy stability. In contrast, the reward growth of MADDPG and MAPPO is relatively slow or exhibits greater volatility, suggesting lower learning efficiency or stability in this complex environment. As for MASAC, while it demonstrates better initial learning dynamics than MADDPG and MAPPO, its final reward and convergence stability are still surpassed by our proposed NGASAC. From the success rate curves (Figure 9), it can be seen that NGASAC generally attains a higher success rate compared to the other algorithms and remains at a high level after stabilization, demonstrating excellent performance in task completion. Although MASAC shows some improvement in the early stages, its final success rate remains lower than that of NGASAC. The success rates of MAPPO and MADDPG improve only marginally, indicating limitations in policy exploration or generalization. By integrating observations from both figures, it can be concluded that NGASAC significantly outperforms the comparison algorithm in both reward acquisition and task success rate, demonstrating superior optimization performance and convergence stability. Through the NF and MHGAT modules, it reduces low-quality experiences to achieve more effective exploration and learning, enhances the agent’s understanding of spatial relationships, better realizes relationship generalization, and allows the agent to make more efficient and complex actions, thereby breaking away from simple strategies and being applicable to multi-UAV cooperative search tasks in low-altitude partially observable environments.

The trained policy network was evaluated over 1000 episodes. The performance metrics are summarized in Table 3 (best results highlighted in bold). For intuitive comparison, line charts of the reward, success rate, task time, and collision departure rate under both evasion strategies are shown in Figure 10. Under the random evasion strategy, NGASAC outperformed all baseline algorithms in reward, success rate, task time, collision departure rate, and search time. It achieved an average reward of 50.38, significantly higher (by at least 2 points) than other algorithms, with a small standard deviation indicating high stability. The success rate reached 84.8%, and the collision departure rate was only 1.90%, clearly superior to even the next best algorithm, MAPPO (success rate: 82.00%, collision departure rate: 3.10%). NGASAC also achieved the shortest task time (70.39 s) and search time (49.04 s), demonstrating its ability to locate and monitor the target more quickly. In terms of disappearance time, NGASAC (8.03 s) performed on par with MASAC (7.50 s), indicating comparable capability in focused search after target loss. Under the evasive escape strategy, NGASAC maintained a significant advantage. It achieved a reward of 52.03, substantially higher than other algorithms (around 48). Although overall performance declined due to the target’s evasive behavior, NGASAC still achieved the highest success rate (77.40%) and the lowest collision departure rate (6.00%). While its task time (95.19 s) and search time (50.92 s) increased compared to the random evasion scenario, they remained significantly better than those of other algorithms. The disappearance time for NGASAC (14.82 s) showed no significant difference from MAPPO (14.38 s), further confirming its strong focused search capability. This indirectly indicates that the NGASAC algorithm has learned not to rush blindly to the next location, but rather a more balanced and strategic approach. When the target is lost, the intelligent system will comprehensively consider more factors, such as maintaining a safe distance from obstacles and teammates, and strategically positioning itself to block the potential escape routes of the evader, rather than merely pursuing its last known location. This kind of intelligent and collaborative behavior, although it may take a little more time to execute, can lead to a more robust long-term monitoring effect and a significant reduction in collision rates. Based on the above analysis, NGASAC demonstrates excellent stability, search efficiency, and obstacle avoidance performance across different target strategies, validating its effectiveness and superiority for multi-UAV cooperative search tasks in partially observable low-altitude environments.

Rapid search and continuous surveillance of the evader represent the core challenges of this task. To more clearly compare the performance of each algorithm in terms of task time, search time, and disappearance time, the average metrics under both evasion strategies are shown in Figure 11. The results indicate that under both strategies, the NGASAC algorithm consistently achieves the shortest task time and search time, demonstrating its ability to efficiently locate the evader and complete the monitoring task. The MASAC algorithm performs second best, while MAPPO and MADDPG exhibit the longest task times. Under the random evasion strategy, NGASAC’s disappearance time is only 0.53 s longer than that of MASAC. Under the evasive strategy, its disappearance time is only 0.44 s longer than that of MAPPO. This indicates that NGASAC can effectively learn and execute a focused search strategy after target loss, promptly re-examining suspicious regions to reacquire the evader. In summary, the analysis demonstrates that the NGASAC algorithm effectively learns strategies such as cooperative search, tracking and surveillance, and focused search, and achieves flexible switching among these strategies, thereby accomplishing the search and surveilling mission with high efficiency.

Figure 12 presents the simulation results under two evasion strategies: (a) random evasion and (b) evasive escape. The searcher policies in both experimental groups were obtained after 5 × 10⁵ training steps.

Under the random evasion strategy, the evader starts at [600, 870], and its movement is independent of the searchers’ positions. The three searchers depart from their predefined initial positions. During the cooperative search phase, they spread out to expand perceptual coverage, reduce field-of-view overlap, and efficiently explore unknown regions. The trajectories indicate that the searchers possess robust obstacle avoidance capabilities and can navigate autonomously within the task area. Searcher 2 first detects the evader, and this positional information is shared with other agents via a communication mechanism. Subsequently, Searcher 1 and Searcher 2 adjust their courses toward the target and initiate tracking. The green trajectories indicate that the searchers enter a tracking and surveillance state, maintaining continuous surveillance of the target until the preset threshold

T_{s}

is reached, marking successful task completion. Under the evasive escape strategy, the evader starts at [700, 900] and can access the searchers’ positional information within its field of view, enabling active evasion. During the cooperative search phase, the searchers still adopt a dispersed strategy, effectively avoiding obstacles and covering the task area. In this experiment, green trajectories represent periods when the target is detected by the searchers, while yellow trajectories indicate paths taken during focused search operations after the target is lost. Searcher 3 first detects the evader but loses visual contact due to the target’s evasive maneuvering. All searchers then converge to search the suspicious region around the target’s last known position. Eventually, Searcher 2 successfully reacquires the target. Although the evader attempts to escape again, it fails to break surveillance within 10 s, and the task is completed successfully. It is important to note that the present study, as a foundational investigation, does not account for external disturbances such as wind or internal uncertainties like sensor noise and actuator dynamics. These factors, while critical for real-world application, are left for future investigation to first conclusively establish the baseline performance of the proposed framework in a controlled setting.

5.4. Generalization Simulations

To validate the generalization capability of the NGASAC algorithm, we constructed five types of simulation environments with distinct characteristics. In all environments, the evader employs the evasive escape strategy. The neural network model with the highest success rate during training was selected for generalization testing. The specific environmental configurations are as follows:

(1) Environments with varying initial searcher positions: In the original training, the searchers’ starting positions were fixed. To evaluate the impact of initial position distribution on algorithm performance, we designed a set of environments with different initial configuration layouts, as illustrated in Figure 13a.

(2) Environments with varying obstacle quantity and layout: The original environment has a fixed number and shape of obstacles. To investigate the impact of obstacle density and geometric layout on the adaptability of the algorithm, two new obstacle configurations were designed: one is a high-density obstacle environment, and the other contains rectangular obstacles of random sizes, as shown in Figure 13b.

(3) Environments with different numbers of searchers: To analyze the influence of the number of agents on the cooperative strategy, an additional searcher was introduced, forming an experimental environment with four searchers, as illustrated in Figure 13d.

(4) New task environments: To examine the generalization ability of the policy network under changes in task structure, a new task was designed: in this scenario, three static targets are deployed. A target is considered successfully detected once it is found by any searcher. The objective is to discover as many targets as possible, as shown in Figure 13e.

To comprehensively evaluate the generalization performance of the NGASAC algorithm, 100 repeated simulations are conducted for each new scenario. The evaluation metric results for Scenarios S0 to S5 are summarized in Table 4. Note that for S5, only three metrics were considered: success rate, task time, and collision departure rate. S0 represents the original baseline scenario.

From the results, the distribution of initial positions (S1) has no significant impact on policy performance; however, a more dispersed initial layout helps improve search efficiency, increasing the success rate to 81% and reducing the task time to 88.40 s, while the collision departure rate remains largely stable. Regarding obstacle variations, the dense obstacle environment (S2) increases search difficulty to some extent, leading to longer task times and a higher collision rate, though the algorithm still maintains relatively good overall performance. Environments with differently shaped obstacles (S3) show some mitigation in success rate compared to S2, reaching 75%, but task time increases to 101.57 s, indicating that NGASAC retains a certain adaptability to unfamiliar obstacle configurations. When the number of searchers increases to four (S4), cooperative efficiency improves significantly—the success rate rises to 88%, and task time decreases substantially to 74.6 s, demonstrating the algorithm’s ability to effectively utilize additional agents to enhance performance. However, the collision departure rate increases slightly, which may be related to the increased complexity of cooperative maneuvering. In the new task scenario (S5), the algorithm exhibits strong transfer capability, achieving a success rate as high as 90%, indicating that the trained policy possesses good generalization and task extensibility.

5.5. Ablation Studies

To further validate the effectiveness of key components in the NGASAC algorithm, we designed systematic ablation simulations, including the following three variants:

Ablation 1: Remove the NF; the policy network directly samples actions from a Gaussian distribution.

Ablation 2: Remove the MHGAT; the Critic network estimates action values using only an MLP to process joint observations and actions.

SAC Variant: Remove both NL and the multi-head attention mechanism, equivalent to the standard SAC algorithm.

Simulations were conducted in an environment where the evader adopts an evasive strategy. Each algorithm was evaluated based on four metrics: reward, success rate, task time, and collision departure rate.

The experimental results are shown in Table 5 and Figure 14. NGASAC achieves the best performance across all four evaluation metrics. Comparing Ablated 1 and Ablated 2, it is evident that removing the MAGAT mechanism has a more significant impact on performance, indicating that this mechanism plays a critical role in capturing the evader by enhancing spatial coordination among agents. Specifically: Compared to Ablated 1, NGASAC—due to the incorporation of the NL layer—achieves a 0.59 increase in reward, a 2.5% improvement in success rate, a reduction of 9.43 s in task time, and a 2.4% decrease in collision departure rate. Compared to Ablated2, the introduction of the MHGAT mechanism leads to a 0.95 increase in reward, a 3.4% improvement in success rate, a 5.74 s reduction in task time, and a 2.9% decrease in collision departure rate. Compared to the MASAC algorithm, NGASAC—by integrating both modules—achieves a 3.83 increase in reward, a 5.8% improvement in success rate, a 14.55 s reduction in task time, and a 4.5% decrease in collision departure rate. In summary, both the NL layer and the MHGAT mechanism contribute significantly to the performance improvement of the algorithm. They operate in a complementary manner, collectively ensuring the outstanding performance of NGASAC in multi-agent cooperative search tasks.

6. Conclusions and Future Work

This paper addresses the problem of multi-UAV cooperative search from a top-down perspective in Partially Observable Low-Altitude Environments. This scenario is highly relevant to real-world applications such as UAVs equipped with visual sensors searching for unauthorized ground targets and has the potential to be extended to complex environments (e.g., cooperative monitoring of marine organisms by multiple underwater vehicles). To tackle this task, we propose NGASAC, a MADRL algorithm that incorporates NF and MHGAT mechanisms on the basis of the MASAC framework, enhancing both model expressiveness and cooperative decision-making performance. The NF layer overcomes the limited representational capacity of traditional Gaussian distributions, improving the flexibility of action sampling. The multi-head attention mechanism effectively captures spatial relationships and dependencies among agents, providing more accurate value estimation for joint actions and thereby facilitating policy optimization. Simulations demonstrate that NGASAC outperforms MASAC, MAPPO, and MADDPG across multiple evaluation metrics, including reward, success rate, task time, and collision departure rate. It also exhibits strong generalization capabilities in new tasks, large-scale agent systems, and complex environments. In future work, we plan to introduce more real-world constraints, such as communication limitations among multiple UAVs, localization errors, and individual agent failures. We will also further explore the adaptability of the algorithm to three-dimensional spatial search tasks. Additionally, the NGASAC algorithm can be extended to other complex multi-agent cooperative scenarios, such as UAV formation control and collaborative navigation in challenging environments. At the same time, an important future direction is to strictly incorporate the influence of disturbances, and to study robust reinforcement learning methods or adaptive control techniques in order to achieve online compensation for continuous disturbances.

Author Contributions

X.-X.Y. was responsible for conceptualization, research methodology design, funding acquisition, and project supervision; W.-Q.Y. was responsible for software programming, data validation, formal analysis, and visualization; Y.Z. was responsible for experimental operations, investigation execution, and data collection and curation; H.Y. was responsible for writing the original draft, data visualization, and result analysis; C.W. was responsible for data curation, literature investigation, and manuscript editing and revision. All authors participated in the review and finalization of the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

Shandong Province Natural Science Foundation.

Data Availability Statement

Data available on request due to privacy restrictions. The data presented in this study are available on request from the corresponding author due to necessary approvals for privacy protection.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Zuo, Z.; Liu, C.; Han, Q.-L.; Song, J. Unmanned Aerial Vehicles: Control Methods and Future Challenges. IEEE/CAA J. Autom. Sin. 2022, 9, 601–604. [Google Scholar] [CrossRef]
Qi, J.; Song, D.; Shang, H.; Wang, N.; Hua, C.; Wu, C.; Qi, X.; Han, J. Search and rescue rotary-wing uav and its application to the lushan ms 7.0 earthquake. J. Field Robot. 2016, 33, 290–321. [Google Scholar] [CrossRef]
Gu, J.; Su, T.; Wang, Q.; Du, X.; Guizani, M. Multiple moving targets surveillance based on a cooperative network for multi-UAV. IEEE Commun. Mag. 2018, 56, 82–89. [Google Scholar] [CrossRef]
Duan, H.; Zhao, J.; Deng, Y.; Shi, Y.; Ding, X. Dynamic discrete pigeon-inspired optimization for multi-UAV cooperative search-attack mission planning. IEEE Trans. Aerosp. Electron. Syst. 2020, 57, 706–720. [Google Scholar] [CrossRef]
Liu, D.; Bao, W.; Zhu, X.; Fei, B.; Men, T.; Xiao, Z. Cooperative path optimization for multiple UAVs surveillance in uncertain environment. IEEE Internet Things J. 2021, 9, 10676–10692. [Google Scholar] [CrossRef]
Zhou, Z.; Zhang, C.; Xu, C.; Xiong, F.; Zhang, Y.; Umer, T. Energy-efficient industrial internet of UAVs for power line inspection in smart grid. IEEE Trans. Ind. Inform. 2018, 14, 2705–2714. [Google Scholar] [CrossRef]
Sun, L.; Wan, L.; Wang, X. Learning-based resource allocation strategy for industrial IoT in UAV-enabled MEC systems. IEEE Trans. Ind. Inform. 2020, 17, 5031–5040. [Google Scholar] [CrossRef]
Chung, T.H.; Hollinger, G.A.; Isler, V. Search and pursuit-evasion in mobile robotics: A survey. Auton. Robot. 2011, 31, 299–316. [Google Scholar] [CrossRef]
Peng, Z.; Wu, G.; Luo, B.; Wang, L. Multi-UAV cooperative pursuit strategy with limited visual field in urban airspace: A multi-agent reinforcement learning approach. IEEE/CAA J. Autom. Sin. 2025, 12, 1350–1367. [Google Scholar] [CrossRef]
Zhang, R.; Zong, Q.; Zhang, X.; Dou, L.; Tian, B. Game of drones: Multi-UAV pursuit-evasion game with online motion planning by deep reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 7900–7909. [Google Scholar] [CrossRef]
Chen, J.; Zha, W.; Peng, Z.; Gu, D. Multi-player pursuit–evasion games with one superior evader. Automatica 2016, 71, 24–32. [Google Scholar] [CrossRef]
Qu, X.; Gan, W.; Song, D.; Zhou, L. Pursuit-evasion game strategy of USV based on deep reinforcement learning in complex multi-obstacle environment. Ocean. Eng. 2023, 273, 114016. [Google Scholar] [CrossRef]
Lau, B.P.L.; Ong, B.J.Y.; Loh, L.K.Y.; Liu, R.; Yuen, C.; Soh, G.S.; Tan, U.X. Multi-AGV’s temporal memory-based RRT exploration in unknown environment. IEEE Robot. Autom. Lett. 2022, 7, 9256–9263. [Google Scholar] [CrossRef]
Xing, X.; Zhou, Z.; Li, Y.; Xiao, B.; Xun, Y. Multi-UAV adaptive cooperative formation trajectory planning based on an improved MATD3 algorithm of deep reinforcement learning. IEEE Trans. Veh. Technol. 2024, 73, 12484–12499. [Google Scholar] [CrossRef]
Wang, Y.; Dong, L.; Sun, C. Cooperative control for multi-player pursuit-evasion games with reinforcement learning. Neurocomputing 2020, 412, 101–114. [Google Scholar] [CrossRef]
Desouky, S.F.; Schwartz, H.M. Self-learning fuzzy logic controllers for pursuit–evasion differential games. Robot. Auton. Syst. 2011, 59, 22–33. [Google Scholar] [CrossRef]
Fu, X.; Wang, H.; Xu, Z. Cooperative pursuit strategy for multi-UAVs based on DE-MADDPG algorithm. Acta Aeronaut. Et Astronaut. Sin. 2022, 43, 522–535. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning (PMLR), Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Luo, B.; Hu, T.; Zjou, Y.; Huang, T.; Yang, C.; Gui, W. Survey on Multi-agent Reinforcement Learning for Control and Decision-making. ACTA Autom. Sin. 2025, 51, 510–539. [Google Scholar] [CrossRef]
Peng, B.; Rashid, T.; Schroeder de Witt, C.; Kamienny, P.-A.; Torr, P.H.S.; Böhmer, W.; Whiteson, S. Facmac: Factored multi-agent centralised policy gradients. Adv. Neural Inf. Process. Syst. 2021, 34, 12208–12221. [Google Scholar]
Ackermann, J.; Gabler, V.; Osa, T.; Sugiyama, M. Reducing overestimation bias in multi-agent domains using double centralized critics. arXiv 2019, arXiv:1910.01465. [Google Scholar] [CrossRef]
Rezende, D.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the International Conference on Machine Learning (PMLR), Lille, France, 7–9 July 2015; pp. 1530–1538. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Ben, J.; Sun, Q.; Liu, K.; Yang, X.; Zhang, F. Multi-head multi-order graph attention networks. Appl. Intell. 2024, 54, 8092–8107. [Google Scholar] [CrossRef]
Fei, B.; Bao, W.; Zhu, X.; Liu, D.; Men, T.; Xiao, Z. Autonomous cooperative search model for multi-UAV with limited communication network. IEEE Internet Things J. 2022, 9, 19346–19361. [Google Scholar] [CrossRef]
Shen, G.; Lei, L.; Zhang, X.; Li, Z.; Cai, S.; Zhang, L. Multi-UAV cooperative search based on reinforcement learning with a digital twin driven training framework. IEEE Trans. Veh. Technol. 2023, 72, 8354–8368. [Google Scholar] [CrossRef]
Meng, K.; Chen, C.; Wu, T.; Xin, B.; Liang, M.; Deng, F. Evolutionary state estimation-based multi-strategy jellyfish search algorithm for multi-UAV cooperative path planning. IEEE Trans. Intell. Veh. 2024, 10, 2490–2507. [Google Scholar] [CrossRef]
Zhang, B.Q.; Lin, X.; Zhu, Y.F.; Tian, J.; Zhu, Z. Enhancing Multi-UAV Reconnaissance and Search Through Double Critic DDPG With Belief Probability Maps. IEEE Trans. Intell. Veh. 2024, 9, 3827–3842. [Google Scholar] [CrossRef]
Yu, Y.; Lee, S. Efficient multi-UAV path planning for collaborative area search operations. Appl. Sci. 2023, 13, 8728. [Google Scholar] [CrossRef]
Hentout, A.; Maoudj, A.; Kouider, A. Shortest Path Planning and Efficient Fuzzy Logic Control of Mobile Robots in Indoor Static and Dynamic Environments. Rom. J. Inf. Sci. Technol. 2024, 27, 21–36. [Google Scholar] [CrossRef]
Xue, K.; Xu, J.; Yuan, L.; Li, M.; Qian, C.; Zhang, Z.; Yu, Y. Multi-agent dynamic algorithm configuration. Adv. Neural Inf. Process. Syst. 2022, 35, 20147–20161. [Google Scholar]
Ming, F.; Gong, W.; Wang, L.; Jin, Y. Constrained multi-objective optimization with deep reinforcement learning assisted operator selection. IEEE/CAA J. Autom. Sin. 2024, 11, 919–931. [Google Scholar] [CrossRef]
Wang, J.R.; Hong, Y.T.; Wang, J.L.; Xu, J.P.; Tang, Y.; Han, Q.-L.; Kurths, J. Cooperative and competitive multi-agent systems: From optimization to games. IEEE/CAA J. Autom. Sin. 2022, 9, 763–783. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Wu, J.; Luo, J.; Jiang, C.; Lin, G. A multi-agent deep reinforcement learning approach for multi-UAV cooperative search in multi-layered aerial computing networks. IEEE Internet Things J. 2024, 10, 2490–2507. [Google Scholar]
Du, W.; Guo, T.; Chen, J.; Li, B.; Zhu, G.; Cao, X. Cooperative pursuit of unauthorized UAVs in urban airspace via Multi-agent reinforcement learning. Transp. Res. Part C Emerg. Technol. 2021, 128, 103122. [Google Scholar] [CrossRef]
Mateescu, A.; Popescu, D.C.; Stefan, I.L. Machine Learning Control for Assistive Humanoid Robots Using Blackbox Optimization of PID Loops Through Digital Twins. Rom. J. Inf. Sci. Technol. 2025, 28, 63–76. [Google Scholar] [CrossRef]
De Souza Junior, C. Hunter Drones: Drones Cooperation for Tracking an Intruder Drone. Ph.D. Thesis, Université de Technologie de Compiègne, Compiègne, France, 2021. [Google Scholar]
Lefebvre, T.; Dubot, T. Conceptual design study of an anti-drone drone. In Proceedings of the 16th AIAA Aviation Technology, Integration, and Operations Conference, Washington, DC, USA, 13–17 June 2016; p. 3449. [Google Scholar]
Lai, Y.C.; Lin, T.Y. Vision-based mid-air object detection and avoidance approach for small unmanned aerial vehicles with deep learning and risk assessment. Remote Sens. 2024, 16, 756. [Google Scholar] [CrossRef]

Figure 1. A simple scenario of the cooperative search task.

Figure 2. Kinematic relationship.

Figure 3. Schematic of the cooperative search in a 2D plane.

Figure 4. Phased search strategy.

Figure 5. Normalizing Flow-based Actor Network.

Figure 6. Multi-Head Graph Attention Critic Network.

Figure 7. Simulation Environment.

Figure 8. Reward Curves During Training. (a) Random Evasion Strategy. (b) Evasive Escape Strategy.

Figure 9. Success Rate Curves During Training. (a) Random Evasion Strategy. (b) Evasive Escape Strategy.

Figure 10. Line Charts of Evaluation Metrics. (a) Random Evasion Strategy. (b) Evasive Escape Strategy.

Figure 11. Time comparison. (a) Random Evasion Strategy. (b) Evasive Escape Strategy.

Figure 12. Simulation experiment. (a) Random Evasion Strategy. (b) Evasive Escape Strategy.

Figure 13. Generalization test environments. (a) Different starting positions. (b) High-density obstacles. (c) Random-sized obstacles. (d) 4 searchers. (e) New task.

Figure 14. Bar Chart of Ablation Experiment Results.

Table 1. Agent Configuration Parameters.

Name	Symbol	Value	Unit
Angular Velocity	$a_{θ}^{t}$	$- π / 3 - π / 3$	rad/s
Evader Velocity	$v_{e}$	12	m/s
Searcher Velocity	$v_{i}$	9–11	m/s
Searcher Acceleration	$a_{v}^{t}$	−1–1	m/s²
Searcher Sensing Radius	$R$	200	m
Minimum Safe Distance	$δ_{s}$	1	m
Maximum Radar Range	$λ_{\max}$	100	m
Obstacle Warning Distance	$δ_{w}$	25	m
Required Monitoring Duration	$T_{s}$	10	s

Table 2. NGASAC Hyperparameter Settings.

Parameter	Value	Parameter	Value
Training steps	$5 \times 10^{5}$	Replay buffer size	$5 \times 10^{5}$
Batch size	256	Soft update coefficient	0.01
Discount factor	0.95	Dropout rate	0.01
Optimizer	Adam	LeakyReLU parameter	0.2
Number of attention heads	2 and 2	Constant $C_{1}$	$1 \times 10^{4}$
Number of flow layers	4	Constant $C_{2}$	$1.5 \times 10^{5}$
Hidden layer dimension	64	Constant $C_{3}$	100
Actor network learning rate	$1 \times 10^{- 4}$	Constant $C_{4}$	0.25
Critic network learning rate	$1 \times 10^{- 3}$	Constant $C_{5}$	1.2
Entropy learning rate	$1 \times 10^{- 3}$	Constant $C_{6}$	0.02
Success reward $r_{s u c c e s s}$	10	Failure penalty $r_{f a l l}$	−10

Table 3. Simulation Results of Evaluation Metrics.

Evasion Strategy	Random Evasion				Evasive Escape
Algorithm	NGASAC	MASAC	MAPPO	MADDPG	NGASAC	MASAC	MAPPO	MADDPG
Reward	50.38 ± 2.30	47.80 ± 4.14	47.22 ± 3.82	47.69 ± 4.35	52.03 ± 4.93	48.20 ± 6.88	47.90 ± 4.06	48.04 ± 5.63
Success Rate (%)	84.80	81.50	82.00	79.10	77.40	71.60	69.50	69.30
Task Time (s)	70.39 ± 6.30	75.91 ± 5.74	79.04 ± 5.05	83.38 ± 4.99	95.19 ± 8.49	109.74 ± 9.03	115.37 ± 8.65	121.29 ± 7.37
Collision Departure Rate (%)	1.90	4.20	3.10	5.30	6.00	10.50	9.90	12.60
Search Time (s)	49.04 ± 7.36	54.72 ± 6.70	55.31 ± 6.37	55.09 ± 7.60	50.92 ± 8.58	57.18 ± 7.99	67.18 ± 6.22	63.28 ± 7.83
Disappearance (s)	8.03 ± 2.36	7.50 ± 2.44	8.91 ± 2.09	10.12 ± 3.81	14.82 ± 5.39	15.05 ± 4.40	14.38 ± 5.28	16.60 ± 6.35

Table 4. Generalization experiment results.

	S0	S1	S2	S3	S4	S5
Reward	52.03	53.85	48.37	51.82	54.9	N/A
Success Rate (%)	77.40	81	69	75	88	90
Task Time (s)	95.19	88.40	112.74	101.57	74.60	92.49
Collision Departure Rate (%)	6	7	15	11	9	8
Search Time (s)	50.92	49.37	57.06	53.91	41.49	N/A
Disappearance (s)	14.82	18.26	30.11	26.5	12.88	N/A

Table 5. Ablation Experiment Results.

	NGASAC	Ablation 1	Ablation 2	MASAC
Reward	52.03	51.44	51.08	48.2
Success Rate (%)	77.4	74.9	74	71.6
Task Time (s)	95.19	104.62	100.93	109.74
Collision Departure Rate (%)	6	8.4	8.9	10.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, X.-X.; Yao, W.-Q.; Zhang, Y.; Yu, H.; Wang, C. Multi-UAV Cooperative Search in Partially Observable Low-Altitude Environments Based on Deep Reinforcement Learning. Drones 2025, 9, 825. https://doi.org/10.3390/drones9120825

AMA Style

Yang X-X, Yao W-Q, Zhang Y, Yu H, Wang C. Multi-UAV Cooperative Search in Partially Observable Low-Altitude Environments Based on Deep Reinforcement Learning. Drones. 2025; 9(12):825. https://doi.org/10.3390/drones9120825

Chicago/Turabian Style

Yang, Xiu-Xia, Wen-Qiang Yao, Yi Zhang, Hao Yu, and Chao Wang. 2025. "Multi-UAV Cooperative Search in Partially Observable Low-Altitude Environments Based on Deep Reinforcement Learning" Drones 9, no. 12: 825. https://doi.org/10.3390/drones9120825

APA Style

Yang, X.-X., Yao, W.-Q., Zhang, Y., Yu, H., & Wang, C. (2025). Multi-UAV Cooperative Search in Partially Observable Low-Altitude Environments Based on Deep Reinforcement Learning. Drones, 9(12), 825. https://doi.org/10.3390/drones9120825

Article Menu

Multi-UAV Cooperative Search in Partially Observable Low-Altitude Environments Based on Deep Reinforcement Learning

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Cooperative Search

2.2. MADRL

3. System Model and Problem Formulation

3.1. Scenario Description

3.2. UAV Kinematic and Circular Sensing Range

3.3. Task Constraints

4. Methodology

4.1. Introduction to Phased Search Strategy

4.2. POMDP

4.2.1. State Space

4.2.2. Action Space

4.2.3. Local Observation Space

4.2.4. Reward Function

4.3. NGASAC

4.3.1. Normalizing Flow-Based Actor Network

4.3.2. Multi-Head Graph Attention Critic Network

4.3.3. Training Algorithm

4.3.4. Algorithm Complexity Analysis

5. Simulations

5.1. Simulation Environment Setup

5.2. Evasion Strategies

5.3. Comparative Simulations

5.4. Generalization Simulations

5.5. Ablation Studies

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI