Multi-UAV Dynamic Target Search Based on Multi-Potential-Field Fusion Reward Shaping MAPPO

Hong, Xiaotong; Wang, Zhengjie; Wang, Yue; Xue, Chao; Gao, Yang

doi:10.3390/drones9110770

Open AccessArticle

Multi-UAV Dynamic Target Search Based on Multi-Potential-Field Fusion Reward Shaping MAPPO

by

Xiaotong Hong

^1,2,

Zhengjie Wang

^1,2,

Yue Wang

^1,3,*,

Chao Xue

¹ and

Yang Gao

¹

School of Mechatronics Engineering, Beijing Institute of Technology, Beijing 100081, China

²

Institute of Advanced Interdisciplinary Technology, Shenzhen MSU-BIT University, Shenzhen 518116, China

³

Yangtze River Delta Research Institute of BIT, Jiaxing 314000, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(11), 770; https://doi.org/10.3390/drones9110770

Submission received: 29 August 2025 / Revised: 5 November 2025 / Accepted: 6 November 2025 / Published: 7 November 2025

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Proposes MPRS-MAPPO, an adaptive reward shaping method integrating three potential fields, enhancing multi-UAV coordination and learning efficiency in dynamic target search.
Achieves a 7.87–29.76% improvement in target detection rate and an 11.58% increase in training return compared to baseline methods.

What are the implications of the main findings?

Offers an effective MARL framework for cooperative search under sparse rewards and dynamic conditions.
The design enhances efficiency and stability, serving as a reference for other multi-agent systems.

Abstract

In the cooperative search for dynamic targets by multiple UAVs, target uncertainty and system complexity pose significant challenges to cooperative decision-making. Multi-agent reinforcement learning (MARL) technology can be used for cooperative policy optimization, but it suffers from convergence difficulties and low policy quality in reward-sparse environments such as dynamic target search. To address this issue, this paper proposes a Multi-Potential-Field Fusion Reward Shaping MAPPO (MPRS-MAPPO) algorithm. First, three potential field functions are constructed for reward shaping: probability edge potential field, maximum probability potential field, and coverage probability sum potential field. Subsequently, an adaptive fusion weight mechanism is proposed to adjust fusion weights based on the correlation between potential field values and advantage values. Furthermore, a warm-up phase is introduced to improve training stability. Extensive experiments, including multi-scale and physical tests, demonstrate that MPRS-MAPPO significantly improves convergence speed, detection rate, and stability compared with MAPPO, MASAC, QMIX, and Scanline. Detection rates increased by 7.87–29.76%, and training uncertainty decreased by 7.43–56.36%, validating the algorithm’s robustness, scalability, and real-world applicability.

Keywords:

multi-UAV collaboration; dynamic target search; reinforcement learning; reward shaping; multi-potential field fusion

1. Introduction

Unmanned aerial vehicles (UAVs), owing to their cost-effectiveness, flexible deployment, and adaptability to complex environments, have been extensively employed in tasks such as reconnaissance, mapping, patrolling, and target search, becoming indispensable components of intelligent perception systems [1,2]. In multi-UAV systems, these advantages are further amplified: cooperative operations not only enhance the efficiency and robustness of complex task execution but also expand mission coverage and system scalability [3]. In particular, for dynamic target search missions in unknown environments, multi-UAV systems have been widely applied in disaster rescue, agricultural inspection, and counter-terrorism security scenarios [4,5].

Nevertheless, multi-UAV systems still encounter considerable challenges in dynamic target search tasks. On one hand, their cooperative mechanisms are inherently complex, requiring rational task allocation to avoid path conflicts, resource waste, and redundant search efforts. On the other hand, targets generally possess only limited prior location information at the initial stage, and their positions and movements change over time, which makes cooperative decision-making even more challenging.

Existing studies on multi-UAV dynamic target search can be broadly categorized into four methodological classes: planning-based approaches, optimization-based approaches, heuristic methods, and reinforcement learning. Planning-based methods generate high-coverage search paths through area partitioning [6,7] and trajectory design [8,9,10,11], which are suitable for static or partially known environments but lack responsiveness under dynamic target scenarios. Optimization-based approaches rely on multi-objective function models to balance metrics such as path length [12,13], search time [14,15,16], and coverage rate [17]; although theoretically optimal, their computational complexity escalates rapidly with task scale, limiting real-time applicability. Heuristic methods, including particle swarm optimization [18,19], ant colony algorithms [20], and multi-population cooperative coevolution [21], exhibit strong global search capability and algorithmic flexibility [22,23], but often lack effective feedback control, making them prone to local optima and slow convergence.

Among the aforementioned approaches, reinforcement learning (RL) [24], particularly multi-agent reinforcement learning (MARL) [25], has emerged as a promising paradigm for addressing dynamic target search problems. Its key advantage lies in its independence from precise modeling, instead enabling adaptive policy learning through continuous interaction with the environment, thereby ensuring strong generalization and robustness. In particular, multi-agent algorithms based on proximal policy optimization, such as MAPPO [26], have demonstrated remarkable performance in handling partial observability, high-dimensional state spaces, and cooperative decision-making, and have been widely applied to multi-UAV cooperative search tasks [27,28,29]. For instance, Refs. [30,31] leveraged MARL to optimize search path allocation and real-time response mechanisms, significantly improving target-tracking accuracy and system-level coordination. In scenarios with unknown target quantities or incomplete information, Ref. [32] integrated map construction with policy learning, enabling UAVs to iteratively update environmental cognition and dynamically adapt strategies during search. Furthermore, researchers have proposed various enhancements in algorithmic structures and optimization processes to further improve training stability and efficiency [33,34,35,36]. More recent studies have introduced MASAC [37] to enhance learning stability and generalization in UAV swarm decision-making under incomplete information and HAPPO [38] to improve coordination and optimization efficiency in heterogeneous multi-agent environments.

Despite the great potential of RL in this domain, its training process remains constrained by the sparse-reward problem [39]. Before the discovery of targets, agents often fail to obtain effective reward signals, leading to inefficient policy optimization. To address this challenge, researchers have introduced reward shaping (RS) techniques [40], which leverage prior knowledge to design auxiliary rewards that improve training efficiency and policy quality. Among them, potential-based reward shaping (PBRS) [41] has been widely adopted due to its provable policy invariance. Various studies have attempted to design diverse potential functions to enhance learning across different tasks. For example, energy-aware shaping functions have been used to improve UAV emergency communication efficiency [42]; position-constrained functions have been applied to optimize multi-agent assembly tasks [43]; linearly weighted multi-potential fusion has been employed to enhance single-agent adaptability [44]; and dynamic adjustment of shaping reward magnitudes has been proposed to improve stage-wise training adaptability [45]. However, in multi-agent scenarios—particularly in multi-UAV dynamic target search—systematic studies on how to effectively integrate multiple potential functions for reward shaping are still lacking.

To address these challenges, we propose a Multi-Potential-Field Fusion Reward Shaping MAPPO (MPRS-MAPPO). Building upon sparse primary rewards, this method designs three semantically meaningful potential functions that serve as shaping signals. Moreover, an adaptive fusion weight mechanism is introduced to adaptively adjust the weights of different potential functions according to their relationship with advantage values, thereby mitigating potential interference. Table 1 compares the proposed approach with existing methods across four dimensions: multi-UAV applicability, dynamic target handling, reward shaping, and potential-field fusion. It can be observed that this work is the first to integrate all four key characteristics into a unified algorithmic framework.

In summary, the main contributions of this work are as follows:

We designed three semantically distinct potential field functions: Probability Edge Potential Field, Maximum Probability Potential Field, and Coverage Probability Sum Potential Field, which provide shaping signals from the perspectives of local prior information, global optimal prediction, and swarm-level coordination.
We developed an Adaptive Fusion Weight Mechanism that adaptively adjusts the weights of potential functions based on their correlation with advantage values, reducing interference among multiple potentials and enabling stable and efficient training convergence.
We proposed the MPRS-MAPPO algorithmic framework, which introduces a warm-up phase followed by a multi-potential field fusion reward shaping mechanism to address the sparse-reward challenge, thereby improving the learning efficiency and cooperation of agents in dynamic target search tasks.

Finally, extensive experiments were conducted on a custom-built multi-UAV simulation platform to validate the superiority of the proposed method in terms of training efficiency, policy coordination, and search performance.

The remainder of this paper is organized as follows. Section 2 formulates the mathematical model of the multi-UAV dynamic target search problem. Section 3 presents the proposed MPRS-MAPPO algorithm in detail. Section 4 describes the experimental design and evaluation results. Section 5 concludes the paper and outlines future research directions.

2. System Modeling

This paper investigates the cooperative search for dynamic ground targets by multiple fixed-wing UAVs. As shown in Figure 1, the task area is the ground space area. Each UAV is equipped with a sensor that has a limited field of view, the range of which is represented by a yellow cone. The ground target and its trajectory are indicated by green icons, and its movement is non-deterministic and may maneuver out of the task area. Although the UAV obtains the prior initial position of the target, the actual position of the target will continue to change due to its maneuverability. Consequently, a cooperative search by multiple UAVs is required to increase the target detection probability. The mission objective is to maximize the number of detected targets within the specified search duration.

Based on the task scenario, this section defines the environmental, target motion, UAV motion, and sensor models that provide the basis for the subsequent method research.

2.1. Environment Model

The search environment is modeled as a two-dimensional grid map

M

, where the probability of a target existing at position

g = (x, y)

changes dynamically at each time step

t

. This dynamic likelihood is represented by the target probability distribution map

p (g, t)

, and can be expressed as follows:

p (g, t) \in [0,1], \forall g \in M .

(1)

The sum of the existence probabilities of the target at each position in the area is defined as the overall existence probability of the target, and its calculation formula is as follows:

p_{r e m a i n} (t) = \sum_{g \in M} p (g, t),

(2)

when

p_{r e m a i n} (t)

equals 1, it indicates that the target is still within the search area; if the value is less than 1, it indicates that the target may have escaped from the area.

The UAV moves in the area, and the sensor carried by it senses a sub-area at each moment. The output of each sensing is

z \in {0, 1}

, where

z = 1

indicates that the target is detected, otherwise it indicates that the target is not detected. For the airborne sensor, when the UAV position

q^{u a v} (t)

and the target position

q^{t g t} (t)

coincide, the probability of sensing the target is the detection probability

p_{D}

. When the UAV position

q^{u a v} (t)

and the target position

q^{t g t} (t)

do not coincide, the probability of misjudging the existence of the target is the false alarm probability

p_{F A}

. The details are as follows:

\{\begin{matrix} \begin{matrix} p_{D} = P (z = 1 ∣ q^{u a v} (t) = q^{t g t} (t)) \\ p_{F A} = P (z = 1 ∣ q^{u a v} (t) \neq q^{t g t} (t)) \end{matrix} \end{matrix} .

(3)

According to the Bayesian theory and the observation results at the current moment, the posterior probability of each sub-area in the target probability distribution map can be updated. Since the movement of the target will cause the change in the overall existence probability

p_{r e m a i n} (t)

, it needs to be taken into account when calculating the posterior probability. Based on the probability distribution at time

t

, the probability at time

t + 1

is updated as:

p (g, t + 1) = \{\begin{matrix} p_{r e m a i n} (t) \cdot \frac{p_{D} \cdot p (g, t)}{p_{D} \cdot p (g, t) + p_{F A} \cdot (1 - p (g, t))}, z = 1 \\ p_{r e m a i n} (t) \cdot \frac{(1 - p_{D}) \cdot p (g, t)}{(1 - p_{D}) \cdot p (g, t) + (1 - p_{F A}) \cdot (1 - p (g, t))}, z = 0 \end{matrix} .

(4)

If there are multiple targets in the area, each target

j

corresponds to a target probability map

p^{j} (g, t)

, which is used to represent the probability distribution of the target position in the area.

2.2. Target and UAV Motion Models

2.2.1. Target Motion Model

In this paper, multiple dynamic targets are considered, and the total number is

N_{T}

. For each target, only its initial position is known, while its subsequent positions evolve randomly over time. The behavior of an individual target can be modeled as a Markov process, and its probability distribution evolves accordingly. The specific formula for a single target is:

p (g, t + 1) = \sum_{g^{'} \in N (g)} T (g^{'} \to g) \cdot p (g^{'}, t),

(5)

in this equation,

g

and

g^{'}

denote target position,

T (g^{'} \to g)

is the state transition probability from position

g^{'}

to

g

, and

N (g)

is the set of neighboring grid cells reachable from

g^{'}

in a single time step. In the absence of prior information on the target’s kinematics, the transition probability

T (g^{'} \to g)

is assumed to follow a uniform distribution. As the complexity of the target motion model increases, the position probability distribution becomes more dispersed and irregular, which makes it more difficult to discover targets within a limited time. This increased complexity also affects the convergence and stability of the cooperative search strategy. Therefore, the selection of an appropriate motion model should comprehensively consider both the characteristics of targets in specific scenarios and the computational efficiency of probabilistic inference. This paper assumes the target has a tendency to move away from its initial position. Consequently, its probability distribution spreads outwards over time, forming an annular area of high probability.

2.2.2. UAV Motion Model

This paper considers a swarm of

N_{a}

homogeneous fixed-wing UAVs, represented as:

U = {u a v_{1}, u a v_{2}, \dots, u a v_{N_{a}}} .

(6)

Each UAV can autonomously adjust its heading and speed, and fly at different altitudes to avoid collisions. For the convenience of modeling and simulation, it is simplified into a particle model with direction constraints in a two-dimensional plane, and its state is represented by a three-dimensional vector

(x, y, ψ)

, where

(x, y)

represents the position coordinates of the UAV in the two-dimensional plane, and

ψ

represents its yaw angle. The continuous time motion model of the UAV is:

\{\begin{matrix} \dot{x} = v \cos ψ \\ \dot{y} = v \sin ψ \\ \dot{ψ} = ω \end{matrix} .

(7)

Discretizing this model with a sampling time of

∆ t

yields the following discrete-time model:

[\begin{matrix} x_{t + 1} \\ y_{t + 1} \\ ψ_{t + 1} \end{matrix}] = [\begin{matrix} x_{t} \\ y_{t} \\ ψ_{t} \end{matrix}] + ∆ t [\begin{matrix} \cos ψ_{t} & 0 \\ \sin ψ_{t} & 0 \\ 0 & 1 \end{matrix}] [\begin{matrix} v_{t} \\ ω_{t} \end{matrix}] .

(8)

To reflect the kinematic constraints of fixed-wing UAVs, this paper sets the yaw change within each time step to be

\{0 °, \pm 45 °, \pm 90 °\}

. The UAV’s speed is normalized such that in one time step, it moves one grid unit for axial movements and

\sqrt{2}

grid units for diagonal movements.

2.3. Sensor Model

To search for ground targets, the UAV is equipped with sensors with detection capabilities. Considering the errors in the actual sensors, the detection probability model and the false alarm probability model are established for the sensors, and the target recognition and judgment method based on the sensor model is designed.

For the detection probability model

p_{D}

, let the position of the UAV be

q_{i}^{u a v} (t) = (x_{i} (t), y_{i} (t))

, the position of the point to be detected be

q_{j}^{t g t} (t) = (x_{j} (t), y_{j} (t))

, and the detection probability decays exponentially with distance, as defined below:

p_{D} (q_{j}^{t g t}, q_{i}^{u a v}) = p_{m a x} \cdot e^{- d (q_{j}^{t g t}, q_{i}^{u a v}) / λ},

(9)

where

d (q_{j}^{t g t}, q_{i}^{u a v}) = \sqrt{(x_{j} - x_{i})^{2} + (y_{j} - y_{i})^{2}}

is the Euclidean distance between the target and the sensor,

p_{m a x}

is the maximum detection probability when the target is at the center of the sensor’s field of view, and

λ

is the attenuation coefficient. The smaller the

λ

, the faster the detection probability decreases.

The trend of the sensor’s detection probability with distance is shown in the figure. The farther the distance from the airborne sensor, the lower the detection probability, as shown in Figure 2. In the actual simulation, the minimum effective detection probability is set to define the effective detection range of the sensor.

The false alarm probability

p_{F A}

, is the probability that the sensor incorrectly reports a target at the position where none is present. Each false alarm means that the target does not exist, but the sensor thinks it has identified the target.

From the above analysis, it can be seen that there are two possibilities for whether there is a target at the specific position

g

,

\{H_{0}, H_{1}\}

, where

H_{0}

indicates that there is no target at the position, and

H_{1}

indicates that there is a target at the position. The sensor output is a Bernoulli variable

z \in [0, 1]

, where

z = 1

indicates that the target is detected.

According to Bayes’ theorem, given a prior probability of target presence

p (g, t)

, the posterior probability after receiving an observation

z

can be computed using the detection probability

p_{D}

, false alarm probability

p_{F A}

, and the prior probability, as follows:

P (H_{1} | z = 1) = \frac{p_{D} \cdot p (g, t)}{p_{D} \cdot p (g, t) + p_{F A} \cdot (1 - p (g, t))} .

(10)

The method of judging whether there is a target at the point when the sensor outputs

z = 1

by setting the posterior probability threshold

η_{e x i s t}

is:

z_{j u d g e} = \{\begin{matrix} 1, P (H_{1} | z = 1) ⩾ η_{e x i s t} \\ 0, P (H_{1} | z = 1) < η_{e x i s t} \end{matrix},

(11)

where

η_{e x i s t}

is the set posterior probability threshold, which affects the value of

z_{j u d g e}

. When

z_{j u d g e}

is 1, it indicates that there is a target at this place, and when

z_{j u d g e}

is 0, it indicates that there is no target at this place.

3. Proposed Methodology

The problem of multi-UAV cooperative search for dynamic targets must account for target mobility and a finite task time, during which targets may escape the search area. The objective for the UAV cluster is to maximize the number of targets detected before they escape. Accordingly, the task objective function in this paper is formulated with two components: maximizing the number of targets detected within the mission timeframe and minimizing the sum of the initial detection times for all targets. The objective function is expressed as follows:

m a x [\sum_{j = 1}^{N_{T}} I (t_{j}^{d e t e c t} ⩽ T_{m a x}) - \sum_{j = 1}^{N_{T}} t_{j}^{d e t e c t}],

(12)

where

I (\cdot)

is the indicator function, which outputs 1 when the target is detected, otherwise it is 0.

t_{j}^{d e t e c t}

represents the time when the

j

-th target is detected;

T_{m a x}

is the maximum time limit of the task.

Throughout the mission, targets move randomly, their positions evolving over time, with the possibility of exiting the search area. To model this dynamic behavior, the target position constraint is defined as:

q_{j}^{t g t} (t + 1) \neq q_{j}^{t g t} (t), \forall j .

(13)

The mission duration for each UAV is constrained by

T_{m a x}

, and can be expressed as follows:

t_{i}^{m i s s i o n} ⩽ T_{m a x}, \forall i .

(14)

The UAV is subject to its own kinematic constraints, and the change rate of heading angle is limited, and can be expressed as follows:

|ψ_{i} (t + 1) - ψ_{i} (t)| ⩽ Δ ψ_{m a x} .

(15)

The position of the drone needs to meet the constraint that the position is within the map, and can be expressed as follows:

q_{i}^{u a v} (t) \in M_{m i s s i o n}, \forall i, t .

(16)

In summary, the mathematical model of the multi-UAV cooperative search problem for dynamic targets can be formally expressed as follows. This model captures the key aspects of the task, including the need to maximize the total number of targets detected within the limited mission time, as well as to minimize the cumulative initial detection times for all targets. Additionally, it incorporates several practical constraints that each UAV must satisfy during operation. These include kinematic constraints such as the maximum allowable change in heading angle, spatial constraints ensuring that UAV positions remain within the mission area, temporal constraints limiting the mission duration, and dynamic constraints reflecting the random movement of targets over time. By integrating the objective function with these constraints, the multi-UAV cooperative search problem is fully formulated as:

\{\begin{matrix} \max [\sum_{j = 1}^{N_{T}} I (t_{j}^{d e t e c t} \leq T_{m a x}) - \sum_{j = 1}^{N_{T}} t_{j}^{d e t e c t}] \\ s . t . \\ | ψ_{i} (t + 1) - ψ_{i} (t) | \leq ∆ ψ_{m a x} \\ q_{j}^{t g t} (t + 1) \neq q_{j}^{t g t} (t), \forall j \\ q_{i}^{u a v} (t) \in M_{m i s s i o n} \\ t_{i}^{m i s s i o n} \leq T_{m a x}, \forall i \end{matrix} .

(17)

This optimization problem is characterized by multiple objectives, numerous constraints, and stochastic target motion, making it intractable for traditional methods. To address this challenge, this paper proposes the MPRS-MAPPO algorithm for solving based on the decentralized partially observable Markov decision process (Dec-POMDP) framework and the multi-agent proximal policy optimization (MAPPO) algorithm.

3.1. Dec-POMDP Formulation

According to the characteristics of the task, the process of multi-UAV cooperative search for dynamic targets is modeled as a Dec-POMDP [30]. In this model, the global state

s_{t}

cannot be perceived by a single UAV. Each UAV takes action

a_{t}^{i}

based on its local observation information

o_{t}^{i}

. Since the target motion is unknown and each UAV can only obtain partial information, the system is partially observable. The corresponding Dec-POMDP can be described by the following tuple:

〈 N, S, A, T, R, Ω, O, γ 〉,

(18)

where

(1): $N = {1, \dots, N_{a}}$ is the set of $N_{a}$ agents.
(2): $S$ is the state space of the global environment, and $s \in S$ represents the current state.
(3): $A = A_{1} \times A_{2} \times \dots \times A_{N_{a}}$ is the joint action space, and $A_{i}$ is the action space of the $i$ -th agent.
(4): $T (s^{'} | s, a) = P (s^{'} | s, a) \in [0,1]$ is the state transition probability function of the environment, which represents the probability of the environment transitioning to the state $s^{'}$ under the state $s$ and the given joint action $a = (a_{1}, a_{2}, \dots, a_{N_{a}})$ of the agent.
(5): $R (s, a) = (R_{1} (s, a), R_{2} (s, a), \dots, R_{N_{a}} (s, a))$ is the joint reward function that outputs the reward values of each agent according to the current state $s$ of the environment and the joint action $a$ of $N_{a}$ agents, where $R_{i} (s, a)$ is the reward value obtained by the $i$ -th agent.
(6): $Ω = Ω_{1} \times Ω_{2} \times \dots \times Ω_{N_{a}}$ is the observation space of $N_{a}$ agents, where $o_{i} \in Ω_{i}$ is the specific observation of the $i$ -th agent.
(7): $O (s, i)$ is the local observation function, and $o_{i} = O (s, i)$ represents the local observation $o_{i}$ obtained by the $i$ -th agent based on the observation function in the state $s$ .
(8): $γ$ is the discount factor in the Markov decision process.

At time

t

, each agent

i

obtains its own local observation result

o_{t}^{i} = O (s_{t}, i)

according to its own local observation function, where the global state

s_{t}

is the current state of the environment that the agent cannot fully perceive. Based on

o_{t}^{i}

, each agent generates an action

a_{t}^{i} = π_{i} (o_{t}^{i})

according to its policy

π_{i}

. The actions output by all agents constitute the joint action

a_{t} = (a_{t}^{1}, a_{t}^{2}, \dots, a_{t}^{N_{a}})

. The environment transitions to the new state

s_{t + 1}

according to the current state

s_{t}

, the current joint action

a_{t}

and the state transition function

T

. The environment also outputs the reward values obtained by each agent according to the joint reward function

R

. The goal of each agent is to maximize the cumulative discounted reward

E_{π} [\sum_{t = 0}^{\infty} γ^{t} R_{i} (s_{t}, a_{t})]

of all agents by optimizing the joint policy

π = (π_{1}, π_{2}, \dots, π_{N_{a}})

. The optimized policy enables the UAV to have the ability to search for cooperative dynamic targets.

3.2. State Space and Action Space

Based on the Dec-POMDP framework,

N_{a}

UAVs are modeled as a multi-agent system to cooperatively search for

N_{T}

potential targets. The global state

s_{t} \in S

contains the position and heading information of all agents, as well as the probability distribution information of all targets:

s_{t} = ({\{q_{i}^{u a v} (t), ψ_{i} (t)\}}_{i = 1}^{N_{a}}, {\{G_{j}, P_{j}\}}_{j = 1}^{N_{T}}),

(19)

where

q_{i}^{u a v} (t) = (x_{i} (t), y_{i} (t))

denotes the grid position of agent

i

and

ψ_{i} (t)

is its heading.

G_{j} = {g_{j, 1}, g_{j, 2}, \dots, g_{j, M_{j}}}

is the set of grid position where the existence probability for target

j

is non-zero, and

P_{j} = {p (g_{j, 1}), p (g_{j, 2}), \dots, p (g_{j, M_{j}})}

is the set of corresponding probability values.

Due to the partial observability of the environment, agent

i

’s local observation

o_{t}^{i} \in Ω_{i}

comprises only its own kinematic state and the local target probability map it maintains:

o_{t}^{i} = (q_{i}^{u a v} (t), ψ_{i} (t), {\{G_{j}^{i}, P_{j}^{i}\}}_{j = 1}^{N_{T}}),

(20)

where

G_{j}^{i}

and

P_{j}^{i}

represent the estimation of the probability area position and probability value of target

j

by agent

i

respectively. In this implementation, the global state is constructed by aggregating all agents’ local observations.

The discrete action space

A_{i}

for each agent

i

is defined as:

A_{i} = {a_{1}, a_{2}, a_{3}, a_{4}, a_{5}},

(21)

this set consists of five discrete actions:

a_{1}

: Maintain current heading and move forward one grid step.

a_{2}

: Turn left by

π / 4

and move forward one grid step.

a_{3}

: Turn left by

π / 2

and move forward one grid step.

a_{4}

: Turn right by

π / 4

and move forward one grid step.

a_{5}

: Turn right by

π / 2

and move forward one grid step. This kinematic model is illustrated in Figure 3.

3.3. Fusion Reward Shaping

Potential-based reward shaping is a technique that transforms prior knowledge into reward signals to alleviate the sparse reward problem and enhance exploration efficiency in reinforcement learning. This method achieves such shaping by defining a potential field, which provides additional intermediate reward signals on top of the sparse base rewards. Its mathematical form is:

R^{'} (s, a, s^{'}) = R (s, a, s^{'}) + γ ϕ (s^{'}) - ϕ (s),

(22)

where

ϕ (\cdot)

is the potential function,

γ

is the discount factor,

R (s, a, s^{'})

is the original reward, and

R^{'} (s, a, s^{'})

is the reward after shaping. This method can theoretically guaranty the consistency of the optimal policy before and after reward shaping. For reinforcement learning, designing a reasonable potential field function is crucial for reward shaping.

Building on the above theory, in the dynamic target search scenario, the total reward is defined as:

R_{t o t a l} = R_{b a s e} + R_{s h a p e},

(23)

where

R_{b a s e}

is the base reward,

R_{s h a p e}

is the shaping reward, and

R_{t o t a l}

is the final reward after shaping.

Specifically, building on the base environmental reward, this paper introduces three types of potential field functions: a Probability Edge Potential Field (

ϕ^{e d g e}

), a Maximum Probability Potential Field (

ϕ^{p m a x}

), and a Coverage Probability Sum Potential Field (

ϕ^{p c o v}

). We also introduce an Adaptive Fusion Weight Mechanism to dynamically adjust their fusion weights. This mechanism allows the system to automatically identify and prioritize the most effective potential functions during training, thereby enhancing both learning efficiency and final policy quality.

The shaping reward is defined as:

R_{s h a p e} = \sum_{f \in e d g e, p m a x, p c o v} ω_{f} [{γ ϕ}^{f} (s_{t + 1}) - ϕ^{f} (s_{t})],

(24)

where

ω_{f}

denotes the adaptive weight of potential function

f

.

In the following, we will elaborate on the base reward, the potential field functions, and the adaptive fusion weight mechanism in detail.

3.3.1. Base Reward

The base reward

R_{b a s e}^{t}

, is composed of two components, as defined in the following equation:

R_{b a s e}^{t} = R_{f i n d}^{t} + R_{p r i o r}^{t},

(25)

where the first component is a discovery reward

R_{f i n d}^{t}

, which incentivizes agents to find new targets. It provides a reward of +50 upon the discovery of a new target at timestep

t

, and 0 otherwise, and can be expressed as follows:

R_{f i n d}^{t} = \{\begin{matrix} 50 & i f f i n d = 1 \\ 0 & o t h e r w i s e \end{matrix} .

(26)

The second component is a prior probability reward

R_{p r i o r}^{t}

, designed to encourage the exploration of high-probability areas. For visiting a grid cell

q^{u a v} (t)

, the agent receives a reward equal to 20 times the cell’s prior probability

p_{p r i o r} (g)

, and can be expressed as follows:

R_{p r i o r}^{t} = 20 \cdot p_{p r i o r} (q^{u a v} (t)) .

(27)

The base reward

R_{b a s e}^{t}

encourages the agent to explore high-probability areas to find more targets, but the reward signal is relatively sparse due to the limited probability distribution.

3.3.2. Multiple Potential Fields

To realize multi-potential field fusion reward shaping mechanism, this paper proposes three complementary potential field functions from both local and global perspectives: Probability Edge Potential Field

ϕ^{e d g e}

, Maximum Probability Potential Field

ϕ^{p m a x}

, and Coverage Probability Sum Potential Field

ϕ^{p c o v}

:

(1): Probability Edge Potential Field $ϕ^{e d g e}$

This potential field is defined from the perspective of local prior information. It is designed to guide agents toward the edge regions of the target probability distribution. In this way, agents obtain higher potential energy when approaching the probability boundary, thereby encouraging exploration around uncertain areas and improving search efficiency. As shown in Figure 4.

The potential field

ϕ^{e d g e}

is defined as follows. The position of the

i

-th agent is:

q_{i}^{u a v} = (x_{i}, y_{i}), i \in 1, \dots, N_{a} .

(28)

The number of targets is

N_{T}

, and the probability area position of the

j

-th target is:

G_{j} = {g_{j, 1}, g_{j, 2}, \dots, g_{j, M_{j}}}, j \in {1, \dots, N_{T}},

(29)

where

M_{j}

is the number of all possible positions of target

j

, and the set of probability values corresponding to each position is:

P_{j} = {p (g_{j, 1}), p (g_{j, 2}), \dots, p (g_{j, M_{j}})} .

(30)

The minimum Euclidean distance from the

i

-th agent to the edge of the target probability area is defined as:

d_{i}^{e d g e} = \underset{1 ⩽ j ⩽ N_{T}}{m i n} \underset{1 ⩽ k ⩽ M_{j}}{m i n} ∥ q_{i}^{u a v} - g_{j, k} ∥ .

(31)

Then the probability edge potential energy of the

i

-th agent is:

ϕ_{i}^{e d g e} = \frac{C_{0}}{d_{i}^{e d g e} + 1},

(32)

where

C_{0}

is the scale factor, which is used to control the range of the potential field value.

The global probability edge potential energy takes the average value of the potential energy of all agents and is processed by the Sigmoid fuzzy function, and can be expressed as follows:

ϕ^{e d g e} (s) = σ (\frac{1}{N_{a}} \sum_{i = 1}^{N_{a}} ϕ_{i}^{e d g e}),

(33)

where

σ (x) = \frac{1}{1 + e^{- a_{σ} (x - c_{σ})}},

(34)

σ

is a Sigmoid-like fuzzy function, with parameter

a_{σ}

controlling the steepness of the curve and

c_{σ}

determining its center point. According to the definition, higher

ϕ^{e d g e}

potential values correspond to the edges of the probability region, whereas positions farther from the edges exhibit lower potential values.

(2): Maximum Probability Potential Field $ϕ^{p m a x}$

This potential field is constructed from the perspective of global optimal prediction.

It is designed to guide agents toward the local maximum of the probability distribution. Positions closer to the maximum probability values yield higher potential energy, directing agents to the most likely target locations and thus accelerating target detection. As shown in Figure 5.

The potential field

ϕ^{p m a x}

is defined as follows. The position of maximum probability for target

j

is:

g_{j}^{m a x} = g_{j, k_{j}^{*}}, k_{j}^{*} = a r g \underset{1 ⩽ k ⩽ M_{j}}{m a x} p (g_{j, k}) .

(35)

The minimum distance from the

i

-th agent to all the maximum probability points is:

d_{i}^{p m a x} = \underset{1 ⩽ j ⩽ N_{T}}{m i n} ∥ q_{i}^{u a v} - g_{j}^{m a x} ∥ .

(36)

The local maximum probability potential energy of the

i

-th agent is:

ϕ_{i}^{p m a x} = \frac{C_{0}}{d_{i}^{p m a x} + 1} .

(37)

Finally, the local maximum probability potential energy of all agents is first averaged to obtain a collective measure, and then this value is processed through a Sigmoid-like fuzzy mapping function to produce the global potential energy.

ϕ^{p m a x} (s) = σ (\frac{1}{N_{a}} \sum_{i = 1}^{N_{a}} ϕ_{i}^{p m a x}) .

(38)

According to the definition, positions closer to the maximum probability point of each target have higher

ϕ^{p m a x}

potential values, while positions farther away exhibit lower values.

(3): Coverage Probability Sum Potential Field $ϕ^{p c o v}$

This potential field is proposed from the perspective of swarm-level coordination. It is designed to guide agents toward regions with the highest global coverage probability. Positions with larger cumulative probabilities correspond to higher potential energy, as shown in Figure 6.

The potential field

ϕ^{p c o v}

is defined as follows. First, the sum of the existence probability of all targets at each grid position is calculated. This distribution is defined as the global coverage probability distribution, and can be expressed as follows:

C (g) = \sum_{j = 1}^{N_{T}} \sum_{k = 1}^{M_{j}} p (g_{j, k}) \cdot I (g = g_{j, k}),

(39)

where

I (\cdot)

is the indicator function.

From this distribution, we identify the grid position with the maximum coverage probability:

g^{c o v_m a x} = a r g \underset{g \in M}{m a x} C (g) .

(40)

The distance from agent

i

to this maximum coverage position is:

d_{i}^{p c o v} = ∥ q_{i}^{u a v} - g^{c o v_m a x} ∥ .

(41)

The total potential energy of its local coverage probability is:

ϕ_{i}^{p c o v} = \frac{C_{0}}{d_{i}^{p c o v} + 1} .

(42)

The global coverage probability sum potential field function based on fuzzy logic is:

ϕ^{p c o v} (s) = σ (\frac{1}{N_{a}} \sum_{i = 1}^{N_{a}} ϕ_{i}^{p c o v}) .

(43)

According to the definition, positions closer to the points with higher cumulative coverage probability have larger

ϕ^{p c o v}

potential values.

3.3.3. Adaptive Fusion Weight Mechanism

To adaptively fuse the above potential fields, this paper designs an adaptive fusion weight mechanism based on the correlation between advantage values and potential field values. In this way, multiple potential fields are combined according to their contribution to the advantage, forming a weighted shaping reward. The shaping reward is calculated as:

R_{s h a p e}^{t} = \sum_{f = 1}^{N_{f}} ω_{f} (γ ϕ^{f} (s_{t + 1}) - ϕ^{f} (s_{t})),

(44)

where

N_{f}

is the number of potential fields, and

ϕ^{f}

is the f-th potential field. In this mechanism, the fusion weights of each potential field function are modeled as random variables that obey the Dirichlet distribution, and satisfy the following constraints [44]:

ω \in \{ω \in R^{N_{f}} : \sum_{f = 1}^{N_{f}} ω_{f} = 1, ω_{f} \geq 0\},

(45)

where

N_{f}

represents the number of potential field functions. By utilizing Bayesian updating theory, the parameter vector of the Dirichlet distribution is dynamically adjusted to achieve adaptive optimization of the weight distribution of each potential field function. The parameter vector can be expressed as follows:

α = (α_{1}, α_{2}, \dots, α_{N_{f}}) .

(46)

The update is based on the Pearson correlation coefficient between each potential energy value and the advantage value, so that the potential field function with a strong positive correlation with the advantage value obtains a higher weight, so as to play a greater role in the reward shaping process. This method is specifically for policy optimization reinforcement learning methods.

During a warm-up phase, agents are trained using only the base reward,

R_{b a s e}^{t}

. This promotes the initial convergence of the value network, which in turn improves the credibility of the advantage function estimates.

Following the warm-up, the weight adjustment phase begins. In this phase, agents interact with the environment to collect trajectory data during each episode. For each timestep t in a trajectory, the following data tuple is recorded:

τ = {\{s_{t}, a_{t}, A (s_{t}, a_{t}), {[ϕ^{f} (s_{t})]}_{f = 1}^{N_{f}}\}}_{t = 1}^{L},

(47)

where

s_{t}

represents the global state, which in this implementation is constructed by aggregating all agents’ local observations.

a_{t}

represents the joint action,

A (s_{t}, a_{t})

is the advantage value,

ϕ^{i} (s_{t})

represents the potential energy value of the

i

-th potential field function under the state

s_{t}

, and

L

is the length of the trajectory.

For each potential field function

ϕ^{i}

, extract the advantage value and potential energy value data pair

{\{A_{t}, ϕ_{t}^{f}\}}_{t = 1}^{L}

. And perform the following steps.

Calculate the mean value of the potential energy is calculated as:

μ_{ϕ^{f}} = \frac{1}{L} \sum_{t = 1}^{L} ϕ_{t}^{f} .

(48)

Calculate the mean of the odds ratio is calculated as:

μ_{A} = \frac{1}{L} \sum_{t = 1}^{L} A_{t} .

(49)

Calculate the Pearson correlation coefficient between the potential energy value and the advantage value is calculated as:

ρ_{f} = \frac{\sum_{t = 1}^{L} (ϕ_{t}^{f} - μ_{ϕ^{f}}) (A_{t} - μ_{A})}{\sqrt{\sum_{t = 1}^{L} (ϕ_{t}^{f} - μ_{ϕ^{f}})^{2}} \sqrt{\sum_{t = 1}^{L} (A_{t} - μ_{A})^{2}}} .

(50)

The correlation coefficients are truncated and normalized, and the process can be expressed as follows:

ρ_{f} \leftarrow m a x (ρ_{f}, 0), ρ_{f} \leftarrow ρ_{f} / \sum_{f^{'} = 1}^{N_{f}} ρ_{f^{'}} .

(51)

Update the Dirichlet distribution parameters using the normalized coefficients, and can be expressed as follows:

α_{f}^{n e w} = α_{f}^{o l d} + η \cdot ρ_{f},

(52)

where

η

is the learning rate for updating the Dirichlet parameters.

After updating the parameters, the mean of the Dirichlet distribution is used as the new weight of the potential field function, and the calculation formula is:

ω_{f} = α_{f} / \sum_{f^{'} = 1}^{N_{f}} α_{f^{'}} .

(53)

In summary, the proposed mechanism enables adaptive adjustment of fusion weights, ensuring a balanced contribution of different potential fields to the overall shaping reward. The Adaptive Fusion Weight Mechanism is outlined in Algorithm 1.

Algorithm 1. Adaptive Fusion Weight Mechanism.

1: Input: Number of potential fields

N_{f}

, learning rate

η

, trajectory length

L

2: Initialize: Dirichlet parameters

α = (α_{1}, α_{2}, \dots, α_{N_{f}})

uniformly

3: Warm-up Phase:

Agents are trained using only base reward

R_{base}^{t}

to stabilize value estimation

4: Weight Adjustment Phase:

5: for each episode do

6: Collect trajectory

τ = {s_{t}, a_{t}, A_{t}, {[ϕ_{t}^{f}]}_{f = 1}^{N_{f}}}_{t = 1}^{L}

7: for each potential field

f = 1

to

N_{f}

do

8: Compute mean potential energy:

μ_{ϕ^{f}} = \frac{1}{L} \sum_{t = 1}^{L} ϕ_{t}^{f}

9: Compute mean advantage:

μ_{A} = \frac{1}{L} \sum_{t = 1}^{L} A_{t}

10: Compute Pearson correlation coefficient:

ρ_{f} = \frac{\sum_{t = 1}^{L} (ϕ_{t}^{f} - μ_{ϕ^{f}}) (A_{t} - μ_{A})}{\sqrt{\sum_{t = 1}^{L} {(ϕ_{t}^{f} - μ_{ϕ^{f}})}^{2}} \sqrt{\sum_{t = 1}^{L} {(A_{t} - μ_{A})}^{2}}}

11: Truncate and normalize:

ρ_{f} \leftarrow \max (ρ_{f}, 0)

12: end for

13: Normalize correlations:

ρ_{f} \leftarrow ρ_{f} / \sum_{f^{'} = 1}^{N_{f}} ρ_{f^{'}}

14: for each potential field

f = 1

to

N_{f}

do

15: Update Dirichlet parameter:

α_{f} \leftarrow α_{f} + η \cdot ρ_{f}

16: end for

17: for each potential field

f = 1

to

N_{f}

do

18: Compute new weight:

ω_{f} = α_{f} / \sum_{f^{'} = 1}^{N_{f}} α_{f^{'}}

19: end for

20: end for

3.4. MPRS-MAPPO Algorithm

Based on the above potential fields and fusion mechanism, we propose the MPRS-MAPPO algorithm. The algorithm introduces a multi-potential field fusion reward shaping mechanism on the basis of the standard MAPPO framework [26] and adopts a centralized training and decentralized execution (CTDE) architecture. During training, each agent collects experience tuples consisting of states, actions, and shaped rewards, which are stored in a shared replay buffer. During centralized training, the global state is used by the value network and for reward shaping to stabilize learning. During decentralized execution, each UAV makes decisions based only on its local observation. The fusion weights of the potential fields are updated adaptively based on their correlation with advantage values, ensuring that more informative potentials have a stronger influence on learning. This allows agents to efficiently explore the environment while maintaining coordinated behavior.

Its core innovation lies in using the multi-potential field fusion module to correct the original environment reward in real time and guide the agent to learn the cooperative policy. Meanwhile, the proposed mechanism is fully compatible with the existing MAPPO algorithm based on policy optimization. The framework of the algorithm is shown in Figure 7.

The key steps of the centralized training of the algorithm are as follows:

(1): Action And Environment Sampling

The MPRS-MAPPO framework inherits the policy optimization approach of MAPPO, where each agent

i

has an independent policy network

π_{θ^{i}}

and a shared value network

v_{ω^{i}}

. At each timestep

t

, multiple parallel environments provide a set of local observations

{\{o_{t}^{i}\}}_{i = 1}^{N_{a}}

. Each agent then samples an action

a_{t}^{i}

from its respective policy network, forming action set

{\{a_{t}^{i}\}}_{i = 1}^{N_{a}}

. After executing this joint action, the environment returns the next observations and the base reward

R_{b a s e}^{t}

.

(2): Reward Shaping

The multi-potential field fusion shaping module shapes the original reward to obtain the shaping reward

R_{s h a p e}^{t}

, where the fusion weights are computed using Algorithm 1:

R_{s h a p e}^{t} = [ω_{e d g e} ({γ ϕ}^{e d g e} (s_{t + 1}) - ϕ^{e d g e} (s_{t})) + ω_{p m a x} (γ ϕ^{p m a x} (s_{t + 1}) - ϕ^{p m a x} (s_{t})) + ω_{p c o v} ({γ ϕ}^{p c o v} (s_{t + 1}) - ϕ^{p c o v} (s_{t}))] .

(54)

The final reward,

R_{t o t a l}^{t}

is the sum of the base reward

R_{b a s e}^{t}

and the shaping reward

R_{s h a p e}^{t}

. The experience data

(s_{t}, a_{t}, R_{t o t a l}^{t}, s_{t + 1})

is stored in the experience replay buffer. Here, the potential field functions

ϕ^{e d g e}

,

ϕ^{p m a x}

, and

ϕ^{p c o v}

are all computed from the global state to ensure that shaping reflects the cooperative search context.

In the reward shaping stage, the global state which aggregates all agents’ local observations is used to compute the potential field values. This allows the shaping module to consider the overall spatial distribution of UAVs and targets, thereby maintaining consistent cooperative guidance.

(3): Value Network Update

The shared value network is updated by minimizing the following loss function, where the global state

s_{t}

is used as input to estimate the overall expected return of the multi-UAV system:

L (ω^{i}) = E_{s_{t}} [{(v_{ω^{i}} (s_{t}) - {\hat{R}}_{i}^{t})}^{2}],

(55)

where

{\hat{R}}_{i}^{t} = \sum_{t^{'} = t}^{\infty} γ^{t^{'} - t} R_{t o t a l}^{t}

is the cumulative discounted return.

During centralized training, the value network receives the global state

s_{t}

as input, allowing it to estimate the overall expected return of the multi-agent system. This enables each agent’s policy to be optimized with respect to the joint environment dynamics.

(4): Policy Network Update

Each agent’s policy network

π_{θ^{i}}

uses only its local observation

o_{t}^{i}

to select actions, ensuring decentralized execution. However, during training, policy optimization is performed using advantage estimates derived from the global value function, thereby integrating centralized information into decentralized learning.

The objective function of policy update is:

L (θ^{i}) = E_{(s_{t}, a_{t})} [m i n (r_{t} (θ^{i}) A_{θ^{i^{'}}} {(o}_{t}^{i}, a_{t}^{i}), c l i p (r_{t} {(θ}^{i}), 1 - ε, 1 + ε {) A}_{θ^{i^{'}}} o_{t}^{i}, a_{t}^{i}))],

(56)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} | o_{t}^{i})}{π_{θ^{'}} (a_{t} | o_{t}^{i})}

is the importance sampling ratio,

c l i p (\cdot)

limits the policy update range, and

A_{θ^{i^{'}}}

is the generalized advantage estimation (GAE), which is calculated as follows:

A_{θ^{i^{'}}} = \sum_{l = 0}^{\infty} (γ λ)^{l} δ_{t + l}^{i}, δ_{t}^{i} = R_{t o t a l}^{t} + γ v_{ω^{i}} (s_{t + 1}) - v_{ω^{i}} (s_{t}),

(57)

where

γ

is the reinforcement learning discount factor,

λ

is the hyperparameter of GAE, and

δ_{t}^{i}

is the temporal difference (TD) error of agent

i

at time

t

.

In the distributed execution phase of the algorithm, each agent relies on its own local observation

o_{t}^{i}

and the converged policy network to make independent decisions, realizing complete decentralized control. As shown in the following:

a_{t}^{i} = a r g \underset{a}{m a x} π_{θ^{i}} (a | o_{t}^{i}) .

(58)

The key to this architecture is the design of three complementary potential functions (

ϕ^{e d g e}, ϕ^{p m a x}, ϕ^{p c o v}

), each with a distinct focus. These are integrated via the Adaptive Fusion Weight Mechanism, which balances their respective contributions during training. The adaptive mechanism dynamically adjusts the fusion weights according to their correlation with the advantage values, ensuring that more informative potential fields have a stronger influence at each training stage. By assigning greater weight to functions that are more beneficial for learning, this method improves both training efficiency and the quality of the final converged policy. The complete MPRS-MAPPO algorithm is outlined in Algorithm 2.

Algorithm 2. MPRS-MAPPO.

1: Notations:

θ

: policy network parameters (actor)

ω

: value network parameters (critic)

α

: Dirichlet distribution parameters (from Algorithm 1)

D

: experience buffer

W

: warm-up stage length,

T

: episode horizon,

K

: update iterations

η_{θ}

,

η_{ω}

,

η_{α}

: learning rates

ϕ^{f}

: potential field functions

{ϕ^{edge}, ϕ^{pmax}, ϕ^{pcov}}

ω_{f}

: adaptive fusion weight of field

f

R_{base}, R_{shape}, R_{total}

: base, shaping, and total rewards

2: Initialize:

θ

,

ω

,

α

(uniform),

D = \emptyset

,

e p i s o d e = 0

3: while policy not converged do

4:

e p i s o d e \leftarrow e p i s o d e + 1

5: for t = 1 to

T

do

6: Each agent

i

: sample

a_{t}^{i} \sim π_{θ^{i}} (o_{t}^{i})

7: (

s_{t + 1}

,

R_{b a s e}

) ←

e n v . s t e p (a_{t})

8: if

e p i s o d e

≤

W

then

⊳

Warm-up

9:

R_{t o t a l} = R_{b a s e}

10: else

⊳

Fusion Reward Shaping

11:

R_{s h a p e} = \sum_{f \in e d g e, p m a x, p c o v} ω_{f} [{γ ϕ}^{f} (s_{t + 1}) - ϕ^{f} (s_{t})]

12:

R_{t o t a l} = R_{b a s e} + R_{s h a p e}

13: end if

14:

D . a d d (s_{t}, a_{t}, R_{t o t a l}, s_{t + 1})

15: end for

16: if

e p i s o d e

>

W

then

⊳

Adaptive Fusion Weights

17: Call Algorithm 1 to update

α_{f}

and compute

ω_{f}

18: end if

19: for

k

= 1 to

K

do

⊳

Policy and Value Updates

20:

\hat{R} = \sum_{l = 0}^{\infty} γ^{l} R_{t o t a l}^{t + l}

21:

ω \leftarrow ω - η_{ω} \nabla_{ω} E [{(v_{ω} (s_{t}) - \hat{R})}^{2}]

22:

A_{t}^{G A E} = \sum_{l = 0}^{\infty} {(γ λ)}^{l} δ_{t + l}

23:

θ^{i} \leftarrow θ^{i} + η_{θ} \nabla_{θ^{i}} E [m i n (r_{t} A_{t}^{G A E}, c l i p (r_{t}, 1 \pm ε) A_{t}^{G A E})]

21: end for

25: end while

4. Results and Discussion

To evaluate the performance of the proposed MPRS-MAPPO algorithm in multi-UAV cooperative search tasks, we designed representative scenarios in which multiple UAVs search for multiple dynamic targets. Systematic experiments were conducted using baseline algorithms and multiple evaluation metrics, and ablation studies were further performed to analyze the performance contributions of individual components. The experiments include both simulation tests and physical flight experiments. The experimental setup and results are presented as follows.

4.1. Experimental Environment and Parameter Settings

The experiments were conducted on a high-performance workstation (Intel i7-13700KF, 32 GB RAM, NVIDIA RTX 3060Ti), where a multi-UAV cooperative search simulation environment was built using PyTorch (version 2.3.0) and OpenAI Gym (version 0.20.0).

As shown in Figure 8, the schematic of the simulation environment is organized into three layers: the top layer displays the current positions and trajectories of the UAVs, the middle layer presents the probabilistic estimates of undiscovered target locations, and the bottom layer shows the positions of detected targets. This simulation environment integrates UAV agents, dynamic targets, and probability distributions derived from sensor observations.

The simulation area was defined as a 15 km × 15 km discrete grid (step size: 1 km). In the representative 4v6 scenario, four UAVs were initialized at random positions to search for six dynamic targets. Each UAV was modeled as a fixed-wing aircraft with a speed of 60 m/s, capable of performing forward, left-turn, and right-turn maneuvers. Collision avoidance was achieved by altitude separation (0.9–1.1 km). The onboard sensor had a coverage of 1 km × 1 km, a detection probability of 0.95, and a false alarm probability of 0.05. Targets moved randomly at approximately 20 m/s and could potentially leave the area.

The main experimental parameters for the UAV search task are summarized in Table 2. The environment size, UAV/target numbers, and their speeds define task dynamics, while detection and false alarm probabilities reflect sensing reliability. For training, episode length, learning rates, discount factor, GAE parameter, PPO clipping, and warm-up rollouts are specified.

4.2. Algorithm Effectiveness Analysis

To verify the effectiveness of the proposed algorithm, detailed analyses were first conducted under the standard experimental configuration of 4 UAVs and 6 targets (4v6). Experiments were conducted with 10 different random seeds. The evaluation metrics include the convergence trends of return and target detection rate, the dynamic variations in the three potential field weights, and the UAV search trajectories after training.

During training, the global return and target detection rate of MPRS-MAPPO both exhibited stable convergence, as shown in Figure 9. Specifically, the return increased from 232.47 to 451.43, while the target detection rate rose from 41.52% to 78.89%, demonstrating the algorithm’s strong optimization capability and convergence properties.

As illustrated in Figure 10, during the warm-up phase, the weights of the three potential field functions were evenly initialized at 33.33%. After the warm-up, the weight of the Maximum Probability Potential Field

ϕ^{p m a x}

rapidly increased to 48.36%, then decreased and stabilized at 42.14%, showing a “rise-then-fall” trend. In contrast, the weights of the Probability Edge Potential Field

ϕ^{e d g e}

and the Coverage Probability Sum Potential Field

ϕ^{p c o v}

initially dropped to 23.86% and 27.61%, respectively, and later gradually increased, stabilizing at 27.38% and 29.47%. This dynamic adaptation validates the algorithm’s capability for multi-potential field fusion.

Figure 11 illustrates the execution process of the UAVs. Red aircraft represent the agents, and green dots denote the true target positions (visible only initially or upon discovery).

At the beginning, targets were randomly distributed. By step 8, the UAVs had detected three targets and gradually converged toward high-probability areas; by step 11, two additional targets were detected; and by step 26, all targets were successfully located. The trajectories demonstrate that the UAVs progressively shifted from concentrated search to distributed coverage, thereby improving overall efficiency.

For clarity, the discovery time of each target is annotated at its upper right corner, while the movement directions of the UAVs are indicated along their respective trajectories.

In terms of real-time performance, the average decision-making time per step was approximately 0.015 s, indicating that the proposed method can meet real-time requirements in online multi-UAV cooperative search scenarios.

To further evaluate the robustness and scalability of the proposed algorithm, comparative experiments were conducted under different task scales, including 2 UAVs vs. 3 targets (2v3), 4 UAVs vs. 6 targets (4v6), and 6 UAVs vs. 9 targets (6v9). The 6v9 scenario involved a more complex case where several targets were initialized in close proximity. The convergence curves of global return under different task scales are shown in Figure 12. The results indicate that the algorithm achieved effective convergence across all configurations, with slightly slower convergence as task complexity increased. Since the number of targets differs among scenarios, the return curves are distributed at different height levels accordingly.

4.3. Comparison with Existing Methods

To comprehensively evaluate the overall performance of MPRS-MAPPO, four representative methods were selected as baselines: three multi-agent reinforcement learning algorithms, MAPPO, MASAC and QMIX, and a heuristic coverage-based search strategy, Scanline. The evaluation metrics included global return curve trends, converged return values, target detection rate, and the rolling standard deviation of the return curves.

MAPPO [26] is a multi-agent policy optimization algorithm that performs centralized training with decentralized execution and uses a clipped surrogate objective to stabilize policy updates. It effectively balances policy improvement and variance control, making it a widely adopted baseline in cooperative MARL tasks.

MASAC [37] is an off-policy, entropy-regularized algorithm that enhances the exploration and stability of multi-agent systems, particularly in environments with incomplete information.

QMIX [25] is a value-based multi-agent reinforcement learning algorithm that decomposes the joint action-value function into individual agent contributions while ensuring monotonicity.

Scanline [10] is a heuristic coverage-based search method that directs UAVs along pre-defined sweeping paths to maximize area coverage and target detection. Although simple and easy to implement, it lacks learning ability and adaptability in dynamic or partially observable environments.

As shown in Figure 13, MPRS-MAPPO outperformed the other algorithms in terms of both convergence speed and stability. While MAPPO and QMIX improved policy performance to some extent, MAPPO suffered from slower convergence, and QMIX exhibited significant fluctuations. MASAC converged faster than MAPPO and reached a similar steady return, with variance larger than MAPPO but smaller than QMIX. As a heuristic method, Scanline oscillated around its mean return value without demonstrating learning or optimization capabilities.

To further analyze behavioral characteristics, the representative search trajectories were visualized, as shown in Figure 14.

Only representative trajectory examples are shown here for clarity; the MASAC trajectories are not displayed due to their similarity in overall pattern to MAPPO. MPRS-MAPPO successfully detected all six targets within only 26 steps, demonstrating a “first concentration, then dispersion” cooperative pattern: initially focusing on high-probability areas, followed by dynamic task allocation to cover wider areas. This reflects advantages in shorter paths and balanced workload distribution. In contrast, MAPPO detected five targets within 36 steps but suffered from uneven workload allocation and missed detections in the later stage. QMIX required 45 steps to detect only four targets, with trajectories biased toward one side and limited utilization of probabilistic information. Scanline detected only three targets after 64 steps due to its static coverage strategy, resulting in the lowest efficiency.

The quantitative comparison is summarized in Table 3. MPRS-MAPPO achieved significantly higher converged return values and detection rates than the baseline methods, while also maintaining the lowest return standard deviation, indicating superior search efficiency and more stable training performance. In this paper, “training uncertainty” refers to the variability of the cumulative returns during training, which is quantitatively measured by the standard deviation of the returns shown in Table 3.

In summary, MPRS-MAPPO demonstrated clear advantages in multi-UAV cooperative search tasks. It outperformed the baseline methods in terms of convergence speed, target detection rate, and training stability (Return Std), thereby verifying the effectiveness of the adaptive fusion weight mechanism in complex cooperative scenarios. Compared to MAPPO, MASAC, QMIX, and Scanline, MPRS-MAPPO improved target detection rates by 7.87%, 12.06%, 17.35%, and 29.76%, respectively, while reducing training uncertainty by 7.43%, 47.13%, 53.36%, and 56.29%. Future work will explore integrating MHT [46] or PHD [47] tracking to enhance target continuity.

4.4. Ablation Studies

To comprehensively evaluate the MPRS-MAPPO algorithm, two groups of ablation experiments were conducted: one to assess the contribution of each algorithmic component and the other to analyze the cooperative behavior among agents. The first group examines the impact of reward shaping, multi-potential-field fusion, and the warm-up stage on learning performance. The second group explores how partial loss of cooperation affects system efficiency by fixing some UAVs to random policies.

To validate the effectiveness of each component in MPRS-MAPPO, we designed ablation experiments focusing on four aspects:

The effectiveness of reward shaping;
Whether multi-potential fields outperform single-potential fields;
Whether dynamic fusion is superior to fixed fusion;
The role of the warm-up stage.

The evaluation metrics included the convergence speed, converged return value, and training stability of the return curves. All methods were implemented on the MAPPO framework under the same environment as in the comparative experiments. As shown in Table 4, the descriptions of the ablation study methods are summarized.

The training return curves of different methods are shown in Figure 15. Overall, MPRS-MAPPO consistently achieved the highest returns with the fastest convergence speed, demonstrating that multi-potential-field fusion and the warm-up stage significantly facilitate policy learning. NoWarmup and Multi-PF-Fixed ranked second and third, respectively, indicating that the warm-up stage further improves training performance and that dynamic fusion is superior to fixed fusion. The single-potential-field methods and the baseline showed much lower returns.

Statistical results of the ablation study are summarized in Table 5.

During convergence, MPRS-MAPPO achieved the highest mean return (an improvement of 11.58%) and the lowest return standard deviation (a reduction of 7.43%), demonstrating superior performance and stability. By contrast, single-potential-field methods provided limited improvements and, in some cases, even introduced instability.

In summary, the adaptive fusion weight mechanism contributed a 5.54% improvement in return; incorporating the multi-potential-field fusion structure further improved the return to 9.69%; and, with the addition of the warm-up stage, the overall return increased by 11.58%. These results convincingly demonstrate the effectiveness and complementarity of reward shaping, multi-potential-field fusion, and the warm-up stage in complex cooperative search tasks.

To further evaluate the impact of inter-agent cooperation, additional experiments were conducted in which a portion of UAVs were fixed to perform random actions, while the remaining UAVs maintained cooperative policy execution.

The results are summarized in Table 6. As the number of random (non-cooperative) UAVs increased, the overall performance gradually declined. The global return decreased from 451.43 (4 cooperative UAVs) to 303.47 (only 1 cooperative UAV), while the detection rate dropped from 78.89% to 54.09%. Meanwhile, the average episode length increased, reflecting slower mission completion and lower search efficiency due to weakened cooperation.

These results clearly indicate that multi-UAV cooperation plays a critical role in achieving efficient target search and high detection performance.

4.5. Physical Experiments

To validate the effectiveness of the proposed algorithm on a real UAV platform, a physical dynamic target search experiment was conducted under laboratory conditions. The setup involved a quadrotor UAV searching for an unknown dynamic target. The UAV was equipped with a visible-light sensor for visual target detection, while the dynamic target was a ground vehicle. Only the target’s initial position was known to the UAV before takeoff. After the experiment started, the vehicle began to move randomly, and the UAV autonomously navigated toward the probabilistic region to search for the target based on the proposed MPRS-MAPPO algorithm. The main experimental parameters are listed in Table 7.

The experimental field setup is shown in Figure 16, where the UAV and dynamic target were deployed in an open area. At the start of the experiment, the target vehicle began to move randomly within the designated region, while the UAV executed the proposed search policy.

The results of the physical test are illustrated in Figure 17. From left to right, the purple time axis represents the progression of the experiment.

The experimental states at each moment are shown in the dashed boxes aligned with the timeline, including the Field View (real-world scene), Sensor View (onboard camera image), and Ground Station Interface (real-time monitoring).

Search Start (t1 = 37.23 s): The UAV takes off and begins its search. The target is still far away; thus, no vehicle appears in the Field View or Sensor View. The Ground Station Interface shows the UAV and target at their initial positions, separated by a large distance.

Mid Process (t2 = 39.81 s): Both the UAV and target have moved, and the UAV is approaching the high-probability region. However, the target has not yet entered the sensor’s field of view, and no detection is achieved at this stage.

Search Success (t3 = 41.62 s): The UAV maneuvers above the target, and the vehicle becomes visible in the Sensor View. The search is completed successfully, and the task is terminated.

The experimental results confirm that the proposed MPRS-MAPPO algorithm can operate on a real UAV platform and autonomously locate dynamic targets in an outdoor environment. The UAV successfully completed the search mission with accurate navigation and real-time target detection, demonstrating the algorithm’s robustness, real-world feasibility, and potential for practical deployment.

5. Conclusions

This paper addresses the challenges of sparse rewards, unstable training, and inefficient convergence of cooperative strategies in multi-UAV cooperative search for dynamic targets. We propose MPRS-MAPPO, a multi-agent reinforcement learning algorithm that incorporates a multi-potential field fusion reward shaping mechanism. This method is designed for dynamic target search scenarios. It integrates three complementary potential field functions: Probability Edge Potential Field, Maximum Probability Potential Field, and Coverage Probability Sum Potential Field. An Adaptive Fusion Weight Mechanism is adopted for dynamic weight adjustment. In addition, a warm-up stage is introduced to mitigate early-stage misguidance. These designs effectively improve both policy learning efficiency and multi-agent cooperation capability.

Compared with MAPPO, MASAC, QMIX, and the heuristic Scanline method, MPRS-MAPPO achieved notable improvements in convergence speed, global return, and detection rate (by 7.87–29.76%), while reducing training uncertainty by 7.43–56.36%. Ablation studies confirmed that reward shaping, multi-potential-field fusion, and the warm-up stage jointly enhance learning efficiency. Cooperation ablation showed performance degradation when some UAVs used random policies, highlighting the importance of collaboration. Multi-scale (2v3, 4v6, 6v9) and physical flight experiments further verified the algorithm’s scalability, robustness, and real-world applicability.

Future work will focus on extending the framework to more complex and realistic environments to further improve system robustness and practical applicability.

Author Contributions

Conceptualization, X.H. and Y.W.; methodology, X.H. and Z.W.; software, X.H.; validation, C.X.; formal analysis, Y.W.; investigation, X.H.; resources, Z.W.; data curation, Y.G.; writing—original draft preparation, X.H.; writing—review and editing, Y.G.; visualization, C.X.; supervision, Y.W.; project administration, Z.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bisio, I.; Garibotto, C.; Haleem, H.; Lavagetto, F.; Sciarrone, A. RF/WiFi-Based UAV Surveillance Systems: A Systematic Literature Review. Internet Things 2024, 26, 101201. [Google Scholar] [CrossRef]
Ekechi, C.C.; Elfouly, T.; Alouani, A.; Khattab, T. A Survey on UAV Control with Multi-Agent Reinforcement Learning. Drones 2025, 9, 484. [Google Scholar] [CrossRef]
Javed, S.; Hassan, A.; Ahmad, R.; Ahmed, W.; Ahmed, R.; Saadat, A.; Guizani, M. State-of-the-Art and Future Research Challenges in UAV Swarms. IEEE Internet Things J. 2024, 11, 19023–19045. [Google Scholar] [CrossRef]
Dinaryanto, O.; Hermawan, D.; Agustian, H.; Astuti, Y. The Technology Development and Application of Unmanned Aerial Vehicle (UAV) in the Agriculture Field in Indonesia. In Proceedings of the 2024 International Conference of Adisutjipto on Aerospace Electrical Engineering and Informatics (ICAAEEI), Yogyakarta, Indonesia, 11–12 December 2024; IEEE: Piscataway, NJ, USA; pp. 1–6. [Google Scholar]
Telli, K.; Kraa, O.; Himeur, Y.; Ouamane, A.; Boumehraz, M.; Atalla, S.; Mansoor, W. A Comprehensive Review of Recent Research Trends on Unmanned Aerial Vehicles (UAVs). Systems 2023, 11, 400. [Google Scholar] [CrossRef]
Wilson, T.; Williams, S.B. Adaptive Path Planning for Depth-constrained Bathymetric Mapping with an Autonomous Surface Vessel. J. Field Robot. 2018, 35, 345–358. [Google Scholar] [CrossRef]
Ma, Y.; Zhao, Y.; Li, Z.; Yan, X.; Bi, H.; Królczyk, G. A New Coverage Path Planning Algorithm for Unmanned Surface Mapping Vehicle Based on A-Star Based Searching. Appl. Ocean Res. 2022, 123, 103163. [Google Scholar] [CrossRef]
Lu, J.; Zeng, B.; Tang, J.; Lam, T.L.; Wen, J. TMSTC*: A Path Planning Algorithm for Minimizing Turns in Multi-Robot Coverage. IEEE Robot. Autom. Lett. 2023, 8, 5275–5282. [Google Scholar] [CrossRef]
Xie, J.; Garcia Carrillo, L.R.; Jin, L. Path Planning for UAV to Cover Multiple Separated Convex Polygonal Regions. IEEE Access 2020, 8, 51770–51785. [Google Scholar] [CrossRef]
Giang, T.T.C.; Lam, D.T.; Binh, H.T.T.; Ly, D.T.H.; Huy, D.Q. BWave Framework for Coverage Path Planning in Complex Environment with Energy Constraint. Expert Syst. Appl. 2024, 248, 123277. [Google Scholar] [CrossRef]
Xia, Y.; Chen, C.; Liu, Y.; Shi, J.; Liu, Z. Two-Layer Path Planning for Multi-Area Coverage by a Cooperative Ground Vehicle and Drone System. Expert Syst. Appl. 2023, 217, 119604. [Google Scholar] [CrossRef]
Saha, S.; Vasegaard, A.E.; Nielsen, I.; Hapka, A.; Budzisz, H. UAVs Path Planning under a Bi-Objective Optimization Framework for Smart Cities. Electronics 2021, 10, 1193. [Google Scholar] [CrossRef]
Maskooki, A.; Kallio, M. A Bi-Criteria Moving-Target Travelling Salesman Problem under Uncertainty. Eur. J. Oper. Res. 2023, 309, 271–285. [Google Scholar] [CrossRef]
Lejeune, M.; Royset, J.O.; Ma, W. Multi-agent Search for a Moving and Camouflaging Target. Nav. Res. Logist. 2024, 71, 532–552. [Google Scholar] [CrossRef]
Cho, S.-W.; Park, J.-H.; Park, H.-J.; Kim, S. Multi-UAV Coverage Path Planning Based on Hexagonal Grid Decomposition in Maritime Search and Rescue. Mathematics 2021, 10, 83. [Google Scholar] [CrossRef]
Kazemdehbashi, S.; Liu, Y. An Algorithm with Exact Bounds for Coverage Path Planning in UAV-Based Search and Rescue under Windy Conditions. Comput. Oper. Res. 2025, 173, 106822. [Google Scholar] [CrossRef]
Yousuf, B.; Lendek, Z.; Buşoniu, L. Exploration-Based Search for an Unknown Number of Targets Using a UAV. IFAC-PapersOnLine 2022, 55, 93–98. [Google Scholar] [CrossRef]
Li, Y.; Chen, W.; Fu, B.; Wu, Z.; Hao, L.; Yang, G. Research on Dynamic Target Search for Multi-UAV Based on Cooperative Coevolution Motion-Encoded Particle Swarm Optimization. Appl. Sci. 2024, 14, 1326. [Google Scholar] [CrossRef]
Ma, T.; Wang, Y.; Li, X. Convex Combination Multiple Populations Competitive Swarm Optimization for Moving Target Search Using UAVs. Inf. Sci. 2023, 641, 119104. [Google Scholar] [CrossRef]
Yue, W.; Xi, Y.; Guan, X. A New Searching Approach Using Improved Multi-Ant Colony Scheme for Multi-UAVs in Unknown Environments. IEEE Access 2019, 7, 161094–161102. [Google Scholar] [CrossRef]
Wu, Y.; Nie, M.; Ma, X.; Guo, Y.; Liu, X. Co-Evolutionary Algorithm-Based Multi-Unmanned Aerial Vehicle Cooperative Path Planning. Drones 2023, 7, 606. [Google Scholar] [CrossRef]
Enhancing Biologically Inspired Swarm Behavior: Metaheuristics to Foster the Optimization of UAVs Coordination in Target Search. Comput. Oper. Res. 2019, 110, 34–47. [CrossRef]
Niu, Y.; Yan, X.; Wang, Y.; Niu, Y. An Improved Sand Cat Swarm Optimization for Moving Target Search by UAV. Expert Syst. Appl. 2024, 238, 122189. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, UK, 1998. [Google Scholar]
Rashid, T.; Samvelyan, M.; Witt, C.S.D.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Bany Salameh, H.; Hussienat, A.; Alhafnawi, M.; Al-Ajlouni, A. Autonomous UAV-Based Surveillance System for Multi-Target Detection Using Reinforcement Learning. Clust. Comput. 2024, 27, 9381–9394. [Google Scholar] [CrossRef]
Liu, Y.; Li, X.; Wang, J.; Wei, F.; Yang, J. Reinforcement-Learning-Based Multi-UAV Cooperative Search for Moving Targets in 3D Scenarios. Drones 2024, 8, 378. [Google Scholar] [CrossRef]
Zhao, H.; Cui, D.; Hao, M.; Xu, X.; Liu, Z. Cooperative Search Strategy of Multi-UAVs Based on Reinforcement Learning. In Proceedings of the 2022 International Conference on Autonomous Unmanned Systems (ICAUS 2022), Xi’an, China, 23–25 September 2022; Fu, W., Gu, M., Niu, Y., Eds.; Lecture Notes in Electrical Engineering. Springer Nature: Singapore, 2023; Volume 1010, pp. 3407–3415, ISBN 978-981-99-0478-5. [Google Scholar]
Su, K.; Qian, F. Multi-UAV Cooperative Searching and Tracking for Moving Targets Based on Multi-Agent Reinforcement Learning. Appl. Sci. 2023, 13, 11905. [Google Scholar] [CrossRef]
Wei, D.; Zhang, L.; Liu, Q.; Chen, H.; Huang, J. UAV Swarm Cooperative Dynamic Target Search: A MAPPO-Based Discrete Optimal Control Method. Drones 2024, 8, 214. [Google Scholar] [CrossRef]
Yan, P.; Jia, T.; Bai, C. Searching and Tracking an Unknown Number of Targets: A Learning-Based Method Enhanced with Maps Merging. Sensors 2021, 21, 1076. [Google Scholar] [CrossRef]
Guo, H.; Liu, Z.; Shi, R.; Yau, W.-Y.; Rus, D. Cross-Entropy Regularized Policy Gradient for Multirobot Nonadversarial Moving Target Search. IEEE Trans. Robot. 2023, 39, 2569–2584. [Google Scholar] [CrossRef]
Guo, H.; Peng, Q.; Cao, Z.; Jin, Y. DRL-Searcher: A Unified Approach to Multirobot Efficient Search for a Moving Target. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 3215–3228. [Google Scholar] [CrossRef]
Boulares, M.; Fehri, A.; Jemni, M. UAV Path Planning Algorithm Based on Deep Q-Learning to Search for a Floating Lost Target in the Ocean. Robot. Auton. Syst. 2024, 179, 104730. [Google Scholar] [CrossRef]
Wang, G.; Wei, F.; Jiang, Y.; Zhao, M.; Wang, K.; Qi, H. A Multi-AUV Maritime Target Search Method for Moving and Invisible Objects Based on Multi-Agent Deep Reinforcement Learning. Sensors 2022, 22, 8562. [Google Scholar] [CrossRef] [PubMed]
Wang, E.; Liu, F.; Hong, C.; Guo, J.; Zhao, L.; Xue, J.; He, N. MADRL-Based UAV Swarm Non-Cooperative Game under Incomplete Information. Chin. J. Aeronaut. 2024, 37, 293–306. [Google Scholar] [CrossRef]
Wen, T.; Wang, X.; Chen, Q. A HAPPO Based Task Offloading Strategy in Heterogeneous Air-Ground Collaborative MEC Networks. In Proceedings of the 2025 IEEE/CIC International Conference on Communications in China (ICCC), Shanghai, China, 10–13 August 2025; IEEE: Piscataway, NJ, USA; pp. 1–6. [Google Scholar]
Park, G.; Jung, W.; Han, S.; Choi, S.; Sung, Y. Adaptive Multi-Model Fusion Learning for Sparse-Reward Reinforcement Learning. Neurocomputing 2025, 633, 129748. [Google Scholar] [CrossRef]
Ng, A.Y.; Harada, D.; Russell, S. Policy Invariance under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the ICML 1999 (International Conference on Machine Learning), Stockholm, Sweden, 27 June–1 July 1999. [Google Scholar]
Bal, M.İ.; Aydın, H.; İyigün, C.; Polat, F. Potential-Based Reward Shaping Using State–Space Segmentation for Efficiency in Reinforcement Learning. Future Gener. Comput. Syst. 2024, 157, 469–484. [Google Scholar] [CrossRef]
Ye, C.; Zhu, W.; Guo, S.; Bai, J. DQN-Based Shaped Reward Function Mold for UAV Emergency Communication. Appl. Sci. 2024, 14, 10496. [Google Scholar] [CrossRef]
Huang, B.; Jin, Y. Reward Shaping in Multiagent Reinforcement Learning for Self-Organizing Systems in Assembly Tasks. Adv. Eng. Inf. 2022, 54, 101800. [Google Scholar] [CrossRef]
Gimelfarb, M.; Sanner, S.; Lee, C.-G. Reinforcement Learning with Multiple Experts: A Bayesian Model Combination Approach. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 2–8 December 2018. [Google Scholar]
Fu, Z.-Y.; Zhan, D.-C.; Li, X.-C.; Lu, Y.-X. Automatic Successive Reinforcement Learning with Multiple Auxiliary Rewards. In Proceedings of the Twenty-eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; International Joint Conferences on Artificial Intelligence Organization: Palo Alto, CA, USA, 2019; pp. 2336–2342. [Google Scholar]
Chong, C.-Y.; Mori, S.; Reid, D.B. Forty Years of Multiple Hypothesis Tracking—A Review of Key Developments. In Proceedings of the 2018 21st International Conference on Information Fusion (FUSION), Cambridge, UK, 10–13 July 2018; IEEE: Piscataway, NJ, USA; pp. 452–459. [Google Scholar]
Zeng, Y.; Wang, J.; Wei, S.; Zhang, C.; Zhou, X.; Lin, Y. Gaussian Mixture Probability Hypothesis Density Filter for Heterogeneous Multi-Sensor Registration. Mathematics 2024, 12, 886. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of dynamic target search scene.

Figure 2. Schematic diagram of sensor detection probability model. Warmer colors indicate higher probabilities.

Figure 3. Schematic diagram of drone operation model.