Collaborative Target Tracking Algorithm for Multi-Agent Based on MAPPO and BCTD

Zhou, Yuebin; Yue, Yunling; Yan, Bolun; Li, Linkun; Xiao, Jinsheng; Yao, Yuan

doi:10.3390/drones9080521

Open AccessArticle

Collaborative Target Tracking Algorithm for Multi-Agent Based on MAPPO and BCTD

by

Yuebin Zhou

^1,†

,

Yunling Yue

^2,†,

Bolun Yan

²

,

Linkun Li

²,

Jinsheng Xiao

^2,3,*

and

Yuan Yao

^1,*

¹

College of Physical Science and Technology, Central China Normal University, Wuhan 430072, China

²

School of Electronic Information, Wuhan University, Wuhan 430072, China

³

Pingyang Institute of Science and Technology Innovation, Wenzhou 325400, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2025, 9(8), 521; https://doi.org/10.3390/drones9080521

Submission received: 11 June 2025 / Revised: 15 July 2025 / Accepted: 21 July 2025 / Published: 24 July 2025

(This article belongs to the Special Issue Advances in Cooperative Perception Application for Unmanned System in Modern Transportation)

Download

Browse Figures

Versions Notes

Abstract

Target tracking is a representative task in multi-agent reinforcement learning (MARL), where agents must collaborate effectively in environments with dense obstacles, evasive targets, and high-dimensional observations—conditions that often lead to local optima and training inefficiencies. To address these challenges, this paper proposes a collaborative tracking algorithm for UAVs that integrates behavior cloning with temporal difference (BCTD) and multi-agent proximal policy optimization (MAPPO). Expert trajectories are generated using the artificial potential field (APF), followed by policy pre-training via behavior cloning and TD-based value optimization. MAPPO is then employed for dynamic fine-tuning, enhancing robustness and coordination. Experiments in a simulated environment show that the proposed MAPPO+BCTD framework outperforms MAPPO, QMIX, and MADDPG in success rate, convergence speed, and tracking efficiency. The proposed method effectively alleviates the local optimization problem of APF and the training inefficiency problem of RL, offering a scalable and reliable solution for dynamic multi-agent coordination.

Keywords:

multi-drone collaborative target tracking; behavior cloning with temporal difference; artificial potential field; multi-agent proximal policy optimization

1. Introduction

Target tracking in dynamic environments requires drones to efficiently locate and track moving targets while avoiding obstacles. Traditional methods, such as the APF [1], are effective in path planning and obstacle avoidance [2] but are prone to local optima, limiting their effectiveness in dynamic settings [3]. MARL methods [4], on the other hand, can optimize collaborative strategies but suffer from slow convergence and high computational consumption [5], especially as the number of agents increases [6].

In response to these challenges, this paper introduces a novel framework that integrates BCTD and MAPPO, which enhances both learning efficiency and stability by leveraging pre-training and fine-tuning strategies [7]. Taking into account the above problems encountered in the multi-drone tracking task [8], the proposed multistage collaborative tracking framework integrates expert guidance, imitation learning (IL) [9] pre-training, and RL [10] fine-tuning. First, the policy network is pre-trained via behavior cloning on expert data of the artificial potential field, enabling rapid acquisition of fundamental path planning and obstacle avoidance capabilities. Subsequently, the potential field weights are dynamically adjusted during reinforcement learning fine-tuning, using the MAPPO [11] algorithm. A global reward signal, such as successful collaborative encirclement, is introduced to overcome local force balance and drive UAVs to bypass obstacles and continuously approach the target.

The MARL training process is often plagued by significant time and resource consumption. Traditional RL methods rely on agents exploring strategies solely through environmental interactions, necessitating numerous trial-and-error iterations to generate interaction data. This approach results in extended training periods and low sample efficiency. To address these challenges, this study employs BCTD to enhance the value network. This method enables drones to acquire skills in path planning and obstacle avoidance within dynamic environments, thereby streamlining subsequent agent fine-tuning processes.

A novel method was proposed in this study to enhance the operational efficiency of collaborative target tracking and encirclement tasks [12]. The key contributions are as follows: 1. The integration of APF with IL. Drones are trained to acquire path planning and obstacle avoidance skills within the artificial potential field framework by initially imitating a pre-trained strategy based on APF principles, followed by fine-tuning through MARL to improve task performance. 2. A methodology employing BCTD is introduced to concurrently optimize the policy and value network. During the behavior cloning phase, the value network is refined using the temporal difference method to enable drones to learn path planning and obstacle avoidance in a dynamic environment, thereby facilitating subsequent agent fine-tuning. Generalized advantage estimation (GAE) is utilized to rectify the bias of the value function, prevent policy oscillation, reduce the duration of policy exploration, and enhance both evaluation accuracy and training efficiency.

2. Related Work

Traditional methods like APF frequently result in agents being trapped in local optima, unable to escape the potential field’s low points. Meanwhile, MARL suffers from low training efficiency and slow convergence [13]. To address these issues, various solutions have been proposed. Rubio et al. modified the APF function to mitigate the local optimum problem [14]. Pateria et al. [15] improved training efficiency by enhancing the experience replay mechanism and introduced a hierarchical reinforcement learning framework to accelerate training. Xiao et al. [16] aimed to reduce evaluation bias and instability by employing multiple critic networks. Despite advancements, existing methods still face limitations, such as policy oscillation and low sample utilization, which complicate the balance between rapid learning and policy robustness.In contrast, the MAPPO+BCTD algorithm introduced here markedly surpasses traditional benchmarks like APF, MADDPG, and QMIX, achieving a 100% tracking success rate, an average of 19.6 steps, and a convergence speed of 65,000 episodes. To address the local optimum issue inherent in the APF method, we integrate BCTD. By using APF to generate expert trajectories, the policy network is pre-trained to avoid local minima. The TD-based value optimization in BCTD further refines the policy, ensuring that agents can overcome local optima by learning from high-quality expert data, thus improving stability during reinforcement learning fine-tuning and enhances training efficiency to 7.7 times that of conventional RL (65,000 vs. 500,000 episodes), offering an efficient and robust solution for multi-agent collaborative tracking in dynamic environments. The subsequent sections detail the relevant research and applications of each technology.

2.1. Artificial Potential Field

The conventional APF method, commonly employed for path planning and obstacle avoidance, operates by simulating attractive and repulsive fields to direct agent motion. The attractive force is given by

F_{att} = k_{att} \cdot (p_{target} - p_{agent})

(1)

where

k_{att} = 2.0

is a positive constant,

p_{agent}

is the current position of the agent, and

p_{target}

is the position of the target. The repulsive force is

F_{rep} = \{\begin{matrix} k_{rep} \cdot (\frac{1}{d (p_{agent}, p_{obs})} - \frac{1}{d_{0}}) \cdot \frac{1}{d {(p_{agent}, p_{obs})}^{2}} \cdot ∥p_{target} - p_{agent}∥, & if d \geq d_{0} \\ 0, & if d < d_{0} \end{matrix}

(2)

where

k_{rep} = 1.5

is a positive constant, d is the distance to the obstacle, and

d_{0} = 0.5

is the influence radius. Whenever the distance between the agent and an obstacle is below a predefined threshold

d < d_{0}

, a repulsive force is exerted, thus enabling the avoidance of the obstacle. Conversely, no force is exerted when the separation distance exceeds the threshold

d \geq d_{0}

. These forces guide the agents’ motion and are combined to form the overall potential field. The approach involves constructing a cohesive potential field, guiding the agent’s movement toward the resultant force, which is a combination of the attractive force from the target and the repulsive force from the obstacles. However, the traditional APF method is susceptible to getting stuck in local potential field equilibria in densely cluttered obstacle environments [17]. Fan et al. [18] proposed enhancements to the conventional APF framework by introducing a threshold limit for target actions, a correction factor for repulsion field distance, and an additional force mechanism. Despite these modifications, the method still exhibits limited dynamic adaptability. In a different approach, Zhang et al. [19] integrated APF with reinforcement learning by utilizing APF to generate expert trajectories and subsequently refining the strategy through RL [20]. This study builds on this concept by leveraging the potential field function of the APF as expert knowledge in imitation learning (IL) to address the local optimization challenges of traditional APF, offering high-quality initialization strategies for subsequent reinforcement learning processes.

2.2. Imitation Learning

IL is effective in reducing the exploration cost of RL by mimicking expert behavior, particularly beneficial in intricate tasks [21]. Behavior cloning (BC), a common approach in IL, aligns with expert state-action mapping through supervised learning. However, it is susceptible to the compounding error issue [22], particularly evident in multi-agent settings where policy deviations can disrupt cooperation [23]. Addressing this challenge, Arora et al. [24] introduced inverse reinforcement learning (IRL), a technique that deduces the intrinsic reward mechanism by studying the expert’s actions, facilitating the agent’s optimization based on this reward structure. In parallel, Bhattacharyya et al. [25] devised an approach that merges generative adversarial imitation learning (GAIL) with generative adversarial networks. By concurrently training the generative and discriminative models adversarially, the agent can replicate the expert’s behavior. This paper leverages behavior cloning for network pretraining, utilizing expert trajectories from APF to expedite initial tracking proficiency and optimizing action distribution matching through cross-entropy loss. In contrast to previous research, this study builds upon Xiao et al.’s approach [26] of improving tiny target recognition by purifying features in target detection. It integrates the time difference method to enhance the optimization of the value network, reduce oscillations in policy fine-tuning, and enhance the adaptability of imitation learning in dynamic settings, differing from prior research.

2.3. Behavior Cloning with Temporal Difference

BCTD merges TD methods with BC to optimize policy and value networks simultaneously in IL. This approach improves strategy stability and training efficiency in multi-agent systems. TD learning, a prevalent method in reinforcement learning, updates the state-value function by minimizing TD errors, thereby guiding policy adjustments. Its primary advantage is its efficient online evaluation. Nevertheless, TD methods generally require a considerable amount of environmental data and, when used in isolation, are inadequate for IL. To overcome these limitations, previous research has investigated the integration of reinforcement learning mechanisms into IL. Perlaza et al. [27] introduced the dataset aggregate method (DAgger) to mitigate policy drift by iteratively incorporating expert decision making. However, DAgger faces challenges such as high computational demands, complex deployment, and frequent expert interaction, limiting its applicability in high-dimensional multi-agent contexts. Addressing these issues, this paper presents the BCTD method, which integrates TD learning into the behavioral cloning phase via “pseudo-temporal difference paths.” During training, the policy network employs cross-entropy loss to align with expert actions while utilizing state-reward information from expert trajectories to compute the TD error, thereby refining the value network parameters. The introduction of GAE [28] for weighted accumulation of TD error reduces valuation variance and enhances the stability of value learning. Unlike traditional BC, the BCTD method not only benefits from the rapid convergence due to expert behavior guidance but also offers value evaluation for current strategy behavior quality. This approach mitigates training oscillations caused by policy drift and ensures precise value initialization for subsequent reinforcement learning phases, such as MAPPO fine-tuning, thereby improving training efficiency and policy robustness.

3. Problem Formulation

3.1. Multi-Robot Pursuit Problem

In this work, we consider a cooperative target tracking problem involving multiple UAV agents operating in a 2D environment with obstacles. The set of pursuers is denoted by

V = {1, 2, \dots, N}

, and the target as E. Each agent

i \in V

can observe the relative positions of nearby agents, obstacles, and the evader, but not the global state. Unlike traditional first-order models, all pursuers in this work are modeled as second-order point-mass systems, where the control input is acceleration, and the state includes both position and velocity. The continuous-time kinematic equations are

\{\begin{matrix} {\dot{p}}_{i} & = v_{i} \\ {\dot{v}}_{i} & = a_{i} \end{matrix}

(3)

where

p_{i}

,

v_{i}

, and

a_{i}

represent the position, velocity, and acceleration of pursuer i, respectively. In discrete time, the state transition is

\{\begin{matrix} p_{i, t + 1} & = p_{i, t} + v_{i, t} \cdot ∆ t + \frac{1}{2} a_{i, t} \cdot ∆ t^{2} \\ v_{i, t + 1} & = v_{i, t} + a_{i, t} \cdot ∆ t \end{matrix}

(4)

In this paper, the movement of the tracking target selects the acceleration direction according to the APF rule, and thus also follows the second-order point-mass systems.

The pursuit task is deemed successful when all of the following conditions are met at a predefined terminal time

t_{\max}

:

Capture condition: Each pursuer is within a capture radius $d_{c}$ from the target:

$d_{E, i} (t_{max}) < d_{c}, \forall i \in V$

(5)
Collision avoidance: All inter-agent distances must exceed twice the safety radius $d_{s}$ , and each agent must maintain a minimum distance from any obstacle:

$d_{i, j} (t) > 2 d_{s}, d_{o, i} (t) > d_{s}, \forall i \neq j, \forall t \in [0, t_{max}]$

(6)

where $d_{E, i} (t)$ is the Euclidean distance between pursuer i and the evader; $d_{i, j} (t)$ is the distance between pursuers i and j; $d_{o, i} (t)$ is the distance between pursuer i and the nearest obstacle.

Given that each agent relies exclusively on its local observation

o_{i}

to make decisions, the aim is to develop a set of decentralized policies

π_{i} (a_{i} | o_{i})

that together meet the specified constraints and complete the pursuit task in the most efficient manner. The primary objective is to maximize the expected cumulative discounted reward as follows:

max_{π} E_{π} [\sum_{t = 0}^{t_{\max}} γ^{t} R (s_{t}, a_{t})]

(7)

where

γ \in (0, 1)

is the discount factor, and

a_{t} = (a_{1, t}, a_{2, t}, \dots, a_{N, t})

is the joint action vector. This issue showcases the amalgamation of dynamic constraints, partial observability, and decentralized coordination. The incorporation of acceleration-based control and second-order dynamics enhances the realism and intricacy of the scenario, thereby escalating the requirements for policy stability and generalization.

3.2. Decentralized Partially Observable Markov Decision Process (Dec-POMDP)

The multi-UAV pursuit problem can be formulated as a partially observable Markov Game (POMG) that is represented by the tuple

〈 S, A, O, P, R, γ 〉

, where

S, A, O, P, R, γ

denote the global state space, joint action space, local observation space, state transition probability, reward function, and discount factor, respectively. Each agent chooses an action based on its local observations, which are sampled from the global state, receives a reward, and collectively impacts the environment’s state. The objective for each agent is to learn the optimal policy

π^{*} (a_{i} | o_{i})

that maximizes the cumulative rewards.

(1) State space S: The state

s \in S

represents the global configuration of the environment, which includes the position and velocity of all drones, the position and velocity of the target, and the positions of static obstacles. The total three-drone system state is

S = [{\{S_{i}^{agent}\}}_{i = 1}^{N}, {\{S_{j}^{target}\}}_{j = 1}^{K}, {\{S_{k}^{obs}\}}_{k = 1}^{M}]

(8)

where N, K, M represents the number of tracking drones, targets, and obstacles, respectively.

For each drone

i \in I

, the state consists of its 2D position

p_{i} = (x_{i}, y_{i})

, velocity

v_{i} = (v_{i_{x}}, v_{i_{y}})

, agent radius

R_{i}

, maximum speed

v_{i}^{\max}

.

S_{i}^{agent} = [p_{i}, v_{i}, R_{i}, v_{i}^{\max}]

(9)

Similarly, the state of target is

S_{j}^{target} = [p_{j}^{target}, v_{j}^{target}, R_{j}^{target}]

(10)

where

p_{j}^{target}

,

v_{j}^{target}

, and

R_{j}^{target}

are the location, velocity, and radius of the tracked entity, respectively. Additionally, the state of obstacle k is

S_{k}^{obs} = [p_{k}^{obs}, R_{k}^{obs}]

(11)

where

R_{k}^{obs}

is the radius of the obstacle k.

(2) Observation space

O_{i}

: Due to limited sensing and communication, each agent i receives only partial observations

O_{i}

. The local observation of agent i is as follows:

O_{i} = [O_{i, s e l f}, O_{i, U A V}, O_{i, o b s}, O_{i, T a r}]

(12)

In the one-target scenario, the observation of targets is

O_{i, T a r} = O_{i, T a r 1}

, where the observation

O_{i, o b s_{j}}

includes the relative position

∆ p_{i, T a r j}

between the agent and the obstacle j, relative velocity

∆ p_{i, T a r j}

, and its radius of size.

O_{i, T a r_{j}} = [∆ p_{i, T a r_{j}}, ∆ v_{i, T a r_{j}}, R_{target}]

(13)

where

∆ p_{i, T a r j} = p_{j, t a r g e t} - p_{i}

represents the relative position between the target and the agent. In the three-drone system, to consider cooperation between agents, all drones are included in the observation range.

O_{i, U A V} = [O_{i, U A V_{1}}, O_{i, U A V_{j}}, \dots, O_{i, U A V_{N_{1}}}], j \neq i

(14)

where

O_{i, U A V_{j}} = [∆ p_{i, U A V_{j}}, ∆ v_{i, U A V_{j}}, R_{U A V}]

, and uses zero padding when the agent is invisible. The observation of an obstacle contains the position relative to the agent and the obstacle radius, so the observation of each visible obstacle j is expressed as follows:

O_{i, Obs} = [O_{i, O b s_{1}}, O_{i, O b s_{j}}, \dots, O_{i, O b s_{N_{2}}},]

(15)

Similarly, compose all obstacle observations into

O_{i, U A V_{j}} = [∆ p_{i, O b s_{j}}, R_{U A V}]

.

Song et al. [29] demonstrated that sparse feature selection enhances decision-making efficiency in high-dimensional settings. In line with this principle, the study’s observation space design focuses solely on essential interaction information (such as relative poses of agents/targets) to prevent irrelevant features from impacting policy learning.

(3) Action space

A^{i}

: The action of agent i is denoted as

a_{i} \in A_{i}

.

A_{i}

represents a set of feasible velocity vectors subject to fundamental kinematic equations and restrictions imposed by map boundaries, maximum velocity, and maximum acceleration limits.

a \in \{a_{s t a t i c}, a_{u p}, a_{d o w n}, a_{l e f t}, a_{r i g h t}, a_{u r}, a_{l r}, a_{l l}, a_{u l}\}

(16)

The values represent acceleration in nine directions: static, upward, downward, left, right, upper right, lower right, lower left, and upper left. This design provides directional precision for path adjustments and obstacle avoidance. The discrete action space simplifies policy learning compared to continuous spaces.

(4) State transition model P: The transition model

P (s_{t + 1} | s_{t}, a_{t})

governs how the global state evolves based on the joint action

a_{t} = [a_{1}, \dots, a_{N}]

from all agents. Each UAV agent updates its velocity and position using Newtonian dynamics:

v_{i, t + 1} = v_{i, t} + a_{i, t} \cdot ∆ t, p_{i, t + 1} = p_{i, t} + v_{i, t} \cdot ∆ t + \frac{1}{2} a_{i, t} \cdot ∆ t^{2}

(17)

In the context of algorithm research, it is reasonable to assume a constant speed of action by the drone over the extremely short time interval

∆ t

.

(5) The reward function

R^{i} (s, a)

guides agent i by providing rewards for actions executed. It is a composite function combining dense and sparse components to promote effective and cooperative target tracking while discouraging unsafe or inefficient behaviors.

R (s, a) = \sum_{i = 1}^{N} (r_{i}^{track} + r_{i}^{collide} + r_{i}^{step}) + r^{success}

(18)

where

r_{i}^{track} = d_{i, t - 1} - d_{i, t}

measures the change in distance between an agent and a target over time, promoting movement towards the target.

r_{i}^{collide}

represents a penalty for collisions between agents or obstacles. This term encourages safe navigation and cooperative spacing:

r_{i}^{collide} = \{\begin{matrix} - 1, & if collision occurs \\ 0, & otherwise \end{matrix}

(19)

r_{i}^{step}

is a step penalty that per timestep to penalize unnecessary wandering and incentivize task efficiency. It is typically linear with respect to the elapsed step ratio

r_{i}^{step} = - \frac{current step}{\max steps}

. Additionally,

r^{success}

is a high-value sparse reward awarded only when all agents successfully surround the target within a predefined radius threshold. This strongly incentivizes cooperation and convergence as follows:

r^{success} = \{\begin{matrix} + 10, & if capture condition is met \\ 0, & otherwise \end{matrix}

(20)

The reward structure combines short-term behavior shaping with long-term goal completion to encourage individual approach and effective coordination in completing a cooperative task, while minimizing collisions and delays. The weights of the reward components (such as

r^{t r a c k}

,

r^{c o l l i d e}

,

r^{s t e p}

, and

r^{s u c c e s s}

) were manually tuned using a grid search approach. We iterated over a set of values for each weight and selected the combination that yielded the most efficient training process and the best tracking performance. This manual tuning allowed us to strike a balance between encouraging rapid target approach, minimizing collisions, and penalizing unnecessary movements, ensuring robust collaboration among agents.

4. Approach

In response to the issue of the traditional methods of artificial potential field being susceptible to local optima in complex scenarios [30], as well as the bottleneck of low training efficiency and slow convergence in MARL, this paper considers an innovative collaborative tracking framework. The key innovation lies in the architecture driven by expert experience in pre-training and fine-tuning architecture [31], along with the combined optimization mechanism based on BCTD. Specifically, this framework effectively integrates expert trajectories produced through APF, BC in IL, value estimation mechanism optimized by TD, and reinforcement policy optimization capability of MAPPO. The goal is to attain effective initialization and consistent iteration of the agent’s strategy. The core process is structured as follows:

(1) Pre-training Guided by APF Expert Experience: Leverage expert trajectories from APF to pre-train the policy network, enabling rapid acquisition of fundamental tracking capabilities.

(2) BCTD Combined Optimization Mechanism: Employ BCTD to synchronize the optimization of both the policy and evaluation networks.

(3) MAPPO Dynamic Fine-tuning: Incorporate the MAPPO clipping mechanism along with a globally designed reward system to facilitate efficient multi-agent collaboration.

Figure 1 illustrates the algorithmic framework, which is systematically divided into three stages: “pre-training—combined optimization—dynamic adjustment,” ensuring a balance between path planning reliability and reinforcement learning efficiency.

In collaborative tracking and pursuit, the action space of the agent is usually represented as a set

{a_{1}, a_{2}, \dots, a_{n}}

. The reinforcement learning policy is denoted as

π_{θ} (a | s)

, where the input is the current state s and the output is a vector f with a dimension of

| A |

, which is the size of the action space. Each element of the vector f corresponds to the probability value of an action, indicating the probability of choosing that action.

Initially, the APF method generates expert demonstration trajectories, forming a high-quality dataset that captures interactions among agents, targets, and obstacles. At this stage, the policy network undergoes pre-training via BC based on rule-driven trajectories. This process allows the agent to swiftly acquire fundamental path planning and obstacle avoidance skills, thereby minimizing exploration costs and easing learning challenges in the subsequent pure reinforcement learning phase. The policy network, structured as a multi-classifier, outputs the probability distribution of each discrete action. Expert actions serve as the supervision signal, and the network is trained using the cross-entropy loss function.

L_{policy} (θ) = - \sum_{t} a_{t} \cdot log π_{θ} (a_{t} | s_{t})

(21)

where

a_{t}

represents the expert actions in one-hot encoding, and

π_{θ}

is the policy network.

Then, to address the limitations of BC in strategy generalization and stability, we present the BCTD mechanism. Using the attention mechanism for cross-modal alignment as proposed by Xu et al. [32], the BCTD framework in this investigation accomplishes multiscale temporal feature integration by utilizing the GAE-weighted TD error, thus improving the robustness of value estimation. This approach simultaneously optimizes the policy and value networks during the imitation learning phase. Expert trajectories provide state–action–reward sequences, which are used to form pseudo-environment interaction paths. The GAE is employed for estimating and propagating value returns. The temporal difference error is defined by the following equation:

δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})

(22)

where

δ_{t}

is the TD error at time step t, indicating the difference between the current valuation and the return after one step.The corresponding GAE estimate is as follows:

{\hat{A}}_{t} = \sum_{l = 0}^{\infty} {(γ λ)}^{l} δ_{t + l}

(23)

where

γ

is the discount factor (set to 0.95 in our experiments). The GAE smooths the TD error across multiple timesteps to reduce variance. In our experiments, we used

λ = 0.9

, which allowed us to effectively balance bias and variance. For example, in one timestep, the TD error for an agent might be calculated as

δ_{t} = 1.5

, and the GAE estimate

{\hat{A}}_{t}

would sum over multiple

δ_{t}

values, weighted by

γ

and

λ

. Meanwhile, the value network aims to minimize the mean squared error. The primary advantage of BCTD is its ability to quantitatively evaluate and correct policy quality without adding to the environmental interaction burden. It mitigates training oscillations from early policy deviations, thereby enhancing convergence stability and the generalization capability of the policy during the imitation learning phase.

L_{value} (ϕ) = \frac{1}{2} \sum_{t} {(V_{ϕ} (s_{t}) - {\hat{V}}_{t})}^{2}

(24)

Finally, building on pre-training, we utilize MAPPO for policy fine-tuning via RL. Employing a centralized training and distributed execution (CTDE) framework, MAPPO shares global value network information while maintaining policy independence, enabling efficient collaboration. Agents receive real-time feedback from the environment, including rewards for target approach, penalties for collisions and step counts, and rewards for successful encirclement. MAPPO mitigates policy update fluctuations through a clipping mechanism and, combined with the value baseline from BCTD pre-training, significantly improves the collaborative capacity and learning efficiency of the multi-agent system in dynamic settings.

The evaluation metrics for the experiments in this chapter include the tracking success rate at convergence, the number of steps required to complete the task, the average reward tested after model convergence, and the number of training episodes required for the model to converge.

In the three-UAV target tracking scenario, analysis of the agent observation space reveals that computational complexity grows linearly with the number of nearby drones and obstacles when each UAV independently executes this cooperative tracking algorithm. The MAPPO+BCTD algorithm achieves convergence at approximately 65 k episodes, significantly lower than alternative approaches such as standard MAPPO, which requires 500 k episodes to converge. On an NVIDIA GeForce GTX 3090 GPU, each episode requires approximately 35 ms of processing time, resulting in a total training duration of roughly 6.3 h. These results highlight the computational efficiency of the proposed method, which benefits from the pre-training phase. Meanwhile, we conducted experiments with parameters detailed in Table 1.

4.1. Three-UAV Target Tracking Test

The BCTD method is used to introduce the tracking ability of the APF into the tracking agents, allowing the agents to learn this ability. By using cross-entropy loss as the optimization function, the agent’s policy model successfully learns the tracking ability of the APF model. Additionally, TD methods are used to achieve convergence of the value network.

The convergence curve of the policy network’s cross-entropy loss indicates that the cross-entropy loss may initially be large due to the complexity of the environment during the IL process. However, as the episodes progress, the cross-entropy loss gradually converges towards 0, showing that the policy network has effectively learned the tracking ability of the APF model. As shown in Figure 2, the convergence of the value network through TD Mean Squared Error (MSE) Loss curve.

The role of APF is merely to generate expert behaviors through imitation learning to guide the path planning of the agent, without relying on the update of the value function. Therefore, there is no critic network and no critic loss. And the loss curve of the value network via TD shows rapid convergence in the early stages. Value loss based on TD decreases quickly. However, during the convergence process, due to the local optimal limitations of the APF mean that there are cases where agents cannot quickly catch up with the tracking target. This leads to some oscillation in the evaluation of the current value even after the tracking ability of the APF model has been learned. The average reward curve and the average number of steps to complete the task for the APF model were then tested, as shown in Figure 3.

Through the reward and step completion curves, even with some local optima, the APF model still effectively completes the corresponding tracking tasks. Additionally, due to its rule-based nature, the tracking performance remains relatively stable.

In the experiments in this chapter, the number of drones is 3, and the number of tracking targets is 1. MAPPO + BCTD and other classic centralized training and distributed execution algorithms were selected for comparison experiments, including MAPPO based on policy, the MADDPG algorithm [33], and the QMIX algorithm [34] based on value. First, we analyze the application of the BCTD algorithm in the MAPPO algorithm, and then compare the performance of different algorithms based on the evaluation metrics in the collaborative tracking scenario.

The average reward curve and the average number of steps for different algorithms during training were tested as shown in Figure 4, which presents the average reward curve of the episode for different algorithms in the collaborative tracking scenario.

The curve with higher opacity and fluctuations represents the actual reward curve used, and the curve with lower opacity represents the curve after smoothing the raw rewards. The analysis of the episode reward convergence for different algorithms is as follows: In this experiment, MAPPO + BCTD demonstrated significant advantages. Compared with other algorithms, MAPPO + BCTD rapidly converged early in the training and stabilized at higher reward values, indicating that the pre-training via behavior cloning effectively accelerated the agents’ policy learning process. In contrast, MAPPO gradually improved rewards but had slower convergence and larger reward fluctuations during training, suggesting that without pre-training, the agent requires more time to adapt to the task. MADDPG showed unstable performance in the multi-agent environment with consistently low reward values, failing to effectively improve tracking ability. QMIX, though showing smoother reward changes, had lower final reward levels and could not overcome the influence of obstacles in the environment, leading to incomplete optimization of the policy. In complex tracking tasks, MAPPO + BCTD accelerated convergence and improved stability through pre-training in behavior cloning, making it the most effective algorithm.

As shown in Figure 4, episode completion step curves for different algorithms in a collaborative tracking scenario, the curve with higher opacity and fluctuations represents the actual episode step curve, and the curve with lower opacity represents the episode step curve after the raw rewards have been smoothed.

In this experiment, MAPPO + BCTD performed the best overall, benefiting from the pre-training via behavior cloning, which effectively improved the agents’ learning efficiency and strategy collaboration capability. This led to a rapid decrease in the number of steps per episode, stabilizing around 20 steps, demonstrating excellent convergence speed and decision-making efficiency. MAPPO followed, with the number of steps stabilizing around 26, indicating its good multi-agent collaboration and environmental adaptation. In contrast, MADDPG performed poorly, with steps remaining around 50 and showing significant fluctuations, exposing its instability in learning for multi-agent cooperation tasks. QMIX, although better than MADDPG, still had a higher number of steps than MAPPO-like algorithms in complex environments, indicating limited policy flexibility and environmental adaptability. In summary, MAPPO + BCTD, with the advantage of pre-training, significantly improved policy convergence efficiency and collaboration capabilities, exhibiting optimal performance in multi-agent target tracking tasks.

The evaluation indicators of the collaborative tracking scenarios in the evaluation indicators of the MAPPO+BCTD, MAPPO, QMIX, MADDPG and APF algorithms were quantified, and the comparison of the tracking performance after convergence of different algorithms is shown in Table 2:

As shown in the table, MAPPO + BCTD demonstrated significant advantages in multi-agent target tracking tasks, particularly in convergence speed, task completion efficiency, and strategy learning ability. With pre-training via behavior cloning, MAPPO + BCTD can converge quickly, completing training in only 65 k episodes, significantly reducing training time compared to other algorithms. At the same time, this algorithm stands out in terms of tracking success rate and average reward, with a tracking success rate of 100% and the highest average reward, indicating its stability and efficiency in task execution. Moreover, MAPPO + BCTD has the fewest average steps, further demonstrating that its optimized policy model can efficiently complete target tracking tasks. In terms of model convergence, MAPPO + BCTD, leveraging the advantages of IL, required the fewest episodes to converge among reinforcement learning algorithms.

4.2. Experimental Visualization Comparison

After all models completed convergence, the effects of different algorithms were tested in the tracking task environment. As shown below, the episode ends when the tracking task is completed within the maximum number of steps; otherwise, the episode ends when the maximum action steps are reached. Figure 5 shows the visual demonstration of the experimental effects of different algorithms. The tracking drone’s trajectory is represented by a red line, and the tracking target robot’s trajectory is represented by a blue line. By comparing various algorithms, it is clear that MAPPO + BCTD demonstrates smoother drone paths and significantly fewer steps to complete the task. In contrast, the QMIX algorithm shows poor obstacle avoidance performance when faced with dense obstacles, causing the drone swarm to be unable to effectively avoid obstacles, which impacts tracking performance. The MADDPG algorithm, on the other hand, has a noticeable local optimum issue, where the drone swarm ultimately converges to a stationary state, failing to effectively track the target, performing much worse than other algorithms. These results indicate that MAPPO + BCTD can achieve more efficient and smoother tracking paths in complex environments, with stronger capabilities for handling dynamic obstacles.

5. Conclusions

An algorithmic model for multi-drone target tracking tasks is thoroughly discussed, with the combination of MAPPO and BCTD being proposed and its application in actual tracking tasks. By introducing the APF model and combining the advantages of IL and RL, especially with optimizations in the MAPPO algorithm, we have significantly enhanced the model’s convergence speed and task completion efficiency. Specifically, MAPPO + BCTD accelerates the policy learning process through pre-training, significantly reducing the number of episodes required for training, and demonstrating extremely high stability and efficiency during training. Furthermore, this method successfully completes multi-drone target tracking tasks with fewer steps, outperforming traditional algorithms such as MAPPO, MADDPG, and QMIX in both convergence speed and task efficiency. Although the proposed algorithm demonstrated strong performance in simulation, future work will focus on testing the method in real-world environments and extending the framework to handle larger multi-agent systems. Additionally, we plan to explore the scalability of the approach in 3D environments, which introduces more complex dynamics and obstacle interactions.

Author Contributions

Conceptualization, Y.Z. and L.L.; methodology, Y.Y. (Yunling Yue); validation, B.Y., J.X. and Y.Y. (Yuan Yao); investigation, J.X. and L.L.; writing—original draft preparation, Y.Z. and Y.Y. (Yuan Yao); writing—review and editing, Y.Y. (Yunling Yue)and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Major Program (JD) of Hubei Province under Grant 2023BAA026.

Data Availability Statement

Data will be made available upon request.

Acknowledgments

The numerical calculations were conducted at the Supercomputing Center of Wuhan University. The authors wish to express their gratitude to the editors and reviewers for their invaluable feedback, which significantly contributed to enhancing the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Y.; Hao, Y.J.; Li, H.Y.; Li, M. Path Planning of Unmanned Ships Using Modified APF Algorithm. In Proceedings of the 2023 IEEE 11th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 21–22 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 112–119. [Google Scholar]
Fan, J.; Chen, X.; Liang, X. UAV trajectory planning based on bi-directional APF-RRT* algorithm with goal-biased. Expert Syst. Appl. 2023, 213, 119137. [Google Scholar] [CrossRef]
Hassan, I.A.; Abed, I.A.; Al-Hussaibi, W.A. Path planning and trajectory tracking control for two-wheel mobile robot. J. Robot. Control (JRC) 2024, 5, 1–15. [Google Scholar] [CrossRef]
Westheider, J.; Rückin, J.; Popović, M. Multi-uav adaptive path planning using deep reinforcement learning. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 649–656. [Google Scholar]
Ejaz, M.; Gui, J.; Asim, M.; El-Affendi, M.A.; Fung, C.; Abd El-Latif, A.A. RL-Planner: Reinforcement learning-enabled efficient path planning in multi-UAV MEC systems. IEEE Trans. Netw. Serv. Manag. 2024, 21, 3317–3329. [Google Scholar] [CrossRef]
Mahjoub, O.; de Kock, R.; Singh, S.; Khlifi, W.; Vall, A.; Tessera, K.a.; Pretorius, A. Efficiently quantifying individual agent importance in cooperative marl. arXiv 2023, arXiv:2312.08466. [Google Scholar]
Cheng, X.; Wang, Z. Multi-UAV Autonomous Collaborative Target Tracking in Dynamic Environment Based on Multi-agent Reinforcement Learning. In Proceedings of the 2023 3rd International Conference on Electronic Information Engineering and Computer (EIECT), Shenzhen, China, 17–19 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 123–128. [Google Scholar]
Wang, Y.; Dong, L.; Sun, C. Cooperative control for multi-player pursuit-evasion games with reinforcement learning. Neurocomputing 2020, 412, 101–114. [Google Scholar] [CrossRef]
Cao, Z.; He, Y. Imitation Learning Based Manipulator Skill Training for Home Environment. In Proceedings of the 2024 China Automation Congress (CAC), Qingdao, China, 1–3 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 6110–6115. [Google Scholar]
Ernst, D.; Louette, A. Introduction to Reinforcement Learning; Feuerriegel, S., Hartmann, J., Janiesch, C., Zschech, P., Eds.; Springer Nature: Berlin/Heidelberg, Germany, 2024; pp. 111–126. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Cai, W.; Chen, H.; Zhang, M. A survey on collaborative hunting with robotic swarm: Key technologies and application scenarios. Neurocomputing 2024, 598, 128008. [Google Scholar] [CrossRef]
Shakya, A.K.; Pillai, G.; Chakrabarty, S. Reinforcement learning algorithms: A brief survey. Expert Syst. Appl. 2023, 231, 120495. [Google Scholar] [CrossRef]
Rubio, F.; Valero, F.; Llopis-Albert, C. A review of mobile robots: Concepts, methods, theoretical framework, and applications. Int. J. Adv. Robot. Syst. 2019, 16, 1729881419839596. [Google Scholar] [CrossRef]
Pateria, S.; Subagdja, B.; Tan, A.h.; Quek, C. Hierarchical reinforcement learning: A comprehensive survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Xiao, J.; Yan, B.; Xie, H.; Yu, Q.; Li, L.; Wang, Y.F. Optimizing Multi-AAV Formation Cooperative Control Strategies with MCDDPG Approach. IEEE Internet Things J. 2025, 12, 9775–9791. [Google Scholar] [CrossRef]
Keyu, L.; Yonggen, L.; Yanchi, Z. Dynamic obstacle avoidance path planning of UAV Based on improved APF. In Proceedings of the 2020 5th International Conference on Communication, Image and Signal Processing (CCISP), Chengdu, China, 13–15 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 159–163. [Google Scholar]
Fan, X.; Guo, Y.; Liu, H.; Wei, B.; Lyu, W. Improved artificial potential field method applied for AUV path planning. Math. Probl. Eng. 2020, 2020, 6523158. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, D.; Zhang, Q.; Pan, W.; Hu, T. DACOOP-A: Decentralized Adaptive Cooperative Pursuit via Attention. IEEE Robot. Autom. Lett. 2023, 9, 5504–5511. [Google Scholar] [CrossRef]
Ding, Q.; Xu, X.; Gui, W. Path planning based on reinforcement learning with improved APF model for synergistic multi-UAVs. In Proceedings of the 2023 26th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Rio de Janeiro, Brazil, 24–26 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 432–437. [Google Scholar]
Zare, M.; Kebria, P.M.; Khosravi, A.; Nahavandi, S. A survey of imitation learning: Algorithms, recent developments, and challenges. IEEE Trans. Cybern. 2024, 54, 7173–7186. [Google Scholar] [CrossRef] [PubMed]
Mahmoudi, S.; Davar, A.; Sohrabipour, P.; Bist, R.B.; Tao, Y.; Wang, D. Leveraging imitation learning in agricultural robotics: A comprehensive survey and comparative analysis. Front. Robot. AI 2024, 11, 1441312. [Google Scholar] [CrossRef] [PubMed]
Hua, J.; Zeng, L.; Li, G.; Ju, Z. Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning. Sensors 2021, 21, 1278. [Google Scholar] [CrossRef] [PubMed]
Arora, S.; Doshi, P. A survey of inverse reinforcement learning: Challenges, methods and progress. Artif. Intell. 2021, 297, 103500. [Google Scholar] [CrossRef]
Bhattacharyya, R.; Wulfe, B.; Phillips, D.J.; Kuefler, A.; Morton, J.; Senanayake, R.; Kochenderfer, M.J. Modeling human driving behavior through generative adversarial imitation learning. IEEE Trans. Intell. Transp. Syst. 2022, 24, 2874–2887. [Google Scholar] [CrossRef]
Xiao, J.; Guo, H.; Zhou, J.; Zhao, T.; Yu, Q.; Chen, Y.; Wang, Z. Tiny object detection with context enhancement and feature purification. Expert Syst. Appl. 2023, 211, 118665. [Google Scholar] [CrossRef]
Perlaza, S.M.; Esnaola, I.; Bisson, G.; Poor, H.V. On the validation of Gibbs algorithms: Training datasets, test datasets and their aggregation. In Proceedings of the 2023 IEEE International Symposium on Information Theory (ISIT), Taipei, Taiwan, 25–30 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 328–333. [Google Scholar]
Jacinto, E.; Martinez, F.; Martinez, F. Navigation of autonomous vehicles using reinforcement learning with generalized advantage estimation. Int. J. Adv. Comput. Sci. Appl. 2023, 14. [Google Scholar] [CrossRef]
Song, Q.; Sun, B.; Li, S. Multimodal sparse transformer network for audio-visual speech recognition. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 10028–10038. [Google Scholar] [CrossRef] [PubMed]
Aldao, E.; González-deSantos, L.M.; Michinel, H.; González-Jorge, H. Uav obstacle avoidance algorithm to navigate in dynamic building environments. Drones 2022, 6, 16. [Google Scholar] [CrossRef]
Liu, Z.; Xu, Y.; Xu, Y.; Qian, Q.; Li, H.; Ji, X.; Chan, A.; Jin, R. Improved fine-tuning by better leveraging pre-training data. Adv. Neural Inf. Process. Syst. 2022, 35, 32568–32581. [Google Scholar]
Xu, C.; Ye, Z.; Mei, L.; Yu, H.; Liu, J.; Yalikun, Y.; Jin, S.; Liu, S.; Yang, W.; Lei, C. Hybrid attention-aware transformer network collaborative multiscale feature alignment for building change detection. IEEE Trans. Instrum. Meas. 2024, 73, 5012914. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]

Figure 1. Algorithm framework.

Figure 2. The policy network cross-entropy loss curve and the value network loss convergence curve.

Figure 3. Performance testing of the APF model.

Figure 4. Comparison of different algorithms.

Figure 5. Demonstration diagrams of different algorithms. The red and purple rectangles indicate the initial positions of the target agent and pursuing agents, respectively. The movement trajectory of the target agent is represented by the blue line, while the movement trajectories of the pursuing agents are shown with red, green, and brown lines respectively. The black circles in the diagram represent generated obstacles.

Table 1. The setting of simulation.

Objects	Properties	Descriptions
Computer	operating system	Linux Ubuntu 20.04
	CPU	i9-9900K
	memory	32GB
	GPU	NVIDIA GeForce GTX 3090
	video memory	24GB
Algorithm	learning rate	0.0001
	discount factor	0.95
	batch size	150
	hidden layer size	(100, 100, 100)
	loss function	Mean-Square Error (MSE)
	optimization method	Stochastic Gradient Descent (SGD)
Scenario Configuration	environment side Length	2.0
	agent radius	0.06
	target radius	0.04
	maximum speed (agent)	1.0
	maximum speed (target)	0.8

Table 2. Comparison of tracking performance after convergence for different algorithms.

Metric	MADDPG	QMIX	MAPPO	APF + MAPPO	BCTD + MAPPO
Success Rate	0.00 ± 0.0	0.72 ± 0.15	1.00 ± 0.12	1.00 ± 0.08	1.00 ± 0.0
Episode Length	50 ± 0	36.2 ± 2.1	25.2 ± 1.6	20.2 ± 0.8	19.6 ± 0.2
Average Reward	−2.7 ± 1	14.6 ± 3.2	31.3 ± 3.8	31.3 ± 3.5	32.4 ± 1.2
Training Timestep	1000 k	300 k	500 k	150 k	65 k

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Y.; Yue, Y.; Yan, B.; Li, L.; Xiao, J.; Yao, Y. Collaborative Target Tracking Algorithm for Multi-Agent Based on MAPPO and BCTD. Drones 2025, 9, 521. https://doi.org/10.3390/drones9080521

AMA Style

Zhou Y, Yue Y, Yan B, Li L, Xiao J, Yao Y. Collaborative Target Tracking Algorithm for Multi-Agent Based on MAPPO and BCTD. Drones. 2025; 9(8):521. https://doi.org/10.3390/drones9080521

Chicago/Turabian Style

Zhou, Yuebin, Yunling Yue, Bolun Yan, Linkun Li, Jinsheng Xiao, and Yuan Yao. 2025. "Collaborative Target Tracking Algorithm for Multi-Agent Based on MAPPO and BCTD" Drones 9, no. 8: 521. https://doi.org/10.3390/drones9080521

APA Style

Zhou, Y., Yue, Y., Yan, B., Li, L., Xiao, J., & Yao, Y. (2025). Collaborative Target Tracking Algorithm for Multi-Agent Based on MAPPO and BCTD. Drones, 9(8), 521. https://doi.org/10.3390/drones9080521

Article Menu

Collaborative Target Tracking Algorithm for Multi-Agent Based on MAPPO and BCTD

Abstract

1. Introduction

2. Related Work

2.1. Artificial Potential Field

2.2. Imitation Learning

2.3. Behavior Cloning with Temporal Difference

3. Problem Formulation

3.1. Multi-Robot Pursuit Problem

3.2. Decentralized Partially Observable Markov Decision Process (Dec-POMDP)

4. Approach

4.1. Three-UAV Target Tracking Test

4.2. Experimental Visualization Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI