Joint Trajectory and Power Optimization for Loosely Coupled Tasks: A Decoupled-Critic MAPPO Approach

Wu, Xiangyu; Hou, Changbo; Meng, Guojing; Zhou, Zhichao; Liu, Qin

doi:10.3390/drones10020116

Open AccessArticle

Joint Trajectory and Power Optimization for Loosely Coupled Tasks: A Decoupled-Critic MAPPO Approach

by

Xiangyu Wu

¹

,

Changbo Hou

^1,2,*

,

Guojing Meng

¹

,

Zhichao Zhou

¹

and

Qin Liu

¹

College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China

²

Key Laboratory of Advanced Marine Communication and Information Technology, Ministry of Industry and Information Technology, Harbin Engineering University, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(2), 116; https://doi.org/10.3390/drones10020116

Submission received: 19 December 2025 / Revised: 30 January 2026 / Accepted: 3 February 2026 / Published: 6 February 2026

(This article belongs to the Section Drone Communications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Proposed a novel MAPPO-DC algorithm for multi-UAV communication, introducing a hierarchical framework and a composite reward function to optimize resource allocation in loosely coupled tasks.
Demonstrated MAPPO-DC’s enhanced performance in simulation, achieving improved communication efficiency, user fairness, and crash avoidance compared to IPPO and standard MAPPO.

What are the implications of the main findings?

Offers a promising, simulation-validated algorithmic foundation for future real-world multi-UAV communication systems, promising improved network performance and task completion in complex environments.
Provides insights into designing advanced MARL algorithms capable of managing credit assignment and balancing individual agent performance with global objectives in decentralized systems.

Abstract

Multi-unmanned aerial vehicle (UAV) systems are crucial for establishing resilient communication networks in disaster-stricken areas, but their limited energy and dynamic characteristics pose significant challenges for sustained and reliable service provision. Optimizing resource allocation in this situation is a complex sequential decision-making problem, which is naturally suitable for multi-agent reinforcement learning (MARL). However, the most advanced MARL methods (e.g., multi-agent proximal policy optimization (MAPPO)) often encounter difficulties in the “loosely coupled” multi-UAV environment due to their overly centralized evaluation mechanism, resulting in unclear credit assignment and inhibiting personalized optimization. To overcome this, we propose a novel hierarchical framework supported by MAPPO with decoupled critics (MAPPO-DC). Our framework employs an efficient clustering algorithm for user association in the upper layer, while MAPPO-DC is used in the lower layer to enable each UAV to learn customized trajectories and power control strategies. MAPPO-DC achieves a complex balance between global coordination and personalized exploration by redesigning the update rules of the critic network, allowing for precise and personalized credit assignment in a loosely coupled environment. In addition, we designed a composite reward function to guide the learning process towards the goal of proportional fairness. The simulation results show that our proposed MAPPO-DC outperforms existing baselines, including independent proximal policy optimization (IPPO) and standard MAPPO, in terms of communication performance and sample efficiency, validating the effectiveness of our tailored MARL architecture for the task. Through model robustness experiments, we have verified that our proposed MAPPO-DC still has certain advantages in strongly coupled environments.

Keywords:

deep reinforcement learning; resource allocation; UAV; low-altitude network

1. Introduction

After sudden disasters such as earthquakes, floods, and building collapses, ground cellular networks often become paralyzed due to physical damage or power outages, forming “communication islands”. In this situation, rapidly deploying multiple unmanned aerial vehicles (UAVs) to form a temporary, self-organizing aerial communication network has become one of the most promising solutions [1,2,3,4].

UAVs have the advantages of flexible deployment and high maneuverability, making them very suitable for scenarios with sparse and unevenly distributed users [5]. The main challenge of UAV-assisted communication lies in its limited onboard energy, which limits flight endurance and signal transmission power. Therefore, it is crucial to effectively allocate communication resources among multiple UAVs [6,7]. Designing effective resource allocation strategies is crucial for providing sustainable and reliable services under strict energy constraints. However, in highly dynamic scenarios involving multiple UAVs and mobile users, solving resource allocation problems constitutes a complex sequential decision-making process. Traditional methods based on static optimization are not suitable for this task, which in turn highlights the enormous potential of learning-based adaptive strategies.

1.1. Related Works

The problem of resource allocation in UAV-assisted communication has been extensively studied in academia. These studies can be broadly categorized into methods based on traditional optimization and those based on learning methods. Our analysis begins with an overview of traditional optimization techniques, followed by the evolution and application of reinforcement learning paradigms in this domain.

1.1.1. Traditional Optimization Methods

Early research mainly relied on traditional optimization techniques to solve the resource allocation problem in multi-UAV communication networks. These works typically model resource allocation problems as precise mathematical formulas, utilizing methods such as convex optimization and successive convex approximation (SCA) to optimize UAV trajectories and transmission power under idealized channel models [4,8]. Other methods include using block coordinate descent (BCD) to decouple complex variables [8], or applying game theory to simulate interactions between multiple UAVs and users [9].

Although traditional optimization methods can theoretically achieve optimal or near-optimal solutions under static conditions with perfect information, they face significant limitations in practical multi-UAV communication scenarios serving multiple users as follows:

Computational complexity and real-time decision-making: Traditional methods in UAV trajectory and resource allocation typically transform optimization problems into finding optimal solutions for non-convex problems, which often require complex mathematical designs (e.g., iterative SCA or BCD methods) and are computationally intensive. This makes it impossible for them to make real-time decisions in highly dynamic environments where UAVs and users are constantly moving [10].
Prior information dependency and model robustness: The performance of traditional optimization methods largely relies on accurate and real-time system state information (e.g., precise channel models, user locations, interference levels) [4]. However, in highly dynamic UAV communication networks, it is difficult to obtain accurate information in real time, which makes traditional optimization methods susceptible to environmental uncertainty and subsequently affects communication performance.
Scalability: As the number of UAVs and users increases, the dimensionality of optimization problems grows rapidly, leading to combinatorial explosion and making traditional methods unsuitable for large-scale dynamic scenarios.
Local optimality of heuristics: Simpler heuristic approaches, such as greedy algorithms [11], while computationally less demanding, often suffer from inherent sub-optimality and local convergence, failing to achieve desirable global communication performance across the network.

These issues point out the limitations of traditional optimization methods in highly dynamic UAV communication systems and highlight the importance of feedback based learning algorithms.

1.1.2. Single-Agent Reinforcement Learning Approaches

In order to overcome the limitations of traditional optimization methods, some scholars use reinforcement learning (RL) for communication resource allocation. RL learns the optimal strategy through direct interaction with the environment, unlike traditional optimization methods that rely on fixed computational modes. This method is very suitable for solving complex sequential decision-making problems in multi-UAV communication scenarios.

To simplify the problem, initial deep reinforcement learning (DRL) research was typically based on a single-agent system framework, which can represent a UAV or centralized controller. For example, some studies focus on trajectory design using algorithms such as deep Q-network (DQN) [12,13], while assuming fixed power control or user association schemes [14]. However, this decoupling method ignores the strong coupling between these variables. After recognizing the inherent relationship between the position, transmission power, and communication performance of UAVs, the mainstream trend in academia has shifted towards joint optimization of continuous variables, especially the joint design of UAV trajectories and transmission power. Solutions based on advanced algorithms such as the deep deterministic policy gradient (DDPG), soft actor-critic (SAC), and proximal policy optimization (PPO) [15,16,17] have been shown to significantly outperform single-variable optimization. These studies consolidate the consensus that joint optimization is the key way to improve the efficiency of UAV communication.

However, in high dynamic scenarios where multiple UAVs provide communication services to multiple users, the DRL framework of a single agent, even as an independent learner (IL) applied to a single UAV (e.g., IPPO [18,19]), inherently faces severe challenges:

High-dimensional action space: In scenarios where multiple UAVs provide communication services to multiple users, the dimension of the action space increases sharply with the number of UAVs and users, making it difficult for single-agent methods to effectively explore and learn optimal strategies.
Non-stationarity and convergence difficulties: When multiple independent agents run in a shared environment, the dynamics of the environment become non-stationary from the perspective of each agent due to the constantly changing strategies of other agents. This non-stationarity leads to convergence difficulties, often causing algorithms to fall into local optima or requiring excessively long training time, severely limiting the upper limit of communication performance.
Lack of global coordination: When multiple independent agents run in a shared environment, due to their complete independence, the lack of coordination between agents leads to malicious interference and limited global communication performance, such as the following:
-
Resource contention and interference: Uncoordinated power allocation and trajectory planning can cause severe inter-UAV interference, degrading overall network capacity and user throughput.
-
Collision avoidance failures: Without explicit coordination, UAVs may collide, leading to mission failure and communication service disruption.
-
Suboptimal coverage and fairness: Individual UAVs optimizing solely for their local objectives may neglect underserved areas or users, compromising network-wide coverage and fairness in communication quality.
-
Credit assignment problem: When negative feedback (e.g., collision, network congestion) occurs, it is difficult for an independent learner to determine whether it is an error in one’s own strategy or the influence of other agent strategies, resulting in low learning efficiency.

These limitations emphasize the necessity of transitioning from single-agent DRL to multi-agent DRL, which can clearly handle coordination and cooperation in complex multi-UAV communication environments.

1.1.3. Multi-Agent Reinforcement Learning Approaches

Therefore, after recognizing the inherent need for coordination in multi-UAV systems, the focus of research has shifted from single-agent DRL to MARL. While simple MARL implementations, such as the ILs mentioned above, still struggle with non-stationarity and credit assignment, the centralized training with decentralized execution (CTDE) framework [20] has emerged as the gold standard. Within CTDE, a centralized critic leverages global information during training to guide the learning of individual, decentralized actors. This architecture is adopted by algorithms ranging from the pioneering multi-agent deep deterministic policy gradient (MADDPG) [21] to the high-performing MAPPO [22]. Due to its outstanding collaborative performance and training stability, MAPPO is widely regarded as one of the most advanced state-of-the-art (SOTA) methods [23,24], especially in complex multi-UAV communication service scenarios. These CTDE-based MARL algorithms directly solve the problem of dimension explosion in single-agent methods and the lack of global coordination in IL methods, enabling the following:

Enhance communication network performance: Through global information sharing and centralized evaluation, the system can collaboratively optimize the trajectories and resource allocation of multiple UAVs, effectively reduce interference, and improve network capacity and overall user throughput.
Robust collision avoidance: Centralized training enables agents to learn collaborative strategies to avoid collisions, ensuring the safe and stable operation of multi-UAV system.
Improving resource management and fairness: Coordinated methods help to allocate resources more fairly, optimize coverage, and thus improve the overall quality and user satisfaction of communication services.

However, while improving the performance of MARL in resource allocation tasks, some scholars have focused on the structural characteristics of task scenarios and introduced the concept of “task coupling structure” [25], defining it as the extent to which the optimal behavior of an agent depends on the behavior of other agents. For example, the classic QMIX algorithm [26] is specifically designed for tasks with strictly monotonic cooperative structures. These studies emphasize the necessity of designing algorithms that appropriately match the coupling characteristics of specific problems.

In the scenario where multiple UAVs provide communication services, the task scenario essentially presents a “loosely coupled” structure. Although UAVs must cooperate to avoid mutual interference and ensure continuous system operation, they are largely autonomous in providing services to users within their respective service areas. Therefore, mainstream MARL algorithms based on CTDE, such as standard MAPPO, may be limited due to their “over concentration” or “over coupling” properties as follows:

Suppression of individualized performance: The single centralized critic of standard MAPPO is designed specifically for strongly coupled tasks to evaluate global returns. This mechanism may inadvertently dilute the unique contributions of individual agents while promoting overall cooperation, inhibiting exploration of personalized and local optimal behaviors. In loosely coupled communication tasks, a single UAV may have different user groups and local optimization objectives, which may hinder achieving maximum individual communication performance.
Ambiguous credit assignment in loosely coupled settings: When receiving global rewards (or punishments), in a loosely coupled environment, a single centralized critic may find it difficult to accurately allocate credit or responsibility to various agents. The ambiguity of credit assignment may slow down learning speed or lead to the formation of suboptimal strategies, especially when the influence of agents is more localized, which can affect the overall efficiency and fairness of communication services.

Therefore, designing a MARL algorithm that not only includes the required communication fairness objectives, but also customizes it specifically for the loosely coupled task characteristics generated by complex multi-UAV communication tasks remains an important and challenging research direction, and current methods have not fully solved it.

1.2. Motivation

In MARL, advanced CTDE algorithms are crucial for coordinating multiple UAVs. Although pioneering methods like MADDPG [21] effectively extend DDPG to multi-agent systems and can handle continuous control tasks such as trajectory and power optimization, their dependence on deterministic strategies and inherent stability challenges limits their applicability to our specific problems. MAPPO [22,23,24] provides significant advantages for joint optimization of continuous variables such as UAV trajectory and power allocation in high dynamic and safety-critical multi-UAV communication networks. MAPPO is built on a stable and robust PPO algorithm, using a random strategy that inherently promotes better exploration in complex, high-dimensional continuous action spaces. This feature is crucial for discovering diverse and energy-efficient strategies in dynamic UAV environments, ensuring robust learning against uncertainty. In addition, MAPPO’s improved stability and ease of adjustment make it a more reliable choice for formulating policies that can ensure safe operation (such as collision avoidance) and continuous communication services under strict energy constraints. Therefore, while pursuing high communication performance and system robustness, MAPPO has become the preferred benchmark for complex multi-UAV communication tasks.

However, there are still two limitations to current research. Firstly, most current research mainly focuses on maximizing the total rate in optimizing objectives [13,22,23], often neglecting fairness among users. Although methods such as maximizing minimum user performance [8] or using the jain fairness [27] index consider fairness among users, they have the problem of ignoring overall system performance or easily falling into local optima in complex optimization environments. Proportional fairness (PF) is a classic indicator that balances performance and fairness, and its application has not been explored to a large extent within this advanced MARL framework.

Secondly, although joint optimization is the most ideal way to improve the overall performance of the system, it is computationally infeasible to directly apply the PF objective function to a fully joint MARL optimization framework. On the other hand, if hierarchical optimization is adopted to avoid this problem, the resulting loosely coupled task structure poses new challenges to the MARL algorithm itself. Simple methods like IPPO are plagued by non-stationarity and cannot capture subtle but crucial inter-agent coupling, such as collision avoidance. This is because sparse global negative feedback signals, such as collision penalties, are often attributed by each independent learner to the randomness of their own strategy or environmental noise, making it difficult to form effective collaborative avoidance strategies. On the contrary, CTDE algorithms designed for mainstream strongly coupled scenarios such as MAPPO may suppress the exploration of individual optimal performance strategies in loosely coupled scenarios due to their overly concentrated value estimation, resulting in poor overall system performance.

Therefore, designing a MARL algorithm that not only considers fairness objectives, but also tailors it to the loosely coupled task characteristics generated by hierarchical optimization has become an important research direction.

1.3. Contribution

To address the aforementioned challenges, this paper proposes a hierarchical framework centered on a novel MARL algorithm, termed MAPPO with decoupled critics (MAPPO-DC). In the upper layer, an efficient user-to-UAV assignment is performed using a lightweight clustering algorithm. In the lower layer, our proposed MAPPO-DC algorithm enables each agent to learn tailored policies for trajectory and power control. Specifically, by redesigning the critic’s update rule, MAPPO-DC allows each agent to distinguish shared global value from its local, task-specific advantages, achieving a sophisticated balance between global coordination and individualized exploration. Furthermore, a composite reward function, congruent with the MAPPO-DC architecture, is devised based on reward construction principles to guide the learning process toward the proportional fairness objective.

The main contributions of this paper are as follows:

We design an efficient hierarchical framework to mitigate the combinatorial explosion caused by an overly large joint optimization space.
Based on a detailed analysis of the scene’s task coupling, we propose the MAPPO-DC algorithm tailored for the loosely coupled task scenario inherent to the hierarchical framework. A composite reward is also designed for this architecture to ensure both inter-agent coordination and individual agent performance in loosely coupled settings.
The simulation results demonstrate that our proposed MAPPO-DC algorithm outperforms IPPO, MAPPO, and other baseline algorithms. In addition, we analyze the intrinsic behavioral mechanisms of different algorithms under energy constraints.

Theoretically, MAPPO-DC is positioned as a CTDE-based algorithm with a multi-critic architecture. Relative to IPPO, it mitigates environmental non-stationarity by incorporating global states into the critic. Relative to standard MAPPO with a single centralized critic, MAPPO-DC decouples the value estimation to prevent the ambiguity of credit assignment caused by global reward averaging in loosely coupled tasks. Furthermore, unlike value-decomposition methods (e.g., VDN, QMIX) that rely on the monotonicity assumption of the global value function, MAPPO-DC requires no such constraints. Consequently, it is particularly well-suited for the loosely coupled communication scenarios presented in this work, which require equal emphasis on inter-agent coordination (e.g., safety constraints) and individual agent performance (e.g., customized power and trajectory control).

2. System Model

We consider a scenario, as illustrated in Figure 1, where a fleet of N energy-constrained UAVs provides wireless communication services to K users distributed over a two-dimensional plane. The set of UAVs is denoted by

N = {1, \dots, N}

, and the set of users is denoted by

K = {1, \dots, K}

. We use the index

i, j \in N_{j \neq i}

for UAVs and

k \in K

for users. At any given time step t, the position of the ith UAV is represented by its 3D Cartesian coordinates

q_{i} (t) \in R^{3}

. Similarly, the position of the kth user is

w_{k} (t) \in R^{3}

. The users are randomly distributed within the operational area [4,5,8]. As shown in Figure 1, multiple UAVs fly within this area to provide communication services to users. The specific parameters of the scenario, adjusted based on [28,29], are detailed in Table A1.

2.1. Transmission Model

The communication link adopts a probabilistic line-of-sight (LoS) model [4], where the channel gain is not deterministic but is the expected value of the channel gains under LoS and non-loS (NLoS) conditions. The probability of LoS depends on the elevation angle between the UAV and the user.

The 3D distance between UAV i and user k is given by

d_{i, k} (t) = ∥q_{i} (t) - w_{k} (t)∥

(1)

The horizontal distance between UAV i and user k is calculated as

d_{i, k}^{2 D} (t) = \sqrt{{(x_{i} (t) - x_{k} (t))}^{2} + {(y_{i} (t) - y_{k} (t))}^{2}}

(2)

And the elevation angle between UAV i and user k is calculated as

θ_{i, k} (t) = \frac{180}{π} arctan (\frac{h_{i}}{d_{i, k}^{2 D} (t)})

(3)

The LoS probability is given by

P_{i, k}^{L o S} (t) = \frac{1}{1 + a \cdot exp (- b \cdot (θ_{i, k} (t) - a))}

(4)

where a and b are empirical parameters dependent on the environment, set to 9.61 and 0.16, respectively [28].

Building upon free-space path loss and differentiating between LoS and NLoS conditions, the path loss for LoS and NLoS links are, respectively,

P L_{i, k}^{L o S} (t) = 20 \log_{10} (d_{i, k} (t)) + ξ_{L o S}

(5)

and

P L_{i, k}^{N L o S} (t) = 20 \log_{10} (d_{i, k} (t)) + ξ_{N L o S}

(6)

To calculate the average channel gain, we first convert the path losses to linear scale,

{\bar{P L}}_{i, k} (t) = P_{i, k}^{L o S} (t) \cdot 10^{\frac{- ξ_{0} - P L_{i . k}^{L o S} (t)}{10}} + (1 - P_{i, k}^{L o S} (t)) \cdot 10^{\frac{- ξ_{0} - P L_{i, k}^{N L o S} (t)}{10}}

(7)

Consequently, the average channel gain (in linear scale) is obtained by converting the average path loss,

h_{i, k} (t) = β_{0} \cdot {\bar{P L}}_{i, k} (t)

(8)

where

ξ_{L o S}

and

ξ_{N L o S}

are the excessive path loss for LoS and NLoS links, respectively, and

β_{0}

is the channel power gain at a reference distance of 1 m [29].

We employ an orthogonal frequency division multiple access (OFDMA) scheme, allocating independent frequency bands to each UAV to avoid co-channel interference among them. Let

U_{i} (t) \subseteq K

be the set of users served by UAV i. According to the Shannon–Hartley theorem, for a user

k \in U_{i} (t)

, the received signal-to-interference-plus-noise ratio (SINR) can be simplified as the signal-to-noise ratio (SNR) and expressed as

S N R_{i, k} (t) = \frac{P_{i, k} (t) h_{i, k} (t)}{B_{k} (t) N_{0}}

(9)

where

P_{i, k} (t)

is the transmit power allocated by UAV i to user k at time t,

h_{i, k} (t)

is the channel gain between them,

N_{0}

is the noise power spectral density, and

B_{k} (t)

is the bandwidth allocated to the user. For analytical convenience, we assume an equal bandwidth allocation scheme, where each served user

k \in U_{i} (t)

receives

B_{k} (t) = \frac{B_{t o t a l} / N}{|U_{i} (t)|}

. The purpose of this assumption is to simplify the analysis of the problem and focus on key variables [18,20], ensuring fairness among users at the bandwidth level, while concentrating research on trajectory and power.

The achievable instantaneous throughput for user k at time t is

C_{k} (t) = B_{k} (t) \log_{2} (1 + S N R_{i, k} (t))

(10)

2.2. Energy Consumption Model

The energy consumption of a UAV is primarily generated by its propulsion system, which includes both hovering and motion components. Based on the quadrotor energy consumption model in [30] and incorporating common simplifications of rotary-wing aircraft aerodynamics from [31], factors such as climb, descent, and blade profile drag are neglected to focus the research on the long-term joint optimization strategy for trajectory and power. Thus, the total power consumption

P_{i} (t)

for UAV i at time t can be approximated [4] as

P_{i} (t) = P_{i}^{f l y} (t) + P^{c o m m}

(11)

where

P^{c o m m}

is the communication power. The propulsion power

P_{i}^{f l y} (t)

is calculated based on the UAV’s velocity vector,

P_{i}^{f l y} (t) = P^{h o v} + c_{m o v e} \cdot {∥v_{i} (t)∥}^{2}

(12)

where

P^{h o v}

is the hovering power,

v_{i} (t)

is the velocity vector of UAV i, and

c_{m o v e}

is the motion-related power coefficient [4]. The remaining energy of UAV i at the next time step can be calculated by subtracting the consumed energy from the current remaining energy,

E_{i} (t) = E_{i} (t - 1) - P_{i} (t) \cdot Δ t

(13)

where

E_{i} (t)

is the remaining energy of UAV i,

Δ t

is the duration of a time step.

2.3. Problem Formulation

This study aims to maximize the system’s PF throughput by jointly optimizing the UAVs’ trajectories

Q = \{q_{i} (t) | \forall i \in N, \forall t\}

and power allocation strategies

P = \{P_{i, k} (t) | \forall i, k, t\}

within a finite flight period and under energy constraints. This objective is achieved by maximizing the log-sum of the users’ long-term average throughputs, as formally expressed in the optimization problem (P1),

\begin{matrix} (P 1) : & \max_{Q, P} & \sum_{k \in K} \log ({\bar{R}}_{k}) \\ s . t . & (C 1) : & ∥q_{i} (t + 1) - q_{i} (t)∥ \leq v_{max} \cdot Δ t, & \forall i, t \\ (C 2) : & H_{min} \leq z_{i} (t) \leq H_{max}, & \forall i, t \\ (C 3) : & ∥q_{i} (t) - q_{j} (t)∥ \geq d_{s a f e}, & \forall i \neq j, t \\ (C 4) : & \sum_{k \in U_{i} (t)} P_{i, k} (t) \leq P^{c o m m}, & \forall i, t \\ (C 5) : & E_{i} (t) \geq 0, & \forall i, t \\ (C 6) : & K_{min} \leq |U_{i} (t)| \leq K_{max}, & \forall i, t \end{matrix}

(14)

where

{\bar{C}}_{k} = \frac{1}{T_{max}} \sum_{t = 0}^{T_{max} - 1} C_{k} (t)

is the long-term discounted cumulative throughput for user k, and

C_{k} (t)

is the instantaneous throughput defined in (10). Constraints C1 to C6 correspond to the UAV maximum velocity, flight altitude, collision avoidance, communication power, battery endurance, and user serving load, respectively.

Problem (P1) is a mixed-integer non-convex optimization problem, characterized by highly coupled variables, high-dimensional state-action spaces, and the highly dynamic nature of UAV communication environments. Due to these complexities, including the problem being NP-hard and the partial observability from each agent’s perspective, it is intractable for traditional optimization methods to obtain real-time optimal solutions. Therefore, we model this problem as a multi-agent partially observable Markov decision process (POMDP). We then employ a MARL approach, specifically our proposed MAPPO-DC framework, to learn an approximately optimal policy that addresses these challenges and maximizes the long-term cumulative reward, with a carefully designed reward function aligned to the objective of (P1).

3. Method

Although the end-to-end MARL method with a fully joint action space is theoretically superior, the explosion of fully joint action space combinations at our scene scale (N UAVs and K users) makes existing algorithms computationally unfeasible. More importantly, in most communication scenarios, the decision time scale of user clusters is much larger than the decision time of trajectory and power control. This natural separation of time scales provides a convincing theoretical basis for adopting a hierarchical optimization framework. The concept of decomposition has also been validated in traditional optimization fields. For example, reference [8] employs the BCD method to solve a similar problem, decomposing the complex joint optimization task into more tractable subproblems that are solved alternately. This inspires our approach, suggesting that problem decomposition, in traditional or learning-based methods, is a critical path toward finding feasible and efficient solutions.

Therefore, we adopted a hierarchical architecture. At the upper level, we adopted a user clustering algorithm tailored to our scenario, which combines nearest neighbor assignment with intelligent user offloading. At the bottom level, our proposed MAPPO-DC algorithm is used to jointly optimize UAV trajectory planning and power allocation. The hierarchical framework decouples user association from trajectory control, which inevitably sacrifices some global optimality. Specifically, the heuristic clustering based on nearest distance may lead to suboptimal load balancing in boundary cases where a user could be better served by a slightly more distant but less congested UAV (to maximize SINR via power headroom). However, finding the globally optimal assignment requires solving a combinatorial optimization problem (

O (N_{u a v}^{N_{u s e r}})

), which is prohibitive for real-time control. Our proposed intelligent user offloading mechanism mitigates this impact by reassigning users from overloaded UAVs to neighboring ones, ensuring that the physical capacity constraints are met. While not mathematically globally optimal, this approach provides a robust Pareto improvement over static assignments, balancing computational latency with system throughput.

The time complexity of our proposed heuristic clustering algorithm is

O (N_{u a v} \cdot N_{u s e r})

, and in the current scenario (30 users, 5 UAVs), it only requires microsecond-level computation. In contrast, finding the theoretical global optimal solution for user allocation is a combinatorial optimization problem with a complexity of

O (N_{u a v}^{N_{u s e r}})

. As the number of UAVs and users increases, the computational cost of this heuristic algorithm only increases linearly, while the computational cost of finding the global optimal solution increases exponentially. Therefore, our hierarchical architecture actually obtained a high-quality suboptimal solution through heuristic methods at an extremely low time cost, leaving a sufficient computational time window for real-time control of the lower-level MARL. This ensures the real-time capability of the system during large-scale expansion.

The specific meanings corresponding to the symbols are shown in Table 1.

3.1. User Clustering

As mentioned above, incorporating discrete user association decisions into the MARL action space leads to a combinatorial explosion. Furthermore, the decision frequency for user association is typically lower than that for trajectory and power control. Therefore, we employ an efficient heuristic-based clustering algorithm to periodically handle the user association problem, thereby constraining the action space of the lower-layer MARL task to continuous control variables. The procedure for hierarchical control, including periodic clustering, is detailed in Algorithm 1.

Algorithm 1 User Clustering via Nearest-Neighbor Assignment and Intelligent Offloading

1: Input: UAV set $N$ , user set $K$ , max users per UAV $N_{max}$ , UAV positions ${\{q_{i} (t)\}}_{i \in N}$ , user positions ${\{w_{k} (t)\}}_{k \in K}$ , total time steps T, clustering period $T_{c l u s t e r}$
2: Initialize: UAV positions $q_{i} (0)$ , user positions $w_{k} (0)$ , initial user-UAV association map $U_{i} (0)$
3: for time step $t = 0$ to $T - 1$ do
4: if $t (\mod T_{c l u s t e r}) = = 0$ then // Periodically update clusters
5: $U_{i} (t) \leftarrow UpdateUserClusters (N, K, N_{max}, {\{q_{i} (t)\}}_{i \in N}, {\{w_{k} (t)\}}_{k \in K})$
6: else
7: $U_{i} (t) \leftarrow U_{i} (t - 1)$ // Maintain previous association
8: end if
9: Execute lower-layer MAPPO-DC policy based on $U_{i} (t)$ to determine UAV velocities $v_{i} (t)$ and powers $P_{i, k} (t)$
10: Update UAV positions: $q_{i} (t + 1) \leftarrow q_{i} (t) + v_{i} (t) Δ t$
11: end for
12: Function $UpdateUserClusters (N, K, N_{max}, {\{q_{i} (t)\}}_{i \in N}, {\{w_{k} (t)\}}_{k \in K})$
13: Output: Final user-UAV association map $U (t) = \{U_{1} (t), \dots, U_{N} (t)\}$
Phase 1: Nearest-Neighbor Assignment
14: Initialize: clusters $U_{i} (t) \leftarrow \emptyset$ for all $i \in N$
15: for each user $k \in K$ do
16: Find the nearest UAV: $i^{*} \leftarrow arg min_{i \in N} w_{k} (t) - q_{i} (t)$
17: Assign user k to UAV $i^{*} : U_{i^{*}} (t) \leftarrow U_{i^{*}} (t) \cup {k}$
18: end for
Phase 2: Intelligent Offloading
19: Initialize: Overloaded UAV set $N_{over} \leftarrow \{i \in N | | U_{i} (t) > N_{max}\}$
20: while $N_{over} \neq \emptyset$ do
21: Select an overloaded UAV $i \in N_{over}$
22: Identify the user to offload: $k_{o f f} \leftarrow arg max_{k \in U_{i} (t)} w_{k} (t) - q_{i} (t)$
23: Define candidate receiver UAVs: $N_{cand} \leftarrow \{j \in N ∖ {i} | | U_{j} (t) < N_{max}\}$
24: if $N_{cand} \neq \emptyset$ then
25: Find the best receiver UAV: $j^{*} \leftarrow arg min_{j \in N_{cand}} w_{k_{off}} (t) - q_{j} (t)$
26: Update clusters: $U_{i} (t) \leftarrow U_{i} (t) ∖ \{k_{o f f}\}, U_{j^{*}} (t) \leftarrow U_{j^{*}} (t) \cup \{k_{o f f}\}$
27: else
28: Break // No available UAV to receive the user
29: end if
30: Update overloaded set: $N_{over} \leftarrow \{i \in N | | U_{i} (t) > N_{max}\}$
31: end while
Safety truncation for rare unsolvable cases
32: for each UAV i with $|U_{i} (t)| > N_{max}$ do
33: Sort users in $U_{i} (t)$ by distance to $q_{i} (t)$ in ascending order
34: Keep the nearest $N_{max}$ users: $U_{i} (t) \leftarrow$ first $N_{max}$ users in sorted list
35: end for
36: return Final clusters $U (t)$

3.2. State Space

The MAPPO-DC algorithm adheres to the CTDE paradigm, which requires aggregating the local observations of all agents to form a global state. State space is partitioned into two components: the local observation space and the global state space.

The self-centric local observation for agent i, denoted as

o_{i} (t)

, comprises

o_{i} (t) = [q_{i} (t), {\{q_{j} (t) - q_{i} (t)\}}_{i, j \in N, j \neq i}, {\{q_{i} (t) - w_{k} (t)\}}_{k \in U_{i} (t)}, {\{\bar{C_{k}} (t)\}}_{k \in U_{i} (t)}, E_{i} (t)]

(15)

It includes the agent’s absolute position

q_{i} (t)

, its relative positions to other UAVs

{\{q_{j} (t) - q_{i} (t)\}}_{i, j \in N, j \neq i}

, the relative positions

{\{q_{i} (t) - w_{k} (t)\}}_{k \in U_{i} (t)}

to its served users

U_{i} (t)

, the historical average throughputs of these users

{\{\bar{C_{k}} (t)\}}_{k \in U_{i} (t)}

, and its own remaining energy

E_{i} (t)

, where

\bar{C_{k}} (t) = \frac{1}{t} \sum_{τ = 1}^{t} C_{k} (τ)

. To maintain a fixed input dimension, information related to the served users is padded to the maximum users a UAV can serve (

K_{max}

).

The global state

s_{t}

is formed by concatenating the local observations of all agents, thereby containing complete information about all dynamic elements in the system. It is constructed as

s_{t} = [{\{q_{i} (t)\}}_{i \in N}, {\{w_{k} (t)\}}_{k \in K}, {\{\bar{C_{k}} (t)\}}_{k \in K}, {\{E_{i} (t)\}}_{i \in N}]

(16)

It consists of the set of all UAV positions

{\{q_{i} (t)\}}_{i \in N}

, the set of all user positions

{\{w_{k} (t)\}}_{k \in K}

, the historical average throughputs of all users

{\{\bar{C_{k}} (t)\}}_{k \in K}

, and the remaining energy of all UAVs

{\{E_{i} (t)\}}_{i \in N}

.

3.3. Action Space

At each time step t, the action of agent i, denoted as

a_{i} (t) \in R^{3 + K_{max}}

, is composed of two continuous vectors: a trajectory control action

a_{i}^{t r a j} (t) \in R^{3}

and a power allocation action

a_{i}^{p w r} (t) \in R^{K_{max}}

.

a_{i} (t) = \{a_{i}^{t r a j} (t), a_{i}^{p w r} (t)\}

(17)

3.3.1. Trajectory Planning

For the trajectory planning action

a_{i}^{t r a j} (t)

, the agent’s action aims to determine its velocity or displacement for the next time step. Specifically, the actor network outputs a mean value

μ_{i}^{t r a j} (t) \in R^{3}

for a displacement vector

Δ q_{i} (t) = {[Δ x, Δ y, Δ z]}_{i, t}

. The actual displacement increment is sampled from a normal distribution centered at this mean,

a_{i}^{t r a j} (t) = Δ q_{i} (t) \sim N (μ_{i}^{t r a j} (t), σ_{t r a j}^{2})

(18)

where

σ_{t r a j}

is a learnable or fixed standard deviation. The UAV’s next position is updated as

q_{i} (t + 1) = q_{i} (t) + Δ q_{i} (t)

. The actual velocity of UAV i at time t is then

v_{i} (t + 1) = {[Δ x, Δ y, Δ z]}_{i, t} / Δ t

, which effectively defines the desired velocity for the current time step.

3.3.2. Power Allocation

For the power allocation action

a_{i}^{p w r} (t)

, the agent’s action aims to decide how to distribute its total communication power

P_{max}

among its served users. We implement this through a Softmax-based mechanism. Specifically, the actor network outputs a vector of raw logit values,

l_{i} (t) = [l_{i, 1} (t), l_{i, 2} (t), \dots, l_{i, K_{max}} (t)] \in R^{K_{max}}

. For a user

k \in U_{i} (t)

, the allocated power

P_{i, k} (t)

is obtained by applying a Softmax function to the corresponding logit values and scaling by the total available power,

a_{i}^{p w r} (t) = P_{i, k} (t) = P_{i}^{c o m m} \cdot \frac{exp (l_{i, k} (t))}{\sum_{k \in U_{i} (t)} exp (l_{i, k} (t))}

(19)

The elements of the

K_{max}

-dimensional vector

l_{i} (t)

are the unnormalized raw logit values. These values are transformed into power allocation ratios via the Softmax function, thereby enabling continuous power control for the served users.

3.4. Reward

To maximize both overall system coordination and individual agent performance, we adapt the reward structure for MAPPO-DC by classifying each reward component based on its contribution to either coordination or individual performance. Specifically, the collision penalty

r^{c o l l} (t)

, PF reward

r^{p f} (t)

, and user satisfaction reward

r^{b o n u s} (t)

, which contribute to agent coordination, are categorized as shared rewards

r^{s h a r e} (t)

. Conversely, the position guidance reward

r^{p o s} (t)

, motion smoothness penalty

r^{s m o o t h} (t)

, boundary penalty

r^{b o u n d} (t)

, and crash penalty

r^{c r a s h} (t)

, which relate to an agent’s individual behavior, are categorized as individual rewards. Since the individual rewards differ for each agent, the total reward for each agent is also distinct

r_{i} (t) = w_{s h a r e} \cdot r^{s h a r e} (t) + w_{i n d} \cdot r_{i}^{i n d} (t)

(20)

where

r^{s h a r e} (t)

is the reward common to all agents,

r^{i n d} (t)

is the reward unique to agent i, and w are the weighting coefficients for each sub-reward,

r^{s h a r e} (t) = w_{c o l l} \cdot r^{c o l l} (t) + w_{p f} \cdot r^{p f} (t) + w_{b o n u s} \cdot r^{b o n u s} (t)

(21)

r_{i}^{i n d} (t) = w_{p o s} \cdot r_{i}^{p o s} (t) + w_{s m o o t h} \cdot r_{i}^{s m o o t h} (t) + w_{b o u n d} \cdot r_{i}^{b o u n d} (t) + w_{c r a s h} \cdot r_{i}^{c r a s h} (t)

(22)

To align with the primary objective in (P1), we design the main component of the reward function based on the instantaneous sum-log-rate, which serves as a dense signal to promote proportional fairness. The PF reward at time t is defined as

r^{p f} (t) = \sum_{k \in K} \log (C_{k} (t) + ε_{p f})

(23)

where

C_{k} (t)

is the instantaneous throughput of the user k, and

ε_{p f}

is a small positive constant to ensure numerical stability for logarithmic operation.

To encourage the fulfillment of a basic Quality of Service (QoS), we introduce an additional reward for exceeding a throughput threshold,

r^{b o n u s} (t) = \sum_{k \in K} I (C_{k} (t) > C_{t h r e s h})

(24)

where

I (\cdot)

is the indicator function, which equals 1 when the condition is true and 0 otherwise, and

C_{t h r e s h}

is the throughput threshold.

To guide exploration within a vast action space and under strict energy constraints, we introduce a position guidance reward

r_{i}^{p o s} (t)

through reward shaping. This reward penalizes the distance between a UAV and the geometric center of its user cluster served by

U_{i} (t)

, thus constraining the exploration space, preventing the model from converging to local optima and accelerating convergence.

r_{i}^{p o s} (t) = - ∥q_{i} (t) - U_{i} (t)∥

(25)

where the geometric center of the user cluster

U_{i} (t)

is calculated as

U_{i} (t) = \frac{1}{|U_{i} (t)|} \sum_{k \in U_{i} (t)} w_{k}^{2 D} (t), if |U_{i} (t)| > 0

(26)

where

U_{i} (t) \in R^{2}

is the 2D coordinate vector of the target cluster center for UAV i at time t,

U_{i} (t)

is the set of user indices assigned to UAV i,

|U_{i} (t)|

is the number of users in that cluster, and

w_{k}^{2 D} (t) \in R^{2}

is the 2D projection of user k’s position vector onto the horizontal plane.

\sum_{k \in U_{i} (t)} w_{k}^{2 D} (t)

is the sum of the position vectors of all users within the cluster.

To ensure that the UAVs generate physically feasible and energy-efficient smooth trajectories, we penalize abrupt changes in their actions,

r_{i}^{s m o o t h} (t) = - \sum_{i \in N} ∥a_{i}^{t r a j} (t) - a_{i}^{t r a j} (t - 1)∥

(27)

A collision penalty is incurred when the distance between any two UAVs is less than the safety distance

d_{s a f e}

,

r^{c o l l} (t) = \sum_{j \in N, j > i} r^{c o l l} \cdot I (∥q_{i} (t) - q_{j} (t)∥ < d_{s a f e})

(28)

where

r^{c o l l}

is a large negative constant.

A boundary penalty is applied when a UAV reaches the scenario boundaries,

r_{i}^{b o u n d} (t) = r^{b o u n d} \cdot I (q_{i} (t) \notin A)

(29)

where

A

represents the allowed 3D flight region and

r^{b o u n d}

is a negative constant. Unlike the collision and energy depletion penalties,

r_{i}^{b o u n d} (t)

is a relatively smaller negative value. This distinction is made because users are randomly distributed within the scenario; if a UAV’s cluster center is near the edge, the policy should be allowed to explore the boundary without incurring an excessive penalty, which would otherwise overly constrain the search space.

An energy depletion penalty is applied when the UAV’s remaining energy drops below zero,

r_{i}^{c r a s h} (t) = r^{c r a s h} \cdot I (E_{i} (t) \leq 0, E_{i} (t - 1) > 0)

(30)

where

r^{c r a s h}

is a large negative constant.

The following rationale behind this composite reward shaping is threefold:

Guidance: The dense signal $r_{i}^{p o s} (t)$ constrains the exploration space, guiding agents towards valid topological formations early in training.
Constraints: $r_{i}^{s m o o t h} (t)$ acts as a soft constraint to enforce physical feasibility, while $r_{i}^{c r a s h} (t)$ and $r_{i}^{b o u n d} (t)$ serve as hard safety constraints (sparse penalties) to prune unsafe state-action regions.
Optimization: Finally, $r^{p f} (t)$ and $r^{b o n u s}$ drive the policy towards the core objective of fair communication performance.

This decomposition ensures that agents learn how to fly safely before optimizing how to communicate efficiently, and the weights of each reward are shown in Table A2.

3.5. MAPPO-DC Algorithm

The MAPPO-DC algorithm proposed in this paper is built upon the MAPPO architecture and likewise adheres to the CTDE paradigm. Its workflow is detailed in Algorithm 2.

Algorithm 2 Multi-Agent Proximal Policy Optimization with Decoupled critics (MAPPO-DC)

1: Initialize: For each agent $i \in N$ :
2: actor network $π_{θ_{i}}$ with parameters $θ_{i}$
3: critic network $V_{ϕ_{i}}$ with parameters $ϕ_{i}$
4: Target critic network $V_{ϕ_{i}^{'}}$ with parameters $ϕ_{i}^{'} \leftarrow ϕ_{i}$
5: Initialize: Replay buffer $B$
6: for episode = 1, 2, ... do
7: Reset environment and get initial global state $s_{0}$ and local observations ${\{o_{i} (0)\}}_{i \in N}$
// Decentralized Execution & Data Collection
8: for t = 1, 2, ..., T do
9: for each agent $i \in N$ in parallel do
10: Select action $a_{i} (t) \sim π_{θ_{i}} (\cdot | o_{i} (t))$
11: end for
12: Execute joint action $a (t) = {\{a_{i} (t)\}}_{i \in N}$ in the environment
13: Receive next global state $s_{t + 1}$ , next local observations ${\{o_{i} (t + 1)\}}_{i \in N}$ , and individual rewards ${\{r_{i} (t)\}}_{i \in N}$
14: Store transition $(s_{t}, \{o_{i} (t)\}, \{a_{i} (t)\}, \{r_{i} (t)\}, s (t + 1), \{o_{i} (t + 1)\})$ in buffer $B$
15: end for
// Centralized Training with Decoupled critics
16: for training step = 1, 2, ..., $T_{e p o c h s}$ do
17: Sample a batch of trajectories from $B$
18: for each agent $i \in N$ in parallel do
19: // — critic Update —
20: Get global states from the batch
21: Calculate TD-target: $y_{i} (t) = r_{i} (t) + γ (1 - d (t)) V_{ϕ_{i}^{'}} (s (t + 1))$
22: Update critic $ϕ_{i}$ by minimizing the loss: $L (ϕ_{i}) = [{(V_{ϕ_{i}} (s (t)) - y_{i} (t))}^{2}]$
23: // — actor Update —
24: Get local observations $o_{i}$ from the batch
25: Compute advantage estimates ${\hat{A}}_{i} (t)$ using GAE with critic $V_{ϕ_{i}}$
26: Update actor $θ_{i}$ by maximizing the PPO-Clip objective:
27: $L_{clip} (θ_{i}) = [min (ρ_{t} (θ_{i}) {\hat{A}}_{i} (t), c l i p (ρ_{t} (θ_{i}), 1 - ε, 1 + ε) {\hat{A}}_{i} (t))]$
28: end for
29: end for
// Update Target Networks
30: for each agent $i \in N$ do
31: $ϕ_{i}^{'} \leftarrow τ ϕ_{i} + (1 - τ) ϕ_{i}^{'}$
32: end for
33: end for

The values of the network parameters in Algorithm 2 are shown in Table A2. The network architectures for the actor and critic are depicted in Figure 2 and Figure 3, respectively.

The actor network employs a shared backbone followed by two independent heads for trajectory and power control, which share weights. A Tanh activation function is used on the trajectory output to constrain the velocity, while a Softmax layer normalizes the logits for power allocation. In the input stage, the remaining energy, being a one-dimensional parameter, is expanded to an 8-dimensional vector via dimension expansion to balance its feature scale relative to other input variables.

Specifically, the global state space is dominated by high-dimensional vectors representing UAV and user positions. Directly concatenating the scalar energy state

E_{i} (t)

would cause a dimension imbalance problem [15], where the critical energy constraints are prone to be submerged by high-dimensional location features during gradient backpropagation. To address this, we employ a dimension spread technique (also known as feature embedding) to map the scalar energy to an

N_{e}

-dimensional vector. We set

N_{e} = 8

based on the following two considerations:

Feature Scale Alignment: This value brings the energy feature representation to a comparable order of magnitude as the user state vectors (e.g., historical throughput), preventing any single feature group from dominating the network’s initial learning phase. This setting is consistent with the findings in [15], where an expansion to $N_{e} = 10$ was shown to be essential for the agent to successfully learn energy-aware navigation tasks.
Convergence Stability: By increasing the dimensionality of the energy input, we effectively increase the number of synaptic weights connecting the energy state to the first hidden layer. This amplifies the gradient signal related to energy consumption, thereby avoiding convergence to local optima where agents maximize throughput but fail to satisfy flight endurance constraints.

In contrast to the single critic in standard MAPPO, our MAPPO-DC algorithm, inspired by the structure of IPPO, assigns a dedicated critic network with distinct parameters

ϕ_{i}

for each agent. Figure 3 illustrates the architecture for critic

V_{ϕ_{i}}

. Unlike IPPO, each critic in MAPPO-DC receives the global state as input. Although all critics share the same input of the global state

s (t)

, they learn personalized value functions through their unique interactions with their respective paired actors.

Specifically, each critic

V_{ϕ_{i}}

solves a regression problem. Its objective is to find a set of parameters

ϕ_{i}^{*}

that minimizes the prediction error for future cumulative rewards. This objective can be formally expressed as

ϕ_{i}^{*} = arg {min_{ϕ_{i}}}_{τ \sim P (τ | {\vec{π}}_{\bar{θ}})} [\sum_{t = 0}^{H} {(V_{ϕ_{i}} (s (t)) - G (t))}^{2}]

(31)

where

G (t) = \sum_{l = 0}^{\infty} γ^{l} r_{t + l + 1}

, the true value of the discounted return

G (t)

is only known at the end of an episode. In practice, we use the TD-target

y_{i} (t) = r_{i} (t) + γ (1 - d (t)) V_{ϕ_{i}^{'}} (s (t + 1))

as its unbiased estimator. Therefore, the gradient update for critic

V_{ϕ_{i}}

follows

\nabla_{ϕ_{i}} L (ϕ_{i}) \propto E_{{\vec{π}}_{\vec{θ}}} [(V_{ϕ_{i}} (s (t)) - y_{i} (t)) \nabla_{ϕ_{i}} V_{ϕ_{i}} (s (t))]

(32)

where expectation

E_{{\vec{π}}_{\vec{θ}}}

is taken over the state-action stream generated by the joint policy of the entire team.

Although the input is the global state, the critic’s objective is not simply to learn the unconditional value of the state. Instead, it implicitly learns a conditional value function: given the state, what is the team’s expected future return assuming that the agent will follow policy? This implicit conditioning is achieved through the statistical averaging effect of the stochastic gradient descent.

Decomposing the joint action into agent i’s action

a_{i} (t)

and the joint action of all other agents

a_{- i} (t)

. Both the reward

r_{i} (t)

and the state transition

p (s (t + 1) | s (t), a (t))

are functions of these two components. From the perspective of critic

V_{ϕ_{i}}

, we obtain the following:

$a_{i} (t) \sim π_{θ_{i}} (\cdot | o_{i} (t))$ : This is generated by its paired actor i, and its behavior pattern is tightly coupled with the critic’s learning process through the advantage function ${\hat{A}}_{i} (t)$ .
$a_{- i} (t) \sim π_{θ_{- i}} (\cdot | o_{- i} (t))$ : This is generated by all other actors. Since the output of critic $V_{ϕ_{i}}$ does not directly influence the updates of other agents, the behavior of $a_{- i} (t)$ can be treated as part of the environmental stochasticity from the perspective of critic $V_{ϕ_{i}}$ ’s optimization process, introducing high-variance noise to the value estimation of $r_{i} (t)$ and $s (t + 1)$ .

According to the law of total expectation, we can decompose the expected gradient update,

E_{s (t)} [\nabla V_{ϕ_{i}} (s (t))] = E_{s (t)} [E_{a_{i} \sim π_{θ_{i}} | s (t)} [E_{a_{- i} \sim π_{θ_{- i}} | s (t)} [\nabla V_{ϕ_{i}} (s (t))]]]

(33)

The update for critic

V_{ϕ_{i}}

is driven by the advantage signal

{\hat{A}}_{i} (t)

. From its point of view, the actions of other agents

a_{- i} (t)

introduce high-variance noise into the global state observation

s (t)

, as these actions lack a direct causal link to agent i’s policy update. By performing a stochastic gradient descent over a large number of samples, critic

V_{ϕ_{i}}

learns to average out this noise and focus on the stable correlation between the behavior of its paired actor and the resulting reward.

This statistical decoupling enables each critic to function as a specialized value estimator. It effectively learns a projection from the global state to a subspace most relevant to its actor’s role, enabling precise and personalized credit assignment even within a centralized training framework. In contrast, standard MAPPO employs a single, centralized critic that regresses towards a global return shared by all agents. While this mechanism promotes cooperation, in loosely coupled tasks, it dilutes agent i’s local reward signal within the global average, leading to credit assignment ambiguity for individual contributions and potentially stifling the exploration of individually optimal behaviors. MAPPO-DC addresses this issue by equipping each agent with its own dedicated critic.

4. Simulation Results

To ensure fairness in all hyperparameter adjustments, baseline comparisons, and ablation studies, all experiments were conducted using the same random seed. In addition, a dedicated scenario library was established before the training, consisting of a training scenario set and a testing scenario set. In the training set, only the initial positions of the UAV and the user are initialized. The test set not only initializes the location, but also fixes the user trajectory. This design ensures that the training process can explore as wide a range of strategies as possible, while ensuring fairness and reproducibility in the testing phase. The ratio of the training scenario set to the testing scenario set is set to 5:1.

In addition, in all MARL algorithms, the learning rates of critic and actor networks adopt linear decay. In order to continuously monitor the performance trend of the model, the current model is tested every 100 training cycles in the entire testing scenario set. The average performance of all test scenarios is recorded as the performance metric of the current model. To ensure that the test results do not affect the training dynamics, the testing process is completely independent of the training process.

To ensure a fair comparison, all experiments were implemented using Python 3.8 and PyTorch 1.12 on a platform with Intel i9-13900K CPU, GeForce RTX 4080 GPU. All reinforcement learning algorithms were trained for a fixed budget of 100,000 episodes. For the MARL baselines (IPPO, MAPPO, MAPPO-DC, MASAC, MADDPG), we maintained identical backbone network architectures (e.g., three hidden layers with 64 units each) and common hyperparameters, differing only in algorithm-specific parameters (e.g., entropy coefficients for MASAC). Additionally, non-learning baselines (greedy, APF, convex) were implemented using standard configurations from the literature to serve as rigorous benchmarks.

4.1. Comparative Study of Clustering Algorithms

We compared our proposed clustering method with the widely used K-means algorithm. Although K-means assigns a small number of users with low variance to each user cluster [32], its objective function (minimizing intra-cluster variance) is purely geometric and independent of communication constraints. This may result in users being assigned to geometrically optimal but physically distant UAVs, leading to service interruptions. Our method aims to avoid this critical issue by prioritizing geographical proximity.

Figure 4 illustrates that K-means achieves a more balanced user load distribution. Figure 5 demonstrates that in terms of individual user performance, the Nearest Neighbor Assignment + Smart Offloading scheme exhibits superior performance for each user, as users consistently experience optimal or suboptimal channel conditions. Figure 6 shows that in terms of overall system performance, the two schemes are comparable. However, in terms of the presence of unserved users, the Nearest Neighbor Assignment + Smart Offloading scheme ensures that all users are served across all episodes and all time steps, provided that the number of crashed UAVs is less than two.

We trained the MAPPO-DC algorithm architecture combining nearest neighbor assignment and intelligent offloading under 5 different random seeds, and counted the number of truncation operations mentioned in Algorithm 1, as well as the average and variance of other key indicators, the results are shown in Table 2. From the results, it can be observed that in the entire training episode (100,000 episodes), there were only a maximum of 5 truncation operations, all of which occurred before the crashed zeroing episode, indicating that this situation occurred very rarely; In addition, we observed that the variance of TP and PF under different random seeds is very small, while the variance of the zero episode of crashed is relatively large. This is because the performance limit that the model can reach is relatively high. It can also be said that the best solution explored by the model in the current scenario is relatively fixed. The plan to zero crashed was explored in the early stage of training, when the model is in the random exploration stage, so the variance is relatively large.

In summary, prioritizing the scenario requirements, we find that ensuring all users are served is preferred over strictly balancing the UAV load. Therefore, we select the Nearest Neighbor Assignment + Smart Offloading scheme as the final user clustering solution.

4.2. Clustering Frequency Experiment

The scenario studied involves mobile ground users that show high dynamism. Consequently, the clustering algorithm must dynamically update and iterate to ensure an effective follow-up service by the UAVs. This experiment seeks to determine the optimal dynamic clustering frequency.

In this experiment, the optimal clustering frequency is jointly determined by two parameters: the final test performance (average of the last 1000 episodes) and the episode count for stable crash zeroing (the episode after which the crash count consistently remains zero).

From the throughput (TP) and PF curves in Figure 7 and the numerical results in Table 3, we observe that a clustering frequency (CF) of 10 performs relatively poorly early in training compared to other frequencies. However, during the convergence stage, CF = 10 yields the highest TP and PF. In contrast, the Reward curve in Figure 7 and the Pareto frontier scatter plot in Figure 8 indicate that CF = 10 requires the longest time for convergence and the slowest stable crash zeroing among the four tested frequencies. The lower clustering frequency (less frequent user cluster reassignment) stabilizes the user groups faced by the agents during training, allowing them to learn superior long-term strategies. However, since the environment changes more slowly, it takes longer to accumulate experience and fully explore the strategy space, resulting in slower convergence.

While CF = 10 exhibits the slowest convergence speed compared to higher frequencies, it ultimately achieves the highest converged TP and PF scores. This presents a trade-off between training efficiency and final inference performance. From a practical deployment perspective, model training is a one-time offline cost, whereas the converged policy’s performance dictates the system’s long-term utility during online execution. Since CF = 10 provides a more stable interaction window that allows agents to learn superior long-term strategies—albeit requiring more episodes to generalize across varying topologies—we prioritize the final performance gains over training speed. Therefore, we select a clustering frequency of 10 as the final parameter setting.

4.3. Comparison with Non-Learning Algorithms

We selected the greedy algorithm [11,33] as the primary non-learning baseline for comparison. This algorithm is a typical method that maximizes the instantaneous reward currently available. This study demonstrates the superiority of learning-based algorithms over non-learning ones under energy-constrained conditions.

Since the Greedy algorithm is a non-learning method, it does not undergo a learning and convergence process. Therefore, we compare its average performance across 200 test scenarios. For MAPPO-DC, we load the converged model and evaluate its performance on the same 200 test scenarios. The “average crashed” metric here refers to the average number of crashed UAVs in the test scenarios.

It can be seen from Table 4 that our algorithm demonstrates a 5.826% improvement in TP, a 16.7 point increase in PF, and a 22.72% improvement in the Jain’s fairness index compared to the Greedy algorithm. Furthermore, as depicted in the performance curves from our experiments, the average crash count for MAPPO-DC decreases from an initial value of 0.26 to a stable convergence at 0 after approximately 20,000 training epochs. In contrast, the Greedy algorithm, which focuses on maximizing instantaneous rewards (where the position reward has the highest weight), leads to frequent and drastic maneuvers that rapidly deplete energy, ultimately failing to sustain continuous service. Consequently, its crash count remains at a persistently high level.

In conclusion, this experiment validates the need to use learning-based algorithms to effectively manage energy consumption in energy-constrained scenarios.

4.4. Comparison with Decoupled Optimization Methods

In the decoupled optimization experiment, we compare our method with two common approaches from existing research: trajectory optimization based on the artificial potential field (APF) method [34,35] and power optimization using a convex optimizer [8]. We denote the scheme combining MAPPO-DC for power optimization and APF for trajectory optimization as MAPPO-DC-P+APF-T. The scheme combining MAPPO-DC for trajectory optimization and convex optimization for power control is denoted as MAPPO-DC-T+Convex-P.

Figure 9 reveals that the advantage of joint optimization with MAPPO-DC over decoupled methods is substantial, primarily because the joint optimization scheme enables global observation and effective credit assignment. The results show that the MAPPO-DC joint optimization scheme outperforms the decoupled schemes in both PF and crash avoidance.

In the MAPPO-DC-P+APF-T scheme, the UAV trajectories are controlled by the non-learning APF algorithm. Since the APF method cannot adjust the trajectories based on the energy remaining of the UAV’s, the crash count fails to decrease. Although it achieves a high TP, its PF remains consistently low.

In the MAPPO-DC-T+Convex-P scheme, although MAPPO-DC can reduce the crash count to a low level by controlling flight energy consumption—which constitutes a major part of the total energy—it lacks control over communication power. While communication energy consumption is relatively small, the lack of direct control over this component prevents the critic from accurately assessing the risks associated with total energy expenditure, leading to a credit assignment failure. As shown in Figure 9, around the 19 k training epoch mark, the crash curve for the MAPPO-DC-T+Convex-P scheme begins to rise instead of fall. This occurs because the policy, after 19 k epochs, starts to push the energy limits in pursuit of higher PF, causing a temporary resurgence in crashes until the final policy stabilizes.

In summary, this experiment demonstrates that in energy-constrained scenarios, a learning algorithm must have control over the entire energy consumption profile to effectively mitigate the risk of UAV crashes and ensure continuous service to all users within the specified mission time.

4.5. Comparison with Other MARL Algorithms

The MAPPO-DC algorithm we propose is designed by integrating the multi-critic structure of IPPO [19,20] into the MAPPO framework. This design solves the problems in loosely coupled tasks by balancing the coordination between intelligent agents and the individual performance of individual agents. Therefore, in this experiment, we mainly compared MAPPO-DC with its basic algorithms, MAPPO and IPPO. In addition, we also compared the MASAC and MADDPG algorithms. Although we conducted extensive parameter tuning experiments, the MADDPG algorithm experienced gradient explosion after about 5500 episodes of training. Therefore, we did not add the training curve of MADDPG to Figure 10.

From the training time of each algorithm in Table 5, it can be seen that, compared to off policy IPPO, MAPPO, and MAPPO-DC algorithms, on policy MASAC and MADDPG algorithms require longer training time due to the need to maintain a larger relay buffer during the training process, while MASAC requires more computation time due to the inclusion of a double-Q network to eliminate high estimation bias. In addition, due to the gradient explosion that occurred during MADDPG training up to 5.5 k, we estimated the training time for 100 k episodes based on the time taken for the first 5.5 k episodes.

As shown in Figure 9 and Figure 10, both the MAPPO-DC-T+Convex-P scheme in Section 4.4 and the IPPO experiment in this section suffer from credit assignment failure, which introduces high-variance noise into the decision-making process. Specifically, for agent i in IPPO, its critic only receives observations from its corresponding actor. Consequently, when the actions of other agents result in negative feedback for agent i, it can neither identify the source of this negative feedback nor devise a solution, a phenomenon known as credit assignment failure. In the MAPPO-DC-T+Convex-P architecture, MAPPO-DC treats the uncontrollable communication power as noise from environmental non-stationarity. Similarly, in the IPPO architecture, each agent i treats all other agents

j (j \neq i)

as noise from environmental non-stationarity. After the 21,000 training epochs, IPPO’s crash curve also begins to rise, which, while increasing TP, leads to a decrease in PF. At this stage, IPPO treats crashes as a sacrificial parameter in exchange for higher TP and PF.

MADDPG failed to converge due to severe overestimation bias and the limitations of deterministic policies in stochastic environments, leading to gradient explosion early in training. Meanwhile, MASAC suffered from initial overfitting followed by a collapse into sub-optimal local safety policies. The failure of MADDPG and the instability of MASAC highlight a critical theoretical insight: in highly dynamic scenarios with random user mobility, off-policy methods (MASAC, MADDPG) significantly underperform on-policy methods (IPPO, MAPPO, MAPPO-DC). This is primarily because the historical experiences stored in the replay buffer become obsolete rapidly due to environmental non-stationarity (distributional shift), causing the critic to learn from inaccurate state distributions. In contrast, on-policy methods, which optimize based on fresh samples from the current policy, demonstrate superior adaptability. Consequently, MADDPG is excluded from Figure 10, and a detailed analysis of both algorithms is provided in Appendix A.

In scenarios where strong agent coordination is not required, MAPPO suffers from credit assignment ambiguity. That is, when agents i and j execute their respective actions and the critic receives positive feedback, it cannot precisely determine whether the positive feedback originated from agent i or agent j. As seen in the comparison between the MAPPO-DC and MAPPO curves in Figure 10, their reward, TP, collision, and out-of-bounds curves are nearly identical. However, the crash curve for MAPPO-DC decreases and converges to zero much faster than that of MAPPO. This indicates that the decoupled critics provide a clearer and more direct penalty signal for unsafe actions, enabling the agents to learn safe operational boundaries more quickly. The crash and PF curves further reveal that by rapidly pruning the vast unsafe regions of the state-action space, MAPPO-DC allows agents to focus their exploration on promising areas for performance optimization. This leads to the initial faster improvement in PF seen in Figure 10. Furthermore, MAPPO-DC achieves a higher final PF than MAPPO, confirming that the decoupled critic design successfully achieves our objective of maximizing individual agent performance while maintaining necessary coordination.

To quantitatively evaluate the efficacy of credit assignment mechanisms, we introduce the Variance of Advantage Estimates as a metric. In MARL, high advantage variance typically indicates ambiguous credit assignment, where an agent’s value estimation is heavily severely perturbed by the non-stationary behaviors of others (noise). Conversely, a lower variance implies that the critic has successfully isolated the agent’s specific contribution from the environmental noise, leading to more stable and precise policy updates.

As can be seen from Figure 11, the mean and standard deviation of the advantage variance of MAPPO-DC are the lowest, which confirms that the decoupled critic architecture can effectively isolate the specific contributions of each agent in loosely coupled scenarios, reducing the noise introduced by the dynamic behavior of other agents. The advantage variance of MAPPO is higher than that of MAPPO-DC, indicating that centralized critics introduce coupling noise when dealing with non-stationarity, that is, the value estimation of agents is sensitive to the action fluctuations of other agents; IPPO exhibits relatively low advantage variance, but its error bars show significant dispersion, indicating that although agents often adopt strategies that ignore interactions (resulting in low variance in steady states), they exhibit severe instability when inevitable conflicts occur, as evidenced by the large lower limit of the error bars; MASAC exhibits the highest advantage variance because it is a maximum entropy algorithm designed to encourage exploration, and compared to trust region-based methods such as PPO, its policy randomness and non-strategic nature of value estimation introduce higher variance in advantage computation.

In conclusion, the simulation experiment results validate our core hypothesis: the decoupling critic mechanism in MAPPO-DC effectively solves the problem of credit assignment failure caused by the introduction of high variance noise in IPPO, as well as the problem of credit assignment ambiguity in the MAPPO centralized critic, which suppresses individual performance. The MAPPO-DC algorithm achieves a more efficient and secure learning process without affecting the necessary coordination of multi-agent tasks. In addition, we compared the experimental results of MASAC and MADDPG and found that the off-policy-based algorithms (MASAC, MADDPG) performed worse than the on-policy-based algorithms (MAPPO-DC, MAPPO, IPPO) in highly dynamic random environments.

4.6. Model Robustness Experiment

To validate the robustness of the proposed algorithm under severe uncertainty and explore its applicability boundaries, we conducted additional experiments with enhanced stochasticity. Note that MADDPG is excluded from this analysis as it failed to converge even in the baseline scenario (Section 4.5); evaluating it in this more complex environment would yield no additional analytical value. We introduced the following three major environmental stressors:

Co-channel Interference: Switching from OFDMA to a full frequency reuse scheme, where all UAVs share the entire spectrum, thereby introducing strong inter- UAV interference.
Heterogeneous Bandwidth: Replacing the equal bandwidth division with an inverse-distance-weighted allocation scheme (an edge-user protection strategy) to simulate uneven resource distribution.
Energy Consumption Noise: Adding Gaussian perturbations to the energy consumption model to mimic unmodeled external disturbances (e.g., wind gusts).

As illustrated in Figure 12, despite the increased environmental uncertainty, both MAPPO and MAPPO-DC successfully converged, demonstrating inherent robustness. The converged system throughput is lower compared to the interference-free scenario (Section 4.5). From the user’s perspective, although full frequency reuse increases the available bandwidth per user, the SINR degrades significantly due to severe co-channel interference. This is particularly detrimental to users at the cluster edges, who suffer from high interference power from neighboring UAVs, leading to a net reduction in TP and PF.

A critical observation from the crash curves (comparing Figure 10 and Figure 12) is that with energy noise, the number of episodes required for crash zeroing increased by approximately 15 k. This indicates the model needs more time to learn a “safety margin”—effectively reserving extra battery to buffer against

3 σ

noise fluctuations. Comparing this to the MAPPO-DC-T+Convex-P scheme (Section 4.4) reveals a key insight: In this experiment, the noise follows a statistical distribution, which the critic network can approximate, allowing the actor to learn a robust margin. In contrast, the convex optimizer in Section 4.4 behaves as a non-stationary or adversarial “black box”—slight position changes cause drastic, non-linear jumps in power consumption. This structural unpredictability is harder to adapt to than statistical noise, explaining why the hybrid convex scheme failed to eliminate crashes completely, whereas the end-to-end MAPPO-DC succeeded here.

The trends for MASAC and IPPO remain consistent with Section 4.5. MASAC exhibits instability (overfitting then local optima), and IPPO sacrifices safety (high crashes) for short-term reward gains.

In this strongly coupled scenario, MAPPO and MAPPO-DC achieved similar converged performance (MAPPO-DC: TP ≈ 23.80 Mbps, PF ≈ 357.61; MAPPO: TP ≈ 23.98 Mbps, PF ≈ 356.34). This similarity arises because both algorithms utilize global states for value estimation, enabling global coordination. However, MAPPO-DC achieved crash zeroing significantly faster than MAPPO. This confirms that even under strong coupling, the decoupled critic architecture provides clearer credit assignment for individual safety constraints, accelerating the learning of safe flight policies.

Figure 13 presents the Advantage Variance in the strongly coupled scenario. Compared to the loosely coupled scenario (Figure 11), the advantage variance for all algorithms increased. This confirms that co-channel interference elevates the coupling strength, making the system performance more dependent on complex multi-agent coordination rather than individual actions, thereby blurring the credit assignment. IPPO exhibits the lowest variance. However, this does not indicate superior credit assignment. Instead, it reflects a trivial solution: to mitigate interference penalties without coordination capabilities, IPPO agents learned a “spatial repulsion” strategy—staying far apart to minimize interaction. This artificially reduces the variance of interaction terms. Consistent with previous results, MASAC shows the highest variance, confirming its poor robustness against uncertainty.

Unlike the loosely coupled case, MAPPO-DC shows slightly higher variance than MAPPO here. This reveals the following trade-off of decoupling:

Smoothing Effect of Centralization: MAPPO’s centralized critic predicts a unified team reward, effectively averaging out individual fluctuations caused by intense interference, resulting in a statistically smoother estimate.
Challenge of Decoupling: MAPPO-DC attempts to isolate individual contributions from this highly entangled signal. In strong coupling, this “forced decoupling” is computationally challenging and more sensitive to interference noise, leading to slightly higher variance.

This explains why MAPPO-DC’s PF is marginally lower (due to variance-induced gradient noise), but its safety learning remains superior (as safety constraints remain largely local).

Comparing the loosely coupled (Section 4.5) and strongly coupled (Section 4.6) scenarios allows us to delineate the following clear operational boundaries:

MAPPO (Centralized Critic): Optimal for Interference-Limited Scenarios where maximizing system capacity via complex coordination is the sole priority. If the mission tolerates slower safety convergence to squeeze out the absolute maximum throughput, standard MAPPO is preferable.
MAPPO-DC (Decoupled Critic): Dominates in Safety-Critical and Rapid-Response Missions. It offers comprehensive advantages in loosely coupled tasks and remains highly competitive in strongly coupled settings. Its ability to swiftly learn collision-avoidance policies (fast Crash Zeroing) makes it the robust choice for real-world deployments, offering the best trade-off between optimality, safety, and training efficiency.

5. Conclusions

This paper addresses the challenge of energy-constrained and fairness-aware resource allocation within multi-UAV emergency communication networks. We proposed a hierarchical framework centered around MAPPO-DC, a novel MARL algorithm specifically designed for loosely coupled task structures, which arise from decoupling user association from trajectory and power control. The framework employs a lightweight clustering algorithm for efficient user assignment, while MAPPO-DC enables UAV agents to learn customized control policies. The algorithm successfully maximizes the system’s PF throughput while ensuring the necessary coordination for the multi-agent task, providing a resource allocation strategy that balances both efficiency and fairness for energy-limited emergency communication networks.

Multiple simulation experiments have drawn the following conclusions: Firstly, the joint optimization of trajectory and power control is essential for reliable operation. When decoupling is introduced into the methodology, the uncontrolled subsystem acts as an irreducible “black box,” inevitably leading to credit assignment failures and persistent crashes. Secondly, our proposed MAPPO-DC algorithm significantly outperforms standard benchmarks such as IPPO and MAPPO. It successfully avoids the non-stationarity pitfalls of independent learning (IPPO) while demonstrating superior sample efficiency and performance compared to the overly centralized value learning of standard MAPPO. This validates the effectiveness of specialized credit assignment for loosely coupled tasks. Finally, our robustness analysis across varying coupling levels reveals a clear operational boundary: MAPPO-DC demonstrates superior efficacy and safety convergence in loosely coupled environments, while retaining high adaptability and competitive performance even when subjected to strong co-channel interference. This validates MAPPO-DC as a robust, safety-aware solution suitable for diverse dynamic communication missions.

Future work can extend this research in several directions. The energy consumption model could be refined to incorporate more complex factors such as wind and acceleration. Additionally, the inclusion of Vehicle-to-Vehicle (V2V) communication could be explored as a means to introduce more explicit coordination, further enhancing cooperative efficiency.

Author Contributions

Conceptualization, G.M.; formal analysis, Q.L.; methodology, Z.Z.; validation, X.W.; writing—original draft preparation, X.W.; writing—review and editing, C.H.; supervision, C.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant U23A20271, in part by the Fundamental Research Funds for the Central Universities under Grant 3072025ZX0804.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In standard MASAC implementations, soft updates are typically used to maintain training stability. However, due to the highly dynamic user random movement characteristics in this scenario, old experiences will quickly become ineffective. Therefore, we adopted a high-frequency gradient update strategy. This strategy aims to maximize the use of the latest samples, forcing the critic network to quickly keep up with changes in the environment. Although this leads to severe overfitting in the early stages of training, it also gives the algorithm the ability to quickly correct errors and restructure strategies after collapse.

As shown in Figure A1 and Figure A3, the MASAC learning curve exhibits a distinct three-stage pattern, driven by the interaction between its maximum entropy mechanism and the experience replay buffer:

Initial Overfitting (0–2 k episodes): Initially, agents benefit from the optimal positioning provided by the hierarchical clustering. Since the replay buffer is sparse and contains low-variance data, the high-frequency updates cause the critic to rapidly “overfit” to the initial static topology. The actor implicitly adopts a passive hovering strategy, resulting in artificially high throughput and near-zero crashes, but lacking generalization capability.
Exploration and Collapse (2–5 k episodes): As the entropy maximization term drives the agents to explore the state space and users begin to move, the environment dynamics shift drastically. The replay buffer becomes a mixture of initial stable samples and new exploratory failures. This introduces a severe distributional shift—where historical data no longer accurately reflects the current policy’s performance. Consequently, the critic’s value estimation degrades, causing the policy to collapse, as evidenced by the surge in crash counts and the sharp drop in total reward.
Convergence to Local Optima (5 k+ episodes): To recover from frequent crashes, the agents learn to prioritize the heavy crash penalties over the throughput rewards. Unlike MAPPO, which uses a trust region to safely fine-tune performance near the boundary of safety, MASAC converges to a conservative “local optimum.” It sacrifices a significant portion of theoretical throughput (TP drops and stabilizes at a lower level) to maintain a wide safety margin. This confirms that without on-policy stability, it is difficult to achieve the fine-grained coordination required for global optimality in this task.

Figure A1. Performance curve of MASAC algorithm in scenarios without introducing co-channel interference.

Although MASAC successfully converged to the security policy, its peak performance (about 50 Mbps) is still lower than MAPPO (about 58 Mbps). This gap highlights the sensitivity of the off-policy approach (MASAC) to the non-stationarity of multi-UAV environments, where outdated data in the replay buffer hinders fine optimization. In contrast, the same strategy feature of MAPPO ensures monotonic improvement towards higher global optimal solutions.

As shown in the Figure A2, the MADDPG algorithm exhibits significant instability and cannot converge. Specifically, after approximately 5500 episodes, critic loss experienced exponential growth (gradient explosion), leading to performance collapse.We believe that two inherent limitations of MADDPG in this highly dynamic scenario led to this failure as follows:

Overestimation bias: Unlike our proposed method and MASAC, standard MADDPG lacks mechanisms like Double-Q learning to mitigate value overestimation. In reward intensive environments, such errors can quickly accumulate.
Limitations of deterministic strategies: User behavior in the environment is stochastic. Deterministic strategies are difficult to adapt to the inherent randomness of environmental states, leading to severe oscillations in value estimation. Even with gradient clipping applied, the divergence phenomenon still indicates that deterministic non-strategic methods have poor robustness for this specific multi-UAV dynamic coverage task.

Figure A2. Performance curve of MADDPG algorithm in scenarios without introducing co-channel interference.

Figure A3. Performance curve of MASAC algorithm in introducing co-channel interference scenarios.

Table A1. Simulation parameters settings.

Parameter	Notion	Value
Spatial Area (m)	$A$	1000 m × 1000 m × 50 m
Number of Agents (UAVs)	N	5
Number of users	K	30
User Movement Speed	$v_{u}$	1 m/s
UAV Maximum Speed	$v_{m a x}$	25 m/s
Safe Distance Between UAVs	$d_{s a f e}$	10 m
Range of Served Users		[4, 8]
Flight Altitude Range		[20 m, 40 m]
Max. Communication Power	$P^{c o m m}$	0.1 W
Min. Allocated Power	$P_{m i n}$	1 × 10⁻⁶ W
Allocatable Bandwidth	$B_{t o t a l}$	10 Mbps
Hovering Power	$P^{h o v}$	150 W
Motion Power Range	$(P_{m i n}^{f l y}, P_{m a x}^{f l y})$	[150 W, 500 W]
Max. Battery Capacity	$E_{m a x}$	65,000 J
Number of Training Scenarios		1000
Number of Test Scenarios		200
Max. Training Episodes		100,000
Time Steps per Episode	$T_{m a x}$	200

Table A2. Network parameter table.

Parameter	Symbol	Value
Discount Factor	$γ$	0.99
GAE Lambda	$λ$	0.95
PPO Clip Range	$ε$	0.2
PPO Epochs	$K_{e p o c h s}$	5
Actor Learning Rate	$a_{a c t o r}$	1 × 10⁻⁵ (Linear Decay)
Critic Learning Rate	$a_{c r i t i c}$	5 × 10⁻⁵ (Linear Decay)
Entropy Coefficient	$c_{e n t}$	0.02
Batch Size	B	800 (4 × 200)
Max Gradient Norm		0.5
Optimizer		Adam (eps = 1 × 10⁻⁵)
Target critic soft update	$τ$	0.01
Shared reward weight	$w_{s h a r e}$	1.1
Individual reward weight	$w_{i n d}$	0.9
Collision penalty weight	$w_{c o l l}$	−80
PF reward weight	$w_{p f}$	1.0
Weight of user satisfaction reward	$w_{b o n u s}$	5.0
Position guided reward weight	$w_{p o s}$	0.8
Motion smoothing penalty weight	$w_{s m o o t h}$	−0.1
Boundary penalty weight	$w_{b o u n d}$	−10
Cracked penalty weight	$w_{c r a s h}$	−200

References

Zhao, N.; Lu, W.; Sheng, M.; Chen, Y.; Tang, J.; Yu, F.R.; Wong, K.K. UAV-assisted Emergency Networks in Disasters. IEEE Wirel. Commun. 2019, 26, 45–51. [Google Scholar] [CrossRef]
Wang, C.; Deng, D.; Xu, L.; Wang, W. Resource Scheduling Based on Deep Reinforcement Learning in UAV Assisted Emergency Communication Networks. IEEE Trans. Commun. 2022, 70, 3834–3848. [Google Scholar] [CrossRef]
Naqvi, S.A.R.; Hassan, S.A.; Pervaiz, H.; Ni, Q. Drone-Aided Communication as a Key Enabler for 5G and Resilient Public Safety Networks. IEEE Commun. Mag. 2018, 56, 36–42. [Google Scholar] [CrossRef]
Zeng, Y.; Zhang, R.; Lim, T.J. Wireless Communications with Unmanned Aerial Vehicles: Opportunities and Challenges. IEEE Commun. Mag. 2016, 54, 36–42. [Google Scholar] [CrossRef]
Zeng, Y.; Zhang, R.; Lim, T.J. Throughput Maximization for UAV-Enabled Mobile Relaying Systems. IEEE Commun. Mag. 2016, 64, 4983–4996. [Google Scholar] [CrossRef]
Cui, J.; Liu, Y.; Nallanathan, A. Multi-Agent Reinforcement Learning-Based Resource Allocation for UAV Networks. IEEE Trans. Wirel. Commun. 2020, 19, 729–743. [Google Scholar] [CrossRef]
Yu, Z.; Gong, Y.; Gong, S.; Guo, Y. Joint Task Offloading and Resource Allocation in UAV-Enabled Mobile Edge Computing. IEEE Internet Things J. 2020, 7, 3147–3159. [Google Scholar] [CrossRef]
Wu, Q.; Zeng, Y.; Zhang, R. Joint Trajectory and Communication Design for Multi-UAV Enabled Wireless Networks. IEEE Trans. Wirel. Commun. 2018, 17, 2109–2121. [Google Scholar] [CrossRef]
Mkiramweni, M.E.; Yang, C.; Li, J.; Zhang, W. A Survey of Game Theory in Unmanned Aerial Vehicles Communications. IEEE Commun. Surv. Tuts. 2019, 21, 3386–3416. [Google Scholar] [CrossRef]
Zeng, Y.; Zhang, R. Energy-Efficient UAV Communication with Trajectory Optimization. IEEE Trans. Wirel. Commun. 2017, 16, 3747–3760. [Google Scholar] [CrossRef]
Wu, X.Z.; Srikant, R.; Perkins, J.R. Scheduling Efficiency of Distributed Greedy Scheduling Algorithms in Wireless Networks. IEEE. Trans. Mob. Comput. 2007, 6, 595–605. [Google Scholar] [CrossRef]
Yang, X.; Qin, D.; Liu, J.; Li, Y.; Zhu, Y.; Ma, L. Deep Reinforcement Learning in NOMA-Assisted UAV Networks for Path Selection and Resource Offloading. Ad Hoc Netw. 2023, 151, 103285. [Google Scholar] [CrossRef]
Zhang, T.; Lei, J.; Liu, Y.; Feng, C.; Nallanathan, A. Trajectory Optimization for UAV Emergency Communication with Limited User Equipment Energy: A Safe-DQN Approach. IEEE Trans. Green Commun. Netw. 2021, 5, 1236–1247. [Google Scholar] [CrossRef]
Zhang, S.; Zeng, Y.; Zhang, R. Cellular-Enabled UAV Communication: A Connectivity-Constrained Trajectory Optimization Perspective. IEEE Trans. Commun. 2019, 67, 2580–2604. [Google Scholar] [CrossRef]
Ding, R.; Gao, F.; Shen, X.S. 3D UAV Trajectory Design and Frequency Band Allocation for Energy-Efficient and Fair Communication: A Deep Reinforcement Learning Approach. IEEE Trans. Wirel. Commun. 2020, 19, 7796–7809. [Google Scholar] [CrossRef]
Xie, H.; Zhang, T.; Xu, X.; Yang, D.; Liu, Y. Joint Sensing, Communication, and Computation in UAV-Assisted Systems. IEEE Internet Things J. 2024, 11, 29412–29426. [Google Scholar] [CrossRef]
Wang, X.; Wan, X.; Ji, H.; Hu, H.; Chen, Z.; Xiao, Y. Joint UAV Deployment and User Scheduling for Wireless Powered Wearable Networks. IEEE Internet Things J. 2024, 11, 21299–21311. [Google Scholar] [CrossRef]
de Witt, C.S.; Gupta, T.; Makoviichuk, D.; Makoviychuk, V.; Torr, P.H.S.; Sun, M.; Whiteson, S. Is independent learning all you need in the starcraft multi-agent challenge? arXiv 2020, arXiv:2011.09533. [Google Scholar]
Lee, W.; Kim, T. Multiagent Reinforcement Learning in Controlling Offloading Ratio and Trajectory for Multi-UAV Mobile-Edge Computing. IEEE Internet Things J. 2024, 11, 3417–3429. [Google Scholar] [CrossRef]
Omidshafiei, S.; Pazis, J.; Amato, C.; How, J.P.; Vian, J. Deep Decentralized Multi-Task Multi-Agent Reinforcement Learning Under Partial Observability. In Proceedings of the 34th International Conference on Machine Learning, PMLR 2017, Sydney, NSW, Australia, 6–11 August 2017; pp. 2681–2690. [Google Scholar]
Lowe, R.; WU, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent actor-critic for Mixed Cooperative-Competitive Environments. In Proceedings of the Advances in Neural Information Processing Systems 30, NeurIPS 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 6379–6390. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; WU, Y. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. In Proceedings of the Advances in Neural Information Processing Systems 35, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; pp. 24611–24624. [Google Scholar]
Kang, H.; Chang, X.; Mišić, J.; Mišić, V.B.; Fan, J.; Liu, Y. Cooperative UAV Resource Allocation and Task Offloading in Hierarchical Aerial Computing Systems: A MAPPO-Based Approach. IEEE Internet Things J. 2023, 10, 10497–10509. [Google Scholar] [CrossRef]
Qin, P.; Fu, Y.; Zhang, J.; Geng, S.; Liu, J.; Zhao, X. DRL-Based Resource Allocation and Trajectory Planning for NOMA-Enabled Multi-UAV Collaborative Caching 6G Network. IEEE Trans. Veh. Technol. 2024, 73, 8750–8764. [Google Scholar] [CrossRef]
Gronauer, S.; Diepold, K. Multi-Agent Deep Reinforcement Learning: A Survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
Liu, C.H.; Chen, Z.; Tang, J.; Xu, J.; Piao, C. Energy-Efficient UAV Control for Effective and Fair Communication Coverage: A Deep Reinforcement Learning Approach. IEEE J. Sel. Areas Commun. 2018, 36, 2059–2070. [Google Scholar] [CrossRef]
Liu, W.; Li, B.; Xie, W.; Dai, Y.; Fei, Z. Energy Efficient Computation Offloading in Aerial Edge Networks with Multi-Agent Cooperation. IEEE Trans. Wirel. Commun. 2023, 22, 5725–5739. [Google Scholar] [CrossRef]
El Haber, E.; Alameddine, H.A.; Assi, C.; Sharafeddine, S. UAV-Aided Ultra-Reliable Low-Latency Computation Offloading in Future IoT Networks. IEEE Trans. Commun. 2021, 69, 6838–6851. [Google Scholar] [CrossRef]
Zeng, Y.; Xu, J.; Zhang, R. Energy Minimization for Wireless Communication with Rotary-Wing UAV. IEEE Trans. Wirel. Commun. 2019, 18, 2329–2345. [Google Scholar] [CrossRef]
Wang, Y.Q.; Cui, Y.; Yang, Y.; Li, Z.D.; Cui, X. Multi-UAV Path Planning for Air-Ground Relay Communication Based on Mix-Greedy MAPPO Algorithm. Drones 2024, 8, 706. [Google Scholar] [CrossRef]
Lyu, J.; Zeng, Y.; Zhang, R.; Lin, T.J. Placement Optimization of UAV-Mounted Mobile Base Stations. IEEE Commun. Lett. 2016, 21, 604–607. [Google Scholar] [CrossRef]
Akhtar, J.; Rajawat, K.; Gupta, V.; Chaturvedi, A.K. Joint User and Antenna Selection in Massive-MIMO Systems with QoS-Constraints. IEEE Syst. J. 2020, 15, 497–508. [Google Scholar] [CrossRef]
Koren, Y.; Borenstein, J. Potential Field Methods and Their Inherent Limitations for Mobile Robot Navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Sacramento, CA, USA, 9–11 April 1991; pp. 1398–1404. [Google Scholar]
Fan, J.; Chen, X.; Liang, X. UAV Trajectory Planning Based on Bi-Directional APF-RRT* Algorithm with Goal-Biased. Expert Syst. Appl. 2023, 213, 119137. [Google Scholar] [CrossRef]

Figure 1. Scenario diagram.

Figure 2. Actor network architecture.

Figure 3. Critic

V_{ϕ_{i}}

network architecture.

Figure 3. Critic

V_{ϕ_{i}}

network architecture.

Figure 4. Average user load bar chart.

Figure 5. User performance distribution CDF curve.

Figure 6. Performance comparison between nearest neighbor assignment + smart offloading and K-means.

Figure 7. Performance curve comparison for clustering frequency tuning.

Figure 8. Pareto frontier scatter plot for clustering frequency tuning.

Figure 9. Performance comparison between joint and decoupled optimization.

Figure 10. Performance curves of each MARL algorithms (without introducing co-channel interference).

Figure 11. The advantage variance of each MARL algorithm (without introducing co-channel interference).

Figure 12. Performance curves of various MARL algorithms (introducing-co-channel interference).

Figure 13. The advantage variance of each MARL algorithm (introducing co-channel interference).

Table 1. Main notions.

Notion	Description
i	Agent (UAV) Index: The i-th agent (UAV)
k	User Index: The k-th user
t	Time Step: The current discrete time step
$q_{i} (t)$	UAV Position: Position coordinates of UAV i at time step t
$w_{k} (t)$	User Position: Position coordinates of user k at time step t
$h_{i, k} (t)$	UAV-to-User Channel Gain: Channel gain from UAV i to user k at time step t
$C_{k} (t)$	User Instantaneous Throughput: Instantaneous throughput of user k at time step t
$o_{i} (t)$	Local Observation: Local information observed by agent i at time step t
$s_{t}$	Global State: Aggregated global information from all agents’ observations
$\bar{C_{k}} (t)$	User Historical Throughput: Historical throughput of user k
$a_{i} (t)$	Action: Action executed by agent i at time step t
$π_{θ_{i}}$	actor Policy: The agent’s policy network
$θ_{i}$	actor Parameters: Parameters of the actor network
$P_{i, k} (t)$	UAV Allocated Power: Power allocated by UAV i to user k
$P_{i} (t)$	Total UAV Power Consumption: Total power consumption of UAV i at time step t
$E_{i} (t)$	Remaining UAV Energy: Remaining energy of UAV i at time step t
$U_{i} (t)$	User Cluster: The cluster of users served by UAV i
$V_{ϕ_{i}}$	critic Value Function: The agent’s value network
$ϕ_{i}$	critic Parameters: Parameters of the critic network
$V_{ϕ_{i}^{'}}$	Target critic: The target network for the critic, used to stabilize training, whose parameters are a soft update of $ϕ_{i}$
$B$	Experience Replay Buffer: A buffer used to store past experiences
$r_{i} (t)$	Reward: Reward for agent i at time step t
$γ$	Discount Factor: The discount rate for future rewards, set to 0.92
$d_{t}$	Done Flag: A flag indicating if the episode terminates at time step t (1 if done, 0 otherwise)
$y_{i, t}$	TD Target: The temporal-difference target used to train the critic
${\hat{A}}_{i, t}$	Advantage Function: Measures the advantage of an action compared to the average, used to train the actor
L	Loss Function: The objective function to be minimized (or maximized)
$ρ_{t} (θ_{i})$	Policy Ratio: The ratio comparing the new and old policies
$ε$	PPO Clipping Parameter: The hyperparameter in PPO used to limit the magnitude of policy updates
$T_{e p o c h s}$	Training Epochs: The total number of training epochs

Table 2. Statistical data of 5 random seeds.

	Number of Truncation Operations (Crashed ≥ 2)	TP	PF	Crashed Final Zeroing Episode
Statistical Value	3 ± 1.5810	58.0991 ± 0.1108	430.3395 ± 0.1489	17,480 ± 572.9317
Range	(1, 5)	(57.9731, 58.2774)	(430.0815, 431.0025)	(16,750, 18,200)

Table 3. Comparison table of performance of different clustering frequencies.

Clustering Frequency	Final Test Reward	Final Test TP (Mbps)	Final Test PF	Crashed Final Zeroing Episode
10	61,661.97	58.069	430.08	16,300
20	54,723.71	53.801	424.92	10,800
40	54,520.61	54.264	425.52	10,400
50	56,240.68	54.108	425.36	13,900

Table 4. Average performance record of MAPPO-DC and greedy algorithm.

Metric	Average TP	Average PF	Average Crashed	Jain’s Index	Jain’s Std
Greedy	55.252	411.03	2.5	0.625	0.0733
MAPPO-DC	58.069	430.08	0	0.767	0.052

Table 5. Training time for each MARL algorithm.

Algorithm	MASAC	MADDPG	IPPO	MAPPO	MAPPO-DC
Training time for 100 k episodes (h)	195	about 120	120	105	102

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, X.; Hou, C.; Meng, G.; Zhou, Z.; Liu, Q. Joint Trajectory and Power Optimization for Loosely Coupled Tasks: A Decoupled-Critic MAPPO Approach. Drones 2026, 10, 116. https://doi.org/10.3390/drones10020116

AMA Style

Wu X, Hou C, Meng G, Zhou Z, Liu Q. Joint Trajectory and Power Optimization for Loosely Coupled Tasks: A Decoupled-Critic MAPPO Approach. Drones. 2026; 10(2):116. https://doi.org/10.3390/drones10020116

Chicago/Turabian Style

Wu, Xiangyu, Changbo Hou, Guojing Meng, Zhichao Zhou, and Qin Liu. 2026. "Joint Trajectory and Power Optimization for Loosely Coupled Tasks: A Decoupled-Critic MAPPO Approach" Drones 10, no. 2: 116. https://doi.org/10.3390/drones10020116

APA Style

Wu, X., Hou, C., Meng, G., Zhou, Z., & Liu, Q. (2026). Joint Trajectory and Power Optimization for Loosely Coupled Tasks: A Decoupled-Critic MAPPO Approach. Drones, 10(2), 116. https://doi.org/10.3390/drones10020116

Article Menu

Joint Trajectory and Power Optimization for Loosely Coupled Tasks: A Decoupled-Critic MAPPO Approach

Highlights

Abstract

1. Introduction

1.1. Related Works

1.1.1. Traditional Optimization Methods

1.1.2. Single-Agent Reinforcement Learning Approaches

1.1.3. Multi-Agent Reinforcement Learning Approaches

1.2. Motivation

1.3. Contribution

2. System Model

2.1. Transmission Model

2.2. Energy Consumption Model

2.3. Problem Formulation

3. Method

3.1. User Clustering

3.2. State Space

3.3. Action Space

3.3.1. Trajectory Planning

3.3.2. Power Allocation

3.4. Reward

3.5. MAPPO-DC Algorithm

4. Simulation Results

4.1. Comparative Study of Clustering Algorithms

4.2. Clustering Frequency Experiment

4.3. Comparison with Non-Learning Algorithms

4.4. Comparison with Decoupled Optimization Methods

4.5. Comparison with Other MARL Algorithms

4.6. Model Robustness Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI