Trajectory Optimization and Resource Allocation for UAV-Assisted Emergency Communication Networks

Chu, Chengxin; Zhang, Jiadong; He, Panfeng; Zhang, Yu; Ouyang, Min; Wan, Fayu; Liu, Qingyu; Chen, Yong

doi:10.3390/drones10040233

Open AccessArticle

Trajectory Optimization and Resource Allocation for UAV-Assisted Emergency Communication Networks

by

Chengxin Chu

^1,†

,

Jiadong Zhang

^2,†

,

Panfeng He

²,

Yu Zhang

²,

Min Ouyang

³,

Fayu Wan

¹

,

Qingyu Liu

⁴ and

Yong Chen

^2,*

¹

The School of Electronics and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

The Sixty-Third Research Institute, National University of Defense Technology, Nanjing 210007, China

³

The School of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China

⁴

The School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

^†

Chengxin Chu and Jiadong Zhang contributed equally to this work.

Drones 2026, 10(4), 233; https://doi.org/10.3390/drones10040233

Submission received: 3 February 2026 / Revised: 14 March 2026 / Accepted: 23 March 2026 / Published: 25 March 2026

(This article belongs to the Special Issue Intelligent Spectrum Management in UAV Communication)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A service model based on a Poisson process is established, and a Maxwell–Boltzmann model is adopted to characterize user mobility. This setup provides a statistically motivated stochastic approximation for user mobility under disrupted emergency scenarios.
The paper proposes the shared feature extraction (SFE)-enhanced Proximal Policy Optimization (PPO) algorithm (SPOR) for joint UAV trajectory optimization and resource allocation, significantly improving data transmission rate, service ratio, and reducing average service distance in real-time dynamic environments.

What are the implications of the main findings?

The proposed modeling framework enables a more accurate and dynamic representation of user mobility and service demands, which is crucial for designing flexible and efficient UAV-assisted communication networks in unpredictable emergency scenarios.
Compared to traditional methods, the proposed SPOR algorithm outperforms in key metrics such as data transmission rate, service ratio, and average service distance. Moreover, it consistently achieves the highest service ratio even under challenging conditions, when the number of users increases or the user service demand probability becomes higher.

Abstract

In emergency communication networks, service demands and user mobility change dynamically. Low service rates and limited coverage are significant challenges that hinder the effectiveness of emergency services. Due to the flexibility, low deployment cost, and adjustable coverage range of unmanned aerial vehicles (UAVs), UAV-assisted emergency communication networks can serve as a viable method to address these challenges. Given the strong coupling between UAV trajectory optimization and resource allocation, joint optimization is crucial to meet dynamic service demands and user mobility. In this paper, we establish a user mobility model based on the Maxwell–Boltzmann distribution and a service model based on the Poisson process. We formulate an optimization problem to maximize the data transmission rate of emergency services. To address the challenges of high-dimensional continuous action spaces, we propose a shared feature extraction-enhanced PPO (SPOR) algorithm for joint trajectory optimization and resource allocation. Simulation results show that the proposed SPOR algorithm significantly outperforms benchmark methods. Specifically, it achieves at least a 20% improvement in data transmission rate, a 28% improvement in emergency communication service ratio, and a 12% reduction in average service distance.

Keywords:

UAV communication; emergency communication; trajectory optimization; resource allocation; deep reinforcement learning; proximal policy optimization

1. Introduction

When natural disasters such as earthquakes and flash floods strike, the communication infrastructure is often compromised. Therefore, establishing rapid emergency communication in disaster-affected areas is critical to the success of post-disaster rescue missions [1]. Providing flexible and responsive emergency communication services presents a significant challenge, primarily due to the highly dispersed and dynamic distribution of affected users and their random and unpredictable service demands. Unmanned aerial vehicles (UAVs), renowned for their flexible deployment and low cost, have demonstrated considerable potential in emergency communication [2]. Unlike conventional emergency communication technologies, UAV-assisted methods enable rapid and flexible deployment, unrestricted by ground traffic conditions [3,4]. However, optimizing UAV trajectories to provide reliable emergency communication for mobile users with stochastic service demands remains a challenging problem.

Significant research efforts have been undertaken in response to the aforementioned challenges, yielding a variety of targeted solutions with substantial theoretical and practical outcomes. Existing approaches can be broadly categorised into traditional optimization theory and machine learning-based methods. Specifically, traditional UAV trajectory optimization and resource allocation methods typically formulate the problem as a mathematical optimization program with objectives and constraints. Early studies often employed convex optimization [5,6], integer programming [7,8] or heuristic algorithms [9,10] solving them by decoupling the intertwined subproblems. For instance, building upon large-scale system analysis and the Dinkelbach method, the authors in [11] transformed the nonlinear fractional objective function into a sequence of difference-of-convex problems. These subproblems are solved within a block coordinate descent framework to update the UAV trajectory and IoT communication resources iteratively. The work in [12] decouples the complex nonlinear completion time minimization problem into a series of tractable convex subproblems. Employing a block coordinate descent framework, the algorithm alternately optimizes the UAV trajectory and communication resources. Non-convexity is resolved via the successive convex approximation method, thereby driving convergence to a feasible and efficient solution that yields significant system performance gains. To maximize the number of connected users for post-disaster data collection, Reference [13] proposed a particle swarm optimization (PSO) algorithm incorporating an adaptive inertia weight factor to determine the UAV’s optimal velocity and bandwidth allocation. To address the challenges of task assignment and path planning for multiple UAVs in dynamic environments, Reference [14] developed a receding horizon optimization framework based on an adaptive disturbance PSO algorithm. Furthermore, Reference [15] investigated the problem of deploying a minimum number of UAVs. It introduced a bio-inspired algorithm for UAV network link optimization, which significantly enhanced network transmission performance and service coverage. However, traditional optimization-based methods are often sensitive to initial conditions and algorithmic choices [16]. Inappropriate initial settings can severely degrade their performance. Furthermore, multi-dimensional coupled decision variables, such as user association and power control, render the joint UAV trajectory optimization and resource allocation problem highly nonlinear and non-convex [17]. This complexity makes it challenging to model the system accurately using mathematical formulations. Consequently, solutions derived from traditional optimization approaches generally struggle to guarantee a global optimum.

Traditional methods for UAV trajectory optimization and resource allocation primarily rely on static optimization or pre-defined trajectories, which struggle to adapt to the dynamic changes in user distribution and real-time fluctuations in service demands. In practical communication scenarios, the mobility of ground users, the randomness of service demands, and the time-varying nature of channel conditions collectively form a highly complex dynamic system. Although existing mathematically driven optimization methods can derive optimal solutions under specific constraints, they typically require complete system information and substantial computational resources, making it challenging to meet the demands of real-time decision-making. Furthermore, these conventional approaches face the dual challenges of the curse of dimensionality and local optima when dealing with high-dimensional continuous action spaces and non-convex optimization problems.

Machine learning algorithms enable systems to learn from accumulated data. Through iterative processes that imitate human learning, these systems continuously improve their knowledge and capabilities [18]. This capability allows UAVs to achieve autonomous learning within their operational environments, offering notable advantages such as a high degree of autonomy and superior real-time performance. Reference [19] introduced an optimized method based on a machine learning Q-learning algorithm for rapid UAV trajectory planning in unknown environments. This approach utilizes the received signal strength as a dynamic reward signal to guide the UAV toward the signal source. Reference [20] developed a trajectory optimization algorithm for UAVs based on the double deep Q-network (DDQN) to address the challenge of limited onboard computational power. However, in the context of UAV trajectory optimization, the dynamic and complex flight environment often involves high-dimensional raw data. Using this data directly as the state input for learning is difficult for computers to process and interpret, ultimately leading to the curse of dimensionality [21]. Deep reinforcement learning (DRL) [22] leverages deep neural networks’ feature extraction capability to process environmental state information layer by layer, resulting in superior processing and generalization power. By learning from interactions with the environment, DRL can master optimal strategies in complex scenarios without requiring an explicit system model. This positions DRL as an auspicious approach for achieving scalable, low-latency, and highly reliable spectrum decision-making [23]. Reference [24] proposed a reinforcement learning algorithm based on a competitive architecture for real-time path planning of uncrewed vehicles. The algorithm employs a DDQN structure that decomposes the Q-value function into a state-value function and an action advantage function. Although this design improves value estimation accuracy, value-based methods are generally limited to discrete action spaces, making them less suitable for continuous UAV trajectory control problems. In UAV trajectory optimization and communication resource allocation, the control variables typically form a high-dimensional continuous action space. Deterministic policy gradient methods such as DDPG can address continuous control problems. However, they often suffer from training instability and sensitivity to hyperparameters, particularly in dynamic environments with stochastic user mobility.

In the UAV trajectory optimization and resource allocation field, Actor–Critic-based DRL has emerged as a research hotspot in recent years. The Actor–Critic algorithm [25] incorporates a value function to evaluate the policy function, enabling single-step updates for the policy learning method and thereby improving learning efficiency. Reference [26] applied trust region policy optimization (TRPO) to deep deterministic policy gradient to mitigate gradient instability. While this approach improves training stability, it requires second-order optimization and Hessian approximation, resulting in relatively high computational overhead. The proximal policy optimization (PPO) algorithm [27] simplifies the TRPO framework by introducing a clipped surrogate objective to constrain policy updates, thereby achieving improved training stability with lower computational complexity. However, conventional PPO implementations typically employ separate feature extraction networks for the actor and critic, which may lead to redundant representation learning and inefficient feature utilization when processing complex environmental states.

In addition to the algorithmic limitations discussed above, numerous technical challenges remain unresolved in the joint optimization of UAV trajectory and communication spectrum resource allocation for emergency communication services. Firstly, the complexity of Air-to-Ground channel modelling cannot be overlooked. The probabilistic switching between line-of-sight (LoS) and non-line-of-sight (NLoS) propagation conditions, distance-dependent path loss, and environmental factors influence accurate channel modelling, a foundational challenge for system design. Secondly, the random walk of user locations, the bursty nature of service demands, and the duration uncertainty demand a system capable of rapid response and adaptive adjustment. Finally, the continuous control of UAV trajectories requires efficient learning strategies. An algorithm design challenge is balancing sufficient exploration for optimal paths while avoiding policy instability.

Building upon the aforementioned research landscape and motivated by the need to address the multifaceted challenges in emergency communication, this paper capitalizes on the high training stability, strong sample efficiency, and implementation simplicity of the PPO algorithm. To this end, we propose the shared feature extraction (SFE)-enhanced PPO for trajectory optimization and resource allocation (SPOR) algorithm for the joint optimization of UAV trajectory and communication resources. In other words, SPOR is intended to address not only the stability–complexity trade-off in continuous-control learning, but also the representation redundancy caused by separately learning policy and value features from the same coupled emergency-network state. The main contributions of this work are summarized as follows:

A user service model is established where service demand arrivals follow a Poisson process and their durations obey a uniform distribution. This characterization effectively captures the bursty and time-varying nature of emergency communication traffic. For user mobility, a model based on the Maxwell–Boltzmann distribution is adopted. User speeds follow a two-dimensional Maxwell–Boltzmann distribution, with random movement directions incorporating a boundary reflection mechanism. This setup provides a tractable stochastic approximation for user mobility in emergency communication scenarios.
A UAV trajectory optimization and resource allocation algorithm is designed based on the Actor–Critic architecture. It leverages an SFE layer to reduce the computational complexity of the neural network. Furthermore, a comprehensive reward function integrating multi-dimensional performance metrics is constructed. This reward function combines data transmission rate rewards, service ratio rewards, and a distance penalty, thereby mitigating potential performance bias from single-objective optimization.
Comprehensive simulations are conducted to evaluate the performance of the proposed algorithm from the dimensions of user scale and service demand. The results demonstrate that our method outperforms the benchmark algorithms in key performance metrics, including training convergence speed, emergency communication service ratio, and average service distance.

The remainder of this paper is organized as follows. Section 2 introduces the system model: the user mobility and service model; the UAV mobility and communication model. Section 3 performs problem transformation for UAV trajectory optimization and resource allocation, and presents the SPOR algorithm. Section 4 presents the simulation results. Finally, Section 5 concludes the paper.

2. System Model

The wireless communication system assisted by a UAV in a disaster area is shown in Figure 1. One UAV equipped with a radio frequency module serves as an aerial base station, providing communication services to ground users. Order

J = {1, 2, \dots, j}

represents the collection of mobile users, and the service demands of mobile users follow the Poisson model. UAV provides radio frequency communication services for mobile users. The symbols and definitions of the relevant parameters are shown in Table 1.

2.1. User Mobility Model and Service Model

Consider the random motion characteristics of mobile users in disaster areas on a two-dimensional plane. In time slot t, the position coordinates and velocity vector of user j are recorded as

(p_{j} (t), v_{j} (t))

, where

p_{j} (t) = (x_{j} (t), y_{j} (t))

is the position vector of user j and

v_{j} (t) = (v_{x} (t), v_{y} (t))

is the velocity vector of user j. The user location update formula is

p_{j} (t + 1) = p_{j} (t) + v_{j} (t) \cdot Δ t,

(1)

where

Δ t

notes the time slot length. Due to the inherently random nature of user mobility, the velocity distribution of users shares similar statistical properties with the motion of gas molecules described by kinetic theory. This similarity to the Maxwell–Boltzmann velocity distribution was initially observed in experimental studies reported in [28]. Building on this observation, later works, such as [29], approximate individual movement speeds using a two-dimensional Maxwell Boltzmann distribution to characterize mobility and contact behavior at the population level. Based on these insights, we adopt a probabilistic mobility model inspired by the Maxwell–Boltzmann distribution to describe user motion. The model provides a statistically grounded approximation of random mobility under disrupted emergency conditions. The probability density function for any velocity component

v_{ξ} (ξ \in {x, y})

of a user is given by

f_{v_{ξ}} (v_{ξ}) = \frac{1}{\sqrt{2 π v_{rms}^{2}}} exp (- \frac{v_{ξ}^{2}}{2 v_{rms}^{2}}),

(2)

where

v_{rms}

is the user’s root mean square speed, the velocity components

v_{x}

and

v_{y}

are identically distributed. The joint probability density function of the user velocity vector is given by

f_{v} (v_{x}, v_{y}) = \frac{1}{2 π v_{rms}^{2}} exp (- \frac{v_{x}^{2} + v_{y}^{2}}{2 v_{rms}^{2}}) .

(3)

In addition, this paper considers a binary decision variable, which is used to characterize the dynamic service demands of ground users in UAV communication networks. The arrival process of total service demands in the system can be modeled as a discrete-time poisson process [30]. All slots have the same length. Assuming a quasi-static environment, the conditions within each time slot t remain unchanged, and the user requests spectrum resource at the beginning of each time slot t. As shown in Figure 2, for each user

J = {1, 2, \dots, j}

in the system, the service demand status is defined as

ρ_{j} (t)

. When

ρ_{j} (t) = 1

, user j has service demands in time slot t; when

ρ_{j} (t) = 0

, user j has no service demand in slot t. For each time slot t, users with services do not generate new services when processing current services, and users without service demands generate task requirements with a probability of

P_{i}

. The duration of services follows a continuous uniform distribution

τ_{i} \sim U {T_{m i n}, T_{m a x}}

, where

T_{m i n}

and

T_{m a x}

represent the minimum and maximum service duration time slots respectively.

2.2. UAV Mobility Model and Communication Model

The UAV operates as an aerial base station with controllable 3D mobility. In time slot t, the UAV position is denoted as

p_{u a v} (t) = (x_{u a v} (t), y_{u a v} (t), h_{u a v} (t))

, where

x_{u a v} (t)

and

y_{u a v} (t)

represent the horizontal coordinates, and

h_{u a v} (t)

represents the flight altitude.

The motion of UAV follows the first-order dynamic model. The velocity vector is defined as

v_{u a v} (t) = (v_{x}^{u a v} (t), v_{y}^{u a v} (t), v_{z}^{u a v} (t))

. Where

v_{x}^{u a v} (t)

,

v_{y}^{u a v} (t)

, and

v_{z}^{u a v} (t)

represent the UAV’s velocity components along the three coordinate axes, respectively. The UAV’s position is updated according to a first-order kinematic model

p_{u a v} (t + 1) = p_{u a v} (t) + v_{u a v} (t) \cdot Δ t .

(4)

To ensure flight safety and mission area coverage, the UAV’s trajectory is subject to spatial boundary constraints: horizontal position constraints

x_{u a v} (t) \in [x_{min}, x_{max}]

,

y_{u a v} (t) \in [y_{min}, y_{max}]

, and an altitude constraint

h_{u a v} (t) \in [h_{min}, h_{max}]

. These constraints enforce operation within the designated operational airspace while maintaining a suitable flight altitude for adequate communication coverage.

The radio frequency channel between the UAV and ground users adopts a probabilistic LoS channel model. The path loss between the UAV and user j under LoS and NLoS conditions can be respectively expressed as [31]

L_{j}^{LoS} = α_{LoS} + β_{LoS} log d_{j} + G,

(5)

L_{j}^{NLoS} = α_{NLoS} + β_{NLoS} log d_{j} + G,

(6)

where

α_{L o S}

and

α_{N L o S}

denote the path loss at the reference distance

d_{j} = 1

,

d_{j}

is the distance between the UAV and user j, and

β_{L o S}

and

β_{N L o S}

represent the path loss exponents for LoS and NLoS transmissions, respectively. G is a Gaussian random variable modeling the random shadowing effect, with its fluctuation measured by the standard deviation. The LoS probability depends on the elevation angle

ϕ_{j} = 180^{\circ} / π {sin}^{- 1} (h / d_{j})

, where h denotes the UAV’s altitude. The LoS and NLoS probabilities are

P_{j}^{LoS} = \frac{1}{1 + η e^{- ς (ϕ_{j} - η)}},

(7)

P_{j}^{NLoS} = 1 - P_{j}^{LoS},

(8)

where

η

and

ς

are environment-dependent parameters. Therefore, the average path loss is calculated as follows

L_{j}^{avg} = L_{j}^{LoS} P_{j}^{LoS} + L_{j}^{NLoS} P_{j}^{NLoS} .

(9)

The system employs the signal-to-noise ratio (SNR) as a key physical-layer performance metric to quantify the transmission quality and reliability of communication links. The SNR is defined as the ratio of the received useful signal power to the noise power at the receiver. It is a crucial indicator for assessing signal transmission quality and system performance, characterizing the channel’s spectral efficiency and transmission reliability. The SNR of the radio frequency channel between the UAV and user j is given by

Γ_{j} (t) = \frac{P_{t} \cdot 10^{- \frac{L_{j}^{avg}}{10}}}{σ^{2}},

(10)

where

P_{t}

is the UAV transmission power and

σ^{2}

is the noise power. In time slot t, the coverage of a ground user by the UAV is determined based on the aforementioned SNR threshold criterion. The coverage indicator function

I_{j} (t)

is defined as:

\{\begin{matrix} I_{j} (t) = 1, Γ_{j} \geq δ \\ I_{j} (t) = 0, Γ_{j} < δ . \end{matrix}

(11)

When the SNR of the signal received by the user exceeds threshold

δ

, it is deemed that the user is within the drone’s effective coverage range. Conversely, if the user’s SNR falls below threshold

δ

, the user is considered disconnected, indicating a failed communication connection [32]. Therefore, the SNR constraint is

Γ_{j} \geq δ

.

Considering the dynamic characteristics of users’ service demands, this paper defines the service user set

L (t)

of time slot t as the set of all users with service demands, represented as

L (t) = {j \in J : ρ_{j} (t) = 1}

;

N (t)

is the set of users who meet both coverage conditions and service demands, represented as

N (t) = {j \in J : ρ_{j} (t) = 1, I_{j} (t) = 1}

. The set of effective users successfully connected to the UAV in time slot t

M (t)

is defined as

M (t) = {j \in J : ρ_{j} (t) = 1, I_{j} (t) = 1, ϖ_{j} (t) = 1},

(12)

where

ϖ_{j} (t)

represents whether the UAV serves the user j in time slot t. Specifically, at

ϖ_{j} (t) = 1

, the UAV serves user j, and at

ϖ_{j} (t) = 0

, the UAV does not serve user j. To quantify the effectiveness of UAV emergency communication services, this paper defines the service ratio as the key performance indicator. The service ratio reflects the UAV’s ability to successfully provide communication services to users with service demands within a specific time slot. In time slot t, the service ratio of the system is defined as

R_{s} (t) = \frac{M (t)}{L (t)} .

(13)

In UAV-assisted emergency communication systems, efficiently utilizing spectrum resources ensures service quality. This paper mainly considers the downlink data transmission from the UAV to ground users. Functioning as an aerial base station, the UAV is tasked with the efficient and reliable distribution of critical information, including high-definition images, real-time videos, and environmental sensor data gathered over disaster areas, to the ground users. This transmission scheme is based on orthogonal frequency division multiple access (OFDMA), allowing the UAV to allocate bandwidth to users with service demands in time slot t. We define the bandwidth allocation matrix

B (t) \in R^{N \times 1}

as

B (t) = {[b_{1} (t), b_{2} (t), \dots, b_{N} (t)]}^{T}

, where

b_{j} (t) \in [0, 1]

represents the normalized bandwidth proportion allocated to user j in time slot t. The bandwidth allocated to user j is

B_{j} (t) = b_{j} (t) \cdot B,

(14)

where B is the total channel bandwidth of the UAV. Therefore, according to the Shannon formula, the data transmission rate of user j can be expressed as

\begin{matrix} R_{j} (t) & = B_{j} (t) \times {log}_{2} (1 + Γ_{j} (t)) \\ = b_{j} (t) \times B \times {log}_{2} (1 + \frac{P_{t} \cdot 10^{- \frac{L_{j}^{avg}}{10}}}{σ^{2}}) . \end{matrix}

(15)

To enable online decision-making, we assume that a lightweight feedback link is available between ground users and the UAV. Through this control link, each user periodically or event-triggeredly reports its location update and service-demand indicator to the UAV. Therefore, at each time slot, the UAV can obtain the user positions and demand states required for state construction and subsequently perform trajectory control, user association, and bandwidth allocation. Since this paper focuses on downlink emergency service performance, the uplink signaling overhead is assumed to be negligible and is not explicitly optimized. Moreover, in this paper, the communication process is modeled under an OFDMA-based downlink abstraction.

2.3. Problem Formulation

Considering the constraints of UAV trajectory control and communication system bandwidth, the system optimizes the UAV flight trajectory based on user locations and dynamic service demands, aiming to serve as many users with service demands as possible. The optimization problem is formulated to maximize the data transmission rate of UAV-assisted emergency communication services. A binary variable

ρ_{j} (t)

indicates whether user j has a service demand in time slot t:

ρ_{j} (t) = 1

if user j has a service demand in time slot t, and

ρ_{j} (t) = 0

otherwise. Another binary variable

ϖ_{j} (t)

indicates whether the UAV is connected to user j in time slot t:

ϖ_{j} (t) = 1

if the UAV serves user j, and

ϖ_{j} (t) = 0

otherwise. The variable

b_{j} (t)

represents the normalized bandwidth proportion allocated to user j in time slot t. The optimization problem constructed in this paper is as follows

\begin{matrix} max_{v_{u a v}, ϖ_{j} (t), b_{j} (t)} \sum_{j \in J} ϖ_{j} (t) R_{j} (t) \\ s . t . C_{1} : & ρ_{j} (t) \in {0, 1}, \forall j \in J, \forall t, \\ C_{2} : & I_{j} (t) \in {0, 1}, \forall j \in J, \forall t, \\ C_{3} : & ϖ_{j} (t) \in {0, 1}, \forall j \in J, \forall t, \\ C_{4} : & ϖ_{j} (t) \leq ρ_{j} (t), \forall j \in J, \forall t, \\ C_{5} : & Γ_{j} (t) \geq δ \cdot ρ_{j} (t), \forall j \in J, \forall t, \\ C_{6} : & \sum_{j \in M (t)} b_{j} (t) = 1, \forall t, \\ C_{7} : & x_{u a v} (t) \in [x_{min}, x_{max}], \forall t, \\ C_{8} : & y_{u a v} (t) \in [y_{min}, y_{max}], \forall t, \\ C_{9} : & h_{u a v} (t) \in [h_{min}, h_{max}], \forall t, \end{matrix}

(16)

where constraint

C_{1}

indicates whether user j has service demands in time slot t. Constraint

C_{2}

indicates whether user j is covered in time slot t. Constraint

C_{3}

indicates whether the UAV serves user j in time slot t. Constraint

C_{4}

is a demand-aware service constraint, and the UAV is allowed to connect users only at

ρ_{j} (t) = 1

.

C_{5}

is an SNR constraint.

C_{6}

is the total bandwidth constraint, ensuring that the total bandwidth allocated by all users does not exceed the total system bandwidth. In addition, constraints

C_{7}

,

C_{8}

, and

C_{9}

are flight range constraints for the UAV.

3. UAV Trajectory Optimization and Resource Allocation Algorithm

The UAV faces significant challenges in trajectory optimization and resource allocation in dynamic environments. In post-disaster communication scenarios, the random movement of ground users leads to a continuously changing spatial distribution, making it difficult for the UAV to maintain stable service coverage. Additionally, the user service’s suddenness and unpredictable demands necessitate resource allocation strategies with real-time self-adaptability. Consequently, the UAV must rapidly adjust its flight trajectories and optimize resource allocation during operation to adapt to environmental dynamics and evolving user demands. With their response speed and adaptability limitations, traditional optimization algorithms struggle to meet the demands for UAV trajectory optimization and resource allocation in such highly dynamic settings.

3.1. Problem Transformation

Considering the above challenges, this paper models the dynamic optimization problem as a Markov decision process (MDP). In the emergency communication scenario, the UAV is the agent that executes a flight policy based on local observations. The agent’s objective is to maximize communication service quality by optimizing its trajectory, guided by user locations and their service demands. Specifically, the UAV inputs its observations into a policy function approximated by a neural network. This network then outputs an action, and a reward is generated upon interacting with the environment to complete this process. The state space, action space, state transition probability and reward function of the markov process under the time slot t are defined as

S

, A, P and R respectively. The detailed definition of MDP elements is as follows

State $S (t)$ : The system status includes the position of the UAV, the position of mobile users, and the status of service demands. Therefore, the state space in a time slot can be defined as

$\begin{matrix} S (t) = & {p_{u a v} (t), p_{1} (t), p_{2} (t), \dots, p_{j} (t), \\ ρ_{1} (t), ρ_{2} (t), \dots, ρ_{j} (t)}, \end{matrix}$

(17)

where $p_{u a v} (t)$ represents the position of the UAV in time slot t. $p_{j} (t)$ indicates the geographical location of the user j in time slot t. In addition, $ρ_{j} (t)$ indicates the service demand status of user j.
Action $A (t)$ : This action corresponds to the decision made by the UAV agent and determines the UAV’s behavior. The UAV agent needs to jointly make decisions on trajectory control, service connection, and bandwidth allocation. In time slot t, the action space of the agent contains three coupled decision variables: $a_{t} = {v_{t}^{n o r m}, ϖ_{j} (t), b_{j} (t)}$ . Where $v_{t}^{n o r m} = {[v_{x}^{n o r m}, v_{y}^{n o r m}, v_{z}^{n o r m}]}^{T} \in {[- 1, 1]}^{3}$ is the normalized speed control vector. Each action component represents the normalized speed in the corresponding direction, and the actual flight speed is obtained through linear mapping $v_{u a v} = v_{m a x} \cdot v_{t}^{n o r m}$ , where $v_{m a x}$ represents the maximum flight speed of UAV. Here, $ϖ_{j} (t)$ represents whether the UAV serves user j in time slot t, and $b_{j} (t)$ represents the bandwidth proportion allocated to user j in time slot t. In our implementation, the actor outputs a continuous latent action vector under a Gaussian policy, which is then mapped to the executable hybrid action. Specifically, the velocity-related outputs are linearly scaled to the physical UAV velocity range, the user-association-related outputs are converted into discrete service decisions through threshold-based binarization with feasibility constraints, and the bandwidth-allocation-related outputs are normalized to satisfy the bandwidth constraint.
In the considered emergency communication scenario, these action components are strongly coupled. UAV trajectory control affects user accessibility, service distance, and channel conditions, while bandwidth allocation determines how the available communication resources are distributed among the currently served users.
State transition probability $P (s (t + 1) | s (t), a (t))$ : The state transition is determined by the deterministic UAV dynamics model, the stochastic user mobility model, and the probability distribution of user service demands. The user j’s position is updated via Equation (1), its velocity given by (3). The UAV’s position is updated by (4). The update of user service demands $ρ_{j} (t)$ follows the Poisson process and generates new demands with probability $P_{i}$ .
Reward $R (t)$ : In the context of UAV communication in the disaster area environment, this paper aims to serve as many users as possible with service demands, and maximize the data transmission rate of UAV emergency communication services. Given the service demand, the reward of UAV agent in time slot t is expressed as

$r_{t} = k_{1} \cdot r_{t}^{d t r} + k_{2} \cdot r_{t}^{s e r} - k_{3} \cdot r_{t}^{d i s},$

(18)

in which $k_{1}$ , $k_{2}$ , and $k_{3}$ are weight coefficients, and $k_{1}$ , $k_{2}$ , $k_{3} \in [0, 1]$ , $k_{1} + k_{2} + k_{3} = 1$ . The coefficients $k_{1}$ , $k_{2}$ and $k_{3}$ are introduced as task-preference parameters in the multi-objective reward to balance the relative importance of transmission rate, service ratio, and distance penalty in the overall objective. Since the three reward components are normalized to comparable ranges, these coefficients are empirically chosen to reflect the desired trade-off among communication quality, service coverage, and spatial proximity to users with active demands. The data transmission rate reward $r_{t}^{d t r}$ is

$r_{t}^{d t r} = \frac{\sum_{j \in M (t)} R_{j} (t)}{B \cdot {log}_{2} (1 + S N R_{m a x})} \in [0, 1] .$

(19)

The service rate reward $r_{t}^{s e r}$ is

$r_{t}^{s e r} = R_{s} (t) = \frac{M (t)}{L (t)} \in [0, 1] .$

(20)

The distance penalties $r_{t}^{d i s}$ is

$r_{t}^{d i s} = \frac{\bar{d} - d_{m a x}}{d_{m a x}} \in [- 1, 0],$

(21)

where $\bar{d}$ is the average horizontal distance from the UAV to all users with service demands, and $d_{m a x}$ is the maximum communication distance of the UAV. In this paper, the average service distance is used as an efficiency-related metric to characterize how closely the UAV can approach users with active service demands during service provision. A smaller value indicates that the UAV can serve demanding users with better spatial proximity, which is generally beneficial for reducing path loss and improving communication efficiency.

3.2. SPOR Algorithm for UAV Trajectory Optimization and Resource Allocation

To solve the MDP formulated above, we propose the SPOR algorithm, a policy-gradient-based reinforcement learning method that balances training stability and sample efficiency by constraining the magnitude of policy updates.

The objective of the policy gradient is to maximize the expected cumulative reward

J (θ) = E_{τ \sim π_{θ}} [\underset{t = 0}{\sum^{T}} γ^{t} r_{t}],

(22)

where

θ

denotes the parameters of the policy network,

τ

is a trajectory sampled from the policy

π_{θ}

, and

γ \in [0, 1]

is the discount factor. To reduce the variance of the gradient estimate, we introduce the advantage function

A^{π} (s_{t}, a_{t}) = Q^{π} (s_{t}, a_{t}) - V^{π} (s_{t}),

(23)

where

Q^{π} (s_{t}, a_{t})

and

V^{π} (s_{t})

are the state-action value function and the state-value function, respectively. To balance bias and variance, we adopt the generalized advantage estimation (GAE) technique

{\hat{A}}_{t} = \underset{l = 0}{\sum^{\infty}} {(γ λ)}^{l} δ_{t + l}^{V},

(24)

where

δ_{t}^{V} = r_{t} + γ V (s_{t + 1}) - V (s_{t})

is the temporal difference (TD) error, and

λ \in [0, 1]

is the GAE parameter controlling the bias–variance trade-off.

The SPOR algorithm limits the magnitude of updates by using a clipping policy ratio, where the policy ratio is defined as

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})},

(25)

where

θ_{old}

denotes the parameters of the behavior policy used to collect the trajectories.

The SPOR algorithm adopts an Actor–Critic architecture with an SFE module. As illustrated in Figure 3, the Actor–Critic network consists of three core components: an SFE layer, an Actor network, and a Critic network. The SFE module in this work is introduced as a practical shared representation module for the considered joint UAV trajectory-control and resource-allocation task, where the system state contains strongly coupled mobility, service-demand, user-association, and bandwidth-related information. By allowing both networks to build upon the same encoded state representation, the model is intended to promote feature reuse between policy learning and value estimation, and may reduce redundant processing in practice.

Specifically, the SFE layer is responsible for mapping the raw state

s \in R^{d_{s}}

input into higher-order feature representations for use by both the actor and critic networks. Its structure includes two fully connected layers

h_{1} = ReLU (W_{1} s + b_{1})

,

h_{2} = ReLU (W_{2} h_{1} + b_{2})

, where

W_{1} \in R^{H \times d_{s}}

and

W_{2} \in R^{H \times H}

are weight matrices,

b_{1} \in R^{H}

,

b_{2} \in R^{H}

and H denotes the hidden layer dimension. By sharing the underlying feature extractor, the network reduces computational redundancy and ensures that both the policy and value functions are based on the same state representation, improving training efficiency and consistency. The Actor network selects actions according to the policy, while the Critic network evaluates the value of the chosen actions. This design facilitates efficient and stable policy learning by sharing the underlying state features.

The policy of the Actor network is modeled as a Gaussian distribution over a continuous latent action space. The sampled latent action is then converted into the final hybrid action through deterministic post-processing, including velocity scaling, user-association binarization, and bandwidth normalization:

π_{θ} (a_{t} ∣ s_{t}) = N (μ_{θ} (s_{t}), σ^{2}),

(26)

where

μ_{θ} (s_{t}) \in {[- 1, 1]}^{d_{a}}

is the mean of the latent action and

d_{a}

is the dimensionality of the action space, while

σ

is a state-independent standard deviation vector. The latent action vector is partitioned into three parts corresponding to UAV velocity control, user association, and bandwidth allocation. The velocity-related outputs are linearly scaled to the physical UAV velocity range, the user-association-related outputs are converted into discrete service decisions through threshold-based binarization with feasibility constraints, and the bandwidth-allocation-related outputs are normalized to satisfy the bandwidth allocation constraint.

The Critic network, parameterized by

ϕ

, estimates the state-value function

V_{ϕ} (s_{t}) \in R

. It serves as an approximation of the value function

V (s_{t})

and is used to compute the TD error

δ_{t}^{V}

and the generalized advantage estimator

{\hat{A}}_{t}

in (24) to guide the Actor in updating the policy. The Critic is trained by minimizing the TD-error-based value loss, providing a low-variance policy gradient signal that accelerates convergence and improves stability.

The clipped surrogate objective of the policy network is defined as

\begin{matrix} L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ε, 1 + ε) {\hat{A}}_{t})], \end{matrix}

(27)

where

ϵ

is the clipping factor. This objective function ensures that the new policy does not deviate too far from the old policy.

L^{VF} (θ)

is the value function loss, and is used to train the Critic network to approximate the true state-value function. The expression is

L^{VF} (ϕ) = E_{t} [{(V_{ϕ} (s_{t}) - {\hat{V}}_{t})}^{2}],

(28)

where

{\hat{V}}_{t} = {\hat{A}}_{t} + V_{ϕ} (s_{t})

is the value target. The total loss minimized by SPOR is

L (θ, ϕ) = - L^{CLIP} (θ) + c_{1} L^{VF} (ϕ) - c_{2} S [π_{θ}],

(29)

where

c_{1}

and

c_{2}

are loss coefficients, and

S [π_{θ}]

is an entropy regularization term given by

S [π_{θ}] = E_{t, a_{t} \sim π_{θ} (\cdot | s_{t})} [- log π_{θ} (a_{t} | s_{t})] .

(30)

To improve the stability of training, the advantage function is standardized

{\hat{A}}_{norm} = \frac{\hat{A} - μ (\hat{A})}{σ (\hat{A}) + ϵ^{'}},

(31)

where

μ (\hat{A})

and

σ (\hat{A})

are, respectively, the mean and standard deviation of the advantage function within the batch, and

ϵ^{'} = 10^{- 8}

is a small numerical stability constant. In addition, this paper employs the stochastic gradient descent with warm restarts (SGDR) policy to dynamically adjust the learning rate, thereby improving the training stability and convergence speed of the SPOR algorithm. The core idea is to periodically decrease the learning rate from its initial value to a minimum value using a cosine function, and then reset the learning rate to its initial value at the end of each cycle. The update formula of the learning rate in cycle i is

η_{t} = η_{min} + \frac{1}{2} (η_{max} - η_{min}) (1 + cos (\frac{T_{c u r}}{T_{i}} π)),

(32)

where

η_{t}

is the current learning rate,

η_{max}

is the initial learning rate,

η_{min}

is the minimum learning rate,

T_{c u r}

is the number of iterations in the current cycle, and

T_{i}

is the total number of iterations in the cycle i.

Algorithm 1 outlines the training procedure for the proposed SPOR algorithm. In each episode, the environment interacts with the current policy to generate on-policy experiences, which are stored in the replay buffer. These experiences are then used to compute the GAE-based advantage function, which is normalized to ensure stable training. The replay buffer ensures that transitions from each episode are efficiently stored and made available for calculating the advantages, which are essential for improving the policy. Subsequently, the shared feature extractor, actor, and critic networks are updated jointly for K epochs. The update is guided by the SPOR clipped surrogate objective, value function loss, and entropy regularization. During the update process, the gradients from both the actor and critic networks are used to update the parameters of the shared feature extractor

ψ

, actor network

θ

, and critic network

ϕ

via gradient descent. The shared feature extractor is updated together with the actor and critic networks to ensure that all networks benefit from the same feature representation. After the parameter updates, the old policy parameters are replaced with the new ones, and the learning rate is updated for the next episode.

Algorithm 1: SPOR Algorithm

4. Simulation and Results

In this section, we present the simulation scenario along with three baseline methods. The experimental results are then described and analyzed.

4.1. Parameter Setting

The simulation environment is set to a three-dimensional area of 2000 m × 2000 m × 200 m with 30 users. A single UAV is employed within this environment. The UAV operates within an altitude range of 50 to 100 m, with a maximum speed of 15 m per second. Its transmission power

P_{t}

is set to 0.5 watts, and the total channel bandwidth B is 10 MHz. The system noise power

σ^{2}

is configured to −105 dBm, while the SNR threshold

δ

is set at 15 dB. Environment-dependent parameters are characterized by values of

η = 9.61

and

ς = 0.16

, and the service demand probability for each user

P_{i}

is 0.5. The neural network architecture employed in the SPOR algorithm consists of an SFE layer as well as actor and critic networks. The SFE layer consists of two fully connected layers, each with 256 neurons. The Actor and Critic networks each contain a single hidden layer with 128 neurons. The simulation parameters used in this work are summarized in Table 2, with some settings chosen based on [33,34].

The simulation environment was developed in Python 3.10 with the PyTorch 2.0 deep learning framework. All experiments were performed on a workstation running Windows 10, configured with an Intel i7 CPU, 32 GB RAM, and an NVIDIA RTX 3060 GPU. The UAV-enabled emergency communication scenario evolves in discrete time slots. In each slot, user locations are updated according to the Maxwell–Boltzmann mobility model, while service-demand states evolve based on the Poisson-arrival demand model. Given the observed system state, the UAV agent determines the trajectory control, user association, and bandwidth allocation actions, after which the environment computes the corresponding reward and transitions to the next state. Unless otherwise specified, each algorithm was trained for 3000 episodes, and all reported results were obtained under identical simulation settings and random initialization conditions to ensure fairness.

4.2. Performance Benchmark

To evaluate the proposed trajectory optimization and resource allocation algorithm for UAVs based on SPOR, the PPO algorithm without the SFE layer, the deep deterministic policy gradient (DDPG) algorithm, and the fixed-position (FP) strategy are considered as benchmark methods. Among these baselines, PPO is used as a direct reference to evaluate the contribution of the proposed improvements over the standard PPO framework, especially the shared feature extraction mechanism. The FP strategy is included as a simple reference baseline to highlight the importance of UAV mobility and adaptive trajectory control. DDPG is adopted as a representative off-policy DRL baseline for continuous control, which provides a useful comparison with the on-policy PPO-based SPOR framework. Together, these methods provide complementary comparisons from the perspectives of algorithmic improvement, mobility adaptation, and different DRL learning mechanisms. In the present work, we focus on representative standard continuous-control baselines so that the proposed SFE-enhanced PPO framework can be evaluated under a unified task formulation and in a clearly interpretable comparison setting.

(1) PPO: This baseline is implemented based on the standard PPO framework, where the SFE module is removed. In this method, the actor and critic networks directly take the original observations as inputs and output the corresponding actions, without employing an SFE layer. The observation space, action space, reward design, and training procedure are kept identical to those of the proposed SPOR method, so that the comparison can explicitly evaluate the effect of the feature extraction mechanism under a common task objective. This controlled comparison helps isolate the practical effect of introducing SFE within the same learning framework. This comparison is intended to examine the effect of introducing SFE under a common PPO-based framework and matched task settings.

(2) DDPG: The DDPG algorithm is adopted as a deterministic policy gradient-based baseline. It consists of an Actor network that generates continuous actions and a Critic network that evaluates the action–value function. To ensure a fair comparison, DDPG uses the same observation space, action space, constraints, and reward function as the proposed SPOR method. Moreover, target networks and an experience replay buffer are employed following the standard DDPG training procedure.

For fairness, SPOR, PPO, and DDPG are implemented under the same task formulation, including the same observation space, action space, and reward design.

(3) FP: In the FP strategy, the UAV remains at a predetermined location throughout the entire episode, and no trajectory optimization is performed. This benchmark serves as a lower-bound reference, highlighting the performance gain achieved by enabling UAV mobility and trajectory control.

4.3. Performance Evaluation

Figure 4 illustrates a comparison of training convergence rates under four different algorithms. In the same scenario, the UAV provides communication services to an identical number of users. As shown in the figure, the proposed SPOR algorithm achieves the fastest and most stable convergence, reaching a high reward level of approximately 1.04 after about 1000 training episodes. The PPO algorithm converges to a slightly lower reward level of around 0.97, while the DDPG algorithm stabilizes at approximately 0.93 after about 1500 episodes. In contrast, the FP strategy exhibits significantly inferior performance, with its reward remaining around 0.8 throughout the training process. These results demonstrate the superiority of the proposed SPOR algorithm in both convergence speed and achievable reward.

Figure 5 compares the data transmission rates achieved by four different algorithms. As shown in the figure, the proposed SPOR algorithm consistently outperforms the other approaches, converging to a stable data transmission rate of approximately 0.7 after around 1000 training episodes. The PPO algorithm achieves a lower stable rate of about 0.58, while the DDPG algorithm converges to approximately 0.47. The proposed algorithm achieves a data transmission rate improvement of over 48% compared with the DDPG baseline, and about 20% compared with PPO. In contrast, the FP strategy exhibits significantly inferior performance, with its data transmission rate remaining at a low level of around 0.2 throughout the training process. These results demonstrate the effectiveness of the proposed algorithm in improving data transmission performance.

Figure 6 illustrates the evolution of the service ratio over training episodes under different algorithms. As shown in the figure, the proposed SPOR algorithm consistently achieves the highest service ratio and converges to a stable level of approximately 0.82 after approximately 1200 training episodes, demonstrating its effectiveness in resource allocation and task scheduling. The PPO algorithm stabilizes at a lower service ratio of about 0.64, while the DDPG algorithm converges to approximately 0.55. The proposed algorithm achieves a service ratio improvement of over 49% compared with the DDPG baseline, and about 28% compared with PPO. In contrast, the FP strategy exhibits significantly inferior performance, with its service ratio remaining at a low level of around 0.28 throughout the training process.

Figure 7 depicts the performance of each algorithm in optimizing the average service distance between the UAV and users with service demands. As shown in the figure, the proposed SPOR algorithm demonstrates superior distance optimization capability, reducing the initial average distance of approximately 1000 m to around 650 m after convergence. The PPO algorithm converges to an average distance of about 740 m, while the DDPG algorithm stabilizes at approximately 800 m. In contrast, the FP strategy fails to achieve effective distance optimization, with the average service distance remaining at a high level of around 1240 m throughout the training process. Compared with the DDPG and PPO baselines, the proposed algorithm reduces the average service distance by about 19% and 12%, respectively. It should be noted that the average service distance is not intended to measure coverage robustness in this paper. Instead, it reflects the spatial efficiency of the UAV in approaching demanding users during dynamic service provision.

Figure 8 demonstrates the service ratio performance of four algorithms under scenarios with varying numbers of users. As shown in the figure, the proposed SPOR algorithm achieves the highest service ratio across all user population sizes. It attains a service ratio of about 0.82 when serving 20 users and maintains a relatively high level of around 0.77 when the number of users increases to 80, indicating strong scalability under increasing user demand. The PPO and DDPG algorithms achieve lower service ratios and exhibit a gradual decline as the user population grows. In contrast, the FP strategy provides severely limited service capability, with its service ratio remaining at a low level of about 0.26–0.30 and slightly decreasing as the number of users increases.

Figure 9 illustrates the service ratio performance of four algorithms under different user service demand probabilities. As shown in the figure, the proposed SPOR algorithm consistently achieves the highest service ratio across all demand levels. When the service demand probability is 0.2, it attains a service ratio of about 0.85, and it remains around 0.8 as the demand probability increases to 0.8, indicating strong robustness under heavier service loads. The PPO and DDPG algorithms exhibit lower service ratios and suffer more pronounced degradation as the demand probability increases. In contrast, the FP strategy shows limited service capability, and its service ratio decreases significantly under high service demand probabilities. These results demonstrate that the proposed algorithm can effectively adapt to varying service loads and provide reliable service assurance for post-disaster emergency communication.

Figure 10 illustrates a scenario where a UAV provides services to users with service demands under the proposed method. It is generated from a representative snapshot of one simulation episode after policy convergence, where the UAV position, user locations, service-demand states, and established communication links are extracted from the simulation environment and then visualized. In the figure, the green triangle represents the position of the UAV, red squares denote users with service demands, and purple circles indicate users without service demands. The solid lines indicate communication links established between the UAV and the users. It can be observed that the proposed method enables the UAV to autonomously optimize its flight trajectory and dynamically adjust its position to achieve effective coverage for users with service demands. In this representative simulation snapshot, the system achieves a service ratio of 80 percent, indicating that the UAV can effectively respond to users with service demands under the learned policy.

The superior performance of SPOR over PPO and DDPG can be attributed to several factors. First, the SFE layer enables the actor and critic networks to learn from a unified high-level state representation, which reduces redundant computation and improves the consistency between policy learning and value estimation. This is particularly beneficial in our problem, where the state space jointly includes the UAV position, user positions, and dynamic service-demand states. Second, SPOR inherits the clipped surrogate objective from PPO, which effectively constrains policy updates and improves training stability in a dynamic and stochastic environment. Third, the joint reward design simultaneously considers transmission rate, service ratio, and service-distance penalty, enabling the UAV to better balance communication quality, service coverage, and spatial proximity. Since SPOR, PPO, and DDPG are evaluated under the same reward formulation, the comparison between SPOR and PPO more directly reflects the contribution of the SFE-based shared representation under the same learning framework, whereas the comparison with DDPG is additionally influenced by differences in optimization mechanism and learning paradigm. Finally, the use of advantage estimation, normalization, and learning rate scheduling further improves convergence efficiency and robustness. As a result, SPOR is able to learn a more effective and stable policy than PPO and DDPG, leading to better overall performance.

5. Conclusions

This paper investigates a UAV-assisted emergency communication network under post-disaster rescue scenarios. We consider a user mobility model based on the Maxwell–Boltzmann distribution and a service model based on the Poisson process. An optimization problem is formulated to maximize the data transmission rate of emergency services. To solve this optimization problem, we propose the SPOR algorithm for joint trajectory optimization and resource allocation. Simulation results demonstrate that the proposed SPOR algorithm outperforms benchmark methods, with at least 20% improvement in data transmission rate, 28% improvement in service ratio, and 12% reduction in average service distance. Future work will explore the application of collaborative multi-UAV systems to further enhance the scalability of emergency communication services.

Author Contributions

Methodology, C.C. and J.Z.; Validation, J.Z., P.H., Y.Z. and F.W.; Investigation, J.Z. and P.H.; Writing—original draft, C.C.; Writing—review and editing, C.C., J.Z., P.H., Y.Z., M.O., F.W., Q.L. and Y.C.; Conceptualization J.Z.; Software, C.C.; Funding acquisition Y.C.; Supervision, F.W. and Y.C.; Project administration, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Innovation Research Foundation of National University of Defense Technology under grant number ZK24-61.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, R.; Zhang, J.; Zhang, Y.; He, P.; Du, Y.; Chen, Y.; Shi, W.; Ding, G.; Hu, F. Joint Task Offloading and Resource Allocation in UAV-Assisted MEC Networks for Disaster Rescue: A Large AI Model Enabled DRL Approach. IEEE Internet Things J. 2025, 12, 48336–48350. [Google Scholar] [CrossRef]
Wang, M.; Li, R.; Jing, F.; Gao, M. Multi-UAV Assisted Air–Ground Collaborative MEC System: DRL-Based Joint Task Offloading and Resource Allocation and 3D UAV Trajectory Optimization. Drones 2024, 8, 510. [Google Scholar] [CrossRef]
Zhang, B.; Yue, D.; Dou, C.; Yuan, D.; Xu, L.; Li, H. Co-Design of Active Power Control and UAVs-Assisted Two-Stage Emergency Communication for ADN: A Cyber-Physical Cross-Space Understanding and Cooperation Method. IEEE Trans. Smart Grid 2025, 17, 1457–1477. [Google Scholar] [CrossRef]
Ma, Y.; Qian, S. Ultra-Dense Uplink UAV Lossy Communications: Trajectory Optimization Based on Mean Field Game. Electronics 2025, 14, 2219. [Google Scholar] [CrossRef]
Wu, K.; Lan, J.; Lu, S.; Wu, C.; Liu, B.; Lu, Z. Integrative Path Planning for Multi-Rotor Logistics UAVs Considering UAV Dynamics, Energy Efficiency, and Obstacle Avoidance. Drones 2025, 9, 93. [Google Scholar] [CrossRef]
Wu, F.; Yang, D.; Xiao, L.; Cuthbert, L. Minimum-Throughput Maximization for Multi-UAV-Enabled Wireless-Powered Communication Networks. Sensors 2019, 19, 1491. [Google Scholar] [CrossRef]
Li, R.; Zhang, Q.; Ma, D.; Yu, K.; Huang, Y. Joint Target Assignment and Resource Allocation for Multi-Base Station Cooperative ISAC in AAV Detection. IEEE Trans. Veh. Technol. 2025, 74, 7700–7714. [Google Scholar] [CrossRef]
Saha, S.; Vasegaard, A.E.; Nielsen, I.; Hapka, A.; Budzisz, H. UAVs Path Planning under a Bi-Objective Optimization Framework for Smart Cities. Electronics 2021, 10, 1193. [Google Scholar] [CrossRef]
Hooshyar, M.; Huang, Y.M. Meta-heuristic Algorithms in UAV Path Planning Optimization: A Systematic Review (2018–2022). Drones 2023, 7, 687. [Google Scholar] [CrossRef]
Wang, Z.; Wei, T.; Sun, G.; Liu, X.; Yu, H.; Niyato, D. Multi-UAV Enabled MEC Networks: Optimizing Delay Through Intelligent 3-D Trajectory Planning and Resource Allocation. IEEE Trans. Intell. Transp. Syst. 2025, 26, 20897–20911. [Google Scholar] [CrossRef]
Tang, X.; Wang, W.; He, H.; Zhang, R. Energy-efficient data collection for UAV-assisted IoT: Joint trajectory and resource optimization. Chin. J. Aeronaut. 2022, 35, 95–105. [Google Scholar] [CrossRef]
Du, W.; Wang, T.; Zhang, H.; Dong, Y.; Li, Y. Joint Resource Allocation and Trajectory Optimization for Completion Time Minimization for Energy-Constrained UAV Communications. IEEE Trans. Veh. Technol. 2023, 72, 4568–4579. [Google Scholar] [CrossRef]
Fu, Y.; Li, D.; Tang, Q.; Zhou, S. Joint Speed and Bandwidth Optimized Strategy of UAV-Assisted Data Collection in Post-Disaster Areas. In Proceedings of the 2022 20th Mediterranean Communication and Computer Networking Conference (MedComNet), Pafos, Cyprus, 1–3 June 2022; pp. 39–42. [Google Scholar]
Han, Z.; Guo, W. Dynamic UAV Task Allocation and Path Planning with Energy Management Using Adaptive PSO in Rolling Horizon Framework. Appl. Sci. 2025, 15, 4220. [Google Scholar] [CrossRef]
Hazra, K.; Shah, V.K.; Roy, S.; Deep, S.; Saha, S.; Nandi, S. Exploring Biological Robustness for Reliable Multi-UAV Networks. IEEE Trans. Netw. Serv. Manag. 2021, 18, 2776–2788. [Google Scholar] [CrossRef]
Shen, Y.; Qu, Y.; Dong, C.; Zhou, F.; Wu, Q. Joint Training and Resource Allocation Optimization for Federated Learning in UAV Swarm. IEEE Internet Things J. 2023, 10, 2272–2284. [Google Scholar] [CrossRef]
Wu, Q.; Ruan, T.; Zhou, F.; Huang, Y.; Xu, F.; Zhao, S.; Liu, Y.; Huang, X. A Unified Cognitive Learning Framework for Adapting to Dynamic Environments and Tasks. IEEE Wirel. Commun. 2021, 28, 208–216. [Google Scholar] [CrossRef]
Jin, H.; Liu, Q.; Li, C.; Hou, Y.T.; Lou, W.; Kompella, S. Hector: A Reinforcement Learning-based Scheduler for Minimizing Casualties of a Military Drone Swarm. In Proceedings of the MILCOM 2022—2022 IEEE Military Communications Conference (MILCOM), Rockville, MD, USA, 28 November–2 December 2022; pp. 887–894. [Google Scholar] [CrossRef]
Muzammul, M.; Assam, M.; Ghadi, Y.Y.; Innab, N.; Alajmi, M.; Alahmadi, T.J. IR-QLA: Machine Learning-Based Q-Learning Algorithm Optimization for UAVs Faster Trajectory Planning by Instructed- Reinforcement Learning. IEEE Access 2024, 12, 91300–91315. [Google Scholar] [CrossRef]
Wang, S.; Qi, N.; Jiang, H.; Xiao, M.; Liu, H.; Jia, L.; Zhao, D. Trajectory Planning for UAV-Assisted Data Collection in IoT Network: A Double Deep Q Network Approach. Electronics 2024, 13, 1592. [Google Scholar] [CrossRef]
Qin, P.; Wu, X.; Fu, M.; Ding, R.; Fu, Y. Latency Minimization Resource Allocation and Trajectory Optimization for UAV-Assisted Cache-Computing Network With Energy Recharging. IEEE Trans. Commun. 2025, 73, 5715–5728. [Google Scholar] [CrossRef]
Luo, R.; Tian, H.; Ni, W.; Cheng, J.; Chen, K.C. Deep Reinforcement Learning Enables Joint Trajectory and Communication in Internet of Robotic Things. IEEE Trans. Wirel. Commun. 2024, 23, 18154–18168. [Google Scholar] [CrossRef]
Zhao, L.; Liu, X.; Shang, T. Maximizing coverage in UAV-based emergency communication networks using deep reinforcement learning. Signal Process. 2025, 230, 109844. [Google Scholar] [CrossRef]
Wu, X.; Chen, H.; Chen, C.; Zhong, M.; Xie, S.; Guo, Y.; Fujita, H. The autonomous navigation and obstacle avoidance for USVs with ANOA deep reinforcement learning method. Knowl.-Based Syst. 2020, 196, 105201. [Google Scholar] [CrossRef]
Grondman, I.; Busoniu, L.; Lopes, G.A.D.; Babuska, R. A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2012, 42, 1291–1307. [Google Scholar] [CrossRef]
Ho, T.M.; Nguyen, K.K.; Cheriet, M. UAV Control for Wireless Service Provisioning in Critical Demand Areas: A Deep Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2021, 70, 7138–7152. [Google Scholar] [CrossRef]
Hu, J.; Yang, X.; Wang, W.; Wei, P.; Ying, L.; Liu, Y. Obstacle Avoidance for UAS in Continuous Action Space Using Deep Reinforcement Learning. IEEE Access 2022, 10, 90623–90634. [Google Scholar] [CrossRef]
Henderson, L. The statistics of crowd fluids. Nature 1971, 229, 381–383. [Google Scholar] [CrossRef]
Vanni, F.; Lambert, D. On the regularity of human mobility patterns at times of a pandemic. arXiv 2021, arXiv:2104.08975. [Google Scholar] [CrossRef]
Zhang, J.; Shi, W.; Zhang, R.; Liu, W. Computation Offloading and Shunting Scheme in Wireless Wireline Internetwork. IEEE Trans. Commun. 2021, 69, 6808–6821. [Google Scholar] [CrossRef]
Liu, C.; Ding, M.; Ma, C.; Li, Q.; Lin, Z.; Liang, Y.C. Performance Analysis for Practical Unmanned Aerial Vehicle Networks with LoS/NLoS Transmissions. In Proceedings of the 2018 IEEE International Conference on Communications Workshops (ICC Workshops), Kansas City, MO, USA, 20–24 May 2018; pp. 1–6. [Google Scholar]
Alzenad, M.; El-Keyi, A.; Yanikomeroglu, H. 3-D Placement of an Unmanned Aerial Vehicle Base Station for Maximum Coverage of Users With Different QoS Requirements. IEEE Wirel. Commun. Lett. 2018, 7, 38–41. [Google Scholar] [CrossRef]
Sun, K.; Yang, J.; Li, J.; Yang, B.; Ding, S. Proximal Policy Optimization-Based Hierarchical Decision-Making Mechanism for Resource Allocation Optimization in UAV Networks. Electronics 2025, 14, 747. [Google Scholar] [CrossRef]
Ding, R.; Zhou, F.; Wu, Q.; Ng, D.W.K. From External Interaction to Internal Inference: An Intelligent Learning Framework for Spectrum Sharing and UAV Trajectory Optimization. IEEE Trans. Wirel. Commun. 2024, 23, 12099–12114. [Google Scholar] [CrossRef]

Figure 1. UAV-assisted wireless communication network for emergency services.

Figure 2. Service model with dynamic user demand.

Figure 3. The training framework of SPOR algorithm for UAV-assisted emergency communication.

Figure 4. Training convergence rate of different algorithms.

Figure 5. Data transmission rate of different algorithms.

Figure 6. Service ratio of different algorithms.

Figure 7. Average service distance of different algorithms.

Figure 8. Impact of user population size on the service ratio.

Figure 9. Impact of user service demand probability on the service ratio.

Figure 10. Schematic diagram of UAV providing services to users with service demands.

Table 1. Important notations.

Symbol	Definition
$J$	Set of mobile users
$p_{j} (t)$	The position vector of user j
$v_{j} (t)$	The velocity vector of user j
$f_{v} (v_{x}, v_{y})$	Joint probability density function of the user’s velocity
$ρ_{j} (t)$	Service demand state of user j in slot t
$P_{i}$	Probability of a user generating a service demand
$T_{m i n}, T_{m a x}$	Min/Max number of consecutive slots for a service
$p_{u a v} (t)$	UAV position
$α_{L o S}, α_{N L o S}$	Path loss at reference distance for LOS/NLOS paths
$β_{L o S}, β_{N L o S}$	Path loss exponent for LOS/NLOS paths
$ϕ_{j}$	Elevation angle between UAV and user j
h	UAV altitude
$η, ς$	Environmental parameters
$Γ_{j}$	SNR between UAV and user j
$P_{t}$	UAV transmission power
$I_{j} (t)$	UAV coverage indicator function
$δ$	SNR threshold
$L (t)$	The set of all users with service demands in slot t
$N (t)$	The set of users who meet both coverage conditions and service demands in slot t
$M (t)$	The set of effective users successfully connected to the UAV in slot t
$R_{s} (t)$	Service ratio
$B_{j} (t)$	Allocated bandwidth for user j
$R_{j}$	Achievable rate of user j

Table 2. Simulation parameters.

Parameter	Symbol	Value
UAV altitude	h	50– $100 m$
Maximum UAV speed	$v_{m a x}$	$15 m / s$
UAV transmission power	$P_{t}$	$0.5 w$
Total channel bandwidth	B	$10 MHz$
Noise power	$σ^{2}$	$- 105 dBm$
SNR threshold	$δ$	$15 dB$
Environment-dependent parameters	$η, ς$	$9.61, 0.16$
Service demand probability for each user	$P_{i}$	$0.5$
Learning rate	$α$	$3 \times 10^{- 4}$
Discount factor	$γ$	0.99
GAE parameter	$λ$	0.95
Clipping parameter	$ϵ$	0.2
Value loss coefficient	$c_{1}$	0.5
Entropy regularization coefficient	$c_{2}$	0.01
Gradient norm threshold	$g_{\max}$	0.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chu, C.; Zhang, J.; He, P.; Zhang, Y.; Ouyang, M.; Wan, F.; Liu, Q.; Chen, Y. Trajectory Optimization and Resource Allocation for UAV-Assisted Emergency Communication Networks. Drones 2026, 10, 233. https://doi.org/10.3390/drones10040233

AMA Style

Chu C, Zhang J, He P, Zhang Y, Ouyang M, Wan F, Liu Q, Chen Y. Trajectory Optimization and Resource Allocation for UAV-Assisted Emergency Communication Networks. Drones. 2026; 10(4):233. https://doi.org/10.3390/drones10040233

Chicago/Turabian Style

Chu, Chengxin, Jiadong Zhang, Panfeng He, Yu Zhang, Min Ouyang, Fayu Wan, Qingyu Liu, and Yong Chen. 2026. "Trajectory Optimization and Resource Allocation for UAV-Assisted Emergency Communication Networks" Drones 10, no. 4: 233. https://doi.org/10.3390/drones10040233

APA Style

Chu, C., Zhang, J., He, P., Zhang, Y., Ouyang, M., Wan, F., Liu, Q., & Chen, Y. (2026). Trajectory Optimization and Resource Allocation for UAV-Assisted Emergency Communication Networks. Drones, 10(4), 233. https://doi.org/10.3390/drones10040233

Article Menu

Trajectory Optimization and Resource Allocation for UAV-Assisted Emergency Communication Networks

Highlights

Abstract

1. Introduction

2. System Model

2.1. User Mobility Model and Service Model

2.2. UAV Mobility Model and Communication Model

2.3. Problem Formulation

3. UAV Trajectory Optimization and Resource Allocation Algorithm

3.1. Problem Transformation

3.2. SPOR Algorithm for UAV Trajectory Optimization and Resource Allocation

4. Simulation and Results

4.1. Parameter Setting

4.2. Performance Benchmark

4.3. Performance Evaluation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI