Dynamic Trajectory Control and User Association for Unmanned-Aerial-Vehicle-Assisted Mobile Edge Computing: A Deep Reinforcement Learning Approach

Wang, Libo; Zhang, Xiangyin; Qin, Kaiyu; Wang, Zhuwei; Yin, Hang; Zhou, Jiayi; Song, Deyu

doi:10.3390/drones9050367

Open AccessArticle

Dynamic Trajectory Control and User Association for Unmanned-Aerial-Vehicle-Assisted Mobile Edge Computing: A Deep Reinforcement Learning Approach

by

Libo Wang

^1,2

,

Xiangyin Zhang

^1,2,3,*

,

Kaiyu Qin

^1,2,3

,

Zhuwei Wang

⁴

,

Hang Yin

⁵,

Jiayi Zhou

^1,2 and

Deyu Song

^1,2

¹

School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, Chengdu 611731, China

²

Aircraft Swarm Intelligent Sensing and Cooperative Control Key Laboratory of Sichuan Province, Chengdu 611731, China

³

National Laboratory on Adaptive Optics, Chengdu 611731, China

⁴

School of Information Science and Technology, Beijing University of Technology, Beijing 100124, China

⁵

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(5), 367; https://doi.org/10.3390/drones9050367

Submission received: 27 February 2025 / Revised: 7 May 2025 / Accepted: 12 May 2025 / Published: 13 May 2025

Download

Browse Figures

Versions Notes

Abstract

Mobile edge computing (MEC) has become an effective framework for latency-sensitive and computation-intensive applications by deploying computing resources at network edge. The unmanned aerial vehicle (UAV)-assisted MEC leverages UAV mobility and communication advantages to enable services in dynamic environments, where frequent adjustments to flight trajectories and user association are required due to dynamic factors such as time-varying task requirements, user mobility, and communication environment variation. This paper addresses the joint optimization problem of UAV flight trajectory control and user association in dynamic environments, which explicitly incorporates the constraints governed by UAV flight dynamics. The joint problem is formulated as a non-convex optimization formulation that involves continuous–discrete hybrid decision variables. To overcome the inherent complexity of this problem, a novel proximal policy optimization-based dynamic control (PPO-DC) algorithm is developed. This algorithm aims to reduce the weighted combination of delay and energy consumption by dynamically controlling the UAV trajectory and user association. The numerical results validate that the PPO-DC algorithm successfully enables real-time UAV trajectory control under flight dynamics constraints, ensuring feasible and efficient flight trajectory. Compared to the state-of-the-art hybrid-action deep reinforcement learning (DRL) algorithms or metaheuristics, the PPO-DC achieves notable improvements in system performance by simultaneously lowering system delay and energy consumption.

Keywords:

UAV-assisted MEC; trajectory control; flight dynamics; user association; PPO

1. Introduction

With the advancement of 6G and the growing adoption of the Internet of Things (IoT), the number of mobile intelligent devices has rapidly increased, driving the development of applications including augmented reality, image recognition, video streaming, and online gaming [1]. Low latency and high computational efficiency constitute fundamental requirements for reliable operation of these applications. However, due to the constrained storage capacity, battery capacity, and computing frequency of mobile devices, it is challenging to maintain outstanding performance when processing these applications locally. While cloud computing enables the offloading of computation tasks to cloud servers, thus reducing the processing load on mobile devices, it can also lead to increased latency and network congestion due to the task-offloading process [2,3].

Mobile edge computing (MEC), as a novel distributed computing framework based on mobile communication networks, offers an effective and economical solution for latency critical and computationally intensive tasks by providing resources at network edge, closer to users [4,5]. Compared to traditional cloud computing, the distributed architecture of MEC deploys computing and storage resources close to users, enabling computing at the network edge. Consequently, the computing tasks can be offloaded from the user devices to distributed edge servers through the wireless network, which not only reduces the computational burden on terminal devices, but also improves response speed and quality of service, especially in the dynamic scenarios such as swarm robotics, intelligent transportation, and industrial automation [6]. In traditional terrestrial MEC scenarios, the base stations are typically immovable, which limits the network coverage and flexibility, and may lead to non-line-of-sight (NLoS) propagation conditions, thereby negatively impacting the communication rate [7]. Additionally, deploying terrestrial MEC servers in specific scenarios, such as remote areas, rural regions, or locations requiring disaster response, presents significant challenges [8].

Fortunately, unmanned aerial vehicles (UAVs), with their exceptional mobility, flexibility of deployment, and line-of-sight (LoS) communication links, are becoming a crucial component of future wireless communications and network technologies. By integrating UAVs with MEC, a promising architecture known as UAV-assisted MEC is introduced, which is capable of overcoming the challenges of traditional terrestrial MEC [9]. Compared with traditional MEC, UAV-assisted MEC provides distinct advantages in terms of on-demand deployment, enhancement in communication coverage, network flexibility, reconfiguration capability, mobility and cost-effectiveness, etc. Additionally, UAVs equipped with MEC servers can adjust their positions based on environment conditions such as user distribution and task volume, flying closer to users to establish better communication links for MEC services. The UAV-assisted MEC system significantly reduces communication latency, task computation latency, and user energy consumption.

Building on these advantages, extensive research has been conducted on UAV-assisted MEC systems, leading to significant achievements. Below, we summarize the main studies in this field. Latency optimization has become a focal point in recent studies. For instance, the work [10] proposed a two-layer MEC framework enabled by UAVs and a high-altitude platform, and designed an algorithm to minimize latency. In complex environments, the work [11] tackled the system latency minimization problem by employing successive convex approximation (SCA) and block coordinate descent (BCD) methods. Meanwhile, the article [12] established the network model of maritime multi-UAV-assisted MEC, and the system latency is minimized by utilizing deep Q-network (DQN) and deep deterministic policy gradient (DDPG) algorithms. Beyond latency considerations, the article [13] focused on energy efficiency enhancement, applying the block sequential upper bound minimization (BSUM) approach to iteratively optimize a decomposed objective function. Similarly, the authors in [14] introduced a DDPG-based method to extend UAV battery life by optimizing a cost function comprising both latency and energy consumption. Addressing fairness in UAV task allocation, the work [15] proposed a DDPG-driven strategy to balance UAV loads and users’ offloading tasks while minimizing energy consumption. Additionally, in [16], the authors developed a distributed offloading algorithm to optimize the energy efficiency and latency of a ground–air cooperative MEC system. The above studies demonstrate that optimizing system performance in latency and energy efficiency is critical for UAV-assisted MEC systems.

To further improve the system performance and effectively integrate UAV into the MEC, researchers have focused on jointly optimizing UAV trajectory design and MEC resource allocation. The UAV deployment problem, as a component of trajectory design, has attracted research attention. The work [17] developed a joint optimization framework integrating UAV deployment, task offloading, and resource allocation. By decomposing the problem into tractable subproblems and solving them via convex relaxation and iterative block updates, the proposed strategy reduced system energy consumption. The work [18] developed a decomposed optimization framework in which optimal transport theory was applied to user association analysis and a swarm intelligence algorithm was employed for UAV deployment, thereby minimizing task delay. In [19], the authors proposed a two-phase strategy for the UAV-assisted MEC system, comprising initial UAV deployment based on user trajectory prediction from historical data and subsequent joint optimization of trajectory design and offloading decisions, leading to minimized energy consumption. While deployment determines the initial positioning of UAVs, trajectory design leverages their mobility to improve QoS and resource utilization. In [20], a unified framework integrating UAV trajectory planning, CPU frequency control, and transmission power management was proposed for MEC networks. In [21], the authors proposed a metaheuristic-algorithm-based method which jointly optimizes trajectory planning and multi-stage task offloading under energy constraints to minimize the total system cost. The article [22] jointly optimizes UAV path planning and 3D beamforming to reduce data transmission latency in MEC network. The article [23] proposed an online joint optimization framework leveraging edge network resource scheduling and UAV trajectory design, which effectively minimized the weighted energy consumption of users while maintaining UAV energy constraints and data queue stability. Recognizing the dynamic nature of user task requirements, the work [24] analyzed their impact on offloading, trajectory planning, and resource allocation, introducing an iterative BCD strategy to reformulate the problem into two convex subproblems. With the presence of multiple eavesdroppers, the work [25] employed the BCD method to jointly optimize UAV trajectory, jamming beamforming and task offloading to establish secure communication links and maximize the secure transmission rate under delay constraints. To enhance system throughput and reduce latency, the authors in [26] developed a comprehensive integrated framework for jointly optimizing UAV trajectory and task managements including offloading, caching, and migration, which is solved by the Lyapunov and BCD method. The authors in [27] developed a deep reinforcement learning (DRL)-based framework, integrating DQN and convex optimization, to jointly optimize UAV trajectory and task offloading, aiming to minimize task completion time and propulsion energy.

In the aforementioned studies, researchers have extensively investigated latency minimization, energy efficiency, trajectory design, and resource allocation of UAV-assisted MEC. However, current research on UAV-assisted MEC remains insufficient. On the one hand, due to the constraints of flight dynamics, UAVs cannot arbitrarily adjust their acceleration and velocity in practice, making sudden acceleration, deceleration, and sharp turns infeasible. However, existing research on UAV trajectory design typically does not consider this aspect, leading to significant discrepancies between the ideal planned trajectory and actual trajectory. On the other hand, due to UAV flight and user mobility, the communication environment between the UAV and users dynamically changes. Meanwhile, users have different task-offloading requirements that vary over time. These dynamic issues require timely and frequent adjustments to the UAV flight trajectory and resource allocation [28,29,30]. However, the existing studies typically derive the UAV trajectory based on relatively low dynamic factors in scenarios, which limits their applicability in complex and highly dynamic scenarios. In summary, due to UAV flight dynamics constraints and the need for real-time adaptability in dynamic environments, effectively integrating UAVs into MEC and leveraging them to enhance MEC performance remains challenging.

For the challenges discussed above, this paper concentrates on performance optimization of the UAV-assisted multi-user MEC system, taking into account UAV flight dynamics constraints and time-varying factors in dynamic scenarios. Specifically, UAVs operating in multi-user scenarios must dynamically control their flight trajectories and user association strategies based on real-time user task requirements and their own flight states. Such dynamic control is critical for performance optimization and operational feasibility. Notably, the system optimization involves both continuous action variables, such as UAV velocities, and discrete action variables, such as user associations, forming a hybrid action space where these variables are interdependent. This study develops a novel proximal policy optimization-based dynamic control (PPO-DC) algorithm, a DRL-based approach, to minimize system consumption by jointly optimizing the dynamic UAV flight trajectory control and user association strategy. The main contributions of this study are outlined below:

A comprehensive system model for UAV-assisted MEC system is mathematically built by introducing the constraints of UAV flight dynamics, where velocity and acceleration cannot be arbitrarily adjusted, along with dynamic factors such as the dynamic communication environment, time-varying user task requirements, and user mobility.
In order to improve the system performance in terms of delay and energy consumption, the joint optimization of dynamic UAV trajectory control and user association is considered, which is formulated as a non-convex optimization problem with discrete–continuous hybrid variables. The formulation provides the foundation for developing an efficient solution, which is crucial for improving system performance in a dynamic environment.
A novel DRL-based dynamic control algorithm, named PPO-DC, is developed to solve the optimization problem. First, the problem is transformed into a Markov decision process (MDP) model. Then, the PPO-DC algorithm is designed to enable the UAV to efficiently and dynamically adjust its flight trajectory, which corresponds to a continuous variable, and its association decision, which corresponds to a discrete variable. This algorithm minimizes the weighted sum of system delay and energy consumption, while tackling the challenging non-convex joint optimization problem and overcoming the complexities introduced by hybrid-action space and dynamic factors in the scenario.
With respect to UAV trajectory control, the proposed PPO-DC algorithm demonstrates notable practical utility and adaptability to dynamic scenarios. Moreover, the system performance in terms of latency and energy consumption under PPO-DC outperforms that of other hybrid-action DRL algorithms or metaheuristics in our simulation. The PPO-DC exhibits satisfactory reward convergence during the training process. These advantages are proved by the simulation results.

The remaining sections are structured as follows. In Section 2, the system model, including communication model, computation model, user mobility model, and UAV dynamics model, is presented. Then, the problem formulation is presented. In Section 3, the problem is transformed into an MDP, followed by the design of the PPO-DC algorithm. In Section 4, simulation results are analyzed. Section 5 concludes this paper.

2. System Model and Optimization Problem Formulation

In this section, a comprehensive system model for UAV-assisted MEC is established, comprising user mobility model, communication model, computation model, and UAV dynamics model. Then, the joint optimization problem of the UAV trajectory control and user association is formulated.

2.1. UAV and User Devices

As illustrated in Figure 1, the UAV-assisted MEC system comprises a single UAV and K user devices (UDs) operating within the MEC service area

Ω

. The UAV, equipped with an onboard MEC server, offers both task-offloading and edge computing services to the UDs simultaneously. The system operates over N time slots, with the time slot indexes defined as

N ≜ {n = 1, 2, \dots, N}

. At any given time slot n, the UAV’s position is specified by

p_{u} [n] = {x_{u} [n], y_{u} [n], H}

, with H being the UAV flight altitude. The UAV’s motion is characterized by its acceleration

a_{u} [n] = {a_{u}^{x} [n], a_{u}^{y} [n]}

and velocity

v_{u} [n] = {v_{u}^{x} [n], v_{u}^{y} [n]}

in the horizontal plane. The UDs are indexed by the set

K ≜ {k = 1, 2, \dots, K}

, where K denotes the total number of all UDs. The position of UD k at time slot n is given by

p_{k} [n] = {x_{k} [n], y_{k} [n], 0}

. The task data randomly generated by UDs are stored in their data buffer, and when the UAV is associated with UD k, the task data of UD k will be offloaded to the UAV for execution.

2.2. User Mobility Model

In future UAV-assisted MEC applications, user devices are expected to exhibit dynamic mobility across various ubiquitous applications, such as rescuers in emergency operation and robotics in industrial automation. To ensure the practical relevance and compatibility of the model for future applications, the user mobility model is considered. The movement process of the user can be regarded as a typical Gauss–Markov mobility model [31]; that is, where the velocity change of mobile users in each time slot follows a Gaussian distribution. Therefore, the user’s velocity at the time slot

n + 1

depends on its velocity at the time slot n, which can be expressed as

\begin{matrix} v_{k} [n + 1] = δ v_{k} [n] + (1 - δ) \bar{v} + \bar{ς} \sqrt{1 - δ^{2}} ϱ_{k} [n] . \end{matrix}

(1)

Here,

v_{k} [n] = (v_{k}^{x} [n], v_{k}^{y} [n])

represents the velocity components of user k along the x- and y-axes. Additionally,

ϱ_{k} [n] \sim N (0, 1)

denotes a standard normal distribution,

δ

is the memory level,

\bar{ς}

is the asymptotic standard deviation, and

\bar{v}

is the asymptotic mean of the velocity.

Then, the mobile user position can be updated by

\begin{matrix} p_{k} [n + 1] = p_{k} [n] + v_{k} [n] Δ T, \end{matrix}

(2)

where

Δ T

denotes the length of each time slot.

2.3. Communication Model

Since the associated UD needs to offload data to the UAV, an uplink communication link is required to be established. Considering the negligible energy consumption and transmission delay in downlink data feedback from UAV to UD, this paper exclusively focuses only on uplink analysis. Additionally, the communication links are typically assumed to be LoS channels. It is also assumed that the UAV has prior knowledge of the positions of all UDs. The communication channel gain between the UD k and the UAV is modeled as [32]

\begin{matrix} h_{k} [n] = \frac{β_{0}}{‖ p_{u} [n] - p_{k} [n] ‖^{2}}, \end{matrix}

(3)

where

β_{0} = {(c / (4 π f_{c}))}^{2}

denotes the channel gain at 1 m, where c is the speed of light and

f_{c}

is the carrier frequency. Here,

p_{u} [n] = {x_{u} [n], y_{u} [n], H}

denotes the UAV’s position at time slot n, and H is the fixed UAV flight altitude. Without loss of generality, it is assumed that the UD and the UAV are equipped with an omni-directional antenna.

In this work, the offloading throughput of UD k, denoted by

R_{k} [n]

, is given by

\begin{matrix} R_{k} [n] = B {log}_{2} (1 + \frac{P_{0} h_{k} [n]}{σ^{2}}), \end{matrix}

(4)

where

P_{0}

is the offloading power of UD k, B is the bandwidth, and

σ^{2}

is the additive white Gaussian noise (AWGN) power, which is derived from

σ^{2}

[dBW] =

T_{n o i s e}

[dBK] +

k_{B}

[dBW/K/Hz] + B[dBHz] [33]. Here,

T_{n o i s e}

is the receiver noise temperature, and

k_{B}

is the Boltzmann constant, equal to −288.6 dBW/K/Hz, where K is the temperature unit Kelvin. Note that the wireless communication link between the UAV and a UD is established only when the UAV associates with that UD at the current time slot, as specified by the association indicator in the computation model section. Thus,

R_{k} [n]

is only meaningful for the UD k associated with the UAV at time n.

2.4. Computation Model

In each time slot, each UD newly generates task data for execution, and the corresponding task data amount follows the Poisson distribution [34,35]. Due to the limited computation capabilities of UDs, the binary task-offloading policy is adopted [36], where a task is either computed locally or offloaded to the UAV for remote computing. The UD association indicator

λ_{k} [n] \in {0, 1}

is defined to represent the offloading behavior. If

λ_{k} [n] = 1

, the UAV is associated with UD k at time slot n, which implies that the communication link between UD k and the UAV is established and the generated data are offloaded to the UAV for execution. Otherwise, if

λ_{k} [n] = 0

, the task data are computed locally at the UD. The association constraint is introduced as follows.

\begin{matrix} \sum_{k \in K} λ_{k} [n] \leq 1 . \end{matrix}

(5)

This constraint limits that, at most, one UD is served by the UAV in each time slot.

Below, the details of the task computation in each time slot are described.

(a) Task Offloading: at time slot n, the associated UD offloads task data to the MEC server carried by UAV. The offloading delay for UD k, denoted by

T_{k}^{o f f}

, can be calculated as

\begin{matrix} T_{k}^{o f f} [n] = min \{\frac{λ_{k} [n] L_{k} [n]}{R_{k} [n]}, Δ T\}, \end{matrix}

(6)

where

L_{k} [n]

denotes the total task data amounts of UD k in time slot n. The offloading duration is constrained by the time slot length

Δ T

; thus, the offloading of

L_{k} [n]

may not always be completed within a single time slot. In such cases, the remaining unoffloaded task data are combined with newly generated data resulting in the task data amounts

L_{k} [n + 1]

in the next time slot.

The offloading energy of UD k, denoted by

E_{k}^{o f f}

, can be calculated as

\begin{matrix} E_{k}^{o f f} [n] = λ_{k} [n] P_{0} T_{k}^{o f f} [n] . \end{matrix}

(7)

(b) UAV Computation: Once the associated UD has offloaded the task to the onboard server, the server starts to compute the task. The computing delay of server, denoted by

T_{k}^{u, c o m p}

, can be calculated as

\begin{matrix} T_{k}^{u, c o m p} [n] = \frac{λ_{k} [n] T_{k}^{o f f} [n] R_{k} [n] C}{f_{u}}, \end{matrix}

(8)

where the notation C is the number of CPU cycles for computing 1-bit task data and

f_{u}

is the server CPU computing frequency.

The UAV computing energy, denoted by

E_{k}^{u, c o m p} [n]

, can be derived as [37]

\begin{matrix} E_{k}^{u, c o m p} [n] = ψ f_{u}^{3} T_{k}^{u, c o m p} . \end{matrix}

(9)

Here,

ψ

is the effective capacitance coefficient of UAV and

ψ f_{u}^{3}

is the CPU power consumption of the UAV.

(c) Local Computation: The UD k possesses limited computing and energy resources for local task computation, with its performance primarily determined by the CPU frequency

f_{k}

. Similar to Equations (8) and (9), the local computation delay

T_{k}^{l, c o m p} [n]

and energy consumption

E_{k}^{l, c o m p} [n]

of UD k can be respectively derived as

\begin{matrix} T_{k}^{l, c o m p} [n] = min \{\frac{(1 - λ_{k} [n]) L_{k} [n] C}{f_{k}}, Δ T\} . \end{matrix}

(10)

Due to limited computing capability of UDs, the processing of

L_{k} [n]

may not be completed within a single time slot. The remaining unprocessed task data are combined with newly generated data in the next time slot, forming the task data amounts

L_{k} [n + 1]

.

\begin{matrix} E_{k}^{l, c o m p} [n] = η f_{k}^{3} T_{k}^{l, c o m p} [n], \end{matrix}

(11)

where

η

denotes the effective capacitance coefficient of UD k and

η f_{k}^{3}

is the CPU power consumption of UD k.

From Equations (7), (9), and (11), the computing energy for UD k task processing, denoted by

E_{k} [n]

, can be calculated as

\begin{matrix} E_{k} [n] = E_{k}^{o f f} [n] + E_{k}^{u, c o m p} [n] + E_{k}^{l, c o m p} [n] . \end{matrix}

(12)

From Equations (6), (8), and (10), the delay for UD k task processing, denoted by

T_{k} [n]

, can be calculated as

\begin{matrix} T_{k} [n] = T_{k}^{o f f} [n] + T_{k}^{u, c o m p} [n] + T_{k}^{l, c o m p} [n] . \end{matrix}

(13)

2.5. UAV Dynamics Model

Existing approaches to design UAV trajectory are often based on the assumption of ideal flight conditions, where the acceleration and velocity of UAV can be adjusted without constraints. However, this assumption is unrealistic in practical scenarios due to constraints imposed by the underlying principle of UAV flight dynamics. Moreover, the frequent UAV trajectory adjustments are necessary because of the variations of the user’s task requirements and the communication environment.

Typically, the UAV dynamics model can be introduced as

\begin{matrix} {\dot{p}}_{u} (t) = v_{u} (t), \end{matrix}

(14a)

\begin{matrix} {\dot{v}}_{u} (t) = a_{u} (t - τ), \end{matrix}

(14b)

where

{\dot{p}}_{u} (t)

,

v_{u} (t)

, and

a_{u} (t)

represent the position, velocity, and acceleration of UAV, respectively, and

τ

is the random delay caused by issues such as signal processing, transmission, reception, and actuation, which serves as a stochastic item.

Moreover, a new state vector is introduced as

\begin{matrix} χ_{u} (t) = (p_{u} (t), v_{u} (t)) . \end{matrix}

(15)

According to Equations (14) and (15), the dynamics model is reformulated as

\begin{matrix} {\dot{χ}}_{u} (t) = A χ_{u} (t) + B a_{u} (t - τ), \end{matrix}

(16)

in which

\begin{matrix} A = [\begin{matrix} 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \end{matrix}], B = [\begin{matrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}] . \end{matrix}

(17)

Then, according to [38], the corresponding discrete-time dynamics model is given by

\begin{matrix} χ_{u} [n + 1] = A_{0} χ_{u} [n] + B_{1} a_{u} [n] + B_{2} a_{u} [n - 1], \end{matrix}

(18)

where

\begin{matrix} A_{0} = e^{A Δ T}, \end{matrix}

(19a)

\begin{matrix} χ_{u} [n] = χ_{u} (n Δ T), \end{matrix}

(19b)

\begin{matrix} B_{1} = \int_{0}^{Δ T - τ_{n}} e^{A Δ T} d t B, \end{matrix}

(19c)

\begin{matrix} B_{2} = \int_{Δ T - τ_{n}}^{Δ T} e^{A Δ T} d t B, \end{matrix}

(19d)

\begin{matrix} a_{u} [n] = \frac{v_{u} [n] - v_{u} [n - 1]}{Δ T}, \end{matrix}

(19e)

and here,

v_{u} [n]

and

τ_{n}

denote the UAV velocity and the random delay in the time slot n, respectively.

Additionally, the propulsion energy constitutes a significant portion of the flight energy, which is typically expressed as [39]

\begin{matrix} \begin{matrix} E_{f l y} [n] = (κ_{1} ‖ v_{u} [n] ‖^{3} + \frac{κ_{2}}{‖ v_{u} [n] ‖} (1 + \frac{‖ a_{u} [n] ‖^{2}}{g^{2}})) \\ + \frac{M (‖ v_{u} [n] ‖^{2} - ‖ v_{u} [n - 1] ‖^{2})}{2 Δ T} . \end{matrix} \end{matrix}

(20)

Here,

κ_{1}

and

κ_{2}

are determined parameters associated with UAV, g is the gravitational acceleration, and M is UAV’s mass.

2.6. Optimization Problem Formulation

From Equations (7), (9), (12), and (20), the total system energy consumption, denoted by

E_{t o t a l}

, can be derived as

E_{t o t a l} = \sum_{n \in N} ζ E_{f l y} [n] + \sum_{n \in N} \sum_{k \in K} E_{k} [n],

(21)

where

ζ

is a scaling factor. Similar to [40,41], flight energy and communication and computation energy are typically assigned different scaling factors due to their different importance.

From Equation (13), the total system time delay, denoted by

T_{t o t a l}

, can be derived as

T_{t o t a l} = \sum_{n \in N} \sum_{k \in K} T_{k} [n] .

(22)

Taking into account UAV flight dynamics constraints and dynamic factors of scenario, the optimization problem regarding the UAV trajectory dynamic control and UD association strategy of UAV-assisted MEC system can be formulated as

\begin{matrix} P : min_{λ, Q} & T_{t o t a l} + ρ E_{t o t a l} \end{matrix}

\begin{matrix} s . t . & χ_{u} [n + 1] = A_{0} χ_{u} [n] + B_{1} a_{u} [n] + B_{2} a_{u} [n - 1], \forall n \in N, \end{matrix}

(23a)

\begin{matrix} λ_{k} [n] \in {0, 1}, \forall k \in K, n \in N, \end{matrix}

(23b)

\begin{matrix} \sum_{k \in K} λ_{k} [n] \leq 1, \forall n \in N, \end{matrix}

(23c)

\begin{matrix} 0 \leq x_{u} [n] \leq x_{\max}, \forall k \in K, n \in N, \end{matrix}

(23d)

\begin{matrix} 0 \leq y_{u} [n] \leq y_{\max}, \forall k \in K, n \in N, \end{matrix}

(23e)

\begin{matrix} 0 \leq v_{u}^{x} [n] \leq v_{u, \max}^{x}, \forall n \in N, \end{matrix}

(23f)

\begin{matrix} 0 \leq v_{u}^{y} [n] \leq v_{u, \max}^{y}, \forall n \in N . \end{matrix}

(23g)

In the above problem formulation, the objective function is the combination of total system delay and weighted total energy consumption, which is induced by the task offloading, computing, and UAV flying, where

ρ

is the weight parameter.

λ = {λ_{k} [n]}

and

Q = {p_{u} [n], v_{u} [n], a_{u} [n]}

respectively denote the sets of UD association and UAV flight control strategy at one time slot. The Equation (23a) describes the restricted discrete-time dynamics of the UAV, detailing how the UAV position

p_{u} [n + 1]

at time slot

n + 1

is updated based on the previous state and velocity control input. This constraint ensures the trajectory complies with the UAV’s flight dynamics. Equations (23b) and (23c) define the user association. Equations (23d) and (23e) force the UAV to fly within the target area. Equations (23f) and (23g) restrict the velocity of the UAV’s flight.

3. PPO-DC Algorithm Design

The problem

P

presents significant complexity due to its non-convex objective function, the presence of both discrete and continuous variables, and the UAV dynamics Equation (23a), a constraint that is commonly overlooked in existing works. In reality, the acceleration and velocity of UAVs, subject to the system dynamics constraint, cannot be changed arbitrarily. To tackle the complexity, the problem

P

is reformulated as an MDP, and then, a DRL algorithm named PPO-DC is proposed.

3.1. Markov Decision Process

The problem

P

can be transformed into an MDP, in which the transition to the next state is only dependent on the current state and the action executed by UAV, satisfying the Markov property. By solving the MDP, the agent learns an optimal policy to improve system delay and energy efficiency through sequential decisions on UAV trajectory control and UD association. Formally, the MDP is characterized by the tuple

(S, A, R, π)

, in which S, A, and R represent the set of states, actions, and rewards, respectively, while

π

denotes the system’s policy. The detailed description of the MDP is given below.

(1) State: The state

s_{n}

encapsulates the critical information needed for decision-making. According to the system model formulated in Section 2, the sate

s_{n}

is designed to contain the environment information at time slot n, including UAV’s position, the UDs’ positions, and task requirements of all UDs. The

s_{n}

can be mathematically defined as

\begin{matrix} s_{n} = {p_{u} [n], p_{1} [n], \dots, p_{K} [n], L_{1} [n], \dots, L_{K} [n]} . \end{matrix}

(24)

(2) Action: The UAV flight trajectory and UD association strategy are expected to be adjusted by the agent in real time. By enabling the agent to output both continuous (i.e., UAV flight velocity) and discrete (i.e., UD association strategy) actions simultaneously, the inherent interdependence between UAV trajectory control and UD association is preserved, thereby enhancing the effectiveness of joint optimization. Moreover, it improves the dynamic responsiveness of the optimization. The action at time slot n can be defined as

\begin{matrix} a_{n} = {v_{n}, λ_{n}}, \end{matrix}

(25)

where

v_{n} = {v_{u}^{x} [n], v_{u}^{y} [n]}

denotes continuous variables representing UAV’s velocity in the x- and y-axes.

λ_{n} = {λ_{1} [n], . . ., λ_{K} [n]}

denotes the discrete variables indicating the association strategy, where

λ_{k} [n] = 1

if the UAV is associated with UD k, and

λ_{k} [n] = 0

otherwise.

By taking the action

a_{n}

in environment, the state

s_{n}

transitions to the next state

s_{n + 1}

according to the UAV dynamics model, user mobility model, communication model, and computation model in Section 2.

(3) Reward: The goal of the algorithm is to optimize the actor network to maximize the long-term accumulative rewards, with the critic network providing an estimate of the advantage function for policy improvement. The reward function provides a basis for calculating the value and advantage functions, which serve as feedback for refining the policy during training. Through iterative updates of the policy using the clipped objective and advantage estimates, the agent learns an optimal strategy for UAV flight control and UD association. The reward of a state–action pair

(s_{n}, a_{n})

can be defined as

r_{n} = - (ρ E_{k} [n] + \sum_{k \in K} T_{k} [n]) - Φ_{p e n} [n],

(26)

where

Φ_{p e n} [n]

denotes the penalty, which results from situations where the UAV violates the restricted airspace boundary or the task data amount exceeds the UD’s cache capacity.

At each time step, the agent collects the current environment state

s_{n}

and selects the action

a_{n}

depending on its learned policy

π_{θ}

. After the execution of action

a_{n}

, the agent observes the next environment state

s_{n + 1}

and receives the reward

r_{n}

. The iterative process enables the agent to refine its policy to maximize the expected cumulative reward.

3.2. Algorithm Description

Traditional algorithms usually address the issue of UAV trajectory design through iteration methods, which have high computational complexity and low adaptability, rendering them unsuitable for dynamic scenarios with stringent real-time requirements. In comparison, DRL-based algorithms have demonstrated superior performance in solving dynamic decision-making problems. Specifically, while DQN [42] and A3C [43] achieve excellent performance, their applicability is limited to discrete action spaces. DDPG [44], being an offline and deterministic algorithm, is less suitable for real-time control. Within the realm of DRL, PPO [45] demonstrates satisfactory performance in handling both continuous and discrete action control problems, and outperforms other methods in real-time control applications. Given these considerations, this paper adopts PPO as the fundamental framework for algorithm design, with the proposed algorithm enabling simultaneous control of discrete and continuous actions to dynamically optimize the trajectory and user association, thereby improving system performance. Specifically, the PPO-clip [45] is utilized, which is one of the recently recommended algorithms by OpenAI [46]. By leveraging a simple clipping mechanism, PPO-clip constrains policy updates within a predefined range, preventing excessively large policy shifts that could lead to instability. In contrast to the PPO-penalty algorithm, which incorporates a KL penalty and was introduced by Google DeepMind [47], PPO-clip eliminates the need to manually adjust the KL divergence, making it more suitable for rapid experiments. The work [45] reported that the clipped surrogate objective outperforms the KL penalty version. Additionally, PPO-clip eliminates the computational overhead imposed by the constraint in the TRPO algorithm [48], thereby enhancing data-sampling efficiency, improving robustness, and simplifying hyperparameter selection. This makes PPO-clip a widely adopted choice in reinforcement learning applications, particularly in scenarios where both stability and computational efficiency are critical.

The architecture of the PPO-DC algorithm is depicted in Figure 2. The PPO-DC model comprises an actor network and a critic network, with parameters denoted by

θ

and

ϕ

, respectively. Two convolutional neural networks (CNNs) with identical structures are separately assigned to the actor and critic networks as feature extractors. The CNNs process the current environment state into a compact feature representation. The input to the CNNs is a three-dimensional tensor, where the first channel records the spatial coordinates of UDs, the second channel stores the current task data amounts of each UD, and the third channel represents the UAV’s position. This structured encoding allows the network to effectively capture spatial correlations and task dynamics. The CNN feature extractor consists of three convolutional layers, where the first, second, and third layers apply 32 filters with kernel size

8 \times 8

and stride 4, 64 filters with kernel size

4 \times 4

and stride 2, and 32 filters with kernel size

3 \times 3

and stride 1, respectively. Each convolutional layer is followed by a ReLU activation and batch normalization. After convolutional processing, the feature maps are flattened and passed through a fully connected layer to produce a 512-dimensional feature vector, which is subsequently fed into both the actor and critic networks. The actor network branches into two parallel heads, a categorical distribution head implemented by a linear layer mapping the 512-dimensional feature vector to 2 logits for discrete user association actions, and a diagonal Gaussian distribution head implemented by a linear layer mapping the feature vector to a 2-dimensional vector representing the continuous UAV velocity control actions. In parallel, the critic network applies a single linear layer to map the 512-dimensional feature vector to a scalar state-value estimation. Through this feature extraction and parallelized actor–critic architecture, PPO-DC efficiently optimizes hybrid discrete–continuous policies.

Following the notations in [45], the network parameters are updated through gradient ascent to optimize the objective function

L_{n}^{C L I P} (θ)

, formulated as

\begin{matrix} L_{n}^{C L I P} (θ) = E [min (g_{n} (θ) A_{n}^{a d v}, c l i p (g_{n} (θ), 1 - ε, 1 + ε) A_{n}^{a d v})], \end{matrix}

(27)

where

ε

,

g_{n} (θ)

, and

A_{n}^{a d v}

denote the clipping coefficient, hybrid-action probability ratio, and advantage estimation, respectively.

The clipping function

c l i p (g_{n} (θ), 1 - ε, 1 + ε)

constrains policy updates by ensuring that the probability ratio remains within

[1 - ε, 1 + ε]

, thereby preventing excessive deviations between the new and old policies. When

g_{n} (θ)

exceeds this range, the advantage function

A_{n}^{a d v}

is clipped accordingly, thus stabilizing the training. The hybrid-action probability ratio

g_{n} (θ)

can be derived as

\begin{matrix} g_{n} (θ) = ω \frac{π_{θ}^{c} (a_{n} ∣ s_{n})}{π_{θ_{old}}^{c} (a_{n} ∣ s_{n})} + (1 - ω) \frac{π_{θ}^{d} (λ_{n} ∣ s_{n})}{π_{θ_{old}}^{d} (λ_{n} ∣ s_{n})}, \end{matrix}

(28)

where the fraction

π_{θ}^{c} (a_{n} ∣ s_{n}) / π_{θ_{old}}^{c} (a_{n} ∣ s_{n})

is the continuous action probability ratio, the fraction

π_{θ}^{d} (λ_{n} ∣ s_{n}) / π_{θ_{old}}^{d} (λ_{n} ∣ s_{n})

is the discrete action probability ratio, and

ω \in (0, 1)

is the weight factor.

The advantage estimation

A_{n}^{a d v}

can be derived as

\begin{matrix} A_{n}^{a d v} = \sum_{i = 0}^{N - n} {(γ λ)}^{i} δ_{n + i}, \end{matrix}

(29)

where

γ

,

λ

, and

δ_{n}

denote the discount factor, generalized advantage estimation (GAE) weight, and temporal difference error (TD-Error), respectively.

The

δ_{n}

can be derived as

\begin{matrix} δ_{n} = r_{n} + γ V_{ϕ} (s_{n + 1}) - V_{ϕ} (s_{n}), \end{matrix}

(30)

where

V_{ϕ} (s_{n})

denotes the value function, which can be derived as

\begin{matrix} V_{ϕ} (s_{n}) = E [\sum_{i = 0}^{N - n} γ^{k} r_{n + i} ∣ s_{n}] . \end{matrix}

(31)

The parameters

θ

of the policy network are updated after each episode by applying the stochastic gradient descent (SGD) method and the gradient

\nabla L_{n}^{C L I P}

as follows.

\begin{matrix} θ = θ - μ_{a} \frac{1}{| B |} \sum_{n \in B} \nabla L_{n}^{C L I P} (θ), \end{matrix}

(32)

where

B

denotes one mini-batch of time-step indexes and

μ_{a}

is the learning rate for actor optimization.

The value loss function

L_{n}^{V} (ϕ)

is derived as

\begin{matrix} L_{n}^{V} (ϕ) = E_{n} [{(V_{ϕ}^{target} (s_{n}) - V_{ϕ} (s_{n}))}^{2}], \end{matrix}

(33)

where the target value of TD-Error is

V_{ϕ}^{target} (s_{n}) = r_{n + 1} + γ V_{ϕ} (s_{n + 1})

.

The critic network parameters are updated by SGD algorithm with

\nabla L_{n}^{V}

as follows.

\begin{matrix} ϕ = ϕ - μ_{c} \frac{1}{| B |} \sum_{n \in B} \nabla L_{n}^{V} (ϕ), \end{matrix}

(34)

where

μ_{c}

is the learning rate for the critic network model optimization.

To implement these updates, a complete trajectory

T

is first collected during each episode. This trajectory is then divided randomly into M mini-batches, denoted

{B_{1}, \dots, B_{M}}

. Within each mini-batch

B_{m}

(for

m \in {1, \dots, M}

), the quantities

V_{ϕ} (s_{n})

,

A_{n}^{a d v}

,

L_{n}^{C L I P} (θ)

, and

L_{n}^{V} (ϕ)

are computed for each transition

(s_{n}, a_{n}, r_{n}, s_{n + 1}) \in B_{m}

. Then, the gradients are derived over

B_{m}

and the

θ

and

ϕ

are updated.

As explained above, the PPO-DC algorithm optimizes the action decision strategy. Based on this, the PPO-DC algorithm is developed to address the joint optimization of dynamic UAV trajectory control and UD association strategy, with the detailed steps provided in Algorithm 1.

Algorithm 1 PPO-DC algorithm for dynamic UAV trajectory control and UD association

Initialize the Actor model parameters: $θ$ ;
Initialize the Critic model parameters: $ϕ$ ;
1:: for Episodes = 1,... do
2:: Reset the environment;
3:: Initialize the observed state $s_{0}$ ;
4:: for each $n \in [1, N]$ do
5:: Gather the current state $s_{n}$ ;
6:: Obtain the continuous action distribution $π_{θ}^{c} (\cdot ∣ s_{n})$ from Actor;
7:: Obtain the discrete action distribution $π_{θ}^{d} (\cdot ∣ s_{n})$ from Actor;
8:: Sample and Output the hybrid action $a_{n} = {v_{u}^{x} [n], v_{u}^{y} [n], λ_{1} [n], . . ., λ_{K} [n]}$ from Actor;
9:: Start to execute the action $a_{n}$ ;
10:: Obtain the reward $r_{n}$ and the state $s_{n + 1}$ when the action $a_{n}$ is executed;
11:: Store the transition $(s_{n}, a_{n}, r_{n}, s_{n + 1})$ into the trajectory $T$ .
12:: end for
13:: for each epoch $i \in {1, \dots, e p o c h s}$ do
14:: Divide the trajectory $T$ randomly into M mini-batches ${B_{1}, \dots, B_{M}}$ ;
15:: for each mini-batch $B_{m}$ , $m \in {1, \dots, M}$ do
16:: for each transition $(s_{n}, a_{n}, r_{n}, s_{n + 1}) \in B_{m}$ do
17:: Obtain $V_{ϕ} (s_{n})$ from Critic;
18:: Derive the temporal difference error $δ_{n}$ and the advantage $A_{n}^{a d v}$ ;
19:: Derive the objective function $L_{n}^{C L I P} (θ)$ and value loss function $L_{n}^{V} (ϕ)$ ;
20:: end for
21:: Compute the aggregated gradient over the mini-batch $B_{m}$ :
22:: Update the Actor parameters:
23:: $θ = θ - μ_{a} \frac{1}{| B_{m} |} \sum_{n \in B_{m}} \nabla L_{n}^{C L I P} (θ)$ ;
24:: Compute the aggregated gradient over the mini-batch $B_{m}$ ;
25:: Update the Critic parameters;
26:: $ϕ = ϕ - μ_{c} \frac{1}{| B_{m} |} \sum_{n \in B_{m}} \nabla L_{n}^{V} (ϕ)$ ;
27:: end for
28:: end for
29:: end for

4. Simulation Results

In this section, a thorough evaluation of the PPO-DC is presented. The convergence performance and learning rate are examined initially. The UAV trajectory under flight dynamics constraints is then evaluated, followed by delay and energy consumption performance compared to other state-of-the-art hybrid-action DRL algorithms or metaheuristics.

In the simulation, the UDs and the UAV move within a rectangular service area

Ω

of 160 m × 160 m and the UAV flies at an altitude of 10 m above the ground level. The starting position of the UAV is the center of the area, while the initial velocity and acceleration are both set to 0. The maximum UAV flight velocity in x- and y-axes is

V_{m a x}

= 15 m/s. Each UD moves at a velocity ranging from 0 to 2 m/s. The transmit power of UD is set to 20 dBm. The uplink carrier frequency

f_{c}

is set to 2 GHz. The uplink bandwidth B is set to 1 MHz. The duration of one time slot is set to 0.1 s. At each time slot, each UD randomly generates a task data amount that follows a Poisson distribution with an expectation of 0.05 Mbits, and the generated amount is truncated to lie within the range [0.01, 0.1] Mbits. The main simulation parameter settings are presented in the Table 1, unless otherwise indicated. The key hyperparameter of the PPO-DC algorithm such as the learning rate is set to 0.0001 if undefined, the reward discount factor

γ

is 0.99, the clipping coefficient

ε

is 0.1, and the weight factor

ω

of hybrid-action probability ratio is 0.5. Additionally, during agent training, we set up to 4000 time slots per update and utilized eight parallel environments to enhance training efficiency.

4.1. Simulation Parameter Setting

4.2. Convergence Performance

In Figure 3, the training cumulative reward lines for the PPO-DC algorithm are illustrated under learning rates of 0.001, 0.0001, and 0.00001. It is evident that the learning speed of the agent decreases as the learning rate reduces. If the learning rate is set to 0.001, the agent is capable of learning quickly and achieving a higher cumulative reward than the 0.00001 case. However, the convergence behavior is highly unstable, as evidenced by frequent fluctuations in the cumulative reward line, where the reward often drops sharply after reaching a high value and then rapidly recovers. This instability renders it less practical for reliable training. In contrast, when the learning rate is 0.00001, the agent exhibits significantly more stable convergence, but the learning speed is extremely slow. The agent requires a large number of episodes and updates to achieve a relatively high reward, making it less efficient for practical training scenarios. With a learning rate of 0.0001, the agent achieves the highest cumulative reward among the three cases. The agent converges rapidly to a high reward level while maintaining intermediate stability compared to the other learning rates. Despite minor fluctuations in the reward line, the convergence trajectory remains relatively steady, indicating a robust balance between learning efficiency and stability.

4.3. UAV Trajectory Evaluation

To evaluate the effectiveness of the PPO-DC algorithm in dynamic UAV trajectory control under dynamics principle constraints, the UAV trajectories in scenarios with varying numbers of UDs are first presented in Figure 4. Subsequently, the comparisons with the PPO algorithm without dynamics constraints are depicted in Figure 5, Figure 6 and Figure 7, focusing on the UAV trajectory, x-axis velocities, and y-axis velocities, respectively.

As demonstrated in Figure 4, the UAV flight trajectories are effectively controlled by the agent in the scenarios in which MEC services are provided to 4, 6, 8, and 10 UDs, respectively. Specifically, in these scenarios, the UAV flight trajectory is dynamically adjusted based on the distribution of UDs within the service area and UD task requirements. As the density of UDs increases, the algorithm ensures that the UAV flight trajectory remains both feasible and efficient. This demonstrates the scalability of the PPO-DC algorithm in environments with varying UD densities.

Figure 5 compares the UAV flight trajectories controlled by two different algorithms over a time period of 0 to 100 s in the scenario with eight UDs. Trajectory 1, controlled by the PPO-DC algorithm, which takes into account the flight dynamics constraints, corresponds to the red line. Trajectory 2, controlled by the PPO algorithm without considering flight dynamics constraints, corresponds to the blue line. Although both algorithms adaptively respond to the spatial distribution and task requirements of UDs, their resulting trajectories differ significantly. In particular, trajectory 1 demonstrates a variety of flight actions, including sustained approximate straight movements, smooth turns (e.g., in the lower-right area and upper-right area), sharp turns (e.g., in the lower-left area and upper-left area), semicircular arcs, and U-turns (e.g., in the lower-left area). Each kind of flight action involves continuous changes in direction and velocity, achieved through gradual deceleration before turning and acceleration afterward, which reflects strict compliance with UAV dynamics principles. The thickened portion of trajectory 1, representing the flight trajectory from 30 to 50 s, exhibits a gradual deceleration and acceleration behavior, which is further substantiated by the velocity variations shown in Figure 6 and Figure 7. These behaviors collectively illustrate the ability of PPO-DC to generate physically feasible trajectories of UAV across diverse motion contexts in the MEC scenarios. By contrast, trajectory 2 contains sudden direction changes without appropriate velocity adjustments, including many sharp turns executed instantaneously. Such actions violate the fundamental flight dynamics and are infeasible in practical deployments. This contrast reveals the limitations of idealized trajectory control policy and emphasizes the effectiveness of the PPO-DC approach.

The contrast between trajectory 1 and trajectory 2 is further illustrated in Figure 6 and Figure 7, which show the UAV’s velocity changes along the x-axis and y-axis, respectively. In trajectory 2, the UAV experiences sudden jumps or drops in velocity at the beginning of some time slots, reflecting the nature of an idealized control policy that ignores flight dynamics constraints. Specifically, sharp velocity increases can be observed around 34 s and 36 s, where the UAV’s velocity suddenly jumps within a single time slot. Similarly, significant velocity drops occur around 40 s and 48 s. These sudden changes are physically unrealistic for actual UAV platforms, as they violate practical acceleration limits and reveal the lack of dynamics modeling in the baseline PPO algorithm. A zoomed-in view of the interval between 30 and 50 s clearly illustrates these distinctions. By comparison, trajectory 1 exhibits smooth and continuous velocity transitions, which result directly from the PPO-DC algorithm’s explicit consideration of UAV flight dynamics constraints. By embedding these constraints into the control framework, PPO-DC ensures that the trajectories adhere to realistic physical laws, avoiding abrupt changes in velocity and enabling physically feasible flight behavior.

In summary, these observations confirm that PPO-DC not only produces realistic and feasible trajectories but also adapts effectively to complex flight scenarios, including sharp turns, smooth turns, and the transitions between deceleration and acceleration. By considering the dynamics principle, the PPO-DC algorithm provides a more realistic transition of environment states, thereby enhancing the practical utility of the algorithm and its adaptability to dynamic UAV-assisted MEC scenarios.

4.4. Performance Comparisons

Figure 8 presents the system performance in the scenarios with different numbers of users under PPO-DC, in terms of three metrics, namely total energy consumption, total delay, and total weighted system consumption. The value on the x-axis in Figure 8 indicates the average task amounts per UD. The figure shows that all three metrics increase as the average task amounts per UD becomes larger, which can be deduced from Equations (6)–(11). Additionally, system consumption increases with the number of UDs, as serving more UDs requires more energy and time resources for both computation and data transmission. It is noteworthy that the total system energy consumption in Figure 8c is the sum of the total system delay and the energy consumption scaled by

ρ = 0.1

. The purpose of introducing the weight parameter

ρ

is to modulate the contribution of energy consumption relative to delay, thereby prioritizing the optimization of delay during agent training.

To evaluate the impact of localization errors, zero-mean Gaussian noise with standard deviation values of 0 m, 3 m, 6 m, and 9 m is added to the positions of UDs to simulate measurement uncertainties. The simulation results depicted in Figure 9 show that as the magnitude of localization errors increases, the system performance gradually deteriorates. This is because inaccurate position estimates disrupt the UAV’s user association and trajectory planning, leading to increased flight distances, higher energy consumption, and longer service delays. The degradation becomes more pronounced under higher UD densities, where smaller distances between UDs amplify the impact of localization noise. The simulation is set to terminate when the processed task data reach 100 Mbits per UD. These findings suggest that in future UAV-assisted MEC systems based on DRL, it would be valuable to explicitly incorporate localization uncertainties into the decision-making framework to further enhance system robustness.

To evaluate the impact of computational load variations, we model the number of CPU cycles required to process one bit of task data as a uniformly distributed random variable centered around four nominal values of 500, 1000, 1500, and 2000 cycles/bit. To capture actual variations in task computing complexity, the computational load per bit of newly generated task data of each user in each time slot is randomly sampled from the

\pm 20 %

range around the corresponding nominal value. The simulation terminates once all UDs have processed 100 Mbits of data. As shown in Figure 10, increasing the average computational load leads to significant rises in both system delay and energy consumption. Time-varying computational loads introduce more dynamic processing requirements. This variation challenges the system to maintain stable and efficient scheduling, as processing duration and energy usage fluctuate unpredictably over time. These effects become more pronounced in dense user environments, where the cumulative processing burden further exacerbates delay and energy overheads. These results highlight the necessity of accounting for dynamic computational requirements when evaluating the performance of UAV-assisted MEC systems.

Secondly, to further evaluate the system performance under the PPO-DC algorithm, a comparison is performed between PPO-DC and the following hybrid-action DRL algorithms or metaheuristics:

(1) Ant colony optimization (ACO) strategy: This strategy employs the ACO method to solve a traveling salesman problem (TSP) for UAV trajectory planning. The UAV plans its flight trajectory and UD association by considering both the task requirements and the location distribution of UDs. After completing the trajectory, the UAV replans the optimal trajectory and UD association based on its current position, UD locations, and task requirements.

(2) Around flight (AF) strategy: In this strategy, the UAV associates with the nearest UD based on the shortest distance and flies toward it. The UAV continues to follow this rule to visit all UDs in the service area.

(3) DDPG [44] strategy: This strategy employs the DDPG algorithm to optimize the UAV’s trajectory control and UD association, which leverages discretization to convert the continuous outputs of the agent to discrete association actions.

The simulation of Figure 11 is completed when the total computed task data amount reaches 1000 Mbits. In Figure 11, the total system energy consumption, total system delay, and total weighted system consumption for the different strategies are presented. It can be observed that the PPO-DC strategy exhibits the lowest delay and energy consumption, followed by the DDPG strategy and the ACO strategy, with the AF strategy performing the worst. This is due to the time-varying task requirements and the random mobility of UDs in the scenario, leading to a lack of timeliness of the optimal trajectory planned by ACO strategy. In addition, the AF strategy relies only on distance as the criterion for user association, which leads to poor adaptability to the time-varying task requirements of UDs. The DDPG strategy outperforms the ACO and AF strategies, but demonstrates lower performance compared to PPO-DC. This is because DDPG is designed for continuous action spaces and applies a floor operation to map continuous outputs to discrete UD association actions. This discretization disrupts the smoothness and continuity of the action space, leading to suboptimal decisions. In contrast, PPO-DC directly models and optimizes discrete actions within a hybrid continuous–discrete action framework, and leverages mechanisms such as clipping and entropy regularization to enhance training stability. Benefiting from these advantages, the PPO-DC strategy can dynamically adjust the UAV’s flight trajectory and UD association in real time by considering the UAV flight states, time-varying task requirements, and users’ mobility, thereby achieving superior performance compared to the other strategies.

5. Conclusions

In this paper, the joint optimization problem of dynamic UAV trajectory control and user association is investigated. Considering the UAV flight dynamics constraints and time-varying factors in scenarios, the system model is mathematically constructed. Moreover, the joint optimization is formulated as a non-convex optimization problem with discrete–continuous hybrid variables. To address the challenges posed by the complexity of problems and the dynamics of scenarios, a DRL-based algorithm named PPO-DC is developed. Comprehensive simulation results validate the effectiveness of PPO-DC in convergence rate, dynamic trajectory control, and system overhead reduction, highlighting its significance in enhancing UAV-assisted MEC performance.

Despite these achievements, several research directions remain to be addressed. First, this study focuses on a single UAV scenario, while multi-UAV coordination presents both challenges and opportunities. Future work could extend the proposed framework by considering multi-UAV cooperative user association, simultaneous multi-user computation task processing, and interference management under shared spectrum access, while exploring and developing distributed or federated reinforcement learning methods to address these challenges. Second, the assumption that the UAV has full knowledge of environmental states may not hold in practical scenarios due to incomplete or delayed information. In practice, obtaining complete environmental knowledge requires UDs to transmit their states via wireless links, which incurs overhead, latency, and potential errors. Therefore, exploring the framework of environment state information collection is promising for enhancing practical applicability. Third, LoS channels expose UAVs to security threats such as eavesdropping and jamming. Developing robust security mechanisms is critical to ensure system reliability.

Author Contributions

Conceptualization, L.W., X.Z., K.Q. and Z.W.; methodology, L.W.; software, L.W.; validation, L.W., H.Y. and J.Z.; formal analysis, L.W.; investigation, L.W., H.Y. and D.S.; resources, K.Q.; data curation, L.W. and J.Z.; writing—original draft preparation, L.W., X.Z. and Z.W.; writing—review and editing, L.W., K.Q., X.Z. and Z.W.; visualization, L.W.; supervision, K.Q.; funding acquisition, K.Q. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Sichuan Natural Science Foundation (General Program) under Grant 2024NSFSC0176, and in part by the National Natural Science Foundation of China under Grant 62371014.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the editor and the anonymous reviewers for their careful reading and valuable suggestions that helped improve the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kong, X.; Wu, Y.; Wang, H.; Xia, F. Edge computing for internet of everything: A survey. IEEE Internet Things J. 2022, 9, 23472–23485. [Google Scholar] [CrossRef]
Hu, Z.; Zhong, R.; Fang, C.; Liu, Y. Caching-at-STARS: The next generation edge caching. IEEE Trans. Wirel. Commun. 2024, 23, 8372–8387. [Google Scholar] [CrossRef]
Fang, C.; Hu, Z.; Meng, X.; Tu, S.; Wang, Z.; Zeng, D.; Ni, W.; Guo, S.; Han, Z. DRL-driven joint task offloading and resource allocation for energy-efficient content delivery in cloud-edge cooperation networks. IEEE Trans. Veh. Technol. 2023, 72, 16195–16207. [Google Scholar] [CrossRef]
Hou, Y.; Wang, C.; Zhu, M.; Xu, X.; Tao, X.; Wu, X. Joint allocation of wireless resource and computing capability in MEC-enabled vehicular network. China Commun. 2021, 18, 64–76. [Google Scholar] [CrossRef]
Mao, Y.; You, C.; Zhang, J.; Huang, K.; Letaief, K.B. A survey on mobile edge computing: The communication perspective. IEEE Commun. Surv. Tutor. 2017, 19, 2322–2358. [Google Scholar] [CrossRef]
Baccour, E.; Mhaisen, N.; Abdellatif, A.A.; Erbad, A.; Mohamed, A.; Hamdi, M.; Guizani, M. Pervasive AI for IoT applications: A survey on resource-efficient distributed artificial intelligence. IEEE Commun. Surv. Tutor. 2022, 24, 2366–2418. [Google Scholar] [CrossRef]
Zhou, F.; Hu, R.Q.; Li, Z.; Wang, Y. Mobile edge computing in unmanned aerial vehicle networks. IEEE Wirel. Commun. 2020, 27, 140–146. [Google Scholar] [CrossRef]
Nabi, A.; Moh, S. Offloading decision and resource allocation in aerial computing: A comprehensive survey. Comput. Sci. Rev. 2025, 56, 100734. [Google Scholar] [CrossRef]
Adnan, M.H.; Zukarnain, Z.A.; Amodu, O.A. Fundamental design aspects of UAV-enabled MEC systems: A review on models, challenges, and future opportunities. Comput. Sci. Rev. 2024, 51, 100615. [Google Scholar] [CrossRef]
Zhang, Y.; Na, Z.; Wen, Z.; Nallanathan, A.; Lu, W. Joint service caching, computation offloading and resource allocation for dual-layer aerial internet of things. Comput. Netw. 2025, 257, 110974. [Google Scholar] [CrossRef]
Liu, C.; Feng, W.; Tao, X.; Ge, N. MEC-empowered non-terrestrial network for 6G wide-area time-sensitive internet of things. Engineering 2022, 8, 96–107. [Google Scholar] [CrossRef]
Liu, Y.; Yan, J.; Zhao, X. Deep reinforcement learning based latency minimization for mobile edge computing with virtualization in maritime UAV communication network. IEEE Trans. Veh. Technol. 2022, 71, 4225–4236. [Google Scholar] [CrossRef]
Ei, N.N.; Alsenwi, M.; Tun, Y.K.; Han, Z.; Hong, C.S. Energy-efficient resource allocation in multi-UAV-assisted two-stage edge computing for beyond 5G networks. IEEE Trans. Intell. Transp. Syst. 2022, 23, 16421–16432. [Google Scholar] [CrossRef]
Tang, S.; Zhou, W.; Chen, L.; Lai, L.; Xia, J.; Fan, L. Battery-constrained federated edge learning in UAV-enabled IoT for B5G/6G networks. Phys. Commun. 2021, 47, 101381. [Google Scholar] [CrossRef]
Wang, Z.; Rong, H.; Jiang, H.; Xiao, Z.; Zeng, F. A load-balanced and energy-efficient navigation scheme for UAV-mounted mobile edge computing. IEEE Trans. Netw. Sci. Eng. 2022, 9, 3659–3674. [Google Scholar] [CrossRef]
Shao, Y.; Xu, H.; Liu, L.; Dong, W.; Shan, P.; Guo, J.; Xu, W. An energy-efficient distributed computation offloading algorithm for ground-air cooperative networks. Veh. Commun. 2025, 52, 100875. [Google Scholar] [CrossRef]
Tun, Y.K.; Dán, G.; Park, Y.M.; Hong, C.S. Joint UAV deployment and resource allocation in THz-assisted MEC-enabled integrated space-air-ground networks. IEEE Trans. Mob. Comput. 2024, 24, 3794–3808. [Google Scholar] [CrossRef]
Han, Z.; Zhou, T.; Xu, T.; Hu, H. Joint user association and deployment optimization for delay-minimized UAV-aided MEC networks. IEEE Wirel. Commun. Lett. 2023, 12, 1791–1795. [Google Scholar] [CrossRef]
Li, C.; Gan, Y.; Zhang, Y.; Luo, Y. A cooperative computation offloading strategy with on-demand deployment of multi-UAVs in UAV-aided mobile edge computing. IEEE Trans. Netw. Serv. Manag. 2023, 21, 2095–2110. [Google Scholar] [CrossRef]
Liu, B.; Wan, Y.; Zhou, F.; Wu, Q.; Hu, R.Q. Resource allocation and trajectory design for MISO UAV-assisted MEC networks. IEEE Trans. Veh. Technol. 2022, 71, 4933–4948. [Google Scholar] [CrossRef]
Wang, Y.; Zhu, J.; Huang, H.; Xiao, F. Bi-objective ant colony optimization for trajectory planning and task offloading in UAV-assisted MEC systems. IEEE Trans. Mob. Comput. 2024, 23, 12360–12377. [Google Scholar] [CrossRef]
Tripathi, S.; Pandey, O.J.; Cenkeramaddi, L.R.; Hegde, R.M. A socially-aware radio map framework for improving QoS of UAV-assisted MEC networks. IEEE Trans. Netw. Serv. Manag. 2022, 20, 342–356. [Google Scholar] [CrossRef]
Yang, Z.; Bi, S.; Zhang, Y.J.A. Online trajectory and resource optimization for stochastic UAV-enabled MEC systems. IEEE Trans. Wirel. Commun. 2022, 21, 5629–5643. [Google Scholar] [CrossRef]
Xu, B.; Kuang, Z.; Gao, J.; Zhao, L.; Wu, C. Joint offloading decision and trajectory design for UAV-enabled edge computing with task dependency. IEEE Trans. Wirel. Commun. 2022, 22, 5043–5055. [Google Scholar] [CrossRef]
Zhao, M.; Wang, Z.; Guo, K.; Zhang, R.; Quek, T.Q. Against mobile collusive eavesdroppers: Cooperative secure transmission and computation in UAV-assisted MEC networks. IEEE Trans. Mob. Comput. 2025, 24, 5280–5297. [Google Scholar] [CrossRef]
Zhao, M.; Zhang, R.; He, Z.; Li, K. Joint optimization of trajectory, offloading, caching, and migration for UAV-assisted MEC. IEEE Trans. Mob. Comput. 2024, 24, 1981–1998. [Google Scholar] [CrossRef]
Liu, C.; Zhong, Y.; Wu, R.; Ren, S.; Du, S.; Guo, B. Deep reinforcement learning based 3D-trajectory design and task offloading in UAV-enabled MEC system. IEEE Trans. Veh. Technol. 2024, 74, 3185–3195. [Google Scholar] [CrossRef]
Cheng, N.; Wu, S.; Wang, X.; Yin, Z.; Li, C.; Chen, W.; Chen, F. AI for UAV-assisted IoT applications: A comprehensive review. IEEE Internet Things J. 2023, 10, 14438–14461. [Google Scholar] [CrossRef]
Deng, C.; Fang, X.; Wang, X. UAV-enabled mobile-edge computing for AI applications: Joint model decision, resource allocation, and trajectory optimization. IEEE Internet Things J. 2022, 10, 5662–5675. [Google Scholar] [CrossRef]
Shen, F.; Yang, B.; Zhang, J.; Xu, C.; Chen, Y.; He, Y. TD3-based trajectory optimization for energy consumption minimization in UAV-assisted MEC system. Comput. Netw. 2024, 255, 110882. [Google Scholar] [CrossRef]
Liu, Q.; Shi, L.; Sun, L.; Li, J.; Ding, M.; Shu, F. Path planning for UAV-mounted mobile edge computing with deep reinforcement learning. IEEE Trans. Veh. Technol. 2020, 69, 5723–5728. [Google Scholar] [CrossRef]
Zhou, F.; Wu, Y.; Hu, R.Q.; Qian, Y. Computation rate maximization in UAV-enabled wireless-powered mobile-edge computing systems. IEEE J. Sel. Areas Commun. 2018, 36, 1927–1941. [Google Scholar] [CrossRef]
Palacios, J.; González-Prelcic, N.; Mosquera, C.; Shimizu, T.; Wang, C.H. A hybrid beamforming design for massive MIMO LEO satellite communications. Front. Space Technol. 2021, 2, 696464. [Google Scholar] [CrossRef]
Emara, M.; ElSawy, H.; Filippou, M.C.; Bauch, G. Spatiotemporal dependable task execution services in MEC-enabled wireless systems. IEEE Wirel. Commun. Lett. 2020, 10, 211–215. [Google Scholar] [CrossRef]
Yu, Y.; Tang, J.; Huang, J.; Zhang, X.; So, D.K.C.; Wong, K.K. Multi-objective optimization for UAV-assisted wireless powered IoT networks based on extended DDPG algorithm. IEEE Trans. Commun. 2021, 69, 6361–6374. [Google Scholar] [CrossRef]
Xu, C.; Zhan, C.; Liao, J.; Gong, J. Computation throughput maximization for UAV-enabled MEC with binary computation offloading. In Proceedings of the ICC 2022-IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 4348–4353. [Google Scholar]
Wang, L.; Li, Y.; Chen, Y.; Li, T.; Yin, Z. Air-Ground Coordinated MEC: Joint Task, Time Allocation and Trajectory Design. IEEE Trans. Veh. Technol. 2024, 74, 4728–4743. [Google Scholar] [CrossRef]
Wang, Z.; Gao, Y.; Fang, C.; Liu, L.; Zhou, H.; Zhang, H. Optimal control design for connected cruise control with stochastic communication delays. IEEE Trans. Veh. Technol. 2020, 69, 15357–15369. [Google Scholar] [CrossRef]
Tang, Q.; Fei, Z.; Zheng, J.; Li, B.; Guo, L.; Wang, J. Secure aerial computing: Convergence of mobile edge computing and blockchain for UAV networks. IEEE Trans. Veh. Technol. 2022, 71, 12073–12087. [Google Scholar] [CrossRef]
Qin, X.; Song, Z.; Hou, T.; Yu, W.; Wang, J.; Sun, X. Joint optimization of resource allocation, phase shift, and UAV trajectory for energy-efficient RIS-assisted UAV-enabled MEC systems. IEEE Trans. Green Commun. Netw. 2023, 7, 1778–1792. [Google Scholar] [CrossRef]
Zhang, Q.; Gao, A.; Wang, Y.; Zhang, S.; Ng, S.X. Multiple Dual-Function UAVs Cooperative Computation Offloading in Hybrid Mobile Edge Computing Systems. In Proceedings of the ICC 2024-IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; pp. 1–6. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Mnih, V. Asynchronous methods for deep reinforcement learning. arXiv 2016, arXiv:1602.01783. [Google Scholar]
Lillicrap, T. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert, M.; Radford, A.; Schulman, J.; Sidor, S.; Wu, Y.; Zhokhov, P. OpenAI Baselines. 2017. Available online: https://github.com/openai/baselines (accessed on 12 February 2025).
Heess, N.; TB, D.; Sriram, S.; Lemmon, J.; Merel, J.; Wayne, G.; Tassa, Y.; Erez, T.; Wang, Z.; Eslami, S.; et al. Emergence of locomotion behaviours in rich environments. arXiv 2017, arXiv:1707.02286. [Google Scholar]
Schulman, J. Trust region policy optimization. arXiv 2015, arXiv:1502.05477. [Google Scholar]

Figure 1. System model of UAV-assisted MEC.

Figure 2. PPO-DC algorithm architecture.

Figure 3. Accumulated reward.

Figure 4. Trajectories for different UD numbers.

Figure 5. Trajectory comparison.

Figure 6. Comparison of x-axis velocities in trajectory 1 and trajectory 2.

Figure 7. Comparison of y-axis velocities in trajectory 1 and trajectory 2.

Figure 8. System performance with different data amounts. (a) Total system delay. (b) Total system energy consumption. (c) Total weighted system consumption.

Figure 9. System performance under different localization error standard deviations. (a) Total system delay. (b) Total system energy consumption.

Figure 10. System performance under different computation loads. (a) Total system delay. (b) Total system energy consumption.

Figure 11. Comparison of system performance under different strategies. (a) Total system delay. (b) Total system energy consumption. (c) Total weighted system consumption.

Table 1. Simulation parameters.

Symbol	Description	Value
$Δ T$	Length of each time slot	0.1 s
H	UAV flight altitude	10 m
$V_{m a x}$	Maximum UAV flight velocity in x- and y-axes	15 m/s
$Ω$	Range of UD and UAV movement	160 m × 160 m
B	Bandwidth	1 MHz
$f_{c}$	Carrier frequency	2 GHz
$P_{0}$	Transmit power of UDs	20 dBm
$T_{n o i s e}$	Receiver noise temperature	24 dBK
C	Required number of CPU cycles per bit	1000 cycles/bit
$f_{k}$	UD computing frequency	0.1 GHz
$f_{u}$	UAV computing frequency	2.4 GHz
$η$	Effective capacitance coefficient of UDs	$10^{- 28}$
$ψ$	Effective capacitance coefficient of UAV	$10^{- 28}$
$ζ$	Scaling factor of flight energy consumption	0.1
$ρ$	Weight parameter of delay and energy consumption	0.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, L.; Zhang, X.; Qin, K.; Wang, Z.; Yin, H.; Zhou, J.; Song, D. Dynamic Trajectory Control and User Association for Unmanned-Aerial-Vehicle-Assisted Mobile Edge Computing: A Deep Reinforcement Learning Approach. Drones 2025, 9, 367. https://doi.org/10.3390/drones9050367

AMA Style

Wang L, Zhang X, Qin K, Wang Z, Yin H, Zhou J, Song D. Dynamic Trajectory Control and User Association for Unmanned-Aerial-Vehicle-Assisted Mobile Edge Computing: A Deep Reinforcement Learning Approach. Drones. 2025; 9(5):367. https://doi.org/10.3390/drones9050367

Chicago/Turabian Style

Wang, Libo, Xiangyin Zhang, Kaiyu Qin, Zhuwei Wang, Hang Yin, Jiayi Zhou, and Deyu Song. 2025. "Dynamic Trajectory Control and User Association for Unmanned-Aerial-Vehicle-Assisted Mobile Edge Computing: A Deep Reinforcement Learning Approach" Drones 9, no. 5: 367. https://doi.org/10.3390/drones9050367

APA Style

Wang, L., Zhang, X., Qin, K., Wang, Z., Yin, H., Zhou, J., & Song, D. (2025). Dynamic Trajectory Control and User Association for Unmanned-Aerial-Vehicle-Assisted Mobile Edge Computing: A Deep Reinforcement Learning Approach. Drones, 9(5), 367. https://doi.org/10.3390/drones9050367

Article Menu

Dynamic Trajectory Control and User Association for Unmanned-Aerial-Vehicle-Assisted Mobile Edge Computing: A Deep Reinforcement Learning Approach

Abstract

1. Introduction

2. System Model and Optimization Problem Formulation

2.1. UAV and User Devices

2.2. User Mobility Model

2.3. Communication Model

2.4. Computation Model

2.5. UAV Dynamics Model

2.6. Optimization Problem Formulation

3. PPO-DC Algorithm Design

3.1. Markov Decision Process

3.2. Algorithm Description

4. Simulation Results

4.1. Simulation Parameter Setting

4.2. Convergence Performance

4.3. UAV Trajectory Evaluation

4.4. Performance Comparisons

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI