Next Article in Journal
Stochastic Planning of Synergetic Conventional Vehicle and UAV Delivery Operations
Previous Article in Journal
Correction: Pilartes-Congo et al. Empirical Evaluation and Simulation of GNSS Solutions on UAS-SfM Accuracy for Shoreline Mapping. Drones 2024, 8, 646
Previous Article in Special Issue
Robust UAV-Oriented Wireless Communications via Multi-Agent Deep Reinforcement Learning to Optimize User Coverage
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Parallel Task Offloading and Trajectory Optimization for UAV-Assisted Mobile Edge Computing via Hierarchical Reinforcement Learning

School of Electronic and Information Engineering, Inner Mongolia University, Hohhot 010021, China
*
Author to whom correspondence should be addressed.
Drones 2025, 9(5), 358; https://doi.org/10.3390/drones9050358
Submission received: 25 March 2025 / Revised: 30 April 2025 / Accepted: 5 May 2025 / Published: 8 May 2025
(This article belongs to the Special Issue UAV-Assisted Mobile Wireless Networks and Applications)

Abstract

:
With the rapid growth of IoT data and increasing demands for low-latency computation, UAV-assisted mobile edge computing (MEC) offers a flexible solution to overcome the limitations of fixed MEC servers. To better reflect concurrent service scenarios, this paper innovatively develops a multi-channel task modeling method, enabling UAVs to simultaneously select multiple users and offload tasks via separate channels, thereby breaking the constraints of traditional sequential assumptions. To address task conflicts and resource waste caused by sequential service assumptions, this paper proposes a hierarchical reinforcement learning-based trajectory and offloading decision-making framework (H-TAOD) for UAVs with multi-channel parallel processing. The framework decouples UAV trajectory planning and task offloading into two sub-tasks optimized via appropriate reinforcement learning methods. An invalid action masking (IAM) mechanism is introduced to avoid channel conflicts. Simulation results verify the superiority of H-TAOD in reward, delay, and convergence.

1. Introduction

With the rapid development of mobile communication technologies and the widespread use of smart devices, diverse network services, and applications have emerged, placing increasing demands on the quality of network services and latency. Mobile edge computing (MEC) has been proposed as a critical complement to cloud computing by deploying computing and storage resources at the edge of mobile networks to provide ultra-low-latency and high-bandwidth services.
In this context, computation-intensive and latency-sensitive mobile applications—such as augmented reality (AR), autonomous driving, and remote healthcare—are becoming increasingly prevalent, placing stringent requirements on response time and quality of service (QoS) [1]. MEC addresses these challenges by reducing communication delay and offloading burdens on the core network, while also enhancing user privacy protection [2].
To further improve MEC performance in dynamic or infrastructure-limited scenarios, unmanned aerial vehicles (UAVs) have been widely adopted due to their high mobility and deployment flexibility. UAVs can act as flying edge servers or relay nodes, capable of offloading computation tasks, collecting data, and facilitating communication [3,4]. As a result, UAV-assisted MEC systems have received growing attention in both academia and industry. For instance, Zhang Jie et al. [5] adopted Lyapunov optimization to jointly optimize UAV trajectories and offloading decisions under stochastic task arrivals and dynamic channel conditions. Lyapunov optimization, rooted in control theory, provides a powerful framework for stabilizing dynamic systems and has been widely applied to real-time decision-making in mobile edge computing. Wan Sheng et al. [6] proposed a three-layer data processing architecture using deep reinforcement learning and dynamic resource control. Additionally, Wang Liang et al. [7] developed a Q-learning-based algorithm for large-scale user association and task allocation in multi-UAV networks.
Recent works have also explored the integration of advanced learning methods into MEC optimization. For example, Asheralieva and Niyato [8] employed hierarchical game theory and reinforcement learning for multi-provider task sharing, and Wang Liang et al. [9] introduced a deep Q-network (DQN) to optimize UAV trajectories in dynamic environments. Other studies have employed deep deterministic policy gradient (DDPG) [10], double DQN (DDQN) [11,12], twin delayed DDPG (TD3) [13], and graph neural networks [14] to address various challenges in task offloading, energy efficiency, and fairness.
Moreover, hierarchical [15,16] and multi-agent reinforcement learning [17,18] have been employed to tackle collaborative optimization problems in UAV networks. Geng Yu et al. [19] and Zhou Hang et al. [20] proposed hierarchical and multi-agent RL algorithms to enhance relay selection and data transmission, respectively. Yao Ziqing et al. [21] incorporated graph attention mechanisms to improve task scheduling and service caching in edge digital twin systems.
As deep learning (DL) models become integral to modern MEC services, their increasing inference complexity and model sizes impose stringent demands on caching strategies and resource scheduling mechanisms. To address this, Xu Jing et al. [22] and Fang Cheng et al. [23] investigated task offloading and caching optimization under non-orthogonal multiple access (NOMA) scenarios, aiming to enhance throughput without compromising energy efficiency. Their work emphasized the importance of coordinated caching and transmission strategies in scenarios with heterogeneous quality-of-service (QoS) requirements. Building on this, Lin Ning et al. [24] introduced a dual-slot scheduling mechanism that jointly handles task offloading and caching placement. By decoupling the decision-making processes across two time slots, the proposed approach significantly improves system responsiveness and task completion latency. Xu Yuntao et al. [25] further extended the caching optimization paradigm by proposing a hybrid caching allocation strategy based on the soft actor–critic (SAC) algorithm, enabling the system to adaptively allocate caching resources in response to diverse user application demands and dynamic environmental conditions.
In parallel, several studies have explored the joint optimization of trajectory control and resource management from a multi-objective perspective. Wang Zhiheng et al. [26] proposed a UAV-assisted MEC architecture wherein UAV trajectories and resource allocation are jointly optimized through a dynamic programming framework. The proposed scheme enhances service efficiency by adapting UAV flight paths in accordance with computing and communication demands. Sun Guojie et al. [27] designed a comprehensive optimization framework that balances delay, energy consumption, and task completion rate using a multi-objective evolutionary algorithm. This approach ensures more reliable task execution across various operational constraints. Cui Jia et al. [28] focused on the integration of power-aware networks and UAV-MEC systems, presenting a joint trajectory and resource allocation strategy that dynamically adapts to variations in network topology and workload distribution. Hui Meng et al. [29] addressed the impact of UAV deployment height on system performance. By introducing a multivariable optimization model, their work simultaneously considers offloading ratio design and altitude control, achieving near-optimal performance bounds in diverse mission scenarios.
In addition, research in the fields of mobile edge computing and UAV modeling is increasingly moving toward more practical scenarios. Existing studies have advanced system modeling and optimization methods from various perspectives. For instance, Premsankar and Ghaddar [30] proposed an energy-efficient service placement strategy for latency-sensitive applications, Ko et al. [31] introduced a joint computation offloading and service caching mechanism considering personalized service preferences, and Hortelano et al. [32] provided a comprehensive survey on reinforcement learning-based computation offloading approaches in edge computing systems. Furthermore, Hayal et al. [33] focused on the practical modeling of hovering UAV-assisted free-space optical (FSO) communication systems, investigating performance under pointing errors and atmospheric turbulence, which offers valuable insights for enhancing the reliability of UAV communication and mobile edge computing systems.
For this purpose, although significant progress has been made, most existing studies assume that UAVs can only serve one user within a single time slot, overlooking their potential for parallel task processing. This assumption tends to cause service conflicts and resource wastage in multi-user high-concurrency scenarios, thereby limiting the overall performance of UAV-assisted mobile edge computing (UAV-MEC) systems.
To fully exploit the UAVs’ capability for concurrent task execution, this paper proposes a novel parallel task offloading and trajectory optimization framework for multi-channel UAV-assisted MEC systems. A practical system model supporting concurrent task processing is developed, and a hierarchical reinforcement learning (HRL)-based optimization approach is introduced to jointly optimize task offloading and UAV trajectory control.
In the proposed framework, the optimization problem is decoupled into two sub-tasks corresponding to two hierarchical layers: the lower layer handles task offloading and resource allocation, while the upper layer focuses on trajectory control. At the lower layer, a discrete soft actor–critic algorithm with invalid action masking (discrete SAC-IAM) is proposed to dynamically filter infeasible task-channel assignments, thereby improving decision stability. Specifically, the lower layer minimizes user device waiting time by performing multi-channel task selection and resource scheduling, ensuring efficient task dispatching. Meanwhile, the upper layer adopts a continuous SAC algorithm to control UAV trajectories, aiming to further reduce system delay while maintaining high service efficiency.
Through the collaborative design of the two layers, the proposed method enhances the training efficiency and decision flexibility of the UAV-MEC system. Simulation results demonstrate that the proposed approach outperforms existing baseline methods in terms of task delay, total accumulated reward, and convergence speed, validating its effectiveness and practicality for parallel task offloading and trajectory optimization in complex scenarios.

2. System Model

To support parallel task offloading and trajectory scheduling in multi-user scenarios, this section builds a UAV-assisted mobile edge computing (MEC) system model with multi-channel concurrent service capability. The UAV serves as an aerial edge node that can simultaneously offload tasks for multiple user equipment (UE) in each time slot. The model considers key factors such as transmission rates, computation limits, task queuing, and UAV mobility and energy constraints, aiming to capture the coordination between offloading and trajectory planning in complex environments. The main variables and parameters involved in the modeling are summarized in Table 1.
First, an overview of the overall system process is presented. As illustrated in Figure 1, a schematic diagram of the UAV-assisted mobile edge computing system is shown. At the beginning, all users generate tasks and send offloading requests. The UAV receives these requests and determines the optimal decision. Subsequently, the system proceeds with decision execution, task transmission, offloading, and result feedback. If there are still remaining tasks, this cycle is repeated iteratively. The detailed system modeling is introduced as follows.

2.1. Basic Definitions

As illustrated in Figure 2, N UEs are randomly distributed in a square region of side length L max , denoted by the set N . A single UAV flies at a fixed altitude H throughout the task execution period T, which is discretized into I equal time slots with duration τ = T / I , indexed by I .
In each slot i I , each UE n N may generate a task of size D n , i requiring F n , i = s · D n , i CPU cycles. The UAV can handle up to K tasks and a total of D max Mbits per slot.
The UAV’s 2D position in slot i is ( X i , Y i ) , with heading angle α i and speed V i . Its movement within each slot duration t fly follows:
V x , i = V i cos ( α i ) , V y , i = V i sin ( α i ) ,
X i + 1 = X i + V x , i · t fly , Y i + 1 = Y i + V y , i · t fly .
The horizontal distance from the UAV to UE n is given by:
R n , i = ( X i X n ) 2 + ( Y i Y n ) 2 .
The transmission rate is modeled as:
r n , i = B log 2 1 + p n f ( R n , i , H ) .

Offloading Decision

The offloading indicator z n , i { 0 , 1 } represents whether UE n offloads its task at slot i. The UAV’s per-slot resource constraints are:
n = 1 N z n , i D n , i D max ,
n = 1 N z n , i K .

2.2. Computation Model

If a task is offloaded, the transmission and computation delays are calculated as:
t n , i tx = z n , i D n , i r n , i ,
t n , i uav = z n , i F n , i f uav .
If the task is processed locally by the UE, the local delay is:
t n , i loc = ( 1 z n , i ) F n , i f UE .
The total task execution delay for UE n in slot i is then:
t n , i = max t n , i tx + t n , i uav , t n , i loc .

UAV Energy Consumption

The UAV’s energy consumption in each slot comprises flight, hovering, and computation energy:
Flight energy: Based on an aerodynamic model, the UAV flight power is given by:
P i f = P 0 1 + 3 V i 2 U tip 2 + P i V 0 V i + 1 2 d 0 ρ S A V i 3 ,
where P 0 is the blade profile power, U tip is the rotor tip speed, P i is the induced power in hover, V 0 is the mean rotor induced velocity in hover, d 0 is the fuselage drag ratio, ρ is the air density, S is the rotor disk area, and A is the UAV’s frame area.
Then the flight energy during slot i is:
E i fly = P i f · t fly .
Hovering energy:
E i hov = p hov · ( τ t fly ) .
Computation energy:
E i cmp = κ · ( f uav ) β · δ i .
Total energy consumption:
E i total = E i fly + E i hov + E i cmp .
The UAV battery level evolves as:
B i + 1 = B i E i total , B i B min .

2.3. Objective Function

The overall objective is to minimize the total user delay over all time slots and users:
min ( X i , Y i , V i , α i ) , z n , i i = 1 I n = 1 N t n , i ,
subject to the Constraints (5) and (6) and the following:
X i , Y i [ 0 , L max ] ,
0 V i V max , 0 α i < 2 π ,
B i + 1 = B i E i total .
Battery Constraint (16) is enforced as well to ensure the UAV always operates above the safety level B min .
In summary, this section models a multi-channel UAV-assisted MEC system, where key practical constraints such as the maximum task load per slot and the parallel channel concurrency limit are jointly considered. The system formulation integrates the optimization of UAV trajectory, task offloading decisions, and energy consumption management, with the objective of minimizing total user delay under mobility, communication, and energy constraints.

3. Hierarchical Optimization via SAC and IAM-Enhanced Discrete SAC

To efficiently solve the above joint optimization problem with multiple constraints and strong coupling, conventional optimization methods often encounter challenges in modeling complexity, poor generalization, and high computational overhead, especially in high-dimensional and mixed-action spaces. Given that UAV control involves a continuous action space, while user task scheduling belongs to a discrete decision space, this section adopts a hierarchical reinforcement learning (HRL) framework to decouple and collaboratively optimize the two sub-tasks.
Specifically, in the high-level policy, a soft actor–critic (SAC) agent is employed to handle the continuous control of UAV trajectory and energy management, enabling dynamic planning of UAV speed and heading. At each decision point along the trajectory, a low-level policy is designed based on a discrete SAC algorithm enhanced with an invalid action masking (IAM) mechanism, which effectively filters out infeasible actions and improves the efficiency and quality of service selection decisions in large discrete action spaces.
This hierarchical learning structure significantly enhances policy convergence and training stability, while improving the system’s adaptability and scalability in highly dynamic and constrained environments.

3.1. Hierarchical Decomposition of the Optimization Problem

At the beginning of each time slot i, the UAV determines a new position to move to, aiming to achieve satisfactory user-latency performance during this slot. After moving to the new position, the UAV remains static for the remainder of the time slot and makes offloading decisions to further reduce the total user latency. These two decision steps alternate until i reaches I.
Given the difference in action characteristics (continuous movement vs. discrete offloading), the overall optimization is split into two sub-problems:
  • Location Optimization ( P loc ): At the beginning of time slot i, determine the UAV’s new position ( x i + 1 , y i + 1 ) under constraints (16), (18)–(20). This affects the distances to users and thus the transmission rate, indirectly influencing the offloading performance within the slot. Formally:
    min { ( x t , y t ) } i = 1 I n = 1 N t n , i , s . t . ( 16 ) , ( 18 ) ( 20 ) .
  • Offloading Optimization ( P off ): Given the UAV’s position within a time slot, decide which users to serve (i.e., which tasks to offload) to minimize the total user latency:
    min { a n , t } i = 1 I n = 1 N t n , i , s . t . ( 5 ) ( 6 ) .
By separating the problem into these two layers, we can effectively handle the high-dimensional joint optimization of continuous and discrete decisions.

3.2. Hierarchical Reinforcement Learning Framework

Before jointly training both decision layers, we first focus on the offloading decision layer in isolation. Specifically, in each training episode, we randomly generate user positions, task demands, and an initial UAV location. The UAV remains static during the episode, and the offloading layer is trained to learn optimal user selection strategies that minimize total latency under random spatial and task conditions.
Figure 3 illustrates the overall architecture of our hierarchical reinforcement learning framework, which decomposes the optimization into two coordinated decision layers:
  • Location Optimization Layer: Observes the environment state and outputs a continuous action representing the UAV’s next position (or equivalently, its velocity direction θ t and speed V t ). Once the UAV reaches the new position, control is passed to the lower layer.
  • Offloading Decision Layer: Given the UAV’s current location, it determines which users to serve within the time slot using discrete actions, while ensuring the parallel channel and capacity constraints are respected.
The two layers share portions of the environment state (e.g., battery levels, user locations, task demands), but are trained separately, each optimizing its respective sub-objective. This design reduces the overall complexity of joint optimization and improves training efficiency.

3.3. Implementation of the Offloading Decision Layer

The offloading decision layer aims to select a subset of users to be served in each time slot, given the UAV’s current location. This layer is modeled as a Markov decision process (MDP), and we employ a discrete variant of the soft actor–critic (SAC) algorithm enhanced with an invalid action masking (IAM) mechanism, which explicitly filters infeasible actions to improve training stability and efficiency.

3.3.1. State Space, Action Space, and Reward Design

State space. Let s t off denote the state at time slot t, including the following:
  • UAV position ( x t , y t ) ;
  • Relative user positions: { ( x n x t ) , ( y n y t ) } n = 1 N ;
  • Task demand of each user C n , t ;
  • User availability indicator (selectable or not);
  • Current UAV battery level B t , remaining channel count K K , and remaining memory (if applicable).
Action space. At each step, the agent selects one user to serve. The action space is A off = { 0 , 1 , , N } , where a = 0 indicates no user is served due to resource constraints.
Reward function. The reward is defined as the negative of total latency:
r t off = n = 1 N t n , t .

3.3.2. Discrete-SAC with Invalid Action Masking

The SAC algorithm maximizes a reward–entropy objective. In discrete action spaces, the policy π ϕ ( a | s ) is modeled as a categorical distribution via the Softmax function. To enhance decision reliability under resource constraints, the invalid action masking (IAM) technique is incorporated into key SAC components.
Maximum entropy objective.
J ( π ) = t E ( s t , a t ) π R ( s t , a t ) + α H π ( · | s t )
Q-function update.
y = r + γ a π ϕ ( a | s ) min i Q θ i ( s , a ) α log π ϕ ( a | s )
Soft value function.
V ψ ( s ) = a π ϕ ( a | s ) min i Q θ i ( s , a ) α log π ϕ ( a | s )
Policy update.
J ( π ϕ ) = E s a π ϕ ( a | s ) α log π ϕ ( a | s ) min i Q θ i ( s , a )
Temperature update.
J ( α ) = E s , a π ϕ α log π ϕ ( a | s ) + H target

3.3.3. Invalid Action Masking (IAM)

We define a binary mask m ( s ) :
m a ( s ) = 1 , if action a is valid , 0 , otherwise .
Masked soft value function:
V ψ ( s ) = a m a ( s ) π ϕ ( a | s ) min i Q θ i ( s , a ) α log π ϕ ( a | s )
Masked policy update:
J ( π ϕ ) = E s a m a ( s ) π ϕ ( a | s ) α log π ϕ ( a | s ) min i Q θ i ( s , a )
Masked temperature update:
J ( α ) = E s , a π ϕ α m a ( s ) log π ϕ ( a | s ) + H target

3.3.4. Iterative Offloading Decision Strategy

To avoid infeasible actions in large discrete spaces, the proposed method uses an iterative offloading strategy with IAM. Instead of generating all offloading decisions in one shot, the policy selects users one by one, updating the mask m ( s ) after each step to exclude already-selected or invalid users. This process continues until UAV resources are fully utilized or capacity limits are reached.

3.3.5. Offloading Decision Layer Training Pseudocode

As illustrated in Figure 4, the diagram presents the training workflow of the lower-layer network. The UAV interacts with the environment to obtain the current state s t , which is fed into the policy network along with the corresponding invalid action mask M ( s t ) to generate a valid action a t . This process is repeated iteratively until the UAV’s resources are fully utilized. The selected action is then executed, receiving an immediate reward, and the transition tuple { s t , a t , r t , s t + 1 , M ( s t ) } is stored in the replay buffer.
Once the buffer has accumulated sufficient experience samples, the training process is triggered. Specifically, a mini-batch of samples is drawn from the buffer to compute the target value using the target Q-networks. The temporal difference error between the target and current Q-values is used to calculate the loss for updating the Q-networks. Meanwhile, the policy network is optimized using policy gradients derived from the minimum Q-value and the entropy regularization term. The entire training process follows the standard discrete soft actor–critic (SAC) update procedure, enhanced with invalid action masking (IAM) to ensure learning occurs strictly within the valid action space. This integration significantly improves training efficiency and policy stability under discrete constraints.
The offloading decision layer is trained first. The training pseudocode is Algorithm 1:
Algorithm 1 Offloading decision with IAM-SAC.
 1:
Initialize:  Q θ 1 , Q θ 2 , policy network π ϕ , value network V ψ , temperature α , replay buffer D
 2:
Set m a ( s ) to mask invalid actions
 3:
for episode = 1 to max_episodes do
 4:
      Randomly initialize UAV and user positions, task demands
 5:
      Observe initial state s 0 off
 6:
      for  t = 0 to T 1  do
 7:
            Determine valid actions via m a ( s t )
 8:
            Sample action a t π ϕ (excluding invalid actions)
 9:
            Execute a t , observe reward r t and next state s t + 1
10:
            Store transition ( s t , a t , r t , s t + 1 ) in D
11:
            Update Q-networks and policy via masked SAC update rules
12:
      end for
13:
end for

3.4. Implementation of the Location Optimization Layer

The goal of the location optimization layer is to determine the UAV’s continuous movement at the beginning of each time slot, in order to minimize the total task latency through better positioning. Once the movement decision is executed, the offloading layer operates for the remainder of the time slot.
To simplify the action representation and make it more interpretable, we parameterize the UAV’s movement as two components: the speed in the x- and y-axes, denoted as V x , t and V y , t , respectively. The UAV then updates its position accordingly.

3.4.1. State Space, Action Space, and Reward Design

State space. Let s t loc denote the state observed by the location optimization layer at time t. It includes:
  • UAV current position ( x t , y t ) ;
  • Positions of all users relative to the UAV;
  • Task demand and status (waiting/served) of each user;
  • UAV status: battery level B t , available channels, memory, etc.
Action space. The output of this layer is a continuous 2D velocity vector:
a t = ( V x , t , V y , t ) R 2 .
This controls the UAV’s movement at the beginning of each time slot.
Reward function. After the UAV moves, the pre-trained offloading decision policy is invoked to determine latency, which is used as the reward:
r t loc = n = 1 N t n , t λ bound · I ( out - of - bound ) ,
where I ( · ) is an indicator function for out-of-bound violations.

3.4.2. Training via Continuous Soft Actor–Critic (SAC)

The action space is continuous, so the standard SAC algorithm is applied using reparameterization. Compared to discrete SAC, continuous SAC samples from a Gaussian distribution and supports backpropagation.
Soft Q-function update.
J ( Q θ i ) = E ( s , a , r , s ) Q θ i ( s , a ) y 2 ,
with target:
y = r + γ min j = 1 , 2 Q θ j ( s , a ˜ ) α log π ϕ ( a ˜ | s ) ,
where a ˜ π ϕ ( · | s ) via reparameterization.
Policy update.
J ( π ϕ ) = E s D , ϵ N ( 0 , 1 ) α log π ϕ ( a ˜ | s ) min i = 1 , 2 Q θ i ( s , a ˜ ) ,
where a ˜ = μ ϕ ( s ) + σ ϕ ( s ) · ϵ .
Temperature update.
J ( α ) = E a π ϕ α log π ϕ ( a | s ) + H target .

3.4.3. Location Layer Training Pseudocode

The SAC-based training process used in this section follows a similar procedure to the algorithm presented in the previous section, with the key difference being the removal of the invalid action masking (IAM) mechanism. As this adjustment simplifies the structure, a detailed description is omitted here. The training of the upper-layer network is built upon the foundation of the lower-layer network. Before initiating the upper-layer training, a brief pre-training phase is conducted for the lower layer to provide a reasonable initialization. This pre-training does not require full convergence. Subsequently, during the training of the upper-layer policy, the lower-layer network continues to be updated in parallel. This joint optimization strategy enables the coordination between both layers to better achieve the overall task objective.
The training pseudocode is as Algorithm 2:
Algorithm 2 Location Optimization with SAC.
 1:
Initialize:  Q θ 1 , Q θ 2 , policy network π ϕ (parameterized by μ ϕ , σ ϕ ), value network V ψ , temperature α , replay buffer D
 2:
for episode = 1 to max_episodes do
 3:
      Randomly initialize UAV and user positions, task demands
 4:
      Observe initial state s 0 loc
 5:
      for  t = 0 to T 1  do
 6:
            Sample action a t = ( V x , t , V y , t ) from π ϕ ( s t ) via reparameterization
 7:
            Update UAV position based on velocity
 8:
            Use pre-trained offloading model to compute latency
 9:
            Compute reward r t , observe next state s t + 1
10:
            Store transition ( s t , a t , r t , s t + 1 ) in D
11:
            Update Q-networks and policy via SAC update rules
12:
      end for
13:
end for
Through this hierarchical structure, the location optimization layer and the offloading decision layer are trained to focus on distinct sub-objectives (continuous UAV movement vs. discrete offloading decisions). In particular, the offloading layer with invalid action masking (IAM) ensures that only feasible user selections are considered, while the upper layer learns optimal positioning to support such selections. Simulation results in the next section demonstrate the performance gains achieved by this coordinated learning scheme.

4. Simulation Results

To evaluate the performance of the proposed hierarchical soft actor–critic framework (H-SAC) for multi-channel UAV-assisted MEC systems, a series of comparative experiments were conducted against baseline methods, including standard SAC and DDPG. This section presents the experimental setup, performance metrics, and a detailed comparison in terms of convergence, user latency, UAV trajectory planning, and task scheduling.

4.1. Experimental Setup

All simulations were implemented in a Python 3.6 environment using TensorFlow 1.15. The UAV operates within a square area of 100 m × 100 m, with N = 10 user devices randomly distributed. The UAV starts from position (10, 10). All relevant parameters are summarized in Table 2.
In both the continuous SAC and discrete SAC implementations, the actor and critic networks are designed as multi-layer perceptrons (MLPs) with three fully connected hidden layers consisting of 256, 128, and 64 neurons, respectively. All hidden layers use the ReLU activation function.
For the continuous SAC module applied to UAV trajectory control, the actor network outputs the mean and log standard deviation of a Gaussian distribution, which are then passed through a Tanh function to produce bounded continuous actions. The critic network estimates the Q-value of continuous state–action pairs.
In the discrete SAC module used for offloading decision-making, the actor network outputs a categorical probability distribution over all candidate actions via a Softmax layer. The critic network outputs Q-values for each discrete action given the current state. To handle resource constraints and action feasibility, an invalid action masking (IAM) mechanism is integrated into both the policy and value learning processes, ensuring that only valid actions are considered during training.
Three algorithms are evaluated:
  • H-SAC (Proposed): A hierarchical RL framework with a continuous SAC for trajectory planning and a discrete SAC with IAM for offloading.
  • SAC: A unified single-layer SAC that jointly outputs movement (continuous) and offloading decisions (discretized).
  • DDPG: A single-layer DDPG using continuous outputs for both movement and offloading.

4.2. Convergence Analysis

Figure 5 and Figure 6 show the reward convergence under two different parallel channel settings ( K = 2 and K = 4 ).
When K = 2 , all algorithms exhibit similar convergence trends. H-SAC achieves a slightly higher final reward but the improvement is modest due to the limited parallelism. In contrast, for K = 4 , H-SAC demonstrates faster convergence and higher final rewards, showing a clear advantage in leveraging multi-channel capacity via coordinated trajectory and offloading decisions.

4.3. Latency Performance

Figure 7 and Figure 8 illustrate the average latency across training episodes under different channel settings.
With K = 2 , all methods converge to comparable latency levels, with H-SAC maintaining a slight edge. When K = 4 , H-SAC substantially outperforms the baselines, achieving significantly lower average latency due to improved resource utilization and conflict avoidance.

4.4. Trajectory and Offloading Comparison

To further highlight the behavioral differences between the two methods, Figure 9 and Figure 10 visualize the UAV’s trajectory and its offloading behavior after training convergence under K = 4 .
Each figure consists of a sequence of six snapshots selected from the simulation window spanning time slots T = 0 to T = 24 , with one snapshot approximately every four time slots. Here, T denotes the index of the decision-making time slot. The UAV’s motion trajectory and the offloaded users at each time step are visualized to reflect the spatio-temporal decision pattern of the agents. (In the figures, the solid blue lines represent the UAV’s movement trajectory, while the dashed blue lines indicate the UAV’s user selection decisions. Red circles denote the positions of the users, and blue squares mark the UAV’s discrete movement points).
Trajectory Planning:
  • H-SAC: The UAV rapidly navigates toward task-dense regions and adapts its flight direction strategically at each time slot. The resulting path is smooth and purposeful, minimizing unnecessary deviations.
  • SAC: The UAV’s trajectory is less efficient, with more erratic and indirect movements. It takes longer to approach areas with high task demands, and the path exhibits more abrupt changes.
Offloading Efficiency:
  • H-SAC: The UAV consistently offloads tasks to the maximum allowed number of users per slot (K), with no task conflicts. The IAM-enhanced decision layer ensures valid and non-redundant selections, maximizing channel utilization.
  • SAC: Offloading behavior is occasionally suboptimal. Some users may be selected multiple times or channels may be left unused due to uncoordinated decisions, leading to resource underutilization and longer delays.
These results confirm that the hierarchical structure in H-SAC facilitates both more effective trajectory planning and more efficient multi-user offloading in high-concurrency scenarios.

5. Conclusions and Future Work

This paper proposes a parallel task offloading and trajectory optimization approach for multi-channel UAV-assisted mobile edge computing (MEC) systems. Focusing on the multi-channel task offloading and path planning problem in single-UAV MEC scenarios, a system model supporting concurrent task execution is developed (see Section 2), and a hierarchical reinforcement learning (HRL)-based optimization framework, namely H-SAC, is introduced (see Section 3).The overall optimization problem is decomposed into two sub-tasks: continuous action space (trajectory control) and discrete action space (task selection), which are optimized by a continuous soft actor–critic and a discrete SAC enhanced with invalid action masking (IAM), respectively. Extensive simulation results (see Section 4) validate that the proposed method achieves superior performance compared to conventional single-layer reinforcement learning algorithms, in terms of reducing user latency, improving resource utilization, and enhancing training stability.
Future research could further extend the applicability of the proposed H-SAC framework to more realistic and complex environments. For instance, incorporating user mobility, non-ideal wireless channels, and environmental disturbances would enhance the system’s adaptability to real-world deployment conditions. In terms of optimization methods, adopting multi-objective reinforcement learning strategies could enable joint optimization of trajectory planning and task offloading. Additionally, introducing higher-fidelity UAV energy consumption models would contribute to improving the realism of system modeling. In large-scale user scenarios, it is necessary to design scalable and efficient scheduling mechanisms to effectively handle hundreds of concurrent task requests.

Author Contributions

Conceptualization, T.W. and X.N.; methodology, T.W.; software, T.W. and Y.N.; validation, T.W., Y.N. and J.L.; formal analysis, T.W.; investigation, T.W. and Z.M.; resources, T.W.; data curation, J.L. and W.W.; writing—original draft preparation, T.W.; writing—review and editing, T.W. and X.N.; visualization, T.W.; supervision, X.N.; project administration, X.N.; funding acquisition, X.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China, Project NO. 62263025.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
UAVUnmanned Aerial Vehicle
MECMobile Edge Computing
IoTInternet of Things
QoSQuality of Service
SACSoft Actor–Critic
DDPGDeep Deterministic Policy Gradient
H-SACHierarchical Soft Actor–Critic
RLReinforcement Learning
IAMInvalid Action Masking
UEUser Equipment

References

  1. Liu, S.; Yu, Y.; Lian, X.; Feng, Y.; She, C.; Yeoh, P.L.; Guo, L.; Vucetic, B.; Li, Y. Dependent Task Scheduling and Offloading for Minimizing Deadline Violation Ratio in Mobile Edge Computing Networks. IEEE J. Sel. Areas Commun. 2023, 41, 538–554. [Google Scholar] [CrossRef]
  2. Azfar, T.; Huang, K.; Ke, R. Enhancing disaster resilience with UAV-assisted edge computing: A reinforcement learning approach to managing heterogeneous edge devices. arXiv 2025, arXiv:2501.15305. [Google Scholar]
  3. Sun, G.; He, L.; Sun, Z.; Wu, Q.; Liang, S.; Li, J.; Niyato, D.; Leung, V.C. Joint task offloading and resource allocation in aerial-terrestrial UAV networks with edge and fog computing for post-disaster rescue. arXiv 2023, arXiv:2309.16709. [Google Scholar] [CrossRef]
  4. Zhang, G.; He, Z.; Cui, M. Energy consumption optimization in UAV-assisted mobile edge computing systems based on deep reinforcement learning. J. Electron. Inf. Technol. 2023, 45, 1635–1643. (In Chinese) [Google Scholar]
  5. Zhang, J.; Zhou, L.; Tang, Q.; Ngai, E.C.-H.; Hu, X.; Zhao, H.; Wei, J. Stochastic Computation Offloading and Trajectory Scheduling for UAV-Assisted Mobile Edge Computing. IEEE Internet Things J. 2018, 6, 3688–3699. [Google Scholar] [CrossRef]
  6. Wan, S.; Lu, J.; Fan, P.; Letaief, K.B. Toward Big Data Processing in IoT: Path Planning and Resource Management of UAV Base Stations in Mobile-Edge Computing System. IEEE Internet Things J. 2019, 7, 5995–6009. [Google Scholar] [CrossRef]
  7. Wang, L.; Huang, P.; Wang, K.; Zhang, G.; Zhang, L.; Aslam, N.; Yang, K. RL-based user association and resource allocation for multi-UAV enabled MEC. In Proceedings of the 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC), Tangier, Morocco, 24–28 June 2019; pp. 741–746. [Google Scholar]
  8. Asheralieva, A.; Niyato, D. Hierarchical Game-Theoretic and Reinforcement Learning Framework for Computational Offloading in UAV-Enabled Mobile Edge Computing Networks with Multiple Service Providers. IEEE Internet Things J. 2019, 6, 8753–8769. [Google Scholar] [CrossRef]
  9. Wang, L.; Wang, K.; Pan, C.; Xu, W.; Aslam, N.; Nallanathan, A. Deep Reinforcement Learning Based Dynamic Trajectory Control for UAV-Assisted Mobile Edge Computing. IEEE Trans. Mob. Comput. 2021, 21, 3536–3550. [Google Scholar] [CrossRef]
  10. Wang, Y.; Fang, W.; Ding, Y.; Xiong, N. Computation offloading optimization for UAV-assisted mobile edge computing: A deep deterministic policy gradient approach. Wirel. Netw. 2021, 27, 2991–3006. [Google Scholar] [CrossRef]
  11. Peng, Y.; Liu, Y.; Li, D.; Zhang, H. Deep Reinforcement Learning Based Freshness-Aware Path Planning for UAV-Assisted Edge Computing Networks with Device Mobility. Remote Sens. 2022, 14, 4016. [Google Scholar] [CrossRef]
  12. Zhang, L. Research on UAV Dynamic Deployment For Mobile Edge Computing. Master’s Thesis, University of Technology, Xi’an, China, 2023. (In Chinese). [Google Scholar]
  13. Wang, Y.; Gao, Z.; Zhang, J.; Cao, X.; Zheng, D.; Gao, Y.; Ng, D.; Di Renzo, M. Trajectory design for UAV-based Internet of Things data collection: A deep reinforcement learning approach. IEEE Internet Things J. 2021, 9, 3899–3912. [Google Scholar] [CrossRef]
  14. Li, K.; Ni, W.; Yuan, X.; Noor, A.; Jamalipour, A. Exploring graph neural networks for joint cruise control and task offloading in UAV-enabled mobile edge computing. In Proceedings of the 2023 IEEE 97th Vehicular Technology Conference (VTC2023-Spring), Florence, Italy, 20–23 June 2023; pp. 1–6. [Google Scholar]
  15. Ren, T.; Niu, J.; Dai, B.; Liu, X.; Hu, Z.; Xu, M.; Guizani, M. Enabling Efficient Scheduling in Large-Scale UAV-Assisted Mobile-Edge Computing via Hierarchical Reinforcement Learning. IEEE Internet Things J. 2021, 9, 7095–7109. [Google Scholar] [CrossRef]
  16. Birman, Y.; Ido, Z.; Katz, G.; Shabtai, A. Hierarchical deep reinforcement learning approach for multi-objective scheduling with varying queue sizes. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–10. [Google Scholar]
  17. Zhang, Y.; Mou, Z.; Gao, F.; Xing, L.; Jiang, J.; Han, Z. Hierarchical Deep Reinforcement Learning for Backscattering Data Collection With Multiple UAVs. IEEE Internet Things J. 2020, 8, 3786–3800. [Google Scholar] [CrossRef]
  18. Shi, W.; Li, J.; Wu, H.; Zhou, C.; Cheng, N.; Shen, X. Drone-Cell Trajectory Planning and Resource Allocation for Highly Mobile Networks: A Hierarchical DRL Approach. IEEE Internet Things J. 2020, 8, 9800–9813. [Google Scholar] [CrossRef]
  19. Geng, Y.; Liu, E.; Wang, R.; Liu, Y. Hierarchical Reinforcement Learning for Relay Selection and Power Optimization in Two-Hop Cooperative Relay Network. IEEE Trans. Commun. 2021, 70, 171–184. [Google Scholar] [CrossRef]
  20. Zhou, H.; Long, Y.; Zhang, W.; Xu, J.; Gong, S. Hierarchical multi-agent deep reinforcement learning for backscatter-aided data offloading. In Proceedings of the 2022 IEEE Wireless Communications and Networking Conference (WCNC), Austin, TX, USA, 10–13 April 2022; pp. 542–547. [Google Scholar]
  21. Yao, Z.; Xia, S.; Li, Y.; Wu, G. Cooperative Task Offloading and Service Caching for Digital Twin Edge Networks: A Graph Attention Multi-Agent Reinforcement Learning Approach. IEEE J. Sel. Areas Commun. 2023, 41, 3401–3413. [Google Scholar] [CrossRef]
  22. Xu, J.; Chen, L.; Zhou, P. Joint service caching and task offloading for mobile edge computing in dense networks. In Proceedings of the IEEE INFOCOM 2018—IEEE Conference on Computer Communications, Honolulu, HI, USA, 16–19 April 2018; pp. 207–215. [Google Scholar]
  23. Fang, C.; Xu, H.; Zhang, T.; Li, Y.; Ni, W.; Han, Z.; Guo, S. Joint Task Offloading and Content Caching for NOMA-Aided Cloud-Edge-Terminal Cooperation Networks. IEEE Trans. Wirel. Commun. 2024, 23, 15586–15600. [Google Scholar] [CrossRef]
  24. Lin, N.; Han, X.; Hawbani, A.; Sun, Y.; Guan, Y.; Zhao, L. Deep Reinforcement Learning-Based Dual-Timescale Service Caching and Computation Offloading for Multi-UAV Assisted MEC Systems. IEEE Trans. Netw. Serv. Manag. 2024, 22, 605–617. [Google Scholar] [CrossRef]
  25. Xu, Y.; Peng, Z.; Song, N.; Qiu, Y.; Zhang, C.; Zhang, Y. Joint Optimization of Service Caching and Task Offloading for Customer Application in MEC: A Hybrid SAC Scheme. IEEE Trans. Consum. Electron. 2024. [Google Scholar] [CrossRef]
  26. Wang, Z.; Zhao, W.; Hu, P.; Zhang, X.; Liu, L.; Fang, C.; Sun, Y. UAV-Assisted Mobile Edge Computing: Dynamic Trajectory Design and Resource Allocation. Sensors 2024, 24, 3948. [Google Scholar] [CrossRef]
  27. Sun, G.; Wang, Y.; Sun, Z.; Wu, Q.; Kang, J.; Niyato, D.; Leung, V.C.M. Multi-Objective Optimization for Multi-UAV-Assisted Mobile Edge Computing. IEEE Trans. Mob. Comput. 2024, 23, 14803–14820. [Google Scholar] [CrossRef]
  28. Cui, J.; Wei, Y.; Wang, J.; Shang, L.; Lin, P. Joint trajectory design and resource allocation for UAV-assisted mobile edge computing in power convergence network. EURASIP J. Wirel. Commun. Netw. 2025, 2025, 4. [Google Scholar] [CrossRef]
  29. Hui, M.; Chen, J.; Yang, L.; Lv, L.; Jiang, H.; Al-Dhahir, N. UAV-Assisted Mobile Edge Computing: Optimal Design of UAV Altitude and Task Offloading. IEEE Trans. Wirel. Commun. 2024, 23, 13633–13647. [Google Scholar] [CrossRef]
  30. Premsankar, G.; Ghaddar, B. Energy-Efficient Service Placement for Latency-Sensitive Applications in Edge Computing. IEEE Internet Things J. 2022, 9, 17926–17937. [Google Scholar] [CrossRef]
  31. Ko, S.-W.; Kim, S.J.; Jung, H.; Choi, S.W. Computation Offloading and Service Caching for Mobile Edge Computing Under Personalized Service Preference. IEEE Trans. Wirel. Commun. 2022, 21, 6568–6583. [Google Scholar] [CrossRef]
  32. Hortelano, D.; de Miguel, I.; Barroso, R.J.D.; Aguado, J.C.; Merayo, N.; Ruiz, L.; Asensio, A.; Masip-Bruin, X.; Fernández, P.; Lorenzo, R.M.; et al. A comprehensive survey on reinforcement-learning-based computation offloading techniques in Edge Computing Systems. J. Netw. Comput. Appl. 2023, 216, 103669. [Google Scholar] [CrossRef]
  33. Hayal, M.R.; Elsayed, E.E.; Kakati, D.; Singh, M.; Elfikky, A.; Boghdady, A.I.; Grover, A.; Mehta, S.; Mohsan, S.A.H.; Nurhidayat, I. Modeling and investigation on the performance enhancement of hovering UAV-based FSO relay optical wireless communication systems under pointing errors and atmospheric turbulence effects. Opt. Quantum Electron. 2023, 55, 625. [Google Scholar] [CrossRef]
Figure 1. Flow diagram of UAV-assisted mobile edge computing.
Figure 1. Flow diagram of UAV-assisted mobile edge computing.
Drones 09 00358 g001
Figure 2. Illustration of UAV-assisted mobile edge computing.
Figure 2. Illustration of UAV-assisted mobile edge computing.
Drones 09 00358 g002
Figure 3. Illustration of the hierarchical reinforcement learning framework.
Figure 3. Illustration of the hierarchical reinforcement learning framework.
Drones 09 00358 g003
Figure 4. Illustration of the discrete SAC framework with invalid action masking (IAM).
Figure 4. Illustration of the discrete SAC framework with invalid action masking (IAM).
Drones 09 00358 g004
Figure 5. Convergence curves under K = 2 .
Figure 5. Convergence curves under K = 2 .
Drones 09 00358 g005
Figure 6. Convergence curves under K = 4 .
Figure 6. Convergence curves under K = 4 .
Drones 09 00358 g006
Figure 7. Latency curves under K = 2 .
Figure 7. Latency curves under K = 2 .
Drones 09 00358 g007
Figure 8. Latency curves under K = 4 .
Figure 8. Latency curves under K = 4 .
Drones 09 00358 g008
Figure 9. UAV trajectory and offloading decisions under H-SAC ( K = 4 ).
Figure 9. UAV trajectory and offloading decisions under H-SAC ( K = 4 ).
Drones 09 00358 g009
Figure 10. UAV trajectory and offloading decisions under SAC ( K = 4 ).
Figure 10. UAV trajectory and offloading decisions under SAC ( K = 4 ).
Drones 09 00358 g010
Table 1. List of main notations.
Table 1. List of main notations.
NotationDescription
System Model
N, N Number and set of user equipments (UEs)
T, I, τ , I Total task duration, number of slots, slot length, and set of slots
D n , i Task data size (Mbits) generated by UE n at slot i
F n , i Required CPU cycles for UE n’s task at slot i
sCPU cycles required per unit data
KMaximum number of tasks the UAV can process in parallel per slot
D max Maximum data volume (Mbits) the UAV can process per slot
( X i , Y i ) 2D coordinate of the UAV in slot i
HFixed altitude of the UAV
V i , α i UAV speed and heading angle in slot i
V x , i , V y , i UAV velocity components in x and y direction at slot i
P 0 Blade profile power in flight energy model
U tip Rotor blade tip speed
P i Induced power in hover
V 0 Mean rotor induced velocity in hover
d 0 Fuselage drag ratio
ρ Air density
SRotor disk area
AUAV frame area
t fly UAV flight duration within each slot
( X n , Y n ) 2D position of UE n
R n , i Horizontal distance between UAV and UE n at slot i
r n , i Transmission rate from UE n to UAV at slot i
B, p n Bandwidth and transmit power of UE n
f ( R n , i , H ) Path loss function including altitude and distance effects
z n , i Binary offloading decision variable for UE n at slot i
Z i Set of offloading decisions in slot i
f uav , f UE CPU frequencies of UAV and UE, respectively,
t n , i tx Transmission delay for UE n’s task at slot i
t n , i uav Computation delay on UAV for UE n’s task at slot i
t n , i loc Local computation delay at UE n
t n , i Total delay of UE n at slot i
P i f UAV flight power at speed V i
E i fly UAV flight energy in slot i
p hov UAV hovering power
E i hov Hovering energy in slot i
κ , β Constants in UAV computation energy model
δ i CPU usage indicator in slot i
E i cmp UAV computation energy in slot i
E i total Total UAV energy consumption in slot i
B i Remaining UAV battery after slot i
B min Minimum allowed UAV battery level
Problem Formulation and Optimization
( V i , α i ) UAV’s control variables: speed and heading
{ z n , i } Offloading decision set across all slots
Objective min i = 1 I n = 1 N t n , i (Minimize total user delay)
(C1)Total offloaded data must not exceed D max in any slot, Formula (5)
(C2)Number of offloaded tasks K in each slot, Formula (6)
(C3)UAV position constraints within [ 0 , L max ] , Formula (18)
(C4)UAV speed and heading constraints, Formula (19)
(C5)UAV battery dynamics across slots, Formula (20)
(C6)UAV battery level must stay above B min , Formula (16)
Table 2. Simulation parameters.
Table 2. Simulation parameters.
SymbolValueDescription
t f l y 1 s Time UAV fly length
K { 2 , 4 } Number of parallel channels
D max { 4.6 , 9.2 } Mbits UAV memory limit
N10Number of UEs
s1000CPU cycles per bit
l max 100 m Area side length
H 50 m UAV altitude
B 1 MHz Channel bandwidth
P n 0.1 W UE transmit power
σ 2 10 13 W Noise power
f n UE 0.6 GHz UE CPU frequency
f m UAV 3 GHz UAV CPU frequency
k n 5Power exponent in CPU energy model
g 0 50 dB Reference channel gain
v x , max 4 m Max x-axis velocity
v y , max 4 m Max y-axis velocity
v n 3Number of tasks per UE
p h 100 W UAV hovering power
e max 500 kJ UAV battery capacity
T8Time slot length
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, T.; Na, X.; Nie, Y.; Liu, J.; Wang, W.; Meng, Z. Parallel Task Offloading and Trajectory Optimization for UAV-Assisted Mobile Edge Computing via Hierarchical Reinforcement Learning. Drones 2025, 9, 358. https://doi.org/10.3390/drones9050358

AMA Style

Wang T, Na X, Nie Y, Liu J, Wang W, Meng Z. Parallel Task Offloading and Trajectory Optimization for UAV-Assisted Mobile Edge Computing via Hierarchical Reinforcement Learning. Drones. 2025; 9(5):358. https://doi.org/10.3390/drones9050358

Chicago/Turabian Style

Wang, Tuo, Xitai Na, Yusen Nie, Jinglong Liu, Wenda Wang, and Zhenduo Meng. 2025. "Parallel Task Offloading and Trajectory Optimization for UAV-Assisted Mobile Edge Computing via Hierarchical Reinforcement Learning" Drones 9, no. 5: 358. https://doi.org/10.3390/drones9050358

APA Style

Wang, T., Na, X., Nie, Y., Liu, J., Wang, W., & Meng, Z. (2025). Parallel Task Offloading and Trajectory Optimization for UAV-Assisted Mobile Edge Computing via Hierarchical Reinforcement Learning. Drones, 9(5), 358. https://doi.org/10.3390/drones9050358

Article Metrics

Back to TopTop