MA-JTATO: Multi-Agent Joint Task Association and Trajectory Optimization in UAV-Assisted Edge Computing System

Zhang, Yunxi; Wen, Zhigang

doi:10.3390/drones10040267

Open AccessArticle

MA-JTATO: Multi-Agent Joint Task Association and Trajectory Optimization in UAV-Assisted Edge Computing System

by

Yunxi Zhang

and

Zhigang Wen

^*

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(4), 267; https://doi.org/10.3390/drones10040267

Submission received: 10 February 2026 / Revised: 2 April 2026 / Accepted: 3 April 2026 / Published: 7 April 2026

(This article belongs to the Section Drone Communications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose an original multi-agent collaborative intelligent optimization framework for UAV-assisted edge computing, and construct a multi-agent joint task association and trajectory optimization (MA-JTATO) algorithm to jointly optimize task association, trajectory planning, and resource allocation in UAV-assisted edge computing system.
An original decoupling and collaboration optimization strategy is adopted to decompose the complex coupled non-convex problem into solvable subproblems, including a UAVserver task association model, a UAV flight trajectory control model, and an edge server computing resource allocation algorithm, which significantly improves system quality of service (QoS) and robustness in dynamic scenarios.

What are the implications of the main findings?

The results demonstrate that the proposed MA-JTATO algorithm significantly outperforms baseline algorithms in system QoS performance, validating its effectiveness and robustness in UAV-assisted edge computing systems.
These findings provide a scalable framework for future UAV-assisted edge computing, enabling efficient multi-agent coordination in dynamic environments for real-world applications requiring low latency and high QoS.

Abstract

With the rapid development of applications such as smart cities and the industrial internet, the computation-intensive tasks generated by massive sensing devices pose significant challenges to traditional cloud computing paradigms. Unmanned aerial vehicle (UAV)-assisted edge computing systems, leveraging their high mobility and wide-area coverage capabilities, offer an innovative architecture for low-latency and highly reliable edge services. However, the practical deployment of such systems faces a highly complex multi-objective optimization problem featured by the tight coupling of task offloading decisions, UAV trajectory planning, and edge server resource allocation. Conventional optimization methods are difficult to adapt to the dynamic and high-dimensional characteristics of this problem, leading to suboptimal system performance. To address this critical challenge, this paper constructs an intelligent collaborative optimization framework for UAV-assisted edge computing systems and formulates the system quality of service (QoS) optimization problem as a mixed-integer non-convex programming problem with the dual objectives of minimizing task processing latency and reducing overall system energy consumption. A multi-agent joint task association and trajectory optimization (MA-JTATO) algorithm based on hybrid reinforcement learning is proposed to solve this intractable problem, which innovatively decouples the original coupled optimization problem into three interrelated subproblems and realizes their collaborative and efficient solution. Specifically, the Advantage Actor-Critic (A2C) algorithm is adopted to realize dynamic and optimal task association between UAVs and edge servers for discrete decision-making requirements; the multi-agent deep deterministic policy gradient (MADDPG) method is employed to achieve cooperative and energy-efficient trajectory planning for multiple UAVs to meet the needs of continuous control in dynamic environments; and convex optimization theory is applied to obtain a closed-form optimal solution for the efficient allocation of computational resources on edge servers. Simulation results demonstrate that the proposed MA-JTATO algorithm significantly outperforms traditional baseline algorithms in enhancing overall QoS, effectively validating the framework’s superior performance and robustness in dynamic and complex scenarios.

Keywords:

UAV assisted edge computing; task offloading; UAV trajectory optimization; resource allocation; multi-agent reinforcement learning

1. Introduction

With the rapid development of new-generation information technologies such as smart cities, the industrial internet, and digital twins, the widespread application of massive sensing devices and intelligent terminals has driven an explosive growth in data volume. Traditional centralized cloud computing models suffer from significant task delays and low resource utilization efficiency in practical applications due to long communication distances, limited transmission bandwidth, and fixed service nodes [1]. In this context, Unmanned aerial vehicle (UAV)-assisted edge computing systems, through their flexible “air-ground” collaborative architecture, provide an innovative solution for real-time data collection and efficient task processing [2]. Leveraging the mobility and broad coverage advantages of UAV nodes, this system enables real-time collection of environmental data in dynamic regions, and ensures the security of data transmission and storage through security authentication and protection mechanisms, providing highly reliable, low-latency, scalable, and secure technical solutions for scenarios such as emergency response, environmental monitoring, and intelligent inspection [3,4].

Due to the limited computational resources of UAV nodes, it is challenging for them to independently handle large-scale computation-intensive tasks. Therefore, computational tasks need to be offloaded to edge servers via communication links to fully utilize their powerful computing capabilities [5]. However, the practical deployment of UAV-assisted edge computing systems faces the core challenge of multi-dimensional coupling involving intricate task offloading decisions, unstable communication links, and resource competition [6]. Achieving intelligent and adaptive dynamic association between UAVs and edge servers, as well as dynamic resource allocation for edge servers, has become a critical challenge to address [7]. Specifically, UAVs must plan energy-efficient optimal flight trajectories to approach edge servers, aiming to simultaneously minimize energy consumption during both flight and data transmission. Realizing the cross-dimensional joint optimization of task allocation, trajectory planning, and resource allocation in highly complex and dynamic UAV-assisted edge computing networks requires ensuring system performance while balancing multiple optimization objectives (e.g., latency, energy consumption) [8].

Nevertheless, traditional optimization methods exhibit significant limitations in addressing the above coupling optimization problem of UAV-assisted edge computing systems. Conventional numerical optimization methods struggle to adapt to dynamic system changes, heuristic algorithms are prone to falling into local optima, and distributed optimization suffers from slow convergence rates [9,10]. These methods are often inadequate for handling high-dimensional state spaces and long-term performance optimization requirements. Reinforcement learning offers an adaptive solution framework for such problems with environmental uncertainties. By modeling the problem as a Markov Decision Process (MDP), intelligent agents can autonomously learn optimal policies through system interactions [11]. Notably, deep multi-agent reinforcement learning enables decentralized collaborative decision-making among multiple UAVs while possessing the capability to process high-dimensional state information [12]. It enables long-term performance optimization in dynamic environments, providing adaptive and scalable intelligent decision-making solutions for UAV-assisted intelligent edge computing systems.

To this end, this paper constructs a pioneering intelligent optimization framework for UAV-assisted edge computing systems. This framework utilizes reinforcement learning to achieve synergistic and optimal decision-making for task association and flight trajectory planning, while incorporating a resource optimization algorithm to enable real-time dynamic resource allocation among edge servers. Ultimately, it aims to achieve comprehensive multi-dimensional optimization of overall system performance. In terms of framework design, the task association module utilizes the Advantage Actor-Critic (A2C) algorithm based on the Actor-Critic architecture to establish a dynamic matching mechanism between UAVs and edge servers for the discrete decision-making characteristics of task offloading [13]. The trajectory planning module employs the multi-agent deep deterministic policy gradient (MADDPG) algorithm to generate energy-efficient optimal flight paths for the continuous control requirements of multi-UAV coordination [14]. The resource allocation module applies optimization theory to derive a closed-form optimal solution, achieving a globally optimal and efficient configuration of computational resources for edge servers. The synergistic optimization of these three modules enables the system to adapt to dynamic environmental changes, ultimately realizing robust optimization of the system’s quality of service (QoS) performance. Based on the above analysis, the main contributions of this paper are summarized as follows:

To address large-scale computation-intensive tasks, this paper constructs a UAV-assisted edge computing system architecture tailored for multi-objective performance optimization and formulates the system QoS optimization goal as a mixed-integer non-convex programming problem with the objectives of jointly minimizing end-to-end task latency and global system energy consumption. Building upon this, an innovative intelligent collaborative optimization framework is proposed, which integrates task association, UAV trajectory control, and edge server computational resource allocation into a unified optimization paradigm. This framework realizes a systematic solution to the complex coupled optimization problem in UAV-assisted edge computing systems.
For this highly intricate coupled joint optimization problem, this paper innovatively adopts a decoupling-and-collaboration optimization strategy and designs the multi-agent joint task association and trajectory optimization (MA-JTATO) algorithm, which decomposes the original intractable coupled problem into three subproblems: UAV-server task association, UAV flight trajectory control, and edge server computing resource allocation. Specifically, the task association subproblem employs the A2C algorithm to establish matching; the trajectory control subproblem introduces the MADDPG method to achieve energy-efficient and collaborative path planning for multiple UAVs; and the resource allocation subproblem leverages optimization theory to achieve efficient and optimal configuration of computing resources.
Extensive simulation experiments demonstrate that the proposed MA-JTATO algorithm significantly outperforms baseline algorithms in terms of system QoS performance. This validates the effectiveness and robustness of the proposed framework in dynamic and complex scenarios, ultimately achieving performance optimization in dynamic UAV-assisted edge computing systems.

2. Related Work

2.1. Joint Optimization for UAV-Assisted Edge Computing

With the rapid development of low-altitude networks, researchers have proposed a series of innovative theories and methods focusing on UAV mobility, energy consumption constraints, and QoS requirements in air-ground collaborative scenarios. For Mobile Edge Computing-enabled Autonomous Aerial Vehicle (AAV) networks in emergency response scenarios, the authors in [15] addressed the joint trajectory-offloading-resource optimization problem for fixed-wing AAV mother-ships. A joint trajectory planning, task offloading, and resource allocation (TTR) optimization model is established, considering constraints such as minimum turning radius, speed, computing resource budgets, and task performance requirements. An offline Particle Swarm Optimization (PSO)-based solution and a Deep Reinforcement Learning (DRL)-based real-time decision-making method for dynamic environments are proposed. The authors in [16] addressed the scenario where multiple mobile users face random task arrivals and movements, and the energy consumption of mobile users needs to be minimized under the UAV energy constraint. They formulated a multi-stage mixed-integer nonlinear programming model and proposed the JTORA algorithm for joint trajectory optimization and resource allocation, which integrates DRL with Lyapunov optimization techniques. This algorithm transforms the original problem into a deterministic optimization one and decomposes it into two parallel subproblems, which are solved by DRL and convex optimization respectively. The authors in [17] constructed an enhanced Time-Expanded Graph (eTEG) model integrating transmission, storage, computation, and transceiver resource constraints. They proposed a joint trajectory planning and task offloading method (A2C-seTEG) based on A2C deep reinforcement learning and sliding windows, which divides the trajectory period into sliding windows and feeds back the offloading results of the previous window to optimize the trajectory planning of the next window.

The authors in [18] constructed a Markov Decision Process (MDP) model integrating communication, computation, and UAV trajectory constraints. They proposed a trajectory-aware task offloading and resource allocation method (TB-TOUAV) based on improved Proximal Policy Optimization (PPO), which dynamically optimizes UAV trajectories, task offloading ratios, and resource allocation schemes through state normalization, continuous action space exploration, and policy clipping update mechanisms. Addressing the scenario of fixed-wing UAV-to-ground communication, the authors in [19] constructed a theoretical propulsion energy consumption model with UAV flight speed and acceleration as core variables. They proposed a circular trajectory design with optimized flight radius and speed, as well as a general constrained trajectory optimization algorithm based on linear state-space approximation and sequential convex optimization. This approach achieves significantly improved energy efficiency while balancing communication throughput and UAV energy consumption. Addressing the scenario where UAV group communications need to deliver rich media, extend line-of-sight coverage, and achieve fast, efficient transitions, the authors in [20] proposed the Efficient Transition Formation (ETF) algorithm. This algorithm evaluates the seamlessness of straight-line trajectories (SLTs) through low-complexity computations or fast checks with controlled traffic overheads, and constructs a new trajectory consisting of a minimal number of seamless straight lines for non-seamless SLTs.

Although existing optimization methods have investigated task offloading, trajectory optimization, and resource allocation in UAV-assisted edge computing to varying degrees, most existing studies only focus on partial subproblems or optimize these components separately under specific architectures and constraints. In contrast, this paper realizes global collaborative optimization by jointly optimizing resource scheduling, dynamic task requirements, and heterogeneous network characteristics in a unified framework.

2.2. Reinforcement Learning for UAV Networks

In UAV networks, Reinforcement Learning (RL) has become a core methodology for addressing challenges posed by dynamic environments and complex decision-making, particularly suited for high-dimensional optimization problems lacking precise environmental models. Existing research has proposed various autonomous decision-making frameworks based on RL, focusing on key areas such as multi-agent collaboration, trajectory planning, and energy consumption optimization. By modeling UAVs as agents and utilizing local observations or centralized training for autonomous policy learning, these methods have achieved significant progress in improving network performance, reducing communication overhead, and enhancing system adaptability. Addressing the balance between communication performance, service fairness, and privacy protection in 3D trajectory optimization of multi-UAV base stations, the authors in [21] constructed a decentralized partially observable multi-agent Markov decision process (POMDP) model. They proposed a Federated Multi-Agent Deep Reinforcement Learning (FedMADRL) method integrated with gated recurrent unit (GRU)-based link quality estimation, where each UAV independently trains a DRL model and aggregates global parameters via Federated Averaging (FedAvg). Addressing the scenario where UAV networks have eavesdroppers, requiring joint optimization of 3D positions, power, and energy harvesting to maximize multi-objective utility, the authors in [22] constructed a multi-objective utility maximization model. They proposed a DRL method based on Proximal Policy Optimization (PPO), where the agent interacts with the environment to learn network dynamics and jointly controls relevant strategies to solve the non-convex problem. Addressing the need for optimizing trajectory planning and resource allocation in multi-AAV-assisted MEC systems, the authors in [23] constructed multi-AAV communication, computing, and system delay-energy joint optimization models. They proposed an enhanced Multi-Agent Reinforcement Learning (MARL) method based on the improved Double Delay Deep Deterministic Policy Gradient (D3PG) algorithm, which achieves collaborative learning among multiple agents to optimize relevant strategies.

Researchers have also proposed innovative RL methods for joint trajectory and resource optimization in UAV networks. Addressing the energy limitations, flight time constraints, high optimization complexity and insufficient decision-making information in the collaborative trajectory planning and power allocation for UAV aerial base stations, the authors in [24] constructed a multi-objective decision model for optimizing the number of served users and energy consumption, and proposed the Communication Actor Centralized Attention Critic Network (CATEN) multi-agent reinforcement learning algorithm. Addressing the trajectory optimization problem aiming at maximizing energy efficiency in UAV-aided cell-free Space-Air-Ground Integrated Networks (SAGIN), the authors in [25] constructed an energy efficiency analysis model considering the power consumption of fixed-wing UAVs and a closed-form expression model for uplink spectral efficiency. They proposed the Single-Critic-Multi-Actor Deep Deterministic Policy Gradient (SCMA-MADDPG) algorithm based on MARL, which adopts a centralized training and decentralized execution framework. Addressing the joint optimization of trajectory and user association for base station traffic offloading in UAV-assisted cellular communication networks, the authors in [26] constructed a finite-state MDP model and proposed a MARL-based distributed State-Action-Reward-State-Action (SARSA) algorithm. This algorithm regards each UAV as an independent agent, enables inter-UAV cooperation and invalid action elimination via a centralized controller, and iteratively updates the value function with the number of user associations as the reward to achieve distributed UAV trajectory optimization.

However, existing RL-based optimization methods for UAV networks still have limitations. Most existing research focuses on single-layer network architectures or specific communication modes. Furthermore, existing methods fail to systematically account for the trade-off between short-term rewards and long-term system robustness under practical constraints such as time-varying task demands. In contrast, this paper realizes a balanced optimization between short-term performance and long-term system robustness by comprehensively incorporating time-varying task characteristics.

3. System Model

3.1. UAV-Assisted Edge Computing System

As shown in Figure 1, this paper constructs a UAV-assisted edge computing system for communication-paralyzed environments. The system consists of a cluster of rotary-wing UAVs, denoted as

U = {u_{1}, u_{2}, \dots, u_{U}}

, and ground edge servers, denoted as

S = {s_{1}, s_{2}, \dots, s_{S}}

. Similar to [9], for analytical simplicity, this paper also assumes that UAVs maintain a constant flight altitude H, while the ground edge servers are deployed at fixed positions on the ground. The system aims to provide reliable support for the rapid collection, backhaul, and computational processing of disaster-area information under extreme scenarios where ground communication infrastructure is damaged or coverage is insufficient.

Specifically, this paper assumes that after completing information collection tasks in the disaster area, the UAV cluster synchronously returns to base from their respective initial positions. During the return process, each UAV, based on its own carried task load status and the computing resource availability of the ground edge servers, makes decisions regarding target server selection and task offloading. Once a UAV flies within the coverage range of the selected edge server, it immediately initiates the data offloading and computational processing procedure.

The system time is discretized into T equal time slots, i.e.,

t \in T = {0, \dots, T - 1}

, where the duration of each time slot is

Δ t

. All UAVs in the system fly at a fixed altitude H, and the ground edge servers are fixed at designated locations. The three-dimensional coordinates of UAV u at time slot t is given by

{[x_{u} (t), y_{u} (t), H]}^{T}

. The set of task data volumes carried by the UAVs is denoted as

D = {d_{1}, d_{2}, \dots, d_{U}}

, and the computing resources required by the tasks are defined as

C = {c_{1}, c_{2}, \dots, c_{U}}

. The three-dimensional coordinates of ground edge server s can be denoted by

{[x_{s}, y_{s}, 0]}^{T}

. The computing resources of the ground edge servers are represented as

F = {f_{1}, f_{2}, \dots, f_{S}}

[11].

For simplicity of analysis, this paper projects the positions of both UAVs and servers onto a two-dimensional plane. Thus, the position of UAV u is denoted as

l_{u} (t) = {[x_{u} (t), y_{u} (t)]}^{T}

and the position of ground edge server s is denoted as

l_{s} = {[x_{s}, y_{s}]}^{T}

. The velocity vector of UAV u in time slot t is defined as

v_{u} (t) = {[v_{u}^{x} (t), v_{u}^{y} (t)]}^{T}

, and its magnitude is constrained by the maximum flight speed, i.e.,

∥ v_{u} (t) ∥ \leq V_{m a x}

.

∥ \cdot ∥

denotes the Euclidean norm. The relationship between

l_{u} (t)

and

v_{u} (t)

can be expressed as:

l_{u} (t + 1) = l_{u} (t) + v_{u} (t) Δ t

. Based on this, the Euclidean distance between UAV u and edge server s in time slot t is defined as:

d_{u, s} (t) = ∥ l_{u} (t) - l_{s} ∥

, and the distance between two UAVs

u_{1}

and

u_{2}

is:

d_{u_{1}, u_{2}} (t) = ∥ l_{u_{1}} (t) - l_{u_{2}} (t) ∥

. Furthermore, a binary decision variable

x_{u}^{s} \in {0, 1}

is introduced, where

x_{u}^{s} = 1

indicates that UAV u selects edge server s as the offloading target for its computing tasks. Since each UAV can only choose one server for offloading, its offloading decision must satisfy the constraint:

\sum_{s = 1}^{S} x_{u}^{s} = 1, \forall u \in U

[27].

3.2. Communication Model

Based on the assumption that UAVs have completed task information sharing before the return phase, this paper focuses primarily on the air-to-ground communication process between UAVs and ground edge servers. Since multiple UAVs may simultaneously offload data to the same edge server, the system adopts the Orthogonal Frequency Division Multiple Access (OFDMA) protocol to enhance spectrum utilization efficiency for multi-user access. Considering that there is typically a high probability of Line-of-Sight (LoS) communication between UAVs and ground edge servers, this paper employs the free-space path loss model to characterize the air-to-ground wireless channel. This ideal channel assumption has certain limitations, and is adopted to simplify the analysis for focusing on the joint tasks of task association, trajectory control, and resource allocation. For any UAV-server pair, it is assumed that the server allocates independent sub-channel resources to its associated UAVs. Let the total communication bandwidth of the system be B. When a server simultaneously serves

N_{s}

UAVs, the sub-channel bandwidth allocated to each UAV is

B_{0} = B / N_{s}

. Based on this, the communication rate between a UAV and a server can be expressed as:

R = B_{0} {log}_{2} (1 + \frac{p h_{0}}{σ^{2} ({(d_{u, s})}^{2} + H^{2})}),

(1)

where p denotes the transmit power of the UAV,

h_{0}

represents the channel gain at a reference distance of 1 m, σ² is the noise power [28].

3.3. Service Delay and Energy Consumption

In UAV-assisted edge computing systems, service latency and energy consumption primarily stem from three stages: the flight phase, the hovering offloading phase, and the server computation phase.

3.3.1. UAV Flight Phase

For rotary-wing UAVs, their flight energy consumption is closely related to flight speed. This paper adopts a rotary-wing UAV energy consumption model, expressing the instantaneous power consumption during flight as:

p_{f} (∥ v_{u} (t) ∥) = p_{0} (1 + \frac{3 ∥ v_{u} {(t) ∥}^{2}}{U_{tip}^{2}}) + p_{i} {(\sqrt{1 + \frac{∥ v_{u} {(t) ∥}^{4}}{4 v_{0}^{4}}} - \frac{∥ v_{u} {(t) ∥}^{2}}{2 v_{0}^{2}})}^{1 / 2} + \frac{1}{2} d_{0} ρ s A {∥ v_{u} (t) ∥}^{3},

(2)

where

p_{0}

and

p_{i}

represent the blade profile power and induced power, respectively;

U_{tip}

denotes the rotor tip speed;

v_{0}

is the mean rotor-induced velocity during hover;

ρ

is the air density; s is the rotor solidity; A is the rotor disc area; and

d_{0}

is the fuselage drag ratio. Consequently, the flight energy consumption is [9]:

E_{f l y} = \sum_{t = 1}^{τ} p_{f} (∥ v_{u} (t) ∥) Δ t .

(3)

3.3.2. Task Offloading Phase

Upon reaching the designated target server, the UAV enters a hovering state and initiates task data offloading. The offloading delay

T_{c}

is determined by the volume of data to be offloaded D carried by the UAV and the communication rate R between the UAV and the server, expressed as [29]:

T_{c} = \frac{D}{R} .

(4)

In the hovering state, the UAV’s velocity satisfies

∥ v_{u} (t) ∥ = 0

. The corresponding hovering power consumption can be derived by simplifying the flight energy consumption model:

p_{h} = p_{0} + p_{i},

(5)

thus, the hovering energy consumption is:

E_{h} = \frac{p_{h} D}{R} .

(6)

Additionally, the UAV consumes communication energy during data offloading. The transmission energy consumption is expressed as:

E_{c} = \frac{p D}{R} .

(7)

3.3.3. Server Computation Phase

The server’s computation delay

T_{s} (t)

is determined by the computational demand

c_{u} (t)

of the UAV’s task and the computational resource

f_{s}^{u} (t)

allocated by the server for that task, expressed as:

T_{s} = \sum_{u = 1}^{N_{s}} \frac{c_{u}}{f_{s}^{u}} .

(8)

Correspondingly, the computation energy consumption

E_{s} (t)

of server s in time slot t can be expressed as:

E_{s} = κ {f_{s}^{u}}^{3} T_{s},

(9)

where

κ

is the energy coefficient of the computing chip, which is determined by the physical characteristics of the chip’s architecture [30].

4. Problem Formulation

In UAV-assisted edge computing systems, the overall system performance is jointly determined by service latency and energy consumption. Specifically, in each time slot t, the total system latency is composed of the data offloading delay and the server computation delay, which can be expressed as:

T_{total} = T_{c} + T_{s},

(10)

where

T_{c}

represents the communication delay incurred during data offloading from the UAV to the target edge server, and

T_{s}

denotes the computation delay for processing the task at the server.

Correspondingly, the total system energy consumption mainly includes the UAV’s flight energy consumption, the transmission energy for task offloading, and the server’s computation energy for task processing. This can be expressed as:

E_{total} = E_{fly} + E_{h} + E_{c} + E_{s},

(11)

where

E_{fly}

denotes the power consumption of the UAV during the return flight phase,

E_{h}

is the power required to maintain the UAV in a hovering state during task offloading,

E_{c}

represents the communication energy consumed by the UAV during data transmission, and

E_{s}

is the computation energy consumed by the edge server while processing the task.

The optimization objective of this paper is to maximize the system QoS over the entire task period by jointly optimizing the UAV’s offloading decisions, flight trajectories, and the computational resource allocation strategy of the edge servers. The system QoS primarily considers factors such as service latency and energy consumption. To flexibly characterize the system’s differing preferences for latency and energy consumption, weighting coefficients

ω_{1}

and

ω_{2}

are introduced to represent the emphasis on latency performance and energy efficiency, respectively.

QoS = \frac{1}{ω_{1} T_{total} + ω_{2} E_{total}} .

(12)

Consequently, the system optimization problem can be formulated as the following weighted multi-objective optimization problem

P 1

:

\begin{matrix} P 1 : \begin{matrix} max_{\begin{matrix} x_{u}^{s}, f_{s}^{u}, \\ l_{u} (t), v_{u} (t) \end{matrix}} \sum_{t = 1}^{T} QoS, \end{matrix} \end{matrix}

(13)

\begin{matrix} s . t . x_{u}^{s} \in {0, 1}, \forall u \in U, \forall s \in S, \end{matrix}

(14)

\begin{matrix} \sum_{s = 1}^{S} x_{u}^{s} = 1, \forall u \in U, \end{matrix}

(15)

\begin{matrix} \sum_{u = 1}^{U} x_{u}^{s} D_{u} \leq D_{max}, \forall s \in S, \end{matrix}

(16)

\begin{matrix} l_{u} (t + 1) = l_{u} (t) + v_{u} (t) Δ t, \end{matrix}

(17)

\begin{matrix} ∥ v_{u} (t) ∥ \leq V_{max}, \end{matrix}

(18)

\begin{matrix} \sum_{u = 1}^{U} f_{s}^{u} \leq f_{s}, \end{matrix}

(19)

\begin{matrix} 0 \leq f_{s}^{u} \leq f_{u}, \end{matrix}

(20)

where constraints (14) and (15) ensure that each UAV can only select one ground edge server as its offloading target during the task period. Constraint (16) limits the maximum data processing capacity of each edge server. Constraint (17) describes the kinematic relationship between the UAV’s position and velocity, while constraint (18) ensures that the UAV’s flight speed does not exceed its maximum allowable speed

V_{\max}

. Constraints (19) and (20) guarantee the reasonable allocation of the server’s computational resources, preventing overload and ensuring that the computational resources allocated to each task remain within a reasonable range.

5. Proposed Solution

To address the optimization problem, we propose an MA-JTATO algorithm that decomposes the original problem into three sub-problems: task offloading decision, UAV trajectory planning, and edge computing resource allocation.

5.1. Task Offloading Decision

To optimize the UAV task offloading decision problem, this paper models it as an MDP, which can be solved by the A2C algorithm [13]. Since this problem involves only a single decision step with deterministic state transitions, it is modeled as a one-step MDP, including the state space

S^{A}

, the action space

A^{A}

, and the reward function

R^{A}

. Specifically, at the beginning of system operation, the UAV u observes the environment state

s^{A} \in S^{A}

, generates an action

a^{A} \in A^{A}

, and receives a reward

r^{A}

. The state, action, and reward are defined as follows:

(1) State Space: The state set

S^{A}

comprises the following three components:

(i) Position Information: The 2D coordinates of UAV u at the current time t,

l_{u} (t) = {[x_{u} (t), y_{u} (t)]}^{T}

, and the coordinate set of servers is

l_{s} = {[x_{s}, y_{s}]}^{T}

.

(ii) Task Load: The set of task data volumes carried by the UAVs:

D = {d_{1}, d_{2}, \dots, d_{U}}

, and the computing resources required by the tasks:

C = {c_{1}, c_{2}, \dots, c_{U}}

.

(iii) Server Computational Resources: The computational resource set of all available servers in the system:

F = {f_{1}, f_{2}, \dots, f_{S}}

.

(2) Action Space: The UAV needs to allocate its carried tasks to S available servers. Therefore, the offloading decision action space

A^{A}

is inherently discrete. We construct a task allocation matrix

x_{u}^{s}

, where:

x_{u}^{s} \in {0, 1}, \forall u \in U, \forall s \in S,

(21)

\sum_{s = 1}^{S} x_{u}^{s} = 1, \forall u \in U,

(22)

where

x_{u}^{s} = 1

indicates that UAV u offloads its task to server s.

(3) Reward Function: The purpose of the reward function is to guide the algorithm to generate optimization decisions that maximize system QoS. In the UAV task offloading scenario, the system QoS evaluation metrics include the total delay and the total energy consumption. Both metrics are negatively correlated with QoS, meaning that lower total delay and lower total energy consumption correspond to higher system QoS. To achieve the optimization objective of QoS maximization, this paper defines the reward value

r^{A}

as the system QoS, constructing the reward function in the form of a weighted reciprocal of total delay and total energy consumption. The specific expression is as follows:

r^{A} (t) = \frac{1}{ω_{1} T_{total} + ω_{2} E_{total}} .

(23)

In this paper, we employ the A2C algorithm to solve the task offloading decision problem. Compared with supervised learning and combinatorial optimization methods, A2C shows superior performance in such single-step decision-making scenarios, as it avoids the dependence on labeled datasets and complex mathematical modeling, and adapts better to dynamic environmental changes. A2C is a Temporal Difference (TD) reinforcement learning algorithm based on the Actor-Critic framework. Its core idea is to achieve policy generation and value evaluation through the Actor network (

P (a^{A} | s^{A}; ϑ

)) and Critic network (

V (s^{A}; ψ)

), respectively, and to guide policy updates using the advantage value, thereby efficiently learning optimal task offloading decisions. Specifically, the Actor network is responsible for outputting action decisions for each UAV, while the Critic network is responsible for evaluating the value of the current state. However, the Actor network inherently outputs continuous actions, which conflict with the discrete offloading decision actions in this study. To resolve this incompatibility, we adopt the softmax activation function as the output layer, transforming the output of the Actor network into the probability of the UAV selecting each server, while ensuring that the sum of selection probabilities for each UAV across all servers equals 1. To obtain the final action decision, we select the server with the highest probability from this distribution as the UAV’s target server, set the corresponding offloading decision

x_{u}^{s}

to 1, and set the values of all other decisions to 0.

During each training step, the system records a sample triplet

(s_{t}^{A}, a_{t}^{A}, r_{t}^{A})

consisting of the current state, the executed action, and the immediate reward. Using this triplet, the parameters of the Critic network and the Actor network are updated separately and immediately in an online fashion.

(1) Critic Network Update: The objective of the Critic network is to learn the state value

V (s^{A}; ψ)

, making it as close as possible to the true state value. Therefore, the loss function is defined as the Mean Squared Error (MSE):

L_{V} (ψ) = {(V (s^{A}; ψ) - r^{A})}^{2},

(24)

where

V (s^{A}; ψ)

represents the Critic network’s estimated value for state

s^{A}

,

r^{A}

denotes the true immediate reward received in that state. The parameters

ψ

of the Critic network are then updated using the gradient descent method:

ψ \leftarrow ψ - α \nabla_{ψ} L_{V} (ψ),

(25)

where

α

is the learning rate,

\nabla_{ψ} L_{V} (ψ)

denotes the gradient of the loss function

L_{V} (ψ)

with respect to the parameters

ψ

.

(2) Actor Network Update: The goal of the Actor network is to maximize the expected cumulative reward of the policy. The core mechanism involves using the advantage value to guide policy updates. In a one-step Markov decision process, the advantage value

A (s^{A}, a^{A}; ϑ, ψ)

is defined as the difference between the immediate reward and the estimated state value:

A (s^{A}, a^{A}; ϑ, ψ) = r^{A} - V (s^{A}; ψ),

(26)

if

A > 0

, the action yields a reward higher than the average state value, and its selection probability should be increased; if

A < 0

, the action yields a reward lower than the average state value, and its selection probability should be decreased. The objective function for the Actor network optimization is formulated as:

J_{P} (ϑ) = log P (a^{A} | s^{A}; ϑ) \cdot A (s^{A}, a^{A}; ϑ, ψ),

(27)

where

log P (a_{i}^{A} | s_{i}^{A}; ϑ)

represents the log probability of action

a^{A}

, which quantifies the policy’s degree of preference for that action. Finally, the gradient ascent method is employed to maximize

J_{P} (ϑ)

and update the Actor network parameters:

ϑ \leftarrow ϑ + β \nabla_{ϑ} J_{P} (ϑ),

(28)

where

β

is the learning rate for the Actor network,

\nabla_{ϑ} J_{P} (ϑ)

denotes the gradient of the objective function

J_{P} (ϑ)

with respect to the parameters

ϑ

.

5.2. UAV Trajectory Control

Upon completion of the task allocation algorithm for UAVs to their target servers, the UAVs initiate the trajectory control algorithm. Trajectory optimization in UAV networks aims to plan optimal flight paths for each UAV under constraints such as latency, energy consumption, and collision avoidance. Considering that the transitions of flight states during UAV trajectory control exhibit temporal dependencies and uncertainties, this paper formulates the UAV trajectory control problem as an MDP and employs reinforcement learning to learn optimal trajectory decisions.

The MDP model for UAV trajectory control is formally described by the tuple

(S^{M}, A^{M}, R^{M}, P^{M})

, where

S^{M}

represents the state space,

A^{M}

denotes the action space,

R^{M}

is the reward function, and

P^{M}

signifies the state transition probability from

S_{t}^{M}

to

S_{t + 1}^{M}

upon executing action

a_{t}^{M}

. The state, action, and reward are defined as follows:

(1) State Space: The state space

S^{M}

comprises the following three components:

(i) UAV velocity:

v_{u} (t) = {[v_{u}^{x} (t), v_{u}^{y} (t)]}^{T}

.

(ii) Coordinates of adjacent UAVs: For each UAV u, the coordinates of the two nearest UAVs

(x_{j}, y_{j})

,

(x_{k}, y_{k})

in the system are recorded. When the number of UAVs increases, the state dimension grows sharply, leading to exponential growth in algorithm training complexity. Therefore, retaining only the coordinates of the two nearest UAVs can meet obstacle avoidance requirements while maintaining state dimension stability.

(iii) Coordinates of the target server: After the task allocation algorithm assigns a target server to each UAV, each UAV must perceive and obtain the coordinates of its corresponding target server

l_{s} = {[x_{s}, y_{s}]}^{T}

in real time to ensure flight toward the target server.

(2) Action Space: The action space

A^{M}

is continuous, corresponding to the four directional movements of the UAV. By adjusting the UAV’s acceleration, its speed and flight trajectory are modified. The action is defined as:

A^{M} = {a_{u} (t) = (F_{x +} (t), F_{x -} (t), F_{y +} (t), F_{y -} (t))},

(29)

where

F_{x +} (t), F_{x -} (t), F_{y +} (t), F_{y -} (t)

denote the external forces on the UAV at time t in the positive x, negative x, positive y, and negative y directions, respectively. These forces adjust the UAV’s acceleration and further change its velocity and flight trajectory.

(3) Reward Function: Under the constraint that all UAVs stop flying and initiate transmission within the specified time step

T_{step}

, the reward function guides UAVs to accurately reach the target server locations while maximizing system QoS. The reward is composed of two parts:

(i) Instantaneous Reward

r_{step} (t)

:

r_{step} (t)

comprises distance optimization and QoS optimization rewards, which guide UAVs to adjust their flight states in real time toward achieving fast and energy-efficient flight objectives.:

r_{s t e p} (t) = μ_{1} r_{d i s t} (t) + μ_{2} r_{e n e r g y} (t),

(30)

where

r_{dist} (t) = - d_{u, s} (t)

aims to guide the UAV toward the target server, and

r_{energy} (t) = \frac{1}{p_{f} (∥ v_{u} (t) ∥) Δ t}

maximizes system QoS by minimizing UAV flight energy consumption,

μ_{1}

and

μ_{2}

are positive weighting coefficients that balance the two objectives.

(ii) Event-Triggered Reward

r_{event}

: Includes arrival rewards and collision penalties. If the distance

d_{u, s} (t)

between a UAV and its target server is less than a threshold

d_{th}

, a reward

K_{arrive}

is granted. If the distance

d_{u_{1}, u_{2}} (t)

between any two UAVs

u_{1}

and

u_{2}

is less than the threshold

d_{th}

, a penalty

K_{coll}

is imposed:

r_{e v e n t} = r_{a r r i v a l} + r_{c o l l},

(31)

where

r_{arrival} = K_{arrival} if d_{u, s} (t) \leq d_{th}^{arrival},

(32)

r_{coll} = - K_{coll} if d_{u_{1}, u_{2}} (t) < d_{th}^{coll} .

(33)

Therefore, the overall reward is formulated as the linear combination of the instantaneous reward

r_{step} (t)

and the event-triggered reward

r_{event} (t)

:

\begin{matrix} r^{M} = r_{step} (t) + r_{event} (t) . \end{matrix}

(34)

This additive reward structure concurrently optimizes continuous flight performance and discrete critical events. The instantaneous reward guides energy-efficient navigation toward the target, while the event-triggered reward explicitly reinforces arrivals and penalizes collisions. By decoupling these components, the design ensures stable gradient propagation and effectively addresses sparse but critical operational constraints, enabling robust policy learning in complex multi-UAV scenarios.

The UAV trajectory control problem exhibits significant characteristics of continuous action spaces and multi-agent collaboration. Traditional reinforcement learning algorithms struggle to simultaneously adapt to these features. The MADDPG algorithm addresses these challenges through a centralized training and decentralized execution architecture [14].

Specifically, each UAV is equipped with an independent Actor network (

π^{M} (s_{t}^{M}; θ)

) that outputs continuous actions based on its own state. Meanwhile, all UAVs share a centralized Critic network (

Q^{M} (s_{t}^{M}, a_{t}^{M}; ϕ)

), which evaluates values using global states and all UAVs’ actions during the training phase. To optimize the training process, the algorithm also establishes corresponding target Actor networks (

{π^{M}}^{'} (s_{t}^{M}; θ^{'})

) for each Actor network and a target Critic network (

{Q^{M}}^{'} (s_{t}^{M}, a_{t}^{M}; ϕ^{'}

) for the centralized Critic network. These target networks track the parameters of the main networks through a soft update mechanism. MADDPG additionally employs an experience replay buffer to store historical interaction data, thereby enhancing learning efficiency and stability. This design enables effective coordination via global information during training while allowing each UAV to make independent decisions based on local information during execution, thus effectively solving the multi-UAV collaborative trajectory planning problem in continuous action spaces.

During the training process, UAVs continuously interact with the environment, generating state-action-reward sequences

(s_{t}^{M}, a_{t}^{M}, r_{t}^{M}, s_{t + 1}^{M})

, and store the interaction data in the experience replay buffer

D

. Based on the experience replay buffer

D

, the parameters of the Critic network, Actor network, and their corresponding target networks are updated separately.

(1) Critic Network Update: First, B tuples

(s_{b}^{M}, a_{b}^{M}, r_{b}^{M}, s_{b + 1}^{M})

are sampled from the experience replay buffer

D

. Subsequently, the Critic network is updated with the objective of minimizing the temporal difference error. Its loss function is defined as the mean squared error between the predicted Q-value and the target Q-value:

L_{Q} (ϕ) = E_{(s_{b}^{M}, a_{b}^{M}, r_{b}^{M}, s_{b + 1}^{M})} [{(Q^{M} (s_{b}^{M}, a_{b}^{M}; ϕ) - y_{b})}^{2}],

(35)

and the target value y is computed by the target networks according to the formula:

y_{b}^{M} = r_{b}^{M} + γ {Q^{M}}^{'} (s_{b + 1}^{M}, {π^{M}}^{'} (s_{b}^{M}; θ^{'}); ϕ^{'}),

(36)

where

γ

denotes the discount factor. The parameters

ϕ

of the Critic network are updated via gradient descent:

ϕ \leftarrow ϕ - α_{Q} \nabla_{ϕ} L_{Q} (ϕ),

(37)

where

α_{Q}

is the learning rate.

(2) Actor Network Update: The objective of the Actor network is to maximize the Q-value evaluated by the Critic network. This is achieved by computing the deterministic policy gradient:

\nabla_{θ} J_{π} (θ) \approx E_{(s_{b}^{M}, a_{b}^{M}, r_{b}^{M}, s_{b + 1}^{M})} [\nabla_{a_{b}^{M}} Q^{M} (s_{b}^{M}, a_{b}^{M}; ϕ) |_{a_{b}^{M} = π^{M} (s_{b}^{M})} \nabla_{θ} π^{M} (s_{b}^{M}; θ)],

(38)

and the parameters are updated along the gradient ascent direction:

\begin{matrix} θ \leftarrow θ + α_{π} \nabla_{θ} J_{π} (θ), \end{matrix}

(39)

where

α_{π}

is the learning rate. This update ensures that the policy of each UAV improves in directions that yield higher system evaluation.

(3) Target Network Update: To stabilize training and break data correlations, the parameters of the target networks are slowly tracked to their corresponding main networks through soft updates. Specifically, the parameters of the target Critic network and target Actor networks are updated according to the following rules:

ϕ^{'} \leftarrow τ ϕ + (1 - τ) ϕ^{'},

(40)

θ^{'} \leftarrow τ θ + (1 - τ) θ^{'},

(41)

where

τ ≪ 1

is the soft update coefficient. The description of the proposed algorithm is shown in Algorithm 1.

Algorithm 1: Multi-Agent UAV Trajectory Control

5.3. Computational Resource Allocation

After UAVs arrive at their target edge servers, computational resources must be allocated at each server to improve the system QoS. This problem can be formulated as:

\begin{matrix} P 2 : min_{f_{s}^{u}} \sum_{s = 1}^{S} \sum_{u = 1}^{N_{s}} \frac{c_{u}}{f_{s}^{u}}, \end{matrix}

(42)

\begin{matrix} s . t . \sum_{u = 1}^{N} f_{s}^{u} \leq f_{s}, \end{matrix}

(43)

\begin{matrix} 0 \leq f_{s}^{u} \leq f_{s} . \end{matrix}

(44)

Analysis of the objective function of

P 2

reveals a server-decoupling property: the resource allocation decision for each ground server s is independent of other servers. Therefore, the global optimal solution can be obtained by parallel solving of S structurally identical single-server resource allocation sub-problems. This is a typical convex optimization problem, and a closed-form optimal solution can be derived via the Karush-Kuhn-Tucker (KKT) conditions [28].

Theorem 1.

The optimal solution of problem P2 has the explicit expression:

{f_{s}^{u}}^{*} = f_{s} \frac{\sqrt{c_{u}}}{\sum_{k = 1}^{N_{s}} \sqrt{c_{k}}} .

(45)

Proof.

For a single server s, define its resource allocation sub-problem as:

\begin{matrix} P 2^{*} : min_{f_{s}^{u}} \sum_{u = 1}^{N_{s}} \frac{c_{u}}{f_{s}^{u}}, \end{matrix}

(46)

\begin{matrix} s . t . \sum_{u = 1}^{N_{s}} f_{s}^{u} \leq f_{s}, \end{matrix}

(47)

\begin{matrix} \begin{matrix} f_{s}^{u} \geq 0, u = 1, \dots, U . \end{matrix} \end{matrix}

(48)

Construct the Lagrangian function:

L (f_{s}^{u}, λ) = \sum_{u = 1}^{N_{s}} \frac{c_{u}}{f_{s}^{u}} + λ (\sum_{u = 1}^{N_{s}} f_{s}^{u} - f_{s}),

(49)

where

λ \geq 0

is Lagrange multipliers. Apply the KKT conditions:

\begin{matrix} \frac{\partial L}{\partial f_{s}^{u}} = - \frac{c_{u}}{{f_{s}^{u}}^{2}} + λ = 0, \forall u . \end{matrix}

(50)

Solve to obtain:

\begin{matrix} f_{s}^{u} = \sqrt{\frac{c_{u}}{λ}}, \forall u . \end{matrix}

(51)

Next, solve the complementary slackness condition based on the above equations and obtain the optimal Lagrange multiplier

λ

value:

λ^{*} = {(\frac{\sum_{n = 1}^{N_{s}} \sqrt{c_{u}}}{f_{s}})}^{2} .

(52)

Based on the above derivation, we can conclude that all optimal computational resource allocation values

f_{s}^{u}

,

\forall s \in S

and

\forall u \in U

are positive. Subsequently, by substituting

λ

into the complementary slackness condition, the proof of Theorem 1 is completed. □

5.4. MA-JTATO Optimization Algorithm

The MA-JTATO algorithm integrates task offloading, UAV trajectory planning, and edge-server resource allocation into a unified, end-to-end learning framework. The algorithm proceeds in an episodic manner, where each episode corresponds to a complete decision cycle in the UAV-assisted edge computing system. The overall workflow is summarized in Algorithm 2 and described below.

Algorithm 2: Multi-agent Task Association and Trajectory Optimization (MA-JTATO)

The algorithm begins by initializing the neural networks for both the task-association and trajectory-control modules. For task association, an A2C structure is employed, consisting of an Actor network (

P (a^{A} | s^{A}; ϑ

)) and a Critic network (

V (s^{A}; ψ)

). For trajectory control, each UAV u is equipped with a Actor network (

π_{u}^{M} (s_{u}^{M}; θ_{u})

), while a centralized Critic network (

Q^{M} (S^{M}, A^{M}; ϕ)

) evaluates the joint actions of all UAVs. All target networks and replay buffers required by the MADDPG framework are also initialized.

Lines 5–22 define the episodic training loop. For each episode e = 1 to

E_{max}

, the global state

s^{A}

is built from the UAV and server coordinates, the computation loads

C

, and the server resources

F

. The A2C Actor then produces a probability distribution

p_{u}

over servers for every UAV. Based on this distribution, each UAV selects the server with the highest probability as its offloading target.

Once the task-association decisions are determined, the multi-UAV trajectory-planning sub-problem is solved by calling Algorithm 1, which yields the optimized flight paths

l_{u} (t)

for all UAVs. After the UAVs arrive at their designated servers, the computing resources are allocated according to the closed-form convex-optimization solution given in (36).

Lines 15–16 calculate the system reward

r^{A}

. The total delay

T_{total}

and total energy

E_{total}

are computed using the models from Section 3, and the immediate reward

r^{A}

is defined as the inverse of their weighted sum, i.e.,

r^{A}

=

1 / (ω_{1} T_{total} + ω_{2} E_{total})

.

Lines 17–21 perform the A2C network updates. The Critic loss is computed as the squared difference between the predicted state value

V (s^{A}; ψ)

and the observed reward

r^{A}

, and its parameters

ψ

are updated via gradient descent. The advantage function

A (s^{A}, a^{A})

=

r^{A} - V (s^{A}; ψ)

is then calculated. The Actor network is updated by maximizing the policy-gradient objective

J_{P} (ϑ)

=

log P (a^{A} | s^{A}; ϑ) \cdot A (s^{A}, a^{A}; ϑ, ψ)

using gradient ascent.

6. Simulation and Discussion

6.1. Parameter Settings and Baseline Algorithms

In this section, we verify the effectiveness of the proposed algorithm and present simulation results under different conditions. The simulation experiments are conducted in a Python 3.10 environment. The considered system area has a size of

400 \times 400 m^{2}

. The noise power spectral density is set to

- 169 dBm / Hz

, which is derived from the theoretical thermal noise

- 174 dBm / Hz

, and then

5 dBm / Hz

is added for practical non-idealities in UAV communications. The weighting coefficients

ω_{1}

and

ω_{2}

are set to 0.8 and 0.2, respectively, aiming to prioritize latency performance while achieving a reasonable trade-off between latency and energy consumption. The weighting coefficients

μ_{1}

and

μ_{2}

are set to 0.9 and 0.1, respectively, aiming to prioritize ensuring the UAVs reach the target server accurately and quickly while saving energy consumption. Other simulation parameters and their default values are listed in Table 1 with reference to [11,27,30].

This paper adopts four baseline algorithms: Twin Delayed Deep Deterministic policy gradient algorithm (TD3), DQN, Ant Colony Optimization (ACO) algorithm, and Greedy algorithm:

(1) TD3 [31]: TD3 is an improved reinforcement learning algorithm for continuous action spaces based on DDPG. It suppresses Q-value overestimation by taking the minimum output of dual-value networks, reduces error sensitivity via delayed policy network updates, and enhances training stability by adding and clipping noise to target actions. TD3 optimizes value networks with experience replay and temporal difference, updates policy networks via deterministic policy gradient.

(2) DQN [32]: DQN is a reinforcement learning algorithm based on value iteration, which approximates the Q-function through deep neural networks and can efficiently handle decision-making problems in high-dimensional state spaces. The introduction of DQN enables UAVs to autonomously learn optimal policies through reinforcement learning and adapt to dynamic environments.

(3) ACO [33]: ACO is a classical heuristic optimization algorithm that effectively handles multi-objective optimization problems and possesses strong global search capabilities. ACO demonstrates strong adaptability and flexibility in finding optimal solutions, achieving satisfactory results in complex environments.

(4) Greedy [34]: The Greedy algorithm approximates the global optimum by selecting the current locally optimal solution at each step.

6.2. Convergence Analysis

The training convergence characteristics of the MA-JTATO algorithm are reflected by the loss convergence curves of MADDPG and A2C, as well as the reward convergence curve of the overall algorithm. As illustrated in Figure 2 and Figure 3, the loss curves of both MADDPG for UAV trajectory control and A2C for task association can achieve rapid convergence. The loss values of the two algorithms decrease rapidly in the early training stage, and finally, the two curves converge to a low and stable level without obvious fluctuations.

The reward convergence curve of the MA-JTATO algorithm further reflects the robust convergence performance of this optimization framework, as illustrated in Figure 4. Over a total of 2000 training episodes, the model’s episodic reward steadily improves and eventually stabilizes, demonstrating significant overall performance enhancement with a smooth progression. During the initial training phase, the agent engages in random exploration for approximately the first 750 episodes. This is followed by a stable learning period lasting about 500 episodes, during which the reward shows an approximately linear upward trend. This stage represents the core process of systematic policy refinement, indicating that the value estimates provided by the Critic network effectively guide the optimization direction of the Actor’s policy. After approximately the 1250th episode, the reward curve enters a plateau, signifying that the policy has largely converged to a near-optimal solution. The entire training process remains free from severe oscillations or performance collapse, validating the stability and robustness of the proposed MA-JTATO algorithm.

6.3. Performance Analysis

In this subsection, we conduct a comparative evaluation of the proposed MA-JTATO algorithm against other baseline algorithms in terms of system latency and QoS. We evaluate these two metrics because they are core indicators for delay-sensitive and quality-guaranteed services in UAV-assisted edge computing systems. Latency directly determines the real-time performance of task offloading and service response. QoS reflects the comprehensive service quality from system requirements, including transmission reliability, service satisfaction, and resource utilization efficiency.

Figure 5 and Figure 6 illustrate the variation of system latency of the proposed MA-JTATO algorithm with respect to two key system parameters: the number of edge servers and the number of UAVs.

It can be observed that the proposed MA-JTATO algorithm consistently achieves the lowest system latency compared to four baseline algorithms across various scenarios. For instance, when the system is configured with

U = 10

and

S = 6

, the latency of MA-JTATO is 95.9%, 93.5%, 79.2%, and 62.8% of that attained by the TD3, DQN, ACO, and Greedy algorithms, respectively. Similarly, under a denser deployment with

U = 12

and

S = 5

, the latency of MA-JTATO remains superior, corresponding to 94.3%, 88.5%, 84.5%, and 70.6% of the latency observed under the TD3, DQN, ACO, and Greedy algorithms.

Furthermore, the results reveal a clear trend: the system latency of all evaluated algorithms increases with the number of UAVs and decreases with the number of edge servers. This behavior can be attributed to the underlying resource-competition dynamics. Specifically, a larger number of UAVs introduces a higher aggregated task load, which intensifies contention for both communication bandwidth and computational capacity, thereby elevating transmission and processing delays. Conversely, increasing the number of edge servers expands the available resource pool, offering more bandwidth and computational units for task offloading and execution, which in turn mitigates latency.

Figure 7 and Figure 8 depict the variation of the QoS performance of the proposed algorithm with respect to the number of edge servers and the number of UAVs. It can be observed that the MA-JTATO algorithm consistently achieves superior QoS performance compared to the four baseline algorithms. For example, in a configuration with

U = 10

and

S = 6

, the QoS attained by MA-JTATO is 1.08 times, 1.171 times, 1.287 times, and 1.67 times higher than that of the TD3, DQN, ACO, and Greedy algorithms, respectively.

Similarly, under a more constrained setting with

U = 12

UAVs and

S = 5

edge servers, the MA-JTATO maintains its leading performance. Here, the achieved QoS is 1.09 times, 1.16 times, 1.308 times, and 1.52 times that of the TD3, DQN, ACO, and Greedy algorithms, respectively.

The observed trends further reveal that the QoS of all compared algorithms generally improves with an increase in the number of edge servers, while it tends to degrade as the number of UAVs grows. This behavior aligns with the intuitive trade-off between computational resource availability and task demand. The MA-JTATO algorithm effectively balances this trade-off through its multi-agent coordination mechanism, leading to sustained and significant QoS enhancements over a range of system scales.

Figure 9 presents the analysis of QoS performance for the proposed MA-JTATO algorithm with respect to the available computing resources of edge servers. The results clearly demonstrate that the MA-JTATO algorithm consistently surpasses all four baseline algorithms across the entire spectrum of computational capacities considered. This persistent superiority highlights the robustness and computational-aware adaptability of the proposed multi-agent coordination mechanism, regardless of the infrastructure’s processing capability. When the server computing resource is

30 \times 10^{9}

CPU-cycles/s, the QoS of the MA-JTATO algorithm is 1.139 times, 1.339 times, 1.667 times, and 1.94 times that of the TD3, DQN, ACO, and Greedy algorithms, respectively.

Moreover, a monotonic improvement in QoS is observed for all algorithms as the computing resources of the edge servers increase. This trend aligns with the expected system behavior: greater computational capacity reduces processing delays, enhances task completion rates, and thereby elevates the overall service quality.

Figure 10 presents the analysis of QoS performance for the proposed MA-JTATO algorithm with respect to communication recourses of edge servers. The results clearly demonstrate that the MA-JTATO algorithm consistently surpasses all four baseline algorithms across the entire spectrum of server communication bandwidth considered. When the communication bandwidth of edge servers is 15 MHz, the QoS of the MA-JTATO algorithm is 1.298 times, 1.526 times, 1.758 times, and 2.217 times that of the TD3, DQN, ACO, and Greedy algorithms, respectively.

Figure 11 presents the analysis of QoS performance for the proposed MA-JTATO algorithm with respect to task size of uavs. The results clearly demonstrate that the MA-JTATO algorithm consistently surpasses all four baseline algorithms across the entire spectrum of UAV task size considered. This persistent superiority highlights the load-aware optimization capability of the proposed multi-agent coordination mechanism. When the UAV task size is

6 \times 10^{6}

bits, the QoS of the MA-JTATO algorithm is 1.079 times, 1.289 times, 1.398 times, and 1.526 times that of the TD3, DQN, ACO, and Greedy algorithms, respectively.

7. Conclusions and Future Work

This paper addresses the challenges of high task latency and excessive energy consumption in UAV-assisted edge computing systems by conducting an in-depth study on the joint optimization of task association, UAV trajectory planning, and edge resource allocation. We propose an intelligent collaborative optimization framework based on multi-agent reinforcement learning. The presented MA-JTATO algorithm decomposes the complex joint optimization problem into three sub-problems: task association, trajectory control, and resource allocation. These sub-problems are then efficiently solved using deep reinforcement learning and optimization theory, respectively, ultimately leading to overall system-level performance enhancement. Specifically, the Actor-Critic-based task association method ensures dynamic optimality in matching, the MADDPG-driven multi-UAV collaborative trajectory planning optimizes communication links while reducing flight energy consumption, and the optimization-theory-derived resource allocation scheme maximizes the computational efficiency of edge servers. Extensive simulation results confirm that, compared to various baseline algorithms, the proposed framework significantly reduces system task processing latency and overall energy consumption, demonstrating strong adaptability and stability in dynamic network environments.

Although the proposed MA-JTATO framework has achieved favorable performance in extensive simulation experiments, there remain several valuable directions to be further explored for higher practicality and generalization. First, to simplify the calculation of data transmission delay, this paper assumes that UAVs maintain a fixed flight altitude. If this assumption is not satisfied, and the UAV’s flight altitude is allowed to be adjusted flexibly, it will increase the complexity of system modeling, introduce altitude variables, and require optimizing the UAV’s flight altitude through optimization algorithms. In future work, the trajectory planning module will be extended to three-dimensional dynamic planning, enabling UAVs to adjust their altitude autonomously according to obstacles, communication interference, and complex terrain, so as to further improve the environmental adaptability and robustness of the system. Second, the current air-to-ground channel modeling adopts idealized assumptions with high LoS probability and free-space path loss. In future research, we will incorporate more realistic air-to-ground channel characteristics, including channel fading, signal blockage, and multi-source communication interference, to enhance the model fidelity and generalization in complex communication environments. In addition, security and authentication issues in multi-UAV systems have not been fully considered in this work. In future research, we will introduce secure authentication mechanisms among UAVs to ensure reliable and secure data and model sharing. We will also draw on advanced security schemes such as cross-layer fingerprint fusion to enhance communication security and resist malicious attacks. Finally, the performance of the proposed algorithm is verified only through numerical simulations. In the future, the algorithm will be deployed on physical UAV platforms and edge computing testbeds, and field experiments will be carried out in practical scenarios such as emergency communications and disaster relief. The model will be further optimized based on real-world measurement data to promote its practical deployment and engineering applications.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z. and Z.W.; validation, Y.Z.; formal analysis, Y.Z.; investigation, Y.Z.; resources, Y.Z. and Z.W.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z. and Z.W.; funding acquisiton Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The simulation codes and datasets that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tao, F.; Cheng, Y.; Da Xu, L.; Zhang, L.; Li, B.H. CCIoT-CMfg: Cloud computing and internet of things-based cloud manufacturing service system. IEEE Trans. Ind. Inform. 2014, 10, 1435–1442. [Google Scholar]
Qin, P.; Li, J.; Zhang, J.; Fu, Y. Joint task allocation and trajectory optimization for multi-UAV collaborative air-ground edge computing. IEEE Trans. Netw. Sci. Eng. 2024, 11, 6231–6243. [Google Scholar] [CrossRef]
Huang, Y.; Li, R.; Chen, M.; Zhao, F.; Zhang, D.; Tu, W. Securing UAV Communications by Fusing Cross-Layer Fingerprints. IEEE Internet Things J. 2025, 13, 2462–2475. [Google Scholar] [CrossRef]
Cordill, B.; Fang, D.; Xu, S. A Comprehensive Survey of Security and Privacy in UAV Systems. IEEE Access 2025, 13, 117843–117866. [Google Scholar] [CrossRef]
Liu, B.; Zhang, W.; Chen, W.; Huang, H.; Guo, S. Online computation offloading and traffic routing for UAV swarms in edge-cloud computing. IEEE Trans. Veh. Technol. 2020, 69, 8777–8791. [Google Scholar] [CrossRef]
Dai, X.; Xiao, Z.; Jiang, H.; Lui, J.C. UAV-assisted task offloading in vehicular edge computing networks. IEEE Trans. Mob. Comput. 2023, 23, 2520–2534. [Google Scholar] [CrossRef]
Seid, A.M.; Boateng, G.O.; Mareri, B.; Sun, G.; Jiang, W. Multi-agent DRL for task offloading and resource allocation in multi-UAV enabled IoT edge network. IEEE Trans. Netw. Serv. Manag. 2021, 18, 4531–4547. [Google Scholar] [CrossRef]
Deng, C.; Fang, X.; Wang, X. UAV-enabled mobile-edge computing for AI applications: Joint model decision, resource allocation, and trajectory optimization. IEEE Internet Things J. 2022, 10, 5662–5675. [Google Scholar] [CrossRef]
Liu, J.; Xu, Z.; Wen, Z. Joint data transmission and trajectory optimization in UAV-enabled wireless powered mobile edge learning systems. IEEE Trans. Veh. Technol. 2023, 72, 11617–11630. [Google Scholar] [CrossRef]
Wang, Y.; Ru, Z.Y.; Wang, K.; Huang, P.Q. Joint deployment and task scheduling optimization for large-scale mobile users in multi-UAV-enabled mobile edge computing. IEEE Trans. Cybern. 2019, 50, 3984–3997. [Google Scholar] [CrossRef]
Chen, J.; Cao, X.; Yang, P.; Xiao, M.; Ren, S.; Zhao, Z.; Wu, D.O. Deep reinforcement learning based resource allocation in multi-UAV-aided MEC networks. IEEE Trans. Commun. 2022, 71, 296–309. [Google Scholar] [CrossRef]
Zhu, B.; Zhang, R.; Ma, F.; Yang, X. A Trajectory Planning and Task Offloading Collaborative Optimization Method for Multi-UAV Assisted MEC. In Proceedings of the 2025 IEEE International Conference on Unmanned Systems (ICUS); IEEE: New York, NY, USA, 2025; pp. 425–432. [Google Scholar]
Peng, B.; Li, X.; Gao, J.; Liu, J.; Chen, Y.N.; Wong, K.F. Adversarial advantage actor-critic model for task-completion dialogue policy learning. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2018; pp. 6149–6153. [Google Scholar]
Zheng, S.; Liu, H. Improved multi-agent deep deterministic policy gradient for path planning-based crowd simulation. IEEE Access 2019, 7, 147755–147770. [Google Scholar] [CrossRef]
Akter, S.; Duong, D.V.A.; Yoon, S. Joint Optimization of UAV Trajectory, Task Offloading, and Resource Allocation in UAV-Aided Emergency Response Operations. IEEE Internet Things J. 2025, 12, 21944–21959. [Google Scholar] [CrossRef]
Chen, Y.; Yang, Y.; Wu, Y.; Huang, J.; Zhao, L. Joint trajectory optimization and resource allocation in UAV-MEC systems: A Lyapunov-assisted DRL approach. IEEE Trans. Serv. Comput. 2025, 18, 854–867. [Google Scholar] [CrossRef]
Zhao, K.; Peng, L.; Tak, B. Joint DRL-based UAV trajectory planning and TEG-based task offloading. IEEE Trans. Consum. Electron. 2025, 71, 3779–3789. [Google Scholar] [CrossRef]
Ahmed, M.; Fatima, N.; Raza, S.; Ali, H.; Qayum, A.; Khan, W.U.; Sheraz, M.; Chuah, T.C. Optimizing resource allocation and task offloading in multi-UAV mec networks. IEEE Access 2025, 13, 68710–68725. [Google Scholar] [CrossRef]
Zeng, Y.; Zhang, R. Energy-efficient UAV communication with trajectory optimization. IEEE Trans. Wirel. Commun. 2017, 16, 3747–3760. [Google Scholar] [CrossRef]
Tu, W. Resource-efficient seamless transitions for high-performance multi-hop UAV multicasting. Comput. Networks 2022, 213, 109051. [Google Scholar] [CrossRef]
Tarekegn, G.B.; Tesfaw, B.A.; Juang, R.T.; Saha, D.; Tarekegn, R.B.; Lin, H.P.; Tai, L.C. Trajectory control and fair communications for multi-UAV networks: A federated multi-agent deep reinforcement learning approach. IEEE Trans. Wirel. Commun. 2025, 24, 7598–7611. [Google Scholar] [CrossRef]
Alwarafy, A.; Melhem, S.B.; Abou Chahine, R.; Said, B.; Alharethi, M.; Almazrouei, L.; Alblooshi, S. Deep Reinforcement Learning-Based Joint Trajectory Design and Resource Allocation for Secure and Energy-Efficient UAV Networks. IEEE Open J. Commun. Soc. 2025, 6, 6491–6505. [Google Scholar] [CrossRef]
Li, K.; Fan, H.; Yang, Y.; Wang, C.; Gao, Q. Multi-agent reinforcement learning-based UAV path and resource allocation for ground-to-air communication network. IEEE Internet Things J. 2025, 12, 44243–44254. [Google Scholar] [CrossRef]
Yuan, Z.; Bi, Y.; Fan, Y.; Liu, Y.; Ma, L.; Zhao, L.; He, Q. Trajectory optimization and power allocation for multi-UAV wireless networks: A communication-based multi-agent deep reinforcement learning approach. IEEE Trans. Comput. 2025, 74, 3404–3418. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, J.; Zeng, Y.; Ai, B. Energy-efficient multi-agent reinforcement learning for UAV trajectory optimization in cell-free massive MIMO networks. IEEE Trans. Wirel. Commun. 2025, 24, 5917–5930. [Google Scholar] [CrossRef]
Mondal, A.; Mishra, D.; Alexandropoulos, G.C.; Al-Nahari, A.; Jäntti, R. Multi-agent reinforcement learning for offloading cellular communications with cooperating uavs. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 9344–9358. [Google Scholar] [CrossRef]
Sun, G.; He, L.; Sun, Z.; Wu, Q.; Liang, S.; Li, J.; Niyato, D.; Leung, V.C. Joint task offloading and resource allocation in aerial-terrestrial UAV networks with edge and fog computing for post-disaster rescue. IEEE Trans. Mob. Comput. 2024, 23, 8582–8600. [Google Scholar] [CrossRef]
Xu, J.; Yao, H.; Zhang, R.; Mai, T.; Guizani, M. Low latency and accuracy-guaranteed dnn inference for uav-assisted iot networks. IEEE Trans. Cogn. Commun. Netw. 2025, 11, 4050–4061. [Google Scholar] [CrossRef]
Liu, B.; Ni, W.; Liu, R.P.; Guo, Y.J.; Zhu, H. Decentralized, privacy-preserving routing of cellular-connected unmanned aerial vehicles for joint goods delivery and sensing. IEEE Trans. Intell. Transp. Syst. 2023, 24, 9627–9641. [Google Scholar] [CrossRef]
Wang, L.; Wang, K.; Pan, C.; Xu, W.; Aslam, N.; Nallanathan, A. Deep reinforcement learning based dynamic trajectory control for UAV-assisted mobile edge computing. IEEE Trans. Mob. Comput. 2021, 21, 3536–3550. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, Y.; Yu, Z.; Li, J.; Qin, Q.; Gao, C. A twin-delayed deep deterministic policy gradient approach for UAV formation control. In Proceedings of the 2024 American Control Conference (ACC); IEEE: New York, NY, USA, 2024; pp. 2393–2398. [Google Scholar]
Zhang, L.; Zhou, W.; Xia, J.; Gao, C.; Zhu, F.; Fan, C.; Ou, J. DQN-based mobile edge computing for smart Internet of vehicle. EURASIP J. Adv. Signal Process. 2022, 2022, 45. [Google Scholar] [CrossRef]
Feng, J.; Liu, Z.; Wu, C.; Ji, Y. AVE: Autonomous vehicular edge computing framework with ACO-based scheduling. IEEE Trans. Veh. Technol. 2017, 66, 10660–10675. [Google Scholar] [CrossRef]
Wei, F.; Chen, S.; Zou, W. A greedy algorithm for task offloading in mobile edge computing system. China Commun. 2018, 15, 149–157. [Google Scholar] [CrossRef]

Figure 1. Multi-UAV-Assisted edge computing system.

Figure 2. MADDPG training loss versus episodes.

Figure 3. A2C training loss versus episodes.

Figure 4. Training reward versus episodes.

Figure 5. System latency versus the number of servers with U = 10.

Figure 6. System latency versus the number of UAVs with S = 5.

Figure 7. System QoS versus the number of servers with U = 10.

Figure 8. System QoS versus the number of UAVs with S = 5.

Figure 9. System QoS versus computing resources with U = 7 and S = 3.

Figure 10. System QoS versus communication resources with U = 6 and S = 3.

Figure 11. System QoS versus task size with U = 6 and S = 3.

Table 1. Simulation Parameters.

Parameter	Value
The altitude of UAVs, H	100 m
Each slot duration, $Δ t$	1 s
Computation resources of servers, $f_{s}$	$[30, 50] \times 10^{9}$ CPU-cycles/s
Task size, $d_{u}$	$[1, 10] \times 10^{6}$ bits
Task computation density, $c_{u}$	$[100, 1000] \times 10^{6}$ cycles
The channel gain, $h_{0}$	$1.42 \times 10^{- 4}$
Transmit power, p	0.1 W
Bandwidth, B	$10 MHz$
Noise power spectral density, $σ^{2}$	$- 169 dBm / Hz$
Weighting coefficient, $ω_{1}$	0.8
Weighting coefficient, $ω_{2}$	0.2
Weighting coefficient, $μ_{1}$	0.9
Weighting coefficient, $μ_{2}$	0.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Wen, Z. MA-JTATO: Multi-Agent Joint Task Association and Trajectory Optimization in UAV-Assisted Edge Computing System. Drones 2026, 10, 267. https://doi.org/10.3390/drones10040267

AMA Style

Zhang Y, Wen Z. MA-JTATO: Multi-Agent Joint Task Association and Trajectory Optimization in UAV-Assisted Edge Computing System. Drones. 2026; 10(4):267. https://doi.org/10.3390/drones10040267

Chicago/Turabian Style

Zhang, Yunxi, and Zhigang Wen. 2026. "MA-JTATO: Multi-Agent Joint Task Association and Trajectory Optimization in UAV-Assisted Edge Computing System" Drones 10, no. 4: 267. https://doi.org/10.3390/drones10040267

APA Style

Zhang, Y., & Wen, Z. (2026). MA-JTATO: Multi-Agent Joint Task Association and Trajectory Optimization in UAV-Assisted Edge Computing System. Drones, 10(4), 267. https://doi.org/10.3390/drones10040267

Article Menu

MA-JTATO: Multi-Agent Joint Task Association and Trajectory Optimization in UAV-Assisted Edge Computing System

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Joint Optimization for UAV-Assisted Edge Computing

2.2. Reinforcement Learning for UAV Networks

3. System Model

3.1. UAV-Assisted Edge Computing System

3.2. Communication Model

3.3. Service Delay and Energy Consumption

3.3.1. UAV Flight Phase

3.3.2. Task Offloading Phase

3.3.3. Server Computation Phase

4. Problem Formulation

5. Proposed Solution

5.1. Task Offloading Decision

5.2. UAV Trajectory Control

5.3. Computational Resource Allocation

5.4. MA-JTATO Optimization Algorithm

6. Simulation and Discussion

6.1. Parameter Settings and Baseline Algorithms

6.2. Convergence Analysis

6.3. Performance Analysis

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI