MA-PF-AD3PG: A Multi-Agent DRL Algorithm for Latency Minimization and Fairness Optimization in 6G IoV-Oriented UAV-Assisted MEC Systems

Wang, Yitian; Wang, Hui; Yu, Haibin

doi:10.3390/drones10010009

Open AccessArticle

MA-PF-AD3PG: A Multi-Agent DRL Algorithm for Latency Minimization and Fairness Optimization in 6G IoV-Oriented UAV-Assisted MEC Systems

by

Yitian Wang

^1,2,3,4

,

Hui Wang

¹ and

Haibin Yu

^2,3,*

¹

School of Information Engineering, Shenyang University of Chemical Technology, Shenyang 110142, China

²

State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China

³

Key Laboratory of Networked Control Systems, Chinese Academy of Sciences, Shenyang 110016, China

⁴

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(1), 9; https://doi.org/10.3390/drones10010009

Submission received: 13 November 2025 / Revised: 16 December 2025 / Accepted: 24 December 2025 / Published: 25 December 2025

(This article belongs to the Section Drone Communications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We develop a priority–fairness coupled optimization framework together with a multi-agent DRL algorithm (MA-PF-AD3PG) to jointly optimize latency, fairness, and task priority in UAV-assisted 6G IoV MEC systems.
The proposed algorithm incorporates an occlusion-aware dynamic deadline model, fairness-aware preprocessing, and an adaptive delayed update mechanism, achieving significantly improved convergence stability and scheduling performance.

What are the implication of the main findings?

The results demonstrate that fairness-driven multi-UAV cooperation can sustain near-perfect service fairness while reducing latency under dynamic vehicular environments.
The findings offer practical insights for designing next-generation UAV-assisted drone communication systems that require balanced QoS, priority awareness, and system-wide efficiency.

Abstract

The rapid proliferation of connected and autonomous vehicles in the 6G era demands ultra-reliable and low-latency computation with intelligent resource coordination. Unmanned Aerial Vehicle (UAV)-assisted Mobile Edge Computing (MEC) provides a flexible and scalable solution to extend coverage and enhance offloading efficiency for dynamic Internet of Vehicles (IoV) environments. However, jointly optimizing task latency, user fairness, and service priority under time-varying channel conditions remains a fundamental challenge.To address this issue, this paper proposes a novel Multi-Agent Priority-based Fairness Adaptive Delayed Deep Deterministic Policy Gradient (MA-PF-AD3PG) algorithm for UAV-assisted MEC systems. An occlusion-aware dynamic deadline model is first established to capture real-time link blockage and channel fading. Based on this model, a priority–fairness coupled optimization framework is formulated to jointly minimize overall latency and balance service fairness across heterogeneous vehicular tasks. To efficiently solve this NP-hard problem, the proposed MA-PF-AD3PG integrates fairness-aware service preprocessing and an adaptive delayed update mechanism within a multi-agent deep reinforcement learning structure, enabling decentralized yet coordinated UAV decision-making. Extensive simulations demonstrate that MA-PF-AD3PG achieves superior convergence stability, 13–57% higher total rewards, up to 46% lower delay, and nearly perfect fairness compared with state-of-the-art Deep Reinforcement Learning (DRL) and heuristic methods.

Keywords:

6G Internet of Vehicles (IoV); UAV-assisted Mobile Edge Computing (MEC); multi-agent deep reinforcement learning (MADRL); fairness-aware resource allocation; task offloading optimization

1. Introduction

The continuous emergence of new services driven by the Internet of Things (IoT), artificial intelligence, and the industrial Internet has placed unprecedented multi-dimensional demands on network communication, characterized by the requirements of high bandwidth, low latency, and high reliability [1]. These requirements are particularly critical in the Internet of Vehicles (IoV)—one of the most promising IoT application scenarios—where dynamic interactions among vehicles, infrastructure, and users impose stringent performance constraints on communication networks [2].

As the next-generation mobile communication technology, the Sixth Generation (6G) network is envisioned to provide ultra-high data rates, enhanced reliability, and extremely low latency [3]. These capabilities align naturally with the performance needs of emerging IoV services, making 6G a key enabler for real-time vehicular data transmission and edge-assisted computation [4]. In parallel, Mobile Edge Computing (MEC)—a foundational component of 6G—brings computation resources closer to end users, thereby alleviating the congestion and transmission delays inherent to conventional cloud computing architectures. This proximity reduces end-device energy consumption, prolongs battery life, and enhances the overall Quality of Service (QoS) [5]. In IoV environments, the synergy between 6G and MEC is particularly vital for supporting latency-sensitive applications such as autonomous driving, traffic monitoring, and safety-critical control [6,7].

However, conventional two-dimensional ground network infrastructures often suffer from inadequate coverage in complex urban environments due to physical obstructions and terrain irregularities [8,9]. To overcome these limitations, 6G is evolving toward three-dimensional network architectures by integrating Unmanned Aerial Vehicles (UAVs) as aerial base stations [10]. UAVs can effectively extend coverage in obstructed urban areas or post-disaster regions, ensuring reliable communication links for autonomous vehicles and rescue systems. Consequently, UAVs are increasingly regarded as indispensable components of IoV-oriented MEC systems, providing flexible and on-demand communication and computation support [11,12].

The Mixed Broadband Reliable Low-Latency Communication (MBRLLC) service introduced in 6G integrates the advantages of enhanced Mobile Broadband (eMBB) and Ultra-Reliable Low-Latency Communication (URLLC). This service model inherently reflects task prioritization in IoV applications: eMBB tasks (e.g., in-vehicle 4K video streaming) emphasize bandwidth; URLLC tasks (e.g., collision avoidance) demand millisecond-level latency and the highest priority; and MBRLLC tasks (e.g., augmented reality navigation) require balanced performance and occupy a medium priority level. Such heterogeneous priorities call for adaptive and priority-aware task offloading strategies in MEC systems. Neglecting task priorities may delay safety-critical services and compromise vehicular safety [13,14].

Nevertheless, achieving efficient and fair task offloading in UAV-assisted IoV-MEC systems remains highly challenging. The high mobility of vehicles results in rapidly changing network topology, rendering static offloading strategies ineffective. Meanwhile, UAV-assisted coverage is inherently vulnerable to dynamic occlusions, mobility-induced channel fluctuations, and strict energy constraints, which may lead to intermittent connectivity and degraded reliability compared with terrestrial infrastructures. Moreover, the use of MEC along vehicular routes introduces non-stationary performance, as vehicles continuously traverse heterogeneous coverage regions with time-varying resource availability and traffic demand. Such performance variability may cause fluctuating latency and throughput, potentially degrading QoS if not properly addressed. These challenges highlight the necessity of adaptive and robustness-aware resource allocation mechanisms that jointly optimize latency, fairness, and service priority in dynamic 6G IoV environments.

2. Related Work

UAV-assisted MEC has been extensively studied as a means to provide flexible computation support and reduce service latency in dynamic IoV environments.

In [15], the CLACMO framework was developed to jointly optimize caching, UAV trajectory, and offloading policies by integrating a variational autoencoder with Deep Reinforcement Learning (DRL). This approach achieved significant improvements in average reward and task completion rate through efficient latent-space exploration. Meanwhile, a DRL-based model for dynamic UAV trajectory control and user association was presented in [16], enabling adaptive decision-making under time-varying network conditions and improving task offloading efficiency in single-UAV scenarios. To further enhance fairness, Yang et al. [17] designed a Reward-Shaping DRL algorithm based on a max–min optimization framework, which simultaneously optimized UAV trajectories, task allocation, and service scheduling in post-disaster edge environments.

Building upon these efforts, a federated multi-agent reinforcement learning framework was introduced in [18] to minimize the age of information in UAV-assisted MEC systems modeled as partially observable Markov decision processes (POMDPs). The design enabled collaborative policy learning among UAVs to improve information freshness. In addition, ref. [19] proposed a hybrid algorithm that combines particle swarm optimization (PSO) diversity with a twin delayed deep deterministic policy gradient (TD3) model, effectively balancing offloading, energy consumption, and UAV trajectory planning. Similarly, Zheng et al. [20] introduced a priority-aware DRL scheme that jointly planned UAV access paths and user offloading strategies, thereby enhancing system utility under heterogeneous task priorities.

On another front, the authors in [21] examined multi-UAV cooperation and resource allocation, where a joint power–computation optimization algorithm was designed to minimize total energy consumption of IoT terminals while maintaining service reliability. Furthermore, Huang et al. [22] leveraged federated DRL to optimize UAV caching and vehicular offloading in IoV networks, improving cache hit rates and reducing redundant transmissions. Following a similar direction, ref. [23] proposed a hierarchical air–ground MEC architecture combined with dual-timescale DRL to optimize UAV trajectories and reduce information latency. Finally, Liu et al. [24] developed a QoS-aware multi-agent DDPG framework for vehicular edge computing, coordinating offloading and resource allocation among multiple vehicles to improve overall system throughput and QoS satisfaction.

While these studies have advanced UAV-assisted MEC from the perspectives of trajectory optimization, resource scheduling, and distributed decision-making, most existing approaches still mainly focus on improving offloading efficiency, but do not jointly address heterogeneous service priorities and fairness among vehicular users under dynamic network conditions. Furthermore, transient link variations and limited UAV energy and computation capacity make real-time coordination more complex in large-scale IoV scenarios. This motivates the development of a unified framework that adaptively allocates resources while balancing priority and fairness.

The main contributions of this paper are summarized as follows:

Occlusion-Aware Dynamic Deadline Model: A novel environment-aware model is developed to capture the time-varying impact of occlusion on channel conditions, transforming real-time link variations into dynamic noise parameters. The model incorporates system-wide hard deadlines based on total task volume to better reflect vehicular service constraints.
Priority–Fairness Coupled Optimization Framework: The latency minimization problem is reformulated into a two-level optimization structure, consisting of fairness-driven vehicular scheduling and UAV trajectory/resource allocation. This formulation enables adaptive decision-making while maintaining fairness and priority consistency among heterogeneous tasks.
Multi-Agent Priority-based Fairness Adaptive Delayed DDPG (MA-PF-AD3PG) Algorithm: A novel MADRL algorithm is proposed by embedding a three-dimensional state representation—priority evaluation, connection history, and fairness factor—within an enhanced MADDPG architecture. The algorithm achieves adaptive trajectory control and offloading optimization, improving system fairness and reducing latency in dynamic 6G IoV environments.

This paper is organized as follows. Section 2 provides a comprehensive review of related studies on IoV and UAV-assisted MEC task offloading. Section 3 describes the system model for the considered IoV scenario and formulates the corresponding optimization problem. Section 4 models the problem as a MDP and presents a MADRL-based algorithm for dynamic resource allocation. Section 5 evaluates the proposed approach through extensive comparative experiments to demonstrate its effectiveness and stability. Finally, Section 6 concludes the paper and discusses potential directions for future research.

3. System Model

We consider a UAV-assisted MEC system deployed in a 6G IoV scenario, as illustrated in Figure 1. The system consists of M UAVs and N vehicular end devices (VEs) distributed within a two-dimensional region of size

L \times W

. Each UAV flies at a fixed altitude H, is equipped with limited onboard computing resources, and provides edge computation services to vehicles through wireless communication links. Let

M = {1, 2, \dots, M}

and

N = {1, 2, \dots, N}

denote the sets of UAVs and VEs, respectively. The system operates in discrete time, where the total operation duration

T_{tot}

is divided into T equal-length time slots with duration

T_{slot}

. The time slot index set is written as

T = {1, 2, \dots, T}

.

In each slot

t \in T

,

{VE}_{n}

generates a computation task with data size

D_{n} (t) \in [D_{n, min}, D_{n, max}]

. Due to limited local processing capability, each task can be partially executed locally and partially offloaded to UAVs for remote computation. After the offloaded task is processed, the result is transmitted back to the vehicle through the downlink channel. To meet diverse QoS requirements in IoV applications, tasks are classified into three priority categories: URLLC (high priority), MBRLLC (medium priority), and eMBB (low priority).

UAV mobility and energy constraints are considered to ensure reliable and safe operation. The distance between any two UAVs must remain above a predefined threshold

d_{safe}

for collision avoidance. The remaining battery energy of

{UAV}_{m}

, denoted

E_{m, re} (t)

, must satisfy

E_{m, re} (t) \geq E_{th}

to ensure continuous flight and computation. Additionally,

{UAV}_{m}

must process a minimum total task volume

D_{m}

within

T_{tot}

to maintain balanced load utilization.

3.1. Communication Model

The position of

{VE}_{n}

at time slot t is denoted by

Q_{n} (t) = [X_{n} (t), Y_{n} (t), 0],

(1)

while

{UAV}_{m}

flies at a fixed altitude H, and its position is represented as

Q_{m} (t) = [X_{m} (t), Y_{m} (t), H] .

(2)

Accordingly, the Euclidean distance between

{UAV}_{m}

and

{VE}_{n}

is given by

d_{m, n} (t) = \sqrt{{(X_{m} (t) - X_{n} (t))}^{2} + {(Y_{m} (t) - Y_{n} (t))}^{2} + H^{2}} .

(3)

{UAV}_{m}

moves with velocity

v_{m} (t)

and heading angle

θ_{m} (t)

in each slot, and its position is updated as

\begin{matrix} X_{m} (t + 1) & = X_{m} (t) + v_{m} (t) T_{fly} cos θ_{m} (t), \\ Y_{m} (t + 1) & = Y_{m} (t) + v_{m} (t) T_{fly} sin θ_{m} (t), \end{matrix}

(4)

where

0 \leq v_{m} (t) \leq v_{max}

and

0 \leq θ_{m} (t) < 2 π

.

We denote by

C_{m, n} (t) \in {0, 1}

the association variable indicating whether

{VE}_{n}

is connected to

{UAV}_{m}

at slot t.

In dense urban IoV environments, wireless links between UAVs and VEs are highly susceptible to dynamic occlusions caused by buildings, roadside infrastructures, and vehicle mobility. To capture this effect, we introduce the binary occlusion indicator

N_{m, n} (t)

, where

N_{m, n} (t) = 0

denotes a line-of-sight (LoS) condition and

N_{m, n} (t) = 1

represents a non-line-of-sight (NLoS) condition. Rather than modeling occlusion as an abstract event,

N_{m, n} (t)

is explicitly coupled with the physical channel condition by directly determining the effective noise power in the achievable rate expression, i.e.,

P_{LoS}

for LoS links and

P_{NLoS}

for occluded links with

P_{NLoS} > P_{LoS}

.

The uplink transmission rate from

{VE}_{n}

to

{UAV}_{m}

is given by

R_{m, n} (t) = C_{m, n} (t) B {log}_{2} (1 + \frac{P_{n, tx} h_{m, n}^{2} (t) d_{m, n}^{- α} (t)}{(1 - N_{m, n} (t)) P_{LOS} + N_{m, n} (t) P_{NLOS}}),

(5)

where B is the channel bandwidth,

P_{n, tx}

is the transmit power of

{VE}_{n}

,

h_{m, n} (t)

denotes small-scale fading, and

α

is the path-loss exponent.

This modeling approach enables transient link blockages to be translated into instantaneous throughput degradation, which in turn affects task transmission delay and the feasibility of meeting slot-level execution deadlines. In this sense, the proposed occlusion-aware model induces an implicit dynamic deadline tightening effect: when a link becomes occluded, the reduced transmission rate increases the risk of deadline violation within a fixed time slot, especially for high-priority latency-sensitive tasks. By embedding occlusion effects directly into the rate and latency formulations, the proposed framework captures realistic channel uncertainty without introducing additional heuristic constraints.

3.2. Computation Model

In task offloading scheduling, the fraction of tasks offloaded from

{VE}_{n}

to

{UAV}_{m}

at slot t is defined by the offloading ratio

w_{m, n} (t)

(

0 \leq w_{m, n} (t) \leq 1

), which is dynamically adjusted with

{UAV}_{m}

’s flight angle

θ_{m} (t)

and velocity

v_{m} (t)

to improve system offloading efficiency.

3.2.1. Latency

The total task execution latency of

{VE}_{n}

comprises three components: local computing latency, transmission latency, and edge computing latency.

The local computing latency of

{VE}_{n}

at time slot t depends on the locally processed data volume, the required CPU cycles per unit data, and the local CPU processing capacity:

T_{n, local} (t) = \sum_{m = 1}^{M} (\frac{D_{n} (t) (1 - w_{m, n} (t)) C_{VE}}{f_{n}}) C_{m, n} (t), \forall n \in N, t \in T,

(6)

where

D_{n} (t)

is the task data volume for

{VE}_{n}

,

C_{VE}

represents the CPU cycles required per unit of task for each VE, and

f_{n}

is the computing capacity of

{VE}_{n}

.

The transmission latency depends on the portion of data offloaded to

{UAV}_{m}

and the corresponding uplink transmission rate between

{VE}_{n}

and

{UAV}_{m}

at time slot t:

T_{n, trans} (t) = \sum_{m = 1}^{M} (\frac{D_{n} (t) w_{m, n} (t)}{R_{m, n} (t)}) C_{m, n} (t), \forall n \in N, t \in T .

(7)

The edge computing latency reflects the time required by

{UAV}_{m}

to execute the offloaded task portion, considering the computing capacity of

{UAV}_{m}

and the total number of VEs served by

{UAV}_{m}

in the current slot:

T_{n, edge} (t) = \sum_{m = 1}^{M} (\frac{D_{n} (t) w_{m, n} (t) C_{UAV}}{f_{m} \sum_{i = 1}^{N} C_{m, i} (t)}) C_{m, n} (t), \forall n \in N, t \in T,

(8)

where

f_{m}

represents the computing capacity of

{UAV}_{m}

, and

C_{UAV}

represents the CPU cycles required per unit of data for the UAV.

Since each

{VE}_{n}

may either execute tasks locally or offload them for remote processing, its total execution latency is determined by the slower of the two computation paths:

T_{VE, n} (t) = max (T_{n, local} (t), T_{n, trans} (t) + T_{n, edge} (t)) .

(9)

Correspondingly, the service latency of

{UAV}_{m}

is the maximum latency among all VEs served by

{UAV}_{m}

at time slot t:

T_{UAV, m} (t) = max_{1 \leq n \leq N} C_{m, n} (t) T_{VE, n} (t) .

(10)

3.2.2. Energy Consumption

The total system energy consumption consists of the energy consumed by both VEs and UAVs.

For

{VE}_{n}

, the transmission energy consumption is given by

E_{n, trans} (t) = P_{n, tx} T_{n, trans} (t) .

(11)

The local computing energy consumption, influenced by chip architecture, is

E_{n, local} (t) = k f_{n}^{2} (1 - w_{m, n} (t)) D_{n} (t) C_{VE},

(12)

where k is a constant factor related to the chip architecture.

Thus, the total energy consumption of

{VE}_{n}

is:

E_{VE, n} (t) = E_{n, local} (t) + E_{n, trans} (t) .

(13)

For

{UAV}_{m}

, the total energy consumption consists of three components: flight energy, hovering energy, and edge computing energy. The flight energy consumption is:

E_{m, fly} (t) = \frac{M_{m} g t_{fly} v_{m} (t)}{K},

(14)

where

M_{m}

is the mass of

{UAV}_{m}

, g is the gravitational acceleration,

t_{fly}

is the flight time, and K is the lift-to-drag ratio, which characterizes the UAV’s aerodynamic performance.

The hovering energy consumption of

{UAV}_{m}

at time slot t is:

E_{m, hover} (t) = P_{m} t_{hover},

(15)

where

P_{m}

is the hovering power and

t_{hover}

is the hovering time.

The edge computing energy consumption for

{UAV}_{m}

is:

E_{m, edge} (t) = \sum_{n = 1}^{N} C_{m, n} (t) k f_{UAV}^{2} w_{m, n} (t) D_{n} (t) C_{UAV} .

(16)

Therefore, the total energy consumption of

{UAV}_{m}

is:

E_{UAV, m} (t) = E_{m, fly} (t) + E_{m, edge} (t) + E_{m, hover} (t) .

(17)

3.3. Task Prioritization

To address the heterogeneous QoS requirements of three typical 6G vehicular services (eMBB, URLLC, and MBRLLC), the system adopts a dual mechanism combining fixed priority quantification and reward-penalty incentives. This framework ensures preferential scheduling of high-priority tasks, with priority levels tied to service characteristics: URLLC tasks (e.g., emergency collision avoidance warnings, safety-critical) have the highest priority; MBRLLC tasks (e.g., AR navigation rendering, bandwidth-latency balanced) are secondary; eMBB tasks (e.g., in-vehicle 4K entertainment, latency-tolerant non-critical) are lowest. To implement this framework, the priority of tasks generated by

{VE}_{n}

at time slot t (denoted

\Pr_{VE, n} (t)

) is quantized solely by service type, formulated as:

\Pr_{VE, n} (t) = \{\begin{matrix} \Pr_{URLLC} & if the task belongs to URLLC; \\ \Pr_{MBRLLC} & if the task belongs to MBRLLC; \\ \Pr_{eMBB} & if the task belongs to eMBB; \end{matrix}

where

\Pr_{URLLC} > \Pr_{MBRLLC} > \Pr_{eMBB}

, establishing discrete hierarchical priority tiers.

To enforce these priority tiers in scheduling, the system integrates a reward-penalty mechanism: when

{UAV}_{m}

serves

{VE}_{n}

, a positive reward is given if the task finishes within

T_{slot}

(i.e.,

T_{VE, n} (t) \leq T_{slot}

), and a negative penalty for delays. Formally,

{UAV}_{m}

’s priority-based reward at slot t is defined as:

\Pr_{UAV, m} (t) = \sum_{n = 1}^{N} C_{m, n} (t) \Pr_{VE, n} (t) I (T_{VE, n} (t) \leq T_{slot}),

(18)

where

I (x)

is the indicator function,

I (x) = 1

if the task is completed on time and

I (x) = - 1

if the task is delayed.

The discrete priority levels are introduced as an abstraction for tractable modeling and stable learning. Since the proposed framework is not restricted by priority granularity, it can be readily extended to continuous or context-dependent QoS representations.

3.4. User Fairness

In UAV-assisted MEC systems, ensuring user fairness involves avoiding the excessive allocation of resources to specific users while maintaining timely service for high-priority tasks, and preventing low-priority users from extended service deprivation. A fairness mechanism that balances “priority orientation” and “opportunity equilibrium” is formulated below.

3.4.1. Key Variable Definitions

Two foundational variables quantify service allocation dynamics:

Service availability variable $s_{n} (t)$ : This binary variable indicates whether ${VE}_{n}$ is served ( $s_{n} (t) = 1$ if served, $s_{n} (t) = 0$ if not). It is derived from the connection variable $C_{m, n} (t)$ :

$s_{n} (t) = \sum_{m = 1}^{M} C_{m, n} (t)$

(19)
Cumulative service count $S_{n} (t)$ : This variable represents the total number of times ${VE}_{n}$ has been served up to time slot t, capturing the history of service allocation:

$S_{n} (t) = \sum_{τ = 1}^{t} s_{n} (τ)$

(20)

3.4.2. Modeling of Target Service Count

To align resource allocation with task priorities,

{VE}_{n}

’s target service count at slot t, denoted

S_{n}^{*} (t)

, represents the cumulative service it should ideally receive based on its priority. This is computed by proportionally distributing the system’s total cumulative services according to the priority of each VE:

S_{n}^{*} (t) = \frac{\Pr_{VE, n} (t)}{\sum_{i = 1}^{N} \Pr_{VE, i} (t)} \sum_{i = 1}^{N} S_{i} (t)

(21)

where

\sum_{i = 1}^{N} S_{i} (t)

is the system’s total cumulative services up to slot t, and

\frac{\Pr_{VE, n} (t)}{\sum_{i = 1}^{N} \Pr_{VE, i} (t)}

is

{VE}_{n}

’s priority share. This ensures that higher-priority users have higher target service counts, such that if

{VE}_{A}

has double the priority of

{VE}_{B}

, then

S_{A}^{*} (t) \approx 2 S_{B}^{*} (t)

.

3.4.3. Fairness Evaluation Metrics

To quantify system fairness, two metrics are introduced:

Deviation variance $Var (t)$ : This metric measures the mean squared deviation between each user’s actual cumulative service count and its target service count, reflecting how well the system aligns with priority goals:

$Var (t) = \frac{1}{N} \sum_{n = 1}^{N} {[S_{n} (t) - S_{n}^{*} (t)]}^{2}$

(22)
Fairness index $Fair (t)$ : This index intuitively reflects the fairness of the system at time slot t. It is defined as:

$Fair (t) = 1 - \frac{Var (t)}{max (Var (t))}$

(23)

Fair (t) \to 1

means most users’ actual services match targets (high priority-oriented fairness);

Fair (t) \to 0

indicates severe deviation, leading to over-service or under-service.

3.5. Formulation of the Optimization Problem

The decision variables of the optimization problem include four key components:

C = {[C_{m, n} (t)]}_{M \times N \times T} \in {0, 1}^{M \times N \times T}

, which denotes the offloading service selection of

{UAV}_{m}

for

{VE}_{n}

;

W = {[w_{m, n} (t)]}_{M \times N \times T} \in {[0, 1]}^{M \times N \times T}

, representing the proportion of

{VE}_{n}

’s tasks offloaded to

{UAV}_{m}

;

V = {[v_{m} (t)]}_{M \times T} \in {[0, v_{\max}]}^{M \times T}

, indicating the flight speed of

{UAV}_{m}

; and

Θ = {[θ_{m} (t)]}_{M \times T} \in {[0, 2 π]}^{M \times T}

, which stands for the flight angle of

{UAV}_{m}

.

The objective of the optimization problem is to regulate

{UAV}_{m}

’s dynamic scheduling and task offloading strategies, aiming to maximize user fairness and UAV priority rewards while minimizing system latency. The problem is formulated as

\begin{matrix} max_{C, W, V, Θ} \sum_{t = 1}^{T} ( & β_{1} Fair (t) + β_{2} \sum_{m = 1}^{M} \Pr_{UAV, m} (t) - β_{3} \sum_{m = 1}^{M} T_{UAV, m} (t)) \end{matrix}

(24)

subject to the following constraints:

Vehicle positions are confined within the service area: $0 \leq X_{n} (t) \leq L, 0 \leq Y_{n} (t) \leq W, \forall n \in N, t \in T .$
UAV flight control constraints: $0 \leq θ_{m} (t) \leq 2 π, 0 \leq v_{m} (t) \leq v_{max}, \forall m \in M, t \in T .$
Task priority constraint: $\Pr_{VE, n} (t) \in {1, 2, 3}, \forall n \in N, t \in T .$
UAV energy constraint: $\sum_{t = 1}^{T} E_{UAV, m} (t) \leq E_{UAV, \max}, \forall m \in M .$
Unique association constraint: $\sum_{m = 1}^{M} C_{m, n} (t) \leq 1, \forall n \in N, t \in T .$
Minimum task completion requirement: $\sum_{t = 1}^{T} \sum_{n = 1}^{N} C_{m, n} (t) w_{m, n} (t) D_{n} (t) \geq D_{m}, \forall m \in M .$
UAV collision-avoidance constraint: $d_{m, j} (t) \geq d_{safe}, \forall m, j \in M, t \in T .$

4. Optimization Framework and Algorithm Development

The formulated joint optimization problem is NP-hard due to the coupling of mixed-integer decision variables, non-convex objectives, and stochastic system dynamics. Task offloading, UAV resource allocation, and priority-aware fairness constraints are tightly interdependent across time slots, preventing decomposition into tractable convex subproblems. While traditional optimization techniques, such as mixed-integer programming or relaxation-based methods, may still be applicable to small-scale or static scenarios, their computational complexity grows prohibitively in large-scale and highly dynamic IoV environments. Consequently, exact or deterministic solutions become impractical.

Therefore, heuristic and learning-based approaches, such as DRL, offer an effective alternative by enabling online decision-making under uncertainty. By modeling the problem as a Markov Decision Process (MDP), the proposed MA-PF-AD3PG framework efficiently approximates high-quality solutions in large-scale and stochastic settings while maintaining scalability and adaptability.

4.1. MDP Formulation

4.1.1. State Space

At each time slot t, the state space captures the real-time environmental information of UAVs and VEs required for the DRL agent’s decision-making. Specifically, for each

{UAV}_{m} \in M

, the state includes the remaining battery energy

E_{m, re} (t)

, the current position

Q_{m} (t)

, and the remaining task volume

D_{m, re} (t)

. For each

{VE}_{n} \in N

, the state comprises the position

Q_{n} (t)

, the task priority

\Pr_{n} (t)

, the task data volume

D_{n} (t)

, and the occlusion status

N_{m, n} (t)

, which indicates whether the communication link between the

{VE}_{n}

and

{UAV}_{m}

is obstructed. The total state space has a dimension of

4 M + 5 N

, which scales linearly with the number of UAVs and VEs. All state parameters are normalized to

[0, 1]

to eliminate scale differences and ensure stable DRL training. At time slot t, the state for

{UAV}_{m}

is represented as:

s_{m} (t) = {E_{m, re} (t), Q_{m} (t), D_{m, re} (t), Q_{n} (t), \Pr_{n} (t), D_{n} (t), N_{m, n} (t) ∣ 1 \leq n \leq N}

(25)

The total state for all M UAVs at time t is given by the concatenation of the individual UAV states:

s (t) = {s_{1} (t), s_{2} (t), \dots, s_{M} (t)}

(26)

4.1.2. Action Space

The action space specifies the decision variables available to the DRL agent under a given state. For each UAV

m \in M

, the action consists of the offloading decision

C_{m, n} (t)

, flight angle

θ_{m} (t)

, flight speed

v_{m} (t)

, and task offloading ratio

w_{m, n} (t)

, as previously defined. These variables are constrained as follows:

C_{m, n} (t) \in {0, 1}

,

0 \leq θ_{m} (t) < 2 π

,

0 \leq v_{m} (t) \leq v_{\max}

, and

0 \leq w_{m, n} (t) \leq 1

. The action for

{UAV}_{m}

is therefore given by:

a_{m} (t) = {C_{m, n} (t), θ_{m} (t), v_{m} (t), w_{m, n} (t) ∣ 1 \leq n \leq N}

(27)

To satisfy the DRL model’s input requirements, the discrete variable

C_{m, n} (t)

is one-hot encoded, and the continuous variables

θ_{m} (t)

,

v_{m} (t)

, and

w_{m, n} (t)

are normalized to

[0, 1]

. The overall action for all UAVs at time t is the concatenation of individual UAV actions:

a (t) = {a_{1} (t), a_{2} (t), \dots, a_{M} (t)}

(28)

This action representation enables decentralized decision-making and enhances the DRL agent’s ability to efficiently handle the UAV-assisted MEC optimization problem.

4.1.3. Reward Function

The reward function defines the objective signal for the DRL agent and aligns with the optimization goals in Section 3.5. It aims to maximize user fairness, prioritize task rewards, and minimize latency. Penalty terms for collision risk and energy constraints are included to maintain feasible operation. At each time slot t, the reward function is defined as:

r (t) = Fair (t) + \sum_{m = 1}^{M} [β_{1} \Pr_{UAV, m} (t) - β_{2} T_{UAV, m} (t) - β_{3} {PE}_{UAV, m} (t)] - β_{4} {PE}_{col} (t),

(29)

where

Fair (t)

is the fairness metric calculated at time slot t.

\Pr_{UAV, m} (t)

represents the priority reward of

{UAV}_{m}

.

T_{UAV, m} (t)

is the latency associated with

{UAV}_{m}

, penalizing delays that may impact the overall system performance.

{PE}_{UAV, m} (t)

denotes the energy consumption penalty for

{UAV}_{m}

.

{PE}_{col} (t)

is the penalty for collision risk, ensuring that the system avoids unsafe UAV positioning. The weighted factors

β_{1}, β_{2}, β_{3}, β_{4}

allow for flexibility in adjusting the relative importance of each of these objectives, ensuring the reward function is tailored to different operational scenarios.

The objective of the DRL agent is to maximize the cumulative reward across the entire time horizon T:

max_{A} \sum_{t = 1}^{T} r (t) .

(30)

4.2. The MADDPG Algorithm

Given the system’s multiple autonomous UAVs (multi-agents) and continuous action space, this paper employs the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm to solve the optimization problem. MADDPG extends the single-agent DDPG method by incorporating a centralized training and decentralized execution paradigm. This adaptation is crucial for managing the dynamic interactions among UAVs, aligning with the system’s collaborative scheduling requirements.

4.2.1. Network Architecture

The MADDPG algorithm utilizes a dual-network structure, consisting of an Actor Network and a Critic Network for each UAV, with a shared global Critic Network to evaluate the collective actions of all agents. This design enables efficient credit assignment among UAVs and facilitates collaborative training.

1.: Actor Network and Actor Target Network

Each UAV is equipped with an independent Actor Network, which generates deterministic actions based on the UAV’s local state. The output action for

{UAV}_{m}

at time t is denoted as

a_{m} (t) = π_{ψ} (s_{m} (t)),

where

s_{m} (t)

is the local state of

{UAV}_{m}

at time t, and

π_{ψ}

is the policy function parameterized by

ψ

. The Actor Target Network performs the same function but uses the next state

s_{m}^{'} (t)

to produce the next target action

a_{m}^{'} (t)

, such that

a_{m}^{'} (t) = π_{ψ_{t}} (s_{m}^{'} (t))

and

ψ_{t}

represents the parameters of the target network.

2.: Critic Network and Critic Target Network

Unlike the single-agent DDPG, the Critic Network in MADDPG evaluates the effectiveness of the joint actions of all UAVs. It takes the global state

s (t)

(which integrates the states of all UAVs and VEs) and the joint action

a (t)

as inputs. The Critic outputs the global state-action value function

Q_{ϕ} (s (t), a (t))

, which is used to estimate the cumulative expected reward from executing the joint action, and adopts the parameter

ϕ

. The Critic Target Network is updated similarly, using the next global state

s^{'} (t)

and next joint action

a^{'} (t)

as inputs, and outputs the next target value

Q_{ϕ_{t}} (s^{'} (t), a^{'} (t))

, where

ϕ_{t}

denotes the parameters of the target network.

4.2.2. Experience Replay

To mitigate the issue of temporal correlations in reinforcement learning data, MADDPG employs a replay memory

R

, where the agent stores past experiences as tuples of the form

(s (t), a (t), r (t), s^{'} (t), a^{'} (t))

. During training, a mini-batch of experiences is sampled from the replay buffer to update the network parameters. This mechanism stabilizes training by allowing the model to learn from past experiences, rather than being solely reliant on immediate transitions.

4.2.3. Network Parameter Update

The updates for the MADDPG algorithm involve three key components: the Actor Network, the Critic Network, and their corresponding Target Networks.

1.: Critic Network Update

The Critic Network is updated by minimizing the Temporal Difference (TD) error for the sampled experiences:

δ_{i} = r_{i} + γ \cdot Q_{ϕ_{t}} (s_{i}^{'}, a_{i}^{'}) - Q_{ϕ} (s_{i}, a_{i})

(31)

where

γ \in [0, 1]

is the discount factor. The Critic’s parameters

ϕ

are updated by gradient descent to minimize the TD error.

2.: Actor Network Update

The Actor Network is updated using the policy gradient theorem. The objective is to maximize the expected cumulative reward

J (π_{ψ})

by adjusting the Actor’s parameters

ψ

as follows:

\nabla_{ψ} J (π_{ψ}) \approx \frac{1}{M} \sum_{i = 1}^{M} [{\nabla_{a_{m}} Q_{ϕ} (s_{i}, a_{i})|}_{a_{m} = π_{ψ} (s_{m, i})} \cdot \nabla_{ψ} π_{ψ} (s_{m, i})]

(32)

3.: Soft Update of Target Networks

To stabilize training and avoid overfitting to recent experiences, MADDPG applies soft updates to the Target Networks:

ψ_{t} \leftarrow (1 - τ) ψ_{t} + τ ψ, ϕ_{t} \leftarrow (1 - τ) ϕ_{t} + τ ϕ

(33)

where

τ

is the soft update coefficient, typically set to a small value (e.g.,

τ = 0.001

).

4.3. The MA-PF-AD3PG Algorithm

We propose the Multi-Agent Priority-based Fairness Adaptive Delayed Update DDPG (MA-PF-AD3PG) algorithm to handle hybrid-variable optimization, reduce training oscillations, and accelerate convergence in multi-UAV MEC environments. The proposed algorithm enhances MADDPG by integrating discrete variable preprocessing and a delayed update mechanism, improving task scheduling efficiency and overall learning stability.

4.3.1. Discrete Service Selection Preprocessing

While MADDPG operates on continuous actions, the offloading decision variable

C_{m, n} (t)

is discrete and cannot be directly optimized within this framework. To address this, we propose a preprocessing step that enables the UAV to select appropriate VEs as service objects based on task priority and fairness considerations.

To prevent service imbalance among VEs, the preprocessing method introduces a fairness weighting mechanism. This mechanism assigns higher selection weights to VEs with fewer connection counts than the average. The fairness weight for each

{VE}_{n}

served by

{UAV}_{m}

at time t is defined as:

P_{m, n} (t) = (\sum_{n = 1}^{J} \Pr_{VE, n} (t)) {(1 - \frac{S_{n} (t)}{\bar{S} (t)})}^{F_{f}},

(34)

where

\sum_{n = 1}^{J} \Pr_{VE, n} (t)

denotes the total weight of VEs with the same priority,

S_{n} (t)

represents the cumulative connection count for

{VE}_{n}

, and

\bar{S} (t)

is the average connection count across all VEs. The fairness factor

F_{f}

is a positive constant that controls the strength of fairness in service selection.

Leveraging the previously defined

Fair (t)

metric, Algorithm 1 presents a dynamic connection allocation mechanism for the vehicular network. A UAV-VE connection probability matrix is generated, embedding fairness constraints to reflect real-time service demands and historical connection patterns. Each UAV then performs probabilistic selection based on this matrix to complete VE pairing, and a conflict detection module ensures that each VE connects to at most one UAV.

This preprocessing converts the original MILP-based problem into a continuous form, simplifying optimization. The original action space of dimension

2 M + 2 M N

, derived from UAV flight, connection, and offloading variables, is reduced to

2 M + M N

after removing the discrete term

C_{m, n} (t)

, significantly decreasing computational complexity.

Algorithm 1 Fairness-Guaranteed Scheduling Algorithm (FGSA)

Require: Task set

N

; Priority mapping rule; Fairness factor

F_{f}

; Time horizon T.
Ensure: Service object selection result

SelectedVE (t)

.

1:: Initialize priority benchmarks $\Pr_{VE, n} (t) \in {\Pr_{URLLC}, \Pr_{MBRLLC}, \Pr_{eMBB}}$ ; Service count variables $s_{n} (t), S_{n} (t)$ ; Fairness threshold $θ_{Fair}$ :
2:: for $t = 1$ to T do
3:: Calculate fairness metrics: Compute target cumulative service count $S_{n}^{*} (t)$ ; Deviation variance $Var (t)$ ; and Fairness metric $Fair (t)$ .
4:: if $Fair (t) < θ_{Fair}$ then
5:: Trigger scheduling strategy adjustment;
6:: end if
7:: for each UAV $m \in M$ do
8:: Probabilistic service selection: Calculate the selection weight $P_{m, n} (t)$ via (34).
9:: Determine $SelectedVE (t)$ by probabilistic sampling (e.g., roulette wheel selection), ensuring each VE connects to at most one UAV.
10:: end for
11:: for each VE $n \in N$ do
12:: State update and iteration: Update the instantaneous service count $s_{n} (t)$ and the cumulative service count $S_{n} (t)$ .
13:: end for
14:: end for

4.3.2. Network Delayed Update Mechanism

The MA-PF-AD3PG algorithm employs a delayed update mechanism controlled by parameter

δ

, which determines the update frequency of the Actor and Critic networks. This delay mitigates training oscillations and stabilizes convergence. The delay factor is typically set to

δ = 2

, implying network updates every two time steps. This choice represents a practical trade-off between training stability and responsiveness in dynamic UAV-assisted MEC environments. Specifically, updating the Actor at every timestep (

δ = 1

) may lead to parameter oscillations due to rapidly changing value estimates, while larger delays (

δ \geq 3

) slow down policy adaptation to fast-varying network conditions.

Under this mechanism, the Critic main network is updated at every timestep to provide accurate and timely value estimation, whereas the Actor main network and the corresponding target networks are updated with a delay of

δ = 2

. This asymmetric update strategy stabilizes multi-agent training while preserving sufficient adaptability to dynamic system states, making it well suited to the considered UAV–MEC scenario.

4.3.3. MA-PF-AD3PG Algorithm Overview

Building upon the decentralized execution and centralized training paradigm of MADDPG, the proposed MA-PF-AD3PG algorithm integrates the fairness-aware preprocessing and the delayed update mechanism introduced in Section 4.3.1 and Section 4.3.2, respectively. Each UAV’s Actor network generates local actions based on its observed state, while a shared Critic network evaluates the global state–action value using these actions and fairness-aware system states. This joint evaluation enables coordinated policy optimization among UAVs.

The proposed framework adapts efficiently to dynamic and large-scale UAV interactions through fairness-driven task assignment and adaptive policy learning. The overall workflow of the algorithm is summarized in Algorithm 2.

Algorithm 2 MA-PF-AD3PG Training Procedure

Require: System parameters

M, N, T, D, B, f_{UAV}, f_{VE}, E_{max}, t_{fly}, T_{tot}, T_{slot}, v_{\max}, P_{LoS}, P_{NLoS}

Ensure: Optimized policies

{C, w, v, θ}

1:: Initialization. Set hyperparameters: discount factor $γ$ ; replay buffer $R$ ; batch size; learning rates $α_{ψ}$ and $α_{ϕ}$ ; soft update coefficient $τ$ ; action exploration noise $ε$ ; delayed update parameter $δ$ ; initialize network parameters $ψ, ψ_{t}, ϕ, ϕ_{t}$ ;
2:: for each episode $i \in range (\max_episodes)$ do
3:: Reset environment: UAV and VE positions $Q_{m}$ , $Q_{n}$ , remaining energy $E_{rem, m} = E_{\max}$ , remaining task load $D_{rem, m} = D$ , and $Fair (t) = 0$ ;
4:: for each time slot $t \in range (T)$ do
5:: for each agent $\in range (M)$ do
6:: Calculate occlusion status $N_{m, n} (t)$ based on $Q_{m} (t), Q_{n} (t)$
7:: Observe current state $s_{m} (t)$ and obtain $C_{m, n} (t)$ via Algorithm 1;
8:: Actor outputs action $a_{m} (t) = π_{ψ} (s_{m} (t)) + ϵ$ ;
9:: Execute action; update UAV states: $E_{rem, m} (t + 1)$ , $D_{rem, m} (t + 1)$ , $Q_{m} (t + 1)$ ;
10:: Observe the next state $s_{m} (t + 1)$ and the next action $a_{m} (t + 1)$
11:: Update $E_{rem, m} (t + 1)$ , $D_{rem, m} (t + 1)$ , and $Q_{m} (t + 1)$ via kinematic model;
12:: Observe the next state $s_{m} (t + 1)$ ; generate next action $a_{m} (t + 1)$ ;
13:: if $D_{rem, m} (t + 1) = = 0$ then
14:: Set delay $= 0$ ; terminate the inner loop (time step t);
15:: else if UAV exceeds boundary then
16:: Set $v (t) = 0$ (halt UAV movement); add position violation penalty $P E_{pos}$ ;
17:: else if $E_{rem, m} (t + 1) < 0$ then
18:: Set $w_{n} (t) = 0$ (stop task offloading); add energy violation penalty $P E_{en}$ ;
19:: else
20:: Calculate the service delay $T_{UAV, m} (t)$ using (10);
21:: end if
22:: end for
23:: Calculate fairness metric $Fair (t)$ using (23);
24:: Calculate priority reward: $\Pr_{reward} (t) = β_{1} \sum_{m = 1}^{M} \Pr_{UAV, m} (t)$ ;
25:: Calculate latency penalty: ${Lat}_{penalty} (t) = β_{2} \sum_{m = 1}^{M} T_{UAV, m} (t)$ ;
26:: Calculate violation penalty: ${Viol}_{penalty} (t) = β_{3} P E_{en} + β_{4} P E_{pos}$ ;
27:: Total reward: $r (t) = Fair (t) + \Pr_{reward} (t) - {Lat}_{penalty} (t) - {Viol}_{penalty} (t)$ ;
28:: Store transition $(s (t), a (t), r (t), s (t + 1))$ in replay buffer $R$ ;
29:: Update Critic network parameter $ϕ$ ;
30:: if $t mod δ = = 0$ then
31:: Update Actor network parameter $ψ$ ;
32:: Soft-update target network parameters: $ψ_{t}$ , $ϕ_{t}$ ;
33:: end if
34:: end for
35:: end for

As shown in Figure 2, the discrete preprocessing reduces the action-space dimensionality, improving computational efficiency. The delayed update mechanism further stabilizes training by mitigating gradient oscillations in the reduced space. The collaborative multi-agent structure improves coordination among UAVs. Collectively, these components improve stability and fairness while reducing computational complexity.

4.3.4. Time Complexity of MA-PF-AD3PG

Let E denote the total number of training episodes, T the number of time steps per episode, B the batch size,

δ

the delayed update interval, and

D_{Actor}

and

D_{Critic}

the computational complexities of the Actor and Critic networks. Since network structures remain fixed, both can be approximated as a constant K.

Auxiliary operations (e.g., UAV state updates and reward computation) require

O (M)

time per step, negligible compared with network computation. The Critic network updates at each step with complexity

O (B D_{Critic})

, while the Actor and Target networks update every

δ

steps with complexity

O (B D_{Actor})

. Hence, the per-episode cost is:

O (T B (D_{Critic} + \frac{D_{Actor}}{δ})) \approx O (T B K),

and the overall training complexity is:

O (E T B) .

Compared with conventional MADDPG

O (E T B (D_{Critic} + D_{Actor}))

, MA-PF-AD3PG reduces Actor updates by a factor of

δ

, maintaining the same asymptotic order

O (E T B)

with lower constant overhead.

5. Simulation Analysis

To evaluate the performance of the proposed MA-PF-AD3PG algorithm in UAV-assisted MEC systems for IoV scenarios, extensive simulations are conducted within a planar area of

100 m \times 100 m

. All UAVs operate at a fixed altitude of 100 m. Two UAVs are deployed, initially located at coordinates

[50, 50, 100]

and

[75, 75, 100]

(unit: m), respectively. The total system bandwidth is set to 1 MHz. Detailed environmental parameters are summarized in Table 1.

Regarding the network structure of MA-PF-AD3PG, the Actor network consists of three fully connected (FC) layers with 300, 400, and 300 neurons, respectively, followed by an output layer corresponding to the action dimension. The Critic network adopts the same FC configuration as the Actor, but its output layer contains a single neuron without activation to directly output Q-values. The number of training iterations is set to 1000, and a minimum mean-square-error (MSE) constraint is applied to regulate the exploration rate. The basic hyperparameters are summarized in Table 2.

5.1. Validation of the Priority-Based Fairness Mechanism

Jain’s fairness index is adopted to quantify the fairness of resource allocation among users, which is defined as

J = \frac{{(\sum_{n = 1}^{N} x_{n})}^{2}}{N \sum_{n = 1}^{N} x_{n}^{2}},

(35)

where

x_{n}

denotes the achieved service metric of

{VE}_{n}

. The index ranges from

1 / N

to 1, with larger values indicating more equitable resource distribution.

To verify the effectiveness of the proposed priority-based fairness mechanism, three multi-agent learning strategies are evaluated under a unified AD3PG framework. The proposed algorithm, termed MA-PF-AD3PG, incorporates priority-aware fairness into the decision-making process. Two baseline algorithms are considered for comparison: MA-RA-AD3PG, where UAVs randomly select serving VEs, and MA-PR-AD3PG, which schedules VEs solely based on task priority without explicit fairness consideration. In the simulation, two UAVs and eight total VEs are considered with a total task volume of 100 Mbits. The performance comparison results are illustrated in Figure 3 and Figure 4.

Figure 3 compares the total reward performance of the three algorithms, where a higher reward indicates better overall system performance. The results show that MA-PF-AD3PG achieves the highest average reward, outperforming MA-PR-AD3PG and MA-RA-AD3PG by 14.42% and 26.40%, respectively. This demonstrates that incorporating priority-aware fairness significantly improves resource utilization efficiency.

Figure 4 presents the comparison of Jain’s fairness index. The proposed MA-PF-AD3PG achieves an average Jain index of 0.999, with the box plot tightly concentrated around 0.9988, indicating near-perfect fairness. In contrast, MA-RA-AD3PG and MA-PR-AD3PG achieve average fairness indices of 0.968 and 0.895, respectively, with the latter falling below the commonly accepted “high fairness” threshold of 0.9. These results confirm that priority-only scheduling may lead to severe service imbalance. Overall, MA-PF-AD3PG provides an optimal trade-off between system efficiency and fairness while maintaining stable convergence behavior.

5.2. Impact of the Delayed Update Mechanism

To evaluate the role of the delayed update parameter in improving convergence stability and optimization efficiency, MA-PF-AD3PG (with delayed update) is compared with MA-PF-DDPG (without delayed update) under the same setup. As shown in Figure 5, MA-PF-AD3PG achieves a significantly higher converged total reward (approximately

- 20

versus

- 80

for MA-PF-DDPG), indicating a substantially improved reward ceiling and better task performance after convergence. While both algorithms reach a minimum penalty of 0, MA-PF-AD3PG exhibits a lower average delay (2.20 vs. 2.55), confirming its advantage in delay control. These results verify that the delayed update mechanism enhances both convergence stability and overall optimization performance.

5.3. Comparison with Other DRL Algorithms

To further demonstrate the superiority of MA-PF-AD3PG, it is compared with several representative DRL algorithms under the same MA-PF framework.

DQN [25]: Value-based framework mapping state-action pairs to Q-values, quantifying expected cumulative rewards.

AC [26]: Merges value estimation and policy gradients, with experience replay and target networks boosting stability.

TD3 [27]: Advanced DDPG variant with dual Critics, using their minimum output to mitigate value overestimation.

KNN-DDPG [11]: Non-parametric method relying on local neighborhood info for lightweight, training-free classification.

In the same scenario with two UAVs and eight VEs (100 Mbits total task volume), the results in Figure 6 indicate that MA-PF-AD3PG achieves the best total reward after convergence (around

- 57

), surpassing MA-PF-AC (around

- 360

), MA-PF-KNN, MA-PF-DDPG, and MA-PF-TD3 (around

- 130

). Moreover, its training curve is notably smoother, maintaining the optimal reward level throughout the 1000-episode process, which demonstrates excellent learning efficiency and convergence stability in multi-agent task scheduling scenarios.

To further evaluate scalability, we consider a larger-scale setup with 16 VEs (2 high-priority, 4 medium-priority, 10 low-priority) and the same total task volume of 100 Mbits. As shown in Figure 7, although MA-AC and MA-PF-KNN converge faster initially, MA-PF-AD3PG ultimately achieves the highest and most stable reward, converging at around 350 episodes. Its final reward exceeds that of MA-PF-DDPG by over 32%, MA-PF-KNN and MA-PF-AC by more than 27%, and MA-PF-TD3 by nearly 16%, confirming its superior convergence stability in large-scale heterogeneous environments.

To validate the generalization capability of MA-PF-AD3PG under different task volumes, additional simulations are conducted using eight VEs (five low priority, two medium priority, and one high priority). Two scenarios are evaluated: Scenario 1, where each UAV serves one VE, and Scenario 2, where each UAV serves two VEs simultaneously.

In Scenario 1, task volumes are set to 60, 80, 100, 120, and 140 Mbits. As shown in Figure 8, where the upper subfigure is a bar chart with 95% confidence intervals (CI) and the lower one is a trend chart, MA-PF-AD3PG consistently outperforms all comparative algorithms across all task sizes. For instance, at 100 Mbits, its maximum total reward is approximately

- 105

, compared to

- 120

for MA-PF-DDPG, reflecting a 13% advantage. The decreasing trend in reward with increasing task volume is attributed to intensified resource contention and latency penalties under higher loads, consistent with practical system behavior. Notably, MA-AC exhibits fluctuations between 80–100 Mbits due to unstable policy convergence caused by imbalanced priority weights.

In Scenario 2, task volumes are 80, 100, 120, 140, and 160 Mbits. As shown in Figure 9, MA-PF-AD3PG not only maintains its superiority but achieves even greater performance gains as each UAV serves more VEs. For example, at 160 Mbits, MA-PF-AD3PG reaches a maximum reward of approximately

- 140

, while MA-PF-AC drops to about

- 260

, indicating a 46% advantage. The same decreasing trend with increasing load is observed, due to intensified scheduling pressure and latency accumulation.

To further evaluate the algorithm’s scalability under different numbers of VEs, the maximum total reward is analyzed using combined box plots (with 95% CIs) and trend curves, as illustrated in Figure 10 for Scenario 1. As the number of VEs increases from 4 to 12, the proposed MA-PF-AD3PG algorithm consistently achieves the highest maximum total reward among all compared methods. In contrast, MA-PF-DDPG, MA-PF-TD3, and other baselines exhibit significant performance degradation and pronounced fluctuations. For example, when the number of VEs reaches 12, the reward of MA-PF-TD3 decreases to approximately

- 68.5

, whereas that of MA-PF-AD3PG remains above

- 29.5

, representing nearly a 57% improvement. These results clearly demonstrate the robustness and scalability of the proposed algorithm as the network scale expands.

In Scenario 2, the number of VEs is set to 4, 8, 12, 16, and 20. As shown in Figure 11, the MA-PF-AD3PG algorithm maintains its superior performance, consistently achieving higher rewards than MA-PF-AC, MA-PF-KNN, and MA-PF-TD3 across all configurations. For instance, when the number of VEs increases to 20, MA-PF-AD3PG attains a maximum total reward of approximately

- 29

, while MA-PF-TD3 drops to around

- 48

, corresponding to a 42% performance gain. Although minor fluctuations occur in some configurations, the overall results confirm that MA-PF-AD3PG demonstrates the best stability and adaptability among all compared algorithms. These findings are consistent with those in Scenario 1, further validating the scalability and robustness of the proposed MA-PF-AD3PG framework in large-scale UAV–MEC resource scheduling.

As the number of VEs increases, the proposed MA-PF-AD3PG maintains smooth convergence and consistent performance advantages over the baseline strategies, indicating that the learning process scales well with larger state and action spaces. Moreover, the decentralized execution with centralized training paradigm enables effective adaptation to dynamic mobility without explicit coordination among UAVs. Overall, these results demonstrate the scalability of the proposed approach and its suitability for larger-scale UAV-assisted IoV deployments.

6. Conclusions

This paper investigates optimization challenges in UAV-assisted MEC systems for 6G IoV scenarios through two main stages. First, a fairness-oriented target selection mechanism is proposed to improve user fairness in UAV service allocation. By incorporating user priority, connection frequency, fairness weighting, and probabilistic selection, the proposed approach effectively balances service fairness and system efficiency. Simulation results show that it reduces service opportunity disparities among VEs while ensuring timely execution of high-priority tasks.

Second, a dynamic multi-UAV MEC model is established considering the characteristics of 6G MBRLLC, eMBB, and URLLC services. To jointly optimize latency, fairness, and task priority, a multi-agent DRL algorithm, termed MA-PF-AD3PG, is developed. By integrating fairness-aware preprocessing and a delayed update mechanism within a centralized training and decentralized execution framework, the proposed method improves convergence stability and computational efficiency. Simulation results demonstrate that MA-PF-AD3PG consistently outperforms baseline algorithms in terms of total reward, latency reduction, and fairness, while exhibiting robust convergence and scalability.

Future work will explore advanced learning techniques, such as federated multi-agent learning and meta-adaptive policy optimization, and extend the proposed framework to multi-tier edge–cloud architectures and dynamic multi-cell UAV networks to further enhance adaptability in large-scale IoV environments.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W. and H.W.; software, H.W.; validation, Y.W. and H.Y.; writing—original draft preparation, H.W.; writing—review and editing, Y.W.; supervision, H.Y.; funding acquisition, Y.W. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Major Program, Grant No. 62595790; Grant Nos. 62173322 and 92267108), the Science and Technology Program of Liaoning Province (Grant Nos. 2023JH3/10200004 and 2022JH25/10100005), and the Youth Fund of the Education Department of Liaoning Province (Grant No. JYTQN2023356).

Data Availability Statement

The simulation codes and datasets that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tran, T.X.; Hajisami, A.; Pandey, P.; Pompili, D. Collaborative mobile edge computing in 5G networks: New paradigms, scenarios, and challenges. IEEE Commun. Mag. 2017, 55, 54–61. [Google Scholar] [CrossRef]
Zhou, F.; Hu, R.Q.; Li, Z.; Wang, Y. Mobile edge computing in unmanned aerial vehicle networks. IEEE Wirel. Commun. 2020, 27, 140–146. [Google Scholar] [CrossRef]
Mao, Y.; You, C.; Zhang, J.; Huang, K.; Letaief, K.B. A survey on mobile edge computing: The communication perspective. IEEE Commun. Surv. Tutor. 2017, 19, 2322–2358. [Google Scholar] [CrossRef]
Zhou, F.; Wu, Y.; Hu, R.Q.; Qian, Y. Computation rate maximization in UAV-enabled wireless-powered mobile-edge computing systems. IEEE J. Sel. Areas Commun. 2018, 36, 1927–1941. [Google Scholar] [CrossRef]
Wang, Y.T.; Wang, H.; Ding, J.F.; Yu, H.B. From latency bottlenecks to seamless edge: AD3PG-powered joint optimization of UAV trajectory and task offloading. Comput. Netw. 2025, 272, 111700. [Google Scholar] [CrossRef]
Jeremiah, S.R.; Yang, L.T.; Park, J.H. Digital twin-assisted resource allocation framework based on edge collaboration for vehicular edge computing. Future Gener. Comput. Syst. 2024, 150, 243–254. [Google Scholar] [CrossRef]
Wang, F.; Zhang, S.; Hong, E.-K.; Quek, T.Q.S. Constellation as a Service: Tailored Connectivity Management in Direct-Satellite-to-Device Networks. IEEE Commun. Mag. 2025, 63, 30–36. [Google Scholar] [CrossRef]
Ghosh, S.; Kuila, P. Efficient offloading in disaster-affected areas using unmanned aerial vehicle-assisted mobile edge computing: A gravitational search algorithm-based approach. Int. J. Disaster Risk Reduct. 2023, 97, 104067. [Google Scholar] [CrossRef]
Ghosh, S.; Kuila, P.; Bey, M.; Azharuddin, M. Quantum-inspired gravitational search algorithm-based low-price binary task offloading for multi-users in unmanned aerial vehicle-assisted edge computing systems. Expert Syst. Appl. 2024, 263, 125762. [Google Scholar] [CrossRef]
Liu, Y.; Yan, J.; Zhao, X. Deep reinforcement learning based latency minimization for mobile edge computing with virtualization in maritime UAV communication network. IEEE Trans. Veh. Technol. 2022, 271, 4225–4236. [Google Scholar] [CrossRef]
Lu, Y.R.; Xu, C.; Wang, Y.T. Joint computation offloading and trajectory optimization for edge computing UAV: A KNN-DDPG algorithm. Drones 2024, 8, 564. [Google Scholar] [CrossRef]
Li, J.; Sun, G.; Duan, L.; Wu, Q. Multi-Objective Optimization for UAV Swarm-Assisted IoT with Virtual Antenna Arrays. IEEE Trans. Mobile Comput. 2024, 23, 4890–4907. [Google Scholar] [CrossRef]
Pang, S.; Wang, L.; Gui, H.; Qiao, S.; He, X.; Zhao, Z. UAV-IRS-assisted energy harvesting for edge computing based on deep reinforcement learning. Future Gener. Comput. Syst. 2025, 163, 107527. [Google Scholar] [CrossRef]
Hu, Q.; Cai, Y.; Yu, G.; Qin, Z.; Zhao, M.; Li, G.Y. Joint offloading and trajectory design for UAV-enabled mobile edge computing systems. IEEE Internet Things J. 2019, 6, 1879–1892. [Google Scholar] [CrossRef]
Xve, K.; Zhai, L.B.; Li, Y.M.; Lu, Z.K.; Zhou, W.J. Task offloading and multi-cache placement based on DRL in UAV-assisted MEC networks. Veh. Commun. 2025, 53, 100900. [Google Scholar] [CrossRef]
Wang, L.; Zhang, X.; Qin, K.; Wang, Z.; Yin, H.; Zhou, J.; Song, D. Dynamic Trajectory Control and User Association for Unmanned-Aerial-Vehicle-Assisted Mobile Edge Computing: A Deep Reinforcement Learning Approach. Drones 2025, 9, 367. [Google Scholar] [CrossRef]
Yang, Y.L.; Xu, H.; Jin, Z.; Song, T.C.; Hu, J.; Song, X.Q. RS-DRL-based offloading policy and UAV trajectory design in F-MEC systems. Digit. Commun. Netw. 2025, 11, 377–386. [Google Scholar] [CrossRef]
Wang, C.; Liu, K.; Yuan, Y.; Peng, S.C.; Li, G.R. Joint trajectory and offloading optimization in UAV-assisted MEC via federated multi-agent reinforcement learning and potential fields. Comput. Netw. 2025, 272, 111681. [Google Scholar] [CrossRef]
Shen, F.F.; Yang, B.F.; Zhang, J.; Xu, C.; Chen, Y.; He, Y.X. TD3-based trajectory optimization for energy consumption minimization in UAV-assisted MEC system. Comput. Netw. 2024, 255, 110882. [Google Scholar] [CrossRef]
Zheng, X.D.; Wu, Y.X.; Zhang, L.H.; Tang, M.B.; Zhu, F.S. Priority-aware path planning and user scheduling for UAV-mounted MEC networks: A deep reinforcement learning approach. Phys. Commun. 2024, 62, 102234. [Google Scholar] [CrossRef]
Du, Y.; Wang, K.; Yang, K.; Zhang, G. Energy-efficient resource allocation in UAV based MEC system for IoT devices. In Proceedings of the 2018 IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 9–13 December 2018; pp. 1–6. [Google Scholar] [CrossRef]
Huang, J.; Zhang, M.; Wan, J.; Chen, Y.; Zhang, N. Joint data caching and computation offloading in UAV-assisted internet of vehicles via federated deep reinforcement learning. IEEE Trans. Veh. Technol. 2024, 73, 17644–17656. [Google Scholar] [CrossRef]
Hu, Z.; Yang, Y.; Gu, W.; Chen, Y.; Huang, J. DRL-based trajectory optimization and task offloading in hierarchical aerial MEC. IEEE Internet Things J. 2025, 12, 3410–3423. [Google Scholar] [CrossRef]
Liu, J.; Wang, Y.; Pan, D.; Yuan, D. QoS-aware task offloading and resource allocation optimization in vehicular edge computing networks via MADDPG. Comput. Netw. 2024, 242, 110282. [Google Scholar] [CrossRef]
Wu, Y.C.; Dinh, T.Q.; Fu, Y.; Lin, C.; Quek, T.Q.S. A Hybrid DQN and Optimization Approach for Strategy and Resource Allocation in MEC Networks. IEEE Trans. Wirel. Commun. 2021, 20, 4282–4295. [Google Scholar] [CrossRef]
Cao, Y.; Wang, H.; Li, D.; Zhang, G. Smart Online Charging Algorithm for Electric Vehicles via Customized Actor–Critic Learning. IEEE Internet Things J. 2022, 9, 684–694. [Google Scholar] [CrossRef]
Luo, X.; Wang, Q.; Gong, H.; Tang, C. UAV Path Planning Based on the Average TD3 Algorithm with Prioritized Experience Replay. IEEE Access 2024, 12, 38017–38029. [Google Scholar] [CrossRef]

Figure 1. System Model.

Figure 2. Network Architecture of MA-PF-AD3PG.

Figure 3. Total reward comparison of MA-PF-AD3PG, MA-PR-AD3PG, and MA-RA-AD3PG.

Figure 4. Jain’s fairness index comparison of MA-PF-AD3PG, MA-PR-AD3PG, and MA-RA-AD3PG.

Figure 5. Comparison of Reward, Penalty, and Delay with and without Delayed Update.

Figure 6. Comparison of Training Reward Values (VE Number = 8).

Figure 7. Comparison of Training Reward Values (VE Number = 16).

Figure 8. Box and Trend Plots of Maximum Reward Values Under Different Task Sizes (Scenario 1).

Figure 9. Box and Trend Plots of Maximum Reward Values Under Different Task Sizes (Scenario 2).

Figure 10. Box and Trend Plots of Maximum Total Rewards Under Varying VE Numbers (Scenario 1).

Figure 11. Box and Trend Plots of Maximum Total Rewards Under Varying VE Numbers (Scenario 2).

Table 1. Environmental Simulation Parameters.

Parameter	Value	Unit
Noise Power (LOS)	−100	dB
Noise Power (NLOS)	−80	dB
VE Computing Frequency	0.2	GHz
UAV Computing Frequency	1.2	GHz
CPU Cycles per bit	1000	Cycles
Uplink Transmission Power	0.1	W
Reference Channel Gain (1 m)	−50	dB
UAV Weight	9.65	kg
UAV Battery Capacity	500	kJ
Maximum UAV Flight Speed	20	m/s

Table 2. Neural Network HyperParameters.

Parameter	Value
Maximum Episodes ( $\max_episode$ )	1000
Actor Network Learning Rate ( $α_{ψ}$ )	0.001
Critic Network Learning Rate ( $α_{ϕ}$ )	0.002
Discount Factor ( $γ$ )	0.5
Soft Update Coefficient ( $τ$ )	0.01
Minimum Variance ( ${var}_{\min}$ )	0.01
Replay Buffer Size ( $R$ )	10,000
Batch Size (B)	64
Delayed Update Parameter ( $δ$ )	2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Wang, H.; Yu, H. MA-PF-AD3PG: A Multi-Agent DRL Algorithm for Latency Minimization and Fairness Optimization in 6G IoV-Oriented UAV-Assisted MEC Systems. Drones 2026, 10, 9. https://doi.org/10.3390/drones10010009

AMA Style

Wang Y, Wang H, Yu H. MA-PF-AD3PG: A Multi-Agent DRL Algorithm for Latency Minimization and Fairness Optimization in 6G IoV-Oriented UAV-Assisted MEC Systems. Drones. 2026; 10(1):9. https://doi.org/10.3390/drones10010009

Chicago/Turabian Style

Wang, Yitian, Hui Wang, and Haibin Yu. 2026. "MA-PF-AD3PG: A Multi-Agent DRL Algorithm for Latency Minimization and Fairness Optimization in 6G IoV-Oriented UAV-Assisted MEC Systems" Drones 10, no. 1: 9. https://doi.org/10.3390/drones10010009

APA Style

Wang, Y., Wang, H., & Yu, H. (2026). MA-PF-AD3PG: A Multi-Agent DRL Algorithm for Latency Minimization and Fairness Optimization in 6G IoV-Oriented UAV-Assisted MEC Systems. Drones, 10(1), 9. https://doi.org/10.3390/drones10010009

Article Menu

MA-PF-AD3PG: A Multi-Agent DRL Algorithm for Latency Minimization and Fairness Optimization in 6G IoV-Oriented UAV-Assisted MEC Systems

Highlights

Abstract

1. Introduction

2. Related Work

3. System Model

3.1. Communication Model

3.2. Computation Model

3.2.1. Latency

3.2.2. Energy Consumption

3.3. Task Prioritization

3.4. User Fairness

3.4.1. Key Variable Definitions

3.4.2. Modeling of Target Service Count

3.4.3. Fairness Evaluation Metrics

3.5. Formulation of the Optimization Problem

4. Optimization Framework and Algorithm Development

4.1. MDP Formulation

4.1.1. State Space

4.1.2. Action Space

4.1.3. Reward Function

4.2. The MADDPG Algorithm

4.2.1. Network Architecture

4.2.2. Experience Replay

4.2.3. Network Parameter Update

4.3. The MA-PF-AD3PG Algorithm

4.3.1. Discrete Service Selection Preprocessing

4.3.2. Network Delayed Update Mechanism

4.3.3. MA-PF-AD3PG Algorithm Overview

4.3.4. Time Complexity of MA-PF-AD3PG

5. Simulation Analysis

5.1. Validation of the Priority-Based Fairness Mechanism

5.2. Impact of the Delayed Update Mechanism

5.3. Comparison with Other DRL Algorithms

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI