Cooperative Schemes for Joint Latency and Energy Consumption Minimization in UAV-MEC Networks

Cheng, Ming; He, Saifei; Pan, Yijin; Lin, Min; Zhu, Wei-Ping

doi:10.3390/s25175234

Open AccessArticle

Cooperative Schemes for Joint Latency and Energy Consumption Minimization in UAV-MEC Networks^†

by

Ming Cheng

^1,*

,

Saifei He

¹

,

Yijin Pan

²

,

Min Lin

¹

and

Wei-Ping Zhu

³

¹

School of Communications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

²

National Mobile Communications Research Laboratory, Southeast University, Nanjing 210096, China

³

Department of Electrical and Computer Engineering, Concordia University, Montreal, QC H3G 1M8, Canada

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in He, S.; Cheng, M.; Pan, Y.; Lin, M.; Zhu, W.-P. Distributed access and offloading scheme for multiple UAVs assisted MEC network. In Proceedings of the IEEE 98th Vehicular Technology Conference (VTC2023-Fall), Hong Kong, China, 10–13 October 2023.

Sensors 2025, 25(17), 5234; https://doi.org/10.3390/s25175234

Submission received: 8 July 2025 / Revised: 16 August 2025 / Accepted: 18 August 2025 / Published: 22 August 2025

(This article belongs to the Special Issue Wireless Communication Technologies for Internet of Things and Wireless Sensor Networks)

Download

Browse Figures

Versions Notes

Abstract

The Internet of Things (IoT) has promoted emerging applications that require massive device collaboration, heavy computation, and stringent latency. Unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) systems can provide flexible services for user devices (UDs) with wide coverage. The optimization of both latency and energy consumption remains a critical yet challenging task due to the inherent trade-off between them. Joint association, offloading, and computing resource allocation are essential to achieving satisfying system performance. However, these processes are difficult due to the highly dynamic environment and the exponentially increasing complexity of large-scale networks. To address these challenges, we introduce a carefully designed cost function to balance the latency and the energy consumption, formulate the joint problem into a partially observable Markov decision process, and propose two multi-agent deep-reinforcement-learning-based schemes to tackle the long-term problem. Specifically, the multi-agent proximal policy optimization (MAPPO)-based scheme uses centralized learning and decentralized execution, while the closed-form enhanced multi-armed bandit (CF-MAB)-based scheme decouples association from offloading and computing resource allocation. In both schemes, UDs act as independent agents that learn from environmental interactions and historic decisions, make decision to maximize its individual reward function, and achieve implicit collaboration through the reward mechanism. The numerical results validate the effectiveness and show the superiority of our proposed schemes. The MAPPO-based scheme enables collaborative agent decisions for high performance in complex dynamic environments, while the CF-MAB-based scheme supports independent rapid response decisions.

Keywords:

mobile edge computing (MEC); multi-UAV networks; energy consumption; latency; closed-form enhanced multi-armed bandit (CF-MAB); multi-agent proximal policy optimization (MAPPO)

1. Introduction

Wireless networks have evolved over time into the Internet of Things (IoT), which has promoted emerging applications including autonomous driving, smart cities, virtual reality, and smart healthcare [1,2]. These applications generally rely on the cooperation of a large number of mobile devices, bring exponentially increasing computation load, and require stringent latency standards. However, user devices (UDs) are equipped with limited computing capability and battery capacity, meaning that they struggle to achieve satisfactory performance in these emerging scenarios. Mobile edge computing (MEC) brings high-performance servers to the network edge and provides efficient computing services for UDs [3,4]. MEC is thus a promising technology to address the latency-sensitive and localized processing requirements of future networks.

The terrestrial network infrastructure has a limited coverage, with some dead zones, meaning that it cannot provide sufficient computing and communications services, especially in remote, rural, or disaster-prone areas. Non-terrestrial networks are considered as enablers to provide a reliable and cost-effective solution for continuous and ubiquitous wireless coverage [5,6]. Unmanned aerial vehicles (UAVs) are widely used in current communication networks due to their flexible movement, easy deployment, and line-of-sight (LoS) connections [7,8]. It is effective and economical to deploy swarms of UAVs to form scalable networks. UAVs act as flying edge nodes that can provide computing, caching, and analytic services for UDs on demand. Consequently, UAV-assisted MEC has become an ideal solution to address scenarios where terrestrial infrastructure is unavailable, insufficient, or impractical [9,10].

1.1. Related Works

UAV-assisted MEC mainly aims to reduce energy consumption and latency. Effective and efficient resource allocation is key to achieving high performance. Most studies in this area concern energy consumption [11,12,13,14,15,16], latency [17,18,19,20,21,22], or both [23,24,25,26,27,28,29]. Latency and the energy consumption are inherently conflicting, meaning that cost functions are generally introduced as a trade-off. The cost can be defined as the weighted sum of latency and energy consumption [24,27], the minimum value of biased metrics [28], or the specific objective [29]. High system performance can be achieved through the joint optimization of task offloading, communication and computing resource allocation, UAVs’ trajectories, and device association.

Convex optimization methods, such as block coordinate descent (BCD) and successive convex approximation (SCA), are used to solve these non-trivial joint problems. To reduce the energy consumption of a UAV-assisted MEC system, Wang et al. [15] divided the joint optimization of device association, UAVs’ trajectories, task offloading, and resource allocation into three sub-problems. The non-convex trajectory optimization problem is transformed into a convex one through SCA. Then, the BCD method is used to optimize each set of control variables iteratively. To minimize the average response delay in a UAV-assisted MEC system, Zhang et al. [22] also divided the joint optimization problem into three sub-problems of UAV deployment, user access, and task offloading. Then, SCA is used to transform these sub-problems into convex problems and BCD is used to solve them to obtain a near-optimal solution. To minimize the delay, reduce the energy consumption, and maximize the offloaded tasks simultaneously, Sun et al. [28] formulated a multi-objective joint problem, divided it into three sub-problems, and then solved them by using the BCD method. The above studies usually assume that the task arrival rate and channel state are known or can be predicted from historical data. These optimization schemes only aim to obtain short-term optimal solutions based on deterministic conditions.

It is difficult to predict the task arrival and the channel state accurately in many practical scenarios. Buffers are generally used to store the stochastic tasks. Then, the system stability, the queue backlog, and the task completion rate become important and the UAV-assisted MEC system should be investigated from a long-term perspective. The Lyapunov method is one of the most effective solutions to ensure the long-term stability and to maximize the utility [30,31,32,33,34]. Specifically, the Lyapunov method introduces a drift-plus-penalty function that balances the utility and the queue backlog. Then, the original long-term stochastic problem is converted into an online sequence of deterministic problems in each timeslot. These short-term problems minimize the upper bound of the drift-plus-penalty function instead and can be solved by using convex optimization. Zhang et al. [30] minimized the average long-term energy consumption in a UAV-assisted MEC system by jointly optimizing the task offloading, the resource allocation, and the trajectory. Liu et al. [32] investigated the long-term problem in multiple-UAV-assisted MEC systems and took the timeslot scheduling into consideration. Zhao et al. [34] further considered the task caching and migration in multiple-UAV-assisted MEC systems and maximized the long-term average service quantity.

In conventional convex optimization-based and Lyapunov-based methods, the original highly coupled problem is transformed into several approximate convex problems and the local optimal solution is obtained by means of an iterative process and alternative optimization. These methods require well-known information, heavy interaction, and exponentially increasing complexity, meaning that they are unsuitable in practical scenarios with high dynamics or a large scale. Fortunately, deep reinforcement learning (DRL)-based schemes have shown significant advantages in solving these problems with high-dimensional dynamic state space, continuous or discrete action space, and non-linear relationships. DRL-based schemes formulate the joint long-term optimization problem in dynamic environments as a Markov decision process (MDP) [35,36,37,38,39] or partially observable MDP (POMDP) [40,41,42,43,44,45]. Then, one or multiple agents learn from environmental and historic information, update the strategies, and make decisions.

DRL has been applied in UAV-assisted MEC networks [10]. Specifically, to minimize the latency in UAV-assisted MEC systems with caching, Zhang et al. [35] used the actor–critic (AC) architecture to optimize the UAVs’ motion, communication and computing resource allocation, and caching decisions simultaneously. To balance the energy consumption and the latency in multi-UAV-assisted vehicular edge computing systems, Li et al. [36] proposed a double-depth Q-network-based algorithm to optimize the vehicle selection and the resource allocation jointly. Seid et al. [37] proposed a deep deterministic policy gradient (DDPG)-based algorithm to optimize the association, offloading, and computing resource jointly. He et al. [39] further proposed a multi-agent DDPG (MADDPG)-based algorithm to optimize the trajectories of UAVs while power and offloading are determined by conventional optimization. Each agent can only obtain its local observation instead of the global state in most practical scenarios. Yan et al. [40], Liu et al. [41], and Wang et al. [42] formulated the long-term optimization to a POMDP and applied the MADDPG scheme to decide the trajectories of UAVs. Though the above schemes excel in specific scenarios, they show inferior performance regarding adaptability and stability. DQN is limited to discrete actions apace and DDPG is designed for continuous actions space. AC is adaptable to both action spaces but suffers from instable training and poor sample efficiency. Proximal policy optimization (PPO) [46,47,48], which is particularly well-suited for high-dimensional actions space with both continuous and discrete actions, has been successfully applied in UAV-assisted MEC networks [38,43,44,45]. Specifically, Wang et al. [38] proposed a PPO-based algorithm to optimize the continuous movement of UAVs and offloading proportions jointly. Kang et al. [43] further applied the PPO into scenarios with partial observation and proposed a multi-agent PPO (MAPPO)-based scheme to optimize offloading proportion, power control, and computing frequency allocation simultaneously.

The related works are summarized in Table 1.

1.2. Motivations and Contributions

UAV-assisted MEC can significantly improve the quality of experience in the IoT. Multi-agent DRL (MADRL)-based methods have shown great potential to improve the efficiency and effectiveness of joint resource allocation. Though MADRL-based methods are generally superior to convex optimization-based and reinforcement learning (RL)-based methods in the performance of their target metrics, their detailed superiority should be further investigated.

This paper investigates a multi-UAV-assisted MEC system in which each UAV is equipped with an MEC server to offer computing services and each UD decides its serving UAV, task offloading proportion, and required computing resources. The network is highly dynamic due to the movement of UAVs and time-varying channel state. We aim to minimize the latency and energy consumption by optimizing the association, task offloading, and computing resource allocation simultaneously. Our previous work [49] proposed an MAB-based distributed scheme to tackle the joint long-term problem. This work significantly extends our previous research with the following contributions:

We introduce cost functions to balance the latency and energy consumption for both the whole system and individual UDs. Then, we formulate a long-term cost minimization problem that has discrete association constraints and continuous offloading and computing frequency constraints. We further formulate this long-term problem into a POMDP and propose a cooperative multi-agent DRL framework. All UDs are agents and each of them makes its own decision based on its local observation and individual reward function.
We propose a MAPPO-based scheme that adopts centralized training and decentralized execution to tackle the POMDP. In the training stage, the global observations and the system reward function are used to train the actor and the critic networks for all UDs. In the execution stage, each UD agent uses its local observation and individual reward function for decision-making and network updates. Simulation results validate the MAPPO-based scheme and show its superiority in reducing system cost.
We decouple the association from the task offloading and the computing resource allocation and propose a lightweight scheme based on closed-form enhanced multi-armed bandit (CF-MAB). In the CF-MAB-based scheme, each UD agent selects its association to maximize its long-term achievable rate and then the optimal offloading and computing resource allocation can be obtained in closed form given the association. Simulation results validate the CF-MAB-based scheme and show its superiority regarding its complexity and task completion rate.

The remainder of this paper is organized as follows. Section 2 describes the system model and formulates the long-term problem. Section 3 presents the MADRL framework, the MAPPO-based scheme, and the MAB-based scheme in detail. Simulation results and discussions are presented in Section 4. Finally, Section 5 provides some concluding remarks.

Some key notations are listed in Table 2.

2. System Model

Consider a multiple-UAV-assisted MEC system that consists of N UAVs and M (

M > N

) UDs as shown in Figure 1. We denote the sets of UDs and UAVs by

M = {1, 2, \dots, M}

and

N = {1, 2, \dots, N}

, respectively. The UDs are randomly located and the UAVs fly along pre-determined trajectories. Each UAV is equipped with an MEC server and provides computing service for UDs. Each UD decides its own association and offloading strategy.

2.1. Association

At each timeslot, each UD associates with a suitable UAV. If multiple UDs associate with the same UAV, frequency division multiple access (FDMA) is adopted to avoid interference. The association between UDs and UAVs is indicated by

a_{m, n, t}

. If UD m associates with UAV n in timeslot t,

a_{m, n, t} = 1

; otherwise,

a_{m, n, t} = 0

. Each UD can only associate with one UAV simultaneously, meaning that we have

\sum_{n \in N} a_{m, n, t} = 1

.

2.2. Communication Model

The UAVs move continuously and the transmission links between UDs and UAVs can be blocked by buildings or trees. We assume that the state of each transmission link is determined by the environment and the angle of elevation between UDs and UAVs [50]. The transmission link between UD m and UAV n is line-of-sight (LOS) with probability

p_{m, n}^{L O S}

and non-line-of-sight (NLOS) with probability

p_{m, n}^{N L O S} (t) = 1 - p_{m, n}^{L O S}

.

The channel gain of transmission links is determined by large-scale fading and small-scale fading. The channel gain between UAV n and UD m is obtained as

g_{m, n} = \{\begin{matrix} {(\frac{4 π f_{c}}{c})}^{2} μ_{L} {d_{m, n}}^{- β_{L}} {|h_{m, n}^{R i c e}|}^{2}, L O S \\ {(\frac{4 π f_{c}}{c})}^{2} μ_{N} {d_{m, n}}^{- β_{N}} {|h_{m, n}^{R a y l e i g h}|}^{2}, N L O S \end{matrix}

(1)

where

f_{c}

is the carrier frequency; c is the speed of light;

β_{L}

(

β_{N}

) and

μ_{L}

(

μ_{N}

) are the path loss exponent and the attenuation factor for LOS (NLOS) links, respectively;

d_{m, n}

is the distance between UD m and UAV n; and

h_{i, j}^{R i c e}

and

h_{i, j}^{R a y l e i g h}

are the Rice and Rayleigh fading coefficients, respectively.

Assume that the channel gains are constant during the channel coherent time interval. The signal-to-noise ratio (SNR) between UD m and UAV-BS n in timeslot t is

γ_{m, n, t} = \frac{P g_{m, n, t}}{σ^{2}}

(2)

where P is the transmission power of the UD,

g_{i, j, t}

is the channel gain, and

σ^{2}

is the noise power.

M_{n, t}

denotes the number of UDs that access UAV n in timeslot t. Then, the maximum achievable transmission rate between UD m and UAV n in timeslot t is

r_{m, n, t} = \frac{B}{M_{n, t}} {log}_{2} (1 + γ_{m, n, t})

(3)

where B is the total bandwidth.

2.3. Computation Model

At each timeslot, each UD generates its computing tasks with a random size.

D_{m, t}

denotes the size of tasks generated by UD m at timeslot t. All UDs adopt the partial offloading policy and determine the offloading proportion. The proportion of tasks that are offloaded to the MEC server is denoted by

α_{m, t} \in [0, 1]

and

(1 - α_{m, t})

is the proportion of tasks that are locally executed.

2.3.1. Edge Computing

The UAV-assisted computing can be divided into three steps: task offloading, edge computing, and result downloading. The UDs first offload their tasks to the serving UAVs and the UAVs compute the received tasks and download the computation results to UDs. The computing results obtained by the MEC server are usually small. The result downloading time is short and can be ignored [17]. The delay in MEC consists of two parts: transmission delay and computation delay.

We neglect the result downloading delay, meaning that the transmission delay equals the task offloading time. The task offloading time from UD m to UAV n at timeslot t can be expressed as

τ_{m, n, t}^{trans} = \frac{α_{m, t} D_{m, t}}{r_{m, n, t}}

(4)

The energy consumption observed when UD m offloads its task to UAV n is given by

E_{m, n, t}^{trans} = P τ_{m, n, t}^{t r a n s} .

(5)

The computing time at the MEC server can be expressed as

τ_{m, n, t}^{comp} = \frac{α_{m, t} D_{m, t} s}{f_{m, n, t}}

(6)

where s is the number of CPU cycles required to process each unit byte and

f_{m, n, t}

denotes the computing capacity of the UAV n allocated to UD m. The energy consumption of the UAV n to compute the offloading task is given by

E_{m, n, t}^{comp} = κ_{1} f_{m, n, t}^{2} α_{m, t} D_{m, t} s

(7)

where

κ_{1}

is a constant that depends on the CPU of the UAV-assisted MEC server.

2.3.2. Local Computing

The local computation delay of UD m at timeslot t can be expressed as

τ_{m, t}^{local} = \frac{(1 - α_{m, t}) D_{m, t} s}{f_{u}}

(8)

where

f_{u}

denotes the computing capability of the UD. We assume that all UDs have a predetermined and fixed computing capability.

The energy consumption of the UD computing the local task can be obtained as

E_{m, t}^{local} = κ_{2} f_{u}^{2} (1 - α_{m, t}) D_{m, t} s

(9)

where

κ_{2}

is a constant which depends on the CPU of UD.

2.4. System Energy Consumption and Latency

The energy consumption to complete UD m’s task in timeslot t comes from the local computing and the edge computing, which can be obtained as

E_{m, t} = {E_{m, t}^{local}}_{︸}^{︷}_{Local computing}^{UD} + \underset{Edge computing}{\underset{︸}{\sum_{n \in N} a_{m, n, t} (\overset{UD}{\overset{︷}{E_{m, n, t}^{trans}}} + \overset{UAV}{\overset{︷}{E_{m, n, t}^{comp}}})}} .

(10)

The total energy consumption for tasks computing in timeslot t is

E_{t} = \sum_{m \in M} E_{m, t}

(11)

The latency to complete UD m’s task in timeslot t can be obtained as

τ_{m, t} = max {τ_{m, t}^{local}, \sum_{n \in N} a_{m, n, t} (τ_{m, n, t}^{trans} + τ_{m, n, t}^{comp})} .

(12)

The system latency is defined as the maximum UD’s latency, which can be obtained as

τ_{t} = max_{m \in M} {τ_{m, t}} .

(13)

2.5. Problem Formulation

This paper aims to minimize the energy consumption and the latency simultaneously. We define the system cost in timeslot t as the weighted sum of energy consumption and latency [23,24]

C_{t} = λ E_{t} + (1 - λ) τ_{t}

(14)

where

λ \in [0, 1]

adjusts the weights. The cost degenerates into energy consumption with

λ = 1

and latency with

λ = 0

.

Due to the time-varying channel state, UDs should select a suitable serving UAV, make a reasonable offloading decision, and request sufficient computing resources based on the current environment at each timeslot. We jointly optimize the association, offloading proportion, and computing resources allocation to minimize the long-term system cost. The joint optimization problem can be formulated as

\begin{matrix} P 1 : & min_{{a_{m, n, t}}, {α_{m, t}}, {f_{m, n, t}}} \sum_{t = 1}^{\infty} C_{t} \end{matrix}

(15a)

\begin{matrix} s . t . & a_{m, n, t} = {0, 1}, m \in M, n \in N, t \geq 1, \end{matrix}

(15b)

\begin{matrix} \sum_{n \in N} a_{m, n, t} = 1, m \in M, t \geq 1, \end{matrix}

(15c)

\begin{matrix} α_{m, t} \in [0, 1], m \in M, t \geq 1, \end{matrix}

(15d)

\begin{matrix} f_{m, n, t} \in [0, F_{max}], m \in M, t \geq 1 \end{matrix}

(15e)

\begin{matrix} τ_{t} \leq τ^{max}, t \geq 1 \end{matrix}

(15f)

where constraints (15b) and (15c) denote that each UD can only associate with one UAV in a timeslot; constraint (15d) is the range of the offloading proportion; constraint (15e) limits the maximum computing resources of the MEC; and constraint (15f) indicates that the latency requirement that cannot exceed the maximum value

τ^{max}

. The difficulty of problem

P 1

lies in the long-term objective and the coupling of association, offloading, and computing resource allocation.

3. Multi-Agent DRL-Based Association, Offloading, and Resource Allocation Schemes

Although the trajectories of UAVs can be predetermined, the variation in channel state and the variety in resources allocation make the system highly dynamic. It is challenging to tackle the long-term optimization problem in highly dynamic scenarios, especially for large networks. Fortunately, RL and DRL methods have been applied to solve these sequential decision problems successfully. The multi-agent scheme can further reduce the overheads of information collection and exchange in large networks. Therefore, we adopt the multi-agent RL framework and propose an RL- and a DRL-based scheme. We first introduce the multi-agent RL framework and formulate the long-term problem as a POMDP. Then, we elaborate the MAPPO-based and the CF-MAB-based schemes.

3.1. Multi-Agent RL Framework

In the multi-agent RL framework, all UDs are agents. Each agent maintains a stochastic policy to make decisions based on its local observations and then receives its corresponding reward independently. The long-term joint allocation problem can be treated as a POMDP. The policy of agent m is denoted by

π^{(m)} (a_{t}^{(m)} | o_{t}^{(m)})

, where

a_{t}^{(m)}

and

o_{t}^{(m)}

are the action and observation, respectively. We aim to find a joint policy for all UDs to minimize the long-term cost. The joint policy is the product of individual policies

π = \prod_{m = 1}^{M} π^{(m)}

.

The observation contains the channel gains and the volume of current tasks. The observation of UD m at timeslot t can be denoted by

o_{t}^{(m)} = {{g_{m, n, t}}_{n \in N}, D_{m, t}} .

(16)

We define the global state as the ensemble of observations of all UDs:

s_{t} = {{o_{t}^{(m)}}_{m \in M}} .

(17)

Each UD agent makes an action to decide its association, offloading proportion, and computing resource demand based on its policy. The action of UD m at timeslot t can be denoted by

a_{t}^{(m)} = {{a_{m, n, t}}_{n \in N}, α_{m, t}, f_{m, n_{0}, t | a_{m, n_{0}, t} = 1}} .

(18)

The joint action of all UDs is denoted by

a_{t} = {{a_{t}^{(m)}}_{m \in M}} .

(19)

After its action, each agent will receive an instantaneous reward from the environment. In our multi-agent framework, there is no global information exchange among UD agents, meaning that each UD can only obtain a direct reward based on its own local information, such as the transmission rate

r_{m, n, t}

, energy consumption

E_{m, t}

, and latency

τ_{m, t}

. The reward of UD m after timeslot t can be denoted by

r_{t}^{(m)} = {{r_{m, n, t}}_{n \in N}, E_{m, t}, τ_{m, t}} .

(20)

Each UD agent can only obtain its local information, meaning that it cannot obtain the system cost. We introduce the individual cost based on its own reward for each UD as

C_{m, t} = λ E_{m, t} + (1 - λ) τ_{m, t} .

(21)

Then, each UD agent updates its policy to minimize its individual cost instead.

There is a gap between minimizing the individual cost and minimizing the system cost. We use the substitution based on the fact that the system cost achieves the minimum when all UDs achieve their minimum individual cost synchronously.

All UD agents interact with the environment and other agents through the observation–action–reward process. A good policy is the key to achieving effective learning and efficient cooperation. We propose a MAPPO-based scheme and an MAB-based scheme to find the policies that can minimize the long-term cost.

3.2. MAPPO-Based Scheme for Long-Term Consumption Minimization

The MAPPO-based scheme is a multi-agent scheme that uses the AC architecture and the PPO. Specifically, each agent has an actor network and a critic network. The former makes actions and the later evaluates the actions. We first introduce the general MAPPO procedure. Then, we explain the reward functions in centralized training and decentralized execution, respectively. Finally, we summarize the MAPPO-based scheme.

3.2.1. Typical MAPPO Procedure

The actor network and the critic network are parameterized by

θ

and

ϕ

, respectively. The expected discounted accumulated reward can be denoted by

J (θ) = E_{a_{t}, s_{t}} [\sum_{t} γ^{t} R_{t}],

(22)

where

R_{t}

is the instantaneous reward in timeslot t and

γ

is the discount factor. Policy gradient methods maximize the expected reward by repeatedly estimating the gradient on

θ

. The most commonly used policy gradient has the form

\hat{g} = {\hat{E}}_{t} [\nabla_{θ} log π_{θ} (a_{t} | s_{t}) {\hat{A}}_{t}]

(23)

where

π_{θ} (a_{t} | s_{t})

is a stochastic policy, the expectation

\hat{E} [\cdot]

indicates the empirical average over a finite batch of samples, and

{\hat{A}}_{t}

is an estimator of the advantage function at timeslot t. We adopt the generalized advantage estimation (GAE) method and use the truncated estimator [51]

{\hat{A}}_{t} = δ_{t} + (γ λ_{0}) δ_{t + 1} + \dots + {(γ λ_{0})}^{T - t + 1} δ_{T - 1}

(24)

where T is the length of trajectory segment, parameter

λ_{0}

balances the bias and variance,

δ_{t} = R_{t} + γ V (s_{t + 1}) - V (s_{t})

, and

V (s_{t})

is the state-value function.

The actor is trained to maximize the objective function

F (θ)

as [47]

F (θ) = \frac{1}{N_{b} M} \sum_{i = 1}^{N_{b}} \sum_{m = 1}^{M} (min {κ_{θ, i}^{(m)} {\hat{A}}_{i}^{(m)}, clip (κ_{θ, i}^{(m)}, 1 - ϵ, 1 + ϵ) {\hat{A}}_{i}^{(m)}} + σ ν [π_{θ} (s_{i}^{(n)})]),

(25)

where

N_{b}

is the size of the mini-batch; M is the number of agents;

κ_{θ, i}^{(m)} = \frac{π_{θ, i}^{(m)} (a_{i} | s_{i})}{π_{θ_{o l d}, i}^{(m)} (a_{i} | s_{i})}

is the probability ratio between the current policy and the updated policy of agent m;

{\hat{A}}_{i}^{(n)}

is the advantage that can be estimated with (24);

clip (κ_{θ, i}^{(n)}, 1 - ϵ, 1 + ϵ)

is a clip function that restricts

κ_{θ, i}^{(m)}

into the interval

[1 - ϵ, 1 + ϵ]

;

σ

is the entropy coefficient hyper parameter; and

ν

is the policy entropy to increase the exploration rate.

The critic network is trained to minimize the objective function [47]

L (ϕ) = \frac{1}{N_{b} M} \sum_{i = 1}^{N_{b}} \sum_{m = 1}^{M} (max {{(V_{ϕ} (s_{i}^{(n)}) - R_{i})}^{2}, {(clip (V_{ϕ} (s_{i}^{(n)}), V_{ϕ_{o l d}} (o_{i}^{(n)}) - ϵ, V_{ϕ_{o l d}} (o_{i}^{(n)}) + ϵ) - R_{i})}^{2}}),

(26)

where

R_{i}

is the cumulative discounted reward.

The objective functions in (25) and (26) can be optimized through the gradient ascent method. The objective functions are also propagated back to update parameters of the actor and critic by means of the Adam method.

3.2.2. Centralized Training and Decentralized Execution

The MAPPO-based scheme adopts centralized training and decentralized execution. In the training stage, all agents share a centralized critic network that evaluates the actions of all agents. This critic network should know all the global states and all agents’ observations and actions. Therefore, all agents share their local observations and receive the same critic network feedback. However, in the execution stage, each agent only has local observations and actions without coordination. To solve this problem, we use different rewards and gradients for the training and execution stages as follows.

To bridge the system cost and the individual cost, we set two kinds of reward functions. Each agent generates its individual reward function based its own cost as

R_{m, t} = \frac{1}{C_{m, t}} .

(27)

Each UD should complete its task in the current timeslot, otherwise it will fail to meet the latency demand. Therefore, the reward function of the whole system should consider the cost of all UDs and the number of UDs that fail. We define the system reward function as

R_{M, t} = (1 - \frac{K_{t}}{M}) \sum_{m \in M} R_{m, t}

(28)

where

K_{t}

is the number of UDs that fail to complete tasks in timeslot t and the term

(1 - \frac{K_{t}}{M})

can be a penalty. The overall reward function

R_{M, t}

is used in training for all agents and the individual reward function

R_{m, t}

is used in execution for network updates.

3.2.3. MAPPO-Based Algorithm

The MAPPO-based algorithm for UAV-assisted MEC system is summarized in Algorithm 1.

θ = {θ^{(m)}}_{m \in M}

and

ϕ = {ϕ^{(m)}}_{m \in M}

are the combinations of parameters of all UD agents, where

θ^{(m)}

and

ϕ^{(m)}

are the parameters of the actor and the critic of UD agent m, respectively. There are some differences in the training stage and the execution stage. In the centralized training stage, the computations in line 11 use the overall reward function

R_{M, t}

, while those in line 15 compute the gradients of objective functions in (25) and (26). In the decentralized execution stage, the computations in line 11 use the individual reward function

R_{M, t}

, while those in line 15 compute the gradients of individual objective functions as

F^{(m)} (θ) = \frac{1}{N_{b}} \sum_{i = 1}^{N_{b}} (min {κ_{θ, i}^{(m)} {\hat{A}}_{i}^{(m)}, clip (κ_{θ, i}^{(m)}, 1 - ϵ, 1 + ϵ) {\hat{A}}_{i}^{(m)}} + σ ν [π_{θ} (s_{i}^{(n)})])

(29)

and

L^{(m)} (ϕ) = \frac{1}{N_{b}} \sum_{i = 1}^{N_{b}} (max {{(V_{ϕ} (s_{i}^{(n)}) - R_{i})}^{2}, {(clip (V_{ϕ} (s_{i}^{(n)}), V_{ϕ_{o l d}} (o_{i}^{(n)}) - ϵ, V_{ϕ_{o l d}} (o_{i}^{(n)}) + ϵ) - R_{i})}^{2}}) .

(30)

Algorithm 1 MAPPO-Based algorithm
1:	Initialize parameter $θ$ of the actor network and parameter $ϕ$ of the critic network.
2:	for each episode $i = 1, \dots, E p$ do
3:	Initialize experience reply buffer U;
4:	for each timeslot $t = 1, \dots, T$ do
5:	for each UD $m \in M$ do
6:	Execute action according to $π_{θ_{o l d}} (a_{t}^{(m)} \| o_{t}^{(m)})$ ;
7:	Get the reward $r_{t}^{(m)}$ and the next observation $o_{t + 1}^{(m)}$ ;
8:	end for
9:	end for
10:	Get trajectory $τ = {o_{t}, a_{t}, r_{t}}_{t = 1}^{T}$ ;
11:	Compute the cumulative discounted reward $R_{t}$ , the state-value function $V_{ϕ_{o l d}}$ , and the advantage ${A_{t}^{(m)}}_{t = 1}^{T}$ (According to (24)).
12:	Split $τ$ , $R_{t}$ , and $A_{t}$ into chunks of length $N_{b}$ , and store them in U.
13:	for mini-batch $j = 1, . . ., B c h$ do
14:	Randomly select a chunk from U as mini-batch b;
15:	Compute gradients of (25) and (26) on $θ$ and $ϕ$ , respectively, using mini-batch b;
16:	Update $θ$ and $ϕ$ using by Adam.
17:	end for
18:	end for

3.3. CF-MAB-Based Scheme for Long-Term Consumption Minimization

The deployment and training of neural networks as agents can introduce high requirements for hardware and software in UDs. To reduce the overheads and complexity further, we propose a lightweight scheme that separates the association from the offloading decision and the computing resource allocation. The association directly determines the achievable rates between UDs and UAVs. Then, we find the closed-form offloading proportion and computing resource allocation for UDs given achievable rates. We first explain how to obtain the closed-form solution for the given association. Then, we explain the details of the MAB-based association. Finally, we summarize the CF-MAB-based scheme.

3.3.1. Closed-Form Offloading and Computing Resource Allocation for a Given Association

For a given association, each UD decides its offloading ratio and computation resources to minimize its individual cost. Assuming that UD m is associated with UAV n in timeslot t, the problem at UD m can be formulated into

\begin{matrix} P 2 : & min_{α_{m, t}, f_{m, n, t}} C_{m, t} \end{matrix}

(31a)

\begin{matrix} s . t . & α_{m, t} \in [0, 1], \end{matrix}

(31b)

\begin{matrix} f_{m, n, t} \in [0, F_{max}], \end{matrix}

(31c)

\begin{matrix} τ_{m, t} \leq τ^{max} . \end{matrix}

(31d)

Problem

P 2

is a short-term decision problem and can be solved at each UD independently.

From (31d), we can obtain

τ_{m, n, t}^{trans} + τ_{m, n, t}^{comp} \leq τ_{max}

(32)

and

τ_{m, t}^{local} \leq τ_{max} .

(33)

Substituting (4), (6), and (8) into (32) and (33), respectively, and combining (31c), we can obtain

1 - \frac{f_{u} τ_{max}}{D_{m, t} s} \leq α_{m, t} \leq \frac{τ_{max}}{\frac{D_{m, t}}{r_{m, n, t}} + \frac{D_{m, t} s}{F_{max}}} .

(34)

Usually, the system should satisfy

\frac{τ_{max}}{\frac{D_{m, t}}{r_{m, n, t}} + \frac{D_{m, t} s}{F_{max}}} \leq 1

so that all tasks can be executed at the MEC server. We can also obtain

\frac{α_{m, t} D_{m, t} s}{τ_{max} - \frac{α_{m, t} D_{m, t}}{r_{m, n, t}}} \leq f_{m, n, t} \leq F_{max} .

(35)

We further analyze the individual cost function and obtain

C_{m, t} = \{\begin{matrix} λ (E_{m, t}^{local} + E_{m, n, t}^{trans} + E_{m, n, t}^{comp}) + (1 - λ) (τ_{m, n, t}^{trans} + τ_{m, n, t}^{comp}), α_{m, t} \geq α_{m, t}^{★} \\ λ (E_{m, t}^{local} + E_{m, n, t}^{trans} + E_{m, n, t}^{comp}) + (1 - λ) τ_{m, t}^{local}, α_{m, t} < α_{m, t}^{★} \end{matrix}

(36)

where

α_{m, t}^{★} = \frac{\frac{s}{f_{u}}}{\frac{1 - α_{m, t}}{f_{u} s} - \frac{α_{m, t}}{r_{m, n, t}}}

is obtained by setting

τ_{m, n, t}^{trans} + τ_{m, n, t}^{comp} = τ_{m, t}^{local}

. We check the gradient of

C_{m, t}

on

{α_{m, t}, f_{m, n, t}}

, i.e.,

\nabla_{α_{m, t}, f_{m, n, t}} C_{m, t}

, and find no stationary points in the feasible region. Consequently, the minimum value of

C_{m, t}

is located at the boundaries with (1)

α_{m, t} = 1

, (2)

f_{m, n, t} = F_{m a x}

, and (3)

α_{m, t} = α_{m, t}^{★}

. A typical case of

C_{m, t}

is shown in Figure 2.

When

α_{m, t} = 1

, UD m offloads all its tasks to the MEC server. The cost function becomes

C_{m, t} = λ κ_{1} D_{m, t} s f_{m, n, t}^{2} + \frac{(1 - λ) D_{m, t} s}{f_{m, n, t}} + \frac{λ P D_{m, t} + (1 - λ) D_{m, t}}{r_{m, n, t}} .

(37)

To achieve the minimum cost, we set

\frac{d C_{m, t}}{d f_{m, n, t}} = 0

and obtain the optimal solution

f_{m, n, t}^{*} = \sqrt[3]{\frac{1 - λ}{2 λ κ_{1}}}

. We also get the lower bound

f_{m, n, t}^{lb} = \frac{D_{m, t} s}{τ_{max} - \frac{D_{m, t}}{r_{m, n, t}}}

from (35). Then, we obtain a optimal solution candidate

(1, min {f_{m, n, t}^{*}, f_{m, n, t}^{l b}})

.

When

f_{m, n, t} = F_{max}

,

C_{m, t}

is a piecewise linear function with respect to

α_{m, t}

. It is monotonically increasing with

α_{m, t} \geq α_{m, t}^{★}

. Moreover, if

λ < \frac{1}{κ_{1} f_{u} F_{max}^{2} - κ_{2} f_{u}^{3} + 1}

and

r_{m, n, t} \geq \frac{λ P f_{u}}{10 λ f_{u}^{3} + (1 - λ) - λ F_{max}^{2} f_{u}}

,

C_{m, t}

is monotonically decreasing with

α_{m, t} \leq α_{m, t}^{★}

. Then, the optimal candidate is

(α_{m, t}^{★}, F_{max})

. Otherwise,

C_{m, t}

is monotonically increasing with

α_{i} \leq α_{m, t}^{★}

. Then, we get the lower bound

α_{m, t}^{l b} = 1 - \frac{f_{u} τ_{max}}{D_{m, t} s}

from (34) and the optimal candidate is

(α_{m, t}^{l b}, F_{max})

.

When

α m, t = α_{m, t}^{★}

, the cost function

C_{m, t}

becomes

C_{m, t} = \frac{f_{m, n, t}}{(\frac{f_{u}}{r_{m, n, t} s} + 1) f_{m, n, t} + f_{u}} (λ κ_{1} f_{m, n, t}^{2} + \frac{λ P}{r_{m, n, t} s} - λ κ_{2} f_{u}^{2} - \frac{1 - λ}{f_{u}}) D_{m, t} s + (λ κ_{2} f_{u}^{2} + \frac{1 - λ}{f_{u}}) D_{m, t} s .

(38)

We set

\frac{d C_{m, t}}{d f_{m, n, t}} = 0

and obtain

f_{m, n, t}^{★} = \frac{X - 3 \sqrt[3]{2} κ_{1} f_{u}}{6 \sqrt[3]{2} κ_{1} (\frac{f_{u}}{r_{m, n, t}} s + 1)} + \frac{3 f_{u}^{2} κ_{1}^{2}}{2^{\frac{2}{3}} κ_{1} (\frac{f_{u}}{r_{m, n, t} s} + 1) X}

(39)

where

X = {(Y + \sqrt{Y^{2} - 2916 f_{u}^{6} κ_{1}^{6}})}^{\frac{1}{3}}

,

Y = f_{u}^{3} κ_{2} Z - \frac{f_{u} P Z}{r_{m, n, t} s} + Z - 54 f_{u}^{3} κ_{1}^{3}

, and

Z = 108 κ_{1} {(\frac{f_{u}}{r_{m, n, t} s} + 1)}^{2}

. Then, we get a optimal candidate

(α_{m, t}^{★}, f_{m, n, t}^{★})

.

Finally, we can obtain the optimal offloading proportion

α_{m, t}^{opt}

and computing resource allocation

f_{m, n, t}^{opt}

for each UD from the three candidates.

3.3.2. MAB-Based Long-Term Association

We tackle the long-term association problem using a distributed MAB scheme. Each UD is an agent that selects its association to maximize its long-term achievable rate. The achievable rate of each UD depends on its channel gain and association with other UDs. Then, it varies over time due to the UAVs moving and associations changing. On the other hand, a slight variety will incur frequent handover, especially when a UD can associate with different UAVs to achieve similar rates. Therefore, the MAB-based scheme should obtain the channel conditions in a timely manner and avoid frequent handover.

Each UD agent decides its association and receives a simultaneous reward at each timeslot. Then, it updates its exploration and exploitation strategy using the cumulative rewards. The weight, the association probability, and the reward function of UAV n for UD m in timeslot t are denoted by

w_{m, n, t}

,

p_{m, n, t}

, and

R_{m, n, t}

, respectively.

Each UD selects its association according to the probability

p_{m, n, t} = (1 - γ_{0}) \frac{w_{m, n, t}}{\sum_{n^{'} \in N} w_{i, n^{'}, t}} + \frac{γ_{0}}{N}

(40)

where

γ_{0} \in (0, 1)

adjusts the proportion of exploitation and exploration.

After its action, each UD receives its reward and the reward function is defined as

R_{m, n, t} = \{\begin{matrix} ψ_{m, n, t} tanh (r_{m, n, t} - {\bar{r}}_{m, n, t} - {\tilde{r}}_{m, n, t}), r_{i, j, t} \geq {\bar{r}}_{m, n, t} \\ 0, otherwise \end{matrix}

(41)

where tanh

(\cdot)

is the hyperbolic tangent activation function;

ψ_{m, n, t}

is a reward discount rate to reduce frequent handover expressed as

ψ_{m, n, t} = \{\begin{matrix} 1, a_{m, n, t} = a_{m, n, t - 1} \\ ψ, a_{m, n, t} \neq a_{m, n, t - 1} \end{matrix}

(42)

with the discount factor

ψ \in [0, 1]

;

{\bar{r}}_{m, n, t}

is the average achieved rate in the past L timeslots, expressed as

{\bar{r}}_{m, n, t} = \frac{1}{L} \sum_{τ = t - L}^{τ = t - 1} r_{m, n, τ};

(43)

and

{\tilde{r}}_{m, n, t}

is the bias to balance the load of UAVs, expressed as

\begin{matrix} {\tilde{r}}_{m, n, t} & = & \frac{1}{|M_{n, t}|} \sum_{k \in M_{n, t}} r_{k, n, t} - \frac{1}{|M_{n, t}| - 1} \sum_{k^{'} \in M_{n, t}, k^{'} \neq m} r_{k^{'}, n, t} \end{matrix}

(44)

where

M_{n, t}

is the set of UDs that associate with UAV n at timeslot t.

Then, each UD can obtain its cumulative reward and update the weight as

G_{m, n, t} = \sum_{τ = 1}^{τ = t} \frac{R_{m, n, τ}}{p_{m, n, τ}},

(45)

and

w_{m, n, t + 1} = exp (η G_{m, n, t}) = w_{m, n, t} exp (η \frac{R_{m, n, t}}{p_{m, n, t}}),

(46)

respectively, where

η > 0

determines how aggressively the algorithm to learn and update the distribution.

It should be noted that the historical information becomes outdated and harmful for the current decision when the UAVs pass over a long distance. To improve the learning performance, we set a slide window with length D as shown in Figure 3, and the accumulative reward becomes the accumulation from the latest D timeslots as

G_{m, n, t} = \sum_{τ = t - D + 1}^{τ = t} \frac{R_{m, n, τ}}{p_{m, n, τ}} .

(47)

3.3.3. CF-MAB-Based Algorithm

Finally, the CF-MAB-based algorithm at each UD is summarized in Algorithm 2. We focus on one UD and can find that each UD’s behavior is similar to the EXP3 algorithm [52]. The convergence proof of the algorithm is given in our previous work [53].

Algorithm 2 CF-MAB-based algorithm
Require: $η$ , $γ_{0}$
1:	Initialize $G_{m, n, t}^{0} = 0$ , $w_{m, n, t}^{0} = 1$ for $m \in M$ and $n \in N$ ;
2:	for each timeslot $t = 1, \dots, T$ do
3:	for each UD $m \in M$ do
4:	Calculate probability distribution $p_{m, n, t}$ with (40);
5:	Make action $a_{m, n, t}$ randomly according to the probabilities ${p_{m, 1, t}, \dots, p_{i, N, t}}$ ;
6:	Calculate the achieved rate $r_{m, n, t}$ ;
7:	Obtain the optimal offloading proportion $α_{m, t}^{opt}$ , computing resource allocation and $f_{m, n, t}^{opt}$ , and the minimum cost $C_{m, t} (α_{m, t}^{opt}, f_{m, n, t}^{opt})$ ;
8:	Calculate rewards $R_{m, n, t}$ with (41) and cumulative rewards $G_{m, n, t}$ with (45) or (47);
9:	Update the weights $w_{m, n, t + 1}$ ;
10:	end for
11:	end for

4. Performance Evaluation

This section evaluates the proposed algorithms. As shown in Figure 4, we assume that

M = 10

UDs are randomly distributed in a

5 \times 5

km² square area with fixed locations and

N = 4

UAV-BSs are deployed above the square area with a fixed height of 300 m. The square area is divided into

N = 4

parts and each part shares its center with a UAV’s circular trajectory. The UAVs fly with a constant speed of

12.69

m/s and the trajectory has a radius of

0.75

km.

Some key parameters are given in the following. The channel coherence time is 5 ms and the channel state is constant during coherence time. The duration of each timeslot is 1 ms. The maximum latency is set to be the timeslot duration, i.e.,

τ_{max} = 1

ms. The bandwidth is

B = 10

MHz, the noise power density is

- 174

dBm/Hz, and the transmit power of UD is

P = 0.35

W. The number of CPU cycles required to compute each bit is

s = 10^{3}

cycles/bit. The computing capability of each UD is

f_{u} = 0.6

GHz. The parameters that adjust energy consumption at the UAV server and at the UD are set to be

κ_{1} = 10^{- 27}

and

κ_{2} = 10^{- 26}

, respectively. The parameter that balances latency and energy consumption is set to be

λ = 0.5

. Unless otherwise stated, the total observed duration spans

5 \times 10^{4}

timeslots, the maximum computing resource is

F_{max} = 1.5

GHz, and the volume of tasks that should be computed in one timeslot is

D_{t} = \sum_{m = 1}^{M} D_{m, t} = 10

kbits.

To compare the performance, we employ MAPPO, received signal strength (RSS)-based, and random association strategies, and local-only, remote-only, closed-form, and random offloading strategies as benchmarks. Specifically,

CF-MAPPO scheme: The UDs’ association is determined via a simplified MAPPO-based scheme to maximize the total achievable rate. Then, the offloading and computing resource allocation are obtained through a closed-form solution;
RSS + Remote-only scheme: Each UD associates to the UAV with maximum RSS and all its tasks are executed at the UAV server;
Local-only scheme: Each UD computes its tasks locally;
Random + Random means: Each UD associates with a random UAV and a random part of its tasks are executed at the UAV server.

Figure 5 compares the convergence performance in the training stage of the proposed MAPPO-based and CF-MAB-based schemes with benchmarks. It is obvious that the RSS + Remote-only scheme, the local-only scheme, and the Random + Random scheme are too simple to achieve satisfying performance. It can be found that the MAPPO-based scheme reduces the system cost significantly and converges to the lowest cost. The reason for this is that the MAPPO-based scheme can learn knowledge from the environment and the historic strategies effectively and efficiently. Then, all UDs can make satisfactory decisions to obtain the best performance. It can also be found that the CF-MAB-based scheme converges quickly and achieves sufficiently low system cost. The reason for this is that the CF-MAB-based scheme reduces the dimensions of the action space significantly by separating the association from the offloading proportion and computing resource allocation. The MAPPO-based scheme adopts centralized training using global information and can learn knowledge more sufficiently and effectively than CF-MAB-based scheme in which each UD only learns from individual historic information. The CF-MAPPO-based scheme achieves a slightly weaker performance than the CF-MAB-based scheme since they both decouple the joint optimization and use the closed-form resource allocation. Moreover, the CF-MAPPO-based scheme uses more effective centralized learning than the CF-MAB-based scheme, meaning that it converges faster.

Figure 6 shows the cumulative system cost with different task computing demands. The tasks may not be completed under the maximum constraints, in which case a failure happens. To show the performance of all schemes completely, we still count the system cost in failures. Both the latency and the energy consumption increase with the task volume, meaning that the cumulative system cost increases in all schemes. The MAPPO-based scheme outperforms the others under all task demands. The CF-MAB-based scheme and the CF-MAPPO-based scheme achieve a similar performance, while the CF-MAB-based scheme demonstrates marginally better performance. It can also be noted that the RSS + Remote-only scheme obtains lower cumulative system cost than the CF-MAB- and CF-MAPPO-based schemes with tasks (

D_{t} < 8

kbits). This is because the CF-MAB- and CF-MAPPO-based schemes take the latency constraints into consideration, meaning that they consume more energy during computing.

Figure 7 shows the cumulative system cost under different computing resource constraints. The volume of tasks in each timeslot is

D_{t} = 5

kbits. All schemes exhibit a monotonically decreasing cumulative system cost with respect to the computing resources below saturation thresholds (about 1 GHz). By optimally leveraging local and remote computation, these schemes achieve energy–latency trade-off optimization, resulting in a low cumulative system cost. The CF-MAB- and CF-MAPPO-based schemes decouple the transmission and the computation so that they cannot further reduce the cost when the computing resources are sufficiently large. The MAPPO-based scheme can use computation resources to compensate for communication in the joint optimization, meaning that it can further lower the cost.

The computation is completed successfully if the tasks are executed in the required time. Figure 8 shows the task completion rates of different schemes. The MAPPO-based scheme, the CF-MAB-based scheme, and the CF-MAPPO-based scheme outperform the other benchmarks significantly. Since the closed-form-based method takes the latency constraint into consideration, the CF-MAB and CF-MAPPO-based scheme can avoid failures and achieve slightly higher completion rates. The MAPPO-based scheme considers the latency constraint in the reward function and may select actions that result in exceeding latency, meaning that this scheme has a slightly lower completion rate. We further check Figure 6 and Figure 8 and can find that the RSS + Remote-only scheme has a significantly lower completion rate, though it achieves a slightly lower system cost.

5. Conclusions

This paper investigated the joint association, offloading, and computing resource allocation in multiple-UAV-assisted MEC networks. We introduce the cost function as a metric to balance the latency and the energy consumption. The joint latency and energy consumption minimization is transferred into a long-term cost minimization problem. Then, we adopt a multi-agent DRL framework and propose a MAPPO-based scheme and a CF-MAB-based scheme to solve this problem. The MAPPO-based scheme uses global observation for centralized learning and partial observation for decentralized execution. The CF-MAB-based scheme obtains associations to maximize the long-term transmission rates and achieve closed-form offloading and computing resource allocation. The numerical results validate the proposed schemes and show their superiority. Moreover, the MAPPO-based scheme can provide collaborative decisions for agents to obtain high performance in complex and dynamic environments. The CF-MAB-based scheme provides independent decisions and is suitable for scenarios needing rapid responses.

Author Contributions

Conceptualization, M.C., S.H., Y.P., M.L. and W.-P.Z.; methodology, M.C., S.H. and Y.P.; software, M.C. and S.H.; validation, M.C. and S.H.; writing-original draft preparation, M.C. and S.H.; writing-review and editing, M.C., M.L. and W.-P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62301282.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Nguyen, D.C.; Ding, M.; Pathirana, P.N.; Seneviratne, A.; Li, J.; Niyato, D.; Dobre, O.; Poor, H.V. 6G Internet of Things: A Comprehensive Survey. IEEE Internet Things J. 2022, 9, 359–383. [Google Scholar] [CrossRef]
Aouedi, O.; Vu, T.H.; Sacco, A.; Nguyen, D.C.; Piamrat, K.; Marchetto, G.; Pham, Q.V. A Survey on Intelligent Internet of Things: Applications, Security, Privacy, and Future Directions. IEEE Commun. Surv. Tutor. 2025, 27, 1238–1292. [Google Scholar] [CrossRef]
McEnroe, P.; Wang, S.; Liyanage, M. A Survey on the Convergence of Edge Computing and AI for UAVs: Opportunities and Challenges. IEEE Internet Things J. 2022, 9, 15435–15459. [Google Scholar] [CrossRef]
Spinelli, F.; Mancuso, V. Toward Enabled Industrial Verticals in 5G: A Survey on MEC-Based Approaches to Provisioning and Flexibility. IEEE Commun. Surv. Tutor. 2021, 23, 596–630. [Google Scholar] [CrossRef]
Azari, M.M.; Solanki, S.; Chatzinotas, S.; Kodheli, O.; Sallouha, H.; Colpaert, A.; Mendoza Montoya, J.F.; Pollin, S.; Haqiqatnejad, A.; Mostaani, A.; et al. Evolution of Non-Terrestrial Networks From 5G to 6G: A Survey. IEEE Commun. Surv. Tutor. 2022, 24, 2633–2672. [Google Scholar] [CrossRef]
Mahboob, S.; Liu, L. Revolutionizing Future Connectivity: A Contemporary Survey on AI-Empowered Satellite-Based Non-Terrestrial Networks in 6G. IEEE Commun. Surv. Tutor. 2024, 26, 1279–1321. [Google Scholar] [CrossRef]
Panahi, F.H.; Panahi, F.H. Reliable and Energy-Efficient UAV Communications: A Cost-Aware Perspective. IEEE Trans. Mobile Comput. 2024, 23, 4038–4049. [Google Scholar] [CrossRef]
Kirubakaran, B.; Vikhrova, O.; Andreev, S.; Hosek, J. UAV-BS Integration with Urban Infrastructure: An Energy Efficiency Perspective. IEEE Commun. Mag. 2025, 63, 100–106. [Google Scholar] [CrossRef]
Peng, H.; Shen, X. Multi-Agent Reinforcement Learning Based Resource Management in MEC- and UAV-Assisted Vehicular Networks. IEEE J. Sel. Areas Commun. 2021, 39, 131–141. [Google Scholar] [CrossRef]
Ullah, S.A.; Hassan, S.A.; Abou-Zeid, H.; Qureshi, H.K.; Jung, H.; Mahmood, A.; Gidlund, M.; Imran, M.A.; Hossain, E. Convergence of MEC and DRL in Non-Terrestrial Wireless Networks: Key Innovations, Challenges, and Future Pathways. IEEE Commun. Surv. Tutor. 2025, 1–39, early access. [Google Scholar]
Tun, Y.K.; Park, Y.M.; Tran, N.H.; Saad, W.; Pandey, S.R.; Hong, C.S. Energy-Efficient Resource Management in UAV-Assisted Mobile Edge Computing. IEEE Commun. Lett. 2021, 25, 249–253. [Google Scholar] [CrossRef]
Liu, B.; Wan, Y.; Zhou, F.; Wu, Q.; Hu, R.Q. Resource Allocation and Trajectory Design for MISO UAV-Assisted MEC Networks. IEEE Trans. Veh. Technol. 2022, 71, 4933–4948. [Google Scholar] [CrossRef]
Xu, B.; Kuang, Z.; Gao, J.; Zhao, L.; Wu, C. Joint Offloading Decision and Trajectory Design for UAV-Enabled Edge Computing With Task Dependency. IEEE Trans. Wireless Commun. 2023, 22, 5043–5055. [Google Scholar] [CrossRef]
Li, Y.; Gao, X.; Shi, M.; Kang, J.; Niyato, D.; Yang, K. Dynamic Weighted Energy Minimization for Aerial Edge Computing Networks. IEEE Internet Things J. 2025, 12, 683–697. [Google Scholar] [CrossRef]
Wang, C.; Zhai, D.; Zhang, R.; Cai, L.; Liu, L.; Dong, M. Joint Association, Trajectory, Offloading, and Resource Optimization in Air and Ground Cooperative MEC Systems. IEEE Trans. Veh. Technol. 2024, 73, 13076–13089. [Google Scholar] [CrossRef]
Li, C.; Gan, Y.; Zhang, Y.; Luo, Y. A Cooperative Computation Offloading Strategy With On-Demand Deployment of Multi-UAVs in UAV-Aided Mobile Edge Computing. IEEE Trans. Network Serv. Manag. 2024, 21, 2095–2110. [Google Scholar] [CrossRef]
Hu, Q.; Cai, Y.; Yu, G.; Qin, Z.; Zhao, M.; Li, G.Y. Joint Offloading and Trajectory Design for UAV-Enabled Mobile Edge Computing Systems. IEEE Internet Things J. 2019, 6, 1879–1892. [Google Scholar] [CrossRef]
Zhang, L.; Ansari, N. Latency-Aware IoT Service Provisioning in UAV-Aided Mobile-Edge Computing Networks. IEEE Internet Things J. 2020, 7, 10573–10580. [Google Scholar] [CrossRef]
Nasir, A.A. Latency Optimization of UAV-Enabled MEC System for Virtual Reality Applications Under Rician Fading Channels. IEEE Wireless Commun. Lett. 2021, 10, 1633–1637. [Google Scholar] [CrossRef]
Sabuj, S.R.; Asiedu, D.K.P.; Lee, K.-J.; Jo, H.-S. Delay Optimization in Mobile Edge Computing: Cognitive UAV-Assisted eMBB and mMTC Services. IEEE Trans. Cogn. Commun. Netw. 2022, 8, 1019–1033. [Google Scholar] [CrossRef]
Han, Z.; Zhou, T.; Xu, T.; Hu, H. Joint User Association and Deployment Optimization for Delay-Minimized UAV-Aided MEC Networks. IEEE Wirel. Commun. Lett. 2023, 12, 1791–1795. [Google Scholar] [CrossRef]
Zhang, J.; Luo, H.; Chen, X.; Shen, H.; Guo, L. Minimizing Response Delay in UAV-Assisted Mobile Edge Computing by Joint UAV Deployment and Computation Offloading. IEEE Trans. Cloud Comput. 2024, 12, 1372–1386. [Google Scholar] [CrossRef]
Yu, Z.; Gong, Y.; Gong, S.; Guo, Y. Joint Task Offloading and Resource Allocation in UAV-Enabled Mobile Edge Computing. IEEE Internet Things J. 2020, 7, 3147–3159. [Google Scholar] [CrossRef]
Zhao, L.; Yang, K.; Tan, Z.; Li, X.; Sharma, S.; Liu, Z. A Novel Cost Optimization Strategy for SDN-Enabled UAV-Assisted Vehicular Computation Offloading. IEEE Trans. Intell. Transp. Syst. 2021, 22, 3664–3674. [Google Scholar] [CrossRef]
Pervez, F.; Sultana, A.; Yang, C.; Zhao, L. Energy and Latency Efficient Joint Communication and Computation Optimization in a Multi-UAV-Assisted MEC Network. IEEE Trans. Wirel. Commun. 2024, 23, 1728–1741. [Google Scholar] [CrossRef]
Kuang, Z.; Pan, Y.; Yang, F.; Zhang, Y. Joint Task Offloading Scheduling and Resource Allocation in Air–Ground Cooperation UAV-Enabled Mobile Edge Computing. IEEE Trans. Veh. Technol. 2024, 73, 5796–5807. [Google Scholar] [CrossRef]
Yuan, H.; Wang, M.; Bi, J.; Shi, S.; Yang, J.; Zhang, J.; Zhou, M.; Buyya, R. Cost-Efficient Task Offloading in Mobile Edge Computing With Layered Unmanned Aerial Vehicles. IEEE Internet Things J. 2024, 11, 30496–30509. [Google Scholar] [CrossRef]
Sun, G.; Wang, Y.; Sun, Z.; Wu, Q.; Kang, J.; Niyato, D.; Leung, V.C.M. Multi-Objective Optimization for Multi-UAV-Assisted Mobile Edge Computing. IEEE Trans. Mobile Comput. 2024, 23, 14803–14820. [Google Scholar] [CrossRef]
Zhang, L.; Wen, F.; Zhang, Q.; Gui, G.; Sari, H.; Adachi, F. Constrained Multiobjective Decomposition Evolutionary Algorithm for UAV-Assisted Mobile Edge Computing Networks. IEEE Internet Things J. 2024, 11, 36673–36687. [Google Scholar] [CrossRef]
Zhang, J.; Zhou, L.; Tang, Q.; Ngai, E.C.H.; Hu, X.; Zhao, H.; Wei, J. Stochastic Computation Offloading and Trajectory Scheduling for UAV-Assisted Mobile Edge Computing. IEEE Internet Things J. 2019, 6, 3688–3699. [Google Scholar] [CrossRef]
Zeng, Y.; Chen, S.; Li, J.; Cui, Y.; Du, J. Online Optimization in UAV-Enabled MEC System: Minimizing Long-Term Energy Consumption Under Adapting to Heterogeneous Demands. IEEE Internet Things J. 2024, 11, 32143–32159. [Google Scholar] [CrossRef]
Liu, B.; Peng, M. Online Offloading for Energy-Efficient and Delay-Aware MEC Systems with Cellular-Connected UAVs. IEEE Internet Things J. 2024, 11, 22321–22336. [Google Scholar] [CrossRef]
Wang, J.; Wang, L.; Zhu, K.; Dai, P. Lyapunov-Based Joint Flight Trajectory and Computation Offloading Optimization for UAV-Assisted Vehicular Networks. IEEE Internet Things J. 2024, 11, 22243–22256. [Google Scholar] [CrossRef]
Zhao, M.; Zhang, R.; He, Z.; Li, K. Joint Optimization of Trajectory, Offloading, Caching, and Migration for UAV-Assisted MEC. IEEE Trans. Mob. Comput. 2025, 24, 1981–1998. [Google Scholar] [CrossRef]
Zhang, H.; Sun, Z.; Yang, C.; Cao, X. Latency Optimization in UAV-Assisted Mobile Edge Computing Empowered by Caching Mechanisms. IEEE J. Miniat. Air Space Syst. 2024, 5, 228–236. [Google Scholar] [CrossRef]
Li, C.; Wu, J.; Zhang, Y.; Wan, S. Energy-Latency Tradeoff for Joint Optimization of Vehicle Selection and Resource Allocation in UAV-Assisted Vehicular Edge Computing. IEEE Trans. Green Commun. Netw. 2025, 9, 445–458. [Google Scholar] [CrossRef]
Seid, A.M.; Boateng, G.O.; Anokye, S.; Kwantwi, T.; Sun, G.; Liu, G. Collaborative Computation Offloading and Resource Allocation in Multi-UAV-Assisted IoT Networks: A Deep Reinforcement Learning Approach. IEEE Internet Things J. 2021, 8, 12203–12218. [Google Scholar] [CrossRef]
Wang, Y.; Farooq, J.; Ghazzai, H.; Setti, G. Joint Positioning and Computation Offloading in Multi-UAV MEC for Low Latency Applications: A Proximal Policy Optimization Approach. IEEE Trans. Mobile Comput. 2025, 1–15, early access. [Google Scholar]
He, Y.; Xiang, K.; Cao, X.; Guizani, M. Task Scheduling and Trajectory Optimization-based on Fairness and Communication Security for Multi-UAV-MEC System. IEEE Internet Things J. 2021, 11, 30510–30523. [Google Scholar] [CrossRef]
Yan, M.; Zhang, L.; Jiang, W.; Chan, C.A.; Gygax, A.F.; Nirmalathas, A. Energy Consumption Modeling and Optimization of UAV-Assisted MEC Networks Using Deep Reinforcement Learning. IEEE Sens. J. 2024, 24, 13629–13639. [Google Scholar] [CrossRef]
Liu, Y.; Lin, P.; Zhang, M.; Zhang, Z.; Yu, F.R. Mobile-Aware Service Offloading for UAV-Assisted IoV: A Multiagent Tiny Distributed Learning Approach. IEEE Internet Things J. 2024, 11, 21191–21201. [Google Scholar] [CrossRef]
Wang, Z.; Wang, H.; Liu, L.; Sun, E.; Zhang, H.; Li, Z.; Fang, C.; Li, M. Dynamic Trajectory Design for Multi-UAV-Assisted Mobile Edge Computing. IEEE Trans. Veh. Technol. 2025, 74, 4684–4697. [Google Scholar] [CrossRef]
Kang, H.; Chang, X.; Mišić, J.; Mišić, V.B.; Fan, J.; Liu, Y. Cooperative UAV resource allocation and task offloading in hierarchical aerial computing systems: A MAPPO based approach. IEEE Internet Things J. 2023, 10, 10497–10509. [Google Scholar] [CrossRef]
Cheng, M.; Zhu, C.; Lin, M.; Wang, J.B.; Zhu, W.P. An O-MAPPO scheme for joint computation offloading and resources allocation in UAV assisted MEC systems. Comp. Commun. 2023, 208, 190–199. [Google Scholar] [CrossRef]
Cheng, M.; Zhu, C.; Lin, M.; Zhu, W.-P. A MAPPO Based Scheme for Joint Resource Allocation in UAV Assisted MEC Networks. In Proceedings of the IEEE/CIC International Conference on Communications in China (ICCC), Hangzhou, China, 7–9 August 2024. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative multi-agent games. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS ’22), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Guo, D.; Tang, L.; Zhang, X.; Liang, Y.-C. Joint Optimization of Handover Control and Power Allocation Based on Multi-Agent Deep Reinforcement Learning. IEEE Trans. Veh. Tech. 2020, 69, 13124–13138. [Google Scholar] [CrossRef]
He, S.; Cheng, M.; Pan, Y.; Lin, M.; Zhu, W.-P. Distributed access and offloading scheme for multiple UAVs assisted MEC network. In Proceedings of the IEEE 98th Vehicular Technology Conference (VTC2023-Fall), Hong Kong, China, 10–13 October 2023. [Google Scholar]
Pokkunuru, A.; Zhang, Q.; Wang, P. Capacity analysis of aerial small cells. In Proceedings of the IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017. [Google Scholar]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv 2018, arXiv:1506.02438. [Google Scholar] [CrossRef]
Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R.E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the IEEE 36th Annual Foundations of Computer Science, Milwaukee, WI, USA, 23–25 October 1995; pp. 322–331. [Google Scholar]
Cheng, M.; He, S.; Lin, M.; Zhu, W.-P.; Wang, J. RL and DRL Based Distributed User Access Schemes in Multi-UAV Networks. IEEE Trans. Veh. Tech. 2025, 74, 5241–5246. [Google Scholar] [CrossRef]

Figure 1. The multi-UAV-assisted MEC system.

Figure 2. Plot of individual cost function

C_{m, t}

with given association.

Figure 2. Plot of individual cost function

C_{m, t}

with given association.

Figure 3. The slide window with length D.

Figure 4. The locations of UAVs and UDs.

Figure 5. Convergence performance in the training stage.

Figure 6. System’s cost with different task volumes.

Figure 7. System’s cost under different computing resource constraints.

Figure 8. Task completion rate of different schemes.

Table 1. Comparison of related works.

Work	Objective			Method
Work	Energy	Latency		Method
[11,12,13,14,15,16]	✓		short-term	Convex Optimization
[17,18,19,20,21,22]		✓
[23,24,25,26,27,28,29]	✓	✓
[30,31]	✓		long-term	Lyapunov
[32,33]	✓	✓
[34]	Service quantity
[35]		✓		AC	Single agent
[36]	✓	✓		DDQN
[37]	✓	✓		DDPG
[38]		✓		PPO
[39,42]	✓	✓		MADDPG	Multiple agents
[40]	✓
[41]		✓
[43]	Task amount			MAPPO
[44]	Energy efficiency
[45]		✓

Table 2. List of key notations.

Notation	Description
$α_{m, t}$	Offloading task proportion of UD m at timeslot t
$β_{L}$ , $β_{N}$	Path loss exponent for LOS and NLOS links
$γ$	Discount factor for rewards
$γ_{0}$	Parameter that adjusts the exploitation and exploration
$γ_{m, n, t}$	SNR between UD m and UAV n in timeslot t
$δ$	TD residual to calculate GAE in (24)
$ϵ$	Determines the interval of clip function
$η$	Parameter that determines how aggressively to learn and update
$θ$	Parameter of actor network
$κ_{1}$ , $κ_{2}$	Weights of energy consumption by UAVs and UDs
$κ_{θ, i}^{(m)}$	Probability ratio between the current and the updated policy
$λ$	Adjusts the weights of energy consumption and latency
$λ_{0}$	Balances the bias and variance in GAE
$μ_{L}$ , $μ_{N}$	Attenuation factor for LOS and NLOS links
$π^{(m)}$	Policy of agent m
$τ$	Latency
$ϕ$	Parameter of critic network
$a_{m, n, t}$	Association between UD m and UAV n in timeslot t
$a_{t}^{(m)}$	Action of agent m in timeslot t
$C_{t}$ , $C_{m, t}$	System’s cost and individual cost in timeslot t
$D_{m, t}$	Volume of tasks
E	Energy consumption
$f_{u}$ , $f_{m, n, t}$	Computing frequencies at UD and UAV
$o_{t}^{(m)}$	Observation of UD m at timeslot t
$r_{m, n, t}$	Achievable rate between UD m and UAV n in timeslot t
$r_{t}^{(m)}$	Reward of UD m at timeslot t
$R_{m, t}$	Individual reward function of UD m at timeslot t
$R_{M, t}$	System’s reward function at timeslot t

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, M.; He, S.; Pan, Y.; Lin, M.; Zhu, W.-P. Cooperative Schemes for Joint Latency and Energy Consumption Minimization in UAV-MEC Networks. Sensors 2025, 25, 5234. https://doi.org/10.3390/s25175234

AMA Style

Cheng M, He S, Pan Y, Lin M, Zhu W-P. Cooperative Schemes for Joint Latency and Energy Consumption Minimization in UAV-MEC Networks. Sensors. 2025; 25(17):5234. https://doi.org/10.3390/s25175234

Chicago/Turabian Style

Cheng, Ming, Saifei He, Yijin Pan, Min Lin, and Wei-Ping Zhu. 2025. "Cooperative Schemes for Joint Latency and Energy Consumption Minimization in UAV-MEC Networks" Sensors 25, no. 17: 5234. https://doi.org/10.3390/s25175234

APA Style

Cheng, M., He, S., Pan, Y., Lin, M., & Zhu, W.-P. (2025). Cooperative Schemes for Joint Latency and Energy Consumption Minimization in UAV-MEC Networks. Sensors, 25(17), 5234. https://doi.org/10.3390/s25175234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cooperative Schemes for Joint Latency and Energy Consumption Minimization in UAV-MEC Networks^†

Abstract

1. Introduction

1.1. Related Works

1.2. Motivations and Contributions

2. System Model

2.1. Association

2.2. Communication Model

2.3. Computation Model

2.3.1. Edge Computing

2.3.2. Local Computing

2.4. System Energy Consumption and Latency

2.5. Problem Formulation

3. Multi-Agent DRL-Based Association, Offloading, and Resource Allocation Schemes

3.1. Multi-Agent RL Framework

3.2. MAPPO-Based Scheme for Long-Term Consumption Minimization

3.2.1. Typical MAPPO Procedure

3.2.2. Centralized Training and Decentralized Execution

3.2.3. MAPPO-Based Algorithm

3.3. CF-MAB-Based Scheme for Long-Term Consumption Minimization

3.3.1. Closed-Form Offloading and Computing Resource Allocation for a Given Association

3.3.2. MAB-Based Long-Term Association

3.3.3. CF-MAB-Based Algorithm

4. Performance Evaluation

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Cooperative Schemes for Joint Latency and Energy Consumption Minimization in UAV-MEC Networks †

Abstract

1. Introduction

1.1. Related Works

1.2. Motivations and Contributions

2. System Model

2.1. Association

2.2. Communication Model

2.3. Computation Model

2.3.1. Edge Computing

2.3.2. Local Computing

2.4. System Energy Consumption and Latency

2.5. Problem Formulation

3. Multi-Agent DRL-Based Association, Offloading, and Resource Allocation Schemes

3.1. Multi-Agent RL Framework

3.2. MAPPO-Based Scheme for Long-Term Consumption Minimization

3.2.1. Typical MAPPO Procedure

3.2.2. Centralized Training and Decentralized Execution

3.2.3. MAPPO-Based Algorithm

3.3. CF-MAB-Based Scheme for Long-Term Consumption Minimization

3.3.1. Closed-Form Offloading and Computing Resource Allocation for a Given Association

3.3.2. MAB-Based Long-Term Association

3.3.3. CF-MAB-Based Algorithm

4. Performance Evaluation

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Cooperative Schemes for Joint Latency and Energy Consumption Minimization in UAV-MEC Networks^†