Secure Communication and Resource Allocation in Double-RIS Cooperative-Aided UAV-MEC Networks

Hu, Xi; Zhao, Hongchao; He, Dongyang; Zhang, Wujie

doi:10.3390/drones9080587

Open AccessArticle

Secure Communication and Resource Allocation in Double-RIS Cooperative-Aided UAV-MEC Networks

NEUQ WiNet Lab, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(8), 587; https://doi.org/10.3390/drones9080587

Submission received: 15 July 2025 / Revised: 10 August 2025 / Accepted: 18 August 2025 / Published: 19 August 2025

(This article belongs to the Section Drone Communications)

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

Proposes double-RIS cooperative UAV-MEC system: multi-dimensional paths via collaboration enhance legitimate links, suppress eavesdropping.
Integrates Lyapunov optimization and PPO, outperforming benchmarks in secure energy efficiency and queue stability.

What is the implication of the main finding?

Double-RIS collaboration resolves urban UAV-MEC link blockage and security issues, boosting robustness.
Lyapunov–DRL integration enables real-time resource allocation in high-dimensional stochastic systems, aiding UAV-MEC deployment.

Abstract

In complex urban wireless environments, unmanned aerial vehicle–mobile edge computing (UAV-MEC) systems face challenges like link blockage and single-antenna eavesdropping threats. The traditional single reconfigurable intelligent surface (RIS), limited in collaboration, struggles to address these issues. This paper proposes a double-RIS cooperative UAV-MEC optimization scheme, leveraging their joint reflection to build multi-dimensional signal paths, boosting legitimate link gains while suppressing eavesdropping channels. It considers double-RIS phase shifts, ground user (GU) transmission power, UAV trajectories, resource allocation, and receiving beamforming, aiming to maximize secure energy efficiency (EE) while ensuring long-term stability of GU and UAV task queues. Given random task arrivals and high-dimensional variable coupling, a dynamic model integrating queue stability and secure transmission constraints is built using Lyapunov optimization, transforming long-term stochastic optimization into slot-by-slot deterministic decisions via the drift-plus-penalty method. To handle high-dimensional continuous spaces, an end-to-end proximal policy optimization (PPO) framework is designed for online learning of multi-dimensional resource allocation and direct acquisition of joint optimization strategies. Simulation results show that compared with benchmark schemes (e.g., single RIS, non-cooperative double RIS) and reinforcement learning algorithms (e.g., advantage actor–critic (A2C), deep deterministic policy gradient (DDPG), deep Q-network (DQN)), the proposed scheme achieves significant improvements in secure EE and queue stability, with faster convergence and better optimization effects, fully verifying its superiority and robustness in complex scenarios.

Keywords:

unmanned aerial vehicle–mobile edge computing (UAV-MEC); double reconfigurable intelligent surface (double-RIS); cooperative reflection; secure energy efficiency; queue stability; Lyapunov optimization; proximal policy optimization (PPO)

1. Introduction

In the transition toward sixth-generation (6G) wireless networks, the rapid expansion of smart devices and large-scale Internet of Things (IoT) applications (e.g., smart cities, industrial monitoring, and real-time video analytics) has led to a massive surge in data traffic, along with growing demands for low-latency computation and reliable communication in complex environments [1,2]. To address these challenges, unmanned aerial vehicle–mobile edge computing (UAV-MEC) systems have emerged as a promising technology, drawing significant attention for their ability to leverage UAVs’ mobility and flexible deployment to extend edge computing capabilities into scenarios where ground infrastructure is sparse or overloaded [3,4]. As aerial edge nodes, UAVs can dynamically establish communication links to handle the computation offloading needs of ground users (GUs), providing a viable solution to reduce delays caused by long-distance data transmission to central clouds [5]. However, in dense urban environments, obstacles such as buildings and signal attenuation often destabilize direct communication links between UAVs and GUs—increasing transmission energy consumption—while the broadcast nature of wireless signals makes task data vulnerable to interception by eavesdroppers, posing threats to physical layer security [6,7]. These combined challenges underscore the critical need for innovative technological solutions to enhance the communication reliability and security of UAV-MEC systems.

Against this backdrop, reconfigurable intelligent surfaces (RISs), as an emerging technology capable of dynamically reconstructing wireless propagation environments, offer a promising solution to enhance the secure transmission performance of MEC systems [8,9]. Composed of numerous low-cost passive reflective elements, RISs can flexibly shape signal propagation paths by dynamically adjusting in real time the phase shifts and amplitude attenuations of each element, thereby improving the quality of legitimate links while suppressing eavesdropping channels. This renders it a key enabling technology for physical layer security and energy-efficiency (EE) optimization [10]. Current research on secure transmission in RIS-aided MEC systems has achieved significant progress. For instance, [11,12] investigate resource allocation strategies for RIS-aided secure MEC networks, enhancing security through joint resource allocation and phase shift regulation. Ref. [13] proposes a physical layer security scheme for hybrid RIS-aided multiple-input multiple-output (MIMO) communications, maximizing secrecy capacity by optimizing reflection coefficients and verifying the inhibitory effect of RISs on multi-antenna eavesdropping scenarios. Ref. [14] optimizes secure computation efficiency for space–air–ground eavesdroppers in UAV-RIS-aided MEC networks, introducing multi-RIS deployment but without exploring their collaboration mechanisms. Ref. [15] designs a UAV-RIS-aided multi-layer MEC security scheme for full-duplex eavesdroppers, suppressing eavesdropping signals through RIS beamforming. While these studies have validated the effectiveness of RISs in enhancing secure transmission in MEC systems, they predominantly focus on single RISs or multiple distributed RISs, where each RIS operates independently without inter-RIS collaboration. Such a model exhibits notable limitations: constrained by practical conditions, a single RIS can only provide limited passive beamforming gain for GUs; a single reflection path struggles to bypass obstacles in complex environments; furthermore, the power scaling order of a single RIS or non-collaborative distributed RISs is merely proportional to the square of the number of reflective elements, failing to meet high-performance requirements in complex scenarios. These limitations have thus spurred the academic community to focus on research into multi-RIS collaboration mechanisms.

Multi-RIS collaboration offers significant advantages over single RISs, providing a new pathway to overcome the performance bottlenecks of traditional single RISs. Ref. [16] revealed that in double passive RIS collaborative communication, the beamforming gain scales with the fourth power of the total number of reflective elements—markedly outperforming the quadratic growth characteristic of single RISs. This property establishes a theoretical foundation for constructing high-gain secure transmission links. Further research [17] extended double-RIS collaboration to multi-user scenarios, demonstrating that cooperative beamforming can effectively mitigate the Doppler effect in high-speed railway communications, thus validating the robustness of collaborative RISs in dynamic environments. Another study [18] generalized the double-RIS model to more universal multi-RIS collaborative systems, proving that distributed collaborative RISs can achieve higher link gain and spectral efficiency than single RISs with the same total number of elements through joint phase shift optimization. However, critical gaps remain in translating these double-RIS collaboration frameworks to practical UAV-MEC scenarios. Specifically, existing research fails to account for the dynamic channel variations induced by UAV mobility—a key feature of UAV-MEC systems—which directly impacts the phase synchronization between two RISs and degrades collaborative beamforming performance. Additionally, these studies overlook security vulnerabilities in complex communication environments, as they lack coordinated mechanisms to actively suppress eavesdroppers, such as by jointly forming signal nulls toward eavesdropper directions. Moreover, they often treat UAV trajectory planning and RIS reflection design as separate processes, ignoring the intrinsic relationship between UAV positions and the quality of RIS-GU-UAV propagation paths—an oversight that is particularly detrimental in obstacle-rich urban environments. In summary, while prior work confirms the theoretical merits of multi-RIS collaboration, their simplified treatment of UAV-induced channel dynamics and security threats hinders practical implementation in complex UAV-MEC systems.

In the research on multi-RIS-aided UAV-MEC systems, addressing dynamic environments emerges as a critical challenge. System dynamism, primarily induced by stochastic task arrival rates and time-varying wireless channel conditions, continuously disrupts system stability. Traditional studies—including those on double-RIS collaboration—often overlook this issue by assuming static channels and fixed task volumes, which fails to account for potential queue backlogs that may compromise long-term system performance. To tackle this challenge, Lyapunov optimization theory offers a robust solution. By formulating a drift-plus-penalty model, this approach transforms long-term stochastic optimization problems into deterministic online optimization tasks, effectively balancing immediate performance and queue length control [19,20]. Simultaneously, the high-dimensional nature of optimization problems presents another significant hurdle, especially for double-RIS systems: the intertwined optimization of phase shift synchronization between two RISs, UAV trajectory adjustments, and resource allocation across computation and communication domains creates complex variable correlations that render traditional optimization methods ineffective. Deep reinforcement learning (DRL) has shown promise in addressing this challenge. As an advanced machine learning paradigm, DRL excels in navigating high-dimensional action spaces and adapting to dynamic changes through sequential decision-making, approximating optimal policies via neural networks to generate real-time multi-dimensional resource allocation strategies [21,22]. The integration of Lyapunov optimization and DRL thus provides a targeted solution: Lyapunov optimization ensures long-term stability amid dynamic tasks and channels (a gap in existing double-RIS models), while DRL handles the high-dimensional complexity of synchronized phase shift and trajectory optimization (overcoming oversimplified decoupling in prior work). Together, they address key limitations in practical double-RIS-aided UAV-MEC systems, advancing secure and efficient task offloading.

To address the aforementioned challenges, this paper constructs a double-RIS-aided UAV-MEC system model that accounts for the impact of eavesdroppers on secure transmission. With dual optimization objectives of queue stability and secure EE, this paper proposes an efficient resource allocation scheme based on Lyapunov theory and the proximal policy optimization (PPO) algorithm. The primary contributions of this work are as follows:

A double-RIS cooperative UAV-MEC network architecture is proposed, leveraging spatial diversity and signal coupling effects of two RISs to construct phase-aligned superimposed beams at legitimate users for maximum signal enhancement while creating phase cancellation patterns towards potential eavesdroppers to suppress information leakage. This framework explicitly addresses phase synchronization and hardware imperfections overlooked in prior studies, jointly optimizing RIS phase-shift matrices, user transmit powers, UAV trajectories, computational resource allocation, UAV receive beamforming vectors under stochastic task arrivals, time-varying channel dynamics, and long-term queue stability constraints—bridging the gap between theoretical gain and practical deployment.
Using Lyapunov optimization theory, the original problem is reformulated into a deterministic online optimization problem with queue stability guarantees by constructing a drift-plus-penalty function. This addresses the critical oversight of dynamic task/channel handling in existing double-RIS research, decomposing per-slot subproblems to balance immediate secure EE and queue backlog penalties and effectively decoupling interdependent variables (e.g., RIS phase shifts and UAV trajectories) to reduce computational complexity.
A deep reinforcement learning-based dynamic resource allocation strategy is developed, modeling the system as a Markov decision process and employing the PPO algorithm to approximate optimal policies. This approach overcomes the high-dimensional optimization barrier in synchronized double-RIS and UAV trajectory design—unresolved in traditional methods—enabling end-to-end adaptive resource scheduling by mapping high-dimensional system states to continuous control actions, outperforming baseline schemes such as single-RIS and non-cooperative double-RIS configurations.

The rest of this paper is organized as follows. Section 2 presents our double-RIS cooperative-aided UAV-MEC network model and the associated optimization problems. In Section 3, we introduce the process of transforming the problem using Lyapunov optimization theory and detail the proposed algorithm framework based on the PPO algorithm. Section 4 analyzes the effectiveness of the algorithm through extensive simulation experiments, presenting and discussing the key results. Finally, Section 5 provides a summary of this paper and outlines future research directions.

2. System Model and Problem Formulation

In this section, we construct a double-RIS cooperative-aided UAV-MEC network model and formulate the communication, task queue, and energy consumption models, as well as the core optimization problem for secure energy efficiency maximization.

2.1. Overview

As depicted in Figure 1, we consider a double-RIS cooperative-aided UAV-MEC network architecture comprising a multi-antenna UAV equipped with an MEC server, two geographically distributed RISs, and K single-antenna GUs. A single-antenna eavesdropper is present in the network, attempting to intercept the transmission channels. We assume RIS1 and RIS2 consist of

M_{1} = M_{1 x} \times M_{1 y}

and

M_{2} = M_{2 x} \times M_{2 y}

passive reflecting elements, respectively, while the UAV is equipped with a uniform rectangular array antenna featuring

M_{T} = M_{T x} \times M_{T y}

elements, where

M_{1 x}

,

M_{1 y}

and

M_{2 x}

,

M_{2 y}

denote the horizontal and vertical element counts of RIS1 and RIS2, respectively, and

M_{T x}

,

M_{T y}

represent the number of horizontal and vertical antennas in the UAV’s array. The set of all GUs is denoted as

K = {1, 2, \dots, K}

. The entire mission duration T is discretized into N equal time slots, denoted by

N = {1, 2, \dots, N}

, each with a duration

τ = T / N

. Within each time slot, the UAV’s position

S_{u} (n) = [x_{u} (n), y_{u} (n), H_{u}]

remains fixed at a height

H_{u}

to minimize energy consumption, while the positions of GUs and RISs are static. Specifically, GU k is located at

S_{k} = [x_{k}, y_{k}, 0]

, RIS1 at

S_{R 1} = [x_{R 1}, y_{R 1}, H_{R 1}]

, RIS2 at

S_{R 2} = [x_{R 2}, y_{R 2}, H_{R 2}]

, and the eavesdropper at

S_{e} = [x_{e}, y_{e}, 0]

. GUs offload tasks to the UAV via composite links: direct path, RIS1-reflected path, RIS2-reflected path, and RIS1-RIS2 cascaded reflected path.

2.2. Communication Model

In realistic urban scenarios, communication links can be categorized into two types based on their propagation characteristics: ground links (between ground nodes such as GUs, eavesdroppers, and RISs) and aerial links (between aerial nodes such as the UAV and RISs). Ground links are inherently subject to significant obstructions from buildings, foliage, and other urban structures, thus being modeled using the Rician fading model to account for both line-of-sight (LoS) and non-line-of-sight (NLoS) components. In contrast, aerial links (between the UAV and RISs) are LoS-dominated due to their elevated deployment heights, with negligible multipath effects, thus being modeled as pure LoS links.

We first define the channel gain matrices involved in the system:

h_{k, R 1} (n) \in C^{M_{1} \times 1}

,

h_{k, R 2} (n) \in C^{M_{2} \times 1}

,

h_{k, E} (n) \in C^{1 \times 1}

,

h_{k, U} (n) \in C^{M_{T} \times 1}

,

h_{R 1, E} (n) \in C^{1 \times 1}

, and

h_{R 2, E} (n) \in C^{1 \times 1}

explicitly denote the complex-valued channel gains from GU k to RIS1, from GU k to RIS2, from GU k to the eavesdropper, from GU k to the UAV (direct link), from RIS1 to the eavesdropper, and from RIS2 to the eavesdropper, respectively. Specifically,

h_{k, R 1} (n)

and

h_{k, R 2} (n)

comprehensively characterize the cascaded channels incorporating the phase shifts of each passive reflecting element in RIS1 and RIS2, whereas

h_{k, U} (n)

precisely models the direct propagation link between the GU and the UAV’s antenna array. Ground links primarily include links between GUs and RISs, GUs and eavesdroppers, and RISs and eavesdroppers. Taking the link between GU k and RIS1 as an illustrative example, its channel is modeled using the Rician fading model as [23]

h_{k}^{R 1 - G} (n) = \sqrt{\frac{β_{0}}{{(d_{k}^{R 1 - G})}^{a^{R 1 - G}}}} [\sqrt{\frac{k^{R 1 - G}}{1 + k^{R 1 - G}}} h_{k}^{R 1 - G, LoS} + \sqrt{\frac{1}{1 + k^{R 1 - G}}} {\tilde{h}}_{k}^{R 1 - G} (n)],

(1)

where

\begin{matrix} h_{k}^{R 1 - G, LoS} & = e^{- j \frac{2 π}{λ_{c}} d_{k}^{R 1 - G} (n)} \\ \times {[e^{- j \frac{2 π}{λ_{c}} Δ R_{x} sin θ_{k}^{R 1 - G} cos δ_{k}^{R 1 - G}}, \dots, e^{- j \frac{2 π}{λ_{c}} Δ R_{x} (M_{R x} - 1) sin θ_{k}^{R 1 - G} cos δ_{k}^{R 1 - G}}]}^{H} \\ \otimes {[1, e^{- j \frac{2 π}{λ_{c}} Δ R_{y} sin θ_{k}^{R 1 - G} sin δ_{k}^{R 1 - G}}, \dots, e^{- j \frac{2 π}{λ_{c}} Δ R_{y} (M_{R y} - 1) sin θ_{k}^{R 1 - G} sin δ_{k}^{R 1 - G}}]}^{H} \end{matrix}

(2)

is the LoS component.

β_{0}

is the unit path gain,

λ_{c} = c / f_{r}

represents the carrier wavelength.

{\tilde{h}}_{k}^{R 1 - G} (n) \sim CN (0, 1)

(circularly symmetric complex Gaussian) denotes the NLoS random component,

a^{R 1 - G} \geq 0

is the path loss exponent, and

k^{R 1 - G} \geq 0

is the Rician factor (characterizing the ratio of LoS to NLoS power).

sin θ_{k}^{R 1 - G} (n) = \frac{H_{u} (n)}{d_{k}^{R 1 - G} (n)}

,

cos δ_{k}^{R 1 - G} (n) = \frac{| y_{R 1} - y_{k} (n) |}{\sqrt{{(x_{R 1} - x_{k} (n))}^{2} + {(y_{R 1} - y_{k} (n))}^{2}}}

and

sin δ_{k}^{R 1 - G} (n) = \frac{| x_{R 1} - x_{k} (n) |}{\sqrt{{(x_{R 1} - x_{k} (n))}^{2} + {(y_{R 1} - y_{k} (n))}^{2}}}

describe the angle of arrival;

d_{k}^{R 1 - G}

is the three-dimensional distance from GU k to RIS1;

Δ R_{x}

and

Δ R_{y}

are the vertical and horizontal distances between RIS elements, respectively.

Aerial links primarily include links between RISs and the UAV, and between the RISs themselves. Given their elevated deployment, these links are dominated by LoS propagation with negligible multipath effects. Taking the link between RIS1 and the UAV as an example, its channel gain is modeled as [23]

h_{R 1, U} (n) = \sqrt{\frac{β_{0}}{{(d^{U - R 1} (n))}^{a^{U - R 1}}}} \cdot e^{- j \frac{2 π}{λ_{c}} d^{U - R 1} (n)} a_{U - R 1} a_{R 1 - U},

(3)

where

\begin{matrix} a_{U - R 1} \\ = {[1, e^{- j \frac{2 π}{λ_{c}} Δ R_{T x} sin θ^{U - R 1} (n) \cdot cos δ^{U - R 1} (n)}, \dots, e^{- j \frac{2 π}{λ_{c}} Δ R_{T x} (M_{T x} - 1) \cdot sin θ^{U - R 1} (n) \cdot cos δ^{U - R 1} (n)}]}^{H} \\ \otimes {[1, e^{- j \frac{2 π}{λ_{c}} Δ R_{T y} sin θ^{U - R 1} (n) sin δ^{U - R 1} (n)}, \dots, e^{- j \frac{2 π}{λ_{c}} Δ R_{T y} (M_{T y} - 1) sin θ^{U - R 1} (n) sin δ^{U - R 1} (n)}]}^{H}, \end{matrix}

(4)

\begin{matrix} a_{R 1 - U} \\ = {[1, e^{- j \frac{2 π}{λ_{c}} Δ R_{x} sin θ^{U - R 1} (n) \cdot cos δ^{U - R 1} (n)}, \dots, e^{- j \frac{2 π}{λ_{c}} Δ R_{x} (M_{R x} - 1) \cdot sin θ^{U - R 1} (n) \cdot cos δ^{U - R 1} (n)}]}^{H} \\ \otimes {[1, e^{- j \frac{2 π}{λ_{c}} Δ R_{y} sin θ^{U - R 1} (n) sin δ^{U - R 1} (n)}, \dots, e^{- j \frac{2 π}{λ_{c}} Δ R_{y} (M_{R y} - 1) sin θ^{U - R 1} (n) sin δ^{U - R 1} (n)}]}^{H}, \end{matrix}

(5)

where

a_{U - R 1}

and

a_{R 1 - U}

are the array response vectors of the UAV and RIS1, respectively.

a^{U - R 1}

is the path loss exponent of the aerial link;

d^{U - R 1} (n)

is the three-dimensional distance between the UAV and RIS1;

sin θ^{U - R 1} (n) = \frac{| H_{u} (n) - H_{R} |}{d^{U - R 1} (n)}

,

cos δ^{U - R 1} (n) = \frac{| y_{R} - y_{u} (n) |}{\sqrt{{(x_{R} - x_{u} (n))}^{2} + {(y_{R} - y_{u} (n))}^{2}}}

and

sin δ^{U - R 1} (n) = \frac{| x_{R} - x_{u} (n) |}{\sqrt{{(x_{R} - x_{u} (n))}^{2} + {(y_{R} - y_{u} (n))}^{2}}}

describe the angle of arrival;

R_{T x}

and

R_{T y}

are the vertical and horizontal distances between adjacent elements of the UAV’s antenna array, respectively. Following a similar derivation, the channel gains of other aerial links (e.g., RIS2 to UAV, RIS1 to RIS2) can be obtained.

In practical dynamic urban environments, acquiring perfect channel state information (CSI) is challenging due to the passive nature of RIS elements, time-varying propagation channels, and the covertness of eavesdroppers [24]. To balance realism and tractability, we adopt a differentiated CSI acquisition strategy based on the above link models: for legitimate links (including both ground and aerial legitimate links, e.g., GU to RIS1/RIS2, RIS1/RIS2 to UAV, direct GU to UAV), the UAV obtains imperfect CSI with bounded estimation errors via periodic pilot feedback and Bayesian estimation algorithms. Specifically, the estimated CSI

\hat{h}

is modeled as

\hat{h} = h + Δ h

, where

h

is the true channel gain, and

Δ h

is the estimation error with

∥ Δ h ∥ \leq ϵ

(

ϵ

is a small positive constant denoting the error bound); for eavesdropping links (e.g., GU to eavesdropper, RIS1/RIS2 to eavesdropper), due to the unpredictable mobility and silent behavior of the eavesdropper, only statistical information (e.g., average path loss and Rician fading distribution) is available instead of instantaneous CSI. Based on the above channel models and CSI acquisition strategies, the estimated total channel gain between user k and the UAV (integrating all legitimate paths) in the n-th time slot is

\begin{matrix} {\hat{g}}_{k}^{UAV} (n) & = |{\hat{h}}_{k, U} (n) + {({\hat{h}}_{k, R 1} (n))}^{H} Ψ^{1} (n) {\hat{h}}_{R 1, R 2} (n) Ψ^{2} (n) {\hat{h}}_{R 1, U} (n) \\ + {({\hat{h}}_{k, R 2} (n))}^{H} Ψ^{2} (n) {\hat{h}}_{R 2, R 1} (n) Ψ^{1} (n) {\hat{h}}_{R 2, U} (n) + {({\hat{h}}_{k, R 1} (n))}^{H} Ψ^{1} (n) {\hat{h}}_{R 1, U} (n) \\ + {({\hat{h}}_{k, R 2} (n))}^{H} Ψ^{2} (n) {\hat{h}}_{R 2, U} (n)|, \end{matrix}

(6)

where

{\hat{h}}_{\cdot}

denotes estimated channel gains with bounded errors.

For the eavesdropping link, the statistical average total channel gain (integrating all eavesdropping paths) is

\begin{matrix} {\bar{g}}_{k}^{E} (n) & = E \{|h_{k, E} (n) + {(h_{k, R 1} (n))}^{H} Ψ^{1} (n) h_{R 1, R 2} (n) Ψ^{2} (n) h_{R 1, E} (n) \\ + {(h_{k, R 2} (n))}^{H} Ψ^{2} (n) h_{R 2, R 1} (n) Ψ^{1} (n) h_{R 2, E} (n) + {(h_{k, R 1} (n))}^{H} Ψ^{1} (n) h_{R 1, E} (n) \\ + {(h_{k, R 2} (n))}^{H} Ψ^{2} (n) h_{R 2, E} (n)|\}, \end{matrix}

(7)

where

E {\cdot}

denotes the expectation over the statistical distribution of eavesdropping channels.

Here,

{(•)}^{H}

denotes the Hermitian transpose, and

Ψ^{1} (n)

and

Ψ^{2} (n)

are the reflection coefficient matrices of RIS1 and RIS2, respectively, given by

Ψ^{1} (n) = diag (e^{j ψ_{1, 1}^{1} (n)}, \dots, e^{j ψ_{m_{Rx}, m_{Ry}}^{1} (n)}, \dots, e^{j ψ_{M_{Rx}, M_{Ry}}^{1} (n)}),

(8)

Ψ^{2} (n) = diag (e^{j ψ_{1, 1}^{2} (n)}, \dots, e^{j ψ_{m_{Rx}, m_{Ry}}^{2} (n)}, \dots, e^{j ψ_{M_{Rx}, M_{Ry}}^{2} (n)}) .

(9)

Each GU transmits data to the UAV over its own channel. According to Shannon’s formula, the transmission rate from GU k to the UAV (using estimated CSI) in the n-th time slot is

R_{k}^{UAV} (n) = W {log}_{2} (1 + \frac{P_{k} (n) {|w_{k}^{H} {\hat{g}}_{k}^{UAV} (n)|}^{2}}{N_{0} \cdot w_{k}^{H} w_{k}}),

(10)

where W is the channel bandwidth,

w_{k}

is the UAV’s receive beamforming vector,

P_{k} (n)

is GU k’s transmit power, and

N_{0}

is the noise power spectral density.

The average transmission rate from GU k to the eavesdropper in the n-th time slot is

R_{k}^{E} (n) = W {log}_{2} (1 + \frac{P_{k} (n) \cdot {\bar{g}}_{k}^{E} (n)}{N_{0}}),

(11)

Due to the presence of the eavesdropper, the secure offloading rate from GU k to the UAV is defined as

R_{k}^{\sec} (n) = max \{R_{k}^{UAV} (n) - R_{k}^{E} (n), 0\},

(12)

and the secure transmission task amount of GU k is

D_{k}^{\sec} (n) = R_{k}^{\sec} (n) τ .

(13)

2.3. Task Queue Model

In the proposed system architecture, ground users generate massive volumes of task data that exceed the real-time processing capabilities of individual computing units. To address this challenge, we implement a distributed buffering mechanism where each user device and the UAV-mounted edge server maintain dedicated task queues. During each time slot, newly generated tasks are first stored in local queues, after which a dynamic task partitioning strategy is applied: a portion of tasks is processed locally on the user device, while the remaining portion is offloaded to the UAV’s queue for remote processing. We assume that user k generates a task in each time slot with data size

A_{k} (n)

, following a stochastic process with mean arrival rate

E \{A_{k} (n)\} = λ_{k}

, where

λ_{k}

represents the average number of tasks generated per slot. The computational complexity of processing 1-bit data is characterized by

φ

, defined as the number of CPU cycles required. Let

f_{k}^{l} (n)

denote the CPU frequency allocated to local processing on user k’s device, and

f_{k}^{u} (n)

represent the CPU frequency assigned by the UAV to process tasks offloaded from user k. These frequencies are subject to the following operational constraints:

C 1 : f_{k}^{l} (n) \leq F^{l}, k \in K, n \in N,

(14)

C 2 : \sum_{k \in K} f_{k}^{u} (n) \leq F^{u}, k \in K, n \in N,

(15)

where

F^{l}

is the maximum CPU frequency of user devices, and

F^{u}

denotes the UAV’s maximum processing frequency. Based on these frequencies, the amount of data processed locally by user k and remotely by the UAV for user k within a time slot

τ

are, respectively,

D_{k}^{l} (n) = \frac{f_{k}^{l} (n) \cdot τ}{φ},

(16)

D_{k}^{u} (n) = \frac{f_{k}^{u} (n) \cdot τ}{φ} .

(17)

Let

Q_{k} (n)

represent the task backlog in user k’s local queue at time slot n and

L (n)

denote the total task backlog in the UAV’s queue. Both queues operate on a first-come, first-served (FCFS) basis, with tasks arriving in slot n scheduled for processing starting from slot

n + 1

. The evolution of these queues can be described by

Q_{k} (n + 1) = {[Q_{k} (n) - D_{k}^{l} (n) - D_{k}^{s e c} (n)]}^{+} + A_{k} (n),

(18)

L (n + 1) = {[L (n) - \sum_{k \in K} D_{k}^{u} (n)]}^{+} + \sum_{k \in K} D_{k} (n),

(19)

where

Ω_{k} (n)

is the amount of data offloaded from user k to the UAV in slot n, and the max function ensures non-negative queue lengths. For the system to operate stably, the time-averaged expected queue lengths must satisfy

C 3 : lim_{N \to \infty} \frac{1}{N} \sum_{n = 0}^{N - 1} E \{Q_{k} (n)\} < \infty,

(20)

C 4 : lim_{N \to \infty} \frac{1}{N} \sum_{n = 0}^{N - 1} E \{L (n)\} < \infty .

(21)

These stability constraints are crucial for maintaining system performance under stochastic conditions, including variable task arrivals, channel fluctuations, and dynamic resource allocation. By ensuring bounded time-averaged queue lengths, we prevent system congestion caused by infinite backlogs, guarantee acceptable average processing delays, and ultimately maintain reliable service quality for all users. This stability forms the foundation for efficient task execution in the edge computing framework, enabling timely processing of user tasks and enhancing overall system usability in practical deployments.

2.4. Energy Consumption Model

The total energy consumed in the designed system is composed of three main parts: computing energy, offloading energy, and UAV flight energy. As passive RISs are used here, their energy consumption is so minimal that it can be ignored in the analysis.

2.4.1. Computing Energy Consumption

Computing energy consumption includes two aspects: the energy consumed by GUs for local task processing and the energy consumed by the UAV for processing offloaded tasks. They are, respectively, expressed as

e_{k}^{l} (n) = κ \cdot {[f_{k}^{l} (n)]}^{3} \cdot τ,

(22)

e_{k}^{u} (n) = κ \cdot {[f_{k}^{u} (n)]}^{3} \cdot τ,

(23)

where

κ

is the effective capacitance coefficient of the CPU. The cubic relationship between energy consumption and frequency is consistent with the classic energy consumption characteristics of complementary metal–oxide–semiconductor (CMOS) circuits, where power consumption increases nonlinearly with the operating frequency of the processor.

2.4.2. Offloading Energy Consumption

In the n-th time slot, GU k will offload part of its tasks. Each GU is equipped with a communication module that can adjust the transmission power, and the energy consumed for task offloading transmission is

e_{k}^{off} (n) = P_{k} (n) \cdot τ,

(24)

where

P_{k} (n)

is the transmission power used by GU k when offloading tasks in the n-th time slot.

2.4.3. Flight Energy Consumption

The flight energy consumption of the UAV is generated when it moves from the current position to the next position, including two parts: the energy for maintaining altitude and the propulsion energy. Considering that the duration

τ

of each time slot is short enough, for the convenience of calculation, it is assumed that the UAV flies at a constant speed in a straight line between two consecutive positions. In addition, the UAV maintains a fixed flight altitude, so the energy consumption for maintaining altitude is a fixed value, and only the propulsion energy consumption needs to be optimized. Based on the analytical model of rotor UAVs, the propulsion energy consumption in the n-th time slot is [24]

e_{f} (n) = \frac{M_{g} \cdot τ \cdot {| v (n) |}_{2}^{2}}{2},

(25)

where

M_{g}

is the mass of the UAV, and

{∥ v (n) ∥}_{2}

is the flight speed of the UAV in the n-th time slot. The total energy consumption of the system in the n-th time slot is the sum of all the above energy consumption components:

e_{tot} (n) = \sum_{k \in K} \{e_{k}^{l} (n) + e_{k}^{u} (n) + e_{k}^{off} (n)\} + ε \cdot e_{f} (n),

(26)

where

ε

is the energy conversion coefficient of the propulsion system.

2.5. Problem Formulation

For the system to sustain long-term operation and deliver high-quality services to users, a comprehensive optimization of the system’s secure transmission rate and energy consumption is essential. However, an exclusive focus on secure transmission rate and system energy consumption might result in excessive queue backlogs, preventing timely processing of user tasks and thus failing to meet user requirements. Therefore, taking into account the stability of task queues, we perform a joint optimization of resource allocation, UAV trajectory, and RIS phase shifts. The objective is to maximize the network’s security EE, defined as the ratio of the long-term total secure transmission rate achieved by all GUs to the total energy consumption:

η (n) = \frac{{lim}_{N \to \infty} \frac{1}{N} E [\sum_{n \in N} \{R_{tot} (n) τ\}]}{{lim}_{N \to \infty} \frac{1}{N} E [\sum_{n \in N} \{e_{tot} (n)\}]} = \frac{{\bar{R}}_{tot} (n) τ}{{\bar{e}}_{tot} (n)},

(27)

where

R_{tot} (n) = \sum_{k \in K} R_{k} (n)

. The aim of this research is to maximize the long-term average EE of all GUs under resource constraints while ensuring the stability of average queue lengths. Thus, the problem can be formulated as

\begin{matrix} P 1 : & max_{Ξ (n)} η (n) \\ s . t . & C 1, C 2, C 3, C 4, \end{matrix}

(28a)

\begin{matrix} C 5 : & 0 \leq P_{k} (n) \leq P_{max}, \end{matrix}

(28b)

\begin{matrix} C 6 : & 0 \leq ψ_{m_{R x}, m_{R y}}^{1} (n) \leq 2 π, \end{matrix}

(28c)

\begin{matrix} C 7 : & 0 \leq ψ_{m_{R x}, m_{R y}}^{2} (n) \leq 2 π, \end{matrix}

(28d)

\begin{matrix} C 8 : & 0 \leq ∥v (n)∥ \leq v_{max}, \end{matrix}

(28e)

\begin{matrix} C 9 : & w_{k}^{H} (n) w_{k} (n) = 1 . \end{matrix}

(28f)

The optimization problem P1 is designed to maximize the security EE

η (n)

. The set of optimization variables

Ξ (n)

includes the phase shift matrices of the two RISs, the beamforming vector of user k at time slot n, the UAV’s flight speed vector, the transmit power of user k, and the local computing frequency as well as the offloading computing frequency for task k. This problem is subject to multiple constraints:

C 1

and

C 2

pertain to computing resource allocation;

C 3

and

C 4

ensure the stability of user and UAV task queues;

C 5

restricts the transmit power of user k to the range from 0 to the maximum transmit power

P_{\max}

;

C 6

and

C 7

limit the phase angles of the two RIS units to the range of 0 to

2 π

;

C 8

constrains the norm of the UAV’s speed vector to be between 0 and the maximum speed

v_{\max}

; and

C 9

enforces the normalization of the beamforming vector of user k.

3. Problem Solution

The previous section formulated the secure EE maximization problem P1, which involves multi-dimensional variable coupling and stochastic dynamics. This section presents a solution framework: first, using Lyapunov optimization to transform it into per-slot subproblems, then modeling it as a model-free Markov decision process (MDP) solved by PPO, and finally analyzing algorithm complexity.

3.1. Lyapunov Optimization

To address the complexity of problem P1, which involves coupled variables and stochastic dynamics leading to NP-hard complexity, we employ Lyapunov optimization techniques. This approach excels at transforming long-term stochastic optimization problems into tractable per-slot decisions while maintaining queue stability. Building upon foundational methodologies in [25,26], we develop an online optimization framework that integrates queue state information with system performance objectives. We begin by defining a quadratic Lyapunov function to characterize the system’s queue backlog state:

U (n) = \sum_{k \in K} \frac{Q_{k} {(n)}^{2}}{2} + \frac{L {(n)}^{2}}{2},

(29)

where

Q_{k} (n)

represents user k’s local task queue and

L (n)

denotes the UAV’s task queue at timeslot n. This function quantifies the “energy” of the queue system, with larger values indicating greater backlog congestion. The conditional Lyapunov drift, measuring the expected change in queue energy between consecutive timeslots, is defined as

Δ U (n) = E \{U (n + 1) - U (n) ∣ Θ (n)\},

(30)

where the state vector

Θ (n)

captures all current queue backlogs:

Θ (n) ≜ \{Q_{1} (n), Q_{2} (n), \dots, Q_{K} (n), L (n)\} .

(31)

To balance queue stability with the primary objective of maximizing security EE, we introduce a drift-plus-penalty function.

Δ_{V} U (n) = \{Δ U (n) - V \cdot E \{η (n) ∣ Θ (n)\}\},

(32)

where

V \geq 0

is a control parameter regulating the trade-off between system performance (

η (n)

) and queue stability. Minimizing this function ensures both bounded queues and near-optimal efficiency. To derive an upper bound for

Δ_{V} U (n)

, we use queue evolution properties and apply the inequality

{{[a - b]}^{+} + c}^{2} \leq a^{2} + b^{2} + c^{2} + 2 a (c - b)

to the user queue update equation, yielding

\begin{matrix} Q_{k} {(n + 1)}^{2} - Q_{k} {(n)}^{2} & \leq {\{{[Q_{k} (n) - D_{k}^{l} (n) - D_{k}^{s e c} (n)]}^{+} + A_{k} (n)\}}^{2} - Q_{k} {(n)}^{2} \\ \leq B_{k}^{a} (n) - 2 Q_{k} (n) (D_{k}^{l} (n) + D_{k}^{s e c} (n)), \end{matrix}

(33)

where

B_{k}^{a} (n) = Q_{k} {(n)}^{2} + A_{k} {(n)}^{2} + 2 Q_{k} (n) A_{k} (n)

is a constant independent of optimization variables. For the UAV queue, using the inequality

{{[x + \sum y_{i}]}^{+} + \sum z_{j}}^{2} - x^{2} \leq \sum {(x + y_{i})}^{2} + \sum {(x + z_{j})}^{2} - (Y + Z) x^{2} + {[\sum y_{i}^{\max} + \sum z_{j}^{\max}]}^{2}

produces

\begin{matrix} L {(n + 1)}^{2} - L {(n)}^{2} & \leq {\{{[L (n) - \sum_{k \in K} D_{k}^{u} (n)]}^{+} + \sum_{k \in K} D_{k}^{s e c} (n)\}}^{2} - L {(n)}^{2} \\ \leq B^{b} (n) + \sum_{k \in K} \{D_{k}^{u} {(n)}^{2} + D_{k}^{s e c} {(n)}^{2} + 2 L (n) (D_{k}^{s e c} (n) - D_{k}^{u} (n))\}, \end{matrix}

(34)

with

B^{b} (n) = {[\sum_{k \in K} D_{k}^{l - u m a x}]}^{2}

as another constant term. Combining these results and dividing by 2, we obtain the drift-plus-penalty bound.

\begin{matrix} Δ_{V} U (n) & \leq E \{\frac{B^{b} (n)}{2} + \sum_{k \in K} (\frac{B_{k}^{a} (n)}{2} + C_{k} (n)) - V η (n) ∣ Θ (n)\}, \end{matrix}

(35)

where

C_{k} (n)

is defined as

C_{k} (n) = D_{k}^{u} {(n)}^{2} + D_{k}^{s e c} {(n)}^{2} + 2 L (n) (D_{k}^{s e c} (n) - D_{k}^{u} (n)) - 2 Q_{k} (n) (D_{k}^{l} (n) + D_{k}^{s e c} (n)) .

(36)

Since

B^{b} (n)

and

B_{k}^{a} (n)

are constants, minimizing

Δ_{V} U (n)

reduces to solving the following per-slot optimization problem:

\begin{matrix} P 2 : & min_{Ξ (n)} \sum_{k \in K} C_{k} (n) - V η (n) \\ s . t . C 1, C 2, C 5, C 6, C 7, C 8, C 9, \end{matrix}

(37)

3.2. MDP Module

Given that the server onboard the UAV boasts stronger computing power and can run more service programs, the UAV is adopted as the agent in this scenario; since the onboard server has superior computing capabilities and can operate more service programs, the UAV is also employed as the agent in this section. The UAV only needs to acquire information from the current environmental state to make decisions, while the transition probability of such continuous states remains unknown—thus, the entire system can be modeled as a model-free MDP with unknown transition probabilities. The PPO algorithm is chosen to solve the UAV edge computing system in this paper due to its significant advantages: it excels in dynamic decision-making, can quickly determine the model’s optimization direction, and is well-suited to handle the unknown transition probability characteristics of model-free MDPs. By analyzing the MDP quadruple, the interaction process between the agent and the environment can be modeled as follows: At a certain time step t, the agent acquires the environmental state

s (t)

, executes an action

a (t)

according to its own policy, the environment transitions to the subsequent state

s (t + 1)

, and then the agent receives a reward

r (t)

. The three basic elements of MDP (state, action, and reward) are defined as follows:

3.2.1. State

The UAV can observe its own position and the positions of RISs and request users to obtain their positions and task information. Therefore, the state of the UAV at time t can be expressed as

s (t) = \{h_{k}^{U} (n), h_{k}^{E} (n), R_{k}^{l - u} (n), S_{u} (n), A_{k} (n), Q_{k} (n), L (n)\} .

(38)

3.2.2. Action

The UAV needs to comprehensively consider the user situation, send offloading decision information to users, continuously adjust its position, and optimize RIS phase shifts, transmission power, and receiving beamforming. The action of the UAV can be expressed as

a (t) = \{Ψ^{1}, Ψ^{2}, w_{k} (n), v (n), P_{k} (n), f_{k}^{l} (n), f_{k}^{u} (n)\} .

(39)

3.2.3. Reward

After the UAV executes an action, a reward needs to be given to the agent based on environmental feedback information. The reward function includes the objective value and various penalties, which can be expressed as

r (t) = ϖ (\sum_{k \in K} C_{k} (n) - V η (n)) + μ_{f} P_{f} + μ_{O} P_{O},

(40)

where

P_{f}

denotes the penalty when the computing frequency allocated by the UAV to users exceeds the maximum frequency

f_{max}^{u} (n)

;

P_{O}

denotes the penalty when the UAV flies out of the service area; and

ϖ

,

μ_{f}

, and

μ_{O}

are weight coefficients used to balance the orders of magnitude. The penalty terms can be calculated as follows:

P_{f} = max (\sum_{k \in K} f_{k}^{u} (n) - f_{max}^{u} (n), 0) / f_{max}^{u} (n),

(41)

P_{O} = |q (n) - clip (S_{u} (n), 0, W)| / v_{max},

(42)

where

clip (x, a, b)

represents the clipping function, which restricts the value of x to the range

(a, b)

. If the penalty is excessively large, the reward clipping method can be adopted to stabilize the learning process.

3.3. PPO Algorithm Application

PPO is based on the trust region policy optimization algorithm, which introduces a clipping factor into the objective function to stabilize the magnitude of action updates within a proximal range; as an on-policy algorithm, it requires less cache space during training and features faster training speed, thus being more suitable for small edge nodes such as UAVs, with its training framework illustrated in Figure 2. During the training process, the agent continuously interacts with the environment to acquire experiential information; after an episode concludes, it retrieves a new batch of collected experiential information from the experience buffer to update its policy, where for each update, the Actor network and Critic network utilize the policy loss function and the state-value loss function as their respective objectives. The loss function of the Actor network is as follows:

L^{actor} (θ) = E π θ [min [ra (t) \hat{A} (t), clip (ra (t), 1 - ϵ, 1 + ϵ) \hat{A} (t)] + ψ S (s (t))],

(43)

where

θ

denotes the parameters of the Actor network;

ra (t) = \frac{π_{θ} (a (t) | s (t))}{π_{θ}^{'} (a (t) | s (t))}

represents the update ratio of the action distribution between the new and old policies;

ϵ

is the clipping factor that limits the update ratio;

ψ S (s (t))

denotes the policy entropy at state

s (t)

; and

\hat{A} (t)

is the advantage function estimate, which typically adopts the Generalized Advantage Estimation (GAE) method, with its calculation method given by

A (t) = \sum_{l = 0}^{\infty} {(γ λ)}^{l} (r (t) + γ V (s (t + 1)) - V (s (t))),

(44)

where

V_{i} (s (t)) = \sum_{l = 0}^{\infty} γ^{l} R_{i} (t)

represents the target state value. The loss function corresponding to the state value

V^{ω} (s (t))

fitted by the Critic network

ω

is

L^{c r i t i c} (ω) = \frac{1}{2} {[V^{ω} (s (t)) - V (s (t))]}^{2} .

(45)

Based on the above analysis, this section proposes a weighted energy consumption optimization scheme based on PPO, whose training process is described in Algorithm 1.

Algorithm 1 Training Process of Weighted Energy Consumption Optimization Scheme Based on PPO

Require:: Maximum number of episodes $E p_{\max}$ , episode length $e p l$ , number of sampling steps before update $S C$ , number of PPO updates $P E$ , discount factor $γ$ , PPO clipping factor $ϵ$ , GAE parameter $λ$
Ensure:: Trained Actor network and Critic network
1:: Initialize parameters of various neural network models.
2:: for Episode = 1 to $E p_{\max}$ do
3:: Initialize the environment and obtain the initial state.
4:: for $t = 1$ to N do
5:: The UAV agent obtains the state $s (t)$ from the environment.
6:: The UAV agent executes the action $a (t)$ .
7:: The UAV agent evaluates the reward $r (t)$ .
8:: Store the experience ${s (t), a (t), r (t), s (t + 1)}$ in the experience buffer.
9:: end for
10:: if Episode × $e p l = S C$ then
11:: for epoch = 1 to $P E$ do
12:: Update the Actor network parameter $θ$ according to Equation (43).
13:: Update the Critic network parameter $ω$ according to Equation (45).
14:: end for
15:: end if
16:: end for
17:: Return $θ$ , $ω$

3.4. Analysis of Algorithm Complexity

For a fully connected neural network with J layers, the basic computational complexity of a single network is

O (\sum_{j = 2}^{J - 1} V_{j - 1} V_{j} + V_{j} V_{j + 1})

(where

V_{j}

represents the number of neurons in the j-th layer). The Lyapunov method achieves a significant reduction in complexity by transforming the original high-dimensional non-convex NP-hard optimization problem into a deterministic subproblem per time slot: The original problem requires traversing a solution space that grows exponentially with the variable dimensions, with a complexity of

O ({(M_{1} + M_{2} + 4 K + 2)}^{N})

. After introducing Lyapunov optimization, by constructing a drift-penalty function to decouple the temporal coupling between variables and decomposing the cross-slot global optimization into slot-wise local decisions, the problem complexity is reduced to

O (M_{1} + M_{2} + 4 K + 2)

, achieving an optimization from exponential to linear order.

The training complexity of the PPO algorithm is

O (\frac{2 ϵ_{m a x} \cdot e p l \cdot P E}{S C \cdot Γ} \sum_{j = 2}^{J - 1} (V_{j - 1} V_{j} + V_{j} V_{j + 1}))

(where

Γ

is the number of sample reuse rounds), and the single-step decision complexity is

O (\sum_{j = 2}^{J - 1} V_{j - 1} V_{j} + V_{j} V_{j + 1})

. The DDPG algorithm, which maintains four networks (Actor, Critic, and their target networks) and requires noise generation, has a complexity of

O (4 ϖ \cdot ζ \cdot \sum_{j = 2}^{J - 1} (V_{j - 1} V_{j} + V_{j} V_{j + 1}))

(where

ϖ

is the soft update coefficient and

ζ

is the noise calculation overhead). The DQN algorithm, constrained by the maintenance of the experience replay buffer, has a complexity of

O (\sum_{j = 2}^{J - 1} (V_{j - 1} V_{j} + V_{j} V_{j + 1}) + Π log Π)

(where

Π

is the replay buffer capacity). In comparison, PPO reduces the environmental interaction cost by a factor of

Γ

through

Γ

-fold sample reuse and eliminates the additional overhead of target network updates (

ϖ = 0

) and replay buffer sorting (the

Π log Π

term), demonstrating significant complexity advantages in continuous action spaces.

4. Result Analysis

We evaluate the proposed algorithm from the perspectives of average queue backlog and average total secure energy efficiency through extensive numerical experiments. A schematic diagram of the simulated 3D scenario is illustrated in Figure 3. The scenario involves a UAV, two collaborative RISs (RIS1 and RIS2), 10 GUs randomly distributed within a 3D area of 1000 × 1000 m², and a single-antenna eavesdropper located at

S_{e} = [- 250, - 250, 0]

. The set of all GUs is denoted as

K = {1, 2, \dots, 10}

. Specifically, RIS1 is deployed at coordinates

S_{R 1} = (- 250, 500, 50)

and RIS2 at

S_{R 2} = (250, 500, 50)

, with a consistent deployment height of

H_{R 1} = H_{R 2} = 50

m for both RISs. The UAV is equipped with a uniform rectangular array (URA) antenna, having an initial position of

S_{u} (1) = (- 500, - 500, 100)

and a maximum speed of

v_{max} = 15

m/s. Both RIS1 and RIS2 adopt a 20 × 20 array structure, consisting of

M_{1} = 20 \times 20

and

M_{2} = 20 \times 20

passive reflecting elements, respectively. The entire mission duration T is discretized into

N = 300

equal time slots, each with a duration of

τ = 1

s. Key simulation parameters are summarized in Table 1.

Figure 4 shows a comparison of convergence curves of different DRL algorithms in the scenario with 10 users (K = 10), where the vertical axis represents the average cumulative reward per episode and the horizontal axis represents the number of training steps. The experimental results indicate that as training progresses, the average cumulative reward per episode of all four algorithms shows an upward trend, demonstrating that DRL methods can optimize strategies through continuous learning, enabling the system performance to gradually improve and stabilize. Specifically, the proposed PPO algorithm exhibits the optimal performance: its convergence speed is significantly faster than that of the Advantage Actor–Critic (A2C) and Deep Deterministic Policy Gradient (DDPG) algorithms, showing a rapid upward trend in the early stage of training, entering a stable phase in advance, and ultimately achieving the highest cumulative reward value. In contrast, the DDPG algorithm has the most significant reward fluctuations and a slower convergence speed, mainly due to its deterministic action output mechanism, which leads to insufficient stability in the policy gradient update direction and limits its exploration ability in continuous action spaces. The Deep Q-Network (DQN) algorithm, which is natively suitable for discrete action spaces, performs the worst in the continuous resource allocation optimization problem involved in this experiment; it not only converges slowly but also achieves the lowest cumulative reward value after stabilization, further verifying its limitations in continuous action space scenarios. These experimental results fully prove that the PPO algorithm, relying on its clipping mechanism for policy updates and efficient sample utilization, exhibits better convergence stability and strategy optimization capabilities than the other three DRL algorithms in complex continuous action space optimization problems, providing a reliable solution for the model proposed in this paper.

Figure 5 clearly demonstrates the significant superiority of the PPO algorithm in the double-RIS cooperative UAV-MEC system. In Figure 5a, as the number of training episodes increases from 0 to 1000, the task queue lengths of the four algorithms (PPO, A2C, DDPG, and DQN) all show a decreasing trend. Among them, the PPO algorithm exhibits the most prominent downward trend: it decreases rapidly in the initial stage, maintains the lowest queue level throughout the process, and finally stabilizes at a low level close to 8.5 × 10⁸ bits, with extremely small fluctuation amplitude. In contrast, although the A2C algorithm also shows a decreasing trend, its queue length at stabilization is significantly higher than that of PPO. The DDPG and DQN algorithms decrease more slowly, especially DQN, which not only has the highest initial queue length (close to 1.15 × 10⁹ bits) but also remains at a relatively high level when stabilized, showing a significant gap compared with PPO. In Figure 5b, the secure energy efficiency of the four algorithms all increases with the number of training episodes, and the PPO algorithm also performs the best: it rises rapidly from the initial stage, remains ahead of other algorithms, and finally stabilizes at a high level close to 4.9 × 10⁸ bits/J. Although the growth trend of the A2C algorithm is similar to that of PPO, its energy efficiency at stabilization is significantly lower than that of PPO. The growth of DDPG and DQN is relatively slow, especially DQN, which has the lowest energy efficiency throughout the process, with a huge gap compared with PPO. Overall, both in terms of the optimization effect of task queue length and the improvement amplitude of secure EE, the PPO algorithm far surpasses A2C, DDPG, and DQN, fully proving that it can more efficiently balance task processing and resource allocation in this scenario, providing an efficient and reliable solution for the model, and highlighting its unique advantages in solving complex continuous action space optimization problems.

Figure 6 compares the queue stability performance of the UAV-MEC system under different RIS assistance schemes, where the vertical axis represents the queue length and the horizontal axis represents the number of time slots. The experimental results show that the user queue length is significantly longer than the UAV queue length, mainly because the UAV has stronger computing processing capabilities and higher task processing efficiency, resulting in a smaller queue backlog. In terms of the time evolution trend, as the number of time slots increases, the user queues of all schemes exhibit a characteristic of “rapid growth–slow growth–stabilization”, while the UAV queues show a trend of “gradual growth–stabilization”. In the initial stage, limited by channel conditions, the efficiency of user task offloading is low, leading to rapid accumulation of queues; as the system operation enters a stable period, the amount of task generation reaches a dynamic balance with the amount of local and edge processing, and the queue length maintains stable fluctuations. The performance differences between different RIS assistance schemes are particularly significant: the double-RIS cooperative scheme has the shortest user queue length and the longest UAV queue length, which benefits from the higher channel gain generated by the cooperative reflection of multiple RISs, enabling more user tasks to be efficiently offloaded to the UAV and significantly reducing the backlog at the user end; the double-RIS non-cooperative scheme, which does not utilize the cooperative gain between RISs, has a slightly longer user queue length than the cooperative scheme, and the UAV queue is correspondingly shorter; the single-RIS scheme, due to the lowest channel gain and limited task offloading capability, results in the longest user queue length and the shortest UAV queue length. These results indicate that the RIS cooperation mechanism directly affects the task offloading strategy by improving channel quality, thereby exerting a significant regulatory effect on queue stability, verifying the advantage of multi-RIS cooperation in balancing the loads of users and the UAV.

Figure 7 presents the dynamic variation trend of the system’s secure EE under different RIS assistance schemes, where the vertical axis represents the secure EE value and the horizontal axis represents the number of time slots. The experimental results show that the system’s secure EE exhibits an evolutionary characteristic of “initial fluctuation–gradual improvement–stabilization” during operation: in the initial stage, limited by channel conditions, the system’s secure EE is at a relatively high level; as the time slots progress, the secure EE gradually improves and finally remains within a stable range, which is closely related to the optimization of transmission efficiency after the channel gain stabilizes. There are significant differences in the secure EE performance among different RIS assistance schemes: the system’s secure EE increases in turn from the single-RIS scheme to the double-RIS non-cooperative scheme to the double-RIS cooperative scheme. Specifically, the double-RIS cooperative scheme, relying on the collaborative reflection effect of multiple RIS units, can maximize the channel gain, reduce transmission energy consumption while increasing the amount of secure task transmission, thus achieving the highest secure EE; the double-RIS non-cooperative scheme, which does not fully utilize the cooperative gain between RISs, has a slightly lower secure EE than the cooperative scheme; the single-RIS scheme, due to limited channel gain and low secure transmission efficiency, shows the worst performance in terms of secure EE. These results verify that the multi-RIS cooperative mechanism can improve secure EE, indicating that the collaborative optimization of the system’s security performance and EE can be effectively achieved by optimizing the RIS configuration to enhance channel gain.

Figure 8 illustrates the relationship between the average user queue length and the system’s average secure EE under different values of the control factor V, where the horizontal axis represents the control factor V, the left vertical axis denotes the average user queue length, and the right vertical axis indicates the system’s average secure EE. The experimental results clearly demonstrate the regulatory effect of the control factor V on system performance, revealing the dynamic trade-off between queue stability and secure EE. Specifically, as the control factor V increases, the average user queue length shows a significant upward trend, while the system’s average secure EE improves simultaneously. The core mechanism behind this phenomenon lies in the fact that the control factor V, as a key parameter in the Lyapunov optimization framework for balancing queue stability and system utility, directly determines the weight allocation of these two objectives in the optimization process: when V takes a small value, the system prioritizes ensuring queue stability, suppressing task backlogs to guarantee service quality, and in this case, the queue length remains at a low level, but the secure EE is mediocre due to limited optimization efforts; when V increases, the weight of the penalty term for secure EE in the system rises significantly, and the optimization objective shifts toward enhancing secure EE— to pursue higher secure transmission efficiency and energy utilization efficiency, the system appropriately relaxes the constraints on queue backlogs, leading to an increase in task accumulation at the user end and thus a rise in the average queue length. Further analysis indicates that the regulatory effect of the control factor V verifies the flexibility of the proposed optimization framework: by adjusting the value of V, refined regulation of queue stability and secure EE can be achieved to meet performance requirements in different scenarios; for instance, in scenarios with high real-time requirements, a smaller V can be selected to prioritize queue stability, while in scenarios sensitive to security and EE, V can be increased to obtain higher secure EE at the cost of partial queue stability. This result fully demonstrates the effectiveness of the control factor V in coordinating the system’s multi-objective optimization, providing important references for parameter configuration in practical applications.

To further explore the impact of other factors on the system’s secure EE, Figure 9 provides quantitative insights through comparative experiments. Figure 9a focuses on the task generation rate (horizontal axis, in Mbits) and average secure EE (vertical axis). Experimental data show that the secure EE of all four schemes (proposed scheme, A2C algorithm scheme, double-RIS non-cooperative scheme, and single-RIS scheme) decreases as the task generation rate increases. Specifically, when the task generation rate ranges from 0.6 Mbits to 1.4 Mbits, the proposed scheme exhibits secure EE values from

4.98 \times 10^{8}

bits/J to

6.22 \times 10^{8}

bits/J; the A2C algorithm scheme ranges from

4.42 \times 10^{8}

bits/J to

5.71 \times 10^{8}

bits/J; the double-RIS non-cooperative scheme ranges from

4.12 \times 10^{8}

bits/J to

5.43 \times 10^{8}

bits/J; and the single-RIS scheme ranges from

3.43 \times 10^{8}

bits/J to

4.71 \times 10^{8}

bits/J. Figure 9b examines the effect of RIS element quantity (horizontal axis, total number of

M_{1} + M_{2}

), with results indicating that secure EE increases for all schemes as the number of RIS elements grows. For the proposed scheme, values rise from

5.19 \times 10^{8}

bits/J to

6.07 \times 10^{8}

bits/J as elements increase from 200 to 1800; the A2C algorithm scheme ranges from

4.71 \times 10^{8}

bits/J to

5.49 \times 10^{8}

bits/J; the double-RIS non-cooperative scheme ranges from

4.39 \times 10^{8}

bits/J to

5.26 \times 10^{8}

bits/J; and the single-RIS scheme ranges from

3.73 \times 10^{8}

bits/J to

4.64 \times 10^{8}

bits/J. These results highlight the superiority of the proposed scheme: in Figure 9a, it maintains the highest secure EE across all task generation rates, as the PPO algorithm dynamically optimizes resource allocation in continuous action spaces while the enhanced channel gain from double-RIS cooperation ensures stable secure transmission efficiency. In Figure 9b, the proposed scheme achieves the most significant improvement with increasing RIS elements, thanks to its collaborative phase shift optimization that fully leverages hardware potential. Compared to the other three schemes, the proposed approach consistently outperforms, underscoring its robust adaptability to varying environmental factors.

5. Conclusions

This paper investigates a double-RIS cooperative-aided UAV-MEC system and proposes a corresponding optimization scheme: leveraging Lyapunov optimization theory, the long-term stochastic optimization problem with queue stability constraints is transformed into a slot-by-slot deterministic decision problem, significantly reducing real-time optimization complexity, while the PPO algorithm, selected to solve the transformed high-dimensional continuous optimization problem, outperforms A2C, DDPG, and DQN in both policy stability and optimization effect. Simulation results demonstrate the proposed scheme’s superiority over benchmarks: in secure energy efficiency, the double-RIS cooperative scheme exceeds the non-cooperative double-RIS scheme by 17.6% and the single-RIS scheme by 33.3%; in queue stability, its user task queue length is 25.6% shorter than the non-cooperative scheme and 47.6% shorter than the single-RIS scheme, with strong robustness shown by maintaining superior performance amid varying task generation rates and RIS element counts, verifying its adaptability to environmental dynamics and hardware configurations for reliable UAV-MEC operation. While the current framework focuses on mitigating passive eavesdropping as a foundational security measure, future work will extend to multi-UAV and multi-RIS scenarios and incorporate defenses against complex threats like active jamming and spoofing, integrating anti-jamming beamforming and authentication mechanisms to enhance comprehensive security resilience for practical deployments.

Author Contributions

Conceptualization, X.H. and H.Z.; methodology, X.H.; software, H.Z.; validation, X.H. and H.Z.; formal analysis, H.Z.; investigation, X.H.; resources, H.Z.; data curation, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, X.H.; visualization, W.Z.; supervision, D.H.; project administration, W.Z.; funding acquisition, X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in Section 5.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lu, Y.; Luo, Z. An Effective Scheme for Delay Minimization in a Multi-UAV-Enabled NOMA-MEC System. IEEE Commun. Lett. 2025, 29, 40–44. [Google Scholar] [CrossRef]
Yan, J.; Wang, W.; Liu, J.; Deng, J.; Yuan, H.; Zhu, Y. Task Demand-Oriented Collaborative Offloading and Deployment Strategy in Software-Defined UAV-Assisted Edge Networks. IEEE Sens. J. 2025, 25, 1641–1655. [Google Scholar] [CrossRef]
Tang, X.; Tang, Q.; Yu, R.; Li, X. Digital Twin-Empowered Task Assignment in Aerial MEC Network: A Resource Coalition Cooperation Approach With Generative Model. IEEE Trans. Netw. Sci. Eng. 2025, 12, 13–27. [Google Scholar] [CrossRef]
Yahya, M.; Naeem, M.; Kaleem, Z.; Alenezi, A.H.; Ejaz, W. Robust Multicriterion Offloading in Digital-Twin-Assisted UAV Networks. IEEE Internet Things J. 2025, 12, 1643–1654. [Google Scholar] [CrossRef]
Zheng, Y.; Li, A.; Wen, Y.; Wang, G. A UAV Trajectory Optimization and Task Offloading Strategy Based on Hybrid Metaheuristic Algorithm in Mobile Edge Computing. Future Internet 2025, 17, 300. [Google Scholar] [CrossRef]
Zhu, C.; Zhu, X.; Qin, T. An Efficient Privacy Protection Mechanism for Blockchain-Based Federated Learning System in UAV-MEC Networks. Sensors 2024, 24, 1364. [Google Scholar] [CrossRef]
Yang, X.; Wang, Q.; Yang, B.; Cao, X. Energy-Efficient Aerial STAR-RIS-Aided Computing Offloading and Content Caching for Wireless Sensor Networks. Sensors 2025, 25, 393. [Google Scholar] [CrossRef]
Liu, Q.; Han, J.; Liu, Q. Joint Task Offloading and Resource Allocation for RIS-assisted UAV for Mobile Edge Computing Networks. In Proceedings of the 2023 IEEE/CIC International Conference on Communications in China (ICCC), Dalian, China, 10–12 August 2023; pp. 1–6. [Google Scholar]
Mei, H.; Yang, K.; Shen, J.; Liu, Q. Joint Trajectory-Task-Cache Optimization With Phase-Shift Design of RIS-Assisted UAV for MEC. IEEE Wirel. Commun. Lett. 2021, 10, 1586–1590. [Google Scholar] [CrossRef]
Duo, B.; He, M.; Wu, Q.; Zhang, Z. Joint Dual-UAV Trajectory and RIS Design for ARIS-Assisted Aerial Computing in IoT. IEEE Internet Things J. 2023, 10, 19584–19594. [Google Scholar] [CrossRef]
Fang, K.; Ouyang, Y.; Zheng, B.; Huang, L.; Wang, G.; Chen, Z. Security Enhancement for RIS-Aided MEC Systems With Deep Reinforcement Learning. IEEE Trans. Commun. 2023, 73, 2466–2479. [Google Scholar] [CrossRef]
Mao, S.; Liu, L.; Zhang, N.; Dong, M.; Zhao, J.; Wu, J. Reconfigurable Intelligent Surface-Assisted Secure Mobile Edge Computing Networks. IEEE Trans. Veh. Technol. 2022, 71, 6647–6660. [Google Scholar] [CrossRef]
Guo, Y.; Zhang, P.; Ye, J.; Xiao, Y.; Jiang, H.; Huang, L. Hybrid RIS-aided Physical Layer Security in MIMO Communications. In Proceedings of the 2024 IEEE 9th International Conference on Computational Intelligence and Applications (ICCIA), Haikou, China, 9–11 August 2024; pp. 268–272. [Google Scholar]
Michailidis, E.; Volakaki, M.-G.; Miridakis, N.; Vouyioukas, D. Optimization of Secure Computation Efficiency in UAV-Enabled RIS-Assisted MEC-IoT Networks With Aerial and Ground Eavesdroppers. IEEE Trans. Commun. 2024, 72, 3994–4009. [Google Scholar] [CrossRef]
Zhou, Y.; Ma, Z.; Liu, G.; Zhang, Z.; Yeoh, P.; Vucetic, B. Secure Multi-Layer MEC Systems With UAV-Enabled Reconfigurable Intelligent Surface Against Full-Duplex Eavesdropper. IEEE Trans. Commun. 2024, 72, 1565–1577. [Google Scholar] [CrossRef]
Han, Y.; Zhang, S.; Duan, L.; Zhang, R. Cooperative double-IRS aided communication: Beamforming design and power scaling. IEEE Wirel. Commun. Lett. 2020, 9, 1206–1210. [Google Scholar] [CrossRef]
Liu, B.; Li, D. Double-RISs-aided predictable high-speed railway communications with doppler effect mitigation. IEEE Internet Things J. 2024, 11, 37394–37398. [Google Scholar] [CrossRef]
Ma, X.; Fang, Y.; Zhang, H.; Guo, S.; Yuan, D. Cooperative beamforming design for multiple RIS-assisted communication systems. IEEE Trans. Wirel. Commun. 2022, 21, 10949–10963. [Google Scholar] [CrossRef]
Gao, J.; Wu, R.; Hao, J. Lyapunov-Guided Energy Scheduling and Computation Offloading for Solar-Powered WSN. Appl. Sci. 2023, 13, 4966. [Google Scholar] [CrossRef]
Chen, J.; Mi, J.; Guo, C.; Fu, Q.; Tang, W.; Luo, W.; Zhu, Q. Research on Offloading and Resource Allocation for MEC with Energy Harvesting Based on Deep Reinforcement Learning. Electronics 2025, 14, 1911. [Google Scholar] [CrossRef]
Lu, H.; He, X.; Zhang, D. Security-Aware Task Offloading Using Deep Reinforcement Learning in Mobile Edge Computing Systems. Electronics 2024, 13, 2933. [Google Scholar] [CrossRef]
Zhang, C.; Wu, C.; Lin, M.; Lin, Y.; Liu, W. Proximal Policy Optimization for Efficient D2D-Assisted Computation Offloading and Resource Allocation in Multi-Access Edge Computing. Future Internet 2024, 16, 19. [Google Scholar] [CrossRef]
Xiao, H.; Zhang, X.; Wang, W.; Wong, K.-K. STAR-RIS Enhanced UAV-MEC Networks: Bi-Directional Task Offloading. In Proceedings of the 2024 IEEE/CIC International Conference on Communications in China (ICCC), Hangzhou, China, 7–9 August 2024; pp. 30–35. [Google Scholar]
Hu, X.; Zhao, H.; Zhang, W.; He, D. Online Resource Allocation and Trajectory Optimization of STAR–RIS–Assisted UAV–MEC System. Drones 2025, 9, 207. [Google Scholar] [CrossRef]
Huang, R.; Xie, X.; Guo, Q. Multi-Queue-Based Offloading Strategy for Deep Reinforcement Learning Tasks. Electronics 2024, 13, 2307. [Google Scholar] [CrossRef]
Deng, D.; Rao, W.; Liu, B.; Jia, D.; Sheng, Y.; Wang, J.; Xiong, S. TA-MAC: A Traffic-Aware TDMA MAC Protocol for Safety Message Dissemination in MEC-assisted VANETs. In Proceedings of the 2020 29th International Conference on Computer Communications and Networks (ICCCN), Honolulu, HI, USA, 3–6 August 2020; pp. 1–9. [Google Scholar]

Figure 1. The double-RIS cooperative-aided UAV-MEC network.

Figure 2. PPO-based DRL training framework.

Figure 3. Three-dimensional scenario schematic.

Figure 4. Comparison of convergence curves of different DRL algorithms.

Figure 5. Performance comparison of DRL algorithms in terms of task queue length and secure energy efficiency. (a) Queue length of different DRL algorithms. (b) Secure EE of different DRL algorithms.

Figure 6. Comparison of queue length in UAV-MEC system under different schemes.

Figure 7. Dynamic variation trend of system’s secure EE.

Figure 8. Average user queue length and system’s average secure EE under different values of control factor V.

Figure 9. The system secure EE via different environment parameters. (a) Task generation rate vs. secure EE. (b) RIS units vs. secure EE.

Table 1. Numerical simulation network parameters.

Notation	Definition	Value
W	Communication bandwidth	$2 MHz$
$P_{max}$	Maximum transmit power of GU	$500 mW$
$N_{0}$	Noise power density	$1 \times 10^{- 9} dBm / Hz$
$φ$	Process density	$10^{3} cycles / bit$
$A_{k}$	Average task arrival rate	$1 M bit / Hz$
$M_{g}$	Weight of UAV	$9.8 kg$
$β_{0}$	Unit path loss	$1 \times 10^{- 7}$
$f_{r}$	Communication carrier frequency	$1 GHz$
$κ$	CPU capacitance coefficient	$10^{- 27}$
$F^{l}$	Maximum CPU frequency of user	$1 GHz$
$F^{U}$	Maximum CPU frequency of UAV	$10 GHz$
$Δ R x, Δ R y$	Distance between array antennas	$1 m, 1 m$
V	Trade-off control coefficient	$1 \times 10^{7}$
$ε$	UAV energy attenuation factor	$3 \times 10^{- 3}$
$a^{U - R 1}, a^{U - R 2}, a^{R 1 - G}$	Path loss exponents	$1, 1, 3.6$
$k^{R 1 - G}, k^{R 2 - G}$	Rician factor	$2 dB, 2 dB$
-	Actor learning rate	0.0001
-	Critic learning rate	0.0001
$E_{p_{max}}$	Maximum number of episodes	700
SC	Sampling steps before PPO update	2000
PE	Number of PPO updates	16
$λ$	GAE parameter	0.95
$γ_{M U}$	Discount factor	0.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, X.; Zhao, H.; He, D.; Zhang, W. Secure Communication and Resource Allocation in Double-RIS Cooperative-Aided UAV-MEC Networks. Drones 2025, 9, 587. https://doi.org/10.3390/drones9080587

AMA Style

Hu X, Zhao H, He D, Zhang W. Secure Communication and Resource Allocation in Double-RIS Cooperative-Aided UAV-MEC Networks. Drones. 2025; 9(8):587. https://doi.org/10.3390/drones9080587

Chicago/Turabian Style

Hu, Xi, Hongchao Zhao, Dongyang He, and Wujie Zhang. 2025. "Secure Communication and Resource Allocation in Double-RIS Cooperative-Aided UAV-MEC Networks" Drones 9, no. 8: 587. https://doi.org/10.3390/drones9080587

APA Style

Hu, X., Zhao, H., He, D., & Zhang, W. (2025). Secure Communication and Resource Allocation in Double-RIS Cooperative-Aided UAV-MEC Networks. Drones, 9(8), 587. https://doi.org/10.3390/drones9080587

Article Menu

Secure Communication and Resource Allocation in Double-RIS Cooperative-Aided UAV-MEC Networks

Abstract

Highlights

Abstract

1. Introduction

2. System Model and Problem Formulation

2.1. Overview

2.2. Communication Model

2.3. Task Queue Model

2.4. Energy Consumption Model

2.4.1. Computing Energy Consumption

2.4.2. Offloading Energy Consumption

2.4.3. Flight Energy Consumption

2.5. Problem Formulation

3. Problem Solution

3.1. Lyapunov Optimization

3.2. MDP Module

3.2.1. State

3.2.2. Action

3.2.3. Reward

3.3. PPO Algorithm Application

3.4. Analysis of Algorithm Complexity

4. Result Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI