Research on Optimization of RIS-Assisted Air-Ground Communication System Based on Reinforcement Learning

Yao, Yuanyuan; Liu, Xinyang; Huang, Sai; Yue, Xinwei

doi:10.3390/s25206382

Open AccessArticle

Research on Optimization of RIS-Assisted Air-Ground Communication System Based on Reinforcement Learning

¹

Key Laboratory of Information and Communication Systems, Ministry of Information Industry, Beijing Information Science and Technology University, Beijing 100101, China

²

Key Laboratory of Modern Measurement & Control Technology, Ministry of Education, Beijing Information Science and Technology University, Beijing 100101, China

³

Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(20), 6382; https://doi.org/10.3390/s25206382

Submission received: 22 August 2025 / Revised: 10 October 2025 / Accepted: 14 October 2025 / Published: 16 October 2025

(This article belongs to the Special Issue Recent Trends and Advances in the Integration of Communication and Sensing)

Download

Browse Figures

Versions Notes

Abstract

In urban emergency communication scenarios, building obstructions can reduce the performance of base station (BS) communication networks. To address such issues, this paper proposes an air-ground wireless network enabled by an unmanned aerial vehicle (UAV) and assisted by reconfigurable intelligent surfaces (RIS). This system enhances the efficacy of UAV-enabled MISO networks. Treating the UAV as an intelligent agent moving in 3D space, sensing changes in the channel environment, and adopting zero-forcing (ZF) precoding to eliminate interference from ground users. Meanwhile, joint design is performed for UAV movement, RIS phase shifts, and power allocation for users. We propose two deep reinforcement learning (DRL) algorithms, which are termed D3QN-WF and DDQN-WF, respectively. Simulation results indicate that D3QN-WF achieves a 15.9% higher sum rate and 50.1% greater throughput than the DDQN-WF baseline, while also demonstrating significantly faster convergence.

Keywords:

unmanned aerial vehicle (UAV); reconfigurable intelligent surface (RIS); deep reinforcement learning (DRL)

1. Introduction

One of the visions of sixth Generation (6G) is to realize the evolution from the Internet of Everything to the Intelligent Internet of Everything on the basis of Fifth Generation (5G), which poses great challenges to traditional terrestrial cellular networks [1]. Compared with conventional terrestrial wireless communication, Unmanned Aerial Vehicle (UAV) communication has significant advantages, such as high mobility, uninterrupted line-of-sight connection, and strong perception ability. The wireless network empowered by UAV is one of the key technologies to improve communication quality [2,3,4]. Meanwhile, UAVs can be equipped with sensing and ranging modules to accurately measure the distance between obstacles and ground equipment, ensuring flight safety, providing a sensing range, and delivering high-quality communication services [5,6].

The upcoming sixth generation (6G) driven Internet-of-Things (IoT) will face the great challenges of extremely low power demand, high transmission reliability, massive connectivities, and physical layer security [7,8,9]. To effectively address the problems of traditional communication technologies being limited by hardware costs and channel fading, Reconfigurable Intelligent Surface (RIS) has become one of the potential applications as a feasible solution to various challenges in future wireless networks. The breakthrough of RIS technology lies in transforming the communication environment into a programmable medium. Its two-dimensional metamaterial surface reconstructs electromagnetic field distribution via dynamic phase adjustment. It also dynamically regulates electromagnetic waves in a programmable manner [10]. Studies have confirmed that compared with traditional AF relays, RIS can provide substantial energy efficiency improvements for wireless networks [11,12,13]. The characteristics of RIS with low cost and UAV with high mobility have the advantages of low power consumption and high spectral efficiency. They have become important components of future communication networks. When RIS is deployed on buildings or UAV platforms, it can regulate reflection paths, change the radio environment in three-dimensional space, and achieve signal enhancement [14,15]. Moreover, studies have shown that the system network composed of RIS and UAV can obtain a more flexible network structure and a higher communication rate [16,17]. However, traditional studies have focused on the offline optimization of RIS static reflection parameters and the optimization of UAV hovering positions. With the evolution of future wireless networks, RIS-UAV communication systems need to have the ability to actively sense and reconstruct the environment.

Future communication networks will integrate various emerging technologies, which will increase the complexity of optimizing throughput and energy efficiency in the network. This complexity will bring about challenges where traditional algorithms are challenged and struggle to find optimal solutions or prove infeasible. Machine learning (ML) is widely used as a powerful tool to enhance the performance of wireless networks, especially for large-scale networks, enabling them to conduct effective optimization in dynamic environments. Due to the rapid development of ML, Deep Reinforcement Learning (DRL), as a branch in this field, provides an alternative approach to solving complex optimization problems [18,19]. DRL is a combination of neural networks and Reinforcement Learning (RL). It realizes algorithm design by using rewards provided in the environment as output values. Therefore, UAVs can be regarded as agents that acquire optimal strategies through learning from feedback, which is obtained through continuous interaction with the environment [20,21]. Therefore, the vision of future 6G and its network architecture aims to establish an intelligent communication ecosystem. By deeply integrating RIS-UAV technology with DRL, a high-efficiency air-ground integrated wireless communication network can be built to meet the needs of future intelligent communications.

1.1. Related Work

In recent years, research on RIS and UAV-assisted communication technologies can be mainly divided into two types: one type deploys RIS directly on UAVs as payloads to form RIS-UAV systems. The air-ground network with RIS-UAV-assisted base stations (BS) is introduced in detail in references [22,23,24,25,26]; such RIS-UAVs act as mobile airborne relays. Although the RIS-UAV scheme achieves blind area coverage, it faces significant challenges in terms of communication quality in urban emergency communications and places with high population density. The other type deploys RIS on ground buildings and uses UAVs as aerial BS. This configuration can provide ground users with better information transmission or data collection services. In comparison, through flexible deployment, the RIS-assisted BS-UAV significantly reduces the pressure on ground BS and improves the communication quality in densely populated areas. Therefore, this paper adopts the RIS-assisted BS-UAV system model to improve the performance of the system network.

Compared to traditional ground BS, UAVs are more convenient to deploy. Liu et al. utilized RIS-assisted BS-UAV and adopted a DRL approach to optimize the 2D coordinates of UAVs and reduce UAV energy consumption. This overcomes the limitation of traditional algorithms that struggle to find or cannot obtain optimal solutions in high-dimensional convex optimization [27]. Ahmad et al. adopted the prioritized experience replay method to optimize phase angles, so as to improve users’ satisfaction with the service quality of communication systems [28]. Hu et al. utilized the Double Deep Q-Network (DDQN) to optimize the 3D trajectory of UAVs, aiming to maximize the system data throughput [29]. Mei et al. proposed a method that does not require pre-prepared training data; instead, it uses environmental modeling to provide reward feedback to BS-UAVs, to optimize the UAV trajectory, and improve the communication rate [30]. Khalili et al. in [31] optimized UAV trajectory and subcarrier allocation with the help of Dueling DQN to improve the performance of heterogeneous networks supported by dual connectivity. Regarding the complex optimization problems in RIS and UAV-assisted air-ground wireless communication networks, Nguyen et al. summarized the existing DRL-based solving methods and pointed out the challenges faced by future research [32].

In the aforementioned research works, reference [25] optimize the coordinates of UAVs in a two-dimensional space, without consideration given to three-dimensional coordinate optimization. As a result, the advantages of UAVs, such as high mobility, have not been fully demonstrated. In the literature [26], although the problem of RIS phase shift optimization has been addressed, the potential of RIS has not been fully exploited by leveraging DRL. In one study [27,28], although DQN and DDQN are used to solve the problems of RIS phase shift and UAV trajectory, the optimized coordinates are still limited to two-dimensional spatial coordinates. In another work [29,30,31], although the DRL method is adopted, power discretization or the failure to consider power allocation will lead to an overall decline in network performance. However, traditional DRL algorithms with discrete action spaces have significant limitations: the value overestimation phenomenon in DQN and Dueling DQN can cause decision biases in agents when channel states change abruptly, while the Q-value oscillation problem in DDQN restricts the stability of UAV trajectory planning. To address these bottlenecks in agent actions, the Dueling Double Deep Q-Network (D3QN) has achieved breakthroughs through innovative architectural integration. Its core idea is to combine the decomposed structure of state value and advantage value from Dueling DQN with the dual-network structure of DDQN. This enables it to accurately estimate the Q-values of all actions of the agent, effectively mitigating the overestimation problem, and boasts the advantages of fast convergence speed and stable convergence effect.

Therefore, the deep reinforcement learning framework is integrated into the RIS-assisted UAV communication model to construct collaborative optimization under a multi-dimensional action space. The sum-rate expression for multiple users under this model is established, and an optimization variable algorithm based on the D3QN framework (Dueling Double Deep Q-Network water-filling algorithm, D3QN-WF) is proposed. RIS changes its phase to reconstruct the channel state; UAVs perceive environmental changes and adjust their 3D coordinates through decision-making; and by optimizing the transmission power of UAVs, the system sum-rate is maximized, and the throughput during communication is improved.

1.2. Contributions

A joint optimization framework for RIS-assisted UAV air-to-ground wireless communication networks under 3D spatial coordinates is proposed. This framework significantly improves the network system sum-rate and throughput during communication by collaboratively optimizing the 3D spatial coordinates of UAVs, the RIS phase shift matrix, and the UAV transmission power, thus giving full play to the synergistic potential of RIS and UAVs;
A joint optimization method combining BS transmission power optimization based on the water-filling algorithm and D3QN. Aiming at the convexity problem of BS transmission power optimization, the water-filling algorithm is adopted for an efficient solution; meanwhile, the D3QN algorithm is innovatively used to solve the optimization problem in the discrete action space. It jointly optimizes the 3D coordinates of UAVs and RIS phases, effectively overcoming the limitations of traditional methods in high-dimensional non-convex optimization problems;
Detailed verification results are provided to demonstrate the effectiveness of the proposed algorithm in improving the system sum-rate and throughput. Simulation results show that the proposed algorithm has significant advantages in rate improvement. In addition, compared with the DDQN-WF, the D3QN-WF algorithm shows obvious advantages in handling multi-dimensional action spaces, with faster convergence and higher stability. This method increases the system sum-rate by 15.9% and the throughput by 50.1%, providing new ideas for the dynamic optimization of future intelligent communication networks.

2. System Model

Figure 1 depicts a RIS-assisted BS-UAV air-to-ground intelligent communication network system. An RIS with N reflecting elements is deployed on a high-rise building to facilitate the reconstruction of the channel environment. The UAV equipped with M antennas is regarded as BS, which communicates with single-antenna users, and it has a sensing antenna at its bottom to sense ground users, with a maximum sensing elevation angle of

ω

. By perceiving the environment, associating with users, and collecting real-time channel state information (CSI), the UAV makes adjustments to its 3D position. When a UAV detects users, there are direct links and cascaded links between the UAV and the users. The direct link includes both line-of-sight (LoS) and non-line-of-sight (NLoS) links. For the cascaded link, only LoS link exists between the UAV and the RIS, while both LoS and NLoS links exist between the RIS and the users. When the users are outside the detection range of the UAV, the RIS is used to cover the blind area. In this case, only the cascaded link exists between the UAV and the ground users.

It is assumed that the system’s Area of Interest (AoI) is discretized into multiple cells of equal size, and the coordinates of the center of cell i can be expressed as

L_{i}^{c} = [x_{i}, y_{i}, z_{i}] \in R^{3 \times 1}

. The variables

x_{b}

,

y_{b}

, and

z_{b}

represent the distances between adjacent cells along the x-axis, y-axis, and z-axis, respectively. The total time slots of the system can be expressed as T, and the three-dimensional coordinates of the UAV at time slot t can be expressed as

Q_{t} = [x_{t}, y_{t}, z_{t}] \in L^{c}

, where

t \in {1, 2, \dots, T}

. Ground users move within a small range and are evenly distributed on one side of the RIS. Their coordinates can be expressed as (

x_{k}^{t}

,

y_{k}^{t}

,

z_{k}^{t}

). In communication networks under multi-target scenarios, the constraint on the UAV’s sensing range is a key factor to be considered in practice. In each time slot, the UAV needs to select targets within its coverage area from all potential IoT devices according to its current flight path for data transmission operations. A sensing communication variable

a_{k}

is defined, and its sensing coverage is related to the height and angle of the UAV.

a_{k} = \{\begin{matrix} 1 & R \leq H tan ω \\ 0 & R > H tan ω \end{matrix}

(1)

where R is the horizontal distance between the UAV and the user,

ω

represents the sensing angle and H represents the height of the UAV. When

a_{k} = 1

, the user is within the sensing range.

2.1. Channel Model

The UAV acting as a BS moves gradually toward the users. When the user is within the UAV’s sensing range, the channel gain of the direct link between the aerial BS and the ground user k can be expressed as

h_{u, k}^{H} \in C^{1 \times M}

which can be modeled as a Rician channel, expressed as:

h_{u, k}^{H} = \sqrt{β d_{u, k}^{- α}} (\sqrt{\frac{\hat{R}}{1 + \hat{R}}} h_{u, k}^{L o S^{H}} + \sqrt{\frac{1}{1 + \hat{R}}} h_{u, k}^{N L o S^{H}})

(2)

where

h_{u, k}^{L o S^{H}}

and

h_{u, k}^{N L o S^{H}}

are used to represent the fast fading component for LoS path and NLoS path between the UAV and the user k, respectively.

\hat{R}

is the Rician factor. The path loss between the BS and k-th user is given by

\sqrt{β d_{u, k}^{- α}}

, where

β

denotes the channel gain at a 1 m reference distance,

α \geq 2

is the path loss exponent, and

d_{u, k}

is the distance between the UAV and user k. For the cascaded channel, there are two links: the UAV-RIS link and the RIS-User link. In the UAV-RIS link, only the LoS link

H_{u, r}^{H} \in C^{N \times M}

exists, which can be expressed as:

H_{u, r}^{H} = \sqrt{β d_{u, r}^{- α}} \sqrt{\frac{\hat{R}}{1 + \hat{R}}} H_{u, r}^{L o S^{H}}

(3)

H_{u, r}^{L o S^{H}} \in C^{N \times M}

is the fast fading component of the LoS channel between UAV and RIS;

d_{u, r}

is the Euclidean distance. Secondly, both LoS and NLoS propagations exist in the RIS-User link. Therefore, it is also modeled using a Rician channel, denoted

h_{r, k}^{H} \in R^{1 \times N}

. The channel gain of the RIS-User link is as follows:

h_{r, k}^{H} = \sqrt{β d_{r, k}^{- α}} (\sqrt{\frac{\hat{R}}{1 + \hat{R}}} h_{r, k}^{L o S^{H}} + \sqrt{\frac{1}{1 + \hat{R}}} h_{r, k}^{N L o S^{H}})

(4)

2.2. Downlink Signal Transmission Modeling and Optimization

Consider that the UAV transmits linear preambles, and the transmitted signal is:

x = \sum_{k = 1}^{K} \sqrt{p_{k}} w_{k} s_{k}

(5)

where

p_{k}

,

k \in K

is the transmit power allocated by the UAV to user k,

s_{k}

is the unit-power complex transmit symbol, and

w_{k} \in C^{M \times 1}

is the precoding direction vector of the k-th user. When the UAV can detect the user, the communication quality is improved through the direct link. When the user is not within the UAV’s sensing range, the blind area is compensated through the cascaded link of the RIS. Therefore, the signal received by the k-th user can be expressed as:

y_{k} = (a_{k} h_{u, k}^{H} + h_{r, k}^{H} Φ H_{u, r}^{H}) \sum_{i = 1}^{K} \sqrt{p_{i}} w_{i} s_{i} + σ^{2}

(6)

where the RIS phase shift matrix is denoted by

Φ ≜ d i a g [ϕ_{1}, ϕ_{2}, \dots, ϕ_{N}]

. For each element in the diagonal matrix

Φ

, that is, the phase angle of the n-th reflecting element is denoted as

ϕ_{n} = e^{j θ_{n}}, \forall n = 1, 2, \dots, N

,

θ_{n} \in [0, 2 π]

. Where

σ^{2} \sim C N (0, σ^{2})

is the additive white Gaussian noise. According to Formula (6), the received signal interference noise ratio (SINR) at the k-th user in the system can be expressed as:

{\hat{γ}}_{k} ≜ \frac{p_{k} {|(a_{k} h_{u, k}^{H} + h_{r, k}^{H} Φ H_{u, r}^{H}) w_{k}|}^{2}}{\sum_{i = 1, i \neq k}^{K} p_{i} {|(a_{i} h_{u, i}^{H} + h_{r, i}^{H} Φ H_{u, r}^{H}) w_{i}|}^{2} + σ^{2}}

(7)

where

\sum_{i = 1, i \neq k}^{K} p_{i} {|(a_{i} h_{u, i}^{H} + h_{r, i}^{H} Φ H_{u, r}^{H}) w_{i}|}^{2}

is the interference signal power received by user k. The UAV departs from the starting point, plans its trajectory, and searches for the optimal deployment position that maximizes the system communication performance. For the sake of fairness, each user is allocated a bandwidth of B. Therefore, the objective function for optimizing the system sum rate under the UAV’s deployment position can be expressed as:

\begin{matrix} (8) & P_{1} : & max_{P, Φ, Q} R ≜ \sum_{k = 1}^{K} B {log}_{2} (1 + {\hat{γ}}_{k}) \\ (8a) & s . t . & a_{k} \in \{0, 1\} \\ (8b) & |e^{j θ_{n}}| = 1, \forall n = 1, 2, \dots, N \\ (8c) & \sum_{k = 1}^{K} p_{k} = P \\ (8d) & p_{k} \geq 0, \forall k = 1, 2, \dots, K \\ (8e) & x_{min} \leq x_{t} \leq x_{max}, \forall t = 1, 2, \dots, T \\ (8f) & y_{min} \leq y_{t} \leq y_{max}, \forall t = 1, 2, \dots, T \\ (8g) & z_{min} \leq z_{t} \leq z_{max}, \forall t = 1, 2, \dots, T \\ (8h) & ∥ x_{t + 1} - x_{t} ∥_{2} \leq x_{b}, \forall t = 1, 2, \dots, T - 1 \\ (8i) & ∥ y_{t + 1} - y_{t} ∥_{2} \leq y_{b}, \forall t = 1, 2, \dots, T - 1 \\ (8j) & ∥ z_{t + 1} - z_{t} ∥_{2} \leq z_{b}, \forall t = 1, 2, \dots, T - 1 \end{matrix}

where the constraint (8a) represents the sensing state of the direct link between the UAV and the k-th user; (8b) ensures that the RIS reflection unit only changes the phase of the signal without altering its amplitude; (8c) ensures that the transmit power of the BS after encoding is equal to P; (8d) ensures that when optimizing power, the user’s power is not lower than the minimum value of 0; (8e) to (8g) limit the 3D coordinates of the UAV’s flight to prevent it from flying out of the boundary; (8h) to (8j) ensure the displacement limit of the UAV in each time slot. Since the optimization of the UAV trajectory coordinates

Q

and the RIS phase matrix

θ

is covered in Section 3, this subsection only discusses in detail the user’s power

P

in Formula (8); in order to suppress channel interference in multi-user scenarios, the zero-forcing (ZF) coding algorithm in linear coding is adopted to eliminate signal interference between users [27]:

(a_{i} h_{u, i}^{H} + h_{r, i}^{H} Φ H_{u, r}^{H}) w_{k} = 0, \forall i \neq k, i \in K .

(9)

h_{k}^{H} = a_{k} h_{u, k}^{H} + h_{r, k}^{H} Φ H_{u, r}^{H}

is used to represent the direct and cascaded channel gains of user k, where

h_{k}^{H} \in C^{1 \times M}

. Under this condition, the global channel matrix can be written as

H^{H} = {(h_{1}, \dots, h_{k})}^{H}

, where

H \in C^{M \times K}

. Therefore, the transmit precoding matrix can be expressed as

W = (w_{1}, \dots, w_{k})

, where

W \in C^{M \times K}

. Under the condition of (9), the zero-force(ZF) precoding matrix can be solved through

H^{H} W = E

, and

W

can be expressed as:

W = H {(H^{H} H)}^{- 1}

(10)

W

is the right pseudoinverse of

H^{H}

. After ZF precoding, there is no interference between users. To solve the optimal power allocation, the beam vector

w_{k}

of the k-th user is normalized in the transmit precoding matrix

W

.

w_{k}^{*} = \frac{w_{k}}{{∥w_{k}∥}^{2}}

(11)

according to Formula (11),

w_{k}^{*}

is the beam direction of user k, and

p_{k}

is the transmit power allocated to user k, therefore

\sum_{k = 1}^{K} {∥\sqrt{p_{k}} w_{k}^{*}∥}^{2} = P

. Then, the SINR of the k-th user is

γ_{k} = \frac{p_{k} {|(a_{k} h_{u, k}^{H} + h_{r, k}^{H} Φ H_{u, r}^{H}) w_{k}^{*}|}^{2}}{σ^{2}}

Since the bandwidth allocated among users is the same, the power allocation part can further equate the optimization problem to:

\begin{matrix} (12) & P_{2} : & - \sum_{k = 1}^{K} ln (1 + \frac{p_{k} ξ_{k}}{σ^{2}}) \\ (12a) & s . t . & \sum_{k = 1}^{K} p_{k} = P \\ (12b) & p_{k} \geq 0, \forall k = 1, 2, \dots, K \end{matrix}

where

ξ_{k} = {|(a_{k} h_{u, k}^{H} + h_{r, k}^{H} Φ H_{u, r}^{H}) w_{k}^{*}|}^{2}

, this problem can be solved using the water-filling algorithm to allocate the transmit power to each user, and its Lagrangian function is:

L (p_{k}, v_{k}, μ) = - \sum_{k = 1}^{K} ln (1 + \frac{P_{k} ξ_{k}}{σ^{2}}) - \sum_{k}^{K} v_{k} p_{k} + μ (\sum_{k = 1}^{K} p_{k} - P)

(13)

v_{k}

is the Lagrange multiplier corresponding to the inequality constraint

p_{k} \geq 0

, and

μ

is the Lagrange multiplier corresponding to the equality

\sum_{k = 1}^{K} p_{k} = P

. The KKT necessary conditions for this optimization problem can be expressed as:

\begin{matrix} (13a) & p_{k}^{*} \geq 0 \\ (13b) & \sum_{k = 1}^{K} p_{k}^{*} = P \\ (13c) & v_{k}^{*} \geq 0 \\ (13d) & v_{k}^{*} p_{k}^{*} = 0 \\ (13e) & {\frac{\partial L (p_{k}, v_{k}, μ)}{\partial p_{k}}|}_{p_{k}^{*}, v_{k}^{*}, μ^{*}} = - \frac{ξ_{k}}{σ^{2} + ξ_{k} p_{k}^{*}} - v_{k}^{*} + μ^{*} = 0 \end{matrix}

where

p_{k}^{*}

is the optimal power allocation,

v_{k}^{*}

and

μ^{*}

are the optimal Lagrange multipliers. According to (13c) and (13e), the following inequalities can be obtained:

μ^{*} \geq \frac{ξ_{k}}{σ^{2} + ξ_{k} p_{k}^{*}}, k = 1, \dots, K

(14)

When

μ^{*} < \frac{ξ_{k}}{σ^{2}}

, it can be derived by combining (14) that

\frac{ξ_{k}}{σ^{2} + ξ_{k} p_{k}^{*}} \leq μ^{*} < \frac{ξ_{k}}{σ^{2}}

(15)

It can be solved from (13a) and (15) that:

p_{k}^{*} \geq \frac{1}{μ^{*}} - \frac{σ^{2}}{ξ_{k}}

(16)

According to the complementary slackness condition in (13d), if

p_{k}^{*} = 0

, the derived formula from (14) would contradict

μ^{*} < \frac{ξ_{k}}{σ^{2}}

, Therefore, when

p_{k}^{*} > 0

, it can be obtained from (13e) that

v_{k}^{*} = 0

, so the following equality holds:

μ^{*} - \frac{ξ_{k}}{σ^{2} + ξ_{k} p_{k}^{*}} = 0

(17)

Therefore, when

μ^{*} < \frac{ξ_{k}}{σ^{2}}

, there is

p_{k}^{*} > 0

, and the closed-form solution can be obtained from (16) and (17):

p_{k}^{*} = \frac{1}{μ^{*}} - \frac{σ^{2}}{ξ_{k}}, μ^{*} < \frac{ξ_{k}}{σ^{2}}

(18)

When

μ^{*} \geq \frac{ξ_{k}}{σ^{2}}

, if

p_{k}^{*} > 0

, according to the complementary slackness condition in (13d), there is

v_{k}^{*} = 0

. From (17) and

μ^{*} \geq \frac{ξ_{k}}{σ^{2}}

, it can be solved that

p_{k}^{*} \leq 0

, which contradicts

p_{k}^{*} > 0

. According to (13a), when

μ^{*} \geq \frac{ξ_{k}}{σ^{2}}

the only feasible solution is

p_{k}^{*} = 0

. Therefore, the optimal power allocation can be expressed as:

p_{k}^{*} = \{\begin{matrix} \frac{1}{μ^{*}} - \frac{σ^{2}}{ξ_{k}}, & if μ^{*} < \frac{ξ_{k}}{σ^{2}} \\ 0, & if μ^{*} \geq \frac{ξ_{k}}{σ^{2}} \end{matrix}

(19)

The above process can be described as pouring the total power P into a water tank, and the classic method shown in Figure 2 can be used for power allocation.

Where

\frac{1}{μ^{*} (i)}

represents the water level. The channel gains of the K users are arranged in descending order. The water level can be derived from (13a) and the channel state information as follows:

\frac{1}{μ^{*} (i)} = \frac{P + \sum_{k = 1}^{K - i + 1} \frac{σ^{2}}{ξ_{k}}}{K - i + 1} .

(20)

where

i = 1 \dots K

;

k = 1 \dots K - i + 1

. After obtaining the water level from (20), the power allocation for each user can be calculated using (19). By adjusting the UAV transmit power

P = (p_{1}, \dots, p_{k})

, the RIS unit phase shift matrix

Φ

, and the UAV trajectory

Q

, the maximum system sum rate can be achieved. First, the UAV perceives channel changes and uses ZF coding to eliminate inter-user interference, then employs the water-filling algorithm to solve for the optimal power allocation

P

among users. Then the RIS unit phase shift

θ

and UAV trajectory Q are included in Equation (8). This problem is non-convex; therefore, a DRL-based D3QN-WF algorithm is designed in the third subsection to solve it.

2.3. Throughput Model

To fully demonstrate the advantages and potential of the D3QN-WF algorithm, this study will calculate the total throughput of the UAV during the entire communication period based on the system sum rate analyzed in Formula (8). The throughput over all T time slots is defined as follows [24]:

R_{s u m} = \sum_{t = 1}^{T} R (t)

(21)

where R(t) is the system sum rate under the UAV’s deployment position in the t time slot.

3. DRL-Based Algorithms

Jointly adjusting the 3D coordinates

Q

of the UAV and the phase angles

θ

of the RIS to maximize the system sum rate can be categorized as a Markov Decision Process (MDP), which consists of states, actions, rewards, and state transition probabilities:

State $s_{t}$ : The state of the system at time slot t is defined as $s_{t} = (Q_{t}, Φ_{t})$ , $s_{t} \in S$ , S is denoted as the state space. where $Q_{t}$ represents the 3D coordinates of the UAV at time slot t. $Φ_{t}$ denotes the phase shift matrix of the RIS at time slot t.
Action $a_{t}$ : Define the spatial action of the UAV in the system at time slot t: The spatial action of the UAV includes horizontal movement and vertical movement. The phase matrix of the RIS elements is dynamically optimized, with the phase parameter of each RIS element being discretely adjusted within a predetermined discretization range. The action $a_{t}$ belongs to the action space A.
State Transition Probability $p (s_{t + 1} |s_{t}, a_{t})$ : The optimization problem of UAV path planning and RIS phase angle adjustment can be simplified as an MDP. The state transition probability depends only on the current state $s_{t}$ and action $a_{t}$ .
Reward $r_{t}$ : The UAV receives feedback on its actions from the environment. Rewards can help it evaluate the quality of its actions and adjust its strategy accordingly. This strategy is adjusted according to the magnitude of the reward to obtain higher returns, thereby improving the sum rate of the system. $r_{t}$ denotes the reward obtained by the system after taking action $a_{t}$ under state $s_{t}$ at time slot t. Thus, the reward $r_{t}$ is the downlink sum rate $R (t)$ of the system at time slot t, as shown follows:

$r_{t} = R (t)$

(22)

The goal of MDP is to maximize the cumulative expected return. When the system is in the optimal state, when the maximum rate is achieved, the maximum return is obtained.

To provide a detailed introduction to D3QN, we first present the basic structure of DRL algorithms. In reinforcement learning, Q-learning is an effective method for solving an MDP. It constructs an action-value function

Q (s_{t}, a_{t})

to evaluate the quality of taking action

a_{t}

in state

s_{t}

. For a state: under a given policy,

Q (s_{t}, a_{t})

is defined as the sum of future rewards:

Q (s_{t}, a_{t}) = r_{t} + γ \cdot r_{t + 1} + \dots + γ^{T - t - 1} \cdot r_{T - 1}, 0 \leq t \leq T - 1,

(23)

where

γ \in [0, 1]

is a discount factor that measures the importance of current and future rewards.

Q (s_{t}, a_{t})

represents the expected cumulative reward obtained by taking action

a_{t}

; in state

a_{t}

, the optimal action-value function

Q^{*} (s_{t}, a_{t})

is defined as:

Q^{*} (s_{t}, a_{t}) = max Q (s_{t}, a_{t})

(24)

and it satisfies the Bellman optimality equation:

Q^{*} (s_{t}, a_{t}) = r_{t} + γ \cdot \max_{a_{t + 1} \in A} Q^{*} (s_{t + 1}, a_{t + 1})

(25)

by selecting the action with the highest value in state

s_{t}

, the optimal policy can be derived from the optimal action-value function. The update rule for

Q (s_{t}, a_{t})

is as follows:

Q (s_{t}, a_{t}) = Q (s_{t}, a_{t}) + \bar{α} \cdot [r_{t} + max_{a_{t + 1} \in A} Q (s_{t + 1}, a_{t + 1}) - Q (s_{t}, a_{t})]

(26)

where

\bar{α}

denotes the learning rate.

Θ

and

Θ^{-}

represent the weights of the estimation network and the target network, respectively. The estimation network is used to obtain

Q (s_{t}, a_{t} |Θ)

, which approximates

Q (s_{t}, a_{t})

. During the training process, the weights

Θ

of the estimation network only need to be updated by minimizing the loss function

L (Θ)

:

L (Θ) = {({y_{t}}^{D Q N} - Q (s_{t}, a_{t} |Θ))}^{2}

(27)

{y_{t}}^{D Q N} = r_{t} + max_{a_{t + 1} \in A} Q (s_{t + 1}, a_{t + 1} {|Θ}^{-})

(28)

where

{y_{t}}^{D Q N}

is the target value. The weights

Θ^{-}

of the target network are updated periodically with the weights

Θ

of the estimation network, with an update interval of O steps. Since DQN directly selects target actions based on target Q-values, it has the problem of overestimation. To solve this problem, DDQN decouples the two steps of selecting target actions and calculating target Q-values through the loss function

L (θ)

:

L (Θ) = {({y_{t}}^{D D Q N} - Q (s_{t}, a_{t} |Θ))}^{2}

(29)

L (Θ) = {y_{t}}^{D D Q N} = r_{t} + Q (s_{t + 1}, \underset{a_{t + 1} \in A}{arg max} Q (s_{t + 1}, a_{t + 1} |Θ) {|Θ}^{-})

(30)

when the state is at

s_{t}

, the operation of estimating the action value function

Q (s_{t}, a_{t})

using DQN or DDQN will lead to unstable output of the value function. To solve the MDP problem, D3QN is used to improve training results. The architecture of Dueling DQN utilizes a sequence of two fully connected layers, which allows the state value and the optimal action to be estimated, respectively. This design allows the neural network to make a basic judgment on a given state, and then revise its judgment through different actions

a_{t}

. On the other hand, Dueling DQN outputs a Q function, which can be combined with DDQN to eliminate the widespread overestimation problem in traditional DQN algorithms.

The network structure of D3QN is shown in Figure 3. First, empirical parameters of size F are placed in the input layer. After passing through the hidden layer, effective features are extracted and sent to two separate paths, respectively. One path is the state value, which is irrelevant to the actions to be taken by the UAV and the RIS. The output of this part represents the value reflecting the current environment. This function is used to evaluate the value of UAV and RIS in a specific environment, and is called the value function

V (s)

. The other path focuses on action value, expressed as the advantage function

A (s, a)

; it is used to measure the relative advantage of each action selected by UAV and RIS from A. These two paths are denoted as

V (s |Θ, \tilde{β})

and

A (s, a |Θ, \tilde{a})

, where

θ

,

\tilde{α}

and

\tilde{β}

represent the parameters of the hidden layer (DNNs), the value function network, and the advantage function network, respectively. D3QN independently learns the value and the advantage of actions, then combines them to form the output layer, which outputs

Q (s_{t}, a_{t})

. This can be further expressed as

Q (s, a |Θ, \tilde{α}, \tilde{β})

. Combining the action advantage

A (s, a)

and the state value

V (s)

, while subtracting the average value of the action advantage

A (s, a)

, the equation is as follows:

Q (s_{t}, a_{t} |Θ, \tilde{α}, \tilde{β}) = V (s_{t} |Θ, \tilde{β}) + (A (s_{t}, a_{t} |Θ, \tilde{α}) - \frac{1}{|A|} \sum_{a_{t + 1 \in A}} A (s_{t}, a_{t + 1} |Θ, \tilde{α}))

(31)

where

|A|

represents the dimension of the action space A. Additionally, the loss function of D3QN can be expressed as:

L (Θ, \tilde{α}, \tilde{β}) = {({y_{t}}^{D 3 Q N} - Q (s_{t}, a_{t} |Θ, \tilde{α}, \tilde{β}))}^{2}

(32)

{y_{t}}^{D 3 Q N} = r_{t} + γ Q (s_{t + 1}, \underset{a_{t + 1} \in A}{arg max} Q (s_{t + 1}, a_{t + 1} |Θ, \tilde{α}, \tilde{β}) |Θ^{-}, {\tilde{α}}^{-}, {\tilde{β}}^{-})

(33)

where

Θ^{-}

,

{\tilde{α}}^{-}

, and

{\tilde{β}}^{-}

represent the parameters of the target network, which are periodically copied and updated according to

θ

,

\tilde{α}

, and

\tilde{β}

in the estimation network. The process of D3QN is shown in Figure 4.

The process of the D3QN algorithm is illustrated in Figure 4. Each episode consists of two stages: the exploration stage (Steps 3 to 24 in Algorithm 1) and the training stage (Steps 25 to 27). In the exploration stage, Action

a (n)

is selected via a random approach (with probability

ε

) or the greedy approach in Step 5 to obtain the UAV trajectory and RIS phase shifts. Meanwhile, the UAV is restricted from flying out of the defined range, and the next state

s (n + 1)

is acquired. Then, ZF precoding and the WF algorithm are used for power allocation (Steps 8 to 20), and the corresponding reward

r (a (n), s (n))

is calculated. Subsequently, the newly generated sample

(s (n), a (n), r (n), s (n + 1))

is stored in the experience replay buffer F in Step 23. In the training stage, a random mini-batch of samples is selected from the experience replay buffer to train the online network

Q (\cdot)

and the target network

Q^{-} (\cdot)

. After that, the target value

y_{t}^{D 3 Q N}

from Equation (33) is used to update the weights

Θ

of the online network by minimizing the loss

L (Θ)

, which follows the general definition in Equation (32).

The state space: The system state

s_{t} \in S

at time slot t includes the 3D coordinates

(x_{t}, y_{t}, z_{t})

of the UAV, the phase

θ_{n t}

of the n-th reflecting element in the RIS. Thus,

S_{t}

can be defined as:

S_{t} = (x_{t}, y_{t}, z_{t}, θ_{n t})

Action: Based on the observed environmental state

S_{t}

, the UAV selects an action from the action space A to execute. Among them, A consists of three parts: (1) the horizontal movement of the UAV

Δ L_{U A V} \in \{(x_{b}, 0), (- x_{b}, 0), (0, y_{b}), (0, - y_{b}), (0, 0)\}

; (2) The vertical movement of the UAV:

Δ H_{U A V} \in \{(z_{b}, - z_{b}, 0)\}

. Since the water-filling algorithm can obtain the optimal UAV transmission power, the action space A does not include the BS transmission power. Instead, the BS transmission power is placed in the environment, and after being solved, it is treated as the observed value of the environment. (3) The phase shift of each reflective element in the RIS

Δ θ_{n} \in \{\frac{π}{80}, - \frac{π}{80}, 0\}

.

Algorithm 1: D3QN-WF algorithm

Reward: By taking actions, the 3D coordinates of the UAV and the phase shift matrix of the RIS are changed. Then, the objective function (12) is solved in the environment, the optimal power allocation (19) is obtained using the water-filling algorithm, and the optimal system sum rate of time slot t is calculated. The reward of time slot t is the same as that in Equation (22). The pseudocode of the algorithm D3QN-WF is shown in Algorithm 1.

4. Complexity of Simulation Analysis

The complexity of the D3QN-WF algorithm can be attributed to two aspects: environmental complexity and training process complexity. In this environment, the water-filling algorithm handles the allocation of BS transmission power to K users, with the number of optimization variables being K. The complexity of the water-filling algorithm is denoted as

f_{wf}

. Then, the computational complexity for the BS to calculate the transmission power in each iteration is

f_{wf} K

. During the training process, the complexity of the D3QN-WF algorithm is

O (|S| \cdot |A|)

. Where

|S|

and

|A|

denote the sizes of the state space and action space, respectively. Therefore, the computational complexity of D3QN-WF is:

O (|S| \cdot |A| \cdot f_{wf} K)

.

5. Simulation Result

In this section, the D3QN-WF algorithm is applied to an RIS-assisted BS-UAV wireless communication network. Comparisons are made with methods such as D3QN with average power, D3QN with unoptimized phase angles, DDQN-WF, and the D3QN-Majorize-Minimization algorithm(D3QN-MM) [11] that optimizes phase using convex optimization methods. Furthermore, the impact of RIS with different numbers of reflecting elements on the system rate is analyzed within the D3QN-WF algorithm. The system simulation parameters are presented in Table 1. The hyperparameters in DRL are shown in Table 2. Python 3.7.12 is used to build a DNN based on TensorFlow 2.0.0 for the simulations.

Assume the UAV starts at (—300 m, —300 m, 150 m), UEs are randomly distributed on the ground within (0 m~100 m, 0 m~200 m, 0 m), and the RIS is located at (0 m, 150 m, 20 m). The UAV’s area of interest (AoI) spans (—300 m~300 m, —300 m~300 m, 50 m~150 m), and the grid step sizes are

x_{b}

=

y_{b}

=

z_{b}

= 5 m.

Figure 5 and Figure 6 compare the 3D flight trajectories of the UAV and their corresponding 2D projections for three algorithms: DDQN-WF, D3QN-WF, and D3QN with average power. Starting from the initial position (—300 m, —300 m, 150 m), the UAV perceives communication users gradually, searches for the optimal trajectory, and globally optimal deployment position, then hovers near this position until the end of the time slot. This hovering and lingering behavior occurs because no termination condition is set in the DRL algorithm, leading to unrestricted movement throughout the entire time slot period T. If the total number of time slots T is small, it may not be possible to find the optimal deployment position in the end. When T is large, due to the adoption of the greedy strategy in DRL and the training error of the DNN, the UAV’s ability to maintain the optimal deployment position will be affected. This may cause the UAV to deviate from the optimal position and make exploratory movements to nearby positions. Suppose T has 200 time slots. After multiple rounds of training, the UAV identifies the optimal deployment position within 90–100 time slots. In the subsequent time slots, the UAV will either remain stationary at the optimal deployment position or exhibit slight fluctuations. However, when the reward obtained by the agent decreases, the UAV will try to return to the position with the maximum reward, resulting in the above-mentioned lingering phenomenon.

Figure 7 shows the variation of rewards with the number of training rounds for four algorithms: DDQN-WF, D3QN with average power, D3QN-MM, and D3QN-WF. It can be seen from the figure that the reward value of the D3QN-WF algorithm tends to converge after approximately 500 training rounds, with only slight fluctuations occurring thereafter. This phenomenon arises due to minor deviations when the UAV reaches the optimal deployment position, as well as the continuous impact of dynamic environmental parameters on system performance. It is worth noting that compared to DDQN-WF, D3QN with average power and D3QN-MM, D3QN-WF exhibits a smaller convergence fluctuation range and more stable convergence characteristics. Comparative experiments with D3QN-WF without optimized phase angles further validate that optimizing the RIS phase parameters can significantly improve the system communication rate.

Figure 8 and Figure 9, respectively, present the optimization results of two algorithms (DDQN-WF and D3QN-WF) for 8 RIS reflecting elements. Experiments show that the stability of D3QN-WF in dynamic environments is significantly superior to that of DDQN-WF. It can be seen from Figure 6 that multiple users are concentrated in the negative direction of the x-axis. Therefore, in comparison with these two algorithms, the phases optimized by D3QN-WF are more concentrated.

Figure 10 presents the cumulative distribution function (CDF) of the system sum rate during the training process for several algorithms, including DDQN-WF, D3QN-WF with different numbers of RIS elements, D3QN-WF with unoptimized phase angles, D3QN-WF with average power, and D3QN-MM. Among them, a CDF value of 1 corresponds to the rate under the optimal 3D deployment coordinates of the UAV. The optimal deployment positions obtained by the algorithms are shown in Figure 6 as follows: the optimal deployment of the D3QN with average power algorithm is at (0 m, 130 m, 50 m); the optimal deployment of the DDQN-WF algorithm is at (70 m, 185 m, 50 m); and the optimal deployment of the D3QN-WF algorithm is at (0 m, 155 m, 50 m). In Figure 10, the CDF curves of DDQN-WF, D3QN with average power, and D3QN without optimized phase angles are located to the left of that of D3QN-WF, which illustrates the superiority of D3QN-WF and indicates that it has obvious advantages in the statistical distribution characteristics of the sum rate. The simulation results show that compared with DDQN-WF and D3QN-MM, the D3QN-WF algorithm increases the system sum rate at the optimal position by 15.9% and 17.6%, respectively. This is because the essence of D3QN lies in combining the decomposed structure of state value and advantage value in Dueling DQN with the dual-network structure of DDQN. This combination improves the accuracy of each action value estimation in D3QN and mitigates the overestimation issue inherent in traditional Q-learning. Therefore, D3QN-WF optimizes the RIS phase matrix and the UAV’s position more effectively. Meanwhile, it can be observed that reasonable power allocation also improves the overall system sum rate.

According to Equation (21), Figure 11 depicts the cumulative distribution function (CDF) of the system throughput

R_{s u m}

during the communication period. It can be seen that D3QN-WF improves throughput throughout the entire communication period compared to other algorithms. In Figure 6, for the DDQN-WF algorithm, although it can find a deployment position close to users during the optimization process, it overestimates the value of some actions. This overestimation leads to a longer time spent searching for the optimal position throughout the entire time slot T. As a result, the throughput during the entire time slot T is significantly lower compared to D3QN-WF. Therefore, it can be observed that throughout the entire communication period, due to the reasonable power allocation of the D3QN-WF algorithm, its throughput is 20% higher than that of D3QN with average power allocation, and total throughput is increased by 50.1% and 55.6% compared with DDQN-WF and D3QN-MM, respectively.

6. Conclusions

In this paper, research on the optimization of the RIS-assisted BS-UAV aerial-ground communication system is conducted. DRL is used to construct autonomous decision-making with deep interaction between the agent and the environment. The D3QN-WF and DDQN-WF algorithms were proposed to adjust the transmission power, reconstruct the RIS phase, and optimize the 3D coordinates to maximize the system sum rate. Simulation results show that, compared with the DDQN-WF algorithm, D3QN-WF has a faster convergence speed, which improves the system sum rate under the optimal deployment position. The total throughput during communication is increased by 50.1%, verifying the significant advantages of D3QN in dynamic environments and providing theoretical support for the optimization of future intelligent communication systems.

Author Contributions

Conceptualization, Y.Y. and X.L.; methodology, Y.Y. and X.L.; validation, Y.Y. and X.L.; formal analysis, Y.Y., X.L., S.H. and X.Y.; investigation, Y.Y. and X.L.; resources, Y.Y.; writing—original draft preparation, Y.Y. and X.L.; writing—review and editing, Y.Y., X.L., S.H. and X.Y.; supervision, Y.Y., S.H. and X.Y.; project administration, Y.Y.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

Project supported by the National Natural Science Foundation of China: No. 62301059, 62422103; the Project of Cultivation for young top-notch Talents of Beijing Municipal Institutions: No. BPHR202203228.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BS	base station
UAV	unmanned aerial vehicle
RIS	reconfigurable intelligent surfaces
ZF	zero-forcing
DRL	deep reinforcemen learning
6G	sixth Generation
5G	Fifth Generation
ML	Machine learning
RL	Reinforcement Learning
DDQN	Double Deep Q-Network
D3QN	Dueling Double Deep Q-Network
D3QN-WF	Dueling Double Deep Q-Network Water-Filling Algorithm
DDQN-WF	Double Deep Q-Network Water-Filling Algorithm
SINR	signal interference noise ratio
CSI	channel state information
LoS	line-of-sight
NLoS	non-line-of-sight
AoI	Area of Interest
MDP	Markov Decision Process

References

Zhou, Y.Q.; Liu, L.; Wang, L. Service Aware 6G: An Intelligent and Open Network Based on Convergence of Communication, Computing and Caching. Digit. Commun. Netw. 2020, 6, 253–260. [Google Scholar] [CrossRef]
Li, B.; Fei, Z.; Zhang, Y. UAV Communications for 5G and Beyond: Recent Advances and Future Trends. IEEE Internet Things J. 2019, 6, 2241–2263. [Google Scholar] [CrossRef]
Zeng, Y.; Wu, Q.Q.; Zhang, R. Accessing From the Sky: A Tutorial on UAV Communications for 5G and Beyond. Proc. IEEE 2019, 107, 2327–2375. [Google Scholar] [CrossRef]
Meng, K.; Wu, Q.; Xu, J.; Chen, W.; Feng, Z.; Schober, R.; Swindlehurst, A.L. UAV-Enabled Integrated Sensing and Communication: Opportunities and Challenges. IEEE Wirel. Commun. 2024, 31, 97–104. [Google Scholar] [CrossRef]
Cui, Y.; Yuan, W.; Zhang, Z.; Mu, J.; Li, X. On the Physical Layer of Digital Twin: An Integrated Sensing and Communications Perspective. IEEE J. Sel. Areas Commun. 2023, 41, 3474–3490. [Google Scholar] [CrossRef]
Wang, L.; Wei, Q.; Xu, L.; Shen, Y.; Zhang, P.; Fei, A. Research on low-energy-consumption deployment of emergency UAV network for integrated communication-navigating-sensing. J. Commun. 2022, 43, 1–20. [Google Scholar]
Li, X.; Zheng, Y.; Zhang, J.; Dang, S.; Nallanathan, A.; Mumtaz, S. Finite SNR Diversity-Multiplexing Trade-off in Hybrid ABCom/RCom-Assisted NOMA Systems. IEEE Trans. Mob. Comput. 2024, 23, 9108–9119. [Google Scholar] [CrossRef]
Li, X.; Wang, Q.; Zeng, M.; Liu, Y.; Dang, S.; Tsiftsis, T.A.; Dobre, O.A. Physical-Layer Authentication for Ambient Backscatter Aided NOMA Symbiotic Systems. IEEE Trans. Commun. 2023, 71, 2288–2303. [Google Scholar] [CrossRef]
Li, X.; Zhao, M.; Zeng, M. Hardware Impaired Ambient Backscatter NOMA System: Reliability and Security. IEEE Trans. Commun. 2021, 69, 2723–2736. [Google Scholar] [CrossRef]
Wu, Q.Q.; Zhang, R. Towards Smart and Reconfigurable Environment: Intelligent Reflecting Surface Aided Wireless Network. IEEE Commun. Mag. 2020, 58, 106–112. [Google Scholar] [CrossRef]
Huang, C.; Zappone, A.; Alexandropoulos, G.C.; Debbah, M.; Yuen, C. Reconfigurable intelligent surfaces for energy efficiency in wireless communication. IEEE Trans. Wireless Commun. 2019, 18, 4157–4170. [Google Scholar] [CrossRef]
RISTech Alliance (RISTA). Reconfigurable Intelligent Surface (RIS) White Paper (2023); RIS Tech Alliance (RISTA): Beijing, China, 2023. [Google Scholar] [CrossRef]
Cui, T.; Jin, S.; Zhang, J.; Zhao, Y.; Yuan, Y. Research Report on Reconfigurable Intelligent Surface (RIS). IMT-2030 (6G) Promotion Group. 2021.
Hemavathy, P.; Priya, S.B.M. Energy-efficient UAV integrated RIS for NOMA communication. In Proceedings of the 2025 1st International Conference on Radio Frequency Communication and Networks (RFCoN), Thanjavur, India, 19–20 June 2025; pp. 1–6. [Google Scholar] [CrossRef]
Huang, J.; Wu, B.; Duan, Q.; Dong, L.; Yu, S. A Fast UAV Trajectory Planning Framework in RIS-Assisted Communication Systems With Accelerated Learning via Multithreading and Federating. IEEE Trans. Mob. Comput. 2025, 24, 6870–6885. [Google Scholar] [CrossRef]
Wu, Z.; Li, X.; Cai, Y.; Yuan, W. Joint Trajectory and Resource Allocation Design for RIS-Assisted UAV-Enabled ISAC Systems. IEEE Wirel. Commun. Lett. 2024, 13, 1384–1388. [Google Scholar] [CrossRef]
Li, S.; Du, H.; Zhang, D.; Li, K. Joint UAV Trajectory and Beamforming Designs for RIS-Assisted MIMO System. IEEE Trans. Veh. Technol. 2024, 73, 5378–5392. [Google Scholar] [CrossRef]
Li, Z.; Wang, S. Phase Shift Design in RIS Empowered Networks: From Optimization to AI-based Models. Network 2022, 2, 398–418. [Google Scholar] [CrossRef]
Liu, Y.; Huang, C.; Chen, G.; Song, R.; Song, S.; Xiao, P. Deep Learning Empowered Trajectory and Passive Beamforming Design in UAV-RIS Enabled Secure Cognitive Non-Terrestrial Networks. IEEE Wirel. Commun. Lett. 2024, 13, 188–192. [Google Scholar] [CrossRef]
Nguyen, K.K.; Khosravirad, S.R.; da Costa, D.B.; Nguyen, L.D.; Duong, T.Q. Reconfigurable Intelligent Surface-Assisted Multi-UAV Networks: Efficient Resource Allocation with Deep Reinforcement Learning. IEEE J. Sel. Top. Signal Process. 2022, 16, 358–368. [Google Scholar] [CrossRef]
Moon, S.; Liu, H.; Hwang, I. Joint beamforming for ris-assisted integrated sensing and secure communication in UAV networks. J. Commun. Netw. 2024, 26, 502–508. [Google Scholar] [CrossRef]
Aung, P.S.; Park, Y.M.; Tun, Y.K. Energy-Efficient Communication Networks via Multiple Aerial Reconfigurable Intelligent Surfaces: DRL and Optimization Approach. IEEE Trans. Veh. Technol. 2024, 73, 4277–4292. [Google Scholar] [CrossRef]
Zhang, H.; Huang, M.; Long, K. Capacity Maximization in RIS-UAV Networks: A DDQN-Based Trajectory and Phase Shift Optimization Approach. IEEE Trans. Wirel. Commun. 2023, 22, 2583–2591. [Google Scholar] [CrossRef]
Liu, X.; Yu, Y.; Li, F. Throughput Maximization for RIS-UAV Relaying Communications. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19569–19574. [Google Scholar] [CrossRef]
Sin, S.; Lee, C.-G.; Ma, J.; Kim, K.; Liu, H.; Moon, S.; Hwang, I. UAV-RIS trajectory optimization algorithm for energy efficiency in UAV-RIS based non-terrestrial systems. In Proceedings of the 2024 15th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 16–18 October 2024; pp. 121–123. [Google Scholar]
Mohamed, Z.; Aissa, S. Leveraging UAVs with Intelligent Reflecting Surfaces for Energy-Efficient Communications with Cell-Edge Users. In Proceedings of the 2020 IEEE International Conference on Communications Workshops (ICC Workshops), Dublin, Ireland, 7–11 June 2020; pp. 1–6. [Google Scholar]
Liu, X.; Liu, Y.; Chen, Y. Machine Learning Empowered Trajectory and Passive Beamforming Design in UAV-RIS Wireless Networks. IEEE J. Sel. Areas Commun. 2021, 39, 2042–2055. [Google Scholar] [CrossRef]
Ahmad, I.; Narmeen, R.; Becvar, Z.; Guvenc, I. Machine learning-based beamforming for unmanned aerial vehicles equipped with reconfigurable intelligent surfaces. IEEE Wirel. Commun. 2022, 29, 32–38. [Google Scholar] [CrossRef]
Hu, Y.; Cao, K.T. Improved DDQN Method for Throughput in RIS-Assisted UAV System. In Proceedings of the 2023 IEEE 6th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), Shenyang, China, 15–17 December 2023; pp. 80–85. [Google Scholar]
Mei, H.; Yang, K.; Wang, K. 3D-Trajectory and Phase-Shift Design for RIS-Assisted UAV Systems Using Deep Reinforcement Learning. IEEE Trans. Veh. Technol. 2022, 71, 3020–3029. [Google Scholar] [CrossRef]
Khalili, A.; Monfared, E.M.; Jorswieck, E.A. Resource Management for Transmit Power Minimization in UAV-Assisted RIS HetNets Supported by Dual Connectivity. IEEE Trans. Wirel. Commun. 2022, 21, 1806–1822. [Google Scholar] [CrossRef]
Nguyen, T.H.; Park, H.; Park, L. Recent Studies on Deep Reinforcement Learning in RIS-UAV Communication Networks. In Proceedings of the 2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Bali, Indonesia, 20–23 February 2023; pp. 378–381. [Google Scholar]

Figure 1. System model for RIS-Assisted UAV-BS air-to-ground downlink communications network.

Figure 2. Classic water filling power distribution method.

Figure 3. The D3QN architecture.

Figure 4. Flowchart of D3QN algorithm.

Figure 5. 3D spatial coordinates.

Figure 6. 2D spatial coordinates.

Figure 7. Reward of D3QN-WF.

Figure 8. D3QN-WF phase.

Figure 9. DDQN-WF phase.

Figure 10. CDF of Rate.

Figure 11. CDF of Throughput.

Table 1. System Simulation Parameters.

Physical Meaning	Parameter	Value
Noise Power	$σ^{2}$	—80 dBm
Bandwidth	B	1 MHz
Path Loss Exponent	$α$	4
Path Channel Gain	$β$	—40 dBm
UAV Transmit Power	$P_{UAV}$	30 dBm
Rice Factor	$\hat{R}$	$10^{3.3}$

Table 2. D3QN-WF Algorithm Hyperparameters.

Physical Meaning	Parameter	Value
Learning Rate	$α$	$10^{- 3}$
Decay Factor	$γ$	0.9
Replay Buffer	F	1500
Greedy Policy	$ε$	0.1
Update Step	O	750
Mini-batch Size	Mini-batch size	75
Activation Function	Activation function	ReLU
Optimizer	Optimizer	RMSProp

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yao, Y.; Liu, X.; Huang, S.; Yue, X. Research on Optimization of RIS-Assisted Air-Ground Communication System Based on Reinforcement Learning. Sensors 2025, 25, 6382. https://doi.org/10.3390/s25206382

AMA Style

Yao Y, Liu X, Huang S, Yue X. Research on Optimization of RIS-Assisted Air-Ground Communication System Based on Reinforcement Learning. Sensors. 2025; 25(20):6382. https://doi.org/10.3390/s25206382

Chicago/Turabian Style

Yao, Yuanyuan, Xinyang Liu, Sai Huang, and Xinwei Yue. 2025. "Research on Optimization of RIS-Assisted Air-Ground Communication System Based on Reinforcement Learning" Sensors 25, no. 20: 6382. https://doi.org/10.3390/s25206382

APA Style

Yao, Y., Liu, X., Huang, S., & Yue, X. (2025). Research on Optimization of RIS-Assisted Air-Ground Communication System Based on Reinforcement Learning. Sensors, 25(20), 6382. https://doi.org/10.3390/s25206382

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Optimization of RIS-Assisted Air-Ground Communication System Based on Reinforcement Learning

Abstract

1. Introduction

1.1. Related Work

1.2. Contributions

2. System Model

2.1. Channel Model

2.2. Downlink Signal Transmission Modeling and Optimization

2.3. Throughput Model

3. DRL-Based Algorithms

4. Complexity of Simulation Analysis

5. Simulation Result

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI