Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks

Tang, Jian; Yuan, Jun; Zhao, Hu; Chen, Mengxiang; Peng, Yi

doi:10.3390/s26113469

Open AccessArticle

Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks

by

Jian Tang

^1,2

,

Jun Yuan

^1,*,

Hu Zhao

^1,*,

Mengxiang Chen

¹ and

Yi Peng

²

¹

School of Artificial Intelligence, Shaoyang Industry Polytechnic College, Shaoyang 422000, China

²

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650000, China

^*

Authors to whom correspondence should be addressed.

Sensors 2026, 26(11), 3469; https://doi.org/10.3390/s26113469

Submission received: 21 April 2026 / Revised: 12 May 2026 / Accepted: 28 May 2026 / Published: 31 May 2026

(This article belongs to the Special Issue Integrated Sensing, Control, and Communication (ISC²) for Low-Altitude Intelligent Networks)

Download

Browse Figures

Versions Notes

Abstract

This paper investigates secure and low-latency communications in UAV-mounted simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS)-assisted urban vehicular networks, where severe blockage, high vehicle mobility, eavesdropping threats, and delay-sensitive traffic services coexist. In the considered system, the UAV is used not only as an aerial carrier for the STAR-RIS but also as a mobile intelligent control node that can dynamically adjust its horizontal aerial position according to vehicle distribution, blockage conditions, and eavesdropping threats. First, a UAV-STAR-RIS-assisted vehicular communication system model is developed by jointly considering urban blockage, vehicle mobility, passive eavesdropping attacks, queueing dynamics, and UAV flight constraints. Then, a high-dimensional, non-convex, and strongly coupled dynamic optimization problem is formulated to maximize the long-term average secure and low-latency utility through the joint optimization of the UAV trajectory, the STAR-RIS transmission–reflection partition ratio, the phase-shift matrices, and the transmit power allocation. Furthermore, the problem is modeled as a Markov decision process with continuous state and action spaces, and a hierarchical constrained soft actor–critic (HC-SAC)-based joint control algorithm is proposed to enable adaptive UAV movement, STAR-RIS configuration, and power control in complex dynamic environments. Simulation results demonstrate that the proposed method outperforms DDPG and several structural benchmark schemes. In the representative evaluation, the proposed HC-SAC achieves an average delay of 10.85 slots and a secrecy outage probability of 0.7160, compared with 11.72 slots and 0.8501 for PPO, and 11.94 slots and 0.8599 for DDPG. Although PPO provides the highest average secrecy rate and successful service ratio, the proposed method still maintains a competitive secure communication capability and service reliability. A normalized composite utility analysis further shows that HC-SAC attains the highest utility value of 0.9254, indicating a more favorable security–latency trade-off in complex urban vehicular scenarios.

Keywords:

vehicular networks; UAV; STAR-RIS; secure communications; low-latency transmission; deep reinforcement learning

1. Introduction

Integrated sensing and communications (ISAC) has emerged as a promising technology for next-generation wireless systems, since it enables spectrum sharing and functional integration between communication and sensing tasks. Meng et al. [1] discussed the opportunities and challenges of cooperative ISAC networks, and pointed out that network-level cooperation can expand sensing and communication coverage and provide additional degrees of freedom for ISAC design. For vehicular scenarios, Liu et al. [2] investigated energy-efficient computation offloading and resource allocation in ISAC-aided 6G V2X networks, showing that ISAC can support both communication services and computation-intensive vehicular applications. Cheng et al. [3] further studied ISAC for vehicular communication networks and highlighted its potential in supporting environmental perception, target awareness, and reliable vehicular information exchange. These studies indicate that ISAC is particularly attractive for urban vehicular networks, where high-rate communication, environmental sensing, and service awareness are required simultaneously.

However, urban vehicular environments also bring severe challenges to ISAC-enabled V2X transmission. Roadside buildings, overpasses, and dense traffic flows may cause frequent blockage and rapid channel fluctuations, which seriously degrade communication reliability and coverage continuity. Meanwhile, vehicular services such as cooperative driving, hazard warning, and real-time control are delay-sensitive, making low-latency transmission an essential requirement. More importantly, the broadcast nature of wireless channels makes vehicular messages vulnerable to interception. Yu et al. [4] studied movable-antenna-aided secure V2X communications from an ISAC perspective, indicating that mobility-aware antenna reconfiguration can improve secure transmission performance. Hasan et al. [5] provided a comprehensive study on securing V2X communication platforms and summarized key security threats in vehicular networks. Gyawali et al. [6] reviewed the major challenges and solutions for cellular-based V2X communications, including reliability, latency, and resource management issues. Hakeem and Kim [7] further surveyed machine learning, federated learning, and edge AI techniques for V2X intrusion detection. Although these works have advanced secure and reliable V2X communications, the joint design of physical-layer security, low-latency service, and UAV-assisted intelligent propagation control remains insufficiently explored.

To improve propagation quality in blocked environments, reconfigurable intelligent surfaces (RISs) have been widely recognized as an effective means of reshaping wireless channels. ElMossallamy et al. [8] summarized the principles, challenges, and opportunities of RIS-assisted wireless communications, showing that RIS can enhance link quality by tuning the electromagnetic responses of passive elements. However, conventional RIS mainly relies on half-space reflection, which limits its service capability when vehicles are distributed on both sides of urban roads. To address this limitation, Mu et al. [9] investigated simultaneously transmitting and reflecting RIS (STAR-RIS)-aided wireless communications, where the incident signal can be simultaneously transmitted and reflected to provide full-space coverage. In the V2X context, Aung et al. [10] proposed a deep reinforcement learning-based joint spectrum allocation and configuration design method for STAR-RIS-assisted V2X communications, demonstrating the potential of STAR-RIS in dynamic vehicular environments. Nevertheless, most existing STAR-RIS-assisted V2X studies mainly focus on spectrum allocation or configuration design, while secure low-latency communication with UAV-mounted STAR-RIS remains less investigated.

Although STAR-RIS offers stronger propagation control capability, fixed roadside or building-mounted deployment still suffers from limited coverage adaptability and poor flexibility in highly dynamic vehicular scenarios. In contrast, unmanned aerial vehicles (UAVs) can provide agile aerial deployment and favorable line-of-sight links. Andreou et al. [11] studied UAV-assisted roadside units for V2X connectivity using Voronoi diagrams in 6G+ infrastructures, showing that UAVs can improve vehicular connectivity through flexible aerial assistance. Peng et al. [12] investigated a UAV-borne STAR-RIS-assisted non-orthogonal multiple access system and developed a joint power allocation algorithm, verifying the feasibility of mounting STAR-RIS on UAV platforms. These studies show that the combination of UAV mobility and STAR-RIS programmability can provide adaptive communication support for vehicular networks. However, existing UAV-assisted or UAV-borne STAR-RIS studies have not fully considered the coupling among UAV trajectory, STAR-RIS transmission–reflection partitioning, phase-shift control, eavesdropping threats, and queue-aware latency.

Several recent studies have further investigated UAV-RIS communications and learning-based UAV network optimization. Nakazato et al. [13] proposed a multi-agent reinforcement learning method for resilient UAV ad hoc backhaul networks, where multiple UAVs collaboratively adjust their deployment to improve coverage and connectivity. This work demonstrates the applicability of reinforcement learning to dynamic low-altitude UAV networks, but it does not consider RIS/STAR-RIS-assisted physical-layer transmission or secure vehicular services. Li et al. [14] studied an RIS-assisted UAV communication system and jointly optimized the UAV trajectory and RIS passive beamforming to maximize the average achievable rate. Yang et al. [15] analyzed the performance of RIS-assisted dual-hop UAV communication systems and derived analytical expressions for outage probability, average bit-error rate, and average capacity, confirming the coverage and reliability gains introduced by RIS. Liu et al. [16] further proposed a machine-learning-empowered UAV-RIS framework, where UAV movement, RIS phase shifts, power allocation, and decoding order were jointly designed. In terms of secure UAV communication, Li et al. [17] investigated robust secure UAV communications with the aid of RIS and jointly optimized the UAV trajectory, RIS passive beamforming, and transmit power under imperfect eavesdropping CSI. Shi et al. [18] provided a comprehensive survey of RIS-aided cell-free massive MIMO systems for 6G and summarized different RIS architectures and applications, including STAR-RIS and UAV integration. Although these studies provide important foundations for UAV-RIS communication, RIS-assisted security, and learning-based UAV control, they mainly focus on conventional reflecting RISs, fixed RIS deployment, or rate/security-oriented optimization. The joint design of UAV-mounted STAR-RIS, high-mobility vehicular users, queue-aware low-latency services, and passive eavesdropping defense remains insufficiently studied.

Meanwhile, physical-layer security and intelligent resource control in RIS-assisted V2X networks have drawn growing attention. Saikia et al. [19] proposed a PPO-based method for RIS-assisted full-duplex 6G-V2X communications, showing that reinforcement learning can support dynamic RIS configuration and resource allocation. Long et al. [20] studied deep reinforcement learning for ISAC in RIS-assisted 6G V2X systems, where sensing and communication performance are jointly optimized. Wang et al. [21] investigated physical-layer security enhancement using artificial noise in C-V2X networks, demonstrating that artificial noise can help suppress eavesdropping. De Lima et al. [22] considered broadband beamforming and jamming mitigation in V2X scenarios, which is related to anti-interference and robust V2X transmission. Shang et al. [23] further studied energy-efficient and intelligent ISAC in V2X networks using spiking-neural-network-driven DRL. These studies provide useful references for secure, intelligent, and ISAC-oriented V2X communications. Nevertheless, they mainly focus on fixed RIS, conventional V2X infrastructure, artificial-noise-aided security, or ISAC resource allocation, while the joint security–latency optimization of UAV-mounted STAR-RIS-assisted vehicular networks remains an open problem.

Deep reinforcement learning (DRL) provides a natural tool for the considered dynamic optimization problem. Since the system involves time-varying vehicle positions, queue evolution, eavesdropper distribution, UAV motion, and STAR-RIS configuration, the resulting optimization problem is high-dimensional, strongly coupled, and non-convex. Conventional model-based optimization methods often incur high computational complexity and are not well-suited for real-time adaptation in fast-changing vehicular scenarios. Li et al. [24] proposed a physical-layer eavesdropping defense scheme for V2X based on an improved SAC algorithm, indicating that SAC-type methods are effective for secure V2X decision-making. Amudha et al. [25] studied a hyperparameter-tuned PPO-based federated DRL method for efficient V2X resource allocation, showing the applicability of PPO in V2X resource management. Mlika and Cherkaoui [26] applied DDPG to minimize the age of information in cellular V2X communications, demonstrating the suitability of deterministic policy gradient methods for continuous vehicular control problems. However, directly applying conventional SAC, PPO, or DDPG to the considered scenario is still challenging because UAV movement, STAR-RIS partitioning, phase-shift configuration, transmit power allocation, secrecy constraints, and latency requirements are tightly coupled.

Motivated by the above observations, this paper investigates secure and low-latency communications in UAV-mounted STAR-RIS-assisted urban vehicular networks. Different from existing studies, the UAV in this work is used not only as an aerial carrier for the STAR-RIS but also as a mobile intelligent control node that can dynamically adjust its trajectory according to vehicle distribution, blockage conditions, queue states, and eavesdropping threats. Meanwhile, the STAR-RIS transmission–reflection partition ratio, phase-shift matrices, and transmit power allocation are jointly optimized with the UAV trajectory. To better handle the coupled control variables and the stringent security–latency trade-off, we formulate a long-term utility maximization problem and develop a hierarchical constrained soft actor–critic (HC-SAC)-based joint optimization method. Specifically, the original problem is first transformed into a Markov decision process with continuous state and action spaces, and then the UAV trajectory, STAR-RIS transmission–reflection partition ratio, phase-shift matrices, and transmit power allocation are jointly optimized through hierarchical and constraint-aware policy learning.

To further clarify the technical differences between this work and existing studies, Table 1 summarizes representative related works in terms of the considered scenario, UAV mobility, RIS/STAR-RIS configuration, physical-layer security, queue-aware latency, learning-based optimization, hierarchical control, and constraint-aware learning.

As shown in Table 1, existing UAV-RIS studies mainly focus on trajectory design, passive beamforming, performance analysis, or secrecy-rate maximization with conventional reflecting RISs. Existing STAR-RIS-assisted V2X works mainly consider fixed surface deployment and spectrum/resource allocation, while the UAV-mounted STAR-RIS architecture has not been sufficiently investigated for secure and low-latency vehicular communication. Moreover, existing DRL- or SAC-based V2X studies rarely integrate UAV trajectory control, STAR-RIS transmission–reflection partitioning, phase-shift design, power allocation, physical-layer security, and queue-aware latency into a unified constrained learning framework. Therefore, the proposed HC-SAC method differs from existing studies by jointly exploiting UAV mobility, STAR-RIS full-space reconfiguration, and hierarchical constraint-aware policy learning.

Compared with the existing literature, the main differences of this work are threefold. First, unlike conventional RIS-assisted or STAR-RIS-assisted V2X studies that mainly consider fixed surface deployment, this paper investigates a UAV-mounted STAR-RIS architecture, where UAV mobility provides additional spatial degrees of freedom for adaptive urban vehicular coverage. Second, unlike existing secure V2X or UAV-RIS works that mainly focus on artificial noise, beamforming, secrecy-rate maximization, or resource allocation, this paper jointly considers passive eavesdropping threats, queue-aware low-latency service, and UAV flight constraints. Third, unlike existing DRL-based V2X optimization methods, the proposed HC-SAC framework jointly learns UAV trajectory control, STAR-RIS transmission–reflection partitioning, phase-shift configuration, and transmit power allocation under coupled security and latency constraints.

The main contributions of this paper are summarized as follows:

A UAV-mounted STAR-RIS-assisted urban vehicular communication framework is established by jointly considering urban blockage, dynamic vehicle mobility, passive eavesdropping threats, queueing delay, and UAV mobility constraints.
A long-term, secure, and low-latency utility maximization problem is formulated to jointly optimize the UAV trajectory, the STAR-RIS transmission–reflection partition ratio, the phase-shift matrices, and the transmit power allocation, resulting in a high-dimensional and strongly coupled continuous-control problem.
A hierarchical constrained soft actor–critic-based joint optimization algorithm is proposed to address the above problem. The developed method improves the adaptability of UAV-mounted STAR-RIS control in dynamic vehicular scenarios and enhances the trade-off between secrecy performance and delay efficiency.
Simulation results demonstrate that the proposed method outperforms DDPG and all structural benchmark schemes. Compared with PPO, it achieves lower delay and lower secrecy outage probability while maintaining a competitive secrecy rate and successful service ratio, thereby yielding the highest normalized composite utility.

The rest of this paper is organized as follows. Section 2 presents the system model. Section 3 formulates the MDP-based problem transformation. Section 4 develops the HC-SAC-based joint optimization algorithm. Section 5 provides simulation results and performance analysis. Finally, conclusions are drawn in Section 6.

Notation: Bold uppercase letters, bold lowercase letters, and lowercase letters denote matrices, vectors, and scalars, respectively.

{(\cdot)}^{T}

denotes the transpose operation,

∥ \cdot ∥

denotes the Euclidean norm,

E [\cdot]

denotes the expectation operator, and

j = \sqrt{- 1}

is the imaginary unit.

2. System Model

In the envisioned ISAC-enabled vehicular network, this paper focuses on secure and low-latency communication functionality. The sensing capability is treated as an auxiliary information source that can provide contextual information, such as vehicle distribution, blockage conditions, and mobility states, to the BS/UAV controller through the state acquisition mechanism discussed in Section 2.4. Accordingly, the following system model characterizes the communication-oriented optimization part of the ISAC-enabled network, while dedicated radar waveform design and sensing-performance optimization are beyond the scope of this work.

2.1. Geometry and Mobility Model

Consider a typical urban road intersection scenario, where the system consists of one base station (BS), one UAV-mounted STAR-RIS, K legitimate vehicular users, and E passive eavesdropping vehicles, as shown in Figure 1.

The BS is deployed at a fixed roadside location to provide downlink communication services for the vehicles within the target area. Due to building blockage and high vehicle mobility, the direct BS-to-vehicle links may experience non-line-of-sight (NLoS) conditions for some users. To enhance the communication quality in blocked areas, a UAV carrying a STAR-RIS is deployed above the road to provide aerial assistance.

The total service duration is divided into N discrete time slots, each with a duration of

δ_{t}

. Let the BS position be denoted by

w_{b} = {[x_{b}, y_{b}, 0]}^{T} .

(1)

The UAV position at slot n is denoted by

q [n] = {[x [n], y [n], H]}^{T}, n = 1, 2, \dots, N,

(2)

where H is the fixed UAV flight altitude. Therefore, the UAV trajectory optimization focuses on the two-dimensional horizontal deployment while maintaining a constant altitude, which is consistent with the rotary-wing propulsion energy model adopted in this paper. The position of the k-th legitimate vehicular user at slot n is given by

u_{k} [n] = {[x_{k} [n], y_{k} [n], 0]}^{T},

(3)

and the position of the e-th passive eavesdropper at slot n is denoted by

z_{e} [n] = {[x_{e} [n], y_{e} [n], 0]}^{T} .

(4)

Due to the UAV mobility limitation, the displacement between two adjacent slots must satisfy

∥ q [n + 1] - q [n] ∥ \leq V_{max} δ_{t}, n = 1, \dots, N - 1,

(5)

where

V_{max}

denotes the maximum UAV speed.

The STAR-RIS mounted on the UAV consists of M programmable elements. Owing to its simultaneous transmission and reflection capability, the reflection ratio and transmission ratio at slot n are denoted by

β_{r} [n]

and

β_{t} [n]

, respectively, and satisfy

β_{r} [n] + β_{t} [n] = 1, 0 \leq β_{r} [n], β_{t} [n] \leq 1 .

(6)

The corresponding reflection and transmission phase-shift matrices are defined as

Θ_{r} [n] = diag (\sqrt{β_{r} [n]} e^{j θ_{r, 1} [n]}, \dots, \sqrt{β_{r} [n]} e^{j θ_{r, M} [n]}),

(7)

Θ_{t} [n] = diag (\sqrt{β_{t} [n]} e^{j θ_{t, 1} [n]}, \dots, \sqrt{β_{t} [n]} e^{j θ_{t, M} [n]}),

(8)

where

θ_{r, m} [n]

and

θ_{t, m} [n]

denote the phase shifts of the m-th STAR-RIS element under the reflection and transmission modes, respectively.

2.2. Channel Gain Model

To characterize the urban blockage effect and small-scale fading, the channel between nodes i and j at slot n is modeled as

h_{i, j} [n] = \sqrt{ρ_{0} {(d_{i, j} [n])}^{- α_{i, j}}} (\sqrt{\frac{κ_{i, j}}{1 + κ_{i, j}}} h_{i, j}^{LoS} [n] + \sqrt{\frac{1}{1 + κ_{i, j}}} h_{i, j}^{NLoS} [n]),

(9)

where

ρ_{0}

denotes the channel power gain at the reference distance of 1 m,

d_{i, j} [n]

is the Euclidean distance between nodes i and j,

α_{i, j}

is the path-loss exponent, and

κ_{i, j}

denotes the Rician factor. Moreover,

h_{i, j}^{LoS} [n]

and

h_{i, j}^{NLoS} [n]

denote the deterministic LoS component and the random NLoS component, respectively.

For the BS–STAR-RIS link and the STAR-RIS–vehicle/eavesdropper links, which are mainly dominated by line-of-sight propagation, relatively large Rician factors are adopted. By contrast, for the direct BS-to-vehicle and BS-to-eavesdropper links in blocked urban environments, smaller Rician factors and larger path-loss exponents are used to characterize severe NLoS propagation conditions. Accordingly, the BS–STAR-RIS channel, the STAR-RIS–user channel, the STAR-RIS–eavesdropper channel, and the corresponding direct links are written, respectively, as

h_{b r} [n] = \sqrt{ρ_{0} {(d_{b r} [n])}^{- α_{b r}}} (\sqrt{\frac{κ_{b r}}{1 + κ_{b r}}} h_{b r}^{LoS} [n] + \sqrt{\frac{1}{1 + κ_{b r}}} h_{b r}^{NLoS} [n]),

(10)

g_{k} [n] = \sqrt{ρ_{0} {(d_{r k} [n])}^{- α_{r k}}} (\sqrt{\frac{κ_{r k}}{1 + κ_{r k}}} g_{k}^{LoS} [n] + \sqrt{\frac{1}{1 + κ_{r k}}} g_{k}^{NLoS} [n]),

(11)

g_{e} [n] = \sqrt{ρ_{0} {(d_{r e} [n])}^{- α_{r e}}} (\sqrt{\frac{κ_{r e}}{1 + κ_{r e}}} g_{e}^{LoS} [n] + \sqrt{\frac{1}{1 + κ_{r e}}} g_{e}^{NLoS} [n]),

(12)

h_{d, k} [n] = \sqrt{ρ_{0} {(d_{b k} [n])}^{- α_{b k}}} (\sqrt{\frac{κ_{b k}}{1 + κ_{b k}}} h_{d, k}^{LoS} [n] + \sqrt{\frac{1}{1 + κ_{b k}}} h_{d, k}^{NLoS} [n]),

(13)

h_{d, e} [n] = \sqrt{ρ_{0} {(d_{b e} [n])}^{- α_{b e}}} (\sqrt{\frac{κ_{b e}}{1 + κ_{b e}}} h_{d, e}^{LoS} [n] + \sqrt{\frac{1}{1 + κ_{b e}}} h_{d, e}^{NLoS} [n]) .

(14)

2.3. Imperfect CSI and Practical STAR-RIS Constraints

In the above channel model, the CSI is assumed to be available at the BS/UAV controller for resource optimization. However, in practical UAV-mounted STAR-RIS-assisted vehicular networks, channel estimation errors, phase quantization, and hardware impairments may exist due to vehicle mobility, limited pilot overhead, and finite-resolution STAR-RIS control circuits. To improve the practical interpretability of the proposed framework, this subsection discusses imperfect CSI and practical STAR-RIS constraints.

For the estimated channel coefficient, an additive CSI error model is adopted:

{\hat{h}}_{i, j} [n] = h_{i, j} [n] + e_{i, j} [n],

(15)

where

h_{i, j} [n]

denotes the actual channel coefficient between nodes i and j,

{\hat{h}}_{i, j} [n]

is the estimated CSI available at the controller, and

e_{i, j} [n]

denotes the channel estimation error. The estimation error is modeled as a complex Gaussian random variable:

e_{i, j} [n] \sim CN (0, σ_{e}^{2}),

(16)

where

σ_{e}^{2}

characterizes the CSI uncertainty level. During policy execution, the HC-SAC agent makes decisions based on the estimated CSI

{\hat{h}}_{i, j} [n]

, while the actual received signal quality is affected by the true channel

h_{i, j} [n]

.

Moreover, practical STAR-RIS elements usually support only finite-resolution phase shifts. Therefore, the continuous phase shift

θ_{m} [n]

can be quantized into a finite codebook:

F_{B} = \{0, \frac{2 π}{2^{B}}, \dots, \frac{2 π (2^{B} - 1)}{2^{B}}\},

(17)

where B denotes the number of phase quantization bits. The quantized phase shift is given by

θ_{m}^{Q} [n] = arg min_{θ \in F_{B}} |θ - θ_{m} [n]| .

(18)

Accordingly, the practical STAR-RIS phase-shift matrices can be written as

Θ_{r}^{Q} [n] = diag (e^{j θ_{r, 1}^{Q} [n]}, \dots, e^{j θ_{r, M}^{Q} [n]}),

(19)

Θ_{t}^{Q} [n] = diag (e^{j θ_{t, 1}^{Q} [n]}, \dots, e^{j θ_{t, M}^{Q} [n]}) .

(20)

In addition, STAR-RIS hardware impairments can be modeled by amplitude attenuation factors. Specifically, the practical reflection and transmission coefficients are expressed as

{\tilde{β}}_{r, m} [n] = η_{r} β_{r, m} [n], {\tilde{β}}_{t, m} [n] = η_{t} β_{t, m} [n],

(21)

where

0 < η_{r} \leq 1

and

0 < η_{t} \leq 1

denote the hardware efficiency factors of the reflection and transmission modes, respectively. When

η_{r} = η_{t} = 1

and

B \to \infty

, the ideal STAR-RIS model is recovered.

In the main algorithm design, the proposed HC-SAC framework is trained based on the estimated CSI and continuous STAR-RIS control variables. Imperfect CSI, finite-resolution phase control, and hardware impairments are included here as practical modeling considerations, and their impact is further evaluated in the robustness analysis in Section 5.10.

2.4. Vehicle State Acquisition and Control Signaling

In practical UAV-mounted STAR-RIS-assisted vehicular networks, the STAR-RIS does not independently detect vehicular users or estimate their channels. Instead, vehicle state acquisition and STAR-RIS control are coordinated by the BS and the UAV-mounted controller through vehicular signaling, pilot transmission, and sensing-assisted state estimation. To clarify the practical operation of the considered system, this subsection describes the vehicle state acquisition and control signaling mechanism.

Specifically, each vehicular user periodically broadcasts basic safety messages (BSMs) or cooperative awareness messages (CAMs), which contain information on the user’s position, velocity, moving direction, and service-related information. Meanwhile, uplink pilot signals are transmitted to assist the BS/UAV controller in estimating the channel state information (CSI). In sensing-assisted vehicular networks, onboard or roadside sensing functions can further provide auxiliary information on vehicle distribution, blockage conditions, and potential target states. Based on these signaling and sensing results, the BS or UAV-mounted controller constructs the system state, including vehicle mobility information, queue states, channel-related features, and eavesdropper-related information.

After obtaining the system state at the beginning of each time slot, the proposed HC-SAC policy generates the UAV trajectory control, STAR-RIS transmission–reflection partition ratio, phase-shift configuration, and transmit power allocation. The resulting control commands are then delivered to the UAV flight controller and the STAR-RIS controller through a dedicated low-rate control link. The UAV adjusts its position according to the trajectory command, while the STAR-RIS updates its transmission–reflection coefficients and phase shifts according to the received configuration command.

For analytical tractability, the duration of each time slot is assumed to be sufficiently short such that the vehicle positions, channel states, and queue states remain approximately unchanged within one slot. At the beginning of the next time slot, the BS/UAV controller updates the observed system state according to newly received BSM/CAM messages, pilot measurements, and sensing feedback. Therefore, the proposed framework operates in a closed-loop manner, where vehicle state acquisition, policy decision, STAR-RIS control, and environment update are repeatedly performed over time.

2.5. Secure Communication Model

Assume that the BS employs downlink superposition transmission at slot n, and the transmitted signal is expressed as

x [n] = \sum_{k = 1}^{K} \sqrt{p_{k} [n]} s_{k} [n],

(22)

where

p_{k} [n]

denotes the transmit power allocated to the k-th legitimate user, and

s_{k} [n]

is the corresponding information symbol satisfying

E [| s_{k} [n] |^{2}] = 1

. The total transmit power is constrained by

\sum_{k = 1}^{K} p_{k} [n] \leq P_{max}, \forall n .

(23)

Since vehicular users are located on both sides of the road, different users may be served by either the reflection region or the transmission region of the STAR-RIS. Let

o_{k} [n] \in {r, t}

denote the operating mode associated with user k at slot n, where r and t represent the reflection and transmission modes, respectively. Then, the equivalent channel from the BS to the k-th legitimate user can be written as

h_{k}^{eq} [n] = h_{d, k} [n] + g_{k}^{H} [n] Θ_{o_{k} [n]} [n] h_{b r} [n] .

(24)

Accordingly, the received signal-to-interference-plus-noise ratio (SINR) of user k at slot n is given by

γ_{k} [n] = \frac{p_{k} [n] {| h_{k}^{eq} [n] |}^{2}}{\sum_{i \neq k} p_{i} [n] {| h_{k}^{eq} [n] |}^{2} + σ_{k}^{2}},

(25)

where

σ_{k}^{2}

denotes the receiver noise power. The achievable rate of user k is thus

R_{k} [n] = B {log}_{2} (1 + γ_{k} [n]),

(26)

where B denotes the system bandwidth.

For the e-th passive eavesdropper, the equivalent channel for intercepting the signal intended for user k is expressed as

h_{e, k}^{eq} [n] = h_{d, e} [n] + g_{e}^{H} [n] Θ_{o_{k} [n]} [n] h_{b r} [n] .

(27)

The corresponding eavesdropping SINR is given by

γ_{e, k} [n] = \frac{p_{k} [n] {| h_{e, k}^{eq} [n] |}^{2}}{\sum_{i \neq k} p_{i} [n] {| h_{e, k}^{eq} [n] |}^{2} + σ_{e}^{2}},

(28)

where

σ_{e}^{2}

is the noise power at the eavesdropper. The corresponding eavesdropping rate is

R_{e, k} [n] = B {log}_{2} (1 + γ_{e, k} [n]) .

(29)

Considering the strongest eavesdropping threat, the instantaneous secrecy rate of user k at slot n is defined as

R_{k}^{\sec} [n] = {[R_{k} [n] - max_{e \in {1, \dots, E}} R_{e, k} [n]]}^{+},

(30)

where

{[x]}^{+} = max (x, 0)

.

2.6. Queue Evolution and Low-Latency Service Model

In urban vehicular networks, many safety-related services, such as cooperative driving, road hazard warning, and vehicle status reporting, are delay-sensitive. Therefore, in addition to improving the secrecy performance, it is necessary to explicitly characterize the traffic queue evolution and the corresponding service delay. Figure 2 illustrates the queue evolution and low-latency service model for each vehicular user.

Let

A_{k} [n]

denote the newly arrived data packets of the k-th vehicular user at time slot n. The packet arrival process is modeled as a Poisson process with average arrival rate

λ_{k}

, i.e.,

A_{k} [n] \sim Poisson (λ_{k}),

(31)

where

λ_{k}

denotes the average packet arrival rate of user k. Let

Q_{k} [n]

denote the queue length of the k-th vehicular user at the beginning of time slot n. The service capability of the system depends on the achievable communication rate. Given the achievable rate

R_{k} [n]

, the slot duration

δ_{t}

, and the packet size

L_{0}

in bits, the number of packets that can be served during time slot n is expressed as

μ_{k} [n] = \frac{δ_{t} R_{k} [n]}{L_{0}},

(32)

where

δ_{t}

denotes the duration of one time slot and

L_{0}

denotes the packet size. In the optimization formulation,

μ_{k} [n]

is treated as a continuous service amount to facilitate differentiable and low-complexity policy learning. In practical packet-level implementation, the actual number of served packets can be obtained by integer rounding, e.g.,

⌊ μ_{k} [n] ⌋

.

Accordingly, the queue evolution of vehicular user k is given by

Q_{k} [n + 1] = max \{Q_{k} [n] - μ_{k} [n], 0\} + A_{k} [n] .

(33)

This equation indicates that the queue length in the next slot is jointly determined by the remaining packets after service and the newly arrived packets.

Based on Little’s law, the instantaneous queueing delay of the k-th vehicular user can be approximated by the ratio between the queue length and the packet arrival rate, i.e.,

D_{k} [n] = \frac{Q_{k} [n]}{λ_{k} + ϵ},

(34)

where

ϵ

is a small positive constant used to avoid division by zero. This instantaneous delay approximation based on Little’s law has been widely adopted in cross-layer wireless resource optimization, since it provides a low-complexity delay feedback indicator for dynamic control problems. The average system delay at time slot n is then defined as

D [n] = \frac{1}{K} \sum_{k = 1}^{K} D_{k} [n],

(35)

where K denotes the number of vehicular users.

To capture the low-latency service requirement, a delay violation indicator is defined as

I_{k}^{D} [n] = \{\begin{matrix} 1, & D_{k} [n] > D_{max}, \\ 0, & D_{k} [n] \leq D_{max}, \end{matrix}

(36)

where

D_{max}

is the maximum tolerable delay threshold. This queue-aware delay model is incorporated into the reward function of the proposed DRL framework, so that the learned control policy can jointly improve secrecy performance and suppress queue accumulation.

2.7. UAV Flight Energy Consumption Model

Since the UAV-mounted STAR-RIS needs to continuously adjust its aerial position to assist vehicular users in dynamic urban environments, UAV propulsion energy consumption should be explicitly considered. Instead of using a simplified quadratic displacement-based energy model, this paper adopts a practical rotary-wing UAV propulsion power model, which is widely used to characterize the relationship between UAV flight speed and propulsion energy consumption.

Let

q [n] = {[x_{u} [n], y_{u} [n], H]}^{T}

denote the UAV position at time slot n, where H is the fixed flight altitude. The horizontal flight speed of the UAV during time slot n is given by

v [n] = \frac{∥q [n + 1] - q [n]∥}{δ_{t}},

(37)

where

δ_{t}

denotes the duration of one time slot. The propulsion power consumption of the rotary-wing UAV is modeled as

P_{UAV} (v [n]) = P_{0} (1 + \frac{3 v^{2} [n]}{U_{tip}^{2}}) + P_{i} {(\sqrt{1 + \frac{v^{4} [n]}{4 v_{0}^{4}}} - \frac{v^{2} [n]}{2 v_{0}^{2}})}^{\frac{1}{2}} + \frac{1}{2} d_{0} ρ s A v^{3} [n],

(38)

where

P_{0}

and

P_{i}

denote the blade profile power and induced power in hovering status, respectively;

U_{tip}

is the tip speed of the rotor blade;

v_{0}

is the mean rotor induced velocity in hovering;

d_{0}

is the fuselage drag ratio;

ρ

is the air density; s is the rotor solidity; and A is the rotor disc area.

Accordingly, the UAV propulsion energy consumption during time slot n is expressed as

E_{UAV} [n] = P_{UAV} (v [n]) δ_{t} .

(39)

The total UAV propulsion energy consumption over the whole flight period is given by

E_{tot}^{UAV} = \sum_{n = 1}^{N} E_{UAV} [n] .

(40)

Recall that the UAV trajectory should satisfy the mobility constraint and the total propulsion energy budget:

∥q [n + 1] - q [n]∥ \leq V_{max} δ_{t}, \forall n,

(41)

E_{tot}^{UAV} \leq E_{max},

(42)

where

V_{max}

denotes the maximum UAV speed and

E_{max}

denotes the available UAV energy budget.

3. MDP Modeling and Problem Transformation

3.1. MDP Modeling

The considered joint optimization problem is high-dimensional, non-convex, strongly coupled, and long-term dynamic. Since vehicle positions, traffic arrivals, eavesdropping threats, and channel states all evolve over time, the control action taken at the current time slot affects not only the instantaneous secrecy rate and delay performance, but also the future system performance through UAV movement and queue evolution. Therefore, the original problem is modeled as a Markov decision process (MDP) with continuous state and action spaces.

At time slot n, the agent observes the system state

s [n]

, executes an action

a [n]

, and receives an immediate reward

r [n]

. The objective is to learn a policy

π

that maximizes the long-term expected discounted reward:

max_{π} E_{π} [\sum_{n = 1}^{N} ζ^{n - 1} r [n]],

(43)

where

ζ \in (0, 1)

is the discount factor.

3.2. State Space Design

To enable the agent to fully perceive the current communication environment, queue states, and UAV operating conditions, the system state is defined as

s [n] = \{q [n], U [n], Z [n], Q [n], h [n], β [n - 1], P [n - 1]\},

(44)

where:

$q [n]$ denotes the current UAV position;
$U [n] = {u_{k} [n]}_{k = 1}^{K}$ denotes the set of legitimate user positions;
$Z [n] = {z_{e} [n]}_{e = 1}^{E}$ denotes the set of eavesdropper positions;
$Q [n] = {Q_{k} [n]}_{k = 1}^{K}$ denotes the traffic queue states;
$h [n]$ represents the channel-state features composed of the BS–STAR-RIS, STAR-RIS–user, and STAR-RIS–eavesdropper links;
$β [n - 1]$ and $P [n - 1]$ denote the STAR-RIS partition ratio and power allocation of the previous slot, respectively.

The above state representation jointly reflects spatial geometry, channel conditions, eavesdropping threats, and queue states. In addition, the previous-slot STAR-RIS partition ratio and power allocation are included to provide temporal context for consecutive decisions.

3.3. Action Space Design

At each time slot, the agent needs to jointly control the UAV trajectory, STAR-RIS transmission–reflection partition ratio, phase-shift configuration, and power allocation. Therefore, the action is defined as

a [n] = \{Δ q [n], β_{r} [n], θ_{r} [n], θ_{t} [n], p [n]\},

(45)

where:

$Δ q [n]$ denotes the UAV displacement control at slot n;
$β_{r} [n]$ denotes the STAR-RIS reflection ratio, while the transmission ratio is obtained as $β_{t} [n] = 1 - β_{r} [n]$ ;
$θ_{r} [n] = [θ_{r, 1} [n], \dots, θ_{r, M} [n]]$ is the reflection phase-shift vector;
$θ_{t} [n] = [θ_{t, 1} [n], \dots, θ_{t, M} [n]]$ is the transmission phase-shift vector;
$p [n] = [p_{1} [n], \dots, p_{K} [n]]$ is the power allocation vector.

3.4. Reward Function Design

To jointly improve secure communication, low-latency performance, and service reliability while controlling the UAV mobility cost, the immediate reward at slot n is designed as

r [n] = α R_{\sec} [n] - β D [n] + γ C [n] - λ E_{UAV} [n] - ξ Φ [n],

(46)

where

R_{\sec} [n] = \frac{1}{K} \sum_{k = 1}^{K} R_{k}^{\sec} [n]

(47)

is the average secrecy rate,

C [n] = \frac{1}{K} \sum_{k = 1}^{K} I (R_{k} [n] \geq R_{k}^{min})

(48)

is the successful service ratio,

E_{UAV} [n]

denotes the UAV propulsion energy consumption in slot n, and

Φ [n]

denotes the degree of constraint violation. The weighting factors

α

,

β

,

γ

,

λ

, and

ξ

are nonnegative constants.

4. HC-SAC-Based Joint Optimization Algorithm

4.1. HC-SAC Framework Overview

As illustrated in Figure 3, the proposed HC-SAC framework follows a closed-loop interaction process between the learning agent and the UAV-mounted STAR-RIS-assisted vehicular environment. At each time slot, the current network state is constructed from the UAV position, vehicle mobility information, queue states, channel-related features, eavesdropper information, and historical control actions. Based on this state, the hierarchical policy network generates continuous control actions for UAV movement, STAR-RIS transmission–reflection configuration, phase-shift design, and transmit power allocation.

Before being executed in the environment, the raw actions are transformed by a feasibility mapping module to satisfy the UAV mobility constraint, STAR-RIS coefficient constraint, phase-shift constraint, and power budget constraint. The environment then updates the vehicle positions, wireless channels, queue lengths, UAV energy consumption, and security–latency performance. The resulting reward and constraint violation signals are stored in the replay buffer and used to update the actor networks, critic networks, entropy coefficient, and constraint multipliers. The detailed definitions of the hierarchical policy structure, feasibility mapping, reward shaping, and network updates are provided in the following subsections.

4.2. Hierarchical Constrained SAC Update

Different from standard SAC, the proposed HC-SAC introduces a hierarchical control structure and a constraint-aware reward shaping mechanism. The high-level actor is responsible for coarse-grained control variables, including the UAV displacement and the STAR-RIS transmission–reflection partition ratio, while the low-level actor is responsible for fine-grained physical-layer decisions, including the STAR-RIS phase-shift vectors and transmit power allocation. This decomposition reduces the effective action coupling and improves learning stability in dynamic vehicular environments.

Let

a_{h} [n]

and

a_{l} [n]

denote the high-level and low-level actions at time slot n, respectively. The overall action is given by

a [n] = {a_{h} [n], a_{l} [n]} .

(49)

The joint hierarchical policy can be expressed as

π_{ϕ} (a [n] | s [n]) = π_{ϕ_{h}} (a_{h} [n] | s [n]) π_{ϕ_{l}} (a_{l} [n] | s [n], a_{h} [n]),

(50)

where

ϕ_{h}

and

ϕ_{l}

denote the parameters of the high-level and low-level actors, respectively, and

ϕ = {ϕ_{h}, ϕ_{l}}

.

Two critic networks with parameters

ω_{1}

and

ω_{2}

are introduced to estimate the soft action-value functions

Q_{ω_{1}} (s, a)

and

Q_{ω_{2}} (s, a)

, respectively. In addition, two target critic networks with parameters

{\bar{ω}}_{1}

and

{\bar{ω}}_{2}

are maintained to improve the stability of temporal-difference learning.

The entropy-regularized objective of the hierarchical policy is given by

J (π) = \sum_{n = 1}^{N} E_{(s [n], a [n]) \sim ρ_{π}} [r [n] + τ H (π_{ϕ} (\cdot | s [n]))],

(51)

where

τ

is the entropy temperature parameter and

H (π_{ϕ} (\cdot | s [n]))

denotes the policy entropy.

To explicitly account for delay, secrecy outage, and UAV energy-related constraints, a Lagrangian-style reward shaping mechanism is incorporated into the learning process. Let

ψ_{d}

,

ψ_{s}

, and

ψ_{e}

denote the adaptive multipliers associated with delay violation, secrecy outage violation, and UAV energy violation, respectively. The corresponding violation levels are defined as

Δ_{d} [n] = {[D [n] - D_{max}]}^{+},

(52)

Δ_{s} [n] = {[P_{out} [n] - P_{out}^{max}]}^{+},

(53)

To handle the total UAV energy budget in an online learning process, the accumulated UAV propulsion energy up to slot n is defined as

{\bar{E}}_{UAV} [n] = \sum_{i = 1}^{n} E_{UAV} [i] .

(54)

The corresponding energy violation level is defined as

Δ_{e} [n] = {[{\bar{E}}_{UAV} [n] - \frac{n}{N} E_{max}]}^{+},

(55)

where

{[x]}^{+} = max {x, 0}

,

D_{max}

is the maximum tolerable delay threshold,

P_{out}^{max}

is the maximum tolerable secrecy outage probability, and

E_{max}

is the total UAV energy budget.

The constraint-aware shaped reward is then expressed as

\hat{r} [n] = r [n] - ψ_{d} Δ_{d} [n] - ψ_{s} Δ_{s} [n] - ψ_{e} Δ_{e} [n] .

(56)

During each training iteration, the target value for the critic networks is constructed as

y [n] = \hat{r} [n] + ζ (min_{j = 1, 2} Q_{{\bar{ω}}_{j}} (s [n + 1], a [n + 1]) - τ log π_{ϕ} (a [n + 1] | s [n + 1])),

(57)

where

ζ

is the discount factor and

a [n + 1] \sim π_{ϕ} (\cdot | s [n + 1])

. The critic networks are updated by minimizing the soft Bellman residual:

L (ω_{j}) = E_{(s, a, \hat{r}, s^{'}) \sim D} [{(Q_{ω_{j}} (s [n], a [n]) - y [n])}^{2}], j = 1, 2,

(58)

where

D

denotes the replay buffer.

The hierarchical policy is updated by minimizing

J (ϕ) = E_{s [n] \sim D, a [n] \sim π_{ϕ}} [τ log π_{ϕ} (a [n] | s [n]) - min_{j = 1, 2} Q_{ω_{j}} (s [n], a [n])] .

(59)

To enhance exploration adaptivity, an automatic entropy tuning mechanism is adopted. The temperature parameter

τ

is updated by minimizing

J (τ) = E_{a [n] \sim π_{ϕ}} [- τ (log π_{ϕ} (a [n] | s [n]) + \bar{H})],

(60)

where

\bar{H}

is the target entropy.

The constraint multipliers are updated according to the observed violation levels:

ψ_{d} \leftarrow {[ψ_{d} + η_{ψ} Δ_{d} [n]]}^{+},

(61)

ψ_{s} \leftarrow {[ψ_{s} + η_{ψ} Δ_{s} [n]]}^{+},

(62)

ψ_{e} \leftarrow {[ψ_{e} + η_{ψ} Δ_{e} [n]]}^{+},

(63)

where

η_{ψ}

is the multiplier learning rate.

Finally, the target critic networks are softly updated as

{\bar{ω}}_{j} \leftarrow ρ ω_{j} + (1 - ρ) {\bar{ω}}_{j}, j = 1, 2,

(64)

where

ρ \in (0, 1)

is the soft update coefficient.

4.3. Feasibility Mapping of Continuous Actions

Since the raw actions generated by the actor networks may violate physical constraints, a feasibility mapping operation is performed before interacting with the environment. Let the hierarchical actor output the raw action

\tilde{a} [n] = \{Δ \tilde{q} [n], {\tilde{β}}_{r} [n], {\tilde{θ}}_{r} [n], {\tilde{θ}}_{t} [n], \tilde{p} [n]\} .

(65)

For the UAV displacement, the mapped motion vector is given by

Δ q [n] = min \{1, \frac{V_{max} δ_{t}}{∥ Δ \tilde{q} [n] ∥ + ϵ}\} Δ \tilde{q} [n],

(66)

where

V_{max}

is the maximum UAV speed,

δ_{t}

is the slot duration, and

ϵ > 0

is a small constant used to avoid division by zero. The UAV position is then updated as

q [n + 1] = Π_{Q} (q [n] + Δ q [n]),

(67)

where

Π_{Q} (\cdot)

denotes the projection onto the feasible flight region

Q

.

For the STAR-RIS transmission–reflection partition ratio, the raw output is mapped by a sigmoid function:

β_{r} [n] = \frac{1}{1 + exp (- {\tilde{β}}_{r} [n])},

(68)

and the transmission ratio is given by

β_{t} [n] = 1 - β_{r} [n] .

(69)

Thus, the constraints

0 \leq β_{r} [n] \leq 1

,

0 \leq β_{t} [n] \leq 1

, and

β_{r} [n] + β_{t} [n] = 1

are satisfied.

For the STAR-RIS phase-shift vectors, a bounded mapping is applied:

θ_{r, m} [n] = π (tanh ({\tilde{θ}}_{r, m} [n]) + 1), \forall m,

(70)

θ_{t, m} [n] = π (tanh ({\tilde{θ}}_{t, m} [n]) + 1), \forall m .

(71)

Therefore,

θ_{r, m} [n] \in [0, 2 π)

and

θ_{t, m} [n] \in [0, 2 π)

are guaranteed.

For the transmit power allocation, a softmax-based normalization is adopted:

{\bar{p}}_{k} [n] = \frac{exp ({\tilde{p}}_{k} [n])}{\sum_{i = 1}^{K} exp ({\tilde{p}}_{i} [n])}, \forall k .

(72)

The feasible transmit power is obtained as

p_{k} [n] = P_{max} {\bar{p}}_{k} [n], \forall k .

(73)

This mapping guarantees

p_{k} [n] \geq 0

and

\sum_{k = 1}^{K} p_{k} [n] = P_{max}

. In this paper, the softmax mapping is used to allocate the available transmit power budget among vehicular users for secrecy-rate-oriented transmission.

4.4. Algorithm Procedure

The overall procedure of the proposed HC-SAC-based joint optimization algorithm is summarized in Algorithm 1.

Algorithm 1 Proposed HC-SAC-based joint optimization algorithm

1:: Initialize the vehicular environment, replay buffer $D$ , high-level actor, low-level actor, critic networks, target critic networks, entropy coefficient, and constraint multipliers.
2:: for each training episode do
3:: Reset the environment and obtain the initial state $s [1]$ .
4:: for each time slot $n = 1, \dots, N$ do
5:: Observe the current state $s [n]$ .
6:: Generate the raw action $\tilde{a} [n]$ from the hierarchical policy networks.
7:: Apply feasibility mapping to obtain the valid action $a [n]$ .
8:: Execute $a [n]$ in the environment.
9:: Update the UAV position, vehicle positions, channel states, queue states, and UAV energy consumption.
10:: Calculate the immediate reward $r [n]$ and constraint violation levels $Δ_{d} [n]$ , $Δ_{s} [n]$ , and $Δ_{e} [n]$ .
11:: Compute the shaped reward $\hat{r} [n]$ .
12:: Observe the next state $s [n + 1]$ .
13:: Store $(s [n], a [n], \hat{r} [n], s [n + 1])$ into the replay buffer $D$ .
14:: if the replay buffer size is larger than the mini-batch size then
15:: Sample a mini-batch from $D$ .
16:: Update the critic networks by minimizing the soft Bellman residual.
17:: Update the high-level and low-level actors using the entropy-regularized policy objective.
18:: Update the entropy coefficient.
19:: Update the constraint multipliers.
20:: Softly update the target critic networks.
21:: end if
22:: end for
23:: end for
24:: Output the trained hierarchical policy.

4.5. Complexity and Convergence Discussion

Let

N_{ϕ_{h}}

,

N_{ϕ_{l}}

, and

N_{ω}

denote the number of trainable parameters in the high-level actor, low-level actor, and each critic network, respectively. For each mini-batch update with batch size

B_{s}

, the dominant computational complexity of the proposed HC-SAC algorithm mainly comes from forward and backward propagation through the hierarchical actor networks and the two critic networks, which can be approximated as

O (B_{s} (N_{ϕ_{h}} + N_{ϕ_{l}} + 2 N_{ω})) .

(74)

If the training process contains

E_{ep}

episodes and each episode consists of N time slots, the overall offline training complexity is given by

O (E_{ep} N B_{s} (N_{ϕ_{h}} + N_{ϕ_{l}} + 2 N_{ω})) .

(75)

Although HC-SAC introduces hierarchical action generation and constraint-aware multiplier updates, these operations bring only a small additional computational cost compared with standard actor–critic training. After training, the online execution only requires one forward propagation of the high-level and low-level actor networks, whose complexity is

O (N_{ϕ_{h}} + N_{ϕ_{l}}) .

(76)

Therefore, the trained policy can directly output the UAV displacement, STAR-RIS transmission–reflection partition ratio, phase-shift vectors, and transmit power allocation without repeatedly solving the original non-convex optimization problem, making it suitable for online decision-making in dynamic vehicular environments.

To improve training stability, the proposed HC-SAC uses double critics, target critic networks, experience replay, entropy-regularized policy improvement, and adaptive constraint multipliers. Experience replay reuses historical transitions, the entropy term encourages exploration, and constraint-aware reward shaping penalizes delay violation, secrecy outage, and excessive UAV energy consumption. Due to neural network approximation and the non-convex problem structure, global optimality cannot be guaranteed. The convergence curves in Section 5 show that HC-SAC reaches a stable evaluation score after training in the considered setting.

5. Simulation Results and Discussion

5.1. Simulation Settings

We consider a typical urban road intersection scenario with an area size of

500 m \times 500 m

. The BS is deployed at a fixed roadside location, while the UAV provides aerial assistance above the road. The total service period is divided into

N = 30

time slots, and the duration of each slot is

δ_{t} = 1 s

. Legitimate vehicular users and eavesdropping vehicles move along predefined lanes. The vehicular users follow a lane-based random mobility model, while the eavesdroppers follow a random lane-following mobility model. The traffic arrival process is modeled as a Poisson process to characterize the stochastic nature of vehicular service arrivals. During training, all DRL algorithms are implemented with the same state space, action space, reward definition, replay setting, and neural network scale unless otherwise specified.

5.2. Simulation Setup and Evaluation Metrics

In this section, numerical simulations are conducted to evaluate the performance of the proposed HC-SAC-based joint optimization scheme. Unless otherwise specified, the main simulation parameters are listed in Table 2. An urban vehicular communication scenario is considered, where vehicular users and potential eavesdroppers are distributed along the road. The UAV-mounted STAR-RIS dynamically adjusts its trajectory and transmission–reflection configuration according to vehicle mobility, blockage conditions, queue states, and channel-related information.

The propulsion energy parameters of the rotary-wing UAV are given in Table 3. These parameters are used to evaluate the practical UAV propulsion energy consumption in the proposed framework.

To ensure reproducibility, the main training hyperparameters of the proposed HC-SAC algorithm and the DRL-based benchmark schemes are summarized in Table 4. Unless otherwise specified, all DRL-based schemes are trained under the same vehicular mobility traces, channel realizations, eavesdropper distributions, state information, action constraints, reward components, and the number of environment interactions.

For fair comparison, PPO and DDPG adopt the same state representation, action mapping rules, reward components, neural network scale, and simulation scenarios as the proposed HC-SAC. The same vehicle trajectories, blockage distributions, channel realizations, eavesdropper locations, and traffic arrival processes are used for all DRL-based schemes. Therefore, the performance difference mainly comes from the learning architecture and policy update mechanism rather than from different simulation conditions.

To clarify the calculation of the simulation results, the evaluation metrics used in this paper are defined as follows. The average secrecy rate is calculated by

{\bar{R}}_{\sec} = \frac{1}{N K} \sum_{n = 1}^{N} \sum_{k = 1}^{K} R_{k}^{\sec} [n],

(77)

where

R_{k}^{\sec} [n]

denotes the instantaneous secrecy rate of vehicular user k at time slot n.

The secrecy outage probability (SOP) is defined as the probability that the instantaneous secrecy rate falls below a predefined secrecy threshold

R_{th}

. In the simulations, it is estimated by Monte Carlo counting as

P_{out} = \frac{\sum_{n = 1}^{N} \sum_{k = 1}^{K} I (R_{k}^{\sec} [n] < R_{th})}{N K},

(78)

where

I (\cdot)

is the indicator function.

The average delay is calculated based on the queue-aware delay model as

\bar{D} = \frac{1}{N K} \sum_{n = 1}^{N} \sum_{k = 1}^{K} D_{k} [n],

(79)

where

D_{k} [n]

denotes the queueing delay of user k at slot n.

The successful service ratio (SSR) is defined as the ratio of vehicular users that simultaneously satisfy the rate and delay requirements:

SSR = \frac{1}{N K} \sum_{n = 1}^{N} \sum_{k = 1}^{K} I (R_{k} [n] \geq R_{k}^{min}, D_{k} [n] \leq D_{max}),

(80)

where

R_{k}^{min}

is the minimum required communication rate.

As an auxiliary per-slot indicator, the instantaneous normalized utility can be written as

U_{ins} = w_{r} {\tilde{R}}_{\sec} + w_{c} \tilde{SSR} - w_{d} \tilde{D} - w_{o} {\tilde{P}}_{out},

(81)

where

{\tilde{R}}_{\sec}

,

\tilde{SSR}

,

\tilde{D}

, and

{\tilde{P}}_{out}

denote the normalized secrecy rate, successful service ratio, average delay, and secrecy outage probability, respectively. The coefficients

w_{r}

,

w_{c}

,

w_{d}

, and

w_{o}

are nonnegative weights used to balance different performance metrics.

Each reported point is obtained by averaging over 24 independent evaluation episodes under the corresponding parameter setting. Since the main purpose is to compare the average performance trends, only the mean values are plotted in the figures.

5.3. Benchmark Schemes

To comprehensively evaluate the effectiveness of the proposed method, several representative DRL-based and structural benchmark schemes are considered. The detailed definitions of these schemes are summarized in Table 5.

For fairness, all benchmark schemes are evaluated under the same vehicular mobility traces, channel realizations, eavesdropper distributions, blockage conditions, packet arrival processes, and initial UAV locations. Unless a specific variable is intentionally fixed for benchmark evaluation, the same state information, action constraints, reward components, and feasibility mapping rules are adopted.

5.4. Training Behavior Comparison

Figure 4 illustrates the convergence behavior of different DRL-based schemes in terms of evaluation score. It can be observed that the proposed HC-SAC achieves the highest evaluation score after convergence and exhibits a smooth upward trend throughout training. PPO also converges stably, but its final evaluation score remains below that of HC-SAC, whereas DDPG converges to the lowest level among the three DRL-based schemes. These results indicate that the hierarchical and constraint-aware design improves policy learning effectiveness in the considered dynamic vehicular environment. The convergence trend is also consistent with the final composite-utility ranking reported later.

5.5. Delay Performance Analysis

Figure 5 shows the average delay under different vehicle speeds. As the vehicle speed increases, the average delay of all schemes generally rises due to faster topology variation and more severe channel fluctuation. Nevertheless, the proposed HC-SAC consistently achieves the lowest delay over the whole speed range, varying from 10.02 to 10.39 slots. By comparison, PPO varies from 10.96 to 11.22 slots, and DDPG varies from 11.18 to 11.62 slots. Averaged over the tested speed range, HC-SAC reduces delay by about 7.9% relative to PPO and 10.4% relative to DDPG. This result demonstrates that the hierarchical control structure and constraint-aware learning mechanism can better coordinate UAV mobility, STAR-RIS reconfiguration, and resource scheduling, thereby improving queue stability and delay control in highly dynamic vehicular environments.

Compared with all benchmark schemes, HC-SAC provides the strongest low-latency guarantee, while the delay gap with the structural baselines is even more evident.

5.6. Secure Communication Performance Analysis

Figure 6 depicts the average secrecy rate under different BS transmit power budgets. As the transmit power increases, the secrecy rate of all schemes improves because stronger transmit power enhances the legitimate communication links. Among all DRL-based schemes, PPO achieves the highest secrecy rate, while the proposed HC-SAC provides competitive secrecy-rate performance and outperforms DDPG in the high-power regime. For example, at 40 dBm, PPO, HC-SAC, and DDPG achieve secrecy rates of 0.8832, 0.7791, and 0.7043 Mbps, respectively. In addition, all three DRL-based schemes outperform the structural baselines, suggesting the benefit of combining adaptive learning with UAV mobility and STAR-RIS-assisted propagation control.

Although HC-SAC does not achieve the highest secrecy rate, it still maintains a competitive secure transmission capability while simultaneously emphasizing delay and secrecy-reliability constraints.

Figure 7 illustrates the secrecy outage probability versus the number of eavesdroppers. As expected, the secrecy outage probability increases with the number of eavesdroppers for all schemes. However, the proposed HC-SAC consistently yields the lowest secrecy outage probability among all methods. Averaged over the tested eavesdropper settings, the SOP of HC-SAC is 0.7241, compared with 0.8613 for PPO and 0.8801 for DDPG, corresponding to reductions of about 15.9% and 17.7%, respectively. This result indicates that the proposed hierarchical constrained mechanism improves secrecy reliability under unfavorable wiretap conditions.

5.7. Service Reliability Analysis

Figure 8 presents the successful service ratio under different transmit power budgets, where the successful service ratio is defined as the proportion of users whose rate and delay requirements are simultaneously satisfied. It can be seen that the successful service ratio of all schemes increases with the transmit power budget. PPO achieves the highest successful service ratio, while the proposed HC-SAC remains close to PPO and becomes slightly better than DDPG in the medium- and high-power regimes. At 40 dBm, the SSR values of PPO, HC-SAC, and DDPG are 0.1767, 0.1693, and 0.1534, respectively.

These results show that HC-SAC preserves strong service capability while achieving better delay and secrecy outage performance. Therefore, compared with PPO, the proposed method provides a more balanced service-oriented control behavior rather than purely pursuing rate-oriented gains.

5.8. Composite Utility Analysis

To further evaluate the overall security–latency trade-off, a normalized composite utility is introduced based on the area-under-the-curve (AUC) values for the secrecy rate, successful service ratio, delay, and secrecy outage probability. Specifically, the secrecy rate and successful service ratio are positively normalized, whereas the delay and secrecy outage probability are inversely normalized. The resulting composite utility is defined as

U_{comp} = w_{1} {\tilde{AUC}}_{R_{\sec}} + w_{2} {\tilde{AUC}}_{SSR} + w_{3} (1 - {\tilde{AUC}}_{D}) + w_{4} (1 - {\tilde{AUC}}_{SOP}),

(82)

where

{\tilde{AUC}}_{R_{\sec}}

,

{\tilde{AUC}}_{SSR}

,

{\tilde{AUC}}_{D}

, and

{\tilde{AUC}}_{SOP}

denote the normalized AUC values of secrecy rate, successful service ratio, delay, and secrecy outage probability, respectively. According to the objective of secure and low-latency vehicular communications, larger weights are assigned to delay and secrecy outage performance.

Figure 9 compares the normalized composite utility of all schemes. It can be observed that the proposed HC-SAC achieves the highest composite utility of 0.9254, while PPO, DDPG, the fixed STAR-RIS partition, the random phase-shift scheme, the fixed UAV trajectory scheme, and the no STAR-RIS scheme achieve 0.6630, 0.5060, 0.3241, 0.2480, 0.2403, and 0, respectively. The zero value of the No STAR-RIS scheme is caused by the min–max normalization, where the worst-performing scheme is mapped to zero. Therefore, HC-SAC improves the composite utility by about 39.6% over PPO and 82.9% over DDPG. Although PPO performs better in terms of average secrecy rate and successful service ratio, HC-SAC provides more significant advantages in delay and secrecy outage probability. As a result, the proposed method achieves the most balanced overall performance when security and latency are jointly considered. This ranking is mainly driven by the delay and SOP gains rather than rate maximization alone.

The above results also explain why the proposed HC-SAC achieves the highest normalized composite utility. Although PPO achieves a higher secrecy rate and successful service ratio in some cases, this does not necessarily indicate an overall advantage in the considered secure and low-latency vehicular communication scenario. PPO tends to learn a more rate-oriented policy, whereas the proposed HC-SAC adopts entropy-regularized off-policy learning and constraint-aware reward shaping to reduce queue accumulation and suppress secrecy outage events. Therefore, HC-SAC does not always maximize a single metric, but achieves a better overall security–latency trade-off, as reflected by its lower average delay, lower secrecy outage probability, and higher normalized composite utility.

5.9. Ablation Study

To further verify the contribution of the key components in the proposed framework, an ablation study is conducted by comparing four SAC-based variants under the same simulation environment, random seeds, and evaluation settings. The considered variants include standard SAC, HC-SAC without hierarchical control, HC-SAC without constraint-aware learning, and the complete proposed HC-SAC. The results averaged over three random seeds are reported in Table 6, where the secrecy rate is measured in Mbps, and the delay is measured in slots. Because the composite utility is normalized within this ablation group, it is used only for comparing the four SAC variants.

As shown in Table 6, the standard SAC baseline obtains the lowest composite utility, indicating that directly applying a flat SAC policy is insufficient for the considered high-dimensional joint control problem. When the hierarchical control structure is removed, the secrecy rate and successful service ratio remain relatively high, but the delay and secrecy outage probability become worse than those of the complete HC-SAC, leading to a much lower composite utility. Notably, HC-SAC without hierarchical control obtains a higher secrecy rate than the complete HC-SAC, i.e., 0.516 Mbps versus 0.433 Mbps. This indicates that the non-hierarchical variant tends to learn a more rate-oriented policy, whereas the complete HC-SAC deliberately trades part of the secrecy-rate gain for significantly lower delay and SOP, which is consistent with the overall security–latency trade-off objective. When the constraint-aware learning mechanism is removed, the delay and SOP are improved compared with the standard SAC baseline, but the overall utility is still significantly lower than that of the complete HC-SAC. These observations show that the hierarchical control module and the constraint-aware learning mechanism contribute from different aspects: the former helps handle the coupled UAV, STAR-RIS, and power-control actions, while the latter guides the policy toward lower delay and lower secrecy outage. The complete HC-SAC achieves the highest composite utility of 0.851, suggesting that the two components jointly contribute to the final security–latency trade-off.

5.10. Robustness Analysis Under Practical Non-Idealities

To further examine the practical robustness of the proposed method, additional simulations are conducted under imperfect CSI and finite-resolution STAR-RIS phase control. In the imperfect CSI case, the CSI error standard deviation is varied from 0.02 to 0.10. In the finite-resolution case, the STAR-RIS phase resolution is set to 3, 2, and 1 bits. The results averaged over three random seeds are reported in Table 7.

As shown in Table 7, the proposed HC-SAC maintains stable performance under moderate CSI errors and low-resolution phase control. Compared with the ideal CSI and continuous phase case, CSI uncertainty mainly increases the average delay from 10.112 slots to about 10.282 slots, while the secrecy rate, SOP, and successful service ratio remain within a narrow range. The small fluctuations in secrecy rate and SSR under CSI errors are caused by stochastic channel perturbations and should be interpreted as robustness rather than a guaranteed performance gain from imperfect CSI. This limited sensitivity also suggests that the proposed framework, which combines geometry-aware channel features with adaptive resource control, does not rely excessively on highly accurate instantaneous CSI. Moreover, reducing the STAR-RIS phase resolution from 3 bits to 1 bit introduces only a limited performance change at the reference operating point. This indicates that the geometry-aware phase focusing and hierarchical control structure provide certain robustness against practical phase-control quantization. These results indicate stable behavior under the considered non-ideal CSI and finite-resolution STAR-RIS settings.

5.11. Reference Performance and Composite Utility Comparison

To provide a concise quantitative summary, Table 8 reports the reference operating-point performance and the AUC-based composite utility of all schemes. The first four metrics are evaluated at the reference operating point, while the normalized composite utility is obtained from the AUC-based evaluation over the corresponding parameter sweeps.

Table 8 summarizes the reference comparison among all schemes. Although PPO achieves the highest average secrecy rate and successful service ratio, HC-SAC provides the lowest delay and the lowest secrecy outage probability. When the four metrics are jointly considered through the normalized composite utility, HC-SAC yields the best overall performance. In particular, compared with PPO, HC-SAC reduces the average delay from 11.7230 to 10.8471 and the SOP from 0.8501 to 0.7160, which leads to a notable composite-utility gain from 0.6630 to 0.9254. These results indicate that the proposed hierarchical and constraint-aware learning strategy provides a more favorable security–latency–service trade-off than both learning-based benchmarks and structural baselines.

6. Conclusions

This paper investigated secure and low-latency communications in UAV-mounted STAR-RIS-assisted urban vehicular networks under severe blockage, high vehicle mobility, eavesdropping threats, and delay-sensitive traffic services. A comprehensive system model was developed by jointly considering urban blockage, vehicle mobility, passive eavesdropping threats, queueing dynamics, UAV trajectory evolution, and flight energy constraints. Based on this model, a long-term average secure and low-latency utility maximization problem was formulated and transformed into an MDP with continuous state and action spaces. A hierarchical constrained SAC-based joint control algorithm was then proposed to optimize the UAV trajectory, STAR-RIS transmission–reflection partition ratio, phase-shift matrices, and transmit power allocation.

Simulation results showed that the proposed method achieves lower delay, lower secrecy outage probability, and higher composite utility than DDPG and structural benchmark schemes, while maintaining competitive values for the secrecy rate and successful service ratio. Compared with PPO, the proposed HC-SAC achieves lower delay and lower secrecy outage probability, while PPO retains a higher secrecy rate and successful service ratio. In the representative evaluation, HC-SAC achieves an average delay of 10.8471 slots, an SOP of 0.7160, and the highest normalized composite utility of 0.9254. The ablation study further indicates that both hierarchical control and constraint-aware learning benefit the final security–latency trade-off, since removing either component substantially reduces the composite utility. Additional robustness results show that the proposed method remains stable under moderate CSI errors and finite-resolution STAR-RIS phase control. These results show that HC-SAC does not simply maximize a single communication metric, but achieves a stronger overall security–latency trade-off by reducing delay and secrecy outage probability while maintaining a competitive secrecy rate and successful service ratio.

Author Contributions

Conceptualization, J.T., J.Y. and Y.P.; methodology, J.Y. and J.T.; software, J.Y. and H.Z.; validation, M.C. and Y.P.; formal analysis, J.T. and M.C.; resources, J.T. and Y.P.; data curation, J.Y. and M.C.; writing—original draft preparation, J.T. and H.Z.; writing—review and editing, H.Z. and Y.P.; supervision, H.Z., M.C. and Y.P.; project administration, H.Z. and M.C.; funding acquisition, J.T. and Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62461030), the Key Basic Research Project of Yunnan Province (202401AS070105), the General Scientific Research Project of Hunan Provincial Department of Education (25C1457), the Shaoyang Municipal Science and Technology Plan (Special Fund Subsidy) Project (2024PT4070, 2024GZ2026), and the Key Scientific Research Project of Shaoyang Industrial Vocational and Technical College (SKY24A04).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code for this paper is available at https://github.com/tangent123/star-ris (accessed on 1 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

STAR-RIS	Simultaneously transmitting and reflecting reconfigurable intelligent surface
UAV	Unmanned aerial vehicle
SAC	Soft actor–critic
DRL	Deep reinforcement learning

References

Meng, K.; Masouros, C.; Petropulu, A.P.; Hanzo, L. Cooperative ISAC networks: Opportunities and challenges. IEEE Wirel. Commun. 2024, 32, 212–219. [Google Scholar] [CrossRef]
Liu, Q.; Luo, R.; Liang, H.; Liu, Q. Energy-efficient joint computation offloading and resource allocation strategy for ISAC-aided 6G V2X networks. IEEE Trans. Green Commun. Netw. 2023, 7, 413–423. [Google Scholar] [CrossRef]
Cheng, X.; Duan, D.; Gao, S.; Yang, L. Integrated sensing and communications (ISAC) for vehicular communication networks (VCN). IEEE Internet Things J. 2022, 9, 23441–23451. [Google Scholar] [CrossRef]
Yu, K.; Li, K.; Zhao, Y.; Feng, Z.; Li, D.; Zhang, Q.; Yu, J. Movable Antenna-Aided Secure V2X Communication: An Integrated Sensing and Communication Perspective. IEEE Wirel. Commun. 2025, 32, 118–124. [Google Scholar] [CrossRef]
Hasan, M.; Mohan, S.; Shimizu, T.; Lu, H. Securing vehicle-to-everything (V2X) communication platforms. IEEE Trans. Intell. Veh. 2020, 5, 693–713. [Google Scholar] [CrossRef]
Gyawali, S.; Xu, S.; Qian, Y.; Hu, R.Q. Challenges and solutions for cellular based V2X communications. IEEE Commun. Surv. Tutor. 2020, 23, 222–255. [Google Scholar] [CrossRef]
Hakeem, S.A.A.; Kim, H. Advancing intrusion detection in V2X networks: A comprehensive survey on machine learning, federated learning, and edge AI for V2X security. IEEE Trans. Intell. Transp. Syst. 2025, 26, 11137–11205. [Google Scholar] [CrossRef]
ElMossallamy, M.A.; Zhang, H.; Song, L.; Seddik, K.G.; Han, Z.; Li, G.Y. Reconfigurable intelligent surfaces for wireless communications: Principles, challenges, and opportunities. IEEE Trans. Cogn. Commun. Netw. 2020, 6, 990–1002. [Google Scholar] [CrossRef]
Mu, X.; Liu, Y.; Guo, L.; Lin, J.; Schober, R. Simultaneously transmitting and reflecting (STAR) RIS aided wireless communications. IEEE Trans. Wirel. Commun. 2021, 21, 3083–3098. [Google Scholar] [CrossRef]
Aung, P.S.; Nguyen, L.X.; Tun, Y.K.; Han, Z.; Hong, C.S. Deep reinforcement learning-based joint spectrum allocation and configuration design for STAR-RIS-assisted V2X communications. IEEE Internet Things J. 2023, 11, 11298–11311. [Google Scholar] [CrossRef]
Andreou, A.; Mavromoustakis, C.X.; Batalla, J.M.; Markakis, E.K.; Mastorakis, G. UAV-assisted RSUs for V2X connectivity using voronoi diagrams in 6G+ infrastructures. IEEE Trans. Intell. Transp. Syst. 2023, 24, 15855–15865. [Google Scholar] [CrossRef]
Peng, Y.; Tang, J.; Yang, Q.; Han, Z.; Ma, J. Joint power allocation algorithm for UAV-borne simultaneous transmitting and reflecting reconfigurable intelligent surface-assisted non-orthogonal multiple access system. IEEE Access 2023, 11, 140506–140518. [Google Scholar] [CrossRef]
Nakazato, J.; So, H.; Tran, G.K.; Suto, K. Multi-Agent Reinforcement Learning for Resilient UAV Ad Hoc Backhaul Networks. IEEE J. Miniaturization Air Space Syst. 2026, 7, 232–245. [Google Scholar] [CrossRef]
Li, S.; Duo, B.; Yuan, X.; Liang, Y.C.; Di Renzo, M. Reconfigurable Intelligent Surface Assisted UAV Communication: Joint Trajectory Design and Passive Beamforming. IEEE Wirel. Commun. Lett. 2020, 9, 716–720. [Google Scholar] [CrossRef]
Yang, L.; Meng, F.; Zhang, J.; Hasna, M.O.; Di Renzo, M. On the Performance of RIS-Assisted Dual-Hop UAV Communication Systems. IEEE Trans. Veh. Technol. 2020, 69, 10385–10390. [Google Scholar] [CrossRef]
Liu, X.; Liu, Y.; Chen, Y. Machine Learning Empowered Trajectory and Passive Beamforming Design in UAV-RIS Wireless Networks. IEEE J. Sel. Areas Commun. 2021, 39, 2042–2055. [Google Scholar] [CrossRef]
Li, S.; Duo, B.; Di Renzo, M.; Tao, M.; Yuan, X. Robust Secure UAV Communications with the Aid of Reconfigurable Intelligent Surfaces. IEEE Trans. Wirel. Commun. 2021, 20, 6402–6417. [Google Scholar] [CrossRef]
Shi, E.; Zhang, J.; Du, H.; Ai, B.; Yuen, C.; Niyato, D. RIS-Aided Cell-Free Massive MIMO Systems for 6G: Fundamentals, System Design, and Applications. Proc. IEEE 2024, 112, 331–364. [Google Scholar] [CrossRef]
Saikia, P.; Pala, S.; Singh, K.; Singh, S.K.; Huang, W.J. Proximal policy optimization for RIS-assisted full duplex 6G-V2X communications. IEEE Trans. Intell. Veh. 2023, 9, 5134–5149. [Google Scholar] [CrossRef]
Long, X.; Zhao, Y.; Wu, H.; Xu, C.Z. Deep reinforcement learning for integrated sensing and communication in RIS-assisted 6G V2X system. IEEE Internet Things J. 2024, 11, 39834–39849. [Google Scholar] [CrossRef]
Wang, C.; Li, Z.; Xia, X.G.; Shi, J.; Si, J.; Zou, Y. Physical layer security enhancement using artificial noise in cellular vehicle-to-everything (C-V2X) networks. IEEE Trans. Veh. Technol. 2020, 69, 15253–15268. [Google Scholar] [CrossRef]
de Lima, D.V.; da Costa, J.P.J.; da Silva, A.A.S.; Santos, G.A.; Vargas, J.A.R.; de Alexandria, A.R. Broadband Beamforming via Frequency Invariance Transformation and PARAFAC Decomposition for Jamming Mitigation in V2X Scenarios. In Proceedings of the 2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring); IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Shang, C.; Yu, J.; Hoang, D.T. Energy-efficient and intelligent ISAC in V2X networks with spiking neural networks-driven DRL. IEEE Trans. Wirel. Commun. 2026, 25, 1182–1195. [Google Scholar] [CrossRef]
Li, Z.; Liao, L.; Gu, S.; Zhao, J. Physical Layer Eavesdropping Defense Scheme for V2X Based on Improved SAC Algorithm. Phys. Commun. 2025, 74, 102980. [Google Scholar] [CrossRef]
Amudha, S.; Sivaradje, G.; Nagarajan, G. Hyperparameter-Tuned PPO-Based Federated Deep Reinforcement Learning (FDRL) with Explainability for Efficient V2X Resource Allocation in 5G Networks. In Proceedings of the International Conference on Artificial Intelligence and Secure Data Analytics (ICAISDA 2025); Atlantis Press: Dordrecht, The Netherlands, 2026; pp. 559–573. [Google Scholar]
Mlika, Z.; Cherkaoui, S. Deep deterministic policy gradient to minimize the age of information in cellular V2X communications. IEEE Trans. Intell. Transp. Syst. 2022, 23, 23597–23612. [Google Scholar] [CrossRef]

Figure 1. System model of secure and low-latency communications in UAV-mounted STAR-RIS-assisted urban vehicular networks.

Figure 2. Queue evolution and a low-latency service model for vehicular users.

Figure 3. Step-by-step methodology of the proposed HC-SAC-based joint optimization framework.

Figure 4. Convergence behavior of HC-SAC, DDPG, and PPO. Light-colored curves denote raw evaluation scores, and bold curves denote smoothed trends.

Figure 5. Average delay under different vehicle speeds.

Figure 6. Average secrecy rate under different transmit power levels.

Figure 7. Secrecy outage probability versus the number of eavesdroppers.

Figure 8. Successful service ratio under different transmit power levels.

Figure 9. Normalized composite utility comparison of different schemes.

Table 1. Comparison with representative related works.

Work	Main Scenario	UAV Mobility	RIS/ STAR-RIS	V2X	Security	Queue Latency	DRL/ SAC	Hierarchical Constraint Control
[14]	RIS-assisted UAV communication with joint trajectory and passive beamforming design	✓	RIS	–	–	–	–	–
[15]	Performance analysis of RIS-assisted dual-hop UAV communication	✓	RIS	–	–	–	–	–
[16]	Machine-learning-empowered UAV-RIS wireless network	✓	RIS	–	–	–	✓	–
[17]	Robust secure UAV communication with RIS	✓	RIS	–	✓	–	–	–
[10]	DRL-based spectrum allocation and STAR-RIS configuration for V2X	–	STAR-RIS	✓	–	–	✓	–
[12]	UAV-borne STAR-RIS-assisted NOMA communication	✓	STAR-RIS	–	–	–	–	–
[19]	PPO-based RIS-assisted full-duplex 6G-V2X communication	–	RIS	✓	–	–	✓	–
[20]	DRL-based ISAC optimization in RIS-assisted 6G V2X systems	–	RIS	✓	–	–	✓	–
[24]	Improved SAC-based physical-layer eavesdropping defense for V2X	–	–	✓	✓	–	SAC	–
This work	UAV-mounted STAR-RIS-assisted secure and low-latency urban vehicular communication	✓	STAR-RIS	✓	✓	✓	HC-SAC	✓

Table 2. Main simulation parameters.

Parameter	Description	Value
K	Number of vehicular users	6
E	Number of eavesdroppers	1–4
M	Number of STAR-RIS elements	64
N	Number of time slots	30
$δ_{t}$	Slot duration	1 s
H	UAV flight altitude	80 m
$V_{max}$	Maximum UAV speed	20 m/s
$P_{max}$	Maximum transmit power	20–40 dBm
W	System bandwidth	10 MHz
$σ^{2}$	Noise power	$- 94$ dBm
$ρ_{0}$	Path loss at reference distance	$- 30$ dB
$α$	Path-loss exponent range	$2.1$ – $4.5$
$κ$	Rician factor range	$0.08$ –12
$v_{k}$	Vehicle speed range	6–22 m/s
$λ_{k}$	Packet arrival rate	2 packets/slot
$L_{0}$	Packet size	$1.2 \times 10^{5}$ bits
$D_{max}$	Maximum tolerable delay	12 slots
$R_{th}$	Secrecy outage threshold	$0.1$ Mbps
$P_{out}^{max}$	Maximum tolerable SOP	$0.76$

Table 3. UAV propulsion energy model parameters.

Parameter	Description	Value
$P_{0}$	Blade profile power	$79.86$ W
$P_{i}$	Induced power in hovering	$88.63$ W
$U_{tip}$	Rotor blade tip speed	120 m/s
$v_{0}$	Mean rotor induced velocity	$4.03$ m/s
$d_{0}$	Fuselage drag ratio	$0.6$
$ρ$	Air density	$1.225$ kg/m³
s	Rotor solidity	$0.05$
A	Rotor disc area	$0.503$ m²

Table 4. Main hyperparameters of the proposed HC-SAC algorithm.

Parameter	Value
High-level actor hidden layers	256-256-128
Low-level actor hidden layers	256-256-128
Critic hidden layers	256-256-128
Activation function	ReLU
Actor learning rate	$1 \times 10^{- 4}$
Critic learning rate	$1 \times 10^{- 4}$
Entropy temperature learning rate	$1 \times 10^{- 4}$
Constraint multiplier learning rate	$1 \times 10^{- 4}$
Replay buffer size	$1 \times 10^{5}$
Mini-batch size	256
Discount factor $ζ$	$0.99$
Soft update coefficient $ρ$	$0.005$
Initial entropy temperature $τ$	$0.2$
Target entropy	− $\dim (A)$
Training episodes	420
Time slots per episode	N
Warm-up steps	2200
Evaluation interval	Every 10 episodes
Evaluation episodes	24
Optimizer	Adam

Table 5. Benchmark schemes used for performance comparison.

Scheme	Description
Proposed HC-SAC	The proposed hierarchical constrained SAC scheme jointly optimizes the UAV trajectory, STAR-RIS transmission–reflection partition ratio, phase-shift matrices, and transmit power allocation. The hierarchical policy structure decouples large-scale UAV and partition control from fine-grained phase/power control, while the constraint-aware reward shaping mechanism improves the security–latency trade-off.
PPO-based scheme	The proximal policy optimization algorithm is adopted to learn the joint control policy. It uses the same state space, action space, feasibility mapping, reward components, and simulation environment as the proposed HC-SAC. The main difference lies in the on-policy clipped policy update mechanism.
DDPG-based scheme	The deep deterministic policy gradient algorithm is used as another DRL-based baseline. It adopts the same state representation, action mapping rules, reward function, and environmental settings as HC-SAC. Different from HC-SAC, DDPG learns a deterministic policy and does not use entropy-regularized exploration.
Fixed UAV trajectory	The UAV follows a predefined straight-line trajectory, while the STAR-RIS transmission–reflection partition ratio, phase-shift matrices, and transmit power allocation are optimized. This benchmark is used to evaluate the contribution of UAV trajectory optimization.
Fixed STAR-RIS partition	The STAR-RIS transmission–reflection partition ratio is fixed during the whole service period, while the UAV trajectory, phase-shift matrices, and transmit power allocation are optimized. This benchmark is used to evaluate the benefit of adaptive STAR-RIS transmission–reflection partitioning.
Random phase-shift scheme	The STAR-RIS phase shifts are randomly generated, while the UAV trajectory, transmission–reflection partition ratio, and transmit power allocation are optimized. This benchmark is used to evaluate the importance of the STAR-RIS phase-shift optimization.
No STAR-RIS scheme	The UAV provides aerial assistance without STAR-RIS-enabled propagation reconfiguration. Only the UAV trajectory and transmit power allocation are optimized. This benchmark is used to quantify the performance gain brought by the UAV-mounted STAR-RIS.

Table 6. Ablation study of the proposed HC-SAC framework.

Variant	Secrecy Rate	Delay	SOP	SSR	Utility
Standard SAC	0.476 ± 0.003	11.753 ± 0.133	0.912 ± 0.004	0.105 ± 0.001	0.146 ± 0.013
HC-SAC w/o hierarchy	0.516 ± 0.008	11.390 ± 0.129	0.880 ± 0.002	0.114 ± 0.002	0.452 ± 0.018
HC-SAC w/o constraints	0.388 ± 0.003	10.831 ± 0.116	0.738 ± 0.002	0.102 ± 0.001	0.552 ± 0.001
Proposed HC-SAC	0.433 ± 0.004	10.440 ± 0.109	0.714 ± 0.002	0.109 ± 0.001	0.851 ± 0.020

Table 7. Robustness evaluation of the proposed HC-SAC under imperfect CSI and finite-resolution STAR-RIS phase control.

Setting	Secrecy Rate	Delay	SOP	SSR
Ideal CSI, continuous phase	0.488 ± 0.009	10.112 ± 0.030	0.698 ± 0.002	0.121 ± 0.001
CSI error std. = 0.02	0.492 ± 0.010	10.282 ± 0.049	0.695 ± 0.001	0.123 ± 0.001
CSI error std. = 0.05	0.492 ± 0.010	10.282 ± 0.049	0.696 ± 0.001	0.123 ± 0.001
CSI error std. = 0.10	0.492 ± 0.010	10.282 ± 0.049	0.695 ± 0.001	0.123 ± 0.001
3-bit phase quantization	0.488 ± 0.009	10.113 ± 0.030	0.699 ± 0.002	0.121 ± 0.001
2-bit phase quantization	0.488 ± 0.009	10.112 ± 0.030	0.698 ± 0.002	0.121 ± 0.001
1-bit phase quantization	0.488 ± 0.009	10.111 ± 0.030	0.698 ± 0.002	0.121 ± 0.001

Table 8. Reference performance and composite utility comparison of all schemes.

Scheme	Avg. Secrecy Rate (Mbps)	Avg. Delay (Slots)	SOP	SSR	Composite Utility
Proposed HC-SAC	0.3928	10.8471	0.7160	0.1004	0.9254
PPO-based scheme	0.4927	11.7230	0.8501	0.1163	0.6630
DDPG-based scheme	0.4171	11.9358	0.8599	0.1012	0.5060
Fixed UAV trajectory	0.1811	12.4940	0.9058	0.0628	0.2403
Fixed STAR-RIS partition	0.2233	12.2493	0.8932	0.0681	0.3241
Random phase-shift scheme	0.1792	12.4942	0.9036	0.0624	0.2480
No STAR-RIS scheme	0.0903	13.3957	0.9473	0.0497	0.0000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, J.; Yuan, J.; Zhao, H.; Chen, M.; Peng, Y. Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks. Sensors 2026, 26, 3469. https://doi.org/10.3390/s26113469

AMA Style

Tang J, Yuan J, Zhao H, Chen M, Peng Y. Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks. Sensors. 2026; 26(11):3469. https://doi.org/10.3390/s26113469

Chicago/Turabian Style

Tang, Jian, Jun Yuan, Hu Zhao, Mengxiang Chen, and Yi Peng. 2026. "Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks" Sensors 26, no. 11: 3469. https://doi.org/10.3390/s26113469

APA Style

Tang, J., Yuan, J., Zhao, H., Chen, M., & Peng, Y. (2026). Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks. Sensors, 26(11), 3469. https://doi.org/10.3390/s26113469

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks

Abstract

1. Introduction

2. System Model

2.1. Geometry and Mobility Model

2.2. Channel Gain Model

2.3. Imperfect CSI and Practical STAR-RIS Constraints

2.4. Vehicle State Acquisition and Control Signaling

2.5. Secure Communication Model

2.6. Queue Evolution and Low-Latency Service Model

2.7. UAV Flight Energy Consumption Model

3. MDP Modeling and Problem Transformation

3.1. MDP Modeling

3.2. State Space Design

3.3. Action Space Design

3.4. Reward Function Design

4. HC-SAC-Based Joint Optimization Algorithm

4.1. HC-SAC Framework Overview

4.2. Hierarchical Constrained SAC Update

4.3. Feasibility Mapping of Continuous Actions

4.4. Algorithm Procedure

4.5. Complexity and Convergence Discussion

5. Simulation Results and Discussion

5.1. Simulation Settings

5.2. Simulation Setup and Evaluation Metrics

5.3. Benchmark Schemes

5.4. Training Behavior Comparison

5.5. Delay Performance Analysis

5.6. Secure Communication Performance Analysis

5.7. Service Reliability Analysis

5.8. Composite Utility Analysis

5.9. Ablation Study

5.10. Robustness Analysis Under Practical Non-Idealities

5.11. Reference Performance and Composite Utility Comparison

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI