Next Article in Journal
Refined Deformable-DETR for Electric Pylon Detection Based on Optical Satellite Image
Previous Article in Journal
Non-Contact Blood Pressure Prediction Using Radar with a Lightweight Temporal Multi-Scale Feature Fusion Network
Previous Article in Special Issue
GSC-YOLO: A Pedestrian Detection Method for Low-Light Security Surveillance Scenarios
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks

1
School of Artificial Intelligence, Shaoyang Industry Polytechnic College, Shaoyang 422000, China
2
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650000, China
*
Authors to whom correspondence should be addressed.
Sensors 2026, 26(11), 3469; https://doi.org/10.3390/s26113469
Submission received: 21 April 2026 / Revised: 12 May 2026 / Accepted: 28 May 2026 / Published: 31 May 2026

Abstract

This paper investigates secure and low-latency communications in UAV-mounted simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS)-assisted urban vehicular networks, where severe blockage, high vehicle mobility, eavesdropping threats, and delay-sensitive traffic services coexist. In the considered system, the UAV is used not only as an aerial carrier for the STAR-RIS but also as a mobile intelligent control node that can dynamically adjust its horizontal aerial position according to vehicle distribution, blockage conditions, and eavesdropping threats. First, a UAV-STAR-RIS-assisted vehicular communication system model is developed by jointly considering urban blockage, vehicle mobility, passive eavesdropping attacks, queueing dynamics, and UAV flight constraints. Then, a high-dimensional, non-convex, and strongly coupled dynamic optimization problem is formulated to maximize the long-term average secure and low-latency utility through the joint optimization of the UAV trajectory, the STAR-RIS transmission–reflection partition ratio, the phase-shift matrices, and the transmit power allocation. Furthermore, the problem is modeled as a Markov decision process with continuous state and action spaces, and a hierarchical constrained soft actor–critic (HC-SAC)-based joint control algorithm is proposed to enable adaptive UAV movement, STAR-RIS configuration, and power control in complex dynamic environments. Simulation results demonstrate that the proposed method outperforms DDPG and several structural benchmark schemes. In the representative evaluation, the proposed HC-SAC achieves an average delay of 10.85 slots and a secrecy outage probability of 0.7160, compared with 11.72 slots and 0.8501 for PPO, and 11.94 slots and 0.8599 for DDPG. Although PPO provides the highest average secrecy rate and successful service ratio, the proposed method still maintains a competitive secure communication capability and service reliability. A normalized composite utility analysis further shows that HC-SAC attains the highest utility value of 0.9254, indicating a more favorable security–latency trade-off in complex urban vehicular scenarios.

1. Introduction

Integrated sensing and communications (ISAC) has emerged as a promising technology for next-generation wireless systems, since it enables spectrum sharing and functional integration between communication and sensing tasks. Meng et al. [1] discussed the opportunities and challenges of cooperative ISAC networks, and pointed out that network-level cooperation can expand sensing and communication coverage and provide additional degrees of freedom for ISAC design. For vehicular scenarios, Liu et al. [2] investigated energy-efficient computation offloading and resource allocation in ISAC-aided 6G V2X networks, showing that ISAC can support both communication services and computation-intensive vehicular applications. Cheng et al. [3] further studied ISAC for vehicular communication networks and highlighted its potential in supporting environmental perception, target awareness, and reliable vehicular information exchange. These studies indicate that ISAC is particularly attractive for urban vehicular networks, where high-rate communication, environmental sensing, and service awareness are required simultaneously.
However, urban vehicular environments also bring severe challenges to ISAC-enabled V2X transmission. Roadside buildings, overpasses, and dense traffic flows may cause frequent blockage and rapid channel fluctuations, which seriously degrade communication reliability and coverage continuity. Meanwhile, vehicular services such as cooperative driving, hazard warning, and real-time control are delay-sensitive, making low-latency transmission an essential requirement. More importantly, the broadcast nature of wireless channels makes vehicular messages vulnerable to interception. Yu et al. [4] studied movable-antenna-aided secure V2X communications from an ISAC perspective, indicating that mobility-aware antenna reconfiguration can improve secure transmission performance. Hasan et al. [5] provided a comprehensive study on securing V2X communication platforms and summarized key security threats in vehicular networks. Gyawali et al. [6] reviewed the major challenges and solutions for cellular-based V2X communications, including reliability, latency, and resource management issues. Hakeem and Kim [7] further surveyed machine learning, federated learning, and edge AI techniques for V2X intrusion detection. Although these works have advanced secure and reliable V2X communications, the joint design of physical-layer security, low-latency service, and UAV-assisted intelligent propagation control remains insufficiently explored.
To improve propagation quality in blocked environments, reconfigurable intelligent surfaces (RISs) have been widely recognized as an effective means of reshaping wireless channels. ElMossallamy et al. [8] summarized the principles, challenges, and opportunities of RIS-assisted wireless communications, showing that RIS can enhance link quality by tuning the electromagnetic responses of passive elements. However, conventional RIS mainly relies on half-space reflection, which limits its service capability when vehicles are distributed on both sides of urban roads. To address this limitation, Mu et al. [9] investigated simultaneously transmitting and reflecting RIS (STAR-RIS)-aided wireless communications, where the incident signal can be simultaneously transmitted and reflected to provide full-space coverage. In the V2X context, Aung et al. [10] proposed a deep reinforcement learning-based joint spectrum allocation and configuration design method for STAR-RIS-assisted V2X communications, demonstrating the potential of STAR-RIS in dynamic vehicular environments. Nevertheless, most existing STAR-RIS-assisted V2X studies mainly focus on spectrum allocation or configuration design, while secure low-latency communication with UAV-mounted STAR-RIS remains less investigated.
Although STAR-RIS offers stronger propagation control capability, fixed roadside or building-mounted deployment still suffers from limited coverage adaptability and poor flexibility in highly dynamic vehicular scenarios. In contrast, unmanned aerial vehicles (UAVs) can provide agile aerial deployment and favorable line-of-sight links. Andreou et al. [11] studied UAV-assisted roadside units for V2X connectivity using Voronoi diagrams in 6G+ infrastructures, showing that UAVs can improve vehicular connectivity through flexible aerial assistance. Peng et al. [12] investigated a UAV-borne STAR-RIS-assisted non-orthogonal multiple access system and developed a joint power allocation algorithm, verifying the feasibility of mounting STAR-RIS on UAV platforms. These studies show that the combination of UAV mobility and STAR-RIS programmability can provide adaptive communication support for vehicular networks. However, existing UAV-assisted or UAV-borne STAR-RIS studies have not fully considered the coupling among UAV trajectory, STAR-RIS transmission–reflection partitioning, phase-shift control, eavesdropping threats, and queue-aware latency.
Several recent studies have further investigated UAV-RIS communications and learning-based UAV network optimization. Nakazato et al. [13] proposed a multi-agent reinforcement learning method for resilient UAV ad hoc backhaul networks, where multiple UAVs collaboratively adjust their deployment to improve coverage and connectivity. This work demonstrates the applicability of reinforcement learning to dynamic low-altitude UAV networks, but it does not consider RIS/STAR-RIS-assisted physical-layer transmission or secure vehicular services. Li et al. [14] studied an RIS-assisted UAV communication system and jointly optimized the UAV trajectory and RIS passive beamforming to maximize the average achievable rate. Yang et al. [15] analyzed the performance of RIS-assisted dual-hop UAV communication systems and derived analytical expressions for outage probability, average bit-error rate, and average capacity, confirming the coverage and reliability gains introduced by RIS. Liu et al. [16] further proposed a machine-learning-empowered UAV-RIS framework, where UAV movement, RIS phase shifts, power allocation, and decoding order were jointly designed. In terms of secure UAV communication, Li et al. [17] investigated robust secure UAV communications with the aid of RIS and jointly optimized the UAV trajectory, RIS passive beamforming, and transmit power under imperfect eavesdropping CSI. Shi et al. [18] provided a comprehensive survey of RIS-aided cell-free massive MIMO systems for 6G and summarized different RIS architectures and applications, including STAR-RIS and UAV integration. Although these studies provide important foundations for UAV-RIS communication, RIS-assisted security, and learning-based UAV control, they mainly focus on conventional reflecting RISs, fixed RIS deployment, or rate/security-oriented optimization. The joint design of UAV-mounted STAR-RIS, high-mobility vehicular users, queue-aware low-latency services, and passive eavesdropping defense remains insufficiently studied.
Meanwhile, physical-layer security and intelligent resource control in RIS-assisted V2X networks have drawn growing attention. Saikia et al. [19] proposed a PPO-based method for RIS-assisted full-duplex 6G-V2X communications, showing that reinforcement learning can support dynamic RIS configuration and resource allocation. Long et al. [20] studied deep reinforcement learning for ISAC in RIS-assisted 6G V2X systems, where sensing and communication performance are jointly optimized. Wang et al. [21] investigated physical-layer security enhancement using artificial noise in C-V2X networks, demonstrating that artificial noise can help suppress eavesdropping. De Lima et al. [22] considered broadband beamforming and jamming mitigation in V2X scenarios, which is related to anti-interference and robust V2X transmission. Shang et al. [23] further studied energy-efficient and intelligent ISAC in V2X networks using spiking-neural-network-driven DRL. These studies provide useful references for secure, intelligent, and ISAC-oriented V2X communications. Nevertheless, they mainly focus on fixed RIS, conventional V2X infrastructure, artificial-noise-aided security, or ISAC resource allocation, while the joint security–latency optimization of UAV-mounted STAR-RIS-assisted vehicular networks remains an open problem.
Deep reinforcement learning (DRL) provides a natural tool for the considered dynamic optimization problem. Since the system involves time-varying vehicle positions, queue evolution, eavesdropper distribution, UAV motion, and STAR-RIS configuration, the resulting optimization problem is high-dimensional, strongly coupled, and non-convex. Conventional model-based optimization methods often incur high computational complexity and are not well-suited for real-time adaptation in fast-changing vehicular scenarios. Li et al. [24] proposed a physical-layer eavesdropping defense scheme for V2X based on an improved SAC algorithm, indicating that SAC-type methods are effective for secure V2X decision-making. Amudha et al. [25] studied a hyperparameter-tuned PPO-based federated DRL method for efficient V2X resource allocation, showing the applicability of PPO in V2X resource management. Mlika and Cherkaoui [26] applied DDPG to minimize the age of information in cellular V2X communications, demonstrating the suitability of deterministic policy gradient methods for continuous vehicular control problems. However, directly applying conventional SAC, PPO, or DDPG to the considered scenario is still challenging because UAV movement, STAR-RIS partitioning, phase-shift configuration, transmit power allocation, secrecy constraints, and latency requirements are tightly coupled.
Motivated by the above observations, this paper investigates secure and low-latency communications in UAV-mounted STAR-RIS-assisted urban vehicular networks. Different from existing studies, the UAV in this work is used not only as an aerial carrier for the STAR-RIS but also as a mobile intelligent control node that can dynamically adjust its trajectory according to vehicle distribution, blockage conditions, queue states, and eavesdropping threats. Meanwhile, the STAR-RIS transmission–reflection partition ratio, phase-shift matrices, and transmit power allocation are jointly optimized with the UAV trajectory. To better handle the coupled control variables and the stringent security–latency trade-off, we formulate a long-term utility maximization problem and develop a hierarchical constrained soft actor–critic (HC-SAC)-based joint optimization method. Specifically, the original problem is first transformed into a Markov decision process with continuous state and action spaces, and then the UAV trajectory, STAR-RIS transmission–reflection partition ratio, phase-shift matrices, and transmit power allocation are jointly optimized through hierarchical and constraint-aware policy learning.
To further clarify the technical differences between this work and existing studies, Table 1 summarizes representative related works in terms of the considered scenario, UAV mobility, RIS/STAR-RIS configuration, physical-layer security, queue-aware latency, learning-based optimization, hierarchical control, and constraint-aware learning.
As shown in Table 1, existing UAV-RIS studies mainly focus on trajectory design, passive beamforming, performance analysis, or secrecy-rate maximization with conventional reflecting RISs. Existing STAR-RIS-assisted V2X works mainly consider fixed surface deployment and spectrum/resource allocation, while the UAV-mounted STAR-RIS architecture has not been sufficiently investigated for secure and low-latency vehicular communication. Moreover, existing DRL- or SAC-based V2X studies rarely integrate UAV trajectory control, STAR-RIS transmission–reflection partitioning, phase-shift design, power allocation, physical-layer security, and queue-aware latency into a unified constrained learning framework. Therefore, the proposed HC-SAC method differs from existing studies by jointly exploiting UAV mobility, STAR-RIS full-space reconfiguration, and hierarchical constraint-aware policy learning.
Compared with the existing literature, the main differences of this work are threefold. First, unlike conventional RIS-assisted or STAR-RIS-assisted V2X studies that mainly consider fixed surface deployment, this paper investigates a UAV-mounted STAR-RIS architecture, where UAV mobility provides additional spatial degrees of freedom for adaptive urban vehicular coverage. Second, unlike existing secure V2X or UAV-RIS works that mainly focus on artificial noise, beamforming, secrecy-rate maximization, or resource allocation, this paper jointly considers passive eavesdropping threats, queue-aware low-latency service, and UAV flight constraints. Third, unlike existing DRL-based V2X optimization methods, the proposed HC-SAC framework jointly learns UAV trajectory control, STAR-RIS transmission–reflection partitioning, phase-shift configuration, and transmit power allocation under coupled security and latency constraints.
The main contributions of this paper are summarized as follows:
  • A UAV-mounted STAR-RIS-assisted urban vehicular communication framework is established by jointly considering urban blockage, dynamic vehicle mobility, passive eavesdropping threats, queueing delay, and UAV mobility constraints.
  • A long-term, secure, and low-latency utility maximization problem is formulated to jointly optimize the UAV trajectory, the STAR-RIS transmission–reflection partition ratio, the phase-shift matrices, and the transmit power allocation, resulting in a high-dimensional and strongly coupled continuous-control problem.
  • A hierarchical constrained soft actor–critic-based joint optimization algorithm is proposed to address the above problem. The developed method improves the adaptability of UAV-mounted STAR-RIS control in dynamic vehicular scenarios and enhances the trade-off between secrecy performance and delay efficiency.
  • Simulation results demonstrate that the proposed method outperforms DDPG and all structural benchmark schemes. Compared with PPO, it achieves lower delay and lower secrecy outage probability while maintaining a competitive secrecy rate and successful service ratio, thereby yielding the highest normalized composite utility.
The rest of this paper is organized as follows. Section 2 presents the system model. Section 3 formulates the MDP-based problem transformation. Section 4 develops the HC-SAC-based joint optimization algorithm. Section 5 provides simulation results and performance analysis. Finally, conclusions are drawn in Section 6.
Notation: Bold uppercase letters, bold lowercase letters, and lowercase letters denote matrices, vectors, and scalars, respectively. ( · ) T denotes the transpose operation, · denotes the Euclidean norm, E [ · ] denotes the expectation operator, and j = 1 is the imaginary unit.

2. System Model

In the envisioned ISAC-enabled vehicular network, this paper focuses on secure and low-latency communication functionality. The sensing capability is treated as an auxiliary information source that can provide contextual information, such as vehicle distribution, blockage conditions, and mobility states, to the BS/UAV controller through the state acquisition mechanism discussed in Section 2.4. Accordingly, the following system model characterizes the communication-oriented optimization part of the ISAC-enabled network, while dedicated radar waveform design and sensing-performance optimization are beyond the scope of this work.

2.1. Geometry and Mobility Model

Consider a typical urban road intersection scenario, where the system consists of one base station (BS), one UAV-mounted STAR-RIS, K legitimate vehicular users, and E passive eavesdropping vehicles, as shown in Figure 1.
The BS is deployed at a fixed roadside location to provide downlink communication services for the vehicles within the target area. Due to building blockage and high vehicle mobility, the direct BS-to-vehicle links may experience non-line-of-sight (NLoS) conditions for some users. To enhance the communication quality in blocked areas, a UAV carrying a STAR-RIS is deployed above the road to provide aerial assistance.
The total service duration is divided into N discrete time slots, each with a duration of δ t . Let the BS position be denoted by
w b = [ x b , y b , 0 ] T .
The UAV position at slot n is denoted by
q [ n ] = [ x [ n ] , y [ n ] , H ] T , n = 1 , 2 , , N ,
where H is the fixed UAV flight altitude. Therefore, the UAV trajectory optimization focuses on the two-dimensional horizontal deployment while maintaining a constant altitude, which is consistent with the rotary-wing propulsion energy model adopted in this paper. The position of the k-th legitimate vehicular user at slot n is given by
u k [ n ] = [ x k [ n ] , y k [ n ] , 0 ] T ,
and the position of the e-th passive eavesdropper at slot n is denoted by
z e [ n ] = [ x e [ n ] , y e [ n ] , 0 ] T .
Due to the UAV mobility limitation, the displacement between two adjacent slots must satisfy
q [ n + 1 ] q [ n ] V max δ t , n = 1 , , N 1 ,
where V max denotes the maximum UAV speed.
The STAR-RIS mounted on the UAV consists of M programmable elements. Owing to its simultaneous transmission and reflection capability, the reflection ratio and transmission ratio at slot n are denoted by β r [ n ] and β t [ n ] , respectively, and satisfy
β r [ n ] + β t [ n ] = 1 , 0 β r [ n ] , β t [ n ] 1 .
The corresponding reflection and transmission phase-shift matrices are defined as
Θ r [ n ] = diag β r [ n ] e j θ r , 1 [ n ] , , β r [ n ] e j θ r , M [ n ] ,
Θ t [ n ] = diag β t [ n ] e j θ t , 1 [ n ] , , β t [ n ] e j θ t , M [ n ] ,
where θ r , m [ n ] and θ t , m [ n ] denote the phase shifts of the m-th STAR-RIS element under the reflection and transmission modes, respectively.

2.2. Channel Gain Model

To characterize the urban blockage effect and small-scale fading, the channel between nodes i and j at slot n is modeled as
h i , j [ n ] = ρ 0 d i , j [ n ] α i , j κ i , j 1 + κ i , j h i , j LoS [ n ] + 1 1 + κ i , j h i , j NLoS [ n ] ,
where ρ 0 denotes the channel power gain at the reference distance of 1 m, d i , j [ n ] is the Euclidean distance between nodes i and j, α i , j is the path-loss exponent, and κ i , j denotes the Rician factor. Moreover, h i , j LoS [ n ] and h i , j NLoS [ n ] denote the deterministic LoS component and the random NLoS component, respectively.
For the BS–STAR-RIS link and the STAR-RIS–vehicle/eavesdropper links, which are mainly dominated by line-of-sight propagation, relatively large Rician factors are adopted. By contrast, for the direct BS-to-vehicle and BS-to-eavesdropper links in blocked urban environments, smaller Rician factors and larger path-loss exponents are used to characterize severe NLoS propagation conditions. Accordingly, the BS–STAR-RIS channel, the STAR-RIS–user channel, the STAR-RIS–eavesdropper channel, and the corresponding direct links are written, respectively, as
h b r [ n ] = ρ 0 d b r [ n ] α b r κ b r 1 + κ b r h b r LoS [ n ] + 1 1 + κ b r h b r NLoS [ n ] ,
g k [ n ] = ρ 0 d r k [ n ] α r k κ r k 1 + κ r k g k LoS [ n ] + 1 1 + κ r k g k NLoS [ n ] ,
g e [ n ] = ρ 0 d r e [ n ] α r e κ r e 1 + κ r e g e LoS [ n ] + 1 1 + κ r e g e NLoS [ n ] ,
h d , k [ n ] = ρ 0 d b k [ n ] α b k κ b k 1 + κ b k h d , k LoS [ n ] + 1 1 + κ b k h d , k NLoS [ n ] ,
h d , e [ n ] = ρ 0 d b e [ n ] α b e κ b e 1 + κ b e h d , e LoS [ n ] + 1 1 + κ b e h d , e NLoS [ n ] .

2.3. Imperfect CSI and Practical STAR-RIS Constraints

In the above channel model, the CSI is assumed to be available at the BS/UAV controller for resource optimization. However, in practical UAV-mounted STAR-RIS-assisted vehicular networks, channel estimation errors, phase quantization, and hardware impairments may exist due to vehicle mobility, limited pilot overhead, and finite-resolution STAR-RIS control circuits. To improve the practical interpretability of the proposed framework, this subsection discusses imperfect CSI and practical STAR-RIS constraints.
For the estimated channel coefficient, an additive CSI error model is adopted:
h ^ i , j [ n ] = h i , j [ n ] + e i , j [ n ] ,
where h i , j [ n ] denotes the actual channel coefficient between nodes i and j, h ^ i , j [ n ] is the estimated CSI available at the controller, and e i , j [ n ] denotes the channel estimation error. The estimation error is modeled as a complex Gaussian random variable:
e i , j [ n ] CN ( 0 , σ e 2 ) ,
where σ e 2 characterizes the CSI uncertainty level. During policy execution, the HC-SAC agent makes decisions based on the estimated CSI h ^ i , j [ n ] , while the actual received signal quality is affected by the true channel h i , j [ n ] .
Moreover, practical STAR-RIS elements usually support only finite-resolution phase shifts. Therefore, the continuous phase shift θ m [ n ] can be quantized into a finite codebook:
F B = 0 , 2 π 2 B , , 2 π ( 2 B 1 ) 2 B ,
where B denotes the number of phase quantization bits. The quantized phase shift is given by
θ m Q [ n ] = arg min θ F B θ θ m [ n ] .
Accordingly, the practical STAR-RIS phase-shift matrices can be written as
Θ r Q [ n ] = diag e j θ r , 1 Q [ n ] , , e j θ r , M Q [ n ] ,
Θ t Q [ n ] = diag e j θ t , 1 Q [ n ] , , e j θ t , M Q [ n ] .
In addition, STAR-RIS hardware impairments can be modeled by amplitude attenuation factors. Specifically, the practical reflection and transmission coefficients are expressed as
β ˜ r , m [ n ] = η r β r , m [ n ] , β ˜ t , m [ n ] = η t β t , m [ n ] ,
where 0 < η r 1 and 0 < η t 1 denote the hardware efficiency factors of the reflection and transmission modes, respectively. When η r = η t = 1 and B , the ideal STAR-RIS model is recovered.
In the main algorithm design, the proposed HC-SAC framework is trained based on the estimated CSI and continuous STAR-RIS control variables. Imperfect CSI, finite-resolution phase control, and hardware impairments are included here as practical modeling considerations, and their impact is further evaluated in the robustness analysis in Section 5.10.

2.4. Vehicle State Acquisition and Control Signaling

In practical UAV-mounted STAR-RIS-assisted vehicular networks, the STAR-RIS does not independently detect vehicular users or estimate their channels. Instead, vehicle state acquisition and STAR-RIS control are coordinated by the BS and the UAV-mounted controller through vehicular signaling, pilot transmission, and sensing-assisted state estimation. To clarify the practical operation of the considered system, this subsection describes the vehicle state acquisition and control signaling mechanism.
Specifically, each vehicular user periodically broadcasts basic safety messages (BSMs) or cooperative awareness messages (CAMs), which contain information on the user’s position, velocity, moving direction, and service-related information. Meanwhile, uplink pilot signals are transmitted to assist the BS/UAV controller in estimating the channel state information (CSI). In sensing-assisted vehicular networks, onboard or roadside sensing functions can further provide auxiliary information on vehicle distribution, blockage conditions, and potential target states. Based on these signaling and sensing results, the BS or UAV-mounted controller constructs the system state, including vehicle mobility information, queue states, channel-related features, and eavesdropper-related information.
After obtaining the system state at the beginning of each time slot, the proposed HC-SAC policy generates the UAV trajectory control, STAR-RIS transmission–reflection partition ratio, phase-shift configuration, and transmit power allocation. The resulting control commands are then delivered to the UAV flight controller and the STAR-RIS controller through a dedicated low-rate control link. The UAV adjusts its position according to the trajectory command, while the STAR-RIS updates its transmission–reflection coefficients and phase shifts according to the received configuration command.
For analytical tractability, the duration of each time slot is assumed to be sufficiently short such that the vehicle positions, channel states, and queue states remain approximately unchanged within one slot. At the beginning of the next time slot, the BS/UAV controller updates the observed system state according to newly received BSM/CAM messages, pilot measurements, and sensing feedback. Therefore, the proposed framework operates in a closed-loop manner, where vehicle state acquisition, policy decision, STAR-RIS control, and environment update are repeatedly performed over time.

2.5. Secure Communication Model

Assume that the BS employs downlink superposition transmission at slot n, and the transmitted signal is expressed as
x [ n ] = k = 1 K p k [ n ] s k [ n ] ,
where p k [ n ] denotes the transmit power allocated to the k-th legitimate user, and s k [ n ] is the corresponding information symbol satisfying E [ | s k [ n ] | 2 ] = 1 . The total transmit power is constrained by
k = 1 K p k [ n ] P max , n .
Since vehicular users are located on both sides of the road, different users may be served by either the reflection region or the transmission region of the STAR-RIS. Let o k [ n ] { r , t } denote the operating mode associated with user k at slot n, where r and t represent the reflection and transmission modes, respectively. Then, the equivalent channel from the BS to the k-th legitimate user can be written as
h k eq [ n ] = h d , k [ n ] + g k H [ n ] Θ o k [ n ] [ n ] h b r [ n ] .
Accordingly, the received signal-to-interference-plus-noise ratio (SINR) of user k at slot n is given by
γ k [ n ] = p k [ n ] | h k eq [ n ] | 2 i k p i [ n ] | h k eq [ n ] | 2 + σ k 2 ,
where σ k 2 denotes the receiver noise power. The achievable rate of user k is thus
R k [ n ] = B log 2 1 + γ k [ n ] ,
where B denotes the system bandwidth.
For the e-th passive eavesdropper, the equivalent channel for intercepting the signal intended for user k is expressed as
h e , k eq [ n ] = h d , e [ n ] + g e H [ n ] Θ o k [ n ] [ n ] h b r [ n ] .
The corresponding eavesdropping SINR is given by
γ e , k [ n ] = p k [ n ] | h e , k eq [ n ] | 2 i k p i [ n ] | h e , k eq [ n ] | 2 + σ e 2 ,
where σ e 2 is the noise power at the eavesdropper. The corresponding eavesdropping rate is
R e , k [ n ] = B log 2 1 + γ e , k [ n ] .
Considering the strongest eavesdropping threat, the instantaneous secrecy rate of user k at slot n is defined as
R k sec [ n ] = R k [ n ] max e { 1 , , E } R e , k [ n ] + ,
where [ x ] + = max ( x , 0 ) .

2.6. Queue Evolution and Low-Latency Service Model

In urban vehicular networks, many safety-related services, such as cooperative driving, road hazard warning, and vehicle status reporting, are delay-sensitive. Therefore, in addition to improving the secrecy performance, it is necessary to explicitly characterize the traffic queue evolution and the corresponding service delay. Figure 2 illustrates the queue evolution and low-latency service model for each vehicular user.
Let A k [ n ] denote the newly arrived data packets of the k-th vehicular user at time slot n. The packet arrival process is modeled as a Poisson process with average arrival rate λ k , i.e.,
A k [ n ] Poisson ( λ k ) ,
where λ k denotes the average packet arrival rate of user k. Let Q k [ n ] denote the queue length of the k-th vehicular user at the beginning of time slot n. The service capability of the system depends on the achievable communication rate. Given the achievable rate R k [ n ] , the slot duration δ t , and the packet size L 0 in bits, the number of packets that can be served during time slot n is expressed as
μ k [ n ] = δ t R k [ n ] L 0 ,
where δ t denotes the duration of one time slot and L 0 denotes the packet size. In the optimization formulation, μ k [ n ] is treated as a continuous service amount to facilitate differentiable and low-complexity policy learning. In practical packet-level implementation, the actual number of served packets can be obtained by integer rounding, e.g., μ k [ n ] .
Accordingly, the queue evolution of vehicular user k is given by
Q k [ n + 1 ] = max Q k [ n ] μ k [ n ] , 0 + A k [ n ] .
This equation indicates that the queue length in the next slot is jointly determined by the remaining packets after service and the newly arrived packets.
Based on Little’s law, the instantaneous queueing delay of the k-th vehicular user can be approximated by the ratio between the queue length and the packet arrival rate, i.e.,
D k [ n ] = Q k [ n ] λ k + ϵ ,
where ϵ is a small positive constant used to avoid division by zero. This instantaneous delay approximation based on Little’s law has been widely adopted in cross-layer wireless resource optimization, since it provides a low-complexity delay feedback indicator for dynamic control problems. The average system delay at time slot n is then defined as
D [ n ] = 1 K k = 1 K D k [ n ] ,
where K denotes the number of vehicular users.
To capture the low-latency service requirement, a delay violation indicator is defined as
I k D [ n ] = 1 , D k [ n ] > D max , 0 , D k [ n ] D max ,
where D max is the maximum tolerable delay threshold. This queue-aware delay model is incorporated into the reward function of the proposed DRL framework, so that the learned control policy can jointly improve secrecy performance and suppress queue accumulation.

2.7. UAV Flight Energy Consumption Model

Since the UAV-mounted STAR-RIS needs to continuously adjust its aerial position to assist vehicular users in dynamic urban environments, UAV propulsion energy consumption should be explicitly considered. Instead of using a simplified quadratic displacement-based energy model, this paper adopts a practical rotary-wing UAV propulsion power model, which is widely used to characterize the relationship between UAV flight speed and propulsion energy consumption.
Let q [ n ] = [ x u [ n ] , y u [ n ] , H ] T denote the UAV position at time slot n, where H is the fixed flight altitude. The horizontal flight speed of the UAV during time slot n is given by
v [ n ] = q [ n + 1 ] q [ n ] δ t ,
where δ t denotes the duration of one time slot. The propulsion power consumption of the rotary-wing UAV is modeled as
P UAV ( v [ n ] ) = P 0 1 + 3 v 2 [ n ] U tip 2 + P i 1 + v 4 [ n ] 4 v 0 4 v 2 [ n ] 2 v 0 2 1 2 + 1 2 d 0 ρ s A v 3 [ n ] ,
where P 0 and P i denote the blade profile power and induced power in hovering status, respectively; U tip is the tip speed of the rotor blade; v 0 is the mean rotor induced velocity in hovering; d 0 is the fuselage drag ratio; ρ is the air density; s is the rotor solidity; and A is the rotor disc area.
Accordingly, the UAV propulsion energy consumption during time slot n is expressed as
E UAV [ n ] = P UAV ( v [ n ] ) δ t .
The total UAV propulsion energy consumption over the whole flight period is given by
E tot UAV = n = 1 N E UAV [ n ] .
Recall that the UAV trajectory should satisfy the mobility constraint and the total propulsion energy budget:
q [ n + 1 ] q [ n ] V max δ t , n ,
E tot UAV E max ,
where V max denotes the maximum UAV speed and E max denotes the available UAV energy budget.

3. MDP Modeling and Problem Transformation

3.1. MDP Modeling

The considered joint optimization problem is high-dimensional, non-convex, strongly coupled, and long-term dynamic. Since vehicle positions, traffic arrivals, eavesdropping threats, and channel states all evolve over time, the control action taken at the current time slot affects not only the instantaneous secrecy rate and delay performance, but also the future system performance through UAV movement and queue evolution. Therefore, the original problem is modeled as a Markov decision process (MDP) with continuous state and action spaces.
At time slot n, the agent observes the system state s [ n ] , executes an action a [ n ] , and receives an immediate reward r [ n ] . The objective is to learn a policy π that maximizes the long-term expected discounted reward:
max π E π n = 1 N ζ n 1 r [ n ] ,
where ζ ( 0 , 1 ) is the discount factor.

3.2. State Space Design

To enable the agent to fully perceive the current communication environment, queue states, and UAV operating conditions, the system state is defined as
s [ n ] = q [ n ] , U [ n ] , Z [ n ] , Q [ n ] , h [ n ] , β [ n 1 ] , P [ n 1 ] ,
where:
  • q [ n ] denotes the current UAV position;
  • U [ n ] = { u k [ n ] } k = 1 K denotes the set of legitimate user positions;
  • Z [ n ] = { z e [ n ] } e = 1 E denotes the set of eavesdropper positions;
  • Q [ n ] = { Q k [ n ] } k = 1 K denotes the traffic queue states;
  • h [ n ] represents the channel-state features composed of the BS–STAR-RIS, STAR-RIS–user, and STAR-RIS–eavesdropper links;
  • β [ n 1 ] and P [ n 1 ] denote the STAR-RIS partition ratio and power allocation of the previous slot, respectively.
The above state representation jointly reflects spatial geometry, channel conditions, eavesdropping threats, and queue states. In addition, the previous-slot STAR-RIS partition ratio and power allocation are included to provide temporal context for consecutive decisions.

3.3. Action Space Design

At each time slot, the agent needs to jointly control the UAV trajectory, STAR-RIS transmission–reflection partition ratio, phase-shift configuration, and power allocation. Therefore, the action is defined as
a [ n ] = Δ q [ n ] , β r [ n ] , θ r [ n ] , θ t [ n ] , p [ n ] ,
where:
  • Δ q [ n ] denotes the UAV displacement control at slot n;
  • β r [ n ] denotes the STAR-RIS reflection ratio, while the transmission ratio is obtained as β t [ n ] = 1 β r [ n ] ;
  • θ r [ n ] = [ θ r , 1 [ n ] , , θ r , M [ n ] ] is the reflection phase-shift vector;
  • θ t [ n ] = [ θ t , 1 [ n ] , , θ t , M [ n ] ] is the transmission phase-shift vector;
  • p [ n ] = [ p 1 [ n ] , , p K [ n ] ] is the power allocation vector.

3.4. Reward Function Design

To jointly improve secure communication, low-latency performance, and service reliability while controlling the UAV mobility cost, the immediate reward at slot n is designed as
r [ n ] = α R sec [ n ] β D [ n ] + γ C [ n ] λ E UAV [ n ] ξ Φ [ n ] ,
where
R sec [ n ] = 1 K k = 1 K R k sec [ n ]
is the average secrecy rate,
C [ n ] = 1 K k = 1 K I R k [ n ] R k min
is the successful service ratio, E UAV [ n ] denotes the UAV propulsion energy consumption in slot n, and Φ [ n ] denotes the degree of constraint violation. The weighting factors α , β , γ , λ , and ξ are nonnegative constants.

4. HC-SAC-Based Joint Optimization Algorithm

4.1. HC-SAC Framework Overview

As illustrated in Figure 3, the proposed HC-SAC framework follows a closed-loop interaction process between the learning agent and the UAV-mounted STAR-RIS-assisted vehicular environment. At each time slot, the current network state is constructed from the UAV position, vehicle mobility information, queue states, channel-related features, eavesdropper information, and historical control actions. Based on this state, the hierarchical policy network generates continuous control actions for UAV movement, STAR-RIS transmission–reflection configuration, phase-shift design, and transmit power allocation.
Before being executed in the environment, the raw actions are transformed by a feasibility mapping module to satisfy the UAV mobility constraint, STAR-RIS coefficient constraint, phase-shift constraint, and power budget constraint. The environment then updates the vehicle positions, wireless channels, queue lengths, UAV energy consumption, and security–latency performance. The resulting reward and constraint violation signals are stored in the replay buffer and used to update the actor networks, critic networks, entropy coefficient, and constraint multipliers. The detailed definitions of the hierarchical policy structure, feasibility mapping, reward shaping, and network updates are provided in the following subsections.

4.2. Hierarchical Constrained SAC Update

Different from standard SAC, the proposed HC-SAC introduces a hierarchical control structure and a constraint-aware reward shaping mechanism. The high-level actor is responsible for coarse-grained control variables, including the UAV displacement and the STAR-RIS transmission–reflection partition ratio, while the low-level actor is responsible for fine-grained physical-layer decisions, including the STAR-RIS phase-shift vectors and transmit power allocation. This decomposition reduces the effective action coupling and improves learning stability in dynamic vehicular environments.
Let a h [ n ] and a l [ n ] denote the high-level and low-level actions at time slot n, respectively. The overall action is given by
a [ n ] = { a h [ n ] , a l [ n ] } .
The joint hierarchical policy can be expressed as
π ϕ ( a [ n ] | s [ n ] ) = π ϕ h ( a h [ n ] | s [ n ] ) π ϕ l ( a l [ n ] | s [ n ] , a h [ n ] ) ,
where ϕ h and ϕ l denote the parameters of the high-level and low-level actors, respectively, and ϕ = { ϕ h , ϕ l } .
Two critic networks with parameters ω 1 and ω 2 are introduced to estimate the soft action-value functions Q ω 1 ( s , a ) and Q ω 2 ( s , a ) , respectively. In addition, two target critic networks with parameters ω ¯ 1 and ω ¯ 2 are maintained to improve the stability of temporal-difference learning.
The entropy-regularized objective of the hierarchical policy is given by
J ( π ) = n = 1 N E ( s [ n ] , a [ n ] ) ρ π r [ n ] + τ H π ϕ ( · | s [ n ] ) ,
where τ is the entropy temperature parameter and H ( π ϕ ( · | s [ n ] ) ) denotes the policy entropy.
To explicitly account for delay, secrecy outage, and UAV energy-related constraints, a Lagrangian-style reward shaping mechanism is incorporated into the learning process. Let ψ d , ψ s , and ψ e denote the adaptive multipliers associated with delay violation, secrecy outage violation, and UAV energy violation, respectively. The corresponding violation levels are defined as
Δ d [ n ] = D [ n ] D max + ,
Δ s [ n ] = P out [ n ] P out max + ,
To handle the total UAV energy budget in an online learning process, the accumulated UAV propulsion energy up to slot n is defined as
E ¯ UAV [ n ] = i = 1 n E UAV [ i ] .
The corresponding energy violation level is defined as
Δ e [ n ] = E ¯ UAV [ n ] n N E max + ,
where [ x ] + = max { x , 0 } , D max is the maximum tolerable delay threshold, P out max is the maximum tolerable secrecy outage probability, and E max is the total UAV energy budget.
The constraint-aware shaped reward is then expressed as
r ^ [ n ] = r [ n ] ψ d Δ d [ n ] ψ s Δ s [ n ] ψ e Δ e [ n ] .
During each training iteration, the target value for the critic networks is constructed as
y [ n ] = r ^ [ n ] + ζ min j = 1 , 2 Q ω ¯ j ( s [ n + 1 ] , a [ n + 1 ] ) τ log π ϕ ( a [ n + 1 ] | s [ n + 1 ] ) ,
where ζ is the discount factor and a [ n + 1 ] π ϕ ( · | s [ n + 1 ] ) . The critic networks are updated by minimizing the soft Bellman residual:
L ( ω j ) = E ( s , a , r ^ , s ) D Q ω j ( s [ n ] , a [ n ] ) y [ n ] 2 , j = 1 , 2 ,
where D denotes the replay buffer.
The hierarchical policy is updated by minimizing
J ( ϕ ) = E s [ n ] D , a [ n ] π ϕ τ log π ϕ ( a [ n ] | s [ n ] ) min j = 1 , 2 Q ω j ( s [ n ] , a [ n ] ) .
To enhance exploration adaptivity, an automatic entropy tuning mechanism is adopted. The temperature parameter τ is updated by minimizing
J ( τ ) = E a [ n ] π ϕ τ log π ϕ ( a [ n ] | s [ n ] ) + H ¯ ,
where H ¯ is the target entropy.
The constraint multipliers are updated according to the observed violation levels:
ψ d ψ d + η ψ Δ d [ n ] + ,
ψ s ψ s + η ψ Δ s [ n ] + ,
ψ e ψ e + η ψ Δ e [ n ] + ,
where η ψ is the multiplier learning rate.
Finally, the target critic networks are softly updated as
ω ¯ j ρ ω j + ( 1 ρ ) ω ¯ j , j = 1 , 2 ,
where ρ ( 0 , 1 ) is the soft update coefficient.

4.3. Feasibility Mapping of Continuous Actions

Since the raw actions generated by the actor networks may violate physical constraints, a feasibility mapping operation is performed before interacting with the environment. Let the hierarchical actor output the raw action
a ˜ [ n ] = Δ q ˜ [ n ] , β ˜ r [ n ] , θ ˜ r [ n ] , θ ˜ t [ n ] , p ˜ [ n ] .
For the UAV displacement, the mapped motion vector is given by
Δ q [ n ] = min 1 , V max δ t Δ q ˜ [ n ] + ϵ Δ q ˜ [ n ] ,
where V max is the maximum UAV speed, δ t is the slot duration, and ϵ > 0 is a small constant used to avoid division by zero. The UAV position is then updated as
q [ n + 1 ] = Π Q q [ n ] + Δ q [ n ] ,
where Π Q ( · ) denotes the projection onto the feasible flight region Q .
For the STAR-RIS transmission–reflection partition ratio, the raw output is mapped by a sigmoid function:
β r [ n ] = 1 1 + exp ( β ˜ r [ n ] ) ,
and the transmission ratio is given by
β t [ n ] = 1 β r [ n ] .
Thus, the constraints 0 β r [ n ] 1 , 0 β t [ n ] 1 , and β r [ n ] + β t [ n ] = 1 are satisfied.
For the STAR-RIS phase-shift vectors, a bounded mapping is applied:
θ r , m [ n ] = π tanh ( θ ˜ r , m [ n ] ) + 1 , m ,
θ t , m [ n ] = π tanh ( θ ˜ t , m [ n ] ) + 1 , m .
Therefore, θ r , m [ n ] [ 0 , 2 π ) and θ t , m [ n ] [ 0 , 2 π ) are guaranteed.
For the transmit power allocation, a softmax-based normalization is adopted:
p ¯ k [ n ] = exp ( p ˜ k [ n ] ) i = 1 K exp ( p ˜ i [ n ] ) , k .
The feasible transmit power is obtained as
p k [ n ] = P max p ¯ k [ n ] , k .
This mapping guarantees p k [ n ] 0 and k = 1 K p k [ n ] = P max . In this paper, the softmax mapping is used to allocate the available transmit power budget among vehicular users for secrecy-rate-oriented transmission.

4.4. Algorithm Procedure

The overall procedure of the proposed HC-SAC-based joint optimization algorithm is summarized in Algorithm 1.
Algorithm 1 Proposed HC-SAC-based joint optimization algorithm
  1:
Initialize the vehicular environment, replay buffer D , high-level actor, low-level actor, critic networks, target critic networks, entropy coefficient, and constraint multipliers.
  2:
for each training episode do
  3:
    Reset the environment and obtain the initial state s [ 1 ] .
  4:
    for each time slot n = 1 , , N  do
  5:
        Observe the current state s [ n ] .
  6:
        Generate the raw action a ˜ [ n ] from the hierarchical policy networks.
  7:
        Apply feasibility mapping to obtain the valid action a [ n ] .
  8:
        Execute a [ n ] in the environment.
  9:
        Update the UAV position, vehicle positions, channel states, queue states, and UAV energy consumption.
10:
        Calculate the immediate reward r [ n ] and constraint violation levels Δ d [ n ] , Δ s [ n ] , and Δ e [ n ] .
11:
        Compute the shaped reward r ^ [ n ] .
12:
        Observe the next state s [ n + 1 ] .
13:
        Store ( s [ n ] , a [ n ] , r ^ [ n ] , s [ n + 1 ] ) into the replay buffer D .
14:
        if the replay buffer size is larger than the mini-batch size then
15:
            Sample a mini-batch from D .
16:
            Update the critic networks by minimizing the soft Bellman residual.
17:
            Update the high-level and low-level actors using the entropy-regularized policy objective.
18:
            Update the entropy coefficient.
19:
            Update the constraint multipliers.
20:
            Softly update the target critic networks.
21:
        end if
22:
    end for
23:
end for
24:
Output the trained hierarchical policy.

4.5. Complexity and Convergence Discussion

Let N ϕ h , N ϕ l , and N ω denote the number of trainable parameters in the high-level actor, low-level actor, and each critic network, respectively. For each mini-batch update with batch size B s , the dominant computational complexity of the proposed HC-SAC algorithm mainly comes from forward and backward propagation through the hierarchical actor networks and the two critic networks, which can be approximated as
O B s N ϕ h + N ϕ l + 2 N ω .
If the training process contains E ep episodes and each episode consists of N time slots, the overall offline training complexity is given by
O E ep N B s N ϕ h + N ϕ l + 2 N ω .
Although HC-SAC introduces hierarchical action generation and constraint-aware multiplier updates, these operations bring only a small additional computational cost compared with standard actor–critic training. After training, the online execution only requires one forward propagation of the high-level and low-level actor networks, whose complexity is
O N ϕ h + N ϕ l .
Therefore, the trained policy can directly output the UAV displacement, STAR-RIS transmission–reflection partition ratio, phase-shift vectors, and transmit power allocation without repeatedly solving the original non-convex optimization problem, making it suitable for online decision-making in dynamic vehicular environments.
To improve training stability, the proposed HC-SAC uses double critics, target critic networks, experience replay, entropy-regularized policy improvement, and adaptive constraint multipliers. Experience replay reuses historical transitions, the entropy term encourages exploration, and constraint-aware reward shaping penalizes delay violation, secrecy outage, and excessive UAV energy consumption. Due to neural network approximation and the non-convex problem structure, global optimality cannot be guaranteed. The convergence curves in Section 5 show that HC-SAC reaches a stable evaluation score after training in the considered setting.

5. Simulation Results and Discussion

5.1. Simulation Settings

We consider a typical urban road intersection scenario with an area size of 500   m × 500   m . The BS is deployed at a fixed roadside location, while the UAV provides aerial assistance above the road. The total service period is divided into N = 30 time slots, and the duration of each slot is δ t = 1   s . Legitimate vehicular users and eavesdropping vehicles move along predefined lanes. The vehicular users follow a lane-based random mobility model, while the eavesdroppers follow a random lane-following mobility model. The traffic arrival process is modeled as a Poisson process to characterize the stochastic nature of vehicular service arrivals. During training, all DRL algorithms are implemented with the same state space, action space, reward definition, replay setting, and neural network scale unless otherwise specified.

5.2. Simulation Setup and Evaluation Metrics

In this section, numerical simulations are conducted to evaluate the performance of the proposed HC-SAC-based joint optimization scheme. Unless otherwise specified, the main simulation parameters are listed in Table 2. An urban vehicular communication scenario is considered, where vehicular users and potential eavesdroppers are distributed along the road. The UAV-mounted STAR-RIS dynamically adjusts its trajectory and transmission–reflection configuration according to vehicle mobility, blockage conditions, queue states, and channel-related information.
The propulsion energy parameters of the rotary-wing UAV are given in Table 3. These parameters are used to evaluate the practical UAV propulsion energy consumption in the proposed framework.
To ensure reproducibility, the main training hyperparameters of the proposed HC-SAC algorithm and the DRL-based benchmark schemes are summarized in Table 4. Unless otherwise specified, all DRL-based schemes are trained under the same vehicular mobility traces, channel realizations, eavesdropper distributions, state information, action constraints, reward components, and the number of environment interactions.
For fair comparison, PPO and DDPG adopt the same state representation, action mapping rules, reward components, neural network scale, and simulation scenarios as the proposed HC-SAC. The same vehicle trajectories, blockage distributions, channel realizations, eavesdropper locations, and traffic arrival processes are used for all DRL-based schemes. Therefore, the performance difference mainly comes from the learning architecture and policy update mechanism rather than from different simulation conditions.
To clarify the calculation of the simulation results, the evaluation metrics used in this paper are defined as follows. The average secrecy rate is calculated by
R ¯ sec = 1 N K n = 1 N k = 1 K R k sec [ n ] ,
where R k sec [ n ] denotes the instantaneous secrecy rate of vehicular user k at time slot n.
The secrecy outage probability (SOP) is defined as the probability that the instantaneous secrecy rate falls below a predefined secrecy threshold R th . In the simulations, it is estimated by Monte Carlo counting as
P out = n = 1 N k = 1 K I R k sec [ n ] < R th N K ,
where I ( · ) is the indicator function.
The average delay is calculated based on the queue-aware delay model as
D ¯ = 1 N K n = 1 N k = 1 K D k [ n ] ,
where D k [ n ] denotes the queueing delay of user k at slot n.
The successful service ratio (SSR) is defined as the ratio of vehicular users that simultaneously satisfy the rate and delay requirements:
SSR = 1 N K n = 1 N k = 1 K I R k [ n ] R k min , D k [ n ] D max ,
where R k min is the minimum required communication rate.
As an auxiliary per-slot indicator, the instantaneous normalized utility can be written as
U ins = w r R ˜ sec + w c SSR ˜ w d D ˜ w o P ˜ out ,
where R ˜ sec , SSR ˜ , D ˜ , and P ˜ out denote the normalized secrecy rate, successful service ratio, average delay, and secrecy outage probability, respectively. The coefficients w r , w c , w d , and w o are nonnegative weights used to balance different performance metrics.
Each reported point is obtained by averaging over 24 independent evaluation episodes under the corresponding parameter setting. Since the main purpose is to compare the average performance trends, only the mean values are plotted in the figures.

5.3. Benchmark Schemes

To comprehensively evaluate the effectiveness of the proposed method, several representative DRL-based and structural benchmark schemes are considered. The detailed definitions of these schemes are summarized in Table 5.
For fairness, all benchmark schemes are evaluated under the same vehicular mobility traces, channel realizations, eavesdropper distributions, blockage conditions, packet arrival processes, and initial UAV locations. Unless a specific variable is intentionally fixed for benchmark evaluation, the same state information, action constraints, reward components, and feasibility mapping rules are adopted.

5.4. Training Behavior Comparison

Figure 4 illustrates the convergence behavior of different DRL-based schemes in terms of evaluation score. It can be observed that the proposed HC-SAC achieves the highest evaluation score after convergence and exhibits a smooth upward trend throughout training. PPO also converges stably, but its final evaluation score remains below that of HC-SAC, whereas DDPG converges to the lowest level among the three DRL-based schemes. These results indicate that the hierarchical and constraint-aware design improves policy learning effectiveness in the considered dynamic vehicular environment. The convergence trend is also consistent with the final composite-utility ranking reported later.

5.5. Delay Performance Analysis

Figure 5 shows the average delay under different vehicle speeds. As the vehicle speed increases, the average delay of all schemes generally rises due to faster topology variation and more severe channel fluctuation. Nevertheless, the proposed HC-SAC consistently achieves the lowest delay over the whole speed range, varying from 10.02 to 10.39 slots. By comparison, PPO varies from 10.96 to 11.22 slots, and DDPG varies from 11.18 to 11.62 slots. Averaged over the tested speed range, HC-SAC reduces delay by about 7.9% relative to PPO and 10.4% relative to DDPG. This result demonstrates that the hierarchical control structure and constraint-aware learning mechanism can better coordinate UAV mobility, STAR-RIS reconfiguration, and resource scheduling, thereby improving queue stability and delay control in highly dynamic vehicular environments.
Compared with all benchmark schemes, HC-SAC provides the strongest low-latency guarantee, while the delay gap with the structural baselines is even more evident.

5.6. Secure Communication Performance Analysis

Figure 6 depicts the average secrecy rate under different BS transmit power budgets. As the transmit power increases, the secrecy rate of all schemes improves because stronger transmit power enhances the legitimate communication links. Among all DRL-based schemes, PPO achieves the highest secrecy rate, while the proposed HC-SAC provides competitive secrecy-rate performance and outperforms DDPG in the high-power regime. For example, at 40 dBm, PPO, HC-SAC, and DDPG achieve secrecy rates of 0.8832, 0.7791, and 0.7043 Mbps, respectively. In addition, all three DRL-based schemes outperform the structural baselines, suggesting the benefit of combining adaptive learning with UAV mobility and STAR-RIS-assisted propagation control.
Although HC-SAC does not achieve the highest secrecy rate, it still maintains a competitive secure transmission capability while simultaneously emphasizing delay and secrecy-reliability constraints.
Figure 7 illustrates the secrecy outage probability versus the number of eavesdroppers. As expected, the secrecy outage probability increases with the number of eavesdroppers for all schemes. However, the proposed HC-SAC consistently yields the lowest secrecy outage probability among all methods. Averaged over the tested eavesdropper settings, the SOP of HC-SAC is 0.7241, compared with 0.8613 for PPO and 0.8801 for DDPG, corresponding to reductions of about 15.9% and 17.7%, respectively. This result indicates that the proposed hierarchical constrained mechanism improves secrecy reliability under unfavorable wiretap conditions.

5.7. Service Reliability Analysis

Figure 8 presents the successful service ratio under different transmit power budgets, where the successful service ratio is defined as the proportion of users whose rate and delay requirements are simultaneously satisfied. It can be seen that the successful service ratio of all schemes increases with the transmit power budget. PPO achieves the highest successful service ratio, while the proposed HC-SAC remains close to PPO and becomes slightly better than DDPG in the medium- and high-power regimes. At 40 dBm, the SSR values of PPO, HC-SAC, and DDPG are 0.1767, 0.1693, and 0.1534, respectively.
These results show that HC-SAC preserves strong service capability while achieving better delay and secrecy outage performance. Therefore, compared with PPO, the proposed method provides a more balanced service-oriented control behavior rather than purely pursuing rate-oriented gains.

5.8. Composite Utility Analysis

To further evaluate the overall security–latency trade-off, a normalized composite utility is introduced based on the area-under-the-curve (AUC) values for the secrecy rate, successful service ratio, delay, and secrecy outage probability. Specifically, the secrecy rate and successful service ratio are positively normalized, whereas the delay and secrecy outage probability are inversely normalized. The resulting composite utility is defined as
U comp = w 1 AUC ˜ R sec + w 2 AUC ˜ SSR + w 3 1 AUC ˜ D + w 4 1 AUC ˜ SOP ,
where AUC ˜ R sec , AUC ˜ SSR , AUC ˜ D , and AUC ˜ SOP denote the normalized AUC values of secrecy rate, successful service ratio, delay, and secrecy outage probability, respectively. According to the objective of secure and low-latency vehicular communications, larger weights are assigned to delay and secrecy outage performance.
Figure 9 compares the normalized composite utility of all schemes. It can be observed that the proposed HC-SAC achieves the highest composite utility of 0.9254, while PPO, DDPG, the fixed STAR-RIS partition, the random phase-shift scheme, the fixed UAV trajectory scheme, and the no STAR-RIS scheme achieve 0.6630, 0.5060, 0.3241, 0.2480, 0.2403, and 0, respectively. The zero value of the No STAR-RIS scheme is caused by the min–max normalization, where the worst-performing scheme is mapped to zero. Therefore, HC-SAC improves the composite utility by about 39.6% over PPO and 82.9% over DDPG. Although PPO performs better in terms of average secrecy rate and successful service ratio, HC-SAC provides more significant advantages in delay and secrecy outage probability. As a result, the proposed method achieves the most balanced overall performance when security and latency are jointly considered. This ranking is mainly driven by the delay and SOP gains rather than rate maximization alone.
The above results also explain why the proposed HC-SAC achieves the highest normalized composite utility. Although PPO achieves a higher secrecy rate and successful service ratio in some cases, this does not necessarily indicate an overall advantage in the considered secure and low-latency vehicular communication scenario. PPO tends to learn a more rate-oriented policy, whereas the proposed HC-SAC adopts entropy-regularized off-policy learning and constraint-aware reward shaping to reduce queue accumulation and suppress secrecy outage events. Therefore, HC-SAC does not always maximize a single metric, but achieves a better overall security–latency trade-off, as reflected by its lower average delay, lower secrecy outage probability, and higher normalized composite utility.

5.9. Ablation Study

To further verify the contribution of the key components in the proposed framework, an ablation study is conducted by comparing four SAC-based variants under the same simulation environment, random seeds, and evaluation settings. The considered variants include standard SAC, HC-SAC without hierarchical control, HC-SAC without constraint-aware learning, and the complete proposed HC-SAC. The results averaged over three random seeds are reported in Table 6, where the secrecy rate is measured in Mbps, and the delay is measured in slots. Because the composite utility is normalized within this ablation group, it is used only for comparing the four SAC variants.
As shown in Table 6, the standard SAC baseline obtains the lowest composite utility, indicating that directly applying a flat SAC policy is insufficient for the considered high-dimensional joint control problem. When the hierarchical control structure is removed, the secrecy rate and successful service ratio remain relatively high, but the delay and secrecy outage probability become worse than those of the complete HC-SAC, leading to a much lower composite utility. Notably, HC-SAC without hierarchical control obtains a higher secrecy rate than the complete HC-SAC, i.e., 0.516 Mbps versus 0.433 Mbps. This indicates that the non-hierarchical variant tends to learn a more rate-oriented policy, whereas the complete HC-SAC deliberately trades part of the secrecy-rate gain for significantly lower delay and SOP, which is consistent with the overall security–latency trade-off objective. When the constraint-aware learning mechanism is removed, the delay and SOP are improved compared with the standard SAC baseline, but the overall utility is still significantly lower than that of the complete HC-SAC. These observations show that the hierarchical control module and the constraint-aware learning mechanism contribute from different aspects: the former helps handle the coupled UAV, STAR-RIS, and power-control actions, while the latter guides the policy toward lower delay and lower secrecy outage. The complete HC-SAC achieves the highest composite utility of 0.851, suggesting that the two components jointly contribute to the final security–latency trade-off.

5.10. Robustness Analysis Under Practical Non-Idealities

To further examine the practical robustness of the proposed method, additional simulations are conducted under imperfect CSI and finite-resolution STAR-RIS phase control. In the imperfect CSI case, the CSI error standard deviation is varied from 0.02 to 0.10. In the finite-resolution case, the STAR-RIS phase resolution is set to 3, 2, and 1 bits. The results averaged over three random seeds are reported in Table 7.
As shown in Table 7, the proposed HC-SAC maintains stable performance under moderate CSI errors and low-resolution phase control. Compared with the ideal CSI and continuous phase case, CSI uncertainty mainly increases the average delay from 10.112 slots to about 10.282 slots, while the secrecy rate, SOP, and successful service ratio remain within a narrow range. The small fluctuations in secrecy rate and SSR under CSI errors are caused by stochastic channel perturbations and should be interpreted as robustness rather than a guaranteed performance gain from imperfect CSI. This limited sensitivity also suggests that the proposed framework, which combines geometry-aware channel features with adaptive resource control, does not rely excessively on highly accurate instantaneous CSI. Moreover, reducing the STAR-RIS phase resolution from 3 bits to 1 bit introduces only a limited performance change at the reference operating point. This indicates that the geometry-aware phase focusing and hierarchical control structure provide certain robustness against practical phase-control quantization. These results indicate stable behavior under the considered non-ideal CSI and finite-resolution STAR-RIS settings.

5.11. Reference Performance and Composite Utility Comparison

To provide a concise quantitative summary, Table 8 reports the reference operating-point performance and the AUC-based composite utility of all schemes. The first four metrics are evaluated at the reference operating point, while the normalized composite utility is obtained from the AUC-based evaluation over the corresponding parameter sweeps.
Table 8 summarizes the reference comparison among all schemes. Although PPO achieves the highest average secrecy rate and successful service ratio, HC-SAC provides the lowest delay and the lowest secrecy outage probability. When the four metrics are jointly considered through the normalized composite utility, HC-SAC yields the best overall performance. In particular, compared with PPO, HC-SAC reduces the average delay from 11.7230 to 10.8471 and the SOP from 0.8501 to 0.7160, which leads to a notable composite-utility gain from 0.6630 to 0.9254. These results indicate that the proposed hierarchical and constraint-aware learning strategy provides a more favorable security–latency–service trade-off than both learning-based benchmarks and structural baselines.

6. Conclusions

This paper investigated secure and low-latency communications in UAV-mounted STAR-RIS-assisted urban vehicular networks under severe blockage, high vehicle mobility, eavesdropping threats, and delay-sensitive traffic services. A comprehensive system model was developed by jointly considering urban blockage, vehicle mobility, passive eavesdropping threats, queueing dynamics, UAV trajectory evolution, and flight energy constraints. Based on this model, a long-term average secure and low-latency utility maximization problem was formulated and transformed into an MDP with continuous state and action spaces. A hierarchical constrained SAC-based joint control algorithm was then proposed to optimize the UAV trajectory, STAR-RIS transmission–reflection partition ratio, phase-shift matrices, and transmit power allocation.
Simulation results showed that the proposed method achieves lower delay, lower secrecy outage probability, and higher composite utility than DDPG and structural benchmark schemes, while maintaining competitive values for the secrecy rate and successful service ratio. Compared with PPO, the proposed HC-SAC achieves lower delay and lower secrecy outage probability, while PPO retains a higher secrecy rate and successful service ratio. In the representative evaluation, HC-SAC achieves an average delay of 10.8471 slots, an SOP of 0.7160, and the highest normalized composite utility of 0.9254. The ablation study further indicates that both hierarchical control and constraint-aware learning benefit the final security–latency trade-off, since removing either component substantially reduces the composite utility. Additional robustness results show that the proposed method remains stable under moderate CSI errors and finite-resolution STAR-RIS phase control. These results show that HC-SAC does not simply maximize a single communication metric, but achieves a stronger overall security–latency trade-off by reducing delay and secrecy outage probability while maintaining a competitive secrecy rate and successful service ratio.

Author Contributions

Conceptualization, J.T., J.Y. and Y.P.; methodology, J.Y. and J.T.; software, J.Y. and H.Z.; validation, M.C. and Y.P.; formal analysis, J.T. and M.C.; resources, J.T. and Y.P.; data curation, J.Y. and M.C.; writing—original draft preparation, J.T. and H.Z.; writing—review and editing, H.Z. and Y.P.; supervision, H.Z., M.C. and Y.P.; project administration, H.Z. and M.C.; funding acquisition, J.T. and Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62461030), the Key Basic Research Project of Yunnan Province (202401AS070105), the General Scientific Research Project of Hunan Provincial Department of Education (25C1457), the Shaoyang Municipal Science and Technology Plan (Special Fund Subsidy) Project (2024PT4070, 2024GZ2026), and the Key Scientific Research Project of Shaoyang Industrial Vocational and Technical College (SKY24A04).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code for this paper is available at https://github.com/tangent123/star-ris (accessed on 1 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
STAR-RISSimultaneously transmitting and reflecting reconfigurable intelligent surface
UAVUnmanned aerial vehicle
SACSoft actor–critic
DRLDeep reinforcement learning

References

  1. Meng, K.; Masouros, C.; Petropulu, A.P.; Hanzo, L. Cooperative ISAC networks: Opportunities and challenges. IEEE Wirel. Commun. 2024, 32, 212–219. [Google Scholar] [CrossRef]
  2. Liu, Q.; Luo, R.; Liang, H.; Liu, Q. Energy-efficient joint computation offloading and resource allocation strategy for ISAC-aided 6G V2X networks. IEEE Trans. Green Commun. Netw. 2023, 7, 413–423. [Google Scholar] [CrossRef]
  3. Cheng, X.; Duan, D.; Gao, S.; Yang, L. Integrated sensing and communications (ISAC) for vehicular communication networks (VCN). IEEE Internet Things J. 2022, 9, 23441–23451. [Google Scholar] [CrossRef]
  4. Yu, K.; Li, K.; Zhao, Y.; Feng, Z.; Li, D.; Zhang, Q.; Yu, J. Movable Antenna-Aided Secure V2X Communication: An Integrated Sensing and Communication Perspective. IEEE Wirel. Commun. 2025, 32, 118–124. [Google Scholar] [CrossRef]
  5. Hasan, M.; Mohan, S.; Shimizu, T.; Lu, H. Securing vehicle-to-everything (V2X) communication platforms. IEEE Trans. Intell. Veh. 2020, 5, 693–713. [Google Scholar] [CrossRef]
  6. Gyawali, S.; Xu, S.; Qian, Y.; Hu, R.Q. Challenges and solutions for cellular based V2X communications. IEEE Commun. Surv. Tutor. 2020, 23, 222–255. [Google Scholar] [CrossRef]
  7. Hakeem, S.A.A.; Kim, H. Advancing intrusion detection in V2X networks: A comprehensive survey on machine learning, federated learning, and edge AI for V2X security. IEEE Trans. Intell. Transp. Syst. 2025, 26, 11137–11205. [Google Scholar] [CrossRef]
  8. ElMossallamy, M.A.; Zhang, H.; Song, L.; Seddik, K.G.; Han, Z.; Li, G.Y. Reconfigurable intelligent surfaces for wireless communications: Principles, challenges, and opportunities. IEEE Trans. Cogn. Commun. Netw. 2020, 6, 990–1002. [Google Scholar] [CrossRef]
  9. Mu, X.; Liu, Y.; Guo, L.; Lin, J.; Schober, R. Simultaneously transmitting and reflecting (STAR) RIS aided wireless communications. IEEE Trans. Wirel. Commun. 2021, 21, 3083–3098. [Google Scholar] [CrossRef]
  10. Aung, P.S.; Nguyen, L.X.; Tun, Y.K.; Han, Z.; Hong, C.S. Deep reinforcement learning-based joint spectrum allocation and configuration design for STAR-RIS-assisted V2X communications. IEEE Internet Things J. 2023, 11, 11298–11311. [Google Scholar] [CrossRef]
  11. Andreou, A.; Mavromoustakis, C.X.; Batalla, J.M.; Markakis, E.K.; Mastorakis, G. UAV-assisted RSUs for V2X connectivity using voronoi diagrams in 6G+ infrastructures. IEEE Trans. Intell. Transp. Syst. 2023, 24, 15855–15865. [Google Scholar] [CrossRef]
  12. Peng, Y.; Tang, J.; Yang, Q.; Han, Z.; Ma, J. Joint power allocation algorithm for UAV-borne simultaneous transmitting and reflecting reconfigurable intelligent surface-assisted non-orthogonal multiple access system. IEEE Access 2023, 11, 140506–140518. [Google Scholar] [CrossRef]
  13. Nakazato, J.; So, H.; Tran, G.K.; Suto, K. Multi-Agent Reinforcement Learning for Resilient UAV Ad Hoc Backhaul Networks. IEEE J. Miniaturization Air Space Syst. 2026, 7, 232–245. [Google Scholar] [CrossRef]
  14. Li, S.; Duo, B.; Yuan, X.; Liang, Y.C.; Di Renzo, M. Reconfigurable Intelligent Surface Assisted UAV Communication: Joint Trajectory Design and Passive Beamforming. IEEE Wirel. Commun. Lett. 2020, 9, 716–720. [Google Scholar] [CrossRef]
  15. Yang, L.; Meng, F.; Zhang, J.; Hasna, M.O.; Di Renzo, M. On the Performance of RIS-Assisted Dual-Hop UAV Communication Systems. IEEE Trans. Veh. Technol. 2020, 69, 10385–10390. [Google Scholar] [CrossRef]
  16. Liu, X.; Liu, Y.; Chen, Y. Machine Learning Empowered Trajectory and Passive Beamforming Design in UAV-RIS Wireless Networks. IEEE J. Sel. Areas Commun. 2021, 39, 2042–2055. [Google Scholar] [CrossRef]
  17. Li, S.; Duo, B.; Di Renzo, M.; Tao, M.; Yuan, X. Robust Secure UAV Communications with the Aid of Reconfigurable Intelligent Surfaces. IEEE Trans. Wirel. Commun. 2021, 20, 6402–6417. [Google Scholar] [CrossRef]
  18. Shi, E.; Zhang, J.; Du, H.; Ai, B.; Yuen, C.; Niyato, D. RIS-Aided Cell-Free Massive MIMO Systems for 6G: Fundamentals, System Design, and Applications. Proc. IEEE 2024, 112, 331–364. [Google Scholar] [CrossRef]
  19. Saikia, P.; Pala, S.; Singh, K.; Singh, S.K.; Huang, W.J. Proximal policy optimization for RIS-assisted full duplex 6G-V2X communications. IEEE Trans. Intell. Veh. 2023, 9, 5134–5149. [Google Scholar] [CrossRef]
  20. Long, X.; Zhao, Y.; Wu, H.; Xu, C.Z. Deep reinforcement learning for integrated sensing and communication in RIS-assisted 6G V2X system. IEEE Internet Things J. 2024, 11, 39834–39849. [Google Scholar] [CrossRef]
  21. Wang, C.; Li, Z.; Xia, X.G.; Shi, J.; Si, J.; Zou, Y. Physical layer security enhancement using artificial noise in cellular vehicle-to-everything (C-V2X) networks. IEEE Trans. Veh. Technol. 2020, 69, 15253–15268. [Google Scholar] [CrossRef]
  22. de Lima, D.V.; da Costa, J.P.J.; da Silva, A.A.S.; Santos, G.A.; Vargas, J.A.R.; de Alexandria, A.R. Broadband Beamforming via Frequency Invariance Transformation and PARAFAC Decomposition for Jamming Mitigation in V2X Scenarios. In Proceedings of the 2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring); IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
  23. Shang, C.; Yu, J.; Hoang, D.T. Energy-efficient and intelligent ISAC in V2X networks with spiking neural networks-driven DRL. IEEE Trans. Wirel. Commun. 2026, 25, 1182–1195. [Google Scholar] [CrossRef]
  24. Li, Z.; Liao, L.; Gu, S.; Zhao, J. Physical Layer Eavesdropping Defense Scheme for V2X Based on Improved SAC Algorithm. Phys. Commun. 2025, 74, 102980. [Google Scholar] [CrossRef]
  25. Amudha, S.; Sivaradje, G.; Nagarajan, G. Hyperparameter-Tuned PPO-Based Federated Deep Reinforcement Learning (FDRL) with Explainability for Efficient V2X Resource Allocation in 5G Networks. In Proceedings of the International Conference on Artificial Intelligence and Secure Data Analytics (ICAISDA 2025); Atlantis Press: Dordrecht, The Netherlands, 2026; pp. 559–573. [Google Scholar]
  26. Mlika, Z.; Cherkaoui, S. Deep deterministic policy gradient to minimize the age of information in cellular V2X communications. IEEE Trans. Intell. Transp. Syst. 2022, 23, 23597–23612. [Google Scholar] [CrossRef]
Figure 1. System model of secure and low-latency communications in UAV-mounted STAR-RIS-assisted urban vehicular networks.
Figure 1. System model of secure and low-latency communications in UAV-mounted STAR-RIS-assisted urban vehicular networks.
Sensors 26 03469 g001
Figure 2. Queue evolution and a low-latency service model for vehicular users.
Figure 2. Queue evolution and a low-latency service model for vehicular users.
Sensors 26 03469 g002
Figure 3. Step-by-step methodology of the proposed HC-SAC-based joint optimization framework.
Figure 3. Step-by-step methodology of the proposed HC-SAC-based joint optimization framework.
Sensors 26 03469 g003
Figure 4. Convergence behavior of HC-SAC, DDPG, and PPO. Light-colored curves denote raw evaluation scores, and bold curves denote smoothed trends.
Figure 4. Convergence behavior of HC-SAC, DDPG, and PPO. Light-colored curves denote raw evaluation scores, and bold curves denote smoothed trends.
Sensors 26 03469 g004
Figure 5. Average delay under different vehicle speeds.
Figure 5. Average delay under different vehicle speeds.
Sensors 26 03469 g005
Figure 6. Average secrecy rate under different transmit power levels.
Figure 6. Average secrecy rate under different transmit power levels.
Sensors 26 03469 g006
Figure 7. Secrecy outage probability versus the number of eavesdroppers.
Figure 7. Secrecy outage probability versus the number of eavesdroppers.
Sensors 26 03469 g007
Figure 8. Successful service ratio under different transmit power levels.
Figure 8. Successful service ratio under different transmit power levels.
Sensors 26 03469 g008
Figure 9. Normalized composite utility comparison of different schemes.
Figure 9. Normalized composite utility comparison of different schemes.
Sensors 26 03469 g009
Table 1. Comparison with representative related works.
Table 1. Comparison with representative related works.
WorkMain ScenarioUAV
Mobility
RIS/
STAR-RIS
V2XSecurityQueue
Latency
DRL/
SAC
Hierarchical
Constraint
Control
[14]RIS-assisted UAV communication with joint trajectory and passive beamforming designRIS
[15]Performance analysis of RIS-assisted dual-hop UAV communicationRIS
[16]Machine-learning-empowered UAV-RIS wireless networkRIS
[17]Robust secure UAV communication with RISRIS
[10]DRL-based spectrum allocation and STAR-RIS configuration for V2XSTAR-RIS
[12]UAV-borne STAR-RIS-assisted NOMA communicationSTAR-RIS
[19]PPO-based RIS-assisted full-duplex 6G-V2X communicationRIS
[20]DRL-based ISAC optimization in RIS-assisted 6G V2X systemsRIS
[24]Improved SAC-based physical-layer eavesdropping defense for V2XSAC
This workUAV-mounted STAR-RIS-assisted secure and low-latency urban vehicular communicationSTAR-RISHC-SAC
Table 2. Main simulation parameters.
Table 2. Main simulation parameters.
ParameterDescriptionValue
KNumber of vehicular users6
ENumber of eavesdroppers1–4
MNumber of STAR-RIS elements64
NNumber of time slots30
δ t Slot duration1 s
HUAV flight altitude80 m
V max Maximum UAV speed20 m/s
P max Maximum transmit power20–40 dBm
WSystem bandwidth10 MHz
σ 2 Noise power 94 dBm
ρ 0 Path loss at reference distance 30 dB
α Path-loss exponent range 2.1 4.5
κ Rician factor range 0.08 –12
v k Vehicle speed range6–22 m/s
λ k Packet arrival rate2 packets/slot
L 0 Packet size 1.2 × 10 5 bits
D max Maximum tolerable delay12 slots
R th Secrecy outage threshold 0.1 Mbps
P out max Maximum tolerable SOP 0.76
Table 3. UAV propulsion energy model parameters.
Table 3. UAV propulsion energy model parameters.
ParameterDescriptionValue
P 0 Blade profile power 79.86 W
P i Induced power in hovering 88.63 W
U tip Rotor blade tip speed120 m/s
v 0 Mean rotor induced velocity 4.03 m/s
d 0 Fuselage drag ratio 0.6
ρ Air density 1.225 kg/m3
sRotor solidity 0.05
ARotor disc area 0.503 m2
Table 4. Main hyperparameters of the proposed HC-SAC algorithm.
Table 4. Main hyperparameters of the proposed HC-SAC algorithm.
ParameterValue
High-level actor hidden layers256-256-128
Low-level actor hidden layers256-256-128
Critic hidden layers256-256-128
Activation functionReLU
Actor learning rate 1 × 10 4
Critic learning rate 1 × 10 4
Entropy temperature learning rate 1 × 10 4
Constraint multiplier learning rate 1 × 10 4
Replay buffer size 1 × 10 5
Mini-batch size256
Discount factor ζ 0.99
Soft update coefficient ρ 0.005
Initial entropy temperature τ 0.2
Target entropy dim ( A )
Training episodes420
Time slots per episodeN
Warm-up steps2200
Evaluation intervalEvery 10 episodes
Evaluation episodes24
OptimizerAdam
Table 5. Benchmark schemes used for performance comparison.
Table 5. Benchmark schemes used for performance comparison.
SchemeDescription
Proposed HC-SACThe proposed hierarchical constrained SAC scheme jointly optimizes the UAV trajectory, STAR-RIS transmission–reflection partition ratio, phase-shift matrices, and transmit power allocation. The hierarchical policy structure decouples large-scale UAV and partition control from fine-grained phase/power control, while the constraint-aware reward shaping mechanism improves the security–latency trade-off.
PPO-based schemeThe proximal policy optimization algorithm is adopted to learn the joint control policy. It uses the same state space, action space, feasibility mapping, reward components, and simulation environment as the proposed HC-SAC. The main difference lies in the on-policy clipped policy update mechanism.
DDPG-based schemeThe deep deterministic policy gradient algorithm is used as another DRL-based baseline. It adopts the same state representation, action mapping rules, reward function, and environmental settings as HC-SAC. Different from HC-SAC, DDPG learns a deterministic policy and does not use entropy-regularized exploration.
Fixed UAV trajectoryThe UAV follows a predefined straight-line trajectory, while the STAR-RIS transmission–reflection partition ratio, phase-shift matrices, and transmit power allocation are optimized. This benchmark is used to evaluate the contribution of UAV trajectory optimization.
Fixed STAR-RIS partitionThe STAR-RIS transmission–reflection partition ratio is fixed during the whole service period, while the UAV trajectory, phase-shift matrices, and transmit power allocation are optimized. This benchmark is used to evaluate the benefit of adaptive STAR-RIS transmission–reflection partitioning.
Random phase-shift schemeThe STAR-RIS phase shifts are randomly generated, while the UAV trajectory, transmission–reflection partition ratio, and transmit power allocation are optimized. This benchmark is used to evaluate the importance of the STAR-RIS phase-shift optimization.
No STAR-RIS schemeThe UAV provides aerial assistance without STAR-RIS-enabled propagation reconfiguration. Only the UAV trajectory and transmit power allocation are optimized. This benchmark is used to quantify the performance gain brought by the UAV-mounted STAR-RIS.
Table 6. Ablation study of the proposed HC-SAC framework.
Table 6. Ablation study of the proposed HC-SAC framework.
VariantSecrecy
Rate
DelaySOPSSRUtility
Standard SAC0.476 ± 0.00311.753 ± 0.1330.912 ± 0.0040.105 ± 0.0010.146 ± 0.013
HC-SAC w/o hierarchy0.516 ± 0.00811.390 ± 0.1290.880 ± 0.0020.114 ± 0.0020.452 ± 0.018
HC-SAC w/o constraints0.388 ± 0.00310.831 ± 0.1160.738 ± 0.0020.102 ± 0.0010.552 ± 0.001
Proposed HC-SAC0.433 ± 0.00410.440 ± 0.1090.714 ± 0.0020.109 ± 0.0010.851 ± 0.020
Table 7. Robustness evaluation of the proposed HC-SAC under imperfect CSI and finite-resolution STAR-RIS phase control.
Table 7. Robustness evaluation of the proposed HC-SAC under imperfect CSI and finite-resolution STAR-RIS phase control.
SettingSecrecy
Rate
DelaySOPSSR
Ideal CSI, continuous phase0.488 ± 0.00910.112 ± 0.0300.698 ± 0.0020.121 ± 0.001
CSI error std. = 0.020.492 ± 0.01010.282 ± 0.0490.695 ± 0.0010.123 ± 0.001
CSI error std. = 0.050.492 ± 0.01010.282 ± 0.0490.696 ± 0.0010.123 ± 0.001
CSI error std. = 0.100.492 ± 0.01010.282 ± 0.0490.695 ± 0.0010.123 ± 0.001
3-bit phase quantization0.488 ± 0.00910.113 ± 0.0300.699 ± 0.0020.121 ± 0.001
2-bit phase quantization0.488 ± 0.00910.112 ± 0.0300.698 ± 0.0020.121 ± 0.001
1-bit phase quantization0.488 ± 0.00910.111 ± 0.0300.698 ± 0.0020.121 ± 0.001
Table 8. Reference performance and composite utility comparison of all schemes.
Table 8. Reference performance and composite utility comparison of all schemes.
SchemeAvg. Secrecy
Rate (Mbps)
Avg. Delay
(Slots)
SOPSSRComposite
Utility
Proposed HC-SAC0.392810.84710.71600.10040.9254
PPO-based scheme0.492711.72300.85010.11630.6630
DDPG-based scheme0.417111.93580.85990.10120.5060
Fixed UAV trajectory0.181112.49400.90580.06280.2403
Fixed STAR-RIS partition0.223312.24930.89320.06810.3241
Random phase-shift scheme0.179212.49420.90360.06240.2480
No STAR-RIS scheme0.090313.39570.94730.04970.0000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, J.; Yuan, J.; Zhao, H.; Chen, M.; Peng, Y. Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks. Sensors 2026, 26, 3469. https://doi.org/10.3390/s26113469

AMA Style

Tang J, Yuan J, Zhao H, Chen M, Peng Y. Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks. Sensors. 2026; 26(11):3469. https://doi.org/10.3390/s26113469

Chicago/Turabian Style

Tang, Jian, Jun Yuan, Hu Zhao, Mengxiang Chen, and Yi Peng. 2026. "Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks" Sensors 26, no. 11: 3469. https://doi.org/10.3390/s26113469

APA Style

Tang, J., Yuan, J., Zhao, H., Chen, M., & Peng, Y. (2026). Deep Reinforcement Learning for Secure and Low-Latency Communications in UAV-Mounted STAR-RIS Assisted Urban Vehicular Networks. Sensors, 26(11), 3469. https://doi.org/10.3390/s26113469

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop