Next Article in Journal
Physiological Assessment of Mental Stress in Construction Workers Under High-Risk Working Conditions: ECG-Based Field Measurements on Inexperienced Scaffolders
Previous Article in Journal
A BIM-Based Digital Twin Framework for Urban Roads: Integrating MMS and Municipal Geospatial Data for AI-Ready Urban Infrastructure Management
Previous Article in Special Issue
Sensing Through Tissues Using Diffuse Optical Imaging and Genetic Programming
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System

1
School of Electronic Engineering, Beijing University of Posts and Telecommunications (BUPT), Beijing 100876, China
2
State Key Laboratory of Information Photonics and Optical Communications, Beijing University of Posts and Telecommunications (BUPT), Beijing 100876, China
3
School of Physics and Optoelectronic Engineering, Nanjing University of Information Science and Technology, Nanjing 211544, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(3), 948; https://doi.org/10.3390/s26030948
Submission received: 5 December 2025 / Revised: 24 January 2026 / Accepted: 26 January 2026 / Published: 2 February 2026

Abstract

Space–air–ground systems employing free-space optical (FSO) communication leverage high-altitude platform stations (HAPS) to deliver seamless and ubiquitous connectivity. Although FSO links offer high capacity, they are highly susceptible to cloud extinction, which severely degrades link availability. Hybrid FSO/radio-frequency (RF) transmission and cloud-aware HAPS trajectory optimization can enhance resilience. However, the conventional cloud-aware hybrid FSO/RF transmission system based on hard-switching (HS) between the FSO and RF links leads to frequent link transitions and unstable throughput. To address these challenges, we propose a joint optimization framework that integrates soft-switch between FSO and RF links with deep reinforcement learning (DRL) for HAPS trajectory optimization. Soft-switching based on rateless codes (RCs) enables simultaneous transmission over both links, where the receiver accumulates packets until successful decoding with a single feedback. The feedback frequency of RC is sparse, which avoids feedback storms but also poses challenges to HAPS trajectory optimization. The DRL agent proactively optimizes HAPS trajectories to avoid cloud cover and maintain link availability. To address the sparse feedback of RCs for DRL training, a reward-shaped proximal policy optimization (PPO)-based agent is developed to jointly optimize throughput and trajectory smoothness. Simulations using realistic ERA5 data show that RC-PPO achieves higher throughput and smoother trajectories compared to the HS-PPO baseline.

1. Introduction

The explosive growth of global data traffic has revealed the limitations of terrestrial networks in achieving ubiquitous connectivity [1]. Space–air–ground integrated networks (SAGINs) with free-space optical communication (FSO), using high-altitude platform stations (HAPS) to relay communications between satellites and ground stations, have emerged as a promising solution [2,3]. FSO links offer ultra-high bandwidth but suffer from low reliability due to atmospheric conditions such as clouds and atmospheric turbulence (AT) [4,5]. Hybrid FSO/RF transmission systems leverage RF backup to enhance reliability [6,7,8]. Such hybrid systems have been deployed in dual-hop relay links [9] and SAGIN scenarios to improve coverage and capacity [10]. Conventional hybrid FSO/RF systems employ hard-switching (HS) strategies that select links based on instantaneous SNR values [11], suffering from frequent transitions and unstable throughput. FSO/RF systems based on rateless code (RC), such as Raptor code [12], enable soft-switching by transmitting distinct encoded packets simultaneously over both the FSO and RF channels [13], thus eliminating switching overhead and improving reliability against burst disruptions. To avoid cloud cover, HAPS trajectory optimization is also an effective method to enhance link stability and throughput.
Deep reinforcement learning (DRL) offers a powerful framework for adaptive HAPS trajectory optimization in dynamic atmospheric environments [14]. For example, ref. [15] integrates 5G low-latency connectivity with a Deep Q-Network (DQN) for 3D trajectory optimization. Ref. [16] employs the deep deterministic policy gradient (DDPG) algorithm for coordinating HAPS coverage. However, these studies foucus on RF-only aerial communication networks. When extending DRL to hybrid FSO/RF systems, ref. [17] applies the A2C algorithm to optimize the HAPS trajectory in the HS FSO/RF transmission system, which relies on dense, per-step reward signals. This dense-reward strategy is fundamentally incompatible with the RC-based soft-switching FSO/RF system. For RC, a meaningful reward is available only upon successful block decoding, which introduces a critical challenge of reward sparsity. This sparse feedback (reward) severely complicates temporal credit assignment and hinders efficient exploration [18]. Ref. [19] successfully used Proximal Policy Optimization (PPO) with reward shaping to compensate for such reward sparsity when guiding UAVs to move to destinations. To the best of our knowledge, no existing work has investigated HAPS trajectory planning in an RC-based soft-switching hybrid FSO/RF system under stochastic cloud dynamics.
This paper proposes a joint optimization framework that integrates rateless-coded physical-layer transmission with PPO-based trajectory learning in a hybrid FSO/RF transmission system. Key contributions are as follows:
  • The cloud-aware HAPS trajectory optimization problem in soft-switching hybrid FSO/RF systems is formulated and solved by a PPO-based DRL approach, under the stochastic moving occluding cloud (SMOC) model derived from the ERA5 dataset.
  • A potential-based reward-shaping mechanism within the PPO framework is developed to mitigate sparse decoding feedback of RCs, delivering faster convergence and superior performance over threshold-based HS-PPO schemes.
The remainder of this paper is organized as follows. Section 2 presents the scheme of the trajectory-optimized HAPS–ground station (GS) hybrid FSO/RF link, FSO/RF channel models, and the throughput of HS and RC-based hybrid FSO/RF systems. Section 3 details the proposed RC-PPO algorithm and the HS-PPO algorithm as a reference and presents the corresponding Markov Decision Process (MDP) formulation and reward shaping strategy. Section 4 provides simulation results. Section 5 gives the conclusion.

2. System Model

This section presents the mathematical framework for the HAPS-assisted hybrid FSO/RF communication system. Firstly, a simplified three-tier SAGIN architecture is introduced. Subsequently, the channel model of FSO and RF links are presented. At last, the throughputs for hybrid FSO/RF systems with HS and RC are elaborated.

2.1. Space–Air–Ground Architecture

As illustrated in Figure 1, we consider a simplified three-tier Space–Air–Ground (SAG) system, which constitutes a fundamental building block of standard SAGINs. The LEO satellite operates at an orbital altitude of approximately 600 km and serves as the primary data source. The HAPS, positioned at H h a p s = 20  km in the stratosphere, functions as an aerial relay node to bridge the satellite-to-ground communication gap. The satellite-to-HAPS link employs FSO communication to exploit its high bandwidth capacity, while the HAPS-to-GS link utilizes hybrid FSO/RF transmission to provide complementary link diversity.
A discrete-time model is adopted with time step index t { 1 , 2 , , T } . The position of HAPS at time t is denoted as p h [ t ] = ( x [ t ] , y [ t ] , H h a p s ) , where ( x [ t ] , y [ t ] ) represents the horizontal coordinates and H h a p s (the altitude of HAPS) remains constant. The GS is located at p g = ( 0 , 0 , H g s ) with H g s = 0 . The instantaneous slant distance between HAPS and GS is given by:
L [ t ] = p h [ t ] p g = x [ t ] 2 + y [ t ] 2 + H h a p s 2 .
And the elevation angle (Figure 1) is θ [ t ] = arctan ( H h a p s / x [ t ] 2 + y [ t ] 2 ) .

2.2. Channel Model

2.2.1. FSO Channel Model

The composite channel gain of the downlink FSO channel between HAPS and GS is expressed as:
H FSO [ t ] = h g [ t ] · h a [ t ] · h l [ t ] ,
where h g [ t ] is the geometric loss due to the divergence of the beam on the slant path distance L [ t ] [20], h a [ t ] follows the Gamma-Gamma distribution characterizing AT-induced fading [21], and h l [ t ] represents the loss of cloud-induced attenuation. Among these factors, h l [ t ] is critical for optimizing the HAPS trajectory due to its spatial heterogeneity. The cloud-induced attenuation follows the Beer–Lambert law [22]:
h l [ t ] = exp ( σ L [ t ] ) .
σ (km−1) is the attenuation coefficient [23], determined based on the visibility (V) (in km) and the wavelength (in nm) as
σ = 3.91 V λ 550 q .
Here, λ is the optical wavelength, V is the visibility, and q is the size-distribution parameter of the scattering particles depending on V. The expression of V is given by:
V = ln ( 0.002 ) β f [ t ] ,
where β f [ t ] is the cloud-induced attenuation of the FSO link obtained by integration over the slant path:
β f [ t ] = 1 sin ( θ [ t ] ) h 0 h top β ext f ( h ) d h .
Here, θ [ t ] is the elevation angle and h 0 and h top are the cloud base and top altitudes. The extinction coefficient β ext f ( h ) for the FSO link is given by [24]:
β ext f ( h ) = 6.51 × 10 3 · c ( h ) ρ w · r e ,
where ρ w = 1  g/cm3 is the water density, r e ( μ m) is the particle effective radius [24], and c ( h ) is the vertical profile of liquid water content (LWC) in g/m3. The three-dimensional (3D) distribution of LWC is modeled as a spatially correlated stochastic field, as detailed in Section 4.1.
The FSO link employs intensity modulation with direct detection (IM/DD) using on–off keying (OOK). The received optical power is P rx f [ t ] = P tx f G tx f G rx f H FSO [ t ] , where P tx f denotes the transmit optical power and G tx f and G rx f , respectively, denote the transmitter and receiver telescope gains. The instantaneous electrical SNR is given by:
γ FSO [ t ] = ( R P rx f [ t ] ) 2 N 0 f B f ,
where R (A/W) is the photodetector responsivity, N 0 f (A2/Hz) is the noise power spectral density, and B f (Hz) is the receiver bandwidth of the FSO link.

2.2.2. RF Channel Model

The RF channel provides complementary connectivity with enhanced reliability under cloud-obscured conditions. The RF channel gain incorporates free-space path loss (FSPL) and cloud-induced attenuation. The expression of FSPL is given by:
FSPL r [ t ] = 4 π L [ t ] f c c 0 2 ,
with RF carrier frequency f c and speed of light c 0 . The cloud attenuation coefficient β r [ t ] (km−1) is computed similarly to (6) for FSO links by path-integrating the specific attenuation coefficient. The specific RF attenuation coefficient β ext r ( h ) (km−1) is derived from Mie scattering theory and modeled as [25]:
β ext r ( h ) = k ext r · c ( h ) ,
where k ext r (km−1/(g·m−3)) is the mass extinction coefficient at RF frequency f c .
Similar to (8), the instantaneous SNR of RF link with QPSK modulation is given by:
γ RF [ t ] = P tx r G tx r G rx r H RF [ t ] N 0 r B r ,
where P tx r is the transmitting power, G tx r and G rx r are the transmitting and receiving antenna gains, N 0 r (W/Hz) is the spectral density of RF noise power, and B r (Hz) is the receiver bandwidth of the RF link.

2.3. Hybrid FSO/RF Systems

This section describes the throughput of hybrid FSO/RF communication systems based on HS and RC. These two schemes differ fundamentally in how they exploit link diversity, which affects trajectory optimization.

2.3.1. Hard Switching

The HS scheme selects either the FSO or RF link at each time step based on instantaneous channel quality. The instantaneous throughput is given by:
T HS [ t ] = R FSO P suc FSO [ t ] , if γ FSO [ t ] γ th , R RF P suc RF [ t ] , if γ FSO [ t ] < γ th ,
where γ th is the switching threshold, R FSO and R RF are the transmitted data rates, P suc FSO [ t ] and P suc RF [ t ] are the per-packet success reception probabilities accounting for forward error correction (FEC). The optimal threshold value maximizes the expected throughput:
γ th = arg max γ th E T HS [ t ] .
For the hybrid FSO/RF system, the bit error rate (BER) for the FSO link with OOK modulation and the RF link with QPSK modulation depend on their respective signal-to-noise ratios (SNRs) at the receiver. The BER for each link can be uniformly expressed as:
BER i [ t ] = 1 2 erfc κ i · γ i [ t ] ,
where i { f , r } denotes the FSO or RF link, γ i [ t ] is the instantaneous SNR, and κ i is the modulation-dependent efficiency factor, with κ f = 0.5 for OOK and κ r = 1 for QPSK. For a packet of L p bits in length, assuming FEC capable of correcting up to t corr bit errors, the packet success probability is computed via the binomial sum:
P suc [ t ] = i = 0 t corr L p i BER [ t ] i 1 BER [ t ] L p i .

2.3.2. Rateless Coding

RCs enable the generation of an arbitrary number of encoded packets until the receiver accumulates sufficient packets to decode the original source block [13]. Unlike traditional channel codes with fixed code rates, RCs automatically adjust the number of encoded packets required to be sent for successful decoding based on channel conditions. Raptor code, which represents the state of the art in RCs, consists of a high-code-rate precode (typically Low-Density Parity-Check (LDPC) code [26]) followed by a Luby Transform (LT) [27] outer code and is utilized in this paper.
Let s GF 2 N s denote the source symbol vector containing N s information symbols. The precode generates an intermediate symbol vector (Figure 1):
c = G pre s ,
where G pre is the precode generator matrix producing N c intermediate symbols. The LT encoder then produces output symbols via:
x i = j = 1 N c G LT ( i , j ) c j ,
where G LT ( i , · ) is a random sparse row sampled according to a robust soliton degree distribution and ⊕ denotes XOR over GF 2 . Each output symbol x i is formed by XORing a small subset of intermediate symbols, with the subset size and selection determined by the degree distribution.
In the RC scheme, both FSO and RF links transmit different encoded symbols simultaneously. Crucially, any correctly received symbol contributes to decoding, regardless of which link delivered it. This superposition property eliminates the need for link selection decisions and enables full utilization of both links under all channel conditions. The receiver can successfully decode the source data block once it accumulates approximately N s ( 1 + ε c ) distinct encoded symbols, where ε c is the average overhead required by the belief-propagation (BP) decoder. This overhead is typically ε c 0.05 0.1 for well-designed Raptor codes [13]. The impact of dynamic SNR fluctuations on the hybrid link is captured through the packet success probabilities P s u c F S O and P s u c R F in Equation (15), which determine the rate of correctly received symbols. Therefore, the expected effective throughput for RC is given by:
E T RC = R FSO P suc FSO + R RF P suc RF 1 + ε c .

3. Trajectory Optimization

This section presents the trajectory optimization framework for hybrid FSO/RF communication schemes. Firstly, the PPO algorithm and MDP formulation are introduced as the foundation of DRL. Subsequently, the trajectory optimization strategies for the HS and RC schemes using PPO-based DRL are elaborated.

3.1. Proximal Policy Optimization

PPO is an on-policy actor-critic algorithm that uses a clipped surrogate objective to constrain policy updates, thereby preventing large deviations from the current policy and ensuring stable training. The objective of PPO, which is maximized at each iteration, is given by:
L t PPO ( θ ) = E ^ t L t CLIP ( θ ) c 1 L t VF ( θ ) + c 2 H [ π θ ] ( s t ) ,
where E ^ t [ · ] denotes the empirical average over a finite batch of samples, H [ π θ ] ( s t ) denotes an entropy bonus, L t VF ( θ ) is the value function loss, and c 1 , c 2 > 0 are weighting coefficients.
The clipped surrogate objective L t CLIP ( θ ) prevents excessively large policy updates by constraining the probability ratio between the current and old policies, given by:
L t CLIP ( θ ) = E ^ t min μ t ( θ ) A ^ t , clip μ t ( θ ) , 1 ϵ clip , 1 + ϵ clip A ^ t ,
where the function clip μ t ( θ ) , 1 ε clip , 1 + ε clip acts as a clipping operation that constrains the probability ratio μ t ( θ ) to the interval [ 1 ε clip , 1 + ε clip ] . ϵ clip (typically 0.1–0.3) is a hyperparameter. The probability ratio μ t ( θ ) at time step t is given by:
μ t ( θ ) = π θ ( a t s t ) π θ old ( a t s t ) .
Moreover, the advantage estimator ( A ^ t ) is formulated as a discounted sum of future temporal-difference (TD) residuals ( δ t ), given by:
A ^ t = l = 0 T t γ l δ t + l ,
where γ ( 0 , 1 ) is the discount factor, and the TD residual is defined as:
δ t = r t + γ V ( s t + 1 ) V ( s t ) .
Here, r t is the immediate reward received from the environment, and V ( s t ) is the state value, which can be used to evaluate the state s t under the current policy.
The value function loss ( L t VF ( θ ) ) minimizes the mean squared error (MSE) between predicted state values and empirical returns, given by:
L t VF ( θ ) = E ^ t V θ ( s t ) R t 2 ,
where R t = l = 0 T t γ l r t + l is the discounted sum of future rewards. The entropy term ( H [ π θ ] ) encourages exploration by preventing premature convergence to deterministic policies, given by
H [ π θ ] ( s t ) = E a π θ ( · | s t ) [ log π θ ( a | s t ) ] .
Network parameters are updated after collecting a trajectory of length T, using the collected batch to perform multiple epochs of minibatch gradient descent on the PPO objective. The pseudocode of the algorithm is shown in Algorithm 1.
Algorithm 1 PPO for HAPS Trajectory Optimization
Require: Policy and value network π θ , V θ , hyperparameters γ , ϵ clip , c 1 , c 2 , K
1:for each training iteration do
2:      Collect trajectory D = { ( s t , a t , r t , s t + 1 ) } t = 0 T using π θ
3:      Compute returns R t = l = 0 T t γ l r t + l
4:      Compute TD residuals δ t = r t + γ V θ ( s t + 1 ) V θ ( s t )
5:      Compute advantages A ^ t = l = 0 T t γ l δ t + l
6:      Normalize advantages: A ^ t ( A ^ t μ ) / σ
7:      Store old policy probabilities π θ old ( a t | s t )
8:      for epoch k = 1 , , K  do
9:            for mini-batch B D  do
10:                  Compute L t PPO ( θ ) = L t CLIP ( θ ) c 1 L t VF ( θ ) + c 2 H [ π θ ] ( s t )
11:                  Update θ via gradient ascent on L t PPO ( θ )
12:            end for
13:      end for
14:end for
15:return Optimized policy π θ *

3.2. Trajectory Optimization with PPO-Based DRL

The HAPS trajectory optimization problem is formalized as a finite-horizon MDP defined by the tuple M = ( S , A , P , R , γ ) , where S is the state space, A is the action space, P is the transition probability, and R is the reward function. The control objective for both RC and HS schemes is to find an optimal policy maximizing the expected discounted return:
π * = arg max π θ E s 0 μ , a t π θ ( · | s t ) , s t + 1 P ( · | s t , a t ) t = 0 T γ t r t .
The state space, action space, and reward function differ significantly between the RC and HS schemes due to their distinct communication paradigms.

3.2.1. RC-PPO

The state space should contain all information needed for the RC-PPO agent to make decisions and preserve the Markovian property of the environment, given by:
s t RC = x t , y t , ψ t , τ t , ρ t S RC ,
where ( x t , y t ) R 2 denote the current coordinates of the HAPS, ψ t [ 0 , 2 π ) is the heading angle measured clockwise from north, τ t = T t is the number of steps remaining, and ρ t = [ ρ ( p t , 1 ) , , ρ ( p t , N p ) ] [ 0 , 1 ] N p contains cloud extinction coefficients sampled at N p probe points { p t , i } along prospective directions.
Since RC allows both links to transmit simultaneously without switching overhead, the action space contains only navigation decisions:
a t RC = Δ ψ t A RC ,
where Δ ψ t [ ψ max , + ψ max ] is the discrete heading change with ψ max = 30 representing the maximum single-step turn angle.
State transitions decompose into deterministic kinematic evolution and stochastic cloud field dynamics. For simplicity, the speed of the HAPS is assumed constant of v 0 , and the effects of wind (atmospheric currents) are not considered in the kinematic model. The HAPS position evolves according to:
x t + 1 = x t + v 0 Δ t cos ( ψ t + Δ ψ t ) , y t + 1 = y t + v 0 Δ t sin ( ψ t + Δ ψ t ) , ψ t + 1 = ( ψ t + Δ ψ t ) mod 2 π .
The decoding progress η t [ 0 , 1 ] , defined as the fraction of successfully received packets to the total required, evolves according to:
η t + 1 = min 1 , η t + Λ t RC Δ t N s ( 1 + ε c ) ,
where N s is the total number of source packets, ε c is the redundancy overhead in the RCs, and Λ t RC is the aggregate packet-arrival rate from both links. Noted, in real-world deployment, the receiver provides feedback only upon successful decoding of complete data blocks, making η t unavailable for real-time decision-making by the HAPS. The aggregate rate:
Λ t RC = i = 1 N p w i R FSO · P s u c FSO ( ρ ( p t , i ) ) + R RF · P s u c RF ( ρ ( p t , i ) ) ,
where R FSO and R RF are the transmitted symbol rates of the FSO and RF links, P s u c FSO ( ρ ) and P s u c RF ( ρ ) denote the probability of successful symbol reception as a function of extinction loss of cloud, and w i = exp ( d i / d 0 ) are distance-dependent weight factors where d i is the distance from HAPS to the ith probe position and d 0 is a normalization constant, thereby assigning higher influence to nearer probes. At each time step, N p probe points { p t , i } i = 1 N p are placed along candidate directions at varying distances from the current HAPS position to estimate future channel conditions.
In order to maximize the average capacity over the allowed T time steps, the reward function balances multiple objectives and is defined as:
r t RC = α Λ t RC + β Δ η t κ ψ | Δ ψ t | κ d d t d 0 2 + q R suc ,
where Λ t RC encourages high instantaneous throughput, Δ η t = η t + 1 η t provides dense feedback on decoding progress through potential-based reward shaping, | Δ ψ t | discourages sharp heading changes to maintain flight stability, and ( d t / d 0 ) 2 with d t = x t 2 + y t 2 penalizes excessive distance from the GS to accelerate policy convergence speed and improve link quality. α , β , κ ψ , κ d , and q are weighting factors to balance the physical and task-driven scales of their corresponding terms.
The terminal reward R suc indicates mission completion:
R suc = 1 , if η t = 1 and t T , 0 , otherwise .
An episode terminates when either the mission succeeds ( R suc = 1 ) or timeout occurs ( τ t = 0 ).
The inclusion of Δ η t in the reward function addresses the sparse feedback challenge inherent in RC. According to the potential-based shaping theorem [19], our shaping term β Δ η t corresponds to the potential function Φ ( s ) = β η t , which preserves the optimal policy while providing dense training signals.
Importantly, although η t is used during training to accelerate policy learning, the trained policy π θ ( a t RC s t RC ) depends only on the observable state s t RC = ( x t , y t , ψ t , τ t , ρ t ) and requires no real-time decoding feedback at deployment.

3.2.2. HS-PPO

For HS-PPO, the state space omits decoding progress but includes the currently selected link represented by the binary indicator H t :
s t HS = ( x t , y t , ψ t , τ t , H t , ρ t ) S HS ,
where H t { FSO , RF } denotes the currently active link. The action space includes both navigation and link selection decisions:
a t HS = ( Δ ψ t , h t ) A HS ,
where h t { 0 , 1 } denotes the currently active link (0 for FSO, 1 for RF). The instantaneous reward is defined as:
r t HS = α Λ t HS κ ψ | Δ ψ t | ξ I [ H t H t 1 ] κ d d t d 0 2 + q R suc ,
where ξ > 0 penalizes frequent link switch, and Λ t HS is the data rate from the active link:
Λ t HS = R FSO · P s u c FSO ( ρ ( p t ) ) , if H t = FSO , R RF · P s u c RF ( ρ ( p t ) ) , if H t = RF .
The terminal reward R suc for HS-PPO is defined similar to that for RC-PPO, with episodes terminating upon mission completion or timeout.

4. Simulation and Results

This section presents the cloud field generation, training setup, and evaluation of the proposed HS-PPO and RC-PPO agents. Key parameters are summarized in Table 1.

4.1. Cloud Field Generation

The 3D LWC field is generated using the SMOC model [28], leveraging statistics of cloud cover and average integrated liquid water content (ILWC) from the ERA5. A log-normal distribution of the content of liquid cloud water (CLWC) over the given site is obtained based on the extracted information from the dataset. Spatially correlated 3D LWC fields are then synthesized by generating random Gaussian fields based on empirical spatial correlations from ERA5, ensuring realistic cloud structures. Finally, an analytical vertical profile c ( h ) is applied to modulate the altitude-dependent LWC distribution within each cloud column:
c ( h ) = W b c a c Γ ( a c ) ( h h 0 ) a c 1 e ( h h 0 ) / b c , h h 0 0 , h < h 0
where W is the columnar ILWC (in kg/m2), h 0 is the cloud base altitude, Γ ( · ) denotes the gamma function, and a c , b c determine the shape of the clouds’ vertical profile, given by:
a c = 4.27 exp [ 4.93 ( C + 0.06 ) ] + 54.12 exp [ 61.25 ( C + 0.06 ) ] + 1.71 , b c = 3.17 a c 3.04 + 0.074 ,
where C is the CLWC. The resulting 3D LWC field is used to compute the specific extinction coefficients of the cloud for the FSO and RF channels according to (7) and (10), respectively.

4.2. PPO Training Configuration

Both RC-PPO and HS-PPO agents employ identical neural network architectures, consisting of a three-layer MLP feature extractor with hidden dimensions [128, 64, 32] and ReLU activations. The feature extractor feeds into separate policy and value heads. The policy head was initialized using orthogonal initialization with a gain of 0.01 to reduce initial entropy. Each cycle collects 200 episodes to form a batch. This batch is reused for 10 update epochs with a mini-batch size of 64. Evaluation is performed with 200 deterministic test episodes employing the static cloud fields generated in Section 4.1. PPO training hyperparameters are detailed in Table 2.
Moreover, for the weighting parameter settings of the reward function (Equations (32) and (35)), an iterative behavior-driven tuning methodology is employed. Specifically, for the RC-PPO agent, α is fixed as the baseline to reflect the primary objective of maximizing throughput; β is selected via grid search to balance the decoding progress signal and throughput exploration; κ ψ and κ d serve as regularization terms to encourage smooth trajectories and movement toward the ground station, respectively. q is set to a large scalar to clearly signify task completion. The specific parameter values are summarized in Table 3.

4.3. Results and Discussion

Figure 2 shows representative HAPS trajectories and their corresponding instantaneous throughput achieved by the HS-PPO and RC-PPO agents under varied cloud conditions. To systematically assess the performance of both schemes, 100 random cloud distributions were first generated using the SMOC model. For each cloud distribution, the ILWC along the slant path was averaged over the starting point to the GS (indicated by the red dashed line in Figure 2). The mean values of ILWC were then ranked across 100 cloud distribution cases to approximately characterize the difficulty of trajectory optimization. For each typical HAPS mission duration (approximately 0.8–1.3 h for a 100 km flight at 75–120 km/h [29]), the cloud field can be regarded as slowly varying (with a temporal decorrelation scale of 15–29 h [30]). Hence, the use of a static cloud map in each episode is justified.
Representative cloud scenarios were selected at the 25th, 50th, and 75th among 100 cloud distributions of this ranked distribution, corresponding to light, moderate, and heavy cloud-density distributions. The horizontal position of the starting point is uniformly fixed at (25 km, 25 km), and the position of the GS is set at (100 km, 100 km).
The overall objective of HAPS trajectory optimization is to approach the GS while actively avoiding dense cloud regions, thereby minimizing signal attenuation and maximizing both throughput and communication coverage of the HAPS. Under light cloud-density conditions (Figure 2a), both HS-PPO and RC-PPO agents tend to advance directly toward the GS, achieving favorable throughput performance due to minimal cloud-induced extinction. As cloud density increases (Figure 2b,c), HS-PPO exhibits substantial throughput degradation attributed to severe FSO link attenuation in dense cloud regions, necessitating RF channel backup. In contrast, RC-PPO achieves higher effective hybrid throughput by dynamically leveraging transient FSO transmission windows while maintaining a reliable RF backup link. This strategy enables RC-PPO to reduce signal interruption durations and feedback latency compared to HS-PPO.
In terms of quantified throughput, under light cloud, the average throughput of RC-PPO is 9.34 Gbps, which is improved by 12.8% compared with the throughput of 8.28 Gbps for HS-PPO. As cloud density increases to a moderate level, RC-PPO reaches 9.68 Gbps, representing a significant 89.4% improvement compared to the throughput (5.11 Gbps) of HS-PPO. Under heavy cloud, the throughput of RC-PPO maintains 7.06 Gbps, yielding a 51.8% improvement over the 4.65 Gbps of HS-PPO. The results demonstrate that the RC-PPO system exhibits substantial throughput improvement under light, moderate, and heavy cloud conditions. RC-PPO demonstrates the most pronounced throughput advantage over HS-PPO under moderate cloud-density conditions. In addition, under heavy cloud-density conditions (Figure 2c), a non-RL scheme is also evaluated, which follows a straight-line path from the starting point directly to the GS without any adaptive cloud avoidance. Results show that RC-PPO achieves the highest average throughput, outperforming both the non-RL RC and HS-PPO schemes. It confirms that the performance gain stems not only from the use of RCs but from the predictive trajectory optimization enabled by DRL.
Figure 3 illustrates the training performance of the RC-PPO and HS-PPO agents. Although a direct quantitative comparison of their absolute return values is complicated for the differences in reward design, the resulting HAPS trajectories suggest that both schemes achieve effective policy learning. Notably, despite the inherently sparser feedback of the RCs (informative feedback is available only upon successful decoding of entire data blocks), the RC-PPO agent exhibits slightly more stable convergence performance. This improved stability is attributed to a shaped-reward mechanism specifically designed to provide incremental feedback on packet decoding progress. By incorporating denser reward signals, RC-PPO effectively mitigates variance in policy gradient estimates, thereby improving sample efficiency throughout training.

5. Conclusions

This paper presents an integrated framework that combines RCs with a PPO-based DRL algorithm for HAPS trajectory optimization in a hybrid FSO/RF SAG system. By embedding potential-based reward shaping around decoding progress, the proposed RC-PPO method addresses key limitations of conventional HS schemes, enabling more reliable exploitation of both FSO and RF channels under stochastic cloud obstruction. Simulations demonstrate that RC-PPO improves achievable capacity and transmission reliability. Under moderate cloud scenarios, the average throughput of the RC-PPO agent is 9.68 Gbps, achieving 89.4% improvement compared to that of HS-PPO (5.11 Gbps). Moreover, RC-PPO exhibits more stable training convergence compared to HS-PPO.
The proposed framework in this paper can be further extended to multi-HAPS/multi-GS systems by adopting multi-agent reinforcement learning (MARL) strategies, such as Multi-Agent PPO (MA-PPO), where multiple HAPSs coordinate their trajectories and transmission policies to optimize coverage and load balancing.

Author Contributions

Conceptualization and methodology, B.C.; software and validation, B.C. and S.C.; investigation and data curation, B.C. and L.W.; writing–original draft, B.C.; writing–review and editing, all authors; supervision, S.C.; funding acquisition, Z.Z., L.W., F.W. and S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (grant numbers U22B2009 and 62271084).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Xu, G.; Xu, M.; Zhang, Q.; Song, Z. Cooperative FSO/RF Space-Air-Ground Integrated Network System with Adaptive Combining: A Performance Analysis. IEEE Trans. Wirel. Commun. 2024, 23, 17279–17293. [Google Scholar]
  2. Samy, R.; Yang, H.C.; Rakia, T.; Alouini, M.S. Space-Air-Ground FSO Networks for High-Throughput Satellite Communications. IEEE Commun. Mag. 2022, 60, 82–87. [Google Scholar] [CrossRef]
  3. Ata, Y.; Alouini, M.S. HAPS Based FSO Links Performance Analysis and Improvement with Adaptive Optics Correction. IEEE Trans. Wirel. Commun. 2023, 22, 4916–4929. [Google Scholar]
  4. Zhu, X.; Kahn, J. Free-space optical communication through atmospheric turbulence channels. IEEE Trans. Commun. 2002, 50, 1293–1300. [Google Scholar] [CrossRef]
  5. Yu, S.; Ding, J.; Fu, Y.; Ma, J.; Tan, L.; Wang, L. Novel approximate and asymptotic expressions of the outage probability and BER in gamma–gamma fading FSO links with generalized pointing errors. Opt. Commun. 2019, 435, 289–296. [Google Scholar] [CrossRef]
  6. Bag, B.; Das, A.; Ansari, I.S.; Prokeš, A.; Bose, C.; Chandra, A. Performance Analysis of Hybrid FSO Systems Using FSO/RF-FSO Link Adaptation. Photo. J. 2018, 10, 7904417. [Google Scholar]
  7. Sharma, K.; Kaur, S.; Singh, H. Channel modelling and performance analysis of switching based hybrid FSO/RF communication system. J. Opt. 2025, 1–9. [Google Scholar] [CrossRef]
  8. Zhang, Q.; Yu, J.; Long, J.; Wang, C.; Chen, J.; Lu, X. A Hybrid RF/FSO Transmission System Based on a Shared Transmitter. Sensors 2025, 25, 2021. [Google Scholar] [CrossRef]
  9. Alathwary, W.A.; Altubaishi, E.S. Investigating and analyzing the performance of dual-hop hybrid FSO/RF systems. Alex. Eng. J. 2024, 101, 16–24. [Google Scholar] [CrossRef]
  10. Mashiko, K.; Kawamoto, Y.; Kato, N.; Yoshida, K.; Ariyoshi, M. Combined Control of Coverage Area and HAPS Deployment in Hybrid FSO/RF SAGIN. IEEE Trans. Veh. Technol. 2025, 74, 10819–10828. [Google Scholar] [CrossRef]
  11. Nikbakht-Sardari, N.; Ghiamy, M.; Akbari, M.E.; Charmin, A. Novel adaptive hard-switching based hybrid RF-FSO system architecture using two threshold values in the presence of atmospheric turbulence and pointing error. Results Eng. 2023, 17, 100813. [Google Scholar] [CrossRef]
  12. Shokrollahi, A. Raptor Codes. IEEE Trans. Inf. Theory 2007, 52, 2551–2567. [Google Scholar]
  13. MacKay, D. Fountain codes. IEE Proc. Commun. 2005, 152, 1062–1068. [Google Scholar]
  14. Ning, Z.; Yang, Y.; Wang, X.; Song, Q.; Guo, L.; Jamalipour, A. Multi-Agent Deep Reinforcement Learning Based UAV Trajectory Optimization for Differentiated Services. IEEE Trans. Mob. Comput. 2024, 23, 5818–5834. [Google Scholar] [CrossRef]
  15. Srivatsa, V.; Kusuma, S.M. Deep Q-Networks and 5G Technology for Flight Analysis and Trajectory Prediction. In Proceedings of the 2024 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 12–14 July 2024; pp. 1–6. [Google Scholar]
  16. Liu, C.H.; Chen, Z.; Tang, J.; Xu, J.; Piao, C. Energy-Efficient UAV Control for Effective and Fair Communication Coverage: A Deep Reinforcement Learning Approach. IEEE J. Sel. Areas Commun. 2018, 36, 2059–2070. [Google Scholar] [CrossRef]
  17. Almohamad, A.; Ibrahim, M.; Ekin, S.; Hasna, M.; Althunibat, S.; Qaraqe, K. Optimizing Non-Terrestrial Hybrid RF/FSO Links with Reinforcement Learning: Navigating Through Clouds. IEEE Open J. Commun. Soc. 2025, 6, 793–806. [Google Scholar] [CrossRef]
  18. Wang, Z.; Li, H.; Wu, Z.; Wu, H. A pretrained proximal policy optimization algorithm with reward shaping for aircraft guidance to a moving destination in three-dimensional continuous space. Int. J. Adv. Robot. Syst. 2021, 18, 1729881421989546. [Google Scholar] [CrossRef]
  19. Liu, Q.; Jiang, Z.; Yang, H.J.; Khosravi, M.; Waite, J.R.; Sarkar, S. Enhancing PPO with Trajectory-Aware Hybrid Policies. arXiv 2025, arXiv:2502.15968. [Google Scholar] [CrossRef]
  20. Benbouzid, A.M.; Belghachem, N. Performance analysis of OOK and PPM modulation schemes in MIMO-FSO links under gamma-gamma atmospheric turbulence. In Proceedings of the Environmental Effects on Light Propagation and Adaptive Systems VI; Stein, K., Gladysz, S., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2023; Volume 12731, p. 127310Q. [Google Scholar] [CrossRef]
  21. Chen, D.; Hui, J. Parameter estimation of Gamma–Gamma fading channel in free space optical communication. Opt. Commun. 2021, 488, 126830. [Google Scholar] [CrossRef]
  22. Kim, I.I.; McArthur, B.; Korevaar, E.J. Comparison of laser beam propagation at 785 nm and 1550 nm in fog and haze for optical wireless communications. In Proceedings of the SPIE Optics East, San Jose, CA, USA, 20–21 January 2001. [Google Scholar]
  23. Le, H.D.; Nguyen, T.V.; Pham, A.T. Cloud Attenuation Statistical Model for Satellite-Based FSO Communications. IEEE Antennas Wirel. Propag. Lett. 2021, 20, 643–647. [Google Scholar] [CrossRef]
  24. Luini, L.; Nebuloni, R. Radio wave propagation and channel modeling for earth–space systems. In Proceedings of the Impact of Clouds from Ka Band to Optical Frequencies; CRC Press: Boca Raton, FL, USA, 2016. [Google Scholar]
  25. Lyras, N.K.; Kourogiorgas, C.I.; Panagopoulos, A.D. Cloud Attenuation Statistics Prediction From Ka-Band to Optical Frequencies: Integrated Liquid Water Content Field Synthesizer. IEEE Trans. Antennas Propag. 2017, 65, 319–328. [Google Scholar] [CrossRef]
  26. Tu, Z.; Zhang, S. Overview of LDPC Codes. In Proceedings of the 7th IEEE International Conference on Computer and Information Technology (CIT 2007), Aizu-Wakamatsu, Japan, 16–19 October 2007; pp. 469–474. [Google Scholar] [CrossRef]
  27. Luby, M. LT codes. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings, Vancouver, BC, Canada, 19 November 2002; pp. 271–280. [Google Scholar] [CrossRef]
  28. Luini, L.; Capsoni, C. Modeling High-Resolution 3-D Cloud Fields for Earth-Space Communication Systems. IEEE Trans. Antennas Propag. 2014, 62, 5190–5199. [Google Scholar] [CrossRef]
  29. Xing, Y.; Hsieh, F.; Ghosh, A.; Rappaport, T.S. High Altitude Platform Stations (HAPS): Architecture and System Performance. In Proceedings of the 2021 IEEE 93rd Vehicular Technology Conference (VTC2021-Spring), Helsinki, Finland, 25–28 April 2021; pp. 1–6. [Google Scholar] [CrossRef]
  30. Eastman, R.; Wood, R.; Bretherton, C. Time Scales of Clouds and Cloud-Controlling Variables in Subtropical Stratocumulus from a Lagrangian Perspective. J. Atmos. Sci. 2016, 73, 3079–3091. [Google Scholar] [CrossRef]
Figure 1. Soft-switching hybrid FSO/RF SAG system.
Figure 1. Soft-switching hybrid FSO/RF SAG system.
Sensors 26 00948 g001
Figure 2. HAPS trajectories and instantaneous throughput for HS-PPO and RC-PPO scheme under (a) light, (b) moderate, and (c) heavy cloud-density conditions.
Figure 2. HAPS trajectories and instantaneous throughput for HS-PPO and RC-PPO scheme under (a) light, (b) moderate, and (c) heavy cloud-density conditions.
Sensors 26 00948 g002
Figure 3. Training performance of HS-PPO and RC-PPO schemes.
Figure 3. Training performance of HS-PPO and RC-PPO schemes.
Sensors 26 00948 g003
Table 1. Simulation parameters.
Table 1. Simulation parameters.
ParameterValue
HAPS altitude ( H h a p s )20 km
Clouds base altitude ( h 0 )1 km
Clouds max altitude ( h max )10 km
Receiver aperture diameter (D)1 m
Responsivity of PD (R)0.8
Variance of background noise ( σ f )250 μ W
Noise power spectral density ( σ r )−100 dB/MHz
Optical transmit power ( P t f )1 W
RF transmit power ( P t r )1 W
Telescope gain of transmitter, receiver ( G t x f , G r x f )70 dB
Antenna gain of transmitter, receiver ( G t x r , G r x r )50 dB
Optical wavelength ( λ f )1550 nm
RF frequency ( f r )30 GHz
Optical bandwidth ( B f )10 GHz
RF bandwidth ( B r )500 MHz
Table 2. PPO training hyperparameters.
Table 2. PPO training hyperparameters.
ParametersValues
Feature layers[128, 64, 32] (ReLU)
Policy head init gain0.01
Learning rate ( α ) 3 × 10 4
Discount ( γ )0.99
Clip ratio ( ϵ c l i p )0.2
Entropy coeff./Value coeff.0.01/0.5
Gradient clip0.5
Mini-batch size64
Update epochs per batch (K)10
Total environment steps 10 6
Table 3. Final tuned reward weights and their tuning rationale.
Table 3. Final tuned reward weights and their tuning rationale.
WeightValueRole and Tuning Rationale
Throughput ( α )1Fixed to establish reward scale, directly reflecting the objective of maximizing data rate.
Decoding Progress ( β )5Provides dense gradients for decoding progress. Value chosen via grid search.
Heading Penalty ( κ ψ )0.35Penalizes abrupt heading changes. Initially small; increased if trajectories exhibited excessive jitter.
Distance Penalty ( κ d )0.1Encourages the agent to maintain proximity to the GS, accelerating learning convergence. Initially small; increased if the agent strayed too far.
Terminal Reward (q)100Task completion signal. Large scalar reward that reinforces successful mission completion.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cui, B.; Cai, S.; Wang, L.; Zhang, Z.; Wang, F. Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System. Sensors 2026, 26, 948. https://doi.org/10.3390/s26030948

AMA Style

Cui B, Cai S, Wang L, Zhang Z, Wang F. Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System. Sensors. 2026; 26(3):948. https://doi.org/10.3390/s26030948

Chicago/Turabian Style

Cui, Beibei, Shanyong Cai, Liqian Wang, Zhiguo Zhang, and Feng Wang. 2026. "Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System" Sensors 26, no. 3: 948. https://doi.org/10.3390/s26030948

APA Style

Cui, B., Cai, S., Wang, L., Zhang, Z., & Wang, F. (2026). Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System. Sensors, 26(3), 948. https://doi.org/10.3390/s26030948

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop