Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System

Cui, Beibei; Cai, Shanyong; Wang, Liqian; Zhang, Zhiguo; Wang, Feng

doi:10.3390/s26030948

Open AccessArticle

Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System

by

Beibei Cui

^1,2,

Shanyong Cai

^1,2,*,

Liqian Wang

^1,2

,

Zhiguo Zhang

^1,2 and

Feng Wang

³

¹

School of Electronic Engineering, Beijing University of Posts and Telecommunications (BUPT), Beijing 100876, China

²

State Key Laboratory of Information Photonics and Optical Communications, Beijing University of Posts and Telecommunications (BUPT), Beijing 100876, China

³

School of Physics and Optoelectronic Engineering, Nanjing University of Information Science and Technology, Nanjing 211544, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(3), 948; https://doi.org/10.3390/s26030948

Submission received: 5 December 2025 / Revised: 24 January 2026 / Accepted: 26 January 2026 / Published: 2 February 2026

(This article belongs to the Special Issue Cutting-Edge Developments in Optical Communications, Perception and Computing)

Download

Browse Figures

Versions Notes

Abstract

Space–air–ground systems employing free-space optical (FSO) communication leverage high-altitude platform stations (HAPS) to deliver seamless and ubiquitous connectivity. Although FSO links offer high capacity, they are highly susceptible to cloud extinction, which severely degrades link availability. Hybrid FSO/radio-frequency (RF) transmission and cloud-aware HAPS trajectory optimization can enhance resilience. However, the conventional cloud-aware hybrid FSO/RF transmission system based on hard-switching (HS) between the FSO and RF links leads to frequent link transitions and unstable throughput. To address these challenges, we propose a joint optimization framework that integrates soft-switch between FSO and RF links with deep reinforcement learning (DRL) for HAPS trajectory optimization. Soft-switching based on rateless codes (RCs) enables simultaneous transmission over both links, where the receiver accumulates packets until successful decoding with a single feedback. The feedback frequency of RC is sparse, which avoids feedback storms but also poses challenges to HAPS trajectory optimization. The DRL agent proactively optimizes HAPS trajectories to avoid cloud cover and maintain link availability. To address the sparse feedback of RCs for DRL training, a reward-shaped proximal policy optimization (PPO)-based agent is developed to jointly optimize throughput and trajectory smoothness. Simulations using realistic ERA5 data show that RC-PPO achieves higher throughput and smoother trajectories compared to the HS-PPO baseline.

Keywords:

hybrid FSO/RF; deep reinforcement learning (DRL); proximal policy optimization (PPO); trajectory optimization

1. Introduction

The explosive growth of global data traffic has revealed the limitations of terrestrial networks in achieving ubiquitous connectivity [1]. Space–air–ground integrated networks (SAGINs) with free-space optical communication (FSO), using high-altitude platform stations (HAPS) to relay communications between satellites and ground stations, have emerged as a promising solution [2,3]. FSO links offer ultra-high bandwidth but suffer from low reliability due to atmospheric conditions such as clouds and atmospheric turbulence (AT) [4,5]. Hybrid FSO/RF transmission systems leverage RF backup to enhance reliability [6,7,8]. Such hybrid systems have been deployed in dual-hop relay links [9] and SAGIN scenarios to improve coverage and capacity [10]. Conventional hybrid FSO/RF systems employ hard-switching (HS) strategies that select links based on instantaneous SNR values [11], suffering from frequent transitions and unstable throughput. FSO/RF systems based on rateless code (RC), such as Raptor code [12], enable soft-switching by transmitting distinct encoded packets simultaneously over both the FSO and RF channels [13], thus eliminating switching overhead and improving reliability against burst disruptions. To avoid cloud cover, HAPS trajectory optimization is also an effective method to enhance link stability and throughput.

Deep reinforcement learning (DRL) offers a powerful framework for adaptive HAPS trajectory optimization in dynamic atmospheric environments [14]. For example, ref. [15] integrates 5G low-latency connectivity with a Deep Q-Network (DQN) for 3D trajectory optimization. Ref. [16] employs the deep deterministic policy gradient (DDPG) algorithm for coordinating HAPS coverage. However, these studies foucus on RF-only aerial communication networks. When extending DRL to hybrid FSO/RF systems, ref. [17] applies the A2C algorithm to optimize the HAPS trajectory in the HS FSO/RF transmission system, which relies on dense, per-step reward signals. This dense-reward strategy is fundamentally incompatible with the RC-based soft-switching FSO/RF system. For RC, a meaningful reward is available only upon successful block decoding, which introduces a critical challenge of reward sparsity. This sparse feedback (reward) severely complicates temporal credit assignment and hinders efficient exploration [18]. Ref. [19] successfully used Proximal Policy Optimization (PPO) with reward shaping to compensate for such reward sparsity when guiding UAVs to move to destinations. To the best of our knowledge, no existing work has investigated HAPS trajectory planning in an RC-based soft-switching hybrid FSO/RF system under stochastic cloud dynamics.

This paper proposes a joint optimization framework that integrates rateless-coded physical-layer transmission with PPO-based trajectory learning in a hybrid FSO/RF transmission system. Key contributions are as follows:

The cloud-aware HAPS trajectory optimization problem in soft-switching hybrid FSO/RF systems is formulated and solved by a PPO-based DRL approach, under the stochastic moving occluding cloud (SMOC) model derived from the ERA5 dataset.
A potential-based reward-shaping mechanism within the PPO framework is developed to mitigate sparse decoding feedback of RCs, delivering faster convergence and superior performance over threshold-based HS-PPO schemes.

The remainder of this paper is organized as follows. Section 2 presents the scheme of the trajectory-optimized HAPS–ground station (GS) hybrid FSO/RF link, FSO/RF channel models, and the throughput of HS and RC-based hybrid FSO/RF systems. Section 3 details the proposed RC-PPO algorithm and the HS-PPO algorithm as a reference and presents the corresponding Markov Decision Process (MDP) formulation and reward shaping strategy. Section 4 provides simulation results. Section 5 gives the conclusion.

2. System Model

This section presents the mathematical framework for the HAPS-assisted hybrid FSO/RF communication system. Firstly, a simplified three-tier SAGIN architecture is introduced. Subsequently, the channel model of FSO and RF links are presented. At last, the throughputs for hybrid FSO/RF systems with HS and RC are elaborated.

2.1. Space–Air–Ground Architecture

As illustrated in Figure 1, we consider a simplified three-tier Space–Air–Ground (SAG) system, which constitutes a fundamental building block of standard SAGINs. The LEO satellite operates at an orbital altitude of approximately 600 km and serves as the primary data source. The HAPS, positioned at

H_{h a p s} = 20

km in the stratosphere, functions as an aerial relay node to bridge the satellite-to-ground communication gap. The satellite-to-HAPS link employs FSO communication to exploit its high bandwidth capacity, while the HAPS-to-GS link utilizes hybrid FSO/RF transmission to provide complementary link diversity.

A discrete-time model is adopted with time step index

t \in {1, 2, \dots, T}

. The position of HAPS at time t is denoted as

p_{h} [t] = (x [t], y [t], H_{h a p s})

, where

(x [t], y [t])

represents the horizontal coordinates and

H_{h a p s}

(the altitude of HAPS) remains constant. The GS is located at

p_{g} = (0, 0, H_{g s})

with

H_{g s} = 0

. The instantaneous slant distance between HAPS and GS is given by:

L [t] = ∥ p_{h} [t] - p_{g} ∥ = \sqrt{x {[t]}^{2} + y {[t]}^{2} + H_{h a p s}^{2}} .

(1)

And the elevation angle (Figure 1) is

θ [t] = \arctan (H_{h a p s} / \sqrt{x {[t]}^{2} + y {[t]}^{2}})

.

2.2. Channel Model

2.2.1. FSO Channel Model

The composite channel gain of the downlink FSO channel between HAPS and GS is expressed as:

H_{FSO} [t] = h_{g} [t] \cdot h_{a} [t] \cdot h_{l} [t],

(2)

where

h_{g} [t]

is the geometric loss due to the divergence of the beam on the slant path distance

L [t]

[20],

h_{a} [t]

follows the Gamma-Gamma distribution characterizing AT-induced fading [21], and

h_{l} [t]

represents the loss of cloud-induced attenuation. Among these factors,

h_{l} [t]

is critical for optimizing the HAPS trajectory due to its spatial heterogeneity. The cloud-induced attenuation follows the Beer–Lambert law [22]:

h_{l} [t] = \exp (- σ L [t]) .

(3)

σ

(km⁻¹) is the attenuation coefficient [23], determined based on the visibility (V) (in km) and the wavelength (in nm) as

σ = \frac{3.91}{V} {(\frac{λ}{550})}^{- q} .

(4)

Here,

λ

is the optical wavelength, V is the visibility, and q is the size-distribution parameter of the scattering particles depending on V. The expression of V is given by:

V = - \frac{\ln (0.002)}{β_{f} [t]},

(5)

where

β^{f} [t]

is the cloud-induced attenuation of the FSO link obtained by integration over the slant path:

β^{f} [t] = \frac{1}{\sin (θ [t])} \int_{h_{0}}^{h_{top}} β_{ext}^{f} (h) d h .

(6)

Here,

θ [t]

is the elevation angle and

h_{0}

and

h_{top}

are the cloud base and top altitudes. The extinction coefficient

β_{ext}^{f} (h)

for the FSO link is given by [24]:

β_{ext}^{f} (h) = \frac{6.51 \times 10^{3} \cdot c (h)}{ρ_{w} \cdot r_{e}},

(7)

where

ρ_{w} = 1

g/cm³ is the water density,

r_{e}

(

μ

m) is the particle effective radius [24], and

c (h)

is the vertical profile of liquid water content (LWC) in g/m³. The three-dimensional (3D) distribution of LWC is modeled as a spatially correlated stochastic field, as detailed in Section 4.1.

The FSO link employs intensity modulation with direct detection (IM/DD) using on–off keying (OOK). The received optical power is

P_{rx}^{f} [t] = P_{tx}^{f} G_{tx}^{f} G_{rx}^{f} H_{FSO} [t]

, where

P_{tx}^{f}

denotes the transmit optical power and

G_{tx}^{f}

and

G_{rx}^{f}

, respectively, denote the transmitter and receiver telescope gains. The instantaneous electrical SNR is given by:

γ_{FSO} [t] = \frac{{(R P_{rx}^{f} [t])}^{2}}{N_{0}^{f} B_{f}},

(8)

where R (A/W) is the photodetector responsivity,

N_{0}^{f}

(A²/Hz) is the noise power spectral density, and

B_{f}

(Hz) is the receiver bandwidth of the FSO link.

2.2.2. RF Channel Model

The RF channel provides complementary connectivity with enhanced reliability under cloud-obscured conditions. The RF channel gain incorporates free-space path loss (FSPL) and cloud-induced attenuation. The expression of FSPL is given by:

{FSPL}_{r} [t] = {(\frac{4 π L [t] f_{c}}{c_{0}})}^{2},

(9)

with RF carrier frequency

f_{c}

and speed of light

c_{0}

. The cloud attenuation coefficient

β^{r} [t]

(km⁻¹) is computed similarly to (6) for FSO links by path-integrating the specific attenuation coefficient. The specific RF attenuation coefficient

β_{ext}^{r} (h)

(km⁻¹) is derived from Mie scattering theory and modeled as [25]:

β_{ext}^{r} (h) = k_{ext}^{r} \cdot c (h),

(10)

where

k_{ext}^{r}

(km⁻¹/(g·m⁻³)) is the mass extinction coefficient at RF frequency

f_{c}

.

Similar to (8), the instantaneous SNR of RF link with QPSK modulation is given by:

γ_{RF} [t] = \frac{P_{tx}^{r} G_{tx}^{r} G_{rx}^{r} H_{RF} [t]}{N_{0}^{r} B_{r}},

(11)

where

P_{tx}^{r}

is the transmitting power,

G_{tx}^{r}

and

G_{rx}^{r}

are the transmitting and receiving antenna gains,

N_{0}^{r}

(W/Hz) is the spectral density of RF noise power, and

B_{r}

(Hz) is the receiver bandwidth of the RF link.

2.3. Hybrid FSO/RF Systems

This section describes the throughput of hybrid FSO/RF communication systems based on HS and RC. These two schemes differ fundamentally in how they exploit link diversity, which affects trajectory optimization.

2.3.1. Hard Switching

The HS scheme selects either the FSO or RF link at each time step based on instantaneous channel quality. The instantaneous throughput is given by:

T_{HS} [t] = \{\begin{matrix} R_{FSO} P_{suc}^{FSO} [t], & if γ_{FSO} [t] \geq γ_{th}, \\ R_{RF} P_{suc}^{RF} [t], & if γ_{FSO} [t] < γ_{th}, \end{matrix}

(12)

where

γ_{th}

is the switching threshold,

R_{FSO}

and

R_{RF}

are the transmitted data rates,

P_{suc}^{FSO} [t]

and

P_{suc}^{RF} [t]

are the per-packet success reception probabilities accounting for forward error correction (FEC). The optimal threshold value maximizes the expected throughput:

γ_{th}^{★} = \arg max_{γ_{th}} E [T_{HS} [t]] .

(13)

For the hybrid FSO/RF system, the bit error rate (BER) for the FSO link with OOK modulation and the RF link with QPSK modulation depend on their respective signal-to-noise ratios (SNRs) at the receiver. The BER for each link can be uniformly expressed as:

{BER}_{i} [t] = \frac{1}{2} erfc (\sqrt{κ_{i} \cdot γ_{i} [t]}),

(14)

where

i \in {f, r}

denotes the FSO or RF link,

γ_{i} [t]

is the instantaneous SNR, and

κ_{i}

is the modulation-dependent efficiency factor, with

κ_{f} = 0.5

for OOK and

κ_{r} = 1

for QPSK. For a packet of

L_{p}

bits in length, assuming FEC capable of correcting up to

t_{corr}

bit errors, the packet success probability is computed via the binomial sum:

P_{suc} [t] = \sum_{i = 0}^{t_{corr}} (\binom{L_{p}}{i}) {(BER [t])}^{i} {(1 - BER [t])}^{L_{p} - i} .

(15)

2.3.2. Rateless Coding

RCs enable the generation of an arbitrary number of encoded packets until the receiver accumulates sufficient packets to decode the original source block [13]. Unlike traditional channel codes with fixed code rates, RCs automatically adjust the number of encoded packets required to be sent for successful decoding based on channel conditions. Raptor code, which represents the state of the art in RCs, consists of a high-code-rate precode (typically Low-Density Parity-Check (LDPC) code [26]) followed by a Luby Transform (LT) [27] outer code and is utilized in this paper.

Let

s \in {GF}_{2}^{N_{s}}

denote the source symbol vector containing

N_{s}

information symbols. The precode generates an intermediate symbol vector (Figure 1):

c = G_{pre} s,

(16)

where

G_{pre}

is the precode generator matrix producing

N_{c}

intermediate symbols. The LT encoder then produces output symbols via:

x_{i} = ⨁_{j = 1}^{N_{c}} G_{LT} (i, j) c_{j},

(17)

where

G_{LT} (i, \cdot)

is a random sparse row sampled according to a robust soliton degree distribution and ⊕ denotes XOR over

{GF}_{2}

. Each output symbol

x_{i}

is formed by XORing a small subset of intermediate symbols, with the subset size and selection determined by the degree distribution.

In the RC scheme, both FSO and RF links transmit different encoded symbols simultaneously. Crucially, any correctly received symbol contributes to decoding, regardless of which link delivered it. This superposition property eliminates the need for link selection decisions and enables full utilization of both links under all channel conditions. The receiver can successfully decode the source data block once it accumulates approximately

N_{s} (1 + ε_{c})

distinct encoded symbols, where

ε_{c}

is the average overhead required by the belief-propagation (BP) decoder. This overhead is typically

ε_{c} \approx 0.05

–

0.1

for well-designed Raptor codes [13]. The impact of dynamic SNR fluctuations on the hybrid link is captured through the packet success probabilities

P_{s u c}^{F S O}

and

P_{s u c}^{R F}

in Equation (15), which determine the rate of correctly received symbols. Therefore, the expected effective throughput for RC is given by:

E [T_{RC}] = \frac{R_{FSO} P_{suc}^{FSO} + R_{RF} P_{suc}^{RF}}{1 + ε_{c}} .

(18)

3. Trajectory Optimization

This section presents the trajectory optimization framework for hybrid FSO/RF communication schemes. Firstly, the PPO algorithm and MDP formulation are introduced as the foundation of DRL. Subsequently, the trajectory optimization strategies for the HS and RC schemes using PPO-based DRL are elaborated.

3.1. Proximal Policy Optimization

PPO is an on-policy actor-critic algorithm that uses a clipped surrogate objective to constrain policy updates, thereby preventing large deviations from the current policy and ensuring stable training. The objective of PPO, which is maximized at each iteration, is given by:

L_{t}^{PPO} (θ) = {\hat{E}}_{t} [L_{t}^{CLIP} (θ) - c_{1} L_{t}^{VF} (θ) + c_{2} H [π_{θ}] (s_{t})],

(19)

where

{\hat{E}}_{t} [\cdot]

denotes the empirical average over a finite batch of samples,

H [π_{θ}] (s_{t})

denotes an entropy bonus,

L_{t}^{VF} (θ)

is the value function loss, and

c_{1}, c_{2} > 0

are weighting coefficients.

The clipped surrogate objective

L_{t}^{CLIP} (θ)

prevents excessively large policy updates by constraining the probability ratio between the current and old policies, given by:

L_{t}^{CLIP} (θ) = {\hat{E}}_{t} [min (μ_{t} (θ) {\hat{A}}_{t}, clip (μ_{t} (θ), 1 - ϵ_{clip}, 1 + ϵ_{clip}) {\hat{A}}_{t})],

(20)

where the function clip

(μ_{t} (θ), 1 - ε_{clip}, 1 + ε_{clip})

acts as a clipping operation that constrains the probability ratio

μ_{t} (θ)

to the interval

[1 - ε_{clip}, 1 + ε_{clip}]

.

ϵ_{clip}

(typically 0.1–0.3) is a hyperparameter. The probability ratio

μ_{t} (θ)

at time step t is given by:

μ_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})} .

(21)

Moreover, the advantage estimator (

{\hat{A}}_{t}

) is formulated as a discounted sum of future temporal-difference (TD) residuals (

δ_{t}

), given by:

{\hat{A}}_{t} = \sum_{l = 0}^{T - t} γ^{l} δ_{t + l},

(22)

where

γ \in (0, 1)

is the discount factor, and the TD residual is defined as:

δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t}) .

(23)

Here,

r_{t}

is the immediate reward received from the environment, and

V (s_{t})

is the state value, which can be used to evaluate the state

s_{t}

under the current policy.

The value function loss (

L_{t}^{VF} (θ)

) minimizes the mean squared error (MSE) between predicted state values and empirical returns, given by:

L_{t}^{VF} (θ) = {\hat{E}}_{t} [{(V_{θ} (s_{t}) - R_{t})}^{2}],

(24)

where

R_{t} = \sum_{l = 0}^{T - t} γ^{l} r_{t + l}

is the discounted sum of future rewards. The entropy term (

H [π_{θ}]

) encourages exploration by preventing premature convergence to deterministic policies, given by

H [π_{θ}] (s_{t}) = - E_{a \sim π_{θ} (\cdot | s_{t})} [\log π_{θ} (a | s_{t})] .

(25)

Network parameters are updated after collecting a trajectory of length T, using the collected batch to perform multiple epochs of minibatch gradient descent on the PPO objective. The pseudocode of the algorithm is shown in Algorithm 1.

Algorithm 1 PPO for HAPS Trajectory Optimization
Require: Policy and value network $π_{θ}, V_{θ}$ , hyperparameters $γ, ϵ_{clip}, c_{1}, c_{2}, K$
1:	for each training iteration do
2:	Collect trajectory $D = {(s_{t}, a_{t}, r_{t}, s_{t + 1})}_{t = 0}^{T}$ using $π_{θ}$
3:	Compute returns $R_{t} = \sum_{l = 0}^{T - t} γ^{l} r_{t + l}$
4:	Compute TD residuals $δ_{t} = r_{t} + γ V_{θ} (s_{t + 1}) - V_{θ} (s_{t})$
5:	Compute advantages ${\hat{A}}_{t} = \sum_{l = 0}^{T - t} γ^{l} δ_{t + l}$
6:	Normalize advantages: ${\hat{A}}_{t} \leftarrow ({\hat{A}}_{t} - μ) / σ$
7:	Store old policy probabilities $π_{θ_{old}} (a_{t} \| s_{t})$
8:	for epoch $k = 1, \dots, K$ do
9:	for mini-batch $B \subset D$ do
10:	Compute $L_{t}^{PPO} (θ) = L_{t}^{CLIP} (θ) - c_{1} L_{t}^{VF} (θ) + c_{2} H [π_{θ}] (s_{t})$
11:	Update $θ$ via gradient ascent on $L_{t}^{PPO} (θ)$
12:	end for
13:	end for
14:	end for
15:	return Optimized policy $π_{θ^{*}}$

3.2. Trajectory Optimization with PPO-Based DRL

The HAPS trajectory optimization problem is formalized as a finite-horizon MDP defined by the tuple

M = (S, A, P, R, γ)

, where

S

is the state space,

A

is the action space, P is the transition probability, and R is the reward function. The control objective for both RC and HS schemes is to find an optimal policy maximizing the expected discounted return:

π^{*} = arg \max_{π_{θ}} E_{s_{0} \sim μ, a_{t} \sim π_{θ} (\cdot | s_{t}), s_{t + 1} \sim P (\cdot | s_{t}, a_{t})} [\sum_{t = 0}^{T} γ^{t} r_{t}] .

(26)

The state space, action space, and reward function differ significantly between the RC and HS schemes due to their distinct communication paradigms.

3.2.1. RC-PPO

The state space should contain all information needed for the RC-PPO agent to make decisions and preserve the Markovian property of the environment, given by:

s_{t}^{RC} = (x_{t}, y_{t}, ψ_{t}, τ_{t}, ρ_{t}) \in S^{RC},

(27)

where

(x_{t}, y_{t}) \in R^{2}

denote the current coordinates of the HAPS,

ψ_{t} \in [0, 2 π)

is the heading angle measured clockwise from north,

τ_{t} = T - t

is the number of steps remaining, and

ρ_{t} = [ρ (p_{t, 1}), \dots, ρ (p_{t, N_{p}})] \in {[0, 1]}^{N_{p}}

contains cloud extinction coefficients sampled at

N_{p}

probe points

{p_{t, i}}

along prospective directions.

Since RC allows both links to transmit simultaneously without switching overhead, the action space contains only navigation decisions:

a_{t}^{RC} = Δ ψ_{t} \in A^{RC},

(28)

where

Δ ψ_{t} \in [- ψ_{max}, + ψ_{max}]

is the discrete heading change with

ψ_{max} = 30^{\circ}

representing the maximum single-step turn angle.

State transitions decompose into deterministic kinematic evolution and stochastic cloud field dynamics. For simplicity, the speed of the HAPS is assumed constant of

v_{0}

, and the effects of wind (atmospheric currents) are not considered in the kinematic model. The HAPS position evolves according to:

\begin{matrix} x_{t + 1} & = x_{t} + v_{0} Δ t \cos (ψ_{t} + Δ ψ_{t}), \\ y_{t + 1} & = y_{t} + v_{0} Δ t \sin (ψ_{t} + Δ ψ_{t}), \\ ψ_{t + 1} & = (ψ_{t} + Δ ψ_{t}) \mod 2 π . \end{matrix}

(29)

The decoding progress

η_{t} \in [0, 1]

, defined as the fraction of successfully received packets to the total required, evolves according to:

η_{t + 1} = min (1, η_{t} + \frac{Λ_{t}^{RC} Δ t}{N_{s} (1 + ε_{c})}),

(30)

where

N_{s}

is the total number of source packets,

ε_{c}

is the redundancy overhead in the RCs, and

Λ_{t}^{RC}

is the aggregate packet-arrival rate from both links. Noted, in real-world deployment, the receiver provides feedback only upon successful decoding of complete data blocks, making

η_{t}

unavailable for real-time decision-making by the HAPS. The aggregate rate:

Λ_{t}^{RC} = \sum_{i = 1}^{N_{p}} w_{i} (R_{FSO} \cdot P_{s u c}^{FSO} (ρ (p_{t, i})) + R_{RF} \cdot P_{s u c}^{RF} (ρ (p_{t, i}))),

(31)

where

R_{FSO}

and

R_{RF}

are the transmitted symbol rates of the FSO and RF links,

P_{s u c}^{FSO} (ρ)

and

P_{s u c}^{RF} (ρ)

denote the probability of successful symbol reception as a function of extinction loss of cloud, and

w_{i} = \exp (- d_{i} / d_{0})

are distance-dependent weight factors where

d_{i}

is the distance from HAPS to the ith probe position and

d_{0}

is a normalization constant, thereby assigning higher influence to nearer probes. At each time step,

N_{p}

probe points

{p_{t, i}}_{i = 1}^{N_{p}}

are placed along candidate directions at varying distances from the current HAPS position to estimate future channel conditions.

In order to maximize the average capacity over the allowed T time steps, the reward function balances multiple objectives and is defined as:

r_{t}^{RC} = α Λ_{t}^{RC} + β Δ η_{t} - κ_{ψ} | Δ ψ_{t} | - κ_{d} {(\frac{d_{t}}{d_{0}})}^{2} + q R_{suc},

(32)

where

Λ_{t}^{RC}

encourages high instantaneous throughput,

Δ η_{t} = η_{t + 1} - η_{t}

provides dense feedback on decoding progress through potential-based reward shaping,

| Δ ψ_{t} |

discourages sharp heading changes to maintain flight stability, and

{(d_{t} / d_{0})}^{2}

with

d_{t} = \sqrt{x_{t}^{2} + y_{t}^{2}}

penalizes excessive distance from the GS to accelerate policy convergence speed and improve link quality.

α

,

β

,

κ_{ψ}

,

κ_{d}

, and q are weighting factors to balance the physical and task-driven scales of their corresponding terms.

The terminal reward

R_{suc}

indicates mission completion:

R_{suc} = \{\begin{matrix} 1, & if η_{t} = 1 and t \leq T, \\ 0, & otherwise . \end{matrix}

(33)

An episode terminates when either the mission succeeds (

R_{suc} = 1

) or timeout occurs (

τ_{t} = 0

).

The inclusion of

Δ η_{t}

in the reward function addresses the sparse feedback challenge inherent in RC. According to the potential-based shaping theorem [19], our shaping term

β Δ η_{t}

corresponds to the potential function

Φ (s) = β η_{t}

, which preserves the optimal policy while providing dense training signals.

Importantly, although

η_{t}

is used during training to accelerate policy learning, the trained policy

π_{θ} (a_{t}^{RC} ∣ s_{t}^{RC})

depends only on the observable state

s_{t}^{RC} = (x_{t}, y_{t}, ψ_{t}, τ_{t}, ρ_{t})

and requires no real-time decoding feedback at deployment.

3.2.2. HS-PPO

For HS-PPO, the state space omits decoding progress but includes the currently selected link represented by the binary indicator

H_{t}

:

s_{t}^{HS} = (x_{t}, y_{t}, ψ_{t}, τ_{t}, H_{t}, ρ_{t}) \in S^{HS},

(34)

where

H_{t} \in {FSO, RF}

denotes the currently active link. The action space includes both navigation and link selection decisions:

a_{t}^{HS} = (Δ ψ_{t}, h_{t}) \in A^{HS},

(35)

where

h_{t} \in {0, 1}

denotes the currently active link (0 for FSO, 1 for RF). The instantaneous reward is defined as:

r_{t}^{HS} = α Λ_{t}^{HS} - κ_{ψ} | Δ ψ_{t} | - ξ I [H_{t} \neq H_{t - 1}] - κ_{d} {(\frac{d_{t}}{d_{0}})}^{2} + q R_{suc},

(36)

where

ξ > 0

penalizes frequent link switch, and

Λ_{t}^{HS}

is the data rate from the active link:

Λ_{t}^{HS} = \{\begin{matrix} R_{FSO} \cdot P_{s u c}^{FSO} (ρ (p_{t})), & if H_{t} = FSO, \\ R_{RF} \cdot P_{s u c}^{RF} (ρ (p_{t})), & if H_{t} = RF . \end{matrix}

(37)

The terminal reward

R_{suc}

for HS-PPO is defined similar to that for RC-PPO, with episodes terminating upon mission completion or timeout.

4. Simulation and Results

This section presents the cloud field generation, training setup, and evaluation of the proposed HS-PPO and RC-PPO agents. Key parameters are summarized in Table 1.

4.1. Cloud Field Generation

The 3D LWC field is generated using the SMOC model [28], leveraging statistics of cloud cover and average integrated liquid water content (ILWC) from the ERA5. A log-normal distribution of the content of liquid cloud water (CLWC) over the given site is obtained based on the extracted information from the dataset. Spatially correlated 3D LWC fields are then synthesized by generating random Gaussian fields based on empirical spatial correlations from ERA5, ensuring realistic cloud structures. Finally, an analytical vertical profile

c (h)

is applied to modulate the altitude-dependent LWC distribution within each cloud column:

c (h) = \{\begin{matrix} \frac{W}{b_{c}^{a_{c}} Γ (a_{c})} {(h - h_{0})}^{a_{c} - 1} e^{- (h - h_{0}) / b_{c}}, & h \geq h_{0} \\ 0, & h < h_{0} \end{matrix}

(38)

where W is the columnar ILWC (in kg/m²),

h_{0}

is the cloud base altitude,

Γ (\cdot)

denotes the gamma function, and

a_{c}

,

b_{c}

determine the shape of the clouds’ vertical profile, given by:

\begin{matrix} a_{c} & = 4.27 \exp [- 4.93 (C + 0.06)] + 54.12 \exp [- 61.25 (C + 0.06)] + 1.71, \\ b_{c} & = 3.17 a_{c}^{- 3.04} + 0.074, \end{matrix}

(39)

where C is the CLWC. The resulting 3D LWC field is used to compute the specific extinction coefficients of the cloud for the FSO and RF channels according to (7) and (10), respectively.

4.2. PPO Training Configuration

Both RC-PPO and HS-PPO agents employ identical neural network architectures, consisting of a three-layer MLP feature extractor with hidden dimensions [128, 64, 32] and ReLU activations. The feature extractor feeds into separate policy and value heads. The policy head was initialized using orthogonal initialization with a gain of 0.01 to reduce initial entropy. Each cycle collects 200 episodes to form a batch. This batch is reused for 10 update epochs with a mini-batch size of 64. Evaluation is performed with 200 deterministic test episodes employing the static cloud fields generated in Section 4.1. PPO training hyperparameters are detailed in Table 2.

Moreover, for the weighting parameter settings of the reward function (Equations (32) and (35)), an iterative behavior-driven tuning methodology is employed. Specifically, for the RC-PPO agent,

α

is fixed as the baseline to reflect the primary objective of maximizing throughput;

β

is selected via grid search to balance the decoding progress signal and throughput exploration;

κ_{ψ}

and

κ_{d}

serve as regularization terms to encourage smooth trajectories and movement toward the ground station, respectively. q is set to a large scalar to clearly signify task completion. The specific parameter values are summarized in Table 3.

4.3. Results and Discussion

Figure 2 shows representative HAPS trajectories and their corresponding instantaneous throughput achieved by the HS-PPO and RC-PPO agents under varied cloud conditions. To systematically assess the performance of both schemes, 100 random cloud distributions were first generated using the SMOC model. For each cloud distribution, the ILWC along the slant path was averaged over the starting point to the GS (indicated by the red dashed line in Figure 2). The mean values of ILWC were then ranked across 100 cloud distribution cases to approximately characterize the difficulty of trajectory optimization. For each typical HAPS mission duration (approximately 0.8–1.3 h for a 100 km flight at 75–120 km/h [29]), the cloud field can be regarded as slowly varying (with a temporal decorrelation scale of 15–29 h [30]). Hence, the use of a static cloud map in each episode is justified.

Representative cloud scenarios were selected at the 25th, 50th, and 75th among 100 cloud distributions of this ranked distribution, corresponding to light, moderate, and heavy cloud-density distributions. The horizontal position of the starting point is uniformly fixed at (25 km, 25 km), and the position of the GS is set at (100 km, 100 km).

The overall objective of HAPS trajectory optimization is to approach the GS while actively avoiding dense cloud regions, thereby minimizing signal attenuation and maximizing both throughput and communication coverage of the HAPS. Under light cloud-density conditions (Figure 2a), both HS-PPO and RC-PPO agents tend to advance directly toward the GS, achieving favorable throughput performance due to minimal cloud-induced extinction. As cloud density increases (Figure 2b,c), HS-PPO exhibits substantial throughput degradation attributed to severe FSO link attenuation in dense cloud regions, necessitating RF channel backup. In contrast, RC-PPO achieves higher effective hybrid throughput by dynamically leveraging transient FSO transmission windows while maintaining a reliable RF backup link. This strategy enables RC-PPO to reduce signal interruption durations and feedback latency compared to HS-PPO.

In terms of quantified throughput, under light cloud, the average throughput of RC-PPO is 9.34 Gbps, which is improved by 12.8% compared with the throughput of 8.28 Gbps for HS-PPO. As cloud density increases to a moderate level, RC-PPO reaches 9.68 Gbps, representing a significant 89.4% improvement compared to the throughput (5.11 Gbps) of HS-PPO. Under heavy cloud, the throughput of RC-PPO maintains 7.06 Gbps, yielding a 51.8% improvement over the 4.65 Gbps of HS-PPO. The results demonstrate that the RC-PPO system exhibits substantial throughput improvement under light, moderate, and heavy cloud conditions. RC-PPO demonstrates the most pronounced throughput advantage over HS-PPO under moderate cloud-density conditions. In addition, under heavy cloud-density conditions (Figure 2c), a non-RL scheme is also evaluated, which follows a straight-line path from the starting point directly to the GS without any adaptive cloud avoidance. Results show that RC-PPO achieves the highest average throughput, outperforming both the non-RL RC and HS-PPO schemes. It confirms that the performance gain stems not only from the use of RCs but from the predictive trajectory optimization enabled by DRL.

Figure 3 illustrates the training performance of the RC-PPO and HS-PPO agents. Although a direct quantitative comparison of their absolute return values is complicated for the differences in reward design, the resulting HAPS trajectories suggest that both schemes achieve effective policy learning. Notably, despite the inherently sparser feedback of the RCs (informative feedback is available only upon successful decoding of entire data blocks), the RC-PPO agent exhibits slightly more stable convergence performance. This improved stability is attributed to a shaped-reward mechanism specifically designed to provide incremental feedback on packet decoding progress. By incorporating denser reward signals, RC-PPO effectively mitigates variance in policy gradient estimates, thereby improving sample efficiency throughout training.

5. Conclusions

This paper presents an integrated framework that combines RCs with a PPO-based DRL algorithm for HAPS trajectory optimization in a hybrid FSO/RF SAG system. By embedding potential-based reward shaping around decoding progress, the proposed RC-PPO method addresses key limitations of conventional HS schemes, enabling more reliable exploitation of both FSO and RF channels under stochastic cloud obstruction. Simulations demonstrate that RC-PPO improves achievable capacity and transmission reliability. Under moderate cloud scenarios, the average throughput of the RC-PPO agent is 9.68 Gbps, achieving 89.4% improvement compared to that of HS-PPO (5.11 Gbps). Moreover, RC-PPO exhibits more stable training convergence compared to HS-PPO.

The proposed framework in this paper can be further extended to multi-HAPS/multi-GS systems by adopting multi-agent reinforcement learning (MARL) strategies, such as Multi-Agent PPO (MA-PPO), where multiple HAPSs coordinate their trajectories and transmission policies to optimize coverage and load balancing.

Author Contributions

Conceptualization and methodology, B.C.; software and validation, B.C. and S.C.; investigation and data curation, B.C. and L.W.; writing–original draft, B.C.; writing–review and editing, all authors; supervision, S.C.; funding acquisition, Z.Z., L.W., F.W. and S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (grant numbers U22B2009 and 62271084).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, G.; Xu, M.; Zhang, Q.; Song, Z. Cooperative FSO/RF Space-Air-Ground Integrated Network System with Adaptive Combining: A Performance Analysis. IEEE Trans. Wirel. Commun. 2024, 23, 17279–17293. [Google Scholar]
Samy, R.; Yang, H.C.; Rakia, T.; Alouini, M.S. Space-Air-Ground FSO Networks for High-Throughput Satellite Communications. IEEE Commun. Mag. 2022, 60, 82–87. [Google Scholar] [CrossRef]
Ata, Y.; Alouini, M.S. HAPS Based FSO Links Performance Analysis and Improvement with Adaptive Optics Correction. IEEE Trans. Wirel. Commun. 2023, 22, 4916–4929. [Google Scholar]
Zhu, X.; Kahn, J. Free-space optical communication through atmospheric turbulence channels. IEEE Trans. Commun. 2002, 50, 1293–1300. [Google Scholar] [CrossRef]
Yu, S.; Ding, J.; Fu, Y.; Ma, J.; Tan, L.; Wang, L. Novel approximate and asymptotic expressions of the outage probability and BER in gamma–gamma fading FSO links with generalized pointing errors. Opt. Commun. 2019, 435, 289–296. [Google Scholar] [CrossRef]
Bag, B.; Das, A.; Ansari, I.S.; Prokeš, A.; Bose, C.; Chandra, A. Performance Analysis of Hybrid FSO Systems Using FSO/RF-FSO Link Adaptation. Photo. J. 2018, 10, 7904417. [Google Scholar]
Sharma, K.; Kaur, S.; Singh, H. Channel modelling and performance analysis of switching based hybrid FSO/RF communication system. J. Opt. 2025, 1–9. [Google Scholar] [CrossRef]
Zhang, Q.; Yu, J.; Long, J.; Wang, C.; Chen, J.; Lu, X. A Hybrid RF/FSO Transmission System Based on a Shared Transmitter. Sensors 2025, 25, 2021. [Google Scholar] [CrossRef]
Alathwary, W.A.; Altubaishi, E.S. Investigating and analyzing the performance of dual-hop hybrid FSO/RF systems. Alex. Eng. J. 2024, 101, 16–24. [Google Scholar] [CrossRef]
Mashiko, K.; Kawamoto, Y.; Kato, N.; Yoshida, K.; Ariyoshi, M. Combined Control of Coverage Area and HAPS Deployment in Hybrid FSO/RF SAGIN. IEEE Trans. Veh. Technol. 2025, 74, 10819–10828. [Google Scholar] [CrossRef]
Nikbakht-Sardari, N.; Ghiamy, M.; Akbari, M.E.; Charmin, A. Novel adaptive hard-switching based hybrid RF-FSO system architecture using two threshold values in the presence of atmospheric turbulence and pointing error. Results Eng. 2023, 17, 100813. [Google Scholar] [CrossRef]
Shokrollahi, A. Raptor Codes. IEEE Trans. Inf. Theory 2007, 52, 2551–2567. [Google Scholar]
MacKay, D. Fountain codes. IEE Proc. Commun. 2005, 152, 1062–1068. [Google Scholar]
Ning, Z.; Yang, Y.; Wang, X.; Song, Q.; Guo, L.; Jamalipour, A. Multi-Agent Deep Reinforcement Learning Based UAV Trajectory Optimization for Differentiated Services. IEEE Trans. Mob. Comput. 2024, 23, 5818–5834. [Google Scholar] [CrossRef]
Srivatsa, V.; Kusuma, S.M. Deep Q-Networks and 5G Technology for Flight Analysis and Trajectory Prediction. In Proceedings of the 2024 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 12–14 July 2024; pp. 1–6. [Google Scholar]
Liu, C.H.; Chen, Z.; Tang, J.; Xu, J.; Piao, C. Energy-Efficient UAV Control for Effective and Fair Communication Coverage: A Deep Reinforcement Learning Approach. IEEE J. Sel. Areas Commun. 2018, 36, 2059–2070. [Google Scholar] [CrossRef]
Almohamad, A.; Ibrahim, M.; Ekin, S.; Hasna, M.; Althunibat, S.; Qaraqe, K. Optimizing Non-Terrestrial Hybrid RF/FSO Links with Reinforcement Learning: Navigating Through Clouds. IEEE Open J. Commun. Soc. 2025, 6, 793–806. [Google Scholar] [CrossRef]
Wang, Z.; Li, H.; Wu, Z.; Wu, H. A pretrained proximal policy optimization algorithm with reward shaping for aircraft guidance to a moving destination in three-dimensional continuous space. Int. J. Adv. Robot. Syst. 2021, 18, 1729881421989546. [Google Scholar] [CrossRef]
Liu, Q.; Jiang, Z.; Yang, H.J.; Khosravi, M.; Waite, J.R.; Sarkar, S. Enhancing PPO with Trajectory-Aware Hybrid Policies. arXiv 2025, arXiv:2502.15968. [Google Scholar] [CrossRef]
Benbouzid, A.M.; Belghachem, N. Performance analysis of OOK and PPM modulation schemes in MIMO-FSO links under gamma-gamma atmospheric turbulence. In Proceedings of the Environmental Effects on Light Propagation and Adaptive Systems VI; Stein, K., Gladysz, S., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2023; Volume 12731, p. 127310Q. [Google Scholar] [CrossRef]
Chen, D.; Hui, J. Parameter estimation of Gamma–Gamma fading channel in free space optical communication. Opt. Commun. 2021, 488, 126830. [Google Scholar] [CrossRef]
Kim, I.I.; McArthur, B.; Korevaar, E.J. Comparison of laser beam propagation at 785 nm and 1550 nm in fog and haze for optical wireless communications. In Proceedings of the SPIE Optics East, San Jose, CA, USA, 20–21 January 2001. [Google Scholar]
Le, H.D.; Nguyen, T.V.; Pham, A.T. Cloud Attenuation Statistical Model for Satellite-Based FSO Communications. IEEE Antennas Wirel. Propag. Lett. 2021, 20, 643–647. [Google Scholar] [CrossRef]
Luini, L.; Nebuloni, R. Radio wave propagation and channel modeling for earth–space systems. In Proceedings of the Impact of Clouds from Ka Band to Optical Frequencies; CRC Press: Boca Raton, FL, USA, 2016. [Google Scholar]
Lyras, N.K.; Kourogiorgas, C.I.; Panagopoulos, A.D. Cloud Attenuation Statistics Prediction From Ka-Band to Optical Frequencies: Integrated Liquid Water Content Field Synthesizer. IEEE Trans. Antennas Propag. 2017, 65, 319–328. [Google Scholar] [CrossRef]
Tu, Z.; Zhang, S. Overview of LDPC Codes. In Proceedings of the 7th IEEE International Conference on Computer and Information Technology (CIT 2007), Aizu-Wakamatsu, Japan, 16–19 October 2007; pp. 469–474. [Google Scholar] [CrossRef]
Luby, M. LT codes. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings, Vancouver, BC, Canada, 19 November 2002; pp. 271–280. [Google Scholar] [CrossRef]
Luini, L.; Capsoni, C. Modeling High-Resolution 3-D Cloud Fields for Earth-Space Communication Systems. IEEE Trans. Antennas Propag. 2014, 62, 5190–5199. [Google Scholar] [CrossRef]
Xing, Y.; Hsieh, F.; Ghosh, A.; Rappaport, T.S. High Altitude Platform Stations (HAPS): Architecture and System Performance. In Proceedings of the 2021 IEEE 93rd Vehicular Technology Conference (VTC2021-Spring), Helsinki, Finland, 25–28 April 2021; pp. 1–6. [Google Scholar] [CrossRef]
Eastman, R.; Wood, R.; Bretherton, C. Time Scales of Clouds and Cloud-Controlling Variables in Subtropical Stratocumulus from a Lagrangian Perspective. J. Atmos. Sci. 2016, 73, 3079–3091. [Google Scholar] [CrossRef]

Figure 1. Soft-switching hybrid FSO/RF SAG system.

Figure 2. HAPS trajectories and instantaneous throughput for HS-PPO and RC-PPO scheme under (a) light, (b) moderate, and (c) heavy cloud-density conditions.

Figure 3. Training performance of HS-PPO and RC-PPO schemes.

Table 1. Simulation parameters.

Parameter	Value
HAPS altitude ( $H_{h a p s}$ )	20 km
Clouds base altitude ( $h_{0}$ )	1 km
Clouds max altitude ( $h_{max}$ )	10 km
Receiver aperture diameter (D)	1 m
Responsivity of PD (R)	0.8
Variance of background noise ( $σ^{f}$ )	250 $μ$ W
Noise power spectral density ( $σ^{r}$ )	−100 dB/MHz
Optical transmit power ( $P_{t}^{f}$ )	1 W
RF transmit power ( $P_{t}^{r}$ )	1 W
Telescope gain of transmitter, receiver ( $G_{t x}^{f}, G_{r x}^{f}$ )	70 dB
Antenna gain of transmitter, receiver ( $G_{t x}^{r}, G_{r x}^{r}$ )	50 dB
Optical wavelength ( $λ^{f}$ )	1550 nm
RF frequency ( $f^{r}$ )	30 GHz
Optical bandwidth ( $B_{f}$ )	10 GHz
RF bandwidth ( $B_{r}$ )	500 MHz

Table 2. PPO training hyperparameters.

Parameters	Values
Feature layers	[128, 64, 32] (ReLU)
Policy head init gain	0.01
Learning rate ( $α$ )	$3 \times 10^{- 4}$
Discount ( $γ$ )	0.99
Clip ratio ( $ϵ_{c l i p}$ )	0.2
Entropy coeff./Value coeff.	0.01/0.5
Gradient clip	0.5
Mini-batch size	64
Update epochs per batch (K)	10
Total environment steps	$10^{6}$

Table 3. Final tuned reward weights and their tuning rationale.

Weight	Value	Role and Tuning Rationale
Throughput ( $α$ )	1	Fixed to establish reward scale, directly reflecting the objective of maximizing data rate.
Decoding Progress ( $β$ )	5	Provides dense gradients for decoding progress. Value chosen via grid search.
Heading Penalty ( $κ_{ψ}$ )	0.35	Penalizes abrupt heading changes. Initially small; increased if trajectories exhibited excessive jitter.
Distance Penalty ( $κ_{d}$ )	0.1	Encourages the agent to maintain proximity to the GS, accelerating learning convergence. Initially small; increased if the agent strayed too far.
Terminal Reward (q)	100	Task completion signal. Large scalar reward that reinforces successful mission completion.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cui, B.; Cai, S.; Wang, L.; Zhang, Z.; Wang, F. Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System. Sensors 2026, 26, 948. https://doi.org/10.3390/s26030948

AMA Style

Cui B, Cai S, Wang L, Zhang Z, Wang F. Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System. Sensors. 2026; 26(3):948. https://doi.org/10.3390/s26030948

Chicago/Turabian Style

Cui, Beibei, Shanyong Cai, Liqian Wang, Zhiguo Zhang, and Feng Wang. 2026. "Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System" Sensors 26, no. 3: 948. https://doi.org/10.3390/s26030948

APA Style

Cui, B., Cai, S., Wang, L., Zhang, Z., & Wang, F. (2026). Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System. Sensors, 26(3), 948. https://doi.org/10.3390/s26030948

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning-Based Cloud-Aware HAPS Trajectory Optimization in Soft-Switching Hybrid FSO/RF Cooperative Transmission System

Abstract

1. Introduction

2. System Model

2.1. Space–Air–Ground Architecture

2.2. Channel Model

2.2.1. FSO Channel Model

2.2.2. RF Channel Model

2.3. Hybrid FSO/RF Systems

2.3.1. Hard Switching

2.3.2. Rateless Coding

3. Trajectory Optimization

3.1. Proximal Policy Optimization

3.2. Trajectory Optimization with PPO-Based DRL

3.2.1. RC-PPO

3.2.2. HS-PPO

4. Simulation and Results

4.1. Cloud Field Generation

4.2. PPO Training Configuration

4.3. Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI