A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning

Yu, He; Li, Shengli; Wu, Junchao; Sun, Yanhong; Wang, Limin

doi:10.3390/aerospace13030285

Open AccessArticle

A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning

by

He Yu

^*

,

Shengli Li

,

Junchao Wu

,

Yanhong Sun

and

Limin Wang

The 54th Research Institute of CETC, Shijiazhuang 050081, China

^*

Author to whom correspondence should be addressed.

Aerospace 2026, 13(3), 285; https://doi.org/10.3390/aerospace13030285

Submission received: 9 January 2026 / Revised: 21 February 2026 / Accepted: 27 February 2026 / Published: 18 March 2026

(This article belongs to the Special Issue Advanced Spacecraft/Satellite Technologies (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

In low-Earth-orbit (LEO) satellite networks, the need for intelligent parameter-adjustment strategies has become increasingly critical due to the presence of highly dynamic channel conditions, limited spectrum resources, and complex interference environments. In this paper, a method for optimizing LEO satellite communication links based on deep reinforcement learning (DRL) is proposed. Through the optimization of the transmit power, the modulation and coding scheme (MCS), the beamforming parameters, and the retransmission mechanisms, adaptive link control is achieved in dynamic operational scenarios. A multidimensional state space is constructed, within which the channel state information, the interference environment, and the historical performance metrics are integrated. The spatio-temporal characteristics of the channel are extracted by means of a hybrid neural architecture that incorporates a convolutional neural network (CNN) and a long short-term memory (LSTM) network. To effectively accommodate both continuous and discrete action spaces, a hybrid DRL framework that combines proximal policy optimization (PPO) with a deep Q-network (DQN) is employed, thereby enabling cross-layer optimization of the physical-layer and link-layer parameters. The results demonstrate that substantial improvements in throughput, bit error rate (BER), and transmit-power efficiency are achieved under severely time-varying channel conditions, which provides a new idea for resource management and dynamic-environment adaptation in satellite communication systems.

Keywords:

satellite communication links; deep reinforcement learning; parameter adjustment strategy; highly dynamic channel; link optimization

1. Introduction

With the acceleration of the global digital transformation, 5G/6G and the integration of space and the sky have become the key technological directions of satellite Internet development. The LEO satellite communication system is receiving increasing attention from countries due to its advantages of low latency, low costs, and high bandwidth [1,2,3]. Large-scale constellation initiatives, represented by Starlink and OneWeb, are undergoing accelerated deployment worldwide with the objective of providing high-speed broadband services with global coverage [4,5,6]. However, LEO satellite communication links still face a series of substantial technical challenges. The high orbital velocity of LEO satellites leads to rapid temporal variations in channel characteristics, which in turn result in pronounced Doppler frequency shifts and reduced channel coherence time [7,8]. In addition, satellite–ground links are influenced by atmospheric effects, with high-frequency bands being particularly susceptible to impairments such as rain attenuation [9,10]. Moreover, the stringent constraints on resources caused by satellites render power, computational capacity, and spectrum exceptionally valuable [11,12]. Traditional control strategies based on fixed thresholds or static rules are inadequate for such highly dynamic environments. And the optimization method of adaptive coding and modulation (ACM) faces bottlenecks such as large parameter space, strong environmental dynamics, and multi-objective conflicts. Therefore, the investigation of new intelligent optimization methodologies for LEO communication links is of significant importance.

There are currently three main methods for optimizing satellite communication links: methods based on fixed rules, techniques based on ACM, and methods based on constraint optimization. Biglieri E adopted a fixed-rule approach and considered three modulation formats, namely 16-PSK, 16-QAM, and a 16-element amplitude phase keying scheme with two amplitude levels [13]. The method of ensuring link connectivity through conservative parameter configuration results in low resource utilization efficiency. Bischl H et al. verified that ACM is easy to implement on large-scale networks and can effectively meet target group error rate requirements even under deep fading conditions [14]. Huang J et al. proposed an efficient utilization of the ACM scheme, which showed that, under the premise of the same transmission power, the throughput of two proposed ACM schemes was nearly six times that of a fixed MCS [15]. This ACM technology can dynamically adjust transmission parameters according to channel conditions to adapt to different transmission environments, which has been used in many communication satellites [16,17]. However, in highly dynamic environments, this method may experience response lag and may not achieve the expected results in complex situations due to neglect of other parameter couplings. In addition, there is an optimization theory-based approach for centralized coordination and signal processing to achieve efficient interference management and flexible network adaptation [18]. This convex optimization framework generally relies on precise channel models and has a high computational complexity, making it difficult to adapt to highly dynamic environments. Therefore, the common drawback of current methods is a lack of real-time perception and decision-making capabilities for highly dynamic environments, where the complexity of multi-parameter joint optimization cannot be effectively handled.

With the rapid development of artificial intelligence (AI), more and more AI-based methods are being applied in the field of communication [19,20,21]. DRL, with its ability to autonomously learn optimal strategies through interaction with the environment, is precisely capable of addressing highly dynamic and complex environmental challenges, providing a new solution for optimizing LEO satellite communication links [22]. Deng, B. et al. proposed an innovative resource management framework for the next generation heterogeneous satellite networks, which can achieve cooperation between independent satellite systems, maximizing resource utilization [23]. Huang et al. investigated the power allocation problem in LEO satellite networks based on DRL technology and further proposed a scheme based on efficient near-end PPO, which can learn the optimal power allocation strategy without knowing any prior information to maximize the overall system rate [24]. From the application of different intelligent algorithms, DRL in the field of communication has shown great potential. And the current research results fully demonstrate the effectiveness of DRL in handling communication optimization problems [25,26], especially its ability to learn complex mapping relationships from high-dimensional states, providing a solid technical foundation for this paper. However, the optimization objects and control variables in current DRL-based communication optimization methods are still relatively independent. Due to the complex and highly dynamic characteristics of the actual LEO communication environment, communication quality often requires joint decision-making of multiple control variables (including discrete and continuous variables). In order to address the challenge of collaborative dynamic optimization of continuous and discrete control variables in LEO satellite communication links, a new method for optimizing LEO satellite communication links based on hybrid DRL is proposed in this paper. On the basis of perception of the link status, this method optimizes multiple control variables such as output power, beamforming parameters, coding and modulation schemes, and retransmission strategy through intelligent agents to maximize communication quality while ensuring link reliability.

The contribution of this paper is the design of a multi-dimensional state space and CNN-LSTM feature extraction network to achieve accurate perception of highly dynamic channel environments and proposal of a hybrid DRL architecture combining PPO and DQN, effectively solving the problem of collaborative decision-making between continuous and discrete parameters in LEO satellite link optimization. The method in this paper provides a new technological approach for the intelligent optimization of LEO satellite communication systems, which has important theoretical value and practical significance.

2. Dynamic Channel Modeling for LEO Satellite Communication Links

Dynamic channel modeling of LEO satellite communication links is the basis for simulating a high-fidelity channel environment and guaranteeing data authenticity in the DRL algorithm simulation training process. In addition to the expected free-space path loss, the quality of LEO satellite communication links is determined by multiple other factors, such as Doppler frequency shifts caused by high-speed motion, rain attenuation, absorption loss, and tropospheric scintillation caused by the atmosphere, as illustrated in Figure 1. The key time-varying factors affecting communication link performance are modeled in this section, and an environmental state perception model is provided for DRL by generating a channel state information matrix that contains amplitude, phase, and multipath information.

2.1. Free-Space Path Loss Model

Based on the theory of electromagnetic wave propagation, the energy of electromagnetic waves gradually disperses on the same phase plane during spatial propagation, which means that as the propagation distance increases, the energy loss of electromagnetic waves becomes more severe. In general, when calculating the propagation loss in free space, only the direct path is considered. When the transmitting end uses an ideal point source antenna, the signal power will be evenly distributed within a spherical area, and the received signal power can be expressed as follows [27]:

P_{r} = \frac{P_{s} G_{s} G_{r} λ^{2}}{{(4 π d)}^{2}}

(1)

where

P_{s}

is the signal transmission power;

P_{r}

is the signal reception power;

G_{s}

is the antenna gain at the transmitting end;

G_{r}

is the antenna gain at the receiving end;

λ

is the wavelength and

d

is the propagation distance. When the gain of both the transmitting antenna and the receiving antenna is 1, the free-space loss

L_{f s}

can be expressed as follows:

L_{f s} = \frac{P_{s}}{P_{r}} = {(\frac{4 π d}{λ})}^{2}

(2)

2.2. Doppler Frequency Offset Model

The Doppler frequency offset is related to the carrier frequency, the movement speed of satellites and end users, while the Doppler frequency offset change rate describes the speed at which the Doppler frequency shift changes over time, as shown in Figure 2.

In an LEO satellite communication system, the satellite continues to move at high speed, and under the same conditions, the corresponding Doppler frequency offset and Doppler frequency offset change rate are also larger than those in ground systems. The Doppler frequency offset can be expressed as follows [28]:

f_{d m} = (\frac{v}{c}) f \cdot \cos θ

(3)

where

f

is the carrier frequency;

v

is the relative motion velocity between the satellite and the ground;

θ

is the angle between the direction of ground terminal movement and the satellite-to-ground link; and

c

is the speed of light. In the scenario of an LEO satellite moving at a high speed, when the satellite accesses the ground station at the minimum access elevation angle, the component of the velocity vector on the straight line of the satellite-to-ground link is at its maximum, corresponding to the largest Doppler frequency shift. Conversely, when the elevation angle between the satellite and the ground station is 90°, the component of the velocity vector on the straight line of the satellite-to-ground link is zero, and the corresponding Doppler frequency shift is also zero. Typically, the Doppler frequency shift in the received signal can reach the order of magnitude of tens to hundreds of KHz.

2.3. Rain Attenuation Model

Due to the scattering and absorption effects of rainfall on electromagnetic waves, when satellite signals pass through rainfall areas, some of the energy will be scattered or absorbed by raindrops. Especially in communication frequency bands above 10 GHz, the impact of rain attenuation on communication quality cannot be ignored [29]. Given the latitude position of the ground station

φ

, the corresponding rain layer height

h_{R}

can be calculated.

h_{R} = \{\begin{array}{l} 0, & φ \leq - 71 \\ 5 + 0.1 (φ + 21), & - 71 < φ \leq - 21 \\ 5, & - 21 < φ \leq 23 \\ 5 - 0.075 (φ - 23), & φ > 23 \end{array}

(4)

When the satellite enters the rain area, the angle between the inclined path and the ground is

θ

, and the length of the inclined path

L_{S}

can be expressed as follows:

L_{S} = (h_{R} - h_{a}) / \sin θ

(5)

where

h_{a}

is the altitude of the ground station. And the size of the horizontal projection distance

L_{G}

can be expressed as follows:

L_{G} = L_{S} \cos θ

(6)

Based on the slant path length

L_{S}

of satellite signals passing through rain areas, the rain attenuation

L_{r a i n}

that results in an average horizontal projection distance

L_{G}

exceeding 0.01% of the time over a year can be expressed as follows:

L_{r a i n} = L_{R} v_{0.01}

(7)

v_{0.01} = \frac{1}{1 + \sqrt{\sin θ} (31 (1 - e^{- (θ / (1 + β))}) \frac{\sqrt{L_{R} γ_{R}}}{f^{2}} - 0.45)}

(8)

L_{R} = \{\begin{array}{l} L_{S} \cdot r_{0.01}, & r_{0.01} < 1 \\ L_{S}, & r_{0.01} \geq 1 \end{array}

(9)

r_{0.01} = \frac{1}{1 + 0.78 \sqrt{\frac{L_{G} γ_{R}}{f}} - 0.38 (1 - e^{- 2 L_{G}})}

(10)

where

γ_{R}

is the attenuation rate per unit time;

β

is the height correction factor.

2.4. Other Loss Models

Gas absorption loss

L_{g a s}

: This loss is mainly caused by oxygen and water vapor, and can be calculated according to the ITU-RP.676-11 model, which is related to frequency and elevation angle. In the 10–30 GHz frequency band, the typical range of this loss is 0.1 to 1 dB.

Tropospheric scintillation loss

L_{s c i n t}

: This loss is mainly caused by rapid fluctuations in signal amplitude caused by atmospheric turbulence, which can be estimated using the model in ITU-R P.618-10 model.

The channel of the LEO satellite communication link can be modeled as a time-varying complex gain. The comprehensive channel gain for a flat-fading channel can be expressed as follows:

h (t) = \sqrt{G_{s} G_{r} / (L_{f s} (t) L_{a t m} (t) L_{o t h e r} (t))} \cdot e^{j (2 π f_{d m} (t) t + ϕ_{0})} \cdot χ (t)

(11)

where

ϕ_{0}

is the initial random phase offset;

L_{a t m} (t)

is the loss caused by the atmosphere, which satisfies

L_{a t m} (t) = L_{r a i n} + L_{g a s} + L_{s c i n t}

.

L_{o t h e r} (t)

is the other losses, which can be used for the addition of subsequent dynamic factors.

χ (t)

is a random process characterizing small-scale fading, such as the Rician fading model:

χ (t) = \sqrt{\frac{α}{α + 1}} + \sqrt{\frac{1}{α + 1}} \cdot z (t)

(12)

where

z (t)

is a complex Gaussian random process, and

α

is the Rician factor, which denotes the ratio of the power of the direct path component to the power of the multipath scattering component. The higher the elevation angle, the larger the

α

tends to be.

Construction of CSI matrix: In practical simulation, it is necessary to discretize and sample the channel. The channel state information of the LEO satellite communication system is represented as a complex matrix

H \in C^{N_{r} \times N_{t} \times N_{s c}}

, where

N_{r}

is the number of receiving antennas.

N_{t}

is the number of transmitting antennas.

N_{s c}

is the number of subcarriers. Each element

h_{i, j, k}

in the matrix follows the aforementioned comprehensive channel model, encompassing various dynamic loss information. In this paper, for the convenience of DRL training simulation, the matrix size is set to

N_{r} = N_{t} = 16, N_{s c} = 1

, and the complex matrix is subjected to dimensionality reduction. The mean statistical characteristics are used to represent the channel state, and the amplitude and phase of each complex is normalized to obtain a CSI matrix tensor,

H_{16 \times 16 \times 2}

. This high-dimensional, time-varying CSI matrix constitutes the core environmental state for DRL intelligent agent training.

3. Performance Index Model of Communication Link

The performance index model of an LEO satellite communication system is the basic reference for constructing a reward function in the DRL algorithm. This section describes a weighted comprehensive reward function designed for DRL training optimization, which incorporates multiple key performance indicators such as throughput, bit error rate, transmission delay, and power consumption. This function serves as a learning guide for the DRL agent, enabling it to not only achieve high speeds but also take into account communication reliability, real-time performance, and energy efficiency when exploring strategies, ultimately optimizing overall system performance.

3.1. Normalized Throughput

Throughput is a core metric for measuring the efficiency of data transmission in a communication system. The effective throughput rather than the theoretical peak of the physical layer is used as a measure, which can be expressed as follows:

T h r o u g h p u t_{n o r m} = \frac{B \cdot \log_{2} (M) \cdot (1 - BLER)}{T h r o u g h p u t_{m a x}}

(13)

where

B

is the channel bandwidth, measured in Hz.

M

is the modulation order (e.g., M = 4 for QPSK and M = 16 for 16QAM). BLER is the block error rate, which refers to the probability of a data block (such as a codeword) being received incorrectly. It is related to the bit error rate (BER) but better reflects the actual transmission failure after forward error correction (FEC) encoding.

T h r o u g h p u t_{m a x}

is the maximum theoretical throughput of the system, used for normalization.

3.2. Bit Error Rate (BER)

The BER is the fundamental indicator for measuring communication reliability, which can be calculated theoretically or statistically calculated using Monte Carlo methods. The calculation of theoretical BER depends on different MCS. For QPSK modulation and M-QAM modulation (M ≥ 16), the BER can be expressed as follows:

B E R_{QPSK} \approx \frac{1}{2} erfc (\sqrt{\frac{E_{b}}{N_{0}}})

(14)

B E R_{M Q A M} \approx \frac{4}{\log_{2} (M)} (1 - \frac{1}{\sqrt{M}}) Q (\sqrt{\frac{3 \log_{2} (M)}{M - 1} \cdot \frac{E_{b}}{N_{0}}})

(15)

where

E_{b}

is the average signal energy carried by each information bit and

N_{0}

is the noise power within a unit bandwidth.

E_{b} / N_{0}

is the bit-to-noise ratio, which is the ultimate manifestation of link budget and is related to the received power, noise power, etc.

erfc

and

Q

refer to the complementary error function and Gaussian Q function, respectively.

3.3. Transmission Delay

Transmission delay is crucial for evaluating the quality of communication, especially for applications demanding high real-time performance. The total transmission delay

D_{t o t a l}

can be expressed as follows:

D_{t o t a l} = D_{p r o p} + D_{t r a n s} + D_{p r o c} + D_{q u e u e}

(16)

where

D_{p r o p}

is the propagation delay, determined by the satellite-to-ground distance.

D_{p r o p} = d / c

, where

d

is the instantaneous satellite-to-ground distance and

c

is the speed of light.

D_{t r a n s}

is the transmission delay, which is related to the packet length

L_{p a c k e t}

and symbol rate

R_{s}

,

D_{t r a n s} = L_{p a c k e t} / (R_{s} \cdot \log_{2} M)

.

D_{p r o c}

is the processing delay, including the time for encoding, modulation, demodulation, decoding, etc., which can be modeled as a fixed value or a random distribution.

D_{q u e u e}

is the queuing delay, which occurs when multiple data streams compete for the same output port in on-board routers, and is related to traffic load and scheduling algorithms. It can be estimated using M/M/1 or more complex queuing theory models.

3.4. Power Efficiency

The power efficiency is directly related to the energy sustainability and lifespan of satellites. The power consumption of the communication subsystem can be simply expressed as

P_{c o n s u m e d} = \frac{P_{t}}{η} + P_{s t a t i c}

(17)

where

P_{t}

is the transmission power, with the unit of W. This is a key variable that the DRL agent can directly optimize.

η

is the efficiency of the power amplifier. The efficiency of typical amplifiers ranges from 30% to 60%, that is,

η \in [0.3, 0.6]

;

P_{t} / η

is the actual power consumption of the power amplifier.

P_{s t a t i c}

is the static power consumption, which includes the power required for the normal operation of baseband devices such as modems and digital signal processors.

The core of optimizing LEO satellite communication links is to maximize system utility while ensuring communication quality. Four key performance indicators were defined in the previous section: throughput, BER, transmission delay, and power efficiency. These indicators conflict with each other: improving throughput typically requires higher power and higher-order modulation but increases the BER and power consumption; reducing power consumption may sacrifice the throughput. Therefore, the essence of the optimization problem is a multi-objective trade-off. To address this issue and avoid subjective weighting, a hierarchical reward structure was designed with constraint satisfaction as a mandatory requirement and throughput optimization was as the main objective. Throughput was set as the main reward of the reward function, which can be expressed as follows:

R_{primary} = T h r o u g h p u t_{n o r m}

(18)

The main reward function set in this way encourages the agent to increase the transmission rate as much as possible. Simultaneously three constraint costs have been defined corresponding to key service quality requirements, which can be expressed as follows:

c_{1} (t) = \frac{B E R (t)}{B E R_{th}}, c_{2} (t) = \frac{D_{t o t a l} (t)}{D_{th}}, c_{3} (t) = \frac{P_{c o n s u m e d} (t)}{P_{th}}

(19)

where

B E R_{th}

is the error rate threshold (e.g., 10⁻⁶, according to DVB-S2X standard);

D_{th}

is the maximum allowable latency (e.g., 20 ms, according to 5G uRLLC standard); and

P_{th}

is the maximum allowable transmission power (such as 40 dBm, depending on the actual satellite payload transmission capability). The long-term constraint conditions can be expressed as follows:

\lim_{T \to \infty} \frac{1}{T} \sum_{t = 0}^{T - 1} c_{k} (t) \leq 1, k = 1, 2, 3

(20)

where

T

is the total number of time steps, taking into account the average cost over an infinite time range. The hierarchical reward structure can be expressed as follows:

R (t) = \{\begin{array}{l} R_{primary} + β \cdot e^{- ε \cdot \max (0, \max_{k} (c_{k} - 1))}, & i f a l l c_{k} \leq 1 \\ - ξ \cdot \sum_{k = 1}^{3} \max {(0, c_{k} - 1)}^{2}, & o t h e r w i s e \end{array}

(21)

where the parameters

β, ε

and

ξ

are adjustable hyperparameters, but they do not affect the weight balance between targets and only control the smoothness of the penalty. The first item (when the constraint is satisfied): the agent obtains the main throughput reward and adds a constraint margin reward, which exponentially decays with increasing margin, encouraging the maintenance of a certain safety margin while satisfying the constraint. The second item (when the constraint is violated): impose a second penalty, the intensity of which is proportional to the square of the degree of violation, guiding the agent to quickly return to the feasible domain.

4. Optimization Algorithm of LEO Satellite Communication Link Based on DRL

On the basis of the reward function design described in Section 3, a hybrid DRL architecture combining PPO and DQN, illustrated in Figure 3, is proposed in this section. By collecting multidimensional state information from the communication environment, an original state space is established, and a hybrid structure combining a CNN and LSTM is used to extract spatiotemporal channel features. The extracted feature vector is simultaneously transmitted to both the PPO and DQN branches. The PPO branch can handle continuous action spaces and is responsible for finely adjusting the transmission power and beamforming weights, while the DQN branch can handle discrete decisions and is responsible for selecting modulation and coding schemes, as well as the retransmission time. The communication link is optimized through iterative training of the DRL agent. This approach overcomes the limitations of traditional methods, with which it is difficult to collaboratively optimize continuous and discrete parameters, and can improve the performance of communication systems.

4.1. Design of State Space

In the optimization algorithm for LEO satellite communication link based on DRL, the design of the state space is crucial as it determines the agent’s level of understanding of the environment and decision quality. Besides the CSI matrix tensor

H_{16 \times 16 \times 2}

, the complete state space

s_{t}

also includes multidimensional state variables, which can be expressed as follows:

s_{t} = [\begin{matrix} H_{16 \times 16 \times 2} \\ B E R_{h i s t}, S N R_{h i s t}, Q_{h i s t}, P_{h i s t} \\ Q_{l e n}, Q_{u t i l}, D_{s t a t s}, P_{u t i l}, S_{p o s e}, H_{d e v i c e} \end{matrix}]

(22)

where

Q_{l e n}

is the number of packets in the current queue.

Q_{u t i l}

reflects the system load status, avoids buffer overflow, and guides traffic control strategies, which satisfies

Q_{u t i l} = Q_{l e n} / Q_{m a x}

.

Q_{m a x}

is the maximum capacity of the queue (number of packets).

D_{s t a t s}

is the delay statistical information, which satisfies

D_{s t a t s} = [μ_{d e l a y}, σ_{d e l a y}]

.

μ_{d e l a y}

and

σ_{d e l a y}

are the average delay and delay standard deviation, respectively.

P_{u t i l}

is the current transmission power and its utilization rate relative to the maximum power, which satisfies

P_{u t i l} = P_{c u r} / P_{m a x}

.

P_{c u r}

is the current transmission power.

P_{m a x}

is the maximum allowable transmission power.

S_{p o s e}

is the position and attitude angle, which satisfies

S_{p o s e} = [l a t i t u d e, l o n g i t u d e, a l t i t u d e, r o l l, p i t c h, y a w]

.

H_{d e v i c e}

is a part of the device status information, which satisfies

H_{d e v i c e} = [T_{a m p l i f i e r}, P_{d c}]

.

T_{a m p l i f i e r}

is the amplifier temperature.

P_{d c}

is the DC power consumption.

B E R_{h i s t}

is the historical sequence of BER, representing the sequence of BER measurement values in a past period of time, which can be expressed as follows:

B E R_{h i s t} = [B E R (t - k), B E R (t - k + 1), \dots, B E R (t)]

(23)

where

k

is the historical window length (typical value: 10–100 sampling points). The sampling interval is generally 1 ms–100 ms. Similarly, the other three historical data states can be expressed as follows:

S N R_{h i s t} = [S N R (t - k), S N R (t - k + 1), \dots, S N R (t)]

(24)

Q_{h i s t} = [Q_{l e n} (t - k), Q_{l e n} (t - k + 1), \dots, Q_{l e n} (t)]

(25)

P_{h i s t} = [P_{c u r} (t - k), P_{c u r} (t - k + 1), \dots, P_{c u r} (t)]

(26)

4.2. Design of Mixed Action Space

The action,

a_{t}

, of the DRL agent is defined as a composite action comprising both continuous and discrete components:

a_{t} = a_{c o n t} + a_{d i s c}

. The continuous action space,

a_{c o n t}

, is used for fine-tuning radio frequency parameters, which can be expressed as follows:

a_{c o n t} = [Δ P_{t}, ϕ_{1}, ϕ_{2}, \dots, ϕ_{K}]

(27)

where

Δ P_{t}

is the adjusted amount of transmitted power. It is a normalized continuous value, typically ranging from [−1,1]. The actual transmit power is obtained through linear mapping, which can be expressed as

P_{t} = P_{m i n} + \frac{Δ P_{t} + 1}{2} (P_{m a x} - P_{m i n})

(28)

where

ϕ_{1}, ϕ_{2}, \dots, ϕ_{K}

is beamforming weight (phase offset);

P_{m a x}

is the maximum allowable transmit power;

P_{m i n}

is the minimum allowable transmit power; and

K

is the number of array elements in a phased array antenna. The discrete action space,

a_{d i s c}

, is used to make category selection decisions, which can be expressed as

a_{d i s c} = [MCS Index, Retry Count]

(29)

where the MCS Index is a discrete value that can be set to establish the correspondence between the index and the MCS, such as 0-QPSK with a coding rate of 1/2, 1-16QAM with a coding rate of 3/4, and 2-64QAM with a coding rate of 5/6. Each index corresponds to a predefined combination of modulation order and coding rate. The Retry Count is the maximum number of retransmissions. In protocols based on automatic repeat requests (ARQs), the agent can choose the maximum number of retransmissions for link-layer packets, such as 0 (no retransmission), 1, or 2. This allows the agent to strike a balance between latency and reliability.

4.3. Design of Hybrid DRL Algorithm

The original state

s_{t}

(including CSI matrix, historical BER, etc.) passes through a shared feature extraction network, which employs a CNN-LSTM hybrid structure. The CNN part is used to extract local features of state information with spatial structure, such as the CSI matrix. The LSTM part is used to capture long-term dependencies in time series, such as historical BER and queue status. And partial scalar states are features extracted by fully connected networks. The final extracted feature vector,

f_{t}

, is simultaneously transmitted to both the PPO and DQN branches, as shown in Figure 3.

The PPO branch comprises an Actor network and a Critic network. Actor network

π_{θ} (a_{c o n t} | s_{t})

: The input is the feature vector

f_{t}

; the output is the mean and variance of continuous action

a_{c o n t}

. It defines the probability distribution of taking continuous actions under state

s_{t}

. The goal is to learn the optimal continuous control strategy.

Critic network

V_{ϕ} (s_{t})

: The input is feature vector

f_{t}

; the output is state value

V (s_{t})

, representing the expected cumulative reward that can be obtained starting from state

s_{t}

in the future. The goal is to evaluate the quality of the current state, which is used to guide the update of the Actor network.

The PPO algorithm ensures the stability of policy updates through its tailored objective function, which can be expressed as

L^{C L I P} (θ) = {\hat{E}}_{t} [\min (r_{t} (θ) {\hat{A}}_{t}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(30)

where

r_{t} (θ)

is the probability ratio between the old and new strategies.

{\hat{A}}_{t}

is the advantage function estimation, indicating the superiority or inferiority of a certain action relative to the average level. It is usually calculated by the Critic network and actual rewards. When

{\hat{A}}_{t} = R_{t} - V (s_{t})

,

ϵ

is a hyperparameter;

R (t)

is the immediate reward obtained at time

t

, and

V (s_{t})

is the state value output by the Critic network.

The DQN branch includes a Q-network

Q_{ψ} (s_{t}, a_{d i s c})

designed to handle discrete decision-making problems. The input is a feature vector

f_{t}

; the output is a Q-value vector, where each element corresponds to the Q-value (expected long-term cumulative reward) of a discrete action combination,

[MCS Index, Retry Count]

. The final action decision is to choose the discrete action with the maximum Q-value through either greedy or ε-greedy strategies, which can be expressed as

a_{d i s c, t} = \arg \max_{a} Q_{ψ} (f_{t}, a)

(31)

where

a

is a randomly selected discrete action, and

a_{d i s c, t}

is the action with the highest Q-value under the current state

s_{t}

. DQN updates the network through temporal difference error, with the goal of minimizing the following loss function:

L (ψ) = {\hat{E}}_{t} [{(Q_{ψ} (s_{t}, a_{d i s c, t}) - y_{t})}^{2}]

(32)

y_{t} = R (t) + γ \cdot \max_{a^{'}} Q_{ψ^{-}} (s_{t + 1}, a^{'})

(33)

where

y_{t}

is the target Q-value;

γ

is the discount factor, where

γ \in [0, 1]

;

Q_{ψ^{-}}

is the target Q-network, whose parameter

ψ^{-}

is periodically copied from the main network

ψ

to stabilize training.

Q_{ψ} (s_{t}, a_{d i s c, t})

is the predicted Q-value, while

y_{t}

is the target Q-value;

ψ

is the set of all weights and biases in the DQN.

R (t)

is the immediate reward obtained at time

t

;

\max_{a^{'}}

is the highest Q-value among all possible actions

a^{'}

in the next state

s_{t + 1}

.

The final collaborative training and decision-making process is as follows:

Forward propagation: At each time step, the feature extraction network extracts features from the original state $s_{t}$ to obtain a feature vector $f_{t}$ .
Action generation: The PPO–Actor network samples a continuous action $a_{c o n t}$ based on the current policy. And the DQN-Q network selects the discrete action $a_{d i s c}$ with the highest Q-value.
Environmental interaction: Perform composite action $(a_{c o n t}, a_{d i s c})$ , and obtain reward $R_{t}$ and the next state $s_{t + 1}$ from the environment (LEO satellite communication link simulator).
Experience storage: Store the transferred sample $(s_{t}, a_{c o n t}, a_{d i s c}, R_{t}, s_{t + 1})$ in the experience replay buffer.
Network Update: Sample a batch of data from the buffer. Update DQN branch: Calculate Q-value loss $L (ψ)$ and perform backpropagation. Update PPO branch: Use the sampled data to calculate the advantage function ${\hat{A}}_{t}$ , then update the Actor network by maximizing the clipped objective function $L^{C L I P} (θ)$ , and update the Critic network by minimizing the value function error.

The application of the method in this paper requires consideration of certain computational resource limitations. The computational complexity of the CNN branch in this method is as follows: the input is a 16 × 16 × 2 CSI matrix tensor, and the total number of multiplication and addition operations is approximately 1.2 × 10⁶. LSTM branch: Input 50 × 4 historical sequences, with a total of approximately 3.5 × 10⁵. Fully connected branch: approximately 2.1 × 10⁵ operations. Total: Approximately 1.76 × 10⁶ floating-point operations (1.76 MFLOPs) per forward inference. Existing commercial-grade onboard AI processors, such as Xilinx KU060 and NVIDIA Jetson TX2i, can support in-orbit deployment of lightweight models. This method adopts a two-stage deployment strategy of offline training and online inference. The specific deployment scenarios are as follows:

Offline training phase (ground assisted): Complete the training of all DRL models at ground stations equipped with high-performance computing clusters. Generate a large amount of diverse scenario data using the STK and NS-3 joint simulation platform, covering different orbital heights, weather conditions, and business loads. After training, a lightweight inference model is generated.
Online reasoning stage (satellite deployment): The lightweight inference model is deployed on the LEO satellite onboard processor to achieve local real-time decision-making.
Continuous learning mechanism: The model is updated regularly by the ground station: when passing over the top ground station, the satellite can receive the updated model parameters through the high-speed link.

The technological novelty of the method proposed in this paper is reflected in three aspects: 1. It is a collaborative architecture for shared feature extraction. Existing methods typically use parallel but independent network structures, while the proposed method uses a CNN-LSTM shared feature extraction network that enables the PPO and DQN branches to make decisions based on the same deep spatiotemporal features, achieving implicit collaborative optimization of continuous and discrete actions. 2. It utilizes a hierarchical decision fusion mechanism. Instead of simply concatenating two types of actions for output, this method introduces a hierarchical decision mechanism based on constraint satisfaction, prioritizing communication reliability constraints (BER; latency) and optimizing throughput targets to make the decision-making process more interpretable. 3. It enables space–time feature modeling for satellite communication. Existing methods often use fully connected networks to process state inputs, while the proposed method implements a CNN-LSTM hybrid network designed specifically for the highly dynamic characteristics of satellite channels that captures the spatial structure and temporal evolution of CSI performance indicators.

5. Results and Discussion

The optimization algorithm designed in this paper requires a large number of randomly constructed channel scenarios to be used for offline training of DRL. The trained model can be deployed online and dynamically output the optimal link configuration parameters based on real-time channel state information. Random DRL training scenarios are designed in this section and simulation results are obtained through examples. The simulation results of the algorithm are compared with those of the traditional method, which proves the effectiveness and progression of the new method.

5.1. Design of Random Training Environment

To ensure that the trained DRL agent possesses strong generalization capabilities, a highly randomized dynamic channel simulation environment was constructed. The core parameters and the range of randomization for the environment are shown in Table 1.

The environmental joint simulation architecture adopts a simulation platform that deeply integrates STK and NS-3, which can be expressed as

P_{s i m} = \{P_{S T K}, P_{NS- 3}, I_{i n t e r f a c e}\}

(34)

where

P_{s i m}

is the entire joint simulation platform;

P_{S T K}

is the STK simulation environment, responsible for generating satellite orbits, positions, and geometric relationships between the satellite and ground links;

P_{NS- 3}

is the NS-3 network simulator, responsible for simulating network protocols, packet transmission, and business flows;

I_{i n t e r f a c e}

is the interface module between STK and NS-3, which enables real-time data exchange between the two platforms. The data stream

D S_{STK \to NS- 3}

from STK to NS-3 can be expressed as

D S_{STK \to NS- 3} = {p o s_{s a t}, p o s_{g w}, L_{f s}, L_{a t m}, L_{o t h e r}, f_{d m}}

(35)

where

p o s_{s a t}

and

p o s_{g w}

represent the coordinate position vectors of the satellite and the gateway station in three-dimensional space, respectively. The loss of dynamic channels refers to the dynamic channel model in Section 2. The STK and NS-3 joint simulation architectures exchange real-time data through the TCP/IP interface. STK is responsible for high-precision orbital dynamics simulation, outputting satellite position, velocity, satellite ground geometric relationships, etc. NS-3: Responsible for network protocol stack simulation, implementing DVB-S2X physical layer, MAC layer, and routing protocols. Coordinate data exchange between the two using Python 3.14 programs to achieve training and inference of DRL agents. The business model parameters are as follows: VoIP service: packet size of 120 bytes, arrival interval of 20 ms; video stream: bitrate of 2 Mbps, packet size of 1500 bytes; FTP service: file size 10 MB.

5.2. Training Process and Hyperparameters of DRL

The training of DRL is conducted on high-performance servers. The setting of hyperparameters has a certain impact on the final training results. The training hyperparameters used in this paper are shown in Table 2.

The DRL training process is as follows:

Initialization: Randomly initialize the DRL network parameters of the satellite.
Scene loop: Simulate a complete process of the satellite passing through a ground station (approximately 10–15 min of simulation time) for each training episode.
Step loop: Observe at each time step (such as 10 ms): The agent obtains the state $s_{t}$ from the environment and the agent outputs action $a_{t}$ through the PPO and DQN networks to the environment, while receiving the reward $R_{t}$ and the new state $s_{t + 1}$ . It stores the experience tuple $(s_{t}, a_{t}, R_{t}, s_{t + 1})$ in the experience replay buffer. Then, it periodically samples from the buffer and updates the network parameters. The simulation convergence results during the training process are shown in Figure 4.

5.3. Simulation Results and Analysis

In order to comprehensively evaluate the adaptability of the DRL method in this paper under different orbital altitudes and environmental conditions, 12 representative test scenarios weren constructed, covering key variables that may be encountered in LEO satellite communication, including orbital altitude, weather conditions, initial elevation angle, and business load, to verify the generalization performance of the algorithm under different orbital characteristics, which are shown in Table 3.

According to the different scenarios in Table 3, different methods were used to simulate the LEO satellite communication link. One is a method of the fixed strategy that adopts a conservative fixed configuration, with a transmission power of 40 dBm, MCS of 16QAM 3/4, and a fixed beam, which is the benchmark scheme for many traditional satellites. Another method is ACM, which is a widely studied adaptive method that dynamically adjusts the MCS based on the instantaneous SNR feedback from the receiving end through a preset SNR-MCS mapping table, but with a fixed transmission power.

In addition to the two traditional methods, two learning-based methods are incorporated. The first is based solely on PPO: using the PPO algorithm to optimize all parameters but treating discrete actions (MCS, retransmission) as continuous values before discretization or mapping through Gaussian distribution sampling. This is a common continuous control method. The second method is to use only the DQN method: we optimize all parameters using the DQN algorithm but discretize continuous actions (such as power and beamforming parameters) into finite levels. This is a classic discrete action learning method. All methods use the same state space, reward function, and training process and differ only in action output structure. The performance comparison results are shown in Table 4.

It can be seen from Table 4 that the hybrid DRL method proposed in this paper demonstrates excellent performance and robustness under different orbital altitudes (550 to 1200 km), meteorological conditions, and business loads. Specifically, as the orbit height increases, the free space path loss increases, and the throughput of all methods shows a decreasing trend (such as in scenarios S1, S4, and S8). However, the method proposed in this paper consistently maintains the highest throughput and has the smallest performance degradation amplitude, demonstrating its ability to adapt to different orbit characteristics. In the harsh LEO satellite operating environment (rainstorms, low elevation, and heavy load), the BER of traditional fixed strategy and adaptive modulation coding (ACM) deteriorates sharply. However, the method proposed in this paper stabilizes the bit error rate at 10⁻⁶. It is compared against single learning-based algorithms. Although the PPO performs well in throughput and power control, its discrete action decision accuracy is insufficient, resulting in a slightly higher BER. Meanwhile, the DQN is limited by its discretization granularity, resulting in a relatively weak overall performance. The hybrid method proposed in this paper combines the advantages of both and achieves optimal performance in all testing scenarios, confirming the necessity of joint optimization of continuous and discrete actions. In addition, in heavy-load scenarios, the proposed method effectively suppresses the growth of queue delay by adaptively adjusting transmission power and beamforming, reducing delay by over 20% compared to the fixed strategy, further highlighting its effectiveness in complex conditions.

In summary, the simulation results fully validate the effectiveness of the new method for optimizing LEO satellite communication links based on the DRL proposed in this paper. Compared with existing technologies, the new method can achieve improvements in throughput, reliability, delay, and energy efficiency in highly dynamic and non-stationary LEO satellite channels through the joint decision-making of intelligent cross-layer parameters. It demonstrates strong environmental adaptability and overall performance advantages and provides a new approach for the next generation of intelligent satellite communication systems.

6. Conclusions

In this paper, a new method for optimizing LEO satellite communication links based on DRL is proposed. The method extracted spatiotemporal features of multidimensional state space through a CNN-LSTM hybrid network and established a hybrid architecture that combined PPO and DQN, enabling the DRL agent to achieve real-time link state perception and jointly decide on discrete and continuous actions such as output power, beamforming, MCS, and retransmission strategies, while ensuring link reliability and maximizing system performance. The simulation results showed that the new method is effective and advanced. And compared with traditional methods, the new method has three main advantages: firstly, it avoids dependence on precise channel models through DRL; at the same time, it achieves multi-parameter collaborative optimization instead of isolated adjustment; finally, it has the ability to adapt to unknown environments, significantly improving the robustness of the system in dynamic scenarios. The method demonstrates the enormous potential of DRL in the field of satellite communication, providing a new idea for promoting the development of LEO satellite communication towards autonomy and intelligence.

Author Contributions

Conceptualization, H.Y. and L.W.; methodology, H.Y. and J.W. software, H.Y. and S.L.; validation, H.Y., formal analysis, H.Y. and Y.S.; investigation, H.Y.; resources, H.Y.; data curation, H.Y. and J.W.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y.; visualization, H.Y. and Y.S.; supervision, H.Y.; project administration, H.Y.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The authors declare the data cannot be made public due to privacy concerns.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

All authors were employed by the company The 54th Research Institute of CETC. Authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Hui, M.; Zhai, S.; Wang, D.; Hui, T.; Wang, W.; Du, P.; Gong, F. A review of leo satellite communication payloads for integrated communication, navigation, and remote sensing: Opportunities, challenges, future directions. IEEE Internet Things 2025, 12, 18954–18992. [Google Scholar] [CrossRef]
Zhou, D.; Sheng, M.; Li, J.; Han, Z. Aerospace integrated networks innovation for empowering 6G: A survey and future challenges. IEEE Commun. Surv. Tutor. 2023, 25, 975–1019. [Google Scholar] [CrossRef]
Li, J.; Han, C.; Ye, N.; Pan, J.; Yang, K.; An, J. Instant Positioning by Single Satellite: Delay-Doppler Analysis Method Enhanced by Beam-Hopping. IEEE Trans. Veh. Technol. 2025, 9, 14418–14431. [Google Scholar] [CrossRef]
Kozhaya, S.; Kassas, Z.M. A first look at the OneWeb LEO constellation: Beacons, beams, and positioning. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 7528–7534. [Google Scholar] [CrossRef]
Boley, A.C.; Byers, M. Satellite mega-constellations create risks in Low Earth Orbit, the atmosphere and on Earth. Sci. Rep. 2021, 11, 10642. [Google Scholar] [CrossRef]
Osoro, O.B.; Oughton, E.J. A techno-economic framework for satellite networks applied to low earth orbit constellations: Assessing Starlink, OneWeb and Kuiper. IEEE Access 2021, 9, 141611–141625. [Google Scholar] [CrossRef]
Fernandes, M.A.; Loureiro, P.A.; Fernandes, G.M.; Monteiro, P.P.; Guiomar, F.P. Digitally mitigating Doppler shift in high-capacity coherent FSO LEO-to-earth links. J. Light. Technol. 2023, 41, 3993–4001. [Google Scholar] [CrossRef]
Shi, J.; Li, Z.; Hu, J.; Tie, Z.; Li, S.; Liang, W.; Ding, Z. OTFS enabled LEO satellite communications: A promising solution to severe doppler effects. IEEE Netw. 2023, 38, 203–209. [Google Scholar] [CrossRef]
Behera, B.; Raghu, N.; Yadav, A.; Setia, N.; Goyal, D. Satellite-to-Ground Propagation Modelling for High-Frequency Communication Systems. Int. J. Antenn. Propag. 2025, 7, 49–55. [Google Scholar]
Sabuj, S.R.; Alam, M.S.; Haider, M.; Hossain, M.A.; Pathan, A.S.K. Low Altitude Satellite Constellation for Futuristic Aerial-Ground Communications. CMES-Comp. Model. Eng. Sci. 2023, 136, 1053–1089. [Google Scholar]
Al-Hraishawi, H.; Chougrani, H.; Kisseleff, S.; Lagunas, E.; Chatzinotas, S. A survey on nongeostationary satellite systems: The communication perspective. IEEE Commun. Surv. Tutor. 2022, 25, 101–132. [Google Scholar] [CrossRef]
Wang, S.; Li, Q. Satellite computing: Vision and challenges. IEEE Internet Things 2023, 10, 22514–22529. [Google Scholar] [CrossRef]
Biglieri, E. High-level modulation and coding for nonlinear satellite channels. IEEE Trans. Commun. 2003, 32, 616–626. [Google Scholar] [CrossRef]
Bischl, H.; Brandt, H.; De Cola, T.; De Gaudenzi, R.; Eberlein, E.; Girault, N.; Alberty, E.; Lipp, S.; Rinaldo, R.; Rislow, B.; et al. Adaptive coding and modulation for satellite broadband networks: From theory to practice. Int. J. Satell. Commun. Netw. 2010, 28, 59–111. [Google Scholar] [CrossRef]
Huang, J.; Su, Y.; Liu, W.; Wang, F. Adaptive modulation and coding techniques for global navigation satellite system inter-satellite communication based on the channel condition. IET Commun. 2016, 10, 2091–2095. [Google Scholar] [CrossRef]
Neinavaie, M.; Kassas, Z.M. Cognitive sensing and navigation with unknown OFDM signals with application to terrestrial 5G and Starlink LEO satellites. IEEE J. Sel. Areas Commun. 2023, 42, 146–160. [Google Scholar] [CrossRef]
Martínez, F.O.; Uribe, G.; Mosquera, F.L. OneWeb: Web content adaptation platform based on W3C Mobile Web Initiative guidelines. Ing. Investig. 2011, 31, 117–126. [Google Scholar] [CrossRef]
Shi, Y.; Zhang, J.; Letaief, K.B.; Bai, B.; Chen, W. Large-scale convex optimization for ultra-dense cloud-RAN. IEEE Wirel. Commun. 2015, 22, 84–91. [Google Scholar] [CrossRef]
Zeng, L.; Zhang, C.; Qin, P.; Zhou, Y.; Cai, Y. One Method for Predicting Satellite Communication Terminal Service Demands Based on Artificial Intelligence Algorithms. Appl. Sci. 2024, 14, 6019. [Google Scholar] [CrossRef]
Zhao, B.; Liu, J.; Wei, Z.; You, I. A deep reinforcement learning based approach for energy-efficient channel allocation in satellite Internet of Things. IEEE Access 2020, 8, 62197–62206. [Google Scholar] [CrossRef]
Wang, H.; Ouyang, Q.; Xi, W.; Xiang, Y.; Ye, N. Dual Intelligence: Leveraging DRL with Smart Satellites to Counter Intelligent Jamming in Satellite Networks. IEEE Trans. Cogn. Commun. Netw. 2025, 12, 1054–1067. [Google Scholar] [CrossRef]
Bhattacharyya, A.; Nambiar, S.M.; Ojha, R.; Gyaneshwar, A.; Chadha, U.; Srinivasan, K. Machine Learning and Deep Learning powered satellite communications: Enabling technologies, applications, open challenges, and future research directions. Int. J. Satell. Commun. Netw. 2023, 41, 539–588. [Google Scholar] [CrossRef]
Deng, B.; Jiang, C.; Yao, H.; Guo, S.; Zhao, S. The next generation heterogeneous satellite communication networks: Integration of resource management and deep reinforcement learning. IEEE Wirel. Commun. 2019, 27, 105–111. [Google Scholar] [CrossRef]
Huang, J.; Yang, Y.; Yin, L.; He, D.; Yan, Q. Deep reinforcement learning-based power allocation for rate-splitting multiple access in 6G LEO satellite communication system. IEEE Wirel. Commun. Lett. 2022, 11, 2185–2189. [Google Scholar] [CrossRef]
Ferreira, P.V.R.; Paffenroth, R.; Wyglinski, A.M. Multiobjective reinforcement learning for cognitive satellite communications using deep neural network ensembles. IEEE J. Sel. Areas Commun. 2018, 36, 1030–1041. [Google Scholar] [CrossRef]
Huang, J.; Yang, Y.; Lee, J.; He, D.; Li, Y. Deep reinforcement learning-based resource allocation for RSMA in LEO satellite-terrestrial networks. IEEE Trans. Commun. 2023, 72, 1341–1354. [Google Scholar] [CrossRef]
Foschini, G.J.; Chizhik, D.; Gans, M.J.; Papadias, C.; Valenzuela, R.A. Analysis and performance of some basic space-time architectures. IEEE J. Sel. Areas Commun. 2003, 21, 303–320. [Google Scholar] [CrossRef]
Wang, C.; Ellis, J.D. Dynamic Doppler frequency shift errors: Measurement, characterization, and compensation. IEEE Trans. Instrum. Meas. 2015, 64, 1994–2004. [Google Scholar] [CrossRef]
Giannetti, F.; Reggiannini, R. Opportunistic rain rate estimation from measurements of satellite downlink attenuation: A survey. Sensors 2021, 21, 5872. [Google Scholar] [CrossRef]

Figure 1. The channel dynamic model of LEO satellite communication link.

Figure 2. Doppler frequency shift between the LEO satellite and user equipment (UE).

Figure 3. Optimization algorithm of LEO satellite communication link based on DRL.

Figure 4. The simulation convergence results during the training process of DRL.

Table 1. The core parameters and the range of randomization for the environment.

Parameter	Value Range/Distribution
initial elevation angle	10°~80°
orbital altitude	500~1200 km
rainfall rate	0~50 mm/h (exponential distribution)
number of interference sources	0~4 (Poisson distribution)
interference source power	−20~0 dBm (uniform distribution)
Rician factor	5~15 dB (uniform distribution)
Packet arrival rate	0.1~1.0 Mbps (uniform distribution)

Table 2. The training hyperparameters of DRL.

Hyperparameter	Value
PPO learning rate	3 × 10⁻⁴
DQN learning rate	1 × 10⁻³
discount factor	0.99
Experience replay buffer size	1 × 10⁶

Table 3. Different scenarios in LEO satellite communication.

Number	Altitude	Weather Conditions	Initial Elevation	Business Load
S1	550 km	sunny	60°	Mild
S2	550 km	light rain (5 mm/h)	30°	Moderate
S3	550 km	moderate rain (15 mm/h)	10°	Heavy
S4	975 km	sunny	75°	Moderate
S5	975 km	moderate rain (15 mm/h)	45°	Mild
S6	975 km	rainstorm (25 mm/h)	30°	Moderate
S7	975 km	rainstorm (25 mm/h)	60°	Heavy
S8	1200 km	sunny	45°	Moderate
S9	1200 km	light rain (5 mm/h)	30°	Mild
S10	1200 km	moderate rain (15 mm/h)	60°	Heavy
S11	1200 km	rainstorm (25 mm/h)	45°	Moderate
S12	975 km	rainstorm (25 mm/h)	15°	Heavy

Table 4. The performance comparison results of different methods in different scenarios.

Scenarios	Methods	Throughput (Mbps)	BER	Delay (ms)	Power (dBm)
Scenarios	Methods	Throughput (Mbps)	(×10⁻⁶)	Delay (ms)	Power (dBm)
S1	Fixed strategy	86.5	0.3	10.2	40.0
	ACM	93.2	0.4	9.8	40.0
	PPO	95.8	0.3	9.5	36.5
	DQN	92.1	0.5	10.0	35.8
	New method	98.4	0.2	9.2	34.2
S2	Fixed strategy	72.3	1.2	13.5	40.0
	ACM	80.5	1.5	12.8	40.0
	PPO	83.6	1.1	12.2	37.1
	DQN	79.8	1.4	12.6	36.3
	New method	86.2	0.8	11.8	35.0
S3	Fixed strategy	48.5	8.5	20.8	40.0
	ACM	55.2	3.8	18.5	40.0
	PPO	58.7	2.6	17.3	34.8
	DQN	54.3	3.2	18.0	33.9
	New method	62.1	1.5	16.2	32.5
S4	Fixed strategy	79.8	0.4	11.8	40.0
	ACM	86.4	0.5	11.2	40.0
	PPO	88.9	0.4	10.8	37.2
	DQN	85.2	0.6	11.1	36.4
	New method	91.5	0.3	10.5	34.8
S5	Fixed strategy	63.2	2.8	16.2	40.0
	ACM	70.1	2.2	15.1	40.0
	PPO	73.5	1.6	14.3	35.2
	DQN	69.8	1.9	14.8	34.5
	New method	76.8	1.0	13.7	33.1
S6	Fixed strategy	45.2	28.0	22.5	40.0
	ACM	58.7	5.2	18.9	40.0
	PPO	62.3	2.8	17.2	33.6
	DQN	55.6	3.5	18.1	32.4
	New method	69.4	1.2	15.8	31.5
S7	Fixed strategy	38.7	35.0	24.8	40.0
	ACM	50.2	6.8	20.5	40.0
	PPO	54.5	3.5	18.6	34.2
	DQN	48.9	4.2	19.3	33.1
	New method	61.2	1.8	17.0	32.0
S8	Fixed strategy	68.5	0.5	14.5	40.0
	ACM	75.2	0.6	13.8	40.0
	PPO	77.8	0.5	13.2	37.8
	DQN	74.3	0.7	13.6	36.9
	New method	80.1	0.4	12.9	35.5
S9	Fixed strategy	60.2	1.8	17.2	40.0
	ACM	67.5	2.0	16.1	40.0
	PPO	70.3	1.4	15.3	36.8
	DQN	66.8	1.7	15.8	36.0
	New method	73.0	0.9	14.8	34.6
S10	Fixed strategy	48.8	4.2	20.5	40.0
	ACM	55.9	2.9	18.8	40.0
	PPO	59.4	2.0	17.5	35.8
	DQN	55.1	2.4	18.2	35.0
	New method	62.7	1.3	16.5	33.8
S11	Fixed strategy	35.6	42.0	26.5	40.0
	ACM	46.8	8.5	22.3	40.0
	PPO	51.2	4.2	20.1	35.5
	DQN	45.5	5.1	21.0	34.2
	New method	57.5	2.1	18.8	33.2
S12	Fixed strategy	28.5	95.0	28.7	40.0
	ACM	42.1	12.8	22.3	40.0
	PPO	47.6	6.5	20.1	35.2
	DQN	41.3	7.8	21.5	34.1
	New method	53.8	3.2	18.6	33.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, H.; Li, S.; Wu, J.; Sun, Y.; Wang, L. A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning. Aerospace 2026, 13, 285. https://doi.org/10.3390/aerospace13030285

AMA Style

Yu H, Li S, Wu J, Sun Y, Wang L. A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning. Aerospace. 2026; 13(3):285. https://doi.org/10.3390/aerospace13030285

Chicago/Turabian Style

Yu, He, Shengli Li, Junchao Wu, Yanhong Sun, and Limin Wang. 2026. "A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning" Aerospace 13, no. 3: 285. https://doi.org/10.3390/aerospace13030285

APA Style

Yu, H., Li, S., Wu, J., Sun, Y., & Wang, L. (2026). A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning. Aerospace, 13(3), 285. https://doi.org/10.3390/aerospace13030285

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Dynamic Channel Modeling for LEO Satellite Communication Links

2.1. Free-Space Path Loss Model

2.2. Doppler Frequency Offset Model

2.3. Rain Attenuation Model

2.4. Other Loss Models

3. Performance Index Model of Communication Link

3.1. Normalized Throughput

3.2. Bit Error Rate (BER)

3.3. Transmission Delay

3.4. Power Efficiency

4. Optimization Algorithm of LEO Satellite Communication Link Based on DRL

4.1. Design of State Space

4.2. Design of Mixed Action Space

4.3. Design of Hybrid DRL Algorithm

5. Results and Discussion

5.1. Design of Random Training Environment

5.2. Training Process and Hyperparameters of DRL

5.3. Simulation Results and Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI