Next Article in Journal
A Multi-Fidelity Aeroelastic Toolchain: From UAVs to Hydrogen Transport Aircraft
Previous Article in Journal
A Simulation and TOPSIS Approach to the Satellite Constellation Design Problem
Previous Article in Special Issue
Thermo-Mechanical Jitter in Slender Space Structures: A Simplified Modeling Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning

The 54th Research Institute of CETC, Shijiazhuang 050081, China
*
Author to whom correspondence should be addressed.
Aerospace 2026, 13(3), 285; https://doi.org/10.3390/aerospace13030285
Submission received: 9 January 2026 / Revised: 21 February 2026 / Accepted: 27 February 2026 / Published: 18 March 2026
(This article belongs to the Special Issue Advanced Spacecraft/Satellite Technologies (2nd Edition))

Abstract

In low-Earth-orbit (LEO) satellite networks, the need for intelligent parameter-adjustment strategies has become increasingly critical due to the presence of highly dynamic channel conditions, limited spectrum resources, and complex interference environments. In this paper, a method for optimizing LEO satellite communication links based on deep reinforcement learning (DRL) is proposed. Through the optimization of the transmit power, the modulation and coding scheme (MCS), the beamforming parameters, and the retransmission mechanisms, adaptive link control is achieved in dynamic operational scenarios. A multidimensional state space is constructed, within which the channel state information, the interference environment, and the historical performance metrics are integrated. The spatio-temporal characteristics of the channel are extracted by means of a hybrid neural architecture that incorporates a convolutional neural network (CNN) and a long short-term memory (LSTM) network. To effectively accommodate both continuous and discrete action spaces, a hybrid DRL framework that combines proximal policy optimization (PPO) with a deep Q-network (DQN) is employed, thereby enabling cross-layer optimization of the physical-layer and link-layer parameters. The results demonstrate that substantial improvements in throughput, bit error rate (BER), and transmit-power efficiency are achieved under severely time-varying channel conditions, which provides a new idea for resource management and dynamic-environment adaptation in satellite communication systems.

1. Introduction

With the acceleration of the global digital transformation, 5G/6G and the integration of space and the sky have become the key technological directions of satellite Internet development. The LEO satellite communication system is receiving increasing attention from countries due to its advantages of low latency, low costs, and high bandwidth [1,2,3]. Large-scale constellation initiatives, represented by Starlink and OneWeb, are undergoing accelerated deployment worldwide with the objective of providing high-speed broadband services with global coverage [4,5,6]. However, LEO satellite communication links still face a series of substantial technical challenges. The high orbital velocity of LEO satellites leads to rapid temporal variations in channel characteristics, which in turn result in pronounced Doppler frequency shifts and reduced channel coherence time [7,8]. In addition, satellite–ground links are influenced by atmospheric effects, with high-frequency bands being particularly susceptible to impairments such as rain attenuation [9,10]. Moreover, the stringent constraints on resources caused by satellites render power, computational capacity, and spectrum exceptionally valuable [11,12]. Traditional control strategies based on fixed thresholds or static rules are inadequate for such highly dynamic environments. And the optimization method of adaptive coding and modulation (ACM) faces bottlenecks such as large parameter space, strong environmental dynamics, and multi-objective conflicts. Therefore, the investigation of new intelligent optimization methodologies for LEO communication links is of significant importance.
There are currently three main methods for optimizing satellite communication links: methods based on fixed rules, techniques based on ACM, and methods based on constraint optimization. Biglieri E adopted a fixed-rule approach and considered three modulation formats, namely 16-PSK, 16-QAM, and a 16-element amplitude phase keying scheme with two amplitude levels [13]. The method of ensuring link connectivity through conservative parameter configuration results in low resource utilization efficiency. Bischl H et al. verified that ACM is easy to implement on large-scale networks and can effectively meet target group error rate requirements even under deep fading conditions [14]. Huang J et al. proposed an efficient utilization of the ACM scheme, which showed that, under the premise of the same transmission power, the throughput of two proposed ACM schemes was nearly six times that of a fixed MCS [15]. This ACM technology can dynamically adjust transmission parameters according to channel conditions to adapt to different transmission environments, which has been used in many communication satellites [16,17]. However, in highly dynamic environments, this method may experience response lag and may not achieve the expected results in complex situations due to neglect of other parameter couplings. In addition, there is an optimization theory-based approach for centralized coordination and signal processing to achieve efficient interference management and flexible network adaptation [18]. This convex optimization framework generally relies on precise channel models and has a high computational complexity, making it difficult to adapt to highly dynamic environments. Therefore, the common drawback of current methods is a lack of real-time perception and decision-making capabilities for highly dynamic environments, where the complexity of multi-parameter joint optimization cannot be effectively handled.
With the rapid development of artificial intelligence (AI), more and more AI-based methods are being applied in the field of communication [19,20,21]. DRL, with its ability to autonomously learn optimal strategies through interaction with the environment, is precisely capable of addressing highly dynamic and complex environmental challenges, providing a new solution for optimizing LEO satellite communication links [22]. Deng, B. et al. proposed an innovative resource management framework for the next generation heterogeneous satellite networks, which can achieve cooperation between independent satellite systems, maximizing resource utilization [23]. Huang et al. investigated the power allocation problem in LEO satellite networks based on DRL technology and further proposed a scheme based on efficient near-end PPO, which can learn the optimal power allocation strategy without knowing any prior information to maximize the overall system rate [24]. From the application of different intelligent algorithms, DRL in the field of communication has shown great potential. And the current research results fully demonstrate the effectiveness of DRL in handling communication optimization problems [25,26], especially its ability to learn complex mapping relationships from high-dimensional states, providing a solid technical foundation for this paper. However, the optimization objects and control variables in current DRL-based communication optimization methods are still relatively independent. Due to the complex and highly dynamic characteristics of the actual LEO communication environment, communication quality often requires joint decision-making of multiple control variables (including discrete and continuous variables). In order to address the challenge of collaborative dynamic optimization of continuous and discrete control variables in LEO satellite communication links, a new method for optimizing LEO satellite communication links based on hybrid DRL is proposed in this paper. On the basis of perception of the link status, this method optimizes multiple control variables such as output power, beamforming parameters, coding and modulation schemes, and retransmission strategy through intelligent agents to maximize communication quality while ensuring link reliability.
The contribution of this paper is the design of a multi-dimensional state space and CNN-LSTM feature extraction network to achieve accurate perception of highly dynamic channel environments and proposal of a hybrid DRL architecture combining PPO and DQN, effectively solving the problem of collaborative decision-making between continuous and discrete parameters in LEO satellite link optimization. The method in this paper provides a new technological approach for the intelligent optimization of LEO satellite communication systems, which has important theoretical value and practical significance.

2. Dynamic Channel Modeling for LEO Satellite Communication Links

Dynamic channel modeling of LEO satellite communication links is the basis for simulating a high-fidelity channel environment and guaranteeing data authenticity in the DRL algorithm simulation training process. In addition to the expected free-space path loss, the quality of LEO satellite communication links is determined by multiple other factors, such as Doppler frequency shifts caused by high-speed motion, rain attenuation, absorption loss, and tropospheric scintillation caused by the atmosphere, as illustrated in Figure 1. The key time-varying factors affecting communication link performance are modeled in this section, and an environmental state perception model is provided for DRL by generating a channel state information matrix that contains amplitude, phase, and multipath information.

2.1. Free-Space Path Loss Model

Based on the theory of electromagnetic wave propagation, the energy of electromagnetic waves gradually disperses on the same phase plane during spatial propagation, which means that as the propagation distance increases, the energy loss of electromagnetic waves becomes more severe. In general, when calculating the propagation loss in free space, only the direct path is considered. When the transmitting end uses an ideal point source antenna, the signal power will be evenly distributed within a spherical area, and the received signal power can be expressed as follows [27]:
P r = P s G s G r λ 2 ( 4 π d ) 2
where P s is the signal transmission power; P r is the signal reception power; G s is the antenna gain at the transmitting end; G r is the antenna gain at the receiving end; λ is the wavelength and d is the propagation distance. When the gain of both the transmitting antenna and the receiving antenna is 1, the free-space loss L f s can be expressed as follows:
L f s = P s P r = 4 π d λ 2

2.2. Doppler Frequency Offset Model

The Doppler frequency offset is related to the carrier frequency, the movement speed of satellites and end users, while the Doppler frequency offset change rate describes the speed at which the Doppler frequency shift changes over time, as shown in Figure 2.
In an LEO satellite communication system, the satellite continues to move at high speed, and under the same conditions, the corresponding Doppler frequency offset and Doppler frequency offset change rate are also larger than those in ground systems. The Doppler frequency offset can be expressed as follows [28]:
f d m = v c f · cos θ
where f is the carrier frequency; v is the relative motion velocity between the satellite and the ground; θ is the angle between the direction of ground terminal movement and the satellite-to-ground link; and c is the speed of light. In the scenario of an LEO satellite moving at a high speed, when the satellite accesses the ground station at the minimum access elevation angle, the component of the velocity vector on the straight line of the satellite-to-ground link is at its maximum, corresponding to the largest Doppler frequency shift. Conversely, when the elevation angle between the satellite and the ground station is 90°, the component of the velocity vector on the straight line of the satellite-to-ground link is zero, and the corresponding Doppler frequency shift is also zero. Typically, the Doppler frequency shift in the received signal can reach the order of magnitude of tens to hundreds of KHz.

2.3. Rain Attenuation Model

Due to the scattering and absorption effects of rainfall on electromagnetic waves, when satellite signals pass through rainfall areas, some of the energy will be scattered or absorbed by raindrops. Especially in communication frequency bands above 10 GHz, the impact of rain attenuation on communication quality cannot be ignored [29]. Given the latitude position of the ground station φ , the corresponding rain layer height h R can be calculated.
h R = 0 , φ 71 5 + 0.1 ( φ + 21 ) , 71 < φ 21 5 , 21 < φ 23 5 0.075 ( φ 23 ) , φ > 23
When the satellite enters the rain area, the angle between the inclined path and the ground is θ , and the length of the inclined path L S can be expressed as follows:
L S = h R h a / sin θ
where h a is the altitude of the ground station. And the size of the horizontal projection distance L G can be expressed as follows:
L G = L S cos θ
Based on the slant path length L S of satellite signals passing through rain areas, the rain attenuation L r a i n that results in an average horizontal projection distance L G exceeding 0.01% of the time over a year can be expressed as follows:
L r a i n = L R v 0.01
v 0.01 = 1 1 + sin θ 31 1 e ( θ / ( 1 + β ) ) L R γ R f 2 0.45
L R = L S · r 0.01 , r 0.01 < 1 L S , r 0.01 1
r 0.01 = 1 1 + 0.78 L G γ R f 0.38 1 e 2 L G
where γ R is the attenuation rate per unit time; β is the height correction factor.

2.4. Other Loss Models

Gas absorption loss L g a s : This loss is mainly caused by oxygen and water vapor, and can be calculated according to the ITU-RP.676-11 model, which is related to frequency and elevation angle. In the 10–30 GHz frequency band, the typical range of this loss is 0.1 to 1 dB.
Tropospheric scintillation loss L s c i n t : This loss is mainly caused by rapid fluctuations in signal amplitude caused by atmospheric turbulence, which can be estimated using the model in ITU-R P.618-10 model.
The channel of the LEO satellite communication link can be modeled as a time-varying complex gain. The comprehensive channel gain for a flat-fading channel can be expressed as follows:
h ( t ) = G s G r / L f s ( t ) L a t m ( t ) L o t h e r ( t ) · e j 2 π f d m ( t ) t + ϕ 0 · χ ( t )
where ϕ 0 is the initial random phase offset; L a t m ( t ) is the loss caused by the atmosphere, which satisfies L a t m ( t ) = L r a i n + L g a s + L s c i n t . L o t h e r ( t ) is the other losses, which can be used for the addition of subsequent dynamic factors. χ ( t ) is a random process characterizing small-scale fading, such as the Rician fading model:
χ ( t ) = α α + 1 + 1 α + 1 · z ( t )
where z ( t ) is a complex Gaussian random process, and α is the Rician factor, which denotes the ratio of the power of the direct path component to the power of the multipath scattering component. The higher the elevation angle, the larger the α tends to be.
Construction of CSI matrix: In practical simulation, it is necessary to discretize and sample the channel. The channel state information of the LEO satellite communication system is represented as a complex matrix H C N r × N t × N s c , where N r is the number of receiving antennas. N t is the number of transmitting antennas. N s c is the number of subcarriers. Each element h i , j , k in the matrix follows the aforementioned comprehensive channel model, encompassing various dynamic loss information. In this paper, for the convenience of DRL training simulation, the matrix size is set to N r = N t = 16 ,   N s c = 1 , and the complex matrix is subjected to dimensionality reduction. The mean statistical characteristics are used to represent the channel state, and the amplitude and phase of each complex is normalized to obtain a CSI matrix tensor, H 16 × 16 × 2 . This high-dimensional, time-varying CSI matrix constitutes the core environmental state for DRL intelligent agent training.

3. Performance Index Model of Communication Link

The performance index model of an LEO satellite communication system is the basic reference for constructing a reward function in the DRL algorithm. This section describes a weighted comprehensive reward function designed for DRL training optimization, which incorporates multiple key performance indicators such as throughput, bit error rate, transmission delay, and power consumption. This function serves as a learning guide for the DRL agent, enabling it to not only achieve high speeds but also take into account communication reliability, real-time performance, and energy efficiency when exploring strategies, ultimately optimizing overall system performance.

3.1. Normalized Throughput

Throughput is a core metric for measuring the efficiency of data transmission in a communication system. The effective throughput rather than the theoretical peak of the physical layer is used as a measure, which can be expressed as follows:
T h r o u g h p u t n o r m = B · log 2 ( M ) · ( 1 BLER ) T h r o u g h p u t m a x
where B is the channel bandwidth, measured in Hz. M is the modulation order (e.g., M = 4 for QPSK and M = 16 for 16QAM). BLER is the block error rate, which refers to the probability of a data block (such as a codeword) being received incorrectly. It is related to the bit error rate (BER) but better reflects the actual transmission failure after forward error correction (FEC) encoding. T h r o u g h p u t m a x is the maximum theoretical throughput of the system, used for normalization.

3.2. Bit Error Rate (BER)

The BER is the fundamental indicator for measuring communication reliability, which can be calculated theoretically or statistically calculated using Monte Carlo methods. The calculation of theoretical BER depends on different MCS. For QPSK modulation and M-QAM modulation (M ≥ 16), the BER can be expressed as follows:
B E R QPSK 1 2 erfc E b N 0
B E R M Q A M 4 log 2 ( M ) 1 1 M Q 3 log 2 ( M ) M 1 · E b N 0
where E b is the average signal energy carried by each information bit and N 0 is the noise power within a unit bandwidth. E b / N 0 is the bit-to-noise ratio, which is the ultimate manifestation of link budget and is related to the received power, noise power, etc. erfc and Q refer to the complementary error function and Gaussian Q function, respectively.

3.3. Transmission Delay

Transmission delay is crucial for evaluating the quality of communication, especially for applications demanding high real-time performance. The total transmission delay D t o t a l   can be expressed as follows:
D t o t a l   =   D p r o p + D t r a n s + D p r o c + D q u e u e
where   D p r o p is the propagation delay, determined by the satellite-to-ground distance.   D p r o p = d / c , where d is the instantaneous satellite-to-ground distance and c is the speed of light. D t r a n s is the transmission delay, which is related to the packet length L p a c k e t and symbol rate R s , D t r a n s = L p a c k e t / ( R s · log 2 M ) . D p r o c is the processing delay, including the time for encoding, modulation, demodulation, decoding, etc., which can be modeled as a fixed value or a random distribution. D q u e u e is the queuing delay, which occurs when multiple data streams compete for the same output port in on-board routers, and is related to traffic load and scheduling algorithms. It can be estimated using M/M/1 or more complex queuing theory models.

3.4. Power Efficiency

The power efficiency is directly related to the energy sustainability and lifespan of satellites. The power consumption of the communication subsystem can be simply expressed as
P c o n s u m e d = P t η + P s t a t i c
where P t is the transmission power, with the unit of W. This is a key variable that the DRL agent can directly optimize. η is the efficiency of the power amplifier. The efficiency of typical amplifiers ranges from 30% to 60%, that is, η [ 0.3 , 0.6 ] ; P t / η is the actual power consumption of the power amplifier. P s t a t i c is the static power consumption, which includes the power required for the normal operation of baseband devices such as modems and digital signal processors.
The core of optimizing LEO satellite communication links is to maximize system utility while ensuring communication quality. Four key performance indicators were defined in the previous section: throughput, BER, transmission delay, and power efficiency. These indicators conflict with each other: improving throughput typically requires higher power and higher-order modulation but increases the BER and power consumption; reducing power consumption may sacrifice the throughput. Therefore, the essence of the optimization problem is a multi-objective trade-off. To address this issue and avoid subjective weighting, a hierarchical reward structure was designed with constraint satisfaction as a mandatory requirement and throughput optimization was as the main objective. Throughput was set as the main reward of the reward function, which can be expressed as follows:
R primary = T h r o u g h p u t n o r m
The main reward function set in this way encourages the agent to increase the transmission rate as much as possible. Simultaneously three constraint costs have been defined corresponding to key service quality requirements, which can be expressed as follows:
c 1 ( t ) = B E R ( t ) B E R th , c 2 ( t ) = D t o t a l ( t ) D th , c 3 ( t ) = P c o n s u m e d ( t ) P th
where B E R th is the error rate threshold (e.g., 10−6, according to DVB-S2X standard); D th is the maximum allowable latency (e.g., 20 ms, according to 5G uRLLC standard); and P th is the maximum allowable transmission power (such as 40 dBm, depending on the actual satellite payload transmission capability). The long-term constraint conditions can be expressed as follows:
lim T 1 T t = 0 T 1 c k ( t ) 1 , k = 1 , 2 , 3
where T is the total number of time steps, taking into account the average cost over an infinite time range. The hierarchical reward structure can be expressed as follows:
R ( t ) = R primary + β · e ε · max ( 0 , max k ( c k 1 ) ) , i f   a l l   c k 1 ξ · k = 1 3 max ( 0 , c k 1 ) 2 , o t h e r w i s e
where the parameters β ,   ε and ξ are adjustable hyperparameters, but they do not affect the weight balance between targets and only control the smoothness of the penalty. The first item (when the constraint is satisfied): the agent obtains the main throughput reward and adds a constraint margin reward, which exponentially decays with increasing margin, encouraging the maintenance of a certain safety margin while satisfying the constraint. The second item (when the constraint is violated): impose a second penalty, the intensity of which is proportional to the square of the degree of violation, guiding the agent to quickly return to the feasible domain.

4. Optimization Algorithm of LEO Satellite Communication Link Based on DRL

On the basis of the reward function design described in Section 3, a hybrid DRL architecture combining PPO and DQN, illustrated in Figure 3, is proposed in this section. By collecting multidimensional state information from the communication environment, an original state space is established, and a hybrid structure combining a CNN and LSTM is used to extract spatiotemporal channel features. The extracted feature vector is simultaneously transmitted to both the PPO and DQN branches. The PPO branch can handle continuous action spaces and is responsible for finely adjusting the transmission power and beamforming weights, while the DQN branch can handle discrete decisions and is responsible for selecting modulation and coding schemes, as well as the retransmission time. The communication link is optimized through iterative training of the DRL agent. This approach overcomes the limitations of traditional methods, with which it is difficult to collaboratively optimize continuous and discrete parameters, and can improve the performance of communication systems.

4.1. Design of State Space

In the optimization algorithm for LEO satellite communication link based on DRL, the design of the state space is crucial as it determines the agent’s level of understanding of the environment and decision quality. Besides the CSI matrix tensor H 16 × 16 × 2 , the complete state space s t also includes multidimensional state variables, which can be expressed as follows:
s t = H 16 × 16 × 2 B E R h i s t ,   S N R h i s t ,   Q h i s t ,   P h i s t Q l e n ,   Q u t i l ,   D s t a t s ,   P u t i l ,   S p o s e ,   H d e v i c e
where Q l e n is the number of packets in the current queue. Q u t i l reflects the system load status, avoids buffer overflow, and guides traffic control strategies, which satisfies Q u t i l = Q l e n / Q m a x . Q m a x is the maximum capacity of the queue (number of packets). D s t a t s is the delay statistical information, which satisfies D s t a t s = [ μ d e l a y , σ d e l a y ] . μ d e l a y and σ d e l a y are the average delay and delay standard deviation, respectively. P u t i l is the current transmission power and its utilization rate relative to the maximum power, which satisfies P u t i l = P c u r / P m a x . P c u r is the current transmission power. P m a x is the maximum allowable transmission power. S p o s e is the position and attitude angle, which satisfies S p o s e = [ l a t i t u d e , l o n g i t u d e , a l t i t u d e , r o l l , p i t c h , y a w ] . H d e v i c e is a part of the device status information, which satisfies H d e v i c e = [ T a m p l i f i e r , P d c ] . T a m p l i f i e r is the amplifier temperature. P d c is the DC power consumption. B E R h i s t is the historical sequence of BER, representing the sequence of BER measurement values in a past period of time, which can be expressed as follows:
B E R h i s t = B E R ( t k ) , B E R ( t k + 1 ) , , B E R ( t )
where k is the historical window length (typical value: 10–100 sampling points). The sampling interval is generally 1 ms–100 ms. Similarly, the other three historical data states can be expressed as follows:
S N R h i s t = S N R ( t k ) , S N R ( t k + 1 ) , , S N R ( t )
Q h i s t = Q l e n ( t k ) , Q l e n ( t k + 1 ) , , Q l e n ( t )
P h i s t = P c u r ( t k ) , P c u r ( t k + 1 ) , , P c u r ( t )

4.2. Design of Mixed Action Space

The action, a t , of the DRL agent is defined as a composite action comprising both continuous and discrete components: a t = a c o n t + a d i s c . The continuous action space, a c o n t , is used for fine-tuning radio frequency parameters, which can be expressed as follows:
a c o n t = [ Δ P t , ϕ 1 , ϕ 2 , , ϕ K ]
where Δ P t is the adjusted amount of transmitted power. It is a normalized continuous value, typically ranging from [−1,1]. The actual transmit power is obtained through linear mapping, which can be expressed as
P t = P m i n + Δ P t + 1 2 P m a x P m i n
where ϕ 1 , ϕ 2 , , ϕ K is beamforming weight (phase offset); P m a x is the maximum allowable transmit power; P m i n is the minimum allowable transmit power; and K is the number of array elements in a phased array antenna. The discrete action space, a d i s c , is used to make category selection decisions, which can be expressed as
a d i s c = [ MCS   Index ,   Retry   Count ]
where the MCS Index is a discrete value that can be set to establish the correspondence between the index and the MCS, such as 0-QPSK with a coding rate of 1/2, 1-16QAM with a coding rate of 3/4, and 2-64QAM with a coding rate of 5/6. Each index corresponds to a predefined combination of modulation order and coding rate. The Retry Count is the maximum number of retransmissions. In protocols based on automatic repeat requests (ARQs), the agent can choose the maximum number of retransmissions for link-layer packets, such as 0 (no retransmission), 1, or 2. This allows the agent to strike a balance between latency and reliability.

4.3. Design of Hybrid DRL Algorithm

The original state s t (including CSI matrix, historical BER, etc.) passes through a shared feature extraction network, which employs a CNN-LSTM hybrid structure. The CNN part is used to extract local features of state information with spatial structure, such as the CSI matrix. The LSTM part is used to capture long-term dependencies in time series, such as historical BER and queue status. And partial scalar states are features extracted by fully connected networks. The final extracted feature vector, f t , is simultaneously transmitted to both the PPO and DQN branches, as shown in Figure 3.
The PPO branch comprises an Actor network and a Critic network. Actor network π θ ( a c o n t | s t ) : The input is the feature vector f t ; the output is the mean and variance of continuous action a c o n t . It defines the probability distribution of taking continuous actions under state s t . The goal is to learn the optimal continuous control strategy.
Critic network V ϕ ( s t ) : The input is feature vector f t ; the output is state value V ( s t ) , representing the expected cumulative reward that can be obtained starting from state s t in the future. The goal is to evaluate the quality of the current state, which is used to guide the update of the Actor network.
The PPO algorithm ensures the stability of policy updates through its tailored objective function, which can be expressed as
L C L I P ( θ ) = E ^ t [ min ( r t ( θ ) A ^ t , c l i p ( r t ( θ ) , 1 ϵ , 1 + ϵ ) A ^ t ) ]
where r t ( θ ) is the probability ratio between the old and new strategies. A ^ t is the advantage function estimation, indicating the superiority or inferiority of a certain action relative to the average level. It is usually calculated by the Critic network and actual rewards. When A ^ t = R t V ( s t ) , ϵ is a hyperparameter; R ( t ) is the immediate reward obtained at time t , and V ( s t ) is the state value output by the Critic network.
The DQN branch includes a Q-network Q ψ ( s t , a d i s c ) designed to handle discrete decision-making problems. The input is a feature vector f t ; the output is a Q-value vector, where each element corresponds to the Q-value (expected long-term cumulative reward) of a discrete action combination, [ MCS   Index ,   Retry   Count ] . The final action decision is to choose the discrete action with the maximum Q-value through either greedy or ε-greedy strategies, which can be expressed as
a d i s c , t = arg max a Q ψ ( f t , a )
where a is a randomly selected discrete action, and a d i s c , t is the action with the highest Q-value under the current state s t . DQN updates the network through temporal difference error, with the goal of minimizing the following loss function:
L ( ψ ) = E ^ t [ ( Q ψ s t , a d i s c , t y t ) 2 ]
y t = R ( t ) + γ · max a Q ψ ( s t + 1 , a )
where y t is the target Q-value; γ is the discount factor, where γ 0 , 1 ; Q ψ is the target Q-network, whose parameter ψ is periodically copied from the main network ψ to stabilize training. Q ψ ( s t , a d i s c , t ) is the predicted Q-value, while y t is the target Q-value; ψ is the set of all weights and biases in the DQN. R ( t ) is the immediate reward obtained at time t ; max a is the highest Q-value among all possible actions a in the next state s t + 1 .
The final collaborative training and decision-making process is as follows:
  • Forward propagation: At each time step, the feature extraction network extracts features from the original state s t to obtain a feature vector   f t .
  • Action generation: The PPO–Actor network samples a continuous action a c o n t based on the current policy. And the DQN-Q network selects the discrete action a d i s c with the highest Q-value.
  • Environmental interaction: Perform composite action a c o n t ,   a d i s c , and obtain reward R t and the next state s t + 1 from the environment (LEO satellite communication link simulator).
  • Experience storage: Store the transferred sample s t , a c o n t , a d i s c , R t , s t + 1 in the experience replay buffer.
  • Network Update: Sample a batch of data from the buffer. Update DQN branch: Calculate Q-value loss L ( ψ ) and perform backpropagation. Update PPO branch: Use the sampled data to calculate the advantage function A ^ t , then update the Actor network by maximizing the clipped objective function L C L I P ( θ ) , and update the Critic network by minimizing the value function error.
The application of the method in this paper requires consideration of certain computational resource limitations. The computational complexity of the CNN branch in this method is as follows: the input is a 16 × 16 × 2 CSI matrix tensor, and the total number of multiplication and addition operations is approximately 1.2 × 106. LSTM branch: Input 50 × 4 historical sequences, with a total of approximately 3.5 × 105. Fully connected branch: approximately 2.1 × 105 operations. Total: Approximately 1.76 × 106 floating-point operations (1.76 MFLOPs) per forward inference. Existing commercial-grade onboard AI processors, such as Xilinx KU060 and NVIDIA Jetson TX2i, can support in-orbit deployment of lightweight models. This method adopts a two-stage deployment strategy of offline training and online inference. The specific deployment scenarios are as follows:
  • Offline training phase (ground assisted): Complete the training of all DRL models at ground stations equipped with high-performance computing clusters. Generate a large amount of diverse scenario data using the STK and NS-3 joint simulation platform, covering different orbital heights, weather conditions, and business loads. After training, a lightweight inference model is generated.
  • Online reasoning stage (satellite deployment): The lightweight inference model is deployed on the LEO satellite onboard processor to achieve local real-time decision-making.
  • Continuous learning mechanism: The model is updated regularly by the ground station: when passing over the top ground station, the satellite can receive the updated model parameters through the high-speed link.
The technological novelty of the method proposed in this paper is reflected in three aspects: 1. It is a collaborative architecture for shared feature extraction. Existing methods typically use parallel but independent network structures, while the proposed method uses a CNN-LSTM shared feature extraction network that enables the PPO and DQN branches to make decisions based on the same deep spatiotemporal features, achieving implicit collaborative optimization of continuous and discrete actions. 2. It utilizes a hierarchical decision fusion mechanism. Instead of simply concatenating two types of actions for output, this method introduces a hierarchical decision mechanism based on constraint satisfaction, prioritizing communication reliability constraints (BER; latency) and optimizing throughput targets to make the decision-making process more interpretable. 3. It enables space–time feature modeling for satellite communication. Existing methods often use fully connected networks to process state inputs, while the proposed method implements a CNN-LSTM hybrid network designed specifically for the highly dynamic characteristics of satellite channels that captures the spatial structure and temporal evolution of CSI performance indicators.

5. Results and Discussion

The optimization algorithm designed in this paper requires a large number of randomly constructed channel scenarios to be used for offline training of DRL. The trained model can be deployed online and dynamically output the optimal link configuration parameters based on real-time channel state information. Random DRL training scenarios are designed in this section and simulation results are obtained through examples. The simulation results of the algorithm are compared with those of the traditional method, which proves the effectiveness and progression of the new method.

5.1. Design of Random Training Environment

To ensure that the trained DRL agent possesses strong generalization capabilities, a highly randomized dynamic channel simulation environment was constructed. The core parameters and the range of randomization for the environment are shown in Table 1.
The environmental joint simulation architecture adopts a simulation platform that deeply integrates STK and NS-3, which can be expressed as
P s i m = P S T K , P NS- 3 , I i n t e r f a c e
where P s i m is the entire joint simulation platform; P S T K is the STK simulation environment, responsible for generating satellite orbits, positions, and geometric relationships between the satellite and ground links; P NS- 3 is the NS-3 network simulator, responsible for simulating network protocols, packet transmission, and business flows; I i n t e r f a c e is the interface module between STK and NS-3, which enables real-time data exchange between the two platforms. The data stream D S STK NS- 3 from STK to NS-3 can be expressed as
D S STK NS- 3 = { p o s s a t , p o s g w , L f s , L a t m , L o t h e r , f d m }
where p o s s a t and p o s g w represent the coordinate position vectors of the satellite and the gateway station in three-dimensional space, respectively. The loss of dynamic channels refers to the dynamic channel model in Section 2. The STK and NS-3 joint simulation architectures exchange real-time data through the TCP/IP interface. STK is responsible for high-precision orbital dynamics simulation, outputting satellite position, velocity, satellite ground geometric relationships, etc. NS-3: Responsible for network protocol stack simulation, implementing DVB-S2X physical layer, MAC layer, and routing protocols. Coordinate data exchange between the two using Python 3.14 programs to achieve training and inference of DRL agents. The business model parameters are as follows: VoIP service: packet size of 120 bytes, arrival interval of 20 ms; video stream: bitrate of 2 Mbps, packet size of 1500 bytes; FTP service: file size 10 MB.

5.2. Training Process and Hyperparameters of DRL

The training of DRL is conducted on high-performance servers. The setting of hyperparameters has a certain impact on the final training results. The training hyperparameters used in this paper are shown in Table 2.
The DRL training process is as follows:
  • Initialization: Randomly initialize the DRL network parameters of the satellite.
  • Scene loop: Simulate a complete process of the satellite passing through a ground station (approximately 10–15 min of simulation time) for each training episode.
  • Step loop: Observe at each time step (such as 10 ms): The agent obtains the state s t from the environment and the agent outputs action a t through the PPO and DQN networks to the environment, while receiving the reward R t and the new state s t + 1 . It stores the experience tuple s t , a t , R t , s t + 1 in the experience replay buffer. Then, it periodically samples from the buffer and updates the network parameters. The simulation convergence results during the training process are shown in Figure 4.

5.3. Simulation Results and Analysis

In order to comprehensively evaluate the adaptability of the DRL method in this paper under different orbital altitudes and environmental conditions, 12 representative test scenarios weren constructed, covering key variables that may be encountered in LEO satellite communication, including orbital altitude, weather conditions, initial elevation angle, and business load, to verify the generalization performance of the algorithm under different orbital characteristics, which are shown in Table 3.
According to the different scenarios in Table 3, different methods were used to simulate the LEO satellite communication link. One is a method of the fixed strategy that adopts a conservative fixed configuration, with a transmission power of 40 dBm, MCS of 16QAM 3/4, and a fixed beam, which is the benchmark scheme for many traditional satellites. Another method is ACM, which is a widely studied adaptive method that dynamically adjusts the MCS based on the instantaneous SNR feedback from the receiving end through a preset SNR-MCS mapping table, but with a fixed transmission power.
In addition to the two traditional methods, two learning-based methods are incorporated. The first is based solely on PPO: using the PPO algorithm to optimize all parameters but treating discrete actions (MCS, retransmission) as continuous values before discretization or mapping through Gaussian distribution sampling. This is a common continuous control method. The second method is to use only the DQN method: we optimize all parameters using the DQN algorithm but discretize continuous actions (such as power and beamforming parameters) into finite levels. This is a classic discrete action learning method. All methods use the same state space, reward function, and training process and differ only in action output structure. The performance comparison results are shown in Table 4.
It can be seen from Table 4 that the hybrid DRL method proposed in this paper demonstrates excellent performance and robustness under different orbital altitudes (550 to 1200 km), meteorological conditions, and business loads. Specifically, as the orbit height increases, the free space path loss increases, and the throughput of all methods shows a decreasing trend (such as in scenarios S1, S4, and S8). However, the method proposed in this paper consistently maintains the highest throughput and has the smallest performance degradation amplitude, demonstrating its ability to adapt to different orbit characteristics. In the harsh LEO satellite operating environment (rainstorms, low elevation, and heavy load), the BER of traditional fixed strategy and adaptive modulation coding (ACM) deteriorates sharply. However, the method proposed in this paper stabilizes the bit error rate at 10−6. It is compared against single learning-based algorithms. Although the PPO performs well in throughput and power control, its discrete action decision accuracy is insufficient, resulting in a slightly higher BER. Meanwhile, the DQN is limited by its discretization granularity, resulting in a relatively weak overall performance. The hybrid method proposed in this paper combines the advantages of both and achieves optimal performance in all testing scenarios, confirming the necessity of joint optimization of continuous and discrete actions. In addition, in heavy-load scenarios, the proposed method effectively suppresses the growth of queue delay by adaptively adjusting transmission power and beamforming, reducing delay by over 20% compared to the fixed strategy, further highlighting its effectiveness in complex conditions.
In summary, the simulation results fully validate the effectiveness of the new method for optimizing LEO satellite communication links based on the DRL proposed in this paper. Compared with existing technologies, the new method can achieve improvements in throughput, reliability, delay, and energy efficiency in highly dynamic and non-stationary LEO satellite channels through the joint decision-making of intelligent cross-layer parameters. It demonstrates strong environmental adaptability and overall performance advantages and provides a new approach for the next generation of intelligent satellite communication systems.

6. Conclusions

In this paper, a new method for optimizing LEO satellite communication links based on DRL is proposed. The method extracted spatiotemporal features of multidimensional state space through a CNN-LSTM hybrid network and established a hybrid architecture that combined PPO and DQN, enabling the DRL agent to achieve real-time link state perception and jointly decide on discrete and continuous actions such as output power, beamforming, MCS, and retransmission strategies, while ensuring link reliability and maximizing system performance. The simulation results showed that the new method is effective and advanced. And compared with traditional methods, the new method has three main advantages: firstly, it avoids dependence on precise channel models through DRL; at the same time, it achieves multi-parameter collaborative optimization instead of isolated adjustment; finally, it has the ability to adapt to unknown environments, significantly improving the robustness of the system in dynamic scenarios. The method demonstrates the enormous potential of DRL in the field of satellite communication, providing a new idea for promoting the development of LEO satellite communication towards autonomy and intelligence.

Author Contributions

Conceptualization, H.Y. and L.W.; methodology, H.Y. and J.W. software, H.Y. and S.L.; validation, H.Y., formal analysis, H.Y. and Y.S.; investigation, H.Y.; resources, H.Y.; data curation, H.Y. and J.W.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y.; visualization, H.Y. and Y.S.; supervision, H.Y.; project administration, H.Y.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The authors declare the data cannot be made public due to privacy concerns.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

All authors were employed by the company The 54th Research Institute of CETC. Authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Hui, M.; Zhai, S.; Wang, D.; Hui, T.; Wang, W.; Du, P.; Gong, F. A review of leo satellite communication payloads for integrated communication, navigation, and remote sensing: Opportunities, challenges, future directions. IEEE Internet Things 2025, 12, 18954–18992. [Google Scholar] [CrossRef]
  2. Zhou, D.; Sheng, M.; Li, J.; Han, Z. Aerospace integrated networks innovation for empowering 6G: A survey and future challenges. IEEE Commun. Surv. Tutor. 2023, 25, 975–1019. [Google Scholar] [CrossRef]
  3. Li, J.; Han, C.; Ye, N.; Pan, J.; Yang, K.; An, J. Instant Positioning by Single Satellite: Delay-Doppler Analysis Method Enhanced by Beam-Hopping. IEEE Trans. Veh. Technol. 2025, 9, 14418–14431. [Google Scholar] [CrossRef]
  4. Kozhaya, S.; Kassas, Z.M. A first look at the OneWeb LEO constellation: Beacons, beams, and positioning. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 7528–7534. [Google Scholar] [CrossRef]
  5. Boley, A.C.; Byers, M. Satellite mega-constellations create risks in Low Earth Orbit, the atmosphere and on Earth. Sci. Rep. 2021, 11, 10642. [Google Scholar] [CrossRef]
  6. Osoro, O.B.; Oughton, E.J. A techno-economic framework for satellite networks applied to low earth orbit constellations: Assessing Starlink, OneWeb and Kuiper. IEEE Access 2021, 9, 141611–141625. [Google Scholar] [CrossRef]
  7. Fernandes, M.A.; Loureiro, P.A.; Fernandes, G.M.; Monteiro, P.P.; Guiomar, F.P. Digitally mitigating Doppler shift in high-capacity coherent FSO LEO-to-earth links. J. Light. Technol. 2023, 41, 3993–4001. [Google Scholar] [CrossRef]
  8. Shi, J.; Li, Z.; Hu, J.; Tie, Z.; Li, S.; Liang, W.; Ding, Z. OTFS enabled LEO satellite communications: A promising solution to severe doppler effects. IEEE Netw. 2023, 38, 203–209. [Google Scholar] [CrossRef]
  9. Behera, B.; Raghu, N.; Yadav, A.; Setia, N.; Goyal, D. Satellite-to-Ground Propagation Modelling for High-Frequency Communication Systems. Int. J. Antenn. Propag. 2025, 7, 49–55. [Google Scholar]
  10. Sabuj, S.R.; Alam, M.S.; Haider, M.; Hossain, M.A.; Pathan, A.S.K. Low Altitude Satellite Constellation for Futuristic Aerial-Ground Communications. CMES-Comp. Model. Eng. Sci. 2023, 136, 1053–1089. [Google Scholar]
  11. Al-Hraishawi, H.; Chougrani, H.; Kisseleff, S.; Lagunas, E.; Chatzinotas, S. A survey on nongeostationary satellite systems: The communication perspective. IEEE Commun. Surv. Tutor. 2022, 25, 101–132. [Google Scholar] [CrossRef]
  12. Wang, S.; Li, Q. Satellite computing: Vision and challenges. IEEE Internet Things 2023, 10, 22514–22529. [Google Scholar] [CrossRef]
  13. Biglieri, E. High-level modulation and coding for nonlinear satellite channels. IEEE Trans. Commun. 2003, 32, 616–626. [Google Scholar] [CrossRef]
  14. Bischl, H.; Brandt, H.; De Cola, T.; De Gaudenzi, R.; Eberlein, E.; Girault, N.; Alberty, E.; Lipp, S.; Rinaldo, R.; Rislow, B.; et al. Adaptive coding and modulation for satellite broadband networks: From theory to practice. Int. J. Satell. Commun. Netw. 2010, 28, 59–111. [Google Scholar] [CrossRef]
  15. Huang, J.; Su, Y.; Liu, W.; Wang, F. Adaptive modulation and coding techniques for global navigation satellite system inter-satellite communication based on the channel condition. IET Commun. 2016, 10, 2091–2095. [Google Scholar] [CrossRef]
  16. Neinavaie, M.; Kassas, Z.M. Cognitive sensing and navigation with unknown OFDM signals with application to terrestrial 5G and Starlink LEO satellites. IEEE J. Sel. Areas Commun. 2023, 42, 146–160. [Google Scholar] [CrossRef]
  17. Martínez, F.O.; Uribe, G.; Mosquera, F.L. OneWeb: Web content adaptation platform based on W3C Mobile Web Initiative guidelines. Ing. Investig. 2011, 31, 117–126. [Google Scholar] [CrossRef]
  18. Shi, Y.; Zhang, J.; Letaief, K.B.; Bai, B.; Chen, W. Large-scale convex optimization for ultra-dense cloud-RAN. IEEE Wirel. Commun. 2015, 22, 84–91. [Google Scholar] [CrossRef]
  19. Zeng, L.; Zhang, C.; Qin, P.; Zhou, Y.; Cai, Y. One Method for Predicting Satellite Communication Terminal Service Demands Based on Artificial Intelligence Algorithms. Appl. Sci. 2024, 14, 6019. [Google Scholar] [CrossRef]
  20. Zhao, B.; Liu, J.; Wei, Z.; You, I. A deep reinforcement learning based approach for energy-efficient channel allocation in satellite Internet of Things. IEEE Access 2020, 8, 62197–62206. [Google Scholar] [CrossRef]
  21. Wang, H.; Ouyang, Q.; Xi, W.; Xiang, Y.; Ye, N. Dual Intelligence: Leveraging DRL with Smart Satellites to Counter Intelligent Jamming in Satellite Networks. IEEE Trans. Cogn. Commun. Netw. 2025, 12, 1054–1067. [Google Scholar] [CrossRef]
  22. Bhattacharyya, A.; Nambiar, S.M.; Ojha, R.; Gyaneshwar, A.; Chadha, U.; Srinivasan, K. Machine Learning and Deep Learning powered satellite communications: Enabling technologies, applications, open challenges, and future research directions. Int. J. Satell. Commun. Netw. 2023, 41, 539–588. [Google Scholar] [CrossRef]
  23. Deng, B.; Jiang, C.; Yao, H.; Guo, S.; Zhao, S. The next generation heterogeneous satellite communication networks: Integration of resource management and deep reinforcement learning. IEEE Wirel. Commun. 2019, 27, 105–111. [Google Scholar] [CrossRef]
  24. Huang, J.; Yang, Y.; Yin, L.; He, D.; Yan, Q. Deep reinforcement learning-based power allocation for rate-splitting multiple access in 6G LEO satellite communication system. IEEE Wirel. Commun. Lett. 2022, 11, 2185–2189. [Google Scholar] [CrossRef]
  25. Ferreira, P.V.R.; Paffenroth, R.; Wyglinski, A.M. Multiobjective reinforcement learning for cognitive satellite communications using deep neural network ensembles. IEEE J. Sel. Areas Commun. 2018, 36, 1030–1041. [Google Scholar] [CrossRef]
  26. Huang, J.; Yang, Y.; Lee, J.; He, D.; Li, Y. Deep reinforcement learning-based resource allocation for RSMA in LEO satellite-terrestrial networks. IEEE Trans. Commun. 2023, 72, 1341–1354. [Google Scholar] [CrossRef]
  27. Foschini, G.J.; Chizhik, D.; Gans, M.J.; Papadias, C.; Valenzuela, R.A. Analysis and performance of some basic space-time architectures. IEEE J. Sel. Areas Commun. 2003, 21, 303–320. [Google Scholar] [CrossRef]
  28. Wang, C.; Ellis, J.D. Dynamic Doppler frequency shift errors: Measurement, characterization, and compensation. IEEE Trans. Instrum. Meas. 2015, 64, 1994–2004. [Google Scholar] [CrossRef]
  29. Giannetti, F.; Reggiannini, R. Opportunistic rain rate estimation from measurements of satellite downlink attenuation: A survey. Sensors 2021, 21, 5872. [Google Scholar] [CrossRef]
Figure 1. The channel dynamic model of LEO satellite communication link.
Figure 1. The channel dynamic model of LEO satellite communication link.
Aerospace 13 00285 g001
Figure 2. Doppler frequency shift between the LEO satellite and user equipment (UE).
Figure 2. Doppler frequency shift between the LEO satellite and user equipment (UE).
Aerospace 13 00285 g002
Figure 3. Optimization algorithm of LEO satellite communication link based on DRL.
Figure 3. Optimization algorithm of LEO satellite communication link based on DRL.
Aerospace 13 00285 g003
Figure 4. The simulation convergence results during the training process of DRL.
Figure 4. The simulation convergence results during the training process of DRL.
Aerospace 13 00285 g004
Table 1. The core parameters and the range of randomization for the environment.
Table 1. The core parameters and the range of randomization for the environment.
ParameterValue Range/Distribution
initial elevation angle10°~80°
orbital altitude500~1200 km
rainfall rate0~50 mm/h (exponential distribution)
number of interference sources0~4 (Poisson distribution)
interference source power−20~0 dBm (uniform distribution)
Rician factor5~15 dB (uniform distribution)
Packet arrival rate0.1~1.0 Mbps (uniform distribution)
Table 2. The training hyperparameters of DRL.
Table 2. The training hyperparameters of DRL.
HyperparameterValue
PPO learning rate3 × 10−4
DQN learning rate1 × 10−3
discount factor0.99
Experience replay buffer size1 × 106
Table 3. Different scenarios in LEO satellite communication.
Table 3. Different scenarios in LEO satellite communication.
NumberAltitudeWeather ConditionsInitial Elevation Business Load
S1550 kmsunny60°Mild
S2550 kmlight rain (5 mm/h)30°Moderate
S3550 kmmoderate rain (15 mm/h)10°Heavy
S4975 kmsunny75°Moderate
S5975 kmmoderate rain (15 mm/h)45°Mild
S6975 kmrainstorm (25 mm/h)30°Moderate
S7975 kmrainstorm (25 mm/h)60°Heavy
S81200 kmsunny45°Moderate
S91200 kmlight rain (5 mm/h)30°Mild
S101200 kmmoderate rain (15 mm/h)60°Heavy
S111200 kmrainstorm (25 mm/h)45°Moderate
S12975 kmrainstorm (25 mm/h)15°Heavy
Table 4. The performance comparison results of different methods in different scenarios.
Table 4. The performance comparison results of different methods in different scenarios.
ScenariosMethodsThroughput (Mbps)BERDelay (ms)Power (dBm)
(×10−6)
S1Fixed strategy86.50.310.240.0
ACM93.20.49.840.0
PPO95.80.39.536.5
DQN92.10.510.035.8
New method98.40.29.234.2
S2Fixed strategy72.31.213.540.0
ACM80.51.512.840.0
PPO83.61.112.237.1
DQN79.81.412.636.3
New method86.20.811.835.0
S3Fixed strategy48.58.520.840.0
ACM55.23.818.540.0
PPO58.72.617.334.8
DQN54.33.218.033.9
New method62.11.516.232.5
S4Fixed strategy79.80.411.840.0
ACM86.40.511.240.0
PPO88.90.410.837.2
DQN85.20.611.136.4
New method91.50.310.534.8
S5Fixed strategy63.22.816.240.0
ACM70.12.215.140.0
PPO73.51.614.335.2
DQN69.81.914.834.5
New method76.81.013.733.1
S6Fixed strategy45.228.022.540.0
ACM58.75.218.940.0
PPO62.32.817.233.6
DQN55.63.518.132.4
New method69.41.215.831.5
S7Fixed strategy38.735.024.840.0
ACM50.26.820.540.0
PPO54.53.518.634.2
DQN48.94.219.333.1
New method61.21.817.032.0
S8Fixed strategy68.50.514.540.0
ACM75.20.613.840.0
PPO77.80.513.237.8
DQN74.30.713.636.9
New method80.10.412.935.5
S9Fixed strategy60.21.817.240.0
ACM67.52.016.140.0
PPO70.31.415.336.8
DQN66.81.715.836.0
New method73.00.914.834.6
S10Fixed strategy48.84.220.540.0
ACM55.92.918.840.0
PPO59.42.017.535.8
DQN55.12.418.235.0
New method62.71.316.533.8
S11Fixed strategy35.642.026.540.0
ACM46.88.522.340.0
PPO51.24.220.135.5
DQN45.55.121.034.2
New method57.52.118.833.2
S12Fixed strategy28.595.028.740.0
ACM42.112.822.340.0
PPO47.66.520.135.2
DQN41.37.821.534.1
New method53.83.218.633.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, H.; Li, S.; Wu, J.; Sun, Y.; Wang, L. A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning. Aerospace 2026, 13, 285. https://doi.org/10.3390/aerospace13030285

AMA Style

Yu H, Li S, Wu J, Sun Y, Wang L. A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning. Aerospace. 2026; 13(3):285. https://doi.org/10.3390/aerospace13030285

Chicago/Turabian Style

Yu, He, Shengli Li, Junchao Wu, Yanhong Sun, and Limin Wang. 2026. "A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning" Aerospace 13, no. 3: 285. https://doi.org/10.3390/aerospace13030285

APA Style

Yu, H., Li, S., Wu, J., Sun, Y., & Wang, L. (2026). A New Method for Optimizing Low-Earth-Orbit Satellite Communication Links Based on Deep Reinforcement Learning. Aerospace, 13(3), 285. https://doi.org/10.3390/aerospace13030285

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop