Joint Beamforming Design for RIS-Assisted Integrated Satellite-HAP-Terrestrial Networks Using Deep Reinforcement Learning

In this paper, we consider reconfigurable intelligent surface (RIS)-assisted integrated satellite high-altitude platform terrestrial networks (IS-HAP-TNs) that can improve network performance by exploiting the HAP stability and RIS reflection. Specifically, the reflector RIS is installed on the side of HAP to reflect signals from the multiple ground user equipment (UE) to the satellite. To aim at maximizing the system sum rate, we jointly optimize the transmit beamforming matrix at the ground UEs and RIS phase shift matrix. Due to the limitation of the unit modulus of the RIS reflective elements constraint, the combinatorial optimization problem is difficult to tackle effectively by traditional solving methods. Based on this, this paper studies the deep reinforcement learning (DRL) algorithm to achieve online decision making for this joint optimization problem. In addition, it is verified through simulation experiments that the proposed DRL algorithm outperforms the standard scheme in terms of system performance, execution time, and computing speed, making real-time decision making truly feasible.


Introduction
As fifth-generation mobile communication systems enter commercial operations worldwide, terrestrial wired and wireless networks are beginning to provide instant, high-speed data transmission services to users in high-density population areas. However, due to geographical conditions and business models, networks in remote areas are still unable to meet the needs of multiple users for full-area coverage and ubiquitous access [1,2]. Compared with traditional terrestrial wireless communication systems, the satellite-aerialground integrated network (SAGIN) emerged as a high-potential infrastructure for future wireless communication networks that can establish seamless coverage and massive connectivity for the explosive growth of terrestrial users [3]. In SAGIN, the high-altitude platform (HAP)-based relaying communications were expected to be the primary choice for aerial communication compared to UAV-based relaying, owing to their lower operating costs, longer residence time, deployment flexibility and the number of communication devices they can carry. Thus, the integration of HAPs for enabling unobstructed connectivity of the integrated satellite-terrestrial networks (ISTNs) attracted widespread attention from academia and industry [4]. Nevertheless, integrated satellite high-altitude platform terrestrial networks (IS-HAP-TNs) also raise serious concern about the rapidly growing energy consumption and wireless security in the transmission process, which are of great significance for maintaining green and reliable communication schemes. accurate mathematical expressions, such as the inevitable non-linearities due to hardware losses [20,21]. Additionally, as for the dynamic RIS configuration scenario, the RIS reflection coefficients are assumed to be reconfigured many times during one transmission instance. The capacity region that corresponds to the dynamic RIS configuration is obtained by the union of all achievable rate tuples over all possible combinations of values of reflection coefficients for each RIS element. However, determining this capacity region is prohibitively complex since the size of the set of all possible combinations of reflection coefficients grows exponentially with the number of reconfiguration. Considering that the future IS-HAP-TNs needs to meet a large number of real-time Internet of Things (IoT) or Internet of Vehicles (IoV) but, due to its channel environment, is complex and highly dynamic, the calculation optimization time of conventional methods is long, so more intelligent algorithms are urgently needed.
On the contrary, deep reinforcement learning (DRL) is a novel approach that combines deep learning (DL) and reinforcement learning (RL). It has been proven to be a significant breakthrough in non-convex optimization problems, including hybrid beamforming design [22], spectrum intelligence sensing [23], channel state estimation [24], and power allocation strategy optimization [25]. Compared with deep learning (DL), the DRL algorithm does not require a large amount of training labeled data as inputs and is therefore very friendly for the optimization of wireless communication systems, where obtaining data is more tedious. By interacting with the environment to obtain rewards from the network, DRL can learn and construct wireless channel knowledge without knowing the complete channel model information and the precise movement pattern, while implementing efficient algorithm design through embedded neural networks to sequentially find optimal solutions to complex multi-objective optimization problems. In [26], a deep Q-network (DQN) with greedy characteristics is proposed for the joint optimization of beamforming design, power allocation strategy and interference coordination for maximizing the signal to interference plus noise ratio (SINR). In [27], by using the DRL framework, the user distribution model is tracked and predicted to autonomously and dynamically optimize the MIMO broadcast beam and propose the optimal broadcast beam for each served cell. The results confirm that optimal coverage can be achieved using the DRL framework in both single-sector and multi-sector environments, and in both periodic and Markovian mobility modes.

Related Work and Motivation
Nowadays, many prior works are developed to investigate the performance of RISassisted wireless communication. (1) For the RIS-assisted wireless communication: In the future wireless communications, the RIS can be deployed as the reconfigurable signal transmitter, receiver and passive reflector array. Thus, by optimizing the RIS reflector element phase shifts, for example, by selecting optimal reflector element positions and passive beamforming designs, RIS-assisted intelligent radio environments are a promising optimization paradigm to transform the design of modern wireless networks. In [28], the author jointly optimized the RIS reflection coefficients and the number of tunable reflective elements to improve the quality of service (QoS) of RIS-assisted edge networks by considering the actual amplitude and phase shift model of the RIS. The authors in [29] proposed an efficient alternating algorithm for fractional programming (FP), majorization-minimization (MM), and manifold optimization methods to jointly optimize active and passive beamforming under multiple constraints of radar sensing similarity, RIS constraints and transmit power constraints. Moreover, the authors in [30] considered the performance of RIS-assisted integrated satellite unmanned aerial vehicle (UAV)-terrestrial networks, and the closed-form expression for the outage probability (OP) was obtained to evaluate the impact of the introduction of RIS on the system performance.
(2) For the DRL in RIS-assisted wireless communication: By combining the function fitting benefits of deep learning with the environmental interaction decision-making benefits of reinforcement learning, DRL is believed to have the ability to solve nonlinear problems without the need for a prior relaxation, enabling direct solutions to non-linear problems in mathematical methods. Recently, DRL-based methods were also applied to solve RIS-assisted wireless communication optimization problems. The authors in [31] presented a RIS-assisted multi-user full-duplex secure communication system with hardware impairments and maximized the total secrecy rate by using DRL. In [32], the authors considered a deep deterministic policy gradient (DDPG)-based framework to tackle the problem of maximizing the receiver signal-tonoise ratio (SNR) in a RIS-assisted single-user wireless communication system. The simulation results showed that the proposed DDPG algorithm can achieve higher performance in a shorter running time compared to traditional methods based on semi-definite relaxation.
Based on the advantages of DRL itself, the model-free DRL emerged as an extraordinarily remarkable technology to address massive data, mathematically intractable non-linear non-convex problems and high-computation issues. The DRL technology is most appealing to large-scale MIMO systems, such as the satellite communication or the IS-HAP-TNs with a massive number of array elements, where optimization problems become non-trivial due to the extremely large dimension optimization involved. Actually, DL-based approaches are able to significantly reduce the complexity and computation time utilizing the offline prediction but often require an exhaustive sample library for online training. Meanwhile, the deep reinforcement learning (DRL) technique, which embraces the advantage of DL in neural network training as well as improving the learning speed and the performance of reinforcement learning (RL) algorithms, has also been adopted in designing wireless communication systems. All of the above work verifies that the DRL algorithm can effectively solve the optimization problems of various wireless communication systems in terms of different performance metrics and achieve high performance in a short run-time with an appropriate algorithmic framework design.

Major Contributions and Novelty
Currently, the joint beamforming design technique is also a hot issue, which can greatly improve the communication efficiency, system capacity and transmission rate of wireless communication systems to some extent. The efficiency of applying DRL to RIS-assisted wireless communication systems has been demonstrated in simulation experiments in the relevant literature. Related meta-surface techniques, including RIS, have also recently been widely used for IS-HAP-TNs [33][34][35][36].
Motivated by the above analysis, we investigated the joint beamforming design for a RIS-assisted IS-HAP-TNs, in which more constraints must be considered, such as the angle of arrival (AoA) or the angle of departure (AoD) in RIS, maximum transmit power of the satellite, the optimal RIS phase shift matrix to reflect, and the highly dynamic signal transmission environment. However, the statistical characteristics of the shadowed-Rician (SR) channel are far more complex than the terrestrial Rayleigh channel, which is also the difficulty and innovation of our work. Nevertheless, since the IS-HAP-TNs environment is harsh as well as time varying, and the hardware of RIS may suffer from the damage unpredictably, it is challenging to perceive accurate and complete channel state information (CSI), making conventional optimization-based methods to IS-HAP-TNs operation no longer appropriate. In addition, the conventional optimization-based methods need a rich body of iterations to achieve a satisfactory solution, which results in it being impracticable for making real-time decisions in time-varying IS-HAP-TNs. Targeted at solving the difficult problem under uncertainty, deep reinforcement learning (DRL) emerged as the times required and was applied to solve the problems in RIS-aided wireless communications. By applying the advanced DRL framework to optimize the ground user transmit beamforming and RIS phase shift matrix to send the desired signal to the target satellite, the overall network system sum rate is maximized [37]. To the best of the authors' knowledge, there is no prior work focusing on the beamforming design for RIS-assisted integrated satellite-HAP-terrestrial networks, especially the installation of RIS on HAP, which motivates our work. In this paper, considering the high dynamics of RIS-assisted IS-HAP-TNs transmission process, the main work and contributions are summarized as follows: (1) Firstly, an innovative system model of installing RIS on the HAP side is proposed in the IS-HAP-TNs. Considering the time-varying characteristics of the IS-HAP-TNs fading channel model and signal transmission model, the system sum rate formulations are given under these system model constraints using the active transmit beamforming at the ground user equipment, the phase shift matrix at the RIS, and the maximization expressions under the proposed constraints.
Secondly, a parameter soft-updated strategy framework based on DDPG framework is designed to optimize the above system sum rate maximization problems. The framework does not need to know the explicit model and specific mobile model of the wireless environment, and solves the formal problem of the system by rationalizing the design of the elemental state space, action space and reward function in the DDPG algorithm so that it can handle continuous variable problems well.
Finally, the simulation experiments on the number of RIS elements as well as the average reward show that the designed DRL algorithm framework outperforms other comparative baseline algorithms, which illustrates the effectiveness of the DRL algorithm in solving joint beam optimization problems and provides guidance for real-time decision making in dynamic IS-HAP-TNs communication environments.
The remainder of this paper is arranged as follows. Section 2 describes the considered system model and identifies the optimization objective problem under the constraints. Section 3 gives the basic framework of the soft update parameter strategy and gives the design flow for the optimization of the active transmit beamforming matrix and the RIS phase shift matrix under this framework. Section 4 plots the network performance simulation results under this framework and provides a detailed theoretical analysis. Finally, Section 5 concludes the whole work.

System Model Description
In this illustration, we envision an uplink transmission communication system that includes a geosynchronous Earth orbit (GEO) satellite and backward high altitude platforms (HAPs) deployed with RIS, as well as the K ground user equipment (UE) employing a single antenna as shown in Figure 1. In our proposed system model, the UEs transmission communication information through RF links to the RIS with M reflective elements installed on the HAP, which acts as a passive reflection relay with changeable transmission links and sends the received signal to the satellite.
It is noted that the satellites are linked to the cloud data computing processing center by the free-space optical (FSO), which can collect global communication information, such as the user's requirements, as the system control link. Instead of coding satellites and HAPs separately, it centralizes the baseband processing of the entire network in the cloud, with the cloud as the core, taking into account resource management and environmental feedback [38]. In order to realistically simulate the UEs-RIS link, where the RIS is mounted on the HAP in the aerial, here, we consider the small-scale path loss model [39], and then, the channel model vector of the UEs-RIS can be expressed as where L total denotes the number of the total transmission path, α l represents the Nakagamim channel model random variable, ϕ AR,l and ϕ DU,l denote the angle of arrival (AoA) of RIS and the angle of departure (AoD) of the UEs in the l-th transmission path. The channel model vector g(L, ϕ) as a function of the transmission path L and the AOA or AOD ϕ can be expressed as The RIS-satellite uplink channel vector is denoted by H RS , which can be expressed as where the N s denotes the antenna numbers of the uniform linear array (ULA) in the satellite, and ϕ AS and ϕ DR are the AOA of the satellite and the AOD of the RIS, respectively. Meanwhile, the P r is the free space path loss between the RIS and the satellite [40]. Note that in the RIS-satellite uplink channel model, considering that the HAP flies at a higher altitude than most ground buildings and the RIS is mounted on the HAP, we only assume the line-of-sight (LoS) transmission path between the RIS and the satellite, and the P r can be expressed by the following formula: where λ, G sr , G st , d sr , κ a , T a , B W denote the carrier wavelength of signal, the gains of every RIS reflection unit, antenna gain of each satellite, the transmission distance between RIS to center of satellite coverage area, the Boltzmann constant, the temperature of the propagating noise and the frequency band of signal, respectively. The direct uplink channel H US from the UE and the satellite is basically a standard MIMO channel model and can be characterized by existing methods to express its channel characteristics [41].

Signal Transmission Model
We assume that the k-th UEs intends to transmit signals denoted as √ w k s k , where w k is the transmission power matrix coefficient vector at the k-th UEs of the total transmit beamforming matrix W = [w 1 , w 2 , ..., w k ] under less than the maximum power constraint P max and s k is satisfied E s k (t) 2 = 1 with zero mean unit variance entries at the tth transmission moment [42]. The transmitted signal from the ground user propagates through the direct link and the reflected link, the latter link reaching the satellite under the reflection of the RIS mounted on the HAP by means of changing the phase of the RIS reflecting elements. We define that Φ denotes the phase shift diagonal matrix applied to the reflective RIS for input by Φ(m, m) = φ m = χ m e jϕ m , where χ m is the magnitude and ϕ m denotes the phase shift caused by the passive reflection of each reflective element of the RIS [43]. Thus, the signal received by the satellite at the t-th time step can then be further expressed in the following: where n 0 denotes the system zero mean additive white Gaussian noise (AWGN) followed by n 0 ∼ CN 0, σ 2 n . As we can see from Equation (5) above, the introduction of the RIS does not introduce additional AWGN compared to the traditional use of relay-based communication systems. This is because the RIS is only a passive mirror relay that only reflects the signal incident on its plane without signal decoding and encoding. Once the RIS receives the signal, its phase will reconfigure the signal by means of a central microprocessor controller connected to the RIS. Thus, the received signal-to-interference-plus-noise ratio (SINR) of UE k signal at the satellite is given by The system can be described that the UEs transmits signals to satellite, so we set the performance metric for evaluating RIS-assisted wireless systems as the system ergodic sum rate, which can be modeled as where R k represents the data transmission rate of the k-th UEs, formulated by R k = log 2 (1 + γ k ).

Problem Formulation
In this section, the aim is to determine the optimization objective function as the maximum rate of chemistry for the RIS-assisted IS-HAP-TNs involved. From the above, it is clear that W = [w 1 , w 2 , ..., w k ] denotes the transmit beamforming matrix vector, and the proposed optimization problem can be modeled as where P max represents the maximum link transmission power. The constraint C1 regulates the transmission maximum power of UEs. The constraint C2 represents the constraints on RIS reflective elements. Obviously, the above optimization problem based on the proposed system model is non-convex, while constraints exist for a non-convex nontrivial optimization problem owing to the high-dimensional reflection elements phase shift and can hardly be settled by traditional improvement approaches. If the traditional mathematical tools were used, the problem would have to be solved by an exhaustive search to obtain the optimal solution, which at the same time implies a large amount of computational resources and processing measures, which is almost impossible for large-scale network communication scenarios, especially for networks with high real-time requirements such as our considered IS-HAP-TNs. Thus, this paper only considers the use of intelligent algorithms for the optimal beamforming solution, it does not mathematically solve challenging optimization problems by non-convex to convex conversions, etc. Meanwhile, to demonstrate the superiority of the proposed algorithm, we refer to the traditional alternating optimization (AO) algorithm in the literature for solving the joint active transmit beamforming and passive beamforming optimization algorithm. First, the constraint in Equation (9) can greatly increase the difficulty of solving this problem, so it is also necessary to relax this constraint to |φ m | ≤ 1, ∀m = 1, 2, ..., M. After obtaining the local optimal solution, the reflection coefficient is changed back to a value that conforms to the modulus constraint of 1 by the projection method. In each iteration of traditional AO algorithms, the globally sub-optimal W is solved by first fixing Φ and the sub-optimal Φ is solved by fixing the matrix W until the algorithm converges. For the design of high-dimensional continuous variables, such as the each element phase shift in large-scale networks, including the transmit power matrix, the phase shift matrix, etc., traditional mathematical optimization methods such as the AO algorithm and water filling (WF) algorithm cannot effectively solve these problems and often generate local optimal deviations. Thus, to efficiently tackle the considered non-convex jointly optimization problems, we propose a soft-updated DDPG algorithm, where, through continuous trial-and-error interaction with the environment, the DRL agent gradually learns a deterministic strategy that leads to the optimal action.

Soft-DDPG-Based Joint Active and Passive Beamforming Design
In this section, the method of DRL is used to jointly optimize the transmit beamforming shape and phase shift array, and utilizing the DDPG structure shown in Figure 2. First, we briefly discuss the soft-DDPG principle and operation process. Then, we introduce the proposed DRL architecture and provide a detailed description of the state, action, reward, and the algorithm framework.

Overview of Soft-DDPG
It is supposed that there exists a central controller or a learning agent in this network that can collect the channel information or communication date immediately, such as the RIS to satellite channel H RS and h UR and the UE to the RIS channel h UR . Figure 2 displays the soft-DDPG architecture suitable for the earning agents to interact with high dynamic communication environments to obtain the pre-defined rewards or punishments. The core concept of the soft-DDPG framework proposed in this letter is to perform effective beamforming design and phase shift convert under unforeseen circumstances, such as local state observations, e.g., RIS. The algorithm mainly includes two kinds of deep neural networks (DNNs), namely, the training network and the target network. To avoid or mitigate the issue of updating state participant values in a single case, we assume that the target and training networks have the same neural network architecture.
Based on the above extensions, we can more clearly portray the framework covered in this article, with four DNNs drawn in detail, which are the training critic network, the training actor network, the target critic network and the target actor network. The functions of these four neural networks described above are described below. The training critic network needs to input the current state s (t) into the action network and output the current action a (t) , and the training actor network needs to input state s (t) and action a (t) into the training critic network and output the Q value Q π (s (t) , a (t) ). The target critic network needs to input the updated state s (t+1) to the target actor network and output the a (t+1) . The target actor network needs to input the updated s (t+1) and a (t+1) to the target critic network and output the target Q value Q π (s (t+1) , a (t+1) ).
Considering the existence of plural inputs in the neural network input, this proposed model uses tanh as the activation function of the hidden layer to limit the action space in the interval (0, 2π), and to eliminate the effect of the change in the distribution of the hidden layer data brought by the parameter update. This proposed DRL framework introduces a batch normalization layer after each hidden layer to process its output. The batch normalization layer can effectively combat the gradient disappearance phenomenon, improve the training efficiency, and make the training process of the deep layer network more stable. In addition, according to the constraints of the transmitting power and phase shift coefficients, the proposed model adds the tanh activation function to the output layer of the actor network to restrict the output to the interval [−1,1], and subsequently transforms the action into the data format required by the optimization problem by taking the absolute value normalization and range mapping methods to meet the constraints of the power allocation and phase shift so as to calculate the system sum rate as Equation (8). We generate different transmission link channel information by following the channel model features described earlier when channel state information (CSI) and the previous action W (t−1) and Φ (t−1) are known at the t-th time step, and the learning agent can establish knowledge about the current state space s (t) in the t-th time step. It is considered that the difficulty of the joint optimal design of active transmission beamforming and passive RIS phase shift matrix are discrete and present a great challenge to continuous state-space and action-space settings. Next, the details of the DRL-based algorithm state space S, action space A and instant reward function R are explained below.
State: State space is generally a description of the environmental observations at the t-th time step. In this paper, the DRL algorithm state space includes three parts, i.e., the k-th UE's transmission power at the (t − 1)-th time step, the CSI of all communication links including direct links and cascaded reflective links containing RIS reflective elements at the (t − 1)-th time step, and the action from the (t − 1)-th time step, which are represented by w (t−1) k , G and a (t−1) , respectively, where Thus, the state space can be expressed as s (t) = w (t−1) k , G, a (t−1) . Action: Action space is generally a series of choices for the next action. Once the agent performs the current action a (t) step by step during the learning process according to the transfer policy π at the t-th time slot, the state space of the environment will be shifted from s (t) to the next state s (t+1) . The action space is designed as the UE transmit power matrix W and the RIS phase shift matrix Φ. Considering that neural networks can only take real part as input and and match the neural network input formats, the process of constructing the action space, where both the transmit power imaginary distribution, such as W = Re{W} + Im{W} and Φ = Re{Φ} + Im{Φ} are to be separated as separate input ports. Thus, the action space can be expressed as where the w (t) k represents the k-th UE transmit power matrix, and the φ (t) m denotes m-th reflective element's phase shift at the t-th time step.
Reward: The purpose of this paper is to maximize the system sum rate, and Equation (10) is adopted as the reward function:

The Process of Algorithm Training
In order to break the coupling between experiences and adapt to a high dynamic environment, the experience replay approach allows agent access to previous historical experiences in subsequent training, the DDPG framework considered in this article. For policy-based algorithms, the agent collects experience in the episode. After an episode is run, experience is lost. It is better with a multi-threaded parallel architecture. This not only solves the previous problems but also makes efficient use of computing resources and improves the training efficiency.
In the proposed DDPG framework, the entire agent consists of a global network and multiple parallel independent workers, each including a set of an actor network and critic network. Each worker interacts independently with their own environment, gaining independent sampling experiences that are independent of each other, thus breaking the coupling between experiences to match the experience replay. Most of the underlying algorithms in DRL are single threaded, that is, a learning agent that interacts with the environment to generate experience. Including the underlying version of the actor network and critic network, because the environment is fixed and the action of the agent needs to be continuous, the experience gathered has strong timing associations and only part of the state and action space can be explored in a limited amount of time. To solve this problem, we adopt the soft-DDPG-based scheme to optimize the design process and present the corresponding pseudo-code in Algorithm 1.
In the initial stage of the algorithm, the experience replay buffer D, the training actor network ψ(·) and training critic network Q(·) need to be initialized randomly (Lines 1-2). They are copied to the target network ψ (·) and Q (·) (Line 3). After initializing and randomly generating the RIS-assisted IS-HAP-TNs communication channel environment state, the state is processed via DNN and the output Y t (Lines 5-6). The action is derived based on Y t , where N is denoted as random noise, with the aim of seeking efficient exploration (Lines 7-8). In this letter, we employ the mini batch to reduce the sample training amount of sampling and ensure the quality of gradient reduction. After the transformation sequence is saved in the memory replay buffer D (Line 10), to achieve the optimal action that maximizes the output of the critic train network, the two train networks are updated using the minibatches of size ζ randomly sampled from replay buffer D (Line 11). We update the critic target network parameters Q(·) by minimizing the variance loss (Line 14). We make use of linking rules to update the actor networks parameters ψ(·). Finally, the target networks parameters of the actor network and critic network are slowly soft updated using the control factor τ as the decaying rate (Line 15). Initialize state s (0) ∈ S, S ← s (0) ; 7: for t = 0, 1, 2, ...T − 1 do: 8: Choose action a (t) = π s (t) | θ π + N ; 9: Take action a (t) , get reward r (t) and s (t) evolves into new state s (t+1) ; 10: Save (s (t) , a (t) , r (t) , s (t+1) ) into D; 11: Randomly sample ζ transitions form D; 12: Training framework via DNN; 13: Compute target value for the critic's evaluation network by y (i) = r (i) + γQ π s (i+1) , π s (i+1) |θ π |θ Q 14: Update the parameters of the critic's evaluation network by 15: Update the parameters of actor network with sampled policy gradients by ∇ θ π π (s|θ π ); 16: Soft-update the parameters of DDPG's target networks by

17:
Update the state s (t+1) ; 16: end for; 17: end for; During each iteration t each of the learning process, the actor train network will select the action from the continuous action space based on the current state s (t) . During this training process, in order to effectively explore the optimal action, the stochastic noise N a is also taken into account in the algorithm framework to obtain the deterministic strategy, i.e., a (t) = π s (t) | θ π + N a , where θ π is the actor train network parameter, and π is the transfer policy. When the operation ends, the environment will transit the last action to the next state s (t+1) to obtain instant reward r (t) , and then obtain an evaluation for the action to evaluate the optimal action a (t) , modeling a state-action value function by parameterized by θ Q as where α denotes the algorithm learning rate in this algorithm framework. To ensure the stability, the target actor network is parameterized by θ π and the target critic network is characterized by θ Q , which is parameterized at intervals according to the online network parameters. Thus, considering that the parameter update strategy is a soft update method, the algorithm is called soft-DDPG. The soft update method of parameters ensures the slow update of parameters and alleviates the instability problem of the policy network during the learning process.

Numerical Simulation Results
In this section, we evaluate the performance improvement of the proposed DRL-based algorithm framework for the proposed system model from different perspectives. First, we randomly generate a channel model matrix following the shadowed-Rician fading distribution and the corresponding channel models mentioned in the article [44]. The system parameters and hyperparameters of the DDPG algorithms are listed in Table 1 [45]. To test whether our algorithm improves network performance, we also consider three other standard solutions: (1) Hard-DDPG: The scheme indicates that the parameters in the DRL framework are updated in a hard-update strategy, which allows the network to copy all the parameters in the network at this time directly into the target network after every t u training session by pre-setting the parameter update interval t u .
Random RIS: The scheme denotes that the RIS phase shift matrix Φ is randomly generated.
Without RIS: This scheme denotes that the communication scenario without RIS and the UEs can send signals directly to the satellite. Considering that the process is a continuous transmission, we assume that the successful transmission signal is 1/2.
Traditional optimization scheme: The traditional algorithm is an alternating optimization algorithm, the specific process is to obtain the transmit power matrix by fixing the RIS reflection coefficient, according to the water filling algorithm. Then we obtain the corresponding current jointly beamforming design according to Equation (9) until the objective function value converges. Figure 3 plots the relationship between the number of RIS reflection elements and the system sum rate. As can be seen from the figure, the system sum rate is significantly higher for all algorithms as the number of RIS elements increases due to the fact that more RIS reflection elements increase the reflection channel gain but also sacrifice the complexity of the RIS deployment at the HAP. In addition, we can observe that the softupdate parameter strategy obtains a higher system sum rate than the hard update parameter strategy, alleviates the instability of the Q-value network in the learning process, and the soft-update strategy obtains a higher system ensemble rate by more flexibly interacting with the environment to design the phase shift matrix more flexibly. It can also be seen that the traditional alternating optimization algorithm used can also obtain a high system sum rate, and that the value of the objective function it obtains does not decrease as the number of elements increases, which also ensures the convergence of the algorithm.
The setting of hyper-parameters will have a great impact on the performance, such as the stability and convergence speed of neural networks. This paper also explores the effect of different learning rates on the performance and convergence speed of the model in our proposed DRL framework. The average reward is used to measure its performance, which can be shown as where T is the maximum step size of sample training. All the parameters initialized in the training phase and the channel sample parameters used for different parameters are the same for this simulated neural network, and a comparison of the network performance at different learning rates, i.e., 0.01, 0.001, 0.0001, 0.00001 is shown in Figure 4. Figure 4 shows the average reward versus time step under different learning rates, and it can be seen that the effect of different learning rate settings in the neural network on the performance of the DRL algorithm varies greatly. For larger learning rates, such as 0.01, the performance of this network decreases, and the convergence effect is unstable and prone to oscillations. In particular, the considered DRL framework with a learning rate of 0.001 performs the best, but converges more slowly than the others. As the RIS reflection element increases, the average system reward also increases gradually as expected with the addition of reflection channels, but this does not significantly increase the convergence time of the proposed DRL framework [46]. In summary, DRL-based algorithms, such as the DDPG used in this paper, are very sensitive to the setting of hyper-parameters in the neural network, and the optimal hyper-parameter settings can vary considerably under different model structures [47]. Therefore, careful experimental tuning is required to obtain optimal parameter settings that can significantly improve the performance and convergence speed of the algorithm [48,49]. Numbers of steps in each training episode 10,000   Figure 5 shows a schematic comparison of the average reward performance and the outdated CSI coefficients, respectively. In the proposed DRL framework, we choose the last moment CSI as the state-space input, and we can see that the average reward of all algorithms decreases gradually as the outdated CSI coefficient decreases. However, the proposed soft-DDPG framework remains at a favorable level compared to the existing scheme and the hard-DDPG scheme. Compared with the advanced DRL schemes, which do not require an exact channel model information, the existing alternating parameter optimization scheme relies on the knowledge of static exact channel model, but because of the high dynamic communication scenario, the system performance is not as good as the soft DDPG and hard DDPG schemes.

Conclusions
This paper discussed the joint optimal design scheme of transmitting active beamforming and passive beamforming for maximizing the system sum rate. In the IS-HAP-TNs assisted by RIS, it is hard to sense the channel state information in the dynamic environment accurately and comprehensively. On this basis, a novel type of DRL architecture is proposed, namely the soft-DDPG algorithm. With the help of the network parameter soft-update strategy, the coordination of the phase shift matrix can be obtained, even when the increasing number of RIS reflective element amplitudes changes. Simulation results show that the proposed framework can achieve better network performance in a lower operation duration and can be applied to the real-time control of IS-HAP-TN systems.