User Pairing for Delay-Limited NOMA-Based Satellite Networks with Deep Reinforcement Learning

In this paper, we investigate a user pairing problem in power domain non-orthogonal multiple access (NOMA) scheme-aided satellite networks. In the considered scenario, different satellite applications are assumed with various delay quality-of-service (QoS) requirements, and the concept of effective capacity is employed to characterize the effect of delay QoS limitations on achieved performance. Based on this, our objective was to select users to form a NOMA user pair and utilize resource efficiently. To this end, a power allocation coefficient was firstly obtained by ensuring that the achieved capacity of users with sensitive delay QoS requirements was not less than that achieved with an orthogonal multiple access (OMA) scheme. Then, considering that user selection in a delay-limited NOMA-based satellite network is intractable and non-convex, a deep reinforcement learning (DRL) algorithm was employed for dynamic user selection. Specifically, channel conditions and delay QoS requirements of users were carefully selected as state, and a DRL algorithm was used to search for the optimal user who could achieve the maximum performance with the power allocation factor, to pair with the delay QoS-sensitive user to form a NOMA user pair for each state. Simulation results are provided to demonstrate that the proposed DRL-based user selection scheme can output the optimal action in each time slot and, thus, provide superior performance than that achieved with a random selection strategy and OMA scheme.


Introduction
Due to the inherent nature of providing vast coverage and economic service, satellite communication has the ability to effectively supplement terrestrial networks during disasters and in rural and deserts areas; thus, it has been considered as an important component for next-generation wireless networks [1]. However, the dramatically increased demand for data access can result in even bigger challenges, including massive connectivity, limited power/spectral resources, and various quality of service (QoS) requirements, in future satellite networks. Recently, non-orthogonal multiple access (NOMA) schemes, including power domain NOMA [2] and code domain NOMA [3], featuring multiple access, high resource utilization efficiency, and user fairness, has become a promising solution to alleviate these challenges faced by future satellite networks. Of these two schemes, power domain NOMA (or simply NOMA for short) scheme, which has the ability to harmoniously integrate with orthogonal multiple access (OMA) techniques in existing satellite architectures, is the main motivation and focus of this article.
In a NOMA-based satellite network, a satellite/multiple users can simultaneously communicate with multiple users/a satellite in downlink/uplink transmissions by super-by using deep neural networks for LEO satellite links. Notably, supervised learning, such as the algorithms used in [15][16][17][18], needs to learn characteristics from input data and desired output data, while a reinforcement learning (RL) algorithm, which is model-free and datadriven, has been extensively adopted in various wireless networks with different objectives. For example, based on Q-learning, an algorithm for jointly optimizing user pairing and power allocation was proposed in [19] to maximize the total sum rate of a satellite random access system. Considering large-scale low-earth orbit constellations, the work in [20] developed a low-complexity successive deep Q-learning algorithm for optimal satellite handover. The authors in [21] proposed a Q-learning NOMA-based random access scheme for time slot and channel allocation in satellite-terrestrial relay networks. In [22], the authors adopted a graph neural network and RL algorithms in a hybrid satellite-terrestrial network to optimize UAV trajectory and maximize the number of served users. In [23,24], the authors conducted resource management in a relay-aided network with the help of distributionally robust deep RL (DRL) and enhanced DRL algorithms, respectively.
Motivated by these observations, for the work herein, we leaned upon a DRL algorithm to pair users and provide services with various delay QoS requirements for future NOMA-based satellite networks (since this paper's aim was to pair users in delay-limited NOMA-based satellite networks with a DRL algorithm, while the impacts of low-density parity check codes [25] in NOMA-based satellite networks will be our follow-up research.). The main contributions of this work can be described as follows: • The concept of effective capacity is employed to measure the rate achieved with a given delay QoS constraint, based on which, a power allocation coefficient is firstly obtained by ensuring the achieved capacity of users with sensitive delay QoS requirements is not less than that achieved with an OMA scheme, and then, the user pairing problem is formulated with the aim of maximizing the sum effective capacity of the considered system; • Because various delay QoS requirements have varying negative impacts on users' capacity, user pairing in a NOMA-based network with various delay QoS constraints is different from that in traditional NOMA-based delay-insensitive system. In this condition, to maximize system capacity with the obtained power allocation factor, when the delay-critical user is fixed, a DRL approach is introduced to select one user who has relatively insensitive delay requirement and good link condition, compared to the other users, to optimize NOMA user pairing with low complexity; • The proposed DRL-based NOMA user pairing strategy is compared to an OMA scheme and NOMA with a random user-selecting scheme, which reveal the superiority of introducing the NOMA scheme and DRL algorithm in the satellite networks from the perspective of performance enhancement. Specifically, the advantage of the proposed approach is achieved by selecting the most suitable delay tolerant user to pair with the delay-sensitive user and form a NOMA user group in each time slot.
The rest of this paper is outlined as follows. The system model is presented in Section 2. Section 3 introduces the concept of effective capacity, obtains the power allocation scheme by ensuring the achieved capacity of the user with sensitive delay QoS requirement is not less than that achieved with the OMA scheme, and formulates the user pairing problem for the delay-limited NOMA-aided satellite network. In Section 4, a DRL algorithm is described in detail and tested in the proposed system. Performance results are discussed and conclusions are given in Sections 5 and 6, respectively.

System Model
Consider a downlink NOMA-based satellite system that is designed to serve m (m ≥ 2) users with the help of the NOMA scheme. These m users are randomly deployed in an area approximated as a circle of radius R with different channel statistical prosperities and delay QoS requirements. (In this paper, channel estimation errors, co-channel interference, complexity, and mobility constraints are not taken into consideration in the proposed system model; the influences of these parameters on user selection and system performance will be a focus in our future works, based on the contributions in the current work.) Without loss of generality, users are ordered based on their link budgets, i.e., Q 1 ≤ Q 2 · · · ≤ Q m , where Q j is the link budget of User j (j = 1, 2, · · · , m). For simplicity, we further assume only the c th and t th users (1 ≤ c < t ≤ m) are selected to form a NOMA group, and each user in the proposed model is equipped with a single antenna.
Thus, the received signal at User j (j = c, t) is where w j denotes the noise at User j with zero mean and δ 2 variance, x = ∑ j=c,t α p j P s x j is the superposed signal (with α p j being a fraction of the transmission power P s allocated to User j and x j (E[|x j | 2 ] = 1) being the signal for User j), Q j (including FSL, antenna gain, beam gain, and fading model) is the entire link budget from satellite to User j, which can be described as follows: where Φ j = L j G j , with L j and G j being the FSL and antenna gain at User j, respectively. G s ϕ j , which is the beam gain of User j, with ϕ j denoting the angle between User j and beam center with respect to the satellite, can be approximated as [5] with G max representing the maximum antenna gain, J n (·) being the Bessel function of first kind and n-th order, d j being the distance from the beam center to User j, and a = 2.07123/R. g j 2 is the channel power gain of the satellite link, which is assumed to follow a widely applied Shadowed Rician fading model [26][27][28][29][30]. According to [31], the probability density function (PDF) of g j 2 is where with 2b j and Ω j , respectively, being the average power of the multipath and the LoS components, m j m j > 0 denoting the Nakagami-m fading parameter, and 1 F 1 (a; b; c) representing the confluent hypergeometric function ( [32], Equation (9.14.1)). Based on the principle of the downlink NOMA scheme, decoding order is decided by users' channel qualities, i.e., the user with a worse link condition decodes its own information firstly and directly. Thus, the signal-to-interference-plus-noise ratio (SINR) of User c is where α p c +α p t =1 and γ = P s /δ 2 is the average transmission SNR. At the same time, the user with better channel quality, i.e., User t, adopts the SIC strategy to decode and remove the interference from User c; the decoding SINR can be derived as We can derive that γ N c < γ N t→c , since Q c < Q t . Then, User t decodes its own information, and the achieved SINR is

Effective Capacity
To provide services with different delay QoS requirements, the concept of effective capacity is employed to characterize the effect of delay QoS limitation on achieved performance, characterized by θ (θ ≥ 0) [10]. In this paper, the uncorrelated service process across different slots is further assumed and the normalized effective capacity is adopted. Under these conditions, given a delay QoS exponent θ j , the normalized effective capacity of User j in bps/Hz is where ψ j = −θ j T f B/ln2, with T f and B being the frame duration and the occupied bandwidth, respectively, R j = log 2 (1 + γ j ) is User j's transmission rate, and E is the expectation operator. We note that a larger/smaller delay QoS exponent θ j is required in a more critical/tolerant delay-limited scenario.

Power Allocation Strategy
To ensure the capacity achieved by the user with a critical delay QoS requirement using the NOMA scheme is always better than that with the TDMA scheme, the power allocation coefficient should be further constrained. In this section, a power allocation scheme is investigated for two cases, i.e., User c in Case 1 and User t in Case 2 are assumed to be delay-sensitive users.
For Case 1, θ c > θ t is assumed and the power allocation factor is limited by and with γ T j = γΦ j G s d j g j 2 = γQ j being the SINR of User j (j = c, t) achieved with the TDMA scheme, and 0.5 owes to the loss in multiplexing in the TDMA system. By substituting (5) into (8), along with some manipulations, α p c can be derived as which means that the value of α p c is decided by γ, location information, and fading severity of User c.
Similarly, the effective capacity of User t can be given by By substituting (4) and (7) into (21) and following with the similar steps as those in the derivation of (12), the effective capacity expression of User t can be derived as Then, the sum effective capacity of the considered system can be given as

Problem Formulation
Although the closed-form expression of sum rate for the considered system has been derived, we must note that the rate of User j (j = c, t) is influenced by many factors, such as delay exponent θ j , transmission SNR γ, fading severity, location information d j , and α p j . Thus, to expressively show the different impacts of these key parameters on the achieved performance, the normalized effective capacity of User j is plotted in Figure 1, where ILS, AS, and FHS are infrequent light shadowing, average shadowing, and frequent heavy shadowing, respectively. From Figure 1, we can directly observe that, when θ j → 0, effective capacity converges to the ergodic capacity, since only delay-insensitive traffic is needed. However, when θ j > 10, even for case α p j = 1, effective capacity reduces to 0 due to the required delay QoS being too stringent. Thus, the range of User j's delay limitation is assumed to be constrained as θ j ∈ [0.5, 10] in this paper. In addition, an increased d j , i.e., a worse fading severity, or decreased γ can degrade the capacity curves. Moreover, all capacity curves decrease with increasing θ j . This observation clearly indicates that the achieved performance suffers from a combination of these factors, although, in both Case 1 and Case 2, it seems like a user with the smallest delay QoS exponent, nearest location information, and best shadowing should be selected as User t/c and paired with the User c/t in Case 1/2 to maximize the sum performance of the considered system. Conversely, while in a spot beam, the user with the nearest location information or best fading condition may have a relatively large θ, or vice versa. Thus, how to select User t/c in Case 1/2 is a vital issue in a delaylimited scenario.
For simplicity, herein, we mainly focus on the user pairing in Case 1, which means that α p c must meet C N t (θ t ) ≥ C T t (θ t ). Then, the optimization problem is to find a user who can obtain the best power utilization efficiency, after taking into account link budget and delay QoS requirement, to be the User t. The mathematical formulation of this problem can be denoted by P1 and formulated as P1 : max In the aforementioned problem, C1 ensures that the link budget of User t must be better than that of User c to successfully perform SIC; C2 denotes that, in Case 1, the resource allocation threshold in (11) must be ensured to guarantee the minimum data rate requirement of User c, and C3 implies that the limited location information of Users c and t.

DRL for Delay-Constrained User Pairing
The deep Q-network (DQN) algorithm, which combines the advantages of Q-learning and deep neural networks, is one of the most representative value-based method in the DRL family, with which the expected returns of actions can be predicted based on a certain environmental observation; a framework of applying such an algorithm in user pairing for the considered system is provided in Figure 2. (The DQN method is the classical approach in the DRL family, whose complexity analysis is not provided in this paper-the interested reader can refer to [33].) Although DRL deployment causes more delay, it is believed that this delay can be significantly decreased with the improvement of chip processing speed).
Since our objective in problem P1 is to choose an appropriate user to be User t at different time slots to maximize the power resource utilization, we thus define a tuplē M :=< S, A, R, π > to model this problem as a Markov decision process (MDP) for a stationary decision. Specifically, S means the state and observation space, A represents the set of actions, R means the designed reward, and π is the policy that makes the decision. Meanwhile, Q π (s l , a l ) is defined as the Q-value obtained with policy π when the environment is in state s l while adopting action a l at the l th time slot. For the problem P1, key elements, such as the states, actions, and cost, in an MDP model are described in detail as follows: • State S: At time slot l, a tuple denoted by s l = (P s , Φ j , d j , g j , θ j ), s l ∈ S is used to describe the system state, where P s , Φ j , d j , g j , θ j are transmission power, antenna gains, location information, fading severity, and delay QoS exponent of User j (j = 1, 2, · · · , c − 1, c + 1, · · · , m), as analyzed in Sections 2 and 3, respectively. Since s l varies in different time slots, the agent is required to adjust its action in each slot accordingly; • Action A: NOMA user pairing is important for NOMA-aided satellite networks with delay QoS constraints because it directly impacts the resource utilization efficiency. Thus, user selection should be designed based on current state; here, we set the action space as [A = 1, 2, · · · , c − 1, c + 1, · · · , m], and then a l = m means the m th user is selected to be the User t; • Reward design: Equation (11) must be satisfied to ensure that User c's performance achieved with the NOMA scheme is not less than that achieved with the TDMA scheme. Based on this, our objective is to select a user to be User t who, with the remaining power resource, can achieve the largest effective capacity. Thus, if User j is selected at time slot l, the reward is assigned as As can be seen from Figure 2, the DQN algorithm has two phases. In the datageneration phase, Q-learning with experience pool D is used to generate data for the next network-training phase. In this process, the agent chooses an action a l according to its observation s l under policy π. To trade off between exploration and exploitation, ε-greedy exploration is used here, which means, for state s l , a random action with probability (0 < < 1) or the best action with probability (1 − ) is chosen to be action a l . With this ε-greedy policy, Q-value function Q π (s l , a l ), which describes the expected R π (l), can be given by Q π (s l , a l ) = E(R π (l)|S = s l , A = a l ). This Q-value function is updated with whereᾱ andγ are the learning rate and discount factor, respectively. The best action can be written as a l = max a l ∈A Q π (s l , a l ). Following the environmental transition resultant from variations in users' link budgets and delay QoS limitations, the tuple (s l , a l , R l , s l+1 ) at the l th time slot is collected and stored in the experience pool, in which the old tuple gives space to the newest tuple (if the pool is full).
Considering that the number of satellite users in a beam spot could be very large, the size and computation efficiency of Q values of (25) for all possible actions are large and low. In this context, deep neural networks parameterized by θ and θ, called target DQN and training DQN, respectively, are used in the neural network training phase to estimate the Q-value by function approximations. As shown in Figure 2, the target of the DQN is to estimate the maximum Q-value for the next state, i.e., max a t+1 Q s t+1 , a t+1 ; θ . The training DQN network is deployed to make an action decision and estimate the Q-value for the current state, whose loss function can be written as Using stochastic gradient descent to minimize the function in (27), the correct weights of θ can be learned by the training DQN. The weights θ are frozen for several steps and then updated by setting θ = θ for the goal of stabilizing the training. The specific steps for the training DQN to select one from many users to be User t is given in Algorithm 1. Initialization:ᾱ,γ, pool capacity D, θ, θ , state s 0 , l=k=1, parameter C. while k≤K do for l≤ T do Observe state s l ; Choose action a l under policy π; Perform action a l , observe reward R π (l), and next state s l+1 ; Store tuple (s l , a l , R π (l), s l+1 ) in pool; if number of tuples in pool is larger than N p then Sample random mini-batch of tuples (s l , a l , R π (l), s l+1 ) from D; Update θ by performing stochastic gradient descent on (27); Set θ = θ in every C steps. end end k=k+1; end Output: In each time slot, a user who can achieve the biggest capacity with the power factor in (11) is selected to be User t.

Results
In this section, simulation results are provided to characterize the effects of users' specific delay QoS requirements on the power allocation scheme, user selection strategy, and system performance. Without loss of generality, we assumed T f B = 1, the carrier frequency as 4 GHz, and radius R = 125 km [5,13]. Moreover, we set the number of users as 150, the fading severities, location information, and delay requirements of these users were randomly generated within [ILS, AS, FHS], [0, 1R], and [0. 5,10], respectively, to show the various channel conditions, locations, and application scenarios of different satellite users. The delay QoS exponent of the delay-sensitive user, i.e., User c, was set as θ c = 9.38, and the label (ILS/AS) denotes the link-shadowing severity of User t/User c in this paper.
We first conducted numerical simulations to show the impact of shadowing, γ, and d c on the power allocation coefficient α p c , as illustrated in Figure 3. From this figure, we can clearly see that, when User c experiences a lighter shadowing, a higher γ, or a closer location information d c , a larger α p c is needed to ensure that the performance achieved with the NOMA scheme is not less than that achieved with the TDMA scheme, which is consistent with the analytical result given in (11). In the following simulations, α p c was set to meet the condition of C N c (θ c ) = C T c (θ c ) without other descriptions. Moreover, it can be observed that the analytical results were all consistent with the Monte Carlo simulations.
Then, simulations were conducted to illustrate the capacity of User t achieved with the NOMA scheme and TDMA scheme versus delay requirement θ t , shown in Figure 4), from which we can clearly observe that the capacity curves all degrade with increasing θ t . This is an expected result because a larger θ t means a smaller tolerated delay outage and a lower supported constant arrival rate. Moreover, we find that the superiority of the NOMA scheme gradually decreases with increasing delay limitation θ t , i.e., when θ t ≥ 10 0.4 , the capacity gap between NOMA and TDMA curves almost disappears. The superiority of the NOMA scheme, for the case θ t < 10 0.4 , is significantly upgraded for a larger γ, a lighter fading severity of User t, or a smaller d t . This is because any of these factors corresponds to a more favourable condition. This phenomenon suggests that, in addition to the shadowing, d t , and γ, θ t must be taken into account to form a flexible NOMA user group and ensure the superiority of NOMA-based satellite networks.    Finally, the DQN algorithm was adopted to select one from many users to be User t and pair them with User c to form a NOMA user group. Specially, since the assumption that Q c < Q t must be satisfied, only users with ILS/AS severity were viewed as candidates.
Meanwhile, α p t = 1 − α p c varied with the location and fading severity of User c as well as the transmission average SNR γ, as shown in Figure 3.
The convergences of the proposed DQN algorithm with different learning rates are shown in Figure 5, from which we find that a smaller value of learning rate leads to a faster convergence, since a smaller learning rate means a lower newly acquired cost is accepted to adjust the evaluated Q π (s l , a l ). Thus,ᾱ = 0.01 was set in our algorithm. Figure 6 compares the effective capacity of selected user achieved with NOMA and TDMA schemes under the proposed strategy and random selection strategy. It can be seen from Figure 6 that curves with the proposed NOMA scheme are superior to those with the TDMA scheme for all cases, demonstrating the advantages of employing the NOMA scheme in delay QoSconstrained satellite communication networks. Moreover, since the proposed DQN-based user selection scheme can find the optimal action for each state, and, thus, it can provide superior performance as well as a much bigger performance difference between NOMA and TDMA schemes than those achieved with a random selection strategy in each time slot.

Conclusions
In this paper, we have proposed a user pairing scheme in NOMA-based satellite networks with delay QoS constraints. With the objective of maximizing the sum effective capacity without degrading the performance of the delay-sensitive user, the user pairing problem was formulated. In particular, we designed the power allocation strategy to make sure that the performance of the delay-sensitive user achieved with the NOMA scheme was not less than that achieved with an OMA scheme. Based on this, the DRL algorithm was adopted to select a user from many users to pair with the delay-sensitive user and form a NOMA group. Simulation results have been provided to validate those performance analyses, show the effects of key parameters on system performance and the user selection strategy, and demonstrate that the DRL algorithm can significantly improve the system performance by finding the optimal action for each state.