Reinforcement Learning-Based Joint User Pairing and Power Allocation in MIMO-NOMA Systems

In this paper, we consider a multiple-input multiple-output (MIMO)—non-orthogonal multiple access (NOMA) system with reinforcement learning (RL). NOMA, which is a technique for increasing the spectrum efficiency, has been extensively studied in fifth-generation (5G) wireless communication systems. The application of MIMO to NOMA can result in an even higher spectral efficiency. Moreover, user pairing and power allocation problem are important techniques in NOMA. However, NOMA has a fundamental limitation of the high computational complexity due to rapidly changing radio channels. This limitation makes it difficult to utilize the characteristics of the channel and allocate radio resources efficiently. To reduce the computational complexity, we propose an RL-based joint user pairing and power allocation scheme. By applying Q-learning, we are able to perform user pairing and power allocation simultaneously, which reduces the computational complexity. The simulation results show that the proposed scheme achieves a sum rate similar to that achieved with the exhaustive search (ES).


Introduction
5G mobile communication is further increasing the number of users using the wireless Internet. Moreover, autonomous vehicles connected to 5G are also increasing. Hence, the importance of spectrum efficiency has been significantly increasing, and non-orthogonal multiple access (NOMA) is one of most important research areas [1]. NOMA techniques can be categorized into two main classes: power-domain and code-domain NOMA. Code-domain NOMA is a technique for multiplexing users based on "codeword." The concept of code-domain NOMA was inspired by the classic code division multiple access (CDMA) system [2]. Code-domain NOMA allows multiple users to share the same time-frequency resources but adopts unique user-specific spreading sequences. The spreading sequences are restricted to sparse sequences or non-orthogonal low cross-correlation sequences in code-domain NOMA. Sparse code multiple access (SCMA) is one of the most important techniques in recent code-domain NOMA. In particular, studies to improve spectral efficiency by using low density parity check (LDPC) codes is actively being conducted [3,4]. Power-domain NOMA is another technique that allows multiple user equipment (UEs) to access the same time/frequency resource, where the signals from the UEs are multiplexed through different power allocation coefficients [5]. The transmit power at the base station (BS) is divided up between the UEs. UEs with poor channel conditions receive more transmit power, whereas UEs with better channel conditions receive less transmit power. On the receiver side, successive interference cancellation (SIC) is used to recover each communication. The SIC successively decodes and subtracts the received signal until it reaches its desired signal [5]. SIC and power allocation are important techniques in power-domain problem. In [22], the dynamic channel access problem was formulated as a partially observable Markov decision process (POMDP), and DQN was applied to find the access policy via online learning. In [23], the authors proposed a multi-agent DNN approach to predict spectrum occupation of unknown neighbouring networks in slotted wireless networks, where they trained the DNN in an online way, using both RL and supervised learning. The authors in [24] proposed a DQN-based power allocation for a multi-cell network to maximize the total network throughput. In [25], a joint precoding and SIC decoding scheme for MIMO-NOMA system was presented in the imperfect SIC decoding environment.
The key challenges in MIMO-NOMA are beamforming, optimization, power allocation, user pairing, and SIC ordering. These challenges have been studied jointly or partially, under specific performance metrics. MIMO-NOMA is a technology that can enhance spectral efficiency in 5G, but it has a fundamental limitation of high computational complexity. This paper aims to increase the sum rate and reduce the computational complexity by using the RL-based joint power allocation and user pairing in MIMO-NOMA systems. The contributions of this paper are as follows: First, we propose an RL-based joint user pairing and power allocation scheme for MIMO-NOMA systems. The previous studies independently investigated user pairing and power allocation problems; or they researched user pairing and power allocation problems via mathematical approaches such as convex optimization in a simplified system with a few users. To the best of the authors' knowledge, this study is the first attempt in which RL is applied to perform user pairing and power allocation jointly under a practical system with multiple users. Second, the proposed RL-based scheme reduces the computational complexity. In the conventional schemes, the user pairing is performed after the BS has received information about the location and CSI from UEs, and then the power is allocated to UEs in each pair. In this paper, the user pairing and power allocation are simultaneously performed through RL when a BS receives the location and CSI from UEs. Exhaustive search (ES) is a scheme to find the maximum sum rate, but its computational complexity is extremely high because it finds all pairs that can be user paired, calculates all the coefficients that can be power allocation, and then finds the sum rate. The proposed RL scheme reduces the computational complexity because the sum rate is calculated with one action selection. Third, the proposed RL-based scheme shows that the sum rate is superior to those of OMA and other comparable schemes. The proposed scheme at the beginning of the simulation shows that the sum rate is low because the BS randomly selects the action, but as the time slot increases, the learning proceeds and it approximately converges to the sum rate of the ES. Moreover, it was shown that the proposed scheme is more efficient than the ES or phased RL schemes in terms of the time and computational complexity.
The remainder of this paper is organized as follows: Section 2 describes the system model, and Section 3 presents the proposed RL-based joint user pairing and power allocation in MIMO-NOMA systems. The numerical results are presented in Section 4 and Section 5 concludes this paper.
For the sake of clarity, the main symbols and their descriptions used in this paper are summarized in Table 1.
Notations: Vectors are presented by boldface small letters, while matrices are represented by boldface capital letters; I N is the Identity matrix andĥ the quantized value of h.

System Description
In this paper, we consider a downlink MIMO-NOMA in a macro cell with 500 m radius, as shown in (Figure 1). The BS has P BS transmit power, and it allocates the same power to the N antennas. Thus, BS transmits a superimposed signal, considering the characteristics of NOMA. To create a MIMO-NOMA applicable scenario, all M UEs are randomly distributed in a cell. The transmitted power at each beam can be expressed as P n = P BS N . We assume that the channel gain is ordered as follows: |h n,i | 2 ≤ |h n,j | 2 , for i ≤ j.
In NOMA, the UE close to the BS can cancel the interference signal by using SIC, where the interference signal may be the signal sent to the UE with poor channel conditions. Here, the SIC is assumed to be operated with little or no errors. In addition, the BS is responsible for pairing UEs and then it determines the transmit power of each UE. Each UE suffers from Rayleigh fading and additive white Gaussian noise (AWGN) with zero mean and variance σ n,k . The superimposed signal transmitted by the BS is as follows: where s n,k , α n,k , P n denote the signal transmitted by the BS, the power allocation coefficient, and the transmit power of each beam, respectively. The signal received at the UE n,k is as follows: where h n,k is the Rayleigh fading channel vector from the BS to the UE n,k , w n is the precoding vector for each beam in the precoding matrix W = [w 1 , w 2 , . . . , w n ], w n ∈ C 1× N , and n n,k is the AWGN; h n,k can be expressed as follows: Moreover, the distance between the BS and UE n,k is denoted as d n,k , the path loss exponent is η, and h n,k represents the RL's state. Equation (3) can be rewritten as follows: y n,k = h n,k P n α n,k s n,k + h n,k w n K ∑ k =k+1 P n α n,k s n,k intra-beam interference After SIC, Equation (5) can be rewritten as follows : h n,k P n α n,k s n,k + h n,k ∑ N n =1,n =n w n x n + n n,k , if k = K, h n,k P n α n,k s n,k + h n,k w n ∑ K k =k+1 P n α n,k s n,k Following the principle of NOMA, the power allocation coefficient, α n,k , of each UE is expressed as follows: where Ω denotes the space of the feasible power allocation coefficient.

Problem Formulation
Based on Equation (5), the signal-to-interference-plus-noise ratio (SINR) for UE n,k is given by where I U n,k and I N n,k are respectively the intra-beam and inter-beam interference, as follows: The objective is to maximize the sum rate from all UEs. Thus, the user pairing of each beam Φ n , power allocation coefficient α n,k for each UE, and precoding vector w n should be determined [8]. The problem can then be formulated as follows: where Equation (11) represents the sum rate of the MIMO-NOMA UEs. The constraint of (C1) is the summation of the power allocation coefficients in a beam. The constraint of (C2) means that the BS satisfies the minimum data rate of each UE, R 0 . The constraint of (C3) represents the beamforming constraint. The optimization problem is the non-convex NP-hard. To solve this problem, the computational complexity should be reduced. The precoding matrices can be expressed as follows [5]: where I N is the N × N identity matrix. Equation (12) represents the inter-beam interference I N n,k can be canceled. Therefore, complex MIMO-NOMA systems can be simplified as single-input single-output (SISO) NOMA systems.
From Equations (8) and (12), the data rate of UE n,k can be express as follows: UE n,K is the closest user from the BS, and SIC can be used to remove the intra-beam interference I U n,k . Consequently, Equation (13) can be rewritten as follows: From Equation (14), the data rate of UEs with 1 ≤ k ≤ K in a beam can be calculated; the sum rate of all MIMO-NOMA systems can be calculated by summing the data rates of all beams. The sum rate of MIMO-NOMA systems R all can be expressed as follows: In the conventional user pairing and power allocation procedure, after the BS acquires the CSI from the UE, the BS determines a pair according to the location or channel gain. This information is transmitted to the UEs. When the response from the UEs has been received, the power allocation coefficient of the UEs belonging to each beam is determined again, and the power is transmitted to each UE.

Proposed RL-Based Joint User Pairing and Power Allocation
In this section, joint user pairing and power allocation for maximizing the sum rates of a MIMO-NOMA system are proposed. In the wireless channel environment, user pairing and power allocation can be modeled as the repeated interactions between the BS and UEs. The optimal user pairing and power allocation depends on the location of UEs and their radio channel states [18]. The user pairing and power allocation of the BS affect the sum rate of the MIMO-NOMA system. Because the MIMO-NOMA transmission process can be formulated as a Markov decision process, Q-learning can be applied to a MIMO-NOMA system.
Q-learning is based on the state, action, and reward [26]. Figure 2 shows a basic structure of RL. In the proposed Q-learning model, the agent is the BS, and the environments is fading, shadowing, and distance environments between the BS and UEs.

Design State and Action
The BS performs the user pairing and power allocation based on Q-leaning, and the Q-function determines the user pairing and power allocation value. The state s t is the quantized channel vector of the UEsĥ n,k , the action θ t comprises a user pairing set Φ n and power allocation coefficient α n,k , and the reward is defined as the quantized sum rateR all of the MIMO-NOMA system. The quantization is performed in L steps, and the channel vector of the UEs generated with the Rayleigh distribution is quantized into L steps.
The state at time t is as follows: where ξ is the space of all the possible channel vectors. Moreover, the size of the state space can be expressed as L NK . The action set of the BS is defined as the index of the joint user pairing and power allocation procedure. As assumed in the system model, when there are M UEs in the cell and the BS forms N beamforming vectors, K UEs form a pair in each beam. The user pairing set is defined as Φ n : Φ n = {(n, 1), (n, 2), . . . , (n, K)},K ≥ 2, 1 ≤ n ≤ N.
When we use the ES method for user pairing, the computational complexity exponentially increases. Meanwhile, if the channel gain of the UEs grouped in the same nth pair is assumed to be ordered by Equation (1), the user pairing complexity can be reduced.
Moreover, the power allocation coefficients are quantized into the number of K UEs in each beam, and the sum of the power allocation coefficients is set to 1. Thus, Equation (7) can be rewritten as follows: By multiplying the user pairing index and K steps of the power allocation coefficients can be the Q-learning's joint action. Hence, joint user pairing and power allocation can be performed in one step. From Equations (17) and (18), the equation of action at time t can be expressed as follows: The size of action spaces is as follows: From Equation (20), the action set θ t can be converted into an index set, i.e., θ t = {0, 1, . . . , (n(θ t ) − 1)}.
The choice of an action in RL is determined by the tradeoff between exploitation and exploration. In this paper, the action was chosen by applying -greedy policy and deciding whether to explore with a random action or exploit the action with the best value with the current information according to .
The -greedy equation is as follows: , with probability 1 − random action, with probability .
An important point when designing the Q-learning model is the size of the (action × state) space. As the (action × state) space increases, the RL complexity exponentially increases. The number of the quantization level L ofĥ n,k increases the state space. The number of user pairing set due to the number of UEs and the number of quantization levels of the power allocation coefficient affect the action space. The (action × state) space exponentially increases with the number of UEs, as shown in Figure 3. As the quantization level increases,ĥ n,k approaches to the actual h n,k ; however, the increase of the quantization levels may be inefficient because the complexity exponentially increases.
Because of the tradeoff between the complexity and the sum rate, it is important to find the optimal quantization level in the RL structure. Figure 4 shows the sum rate for an increasing quantization level when the time slot is limited to 100,000. The results show that, when the ES scheme is applied, the sum rate increases and converges to about 17.3 bps/Hz. By contrast, when the proposed Q-learning scheme is applied, the sum rate increases and then decreases after a certain level because of the limited time slot (100,000). If the time slot is not limited, the sum rate of Q-learning increases as the quantization level increases. However, as the number of quantization levels increases, the number of states increases, and the RL model requires more time for the sum rate to converge. Our object is to achieve the sum rate similar to that obtained with the ES scheme, while reducing the computational complexity.
In Figure 4, for the case that the reward of RL is calculated withĥ n,k , the sum rate is highest when the quantization level is 5. Here, we assumed there are four UEs in the cell. For the case that the reward of RL is calculated with h n,k , the sum rate is highest when the quantization level is 4. Here,R all , which the reward of RL, is calculated withĥ n,k , and R all , which is the sum rate, is calculated with h n,k . The difference betweenR all and R all is due to the quantization error in the CSI. Because the object is to increases the sum rate, we chose the quantization level as 4 in the proposed Q-learning.

Q-Learning-Based Joint User Pairing and Power Allocation Procedure
The reward is the sum rate of the MIMO-NOMA UEs. From Equation (15) reward at time t can be expressed as follows:R whereR all is the sum rate calculated withĥ n,k . In Q-learning,R all is continuously updated by Q-function; whereas R all is calculated with h n,k . The user pairing index and power allocation coefficient is simultaneously determined by using Q-learning. Moreover, Q(s, θ) denotes the Q-function of the BS for system state s and action θ: where the learning rate β ∈ (0, 1] represents the weight of the recent experience in the learning process. The discount factor δ ∈ [0, 1] controls the importance of the immediate and future rewards.
The main structure of the joint user pairing and power allocation based on Q-learning is illustrated in Figure 5 and the algorithm is summarized in Algorithm 1.

Algorithm 1 Joint user pairing and power allocation with Q-learning
1: Set Q(s t , θ t ) = 0, ∀θ t = 0 and ∀s t = 0 2: for t = 1 to T do 3: Observe the current state s t

4:
Choose action θ t in Equation (19) 5: Convert action into user pairing set Φ n and power allocation coefficient α t n,k 6: for n = 1 to N do 7: for k = 1 to K do 8: Allocate the transmit power α t n,k P n and pair Φ n for the signal to user k 9: end for 10: end for 11: Send the superimposed signal x t via N antennas 12: Observe fading, shadowing, and the distance between BS and UEs 13: Observe the CSI h t n,k

14:
Calculate the rewardR all 15: 16: Update Q(s t , θ t ) in Equation (23) 17: Calculate R all in Equation (15) 18: end for Algorithm 1 works as follows: First, the Q-learning parameters, Q(s t , θ t ), θ t , and s t , are initialized. In Step 3, the BS observes the current state s t . In Step 4, the BS selects the action θ t according to the -greedy policy. In Step 5, the BS converts the selected θ t into a user pairing set Φ n and the power allocation coefficient α n,k . In Step 10, the BS transmits the superimposed signal x t via N antennas to the UEs. In Step 12, the BS observes fading, shadowing, and the distance between BS and UEs. In Step 13, the CSI h t n,k is observed, and in Step 14, the rewardR all is calculated. In Step 15, the next state s t+1 is quantized. Finally, in Steps 16 and 17, the BS updates Q(s t , θ t ) and R all based on Equations (23) and (15), respectively.

Numerical Results
We consider a MIMO-NOMA system with one BS. The BS is located at the center. The UEs are randomly distributed in a cell within a radius of 50 to 500 m. To take the movement and the channel fluctuation of each UE into consideration, the location and the CSI of each UE is randomly generated in every time slot. In addition, two UEs are assumed to be paired in one beam; Equation (15) can then be expressed as follows: log 2 (1 + α n,1 P n |h n,1 w n | 2 I U n,1 + σ 2 n ) log 2 (1 + α n,2 P n |h n,2 w n | 2 σ 2 n ) .
Because K = 2, the power allocation coefficient can be quantized into level 2. The power allocation coefficient set Ω is assumed to be Ω = [0.2, 0.4]. The learning rate of the Q-function is set to 0.9999, and the discount factor is set to 0.0001. The time slot is one TTI, e.g., 1 ms, in a LTE system or a 5G system with 15 kHz subcarrier spacing [27]. At every time slot, the BS observes the CSI of UEs and performs the user pairing and power allocation. The total number of time slots is 100,000; the simulation results are obtained by repeating 1000 times under iteration. The simulation parameters used in this paper are listed in Table 2. The simulation was performed with the following simulation environments: Intel(R) Core i9 − 9900K CPU @3.60 GHz, RAM 16.0 GB, Window10, python 3.7, GPU GeForce RTX 2080 Ti.
The performance of the proposed RL based scheme is compared with the following schemes: the ES, OMA, random selection, and phased RL schemes for determine the user pairing and the transmit power of UEs. In the ES scheme, the user paring and the transmit power are optimally determined by using the exhaust search method, and therefore the ES scheme shows the highest performance. In the random selection scheme, the BS randomly determines the user pairing and the transmit power of UEs. In the OMA scheme, the BS serves only one UE in a beam and therefore the sum rate is given by [28] In the phased RL-based user paring and power allocation scheme, the BS sequentially determines a user pairing and the transmit power of UEs. That is, after pairing the UEs, the BS can then determine the transmit power of UEs. In the phased RL scheme, the Q-function for user pairing is defined as Q UP (s, θ UP ) and the Q-function of the power allocation is defined as Q PA (s, θ PA ). From Equation (17), action of user pairing RL is defined as θ UP = Φ n . From Equation (18), action of power allocation RL is defined as θ PA = α n,k . First, user pairing RL proceeds in which the rewards are only used to update the Q-function, where the reward is calculated with the fixed power allocation. The user pairing set Φ n is determined by the BS through Q UP (s, θ UP ). In power allocation RL, the user pairing set Φ n is observed as a state along withĥ n,k . Power allocation coefficient is determined by the BS through Q PA (s, θ PA ). Finally, the BS updates Q PA (s, θ PA ), and R all . The algorithm of the phased RL-based user pairing and power allocation scheme is summarized in Algorithm 2.
Algorithm 2 Phased RL-based user pairing and power allocation 1: Set Q UP (s t UP , θ t UP ) = 0, ∀θ t UP = 0 and ∀s t UP = 0 2: Set Q PA (s t PA , θ t PA ) = 0, ∀θ t PA = 0 and ∀s t PA = 0 3: for t = 1 to T do 4: Choose action θ t UP in Equation (17) 5: for n = 1 to N do 6: for k = 1 to K do 7: Allocate the fixed transmit power for the signal to user k 8: end for 9: end for 10: Send the superimposed signal x t via N antennas 11: Observe s t and rewardR t UP 12: Update Q UP (s t UP , θ t UP ) in Equation (23) 13: Choose action θ t PA in Equation (18) 14: for n = 1 to N do 15: for k = 1 to K do 16: Apply user pairing θ t Update Q PA (s t PA , θ t PA ) in Equation (23) 23: Calculate R all in Equation (15) 24: end for Figure 6 shows the sum rate of the RL scheme with respect to the time slot, when the number of UEs is 4 and the quantization levels of CSI is 4. The transmit power of the BS is 43 dBm. In the RL-based scheme, the actions are randomly determined in the first time, which leads to a lower sum rate. As time elapses, the sum rate of the RL-based scheme increases and when the time slot reaches about 40,000, it approximately converges to that of the ES scheme with a performance difference of 0.57%. It also means that it takes about 40 seconds (when the time slot is 1 ms) to achieve the sum rate similar to ES. However, the proposed RL-based scheme can keep up with the changing radio channel of the UE because the BS continuously trains the machine for every time slot. Hence, if the wireless channel environment of the UE does not change very rapidly, the proposed RL-based scheme can be applied to real-time scenarios. Because of the quantization error, the RL's reward is lower than the sum rate calculated with the h n,k . The numerical results are compared with those of other schemes by the sum rate calculated with h n,k .
When the transmit power of the BS increases, the sum rate increases, as shown in Figure 7. As the transmit power of the BS increases, the sum rates of all schemes increase. The random selection scheme shows the worst sum rate because the SIC is not perfect. As presented in Figure 7, the proposed scheme shows approximately same results as the ES, and also the phased RL scheme exhibits a similar sum rate. When the transmit power is 43 dBm, the proposed RL scheme increases the sum rate by about 21.15% and about 41.98% in comparison with the OMA scheme and the random selection scheme, respectively.   Figure 8 shows the sum rate as the number of UEs increases. As the number of UEs increases, the sum rates of all schemes increase and finally gradually converge. The performance difference between the ES scheme and the proposed scheme slightly increases as the number of UEs increases. For 10 UEs, the performance difference is about 5.48%, which is due to the increased size of states. The proposed scheme increases the sum rate by about 13.17% and about 47.67% in comparison with the OMA scheme and the random selection scheme, respectively. However, the proposed scheme and the phased RL scheme show the similar performance. Figure 9 presents the required simulation time as the number of UEs increases. Because the ES scheme investigates all possible actions, its simulation time is extremely high. The results show that the proposed scheme is more efficient than the phased RL scheme in terms of the time complexity.
The proposed scheme reduces the time complexity by about 20.97% compared with the phased RL scheme.  The proposed scheme reduces the computational complexity. The ES scheme finds all possible actions and therefore, when the action space is denoted by n = θ t , the complexity of the ES scheme is represented by O(n). The phased RL scheme sequentially determines the user paring and the transmit power of UEs in each pair. Hence, the complexity of the phased RL can be expressed as 2 · O(1), because the RL requires a complexity of O(1) after it converges. The proposed RL-based scheme calculates the reward by choosing one action and therefore it has a complexity of O(1).

Conclusions
In this paper, an RL-based joint user pairing and power allocation scheme for MIMO-NOMA systems is proposed. To reduce the computational complexity of finding the user pairing and the transmit power of users, the Q-learning was applied. The user pairing and the transmit power allocation were simultaneously performed in Q-learning's action. The proposed scheme shows the sum rate similar to that of the ES scheme with the low computational complexity. The proposed scheme reduces the time complexity compared with the phased RL scheme although they show the similar performance in terms of the sum rate. However, as the number of UEs increases, the performance difference between the proposed scheme and the ES scheme slightly increases. In the future, we will apply the DQN to the MIMO-NOMA system in order to reduce the performance difference.
Author Contributions: J.L. has contributed to design the algorithm, perform the simulations, and prepare the manuscript. J.S. has led the research project and supervised the activities as the corresponding author. All authors have read and agreed to the published version of the manuscript Funding: This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1F1A1058716 and No. 2020R1F1A1065109). Also, this research was supported by the "HPC Support" project funded by the Korea Ministry of Science and ICT and NIPA.

Conflicts of Interest:
The authors declare no conflict of interest.