Hybrid NOMA/OMA-Based Dynamic Power Allocation Scheme Using Deep Reinforcement Learning in 5G Networks

: Non-orthogonal multiple access (NOMA) is considered a potential technique in ﬁfth-generation (5G). Nevertheless, it is relatively complex when applying NOMA to a massive access scenario. Thus, in this paper, a hybrid NOMA/OMA scheme is considered for uplink wireless transmission systems where multiple cognitive users (CUs) can simultaneously transmit their data to a cognitive base station (CBS). We adopt a user-pairing algorithm in which the CUs are grouped into multiple pairs, and each group is assigned to an orthogonal sub-channel such that each user in a pair applies NOMA to transmit data to the CBS without causing interference with other groups. Subsequently, the signal transmitted by the CUs of each NOMA group can be independently retrieved by using successive interference cancellation (SIC). The CUs are assumed to harvest solar energy to maintain operations. Moreover, joint power and bandwidth allocation is taken into account at the CBS to optimize energy and spectrum efﬁciency in order to obtain the maximum long-term data rate for the system. To this end, we propose a deep actor-critic reinforcement learning (DACRL) algorithm to respectively model the policy function and value function for the actor and critic of the agent (i.e., the CBS), in which the actor can learn about system dynamics by interacting with the environment. Meanwhile, the critic can evaluate the action taken such that the CBS can optimally assign power and bandwidth to the CUs when the training phase ﬁnishes. Numerical results validate the superior performance of the proposed scheme, compared with other conventional schemes.


Introduction
Recently, fourth-generation (4G) systems reached maturity, and will evolve into fifth-generation (5G) systems where limited amounts of new spectrum can be utilized to meet the stringent demands of users. However, critical challenges will come from explosive growth in devices and data volumes, which require more efficient exploitation of valuable spectrum. Therefore, non-orthogonal multiple access (NOMA) is one of the potential candidates for 5G and upcoming cellular network generations [1][2][3].
According to NOMA principles, multiple users are allowed to share time and spectrum resources in the same spatial layer via power-domain multiplexing, in contrast to conventional orthogonal multiple access (OMA) techniques consisting of frequency-division multiple access (FDMA) and time division multiple access (TDMA) [4]. Interuser interference can be alleviated by performing successive interference cancellation (SIC) on the receiver side. There has been a lot of research aimed at sum Since the information regarding network dynamics (e.g., harvested energy distribution, primary user's behavior) is sometimes unavailable in the cognitive radio system, researchers usually formulate optimization problems as the framework of a Markov decision process (MDP) [20,22,26,27]. Reinforcement learning is one of the potential approaches to obtaining the optimal solution for an MDP problem by interacting with the environment without having prior information about the network dynamics or without any supervision [28][29][30]. However, it is a big issue for reinforcement learning to have to deal with large-state-space optimization problems. For this reason, deep reinforcement learning (DRL) is being investigated extensively these days in wireless communication systems where deep neural networks (DNNs) work as function approximators and are utilized to learn the optimal policy [31][32][33]. Meng et al. proposed a deep reinforcement learning method for a joint spectrum sensing and power control problem in a cognitive small cell [31]. In addition, deep Q-learning was studied for a wireless gateway that is able to derive the optimal policy to maximize throughput in cognitive radio networks [32]. Zhang et al. [33] proposed an asynchronous advantage, deep actor-critic-based scheme to optimize spectrum sharing efficiency and guarantee the QoS requirements of PUs and CUs.
To the best of our knowledge, there has been little research into resource allocation using deep reinforcement learning under a non-RF energy-harvesting scenario in uplink cognitive radio networks. Thus, we propose a deep actor-critic reinforcement learning framework for efficient joint power and bandwidth allocation by adopting hybrid NOMA/OMA in uplink cognitive radio networks (CRNs). In them, solar energy-powered CUs are assigned the proper transmission power and bandwidth to transmit data to a cognitive base station in every time slot in order to maximize the long-term data transmission rate of the system. Specifically, the main contributions of this paper are as follows. • We study a model of a hybrid NOMA/OMA uplink cognitive radio network adopting energy harvesting at the CUs, where solar energy-powered CUs opportunistically use the licensed channel of the primary network to transmit data to a cognitive base station using NOMA/OMA techniques. Beside that, a user-pairing algorithm is adopted such that we can assign orthogonal frequency bands to each NOMA group after pairing. We take power and bandwidth allocation into account such that the transmission power and bandwidth are optimally utilized by each CU under energy constraints and environmental uncertainty. The system is assumed to work on a time-slotted basis. • We formulate the problem of long-term data transmission rate maximization as the framework of a Markov decision process (MDP), and we obtain the optimal policy by adopting a deep actor-critic reinforcement learning (DACRL) framework under a trial-and-error learning algorithm. More specifically, we use DNNs to approximate the policy function and the value function for the actor and critic components, respectively. As a result, the cognitive base station can allocate the appropriate transmission power and bandwidth to the CUs by directly interacting with the environment, such that the system reward can be maximized in the long run by using the proposed algorithm. • Lastly, extensive numerical results are provided to assess the proposed algorithm performance through diverse network parameters. The simulation results of the proposed scheme are shown to be superior to conventional schemes where decisions on transmission power and bandwidth allocation are taken without long-term considerations.
The rest of this paper is structured as follows. The system model is presented in Section 2. We introduce the problem formulation in Section 3, and we describe the deep actor-critic reinforcement learning scheme for resource allocation in Section 4. The simulation results and discussions are in Section 5. Finally, we conclude the paper in Section 6.

System Model
We consider an uplink CRN that consists of a cognitive base station (CBS), a primary base station (PBS), multiple primary users, and 2M cognitive users as illustrated in Figure 1. Each CU is outfitted with a single antenna to transmit data to the CBS, and each is equipped with an energy-harvesting component (i.e., solar panels). The PBS and PUs have the license to use the primary channel at will. However, they do not always have data to transmit on the primary channel. Meanwhile, the CBS and the CUs can opportunistically utilize the primary channel by adopting a hybrid NOMA/OMA technique when the channel is sensed as free. To this end, the CBS divides the set of CUs into pairs according to Algorithm 1 where the CU having the highest channel gain will be coupled with the CU having the lowest channel gain, and one of available channels will be assigned to these pairs. More specifically, the CUs are arranged into M NOMA groups, and the primary channel is divided into multiple subchannels to apply hybrid NOMA/OMA for the transmissions between the CUs and the CBS, with G = {G 1 , G 2 , G 3 , ..., G M } denoting the set of NOMA groups. Additionally, M NOMA groups are assigned to M orthogonal subchannels, SC = {SC 1 , SC 2 , SC 3 ,..., SC M }, of the primary channel such that the CUs in each NOMA group can transmit on the same subchannel and will not interfere with the other groups. In this paper, successive interference cancellation (SIC) [34] is applied at the CBS for decoding received signals, which are transmitted from the CUs. Moreover, we assume that the CUs always have data to transmit, and the CBS has complete channel state information (CSI) of all the CUs.  The network system operation is illustrated in Figure 2. In particular, at the beginning of a time slot, with duration τ ss , all CUs concurrently perform spectrum sensing and report their local results to the CBS. Based on these sensing results, the CBS first decides the global sensing result as to whether the primary channel is busy or not following the combination rule [35,36], and then allocates power and bandwidth to all CUs for uplink data transmission. As a consequence, according to the allocated power and bandwidth of the NOMA groups, the CUs in each NOMA group can transmit their data to the CBS through the same subchannel without causing interference with other groups within duration τ Tr = T tot − τ ss , where T tot is the total time slot duration. Information regarding the remaining energy in all the CUs is updated to the CBS at the end of each time slot. Each data transmission session of the CUs may take place in more than one time slot until all their data have been transmitted successfully. During data transmission, the received composite signal at the CBS on subchannel SC m is given by where P im (t) = e tr im (t) τ Tr |i ∈ {1, 2} , m ∈ {1, 2, ..., M} is the transmission power of CU i in NOMA group G m , in which e tr im (t) is the transmission energy assigned for CU im in time slot t; x im (t) denotes the transmit signal of CU im in time slot t, E{|x im (t)| 2 }= 1 ; ω m is the additive white Gaussian noise (AWGN) at the CBS on subchannel SC m with zero mean and variance σ 2 ; and h im is the channel coefficient between CU im and the CBS. The overall received signal at the CBS in time slot t is given by The received signals at the CBS on different sub-channels are independently retrieved from composite signal y m (t) using the SIC technique. In particular, the CU's signal with the highest channel gain is firstly decoded, and then it will be removed from composite signal at the CBS, in a successive manner. The CU's signal with the lower channel gain in the sub-channel is treated as noise of the CU with the higher channel gain. We assume perfect SIC implementation at the CBS. The achievable transmission rate for the CUs in NOMA group G m are where b m (t) is the amount of bandwidth allocated to subchannel SC m in time slot t, g im = |h im | 2 is the channel gain of CU im on subchannel m, and g 1m ≥ g 2m . Since the channel gain of CU 1m , g 1m , is higher, CU 1m has a higher priority for decoding. Consequently, the signal of CU 1m is decoded first by treating the signal of CU 2m as interference. Next, user CU 1m is removed from signal y m (t), and the signal of user CU 2m is decoded as interference-free. The sum achievable transmission rate of NOMA group G m can be calculated as: The sum achievable transmission rate at the CBS can be given as follows:

Energy Arrival and Primary User Models
In this paper, the CUs have a finite capacity battery, E bat , which can be constantly recharged by the solar energy harvesters. Therefore, the CUs can perform their other operations and harvest solar energy simultaneously. For many reasons (such as the weather, the season, different times of the day), the harvested energy from solar resources may vary in practice. Herein, we take into account a practical case, where the harvested energy of CU i in NOMA group G m (denoted as e h im ) follows a Poisson distribution with mean value ξ avg , as studied in [37]. The arrival energy amount that CU im can harvest during time slot t can be given as e h The cumulative distribution function can be given as follows: Herein, we use a two-state Markov discrete-time process to model the state of the primary channel, as depicted in Figure 3. We assume that the state of the primary channel does not change during the time slot duration, T tot , and the primary channel can switch states between two adjacent time slots. The state transition probabilities between two time slots are denoted as P ij |i, j ∈ {F, B} , in which F stands for the free state, and B stands for the busy state. In this paper, we consider cooperative spectrum sensing, in which all CUs collaboratively detect spectrum holes based on an energy detection method, and they send these local sensing results to the CBS. Subsequently, the final decision on the primary users' activities is attained by combining the local sensing data at the CBS [36]. The performance of the cooperative sensing scheme can be evaluated based on probability of detection P d and probability of false alarm P f . P d is denoted as the probability that the PU's presence is correctly detected (i.e., the primary channel is actually used by the PUs). Meanwhile, P f is denoted as the probability that the PU's is detected to be active, but it is actually inactive (i.e., the sensing result of the primary channel is busy, but the primary channel is actually free) .

Long-Term Transmission Rate Maximization Problem Formulation
In this section, we aim at maximizing the long-term data transmission rate for uplink NOMA/OMA. The 2M users in the CRN can be decoupled into pairs according to their channel gain, as described in Algorithm 1. After user pairing, the joint power allocation and bandwidth allocation problem can be formulated as follows: where a (t) = {b (t), ε (t)} represents the action that the CBS assigns to the CUs in time slot t; b (t) indicates a vector of the allocated bandwidth portions assigned to the corresponding sub-channel, .., e tr 1M (t) , e tr 2M (t) refers to a transmission energy vector of the CUs, where e tr im (t) ∈ 0, e tr 1 , e tr 2 , ..., e tr max is the transmission energy value for CU im , and e tr max represents the upper-bounded value of transmission energy for each CU in time slot t.

Deep Reinforcement Learning-Based Resource Allocation Policy
In this section, we first reformulate the joint power and bandwidth allocation problem, which is aimed at maximizing the long-term data transmission rate of the system as the framework of an MDP. Then, we apply a DRL approach to solve the problem, in which the agent (i.e., the CBS) learns to create the optimal resource policy via trial-and-error interactions with the environment. One of the disadvantages of reinforcement learning is that the high computational costs can be imposed due to the long time learning process of a system with high state space and action space. However, the proposed scheme requires less computation overhead by adopting deep neural networks, as compared to other algorithms such as value iteration-based dynamic programming in partially observable Markov decision process (POMDP) framework [20] in which the transition probability of the energy arrival is required for obtaining the solution. Thus, the complex in formulation and computation can be relieved regardless of the dynamic properties of the environment by using the proposed scheme, as compared to POMDP scheme. Furthermore, the advantage of a deep reinforcement learning scheme as compared with the POMDP scheme is that the unknown harvested energy distribution can be estimated to create the optimal policy by interacting with the environment over the time horizon. In addition, the proposed scheme can work effectively in a large-state-and-space system by adopting deep neural networks. However, other reinforcement learning schemes such as Q-learning [38], actor-critic learning [39] might not be appropriate to large-state-and-space systems. In the proposed scheme, a deep neural network was trained to obtain the optimal policy where the reward of the system converges to optimal value. Then, the system can choose an optimal action at every state according to that policy learned from the training phase without re-training. Thus, deep actor-critic reinforcement learning can be more applicable to the wireless communication system.

Markov Decision Process
Generally, the purpose of reinforcement learning is for the agent to learn how to map each system state to an optimal action through a trial-and-error learning process. In this way, the accumulated sum of rewards can be maximized after a number of training time slots. Figure 4 illustrates the traditional reinforcement learning via agent-environment interaction. In particular, the agent observers the system state and then chooses an action at the beginning of a time slot. After that, the system receives the corresponding reward at the end of the time slot, and transfers to the next state based on the performed action. The system will be updated and will then go into the next interaction between agent and environment. We denote the state space and action space of the system in this paper as S and A, respectively; s (t) = {µ (t) , e re (t)} ∈ S represents the state of the network in time slot t, where µ (t) is the probability (belie f ) that the primary channel is free in that time slot, and e re (t) = e re 11 (t) , e re 21 (t) , e re 12 (t) , e re 22 (t) , e re 1M (t) , e re 2M (t) denotes a vector of remaining energy of CUs, where 0 ≤ e re im ≤ E bat represents the remaining energy value of CU im . The action at the CBS is denoted as a (t) = {b (t) , ε (t)} ∈ A. In this paper, we define the reward as the sum data rate of the system, as presented in Equation (5).
The decision-making process can be expressed as follows. At the beginning of time slot t, the agent observes the state, s (t) ∈ S, from information about the environment, and then chooses action a (t) ∈ A following a stochastic policy, π (a |s ) = Pr (a (t) = a |s (t) = s ), which is mapped from the environment state to the probability of taking an action. In this work, the network agent (i.e, the CBS) determines the transmission power for each CU and decides whether to allocate the bandwidth portion to the NOMA groups in each time slot. Then, the CUs perform their operations (transmit data or stay silent) according to the assigned action from the CBS. Afterward, the instant reward, R (t), which is defined in Equation (5), is fed back to the agent, and the environment transforms to the next state, s (t + 1). At the end of the time slot, the CUs report information about the current remaining energy level in each CU to the CBS for network management. In the following, we describe the way to update information about the belief and the remaining energy based on the assigned actions at the CBS.

Silent Mode
The global sensing decision shows that the primary channel is busy in the current time slot, and thus, the CBS trusts this result and has all CUs stay silent. As a consequence, there is no reward in this time slot, i.e., R (t) = 0. The belief in current time slot t can be calculated according to Bayes' rule [40], as follows: Belief µ (t + 1) for the next time slot is updated as follows: The remaining energy of CU im for the next time slot is updated as e re im (t + 1) = min e re im (t) + e h im (t) − e ss , E bat , where e ss is the consumed energy from the spectrum sensing process.

Transmission Mode
The global sensing decision indicates that the primary channel is free in the current time slot, and then, the CBS assigns transmission power levels to the CUs for transmitting their data to the CBS. We assume that the data of the CUs will be successfully decoded if the primary channel is actually free; otherwise, no data can be retrieved due to collisions between the signals of the PUs and CUs. In this case, there are two possible observations, as follows.
Observation 1 (Φ 1 ): All data are successfully received and decoded at the CBS at the end of the time slot. This result means the primary channel was actually free during this time slot, and the global sensing result was correct. The total reward for the network is calculated as where the immediate data transmission rate of NOMA group G m , R m (t), can be computed with Equation (4). Belief µ (t + 1) for the next time slot is updated as The remaining energy in CU im for the next time slot will be e re im (t + 1) = min e re where e tr im (t) denotes the transmission energy assigned to CU im in time slot t. Observation 2 (Φ 2 ): The CBS can not successfully decode the data from the CUs at the end of time slot t due to collisions between the signals of the CUs and the PUs. It implies that the primary channel was occupied, and misdetection happened. In this case, no reward is achieved, i.e., R (s (t) |Φ 2 ) = 0. Belief µ (t + 1) for the next time slot can be updated as The remaining energy in CU im for the next time slot is updated by e re im (t + 1) = min e re im (t) + e h im (t) − e ss − e tr im (t) , E bat .
In reinforcement learning, the agent is capable of improving the policy based on the recursive lookup table of state-value functions. The state-value function, V π (s), is defined as the maximum expected value of the accumulated reward starting from current state s with the given policy, which is written as [28]: where E {.} denotes the expectation, in which γ ∈ (0, 1) is the discount factor, which can affect the agent's decisions on myopic or foresighted operations; π is the stochastic policy, which maps environment state space S to action space A, π (a |s ) = Pr (a (t) = a |s (t) = s ). The objective of the resource allocation problem is to find optimal policy π * that provides the maximum discounted value function in the long run, which can satisfy the Bellman equation as follows [41]: π * (a |s ) = arg max π V π (s) .
The policy can be explored by using an − greedy policy in which a random action is chosen with probability , or an action can be selected based on the current policy with probability (1 − ) during the training process [42]. As a result, the problem of joint power and bandwidth allocation in Equation (7) can be rewritten as Equation (17), and the solution to deep actor-critic reinforcement learning will be presented in the following section.

Deep Actor-Critic Reinforcement Learning Algorithm
The maximization problem in Equation (17) can be solved by using the actor-critic method, which is derived by combining the value-based method [43] and the policy-based method [44]. The actor-critic structure involves two neural networks (actor and critic) and an environment, as shown in Figure 5. The actor can determine the action according to the policy, and the critic evaluates the selected actions based on value functions and instant rewards that are fed back from the environment. The input of the actor is the state of the network, and the output is the policy, which directly affect how the agent chooses the optimal action. The output of the critic is a state-value function V π (s), which is used to calculate the temporal difference (TD) error. Thereafter, the TD error is used to update the actor and the critic. Herein, both the policy function in the actor and the value function in the critic are approximated with parameter vectors θ and ω, respectively, by two sequential models of a deep neural network. Both value function parameter ω and policy parameter θ are stochastically initialized and updated constantly by the critic and the actor, respectively, during the training process. Figure 6 depicts the DNN at the critic, which is composed of an input layer, two hidden layers, and an output layer. The critic network is a feed-forward neural network that evaluates the action taken by the actor. Then, the evaluation of the critic is used by the actor to update its control policy. The input layer of the critic is an environment state, which contains (2M + 1) elements. Each hidden layer is a fully connected layer, which involves H C neurons and uses a rectified linear unit (ReLU) activation function [45,46] as follows:

The Critic with a DNN
is the estimated output of the layer before applying the activation function, in which s i (t) indicates the ith element of the input state, s (t), and ω i is the weight for the ith input.
The output layer of the DNN at the critic contains one neuron and uses the linear activation function to estimate the state-value function, V π (s). In this paper, the value function parameter is optimized by adopting stochastic gradient descent with a back-propagation algorithm to minimize the loss function, defined as the mean squared error, which is computed by where δ (t) is the TD error between the target value and the estimated value, which is given by and it is utilized to evaluate selected action a (t) of the actor. If the value of δ (t) is positive, the tendency to choose action a (t) in the future, when the system is in the same state, will be strengthened; otherwise, it will be weakened. The critic parameter can be updated in the direction of the gradient, as follows: where α c is the learning rate of the critic.

The Actor with a DNN
The DNN in the actor is shown in Figure 7, which includes an input layer, two hidden layers, and an output layer. The input layer of the actor is the current state of the environment. There are two hidden layers in the actor, where each hidden layer is comprised of H A neurons. The output layer of the actor provides the probabilities of selecting actions in a given state. Furthermore, the output layer utilizes the soft-max activation function [28] to compute the policy of each action in the action space, which is given as: where z a is the estimated value for the preference of choosing action a. In the actor, the policy can be enhanced by optimizing the state-value function as follows: where d π (s) is the state distribution. Policy parameter θ can be updated toward the gradient ascending to maximize the objective function [39], as follows: where α a denotes the learning rate of actor network, and policy gradient ∇ θ J (π θ ) can be computed by using the TD error [47]: It is worth noting that TD error δ (t) is supplied by the critic. The training procedure of the proposed DACRL approach is summarized in Algorithm 2. In the algorithm, the agent interacts with the environment and learns to select optimal action in each state. The convergence of the proposed algorithm depends on number of steps per episode, the number of training episodes and the learning rate, which is discussed in the following section.

Simulation Results
In this section, we investigate the performance of uplink NOMA systems using our proposed scheme. The simulation results are compared with other myopic schemes [48] (Myopic-UP, Myopic-Random, and Myopic-OMA) in terms of average data transmission rate and energy efficiency. In the myopic schemes, the system only maximizes the reward in the current time slot, and the system bandwidth is allocated to the group only if it has at least one active CU in the current time slot. In particular, with the Myopic-UP scheme, the CBS arranges the CUs into different pairs based on Algorthim 1. In the Myopic-Random scheme, the CBS randomly decouples the CUs into pairs. In the Myopic-OMA scheme, the total system bandwidth is divided equally into sub-channels in order to assign them to each active CU without applying user pairing. In the following, we analyze the influence of the network parameters on the schemes through the numerical results.
In this paper, we used Python 3.7 with the TensorFlow deep learning library to implement the DACRL algorithm. Herein, we consider a network based on different channel gain values between the CUs and the CBS, such as h 1 = −20 dB, h 2 = −25 dB, h 3 = −30 dB, h 4 = −35 dB, h 5 = −40 dB, h 6 = −45 dB, where h 1 , h 2 , h 3 , h 4 , h 5 , h 6 are the channel gains between CU 1 , CU 2 , CU 3 , CU 4 , CU 5 , CU 6 and the CBS, respectively. Two sequential DNNs are utilized to model the value function and the policy function in the proposed algorithm. Each DNN is designed with an input layer, two hidden layers and an output layer as described in Section 4. The number of neurons in each hidden layer of the value function DNN in the critic, and the policy function in the actor, are set at H C = 24 and H A = 24, respectively. For the training process, value function parameter ω and the policy parameter θ are stochastically initialized by using uniform Xavier initialization [49]. The other simulation parameters for the system are shown in Table 1.
We first examine the average transmission rates of the the DACRL scheme under different training iterations, T, while the number of episodes, L, increases from 1 to 400. We achieved the results by calculating the average transmission rate after separately running the simulation 20 times, as shown in Figure 8. The curves sharply increase in the first 50 training episodes, and then gradually converge to the optimal value. We can see that the agent needs more than 350 episodes to learn the optimal policy at T = 1000 iterations per episode. However, with the increment in T, the algorithm begins to converge faster. For instance, the proposed scheme learns the optimal policy in less than 300 episodes when T = 2000. Nevertheless, it might take a very long time for the training process if each episode uses too many iterations, and the algorithm evenly converges to a locally optimal policy. As a result, the number of training iterations per episode and the number of training episodes should not be too large or too small. In the rest of the simulations, we set training episodes at 300 and training iterations at 2000.  T=1000  T=1500  T=2000 T=2500 T=3000 Figure 8. The convergence rate of the proposed actor-critic deep reinforcement learning with different training steps in each episode. Figure 9 shows the convergence rate of the proposed scheme according to various values of actor learning rate α a and critic learning rate α c . The figure shows that the reward converges faster with increments in the learning rates. In addition, we can observe that the proposed scheme with actor learning rate α a = 0.001 and critic learning rate α c = 0.005 provides the best performance after 300 episodes. When the learning rates of the actor and the critic increase to α a = 0.01 and α c = 0.005, respectively, the algorithm converges very fast, but does not bring a good reward due to underfitting. Therefore, we set the actor and critic learning rates at α a = 0.001 and α c = 0.005, respectively, for the rest of the simulations.  Figure 10 illustrates the average transmission rates under the influence of mean harvested energy. We can see that the average transmission rate of the system increases when the mean value of harvested energy grows. The reason is that with an increase in ξ avg , the CUs can harvest more solar energy, and thus, the CUs have a greater chance to transmit data to the CBS. In addition, the average transmission rate of the proposed scheme dominates the conventional schemes because the conventional schemes focus on maximizing the current reward, and they ignore the impact of the current decision on the future reward. Thus, whenever the primary channel is free, these conventional schemes allow all CUs to transmit their data by consuming most of the energy in the battery in order to maximize the instant reward. This makes the CUs stay silent in the future due to energy shortages. Although the Myopic-Random scheme had lower performance than the Myopic-UP scheme, it still had greater rewards than Myopic-OMA. This outcome demonstrates the efficiency of the hybrid NOMA/OMA approach, compared with the OMA approach, in terms of average transmission rate. In Figure 11, the energy efficiency of the schemes was compared with respect to the mean value of the harvested energy. In this paper, we define energy efficiency as the transmission data rate obtained at the CBS over the total energy consumption of the CUs during the operations. We can see that the energy efficiency declines as ξ avg rises. The reason is that when the harvested energy goes up, the CUs can gather more energy for their operations; however, the amount of energy overflowing the CUs' batteries also increases. The curves show that the performance of the proposed scheme outperforms the other conventional schemes because the DACRL agent can learn about the dynamic arrival of harvested energy from the environment. Thus, the proposed scheme can make proper decision in each time slot. In Figures 12 and 13, we plot the average transmission rate and the energy efficiency, respectively, based on differing noise variance at the CBS. The curves show that system performance notably degrades when noise variance increases. To explain this, noise variance will degrade the data transmission rate, as shown in Equation (3). As a consequence, energy efficiency also decreases with an increment in noise variance. Based on noise variance at the CBS, the figures verify that the proposed scheme dominates the myopic schemes.

Conclusions
In this paper, we investigated a deep reinforcement learning framework for joint power and bandwidth allocation by adopting both hybrid NOMA/OMA and user pairing in uplink CRNs. The DACRL algorithm was employed to maximize the long-term transmission rate under the energy constraint in the CUs. A DNN was applied to approximate the policy function and the value function such that the algorithm can work in the system with large state and action spaces. The agent of the DACRL can explore the optimal policy by interacting with the environment. As a consequence, the CBS can effectively allocate bandwidth and power to the CUs based on the current network state in each timeslot. The simulation results verified the advantages of the proposed scheme in improving network performance under various network conditions in the long run, compared to the conventional schemes.