Optimal Energy Efﬁciency Used DDPG in IRS-NOMA Wireless Communications

: Combining Intelligent Reﬂecting Surface (IRS) with Non-Orthogonal Multiple Access (NOMA) technology is a viable option for increasing communication performance. Firstly, a NOMA downlink transmission system assisted by IRS is established in this study, for maximizing its energy efﬁciency. Then a Deep Deterministic Policy Gradient (DDPG) algorithm with symmetric properties is used to further optimize the energy efﬁciency of the system by intelligently adjusting the beam-forming matrix of the access point (AP) and the phase-shift matrix of the IRS. According to the simulation results, the proposed IRS-assisted NOMA downlink network based on the DDPG algorithm presented a considerably higher energy efﬁciency than the orthogonal multiple access network.


Introduction
People's expectations for mobile data transfer rates have grown drastically with the massive expansion of Internet users and the rapid development of the Internet of Things (IoT) in recent years. According to the white paper by Cisco [1], the mobile network connection speeds will grow more than triple in 2023, reaching 43.9 Mbps. Some prospective technologies, such as Intelligent Reflecting Surface (IRS) [2][3][4][5], Non-Orthogonal Multiple Access (NOMA) [6][7][8][9] and Deep Reinforcement Learning (DRL) algorithm [10][11][12][13], have already been developed and explored to boost the user's transmission rate.
IRS, as a novel technique, has reshaped the wireless transmission environment in recent years by altering passive beamforming. It is composed of a variety of inexpensive, passively reflecting components that may be used in current wireless networks [14]. NOMA is another revolutionary technology of the 5th generation mobile communications, which utilized the superposition coding of transmitter and interference elimination of receiver to improve system throughput and spectral efficiency. This technology, unlike the classic Orthogonal Multiple Access (OMA) technologies, assigns a fraction of subcarriers to a specific user, allowing Base Stations (BS) to broadcast concurrently with multiple mobile users [15]. How to integrate IRS and NOMA effectively has now been a focus of the wireless communications improvements.
Ding proposed a simple IRS-aided NOMA transmission scheme, which can provide services for users at the edge of the cell [21]. Wang investigated the effectiveness of IRS in a 2 of 10 NOMA system in terms of transmitting power consumption [22]. In [23], it highlighted the primary responsibilities of IRSs in MIMO-NOMA systems, in contrast to the prior research. A three-step approach to innovative resource allocation was also presented to meet this demand [24]. In [25], it explored how IRS might be used in NOMA, where a BS transmitted the superposed signals to several users via an IRS. In [26], it described an energy-efficient method that struck a fair balance between sum-rate maximization and overall power usage reduction.
Reinforcement Learning (RL) has been widely used in communications. Even in time-varying channels, DRL may tackle wireless communication challenges by exchanging information and reward mechanisms with the communication environment. DRL was used in [27] to investigate the design of integrating the beamforming matrix and the IRS phaseshift matrix. In [28], a Deep Deterministic Policy Gradient (DDPG) technique was applied to intelligently change the phase-shift matrix of AP by controlling numerous Reflection Elements (RE) of the IRS.
Accordingly, this paper proposed an energy efficiency optimization algorithm based on DRL for an IRS-assisted NOMA downlink wireless communications system. Since wireless communications were both time-varying and continuous, the DDPG algorithm from DRL was selected to specifically optimize the energy efficiency of the IRS-NOMA communications system. The main contributions of this paper are as follows: (1) We studied the IRS-NOMA downlink wireless communication systems, considering the base station and users' direct transmission link to maximize the system energy efficiency. (2) Based on the DDPG algorithm, the beam forming matrix of AP and phase shift matrix of IRS are jointly optimized to maximize system energy efficiency. (3) Compared with the conventional OMA networks, the proposed algorithm presented a higher energy efficiency.

IRS-NOMA System Model
An IRS-assisted NOMA wireless communication (IRS-NOMA) is consisting of an IRS, a BS and multiple users, as shown in Figure 1. BS has a set of antennas, denoted by M = {1, 2, . . . , M}. K = {1, 2, . . . , K} is to denote the set of users. The IRS consists of N-number reflection units, denoted by N = {1, 2, . . . , N}. IRS can independently reflect the received signal through Channel State Information (CSI) and alter the amplitude or phase of the reflection unit to coordinate the signal's directions.
a NOMA system in terms of transmitting power consumption [22]. In [23], it highlighte the primary responsibilities of IRSs in MIMO-NOMA systems, in contrast to the prior re search. A three-step approach to innovative resource allocation was also presented t meet this demand [24]. In [25], it explored how IRS might be used in NOMA, where a B transmitted the superposed signals to several users via an IRS. In [26], it described a energy-efficient method that struck a fair balance between sum-rate maximization an overall power usage reduction.
Reinforcement Learning (RL) has been widely used in communications. Even in time varying channels, DRL may tackle wireless communication challenges by exchanging in formation and reward mechanisms with the communication environment. DRL was use in [27] to investigate the design of integrating the beamforming matrix and the IRS phase shift matrix. In [28], a Deep Deterministic Policy Gradient (DDPG) technique was applie to intelligently change the phase-shift matrix of AP by controlling numerous Reflectio Elements (RE) of the IRS.
Accordingly, this paper proposed an energy efficiency optimization algorithm base on DRL for an IRS-assisted NOMA downlink wireless communications system. Sinc wireless communications were both time-varying and continuous, the DDPG algorithm from DRL was selected to specifically optimize the energy efficiency of the IRS-NOM communications system. The main contributions of this paper are as follows: (1) We studied the IRS-NOMA downlink wireless communication systems, considerin the base station and users' direct transmission link to maximize the system energ efficiency.
(2) Based on the DDPG algorithm, the beam forming matrix of AP and phase shift matri of IRS are jointly optimized to maximize system energy efficiency. (3) Compared with the conventional OMA networks, the proposed algorithm presente a higher energy efficiency.   The IRS-NOMA wireless communication is a downlink system. Its transmitted signal can be expressed as:

IRS-NOMA System Model
where ω i ∈ C A is the generated precoding matrix, x ∈ C A represents the signals to be sent [29]. The users' received signal is addressed by: where the channel parameters from BS to the user, BS to IRS, and IRS to the user are signified as Users are ranked in terms of channel quality compared with BS to simplify NOMA transmission. According to NOMA technology's decoding principle, users must first decode and exclude individuals whose channel quality is lower than their own until their signals are decoded later, while signals transmitted to users whose channel quality is higher than their own must be treated as noises. As a result, the signal that the user received can be expressed as: After decoding the signals successfully, the transmission rate of the user k will be specified by: where is the Signal-to-Interference-plus-Noise Ratio (SINR) at the k-th user. Thus, the system's Spectral Efficiency (SE) can be written as:

Optimization Problem
In the IRS-NOMA system, the goal to achieve is optimizing its energy efficiency, which will be calculated as the ratio of the system's SE to total power consumption as follows: where µ∑ K k=1 p k , P BS , P U , P n is the hardware power consumed by BS transmitting, BS power loss, mobile user terminal's power loss and IRS, respectively.

DRL
RL has a strong ability for environment interaction, and the interaction process between agent and environment can be expressed by Markov Decision Processes (MDPs). During the interaction process, the agent performs the action under the guidance of the strategy π generated by the RL algorithm according to the currently observed state, gets Symmetry 2022, 14, 1018 4 of 10 feedback with reward from the environment, and then enters the next state. The RL algorithm can repeat the above steps and get the cumulative rewards. The purpose of the RL algorithm is to find the best method for generating the most cumulative rewards throughout the interactions with the environment.
RL's fundamental model is made up of two parts: agent and environment, including the state, action and reward. MDPs policies are related only to the current state and can be represented as: W = {S, A, P, R, γ}.
(1) S = {s 1 , s 2 , . . . , s n } represents the state sets; (2) A = {a 1 , a 2 , . . . , a n } represents the action sets; (3) P a s→s = P r (s (s, a)) represents the probability that the current state s will move to the next state S' after taking the action a; (4) R(s, a) = E[R t+1 |s, a] represents the immediate reward generated by the agent performing action a in the current state s; (5) γ represents the discount factor, according to which each reward is assigned with a different weight.
The cumulative reward of the agent is: The agent's optimization aim is to discover a strategy π(a|s) that maximizes the total reward G(s), where π(a|s) = P(a|s) , which is the probability of taking action in the current state. The following state value function is obtained by: The function is the expectation of reward based on the state S at the time t. The expected reward of state at time t after selecting action a is called the state-action value function. The following is its calculation formula: The ideal equation of the following two ones may be solved through the Bellman optimal criterion: where V * (s) is the value of selecting the optimal action considering all possible ones under the current state; Q * (s, a) is the long-term value of considering all possible states after performing an action in each state and then choosing the best one to perform in those states.

DDPG Algorithm
Since various parameters of the wireless cellular network are constantly changing and the action values of the agent in the DDPG algorithm are continuous, it is selected in this study to optimize the energy efficiency of the IRS-NOMA system. The block diagram of the IRS-NOMA downlink system's optimization model based on the DDPG algorithm is shown in Figure 2. DDPG is divided into two sections: actor and critic. The actor-network and actor target network are included in the actor section, whereas the critic network and critic target network are included in the critic section, respectively. So, the DDPG algorithm has symmetric properties. All four above are Deep Neural Networks (DNNs) which have three layers: input layer, hidden layer, and output layer. The DNN can have generalization ability from its multiple hidden layers.
Since various parameters of the wireless cellular network are constantly changing and the action values of the agent in the DDPG algorithm are continuous, it is selected in this study to optimize the energy efficiency of the IRS-NOMA system. The block diagram of the IRS-NOMA downlink system's optimization model based on the DDPG algorithm is shown in Figure 2. DDPG is divided into two sections: actor and critic. The actor-network and actor target network are included in the actor section, whereas the critic network and critic target network are included in the critic section, respectively. So, the DDPG algorithm has symmetric properties. All four above are Deep Neural Networks (DNNs) which have three layers: input layer, hidden layer, and output layer. The DNN can have generalization ability from its multiple hidden layers. The actor-network outputs the agent action according to the input state of the agent, then the critic network outputs the Q values according to the different input states and actions of the actor-network. In the learning process, the critic network serves as an auxiliary network for reviewing the outputs of the actor-network. It is in charge of analyzing the performance of the actor-network but does not participate in generating actions.
The DDPG algorithm seeks to maximize the reviewing value of the critic network which makes an accurate evaluation of the actor-network as far as possible. The actornetwork and critic-network work together to optimize the parameters of the neural network with two parts according to the following formula: The actor-network outputs the agent action according to the input state of the agent, then the critic network outputs the Q values according to the different input states and actions of the actor-network. In the learning process, the critic network serves as an auxiliary network for reviewing the outputs of the actor-network. It is in charge of analyzing the performance of the actor-network but does not participate in generating actions.
The DDPG algorithm seeks to maximize the reviewing value of the critic network which makes an accurate evaluation of the actor-network as far as possible. The actornetwork and critic-network work together to optimize the parameters of the neural network with two parts according to the following formula: and θ * a and θ * c are differentiable. According to the chain rule, their gradients can be obtained as follows: ∇θ a = ∇ a Q(s c , a; θ c ) a=A(s,θ a ) ∇ θ a A(s; θ a ), (14) ∇θ c = (Q(s c , a; θ c ) − R(s, a))∇ θ c Q(s, a; θ c ) a=A(s;θ a ) .
The soft update is adopted for target network parameters which are slowly updated at the beginning. The formulas are as follows: where τ is the update coefficient.

Application of DDPG in IRS-NOMA Wireless Communications
In this paper, the DDPG algorithm is adopted to optimize the energy efficiency of the IRS-NOMA wireless communication system. The state, action, and instant reward are as follows since the DDPG algorithm's agent can gather real-time information on channel status: (1) State: The channel matrix H 1 and H 2 , the transmission power at the tth step, the received power of users at the tth step and the action from the (t − 1)th step jointly play a role in determining the state s (t) at the tth step. Since the matrices of the system are complex, information is lost by absolute operators during calculating the transmitted and received powers, so it will be necessary to distinguish the real and imaginary parts of the transmitted signal. (2) Action: The transmit beamforming matrix G and the phase shift matrix Φ are applied to create the action. Similarly, G = Re{G} + Im{G} and Φ = Re{Φ} + Im{Φ} are also divided into real parts and imaginary parts, respectively. (3) Reward: At the tth step of DRL, the reward will be determined as the energy efficiency η(G(t), Φ(t), H 1 , H 2 ) of the IRS-NOMA wireless communications. Algorithm complexity analysis: the DDPG algorithm has four neural networks, assuming that each one has L layers, and each layer has M l neurons, due to their different types, the corresponding complexity will not be the same. This paper assumes that the sum nodes of BN, 'Relu' and 'tanh' layers are M b , M r And M t , respectively. M a represents the nodes of the actor-network and M c represents the nodes of the critic network. Individual "BN" nodes, "Relu" nodes, and "tanh" nodes require 5, 1, and 6 floating-point calculations, respectively [30]. In the training process, the actor-network and the critic network . Therefore, the total complexity of the algorithm will be: 2M a l · M a l+1 ))), (18) where N, T were given as the number of training episodes and the number of steps per training round, respectively. The pseudocode of the DDPG algorithm used in the IRS-NOMA wireless communication system is shown in Algorithm 1.

Algorithm 1: DDPG algorithm used in IRS-NOMA wireless communications system.
1: Input : Episode, actor − network learning rate η a , critic network learning rate η c , H 1 and H 2 2: Initialization : experience memory with size D, training actor − network parameter θ a , target actor − network parameter θ a = θ a , training critic network with parameter θ c , target critic network with parameter θ c = θ c , transmit beamforming matrix G, phase shift matrix Φ 3: for n = 1 to N do 4: Collect and preprocess H 1 and H 2 to obtain the first state s 1 5: for t = 1 to T do 6: Obtain action a t from the actor-network 7: Execute action a t , observe instant reward r t 8: Observe new state s t+1 9: Store the experience (s t , a t , r t , s t+1 ) in the replay memory 10: Calculate critic gradient ∇θ c by (15), and update parameter θ c ← θ c − η c ∇θ c 11: Calculate actor gradient ∇θ a by (14), and update parameter θ a ← θ a + η a ∇θ a 12: Updated actor and critic network state s t = s t+1 13: end for 14: end for 15: Output: action, Q value function

Simulation Results
In this chapter, this paper simulated and quantified the energy efficiency performance of DDPG in the IRS-NOMA wireless communications system. This paper considered the Small-scale Rayleigh Fading between the BS and users. The simulation parameter settings are shown in Table 1. All the simulations are performed on a desktop with an Inter(R) Xeon(R) Platinum 8268 CPU @2.90 GHz and 16 GB memory. The simulation environment was based on tensorflow1.13. The proposed algorithm adopted the Python programming language. The DDPG algorithm's training diagram is shown in Figure 3. The reward value grows with the number of steps taken during the training phase and then converges into a constant amount. When the transmission power turns constant, the increasing number of users will reduce the system's energy efficiency.
As shown in Figure 4, when the transmitting power of the BS is increased, the energy efficiency will decrease accordingly. As the number of reflection units grows, the overall system's energy efficiency will fall. IRS' performance can be improved more effectively if the number of reflecting parts is increased appropriately. The DDPG algorithm's training diagram is shown in Figure 3. The reward value grows with the number of steps taken during the training phase and then converges into a constant amount. When the transmission power turns constant, the increasing number of users will reduce the system's energy efficiency. As shown in Figure 4, when the transmitting power of the BS is increased, the energy efficiency will decrease accordingly. As the number of reflection units grows, the overall system's energy efficiency will fall. IRS' performance can be improved more effectively if the number of reflecting parts is increased appropriately.   of users will reduce the system's energy efficiency. As shown in Figure 4, when the transmitting power of the BS is increased, the energy efficiency will decrease accordingly. As the number of reflection units grows, the overall system's energy efficiency will fall. IRS' performance can be improved more effectively if the number of reflecting parts is increased appropriately.     Figure 6 shows the comparison between NOMA transmission and OMA transmission. The efficiency performance disparity between them widens as the transmitting power grows. Therefore, the addition of NOMA technology is proved to have positive  Figure 6 shows the comparison between NOMA transmission and OMA transmission. The efficiency performance disparity between them widens as the transmitting power grows. Therefore, the addition of NOMA technology is proved to have positive significance for energy efficiency improvement.  Figure 6 shows the comparison between NOMA transmission and OMA transmission. The efficiency performance disparity between them widens as the transmitting power grows. Therefore, the addition of NOMA technology is proved to have positive significance for energy efficiency improvement. Figure 6. The influence of different transmission powers on energy efficiency.

Conclusions
This paper proposed an optimized IRS-NOMA communication system based on the DDPG algorithm. Aiming at the energy efficiency of this downlink wireless communication system, the DDPG algorithm from DRL was applied to optimize both the IRS's phase shifts and the beamforming vectors for finally improving the energy efficiency of the entire system. The proposed algorithm's applicability was demonstrated by the simulation results that the integration of NOMA and IRS can effectively improve the energy efficiency of IRS-NOMA wireless communication systems.

Conclusions
This paper proposed an optimized IRS-NOMA communication system based on the DDPG algorithm. Aiming at the energy efficiency of this downlink wireless communication system, the DDPG algorithm from DRL was applied to optimize both the IRS's phase shifts and the beamforming vectors for finally improving the energy efficiency of the entire system. The proposed algorithm's applicability was demonstrated by the simulation results that the integration of NOMA and IRS can effectively improve the energy efficiency of IRS-NOMA wireless communication systems.