Joint Beamforming, Power Allocation, and Splitting Control for SWIPT-Enabled IoT Networks with Deep Reinforcement Learning and Game Theory

Future wireless networks promise immense increases on data rate and energy efficiency while overcoming the difficulties of charging the wireless stations or devices in the Internet of Things (IoT) with the capability of simultaneous wireless information and power transfer (SWIPT). For such networks, jointly optimizing beamforming, power control, and energy harvesting to enhance the communication performance from the base stations (BSs) (or access points (APs)) to the mobile nodes (MNs) served would be a real challenge. In this work, we formulate the joint optimization as a mixed integer nonlinear programming (MINLP) problem, which can be also realized as a complex multiple resource allocation (MRA) optimization problem subject to different allocation constraints. By means of deep reinforcement learning to estimate future rewards of actions based on the reported information from the users served by the networks, we introduce single-layer MRA algorithms based on deep Q-learning (DQN) and deep deterministic policy gradient (DDPG), respectively, as the basis for the downlink wireless transmissions. Moreover, by incorporating the capability of data-driven DQN technique and the strength of noncooperative game theory model, we propose a two-layer iterative approach to resolve the NP-hard MRA problem, which can further improve the communication performance in terms of data rate, energy harvesting, and power consumption. For the two-layer approach, we also introduce a pricing strategy for BSs or APs to determine their power costs on the basis of social utility maximization to control the transmit power. Finally, with the simulated environment based on realistic wireless networks, our numerical results show that the two-layer MRA algorithm proposed can achieve up to 2.3 times higher value than the single-layer counterparts which represent the data-driven deep reinforcement learning-based algorithms extended to resolve the problem, in terms of the utilities designed to reflect the trade-off among the performance metrics considered.


Introduction
The tremendous growth in wireless data transmission would be a result from the introduction of fifth generation of wireless communications (5G) and will continue in the wireless networks beyond 5G (B5G). In particular, the collaboration between 5G enabled Internet of Things (5G-IoT) and wireless sensor networks (WSNs) will extend the connections between the Internet and the real world and widen the scope of IoT services. In such collective networks, by uploading part of or all of the computing tasks to the edge computing, a mobile edge computing (MEC) technique is developed to reduce the enormous data traffic and huge energy consumption brought by a great number of IoT devices single agent-based reinforcement learning to comply with the fact noted in [47] that, when a multi-agent setting is modified by the actions of all agents, the environment becomes non-stationary, and the effectiveness of most reinforcement learning algorithms would not hold in non-stationary environments [48]. In addition, by further collaborating with the game-based iterative algorithms, our approach would reduce the overhead resulting from, e.g., the MM approach to resolve a complex optimization problem such as that in [46].

The Motivations and Characteristics of This Work
In recent years, advances in artificial intelligence are further helped by the neural networks such as generative adversarial networks [49] which use advanced game theory techniques to deep learn information and could converge to the Nash equilibrium of the game involved. In general, these advances can be reflected by the notion that a machine (computer) can learn about the outcomes of the game involved and teaches itself to do better based on the probabilities, strategies, and previous instances of the game and other players under the ground of game theory. By extending the advanced notation to the optimization framework, in this work, we further exhibit the possibility of applying learning-based methods, model-based methods, or both to resolve the joint beamforming, power control, and energy harvesting problem in the SWIPT-enabled wireless networks that can alleviate the hardness of finding an optimal solution with an optimization tool required to be completed in time. In particular, in this scenario, apart from BS i serving the user or MN needing to decide its transmit power, beamforming vector, and power splitting ratio, the other BSs j = i would make their own decisions at the same time, which can affect the user or MN served by BS i simultaneously. Here, by leveraging the scenario, we conduct our approach to make a good trade-off between information decoding and energy harvesting, which can be deployed in an actual SWIPT-enabled IoT network as one of the various SWIPT applications surveyed in [50]. Specifically, by using the UE coordinates as that in [51] sent to BS, it can align with the industry specification [33] through the slight modification to reduce the original signal overhead of [33] on the channel state information to be sent by UE with a report to have its length equal to the number of antenna elements at least. As a summary, we list the characteristics of this work as follows: • We introduce two single-layer algorithms based on the conventional DRL-based models, DQN and DDPG, to solve the joint optimization problem formulated here as a non-convex MINLP problem, and realized as an MRA problem subject to the different allocation constraints. • We propose further a two-layer iterative approach that can incorporate the capability of data-driven DQN technique and the strength of non-cooperative game theory model to resolve the NP-hard MRA problem. • For the two-layer approach, we also introduce a pricing strategy to determine the power costs based on the social utility maximization to control the transmit power. • With the simulated environment based on realistic wireless networks, we show the results that, by means of both learning-based and model-based methods, the two-layer MRA algorithm proposed can outperform the single-layer counterparts introduced which rely only on the data-driven DRL-based models.
The rest of this paper is structured as follows. In Section 2, we introduce the network and channel models for this work. Next, we present the single-layer learning-based approaches in Section 3, followed by the two-layer hybrid approach based on game theory and deep reinforcement learning in Section 4. These approaches are then numerically examined in Section 5 to show their performance differences. Finally, conclusions are drawn in Section 6.

Network Model
As shown in Figure 1, an orthogonal frequency division multiplexing (OFDM) multiaccess network with L base stations (BSs) (or access points (APs)) is considered for downlink transmission, in which a serving BS would associate with one mobile node (MN). The distance between two neighbor BSs is R and the cell radius (or transmission range) of BS iŝ r > R/2 to allow overlap. Here, unlike the conventional coordinated multipoint Tx/Rx (CoMP) system applied to the scenario in which a MN could receive data from multiple BSs, we apply the SWIPT technique to the network so that an MN can simultaneously receive not only wireless information but also energy from different BSs. In addition, although mmWave brings many performance benefits as an essential part of 5G, it is also known to have high propagation losses due to higher mmWave frequency bands to be adopted. Thus, analog beamforming for the downlink transmission is considered to alleviate these losses.

Figure 1.
A system model with respect to the joint beamforming, power allocation, and splitting control for SWIPT-enabled IoT networks. In this model, each mobile node has a power split mechanism to split the received signal into two streams, one sent to the energy harvesting circuit for harvesting energy and the other to the communication circuit for decoding information.
Next, for more flexibly constructing a beampattern toward MN, each BS adopts a two-dimensional array of M antennas while each MN has a single antenna for transmission. Given that, the received signal at the MN associated with i-th BS would be In the above, x i , x j ∈ C are the transmitted signals form the i-th and j-th BSs, complying with the power constraint E{|x i |} = P i and E{|x j |} = P j , where P i and P j are the transmit powers of the i-th and j-th BSs. In addition, h i,i , h i,j ∈ C M×1 are the channel vectors from the i-th and j-th BSs to the MN at the i-th BS, and f i , f j ∈ C M×1 denote the downlink beamforming vectors adopted at the i-th BS and j-th BSs, respectively. As the last term, n i represents the noise at the receiver sampled from a complex normal distribution with zero mean and variance σ 2 n . Beamforming: As mentioned previously, for the high propagation loss, analog beamforming vectors are assumed for transmission, and each f i , i = 1, 2, · · · , |F |, consists of the beamforming weights for a two-dimensional (2D) planar array steered towards MN. More specifically, let each BS have a 2D array of antennas in the x-y plane, in which the antenna m is located at where λ is the wavelength. Given the elevation direction ψ d and the azimuthal direction φ d , the phased weights for the 2D array steered towards the angle (ψ d , φ d ) in the polar coordinates can be given by e −j2π sin ψ d (a m cos φ d +b m sin φ d ) . If the target is located on the x-y plane, sin ψ d will be 1 and the weights can be simplified as e −j2π(a m cos φ d +b m sin φ d ) .
Given that, we consider every beamforming vector to be selected from a steering-based beamforming codebook F with |F | elements, wherein the n-th element or the array steering vector in the direction φ n is given by

Channel Model
With the beamforming vector introduced above, we consider a narrow-band geometric channel model which is widely used for mmWave networks [52][53][54]. Specifically, the channel from BS i to the MN in BS j is formulated here as where ρ i,j represents the path-loss between BS i and the MN associated with BS j. α p i,j is the complex path gain. a(φ p i,j ) denotes the array response vector with respect to φ p i,j , which is the angle of departure (AoD) of the p-th path. N p i,j is the number of channel paths, and when compared with those for sub-6G, the number for mmWave is usually a small number [55,56]. Next, let the received power measured by the MN associated with BS i over a set of resource blocks (RBs) on the channel from BS j to the MN be P j |h i,j f j | 2 . Given that, the received signal to noise and interference ratio (SINR) for the MN associated with BS i can be obtained by As shown above, each BS i uses P i to transmit to its user with beamforming vector f i . When incorporating SWIPT into power allocation, the use of beamforming on the mmWave MIMO system provides a new solution to resolve both interference and energy problems [57][58][59]. To this end, each MN in the network is installed with a power splitting unit to split the received signal for information decoding and energy harvesting simultaneously. Given that, the beamforming would provide a dedicated beam for MN through which power control and power splitting for energy harvesting can be realized at the same time. More specifically, in the power splitting architecture for downlink, the received signal at the MN associated with BS i which transmits with its beamforming vector f i , and transmit power P i is split into two separate signal streams according to the power split ratio θ i , which will be determined in the sequel to maximize the system utility. In addition, when the technology of successive interference cancellation (SIC) is employed to mitigate the interference for data decoding, the stronger signal would be decoded first, and the weaker signals remaining could contribute to the interferences for decoding. With P and F to denote the sets for the transmit power and the power split ratio, respectively, in addition to the above, the SINR at the received MN i with SWIPT and SIC could be obtained by As shown above, 1 − θ i denotes the fraction of signal for the data transmission of SWIPT. In addition, with SIC [60], when there are multiple signals received by the MN associated with BS i concurrently, it will decode the stronger signal, and treat the weaker signals as interference. Here, if there are stronger signals from some BSs, they would be decoded and deleted first. Then, the desired signal will be obtained by treating the weaker signals from the other BSs if they exist, noted here by j = i, P i |h i,i f i | 2 > P j |h j,i f j | 2 , as the interference for decoding in addition to the noise σ 2 n .

Problem Formulation
Providing these essential models, our aim is to jointly optimize beamforming vectors, transmit powers, and power split ratios at the BSs to make the best trade-off between data rates, harvested energies, and power consumption from all MNs served in the SWIPTenabled network with SIC, which is formulated as a complex multiple resource allocation (MRA) optimization problem subject to different allocation constraints that resulted from the different types of resources involved, shown as follows: where U i (P, θ i , F) in (7a) denotes the utility function for the trade-off to be introduced in (19). (7b) specifies the constraint that the transmit power, P i , should be ranged between the minimum transmit power, P min , and the maximum transmit power, P max . (7c) requires θ i to be a nonnegative ratio number no larger than 1. Finally, (7d) says that the vector, f i , should be selected from its codebook F . Clearly, if U i in the objective involves γ i in (6), (P1) will be a mixed integer nonlinear programming (MINLP) problem. It would be even a non-convex MINLP problem due to the non-convexity of the objective function and the allocation constraints involving discrete values, and its solution is hard to find even using an optimization tool. To resolve this hard problem efficiently, we propose two kinds of innovative approaches based on deep reinforcement learning, game theory, or both, resulting in data-driven, model-driven, or hybrid iterative algorithms which could be operated in a single layer or two different layers, as introduced in the following. In addition, for clarity, we summarize the import symbols for the approaches to be introduced in Table A1 located in Appendix A due to its size.

Single-Layer Learning-Based Approaches
Determining an exact state transition model for (P1) through a model-based dynamic programming algorithm is challenging because the MRA problem on transmit power, power split ratio, and beamforming vector is location dependent. It is not trivial to list all the state-action pairs to be found in a state transition model predefined. Therefore, we design two single-layer learning-based algorithms derived from Markov decision process (MDP) to resolve this problem.

Q-Learning Approach
The Q-learning algorithm is based on the MDP that can be defined as a 4-tuple < S, A, R, P>, where S = {s 1 , s 2 , · · · , s m } is the finite set of states, and A = {a 1 , a 2 , · · · , a n } is the set of discrete actions. R(s, a, s ) is the function to provide reward r defined at state s ∈ S, action a ∈ A, and next state s . P ss (a) = p(s |s, a) is the transition probability of the agent at state s taking action a to migrate to state s . Given that, reinforcement learning is conducted to find the optimal policy π * (s) that can maximize the total expected discounted reward. Among the different approaches to this end, Q-learning is widely considered, which adopts a value function V π (s) → r for the expected value to be obtained by policy π from each s ∈ S. Specifically, based on the infinite horizon discounted MDP, the value function in the following is formulated to show the goodness of π as where 0 ≤ ζ ≤ 1 denotes the discount factor, and E is the expectation operation. Here, the optimal policy is defined to map the states to the optimal action in order to maximize the expected cumulative reward. In particular, the optimal action at each sate s can be obtained with the Bellman equation [61]: Given that, the action-value function is in fact the expected reward of this model starting from state s which takes action a according to policy π; that is, Let the optimal policy Q * (s, a) be Q π * . Then, we can obtain The strength of Q-learning can now be revealed as it can learn π * without knowing the environment dynamics or P ss (a), and the agent can learn it by adjusting the Q value with the following update rule: where α ∈ [0, 1) denotes the learning rate. Given this strength, the application of Q-learning is, however, limited because the optimal policy can be obtained only when the state-action spaces are discrete and the dimension is relatively small. Fortunately, after considerable investigations on the deep learning techniques, reinforcement learning has made significant progress to replace a Q-table with the neural network, leading to DQN that can approximate Q(s t , a t ). In particular, in DQN, the Q value in time t is rewritten as Q(s t , a t , ω) wherein ω is the weight of a deep neural network (DNN). Given that, the optimal policy π * (s) in DQN can be represented by π * (s) = arg max a Q * (s t , a , ω), where Q * denotes the optimal Q value obtained through DNN. The goal of this approach is then to choose the approximated action a t+1 = π * (s t+1 ), and the approximated Q value is given bŷ In the above, ω will be updated by minimizing the loss function: Deep Q learning elements: Following the Q-learning design approach, we next define state, action, and reward function specific for solving (P1) as follows: (1) State: First, if there are n links in the network, the state at time t is represented in the sequel by using the capital notations for their components and using the superscript such as "(t)" for the time index as follows: where , and In the above, L i ) denotes the Cartesian coor-dinates of MN in link i at time t, while the others, i.e., P i , and f (t) i , denote the transmit power, power splitting ratio, and beamforming vector for link i at time t, respectively. Among these variables, the transmit power is usually the only parameter to be considered in many previous works [27,62]. In the complex MRA problem also involving other types of resources, it is still a major factor affecting the system performance based on SINR in (5) that would be significantly impacted by the power, and thus we consider two different state formulations for P (t) as follows.
• Power state formulation 1 (PSF1): First, to align with the industry standard [33] which chooses integers for power increments, we consider a ±1 dB offset representation similar to that shown in [51], as the the first formulation for the power state. Specifically, given an initial value P 0 i , the transmit power P i , ∀i (despite t), will be chosen from the set where K min = -10 log 10 ( P min P 0 i ) and K max = 10 log 10 ( P max P 0 i ) .
• Power state formulation 2 (PSF2): Next, as shown in [27], the performance of a power-controllable network can be improved by quantizing the transmit power through a logarithmic step size instead of linear step size. Given that, the transmit power P i , ∀i could be selected from the set Apart from the above, the other parameters, such as θ i , ∀i, can be chosen from the splitting ratio set Θ with linear step size, and f i , ∀i can be selected from the predefined codebook F with |F | finite vectors or elements.
(2) Action: The action of this process at time t, a (t) is selected from a set of binary decisions on the variablesÂ n denote all the possible binary decisions on the three types of variables involved, respectively. That is, the agent can decide each link i to increase or decrease each of the variables to the next quantized value according tô respectively. Note that, as the number of values of a variable is limited, when reaching the maximum or minimum value with a binary action chosen fromÂ, a modulo operation is used to decide the index for the next quantized value in the state space. For example, As another example, with f min = 1 and f max = |F | to denote the first and the last vector in the codebook F , respectively, the action of increasing or decreasing f min ≤ f (t) i ≤ f max by 1 will choose the previous or the next vector of f , and a similar modulo operation will also be applied to keep f (3) Reward: To reduce the power consumption for green communication while maintaining the desired trade-off among the data rate and the energy harvesting, we introduce a reward function that can represent a trade-off among the three metrics properly normalized for link i with parameters λ i , µ i , and ν i , at time t, as i , F (t) ) denotes the data rate of link i obtained at time t, which can be represented by In addition, i ) is the energy harvested at MN of link i at time t, represented in the log scale as wherein the harvested energy in its raw form is given by In the above, δ is the power conversion efficiency, and ν i is the price or cost for the power consumption P (t) i to be paid for link i's transmission. Note that the log representation is considered here to accommodate a normalization process in deep learning similar to the batch normalization in [63]. Otherwise, the data rate obtained with a log operation and the raw energy harvesting e i (P (t) , θ (t) i , F (t) ) without the (log) operation may be directly combined in the utility function. If so, with the metric values lying in very different ranges, such a raw representation could cause problems in the training process. Note also that, although λ i and µ i could be set to compensate the scale differences, a very high energy obtained in certain case can still happen to significantly vary the utility function and impede the learning process. By taking these into account, the system utility at time t can be represented by the sum of these link rewards as Policy selection: In general, Q-learning is an off-policy algorithm that can find a suboptimal policy even when its actions are obtained from an arbitrary exploratory selection policy [64]. Following that, we conduct the DQN-based MRA algorithm to have a neargreedy action selection policy, which consists of (1) exploration mode and (2) exploitation mode. On the one hand, in exploration mode, the DQN agent would randomly try different actions at every time t for getting a better state-action or Q value. On the other hand, in exploitation mode, the agent will choose at each time t an action a (t) that can maximize the Q value via DNN with weight ω; that is, a (t) = arg max a ∈A Q * (s t , a , ω). More specifically, we conduct the agent to explore with a probability and to exploit with a probability 1 − , where ∈ (0, 1) denotes a hyperparameter to adjust the trade-off between exploration and exploitation, resulting in a -greedy selection policy.
Experience replay: This algorithm also includes a buffer memory D as a replay memory to store transactions (s (t) , a (t) , r (t) , s ), where reward r (t) = U (t) is obtained by (23) at time t. Given that, at each learning step, a mini-batch is constructed by randomly sampling the memory pool and then a stochastic gradient descent (SGD) is used to update ω. By reusing the previous experiences, the experience replay makes the stored samples to be exploited more efficiently. Furthermore, by randomly sampling the experience buffer, a more independent and identically distributed data set could be obtained for training.
As a summary of these key points introduced above, we formulate the single-layer DQNbased MRA training algorithm with a pseudo code representation shown in Algorithm 1 for easy reference.

Algorithm 1
The single-layer DQN-based MRA training algorithm.

DDPG-Based Approach
Similar to that found in the literature [28,29], as a deep reinforcement learning algorithm, DQN would be superior to the classical Q-learning algorithm because it can handle the problems with high-dimensional state spaces that can hardly be done with the former. However, DQN still works on a discrete action space, and suffers the curse of dimensionality when the action space becomes large. For this, we next develop a deep deterministic policy gradient (DDPG)-based algorithm that can find optimal actions in a continuous space to solve this MRA optimization problem without quantizing the actions that should be done for the DQN-based algorithm.
Specifically, with DDPG, we aim to determine an action a to maximize the action-value function Q(s, a) for a given state s. That is, our goal is to find as that done with DQN introduced previously. However, unlike DQN, there are two neural networks for DDPG, namely actor network and critic network, and each contains two subnets, namely online net and target net, with the same architecture. First, the actor network with the weight of DNN, ω a , which is called "actor parameter", will take state s to output a deterministic action a, denoted by Q a (s; ω a ). Second, the critic network with the weight of DNN, ω c , which is called "critic parameter" will take state s and a as its inputs to produce the state-value function, denoted by Q(s, a; ω c ), to simulate a table for Q-learning or Q-table that would get rid of the curse of dimensionality. Given that, two key features of DDPG can be summarized as follows: (1) Exploration: As defined, the actor network is conducted to provide solutions to the problem, playing a crucial role in DDPG. However, as it is designed to produce only deterministic actions, additional noise, n, is added to the output so that the actor network can explore the solution space. That is, (2) Updating the networks: Next, with the notation (s, a, r, s ) to denote the transaction wherein reward r is obtained by taking action a at state s to migrate to s as that in DQN, the update procedures for the critic and actor networks can be further summarized in the following.
• As shown in (24), the actor network is updated by maximizing the state-value function. In terms of the parameters ω a and ω c , this maximization problem can be rewritten to find J(ω a ) = Q(s, a; ω c )| a=Q a (s;ω a ) . Here, as the action space is continuous and the state-value function is assumed to be differentiable, the actor parameter, ω a , would be updated by using the gradient ascent method. Furthermore, as the gradient depends on the derivative of the objective function with respect to ω a , the chain rule can be applied as Then, as the actor network would output Q a (s; ω a ) to be the action adopted by the critic network, the actor parameter ω a can be updated by maximizing the critic network's output with the action obtained from the actor network, while fixing the critic parameter ω c . • Apart from the actor network to generate the needed actions, the critic network is also crucial to ensure that the actor network is well trained. To update the critic network, there are two aspects to be considered. First, with Q a (s; ω a ) from the target actor network to be an input of the target critic network, the state-value function would produce Second, the output of the critic network, Q(s, a; ω c ), can be regarded as another source to estimate the state-value function. Based on these aspects, the critic network can be updated by minimizing the following loss function: Given that, the critic parameter, ω c , can be obtained by finding the parameter to minimize this loss function. • Finally, the target nets in both critic and actor networks can be updated with the soft update parameter, τ, on their parameters ω c and ω a , as follows: Action representation for the MRA problem: As defined, the actor network outputs the deterministic action a * = Q a (s; ω a ). Due to the deterministic, a dynamic -greedy policy is used to determine the action by adding a noise term n (t) to explore the action space. Here, as the state of this work involves different types of variables, the action resulting at time t in fact consists of three parts as . When added with the corresponding noises, the exploration action a (t) would be specified as where the different parts of a (t) are clipped to the intervals [x up , x low ], x ∈ {P, Θ, F}, according to the different types of variables, and the added noises are obtained with a normal distribution also based on the different types as where d (t) denotes the exploration decay rate at time t. State normalization and quantization: As shown in the previous works [32,63,65], a state normalization to preprocess the training sample sets would lead to a much easier and faster training process. In our work, the three types of variables, P (t) , Θ (t) , and F (t) (shown in vector forms) in s (t) may have their values lying in very different ranges, which could cause problems in a training process. To prevent them, we normalize the coordinates with the cell radius, and these variables with the scale factors ς 1 , ς 2 , and ς 3 , as In the above,f i is an integer variable rounded from its real counterpart to denote which element in the codebook F to be used because the output of DDGP is a continuous action. Specifically, given is obtained by (30), its value at time t will bẽ (33) Note that, after the rounding operation (represented here by the floor function), the value may still be out of its feasible range, and thus a modulo operation similar to that for DQN is also applied here to keep it in [ f min , f max ]. For the other types of variables, the corresponding modulo operations are required to keep them in their feasible ranges as well. Still, due to their continuous nature, a rounding operation is avoided. Specifically, with i at time t would be updated bỹ Apart from the above, the critic network Q(s c , a; ω c ) is conducted to transfer gradient in learning, which is not involved in action generation. In particular, the critic network evaluates the current control action based on the performance index (23) while the parameters P (t) , Θ (t) , and F (t) of U in (23) are obtained by the actor network. Apart from these networks, the DDPG-based algorithm also includes an experience replay mechanism as the DQN counterpart. That is, when the experience buffer is full, the current transition (s (t) , a (t) , r (t) , s ) will replace the oldest one in the buffer D where reward r (t) = U (t) , and then the algorithm would randomly choose η stored transitions to form a mini-batch for updating the networks. Given these sampled transitions, the critic network can update its online net by minimizing the loss function represented bŷ where y i = r i + ζQ(s i , a; ω c )| a=Q a (s i ;ω a ) . Similarly, the actor network can update its online net with Finally, we summarize the single-layer DDPG-based MRA training algorithm in Algorithm 2 to be referred to easily.

Algorithm 2
The single-layer DDPG-based MRA training algorithm.

Two-Layer Hybrid Approach Based on Game Theory and Deep Reinforcement Learning
As exhibited above, DDPG can be used for continuous action spaces as well as highdimensional state spaces, which would overcome the difficulty of DQN which can apply only to discrete action spaces. However, the MRA problem includes both discrete and continuous variables, which requires DDPG to quantize the continuous variables involved to be their discrete counterparts as shown in (33). In addition, as a data driven approach, deep reinforcement learning does not explicitly benefit from an analytic model specific to the problem. To take the advantages from both data-driven and model-driven approaches, we propose in the following a novel approach that consists of two layers, where the lower layer is responsible for the continuous power allocation (PA) and energy harvest splitting (EHS) by using a game-theory-based iterative method, and the upper layer resolves the discrete beam selection problem (BSP) by using a DQN algorithm. That is, if f i , ∀i, can be given, PA and EHS on P i and θ i for each link i could be decomposed from the objective. Then, we could simplify the MRA problem by reducing (P1) to a BSP sub-problem and a PA/EHS sub-problem. Specifically, the latter (PA/EHS) is given by Clearly, if the BSP sub-problem can be solved, the major challenge of this approach would be the PA/EHS sub-problem shown in (P2). Here, even represented by a simpler form, (P2) is still a non-convex problem whose solution for link i will depend on the other links j = i. That is, despite EHS, the PA problem still remains in (P2) that a larger P i would increase SINR of link i while reducing those of the other links j = i in (6), increase energy harvesting in (22), or both, at the cost ν i for P i in the objective function.

Game Model
To overcome this difficulty, we convert (P2) into a non-cooperative game among the multiple links which could be regarded as self-interesting players and finding its Nash equilibrium (NE) is the fundamental issue to be considered in this game model. On the one hand, a link i can be seen as a non-cooperative game player who can choose its own P i and θ i to make a trade-off so that a larger P i will lead to a higher SINR value in (6) for data rate, a higher value in (22) for energy harvesting, or both on the cost of a higher power consumption, and vice versa. On the other hand, the utility given in (19) can be considered to reduce the power consumption for green communication while maintaining a desired trade-off among the data rate and the energy harvesting. The game-based pricing strategy is thus designed through which BS can require its link to pay a certain price for the power consumption on its transmission. For this, λ i can be interpreted as the willingness of player i to pay for the data rate, and µ i as that to pay for the energy harvesting. Given that, each link or player i can determine its P i and θ i based on price ν i to maximize its own utility, and in this maximization, λ i , µ i , and ν i are predetermined values for player i and unknown for the others j = i, as a basis for the non-cooperative game.

Existence of Nash Equilibrium
To ensure the outcome of the non-cooperative game to be effective, we next show this game to have at least one Nash equilibrium. As noted in [66], a Nash equilibrium point represents a situation wherein every player is unilaterally optimal and no player can increase its utility alone by changing its own strategy. Furthermore, according to the game theory fundamental [66], the non-cooperative game admits at least one Nash equilibrium point if (1) the strategy space is a nonempty, compact and convex set, and (2) the utility function is continuous quasiconcave with respect to the action space. In (P2), the utility function U i can be verified to satisfy the above conditions. Specifically, for the first condition, we can note that the transmit power is bounded by P min and P max , i.e., P i ∈ [P min , P max ], and the power splitting ratio, θ i , is a real number bounded by 0 and 1. Let S i be the set of all strategies as its strategy space. Then, the strategy space for each link i in the proposed game model can be represented by S i = (P i , θ i ) ∈ R 2 |P min ≤ P i ≤ P max , 0 ≤ θ i ≤ 1 , which is a compact (closed and bounded) convex set as required.
For the second condition, we can derive the partial differential of the utility function with respect to power P i as where R i is the total received power at link i, which accommodates the effect of SIC involved, as shown as follows: Similarly, we can obtain the partial differential of the utility with respect to θ i by Furthermore, from (39) and (41), the second derivative of the utility function with respect to P i and θ i , respectively, can be obtained by It is easy to see that both are less than or equal to 0, implying that the utility function is convex. In addition, U i is continuous in P i . Consequently, the utility functions, U i , ∀i, all satisfy the required conditions for the existence of at least one Nash equilibrium.

Power Allocation and Energy Harvest Splitting in the Lower Layer
Based on the non-cooperative game model introduced, the associated BS is responsible for deciding the transmit power P i and the power splitting ratio θ i for link i, with the channel state information h i,j and the weights λ i and µ i , which can be done by finding its Nash equilibrium. To see this, we note that, as the utility functions U i , ∀i, are concave down with respect to (P i , θ i ), this decision can be made by using the solution to the system of equations: where n denotes the number of links in the network. To solve the system of equations, we propose an iterative algorithm based on the game model, and through the fixed point iteration process, the system of Equation (43) can be solved numerically. Here, by taking the derivative with respect to P i (resp. θ i ) and setting the result equal to 0, we can transform the system into a fixed point form for each link i that can facilitate its convergence, as follows: whereR i is an auxiliary variable denoted bŷ To show the iterative process more clearly, we denote the transmit power, the total received power, the auxiliary variable, and the power splitting ratio, for link i at the k-th iteration, by , and θ i [k], respectively. Given that, the iterations on P i and θ i can be shown by the relationships between iterations k and k − 1 with their results to be bounded by the corresponding maximum and minimum values as follows:

Beam Selection in the Upper Layer and the Overall Algorithm
With the transmit powers and energy splitting ratios from the lower layer with a low cost, the two-layer hybrid approach is designed to resolve the remaining beam selection problem with a DQN-based algorithm in the upper layer, which would reduce the computational overhead when compared with the DQN approach in Section 3.1 and the DDPG-based approach in Section 3.2. In addition, unlike the previous approaches considering either discrete action space or continuous action space solely, the two-layer approach obtains the variables in their own domains without either approximating the hybrid space by concretization or relaxing it into a continuous set. As a result, the two-layer approach would achieve higher utilities than the others, as exemplified in the experiments.
Specifically, we propose to use a DQN-based algorithm in the upper layer to resolve the beam selection problem in its own discrete action space. When compared with that given in Section 3.1, this algorithm considers locations L and beamforming vectors F only, leading to a reduced DQN model whose state at time t is represented by s (t) = L (t) , F (t) , and the action a (t) is selected fromÂ (here including onlyÂ F ) modified to take into account also the case of no changes. That is, eachÂ (t) f i ∈ a (t) selected fromÂ F can now be anyone in {−1, 0, +1} instead of ±1, in which 0 implies no changes on the previous beam selection. When the modification integrates with the lower layer, the two-layer hybrid MRA training algorithm has results as shown in Algorithm 3 along with its flowchart shown in Figure 2. Similar to Algorithms 1 and 2, the training algorithm would take the parameters for the utility, the hyperparameters for the learning algorithm, and the parameters for the gamebased method, as the input, while producing a learned DQN model as the output that can online decide P i , θ i , and f i , ∀i, for the optimization problem in (7) afterwards. Apart from the input and output, its main steps are summarized as follows:

Algorithm 3
The two-layer hybrid MRA training algorithm.

•
Observe state s (t) at time t for beam section. • Select an optimal action from a (t) at time step t. • Given selected beamforming vectors F (t) , obtain transmit powers P (t) and splitting ratios Θ (t) through the game-theory-based iterative method in the lower layer. • Assess the impact on data rate r i , energy harvesting E i , and transmit power P i , for all links i.

•
Reward the action at time t as U i (P (t) , θ (t) i , F (t) ), ∀i, based on the impact assessed. • Train DQN with the system utility U (t) obtained.
After the training or learning period, say T, the trained DQN from Algorithm 3 would be used to observe the following state s (t) = L (t) , F (t) , t > T, evaluate utility U i with the given parameters λ i , µ i , and ν i , and then take action a (t) to decide P i , θ i , and f i , ∀i, for the system in the testing process.

Time Complexity
Next, we show the time complexity for each of these algorithms before revealing their performance differences in the next section. Specifically, let the number of episodes be M, and the number of time-steps per episode be N . Assuming that the Q-learning network in Algorithm 1 has J fully connected layers, the time complexity with regard to the number of (floating point) operations in this algorithm would be O(∑ J−1 j=0 u j u j+1 ) based on the analysis in [32], where u j denotes the unit number in the jth layer, and u 0 is the input state size. In each time-step of an episode, there may be other operations such as the random selection of an action in line 10 not involving the neural network, which could be ignored when compared with the former for the analysis. Thus, taking the nesting for loops (the outer is episode loop and the inner is time-step loop) into account, we have its worst-case time complexity as O(MN ∑ J−1 j=0 u j u j+1 ). Apart from training, DDPG also involves a normalization process whose time complexity could be denoted by T (s), where T (s) is the number of the variables in the state set. In addition, the actor and critic networks of DDPG in Algorithm 2 are assumed to have J and K fully connected layers, respectively. According to [32], the time complexity with respect to these networks in the training algorithm would be O(∑ J−1 j=0 u actor,j u actor,j+1 + ∑ K−1 k=0 u critic,k u critic,k+1 ), where u actor,i and u critic,i denote the unit number in the ith layer with respect to the actor network and the critic network, respectively. Then, by taking the nesting loops into account as well, we have the overall time complexity of this algorithm as O(MN (∑ J−1 j=0 u actor,j u actor,j+1 + ∑ K−1 k=0 u critic,k u critic,k+1 )). Finally, let the number of links be n and the number of iterations per link be K in addition to M and N given previously. As the two-layer hybrid training algorithm involves the lower-layer game-theory-based iterations, the overall time complexity of Algorithm 3 would be O(MN nK ∑ J−1 j=0 u Q,j u Q,j+1 ), where u Q,j denotes the unit number in the jth layer with respect to the DQN neural network in this algorithm. Note that, although there are additional nK iterations for the lower layer, the input state size u Q,0 is L (0) , F (0) that could be much smaller than u 0 = L (0) , P (0) , Θ (0) , F (0) in the singlelayer Algorithm 1 with DQN, while u Q,j = u j , 1 ≤ j ≤ J, is considered. In addition, it requires no normalization process and has the computational overhead on its neural network lower than that of O(∑ J−1 j=0 u actor,j u actor,j+1 + ∑ K−1 k=0 u critic,k u critic,k+1 ) on the two different types of neural networks in Algorithm 2.

Numerical Experiments
In this section, we conduct simulation experiments to evaluate the proposed two-layer approach and compare it with the single-layer approaches also introduced. To this end, we first present the simulation setup adopted and the parameters involved. Then, we show the performance differences between the two-layer hybrid MRA algorithm based on game theory and deep reinforcement learning, and the single-layer counterparts based on the conventional deep reinforcement learning models (DQN and DDPG).

Simulation Setup
With the network model and the channel model introduced in Section 2, we conduct MNs to be uniformly distributed in the simulated cellular network and let them move at a speed of v = 2 km/h on average with log-normal shadow fading as well as small-scale fading. In this environment, the cell radius is set tor and the distance between sites or BSs is considered to be 1.5r, in which MNs can experience a probability of line of sight, P los , on the signals from BSs. For easy reference, the important parameters for the radio environment including those not shown above are summarized in Table 1. Apart from the parameters for radio, the converge threshold is set to 10 −5 for the two-layer algorithm, and the hyperparameters for the deep reinforcement learning models are tabulated in Table 2. For example, in the DQN for the single-layer approach, the state s (t) at time t is denoted by which corresponds to the size of state, 10, listed in this table. In addition, as introduced in Section 3.1, a ± 1 dB offset representation is considered for PSF1, and the number of power levels is set here as 9 for PSF2 to construct their power sets P 1 and P 2 , respectively. Furthermore, a ± 0.05 offset representation, and a set of 11 values, {0, 0.1, · · · , 1} with step size of 0.1, are also conducted as the power splitting ratio sets Θ for PSF1 and PSF2, respectively. Nevertheless, the size of action is 64 according to the binary decisions defined in (18), despite PSF1 or PSF2 in DQN. Apart from the above, for the two-layer approach, the DQN for the upper layer only considers the beamforming vectors F in addition to the locations L, which reduces the size of state to 6. Moreover, as it considers {−1, 0, +1} instead of ±1 for the actions, the size of action becomes 9. Despite these differences, the other hyperparameters of DQN are the same for both single-and two-layer approaches. Finally, the hyperparameters for DDPG are chosen to reflect its performance on average with a reasonable time complexity to execute, and a codebook F with 4, 8, 16, and 32 elements or vectors, respectively, to correspond to the different numbers of antennas in the radio environment is considered for all the algorithms involved. Size of state (|s|) 6 Size of action (|a|) 9 The same parameters for the single-layer DQN Given that, we conduct 50 experiments with different seeds for all the algorithms under comparison. For each of these experiments, there are 400 training episodes or epochs in total. At the beginning of each episode, MNs are randomly located in the simulated network, which then move at speed v in 500 time slots per episode. Afterward, with the trained (P, Θ, F) from these algorithms, we conduct another 100 episodes with MNs randomly located at the beginning as well to obtain the averaged utility, data rate, energy harvesting, and power consumption to validate the parameters obtained with the different algorithms. Specifically, each 100 testing episodes of an experiment produce a mean value, and each averaged metric shown in the following figures denotes the average of these mean values from the 50 experiments. Note that, since DDPG is trained with normalized variables as shown in (32), in the testing process, we also have to preprocess these inputs.

Performance Comparison
Given the environment, we compare the proposed two-layer MRA algorithm aided by game theory with the single-layer MRA algorithms based solely on DQN and DDPG also introduced. To see their performance differences, we conduct two sets of experiments from different aspects; the first focuses on the number of antennas, M, and the second on the power cost ν i . Given that, in Figures 3-5 to be shown for the comparison results, the legends of "two-layer", "single-layer with DDPG", "single-layer with DQN of PSF1", and "single-layer with DQN of PSF2" exhibited therein represent the two-layer MRA algorithm, the single-layer DDPG-based MRA algorithm, the single-layer DQN-based MRA algorithm with PSF1, and the single-layer DQN-based MRA algorithm with PSF2 introduced in this work, respectively.

Impacts of Antennas
In the first experiment set, four numbers of transmit antennas, M ∈ {4, 8, 16, 32}, in BS are examined while fixing λ i = 10, µ i = 1, and ν i = 1, ∀i. Due to similar trends to be given, in Figure 3, we exemplify the utilities obtained during the training periods in two experiment instances with the highest and the lowest number of transmit antennas, 32 and 4, respectively. It can be seen easily from the two sub-figures that the utility that resulted from the two-layer MRA algorithm is higher than those from the single-layer counterparts during the training periods, despite the number of antennas, on average. In addition, it can also be observed that, with the continuous action space, DDPG could outperform DQN in general, despite the power state formulations (PSF1 and PSF2) of the latter. Finally, we can see that, with a ± 1 dB offset representation, PSF1 of DQN would result in a greater number of states on the transmit power than PSF2 equipped with a limited number of quantized levels, which could eventually lead to a better performance on the utility in the long term.
Next, we show the performance differences among the averaged metrics on utility, data rate, energy harvesting, and power consumption obtained by the testing process on (P, Θ, F) resulting from these algorithms. As shown in Figure 4, the two-layer MRA algorithm outperforms the single-layer counterparts on all the performance metrics except the energy harvesting, despite the number of antennas, M. In particular, in terms of the averaged utilities resulting from all different M, the two-layer MRA algorithm can achieve up to 2.3 times higher value than the single-layer DQN of the PSF2 algorithm. Despite the utility, as the resulting energy harvesting has relatively smaller values to impact the overall utility, a lower (resp. higher) value of this metric represented in the log scale is still possible and its impact would be compensated by a higher (resp. lower) value of power consumption, data rate, or both, which eventually leads to the overall utility to increase as M increases. For example, the highest utilities which are obtained by the two-layer MRA algorithm (as shown in Figure 4a) are mainly contributed by the highest data rates (as shown in Figure 4b) and the lowest power consumption (as shown in Figure 4d), which are all resulting from the two-layer algorithm, despite the energy harvesting of this algorithm to be slightly fluctuated as M increases and lower than that from the single-layer counterparts (as shown in Figure 4c). In addition, as no previous works exactly consider the same system formulations and metrics presented here, it is hard to directly compare this work with the others such as [27,51] which consider only P, F, or both, for their data transmissions without the capability of energy harvesting. However, even without the capability, we could still consider the DRL algorithm in [51] with only P to see the possible performance differences between ours and the conventional approaches. Specifically, with M = 32, the comparison results are summarized in Table 3. As shown readily, without the power split for energy harvesting, the DRL algorithm can obtain the highest data rate as an upper bound here, as expected. In comparison, the two-layer algorithm can achieve almost the same data rate while harvesting the energy with the lowest power consumption. Similarly, the singlelayer algorithms can enjoy the energy harvesting with similar power consumption, but they may obtain lower data rates when splitting their powers to harvest energy and send data simultaneously.

Impacts of Pricing Strategy
From the utility function defined by (19), we can see that the unit power cost ν i actually plays a crucial role in the non-cooperative game model, and would have a strong impact on the performance of joint optimization and the Nash equilibrium. Thus, in the final set of experiments, we propose a simple pricing strategy for the base station to determine ν i on the basis of social utility maximization and to control the transmit power of link so that its value can be located within the feasible range [P min , P max ] for the high performance of this algorithm to be realized by the social utility maximization.   Given that, the desired power cost ν P d i can be obtained by Accordingly, the two-layer hybrid MRA algorithm is slightly modified to dynamically adjust ν i instead of using a fixed ν i , ∀i, as an input of the algorithm. To be more specific, the sketch of this modification is given in Algorithm 4, wherein the modified three statements showing their calculations (50), (47) and (48), respectively, are highlighted with bold italic font, in addition to the fact that the input does not include ν i now. For the comparison, the pricing strategy is also applied to Algorithms 1 and 2 by replacing the input ν i with ν P d i dynamically adjusted by using (50) as well after observing the next state S carried out in the corresponding steps in these algorithms.

Algorithm 4
The two-layer hybrid MRA training algorithm with the pricing strategy.
(Input) λ i , µ i , ∀i, · · · ; · · · for episode = 1 to M do for time t = 1 to N do · · · Observe next state S ;  (50) while fixing λ = 10, µ = 1, and M = 32, and conduct these algorithms to output the performance metrics averaged to be compared. The results are now summarized in Figure 5, showing that the two-layer algorithm outperforms the others in terms of the utility. In particular, although it may have lower data rates when ν P=1 (denoting ν P d i obtained by P d i = 1 W), and higher power consumption when ν P>10 (denoting ν P d i with P d i > 10 W), the increasing trend of these resulting metrics would still lead to a utility higher than the others and the resulting utility would increase as ν P d i increases. Similarly, as the energy harvesting has relatively smaller values to impact the system as noted before, its small fluctuations from the different algorithms do not alter the increasing trend of utility in the final experiment set as well.

Conclusions
In this work, we sought to maximize the utility that can make an optimal trade-off among data rate and energy harvesting while balancing the cost of power consumption in multi-access wireless networks with base stations having multi-antennas. Given the capability of selecting beamforming vectors from a finite set, adjusting transmit powers, and deciding power splitting ratios for energy harvesting, the wireless networks developed toward the future generation (beyond 5G or B5G) are expected to achieve the extreme performance requirements that can only be satisfied by an optimal solution to be possibly found through an exhaustive search.
To meet the expectation, we have shown in this work how to design DRL-based approaches operated in a single layer to jointly solve for power control, beamforming selection, and power splitting decision, and approach the optimal trade-off among the performance metrics without an exhaustive search in the action space that resulted. Furthermore, we have shown how to incorporate a data-driven DRL-based technique and a model-driven game-theory-based algorithm to form a two-layer iterative approach to resolve the NP-hard MRA problem in the wireless networks. Specifically, we have shown that, by taking benefits from both data-driven and model-driven methods, the proposed two-layer MRA algorithm can outperform the single-layer counterparts which rely only on the data-driven DRL-based algorithms. Here, the single-layer algorithms could represent the conventional DRL methods extended to have the energy harvesting capability. As shown readily in the experiments, the conventional DRL method and the single-layer algorithms would not provide a good performance trade-off on the metrics considered. That is, the overall utilities reflecting the trade-off from the single-layer algorithms have been shown to be lower than that from the two-layer approach. In contrast, by collaborating between DRL and game theory, the two-layer approach has been shown to achieve better trade-off among the data rate and the energy harvesting while balancing the cost of power consumption, reflecting on the higher utilities obtained. Specifically, in the simulation experiments, we have exemplified the performance differences of these algorithms in terms of data rate, energy harvesting, and power consumption, verified the feasibility of the three parameters in the utility function, and examined the pricing strategy proposed that can dynamically adjust the transmit power of the link to locate its value within the feasible range for the high performance of the two-layer algorithm to be obtained by the social utility maximization.
From the viewpoint of social utility maximization, our pricing strategy had been shown to give this system the leverage to select beamforming vectors, transmit powers, and power split ratios by properly adjusting the power costs. Finally, inspired by the related works on multi-agent DRL, we would aim to develop further collaborating schemes that can reduce the overhead caused by different optimization methods even under the non-stationary environment brought by a multi-agent setting, as our future work.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. Important symbols for the proposed approaches in this work.

Name
Description Name Description P, Θ, F sets of transmit powers, splitting ratios, and beamforming vectors, respectively P i , θ i , f i transmit power, splitting ratio, and beamforming vector for link i, respectively L set of locationsL loss function S a finite set of states, {s 1 , s 2 , · · · , s m } s (t) state at time t, denoted by L (t) ,P (t) , A a finite set of actions, {a 1 , a 2 , · · · , a n }Â a set of binary variables, whereÂ P , A Θ , andÂ F correspond to those for P, Θ, and F, respectively.

Name
Description Name Description R a finite set of rewards, where R(s, a, s ) is the function to provide reward r at state s ∈ S, action a ∈ A, and next state s P a finite set of transition probabilities, where P ss (a) = p(s |s, a) is the transition probability at state s taking action a to migrate to state s π * (s) optimal policy at state s V π (s) value function for the expected value to be obtained by policy π from state s ∈ S V * (s) optimal action at state s Q π (s, a) action-value function representing the expected reward starting from state s and taking action a from policy π Q π * optimal policy for the (optimal) action-value function Q * (s, a) = max π Q π (s, a) Q(s t , a t ) action-value (Q) function at time t Q(s t , a t , ω ) reward for link i at time t, including data rate r i (P (t) , θ (t) i , F (t) ) and energy harvest E i (P (t) , θ (t) i , F (t) ) (s (t) , a (t) , r (t) , s ) transition at time t, where r (t) = U (t) that is the system utility at this time step α, α a , α c learning rate, the (learning) rate specific to actor network, and the (learning) rate specific to critic network , min exploration rate (probability) and its minimum requirement