Joint Beamforming, Power Allocation, and Splitting Control for SWIPTEnabled IoT Networks with Deep Reinforcement Learning and Game Theory
Abstract
:1. Introduction
1.1. Related Work
1.2. The Motivations and Characteristics of This Work
 We introduce two singlelayer algorithms based on the conventional DRLbased models, DQN and DDPG, to solve the joint optimization problem formulated here as a nonconvex MINLP problem, and realized as an MRA problem subject to the different allocation constraints.
 We propose further a twolayer iterative approach that can incorporate the capability of datadriven DQN technique and the strength of noncooperative game theory model to resolve the NPhard MRA problem.
 For the twolayer approach, we also introduce a pricing strategy to determine the power costs based on the social utility maximization to control the transmit power.
 With the simulated environment based on realistic wireless networks, we show the results that, by means of both learningbased and modelbased methods, the twolayer MRA algorithm proposed can outperform the singlelayer counterparts introduced which rely only on the datadriven DRLbased models.
2. Network and Channel Models
2.1. Network Model
2.2. Channel Model
2.3. Problem Formulation
3. SingleLayer LearningBased Approaches
3.1. QLearning Approach
 (1)
 State: First, if there are n links in the network, the state at time t is represented in the sequel by using the capital notations for their components and using the superscript such as “$\left(t\right)$” for the time index as follows:$${s}^{\left(t\right)}=\left\{{\mathbf{L}}^{\left(t\right)},{\mathbf{P}}^{\left(t\right)},{\Theta}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)}\right\}$$Among these variables, the transmit power is usually the only parameter to be considered in many previous works [27,62]. In the complex MRA problem also involving other types of resources, it is still a major factor affecting the system performance based on SINR in (5) that would be significantly impacted by the power, and thus we consider two different state formulations for ${\mathbf{P}}^{\left(t\right)}$ as follows.
 Power state formulation 1 (PSF1): First, to align with the industry standard [33] which chooses integers for power increments, we consider a $\pm 1$ dB offset representation similar to that shown in [51], as the the first formulation for the power state. Specifically, given an initial value ${P}_{i}^{0}$, the transmit power ${P}_{i},\forall i$ (despite t), will be chosen from the set$$\begin{array}{ccc}& & \phantom{\rule{15.0pt}{0ex}}{\mathcal{P}}_{i}^{1}\stackrel{\u25b5}{=}\left\{{10}^{0.1\xb7{K}_{min}}{P}_{i}^{0},\cdots ,{10}^{0.1}{P}_{i}^{0},{P}_{i}^{0},{10}^{0.1}{P}_{i}^{0},\cdots ,{10}^{0.1\xb7{K}_{max}}{P}_{i}^{0}\right\}\hfill \end{array}$$
 Power state formulation 2 (PSF2): Next, as shown in [27], the performance of a powercontrollable network can be improved by quantizing the transmit power through a logarithmic step size instead of linear step size. Given that, the transmit power ${P}_{i},\forall i$ could be selected from the set$${\mathcal{P}}^{2}\stackrel{\u25b5}{=}\left.\left\{{P}_{min}{\left(\frac{{P}_{min}}{{P}_{max}}\right)}^{\frac{j}{{\mathcal{P}}^{2}2}}\rightj=0,\cdots ,\left{\mathcal{P}}^{2}\right2\right\}$$
Apart from the above, the other parameters, such as ${\theta}_{i},\forall i$, can be chosen from the splitting ratio set $\Theta $ with linear step size, and ${f}_{i},\forall i$ can be selected from the predefined codebook $\mathcal{F}$ with $\left\mathcal{F}\right$ finite vectors or elements.  (2)
 Action: The action of this process at time t, ${a}^{\left(t\right)}$ is selected from a set of binary decisions on the variables$$\widehat{\mathbf{A}}=\left\{{\widehat{\mathbf{A}}}_{P},{\widehat{\mathbf{A}}}_{\Theta},{\widehat{\mathbf{A}}}_{F}\right\}$$Note that, as the number of values of a variable is limited, when reaching the maximum or minimum value with a binary action chosen from $\widehat{\mathbf{A}}$, a modulo operation is used to decide the index for the next quantized value in the state space. For example, in PSF2, if ${P}_{i}^{\left(t\right)}={P}_{min}{\left(\frac{{P}_{min}}{{P}_{max}}\right)}^{\frac{j}{{\mathcal{P}}^{2}2}}$ with $j=0$, and $j+{\widehat{A}}_{{p}_{i}}^{\left(t\right)}<0$, then the modulo operation will lead to ${P}_{i}^{(t+1)}={P}_{min}{\left(\frac{{P}_{min}}{{P}_{max}}\right)}^{\frac{{j}^{\prime}}{{\mathcal{P}}^{2}2}}$ with ${j}^{\prime}=\left{\mathcal{P}}^{2}\right2$ in ${\mathcal{P}}^{2}$. As another example, with ${f}_{min}=1$ and ${f}_{max}=\left\mathcal{F}\right$ to denote the first and the last vector in the codebook $\mathcal{F}$, respectively, the action of increasing or decreasing ${f}_{min}\le {f}_{i}^{\left(t\right)}\le {f}_{max}$ by 1 will choose the previous or the next vector of ${f}_{i}^{\left(t\right)}$ in $\mathcal{F}$ as ${f}_{i}^{(t+1)}$, and a similar modulo operation will also be applied to keep ${f}_{i}^{(t+1)}$ within $[{f}_{min}$, ${f}_{max}]$.
 (3)
 Reward: To reduce the power consumption for green communication while maintaining the desired tradeoff among the data rate and the energy harvesting, we introduce a reward function that can represent a tradeoff among the three metrics properly normalized for link i with parameters ${\lambda}_{i}$, ${\mu}_{i}$, and ${\nu}_{i}$, at time t, as$$\begin{array}{ccc}& & \phantom{\rule{40.0pt}{0ex}}{U}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})={\lambda}_{i}{r}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})+{\mu}_{i}{E}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)}){\nu}_{i}{P}_{i}^{\left(t\right)}\hfill \end{array}$$$${r}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})=log(1+{\gamma}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)}))$$In addition, ${E}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}_{i}^{\left(t\right)})$ is the energy harvested at MN of link i at time t, represented in the log scale as$${E}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})=log\left({e}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})\right)$$$$\begin{array}{ccc}\hfill & & \phantom{\rule{60.0pt}{0ex}}{e}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})={\theta}_{i}\delta \left({P}_{i}^{\left(t\right)}{h}_{i,i}^{\left(t\right)}{f}_{i}^{\left(t\right)}{}^{2}+\sum _{j\ne i}{P}_{j}^{\left(t\right)}{\left{h}_{j,i}^{\left(t\right)}{f}_{j}^{\left(t\right)}\right}^{2}+{\sigma}_{n}^{2}\right)\hfill \end{array}$$In the above, $\delta $ is the power conversion efficiency, and ${\nu}_{i}$ is the price or cost for the power consumption ${P}_{i}^{\left(t\right)}$ to be paid for link i’s transmission. Note that the log representation is considered here to accommodate a normalization process in deep learning similar to the batch normalization in [63]. Otherwise, the data rate ${r}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})$ obtained with a log operation and the raw energy harvesting ${e}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})$ without the (log) operation may be directly combined in the utility function. If so, with the metric values lying in very different ranges, such a raw representation could cause problems in the training process. Note also that, although ${\lambda}_{i}$ and ${\mu}_{i}$ could be set to compensate the scale differences, a very high energy obtained in certain case can still happen to significantly vary the utility function and impede the learning process. By taking these into account, the system utility at time t can be represented by the sum of these link rewards as$${U}^{\left(t\right)}=U({\mathbf{P}}^{\left(t\right)},{\Theta}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})=\sum _{i}{U}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})$$
Algorithm 1 The singlelayer DQNbased MRA training algorithm. 

3.2. DDPGBased Approach
 (1)
 Exploration: As defined, the actor network is conducted to provide solutions to the problem, playing a crucial role in DDPG. However, as it is designed to produce only deterministic actions, additional noise, n, is added to the output so that the actor network can explore the solution space. That is,$$a\left(s\right)={Q}_{a}(s;{\omega}_{a})+n$$
 (2)
 Updating the networks: Next, with the notation $(s,a,\mathfrak{r},{s}^{\prime})$ to denote the transaction wherein reward $\mathfrak{r}$ is obtained by taking action a at state s to migrate to ${s}^{\prime}$ as that in DQN, the update procedures for the critic and actor networks can be further summarized in the following.
 As shown in (24), the actor network is updated by maximizing the state–value function. In terms of the parameters ${\omega}_{a}$ and ${\omega}_{c}$, this maximization problem can be rewritten to find $J\left({\omega}_{a}\right)=Q(s,a;{\omega}_{c}){}_{a={Q}_{a}(s;{\omega}_{a})}$. Here, as the action space is continuous and the state–value function is assumed to be differentiable, the actor parameter, ${\omega}_{a}$, would be updated by using the gradient ascent method. Furthermore, as the gradient depends on the derivative of the objective function with respect to ${\omega}_{a}$, the chain rule can be applied as$$\nabla {\omega}_{a}J\left({\omega}_{a}\right)={\nabla}_{a}Q(s,a;{\omega}_{c}){}_{a={Q}_{a}(s;{\omega}_{a})}\nabla {\omega}_{a}{Q}_{a}(s;{\omega}_{a})$$Then, as the actor network would output ${Q}_{a}(s;{\omega}_{a})$ to be the action adopted by the critic network, the actor parameter ${\omega}_{a}$ can be updated by maximizing the critic network’s output with the action obtained from the actor network, while fixing the critic parameter ${\omega}_{c}$.
 Apart from the actor network to generate the needed actions, the critic network is also crucial to ensure that the actor network is well trained. To update the critic network, there are two aspects to be considered. First, with ${Q}_{{a}^{\prime}}(s;{\omega}_{{a}^{\prime}})$ from the target actor network to be an input of the target critic network, the state–value function would produce$$y=\mathfrak{r}+\zeta \overline{Q}({s}^{\prime},a;{\omega}_{c}){}_{a={Q}_{{a}^{\prime}}(s;{\omega}_{{a}^{\prime}})}$$Second, the output of the critic network, $Q(s,a;{\omega}_{c})$, can be regarded as another source to estimate the state–value function. Based on these aspects, the critic network can be updated by minimizing the following loss function:$$\widehat{L}={(yQ(s,a;{\omega}_{c}))}^{2}$$Given that, the critic parameter, ${\omega}_{c}$, can be obtained by finding the parameter to minimize this loss function.
 Finally, the target nets in both critic and actor networks can be updated with the soft update parameter, $\tau $, on their parameters ${\omega}_{c}^{\prime}$ and ${\omega}_{a}^{\prime}$, as follows:$${\omega}_{c}^{\prime}=\tau {\omega}_{c}+(1\tau ){\omega}_{c}^{\prime},\phantom{\rule{10.0pt}{0ex}}{\omega}_{a}^{\prime}=\tau {\omega}_{a}+(1\tau ){\omega}_{a}^{\prime}$$
Algorithm 2 The singlelayer DDPGbased MRA training algorithm. 

4. TwoLayer Hybrid Approach Based on Game Theory and Deep Reinforcement Learning
4.1. Game Model
4.2. Existence of Nash Equilibrium
4.3. Power Allocation and Energy Harvest Splitting in the Lower Layer
4.4. Beam Selection in the Upper Layer and the Overall Algorithm
Algorithm 3 The twolayer hybrid MRA training algorithm. 

 Observe state ${s}^{\left(t\right)}$ at time t for beam section.
 Select an optimal action from ${a}^{\left(t\right)}$ at time step t.
 Given selected beamforming vectors ${\mathbf{F}}^{\left(t\right)}$, obtain transmit powers ${\mathbf{P}}^{\left(t\right)}$ and splitting ratios ${\Theta}^{\left(t\right)}$ through the gametheorybased iterative method in the lower layer.
 Assess the impact on data rate ${r}_{i}$, energy harvesting ${E}_{i}$, and transmit power ${P}_{i}$, for all links i.
 Reward the action at time t as ${U}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)}),\forall i$, based on the impact assessed.
 Train DQN with the system utility ${U}^{\left(t\right)}$ obtained.
4.5. Time Complexity
5. Numerical Experiments
5.1. Simulation Setup
5.2. Performance Comparison
5.2.1. Impacts of Antennas
5.2.2. Impacts of Pricing Strategy
Algorithm 4 The twolayer hybrid MRA training algorithm with the pricing strategy. 

6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Name  Description  Name  Description 

$\mathbf{P},\Theta ,\mathbf{F}$  sets of transmit powers, splitting ratios, and beamforming vectors, respectively  ${P}_{i},{\theta}_{i},{f}_{i}$  transmit power, splitting ratio, and beamforming vector for link i, respectively 
$\mathbf{L}$  set of locations  $\widehat{L}$  loss function 
$\tilde{\mathit{S}}$  a finite set of states, $\left\{{s}_{1},{s}_{2},\cdots ,{s}_{\mathtt{m}}\right\}$  ${s}^{\left(t\right)}$  state at time t, denoted by $\left\{{\mathbf{L}}^{\left(t\right)}\right.$,${\mathbf{P}}^{\left(t\right)}$, ${\Theta}^{\left(t\right)}$, $\left.{\mathbf{F}}^{\left(t\right)}\right\}$ 
$\tilde{\mathit{A}}$  a finite set of actions, $\left\{{a}_{1},{a}_{2},\cdots ,{a}_{\mathtt{n}}\right\}$  $\widehat{\mathbf{A}}$  a set of binary variables, where ${\widehat{\mathbf{A}}}_{P}$, ${\widehat{\mathbf{A}}}_{\Theta}$, and ${\widehat{\mathbf{A}}}_{F}$ correspond to those for $\mathbf{P}$, $\Theta $, and $\mathbf{F}$, respectively. 
$\tilde{\mathit{R}}$  a finite set of rewards, where $\tilde{R}(s,a,{s}^{\prime})$ is the function to provide reward $\mathfrak{r}$ at state $s\in \tilde{\mathit{S}}$, action $a\in \tilde{\mathit{A}}$, and next state ${s}^{\prime}$  $\tilde{\mathit{P}}$  a finite set of transition probabilities, where ${\tilde{P}}_{s{s}^{\prime}\left(a\right)}=p\left({s}^{\prime}\rights,a)$ is the transition probability at state s taking action a to migrate to state ${s}^{\prime}$ 
${\pi}^{*}\left(s\right)$  optimal policy at state s  ${V}^{\pi}\left(s\right)$  value function for the expected value to be obtained by policy $\pi $ from state $s\in S$ 
${V}^{*}\left(s\right)$  optimal action at state s  ${Q}^{\pi}(s,a)$  action–value function representing the expected reward starting from state s and taking action a from policy $\pi $ 
${Q}^{{\pi}^{*}}$  optimal policy for the (optimal) action–value function ${Q}^{*}(s,a)={max}_{\pi}{Q}^{\pi}(s,a)$  $Q({s}_{t},{a}_{t})$  action–value (Q) function at time t 
$\widehat{Q}({s}_{t},{a}_{t},$${\omega}^{\prime})$  approximated action–value (Q) function with the weight of DNN, ${\omega}^{\prime}$, at time t  $\mathcal{F}$  beamforming codebook 
${\mathcal{P}}_{i}^{1}$  a set of transmit powers for link i in PSF1  ${\mathcal{P}}^{2}$  a set of transmit powers for all links in PSF2 
${U}_{i}({\mathbf{P}}^{\left(t\right)},$${\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})$  reward for link i at time t, including data rate ${r}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{\left(t\right)},{\mathbf{F}}^{\left(t\right)})$ and energy harvest ${E}_{i}({\mathbf{P}}^{\left(t\right)},{\theta}_{i}^{{}^{\left(t\right)}}$ $,{\mathbf{F}}^{\left(t\right)})$  $({s}^{\left(t\right)},{a}^{\left(t\right)},$${\mathfrak{r}}^{\left(t\right)},{s}^{\prime})$  transition at time t, where ${\mathfrak{r}}^{\left(t\right)}={U}^{\left(t\right)}$ that is the system utility at this time step 
$\alpha $, ${\alpha}_{a}$, ${\alpha}_{c}$  learning rate, the (learning) rate specific to actor network, and the (learning) rate specific to critic network  $\u03f5,{\u03f5}_{min}$  exploration rate (probability) and its minimum requirement 
$\zeta $  discount factor  $\tau $  soft update parameter 
d  exploration decay rate  D  replay buffer 
$\eta $  batch size  $\varrho $  converge threshold for the fixed point iteration 
${Q}_{a}(s;{\omega}_{a})$, ${Q}_{{a}^{\prime}}(s;{\omega}_{{a}^{\prime}})$  output of actor network (online and target, respectively)  $Q(s,a;{\omega}_{c})$, $\overline{Q}(s,a;{\omega}_{{c}^{\prime}})$  output of critic network (online and target, respectively) 
${\alpha}_{i},{\mu}_{i},{\nu}_{i}$  parameters for data rate, energy harvesting, and power consumption, respectively, for link i  ${\varsigma}_{1}$, ${\varsigma}_{2}$, ${\varsigma}_{3}$  scale factors for normalization of DDPG at time t 
${a}^{{\left(t\right)}^{*}}$  deterministic action of DDPG at time t, wherein ${A}_{P}^{{\left(t\right)}^{*}}$, ${A}_{\Theta}^{{\left(t\right)}^{*}}$, and ${A}_{F}^{{\left(t\right)}^{*}}$ correspond to those for transmit power, split ratio and beamforming vector  ${\tilde{P}}_{i}^{\left(t\right)},{\tilde{\theta}}_{i}^{\left(t\right)}$, ${\tilde{f}}_{i}^{\left(t\right)}$  variables for normalization of DDPG at time t 
${R}_{i}$  total received power at link i for the fixed point iteration  ${\widehat{R}}_{i}$  auxiliary variable at link i for the fixed point iteration 
${P}_{i}^{d}$  desired transmit power at link i for the pricing strategy  ${\nu}_{{P}_{i}^{d}}$  desired power cost at link i for the pricing strategy 
References
 Zhang, K.; Mao, Y.; Leng, S.; Zhao, Q.; Li, L.; Peng, X.; Pan, L.; Maharjan, S.; Zhang, Y. EnergyEfficient Offloading for Mobile Edge Computing in 5G Heterogeneous Networks. IEEE Access 2016, 4, 5896–5907. [Google Scholar] [CrossRef]
 Hewa, T.; Braeken, A.; Ylianttila, M.; Liyanage, M. MultiAccess Edge Computing and Blockchainbased Secure Telehealth System Connected with 5G and IoT. In Proceedings of the GLOBECOM 2020—2020 IEEE Global Communications Conference, Taipei, Taiwan, 7–11 December 2020; pp. 1–6. [Google Scholar]
 Chen, F.; Wang, A.; Zhang, Y.; Ni, Z.; Hua, J. Energy Efficient SWIPT Based Mobile Edge Computing Framework for WSNAssisted IoT. Sensors 2021, 21, 4798. [Google Scholar] [CrossRef] [PubMed]
 Chae, S.H.; Jeong, C.; Lim, S.H. Simultaneous Wireless Information and Power Transfer for Internet of Things Sensor Networks. IEEE Internet Things J. 2018, 5, 2829–2843. [Google Scholar] [CrossRef]
 Masood, Z.A.; Choi, Y. Energyefficient optimal power allocation for swipt based iotenabled smart meter. Sensors 2021, 21, 7857. [Google Scholar] [CrossRef]
 Liu, J.S.; Lin, C.H.R.; Tsai, J. Delay and energy tradeoff in energy harvesting multihop wireless networks with intersession network coding and successive interference cancellation. IEEE Access 2017, 5, 544–564. [Google Scholar] [CrossRef]
 Tran, T.N.; Voznak, M. Switchable Coupled Relays Aid Massive NonOrthogonal Multiple Access Networks with Transmit Antenna Selection and Energy Harvesting. Sensors 2021, 21, 1101. [Google Scholar] [CrossRef]
 Luo, Z.Q.; Zhang, S. Dynamic Spectrum Management: Complexity and Duality. IEEE J. Sel. Top. Signal Process. 2008, 2, 57–73. [Google Scholar] [CrossRef][Green Version]
 Boccardi, F., Jr.; Heath, R.W.; Lozano, A.; Marzetta, T.L.; Popovski, P. Five disruptive technology directions for 5G. IEEE Commun. Mag. 2014, 52, 74–80. [Google Scholar] [CrossRef][Green Version]
 Li, Y.; Luo, J.; Xu, W.; Vucic, N.; Pateromichelakis, E.; Caire, G. A Joint Scheduling and Resource Allocation Scheme for Millimeter Wave Heterogeneous Networks. In Proceedings of the 2017 IEEE Wireless Communications and Networking Conference (WCNC), Francisco, CA, USA, 19–22 March 2017; pp. 1–6. [Google Scholar]
 Yang, Z.; Xu, W.; Xu, H.; Shi, J.; Chen, M. User Association, Resource Allocation and Power Control in LoadCoupled Heterogeneous Networks. In Proceedings of the 2016 IEEE Globecom Workshops (GC Wkshps), Washington, DC, USA, 4–8 December 2016; pp. 1–7. [Google Scholar]
 Saeed, A.; Katranaras, E.; Dianati, M.; Imran, M.A. Dynamic femtocell resource allocation for managing intertier interference in downlink of heterogeneous networks. IET Commun. 2016, 10, 641–650. [Google Scholar] [CrossRef][Green Version]
 Coskun, C.C.; Davaslioglu, K.; Ayanoglu, E. ThreeStage Resource Allocation Algorithm for EnergyEfficient Heterogeneous Networks. IEEE Trans. Veh. Technol. 2017, 66, 6942–6957. [Google Scholar] [CrossRef]
 Liu, R.; Sheng, M.; Wu, W. EnergyEfficient Resource Allocation for Heterogeneous Wireless Network With MultiHomed User Equipments. IEEE Access 2018, 6, 14591–14601. [Google Scholar] [CrossRef]
 Le, N.T.; Tran, L.N.; Vu, Q.D.; Jayalath, D. EnergyEfficient Resource Allocation for OFDMA Heterogeneous Networks. IEEE Trans. Commun. 2019, 67, 7043–7057. [Google Scholar] [CrossRef][Green Version]
 Zhang, Y.; Wang, Y.; Zhang, W. Energy efficient resource allocation for heterogeneous cloud radio access networks with user cooperation and QoS guarantees. In Proceedings of the 2016 IEEE Wireless Communications and Networking Conference, Doha, Qatar, 3–6 April 2016; pp. 1–6. [Google Scholar]
 Zou, S.; Liu, N.; Pan, Z.; You, X. Joint Power and Resource Allocation for NonUniform Topologies in Heterogeneous Networks. In Proceedings of the 2016 IEEE 83rd Vehicular Technology Conference (VTC Spring), Nanjing, China, 15–18 May 2016; pp. 1–5. [Google Scholar]
 Zhang, H.; Du, J.; Cheng, J.; Long, K.; Leung, V.C.M. Incomplete CSI Based Resource Optimization in SWIPT Enabled Heterogeneous Networks: A NonCooperative Game Theoretic Approach. IEEE Trans. Wirel. Commun. 2017, 17, 1882–1892. [Google Scholar] [CrossRef]
 Chen, X.; Zhao, Z.; Zhang, H. Stochastic Power Adaptation with Multiagent Reinforcement Learning for Cognitive Wireless Mesh Networks. IEEE Trans. Mob. Comput. 2012, 12, 2155–2166. [Google Scholar] [CrossRef]
 Xu, C.; Sheng, M.; Yang, C.; Wang, X.; Wang, L. PricingBased Multiresource Allocation in OFDMA Cognitive Radio Networks: An Energy Efficiency Perspective. IEEE Trans. Veh. Technol. 2013, 63, 2336–2348. [Google Scholar] [CrossRef]
 Jiang, Y.; Lu, N.; Chen, Y.; Zheng, F.; Bennis, M.; Gao, X.; You, X. EnergyEfficient Noncooperative Power Control in SmallCell Networks. IEEE Trans. Veh. Technol. 2017, 66, 7540–7547. [Google Scholar] [CrossRef][Green Version]
 Zhang, H.; Song, L.; Han, Z. Radio Resource Allocation for DevicetoDevice Underlay Communication Using Hypergraph Theory. IEEE Trans. Wirel. Commun. 2016, 15, 1. [Google Scholar] [CrossRef][Green Version]
 Zhang, R.; Cheng, X.; Yang, L.; Jiao, B. Interferenceaware graph based resource sharing for devicetodevice communications underlaying cellular networks. In Proceedings of the 2013 IEEE Wireless Communications and Networking Conference (WCNC), Shanghai, China, 7–10 April 2013; pp. 140–145. [Google Scholar]
 Feng, D.; Lu, L.; YuanWu, Y.; Li, G.Y.; Feng, G.; Li, S. Devicetodevice communications underlaying cellular networks. IEEE Trans. Commun. 2013, 61, 3541–3551. [Google Scholar] [CrossRef]
 Jiang, Y.; Liu, Q.; Zheng, F.; Gao, X.; You, X. EnergyEfficient Joint Resource Allocation and Power Control for D2D Communications. IEEE Trans. Veh. Technol. 2016, 65, 6119–6127. [Google Scholar] [CrossRef][Green Version]
 Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Humanlevel control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
 Meng, F.; Chen, P.; Wu, L.; Cheng, J. Power Allocation in MultiUser Cellular Networks: Deep Reinforcement Learning Approaches. IEEE Trans. Wirel. Commun. 2020, 19, 6255–6267. [Google Scholar] [CrossRef]
 Nguyen, K.K.; Duong, T.Q.; Vien, N.A.; LeKhac, N.A.; Nguyen, M.N. NonCooperative Energy Efficient Power Allocation Game in D2D Communication: A MultiAgent Deep Reinforcement Learning Approach. IEEE Access 2019, 7, 100480–100490. [Google Scholar] [CrossRef]
 Zhang, Y.; Kang, C.; Ma, T.; Teng, Y.; Guo, D. Power Allocation in MultiCell Networks Using Deep Reinforcement Learning. In Proceedings of the 2018 IEEE 88th Vehicular Technology Conference (VTCFall), Chicago, IL, USA, 27–30 August 2018; pp. 1–6. [Google Scholar]
 Choi, J. Massive MIMO With Joint Power Control. IEEE Wirel. Commun. Lett. 2014, 3, 329–332. [Google Scholar] [CrossRef]
 Zhang, Y.; Kang, C.; Teng, Y.; Li, S.; Zheng, W.; Fang, J. Deep Reinforcement Learning Framework for Joint Resource Allocation in Heterogeneous Networks. In Proceedings of the 2019 IEEE 90th Vehicular Technology Conference (VTC2019Fall), Honolulu, HI, USA, 22–25 September 2019; pp. 1–6. [Google Scholar]
 Qiu, C.; Hu, Y.; Chen, Y.; Zeng, B. Deep Deterministic Policy Gradient (DDPG)Based Energy Harvesting Wireless Communications. IEEE Internet Things J. 2019, 6, 8577–8588. [Google Scholar] [CrossRef]
 3GPP. Evolved Universal Terrestrial Radio Access (EUTRA): Physical Layer Procedures (3GPP); ts 36.213, dec. 2015; 3GPP: Valbonne, France, 2015. [Google Scholar]
 Kim, R.; Kim, Y.; Yu, N.Y.; Kim, S.J.; Lim, H. Online LearningBased Downlink Transmission Coordination in UltraDense Millimeter Wave Heterogeneous Networks. IEEE Trans. Wirel. Commun. 2019, 18, 2200–2214. [Google Scholar] [CrossRef]
 Song, Q.; Wang, X.; Qiu, T.; Ning, Z. An Interference CoordinationBased Distributed Resource Allocation Scheme in Heterogeneous Cellular Networks. IEEE Access 2017, 5, 2152–2162. [Google Scholar] [CrossRef]
 Trakas, P.; Adelantado, F.; Zorba, N.; Verikoukis, C. A QoEaware joint resource allocation and dynamic pricing algorithm for heterogeneous networks. In Proceedings of the GLOBECOM 2017—2017 IEEE Global Communications Conference, Singapore, 4–8 December 2017; pp. 1–6. [Google Scholar]
 Simsek, M.; Bennis, M.; Guvenc, I. Learning Based Frequency and TimeDomain InterCell Interference Coordination in HetNets. IEEE Trans. Veh. Technol. 2014, 64, 4589–4602. [Google Scholar] [CrossRef]
 Ghadimi, E.; Calabrese, F.D.; Peters, G.; Soldati, P. A reinforcement learning approach to power control and rate adaptation in cellular networks. In Proceedings of the 2017 IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017; pp. 1–7. [Google Scholar]
 Calabrese, F.D.; Wang, L.; Ghadimi, E.; Peters, G.; Hanzo, L.; Soldati, P. Learning Radio Resource Management in RANs: Framework, Opportunities, and Challenges. IEEE Commun. Mag. 2018, 56, 138–145. [Google Scholar] [CrossRef]
 Sharma, S.; Darak, S.J.; Srivastava, A. Energy saving in heterogeneous cellular network via transfer reinforcement learning based policy. In Proceedings of the 2017 9th International Conference on Communication Systems and Networks (COMSNETS), Bengaluru, India, 4–8 January 2017; pp. 397–398. [Google Scholar]
 Wei, Y.; Yu, F.R.; Song, M.; Han, Z. User Scheduling and Resource Allocation in HetNets With Hybrid Energy Supply: An ActorCritic Reinforcement Learning Approach. IEEE Trans. Wirel. Commun. 2017, 17, 680–692. [Google Scholar] [CrossRef]
 Liang, L.; Feng, G. A GameTheoretic Framework for Interference Coordination in OFDMA Relay Networks. IEEE Trans. Veh. Technol. 2011, 61, 321–332. [Google Scholar] [CrossRef]
 Lu, Y.; Xiong, K.; Fan, P.; Zhong, Z.; Ai, B.; Ben Letaief, K. WorstCase Energy Efficiency in Secure SWIPT Networks with RateSplitting ID and PowerSplitting EH Receivers. IEEE Trans. Wirel. Commun. 2021, 21, 1870–1885. [Google Scholar] [CrossRef]
 Xu, Y.; Li, G.; Yang, Y.; Liu, M.; Gui, G. Robust Resource Allocation and Power Splitting in SWIPT Enabled Heterogeneous Networks: A Robust Minimax Approach. IEEE Internet Things J. 2019, 6, 10799–10811. [Google Scholar] [CrossRef]
 Zhang, R.; Xiong, K.; Lu, Y.; Gao, B.; Fan, P.; Ben Letaief, K. Joint Coordinated Beamforming and Power Splitting Ratio Optimization in MUMISO SWIPTEnabled HetNets: A MultiAgent DDQNBased Approach. IEEE J. Sel. Areas Commun. 2021, 40, 677–693. [Google Scholar] [CrossRef]
 Omidkar, A.; Khalili, A.; Nguyen, H.H.; Shafiei, H. Reinforcement Learning Based Resource Allocation for EnergyHarvestingAided D2D Communications in IoT Networks. IEEE Internet Things J. 2022, 7, 4387–4394. [Google Scholar] [CrossRef]
 Canese, L.; Cardarilli, G.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Spanò, S. MultiAgent Reinforcement Learning: A Review of Challenges and Applications. Appl. Sci. 2021, 11, 4948. [Google Scholar] [CrossRef]
 Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
 Goodfellow, J.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–14 December 2014; pp. 2672–2680. [Google Scholar]
 Perera, T.D.P.; Jayakody, D.N.K.; Sharma, S.K.; Chatzinotas, S.; Li, J. Simultaneous Wireless Information and Power Transfer (SWIPT): Recent Advances and Future Challenges. IEEE Commun. Surv. Tutor. 2017, 20, 264–302. [Google Scholar] [CrossRef][Green Version]
 Mismar, F.B.; Evans, B.L.; Alkhateeb, A. Deep Reinforcement Learning for 5G Networks: Joint Beamforming, Power Control, and Interference Coordination. IEEE Trans. Commun. 2019, 68, 1581–1592. [Google Scholar] [CrossRef][Green Version]
 Alkhateeb, A.; El Ayach, O.; Leus, G.; Heath, R.W. Channel Estimation and Hybrid Precoding for Millimeter Wave Cellular Systems. IEEE J. Sel. Top. Signal Process. 2014, 8, 831–846. [Google Scholar] [CrossRef][Green Version]
 Heath, R.W., Jr.; GonzalezPrelcic, N.; Rangan, S.; Roh, W.; Sayeed, A.M. An Overview of Signal Processing Techniques for Millimeter Wave MIMO Systems. IEEE J. Sel. Top. Signal Process. 2016, 10, 436–453. [Google Scholar] [CrossRef]
 Schniter, P.; Sayeed, A. Channel estimation and precoder design for millimeterwave communications: The sparse way. In Proceedings of the 2014 48th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 2–5 November 2014; pp. 273–277. [Google Scholar]
 Rappaport, T.; Gutierrez, F.; BenDor, E.; Murdock, J.N.; Qiao, Y.; Tamir, J.I. Broadband MillimeterWave Propagation Measurements and Models Using AdaptiveBeam Antennas for Outdoor Urban Cellular Communications. IEEE Trans. Antennas Propag. 2013, 61, 1850–1859. [Google Scholar] [CrossRef]
 Rappaport, T.S.; Heath, R.W., Jr.; Daniels, R.C.; Murdock, J.N. Millimeter Wave Wireless Communications; Pearson: London, UK, 2014. [Google Scholar]
 Lu, X.; Wang, P.; Niyato, D.; Kim, D.I.; Han, Z. Wireless charging technologies: Fundamentals, standards, and network applications. IEEE Commun. Surv. Tutor. 2015, 18, 1413–1452. [Google Scholar] [CrossRef][Green Version]
 Ng, D.W.K.; Lo, E.S.; Schober, R. Multiobjective Resource Allocation for Secure Communication in Cognitive Radio Networks With Wireless Information and Power Transfer. IEEE Trans. Veh. Technol. 2016, 65, 3166–3184. [Google Scholar] [CrossRef][Green Version]
 Chang, Z.; Gong, J.; Ristaniemi, T.; Niu, Z. EnergyEfficient Resource Allocation and User Scheduling for Collaborative Mobile Clouds With Hybrid Receivers. IEEE Trans. Veh. Technol. 2016, 65, 9834–9846. [Google Scholar] [CrossRef]
 Sen, S.; Santhapuri, N.; Choudhury, R.R.; Nelakuditi, S. Successive interference cancellation: A backoftheenvelope perspective. In Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, New York, NY, USA, 20–21 October 2010; pp. 17:1–17:6. [Google Scholar]
 Bertsekas, D.P. Dynamic Programming and Optimal Control; Athena Scientific: Belmont, MA, USA, 1995; Volume 1. [Google Scholar]
 Li, X.; Fang, J.; Cheng, W.; Duan, H.; Chen, Z.; Li, H. Intelligent Power Control for Spectrum Sharing in Cognitive Radios: A Deep Reinforcement Learning Approach. IEEE Access 2018, 6, 25463–25473. [Google Scholar] [CrossRef]
 Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Volume 37, pp. 448–456. [Google Scholar]
 Sutton, R.S.; Barto, A.G. Introduction to Reinforcement Learning, 1st ed.; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
 Alkhateeb, A.; Alex, S.; Varkey, P.; Li, Y.; Qu, Q.; Tujkovic, D. Deep Learning Coordinated Beamforming for HighlyMobile Millimeter Wave Systems. IEEE Access 2018, 6, 37328–37348. [Google Scholar] [CrossRef]
 Fudenberg, D.; Tirole, J. Game Theory; MIT Press: Cambridge, MA, USA, 1991. [Google Scholar]
Parameter  Value 

Maximum transmit power (${P}_{max}$)  40 W (46 dBm) 
Minimum transmit power (${P}_{min}$)  1 W (30 dBm) 
Probability of light of sight (${P}_{los}$)  0.7 
Cell radius ($\widehat{r}$)  150 m 
Distance between sites (BSs)  225 m 
Antenna gain  3 dBi 
Mobile node (MN) antenna gain  0 dBi 
Number of multipaths  4 
MN movement speed on average (v)  2 km/h 
Number of transmit antennas of BS  $\left\{4,8,16,32\right\}$ 
Downlink frequency band  28 GHz 
Parameter  Value 

DQN:  
Discount factor ($\zeta $)  0.995 
Learning rate ($\alpha $)  0.01 
Initial exploration rate ($\u03f5$)  1.0 
Minimum exploration rate (${\u03f5}_{min}$)  0.1 
Exploration decay rate (d)  0.9995 
Size of state ($\lefts\right$)  10 
Size of action ($\lefta\right$)  64 
Replay buffer size ($\leftD\right$)  2000 
Batch size ($\eta $)  256 
DDPG:  
Actor learning rate (${\alpha}_{a}$)  0.001 
Critic learning rate (${\alpha}_{c}$)  0.002 
Replay buffer size ($\leftD\right$)  10000 
Exploration decay rate (d)  0.9995 
Batch size ($\eta $)  32 
Scale factors (${\varsigma}_{1}$, ${\varsigma}_{2}$, ${\varsigma}_{3}$)  1 
Discount factor ($\zeta $)  0.9 
Soft update parameter ($\tau $)  0.01 
DQN for twolayer:  
Size of state ($\lefts\right$)  6 
Size of action ($\lefta\right$)  9 
The same parameters for the singlelayer DQN 
Method  Data Rate  Energy Harvesting  Power Consumption 

DRL  11.32910  0  22.51510 
twolayer  11.26969  $8.164853\times {10}^{9}$  16.40005 
singlelayer with DDPG  10.58339  $1.062941\times {10}^{5}$  21.34165 
singlelayer with DQN of PSF1  9.31607  $5.809001\times {10}^{8}$  22.50100 
singlelayer with DQN of PSF2  8.46842  $3.477011\times {10}^{8}$  23.69319 
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. 
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, J.; Lin, C.H.R.; Hu, Y.C.; Donta, P.K. Joint Beamforming, Power Allocation, and Splitting Control for SWIPTEnabled IoT Networks with Deep Reinforcement Learning and Game Theory. Sensors 2022, 22, 2328. https://doi.org/10.3390/s22062328
Liu J, Lin CHR, Hu YC, Donta PK. Joint Beamforming, Power Allocation, and Splitting Control for SWIPTEnabled IoT Networks with Deep Reinforcement Learning and Game Theory. Sensors. 2022; 22(6):2328. https://doi.org/10.3390/s22062328
Chicago/Turabian StyleLiu, JainShing, ChunHung Richard Lin, YuChen Hu, and Praveen Kumar Donta. 2022. "Joint Beamforming, Power Allocation, and Splitting Control for SWIPTEnabled IoT Networks with Deep Reinforcement Learning and Game Theory" Sensors 22, no. 6: 2328. https://doi.org/10.3390/s22062328