Energy-Efﬁcient Power Allocation and User Association in Heterogeneous Networks with Deep Reinforcement Learning

: This paper studies the problem of joint power allocation and user association in wireless heterogeneous networks (HetNets) with a deep reinforcement learning (DRL)-based approach. This is a challenging problem since the action space is hybrid, consisting of continuous actions (power allocation) and discrete actions (device association). Instead of quantizing the continuous space (i.e., possible values of powers) into a set of discrete alternatives and applying traditional deep reinforcement approaches such as deep Q learning, we propose working on the hybrid space directly by using the novel parameterized deep Q-network (P-DQN) to update the learning policy and maximize the average cumulative reward. Furthermore, we incorporate the constraints of limited wireless backhaul capacity and the quality-of-service (QoS) of each user equipment (UE) into the learning process. Simulation results show that the proposed P-DQN outperforms the traditional approaches, such as the DQN and distance-based association, in terms of energy efﬁciency while satisfying the QoS and backhaul capacity constraints. The improvement in the energy efﬁciency of the proposed P-DQN on average may reach 77.6% and 140.6% over the traditional DQN and distance-based association approaches, respectively, in a HetNet with three SBS and ﬁve UEs.


Introduction
With the exponential growth of wireless Internet-of-Things (IoT) sensors and ultrareliable requirement in the next-generation cellular networks, the global mobile data traffic is expected to reach about 1 zettabyte by 2022 according to Cisco's forecast [1]. To meet the demands of higher data traffic in wireless links either in fixed sensors for IoT networks or mobile devices in cellular networks, the network infrastructure inevitably will need to expand dramatically, which will result in tremendous escalation of energy consumption and backhaul traffic. Energy efficiency and spectral efficiency are therefore critical issues when designing next-generation wireless communication systems with enabling quality-of-service (QoS) guarantees for radio devices with considerations of efficient power consumption [2,3].
Enhancing the cell density is one of the approaches to meet the need of high date rate under limited bandwidth in centralized networks. Heterogeneous networks (HetNets) have therefore emerged as a standard part of future mobile networks to improve the system capacity and energy efficiency through more flexible design of transmission power allocation and smaller coverage sizes by densely deployed small base stations (SBSs) [3][4][5]. However, the interference problem caused from various SBSs is the primary challenge for effective system capacity improvement.

Motivation
Cell densification of HetNet enhances the spectrum efficiency in an energy-efficient way. On the other hand, cell densification potentially increases the inter-cell interference, especially at the cell edges, which deteriorates the QoS of user equipments (UEs).
Interference coordination schemes by means of radio resource allocation and power control can be implemented to achieve higher energy efficiency and spectral efficiency [2]. However, the growing complexity of wireless networks due to increased links and heterogeneous network structure create tremendous challenges for system designs, thus calling for more intelligent techniques for effective yet efficient resource management strategies. In this perspective, data-driven machine learning techniques have been regarded as viable new approaches to dealing with complex network dynamics [5][6][7][8][9][10]. Compared with traditional model-based algorithms [11][12][13][14], deep reinforcement learning (DRL), leveraging recent advances in deep neural networks with reinforcement learning [15][16][17], can autonomously extract features from the raw data with different formats and complex correlations experienced by the mobile environments. This potentially reduces the cost of data pre-processing [11,18]. In view of this, in this paper, we provide a reinforcementlearning-based solution for power allocation and radio device association with the objective of maximizing energy efficiency while satisfying the required QoS and wireless backhaul capacity constraints.

Prior Work
In order to intelligently manage the interwoven dynamics underlying the wireless sensor or mobile networks in which a variety of network parameters are generally unknown, deep reinforcement learning (DRL)-based approaches have been applied to tackle the challenges of radio resource management in wireless networks, e.g., [11,16,[18][19][20][21][22][23][24][25][26][27], due to DRL's ability to extract features from raw data, learn complex correlations generated by the mobile environment, and make sequential decisions through interactions with the environment without knowledge of complete environment information. In the applications of power allocation, the objectives can often be categorized into three types: capacity maximization in [11,19,20], energy saving in [16], and maximization of capacity for consumed energy (which is defined as the energy efficiency) in [21][22][23][24].
Meng et al. [11] propose several DL-based algorithms to handle the power allocation with the aim to maximize the sum rate in multiuser wireless cellular networks, in which the DL-based data-driven approaches are demonstrated to outperform the traditional model-based power allocation method. Nasir and Guo [16] utilize multi-agent deep RL to adaptively control the discrete power level (i.e., the range of possible power values is quantized into a number of discrete levels) for each user where the policy can be made without requiring to know the instantaneous CSI. Park and Lim [18] tackle the problem of discrete power allocation and mode selection in device-to-device (D2D) communication networks using DQN with energy efficiency as the reward in the learning process. Amiri et al. [19] apply the cooperative Q-learning for power control at discrete levels, but the effect of channel variations is not considered. Ahmed and Hossain [20] employ deep Q-learning, which was originally proposed in [28], to update the transmission power allocated for each user at the small cell base station in HetNet, in which the power is quantized into discrete levels. Xu et al. [21] present a novel DRL-based technique for resource allocation by considering power efficiency in cloud radio access networks (RANs) and ensuring QoS guarantee. Lu et al. [22] propose a DRL-DPT framework, in which the agent learns directly from the expected objective instead of critic value, for energy efficiency maximization without explicit simulation results for QoS guarantee. Wei et al. determine the number of users and subchannels with corresponding power allocation in HetNets using a policy-gradient based actor-critic learning algorithm [23]. Instead of quantizing the power into discrete levels, Meng et al. [11] model the power as a continuous action and adaptively update the continuous power using the deep deterministic policy gradient-based (DDPG-based) reinforcement learning [29,30]. While the above-mentioned research successfully apply the DRL-based techniques to power allocation in heterogenous networks, the problem of user association is not jointly considered, and various practical constraints in the network are not accounted for, such as the limited backhaul capacity [13] in each small cell base station. A novel energy-efficient joint power allocation and user association using deep reinforcement learning is studied by Li et al. in [24], where the power is considered as belonging to a discrete set and the learning process is not bounded by any constraints. A summary of the related work is provided in Table 1.

Contributions of the Research
In contrast with existing studies, which quantized the continuous set into discrete space [16,19,20], we propose utilizing the parameterized deep Q-network (P-DQN) to handle the problem with a hybrid action space composed of discrete user association and continuous power allocation [31]. This overcomes the difficulty of traditional DQN which can cope with RL problems having discrete action spaces, either with intrinsically discrete actions or with discrete actions quantized from continuous action space. In this work, we provide a joint solution for power allocation and user association with the objective of maximizing downlink energy efficiency under backhaul link constraint and QoS guarantee using P-DQN. A flexible reward function is devised to meet each user equipment's QoS demands in different traffic scenarios, and a penalty mechanism is introduced when the backhaul link constraint is violated. Simulation results demonstrate that the P-DQN outperforms other approaches in terms of overall energy efficiency while satisfying QoS requirements and backhaul constraints. The main contributions of this paper are summarized as follows: • We provide a joint solution for the power allocation and user association with the objective of maximizing downlink energy efficiency under backhaul link constraint and QoS guarantee. We employ the novel model-free parameterized deep Q-network (P-DQN) framework that is capable of updating policies in a hybrid discrete-continuous action space (i.e., discrete BS-UE association and continuous power allocation). • To the best of our knowledge, most DRL-based research about power allocation do not consider the wireless backhaul capacity constraint and user QoS. We design the flexible reward function to meet the QoS demands at different traffic scenarios and introduce a penalty mechanism when the backhaul link constraint is violated. We verify by simulations that the proposed P-DQN framework outperforms other proposed approaches in terms of overall energy efficiency while satisfying QoS requirements and backhaul constraints.

Organization
The rest of the paper is organized as follows. The system model is described in Section 2. We next present the joint energy-efficiency maximization problem of clustering decision and power allocation in Section 3. Simulation results are given in Section 4. Finally, we provide concluding remarks in Section 5.

Heterogeneous Network
Consider the downlink of a two-tier HetNet composed of a macro base station (MBS), J small BSs (SBS), and K UEs, with J = {1, 2, . . . , J} and K = {1, 2, . . . , K} being the sets of SBSs and UEs, respectively. The system network is depicted in Figure 1. In this paper, we assume there is no cross-tier interference in the network, which can be achieved by using different frequency bands for transmissions in the two tiers (e.g., sub-6 GHz in tier 1 and millimeter wave bands in tier 2).
The MBS is equipped with an antenna array of size N T , which is assumed to be larger than the the number of SBS J, i.e., and N T > J. Orthogonal frequency division multiple access (OFDMA) is utilized for the downlink communication between SBSs and UEs, with a total number of subchannels N sub .

User Association
Each UE is assumed to be associated with only one SBS, but each SBS can serve multiple UEs using OFDMA. The UEs served by the same SBS constitute a cluster. Let F k denote the set of subchannels allocated to the kth UE and c j,k ∈ {0, 1} represent the status of user association, i.e., c j,k = 1 if the kth UE is associated with the jth SBS and c j,k = 0 otherwise. Then, the set of UEs in the cluster j is given by C j = {k : c j,k = 1, k ∈ K}, with |C j | being the number of UEs in C j . The SBS serving the kth UE can be represented by S k = {j : c j,k = 1, j ∈ J }. Note that, since each user is assumed to be associated with only one SBS in this paper, |S k | equals one. The set of active SBSs is J active = {j | |C j | > 0}.
The spectral efficiency of the kth UE is given by where the signal-to-noise-plus-noise ratio (SINR) is with g j,k, f being the channel gain between the jth SBS and the kth UE in subchannel f , I k, f the interference observed by the kth UE, and σ 2 the noise power. Specifically, the channel gain is defined as g j,k, f = |h j,k, f | 2 , where h j,k, f is the corresponding channel coefficient. The transmit power P j,k, f from SBS j to UE k in subchannel f needs to satisfy the power constraint 0 ≤ ∑ k∈C j ∑ f ∈F k P j,k, f ≤ P SBS j ,max , where P SBS j ,max is the maximum transmit power of the jth SBS. The user sum-rate for the jth cluster is given by We consider the scenario that each SBS allocates orthogonal subchannels to different UEs within its serving coverage, so there is no intra-cluster interference in each cluster. Each UE can acquire at least one subchannel for data transmission if the cluster size (i.e., the number of served UEs) is not larger than the number of subchannels. Without intra-cluster interference, the interference term I k is composed only by the inter-cluster interference and can be expressed by Detailed notation descriptions are summarized in Table 2. total power consumption at all the SBSs. g j,k, f channel gain between SBS j and UE k in the f th subchannel h j,k, f channel coefficient between SBS j and UE k in the f th subchannel P SBS j ,max maximum power available at the jth SBS P j,k, f * transmit power from SBS j to UE k in the f th subchannel p UE k ** transmit power from the associated SBS to UE k |J active | number of active SBSs σ 2 noise power I k, f interference experienced by UE k in subchannel f C j the set of UEs in cluster j c j,k link indicator between SBS j and UE k c UE user association SI NR k, f SINR for UE k in the f th subchannel ρ k capacity for UE k ν k capacity threshold for UE k ρ SBS j user sum-rate for SBS j R SBS j maximum downlink data rate for SBS j * For the optimization approach. ** For the DRL-based approach.

Power Consumption
The system power consumption includes the operational power, which is the minimum amount of power to keep the SBS active, and data transmission power. Operational powers for SBSs and MBS are expressed as P o,SBS and P o,MBS , respectively. The total power consumption of all SBSs is where | J active | is the number of active SBSs.

Optimization Problem
We aim for a joint solution optimizing for the user association and transmit power allocation with the objective of maximizing energy efficiency, which is defined as the achievable sum rate per consumed power, in the downlink of the two-tier HetNet while considering QoS guarantee and wireless backhaul link capacity constraints. The problem can be formulated as C 1 in (4b) assumes that each UE is served by only one SBS, and C 2 in (4c) refers to transmit power limit at the jth SBS with P SBS j ,max the maximum power available at the jth SBS. C 3 in (4d) indicates the QoS requirement for each UE, where ν k is the capacity threshold for UE k. C 4 in (4e) is the cluster size constraint with |C j | max the maximum allowable number of users in C j . This ensures that UEs in the same cluster are assigned different subchannels to avoid intra-cluster interference. C 5 in (4f) indicates the backhaul link capacity constraint, where R SBS j is the maximum achievable downlink data rate for SBS j. Note that the subchannel assignment is assumed to be known and is not considered in this work.
The strategy in (4a) attempts to maximize the energy efficiency by finding the optimal user association and power allocation, which is generally a challenging problem with various unknowns and hybrid unknown spaces (continuous power and discrete clustering) in the system. Furthermore, the optimization problem in (4a) deals with a one-shot scenario at a certain time instant which needs to be re-evaluated when the network evolves to the next time instant. To tackle the challenges, we are therefore motivated to resort to the techniques of reinforcement learning (RL).

Reinforcement Learning
RL as one kind of machine learning is well known for its capability of making decisions sequentially in dynamic environments, where the decision-making agent interacts with the environments by an appropriately chosen action which is based on its past experiences learned through a reward function and on its current environment state the agent is experiencing [15]. These constitute the three fundamental elements in an RL: state (s t ), action (a t ), and reward (r(s t , a t )). Typically, an RL formulates the environment dynamics as a Markov decision process (MDP), and the primary objective is to determine the action contingent upon a certain state at each time step such that the expected discount cumulated reward is maximized. More specifically, traditional Q-learning aims at finding the action that maximizes the action-value function Q(s, a) defined by where γ is the discount factor. It is a challenging task to evaluate the Q-function in (5) in many applications, since the statistical properties between any two states are often not easy to obtain. Fortunately, thanks to the advancement in deep neural networks (DNNs), evaluations of the Q-function in (5) can be well approximated by properly designed DNNs [28]. Specifically, during the training phase, the weights in the deep Q learning (DQN) network are trained to extract features from raw data with corresponding target values obtained by the recursive Bellman equation developed from (5). The success of DQN has led to an explosive subsequent development in the area of deep reinforcement learning (DRL), such as the more stabilized version of the DQN (double DQN in [32]), the extension to continuous actions using the deep deterministic policy gradient (DDPG) in [29], and the TD3 [33].

State, Action, and Reward Function
In this research, the state, action, and reward function associated with the penalty mechanism for the wireless backhaul capacity constraint in the considered RL-based joint power allocation and user association are defined as follows: • State: The state at the t th time slot is defined as the user data rate in that time slot • Action: The action in the t th time slot is defined as where c UE = [c j,k ] j=1:J,k=1:K , with c j,k ∈ {0, 1}, j ∈ J , k ∈ K, and p UE (t) indicating the sets of user associations and power allocations, respectively. More specifically, the power allocation set is given by ] f : f ∈F k is the vector of allocated power for data transmission in all subchannels assigned to UE k from its associated SBS S k . • Reward: We aim to maximize the overall energy efficiency as in (4a) while maintaining QoS for each US and satisfying the backhaul link capacity constraint for each SBS. Hence, the reward r t at the t th time slot is defined as where with κ 1 (t) = 1 P T ∑ K k=1 ρ k (t) being the system energy efficiency and being the penalty term which discourages the agent from taking the actions such that the capacity of each user deviates too much from the QoS threshold, and Z κ 1 (t) and Z κ 2 (t) are the Z-scores (i.e., standardized results) of κ 1 (t) and κ 2 (t), respectively. r th is a threshold used to reduce the likelihood of violating the backhaul capacity constraint.
One of the challenges when transforming a traditional optimization problem into a DRL problem is to devise proper handling of the constraints in the original optimization problem. In this paper, the penalty term in κ 2 (t) is designed to improve the QoS satisfaction in constraint C 3 through reducing the number of UEs whose achievable rates are much higher or lower than the capacity thresholds. The weights λ 1 , λ 2 ∈ [0, 1] control the significance of the corresponding term. Operators can tune the weights λ 1 and λ 2 according to their needs, e.g., setting λ 1 > λ 2 at off-peak traffic periods and λ 1 < λ 2 at peak traffic hours to enhance each user's service experiences. Note that the reward function defined in (8) may not be feasible in practice, and judicious setting for the associate weights λ 1 and λ 2 are needed based on trial-and-error efforts. Furthermore, in order to guide the agent to follow the backhaul link constraint, a penalty mechanism is introduced here for the agent to adjust corresponding actions. If one of the SBSs experiences a sum rate that violates the backhaul capacity constraint, the agent receives a penalty and restarts a new episode in the learning process. On the other hand, in this paper, the cluster size constraint is dealt with by including only the legitimate discrete actions, each of which allows no more than N sub users in each cluster, in the constrained discrete action space for the entire learning process. This guarantees the output of the discrete action satisfies the cluster size constraint. Finally, in order to accommodate the power constraint in C 2 , techniques mentioned in the above can also be utilized. Alternatively, in this work, modifications have been made to the power constraint such that each user's allocated power is restricted by ∑ f ∈F k P j,k, f ≤ P k,max , where the per-user maximum power P k,max is assumed to be P k,max = 1 |C j | P SBS j ,max for k ∈ C j . In this case, the total power constraint in C 2 can be satisfied, though in a suboptimum fashion. This per user power constraint can be facilitated by the actor-parameter network in the proposed P-DQN in a much easier way.

Parameterized Deep Q Network
Recent progress in deep RL (DRL) approaches has made the DRL, such as DQN, a viable technique to tackle various resource allocation problems in wireless networks. However, in order for DQN to be able to solve the joint power allocation and user association problem considered in this work, the continuous action space in the power allocation has to be quantized into discrete action space first. Quantization of the continuous action space may round off potentially optimal power allocations. Moreover, the complexity of the DQN increases exponentially with the dimension of the action space, leading to undesirable huge consumption of power and slowdown of convergence speed. To overcome this difficulty, in this paper, we propose employing the P-DQN [31,34] for the joint power allocation and user association because of its capability of solving problems with hybrid action space.
The parameterized action space is denoted by by A PA = {(c, x c )|x c ∈ A c , for all c ∈ A d }, where A c and A d are the continuous and discrete action spaces, respectively. When the discrete action c takes all possible combinations into consideration without constraints, the discrete action space A d = {[c k,j ] : c k,j ∈ {0, 1}, k ∈ K, j ∈ J }. Each discrete action c has a corresponding continuous parameter x c ∈ X c , where X c is the set of all users' power allocations p UE (t) in this work, for a discrete action c. The primary network of the P-DQN (without the stabilizing target networks) is presented in Figure 2. The primary network in Figure 2 consists of an actor-parameter network x c (s; θ) with weights θ, which maps the state c and each discrete action to its corresponding continuous parameter, and an actor network Q(s, c, x c ; ω) with weights ω, which evaluates the action-value Q-function, i.e., the long term expected cumulative reward Q(s, a) defined in (5), and the action can be explicitly represented by the 2-tuple a = (c, x c ) to emphasize the hybrid nature in the action. Typically, the weights θ in the actor-parameter network can be determined by maximizing the expected action-value function E[Q(s, c, x c (c; θ); ω)]. Furthermore, the weights ω in the actor network can be updated by minimizing the mean-squared error E[(y t − Q(s t , a t ; ω)) 2 ], where y n is the target value in the network and a t = (c(t), x c (t)) [28,31].
In order to stabilize the P-DQN, an additional target network, combining the original primary network shown in Figure 2, is built to produce the target value y t needed in the actor network [28]. Furthermore, to expedite the training process, an experience replay buffer D is implemented to provide random samples for evaluating the means appearing in the loss functions of both the actor network and actor-parameter network. With the replay buffer, the loss functions for the actor-parameter x c (s; θ) and the actor Q(s, c, x c ; ω) can be obtained by the following sample means: ; ω − is evaluated by the target network for stability with weights θ − and ω − , (s i , c i , x c i , r i , s i+1 ) ∈ D is sampled from the replay buffer, and N is the size of the mini-batch (i.e., sample size). The weights θ and ω are updated according to where α a,p and α a are the learning rate for the weights in the actor-parameter and actor network, respectively. At a given state s, the actor-parameter network produces the continuous parameters, which maximize the average Q(s, c, x c ; ω) for each discrete action c. Then, the actor network determines the action (c, x c ) = arg max (c,x c ) Q(s, c, x c ; ω) after the action-value Q-function has been evaluated with the aid of the target network. In the training phase, the off-policy scheme is implemented where the agent selects the action based on -greedy policy and generates the Q-target using the greedy policy for exploration. The algorithm of the proposed P-DQN with the target networks is summarized in Algorithm 1.

Algorithm 1 Parameterized Deep Q-Network (P-DQN) Algorithm with the quasi-static target networks.
Input: Learning rates α a , α a,p , exploration parameter , mini-batch size B, a probability distribution ξ.
Determine the action parameters x c (s t ; θ t ) by the actor-parameter network.
Select action a t = (c t , x c t ) according to the -greedy policy: Take action a t , observe reward r t and the next state s t+1 .
Store the experience (s t , a t , r t , s t+1 ) into D.
Draw N samples of experience (s i , a i , r i , s i+1 ) randomly from D.

Simulation Setup
In the simulation, a HetNet with three SBSs and five UEs uniformly located in a macrocell with radius 500 m is considered. Backhaul transmission model considered in [35] is adopted in the simulations. The MBS is equipped with 100 antennas and has 20 beamforming groups [35]. Slow Rayleigh fading channels are adopted for simulations where the channel remains unchanged throughout each episode. The Rayleigh channel coefficient is modeled as h ∼ CN (0,1). We also adopt the non-line-of-sight path-loss model for urban MBSs and SBSs [36]. Each subchannel is randomly allocated to a user, and the subchannel allocation is assumed known for the agent. The other settings of the simulation are summarized in Table 3. The Adam optimizer is employed for all DNNs that are embedded in P-DQN. The -greedy algorithm and Ornstein-Uhlenbeck noise is used for explorations of discrete actions and continuous parameters, respectively. We set the threshold r th = 0.1, discount factor γ = 0.95, batch size N = 128, the maximum number of episodes as 2000, and the maximum steps per episode as 100. Other parameter settings used in the P-DQN are given in Table 4. The simulation codes used in this research can be found in [37].   Table 4. The settings for the deep neural networks used in the P-DQN.

Performance Analysis
In the simulations, we compare the proposed P-DQN for joint user clustering and power allocation with the following approaches: • Nearest SBS + Random Power: Each UE is associated with the nearest SBS. Random power means that each SBS serves the UEs in its cluster with random powers in a way that the resulting sum rate within the cluster cannot exceed the power and backhaul capacity limit. • Best Channel + Random Power: Each UE is associated with the SBS with the best received signal power, which depends on the UE-SBS distance as well as the smallscale fading effect. Furthermore, each SBS serves the UEs in its cluster with random power allocations under the power and backhaul capacity constraint. •

P-DQN Association + Random Power:
The user association policy is accomplished by the proposed P-DQN, whereas each SBS serves the UEs in its cluster with random powers under the total power and backhaul capacity constraint. • Random Association + P-DQN Power: Each SBS allocates the power to its serving UEs based on the policy determined by the P-DQN. Each UE is randomly associated with one SBS in such a way that the random association policy obeys the backhaul link constraint. We set the weights (λ 1 , λ 2 ) = (0.43, 0.16) for both the P-DQN with the cluster size constraint and the P-DQN without the cluster size constraint, which allows us to observe the effect of the cluster size constraint in C 4 . Figure 3 depicts the average normalized reward versus steps over 500 realizations during the training phase. It shows the convergence of the user association and power allocation algorithm using the proposed P-DQN. Note that, while the P-DQN approach without cluster size constraint provides higher reward as can be seen from Figure 3, it cannot guarantee all UEs' QoS. Figure 4 shows the effectiveness of the proposed P-DQN algorithm, as it outperforms other approaches in terms of both the reward and the energy efficiency. The numerical values of the reward and energy efficiency obtained in Figure 4 and averaged over all time steps are summarized in Table 5. The improvement in the energy efficiency of the proposed P-DQN on average may reach 77.6% over the traditional DQN and 140.6% over the nearest-distance-based association approaches.    Steps (x100)    Figure 4 and averaged over all time steps. The method index in the  Table 6 shows that each UE obtains 83.47% of the required QoS, and each SBS obeys the wireless backhaul capacity constraint at each timeslot through the proposed P-DQN with the cluster size constraint. However, the percentage of QoS satisfaction for the methods with random power or random association is about 50%. The percentage of QoS satisfaction of the proposed P-DQN with the cluster size constraint is higher than that of the other approaches, as a penalty term is used in the reward function to realize each UE's QoS as much as possible in the proposed P-DQN. For Random Association + P-DQN Association, the ratio is even smaller than 5%. For DQN with cluster size constraint, the ratio is even smaller than 1%, as the quantized power levels suffer from round-off imperfections, while the continuous action learned by the P-DQN allows for a better QoS with appropriate continuous power allocation.   In Figure 5, we analyze the interference power experienced by each UE under various approaches. The results in Figure 5a,c,e demonstrate that the interference each user experiences in the "Nearest SBS + Random Power" approach is higher than that in the "P-DQN with Cluster Size Constraint" approach and "P-DQN Association + Random Power" approach. By comparing the results in Figure 5a,e, it can be seen that random power allocation does not have much impact on each user's received interference level. On the other hand, by comparing the results in Figure 5c,e, we observe a noticeably increased interference level in UE 3 and UE 4, which implies that the interference is largely dominated by the result of user association. The results here indicate that the user association policy learned by the proposed P-DQN with cluster size constraint generally tends to determine the matching between UEs and SBSs such that the inter-cluster interference can be managed to a lower level, as shown in Figure 5a, which results in a higher system throughput.
The increased level of interference in UEs (such as UE 3 and UE 4) observed in Figure 5c can be explained with the aid of Figure 6, which plots the locations of each UE and SBS with their association status being specified by colors. The association results and physical distances between each user and all SBSs in Figure 6 can provide insights into the interference level each UE experiences under different association strategies. For example, the increased interference in UE 4 in Figure 5c can be analyzed by comparing Figure 6b with Figure 6a. More specifically, while UE 4 in Figure 6a associated with SBS 2 under the proposed P-DQN with cluster size constraint is interfered only by inter-cluster signals transmitted from SBS 1, this UE 4 associated with SBS 2 in Figure 6b under the "Nearest SBS" (i.e., nearest distance) association approach can potentially be interfered with by inter-cluster signals from SBS 0 and SBS 1. It can be seen from Figure 6b that SBS 0 could strongly interfere with UE 4, due to the short distance between SBS 0 and UE 4, thus leading to the increased interference level in UE 4 as shown in Figure 6b. Finally, as for the case of P-DQN approach without cluster size constraint, the system suffers from intra-cluster interference, which significantly impacts the interference level in each user as shown in Figure 5b. Since keeping an SBS active demands huge operational power, we can see from Figure 6c that the P-DQN without a cluster size constraint tends to have all UEs served only by one SBS in pursuit of less overall system power consumption.
In contrast with the user association schemes based on the distance or the channel quality between a UE and an SBS, the P-DQN-based user association tends to activate fewer SBSs, which leads to less consumption of overall operational power in the SBSs and results in a higher energy efficiency. More specifically, as illustrated in Figure 6a, where different colors refers to different clusters, we see thatUE 0, UE 2 and UE 3 are associated with SBS 1, and UE 1 and UE 4 are connected to SBS 2, whereas SBS 0 is not active when employing the policy learned by P-DQN with the cluster size constraint.

Conclusions
In this paper, we have studied the joint problem of user association and power allocation using P-DQN in the downlink of a two-tier HetNet without knowledge of the environment transition probability. The wireless network has been formulated as a parameterized action Markov decision process with a hybrid (discrete-continuous) action space. The P-DQN has been adopted as a model-free framework to avoid quantization noise resulting from rounding the continuous power space into discrete levels. With the consideration of realistic scenarios, we have designed the reward function as the energy efficiency with QoS constraint per user as well as backhaul capacity constraint. We have introduced a penalty mechanism when the constraints are violated. We have also utilized the cluster size constraint for intra-cluster interference mitigation. In simulations, the proposed P-DQN has been verified to outperform other traditional methods in terms of overall energy efficiency while satisfying QoS requirements and backhaul constraints. The improvement in the energy efficiency of the proposed P-DQN on average may reach 77.6% over the traditional DQN, both with the cluster size constraint. Meanwhile, the proposed P-DQN may still suffer from the curse of dimensionality when dealing with problems with sizable action spaces. It will be worthwhile to investigate advanced DRL techniques (such as the DDPG technique or the multi-agent RL), in future work, capable of handling the problems of joint user association and power allocation, which typically have large action spaces in scenarios of practical interest.