A Turbo Q-Learning (TQL) for Energy Efficiency Optimization in Heterogeneous Networks

In order to maximize energy efficiency in heterogeneous networks (HetNets), a turbo Q-Learning (TQL) combined with multistage decision process and tabular Q-Learning is proposed to optimize the resource configuration. For the large dimensions of action space, the problem of energy efficiency optimization is designed as a multistage decision process in this paper, according to the resource allocation of optimization objectives, the initial problem is divided into several subproblems which are solved by tabular Q-Learning, and the traditional exponential increasing size of action space is decomposed into linear increase. By iterating the solutions of subproblems, the initial problem is solved. The simple stability analysis of the algorithm is given in this paper. As to the large dimension of state space, we use a deep neural network (DNN) to classify states where the optimization policy of novel Q-Learning is set to label samples. Thus far, the dimensions of action and state space have been solved. The simulation results show that our approach is convergent, improves the convergence speed by 60% while maintaining almost the same energy efficiency and having the characteristics of system adjustment.


Introduction
With the dramatic growing number of wireless devices, more stringent requirements are put forward for performance and energy efficiency of heterogeneous networks (HetNets) [1]. With the increasing complexity of HetNets, the optimization of energy efficiency has more and more challenges and is one of the hot spots of communication network research, especially for the HetNets with 5G BSs. Therefore, in this paper, the efficiency of resource allocation algorithm is studied, in which the reinforcement learning (RL) is utilized and the parameters such as ABS (almost blank sub-frame), CRE (Cell range expansion), and SI-SBSs (sleeping indicaton of small BSs) are jointly considered simultaneously to optimize the energy efficiency of the whole network.
In general, the optimization problems with multi-variables are non-convex NP problems, it is hard to be solved directly. Some can be processed by dividing the original problem into sub-problems which can be iteratively solved with an acceptable complexity. Baidas et al. [2] jointly considered subcarriers assignment and global energy-efficient (GEE) power allocation, and the original problem was divided into two subproblems as subcarrier allocation by many to many matching and GEE maximizing power allocation. By designing a two-stage solution program, the original problem was effectively solved with the ensured stability. Chen et al. [3] jointly investigated the task allocation and CPU-cycle frequency, in order to achieve the minimum energy consumption which scaled down to the sum of two deterministic optimization subproblems by Lyapunov optimization theory. The optimal solutions of the two sub-problems separately which were local computation allocation (LCA) and offloaded computation allocation (OCA) were found to obtain the optimal solution of the upper bound of the original problem. Although decomposition and iteration are efficient to solve the non-convex NP problems in many cases, the complexity of modeling and computing is still high in most cases.
As the AI technologies are developing with a very high speed in the recent years, some learning methods are introduced to solve some complicated optimization problems. As shown in [4][5][6][7][8][9], model-free RL methods can be an efficient way to solve the energy efficiency optimization problem of HetNets, since the precise model process was not necessary. In [4,5], the Actor-Critic (AC) algorithm was applied to optimize energy efficiency of HetNets while the authors did not conduct in-depth research on the selections of basis functions which are challenging for the application of RL. Roohollah et al. [6] introduced a Q-Learning (QL) based distributed power allocation algorithm (Q-DPA) as a self-organizing mechanism to solve the power optimization problem in the networks. In [7], a method based on QL was proposed to solve the energy efficiency and delay problems of smart grid data transmission in HetNets, in which, however, the dimension of action and state space was too large. Ayala-Romero et al. [8,9] combined dynamic programming, neural networks, and data-driven methods to solve problem of energy saving and interference coordination in HetNets.
In this paper, inspired from the previous works [2,[4][5][6][7][8][9][10], referring to RL and the idea of converting non-convex NP hard problem into several sub-problems, a turbo QL (TQL) scheme is proposed to optimize energy efficiency in which the traditional QL algorithm is decomposed into several sub-Q-Learning algorithms and has a loop iteration structure, each sub-Q-learning solving each sub-problem. In our scheme, the parameters ABS, CRE, and SI-SBSs are jointly taken into account as action vectors, the user positions are taken as the states in order to fully consider the randomness of users in BSs, and the reward function is designed as a negative reciprocal of the system energy efficiency. The problem of dimensional explosion with increased action space is solved by our proposed TQL structure, and it is acceptable for the complexity of the algorithm. For the states, a fully connected deep neural network is designed to identify state type. The contributions of this paper are summarized as follows.
(1) The reward function is designed as a negative reciprocal of the system energy efficiency to avoid the slow speed of convergence and the possibility of fulling into local optimum. If the magnitude of the reward in RL is too large, it is easy to fall into the local optimization, and too small value can cause the problem of system oscillation or slow speed of convergence. In this paper, directly using energy efficiency as the reward function causes the reward value too large and it is easy to fall into the local optimum. As shown in our experimental results, our designed reward function works well.
(2) The TQL is proposed by combining traditional Q-Learning and multistage decision process which has a loop iteration structure, each sub-Q-Learning solving each sub-problem which is from an original optimization problem. It effectively deals with the dimensional explosion problem caused by the action space increasing in RL and greatly reduces the complexity of optimization problems.
(3) The relevant parameters of each sub-problem can be adjusted independently. Thus, the complexity is low in our proposed TQL algorithm. Simulation results show that the TQL algorithm can solve the original problem with efficiency and flexibility.
The rest of the paper is organized as follows. Related works are summarized in Section 2. Section 3 introduces the system model. In Section 4, the energy efficiency model is formulated. Our proposed algorithm is presented in Section 5. Section 6 shows the simulation framework and numerical results, and conclusions are drawn in Section 7.

Related Works
According to the method of solving the problem of resource optimization in HetNets, related works are mainly classified into three aspects as traditional optimization methods, machine learning based approaches, and neural-network-based ones.

Traditional Methods for Optimizing Hetnets
For a cluster sleeping method, Chang et al. [11] utilized a genetic algorithm to achieve dynamic matching of energy consumption and Li et al. [12] proposed a Gauss-Seidel method to optimize resources in HetNets. In [13], a low complexity algorithm based on the many-to-many matching game between the virtual SMSs, and the users were proposed to solve the problems of exponential growth of mobile data traffic and energy saving. Anany et al. [14] also utilized matching game and proposed an association algorithm which jointly considered the rate and power of each wireless device to get optimal association between the wireless devices and the best BSs according to a well-designed utility function. Wang et al. [15] modeled the location of each layer of BS as an independent Poisson Point Process (PPP) to analyze the coupling relationship between the probability of successful transmission and BS activation. Dong et al. [16] adopted the Poisson clustering process (PCP) method and analyzed the local delay of discontinuous transmission (DTX) mode. For correctly deploying and expanding HetNets to avoid co-channel interference (CCI), Khan et al. [17] proposed a new three-sector three-layer frequency division multiplexing technology (FFR-3SL). Tiwari et al. proposed a Bayesian minimum mean-square-error (MMSE)-based method to estimate user velocity in [18] and presented a handover-count based minimum-variance-unbiased [19].

Machine Learning for Optimizing Hetnets
Chang et al. [20] utilized a dynamic programming method to optimize spectrum resources between eNBs and low power consumption nodes (LPNs). Deb et al. [21] presented a measurement data-driven machine learning mode LeAP for power control of LTE uplink interference. Chen et al. [22] put forward a method based on hypergraph clustering to solve the serious accumulated interference and improve the system throughput under the requirement of users' fairness. Different from [8,9], Siddavaatam et al. [23] investigated an energy-aware algorithm based on ant colony optimization. Huang et al. [24] proposed an algorithm based on cross entropy (CE) by a sampling approach to address the problem of user association in an iterative mechanism. Castro-Hernandez et al. [25] proposed the application of clustering algorithm and data mining technology to identify the edge users. Castro-Hernandez et al. [26] also presented clustering algorithm and data mining technology to allow the BS to learn and recognized the received signal strength value autonomously which was from the users' reports in the handover (HO) process. Like [26], according to the trigger condition of user handoffs in BS, Yao et al. [27] proposed that the minimum numbers of user handoffs was transformed into the volume of transmitted data in a certain period of time as a reward function, and then the historical information for volume of transmitted data was used to approximate expectation for the volume of transmitted data by Monte Carlo algorithm. Different from the action based on our proposed scheme, Fan et al. [28] proposed to decompose the QL based on state composed of user-BS into two QLs based on the state of user and BS.

Neural Networks for Optimizing Hetnets
Many researchers focused on solving heterogeneous network problems with neural networks. Different from Refs. [4,5], Li et al. [29] use convolutional neural networks (CNN) and deep neural network (DNN) network structure as actor part and critic part of the AC algorithm to optimize heterogeneous network resources. Chai et al. [30] proposed an access network modeling and an adaptive parameter adjustment algorithm based on a neural network model. The algorithm was applied to the source and destination switching networks, and the input parameters of users in HetNets were dynamically adjusted according to the required QoS. Fan et al. [31] proposed a fuzzy neural network based on RL to optimize antenna tilt and power to achieve automatic collaborative optimization of power and antenna tilt. Self organizing network entities used cooperative Q-learning and reinforcement back-propagation methods to obtain and adjust their optimization experience to achieve cooperative learning. Similar to [29], some schemes combining neural network with reinforcement learning ware proposed in [32,33]. Since traditional iterative optimization methods, whether optimal or heuristic, usually needed a lot of iterations to achieve satisfactory performance, and led to considerable computational delay, from the perspective of in-depth learning, Lei et al. [32] proposed a feasible cache optimization method which was to train the optimization algorithms through a DNN in advance, instead of directly applying them in real-time caching or scheduling. The computational burden was transferred to a DDN training phase to reduce the complexity of delay sensitive operation phase significantly. Considering the complexity of base station power optimization in multi-layer heterogeneous networks, Sun et al. [33] proposed a dynamic pico-cell base stations (PBS) operation scheme based on CNN, which dynamically changed the on/off state of PBS according to the user's real-time position, thus reducing the total power of BSS.
As shown in the above analysis, with the increasing complexity of heterogeneous network, more and more parameters are needed to be jointly considered to optimize the network system. It is more difficult to directly solve the network optimization problem by using the traditional optimization scheme. Recently, machine learning technologies have become popular and are applied to the optimization of heterogeneous network resources. Model-free learning brings the convenience of solving non convex problems. However, the acquisition of samples, the design of data labels, and the establishment of Markov decision-making process are all great challenges. In this paper, our proposed TQL algorithm which combines Model-free QL with multistage decision process to optimize the allocation of network resources.

System Model
We consider a two-layer HetNets scenario as shown in Figure 1, in which a cell contains the macro base stations (MBSs) and SBSs. The SBSs are randomly deployed within the coverage of MBSs. The sets of the SBSs and the MBSs are denoted as S and M, respectively. The users (UEs) randomly enter the cell. According to a set of UEs association with BSs, UEs are divided into SBS UEs (SUEs) and MBS UEs (MUEs) who are associated with SBSs and MBS, respectively. In order to balance the load of the entire system network and reduce cross-layer interference by offloading the users of MBSs to SBSs, the enhanced Inter-cell Interference Coordination (eICIC) technology was proposed with two important parameters as ABS and CRE according to [34]. To reduce signal interference, MBSs and SBSs use radio resources in different time periods (subframes) according to eICIC. A frame is divided into some sub frames as ABS and non-ABS subframes, and MBSs normally transmit normal power in nABS (non-ABS) subframes and keep silent or transmit low power in ABS subframes, where the ratio of ABS in a frame is donated as α. The SBSs keep normal transmit power in the whole frames. In the time domain, since the MBSs are allowed to be muted in an ABS subframe period, the interference of the MBSs to the users serviced by SBSs is reduced. Therefore, the SINR of UEs with poor channel condition is improved since there is no interference from MBSs in these ABS subframes.
In general, the power of MBSs is much higher than that of SBSs, and some UEs should be accessed to the MBSs according to the reference signal receiving power (RSRP). This is because the UEs in LTE networks are associated with the BSs based on RSRP policy where the UEs are connected to the BSs with the highest reference signal. To balance load and improve the system capacity, SBSs are designed to enhance the frequency multiplexing of the network. CRE was proposed to support SBSs to extend their coverage by adding a bias to their RSRP in which users outside the edge of the SBSs can be connected to the SBSs. The UEs located in the extended area of the SBSs receive less interference from MBSs in ABS subframes and get better channel gain to improve their SINR.
Due to the two operating modes of the MBSs in the ABS subframes and non-ABS subframes (nABS), there are also two interference modes for the UEs in the downlink in the HetNets. When the MBSs are in the ABS subframes, the UEs receive only the signal transmitted by the SBSs. However, the MBSs are in the nABS subframes, and the UEs are interfered by the transmitted signal from the SBSs and the MBSs.
The SINR k,n of the UE n connected to the MBS k can be expressed as where P k,n M is the transmission power from the MBS k to the UE n in the nABS subframes, P j,n m is the transmission power from the MBS j to the UE n in the ABS subframes, G k,n represents the channel gain from the MBS k to UE n, P j,n S denotes the transmission power from the SBS j to the UE n, N 0 indicates noise variance of the additive white Gaussian, M ABS and M nABS are denoted as MBSs in the ABS and nABS subframes periods, respectively. Note that m and M are short for M ABS and M nABS , respectively.
The SINR k,n of the UE n connected in the SBS k can be written as where S ABS and S nABS are denoted as the SBSs sets in the ABS subframes and nABS subframes, respectively. Hence, the transmission rate of UE n connected to the BS k can be given as where B is the system bandwidth.

Problem Formulation
For comprehensively optimizing the energy efficiency of HetNets, the parameters such as SI-SBSs, ABS, and CRE should be jointly considered. The optimization problem is modeled by setting the energy efficiency of the system as the optimization objective function. Based on the above analysis, we can establish a joint optimization energy efficiency problem as where the relationship during state x k,n , the CRE setting size of the SB k and transmission power P k,n is closely related. ∑ k∈M∪S P k is closely related to the number of active SBSs. Let x k,n represent the connection status between BS k and UE n, which is expressed as where x k,n = 1 indicates that a connection is established between BS k and UE n, otherwise 0. Equation (6) represents that each UE in the cell can only be connected to one BS. The transmission power from the BS k to the UE n at different subframe times can be expressed as where where N TRX is the number of BS transceivers, P m 0 indicates MBSs consumption power in sleep state, and P m max represents maximum transmission power of the MBSs, R k ∈ [0, 1] denotes the load factor of the BSs, which depends on ABS, CRE, and the load density of the BSs, and P k,n S is where e k represents active state which is 1, and otherwise 0, P s max denotes the maximum RF output power of the SBSs, and P s 0 indicates the power when there is no RF of SBSs, P s sleep is the power consumption when the BSs transceiver station are in sleep state, and ∆ is When the base station k switches from the sleep state to the active state. 0 Others (11) where ϕ represents the proportion of the BSs that wake up the transceiver from sleep to activation state. Note that ABS, CRE, and the number of active SBSs all affect the load factor of the BSs, which makes problem (4) become complicated and be a non-convex problem. In order to fully consider the complexity and unknown characteristics of the real environment, the optimization problem (4) can be changed as where f function is unknown, β represents the CRE parameter, |S| represents the number of SBSs activations.

Solution with a Tql Algorithm
It is difficult to solve (12) directly because the complexity of the target optimization system is an unknown non-convex problem. In [12], the Gauss-Seidel method needs too much prior knowledge, which is not as convenient as reinforcement learning. In this paper, the table reinforcement learning method QL is used to optimize the system energy efficiency and then our TQL algorithm is proposed to optimize it.

Q-Learning Algorithm
The environment is typically formulated as a finite-state Markov Decision Process (MDP) and we set a finite discrete time series t ∈ {0, 1, ...,∞}. ABS and CRE are denoted by α ∈ A and β ∈ B, respectively. The activation state of the SBSs in the state u t is e t ∈ E, where E = (0, 1) |P| , and |P| represents the number of SBSs in a cell. According to the Control Space Augmentors (CSA) concept mentioned in [9], the SBSs states e t can be derived based on the number of SBSs activations |S| in the cell, where A, B, and |S| represent a limited set of all parameter configurations. Let S be a discrete set of environment states and A be a discrete set of actions. At each step t, the agent senses the environment state s t = s ∈ S and selects an action a t = a ∈ A to be performed, where s is the position of a certain number of UEs in the cell, S represents the set of cell UEs positions, a is optimal α, β and |S| parameter configurations to optimize the energy efficiency of the system, and A represents the set of parameter configurations. As a result, the environment makes a transition to the new state s t+1 = s ∈ S according to probability P (s t+1 = s |s t = s, a t = a) and thereby generates a reward r t = r(s t , a) ∈ R passing to the agent. MDP is denoted as a tople (S,A,P,R), where • S is the set of finite state space; • A is the set of finite action space; • P is the set of transition probabilities; • R represents the set of reward funtion.
(1) State: The position of users at step t is considered as state s t = s, and the set of states is denoted as S.
(3) State transition: The location of users in the cell changes is considered irregularly, and the state transition is random.
(4) Reward function: The optimization problem is system energy efficiency which is used as reward function, but, in the actual simulation process, the reward is too large, which causes the system to fall into the local optimum easily. Our proposed solution is that negative reciprocal of energy efficiency is designed as the reward.
The goal of RL is to find out the expectation of the strategy with the greatest cumulative reward, which can be expressed as max π E ∑ ∞ t=0 γ t r(s t , a t , s t+1 ) where discount factor γ indicates the degree of influence of successor states on current state, and r (s t , a t , s t+1 ) represents the reward of state s t selecting a t and then transiting to state s t+1 . The best decision sequence of MAP is solved by the Bellman equation. The state-action value function q(s, a) can evaluate the current state. The value of each state-action is not only determined by the current state but also by the successor states. Therefore, the state-action value function q(s, a) of the current s can be obtained by the cumulative reward expectation of the state. Bellman's equation can be given as [35] q π (s, a) = E π r t+1 + γr t+2 + γ 2 r t+3 + ...|a t = a, s t = s , (14) which is also equivalent to Optimal action-value function Q * (s, a) = max π Q * (s, a) can be written as Q * (s, a) = ∑ s P (s |s, a)(r(s, a, s ) + γmax a Q * (s , a )) The update process of Q-value using a time difference method is expressed as [35] Q(s, a) ← Q(s, a) + λ r + γmax a Q(s , a ) − Q(s, a) where λ is the learning rate. According to the Formula (17), the QL algorithm is utilized to solve problem (12) as shown in Algorithm 1:

Algorithm 1:
The QL for optimizing original problem Require: the set of state K, the set of action X, earning rate λ QL , greedy probability ε QL , discount factor γ QL , and threshold QL . Ensure: Q table.
1: Initialize Q(s, a), state s and n=0 and setting threshold QL ; 2: while n <= threshold QL do 3: In state s, select the optimal action a with greedy probability ε QL ; 4: Observe r;

Tql Algorithm
If QL is directly utilized to solve the original problem (12) as shown in Algorithm 1, we can see the action size X = (|P| + 1) × |A| × |B| is too large, where |•| represents a cardinality of set. Since the action is represented by a vector of three dimensions, the optimization problem can be decomposed into three subproblems. We propose the TQL algorithm which decomposes the objective optimization problem into three sub-problems as optimizing the ratio of ABS α, CRE bias β, and the number of SBS activations |S| to reduce action space size.

Sub-Problem A: Given the Cre Bias β and the Number of Sbs Activation |S| for Optimizing the Abs Ratio α
The action of the sub-problem A is a t = (α, β, |S|) where β and |S| are given. α ∈ A and the action space size is |A| . State s = s t . The tabular method of Q-learning can be used to solve sub-problem A and the updating rule of sub-Q-value can be written as and shown in Algorithm 2.

Sub-Problem B: Given the Abs Ratio α and the Number of Sbs Activation |S| for Optimizing the Cre Bias β
The action of the sub-problem B is a t = (α, β, |S|) where α and |S| are known. β ∈ B and the action space size is |B|. State s = s t . Like Formula (18), the updating rule of sub-Q-value can be written as and shown in Algorithm 3.

Algorithm 3:
The ABS ratio α and the number of SBSs activation |S| are given to optimize the CRE bias β.

Sub-Problem C: Given the Abs Ratio α and the Cre Bias β for Optimizing the Number of Sbs Activation |S|
The action of the sub-problem B is a t = (α, β, |S|) where α and β are known. |S| ∈ {0, 1, ..., |P|} and the action space size is |P| + 1 . State s = s t . It is similar to Formula (18) and the updating rule of sub-Q-value can be written as and shown in Algorithm 4.

5:
Observe r; 6: randomly transfer from s to s ; 7: Update Q |S| (s, a) according to Formula (20); 8: s ← s ; 9: n = n + 1; 10: end while 11: Output: |S| = a; The TQL algorithm solves the original problem (11) shown in Algorithm 5. Fixed the CRE bias β and the number of SBS activation |S|, calculate the ABS ratio α according to Algorithm 2. Pass the solved α to step (4) and step (5); 4: Fixing the ABS ratio α and the number of SBS activation |S|, calculate the CRE bias β according to Algorithm 3. Pass the solved β to step (3) and step (4); 5: Fix the ABS ratio α and the CRE bias β, calculate the number of SBS activation |S| according to Algorithm 4. Pass the solved |S| to step (4) and step (3); 6: n = n + 1; 7: end while 8: Output: α, β, |S|; In summary, our scheme has changed the size of action space from traditional exponential increase as X = (|P| + 1) × |A| × |B| to linear increase as X = (|P| + 1) + |A| + |B|, which greatly reduces the dimension and size of the action space.
Algorithm 5 can be considered as a multi-stage decision process optimization problem which is shown in Figure 2. The action spaces of the third, fourth, and fifth step in Algorithm 5, which are the optimization problem of Algorithms 2-4, are set to A , B, and C where A , B, C are limited and denoted as set A, B, and {0, 1, ..., |P| }, respectively. It can be seen in Figure 2 that the state spaces of the third, fourth, and fifth step in Algorithm 5 are the Cartesian product of the other two action spaces and the state spaces size are |B| |C |, |A | |C | and |A | |B|, respectively. The state transition equation refers to the transition probability from state b i c i to state a i b i conditioned on taking action a i . The state transition probability is written as P s = a i b i |s = b i c i , action = a i = 1, where a i ∈ A , b i ∈ B and c i ∈ C . We assume with the condition that there is no interference in the transition process, so the transition probability here is 1. If the multi-stage decision-making process is a closed loop that b i c i = b k c k or a i c i = a k c k or a i b j = a k b l exists, Algorithm 5 is stable. Since |A |, |B|, and |C | are all bounded, b i c i = b k c k or a i c i = a k c k or a i b j = a k b l appears at most min (|B| |C | , |A | |C | , |A | |B|) + 1 transitions during the whole stage. Therefore, it indicates that Algorithm 5 is stable. In Section 6, Figure 6 further illustrates that Algorithm 5 converges to a near optimal solution.

Neural Network for the Classification of States
In subsection B, the TQL algorithm solves the problem of action space dimension explosion. In order to have the ability to classify the state for the agent, a DNN whose structure is shown in Figure 3, in which there are two hidden layers and each hidden layer has 512 nodes. The activation function for each hidden layer is a rectified linear unit (ReLU), and ADAM (adaptive moment estimation) [36] is utilized as updating algorithm and learning-rate is equal to 0.001. The optimal strategy of TQL is to label the samples and specify optimal action in each state. Samples are the set of users position which are gathered from TQL algorithm. The input is the location information of users and the optimal action is encoded according to the index in the action space as the output.
We tried a one-layer, two-layer, and three-layer hidden layer network, and the experimental results showed that it had similar performance on the training performance. Two hidden layers DNN performed relatively more effective with respect to training speed and performance, which is why we use two hidden layers of DNN.

Numerical Simulation
The parameters of experimental simulation are set according to the 3GPP LTE-A HetNets framework [37], and the wireless channel is modeled as deterministic path loss attenuation and random shadow fading models. In this part, the scenario we deployed is that each MBS covers the users in a 120 • cell as shown as shadow part in Figure 4 and is interfered by three other MBSs. We deployed a field where six SBSs are randomly deployed within the coverage of the MBS in the green shaded part of Figure 4 and select working mode according to load conditions. The coverage radius of the MBS and the SBSs are 500 m and 100 m, respectively. The thermal noise power is −176 dBm, the system spectrum bandwidth is 10 MHz and the antenna gains of the MBSs, SBSs, and the UE are 14 dBi, 5 dBi, and 0 dBi, respectively. The maximum transmission power of the MBSs and the SBSs are set to 46 dBm and 30 dBm, respectively. The probability of a user entering a cell to access a MBS and a SBS are 1/3 and 2/3, respectively. The proportion ϕ of the BSs that wake up is set here to 0.5. Although LTE frame includes 10 subframes, the ABS mode has a periodicity of eight subframes. The ABS ratio of protected subframe to traditional subframe α ∈ [0, 1] belongs to the set {0/8, ..., 7/8}. CRE is denoted by β ∈ B , where B = {0, 6, 9, 12, 15, 18}. The specific parameters are listed in Table 1.   Figures 5 and 6 show the relationship during iterations and accuracy of Q-Learning algorithm and our TQL algorithm where learning rate, discount factor, and greed rate are all set to 0.1, and the number of users are set to 50, 100, 150, and 200, respectively. Figure 5 shows that the tabular method of the Q-Learning algorithm converges after 80 × 1000 = 800,000 iterations under different load conditions. Our proposed TQL algorithm converges after 800 × 400 = 320,000 iterations as shown in Figure 6. We can see that the convergence speed of our proposed TQL algorithm is increased by about 60% compared with the Algorithm 1. Note that the convergence speed of TQL algorithm proposed in this paper is still much faster than that of Algorithm 1, especially in the case where the action space cardinality is very large from the analysis of Algorithm 5.  When the iteration number of TQL represented by the green line and red line is more than about 75 and 30, our proposed TQL algorithm is convergent, but the final accuracy rates are about 98% and 90%, respectively. We can see that, although the convergence speed respected by green line is relatively slower, the final correct rate is higher and the performance is better than that respected by the red line. Our proposed TQL algorithm can make a balance between performance and convergence speed according to actual requirements by adjusting the iterations of the subsystems such as the black line indicated case where the numbers of sub-QL iterations of SBSs, ABS, and CRE are set to 1000, 500, and 500, respectively. The number of iterations for TQL represented by cyan line is more than about 50 and the final correct rate is about 93%. In the case where other experimental parameters are the same as that of Figure 7, Figure 8 shows the influence of different learning rates of sub-QL algorithms on convergence speed and correct rate. Take the red line and the cyan line as examples which indicate that the learning rates of sub-QL of SBSs, ABS, and CRE are set to 0.1, 0.05, 0.05 and 0.05, 0.01, 0.01, respectively. We can see that, when the iteration numbers of TQL respected by the red line and cyan line are about 30 and 70, our TQL algorithm is convergent and the final accuracy rate are about 95% and 99%, respectively. The convergence speed respected by the red line is faster than that respected by the cyan line, but the final accuracy rate is otherwise. Pay attention to the problem caused by the setting learning rates respected by the green line which are all set to 0.01, if the learning rate is set too low, the system falls into the local optimum, and the global optimum cannot be found. It is easy to make this mistake in the RL.   In the case where other experimental parameters the same as that in analysis of Figures 7 and 8, the methods of analyzing Figures 9 and 10 are like that in Figures 7 and 8. The results obtained from Figures 7-10 are that the balance can be obtained between performance and convergence speed by changing the corresponding parameters of the sub-QL in our TQL algorithm. It can be seen that our TQL algorithm has greater flexibility in parameter adjustment compared to the general system where only one set of parameters is set.   Figures 11-13 show examples of the sample classification of our designed DNN when the ABS, CRE, and the number of SBSs activations are set to (0, 0, 6), (3/8, 6, 6) and (7/8, 18, 6), respectively. The labeled samples are obtained from the optimal strategy of TQL algorithm, in which 90% of them are training samples and the rest are test samples. The red dots represent the macro base stations where the macro base station with the number 0 is the cell signal source, and the macro base stations with the other numbers are the interference signal sources. Blue and green dots indicate the location of SBSs and users in the cell, respectively. Figure 14 shows that QL, TQL, and the ADP ES IC algorithm in [9] optimize the power consumption of HetNets. We can see that, under the condition that the number of users being less than 100, the power consumption obtained by the QL algorithm is lower than the TQL proposed in this paper. However, when the number of users is greater than 100, the power consumptions of the two algorithms are the same, which are between the maximum power consumption and the minimum power consumption. The algorithm of ADP ES IC optimizes the power consumption best in the case of 50 users, and it is seen in Figure 14 that the power consumption control of the entire system is better than QL and TQL in the entire 10-200 users interval. However, it can be seen that such good results is obtained at the premise of sacrificing energy efficiency in Figure 15.     Figure 15 shows the energy efficiency of HetNets is optimized by QL, TQL, and ADP ES IC algorithms, respectively. Green solid line and red dotted line represent the theoretical optimal energy efficiency and the energy efficiency obtained by our TQL algorithm, respectively. The sub-picture on the left of Figure 15 shows that the optimized energy efficiency of our TQL algorithm is very close to the theoretical optimal, and the sub-picture on the right of Figure 15 shows that the index of energy efficiency of our algorithm optimization system is slightly lower than the theoretical optimal. According to the analysis of Figure 6, because the TQL algorithm has not found the optimal solutions (i.e., Pico BS, ABS, and CRE configuration) in some states, the gap exists between theoretical optimal and TQL algorithm, as shown in the sub-picture on the right of Figure 15. However, Figure 15 proves that the TQL energy efficiency performance is very close to the theoretical optimal energy efficiency, indicating that the TQL algorithm proposed in this paper has not been optimized in some states, but the solution found is also a relatively optimal solution, which may be a suboptimal solution.
For the system, the performance loss is small. The ADP ES IC algorithm is poor in energy efficiency optimization, mainly because the authors focus on power optimization of the system in HetNets.

Conclusions
In order to jointly optimize resources to maximize the energy efficiency of HetNets by RL, there is a problem of too large action space and state space. We propose a novel QL algorithm (TQL) based on the multi-stage decision process to improve the QL algorithm to effectively solve the problem of excessive action space. Through the analysis of Algorithm 5, the advantage is greater in the case of large action space. For the dimension problem of state space, DNN is designed to classify states where the optimization policy of novel Q-Learning is set to label samples. At the same time, compared with the general RL algorithm, there is only one set of adjustable parameters, and the TQL learning proposed in this paper can further adjust the parameters according to the system requirements to further optimize the system. Thus far, the dimensions of action and state have been solved. The algorithm proposed in this paper is more flexible. Finally, the simulation result proves that the algorithm proposed in this paper is effective and feasible, and improves the convergence speed by 60% compared with the tabular QL.