Multi-Objective Optimization of Energy Saving and Throughput in Heterogeneous Networks Using Deep Reinforcement Learning

Wireless networking using GHz or THz spectra has encouraged mobile service providers to deploy small cells to improve link quality and cell capacity using mmWave backhaul links. As green networking for less CO2 emission is mandatory to confront global climate change, we need energy efficient network management for such denser small-cell heterogeneous networks (HetNets) that already suffer from observable power consumption. We establish a dual-objective optimization model that minimizes energy consumption by switching off unused small cells while maximizing user throughput, which is a mixed integer linear problem (MILP). Recently, the deep reinforcement learning (DRL) algorithm has been applied to many NP-hard problems of the wireless networking field, such as radio resource allocation, association and power saving, which can induce a near-optimal solution with fast inference time as an online solution. In this paper, we investigate the feasibility of the DRL algorithm for a dual-objective problem, energy efficient routing and throughput maximization, which has not been explored before. We propose a proximal policy (PPO)-based multi-objective algorithm using the actor-critic model that is realized as an optimistic linear support framework in which the PPO algorithm searches for feasible solutions iteratively. Experimental results show that our algorithm can achieve throughput and energy savings comparable to the CPLEX.


Introduction
Exponentially increasing mobile traffic accelerates the deployment of dense small cells operating on the 3 GHz spectrum under legacy macro cells, called a heterogeneous small cell network (HetNet), which offloads congested macro cells and eventually enhances quality of user experience (QoE). User equipments (UEs) can have dual connectivity to the macro eNB (MeNB) and small eNB (SeNB) for control/data bearer splitting or download busting. Such SeNB deployment is costly when backhauling to a network gateway (a MeNB in this paper). Millimeter-wave (mmWave)-based backhauling can reduce deployment efforts and provide gigabit data rates to UEs using huge bandwidths, such as 9 and 10 GHz, available at the 60 GHz band and E-band. Many measurement campaigns and demonstrations at 28, 38, 60 and 73 GHz have already shown the feasibility of mmWave use for mobile communication [1][2][3].
To overcome the short communication range of the mmWave link due to its high pathloss and low penetration, beam forming based on directional antennae and repeaters for amplifying is necessarily considered. Figure 1 shows the HetNet equipped by a multihop backhaul mesh network for long-range backhauling of the mmWave links, in which an SeNB unreachable by the MeNB can access the Internet through multi-hop relays of the SeNBs [4,5]. The mmWave-based backhaul mesh networks have several challenges, such as efficient radio resource management (RRM) [5,6], interference management [7,8], multi-hop routing [9], and energy saving [10].
where individual mobile agents learn an optimal policy to maintain connectivity while saving limited power. [44][45][46] introduced energy-saving mechanisms using DRL, wherein an agent controls the transmission power, association and sleep mode of SeNBs in a HetNet without multi-hop backhauls. To the best of our knowledge, this is the first work that investigates DRL to find the Pareto front of a multi-objective optimization problem of energy saving and throughput maximization in the HetNet with an mmWave-based multi-hop backhaul mesh.
Key motivations of this study are enumerated as below: • There has not been notable research on an energy efficient multi-hop routing algorithm using DRL for an mmWave backhaul mesh of a dense HetNet; • The DRL-based algorithm can be considered to find a Pareto front solution for the dualobjective optimization of energy saving and throughput maximization in the HetNet.
To solve our optimization problem, we adopt a proximal policy optimization (PPO)based DRL algorithm [24] which shows typically fast and reliable convergence in the training phase as one of popular policy-based DRL algorithms. The PPO algorithm can provide an online policy for controlling backhaul transmission and SeNB power in HetNets, and it is simple to implement but comparable with the complicated trust region policy optimization (TRPO) [23] in terms of performance. However, it is a challenge for the PPO algorithm to find an optimum of the multi-objective problem if only the reward sum of conflicting multi-objectives is given to an agent for training. Therefore, we consider a multiobjective reinforcement learning (MORL) approach [47] to find the Pareto front solutions.
Optimistic linear support (OLS) is proposed for the MORL [48], in which an outer loop iteratively calls a single-objective solver based on the deep Q-network as a subroutine. In this paper, we propose PPO-based deep optimistic linear support (PDOLS), where the PPO algorithm iteratively solves the scalarized objective problem by a specific weight vector for rewards. In experiments, the proposed PDOLS searched optimal corner weights for multi-objectives efficiently and resulted in similar outcomes to the optimal weights obtained through repeated experiments. Additionally, the PDOLS achieved notable throughput and energy saving compared to the CPLEX results [9]; the CPLEX achieves a 35% energy savings and a 14 Mbps data rate without blockage, while the PDOLS achieves an almost 28% energy savings and a 13.4 Mbps data rate. Such performance reduction is small, considering the CPLEX execution time and DRL inference time are 30 min vs. 1 s. Furthermore, we improve the PDOLS with a scaled reward (PDOLS-SR) that adjusts the reward values according to the environment, which increases the probability of finding the optimal weight vector.
We highlight our key contributions of this study as below: • We propose a PPO-based online algorithm for the bi-objective problem of energy minimization and throughput maximization; • We propose an integrated framework based on the PPO algorithm and OLS to find the Pareto front of the two objectives; • We demonstrate the feasibility of the proposed online solution based on DRL in a HetNet environment.
The remainder of the paper is organized as follows. We introduce recent works on DRL for wireless networking solutions in Section 2, and offer an overview of the DRL background in Section 3. In Section 4, we establish the multi-objective optimization model for energy saving and throughput maximization in HetNets. We propose the PPO and PDOLS algorithm for the multi-objective optimization problem in Section 5. Section 6 shows our experimental results regarding performance of the learning algorithm and HetNet throughput. Finally, we discuss and conclude our study in Section 7.

Related Works
Previously, most of the NP problems in the wireless communication and networking area were solved by linear approximation or heuristic algorithms, such as simulated annealing (SA), generic algorithm (GA), particle swarm optimization (PSO), etc. Recent successes of the DNN technique in computer vision and speech recognition show the possibility of applying large-scale feed-forward neural networks to wireless networking. Therefore, the 1D or 2D convolution neural network (CNN) that is popular for computer vision and image processing was used for wireless channel estimation with MIMO [49][50][51], automatic modulation and coding schemes [52][53][54] and network intrusion detection [55][56][57][58].
In contrast to the above supervised deep learning, artificial intelligence for controlling dynamics of the wireless networking system needs to be made naturally by past experience in the system. Such dynamic systems can be modelled by the MDP; at each step, a network agent acts based on the state and receives reward feedback for the action, such as successful transmission, packet loss, collision, saving power, etc. Using the collected experience data, the DRL algorithm can effectively find an optimal solution of the wireless networking system. The following studies have demonstrated feasibility of using DRL algorithms for wireless communication and networking during the last several years (refer to the summary in Table 1).

References
Areas of DRL Studies on Wireless Communications [26][27][28][29][30][31][32] Cognitive radio and dynamic wireless channel selection increase spectral efficiency, which is typically a combinatoric problem of matching channels to nodes. Using DRL, agents can learn the optimal policy from the degree of interference as a reward for every action of channel selection.
[ [38][39][40][41] The wireless link layer provides a media access scheme for multiple users which is realized in a MAC protocol. Several studies design the wireless MAC protocol based on the DRL algorithm, in which DRL agents learn an optimal transmission policy from the reward of contention resolution at a particular channel state.
[ [59][60][61] A user association or handover algorithm for a serving base station affects throughput and QoS of each user. The DRL algorithm enables UEs to select an optimal base station based on past experience. [33][34][35][36][37] Wireless networks have various resources to be scheduled, such as radio block, channels, sequence codes, power, time slots, etc. Many of the scheduling problems have non-convex feasible set and user mobility, which makes the problems intractable. The DRL agents learn an optimal scheduling policy repeatedly from resource utilization against a chosen allocation. [42][43][44]62,63] Energy and power consumption is critical, especially for green wireless networking, mobile edge cloud networks and UAV networks. The DRL algorithm explores possible policies based on the reward of energy saving while guaranteeing throughput constraint.
Wang et al. [26] proposed a dynamic multi-channel access mechanism based on deep Q-learning. A node selects one multi-channel that has low interference, which returns the maximum reward for the action. Zhong et al. [27,28] used the actor-critic algorithm to explore the sensing policy for dynamic channel access and considered a multi-agent model for distributed sensors in a partially observable environment. Naparstek et al. [29,30] also proposed DQN-based multi-agents which act based on Q-value independently. Li et al. [31] applied the DQN for channel sensing, and Liu et al. [32] proposed a hierarchical deep Q-network (h-DQN) model for cooperative channel sensing, which divides the original problem into separate sub-problems for multi-DRL agents.
Ali et al. [38] introduced a Q-learning-based MAC protocol in dense WLANs which learns the optimal policy based on channel state and transmission action experience. Yu et al. [39] investigated a DRL-based MAC protocol for heterogeneous wireless networking which was called deep-reinforcement learning multiple access (DLMA). They established a new multi-dimensional RL framework based on the Q-learning that maximizes sum throughput and provides proportional fairness, even co-existing with TDMA-like ALOHA protocols. Al et al. [40] studied radio resource scheduling (RRS) in the cellular MAC layer using the DQN. Nisioti et al. [41] presented a MAC solution for sensor networks based on coordinated reinforcement learning by considering the dependencies among sensors to find the optimal actions. Zhao et al. [59] studied user association and radio resource allocation in a HetNet. For a large action space, they considered a multi-agent RL approach and a dueling double deep Q-network (D3QN) to obtain an optimal policy with little computation complexity. Zhang et al. [60] proposed a DRL algorithm for the association between each IoT device and a cellular user to maximize the sum rate of all the IoT devices in symbiotic radio networks (SRNs). Ding et al. [61] introduced the user association and power control scheme using the multi-agent DQN to ensure the UE's quality of service (QoS) requirements.
He et al. [33] proposed an orchestration framework in vehicular networks with a novel DRL algorithm for the resource allocation of networking, caching and computing resources. Shi et al. [34] modelled a hierarchical DRL-based multi-DC (drone cell) trajectory planning and resource allocation scheme for high-mobility users. In [35,36], the authors also conducted resource allocation for uplink nonorthogonal multiple access (NOMA) systems using a DRL-based algorithm to solve the nonconvex optimization problem. Rahimi et al. [37] also tried to increase scalability with a hierarchical DRL for joint user association and resource allocation in the NOMA system.
Liu et al. [43] introduced a novel DRL-based energy-efficient routing protocol called DRL-ER, which avoids the battery energy imbalance of constellations and guarantees a required end-to-end delay bound. Liu et al. [42] adopted a DRL-based energy-efficient control for coverage and connectivity in UAV communication systems. Du et al. [62] reviewed and analyzed how to achieve green DRL for radio resource management (RRM). Dai et al. [63] utilized DRL to design an optimal computation offloading and resource allocation strategy for minimizing energy consumption. El et al. [44] solved the energydelay-trade-off (EDT) problem in a HetNet where small cells can switch to different sleep mode levels to save energy while maintaining QoS using the DRL.
To the best of our knowledge, our study is first to develop a PPO-based multiobjective algorithm that controls multi-hop routing and switching on/off SeNBs in Het-Nets, even though many previous works have applied the DRL algorithm for other optimization problems.

Deep Reinforcement Learning (DRL)
This section provides a brief overview of reinforcement learning (RL) and DRL. RL is a popular machine learning algorithm which allows agents to learn optimal behavior through trial-and-error interactions with a dynamic environment. A key strategy of the RL is utilizing statistics to obtain an optimal control decision (policy) in the form of the MDP. The MDP is modelled by (S, A, P a ss , R a ), wherein the state space is represented by S, the action space is represented by A, the state transition probability is P ss at a taken action a and a corresponding reward R, and in which the policy as a function π(s) specifies an action a in each state s. Therefore, an optimal policy, π * , maximizes the expected reward for future T steps, E[∑ T t=0 γ t r t ], where γ is a discount factor (0 ≤ γ < 1) for the infinite-horizon discounted model.
For effective agent learning, the estimation of a state-value function for a state s γ t R(s t+k+1 ) S t = s at a time step t. Additionally, suppose that a certain action, a, is taken in the state s; then, an action-value Q-function can be defined as q π (s, a) = E π ∞ ∑ k=0 γ t R a (s t+k+1 ) S t = s, A t = a . According to the Bellman optimality equation, the optimal value function, V * (s), can be decomposed recursively as , which tells us that the expected return from the best action is the same as the state value of an optimal policy.

Deep Q-Learning
As the state and action spaces become larger and continuous, function approximation is mandatory for Q-learning instead of using a legacy tabular form of actions and Q-values. Although the combination of RL and neural networks was considered a long time ago, it is only very recently that DRL algorithms based on deep neural networks (DNNs) has received much attention instead of the linear function approximation [20,64]. DNNs represent a function with higher complexity by employing a deep hierarchical layer architecture that constitutes a non-linear information processing unit. Deep learning approximates such a mapping function for statistical curve fitting with labeled training datasets.
The DRL utilizes the training process of the DNN based on datasets which can improve learning speed and performance without the MDP model information (the R and P ss are unknown). The DRL induces a policy based on a value function, V π (s), approximated by the DNN, which is trained using the batch of samples (S, A, R, S ) that an agent collects by interacting with the environment. In a sequence of discrete time, {t = 0, 1, 2, . . .}, the agent selects an ε-greedy action for the maximum reward given by V π (s); the ε provides randomness to explore and avoid the local minimum.
Mnih et al. introduced the deep Q-network (DQN) in [22], which is a seminal work for Q-function approximation based on DNNs. In particular, they addressed and solved two challenges in the DRL; first, the deep learning assumes that the data samples are iid (independent identically distributed), but actually the next state, s , is correlated with the current state, s, in the MDP. Second, the target model for training is non-stationary, as the model parameters θ are updated at every iteration. For this, the DQN adopts an experience-replay buffer for the training and separation of the main and target networks. The DQN updates θ of the main network by minimizing temporal-difference errors, and the state-action value function, Q(s, a; θ) are given by the target and main network, respectively. The target network is periodically updated by the main network.

Policy Gradient and Actor-Critic
The DQN is limited to high dimensional and continuous action spaces that demand iterative optimization processes at every step. Additionally, discretizing the continuous action values cannot avoid the curse of dimensionality due to a large number of actions, or, probably, loses important information of the action space from quantization. Therefore, the policy gradient (PG) algorithm is used mostly for high dimensional and continuous actions [65,66], which adjusts the model parameter, θ, of a policy function in the direction of the stochastic policy gradient (SPG), ∇ θ J(π θ ).
The PG algorithm [21] can be implemented by the actor-critic architecture, in which the actor stochastically updates the θ of the policy function while the critic evaluates the policy and updates the action-value function approximator, Q w (s, a), in such a direction as to minimize error, 2 As the dimension of action spaces increases, deterministic policy gradient (DPG) as a special case of the SPG is efficient to derive only the mean of the state spaces compared to the SPG, lim σ↓0 ∇ θ J(π µ θ ,σ ) = ∇ θ J(µ θ ).

System Model
In this section, we establish a mathematical system model of the HetNet with a mmWave backhaul mesh among SeNBs and MeNBs in which energy consumption and user traffic for the mmWave backhaul links and access links are formulated. In this model, we present dual objectives to minimize the energy while maximizing the user throughput. The symbols used in this model are described in Table 2. Table 2. Parameters (P) and variables (V) used in the model.

Symbol Description
User demand data rate u V

Energy Consumption Model
The energy consumption of eNB i is composed of two folds: energy consumption from access links toward UEs and backhaul links toward other eNBs, where energy consumption in the access network (AN) and backhaul network (BN) are e AN i and e BN i , respectively.

AN Energy Consumption
According to the linear approximation [67] between relative RF output power and the power consumption of an eNB, energy consumption for the access links can be derived as where ∆ p is a multiplier for load-dependent power consumption, which is different from the type of antenna (refer to Table 3) [67].
where the SINR is the signal-to-noise and interference ratio, P AN out i is the power consumption of the transceiver for the access links for all associated UEs, and 0 < P AN out i ≤ P AN max i . P AN max i is the maximum transmission power for the AN transceiver at the eNB i. The P AN out i can be scaled by the aggregated flow rate F AN i against the link capacity, which is the same as the ratio of radio resource blocks (RB) used by all associated UEs to the total available RBs (N RB i ); the number of used RBs can be calculated by dividing the sum of user data rate by the rate of a single RB (bandwidth B RB Hz). N AN a iu is the number of antenna for MIMO and f u iu is the data rate for each UE. x u iu is an integer value {0, 1} to indicate the UE association with the eNB i. Equations (6)-(11) in [69] As shown in Equation (3), the eNB has a statically minimum non-zero output power of the transceiver, P AN 0 i , although there is no associated UE. Accordingly, switching off unused eNBs is critical to save energy. Table 3 shows experimental values for the aforementioned parameters in this study, such as P AN max i and P AN 0 i .

BN Energy Consumption
The energy consumption of a BH link can be formulated similarly to the AN link: (i) static power (P BH 0 i ) of a transceiver for each backhaul link toward a next-hop eNB j, and (ii) dynamic power by the amount of aggregated user data rate that travels over that link: where P BH 0 i represents the minimum non-zero static power of each BH transceiver at eNB i. The dynamic power P BH out i of a mmWave backhaul link is derived by the multiplication of the band-wide transmission power P BH j t i and bandwidth efficiency, as below: where B max ij is the maximum data rate for a backhaul link ij. The integer value x u ij indicates routing information if a data flow of a user u uses the backhaul link ij or not.
where SNR is the signal-to-noise ratio satisfying B max ij , N th stands for the thermal noise, NF stands for the noise figure and PL represents the free-space path loss. The parameters L t and L r represent the transmitter and receiver losses, respectively, while G t and G r are the transmitter/receiver antenna gains and L m is the link margin.
The maximum transmitted power of a transceiver operating at frequency f BH may be given by where EIRP max denotes the maximum equivalent isotropically radiated power, and P BH max i is configured as 224 mW according to specifications in [70], as shown in Table 3.
Total energy consumption of the BN is the sum of the energy consumption of the available backhaul links, as below.
As a consequence, the energy consumption of each eNB depends on user data flows and the static power consumption. Control message unicast or broadcast in the cell can consume extra energy in addition to the user traffic. In this study, we ignore energy consumption from the control overhead that is relatively less than the bearer. In the following section, therefore, we define several constraints to switch on or off the SeNBs based on the presence of the data flows.

Switch On and Off Model
We introduce two binary variables, s AN i and s BN i , that indicate whether the AN link and the BH link, respectively, is powered on or off at node i; that is: s BN i = 1 when all BH at i are powered on, ∀i ∈ N 0 when all BH at i are powered off, ∀i ∈ N The power status of the AN and BN, s AN i and s BN i , is decided by the use of access or backhaul links. Accordingly, switch variables for AN and BN are configured by the presence of data flows, as below: For the multi-hop routing path of the user flows, a link (i, j) of power-off eNB i cannot be used as x u ij = 0: where ς is a big number (i.e., 10 8 ).

Multi-Hop Routing Model
In this section, routing constraints are given for user data flows in the mmWave backhaul mesh network. First, a user data flow should satisfy the flow conservation rule in Equation (15). Second, a user data flow travels along a single path rather than multiple paths in Equation (16); in this study, we only consider single connectivity rather than dual connectivity. Third, an UE therefore has to associate with only one eNB in Equation (17).
where R u represents the demanded data rate of each UE u.
where x u ij = {0, 1} indicates the routing information of a user data flow, f u .

Link Capacity and Scheduling Model
For capacity constraint, the data rate of each user flow and aggregated flows must be less than the access and backhaul link capacity. For instance, when more than one UE connects to the same eNB, they have to share the capacity on that access link.
Therefore, the AN capacity constraint is given as follows: where C max i is the maximum capacity of eNB i as the access link capacity. Additionally, the sum of the user flows on a given BH link is limited by the maximum capacity of the BH link: where L BH represents a set of BH links.
In the mmWave backhaul mesh network, we have to schedule transmissions among all links in the set of interference links, (i, j) ∈ I. For duplex, first we adopt time division duplex (TDD), which is used to separate transmission and reception on a BH link (i.e., different time slots are assigned for the transmission from eNB i to j and for the transmission from eNB j to i). Similarly, time division multiplexing (TDM) is used to schedule transmissions among adjacent BH links. The following constraint ensures that the capacity of each BH link is shared among adjacent interferenced BH links: The flow rate on the link (i, j) can increase at the given link capacity as the interference is reduced by switching off SeNBs with the interfering BH links (i, j) ∈ I.

Dual Objective Function
In this study, we have dual objectives, which are minimizing the total energy consumption of the HetNets while maximizing the sum of data rate R u of each user u with the aforementioned constraints: s.t. Equations (2) − (20) where {ω 1 , ω 2 } is a scaling vector that is used to impose weight for each objective; ω 1 and ω 2 are for energy consumption and throughput, respectively.

Deep Multi-Objective Reinforcement Learning in mmWave HetNet
In this section, we solve the optimization problem in Equation (21), which is not only non-convex, but contains dual objectives that are conflicting to each other. We introduce the PPO and PDOLS algorithms to effectively search for efficient solutions in the Pareto front of the dual objectives.

Proximal Policy Optimization
The TRPO is a stochastic policy-based optimization technique that can guarantee updates in the direction of increasing performance within a trust region. Schulman et al. [23] proposed a new policy optimization algorithm following the TRPO, called the PPO algorithm [24]. After then, several algorithms such as TD3 [71] and soft actor critic (SAC) [25] have been proposed, but the PPO is still a popular algorithm with some advantages of the TRPO. The PPO is easy to implement, using only first-order optimization, and is able to solve the data efficiency problem while achieving a similar performance as the complicated TRPO.
In the TRPO, updates are conducted by a policy that maximizes the objective function ("surrogate" objective) within a specific constraint as below, By applying the Kullback-Leibler divergence (KL) constraint between the old policy π θ old (a t |s t ) and the current policy π θ (a t |s t ) in Equation (23), the TRPO can provide monotonical improvement to the π θ (a t |s t ) at each iteration and prevent excessive updates by limiting the range δ. However, it demands intensive computation for a rough solution that is infeasible to analyze. Instead, the constraint is relaxed by penalty with coefficient β in Equation (24), in which the surrogate objective forms a lower bound to guarantee the performance of the policy π.
However, it is difficult to choose a constant value of β that performs well across various problems. For this, a new surrogate object function of the PPO is proposed to emulate monotonous improvement of the TRPO. The new surrogate objective function is presented in Equation (25), Using the clip function, the PPO enables the surrogate objective function to avoid excessive policy updates while achieving similar performance to the TRPO. In addition, the PPO collects fixed-length T trajectory segments as a mini-batch and performs learning based on them repeatedly, which increases sample efficiency and learning stability.
For calculating A t , a truncated version of generalized advantage estimation (GAE) is used,Â t = δ t + (γλ)δ t+1 + · · · + · · · + (γλ) T−t+1 δ T−1 , Due to the high sample complexity (i.e., the number of training samples required for successfully learning) of our HetNet model that probably increases the number of necessary samples and their variance, we apply the truncated version of GAE, which provides stable and steady learning in the PPO algorithm [72]. GAE can enable monotonous increments in reward by reducing the sample variance through discount vector γ and λ like the TD(λ).

MDP of mmWave-Backhaul HetNets
In this section, we define a MDP model (S, A, R a , P a ss ) for our multi-objective optimization problem in the HetNets.

•
State S: the state in the HetNet MDP is denoted by a traffic matrix that represents traffic load v e = [0, 1] at access and backhaul links, which eventually determines throughput and energy consumption. In particular, we define a single representative state for all access links of a certain eNB instead of the individual state to reduce state information, since the AN energy consumption from transmission power, P AN out i , is calculated by aggregated RBs of all associated users as shown in Equation (4). Accordingly, the vector size of the state space is |L BH | + |N |. We define the environment state, s t = {v 1 , v 2 , . . . , v |L BH | , v |L BH |+1 , . . . , v |L BH |+|N | }, with v e , as below: where the index e of each link (i, j) is given by the environment at the beginning of the learning phase; • Action A: the agent action is routing and association of user flows, which actually decides a set of x ij binary variables, as discussed in Equations (16) and (17). However, such discrete action space grows exponentially by the number of the links, in which convergence of the learning algorithm is rarely guaranteed and large memory is required for computation. Instead, we consider a weight matrix (a t ∈ R |L| ) of all links for all user flows, with which each flow finds a path using a link-state routing algorithm (e.g., the Dijikstra algorithm). Accordingly, the space complexity decreases from O(2 |L| ) to O(|L|). All actions for the links can be defined as below: Unfortunately, such a shortest path algorithm leads most of users to select a MeNB's AN link as a single-hop path; cumulative weights along a multi-hop path are mostly higher than for a single hop. This prevents the DRL algorithm from exploring actions of multi-hop routing that may offer reward gain by increasing user throughput, ∑ u∈U R u , more than the cost of energy consumption,∑ i∈N e i . Therefore, we limit the number of user flows for the MeNB in the routing algorithm that admits the user flows to the MeNB only if the MeNB has available RBs, Otherwise, users find multi-hop paths through SeNBs in the algorithm; • Reward R: the reward is given by the objective function of Equation (21). Thus, we change the minimization objective to maximization by multiplying Equation (21) by −1. For normalization, the sum rate of all UE flows and corresponding eNB energy consumption are divided by the sum of the maximum data rate and maximum energy consumption. Subsequently, the reward can be written in Equation (31) as where r e and r d represent ∑ i∈N e i e max and 1 |N| · ∑ u∈U R u d u , respectively.

PPO-Based DRL for HetNet Optimization
The aforementioned MDP model of our HetNet optimization has continuous state and action spaces; thus, the PPO can effectively perform the exploration of solutions without the excessive updates in Equation (25). We implement the PPO-based DRL algorithm in Algorithm 1, which is based on the actor-critic architecture.

18: end for
In the input of Algorithm 1, the actor network π θ parametrized by θ provides a policy (a t ) according to the environmental state (s t ). Meanwhile the critic network presents the reward value (V φ (s t )), which is parametrized by φ. At the beginning, the PPO collects total T trajectory tuples (S, A, R, S ) (line 2-11), and subsequently, π θ and V φ are trained multiple K times with the T collected tuples (line [12][13][14][15]. The parameters of π θ and V φ (s t ) are updated by Equations (32) and (33).
where L(A t , θ old ) is derived by Equation (25) at the given old parameter θ old .
where the V t GAE(γ,λ) is a target value derived by Equation (26); that is, Since we implement both an actor network and a critic network, π θ and V φ (s t ) using multi-layer perceptrons(MLP), in the gradient update process, backward propagation is conducted; in this paper, we adopt Smooth L1 as an optimizer among Adagrad, Adam, Smooth L1 , etc. Although the surrogate objective function of the PPO in Equation (25) is applied only to π θ , V φ (s t ) is affected interactively within the actor-critic loop.Thereby, both policy and value can avoid excessive updates. The update process of the algorithm continues until the reward increases and converges to a certain level.

Multi-Objective Deep Reinforcement Learning
The PPO-based DRL algorithm can suffer from finding Pareto fronts in the multiobjective MDP (MOMDP) problem since it just learns a policy with a scalarized single objective which is unclear to evaluate each contribution of different objectives. As the reward of the MOMDP is a vector of n rewards of multi-objectives, R(s t , a t ) = r t ∈ R n [47], for the reward scalarization, simple linearization such as F (V π , ω) = ω · V π can be used (i.e., convex combination of the policy values, V π ), where V π is a value vector for a policy, π, and ω is a weight vector for the importance of the objectives [48].
Therefore, we propose the PDOLS algorithm to find an optimal solution for the MOMDP problem. Figure 2 depicts how the PPO and the OLS cooperate for the multiobjective HetNet problem. The OLS part provides a framework of the outer loop to handle possible weight vectors, while the PPO part provides actor-critic networks to update the policy and value. The outer loop incrementally constructs the convex coverage set (CCS) that is an intermediate approximated coverage set, S, by solving a series of singleobjective MDPs scalarized by possible weight vectors, which eventually contains at least one optimal policy.
To reduce training efforts for all cases of weight vectors, the OLS manages corner weights that indicate break points in the piecewise linear CCS as a lower bound in addition to the S. Thus, the OLS selects the weight vector for training only among the corner weights. When a new corner weight, ω , is discovered from the PPO learning, that is, are removed from S. Afterwards, the OLS selects the next corner weight in a priority queue for learning, as shown in Figure 2. The detailed procedures of the PDOLS are described in Algorithm 2. The discovered corner weight ω 1 and ω 2 of energy consumption and throughput is used back for the PPO-based DRL to find a new lower bound of V π and its π in the Algorithm 2 (line 5-18). At that time, the reward value r e and r d of energy consumption and throughput can affect the creation of a set of V * S (ω). For instance, a new corner weight to be used for further learning and finding a new V π is rarely found if the reward gap between two objectives is large. Therefore, we scale down the reward value instead of the original value from environment in order to increase the probability of finding the new corner weights (line 10).Â t is calculated through Equation (26) andÂ t ∈ A 2 as r t ∈ R 2 (line 12-13). To reflect the corner weight from the OLS inÂ t ,Â t is updated by multiplying [A t e , A t d ] and [ω 1 , ω 2 ] (line 13). When the convergence is achieved in the PPO learning process, the PPO sends a new V t [v e , v d ] to the OLS (line 18).

Algorithm 2 PPO-Based Deep
Optimistic Linear Support 1: Initialization: 2: S = partial CSS, which is composed of V t obtained after the PPO learning. 3: W = corner weights, which is obtained from S. 4: Q = priority queue of weights for the multi-objective, where the weights form a tuple along with their importance (i.e., ([ω t 1 , ω t 2 ], I)). Instruction: 5: ω t = Q.pop() 6: for iteration=1,2, ...., do 7: for iteration=1,2, ...., T do 8: a t = π θ old (s t ) 9: [r t e , r t d ], s t+1 = Env(a t ) 10: Reduce scaling of [r t e , r t d ] 11: [A t e , A t d ] = compute advantage estimate from Equation (26) end for 15: Optimize surrogate L and wrt θ fromÂ t , with K epochs 16: Optimize V φ and wrt φ from V t GAE(γ,λ) , with K epochs 17: θ old = θ, φ old = φ 18: end for when convergence 19 for iteration=1,2, ...., ω c do 27: if estimate improvement of (ω , W, S) > τ then 28: The priority queue of the weights, Q, is initially configured with extreme weights (i.e., [0, 1], [1,0]) and updated whenever a new corner weight is found. The priority is determined according to the distance between F (V π , ω ) of the new corner weight ω and a line made by values of two adjacent corner weights on both sides of the new corner weight. In other words, the priority is proportional to the degree of convexity downward in V * S (ω). The OLS removes obsolete V del and ω del when creating a new V * S (ω) (line 22,25). Depending on the improvement of the new corner weight, the OLS decides whether to add it to the Q by comparing to a threshold τ (line [26][27][28]. We set the τ to 0 to train aggressively for all discovered corner weights to find optimal values. Finally, the PPO and OLS stop processing if no new corner weight is found and Q is empty (line 32-33).

Experiment
In this section, we evaluate the performance in terms of energy saving and user throughput, comparing algorithms proposed in the previous section. We establish an experimental environment with 1 MeNB and 25 SeNBs that form a backhaul mesh network as depicted in Figure 3, where the mmWave BH links (i.e., gray dashed lines in Figure 3) connect the SeNBs to each other or to the MeNB for Internet access. There are only 4 SeNBs reachable to the MeNB, which thus limits the sum rate of all data flows below the sum of their BH link capacity. Therefore, we assume that each UE, u, demands a maximum 14 Mbps data rate (d u ) in this experiment with the 100 UEs and last mile 4 SeNBs since those bottleneck BH links (i.e., the purple dot line in Figure 3) allow 14 Mbps per UE. To support a greater UE data rate, we can increase the BH link bandwidth or place more SeNBs reachable to the MeNB gateway.
A total of 100 UEs are randomly dropped over the MeNB and SeNB coverage area, where the SeNBs are apart by 100 meters and their cell coverage is more than 80 meters. Accordingly, the UEs have more than one SeNB to associate with, in addition to the universal MeNB, depending on their location. Both the MeNB and SeNBs provide microwave link access, denoted by AN links in Figure 3. The access and BH link is configured as in Table 3 for our experiment. In our study, the training and model update are performed interactively with the network simulator environment based on parameters specified in 3GPP standard and related works [68,69].
We build actor-critic networks using a DNN with 2 hidden layers (64 × 64 perceptrons) of a fully-connected neural network to estimate the policy and value, respectively. The actor network for policy receives the input of the state field and returns the action field as output as defined in Section 5.2. On the other hand, the critic network for value is designed differently according to the PPO and PDOLS algorithm. Both algorithms receive the same input for the state field, but the PPO-based critic returns only one value, while the PDOLSbased critic returns two values of the dual objectives. Detailed parameters for the DRL are introduced in Table 4. For this experiment, we used the pyTorch library on a Linux 20.04 server equipped with Intel CPU i7-9700KF, GPU GeForce RTX 2080 and 32 GB RAM.  First, we evaluate the performance of the PPO-based DRL algorithm in the HetNet environment in terms of learning speed and convergence. For this, we configure the weight vector of energy consumption and data rate as ω 1 = 0.5 and ω 2 = 0.5, respectively, and the UE demand rate as 14 Mbps. Figure 4a shows the performance with varying learning rates from 1 × 10 −5 to 3 × 10 −4 . The PPO algorithm shows good convergence of reward as training iterations continue, regardless of learning rate. The reward increases exponentially during the initial training iterations and becomes saturated after 40 K training iterations. The higher learning rate accelerates the reward convergence, but it skips over the better local minimum and is trapped in another; when the learning rate increases from 1 × 10 −4 to 3 × 10 −4 , the converged reward decreases from 0.104 to 0.0899. The loss for the value and policy can be seen in Figures 4b,c, respectively. The loss of value and policy decreases drastically as the training iterations continue. Policy learning can avoid excessive learning owing to clipping of the PPO, which leads the policy loss to be comparable regardless of the learning rate. Additionally, the value loss follows the policy loss through the actor-critic interactions.    Figure 5 shows evaluations on learning performance with varying reward weights (ω 1 , ω 2 ). For this experiment, we configure the learning rate as 1 × 10 −4 , which shows the fastest convergence with the highest reward. In Figure 5a, rewards from energy consumption and user throughput converge at 50 K training iterations with reward weight (ω 1 = 0.5, ω 2 = 0.5). Figures 5b,c show that the energy saving (i.e., 1-consumed energy/maximum energy) and mean data rate converge at different iterations according to the reward weight; the reward convergence is achieved at an average of 80 K training iterations, about 21.5 min on our server for each weight value. To find the optimal solutions, iterative learning for all possible weight vectors is needed. Therefore, the computation delay depends on the granularity of the weight values to explore; this experiment demands a total of 80 K · 7 iterations. System performance varies with ω 1 of the energy consumption from 0.2 to 0.8 and ω 2 of the UE's data rate, 1 − ω 1 . When ω 1 is set to 0.8, the maximum energy saving is achieved by 0.419, while the UE's data rate is only 4.89 Mbps as a minimum value, because of their trade-off relationship. Contrarily, the minimum energy saving, 0.134, allows the maximum data rate, 13.7 Mbps, with ω 1 = 0.2. Consequently, the optimal weight for maximum reward is found to be ω 1 = 0.6 and ω 2 = 0.4, which results in an energy savings of 0.272 and a UE data rate of 13.39 Mbps.
Next, we evaluate the PDOLS algorithm to find the optimal value and weight in a HetNet environment with a varying demand rate and number of UEs. In Figure 6a, the mean data rate satisfies most of all demand rates except for 14 Mbps: 6, 8, 10, 12, and 13.39 Mbps. The energy savings of the HetNet is inversely proportional to the demand rate: 0.42, 0.37, 0.31, 0.30, and 0.23. For these values, the ω 1 of the optimal weight is 0.79, 0.72, 0.65, 0.64 and 0.57, with respect to each data rate.  Figure 6b shows the change of the active SeNBs during the learning procedure. Most of the 25 SeNBs are turned on at the beginning of learning, but after 80K iterations, almost 10-12 SeNBs are switched off according to the UE's demand rate. For the higher demand rate, more SeNBs are active to support the user traffic. Although the number of active SeNBs is the same for 10, 12, and 14 Mbps, energy consumption increases, especially for the 14 Mbps in Figure 6a, as power consumption of the active links increases proportionally by user traffic.
We evaluate the performance of the PDOLS again with different numbers of UEs such as 40, 70 and 100, where the demand data rate is configured to be 14 Mbps. Figure 7a shows that both energy savings and the sum of the data rate increase as the number of UEs decreases. Accordingly, the user demand rate is mostly satisfied, except for 100 UEs. The energy saving is 0.46, 0.38, and 0.2, respectively, for each number of UEs. The corresponding active SeNBs are 6, 10, and 15, as shown in Figure 7b. Here, ω 1 of the optimal weight is found to be 0.8, 0.66, and 0.57 for each case. For 40 UEs, the number of active SeNBs is around 18 initially and decreases to up to 6 SeNBs, as data flows of many UEs use the same multi-hop paths provided by the active SeNBs. Otherwise, isolated UEs that have no path through the SeNBs directly access to the MeNB. Comparing the result of 100 UEs with 6 Mbps, we can conjecture that a higher number of UEs induces network-wide deployment, which consumes more RBs of the MeNB and transmission power for a smaller number of serving UEs. Figure 8a compares the performance of the proposed algorithms discussed in Section 5, where the number of UEs and the demand rate are configured as 100 and 14 Mbps. A heuristic algorithm leads the UEs to associate with a less-loaded SeNB and use the shortest path to the MeNB gateway, which performs worse with energy savings of 0.16 and a data rate of 9.14 Mbps than others. Meanwhile, the PPO and PDOLS show comparable results of 0.27, 13.39 Mbps for the PPO and 0.23, 13.79 Mbps for the PDOLS, where the optimal weight for the PPO is selected manually after iterative executions with different weight vectors, while the PDOLS algorithm automatically searches for the optimal weight values. The PDOLS-SR outperforms other algorithms with 0.27 and 13.79 Mbps when the reward is scaled by 1/5.   Figure 8b shows the variation of corner weight in the OLS framework of the PDOLS. In our experiment, the PDOLS-SR conducts the training process 11 times (11 steps in the figure) to find the optimal weight, while the PDOLS does this only 7 times (7 steps). The PDOLS-SR can scavenge and explore more corner weights to find a near-optimal weight close to the PPO weight, 0.6 (the red solid line). The optimal ω 1 of the PDOLS-SR is 0.5872, while the ω 1 of the PDOLS is 0.5683. Further adjustment for downscaling of the reward, such as 1/10 or 1/15, only increases training time without notable performance enhancement.

Conclusions
In this paper, we solve a multi-objective optimization problem of throughput maximization and energy consumption minimization in a HetNet with a mmWave-backhaul mesh. For this, we implement a PPO-based DRL algorithm based on actor-critic architecture. However, the conventional PPO algorithm has limitations in its ability to cope with the multi-objective problem. Therefore, we propose PDOLS, which allows the PPO algorithm to interoperate with OLS as an outer loop to search for an optimal weight vector for the dual objectives. Experimental results show that the PPO-based DRL algorithm converges successfully with increasing rewards as training is iterated. Additionally, the learned solution of energy saving and user throughput is comparable to the CPLEX result. PDOLS can find a feasible weight vector for the dual objectives which is similar to the optimal weight that is identified manually using all possible combinations of the weight values.