Deep Reinforcement Learning Based Resource Allocation for D2D Communications Underlay Cellular Networks

In this paper, a resource allocation (RA) scheme based on deep reinforcement learning (DRL) is designed for device-to-device (D2D) communications underlay cellular networks. The goal of RA is to determine the transmission power and spectrum channel of D2D links to maximize the sum of the average effective throughput of all cellular and D2D links in a cell accumulated over multiple time steps, where a cellular channel can be allocated to multiple D2D links. Allowing a cellular channel to be shared by multiple D2D links and considering performance over multiple time steps require a high level of system overhead and computational complexity so that optimal RA is practically infeasible in this scenario, especially when a large number of D2D links are involved. To mitigate the complexity, we propose a sub-optimal RA scheme based on a multi-agent DRL, which operates with shared information in participating devices, such as locations and allocated resources. Each agent corresponds to each D2D link and multiple agents perform learning in a staggered and cyclic manner. The proposed DRL-based RA scheme allocates resources to D2D devices promptly according to dynamically varying network set-ups, including device locations. The proposed sub-optimal RA scheme outperforms other schemes, where the performance gain becomes significant when the densities of devices in a cell are high.


Introduction
As the demand for mobile and wireless communications grows, a large number of spectrum resources are needed to accommodate the increasing number of mobile and wireless users. Since the amount of available spectrum is limited, there are severe shortages of spectrum resources in modern wireless communication systems. Consequently, it is critical for cellular networks to support a high number of mobile users with limited spectrum resources while maintaining a high quality of services (QoS). Device-to-device (D2D) communication technology was introduced as a promising solution to resolve the spectrum shortage problem [1]. By using the D2D technology, mobile users in the proximity are able to communicate with each other directly without imposing heavy loads on cellular networks. The scope of D2D communications has been extended to vehicle-to-vehicle (V2V) and vehicle-to-everything (V2X) systems and the D2D technology is considered a key technology in fifth-generation (5G) wireless communications. Many researchers have focused on D2D technology and have conducted a large number of research activities regarding D2D communications.
Since D2D communications utilize spectrum channels, which are already occupied by cellular users, it is essential to allocate communication resources to D2D devices in a way that the performances of D2D links are improved without destroying the QoS of cellular links. Thus, the resource allocation (RA) for D2D devices in the existence of cellular devices is an inherent issue of D2D communications in theoretical and practical aspects. A large number of research studies have been conducted regarding RA for D2D communications, especially underlay cellular networks [2][3][4][5][6][7][8][9][10][11][12][13]. When different D2D links occupy distinct cellular channels, efficient one-to-one mapping techniques may be applied to RA [2]. On the other hand, in case multiple D2D links are allowed to share a cellular channel, the optimal RA problem becomes NP-hard and cannot be solved analytically, especially when a high number of wireless devices are distributed densely in a cell. Hence, sub-optimal approaches with reduced complexities have been investigated to implement RA in practical D2D communication systems [14][15][16]. In [14], a two-step resource allocation scheme was introduced to maximize the sum capacity of D2D communications. In [15], a centralized resource allocation method based on the difference of the convex function (DC) programming was proposed to solve a weighted sum rate maximization problem. In [16], an alternating channel assignment-power allocation scheme was proposed to maximize the sum rate of the cellular and D2D links.
As the scope of D2D is extended to D2D with mobility, we need a dynamic RA scheme suitable for networks with devices changing their locations continuously. It is clear that a dynamic RA requires a much higher system overhead and computational complexity than a static one because the resource allocation needs to be updated whenever a network setup, including device locations, changes. Data transmissions need to be conducted over multiple time steps for several reasons, e.g., due to the segmentation of long data frames into multiple short ones due to limited bandwidth channels. The number of time steps may be determined by data size, channel bandwidth, battery life of UEs, etc. The RA becomes more complex and hard to implement if a sequence of resources needs to be allocated over multiple time steps. Thus, a sub-optimal RA scheme with low complexity is more demanding in modern communication networks.
Recently, deep learning (DL) and reinforcement learning (RL) have received attention from a wide range of fields. DL has been actively adopted in optimization, system identification, recognition, and classification in many applications, including wireless communications. RL is a mechanism of agents that learns what to do, or how to map situations to actions, in order to maximize a reward through a trial-and-error search. Deep RL (DRL) incorporates DL into RL, in which agents make decisions from unstructured input data, where the deep Q-network (DQN) is a well-known example of DRL [17]. DRL has also been widely applied to various forms of optimization and policy determination problems, including wireless communication systems. DRL is an efficient mechanism for sequential decision-making, so it is a natural approach to apply DRL to RA over multiple time steps. Using DRL is considered a good approach to determining resources for D2D devices in a sub-optimal manner, with lower complexity in practical communication networks. In the training phase, artificial neural networks (ANNs) in agents are intensively trained for as many situations as possible. Then, in an actual operational phase, agents just observe situations and draw sub-optimal solutions to an RA problem using trained ANNs.
Various forms of DL-based RA schemes have been proposed. A hybrid power allocation scheme was proposed to maximize the sum rate of D2D users by mitigating the QoS constraint violation [18], and a channel and power allocation scheme for overlay D2D networks was proposed to maximize the sum rate of D2D pairs with a minimum rate constraint [19]. The DL framework (for the optimal RA in multi-channel cellular systems with D2D communications) was proposed to maximize the overall spectral efficiency [20], and random graph-based sparse-long short-term memory (LSTM) network for joint resource management was proposed to maximize the determinacy of latency in cellular machine-to-machine communications [21]. In [22], the RA scheme in unmanned aerial vehicle (UAV)-assisted cellular V2X (C-V2X) communications was proposed to maximize the bandwidth efficiency while satisfying the rate and latency of users.
RA schemes using RL in training phases were also proposed. An energy optimization technique was proposed in 5G wireless vehicular social networks [23], and a joint power allocation and relay selection scheme based on Q-learning was proposed to improve energy efficiency in relay-aided D2D communications underlay cellular networks [24]. A contentcaching strategy based on multi-agent RL with reduced action space was introduced to maximize the expected total caching reward in mobile D2D networks [25].
An increasing number of research studies are investigating and devising RA schemes based on the DRL principle. A centralized double-DQN-based RA scheme was proposed for dynamic spectrum access in D2D communications underlay cellular networks [26], and a centralized hierarchical DRL-based method was proposed to find an optimal relay selection and power allocation strategy for 5G mmWave D2D links [27]. In [28], a DRL-based algorithm was proposed to determine the transmit power of D2D and cellular links for maximizing an overall sum-rate. In [29], each V2V link selects resources with the aid of DQN to satisfy a latency constraint and minimize the mutual interference between the infrastructure and vehicles in unicast and broadcast scenarios. A deep deterministic policy gradient (DDPG) algorithm was used for the energy efficient power control in D2D-based V2V communications [30], and an adaptive RL framework was used to select the appropriate channel selection for a non-orthogonal multiple access-unmanned aerial vehicle (NOMA-UAV) network [31]. In [32], a distributed frequency RA framework based on the multi-agent actor-critic (MAAC) was proposed. In [33], a multi-agent DRL-based distributed power control and RA algorithm was introduced to maximize the throughput of D2D and cellular users. In [34], a DRL-based joint mode selection and channel allocation algorithm was proposed in D2D communicationenabled heterogeneous cellular networks to maximize the system sum-rate in mmWave and cellular bands.
In centralized RA, a central coordinator collects information from all devices in a cell and determines the resources for all participating devices. On the other hand, in a decentralized RA, participating devices determine their own resources by using their locally obtained information. The centralized RA scheme results in a better performance than the decentralized scheme at the cost of high system overhead and high computational complexity concentrated on a central unit. On the other hand, the decentralized RA scheme results in a lower system overhead and distributed computation burden at the cost of the degradation of the communication performance. As the number of devices participating in D2D networks grows, the system overhead required to collect data from devices and deliver RA results to individual devices also increases. This results in increasing interest in decentralized RA schemes, and recently, various forms of decentralized RA schemes adopting DRL have been proposed. The basic requirement for implementing a decentralized DRL-based RA scheme involves sufficient computing capabilities from participating D2D devices because each one needs to operate its own learning mechanism, such as ANN. Vehicles and roadside infrastructures are able to supply sufficient computing power and enough space to mount high-performance devices so that V2X networks can utilize a decentralized-DRL-based RA scheme. On the other hand, personal hand-held devices do not have enough power supply and computational capabilities to run their own learning units so a decentralized DRL is not considered a viable solution for RA. Consequently, central a RA scheme still has high demands, and an advanced approach to reducing the computational complexity of DRL adopted for RA by a central coordinator is highly demanding.
In this paper, we propose a practically efficient centralized RA scheme based on a multi-agent DRL for D2D communications underlay cellular networks. We aim to present a good performance of the centralized RA scheme while reducing the computational complexity by using a multi-agent structure. Transmit power and the spectrum channel of D2D links are considered resources of D2D communications, and the objective of RA is to maximize the sum of the average effective throughput of all cellular and D2D links in a cell accumulated over multiple time steps. We obtained outage probabilities of cellular and D2D links in terms of the spectrum channel and the transmit power of the devices. Then, we define an effective throughput and formulate the optimization problem required for RA. We introduce a multi-agent DRL framework in which agents reside in a central coordinator of the cell and conduct constituent learning processes in a staggered and cyclic manner. Thanks to the segmentation of ANN into smaller ones, the proposed multi-agent DRL requires lower computational complexities in both the training phase and testing phase than the joint DRL for RA. The proposed RA scheme promptly allocates resources depending on the locations of participating devices, which vary dynamically. It was observed from simulations that the proposed DRL-based RA scheme performs well in various aspects of D2D communications underlay cellular networks. The usefulness of the proposed RA scheme is clearer in case the D2D devices are distributed more densely in a cell resulting in a higher level of mutual interferences among devices. Consequently, the proposed RA scheme is considered practically efficient in the next-generation communication network in which a high number of D2D devices with high mobility exist in cellular networks.
This paper is organized as follows. In Section 2, the system model of D2D communications underlay cellular networks is presented and the optimal resource allocation is formulated to maximize the sum of the average effective throughput of cellular and D2D links accumulated over multiple time steps. In Section 3, we provide a short introduction to the deep reinforcement learning algorithm. In Section 4, we propose a multi-agent DRLbased RA scheme, in which multiple agents conduct constituent learning in a staggered manner with a timing offset in a cyclic manner in a training phase. In Section 5, we analyze the performance of the proposed scheme in various aspects and compare it with other RA schemes. Finally, we conclude this paper in Section 6.

System Model
We consider a single cell, in which an evolved Node B (eNB), K cellular user equipment (CUE), and M D2D pairs exist, as shown in Figure 1. A D2D pair is formed by the transmitting D2D user equipment (DUET) and the receiving D2D user equipment (DUER), where D2D communications occur during a cellular uplink period. Note that RA decisions are made centrally by eNB and delivered to corresponding DUETs during a cellular downlink period. Each CUE occupies a dedicated channel while D2D links use channels already occupied by CUEs, where multiple D2D links are allowed to share a channel. We index CUE and the channel occupied by CUE as k = 0, 1, · · · , K − 1. We also index the D2D pair and its DUET and DUER as m = 0, 1, · · · , M − 1. Let s m ands k denote a transmit symbol of DUET m and CUE k, respectively, each of which has the power of p m andp k , respectively. We let the transmit power of each DUET be chosen out of L discrete values, i.e., p m ∈ {p (1) , p (2) , · · · , p (L) } for each m. We also let d xy denote the distance between user equipment (UE) x and y, where d xB denotes the distance between UE x and eNB. We leth k kn denote a small-scale fading gain of the channel between CUE k and DUER n, and let h k mn denote a small-scale fading gain of the channel between DUET m and DUER n over the channel k. We supposeh k kn and h k mn are independent and are identically distributed (i.i.d.) zero-mean circularly symmetric complex Gaussians with unit variances. We use a log-distance model for large-scale fading in the channel between UEs x and y, which are determined by d −α xy with a path loss exponent α. We define a D2D channel access indicator δ mk as δ mk = 1 if a D2D pair m uses the cellular channel k, and δ mk = 0 otherwise.
The received signal at DUER m over the channel k is written as where w m denotes the additive noise at DUER m, which is a zero-mean circularly symmetric complex white Gaussian with a variance σ 2 w . The signal-to-interference-and-noise ratio (SINR) of y k m is determined by Similarly, the SINR of the received signal at eNB over the channel k, denoted by γ k B , is defined as We declare a link outage when the achievable data rate does not meet a target rate. Let R c and R d denote target rates of the cellular link and D2D link, respectively. We also let γ c and γ d denote values of the SINR of the cellular link and D2D link, respectively, by which corresponding target rates are achieved. Note that γ c = 2 R c − 1 and γ d = 2 R d − 1, where log 2 (1 + γ c ) = R c and log 2 (1 + γ d ) = R d . Then, γ c and γ d represent the SINR threshold for declaring outages of the cellular link and D2D link, respectively. We also let ρ k B and ρ k m denote the outage probabilities of the cellular link k and D2D link m over a channel k, respectively. Then, we obtain and whose derivations are provided in Appendix A. Note that ρ k m is defined only when δ mk = 1. We define an effective throughput of the D2D link m over the channel k as a target rate multiplied by the probability of the successful transmission, i.e., R d (1 − ρ k m ). In the same manner, the effective throughput of the cellular link k is defined by R c (1 − ρ k B ). The goal of RA is determining the channel and transmitting power of all D2D links at each time step to maximize the cumulative sum of the average effective throughputs of the cellular and D2D links over multiple time steps T. Multiple D2D links are allowed to share an identical cellular channel. A D2D link occupies a single cellular channel during data transmission. Then, RA can be expressed as max δ mk,t ,p m,t ,∀t,k,m where the time step t is specified in δ mk , p m , ρ k B , and ρ k m as δ mk,t , p m,t , ρ k B,t and ρ k m,t , respectively, with a slight abuse of notation. Constraints in (6) imply that each D2D link utilizes only one cellular channel while a cellular channel can be used by multiple D2D links. A mathematical technique to solve (6) is not available, so we need to rely on a brute-force search approach to obtain optimal solutions of (6), which are δ mk,t , p m,t for all m, k, t. Since there exist LK possibilities of the pair of δ mk,t and p m,t for given m and t, the overall number of possible combinations of δ mk,t and p m,t is (LK) MT . Thus, in a brute-force search, the objective function needs to be evaluated for each (LK) MT candidate to obtain an optimal solution. As a result, the optimal RA is too complex to be implemented in a practical system especially when the number of participating D2D links M is high. If distinct cellular channels are assigned to different D2D links, the channel allocation can be performed by a low-complexity one-to-one mapping algorithm, e.g., the Hungarian algorithm [2]. However, in case multiple D2D links are allowed to use an identical cellular channel, the high computational complexity becomes a large constraint in regard to using a brute-forcesearch-based RA scheme in a practical communication system. Thus, in this paper, we devise a low-complexity RA scheme based on DRL, which can be utilized in practice.

Deep Reinforcement Learning Preliminaries
Reinforcement learning (RL) is a mechanism of learning what to do, or how to map situations to actions, in order to maximize a reward through a trial-and-error search. It is known that the Markov decision process (MDP), represented by a model-free learning scheme, is useful for studying optimization problems solved by RL. MDP can formalize sequential decision-making, in which agents interact with the environment, observe states, and take actions affecting not only immediate rewards but also subsequent situations. MDP is represented by (S, A, P, R), where S is a set of states, A is a set of actions that the agent can take based on a given state, P : S × A × S → [0, 1] is a transition function characterizing the probability that a given state and action are mapped to the next state, and R is a set of possible rewards obtained by an agent. If the cardinalities of S, A, and R are finite, the MDP is called a finite MDP.
In the RL, at a certain time step t, an agent observes a state s t ∈ S of the environment and accordingly takes an action a t ∈ A based on a policy π. The policy is a mapping from states to probabilities of selecting each possible action. Following the action a t , the state s t transits to a new state s t+1 and the agent obtains a reward r t and computes a return as G t = ∑ ∞ k=0 β k r t+k+1 , where 0 ≤ β ≤ 1 is a discount factor adjusting the impact of future rewards. The agent evaluates the expected return obtained by starting from a state s and following a policy π, thereafter, as v π (s) = E{G t | s t = s, π} and the expected return obtained by starting from a state s, taking an action a, and following a policy π, thereafter, as q π (s, a) = E{G t | s t = s, a t = a, π}. Note that v π (s) and q π (s, a) are called a state-value function and an action-value function, respectively, under a policy π. Then, the agent determines the optimal policy for a given state s by π * = argmax π v π (s), through which optimal state-value function and action-value function are also defined by v * (s) = max π v π (s) and q * (s, a) = max π q π (s, a), respectively. Q-learning was developed as an off-policy RL algorithm for a temporal-difference control of a finite MDP. It handles problems with stochastic transitions and resulting rewards without requiring a model of the environment. At each time step t, the agent staying at a state s t ∈ S selects an action a t ∈ A based on an action selection rule, which is designed to balance the behaviors of exploration and exploitation by agents. Greedy, -greedy, and soft-max methods are widely-used examples of the action selection rule. The quality of the pair of state s t and action a t is evaluated by a function Q : S × A → R, whose result Q(s, a) is called a state-action value. After taking an action, the agent observes a resultant reward r t and the next state s t+1 , and then updates the state-action value by where µ is the learning rate. This procedure is repeated from the initial time step up to the final time step. This series of steps is called an episode. At the beginning of each episode, the environment is set to an initial state and the agent's reward is reset to zero. We directly approximate the optimal action-value function, q * (s, a), by using the state-action value Q(s, a), independent of the policy being followed. It is known that in MDP, the state-action value converges with probability 1 to the optimal action-value function if each action is executed at each state during the infinite run times and the learning rate µ decays, appropriately. The optimal policy π * can be found once the optimal action-value function q * (s, a) is determined. After a sufficient number of updates, the state-action values for all states and actions converge.
In the case of a large state space S, evaluating the state-action values Q(s, a) for all states requires high computational complexity. We can speed up the learning process by using a function approximator, obtained from earlier experiences, to compute the stateaction values. DeepMind introduced a deep Q-learning, or deep Q-network (DQN), which uses a convolutional neural network (CNN) or generally an artificial neural network (ANN) as a function approximator [17]. DQN has two phases, the (i) training phase and (ii) testing phase. In the training phase, an agent trains its state-action value approximator through a sufficient number of learning iterations. Then, the system enters a testing phase, in which the trained state-action value approximator is used to draw the best actions for a given set of observations.
In the training phase of DQN, the agent utilizes two ANNs, called Q-networks, which are a prediction network and a target network. In Q-networks, states are defined by observations obtained by the agent and are fed to input nodes. With a given state s, the prediction network computes Q(s, a) approximately for each realization of action a ∈ A at each output node. The action of an agent is chosen by the action selection rule and applied to the environment or emulator. Then, a reward r, as well as a new state s , are obtained, and the transition vector {s, a, r, s } is stored in the experience replay memory. Since observations at consecutive iterations are highly correlated, small updates of state-action values may significantly change the policy and the data distribution, which may result in the instability of RL. To overcome this problem, deep Q-learning utilizes an experience replay. Random samples of prior transition vectors are picked from an experience replay memory and used to evaluate a loss function through the prediction network and target network, where a batch of transitions may be used. This removes correlation in the observation sequence and smooths changes in the data distribution. The prediction network is updated at every time step by using the obtained loss function while the target network is updated periodically or updated softly at every time step. This process is composed of one learning iteration, which is summarized in Figure 2. After a training phase is completed by a sufficient number of iterations, the testing phase begins, in which the agent takes an action a corresponding to the output node of ANN having the greatest Q(s, a) for given states s fed to input nodes of ANN.

DRL-Based Resource Allocation for D2D Communications
We propose a DQN-based RA scheme for D2D communications underlay cellular networks. First, we consider a joint RA scheme by which resources for all D2D links are determined simultaneously. A central coordinator at eNB is considered a single agent of DQN, which conducts RA for all DUEs. We define an episode as a time duration T for which a sequence of data transmissions from DUET to DUER is complete. We count the time steps inside each episode, i.e., the time step t is defined between 0 and T − 1. At time step t, the state s t is defined by locations of all UEs in the cell, indices of channel resources, and transmit power levels of D2D links at the time step t. Let z t denote the vector of locations of all UEs, i.e., CUEs, DUETs, and DUERs, and let c t and p t denote vectors of allocated channel indices and transmit power levels of all D2D links at time step t, respectively. Then, the state is expressed as We define the action (a t ) at time step t by the determination of the transmit power levels and channel indices to all D2D links at time step t. The instantaneous reward at time step t, denoted by r t , is defined as the sum of the average effective throughput of D2D and cellular links in the cell at time step t, i.e., and the accumulated reward up to the time step t − 1, denoted byr t , is obtained as The reward accumulated over the time steps in an episode will be called a benefit and expressed asr The goal of RA is to maximize the benefits under existing constraints as introduced in (6).
In DQN, the state s t is fed to input nodes of ANN and each output node of ANN is dedicated to each action. As introduced in (8), a state is defined by a (K + 4M)-tuple vector, which are (K + 2M) entries of z t and M entries for each of c t and p t . Each M D2D link can be allocated a channel index and a power level out of K and L possible values, respectively, so that there exist (LK) M possible realizations of action at each time step. It follows that ANNs in the prediction network and target network have K + 4M input nodes and (LK) M output nodes, as shown in Figure 3a. Suppose ANN has L h hidden layers, each of which has N h nodes. We also suppose neighboring layers are fully connected. In the testing phase, ANN performs the forward propagation from the input layer to the output layer. We consider the computational complexity of the forward propagation of ANN in terms of floating point operation (FLOP) [35]. A forward propagation from a layer with N i nodes to a layer with N j nodes requires approximately 2N i N j FLOPs where the computational complexities of the activation functions of nodes are negligible. Then, FLOPs required for forward propagation over ANN is approximately 2 In the training phase, ANN is updated through multiple pairs of forward and backward propagations. It is known that FLOPs of backward propagation are typically 2-3 times the FLOPs of forward propagation [35]. Thus, it is sufficient to focus on forward propagation when comparing computational complexities of ANNs. When L, K, and M are high, (LK) M is much higher than K + 4M and N h so that the computational complexity of ANN is dominated by N h (LK) M . Since N h (LK) M FLOPs are too high to be executed in real time, the RA scheme based on a single-agent DQN is practically infeasible with high L, K, and M. To resolve this problem, we utilize a structure of multi-agent DQN, in which each agent corresponds to each D2D link and operates its own DQN. Note that agents exist physically in a central coordinator at eNB. Agents have ANNs in prediction and target networks with segmented structures as depicted in Figure 3b. Agents share a state, which is identical to the state of a single-agent DQN defined in (8). The action of an agent is reduced to the allocation of transmit power and spectrum channel of the corresponding D2D link only. Thus, the action chosen by the agent m at time step t, denoted by a m t , is defined by where c m t ∈ {0, . . . , K − 1} and p m t ∈ {0, . . . , L − 1} denote the index of channel and transmit power level of the D2D link m at the time step t, respectively. Since each D2D link has LK possible realizations of action, the ANN in each DQN has LK output nodes as shown in Figure 3b, where the number of input nodes remains as K + 4M. Then, the overall FLOPs required for forward propagations over M segmented ANNs is approximately If LKM >> N h , the computational complexity of multiple-segmented ANNs is dominated by N h LKM, which is M/(LK) M−1 of the complexity of a single-agent ANN. On the other hand, if N h >> LKM, the complexity is dominated by ML h N 2 h , which is ML h N h /(LK) M of the complexity of a single-agent ANN. In both cases, a significant level of complexity reduction is observed by using the multi-agent DQN.
The learning process of multi-agent RL is executed as described below. Let us define constituent learning as a sequence of operations by a single agent, i.e., observing a state s, taking an action a, observing a reward r and a new state s , updating weights of prediction/target networks θ and θ . If all agents conduct constituent learning simultaneously, it is impossible to evaluate explicitly the influence of individual agent actions on the change of the environment. Since the reward and the next state fed back to each agent do not reflect explicitly the contribution of the corresponding agent to the environment, the multi-agent RL may not converge well or may not improve performance through iterations. Thus, it is a reasonable approach to devise a sequential operation of constituent learning by multiple agents.
We let agents conduct constituent learning in a cyclic manner with the timing offset as depicted in Figure 4. Without loss of generality, labeling agents is based on the order of performing constituent learning. This order is randomly selected at the beginning of the training phase and is maintained. After completing one constituent learning procedure, each agent keeps idling until its turn comes around again, by which each agent performs constituent learning periodically. We define a time step as a time interval corresponding to a period of learning by agent 0 as depicted in due to the existence of the idling period between adjacent constituent learning of each agent, which is a modification from the conventional DQN. In this manner, the overall cyclic learning procedure is operated during a training phase. The collection of constituent learning of all agents composes a learning iteration. We let s m t , a m t , r m t , and s m t denote state, action, reward, next state of the agent m at time step t. Even at the same time step, different agents have different values for these variables. The learning procedure of each agent over multiple time steps and episodes is described as follows, which is also summarized in Algorithm 1. For a simple description, we focus on the operation of a specific agent indexed by m, where m = 0, . . . , M − 1. This corresponds to a single timeline of a single agent in Figure 4. First, we initialize weights of prediction networks θ m by randomly generated small numbers, and set weights of target networks as θ m = θ m . Experience replay memory D m is initialized by running a random policy. The agent m observes a state s m t and takes an action a m t based on the -greedy policy as an action selection rule to affect the environment. This implies that the agent m selects an action a m t resulting in the maximum state-action value Q(s m t , a m t ; θ m ) with probability (1 − ) or selects an action randomly from other candidates with probability . Note that we use the expression Q(s, a; θ) to declare that the state-action value Q(s, a) is obtained by ANN with weights θ. Affected by action a m t , the environment changes and the agent m obtains a reward r m t by (9)  is one sample included in a batch B picked up from D m , where we use subscript j to represent an index at which the transition vector is stored in D m with a slight abuse of notation. The predicted state-action value Q(s m j , a m j ; θ m ) is obtained from the output node corresponding to the action a m j in ANN of prediction network with weight θ m . The target state-action value y m j is obtained by a target network with weight θ m as Then, the loss function L m is computed as the mean squared error between the target state-action value and predicted state-action value by where |B| represents the size of batch B. We update the weights of the prediction network θ m by using a stochastic gradient descent algorithm as where η is a learning rate, and update weights of target network θ m softly as where τ << 1. We repeat the above process over the time steps in each episode and repeat the whole process over E episodes. After the training phase is completed through multiple episodes as introduced above, the RA scheme enters a testing phase which corresponds to an actual operation of DUEs underlay cellular networks. At every time step, observation obtained by eNB is used by each agent as a state, which is input to the trained prediction network. Then, the action resulting in the maximum state-action value at the output nodes of the prediction network is chosen for each agent. The chosen action is reported to DUET and used as resources for the corresponding D2D communication. In the testing phase, resource allocation for all agents may be executed simultaneously, not in a staggered manner, at each time step. The environment is influenced by D2D communications performed in this manner, and new observation is obtained by eNB. This procedure repeats over all time steps of the testing phase. Obtain Q(s m j , a m j ; θ m ) from the prediction network. Obtain y m j by (13) in the target network. Compute the loss function L m by (14). Update θ m and θ m by (15) and (16), respectively. end for end for

Numerical Results
We consider a single circular-shaped cell, in which eNB is located at the center, and K CUEs as well as M DUE pairs exist, where DUERs are placed around the corresponding DUETs within a distance of 5 [m]. The distribution of CUE and DUET in the cell and the distribution of DUER around the DUET follow the binomial point process (BPP) model [36]. All UEs change their locations at every time step. Simulation parameters used in numerical experiments are listed in Table 1 and hyperparameters used for DRL are listed in Table 2. Simulation software used for numerical experiments were Python 3.6.12 and PyTorch 1.4.0. The transmit power of each CUE was determined such that the corresponding SNR at eNB resulted in an outage probability of 0.1 %. We consider various values for R c andr c for the analysis of RA schemes in various aspects.
Each ANN in the prediction and target networks has five fully connected layers, the middle three of which are hidden layers. Each hidden layer has 300 neurons equipped with the ReLU activation function; a stochastic gradient descent optimizer is used for updating the weight of ANNs. Experiences replaying memories are initially filled with experiences obtained by running random policies. The training phase is completed in 5000 iterations (500 episodes and 10 iterations/episode), and the -greedy policy with linear annealing is applied as an action selection rule. It is observed from Figure 5 that DQNs are updated well during the training phase.   We evaluate the performances of RA schemes in terms of average benefits and compare performances of the proposed DRL-based RA, random RA, and greedy RA schemes. A random RA allocates the spectrum channel and transmits the power of the D2D links randomly. In a greedy RA, the channel and transmit power of D2D links are determined through a greedy search to maximize the sum of the average effective throughput for each time step. We consider a scenario that every time step, locations of UEs change, and resources of all D2D pairs are allocated simultaneously.
In Figure 6, we plot the average benefits obtained by various RA schemes under comparison with respect to the number of D2D pairs M existing in the cell, where various radii of cellr c and target rates of cellular link R c are considered. It is observed from Figure 6 that the proposed DRL-based RA scheme shows better performance than others in all situations. As the number of D2D links M in the cell grows, all RA schemes show lower average benefits due to resulting severer mutual interference among UEs. However, the performance degradation of the proposed DRL-based RA scheme is less sensitive to the growth of M than other RA schemes. Thus, the performance gain of the proposed RA scheme over others becomes significant as the number of D2D pairs in the cell increases. It is also observed that the performance gain of the proposed DRL-based RA scheme over others increase as the radius of cellr c decrease. From these observations, it is obviously inferred that the proposed RA scheme is quite useful especially when DUEs are distributed densely in a cell and suffer from a high level of mutual interference from other UEs. Since the demand for D2D communications is continuously growing, the proposed RA scheme would be a meaningful solution to resolve the spectrum shortage problem in the next-generation wireless communication systems. It is additionally observed that the proposed DRL-based RA scheme attains highly improved average benefit with higher R c compared with other RA schemes, which is clear from Figure 7. This is explained by the property that the proposed RA scheme balances well effective throughputs of both D2D links and cellular links, while a greedy RA scheme has a priority in maintaining the QoS of D2D links at a required level. Consequently, the proposed DRL-based RA scheme obtains a significant performance gain over others in case the DUEs are distributed densely in a cell and CUE has a higher target rate than DUE.
In Figures 8-10, we plot the average transmit power of DUETs, the average outage probability of D2D links, and the average outage probability of cellular links with respect to M, respectively, where various R c andr c are considered. It is observed that the proposed DRL-based RA scheme adapts the transmitting power sensitively with respect to M and r c , to achieve high benefits while maintaining the QoS of the cellular links. For larger M and smallerr c , the proposed RA scheme prevents benefits from decreasing by using a lower transmitting power of DUET at the cost of a higher outage probability of the D2D links. The overall performance is maintained at the sacrifice of the D2D link because the cellular link has a higher contribution to the overall performance than the D2D link when R c is high. Although a greedy RA scheme also adapts the transmit power of DUET depending on M, higher transmit power is allocated to D2D links with growing M and thus the average outage probability of D2D links is maintained at the cost of increasing the average outage probability of cellular links. For higher R c , this way of power allocation results in higher degradation in the average benefit so the performance is outperformed further by the proposed RA scheme. Figure 11 shows clearly that a higher ratio of R c /R d results in lower transmitting power of DUET and, thus, a higher outage probability of D2D links by the proposed DRL-based RA scheme, which is different from a greedy RA scheme. The performance gain of the proposed RA scheme over others comes from the fact that each agent was trained to know implicitly how other agents will act at the next time step for a given set of observations. In other RA schemes, on the other hand, D2D links determine their resources only depending on the current observation. Since each agent in the proposed RA scheme takes an action based on the prediction of other agents' future actions, the proposed RA scheme works much better than others especially in case a high level of mutual interference among UEs exists. The proposed DRL-based RA scheme allocates adaptively communication resources depending on the locations of UEs, the cell size, and the target rate of the cellular link by using the pre-trained allocation rule, which enables a fast RA in the actual operation of communication networks.

Conclusions
We proposed a DRL-based RA scheme for the communications of D2D pairs underlaying cellular networks, where the spectrum channel for the D2D link and the transmit power of DUET are considered communication resources to be determined. Multiple D2D pairs are allowed to share a cellular channel, which results in high computational complexity when determining channels for D2D links. Moreover, ANNs used in DRL for joint RA have a high number of output nodes resulting in high computational complexity. To resolve this problem, a multi-agent DQN is adopted in the proposed scheme, resulting in segmented ANNs and, thus, reduced computational complexity. In the testing phase corresponding to the period of actual operation, the proposed scheme allocates communication resources adaptively in a real-time manner depending on the network setup by using the pre-trained ANNs. The proposed RA scheme outperforms others, especially when UEs are distributed densely, resulting in a high level of mutual interferences and the QoS of the cellular link has a higher priority than the D2D link.

Conflicts of Interest:
The authors declare no conflict of interest.