1. Introduction
Devicetodevice (D2D) communication is a nascent feature for the Long term evolution advanced (LTEAdvanced) systems. D2D communication can operate in centralized, i.e., Base station (BS) controlled mode, and decentralized mode, i.e., without a BS [
1]. Unlike the traditional cellular network where Cellular users (CU) communicate through the base station, D2D allows direct communication between users by reusing the available radio resources. Consequently, D2D communication can provide improved system throughput and reduced traffic load to the BS. However, D2D devices generate interferences while reusing the resources [
2,
3]. Efficient resource allocation play a vital role in reducing the interference level, which positively impacts the overall system throughput. Fine tuning of power allocation on Resource blocks (RB) has consequences on interference, i.e., a higher transmission power can increase D2D throughput; however, it increases the interference level as well. Therefore, choosing the proper level of transmission power for RBs is a key research issue in D2D communication, which calls for adaptive power allocation methods.
Resource allocators, i.e., D2D transmitters in our system model as described in
Section 3 need to perform a particular action at each time step based on the application demand. For example, actions can be selecting power level options for a particular RB [
4]. Random power allocation is not suitable in a D2D communication due to its dynamic nature in terms of signal quality, interferences and limited battery capacity [
5]. Scheduling of these actions associated with different levels of power helps to allocate the resources in such a way that the overall system throughput is increased and an acceptable level of interference is maintained. However, this is hard to maintain, and therefore, we need an algorithm for learning the scheduling of actions adaptively, which helps to improve the overall system throughput with fairness and the minimum level of interferences.
To illustrate the problem,
Figure 1 shows a basic single cell scenario with one Cellular user (CU), two D2D pairs and one base station having two resource blocks operating in an underlay mode. D2D devices contend for resource blocks for reusing. Here, RB1 is allocated to the cellular user. D2D pair Tx and D2D pair Rx are assigned RB2. Now, D2D candidate Tx and D2D candidate Rx will contend for the resources either for RB1 or for RB2 to access. If we allocate RB1 to a D2D pair closer to the BS, there will be high interference between the D2D pair and the cellular user. So, RB1 should be allocated to the D2D candidate Tx which is closer to the cell edge (d2 > d3). For reusing the RB1, there will be interferences. Our goal is to propose an adaptive learning algorithm for selecting the proper level of power for the RB to minimize the level of interferences and maximize the throughput of the system.
In contrast with existing works, our proposed algorithm helps to learn the proper action selection for resource allocation. We consider reinforcement learning with the cooperation between users by sharing the value function and incorporating a neighboring factor. In addition, we consider a set of states based on system variables which have an impact on the overall QoS of the system. Moreover, we consider both crosstier interference (interference that the BS receives from D2D transmitter and that the D2D receivers receive from cellular users) and cotier interference (that the D2D receivers receive from D2D transmitters) [
6]. To the best of our knowledge, this is the first work that considers all the above aspects for adaptive resource allocation in D2D communications.
The main contributions of this work can be stated as follows:
We propose an adaptive and cooperative reinforcement learning algorithm to improve achievable system throughput as well as D2D throughput simultaneously. The cooperation is performed by sharing the value function between devices and imposing the neighboring factor in our learning algorithm. A set of actions is considered based on the level of transmission power for a particular Resource block (RB). Further, a set of states is defined considering the appropriate number of systemdefined variables. In addition, the reward function is composed of Signaltonoiseplusinterference ratio (SINR) and the channel gains (between the base station and user, and also between users). Moreover, our proposed reinforcement learning algorithm is an onpolicy learning algorithm which considers both exploitation and exploration. This action selection strategy helps to learn the best action to execute, which has a positive impact on selecting the proper level of power allocation to resource blocks. Consequently, this method shows better performance regarding overall system throughput.
We perform realistic throughput evaluation of the proposed algorithm while varying the transmission power and the number of D2D users. We compare our method with existing distributed reinforcement learning and random allocation of resources in terms of D2D and system throughput considering the system model where Resource block (RB)power level combination is used for resource allocation. Moreover, we consider fairness among D2D pairs by computing a fairness index which shows that our proposed algorithm achieves balance among D2D users throughput.
The rest of the paper is organized as follows.
Section 2 describes the related works. This is followed by the system model in
Section 3. The proposed cooperative reinforcement learning based resource allocation algorithm is described in
Section 4.
Section 5 presents the simulation results.
Section 6 concludes the paper with future works.
2. Related Works
Recent advances in Reinforcement learning (RL) create a broad scope of adaptive applications to apply. Resource allocation in D2D communication is such an application. Here, we describe at first some classical approaches [
7,
8,
9,
10,
11,
12,
13,
14,
15,
16] followed by existing RLbased resource allocation algorithms [
17,
18].
In [
7], an efficient resource allocation technique for multiple D2D pairs is proposed considering the maximization of system throughput. By exploring the relationship between the number of Resource blocks (RB) per D2D pair and the maximum power constraint for each D2D pair, a suboptimal solution is proposed to achieve higher system throughput. However, the interference among D2D pairs is not considered. Local water filling algorithm (LWFA) is used for each D2D pair which is computationally expensive. Feng et al. [
8] introduce a resource allocation technique by maintaining the QoS of cellular users and D2D pairs simultaneously to enhance the system performance. A threestep scheme is proposed where the system performs admission control at first and then allocates the power to each D2D pair and its potential Cellular user (CU). A maximum weight bipartite Matching based scheme (MBS) is proposed to select a suitable CU partner for each D2D pair where the system throughput is maximized. However, this work basically focuses on suitable CU selection for the resource sharing where adaptive power allocation is not considered. In [
9], a centralized heuristic approach is proposed where the resources of cellular users and D2D pairs are synchronized considering the interference link gain from D2D transmitter to the BS. They formulate the problem of radio resource allocation to the D2D communication as a Mixed integer nonlinear programming (MINLP). However, MINLP is hard to solve and the adaptive power control mechanism is not considered. Zhao et al. [
10] propose a joint mode selection and resource allocation method for the D2D links to enhance the system sumrate. They formulate the problem to maximize the throughput with SINR and power constraints for both D2D links and cellular users. They propose a Coalition formation game (CFG) with transferable utility to solve the problem. However, they do not consider the adaptive power allocation problem. In [
11], Min et al. propose a Restricted interference region (RIR) where cellular users and D2D users can not coexist. By adjusting the size of the restricted interference region, they propose the interference control mechanism in a way that the D2D throughput is increased over time. In [
12], the authors consider the target rate of cellular users for maximizing the system throughput. Their proposed method shows better results in terms of system interference. However, their work also focuses on the region control for the interference. They do not consider the adaptive resource allocation for maximizing the system throughput. A common limitation to the works as mentioned above is that they are fully centralized, which requires full knowledge of the link state information that produces redundant information over the network.
In addition to abovementioned works, Hajiaghajani et al. [
13] propose a heuristic resource allocation method. They design an adaptive interference restricted region for the multiple D2D pairs. In their proposed region, multiple D2D pairs share the resources where the system throughput is increased. However, their proposed method is not adaptive regarding power allocation to the users. In [
14], the authors propose a twophase optimization algorithm for the adaptive resource allocation which provides better results for system throughput. They propose Lagrangian dual decomposition (LDD) which is computationally complex.
Wang et al. [
15] propose a Joint scheduling (JS) and resource allocation for the D2D underlay communication where the average D2D throughput can be improved. Here, the channel assigned to the cellular users is reused by only one D2D pair and the cotier interference is not considered. In [
16], Yin et al. propose a distributed resource allocation method where minimum rates of cellular users and D2D pairs are maintained. A Game theoretic algorithm (GTA) is proposed for minimizing the interferences among D2D pairs. However, this approach provides low spectral efficiency.
With regards to machine learning for resource allocation in D2D communication, there are only few works, e.g., [
17,
18]. Luo et al. [
17] and Nie et al. [
18] exploit machine learning algorithms for D2D resource allocation. Luo et al. [
17] propose Distributed reinforcement learning (DIRL), Qlearning algorithm for resource allocation which improves the overall system performance in comparison to the random allocator. However, the model of Reinforcement learning (RL) is not well structured. For example, the set of states and a set of actions are not adequately designed. Their reward function is composed of only Signal to interference plus noise power ratio (SINR) metric. The channel gain between the base station and the user, and also the channel gain between users are not considered. This is a drawback since channel gains are important to consider as these help the D2D communication with better SINR level and transmission power, which is reflected in increased system throughput [
19].
Recently, Nie et al. [
18] propose Distributed reinforcement learning (DIRL), Qlearning to solve the power control problem in underlay mode. In addition, they explore the optimal power allocation which helps to maintain the overall system capacity. However, this preliminary study has limitations, for example, in their reward function, the channel gains are not considered. In addition, in their system model only the transmit power level is considered for maximizing the system throughput. To consider RB/subcarrier allocation in the optimization function is a very important issue for mitigating interference [
20]. Moreover, the cooperation between devices for resource allocation is not investigated in these existing works. A summary of the features and limitations of classical and RLbased allocation methods is given in
Table 1.
We propose adaptive resource allocation using Cooperative reinforcement learning (CRL) considering the neighboring factor, improved state space, and a reward function. Our proposed resource allocation method helps to provide mitigated interference level, D2D throughput and consequently an overall improved system throughput.
Table 1 shows the comparison of all the above mentioned works with our proposed cooperative reinforcement learning. Firstly, we categorize the related methods in two types: classical D2D resource allocation methods and Reinforcement learning (RL) based D2D resource allocation methods. We compare these works based on D2D throughput, system throughput, transmission alignment, online task scheduling for resource allocation, and cooperation. We can observe that almost all the methods consider the D2D and system throughput. None of the existing methods for resource allocation consider the transmission alignment, online action scheduling, and cooperation for the adaptive resource allocation.
3. System Model
We consider a network that consists of one Base station (BS) and a set of
$\stackrel{\u02d8}{C}$ Cellular users (CU), i.e.,
$\stackrel{\u02d8}{C}=\{1,2,3,\dots ,C\}$. There are also
$\stackrel{\u02d8}{D}$ D2D pairs,
$\stackrel{\u02d8}{D}=\{1,2,3,\dots ,D\}$ coexist with the cellular users within the coverage of BS. In a particular D2D pair,
${d}_{T}$ and
${d}_{R}$ are the D2D transmitter and D2D receiver respectively. The set of User equipments (UE) in the network is given by UE =
$\{\stackrel{\u02d8}{C}\cup \stackrel{\u02d8}{D}\}$. Each D2D transmitter
${d}_{T}$ selects an available Resource block (RB)
r from the set
$RB=\{1,2,3,\dots ,R\}$. In addition, underlay D2D transmitters select the transmit power from a finite set of power levels, i.e.,
${p}_{r}=({p}_{r}^{1},{p}_{r}^{2},\dots ,{p}_{r}^{R})$. Each D2D transmitter should select resources, i.e., RBpower level combination refers to transmission alignment [
21].
For each RB $r\in R$, there is a predefined threshold ${I}_{th}^{(r)}$ for maximum aggregated interference. We consider that the value of ${I}_{th}^{(r)}$ is known to the transmitters using the feedback control channels. An underlay transmitter uses a particular transmission alignment in a way that the crosstier interference should be within the threshold limit. According to our proposed system model, only one CU can be served by one RB where D2D users can reuse the same RB to improve the spectrum efficiency.
For each transmitter ${d}_{T}$, the transmit power over the RBs is determined by the vector ${p}_{r}\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}{[{p}_{r}^{1},\phantom{\rule{3.33333pt}{0ex}}{p}_{r}^{2},\dots ,{p}_{r}^{R}]}^{T}$ where ${p}_{r}\ge 0$ denotes the transmit power level of transmitter over resource block r. If RB is not allocated to the transmitter then the power level ${p}_{r}=0$. As we assume that each transmitter selects only one RB where only one entity in the power level ${p}_{r}\ne 0$.
Signaltointerferenceplusnoiseratio (SINR) can be treated as an important factor to measure the link quality. The received SINR for any D2D receiver over
rth RB as follows:
where
${p}_{r}^{{D}_{u}}$ and
${p}_{r}^{c}$ denote the
uth D2D user and cellular user uplink transmission power on
rth RB, respectively.
${p}_{r}^{{D}_{u}}\le {P}_{max}$,
${\forall}_{u}\in D$ where
${P}_{max}$ is the upper bound of each D2D user’s transmit power.
${\sigma}^{2}$ is the noise variance [
9].
${G}_{{D}_{u,r}}^{uu}$, ${G}_{{D}_{v,r}}^{uv}$ and ${G}_{r}^{{c}_{u}}$ are the channel gains in the uth D2D link, the channel gain from D2D transmitter u to receiver v, and the channel gain from cellular transmitter c to receiver u, respectively. ${D}_{r}$ is a D2D pairs set sharing the rth RB.
The SINR of a cellular user
$c\in \stackrel{\u02d8}{C}$ on the
rth RB is
where
${G}_{c,r}$ and
${G}_{v,r}$ indicate the channel gains on the
rth RB from BS to cellular user
c and
vth D2D transmitter, respectively.
The total pathloss which includes the antenna gain between BS and the user
u is:
where
${L}_{dB}(d)$ is the pathloss between a BS and the user at a distance
d meter.
${X}_{u}$ is the lognormal shadow pathloss of user
u.
${A}_{dB}(\theta )$ is the radiation pattern [
22].
${L}_{dB}(d)$ can be expressed as follows:
where
${f}_{c}$ is the carrier frequency in GHz and
${h}_{b}$ is the base station antenna height [
22].
The linear gain between the BS and a user is ${G}_{Bu}={10}^{\frac{P{L}_{dB,B,u}}{10}}$.
For D2D communication, the gain between two users
u and
v is
${G}_{uv}={k}_{uv}{d}_{uv}^{\alpha}$ [
23]. Here,
${d}_{uv}$ is the distance between transmitter
u and receiver
v.
$\alpha $ is a constant pathloss exponent and
${k}_{uv}$ is a normalization constant.
The objective of resource allocation problem (i.e., to allocate RB and transmit power) is to assign the resources in a way that maximizes system throughput. System throughput is the sum of D2D users and CU throughput, which is calculated by Equation (
6).
The resource allocation can be indicated by a binary decision variable,
${b}_{v}^{(r,{p}_{r})}$ where
The aggregated interference experienced by RB
r can be expressed as follows
Let
$B={[{b}_{1}^{(1,1)},\dots ,{b}_{1}^{(r,{p}_{r})},\dots ,{b}_{1}^{(R,{P}_{max})}]}^{T}$ denote the resource e.g., RB and transmission power allocation. So, the allocation problem can be expressed as follows:
where
${p}_{r}=({p}_{r}^{1},{p}_{r}^{2},\dots ,{p}_{r}^{R})$ and
${W}_{RB}$ is the bandwidth corresponding to a RB. The objective function is to maximize the throughput of the system constrained by that the aggregated interference should be limited by a predefined threshold. The number of RB selected by the transmitter should be one where each can select one power level at each RB. Our goal is to investigate the optimal resource allocation in such a way that the system throughput is maximized by applying cooperative reinforcement learning.
4. Cooperative Reinforcement Learning Algorithm for Resource Allocation
In this section, we describe the basics of Reinforcement learning (RL), followed by our proposed cooperative reinforcement learning algorithm. After that, we describe the set of states, the set of actions and reward function for our proposed algorithm. Finally, Algorithm 1 shows the overall proposed resource allocation method and Algorithm 2 shows the execution steps of our proposed cooperative reinforcement learning.
We apply a Reinforcement learning (RL) algorithm named state action reward state action, SARSA(
$\lambda $), for adaptive resource in D2D communication for efficient resource allocation. This variant of standard SARSA(
$\lambda $) [
24] algorithm has some important features like cooperation by using a neighboring factor, a heuristic policy for exploration and exploitation, and a varying learning rate considering the visited stateaction pair. Currently, we are applying the learning algorithm for the resource allocation of D2D users considering that the allocation of cellular users is performed prior to the allocation of D2D users. We consider the cooperative fashion of this learning algorithm which helps to improve the throughput as explained in
Section 1 by sharing the value function and incorporating weight factors for the neighbors of each agent.
In reinforcement learning, there is no need for prior knowledge about the environment. Agents learn how to behave with the environment based on the previous experience achieved, which is traced by a parameter, i.e., Qvalue and controlled by a reward function. There should be some actions/tasks to perform at every time step. After performing every action, the agents shifts from one state to another and it gets a reward that reflects the impact of that action, which helps to decide about the next action to perform. The basic reinforcement learning is a form of Markov decision process (MDP).
Figure 2 depicts the overall model of a reinforcement learning algorithm.
Each agent in RL has the following components [
25]:
Policy: The policy acts as a decision making function for the agents. All other functions/components help to improve the policy for better decision making.
Reward function: The reward function defines the ultimate goal of an agent. This helps to assign a value/number to the performed action, which indicates the intrinsic desirability of the states. The main objective of the agent is to maximize the reward function in the long run.
Value function: The value function determines the suitability of action selection in the long run. The value of a state is accumulated reward over long run when starting from the current state.
Model: The model of the environment mimics the behavior of the environment which consists of a set of states and a set of actions.
In our model of the environment, we consider the components of the reinforcement learning algorithm as follows:
Agent: All the resource allocators: D2D Transmitters.
State: The state of D2D user
u on RB
r at time
t is defined as:
We consider three variables ${\gamma}_{r}^{c}$, ${G}_{Bu}$ and ${G}_{uv}$ for defining the states for maintaining the overall quality of the network. ${\gamma}_{r}^{c}$ is the SINR of a cellular user on the rth RB. ${G}_{Bu}$ is the channel gain between the BS and an user u. ${G}_{uv}$ is the channel gain between two users u and v. The variables ${\gamma}_{r}^{c}$, ${G}_{Bu}$ and ${G}_{uv}$ are important to consider for the resource allocation. The SINR ${\gamma}_{r}^{c}$ is the indicator of the quality of service of the network. In addition, if the channel gains (${G}_{Bu}$ and ${G}_{uv}$) quality is good then it is possible to achieve higher throughput without excessively increasing the transmit power, i.e., without causing too much interference to others. On the other hand, if the channel gain is too low, higher transmit power is required, which leads to increased interference.
Now, the state values of these variables can be either 0 or 1 based on following conditions. If the value of the variables are greater than or equal to a threshold value, then this denotes that their state value is ‘0’. On the contrary, if the values are less than the threshold value, then their state value is ‘1’. So, ${\gamma}_{r}^{c}\ge {\tau}_{0}$ means state value ’1’ and ${\gamma}_{r}^{c}<{\tau}_{0}$ means state value ’0’. Similarly, ${G}_{Bu}\ge {\tau}_{1}$ means state value ‘1’ and ${G}_{Bu}<{\tau}_{1}$ means state value ‘0’. Consequently, ${G}_{uv}\ge {\tau}_{2}$ means state value ‘1’ and ${G}_{uv}<{\tau}_{2}$ means the state value ‘0’. In this way, based on the combination of the value of these variables, the total number of possible states is eight where ${\tau}_{0}$, ${\tau}_{1}$ and ${\tau}_{2}$ are the minimum SINR and channel gain guaranteeing the QoS performance of the system.
Action/Task: The action of each agent consists of a set of transmitting power levels. It is denoted by
where
r represents the
rth Resource Block (RB), and
$pl$ means that every agent has
$pl$ power levels. In this work, we consider the power levels to assign within the range of 1 to
${P}_{max}$ in the interval of 1 dBm.
Reward Function: The reward function for the reinforcement learning is designed focusing on the throughput of each agent/user which is formulated as follows:
when
${\gamma}_{r}^{c}\ge {\tau}_{0}$,
${G}_{Bu}\ge {\tau}_{1}$ and
${G}_{uv}\ge {\tau}_{2}$. Otherwise,
ℜ = − 1. SINR (
u) denotes the signal to interference plus noise power ratio of user
u (Step 7–10 in Algorithm 1).
SARSA(
$\lambda $) is an onpolicy reinforcement learning algorithm that estimates the value of the policy being followed where
$\lambda $ is a parameter such as learning rate [
26]. In SARSA learning algorithm, every agent needs to maintain a Q matrix which is initially assigned 0 and the agents may be in any state. Based on performing one particular action, it shifts from one state to another. The basic form of the learning algorithm is
$({s}_{t},{a}_{t},\Re ,{s}_{t+1},{a}_{t+1})$, which means that the agent was in state
${s}_{t}$, did action
${a}_{t}$, received reward
ℜ, and ended up in state
${s}_{t+1}$, from which it decided to perform action
${a}_{t+1}$. This provides a new iteration to update
${Q}_{t}({s}_{t},{a}_{t})$.
SARSA(
$\lambda $) helps to find out the appropriate sets of actions for some states. The considered stateaction pair’s value function
${Q}_{t}({s}_{t},{a}_{t})$ as follows:
In Equation (
8),
$\gamma $ is a
discountfactor which varies from 0 to 1. The higher the value, the more the agent relies on future rewards than on the immediate reward. The objective of applying reinforcement learning is to find the optimal policy
${Q}_{t}^{\pi}({s}_{t},{a}_{t})$ which maximizes the value function
$\pi =\underset{\pi}{max}{Q}_{t}^{\pi}({s}_{t},{a}_{t})$. We consider the cooperative fashion of this algorithm where each agent shares the value function with each other.
At each time step,
${Q}_{t+1}$ for the iteration
$t+1$,
${Q}_{t+1}$ is updated with the temporal difference error
${\delta}_{t}$ and the immediate received reward. The
Q value has the following update rules:
for all
s,
a.
In Equation (
9),
$\alpha \in [0,1]$ is the learning rate which decreases with time.
${\delta}_{t}$ is the temporal difference error which is calculated by following rule (Step 7 in Algorithm 2):
In Equation (
10),
${\gamma}_{1}$ is a discountfactor which varies from 0 to 1. The higher the value, the more the agent relies on future rewards than on the immediate reward.
${\Re}_{t+1}$ represents the reward received for performing an action.
f is the neighboring weight factor of agent
i where this factor consists of the effect of neighbor’s Qvalue, which helps to update the Qvalue of agent
i that is calculated as follows [
27]:
where
$ngh({n}_{i})$ is the number of neighbors of agent
i within the D2D radius. BS provides the information of number of neighbors for each agent [
28].
There is a tradeoff between exploration and exploitation in reinforcement learning. Exploration chooses an action randomly in the system to find out the utility of that chosen action. Exploitation deals with the actions which have been chosen based on previously learned utility of the actions.
We use a heuristic for exploration probability at any given time such as:
where
${\u03f5}_{max}$ and
${\u03f5}_{min}$ denote upper and lower boundaries for the exploration factor, respectively.
$\phantom{\rule{4pt}{0ex}}{S}_{max}$ represents the maximum number of states which is eight in our work and S represents the current number of states already known [
29]. At each time step, the system calculates
$\u03f5$ and generates a random number in the interval
$[0,1]$. If the selected random number is less than or equal to
$\u03f5$, the system chooses a uniformly random task (exploration), otherwise it chooses the best task using
Q values (exploitation).
k is a constant which controls the effect of unexplored states (Step 4 in Algorithm 2).
SARSA(
$\lambda $) helps to improve the learning technique by eligibility trace. In Equation (
9),
${e}_{t}(s,a)$ is the eligibility trace. The eligibility trace is updated by the following rule:
Here, $\lambda $ is learning parameter for guaranteed convergence, whereas ${\gamma}_{2}$ is the discount factor. In addition, the eligibility trace helps to provide higher impact on revisited states. For example, for a stateaction pair $({s}_{t},{a}_{t})$, if ${s}_{t}\in s$ and ${a}_{t}\in a$, the stateaction pair is reinforced. Otherwise, the eligibility trace is removed (Step 8 in Algorithm 2).
The learning rate
$\alpha $ is decreased in such a way that it reflects the degree to which a stateaction pair has been chosen in the recent past. It is calculated as:
where
$\rho $ is a positive constant and
$visited(s,a)$ represents the visited stateaction pairs so far [
30] (Step 6 in Algorithm 2).
Algorithm 1: Proposed resource allocation method 
Input :${P}_{max}$ = 23 dBm, Number of resource blocks = 30, Number of cellular users = 30, Number of D2D user pairs = 12, D2D radius = 20 m, Pathloss parameter = 3.5, Cell radius = 500 m, ${\tau}_{0}$ = 0.004, ${\tau}_{1}$ = 0.2512, ${\tau}_{2}$ = 0.2512 [9] Output: RBPower level, System Throughput
loop Pathloss calculation by $P{L}_{dB,B,u(.)}={L}_{dB}(d)+lo{g}_{10}({X}_{u}){A}_{dB}(\theta )$ Gain between the BS and a user, ${G}_{Bu}={10}^{\frac{P{L}_{dB,B,u}}{10}}$ Gain between two users, ${G}_{uv}={k}_{uv}{d}_{uv}^{\alpha}$ SINR of the D2D users on the rth RB, ${\gamma}_{r}^{{D}_{u}}=\frac{{p}_{r}^{{D}_{u}}\xb7{G}_{{D}_{u,r}}^{uu}}{{\sigma}^{2}+{p}_{r}^{c}\xb7{G}_{r}^{{c}_{u}}+{\sum}_{v\in {D}_{r}}^{v\ne u}{p}_{r}^{{d}_{v}}\xb7{G}_{{D}_{v,r}}^{uv}}$ SINR of the cellular users on the RB, ${\gamma}_{r}^{c}=\frac{{p}_{r}^{c}\xb7{G}_{c,r}}{{\sigma}^{2}+{\sum}_{v\in {D}_{r}}{p}_{r}^{{d}_{v}}{G}_{v,r}}$ if $({\gamma}_{r}^{c}\ge {\tau}_{0}$, ${G}_{Bu}\ge {\tau}_{1}$ and ${G}_{uv}\ge {\tau}_{2})$ then $\text{}\text{}\Re ={log}_{2}(1+SINR(u));$ else $\text{}\text{}\Re =1;$ end Apply Algorithm 2 for the power allocation end loop

Algorithm 2: Cooperative SARSA($\lambda $) reinforcement learning algorithm over number of iterations. 
Initialize $Q(s,a)$ = 0, $e(s,a)$ = 0, ${\u03f5}_{max}$ = 0.3, ${\u03f5}_{min}$ = 0.1, k = 0.25, $\rho $ = 1, $\gamma =0.9$, ${\gamma}_{1}=0.5$, $\lambda =0.5$ [ 17, 29] loop Determine the current s based on ${\gamma}_{r}^{c}$, ${G}_{Bu}$ and ${G}_{uv}$ Select a particular action a based on the policy, $\u03f5=min({\u03f5}_{max},{\u03f5}_{min}+k*({S}_{max}S)/{S}_{max})$ Execute the selected action Update learning rate by $\alpha =\frac{\rho}{visited(s,a)}$ Determine the temporal difference error by ${\delta}_{t}=\Re +{\gamma}_{1}f{Q}_{t+1}({s}_{t+1},{a}_{t+1}){Q}_{t}({s}_{t},{a}_{t})$ Update eligibility traces Update the Qvalue, ${Q}_{t+1}({s}_{t+1},{a}_{t+1})\leftarrow {Q}_{t}({s}_{t},{a}_{t})+\alpha {\delta}_{t}{e}_{t}({s}_{t},{a}_{t})$ Update the value function and share with neighbors Shift to the next based on the executed action end loop

Algorithm 1 depicts the overall proposed resource allocation method. After setting the initial input parameters, the system oriented parameters, i.e., pathloss, channel gains, SINR of the D2D users and cellular users on the rth RB are calculated (Step 2–6 in Algorithm 1). Then the reward function is calculated (Step 8) and is assigned when the state values satisfy the constraint in step 7. After that Algorithm 2 is applied for the adaptive resource allocation. Algorithm 2 shows our proposed reinforcement learning algorithm execution steps for resource allocation over number of iterations.