#
Distributed Spectrum Management in Cognitive Radio Networks by Consensus-Based Reinforcement Learning^{ †}

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### Contributions

## 2. Related Work

## 3. Problem Description and System Model

#### 3.1. Joint Spectrum Sensing and Channel Selection

- $\mathrm{SENSE}$, whereas the ${\mathrm{SU}}_{i}$ senses the frequency to which it is currently tuned in order to determine the presence of PU activity. A default energy-detection sensing scheme is assumed [5], with ${p}_{D}$ indicating the probability of correct detection and $1-{p}_{D}$ the probability of sensing errors;
- $\mathrm{TRANSMIT}$, whereas the ${\mathrm{SU}}_{i}$ tries to transmit one packet to the ${\mathrm{RSU}}_{i}$, while implementing the Carrier Sense Multiple Access (CSMA) as Media Access Control (MAC) protocol. Transmission is attempted until an acknowledgement packet is received or a maximum number of attempts ($\mathrm{MAX}\_\mathrm{ATT}$) is reached, in which case the packet is discarded;
- $\mathrm{SWITCH}$, in which case the ${\mathrm{SU}}_{i}$ of the pair switches to a different licensed frequency and notifies ${\mathrm{RSU}}_{i}$ of the switch via the CSC channel.

#### 3.2. Reinforcement Learning and Markov Decision Processes

- $\mathcal{S}$ represents a discrete set of available states; we denote the current state of an agent at a discrete time k as $s\left(k\right)$,
- $\mathcal{A}$ represents a discrete set of available actions; we denote the set of actions available in state $s\left(k\right)$ as $\mathcal{A}\left(s\right(k\left)\right)$,
- $R:\mathcal{S}\times \mathcal{A}\to R$ is the reward function representing a numerical reward (or average reward in case of random rewards) received after applying an action at a certain state; let $r\left(k\right)$ indicate the (possibly random) reward received by the agent while being in state $s\left(k\right)$ and executing action $a\left(k\right)\in \mathcal{A}\left(s\right(k\left)\right)$,
- $T:\mathcal{S}\times \mathcal{A}\to \mathcal{S}$ is the state transition function, which indicates the next state $s(k+1)$ after executing action $a\left(k\right)\in \mathcal{A}\left(s\right(k\left)\right)$ in state $s\left(k\right)$; in case of nondeterministic environments, the T function is a probability distribution over the set of actions and states, i.e., $T:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\to [0,1]$.

#### 3.3. MDP Formulation of the JSS Model

- Learning agents are the ${\mathrm{SU}}_{i}$ entities, $i=1,\dots ,N$.
- We denote the current state of an agent ${\mathrm{SU}}_{i}$ at a discrete time k as ${s}_{i}\left(k\right)$, $i=1,\dots ,N$.
- The set of states $\mathcal{S}$ is a set of couples $({f}_{j},{\mathrm{AVAIL}}_{j})$, where ${f}_{j}$ is a frequency from the set F (of K licensed frequencies or channels) and ${\mathrm{AVAIL}}_{j}=\{\mathrm{IDLE},\mathrm{BUSY},\mathrm{UNKNOWN}\}$ represents the sensed state of availability of channel j.
- The set of actions for an agent i, i.e., ${\mathrm{SU}}_{i}$, is ${A}_{i}=\{\mathrm{SENSE},\mathrm{TRANSMIT},{\mathrm{SWITCH}}_{{f}_{1}},\dots ,{\mathrm{SWITCH}}_{{f}_{j}},\dots ,{\mathrm{SWITCH}}_{{f}_{K}}\}$, $i=1,\dots ,N$, $j=1,\dots ,K$, where ${\mathrm{SWITCH}}_{{f}_{j}}$ is the action of switching to the frequency ${f}_{j}$, whereas the agent does not switch to the frequency it is currently tuned to.
- The reward function $R:\mathcal{S}\times \mathcal{A}\to \mathrm{PMF}[0,1]$, where $\mathrm{PMF}[0,1]$ is a probability mass function defined according to the actions taken and states encountered, i.e.,$$\begin{array}{cc}\hfill \phantom{\rule{1.em}{0ex}}& R(({f}_{j},*),\mathrm{SENSE})=\zeta ,\mathrm{if}\phantom{\rule{4.pt}{0ex}}{f}_{j}\phantom{\rule{4.pt}{0ex}}\mathrm{is}\phantom{\rule{4.pt}{0ex}}\mathrm{found}\phantom{\rule{4.pt}{0ex}}\mathrm{IDLE},\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& R(({f}_{j},*),\mathrm{SENSE})=0,\mathrm{if}\phantom{\rule{4.pt}{0ex}}{f}_{j}\phantom{\rule{4.pt}{0ex}}\mathrm{is}\phantom{\rule{4.pt}{0ex}}\mathrm{found}\phantom{\rule{4.pt}{0ex}}\mathrm{BUSY},\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& R(({f}_{j},\mathrm{IDLE}),\mathrm{TRANSMIT})=1-\frac{\#\mathrm{retrans}}{\mathrm{MAX}\_\mathrm{ATT}},\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& R(({f}_{j},\mathrm{UNKNOWN}),\mathrm{SWITCH})=0,\hfill \end{array}$$
- The state transition function $T:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\to [0,1]$ is defined as:$$\begin{array}{cc}\hfill \phantom{\rule{1.em}{0ex}}& T(({f}_{j},*),\mathrm{SENSE},({f}_{j},IDLE))={\alpha}_{j}/({\alpha}_{j}+{\beta}_{j}),\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& T(({f}_{j},*),\mathrm{SENSE},({f}_{j},BUSY))={\beta}_{j}/({\alpha}_{j}+{\beta}_{j}),\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& T(({f}_{j},IDLE),\mathrm{TRANSMIT},({f}_{j},IDLE))=1,\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& T(({f}_{j},*),\mathrm{SWITCH},({f}_{k},UNKNOWN))=1.\hfill \end{array}$$The state transition function value is 0 for all the other argument values. Note that, as often is the case in practice, the channel state switching from IDLE to BUSY (or BUSY to IDLE) happens with a frequency far smaller than the learning rate of the SUs; in this case, it is possible to set, during these time intervals, the probabilities of sensing IDLE, or BUSY, to either 0 or 1. See the simulations section for more details.

## 4. Consensus-Based Distributed Joint Spectrum Sensing and Selection

#### 4.1. Distributed Consensus-Based Policy Evaluation

Algorithm 1: Distributed consensus-based policy evaluation. |

#### 4.2. Distributed Consensus-Based Q-Learning

#### 4.3. Convergence Rate and Complexity Analysis

Algorithm 2: Distributed consensus-based Q-learning. |

## 5. Simulations

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

CRN | Cognitive Radio Network |

QoS | Quality of Service |

CM | Cognition Module |

DSA | Dynamic Spectrum Access |

SU | Secondary User |

PU | Primary User |

CR | Cognitive Radio |

RL | Reinforcement Learning |

MDP | Markov Decision Process |

MARL | Multi-agent Reinforcement Learning |

DQN | Deep Q-Network |

TD | Temporal Difference |

JSS | Joint Spectrum Sensing and (channel) Selection |

DCS | Dynamic Channel Selection |

CSC | Control Signalling Communication |

PER | Packet Error Rate |

CSMA | Carrier Sense Multiple Access |

MAC | Media Access Control |

## References

- Lo, B.F.; Akyildiz, I.F. Reinforcement learning-based cooperative sensing in cognitive radio ad hoc networks. In Proceedings of the 21st Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, Istanbul, Turkey, 26–30 September 2010; pp. 2244–2249. [Google Scholar] [CrossRef]
- Di Felice, M.; Bedogni, L.; Bononi, L. Reinforcement Learning-Based Spectrum Management for Cognitive Radio Networks: A Literature Review and Case Study. In Handbook of Cognitive Radio; Springer: Singapore, 2019; Volume 3, pp. 1849–1886. [Google Scholar]
- Yu, H.; Zikria, Y.B. Cognitive Radio Networks for Internet of Things and Wireless Sensor Networks. Sensors
**2020**, 20, 5288. [Google Scholar] [CrossRef] [PubMed] - Beko, M. Efficient Beamforming in Cognitive Radio Multicast Transmission. IEEE Trans. Wirel. Commun.
**2012**, 11, 4108–4117. [Google Scholar] [CrossRef] - Yucek, T.; Arslan, H. A survey of spectrum sensing algorithms for cognitive radio applications. IEEE Commun. Surv. Tutorials
**2009**, 11, 116–130. [Google Scholar] [CrossRef] - Jondral, F.K. Software-Defined Radio—Basics and Evolution to Cognitive Radio. EURASIP J. Wirel. Commun. Netw.
**2005**, 2005, 275–283. [Google Scholar] [CrossRef] [Green Version] - Wang, W.; Kwasinski, A.; Niyato, D.; Han, Z. A Survey on Applications of Model-Free Strategy Learning in Cognitive Wireless Networks. IEEE Commun. Surv. Tutorials
**2016**, 18, 1717–1757. [Google Scholar] [CrossRef] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Sutton, R.S.; Maei, H.R.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesvári, C.; Wiewiora, E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 993–1000. [Google Scholar]
- Geist, M.; Scherrer, B. Off-policy Learning With Eligibility Traces: A Survey. J. Mach. Learn. Res.
**2014**, 15, 289–333. [Google Scholar] - Stanković, M.S.; Beko, M.; Stanković, S.S. Distributed Gradient Temporal Difference Off-policy Learning With Eligibility Traces: Weak Convergence. In Proceedings of the IFAC World Congress, Berlin, Germany, 11–17 July 2020. [Google Scholar]
- Stanković, M.S.; Beko, M.; Stanković, S.S. Distributed Value Function Approximation for Collaborative Multi-Agent Reinforcement Learning. IEEE Trans. Control Netw. Syst.
**2021**. [Google Scholar] [CrossRef] - Tian, Z.; Wang, J.; Wang, J.; Song, J. Distributed NOMA-Based Multi-Armed Bandit Approach for Channel Access in Cognitive Radio Networks. IEEE Wirel. Commun. Lett.
**2019**, 8, 1112–1115. [Google Scholar] [CrossRef] - Modi, N.; Mary, P.; Moy, C. QoS Driven Channel Selection Algorithm for Cognitive Radio Network: Multi-User Multi-Armed Bandit Approach. IEEE Trans. Cogn. Commun. Netw.
**2017**, 3, 49–66. [Google Scholar] [CrossRef] [Green Version] - Kuleshov, V.; Precup, D. Algorithms for multi-armed bandit problems. arXiv
**2014**, arXiv:1402.6028. [Google Scholar] - Busoniu, L.; Babuska, R.; De Schutter, B. A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.)
**2008**, 38, 156–172. [Google Scholar] [CrossRef] [Green Version] - Zhang, K.; Yang, Z.; Basar, T. Decentralized Multi-Agent Reinforcement Learning with Networked Agents: Recent Advances. arXiv
**2019**, arXiv:1912.03821. [Google Scholar] - Zhang, K.; Yang, Z.; Liu, H.; Zhang, T.; Basar, T. Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents. arXiv
**2018**, arXiv:1802.08757. [Google Scholar] - Kar, S.; Moura, J.M.; Poor, H.V. QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus+Innovations. IEEE Trans. Signal Process.
**2013**, 61, 1848–1862. [Google Scholar] [CrossRef] [Green Version] - Macua, S.V.; Chen, J.; Zazo, S.; Sayed, A.H. Distributed policy evaluation under multiple behavior strategies. IEEE Trans. Autom. Control
**2015**, 60, 1260–1274. [Google Scholar] [CrossRef] [Green Version] - Dašić, D.; Vučetić, M.; Perić, M.; Beko, M.; Stanković, M. Cooperative Multi-Agent Reinforcement Learning for Spectrum Management in IoT Cognitive Networks. In Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics, Biarritz, France, 30 June–3 July 2020. [Google Scholar]
- Kaur, A.; Kumar, K. Intelligent spectrum management based on reinforcement learning schemes in cooperative cognitive radio networks. Phys. Commun.
**2020**, 43, 101226. [Google Scholar] [CrossRef] - Wu, C.; Chowdhury, K.; Di Felice, M.; Meleis, W. Spectrum Management of Cognitive Radio Using Multi-Agent Reinforcement Learning; International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, USA, 2010; pp. 1705–1712. [Google Scholar]
- Jiang, T.; Grace, D.; Mitchell, P.D. Efficient exploration in reinforcement learning-based cognitive radio spectrum sharing. IET Commun.
**2011**, 5, 1309–1317. [Google Scholar] [CrossRef] - Mustapha, I.; Ali, B.M.; Sali, A.; Rasid, M.; Mohamad, H. An energy efficient Reinforcement Learning based Cooperative Channel Sensing for Cognitive Radio Sensor Networks. Pervasive Mob. Comput.
**2017**, 35, 165–184. [Google Scholar] [CrossRef] - Ning, W.; Huang, X.; Yang, K.; Wu, F.; Leng, S. Reinforcement learning enabled cooperative spectrum sensing in cognitive radio networks. J. Commun. Netw.
**2020**, 22, 12–22. [Google Scholar] [CrossRef] - Kaur, A.; Kumar, K. Imperfect CSI based Intelligent Dynamic Spectrum Management using Cooperative Reinforcement Learning Framework in Cognitive Radio Networks. IEEE Trans. Mob. Comput.
**2020**. [Google Scholar] [CrossRef] - Jang, S.J.; Han, C.H.; Lee, K.E.; Yoo, S.J. Reinforcement learning-based dynamic band and channel selection in cognitive radio ad-hoc networks. EURASIP J. Wirel. Commun. Netw.
**2019**, 2019, 131. [Google Scholar] [CrossRef] - Naparstek, O.; Cohen, K. Deep Multi-User Reinforcement Learning for Distributed Dynamic Spectrum Access. IEEE Trans. Wirel. Commun.
**2019**, 18, 310–323. [Google Scholar] [CrossRef] [Green Version] - Wang, S.; Liu, H.; Gomes, P.H.; Krishnamachari, B. Deep Reinforcement Learning for Dynamic Multichannel Access in Wireless Networks. IEEE Trans. Cogn. Commun. Netw.
**2018**, 4, 257–265. [Google Scholar] [CrossRef] [Green Version] - Raj, V.; Dias, I.; Tholeti, T.; Kalyani, S. Spectrum Access In Cognitive Radio Using a Two-Stage Reinforcement Learning Approach. IEEE J. Sel. Top. Signal Process.
**2018**, 12, 20–34. [Google Scholar] [CrossRef] [Green Version] - Lin, Y.; Wang, C.; Wang, J.; Dou, Z. A Novel Dynamic Spectrum Access Framework Based on Reinforcement Learning for Cognitive Radio Sensor Networks. Sensors
**2016**, 16, 1675. [Google Scholar] [CrossRef] - Yang, P.; Li, L.; Yin, J.; Zhang, H.; Liang, W.; Chen, W.; Han, Z. Dynamic Spectrum Access in Cognitive Radio Networks Using Deep Reinforcement Learning and Evolutionary Game. In Proceedings of the 2018 IEEE/CIC International Conference on Communications in China (ICCC), Beijing, China, 16–18 August 2018; pp. 405–409. [Google Scholar] [CrossRef]
- Li, Z.; Yu, F.R.; Huang, M. A Distributed Consensus-Based Cooperative Spectrum-Sensing Scheme in Cognitive Radios. IEEE Trans. Veh. Technol.
**2010**, 59, 383–393. [Google Scholar] [CrossRef] [Green Version] - Lee, W.Y.; Akyildiz, I.F. Optimal spectrum sensing framework for cognitive radio networks. IEEE Trans. Wirel. Commun.
**2008**, 7, 3845–3857. [Google Scholar] - Di Felice, M.; Chowdhury, K.R.; Wu, C.; Bononi, L.; Meleis, W. Learning-based spectrum selection in Cognitive Radio Ad Hoc Networks. In Proceedings of the Wired/Wireless Internet Communications, Lulea, Sweden, 1–3 June 2010. [Google Scholar]
- Ali, K.O.A.; Ilić, N.; Stanković, M.S.; Stanković, S.S. Distributed target tracking in sensor networks using multi-step consensus. IET Radar Sonar Navig.
**2018**, 12, 998–1004. [Google Scholar] [CrossRef] - Boyd, S.; Ghosh, A.; Prabhakar, B.; Shah, D. Randomized gossip algorithms. IEEE Trans. Inf. Theory
**2006**, 52, 2508–2530. [Google Scholar] [CrossRef] [Green Version] - Olshevsky, A.; Tsitsiklis, J.N. Convergence Rates in Distributed Consensus and Averaging. In Proceedings of the 45th IEEE Conference on Decision and Control, San Diego, CA, USA, 13–15 December 2006; pp. 3387–3392. [Google Scholar] [CrossRef]
- Kushner, H.J.; Yin, G. Asymptotic properties of distributed and communicating stochastic approximation algorithms. SIAM J. Control Optim.
**1987**, 25, 1266–1290. [Google Scholar] [CrossRef] - Stanković, M.S.; Ilić, N.; Stanković, S.S. Distributed Stochastic Approximation: Weak Convergence and Network Design. IEEE Trans. Autom. Control
**2016**, 61, 4069–4074. [Google Scholar] [CrossRef] - Sutton, R.S.; Mahmood, A.R.; White, M. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning. J. Mach. Learn. Res.
**2016**, 17, 1–29. [Google Scholar] - Bhandari, J.; Russo, D.; Singal, R. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation. In Proceedings of the 31st Conference On Learning Theory, Stockholm, Sweden, 6–9 July 2018; Volume 75, pp. 1691–1692. [Google Scholar]
- Saleem, Y.; Rehmani, M.H. Primary radio user activity models for cognitive radio networks: A survey. J. Netw. Comput. Appl.
**2014**, 43, 1–16. [Google Scholar] [CrossRef] - Arjoune, Y.; Kaabouch, N. A Comprehensive Survey on Spectrum Sensing in Cognitive Radio Networks: Recent Advances, New Challenges, and Future Research Directions. Sensors
**2019**, 19, 126. [Google Scholar] [CrossRef] [Green Version]

**Figure 2.**Value function estimations for all the states (different plots) and all the agents (different lines). Intervals during which the channels are busy are shown in shades of red.

**Figure 3.**Error statistics of state-value function estimations for different algorithms, versus number of iterations.

**Figure 4.**Value function estimations of all agents (different lines) for a state that can be visited by only three agents. Intervals when the corresponding channel is busy are shown in shades of red.

**Figure 5.**Action-value function estimations for all the possible actions from three states corresponding to a single channel (different plots with corresponding ActionjState pairs given in the titles) and all agents (differently colored lines). Intervals when the considered single channel is busy are shown in shades of red. Intervals when the channels corresponding to destinations of switching actions are busy are shown in shades of gray.

**Figure 6.**Total number of successful and failed transmission events in the network for different algorithms, versus number of iterations.

Authors | Problem Treated | Local Processing | Cooperative Scheme |
---|---|---|---|

C. Wu et al. [23] | Channel and power level selection | Q-learning | Centralized |

T. Jiang et al. [24] | Efficient exploration in spectrum sharing | Linear value function approximation | None |

Y. Tian et al. [13] | Channel and power level selection | MAB with policy index (energy efficiency) | None |

N. Modi et al. [14] | Channel selection | MAB with learning policy (quality metric reflecting interference) | None |

B.F. Lo, I.F. Akyildiz [1] | Cooperation overhead | Q-learning | Cooperative sensing |

I. Mustapha et al. [25] | Cooperative channel sensing | Q-learning | Cluster heads aggregation |

W. Ning [26] | Cooperative spectrum sensing | Q-learning | Partner selection algorithm |

A. Kaur, K. Kumar [27] | Imperfect Channel State Information-based spectrum management | Q-learning | Assisted with cloud computing |

S-J. Jang et al. [28] | Dynamic band and channel selection | Q-learning | Centralized |

O. Naparstek, K. Cohen [29] | Distributed dynamic spectrum access | Deep Q-learning with LSTM | Centralized |

S. Wang et al. [30] | Dynamic multichannel access | Deep Q-learning with Experience Replay | None |

V. Raj et al. [31] | Channel selection and availability prediction | MAB setting and Bayesian learning | None |

Y. Lin et al. [32] | Dynamic spectrum access, power allocation | Hybrid Spectrum Access Algorithm | None |

A. Kaur, K. Kumar [22] | Dynamic spectrum management | Comparison-based Cooperative Q-Learning (CCopQL) and SARSA | Decentralized resource allocation |

P. Yang et al. [33] | Dynamic spectrum access | Deep Q-learning | Balanced via replicator dynamic using evolutionary game theory |

Channel | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|

$\alpha $ | 10 | 5 | 2 | 10 | 5 | 2 |

$\beta $ | 2 | 5 | 10 | 2 | 5 | 10 |

PER | 0.5 | 0.5 | 0.5 | 0.1 | 0.1 | 0.1 |

State/Action | SENSE | TRANSMIT | ${\mathbf{SWITCH}}_{{\mathit{f}}_{\mathit{j}}}$ |
---|---|---|---|

$({f}_{i},\mathrm{IDLE})$ | 1/3 | 1/3 | 1/15 |

$({f}_{i},\mathrm{BUSY})$ | 1/2 | 0 | 1/10 |

State/Action | SENSE | TRANSMIT | ${\mathbf{SWITCH}}_{{\mathit{f}}_{\mathit{j}}}$ | ${\mathbf{SWITCH}}_{{\mathit{f}}_{\mathit{k}}}$ |
---|---|---|---|---|

$({f}_{i},\mathrm{IDLE})$ | 1/3 | 1/3 | 5/18 | 1/72 |

$({f}_{i},\mathrm{BUSY})$ | 1/2 | 0 | 5/12 | 1/48 |

**Table 5.**Different channels’ contribution to the total number of successful and failed transmissions in the network.

Channel | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|

successful transmissions | 1440 | 2263 | 2885 | 115 | 11,739 | 45,678 |

failed transmissions | 75 | 136 | 8 | 74 | 94 | 85 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dašić, D.; Ilić, N.; Vučetić, M.; Perić, M.; Beko, M.; Stanković, M.S.
Distributed Spectrum Management in Cognitive Radio Networks by Consensus-Based Reinforcement Learning. *Sensors* **2021**, *21*, 2970.
https://doi.org/10.3390/s21092970

**AMA Style**

Dašić D, Ilić N, Vučetić M, Perić M, Beko M, Stanković MS.
Distributed Spectrum Management in Cognitive Radio Networks by Consensus-Based Reinforcement Learning. *Sensors*. 2021; 21(9):2970.
https://doi.org/10.3390/s21092970

**Chicago/Turabian Style**

Dašić, Dejan, Nemanja Ilić, Miljan Vučetić, Miroslav Perić, Marko Beko, and Miloš S. Stanković.
2021. "Distributed Spectrum Management in Cognitive Radio Networks by Consensus-Based Reinforcement Learning" *Sensors* 21, no. 9: 2970.
https://doi.org/10.3390/s21092970