# muMAB: A Multi-Armed Bandit Model for Wireless Network Selection

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Measure and Use Differentiation in Multi-Armed Bandit

#### 2.1. The muMAB Model

- time is divided into steps with a duration of T, and the time horizon is defined as ${T}_{TH}={n}_{TH}T$;
- there are 1 player and $\mathcal{K}$ arms;
- a reward is associated with the generic k-th arm, $k=\{1,\cdots ,\mathcal{K}\}$; $\forall k\in \mathcal{K}$, the reward $\left\{{W}_{k}\left(n\right):n\in N\right\}$ is a stationary ergodic random process associated with arm k, with statistics not known a priori; given a time step n, ${W}_{k}\left(n\right)$ is thus a random variable taking values in the real non-negative numbers set ${\Re}^{+}$, with unknown Probability Density Function (PDF); the mean value of ${W}_{k}\left(n\right)$ is defined as ${\mu}_{k}=E\left[{W}_{k}\left(n\right)\right]$;
- there are two distinct actions: to measure (“m”) and to use (“u”). At the beginning of time step n, the player can choose to apply action a to arm k; the choice ${c}_{n}$ is represented by a pair:$${c}_{n}=\left({a}_{n},{k}_{n}\right),{a}_{n}\in \left\{m,u\right\},{k}_{n}\in \mathcal{K},$$
- feedback $f\left({c}_{n}\right)$ is a pair, composed by:
- a realization of ${W}_{k}\left(n\right)$ at time step n, ${w}_{k}\left(n\right)$, that is the current reward value associated with arm k;
- a gain $g\left({c}_{n}\right)$;

therefore:$$f\left({c}_{n}\right)=\left({w}_{k}\left(n\right),g\left({c}_{n}\right)\right);$$ - measure and use actions have duration ${T}_{M}$ and ${T}_{U}$, respectively, defined as ${T}_{M}={n}_{M}T$, ${T}_{U}={n}_{U}T$, where ${n}_{M},\phantom{\rule{0.166667em}{0ex}}{n}_{U}\in N$. As a result, if at time step n the player chooses action measure (use, respectively), i.e., ${a}_{n}=m\left({a}_{n}=u\right)$, the next ${n}_{M}$ (${n}_{U}$) steps are “occupied” and the next choice can be taken at time step ${n}^{\prime}=n+{n}_{M}$ (${n}^{\prime}=n+{n}_{U}$). Gain $g\left({c}_{n}\right)$ is a function of both the selected action and ${W}_{k}\left(n\right)$; it is always equal to zero when measure action is selected, while it is the sum of the values of the realizations of ${W}_{k}\left(n\right)$ from time steps n to ${n}^{\prime}=n+{n}_{U}$ when arm k is used at time step n:$$g\left({c}_{n}\right)=\left\{\begin{array}{cc}0,\hfill & \mathrm{if}\phantom{\rule{0.277778em}{0ex}}{a}_{n}=m,\\ \sum _{i=n}^{n+{n}_{U}}{w}_{k}\left(i\right),\hfill & \mathrm{if}\phantom{\rule{0.277778em}{0ex}}{a}_{n}=u;\end{array}\right.$$
- the performance of an algorithm is measured by the regret of not always using the arm with the highest reward mean value ${k}^{\ast}$:$${k}^{\ast}=\underset{k\in \mathcal{K}}{arg\; max}{\mu}_{k};$$$$R\left(n\right)={G}_{MAX}\left(n\right)-E\left[G\left(n\right)\right],$$$$G\left(n\right)=\sum _{i=1}^{n}g\left({c}_{i}\right),$$$${G}_{MAX}\left(n\right)=E\left[G\left(n\right)\right]:{c}_{i}=\left(u,{k}^{\ast}\right);$$
- the goal is to find an algorithm that minimizes regret evolution in time.

- it introduces two actions, measure and use, in place of the use action considered in the classical model;
- as a result of each action, it provides feedback composed of two parts: the values of the rewards on the selected arm, and a gain depending on the selected action;
- it introduces the concept of locking the player on an arm after it is selected for measuring or using, with different locking periods depending on the selected action (measure vs. use).

#### 2.2. Algorithms

#### 2.2.1. muUCB1

- the reward mean value estimate of the arm;
- a bias, that eventually allows the index of an arm with a low reward mean value to increase enough for the arm to be selected.

#### 2.2.2. MLI

Algorithm 1: muUCB1. |

Algorithm 2: MLI. |

#### 2.3. muMAB Complexity and Discussion

## 3. Performance Evaluation: Settings

- the number of steps required to reach the time horizon was set to ${n}_{TH}={10}^{5}$;
- $\mathcal{K}=5$ arms were considered;
- the value of ${n}_{M}$ was set to 1; therefore, ${T}_{M}=T$; the value of ${n}_{U}$ was variable, leading to different ${T}_{U}/{T}_{M}$ ratios being considered;
- for $\epsilon $-greedy algorithm, $\epsilon $ was set to $0.1$ according to the results presented in [17], indicating this value as the one leading to best performance;
- for the MLI algorithm, the number of times that every arm is measured in the Phase 1 was set to ${d}_{1}=5$; ${d}_{2}$, i.e., the number of use actions after which the first measure is performed, was also set to 5;
- all results were averaged over 500 runs;

#### 3.1. Synthetic Data

- Bernoulli distribution;
- truncated (to non-negative values) Gaussian distribution;
- exponential distribution.

- Hard configuration: ${\mu}_{1}=0.6$, ${\mu}_{2}=0.8$, ${\mu}_{3}=0.1$, ${\mu}_{4}=0.3$, ${\mu}_{5}=0.7$;
- Easy configuration: ${\mu}_{1}=0.2$, ${\mu}_{2}=0.8$, ${\mu}_{3}=0.1$, ${\mu}_{4}=0.3$, ${\mu}_{5}=0.1$.

#### 3.2. Real Data

- (1)
- $r\left(i,j\right)={l}_{max}-l\left(i,j\right)$ (linear conversion);
- (2)
- $r\left(i,j\right)=\frac{log\left({l}_{max}\right)-log\left(l\left(i,j\right)\right)}{log\left({l}_{max}\right)}$ (logarithmic conversion),

## 4. Performance Evaluation: Results

#### 4.1. Synthetic Data-Hard Configuration

#### 4.2. Synthetic Data-Easy Configuration

#### 4.3. Real Data-Linear Conversion

#### 4.4. Real Data-Logarithmic Conversion

#### 4.5. Discussion of Results

## 5. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Abbreviations

BER | Bit Error Rate |

LAN | Local Area Network |

MAB | Multi-Armed Bandit |

MAC | Medium Access Control |

MLI | Measure with Logarithmic Interval |

muUCB1 | measure-use-UCB1 |

Probability Density Function | |

POKER | Price of Knowledge and Estimated Reward |

QoE | Quality of Experience |

QoS | Quality of Service |

RAT | Radio Access Technology |

RSSI | Received Signal Strength Indicator |

SIR | Signal-to-Interference Ratio |

SNR | Signal-to-Noise Ratio |

## References

- 5G: A Technology Vision. 4 November 2013. Available online: http://www.huawei.com/5gwhitepaper/ (accessed on 24 January 2018).
- Matinmikko, M.; Roivainen, A.; Latva-aho, M.; Hiltunen, K. Interference Study of Micro Licensing for 5G Micro Operator Small Cell Deployments. In Proceedings of the 12th EAI International Conference on Cognitive Radio Oriented Wireless Networks (CROWNCOM), Lisbon, Portugal, 20–22 September 2017. [Google Scholar]
- Trestian, R.; Ormond, O.; Muntean, G.M. Game Theory-Based Network Selection: Solutions and Challenges. IEEE Commun. Surv. Tutor.
**2012**, 14, 1212–1231. [Google Scholar] [CrossRef] - Wang, L.; Kuo, G.S. Mathematical Modeling for Network Selection in Heterogeneous Wireless Networks—A Tutorial. IEEE Commun. Surv. Tutor.
**2013**, 15, 271–292. [Google Scholar] - Lee, W.; Cho, D.H. Enhanced Group Handover Scheme in Multiaccess Networks. IEEE Trans. Veh. Technol.
**2011**, 60, 2389–2395. [Google Scholar] [CrossRef] - Farrugia, R.A.; Galea, C.; Zammit, S.; Muscat, A. Objective Video Quality Metrics for HDTV Services: A Survey. EuroCon
**2013**, 2013. [Google Scholar] [CrossRef] - Di Benedetto, M.G.; Cattoni, A.F.; Fiorina, J.; Bader, F.; De Nardis, L. Cognitive radio and Networking for Heterogeneous Wireless Networks. In Automatic Best Wireless Network Selection Based on Key Performance Indicators; Boldrini, S., Di Benedetto, M.G., Tosti, A., Fiorina, J., Eds.; Signals and Communication Technology; Springer: Berlin, Germany, 2015; Chapter by Boldrini; pp. 201–214. [Google Scholar]
- Tsiropoulou, E.E.; Katsinis, G.K.; Filios, A.; Papavassiliou, S. On the Problem of Optimal Cell Selection and Uplink Power Control in Open Access Multi-service Two-Tier Femtocell Networks. In Proceedings of the 13th International Conference on Ad-Hoc Networks and Wireless (ADHOC-NOW 2014), Benidorm, Spain, 22–27 June 2014; Springer: Berlin, Germany, 2014; Volume 8487. [Google Scholar]
- Vamvakas, P.; Tsiropoulou, E.E.; Papavassiliou, S. Dynamic provider selection and power resource management in competitive wireless communication markets. Mob. Netw. Appl.
**2017**, 1–14. [Google Scholar] [CrossRef] - Malanchini, I.; Cesana, M.; Gatti, N. Network Selection and Resource Allocation Games for Wireless Access Networks. IEEE Trans. Mobile Comput.
**2013**, 12, 2427–2440. [Google Scholar] [CrossRef] [Green Version] - Yang, Y.H.; Chen, Y.; Jiang, C.; Wang, C.Y.; Ray Liu, K.J. Wireless Access Network Selection Game with Negative Network Externality. IEEE Trans. Wirel. Commun.
**2013**, 12, 5048–5060. [Google Scholar] [CrossRef] - Whittle, P. Multi-armed bandits and the Gittins index. J. R. Stat. Soc. Ser. B
**1980**, 42, 143–149. [Google Scholar] - Gittins, J.C. Multi-Armed Bandit Allocation Indices; John Wiley & Sons: Hoboken, NJ, USA, 1989. [Google Scholar]
- Hero, A.; Castanon, D.; Cochran, D.; Kastella, K. (Eds.) Multi-Armed Bandit Problems. In Foundations and Applications of Sensor Management; Springer International Publishing AG: Cham, Switzerland, 2008. [Google Scholar]
- Caso, G.; De Nardis, L.; Di Benedetto, M.G. Toward Context-Aware Dynamic Spectrum Management for 5G. IEEE Wirel. Commun.
**2017**, 24, 38–43. [Google Scholar] [CrossRef] - Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn.
**2002**, 47, 235–256. [Google Scholar] [CrossRef] - Vermorel, J.; Mohri, M. Multi-armed bandit algorithms and empirical evaluation. In Proceedings of the 16th European Conference on Machine Learning, Porto, Portugal, 3–7 October 2005; Springer International Publishing AG: Cham, Switzerland, 2005; Volume 3720, pp. 437–448. [Google Scholar]
- Agarwal, A.; Hsu, D.; Kale, S.; Langford, J.; Li, L.; Schapire, R.E. Taming the monster: a fast and simple algorithm for contextual bandits. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. II-1638–II-1646. [Google Scholar]
- Wu, Q.; Du, Z.; Yang, P.; Yao, Y.D.; Wang, J. Traffic-Aware Online Network Selection in Heterogeneous Wireless Networks. IEEE Trans. Veh. Technol.
**2016**, 65, 381–397. [Google Scholar] [CrossRef] - Lai, T.L.; Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math.
**1985**, 6, 4–22. [Google Scholar] [CrossRef] - Hassan, H.; Elkhazeen, K.; Raahemiafar, K.; Fernando, X. Optimization of control parameters using averaging of handover indicator and received power for minimizing ping-pong handover in LTE. In Proceedings of the IEEE 28th Canadian Conference on Electrical and Computer Engineering (CCECE), Halifax, NS, Canada, 3–6 May 2015. [Google Scholar]
- Cesa-Bianchi, N.; Fischer, P. Finite-time regret bounds of the multi-armed bandit problem. In Proceedings of the 15th International Conference on Machine Learning (ICML 1998), Madison, WI, USA, 24–27 July 1998; pp. 100–108. [Google Scholar]
- Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, Cambridge University, Cambridge, UK, May 1989. [Google Scholar]
- Vermorel, J. Multi-Armed Bandit Data. 2013. Available online: https://sourceforge.net/projects/bandit/ (accessed on 24 January 2018).
- Lai, L.; El Gamal, H.; Jiang, H.; Poor, H.V. Cognitive medium access: Exploration, exploitation, and competition. IEEE Trans. Mobile Comput.
**2011**, 10, 239–253. [Google Scholar] - Mu, M.; Mauthe, A.; Garcia, F. A utility-based QoS model for emerging multimedia applications. In Proceedings of the 2nd International Conference on Next Generation Mobile Applications, Services and Technologies (NGMAST’08), Cardiff, UK, 16–19 September 2008. [Google Scholar]
- Boldrini, S.; Fiorina, J.; Di Benedetto, M.G. Introducing strategic measure actions in multi-armed bandits. In Proceedings of the IEEE 24th International Symposium on Personal, Indoor and Mobile Radio Communications-Workshop on Cognitive Radio Medium Access Control and Network Solutions (MACNET’13), London, UK, 8–9 September 2013. [Google Scholar]

**Figure 1.**Performance in terms of regret of the six considered algorithms, with a Bernoulli distribution for the reward Probability Density Function (PDF) and ${T}_{U}/{T}_{M}=1$. (

**a**) Hard configuration; (

**b**) Easy configuration.

**Figure 2.**Performance in terms of regret of the six considered algorithms, with a Bernoulli distribution for the reward PDF and ${T}_{U}/{T}_{M}=5$. (

**a**) Hard configuration; (

**b**) Easy configuration.

**Figure 3.**Performance in terms of regret of the six considered algorithms, with a Bernoulli distribution for the reward PDF and ${T}_{U}/{T}_{M}=10$. (

**a**) Hard configuration; (

**b**) Easy configuration.

**Figure 4.**Performance in terms of regret of the six considered algorithms, with a truncated Gaussian distribution for the reward PDF and ${T}_{U}/{T}_{M}=1$. (

**a**) Hard configuration; (

**b**) Easy configuration.

**Figure 5.**Performance in terms of regret of the six considered algorithms, with a truncated Gaussian distribution for the reward PDF and ${T}_{U}/{T}_{M}=5$. (

**a**) Hard configuration; (

**b**) Easy configuration.

**Figure 6.**Performance in terms of regret of the six considered algorithms, with a truncated Gaussian distribution for the reward PDF and ${T}_{U}/{T}_{M}=10$. (

**a**) Hard configuration; (

**b**) Easy configuration.

**Figure 7.**Performance in terms of regret of the six considered algorithms, with an exponential distribution for the reward PDF and ${T}_{U}/{T}_{M}=1$. (

**a**) Hard configuration; (

**b**) Easy configuration.

**Figure 8.**Performance in terms of regret of the six considered algorithms, with a exponential distribution for the reward PDF and ${T}_{U}/{T}_{M}=5$. (

**a**) Hard configuration; (

**b**) Easy configuration.

**Figure 9.**Performance in terms of regret of the six considered algorithms, with an exponential distribution for the reward PDF and ${T}_{U}/{T}_{M}=10$. (

**a**) Hard configuration; (

**b**) Easy configuration.

**Figure 10.**Regret achieved by the six considered algorithms at the time horizon as a function of the run, with a Bernoulli distribution for the reward PDF and ${T}_{U}/{T}_{M}=5$. (

**a**) Hard configuration; (

**b**) Easy configuration.

**Figure 11.**Performance in terms of regret of the six considered algorithms, with real captured data used as reward and ${T}_{U}/{T}_{M}=1$. (

**a**) linear conversion; (

**b**) logarithmic conversion.

**Figure 12.**Performance in terms of regret of the six considered algorithms, with real captured data used as reward and ${T}_{U}/{T}_{M}=5$. (

**a**) linear conversion; (

**b**) logarithmic conversion.

**Figure 13.**Performance in terms of regret of the six considered algorithms, with real captured data used as reward and ${T}_{U}/{T}_{M}=10$. (

**a**) linear conversion; (

**b**) logarithmic conversion.

**Figure 14.**Execution time of the six considered algorithms normalized with respect to the execution time of the $\epsilon $-greedy algorithm.

Acu-Edu | Acadiau-Ca | Adrian-Edu | Agnesscott-Edu | Aims-Edu |
---|---|---|---|---|

396 | 381 | 488 | 506 | 333 |

271 | 261 | 488 | 504 | 276 |

271 | 141 | 325 | 545 | 266 |

268 | 136 | 324 | 1946 | 331 |

273 | 136 | 321 | 549 | 290 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Boldrini, S.; De Nardis, L.; Caso, G.; Le, M.T.P.; Fiorina, J.; Di Benedetto, M.-G.
muMAB: A Multi-Armed Bandit Model for Wireless Network Selection. *Algorithms* **2018**, *11*, 13.
https://doi.org/10.3390/a11020013

**AMA Style**

Boldrini S, De Nardis L, Caso G, Le MTP, Fiorina J, Di Benedetto M-G.
muMAB: A Multi-Armed Bandit Model for Wireless Network Selection. *Algorithms*. 2018; 11(2):13.
https://doi.org/10.3390/a11020013

**Chicago/Turabian Style**

Boldrini, Stefano, Luca De Nardis, Giuseppe Caso, Mai T. P. Le, Jocelyn Fiorina, and Maria-Gabriella Di Benedetto.
2018. "muMAB: A Multi-Armed Bandit Model for Wireless Network Selection" *Algorithms* 11, no. 2: 13.
https://doi.org/10.3390/a11020013