# An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Literature Review

## 3. Methodology

#### 3.1. Value of Information

**Optimization Term: Expected Returns.**The term to be optimized represents the expected returns ${\sum}_{k=1,2,\dots}{\mathbb{E}}_{{a}_{k}^{i}}\left[{X}_{k}^{i}\right]={\sum}_{k=1,2,\dots}p({a}_{k}^{i}){X}_{k}^{i}$ associated with an arm-selection strategy $p({a}_{k}^{i})$ that provides the best pay-out ${X}_{k}^{i}$ over all arm pulls k. The selection strategy $p({a}_{k}^{i})$ is initially unknown. However, it is assumed to be related to the specified prior ${p}_{0}({a}_{k}^{i})$, in some manner, as described by the information constraint bound.

**Constraint Term: Transformation Cost.**The constraint term, ${D}_{\mathrm{KL}}(p({a}_{k}^{i})\parallel {p}_{0}({a}_{k}^{i}))$, a Kullback–Leibler divergence, quantifies the divergence between the posterior $p({a}_{k}^{i})$ and the prior ${p}_{0}({a}_{k}^{i})$. This term is bounded above by some non-negative value ${\phi}_{\mathrm{inf}}$, which implies that the amount of overlap will be artificially limited by the chosen ${\phi}_{\mathrm{inf}}$.

**Optimization Term: Transformation Amount.**The value of information was previously defined by the greatest increase of pay-outs associated with a transformation-cost-constrained policy. The only term in the criterion now measures the transformation costs of the policy. Due to the minimization of the Kullback–Leibler divergence $-{D}_{\mathrm{KL}}(p({a}_{k}^{i})\parallel {p}_{0}({a}_{k}^{i}))$, the criterion seeks to find a policy $p({a}_{k}^{i})$, with a potentially unlimited conversion cost, that can achieve a specified upper bound on the expected pay-out ${\sum}_{k=1,2,\dots}{\mathbb{E}}_{{a}_{k}^{i}}\left[{X}_{k}^{i}\right]$ over a number of arm pulls.

**Constraint Term: Pay-Out Limit.**The constraint bounds a policy’s range of the expected returns ${\mathbb{E}}_{{a}_{k}^{i}}\left[{X}_{k}^{i}\right]$ by a value ${\phi}_{\mathrm{cost}}$. As ${\phi}_{\mathrm{cost}}$ approaches infinity, we seek the policy with the highest pay-out. For this to happen, the policy space must be searched finely. Exploration will therefore dominate. As ${\phi}_{\mathrm{cost}}$ becomes zero, the constraint to find the policies with the best returns is relaxed. The policy space is searched coarsely, and exploitation dominates. For some value of ${\phi}_{\mathrm{cost}}$ between these two extremes, exploitation becomes more prevalent than exploration. The actual value for this to occur will be application-dependent.

#### 3.2. Value-of-Information Policies

- τ
_{k}: **Temperature Term.**This parameter can be interpreted as an inverse-temperature-like parameter. As ${\tau}_{k}$ approaches infinity, we arrive at a low-temperature thermodynamic system of particles. Such a collection particles will have low kinetic energy and hence their position will not vary greatly. In the context of learning, this situation corresponds to policies ${\pi}_{k}^{i},{\pi}_{k+1}^{i},\dots $ that will not change much due to a low-exploration search. Indeed, the allocation rule will increasingly begin to favor the arm with the highest current reward estimate, with ties broken randomly. All other randomly selectable arms will be progressively ignored. On the other hand, as ${\tau}_{k}$ goes to zero, we have a high-temperature system of particles, which has a high kinetic energy. The arm-selection policy ${\pi}_{k+1}^{i}$ can change drastically as a result. In the limit, all arms are pulled independently and uniformly at random. Values of the parameter between these two extremes implement searches for arms that are a blend of exploration and exploitation.- γ
_{k}: **Mixing Coefficient Term.**This parameter can be viewed as a mixing coefficient for the model: it switches the preference between the exponential distribution component and the uniform distribution component. For values of the mixing coefficient converging to one, the exponential component is increasingly ignored. The resulting arm-choice probability distributions ${\pi}_{k+1}^{i}$ become uniform. The chance of any arm being selected is the same, regardless of the expected pay-out. This corresponds to a pure exploration strategy. As the mixing coefficient becomes zero, the influence of the exponential term consequently rises. The values of ${\pi}_{k+1}^{i}$ will hence begin dictate how frequently a given arm will be chosen based upon its energy, as described by the reward pay-out, and the inverse temperature. Depending on the value of the inverse-temperature parameter, ${\tau}_{k}^{-1}$, the resulting search will either be exploration-intensive, exploitation-intensive, or somewhere in between.

Algorithm 1: ${\tau}_{k}$-changing Exploration: VoIMix |

Algorithm 2: ${\tau}_{k}$-changing Exploration: AutoVoIMix |

#### Automated Hyperparameter Selection

## 4. Simulations

#### 4.1. Simulation Preliminaries

#### 4.2. Value of Information Results and Analysis

#### 4.2.1. Fixed-Parameter Case

**Exploration Amount and Performance**. Figure 1a contains probability simplex plots for VoIMix. Such plots highlight the arm-selection probability distributions of the iterates during the learning process for the three-armed bandit problem. They indicate how VoIMix is investigating the action space across all of the Monte Carlo trials and if it is focusing on sub-optimal arms.

#### 4.2.2. Adaptive-Parameter Case

**Exploration Amount and Performance.**In Figure 2, we provide results for both VoIMix and AutoVoIMix over a longer number of iterations. These results are for the cases where ${\tau}_{k}^{-1}$ and ${\gamma}_{k}$ are either fixed or automatically tuned according to the cooling schedules given in the previous section.

**Hyperparameter Values and Performance.**Both VoIMix and AutoVoIMix have a single hyperparameter that must be chosen by the investigator. For VoIMix, we set hyperparameter value ${d}_{k}$ to halfway between zero and the smallest mean reward difference. For AutoVoIMix, we also set the hyperparameter ${\theta}_{k}$ in the middle of the valid range. This offered a reasonable period for exploration and hence yielded good performance.

#### 4.2.3. Fixed- and Variable-Parameter Case Discussions

**Exploration Amount Discussions.**We considered two possible instances of VoIMix: one where its parameters were fixed another where they varied. Our simulations showed that the latter outperformed the former. This was because a constant amount of exploration might explore either too greatly or not thoroughly enough to achieve good performance. Without the optimal annealing schedule that we provided, substantial trial and error testing would be needed to find reasonable exploration amounts for practical problems.

**Hyperparameter Value Discussions.**An issue faced when using many multi-armed bandit algorithms, including VoIMix and AutoVoIMix, is the proper setting of hyperparameter values by investigators. Such hyperparameters often have a great impact on the corresponding results. They are also difficult to set so that the corresponding regret bounds hold.

#### 4.3. Methodological Comparisons

#### 4.3.1. Stochastic Methods: $\u03f5$-Greedy and Soft-Max

#### 4.3.2. Stochastic Methods: Pursuit and Reinforcement Comparison

#### 4.3.3. Deterministic Methods: Upper Confidence Bound

#### 4.3.4. Methodological Comparisons Discussions

**$\u03f5$-Greedy and Soft-Max Discussions.**Our simulations indicated that the performance of SoftMix exceeded that GreedyMix. In the long-term, GreedyMix should improve beyond SoftMix, especially if the currently known regret bounds are tight. This is because an unlucky sampling of arms in SoftMix can bias the expected returns and favor sub-optimal arms. The optimal arm may have a low probability of being sampled, especially after many pulls. In contrast, all arms have the equal chance of being selected during the exploration phase of GreedyMix. Any poorly estimated reward means, especially that of the optimal arm, can hence be made more precise. The optimal arm will therefore be chosen more frequently once this occurs.

**Pursuit and Reinforcement Comparison Discussions.**Both VoIMix and AutoVoIMix perform well since they avoid a biased sampling of sub-optimal arms. Pursuit methods, in contrast, are highly susceptible to such issues. These types of algorithms only consider the expected reward associated with the current-best arm when assigning probability mass. The rewards of all other arms are ignored, unlike in soft-max-based selection, which provides little opportunity to recover from a poor arm selection after a few iterations. Likewise, the uniform-exploration components in VoIMix and AutoVoIMix further aid in avoiding a biased sampling, especially early in the learning process where it dominates. We found that low-reward policies were formulated rather frequently, in practice, by pursuit methods. The regret tended to plateau to a sub-optimal value in just a few arm pulls. The optimal arm was rarely identified, even after adjusting the learning-rate parameter.

**Upper-Confidence-Bound Discussions.**Our simulations revealed that while upper-confidence-bound methods achieve superior performance during the first hundred arm pulls, it lags behind VoIMix over a longer number of plays. This behavior likely stems from the greedy arm-selection process of the upper-confidence-bound methods. Initially, only the highest-performing slot machine will be chosen by these methods, as the arm-frequency constraint will do little to sway the selection. This arm may be sub-optimal, yet have an unusually high reward due to the variance. The optimal arm, in contrast, may have had an uncharacteristically low reward the round that it was chosen, causing it to be ignored until later in the learning phase when the arm-frequency constraint will ensure that it is picked. Stochastic approaches have more opportunities to choose optimal arms that, initially, appear to be sub-optimal.

## 5. Conclusions

## Supplementary Materials

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Variable Notation

Symbol | Description |

$\mathcal{R}$ | Continuous space of real values |

$\mathcal{A}$ | Discrete space of slot-machine arms |

$\left|\mathcal{A}\right|$ | Total number of slot-machine arms |

T | Total number of arm pulls |

s, k | A given arm pull index in the range $[0,T]$; index to a particular iteration or round |

${a}^{i}$ | A given slot-machine arm, indexed by i; i takes values in the range $[1,|\mathcal{A}\left|\right]$ |

${a}^{\ast}$ | The slot-machine arm that yields the best expected reward |

${a}_{k}^{i}$ | A particular slot machine arm i pulled at round k |

${X}_{k}^{i}$ | The real-valued reward obtained by pulling arm ${a}^{i}$ at round k; can also be interpreted as a random variable |

${\mu}^{i}$ | The expected reward for a non-optimal slot machine arm ${a}^{i}$ |

${\mu}^{\ast}$ | The expected reward for the optimal slot machine arm ${a}^{\ast}$ |

${\sigma}^{2}$ | The variance of all slot machine arms |

${p}_{0}({a}_{k}^{i})$ | A prior probability distribution that specifies the chance of choosing arm ${a}^{i}$ at round k |

$p({a}_{k}^{i})$ | A posterior-like probability distribution that specifies the chance of choosing arm ${a}^{i}$ at round k |

$p({a}^{j})$ | A probability distribution that specifies the chance of choosing arm ${a}^{j}$ independent of a given round k |

${\pi}_{k}^{i}$ | Equivalent to $p({a}_{k}^{i})$; represents the arm-selection policy |

${\mathbb{E}}_{\xb7}[\xb7]$ | The expected value of random variable according to a probability distribution, e.g., ${\mathbb{E}}_{{a}_{k}^{i}}\left[{X}_{k}^{i}\right]$ represents ${\sum}_{{a}_{k}^{i}\in \mathcal{A}}p({a}_{k}^{i}){X}_{k}^{i}$ |

${D}_{\mathrm{KL}}(\xb7\parallel \xb7)$ | The Kullback–Leibler divergence between two probability distributions, e.g., ${D}_{\mathrm{KL}}(p({a}_{k}^{i})\parallel {p}_{0}({a}_{k}^{i}))$ is the divergence between $p({a}_{k}^{i})$ and ${p}_{0}({a}_{k}^{i})$ |

${\phi}_{\mathrm{inf}}$ | A positive, real-valued information constraint bound |

${\phi}_{\mathrm{cost}}$ | A positive, real-valued reward constraint bound |

${q}_{k+1}^{i}$ | An estimate of the expected reward for slot machine arm ${a}^{i}$ at round $k+1$ |

${I}_{k}^{i}$ | A binary-valued indicator variable that specifies if slot machine arm ${a}^{i}$ was chosen at round k |

$\u03f5$ | A non-negative, real-valued parameter that dictates the amount of exploration for the $\u03f5$-greedy algorithm (GreedyMix) |

$\tau $ | A non-negative, real-valued parameter that dictates the amount of exploration for soft-max-based selection (SoftMix) and value-of-information-based exploration (VoIMix) |

$\gamma $ | A non-negative, real-valued parameter that represents the mixing coefficient for value-of-information-based exploration (VoIMix and AutoVoIMix) |

$\theta $ | A non-negative, real-valued hyperparameter that dictates the exploration duration for value-of-information-based exploration (AutoVoIMix) |

d | A positive, real-valued hyperparameter that dictates the exploration duration for value-of-information-based exploration (VoIMix) |

${\tau}_{k}$, ${\gamma}_{k}$, | The values of $\tau $, $\gamma $, $\theta $, and d, respectively, |

${\theta}_{k}$, ${d}_{k}$ | at a given round k |

${\delta}_{i}$ | The mean loss incurred by choosing arm ${a}^{i}$ instead of arm ${a}^{\ast}$: ${\delta}_{i}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}{\mu}^{\ast}\phantom{\rule{-0.166667em}{0ex}}-\phantom{\rule{-0.166667em}{0ex}}{\mu}^{i}$ |

${V}_{p}^{j}$ | A random variable representing the ratio of the reward for a given slot machine arm to its chance of being selected: ${X}_{p}^{j}/{\pi}_{p}^{j}$ |

${Z}_{s}^{i}$ | A random variable representing the expression ${\delta}_{i}+{V}_{s}^{i}-{V}_{s}^{\ast}$ |

${Q}_{k-1}^{i}$ | A random variable representing the expression ${\prod}_{s=1}^{k-1}\mathrm{exp}({\tau}_{s}{V}_{s}^{i}-{\tau}_{s}{V}_{s}^{\ast})$ |

${\mathcal{J}}_{s}$ | A $\sigma $-algebra: $\sigma ({X}_{s}^{i},\phantom{\rule{0.166667em}{0ex}}1\phantom{\rule{-0.166667em}{0ex}}\le \phantom{\rule{-0.166667em}{0ex}}s\phantom{\rule{-0.166667em}{0ex}}\le \phantom{\rule{-0.166667em}{0ex}}k)$, $k\phantom{\rule{-0.166667em}{0ex}}\ge \phantom{\rule{-0.166667em}{0ex}}1$ |

${\varphi}_{c}({\tau}_{s})$ | A real-valued function parameterized by c: ${\varphi}_{c}({\tau}_{s})=({e}^{c{\tau}_{s}}\phantom{\rule{-0.166667em}{0ex}}-\phantom{\rule{-0.166667em}{0ex}}c{\tau}_{s}\phantom{\rule{-0.166667em}{0ex}}-\phantom{\rule{-0.166667em}{0ex}}1)/{\tau}_{s}^{2}$ |

$\omega $, ${\omega}^{\prime}$ | Real values inside the unit interval |

${c}_{k}$ | A real-valued sequence $1\phantom{\rule{-0.166667em}{0ex}}+\phantom{\rule{-0.166667em}{0ex}}2\left|\mathcal{A}\right|/{\gamma}_{k}$ for VoIMix or $1\phantom{\rule{-0.166667em}{0ex}}+\phantom{\rule{-0.166667em}{0ex}}\left|\mathcal{A}\right|/{\gamma}_{k}$ for AutoVoIMix |

${\sigma}_{k}^{2}$ | A real-valued sequence $-{d}^{2}\phantom{\rule{-0.166667em}{0ex}}+\phantom{\rule{-0.166667em}{0ex}}2\left|\mathcal{A}\right|/{\gamma}_{k}$ for VoIMix or $2\left|\mathcal{A}\right|/{\gamma}_{k}$ for AutoVoIMix |

## Appendix A

**Proposition**

**A1.**

**Proof.**

**Proposition**

**A2.**

**Proof.**

**Proposition**

**A3.**

**Proof.**

**Proposition**

**A4.**

**Proof.**

## References

- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika
**1933**, 25, 285–294. [Google Scholar] [CrossRef] - Robbins, H. Some aspects of the sequential design of experiments. Bull. Am. Math. Soc.
**1952**, 58, 527–535. [Google Scholar] [CrossRef] - Bubeck, S.; Cesa-Bianchi, N. Regret analysis of stochastic and non-stochastic multi-armed bandit problems. Found. Trends Mach. Learn.
**2012**, 5, 1–122. [Google Scholar] [CrossRef] - Lai, T.L.; Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math.
**1985**, 6, 4–22. [Google Scholar] [CrossRef] - Auer, P. Using confidence bounds for exploration-exploitation trade-offs. J. Mach. Learn. Res.
**2002**, 3, 397–422. [Google Scholar] - Auer, P.; Ortner, R. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Period. Math. Hung.
**2010**, 61, 55–65. [Google Scholar] [CrossRef] - Garivier, A.; Cappé, O. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the Conference on Learning Theory (COLT), Budapest, Hungary, 24 July 2011; pp. 359–376. [Google Scholar]
- Cappé, R.; Garivier, A.; Maillard, O.A.; Munos, R.; Stoltz, G. Kullback–Leibler upper confidence bounds for optimal sequential allocation. Ann. Stat.
**2013**, 41, 1516–1541. [Google Scholar] [CrossRef] - Agarwal, S.; Goyal, N. Analysis of Thompson sampling for the multi-armed bandit problem. J. Mach. Learn. Res.
**2012**, 23, 1–39. [Google Scholar] - Honda, J.; Takemura, A. An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Mach. Learn.
**2011**, 85, 361–391. [Google Scholar] [CrossRef] - Honda, J.; Takemura, A. Non-asymptotic analysis if a new bandit algorithm for semi-bounded rewards. J. Mach. Learn. Res.
**2015**, 16, 3721–3756. [Google Scholar] - Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multi-armed bandit problem. Mach. Learn.
**2002**, 47, 235–256. [Google Scholar] [CrossRef] - McMahan, H.B.; Streeter, M. Tight bounds for multi-armed bandits with expert advice. In Proceedings of the Conference on Learning Theory (COLT), Montreal, QC, Canada, 18–21 June 2009; pp. 1–10. [Google Scholar]
- Stratonovich, R.L. On value of information. Izv. USSR Acad. Sci. Tech. Cybern.
**1965**, 5, 3–12. [Google Scholar] - Stratonovich, R.L.; Grishanin, B.A. Value of information when an estimated random variable is hidden. Izv. USSR Acad. Sci. Tech. Cybern.
**1966**, 6, 3–15. [Google Scholar] - Sledge, I.J.; Príncipe, J.C. Analysis of agent expertise in Ms. Pac-Man using value-of-information-based policies. IEEE Trans. Comput. Intell. Artif. Intell. Games
**2017**. [Google Scholar] [CrossRef] - Sledge, I.J.; Príncipe, J.C. Partitioning relational matrices of similarities or dissimilarities using the value of information. arxiv, 2017; arXiv:1710.10381. [Google Scholar]
- Sledge, I.J.; Emigh, M.S.; Príncipe, J.C. Guided policy exploration for Markov decision processes using an uncertainty-based value-of-information criterion. arxiv, 2018; arXiv:1802.01518. [Google Scholar]
- Cesa-Bianchi, N.; Fischer, P. Finite-time regret bounds for the multi-armed bandit problem. In Proceedings of the International Conference on Machine Learning (ICML), San Francisco, CA, USA, 24–27 July 1998; pp. 100–108. [Google Scholar]
- Cesa-Bianchi, N.; Gentile, C.; Neu, G.; Lugosi, G. Boltzmann exploration done right. In Advances in Neural Information Processing Systems (NIPS); Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; MIT Press: Cambridge, MA, USA, 2017; pp. 6287–6296. [Google Scholar]
- Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R.E. The nonstochastic multi-armed bandit problem. SIAM J. Comput.
**2002**, 32, 48–77. [Google Scholar] [CrossRef] - Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res.
**1996**, 4, 237–285. [Google Scholar] - Salganicoff, M.; Ungar, L.H. Active exploration and learning in real-valued spaces using multi-armed bandit allocation indices. In Proceedings of the International Conference on Machine Learning (ICML), Tahoe City, CA, USA, 9–12 July 1995; pp. 480–487. [Google Scholar]
- Strehl, A.L.; Mesterharm, C.; Littman, M.L.; Hirsh, H. Experience-efficient learning in associative bandit problems. In Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006; pp. 889–896. [Google Scholar]
- Madani, O.; Lizotte, S.J.; Greiner, R. The budgeted multi-armed bandit problem. In Proceedings of the Conference on Learning Theory (COLT), Banff, AB, Canada, 1–4 July 2004; pp. 643–645. [Google Scholar]
- Kleinberg, R.D. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems (NIPS); Saul, L.K., Weiss, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2008; pp. 697–704. [Google Scholar]
- Wang, Y.; Audibert, J.; Munos, R. Algorithms for infinitely many-armed bandits. In Advances in Neural Information Processing Systems (NIPS); Koller, D., Schuurmans, D., Bengio, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2008; pp. 1729–1736. [Google Scholar]
- Bubeck, S.; Munos, R.; Stoltz, G. Pure exploration in finitely-armed and continuous-armed bandits. Theor. Comput. Sci.
**2011**, 412, 1876–1902. [Google Scholar] [CrossRef] - Vermorel, J.; Mohri, M. Multi-armed bandit algorithms and empirical evaluation. In Machine Learning: ECML; Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L., Eds.; Springer: New York, NY, USA, 2005; pp. 437–448. [Google Scholar]
- Even-Dar, E.; Mannor, S.; Mansour, Y. PAC bounds for multi-armed bandit and Markov decision processes. In Proceedings of the Conference on Learning Theory (COLT), Sydney, Australia, 8–10 July 2002; pp. 255–270. [Google Scholar]
- Mannor, S.; Tsitsiklis, J.N. The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res.
**2004**, 5, 623–648. [Google Scholar] - Langford, J.; Zhang, T. The epoch-greedy algorithm for multi-armed bandits. In Advances in Neural Information Processing Systems (NIPS); Platt, J.C., Koller, D., Singer, Y., Roweis, S.T., Eds.; MIT Press: Cambridge, MA, USA, 2008; pp. 817–824. [Google Scholar]
- Srinivas, S.; Krause, A.; Seeger, M.; Kakade, S.M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 1015–1022. [Google Scholar]
- Krause, A.; Ong, C.S. Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems (NIPS); Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q., Eds.; MIT Press: Cambridge, MA, USA, 2011; pp. 2447–2455. [Google Scholar]
- Beygelzimer, A.; Langford, J.; Li, L.; Reyzin, L.; Schapire, R.E. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 19–26. [Google Scholar]
- Sledge, I.J.; Príncipe, J.C. Balancing exploration and exploitation in reinforcement learning using a value of information criterion. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 1–5. [Google Scholar]
- Narendra, K.S.; Thathachar, M.A.L. Learning automata—A survey. IEEE Trans. Syst. Man Cybern.
**1974**, 4, 323–334. [Google Scholar] [CrossRef] - Thathachar, M.A.L.; Sastry, P.S. A new approach to the design of reinforcement schemes for learning automata. IEEE Trans. Syst. Man Cybern.
**1985**, 15, 168–175. [Google Scholar] [CrossRef] - Audibert, J.Y.; Munos, R.; Szepesvári, C. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theor. Comput. Sci.
**2009**, 410, 1876–1902. [Google Scholar] [CrossRef]

**Figure 1.**Results for the value-of-information-based VoIMix for the fixed-parameter case. The plots in (

**a**) highlight the regions in the action probability simplex that are visited during the simulations for the three-armed bandit problem. Cooler colors (purple, blue and green) correspond to probability triples that are encountered often during learning. Cooler Warmer (yellow, orange, and red) correspond to probability triples that are not seen often. The correct action is the one associated with the bottom, left corner of the simplex. The simplex plots were produced for fixed parameter values and averaged across independent simulations. The plots in (

**b**–

**d**) give the regret, reward, and optimal-arm selection cumulative probability for the three-, ten-, and thirty-armed bandit problems when the distribution variance is ${\sigma}^{2}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}1$. The red, orange, green, and blue curves correspond to fixed inverse temperatures of ${\tau}^{-1}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}0.01$, 0.1, 0.5, and 1.0, respectively.

**Figure 2.**Long-term results for the value-of-information-based VoIMix and AutoVoIMix for the fixed and variable parameter case. The hyperparameters d and $\theta $ are fixed for each method. The plots in (

**a**) highlight the regions in the action probability simplex that are visited during the simulations for the three-armed bandit problem. The plots in (

**b**–

**d**) give the regret, reward, and optimal-arm selection cumulative probability for the three-, ten-, and thirty-armed bandit problems when the distribution variance is ${\sigma}^{2}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}1$. In these plots, the purple curves correspond to VoIMix where the inverse-temperature and mixing parameters are allowed to vary throughout the simulation, according to the cooling schedule given in the previous section. The dark blue curves correspond to AutoVoIMix where the inverse-temperature and mixing coefficient are allowed to vary throughout the simulation, according to the cooling schedule given in the previous section. The red, orange, green, and blue curves correspond to fixed inverse temperatures of ${\tau}^{-1}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}0.01$, 0.1, 0.5, and 1.0, respectively.

**Figure 3.**Results for the $\u03f5$-greedy-based GreedyMix for the fixed-parameter case. The plots in (

**a**) highlight the regions in the action probability simplex that are visited during the simulations for the three-armed bandit problem. The correct action is the one associated with the bottom, left corner of the simplex. The simplex plots were produced for fixed parameter values and averaged across independent simulations. The plots in (

**b**–

**d**) give the regret, reward, and optimal-arm selection cumulative probability for the three-, ten-, and thirty-armed bandit problems when the distribution variance is ${\sigma}^{2}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}1$. The red, orange, green, and blue curves correspond to fixed exploration values of $\u03f5\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}0.01$, 0.1, 0.5, and 1.0, respectively.

**Figure 4.**Results for the soft-max-based SoftMix for the fixed-parameter case. The plots in (

**a**) highlight the regions in the action probability simplex that are visited during the simulations for the three-armed bandit problem. The correct action is the one associated with the bottom, left corner of the simplex. The simplex plots were produced for fixed parameter values and averaged across independent simulations. The plots in (

**b**–

**d**) give the regret, reward, and optimal-arm selection cumulative probability for the three-, ten-, and thirty-armed bandit problems when the distribution variance is ${\sigma}^{2}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}1$. The red, orange, green, and blue curves correspond to fixed inverse temperatures of ${\tau}^{-1}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}0.01$, 0.1, 0.5, and 1.0, respectively.

**Figure 5.**Results for the pursuit method for the fixed-parameter case. The plots in (

**a**–

**c**) give the regret, reward, and optimal-arm selection cumulative probability for the three-, ten-, and thirty-armed bandit problems when the distribution variance is ${\sigma}^{2}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}1$. The red, orange, green, and blue curves correspond to fixed values of the policy learning rate $\beta \phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}0.01$, 0.1, 0.5, and 0.9, respectively. Comparisons against the tuned version of VoIMix (purple curve), where ${\tau}_{k}^{-1}$ and ${d}_{k}$ change across each arm pull, along with the upper confidence bound method (dark blue curve) are provided.

**Figure 6.**Results for the soft-max-based reinforcement comparison for the fixed-parameter case. The plots in (

**a**–

**c**) give the regret, reward, and optimal-arm selection cumulative probability for the three-, ten-, and thirty-armed bandit problem when the distribution variance is ${\sigma}^{2}\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}1$. A value of $\alpha \phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}0.5$ was used for updating the mean reward at each iteration. The red, orange, green, and blue curves correspond to fixed values of the learning rate $\beta \phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}0.01$, 0.1, 0.5, and 0.9, respectively. Comparisons against the tuned version of VoIMix (purple curve), where ${\tau}_{k}^{-1}$ and ${d}_{k}$ change across each arm pull, along with the upper confidence bound method (dark blue curve) are provided.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Sledge, I.J.; Príncipe, J.C.
An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits. *Entropy* **2018**, *20*, 155.
https://doi.org/10.3390/e20030155

**AMA Style**

Sledge IJ, Príncipe JC.
An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits. *Entropy*. 2018; 20(3):155.
https://doi.org/10.3390/e20030155

**Chicago/Turabian Style**

Sledge, Isaac J., and José C. Príncipe.
2018. "An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits" *Entropy* 20, no. 3: 155.
https://doi.org/10.3390/e20030155