On Gap-Based Lower Bounding Techniques for Best-Arm Identification
Abstract
:1. Introduction
2. Overview of Results
2.1. Problem Setup
- There are M arms with Bernoulli rewards; the means are , and this set of means is said to define the bandit instance. Our analysis will consider instances with arms sorted such that , without loss of generality.
- The agent would like to find an arm whose arm mean is within of the highest arm mean for some , i.e., . Even if there are multiple such arms, just identifying one of them is good enough.
- In each round, the agent can pull any arm and observe an reward , where s is the number of times the l-th arm has been pulled so far. We assume that the rewards are independent, both across arms and across times.
- In each round, the agent can alternatively choose to terminate and output an arm index believed to be -optimal. The index at which this occurs is denoted by T, and is a random variable because it is allowed to depend on the rewards observed. We are interested in the expected number of arm pulls (also called the sample complexity) for a given instance , which should ideally be as low as possible.
- An algorithm is said to be -PAC (Probably Approximately Correct) if, for all bandit instances, it outputs an -optimal arm with probability at least when it terminates at the stopping time T.
2.2. Existing Lower Bounds
2.3. Our Result and Discussion
3. Proof of Theorem 1
4. Conclusion
Author Contributions
Funding
Conflicts of Interest
Appendix A. Proof of Lemma 1 (Constant-Probability Event for Small Enough [])
- (A30) uses the definitions of and ;
- (A33) uses the definitions of and ;
- (A34) follows from the definitions of and in (A23) and (A24) (which imply );
- (A35) follows from (A26);
- (A36) follows from Lemma A1 with and ;
- (A37) follows from (A29);
- (A38) follows from the definition of ;
- (A40) follows since the condition in yields , which impliesfor all ;
- (A41) follows from the definitions of and in (25)–(26);
- (A42) follows from the definition of in (31);
- (A43) follows from the definition of in (15).
Appendix B. Proof of Proposition 1 (Bounding a Likelihood Ratio)
- Case 1:. In this case, recalling that , we haveOn the other hand, since , we haveand applying Lemma A2 givesIn addition, again using , we haveand hence
- Case 2:. For this case, we havewhere (A69) follows from along with (A53), and (A70) follows since .From (A53), we haveFor the third term in (A71), we proceed as follows:where (A76) uses , (A77) follows from (A75), (A79) follows from (A44), (A80) follows by definition of in (25), (A81) follows from the fact that , and (A82) follows from the fact that for all and .On the other hand, observe thatsince for all and . It follows from (A83) and (A84) thatand it follows from (A71) and (A85) thatNow, since (since we are in the case ), by Lemma A2, we havewhere (A89) follows from the definition of in (25).We now consider two further sub-cases:
- (i)
- (ii)
- If , then we havewhere (A94) follows from the fact that if and .
From (A93) and (A95), we obtain
Appendix C. Differences in Analysis Techniques
- We remove the restriction (or ) used in the subsets and in (Equations (4) and (5) [14]), so that our lower bound depends on all of the arms. To achieve this, our analysis frequently needs to handle the cases and separately (e.g., see the proof of Proposition 1).
- The preceding separation into two cases also introduces further difficulties. For example, our definition of in (30) is modified to contain different constants for the cases and , which is not the case in (Lemma 2 [14]). Accordingly, the quantities in (27) and in (28) appear in our proof but not in [14].
- To further reduce the constant term from to (see Theorem 1), we also need to use other mathematical tricks to sharpen certain inequalities, such as (A83).
References
- Lattimore, T.; Szepesvári, C. Bandit Algorithms; Cambridge University Press: Cambridge, UK, to appear.
- Villar, S.S.; Bowden, J.; Wason, J. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Stat. Sci. 2015, 30, 199–215. [Google Scholar] [CrossRef] [PubMed]
- Li, L.; Chu, W.; Langford, J.; Schapire, R.E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010. [Google Scholar]
- Awerbuch, B.; Kleinberg, R.D. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the Symposium of Theory of Computing (STOC04), Chicago, IL, USA, 5–8 June 2004. [Google Scholar]
- Shen, W.; Wang, J.; Jiang, Y.G.; Zha, H. Portfolio Choices with Orthogonal Bandit Learning. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI-15), Bengaluru, India, 25–31 July 2015. [Google Scholar]
- Bechhofer, R.E. A sequential multiple-decision procedure for selecting the best one of several normal populations with a common unknown variance, and its use with various experimental designs. Biometrics 1958, 14, 408–429. [Google Scholar] [CrossRef]
- Paulson, E. A sequential procedure for selecting the population with the largest mean from k normal populations. Ann. Math. Stat. 1964, 35, 174–180. [Google Scholar] [CrossRef]
- Even-Dar, E.; Mannor, S.; Mansour, Y. PAC bounds for multi-armed bandit and Markov decision processes. In Proceedings of the Fifteenth Annual Conference on Computational Learning Theory, Sydney, Australia, 8–10 July 2002. [Google Scholar]
- Kalyanakrishnan, S.; Tewari, A.; Auer, P.; Stone, P. PAC subset selection in stochastic multi-armed bandits. In Proceedings of the International Conference on Machine Learning, Edinburgh, UK, 26 June–1 July 2012. [Google Scholar]
- Gabillon, V.; Ghavamzadeh, M.; Lazaric, A. Best arm identification: A unified approach to fixed budget and fixed confidence. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
- Jamieson, K.; Malloy, M.; Nowak, R.; Bubeck, S. On finding the largest mean among many. arXiv 2013, arXiv:1306.3917. [Google Scholar]
- Karnin, Z.; Koren, T.; Somekh, O. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
- Jamieson, K.; Malloy, M.; Nowak, R.; Bubeck, S. lil’UCB: An Optimal Exploration Algorithm for Multi-Armed Bandits. arXiv 2013, arXiv:1312.7308. [Google Scholar]
- Mannor, S.; Tsitsiklis, J.N. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem. J. Mach. Learn. Res. 2004, 5, 623–648. [Google Scholar]
- Kaufmann, E.; Cappé, O.; Garivier, A. On the Complexity of Best-arm Identification in Multi-armed Bandit Models. J. Mach. Learn. Res. 2016, 17, 1–42. [Google Scholar]
- Carpentier, A.; Locatelli, A. Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem. In Proceedings of the Conference On Learning Theory, New York, NY, USA, 23–26 June 2016. [Google Scholar]
- Chen, L.; Li, J.; Qiao, M. Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017), Fort Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
- Simchowitz, M.; Jamieson, K.G.; Recht, B. The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime. arXiv 2013, arXiv:abs/1702.05186. [Google Scholar]
- Bubeck, S.; Bianchi, N.C. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. In Foundations and Trends in Machine Learning; Now Publishers Inc.: Hanover, MA, USA, 2012; Volume 5. [Google Scholar]
- Royden, H.; Fitzpatrick, P. Real Analysis, 4th ed.; Pearson: New York, NY, USA, 2010. [Google Scholar]
- Katariya, S.; Jain, L.; Sengupta, N.; Evans, J.; Nowak, R. Adaptive Sampling for Coarse Ranking. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS 2018), Lanzarote, Spain, 9–11 April 2018. [Google Scholar]
- Billingsley, P. Probability and Measure, 3rd ed.; Wiley-Interscience: Hoboken, NJ, USA, 1995. [Google Scholar]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Truong, L.V.; Scarlett, J. On Gap-Based Lower Bounding Techniques for Best-Arm Identification. Entropy 2020, 22, 788. https://doi.org/10.3390/e22070788
Truong LV, Scarlett J. On Gap-Based Lower Bounding Techniques for Best-Arm Identification. Entropy. 2020; 22(7):788. https://doi.org/10.3390/e22070788
Chicago/Turabian StyleTruong, Lan V., and Jonathan Scarlett. 2020. "On Gap-Based Lower Bounding Techniques for Best-Arm Identification" Entropy 22, no. 7: 788. https://doi.org/10.3390/e22070788
APA StyleTruong, L. V., & Scarlett, J. (2020). On Gap-Based Lower Bounding Techniques for Best-Arm Identification. Entropy, 22(7), 788. https://doi.org/10.3390/e22070788
