Abstract
In this paper, we study an independent Bernoulli two-armed bandit with unknown parameters and , where and have a pair of priori distributions such that and is an arbitrary positive measure on . Berry proposed the conjecture that, given a pair of priori distributions of parameters and , the arm with R is the current optimal choice if and the expectation of is not less than that of . We give an easily verifiable equivalent form of Berry’s conjecture and use it to prove that Berry’s conjecture holds when R and L are two-point distributions as well as when R and L are beta distributions and the number of trials .
Keywords:
Bernoulli two-armed bandit; stochastically maximizing; prior distributions; Bayesian decision theory MSC:
62C10; 62L10
1. Introduction
The bandit problem is a well-known problem in sequential control under conditions of incomplete information. It involves sequential selections from several options referred to as the arms of the bandit. The payoffs of these arms are characterized by parameters which are typically unknown. Gamblers should learn from the past information when deciding which arm to select next with an aim to maximize the total payoffs.
This problem was first raised by Thompson in the study of medical trials [1] and has been applied to market pricing (see [2]), digital marketing (see [3]), search problems (see [4]) and many other sequential statistical decision problems which are characterized by the trade-off between exploration and exploitation (i.e., between long-term benefits and short-term gains). For example, gamblers may choose to make enough observations of each arm in the early stages to estimate the gain for each arm and then select the arm with the largest estimated gain in the later stages. Observations of bad arms in the early stages can reduce short-term gains, but the information they bring can enhance long-term gains. The trade-off between short-term and long-term gains to maximize total payoffs is the key to the bandit problem.
There are three main schools of early research on the bandit problem: Berry’s school, which focuses on the finite-horizon setting, Gittins’s school, which studies an infinite horizon with discounting, and Robbins’s school, which focuses on the time-averaged infinite-horizon setting.
Here, we focus on the two-armed bandit problem proposed by Berry [5]. It is an important foundational model, and there are many variants based on it, such as the models in [6,7]. It can also be used directly in practice, such as in the study of human selection behavior [8].
Let and denote the independent Bernoulli processes with parameters and , respectively. We call the right arm and the left arm. An observation on either arm is called a pull. A right pull or a left pull is made at each of N stages, and the result of the pull at each stage is known before a right or left pull is made in the next stage. In Berry’s setting, the parameters and associated with and , respectively, are not known but are random variables. The sequences of successes and failures associated with the right and left arms are therefore not sequences of independent Bernoulli trials but are conditionally independent from the unknown parameters and . The goal of this problem is to find a strategy to maximize the expected number of wins after N pulls.
Berry used Bayesian theory to investigate this problem and assumed that the prior distributions of and had the following unique form:
where and are arbitrary positive measures on and and are normalizing constants.
Although it seems simple, this model is not completely solved. Its optimal strategy has an expression only in a few cases, and in most cases, the optimal strategy can only be calculated by a recursive formula (Equation (12) in this paper), which is difficult to compute when N is large, even using a computer. Berry proposed a conjecture (Conjecture B in [5]) that arm is the current optimal choice if , and . Here, and are the expectations of and with respect to the distributions R and L, respectively. As mentioned in [9], no significant progress has been made in the computation of optimal strategies for over 40 years. The confirmation of Berry’s conjecture can avoid the use of recursive formulas in many cases and greatly improve the speed of optimal strategy computation for Berry’s bandit models.
The study of Berry’s conjecture is an important step to improve the theory of bandit models. Intuitively, if an arm is less observed, then choosing it will bring more long-term benefits since the information it brings will help us in our later choices. Berry’s conjecture tells us that if an arm has both higher short-term gains and higher long-term gains, then it must be optimal. This is consistent with our intuition. Although Berry’s conjecture is of great importance, it is difficult to prove. There are few relevant references. Joshi [10] published a paper in The Annals of Statistics announcing the proof of Berry’s conjecture. Unfortunately, Joshi later announced that this proof was wrong [10]. Yue [11] studied a problem similar to the set-up in our Theorem 6, but Yue studied a two-stage bandit model, which differs significantly from the model studied in this paper.
After years of research, more and more new models and strategies have arisen. Many have turned to asymptotically optimal and suboptimal strategies. Here are a few examples. The famous Gittins index strategy introduced by Gittins and Jones [12] assigns each arm an index as a function of its current state and then activates the arm with the largest index value. This policy optimizes the infinite-horizon expected discounted costs and infinite-horizon long-run average cost. If more than one arm can change its state in every period, then the problem becomes a so-called restless problem. Whittle [13] proposed an index rule to solve the restless problem. This index is not necessarily optimal, but Weber and Weiss [14] proved that it would admit a form of asymptotic optimality as both the number of arms and the number of allocated arms in each period grow to infinity at a fixed proportion. The restless multi-armed bandit model can be used in many applications, such as clinical trials, sensor management and capacity management in healthcare (see [15,16,17,18,19,20]).
A major drawback of the Gittins index and Whittle index is that they are both difficult to calculate. The current fastest algorithm can only solve the index in cubic time [21,22]. A second drawback of the Gittins index is that the arms must have independent parameters, and the discounting scheme must be geometric. If these conditions are not met, then the Gittins index strategy is only suboptimal [23,24].
Another important strategy is the upper confidence bound (UCB)strategy. Lai and Robbins [25] laid out the theory of asymptotically optimal allocation and were the first to actually use the term upper confidence bound. Each arm is assigned a UCB for its mean reward, and the arm with the largest bound is to be played. The bound is not the conventional upper limit for a confidence interval and is not easy to compute. The design of the confidence bound has been successively improved in [26,27,28,29,30,31,32,33]. Among them, the kl-UCB strategy [30] and Bayes-UCB strategy [33] are asymptotically optimal for exponential family bandit models.
The Gittins index, UCB method and other strategies such as Thompson sampling and -greedy are all suboptimal when applied to Berry’s model. When the number of pulls N is not very large, there is a significant difference from the optimal strategy. Therefore, it is still necessary to prove Berry’s conjecture and accelerate the computation of the optimal strategy.
In this paper, we prove that Berry’s conjecture is equivalent to the following statement:
Statement.
If , and , then arm is the optimal choice.
This result reveals that Berry’s conjecture is essentially a quantitative study of the relationship between exploitation and exploration. This shows that when the prior distributions of the two arms are equal, the one with fewer observations is more worthy of selection. Using this result, we studied two specific models.
The first special case is where and is a Bernoulli distribution, with a concentrating probability of on each of and . In this case, we prove that our Statement holds and obtain a more complete conclusion than Berry’s conjecture. For any real numbers , , and ,the right arm is currently optimal if and only if . This is consistent with the conclusion that Berry obtained in a different way in [5].
The second special case is where the initial distributions R and L are both beta distributions. A partial result is obtained in this case. Let and be positive real numbers and satisfy and . If the number of remaining pulls , then the current optimal choice is the right arm. Here, denotes the largest integer less than or equal to x.
This paper is organized as follows. In Section 2, the concepts and results used in this paper are given. In Section 3, the main result is obtained, which proves the equivalence between Berry’s conjecture and our Statement. In Section 4 and Section 5, we discuss two specific cases, where the initial distributions R and L are both two-point distributions or both beta distributions, respectively. Finally, Section 6 gives the conclusions and future research directions.
2. Preliminarys
A brief introduction of the notation and the structure of the problem is given below. See [5] for details.
As we mentioned, gamblers need to choose from two arms, namely the right arm and the left arm . Arms and are independent Bernoulli processes with unknown parameters and , respectively. Berry used Bayesian theory to investigate this problem. Let denote the pattern of information known about and at stage , which is regarded as a pair of probability distributions of the unknown parameters and . denotes the initial information, consisting of the distribution of and . Specifically, denotes the distribution of , denotes the distribution of and the initial information . If the right arm is pulled, then we update the distribution R of the right arm according to its result using the Bayesian theory. Similarly, a pull on the left arm also updates L to a posterior distribution.
A common goal of gamblers is to maximize their payoffs. Assuming that the utility function of their payoffs is linear, the goal of this problem is to find a strategy to maximize the expected number of wins.
2.1. The Initial Distributions
In this model, Berry considered a special form of initial distribution. We use the initial probability distributions R and L of the Bernoulli parameters and for the right arm and the left arm , respectively, as follows:
where and are arbitrary positive measures on and and are defined by
Note that and here are not necessarily integers but any real numbers that can make and converge. The set of or making or is called the possibility region of or . It is easy to verify that when and are nonnegative real numbers, if . Therefore, for any measure which assigns a positive measure to the interior of the unit interval, the possibility region for is a quadrant of the plane. If and are integers, then using Bayes’ theorem, we can consider that the distribution R is derived from the measure and a number of pulls on the right arm, with successes and failures. Similarly, if and are integers, L is derived from the measure and a number of pulls on the left arm, with successes and failures.
With these notations, the initial distribution can be written as
The posterior distribution if the right arm is pulled and wins, while if fails. Similarly, the posterior distribution if the left arm is pulled and wins, while if fails.
Sometimes, we only need to consider the distribution R of the right arm, and thus we can write and as
Let and denote the expectations of and for the distributions R and L, respectively.
An important case is . In this case, the difference between the distributions R and L is entirely determined by and , respectively. Therefore, without causing confusion, the above notation will be shortened to , and .
In this paper, we focus on two special cases. The first case is when is a two-point distribution , with a concentrating probability of on each of and and without both and . The distributions R and L are also two-point distributions and
The possibility region of is dependent on and . If , then the possibility region is the whole plane. If and , then the possibility region is and . If and , then the possibility region is and . The corresponding expectations are
In the second case, let , where
And the number of successes and failures satisfy
which are equivalent to and being positive numbers. Then, the distributions R and L are both beta distributions and
The corresponding expectations are
Using the Bayesian formula, we know that all posterior distributions are also beta distributions.
2.2. The Function
This problem has a dynamic programming property. In each selection, the gambler always needs to choose the arm that will lead to the greatest subsequent gain based on the current information.
Let () denote the worth of the pattern with pulls remaining when the right (left) arm is pulled at stage and an optimal procedure follows thereafter. Let be the worth of when an optimal procedure is followed. Then, there is
Using this dynamic programming property, for all and , there is
Note that for any . Then, for any and , we can define an important function:
The function represents the advantage of choosing over in the first stage. means that is optimal, and means that is better than .
By simple calculation, we obtain that the function can be defined recursively. Let and . Then, for any and , we have
In addition, for , there is
Proposition 1
(Theorem 3.1 in [5]). For any and , there is
2.3. Berry’s Conjecture and Related Results
Obviously, when there are N pulls left and the current known information is , we can use the sign of to determine which arm is optimal at this stage. Therefore, how to identify the sign of is the key to finding the optimal strategy. Unfortunately, Berry did not completely solve this problem and instead gave the following Theorem 1.
Theorem 1
(Theorem 5.1 in [5]). The following statements are true for , and any , :
Remark 1.
Theorem 5.2 in [5] states that a strict increase in or a strict decrease in guarantees a strict increase in for all L and N.
Remark 2.
When considering the left arm, a conclusion similar to the above theorem can be obtained by using the fact that . For , and any , , we have
Theorem 1 cannot be used in many cases due to the harsh conditions. However, it can still reveal some properties of the function , such as
and
When R and L are conjugate with each other (i.e., ), there are several more refined results:
Theorem 2
(Theorem 6.4 in [5]). Provided , if and , then for all .
Theorem 2 is intuitive. When the right arm wins more often than the left arm and loses less than the left arm, we believe that the right arm is better. We can immediately get the following Corollary 1 and Corollary 2 from Theorem 2.
Corollary 1
(Corollary 1 in [5]). Provided , and if and , then for all .
Corollary 2
(Corollary 2 in [5]). Provided , and if and , then for all .
Intuitively, the conclusion of Corollary 2 can still be strengthened. When , the right arm has less known information than the left arm, so choosing the right arm can bring additional information. Additionally, if at the same time the right arm offers a greater expected immediate payoff, then the optimal choice for the next pull should be the right arm.
With this idea in mind, Berry proposed the following conjecture:
Berry’s Conjecture (Conjecture B in [5]). Let and . If , and , then for all .
3. Main Result
In this section, we prove that Berry’s conjecture is equivalent to the following statement:
Statement.
Let and . If and , then for all integers .
It seems that Berry’s conjecture is a stronger result while the Statement is a direct corollary of it. However, we will show below that the Statement is actually equivalent to Berry’s conjecture.
Here, we quote two results obtained by Berry regarding the partial derivatives of and . See the discussion before Equation (4.8) in [5]:
and
Using these results, we can derive the following Lemma 1.
Lemma 1.
Let and be interior points of the possibility region of μ, and θ be real numbers such that and are both in the possibility region of μ. Then, for any positive integer N, we have
Proof.
Let us use mathematical induction. When , we have
Notice that
Therefore, we have
Now, let us assume that Equation (20) holds for N, and consider the case. Through Equation (12), we have
By taking the derivative, we obtain the following equality:
By combining Equations (21), (22), (24) and (25), we obtain that the last two summands on the right side of Equation (23) are both negative.
Then, by applying the assumption for N on , , and , we know the first four summands on the right side of Equation (23) are all negative. Hence, we now have
Thus, Equation (20) holds for any positive integer N. □
Now, we can prove the equivalence of the Statement and Berry’s conjecture:
Theorem 3.
The Statement holds if and only if Berry’s conjecture holds.
Proof.
Assume that the Statement holds. When , we have . If , then the conclusion of Berry’s conjecture must hold by applying the Statement. For any and , there must be by using Equations (16) and (17). If , then by applying Corollary 2. Thus, we only need to prove the case where and .
Therefore, the possibility region of contains at least all pairs of that satisfy and (see Section 2.1). Let . With the equalities in Equations (18) and (19), we have . If , then , and . Due to and , we have
Then, there exists such that . Since is an interior point of the possibility region of , we can consider . With Lemma 1, we obtain
Note that when and , we can use the Statement to obtain for any . Therefore, the desired result holds. □
Theorem 3 simplifies the condition in Berry’s conjecture to . Unfortunately, the Statement is still not easy to prove. In the following, we will continue this discussion under two important special cases, where R and L are two-point distributions or beta distributions.
4. Two-Point Distribution Case
In this section, we consider the situation where and the distribution is a two-point distribution, with a concentrating probability of for both and . Without causing confusion, we will omit the in the notation (e.g., will be written as ). In this case, we prove that the Statement holds and obtain a more complete conclusion than Berry’s conjecture due to the good properties of . This is consistent with the conclusion that Berry obtained by discussing the contours of in the plane.
In the following discussion, let , then all of the pairs in the plane are in the possibility region.
To prove the Statement, we should first find the points in the possibility region where the expected values of and are equal such that
Since , Equation (28) is equivalent to
With the logarithm, we obtain
Recall that for and , there is
Using Equations (29) and (30), we obtain that the relationship between and to make Equation (26) hold is
where .
Next, we will show that has a strong symmetry property when . Note that since the possibility region of is the whole plane, the m and n in Theorem 4 can be any integers:
Theorem 4.
If , then for any positive integer N and any numbers , m and n, there is
Proof.
We will use mathematical induction to prove this theorem.
When , for any integers m and n, we have
By letting , and , and apply Equation (34) to , and , we obtain
Then, Equation (32) has been proven for the case where .
Now, assume that Equation (32) holds for N. For any and any numbers , m and n, we need to prove Equation (32) also holds for .
Consider the case. Using the recursive Equation (12), we have
Thus, if , then there must be . Hence, we obtain
If , then , and thus Equation (37) also holds.
With Equations (35) and (37), we obtain that the sum of the first and seventh summands on the right side of Equation (36) is zero.
Together with Equation (34), we obtain that the sum of the third and fifth summands on the right side of Equation (36) is also zero.
Using similar techniques, we can find that the sum of the second and eighth summands on the right side of Equations (36) and the sum of the fourth and sixth summands on the right side of Equation (36) are both zero. Hence, we have proven that
In other words, Equation (32) holds for . The theorem is proven by induction. □
Based on Theorem 4, we can conclude that the Statement holds for . The following corollary shows that when , which is a stronger conclusion than our Statement:
Corollary 3.
Provided , for any positive integer N and any real numbers , and , there is
Proof.
The corollary can be deduced from Theorem 4 for . □
Note that is not needed in Theorem 4 and Corollary 3, and we can obtain the following result, which is stronger than Berry’s conjecture and consistent with Theorem 8.3 in [5].
Theorem 5.
Provided , for any positive integer N and any real numbers , , and , we have
Proof.
If , by the equivalence between Equations (26) and (31), there is and , where and . Therefore, with Corollary 3, we have
Consider the case . If , then with Theorem 3, we have . If , by the equivalence between Equations (26) and (31), there is
Notice that
and with Equations (18) and (19), we obtain and . Let . Then, we have and Equation (38) becomes
By applying Lemma 1 with and Corollary 3, there is
Therefore, we have if . The case can be proven by a similar method. □
5. Beta Distribution Case
In this section, we consider the case where the prior distribution has a special structure , where is defined by Equation (9). Then, R and L are both beta distributions.
Although we believe that the Statement holds in this case, only a partial result is obtained. We know that in this case, the expectation of the parameters and are and , which are increasing functions of and and decreasing functions of and respectively. Therefore, if and only if .
The following theorem shows that holds for when :
Theorem 6.
If , and , and if for a fixed positive integer we have , then for any integer , there is .
Proof.
Using mathematical induction again, we prove that if and , then the following two inequalities hold:
and for any integer ,
Through the exact version of Equation (13) for , it is easy to verify the above inequalities:
For any , we have
Since , we have
Now, we assume that Equations (42) and (43) hold for , and we will show that Equations (42) and (43) also hold for . Note that when we consider pulls, the condition for m will become .
First, it can be deduced from Equation (12) that
Using the assumption for N, we know that for , we have
Since , by using Remark 2, we can obtain
Hence, , and consequently, we find that Equation (42) holds for ; in other words, we have
Next, we will prove that Equation (43) also holds for . For any , because of , by using Theorem 1, we obtain
Then, there is .
Since , the sum of the first and fifth summands on the right side of Equation (44) is
Since , and , we can obtain by Theorem 1 that
Due to the fact that , we can apply the assumption for N pulls and obtain . Consequently, we have
Therefore, the third summand on right side of Equation (44) is nonnegative.
By using Theorem 1 twice, we obtain
Hence, , and the fourth summand on right side of Equation (44) is nonnegative.
Now, we only need to consider . As we know, . It follows from Remark 2 that
Let , and . Then, and still hold. Hence, the assumption for N can be applied, and we obtain
Then, we obtain . Thus far, all summands on the right side of the equality in Equation (44) are positive. Then, we find that Equation (43) holds for ; that is, for any , we have
Then, the theorem is proven by mathematical induction. □
The Statement is partially proven in Theorem 6, and hence we can deduce a partial result about Berry’s conjecture.
Consider , where and are positive real numbers and satisfy and . Using the properties of the beta distribution, there is
For and such that and , applying Lemma 1, there is
In order to use Theorem 6, we require
which is equivalent to . To make as large as possible, we choose and . Now let
We obtain the following result by using Theorem 6:
Theorem 7.
Let , and be positive real numbers and satisfy and , . Then, for , there is
Proof.
Let and . If , then by using Corollary 2.
If , then we know that , and using the fact that , we obtain from Theorem 6 that for ,
It can be deduced from Lemma 1 that . Therefore, the proof is completed. □
Obviously , and thus Berry’s conjecture always holds for . However, when we want to use Theorem 7 for large values of N, we need to be large enough, which means that is very large. This is consistent with the intuitive impression that the greater the expectation of the arm, the greater the advantage in the choice.
6. Conclusions
As we mentioned in the introduction, the bandit model is used to solve the problem of the trade-off between exploration and exploitation. Berry’s conjecture is an exactly intuitive conjecture about exploration and exploitation. In this paper, we show the essence of Berry’s conjecture by proving the equivalence of the conjecture and our Statement. The Statement is easier to verify than Berry’s conjecture and thus provides a new idea for proving Berry’s conjecture. We also proved that Berry’s conjecture holds in two specific models, which can speed up the computation of optimal strategies. We believe that for most positive measures , Berry’s conjecture is correct. In the future, we will apply our Statement to prove Berry’s conjecture to other specific models.
Author Contributions
Conceptualization, P.W. and J.Z.; methodology, P.W. and J.Z.; validation, J.Z.; formal analysis, P.W. and J.Z.; writing—original draft preparation, J.Z.; writing—review and editing, P.W. and J.Z.; supervision, P.W.; funding acquisition, P.W. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Key Research and Development Program of China (No. 2018YFA0703900) and the Natural Science Foundation of Shandong Province (Nos. ZR2021MA098 and ZR2019ZD41).
Data Availability Statement
Data sharing not applicable to this article as no datasets were generated or analysed during the current (theoretical) study.
Conflicts of Interest
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Thompson, W.R. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika 1933, 25, 285. [Google Scholar] [CrossRef]
- Rothschild, M. A two-armed bandit theory of market pricing. J. Econ. Theory 1974, 9, 185–202. [Google Scholar] [CrossRef]
- Liberali, G.B.; Hauser, J.R.; Urban, G.L. Morphing Theory and Applications. In International Series in Operations Research & Management Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 531–562. [Google Scholar] [CrossRef]
- Aggarwal, C.C. Recommender Systems; Springer International Publishing: Berlin/Heidelberg, Germany, 2016. [Google Scholar] [CrossRef]
- Berry, D.A. A Bernoulli Two-armed Bandit. Ann. Math. Stat. 1972, 43, 871–897. [Google Scholar] [CrossRef]
- Berry, D.A.; Chen, R.W.; Zame, A.; Heath, D.C.; Shepp, L.A. Bandit problems with infinitely many arms. Ann. Stat. 1997, 25, 2103–2116. [Google Scholar] [CrossRef]
- Lin, C.T.; Shiau, C.J. Some Optimal Strategies for Bandit Problems with Beta Prior Distributions. Ann. Inst. Stat. Math. 2000, 52, 397–405. [Google Scholar] [CrossRef]
- Steyvers, M.; Lee, M.D.; Wagenmakers, E.J. A Bayesian analysis of human decision-making on bandit problems. J. Math. Psychol. 2009, 53, 168–179. [Google Scholar] [CrossRef]
- Jacko, P. The Finite-Horizon Two-Armed Bandit Problem with Binary Responses: A Multidisciplinary Survey of the History, State of the Art, and Myths; Working Paper 3; Lancaster University Management School: Lancaster, UK, 2019. [Google Scholar]
- Joshi, V.M. A Conjecture of Berry Regarding A Bernoulli Two-Armed Bandit. Ann. Stat. 1975, 3, 189–202, Correction in Ann. Stat. 1985, 13, 1249. [Google Scholar] [CrossRef]
- Yue, J.C. Generalized two-stage bandit problem. Commun. Stat.-Theory Methods 1999, 28, 2261–2276. [Google Scholar] [CrossRef]
- Gittins, J.; Jones, D. A Dynamic Allocation Index for the Sequential Design of Experiments. In Progress in Statistics; Gani, J., Ed.; North-Holland: Amsterdam, The Netherlands, 1974; pp. 241–266. [Google Scholar]
- Whittle, P. Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 1988, 25, 287–298. [Google Scholar] [CrossRef]
- Weber, R.R.; Weiss, G. On an index policy for restless bandits. J. Appl. Probab. 1990, 27, 637–648. [Google Scholar] [CrossRef]
- Ahmad, S.H.A.; Liu, M.; Javidi, T.; Zhao, Q.; Krishnamachari, B. Optimality of Myopic Sensing in Multichannel Opportunistic Access. IEEE Trans. Inf. Theory 2009, 55, 4040–4050. [Google Scholar] [CrossRef]
- Deo, S.; Iravani, S.; Jiang, T.; Smilowitz, K.; Samuelson, S. Improving Health Outcomes Through Better Capacity Allocation in a Community-Based Chronic Care Model. Oper. Res. 2013, 61, 1277–1294. [Google Scholar] [CrossRef]
- Gittins, J.; Glazebrook, K.; Weber, R. Multi-Armed Bandit Allocation Indices; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2011. [Google Scholar] [CrossRef]
- Lee, E.; Lavieri, M.S.; Volk, M. Optimal Screening for Hepatocellular Carcinoma: A Restless Bandit Model. Manuf. Serv. Oper. Manag. 2019, 21, 198–212. [Google Scholar] [CrossRef]
- Mahajan, A.; Teneketzis, D. Multi-Armed Bandit Problems. In Foundations and Applications of Sensor Management; Springer US: New York, NY, USA, 2008; pp. 121–151. [Google Scholar] [CrossRef]
- Washburn, R.B. Application of Multi-Armed Bandits to Sensor Management. In Foundations and Applications of Sensor Management; Springer US: New York, NY, USA, 2008; pp. 153–175. [Google Scholar] [CrossRef]
- Gast, N.; Gaujal, B.; Khun, K. Computing Whittle (and Gittins) Index in Subcubic Time. arXiv 2022, arXiv:2203.05207. [Google Scholar]
- Niño-Mora, J. A (2/3)n3 Fast-Pivoting Algorithm for the Gittins Index and Optimal Stopping of a Markov Chain. INFORMS J. Comput. 2007, 19, 596–606. [Google Scholar] [CrossRef]
- Berry, D.A.; Fristedt, B. Bandit Problems: Sequential Allocation of Experiments; Springer: Dordrecht, The Netherlands, 1985. [Google Scholar] [CrossRef]
- Gittins, J.; Wang, Y.G. The Learning Component of Dynamic Allocation Indices. Ann. Stat. 1992, 20, 1625–1636. [Google Scholar] [CrossRef]
- Lai, T.; Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 1985, 6, 4–22. [Google Scholar] [CrossRef]
- Agrawal, R. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 1995, 27, 1054–1078. [Google Scholar] [CrossRef]
- Audibert, J.Y.; Bubeck, S. Regret Bounds and Minimax Policies under Partial Monitoring. J. Mach. Learn. Res. 2010, 11, 2785–2836. [Google Scholar]
- Audibert, J.Y.; Munos, R.; Szepesvári, C. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theor. Comput. Sci. 2009, 410, 1876–1902. [Google Scholar] [CrossRef]
- Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
- Cappé, O.; Garivier, A.; Maillard, O.A.; Munos, R.; Stoltz, G. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Ann. Stat. 2013, 41, 1516–1541. [Google Scholar] [CrossRef]
- Honda, J.; Takemura, A. An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. In Proceedings of the COLT 2010, Haifa, Israel, 27–29 June 2010; pp. 67–79. [Google Scholar]
- Lai, T.L. Adaptive Treatment Allocation and the Multi-Armed Bandit Problem. Ann. Stat. 1987, 15, 1091–1114. [Google Scholar] [CrossRef]
- Kaufmann, E. On Bayesian index policies for sequential resource allocation. Ann. Stat. 2018, 46, 842–865. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).