Next Article in Journal
Optimal Model Averaging for Semiparametric Partially Linear Models with Censored Data
Next Article in Special Issue
Exponential Inequality of Marked Point Processes
Previous Article in Journal
Biomedical Interaction Prediction with Adaptive Line Graph Contrastive Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Conjecture of Berry Regarding a Bernoulli Two-Armed Bandit

1
School of Mathematics, Shandong University, Jinan 250100, China
2
Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan 250100, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(3), 733; https://doi.org/10.3390/math11030733
Submission received: 25 December 2022 / Revised: 21 January 2023 / Accepted: 23 January 2023 / Published: 1 February 2023
(This article belongs to the Special Issue Statistical Methods in Mathematical Finance and Economics)

Abstract

:
In this paper, we study an independent Bernoulli two-armed bandit with unknown parameters ρ and λ , where ρ and λ have a pair of priori distributions such that d R ( ρ ) = C R ρ r 0 ( 1 ρ ) r 0 d μ ( ρ ) , d L ( λ ) = C L λ l 0 ( 1 λ ) l 0 d μ ( λ ) and μ is an arbitrary positive measure on [ 0 , 1 ] . Berry proposed the conjecture that, given a pair of priori distributions ( R , L ) of parameters ρ and λ , the arm with R is the current optimal choice if r 0 + r 0 < l 0 + l 0 and the expectation of ρ is not less than that of λ . We give an easily verifiable equivalent form of Berry’s conjecture and use it to prove that Berry’s conjecture holds when R and L are two-point distributions as well as when R and L are beta distributions and the number of trials N r 0 r 0 + 1 .

1. Introduction

The bandit problem is a well-known problem in sequential control under conditions of incomplete information. It involves sequential selections from several options referred to as the arms of the bandit. The payoffs of these arms are characterized by parameters which are typically unknown. Gamblers should learn from the past information when deciding which arm to select next with an aim to maximize the total payoffs.
This problem was first raised by Thompson in the study of medical trials [1] and has been applied to market pricing (see [2]), digital marketing (see [3]), search problems (see [4]) and many other sequential statistical decision problems which are characterized by the trade-off between exploration and exploitation (i.e., between long-term benefits and short-term gains). For example, gamblers may choose to make enough observations of each arm in the early stages to estimate the gain for each arm and then select the arm with the largest estimated gain in the later stages. Observations of bad arms in the early stages can reduce short-term gains, but the information they bring can enhance long-term gains. The trade-off between short-term and long-term gains to maximize total payoffs is the key to the bandit problem.
There are three main schools of early research on the bandit problem: Berry’s school, which focuses on the finite-horizon setting, Gittins’s school, which studies an infinite horizon with discounting, and Robbins’s school, which focuses on the time-averaged infinite-horizon setting.
Here, we focus on the two-armed bandit problem proposed by Berry [5]. It is an important foundational model, and there are many variants based on it, such as the models in [6,7]. It can also be used directly in practice, such as in the study of human selection behavior [8].
Let R and L denote the independent Bernoulli processes with parameters ρ and λ , respectively. We call R the right arm and L the left arm. An observation on either arm is called a pull. A right pull or a left pull is made at each of N stages, and the result of the pull at each stage is known before a right or left pull is made in the next stage. In Berry’s setting, the parameters ρ and λ associated with R and L , respectively, are not known but are random variables. The sequences of successes and failures associated with the right and left arms are therefore not sequences of independent Bernoulli trials but are conditionally independent from the unknown parameters ρ and λ . The goal of this problem is to find a strategy to maximize the expected number of wins after N pulls.
Berry used Bayesian theory to investigate this problem and assumed that the prior distributions of ρ and λ had the following unique form:
d R ( ρ ) = C R ρ r 0 ( 1 ρ ) r 0 d μ R ( ρ ) ,
d L ( λ ) = C L λ l 0 ( 1 λ ) l 0 d μ L ( λ ) ,
where μ R and μ L are arbitrary positive measures on [ 0 , 1 ] and C R and C L are normalizing constants.
Although it seems simple, this model is not completely solved. Its optimal strategy has an expression only in a few cases, and in most cases, the optimal strategy can only be calculated by a recursive formula (Equation (12) in this paper), which is difficult to compute when N is large, even using a computer. Berry proposed a conjecture (Conjecture B in [5]) that arm R is the current optimal choice if μ R = μ L , r 0 + r 0 < l 0 + l 0 and E ( ρ | r 0 , r 0 ) E ( λ | l 0 , l 0 ) . Here, E ( ρ | r 0 , r 0 ) and E ( λ | l 0 , l 0 ) are the expectations of ρ and λ with respect to the distributions R and L, respectively. As mentioned in [9], no significant progress has been made in the computation of optimal strategies for over 40 years. The confirmation of Berry’s conjecture can avoid the use of recursive formulas in many cases and greatly improve the speed of optimal strategy computation for Berry’s bandit models.
The study of Berry’s conjecture is an important step to improve the theory of bandit models. Intuitively, if an arm is less observed, then choosing it will bring more long-term benefits since the information it brings will help us in our later choices. Berry’s conjecture tells us that if an arm has both higher short-term gains and higher long-term gains, then it must be optimal. This is consistent with our intuition. Although Berry’s conjecture is of great importance, it is difficult to prove. There are few relevant references. Joshi [10] published a paper in The Annals of Statistics announcing the proof of Berry’s conjecture. Unfortunately, Joshi later announced that this proof was wrong [10]. Yue [11] studied a problem similar to the set-up in our Theorem 6, but Yue studied a two-stage bandit model, which differs significantly from the model studied in this paper.
After years of research, more and more new models and strategies have arisen. Many have turned to asymptotically optimal and suboptimal strategies. Here are a few examples. The famous Gittins index strategy introduced by Gittins and Jones [12] assigns each arm an index as a function of its current state and then activates the arm with the largest index value. This policy optimizes the infinite-horizon expected discounted costs and infinite-horizon long-run average cost. If more than one arm can change its state in every period, then the problem becomes a so-called restless problem. Whittle [13] proposed an index rule to solve the restless problem. This index is not necessarily optimal, but Weber and Weiss [14] proved that it would admit a form of asymptotic optimality as both the number of arms and the number of allocated arms in each period grow to infinity at a fixed proportion. The restless multi-armed bandit model can be used in many applications, such as clinical trials, sensor management and capacity management in healthcare (see [15,16,17,18,19,20]).
A major drawback of the Gittins index and Whittle index is that they are both difficult to calculate. The current fastest algorithm can only solve the index in cubic time [21,22]. A second drawback of the Gittins index is that the arms must have independent parameters, and the discounting scheme must be geometric. If these conditions are not met, then the Gittins index strategy is only suboptimal [23,24].
Another important strategy is the upper confidence bound (UCB)strategy. Lai and Robbins [25] laid out the theory of asymptotically optimal allocation and were the first to actually use the term upper confidence bound. Each arm is assigned a UCB for its mean reward, and the arm with the largest bound is to be played. The bound is not the conventional upper limit for a confidence interval and is not easy to compute. The design of the confidence bound has been successively improved in [26,27,28,29,30,31,32,33]. Among them, the kl-UCB strategy [30] and Bayes-UCB strategy [33] are asymptotically optimal for exponential family bandit models.
The Gittins index, UCB method and other strategies such as Thompson sampling and ϵ -greedy are all suboptimal when applied to Berry’s model. When the number of pulls N is not very large, there is a significant difference from the optimal strategy. Therefore, it is still necessary to prove Berry’s conjecture and accelerate the computation of the optimal strategy.
In this paper, we prove that Berry’s conjecture is equivalent to the following statement:
Statement. 
If μ R = μ L = μ , r 0 + r 0 l 0 + l 0 and E ( ρ | r 0 , r 0 ) = E ( λ | l 0 , l 0 ) , then arm R is the optimal choice.
This result reveals that Berry’s conjecture is essentially a quantitative study of the relationship between exploitation and exploration. This shows that when the prior distributions of the two arms are equal, the one with fewer observations is more worthy of selection. Using this result, we studied two specific models.
The first special case is where μ = τ and τ is a Bernoulli distribution, with a concentrating probability of 1 2 on each of τ 1 and τ 2 . In this case, we prove that our Statement holds and obtain a more complete conclusion than Berry’s conjecture. For any real numbers r 0 , r 0 , l 0 and l 0 ,the right arm is currently optimal if and only if E ( ρ | r 0 , r 0 ) E ( λ | l 0 , l 0 ) . This is consistent with the conclusion that Berry obtained in a different way in [5].
The second special case is where the initial distributions R and L are both beta distributions. A partial result is obtained in this case. Let r 0 , r 0 , l 0 and l 0 be positive real numbers and satisfy r 0 + r 0 l 0 + l 0 and r 0 r 0 l 0 l 0 . If the number of remaining pulls N r 0 r 0 + 1 , then the current optimal choice is the right arm. Here, x denotes the largest integer less than or equal to x.
This paper is organized as follows. In Section 2, the concepts and results used in this paper are given. In Section 3, the main result is obtained, which proves the equivalence between Berry’s conjecture and our Statement. In Section 4 and Section 5, we discuss two specific cases, where the initial distributions R and L are both two-point distributions or both beta distributions, respectively. Finally, Section 6 gives the conclusions and future research directions.

2. Preliminarys

A brief introduction of the notation and the structure of the problem is given below. See [5] for details.
As we mentioned, gamblers need to choose from two arms, namely the right arm R and the left arm L . Arms R and L are independent Bernoulli processes with unknown parameters ρ and λ , respectively. Berry used Bayesian theory to investigate this problem. Let I k denote the pattern of information known about R and L at stage k + 1 , which is regarded as a pair of probability distributions of the unknown parameters ρ and λ . I 0 denotes the initial information, consisting of the distribution of ρ and λ . Specifically, R = R ( ρ ) denotes the distribution of ρ , L = L ( λ ) denotes the distribution of λ and the initial information I 0 = ( R , L ) . If the right arm R is pulled, then we update the distribution R of the right arm according to its result using the Bayesian theory. Similarly, a pull on the left arm also updates L to a posterior distribution.
A common goal of gamblers is to maximize their payoffs. Assuming that the utility function of their payoffs is linear, the goal of this problem is to find a strategy to maximize the expected number of wins.

2.1. The Initial Distributions

In this model, Berry considered a special form of initial distribution. We use the initial probability distributions R and L of the Bernoulli parameters ρ and λ for the right arm R and the left arm L , respectively, as follows:
d R ( ρ ) = 1 v ( r 0 , r 0 ; μ R ) ρ r 0 ( 1 ρ ) r 0 d μ R ( ρ ) ,
d L ( λ ) = 1 v ( l 0 , l 0 ; μ L ) λ l 0 ( 1 λ ) l 0 d μ L ( λ ) ,
where μ R and μ L are arbitrary positive measures on [ 0 , 1 ] and v ( r 0 , r 0 ; μ R ) and v ( l 0 , l 0 ; μ L ) are defined by
v ( r 0 , r 0 ; μ R ) = 0 1 ρ r 0 ( 1 ρ ) r 0 d μ R ( ρ ) , v ( l 0 , l 0 ; μ L ) = 0 1 λ l 0 ( 1 λ ) l 0 d μ L ( λ ) .
Note that r 0 , r 0 and l 0 , l 0 here are not necessarily integers but any real numbers that can make v ( r 0 , r 0 ; μ R ) and v ( l 0 , l 0 ; μ L ) converge. The set of ( r 0 , r 0 ) or ( l 0 , l 0 ) making v ( r 0 , r 0 ; μ R ) < or v ( l 0 , l 0 ; μ L ) < is called the possibility region of μ R or μ L . It is easy to verify that when δ r and δ r are nonnegative real numbers, v ( r 0 + δ r , r 0 + δ r ; μ R ) < if v ( r 0 , r 0 ; μ R ) < . Therefore, for any measure μ R which assigns a positive measure to the interior of the unit interval, the possibility region for μ R is a quadrant of the ( r 0 , r 0 ) plane. If r 0 and r 0 are integers, then using Bayes’ theorem, we can consider that the distribution R is derived from the measure μ R and a number N R = r 0 + r 0 of pulls on the right arm, with r 0 successes and r 0 failures. Similarly, if l 0 and l 0 are integers, L is derived from the measure μ L and a number N L = l 0 + l 0 of pulls on the left arm, with l 0 successes and l 0 failures.
With these notations, the initial distribution I 0 = ( R , L ) can be written as
I 0 = ( r 0 , r 0 , μ R ; l 0 , l 0 , μ L ) .
The posterior distribution I 1 = ( r 0 + 1 , r 0 , μ R ; l 0 , l 0 , μ L ) if the right arm R is pulled and wins, while I 1 = ( r 0 , r 0 + 1 , μ R ; l 0 , l 0 , μ L ) if R fails. Similarly, the posterior distribution I 1 = ( r 0 , r 0 , μ R ; l 0 + 1 , l 0 , μ L ) if the left arm L is pulled and wins, while I 1 = ( r 0 , r 0 , μ R ; l 0 , l 0 + 1 , μ L ) if L fails.
Sometimes, we only need to consider the distribution R of the right arm, and thus we can write I 0 and I 1 as
I 0 = ( r 0 , r 0 , μ R ; L ) and I 1 = ( r 0 + 1 , r 0 , μ R ; L ) .
Let E ( ρ | r 0 , r 0 ; μ R ) and E ( λ | l 0 , l 0 ; μ L ) denote the expectations of ρ and λ for the distributions R and L, respectively.
An important case is μ R = μ L = μ . In this case, the difference between the distributions R and L is entirely determined by r 0 , r 0 and l 0 , l 0 , respectively. Therefore, without causing confusion, the above notation will be shortened to I 0 = ( r 0 , r 0 ; l 0 , l 0 ) , E ( ρ | r 0 , r 0 ) and E ( λ | l 0 , l 0 ) .
In this paper, we focus on two special μ cases. The first case is when μ is a two-point distribution τ , with a concentrating probability of 1 2 on each of τ 1 and τ 2 and τ 1 < τ 2 without both τ 1 = 0 and τ 2 = 1 . The distributions R and L are also two-point distributions and
R ( τ 1 ) = τ 1 r 0 ( 1 τ 1 ) r 0 τ 1 r 0 ( 1 τ 1 ) r 0 + τ 2 r 0 ( 1 τ 2 ) r 0 , R ( τ 2 ) = τ 2 r 0 ( 1 τ 2 ) r 0 τ 1 r 0 ( 1 τ 1 ) r 0 + τ 2 r 0 ( 1 τ 2 ) r 0 ,
L ( τ 1 ) = τ 1 l 0 ( 1 τ 1 ) l 0 τ 1 l 0 ( 1 τ 1 ) l 0 + τ 2 l 0 ( 1 τ 2 ) l 0 , L ( τ 2 ) = τ 2 l 0 ( 1 τ 2 ) l 0 τ 1 l 0 ( 1 τ 1 ) l 0 + τ 2 l 0 ( 1 τ 2 ) l 0 .
The possibility region of τ is dependent on τ 1 and τ 2 . If 0 < τ 1 < τ 2 < 1 , then the possibility region is the whole ( r 0 , r 0 ) plane. If τ 1 = 0 and τ 2 < 1 , then the possibility region is r 0 0 and r 0 R . If τ 1 > 0 and τ 2 = 1 , then the possibility region is r 0 R and r 0 0 . The corresponding expectations are
E ( ρ | r 0 , r 0 ; τ ) = τ 1 r 0 + 1 ( 1 τ 1 ) r 0 + τ 2 r 0 + 1 ( 1 τ 2 ) r 0 τ 1 r 0 ( 1 τ 1 ) r 0 + τ 2 r 0 ( 1 τ 2 ) r 0 ,
E ( λ | l 0 , l 0 ; τ ) = τ 1 l 0 + 1 ( 1 τ 1 ) l 0 + τ 2 l 0 + 1 ( 1 τ 2 ) l 0 τ 1 l 0 ( 1 τ 1 ) l 0 + τ 2 l 0 ( 1 τ 2 ) l 0 .
In the second case, let μ = β , where
β ( A ) = A x 1 ( 1 x ) 1 d x , for any A [ 0 , 1 ] .
And the number of successes and failures r 0 , r 0 , l 0 , l 0 satisfy
v ( r 0 , r 0 ; β ) = 0 1 ρ r 0 1 ( 1 ρ ) r 0 1 d ρ < , v ( l 0 , l 0 ; β ) = 0 1 λ l 0 1 ( 1 λ ) l 0 1 d λ < ,
which are equivalent to r 0 , r 0 , l 0 and l 0 being positive numbers. Then, the distributions R and L are both beta distributions and
d R ( ρ ) = 1 v ( r 0 , r 0 ; β ) ρ r 0 1 ( 1 ρ ) r 0 1 d ρ ,
d L ( λ ) = 1 v ( l 0 , l 0 ; β ) λ l 0 1 ( 1 λ ) l 0 1 d λ .
The corresponding expectations are
E ( ρ | r 0 , r 0 ; β ) = r 0 N R = r 0 r 0 + r 0 , E ( λ | l 0 , l 0 ; β ) = l 0 N L = l 0 l 0 + l 0 .
Using the Bayesian formula, we know that all posterior distributions are also beta distributions.

2.2. The Function Δ

This problem has a dynamic programming property. In each selection, the gambler always needs to choose the arm that will lead to the greatest subsequent gain based on the current information.
Let W N k R ( I k ) ( W N k L ( I k ) ) denote the worth of the pattern I k with N k pulls remaining when the right (left) arm is pulled at stage k + 1 and an optimal procedure follows thereafter. Let W N ( I 0 ) be the worth of I 0 when an optimal procedure is followed. Then, there is
W N ( I 0 ) = max W N R ( I 0 ) , W N L ( I 0 ) for all N 1 and I 0 .
Using this dynamic programming property, for all N 1 and I 0 = ( r 0 , r 0 , μ R ; l 0 , l 0 , μ L ) , there is
W N R ( I 0 ) = E ( ρ | r 0 , r 0 ; μ R ) + E ( ρ | r 0 , r 0 ; μ R ) W N 1 ( r 0 + 1 , r 0 , μ R ; l 0 , l 0 , μ L ) E ( ρ | r 0 , r 0 ; μ R ) + ( 1 E ( ρ | r 0 , r 0 ; μ R ) ) W N 1 ( r 0 , r 0 + 1 , μ R ; l 0 , l 0 , μ L ) , W N L ( I 0 ) = E ( λ | l 0 , l 0 ; μ L ) + E ( λ | l 0 , l 0 ; μ L ) W N 1 ( r 0 , r 0 , μ R ; l 0 + 1 , l 0 , μ L ) E ( λ | l 0 , l 0 ; μ L ) + ( 1 E ( λ | l 0 , l 0 ; μ L ) ) W N 1 ( r 0 , r 0 , μ R ; l 0 , l 0 + 1 , μ L ) .
Note that W 0 ( I 0 ) = 0 for any I 0 . Then, for any I 0 and N 1 , we can define an important function:
Δ N ( I 0 ) = W N R ( I 0 ) W N L ( I 0 ) .
The function Δ N ( I 0 ) represents the advantage of choosing R over L in the first stage. Δ N ( I 0 ) 0 means that R is optimal, and Δ N ( I 0 ) < 0 means that L is better than R .
By simple calculation, we obtain that the function Δ N ( I 0 ) can be defined recursively. Let Δ N + ( I 0 ) = max { 0 , Δ N ( I 0 ) } 0 and Δ N ( I 0 ) = min { 0 , Δ N ( I 0 ) } 0 . Then, for any I 0 = ( r 0 , r 0 , μ R ; l 0 , l 0 , μ L ) and N 2 , we have
Δ N ( I 0 ) = E ( ρ | r 0 , r 0 ; μ R ) Δ N 1 + ( r 0 + 1 , r 0 , μ R ; l 0 , l 0 , μ L ) + ( 1 E ( ρ | r 0 , r 0 ; μ R ) ) Δ N 1 + ( r 0 , r 0 + 1 , μ R ; l 0 , l 0 , μ L ) + E ( λ | l 0 , l 0 ; μ L ) Δ N 1 ( r 0 , r 0 , μ R ; l 0 + 1 , l 0 , μ L ) + ( 1 E ( λ | l 0 , l 0 ; μ L ) ) Δ N 1 ( r 0 , r 0 , μ R ; l 0 , l 0 + 1 , μ L ) .
In addition, for N = 1 , there is
Δ 1 ( I 0 ) = E ( ρ | r 0 , r 0 ; μ R ) E ( λ | l 0 , l 0 ; μ L ) .
The following Proposition 1 is easily obtained from Equations (12) and (13).
Proposition 1
(Theorem 3.1 in [5]). For any I 0 = ( r 0 , r 0 , μ R ; l 0 , l 0 , μ L ) and N 1 , there is
( 1 E ( ρ | r 0 , r 0 ; μ R ) Δ N ( I 0 ) 1 E ( λ | l 0 , l 0 ; μ L ) .

2.3. Berry’s Conjecture and Related Results

Obviously, when there are N pulls left and the current known information is I 0 , we can use the sign of Δ N ( I 0 ) to determine which arm is optimal at this stage. Therefore, how to identify the sign of Δ N ( I 0 ) is the key to finding the optimal strategy. Unfortunately, Berry did not completely solve this problem and instead gave the following Theorem 1.
Theorem 1
(Theorem 5.1 in [5]). The following statements are true for N 1 , I 0 = ( r 0 , r 0 , μ R ; L ) and any δ r , δ r 0 :
Δ N ( r 0 + δ r , r 0 + δ r , μ R ; L ) Δ N ( r 0 , r 0 , μ R ; L ) if E ( ρ | r 0 + δ r + N 1 , r 0 + δ r ; μ R ) E ( ρ | r 0 + N 1 , r 0 ; μ R ) ; Δ N ( r 0 + δ r , r 0 + δ r , μ R ; L ) Δ N ( r 0 , r 0 , μ R ; L ) if E ( ρ | r 0 + δ r , r 0 + δ r + N 1 ; μ R ) E ( ρ | r 0 , r 0 + N 1 ; μ R ) .
Remark 1.
Theorem 5.2 in [5] states that a strict increase in E ( ρ | r 0 + N 1 , r 0 ; μ R ) or a strict decrease in E ( ρ | r 0 , r 0 + N 1 ; μ R ) guarantees a strict increase in Δ N ( r 0 , r 0 , μ R ; L ) for all L and N.
Remark 2.
When considering the left arm, a conclusion similar to the above theorem can be obtained by using the fact that Δ N ( r 0 , r 0 , μ R ; l 0 , l 0 , μ L ) = Δ N ( l 0 , l 0 , μ L ; r 0 , r 0 , μ R ) . For N 1 , I 0 = ( R ; l 0 , l 0 , μ L ) and any δ l , δ l 0 , we have
Δ N ( R ; l 0 + δ l , l 0 + δ l , μ L ) Δ N ( R ; l 0 , l 0 , μ L ) if E ( λ | l 0 + δ l + N 1 , l 0 + δ l ; μ L ) E ( λ | l 0 + N 1 , l 0 ; μ L ) ; Δ N ( R ; l 0 + δ l , l 0 + δ l , μ L ) Δ N ( R ; l 0 , l 0 , μ L ) if E ( λ | l 0 + δ l , l 0 + δ l + N 1 ; μ L ) E ( λ | l 0 , l 0 + N 1 ; μ L ) .
Theorem 1 cannot be used in many cases due to the harsh conditions. However, it can still reveal some properties of the function Δ N ( I 0 ) , such as
Δ N ( r 0 + 1 , r 0 , μ R ; L ) Δ N ( r 0 , r 0 , μ R ; L ) Δ N ( r 0 , r 0 + 1 , μ R ; L ) , for N 1
and
1 1 Δ N ( r 0 , r 0 , μ R ; L ) Δ N 1 + ( r 0 + 1 , r 0 , μ R ; L ) for N 2 .
When R and L are conjugate with each other (i.e., μ R = μ L = μ ), there are several more refined results:
Theorem 2
(Theorem 6.4 in [5]). Provided μ R = μ L = μ , if r 0 l 0 and r 0 l 0 , then Δ N ( r 0 , r 0 ; l 0 , l 0 ) 0 for all N 1 .
Theorem 2 is intuitive. When the right arm wins more often than the left arm and loses less than the left arm, we believe that the right arm is better. We can immediately get the following Corollary 1 and Corollary 2 from Theorem 2.
Corollary 1
(Corollary 1 in [5]). Provided μ R = μ L = μ , and if N R N L and r 0 l 0 , then Δ N ( r 0 , r 0 ; l 0 , l 0 ) 0 for all N 1 .
Corollary 2
(Corollary 2 in [5]). Provided μ R = μ L = μ , and if N R N L and r 0 l 0 , then Δ N ( r 0 , r 0 ; l 0 , l 0 ) 0 for all N 1 .
Intuitively, the conclusion of Corollary 2 can still be strengthened. When N R N L , the right arm has less known information than the left arm, so choosing the right arm can bring additional information. Additionally, if at the same time the right arm offers a greater expected immediate payoff, then the optimal choice for the next pull should be the right arm.
With this idea in mind, Berry proposed the following conjecture:
Berry’s Conjecture (Conjecture B in [5]). Let μ R = μ L = μ and I 0 = ( r 0 , r 0 ; l 0 , l 0 ) . If r 0 + r 0 l 0 + l 0 , and E ( ρ | r 0 , r 0 ) E ( λ | l 0 , l 0 ) , then Δ N ( r 0 , r 0 , l 0 , l 0 ) 0 for all N 1 .

3. Main Result

In this section, we prove that Berry’s conjecture is equivalent to the following statement:
Statement. 
Let μ R = μ L = μ and I 0 = ( r 0 , r 0 ; l 0 , l 0 ) . If r 0 + r 0 l 0 + l 0 and E ( ρ | r 0 , r 0 ) = E ( λ | l 0 , l 0 ) , then Δ N ( r 0 , r 0 , l 0 , l 0 ) 0 for all integers N 1 .
It seems that Berry’s conjecture is a stronger result while the Statement is a direct corollary of it. However, we will show below that the Statement is actually equivalent to Berry’s conjecture.
Here, we quote two results obtained by Berry regarding the partial derivatives of E ( ρ | r , r ; μ R ) and E ( λ | l , l ; μ L ) . See the discussion before Equation (4.8) in [5]:
r E ( ρ | r , r ; μ R ) = C o v ( ρ , log ρ ) 0 ,
r E ( ρ | r , r ; μ R ) = C o v ( ρ , log ( 1 ρ ) ) 0 ,
and
l E ( λ | l , l ; μ L ) = C o v ( λ , log λ ) 0 ,
l E ( λ | l , l ; μ L ) = C o v ( λ , log ( 1 λ ) ) 0 .
Using these results, we can derive the following Lemma 1.
Lemma 1.
Let ( r 0 , r 0 ) and ( l 0 , l 0 ) be interior points of the possibility region of μ, K 0 and θ be real numbers such that ( r 0 + θ K , r 0 θ K ) and ( l 0 + θ , l 0 θ ) are both in the possibility region of μ. Then, for any positive integer N, we have
d d θ Δ N ( r 0 + θ K , r 0 θ K , l 0 + θ , l 0 θ ) 0 .
Proof. 
Let us use mathematical induction. When N = 1 , we have
Δ 1 ( r 0 + θ K , r 0 θ K , l 0 + θ , l 0 θ ) = E ( ρ | r 0 + θ K , r 0 θ K ) E ( λ | l 0 + θ , l 0 θ ) .
Notice that
d d θ E ( ρ | r 0 + θ K , r 0 θ K ) = K r E ( ρ | r 0 + θ K , r 0 θ K ) r E ( ρ | r 0 + θ K , r 0 θ K ) , d d θ E ( λ | l 0 + θ , l 0 θ ) = l E ( λ | l 0 + θ , l 0 θ ) l E ( λ | l 0 + θ , l 0 θ ) .
Since K 0 , the above equalities together with Equations (16)–(19) yield
d d θ E ( ρ | r 0 + θ K , r 0 θ K ) 0 ,
d d θ E ( λ | l 0 + θ , l 0 θ ) 0 .
Therefore, we have d d θ Δ 1 ( r 0 + θ K , r 0 θ K , l 0 + θ , l 0 θ ) 0 .
Now, let us assume that Equation (20) holds for N, and consider the N + 1 case. Through Equation (12), we have
Δ N + 1 ( r 0 + θ K , r 0 θ K , l 0 + θ , l 0 θ ) = E ρ | r 0 + θ K , r 0 θ K Δ N + r 0 + θ K + 1 , r 0 θ K , l 0 + θ , l 0 θ + 1 E ρ | r 0 + θ K , r 0 θ K Δ N + r 0 + θ K , r 0 θ K + 1 , l 0 + θ , l 0 θ + E λ | l 0 + θ , l 0 θ Δ N r 0 + θ K , r 0 θ K , l 0 + θ + 1 , l 0 θ + 1 E λ | l 0 + θ , l 0 θ Δ N r 0 + θ K , r 0 θ K , l 0 + θ , l 0 θ + 1 .
By taking the derivative, we obtain the following equality:
d d θ Δ N + 1 ( r 0 + θ K , r 0 θ K , l 0 + θ , l 0 θ ) = E ( ρ | r 0 + θ K , r 0 θ K ) d d θ Δ N + ( r 0 + θ K + 1 , r 0 θ K , l 0 + θ , l 0 θ ) + 1 E ( ρ | r 0 + θ K , r 0 θ K ) d d θ Δ N + ( r 0 + θ K , r 0 θ K + 1 , l 0 + θ , l 0 θ ) + E ( λ | l 0 + θ , l 0 θ ) d d θ Δ N ( r 0 + θ K , r 0 θ K , l 0 + θ + 1 , l 0 θ ) + 1 E λ | l 0 + θ , l 0 θ d d θ Δ N ( r 0 + θ K , r 0 θ K , l 0 + θ , l 0 θ + 1 ) + d d θ E ( ρ | r 0 + θ K , r 0 θ K ) Δ N + ( r 0 + θ K + 1 , r 0 θ K , l 0 + θ , l 0 θ ) + d d θ E ( ρ | r 0 + θ K , r 0 θ K ) Δ N + ( r 0 + θ K , r 0 θ K + 1 , l 0 + θ , l 0 θ ) + d d θ E ( λ | l 0 + θ , l 0 θ ) Δ N ( r 0 + θ K , r 0 θ K , l 0 + θ + 1 , l 0 θ ) + d d θ E ( λ | l 0 + θ , l 0 θ ) Δ N ( r 0 + θ K , r 0 θ K , l 0 + θ , l 0 θ + 1 )
Through Equation (14), we can obtain that
Δ N + ( r 0 + θ K + 1 , r 0 θ K , l 0 + θ , l 0 θ ) Δ N + ( r 0 + θ K , r 0 θ K + 1 , l 0 + θ , l 0 θ ) ,
Δ N ( r 0 + θ K , r 0 θ K , l 0 + θ + 1 , l 0 θ ) Δ N ( r 0 + θ K , r 0 θ K , l 0 + θ , l 0 θ + 1 ) .
By combining Equations (21), (22), (24) and (25), we obtain that the last two summands on the right side of Equation (23) are both negative.
Then, by applying the assumption for N on ( r 0 + 1 , r 0 , l 0 , l 0 ) , ( r 0 , r 0 + 1 , r 0 , r 0 ) , ( r 0 , r 0 , l 0 + 1 , l 0 ) and ( r 0 , r 0 , l 0 , l 0 + 1 ) , we know the first four summands on the right side of Equation (23) are all negative. Hence, we now have
d d θ Δ N + 1 ( r 0 + θ K , r 0 θ K , l 0 + θ , l 0 θ ) 0 .
Thus, Equation (20) holds for any positive integer N. □
Now, we can prove the equivalence of the Statement and Berry’s conjecture:
Theorem 3.
The Statement holds if and only if Berry’s conjecture holds.
Proof. 
Assume that the Statement holds. When μ R = μ L = μ , we have E ( ρ | l 0 , l 0 ) = E ( λ | l 0 , l 0 ) . If E ( ρ | r 0 , r 0 ) = E ( λ | l 0 , l 0 ) , then the conclusion of Berry’s conjecture must hold by applying the Statement. For any r 0 + r 0 l 0 + l 0 and E ( ρ | r 0 , r 0 ) > E ( λ | l 0 , l 0 ) , there must be r 0 < l 0 by using Equations (16) and (17). If r 0 l 0 , then Δ N ( r 0 , r 0 , l 0 , l 0 ) 0 by applying Corollary 2. Thus, we only need to prove the case where r 0 < l 0 and r 0 < l 0 .
Therefore, the possibility region of μ contains at least all pairs of ( r , r ) that satisfy r > r 0 and r > r 0 (see Section 2.1). Let θ > 0 . With the equalities in Equations (18) and (19), we have E ( λ | l 0 + θ , l 0 θ ) E ( λ | l 0 , l 0 ) . If θ = l 0 r 0 , then θ > 0 , l 0 + θ > r 0 and l 0 θ = r 0 . Due to μ R = μ L and E ( ρ | r 0 , r 0 ) E ( λ | l 0 , l 0 ) , we have
E ( λ | l 0 + θ , l 0 θ ) E ( λ | r 0 , r 0 ) = E ( ρ | r 0 , r 0 ) E ( λ | l 0 , l 0 ) .
Then, there exists θ * l 0 r 0 such that E ( λ | l 0 + θ * , l 0 θ * ) = E ( ρ | r 0 , r 0 ) . Since ( l 0 + θ * , l 0 θ * ) is an interior point of the possibility region of μ , we can consider Δ N ( r 0 , r 0 , l 0 + θ * , l 0 θ * ) . With Lemma 1, we obtain
Δ N ( r 0 , r 0 , l 0 + θ * , l 0 θ * ) Δ N ( r 0 , r 0 , l 0 , l 0 ) .
Note that when r 0 + r 0 l 0 + l 0 = l 0 + θ * + l 0 θ * and E ( λ | l 0 + θ * , l 0 θ * ) = E ( ρ | r 0 , r 0 ) , we can use the Statement to obtain Δ N ( r 0 , r 0 , l 0 + θ * , l 0 θ * ) 0 for any N 1 . Therefore, the desired result Δ N ( r 0 , r 0 , l 0 , l 0 ) 0 holds. □
Theorem 3 simplifies the condition E ( ρ | r 0 , r 0 ) E ( λ | l 0 , l 0 ) in Berry’s conjecture to E ( ρ | r 0 , r 0 ) = E ( λ | l 0 , l 0 ) . Unfortunately, the Statement is still not easy to prove. In the following, we will continue this discussion under two important special cases, where R and L are two-point distributions or beta distributions.

4. Two-Point Distribution Case

In this section, we consider the situation where μ R = μ L = τ and the distribution τ is a two-point distribution, with a concentrating probability of 1 2 for both τ 1 and τ 2 . Without causing confusion, we will omit the τ in the notation (e.g., E ( ρ | r 0 , r 0 , τ ) will be written as E ( ρ | r 0 , r 0 ) ). In this case, we prove that the Statement holds and obtain a more complete conclusion than Berry’s conjecture due to the good properties of Δ N . This is consistent with the conclusion that Berry obtained by discussing the contours of Δ N in the ( r 0 , r 0 ) plane.
In the following discussion, let 0 < τ 1 < τ 2 < 1 , then all of the pairs ( r , r ) in the plane are in the possibility region.
To prove the Statement, we should first find the points in the possibility region where the expected values of ρ and λ are equal such that
E ( ρ | r 0 , r 0 ) = E ( λ | l 0 , l 0 ) .
Due to Equations (7) and (8), Equation (26) is equivalent to
τ 1 r 0 + 1 ( 1 τ 1 ) r 0 + τ 2 r 0 + 1 ( 1 τ 2 ) r 0 τ 1 r 0 ( 1 τ 1 ) r 0 + τ 2 r 0 ( 1 τ 2 ) r 0 = τ 1 l 0 + 1 ( 1 τ 1 ) l 0 + τ 2 l 0 + 1 ( 1 τ 2 ) l 0 τ 1 l 0 ( 1 τ 1 ) l 0 + τ 2 l 0 ( 1 τ 2 ) l 0 ,
which holds if and only if
( τ 2 τ 1 ) τ 1 l 0 ( 1 τ 1 ) l 0 τ 2 r 0 ( 1 τ 2 ) r 0 τ 1 r 0 ( 1 τ 1 ) r 0 τ 2 l 0 ( 1 τ 2 ) l 0 = 0 .
Since 0 < τ 1 < τ 2 < 1 , Equation (28) is equivalent to
τ 2 r 0 l 0 ( 1 τ 2 ) r 0 l 0 = τ 1 r 0 l 0 ( 1 τ 1 ) r 0 l 0 .
With the logarithm, we obtain
( r 0 l 0 ) log τ 2 + ( r 0 l 0 ) log ( 1 τ 2 ) = ( r 0 l 0 ) log τ 1 + ( r 0 l 0 ) log ( 1 τ 1 ) .
Recall that for N R = r 0 + r 0 and N L = l 0 + l 0 , there is
r 0 l 0 = N R N L ( r 0 l 0 ) .
Using Equations (29) and (30), we obtain that the relationship between l 0 and r 0 to make Equation (26) hold is
r 0 = l 0 C τ ( N R N L ) ,
where C τ = log ( 1 τ 2 ) log ( 1 τ 1 ) log τ 2 log τ 1 + log ( 1 τ 1 ) log ( 1 τ 2 ) .
Next, we will show that Δ N has a strong symmetry property when μ R = μ L = τ . Note that since the possibility region of τ is the whole plane, the m and n in Theorem 4 can be any integers:
Theorem 4.
If μ R = μ L = τ , then for any positive integer N and any numbers l 0 , m and n, there is
Δ N ( l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n , l 0 , N L l 0 ) + Δ N ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 + m , N L l 0 + n ) = 0 .
Proof. 
We will use mathematical induction to prove this theorem.
When N = 1 , for any integers m and n, we have
Δ 1 ( l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n , l 0 , N L l 0 ) + Δ 1 ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 + m , N L l 0 + n ) = E ( ρ | l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n ) E ( λ | l 0 , N L l 0 ) + E ( ρ | l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) ) E ( λ | l 0 + m , N L l 0 + n ) .
With the equivalence between Equations (26) and (31), we have
E ( ρ | l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) ) = E ( λ | l 0 , N L l 0 ) .
By letting l 0 ˜ = l 0 + m , N ˜ R = N R + m + n and N ˜ L = N L + m + n , and apply Equation (34) to l 0 ˜ , N ˜ R and N ˜ L , we obtain
E ( ρ | l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n ) = E ( λ | l 0 + m , N L l 0 + n ) .
Equation (33), together with Equations (34) and (35), yields
Δ 1 ( l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n , l 0 , N L l 0 ) + Δ 1 ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 + m , N L l 0 + n ) = 0 .
Then, Equation (32) has been proven for the case where N = 1 .
Now, assume that Equation (32) holds for N. For any N R , N L and any numbers l 0 , m and n, we need to prove Equation (32) also holds for N + 1 .
Consider the N + 1 case. Using the recursive Equation (12), we have
Δ N + 1 ( l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n , l 0 , N L l 0 ) + Δ N + 1 ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 + m , N L l 0 + n ) = E ( ρ | l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n ) × Δ N + ( l 0 C τ ( N R N L ) + m + 1 , N R l 0 + C τ ( N R N L ) + n , l 0 , N L l 0 ) + ( 1 E ( ρ | l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n ) ) × Δ N + ( l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n + 1 , l 0 , N L l 0 ) + E ( λ | l 0 , N L l 0 ) × Δ N ( l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n , l 0 + 1 , N L l 0 ) + ( 1 E ( λ | l 0 , N L l 0 ) ) × Δ N ( l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n , l 0 , N L l 0 + 1 ) + E ( ρ | l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) ) × Δ N + ( l 0 C τ ( N R N L ) + 1 , N R l 0 + C τ ( N R N L ) , l 0 + m , N L l 0 + n ) + ( 1 E ( ρ | l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) ) ) × Δ N + ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) + 1 , l 0 + m , N L l 0 + n ) + E ( λ | l 0 + m , N L l 0 + n ) × Δ N ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 + m + 1 , N L l 0 + n ) + ( 1 E ( λ | l 0 + m , N L l 0 + n ) ) × Δ N ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 + m , N L l 0 + n + 1 ) .
For N R , N L , m + 1 , n and l 0 , Equation (32) implies
Δ N ( l 0 C τ ( N R N L ) + m + 1 , N R l 0 + C τ ( N R N L ) + n , l 0 , N L l 0 ) + Δ N ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 + m + 1 , N L l 0 + n ) = 0 .
Thus, if Δ N ( l 0 C τ ( N R N L ) + m + 1 , N R l 0 + C τ ( N R N L ) + n , l 0 , N L l 0 ) 0 , then there must be Δ N ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 + m + 1 , N L l 0 + n ) 0 . Hence, we obtain
Δ N + ( l 0 C τ ( N R N L ) + m + 1 , N R l 0 + C τ ( N R N L ) + n , l 0 , N L l 0 ) + Δ N ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 + m + 1 , N L l 0 + n ) = 0 .
If Δ N ( l 0 C τ ( N R N L ) + m + 1 , N R l 0 + C τ ( N R N L ) + n , l 0 , N L l 0 ) 0 , then Δ N ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 + m + 1 , N L l 0 + n ) 0 , and thus Equation (37) also holds.
With Equations (35) and (37), we obtain that the sum of the first and seventh summands on the right side of Equation (36) is zero.
Similarly, for N R + 1 , N L + 1 , m 1 , n and l 0 + 1 , Equation (32) implies
Δ N ( l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n , l 0 + 1 , N L l 0 ) + Δ N ( l 0 C τ ( N R N L ) + 1 , N R l 0 + C τ ( N R N L ) , l 0 + m , N L l 0 + n ) = 0 .
Together with Equation (34), we obtain that the sum of the third and fifth summands on the right side of Equation (36) is also zero.
Using similar techniques, we can find that the sum of the second and eighth summands on the right side of Equations (36) and the sum of the fourth and sixth summands on the right side of Equation (36) are both zero. Hence, we have proven that
Δ N + 1 ( l 0 C τ ( N R N L ) + m , N R l 0 + C τ ( N R N L ) + n , l 0 , N L l 0 ) + Δ N + 1 ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 + m , N L l 0 + n ) = 0 .
In other words, Equation (32) holds for N + 1 . The theorem is proven by induction. □
Based on Theorem 4, we can conclude that the Statement holds for μ R = μ L = τ . The following corollary shows that Δ N ( r 0 , r 0 , l 0 , l 0 ) = 0 when E ( ρ | r 0 , r 0 ) = E ( λ | l 0 , l 0 ) , which is a stronger conclusion than our Statement:
Corollary 3.
Provided μ R = μ L = τ , for any positive integer N and any real numbers l 0 , N R and N L , there is
Δ N ( l 0 C τ ( N R N L ) , N R l 0 + C τ ( N R N L ) , l 0 , N L l 0 ) = 0 .
Proof. 
The corollary can be deduced from Theorem 4 for m = n = 0 . □
Note that N R N L is not needed in Theorem 4 and Corollary 3, and we can obtain the following result, which is stronger than Berry’s conjecture and consistent with Theorem 8.3 in [5].
Theorem 5.
Provided μ R = μ L = τ , for any positive integer N and any real numbers r 0 , r 0 , l 0 and l 0 , we have
Δ N ( r 0 , r 0 , l 0 , l 0 ) 0 , if E ( ρ | r 0 , r 0 ) E ( λ | l 0 , l 0 ) ; Δ N ( r 0 , r 0 , l 0 , l 0 ) 0 , if E ( ρ | r 0 , r 0 ) E ( λ | l 0 , l 0 ) .
Proof. 
If E ( ρ | r 0 , r 0 ) = E ( λ | l 0 , l 0 ) , by the equivalence between Equations (26) and (31), there is r 0 = l 0 C τ ( N R N L ) and r 0 = N R l 0 + C τ ( N R N L ) , where N R = r 0 + r 0 and N L = l 0 + l 0 . Therefore, with Corollary 3, we have Δ N ( r 0 , r 0 , l 0 , l 0 ) = 0 .
Consider the case E ( ρ | r 0 , r 0 ) > E ( λ | l 0 , l 0 ) . If r 0 + r 0 l 0 + l 0 , then with Theorem 3, we have Δ N ( r 0 , r 0 , l 0 , l 0 ) 0 . If r 0 + r 0 > l 0 + l 0 , by the equivalence between Equations (26) and (31), there is
E ( λ | r 0 + C τ ( N R N L ) , N L r 0 C τ ( N R N L ) ) = E ( ρ | r 0 , r 0 ) > E ( λ | l 0 , l 0 ) .
Notice that
r 0 + C τ ( N R N L ) + N L r 0 C τ ( N R N L ) = N L = l 0 + l 0 ,
and with Equations (18) and (19), we obtain l 0 < r 0 + C τ ( N R N L ) and l 0 > N L r 0 C τ ( N R N L ) . Let θ = r 0 + C τ ( N R N L ) l 0 . Then, we have θ > 0 and Equation (38) becomes
E ( λ | l 0 + θ , l 0 θ ) = E ( ρ | r 0 , r 0 ) > E ( λ | l 0 , l 0 ) .
By applying Lemma 1 with K = 0 and Corollary 3, there is
Δ N ( r 0 , r 0 , l 0 , l 0 ) Δ N ( r 0 , r 0 , l 0 + θ , l 0 θ ) = 0 .
Therefore, we have Δ N ( r 0 , r 0 , l 0 , l 0 ) 0 if E ( ρ | r 0 , r 0 ) E ( λ | l 0 , l 0 ) . The E ( ρ | r 0 , r 0 ) E ( λ | l 0 , l 0 ) case can be proven by a similar method. □

5. Beta Distribution Case

In this section, we consider the case where the prior distribution I 0 = ( R , L ) has a special structure μ R = μ L = β , where β is defined by Equation (9). Then, R and L are both beta distributions.
Although we believe that the Statement holds in this case, only a partial result is obtained. We know that in this case, the expectation of the parameters ρ and λ are E ( ρ | r 0 , r 0 ) = r 0 r 0 + r 0 and E ( λ | l 0 , l 0 ) = l 0 l 0 + l 0 , which are increasing functions of r 0 and l 0 and decreasing functions of r 0 and l 0 respectively. Therefore, E ( ρ | r 0 , r 0 ) = E ( λ | l 0 , l 0 ) if and only if r 0 r 0 = l 0 l 0 .
The following theorem shows that Δ N ( r 0 , r 0 , l 0 , l 0 ) 0 holds for N < l 0 l 0 + 1 when E ( ρ | r 0 , r 0 ) = E ( λ | l 0 , l 0 ) :
Theorem 6.
If μ R = μ L = β , 0 < φ 1 and l 0 , l 0 > 0 , and if for a fixed positive integer N * we have l 0 l 0 N * 1 , then for any integer 0 < N N * , there is Δ N ( φ l 0 , φ l 0 , l 0 , l 0 ) 0 .
Proof. 
Using mathematical induction again, we prove that if l 0 l 0 N * 1 and 0 < N N * , then the following two inequalities hold:
Δ N ( φ l 0 , φ l 0 , l 0 , l 0 ) 0 ,
and for any integer 1 m N * + 1 N ,
1 1 Δ N + ( φ l 0 + m , φ l 0 , l 0 , l 0 ) + Δ N ( φ l 0 , φ l 0 , l 0 + m , l 0 ) 0 .
Through the exact version of Equation (13) for Δ 1 , it is easy to verify the above inequalities:
Δ 1 ( φ l 0 , φ l 0 , l 0 , l 0 ) = E ( ρ | φ l 0 , φ l 0 ) E ( λ | l 0 , l 0 ) = φ l 0 φ l 0 + φ l 0 l 0 l 0 + l 0 = 0 .
For any m 0 , we have
Δ 1 ( φ l 0 + m , φ l 0 , l 0 , l 0 ) = E ( ρ | φ l 0 + m , φ l 0 ) E ( λ | l 0 , l 0 ) = φ l 0 + m φ l 0 + m + φ l 0 l 0 l 0 + l 0 0 , Δ 1 ( φ l 0 , φ l 0 , l 0 + m , l 0 ) = E ( ρ | φ l 0 , φ l 0 ) E ( λ | l 0 + m , l 0 ) = φ l 0 φ l 0 + φ l 0 l 0 + m l 0 + m + l 0 0 .
Since 0 < φ 1 , we have
Δ 1 + ( φ l 0 + m , φ l 0 , l 0 , l 0 ) + Δ 1 ( φ l 0 , φ l 0 , l 0 + m , l 0 ) = Δ 1 ( φ l 0 + m , φ l 0 , l 0 , l 0 ) + Δ 1 ( φ l 0 , φ l 0 , l 0 + m , l 0 ) = φ l 0 + m φ l 0 + φ l 0 + m l 0 l 0 + l 0 + φ l 0 φ l 0 + φ l 0 l 0 + m l 0 + l 0 + m 0 .
Hence, Equations (42) and (43) hold for N = 1 .
Now, we assume that Equations (42) and (43) hold for 1 N N * 1 , and we will show that Equations (42) and (43) also hold for N + 1 . Note that when we consider N + 1 pulls, the condition for m will become 1 m N * N .
First, it can be deduced from Equation (12) that
Δ N + 1 ( φ l 0 , φ l 0 , l 0 , l 0 ) = E ( ρ | φ l 0 , φ l 0 ) Δ N + ( φ l 0 + 1 , φ l 0 , l 0 , l 0 ) + ( 1 E ( ρ | φ l 0 , φ l 0 ) ) Δ N + ( φ l 0 , φ l 0 + 1 , l 0 , l 0 ) + E ( λ | l 0 , l 0 ) Δ N ( φ l 0 , φ l 0 , l 0 + 1 , l 0 ) + ( 1 E ( λ | l 0 , l 0 ) ) Δ N ( φ l 0 , φ l 0 , l 0 , l 0 + 1 ) = l 0 l 0 + l 0 Δ N + ( φ l 0 + 1 , φ l 0 , l 0 , l 0 ) + Δ N ( φ l 0 , φ l 0 , l 0 + 1 , l 0 ) + l 0 l 0 + l 0 Δ N + ( φ l 0 , φ l 0 + 1 , l 0 , l 0 ) + Δ N ( φ l 0 , φ l 0 , l 0 , l 0 + 1 ) .
Using the assumption for N, we know that for m = 1 , we have
Δ N ( φ l 0 , φ l 0 , l 0 , l 0 ) 0 , Δ N + ( φ l 0 + 1 , φ l 0 , l 0 , l 0 ) + Δ N ( φ l 0 , φ l 0 , l 0 + 1 , l 0 ) 0 .
Since E ( λ | l 0 , l 0 + 1 + N 1 ) < E ( λ | l 0 , l 0 + N 1 ) , by using Remark 2, we can obtain
Δ N ( φ l 0 , φ l 0 , l 0 , l 0 + 1 ) Δ N ( φ l 0 , φ l 0 , l 0 , l 0 ) 0 .
Hence, Δ N ( φ l 0 , φ l 0 , l 0 , l 0 + 1 ) = 0 , and consequently, we find that Equation (42) holds for N + 1 ; in other words, we have
Δ N + 1 ( φ l 0 , φ l 0 , l 0 , l 0 ) 0 .
Next, we will prove that Equation (43) also holds for N + 1 . For any 1 m N * N , because of E ( ρ | φ l 0 + m + N , φ l 0 ) > E ( ρ | φ l 0 + N , φ l 0 ) , by using Theorem 1, we obtain
Δ N + 1 ( φ l 0 + m , φ l 0 , l 0 , l 0 ) Δ N + 1 ( φ l 0 , φ l 0 , l 0 , l 0 ) 0 .
Then, there is Δ N + 1 + ( φ l 0 + m , φ l 0 , l 0 , l 0 ) = Δ N + 1 ( φ l 0 + m , φ l 0 , l 0 , l 0 ) .
If Δ N + 1 ( φ l 0 , φ l 0 , l 0 + m , l 0 ) 0 , then the inequality in Equation (43) is also proven. If Δ N + 1 ( φ l 0 , φ l 0 , l 0 + m , l 0 ) < 0 , then with Equation (12), we have
Δ N + 1 + ( φ l 0 + m , φ l 0 , l 0 , l 0 ) + Δ N + 1 ( φ l 0 , φ l 0 , l 0 + m , l 0 ) = Δ N + 1 ( φ l 0 + m , φ l 0 , l 0 , l 0 ) + Δ N + 1 ( φ l 0 , φ l 0 , l 0 + m , l 0 ) = φ l 0 + m φ l 0 + φ l 0 + m Δ N + ( φ l 0 + m + 1 , φ l 0 , l 0 , l 0 ) + φ l 0 φ l 0 + φ l 0 + m Δ N + ( φ l 0 + m , φ l 0 + 1 , l 0 , l 0 ) + l 0 l 0 + l 0 Δ N ( φ l 0 + m , φ l 0 , l 0 + 1 , l 0 ) + Δ N + ( φ l 0 + 1 , φ l 0 , l 0 + m , l 0 ) + l 0 l 0 + l 0 Δ N ( φ l 0 + m , φ l 0 , l 0 , l 0 + 1 ) + Δ N + ( φ l 0 , φ l 0 + 1 , l 0 + m , l 0 ) + l 0 + m l 0 + l 0 + m Δ N ( φ l 0 , φ l 0 , l 0 + m + 1 , l 0 ) + l 0 l 0 + l 0 + m Δ N ( φ l 0 , φ l 0 , l 0 + m , l 0 + 1 ) .
Using the fact that m N * N , we have m + 1 N * N + 1 . Thus, we can apply Equation (43) of N pulls to m + 1 and obtain
Δ N + ( φ l 0 + m + 1 , φ l 0 , l 0 , l 0 ) + Δ N ( φ l 0 , φ l 0 , l 0 + m + 1 , l 0 ) 0 .
Since 0 < φ 1 , the sum of the first and fifth summands on the right side of Equation (44) is
φ l 0 + m φ l 0 + φ l 0 + m Δ N + ( φ l 0 + m + 1 , φ l 0 , l 0 , l 0 ) + l 0 + m l 0 + l 0 + m Δ N ( φ l 0 , φ l 0 , l 0 + m + 1 , l 0 ) l 0 + m l 0 + l 0 + m Δ N + ( φ l 0 + m + 1 , φ l 0 , l 0 , l 0 ) + Δ N ( φ l 0 , φ l 0 , l 0 + m + 1 , l 0 ) 0 .
Since m 1 φ , and E ( ρ | φ l 0 + m , φ l 0 ) E ( ρ | φ ( l 0 + 1 ) , φ l 0 ) , we can obtain by Theorem 1 that
Δ N ( φ l 0 + m , φ l 0 , l 0 + 1 , l 0 ) Δ N ( φ ( l 0 + 1 ) , φ l 0 , l 0 + 1 , l 0 ) .
Due to the fact that l 0 + 1 l 0 > l 0 l 0 N * 1 , we can apply the assumption for N pulls and obtain Δ N ( φ ( l 0 + 1 ) , φ l 0 , l 0 + 1 , l 0 ) 0 . Consequently, we have
Δ N ( φ l 0 + m , φ l 0 , l 0 + 1 , l 0 ) = 0 .
Therefore, the third summand on right side of Equation (44) is nonnegative.
By using Theorem 1 twice, we obtain
Δ N ( φ l 0 + m , φ l 0 , l 0 , l 0 + 1 ) Δ N ( φ l 0 , φ l 0 , l 0 , l 0 ) 0 .
Hence, Δ N ( φ l 0 + m , φ l 0 , l 0 , l 0 + 1 ) = 0 , and the fourth summand on right side of Equation (44) is nonnegative.
Now, we only need to consider Δ N ( φ l 0 , φ l 0 , l 0 + m , l 0 + 1 ) . As we know, m N * N N * 1 l 0 l 0 . It follows from Remark 2 that
Δ N ( φ l 0 , φ l 0 , l 0 + m , l 0 + 1 ) Δ N ( φ l 0 , φ l 0 , l 0 + l 0 l 0 , l 0 + 1 ) .
Let l 0 ˜ = l 0 ( 1 + 1 l 0 ) , l 0 ˜ = l 0 ( 1 + 1 l 0 ) and φ ˜ = φ l 0 1 + l 0 . Then, 0 < φ ˜ 1 and l 0 ˜ l 0 ˜ = l 0 l 0 > N * 1 still hold. Hence, the assumption for N can be applied, and we obtain
Δ N ( φ l 0 , φ l 0 , l 0 + l 0 l 0 , l 0 + 1 ) = Δ N ( φ ˜ l 0 ˜ , φ ˜ l 0 ˜ , l 0 ˜ , l 0 ˜ ) 0 .
Then, we obtain Δ N ( φ l 0 , φ l 0 , l 0 + m , l 0 + 1 ) = 0 . Thus far, all summands on the right side of the equality in Equation (44) are positive. Then, we find that Equation (43) holds for N + 1 ; that is, for any 1 m N * N , we have
Δ N + 1 + ( φ l 0 + m , φ l 0 , l 0 , l 0 ) + Δ N + 1 ( φ l 0 , φ l 0 , l 0 + m , l 0 ) 0 .
Then, the theorem is proven by mathematical induction. □
The Statement is partially proven in Theorem 6, and hence we can deduce a partial result about Berry’s conjecture.
Consider I 0 = ( r 0 , r 0 , β ; l 0 , l 0 , β ) , where r 0 , r 0 , l 0 and l 0 are positive real numbers and satisfy r 0 + r 0 l 0 + l 0 and r 0 r 0 l 0 l 0 . Using the properties of the beta distribution, there is
E ( ρ | r 0 , r 0 ) = r 0 r 0 + r 0 l 0 l 0 + l 0 = E ( λ | l 0 , l 0 ) .
For θ 0 and K 0 such that r 0 > θ K and l 0 > θ , applying Lemma 1, there is
Δ N ( r 0 + θ K , r 0 θ K , l 0 + θ , l 0 θ ) Δ N ( r 0 , r 0 , l 0 , l 0 ) , for any N .
In order to use Theorem 6, we require
r 0 + θ K r 0 θ K = l 0 + θ l 0 θ ,
which is equivalent to r 0 l 0 r 0 l 0 = θ [ r 0 + r 0 K ( l 0 + l 0 ) ] . To make θ as large as possible, we choose K = 0 and θ = r 0 l 0 r 0 l 0 r 0 + r 0 . Now let
N * = l 0 + θ l 0 θ + 1 = r 0 r 0 + 1 .
We obtain the following result by using Theorem 6:
Theorem 7.
Let I 0 = ( r 0 , r 0 , β ; l 0 , l 0 , β ) , r 0 , r 0 , l 0 and l 0 be positive real numbers and satisfy r 0 + r 0 l 0 + l 0 and r 0 r 0 l 0 l 0 , N * = r 0 r 0 + 1 . Then, for 1 N N * , there is
Δ N ( r 0 , r 0 , l 0 , l 0 ) 0 .
Proof. 
Let θ = r 0 l 0 r 0 l 0 r 0 + r 0 and φ = r 0 l 0 + θ . If r 0 l 0 , then Δ N ( r 0 , r 0 , l 0 , l 0 ) 0 by using Corollary 2.
If r 0 < l 0 , then we know that 0 < φ < 1 , and using the fact that l 0 + θ l 0 θ N * 1 , we obtain from Theorem 6 that for N N * ,
Δ N ( r 0 , r 0 , l 0 + θ , l 0 θ ) = Δ N ( φ ( l 0 + θ ) , φ ( l 0 θ ) , l 0 + θ , l 0 θ ) 0 .
It can be deduced from Lemma 1 that Δ N ( r 0 , r 0 , l 0 , l 0 ) Δ N ( r 0 , r 0 , l 0 + θ , l 0 θ ) 0 . Therefore, the proof is completed. □
Obviously N * > 1 , and thus Berry’s conjecture always holds for N = 1 . However, when we want to use Theorem 7 for large values of N, we need N * to be large enough, which means that r 0 r 0 is very large. This is consistent with the intuitive impression that the greater the expectation of the arm, the greater the advantage in the choice.

6. Conclusions

As we mentioned in the introduction, the bandit model is used to solve the problem of the trade-off between exploration and exploitation. Berry’s conjecture is an exactly intuitive conjecture about exploration and exploitation. In this paper, we show the essence of Berry’s conjecture by proving the equivalence of the conjecture and our Statement. The Statement is easier to verify than Berry’s conjecture and thus provides a new idea for proving Berry’s conjecture. We also proved that Berry’s conjecture holds in two specific models, which can speed up the computation of optimal strategies. We believe that for most positive measures μ , Berry’s conjecture is correct. In the future, we will apply our Statement to prove Berry’s conjecture to other specific models.

Author Contributions

Conceptualization, P.W. and J.Z.; methodology, P.W. and J.Z.; validation, J.Z.; formal analysis, P.W. and J.Z.; writing—original draft preparation, J.Z.; writing—review and editing, P.W. and J.Z.; supervision, P.W.; funding acquisition, P.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (No. 2018YFA0703900) and the Natural Science Foundation of Shandong Province (Nos. ZR2021MA098 and ZR2019ZD41).

Data Availability Statement

Data sharing not applicable to this article as no datasets were generated or analysed during the current (theoretical) study.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Thompson, W.R. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika 1933, 25, 285. [Google Scholar] [CrossRef]
  2. Rothschild, M. A two-armed bandit theory of market pricing. J. Econ. Theory 1974, 9, 185–202. [Google Scholar] [CrossRef]
  3. Liberali, G.B.; Hauser, J.R.; Urban, G.L. Morphing Theory and Applications. In International Series in Operations Research & Management Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 531–562. [Google Scholar] [CrossRef]
  4. Aggarwal, C.C. Recommender Systems; Springer International Publishing: Berlin/Heidelberg, Germany, 2016. [Google Scholar] [CrossRef]
  5. Berry, D.A. A Bernoulli Two-armed Bandit. Ann. Math. Stat. 1972, 43, 871–897. [Google Scholar] [CrossRef]
  6. Berry, D.A.; Chen, R.W.; Zame, A.; Heath, D.C.; Shepp, L.A. Bandit problems with infinitely many arms. Ann. Stat. 1997, 25, 2103–2116. [Google Scholar] [CrossRef]
  7. Lin, C.T.; Shiau, C.J. Some Optimal Strategies for Bandit Problems with Beta Prior Distributions. Ann. Inst. Stat. Math. 2000, 52, 397–405. [Google Scholar] [CrossRef]
  8. Steyvers, M.; Lee, M.D.; Wagenmakers, E.J. A Bayesian analysis of human decision-making on bandit problems. J. Math. Psychol. 2009, 53, 168–179. [Google Scholar] [CrossRef]
  9. Jacko, P. The Finite-Horizon Two-Armed Bandit Problem with Binary Responses: A Multidisciplinary Survey of the History, State of the Art, and Myths; Working Paper 3; Lancaster University Management School: Lancaster, UK, 2019. [Google Scholar]
  10. Joshi, V.M. A Conjecture of Berry Regarding A Bernoulli Two-Armed Bandit. Ann. Stat. 1975, 3, 189–202, Correction in Ann. Stat. 1985, 13, 1249. [Google Scholar] [CrossRef]
  11. Yue, J.C. Generalized two-stage bandit problem. Commun. Stat.-Theory Methods 1999, 28, 2261–2276. [Google Scholar] [CrossRef]
  12. Gittins, J.; Jones, D. A Dynamic Allocation Index for the Sequential Design of Experiments. In Progress in Statistics; Gani, J., Ed.; North-Holland: Amsterdam, The Netherlands, 1974; pp. 241–266. [Google Scholar]
  13. Whittle, P. Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 1988, 25, 287–298. [Google Scholar] [CrossRef]
  14. Weber, R.R.; Weiss, G. On an index policy for restless bandits. J. Appl. Probab. 1990, 27, 637–648. [Google Scholar] [CrossRef]
  15. Ahmad, S.H.A.; Liu, M.; Javidi, T.; Zhao, Q.; Krishnamachari, B. Optimality of Myopic Sensing in Multichannel Opportunistic Access. IEEE Trans. Inf. Theory 2009, 55, 4040–4050. [Google Scholar] [CrossRef]
  16. Deo, S.; Iravani, S.; Jiang, T.; Smilowitz, K.; Samuelson, S. Improving Health Outcomes Through Better Capacity Allocation in a Community-Based Chronic Care Model. Oper. Res. 2013, 61, 1277–1294. [Google Scholar] [CrossRef]
  17. Gittins, J.; Glazebrook, K.; Weber, R. Multi-Armed Bandit Allocation Indices; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2011. [Google Scholar] [CrossRef]
  18. Lee, E.; Lavieri, M.S.; Volk, M. Optimal Screening for Hepatocellular Carcinoma: A Restless Bandit Model. Manuf. Serv. Oper. Manag. 2019, 21, 198–212. [Google Scholar] [CrossRef]
  19. Mahajan, A.; Teneketzis, D. Multi-Armed Bandit Problems. In Foundations and Applications of Sensor Management; Springer US: New York, NY, USA, 2008; pp. 121–151. [Google Scholar] [CrossRef]
  20. Washburn, R.B. Application of Multi-Armed Bandits to Sensor Management. In Foundations and Applications of Sensor Management; Springer US: New York, NY, USA, 2008; pp. 153–175. [Google Scholar] [CrossRef]
  21. Gast, N.; Gaujal, B.; Khun, K. Computing Whittle (and Gittins) Index in Subcubic Time. arXiv 2022, arXiv:2203.05207. [Google Scholar]
  22. Niño-Mora, J. A (2/3)n3 Fast-Pivoting Algorithm for the Gittins Index and Optimal Stopping of a Markov Chain. INFORMS J. Comput. 2007, 19, 596–606. [Google Scholar] [CrossRef]
  23. Berry, D.A.; Fristedt, B. Bandit Problems: Sequential Allocation of Experiments; Springer: Dordrecht, The Netherlands, 1985. [Google Scholar] [CrossRef]
  24. Gittins, J.; Wang, Y.G. The Learning Component of Dynamic Allocation Indices. Ann. Stat. 1992, 20, 1625–1636. [Google Scholar] [CrossRef]
  25. Lai, T.; Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 1985, 6, 4–22. [Google Scholar] [CrossRef]
  26. Agrawal, R. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 1995, 27, 1054–1078. [Google Scholar] [CrossRef]
  27. Audibert, J.Y.; Bubeck, S. Regret Bounds and Minimax Policies under Partial Monitoring. J. Mach. Learn. Res. 2010, 11, 2785–2836. [Google Scholar]
  28. Audibert, J.Y.; Munos, R.; Szepesvári, C. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theor. Comput. Sci. 2009, 410, 1876–1902. [Google Scholar] [CrossRef]
  29. Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
  30. Cappé, O.; Garivier, A.; Maillard, O.A.; Munos, R.; Stoltz, G. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Ann. Stat. 2013, 41, 1516–1541. [Google Scholar] [CrossRef]
  31. Honda, J.; Takemura, A. An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. In Proceedings of the COLT 2010, Haifa, Israel, 27–29 June 2010; pp. 67–79. [Google Scholar]
  32. Lai, T.L. Adaptive Treatment Allocation and the Multi-Armed Bandit Problem. Ann. Stat. 1987, 15, 1091–1114. [Google Scholar] [CrossRef]
  33. Kaufmann, E. On Bayesian index policies for sequential resource allocation. Ann. Stat. 2018, 46, 842–865. [Google Scholar] [CrossRef] [Green Version]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Wu, P. On the Conjecture of Berry Regarding a Bernoulli Two-Armed Bandit. Mathematics 2023, 11, 733. https://doi.org/10.3390/math11030733

AMA Style

Zhang J, Wu P. On the Conjecture of Berry Regarding a Bernoulli Two-Armed Bandit. Mathematics. 2023; 11(3):733. https://doi.org/10.3390/math11030733

Chicago/Turabian Style

Zhang, Jichen, and Panyu Wu. 2023. "On the Conjecture of Berry Regarding a Bernoulli Two-Armed Bandit" Mathematics 11, no. 3: 733. https://doi.org/10.3390/math11030733

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop