Abstract
Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationary settings, there is currently no regret bound analysis for TS with uninformative priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret upper bounds of both algorithms are , where T is the time horizon and is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods.
1. Introduction
MAB is a classic sequential decision problem. At each time step, the learner selects an arm from a finite set of arms (also known as actions) based on its past observations, and it only observes the reward of the chosen action. The learner’s goal is to maximize its expected cumulative reward or minimize the regret incurred during the learning process. The regret is defined as the difference between the expected reward of the optimal arm and the expected reward achieved by the MAB algorithm.
MAB has found practical use in various scenarios, with one of the earliest applications being the diagnosis and treatment experiments proposed by Robbins [1]. In this experiment, each patient’s treatment plan corresponds to an arm in the MAB problem, and the goal is to minimize the patient’s health loss by making optimal treatment decisions. Recently, MAB has gained wide-ranging applicability. For example, MAB algorithms have been used in online recommendation systems to improve user experiences and increase engagement [2,3,4]. Similarly, MAB has been employed in online advertising campaigns to optimize the allocation of resources and maximize the effectiveness of ad placements [5]. While the standard MAB model assumes fixed reward distributions, real-world scenarios often involve changing distributions over time. For instance, in online recommendation systems, the collected data gradually become outdated, and user preferences are likely to evolve [6]. This dynamic nature necessitates the development of algorithms that can adapt to these changes, leading to the exploration of non-stationary MAB problems.
In recent years, there has been much research on non-stationary multi-armed bandit problems. These methods can be roughly divided into two categories: they either actively detect changes in the reward distribution using change-point detection algorithms [7,8,9,10,11], or they passively reduce the effect of past observations [12,13,14,15]. Ghatak [16], Alami and Azizi [17] use the active algorithm for non-stationary settings, which combines change-detection and TS. Viappiani [18], Gupta et al. [19], Cavenaghi et al. [20] also address the non-stationary problem with TS algorithm. However, they are experimental paper without theoretical analysis. Liu et al. [21] propose a novel sampling method-predictive sampling. They use information theory tools to analyze the Bayesian regret of their method.
The active methods need to make some assumptions about the change in arms distribution to ensure the effectiveness of the change-point detection algorithm. For instance, refs. [7,8] require a lower bound on the amplitude of change in each arm’s expected rewards. The passive method requires fewer assumptions about the characteristics of the change. They often use a sliding window or discount factor to forget past information to adapt to the change in arms distribution.
However, TS with a passive method has received little theoretical analysis of regret in non-stationary MAB problems. Raj and Kalyani [13] have studied the discounted Thompson sampling with Beta priors. While they only derive the probability of picking a suboptimal arm for the simple case of a two-armed bandit. To the best of our knowledge, only sliding-window Thompson sampling with Beta priors [14] provides the regret upper bounds. However, their proof is not correct. Recently, Fiandri et al. [22] have corrected their proof errors by using the techniques proposed in [23].
Our contributions are as follows: we propose discounted TS (DS-TS) and sliding-window TS (SW-TS) with uninformative priors for abruptly changing settings. We adopt a unified method to analyze the regret upper bound for both algorithms. The theoretical analysis results show that their regret upper bounds are of order , where T is the number of time steps, is the number of breakpoints. This regret bound matches the lower bound proven by Garivier and Moulines [12] in an order sense. We also verify the algorithms in various environmental settings with Gaussian and Bernoulli rewards, and both DS-TS and SW-TS achieve competitive performance.
2. Related Works
Many works are based on the idea of forgetting past observations. Discounted UCB (DS-UCB) [12,24] uses a discounted factor to average the past rewards. In order to achieve the purpose of forgetting information, the weight of the early reward is smaller. Garivier and Moulines [12] also propose the sliding-window UCB (SW-UCB) by only using a few recent rewards to compute the UCB index. They calculate the regret upper bound for DS-UCB and SW-UCB as . EXP3.S, as proposed in [25], has been shown to achieve the regret upper bound by . Under the assumption that the total variation of the expected rewards over the time horizon is bounded by a budget , Besbes et al. [26] introduce REXP3 with regret . Combes and Proutiere [27] propose the SW-OSUB algorithm, specifically for the case of smoothly changing with an upper bound of , where is the Lipschitz constant of the evolve process. Raj and Kalyani [13] propose the discounted Thompson sampling for Bernoulli priors without providing the regret upper bound. They only calculate the probability of picking a sub-optimal arm for the simple case of a two-armed bandit. Trovo et al. [14] propose the sliding-window Thompson sampling algorithm with regret for abruptly changing settings and for smoothly changing settings. Baudry et al. [15] propose a novel algorithm named Sliding-Window Last Block Subsampling Duelling Algorithm (SW-LB-SDA) with regret . They assume that the reward distributions belong to the same one-parameter exponential family for all arms during each stationary phase. This means that SW-LB-SDA is not applicable to Gaussian reward distributions with unknown variance.
There are also many works that exploit techniques from the field of change detection to deal with reward distributions varying over time. Mellor and Shapiro [28] combine a Bayesian change point mechanism and Thompson sampling strategy to tackle the non-stationary problem. Their algorithm can detect global switching and per-arm switching. Liu et al. [7] propose a change-detection framework that combines UCB and a change-detection algorithm named CUSUM. They obtain an upper bound for the average detection delay and a lower bound for the average time between false alarms. Cao et al. [8] propose M-UCB, which is similar to CUSUM but uses another simpler change-detection algorithm. M-UCB and CUSUM are nearly optimal, their regret bounds are .
The above works assume that the rewards distribution are bounded except for SW-LB-SDA. We assume the rewards distribution is a subGaussian distribution, which is a more general setting that includes both bounded distributions and Gaussian distributions.
Recently, there are some works that derive regret bounds without knowing the number of changes. For example, Auer et al. [9] propose an algorithm called ADSWITCH with optimal regret bound . Suk and Kpotufe [29] improve the work [9] so that the obtained regret bound is smaller than , where S only counts the best arms switches. There are also some studies investigating non-stationary representation learning in bandit problems [30,31]. Their focus is mainly on sequential representation learning and introducing an online algorithm that is able to detect task switches and learn and transfer a non-stationary representation in an adaptive fashion.
3. Problem Formulation
Assume that the non-stationary MAB problem has K arms with finite time horizon T. At each round t, the learner must select an arm and obtain the corresponding reward . The rewards are generated from -subGaussian distributions. The expectation of is denoted as . A policy is a function that selects arm to play at round t. Let denote the expected reward of the optimal arm at round t. Unlike the stationary MAB settings, where an arm is optimal all of the time (i.e., ), while in the non-stationary settings, the optimal arms might change over time. The performance of a policy is measured in terms of cumulative expected regret:
where is the expectation with respect to randomness of . Let and let
denote the number of plays of arm i when it is not the best arm until time T. When we analyze the upper bound of , we can directly analyze to obtain the regret upper bound of each arm.
Abruptly Changing Setting
The abruptly changing setting is introduced by Garivier and Moulines [12] for the first time. The number of breakpoints is denoted as . Suppose the set of breakpoints is (we define ). At each breakpoint, the reward distribution changes for at least one arm. The rounds between two adjacent breakpoints are called stationary phase. Abruptly changing bandits pose a more challenging problem, as the learner needs to balance exploration and exploitation within each stationary phase and during the changes between different phases. Trovo et al. [14] makes an assumption about the number of breakpoints to facilitate more generalized analysis, while we explicitly use to represent the number of breakpoints for analysis. An implicit assumption we use is that the number of breakpoints is much smaller than T, i.e., . In the community of piecewise stationary bandit problems, it is commonly assumed that is much smaller than T. When and T are comparable, researchers typically consider scenarios with smooth changes [27].
4. Algorithms
In this section, we propose the DS-TS and SW-TS with uninformative priors for the non-stationary stochastic MAB problems. Different from [32], we assume that the reward distribution follows a -subGaussian distribution rather than a bounded distribution. An uninformative prior can be obtained by letting the variance of a Gaussian prior approach infinity. First, assume that are independently and identically distributed, following a -subGaussian distribution with mean and the prior distribution is a Gaussian distribution . The posterior distribution is also a Gaussian distribution where
Let , we obtain the posterior distribution as . In fact, when is infinite, the prior distribution turns to be an uninformative prior.
4.1. DS-TS
DS-TS uses a discount factor () to dynamically adjust the estimate of each arm’s distribution. The key to our algorithm is to decrease the sampling variance of the selected arm while increasing the sampling variance of the unselected arms.
Specifically, let
denote the discounted number of plays of arm i until time t. We use
called discounted empirical average to estimate the expected rewards of arm i. In non-stationary settings, we use the discounted average and discounted number of plays instead of the true average and number of plays, respectively. Therefore, the posterior distribution is .
Algorithm 1 shows the pseudocode of DS-TS. Step 3 is the Thompson sampling. For each arm, we draw a random sample from . We use as the posterior variance instead of , which helps the subsequent analysis. Then, we select arm with the maximum sample value and obtain the reward (Step 5). To avoid the time complexity going to , we introduce to calculate using an iterative method (Steps 7–9).
If arm i is selected at round t, the posterior distribution is updated as follows:
If arm i is not selected at round t, the posterior distribution is updated as
i.e., the expectation of posterior distribution remains unchanged.
| Algorithm 1 DS-TS |
|
4.2. SW-TS
SW-TS uses a sliding window to adapt to changes in the reward distribution. Let
If , the range of summation is from 1 to t. Similar to DS-TS, the posterior distribution is . Algorithm 2 shows the pseudocode of SW-TS. To avoid the time complexity going to , we introduce to update .
| Algorithm 2 SW-TS |
|
4.3. Results
In this section, we give the regret upper bounds of DS-TS and SW-TS. Then, we discuss how to take the values of the parameters so that these algorithms reach the optimal upper bound.
Recall that . Let , be the minimum difference between the expected reward of the best arm and the expected reward of arm i in all time T when the arm i is not the best arm. Let denote the maximum expected variation of arms.
Theorem 1
(DS-TS). Let satisfying . For any suboptimal arm i,
where
Remark 1.
The condition can ensure that is well defined. In general, we do not need to know in advance when setting the value of γ. If we choose a γ close to 1, then the condition in Theorem 1 is easily satisfied, as shown in the corollary below.
Corollary 1.
If the time horizon T and number of breakpoints are known in advance, the discounted factor can be chosen as . If ,
we have
Theorem 2
(SW-TS). Let , for any suboptimal arm i,
where
Corollary 2.
If the time horizon T and number of breakpoints are known in advance, the sliding window can be chosen as , then
5. Proofs of Upper Bounds
Before giving the detailed proof, we discuss the main challenges in regret analysis of Thompson sampling in a non-stationary setting. These challenges are addressed by Lemmas 1–3.
5.1. Challenges in Regret Analysis
Existing analyses of regret bounds for Thompson sampling [32,33,34] decompose the regret into two parts. The first part of regret comes from the over-estimation of the suboptimal arm, which can be dealt with by the concentration properties of the sampling distribution and rewards distribution. The second part is the under-estimation of the optimal arm, which mainly relies on bounding the following equation.
where is the probability that the best arm will not be under-estimated from the mean reward by a margin .
The first challenge is specific to the DS-TS algorithm. Unlike SW-TS, which completely forgets previous information after rounds following a breakpoint, DS-TS cannot fully forget past information.
This makes it challenging to utilize the concentration properties of the reward distribution to bound regret comes from the over-estimate of the suboptimal arm. And this will further affect the analysis of Equation (2).
The second challenge is the under-estimation of the optimal arm. In stationary settings, changes only when the optimal arm is selected, Equation (2) can be bounded by the method proposed by Agrawal and Goyal [32]. However, the distribution of may vary over time in non-stationary settings. It is challenging and nontrivial to obtain a tight bound of Equation (2).
To overcome the first challenges, we adjust the posterior variance to be . This slightly larger variance is specifically designed for the -subGaussian distribution, which helps to bound (In Appendix B.4, we have shown that our analysis method requires the variance to be greater than . And we set the variance to for a more convenient presentation of the paper’s results.). Then, we define , which serves a role similar to the upper confidence bound in the UCB algorithm. We solve this problem through Lemmas 1 and 2.
For the second challenge, we use the new defined and employ a new regret decomposition for Equation (2) based on whether the event occurs. Intuitively, if , is close to 1, which will lead to a sharp bound. If , using Lemma A3 we can also obtain the upper bound of Equation (2). We derive the upper bound of for non-stationary settings, with an extra logarithmic term compared with the stationary settings. The proof of Lemma 3 in Appendix B.3 demonstrates these details.
5.2. Proofs of Theorem 1
For arm , we choose two threshold such that . Then and . The history is defined as the plays and rewards of the previous t plays. and the distribution of are determined by the history .
The abruptly changing setting is in fact piecewise-stationary. The rounds between two adjacent breakpoints are stationary. Based on this observation, we define the pseudo-stationary phase as
The rounds in can to some extent ensure that the rewards are “stationary”. For any , the rewards distribution remain unchanged between . Therefore, we can obtain a good estimate of the rewards distribution in (Lemma 1). Let denote the complement of , i.e., . Note that, there is at most rounds belonging to after each breakpoint. This is because the rounds between two adjacent breakpoints are stationary. The time steps after the breakpoint rounds, the rewards distribution do not change and therefore belongs to . Therefore, the number of elements in the set has an upper bound , i.e.,
Figure 1 shows and in two different situations. Since during the rounds in , i.e., the rounds following a breakpoint, the estimate of the expected rewards may be poor, we directly bound the regret during by and only focus on the regret in .
Figure 1.
Illustration of and in two different situations. are the breakpoints. The situation that is shown in the top figure, and is in the bottom.
To facilitate the analysis, we define the following quantities and events.
Definition 1.
Define as the event . Define as the event .
Intuitively, event represents selecting a sufficiently explored suboptimal arm. Event denotes is not too far from the mean .
Now we list some useful lemmas. The detailed proofs are provided in the appendix. The following lemma depicts that after finite rounds at the breakpoint, i.e., in the pseudo-stationary phase, the distance between and discounted average of expectation for arm i can be bounded by . is analogous to the upper confidence bound in the UCB algorithm.
Lemma 1.
Let denote the discounted average of expectation for arm i at time step t. , the distance between and is less than .
Using Lemma 1 and the self-normalized Hoeffding-type inequality for subGaussian distributions (Lemma A1), we have the following lemma. This lemma helps to bound regret comes from the over-estimation of suboptimal arm.
Lemma 2.
,
The following key lemma helps bound the regret comes from the under-estimation of the optimal arm. This is the most tricky part of analyzing TS. Note that, the proof in [14] does not prove the result of the following lemma.
Lemma 3.
Let . For any and ,
Before we give the detailed proof, we give a outline of our proof.
Proof Outlines. First, since the regret incurred in can be bounded by , we only consider the regret in rounds . Then, we consider the event . If this event is not true, we can use Lemma A3 to bound the regret by . If this event holds true, we additionally consider whether the suboptimal arm is over-estimation () and whether event is true to decompose the regret into three parts as Equation (10). The first part comes from the over-estimation of the suboptimal arm, which can be bounded by Lemma 2. The second part comes from the bias in sampling the suboptimal arm, which can be bounded by the properties of Gaussian distribution Equation (A1). The third part denotes that the regret comes from the under-estimation of the optimal arm and can be bounded by Lemma 3.
The proof is in 5 steps:
Step 1: We can divide the rounds into two parts: and . Equation (3) shows that the number of elements in the second part is smaller than , we have
Step 2: Then, we consider the event .
We first bound .
where the last equation uses the tower rule of expectation.
Using Lemma A3, we have
Therefore,
Step 3: Recall that we use to denote the event and denote the event . Equation (9) may be decomposed as follows:
Using Lemma 2, the first part in Equation (10) can be bounded by .
Step 4: Then, we bound the second part in Equation (10). Use the fact that and are determined by the history , we have
Given the history such that and , we have
Therefore,
where the second inequality follows and Equation (A1).
For other , the indicator term will be 0. Hence, we can bound the second part by
Step 5: Finally, we focus the third term in Equation (10). Using Lemma A2 and the fact that is fixed given ,
Then, by Lemma 3, we have
Substituting the results in Step 3–5 to Equation (10) and Equation (9),
5.3. Proofs of Theorem 2
The proof of Theorem 2 is similar to Theorem 1. The main difference is that the pseudo-stationary phase is now defined as . Let
If ,
This means the bias () vanishes. We no longer need an n related to to deal with the bias issue. We only need to define as
We directly list the following two lemmas, corresponding to Lemma 2 and Lemma 3, respectively.
Lemma 4.
,
This lemma is similar to Lemma 2. It can be used to bound the regret that comes from over-estimation of the suboptimal arms. This lemma can be proved by Hoeffding-type inequality for subGaussian distributions (Lemma A1). The detailed proofs can be found in Appendix B.5.
Lemma 5.
Let . For any and ,
This key lemma helps bound the regret that comes from the under-estimation of the optimal arm (Step 5 in the proof of DS-TS) which is similar to Lemma 3. It can be proved by Lemma A2, which transforms the probability of selecting the ith arm into the probability of selecting the optimal arm . The detailed proofs can be found in Appendix B.6.
The rest of the proof is nearly identical to the proof of Theorem 1.
6. Experiments
In this section, we empirically compare the performance of our method with state-of-the-art algorithms on Bernoulli and Gaussian reward distributions (our code is available at https://github.com/qh1874/TS_NonStationary (accessed on 14 August 2024)). Specifically, we compare DS-TS and SW-TS with Thompson sampling to evaluate the improvement obtained thanks to the employment of the discounted factor and sliding window . We also compare our method with the UCB method, DS-UCB and SW-UCB [12] to evaluate the effect of Thompson sampling and UCB. Furthermore, we compare our method with some novel and efficient algorithms such as CUSUM [7], M-UCB [8] and SW-LB-SDA [15]. Note that SW-LB-SDA is not applicable to Gaussian reward distributions with unknown variance. We measure the performance of each algorithm with the cumulative expected regret defined in Equation (1). The expected regret is averaged over 100 independently runs. The 95% confidence interval is obtained by performing 100 independent runs and is depicted as a semi-transparent region in the figure.
6.1. Gaussian Arms
6.1.1. Experimental Setting for Gaussian Arms
We fix the time horizon as T = 100,000. The mean and standard deviation are drawn from distributions and . For Gaussian rewards, we conduct two experiments. In the first experiment, we split the time horizon into five phases and use a number of arms . While in the second experiment, we split the time horizon into 10 phases and use a number of arms . Figure 2 depicts the expected rewards for Gaussian arms and Bernoulli arms with and .
Figure 2.
. Gaussian arms (a), Bernoulli arms (b).
The analysis of SW-UCB and DS-UCB is conducted under the bounded reward assumption, but the algorithms can adapt to Gaussian scenarios. To achieve reasonable performance, it is necessary to adjust the discounted factor and the sliding-window appropriately. We use the settings recommended in [15], where for SW-UCB and for DS-UCB.
6.1.2. Results
Figure 3 illustrates the performance of these algorithms for Gaussian rewards under two different settings. Notably, CUSUM and M-UCB are not applicable to Gaussian rewards: CUSUM is designed for Bernoulli distributions, while M-UCB assumes bounded distributions. The discounted methods tend to perform better than sliding-window methods in Gaussian rewards.
Figure 3.
Gaussian arms. (a) . (b) .
Among these algorithms, only our algorithms and SW-LB-SDA provide regret analysis for unbounded rewards. Our algorithm (DS-TS) and SW-LB-SDA have demonstrated highly competitive experimental performance.
6.2. Bernoulli Arms
6.2.1. Experimental Setting for Bernoulli Arms
The time horizon is set as T = 100,000. We split the time horizon into phases of equal length and use a number of arms , respectively.
For Bernoulli rewards, the expected value of each arm i is drawn from a uniform distribution over . In the stationary phase, the rewards distributions remain unchanged. The Bernoulli arms for each phase are generated as
For a Bernoulli distribution, we modify the Thompson sampling (step 3) in our algorithm as and . Based on Corollaries 1 and 2, we set and . To allow for fair comparison, DS-UCB uses the discount factor , SW-UCB uses the sliding window suggested by [12]. Based on [15], we set for LB-SDA. For changepoint detection algorithm M-UCB, we set as suggested by [8]. But we set the amount of exploration as . In practice, it has been found that using this value instead of the one guaranteed in [8] will improve empirical performance [15]. For CUSUM, following from [7], we set and . For our experiment settings, we choose .
6.2.2. Results
Figure 4 presents the results for Bernoulli arms in abruptly changing settings. It can be observed that our method (SW-TS) and SW-LB-SDA exhibit almost identical performance. Thompson sampling, designed for stationary MAB problems, shows significant oscillations at the breakpoints. The changepoint detection algorithm CUSUM [7] also shows competitive performance. Note that our experiment does not satisfy the detectability assumption of CUSUM. As the number of arms and breakpoints increase, the performance of UCB-class algorithms (DS-UCB, SW-UCB) declines, while two TS-based algorithms (DS-TS, SW-TS) still work well.
Figure 4.
Bernoulli arms. Settings with (a), (b).
6.2.3. Storage and Compute Cost
These algorithms can be divided into three class: UCB, TS and SW-LB-SDA. At each round, UCB-class and TS-class algorithms require storage and spend time complexity for computational cost. However, for round T, SW-LB-SDA require storage and spend time cost. Although the experimental performance of SW-LB-SDA is similar to our algorithms, our algorithm has less storage space and lower computational complexity.
6.3. Different Variance
The non-stationary setting has greater noise for estimation as compared to the stationary setting. Intuitively, TS with standard variance for the non-stationary setting should have worse regret as compared to the one with a larger variance. In this subsection, we conduct some experiments to verify this point. Table 1 shows the experimental results. For TS and the SW-TS algorithm, larger variance does indeed lead to smaller regret. This conclusion does not hold for DS-TS. We believe this does not contradict the above conclusion, because for DS-TS, the discount factor plays a more important role. If an arm has not been selected for some rounds, then will be small ( can be close to 0, while if for SW-TS is greater than 1 or equal to 0), then has already become large, ensuring exploration performance. Therefore, DS-TS achieves the minimum regret. However, may be too large, leading to excessive exploration and thus reducing regret compared to .
Table 1.
Settings with T = 100,000, = 5, K = 5 for Gaussian arms. The mean and standard deviation are drawn from distributions and . We set .
7. Conclusions
In this paper, we analyze the regret upper bound of the TS algorithm with an uninformative prior in non-stationary settings, filling a research gap in this field. Our approach builds upon previous works while tackling two key challenges specific to non-stationary environments: under-estimation of the optimal arm and the inability of DS-TS algorithm to fully forget previous information. Finally, we conduct some experiments to verify the theory results. Below we discuss the results and propose directions for future research.
(1) The standard posterior update rule for Thompson sampling has a sampling variance as . We use only for ease of analysis. While this discrepancy is significant only for relatively small values of N, it would be valuable to develop proof techniques that leverage the variance of standard Bayesian updates.
(2) Our regret upper bound includes an additional logarithmic term compared to DS-UCB and SW-UCB, along with coefficients of and . It would be interesting to explore whether the additional logarithm and large coefficients are intrinsic to DS-TS and SW-TS algorithms or are a limitation of our analysis.
Author Contributions
Conceptualization, H.Q.; Formal analysis, H.Q.; Investigation, F.G.; Methodology, H.Q. and F.G.; Software, H.Q. and F.G.; Supervision, L.Z.; Validation, L.Z.; Writing—original draft, H.Q.; Writing—review and editing, F.G. and L.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Our code is available at https://github.com/qh1874/TS_NonStationary (accessed on 14 August 2024).
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A. Facts and Lemmas
Garivier and Moulines [12] has derived a Hoeffding-type inequality for self-normalized means with a random number of summands. Their bound is for bounded distribution. Leveraging the properties of -subGaussian distributions, we have the following bound for -subGaussian. Recall that
Lemma A1.
Let ,
Let ,
The following inequality is the anti-concentration and concentration bound for Gaussian distributed random variables.
Fact 1
([35]). For a Gaussian distributed random variable X with mean μ and variance , for any
Since , we also have the following well-known result:
The following lemma is adapted from [32] and is often used in the analysis of Thompson sampling, which can transform the probability of selecting the ith arm into the probability of selecting the optimal arm .
Lemma A2.
Let . For any , ,
Lemma A3
([12]). For any , and ,
Appendix B. Detailed Proofs of Lemmas and Theorems
Appendix B.1. Proof of Lemma 1
Recall that . Since is a convex combination of elements , we have
We can write as . Thus, we have
Recall that . If , we have
Therefore, , we have
where the last inequality follows from .
If , , we have
If , from Equation (A2), we also have
By the definition of ,
Appendix B.2. Proof of Lemma 2
From the definition of in Equation (4), we can obtain
If , . Thus, we have
Therefore,
where (a) uses Lemma 1, (b) uses Equation (A4), (c) follows from , (d) uses Lemma A1.
Since , this ends the proof.
Appendix B.3. Proof of Lemma 3
This proof is adapted from [32] for the stationary settings. However, there are some technical problems that are difficult to overcome in non-stationary settings. The tricky problem is to lower bound the probability of the mean’s estimation of optimal arm Equation (A9). By designing the function and decomposing the regret to use Lemma A3 again, we solve this challenge. We use blue font to emphasize the techniques used in the proof.
The proof is in three steps.
Step 1: We first prove that has an upper bound independent of t.
Define a Bernoulli experiment as sampling from , where success implies that . Let denote the number of experiments performed when the event first occurs. Then,
Let ( is an integer ) and let denote the maximum of r independent Bernoulli experiments. Then,
Using Fact A1,
For any , . Hence, for any ,
Therefore, for any ,
Next, we apply Lemma A1 to lower bound .
Since ,
Then, we have
Substituting, for any ,
Therefore,
This proves a bound of independent of t.
Step 2: Define . We consider the upper bound of when .
Now, since ,. Therefore, for any ,
Using Fact A1,
This implies
Also, apply the self-normalized Hoeffding-type inequality,
Let . Therefore, for any ,
When , we can use Equation (A10) to obtain
Combining these results
Therefore, when , it holds that
Step 3: Let .
Appendix B.4. Larger Variance
In fact, Lemma A1 has a stricter upper bound as , where . Let , then we obtain Lemma A1. Suppose the variance of Thompson sampling is . The lower bound of Equation (A9) becomes
To ensure that in Equation (A11) has a finite upper bound, our analysis method requires the infinite series
is convergent. Thus, we have
i.e., the sampling variance needs to be strictly greater than .
Appendix B.5. Proof of Lemma 4
Recall that . Using Lemma A1, we have
Appendix B.6. Proof of Lemma 5
The proof is similar to the proof of Lemma 3.
Step 1: We first prove that has an upper bound independent of t.
Let ( is an integer). Then,
Using Fact A1,
For any , . Hence, for any ,
Therefore, for any ,
Next, we apply Lemma A1 to lower bound .
Substituting, for any ,
Therefore,
This proves a bound of independent of t.
Step 2: Define . We consider the upper bound of when .
Now, since ,. Therefore, for any ,
Using Fact A1,
This implies
Also, apply Lemma A1,
Let . Therefore, for any ,
When , we can use Equation (A10) to obtain
Combining these results,
Therefore, when , it holds that
Step 3: Let and .
References
- Robbins, H. Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 1952, 58, 527–535. [Google Scholar] [CrossRef]
- Li, L.; Chu, W.; Langford, J.; Wang, X. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM International Conference on Web Search and Data Mining, Hong Kong, China, 9–12 February 2011; pp. 297–306. [Google Scholar]
- Bouneffouf, D.; Bouzeghoub, A.; Ganarski, A.L. A contextual-bandit algorithm for mobile context-aware recommender system. In Neural Information Processing, Proceedings of the International Conference, ICONIP 2012, Doha, Qatar, 12–15 November 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 324–331. [Google Scholar]
- Li, S.; Karatzoglou, A.; Gentile, C. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 539–548. [Google Scholar]
- Schwartz, E.M.; Bradlow, E.T.; Fader, P.S. Customer acquisition via display advertising using multi-armed bandit experiments. Mark. Sci. 2017, 36, 500–522. [Google Scholar] [CrossRef]
- Wu, Q.; Iyer, N.; Wang, H. Learning contextual bandits in a non-stationary environment. In Proceedings of the The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 495–504. [Google Scholar]
- Liu, F.; Lee, J.; Shroff, N. A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Cao, Y.; Wen, Z.; Kveton, B.; Xie, Y. Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Okinawa, Japan, 16–18 April 2019; pp. 418–427. [Google Scholar]
- Auer, P.; Gajane, P.; Ortner, R. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Proceedings of the Conference on Learning Theory, Phoenix, AZ, USA, 25–28 June 2019; pp. 138–158. [Google Scholar]
- Chen, Y.; Lee, C.W.; Luo, H.; Wei, C.Y. A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. In Proceedings of the Conference on Learning Theory, Phoenix, AZ, USA, 25–28 June 2019; pp. 696–726. [Google Scholar]
- Besson, L.; Kaufmann, E.; Maillard, O.A.; Seznec, J. Efficient Change-Point Detection for Tackling Piecewise-Stationary Bandits. J. Mach. Learn. Res. 2022, 23, 1–40. [Google Scholar]
- Garivier, A.; Moulines, E. On upper-confidence bound policies for switching bandit problems. In Algorithmic Learning Theory, Proceedings of the 22nd International Conference, ALT 2011, Espoo, Finland, 5–7 October 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 174–188. [Google Scholar]
- Raj, V.; Kalyani, S. Taming non-stationary bandits: A Bayesian approach. arXiv 2017, arXiv:1707.09727. [Google Scholar]
- Trovo, F.; Paladino, S.; Restelli, M.; Gatti, N. Sliding-window thompson sampling for non-stationary settings. J. Artif. Intell. Res. 2020, 68, 311–364. [Google Scholar] [CrossRef]
- Baudry, D.; Russac, Y.; Cappé, O. On Limited-Memory Subsampling Strategies for Bandits. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 727–737. [Google Scholar]
- Ghatak, G. A change-detection-based Thompson sampling framework for non-stationary bandits. IEEE Trans. Comput. 2020, 70, 1670–1676. [Google Scholar] [CrossRef]
- Alami, R.; Azizi, O. Ts-glr: An adaptive thompson sampling for the switching multi-armed bandit problem. In Proceedings of the NeurIPS 2020 Challenges of Real World Reinforcement Learning Workshop, Virtual, 6–12 December 2020. [Google Scholar]
- Viappiani, P. Thompson sampling for bayesian bandits with resets. In Algorithmic Decision Theory, Proceedings of the Third International Conference, ADT 2013, Bruxelles, Belgium, 12–14 November 2013; Proceedings 3; Springer: Berlin/Heidelberg, Germany, 2013; pp. 399–410. [Google Scholar]
- Gupta, N.; Granmo, O.C.; Agrawala, A. Thompson sampling for dynamic multi-armed bandits. In Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops, Honolulu, HI, USA, 18–21 December 2011; Volume 1, pp. 484–489. [Google Scholar]
- Cavenaghi, E.; Sottocornola, G.; Stella, F.; Zanker, M. Non stationary multi-armed bandit: Empirical evaluation of a new concept drift-aware algorithm. Entropy 2021, 23, 380. [Google Scholar] [CrossRef]
- Liu, Y.; Van Roy, B.; Xu, K. Nonstationary bandit learning via predictive sampling. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 25–27 April 2023; pp. 6215–6244. [Google Scholar]
- Fiandri, M.; Metelli, A.M.; Trovò, F. Sliding-Window Thompson Sampling for Non-Stationary Settings. arXiv 2024, arXiv:2409.05181. [Google Scholar]
- Qi, H.; Wang, Y.; Zhu, L. Discounted thompson sampling for non-stationary bandit problems. arXiv 2023, arXiv:2305.10718. [Google Scholar]
- Kocsis, L.; Szepesvári, C. Discounted ucb. In Proceedings of the 2nd PASCAL Challenges Workshop, Venice, Italy, 10–12 April 2006; Volume 2, pp. 51–134. [Google Scholar]
- Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R.E. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 2002, 32, 48–77. [Google Scholar] [CrossRef]
- Besbes, O.; Gur, Y.; Zeevi, A. Stochastic multi-armed-bandit problem with non-stationary rewards. In Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Combes, R.; Proutiere, A. Unimodal bandits: Regret lower bounds and optimal algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 521–529. [Google Scholar]
- Mellor, J.; Shapiro, J. Thompson sampling in switching environments with Bayesian online change detection. In Proceedings of the Artificial Intelligence and Statistics, Scottsdale, AZ, USA, 29 April–1 May 2013; pp. 442–450. [Google Scholar]
- Suk, J.; Kpotufe, S. Tracking Most Significant Arm Switches in Bandits. In Proceedings of the Conference on Learning Theory, London, UK, 2–5 July 2022; pp. 2160–2182. [Google Scholar]
- Qin, Y.; Menara, T.; Oymak, S.; Ching, S.; Pasqualetti, F. Non-stationary representation learning in sequential multi-armed bandits. In Proceedings of the ICML Workshop on Reinforcement Learning Theory, Virtual, 18-24 July 2021. [Google Scholar]
- Qin, Y.; Menara, T.; Oymak, S.; Ching, S.; Pasqualetti, F. Non-stationary representation learning in sequential linear bandits. IEEE Open J. Control Syst. 2022, 1, 41–56. [Google Scholar] [CrossRef]
- Agrawal, S.; Goyal, N. Further optimal regret bounds for thompson sampling. In Proceedings of the Artificial Intelligence and Statistics, Scottsdale, AZ, USA, 29 April–1 May 2013; pp. 99–107. [Google Scholar]
- Jin, T.; Xu, P.; Shi, J.; Xiao, X.; Gu, Q. Mots: Minimax optimal thompson sampling. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 5074–5083. [Google Scholar]
- Jin, T.; Xu, P.; Xiao, X.; Anandkumar, A. Finite-time regret of thompson sampling algorithms for exponential family multi-armed bandits. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; pp. 38475–38487. [Google Scholar]
- Abramowitz, M.; Stegun, I.A. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables; US Government Printing Office: Washington, DC, USA, 1964; Volume 55.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).