Thompson Sampling for Non-Stationary Bandit Problems

Qi, Han; Guo, Fei; Zhu, Li

doi:10.3390/e27010051

Open AccessArticle

Thompson Sampling for Non-Stationary Bandit Problems

by

Han Qi

^*

,

Fei Guo

and

Li Zhu

^*

School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Authors to whom correspondence should be addressed.

Entropy 2025, 27(1), 51; https://doi.org/10.3390/e27010051

Submission received: 13 November 2024 / Revised: 2 January 2025 / Accepted: 7 January 2025 / Published: 9 January 2025

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationary settings, there is currently no regret bound analysis for TS with uninformative priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret upper bounds of both algorithms are

\tilde{O} (\sqrt{T B_{T}})

, where T is the time horizon and

B_{T}

is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods.

Keywords:

multi-armed bandits; Thompson sampling; non-stationary

1. Introduction

MAB is a classic sequential decision problem. At each time step, the learner selects an arm from a finite set of arms (also known as actions) based on its past observations, and it only observes the reward of the chosen action. The learner’s goal is to maximize its expected cumulative reward or minimize the regret incurred during the learning process. The regret is defined as the difference between the expected reward of the optimal arm and the expected reward achieved by the MAB algorithm.

MAB has found practical use in various scenarios, with one of the earliest applications being the diagnosis and treatment experiments proposed by Robbins [1]. In this experiment, each patient’s treatment plan corresponds to an arm in the MAB problem, and the goal is to minimize the patient’s health loss by making optimal treatment decisions. Recently, MAB has gained wide-ranging applicability. For example, MAB algorithms have been used in online recommendation systems to improve user experiences and increase engagement [2,3,4]. Similarly, MAB has been employed in online advertising campaigns to optimize the allocation of resources and maximize the effectiveness of ad placements [5]. While the standard MAB model assumes fixed reward distributions, real-world scenarios often involve changing distributions over time. For instance, in online recommendation systems, the collected data gradually become outdated, and user preferences are likely to evolve [6]. This dynamic nature necessitates the development of algorithms that can adapt to these changes, leading to the exploration of non-stationary MAB problems.

In recent years, there has been much research on non-stationary multi-armed bandit problems. These methods can be roughly divided into two categories: they either actively detect changes in the reward distribution using change-point detection algorithms [7,8,9,10,11], or they passively reduce the effect of past observations [12,13,14,15]. Ghatak [16], Alami and Azizi [17] use the active algorithm for non-stationary settings, which combines change-detection and TS. Viappiani [18], Gupta et al. [19], Cavenaghi et al. [20] also address the non-stationary problem with TS algorithm. However, they are experimental paper without theoretical analysis. Liu et al. [21] propose a novel sampling method-predictive sampling. They use information theory tools to analyze the Bayesian regret of their method.

The active methods need to make some assumptions about the change in arms distribution to ensure the effectiveness of the change-point detection algorithm. For instance, refs. [7,8] require a lower bound on the amplitude of change in each arm’s expected rewards. The passive method requires fewer assumptions about the characteristics of the change. They often use a sliding window or discount factor to forget past information to adapt to the change in arms distribution.

However, TS with a passive method has received little theoretical analysis of regret in non-stationary MAB problems. Raj and Kalyani [13] have studied the discounted Thompson sampling with Beta priors. While they only derive the probability of picking a suboptimal arm for the simple case of a two-armed bandit. To the best of our knowledge, only sliding-window Thompson sampling with Beta priors [14] provides the regret upper bounds. However, their proof is not correct. Recently, Fiandri et al. [22] have corrected their proof errors by using the techniques proposed in [23].

Our contributions are as follows: we propose discounted TS (DS-TS) and sliding-window TS (SW-TS) with uninformative priors for abruptly changing settings. We adopt a unified method to analyze the regret upper bound for both algorithms. The theoretical analysis results show that their regret upper bounds are of order

\tilde{O} (\sqrt{T B_{T}})

, where T is the number of time steps,

B_{T}

is the number of breakpoints. This regret bound matches the

Ω (\sqrt{T})

lower bound proven by Garivier and Moulines [12] in an order sense. We also verify the algorithms in various environmental settings with Gaussian and Bernoulli rewards, and both DS-TS and SW-TS achieve competitive performance.

2. Related Works

Many works are based on the idea of forgetting past observations. Discounted UCB (DS-UCB) [12,24] uses a discounted factor to average the past rewards. In order to achieve the purpose of forgetting information, the weight of the early reward is smaller. Garivier and Moulines [12] also propose the sliding-window UCB (SW-UCB) by only using a few recent rewards to compute the UCB index. They calculate the regret upper bound for DS-UCB and SW-UCB as

\tilde{O} (\sqrt{T B_{T}})

. EXP3.S, as proposed in [25], has been shown to achieve the regret upper bound by

\tilde{O} (\sqrt{T B_{T}})

. Under the assumption that the total variation of the expected rewards over the time horizon is bounded by a budget

V_{T}

, Besbes et al. [26] introduce REXP3 with regret

\tilde{O} (T^{2 / 3})

. Combes and Proutiere [27] propose the SW-OSUB algorithm, specifically for the case of smoothly changing with an upper bound of

\tilde{O} (σ^{1 / 4} T)

, where

σ

is the Lipschitz constant of the evolve process. Raj and Kalyani [13] propose the discounted Thompson sampling for Bernoulli priors without providing the regret upper bound. They only calculate the probability of picking a sub-optimal arm for the simple case of a two-armed bandit. Trovo et al. [14] propose the sliding-window Thompson sampling algorithm with regret

\tilde{O} (T^{\frac{1 + α}{2}})

for abruptly changing settings and

\tilde{O} (T^{β})

for smoothly changing settings. Baudry et al. [15] propose a novel algorithm named Sliding-Window Last Block Subsampling Duelling Algorithm (SW-LB-SDA) with regret

\tilde{O} (\sqrt{T B_{T}})

. They assume that the reward distributions belong to the same one-parameter exponential family for all arms during each stationary phase. This means that SW-LB-SDA is not applicable to Gaussian reward distributions with unknown variance.

There are also many works that exploit techniques from the field of change detection to deal with reward distributions varying over time. Mellor and Shapiro [28] combine a Bayesian change point mechanism and Thompson sampling strategy to tackle the non-stationary problem. Their algorithm can detect global switching and per-arm switching. Liu et al. [7] propose a change-detection framework that combines UCB and a change-detection algorithm named CUSUM. They obtain an upper bound for the average detection delay and a lower bound for the average time between false alarms. Cao et al. [8] propose M-UCB, which is similar to CUSUM but uses another simpler change-detection algorithm. M-UCB and CUSUM are nearly optimal, their regret bounds are

\tilde{O} (\sqrt{T B_{T}})

.

The above works assume that the rewards distribution are bounded except for SW-LB-SDA. We assume the rewards distribution is a subGaussian distribution, which is a more general setting that includes both bounded distributions and Gaussian distributions.

Recently, there are some works that derive regret bounds without knowing the number of changes. For example, Auer et al. [9] propose an algorithm called ADSWITCH with optimal regret bound

\tilde{O} (\sqrt{B_{T} T})

. Suk and Kpotufe [29] improve the work [9] so that the obtained regret bound is smaller than

\tilde{O} (\sqrt{S T})

, where S only counts the best arms switches. There are also some studies investigating non-stationary representation learning in bandit problems [30,31]. Their focus is mainly on sequential representation learning and introducing an online algorithm that is able to detect task switches and learn and transfer a non-stationary representation in an adaptive fashion.

3. Problem Formulation

Assume that the non-stationary MAB problem has K arms

A : = {1, 2, \dots, K}

with finite time horizon T. At each round t, the learner must select an arm

i_{t} \in A

and obtain the corresponding reward

X_{t} (i_{t})

. The rewards are generated from

σ

-subGaussian distributions. The expectation of

X_{t} (i)

is denoted as

μ_{t} (i) = E [X_{t} (i)]

. A policy

π

is a function that selects arm

i_{t}

to play at round t. Let

μ_{t} (*) : = {max}_{i \in {1, \dots, K}} μ_{t} (i)

denote the expected reward of the optimal arm

i_{t}^{*}

at round t. Unlike the stationary MAB settings, where an arm is optimal all of the time (i.e.,

\forall t \in {1, \dots, T}, i_{t}^{*} = i^{*}

), while in the non-stationary settings, the optimal arms might change over time. The performance of a policy

π

is measured in terms of cumulative expected regret:

R_{T}^{π} = E [\sum_{t = 1}^{T} (μ_{t} (*) - μ_{t} (i_{t}))],

(1)

where

E [\cdot]

is the expectation with respect to randomness of

π

. Let

Δ_{t} (i) = μ_{t} (*) - μ_{t} (i)

and let

k_{T} (i) = \sum_{t = 1}^{T} 1 {i_{t} = i, i \neq i_{t}^{*}}

denote the number of plays of arm i when it is not the best arm until time T. When we analyze the upper bound of

R_{T}^{π}

, we can directly analyze

E [k_{T} (i)]

to obtain the regret upper bound of each arm.

Abruptly Changing Setting

The abruptly changing setting is introduced by Garivier and Moulines [12] for the first time. The number of breakpoints is denoted as

B_{T} = \sum_{t = 1}^{T - 1} 1 {\exists i \in A : μ_{t} (i) \neq μ_{t + 1} (i)}

. Suppose the set of breakpoints is

B = {b_{1}, \dots, b_{B_{T}}}

(we define

b_{1} = 1

). At each breakpoint, the reward distribution changes for at least one arm. The rounds between two adjacent breakpoints are called stationary phase. Abruptly changing bandits pose a more challenging problem, as the learner needs to balance exploration and exploitation within each stationary phase and during the changes between different phases. Trovo et al. [14] makes an assumption about the number of breakpoints to facilitate more generalized analysis, while we explicitly use

B_{T}

to represent the number of breakpoints for analysis. An implicit assumption we use is that the number of breakpoints

B_{T}

is much smaller than T, i.e.,

B_{T} ≪ T

. In the community of piecewise stationary bandit problems, it is commonly assumed that

B_{T}

is much smaller than T. When

B_{T}

and T are comparable, researchers typically consider scenarios with smooth changes [27].

4. Algorithms

In this section, we propose the DS-TS and SW-TS with uninformative priors for the non-stationary stochastic MAB problems. Different from [32], we assume that the reward distribution follows a

σ

-subGaussian distribution rather than a bounded distribution. An uninformative prior can be obtained by letting the variance of a Gaussian prior approach infinity. First, assume that

X_{1}, \dots, X_{n}

are independently and identically distributed, following a

σ

-subGaussian distribution with mean

μ

and the prior distribution is a Gaussian distribution

N (0, σ_{0}^{2})

. The posterior distribution is also a Gaussian distribution

N (μ_{1}, σ_{1}^{2})

where

μ_{1} = σ_{1}^{2} (\frac{0}{σ_{0}^{2}} + \frac{\sum_{i = 1}^{n} X_{i}}{σ^{2}}), σ_{1}^{2} = \frac{1}{\frac{1}{σ_{0}^{2}} + \frac{n}{σ^{2}}} .

Let

σ_{0} = + \infty

, we obtain the posterior distribution as

N (\frac{1}{n} \sum_{i = 1}^{n} X_{i}, \frac{σ^{2}}{n})

. In fact, when

σ_{0}

is infinite, the prior distribution turns to be an uninformative prior.

4.1. DS-TS

DS-TS uses a discount factor

γ

(

0 < γ < 1

) to dynamically adjust the estimate of each arm’s distribution. The key to our algorithm is to decrease the sampling variance of the selected arm while increasing the sampling variance of the unselected arms.

Specifically, let

N_{t} (γ, i) = \sum_{j = 1}^{t} γ^{t - j} 1 {i_{j} = i}

denote the discounted number of plays of arm i until time t. We use

{\hat{μ}}_{t} (γ, i) = \frac{1}{N_{t} (γ, i)} \sum_{j = 1}^{t} γ^{t - j} X_{j} (i) 1 {i_{j} = i}

called discounted empirical average to estimate the expected rewards of arm i. In non-stationary settings, we use the discounted average and discounted number of plays instead of the true average and number of plays, respectively. Therefore, the posterior distribution is

N ({\hat{μ}}_{t} (γ, i), \frac{σ^{2}}{N_{t} (γ, i)})

.

Algorithm 1 shows the pseudocode of DS-TS. Step 3 is the Thompson sampling. For each arm, we draw a random sample

θ_{t} (i)

from

N ({\hat{μ}}_{t} (γ, i), \frac{4 σ^{2}}{N_{t} (γ, i)})

. We use

\frac{4 σ^{2}}{N_{t} (γ, i)}

as the posterior variance instead of

\frac{σ^{2}}{N_{t} (γ, i)}

, which helps the subsequent analysis. Then, we select arm

i_{t}

with the maximum sample value and obtain the reward

X_{t} (i_{t})

(Step 5). To avoid the time complexity going to

O (T^{2})

, we introduce

{\tilde{μ}}_{t} (γ, i) = \sum_{j = 1}^{t} γ^{t - j} X_{j} (i) 1 {i_{j} = i}

to calculate

{\hat{μ}}_{t} (γ, i)

using an iterative method (Steps 7–9).

If arm i is selected at round t, the posterior distribution is updated as follows:

{\hat{μ}}_{t + 1} (γ, i) = \frac{γ {\hat{μ}}_{t} (γ, i) N_{t} (γ, i) + X_{t} (i)}{γ N_{t} (γ, i) + 1} = \frac{{\tilde{μ}}_{t + 1} (γ, i)}{N_{t + 1} (γ, i)}

If arm i is not selected at round t, the posterior distribution is updated as

{\hat{μ}}_{t + 1} (γ, i) = \frac{{\tilde{μ}}_{t + 1} (γ, i)}{N_{t + 1} (γ, i)} = \frac{γ {\tilde{μ}}_{t} (γ, i)}{γ N_{t} (γ, i)} = {\hat{μ}}_{t} (γ, i)

i.e., the expectation of posterior distribution remains unchanged.

Algorithm 1 DS-TS

1:: discounted factor $γ \in (0, 1)$ , ${\hat{μ}}_{1} (i) = 0$ , ${\tilde{μ}}_{1} (i) = 0$ , $N_{t} (γ, i) = 0$
2:: for $t = 1, \dots, T$ do
3:: for $i = 1, \dots, K$ do
4:: sample $θ_{t} (i) \sim N ({\hat{μ}}_{t} (γ, i), \frac{4 σ^{2}}{N_{t} (γ, i)})$
5:: end for
6:: Pull arm $i_{t} = arg {max}_{i} θ_{t} (i)$ , observe reward $X_{t} (i_{t})$
7:: for $i = 1, \dots, K$ do
8:: ${\tilde{μ}}_{t + 1} (γ, i) = γ {\tilde{μ}}_{t} (γ, i) + 1 {i_{t} = i} X_{t} (i)$
9:: $N_{t + 1} (γ, i) = γ N_{t} (γ, i) + 1 {i_{t} = i}$
10:: ${\hat{μ}}_{t + 1} (γ, i) = \frac{{\tilde{μ}}_{t + 1} (γ, i)}{N_{t + 1} (γ, i)}$
11:: end for
12:: end for

4.2. SW-TS

SW-TS uses a sliding window

τ

to adapt to changes in the reward distribution. Let

N_{t} (τ, i) = \sum_{j = t - τ + 1}^{t} 1 {i_{j} = i}, {\hat{μ}}_{t} (τ, i) = \frac{1}{N_{t} (τ, i)} \sum_{j = t - τ + 1}^{t} X_{j} (i) 1 {i_{j} = i} .

If

t < τ

, the range of summation is from 1 to t. Similar to DS-TS, the posterior distribution is

N ({\hat{μ}}_{t} (τ, i), \frac{4 σ^{2}}{N_{t} (τ, i)})

. Algorithm 2 shows the pseudocode of SW-TS. To avoid the time complexity going to

O (T^{2})

, we introduce

{\tilde{μ}}_{t} (τ, i) = \sum_{j = t - τ + 1}^{t} X_{j} (i) 1 {i_{j} = i}

to update

{\hat{μ}}_{t} (τ, i)

.

Algorithm 2 SW-TS

1:: sliding window $τ$ , ${\hat{μ}}_{1} (i) = 0$ , ${\tilde{μ}}_{1} (i) = 0$ , $N_{t} (τ, i) = 0$
2:: for $t = 1, \dots, T$ do
3:: for $i = 1, \dots, K$ do
4:: sample $θ_{t} (i) \sim N ({\hat{μ}}_{t} (τ, i), \frac{4 σ^{2}}{N_{t} (τ, i)})$
5:: end for
6:: Pull arm $i_{t} = arg {max}_{i} θ_{t} (i)$ , observe reward $X_{t} (i_{t})$
7:: for $i = 1, \dots, K$ do
8:: $N_{t + 1} (τ, i) = N_{t} (τ, i) + 1 {i_{t} = i} - 1 {i_{t - τ} = i}$
9:: ${\tilde{μ}}_{t + 1} (τ, i) = {\tilde{μ}}_{t} (τ, i) + 1 {i_{t} = i} X_{t} (i) - 1 {i_{t - τ} = i} X_{t - τ} (i)$
10:: ${\hat{μ}}_{t + 1} (τ, i) = \frac{{\tilde{μ}}_{t + 1} (τ, i)}{N_{t + 1} (τ, i)}$
11:: end for
12:: end for

4.3. Results

In this section, we give the regret upper bounds of DS-TS and SW-TS. Then, we discuss how to take the values of the parameters so that these algorithms reach the optimal upper bound.

Recall that

Δ_{t} (i) = μ_{t} (*) - μ_{t} (i)

. Let

Δ_{T} (i) = min {Δ_{t} (i) : t \leq T, i \neq i_{t}^{*}}

, be the minimum difference between the expected reward of the best arm

i_{t}^{*}

and the expected reward of arm i in all time T when the arm i is not the best arm. Let

Δ_{m a x}^{T} = max {μ_{t_{1}} (i) - μ_{t_{2}} (i) : t_{1} \neq t_{2}, i \in [K]}

denote the maximum expected variation of arms.

Theorem 1

(DS-TS). Let

γ \in (0, 1)

satisfying

{(\frac{σ}{Δ_{m a x}^{T}})}^{2} < \frac{e}{1 - γ}

. For any suboptimal arm i,

E [k_{T} (i)] \leq B_{T} D (γ) + C_{1} (γ) L_{1} (γ) γ^{- \frac{1}{1 - γ}} T (1 - γ),

where

D (γ) = \frac{log ({(\frac{σ}{Δ_{m a x}^{T}})}^{2} {(1 - γ)}^{2} log \frac{1}{1 - γ})}{log γ}, C_{1} (γ) = e^{17} + 15 log \frac{1}{1 - γ},

L_{1} (γ) = \frac{1152 log (\frac{1}{1 - γ} + e^{17}) σ^{2}}{γ^{1 / (1 - γ)} {(Δ_{T} (i))}^{2}} .

Remark 1.

The condition

{(\frac{σ}{Δ_{m a x}^{T}})}^{2} < \frac{e}{1 - γ}

can ensure that

D (γ)

is well defined. In general, we do not need to know

Δ_{m a x}^{T}

in advance when setting the value of γ. If we choose a γ close to 1, then the condition

{(\frac{σ}{Δ_{m a x}^{T}})}^{2} < \frac{e}{1 - γ}

in Theorem 1 is easily satisfied, as shown in the corollary below.

Corollary 1.

If the time horizon T and number of breakpoints

B_{T}

are known in advance, the discounted factor can be chosen as

γ = 1 - \frac{1}{σ} \sqrt{\frac{B_{T}}{T log T}}

. If

B_{T} ≪ T

,

{(\frac{σ}{Δ_{m a x}^{T}})}^{2} (1 - γ) = \frac{σ}{{(Δ_{m a x}^{T})}^{2}} \sqrt{\frac{B_{T}}{T log T}} < e .

we have

E [k_{T} (i)] = O (\sqrt{T B_{T}} {(log T)}^{\frac{3}{2}}) .

Theorem 2

(SW-TS). Let

τ > 0

, for any suboptimal arm i,

E [k_{T} (i)] \leq B_{T} τ + C_{2} (τ) L_{2} (τ) \frac{T}{τ},

where

C_{2} (τ) = e^{11} + 15 log τ, L_{2} (τ) = \frac{1152 log (τ + e^{11}) σ^{2}}{{(Δ_{T} (i))}^{2}} .

Corollary 2.

If the time horizon T and number of breakpoints

B_{T}

are known in advance, the sliding window can be chosen as

τ = σ \sqrt{T / B_{T}} log T

, then

E [k_{T} (i)] = O (\sqrt{T B_{T}} log T) .

5. Proofs of Upper Bounds

Before giving the detailed proof, we discuss the main challenges in regret analysis of Thompson sampling in a non-stationary setting. These challenges are addressed by Lemmas 1–3.

5.1. Challenges in Regret Analysis

Existing analyses of regret bounds for Thompson sampling [32,33,34] decompose the regret into two parts. The first part of regret comes from the over-estimation of the suboptimal arm, which can be dealt with by the concentration properties of the sampling distribution and rewards distribution. The second part is the under-estimation of the optimal arm, which mainly relies on bounding the following equation.

\sum_{t = 1}^{T} E [\frac{1 - p_{i, t}}{p_{i, t}} 1 {i_{t} = i_{t}^{*}, θ_{t} (*) \leq μ_{t} (*) - ϵ_{i}}],

(2)

where

p_{i, t} = P (θ_{t} (*) > μ_{t} (*) - ϵ_{i})

is the probability that the best arm will not be under-estimated from the mean reward by a margin

ϵ_{i}

.

The first challenge is specific to the DS-TS algorithm. Unlike SW-TS, which completely forgets previous information after

τ

rounds following a breakpoint, DS-TS cannot fully forget past information.

This makes it challenging to utilize the concentration properties of the reward distribution to bound regret comes from the over-estimate of the suboptimal arm. And this will further affect the analysis of Equation (2).

The second challenge is the under-estimation of the optimal arm. In stationary settings,

p_{i, t}

changes only when the optimal arm is selected, Equation (2) can be bounded by the method proposed by Agrawal and Goyal [32]. However, the distribution of

θ_{t} (*)

may vary over time in non-stationary settings. It is challenging and nontrivial to obtain a tight bound of Equation (2).

To overcome the first challenges, we adjust the posterior variance to be

\frac{4 σ^{2}}{N_{t} (γ, i)}

. This slightly larger variance is specifically designed for the

σ^{2}

-subGaussian distribution, which helps to bound

E [\frac{1}{p_{i, t}}]

(In Appendix B.4, we have shown that our analysis method requires the variance to be greater than

\frac{2 σ^{2}}{N}

. And we set the variance to

\frac{4 σ^{2}}{N}

for a more convenient presentation of the paper’s results.). Then, we define

U_{t} (γ, i)

, which serves a role similar to the upper confidence bound in the UCB algorithm. We solve this problem through Lemmas 1 and 2.

For the second challenge, we use the new defined

U_{t} (γ, i)

and employ a new regret decomposition for Equation (2) based on whether the event

{N_{t} (γ, *) > L_{1} (γ)}

occurs. Intuitively, if

N_{t} (γ, *) > L_{1} (γ)

,

p_{i, t}

is close to 1, which will lead to a sharp bound. If

N_{t} (γ, *) \leq L_{1} (γ)

, using Lemma A3 we can also obtain the upper bound of Equation (2). We derive the upper bound of

E [\frac{1}{p_{i, t}}]

for non-stationary settings, with an extra logarithmic term compared with the stationary settings. The proof of Lemma 3 in Appendix B.3 demonstrates these details.

5.2. Proofs of Theorem 1

For arm

i \neq i_{t}^{*}

, we choose two threshold

x_{t} (i), y_{t} (i)

such that

x_{t} (i) = μ_{t} (i) + \frac{Δ_{t} (i)}{3}, y_{t} (i) = μ_{t} (*) - \frac{Δ_{t} (i)}{3}

. Then

μ_{t} (i) < x_{t} (i) < y_{t} (i) < μ_{t} (*)

and

y_{t} (i) - x_{t} (i) = \frac{Δ_{t} (i)}{3}

. The history

F_{t}

is defined as the plays and rewards of the previous t plays.

{\hat{μ}}_{t} (γ, i), i_{t}

and the distribution of

θ_{t} (i)

are determined by the history

F_{t - 1}

.

The abruptly changing setting is in fact piecewise-stationary. The rounds between two adjacent breakpoints are stationary. Based on this observation, we define the pseudo-stationary phase as

T (γ) = {t \leq T : \forall s \in (t - D (γ), t], μ_{s} (\cdot) = μ_{t} (\cdot)} .

The rounds in

T (γ)

can to some extent ensure that the rewards are “stationary”. For any

t \in T (γ)

, the rewards distribution remain unchanged between

(t - D (γ), t)

. Therefore, we can obtain a good estimate of the rewards distribution in

T (γ)

(Lemma 1). Let

S (γ)

denote the complement of

T (γ)

, i.e.,

S (γ) = {t \leq T : t \notin T (γ)}

. Note that, there is at most

D (γ)

rounds belonging to

S (γ)

after each breakpoint. This is because the rounds between two adjacent breakpoints are stationary. The time steps after the breakpoint

D (γ)

rounds, the rewards distribution do not change and therefore belongs to

T (γ)

. Therefore, the number of elements in the set

S (γ)

has an upper bound

B_{T} D (γ)

, i.e.,

| S (γ) | \leq B_{T} D (γ)

(3)

Figure 1 shows

T (γ)

and

S (γ)

in two different situations. Since during the rounds in

S (γ)

, i.e., the rounds following a breakpoint, the estimate of the expected rewards may be poor, we directly bound the regret during

S (γ)

by

B_{T} D (γ)

and only focus on the regret in

T (γ)

.

To facilitate the analysis, we define the following quantities and events.

n = 6 \sqrt{2} + 3 \sqrt{1 - γ}, A (γ) = \frac{n^{2} log (\frac{1}{1 - γ}) σ^{2}}{{(Δ_{T} (i))}^{2}}, U_{t} (γ, i) = σ \sqrt{\frac{(1 - γ) log \frac{1}{1 - γ}}{N_{t} (γ, i)}} .

(4)

Definition 1.

Define

E_{t} (γ, i)

as the event

{i_{t} = i, N_{t} (γ, i) > A (γ)}

. Define

E_{t}^{θ} (i)

as the event

{θ_{t} (i) < y_{t} (i)}

.

Intuitively, event

E_{t} (γ, i)

represents selecting a sufficiently explored suboptimal arm. Event

E_{t}^{θ} (i)

denotes

θ_{t} (i)

is not too far from the mean

μ_{t} (i)

.

Now we list some useful lemmas. The detailed proofs are provided in the appendix. The following lemma depicts that after finite rounds at the breakpoint, i.e., in the pseudo-stationary phase, the distance between

μ_{t} (i)

and discounted average of expectation for arm i can be bounded by

U_{t} (γ, i)

.

U_{t} (γ, i)

is analogous to the upper confidence bound in the UCB algorithm.

Lemma 1.

Let

{\ddot{μ}}_{t} (γ, i) = \frac{1}{N_{t} (γ, i)} \sum_{j = 1}^{t} γ^{t - j} 1 {i_{j} = i} μ_{j} (i)

denote the discounted average of expectation for arm i at time step t.

\forall t \in T (γ)

, the distance between

μ_{t} (i)

and

{\ddot{μ}}_{t} (γ, i)

is less than

U_{t} (γ, i)

.

| μ_{t} (i) - {\ddot{μ}}_{t} (γ, i) | \leq U_{t} (γ, i),

(5)

Using Lemma 1 and the self-normalized Hoeffding-type inequality for subGaussian distributions (Lemma A1), we have the following lemma. This lemma helps to bound regret comes from the over-estimation of suboptimal arm.

Lemma 2.

\forall t \in T (γ), i \neq i_{t}^{*}

,

P ({\hat{μ}}_{t} (γ, i) > x_{t} (i), N_{t} (γ, i) > A (γ)) \leq {(1 - γ)}^{2}

The following key lemma helps bound the regret comes from the under-estimation of the optimal arm. This is the most tricky part of analyzing TS. Note that, the proof in [14] does not prove the result of the following lemma.

Lemma 3.

Let

p_{i, t} = P (θ_{t} (*) > y_{t} (i) ∣ F_{t - 1})

. For any

t \in T (γ)

and

i \neq i_{t}^{*}

,

\begin{matrix} \sum_{t \in T (γ)} E [\frac{1 - p_{i, t}}{p_{i, t}} 1 {i_{t} = i_{t}^{*}, θ_{t} (i) < y_{t} (i)}] \leq (e^{17} + 12 log \frac{1}{1 - γ}) T (1 - γ) L_{1} (γ) γ^{- \frac{1}{1 - γ}} . \end{matrix}

Before we give the detailed proof, we give a outline of our proof.

Proof Outlines. First, since the regret incurred in

{t \notin T (γ)}

can be bounded by

B_{T} D (γ)

, we only consider the regret in rounds

{t \in T (γ)}

. Then, we consider the event

{N_{t} (γ, i) > A (γ)}

. If this event is not true, we can use Lemma A3 to bound the regret by

T (1 - γ) A (γ) γ^{- 1 / (1 - γ)}

. If this event holds true, we additionally consider whether the suboptimal arm is over-estimation (

{\hat{μ}}_{t} (γ, i) > x_{t} (i)

) and whether event

{θ_{t} (i) < y_{t} (i)}

is true to decompose the regret into three parts as Equation (10). The first part comes from the over-estimation of the suboptimal arm, which can be bounded by Lemma 2. The second part comes from the bias in sampling the suboptimal arm, which can be bounded by the properties of Gaussian distribution Equation (A1). The third part denotes that the regret comes from the under-estimation of the optimal arm and can be bounded by Lemma 3.

The proof is in 5 steps:

Step 1: We can divide the rounds

t \in {1, \dots, T}

into two parts:

{t \in T (γ)}

and

{t \notin T (γ)}

. Equation (3) shows that the number of elements in the second part is smaller than

B_{T} D (γ)

, we have

E [k_{T} (i)] \leq B_{T} D (γ) + \sum_{t \in T (γ)} P (i_{t} = i) .

(6)

Step 2: Then, we consider the event

{N_{t} (γ, i) > A (γ)}

.

\begin{matrix} \sum_{t \in T (γ)} P (i_{t} = i) = \sum_{t \in T (γ)} P (i_{t} = i, N_{t} (γ, i) < A (γ)) + \sum_{t \in T (γ)} P (i_{t} = i, N_{t} (γ, i) > A (γ)) . \end{matrix}

We first bound

\sum_{t \in T (γ)} P (i_{t} = i, N_{t} (γ, i) < A (γ))

.

\begin{matrix} \sum_{t \in T (γ)} P (i_{t} = i, N_{t} (γ, i) < A (γ)) & = \sum_{t \in T (γ)} E [P (i_{t} = i, N_{t} (γ, i) < A (γ) ∣ F_{t - 1})] \\ = \sum_{t \in T (γ)} E [E [1 {i_{t} = i, N_{t} (γ, i) < A (γ) ∣ F_{t - 1}}]] \\ = \sum_{t \in T (γ)} E [1 {i_{t} = i, N_{t} (γ, i) < A (γ)}], \end{matrix}

(7)

where the last equation uses the tower rule of expectation.

Using Lemma A3, we have

\sum_{t \in T (γ)} P (i_{t} = i, N_{t} (γ, i) < A (γ)) \leq T (1 - γ) A (γ) γ^{- 1 / (1 - γ)}

(8)

Therefore,

\begin{matrix} E [k_{T} (i)] \leq T (1 - γ) A (γ) γ^{- \frac{1}{1 - γ}} + B_{T} D (γ) + \sum_{t \in T (γ)} P (i_{t} = i, N_{t} (γ, i) > A (γ)) \end{matrix}

(9)

Step 3: Recall that we use

E_{t} (γ, i)

to denote the event

{i_{t} = i, N_{t} (γ, i) > A (γ)}

and

E_{t}^{θ} (i)

denote the event

θ_{t} (i) < y_{t} (i)

. Equation (9) may be decomposed as follows:

\begin{matrix} \sum_{t \in T (γ)} P (E_{t} (γ, i)) & = \sum_{t \in T (γ)} P (E_{t} (γ, i), {\hat{μ}}_{t} (γ, i) > x_{t} (i)) \\ + \sum_{t \in T (γ)} P (E_{t} (γ, i), {\hat{μ}}_{t} (γ, i) < x_{t} (i), \bar{E_{t}^{θ} (i)}) \\ + \sum_{t \in T (γ)} P (E_{t} (γ, i), {\hat{μ}}_{t} (γ, i) < x_{t} (i), E_{t}^{θ} (i)) \end{matrix}

(10)

Using Lemma 2, the first part in Equation (10) can be bounded by

T {(1 - γ)}^{2}

.

Step 4: Then, we bound the second part in Equation (10). Use the fact that

N_{t} (γ, i)

and

{\hat{μ}}_{t} (i)

are determined by the history

F_{t - 1}

, we have

\begin{matrix} \sum_{t \in T (γ)} P (E_{t} (γ, i), {\hat{μ}}_{t} (γ, i) < x_{t} (i), \bar{E_{t}^{θ} (i)}) \\ = E [\sum_{t \in T (γ)} E [1 {i_{t} = i, N_{t} (γ, i) > A (γ), {\hat{μ}}_{t} (γ, i) < x_{t} (i), \bar{E_{t}^{θ} (i)}} ∣ F_{t - 1}]] \\ = E [\sum_{t \in T (γ)} 1 {N_{t} (γ, i) > A (γ), {\hat{μ}}_{t} (γ, i) < x_{t} (i)} P (i_{t} = i, \bar{E_{t}^{θ} (i)} ∣ F_{t - 1})] \\ \leq E [\sum_{t \in T (γ)} 1 {N_{t} (γ, i) > A (γ), {\hat{μ}}_{t} (γ, i) < x_{t} (i)} P (θ_{t} (i) > y_{t} (i) ∣ F_{t - 1})] . \end{matrix}

(11)

Given the history

F_{t - 1}

such that

N_{t} (γ, i) > A (γ)

and

{\hat{μ}}_{t} (γ, i) < x_{t} (i)

, we have

y_{t} (i) - {\hat{μ}}_{t} (γ, i) > y_{t} (i) - x_{t} (i) = \frac{Δ_{t} (i)}{3} \geq \frac{Δ_{T} (i)}{3} .

Therefore,

\begin{matrix} P (θ_{t} (i) > y_{t} (i) ∣ F_{t - 1})) \leq P (θ_{t} (i) - {\hat{μ}}_{t} (γ, i) > \frac{Δ_{T} (i)}{3} ∣ F_{t - 1}) \\ \leq \frac{1}{2} exp (- \frac{{(Δ_{T} (i))}^{2} A (γ)}{72 σ^{2}}) \\ \leq \frac{1}{2} (1 - γ), \end{matrix}

(12)

where the second inequality follows

θ_{t} (i) \sim N ({\hat{μ}}_{t} (γ, i), \frac{4 σ^{2}}{N_{t} (γ, i)})

and Equation (A1).

For other

F_{t - 1}

, the indicator term

1 {N_{t} (γ, i) > A (γ), {\hat{μ}}_{t} (γ, i) < x_{t} (i)}

will be 0. Hence, we can bound the second part by

\frac{T}{2} (1 - γ)

Step 5: Finally, we focus the third term in Equation (10). Using Lemma A2 and the fact that

p_{i, t}

is fixed given

F_{t - 1}

,

\begin{matrix} \sum_{t \in T (γ)} P (E_{t} (γ, i), {\hat{μ}}_{t} (γ, i) < x_{t} (i), E_{t}^{θ} (i)) & \leq \sum_{t \in T (γ)} E [\frac{1 - p_{i, t}}{p_{i, t}} P (i_{t} = i_{t}^{*}, E_{t}^{θ} (i) ∣ F_{t - 1})] \\ = \sum_{t \in T (γ)} E [E [\frac{1 - p_{i, t}}{p_{i, t}} 1 {i_{t} = i_{t}^{*}, E_{t}^{θ} (i) ∣ F_{t - 1}}]] \\ = \sum_{t \in T (γ)} E [\frac{1 - p_{i, t}}{p_{i, t}} 1 {i_{t} = i_{t}^{*}, E_{t}^{θ} (i)}] \end{matrix}

Then, by Lemma 3, we have

\sum_{t \in T (γ)} P (E_{t} (γ, i), {\hat{μ}}_{t} (γ, i) < x_{t} (i), E_{t}^{θ} (i)) \leq (e^{17} + 12 log \frac{1}{1 - γ}) T (1 - γ) L_{1} (γ) γ^{- \frac{1}{1 - γ}} .

(13)

Substituting the results in Step 3–5 to Equation (10) and Equation (9),

\begin{matrix} E [k_{T} (i)] & \leq T (1 - γ) A (γ) γ^{- 1 / (1 - γ)} + B_{T} D (γ) + 2 T (1 - γ) \\ + (e^{17} + 12 log \frac{1}{1 - γ}) T (1 - γ) L_{1} (γ) γ^{- 1 / (1 - γ)} \\ \leq B_{T} D (γ) + (e^{17} + 15 log \frac{1}{1 - γ}) L_{1} (γ) γ^{- \frac{1}{1 - γ}} T (1 - γ) . \end{matrix}

5.3. Proofs of Theorem 2

The proof of Theorem 2 is similar to Theorem 1. The main difference is that the pseudo-stationary phase is now defined as

T (τ) = {t \leq T : \forall s \in (t - τ, t], μ_{s} (\cdot) = μ_{t} (\cdot)}

. Let

{\ddot{μ}}_{t} (τ, i) = \frac{1}{N_{t} (τ, i)} \sum_{j = t - τ + 1}^{t} 1 {i_{j} = i} μ_{j} (i) .

If

t \in T (τ)

,

{\ddot{μ}}_{t} (τ, i) = \frac{1}{N_{t} (τ, i)} \sum_{j = t - τ + 1}^{t} 1 {i_{j} = i} μ_{t} (i) = μ_{t} (i) .

This means the bias (

U_{t} (γ, i)

) vanishes. We no longer need an n related to

τ

to deal with the bias issue. We only need to define

A (τ)

as

A (τ) = \frac{72 log (τ) σ^{2}}{{(Δ_{T} (i))}^{2}} .

We directly list the following two lemmas, corresponding to Lemma 2 and Lemma 3, respectively.

Lemma 4.

\forall t \in T (τ), t \neq i_{t}^{*}

,

P ({\hat{μ}}_{t} (τ, i) > x_{t} (i), N_{t} (τ, i) > A (τ)) \leq \frac{1}{τ^{2}} .

This lemma is similar to Lemma 2. It can be used to bound the regret that comes from over-estimation of the suboptimal arms. This lemma can be proved by Hoeffding-type inequality for subGaussian distributions (Lemma A1). The detailed proofs can be found in Appendix B.5.

Lemma 5.

Let

p_{i, t} = P (θ_{t} (*) > y_{t} (i) | F_{t - 1})

. For any

t \in T (τ)

and

i \neq i_{t}^{*}

,

\sum_{t \in T (γ)} E [\frac{1 - p_{i, t}}{p_{i, t}} 1 {i_{t} = i_{t}^{*}, θ_{t} (i) < y_{t} (i)}] \leq (e^{11} + 9 + 3 log τ) \frac{T}{τ} L_{2} (τ) .

This key lemma helps bound the regret that comes from the under-estimation of the optimal arm (Step 5 in the proof of DS-TS) which is similar to Lemma 3. It can be proved by Lemma A2, which transforms the probability of selecting the ith arm into the probability of selecting the optimal arm

i_{t}^{*}

. The detailed proofs can be found in Appendix B.6.

The rest of the proof is nearly identical to the proof of Theorem 1.

6. Experiments

In this section, we empirically compare the performance of our method with state-of-the-art algorithms on Bernoulli and Gaussian reward distributions (our code is available at https://github.com/qh1874/TS_NonStationary (accessed on 14 August 2024)). Specifically, we compare DS-TS and SW-TS with Thompson sampling to evaluate the improvement obtained thanks to the employment of the discounted factor

γ

and sliding window

τ

. We also compare our method with the UCB method, DS-UCB and SW-UCB [12] to evaluate the effect of Thompson sampling and UCB. Furthermore, we compare our method with some novel and efficient algorithms such as CUSUM [7], M-UCB [8] and SW-LB-SDA [15]. Note that SW-LB-SDA is not applicable to Gaussian reward distributions with unknown variance. We measure the performance of each algorithm with the cumulative expected regret defined in Equation (1). The expected regret is averaged over 100 independently runs. The 95% confidence interval is obtained by performing 100 independent runs and is depicted as a semi-transparent region in the figure.

6.1. Gaussian Arms

6.1.1. Experimental Setting for Gaussian Arms

We fix the time horizon as T = 100,000. The mean and standard deviation are drawn from distributions

N (0, 5^{2})

and

U (1, 5)

. For Gaussian rewards, we conduct two experiments. In the first experiment, we split the time horizon into five phases and use a number of arms

K = 5

. While in the second experiment, we split the time horizon into 10 phases and use a number of arms

K = 10

. Figure 2 depicts the expected rewards for Gaussian arms and Bernoulli arms with

K = 5

and

B_{T} = 5

.

The analysis of SW-UCB and DS-UCB is conducted under the bounded reward assumption, but the algorithms can adapt to Gaussian scenarios. To achieve reasonable performance, it is necessary to adjust the discounted factor and the sliding-window appropriately. We use the settings recommended in [15], where

τ = 2 (1 + 2 σ) \sqrt{T log (T) / B_{T}}

for SW-UCB and

γ = 1 - 1 / (4 (1 + 2 σ)) \sqrt{B_{T} / T}

for DS-UCB.

6.1.2. Results

Figure 3 illustrates the performance of these algorithms for Gaussian rewards under two different settings. Notably, CUSUM and M-UCB are not applicable to Gaussian rewards: CUSUM is designed for Bernoulli distributions, while M-UCB assumes bounded distributions. The discounted methods tend to perform better than sliding-window methods in Gaussian rewards.

Among these algorithms, only our algorithms and SW-LB-SDA provide regret analysis for unbounded rewards. Our algorithm (DS-TS) and SW-LB-SDA have demonstrated highly competitive experimental performance.

6.2. Bernoulli Arms

6.2.1. Experimental Setting for Bernoulli Arms

The time horizon is set as T = 100,000. We split the time horizon into

5, 10

phases of equal length and use a number of arms

K = {5, 10}

, respectively.

For Bernoulli rewards, the expected value

μ_{t} (i)

of each arm i is drawn from a uniform distribution over

[0, 1]

. In the stationary phase, the rewards distributions remain unchanged. The Bernoulli arms for each phase are generated as

μ_{t} (i) \sim U (0, 1) .

For a Bernoulli distribution, we modify the Thompson sampling (step 3) in our algorithm as

θ_{t} (i) \sim N ({\hat{μ}}_{t} (γ, i), \frac{1}{N_{t} (γ, i)})

and

θ_{t} (i) \sim N ({\hat{μ}}_{t} (τ, i), \frac{1}{N_{t} (τ, i)})

. Based on Corollaries 1 and 2, we set

γ = 1 - \sqrt{\frac{B_{T}}{T log T}}

and

τ = \sqrt{T / B_{T}} log T

. To allow for fair comparison, DS-UCB uses the discount factor

γ = 1 - \sqrt{B_{T} / T} / 4

, SW-UCB uses the sliding window

τ = 2 \sqrt{T log T / B_{T}}

suggested by [12]. Based on [15], we set

τ = 2 \sqrt{T log (T) / B_{T}}

for LB-SDA. For changepoint detection algorithm M-UCB, we set

w = 800, b = \sqrt{w / 2 log (2 K T^{2})}

as suggested by [8]. But we set the amount of exploration as

γ = \sqrt{K B_{T} log (T) / T}

. In practice, it has been found that using this value instead of the one guaranteed in [8] will improve empirical performance [15]. For CUSUM, following from [7], we set

α = \sqrt{B_{T} / T log (T / B_{T})}

and

h = log (T / B_{T})

. For our experiment settings, we choose

M = 50, ϵ = 0.05

.

6.2.2. Results

Figure 4 presents the results for Bernoulli arms in abruptly changing settings. It can be observed that our method (SW-TS) and SW-LB-SDA exhibit almost identical performance. Thompson sampling, designed for stationary MAB problems, shows significant oscillations at the breakpoints. The changepoint detection algorithm CUSUM [7] also shows competitive performance. Note that our experiment does not satisfy the detectability assumption of CUSUM. As the number of arms and breakpoints increase, the performance of UCB-class algorithms (DS-UCB, SW-UCB) declines, while two TS-based algorithms (DS-TS, SW-TS) still work well.

6.2.3. Storage and Compute Cost

These algorithms can be divided into three class: UCB, TS and SW-LB-SDA. At each round, UCB-class and TS-class algorithms require

O (K)

storage and spend

O (K)

time complexity for computational cost. However, for round T, SW-LB-SDA require

O (K {(log T)}^{2})

storage and spend

O (K log T)

time cost. Although the experimental performance of SW-LB-SDA is similar to our algorithms, our algorithm has less storage space and lower computational complexity.

6.3. Different Variance

The non-stationary setting has greater noise for estimation as compared to the stationary setting. Intuitively, TS with standard variance for the non-stationary setting should have worse regret as compared to the one with a larger variance. In this subsection, we conduct some experiments to verify this point. Table 1 shows the experimental results. For TS and the SW-TS algorithm, larger variance does indeed lead to smaller regret. This conclusion does not hold for DS-TS. We believe this does not contradict the above conclusion, because for DS-TS, the discount factor plays a more important role. If an arm has not been selected for some rounds, then

N_{t} (γ, i)

will be small (

N_{t} (γ, i)

can be close to 0, while if

N_{t} (τ, i)

for SW-TS is greater than 1 or equal to 0), then

\frac{σ}{N_{t} (γ, i)}

has already become large, ensuring exploration performance. Therefore, DS-TS achieves the minimum regret. However,

\frac{2 σ}{N_{t} (γ, i)}

may be too large, leading to excessive exploration and thus reducing regret compared to

\frac{σ}{N_{t} (γ, i)}

.

7. Conclusions

In this paper, we analyze the regret upper bound of the TS algorithm with an uninformative prior in non-stationary settings, filling a research gap in this field. Our approach builds upon previous works while tackling two key challenges specific to non-stationary environments: under-estimation of the optimal arm and the inability of DS-TS algorithm to fully forget previous information. Finally, we conduct some experiments to verify the theory results. Below we discuss the results and propose directions for future research.

(1) The standard posterior update rule for Thompson sampling has a sampling variance as

\frac{σ^{2}}{N}

. We use

\frac{4 σ^{2}}{N}

only for ease of analysis. While this discrepancy is significant only for relatively small values of N, it would be valuable to develop proof techniques that leverage the variance of standard Bayesian updates.

(2) Our regret upper bound includes an additional logarithmic term compared to DS-UCB and SW-UCB, along with coefficients of

e^{17}

and

e^{11}

. It would be interesting to explore whether the additional logarithm and large coefficients are intrinsic to DS-TS and SW-TS algorithms or are a limitation of our analysis.

Author Contributions

Conceptualization, H.Q.; Formal analysis, H.Q.; Investigation, F.G.; Methodology, H.Q. and F.G.; Software, H.Q. and F.G.; Supervision, L.Z.; Validation, L.Z.; Writing—original draft, H.Q.; Writing—review and editing, F.G. and L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Our code is available at https://github.com/qh1874/TS_NonStationary (accessed on 14 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Facts and Lemmas

Garivier and Moulines [12] has derived a Hoeffding-type inequality for self-normalized means with a random number of summands. Their bound is for bounded distribution. Leveraging the properties of

σ

-subGaussian distributions, we have the following bound for

σ

-subGaussian. Recall that

N_{t} (γ, i) = \sum_{j = 1}^{t} γ^{t - j} 1 {i_{j} = i}, {\hat{μ}}_{t} (γ, i) = \frac{1}{N_{t} (γ, i)} \sum_{j = 1}^{t} γ^{t - j} X_{j} (i) 1 {i_{j} = i}

{\ddot{μ}}_{t} (γ, i) = \frac{1}{N_{t} (γ, i)} \sum_{j = 1}^{t} γ^{t - j} 1 {i_{j} = i} μ_{j} (i)

Lemma A1.

Let

t \in T (γ), δ > 0

,

P (\frac{N_{t} (γ, i) (\hat{μ} (γ, i) - \ddot{μ} (γ, i))}{\sqrt{N_{t} (γ^{2}, i)}} > δ) \leq log (\frac{1}{1 - γ}) exp (- \frac{3 δ^{2}}{8 σ^{2}}) .

Let

t \in T (τ), δ > 0

,

P (\sqrt{N_{t} (τ, i)} (\hat{μ} (τ, i) - μ_{t} (i)) > δ) \leq log τ exp (- \frac{3 δ^{2}}{8 σ^{2}}),

The following inequality is the anti-concentration and concentration bound for Gaussian distributed random variables.

Fact 1

([35]). For a Gaussian distributed random variable X with mean μ and variance

σ^{2}

, for any

a > 0

\frac{1}{\sqrt{2 π}} \frac{a}{1 + a^{2}} e^{- a^{2} / 2} \leq P (X - μ > a σ) \leq \frac{1}{a + \sqrt{a^{2} + 4}} e^{- a^{2} / 2}

(A1)

Since

\frac{1}{a + \sqrt{a^{2} + 4}} \leq \frac{1}{2}

, we also have the following well-known result:

P (X - μ > a σ) \leq \frac{1}{2} e^{- a^{2} / 2}

The following lemma is adapted from [32] and is often used in the analysis of Thompson sampling, which can transform the probability of selecting the ith arm into the probability of selecting the optimal arm

i_{t}^{*}

.

Lemma A2.

Let

p_{i, t} = P (θ_{t} (*) > y_{t} (i) ∣ F_{t - 1})

. For any

A > 0

,

i \neq i_{t}^{*}

,

P (i_{t} = i, θ_{t} (i) < y_{t} (i) | F_{t - 1}) \leq \frac{(1 - p_{i, t})}{p_{i, t}} P (i_{t} = i_{t}^{*}, θ_{t} (i) < y_{t} (i) ∣ F_{t - 1})

Lemma A3

([12]). For any

i \in {1, \dots, K}

,

γ \in (0, 1)

and

A > 0

,

\sum_{t = 1}^{T} 1 {i_{t} = i, N_{t} (γ, i) < A} \leq ⌈ T (1 - γ) ⌉ A γ^{- 1 / (1 - γ)},

\sum_{t = 1}^{T} 1 {i_{t} = i, N_{t} (τ, i) < A} \leq ⌈\frac{T}{τ}⌉ A .

Appendix B. Detailed Proofs of Lemmas and Theorems

Appendix B.1. Proof of Lemma 1

Recall that

{\ddot{μ}}_{t} (γ, i) = \frac{1}{N_{t} (γ, i)} \sum_{j = 1}^{t} γ^{t - j} 1 {i_{j} = i} μ_{j} (i)

. Since

{\ddot{μ}}_{t} (γ, i)

is a convex combination of elements

μ_{j} (i), j = 1, \dots, t

, we have

| μ_{t} (i) - {\ddot{μ}}_{t} (γ, i) | \leq Δ_{m a x}^{T}

(A2)

We can write

μ_{t} (i)

as

μ_{t} (i) = \frac{1}{N_{t} (γ, i)} \sum_{j = 1}^{t} γ^{t - j} 1 {i_{j} = i} μ_{t} (i)

. Thus, we have

| μ_{t} (i) - {\ddot{μ}}_{t} (γ, i) | = \frac{1}{N_{t} (γ, i)} | \sum_{j = 1}^{t} γ^{t - j} (μ_{j} (i) - μ_{t} (i)) 1 {i_{j} = i} | .

Recall that

T (γ) = {t \leq T : \forall s \in (t - D (γ), t], μ_{s} (\cdot) = μ_{t} (\cdot)}

. If

t \in T (γ)

, we have

μ_{j} (i) = μ_{t} (i), \forall j \in (t - D (γ), t) .

Therefore,

\forall t \in T (γ)

, we have

\begin{matrix} | μ_{t} (i) - {\ddot{μ}}_{t} (γ, i) | & = \frac{1}{N_{t} (γ, i)} | \sum_{j = 1}^{t - D (γ)} γ^{t - j} (μ_{j} (i) - μ_{t} (i)) 1 {i_{j} = i} | \\ \leq \frac{Δ_{m a x}^{T}}{N_{t} (γ, i)} \sum_{j = 1}^{t - D (γ)} γ^{t - j} 1 {i_{j} = i} \\ = \frac{Δ_{m a x}^{T}}{N_{t} (γ, i)} γ^{D (γ)} N_{t - D (γ)} (γ, i) \\ \leq \frac{Δ_{m a x}^{T} γ^{D (γ)}}{N_{t} (γ, i) (1 - γ)} \end{matrix}

where the last inequality follows from

N_{t - D (γ)} (γ, i) \leq \frac{1}{1 - γ}

.

If

\frac{γ^{D (γ)}}{N_{t} (γ, i) (1 - γ)} < 1

,

\frac{γ^{D (γ)}}{N_{t} (γ, i) (1 - γ)} < \sqrt{\frac{γ^{D (γ)}}{N_{t} (γ, i) (1 - γ)}}

, we have

| μ_{t} (i) - {\ddot{μ}}_{t} (γ, i) | \leq Δ_{m a x}^{T} \sqrt{\frac{γ^{D (γ)}}{N_{t} (γ, i) (1 - γ)}} .

If

\frac{γ^{D (γ)}}{N_{t} (γ, i) (1 - γ)} \geq 1

, from Equation (A2), we also have

| μ_{t} (i) - {\ddot{μ}}_{t} (γ, i) | \leq Δ_{m a x}^{T} \leq Δ_{m a x}^{T} \sqrt{\frac{γ^{D (γ)}}{N_{t} (γ, i) (1 - γ)}} .

By the definition of

D (γ) = \frac{log ({(\frac{σ}{Δ_{m a x}^{T}})}^{2} {(1 - γ)}^{2} log \frac{1}{1 - γ})}{log γ}

,

| μ_{t} (i) - {\ddot{μ}}_{t} (γ, i) | \leq σ \sqrt{\frac{(1 - γ) log \frac{1}{1 - γ}}{N_{t} (γ, i)}}

Appendix B.2. Proof of Lemma 2

From the definition of

n, A (γ), U_{t} (γ, i)

in Equation (4), we can obtain

U_{t} (γ, i) = \frac{\sqrt{1 - γ} Δ_{T} (i)}{n} \sqrt{\frac{A (γ)}{N_{t} (γ, i)}} .

(A3)

If

N_{t} (γ, i) > A (γ)

,

U_{t} (γ, i) < \frac{\sqrt{1 - γ}}{n} Δ_{T} (i)

. Thus, we have

\frac{Δ_{t} (i)}{3} - U_{t} (γ, i) > \frac{Δ_{T} (i)}{3} - \frac{\sqrt{1 - γ}}{n} Δ_{T} (i) = \frac{2 \sqrt{2}}{n} Δ_{T} (i) .

(A4)

Therefore,

\begin{matrix} P ({\hat{μ}}_{t} (γ, i) > μ_{t} (i) + \frac{Δ_{t} (i)}{3}, N_{t} (γ, i) > A (γ)) \\ \overset{(a)}{\leq} P ({\hat{μ}}_{t} (γ, i) - {\ddot{μ}}_{t} (γ, i) > \frac{Δ_{t} (i)}{3} - U_{t} (γ, i), N_{t} (γ, i) > A (γ)) \\ \overset{(b)}{\leq} P ({\hat{μ}}_{t} (γ, i) - {\ddot{μ}}_{t} (γ, i) > \frac{2 \sqrt{2}}{n} Δ_{T} (i), N_{t} (γ, i) > A (γ)) \\ \overset{(c)}{\leq} P (\frac{N_{t} (γ, i) ({\hat{μ}}_{t} (γ, i) - {\ddot{μ}}_{t} (γ, i))}{\sqrt{N_{t} (γ^{2}, i)}} > \frac{2 \sqrt{2}}{n} Δ_{T} (i) \sqrt{A (γ)}) \\ \overset{(d)}{\leq} log \frac{1}{1 - γ} exp (- \frac{3 {(Δ_{T} (i))}^{2}}{n^{2} σ^{2}} A (γ)) \\ \leq {(1 - γ)}^{3} log \frac{1}{1 - γ} \end{matrix}

(A5)

where (a) uses Lemma 1, (b) uses Equation (A4), (c) follows from

N_{t} (γ, i) > N_{t} (γ^{2}, i)

, (d) uses Lemma A1.

Since

(1 - γ) log \frac{1}{1 - γ} \leq \frac{1}{e} < 1

, this ends the proof.

Appendix B.3. Proof of Lemma 3

This proof is adapted from [32] for the stationary settings. However, there are some technical problems that are difficult to overcome in non-stationary settings. The tricky problem is to lower bound the probability of the mean’s estimation of optimal arm Equation (A9). By designing the function

U_{t} (γ, i)

and decomposing the regret to use Lemma A3 again, we solve this challenge. We use blue font to emphasize the techniques used in the proof.

The proof is in three steps.

Step 1: We first prove that

E [\frac{1}{p_{i, t}}]

has an upper bound independent of t.

Define a Bernoulli experiment as sampling from

N ({\hat{μ}}_{t} (*), \frac{4 σ^{2}}{N_{t} (γ, *)})

, where success implies that

θ_{t} (*) > y_{t} (i)

. Let

G_{t}

denote the number of experiments performed when the event

{θ_{t} (*) > y_{t} (i)}

first occurs. Then,

E [\frac{1}{p_{i, t}}] = E [E [G_{t} ∣ F_{t - 1}]] = E [G_{t}]

Let

z = \sqrt{log r} + \frac{1}{2}

(

r \geq 1

is an integer ) and let

{MAX}_{r}

denote the maximum of r independent Bernoulli experiments. Then,

\begin{matrix} P (G_{t} \leq r) & \geq P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} \geq y_{t} (i)) \\ = E [E [1 {{MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} \geq y_{t} (i)} ∣ F_{t - 1}]] \\ = E [1 {{\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} \geq y_{t} (i)} \cdot P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} ∣ F_{t - 1})] \end{matrix}

(A6)

Using Fact A1,

\begin{matrix} P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} ∣ F_{t - 1}) & \geq 1 - {(1 - \frac{1}{\sqrt{2 π}} \frac{z}{z^{2} + 1} e^{- z^{2} / 2})}^{r} \\ = 1 - {(1 - \frac{1}{\sqrt{2 π}} \frac{\sqrt{log r} + \frac{1}{2}}{{(\sqrt{log r} + \frac{1}{2})}^{2} + 1} \frac{e^{- 1 / 4 - \sqrt{log r} / 2}}{\sqrt{r}})}^{r} \\ \geq 1 - e^{- \frac{\sqrt{r} e^{- \sqrt{log r} / 2}}{e^{0.25} \sqrt{2 π} (\sqrt{log r} + 1)}} \end{matrix}

(A7)

For any

r \geq e^{17}

,

e^{- \frac{\sqrt{r} e^{- \sqrt{log r} / 2}}{e^{0.25} \sqrt{2 π} (\sqrt{log r} + 1)}} \leq \frac{1}{r^{2}}

. Hence, for any

r \geq e^{17}

,

P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} ∣ F_{t - 1}) \geq 1 - \frac{1}{r^{2}} .

Therefore, for any

r \geq e^{17}

,

P (G_{t} \leq r) \geq (1 - \frac{1}{r^{2}}) P ({\hat{μ}}_{t} (*) + \frac{z}{\sqrt{N_{t} (γ, *)}} \geq y_{t} (i))

Next, we apply Lemma A1 to lower bound

P ({\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} \geq y_{t} (i))

.

\begin{matrix} P ({\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} \geq y_{t} (i)) & \geq 1 - P ({\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} \leq μ_{t} (*)) \\ \geq 1 - P ({\hat{μ}}_{t} (*) - {\ddot{μ}}_{t} (*) \leq U_{t} (γ, *) - \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}}) \end{matrix}

(A8)

Since

U_{t} (γ, *) = \frac{σ \sqrt{(1 - γ) log \frac{1}{1 - γ}}}{\sqrt{N_{t} (γ, *)}}, z = log r + \frac{1}{2}

,

U_{t} (γ, *) - \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} = \frac{σ \sqrt{(1 - γ) log \frac{1}{1 - γ}} - σ - 2 σ \sqrt{log r}}{\sqrt{N_{t} (γ, *)}} < - \frac{2 σ \sqrt{log r}}{\sqrt{N_{t} (γ, *)}} .

Then, we have

\begin{matrix} P ({\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} \geq y_{t} (i)) & \geq 1 - P ({\hat{μ}}_{t} (*) - {\ddot{μ}}_{t} (*) < - \frac{2 σ \sqrt{log r}}{\sqrt{N_{t} (γ, *)}}) \\ \geq 1 - log (\frac{1}{1 - γ}) e^{- \frac{3}{2} log r} \\ \geq 1 - log \frac{1}{1 - γ} \frac{1}{r^{1.5}} . \end{matrix}

(A9)

Substituting, for any

r > e^{17}

,

P (G_{t} \leq r) \geq 1 - log \frac{1}{1 - γ} \frac{1}{r^{1.5}} - \frac{1}{r^{2}}

(A10)

Therefore,

\begin{matrix} E [G_{t}] & = \sum_{r = 0}^{\infty} P (G_{t} \geq r) \\ \leq 1 + e^{17} + \sum_{r > e^{17}} (log \frac{1}{1 - γ} \frac{1}{r^{1.5}} + \frac{1}{r^{2}}) \\ \leq e^{17} + 3 + 3 log \frac{1}{1 - γ} \end{matrix}

(A11)

This proves a bound of

E [\frac{1}{p_{i, t}}] \leq e^{17} + 3 + 3 log \frac{1}{1 - γ}

independent of t.

Step 2: Define

L (γ) = \frac{1152 log (\frac{1}{1 - γ} + e^{17}) σ^{2}}{{(Δ_{T} (i))}^{2}}

. We consider the upper bound of

E [\frac{1}{p_{i, t}}]

when

N_{t} (γ, *) > L (γ)

.

\begin{matrix} P (G_{t} \leq r) & \geq P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} - \frac{Δ_{t} (i)}{6} \geq y_{t} (i)) \\ = E [1 {{\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} - \frac{Δ_{t} (i)}{6} \geq y_{t} (i)} \\ \cdot P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} - \frac{Δ_{t} (i)}{6} ∣ F_{t - 1})] \end{matrix}

(A12)

Now, since

N_{t} (γ, *) > L (γ)

,

\frac{1}{\sqrt{N_{t} (γ, *)}} < \frac{Δ_{t} (i)}{48 \sqrt{log (\frac{1}{1 - γ} + e^{17})} σ}

. Therefore, for any

r \leq {(\frac{1}{1 - γ} + e^{17})}^{2}

,

\frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} - \frac{Δ_{t} (i)}{6} = \frac{2 σ \sqrt{log r} + σ}{\sqrt{N_{t} (γ, *)}} - \frac{Δ_{t} (i)}{6} \leq - \frac{Δ_{t} (i)}{12} .

Using Fact A1,

P (θ_{t} (i) > {\hat{μ}}_{t} (i) - \frac{Δ_{t} (i)}{12} | F_{t - 1}) \leq 1 - \frac{1}{2} e^{- \frac{N_{t} (γ, *)}{4 σ^{2}} \frac{Δ_{t} {(i)}^{2}}{288}} \geq 1 - \frac{1}{2 (1 / (1 - γ) + e^{17})} .

This implies

P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z}{\sqrt{N_{t} (γ, *)}} - \frac{Δ_{t} (i)}{6} | F_{t - 1}) \geq 1 - \frac{1}{2^{r} {(1 / (1 - γ) + e^{17})}^{r}} .

Also, apply the self-normalized Hoeffding-type inequality,

\begin{matrix} P ({\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (γ, *)}} - \frac{Δ_{t} (i)}{6} \geq y_{t} (i)) & \geq 1 - P ({\hat{μ}}_{t} (*) \leq μ_{t} (*) - \frac{Δ_{t} (i)}{6}) \\ \geq 1 - P ({\hat{μ}}_{t} (*) - {\ddot{μ}}_{t} (*) \geq - U_{t} (γ, *) + \frac{Δ_{t} (i)}{6}) \\ > 1 - P ({\hat{μ}}_{t} (*) - {\ddot{μ}}_{t} (*) \geq \frac{Δ_{T} (i)}{8} \sqrt{\frac{L (γ)}{N_{t} (γ, *)}}) \\ \geq 1 - log (\frac{1}{1 - γ} + e^{17}) \frac{1}{{(1 / (1 - γ) + e^{17})}^{3}} . \end{matrix}

Let

γ^{'} = {(\frac{1}{1 - γ} + e^{17})}^{2}

. Therefore, for any

1 \leq r \leq γ^{'}

,

P (G_{t} \leq r) \geq 1 - \frac{1}{2^{r} {γ^{'}}^{r / 2}} - log (\frac{1}{1 - γ} + e^{17}) \frac{1}{{γ^{'}}^{1.5}} .

When

r \geq γ^{'} > e^{17}

, we can use Equation (A10) to obtain

P (G_{t} \leq r) \geq 1 - log \frac{1}{1 - γ} \frac{1}{r^{1.5}} - \frac{1}{r^{2}}

Combining these results

\begin{matrix} E [G_{t}] & \leq \sum_{r = 0}^{\infty} P (G_{t} \geq r) \\ \leq 1 + \sum_{r = 1}^{γ^{'}} P (G_{t} \geq r) + \sum_{r = γ^{'}}^{\infty} P (G_{t} \geq r) \\ \leq 1 + \sum_{r = 1}^{γ^{'}} (\frac{1}{2^{r} {γ^{'}}^{r / 2}} + log (\frac{1}{1 - γ} + e^{17}) \frac{1}{{γ^{'}}^{1.5}}) + \sum_{r = γ^{'}}^{\infty} (log \frac{1}{1 - γ} \frac{1}{r^{1.5}} + \frac{1}{r^{2}}) \\ \leq 1 + \frac{1}{2 \sqrt{γ^{'}}} + log (\frac{1}{1 - γ} + e^{17}) \frac{1}{\sqrt{γ^{'}}} + \frac{2}{γ^{'}} + log (\frac{1}{1 - γ}) \frac{3}{\sqrt{γ^{'}}} \\ \leq 1 + 6 (1 - γ) log (\frac{1}{1 - γ} + e^{17}) . \end{matrix}

Therefore, when

N_{t} (γ, *) > L (γ)

, it holds that

E [\frac{1}{p_{i, t}}] - 1 = E [G_{t}] - 1 \leq 6 (1 - γ) log (\frac{1}{1 - γ} + e^{17}) .

Step 3: Let

A (γ, *) = {t \in {1, \dots, T} : N_{t} (γ, *) \leq L (γ)}

.

\begin{matrix} \sum_{t \in T (γ)} E [\frac{1 - p_{i, t}}{p_{i, t}} 1 {i_{t} = i_{t}^{*}, θ_{t} (i) < y_{t} (i)}] \\ \leq (\sum_{t \in T (γ) \cap A (γ, *)} + \sum_{t \in T (γ) ∖ A (γ, *)}) E [\frac{1 - p_{i, t}}{p_{i, t}} 1 {i_{t} = i_{t}^{*}, θ_{t} (i) < y_{t} (i)}] \\ \leq | {t : i_{t} = i_{t}^{*}, N_{t} (γ, *) \leq L (γ)} | (e^{17} + 3 + 3 log \frac{1}{1 - γ}) + \sum_{t \in T (γ) ∖ A (γ, *)} E [\frac{1 - p_{i, t}}{p_{i, t}}] \\ \leq T (1 - γ) L (γ) γ^{- 1 / (1 - γ)} (e^{17} + 3 + 3 log \frac{1}{1 - γ}) + 6 T (1 - γ) log (\frac{1}{1 - γ} + e^{17}) \\ \leq (e^{17} + 9 + 3 log \frac{1}{1 - γ}) T (1 - γ) L (γ) γ^{- 1 / (1 - γ)} . \end{matrix}

(A13)

Appendix B.4. Larger Variance

In fact, Lemma A1 has a stricter upper bound as

\frac{log \frac{1}{1 - γ}}{log (1 + η)} exp (- \frac{1}{2 σ^{2}} (1 - \frac{η^{2}}{16}))

, where

η \in (- 1, 4)

. Let

η = 2

, then we obtain Lemma A1. Suppose the variance of Thompson sampling is

\frac{ξ σ^{2}}{N_{t} (γ, i)}

. The lower bound of Equation (A9) becomes

\frac{log \frac{1}{1 - γ}}{log (1 + η)} exp (- \frac{ξ log r}{2} (1 - \frac{η^{2}}{16})) .

To ensure that

E [G_{t}]

in Equation (A11) has a finite upper bound, our analysis method requires the infinite series

\sum_{r = 1}^{\infty} exp (- \frac{ξ log r}{2} (1 - \frac{η^{2}}{16})) = \sum_{r = 1}^{\infty} \frac{1}{r^{\frac{ξ}{2} (1 - \frac{η^{2}}{16})}}

is convergent. Thus, we have

ξ > \frac{2}{1 - \frac{η^{2}}{16}} > 2,

i.e., the sampling variance needs to be strictly greater than

\frac{2 σ^{2}}{N_{t} (γ, i)}

.

Appendix B.5. Proof of Lemma 4

Recall that

A (τ) = \frac{72 log (τ) σ^{2}}{{(Δ_{T} (i))}^{2}}

. Using Lemma A1, we have

\begin{matrix} P ({\hat{μ}}_{t} (τ, i) > x_{t} (i), N_{t} (τ, i) > A (τ)) & = P ({\hat{μ}}_{t} (τ, i) - μ_{t} (i) > \frac{Δ_{t} (i)}{3}, N_{t} (τ, i) > A (γ)) \\ \leq P (\sqrt{N_{t} (τ, i)} ({\hat{μ}}_{t} (τ, i) - μ_{t} (i)) > \frac{Δ_{T} (i)}{3} \sqrt{A (γ)}) \\ \leq log τ exp (- \frac{3 {(Δ_{T} (i))}^{2}}{72 σ^{2}} A (γ)) \\ \leq \frac{1}{τ^{2}} \end{matrix}

(A14)

Appendix B.6. Proof of Lemma 5

The proof is similar to the proof of Lemma 3.

Step 1: We first prove that

E [\frac{1}{p_{i, t}}]

has an upper bound independent of t.

Let

z = \sqrt{log r}

(

r \geq 1

is an integer). Then,

\begin{matrix} P (G_{t} \leq r) & \geq P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} \geq y_{t} (i)) \\ = E [1 {{\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} \geq y_{t} (i)} P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} ∣ F_{t - 1})] \end{matrix}

(A15)

Using Fact A1,

\begin{matrix} P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} ∣ F_{t - 1}) & \geq 1 - {(1 - \frac{1}{\sqrt{2 π}} \frac{z}{z^{2} + 1} e^{- z^{2} / 2})}^{r} \\ = 1 - {(1 - \frac{1}{\sqrt{2 π}} \frac{\sqrt{log r}}{{(\sqrt{log r})}^{2} + 1} \frac{1}{\sqrt{r}})}^{r} \\ \geq 1 - e^{- \frac{\sqrt{r}}{\sqrt{2 π} (\sqrt{log r} + 1)}} \end{matrix}

(A16)

For any

r \geq e^{11}

,

e^{- \frac{\sqrt{r}}{\sqrt{2 π} (\sqrt{log r} + 1)}} \leq \frac{1}{r^{2}}

. Hence, for any

r \geq e^{11}

,

P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} ∣ F_{t - 1}) \geq 1 - \frac{1}{r^{2}} .

Therefore, for any

r \geq e^{11}

,

P (G_{t} \leq r) \geq (1 - \frac{1}{r^{2}}) P ({\hat{μ}}_{t} (*) + \frac{z}{\sqrt{N_{t} (τ, *)}} \geq y_{t} (i))

Next, we apply Lemma A1 to lower bound

P ({\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} \geq y_{t} (i))

.

\begin{matrix} P ({\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} \geq y_{t} (i)) & \geq 1 - P ({\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} \leq μ_{t} (*)) \\ \geq 1 - P ({\hat{μ}}_{t} (*) - μ_{t} (*) < - \frac{2 σ \sqrt{log r}}{\sqrt{N_{t} (τ, *)}}) \\ \geq 1 - log τ e^{- \frac{3}{2} log r} \\ = 1 - log τ \frac{1}{r^{1.5}} . \end{matrix}

Substituting, for any

r > e^{11}

,

P (G_{t} \leq r) \geq 1 - log τ \frac{1}{r^{1.5}} - \frac{1}{r^{2}}

(A17)

Therefore,

\begin{matrix} E [G_{t}] & = \sum_{r = 0}^{\infty} P (G_{t} \geq r) \\ \leq 1 + e^{11} + \sum_{r > e^{11}} (log τ \frac{1}{r^{1.5}} + \frac{1}{r^{2}}) \\ \leq e^{11} + 3 + 3 log τ \end{matrix}

This proves a bound of

E [\frac{1}{p_{i, t}}] \leq e^{11} + 3 + 3 log τ

independent of t.

Step 2: Define

L (τ) = \frac{1152 log (τ + e^{11}) σ^{2}}{{(Δ_{T} (i))}^{2}}

. We consider the upper bound of

E [\frac{1}{p_{i, t}}]

when

N_{t} (τ, *) > L (τ)

.

\begin{matrix} P (G_{t} \leq r) & \geq P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} - \frac{Δ_{t} (i)}{6} \geq y_{t} (i)) \\ = E [1 {{\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} - \frac{Δ_{t} (i)}{6} \geq y_{t} (i)} \\ \cdot P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} - \frac{Δ_{t} (i)}{6} ∣ F_{t - 1})] \end{matrix}

(A18)

Now, since

N_{t} (τ, *) > L (τ)

,

\frac{1}{\sqrt{N_{t} (τ, *)}} < \frac{Δ_{t} (i)}{48 \sqrt{log (τ + e^{11})} σ}

. Therefore, for any

r \leq {(τ + e^{11})}^{2}

,

\frac{z \cdot 2 σ}{\sqrt{N_{t} (τ, *)}} - \frac{Δ_{t} (i)}{6} = \frac{2 σ \sqrt{log r} + σ}{\sqrt{N_{t} (τ, *)}} - \frac{Δ_{t} (i)}{6} \leq - \frac{Δ_{t} (i)}{12} .

Using Fact A1,

P (θ_{t} (i) > {\hat{μ}}_{t} (i) - \frac{Δ_{t} (i)}{12} | F_{t - 1}) \leq 1 - \frac{1}{2} e^{- \frac{N_{t} (τ, *)}{4 σ^{2}} \frac{Δ_{t} {(i)}^{2}}{288}} \geq 1 - \frac{1}{2 (τ + e^{11})} .

This implies

P ({MAX}_{r} > {\hat{μ}}_{t} (*) + \frac{z}{\sqrt{N_{t} (τ, *)}} - \frac{Δ_{t} (i)}{6} | F_{t - 1}) \geq 1 - \frac{1}{2^{r} {(τ + e^{11})}^{r}} .

Also, apply Lemma A1,

\begin{matrix} P ({\hat{μ}}_{t} (*) + \frac{z}{\sqrt{N_{t} (τ, *)}} - \frac{Δ_{t} (i)}{6} \geq y_{t} (i)) & \geq 1 - P ({\hat{μ}}_{t} (*) - μ_{t} (*) \geq \frac{Δ_{t} (i)}{6}) \\ \geq 1 - log (τ + e^{11}) \frac{1}{{(τ + e^{11})}^{3}} . \end{matrix}

Let

τ^{'} = {(τ + e^{11})}^{2}

. Therefore, for any

1 \leq r \leq τ^{'}

,

P (G_{t} \leq r) \geq 1 - \frac{1}{2^{r} {τ^{'}}^{r / 2}} - log (τ + e^{11}) \frac{1}{{τ^{'}}^{1.5}} .

When

r \geq τ^{'} > e^{11}

, we can use Equation (A10) to obtain

P (G_{t} \leq r) \geq 1 - log τ \frac{1}{r^{1.5}} - \frac{1}{r^{2}}

Combining these results,

\begin{matrix} E [G_{t}] & \leq \sum_{r = 0}^{\infty} P (G_{t} \geq r) \\ \leq 1 + \sum_{r = 1}^{τ^{'}} P (G_{t} \geq r) + \sum_{r = τ^{'}}^{\infty} P (G_{t} \geq r) \\ \leq 1 + \frac{6}{τ} log (τ + e^{11}) . \end{matrix}

Therefore, when

N_{t} (τ, *) > L (τ)

, it holds that

E [\frac{1}{p_{i, t}}] - 1 = E [G_{t}] - 1 \leq \frac{6}{τ} log (τ + e^{11}) .

Step 3: Let

A (τ, *) = {t \in {1, \dots, T} : N_{t} (τ, *) \leq L (τ)}

and

C = e^{11} + 9

.

\begin{matrix} \sum_{t \in T (τ)} E [\frac{1 - p_{i, t}}{p_{i, t}} 1 {i_{t} = i_{t}^{*}, θ_{t} (i) < y_{t} (i)}] \\ \leq (\sum_{t \in T (τ) \cap A (τ, *)} + \sum_{t \in T (τ) ∖ A (τ, *)}) E [\frac{1 - p_{i, t}}{p_{i, t}} 1 {i_{t} = i_{t}^{*}, θ_{t} (i) < y_{t} (i)}] \\ \leq | {t : i_{t} = i_{t}^{*}, N_{t} (τ, *) \leq L (τ)} | (e^{11} + 3 + 3 log τ) + \sum_{t \in T (τ) ∖ A (τ, *)} E [\frac{1 - p_{i, t}}{p_{i, t}}] \\ \leq \frac{T}{τ} L (τ) (e^{11} + 3 + 3 log τ) + 6 \frac{T}{τ} log (τ + e^{11}) \\ \leq (e^{11} + 9 + 3 log τ) \frac{T}{τ} L (τ) . \end{matrix}

(A19)

References

Robbins, H. Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 1952, 58, 527–535. [Google Scholar] [CrossRef]
Li, L.; Chu, W.; Langford, J.; Wang, X. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM International Conference on Web Search and Data Mining, Hong Kong, China, 9–12 February 2011; pp. 297–306. [Google Scholar]
Bouneffouf, D.; Bouzeghoub, A.; Ganarski, A.L. A contextual-bandit algorithm for mobile context-aware recommender system. In Neural Information Processing, Proceedings of the International Conference, ICONIP 2012, Doha, Qatar, 12–15 November 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 324–331. [Google Scholar]
Li, S.; Karatzoglou, A.; Gentile, C. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 539–548. [Google Scholar]
Schwartz, E.M.; Bradlow, E.T.; Fader, P.S. Customer acquisition via display advertising using multi-armed bandit experiments. Mark. Sci. 2017, 36, 500–522. [Google Scholar] [CrossRef]
Wu, Q.; Iyer, N.; Wang, H. Learning contextual bandits in a non-stationary environment. In Proceedings of the The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 495–504. [Google Scholar]
Liu, F.; Lee, J.; Shroff, N. A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Cao, Y.; Wen, Z.; Kveton, B.; Xie, Y. Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Okinawa, Japan, 16–18 April 2019; pp. 418–427. [Google Scholar]
Auer, P.; Gajane, P.; Ortner, R. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Proceedings of the Conference on Learning Theory, Phoenix, AZ, USA, 25–28 June 2019; pp. 138–158. [Google Scholar]
Chen, Y.; Lee, C.W.; Luo, H.; Wei, C.Y. A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. In Proceedings of the Conference on Learning Theory, Phoenix, AZ, USA, 25–28 June 2019; pp. 696–726. [Google Scholar]
Besson, L.; Kaufmann, E.; Maillard, O.A.; Seznec, J. Efficient Change-Point Detection for Tackling Piecewise-Stationary Bandits. J. Mach. Learn. Res. 2022, 23, 1–40. [Google Scholar]
Garivier, A.; Moulines, E. On upper-confidence bound policies for switching bandit problems. In Algorithmic Learning Theory, Proceedings of the 22nd International Conference, ALT 2011, Espoo, Finland, 5–7 October 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 174–188. [Google Scholar]
Raj, V.; Kalyani, S. Taming non-stationary bandits: A Bayesian approach. arXiv 2017, arXiv:1707.09727. [Google Scholar]
Trovo, F.; Paladino, S.; Restelli, M.; Gatti, N. Sliding-window thompson sampling for non-stationary settings. J. Artif. Intell. Res. 2020, 68, 311–364. [Google Scholar] [CrossRef]
Baudry, D.; Russac, Y.; Cappé, O. On Limited-Memory Subsampling Strategies for Bandits. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 727–737. [Google Scholar]
Ghatak, G. A change-detection-based Thompson sampling framework for non-stationary bandits. IEEE Trans. Comput. 2020, 70, 1670–1676. [Google Scholar] [CrossRef]
Alami, R.; Azizi, O. Ts-glr: An adaptive thompson sampling for the switching multi-armed bandit problem. In Proceedings of the NeurIPS 2020 Challenges of Real World Reinforcement Learning Workshop, Virtual, 6–12 December 2020. [Google Scholar]
Viappiani, P. Thompson sampling for bayesian bandits with resets. In Algorithmic Decision Theory, Proceedings of the Third International Conference, ADT 2013, Bruxelles, Belgium, 12–14 November 2013; Proceedings 3; Springer: Berlin/Heidelberg, Germany, 2013; pp. 399–410. [Google Scholar]
Gupta, N.; Granmo, O.C.; Agrawala, A. Thompson sampling for dynamic multi-armed bandits. In Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops, Honolulu, HI, USA, 18–21 December 2011; Volume 1, pp. 484–489. [Google Scholar]
Cavenaghi, E.; Sottocornola, G.; Stella, F.; Zanker, M. Non stationary multi-armed bandit: Empirical evaluation of a new concept drift-aware algorithm. Entropy 2021, 23, 380. [Google Scholar] [CrossRef]
Liu, Y.; Van Roy, B.; Xu, K. Nonstationary bandit learning via predictive sampling. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 25–27 April 2023; pp. 6215–6244. [Google Scholar]
Fiandri, M.; Metelli, A.M.; Trovò, F. Sliding-Window Thompson Sampling for Non-Stationary Settings. arXiv 2024, arXiv:2409.05181. [Google Scholar]
Qi, H.; Wang, Y.; Zhu, L. Discounted thompson sampling for non-stationary bandit problems. arXiv 2023, arXiv:2305.10718. [Google Scholar]
Kocsis, L.; Szepesvári, C. Discounted ucb. In Proceedings of the 2nd PASCAL Challenges Workshop, Venice, Italy, 10–12 April 2006; Volume 2, pp. 51–134. [Google Scholar]
Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R.E. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 2002, 32, 48–77. [Google Scholar] [CrossRef]
Besbes, O.; Gur, Y.; Zeevi, A. Stochastic multi-armed-bandit problem with non-stationary rewards. In Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Combes, R.; Proutiere, A. Unimodal bandits: Regret lower bounds and optimal algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 521–529. [Google Scholar]
Mellor, J.; Shapiro, J. Thompson sampling in switching environments with Bayesian online change detection. In Proceedings of the Artificial Intelligence and Statistics, Scottsdale, AZ, USA, 29 April–1 May 2013; pp. 442–450. [Google Scholar]
Suk, J.; Kpotufe, S. Tracking Most Significant Arm Switches in Bandits. In Proceedings of the Conference on Learning Theory, London, UK, 2–5 July 2022; pp. 2160–2182. [Google Scholar]
Qin, Y.; Menara, T.; Oymak, S.; Ching, S.; Pasqualetti, F. Non-stationary representation learning in sequential multi-armed bandits. In Proceedings of the ICML Workshop on Reinforcement Learning Theory, Virtual, 18-24 July 2021. [Google Scholar]
Qin, Y.; Menara, T.; Oymak, S.; Ching, S.; Pasqualetti, F. Non-stationary representation learning in sequential linear bandits. IEEE Open J. Control Syst. 2022, 1, 41–56. [Google Scholar] [CrossRef]
Agrawal, S.; Goyal, N. Further optimal regret bounds for thompson sampling. In Proceedings of the Artificial Intelligence and Statistics, Scottsdale, AZ, USA, 29 April–1 May 2013; pp. 99–107. [Google Scholar]
Jin, T.; Xu, P.; Shi, J.; Xiao, X.; Gu, Q. Mots: Minimax optimal thompson sampling. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 5074–5083. [Google Scholar]
Jin, T.; Xu, P.; Xiao, X.; Anandkumar, A. Finite-time regret of thompson sampling algorithms for exponential family multi-armed bandits. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; pp. 38475–38487. [Google Scholar]
Abramowitz, M.; Stegun, I.A. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables; US Government Printing Office: Washington, DC, USA, 1964; Volume 55.

Figure 1. Illustration of

T (γ)

and

S (γ)

in two different situations.

b_{i}, b_{i + 1}, b_{i + 2}

are the breakpoints. The situation that

b_{i + 1} - b_{i} > D (γ)

is shown in the top figure, and

b_{i + 1} - b_{i} \leq D (γ)

is in the bottom.

Figure 1. Illustration of

T (γ)

and

S (γ)

in two different situations.

b_{i}, b_{i + 1}, b_{i + 2}

are the breakpoints. The situation that

b_{i + 1} - b_{i} > D (γ)

is shown in the top figure, and

b_{i + 1} - b_{i} \leq D (γ)

is in the bottom.

Figure 2.

K = 5, B_{T} = 5

. Gaussian arms (a), Bernoulli arms (b).

Figure 2.

K = 5, B_{T} = 5

. Gaussian arms (a), Bernoulli arms (b).

Figure 3. Gaussian arms. (a)

K = 5, B_{T} = 5

. (b)

K = 10, B_{T} = 10

.

Figure 3. Gaussian arms. (a)

K = 5, B_{T} = 5

. (b)

K = 10, B_{T} = 10

.

Figure 4. Bernoulli arms. Settings with

K = 5, B_{T} = 5

(a),

K = 10, B_{T} = 10

(b).

Figure 4. Bernoulli arms. Settings with

K = 5, B_{T} = 5

(a),

K = 10, B_{T} = 10

(b).

Table 1. Settings with T = 100,000,

B_{T}

= 5, K = 5 for Gaussian arms. The mean and standard deviation are drawn from distributions

N (0, 5^{2})

and

U (1, 5)

. We set

σ = 5

.

Table 1. Settings with T = 100,000,

B_{T}

= 5, K = 5 for Gaussian arms. The mean and standard deviation are drawn from distributions

N (0, 5^{2})

and

U (1, 5)

. We set

σ = 5

.

Algorithms	TS		DS-TS		SW-TS
std	$\frac{σ}{N}$	$\frac{2 σ}{N}$	$\frac{σ}{N}$	$\frac{2 σ}{N}$	$\frac{σ}{N}$	$\frac{2 σ}{N}$
Regret	333,835	305,064	41,790	52,909	83,731	83,150

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qi, H.; Guo, F.; Zhu, L. Thompson Sampling for Non-Stationary Bandit Problems. Entropy 2025, 27, 51. https://doi.org/10.3390/e27010051

AMA Style

Qi H, Guo F, Zhu L. Thompson Sampling for Non-Stationary Bandit Problems. Entropy. 2025; 27(1):51. https://doi.org/10.3390/e27010051

Chicago/Turabian Style

Qi, Han, Fei Guo, and Li Zhu. 2025. "Thompson Sampling for Non-Stationary Bandit Problems" Entropy 27, no. 1: 51. https://doi.org/10.3390/e27010051

APA Style

Qi, H., Guo, F., & Zhu, L. (2025). Thompson Sampling for Non-Stationary Bandit Problems. Entropy, 27(1), 51. https://doi.org/10.3390/e27010051

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Thompson Sampling for Non-Stationary Bandit Problems

Abstract

1. Introduction

2. Related Works

3. Problem Formulation

Abruptly Changing Setting

4. Algorithms

4.1. DS-TS

4.2. SW-TS

4.3. Results

5. Proofs of Upper Bounds

5.1. Challenges in Regret Analysis

5.2. Proofs of Theorem 1

5.3. Proofs of Theorem 2

6. Experiments

6.1. Gaussian Arms

6.1.1. Experimental Setting for Gaussian Arms

6.1.2. Results

6.2. Bernoulli Arms

6.2.1. Experimental Setting for Bernoulli Arms

6.2.2. Results

6.2.3. Storage and Compute Cost

6.3. Different Variance

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Facts and Lemmas

Appendix B. Detailed Proofs of Lemmas and Theorems

Appendix B.1. Proof of Lemma 1

Appendix B.2. Proof of Lemma 2

Appendix B.3. Proof of Lemma 3

Appendix B.4. Larger Variance

Appendix B.5. Proof of Lemma 4

Appendix B.6. Proof of Lemma 5

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI