Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies

Sawwan, Abdalaziz; Wu, Jie

doi:10.3390/network5010003

Open AccessArticle

Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies

by

Abdalaziz Sawwan

^* and

Jie Wu

Department of Computer and Information Science, Temple University, Philadelphia, PA 19122, USA

^*

Author to whom correspondence should be addressed.

Network 2025, 5(1), 3; https://doi.org/10.3390/network5010003

Submission received: 29 October 2024 / Revised: 15 January 2025 / Accepted: 26 January 2025 / Published: 29 January 2025

Download

Browse Figures

Versions Notes

Abstract

Sequential decision-making in dynamic and interconnected environments is a cornerstone of numerous applications, ranging from communication networks and finance to distributed blockchain systems and IoT frameworks. The multi-armed bandit (MAB) problem is a fundamental model in this domain that traditionally assumes independent and identically distributed (iid) rewards, which limits its effectiveness in capturing the inherent dependencies and state dynamics present in some real-world scenarios. In this paper, we lay a theoretical framework for a modified MAB model in which each arm’s reward is generated by a hidden Markov process. In our model, each arm undergoes Markov state transitions independent of play in a way that results in varying reward distributions and heightened uncertainty in reward observations. The number of states for each arm can be up to three states. A key challenge arises from the fact that the underlying states governing each arm’s rewards remain hidden at the time of selection. To address this, we adapt traditional index-based policies and develop a modified index approach tailored to accommodate Markovian transitions and enhance selection efficiency for our model. Our proposed proposed Markovian Upper Confidence Bound (MC-UCB) policy achieves logarithmic regret. Comparative analysis with the classical UCB algorithm reveals that MC-UCB consistently achieves approximately a 15% reduction in cumulative regret. This work provides significant theoretical insights and lays a robust foundation for future research aimed at optimizing decision-making processes in complex, networked systems with hidden state dependencies.

Keywords:

dynamic distributions; learning theory; Markov chain; multi-armed bandit

1. Introduction

Decision-making in environments with network-like dependencies presents a fundamental challenge across various fields, including communication networks, finance, and complex distributed systems [1,2,3,4]. In such environments, a decision-maker faces interconnected structures where actions taken on one element may influence the states or rewards of others, thereby creating dynamic dependencies reminiscent of those found in networked systems. Examples of such networks can be found in resource allocation across multiple communication channels in IoT (Internet of Things) sensor networks [5], throughput optimization in distributed blockchain ecosystems [6], adaptive QoS (Quality of Service) management in communication networks [7], and security or intrusion detection frameworks in large-scale system administration scenarios [8]. In these contexts, the multi-armed bandit (MAB) problem, where a player repeatedly selects among multiple uncertain options (arms), becomes more intricate due to underlying and often hidden state transitions that evolve over time.

The original classical MAB formulation, introduced by Robbins [9,10], assumes that each arm’s reward distribution remains fixed and independent over time. However, in networked scenarios, these assumptions rarely hold: the reward distributions may shift due to underlying Markovian state transitions that are hidden from the decision-maker [11]. Arms in such a scenario can represent network nodes, communication links, or distributed resources whose performance and reliability evolve with time. The agent must continually learn and adapt, taking into account latent transitions that are reminiscent of evolving network conditions.

In this paper, we lay a theoretical framework for a modified MAB model in which each arm’s reward is generated by a hidden Markov process. This approach models the type of network-like dependencies found, for example, in dynamic IoT sensor networks—where channel conditions and sensor states change stochastically and are not directly observable, yet these state changes critically affect the rewards (e.g., reliable data transmission or efficient resource utilization). Each arm in our model can transition among up to three states, each associated with a different reward distribution, regardless of whether the arm is played. The result is a problem setting that demands sophisticated exploration–exploitation strategies that identify the best arms under evolving conditions and also cope with underlying dynamics that reflect network interdependencies.

In this context, we evaluate the decision-maker’s performance using the concept of regret, a metric that captures the cost of uncertainty in networked decision-making environments. Regret is defined as the difference between the expected reward an ideal policy—one with complete knowledge of all arm statistics or hindsight advantage—would achieve, and the reward achieved by the decision-maker’s actual strategy. An ideal policy would consistently select the arm yielding the highest expected reward over time. This concept, commonly referred to as weak regret, is a central performance measure in uncertain decision problems, as highlighted by Auer et al. [12]. Our study focuses on regret, particularly within interconnected, network-like settings.

MAB problems with Markovian rewards significantly heighten complexity due to dynamic dependencies that reflect networked interactions. Here, each arm is modeled as a Markov process with a finite set of states, each linked to a unique reward distribution. The transition between states follows a known probability matrix, introducing a memory element into the decision process where rewards depend not only on the current choice but also on the hidden state of each arm [13,14,15,16]. This Markovian structure effectively simulates a network in which states and rewards are dynamically interdependent over time.

The state transitions are determined by predefined probabilities, yet the exact state of each arm remains hidden. This creates a layer of opacity similar to unobserved interactions in networked systems [17,18,19]. Consequently, the player must infer each arm’s state from the history of observed rewards. This amplifies the challenge of the exploration–exploitation trade-off. The decision-maker faces a networked challenge: to exploit high-reward arms based on historical performance or to explore underused arms to reveal potential reward structures. Figure 1 illustrates an example of the problem and highlights the network-like dependencies across arms.

A core challenge in this interconnected framework is to develop strategies that effectively balance immediate rewards with potential future gains that could arise from transitioning into more advantageous states [20,21]. This networked trade-off between short-term exploitation and long-term exploration is not purely theoretical or network-related; it mirrors complex, real-world decision-making environments such as financial portfolio management or adaptive clinical trials where treatments impact outcomes over time [22,23,24,25].

In this work, we address these challenges by introducing a novel theoretical approach to the MAB problem with Markovian dynamics and network-like dependencies where each arm has up to three possible states. We adapt traditional index policies to account for the intricate structure of state transitions. Our focus is on refining these policies to achieve robust performance by attaining logarithmic regret even within the complex networked dynamics of hidden state transitions. We further compare our modified index-based policies with the classic upper confidence bound (UCB) algorithm. This study thus sets the stage for a deeper understanding of decision strategies within networked environments involving uncertainty and dynamic dependencies.

1.1. Main Findings

This paper makes the following theoretical contributions:

We demonstrate that for each arm, represented as an irreducible, finite-state, aperiodic, and reversible three-state Markov chain, simple sample mean-based index policies can achieve logarithmic regret uniformly over time, even in interconnected settings resembling networked dependencies.
We simplify the analysis of state transition probabilities by modeling the arms as Markov chains with identical rewards that capture basic network-like structures in which transitions are dependent on state dynamics.
We present a numerical comparison of the regret incurred by our sample mean-based index policy and evaluate its performance relative to other policies.

1.2. Application Context and Conceptual Validation in Network-like Scenarios

While our primary contribution is theoretical, it is helpful to illustrate how this framework can be built on to conceptually extend to real network scenarios. Consider, for example, the following contexts:

Security [26]: Arms may represent intrusion detection strategies whose efficacy varies as an adversary’s tactics evolve over time. Each state transition corresponds to a shift in the threat environment. Our Markovian MAB framework can guide strategic decisions to maintain robust defense while learning dynamically about evolving threats.
Distributed Blockchain Systems [27]: Nodes or shards in a blockchain network might yield variable validation rewards depending on their state of congestion or consensus participation. The Markovian structure models the dynamic nature of node availability and network conditions in a way that would help a node operator choose where to allocate resources or which shard to support over time.
QoS in Communication Networks [7]: Network links may fluctuate between high-quality, moderate, and poor states due to changing traffic patterns. By representing each link as a Markovian arm, our framework can assist in selecting the best channel at any given time in order to balance the exploration of uncertain but potentially high-quality links with the exploitation of known reliable ones.
IoT and System Administration [28]: IoT nodes or servers can transition between states that reflect varying processing loads or energy conditions. The Markovian MAB model helps a controller decide which node to query or utilize for computations, thereby maximizing long-term performance.

In sum, while this work is focused on the theoretical aspects and fundamental results for up to three states, it offers a roadmap for future empirical explorations and practical implementations. The stylized simulation experiments that we show later serve as a preliminary demonstration and show that the theoretical principles hold in a controlled synthetic environment, thus setting the stage for subsequent research aiming at more comprehensive benchmarking in real-world network contexts.

The remainder of the paper is structured as follows. Section 2 gives the related work. Section 3 presents the preliminaries. Section 4 shows the problem formulation. The index policy and its regret analysis are given in Section 5. Section 6 shows our numerical simulation results, and finally, Section 7 concludes the paper.

2. Related Work

The literature on the MAB problem is vast and has evolved considerably from the original formulations focusing on independent and identically distributed (iid) reward processes. Early seminal work by Robbins and Lai [9,10] established foundations for the iid case for certain known environments. Over time, researchers have explored a broad spectrum of MAB extensions that incorporate various forms of structure and dynamics. Notably, Markovian reward processes represent a key generalization and enable the modeling of scenarios where arm states—and thus rewards—evolve with memory and dependence on previous states.

Early explorations into Markovian bandits can be found in the work of Anantharam et al. [29], which analyzed index policies effective for arms governed by irreducible, finite-state, aperiodic Markov chains. Their approach demonstrated how arms with state-dependent rewards could still be tackled through index strategies that generalize the Gittins index concept [30]. While these studies set important precedents for handling Markovian structures, they often made simplifying assumptions, such as a single-parameter transition function or identical state spaces across arms. In contrast, our framework does not presume a single-parameter form for transition probabilities, nor does it require identical state spaces. By allowing each arm to transition among up to three states under distinct probability kernels, we offer a more flexible setting that can model diverse types of network dependencies.

Building upon this foundation, research has examined the problem of achieving low regret under more general conditions. Agrawal [31] and Auer et al. [32] established classical logarithmic regret results for iid settings. Their contributions included index and UCB-based strategies that guarantee optimal asymptotic and even uniformly logarithmic performance over time. They rely heavily on the iid assumption and do not directly address the complexities introduced by state transitions or network-like interdependencies. More recent works have begun to relax these assumptions. For instance, Garivier and Moulines [33] and Besbes et al. [34] considered bandit problems with non-stationary reward distributions in a way that captures some aspects of temporal dynamics without fully embracing Markovian state dependence. Such approaches typically rely on “resetting” or “sliding-window” techniques that do not directly exploit known Markovian transition structures.

In parallel, other authors have studied scenarios where multiple users or decision-makers interact with the same set of arms in network settings, leading to complex dynamics and collisions among players [35,36,37]. Here, the challenge lies in coordinating multiple agents to minimize interference and collectively achieve low regret. While such multi-player frameworks mirror network complexity, their primary focus is on handling concurrency and competition rather than modeling state evolution within each arm. Our approach differs by focusing explicitly on Markovian transitions at the arm level rather than strategic interactions among multiple decision-makers.

The distinction between rested and restless bandits further highlights the complexity in Markovian settings. In classical rested bandits, the state of an unplayed arm remains frozen until chosen again, as examined in works like those of Ortner [38] and Raj and Kalyani [39]. However, in restless bandits, arm states evolve regardless of selection, making the problem significantly more complex. The restless bandits formualation has explored structural results and approximation algorithms for special cases [11,40]. Our framework takes a step forward by considering a setting in which all arms transition at every round, falling somewhere between the fully rested and fully restless extremes, and by establishing logarithmic regret bounds in this intermediate regime.

Compared to the closely related studies such as [15,29], our work introduces a novel solution. For instance, Tekin et al. [15] restrict attention to two-state arms with transitions occurring only when the arm is played, which simplifies the analysis but limits applicability. In [29], the reward-generating process is governed by a single parameter and identical state spaces across all arms. In contrast, our model allows each arm to have distinct state spaces and transition matrices, and does not rely on a single-parameter structure. We also require that the reward process be reversible, a mild assumption that enables cleaner theoretical analysis. The indices we derive rely on sample means rather than complicated recursive computations, and yield uniform logarithmic regret bounds rather than merely asymptotic guarantees.

Lastly, recent theoretical studies on bandits with structure—such as Liu et al. [41], who considered bandits with feedback graphs, or Chen et al. [42], who looked at dynamic networked scenarios—point to a growing interest in incorporating more nuanced dependencies into MAB models. Our results add to this literature by providing a more direct handle on Markovian state transitions within a theoretically grounded bandit framework.

In sum, our work occupies a unique position at the intersection of Markovian bandits, structured bandit problems, and theoretical analyses that strive for uniform logarithmic regret. While prior research established important groundwork in various specialized settings, we advance the state of the art by offering a flexible, three-state Markovian model, clear conditions for reversibility, and efficient index-based strategies that can be analyzed rigorously. This sets the stage for future studies aiming to extend these techniques to an even broader range of network-like environments and more complex state spaces.

3. Preliminaries

This section provides an introduction to essential concepts that form the foundation for our study of MABs with Markovian rewards, particularly in environments where network-like dependencies may influence state transitions. We begin by discussing Markov processes, which are essential for understanding the dynamic and interconnected nature of our model, and proceed to explore fundamental aspects of MAB problems with a focus on the complexities introduced by Markovian reward structures.

3.1. Markov Processes

A Markov process is a stochastic model that describes a sequence of possible events where the probability of each event depends only on the state attained in the previous event. In the context of Markov processes, the future is independent of the past given the present. This property, known as the Markov property, is central to our analysis of bandit arms as Markov chains, which can exhibit dependencies across states that reflect networked interactions over time.

For a given Markov process, we define a state space

X

that contains all possible states the process can occupy. The transitions between these states are governed by probabilities defined in a transition matrix P, where each entry

P_{u v}

represents the probability of moving from state u to state v. This matrix is fundamental for predicting and understanding the behavior of interconnected systems over time.

3.2. Markov Decision Processes in Bandit Problems

In MABs, a Markov Decision Process (MDP) provides a framework for decision-making where transitions between states are determined not only by the current state but also by the action taken by the decision-maker. Each action in an MDP results in a reward and a transition to the next state where each arm pull can be viewed as an action within a potentially networked system of state dependencies.

In a typical MAB problem with Markovian rewards, each arm represents an independent Markov process. The player’s objective is to maximize cumulative rewards over a sequence of arm pulls. The decision of which arm to pull involves evaluating the current state of each arm and estimating potential rewards based on state transition probabilities, akin to navigating networked dependencies where each choice impacts future outcomes in interconnected states.

3.3. Exploration vs. Exploitation in Markovian Bandits

A key challenge in MAB problems is the trade-off between exploration and exploitation. This dilemma is more pronounced in Markovian bandits due to the changing state of each arm. Exploration involves pulling less-understood arms to gain more information about their reward distributions and state transitions. Exploitation means choosing arms that are currently known to offer higher rewards based on accumulated knowledge.

Balancing these strategies is crucial for achieving optimal performance, especially when the bandit arms exhibit state-dependent rewards that evolve according to Markov dynamics. The player must not only consider immediate rewards but also the potential future benefits of being in favorable states.

The concepts introduced in this section provide the necessary background to appreciate the complexities involved in our study of MABs with Markovian rewards. Understanding these principles is essential for developing effective strategies and algorithms to tackle the dynamic and probabilistic nature of the problem.

4. Problem Formulation

We consider a scenario comprising K distinct arms, each labeled by an index

i \in {1, 2, \dots, K}

. Each arm i is represented as an irreducible Markov chain with a finite state space denoted by

X^{(i)}

. The transition kernel of arm i is known and is described by a probability matrix

P^{(i)} = {p_{u v}^{(i)} : u, v \in X^{(i)}}

. Every state u of arm i yields a stationary and strictly positive reward

r_{u}^{(i)}

. We assume that the K Markov chains (one per arm) are mutually independent. Let

ϕ^{(i)} = {ϕ_{u}^{(i)} : u \in X^{(i)}}

be the stationary distribution of the ith arm. The mean reward of arm i, denoted by

ν^{i}

, can then be expressed as

ν^{i} = \sum_{u \in X^{(i)}} r_{u}^{(i)} ϕ_{u}^{(i)}

(1)

The arm with the largest mean reward is indicated by a superscript ★, so that

ν^{★} = {max}_{1 \leq i \leq K} ν^{i}

. We define the regret of a policy

α

after n steps,

R^{α} (n)

, as the difference between the expected cumulative reward that would be obtained by always selecting the best arm and the actual expected cumulative reward gathered under policy

α

. If

α (t)

denotes the arm chosen by

α

at time t and

x_{α (t)}

the state visited by that arm at time t, we have

R^{α} (n) = n ν^{★} - E^{α} [\sum_{t = 1}^{n} r_{x_{α (t)}}^{(α (t))}]

(2)

In principle, if one always knew which arm has the highest mean reward, playing that arm indefinitely would constitute the optimal single-arm selection strategy. Nonetheless, this does not necessarily identify the best policy among all possible stationary and non-stationary policies if the entire statistical structure of the arms were fully known. In the broader scenario over an infinite horizon, the optimal policy is characterized by the Gittins index, as introduced by Gittins [30]. If each arm’s rewards were iid, then the optimal solution over all admissible policies would simply be to consistently choose the best single-action arm. In our work here, we limit our comparison of performance to this single-action benchmark.

To investigate policies that minimize regret, we employ a series of preliminary results to relate the regret

R^{α} (n)

to the expected number of times suboptimal arms are played. For a given policy

α

, let

M^{α, i} (t)

represent the total number of times arm i is pulled up to time t. Understanding the connection between regret and

E^{α} [M^{α, i} (n)]

proves critical.

We invoke the following lemma to establish a key relationship. We adapt and modify its proof here for completeness:

Lemma 1

(Adapted from Lemma 2.1 in [29]). Consider a Markov chain Y that is irreducible, aperiodic, and has a finite state space S. Its transitions are governed by a probability matrix P, and it begins with an initial distribution in which all states have strictly positive probability. Let

F_{t}

be the σ-algebra generated by the sequence of states

X_{1}, X_{2}, \dots, X_{t}

, where

X_{t}

is the state at time t. Suppose G is a σ-algebra independent of

F = \lor_{t \geq 1} F_{t}

. Consider a stopping time τ with respect to the sequence of σ-algebras

{G \lor F_{t} : t \geq 1}

. Define the visitation count of a particular state

x \in S

up to time τ by

N (x, τ) = \sum_{t = 1}^{τ} I (X_{t} = x) .

If

E [τ]

is finite, then there exists a constant

D (P)

(depending solely on P) such that

D (P) \geq |ϕ_{x} E [τ] - E [N (x, τ)]|

(3)

where

ϕ = {ϕ_{x} : x \in S}

is the stationary distribution of the chain.

Proof of Lemma 1.

Consider the sequence of regeneration times

{τ_{k} : k \geq 0}

defined by

\begin{matrix} τ_{0} & = 0, \\ τ_{k} & = min {t > τ_{k - 1} ∣ X_{t} = X_{1}}, \forall k \in N \end{matrix}

Given the chain’s irreducibility, we assert that

τ_{k} < \infty

for every k. Let

B_{k}

be the kth “block” of the chain:

B_{k} = (X_{τ_{k - 1} + 1}, X_{τ_{k - 1} + 2}, \dots, X_{τ_{k} - 1}) .

By the regenerative property of Markov chains, the blocks

B_{k}

are iid. The expected number of visits to x in a typical block is

E [N (x, B_{1})] = ϕ_{x} E [l (B_{1})]

, where

l (B_{1})

is the length of the block

B_{1}

.

Define T as the first return time to

X_{1}

after time

τ

:

T = min {t > τ ∣ X_{t} = X_{1}} = τ_{κ}

for some

κ

. Note that

T - τ

is also finite in expectation due to irreducibility. Applying Wald’s identity,

E [\sum_{t = 1}^{T - 1} I (X_{t} = x)] = E [κ] E [N (x, B_{1})] = ϕ_{x} E [l (B_{1})] E [κ] .

Similarly,

E (T - 1) = E [κ] E [l (B_{1})] .

Because

E (T - τ) \leq D (P)

for some constant

D (P)

, we have for any

x \in S

\begin{matrix} N (x, T) - (T - τ) \leq N (x, τ) < N (x, T), \\ ϕ_{x} E (T - 1) - D (P) \leq E [N (x, τ)] \leq ϕ_{x} E (T - 1) + 1, \\ ϕ_{x} E [τ] - D (P) \leq E [N (x, τ)] \leq ϕ_{x} E [τ] + D (P), \\ | E [N (x, τ)] - ϕ_{x} E [τ] | \leq D (P) . \end{matrix}

Thus, we have shown the stated bound, completing the proof. □

Next, we relate the regret

R^{α} (n)

to

E^{α} [M^{α, i} (n)]

, the expected count of plays of each arm i up to time n.

Lemma 2.

Under the conditions of Lemma 1, consider any strategy α that ensures the average time between successive pulls of any given arm remains bounded. Then, there exists a constant

D (X, P, R)

—depending on the sets

{X^{(i)}}

, the probability matrices

{P^{(i)}}

, and the reward structures

{r_{u}^{(i)}}

—such that

R^{α} (n) \leq \sum_{i = 1}^{K} (ν^{★} - ν^{i}) E^{α} [M^{α, i} (n)] + D (X, P, R) .

(4)

Proof of Lemma 2.

For each arm i, let

H^{i} = ⋁_{j \neq i} F^{(j)}

be the

σ

-algebra generated by the observations of all arms except arm i. Since the arms are independent,

H^{i}

is independent of

F^{(i)}

, the filtration associated with arm i. Note that

M^{α, i} (n)

is a stopping time with respect to

{H^{i} \lor F_{t}^{(i)} : t \geq 1}

.

Denote by

{X^{(i)} (1), X^{(i)} (2), \dots, X^{(i)} (M^{α, i} (n))}

the sequence of states visited by arm i within the first n steps of the policy

α

. The total collected reward up to time n is

\sum_{t = 1}^{n} r_{x_{α (t)}}^{(α (t))} = \sum_{i = 1}^{K} \sum_{j = 1}^{M^{α, i} (n)} \sum_{v \in X^{(i)}} r_{v}^{(i)} I (X^{(i)} (j) = v) .

By definition of regret,

R^{α} (n) = n ν^{★} - E^{α} [\sum_{t = 1}^{n} r_{x_{α (t)}}^{(α (t))}] .

Rewriting and employing linearity of expectation,

\begin{matrix} R^{α} (n) & = n ν^{★} - \sum_{i = 1}^{K} ν^{i} E^{α} [M^{α, i} (n)] \\ + E^{α} [\sum_{i = 1}^{K} \sum_{j = 1}^{M^{α, i} (n)} \sum_{v \in X^{(i)}} r_{v}^{(i)} I (X^{(i)} (j) = v)] \\ - \sum_{i = 1}^{K} \sum_{v \in X^{(i)}} r_{v}^{(i)} ϕ_{v}^{(i)} E^{α} [M^{α, i} (n)] . \end{matrix}

Since

| E [N (v, M^{α, i} (n))] - ϕ_{v}^{(i)} E^{α} [M^{α, i} (n)] | \leq D (P^{(i)})

by Lemma 1 (applied to each arm’s Markov chain), we have

R^{α} (n) \leq \sum_{i = 1}^{K} \sum_{v \in X^{(i)}} D (P^{(i)}) r_{v}^{(i)} .

This upper bound depends on all the arms’ state spaces, transition laws, and reward distributions. We thus denote this cumulative constant by

D (X, P, R)

, concluding the proof. □

In essence, Lemma 2 states that the regret of any policy can be bounded by a term that sums, over all arms, the product of their respective expected selection counts and their suboptimality gap

(ν^{★} - ν^{i})

, plus a constant. This insight lays the groundwork for subsequent analysis and the development of regret-minimizing strategies.

5. A Solution to the Problem with Bounded Regret

In this section, we explore a sample-based index policy, which is a UCB-type policy, modified from the one introduced by [32]. This approach is adapted to our setting, where each arm evolves according to a Markovian state process. Algorithm 1 shows the policy, which we call the Markovian UCB (MC-UCB) policy.

Let

r^{(i)} (m)

denote the m-th observed reward from arm i and

M^{i} (n)

the number of times arm i has been selected up to (and including) time n. We define the empirical mean reward for arm i after n steps as

{\bar{r}}^{(i)} (M^{i} (n)) = \frac{r^{(i)} (1) + r^{(i)} (2) + \dots + r^{(i)} (M^{i} (n))}{M^{i} (n)} .

At each time step, the policy assigns an index to each arm. For arm i at step n, this index is denoted by

h_{n, M^{i} (n)}^{(i)}

. The arm chosen at time n is the one with the highest index.

The index is computed as follows. Initially, each arm is played exactly once. Every time an arm is played, its empirical mean

{\bar{r}}^{(i)} (\cdot)

is updated and forms the first component of the index. For arms that are not played, the uncertainty regarding their true mean reward increases, captured by an exploration term added to the index. The resulting index at time n for arm i is of the form

h_{n, M^{i} (n)}^{(i)} = {\bar{r}}^{(i)} (M^{i} (n)) + \sqrt{\frac{α ln n}{M^{i} (n)}} .

where the constant

α

is set to 2, similar to the standard UCB policy [32].

Algorithm 1 Markovian UCB (MC-UCB)

Require: Number of arms K, horizon T, and known transition kernels

{p_{u v}^{(i)} : u, v \in X^{(i)} for each i}

.

Ensure: Sequence of selected arms

{a_{1}, a_{2}, \dots, a_{T}} .

Initialization:

t \leftarrow 1 .

1: while

t \leq K

do

2: Select arm

a_{t} = t .

3:

t \leftarrow t + 1 .

4: while

t \leq T

do

5: for each arm

i \in {1, 2, \dots, K}

do

6: Calculate

{\bar{r}}^{(i)} (M^{i} (t)) = \frac{r^{(i)} (1) + r^{(i)} (2) + \dots + r^{(i)} (M^{i} (t))}{M^{i} (t)} .

7: Select arm

a_{t} = \arg \max_{i} {{\bar{r}}^{(i)} (M^{i} (t)) + \sqrt{\frac{α ln t}{M^{i} (t)}}} .

8:

t \leftarrow t + 1 .

9: return

{a_{1}, a_{2}, \dots, a_{T}}

.

The proposed MC-UCB algorithm demonstrates favorable scalability with respect to both the number of arms K and the number of states per arm. At each time step, the algorithm performs a straightforward computation of the empirical mean reward for each arm, which can be efficiently maintained using incremental updating formulas. Specifically, instead of storing all past rewards, the algorithm only requires maintaining a running sum and count of rewards for each arm; thereby, it ensures constant time and space complexity per arm. Consequently, the overall computational complexity per round scales linearly with the number of arms, i.e.,

O (K)

, which makes it highly efficient even as K grows.

Moreover, since each arm is modeled with a finite and small number of states (up to three in our theoretical framework), the state transition management incurs minimal overhead. The known transition probabilities allow for precomputing stationary distributions, which can be utilized to optimize the index calculations without necessitating real-time state inference. This precomputation further reduces the computational burden during the decision-making process. However, it is important to acknowledge that extending the model to accommodate a significantly larger number of states or unknown transition probabilities would introduce additional complexity. Future work could explore approximate methods or hierarchical indexing strategies to mitigate potential inefficiencies in such scenarios. Nonetheless, within the current scope of three-state arms, the MC-UCB algorithm remains computationally tractable and well suited for possible applications that require rapid and scalable decision-making.

Below, we will show that the expected regret of this index policy grows at most on the order of

ln (n)

. To establish this, we will upper-bound the expected frequency with which any suboptimal arm (those with mean reward smaller than

ν^{★}

) is chosen. A crucial tool for this analysis is a lemma from Gillman [43], which provides a bound on the probability that the empirical frequency of visits to a subset of states deviates significantly from its stationary distribution.

Lemma 3

(Based on Theorem 2.1 in [43]). Consider a reversible, irreducible, aperiodic Markov chain with a finite state space

X

and transition matrix P. Let

q

be an initial distribution, and define

N_{q} = {∥(q_{x} / ϕ_{x}, x \in X)∥}_{2}

. Let

λ_{2}

be the second largest eigenvalue of P and define

ϵ = 1 - λ_{2}

. For a subset of states

W \subseteq X

, define

ϕ_{W} = \sum_{x \in W} ϕ_{x}

and let

t_{W} (n)

be the count of visits to W up to time n. Then, for any

β \geq 0

,

P (t_{W} (n) - n ϕ_{W} \geq β) \leq (1 + β ϵ / (10 n)) N_{q} e^{(- \frac{β^{2} ϵ}{20 n})} .

(5)

Proof of Lemma 3.

The proof can be directly derived from Theorem 2.1 in [43]. □

We now proceed to the main theorem for our policy. The proof utilizes techniques analogous to those in [32] to derive logarithmic regret bounds for the MC-UCB policy.

Theorem 1.

Consider K arms, each arm i being modeled as a finite-state, irreducible, aperiodic, and reversible Markov chain with a state space

X^{(i)}

. All rewards

r_{x}^{i}

are strictly positive. Let

ϕ_{min} = min_{1 \leq i \leq K, x \in X^{(i)}} ϕ_{x}^{i}, r_{max} = max_{1 \leq i \leq K, x \in X^{(i)}} r_{x}^{i}, r_{min} = min_{1 \leq i \leq K, x \in X^{(i)}} r_{x}^{i},

X_{max} = max_{1 \leq i \leq K} | X^{(i)} |, ϵ_{max} = max_{1 \leq i \leq K} ϵ^{i}, ϵ_{min} = min_{1 \leq i \leq K} ϵ^{i} .

Define the constant

α \geq 100 X_{max}^{2} r_{max}^{2} / ϵ_{min}

. Then, the upper bound on the regret

R (n)

of the UCB policy is

\begin{matrix} R (n) \leq & 5 α \sum_{i : ν^{i} < ν^{*}} \frac{ln n}{ν^{*} - ν^{i}} + \sum_{i : ν^{i} < ν^{*}} (ν^{*} - ν^{i}) C^{i} \\ + D (S, P, R) \end{matrix}

where

\begin{matrix} C^{i} & = (D^{i} + D^{*}) β + 1, \\ D^{i} & = \frac{| X^{(i)} |}{ϕ_{min}} (1 + \frac{ϵ_{max} \sqrt{α}}{12 | X^{(i)} | r_{min}}), \\ β & = \sum_{t = 1}^{\infty} \frac{1}{t^{2}} = π^{2} / 6 . \end{matrix}

Proof of Theorem 1.

We analyze the performance of the UCB strategy with a parameter

β

dictating the magnitude of the confidence intervals. Unless noted otherwise, the notation omits superscripts related to the policy for brevity. For each arm i, let

{\bar{r}}^{i} (M^{i} (n))

denote the empirical mean reward after

M^{i} (n)

plays. Define

c_{t, s} = \sqrt{\frac{β ln t}{s}}

to represent the confidence width. Let m be a positive integer. The number of times arm i is selected up to time n is

M^{i} (n) = 1 + \sum_{t = K + 1}^{n} I (β (t) = i) .

We bound this as follows:

\begin{matrix} M^{i} (n) & = \sum_{t = K + 1}^{n} I (β (t) = i) + 1 \\ \leq m + \sum_{t = K + 1}^{n} I (β (t) = i, M^{i} (t - 1) \geq m) . \end{matrix}

Define the event

δ^{i} (t, m)

by the inequality

{\bar{r}}^{*} (M^{*} (t - 1)) + c_{t - 1, M^{*} (t - 1)} \leq {\bar{r}}^{i} (M^{i} (t - 1)) + c_{t - 1, M^{i} (t - 1)},

and let

ξ^{i} (t, m)

correspond to

min_{0 < s < t} ({\bar{r}}^{*} (s) + c_{t - 1, s}) \leq max_{m < s_{i} < t} ({\bar{r}}^{i} (s_{i}) + c_{t - 1, s_{i}}) .

Since

{β (t) = i, M^{i} (t - 1) \geq m}

implies

δ^{i} (t, m)

, and

δ^{i} (t, m)

implies

ξ^{i} (t, m)

, we have

M^{i} (n) \leq m + \sum_{t = K + 1}^{n} I (ξ^{i} (t, m)) .

Expanding over all indices, one can rewrite as follows:

\begin{matrix} M^{i} (n) & \leq m + \sum_{t = 1}^{\infty} \sum_{s = 1}^{t - 1} \sum_{s_{i} = m}^{t - 1} I ({\bar{r}}^{*} (s) + c_{t, s} \leq {\bar{r}}^{i} (s_{i}) + c_{t, s_{i}}) . \end{matrix}

To have

{\bar{r}}^{*} (s) + c_{t, s} \leq {\bar{r}}^{i} (s_{i}) + c_{t, s_{i}}

, at least one of the following must hold:

{\bar{r}}^{*} (s) \leq ν^{*} - c_{t, s}, {\bar{r}}^{i} (s_{i}) \geq ν^{i} + c_{t, s_{i}}, or ν^{*} < ν^{i} + 2 c_{t, s_{i}} .

To prevent

ν^{*} < ν^{i} + 2 c_{t, s_{i}}

from holding, choose

s_{i} \geq \frac{3 α ln n}{{(ν^{*} - ν^{i})}^{2}}

to ensure

2 c_{t, s_{i}} \leq ν^{*} - ν^{i}

. Let

k = ⌈3 α ln n / {(ν^{*} - ν^{i})}^{2}⌉

. Consequently,

E [M^{i} (n)] \leq ⌈\frac{3 α ln n}{{(ν^{*} - ν^{i})}^{2}}⌉ + \sum_{t = 1}^{\infty} \sum_{s = 1}^{t - 1} \sum_{s_{i} = k}^{t - 1} P ({\bar{r}}^{*} (s) \leq ν^{*} - c_{t, s})

+ \sum_{t = 1}^{\infty} \sum_{s = 1}^{t - 1} \sum_{s_{i} = k}^{t - 1} P ({\bar{r}}^{i} (s_{i}) \geq ν^{i} + c_{t, s_{i}}) .

We now employ the Markov chain deviation bounds. For each arm i, let

q^{i}

be the initial distribution and

N_{q^{i}} = {∥{(\frac{q_{y}^{i}}{ϕ_{y}^{i}})}_{y \in X^{(i)}}∥}_{2} .

Since

q_{y}^{i} > 0

and

ϕ_{x}^{i} \geq ϕ_{min}

, we have

N_{q^{i}} \leq 1 / ϕ_{min}

(using Minkowski’s inequality). Thus, consider the probability

P ({\bar{r}}^{i} (s_{i}) \geq ν^{i} + c_{t, s_{i}}) .

Rewriting this event in terms of state visits and leveraging the deviation bounds (analogously to Lemma 3’s result but adapted here), we obtain

\begin{matrix} P ({\bar{r}}^{i} (s_{i}) \geq ν^{i} + c_{t, s_{i}}) \\ \leq & \sum_{y \in X^{(i)}} P (- r_{y}^{i} n_{y}^{i} (s_{i}) + r_{y}^{i} s_{i} ϕ_{y}^{i} \leq - \frac{s_{i} c_{t, s_{i}}}{| X^{(i)} |}) \\ = & \sum_{y \in X^{(i)}} P (r_{y}^{i} n_{y}^{i} (s_{i}) - r_{y}^{i} s_{i} ϕ_{y}^{i} \geq \frac{s_{i} c_{t, s_{i}}}{| X^{(i)} |}) \\ \leq & \sum_{y \in X^{(i)}} (1 + \frac{ϵ^{i} \sqrt{β ln t / s_{i}}}{12 | X^{(i)} | r_{y}^{i}}) N_{q^{i}} t^{- \frac{β ϵ^{i}}{25 {| X^{(i)} |}^{2} {r_{y}^{i}}^{2}}} \\ \leq & \sum_{y \in X^{(i)}} (1 + \frac{ϵ_{max} \sqrt{β t}}{12 | X^{(i)} | r_{min}}) N_{q^{i}} t^{- \frac{β ϵ_{min}}{25 S_{max}^{2} r_{max}^{2}}} \\ \leq & \sum_{y \in X^{(i)}} \sqrt{t} (1 + \frac{ϵ_{max} \sqrt{β}}{12 r_{min}}) N_{q^{i}} t^{- \frac{β ϵ_{min}}{25 S_{max}^{2} r_{max}^{2}}} \end{matrix}

Substituting the value of

N_{q^{i}}

,

P ({\bar{r}}^{i} (s_{i}) \geq ν^{i} + c_{t, s_{i}}) \leq \sum_{y \in X^{(i)}} (1 + \frac{ϵ_{max} \sqrt{β ln t / s_{i}}}{12 | X^{(i)} | r_{min}}) \frac{| X^{(i)} |}{ϕ_{min}} t^{- \frac{β ϵ_{min}}{25 X_{max}^{2} r_{max}^{2}}} .

A similar bound holds for

P ({\bar{r}}^{*} (s) \leq ν^{*} - c_{t, s})

, replacing

| X^{(i)} |

and

r_{min}

by their respective terms from the best arm’s chain

X^{(*)}

. These upper bounds produce a geometric decay in t, ensuring summability. Detailed manipulation leads to

(ν^{*} - ν^{i}) E [M^{i} (n)] \leq 4 α \frac{ln n}{(ν^{*} - ν^{i})} + (ν^{*} - ν^{i}) C^{i} .

Summing over all suboptimal arms i such that

ν^{i} < ν^{*}

,

\sum_{i : ν^{i} < ν^{*}} (ν^{*} - ν^{i}) E [M^{i} (n)] \leq 4 α \sum_{ν^{i} < ν^{*}} \frac{ln n}{(ν^{*} - ν^{i})} + \sum_{i : ν^{i} < ν^{*}} (ν^{*} - ν^{i}) C^{i} .

Incorporating the additional constant term

D (S, P, R)

from Lemma 2, we finally establish

R (n) \leq 5 α \sum_{i : ν^{i} < ν^{*}} \frac{ln n}{ν^{*} - ν^{i}} + \sum_{i : ν^{i} < ν^{*}} (ν^{*} - ν^{i}) C^{i} + D (S, P, R) .

This proves the stated theorem. □

The obtained bound on

R (n)

is of order

ln n

, similar to known asymptotic results, but holds uniformly in n. The constant factors, however, depend on various parameters, including the stationary distributions, the eigenvalue gaps

ϵ^{i}

, and the reward range. Proper selection of a sufficiently large

α

(based on

ϵ_{min}

,

X_{max}

, and

r_{max}

) makes our result stronger. Although setting a large

α

is not necessary for the asymptotic scaling, it simplifies the analysis and ensures that the exploration term dominates initially in a way that would result in uniformly logarithmic regret over time.

Such constants are influenced by the intricate structure of the underlying Markov chains. In special cases, these complexities can be simplified. In the next section, we present a specific example of the index policy.

The above analysis and the resulting logarithmic regret guarantees rely critically on the assumption that the state transition probabilities for each arm are precisely known. Under this assumption, the decision-maker can form accurate estimates of each arm’s mean reward and state distribution over time. If these transition probabilities are even slightly uncertain, the issue becomes significantly more complex. Suppose there exists a small but fixed deviation

δ > 0

such that for each arm i, the true transition probability

p_{u v}^{(i)}

satisfies

|p_{u v}^{(i)} - {\hat{p}}_{u v}^{(i)}| \leq δ

for the available (estimated) probabilities

{{\hat{p}}_{u v}^{(i)}}

. Although

δ

can be arbitrarily small, it introduces a persistent, non-vanishing discrepancy that compounds over time and directly impacts the estimation of the arms’ stationary distributions and expected rewards.

To illustrate the effect of this discrepancy, consider the long-term frequency of visits to a particular state

x \in X^{(i)}

. When the transition probabilities are exact, our analysis ensures that the empirical frequency closely matches the true stationary distribution

ϕ_{x}^{(i)}

. However, with even a small error

δ

, let the induced perturbed stationary measure be

ϕ_{x}^{(i), δ}

. As

n \to \infty

, the difference

| ϕ_{x}^{(i)} - ϕ_{x}^{(i), δ} |

does not vanish, and any reward estimation relying on the exact stationary distribution becomes systematically biased. This persistent bias undermines the correctness of confidence intervals derived under the assumption of known transition probabilities. Consequently, the index computations that yield logarithmic regret bounds no longer hold, and the regret is no longer guaranteed to remain bounded by a term of order

ln n

. Thus, incorporating uncertainty in transition probabilities would require a fundamentally different approach, and at present, the theoretical techniques employed here do not extend to handle unknown or partially known transition probabilities without sacrificing the uniform logarithmic regret properties.

6. Simulations

While this work is primarily theoretical as it mainly establishes regret bounds for MABs with up to three states per arm under known Markovian transition probabilities, it is nonetheless instructive to provide numerical simulations.

6.1. Experimental Setup

We consider a set of

K = 5

arms, each modeled as a three-state Markov chain. The transition probabilities for each arm’s Markov chain, as well as the rewards associated with each state, are randomly generated at the start of every simulation run. This randomized setup ensures that the results represent average-case performance over a wide variety of synthetic conditions rather than tuning to any particular fixed scenario.

Specifically, for each arm

i \in {1, \dots, 5}

, we construct its state transition probability matrix

P^{(i)}

and reward vector

ν^{(i)}

as follows:

1. State Transition Probabilities: We draw each nonzero transition probability

p_{u v}^{(i)}

from a Beta distribution (to ensure values between 0 and 1) and then normalize each row so that they form a valid probability distribution. For example, for each row u, we sample three preliminary values from

Beta (α, β)

with parameters

(α, β)

fixed with

(α, β) = (2, 2)

for a moderate spread, and then normalize the row so that

\sum_{v} p_{u v}^{(i)} = 1

. Each run of the simulation independently re-samples these probabilities. This ensures diverse state transition dynamics for each arm across runs.

2. Reward Distributions: Each state of each arm is assigned a reward distribution centered around a mean value drawn uniformly from

[0, 1]

. Specifically, for arm i and state u, we let

μ_{u}^{(i)} \sim Uniform (0, 1) .

We then model the reward at each round from that state as

r_{t, u}^{(i)} \sim \bar{N} (μ_{u}^{(i)}, σ^{2}),

where the value of

σ

is the standard deviation for all states and arms and

\bar{N} (μ_{u}^{(i)}, σ^{2})

is the truncated normal distribution. Truncation ensures that rewards remain within

[0, 1]

. By re-sampling these mean rewards and their underlying realizations in every run, we capture a broad spectrum of synthetic arm behaviors.

3. Multiple Simulation Runs: To assess performance stability, we run each experiment for

N_{runs} = 10^{4}

independent runs (which goes beyond any reasonable confidence level value). Each run involves simulating

T = 10^{4}

time steps, allowing sufficient duration for the algorithms to settle into steady behaviors. Due to this extensive repetition, we approximate the long-run expected cumulative rewards and regret for each algorithm, mitigating the variance from any particular random draw.

This highly synthetic and randomized environment aims to stress-test the MC-UCB policy under different Markovian conditions to demonstrate how our theory-based approach scales to a few arms and stochastic transitions.

6.2. Compared Algorithms and Metrics

We compare the proposed MC-UCB algorithm with two baseline MAB algorithms adapted to Markovian settings:

Classical UCB: Uses sample means and confidence bounds assuming iid rewards, ignoring the underlying Markov structure. Although it cannot fully exploit the known transitions, it serves as a canonical benchmark.
$ϵ$ -Greedy: Selects a random arm with probability $ϵ$ and the best empirical mean arm otherwise. We set $ϵ = 0.1, 0.5$ as a fixed exploratory parameter.

We measure cumulative regret, defined as the difference between the cumulative reward of an omniscient oracle that always picks the optimal state–arm combination and the cumulative reward earned by the policy. Given our theoretical results, we expect MC-UCB to achieve lower regret growth rates compared to the baseline methods.

6.3. Numerical Results

The results of the simulations are presented in Figure 2 and Figure 3, which illustrate the cumulative regret for the algorithms across multiple values of

σ

(reward standard deviation) and the number of rounds. The comparison includes MC-UCB, UCB, and

ϵ

-Greedy with

ϵ = 0.1

and

ϵ = 0.5

.

In Figure 2, we observe that as the value of

σ

increases, the overall regret grows for all algorithms. However, the rate at which regret accumulates varies significantly across the algorithms. The MC-UCB algorithm consistently outperforms the baselines as it exhibits the lowest cumulative regret across all values of

σ

.

Specifically, the following trends can be identified:

Effect of Increasing $σ$ : As the value of $σ$ increases, the cumulative regret grows at a faster rate for all algorithms. This is expected because higher variability in rewards makes it more challenging to distinguish between the optimal and suboptimal arms. Nevertheless, MC-UCB demonstrates a robust ability to adapt to this increased variability and to maintain a clear performance advantage over the classical UCB and $ϵ$ -Greedy algorithms.
Comparison with $ϵ$ -Greedy: The $ϵ$ -Greedy algorithms, with $ϵ = 0.1$ and $ϵ = 0.5$ , perform consistently worse than MC-UCB. Notably, $ϵ = 0.5$ results in lower regret compared to $ϵ = 0.1$ , as the excessive exploration prevents the algorithm from exploiting the optimal arms efficiently. This is especially prominent in settings with low $σ$ , where unnecessary exploration leads to regret accumulation.
Performance of Classical UCB: The classical UCB algorithm achieves lower regret than the $ϵ$ -Greedy variants but fails to match the performance of MC-UCB. The classical UCB assumes iid rewards and does not account for the Markovian structure, which limits its ability to leverage state transitions effectively. This leads to slower learning of the optimal arms.
MC-UCB’s Adaptability: Across all settings of $σ$ , MC-UCB demonstrates superior performance, particularly as the number of rounds increases. MC-UCB achieves faster convergence to the optimal arms and maintains lower cumulative regret by leveraging the Markovian structure. This advantage becomes more pronounced at higher $σ$ values, where the increased reward variability exacerbates the shortcomings of the baseline algorithms.

Figure 3 provides a three-dimensional view of the total regret for each algorithm as a function of

σ

and the number of rounds. The plots reveal a clear trend: while all algorithms experience regret growth with increasing

σ

, MC-UCB consistently maintains the smallest regret surface. In contrast, the classical UCB and

ϵ

-Greedy algorithms exhibit higher regret surfaces, with

ϵ

-Greedy particularly struggling under larger

σ

values.

6.4. Robustness and Sensitivity to System Variations

Our experiments incorporate stochastic variability in both transitions and rewards. While we have maintained fixed distributions for sampling these parameters, the repeated randomization and large number of runs ensure that the results are not tailored to a single contrived example. Over thousands of simulations, the MC-UCB algorithm consistently outperforms the baselines, indicating that its theoretical properties are robust to different random initializations and transitions. However, we must emphasize that these simulations remain limited in scale and scope. Larger state spaces could invalidate our current theoretical guarantees and cause the underlying assumptions of our derivations to fail.

6.5. Additional Markovian Network Scenario and Results

To further illustrate the flexibility of MC-UCB under a Markovian reward structure, we also conduct a complementary numerical experiment wherein the arms represent network links transitioning among three distinct quality states (High, Medium, and Low). The rewards are interpreted as throughput (in Mbps), reflecting the link’s capacity at each time step. Unlike the fully randomized approach in the previous settings, here we fix the transition matrices and reward means (sampled from the dataset [44]) to highlight how variability in observation noise (i.e., the standard deviation

σ

) impacts each algorithm’s performance.

We consider a simple network setting that translates to

K = 3

arms, each with a three-state Markov chain. The probability of remaining in or transitioning between these states is encoded by a fixed transition matrix

P^{(i)}

for each arm

i \in {1, 2, 3}

. For example, an arm in a High state remains there with probability

0.80

, transitions to Medium with probability

0.15

, and drops to Low with probability

0.05

. We interpret the per-round reward

r_{t}^{(i)}

as a throughput measurement drawn from a Gaussian distribution with mean

μ_{u}^{(i)}

(the average throughput for state u of arm i) and variance

σ^{2}

. Thus, higher reward corresponds to higher link throughput. We vary the standard deviation

σ \in {2.0, 3.0, 4.0}

to simulate increasingly fluctuating network conditions.

We employ the same core policy classes introduced previously, with the key difference being that we now deal with throughput (Mbps) as reward:

MC-UCB: Our proposed Markovian UCB policy that can exploit knowledge of the transition probabilities.
Classical UCB: A reference baseline assuming iid rewards.
Baseline-Greedy: A purely greedy strategy, always picking the arm with the highest observed average so far.

We set the horizon to T = 10,000 rounds. At each round, the selected arm yields a random throughput sample from

N (μ_{u}^{(i)}, σ^{2})

for its current state u, and all arms then transition. Our performance metric is the time-averaged throughput achieved by each policy, since throughput is a key measure of network performance.

For each fixed

σ

, we run three numerical evaluations on the network simulations (one for each policy) and compute the running average throughput over time. We then plot the final average–throughput curves for each policy. The transition matrices, state means, and values of

σ

remain consistent in all runs to isolate the effect of observation noise (reward variability).

Figure 4 illustrates the key results for each

σ .

The results clearly demonstrate the consistent superiority of the MC-UCB algorithm across all tested noise levels (

σ

). For

σ = 2.0

, MC-UCB quickly stabilizes around 6 Mbps, outperforming both classical UCB and Baseline-Greedy, which exhibit slower convergence and slightly lower steady-state throughput. As the noise level increases to

σ = 3.0

, MC-UCB maintains a noticeable advantage, achieving higher initial throughput and stabilizing at a value above 6 Mbps, whereas the other algorithms lag behind, converging closer to 5.5 Mbps. Even under the highest noise level,

σ = 4.0

, MC-UCB continues to outperform its counterparts, demonstrating faster convergence and sustaining higher throughput near 6 Mbps, while classical UCB and Baseline-Greedy fall short. These results highlight the robustness and adaptability of MC-UCB, making it the most effective approach in scenarios with varying noise conditions.

6.6. Simulation Summary

Using purely synthetic data, the simulation results validate the effectiveness of the proposed MC-UCB algorithm within Markovian MAB settings, where it consistently surpasses classical UCB and

ϵ

-Greedy algorithms under various experimental conditions. Specifically, MC-UCB exhibits a

15 %

lower cumulative regret on average compared to classical UCB for the specified settings. This demonstrates that MC-UCB successfully leverages the Markovian structure for efficient adaptation to state transitions. This is particularly evident as the reward variability increases (with a larger

σ

), where MC-UCB shows superior adaptability and maintains its performance advantage. This shows the robust adaptability of MC-UCB across scenarios with both low and high variability compared to the other baseline algorithms. The algorithm’s scalability is confirmed as MC-UCB’s regret curves ascend at a slower rate over increasing rounds, which showcases its long-term efficiency. The

ϵ

-Greedy algorithms, especially at

ϵ = 0.1

, encounter issues with excessive exploitation in a way that leads to significantly higher regret. In contrast, while classical UCB performs better than

ϵ

-Greedy, it fails to match MC-UCB’s performance due to its inefficiency in handling state transitions. Overall, MC-UCB’s integration of the Markovian structure allows it to effectively balance exploration and exploitation.

Furthermore, in this supplemental experiment that we conducted on the simulated network and that was derived from the dataset in [44], the Markovian perspective allows our MC-UCB algorithm to handle state transitions adeptly, which translates to more stable performance in highly variable settings (large

σ

) and to higher throughput overall. This supplemental experiment thus complements the more extensive randomized evaluations by focusing on a single, fixed set of state transitions under network settings, which further highlights MC-UCB’s efficacy in network-like applications.

7. Conclusions

In this study, we have addressed the multi-armed bandit (MAB) problem with a Markovian rewards structure where each arm can transition between up to three states, which simulates dependencies often seen in networked systems. We demonstrated that a sample mean-based index policy, when adjusted for the complexity of our model, achieves logarithmic regret uniformly over time. This effectiveness depends on setting the exploration constant large enough relative to the eigenvalue gaps of the arms’ stochastic matrices. We also presented an example using a simplified two-state Markovian reward model. The numerical analysis suggests that the index policy remains near optimal even if the exploration constant does not strictly meet the theoretical sufficiency condition. This robustness indicates that our policy can be effective in a wide range of practical scenarios, including applications with network-like dependencies.

Author Contributions

Conceptualization, A.S. and J.W.; methodology, A.S. and J.W.; software, A.S.; validation, A.S.; formal analysis, A.S.; investigation, A.S.; resources, A.S. and J.W.; data curation, A.S.; writing—original draft preparation, A.S.; writing—review and editing, A.S. and J.W.; visualization, A.S.; supervision, J.W.; project administration, J.W.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by NSF grants CNS 2214940, CPS 2128378, CNS 2107014, and CNS 2150152.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Guo, Z.; Zhang, C.; Li, M.; Krunz, M. Fair Probabilistic Multi-Armed Bandit with Applications to Network Optimization. IEEE Trans. Mach. Learn. Commun. Netw. 2024, 2, 994–1016. [Google Scholar] [CrossRef]
Charpentier, A.; Elie, R.; Remlinger, C. Reinforcement learning in economics and finance. Comput. Econ. 2021, 62, 425–462. [Google Scholar] [CrossRef]
Zhu, J.; Liu, J. Distributed Multiarmed Bandits. IEEE Trans. Autom. Control 2023, 68, 3025–3040. [Google Scholar] [CrossRef]
Sawwan, A.; Wu, J. A Combinatorial Multi-Armed Bandit Approach for Stochastic Facility Allocation Problem. In Proceedings of the 2024 Workshop on Advanced Tools, Programming Languages, and PLatforms for Implementing and Evaluating algorithms for Distributed systems, Nantes, France, 17 June 2024; pp. 1–10. [Google Scholar]
Xu, Z.; Zhang, Z.; Wang, S.; Hu, X.; Jia, Y.; Ren, B. Energy-Constrained Distributed MAC in CR-IoT Networks: A Budgeted Multi-Player Multi-Armed Bandit Approach. IEEE Trans. Cogn. Commun. Netw. 2024. [Google Scholar] [CrossRef]
Gao, G.; Huang, S.; Huang, H.; Xiao, M.; Wu, J.; Sun, Y.E.; Zhang, S. Combination of auction theory and multi-armed bandits: Model, algorithm, and application. IEEE Trans. Mob. Comput. 2022, 22, 6343–6357. [Google Scholar] [CrossRef]
Barrachina-Muñoz, S.; Chiumento, A.; Bellalta, B. Multi-armed bandits for spectrum allocation in multi-agent channel bonding WLANs. IEEE Access 2021, 9, 133472–133490. [Google Scholar] [CrossRef]
Tariq, Z.U.A.; Baccour, E.; Erbad, A.; Guizani, M.; Hamdi, M. Network intrusion detection for smart infrastructure using multi-armed bandit based reinforcement learning in adversarial environment. In Proceedings of the 2022 International Conference on Cyber Warfare and Security (ICCWS), Albany, NY, USA, 17–18 March 2022; pp. 75–82. [Google Scholar]
Robbins, H. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 1952, 58, 527–535. [Google Scholar] [CrossRef]
Lai, T.L.; Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 1985, 6, 4–22. [Google Scholar] [CrossRef]
Ou, H.C.; Siebenbrunner, C.; Killian, J.; Brooks, M.B.; Kempe, D.; Vorobeychik, Y.; Tambe, M. Networked restless multi-armed bandits for mobile interventions. arXiv 2022, arXiv:2201.12408. [Google Scholar]
Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R.E. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 2002, 32, 48–77. [Google Scholar] [CrossRef]
Jiang, B.; Jiang, B.; Li, J.; Lin, T.; Wang, X.; Zhou, C. Online restless bandits with unobserved states. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 15041–15066. [Google Scholar]
Sawwan, A.; Wu, J. A new framework: Short-term and long-term returns in stochastic multi-armed bandit. In Proceedings of the IEEE INFOCOM 2023-IEEE Conference on Computer Communications, New York, NY, USA, 17–20 May 2023; pp. 1–10. [Google Scholar]
Tekin, C.; Liu, M. Online algorithms for the multi-armed bandit problem with markovian rewards. In Proceedings of the 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 29 September–1 October 2010; pp. 1675–1682. [Google Scholar]
Sawwan, A.; Wu, J. Budget-Constrained and Deadline-Driven Multi-Armed Bandits with Delays. In Proceedings of the 21st Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Phoenix, AZ, USA, 2–4 December 2024. [Google Scholar]
Duran, S.; Ayesta, U.; Verloop, I.M. On the Whittle index of Markov modulated restless bandits. Queueing Syst. 2022, 102, 373–430. [Google Scholar] [CrossRef]
Wang, S.; Xiong, G.; Li, J. Online Restless Multi-Armed Bandits with Long-Term Fairness Constraints. Proc. Aaai Conf. Artif. Intell. 2024, 38, 15616–15624. [Google Scholar] [CrossRef]
Sawwan, A.; Wu, J. Diversity-based recruitment in crowdsensing by combinatorial multi-armed bandits. Tsinghua Sci. Technol. 2024, 30, 732–747. [Google Scholar] [CrossRef]
Gafni, T.; Cohen, K. Learning in restless multiarmed bandits via adaptive arm sequencing rules. IEEE Trans. Autom. Control. 2020, 66, 5029–5036. [Google Scholar] [CrossRef]
Zhao, Q. Multi-Armed Bandits: Theory and Applications to Online Learning in Networks; Springer Nature: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Bouneffouf, D.; Rish, I.; Aggarwal, C. Survey on applications of multi-armed and contextual bandits. In Proceedings of the 2020 IEEE Congress on Evolutionary Computation (CEC), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Burtini, G.; Loeppky, J.; Lawrence, R. A survey of online experiment design with the stochastic multi-armed bandit. arXiv 2015, arXiv:1510.00757. [Google Scholar]
Mazumdar, E.; Dong, R.; Royo, V.R.; Tomlin, C.; Sastry, S.S. A multi-armed bandit approach for online expert selection in markov decision processes. arXiv 2017, arXiv:1707.05714. [Google Scholar]
Denisov, D.; Walton, N. Regret analysis of a markov policy gradient algorithm for multi-arm bandits. arXiv 2020, arXiv:2007.10229. [Google Scholar]
Bout, E.; Brighente, A.; Conti, M.; Loscri, V. Folpetti: A novel multi-armed bandit smart attack for wireless networks. In Proceedings of the 17th International Conference on Availability, Reliability and Security, Vienna, Austria, 23–26 August 2022; pp. 1–10. [Google Scholar]
Taghavi, M.; Bentahar, J.; Otrok, H.; Bakhtiyari, K. A reinforcement learning model for the reliability of blockchain oracles. Expert Syst. Appl. 2023, 214, 119160. [Google Scholar] [CrossRef]
Raza, M.A.; Abolhasan, M.; Lipman, J.; Shariati, N.; Ni, W.; Jamalipour, A. Multi-Agent Multi-Armed Bandit Learning for Grant-Free Access in Ultra-Dense IoT Networks. IEEE Trans. Cogn. Commun. Netw. 2024, 10, 1356–1370. [Google Scholar] [CrossRef]
Anantharam, V.; Varaiya, P.; Walrand, J. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part IId rewards. IEEE Trans. Autom. Control 1987, 32, 968–976. [Google Scholar] [CrossRef]
Gittins, J.C. Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. Stat. Methodol. 1979, 41, 148–164. [Google Scholar] [CrossRef]
Agrawal, R. Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 1995, 27, 1054–1078. [Google Scholar] [CrossRef]
Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
Garivier, A.; Moulines, E. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory; Springer: Berlin/Heidelberg, Germany, 2011; pp. 174–188. [Google Scholar]
Besbes, O.; Gur, Y.; Zeevi, A. Stochastic multi-armed-bandit problem with non-stationary rewards. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Shi, C.; Shen, C. Multi-player multi-armed bandits with collision-dependent reward distributions. IEEE Trans. Signal Process. 2021, 69, 4385–4402. [Google Scholar] [CrossRef]
Mohamed, E.M.; Hashima, S.; Aldosary, A.; Hatano, K.; Abdelghany, M.A. Gateway selection in millimeter wave UAV wireless networks using multi-player multi-armed bandit. Sensors 2020, 20, 3947. [Google Scholar] [CrossRef] [PubMed]
Dakdouk, H.; Féraud, R.; Varsier, N.; Maillé, P.; Laroche, R. Massive multi-player multi-armed bandits for IoT networks: An application on LoRa networks. Ad Hoc Netw. 2023, 151, 103283. [Google Scholar] [CrossRef]
Ortner, R. Pseudometrics for state aggregation in average reward Markov decision processes. In Proceedings of the Algorithmic Learning Theory: 18th International Conference, Sendai, Japan, 1–4 October 2007; Proceedings 18. Springer: Berlin/Heidelberg, Germany, 2007; pp. 373–387. [Google Scholar]
Raj, V.; Kalyani, S. Taming non-stationary bandits: A Bayesian approach. arXiv 2017, arXiv:1707.09727. [Google Scholar]
Herlihy, C.; Dickerson, J.P. Networked restless bandits with positive externalities. AAAI Conf. Artif. Intell. 2023, 37, 11997–12004. [Google Scholar] [CrossRef]
Liu, K.; Zhao, Q. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Trans. Inf. Theory 2010, 56, 5547–5567. [Google Scholar] [CrossRef]
Chen, S.; Tao, Y.; Yu, D.; Li, F.; Gong, B. Distributed learning dynamics of multi-armed bandits for edge intelligence. J. Syst. Archit. 2021, 114, 101919. [Google Scholar] [CrossRef]
Gillman, D. A Chernoff bound for random walks on expander graphs. SIAM J. Comput. 1998, 27, 1203–1220. [Google Scholar] [CrossRef]
Sheikh, C. Upper Confidence Bound Dataset. 2021. Available online: https://www.kaggle.com/datasets/chaandsheikh/upper-confidence-bound-dataset (accessed on 15 January 2025).

Figure 1. A sample example of two-arms of a multi-armed bandit. The first arm has two states and the second arm has one state.

Figure 2. The simulation results for the specified settings under various values of

σ

.

Figure 2. The simulation results for the specified settings under various values of

σ

.

Figure 3. Full view of how the total regret changes under the different algorithms as the value of

σ

changes.

Figure 3. Full view of how the total regret changes under the different algorithms as the value of

σ

changes.

Figure 4. Results on network-like settings under different algorithms for various levels of noise (

σ

).

Figure 4. Results on network-like settings under different algorithms for various levels of noise (

σ

).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sawwan, A.; Wu, J. Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies. Network 2025, 5, 3. https://doi.org/10.3390/network5010003

AMA Style

Sawwan A, Wu J. Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies. Network. 2025; 5(1):3. https://doi.org/10.3390/network5010003

Chicago/Turabian Style

Sawwan, Abdalaziz, and Jie Wu. 2025. "Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies" Network 5, no. 1: 3. https://doi.org/10.3390/network5010003

APA Style

Sawwan, A., & Wu, J. (2025). Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies. Network, 5(1), 3. https://doi.org/10.3390/network5010003

Article Menu

Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies

Abstract

1. Introduction

1.1. Main Findings

1.2. Application Context and Conceptual Validation in Network-like Scenarios

2. Related Work

3. Preliminaries

3.1. Markov Processes

3.2. Markov Decision Processes in Bandit Problems

3.3. Exploration vs. Exploitation in Markovian Bandits

4. Problem Formulation

5. A Solution to the Problem with Bounded Regret

6. Simulations

6.1. Experimental Setup

6.2. Compared Algorithms and Metrics

6.3. Numerical Results

6.4. Robustness and Sensitivity to System Variations

6.5. Additional Markovian Network Scenario and Results

6.6. Simulation Summary

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI