Fast Two-Stage Computation of an Index Policy for Multi-Armed Bandits with Setup Delays

Niño-Mora, José

doi:10.3390/math9010052

Open AccessArticle

Fast Two-Stage Computation of an Index Policy for Multi-Armed Bandits with Setup Delays

by

José Niño-Mora

Department of Statistics, Carlos III University of Madrid, 28903 Getafe, Spain

Mathematics 2021, 9(1), 52; https://doi.org/10.3390/math9010052

Submission received: 7 December 2020 / Revised: 23 December 2020 / Accepted: 24 December 2020 / Published: 29 December 2020

(This article belongs to the Special Issue Stochastic Models with Applications)

Download

Browse Figures

Versions Notes

Abstract

:

We consider the multi-armed bandit problem with penalties for switching that include setup delays and costs, extending the former results of the author for the special case with no switching delays. A priority index for projects with setup delays that characterizes, in part, optimal policies was introduced by Asawa and Teneketzis in 1996, yet without giving a means of computing it. We present a fast two-stage index computing method, which computes the continuation index (which applies when the project has been set up) in a first stage and certain extra quantities with cubic (arithmetic-operation) complexity in the number of project states and then computes the switching index (which applies when the project is not set up), in a second stage, with quadratic complexity. The approach is based on new methodological advances on restless bandit indexation, which are introduced and deployed herein, being motivated by the limitations of previous results, exploiting the fact that the aforementioned index is the Whittle index of the project in its restless reformulation. A numerical study demonstrates substantial runtime speed-ups of the new two-stage index algorithm versus a general one-stage Whittle index algorithm. The study further gives evidence that, in a multi-project setting, the index policy is consistently nearly optimal.

Keywords:

multi-armed bandits; setup delays; setup costs; index policies; semi-Markov decision processes; hysteresis

1. Introduction

1.1. Background

In a much-studied version of the multi-armed bandit problem (MABP), a decision-maker selects one project to engage from a finite set of dynamic and stochastic projects at each of an infinite sequence of discrete-time periods. Each project is modeled as a classic (non-restless) bandit, so the engaged (active) project gives rewards and its state changes in a Markovian fashion, while rested (passive) projects neither produce rewards nor change state. The goal is to find a policy that selects one project to be engaged at each time, for maximizing the expected total geometrically discounted reward. The MABP is widely applicable, being regarded as a modeling paradigm of the exploration versus exploitation trade-off, and it has generated a vast literature (see the monograph [1] and the cited references there). Although the curse of dimensionality hinders direct numerical solution of its dynamic programming (DP) optimality equations for realistic-size models, as the size of the multi-dimensional state space grows exponentially with the number of projects, the MABP is solved optimally by a remarkably simple type of policy, a so-called (priority-) index policy. Index policies are based on defining for each project m an index

λ_{m} (i_{m})

—a scalar mapping of the project state

i_{m}

that depends only on the project parameters—and engage at each time a project of largest index. See, e.g., [2,3,4,5,6]. The index that is considered in [2], which is known in the literature as the Gittins index, extends to general Markovian bandits that which was introduced by Bellman in [7] for solving a Bernoulli bandit model.

However, appropriate modeling of potential applications often entails the incorporation of features that violate assumptions of the classic MABP. Regarding the assumption that passive projects do not give rewards, this is noncritical, since passive rewards can be readily eliminated through a linear transformation, as shown in [8]. Yet, other assumptions turn out to be critical, as index policies are typically suboptimal when they are violated. Such is the case, as demonstrated in [9], with the requirement that switching from engaging one project to another be costless, which is hardly realistic in many, if not most, applications. As stated in (p. 1, [9]), “it is diffificult to imagine a relevant economic decision problem in which the decision-maker may costlessly move between alternatives”. This motivates the interest of investigating extensions of the MABP that incorporate costs and/or delays for switching projects, which we will refer to generically, as in [10], as the multi-armed bandit problem with switching penalties (MABPSP).

Despite its practical relevance, the MABPSP has received relatively scant research attention when compared to the standard MABP. We refer the reader to [11] for a review of research on the MABPSP until the early 2000s. Important references on such early work include [9,10,12,13,14]. Additionally, see the survey [15]. Yet, the last decade has witnessed growing interest on variants of the MABPSP, being motivated by the relevance of switching penalties in a variety of application areas, including hiring and retention of workers who learn over time [16], online marketing [17,18], experiential learning [19], opportunistic channel access in communication networks [20,21], and continuation and abandonment decisions for research projects [22]. For recent theoretical work on properties of the MABPSP, see [23].

While the aforementioned work concerns discrete-state projects, ref. [24,25] address Markovian continuous-state projects with constant setup penalties (costs or delays).

1.2. Index Policies, Histeresis, and the Asawa and Teneketzis Index for the MABPSP

While switching penalties can generally be sequence-dependent, this paper will focus on the case that such penalties are separately defined for each project, while allowing them to depend on the project state. Specifically, we will assume that switching from engaging one project to another entails, similarly as in [26], a setdown cost to switch off the currently engaged project, and then a setup cost followed by a random setup delay to switch on the project about to be engaged. Note that setup delays can be used to model, e.g., time for preparing the ground or building infrastructure, as well as training or learning delays.

Although index policies are generally suboptimal for the MABPSP (see [9]), their ease of implementation motivates the interest of designing policies from such a class that perform well. An index policy in such a setting attaches to each project m an index

λ_{m} (a_{m}^{-}, i_{m})

, which now depends on both the previous action

a_{m}^{-} \in {0, 1}

(passive: 0 or active: 1) and the current project state

i_{m}

. Thus, such an index decouples into a continuation index

λ_{m} (1, i_{m})

, which applies when the project has already been set up, and a switching index

λ_{m} (0, i_{m})

, to be used when the project has not yet been set up.

Drawing on intuition one would expect that switching penalties should discourage frequent switching and, hence, should cause a histeresis effect on the structure of optimal policies. Thus, it should be optimal to stick longer to the currently engaged project that would be the case in the absence of such penalties. As put in (p. 691 [9]), “it is obvious that in comparing two otherwise identical arms, one of which was used in the previous period, the one which was in use must necessarily be more attractive than the one which was idle”. To be consistent with such a hysteresis property, the indices of a project m must satisfy that

λ_{m} (1, i_{m}) ⩾ λ_{m} (0, i_{m}) for every project state i_{m} .

(1)

Note that index policies can be optimal in special cases of the MABPSP, as shown in [13], in a model for scheduling a multi-class batch of stochastic jobs.

An intuitively appealing choice of index, extending that in [13], is that considered by Asawa and Teneketzis in [10]—which we will refer to in the sequel as the AT index—for a project having either a constant (not dependent on the project state) setup cost or a constant setup delay distribution, and no setdown costs. It is shown in [10] that the AT index provides a partial characterization of optimal policies for the version of the MABPSP considered there. The continuation AT index of a project is simply its Gittins index. As for the switching AT index, it is the highest rate of discounted expected reward minus setup cost per unit of discounted expected active time (counting the setup delay as active time) that can be attained from an initially passive project by first setting it up and then engaging it for a random duration that is given by a stopping time.

1.3. Index Computation

Efficient index computation is a key issue that must be addressed in practice for deploying an index policy for the MABPSP. For a project with n states and constant setup cost, but without setup delays, (Section III.C [10]) shows that the

2 n

AT continuation and switching index values

λ^{*} (a^{-}, i)

can be computed as the Gittins index of an appropriately defined

2 n

-state project with augmented state

(a^{-}, i)

. Because computing the Gittins index has, in general, a cubic operation complexity in the number of states, such an approach results in an eightfold increase in complexity relative to that of computing the continuation index only.

A faster two-stage approach for a project with both setup and setdown costs—but no setup delays—that can be state-dependent was given by the author in [27]. The proposed algorithm computes, in the first stage, the continuation index and certain extra quantities by applying the

(4 / 3) n^{3} + O (n^{2})

fast-pivoting algorithm with extended output presented in [28]. Subsequently, in the second stage, it computes the switching index in at most

O (n^{2})

operations. Hence, computing with that algorithm the

2 n

AT index values entails only a twofold complexity increase relative to the

(2 / 3) n^{3} + O (n^{2})

operation count to compute the continuation (Gittins) index only through the fast-pivoting algorithm (without extended output) given in [28]. Further, ref. [27] reports on the results of a numerical study demonstrating that the resulting index policy for the version of the MABPSP considered there is close to optimal and outperforms the Gittins index policy by a wide margin, across a wide range of instances.

1.4. Approach via Restless Bandit Reformulation, Whittle Index, and Indexability

The two-stage index algorithm shown in [27] exploits the reformulation of a project with switching costs and states i as a restless bandit—i.e., a project that can change state while passive—without such costs, moving across augmented states

(a^{-}, i)

. In that way, the MABP with switching costs is cast as a multi-armed restless bandit problem (MARBP) without them, which allows for the deployment of theoretical and algorithmic results on restless bandit indexation, as introduced in [29] by Whittle. Such a theory has been developed in [30,31,32,33] by the author. Additionally, see the survey [34].

Thus, while the MARBP is generally intractable, as it is known to be PSPACE-hard (see [35]), Whittle introduced, in [29], a widely applied heuristic index policy. For a sample of recent applications, see, for example [36,37,38,39,40,41,42,43,44,45,46,47,48]. Yet, the Whittle index is only defined for a limited class of restless bandits, called indexable, and it is nontrivial to verify whether such an indexability property holds for a given model. The work of the author referred to above provides sufficient indexability conditions for general restless bandits, which are grounded on satisfaction by project performance metrics of partial conservation laws (PCLs), together with an adaptive-greedy index algorithm that computes the Whittle index (and extensions thereof) under such conditions.

Such a PCL-indexability approach is deployed in [27], using the result that the AT index of a non-restless bandit with switching costs (but no switching delays) is its Whittle index in the project’s restless reformulation. The corresponding restless bandit model is shown to satisfy the PCL-indexability conditions, ensuring that its Whittle index can be computed by the adaptive-greedy algorithm. Special structure and the results in [49] are then used in [27] in order to decouple that algorithm into a faster two-stage method.

1.5. Motivation and Goals

Yet, no method is given in Asawa and Teneketzis [10] in order to compute their proposed index under switching delays. The latter’s relevance in applications, along with the tractability and effectiveness of the AT index policy in the pure-switching-costs case, motivates the interest to extend the restless bandit indexation approach for developing an efficient index algorithm for bandits that incorporate both switching costs and delays, which is the first goal of this paper.

Carrying out such an extension turns out to raise methodological research challenges on restless bandit indexation. Thus, when a Markovian non-restless bandit with switching delays is reformulated as a semi-Markov restless bandit without them, it is found that the resultant model need not satisfy the PCL-indexability conditions that were the cornerstone to the analyses presented in Niño-Mora [27] for the pure-switching-costs case. This motivates us to significantly extend the scope of previous theory, obtaining more powerful sufficient indexability conditions, which are both easier to apply and applicable to a wider class of models, including that of concern herein. That is the second goal of this paper. The third goal entails assessing the runtime performance of the proposed index algorithm, and evaluating the performance of the resulting index policy, both in terms of its optimality gap and its improvement over alternative simpler index policies.

1.6. Contributions

Concerning the second goal, on general restless bandit methodology, we introduce, for finite-state restless bandits, significantly simpler and less stringent sufficient conditions for indexability than the former PCL-based conditions, under which it is also assured that the adaptive-greedy algorithm computes the MPI. We further show such conditions to be necessary, in that any indexable finite-state restless bandit satisfies them. Thus, the new conditions furnish a complete characterization of indexability, which can be used in order to analytically establish a priori that a restless bandit model of concern is indexable—as opposed to numerically verifying a posteriori that a given instance is indexable.

As for the first goal, we deploy the new indexability conditions in the restless bandit reformulation of a non-restless bandit with switching delays and costs. Because the AT index emerges as the Whittle index in such a reformulation, we are thus assured that the adaptive-greedy algorithm will compute it. The complexity of such an algorithm is then reduced by exploiting special structure, which again yields a substantially faster two-stage method. In the first stage, the continuation index is computed in

(4 / 3) n^{3} + O (n^{2})

arithmetic operations, and then the switching index is computed in the second stage in only—at most—

(5 / 2) n^{2} + O (n)

operations. Thus, we obtain a two-stage algorithm that computes both the continuation and switching index in roughly twice the time that is required to compute the continuation index alone (if the latter were computed using the fast-pivoting

(2 / 3) n^{3} + O (n^{2})

algorithm in [34]).

Regarding the third goal, we report on a computational study demonstrating the substantial runtime speed-up that is achieved by the two-stage algorithm relative to direct application of the one-stage adaptive-greedy algorithm. This study further reports on experiments providing evidence that the index policy is close to optimal and it attains significant gains against a benchmark index policy across a wide range of randomly generated instances with two and three projects.

1.7. Structure of the Paper and Notation

The rest of the paper proceeds as follows. Section 2 describes the MABPSP model of concern, reviews the AT index, and describes the restless bandit indexation approach to be applied. Section 3 lays the groundwork for such an approach in a general framework of finite-state restless bandits, introducing the new methological advances on restless bandit indexation. Section 4 deploys the new results in the special restless bandit model that arises from the reformulation of a non-restless bandit with switching penalties, which culminates in the development of the new two-stage index algorithm in Section 5. Section 6 presents some qualitative properties on how the index depends on setup and setdown penalties. Finally, Section 7 presents and discusses the numerical study.

Because the notation of the paper may be hard to follow, Table 1 summarizes it for the reader’s convenience.

2. MABPSP Model and Its Semi-Markov MARBP Reformulation

A decision-maker ponders how to prioritize the allocation of effort to M dynamic and stochastic projects that are labelled by

m \in M ≜ {1, \dots, M}

, one of which must be engaged (active) at each of a sequence of decision periods

t_{k} \in Z_{+} ≜ {0, 1, 2, \dots}

, with

t_{0} = 0

and

t_{k} ↗ \infty

as

k ↗ \infty

, while others are rested (passive). Switching projects on and off entails setup and setdown delays and costs, respectively. A setup (resp. setdown) delay on a project is necessarily followed by a period in which the project is worked on (resp. rested), i.e., the times at which a setup or a setdown delay are completed are not decision periods. We will say that a project is “active” when it is either being engaged (worked upon) or undergoing a setup or a setdown delay. Let

X_{m} (t)

and

A_{m} (t)

denote the prevailing state, which belongs to the finite state space

X_{m}

, and action for project m at time t (

A_{m} (t) = 1

: active;

A_{m} (t) = 0

: passive), and let

A_{m}^{-} (t) ≜ A_{m} (t - 1)

denote the previously chosen action, with

A_{m}^{-} (0)

indicating the initial setup status.

While project m is passive, it neither accrues rewards nor changes state. Switching it on when it lies in state

i_{m}

entails a lump setup cost

c_{m} (i_{m})

, followed by a random setup delay of duration

ξ_{m} (i_{m})

periods, whose z-transform is

ϕ_{m} (z; i_{m}) ≜ E [z^{ξ_{m} (i_{m})}]

, over which no rewards are earned. After such a setup, the project must be engaged, yielding a reward

R_{m} (i_{m})

, after which its state moves at the next period to

j_{m}

with transition probability

p_{m} (i_{m}, j_{m})

. After at least one period in which the project is engaged, it may be decided to switch it off. If this is done when the project lies in state

j_{m}

, then a lump setdown cost

d_{m} (j_{m})

is incurred, followed by a random setdown delay of duration

η_{m}

with z-transform

ψ_{m} (z) ≜ E [z^{η_{m}}]

, over which no rewards accumulate. Subsequently, the project remains passive for one or more periods. Note that setup delay distributions are allowed to be state-dependent, whereas setdown delay’s are not (cf. Section 2.1). Rewards and costs are geometrically time-discounted with factor

β < 1

. We write, in what follows, the above z-transforms evaluated at

z = β

simply as

ϕ_{m} (i_{m})

and

ψ_{m}

.

Actions are prescribed through a scheduling policy

π

, which is chosen from the class

Π

of policies that are admissible, i.e., nonanticipative with respect to the history of states and actions, and engaging one project at a time. The MABPSP (cf. Section 1) is concerned with finding an admissible scheduling policy that attains the maximum expected total discounted reward net of switching costs.

This problem can be cast into the framework of semi-Markov decision problems (SMDPs) by including into the state of each project m the last action taken, i.e., by using the augmented state

Y_{m} (t) ≜ (A_{m}^{-} (t), X_{m} (t))

, which belongs to the augmented state space

Y_{m} ≜ {0, 1} \times X_{m}

. Thus, one obtains a multidimensional SMDP having joint state

Y (t) ≜ {(Y_{m} (t))}_{m \in M}

and joint action

A (t) ≜ {(A_{m} (t))}_{m \in M}

. This is a special type of semi-Markov MARBP (cf. Section 1), as the constituent projects become restless in such a reformulation.

Rewards and dynamics for the reformulated project m are as follows, where

R_{m}^{a_{m}} (a_{m}^{-}, i)

and

p_{m}^{a_{m}} ((a_{m}^{-}, i_{m}), (b_{m}^{-}, j_{m}))

denote the one-stage (i.e., from

t_{k}

to

t_{k + 1}

) expected reward and transition probability, which results from taking action

a_{m}

in state

Y_{m} (t_{k}) = (a_{m}^{-}, i_{m})

. On the one hand, if, in period

t_{k}

, the project lies in state

(1, i_{m})

and it is again engaged, it yields the reward

R_{m}^{1} (1, i_{m}) ≜ R_{m} (i_{m})

and its state transitions at

t_{k + 1} = t_{k} + 1

to

(1, j_{m})

with probability

p_{m}^{1} ((1, i_{m}), (1, j_{m})) ≜ p_{m} (i_{m}, j_{m})

. If, instead, the project is switched off, it gives the reward

R_{m}^{0} (1, i_{m}) \equiv - d_{m} (i_{m})

and its state moves at

t_{k + 1} = t_{k} + η_{m} + 1

to

(0, i_{m})

with probability 1, i.e.,

p_{m}^{0} ((1, i_{m}), (0, i_{m})) \equiv 1

. On the other hand, if the project occupies at time

t_{k}

the state

(0, i_{m})

and is then switched on, it yields the expected reward

\begin{matrix} R_{m}^{1} (0, i_{m}) & ≜ E [- c_{m} (i_{m}) + β^{ξ_{m} (i_{m})} R_{m} (i_{m})] = - c_{m} (i_{m}) + ϕ_{m} (i_{m}) R_{m} (i_{m}) \end{matrix}

(2)

until the following decision time

t_{k + 1} = t_{k} + ξ_{m} (i_{m}) + 1

, in which the project state transitions to

(1, j_{m})

with probability

p_{m}^{1} ((0, i_{m}), (1, j_{m})) ≜ p_{m} (i_{m}, j_{m})

. If the project is kept idle, then it gives no reward, i.e.,

R_{m}^{0} (0, i_{m}) \equiv 0

, and its state remains frozen up to

t_{k + 1} = t_{k} + 1

, so

p_{m}^{0} ((0, i_{m}), (0, i_{m})) \equiv 1

.

Thus, the MABPSP of concern is formulated as the semi-Markov MARBP

\underset{π \in Π}{maximize} E_{Y (0)}^{π} [\sum_{k = 0}^{\infty} \sum_{m = 1}^{M} R_{m}^{A_{m} (t_{k})} (Y_{m} (t_{k})) β^{t_{k}}],

(3)

where

E_{Y (0)}^{π} [\cdot]

is expectation under policy

π

conditioned on starting from the joint state

Y (0)

.

2.1. Reduction to the Case with No Setdown Penalties

We next show that one can restrict attention with no loss of generality to the case that there are no setdown penalties, which will allow for us to simplify subsequent analyses. Imagine that, say, at time

t = 0

, a passive project is set up and is then worked on for a random number of periods determined by a stopping time

τ ⩾ 1

, after which it is set down. Dropping the label m, denote, by

R = {(R_{j})}_{j \in X}

,

c = {(c_{j})}_{j \in X}

, and

d = {(d_{j})}_{j \in X}

, the active reward vector, and the setup and setdown cost vectors. Denote, by

ϕ = {(ϕ_{j})}_{j \in X}

, the setup delay z-transform vector and by

ψ

the constant setdown delay transform, both evaluated at

z = β

. The total discounted expected net reward that is obtained from the project over such a time interval, starting from the augmented state

Y (0) = (0, i)

, is

F_{(0, i)}^{τ} (R, c, d, ϕ, ψ) ≜ E_{(0, i)}^{τ} [- c_{i} + β^{ξ_{i}} \sum_{t = 0}^{τ - 1} R_{X (t)} β^{t} - d_{X (τ)} β^{ξ_{i} + τ}],

(4)

where

ξ_{i}

is the setup delay. The corresponding discounted active time expended on the project is

G_{(0, i)}^{τ} (ϕ, ψ) ≜ E_{(0, i)}^{τ} [\frac{1 - β^{ξ_{i}}}{1 - β} + β^{ξ_{i}} \sum_{t = 0}^{τ - 1} β^{t} + \frac{1 - β^{η}}{1 - β} β^{ξ_{i} + τ}],

(5)

where, as pointed out above, the setup and setdown delays

ξ_{i}

and

η

are both considered to be active time.

In the next result, which extends Lemma 3.4 of [27] to the present setting,

I

is the identity matrix indexed by

X

,

P = {(p_{i j})}_{i, j \in X}

,

0

is a vector of zeros, and

ϕ \cdot d ≜ {(ϕ_{j} d_{j})}_{j \in X}

.

Lemma 1.

(a): $F_{(0, i)}^{τ} (R, c, d, ϕ, ψ) = F_{(0, i)}^{τ} (ψ^{- 1} (R + (I - β P) d), c + ϕ \cdot d, 0, ψ ϕ, 1)$ .
(b): $G_{(0, i)}^{τ} (ϕ, ψ) = G_{(0, i)}^{τ} (ψ ϕ, 1)$ .

Proof.

(a) Use the identity

d_{X (τ)} β^{τ} = d_{i} - \sum_{t = 0}^{τ - 1} (d_{X (t)} - β d_{X (t + 1)}) β^{t}

to write

\begin{matrix} F_{(0, i)}^{τ} (R, c, d, ϕ, ψ) & ≜ E_{(0, i)}^{τ} [- c_{i} + β^{ξ_{i}} \sum_{t = 0}^{τ - 1} R_{X (t)} β^{t} - d_{Y (τ)} β^{ξ_{i} + τ}] \\ = - c_{i} + ϕ_{i} E_{(0, i)}^{τ} [\sum_{t = 0}^{τ - 1} R_{X (t)} β^{t} - d_{Y (τ)} β^{τ}] \\ = - c_{i} + ϕ_{i} (- d_{i} + E_{(0, i)}^{τ} [\sum_{t = 0}^{τ - 1} (R_{X (t)} + d_{X (t)} - β d_{X (t + 1)}) β^{t}]) \\ = - c_{i} - ϕ_{i} d_{i} + ϕ_{i} E_{(0, i)}^{τ} [\sum_{t = 0}^{τ - 1} (R_{X (t)} + d_{X (t)} - β d_{X (t + 1)}) β^{t}] \\ = - c_{i} - ϕ_{i} d_{i} + ϕ_{i} ψ E_{(0, i)}^{τ} [ψ^{- 1} \sum_{t = 0}^{τ - 1} (R_{X (t)} + d_{X (t)} - β d_{X (t + 1)}) β^{t}] \\ = F_{(0, i)}^{τ} (ψ^{- 1} (R + (I - β P) d), c + ϕ \cdot d, 0, ψ ϕ, 1) . \end{matrix}

(b) This part follows by writing

\begin{matrix} G_{(0, i)}^{τ} & ≜ E_{(0, i)}^{τ} [\frac{1 - β^{ξ_{i}}}{1 - β} + β^{ξ_{i}} \sum_{t = 0}^{τ - 1} β^{t} + \frac{1 - β^{η}}{1 - β} β^{ξ_{i} + τ}] = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} E_{i}^{τ} [\sum_{t = 0}^{τ - 1} β^{t} + \frac{1 - ψ}{1 - β} β^{τ}] \\ = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} E_{(0, i)}^{τ} [\sum_{t = 0}^{τ - 1} β^{t} + \frac{1 - ψ}{1 - β} (1 - (1 - β) \sum_{t = 0}^{τ - 1} β^{t})] \\ = \frac{1 - ϕ_{i} ψ}{1 - β} + ϕ_{i} E_{(0, i)}^{τ} [\sum_{t = 0}^{τ - 1} (1 - (1 - ψ)) β^{t}] = \frac{1 - ϕ_{i} ψ}{1 - β} + ϕ_{i} ψ E_{(0, i)}^{τ} [\sum_{t = 0}^{τ - 1} β^{t}] \\ = G_{(0, i)}^{τ} (ψ ϕ, 1) . \end{matrix}

□

Lemma 1 can be used in order to eliminate setdown penalties: it suffices to incorporate them into new setup costs, setup delay transforms, and active rewards, while using the transformations

{\tilde{c}}_{j} ≜ c_{j} + ϕ_{j} d_{j}, {\tilde{ϕ}}_{j} ≜ ψ ϕ_{j}, and \tilde{R} ≜ ψ^{- 1} (R + (I - β P) d) .

(6)

Note that such a reduction would not have been accomplished had the setdown delay transform not been constant. In the case

c_{j} \equiv c

and

d_{j} \equiv d

, we obtain

{\tilde{c}}_{j} \equiv c + d ϕ_{j}

and

{\tilde{R}}_{j} = (R_{j} + (1 - β) d) / ψ

.

Accordingly, we will focus henceforth on the normalized case without setdown penalties

d_{j} \equiv 0

,

ψ \equiv 1

.

2.2. The AT Index

We next consider the AT index of a project with setup penalties—dropping again the label m—extending the original definitions in [10]. The continuation AT index is

\begin{matrix} λ_{(1, i)}^{AT} & ≜ max_{τ ⩾ 1} \frac{E_{i}^{τ} [\sum_{t = 0}^{τ - 1} R_{X (t)} β^{t}]}{E_{i}^{τ} [\sum_{t = 0}^{τ - 1} β^{t}]}, \end{matrix}

(7)

where

τ ⩾ 1

is a stopping time for engaging the project starting in state i when it is already set up; hence,

λ_{(1, i)}^{AT}

is just the project’s Gittins index. As for the switching AT index, it is given by

\begin{matrix} λ_{(0, i)}^{AT} & ≜ max_{τ ⩾ 1} \frac{- c_{i} + E_{i}^{τ} [β^{ξ_{i}} \sum_{t = 0}^{τ - 1} R_{X (t)} β^{t}]}{E_{i}^{τ} [\sum_{t = 0}^{ξ_{i} - 1} β^{t} + β^{ξ_{i}} \sum_{t = 0}^{τ - 1} β^{t}]} = max_{τ ⩾ 1} \frac{- c_{i} + ϕ_{i} E_{i}^{τ} [\sum_{t = 0}^{τ - 1} R_{X (t)} β^{t}]}{\frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} E_{i}^{τ} [\sum_{t = 0}^{τ - 1} β^{t}]}, \end{matrix}

(8)

where now

τ

is a stopping-time rule that is followed after the project has been set up in state i.

The following requirements will be assumed henceforth on setup costs and setup delay transforms, which extend the corresponding conditions in [10].

Assumption 1.

The following holds:

(i): non-negative setup costs: $c_{j} ⩾ 0$ for $j \in X$ .
(ii): non-negative rewards: If some setup delay can be positive, i.e., $ϕ \neq 1$ , then $R_{j} ⩾ 0$ for $j \in X$ .

The next result shows that Assumption 1 ensures the satisfaction of the hysteresis property in (1).

Lemma 2.

Under Assumption 1,

λ_{(1, i)}^{AT} ⩾ λ_{(0, i)}^{AT}

for

i \in X

.

Proof.

For a given state

i \in X

and stopping-time rule

τ

as above, write

G_{i}^{τ} ≜ E_{i}^{τ} [\sum_{t = 0}^{τ - 1} β^{t}]

and

F_{i}^{τ} ≜ E_{i}^{τ} [\sum_{t = 0}^{τ - 1} R_{X (t)} β^{t}]

. Now, Assumption 1 ensures that

c_{i} ⩾ 0

and

F_{i}^{τ} ⩾ 0

, and hence

\frac{F_{i}^{τ}}{G_{i}^{τ}} - \frac{- c_{i} + ϕ_{i} F_{i}^{τ}}{\frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{i}^{τ}} = \frac{1}{G_{i}^{τ}} \frac{(1 - β) c_{i} G_{i}^{τ} + (1 - ϕ_{i}) F_{i}^{τ}}{1 - ϕ_{i} + (1 - β) ϕ_{i} G_{i}^{τ}} ⩾ 0,

(9)

Further, (9), (7), and (8) immediately yield that

λ_{(1, i)}^{AT} ⩾ λ_{(0, i)}^{AT}

, which completes the proof. □

3. New Methodological Results on Restless Bandit Indexation

This section presents new results on restless bandit indexation, which, besides having an intrinsic interest, are required and form the basis for the approach to non-restless bandits with switching times that is deployed in later sections.

3.1. Indexable Restless Bandits and the Whittle Index

Consider a semi-Markov restless bandit, representing a dynamic and stochastic project whose state

Y (t)

transitions over time periods

t = 0, 1, 2, \dots

through the finite state space

Y

. The project’s evolution is governed by a policy

π

that is taken from the class

Π

of nonanticipative randomized policies, which, at each of an increasing sequence

t_{k}

of decision periods with

t_{0} = 0

and

t_{k} ↗ \infty

as

k ↗ \infty

, prescribes an action

A (t_{k}) \in {0, 1}

that determines the status during the ensuing stage until the next decision period

t_{k + 1}

(1: active; 0: passive). Taking action

A (t_{k}) = a

at time

t_{k}

when the project occupies state

Y (t_{k}) = y

has the following consequences over the following stage, relative to a given one-period discount factor

0 < β < 1

: an expected total discounted amount of reward

R_{y}^{a}

and of a generic resource

Q_{y}^{a} ⩾ 0

is earned and expended, respectively; further, the joint distribution of the stage’s duration

t_{k + 1} - t_{k}

and its final state

Y (t_{k + 1})

is given through the discounted transition transform

ϕ_{y y^{'}}^{a} ≜ E [β^{t_{k + 1} - t_{k}} 1_{{Y (t_{k + 1}) = y^{'}}} | Y (t_{k}) = y, A (t_{k}) = a]

, where

1_{{\cdot}}

denotes an event indicator.

It will be convenient to partition

Y

into the (possibly empty) set of uncontrollable states

Y^{{0}} ≜ \{i \in Y : Q_{y}^{0} = Q_{y}^{1} and ϕ_{y y^{'}}^{0} \equiv ϕ_{y y^{'}}^{1}, y \in Y\},

where both actions entail identical resource consumptions and dynamics, and the remaining set

Y^{{0, 1}} ≜ Y \ Y^{{0}}

of controllable states, which is assumed to consist of

N = | Y^{{0, 1}} | ⩾ 1

elements. The notation

Y^{{0}}

is meant to reflect the convention that the passive action

a = 0

is chosen in uncontrollable states.

The value of the rewards earned and amount of resource expended by a policy

π

starting from state y is evaluated, respectively, by the discounted reward and resource consumption metrics

F_{y}^{π} ≜ E_{y}^{π} [\sum_{k = 0}^{\infty} R_{Y (t_{k})}^{A (t_{k})} β^{t_{k}}] and G_{y}^{π} ≜ E_{y}^{π} [\sum_{k = 0}^{\infty} Q_{Y (t_{k})}^{A (t_{k})} β^{t_{k}}] .

Let us introduce a parameter

λ

representing the resource unit price, and consider the λ-price problem

\underset{π \in Π}{maximize} F_{y}^{π} - λ G_{y}^{π},

(10)

which concerns finding a policy that maximizes the value of rewards earned minus the cost of resources expended. Because (10) is an infinite-horizon finite-state and -action SMDP, by standard results it is solved by stationary deterministic policies that are characterized by the solutions to the following DP equations, where

V_{y}^{*} (λ)

denotes the optimal value starting from y under price

λ

:

V_{y}^{*} (λ) = max_{a \in {0, 1}} R_{y}^{a} - λ Q_{y}^{a} + \sum_{y^{'} \in Y} ϕ_{y y^{'}}^{a} V_{y^{'}}^{*} (λ), y \in Y .

(11)

Such a project is said to be indexable (cf. [29]), if, for each controllable state

y \in Y^{{0, 1}}

, there exists a unique break-even price

λ_{y}^{*}

, such that: it is optimal to engage the project in state y if and only if

λ ⩽ λ_{y}^{*}

, and it is optimal to rest it if and only if

λ ⩾ λ_{y}^{*}

. Or, in terms of the DP Equation (11),

R_{y}^{1} - λ Q_{y}^{1} + \sum_{y^{'} \in Y} ϕ_{y y^{'}}^{1} V_{y^{'}}^{*} (λ) ⩾ R_{y}^{0} - λ Q_{y}^{0} + \sum_{y^{'} \in Y} ϕ_{y y^{'}}^{0} V_{y^{'}}^{*} (λ) ⟺ λ_{y}^{*} ⩾ λ, y \in Y^{{0, 1}}

and

R_{y}^{1} - λ Q_{y}^{1} + \sum_{y^{'} \in Y} ϕ_{y y^{'}}^{1} V_{y^{'}}^{*} (λ) ⩽ R_{y}^{0} - λ Q_{y}^{0} + \sum_{y^{'} \in Y} ϕ_{y y^{'}}^{0} V_{y^{'}}^{*} (λ) ⟺ λ^{*} ⩽ λ_{y}, y \in Y^{{0, 1}} .

We will refer to the mapping

i \mapsto λ_{y}^{*}

as the project’s Whittle index. See [29].

3.2. Exploiting Special Structure: Indexability Relative to a Family of Policies

While one can readily numerically test whether a given restless bandit instance is indexable, a researcher investigating a particular restless bandit model will instead be concerned with analytically establishing its indexability under an appropriate range of model parameters. The key to achieving such a goal is—as in optimal-stopping problems—to exploit special structure by guessing a family of policies (stationary deterministic), among which there exists an optimal policy for (10) for every resource price

λ \in R

.

We represent a stationary deterministic policy by its active (state) set, consisting of those controllable states where it prescribes engaging the project. Thus, a family of such policies is given as a family

F

of active sets

S \subseteq Y^{{0, 1}}

, and, hence, we will refer to the family of

F

-policies. Relative to such a family, we will call the project

F

-indexable if (i) it is indexable, and (ii)

F

-policies are optimal for

λ

-price problem (10) for every resource price

λ \in R

.

We will impose the following connectivity requirements on

F

.

Assumption 2.

The active-set family

F

satisfies the following conditions:

(i): $\emptyset, Y^{{0, 1}} \in F;$
(ii): for any $S, S^{'} \in F$ , with $S \subset S^{'}$ , there exist $y, y^{'} \in S^{'} \ S$ such that $S \cup {y}, S^{'} \ {y^{'}} \in F;$
(iii): for any $S, S^{'} \in F$ , $S \cup S^{'}, S \cap S^{'} \in F .$

Note that condition (iii) in Assumption 2 means that

F

is a lattice relative to set inclusion. As for condition (ii), it ensures that any two nested active sets

S, S^{'} \in F

with

S \subset S^{'}

can be connected by an increasing chain

S = S_{0} \subset \dots \subset S_{k} = S^{'}

of adjacent (i.e., differing by one state) sets in

F

. Further, condition (i) ensures that one can connect in such a fashion ∅ with

Y^{{0, 1}}

. We will call a set family

F

satisfying Assumption 2(ii, iii) a monotonically connected lattice.

3.3. New Sufficient Conditions for $F$ -Indexability and Adaptive-Greedy Index Algorithm

Suppose that, for a particular restless bandit model, a suitable active-set family

F

, as above, has been posited relative to which one aims to analytically establish

F

-indexability. While, in the aforementioned earlier work of the author, sufficient conditions for

F

-indexability are given, which further ensure that the project’s Whittle index can be computed by using an adaptive-greedy index algorithm that was introduced in such work, we next introduce new sufficient conditions that are significantly less restrictive.The new conditions are motivated by the model of concern in this paper, as we will see that it need not satisfy the former conditions, as mentioned in Section 1.

In order to formulate the new conditions and the index algorithm we need to define certain marginal metrics, as follows. Given an action

a \in {0, 1}

and active set

S \subseteq Y^{{0, 1}}

, write, as

〈 a, S 〉

, the policy that initially chooses action a, and then follows the S-active policy. For a given state y and active set S, consider the marginal work metric

g_{y}^{S} ≜ G_{y}^{〈 1, S 〉} - G_{y}^{〈 0, S 〉},

(12)

which represents the marginal increase in the amount of resource expended resulting from taking first the active rather than the passive action and, then, following the S-active policy. Note that such a marginal work metric vanishes at uncontrollable states:

g_{y}^{S} = 0, y \in Y^{{0}} .

(13)

Further, define the marginal reward metric

f_{y}^{S} ≜ F_{y}^{〈 1, S 〉} - F_{y}^{〈 0, S 〉},

(14)

which represents the marginal increase in rewards earned. Finally, for

g_{y}^{S} \neq 0

, define the marginal productivity metric

λ_{y}^{S} ≜ \frac{f_{y}^{S}}{g_{y}^{S}} .

(15)

We will consider the adaptive-greedy index algorithm that is given in Algorithm 1 in its top-down version, where index values are meant to be computed from highest to lowest; one could similarly consider the symmetric bottom-up version. Such an algorithm has a very simple structure, as it constructs in n steps (recall that

N ≜ | Y^{{0, 1}} |

), an increasing chain of successive active sets

S^{0} = \emptyset \subset S^{1} \subset \dots \subset S^{N} = Y^{{0, 1}}

in

F

, proceeding at each step in a greedy fashion. Thus, once active set

S^{k - 1} \in F

has been obtained, the next active set

S^{k}

is constructed by augmenting

S^{k - 1}

with a controllable state

y \in Y^{{0, 1}} \ S^{k - 1}

that maximizes marginal productivity metric

λ_{y}^{S^{k - 1}}

, restricting attention to states y for which the following active set is in

F

, so

S^{k} = S^{k - 1} \cup {y} \in F

. Ties are broken arbitrarily.

Note that Algorithm 1 only shows an algorithmic scheme, as it is not specified how to compute the metrics that are required for computations. A complete fast-pivoting implementation of such an algorithm is given by the author in [49].

Additionally, note that the algorithm’s input consists of all the project’s primitive parameters, namely states, rewards, transition probabilities, and discount factor.

The same considerations apply to Algorithm 2.

Algorithm 1: Top-down adaptive-greedy index algorithm

{AG}_{F}

.

Output:

{\{y_{k}, λ_{y_{k}}^{*}\}}_{k = 1}^{N}

S^{0} : = \emptyset

for

k : = 1

to N do

choose

y_{k} \in \arg \max \{λ_{y}^{S^{k - 1}} : y \in Y^{\{0, 1\}} \ S^{k - 1}, S^{k - 1} \cup \{y\} \in F\}

λ_{y_{k}}^{*} : = λ_{y_{k}}^{S^{k - 1}}

;

S^{k} : = S^{k - 1} \cup \{y_{k}\}

end { for }

The main result of this section, giving the new indexability conditions and ensuring the validity of the adaptive-greedy index algorithm for computing the Whittle index, is stated next.

Algorithm 2: Geometrically intuitive reformulation of adaptive-greedy index algorithm

{AG}_{F}

.

Output:

{y_{k}, λ_{y_{k}}^{*}}_{k = 1}^{N}

S^{0} : = \emptyset

for

k : = 1

to N do

choose

j_{k} \in \arg \max \{\frac{F^{S^{k - 1} \cup {j}} - F^{S^{k - 1}}}{G^{S^{k - 1} \cup {y}} - G^{S^{k - 1}}} : y \in Y^{{0, 1}} \ S^{k - 1}, S^{k - 1} \cup {y} \in F\}

λ_{y_{k}}^{*} : = λ_{y_{k}}^{S^{k - 1}}

;

S^{k} : = S^{k - 1} \cup {y_{k}}

end { for }

Theorem 1.

The following holds:

(a)

Suppose that the project satisfies the following conditions:

(i): for every active set $S \in F,$

$\begin{matrix} g_{y}^{S} > 0, & y \in S, S \ {y} \in F, \\ g_{y}^{S} > 0, & y \in Y^{{0, 1}} \ S, S \cup {y} \in F; \end{matrix}$

(16)

or, equivalently, for every nested active-set pair $S \subset S^{'}$ with $S, S^{'} \in F$ ,

${(G_{y}^{S})}_{y \in Y} ⪇ {(G_{y}^{S^{'}})}_{y \in Y} .$

(17)
(ii): for every resource price $λ \in R$ , there exists an optimal $F$ -policy for λ-price problem (10).

Then, the project is

F

-indexable and algorithm

{AG}_{F}

computes its Whittle index

λ_{y_{k}}^{*}

in non-increasing order.

(b)

If the project is indexable, then it satisfies conditions (i) and (ii) in part (a) for some nested family of adjacent active sets of the form

F = {S^{0}, S^{1}, \dots, S^{N}}

with

S^{0} = \emptyset \subset S^{1} \subset \dots \subset S^{N} = Y^{{0, 1}}

.

In order to prove Theorem 1, we need to establish a number of preliminary results. Before doing so, let us clarify the improvement that the new sufficient

F

-indexability conditions (i) and (ii) in Theorem 1(a) represent over those that were introduced in Niño-Mora [30,31] based on PCLs, which are:

(i): for every $S \in F$ , $g_{y}^{S} > 0$ for $y \in Y^{{0, 1}}$ ;
(ii): algorithm ${AG}_{F}$ computes index $λ_{y_{k}}^{*}$ in non-increasing order: $λ_{y_{1}}^{*} ⩾ λ_{y_{2}}^{*} ⩾ \dots ⩾ λ_{y_{N}}^{*}$ .

Thus, the new condition (i) in Theorem 1(a), as formulated in (16), is significantly less stringent than the old condition (i). Further, the reformulation in (17) clarifies its intuitive meaning: it means that resource consumption metric

G_{y}^{S}

is monotone non-decreasing in the active set S within the domain

F

, and that two nested active sets

S \subset S^{'}

in

F

give different resource consumption vectors

{(G_{y}^{S})}_{y \in Y}

and

{(G_{y}^{S^{'}})}_{y \in Y}

.

As for the old condition (ii), the author has found that, in complex models with a multidimensional state, it can be elusive to establish it analytically. In contrast, the new condition (ii) in Theorem 1(a) allows one either to draw on the rich literature available on optimality of structured policies for special models, or to deploy ad hoc DP arguments to prove the optimality of

F

-policies for the model at hand.

Note that [50] has proposed sufficient

F

-indexability conditions, which are, however, significantly more restrictive than those herein. Thus, the conditions in [50] require, among further assumptions, including (i) and (ii) in Theorem 1(a), that the resource metric be submodular and reward metric be supermodular in the active set. Theorem 1(a) shows that such extra assumptions are unnecessary.

Theorem 1(b) further assures that the new conditions are also necessary for indexability, in the sense that any indexable restless bandit satisfies them relative to some nested active-set family

F

, as stated.

We start by establishing the equivalence between the formulations in (16) and (17) of condition (i) in Theorem 1(a), by drawing on the results in Niño-Mora (Sect. 6 of [31]) (for Markovian restless bandits) and in Niño-Mora (Sect. 4 of [32]) for semi-Markov restless bandits. These refer to relations between resource and reward metrics and their marginal counterparts, via state-action occupancy measures

x_{y y^{'}}^{a, π} ≜ E_{y}^{π} [\sum_{k = 0}^{\infty} 1_{{Y (t_{k}) = y^{'}, A (t_{k}) = a}} β^{t_{k}}] .

(18)

Note that

x_{y y^{'}}^{a, π}

measures the expected total discounted number of decision periods, in which action a is chosen in state

y^{'}

while using policy

π

, starting from state y. In the present notation, the relevant relations are

\begin{matrix} G_{y}^{S \ {y^{'}}} & = G_{y}^{S} - g_{y^{'}}^{S} x_{y y^{'}}^{0, S \ {y^{'}}}, y^{'} \in S \\ G_{y}^{S \cup {y^{'}}} & = G_{y}^{S} + g_{y^{'}}^{S} x_{y y^{'}}^{1, S \cup {y^{'}}}, y^{'} \in Y^{{0, 1}} \ S, \end{matrix}

(19)

and

\begin{matrix} F_{y}^{S \ {y^{'}}} & = F_{y}^{S} - f_{y^{'}}^{S} x_{y y^{'}}^{0, S \ {y^{'}}}, y^{'} \in S \\ F_{y}^{S \cup {y^{'}}} & = F_{y}^{S} + f_{y^{'}}^{S} x_{y y^{'}}^{1, S \cup {y^{'}}}, y^{'} \in Y^{{0, 1}} \ S . \end{matrix}

(20)

Lemma 3.

Conditions (16) and (17) in Theorem 1(a) are equivalent.

Proof.

Suppose that (16) holds for a certain

S \in F

. We then have, on the one hand, that

g_{y^{'}}^{S} > 0

for

y^{'} \in S

such that

S \ {y^{'}} \in F

, along with

x_{y y^{'}}^{0, S \ {j}} ⩾ 0

for any y, implies, via the first identity in (19), that

G_{y}^{S \ {y^{'}}} ⩽ G_{y}^{S}

; further, by taking

y = y^{'}

, we obtain

G_{y^{'}}^{S \ {y^{'}}} < G_{y^{'}}^{S}

, since

x_{y^{'} y^{'}}^{0, S \ {y^{'}}} > 0

. Hence, we have

{(G_{y}^{S \ {y^{'}}})}_{i \in Y} ⪇ {(G_{y}^{S})}_{y \in Y}

, for such

y^{'}

. On the other hand, we have that

g_{y^{'}}^{S} > 0

for

y^{'} \in Y^{{0, 1}} \ S

such that

S \cup {y^{'}} \in F

, along with

x_{y y^{'}}^{1, S \cup {y^{'}}} ⩾ 0

for any y, implies, via the second identity in (19), that

G_{y}^{S} ⩽ G_{y}^{S \cup {j}}

; further, by taking

y = y^{'}

, we obtain

G_{y^{'}}^{S} < G {y^{'}}^{S \cup {y^{'} j}}

, since

x_{y^{'} y^{'}}^{1, S \cup {y^{'}}} > 0

. Hence, we have

{(G_{y}^{S})}_{y \in Y} ⪇ {(G_{y}^{S \cup {y^{'}}})}_{y \in Y}

for such

y^{'}

. Now, the proven relations imply (17) via Assumption 2(ii).

Conversely, suppose that (17) holds for a certain

S \in F

. Then, on the one hand, we have

{(G_{y}^{S \ {y^{'}}})}_{y \in Y} ⪇ {(G_{y}^{S})}_{y \in Y}

for

y^{'} \in S

such that

S \ {y^{'}} \in F

. This, along with

x_{y y^{'}}^{0, S \ {j^{'}}} ⩾ 0

for every y implies, via the first identity in (19), that

g_{y^{'}}^{S} > 0

for such

y^{'}

. On the other hand, we have

{(G_{y}^{S})}_{y \in Y} ⪇ {(G_{y}^{S \cup {y^{'}}})}_{y \in Y}

for

y^{'} \in Y^{{0, 1}} \ S

such that

S \cup {y^{'}} \in F

. This, along with

x_{y y^{'}}^{1, S \cup {y^{'}}} ⩾ 0

for every y implies, via the second identity in (19), that

g_{y^{'}}^{S} > 0

for such

y^{'}

. Therefore, (16) holds, which completes the proof. □

3.4. Proving Theorem 1: Achievable Resource-Reward Performance Region Approach

We next deploy an approach in order to prove Theorem 1, which draws on first principles via an intuitive geometric and economic viewpoint introduced in [31,32]. We will find it convenient to consider, instead of (10), the

λ

-price problem that is obtained by using the averaged resource and reward metrics where the initial project state

Y (0)

is drawn from a distribution p with positive probability mass

p_{y} > 0

at every state

y \in Y

,

G^{π} ≜ \sum_{y \in Y} p_{y} G_{y}^{π} and F^{π} ≜ \sum_{y \in Y} p_{y} F_{y}^{π},

(21)

i.e.,

\underset{π \in Π}{maximize} F^{π} - λ G^{π} .

(22)

Relative to such metrics, consider the project’s achievable resource-reward performance region

H ≜ \{(G^{π}, F^{π}) : π \in Π\},

(23)

which is defined as the region in the resource-reward plane that consists of all the performance points

(G^{π}, F^{π})

that can be achieved under admissible project operating policies

π \in Π

. The optimality of stationary deterministic policies for infinite-horizon finite-state and -action SMDPs ensures that

H

is the closed convex polygon spanned as the convex hull of points

(G^{S}, F^{S})

for active sets

S \subseteq Y^{{0, 1}}

. Thus, we can reformulate

λ

-price problem (22) as the linear programming (LP) problem

\underset{(G, F) \in H}{maximize} F - λ G .

(24)

In order to illustrate and clarify such an approach, consider the concrete example of a certain restless bandit having state space

Y = Y^{{0, 1}} = {1, 2, 3}

that is discussed in (Sec. 2.2 of [34]) For such a project, Figure 1, in that paper, plots the achievable resource-reward performance region

H

, with points

(G^{S}, F^{S})

being labeled by their active sets S.

The fact that such a project is indexable is apparent from the structure of the upper boundary of

H

,

\bar{\partial} H ≜ \{(G, F) \in H : \tilde{F} ⩽ F for every (\tilde{G}, \tilde{F}) \in H having \tilde{G} = G\},

(25)

as this is determined from left to right by an increasing nested family of adjacent active sets connecting ∅ to

Y^{{0, 1}}

:

F = \{\emptyset, {1}, {1, 2}, {1, 2, 3}\}

. Thus, the Whittle indices of the states are given by the successive slopes measuring the marginal reward versus resource trade-off rates:

λ_{1}^{*} = \frac{F^{{1}} - F^{\emptyset}}{G^{{1}} - G^{\emptyset}} ⩾ λ_{2}^{*} = \frac{F^{{1, 2}} - F^{{1}}}{G^{{1, 2}} - G^{{1}}} ⩾ λ_{3}^{*} = \frac{F^{{1, 2, 3}} - F^{{1, 2}}}{G^{{1, 2, 3}} - G^{{1, 2}}} .

(26)

In this example, the geometry of the top-down adaptive-greedy algorithm

{AG}_{F}

corresponds to traversing the upper boundary

\bar{\partial} H

from left to right, proceeding, at each step, by augmenting the current active set by a new state in a locally greedy fashion, as the slopes in (26) are equivalently formulated as

λ_{1}^{*} = \frac{f_{1}^{\emptyset}}{g_{1}^{\emptyset}} ⩾ λ_{2}^{*} = \frac{f_{2}^{{1}}}{g_{2}^{{1}}} ⩾ λ_{3}^{*} = \frac{f_{3}^{{1, 2}}}{g_{3}^{{1, 2}}} .

(27)

The insights that are conveyed by such an example extend to the general setting of concern herein, as elucidated in Niño-Mora [31,32,34]. Thus, the indexability of a project is recast as a property of the upper boundary

\bar{\partial} H

of region

H

, whereby it is determined by a nested active-set family as in the example. Note that the equivalence between the geometric slopes in (27) and the marginal productivity rates (26) in follow from (19) and (20) or, more precisely, from the corresponding relations for the averaged metrics,

\begin{matrix} G^{S \ {y^{'}}} & = G^{S} - g_{y^{'}}^{S} x_{y^{'}}^{0, S \ {y^{'}}}, y^{'} \in S \\ G^{S \cup {y^{'}}} & = G^{S} + g_{y^{'}}^{S} x_{y^{'}}^{1, S \cup {y^{'}}}, y^{'} \in Y^{{0, 1}} \ S, \end{matrix}

(28)

and

\begin{matrix} F^{S \ {y^{'}}} & = F^{S} - f_{y^{'}}^{S} x_{y^{'}}^{0, S \ {y^{'}}}, y^{'} \in S \\ F^{S \cup {y^{'}}} & = F^{S} + f_{y^{'}}^{S} x_{y^{'}}^{1, S \cup {y^{'}}}, y^{'} \in Y^{{0, 1}} \ S, \end{matrix}

(29)

where

x_{y^{'}}^{a, π}

is the state-action occupancy measure that is obtained by drawing the initial state according to the probabilities

p_{y}

. Thus, assuming condition (i) in Theorem 1(a), we have, for

S \in F

,

\frac{f_{y^{'}}^{S}}{g_{y^{'}}^{S}} = \{\begin{matrix} \frac{F^{S} - F^{S \ {y^{'}}}}{G^{S} - G^{S \ {y^{'}}}}, & y^{'} \in S, S \ {y^{'}} \in F \\ \frac{F^{S \cup {y^{'}}} - F^{S}}{G^{S \cup {y^{'} j}} - G^{S}}, & y^{'} \in Y^{{0, 1}} \ S, S \cup {y^{'}} \in F . \end{matrix}

(30)

Such relations allow for us to reformulate the adaptive-greedy algorithm

{AG}_{F}

in Algorithm 1 into the geometrically intuitive form that is shown in Algorithm 2. Such a reformulation clarifies that this algorithm seeks to traverse, from left to right, the upper boundary

\bar{\partial} H

, proceeding at each step by augmenting the current active set by a new state in a locally greedy fashion, while only using active sets in

F

.

We next proceed to establish a number of preliminary results, on which the proof of Theorem 1 will draw. The first shows that the family of optimal active sets for the

λ

-price problem is a lattice that contains its intervals.

Lemma 4.

If S and

S^{'}

are optimal active sets for (22), then so is any

S^{″}

satisfying

S \cap S^{'} \subseteq S^{″} \subseteq S \cap S^{″} .

Proof.

The result is an immediate property of the DP Equations (11) characterizing the optimal stationary deterministic policies (i.e., the optimal active sets) for the

λ

-price problem. □

The following result shows that, under condition (i) in Theorem 1(a), resource consumption metric

G^{S}

is strictly increasing relative to active-set inclusion in the domain

S \in F

.

Lemma 5.

Suppose that condition (i) in Theorem 1(a) holds. Then,

G^{S} < G^{S^{'}}

for

S \subset S^{'}

,

S, S^{'} \in F

.

Proof.

The result follows immediately from the formulation of such a condition (i) in (17), along with the assumption of positive initial state probabilities

p_{y} > 0

for

y \in Y

. □

The next result establishes, under conditions (i) and (ii) in Theorem 1(a), the non-degeneracy of the extreme points of

H

in upper boundary

\bar{\partial} H

, showing that each is achieved by a unique active set in

F

.

Lemma 6.

Suppose that conditions (i) and (ii) in Theorem 1(a) hold. Then, for every

(G^{*}, F^{*}) \in \bar{\partial} H

that is an extreme point of

H

, there exists a unique active set

S^{*} \in F

achieving it, i.e., with

(G^{*}, F^{*}) = (G^{S^{'}}, F^{S^{'}})

.

Proof.

Because

(G^{*}, F^{*})

is an extreme point of

H

in

\bar{\partial} H

, there exists a resource price

λ^{*}

, such that

(G^{*}, F^{*})

is the unique solution to the LP problem (24) for

λ = λ^{*}

. Now, condition (ii) in Theorem 1 ensures that there exists an active set

S^{*} \in F

that is optimal for

λ^{*}

-price problem (22), i.e., such that

(G^{*}, F^{*}) = (G^{S^{*}}, F^{S^{*}})

. Let us argue, by contradiction, that such an active set is unique, assuming that there exists a different active set

S^{* *} \in F

, for which

(G^{*}, F^{*}) = (G^{S^{* *}}, F^{S^{* *}})

. Then, by Assumption 2(iii) and Lemma 4, both

S^{*} \cap S^{* *}

and

S^{*} \cup S^{* *}

would belong in

F

and be optimal for the

λ^{*}

-price problem. Therefore,

(G^{*}, F^{*}) = (G^{S^{*}}, F^{S^{*}}) = (G^{S^{*} \cap S^{* *}}, F^{S^{*} \cap S^{* *}}) = (G^{S^{*} \cup S^{* *}}, F^{S^{*} \cup S^{* *}}) .

(31)

Now, since it is assumed that

S^{*} \neq S^{* *}

, there are two cases to consider: in the first case, if it were

S^{*} ⊄ S^{* *}

, then it would be

S^{*} \cap S^{* *} \subset S^{*} \subset S^{*} \cup S^{* *}

and, hence, by Lemma 5,

G^{S^{*} \cap S^{* *}} < G^{S^{*}} < G^{S^{*} \cap S^{* *}}

, which contradicts (31). In the second case, if it were

S^{* *} ⊄ S^{*}

, then it would be

S^{*} \cap S^{* *} \subset S^{* *} \subset S^{*} \cup S^{* *}

and, hence, by Lemma 5,

G^{S^{*} \cap S^{* *}} < G^{S^{* *}} < G^{S^{*} \cup S^{* *}}

, which again contradicts (31). Therefore, there cannot exist such an

S^{* *}

, which completes the proof. □

We can now prove Theorem 1.

Proof of Theorem 1.

(a) We will show that the project is

F

-indexable by using the geometric characterization of indexability that is reviewed in the present section. Namely, by showing that the upper boundary

\bar{\partial} H

is determined by an increasing nested family of adjacent active sets in

F

connecting ∅ to

Y^{{0, 1}}

. We refer the reader to the plot shown in Figure 1 for a geometric illustration of the following arguments.

Let us start by showing that the extreme points of

H

, which determine

\bar{\partial} H

, are attained, from left to right, by a unique increasing chain of active sets in

F

—not necessarily adjacent. Thus, consider two adjacent extreme points of

H

in

\bar{\partial} H

, i.e., joined by a line segment in

\bar{\partial} H

. By Lemma 6, there exist two unique and distinct active sets

S, S^{'} \in F

, whose performance points

(G^{S}, F^{S})

and

(G^{S^{'}}, F^{S^{'}})

achieve such extreme points, where we assume, without loss of generality, that

G^{S} < G^{S^{'}}

. We will show that it must be

S \subset S^{'}

. Letting

λ = (F^{S^{'}} - F^{S}) / (G^{S^{'}} - G^{S})

be the slope of the line segment joining such extreme points we have that both S and

S^{'}

solve the

λ

-price problem and, hence, by Lemma 4, so do

S \cap S^{'}

and

S \cup S^{'}

. Now, from the stated properties of S and

S^{'}

, it follows that points

(G^{S \cap S^{'}}, F^{S \cap S^{'}})

and

(G^{S \cup S^{'}}, F^{S \cup S^{'}})

must lie in the line segment joining

(G^{S}, F^{S})

and

(G^{S^{'}}, F^{S^{'}})

and, hence,

G^{S \cap S^{'}}, G^{S \cup S^{'}} \in [G^{S}, G^{S^{'}}]

. Further, since, by Assumption 2(iii)

S \cap S^{'}, S \cup S^{'} \in F

, Lemma 5 gives that

G^{S \cap S^{'}} ⩽ G^{S}

and

G^{S^{'}} ⩽ G^{S \cup S^{'}}

. Therefore,

G^{S} = G^{S \cap S^{'}} = G^{S \cup S^{'}} = G^{S^{'}} .

(32)

We next argue, by contradiction, that

S \subset S^{'}

: if such were not the case, i.e.,

S ⊄ S^{'}

, then it would follow that

S \cap S^{'} \subset S \subset S \cup S^{'}

and, hence, by Lemma 5,

G^{S \cap S^{'}} < G^{S} < G^{S \cup S^{'}}

, contradicting (32).

Let us next show that, if any two adjacent extreme points

(G^{S}, F^{S})

and

(G^{S^{'}}, F^{S^{'}})

in

\bar{\partial} H

, with

G^{S} < G^{S^{'}}

, are determined by active sets

S \subset S^{'}

in such a chain that are not adjacent, they can be connected from left to right by points in

\bar{\partial} H

that are attained by an increasing chain of adjacent active sets in

F

. On the one hand, Assumption 2(ii) ensures the existence of an increasing chain of active sets in

F

that are adjacent and connect S to

S^{'}

:

S = T_{0} \subset T_{1} \subset \dots \subset T_{k - 1} \subset T_{k} = S^{'}

. On the other hand, if

λ = (F^{S^{'}} - F^{S}) / (G^{S^{'}} - G^{S})

is the slope of the line segment joining such extreme points, then we have that both S and

S^{'}

solve the

λ

-price problem and, hence, by Lemma 4, so does every intermediate active set

T_{1}, \dots, T_{k - 1}

in such a chain. Hence, Lemma 5 ensures that

G^{S} < G^{T_{1}} < \dots < G^{T_{k - 1}} < G^{S^{'}}

, as required.

In order to establish

F

-indexability, it only remains to show that the leftmost (resp. rightmost) extreme point of

H

in

\bar{\partial} H

is that attained by active set

S = \emptyset

(resp.

S = Y^{{0, 1}}

). This follows from Assumption 2(i), condition (ii) in Theorem 1(a), and Lemma 5 (ensuring that

G^{\emptyset} < G^{S} < G^{Y^{{0, 1}}}

for

S \in F

,

\emptyset \subset S \subset Y^{{0, 1}}

).

Having established

F

-indexability, the result that algorithm

{AG}_{F}

computes the project’s Whittle index follows immediately from the algorithm’s geometric interpretation, as revealed by its reformulation in Algorithm 2.

(b) Suppose now that the project is indexable. Then,

\bar{\partial} H

is determined by some increasing chain of adjacent active sets connecting ∅ to

Y^{{0, 1}}

:

S^{0} = \emptyset \subset S^{1} \subset \dots \subset S^{N} = Y^{{0, 1}}

. Letting

F ≜ {S^{0}, S^{1}, \dots, S^{N}}

, it is readily seen that such an active-set family satisfies conditions (i) and (ii) in part (a). This completes the proof. □

4. Application to Projects with Setup Delays and Costs

This section deploys the framework and results above on restless bandit indexation in our motivating model: the restless bandit reformulation of a non-restless bandit with setup costs and delays (and no setdown penalties: cf. Section 2.1), as discussed in Section 2. The project label m is dropped thereafter from the notation.

In this reformulation, all of the augmented states are controllable, i.e.,

Y = Y^{{0, 1}}

, and an active-state subset of the augmented state space

Y

representing a stationary deterministic policy is given by specifying the original-state subsets

S_{0}, S_{1} \subseteq X

, such that the project is engaged when it was rested (resp. engaged) previously if the state

X (t)

belongs to

S_{0}

(resp. in

S_{1}

). We will denote such an active set/policy, as in [27], by

S_{0} \oplus S_{1} ≜ {0} \times S_{0} \cup {1} \times S_{1} \subseteq Y .

We next address the issue of guessing an appropriate family

F

of active sets

S_{0} \oplus S_{1}

, which contains optimal active sets for the

λ

-price problem of concern (cf. (10)), which is now formulated as

\underset{π \in Π}{maximize} F_{(a^{-}, i)}^{π} - G_{(a^{-}, i)}^{π},

(33)

where

F_{(a^{-}, i)}^{π}

and

G_{(a^{-}, i)}^{π}

are the reward and resource (work) metrics that are given by

F_{(a^{-}, i)}^{π} ≜ E_{(a^{-}, i)}^{π} [\sum_{k = 0}^{\infty} R_{Y (t)}^{a (t)} β^{t_{k}}] and G_{(a^{-}, i)}^{π} ≜ E_{(a^{-}, i)}^{π} [\sum_{k = 0}^{\infty} Q_{Y (t)}^{a (t)} β^{t_{k}}] .

(34)

The intuition that, under Assumption 1, if engaging the project is optimal when it was not set up, then engaging it should also be optimal when it was set up, leads us to posit the following choice of

F

:

F ≜ \{S_{0} \oplus S_{1} : S_{0} \subseteq S_{1} \subseteq X\} .

(35)

Such an

F

represents a family of policies that satisfies Assumption 2. If

S_{0} \subset S_{1}

, policy

S_{0} \oplus S_{1} \in F

has the hysteresis region

S_{1} \ S_{0}

, i.e., when the original state

X (t)

lies in

S_{1} \ S_{0}

the policy sticks to the previously chosen action. We will seek to prove indexability with respect to such a family of policies, i.e.,

F

-indexability.

Note that the marginal work, reward, and productivity metrics defined in general by (12)–(15) now take the form

g_{(a^{-}, i)}^{S_{0} \oplus S_{1}} ≜ G_{(a^{-}, i)}^{〈 1, S_{0} \oplus S_{1} 〉} - G_{(a^{-}, i)}^{〈 0, S_{0} \oplus S_{1} 〉},

(36)

f_{(a^{-}, i)}^{S_{0} \oplus S_{1}} ≜ F_{(a^{-}, i)}^{〈 1, S_{0} \oplus S_{1} 〉} - F_{(a^{-}, i)}^{〈 0, S_{0} \oplus S_{1} 〉},

(37)

and, for

g_{(a^{-}, i)}^{S_{0} \oplus S_{1}} \neq 0

,

λ_{(a^{-}, i)}^{S_{0} \oplus S_{1}} ≜ \frac{f_{(a^{-}, i)}^{S_{0} \oplus S_{1}}}{g_{(a^{-}, i)}^{S_{0} \oplus S_{1}}} .

(38)

We next adapt to the present setting the general top-down adaptive-greedy algorithm

{AG}_{F}

in Algorithm 1, which yields the algorithm in Algorithm 3, where

n ≜ | X |

is now the number of project states in the non-restless formulation. The output of the algorithm has been decoupled, noting that, at every step, the algorithm expands the current active set

S_{0}^{k_{0} - 1} \oplus S_{1}^{k_{1} - 1}

by adding a state that can be either of the form

(0, i_{0}^{k_{0}})

or

(1, i_{1}^{k_{1}})

. Thus, instead of using a single counter k, ranging from 0 to

2 n

, two counters

1 ⩽ k_{0} ⩽ k_{1} ⩽ n

are used, with such counters being related by

k = k_{0} + k_{1} - 1

. Henceforth, we use a more algorithm-like notation, writing, e.g.,

λ_{(0, j)}^{S_{0}^{k_{0} - 1} \oplus S_{1}^{k_{1} - 1}}

as

λ_{(0, j)}^{(k_{0} - 1, k_{1} - 1)}

. Note that the active sets

S_{0}^{k_{0}}

and

S_{1}^{k_{1}}

that are generated in the algorithm are given by

S_{0}^{k_{0}} = {i_{0}^{1}, \dots, i_{0}^{k_{0}}}

and

S_{1}^{k_{1}} = {i_{1}^{1}, \dots, i_{1}^{k_{1}}}

, and satisfy

S_{0}^{k_{0}} \subseteq S_{1}^{k_{1}}

, for

1 ⩽ k_{0} ⩽ k_{1} ⩽ n

, consistently with (35). Thus, the algorithm produces a decoupled output consisting of two augmented-state strings strings

(0, i_{0}^{k_{0}})

and

(1, i_{1}^{k_{1}})

, which jointly span

Y

, along with corresponding switching and continuation index values

λ_{(0, i_{0}^{k_{0}})}^{*}

and

λ_{(1, i_{1}^{k_{1}})}^{*}

.

Algorithm 3: Adaptation of index algorithm

{AG}_{F}

to the present model.

Output:

{\{(0, i_{0}^{k_{0}}), λ_{(0, i_{0}^{k_{0}})}^{*}\}}_{k_{0} = 1}^{n}

,

{\{(1, i_{1}^{k_{1}}), λ_{(1, i_{1}^{k_{1}})}^{*}\}}_{k_{1} = 1}^{n}

S_{0}^{0} : = \emptyset

;

S_{1}^{0} : = \emptyset

;

k_{0} : = 1

;

k_{1} : = 1

while

k_{0} + k_{1} ⩽ 2 n + 1

do

if

k_{1} ⩽ n

choose

j_{1}^{max} \in \arg \max \{λ_{(1, j)}^{(k_{0} - 1, k_{1} - 1)} : j \in X \ S_{1}^{k_{1} - 1}\}

if

k_{0} < k_{1}

choose

j_{0}^{max} \in \arg \max \{λ_{(0, j)}^{(k_{0} - 1, k_{1} - 1)} : j \in S_{1}^{k_{1} - 1} \ S_{0}^{k_{0} - 1}\}

if

k_{1} = n + 1

or

\{k_{0} < k_{1} ⩽ n and λ_{(1, j_{1}^{max})}^{(k_{0} - 1, k_{1} - 1)} < λ_{(0, j_{0}^{max})}^{(k_{0} - 1, k_{1} - 1)}\}

i_{0}^{k_{0}} : = j_{0}^{max}

;

λ_{(0, i_{1}^{k_{0}})}^{*} : = λ_{(0, i_{1}^{k_{0}})}^{(k_{0} - 1, k_{1} - 1)}

;

S_{0}^{k_{0}} : = S_{0}^{k_{0} - 1} \cup \{i_{0}^{k_{0}}\}

;

k_{0} : = k_{0} + 1

else

i_{1}^{k_{1}} : = j_{1}^{max}

;

λ_{(1, i_{1}^{k_{1}})}^{*} : = λ_{(1, i_{1}^{k_{1}})}^{(k_{0} - 1, k_{1} - 1)}

;

S_{1}^{k_{1}} : = S_{1}^{k_{1} - 1} \cup {i_{1}^{k_{1}}}

;

k_{1} : = k_{1} + 1

end { if }

end { while }

4.1. Proving That $F$ -Policies Are Optimal

We next aim to establish that condition (ii) in Theorem 1(a) is satisfied by the present model, i.e., that

F

-policies, i.e., those with active sets

S_{0} \oplus S_{1} \in F

that are defined by (35), suffice to solve the

λ

-price problem (33) for any price

λ \in R

. We will use the DP optimality equations that characterize the optimal value function

V_{(a^{-}, i)}^{*} (λ)

for problem (33), starting from each augmented state

(a^{-}, i) \in Y

: thus, for each original state

i \in X

,

\begin{matrix} V_{(1, i)}^{*} (λ) & = max \{β V_{(0, i)}^{*} (λ), R_{i} - λ + β \sum_{j \in X} p_{i j} V_{(1, j)}^{*} (λ)\} \\ V_{(0, i)}^{*} (λ) & = max \{β V_{(0, i)}^{*} (λ), - c_{i} - \frac{1 - ϕ_{i}}{1 - β} λ + ϕ_{i} (R_{i} - λ + β \sum_{j \in X} p_{i j} V_{(1, j)}^{*} (λ))\} . \end{matrix}

(39)

We start by showing that the optimal value function is non-negative.

Lemma 7.

V_{(a^{-}, i)}^{*} (λ) ⩾ 0

.

Proof.

Because no setdown penalties are assumed (cf. Section 2.1), a possible course of action incurring zero net reward is to set down the project and keep it that way, which yields the result. □

We can now prove the optimality of

F

-policies.

Lemma 8.

For every

λ \in R

, there exists an optimal active set

S_{0} \oplus S_{1} \in F

for λ-price problem (33).

Proof.

Fix

λ \in R

and

i \in X

. It suffices to show that, if resting the project is optimal in state

(1, i)

, then it is also optimal doing so in state

(0, i)

. Let us formulate that hypothesis, as

β V_{(0, i)}^{*} (λ) ⩾ R_{i} - λ + β \sum_{j \in X} p_{i j} V_{(1, j)}^{*} (λ) .

(40)

We aim to show that, then, it is optimal resting the project in state

(0, i)

, so

β V_{(0, i)}^{*} (λ) ⩾ - c_{i} - \frac{1 - ϕ_{i}}{1 - β} λ + ϕ_{i} (R_{i} - λ + β \sum_{j \in X} p_{i j} V_{(1, j)}^{*} (λ)) .

Consider first the case

λ < 0

. We will argue, by contradiction, that hypothesis (40) then cannot hold, i.e., it cannot be optimal to rest the project once it is active. Drawing on non-restless bandit theory, note that, when the project is active, it is optimal to rest it only if it ever reaches an original state

j \in X

at which

λ ⩽ λ_{j}^{*}

, where

λ_{j}^{*}

is the original (non-restless) bandit’s Gittins index. Assumption 1(ii) now assures us that

λ_{j}^{*} ⩾ 0

for each

j \in X

, and, therefore, it is optimal to keep the project active forever.

Next, consider the case

λ ⩾ 0

. Then, the following chain of inequalities holds:

\begin{matrix} β V_{(0, i)}^{*} (λ) & ⩾ R_{i} - λ + β \sum_{j \in X} p_{i j} V_{(1, j)}^{*} (λ) ⩾ - c_{i} - \frac{1 - ϕ_{i}}{1 - β} λ + ϕ_{i} (R_{i} - λ + β \sum_{j \in X} p_{i j} V_{(1, j)}^{*} (λ)), \end{matrix}

where the fact that the second inequality holds is apparent by reformulating it as

(1 - ϕ_{i}) (R_{i} + β \sum_{j \in X} p_{i j} V_{(1, j)}^{*} (λ)) ⩾ - c_{i} - β \frac{1 - ϕ}{1 - β} λ,

and noting that Assumption 1(ii) and Lemma 7, ensure that the latter inequality left-hand side is non-negative, and, further, Assumption 1(i) and

λ ⩾ 0

ensure non-positivity of its right-hand side. This completes the proof. □

4.2. Work Metric Analysis and $F$ -Indexability Proof

We now consider how to calculate work and marginal work metrics

G_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

and

g_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

, by relating them to the corresponding metrics

G_{i}^{S}

and

g_{i}^{S}

for the underlying non-restless project. We will further use such analyses to establish that condition (i) in Theorem 1(a) holds for the model of concern, thus allowing for us to apply such a theorem.

For each

S \subseteq X

, the

G_{i}^{S}

are characterized by the unique solution to the evaluation equations

G_{i}^{S} = \{\begin{matrix} 1 + β \sum_{j \in S} p_{i j} G_{j}^{S} & if i \in S \\ 0 & otherwise . \end{matrix}

(41)

Further, the marginal work metric

g_{i}^{S}

is evaluated by

\begin{matrix} g_{i}^{S} & ≜ G_{i}^{〈 1, S 〉} - G_{i}^{〈 0, S 〉} = 1 + β \sum_{j \in X} p_{i j} G_{j}^{S} - β G_{i}^{S} = \{\begin{matrix} (1 - β) G_{i}^{S} & if i \in S \\ 1 + β \sum_{j \in S} p_{i j} G_{j}^{S} & otherwise . \end{matrix} \end{matrix}

(42)

Note that (41) and (42) imply that

g_{i}^{S} > 0, i \in N .

(43)

We now go back to the project’s restless bandit reformulation. The next result, whose proof is omitted, as it is immediate, gives the evaluation equations for work metric

G_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

under a given active set.

Lemma 9.

For

S_{0} \oplus S_{1} \in F

,

G_{(0, i)}^{S_{0} \oplus S_{1}} = \{\begin{matrix} \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{(1, i)}^{S_{0} \oplus S_{1}} & if i \in S_{0} \\ 0 & otherwise \end{matrix} and G_{(1, i)}^{S_{0} \oplus S_{1}} = \{\begin{matrix} 1 + β \sum_{j \in X} p_{i j} G_{(1, j)}^{S_{0} \oplus S_{1}} & if i \in S_{1} \\ 0 & otherwise . \end{matrix}

The following result represents work metric

G_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

in terms of the

G_{j}^{S}

.

Lemma 10.

For

S_{0} \oplus S_{1} \in F

:

(a): $G_{(a^{-}, i)}^{S_{0} \oplus S_{1}} = G_{i}^{S_{1}} = 0$ , for $a^{-} \in {0, 1}, i \in X \ S_{1}$ .
(b): $G_{(1, i)}^{S_{0} \oplus S_{1}} = G_{i}^{S_{1}}$ , for $i \in S_{1}$ .
(c): $G_{(0, i)}^{S_{0} \oplus S_{1}} = (1 - ϕ_{i}) / (1 - β) + ϕ_{i} G_{i}^{S_{1}}$ , for $i \in S_{0}$ .
(d): $G_{(0, i)}^{S_{0} \oplus S_{1}} = 0$ , for $i \in S_{1} \ S_{0}$ .

Proof.

(a) The result follows readily from the definition of

S_{0} \oplus S_{1}

.

(b) For

i \in S_{1}

, we have

\begin{matrix} G_{(1, i)}^{S_{0} \oplus S_{1}} & = 1 + β \sum_{j \in S_{1}} p_{i j} G_{(1, j)}^{S_{0} \oplus S_{1}} + β \sum_{j \in X \ S_{1}} p_{i j} G_{(1, j)}^{S_{0} \oplus S_{1}} = 1 + β \sum_{j \in S_{1}} p_{i j} G_{(1, j)}^{S_{0} \oplus S_{1}}, \end{matrix}

while using Lemma 9 and part (a). Thus, the

G_{(1, i)}^{S_{0} \oplus S_{1}}

satisfy the equations in (41) characterizing the

G_{i}^{S_{1}}

for

i \in S_{1}

, which gives the result.

(c) We have, for

i \in S_{0}

,

\begin{matrix} G_{(0, i)}^{S_{0} \oplus S_{1}} & = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{(1, i)}^{S_{0} \oplus S_{1}} = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{i}^{S_{1}}, \end{matrix}

using Lemma 9, the inclusion

S_{0} \subseteq S_{1}

, and (a, b).

(d) The result follows readily from the definition of

S_{0} \oplus S_{1}

. □

Concerning the marginal work metric

g_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

, (36) and Lemma 9, they readily give that

\begin{matrix} g_{(1, i)}^{S_{0} \oplus S_{1}} & = 1 + β \sum_{j \in X} p_{i j} G_{(1, j)}^{S_{0} \oplus S_{1}} - β G_{(0, i)}^{S_{0} \oplus S_{1}} \\ g_{(0, i)}^{S_{0} \oplus S_{1}} & = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} (1 + β \sum_{j \in X} p_{i j} G_{(1, j)}^{S_{0} \oplus S_{1}}) - β G_{(0, i)}^{S_{0} \oplus S_{1}} . \end{matrix}

(44)

The following result represents marginal work metric

g_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

in terms of the

g_{j}^{S}

.

Lemma 11.

For every

a^{-} \in {0, 1}, S_{0} \oplus S_{1} \in F

:

(a): $g_{(1, i)}^{S_{0} \oplus S_{1}} = g_{i}^{S_{1}}$ , for $i \in X \ S_{1}$ .
(b): $g_{(0, i)}^{S_{0} \oplus S_{1}} = \frac{1 - ϕ_{i}}{1 - β} + g_{i}^{S_{1}}$ , for $i \in X \ S_{1}$ .
(c): $g_{(1, i)}^{S_{0} \oplus S_{1}} = \frac{1 - β ϕ_{i}}{1 - β} (g_{i}^{S_{1}} - β \frac{1 - ϕ_{i}}{1 - β ϕ_{i}})$ , for $i \in S_{0}$ .
(d): $g_{(0, i)}^{S_{0} \oplus S_{1}} = 1 - ϕ_{i} + ϕ_{i} g_{i}^{S_{1}}$ , for $i \in S_{0}$ .
(e): $g_{(1, i)}^{S_{0} \oplus S_{1}} = \frac{g_{i}^{S_{1}}}{1 - β}$ , for $i \in S_{1} \ S_{0}$ .
(f): $g_{(0, i)}^{S_{0} \oplus S_{1}} = \frac{1 - ϕ_{i}}{1 - β} + \frac{ϕ_{i}}{1 - β} g_{i}^{S_{1}}$ , for $i \in S_{1} \ S_{0}$ .

Proof.

(a) We have, for

i \in X \ S_{1}

,

\begin{matrix} g_{(1, i)}^{S_{0} \oplus S_{1}} & = 1 + β \sum_{j \in X} p_{i j} G_{(1, j)}^{S_{0} \oplus S_{1}} - β G_{(0, i)}^{S_{0} \oplus S_{1}} = 1 + β \sum_{j \in S_{1}} p_{i j} G_{j}^{S_{1}} = g_{i}^{S_{1}}, \end{matrix}

using (44), Lemma 10(a,b), and (42).

(b) We can write, for

i \in X \ S_{1}

,

\begin{matrix} g_{(0, i)}^{S_{0} \oplus S_{1}} & = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} (1 + β \sum_{j \in X} p_{i j} G_{(1, j)}^{S_{0} \oplus S_{1}}) - β G_{(0, j)}^{S_{0} \oplus S_{1}} \\ = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} (1 + β \sum_{j \in S_{1}} p_{i j} G_{j}^{S_{1}}) = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} g_{i}^{S_{1}}, \end{matrix}

while using (44), Lemma 10(a,b), and (42).

(c) We have, for

i \in S_{0}

,

\begin{matrix} g_{(1, i)}^{S_{0} \oplus S_{1}} & = G_{(1, i)}^{S_{0} \oplus S_{1}} - β G_{(0, i)}^{S_{0} \oplus S_{1}} = G_{i}^{S_{1}} - β (\frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{i}^{S_{1}}) \\ = (1 - β ϕ_{i}) G_{i}^{S_{1}} - β \frac{1 - ϕ_{i}}{1 - β} = \frac{1 - β ϕ_{i}}{1 - β} (g_{i}^{S_{1}} - β \frac{1 - ϕ_{i}}{1 - β ϕ_{i}}), \end{matrix}

using (44),

S_{0} \subseteq S_{1}

, Lemma 9, Lemma 10(b,c), and (42).

(d) We obtain, for

i \in S_{0}

,

\begin{matrix} g_{(0, i)}^{S_{0} \oplus S_{1}} & = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{(1, i)}^{S_{0} \oplus S_{1}} - β G_{(0, i)}^{S_{0} \oplus S_{1}} = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{i}^{S_{1}} - β (\frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{i}^{S_{1}}) \\ = 1 - ϕ_{i} + ϕ_{i} (1 - β) G_{i}^{S_{1}} = 1 - ϕ_{i} + ϕ_{i} g_{i}^{S_{1}}, \end{matrix}

while using Lemma 9,

S_{0} \subseteq S_{1}

, Lemma 10(b,c), and (42).

(e) We have, for

i \in S_{1} \ S_{0}

,

\begin{matrix} g_{(1, i)}^{S_{0} \oplus S_{1}} & = G_{(1, i)}^{S_{0} \oplus S_{1}} - β G_{(0, i)}^{S_{0} \oplus S_{1}} = G_{i}^{S_{1}} = \frac{g_{i}^{S_{1}}}{1 - β}, \end{matrix}

using (44), Lemma 9, Lemma 10(d), and (42).

(f) We have, for

i \in S_{1} \ S_{0}

,

\begin{matrix} g_{(0, i)}^{S_{0} \oplus S_{1}} & = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{(1, i)}^{S_{0} \oplus S_{1}} = \frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{i}^{S_{1}} = \frac{1 - ϕ_{i}}{1 - β} + \frac{ϕ_{i}}{1 - β} g_{i}^{S_{1}}, \end{matrix}

using (44), Lemma 9, Lemma 10(b), and (42). □

It must be now remarked that, at the corresponding point in the analysis of [27]—for the case with no setup delays

ϕ_{i} \equiv 1

—one could establish the positivity of the marginal work metric, i.e.,

g_{(a^{-}, i)}^{S_{0} \oplus S_{1}} > 0

for

(a^{-}, i) \in Y

,

S_{0} \oplus S_{1} \in F

, which is the first PCL-indexability condition and it implies the less stringent condition (i) in Theorem 1(a). However, here, it is apparent from Lemma 11(c) that, for

i \in S_{0}

,

g_{(1, i)}^{S_{0} \oplus S_{1}}

can be negative for

β

that is close to 1. This is why we cannot use here the same line of argument that is given in [27] to show indexability.

As mentioned above, we will use, instead, for such a purpose, Theorem 1(a). The following result shows that condition (i) in that theorem holds for the model of concern.

Lemma 12.

For

S_{0} \oplus S_{1} \in F

,

\begin{matrix} g_{(a^{-}, i)}^{S_{0} \oplus S_{1}} > 0, & (a^{-}, i) \in S_{0} \oplus S_{1}, S_{0} \oplus S_{1} \ {(a^{-}, i)} \in F \\ g_{(a^{-}, i)}^{S_{0} \oplus S_{1}} > 0, & (a^{-}, i) \in Y \ S_{0} \oplus S_{1}, S_{0} \oplus S_{1} \cup {(a^{-}, i)} \in F . \end{matrix}

Proof.

First, consider the case

S_{0} \oplus S_{1} = \emptyset \oplus \emptyset

. Then, using Lemma 11(a–d), along with

g_{i}^{\emptyset} \equiv 1

, gives that, for

i \in X

,

g_{(1, i)}^{\emptyset \oplus \emptyset} = g_{i}^{\emptyset} = 1 > 0, g_{(0, i)}^{\emptyset \oplus \emptyset} = \frac{1 - ϕ_{i}}{1 - β} + g_{i}^{\emptyset} = \frac{1 - ϕ_{i}}{1 - β} + 1 > 0 .

Now, consider the case

S_{0} \oplus S_{1} = X \oplus X = Y

. Then, again using Lemma 11(a–d) along with

g_{i}^{X} \equiv 1

gives that, for

i \in X

,

g_{(1, i)}^{X \oplus X} = \frac{1 - β ϕ_{i}}{1 - β} (g_{i}^{X} - β \frac{1 - ϕ_{i}}{1 - β ϕ_{i}}) = 1 > 0, g_{(0, i)}^{X \oplus X} = 1 - ϕ_{i} + ϕ_{i} g_{i}^{X} = 1 > 0 .

Finally, consider

S_{0} \oplus S_{1} \in F

, which is different from

\emptyset \oplus \emptyset

and

X \oplus X

. Then, Lemma 11 and (35) imply that it could only happen that marginal work metric

g_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

be negative if

a^{-} = 1

and

i \in S_{0}

. However, such a case is not included in the required conditions, since

(1, i) \in S_{0} \oplus S_{1}

(due to

S_{0} \subseteq S_{1}

), yet

S_{0} \oplus S_{1} \ {(1, i)} = S_{0} \oplus (S_{1} \ {i}) \notin F

(since

i \in S_{0} ⊈ S_{1} \ {i}

). This completes the proof. □

We are now ready to deploy Theorem 1(a) in the present model.

Proposition 1.

The present restless bandit model is

F

-indexable and Algorithm 3 computes its Whittle index.

Proof.

Lemmas 8 and 12 show that conditions (i) and (ii) in Theorem 1(a) hold, respectively, which implies the result. □

4.3. The AT Index Is the Whittle Index

We next use the results above in order to prove the identity between the Whittle index and the AT index. We will reformulate the AT index formulae in (7)–(8) while using active sets

S \subseteq X

, rather than stopping times

τ

. Thus, we can reformulate the continuation and switching AT indices, as

λ_{(1, i)}^{AT} ≜ max_{S \subseteq X : i \in S} \frac{F_{i}^{S}}{G_{i}^{S}},

(45)

and

λ_{(0, i)}^{AT} ≜ max_{S \subseteq X : i \in S} \frac{- c_{i} + ϕ_{i} F_{i}^{S}}{\frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{i}^{S}} .

(46)

Recall that we denote the Whittle index by

λ_{(a^{-}, i)}^{*}

.

Proposition 2.

For

i \in X

,

λ_{(1, i)}^{*} = λ_{(1, i)}^{AT}

and

λ_{(0, i)}^{*} = λ_{(0, i)}^{AT}

.

Proof.

We start by showing that

λ_{(1, i)}^{*} = λ_{(1, i)}^{AT}

, while using the equivalences

\begin{matrix} λ ⩾ λ_{(1, i)}^{*} & ⟺ resting the project in (1, i) is optimal for problem (33) \\ ⟺ 0 ⩾ max_{S_{0} \subseteq S_{1} \subseteq X : i \in S_{1}} F_{(1, i)}^{S_{0} \oplus S_{1}} - λ G_{(1, i)}^{S_{0} \oplus S_{1}} \\ ⟺ λ ⩾ max_{S_{0} \subseteq S_{1} \subseteq X : i \in S_{1}} \frac{F_{(1, i)}^{S_{0} \oplus S_{1}}}{G_{(1, i)}^{S_{0} \oplus S_{1}}} \\ ⟺ λ ⩾ max_{i \in S_{1} \subseteq X} \frac{F_{i}^{S_{1}}}{G_{i}^{S_{1}}} = λ_{(1, i)}^{AT}, \end{matrix}

drawing on the project’s

F

-indexability (Proposition 1), and so, if resting the project iin

(1, i)

is optimal, then resting it in

(0, i)

is also optimal, together with Lemmas 10(b) and 14(b).

We next prove that

λ_{(0, i)}^{*} = λ_{(0, i)}^{AT}

, through the chain of equivalences

\begin{matrix} λ ⩾ λ_{(0, i)}^{*} & ⟺ resting the project in (0, i) is optimal for (33) \\ ⟺ 0 ⩾ max_{S_{0} \subseteq S_{1} \subseteq X : i \in S_{0}} F_{(0, i)}^{S_{0} \oplus S_{1}} - λ G_{(0, i)}^{S_{0} \oplus S_{1}} \\ ⟺ λ ⩾ max_{S_{0} \subseteq S_{1} \subseteq X : i \in S_{0}} \frac{F_{(0, i)}^{S_{0} \oplus S_{1}}}{G_{(0, i)}^{S_{0} \oplus S_{1}}} \\ ⟺ λ ⩾ max_{S_{1} \subseteq X : i \in S_{1}} \frac{- c_{i} + ϕ_{i} F_{i}^{S_{1}}}{\frac{1 - ϕ_{i}}{1 - β} + ϕ_{i} G_{i}^{S_{1}}} = λ_{(0, i)}^{AT}, \end{matrix}

drawing on the result that the project is

F

-indexable, together with Lemmas 10(c) and 14(c). □

4.4. Reward Metric Analysis

We proceed by considering how to calculate the reward and marginal reward metrics

F_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

and

f_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

, by relating them to the metrics

F_{i}^{S}

and

f_{i}^{S}

for the corresponding non-restless project with no setup penalties.

For every active set

S \subseteq X

, the reward metric

F_{i}^{S}

is determined by the evaluation equations

F_{i}^{S} = \{\begin{matrix} R_{i} + β \sum_{j \in S} p_{i j} F_{j}^{S} & if i \in S \\ 0 & otherwise, \end{matrix}

(47)

and the marginal reward metric is given by

f_{i}^{S} ≜ F_{i}^{〈 1, S 〉} - F_{i}^{〈 0, S 〉} = R_{i} + β \sum_{j \in S} p_{i j} F_{j}^{S} - β F_{i}^{S} = \{\begin{matrix} (1 - β) F_{i}^{S} & if i \in S \\ R_{i} + β \sum_{j \in S} p_{i j} F_{j}^{S} & otherwise . \end{matrix}

(48)

Going back to the semi-Markov restless bandit reformulation, the following result shows the evaluation equations for the reward metrics

F_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

, for an active set

S_{0} \oplus S_{1} \in F

.

Lemma 13.

F_{(a^{-}, i)}^{S_{0} \oplus S_{1}} = \{\begin{matrix} R_{i} + β \sum_{j \in X} p_{i j} F_{(1, j)}^{S_{0} \oplus S_{1}} & if a^{-} = 1, i \in S_{1} \\ - c_{i} + ϕ_{i} (R_{i} + β \sum_{j \in X} p_{i j} F_{(1, j)}^{S_{0} \oplus S_{1}}) & if a^{-} = 0, i \in S_{0} \\ β F_{(0, i)}^{S_{0} \oplus S_{1}} & otherwise . \end{matrix}

The following result formulates the reward metric

F_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

, in terms of the

F_{i}^{S}

’s.

Lemma 14.

For

S_{0} \oplus S_{1} \in F

:

(a): $F_{(a^{-}, i)}^{S_{0} \oplus S_{1}} = 0 = F_{i}^{S_{1}}$ , for $a^{-} \in {0, 1}, i \in X \ S_{1}$ .
(b): $F_{(1, i)}^{S_{0} \oplus S_{1}} = F_{i}^{S_{1}}$ , for $i \in S_{1}$ .
(c): $F_{(0, i)}^{S_{0} \oplus S_{1}} = - c_{i} + ϕ_{i} F_{i}^{S_{1}}$ , for $i \in S_{0}$ .
(d): $F_{(0, i)}^{S_{0} \oplus S_{1}} = 0 = F_{i}^{S_{0}}$ , for $i \in S_{1} \ S_{0}$ .

Proof.

(a) This part follows from the definition of

S_{0} \oplus S_{1}

.

(b) We have, for

i \in S_{1}

,

\begin{matrix} F_{(1, i)}^{S_{0} \oplus S_{1}} & = R_{i} + β \sum_{j \in S_{1}} p_{i j} F_{(1, j)}^{S_{0} \oplus S_{1}} + β \sum_{j \in X \ S_{1}} p_{i j} F_{(1, j)}^{S_{0} \oplus S_{1}} = R_{i} + β \sum_{j \in S_{1}} p_{i j} F_{(1, j)}^{S_{0} \oplus S_{1}}, \end{matrix}

while using Lemma 13 and part (a). Thus, the

F_{(1, i)}^{S_{0} \oplus S_{1}}

’s, for

i \in S_{1}

, satisfy (47), which yields the result.

(c) We can write, for

i \in S_{0}

,

\begin{matrix} F_{(0, i)}^{S_{0} \oplus S_{1}} & = - c_{i} + ϕ_{i} (R_{i} + β \sum_{j \in S_{1}} p_{i j} F_{(1, j)}^{S_{0} \oplus S_{1}}) = - c_{i} + ϕ_{i} F_{i}^{S_{1}}, \end{matrix}

using parts (a, b), Lemma 13, and (47).

(d) The result follows from the definition of

S_{0} \oplus S_{1}

. □

Concerning the marginal reward metric

f_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

, we obtain, from (37) and Lemma 13, that

\begin{matrix} f_{(1, i)}^{S_{0} \oplus S_{1}} & = R_{i} + β \sum_{j \in X} p_{i j} F_{(1, j)}^{S_{0} \oplus S_{1}} - β F_{(0, i)}^{S_{0} \oplus S_{1}} \\ f_{(0, i)}^{S_{0} \oplus S_{1}} & = - c_{i} + ϕ_{i} (R_{i} + β \sum_{j \in X} p_{i j} F_{(1, j)}^{S_{0} \oplus S_{1}}) - β F_{(0, i)}^{S_{0} \oplus S_{1}} . \end{matrix}

(49)

The following result represents the marginal reward

f_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

in terms of the

f_{j}^{S}

.

Lemma 15.

For

S_{0} \oplus S_{1} \in F

:

(a): $f_{(1, i)}^{S_{0} \oplus S_{1}} = f_{i}^{S_{1}}$ , for $i \in X \ S_{1}$ .
(b): $f_{(0, i)}^{S_{0} \oplus S_{1}} = - c_{i} + f_{i}^{S_{1}}$ , for $i \in X \ S_{1}$ .
(c): $f_{(1, i)}^{S_{0} \oplus S_{1}} = β c_{i} + \frac{1 - β ϕ_{i}}{1 - β} f_{i}^{S_{1}}$ , for $i \in S_{0}$ .
(d): $f_{(0, i)}^{S_{0} \oplus S_{1}} = - (1 - β) c_{i} + ϕ_{i} f_{i}^{S_{1}}$ , for $i \in S_{0}$ .
(e): $f_{(1, i)}^{S_{0} \oplus S_{1}} = \frac{f_{i}^{S_{1}}}{1 - β}$ , for $i \in S_{1} \ S_{0}$ .
(f): $f_{(0, i)}^{S_{0} \oplus S_{1}} = - c_{i} + ϕ_{i} \frac{f_{i}^{S_{1}}}{1 - β}$ , for $i \in S_{1} \ S_{0}$ .

Proof.

(a) We have, for

i \in X \ S_{1}

,

\begin{matrix} f_{(1, i)}^{S_{0} \oplus S_{1}} & = R_{i} + β \sum_{j \in X} p_{i j} F_{(1, j)}^{S_{0} \oplus S_{1}} - F_{(1, i)}^{S_{0} \oplus S_{1}} = R_{i} + β \sum_{j \in S_{1}} p_{i j} F_{j}^{S_{1}} = f_{i}^{S_{1}}, \end{matrix}

using (49), Lemmas 13 and 14(a,b), (47), and (48).

(b) We can write, for

i \in X \ S_{1}

,

\begin{matrix} f_{(0, i)}^{S_{0} \oplus S_{1}} & = - c_{i} + ϕ_{i} (1 + β \sum_{j \in X} p_{i j} F_{(1, j)}^{S_{0} \oplus S_{1}}) - β F_{(0, j)}^{S_{0} \oplus S_{1}} \\ = - c_{i} + ϕ_{i} (1 + β \sum_{j \in S_{1}} p_{i j} F_{j}^{S_{1}}) = - c_{i} + ϕ_{i} f_{i}^{S_{1}}, \end{matrix}

using (49), (48), and Lemma 14(a,b).

(c) We have, for

i \in S_{0}

,

\begin{matrix} f_{(1, i)}^{S_{0} \oplus S_{1}} & = F_{(1, i)}^{S_{0} \oplus S_{1}} - β F_{(0, i)}^{S_{0} \oplus S_{1}} = F_{i}^{S_{1}} - β (- c_{i} + ϕ_{i} F_{i}^{S_{1}}) \\ = β c_{i} + (1 - β ϕ_{i}) F_{i}^{S_{1}} = β c_{i} + \frac{1 - β ϕ_{i}}{1 - β} f_{i}^{S_{1}}, \end{matrix}

using (49),

S_{0} \subseteq S_{1}

, Lemmas 13 and 14(b,c), and (48).

(d) We can write, for

i \in S_{0}

,

\begin{matrix} f_{(0, i)}^{S_{0} \oplus S_{1}} & = - c_{i} + ϕ_{i} F_{(1, i)}^{S_{0} \oplus S_{1}} - β F_{(0, i)}^{S_{0} \oplus S_{1}} = - c_{i} + ϕ_{i} F_{i}^{S_{1}} - β (- c_{i} + ϕ_{i} F_{i}^{S_{1}}) \\ = - (1 - β) c_{i} + ϕ_{i} (1 - β) F_{i}^{S_{1}} = - (1 - β) c_{i} + ϕ_{i} f_{i}^{S_{1}}, \end{matrix}

while using Lemmas 13 and 14(b,c),

S_{0} \subseteq S_{1}

, and (48).

(e) We have, for

i \in S_{1} \ S_{0}

,

\begin{matrix} f_{(1, i)}^{S_{0} \oplus S_{1}} & = F_{(1, i)}^{S_{0} \oplus S_{1}} - β F_{(0, i)}^{S_{0} \oplus S_{1}} = F_{i}^{S_{1}} = \frac{f_{i}^{S_{1}}}{1 - β}, \end{matrix}

using (49), Lemmas 13 and 14(d), and (48).

(f) We obtain, for

i \in S_{1} \ S_{0}

,

\begin{matrix} f_{(0, i)}^{S_{0} \oplus S_{1}} & = - c_{i} + ϕ_{i} (R_{i} + β \sum_{j \in N} p_{i j} F_{(1, j)}^{S_{0} \oplus S_{1}}) - β F_{(0, i)}^{S_{0} \oplus S_{1}} = - c_{i} + ϕ_{i} F_{i}^{S_{1}} = - c_{i} + ϕ_{i} \frac{f_{i}^{S_{1}}}{1 - β}, \end{matrix}

using (49), Lemmas 13 and 14(b), and (48). This completes the proof. □

5. Designing an Efficient Two-Stage Index Algorithm

This section draws on the above in order to develop an efficient index algorithm, which exploits special structure to simplify the one-stage adaptive-greedy algorithm in Algorithm 3, by decoupling the calculation of the continuation and switching indices into a two-stage method, for which an efficient implementation is provided.

5.1. Marginal Productivity Metric Analysis

We start by addressing the calculation of required marginal productivity metrics

λ_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

in (38), also by relating them to metrics

λ_{i}^{S}

for the corresponding non-restless project without setup penalties, which are given by

λ_{i}^{S} ≜ \frac{f_{i}^{S}}{g_{i}^{S}}, i \in X, S \subseteq X .

(50)

The next result represents

λ_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

in terms of the

λ_{j}^{S}

.

Lemma 16.

For

S_{0} \oplus S_{1} \in F

:

(a): $λ_{(1, i)}^{S_{0} \oplus S_{1}} = λ_{i}^{S_{1}}$ , for $i \in X \ S_{1}$ .
(b): $λ_{(0, i)}^{S_{0} \oplus S_{1}} = \frac{- c_{i} + f_{i}^{S_{1}}}{\frac{1 - ϕ_{i}}{1 - β} + g_{i}^{S_{1}}} = \frac{g_{i}^{S_{1}}}{\frac{1 - ϕ_{i}}{1 - β} + g_{i}^{S_{1}}} (λ_{i}^{S_{1}} - \frac{c_{i}}{g_{i}^{S_{1}}})$ , for $i \in X \ S_{1}$ .
(c): $λ_{(1, i)}^{S_{0} \oplus S_{1}} = \frac{β c_{i} + \frac{1 - β ϕ_{i}}{1 - β} f_{i}^{S_{1}}}{\frac{1 - β ϕ_{i}}{1 - β} (g_{i}^{S_{1}} - β \frac{1 - ϕ_{i}}{1 - β ϕ_{i}})} = \frac{g_{i}^{S_{1}}}{g_{i}^{S_{1}} - β \frac{1 - ϕ_{i}}{1 - β ϕ_{i}}} (λ_{i}^{S_{1}} + \frac{β (1 - β)}{1 - β ϕ_{i}} \frac{c_{i}}{g_{i}^{S_{1}}})$ , for $i \in S_{0}$ such that $g_{i}^{S_{1}} \neq β \frac{1 - ϕ_{i}}{1 - β ϕ_{i}}$ .
(d): $λ_{(0, i)}^{S_{0} \oplus S_{1}} = \frac{- (1 - β) c_{i} + ϕ_{i} f_{i}^{S_{1}}}{1 - ϕ_{i} + ϕ_{i} g_{i}^{S_{1}}} = \frac{- (1 - β) c_{i} + ϕ_{i} g_{i}^{S_{1}} λ_{i}^{S_{1}}}{1 - ϕ_{i} + ϕ_{i} g_{i}^{S_{1}}}$ , for $i \in S_{0}$ .
(e): $λ_{(1, i)}^{S_{0} \oplus S_{1}} = λ_{i}^{S_{1}}$ , for $i \in S_{1} \ S_{0}$ .
(f): $λ_{(0, i)}^{S_{0} \oplus S_{1}} = λ_{i}^{S_{1}} - \frac{(1 - β) c_{i} + (1 - ϕ_{i}) λ_{i}^{S_{1}}}{1 - ϕ_{i} + ϕ_{i} g_{i}^{S_{1}}}$ , $i \in S_{1} \ S_{0}$ .

Proof.

All of the parts follow readily from (50), (38), and Lemmas 11 and 15. □

5.2. Simplified Version of the Index Algorithm

Using the above results allows for us to give a simplified and more explicit version of the index algorithm

{AG}_{F}

in Algorithm 3, which is given in Algorithm 4. In it, we draw on Lemma 16(b,d) to formulate marginal productivity rates

λ_{(a^{-}, i)}^{S_{0} \oplus S_{1}}

in terms of the

g_{j}^{S}

and

λ_{j}^{S}

. Thus, the

g_{j}^{(k_{1} - 1)}

and

λ_{j}^{(k_{1} - 1)}

in the algorithm correspond to

g_{(1, j)}^{(k_{0} - 1, k_{1} - 1)}

and

λ_{(1, j)}^{(k_{0} - 1, k_{1} - 1)}

, respectively. Further, we use

λ_{(0, j)}^{(0, k_{1} - 1)}

(which denotes

λ_{(0, j)}^{S_{0}^{0} \oplus S_{1}^{k_{1} - 1}}

) in place of

λ_{(0, j)}^{(k_{0} - 1, k_{1} - 1)}

, drawing on Lemma 16(d). Note that such simplifications achieve significant savings in computer memory, since storage of quantities

λ_{j}^{(k_{1} - 1)}

and

λ_{(0, j)}^{(0, k_{1} - 1)}

entail one less dimension than storing of the

λ_{(1, j)}^{(k_{0} - 1, k_{1} - 1)}

and

λ_{(0, j)}^{(k_{0} - 1, k_{1} - 1)}

.

Algorithm 4: Simplified version of index algorithm

{AG}_{F}

.

Output:

{(0, i_{0}^{k_{0}}), λ_{(0, i_{0}^{k_{0}})}^{*}}_{k_{0} = 1}^{n}, {(1, i_{1}^{k_{1}}), λ_{(1, i_{1}^{k_{1}})}^{*}}_{k_{1} = 1}^{n}

S_{0}^{0} : = \emptyset

;

S_{1}^{0} : = \emptyset

;

k_{0} : = 1

;

k_{1} : = 1

; compute

{(g_{i}^{(0)}, λ_{i}^{(0)}) : i \in X}

while

k_{0} + k_{1} ⩽ 2 n + 2

do

if

k_{1} ⩽ n

choose

j_{1}^{max} \in \arg \max \{λ_{j}^{(k_{1} - 1)} : j \in X \ S_{1}^{k_{1} - 1}\}

λ_{(0, j)}^{(0, k_{1} - 1)} : = λ_{j}^{(k_{1} - 1)} - \frac{(1 - β) c_{j} + (1 - ϕ_{j}) λ_{j}^{(k_{1} - 1)}}{1 - ϕ_{j} + ϕ_{j} g_{j}^{(k_{1} - 1)}}, j \in S_{1}^{k_{1} - 1} \ S_{0}^{k_{0} - 1}

if

k_{0} < k_{1}

choose

j_{0}^{max} \in \arg \max \{λ_{(0, j)}^{(0, k_{1} - 1)} : j \in S_{1}^{k_{1} - 1} \ S_{0}^{k_{0} - 1}\}

if

k_{1} = n + 1

or

\{k_{0} < k_{1} ⩽ n and λ_{j_{1}^{max}}^{(k_{1} - 1)} < λ_{j_{0}^{max}}^{(0, k_{1} - 1)}\}

i_{0}^{k_{0}} : = j_{0}^{max}

;

λ_{(0, i_{0}^{k_{0}})}^{*} : = λ_{(0, i_{0}^{k_{0}})}^{(0, k_{1} - 1)}

;

S_{0}^{k_{0}} : = S_{0}^{k_{0} - 1} \cup {i_{0}^{k_{0}}}

;

k_{0} : = k_{0} + 1

else

i_{1}^{k_{1}} : = j_{1}^{max}

;

λ_{i_{1}^{k_{1}}}^{*} : = λ_{i_{1}^{k_{1}}}^{(k_{1} - 1)}

;

S_{1}^{k_{1}} : = S_{1}^{k_{1} - 1} \cup {i_{1}^{k_{1}}}

;

k_{1} : = k_{1} + 1

compute

{(g_{i}^{(k_{1})}, λ_{i}^{(k_{1})}) : i \in X}

end { if }

end { while }

5.3. Two-Stage Implementation of the Index Algorithm

We next proceed to still further simplify the index algorithm in Algorithm 4, by decoupling it into two successive algorithms. The first stage of such a scheme computes the continuation index

λ_{(1, i)}^{*}

, which we saw above is just the Gittins index

λ_{i}^{*}

. We will need additional quantities as input to the second stage: the

g_{j}^{(k_{1})}

and

λ_{j}^{(k_{1})}

appearing in Algorithm 4.

In order to obtain such an index and the required additional quantities, consider the algorithmic scheme

{AG}^{1}

in Algorithm 5, which is a variant of that in [8], reformulated as in [28]. For implementations, we can use algorithms that are provided in the latter paper, in particular the fast-pivoting algorithm with extended output, which has an

(4 / 3) n^{3} + O (n^{2})

arithmetic-operation count.

Algorithm 5: Gittins-index algorithmic scheme

{AG}^{1}

.

Output:

{i_{1}^{k_{1}}}_{k_{1} = 1}^{n}

,

{λ_{j}^{*} : j \in X}

,

{(g_{j}^{(k_{1})}, λ_{j}^{(k_{1})}) : j \in S_{1}^{k_{1}}}_{k_{1} = 1}^{n}

set

S_{1}^{0} : = \emptyset

; compute

{(g_{i}^{(0)}, λ_{i}^{(0)}) : i \in X}

for

k_{1} : = 1

to n do

choose

i_{1}^{k_{1}} \in \arg \max \{λ_{i}^{(k_{1} - 1)} : i \in X \ S_{1}^{k_{1} - 1}\}

λ_{i_{1}^{k_{1}}}^{*} : = λ_{i_{1}^{k_{1}}}^{(k_{1} - 1)}

;

S_{1}^{k_{1}} : = S_{1}^{k_{1} - 1} \cup {i_{1}^{k_{1}}}

compute

{(g_{i}^{(k_{1})}, λ_{i}^{(k_{1})}) : i \in X}

end

We next address the computation of the switching index in the second stage, once the Gittins index and required extra quantities have been computed. Consider algorithm

{AG}^{0}

that is given in Algorithm 6, whose input is the output of algorithm

{AG}^{1}

, and which returns a sequence of all the states

i_{0}^{k_{0}}

in

X

, together with index values

λ_{(0, i_{0}^{k_{0}})}^{*}

. Note that such an algorithm is formulated in a form applying to the case of concern herein, with a positive setup delay at every state j, so

ϕ_{j} < 1

.

Algorithm 6: Switching-index algorithm

{AG}^{0}

.

ALGORITHM

{AG}^{0}

:

Input:

{i_{1}^{k_{1}}}_{k_{1} = 1}^{n}

,

{λ_{j}^{*} : j \in X}

,

{(g_{j}^{(k_{1})}, λ_{j}^{(k_{1})}) : j \in S_{1}^{k_{1}}}_{k_{1} = 1}^{n}

Output:

{i_{0}^{k_{0}}}_{k_{0} = 1}^{n}, {λ_{(0, j)}^{*} : j \in X}

{\hat{c}}_{j} : = \frac{1 - β}{1 - ϕ_{j}} c_{j}, j \in X

;

z_{j} = ϕ_{j} / (1 - ϕ_{j})

;

S_{0}^{0} : = \emptyset

;

S_{1}^{0} : = \emptyset

;

k_{0} : = 0

for

k_{1} : = 1

to n do

S_{1}^{k_{1}} : = S_{1}^{k_{1} - 1} \cup {i_{1}^{k_{1}}}

;

{AUGMENT}_{1} : = false

λ_{(0, j)}^{(0, k_{1})} : = λ_{j}^{(k_{1} - 1)} - \frac{{\hat{c}}_{j} + λ_{j}^{(k_{1} - 1)}}{1 + z_{j} g_{j}^{(k_{1} - 1)}}, j \in S_{1}^{k_{1}} \ S_{0}^{k_{0}}

while

k_{0} < k_{1}

and not(

{AUGMENT}_{1}

) do

choose

j_{0}^{max} \in \arg \max \{λ_{(0, j)}^{(0, k_{1})} : j \in S_{1}^{k_{1}} \ S_{0}^{k_{0}}\}

if

k_{1} = n

or

λ_{i_{1}^{k_{1}}}^{*} < λ_{(0, j_{0}^{max})}^{(0, k_{1})}

i_{0}^{k_{0} + 1} : = j_{0}^{max}

;

λ_{(0, i_{0}^{k_{0} + 1})}^{*} : = λ_{(0, i_{0}^{k_{0} + 1})}^{(0, k_{1})}

S_{0}^{k_{0} + 1} : = S_{0}^{k_{0}} \cup {i_{0}^{k_{0} + 1}}

;

k_{0} : = k_{0} + 1

else

{AUGMENT}_{1} : = true

end { if }

end { while }

end { for }

We have the following result.

Proposition 3.

Algorithm

{AG}^{0}

computes index

λ_{(0, i)}^{*}

in no more than

(5 / 2) n^{2} + O (n)

operations.

Proof.

The fact that algorithm

{AG}^{0}

calculates the

λ_{(0, i)}^{*}

follows by noting that we have obtained it from algorithm

{AG}_{F}

in Algorithm 4 simply by decoupling the calculation of the

λ_{(0, i)}^{*}

and the

λ_{(1, i)}^{*} = λ_{i}^{*}

.

As for the algorithm’s arithmetic-operation count, it is dominated by the statements

λ_{(0, j)}^{(0, k_{1})} : = λ_{j}^{(k_{1} - 1)} - \frac{{\hat{c}}_{j} + λ_{j}^{(k_{1} - 1)}}{1 + z_{j} g_{j}^{(k_{1} - 1)}}, j \in S_{1}^{k_{1}} \ S_{0}^{k_{0}},

for

k_{1} = 2, \dots, n + 1

, each of which performs no more than

5 k_{1}

operations. This gives the maximum stated operation count. □

6. How Does the Index Depend on Switching Penalties?

We next present and discuss properties on the index dependence on the switching penalties, when considering the case where the latter are constant across states:

c_{i} \equiv c

,

d_{i} \equiv d

and

ϕ_{i} \equiv ϕ

for

i \in X

. The notation below will make explicit the prevailing penalties, writing

λ_{(1, i)}^{*} (d, ψ)

, and

λ_{(0, i)}^{*} (c, d, ϕ, ψ)

.

We write, as

λ_{i}^{*} ⩾ 0

, the Gittins index, and as

F_{i}^{S} ⩾ 0

, the reward metric of the original project with no switching penalties. We will draw on the following expression for the switching index:

λ_{(0, i)}^{*} (c, d, ϕ, ψ) = max_{S \subseteq X : i \in S} H (c, d, ϕ, ψ, F_{i}^{S}, G_{i}^{S}),

(51)

where

H (c, d, ϕ, ψ, F, G) ≜ \frac{- (c + ϕ d) + ϕ (F + (1 - β) d G)}{\frac{1 - ϕ ψ}{1 - β} + ϕ ψ G} .

Note that (51) uses the transformation that is considered in Section 2.1, together with the switching-index formulation in (46), while using the result that the original non-restless project’s reward metric with transformed rewards

{\tilde{R}}_{j} = (R_{j} + (1 - β) d) / ψ

, for

j \in X

, is

{\tilde{F}}_{i}^{S} = (F_{i}^{S} + (1 - β) d G_{i}^{S}) / ψ

.

We will further use the following preliminary result.

Lemma 17.

(a): If $S \subset S^{'} \subseteq X$ , then $F_{i}^{S} ⩽ F_{i}^{S^{'}}$ and $G_{i}^{S} ⩽ G_{i}^{S^{'}}$ .
(b): If $d + ψ c ⩾ ϕ ψ F_{i}^{X}$ , then $H (c, d, ϕ, ψ, F, G)$ is monotone increasing in F and in G.

Proof.

(a) The results follows from the interpretation of work and reward metrics, using Assumption 1(ii) for the latter.

(b) This part follows from the following results:

\frac{\partial}{\partial F} H (c, d, ϕ, ψ, F, G) = \frac{ϕ}{\frac{1 - ϕ ψ}{1 - β} + ϕ ψ G} > 0 and \frac{\partial}{\partial G} H (c, d, ϕ, ψ, F, G) = ϕ \frac{d + ψ c - ϕ ψ F}{{(\frac{1 - ϕ ψ}{1 - β} + ϕ ψ G)}^{2}} > 0 .

□

We have the following result.

Proposition 4.

(a): $λ_{(1, i)}^{*} (d, ψ) = (λ_{i}^{*} + (1 - β) d) / ψ$ .
(b): If $d + ψ c ⩾ ϕ ψ F_{i}^{X}$ , then $λ_{(0, i)}^{*} = ϕ λ_{i}^{X} - (1 - β) c$ .
(c): $λ_{(0, i)}^{*} (c, d, ϕ, ψ)$ is convex and piecewise linear in $(c, d)$ , decreasing in c and non-increasing in d.
(d): For $d + ψ c ⩾ ϕ ψ F_{i}^{X}$ , or for $c, d ⩾ 0$ small enough and $R_{i} > 0$ , or for $c = d = 0$ , $λ_{(0, i)}^{*} (c, d, ϕ, ψ)$ is convex and non-decreasing in ϕ and in ψ.
(e): ${lim}_{ϕ ↘ 0} λ_{(0, i)}^{*} (c, d, ϕ, ψ) = - (1 - β) c$ .
(f): $λ_{(0, i)}^{*} (c, d, ϕ, ψ) = ϕ λ_{i}^{N} - (1 - β) c + O (ψ^{2})$ , as $ψ ↘ 0$ .

Proof.

(a) The result follows from noting that

λ_{(1, i)}^{*} (d, ψ)

is the Gittins index of the project with modified active rewards

{\tilde{R}}_{j} = (R_{j} + (1 - β) d) / ψ

(cf. Section 2.1), which is related to the project Gittins index

λ_{i}^{*}

(with unmodified rewards

R_{j}

) by the stated expression.

(b) Using Lemma 17(b) and

λ_{i}^{X} = (1 - β) F_{i}^{X}

, we obtain

λ_{(0, i)}^{*} (c, d, ϕ, ψ) = max_{(F, G) \in [0, F_{i}^{X}] \times [0, G_{i}^{X}]} H (c, d, ϕ, ψ, F, G) = H (c, d, ϕ, ψ, F_{i}^{X}, G_{i}^{X}) = ϕ λ_{i}^{X} - (1 - β) c .

(c) The result follows by noting that (51) formulates

λ_{(0, i)}^{*} (c, d, ϕ, ψ)

as the maximum of linear functions in

(c, d)

that decrease in c and are non-increasing in d.

(d) Concerning the dependence on

ϕ

, when

d + ψ c ⩾ ϕ ψ F_{i}^{X}

the result follows by (b). Furthermore,

\begin{matrix} \frac{\partial}{\partial ϕ} H (c, d, ϕ, ψ, F_{i}^{S}, G_{i}^{S}) & = (1 - β) \frac{F_{i}^{S} - (1 - (1 - β) G_{i}^{S}) (d + ψ c)}{{(1 - ϕ ψ (1 - (1 - β) G_{i}^{S}))}^{2}} ⩾ 0 \\ \frac{\partial^{2}}{\partial ϕ^{2}} H (c, d, ϕ, ψ, F_{i}^{S}, G_{i}^{S}), & = \frac{2 (1 - β) (1 - (1 - β) G_{i}^{S}) ψ}{{(1 - ϕ ψ (1 - (1 - β) G_{i}^{S}))}^{3}} (F_{i}^{S} - (1 - (1 - β) G_{i}^{S}) (d + ψ c)) ⩾ 0, \end{matrix}

where the inequalities hold for

c, d

small enough, using that

R_{i} > 0

so that

F_{i}^{S} > 0

, and for

c = d = 0

. Hence,

λ_{(0, i)}^{*} (c, d, ϕ, ψ)

is a maximum of convex non-decreasing functions, which is also convex non-decreasing.

The same argument can be applied to dependence on

ψ

, while using that

\begin{matrix} \frac{\partial}{\partial ψ} H (c, d, ϕ, ψ, F_{i}^{S}, G_{i}^{S}) & = \frac{(1 - β) (1 - (1 - β) G_{i}^{S}) ϕ}{{(1 - ϕ ψ (1 - (1 - β) G_{i}^{S}))}^{2}} (ϕ F_{i}^{S} - c - (1 - (1 - β) G_{i}^{S}) ϕ d) \\ \frac{\partial^{2}}{\partial ψ^{2}} H (c, d, ϕ, ψ, F_{i}^{S}, G_{i}^{S}) & = \frac{2 (1 - β) {(1 - (1 - β) G_{i}^{S})}^{2} ϕ^{2}}{(1 - ϕ ψ (1 - (1 - β) G_{i}^{S}))^{3}} (ϕ F_{i}^{S} - c - (1 - (1 - β) G_{i}^{S}) ϕ d) . \end{matrix}

Parts (e) and (f) follow straightforwardly. □

We conjecture that Lemma 4(c) should hold without the qualifications considered above.

Now, consider the following examples to illustrate the results above. The first example concerns a three-state project with no setdown penalties or setup costs, setup delay transform

ϕ

,

β = 0.95

,

R = [\begin{matrix} 0.7221 \\ 0.9685 \\ 0.1557 \end{matrix}] and P = [\begin{matrix} 0.8061 & 0.1574 & 0.0365 \\ 0.1957 & 0.0067 & 0.7976 \\ 0.1378 & 0.5959 & 0.2663 \end{matrix}] .

Figure 2 plots the project’s switching index for each of the three states versus

1 - ϕ

. Note that each of the lines shown corresponds to one of the project states. The plot agrees with Proposition 4(d, e). It also illustrates that the relative ordering of states that is induced by the switching index can vary with

ϕ

.

The following example is based on the same project, but with no setup delays and with setdown delay transform

ψ

. Figure 3 displays the continuation and switching indices for each of the three states versus

1 - ψ

. Note that each of the lines shown corresponds to one of the project states. The plots agree with Proposition 4(a,d,f). Note that the continuation index

λ_{(1, i)}^{*} (d, ψ)

increases to infinity as

ψ

vanishes, as the incentive of sticking to a project increases steeply as the setdown delay becomes larger. The plot for the switching index further shows that the relative ordering of states can vary with

ψ

.

7. Numerical Study

We next report on the results of a numerical study, which is based on MATLAB implementations of the algorithms that are discussed here developed by the author.

The first experiment addressed the runtime of the decoupled index computing method. A random project instance with setup delays and costs was randomly generated for each of the following numbers of states:

n = 500, 1000, \dots, 5000

. For each such n, the time to compute the continuation index and required extra quantities while using the fast-pivoting algorithm with extended output in [28] was recorded, as well as the time for computing the switching index by algorithm

{AG}^{0}

, and the time for jointly computing both indices by using the simplex-based implementation that is given in [49] of the adaptive-greedy algorithm

{AG}_{F}

. This experiment was run on a 2.8 GHz PC with 4 GB of memory.

Figure 4 shows the results. The left pane plots total runtimes (measured in hours) to compute both indices versus n. Red squares represent the

{AG}_{F}

joint-computing scheme, and blue circles represent the two-stage scheme. We see that the latter attained approximately a fourfold speed-up over the former. The right pane plots runtimes (measured in seconds), for the switching index algorithm versus the number of states n. The timescale change from hours to seconds highlights the order-of-magnitude speed-up attained.

The following experiments were designed in order to evaluate the average relative performance of the Whittle index policy in randomly generated two- and three-project instances, both versus the optimal problem value, and versus the benchmark Gittins index policy, which does not take setups into account. For each problem instance, the optimal value was calculated by solving with the CPLEX LP solver the LP formulation of the DP optimality equations. The Whittle index and benchmark scheduling policies were evaluated by solving, with MATLAB, the appropriate systems of linear evaluation equations.

The second experiment was designed to assess the dependence of the relative performance of Whittle’s index policy for two-project instances on a constant setup-time transform

ϕ

and discount factor

β

—with no setdown penalties. A sample of 100 randomly generated instances with 10-state projects was obtained with MATLAB. In each instance, the parameters for each project were independently drawn: transition probabilities (by scaling a matrix with uniform entries) and uniform (between 0 and 1) active rewards. For every instance

k = 1, \dots, 100

and parameters

(ϕ, β) \in [0.5, 0.99] \times [0.5, 0.95]

—with a 0.1 grid—the optimal value

V^{(k), opt}

and the values of the Whittle index (

V^{(k), W}

) and benchmark (

V^{(k), bench}

) policies were calculated, together with the relative optimality of the Whittle index policy

Δ^{(k), W} ≜ 100 (V^{(k), opt} - V^{(k), W}) / | V^{(k), opt} |

, and the optimality-gap ratio of the Whittle index over the benchmark policy

ρ^{(k), W, bench} ≜ 100 (V^{(k), W} - V^{(k), opt}) / (V^{(k), bench} - V^{(k), opt})

. The latter were then averaged over the 100 instances for each

(c, β)

pair, in order to obtain the average values

Δ^{W}

and

ρ^{W, bench}

.

Values

V^{(k), opt}

,

V^{(k), W}

and

V^{(k), bench}

were computed, as follows. The corresponding value functions

V_{((a_{1}^{-}, i_{1}), (a_{2}^{-}, i_{2}))}^{(k), opt}

,

V_{((a_{1}^{-}, i_{1}), (a_{2}^{-}, i_{2}))}^{(k), W}

and

V_{((a_{1}^{-}, i_{1}), (a_{2}^{-}, i_{2}))}^{(k), bench}

were calculated. Subsequently, the values were calculated when considering that both projects start out being passive, as

V^{(k), π} ≜ \frac{1}{n^{2}} \sum_{i_{1}, i_{2} \in N} V_{((0, i_{1}), (0, i_{2}))}^{(k), π}, π \in {opt, W, bench} .

(52)

Figure 5 displays, in its left pane, the relative gap

Δ^{W}

versus

ϕ

—note the inverted

ϕ

-axis used throughout—for multiple

β

, while using cubic interpolation. The gap starts at 0 as

ϕ

approaches 1 (as the optimal policy is then obtained), and then grows up to a maximum, which is below

0.18 %

, and then decreases to 0 as

ϕ

gets smaller. That pattern agrees with intuition: for small enough

ϕ

, both the optimal and Whittle index policies initially pick a project and stick to it. Because the best such project can be determined by single-project evaluations, the Whittle index policy will correctly choose it. The right pane shows that

Δ^{W}

is not monotonic in

β

, as it is increasing for small

β

and then decreases for

β

closer to 1. Hence, in the left pane, the higher peaks typically correspond to larger values of

β

.

Figure 6 shows similar plots for the optimality-gap ratio

ρ^{W, bench}

of the Whittle index over the benchmark policy. They highlight that the average optimality gap for the Whittle index policy remains below

45 %

of that for the benchmark policy. The left pane shows that the ratio vanishes for

ϕ

that is small enough, as the Whittle index policy is then optimal. Additionally, the right pane shows that the ratio is increasing with

β

. Thus, in the left pane, for fixed

ϕ

, higher values correspond to larger

β

.

The third experiment was similar in nature as the previous one, but, when considering instead a constant setup delay T for each project,

ϕ = β^{T}

. Figure 7 and Figure 8 show the results, which highlight that Whittle’s index policy was optimal for

T ⩾ 2

, its relative optimality gap did not exceed

0.06 %

, and it substantially outperformed the benchmark Gittins-index policy, as the optimality-gap ratio stays below

2 %

.

The fourth experiment addressed the effect of asymmetric (and constant) setup delay transforms, with these varying over the range

(ϕ_{1}, ϕ_{2}) \in {[0.8, 0.99]}^{2}

, in two-project instances with discount factor

β = 0.9

. In the left contour plot in Figure 9 it is shown that the average relative optimality gap of Whittle’s index policy,

Δ^{W}

, reaches a maximum of approximately

0.14 %

, and it vanishes as both

ϕ_{1}

and

ϕ_{2}

get close to unity, and as either of them becomes small enough. The right contour plot shows that the optimality-gap ratio

ρ^{W}

reaches the maximum values of nearly

50 %

, then vanishing as either

ϕ_{1}

or

ϕ_{2}

becomes sufficiently small.

The fifth experiment studied the effect of state-dependent setup delay parameters

ϕ_{i}

, as the discount factor is changed. Uniform[0.9, 1] i.i.d. state-dependent setup costs were randomly generated for every instance. The left pane shown in Figure 10 displays the average relative optimality gap versus the discount factor, showing that such a gap stays below

0.14 %

. The right pane highlights that the average optimality-gap ratio

ρ^{W, bench}

stays below

20 %

.

The sixth experiment considered the relative performance of Whittle’s index policy on three-project instances in terms of a setup delay parameter

ϕ

and discount factor, while using a random sample of 100 instances of three eight-state projects. For each instance, the parameters varied over the range

(ϕ, β) \in [0.5, 0.99] \times [0.5, 0.95]

. The results are displayed in Figure 11 and Figure 12, which are the counterparts of Figure 5 and Figure 6. Comparing Figure 5 and Figure 11 shows a slight degradation of performance for Whittle’s index policy in the latter, although the average gap

Δ^{W}

stays small, beneath

0.25 %

. Comparing Figure 6 and Figure 12 shows similar values for the ratio

ρ^{W, bench}

.

8. Conclusions

Bandit models with switching penalties are relevant for a wide variety of applications. Computing optimal policies is generally intractable, which motivates the search for simple policies that can be implemented in practice and perform well. Index policies are an appealing class of policies, which have been proposed for such problems. Yet, while algorithms are given in [10,27] for computing the Asawa and Teneketzis index for a bandit with switching costs only, no algorithms have been given in the literature in order to compute the extension of such an index for bandits with switching penalties that incorporate switching delays. This paper presents the first such algorithm. It further provides evidence in a numerical study that the resulting index policy is nearly optimal across the instances considered. This work could be extended in several directions, including developing specialized algorithms for computing the index, in particular, models that arise in applications.

Funding

This research has been developed over a number of years, and has been funded in part by the Spanish Government under grants MEC MTM2004-02334 and MTM2007-63140, and PID2019-109196GB-I00 / AEI / 10.13039/501100011033. This work has also been funded in part by the Comunidad de Madrid in the setting of the multi-year agreement with Universidad Carlos III de Madrid within the line of activity “Excelencia para el Profesorado Universitario”, in the framework of the V Regional Plan of Scientific Research and Technological Innovation 2016–2020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data have not been made publicly available because the article describes in full detail how they can be generated by computer simulation and computational experiments.

Acknowledgments

The author has presented a preliminary version of this work at ValueTools ’07, the Second International Conference on Performance Evaluation Methodologies and Tools, which appears in abridged form in the online proceedings [51]. A preliminary version was also posted as the working paper [52].

Conflicts of Interest

The author declares no conflict of interest.

References

Gittins, J.C. Multi-Armed Bandit Allocation Indices; Wiley: Chichester, UK, 1989. [Google Scholar]
Gittins, J.C.; Jones, D.M. A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (Eur. Meeting of Statisticians, Budapest, 1972); Gani, J., Sarkadi, K., Vincze, I., Eds.; North-Holland: Amsterdam, The Netherlands, 1974; pp. 241–266. [Google Scholar]
Gittins, J.C. Bandit processes and dynamic allocation indices. J. R. Statist. Soc. Ser. B 1979, 41, 148–177. [Google Scholar] [CrossRef] [Green Version]
Whittle, P. Multi-armed bandits and the Gittins index. J. R. Statist. Soc. Ser. B 1980, 42, 143–149. [Google Scholar] [CrossRef]
Weber, R. On the Gittins index for multiarmed bandits. Ann. Appl. Probab. 1992, 2, 1024–1033. [Google Scholar] [CrossRef]
Bertsimas, D.; Niño-Mora, J. Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems. Math. Oper. Res. 1996, 21, 257–306. [Google Scholar] [CrossRef]
Bellman, R. A problem in the sequential design of experiments. Sankhyā 1956, 16, 221–229. [Google Scholar]
Varaiya, P.P.; Walrand, J.C.; Buyukkoc, C. Extensions of the multiarmed bandit problem: The discounted case. IEEE Trans. Automat. Control 1985, 30, 426–439. [Google Scholar] [CrossRef]
Banks, J.S.; Sundaram, R.K. Switching costs and the Gittins index. Econometrica 1994, 62, 687–694. [Google Scholar] [CrossRef]
Asawa, M.; Teneketzis, D. Multi-armed bandits with switching penalties. IEEE Trans. Automat. Control 1996, 41, 328–348. [Google Scholar] [CrossRef]
Jun, T.S. Survey on the bandit problem with switching costs. De Econ. 2004, 152, 513–541. [Google Scholar] [CrossRef]
Agrawal, R.; Hegde, M.V.; Teneketzis, D. Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Trans. Automat. Control 1988, 33, 899–906. [Google Scholar] [CrossRef] [Green Version]
Van Oyen, M.P.; Pandelis, D.G.; Teneketzis, D. Optimality of index policies for stochastic scheduling with switching penalties. J. Appl. Probab. 1992, 29, 957–966. [Google Scholar] [CrossRef] [Green Version]
Bergemann, D.; Valimaki, J. Stationary multi-choice bandit problems. J. Econ. Dyn. Control 2001, 25, 1585–1594. [Google Scholar] [CrossRef] [Green Version]
Sundaram, R.K. Generalized bandit problems. In Social Choice and Strategic Decisions; Austen-Smith, D., Duggan, J., Eds.; Studies in Choice and Welfare; Springer: Berlin, Germany, 2005; pp. 131–162. [Google Scholar]
Arlotto, A.; Chick, S.E.; Gans, N. Optimal hiring and retention policies for heterogeneous workers who learn. Manag. Sci. 2014, 60, 110–129. [Google Scholar] [CrossRef] [Green Version]
Hauser, J.R.; Liberali, G.; Urban, G. Website morphing 2.0: Switching costs, partial exposure, random exit, and when to morph. Manag. Sci. 2014, 60, 1594–1616. [Google Scholar] [CrossRef]
Liberali, G.B.; Hauser, J.R.; Urban, G.L. Morphing theory and application. In Handbook of Marketing Decision Models; Wierenga, B., van der Lans, R., Eds.; International Series in Operations Research & Management Science; Springer: Cham, Switzerland, 2017; Chapter 18; Volume 254, pp. 531–562. [Google Scholar]
Lin, S.; Zhang, J.J.; Hauser, J.R. Learning from experience, simply. Mark. Sci. 2015, 34, 1–19. [Google Scholar] [CrossRef]
Huang, J.; Gan, X.; Feng, X. Multi-armed bandit based opportunistic channel access: A consideration of switch cost. In Proceedings of the IEEE International Conference on Communications—Ad-hoc and Sensor Networking Symposium, Budapest, Hungary, 9–13 June 2013; pp. 1651–1655. [Google Scholar]
Qin, Z.Q.; Wang, J.L.; Chen, J.; Sun, Y.M.; Du, Z.Y.; Xu, Y.H. Opportunistic channel access with repetition time diversity and switching cost: A block multi-armed bandit approach. Wirel. Netw. 2018, 24, 1683–16977. [Google Scholar] [CrossRef]
McCardle, K.F.; Tsetlin, I.; Winkler, R.L. When to abandon a research project and search for a new one. Oper. Res. 2018, 66, 799–813. [Google Scholar] [CrossRef]
Savelov, M.P. Gittins index for simple family of Markov bandit processes with switching cost and no discounting. Theory Probab. Appl. 2019, 64, 355–364. [Google Scholar] [CrossRef]
Dusonchet, F.; Hongler, M.O. Optimal hysteresis for a class of deterministic deteriorating two-armed bandit problem with switching costs. Automatica 2003, 39, 1947–1955. [Google Scholar] [CrossRef]
Dusonchet, F.; Hongler, M.O. Priority index heuristic for multi-armed bandit problems with set-up costs and/or set-up time delays. Int. J. Comput. Integr. Manuf. 2006, 19, 210–219. [Google Scholar] [CrossRef]
Mason, A.J.; Anderson, E.J. Minimizing flow time on a single machine with job classes and setup times. Nav. Res. Logist. 1991, 64, 333–350. [Google Scholar] [CrossRef]
Niño-Mora, J. A faster index algorithm and a computational study for bandits with switching costs. INFORMS J. Comput. 2008, 20, 255–269. [Google Scholar] [CrossRef]
Niño-Mora, J. A (2/3)n³ fast-pivoting algorithm for the Gittins index and optimal stopping of a Markov chain. INFORMS J. Comput. 2007, 19, 596–606. [Google Scholar] [CrossRef] [Green Version]
Whittle, P. Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 1988, 25A, 287–298. [Google Scholar] [CrossRef]
Niño-Mora, J. Restless bandits, partial conservation laws and indexability. Adv. Appl. Probab. 2001, 33, 76–98. [Google Scholar] [CrossRef] [Green Version]
Niño-Mora, J. Dynamic allocation indices for restless projects and queueing admission control: A polyhedral approach. Math. Program. 2002, 93, 361–413. [Google Scholar] [CrossRef]
Niño-Mora, J. Restless bandit marginal productivity indices, diminishing returns and optimal control of make-to-order/make-to-stock M/G/1 queues. Math. Oper. Res. 2006, 31, 50–84. [Google Scholar] [CrossRef]
Niño-Mora, J. A verification theorem for threshold-indexability of real-state discounted restless bandits. Math. Oper. Res. 2020, 45, 465–496. [Google Scholar] [CrossRef]
Niño-Mora, J. Dynamic priority allocation via restless bandit marginal productivity indices. Top 2007, 15, 161–198. [Google Scholar] [CrossRef]
Papadimitriou, C.H.; Tsitsiklis, J.N. The complexity of optimal queuing network control. Math. Oper. Res. 1999, 24, 293–305. [Google Scholar] [CrossRef] [Green Version]
Qian, Y.; Zhang, C.; Krishnamachari, B.; Tambe, M. Restless poachers: Handling exploration-exploitation tradeoffs in security domains. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, Singapore, 9–13 May 2016; ACM: New York, NY, USA, 2016; pp. 123–131. [Google Scholar]
Fu, J.; Moran, B.; Guo, J.; Wong, E.W.M.; Zukerman, M. Asymptotically optimal job assignment for energy-efficient processor-sharing server farms. IEEE J. Sel. Areas Commun. 2016, 34, 4008–4023. [Google Scholar] [CrossRef]
Borkar, V.S.; Pattathil, S. Whittle indexability in egalitarian processor sharing systems. Ann. Oper. Res. 2017, 1–21. [Google Scholar] [CrossRef] [Green Version]
Borkar, V.S.; Ravikumar, K.; Saboo, K. An index policy for dynamic pricing in cloud computing under price commitments. Appl. Math. 2017, 44, 215–245. [Google Scholar] [CrossRef]
Borkar, V.S.; Kasbekar, G.S.; Pattathil, S.; Shetty, P.Y. Opportunistic scheduling as restless bandits. IEEE Trans. Control Netw. Syst. 2018, 5, 1952–1961. [Google Scholar] [CrossRef] [Green Version]
Gerum, P.C.L.; Altay, A.; Baykal-Gursoy, M. Data-driven predictive maintenance scheduling policies for railways. Transport. Res. Part C Emerg. Technol. 2019, 107, 137–154. [Google Scholar] [CrossRef]
Abbou, A.; Makis, V. Group maintenance: A restless bandits approach. INFORMS J. Comput. 2019, 31, 719–731. [Google Scholar] [CrossRef]
Ayer, T.; Zhang, C.; Bonifonte, A.; Spaulding, A.C.; Chhatwal, J. Prioritizing hepatitis C treatment in US prisons. Oper. Res. 2019, 67, 853–873. [Google Scholar] [CrossRef] [Green Version]
Niño-Mora, J. Resource allocation and routing in parallel multi-server queues with abandonments for cloud profit maximization. Comput. Oper. Res. 2019, 103, 221–236. [Google Scholar] [CrossRef]
Fu, J.; Moran, B. Energy-efficient job-assignment policy with asymptotically guaranteed performance deviation. IEEE/ACM Trans. Netw. 2020, 28, 1325–1338. [Google Scholar] [CrossRef] [Green Version]
Hsu, Y.P.; Modiano, E.; Duan, L.J. Scheduling algorithms for minimizing age of information in wireless broadcast networks with random arrivals. IEEE Trans. Mob. Comput. 2020, 19, 2903–2915. [Google Scholar] [CrossRef]
Sun, J.Z.; Jiang, Z.Y.; Krishnamachari, B.; Zhou, S.; Niu, Z.S. Closed-form Whittle’s index-enabled random access for timely status update. IEEE Trans. Commun. 2020, 68, 1538–1551. [Google Scholar] [CrossRef]
Li, D.; Ding, L.; Connor, S. When to switch? Index policies for resource scheduling in emergency response. Prod. Oper. Manag. 2020, 29, 241–262. [Google Scholar] [CrossRef]
Niño-Mora, J. A fast-pivoting algorithm for Whittle’s restless bandit index. Mathematics 2020, 8, 2226. [Google Scholar] [CrossRef]
Yao, D.D. Comments on: “Dynamic priority allocation via restless bandit marginal productivity indices” [Top 15 (2007), no. 2, 161–198] by J. Niño-Mora. Top 2007, 15, 220–223. [Google Scholar] [CrossRef]
Niño-Mora, J. Computing an index policy for bandits with switching penalties. In Proceedings of the ValueTools ’07, the Second International Conference on Performance Evaluation Methodologies and Tools, Nantes, France, 23–25 October 2007; ICST: Brussels, Belgium, 2007. Available online: https://dl.acm.org/doi/10.5555/1345263.1345361 (accessed on 29 December 2020).
Niño-Mora, J. Two-Stage Index Computation for Bandits with Switching Penalties II: Switching Delays; Working Paper 07-42, Statistics and Econometrics Series 10; Univ. Carlos III de Madrid: Madrid, Spain, 2007. [Google Scholar]

Figure 1. Illustration for the proof of Theorem 1.

Figure 2. Switching index versus setup delay transform.

Figure 3. Continuation and switching indices versus setdown delay transform.

Figure 4. Exp. 1: Runtimes of index algorithms.

Figure 5. Exp. 2: Average optimality gap (%) of Whittle’s index policy.

Figure 6. Exp. 2: Average optimality-gap ratio (%) of Whittle’s index policy over the benchmark policy.

Figure 7. Exp. 3: Average optimality gap (%) of Whittle’s index policy.

Figure 8. Exp. 3: Average optimality-gap ratio (%) of Whittle’s index over benchmark policy.

Figure 9. Exp. 4: Average relative performance (%) of Whittle’s index policy versus

(ϕ_{1}, ϕ_{2})

, for

β = 0.9

.

Figure 9. Exp. 4: Average relative performance (%) of Whittle’s index policy versus

(ϕ_{1}, ϕ_{2})

, for

β = 0.9

.

Figure 10. Exp. 5: Average relative performance (%) of Whittle’s index policy with state-dependent setup delays.

Figure 11. Exp. 6: Version of Figure 5 for three-project instances.

Figure 12. Exp. 6: Version of Figure 6 for three-project instances.

Table 1. Some notation employed in the paper.

$M ≜ {1, \dots, M}$	set of projects
$t_{k}$	decision periods
$X_{m} (t)$ , $X (t)$	project state in period t
$X_{m}$ , $X$	project state space
$A_{m} (t)$ , $A (t)$	action chosen on a project in period t
$A_{m}^{-} (t)$ , $A^{-} (t)$	previously chosen action
$R_{m} (i_{m})$ , $R (i)$	rewards
$β$	one-period discount factor
$p_{m} (i_{m}, j_{m})$ , $p (i, j)$	state-transition probabilities
$c_{m} (i_{m})$ , $c (i)$	setup costs
$d_{m} (i_{m})$ , $d (i)$	setdown costs
$ξ_{m} (i_{m})$ , $ξ (i)$	setup delays
$ϕ_{m} (i_{m})$ , $ϕ (i)$	setup delay z-transforms, for $z = β$
$ψ_{m}$ , $ψ$	setdown delay z-transform, for $z = β$
$Y_{m} (t)$ , $Y (t)$	augmented state in period t
$Y_{m}$ , $Y$	augmented state space
$F_{i}^{π}$ , $F_{y}^{π}$ , $F^{π}$	reward metric
$G_{i}^{π}$ , $G_{y}^{π}$ , $G^{π}$	resource consumption metric
$f_{i}^{π}$ , $f_{y}^{π}$	marginal reward metric
$g_{i}^{π}$ , $g_{y}^{π}$	marginal resource consumption metric
$λ_{i}^{S}$ , $λ_{y}^{S}$	marginal productivity metric

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Niño-Mora, J. Fast Two-Stage Computation of an Index Policy for Multi-Armed Bandits with Setup Delays. Mathematics 2021, 9, 52. https://doi.org/10.3390/math9010052

AMA Style

Niño-Mora J. Fast Two-Stage Computation of an Index Policy for Multi-Armed Bandits with Setup Delays. Mathematics. 2021; 9(1):52. https://doi.org/10.3390/math9010052

Chicago/Turabian Style

Niño-Mora, José. 2021. "Fast Two-Stage Computation of an Index Policy for Multi-Armed Bandits with Setup Delays" Mathematics 9, no. 1: 52. https://doi.org/10.3390/math9010052

APA Style

Niño-Mora, J. (2021). Fast Two-Stage Computation of an Index Policy for Multi-Armed Bandits with Setup Delays. Mathematics, 9(1), 52. https://doi.org/10.3390/math9010052

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast Two-Stage Computation of an Index Policy for Multi-Armed Bandits with Setup Delays

Abstract

1. Introduction

1.1. Background

1.2. Index Policies, Histeresis, and the Asawa and Teneketzis Index for the MABPSP

1.3. Index Computation

1.4. Approach via Restless Bandit Reformulation, Whittle Index, and Indexability

1.5. Motivation and Goals

1.6. Contributions

1.7. Structure of the Paper and Notation

2. MABPSP Model and Its Semi-Markov MARBP Reformulation

2.1. Reduction to the Case with No Setdown Penalties

2.2. The AT Index

3. New Methodological Results on Restless Bandit Indexation

3.1. Indexable Restless Bandits and the Whittle Index

3.2. Exploiting Special Structure: Indexability Relative to a Family of Policies

3.3. New Sufficient Conditions for F -Indexability and Adaptive-Greedy Index Algorithm

3.4. Proving Theorem 1: Achievable Resource-Reward Performance Region Approach

4. Application to Projects with Setup Delays and Costs

4.1. Proving That F -Policies Are Optimal

4.2. Work Metric Analysis and F -Indexability Proof

4.3. The AT Index Is the Whittle Index

4.4. Reward Metric Analysis

5. Designing an Efficient Two-Stage Index Algorithm

5.1. Marginal Productivity Metric Analysis

5.2. Simplified Version of the Index Algorithm

5.3. Two-Stage Implementation of the Index Algorithm

6. How Does the Index Depend on Switching Penalties?

7. Numerical Study

8. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3. New Sufficient Conditions for $F$ -Indexability and Adaptive-Greedy Index Algorithm

4.1. Proving That $F$ -Policies Are Optimal

4.2. Work Metric Analysis and $F$ -Indexability Proof