Fast Two-Stage Computation of an Index Policy for Multi-Armed Bandits with Setup Delays

: We consider the multi-armed bandit problem with penalties for switching that include setup delays and costs, extending the former results of the author for the special case with no switching delays. A priority index for projects with setup delays that characterizes, in part, optimal policies was introduced by Asawa and Teneketzis in 1996, yet without giving a means of computing it. We present a fast two-stage index computing method, which computes the continuation index (which applies when the project has been set up) in a ﬁrst stage and certain extra quantities with cubic (arithmetic-operation) complexity in the number of project states and then computes the switching index (which applies when the project is not set up), in a second stage, with quadratic complexity. The approach is based on new methodological advances on restless bandit indexation, which are introduced and deployed herein, being motivated by the limitations of previous results, exploiting the fact that the aforementioned index is the Whittle index of the project in its restless reformulation. A numerical study demonstrates substantial runtime speed-ups of the new two-stage index algorithm versus a general one-stage Whittle index algorithm. The study further gives evidence that, in a multi-project setting, the index policy is consistently nearly optimal.


Background
In a much-studied version of the multi-armed bandit problem (MABP), a decision-maker selects one project to engage from a finite set of dynamic and stochastic projects at each of an infinite sequence of discrete-time periods. Each project is modeled as a classic (non-restless) bandit, so the engaged (active) project gives rewards and its state changes in a Markovian fashion, while rested (passive) projects neither produce rewards nor change state. The goal is to find a policy that selects one project to be engaged at each time, for maximizing the expected total geometrically discounted reward. The MABP is widely applicable, being regarded as a modeling paradigm of the exploration versus exploitation trade-off, and it has generated a vast literature (see the monograph [1] and the cited references there). Although the curse of dimensionality hinders direct numerical solution of its dynamic programming (DP) optimality equations for realistic-size models, as the size of the multi-dimensional state space grows exponentially with the number of projects, the MABP is solved optimally by a remarkably simple type of policy, a so-called (priority-) index policy. Index policies are based on defining for each project m an index λ m (i m )-a scalar mapping of the project state i m that depends only on the project parameters-and engage at each time a project of largest index. See, e.g., [2][3][4][5][6]. The index that is considered in [2], which is known in the literature as the Gittins index, extends to general Markovian bandits that which was introduced by Bellman in [7] for solving a Bernoulli bandit model.
However, appropriate modeling of potential applications often entails the incorporation of features that violate assumptions of the classic MABP. Regarding the assumption that passive projects do not give rewards, this is noncritical, since passive rewards can be

Index Policies, Histeresis, and the Asawa and Teneketzis Index for the MABPSP
While switching penalties can generally be sequence-dependent, this paper will focus on the case that such penalties are separately defined for each project, while allowing them to depend on the project state. Specifically, we will assume that switching from engaging one project to another entails, similarly as in [26], a setdown cost to switch off the currently engaged project, and then a setup cost followed by a random setup delay to switch on the project about to be engaged. Note that setup delays can be used to model, e.g., time for preparing the ground or building infrastructure, as well as training or learning delays.
Although index policies are generally suboptimal for the MABPSP (see [9]), their ease of implementation motivates the interest of designing policies from such a class that perform well. An index policy in such a setting attaches to each project m an index λ m (a − m , i m ), which now depends on both the previous action a − m ∈ {0, 1} (passive: 0 or active: 1) and the current project state i m . Thus, such an index decouples into a continuation index λ m (1, i m ), which applies when the project has already been set up, and a switching index λ m (0, i m ), to be used when the project has not yet been set up.
Drawing on intuition one would expect that switching penalties should discourage frequent switching and, hence, should cause a histeresis effect on the structure of optimal policies. Thus, it should be optimal to stick longer to the currently engaged project that would be the case in the absence of such penalties. As put in (p. 691 [9]), "it is obvious that in comparing two otherwise identical arms, one of which was used in the previous period, the one which was in use must necessarily be more attractive than the one which was idle". To be consistent with such a hysteresis property, the indices of a project m must satisfy that λ m (1, i m ) λ m (0, i m ) for every project state i m . (1) Note that index policies can be optimal in special cases of the MABPSP, as shown in [13], in a model for scheduling a multi-class batch of stochastic jobs.
An intuitively appealing choice of index, extending that in [13], is that considered by Asawa and Teneketzis in [10]-which we will refer to in the sequel as the AT index-for a project having either a constant (not dependent on the project state) setup cost or a constant setup delay distribution, and no setdown costs. It is shown in [10] that the AT index provides a partial characterization of optimal policies for the version of the MABPSP considered there. The continuation AT index of a project is simply its Gittins index. As for the switching AT index, it is the highest rate of discounted expected reward minus setup cost per unit of discounted expected active time (counting the setup delay as active time) that can be attained from an initially passive project by first setting it up and then engaging it for a random duration that is given by a stopping time.

Index Computation
Efficient index computation is a key issue that must be addressed in practice for deploying an index policy for the MABPSP. For a project with n states and constant setup cost, but without setup delays, (Section III.C [10]) shows that the 2n AT continuation and switching index values λ * (a − , i) can be computed as the Gittins index of an appropriately defined 2n-state project with augmented state (a − , i). Because computing the Gittins index has, in general, a cubic operation complexity in the number of states, such an approach results in an eightfold increase in complexity relative to that of computing the continuation index only.
A faster two-stage approach for a project with both setup and setdown costs-but no setup delays-that can be state-dependent was given by the author in [27]. The proposed algorithm computes, in the first stage, the continuation index and certain extra quantities by applying the (4/3)n 3 + O(n 2 ) fast-pivoting algorithm with extended output presented in [28]. Subsequently, in the second stage, it computes the switching index in at most O(n 2 ) operations. Hence, computing with that algorithm the 2n AT index values entails only a twofold complexity increase relative to the (2/3)n 3 + O(n 2 ) operation count to compute the continuation (Gittins) index only through the fast-pivoting algorithm (without extended output) given in [28]. Further, ref. [27] reports on the results of a numerical study demonstrating that the resulting index policy for the version of the MABPSP considered there is close to optimal and outperforms the Gittins index policy by a wide margin, across a wide range of instances.

Approach via Restless Bandit Reformulation, Whittle Index, and Indexability
The two-stage index algorithm shown in [27] exploits the reformulation of a project with switching costs and states i as a restless bandit-i.e., a project that can change state while passive-without such costs, moving across augmented states (a − , i). In that way, the MABP with switching costs is cast as a multi-armed restless bandit problem (MARBP) without them, which allows for the deployment of theoretical and algorithmic results on restless bandit indexation, as introduced in [29] by Whittle. Such a theory has been developed in [30][31][32][33] by the author. Additionally, see the survey [34].
Thus, while the MARBP is generally intractable, as it is known to be PSPACE-hard (see [35]), Whittle introduced, in [29], a widely applied heuristic index policy. For a sample of recent applications, see, for example [36][37][38][39][40][41][42][43][44][45][46][47][48]. Yet, the Whittle index is only defined for a limited class of restless bandits, called indexable, and it is nontrivial to verify whether such an indexability property holds for a given model. The work of the author referred to above provides sufficient indexability conditions for general restless bandits, which are grounded on satisfaction by project performance metrics of partial conservation laws (PCLs), together with an adaptive-greedy index algorithm that computes the Whittle index (and extensions thereof) under such conditions. Such a PCL-indexability approach is deployed in [27], using the result that the AT index of a non-restless bandit with switching costs (but no switching delays) is its Whittle index in the project's restless reformulation. The corresponding restless bandit model is shown to satisfy the PCL-indexability conditions, ensuring that its Whittle index can be computed by the adaptive-greedy algorithm. Special structure and the results in [49] are then used in [27] in order to decouple that algorithm into a faster two-stage method.

Motivation and Goals
Yet, no method is given in Asawa and Teneketzis [10] in order to compute their proposed index under switching delays. The latter's relevance in applications, along with the tractability and effectiveness of the AT index policy in the pure-switching-costs case, motivates the interest to extend the restless bandit indexation approach for developing an efficient index algorithm for bandits that incorporate both switching costs and delays, which is the first goal of this paper.
Carrying out such an extension turns out to raise methodological research challenges on restless bandit indexation. Thus, when a Markovian non-restless bandit with switching delays is reformulated as a semi-Markov restless bandit without them, it is found that the resultant model need not satisfy the PCL-indexability conditions that were the cornerstone to the analyses presented in Niño-Mora [27] for the pure-switching-costs case. This motivates us to significantly extend the scope of previous theory, obtaining more powerful sufficient indexability conditions, which are both easier to apply and applicable to a wider class of models, including that of concern herein. That is the second goal of this paper. The third goal entails assessing the runtime performance of the proposed index algorithm, and evaluating the performance of the resulting index policy, both in terms of its optimality gap and its improvement over alternative simpler index policies.

Contributions
Concerning the second goal, on general restless bandit methodology, we introduce, for finite-state restless bandits, significantly simpler and less stringent sufficient conditions for indexability than the former PCL-based conditions, under which it is also assured that the adaptive-greedy algorithm computes the MPI. We further show such conditions to be necessary, in that any indexable finite-state restless bandit satisfies them. Thus, the new conditions furnish a complete characterization of indexability, which can be used in order to analytically establish a priori that a restless bandit model of concern is indexable-as opposed to numerically verifying a posteriori that a given instance is indexable.
As for the first goal, we deploy the new indexability conditions in the restless bandit reformulation of a non-restless bandit with switching delays and costs. Because the AT index emerges as the Whittle index in such a reformulation, we are thus assured that the adaptive-greedy algorithm will compute it. The complexity of such an algorithm is then reduced by exploiting special structure, which again yields a substantially faster twostage method. In the first stage, the continuation index is computed in (4/3)n 3 + O(n 2 ) arithmetic operations, and then the switching index is computed in the second stage in only-at most-(5/2)n 2 + O(n) operations. Thus, we obtain a two-stage algorithm that computes both the continuation and switching index in roughly twice the time that is required to compute the continuation index alone (if the latter were computed using the fast-pivoting (2/3)n 3 + O(n 2 ) algorithm in [34]).
Regarding the third goal, we report on a computational study demonstrating the substantial runtime speed-up that is achieved by the two-stage algorithm relative to direct application of the one-stage adaptive-greedy algorithm. This study further reports on experiments providing evidence that the index policy is close to optimal and it attains significant gains against a benchmark index policy across a wide range of randomly generated instances with two and three projects.

Structure of the Paper and Notation
The rest of the paper proceeds as follows. Section 2 describes the MABPSP model of concern, reviews the AT index, and describes the restless bandit indexation approach to be applied. Section 3 lays the groundwork for such an approach in a general framework of finite-state restless bandits, introducing the new methological advances on restless bandit indexation. Section 4 deploys the new results in the special restless bandit model that arises from the reformulation of a non-restless bandit with switching penalties, which culminates in the development of the new two-stage index algorithm in Section 5. Section 6 presents some qualitative properties on how the index depends on setup and setdown penalties. Finally, Section 7 presents and discusses the numerical study.
Because the notation of the paper may be hard to follow, Table 1 summarizes it for the reader's convenience.

MABPSP Model and Its Semi-Markov MARBP Reformulation
A decision-maker ponders how to prioritize the allocation of effort to M dynamic and stochastic projects that are labelled by m ∈ M {1, . . . , M}, one of which must be engaged (active) at each of a sequence of decision periods t k ∈ Z + {0, 1, 2, . . .}, with t 0 = 0 and t k ∞ as k ∞, while others are rested (passive). Switching projects on and off entails setup and setdown delays and costs, respectively. A setup (resp. setdown) delay on a project is necessarily followed by a period in which the project is worked on (resp. rested), i.e., the times at which a setup or a setdown delay are completed are not decision periods. We will say that a project is "active" when it is either being engaged (worked upon) or undergoing a setup or a setdown delay. Let X m (t) and A m (t) denote the prevailing state, which belongs to the finite state space X m , and action for project m at time t (A m (t) = 1: active; A m (t) = 0: passive), and let A − m (t) A m (t − 1) denote the previously chosen action, with A − m (0) indicating the initial setup status. While project m is passive, it neither accrues rewards nor changes state. Switching it on when it lies in state i m entails a lump setup cost c m (i m ), followed by a random setup delay of duration ξ m (i m ) periods, whose z-transform is φ m (z; i m ) E z ξ m (i m ) , over which no rewards are earned. After such a setup, the project must be engaged, yielding a reward R m (i m ), after which its state moves at the next period to j m with transition probability p m (i m , j m ). After at least one period in which the project is engaged, it may be decided to switch it off. If this is done when the project lies in state j m , then a lump setdown cost d m (j m ) is incurred, followed by a random setdown delay of duration η m with z-transform ψ m (z) E z η m , over which no rewards accumulate. Subsequently, the project remains passive for one or more periods. Note that setup delay distributions are allowed to be state-dependent, whereas setdown delay's are not (cf. Section 2.1). Rewards and costs are geometrically time-discounted with factor β < 1. We write, in what follows, the above z-transforms evaluated at z = β simply as φ m (i m ) and ψ m .
Actions are prescribed through a scheduling policy π, which is chosen from the class Π of policies that are admissible, i.e., nonanticipative with respect to the history of states and actions, and engaging one project at a time. The MABPSP (cf. Section 1) is concerned with finding an admissible scheduling policy that attains the maximum expected total discounted reward net of switching costs.
This problem can be cast into the framework of semi-Markov decision problems (SMDPs) by including into the state of each project m the last action taken, i.e., by using the augmented state Y m (t) (A − m (t), X m (t)), which belongs to the augmented state space Y m {0, 1} × X m . Thus, one obtains a multidimensional SMDP having joint state Y(t) (Y m (t)) m∈M and joint action A(t) (A m (t)) m∈M . This is a special type of semi-Markov MARBP (cf. Section 1), as the constituent projects become restless in such a reformulation.
Rewards and dynamics for the reformulated project m are as follows, where R a m m (a − m , i) and p a m m ((a − m , i m ), (b − m , j m )) denote the one-stage (i.e., from t k to t k+1 ) expected reward and transition probability, which results from taking action a m in state Y m (t k ) = (a − m , i m ). On the one hand, if, in period t k , the project lies in state (1, i m ) and it is again engaged, it yields the reward R 1 m (1, i m ) R m (i m ) and its state transitions at t k+1 = t k + 1 to (1, j m ) with probability p 1 m ((1, i m ), (1, j m )) p m (i m , j m ). If, instead, the project is switched off, it gives the reward R 0 m (1, i m ) ≡ −d m (i m ) and its state moves at t k+1 = t k + η m + 1 to (0, i m ) with probability 1, i.e., p 0 m ((1, i m ), (0, i m )) ≡ 1. On the other hand, if the project occupies at time t k the state (0, i m ) and is then switched on, it yields the expected reward until the following decision time t k+1 = t k + ξ m (i m ) + 1, in which the project state transitions to (1, j m ) with probability p 1 m ((0, i m ), (1, j m )) p m (i m , j m ). If the project is kept idle, then it gives no reward, i.e., R 0 m (0, i m ) ≡ 0, and its state remains frozen up to t k+1 = t k + 1, Thus, the MABPSP of concern is formulated as the semi-Markov MARBP where E π Y(0) [·] is expectation under policy π conditioned on starting from the joint state Y(0).

Reduction to the Case with No Setdown Penalties
We next show that one can restrict attention with no loss of generality to the case that there are no setdown penalties, which will allow for us to simplify subsequent analyses. Imagine that, say, at time t = 0, a passive project is set up and is then worked on for a random number of periods determined by a stopping time τ 1, after which it is set down. Dropping the label m, denote, by R = (R j ) j∈X , c = (c j ) j∈X , and d = (d j ) j∈X , the active reward vector, and the setup and setdown cost vectors. Denote, by φ = (φ j ) j∈X , the setup delay z-transform vector and by ψ the constant setdown delay transform, both evaluated at z = β. The total discounted expected net reward that is obtained from the project over such a time interval, starting from the augmented state Y(0) = (0, i), is where ξ i is the setup delay. The corresponding discounted active time expended on the project is where, as pointed out above, the setup and setdown delays ξ i and η are both considered to be active time.
In the next result, which extends Lemma 3.4 of [27] to the present setting, I is the identity matrix indexed by X , P = (p ij ) i,j∈X , 0 is a vector of zeros, and φ · d (φ j d j ) j∈X .
Proof. (a) Use the identity (b) This part follows by writing Lemma 1 can be used in order to eliminate setdown penalties: it suffices to incorporate them into new setup costs, setup delay transforms, and active rewards, while using the transformations Note that such a reduction would not have been accomplished had the setdown delay transform not been constant. In the case c j ≡ c and d j ≡ d, we obtain c j ≡ c + dφ j and Accordingly, we will focus henceforth on the normalized case without setdown penalties d j ≡ 0, ψ ≡ 1.

The AT Index
We next consider the AT index of a project with setup penalties-dropping again the label m-extending the original definitions in [10]. The continuation AT index is where τ 1 is a stopping time for engaging the project starting in state i when it is already set up; hence, λ AT (1,i) is just the project's Gittins index. As for the switching AT index, it is given by where now τ is a stopping-time rule that is followed after the project has been set up in state i.
The following requirements will be assumed henceforth on setup costs and setup delay transforms, which extend the corresponding conditions in [10].

Assumption 1. The following holds:
(i) non-negative setup costs: c j 0 for j ∈ X .
(ii) non-negative rewards: If some setup delay can be positive, i.e., φ = 1, then R j 0 for j ∈ X .
The next result shows that Assumption 1 ensures the satisfaction of the hysteresis property in (1).
Proof. For a given state i ∈ X and stopping-time rule τ as above, write G τ . Now, Assumption 1 ensures that c i 0 and F τ i 0, and hence Further, (9), (7), and (8) immediately yield that λ AT , which completes the proof.

New Methodological Results on Restless Bandit Indexation
This section presents new results on restless bandit indexation, which, besides having an intrinsic interest, are required and form the basis for the approach to non-restless bandits with switching times that is deployed in later sections.

Indexable Restless Bandits and the Whittle Index
Consider a semi-Markov restless bandit, representing a dynamic and stochastic project whose state Y(t) transitions over time periods t = 0, 1, 2, . . . through the finite state space Y. The project's evolution is governed by a policy π that is taken from the class Π of nonanticipative randomized policies, which, at each of an increasing sequence t k of decision periods with t 0 = 0 and t k ∞ as k ∞, prescribes an action A(t k ) ∈ {0, 1} that determines the status during the ensuing stage until the next decision period t k+1 (1: active; 0: passive). Taking action A(t k ) = a at time t k when the project occupies state Y(t k ) = y has the following consequences over the following stage, relative to a given one-period discount factor 0 < β < 1: an expected total discounted amount of reward R a y and of a generic resource Q a y 0 is earned and expended, respectively; further, the joint distribution of the stage's duration t k+1 − t k and its final state It will be convenient to partition Y into the (possibly empty) set of uncontrollable states where both actions entail identical resource consumptions and dynamics, and the remaining set Y {0,1} Y \ Y {0} of controllable states, which is assumed to consist of N = |Y {0,1} | 1 elements. The notation Y {0} is meant to reflect the convention that the passive action a = 0 is chosen in uncontrollable states. The value of the rewards earned and amount of resource expended by a policy π starting from state y is evaluated, respectively, by the discounted reward and resource consumption metrics Let us introduce a parameter λ representing the resource unit price, and consider the λ-price problem maximize which concerns finding a policy that maximizes the value of rewards earned minus the cost of resources expended. Because (10) is an infinite-horizon finite-state and -action SMDP, by standard results it is solved by stationary deterministic policies that are characterized by the solutions to the following DP equations, where V * y (λ) denotes the optimal value starting from y under price λ: Such a project is said to be indexable (cf. [29]), if, for each controllable state y ∈ Y {0,1} , there exists a unique break-even price λ * y , such that: it is optimal to engage the project in state y if and only if λ λ * y , and it is optimal to rest it if and only if λ λ * y . Or, in terms of the DP Equation (11), We will refer to the mapping i → λ * y as the project's Whittle index. See [29].

Exploiting Special Structure: Indexability Relative to a Family of Policies
While one can readily numerically test whether a given restless bandit instance is indexable, a researcher investigating a particular restless bandit model will instead be concerned with analytically establishing its indexability under an appropriate range of model parameters. The key to achieving such a goal is-as in optimal-stopping problemsto exploit special structure by guessing a family of policies (stationary deterministic), among which there exists an optimal policy for (10) for every resource price λ ∈ R.
We represent a stationary deterministic policy by its active (state) set, consisting of those controllable states where it prescribes engaging the project. Thus, a family of such policies is given as a family F of active sets S ⊆ Y {0,1} , and, hence, we will refer to the family of F -policies. Relative to such a family, we will call the project F -indexable if (i) it is indexable, and (ii) F -policies are optimal for λ-price problem (10) for every resource price λ ∈ R.
We will impose the following connectivity requirements on F .

Assumption 2.
The active-set family F satisfies the following conditions: Note that condition (iii) in Assumption 2 means that F is a lattice relative to set inclusion. As for condition (ii), it ensures that any two nested active sets S, S ∈ F with S ⊂ S can be connected by an increasing chain S = S 0 ⊂ · · · ⊂ S k = S of adjacent (i.e., differing by one state) sets in F . Further, condition (i) ensures that one can connect in such a fashion ∅ with Y {0,1} . We will call a set family F satisfying Assumption 2(ii, iii) a monotonically connected lattice.

New Sufficient Conditions for F -Indexability and Adaptive-Greedy Index Algorithm
Suppose that, for a particular restless bandit model, a suitable active-set family F , as above, has been posited relative to which one aims to analytically establish Findexability. While, in the aforementioned earlier work of the author, sufficient conditions for F -indexability are given, which further ensure that the project's Whittle index can be computed by using an adaptive-greedy index algorithm that was introduced in such work, we next introduce new sufficient conditions that are significantly less restrictive.The new conditions are motivated by the model of concern in this paper, as we will see that it need not satisfy the former conditions, as mentioned in Section 1.
In order to formulate the new conditions and the index algorithm we need to define certain marginal metrics, as follows. Given an action a ∈ {0, 1} and active set S ⊆ Y {0,1} , write, as a, S , the policy that initially chooses action a, and then follows the S-active policy. For a given state y and active set S, consider the marginal work metric which represents the marginal increase in the amount of resource expended resulting from taking first the active rather than the passive action and, then, following the S-active policy. Note that such a marginal work metric vanishes at uncontrollable states: Further, define the marginal reward metric which represents the marginal increase in rewards earned. Finally, for g S y = 0, define the marginal productivity metric We will consider the adaptive-greedy index algorithm that is given in Algorithm 1 in its top-down version, where index values are meant to be computed from highest to lowest; one could similarly consider the symmetric bottom-up version. Such an algorithm has a very simple structure, as it constructs in n steps (recall that N |Y {0,1} |), an increasing chain of successive active sets S 0 = ∅ ⊂ S 1 ⊂ . . . ⊂ S N = Y {0,1} in F , proceeding at each step in a greedy fashion. Thus, once active set S k−1 ∈ F has been obtained, the next active set S k is constructed by augmenting S k−1 with a controllable state y ∈ Y {0,1} \ S k−1 that maximizes marginal productivity metric λ S k−1 y , restricting attention to states y for which the following active set is in F , so S k = S k−1 ∪ {y} ∈ F . Ties are broken arbitrarily.
Note that Algorithm 1 only shows an algorithmic scheme, as it is not specified how to compute the metrics that are required for computations. A complete fast-pivoting implementation of such an algorithm is given by the author in [49].
Additionally, note that the algorithm's input consists of all the project's primitive parameters, namely states, rewards, transition probabilities, and discount factor.
The same considerations apply to Algorithm 2.

Algorithm 1:
Top-down adaptive-greedy index algorithm AG F .
The main result of this section, giving the new indexability conditions and ensuring the validity of the adaptive-greedy index algorithm for computing the Whittle index, is stated next.

Algorithm 2:
Geometrically intuitive reformulation of adaptive-greedy index algorithm AG F .
The following holds: (a) Suppose that the project satisfies the following conditions: or, equivalently, for every nested active-set pair S ⊂ S with S, S ∈ F , (ii) for every resource price λ ∈ R, there exists an optimal F -policy for λ-price problem (10). Then, the project is F -indexable and algorithm AG F computes its Whittle index λ * y k in non-increasing order. (b) If the project is indexable, then it satisfies conditions (i) and (ii) in part (a) for some nested family of adjacent active sets of the form In order to prove Theorem 1, we need to establish a number of preliminary results. Before doing so, let us clarify the improvement that the new sufficient F -indexability conditions (i) and (ii) in Theorem 1(a) represent over those that were introduced in Niño-Mora [30,31] based on PCLs, which are: · · · λ * y N . Thus, the new condition (i) in Theorem 1(a), as formulated in (16), is significantly less stringent than the old condition (i). Further, the reformulation in (17) clarifies its intuitive meaning: it means that resource consumption metric G S y is monotone non-decreasing in the active set S within the domain F , and that two nested active sets S ⊂ S in F give different resource consumption vectors (G S y ) y∈Y and (G S y ) y∈Y . As for the old condition (ii), the author has found that, in complex models with a multidimensional state, it can be elusive to establish it analytically. In contrast, the new condition (ii) in Theorem 1(a) allows one either to draw on the rich literature available on optimality of structured policies for special models, or to deploy ad hoc DP arguments to prove the optimality of F -policies for the model at hand.
Note that [50] has proposed sufficient F -indexability conditions, which are, however, significantly more restrictive than those herein. Thus, the conditions in [50] require, among further assumptions, including (i) and (ii) in Theorem 1(a), that the resource metric be submodular and reward metric be supermodular in the active set. Theorem 1(a) shows that such extra assumptions are unnecessary.
Theorem 1(b) further assures that the new conditions are also necessary for indexability, in the sense that any indexable restless bandit satisfies them relative to some nested active-set family F , as stated.
We start by establishing the equivalence between the formulations in (16) and (17) Note that x a,π yy measures the expected total discounted number of decision periods, in which action a is chosen in state y while using policy π, starting from state y. In the present notation, the relevant relations are and (16) and (17) in Theorem 1(a) are equivalent.

Lemma 3. Conditions
Proof. Suppose that (16) holds for a certain S ∈ F . We then have, on the one hand, that g S y > 0 for y ∈ S such that S \ {y } ∈ F , along with x 0,S\{j} yy 0 for any y, implies, via the first identity in (19), that G  (19), that g S y > 0 for such y . Therefore, (16) holds, which completes the proof.

Proving Theorem 1: Achievable Resource-Reward Performance Region Approach
We next deploy an approach in order to prove Theorem 1, which draws on first principles via an intuitive geometric and economic viewpoint introduced in [31,32]. We will find it convenient to consider, instead of (10), the λ-price problem that is obtained by using the averaged resource and reward metrics where the initial project state Y(0) is drawn from a distribution p with positive probability mass p y > 0 at every state y ∈ Y, G π ∑ y∈Y p y G π y and F π ∑ y∈Y p y F π y , i.e., maximize Relative to such metrics, consider the project's achievable resource-reward performance region which is defined as the region in the resource-reward plane that consists of all the performance points (G π , F π ) that can be achieved under admissible project operating policies π ∈ Π. The optimality of stationary deterministic policies for infinite-horizon finite-state and -action SMDPs ensures that H is the closed convex polygon spanned as the convex hull of points (G S , F S ) for active sets S ⊆ Y {0,1} . Thus, we can reformulate λ-price problem (22) as the linear programming (LP) problem In order to illustrate and clarify such an approach, consider the concrete example of a certain restless bandit having state space Y = Y {0,1} = {1, 2, 3} that is discussed in (Sec. 2.2 of [34]) For such a project, Figure 1, in that paper, plots the achievable resource-reward performance region H, with points (G S , F S ) being labeled by their active sets S. The fact that such a project is indexable is apparent from the structure of the upper boundary of H, In this example, the geometry of the top-down adaptive-greedy algorithm AG F corresponds to traversing the upper boundary∂H from left to right, proceeding, at each step, by augmenting the current active set by a new state in a locally greedy fashion, as the slopes in (26) are equivalently formulated as The insights that are conveyed by such an example extend to the general setting of concern herein, as elucidated in Niño-Mora [31,32,34]. Thus, the indexability of a project is recast as a property of the upper boundary∂H of region H, whereby it is determined by a nested active-set family as in the example. Note that the equivalence between the geometric slopes in (27) and the marginal productivity rates (26) in follow from (19) and (20) or, more precisely, from the corresponding relations for the averaged metrics, and where x a,π y is the state-action occupancy measure that is obtained by drawing the initial state according to the probabilities p y . Thus, assuming condition (i) in Theorem 1(a), we have, for S ∈ F , Such relations allow for us to reformulate the adaptive-greedy algorithm AG F in Algorithm 1 into the geometrically intuitive form that is shown in Algorithm 2. Such a reformulation clarifies that this algorithm seeks to traverse, from left to right, the upper boundary∂H, proceeding at each step by augmenting the current active set by a new state in a locally greedy fashion, while only using active sets in F . We next proceed to establish a number of preliminary results, on which the proof of Theorem 1 will draw. The first shows that the family of optimal active sets for the λ-price problem is a lattice that contains its intervals. (22), then so is any S satisfying S ∩ S ⊆ S ⊆ S ∩ S .

Lemma 4. If S and S are optimal active sets for
Proof. The result is an immediate property of the DP Equations (11) characterizing the optimal stationary deterministic policies (i.e., the optimal active sets) for the λ-price problem.
The following result shows that, under condition (i) in Theorem 1(a), resource consumption metric G S is strictly increasing relative to active-set inclusion in the domain S ∈ F .

Lemma 5.
Suppose that condition (i) in Theorem 1(a) holds. Then, G S < G S for S ⊂ S , S, S ∈ F . Proof. The result follows immediately from the formulation of such a condition (i) in (17), along with the assumption of positive initial state probabilities p y > 0 for y ∈ Y.
The next result establishes, under conditions (i) and (ii) in Theorem 1(a), the nondegeneracy of the extreme points of H in upper boundary∂H, showing that each is achieved by a unique active set in F .

Lemma 6.
Suppose that conditions (i) and (ii) in Theorem 1(a) hold. Then, for every (G * , F * ) ∈ ∂H that is an extreme point of H, there exists a unique active set S * ∈ F achieving it, i.e., with (G * , F * ) = (G S , F S ).
Now, since it is assumed that S * = S * * , there are two cases to consider: in the first case, if it were S * ⊂ S * * , then it would be S * ∩ S * * ⊂ S * ⊂ S * ∪ S * * and, hence, by Lemma 5, G S * ∩S * * < G S * < G S * ∩S * * , which contradicts (31). In the second case, if it were S * * ⊂ S * , then it would be S * ∩ S * * ⊂ S * * ⊂ S * ∪ S * * and, hence, by Lemma 5, G S * ∩S * * < G S * * < G S * ∪S * * , which again contradicts (31). Therefore, there cannot exist such an S * * , which completes the proof.
We can now prove Theorem 1.
Proof of Theorem 1. (a) We will show that the project is F -indexable by using the geometric characterization of indexability that is reviewed in the present section. Namely, by showing that the upper boundary∂H is determined by an increasing nested family of adjacent active sets in F connecting ∅ to Y {0,1} . We refer the reader to the plot shown in Figure 1 for a geometric illustration of the following arguments. Let us start by showing that the extreme points of H, which determine∂H, are attained, from left to right, by a unique increasing chain of active sets in F -not necessarily adjacent. Thus, consider two adjacent extreme points of H in∂H, i.e., joined by a line segment in∂H. By Lemma 6, there exist two unique and distinct active sets S, S ∈ F , whose performance points (G S , F S ) and (G S , F S ) achieve such extreme points, where we assume, without loss of generality, that G S < G S . We will show that it must be S ⊂ S . Letting λ = (F S − F S )/(G S − G S ) be the slope of the line segment joining such extreme points we have that both S and S solve the λ-price problem and, hence, by Lemma 4, so do S ∩ S and S ∪ S . Now, from the stated properties of S and S , it follows that points (G S∩S , F S∩S ) and (G S∪S , F S∪S ) must lie in the line segment joining (G S , F S ) and (G S , F S ) and, hence, G S∩S , G S∪S ∈ [G S , G S ]. Further, since, by Assumption 2(iii) S ∩ S , S ∪ S ∈ F , Lemma 5 gives that G S∩S G S and G S G S∪S . Therefore, We next argue, by contradiction, that S ⊂ S : if such were not the case, i.e., S ⊂ S , then it would follow that S ∩ S ⊂ S ⊂ S ∪ S and, hence, by Lemma 5, G S∩S < G S < G S∪S , contradicting (32).
Let us next show that, if any two adjacent extreme points (G S , F S ) and (G S , F S ) in ∂H, with G S < G S , are determined by active sets S ⊂ S in such a chain that are not adjacent, they can be connected from left to right by points in∂H that are attained by an increasing chain of adjacent active sets in F . On the one hand, Assumption 2(ii) ensures the existence of an increasing chain of active sets in F that are adjacent and connect S to S : On the other hand, if λ = (F S − F S )/(G S − G S ) is the slope of the line segment joining such extreme points, then we have that both S and S solve the λ-price problem and, hence, by Lemma 4, so does every intermediate active set T 1 , . . . , T k−1 in such a chain. Hence, Lemma 5 ensures that G S < G T 1 < · · · < G T k−1 < G S , as required.
In order to establish F -indexability, it only remains to show that the leftmost (resp. rightmost) extreme point of H in∂H is that attained by active set S = ∅ (resp. S = Y {0,1} ). This follows from Assumption 2(i), condition (ii) in Theorem 1(a), and Lemma 5 (ensuring that Having established F -indexability, the result that algorithm AG F computes the project's Whittle index follows immediately from the algorithm's geometric interpretation, as revealed by its reformulation in Algorithm 2. (b) Suppose now that the project is indexable. Then,∂H is determined by some increasing chain of adjacent active sets connecting ∅ to Y {0,1} : S 0 = ∅ ⊂ S 1 ⊂ · · · ⊂ S N = Y {0,1} . Letting F {S 0 , S 1 , . . . , S N }, it is readily seen that such an active-set family satisfies conditions (i) and (ii) in part (a). This completes the proof.

Application to Projects with Setup Delays and Costs
This section deploys the framework and results above on restless bandit indexation in our motivating model: the restless bandit reformulation of a non-restless bandit with setup costs and delays (and no setdown penalties: cf. Section 2.1), as discussed in Section 2. The project label m is dropped thereafter from the notation.
In this reformulation, all of the augmented states are controllable, i.e., Y = Y {0,1} , and an active-state subset of the augmented state space Y representing a stationary deterministic policy is given by specifying the original-state subsets S 0 , S 1 ⊆ X , such that the project is engaged when it was rested (resp. engaged) previously if the state X(t) belongs to S 0 (resp. in S 1 ). We will denote such an active set/policy, as in [27], by We next address the issue of guessing an appropriate family F of active sets S 0 ⊕ S 1 , which contains optimal active sets for the λ-price problem of concern (cf. (10)), which is now formulated as maximize where F π (a − ,i) and G π (a − ,i) are the reward and resource (work) metrics that are given by The intuition that, under Assumption 1, if engaging the project is optimal when it was not set up, then engaging it should also be optimal when it was set up, leads us to posit the following choice of F : Such an F represents a family of policies that satisfies Assumption 2. If S 0 ⊂ S 1 , policy S 0 ⊕ S 1 ∈ F has the hysteresis region S 1 \ S 0 , i.e., when the original state X(t) lies in S 1 \ S 0 the policy sticks to the previously chosen action. We will seek to prove indexability with respect to such a family of policies, i.e., F -indexability.
Note that the marginal work, reward, and productivity metrics defined in general by (12)-(15) now take the form and, for g We next adapt to the present setting the general top-down adaptive-greedy algorithm AG F in Algorithm 1, which yields the algorithm in Algorithm 3, where n |X | is now the number of project states in the non-restless formulation. The output of the algorithm has been decoupled, noting that, at every step, the algorithm expands the current active set S k 0 −1 0 ⊕ S k 1 −1 1 by adding a state that can be either of the form (0, i k 0 0 ) or (1, i k 1 1 ). Thus, instead of using a single counter k, ranging from 0 to 2n, two counters 1 k 0 k 1 n are used, with such counters being related by k = k 0 + k 1 − 1. Henceforth, we use a more algorithm-like notation, writing, e.g., λ . Algorithm 3: Adaptation of index algorithm AG F to the present model.

Proving That F -Policies Are Optimal
We next aim to establish that condition (ii) in Theorem 1(a) is satisfied by the present model, i.e., that F -policies, i.e., those with active sets S 0 ⊕ S 1 ∈ F that are defined by (35), suffice to solve the λ-price problem (33) for any price λ ∈ R. We will use the DP optimality equations that characterize the optimal value function V * (a − ,i) (λ) for problem (33), starting from each augmented state (a − , i) ∈ Y: thus, for each original state i ∈ X , We start by showing that the optimal value function is non-negative.
Proof. Because no setdown penalties are assumed (cf. Section 2.1), a possible course of action incurring zero net reward is to set down the project and keep it that way, which yields the result.
We can now prove the optimality of F -policies.
Proof. Fix λ ∈ R and i ∈ X . It suffices to show that, if resting the project is optimal in state (1, i), then it is also optimal doing so in state (0, i). Let us formulate that hypothesis, as We aim to show that, then, it is optimal resting the project in state (0, i), so Consider first the case λ < 0. We will argue, by contradiction, that hypothesis (40) then cannot hold, i.e., it cannot be optimal to rest the project once it is active. Drawing on non-restless bandit theory, note that, when the project is active, it is optimal to rest it only if it ever reaches an original state j ∈ X at which λ λ * j , where λ * j is the original (non-restless) bandit's Gittins index. Assumption 1(ii) now assures us that λ * j 0 for each j ∈ X , and, therefore, it is optimal to keep the project active forever.
Next, consider the case λ 0. Then, the following chain of inequalities holds: where the fact that the second inequality holds is apparent by reformulating it as and noting that Assumption 1(ii) and Lemma 7, ensure that the latter inequality left-hand side is non-negative, and, further, Assumption 1(i) and λ 0 ensure non-positivity of its right-hand side. This completes the proof.

Work Metric Analysis and F -Indexability Proof
We now consider how to calculate work and marginal work metrics G S 0 ⊕S 1 (a − ,i) and g S 0 ⊕S 1 (a − ,i) , by relating them to the corresponding metrics G S i and g S i for the underlying non-restless project. We will further use such analyses to establish that condition (i) in Theorem 1(a) holds for the model of concern, thus allowing for us to apply such a theorem.
For each S ⊆ X , the G S i are characterized by the unique solution to the evaluation equations Further, the marginal work metric g S i is evaluated by Note that (41) and (42) imply that We now go back to the project's restless bandit reformulation. The next result, whose proof is omitted, as it is immediate, gives the evaluation equations for work metric G S 0 ⊕S 1 (a − ,i) under a given active set.
The following result represents work metric G S 0 ⊕S 1 (a − ,i) in terms of the G S j .
Proof. (a) The result follows readily from the definition of S 0 ⊕ S 1 .
(b) For i ∈ S 1 , we have while using Lemma 9 and part (a). Thus, the G using Lemma 9, the inclusion S 0 ⊆ S 1 , and (a, b).
(d) The result follows readily from the definition of S 0 ⊕ S 1 .
Concerning the marginal work metric g (36) and Lemma 9, they readily give that The following result represents marginal work metric g S 0 ⊕S 1 (a − ,i) in terms of the g S j .
It must be now remarked that, at the corresponding point in the analysis of [27]-for the case with no setup delays φ i ≡ 1-one could establish the positivity of the marginal work metric, i.e., g S 0 ⊕S 1 (a − ,i) > 0 for (a − , i) ∈ Y, S 0 ⊕ S 1 ∈ F , which is the first PCLindexability condition and it implies the less stringent condition (i) in Theorem 1(a). However, here, it is apparent from Lemma 11(c) that, for i ∈ S 0 , g S 0 ⊕S 1 (1,i) can be negative for β that is close to 1. This is why we cannot use here the same line of argument that is given in [27] to show indexability.
As mentioned above, we will use, instead, for such a purpose, Theorem 1(a). The following result shows that condition (i) in that theorem holds for the model of concern.
Lemma 12. For S 0 ⊕ S 1 ∈ F , Proof. First, consider the case S 0 ⊕ S 1 = ∅ ⊕ ∅. Then, using Lemma 11(a-d), along with g ∅ i ≡ 1, gives that, for i ∈ X , Now, consider the case S 0 ⊕ S 1 = X ⊕ X = Y. Then, again using Lemma 11(a-d) along with g X i ≡ 1 gives that, for i ∈ X , Finally, consider S 0 ⊕ S 1 ∈ F , which is different from ∅ ⊕ ∅ and X ⊕ X . Then, Lemma 11 and (35) imply that it could only happen that marginal work metric g S 0 ⊕S 1 (a − ,i) be negative if a − = 1 and i ∈ S 0 . However, such a case is not included in the required conditions, since (1, i) . This completes the proof.
We are now ready to deploy Theorem 1(a) in the present model.

Proposition 1.
The present restless bandit model is F -indexable and Algorithm 3 computes its Whittle index.
Proof. Lemmas 8 and 12 show that conditions (i) and (ii) in Theorem 1(a) hold, respectively, which implies the result.

The AT Index Is the Whittle Index
We next use the results above in order to prove the identity between the Whittle index and the AT index. We will reformulate the AT index formulae in (7)-(8) while using active sets S ⊆ X , rather than stopping times τ. Thus, we can reformulate the continuation and switching AT indices, as and Recall that we denote the Whittle index by λ * (a − ,i) .

Reward Metric Analysis
We proceed by considering how to calculate the reward and marginal reward metrics F S 0 ⊕S 1 (a − ,i) and f S 0 ⊕S 1 (a − ,i) , by relating them to the metrics F S i and f S i for the corresponding nonrestless project with no setup penalties.
For every active set S ⊆ X , the reward metric F S i is determined by the evaluation equations and the marginal reward metric is given by Going back to the semi-Markov restless bandit reformulation, the following result shows the evaluation equations for the reward metrics F S 0 ⊕S 1 (a − ,i) , for an active set S 0 ⊕ S 1 ∈ F . Lemma 13.
The following result formulates the reward metric F S 0 ⊕S 1 (a − ,i) , in terms of the F S i 's.
Proof. (a) This part follows from the definition of S 0 ⊕ S 1 .
(d) The result follows from the definition of S 0 ⊕ S 1 .

Designing an Efficient Two-Stage Index Algorithm
This section draws on the above in order to develop an efficient index algorithm, which exploits special structure to simplify the one-stage adaptive-greedy algorithm in Algorithm 3, by decoupling the calculation of the continuation and switching indices into a two-stage method, for which an efficient implementation is provided.

Marginal Productivity Metric Analysis
We start by addressing the calculation of required marginal productivity metrics (38), also by relating them to metrics λ S i for the corresponding non-restless project without setup penalties, which are given by The next result represents λ , for i ∈ S 0 such that g Proof. All of the parts follow readily from (50), (38), and Lemmas 11 and 15.

Simplified Version of the Index Algorithm
Using the above results allows for us to give a simplified and more explicit version of the index algorithm AG F in Algorithm 3, which is given in Algorithm 4. In it, we draw on Lemma 16(b,d) to formulate marginal productivity rates λ S 0 ⊕S 1 (a − ,i) in terms of the g S j and λ S j . Thus, the g in the algorithm correspond to g , respectively. Further, we use λ , drawing on Lemma 16(d). Note that such simplifications achieve significant savings in computer memory, since storage of quantities λ entail one less dimension than storing of the λ and λ .

Algorithm 4:
Simplified version of index algorithm AG F .

Two-Stage Implementation of the Index Algorithm
We next proceed to still further simplify the index algorithm in Algorithm 4, by decoupling it into two successive algorithms. The first stage of such a scheme computes the continuation index λ * (1,i) , which we saw above is just the Gittins index λ * i . We will need additional quantities as input to the second stage: the g (k 1 ) j and λ (k 1 ) j appearing in Algorithm 4.
In order to obtain such an index and the required additional quantities, consider the algorithmic scheme AG 1 in Algorithm 5, which is a variant of that in [8], reformulated as in [28]. For implementations, we can use algorithms that are provided in the latter paper, in particular the fast-pivoting algorithm with extended output, which has an (4/3)n 3 + O(n 2 ) arithmetic-operation count.
We next address the computation of the switching index in the second stage, once the Gittins index and required extra quantities have been computed. Consider algorithm AG 0 that is given in Algorithm 6, whose input is the output of algorithm AG 1 , and which returns a sequence of all the states i . Note that such an algorithm is formulated in a form applying to the case of concern herein, with a positive setup delay at every state j, so φ j < 1.
ALGORITHM AG 0 : We have the following result.
Proof. The fact that algorithm AG 0 calculates the λ * (0,i) follows by noting that we have obtained it from algorithm AG F in Algorithm 4 simply by decoupling the calculation of the λ * (0,i) and the λ * (1,i) = λ * i .
We conjecture that Lemma 4(c) should hold without the qualifications considered above. Now, consider the following examples to illustrate the results above . The first example concerns a three-state project with no setdown penalties or setup costs, setup delay transform φ, β = 0.95,  Note that each of the lines shown corresponds to one of the project states. The plot agrees with Proposition 4(d, e). It also illustrates that the relative ordering of states that is induced by the switching index can vary with φ. The following example is based on the same project, but with no setup delays and with setdown delay transform ψ. Figure 3 displays the continuation and switching indices for each of the three states versus 1 − ψ. Note that each of the lines shown corresponds to one of the project states. The plots agree with Proposition 4(a,d,f). Note that the continuation index λ * (1,i) (d, ψ) increases to infinity as ψ vanishes, as the incentive of sticking to a project increases steeply as the setdown delay becomes larger. The plot for the switching index further shows that the relative ordering of states can vary with ψ.

Numerical Study
We next report on the results of a numerical study, which is based on MATLAB implementations of the algorithms that are discussed here developed by the author.
The first experiment addressed the runtime of the decoupled index computing method. A random project instance with setup delays and costs was randomly generated for each of the following numbers of states: n = 500, 1000, . . . , 5000. For each such n, the time to compute the continuation index and required extra quantities while using the fast-pivoting algorithm with extended output in [28] was recorded, as well as the time for computing the switching index by algorithm AG 0 , and the time for jointly computing both indices by using the simplex-based implementation that is given in [49] of the adaptive-greedy algorithm AG F . This experiment was run on a 2.8 GHz PC with 4 GB of memory. Figure 4 shows the results. The left pane plots total runtimes (measured in hours) to compute both indices versus n. Red squares represent the AG F joint-computing scheme, and blue circles represent the two-stage scheme. We see that the latter attained approximately a fourfold speed-up over the former. The right pane plots runtimes (measured in seconds), for the switching index algorithm versus the number of states n. The timescale change from hours to seconds highlights the order-of-magnitude speed-up attained. The following experiments were designed in order to evaluate the average relative performance of the Whittle index policy in randomly generated two-and three-project instances, both versus the optimal problem value, and versus the benchmark Gittins index policy, which does not take setups into account. For each problem instance, the optimal value was calculated by solving with the CPLEX LP solver the LP formulation of the DP optimality equations. The Whittle index and benchmark scheduling policies were evaluated by solving, with MATLAB, the appropriate systems of linear evaluation equations.
The second experiment was designed to assess the dependence of the relative performance of Whittle's index policy for two-project instances on a constant setup-time transform φ and discount factor β-with no setdown penalties. A sample of 100 randomly generated instances with 10-state projects was obtained with MATLAB. In each instance, the parameters for each project were independently drawn: transition probabilities (by scaling a matrix with uniform entries) and uniform (between 0 and 1) active rewards. For every instance k = 1, . . . , 100 and parameters (φ, β) ∈ [0.5, 0.99] × [0.5, 0.95]-with a 0.1 gridthe optimal value V (k),opt and the values of the Whittle index (V (k),W ) and benchmark (V (k),bench ) policies were calculated, together with the relative optimality of the Whittle index policy ∆ (k),W 100(V (k),opt − V (k),W )/|V (k),opt |, and the optimality-gap ratio of the Whittle index over the benchmark policy ρ (k),W,bench 100(V (k),W − V (k),opt )/(V (k),bench − V (k),opt ). The latter were then averaged over the 100 instances for each (c, β) pair, in order to obtain the average values ∆ W and ρ W,bench .
Values V (k),opt , V (k),W and V (k),bench were computed, as follows. The corresponding value functions V (k),opt ((a − 1 ,i 1 ),(a − 2 ,i 2 )) , V (k),W ((a − 1 ,i 1 ),(a − 2 ,i 2 )) and V (k),bench ((a − 1 ,i 1 ),(a − Figure 5 displays, in its left pane, the relative gap ∆ W versus φ-note the inverted φ-axis used throughout-for multiple β, while using cubic interpolation. The gap starts at 0 as φ approaches 1 (as the optimal policy is then obtained), and then grows up to a maximum, which is below 0.18%, and then decreases to 0 as φ gets smaller. That pattern agrees with intuition: for small enough φ, both the optimal and Whittle index policies initially pick a project and stick to it. Because the best such project can be determined by single-project evaluations, the Whittle index policy will correctly choose it. The right pane shows that ∆ W is not monotonic in β, as it is increasing for small β and then decreases for β closer to 1. Hence, in the left pane, the higher peaks typically correspond to larger values of β. Figure 6 shows similar plots for the optimality-gap ratio ρ W,bench of the Whittle index over the benchmark policy. They highlight that the average optimality gap for the Whittle index policy remains below 45% of that for the benchmark policy. The left pane shows that the ratio vanishes for φ that is small enough, as the Whittle index policy is then optimal.
Additionally, the right pane shows that the ratio is increasing with β. Thus, in the left pane, for fixed φ, higher values correspond to larger β.
The third experiment was similar in nature as the previous one, but, when considering instead a constant setup delay T for each project, φ = β T . Figures 7 and 8 show the results, which highlight that Whittle's index policy was optimal for T 2, its relative optimality gap did not exceed 0.06%, and it substantially outperformed the benchmark Gittins-index policy, as the optimality-gap ratio stays below 2%.    The fourth experiment addressed the effect of asymmetric (and constant) setup delay transforms, with these varying over the range (φ 1 , φ 2 ) ∈ [0.8, 0.99] 2 , in two-project instances with discount factor β = 0.9. In the left contour plot in Figure 9 it is shown that the average relative optimality gap of Whittle's index policy, ∆ W , reaches a maximum of approximately 0.14%, and it vanishes as both φ 1 and φ 2 get close to unity, and as either of them becomes small enough. The right contour plot shows that the optimality-gap ratio ρ W reaches the maximum values of nearly 50%, then vanishing as either φ 1 or φ 2 becomes sufficiently small. The fifth experiment studied the effect of state-dependent setup delay parameters φ i , as the discount factor is changed. Uniform[0.9, 1] i.i.d. state-dependent setup costs were randomly generated for every instance. The left pane shown in Figure 10 displays the average relative optimality gap versus the discount factor, showing that such a gap stays below 0.14%. The right pane highlights that the average optimality-gap ratio ρ W,bench stays below 20%.
The sixth experiment considered the relative performance of Whittle's index policy on three-project instances in terms of a setup delay parameter φ and discount factor, while using a random sample of 100 instances of three eight-state projects. For each instance, the parameters varied over the range (φ, β) ∈ [0.5, 0.99] × [0.5, 0.95]. The results are displayed in Figures 11 and 12, which are the counterparts of Figures 5 and 6. Comparing Figures 5 and 11 shows a slight degradation of performance for Whittle's index policy in the latter, although the average gap ∆ W stays small, beneath 0.25%. Comparing Figures 6 and 12 shows similar values for the ratio ρ W,bench .

Conclusions
Bandit models with switching penalties are relevant for a wide variety of applications. Computing optimal policies is generally intractable, which motivates the search for simple policies that can be implemented in practice and perform well. Index policies are an appealing class of policies, which have been proposed for such problems. Yet, while algorithms are given in [10,27] for computing the Asawa and Teneketzis index for a bandit with switching costs only, no algorithms have been given in the literature in order to compute the extension of such an index for bandits with switching penalties that incorporate switching delays. This paper presents the first such algorithm. It further provides evidence in a numerical study that the resulting index policy is nearly optimal across the instances considered. This work could be extended in several directions, including developing specialized algorithms for computing the index, in particular, models that arise in applications.