You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Feature Paper
  • Article
  • Open Access

1 December 2025

Adaptive Optimization for Stochastic Renewal Systems

Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089-2565, USA
This article belongs to the Section D2: Operations Research and Fuzzy Decision Making

Abstract

This paper considers online optimization for a sequence of tasks. Each task can be processed in one of multiple processing modes that affect the duration of the task, the reward earned, and an additional vector of penalties (such as energy or cost). Let A [ k ] be a random matrix that specifies the parameters of task k. The goal is to observe A [ k ] at the start of task k and then choose a processing mode for the task so that, over time, time average reward is maximized subject to time average penalty constraints. This is a renewal optimization problem. It is challenging because the probability distribution for the A [ k ] sequence is unknown. Efficient decisions must be learned in a timely manner. Prior work shows that any algorithm that comes within ϵ of optimality must have Ω ( 1 / ϵ 2 ) convergence time. The only known algorithm that can meet this bound operates without time average penalty constraints and uses a diminishing stepsize that cannot adapt when probabilities change. This paper develops a new algorithm that is adaptive and comes within Θ ( ϵ ) of optimality for any interval of Θ ( 1 / ϵ 2 ) tasks over which probabilities are held fixed.
MSC:
65K10; 60K05; 93E35

1. Introduction

This paper considers online optimization for a system that performs a sequence of tasks. At the start of task k { 1 , 2 , 3 , } , a matrix A [ k ] of parameters about the task is revealed. The controller observes A [ k ] and then makes a decision about how the task should be processed. The decision, together with A [ k ] , determines the duration of the task, the reward earned by processing that task, and a vector of additional penalties. Specifically, define
T [ k ] =   duration   of   task   k R [ k ] =   reward   of   task   k Y [ k ] =   ( Y 1 [ k ] , , Y n [ k ] ) =   penalty   vector   for   task   k
where n is a fixed positive integer. The goal is to make decisions over time that maximize the time average reward subject to a collection of time average penalty constraints. This problem has a number of applications, including image classification, video processing, wireless multiple access, and transportation scheduling.
For example, consider a computational device that performs back-to-back image classification tasks with the goal of maximizing time average profit subject to a time average power constraint of p a v and an average per-task quality constraint of q a v . For each task, the device chooses from a collection of classification algorithms, each having a different duration of time and yielding certain profit, energy, and quality characteristics. The number of options for task k is equal to the number of rows of matrix A [ k ] . Numerical values for each option are specified in the corresponding row. Suppose the controller observes the following matrix A [ 1 ] for task 1:
A [ 1 ] = duration profit energy quality 5.1 3.6 2.3 0.5 alg 1 7.0 2.8 1.5 1.0 alg 2 10.2 3.0 1.1 1.0 alg 3
Choosing a classification algorithm for task 1 reduces to choosing one of the three rows of A [ 1 ] . Suppose we choose row 2 (algorithm 2). Then, task 1 will have a duration of 7.0 units of time, as illustrated by the width T [ 1 ] in the timeline of Figure 1. Also, the task will have profit, energy, and quality of 2.8 , 1.5 , and 1.0 . If we had instead chosen row 1, then the task duration would be lower and profit would be higher, but the energy consumption would be higher and the task would have lower quality. It is not obvious what decision helps to maximize the time average profit subject to the power and quality constraints. Moreover, the matrices for future tasks can have different row sizes, each matrix A [ k ] is only revealed at the start of task k, and the probability distribution for A [ k ] is unknown.
Figure 1. Four sequential tasks in the timeline. Vertical arrows for each task k represent values for reward R [ k ] and penalty vector Y [ k ] . In this example, green is reward (profit), red is energy, blue is quality. Vector ( T [ k ] , R [ k ] , Y [ k ] ) depends on choices made at the start of task k.
The specific constrained optimization problem for this example is
Maximize : lim m k = 1 m R [ k ] k = 1 m T [ k ]
Subject   to : lim m k = 1 m E n e r g y [ k ] k = 1 m T [ k ] p a v
lim m 1 m k = 1 m Q u a l i t y [ k ] q a v
where objective (2) represents the time average profit; (3) imposes the time average power constraint; (4) imposes the per-task average quality constraint. For simplicity of this example, we assume the limits exist. This problem is more concisely posed as maximizing R ¯ / T ¯ subject to E n e r g y ¯ / T ¯ p a v and Q u a l i t y ¯ q a v , where R ¯ , T ¯ , E n e r g y ¯ , Q u a l i t y ¯ are per-task averages.
Problem (2)–(4) involves ratios of averages. This paper converts such problems into a canonical form of maximizing a ratio R ¯ / T ¯ subject to penalty constraints Y ¯ i 0 for i { 1 , , n } , where n is the number of constraints ( n = 2 in example problem (2)–(4)). The constraint E n e r g y ¯ / T ¯ p a v is converted to Y ¯ 1 0 by defining a penalty process Y 1 [ k ] = E n e r g y [ k ] p a v T [ k ] for k { 1 , 2 , 3 , } . However, adaptive minimization of the ratio R ¯ / T ¯ is nontrivial, and new techniques are required.
Optimality for problem (2)–(4) depends on the joint probability distribution for the size and entries of matrix A [ k ] . We first assume { A [ k ] } k = 1 are independent and identically distributed (i.i.d.) as matrices over tasks k { 1 , 2 , 3 , } . This yields a well-defined ergodic optimality for problem (2)–(4). However, the multi-dimensional distribution for the entries of A [ k ] is unknown, and the parameter space is enormous. Instead of attempting to learn the distribution, this paper uses techniques similar to the drift-plus-penalty method of [1] that acts on weighted functions of the observed random A [ k ] . Specifically, this paper develops a new optimization strategy that acts on a single timescale for real-time optimization of ratios of averages.
Given ϵ > 0 , how long does it take to come within ϵ of the infinite horizon optimality for problem (2)–(4)? This question is partially addressed in the prior work [2] for problems without time average penalty constraints. That prior work uses a Robbins–Monro algorithm with a vanishing stepsize. It cannot adapt if probability distributions change over time. The current paper develops a new algorithm that is adaptive. While infinite horizon optimality is defined by imagining { A [ k ] } as an infinite i.i.d. sequence, our algorithm can be analyzed over any finite block of m tasks { k 0 , k 0 + 1 , , k 0 + m 1 } for which i.i.d. behavior is assumed. Indeed, fix any algorithm parameter ϵ > 0 and consider any integer m 1 / ϵ 2 . Over any block of m tasks during which A [ k ] is i.i.d., the time average expected performance of our algorithm is within O ( ϵ ) of infinite horizon optimality (as defined by the A [ k ] distribution for this block), regardless of the distribution or sample path behavior of A [ k ] before the block. When the algorithm is implemented over all time, it tracks the new optimality that results if probability distributions change multiple times in the timeline, without knowing when changes occur, provided that each new distribution lasts for a duration of at least Θ ( 1 / ϵ 2 ) tasks. This new Θ ( 1 / ϵ 2 ) achievability result matches a converse bound proven for unconstrained problems in [2].

1.1. Model

Fix n as a positive integer. At the start of each task k { 1 , 2 , 3 , } the controller observes matrix A [ k ] with size M [ k ] × ( n + 2 ) , where M [ k ] is the random number of processing options for task k. Each row r { 1 , , M [ k ] } has the form
[ T r [ k ] , R r [ k ] , Y r , 1 [ k ] , , Y r , n [ k ] ]
It is assumed that all rows have T r [ k ] t m i n , where t m i n > 0 is some minimum task duration. Let ( T [ k ] , R [ k ] , Y [ k ] ) be the vector of values for the row selected on task k, where Y [ k ] = ( Y 1 [ k ] , , Y n [ k ] ) . For each positive integer m, define R ¯ [ m ] by
R ¯ [ m ] = 1 m k = 1 m R [ k ]
Define T ¯ [ m ] and Y ¯ [ m ] similarly. The infinite horizon problem is
Maximize : lim inf m E R ¯ [ m ] E T ¯ [ m ]
Subject   to : lim sup m E Y ¯ i [ m ] 0 i { 1 , , n }
( T [ k ] , R [ k ] , Y [ k ] ) R o w ( A [ k ] ) k { 1 , 2 , 3 , }
where R o w ( A [ k ] ) denotes the set of rows of matrix A [ k ] . Note that T [ k ] t m i n > 0 always, so there is no divide-by-zero issue.
Problem (5)–(7) is assumed to be feasible, meaning it is possible to satisfy the constraints (6) and (7). Let θ * denote the optimal objective in (5). The formulation (5)–(7) has two differences in comparison with the introductory example (2)–(4): (i) all time average constraints are expressed as a single average (rather than a ratio); (ii) sample path limits are replaced by limits that use expectations. This is similar to the treatment in [2]. This preserves the infinite horizon value of θ * and facilitates analysis of adaptation time.A decision policy is said to be an ϵ -approximation with convergence time d if
E R ¯ [ m ] E T ¯ [ m ] θ * + ϵ m d
E Y ¯ i [ m ] ϵ i { 1 , , n } m d
A decision policy is said to be an O ( ϵ ) -approximation (with convergence time d) if all appearances of ϵ in the above definition are replaced by some constant multiple of ϵ .

1.2. Prior Work

The fractional structure of the objective (5) is similar in spirit to a linear fractional program. Linear fractional programs can be converted to linear programs using a nonlinear change in variables (see [3,4]). This conversion is used in [3] for offline design of policies for embedded Markov decision problems. Such methods cannot be directly used for our online problem because time averages are not preserved under nonlinear transformations. Related nonlinear transformations are used in [5] for offline control for opportunistic Markov decision problems where states include an A [ k ] process similar to the current paper. The work in [5] discusses how offline strategies can be leveraged for online use (such as with a two-timescale approach), although overall convergence time is unclear.
The problem (5)–(7) is posed in [6], where it is called a renewal optimization problem (see also Chapter 7 of [1]). The solution in [6] constructs virtual queues for each time average inequality constraint (6) and makes a decision for each task k to minimize a drift-plus-penalty ratio:
E Δ [ k ] v R [ k ] | H [ k ] E T [ k ] | H [ k ]
where Δ [ k ] is the change in a Lyapunov function on the virtual queues; v is a parameter that affects accuracy; H [ k ] is system history before task k. Exact minimization of (10) is impossible unless the distribution for A [ k ] is known. A method for approximately minimizing (10) by sampling A [ k ] over a window of previous tasks is developed in [6], although only a partial convergence analysis is given there. This prior work is based on the Lyapunov drift and max-weight scheduling methods developed for fixed timeslot queuing systems in [7,8]. Data center applications of renewal optimization are in [9]. Asynchronous renewal systems are treated in [10].
A different approach in [2] is used for a problem that seeks only to maximize time average reward R ¯ / T ¯ (with no penalties Y [ k ] ). For each task k, that method chooses ( T [ k ] , R [ k ] ) R o w ( A [ k ] ) to maximize R [ k ] θ [ k 1 ] T [ k ] , where θ [ k 1 ] is an estimate of θ * that is updated according to a Robbins–Monro iteration:
θ [ k ] = θ [ k 1 ] + η [ k ] ( R [ k ] θ [ k 1 ] T [ k ] )
where η [ k ] is a stepsize. See [11] for the original Robbins–Monro algorithm and [12,13,14,15,16,17] for extensions in other contexts. The approach (11) is desirable because it does not require sampling from a window of past A [ k ] values. Further, under a vanishing stepsize rule, the optimality gap decreases like O ( 1 / k ) , which is asymptotically optimal [2]. However, it is unclear how to extend this method to handle time average penalties Y [ k ] . Further, while the vanishing stepsize enables fast convergence, it makes increasing investments in the probability model and cannot adapt if probabilities change. The work in [2] shows a fixed stepsize rule is better for adaptation but has a slower convergence time of O ( 1 / ϵ 3 ) .
Fixed stepsizes are known to enable adaptation in other contexts. For online convex optimization, it is shown in [18] that a fixed stepsize enables the time-averaged cost to be within O ( ϵ ) of optimality (as compared with the best fixed decision in hindsight) over any sequence of Θ ( 1 / ϵ 2 ) steps. For adaptive estimation, work in [19] considers the problem of removing bias from Markov-based samples. The work in [19] develops an adaptive Robbins–Monro technique that averages between two fixed stepsizes. Adaptive algorithms are also of interest in convex bandit problems, see [20,21].
A different type of problem is nonlinear robust optimization that seeks to solve a deterministic nonlinear program involving a decision vector x and a vector of (unchosen) uncertain values u (see [22]). The robust design treats worst-case behavior with constraints such as c ( x , u ) 0 for all u U ( x ) , where U ( x ) is a (possibly infinite) set of possible values of u given the decision vector x. Iterative methods, such as those in [23], use gradients and linear approximations to sequentially calculate updates that get closer to minimizing a cost function subject to the robust constraint specifications.
Our problem is an opportunistic scheduling problem because A [ k ] is revealed at the start of each task k (before a scheduling decision is made). While A [ k ] can be viewed as helpful side information, it is challenging to make optimal use of this side information. The policy space huge: Optimality depends on the full (and unknown) joint distribution of entries in matrix A [ k ] . A different class of problems, called vector-based bandit problems, has a simpler policy space where there is no A [ k ] information. There, the controller pulls an arm from a fixed set of m arms, each arm giving a vector-based reward with an unknown distribution that is the same each time it is pulled. In that context, optimality depends only on the mean rewards. Estimation of the means can be performed efficiently by exploring each arm according to various bandit techniques such as the resource-constrained techniques in [24,25,26]. Such problems have a different learning structure that does not relate to the current paper.

1.3. Our Contributions

We develop an algorithm for renewal optimization that does not require probability information or sampling from the past. The algorithm has explicit convergence guarantees and meets the optimal asymptotic convergence time bound of [2]. Unlike the Robbins–Monro method in [2], our new algorithm allows for penalty constraints. Furthermore, our algorithm is adaptive and achieves performance within O ( ϵ ) of optimality over any sequence of Θ ( 1 / ϵ 2 ) tasks. This fast adaptation is enabled by a new hierarchical decision structure.

2. Preliminaries

2.1. Notation

For x , y R n we use x y = i = 1 n x i y i and | | x | | = i = 1 n x i 2 . In this paper, we use the following vectors and constants:
  • A [ k ] = matrix for task k (for row selection)
  • T [ k ] = duration of task k
  • R [ k ] = reward of task k
  • Y [ k ] = ( Y 1 [ k ] , , Y n [ k ] ) = vector of penalties for task k
  • Q [ k ] = ( Q 1 [ k ] , , Q n [ k ] ) = vector of virtual queues for penalty constraints
  • J [ k ] = virtual queue used to optimize reward
  • v = weight emphasis on reward
  • q i = size upper bound parameter for virtual queue Q i [ k ]
  • θ * = optimal time average reward
  • r m a x = upper bound on R [ k ]
  • c = upper bound on | | Y [ k ] | |
  • y i , m i n , y i , m a x = bounds on Y i [ k ] ( y i , m i n Y i [ k ] y i , m a x )
  • [ t m i n , t m a x ] = bounds on T [ k ]
  • γ m i n = 1 / t m a x , γ m a x = 1 / t m i n
  • s = Slater condition parameter
  • b , β 1 , β 2 , d 0 , d 1 , d 2 = constants in theorems (based on q i , r m a x , c , t m i n , t m a x , y i , m i n , y i , m a x ).

2.2. Boundedness Assumptions

Assume the following bounds hold for all k { 1 , 2 , 3 , } , all A [ k ] , and all choices of ( T [ k ] , R [ k ] , Y [ k ] ) R o w ( A [ k ] ) :
t m i n T [ k ] t m a x
0 R [ k ] r m a x
| | Y [ k ] | | c
y i , m i n Y i [ k ] y i , m a x i { 1 , , n }
where t m i n , t m a x , r m a x , c, y i , m i n , y i , m a x are nonnegative constants (with t m i n > 0 ). Constraint (13) assumes all rewards R [ k ] are nonnegative. This is without loss of generality: If the system can have negative rewards in some bounded interval r m i n R [ k ] r m a x , where r m i n > 0 , we define a new nonnegative reward G [ k ] = R [ k ] + ( r m i n / t m i n ) T [ k ] . The objective of maximizing G ¯ / T ¯ is the same as the objective of maximizing R ¯ / T ¯ .

2.3. The Sets Γ and Γ ¯

Assume { A [ k ] } k = 1 is a sequence of i.i.d. random matrices with an unknown distribution. (When appropriate, this is relaxed to assume i.i.d. behavior occurs only over a finite block of consecutive tasks.) For k { 1 , 2 , 3 , } , define a decision vector ( T [ k ] , R [ k ] , Y [ k ] ) as a random vector that satisfies
( T [ k ] , R [ k ] , Y [ k ] ) R o w ( A [ k ] )
Let Γ R n + 2 be the set of all expectations E ( T [ k ] , R [ k ] , Y [ k ] ) for a given task k, considering all possible decision vectors. The set Γ depends on the (unknown) distribution of A [ k ] and considers all conditional probabilities for choosing a row given the observed A [ k ] . The { A [ k ] } k = 1 matrices are i.i.d. and so Γ is the same for all k { 1 , 2 , 3 , } . It can be shown that Γ is nonempty, bounded, and convex (see Section 4.11 in [1]). Its closure Γ ¯ is compact and convex. Define the history up to task m as
H [ m ] = ( A [ 1 ] , A [ 2 ] , , A [ m 1 ] )
where H [ 1 ] is defined to be the constant 0.
Lemma 1.
Suppose { A [ k ] } k = 1 are i.i.d. and satisfy the boundedness assumptions (12)–(15). Then, for every ( t , r , y ) Γ and k { 1 , 2 , 3 , } , there exists a random decision vector
( T * [ k ] , R * [ k ] , Y * [ k ] ) R o w ( A [ k ] )
that is independent of H [ k ] and that satisfies (with probability 1):
E ( T * [ k ] , R * [ k ] , Y * [ k ] ) | H [ k ] = ( t , r , y )
Proof. 
By definition of Γ , given any ( t , r , y ) Γ , there is a conditional distribution for choosing a row of A [ k ] (given the observed A [ k ] ) under which the (unconditional) expected value of the chosen row is ( t , r , y ) . Let U [ k ] U n i f [ 0 , 1 ] be a random variable that is independent of ( A [ k ] , H [ k ] ) . Use U [ k ] to implement the randomized row selection (according to the desired conditional distribution) after A [ k ] is observed. Formally, the random row ( T * [ k ] , R * [ k ] , Y * [ k ] ) can be viewed as a Borel measurable function of ( A [ k ] , U [ k ] ) . (See Proposition 5.13 in [27], where ξ there plays the role of our ( T * [ k ] , R * [ k ] , Y * [ k ] ) which takes values in the Borel space R n + 2 ; η plays the role of our A [ k ] ; ζ plays the role of our H [ k ] ; ν plays the role of our U [ k ] ). Since ( A [ k ] , U [ k ] ) is independent of history H [ k ] , the resulting random row ( T * [ k ] , R * [ k ] , Y * [ k ] ) is independent of H [ k ] (so with probability 1, its conditional expectation given H [ k ] is the same as its unconditional expectation). □

2.4. The Deterministic Problem

For analysis of our stochastic problem, it is useful to consider the closely related deterministic problem
Maximize : r / t
Subject   to : y i 0 i { 1 , , n }
( t , r , y ) Γ ¯
where Γ ¯ is the closure of Γ . All points ( t , r , y ) Γ ¯ have t t m i n > 0 , so there are no divide-by-zero issues. Using arguments similar to those in Section 4.11 in [1], it can be shown that (i) the stochastic problem (5)–(7) is feasible if and only if the deterministic problem (16)–(18) is feasible; (ii) if feasible, the optimal objective values are the same. Specifically, if ( t * , r * , y * ) solves (16)–(18), then
θ * = r * / t *
where θ * is the optimal objective for both the deterministic problem (16)–(18) and the stochastic problem (5)–(7). The deterministic problem (16)–(18) seeks to maximize a continuous function r / t over the compact set defined by constraints (17) and (18), so it has an optimal solution whenever it is feasible.
When n > 0 (so there is at least one time average penalty constraint of the form (17)), we assume a Slater condition that is more stringent than mere feasibility: There is a value s > 0 and a vector ( t s , r s , y s ) Γ ¯ such that
y i s s i { 1 , , n }

3. Algorithm Development

3.1. Parameters and Constants

The algorithm uses parameters v > 0 , α > 0 , q = ( q 1 , , q n ) with q i 0 for all i { 1 , , n } , to be precisely determined later. The constants t m i n , t m a x , r m a x , y i , m i n , y i , m a x , c from the boundedness assumptions (12)–(15) are assumed known. Define
γ m i n = 1 / t m a x , γ m a x = 1 / t m i n
The algorithm introduces auxiliary variables γ [ k ] chosen in the interval [ γ m i n , γ m a x ] for each task k { 1 , 2 , 3 , } .

3.2. Intuition

For intuition, temporarily assume the following limits exist:
Y ¯ i = lim m 1 m k = 1 m Y i [ k ] 1 / γ ¯ = lim m 1 m k = 1 m 1 / γ [ k ]
The idea is to make decisions for row selection (and γ [ k ] selection) for each task k so that, over time, the following time-averaged problem is solved:
Maximize : lim m 1 m k = 1 m R [ k ] γ [ k ] / γ [ k 1 ] 1 m k = 1 m 1 / γ [ k 1 ]
Subject   to : Y ¯ i 0 i { 1 , , n }
T ¯ 1 / γ ¯
γ [ k ] [ γ m i n , γ m a x ] k
( T [ k ] , R [ k ] , Y [ k ] ) R o w ( A [ k ] ) k
γ [ k ] varies   slowly   over   k
This is an informal description because the constraint “ γ [ k ] varies slowly” is not precise. Intuitively, if γ [ k ] does not change much from one task to the next, the above objective is close to R ¯ / 1 / γ ¯ , which (by the second constraint) is less than or equal to the desired objective R ¯ / T ¯ . This is useful because, as we show, the above problem can be treated using a novel hierarchical optimization method.

3.3. Virtual Queues

To enforce the constraints Y ¯ i 0 , for each i { 1 , , n } define a process Q i [ k ] with initial condition Q i [ 1 ] = 0 and update equation
Q i [ k + 1 ] = [ Q i [ k ] + Y i [ k ] ] 0 q i v k { 1 , 2 , 3 , }
where v > 0 and q = ( q 1 , , q n ) are given nonnegative parameters (to be sized later); where [ z ] 0 q i v denotes the projection of a real number z onto the interval [ 0 , q i v ] :
[ z ] 0 q i v = q i v if   z > q i v z if   z [ 0 , q i v ] 0 else
To enforce the constraint T ¯ 1 / γ ¯ , define a process J [ k ] by
J [ k + 1 ] = [ J [ k ] + T [ k ] 1 / γ [ k ] ] 0 k { 1 , 2 , 3 , }
with J [ 1 ] = 0 . The processes Q i [ k ] and J [ k ] shall be called virtual queues because their update resembles a queuing system with arrivals and service for each k. Such virtual queues can be viewed as time-varying Lagrange multipliers and are standard for enforcing time average inequality constraints (see [1]).
For each task k and each i { 1 , , n } , define 1 i [ k ] as
1 i [ k ] = 1 if   Q i [ k ] > q i v y i , m a x 0 else
Lemma 2.
Fix k 0 and m as positive integers. Consider the iterations (27) and (28) under any decisions that satisfy ( T [ k ] , R [ k ] , Y [ k ] ) R o w ( A [ k ] ) and γ [ k ] [ γ m i n , γ m a x ] for all k. Then for all i { 1 , , n } the following hold:
1 m k = k 0 k 0 + m 1 ( Y i [ k ] 1 i [ k ] y i , m a x ) q i v m
1 m k = k 0 k 0 + m 1 ( T [ k ] 1 / γ [ k ] ) J [ k 0 + m ] m
Proof. 
Fix i { 1 , , n } and k { 1 , 2 , 3 , } . We first claim
Y i [ k ] 1 i [ k ] y i , m a x Q i [ k + 1 ] Q i [ k ]
To verify (32), consider the two cases:
  • Case 1: Suppose Q i [ k ] + Y i [ k ] > q i v . It follows by (27) that Q i [ k + 1 ] = q i v . Also, since Y i [ k ] y i , m a x , we have Q i [ k ] > q i v y i , m a x and so 1 i [ k ] = 1 . It follows that (32) reduces to Y i [ k ] y i , m a x q i v Q i [ k ] , which is true because the left-hand side is always nonpositive while the right-hand side is always nonnegative.
  • Case 2: Suppose Q i [ k ] + Y i [ k ] q i v . The update (27) then gives Q i [ k + 1 ] Q i [ k ] + Y i [ k ] , so (32) again holds (recall y i , m a x 0 ).
Summing (32) over k { k 0 , , k 0 + m 1 } gives
k = k 0 k 0 + m 1 ( Y i [ k ] 1 i [ k ] y i , m a x ) Q i [ k 0 + m ] Q i [ k 0 ] q i v
where the final equality holds because the update (27) ensures Q i [ k ] [ 0 , q i v ] for all k. Dividing by m proves (30).
To prove (31), observe the update (28) implies
J [ k + 1 ] J [ k ] + T [ k ] 1 / γ [ k ] k { 1 , 2 , 3 , }
Summing over k { k 0 , , k 0 + m 1 } gives
J [ k 0 + m ] J [ k 0 ] k = k 0 k 0 + m 1 ( T [ k ] 1 / γ [ k ] )
The result follows by dividing by m and observing J [ k 0 ] 0 . □

3.4. Lyapunov Drift

Define Q [ k ] = ( Q 1 [ k ] , , Q n [ k ] ) . Define
L [ k ] = 1 2 J [ k ] 2 + 1 2 | | Q [ k ] | | 2
where | | Q [ k ] | | 2 = i = 1 n Q i [ k ] 2 . The value L [ k ] can be viewed as a Lyapunov function on the queue state for task k. Define
Δ [ k ] = L [ k + 1 ] L [ k ]
Lemma 3.
For each k { 1 , 2 , 3 , } and any ( T [ k ] , R [ k ] , Y [ k ] ) R o w ( A [ k ] ) and γ [ k ] [ γ m i n , γ m a x ] , we have
Δ [ k ] b + J [ k ] ( T [ k ] 1 / γ [ k ] ) + Q [ k ] Y [ k ]
where
b = 1 2 c 2 + ( t m a x t m i n ) 2
with c , t m i n , t m a x defined in (14), (12) and γ m i n , γ m a x defined in (20).
Proof. 
Squaring (27) and using ( [ z ] 0 q i v ) 2 z 2 for all z R gives
Q i [ k + 1 ] 2 ( Q i [ k ] + Y i [ k ] ) 2 = Q i [ k ] 2 + Y i [ k ] 2 + 2 Q i [ k ] Y i [ k ]
Summing over i { 1 , , n } and dividing by 2 gives
1 2 | | Q [ k + 1 ] | | 2 1 2 | | Q [ k ] | | 2 + 1 2 | | Y [ k ] | | 2 + Q [ k ] Y [ k ] 1 2 | | Q [ k ] | | 2 + 1 2 c 2 + Q [ k ] Y [ k ]
where we have used the boundedness assumption (14). Similarly, squaring (28) and using ( [ z ] 0 ) 2 z 2 for all z R gives
1 2 J [ k + 1 ] 2 1 2 J [ k ] 2 + 1 2 ( T [ k ] 1 / γ [ k ] ) 2 + J [ k ] ( T [ k ] 1 / γ [ k ] )
Summing (35) and (36) gives
Δ [ k ] 1 2 [ c 2 + ( T [ k ] 1 / γ [ k ] ) 2 ] + J [ k ] ( T [ k ] 1 / γ [ k ] ) + Q [ k ] Y [ k ]
The result follows by observing
( T [ k ] 1 / γ [ k ] ) 2 ( t m a x t m i n ) 2
which holds by the boundedness assumption (12) and the fact 1 / γ [ k ] [ t m i n , t m a x ] . □
The above lemma implies that
Δ [ k ] v R [ k ] b v R [ k ] + J [ k ] ( T [ k ] 1 / γ [ k ] ) + Q [ k ] Y [ k ]
For each task k { 1 , 2 , 3 , } , our hierarchical algorithm performs the following:
  • Step 1: Choose ( T [ k ] , R [ k ] , Y [ k ] ) R o w ( A [ k ] ) to greedily minimize the right-hand side of (37) (ignoring the term that depends on γ [ k ] ).
  • Step 2: Treating γ [ k 1 ] , T [ k ] , R [ k ] , Y [ k ] as known constants, choose γ [ k ] [ γ m i n , γ m a x ] to minimize
    γ [ k ] γ [ k 1 ] ( v R [ k ] + J [ k ] ( T [ k ] 1 / γ [ k ] ) + Q [ k ] Y [ k ] ) for   ( 21 ) + α v 2 2 ( γ [ k ] γ [ k 1 ] ) 2 for   ( 26 )
The term in the first underbrace relates to the objective (21) and arises by multiplying both sides of (37) by γ [ k ] / γ [ k 1 ] ; the term in the second underbrace is a weighted “prox-type” term that, for our purposes, acts only to enforce constraint (26).

3.5. Algorithm

Fix parameters v > 0 , α > 0 , q = ( q 1 , , q n ) with q i 0 for i { 1 , , n } (to be sized later). Initialize γ [ 0 ] = γ m i n , J [ 1 ] = 0 , Q [ 1 ] = ( 0 , , 0 ) . For each task k { 1 , 2 , 3 , } , perform the following:
  • Row selection: Observe Q [ k ] , J [ k ] , A [ k ] and treat these as given constants. Choose ( T [ k ] , R [ k ] , Y [ k ] ) R o w ( A [ k ] ) to minimize
    v R [ k ] + J [ k ] T [ k ] + Q [ k ] Y [ k ]
    breaking ties arbitrarily (such as by using the smallest indexed row).
  • γ [ k ] selection: Observe Q [ k ] , J [ k ] , γ [ k 1 ] , and the decisions ( T [ k ] , R [ k ] , Y [ k ] ) just made by the row selection, and treat these as given constants. Choose γ [ k ] [ γ m i n , γ m a x ] to minimize the following quadratic function of γ [ k ] :
    γ [ k ] v R [ k ] + J [ k ] T [ k ] + Q [ k ] Y [ k ] + γ [ k 1 ] α v 2 2 ( γ [ k ] γ [ k 1 ] ) 2
    The explicit solution is
    γ [ k ] = γ [ k 1 ] + v R [ k ] J [ k ] T [ k ] Q [ k ] Y [ k ] γ [ k 1 ] α v 2 γ m i n γ m a x
    where [ z ] γ m i n γ m a x denotes the projection of z R onto the interval [ γ m i n , γ m a x ] .
  • Virtual queue updates: Update the virtual queues via (27) and (28).

3.6. Example Decision Procedure

Consider a step k { 1 , 2 , 3 , } where A [ k ] is given by the example matrix with three rows in (1). In that context, each row represents an image classification algorithm option, and p a v and q a v are desired constraints on time average power and task-average quality. The matrix is repeated here with Y 1 [ k ] = e n e r g y [ k ] p a v T [ k ] , Y 2 [ k ] = q a v q u a l i t y [ k ] :
A [ k ] = T [ k ] R [ k ] Y 1 [ k ] Y 2 [ k ] 5.1 3.6 2.3 5.1 p a v q a v 0.5 row 1 ( alg 1 ) 7.0 2.8 1.5 7.0 p a v q a v 1.0 row 2 ( alg 2 ) 10.2 3.0 1.1 10.2 p a v q a v 1.0 row 3 ( alg 3 )
The row selection decision for step k uses the current virtual queue values J [ k ] , Q 1 [ k ] , Q 2 [ k ] to compute the following row values:
  • Row 1: 3.6 v + 5.1 J [ k ] + ( 2.3 5.1 p a v ) Q 1 [ k ] + ( q a v 0.5 ) Q 2 [ k ]
  • Row 2: 2.8 v + 7.0 J [ k ] + ( 1.5 7.0 p a v ) Q 1 [ k ] + ( q a v 1.0 ) Q 2 [ k ]
  • Row 3: 3.0 v + 10.2 J [ k ] + ( 1.1 10.2 p a v ) Q 1 [ k ] + ( q a v 1.0 ) Q 2 [ k ]
The smallest row value is then chosen (breaking ties arbitrarily). Assuming this selection leads to row 2, the γ [ k ] value is updated as
γ [ k ] = γ [ k 1 ] + 2.8 v 7.0 J [ k ] ( 1.5 7.0 p a v ) Q 1 [ k ] ( q a v 1.0 ) Q 2 [ k ] γ [ k 1 ] α v 2 γ m i n γ m a x
Finally, the virtual queues are updated via
J [ k + 1 ] = [ J [ k ] + 7.0 1 / γ [ k ] ] 0 Q 1 [ k + 1 ] = [ Q 1 [ k ] + 1.5 7.0 p a v ] 0 q 1 v Q 2 [ k + 1 ] = [ Q 2 [ k ] + q a v 1.0 ] 0 q 2 v
Row selection is the most complicated part of the algorithm at each step k: Suppose there are at most m rows in the matrix A [ k ] . For each row, we must perform ( n + 2 ) multiply-adds to compute the row value. Then, we must select the minimizing row value. The worst-case complexity of the row selection is roughly m ( n + 2 ) . In this example, we have n = 2 and m = 3 , so implementation is simple. In cases when the number of rows is large, say, 10 6 , row selection can be parallelized via multiple processors.

3.7. Key Analysis

Fix k { 1 , 2 , 3 , } . The row selection decision of our algorithm implies
v R [ k ] + J [ k ] T [ k ] + Q [ k ] Y [ k ] v R * [ k ] + J [ k ] T * [ k ] + Q [ k ] Y * [ k ]
where ( T * [ k ] , R * [ k ] , Y * [ k ] ) is any other vector in R o w ( A [ k ] ) (including any decision vector for task k that is chosen according to some optimized probability distribution).
The γ [ k ] selection decision of our algorithm chooses γ [ k ] [ γ m i n , γ m a x ] to minimize a function of γ [ k ] that is β -strongly convex for parameter β = γ [ k 1 ] α v 2 . A standard strongly convex pushback result (see, for example, Lemma 2.1 in [16]) ensures that if β > 0 , c R , and f : [ a , b ] R is a convex function over some interval [ a , b ] , and if γ o p t [ a , b ] minimizes f ( γ ) + β 2 ( γ c ) 2 over all γ [ a , b ] , then f ( γ o p t ) + β 2 ( γ o p t c ) 2 f ( γ ) + β 2 ( γ c ) 2 β 2 ( γ γ o p t ) 2 for all γ [ a , b ] . Since our γ [ k ] minimizes a β -strongly convex function we obtain
γ [ k ] v R [ k ] + J [ k ] T [ k ] + Q [ k ] Y [ k ] + γ [ k 1 ] α v 2 2 ( γ [ k ] γ [ k 1 ] ) 2 γ * v R [ k ] + J [ k ] T [ k ] + Q [ k ] Y [ k ] + γ [ k 1 ] α v 2 2 ( γ * γ [ k 1 ] ) 2 γ [ k 1 ] α v 2 2 ( γ * γ [ k ] ) 2 pushback γ * v R * [ k ] + J [ k ] T * [ k ] + Q [ k ] Y [ k ] * + γ [ k 1 ] α v 2 2 ( γ * γ [ k 1 ] ) 2 γ [ k 1 ] α v 2 2 ( γ * γ [ k ] ) 2
where the first inequality highlights the pushback term that arises from strong convexity; the second inequality holds by (39) and the fact γ * > 0 .
Dividing the above inequality by γ [ k 1 ] > 0 gives
γ [ k ] γ [ k 1 ] [ v R [ k ] + J [ k ] T [ k ] + Q [ k ] Y [ k ] ] + α v 2 2 ( γ [ k ] γ [ k 1 ] ) 2 γ * γ [ k 1 ] [ v R * [ k ] + J [ k ] T * [ k ] + Q [ k ] Y * [ k ] ] + α v 2 2 ( γ * γ [ k 1 ] ) 2 α v 2 2 ( γ * γ [ k ] ) 2
Lemma 4.
For any sample path of { A [ k ] } k = 1 , for each k { 1 , 2 , 3 , } , our algorithm yields
Δ [ k ] v R [ k ] b + γ * γ [ k 1 ] [ v R * [ k ] + J [ k ] ( T * [ k ] 1 / γ * ) + Q [ k ] Y * [ k ] ] + α v 2 2 ( γ * γ [ k 1 ] ) 2 α v 2 2 ( γ * γ [ k ] ) 2 + ( v r m a x + c v | | q | | + ( t m a x t m i n ) J [ k ] ) 2 2 γ m i n 2 α v 2
where Δ [ k ] v R [ k ] , γ [ k ] , γ [ k 1 ] are the actual values that arise in our algorithm; γ * is any real number in [ γ m i n , γ m a x ] ; and ( T * [ k ] , R * [ k ] , Y * [ k ] ) is any vector in R o w ( A [ k ] ) .
Proof. 
Adding J [ k ] / γ [ k 1 ] to both sides of (40) gives
L H S [ k ] : = γ [ k ] γ [ k 1 ] [ v R [ k ] + J [ k ] ( T [ k ] 1 / γ [ k ] ) + Q [ k ] Y [ k ] ] + α v 2 2 ( γ [ k ] γ [ k 1 ] ) 2 γ * γ [ k 1 ] [ v R * [ k ] + J [ k ] ( T * [ k ] 1 / γ * ) + Q [ k ] Y * [ k ] ] + α v 2 2 ( γ * γ [ k 1 ] ) 2 α v 2 2 ( γ * γ [ k ] ) 2
where L H S [ k ] is defined as the left-hand side of (42). Rearranging terms in L H S [ k ] gives
L H S [ k ] = v R [ k ] + J [ k ] ( T [ k ] 1 / γ [ k ] ) + Q [ k ] Y [ k ] + α v 2 2 ( γ [ k ] γ [ k 1 ] ) 2 + ( γ [ k ] γ [ k 1 ] ) γ [ k 1 ] [ v R [ k ] + J [ k ] ( T [ k ] 1 / γ [ k ] ) + Q [ k ] Y [ k ] ] Δ [ k ] v R [ k ] b + α v 2 2 ( γ [ k ] γ [ k 1 ] ) 2 + ( γ [ k ] γ [ k 1 ] ) γ [ k 1 ] [ v R [ k ] + J [ k ] ( T [ k ] 1 / γ [ k ] ) + Q [ k ] Y [ k ] ] Δ [ k ] v R [ k ] b v R [ k ] + J [ k ] ( T [ k ] 1 / γ [ k ] ) + Q [ k ] Y [ k ] γ [ k 1 ] 2 2 α v 2
where the first inequality holds by (34); the final inequality holds since all x , y R satisfy
α v 2 2 x 2 + y x y 2 2 α v 2
which holds by completing the square (we use x = γ [ k ] γ [ k 1 ] ). Now, observe
| v R [ k ] | v r m a x | J [ k ] ( T [ k ] 1 / γ [ k ] ) | ( t m a x t m i n ) J [ k ] | Q [ k ] Y [ k ] | | | Q [ k ] | | · | | Y [ k ] | | c v | | q | |
where the final inequality uses Cauchy-Schwarz, assumption (14), and the fact 0 Q i [ k ] q i v . Substituting these bounds into the right-hand side of (43) gives
L H S [ k ] Δ [ k ] v R [ k ] b ( v r m a x + c v | | q | | + ( t m a x t m i n ) J [ k ] ) 2 2 γ [ k 1 ] 2 α v 2
Substituting this into (42) proves the result. □
Lemma 5.
Suppose problem (16)–(18) has optimal solution ( t * , r * , y * ) Γ ¯ and optimal objective θ * = r * / t * . Then, for k { 1 , 2 , 3 , } , we have (with probability 1):
E Δ [ k ] v R [ k ] | H [ k ] b + v θ * γ [ k 1 ] + α v 2 2 E ( 1 / t * γ [ k 1 ] ) 2 ( 1 / t * γ [ k ] ) 2 | H [ k ] + ( v r m a x + c v | | q | | + ( t m a x t m i n ) J [ k ] ) 2 2 γ m i n 2 α v 2
where H [ k ] = ( A [ 1 ] , , A [ k 1 ] ) , and H [ k ] determines Q [ k ] and J [ k ] .
Proof. 
Fix k { 1 , 2 , 3 , } . Fix ( t , r , y ) Γ . Observe that t [ t m i n , t m a x ] and so 1 / t [ γ m i n , γ m a x ] . By Lemma 1, there is a decision vector ( T * [ k ] , R * [ k ] , Y * [ k ] ) R o w ( A [ k ] ) that is independent of H [ k ] such that
E ( T * [ k ] , R * [ k ] , Y * [ k ] ) | H [ k ] = ( t , r , y )
Substituting this ( T * [ k ] , R * [ k ] , Y * [ k ] ) , along with γ * = 1 / t , into (41) gives
Δ [ k ] v R [ k ] b + 1 / t γ [ k 1 ] [ v R * [ k ] + J [ k ] ( T * [ k ] t ) + Q [ k ] Y * [ k ] ] + α v 2 2 ( 1 / t γ [ k 1 ] ) 2 α v 2 2 ( 1 / t γ [ k ] ) 2 + ( v r m a x + c v | | q | | + ( t m a x t m i n ) J [ k ] ) 2 2 γ m i n 2 α v 2
Taking conditional expectations and using (45) gives
E Δ [ k ] v R [ k ] | H [ k ] b + 1 / t γ [ k 1 ] [ v r + Q [ k ] y ] + α v 2 2 E ( 1 / t γ [ k 1 ] ) 2 ( 1 / t γ [ k ] ) 2 | H [ k ] + ( v r m a x + c v | | q | | + ( t m a x t m i n ) J [ k ] ) 2 2 γ m i n 2 α v 2
This holds for all ( t , r , y ) Γ . Since ( t * , r * , y * ) Γ ¯ , there is a sequence { ( t j , r j , y j ) } j = 1 in Γ that converges to ( t * , r * , y * ) . Taking a limit over such points in (47) gives
E Δ [ k ] v R [ k ] | H [ k ] b + 1 / t * γ [ k 1 ] [ v r * + Q [ k ] y * ] + α v 2 2 E ( 1 / t * γ [ k 1 ] ) 2 ( 1 / t * γ [ k ] ) 2 | H [ k ] + ( v r m a x + c v | | q | | + ( t m a x t m i n ) J [ k ] ) 2 2 γ m i n 2 α v 2
Recall the optimal solution has y * = ( y 1 * , , y n * ) with y i * 0 for all i. The result is obtained by substituting Q [ k ] y * 0 and r * / t * = θ * . □

4. Reward Guarantee

We first provide a deterministic bound on J [ k ] .
Lemma 6.
Under any { A [ k ] } k = 1 sequence, our algorithm yields
0 J [ k ] v ( β 1 + β 2 )
where β 1 and β 2 are nonnegative constants defined
β 1 = ( 1 + r m a x + i = 1 n q i y i , m i n ) t m i n
β 2 = 1 v α v γ m a x ( γ m a x γ m i n ) ( t m a x t m i n )
where z denotes the smallest integer greater than or equal to the real number z.
Proof. 
The update (28) shows J [ k ] is always nonnegative. Define
m = α v γ m a x ( γ m a x γ m i n )
We first make two claims:
  • Claim 1: If J [ k 0 ] v β 1 for some task k 0 { 1 , 2 , 3 , } , then
    J [ k ] v ( β 1 + β 2 ) k { k 0 , k 0 + 1 , , k 0 + m }
    To prove Claim 1, observe that for each task k, we have
    T [ k ] 1 / γ [ k ] t m a x 1 / γ m a x = t m a x t m i n
    Thus, the update (28) implies J [ k ] can increase by at most t m a x t m i n over one task. Thus, J [ k ] can increase by at most m ( t m a x t m i n ) over any sequence of m or fewer tasks. By construction, m ( t m a x t m i n ) = v β 2 . It follows that if J [ k 0 ] v β 1 , then J [ k ] v β 1 + v β 2 for all k { k 0 , k 0 + 1 , , k 0 + m } .
  • Claim 2: If J [ k ] v β 1 for some task k { 1 , 2 , 3 , } , then γ [ k ] γ [ k 1 ] , and in particular,
    γ [ k ] max γ m i n , γ [ k 1 ] 1 γ m a x α v
    To prove Claim 2, suppose J [ k ] v β 1 . Observe that
    v R [ k ] J [ k ] T [ k ] Q [ k ] Y [ k ] γ [ k 1 ] α v 2 ( a ) v r m a x v β 1 t m i n + i = 1 n q i v y i , m i n γ [ k 1 ] α v 2 = ( b ) 1 γ [ k 1 ] α v 1 γ m a x α v
    where (a) holds because R [ k ] r m a x , T [ k ] t m i n , and Q i [ k ] Y i [ k ] q i v y i , m i n for all i { 1 , , n } (recall y i , m i n Y i [ k ] y i , m a x ) ; equality (b) holds by definition of β 1 . Claim 2 follows in view of the iteration (38).
Since J [ 1 ] = 0 v β 1 , Claim 1 implies J [ k ] v ( β 1 + β 2 ) for all k { 1 , , 1 + m } . Now use induction: Suppose J [ k ] v ( β 1 + β 2 ) for all k { 1 , , k 0 } for some integer k 0 1 + m . We show this is also true for k 0 + 1 . If J [ k ] v β 1 for some k { k 0 + 1 m , , k 0 } , then Claim 1 implies J [ k 0 + 1 ] v ( β 1 + β 2 ) , and we are done.
Now, suppose J [ k ] > v β 1 for all k { k 0 + 1 m , , k 0 } . Claim 2 implies (52) holds for all k { k 0 + 1 m , , k 0 } and
γ [ k 0 + 1 m ] γ [ k 0 m ] γ [ k 0 ]
Therefore, if γ [ k ] = γ m i n for some k { k 0 + 1 m , , k 0 } then γ [ k 0 ] = γ m i n = 1 / t m a x and the update (28) gives
J [ k 0 + 1 ] = [ J [ k 0 ] + T [ k 0 ] t m a x ] 0 J [ k 0 ] v ( β 1 + β 2 )
where the final inequality is the induction assumption, and we are done. We now show the remaining case γ [ k ] > γ m i n for all k { k 0 + 1 m , , k 0 } is impossible. Suppose γ [ k ] > γ m i n for all k { k 0 + 1 m , , k 0 } (we reach a contradiction). Then, (52) implies
γ [ k ] γ [ k 1 ] 1 γ m a x α v k { k 0 + 1 m , , k 0 }
Summing over k { k 0 + 1 m , , k 0 } gives
γ [ k 0 ] γ [ k 0 m ] m γ m a x α v
and so
γ [ k 0 ] γ m a x m γ m a x α v ( a ) γ m a x ( γ m a x γ m i n ) = γ m i n
where inequality (a) holds by definition of m in (51). This contradicts γ [ k 0 ] > γ m i n . □

4.1. Reward over Any m Consecutive Tasks

For postive integers m , k 0 , define R ¯ [ k 0 ; m ] and T ¯ [ k 0 , m ] as
R ¯ [ k 0 ; m ] = 1 m k = k 0 k 0 + m 1 R [ k ] T ¯ [ k 0 ; m ] = 1 m k = k 0 k 0 + m 1 T [ k ]
Assume { A [ k ] } k = k 0 k 0 + m 1 is i.i.d. over this block of tasks. Define Γ ¯ and the corresponding deterministic problem (16)–(18) with respect to the (unknown) distribution for A [ k ] .
Theorem 1.
Suppose the problem (16)–(18) is feasible with optimal solution ( t * , r * , y * ) Γ ¯ and optimal objective value θ * = r * / t * . Then, for any parameters v > 0 , α > 0 , q = ( q 1 , , q n ) 0 and all positive integers k 0 , m , our algorithm yields
E R ¯ [ k 0 ; m ] E T ¯ [ k 0 ; m ] θ * d 1 v v d 2 m r m a x / t m i n m
where d 1 , d 2 are defined
d 1 = b + 1 2 γ m i n 2 α ( r m a x + c | | q | | + ( t m a x t m i n ) ( β 1 + β 2 ) ) 2 t m i n
d 2 = 1 2 | | q | | 2 + 1 2 ( β 1 + β 2 ) 2 + α 2 ( γ m a x γ m i n ) 2 + θ * ( β 1 + β 2 ) t m i n
where β 1 , β 1 are defined in (49), (50). In particular, fixing q i 0 and ϵ > 0 and choosing v = 1 / ϵ , α = 1 , gives for all k 0 :
E R ¯ [ k 0 ; m ] E T ¯ [ k 0 ; m ] θ * O ( ϵ ) m 1 / ϵ 2
Similar behavior holds when replacing α = 1 with α = c 1 / max [ c 2 , 1 / 2 ] , where c 1 , c 2 are fine-tuned constants (defined later) in (61), (62).
Proof. 
Fix k { 2 , 3 , 4 , } . Using iterated expectations and substituting J [ k ] v ( β 1 + β 2 ) (from Lemma 6) into (44) gives
E Δ [ k ] v R [ k ] b v θ * E 1 γ [ k 1 ] + α v 2 2 E ( 1 / t * γ [ k 1 ] ) 2 ( 1 / t * γ [ k ] ) 2 + ( r m a x + c | | q | | + ( t m a x t m i n ) ( β 1 + β 2 ) ) 2 2 γ m i n 2 α
Manipulating the second term on the right-hand side above gives
v θ * E 1 γ [ k 1 ] = v θ * E T [ k 1 ] + v θ * E T [ k 1 ] 1 γ [ k 1 ] v θ * E T [ k 1 ] + v θ * E J [ k ] J [ k 1 ]
where the final inequality holds by (33). Substituting this into the right-hand side of (57) gives
E Δ [ k ] v R [ k ] b v θ * E T [ k 1 ] + v θ * E J [ k ] J [ k 1 ] + α v 2 2 E ( 1 / t * γ [ k 1 ] ) 2 ( 1 / t * γ [ k ] ) 2 + ( r m a x + c | | q | | + ( t m a x t m i n ) ( β 1 + β 2 ) ) 2 2 γ m i n 2 α
Summing the above over k { k 0 + 1 , , k 0 + m } and dividing by m v gives
1 m v E L [ k 0 + m + 1 ] L [ k 0 + 1 ] 1 m k = k 0 + 1 k 0 + m E R [ k ] b v θ * 1 m k = k 0 + 1 k 0 + m E T [ k 1 ] + θ * m E J [ k 0 + m ] J [ k 0 ] + α v 2 m E ( 1 / t * γ [ k 0 ] ) 2 ( 1 / t * γ [ k 0 + m ] ) 2 + ( r m a x + c | | q | | + ( t m a x t m i n ) ( β 1 + β 2 ) ) 2 2 v γ m i n 2 α b v θ * E T ¯ [ k 0 ; m ] + θ * m v ( β 1 + β 2 ) + α v 2 m ( γ m a x γ m i n ) 2 + ( r m a x + c | | q | | + ( t m a x t m i n ) ( β 1 + β 2 ) ) 2 2 v γ m i n 2 α
where the final inequality substitutes the definition of T ¯ [ k 0 ; m ] and uses
0 J [ k ] v ( β 1 + β 2 ) k ( 1 / t * γ [ k ] ) 2 ( γ m a x γ m i n ) 2 k
Rearranging terms in (59) gives
E R ¯ [ k 0 ; m ] θ * E T ¯ [ k 0 ; m ] E R [ k 0 + m ] R [ k 0 ] m 1 m v E L [ k 0 + 1 ] L [ k 0 + m + 1 ] b + 1 2 γ m i n 2 α ( r m a x + c | | q | | + ( t m a x t m i n ) ( β 1 + β 2 ) ) 2 v v m ( α 2 ( γ m a x γ m i n ) 2 + θ * ( β 1 + β 2 ) )
Terms on the right-hand side have the following bounds:
E R [ k 0 + m ] R [ k 0 ] r m a x E L [ k 0 + 1 ] L [ k 0 + m + 1 ] E L [ k 0 + 1 ] 1 2 | | v q | | 2 1 2 v 2 ( β 1 + β 2 ) 2
where the final inequality uses
L [ k 0 + 1 ] = 1 2 | | Q [ k 0 + 1 ] | | 2 + 1 2 J [ k 0 + 1 ] 2
and the facts | | Q [ k ] | | | | v q | | (since 0 Q i [ k ] q i v from (27)) and 0 J [ k ] v ( β 1 + β 2 ) (from (48)). This proves the result upon usage of the constants d 1 , d 2 . □

4.2. No Penalty Constraints

A special and nontrivial case of Theorem 1 is when the only goal is to maximize time average reward R ¯ / T ¯ , with no additional Y i [ k ] processes (case n = 0 , q = 0 in Theorem 1). For this, let R ¯ [ m ] and T ¯ [ m ] be averages over tasks { 1 , , m } . Fix ϵ > 0 . The work in [2] showed that, in the absence of a priori knowledge of the probability distribution of the A [ k ] matrices, any algorithm that runs over tasks { 1 , 2 , 3 , , m } and achieves
E R ¯ [ m ] E T ¯ [ m ] θ * O ( ϵ )
must have m Ω ( 1 / ϵ 2 ) . That is, convergence time is necessarily  Ω ( 1 / ϵ 2 ) . The work in [2] developed a Robbins–Monro iterative algorithm with a vanishing stepsize to achieve this optimal convergence time. Specifically, it deviates from θ * by an optimality gap O ( 1 / m ) as the algorithm runs over tasks m { 1 , 2 , 3 , } . However, the vanishing stepsize means that the algorithm cannot adapt to changes. The algorithm of the current paper achieves the optimal convergence time using a different technique. The parameter v can be interpreted as an inverse stepsize parameter, so the stepsize is a constant ϵ = 1 / v . With this constant stepsize, the algorithm is adaptive and achieves reward per unit time within O ( ϵ ) of optimality over any consecutive block of Θ ( 1 / ϵ 2 ) tasks for which the { A [ k ] } matrices have i.i.d. behavior, regardless of the distribution before the start of the block.
The value α in Theorem 1 can be fine tuned. Using v = 1 / ϵ , q = 0 in (53) gives
E R ¯ [ k 0 ; m ] E T ¯ [ k 0 ; m ] θ * ϵ d 1 d 2 ϵ m ( r m a x / t m i n ) m m { 1 , 2 , 3 , }
where
d 1 = b + 1 2 γ m i n 2 α ( r m a x + ( t m a x t m i n ) ( β 1 + β 2 ) ) 2 t m i n d 2 = 1 2 ( β 1 + β 2 ) 2 + α 2 ( γ m a x γ m i n ) 2 + θ * ( β 1 + β 2 ) t m i n
where
β 1 + β 2 = 1 + r m a x + α ( 1 / t m i n 1 / t m a x ) ( t m a x t m i n ) t m i n
which uses γ m i n = 1 / t m a x , γ m a x = 1 / t m i n , and definitions of β 1 , β 2 in (49), (50). The above expression for β 1 + β 2 ignores the pesky ceiling operation in (50) as we are merely trying to right-size the α constant (formally, one can assume v = 1 / ϵ is chosen to make the quantity inside the ceiling operation an integer). The term ϵ d 1 in (60) does not vanish as m . Choosing α to minimize d 1 amounts to minimizing
1 α r m a x + ( t m a x t m i n ) ( 1 + r m a x ) t m i n + α ( t m a x t m i n ) 2 t m i n 1 t m i n 1 t m a x 2
That is, choose α > 0 to minimize ( c 1 / α ) + c 2 α where
c 1 = r m a x + ( t m a x t m i n ) ( 1 + r m a x ) t m i n
c 2 = ( t m a x t m i n ) 2 t m i n 1 t m i n 1 t m a x
This yields α = c 1 / c 2 . To avoid a very large value of α (which affects the d 2 constant) in the special case c 2 < 1 / 2 , one might adjust this to using α = c 1 max [ c 2 , 1 / 2 ] .

5. Constraints

This section considers the process Y [ k ] R n (for n > 0 ). Fix ϵ > 0 . By choosing q i = Θ ( 1 ) and v = Θ ( 1 / ϵ ) , inequality (30) already shows for all i { 1 , , n } that the time average of Y i [ k ] starting at any time k 0 satisfies
1 m k = k 0 k 0 + m 1 Y i [ k ] Θ ( ϵ ) + 1 m k = k 0 k 0 + m 1 1 i [ k ] y i , m a x m 1 / ϵ 2
The next result uses the Slater condition (19) to show events { 1 i [ k ] = 1 } are rare over any block of 1 / ϵ 2 tasks (regardless of history before the block).
Theorem 2.
Assume the Slater condition (19) holds for some s > 0 and vector ( t s , r s , y s ) Γ ¯ . Fix ϵ > 0 . Define
d 0 = 2 r m a x + 2 ( β 1 + β 2 ) ( t m a x t m i n ) + c 2 ϵ
where β 1 , β 1 are defined in (49), (50). Fix q 1 2 d 0 / s and fix q i = q 1 for i { 1 , , n } . Assume v = max { 1 / ϵ , 3 s 2 4 d 0 } . For all i { 1 , , n } , all k 0 { 1 , 2 , 3 , } , and all m 4 v q 1 n s + v 2 , we have (with probability 1):
1 m k = k 0 k 0 + m 1 E Y i [ k ] | H [ k 0 ] O ( ϵ )

Proof of Theorem 2

To prove Theorem 2, define Z [ k ] = | | Q [ k ] | | for k { 1 , 2 , 3 , } . For each k, Z [ k ] is determined by history H [ k ] , meaning Z [ k ] is σ ( H [ k ] ) -measurable. We first prove a lemma.
Lemma 7.
Assume the Slater condition (19) holds for some s > 0 and vector ( t s , r s , y s ) Γ ¯ . For all k { 1 , 2 , 3 , } our algorithm gives
| Z [ k + 1 ] Z [ k ] | c
E Z [ k + 1 ] Z [ k ] | H [ k ] c i f   Z [ k ] < λ s / 2 i f   Z [ k ] λ
where the final inequality holds almost surely; c is defined in (14); d 0 is defined in (64); λ is defined
λ = max v d 0 s s 4 , s 2
Proof. 
Fix k { 1 , 2 , 3 , } . To prove (65), we have
Z [ k + 1 ] = | | Q [ k + 1 ] | | ( a ) | | Q [ k ] + Y [ k ] | | | | Q [ k ] | | + | | Y [ k ] | | Z [ k ] + c
where inequality (a) holds by the queue update (27) and the nonexpansion property of projections; the final inequality uses | | Y [ k ] | | c from (14). Similarly,
Z [ k + 1 ] = | | Q [ k + 1 ] | | | | Q [ k ] | | | | Q [ k + 1 ] Q [ k ] | | = ( a ) | | Q [ k ] | | i = 1 n ( [ Q i [ k ] + Y i [ k ] ] 0 q i v [ Q i [ k ] ] 0 q i v ) 2 ( b ) | | Q [ k ] | | i = 1 n ( Q i [ k ] + Y i [ k ] Q i [ k ] ) 2 = | | Q [ k | | | | Y [ k ] | | ( c ) Z [ k ] c
where (a) holds by substituting the definition of Q i [ k + 1 ] from (27) and the fact Q i [ k ] = [ Q i [ k ] ] 0 q i v ; (b) holds by the nonexpansion property of projections; (c) holds because | | Y [ k ] | | c from (14). Inequalities (68) and (69) together prove (65).
We now prove (66). The case Z [ k ] < λ follows immediately from (65). It suffices to consider Z [ k ] λ . The queue update (27) ensures
| | Q [ k + 1 ] | | 2 | | Q [ k ] | | 2 + c 2 + 2 Q [ k ] Y [ k ]
We know ( t s , r s , y s ) Γ ¯ . For simplicity, assume ( t s , r s , y s ) Γ (else we use a limiting argument over points in Γ that approach ( t s , r s , y s ) ). Lemma 1 ensures the existence of ( T * [ k ] , R * [ k ] , Y * [ k ] ) R o w ( A [ k ] ) that satisfies (with prob 1):
E ( T * [ k ] , R * [ k ] , Y * [ k ] ) | H [ k ] = ( t s , r s , y s )
By (39), we have
v R [ k ] + J [ k ] T [ k ] + Q [ k ] Y [ k ] v R * [ k ] + J [ k ] T * [ k ] + Q [ k ] Y * [ k ]
Multiplying the above inequality by 2 and rearranging terms gives
2 Q [ k ] Y [ k ] 2 v ( R [ k ] R * [ k ] ) + 2 J [ k ] ( T * [ k ] T [ k ] ) + 2 Q [ k ] Y * [ k ] 2 v r m a x + 2 v ( β 1 + β 2 ) ( t m a x t m i n ) + 2 Q [ k ] Y * [ k ]
where the final inequality uses J [ k ] v ( β 1 + β 2 ) (from Lemma 6). Substituting this into the right-hand side of (70) gives
| | Q [ k + 1 ] | | 2 | | Q [ k ] | | 2 + c 2 + 2 v r m a x + 2 v ( β 1 + β 2 ) ( t m a x t m i n ) + 2 Q [ k ] Y * [ k ] | | Q [ k ] | | 2 + v d 0 + 2 Q [ k ] Y * [ k ] = Z [ k ] 2 + v d 0 + 2 Q [ k ] Y * [ k ]
where d 0 is defined in (64). Taking conditional expectations of both sides and using (71) gives (with prob 1)
E | | Q [ k + 1 ] | | 2 | H [ k ] Z [ k ] 2 + v d 0 + 2 Q [ k ] y s ( a ) Z [ k ] 2 + v d 0 2 s i = 1 n Q i [ k ] ( b ) Z [ k ] 2 + v d 0 2 s Z [ k ]
where inequality (a) holds by (19); inequality (b) holds by the triangle inequality
Z [ k ] = | | Q [ k ] | | i = 1 n Q i [ k ]
Jensen’s inequality and the definition Z [ k + 1 ] = | | Q [ k + 1 ] | | gives
E Z [ k + 1 ] | H [ k ] 2 E | | Q [ k + 1 ] | | 2 | H [ k ]
Substituting this into the previous inequality gives
E Z [ k + 1 ] | H [ k ] 2 Z [ k ] 2 + v d 0 2 s Z [ k ] = ( Z [ k ] s / 2 ) 2 s 2 / 4 + v d 0 s Z [ k ] ( a ) ( Z [ k ] s / 2 ) 2 s 2 / 4 + v d 0 s λ ( b ) ( Z [ k ] s / 2 ) 2
where (a) holds because we assume Z [ k ] λ ; (b) holds because λ v d 0 / s s / 4 by definition of λ in (67). The definition of λ also implies λ s / 2 . Since Z [ k ] λ s / 2 , we can take square roots to obtain E Z [ k + 1 ] | H [ k ] Z [ k ] s / 2 . □
Lemma 7 is in the exact form required of Lemma 4 in [28], so we immediately obtain the following corollary:
Corollary 1.
Assume the Slater condition (19) holds for some s > 0 and vector ( t s , r s , y s ) Γ ¯ . Then, for all k 0 { 1 , 2 , 3 , } and all z 0 [ 0 , v q 1 n ] , given Z [ k 0 ] = z 0 , our algorithm gives (with probability 1):
E e η Z [ k ] | H [ k 0 ] d + ( e η z 0 d ) ρ k k 0 k { k 0 , k 0 + 1 , k 0 + 2 , }
where
η = s / 2 c 2 + c s / 6
ρ = 1 η s / 4
d = ( e η c ρ ) e η λ 1 ρ
where λ is given in (67). Further, it holds that s / 2 c , e η c ρ , and 0 < ρ < 1 .
Proof. 
This follows by applying Lemma 4 in [28] to the result of Lemma 7. □
We now use Corollary 1 to prove Theorem 2.
Proof. 
(Theorem 2) Fix i { 1 , , n } , k 0 { 1 , 2 , 3 , } . Define
k 1 = 4 v q 1 n s
Fix m k 1 + v 2 . By (63), it suffices to show
1 m k = k 0 + k 1 k 0 + m 1 E 1 i [ k ] | H [ k 0 ] O ( ϵ )
To this end, we have, by definition of 1 i [ k ] in (29),
e η ( q i v y i , m a x ) 1 i [ k ] e η Q i [ k ] e η Z [ k ]
Define z 0 = Z [ k 0 ] . Taking conditional expectations and using (72) gives, for all k k 0 + k 1 ,
e η ( q i v y i , m a x ) E 1 i [ k ] | H [ k 0 ] d + ( e η z 0 d ) ρ k k 0 d + e η z 0 ρ k k 0
Summing over the (fewer than m) terms and dividing by m gives
e η ( q i v y i , m a x ) 1 m k = k 0 + k 1 k 0 + m 1 E 1 i [ k ] | H [ k 0 ] d + e η z 0 ρ k 1 m k = k 0 + k 1 k 0 + m 1 ρ k k 1 k 0 d + e η z 0 ρ k 1 m 1 1 ρ
Recall that q i = q 1 for all i. To show (77), it suffices to show
d e η ( q 1 v y i , m a x ) O ( ϵ )
e η ( q 1 v y i , m a x ) e η z 0 ρ k 1 ( 1 ρ ) m O ( ϵ )
We find these are much smaller than O ( ϵ ) . By assumption, v 3 s 2 4 d 0 and so from (67)
λ = v d 0 s s 4
By definition of d in (75):
d e η ( q 1 v y i , m a x ) = ( e η c ρ ) e η λ 1 ρ e η y i , m a x e η q 1 v ( a ) 1 1 ρ e η ( c + λ + y i , m a x q 1 v ) = ( b ) 1 1 ρ e η ( c + v d 0 / s s / 4 + y i , m a x q 1 v ) = e η ( y i , m a x + c s / 4 ) 1 ρ e η v ( q 1 d 0 / s ) ( c ) e η ( y i , m a x + c s / 4 ) 1 ρ e η v d 0 / s ( d ) e η ( y i , m a x + c s / 4 ) 1 ρ e ( η d 0 / s ) / ϵ ( e ) O ( e ( η d 0 / s ) / ϵ ) O ( ϵ )
where (a) holds because 0 < ρ < 1 (recall Corollary 1); (b) holds by (80); (c) holds because q 1 2 d 0 / s ; (d) holds because v 1 / ϵ ; (e) holds because y i , m a x , η , c , s , ρ are all Θ ( 1 ) constants that do not scale with ϵ . The term goes to zero exponentially fast as ϵ 0 , much faster than O ( ϵ ) . This proves (78).
To show (79), we have
e η ( q 1 v y i , m a x ) e η z 0 ρ k 1 ( 1 ρ ) m e η y i , m a x ( 1 ρ ) m e η z 0 ρ k 1 = e η y i , m a x ( 1 ρ ) m e η z 0 + k 1 log ( ρ )
By definition of ρ in (74), we have
k 1 log ( ρ ) = k 1 log ( 1 η s / 4 ) k 1 η s / 4
which uses the fact log ( 1 + x ) x for all x > 1 (recall from Corollary 1 that η s / 4 < 1 ). Adding η z 0 to both sides gives
η z 0 + k 1 log ( ρ ) η ( z 0 k 1 s / 4 ) ( a ) η ( v q 1 n k 1 s / 4 ) ( b ) 0
where (a) holds because z 0 v q 1 n ; (b) holds by definition of k 1 in (76). Substituting the above inequality into (81) gives
e η ( q 1 v y i , m a x ) e η z 0 ρ k 1 ( 1 ρ ) m e η y i , m a x ( 1 ρ ) m e η y i , m a x ( 1 ρ ) ( k 1 + v 2 ) O ( ϵ 2 ) O ( ϵ )
where we used v 1 / ϵ . □

6. Simulation

All simulations are conducted with Matlab R2023b Update 6 (23.2.0.2485118).

6.1. System 1

This subsection considers the sequential project selection problem from Section 2.3 in [2]. The A [ k ] matrices have a random number of rows. Each row represents a project option and has two columns: The duration of time T [ k ] for the project and its corresponding reward R [ k ] . The goal is to simply maximize the time average reward per unit time (so n = 0 , and there are no additional penalty constraints). As explained in [2], the greedy policy of always choosing the row that maximizes the instantaneous reward/time ratio R [ k ] / T [ k ] is not necessarily optimal. The optimal row decision is not obvious and it depends on the (unknown) distribution of A [ k ] . Two different distributions for A [ k ] are considered in the simulations (specified at the end of this subsection). Both distributions have t m i n = 1 , t m a x = 10 , r m a x = 500 .
Figure 2 illustrates results for a simulation over 10 4 tasks using i.i.d. { A [ k ] } with Distribution 1. The vertical axis in Figure 2 represents the accumulated reward per task starting with task 1 and running up to the current task k:
j = 1 k E R [ j ] j = 1 k E T [ j ]
where the expectations E R [ j ] and E T [ k ] are approximated by averaging over 40 independent simulation runs. Figure 2 compares the greedy algorithm of always choosing the task k that maximizes R [ k ] / T [ k ] ; the (nonadaptive) Robbins–Monro algorithm from [2] that uses a stepsize η [ k ] = 1 k + 1 ; the proposed adaptive algorithm for the cases v = 1 , v = 2 , v = 10 (and using α = c 1 / max [ c 2 , 1 / 2 ] ). The dashed horizontal line in Figure 2 is the optimal θ * value corresponding to Distribution 1. The value θ * is difficult to calculate analytically, so we use an empirical value obtained by the final point on the Robbins–Monro curve.
Figure 2. System 1 and Distribution 1: Accumulated reward per unit time for the proposed adaptive algorithm (with v { 1 , 2 , 10 } ), the vanishing-stepsize Robbins–Monro algorithm, and the greedy algorithm. All data points are averaged over 40 independent simulations. The dashed horizontal line is the optimal θ 1 * for Distribution 1.
It can be seen that the greedy algorithm has significantly worse performance compared with the others. The Robbins–Monro algorithm, which uses a vanishing stepsize, has the fastest convergence and the highest achieved reward per unit time. As predicted by our theorems, the proposed adaptive algorithm has a convergence time that gets slower as v is increased, with a corresponding tradeoff in accuracy, where accuracy relates to the proximity of the converged value to the optimal θ * . The case v = 1 converges quickly but has less accuracy. The cases v = 2 and v = 10 have accuracy that is competitive with Robbins–Monro.
Figure 3 illustrates the adaptation advantages of the proposed algorithm. Figure 3 considers simulations over 2 × 10 4 tasks. The first half of the simulation refers to tasks { 1 , , 10 4 } , the second half refers to tasks { 10 4 + 1 , , 2 × 10 4 } . The { A [ k ] } matrices in the first half are i.i.d. with Distribution 1; in the second half, they are i.i.d. with Distribution 2. Nobody tells the algorithms that a change occurs at the halfway mark; rather, the algorithms must adapt. The two dashed horizontal lines represent optimal θ * values for Distribution 1 and Distribution 2. Data in Figure 3 is plotted as a moving average with a window of the past 200 tasks (and averaged over 40 independent simulations). As seen in the figure, the adaptive algorithm (with v = 10 ) produces near-optimal performance that quickly adapts to the change. In stark contrast, the Robbins–Monro algorithm adapts very slowly to the change and takes roughly ( 3 / 4 ) × 10 4 tasks to move close to optimality. The adaptation time of Robbins–Monro is much slower than its convergence time starting at task 1. This is due to the vanishing stepsize and the fact that, at the time of the distribution change, the stepsize is very small. Theoretically, the Robbins–Monro algorithm has an arbitrarily large adaptation time, as can be seen by imagining a simulation that uses a fixed distribution for a number of tasks x before changing to another distribution: The stepsize at the time of change is η [ x ] = 1 / ( x + 1 ) , hence an arbitrarily large value of x yields an arbitrarily large adaptation time.
Figure 3. System 1: Testing adaptation over a simulation of 2 × 10 4 tasks with a distributional change introduced at the halfway point (task 10 4 ). The two horizontal dashed lines represent optimal θ * values for the two distributions. Each point for task k is the result of a moving window average i = 1 200 E R [ k i ] i = 1 200 E T [ k i ] , where expectations are obtained by averaging over 40 independent simulations. The adaptive algorithm (with v = 10 ) quickly adapts to the change. The Robbins–Monro algorithm takes a long time to adapt.
Figure 3 shows the greedy algorithm adapts very quickly. This is because the greedy algorithm maximizes R [ k ] / T [ k ] for each task k without regard to history. Of course, the greedy algorithm is the least accurate and produces results that are significantly less than optimal for both distributions. To avoid clutter, the adaptive algorithm for cases v = 1 , v = 2 are not plotted in Figure 3. Only the case v = 10 is shown because this case has the slowest adaptation but the most accuracy (as compared with v = 1 , v = 2 cases). While not shown in Figure 3, it was observed that the accuracy of the v = 2 case was only marginally worse than that of the v = 10 case (similar to Figure 2).
For the simulation of the proposed algorithm for the scenario of Figure 3, the virtual queue J [ k ] was observed to have a maximum value 661.0219 over the entire timeline, with a noticeable jump in average value of J [ k ] after the midway point in the simulation, as shown in Figure 4.
Figure 4. System 1: A sample path of virtual queue J [ k ] for the proposed algorithm for the same scenario as Figure 3. A change in distribution occurs at the halfway point in the simulation.
The two distributions in this subsection (and in Figure 2, Figure 3 and Figure 4) are
  • Distribution 1: With M [ k ] being the random number of rows, we use P [ M [ k ] = 1 ] = 0.1 , P [ M [ k ] = 2 ] = 0.6 , P [ M [ k ] = 3 ] = 0.15 , P [ M [ k ] = 4 ] = 0.15 . The first row is always [ T 1 , R 1 ] = [ 1 , 0 ] and represents a “vacation” option that lasts for one unit of time and has zero reward (as explained in [2], it can be optimal to take vacations a certain fraction of time, even if there are other row options). The remaining rows r, if any, have parameters [ T r , R r ] generated independently with T r U n i f [ 1 , 10 ] and R r = T r G r , where G r U n i f [ 0 , 50 ] and is independent of T r .
  • Distribution 2: We use P [ M [ k ] = 1 ] = 0 , P [ M [ k ] = 2 ] = 0.2 , P [ M [ k ] = 3 ] = 0.4 , P [ M [ k ] = 4 ] = 0.4 . The first row is always [ T 1 , R 1 ] = [ 1 , 0 ] . The other rows r are independently chosen as a random vector [ T r , R r ] with T r U n i f [ 1 , 10 ] , R r = G r T r + H r with G r , H r independent and G r U n i f [ 10 , 30 ] , H r U n i f [ 0 , 200 ] .

6.2. System 2

This subsection considers a device that processes computational tasks with the goal of maximizing time average profit subject to a time average power constraint of p a v = 1 / 3 energy/time. There is a penalty process Y [ k ] and so the Robbins–Monro algorithm of [2] cannot be used. For simplicity, we use q 1 = (as discussed in Section 7). Q 1 [ t ] was observed to be stable with a maximum size of less than 350 over all time. We compare the adaptive algorithm of the current paper to the drift-plus-penalty ratio method of [6]. The ratio of expectations from the main method in [6] requires knowledge of the probability distribution on A [ k ] . A heuristic is proposed in Section VI.B in [6] that uses a drift-plus-penalty minimization of v ( R [ k ] θ [ k 1 ] T [ k ] ) + Q [ k ] Y [ k ] , which has a simple decision complexity for each task that is the same as the decision complexity of the adaptive algorithm proposed in the current paper, and where θ [ k ] is defined as a running average:
θ [ k ] = i = 1 k R [ i ] i = 1 k T [ i ]
It is argued in Section VI.B in [6] that, if the heuristic converges, it converges to a point that is within O ( ϵ ) of optimality, where the parameter v is chosen as v = 1 / ϵ . We call this heuristic “DPP with ratio averaging” in the simulations. (Another method described in [6] approximates the ratio of expectations using a window of w past samples. The per-task decision complexity grows with w and hence is larger than the complexity of the algorithm proposed in the current paper. For ease of implementation, we have not considered this method.) We also compare to a greedy method that removes any row r of A [ k ] that does not satisfy E n e r g y r [ k ] / T r [ k ] 1 / 3 , and chooses from the remaining rows to maximize R r [ k ] / T r [ k ] .
The i.i.d. matrices { A [ k ] } have three columns and three rows of the form:
A [ k ] = 1 0 0 T 2 [ k ] R 2 [ k ] Y 2 [ k ] T 3 [ k ] R 3 [ k ] Y 3 [ k ]
where Y r [ k ] = E n e r g y r [ k ] ( 1 / 3 ) T r [ k ] for i { 2 , 3 } . The first row corresponds ignoring task k and remaining idle for 1 unit of time, earning no reward but using no energy, so ( T 1 [ k ] , R 1 [ k ] , Y 1 [ k ] ) = ( 1 , 0 , 0 ) . The second row corresponds to processing task k at the home device. The third row corresponds to outsourcing task k to a cloud device. Two distributions are considered (specified at the end of this subsection). Under Distribution A, the reward is the same for both rows 2 and 3, but the energy and durations of time are different. Under Distribution B, the reward is higher for processing at the home device. Both distributions have t m i n = 1 , t m a x = 12 , r m a x = 20 . We use α = c 1 / max [ c 2 , 1 / 2 ] for the adaptive algorithm. Under the distributions used, the greedy algorithm is never able to use row 2, can always use either row 1 or 3, and always selects row 3.
Figure 5 and Figure 6 consider reward and power for simulations over 5000 tasks with i.i.d. { A [ k ] } under Distribution A. Figure 5 plots the running average of i = 1 k E R [ i ] i = 1 k E T [ i ] where expectations are attained by averaging over 40 independent simulations. The horizontal asymptote illustrates the optimal θ * as obtained by simulation. The simulation uses v = 50 for the DPP with ratio averaging because this was sufficient for an accurate approximation of θ * , as seen in Figure 5. The proposed adaptive algorithm is considered for v = 10 , v = 50 , v = 200 . As predicted by our theorems, it can be seen that the converged reward is closer to θ * as v is increased (the case v = 200 is competitive with DPP with ratio averaging). Figure 6 plots the corresponding running average of i = 1 k E E n e r g y [ i ] i = 1 k E T [ i ] . The disadvantage of choosing a large value of v is seen by the longer time required for time-averaged power to converge to the horizontal asymptote p a v = 1 / 3 . Figure 5 and Figure 6 show the greedy algorithm has the worst reward per unit time and has average power significantly under the required constraint. This shows that, unlike the other algorithms, the greedy algorithm does not make intelligent decisions for using more power to improve its reward. Considering only the performance shown in Figure 5 and Figure 6, the DPP with ratio averaging heuristic demonstrates the best convergence times, which is likely due to the fact that it uses only one virtual queue Q [ k ] while our adaptive algorithm uses Q [ k ] and J [ k ] . It is interesting to note that the adaptive algorithms and the DPP with ratio averaging heuristic both choose row 1 (idle) a significant fraction of the time. This is because, when a task has a small reward but a large duration of time, it is better to throw the task away and wait idly for a short amount of time in order to see a new task with a hopefully larger reward.
Figure 5. Time average reward up to task k for the proposed adaptive algorithm ( v { 10 , 50 , 200 } ); the DPP algorithm with ratio averaging; the greedy algorithm. The dashed horizontal line is the value θ A * .
Figure 6. Corresponding time-averaged power for the simulations of Figure 5. The horizontal asymptote is p a v = 1 / 3 . The greedy algorithm falls too far under the p a v constraint: it does not know how to use more power to increase its time average reward.
Significant adaptation advantages of our proposed algorithm are illustrated in Figure 7. Performance is plotted over a moving average with window size w = 200 , and averaged over 100 independent simulations. The first half of the simulation uses i.i.d. { A [ k ] } with Distribution A, the second half uses Distribution B. The two horizontal asymptotes in Figure 7 are optimal θ A * and θ B * values for Distributions A and B. As seen in the figure, both the adaptive algorithm and the DPP with ratio averaging heuristic quickly converge to the optimal θ A * value associated with Distribution A (the rewards under the adaptive algorithm are slightly less than those of the heuristic). At the time of change, the adaptive algorithm has a spike that lasts for roughly 2000 tasks until it settles down to θ B * . This can be viewed as the adaptation time and can be decreased by decreasing the value of v (at a corresponding accuracy cost). It converges from above to θ B * because, as seen in Figure 7, the spike marks a period of using more power than the required amount. In contrast, the DPP with ratio averaging algorithm cannot adapt and never increases to the optimal value of θ B * .
Figure 7. Adaptation for (a) reward and (b) power when the distribution is changed halfway through the simulation. Horizontal asymptotes are θ A * and θ B * for Distributions A and B. The adaptive algorithm settles into the new optimality point θ B * , while DPP with ratio averaging cannot adapt.
The distributions used are as follows: For each task k, there are two independent random variables U 1 [ k ] , U 2 [ k ] U n i f [ 0 , 1 ] generated. Then,
  • Distribution A: Note that R 2 [ k ] = R 3 [ k ] and T 2 [ k ] < T 3 [ k ] always.
    ( T 2 [ k ] , R 2 [ k ] , E n e r g y 2 [ k ] ) = ( 1 + 9 U 1 [ k ] , 10 U 1 [ k ] ( U 2 [ k ] + 1 ) , 1 + 9 U 1 [ k ] ) ( T 3 [ k ] , R 3 [ k ] , E n e r g y 3 [ k ] ) = ( 6 + 6 U 1 [ k ] , 10 U 1 [ k ] ( U 2 [ k ] + 1 ) , U 1 [ k ] )
  • Distribution B: The R 2 [ k ] value is increased in comparison to Distribution A.
    ( T 2 [ k ] , R 2 [ k ] , E n e r g y 2 [ k ] ) = ( 1 + 9 U 1 [ k ] , min { 20 ( U 2 [ k ] + 1 ) , 20 } , 1 + 9 U 1 [ k ] ) ( T 3 [ k ] , R 3 [ k ] , E n e r g y 3 [ k ] ) = ( 6 + 6 U 1 [ k ] , 10 U 1 [ k ] ( U 2 [ k ] + 1 ) , U 1 [ k ] )

6.3. Weight Adjustment

While the Θ ( 1 / ϵ 2 ) adaptation times achieved by the proposed algorithm are asymptotically optimal, an important question is whether the coefficient can be improved by some constant factor. Specifically, this subsection attempts to reduce the 2000-task adaptation time seen in the spikes of Figure 7 and Figure 8 without degrading accuracy. We observe that the J [ k ] and Y [ k ] queues are weighted equally in the Lyapunov function L [ k ] = 1 2 J [ k ] 2 + 1 2 Y [ k ] 2 . More weight can be placed on Y [ k ] to emphasize the average power constraint and thereby reduce the spike in Figure 8b. This can be performed with no change in the mathematical analysis by redefining the penalty as Y [ k ] = w Y [ k ] for some constant w > 0 . The constraint Y ¯ 0 is the same as Y ¯ 0 . We use w = 2 and also double the v parameter from 50 to 100, which maintains the same relative weight between reward and Q [ k ] but deemphasizes the J [ k ] virtual queue by a factor of 2.
Figure 8. Comparing (a) reward and (b) power of the reweighted adaptive scheme when the distribution changes twice. Parameters for the adaptive and DPP with ratio averaging algorithms are the same as Figure 7.
Figure 8 plots performance over 3 × 10 4 tasks with Distribution A in the first third, Distribution B in the second third, and Distribution A in the final third. The adaptive algorithm and the DPP with ratio averaging algorithm use the same parameters as in Figure 7. The reweighted adaptive algorithm uses v = 100 and Y [ k ] = 2 Y [ k ] . It can be seen that the reweighting decreases adaptation time with no noticeable change in accuracy. This illustrates the benefits of weighting the power penalty Y [ k ] more heavily than the virtual queue J [ k ] .
The simulations in Figure 8 further show that the proposed adaptive algorithms can effectively handle multiple distributional changes. Indeed, the reward settles close to the new optimality point after each change in distribution. In contrast, the DPP with ratio averaging algorithm, which was not designed to be adaptive, appears completely lost after the first distribution change and never recovers. This emphasizes the importance of adaptive algorithms.
The value of the virtual queue Q [ k ] for the proposed algorithm and its reweighted variant are shown in Figure 9. This illustrates that using q 1 = , which puts no deterministic upper bound on Q [ k ] , does not adversely affect performance, as discussed in Section 7. The virtual queue J [ k ] is not shown, but its behavior is similar, and its maximum value over all time was observed to be less than 400.
Figure 9. System 2: A sample path of Q [ k ] for the proposed adaptive algorithm and its reweighted variant for the same scenario as Figure 8.

7. Discussion

The proposed algorithm can be run indefinitely over an infinite sequence of tasks. The analysis holds for any block of m tasks over which { A [ k ] } is i.i.d. over that block. This holds regardless of sample path history before the block. A useful mathematical model that fits our analysis is when { A [ k ] } evolves over disjoint and variable-length blocks of time during which behavior is i.i.d. with a fixed but unknown distribution, and where the start and end times of each block are unknown. In this scenario, the algorithm adapts and comes within Θ ( ϵ ) of optimality for every block that lasts for at least 1 / ϵ 2 tasks.
What happens if { A [ k ] } is not i.i.d. over a block of interest? The good news is that our analysis provides worst-case bounds on the virtual queues that hold for all time and for any sample path. This means the algorithm maintains reasonable operational states. Of course, our optimality analysis uses the i.i.d. assumption. We conjecture that the algorithm also makes efficient decisions in more general situations where { A [ k ] } arises from an ergodic Markov chain. The analysis in that situation would be more complicated, and the adaptation times would depend on the mixing time of the Markov process. We leave such open questions for future work.
There are situations where { A [ k ] } evolves according to an ergodic Markov chain with very slow mixing times, slower than any reasonable timescale over which we want our algorithm to adapt. For example, one can imagine a 2-state Markov chain where A [ k ] is i.i.d. with one distribution in state 1, and i.i.d. with another distribution in state 2. If transition probabilities between states 1 and 2 are very small, then state transitions may occur on a timescale of hours (and ergodic mixing times may be on the order of days or weeks), while thousands of tasks are performed before each state transition. Each state transition starts a new block of tasks. Our algorithm adapts to each new distribution, without knowing the transition times, provided that transition times are separated by at least 1 / ϵ 2 tasks. In other words, our convergence analysis holds on the shorter timescale of the block, rather than the longer (and possibly irrelevant) mixing time of the underlying Markov chain.
When n > 0 , the algorithm has a parameter q 1 > 0 , where q 1 = Θ ( 1 ) . Theorem 2 suggests q 1 2 d 0 / s . This requires knowledge of d 0 / s . In practice, there is little danger in choosing q 1 to be too large. Even choosing q 1 = works well in practice (see simulation section). Intuitively, this is because the virtual queue update (27) for q 1 = reduces to
Q i [ k + 1 ] = max Q i [ k ] + Y i [ k ] , 0 k { 1 , 2 , 3 , }
which means 1 i [ k ] = 0 for all k and the inequality (30) can be modified to
1 m k = k 0 k 0 + m 1 Y i [ k ] Q i [ k 0 + m ] Q i [ k 0 ] m
for all positive integers k 0 , m . Intuitively, the Slater condition still ensures | | Q [ k ] | | concentrates quickly and is still rarely much larger than the λ parameter in Lemma 7, where λ = Θ ( v ) . Intuitively, while the J [ k ] virtual queue would no longer be deterministically bounded, it would stay within its existing bounds with high probability. Taking expectations of (82) would then produce a right-hand side proportional to v / m , which is O ( ϵ ) whenever v = Θ ( 1 / ϵ ) and m 1 / ϵ 2 .

8. Conclusions

This paper gives an adaptive algorithm for renewal optimization, where decisions for each task k determine the duration of the task, the reward for the task, and a vector of penalties. The algorithm has a low per-task decision complexity, operates without knowledge of system probabilities, and is robust to changes in the probability distribution that occur at unknown times. A new hierarchical decision rule enables the algorithm to achieve within ϵ of optimality over any sequence of Θ ( 1 / ϵ 2 ) tasks over which the probability distribution is fixed, regardless of system history. This adaptation time matches a prior converse result that shows any algorithm that guarantees ϵ -optimality during tasks { 1 , , m } must have m Ω ( 1 / ϵ 2 ) , even if there are no additional penalty processes.

Funding

This work was supported in part by NSF SpecEES 1824418.

Data Availability Statement

The MATLAB simulations and figures in this study are openly available in the provided link: https://ee.usc.edu/stochastic-nets/docs/Adapt-Renewals-Neely2025-simulations.zip, accessed on 20 November 2025.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Neely, M.J. Stochastic Network Optimization with Application to Communication and Queueing Systems; Morgan & Claypool: San Rafael, CA, USA, 2010. [Google Scholar]
  2. Neely, M.J. Fast Learning for Renewal Optimization in Online Task Scheduling. J. Mach. Learn. Res. 2021, 22, 1–44. [Google Scholar]
  3. Fox, B. Markov Renewal Programming by Linear Fractional Programming. SIAM J. Appl. Math. 1966, 14, 1418–1432. [Google Scholar] [CrossRef]
  4. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  5. Neely, M.J. Online Fractional Programming for Markov Decision Systems. In Proceedings of the 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 28–30 September 2011. [Google Scholar]
  6. Neely, M.J. Dynamic Optimization and Learning for Renewal Systems. IEEE Trans. Autom. Control 2013, 58, 32–46. [Google Scholar] [CrossRef]
  7. Tassiulas, L.; Ephremides, A. Dynamic Server Allocation to Parallel Queues with Randomly Varying Connectivity. IEEE Trans. Inf. Theory 1993, 39, 466–478. [Google Scholar] [CrossRef]
  8. Tassiulas, L.; Ephremides, A. Stability Properties of Constrained Queueing Systems and Scheduling Policies for Maximum Throughput in Multihop Radio Networks. IEEE Trans. Autom. Control 1992, 37, 1936–1948. [Google Scholar] [CrossRef]
  9. Wei, X.; Neely, M.J. Data Center Server Provision: Distributed Asynchronous Control for Coupled Renewal Systems. IEEE/ACM Trans. Netw. 2017, 25, 2180–2194. [Google Scholar] [CrossRef]
  10. Wei, X.; Neely, M.J. Asynchronous Optimization over Weakly Coupled Renewal Systems. Stoch. Syst. 2018, 8, 167–191. [Google Scholar] [CrossRef]
  11. Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
  12. Borkar, V.S. Stochastic Approximation: A Dynamical Systems Viewpoint; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  13. Nemirovski, A.; Yudin, D. Problem Complexity and Method Efficiency in Optimization; Wiley-Interscience Series in Discrete Mathematics; John Wiley: Hoboken, NJ, USA, 1983. [Google Scholar]
  14. Kushner, H.J.; Yin, G. Stochastic Approximation and Recursive Algorithms and Applications; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  15. Toulis, P.; Horel, T.; Airoldi, E.M. The Proximal Robbins–Monro Method. J. R. Stat. Soc. Ser. B Stat. Methodol. 2021, 83, 188–212. [Google Scholar] [CrossRef]
  16. Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. Robust Stochastic Approximation Approach to Stochastic Programming. SIAM J. Optim. 2009, 19, 1574–1609. [Google Scholar] [CrossRef]
  17. Joseph, V.R. Efficient Robbins-Monro Procedure for Binary Data. Biometrika 2004, 91, 461–470. [Google Scholar] [CrossRef]
  18. Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), Washington, DC, USA, 21–24 August 2003. [Google Scholar]
  19. Huo, D.L.; Chen, Y.; Xie, Q. Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Stepsizes. In Proceedings of the Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’23), New York, NY, USA, 19–23 June 2023; pp. 81–82. [Google Scholar] [CrossRef]
  20. Luo, H.; Zhang, M.; Zhao, P. Adaptive Bandit Convex Optimization with Heterogeneous Curvature. Proc. Mach. Learn. Res. 2022, 178, 1–37. [Google Scholar]
  21. Van der Hoeven, D.; Cutkosky, A.; Luo, H. Comparator-Adaptive Convex Bandits. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), Red Hook, NY, USA, 6–12 December 2020. [Google Scholar]
  22. Leyffer, S.; Menickelly, M.; Munson, T.; Vanaret, C.W.; Wild, S. A Survey of Nonlinear Robust Optimization. INFOR Inf. Syst. Oper. Res. 2020, 58, 342–373. [Google Scholar] [CrossRef]
  23. Tang, J.; Fu, C.; Mi, C.; Liu, H. An interval sequential linear programming for nonlinear robust optimization problems. Appl. Math. Model. 2022, 107, 256–274. [Google Scholar] [CrossRef]
  24. Badanidiyuru, A.; Kleinberg, R.; Slivkins, A. Bandits with Knapsacks. J. ACM 2018, 65, 1–55. [Google Scholar] [CrossRef]
  25. Agrawal, S.; Devanur, N.R. Bandits with Concave Rewards and Convex Knapsacks. In Proceedings of the Fifteenth ACM Conference on Economics and Computation (EC ’14), New York, NY, USA, 8–12 June 2014; pp. 989–1006. [Google Scholar] [CrossRef]
  26. Xia, Y.; Ding, W.; Zhang, X.; Yu, N.; Qin, T. Budgeted Bandit Problems with Continuous Random Costs. In Proceedings of the ACML, Hong Kong, 20–22 November 2015. [Google Scholar]
  27. Kallenberg, O. Foundations of Modern Probability, 2nd ed.; Probability and Its Applications; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
  28. Neely, M.J. Energy-Aware Wireless Scheduling with Near Optimal Backlog and Convergence Time Tradeoffs. IEEE/ACM Trans. Netw. 2016, 24, 2223–2236. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.