Adaptive Optimization for Stochastic Renewal Systems

Michael J. Neely

doi:10.3390/math13233850

Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089-2565, USA

Mathematics2025, 13(23), 3850;https://doi.org/10.3390/math13233850

This article belongs to the Section D2: Operations Research and Fuzzy Decision Making

Version Notes

Order Reprints

Abstract

This paper considers online optimization for a sequence of tasks. Each task can be processed in one of multiple processing modes that affect the duration of the task, the reward earned, and an additional vector of penalties (such as energy or cost). Let

A [k]

be a random matrix that specifies the parameters of task k. The goal is to observe

A [k]

at the start of task k and then choose a processing mode for the task so that, over time, time average reward is maximized subject to time average penalty constraints. This is a renewal optimization problem. It is challenging because the probability distribution for the

A [k]

sequence is unknown. Efficient decisions must be learned in a timely manner. Prior work shows that any algorithm that comes within

ϵ

of optimality must have

Ω (1 / ϵ^{2})

convergence time. The only known algorithm that can meet this bound operates without time average penalty constraints and uses a diminishing stepsize that cannot adapt when probabilities change. This paper develops a new algorithm that is adaptive and comes within

Θ (ϵ)

of optimality for any interval of

Θ (1 / ϵ^{2})

tasks over which probabilities are held fixed.

Keywords:

control; task scheduling; learning; Lyapunov function

MSC:

65K10; 60K05; 93E35

1. Introduction

This paper considers online optimization for a system that performs a sequence of tasks. At the start of task

k \in {1, 2, 3, \dots}

, a matrix

A [k]

of parameters about the task is revealed. The controller observes

A [k]

and then makes a decision about how the task should be processed. The decision, together with

A [k]

, determines the duration of the task, the reward earned by processing that task, and a vector of additional penalties. Specifically, define

\begin{matrix} T [k] & = duration of task k \\ R [k] & = reward of task k \\ Y [k] & = (Y_{1} [k], \dots, Y_{n} [k]) \\ = penalty vector for task k \end{matrix}

where n is a fixed positive integer. The goal is to make decisions over time that maximize the time average reward subject to a collection of time average penalty constraints. This problem has a number of applications, including image classification, video processing, wireless multiple access, and transportation scheduling.

For example, consider a computational device that performs back-to-back image classification tasks with the goal of maximizing time average profit subject to a time average power constraint of

p_{a v}

and an average per-task quality constraint of

q_{a v}

. For each task, the device chooses from a collection of classification algorithms, each having a different duration of time and yielding certain profit, energy, and quality characteristics. The number of options for task k is equal to the number of rows of matrix

A [k]

. Numerical values for each option are specified in the corresponding row. Suppose the controller observes the following matrix

A [1]

for task 1:

A [1] = \begin{matrix} duration & profit & energy & quality \\ 5.1 & 3.6 & 2.3 & 0.5 & alg 1 \\ 7.0 & 2.8 & 1.5 & 1.0 & alg 2 \\ 10.2 & 3.0 & 1.1 & 1.0 & alg 3 \end{matrix}

(1)

Choosing a classification algorithm for task 1 reduces to choosing one of the three rows of

A [1]

. Suppose we choose row 2 (algorithm 2). Then, task 1 will have a duration of

7.0

units of time, as illustrated by the width

T [1]

in the timeline of Figure 1. Also, the task will have profit, energy, and quality of

2.8

,

1.5

, and

1.0

. If we had instead chosen row 1, then the task duration would be lower and profit would be higher, but the energy consumption would be higher and the task would have lower quality. It is not obvious what decision helps to maximize the time average profit subject to the power and quality constraints. Moreover, the matrices for future tasks can have different row sizes, each matrix

A [k]

is only revealed at the start of task k, and the probability distribution for

A [k]

is unknown.

Figure 1. Four sequential tasks in the timeline. Vertical arrows for each task k represent values for reward

R [k]

and penalty vector

Y [k]

. In this example, green is reward (profit), red is energy, blue is quality. Vector

(T [k], R [k], Y [k])

depends on choices made at the start of task k.

The specific constrained optimization problem for this example is

\begin{matrix} Maximize : & lim_{m \to \infty} \frac{\sum_{k = 1}^{m} R [k]}{\sum_{k = 1}^{m} T [k]} \end{matrix}

(2)

\begin{matrix} Subject to : & lim_{m \to \infty} \frac{\sum_{k = 1}^{m} E n e r g y [k]}{\sum_{k = 1}^{m} T [k]} \leq p_{a v} \end{matrix}

(3)

\begin{matrix} lim_{m \to \infty} \frac{1}{m} \sum_{k = 1}^{m} Q u a l i t y [k] \geq q_{a v} \end{matrix}

(4)

where objective (2) represents the time average profit; (3) imposes the time average power constraint; (4) imposes the per-task average quality constraint. For simplicity of this example, we assume the limits exist. This problem is more concisely posed as maximizing

\bar{R} / \bar{T}

subject to

\bar{E n e r g y} / \bar{T} \leq p_{a v}

and

\bar{Q u a l i t y} \geq q_{a v}

, where

\bar{R}

,

\bar{T}

,

\bar{E n e r g y}

,

\bar{Q u a l i t y}

are per-task averages.

Problem (2)–(4) involves ratios of averages. This paper converts such problems into a canonical form of maximizing a ratio

\bar{R} / \bar{T}

subject to penalty constraints

{\bar{Y}}_{i} \leq 0

for

i \in {1, \dots, n}

, where n is the number of constraints (

n = 2

in example problem (2)–(4)). The constraint

\bar{E n e r g y} / \bar{T} \leq p_{a v}

is converted to

{\bar{Y}}_{1} \leq 0

by defining a penalty process

Y_{1} [k] = E n e r g y [k] - p_{a v} T [k]

for

k \in {1, 2, 3, \dots}

. However, adaptive minimization of the ratio

\bar{R} / \bar{T}

is nontrivial, and new techniques are required.

Optimality for problem (2)–(4) depends on the joint probability distribution for the size and entries of matrix

A [k]

. We first assume

{A [k]}_{k = 1}^{\infty}

are independent and identically distributed (i.i.d.) as matrices over tasks

k \in {1, 2, 3, \dots}

. This yields a well-defined ergodic optimality for problem (2)–(4). However, the multi-dimensional distribution for the entries of

A [k]

is unknown, and the parameter space is enormous. Instead of attempting to learn the distribution, this paper uses techniques similar to the drift-plus-penalty method of [1] that acts on weighted functions of the observed random

A [k]

. Specifically, this paper develops a new optimization strategy that acts on a single timescale for real-time optimization of ratios of averages.

Given

ϵ > 0

, how long does it take to come within

ϵ

of the infinite horizon optimality for problem (2)–(4)? This question is partially addressed in the prior work [2] for problems without time average penalty constraints. That prior work uses a Robbins–Monro algorithm with a vanishing stepsize. It cannot adapt if probability distributions change over time. The current paper develops a new algorithm that is adaptive. While infinite horizon optimality is defined by imagining

{A [k]}

as an infinite i.i.d. sequence, our algorithm can be analyzed over any finite block of m tasks

{k_{0}, k_{0} + 1, \dots, k_{0} + m - 1}

for which i.i.d. behavior is assumed. Indeed, fix any algorithm parameter

ϵ > 0

and consider any integer

m \geq 1 / ϵ^{2}

. Over any block of m tasks during which

A [k]

is i.i.d., the time average expected performance of our algorithm is within

O (ϵ)

of infinite horizon optimality (as defined by the

A [k]

distribution for this block), regardless of the distribution or sample path behavior of

A [k]

before the block. When the algorithm is implemented over all time, it tracks the new optimality that results if probability distributions change multiple times in the timeline, without knowing when changes occur, provided that each new distribution lasts for a duration of at least

Θ (1 / ϵ^{2})

tasks. This new

Θ (1 / ϵ^{2})

achievability result matches a converse bound proven for unconstrained problems in [2].

1.1. Model

Fix n as a positive integer. At the start of each task

k \in {1, 2, 3, \dots}

the controller observes matrix

A [k]

with size

M [k] \times (n + 2)

, where

M [k]

is the random number of processing options for task k. Each row

r \in {1, \dots, M [k]}

has the form

[T_{r} [k], R_{r} [k], Y_{r, 1} [k], \dots, Y_{r, n} [k]]

It is assumed that all rows have

T_{r} [k] \geq t_{m i n}

, where

t_{m i n} > 0

is some minimum task duration. Let

(T [k], R [k], Y [k])

be the vector of values for the row selected on task k, where

Y [k] = (Y_{1} [k], \dots, Y_{n} [k])

. For each positive integer m, define

\bar{R} [m]

by

\bar{R} [m] = \frac{1}{m} \sum_{k = 1}^{m} R [k]

Define

\bar{T} [m]

and

\bar{Y} [m]

similarly. The infinite horizon problem is

\begin{matrix} Maximize : & \underset{m \to \infty}{lim inf} \frac{E [\bar{R} [m]]}{E [\bar{T} [m]]} \end{matrix}

(5)

\begin{matrix} Subject to : & \underset{m \to \infty}{lim sup} E [{\bar{Y}}_{i} [m]] \leq 0 \forall i \in {1, \dots, n} \end{matrix}

(6)

\begin{matrix} (T [k], R [k], Y [k]) \in R o w (A [k]) \forall k \in {1, 2, 3, \dots} \end{matrix}

(7)

where

R o w (A [k])

denotes the set of rows of matrix

A [k]

. Note that

T [k] \geq t_{m i n} > 0

always, so there is no divide-by-zero issue.

Problem (5)–(7) is assumed to be feasible, meaning it is possible to satisfy the constraints (6) and (7). Let

θ^{*}

denote the optimal objective in (5). The formulation (5)–(7) has two differences in comparison with the introductory example (2)–(4): (i) all time average constraints are expressed as a single average (rather than a ratio); (ii) sample path limits are replaced by limits that use expectations. This is similar to the treatment in [2]. This preserves the infinite horizon value of

θ^{*}

and facilitates analysis of adaptation time.A decision policy is said to be an

ϵ

-approximation with convergence time d if

\begin{matrix} \frac{E [\bar{R} [m]]}{E [\bar{T} [m]]} \leq θ^{*} + ϵ \forall m \geq d \end{matrix}

(8)

\begin{matrix} E [{\bar{Y}}_{i} [m]] \leq ϵ \forall i \in {1, \dots, n} \forall m \geq d \end{matrix}

(9)

A decision policy is said to be an

O (ϵ)

-approximation (with convergence time d) if all appearances of

ϵ

in the above definition are replaced by some constant multiple of

ϵ

.

1.2. Prior Work

The fractional structure of the objective (5) is similar in spirit to a linear fractional program. Linear fractional programs can be converted to linear programs using a nonlinear change in variables (see [3,4]). This conversion is used in [3] for offline design of policies for embedded Markov decision problems. Such methods cannot be directly used for our online problem because time averages are not preserved under nonlinear transformations. Related nonlinear transformations are used in [5] for offline control for opportunistic Markov decision problems where states include an

A [k]

process similar to the current paper. The work in [5] discusses how offline strategies can be leveraged for online use (such as with a two-timescale approach), although overall convergence time is unclear.

The problem (5)–(7) is posed in [6], where it is called a renewal optimization problem (see also Chapter 7 of [1]). The solution in [6] constructs virtual queues for each time average inequality constraint (6) and makes a decision for each task k to minimize a drift-plus-penalty ratio:

\frac{E [Δ [k] - v R [k] | H [k]]}{E [T [k] | H [k]]}

(10)

where

Δ [k]

is the change in a Lyapunov function on the virtual queues; v is a parameter that affects accuracy;

H [k]

is system history before task k. Exact minimization of (10) is impossible unless the distribution for

A [k]

is known. A method for approximately minimizing (10) by sampling

A [k]

over a window of previous tasks is developed in [6], although only a partial convergence analysis is given there. This prior work is based on the Lyapunov drift and max-weight scheduling methods developed for fixed timeslot queuing systems in [7,8]. Data center applications of renewal optimization are in [9]. Asynchronous renewal systems are treated in [10].

A different approach in [2] is used for a problem that seeks only to maximize time average reward

\bar{R} / \bar{T}

(with no penalties

Y [k]

). For each task k, that method chooses

(T [k], R [k]) \in R o w (A [k])

to maximize

R [k] - θ [k - 1] T [k]

, where

θ [k - 1]

is an estimate of

θ^{*}

that is updated according to a Robbins–Monro iteration:

θ [k] = θ [k - 1] + η [k] (R [k] - θ [k - 1] T [k])

(11)

where

η [k]

is a stepsize. See [11] for the original Robbins–Monro algorithm and [12,13,14,15,16,17] for extensions in other contexts. The approach (11) is desirable because it does not require sampling from a window of past

A [k]

values. Further, under a vanishing stepsize rule, the optimality gap decreases like

O (1 / \sqrt{k})

, which is asymptotically optimal [2]. However, it is unclear how to extend this method to handle time average penalties

Y [k]

. Further, while the vanishing stepsize enables fast convergence, it makes increasing investments in the probability model and cannot adapt if probabilities change. The work in [2] shows a fixed stepsize rule is better for adaptation but has a slower convergence time of

O (1 / ϵ^{3})

.

Fixed stepsizes are known to enable adaptation in other contexts. For online convex optimization, it is shown in [18] that a fixed stepsize enables the time-averaged cost to be within

O (ϵ)

of optimality (as compared with the best fixed decision in hindsight) over any sequence of

Θ (1 / ϵ^{2})

steps. For adaptive estimation, work in [19] considers the problem of removing bias from Markov-based samples. The work in [19] develops an adaptive Robbins–Monro technique that averages between two fixed stepsizes. Adaptive algorithms are also of interest in convex bandit problems, see [20,21].

A different type of problem is nonlinear robust optimization that seeks to solve a deterministic nonlinear program involving a decision vector x and a vector of (unchosen) uncertain values u (see [22]). The robust design treats worst-case behavior with constraints such as

c (x, u) \leq 0

for all

u \in U (x)

, where

U (x)

is a (possibly infinite) set of possible values of u given the decision vector x. Iterative methods, such as those in [23], use gradients and linear approximations to sequentially calculate updates that get closer to minimizing a cost function subject to the robust constraint specifications.

Our problem is an opportunistic scheduling problem because

A [k]

is revealed at the start of each task k (before a scheduling decision is made). While

A [k]

can be viewed as helpful side information, it is challenging to make optimal use of this side information. The policy space huge: Optimality depends on the full (and unknown) joint distribution of entries in matrix

A [k]

. A different class of problems, called vector-based bandit problems, has a simpler policy space where there is no

A [k]

information. There, the controller pulls an arm from a fixed set of m arms, each arm giving a vector-based reward with an unknown distribution that is the same each time it is pulled. In that context, optimality depends only on the mean rewards. Estimation of the means can be performed efficiently by exploring each arm according to various bandit techniques such as the resource-constrained techniques in [24,25,26]. Such problems have a different learning structure that does not relate to the current paper.

1.3. Our Contributions

We develop an algorithm for renewal optimization that does not require probability information or sampling from the past. The algorithm has explicit convergence guarantees and meets the optimal asymptotic convergence time bound of [2]. Unlike the Robbins–Monro method in [2], our new algorithm allows for penalty constraints. Furthermore, our algorithm is adaptive and achieves performance within

O (ϵ)

of optimality over any sequence of

Θ (1 / ϵ^{2})

tasks. This fast adaptation is enabled by a new hierarchical decision structure.

2. Preliminaries

2.1. Notation

For

x, y \in R^{n}

we use

x^{⊤} y = \sum_{i = 1}^{n} x_{i} y_{i}

and

| | x | | = \sqrt{\sum_{i = 1}^{n} x_{i}^{2}}

. In this paper, we use the following vectors and constants:

$A [k] =$ matrix for task k (for row selection)
$T [k] =$ duration of task k
$R [k] =$ reward of task k
$Y [k] = (Y_{1} [k], \dots, Y_{n} [k]) =$ vector of penalties for task k
$Q [k] = (Q_{1} [k], \dots, Q_{n} [k]) =$ vector of virtual queues for penalty constraints
$J [k] =$ virtual queue used to optimize reward
$v =$ weight emphasis on reward
$q_{i} =$ size upper bound parameter for virtual queue $Q_{i} [k]$
$θ^{*} =$ optimal time average reward
$r_{m a x} =$ upper bound on $R [k]$
$c =$ upper bound on $| | Y [k] | |$
$y_{i, m i n}, y_{i, m a x} =$ bounds on $Y_{i} [k]$ ( $- y_{i, m i n} \leq Y_{i} [k] \leq y_{i, m a x}$ )
$[t_{m i n}, t_{m a x}] =$ bounds on $T [k]$
$γ_{m i n} = 1 / t_{m a x}$ , $γ_{m a x} = 1 / t_{m i n}$
$s =$ Slater condition parameter
$b, β_{1}, β_{2}, d_{0}, d_{1}, d_{2} =$ constants in theorems (based on $q_{i}, r_{m a x}, c, t_{m i n}, t_{m a x}, y_{i, m i n}, y_{i, m a x}$ ).

2.2. Boundedness Assumptions

Assume the following bounds hold for all

k \in {1, 2, 3, \dots}

, all

A [k]

, and all choices of

(T [k], R [k], Y [k]) \in R o w (A [k])

:

\begin{matrix} t_{m i n} \leq T [k] \leq t_{m a x} \end{matrix}

(12)

\begin{matrix} 0 \leq R [k] \leq r_{m a x} \end{matrix}

(13)

\begin{matrix} | | Y [k] | | \leq c \end{matrix}

(14)

\begin{matrix} - y_{i, m i n} \leq Y_{i} [k] \leq y_{i, m a x} \forall i \in {1, \dots, n} \end{matrix}

(15)

where

t_{m i n}

,

t_{m a x}

,

r_{m a x}

, c,

y_{i, m i n}

,

y_{i, m a x}

are nonnegative constants (with

t_{m i n} > 0

). Constraint (13) assumes all rewards

R [k]

are nonnegative. This is without loss of generality: If the system can have negative rewards in some bounded interval

- r_{m i n} \leq R [k] \leq r_{m a x}

, where

r_{m i n} > 0

, we define a new nonnegative reward

G [k] = R [k] + (r_{m i n} / t_{m i n}) T [k]

. The objective of maximizing

\bar{G} / \bar{T}

is the same as the objective of maximizing

\bar{R} / \bar{T}

.

2.3. The Sets $Γ$ and $\bar{Γ}$

Assume

{A [k]}_{k = 1}^{\infty}

is a sequence of i.i.d. random matrices with an unknown distribution. (When appropriate, this is relaxed to assume i.i.d. behavior occurs only over a finite block of consecutive tasks.) For

k \in {1, 2, 3, \dots}

, define a decision vector

(T [k], R [k], Y [k])

as a random vector that satisfies

(T [k], R [k], Y [k]) \in R o w (A [k])

Let

Γ \subseteq R^{n + 2}

be the set of all expectations

E [(T [k], R [k], Y [k])]

for a given task k, considering all possible decision vectors. The set

Γ

depends on the (unknown) distribution of

A [k]

and considers all conditional probabilities for choosing a row given the observed

A [k]

. The

{A [k]}_{k = 1}^{\infty}

matrices are i.i.d. and so

Γ

is the same for all

k \in {1, 2, 3, \dots}

. It can be shown that

Γ

is nonempty, bounded, and convex (see Section 4.11 in [1]). Its closure

\bar{Γ}

is compact and convex. Define the history up to task m as

H [m] = (A [1], A [2], \dots, A [m - 1])

where

H [1]

is defined to be the constant 0.

Lemma 1.

Suppose

{A [k]}_{k = 1}^{\infty}

are i.i.d. and satisfy the boundedness assumptions (12)–(15). Then, for every

(t, r, y) \in Γ

and

k \in {1, 2, 3, \dots}

, there exists a random decision vector

(T^{*} [k], R^{*} [k], Y^{*} [k]) \in R o w (A [k])

that is independent of

H [k]

and that satisfies (with probability 1):

E [(T^{*} [k], R^{*} [k], Y^{*} [k]) | H [k]] = (t, r, y)

Proof.

By definition of

Γ

, given any

(t, r, y) \in Γ

, there is a conditional distribution for choosing a row of

A [k]

(given the observed

A [k]

) under which the (unconditional) expected value of the chosen row is

(t, r, y)

. Let

U [k] \sim U n i f [0, 1]

be a random variable that is independent of

(A [k], H [k])

. Use

U [k]

to implement the randomized row selection (according to the desired conditional distribution) after

A [k]

is observed. Formally, the random row

(T^{*} [k], R^{*} [k], Y^{*} [k])

can be viewed as a Borel measurable function of

(A [k], U [k])

. (See Proposition 5.13 in [27], where

ξ

there plays the role of our

(T^{*} [k], R^{*} [k], Y^{*} [k])

which takes values in the Borel space

R^{n + 2}

;

η

plays the role of our

A [k]

;

ζ

plays the role of our

H [k]

;

ν

plays the role of our

U [k]

). Since

(A [k], U [k])

is independent of history

H [k]

, the resulting random row

(T^{*} [k], R^{*} [k], Y^{*} [k])

is independent of

H [k]

(so with probability 1, its conditional expectation given

H [k]

is the same as its unconditional expectation). □

2.4. The Deterministic Problem

For analysis of our stochastic problem, it is useful to consider the closely related deterministic problem

\begin{matrix} Maximize : & r / t \end{matrix}

(16)

\begin{matrix} Subject to : & y_{i} \leq 0 \forall i \in {1, \dots, n} \end{matrix}

(17)

\begin{matrix} (t, r, y) \in \bar{Γ} \end{matrix}

(18)

where

\bar{Γ}

is the closure of

Γ

. All points

(t, r, y) \in \bar{Γ}

have

t \geq t_{m i n} > 0

, so there are no divide-by-zero issues. Using arguments similar to those in Section 4.11 in [1], it can be shown that (i) the stochastic problem (5)–(7) is feasible if and only if the deterministic problem (16)–(18) is feasible; (ii) if feasible, the optimal objective values are the same. Specifically, if

(t^{*}, r^{*}, y^{*})

solves (16)–(18), then

θ^{*} = r^{*} / t^{*}

where

θ^{*}

is the optimal objective for both the deterministic problem (16)–(18) and the stochastic problem (5)–(7). The deterministic problem (16)–(18) seeks to maximize a continuous function

r / t

over the compact set defined by constraints (17) and (18), so it has an optimal solution whenever it is feasible.

When

n > 0

(so there is at least one time average penalty constraint of the form (17)), we assume a Slater condition that is more stringent than mere feasibility: There is a value

s > 0

and a vector

(t^{s}, r^{s}, y^{s}) \in \bar{Γ}

such that

y_{i}^{s} \leq - s \forall i \in {1, \dots, n}

(19)

3. Algorithm Development

3.1. Parameters and Constants

The algorithm uses parameters

v > 0

,

α > 0

,

q = (q_{1}, \dots, q_{n})

with

q_{i} \geq 0

for all

i \in {1, \dots, n}

, to be precisely determined later. The constants

t_{m i n}, t_{m a x}, r_{m a x}, y_{i, m i n}, y_{i, m a x}, c

from the boundedness assumptions (12)–(15) are assumed known. Define

γ_{m i n} = 1 / t_{m a x}, γ_{m a x} = 1 / t_{m i n}

(20)

The algorithm introduces auxiliary variables

γ [k]

chosen in the interval

[γ_{m i n}, γ_{m a x}]

for each task

k \in {1, 2, 3, \dots}

.

3.2. Intuition

For intuition, temporarily assume the following limits exist:

\begin{matrix} {\bar{Y}}_{i} & = lim_{m \to \infty} \frac{1}{m} \sum_{k = 1}^{m} Y_{i} [k] \\ \bar{1 / γ} & = lim_{m \to \infty} \frac{1}{m} \sum_{k = 1}^{m} 1 / γ [k] \end{matrix}

The idea is to make decisions for row selection (and

γ [k]

selection) for each task k so that, over time, the following time-averaged problem is solved:

\begin{matrix} Maximize : & lim_{m \to \infty} \frac{\frac{1}{m} \sum_{k = 1}^{m} R [k] γ [k] / γ [k - 1]}{\frac{1}{m} \sum_{k = 1}^{m} 1 / γ [k - 1]} \end{matrix}

(21)

\begin{matrix} Subject to : & {\bar{Y}}_{i} \leq 0 \forall i \in {1, \dots, n} \end{matrix}

(22)

\begin{matrix} \bar{T} \leq \bar{1 / γ} \end{matrix}

(23)

\begin{matrix} γ [k] \in [γ_{m i n}, γ_{m a x}] \forall k \end{matrix}

(24)

\begin{matrix} (T [k], R [k], Y [k]) \in R o w (A [k]) \forall k \end{matrix}

(25)

\begin{matrix} γ [k] varies “ slowly ” over k \end{matrix}

(26)

This is an informal description because the constraint “

γ [k]

varies slowly” is not precise. Intuitively, if

γ [k]

does not change much from one task to the next, the above objective is close to

\bar{R} / \bar{1 / γ}

, which (by the second constraint) is less than or equal to the desired objective

\bar{R} / \bar{T}

. This is useful because, as we show, the above problem can be treated using a novel hierarchical optimization method.

3.3. Virtual Queues

To enforce the constraints

{\bar{Y}}_{i} \leq 0

, for each

i \in {1, \dots, n}

define a process

Q_{i} [k]

with initial condition

Q_{i} [1] = 0

and update equation

Q_{i} [k + 1] = {[Q_{i} [k] + Y_{i} [k]]}_{0}^{q_{i} v} \forall k \in {1, 2, 3, \dots}

(27)

where

v > 0

and

q = (q_{1}, \dots, q_{n})

are given nonnegative parameters (to be sized later); where

{[z]}_{0}^{q_{i} v}

denotes the projection of a real number z onto the interval

[0, q_{i} v]

:

{[z]}_{0}^{q_{i} v} = \{\begin{matrix} q_{i} v & if z > q_{i} v \\ z & if z \in [0, q_{i} v] \\ 0 & else \end{matrix}

To enforce the constraint

\bar{T} \leq \bar{1 / γ}

, define a process

J [k]

by

J [k + 1] = {[J [k] + T [k] - 1 / γ [k]]}_{0}^{\infty} \forall k \in {1, 2, 3, \dots}

(28)

with

J [1] = 0

. The processes

Q_{i} [k]

and

J [k]

shall be called virtual queues because their update resembles a queuing system with arrivals and service for each k. Such virtual queues can be viewed as time-varying Lagrange multipliers and are standard for enforcing time average inequality constraints (see [1]).

For each task k and each

i \in {1, \dots, n}

, define

1_{i} [k]

as

1_{i} [k] = \{\begin{matrix} 1 & if Q_{i} [k] > q_{i} v - y_{i, m a x} \\ 0 & else \end{matrix}

(29)

Lemma 2.

Fix

k_{0}

and m as positive integers. Consider the iterations (27) and (28) under any decisions that satisfy

(T [k], R [k], Y [k]) \in R o w (A [k])

and

γ [k] \in [γ_{m i n}, γ_{m a x}]

for all k. Then for all

i \in {1, \dots, n}

the following hold:

\begin{matrix} \frac{1}{m} \sum_{k = k_{0}}^{k_{0} + m - 1} (Y_{i} [k] - 1_{i} [k] y_{i, m a x}) \leq \frac{q_{i} v}{m} \end{matrix}

(30)

\begin{matrix} \frac{1}{m} \sum_{k = k_{0}}^{k_{0} + m - 1} (T [k] - 1 / γ [k]) \leq \frac{J [k_{0} + m]}{m} \end{matrix}

(31)

Proof.

Fix

i \in {1, \dots, n}

and

k \in {1, 2, 3, \dots}

. We first claim

Y_{i} [k] - 1_{i} [k] y_{i, m a x} \leq Q_{i} [k + 1] - Q_{i} [k]

(32)

To verify (32), consider the two cases:

Case 1: Suppose $Q_{i} [k] + Y_{i} [k] > q_{i} v$ . It follows by (27) that $Q_{i} [k + 1] = q_{i} v$ . Also, since $Y_{i} [k] \leq y_{i, m a x}$ , we have $Q_{i} [k] > q_{i} v - y_{i, m a x}$ and so $1_{i} [k] = 1$ . It follows that (32) reduces to $Y_{i} [k] - y_{i, m a x} \leq q_{i} v - Q_{i} [k]$ , which is true because the left-hand side is always nonpositive while the right-hand side is always nonnegative.
Case 2: Suppose $Q_{i} [k] + Y_{i} [k] \leq q_{i} v$ . The update (27) then gives $Q_{i} [k + 1] \geq Q_{i} [k] + Y_{i} [k]$ , so (32) again holds (recall $y_{i, m a x} \geq 0$ ).

Summing (32) over

k \in {k_{0}, \dots, k_{0} + m - 1}

gives

\begin{matrix} \sum_{k = k_{0}}^{k_{0} + m - 1} (Y_{i} [k] - 1_{i} [k] y_{i, m a x}) & \leq Q_{i} [k_{0} + m] - Q_{i} [k_{0}] \\ \leq q_{i} v \end{matrix}

where the final equality holds because the update (27) ensures

Q_{i} [k] \in [0, q_{i} v]

for all k. Dividing by m proves (30).

To prove (31), observe the update (28) implies

J [k + 1] \geq J [k] + T [k] - 1 / γ [k] \forall k \in {1, 2, 3, \dots}

(33)

Summing over

k \in {k_{0}, \dots, k_{0} + m - 1}

gives

J [k_{0} + m] - J [k_{0}] \geq \sum_{k = k_{0}}^{k_{0} + m - 1} (T [k] - 1 / γ [k])

The result follows by dividing by m and observing

J [k_{0}] \geq 0

. □

3.4. Lyapunov Drift

Define

Q [k] = (Q_{1} [k], \dots, Q_{n} [k])

. Define

L [k] = \frac{1}{2} J {[k]}^{2} + \frac{1}{2} {| | Q [k] | |}^{2}

where

{| | Q [k] | |}^{2} = \sum_{i = 1}^{n} Q_{i} {[k]}^{2}

. The value

L [k]

can be viewed as a Lyapunov function on the queue state for task k. Define

Δ [k] = L [k + 1] - L [k]

Lemma 3.

For each

k \in {1, 2, 3, \dots}

and any

(T [k], R [k], Y [k]) \in R o w (A [k])

and

γ [k] \in [γ_{m i n}, γ_{m a x}]

, we have

Δ [k] \leq b + J [k] (T [k] - 1 / γ [k]) + Q {[k]}^{⊤} Y [k]

(34)

where

b = \frac{1}{2} [c^{2} + {(t_{m a x} - t_{m i n})}^{2}]

with

c, t_{m i n}, t_{m a x}

defined in (14), (12) and

γ_{m i n}, γ_{m a x}

defined in (20).

Proof.

Squaring (27) and using

{({[z]}_{0}^{q_{i} v})}^{2} \leq z^{2}

for all

z \in R

gives

\begin{matrix} Q_{i} {[k + 1]}^{2} & \leq {(Q_{i} [k] + Y_{i} [k])}^{2} \\ = Q_{i} {[k]}^{2} + Y_{i} {[k]}^{2} + 2 Q_{i} [k] Y_{i} [k] \end{matrix}

Summing over

i \in {1, \dots, n}

and dividing by 2 gives

\begin{matrix} \frac{1}{2} {| | Q [k + 1] | |}^{2} & \leq \frac{1}{2} {| | Q [k] | |}^{2} + \frac{1}{2} {| | Y [k] | |}^{2} + Q {[k]}^{⊤} Y [k] \\ \leq \frac{1}{2} {| | Q [k] | |}^{2} + \frac{1}{2} c^{2} + Q {[k]}^{⊤} Y [k] \end{matrix}

(35)

where we have used the boundedness assumption (14). Similarly, squaring (28) and using

{({[z]}_{0}^{\infty})}^{2} \leq z^{2}

for all

z \in R

gives

\begin{matrix} \frac{1}{2} J {[k + 1]}^{2} \leq \frac{1}{2} J {[k]}^{2} + \frac{1}{2} {(T [k] - 1 / γ [k])}^{2} + J [k] (T [k] - 1 / γ [k]) \end{matrix}

(36)

Summing (35) and (36) gives

Δ [k] \leq \frac{1}{2} [c^{2} + {(T [k] - 1 / γ [k])}^{2}] + J [k] (T [k] - 1 / γ [k]) + Q {[k]}^{⊤} Y [k]

The result follows by observing

{(T [k] - 1 / γ [k])}^{2} \leq {(t_{m a x} - t_{m i n})}^{2}

which holds by the boundedness assumption (12) and the fact

1 / γ [k] \in [t_{m i n}, t_{m a x}]

. □

The above lemma implies that

Δ [k] - v R [k] \leq b - v R [k] + J [k] (T [k] - 1 / γ [k]) + Q {[k]}^{⊤} Y [k]

(37)

For each task

k \in {1, 2, 3, \dots}

, our hierarchical algorithm performs the following:

Step 1: Choose $(T [k], R [k], Y [k]) \in R o w (A [k])$ to greedily minimize the right-hand side of (37) (ignoring the term that depends on $γ [k]$ ).
Step 2: Treating $γ [k - 1], T [k], R [k], Y [k]$ as known constants, choose $γ [k] \in [γ_{m i n}, γ_{m a x}]$ to minimize

$\underset{for (21)}{\underset{︸}{\frac{γ [k]}{γ [k - 1]} (- v R [k] + J [k] (T [k] - 1 / γ [k]) + Q {[k]}^{⊤} Y [k])}} + \underset{for (26)}{\underset{︸}{\frac{α v^{2}}{2} {(γ [k] - γ [k - 1])}^{2}}}$

The term in the first underbrace relates to the objective (21) and arises by multiplying both sides of (37) by

γ [k] / γ [k - 1]

; the term in the second underbrace is a weighted “prox-type” term that, for our purposes, acts only to enforce constraint (26).

3.5. Algorithm

Fix parameters

v > 0, α > 0, q = (q_{1}, \dots, q_{n})

with

q_{i} \geq 0

for

i \in {1, \dots, n}

(to be sized later). Initialize

γ [0] = γ_{m i n}

,

J [1] = 0

,

Q [1] = (0, \dots, 0)

. For each task

k \in {1, 2, 3, \dots}

, perform the following:

Row selection: Observe $Q [k], J [k], A [k]$ and treat these as given constants. Choose $(T [k], R [k], Y [k]) \in R o w (A [k])$ to minimize

$- v R [k] + J [k] T [k] + Q {[k]}^{⊤} Y [k]$

breaking ties arbitrarily (such as by using the smallest indexed row).
$γ [k]$ selection: Observe $Q [k], J [k], γ [k - 1]$ , and the decisions $(T [k], R [k], Y [k])$ just made by the row selection, and treat these as given constants. Choose $γ [k] \in [γ_{m i n}, γ_{m a x}]$ to minimize the following quadratic function of $γ [k]$ :

$γ [k] [- v R [k] + J [k] T [k] + Q {[k]}^{⊤} Y [k]] + \frac{γ [k - 1] α v^{2}}{2} {(γ [k] - γ [k - 1])}^{2}$

The explicit solution is

$γ [k] = {[γ [k - 1] + \frac{v R [k] - J [k] T [k] - Q {[k]}^{⊤} Y [k]}{γ [k - 1] α v^{2}}]}_{γ_{m i n}}^{γ_{m a x}}$

(38)

where ${[z]}_{γ_{m i n}}^{γ_{m a x}}$ denotes the projection of $z \in R$ onto the interval $[γ_{m i n}, γ_{m a x}]$ .
Virtual queue updates: Update the virtual queues via (27) and (28).

3.6. Example Decision Procedure

Consider a step

k \in {1, 2, 3, \dots}

where

A [k]

is given by the example matrix with three rows in (1). In that context, each row represents an image classification algorithm option, and

p_{a v}

and

q_{a v}

are desired constraints on time average power and task-average quality. The matrix is repeated here with

Y_{1} [k] = e n e r g y [k] - p_{a v} T [k]

,

Y_{2} [k] = q_{a v} - q u a l i t y [k]

:

A [k] = \begin{matrix} T [k] & R [k] & Y_{1} [k] & Y_{2} [k] \\ 5.1 & 3.6 & 2.3 - 5.1 p_{a v} & q_{a v} - 0.5 & row 1 (alg 1) \\ 7.0 & 2.8 & 1.5 - 7.0 p_{a v} & q_{a v} - 1.0 & row 2 (alg 2) \\ 10.2 & 3.0 & 1.1 - 10.2 p_{a v} & q_{a v} - 1.0 & row 3 (alg 3) \end{matrix}

The row selection decision for step k uses the current virtual queue values

J [k], Q_{1} [k], Q_{2} [k]

to compute the following row values:

Row 1: $- 3.6 v + 5.1 J [k] + (2.3 - 5.1 p_{a v}) Q_{1} [k] + (q_{a v} - 0.5) Q_{2} [k]$
Row 2: $- 2.8 v + 7.0 J [k] + (1.5 - 7.0 p_{a v}) Q_{1} [k] + (q_{a v} - 1.0) Q_{2} [k]$
Row 3: $- 3.0 v + 10.2 J [k] + (1.1 - 10.2 p_{a v}) Q_{1} [k] + (q_{a v} - 1.0) Q_{2} [k]$

The smallest row value is then chosen (breaking ties arbitrarily). Assuming this selection leads to row 2, the

γ [k]

value is updated as

γ [k] = {[γ [k - 1] + \frac{2.8 v - 7.0 J [k] - (1.5 - 7.0 p_{a v}) Q_{1} [k] - (q_{a v} - 1.0) Q_{2} [k]}{γ [k - 1] α v^{2}}]}_{γ_{m i n}}^{γ_{m a x}}

Finally, the virtual queues are updated via

\begin{matrix} J [k + 1] & = {[J [k] + 7.0 - 1 / γ [k]]}_{0}^{\infty} \\ Q_{1} [k + 1] & = {[Q_{1} [k] + 1.5 - 7.0 p_{a v}]}_{0}^{q_{1} v} \\ Q_{2} [k + 1] & = {[Q_{2} [k] + q_{a v} - 1.0]}_{0}^{q_{2} v} \end{matrix}

Row selection is the most complicated part of the algorithm at each step k: Suppose there are at most m rows in the matrix

A [k]

. For each row, we must perform

(n + 2)

multiply-adds to compute the row value. Then, we must select the minimizing row value. The worst-case complexity of the row selection is roughly

m (n + 2)

. In this example, we have

n = 2

and

m = 3

, so implementation is simple. In cases when the number of rows is large, say,

10^{6}

, row selection can be parallelized via multiple processors.

3.7. Key Analysis

Fix

k \in {1, 2, 3, \dots}

. The row selection decision of our algorithm implies

- v R [k] + J [k] T [k] + Q {[k]}^{⊤} Y [k] \leq - v R^{*} [k] + J [k] T^{*} [k] + Q {[k]}^{⊤} Y^{*} [k]

(39)

where

(T^{*} [k], R^{*} [k], Y^{*} [k])

is any other vector in

R o w (A [k])

(including any decision vector for task k that is chosen according to some optimized probability distribution).

The

γ [k]

selection decision of our algorithm chooses

γ [k] \in [γ_{m i n}, γ_{m a x}]

to minimize a function of

γ [k]

that is

β

-strongly convex for parameter

β = γ [k - 1] α v^{2}

. A standard strongly convex pushback result (see, for example, Lemma 2.1 in [16]) ensures that if

β > 0

,

c \in R

, and

f : [a, b] \to R

is a convex function over some interval

[a, b]

, and if

γ^{o p t} \in [a, b]

minimizes

f (γ) + \frac{β}{2} {(γ - c)}^{2}

over all

γ \in [a, b]

, then

f (γ^{o p t}) + \frac{β}{2} {(γ^{o p t} - c)}^{2} \leq f (γ) + \frac{β}{2} {(γ - c)}^{2} - \frac{β}{2} {(γ - γ^{o p t})}^{2}

for all

γ \in [a, b]

. Since our

γ [k]

minimizes a

β

-strongly convex function we obtain

\begin{matrix} γ [k] [- v R [k] + J [k] T [k] + Q {[k]}^{⊤} Y [k]] + \frac{γ [k - 1] α v^{2}}{2} {(γ [k] - γ [k - 1])}^{2} \\ \leq γ^{*} [- v R [k] + J [k] T [k] + Q {[k]}^{⊤} Y [k]] + \frac{γ [k - 1] α v^{2}}{2} {(γ^{*} - γ [k - 1])}^{2} \\ - \underset{pushback}{\underset{︸}{\frac{γ [k - 1] α v^{2}}{2} {(γ^{*} - γ [k])}^{2}}} \\ \leq γ^{*} [- v R^{*} [k] + J [k] T^{*} [k] + Q {[k]}^{⊤} Y {[k]}^{*}] + \frac{γ [k - 1] α v^{2}}{2} {(γ^{*} - γ [k - 1])}^{2} \\ - \frac{γ [k - 1] α v^{2}}{2} {(γ^{*} - γ [k])}^{2} \end{matrix}

where the first inequality highlights the pushback term that arises from strong convexity; the second inequality holds by (39) and the fact

γ^{*} > 0

.

Dividing the above inequality by

γ [k - 1] > 0

gives

\begin{matrix} \frac{γ [k]}{γ [k - 1]} [- v R [k] + J [k] T [k] + Q {[k]}^{⊤} Y [k]] + \frac{α v^{2}}{2} {(γ [k] - γ [k - 1])}^{2} \\ \leq \frac{γ^{*}}{γ [k - 1]} [- v R^{*} [k] + J [k] T^{*} [k] + Q {[k]}^{⊤} Y^{*} [k]] \\ + \frac{α v^{2}}{2} {(γ^{*} - γ [k - 1])}^{2} - \frac{α v^{2}}{2} {(γ^{*} - γ [k])}^{2} \end{matrix}

(40)

Lemma 4.

For any sample path of

{A [k]}_{k = 1}^{\infty}

, for each

k \in {1, 2, 3, \dots}

, our algorithm yields

\begin{matrix} Δ [k] - v R [k] & \leq b + \frac{γ^{*}}{γ [k - 1]} [- v R^{*} [k] + J [k] (T^{*} [k] - 1 / γ^{*}) + Q {[k]}^{⊤} Y^{*} [k]] \\ + \frac{α v^{2}}{2} {(γ^{*} - γ [k - 1])}^{2} - \frac{α v^{2}}{2} {(γ^{*} - γ [k])}^{2} \\ + \frac{(v r_{m a x} + c v | | q | | + (t_{m a x} - t_{m i n}) J [k])^{2}}{2 γ_{m i n}^{2} α v^{2}} \end{matrix}

(41)

where

Δ [k] - v R [k], γ [k], γ [k - 1]

are the actual values that arise in our algorithm;

γ^{*}

is any real number in

[γ_{m i n}, γ_{m a x}]

; and

(T^{*} [k], R^{*} [k], Y^{*} [k])

is any vector in

R o w (A [k])

.

Proof.

Adding

- J [k] / γ [k - 1]

to both sides of (40) gives

\begin{matrix} L H S [k] & : = \frac{γ [k]}{γ [k - 1]} [- v R [k] + J [k] (T [k] - 1 / γ [k]) + Q {[k]}^{⊤} Y [k]] \\ + \frac{α v^{2}}{2} {(γ [k] - γ [k - 1])}^{2} \\ \leq \frac{γ^{*}}{γ [k - 1]} [- v R^{*} [k] + J [k] (T^{*} [k] - 1 / γ^{*}) + Q {[k]}^{⊤} Y^{*} [k]] \\ + \frac{α v^{2}}{2} {(γ^{*} - γ [k - 1])}^{2} - \frac{α v^{2}}{2} {(γ^{*} - γ [k])}^{2} \end{matrix}

(42)

where

L H S [k]

is defined as the left-hand side of (42). Rearranging terms in

L H S [k]

gives

\begin{matrix} L H S [k] & = - v R [k] + J [k] (T [k] - 1 / γ [k]) + Q {[k]}^{⊤} Y [k] + \frac{α v^{2}}{2} {(γ [k] - γ [k - 1])}^{2} \\ + \frac{(γ [k] - γ [k - 1])}{γ [k - 1]} [- v R [k] + J [k] (T [k] - 1 / γ [k]) + Q {[k]}^{⊤} Y [k]] \\ \geq Δ [k] - v R [k] - b + \frac{α v^{2}}{2} {(γ [k] - γ [k - 1])}^{2} \\ + \frac{(γ [k] - γ [k - 1])}{γ [k - 1]} [- v R [k] + J [k] (T [k] - 1 / γ [k]) + Q {[k]}^{⊤} Y [k]] \\ \geq Δ [k] - v R [k] - b - \frac{{(\frac{- v R [k] + J [k] (T [k] - 1 / γ [k]) + Q {[k]}^{⊤} Y [k]}{γ [k - 1]})}^{2}}{2 α v^{2}} \end{matrix}

(43)

where the first inequality holds by (34); the final inequality holds since all

x, y \in R

satisfy

\frac{α v^{2}}{2} x^{2} + y x \geq - \frac{y^{2}}{2 α v^{2}}

which holds by completing the square (we use

x = γ [k] - γ [k - 1]

). Now, observe

\begin{matrix} | - v R [k] | \leq v r_{m a x} \\ | J [k] (T [k] - 1 / γ [k]) | \leq (t_{m a x} - t_{m i n}) J [k] \\ {| Q [k]}^{⊤} Y [k] | \leq | | Q [k] | | \cdot | | Y [k] | | \leq c v | | q | | \end{matrix}

where the final inequality uses Cauchy-Schwarz, assumption (14), and the fact

0 \leq Q_{i} [k] \leq q_{i} v

. Substituting these bounds into the right-hand side of (43) gives

L H S [k] \geq Δ [k] - v R [k] - b - \frac{(v r_{m a x} + c v | | q | | + (t_{m a x} - t_{m i n}) J [k])^{2}}{2 γ {[k - 1]}^{2} α v^{2}}

Substituting this into (42) proves the result. □

Lemma 5.

Suppose problem (16)–(18) has optimal solution

(t^{*}, r^{*}, y^{*}) \in \bar{Γ}

and optimal objective

θ^{*} = r^{*} / t^{*}

. Then, for

k \in {1, 2, 3, \dots}

, we have (with probability 1):

\begin{matrix} E [Δ [k] - v R [k] | H [k]] & \leq b + \frac{- v θ^{*}}{γ [k - 1]} \\ + \frac{α v^{2}}{2} E [{(1 / t^{*} - γ [k - 1])}^{2} - {(1 / t^{*} - γ [k])}^{2} | H [k]] \\ + \frac{(v r_{m a x} + c v | | q | | + (t_{m a x} - t_{m i n}) J [k])^{2}}{2 γ_{m i n}^{2} α v^{2}} \end{matrix}

(44)

where

H [k] = (A [1], \dots, A [k - 1])

, and

H [k]

determines

Q [k]

and

J [k]

.

Proof.

Fix

k \in {1, 2, 3, \dots}

. Fix

(t, r, y) \in Γ

. Observe that

t \in [t_{m i n}, t_{m a x}]

and so

1 / t \in [γ_{m i n}, γ_{m a x}]

. By Lemma 1, there is a decision vector

(T^{*} [k], R^{*} [k], Y^{*} [k]) \in R o w (A [k])

that is independent of

H [k]

such that

E [(T^{*} [k], R^{*} [k], Y^{*} [k]) | H [k]] = (t, r, y)

(45)

Substituting this

(T^{*} [k], R^{*} [k], Y^{*} [k])

, along with

γ^{*} = 1 / t

, into (41) gives

\begin{matrix} Δ [k] - v R [k] & \leq b + \frac{1 / t}{γ [k - 1]} [- v R^{*} [k] + J [k] (T^{*} [k] - t) + Q {[k]}^{⊤} Y^{*} [k]] \\ + \frac{α v^{2}}{2} {(1 / t - γ [k - 1])}^{2} - \frac{α v^{2}}{2} {(1 / t - γ [k])}^{2} \\ + \frac{(v r_{m a x} + c v | | q | | + (t_{m a x} - t_{m i n}) J [k])^{2}}{2 γ_{m i n}^{2} α v^{2}} \end{matrix}

(46)

Taking conditional expectations and using (45) gives

\begin{matrix} E [Δ [k] - v R [k] | H [k]] & \leq b + \frac{1 / t}{γ [k - 1]} [- v r + Q {[k]}^{⊤} y] \\ + \frac{α v^{2}}{2} E [{(1 / t - γ [k - 1])}^{2} - {(1 / t - γ [k])}^{2} | H [k]] \\ + \frac{(v r_{m a x} + c v | | q | | + (t_{m a x} - t_{m i n}) J [k])^{2}}{2 γ_{m i n}^{2} α v^{2}} \end{matrix}

(47)

This holds for all

(t, r, y) \in Γ

. Since

(t^{*}, r^{*}, y^{*}) \in \bar{Γ}

, there is a sequence

{(t_{j}, r_{j}, y_{j})}_{j = 1}^{\infty}

in

Γ

that converges to

(t^{*}, r^{*}, y^{*})

. Taking a limit over such points in (47) gives

\begin{matrix} E [Δ [k] - v R [k] | H [k]] & \leq b + \frac{1 / t^{*}}{γ [k - 1]} [- v r^{*} + Q {[k]}^{⊤} y^{*}] \\ + \frac{α v^{2}}{2} E [{(1 / t^{*} - γ [k - 1])}^{2} - {(1 / t^{*} - γ [k])}^{2} | H [k]] \\ + \frac{(v r_{m a x} + c v | | q | | + (t_{m a x} - t_{m i n}) J [k])^{2}}{2 γ_{m i n}^{2} α v^{2}} \end{matrix}

Recall the optimal solution has

y^{*} = (y_{1}^{*}, \dots, y_{n}^{*})

with

y_{i}^{*} \leq 0

for all i. The result is obtained by substituting

Q {[k]}^{⊤} y^{*} \leq 0

and

r^{*} / t^{*} = θ^{*}

. □

4. Reward Guarantee

We first provide a deterministic bound on

J [k]

.

Lemma 6.

Under any

{A [k]}_{k = 1}^{\infty}

sequence, our algorithm yields

0 \leq J [k] \leq v (β_{1} + β_{2})

(48)

where

β_{1}

and

β_{2}

are nonnegative constants defined

\begin{matrix} β_{1} & = \frac{(1 + r_{m a x} + \sum_{i = 1}^{n} q_{i} y_{i, m i n})}{t_{m i n}} \end{matrix}

(49)

\begin{matrix} β_{2} & = \frac{1}{v} ⌈α v γ_{m a x} (γ_{m a x} - γ_{m i n})⌉ (t_{m a x} - t_{m i n}) \end{matrix}

(50)

where

⌈ z ⌉

denotes the smallest integer greater than or equal to the real number z.

Proof.

The update (28) shows

J [k]

is always nonnegative. Define

m = ⌈α v γ_{m a x} (γ_{m a x} - γ_{m i n})⌉

(51)

We first make two claims:

Claim 1: If $J [k_{0}] \leq v β_{1}$ for some task $k_{0} \in {1, 2, 3, \dots}$ , then

$J [k] \leq v (β_{1} + β_{2}) \forall k \in {k_{0}, k_{0} + 1, \dots, k_{0} + m}$

To prove Claim 1, observe that for each task k, we have

$T [k] - 1 / γ [k] \leq t_{m a x} - 1 / γ_{m a x} = t_{m a x} - t_{m i n}$

Thus, the update (28) implies $J [k]$ can increase by at most $t_{m a x} - t_{m i n}$ over one task. Thus, $J [k]$ can increase by at most $m (t_{m a x} - t_{m i n})$ over any sequence of m or fewer tasks. By construction, $m (t_{m a x} - t_{m i n}) = v β_{2}$ . It follows that if $J [k_{0}] \leq v β_{1}$ , then $J [k] \leq v β_{1} + v β_{2}$ for all $k \in {k_{0}, k_{0} + 1, \dots, k_{0} + m}$ .
Claim 2: If $J [k] \geq v β_{1}$ for some task $k \in {1, 2, 3, \dots}$ , then $γ [k] \leq γ [k - 1]$ , and in particular,

$γ [k] \leq max \{γ_{m i n}, γ [k - 1] - \frac{1}{γ_{m a x} α v}\}$

(52)

To prove Claim 2, suppose $J [k] \geq v β_{1}$ . Observe that

$\begin{matrix} \frac{v R [k] - J [k] T [k] - Q {[k]}^{⊤} Y [k]}{γ [k - 1] α v^{2}} & \overset{(a)}{\leq} \frac{v r_{m a x} - v β_{1} t_{m i n} + \sum_{i = 1}^{n} q_{i} v y_{i, m i n}}{γ [k - 1] α v^{2}} \\ \overset{(b)}{=} \frac{- 1}{γ [k - 1] α v} \\ \leq \frac{- 1}{γ_{m a x} α v} \end{matrix}$

where (a) holds because $R [k] \leq r_{m a x}$ , $T [k] \geq t_{m i n}$ , and $- Q_{i} [k] Y_{i} [k] \leq q_{i} v y_{i, m i n}$ for all $i \in {1, \dots, n}$ (recall $- y_{i, m i n} \leq Y_{i} [k] \leq y_{i, m a x})$ ; equality (b) holds by definition of $β_{1}$ . Claim 2 follows in view of the iteration (38).

Since

J [1] = 0 \leq v β_{1}

, Claim 1 implies

J [k] \leq v (β_{1} + β_{2})

for all

k \in {1, \dots, 1 + m}

. Now use induction: Suppose

J [k] \leq v (β_{1} + β_{2})

for all

k \in {1, \dots, k_{0}}

for some integer

k_{0} \geq 1 + m

. We show this is also true for

k_{0} + 1

. If

J [k] \leq v β_{1}

for some

k \in {k_{0} + 1 - m, \dots, k_{0}}

, then Claim 1 implies

J [k_{0} + 1] \leq v (β_{1} + β_{2})

, and we are done.

Now, suppose

J [k] > v β_{1}

for all

k \in {k_{0} + 1 - m, \dots, k_{0}}

. Claim 2 implies (52) holds for all

k \in {k_{0} + 1 - m, \dots, k_{0}}

and

γ [k_{0} + 1 - m] \geq γ [k_{0} - m] \geq \dots \geq γ [k_{0}]

Therefore, if

γ [k] = γ_{m i n}

for some

k \in {k_{0} + 1 - m, \dots, k_{0}}

then

γ [k_{0}] = γ_{m i n} = 1 / t_{m a x}

and the update (28) gives

J [k_{0} + 1] = {[J [k_{0}] + T [k_{0}] - t_{m a x}]}_{0}^{\infty} \leq J [k_{0}] \leq v (β_{1} + β_{2})

where the final inequality is the induction assumption, and we are done. We now show the remaining case

γ [k] > γ_{m i n}

for all

k \in {k_{0} + 1 - m, \dots, k_{0}}

is impossible. Suppose

γ [k] > γ_{m i n}

for all

k \in {k_{0} + 1 - m, \dots, k_{0}}

(we reach a contradiction). Then, (52) implies

γ [k] - γ [k - 1] \leq - \frac{1}{γ_{m a x} α v} \forall k \in {k_{0} + 1 - m, \dots, k_{0}}

Summing over

k \in {k_{0} + 1 - m, \dots, k_{0}}

gives

γ [k_{0}] - γ [k_{0} - m] \leq - \frac{m}{γ_{m a x} α v}

and so

\begin{matrix} γ [k_{0}] & \leq γ_{m a x} - \frac{m}{γ_{m a x} α v} \\ \overset{(a)}{\leq} γ_{m a x} - (γ_{m a x} - γ_{m i n}) = γ_{m i n} \end{matrix}

where inequality (a) holds by definition of m in (51). This contradicts

γ [k_{0}] > γ_{m i n}

. □

4.1. Reward over Any m Consecutive Tasks

For postive integers

m, k_{0}

, define

\bar{R} [k_{0}; m]

and

\bar{T} [k_{0}, m]

as

\begin{matrix} \bar{R} [k_{0}; m] & = \frac{1}{m} \sum_{k = k_{0}}^{k_{0} + m - 1} R [k] \\ \bar{T} [k_{0}; m] & = \frac{1}{m} \sum_{k = k_{0}}^{k_{0} + m - 1} T [k] \end{matrix}

Assume

{A [k]}_{k = k_{0}}^{k_{0} + m - 1}

is i.i.d. over this block of tasks. Define

\bar{Γ}

and the corresponding deterministic problem (16)–(18) with respect to the (unknown) distribution for

A [k]

.

Theorem 1.

Suppose the problem (16)–(18) is feasible with optimal solution

(t^{*}, r^{*}, y^{*}) \in \bar{Γ}

and optimal objective value

θ^{*} = r^{*} / t^{*}

. Then, for any parameters

v > 0, α > 0

,

q = (q_{1}, \dots, q_{n}) \geq 0

and all positive integers

k_{0}, m

, our algorithm yields

\begin{matrix} \frac{E [\bar{R} [k_{0}; m]]}{E [\bar{T} [k_{0}; m]]} & \geq θ^{*} - \frac{d_{1}}{v} - \frac{v d_{2}}{m} - \frac{r_{m a x} / t_{m i n}}{m} \end{matrix}

(53)

where

d_{1}, d_{2}

are defined

\begin{matrix} d_{1} & = \frac{b + \frac{1}{2 γ_{m i n}^{2} α} (r_{m a x} + c | | q | | + (t_{m a x} - t_{m i n}) (β_{1} + β_{2}))^{2}}{t_{m i n}} \end{matrix}

(54)

\begin{matrix} d_{2} & = \frac{\frac{1}{2} {| | q | |}^{2} + \frac{1}{2} {(β_{1} + β_{2})}^{2} + \frac{α}{2} {(γ_{m a x} - γ_{m i n})}^{2} + θ^{*} (β_{1} + β_{2})}{t_{m i n}} \end{matrix}

(55)

where

β_{1}, β_{1}

are defined in (49), (50). In particular, fixing

q_{i} \geq 0

and

ϵ > 0

and choosing

v = 1 / ϵ

,

α = 1

, gives for all

k_{0}

:

\begin{matrix} \frac{E [\bar{R} [k_{0}; m]]}{E [\bar{T} [k_{0}; m]]} & \geq θ^{*} - O (ϵ) \forall m \geq 1 / ϵ^{2} \end{matrix}

(56)

Similar behavior holds when replacing

α = 1

with

α = c_{1} / max [c_{2}, 1 / 2]

, where

c_{1}, c_{2}

are fine-tuned constants (defined later) in (61), (62).

Proof.

Fix

k \in {2, 3, 4, \dots}

. Using iterated expectations and substituting

J [k] \leq v (β_{1} + β_{2})

(from Lemma 6) into (44) gives

\begin{matrix} E [Δ [k] - v R [k]] & \leq b - v θ^{*} E [\frac{1}{γ [k - 1]}] + \frac{α v^{2}}{2} E [{(1 / t^{*} - γ [k - 1])}^{2} - {(1 / t^{*} - γ [k])}^{2}] \\ + \frac{(r_{m a x} + c | | q | | + (t_{m a x} - t_{m i n}) (β_{1} + β_{2}))^{2}}{2 γ_{m i n}^{2} α} \end{matrix}

(57)

Manipulating the second term on the right-hand side above gives

\begin{matrix} - v θ^{*} E [\frac{1}{γ [k - 1]}] & = - v θ^{*} E [T [k - 1]] + v θ^{*} E [T [k - 1] - \frac{1}{γ [k - 1]}] \\ \leq - v θ^{*} E [T [k - 1]] + v θ^{*} E [J [k] - J [k - 1]] \end{matrix}

where the final inequality holds by (33). Substituting this into the right-hand side of (57) gives

\begin{matrix} E [Δ [k] - v R [k]] & \leq b - v θ^{*} E [T [k - 1]] + v θ^{*} E [J [k] - J [k - 1]] \\ + \frac{α v^{2}}{2} E [{(1 / t^{*} - γ [k - 1])}^{2} - {(1 / t^{*} - γ [k])}^{2}] \\ + \frac{(r_{m a x} + c | | q | | + (t_{m a x} - t_{m i n}) (β_{1} + β_{2}))^{2}}{2 γ_{m i n}^{2} α} \end{matrix}

(58)

Summing the above over

k \in {k_{0} + 1, \dots, k_{0} + m}

and dividing by

m v

gives

\begin{matrix} \frac{1}{m v} E [L [k_{0} + m + 1] - L [k_{0} + 1]] - \frac{1}{m} \sum_{k = k_{0} + 1}^{k_{0} + m} E [R [k]] \\ \leq \frac{b}{v} - θ^{*} \frac{1}{m} \sum_{k = k_{0} + 1}^{k_{0} + m} E [T [k - 1]] + \frac{θ^{*}}{m} E [J [k_{0} + m] - J [k_{0}]] \\ + \frac{α v}{2 m} E [{(1 / t^{*} - γ [k_{0}])}^{2} - {(1 / t^{*} - γ [k_{0} + m])}^{2}] \\ + \frac{(r_{m a x} + c | | q | | + (t_{m a x} - t_{m i n}) (β_{1} + β_{2}))^{2}}{2 v γ_{m i n}^{2} α} \\ \leq \frac{b}{v} - θ^{*} E [\bar{T} [k_{0}; m]] + \frac{θ^{*}}{m} v (β_{1} + β_{2}) \\ + \frac{α v}{2 m} {(γ_{m a x} - γ_{m i n})}^{2} \\ + \frac{(r_{m a x} + c | | q | | + (t_{m a x} - t_{m i n}) (β_{1} + β_{2}))^{2}}{2 v γ_{m i n}^{2} α} \end{matrix}

(59)

where the final inequality substitutes the definition of

\bar{T} [k_{0}; m]

and uses

\begin{matrix} 0 \leq J [k] \leq v (β_{1} + β_{2}) \forall k \\ {(1 / t^{*} - γ [k])}^{2} \leq {(γ_{m a x} - γ_{m i n})}^{2} \forall k \end{matrix}

Rearranging terms in (59) gives

\begin{matrix} E [\bar{R} [k_{0}; m]] & \geq θ^{*} E [\bar{T} [k_{0}; m]] - \frac{E [R [k_{0} + m] - R [k_{0}]]}{m} \\ - \frac{1}{m v} E [L [k_{0} + 1] - L [k_{0} + m + 1]] \\ - \frac{b + \frac{1}{2 γ_{m i n}^{2} α} (r_{m a x} + c | | q | | + (t_{m a x} - t_{m i n}) (β_{1} + β_{2}))^{2}}{v} \\ - \frac{v}{m} (\frac{α}{2} {(γ_{m a x} - γ_{m i n})}^{2} + θ^{*} (β_{1} + β_{2})) \end{matrix}

Terms on the right-hand side have the following bounds:

\begin{matrix} - E [R [k_{0} + m] - R [k_{0}]] \geq - r_{m a x} \\ - E [L [k_{0} + 1] - L [k_{0} + m + 1]] \geq - E [L [k_{0} + 1]] \geq - \frac{1}{2} {| | v q | |}^{2} - \frac{1}{2} v^{2} {(β_{1} + β_{2})}^{2} \end{matrix}

where the final inequality uses

L [k_{0} + 1] = \frac{1}{2} | | Q [k_{0} + 1] {| |}^{2} + \frac{1}{2} J {[k_{0} + 1]}^{2}

and the facts

| | Q [k] | | \leq | | v q | |

(since

0 \leq Q_{i} [k] \leq q_{i} v

from (27)) and

0 \leq J [k] \leq v (β_{1} + β_{2})

(from (48)). This proves the result upon usage of the constants

d_{1}, d_{2}

. □

4.2. No Penalty Constraints

A special and nontrivial case of Theorem 1 is when the only goal is to maximize time average reward

\bar{R} / \bar{T}

, with no additional

Y_{i} [k]

processes (case

n = 0

,

q = 0

in Theorem 1). For this, let

\bar{R} [m]

and

\bar{T} [m]

be averages over tasks

{1, \dots, m}

. Fix

ϵ > 0

. The work in [2] showed that, in the absence of a priori knowledge of the probability distribution of the

A [k]

matrices, any algorithm that runs over tasks

{1, 2, 3, \dots, m}

and achieves

\frac{E [\bar{R} [m]]}{E [\bar{T} [m]]} \geq θ^{*} - O (ϵ)

must have

m \geq Ω (1 / ϵ^{2})

. That is, convergence time is necessarily

Ω (1 / ϵ^{2})

. The work in [2] developed a Robbins–Monro iterative algorithm with a vanishing stepsize to achieve this optimal convergence time. Specifically, it deviates from

θ^{*}

by an optimality gap

O (1 / \sqrt{m})

as the algorithm runs over tasks

m \in {1, 2, 3, \dots}

. However, the vanishing stepsize means that the algorithm cannot adapt to changes. The algorithm of the current paper achieves the optimal convergence time using a different technique. The parameter v can be interpreted as an inverse stepsize parameter, so the stepsize is a constant

ϵ = 1 / v

. With this constant stepsize, the algorithm is adaptive and achieves reward per unit time within

O (ϵ)

of optimality over any consecutive block of

Θ (1 / ϵ^{2})

tasks for which the

{A [k]}

matrices have i.i.d. behavior, regardless of the distribution before the start of the block.

The value

α

in Theorem 1 can be fine tuned. Using

v = 1 / ϵ

,

q = 0

in (53) gives

\frac{E [\bar{R} [k_{0}; m]]}{E [\bar{T} [k_{0}; m]]} \geq θ^{*} - ϵ d_{1} - \frac{d_{2}}{ϵ m} - \frac{(r_{m a x} / t_{m i n})}{m} \forall m \in {1, 2, 3, \dots}

(60)

where

\begin{matrix} d_{1} & = \frac{b + \frac{1}{2 γ_{m i n}^{2} α} {(r_{m a x} + (t_{m a x} - t_{m i n}) (β_{1} + β_{2}))}^{2}}{t_{m i n}} \\ d_{2} & = \frac{\frac{1}{2} {(β_{1} + β_{2})}^{2} + \frac{α}{2} {(γ_{m a x} - γ_{m i n})}^{2} + θ^{*} (β_{1} + β_{2})}{t_{m i n}} \end{matrix}

where

β_{1} + β_{2} = \frac{1 + r_{m a x} + α (1 / t_{m i n} - 1 / t_{m a x}) (t_{m a x} - t_{m i n})}{t_{m i n}}

which uses

γ_{m i n} = 1 / t_{m a x}

,

γ_{m a x} = 1 / t_{m i n}

, and definitions of

β_{1}, β_{2}

in (49), (50). The above expression for

β_{1} + β_{2}

ignores the pesky ceiling operation in (50) as we are merely trying to right-size the

α

constant (formally, one can assume

v = 1 / ϵ

is chosen to make the quantity inside the ceiling operation an integer). The term

- ϵ d_{1}

in (60) does not vanish as

m \to \infty

. Choosing

α

to minimize

d_{1}

amounts to minimizing

\frac{1}{α} {[r_{m a x} + \frac{(t_{m a x} - t_{m i n}) (1 + r_{m a x})}{t_{m i n}} + α \frac{{(t_{m a x} - t_{m i n})}^{2}}{t_{m i n}} (\frac{1}{t_{m i n}} - \frac{1}{t_{m a x}})]}^{2}

That is, choose

α > 0

to minimize

(c_{1} / \sqrt{α}) + c_{2} \sqrt{α}

where

\begin{matrix} c_{1} & = r_{m a x} + \frac{(t_{m a x} - t_{m i n}) (1 + r_{m a x})}{t_{m i n}} \end{matrix}

(61)

\begin{matrix} c_{2} & = \frac{{(t_{m a x} - t_{m i n})}^{2}}{t_{m i n}} (\frac{1}{t_{m i n}} - \frac{1}{t_{m a x}}) \end{matrix}

(62)

This yields

α = c_{1} / c_{2}

. To avoid a very large value of

α

(which affects the

d_{2}

constant) in the special case

c_{2} < 1 / 2

, one might adjust this to using

α = \frac{c_{1}}{max [c_{2}, 1 / 2]}

.

5. Constraints

This section considers the process

Y [k] \in R^{n}

(for

n > 0

). Fix

ϵ > 0

. By choosing

q_{i} = Θ (1)

and

v = Θ (1 / ϵ)

, inequality (30) already shows for all

i \in {1, \dots, n}

that the time average of

Y_{i} [k]

starting at any time

k_{0}

satisfies

\frac{1}{m} \sum_{k = k_{0}}^{k_{0} + m - 1} Y_{i} [k] \leq Θ (ϵ) + \frac{1}{m} \sum_{k = k_{0}}^{k_{0} + m - 1} 1_{i} [k] y_{i, m a x} \forall m \geq 1 / ϵ^{2}

(63)

The next result uses the Slater condition (19) to show events

{1_{i} [k] = 1}

are rare over any block of

1 / ϵ^{2}

tasks (regardless of history before the block).

Theorem 2.

Assume the Slater condition (19) holds for some

s > 0

and vector

(t^{s}, r^{s}, y^{s}) \in \bar{Γ}

. Fix

ϵ > 0

. Define

d_{0} = 2 r_{m a x} + 2 (β_{1} + β_{2}) (t_{m a x} - t_{m i n}) + c^{2} ϵ

(64)

where

β_{1}, β_{1}

are defined in (49), (50). Fix

q_{1} \geq 2 d_{0} / s

and fix

q_{i} = q_{1}

for

i \in {1, \dots, n}

. Assume

v = max {1 / ϵ, \frac{3 s^{2}}{4 d_{0}}}

. For all

i \in {1, \dots, n}

, all

k_{0} \in {1, 2, 3, \dots}

, and all

m \geq ⌈ \frac{4 v q_{1} \sqrt{n}}{s} ⌉ + v^{2}

, we have (with probability 1):

\frac{1}{m} \sum_{k = k_{0}}^{k_{0} + m - 1} E [Y_{i} [k] | H [k_{0}]] \leq O (ϵ)

Proof of Theorem 2

To prove Theorem 2, define

Z [k] = | | Q [k] | |

for

k \in {1, 2, 3, \dots}

. For each k,

Z [k]

is determined by history

H [k]

, meaning

Z [k]

is

σ (H [k])

-measurable. We first prove a lemma.

Lemma 7.

Assume the Slater condition (19) holds for some

s > 0

and vector

(t^{s}, r^{s}, y^{s}) \in \bar{Γ}

. For all

k \in {1, 2, 3, \dots}

our algorithm gives

\begin{matrix} | Z [k + 1] - Z [k] | \leq c \end{matrix}

(65)

\begin{matrix} E [Z [k + 1] - Z [k] | H [k]] \leq \{\begin{matrix} c & i f Z [k] < λ \\ - s / 2 & i f Z [k] \geq λ \end{matrix} \end{matrix}

(66)

where the final inequality holds almost surely; c is defined in (14);

d_{0}

is defined in (64); λ is defined

λ = max \{\frac{v d_{0}}{s} - \frac{s}{4}, \frac{s}{2}\}

(67)

Proof.

Fix

k \in {1, 2, 3, \dots}

. To prove (65), we have

\begin{matrix} Z [k + 1] = | | Q [k + 1] | | & \overset{(a)}{\leq} | | Q [k] + Y [k] | | \\ \leq | | Q [k] | | + | | Y [k] | | \\ \leq Z [k] + c \end{matrix}

(68)

where inequality (a) holds by the queue update (27) and the nonexpansion property of projections; the final inequality uses

| | Y [k] | | \leq c

from (14). Similarly,

\begin{matrix} Z [k + 1] & = | | Q [k + 1] | | \\ \geq | | Q [k] | | - | | Q [k + 1] - Q [k] | | \\ \overset{(a)}{=} | | Q [k] | | - \sqrt{\sum_{i = 1}^{n} {({[Q_{i} [k] + Y_{i} [k]]}_{0}^{q_{i} v} - {[Q_{i} [k]]}_{0}^{q_{i} v})}^{2}} \\ \overset{(b)}{\geq} | | Q [k] | | - \sqrt{\sum_{i = 1}^{n} {(Q_{i} [k] + Y_{i} [k] - Q_{i} [k])}^{2}} \\ = | | Q [k | | - | | Y [k] | | \\ \overset{(c)}{\geq} Z [k] - c \end{matrix}

(69)

where (a) holds by substituting the definition of

Q_{i} [k + 1]

from (27) and the fact

Q_{i} [k] = {[Q_{i} [k]]}_{0}^{q_{i} v}

; (b) holds by the nonexpansion property of projections; (c) holds because

| | Y [k] | | \leq c

from (14). Inequalities (68) and (69) together prove (65).

We now prove (66). The case

Z [k] < λ

follows immediately from (65). It suffices to consider

Z [k] \geq λ

. The queue update (27) ensures

{| | Q [k + 1] | |}^{2} \leq {| | Q [k] | |}^{2} + c^{2} + 2 Q {[k]}^{⊤} Y [k]

(70)

We know

(t^{s}, r^{s}, y^{s}) \in \bar{Γ}

. For simplicity, assume

(t^{s}, r^{s}, y^{s}) \in Γ

(else we use a limiting argument over points in

Γ

that approach

(t^{s}, r^{s}, y^{s})

). Lemma 1 ensures the existence of

(T^{*} [k], R^{*} [k], Y^{*} [k]) \in R o w (A [k])

that satisfies (with prob 1):

E [(T^{*} [k], R^{*} [k], Y^{*} [k]) | H [k]] = (t^{s}, r^{s}, y^{s})

(71)

By (39), we have

\begin{matrix} - v R [k] + J [k] T [k] + Q {[k]}^{⊤} Y [k] \leq - v R^{*} [k] + J [k] T^{*} [k] + Q {[k]}^{⊤} Y^{*} [k] \end{matrix}

Multiplying the above inequality by 2 and rearranging terms gives

\begin{matrix} 2 Q {[k]}^{⊤} Y [k] & \leq 2 v (R [k] - R^{*} [k]) + 2 J [k] (T^{*} [k] - T [k]) + 2 Q {[k]}^{⊤} Y^{*} [k] \\ \leq 2 v r_{m a x} + 2 v (β_{1} + β_{2}) (t_{m a x} - t_{m i n}) + 2 Q {[k]}^{⊤} Y^{*} [k] \end{matrix}

where the final inequality uses

J [k] \leq v (β_{1} + β_{2})

(from Lemma 6). Substituting this into the right-hand side of (70) gives

\begin{matrix} {| | Q [k + 1] | |}^{2} & \leq {| | Q [k] | |}^{2} + c^{2} + 2 v r_{m a x} + 2 v (β_{1} + β_{2}) (t_{m a x} - t_{m i n}) + 2 Q {[k]}^{⊤} Y^{*} [k] \\ \leq {| | Q [k] | |}^{2} + v d_{0} + 2 Q {[k]}^{⊤} Y^{*} [k] \\ = Z {[k]}^{2} + v d_{0} + 2 Q {[k]}^{⊤} Y^{*} [k] \end{matrix}

where

d_{0}

is defined in (64). Taking conditional expectations of both sides and using (71) gives (with prob 1)

\begin{matrix} E [{| | Q [k + 1] | |}^{2} | H [k]] & \leq Z {[k]}^{2} + v d_{0} + 2 Q {[k]}^{⊤} y^{s} \\ \overset{(a)}{\leq} Z {[k]}^{2} + v d_{0} - 2 s \sum_{i = 1}^{n} Q_{i} [k] \\ \overset{(b)}{\leq} Z {[k]}^{2} + v d_{0} - 2 s Z [k] \end{matrix}

where inequality (a) holds by (19); inequality (b) holds by the triangle inequality

Z [k] = | | Q [k] | | \leq \sum_{i = 1}^{n} Q_{i} [k]

Jensen’s inequality and the definition

Z [k + 1] = | | Q [k + 1] | |

gives

E {[Z [k + 1] | H [k]]}^{2} \leq E [{| | Q [k + 1] | |}^{2} | H [k]]

Substituting this into the previous inequality gives

\begin{matrix} E {[Z [k + 1] | H [k]]}^{2} & \leq Z {[k]}^{2} + v d_{0} - 2 s Z [k] \\ = {(Z [k] - s / 2)}^{2} - s^{2} / 4 + v d_{0} - s Z [k] \\ \overset{(a)}{\leq} {(Z [k] - s / 2)}^{2} - s^{2} / 4 + v d_{0} - s λ \\ \overset{(b)}{\leq} {(Z [k] - s / 2)}^{2} \end{matrix}

where (a) holds because we assume

Z [k] \geq λ

; (b) holds because

λ \geq v d_{0} / s - s / 4

by definition of

λ

in (67). The definition of

λ

also implies

λ \geq s / 2

. Since

Z [k] \geq λ \geq s / 2

, we can take square roots to obtain

E [Z [k + 1] | H [k]] \leq Z [k] - s / 2

. □

Lemma 7 is in the exact form required of Lemma 4 in [28], so we immediately obtain the following corollary:

Corollary 1.

Assume the Slater condition (19) holds for some

s > 0

and vector

(t^{s}, r^{s}, y^{s}) \in \bar{Γ}

. Then, for all

k_{0} \in {1, 2, 3, \dots}

and all

z_{0} \in [0, v q_{1} \sqrt{n}]

, given

Z [k_{0}] = z_{0}

, our algorithm gives (with probability 1):

E [e^{η Z [k]} | H [k_{0}]] \leq d + (e^{η z_{0}} - d) ρ^{k - k_{0}} \forall k \in {k_{0}, k_{0} + 1, k_{0} + 2, \dots}

(72)

where

\begin{matrix} η & = \frac{s / 2}{c^{2} + c s / 6} \end{matrix}

(73)

\begin{matrix} ρ & = 1 - η s / 4 \end{matrix}

(74)

\begin{matrix} d & = \frac{(e^{η c} - ρ) e^{η λ}}{1 - ρ} \end{matrix}

(75)

where λ is given in (67). Further, it holds that

s / 2 \leq c

,

e^{η c} \leq ρ

, and

0 < ρ < 1

.

Proof.

This follows by applying Lemma 4 in [28] to the result of Lemma 7. □

We now use Corollary 1 to prove Theorem 2.

Proof.

(Theorem 2) Fix

i \in {1, \dots, n}

,

k_{0} \in {1, 2, 3, \dots}

. Define

k_{1} = ⌈\frac{4 v q_{1} \sqrt{n}}{s}⌉

(76)

Fix

m \geq k_{1} + v^{2}

. By (63), it suffices to show

\frac{1}{m} \sum_{k = k_{0} + k_{1}}^{k_{0} + m - 1} E [1_{i} [k] | H [k_{0}]] \leq O (ϵ)

(77)

To this end, we have, by definition of

1_{i} [k]

in (29),

e^{η (q_{i} v - y_{i, m a x})} 1_{i} [k] \leq e^{η Q_{i} [k]} \leq e^{η Z [k]}

Define

z_{0} = Z [k_{0}]

. Taking conditional expectations and using (72) gives, for all

k \geq k_{0} + k_{1}

,

\begin{matrix} e^{η (q_{i} v - y_{i, m a x})} E [1_{i} [k] | H [k_{0}]] & \leq d + (e^{η z_{0}} - d) ρ^{k - k_{0}} \\ \leq d + e^{η z_{0}} ρ^{k - k_{0}} \end{matrix}

Summing over the (fewer than m) terms and dividing by m gives

\begin{matrix} e^{η (q_{i} v - y_{i, m a x})} \frac{1}{m} \sum_{k = k_{0} + k_{1}}^{k_{0} + m - 1} E [1_{i} [k] | H [k_{0}]] & \leq d + \frac{e^{η z_{0}} ρ^{k_{1}}}{m} \sum_{k = k_{0} + k_{1}}^{k_{0} + m - 1} ρ^{k - k_{1} - k_{0}} \\ \leq d + \frac{e^{η z_{0}} ρ^{k_{1}}}{m} \frac{1}{1 - ρ} \end{matrix}

Recall that

q_{i} = q_{1}

for all i. To show (77), it suffices to show

\begin{matrix} d e^{- η (q_{1} v - y_{i, m a x})} \leq O (ϵ) \end{matrix}

(78)

\begin{matrix} e^{- η (q_{1} v - y_{i, m a x})} \frac{e^{η z_{0}} ρ^{k_{1}}}{(1 - ρ) m} \leq O (ϵ) \end{matrix}

(79)

We find these are much smaller than

O (ϵ)

. By assumption,

v \geq \frac{3 s^{2}}{4 d_{0}}

and so from (67)

λ = \frac{v d_{0}}{s} - \frac{s}{4}

(80)

By definition of d in (75):

\begin{matrix} d e^{- η (q_{1} v - y_{i, m a x})} & = \frac{(e^{η c} - ρ) e^{η λ}}{1 - ρ} e^{η y_{i, m a x}} e^{- η q_{1} v} \\ \overset{(a)}{\leq} \frac{1}{1 - ρ} e^{η (c + λ + y_{i, m a x} - q_{1} v)} \\ \overset{(b)}{=} \frac{1}{1 - ρ} e^{η (c + v d_{0} / s - s / 4 + y_{i, m a x} - q_{1} v)} \\ = (\frac{e^{η (y_{i, m a x} + c - s / 4)}}{1 - ρ}) e^{- η v (q_{1} - d_{0} / s)} \\ \overset{(c)}{\leq} (\frac{e^{η (y_{i, m a x} + c - s / 4)}}{1 - ρ}) e^{- η v d_{0} / s} \\ \overset{(d)}{\leq} (\frac{e^{η (y_{i, m a x} + c - s / 4)}}{1 - ρ}) e^{- (η d_{0} / s) / ϵ} \\ \overset{(e)}{\leq} O (e^{- (η d_{0} / s) / ϵ}) \leq O (ϵ) \end{matrix}

where (a) holds because

0 < ρ < 1

(recall Corollary 1); (b) holds by (80); (c) holds because

q_{1} \geq 2 d_{0} / s

; (d) holds because

v \geq 1 / ϵ

; (e) holds because

y_{i, m a x}, η, c, s, ρ

are all

Θ (1)

constants that do not scale with

ϵ

. The term goes to zero exponentially fast as

ϵ \to 0

, much faster than

O (ϵ)

. This proves (78).

To show (79), we have

\begin{matrix} e^{- η (q_{1} v - y_{i, m a x})} \frac{e^{η z_{0}} ρ^{k_{1}}}{(1 - ρ) m} & \leq \frac{e^{η y_{i, m a x}}}{(1 - ρ) m} e^{η z_{0}} ρ^{k_{1}} = \frac{e^{η y_{i, m a x}}}{(1 - ρ) m} e^{η z_{0} + k_{1} log (ρ)} \end{matrix}

(81)

By definition of

ρ

in (74), we have

\begin{matrix} k_{1} log (ρ) & = k_{1} log (1 - η s / 4) \\ \leq - k_{1} η s / 4 \end{matrix}

which uses the fact

log (1 + x) \leq x

for all

x > - 1

(recall from Corollary 1 that

η s / 4 < 1

). Adding

η z_{0}

to both sides gives

\begin{matrix} η z_{0} + k_{1} log (ρ) & \leq η (z_{0} - k_{1} s / 4) \\ \overset{(a)}{\leq} η (v q_{1} \sqrt{n} - k_{1} s / 4) \\ \overset{(b)}{\leq} 0 \end{matrix}

where (a) holds because

z_{0} \leq v q_{1} \sqrt{n}

; (b) holds by definition of

k_{1}

in (76). Substituting the above inequality into (81) gives

\begin{matrix} e^{- η (q_{1} v - y_{i, m a x})} \frac{e^{η z_{0}} ρ^{k_{1}}}{(1 - ρ) m} & \leq \frac{e^{η y_{i, m a x}}}{(1 - ρ) m} \\ \leq \frac{e^{η y_{i, m a x}}}{(1 - ρ) (k_{1} + v^{2})} \\ \leq O (ϵ^{2}) \leq O (ϵ) \end{matrix}

where we used

v \geq 1 / ϵ

. □

6. Simulation

All simulations are conducted with Matlab R2023b Update 6 (23.2.0.2485118).

6.1. System 1

This subsection considers the sequential project selection problem from Section 2.3 in [2]. The

A [k]

matrices have a random number of rows. Each row represents a project option and has two columns: The duration of time

T [k]

for the project and its corresponding reward

R [k]

. The goal is to simply maximize the time average reward per unit time (so

n = 0

, and there are no additional penalty constraints). As explained in [2], the greedy policy of always choosing the row that maximizes the instantaneous reward/time ratio

R [k] / T [k]

is not necessarily optimal. The optimal row decision is not obvious and it depends on the (unknown) distribution of

A [k]

. Two different distributions for

A [k]

are considered in the simulations (specified at the end of this subsection). Both distributions have

t_{m i n} = 1, t_{m a x} = 10, r_{m a x} = 500

.

Figure 2 illustrates results for a simulation over

10^{4}

tasks using i.i.d.

{A [k]}

with Distribution 1. The vertical axis in Figure 2 represents the accumulated reward per task starting with task 1 and running up to the current task k:

\frac{\sum_{j = 1}^{k} E [R [j]]}{\sum_{j = 1}^{k} E [T [j]]}

where the expectations

E [R [j]]

and

E [T [k]]

are approximated by averaging over 40 independent simulation runs. Figure 2 compares the greedy algorithm of always choosing the task k that maximizes

R [k] / T [k]

; the (nonadaptive) Robbins–Monro algorithm from [2] that uses a stepsize

η [k] = \frac{1}{k + 1}

; the proposed adaptive algorithm for the cases

v = 1, v = 2, v = 10

(and using

α = c_{1} / max [c_{2}, 1 / 2]

). The dashed horizontal line in Figure 2 is the optimal

θ^{*}

value corresponding to Distribution 1. The value

θ^{*}

is difficult to calculate analytically, so we use an empirical value obtained by the final point on the Robbins–Monro curve.

Figure 2. System 1 and Distribution 1: Accumulated reward per unit time for the proposed adaptive algorithm (with

v \in {1, 2, 10}

), the vanishing-stepsize Robbins–Monro algorithm, and the greedy algorithm. All data points are averaged over 40 independent simulations. The dashed horizontal line is the optimal

θ_{1}^{*}

for Distribution 1.

It can be seen that the greedy algorithm has significantly worse performance compared with the others. The Robbins–Monro algorithm, which uses a vanishing stepsize, has the fastest convergence and the highest achieved reward per unit time. As predicted by our theorems, the proposed adaptive algorithm has a convergence time that gets slower as v is increased, with a corresponding tradeoff in accuracy, where accuracy relates to the proximity of the converged value to the optimal

θ^{*}

. The case

v = 1

converges quickly but has less accuracy. The cases

v = 2

and

v = 10

have accuracy that is competitive with Robbins–Monro.

Figure 3 illustrates the adaptation advantages of the proposed algorithm. Figure 3 considers simulations over

2 \times 10^{4}

tasks. The first half of the simulation refers to tasks

{1, \dots, 10^{4}}

, the second half refers to tasks

{10^{4} + 1, \dots, 2 \times 10^{4}}

. The

{A [k]}

matrices in the first half are i.i.d. with Distribution 1; in the second half, they are i.i.d. with Distribution 2. Nobody tells the algorithms that a change occurs at the halfway mark; rather, the algorithms must adapt. The two dashed horizontal lines represent optimal

θ^{*}

values for Distribution 1 and Distribution 2. Data in Figure 3 is plotted as a moving average with a window of the past 200 tasks (and averaged over 40 independent simulations). As seen in the figure, the adaptive algorithm (with

v = 10

) produces near-optimal performance that quickly adapts to the change. In stark contrast, the Robbins–Monro algorithm adapts very slowly to the change and takes roughly

(3 / 4) \times 10^{4}

tasks to move close to optimality. The adaptation time of Robbins–Monro is much slower than its convergence time starting at task 1. This is due to the vanishing stepsize and the fact that, at the time of the distribution change, the stepsize is very small. Theoretically, the Robbins–Monro algorithm has an arbitrarily large adaptation time, as can be seen by imagining a simulation that uses a fixed distribution for a number of tasks x before changing to another distribution: The stepsize at the time of change is

η [x] = 1 / (x + 1)

, hence an arbitrarily large value of x yields an arbitrarily large adaptation time.

Figure 3. System 1: Testing adaptation over a simulation of

2 \times 10^{4}

tasks with a distributional change introduced at the halfway point (task

10^{4}

). The two horizontal dashed lines represent optimal

θ^{*}

values for the two distributions. Each point for task k is the result of a moving window average

\frac{\sum_{i = 1}^{200} E [R [k - i]]}{\sum_{i = 1}^{200} E [T [k - i]]}

, where expectations are obtained by averaging over 40 independent simulations. The adaptive algorithm (with

v = 10

) quickly adapts to the change. The Robbins–Monro algorithm takes a long time to adapt.

Figure 3 shows the greedy algorithm adapts very quickly. This is because the greedy algorithm maximizes

R [k] / T [k]

for each task k without regard to history. Of course, the greedy algorithm is the least accurate and produces results that are significantly less than optimal for both distributions. To avoid clutter, the adaptive algorithm for cases

v = 1, v = 2

are not plotted in Figure 3. Only the case

v = 10

is shown because this case has the slowest adaptation but the most accuracy (as compared with

v = 1, v = 2

cases). While not shown in Figure 3, it was observed that the accuracy of the

v = 2

case was only marginally worse than that of the

v = 10

case (similar to Figure 2).

For the simulation of the proposed algorithm for the scenario of Figure 3, the virtual queue

J [k]

was observed to have a maximum value

661.0219

over the entire timeline, with a noticeable jump in average value of

J [k]

after the midway point in the simulation, as shown in Figure 4.

Figure 4. System 1: A sample path of virtual queue

J [k]

for the proposed algorithm for the same scenario as Figure 3. A change in distribution occurs at the halfway point in the simulation.

The two distributions in this subsection (and in Figure 2, Figure 3 and Figure 4) are

Distribution 1: With $M [k]$ being the random number of rows, we use $P [M [k] = 1] = 0.1$ , $P [M [k] = 2] = 0.6$ , $P [M [k] = 3] = 0.15$ , $P [M [k] = 4] = 0.15$ . The first row is always $[T_{1}, R_{1}] = [1, 0]$ and represents a “vacation” option that lasts for one unit of time and has zero reward (as explained in [2], it can be optimal to take vacations a certain fraction of time, even if there are other row options). The remaining rows r, if any, have parameters $[T_{r}, R_{r}]$ generated independently with $T_{r} \sim U n i f [1, 10]$ and $R_{r} = T_{r} G_{r}$ , where $G_{r} \sim U n i f [0, 50]$ and is independent of $T_{r}$ .
Distribution 2: We use $P [M [k] = 1] = 0$ , $P [M [k] = 2] = 0.2$ , $P [M [k] = 3] = 0.4$ , $P [M [k] = 4] = 0.4$ . The first row is always $[T_{1}, R_{1}] = [1, 0]$ . The other rows r are independently chosen as a random vector $[T_{r}, R_{r}]$ with $T_{r} \sim U n i f [1, 10]$ , $R_{r} = G_{r} T_{r} + H_{r}$ with $G_{r}, H_{r}$ independent and $G_{r} \sim U n i f [10, 30]$ , $H_{r} \sim U n i f [0, 200]$ .

6.2. System 2

This subsection considers a device that processes computational tasks with the goal of maximizing time average profit subject to a time average power constraint of

p_{a v} = 1 / 3

energy/time. There is a penalty process

Y [k]

and so the Robbins–Monro algorithm of [2] cannot be used. For simplicity, we use

q_{1} = \infty

(as discussed in Section 7).

Q_{1} [t]

was observed to be stable with a maximum size of less than 350 over all time. We compare the adaptive algorithm of the current paper to the drift-plus-penalty ratio method of [6]. The ratio of expectations from the main method in [6] requires knowledge of the probability distribution on

A [k]

. A heuristic is proposed in Section VI.B in [6] that uses a drift-plus-penalty minimization of

- v (R [k] - θ [k - 1] T [k]) + Q [k] Y [k]

, which has a simple decision complexity for each task that is the same as the decision complexity of the adaptive algorithm proposed in the current paper, and where

θ [k]

is defined as a running average:

θ [k] = \frac{\sum_{i = 1}^{k} R [i]}{\sum_{i = 1}^{k} T [i]}

It is argued in Section VI.B in [6] that, if the heuristic converges, it converges to a point that is within

O (ϵ)

of optimality, where the parameter v is chosen as

v = 1 / ϵ

. We call this heuristic “DPP with ratio averaging” in the simulations. (Another method described in [6] approximates the ratio of expectations using a window of w past samples. The per-task decision complexity grows with w and hence is larger than the complexity of the algorithm proposed in the current paper. For ease of implementation, we have not considered this method.) We also compare to a greedy method that removes any row r of

A [k]

that does not satisfy

E n e r g y_{r} [k] / T_{r} [k] \leq 1 / 3

, and chooses from the remaining rows to maximize

R_{r} [k] / T_{r} [k]

.

The i.i.d. matrices

{A [k]}

have three columns and three rows of the form:

A [k] = [\begin{matrix} 1 & 0 & 0 \\ T_{2} [k] & R_{2} [k] & Y_{2} [k] \\ T_{3} [k] & R_{3} [k] & Y_{3} [k] \end{matrix}]

where

Y_{r} [k] = E n e r g y_{r} [k] - (1 / 3) T_{r} [k]

for

i \in {2, 3}

. The first row corresponds ignoring task k and remaining idle for 1 unit of time, earning no reward but using no energy, so

(T_{1} [k], R_{1} [k], Y_{1} [k]) = (1, 0, 0)

. The second row corresponds to processing task k at the home device. The third row corresponds to outsourcing task k to a cloud device. Two distributions are considered (specified at the end of this subsection). Under Distribution A, the reward is the same for both rows 2 and 3, but the energy and durations of time are different. Under Distribution B, the reward is higher for processing at the home device. Both distributions have

t_{m i n} = 1

,

t_{m a x} = 12

,

r_{m a x} = 20

. We use

α = c_{1} / max [c_{2}, 1 / 2]

for the adaptive algorithm. Under the distributions used, the greedy algorithm is never able to use row 2, can always use either row 1 or 3, and always selects row 3.

Figure 5 and Figure 6 consider reward and power for simulations over 5000 tasks with i.i.d.

{A [k]}

under Distribution A. Figure 5 plots the running average of

\frac{\sum_{i = 1}^{k} E [R [i]]}{\sum_{i = 1}^{k} E [T [i]]}

where expectations are attained by averaging over 40 independent simulations. The horizontal asymptote illustrates the optimal

θ^{*}

as obtained by simulation. The simulation uses

v = 50

for the DPP with ratio averaging because this was sufficient for an accurate approximation of

θ^{*}

, as seen in Figure 5. The proposed adaptive algorithm is considered for

v = 10, v = 50, v = 200

. As predicted by our theorems, it can be seen that the converged reward is closer to

θ^{*}

as v is increased (the case

v = 200

is competitive with DPP with ratio averaging). Figure 6 plots the corresponding running average of

\frac{\sum_{i = 1}^{k} E [E n e r g y [i]]}{\sum_{i = 1}^{k} E [T [i]]}

. The disadvantage of choosing a large value of v is seen by the longer time required for time-averaged power to converge to the horizontal asymptote

p_{a v} = 1 / 3

. Figure 5 and Figure 6 show the greedy algorithm has the worst reward per unit time and has average power significantly under the required constraint. This shows that, unlike the other algorithms, the greedy algorithm does not make intelligent decisions for using more power to improve its reward. Considering only the performance shown in Figure 5 and Figure 6, the DPP with ratio averaging heuristic demonstrates the best convergence times, which is likely due to the fact that it uses only one virtual queue

Q [k]

while our adaptive algorithm uses

Q [k]

and

J [k]

. It is interesting to note that the adaptive algorithms and the DPP with ratio averaging heuristic both choose row 1 (idle) a significant fraction of the time. This is because, when a task has a small reward but a large duration of time, it is better to throw the task away and wait idly for a short amount of time in order to see a new task with a hopefully larger reward.

Figure 5. Time average reward up to task k for the proposed adaptive algorithm (

v \in {10, 50, 200}

); the DPP algorithm with ratio averaging; the greedy algorithm. The dashed horizontal line is the value

θ_{A}^{*}

.

Figure 6. Corresponding time-averaged power for the simulations of Figure 5. The horizontal asymptote is

p_{a v} = 1 / 3

. The greedy algorithm falls too far under the

p_{a v}

constraint: it does not know how to use more power to increase its time average reward.

Significant adaptation advantages of our proposed algorithm are illustrated in Figure 7. Performance is plotted over a moving average with window size

w = 200

, and averaged over 100 independent simulations. The first half of the simulation uses i.i.d.

{A [k]}

with Distribution A, the second half uses Distribution B. The two horizontal asymptotes in Figure 7 are optimal

θ_{A}^{*}

and

θ_{B}^{*}

values for Distributions A and B. As seen in the figure, both the adaptive algorithm and the DPP with ratio averaging heuristic quickly converge to the optimal

θ_{A}^{*}

value associated with Distribution A (the rewards under the adaptive algorithm are slightly less than those of the heuristic). At the time of change, the adaptive algorithm has a spike that lasts for roughly 2000 tasks until it settles down to

θ_{B}^{*}

. This can be viewed as the adaptation time and can be decreased by decreasing the value of v (at a corresponding accuracy cost). It converges from above to

θ_{B}^{*}

because, as seen in Figure 7, the spike marks a period of using more power than the required amount. In contrast, the DPP with ratio averaging algorithm cannot adapt and never increases to the optimal value of

θ_{B}^{*}

.

Figure 7. Adaptation for (a) reward and (b) power when the distribution is changed halfway through the simulation. Horizontal asymptotes are

θ_{A}^{*}

and

θ_{B}^{*}

for Distributions A and B. The adaptive algorithm settles into the new optimality point

θ_{B}^{*}

, while DPP with ratio averaging cannot adapt.

The distributions used are as follows: For each task k, there are two independent random variables

U_{1} [k], U_{2} [k] \sim U n i f [0, 1]

generated. Then,

Distribution A: Note that $R_{2} [k] = R_{3} [k]$ and $T_{2} [k] < T_{3} [k]$ always.

$\begin{matrix} (T_{2} [k], R_{2} [k], E n e r g y_{2} [k]) = (1 + 9 U_{1} [k], 10 U_{1} [k] (U_{2} [k] + 1), 1 + 9 U_{1} [k]) \\ (T_{3} [k], R_{3} [k], E n e r g y_{3} [k]) = (6 + 6 U_{1} [k], 10 U_{1} [k] (U_{2} [k] + 1), U_{1} [k]) \end{matrix}$
Distribution B: The $R_{2} [k]$ value is increased in comparison to Distribution A.

$\begin{matrix} (T_{2} [k], R_{2} [k], E n e r g y_{2} [k]) = (1 + 9 U_{1} [k], min {20 (U_{2} [k] + 1), 20}, 1 + 9 U_{1} [k]) \\ (T_{3} [k], R_{3} [k], E n e r g y_{3} [k]) = (6 + 6 U_{1} [k], 10 U_{1} [k] (U_{2} [k] + 1), U_{1} [k]) \end{matrix}$

6.3. Weight Adjustment

While the

Θ (1 / ϵ^{2})

adaptation times achieved by the proposed algorithm are asymptotically optimal, an important question is whether the coefficient can be improved by some constant factor. Specifically, this subsection attempts to reduce the 2000-task adaptation time seen in the spikes of Figure 7 and Figure 8 without degrading accuracy. We observe that the

J [k]

and

Y [k]

queues are weighted equally in the Lyapunov function

L [k] = \frac{1}{2} J {[k]}^{2} + \frac{1}{2} Y {[k]}^{2}

. More weight can be placed on

Y [k]

to emphasize the average power constraint and thereby reduce the spike in Figure 8b. This can be performed with no change in the mathematical analysis by redefining the penalty as

Y^{'} [k] = w Y [k]

for some constant

w > 0

. The constraint

\bar{Y} \leq 0

is the same as

{\bar{Y}}^{'} \leq 0

. We use

w = 2

and also double the v parameter from 50 to 100, which maintains the same relative weight between reward and

Q [k]

but deemphasizes the

J [k]

virtual queue by a factor of 2.

Figure 8. Comparing (a) reward and (b) power of the reweighted adaptive scheme when the distribution changes twice. Parameters for the adaptive and DPP with ratio averaging algorithms are the same as Figure 7.

Figure 8 plots performance over

3 \times 10^{4}

tasks with Distribution A in the first third, Distribution B in the second third, and Distribution A in the final third. The adaptive algorithm and the DPP with ratio averaging algorithm use the same parameters as in Figure 7. The reweighted adaptive algorithm uses

v = 100

and

Y^{'} [k] = 2 Y [k]

. It can be seen that the reweighting decreases adaptation time with no noticeable change in accuracy. This illustrates the benefits of weighting the power penalty

Y [k]

more heavily than the virtual queue

J [k]

.

The simulations in Figure 8 further show that the proposed adaptive algorithms can effectively handle multiple distributional changes. Indeed, the reward settles close to the new optimality point after each change in distribution. In contrast, the DPP with ratio averaging algorithm, which was not designed to be adaptive, appears completely lost after the first distribution change and never recovers. This emphasizes the importance of adaptive algorithms.

The value of the virtual queue

Q [k]

for the proposed algorithm and its reweighted variant are shown in Figure 9. This illustrates that using

q_{1} = \infty

, which puts no deterministic upper bound on

Q [k]

, does not adversely affect performance, as discussed in Section 7. The virtual queue

J [k]

is not shown, but its behavior is similar, and its maximum value over all time was observed to be less than 400.

Figure 9. System 2: A sample path of

Q [k]

for the proposed adaptive algorithm and its reweighted variant for the same scenario as Figure 8.

7. Discussion

The proposed algorithm can be run indefinitely over an infinite sequence of tasks. The analysis holds for any block of m tasks over which

{A [k]}

is i.i.d. over that block. This holds regardless of sample path history before the block. A useful mathematical model that fits our analysis is when

{A [k]}

evolves over disjoint and variable-length blocks of time during which behavior is i.i.d. with a fixed but unknown distribution, and where the start and end times of each block are unknown. In this scenario, the algorithm adapts and comes within

Θ (ϵ)

of optimality for every block that lasts for at least

1 / ϵ^{2}

tasks.

What happens if

{A [k]}

is not i.i.d. over a block of interest? The good news is that our analysis provides worst-case bounds on the virtual queues that hold for all time and for any sample path. This means the algorithm maintains reasonable operational states. Of course, our optimality analysis uses the i.i.d. assumption. We conjecture that the algorithm also makes efficient decisions in more general situations where

{A [k]}

arises from an ergodic Markov chain. The analysis in that situation would be more complicated, and the adaptation times would depend on the mixing time of the Markov process. We leave such open questions for future work.

There are situations where

{A [k]}

evolves according to an ergodic Markov chain with very slow mixing times, slower than any reasonable timescale over which we want our algorithm to adapt. For example, one can imagine a 2-state Markov chain where

A [k]

is i.i.d. with one distribution in state 1, and i.i.d. with another distribution in state 2. If transition probabilities between states 1 and 2 are very small, then state transitions may occur on a timescale of hours (and ergodic mixing times may be on the order of days or weeks), while thousands of tasks are performed before each state transition. Each state transition starts a new block of tasks. Our algorithm adapts to each new distribution, without knowing the transition times, provided that transition times are separated by at least

1 / ϵ^{2}

tasks. In other words, our convergence analysis holds on the shorter timescale of the block, rather than the longer (and possibly irrelevant) mixing time of the underlying Markov chain.

When

n > 0

, the algorithm has a parameter

q_{1} > 0

, where

q_{1} = Θ (1)

. Theorem 2 suggests

q_{1} \geq 2 d_{0} / s

. This requires knowledge of

d_{0} / s

. In practice, there is little danger in choosing

q_{1}

to be too large. Even choosing

q_{1} = \infty

works well in practice (see simulation section). Intuitively, this is because the virtual queue update (27) for

q_{1} = \infty

reduces to

Q_{i} [k + 1] = max \{Q_{i} [k] + Y_{i} [k], 0\} \forall k \in {1, 2, 3, \dots}

which means

1_{i} [k] = 0

for all k and the inequality (30) can be modified to

\frac{1}{m} \sum_{k = k_{0}}^{k_{0} + m - 1} Y_{i} [k] \leq \frac{Q_{i} [k_{0} + m] - Q_{i} [k_{0}]}{m}

(82)

for all positive integers

k_{0}, m

. Intuitively, the Slater condition still ensures

| | Q [k] | |

concentrates quickly and is still rarely much larger than the

λ

parameter in Lemma 7, where

λ = Θ (v)

. Intuitively, while the

J [k]

virtual queue would no longer be deterministically bounded, it would stay within its existing bounds with high probability. Taking expectations of (82) would then produce a right-hand side proportional to

v / m

, which is

O (ϵ)

whenever

v = Θ (1 / ϵ)

and

m \geq 1 / ϵ^{2}

.

8. Conclusions

This paper gives an adaptive algorithm for renewal optimization, where decisions for each task k determine the duration of the task, the reward for the task, and a vector of penalties. The algorithm has a low per-task decision complexity, operates without knowledge of system probabilities, and is robust to changes in the probability distribution that occur at unknown times. A new hierarchical decision rule enables the algorithm to achieve within

ϵ

of optimality over any sequence of

Θ (1 / ϵ^{2})

tasks over which the probability distribution is fixed, regardless of system history. This adaptation time matches a prior converse result that shows any algorithm that guarantees

ϵ

-optimality during tasks

{1, \dots, m}

must have

m \geq Ω (1 / ϵ^{2})

, even if there are no additional penalty processes.

Funding

This work was supported in part by NSF SpecEES 1824418.

Data Availability Statement

The MATLAB simulations and figures in this study are openly available in the provided link: https://ee.usc.edu/stochastic-nets/docs/Adapt-Renewals-Neely2025-simulations.zip, accessed on 20 November 2025.

Conflicts of Interest

The author declares no conflicts of interest.

References

Neely, M.J. Stochastic Network Optimization with Application to Communication and Queueing Systems; Morgan & Claypool: San Rafael, CA, USA, 2010. [Google Scholar]
Neely, M.J. Fast Learning for Renewal Optimization in Online Task Scheduling. J. Mach. Learn. Res. 2021, 22, 1–44. [Google Scholar]
Fox, B. Markov Renewal Programming by Linear Fractional Programming. SIAM J. Appl. Math. 1966, 14, 1418–1432. [Google Scholar] [CrossRef]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Neely, M.J. Online Fractional Programming for Markov Decision Systems. In Proceedings of the 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 28–30 September 2011. [Google Scholar]
Neely, M.J. Dynamic Optimization and Learning for Renewal Systems. IEEE Trans. Autom. Control 2013, 58, 32–46. [Google Scholar] [CrossRef]
Tassiulas, L.; Ephremides, A. Dynamic Server Allocation to Parallel Queues with Randomly Varying Connectivity. IEEE Trans. Inf. Theory 1993, 39, 466–478. [Google Scholar] [CrossRef]
Tassiulas, L.; Ephremides, A. Stability Properties of Constrained Queueing Systems and Scheduling Policies for Maximum Throughput in Multihop Radio Networks. IEEE Trans. Autom. Control 1992, 37, 1936–1948. [Google Scholar] [CrossRef]
Wei, X.; Neely, M.J. Data Center Server Provision: Distributed Asynchronous Control for Coupled Renewal Systems. IEEE/ACM Trans. Netw. 2017, 25, 2180–2194. [Google Scholar] [CrossRef]
Wei, X.; Neely, M.J. Asynchronous Optimization over Weakly Coupled Renewal Systems. Stoch. Syst. 2018, 8, 167–191. [Google Scholar] [CrossRef]
Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Borkar, V.S. Stochastic Approximation: A Dynamical Systems Viewpoint; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Nemirovski, A.; Yudin, D. Problem Complexity and Method Efficiency in Optimization; Wiley-Interscience Series in Discrete Mathematics; John Wiley: Hoboken, NJ, USA, 1983. [Google Scholar]
Kushner, H.J.; Yin, G. Stochastic Approximation and Recursive Algorithms and Applications; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
Toulis, P.; Horel, T.; Airoldi, E.M. The Proximal Robbins–Monro Method. J. R. Stat. Soc. Ser. B Stat. Methodol. 2021, 83, 188–212. [Google Scholar] [CrossRef]
Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. Robust Stochastic Approximation Approach to Stochastic Programming. SIAM J. Optim. 2009, 19, 1574–1609. [Google Scholar] [CrossRef]
Joseph, V.R. Efficient Robbins-Monro Procedure for Binary Data. Biometrika 2004, 91, 461–470. [Google Scholar] [CrossRef]
Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), Washington, DC, USA, 21–24 August 2003. [Google Scholar]
Huo, D.L.; Chen, Y.; Xie, Q. Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Stepsizes. In Proceedings of the Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’23), New York, NY, USA, 19–23 June 2023; pp. 81–82. [Google Scholar] [CrossRef]
Luo, H.; Zhang, M.; Zhao, P. Adaptive Bandit Convex Optimization with Heterogeneous Curvature. Proc. Mach. Learn. Res. 2022, 178, 1–37. [Google Scholar]
Van der Hoeven, D.; Cutkosky, A.; Luo, H. Comparator-Adaptive Convex Bandits. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), Red Hook, NY, USA, 6–12 December 2020. [Google Scholar]
Leyffer, S.; Menickelly, M.; Munson, T.; Vanaret, C.W.; Wild, S. A Survey of Nonlinear Robust Optimization. INFOR Inf. Syst. Oper. Res. 2020, 58, 342–373. [Google Scholar] [CrossRef]
Tang, J.; Fu, C.; Mi, C.; Liu, H. An interval sequential linear programming for nonlinear robust optimization problems. Appl. Math. Model. 2022, 107, 256–274. [Google Scholar] [CrossRef]
Badanidiyuru, A.; Kleinberg, R.; Slivkins, A. Bandits with Knapsacks. J. ACM 2018, 65, 1–55. [Google Scholar] [CrossRef]
Agrawal, S.; Devanur, N.R. Bandits with Concave Rewards and Convex Knapsacks. In Proceedings of the Fifteenth ACM Conference on Economics and Computation (EC ’14), New York, NY, USA, 8–12 June 2014; pp. 989–1006. [Google Scholar] [CrossRef]
Xia, Y.; Ding, W.; Zhang, X.; Yu, N.; Qin, T. Budgeted Bandit Problems with Continuous Random Costs. In Proceedings of the ACML, Hong Kong, 20–22 November 2015. [Google Scholar]
Kallenberg, O. Foundations of Modern Probability, 2nd ed.; Probability and Its Applications; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Neely, M.J. Energy-Aware Wireless Scheduling with Near Optimal Backlog and Convergence Time Tradeoffs. IEEE/ACM Trans. Netw. 2016, 24, 2223–2236. [Google Scholar] [CrossRef]

Figure 1. Four sequential tasks in the timeline. Vertical arrows for each task k represent values for reward

R [k]

and penalty vector

Y [k]

. In this example, green is reward (profit), red is energy, blue is quality. Vector

(T [k], R [k], Y [k])

depends on choices made at the start of task k.

Figure 2. System 1 and Distribution 1: Accumulated reward per unit time for the proposed adaptive algorithm (with

v \in {1, 2, 10}

), the vanishing-stepsize Robbins–Monro algorithm, and the greedy algorithm. All data points are averaged over 40 independent simulations. The dashed horizontal line is the optimal

θ_{1}^{*}

for Distribution 1.

Figure 3. System 1: Testing adaptation over a simulation of

2 \times 10^{4}

tasks with a distributional change introduced at the halfway point (task

10^{4}

). The two horizontal dashed lines represent optimal

θ^{*}

values for the two distributions. Each point for task k is the result of a moving window average

\frac{\sum_{i = 1}^{200} E [R [k - i]]}{\sum_{i = 1}^{200} E [T [k - i]]}

, where expectations are obtained by averaging over 40 independent simulations. The adaptive algorithm (with

v = 10

) quickly adapts to the change. The Robbins–Monro algorithm takes a long time to adapt.

Figure 4. System 1: A sample path of virtual queue

J [k]

for the proposed algorithm for the same scenario as Figure 3. A change in distribution occurs at the halfway point in the simulation.

Figure 5. Time average reward up to task k for the proposed adaptive algorithm (

v \in {10, 50, 200}

); the DPP algorithm with ratio averaging; the greedy algorithm. The dashed horizontal line is the value

θ_{A}^{*}

.

Figure 6. Corresponding time-averaged power for the simulations of Figure 5. The horizontal asymptote is

p_{a v} = 1 / 3

. The greedy algorithm falls too far under the

p_{a v}

constraint: it does not know how to use more power to increase its time average reward.

Figure 7. Adaptation for (a) reward and (b) power when the distribution is changed halfway through the simulation. Horizontal asymptotes are

θ_{A}^{*}

and

θ_{B}^{*}

for Distributions A and B. The adaptive algorithm settles into the new optimality point

θ_{B}^{*}

, while DPP with ratio averaging cannot adapt.

Figure 8. Comparing (a) reward and (b) power of the reweighted adaptive scheme when the distribution changes twice. Parameters for the adaptive and DPP with ratio averaging algorithms are the same as Figure 7.

Figure 9. System 2: A sample path of

Q [k]

for the proposed adaptive algorithm and its reweighted variant for the same scenario as Figure 8.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Adaptive Optimization for Stochastic Renewal Systems

Abstract

1. Introduction

1.1. Model

1.2. Prior Work

1.3. Our Contributions

2. Preliminaries

2.1. Notation

2.2. Boundedness Assumptions

2.3. The Sets $Γ$ and $\bar{Γ}$

2.4. The Deterministic Problem

3. Algorithm Development

3.1. Parameters and Constants

3.2. Intuition

3.3. Virtual Queues

3.4. Lyapunov Drift

3.5. Algorithm

3.6. Example Decision Procedure

3.7. Key Analysis

4. Reward Guarantee

4.1. Reward over Any m Consecutive Tasks

4.2. No Penalty Constraints

5. Constraints

Proof of Theorem 2

6. Simulation

6.1. System 1

6.2. System 2

6.3. Weight Adjustment

7. Discussion

8. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Adaptive Optimization for Stochastic Renewal Systems

Abstract

1. Introduction

1.1. Model

1.2. Prior Work

1.3. Our Contributions

2. Preliminaries

2.1. Notation

2.2. Boundedness Assumptions

2.3. The Sets Γ and Γ ¯

2.4. The Deterministic Problem

3. Algorithm Development

3.1. Parameters and Constants

3.2. Intuition

3.3. Virtual Queues

3.4. Lyapunov Drift

3.5. Algorithm

3.6. Example Decision Procedure

3.7. Key Analysis

4. Reward Guarantee

4.1. Reward over Any m Consecutive Tasks

4.2. No Penalty Constraints

5. Constraints

Proof of Theorem 2

6. Simulation

6.1. System 1

6.2. System 2

6.3. Weight Adjustment

7. Discussion

8. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

2.3. The Sets $Γ$ and $\bar{Γ}$