Quick and Complete Convergence in the Law of Large Numbers with Applications to Statistics

Alexander G. Tartakovsky

doi:10.3390/math11122687

AGT StatConsult, 71 Cypress Way, Rolling Hills Estates, CA 90274, USA

Mathematics2023, 11(12), 2687;https://doi.org/10.3390/math11122687

This article belongs to the Special Issue Analytical Methods and Convergence in Probability with Applications, 2nd Edition

Version Notes

Order Reprints

Abstract

In the first part of this article, we discuss and generalize the complete convergence introduced by Hsu and Robbins in 1947 to the r-complete convergence introduced by Tartakovsky in 1998. We also establish its relation to the r-quick convergence first introduced by Strassen in 1967 and extensively studied by Lai. Our work is motivated by various statistical problems, mostly in sequential analysis. As we show in the second part, generalizing and studying these convergence modes is important not only in probability theory but also to solve challenging statistical problems in hypothesis testing and changepoint detection for general stochastic non-i.i.d. models.

Keywords:

complete convergence; r-quick convergence; sequential analysis; hypothesis testing; changepoint detection

MSC:

60F15; 60G35; 60G40; 60J05; 62L10; 62C10; 62C20; 62F03; 62H15; 62M02; 62P30

1. Introduction

In [1], Hsu and Robbins introduced the notion of complete convergence which is stronger than almost sure (a.s.) convergence. Hsu and Robbins used this notion to discuss certain aspects of the law of large numbers (LLN). In particular, let

X_{1}, X_{2}, \dots

be independent and identically distributed (i.i.d.) random variables with the common mean

μ = E [X_{1}]

. Hsu and Robbins proved that, while in Kolmogorov’s strong law of large numbers (SLLN), only the first moment condition is needed for the sample mean

n^{- 1} \sum_{t = 1}^{n} X_{t}

to converge to

μ

as

n \to \infty

, the complete version of the SLLN requires the second-moment condition

E | X_{1} |^{2} < \infty

(finiteness of variance). Later, Baum and Katz [2], working on the rate of convergence in the LLN established that the second-moment condition is not only necessary but also sufficient for complete convergence. Strassen [3] introduced another mode of convergence, the r-quick convergence. When

r = 1

, these two modes of convergence are closely related. In the case of i.i.d. random variables and the sample mean

n^{- 1} \sum_{t = 1}^{n} X_{t}

, they are identical. This fact and certain statistical applications motivated Tartakovsky [4] (see also Tartakovsky [5] and Tartakovsky et al. [6]) to introduce a natural generalization of complete convergence—the r-complete convergence, which turns out to be identical to the r-quick convergence in the i.i.d. case.

The goal of this overview paper is to discuss the importance of quick and complete convergence concepts for several challenging statistical applications. These modes of convergence are discussed in detail in the first part of this paper. Statistical applications, which constitute the second part of this paper, include such fields as sequential hypothesis testing and changepoint detection in general non-i.i.d. stochastic models when observations can be dependent and highly non-stationary. Specifically, in the second part, we first address near optimality of Wald’s sequential probability ratio test (SPRT) for testing two hypotheses regarding the distributions of non-i.i.d. data. We discuss Lai’s results in his fundamental paper [7], which was the first publication that used the r-quick convergence of the log-likelihood ratio processes to establish the asymptotic optimality of the SPRT as probabilities of errors go to zero. We then go on to tackle the much more difficult multi-decision problem of testing multiple hypotheses and show that certain multi-hypothesis sequential tests asymptotically minimize moments of the stopping time distribution up to the order r when properly normalized log-likelihood ratio processes between hypotheses converge r-quickly or r-completely to finite positive numbers. These results can be established based on the former works of the author (see, e.g., Tartakovsky [4,5] and Tartakovsky et al. [6]). The second challenging application is the quickest change detection when it is necessary to detect a change that occurs at an unknown point in time as rapidly as possible. We show, using the works of the author (see, e.g., [5,6] and the references therein), that certain popular changepoint detection procedures such as CUSUM, Shiryaev, and Shiryaev–Roberts procedures are asymptotically optimal as the false alarm rate is low when the normalized log-likelihood ratio processes converge r-completely to finite numbers.

The rest of the paper is organized as follows. Section 2 discusses pure probabilistic issues related to r-complete convergence and r-quick convergence. Section 3 explores statistical applications in sequential hypothesis testing and changepoint detection. Section 4 outlines sufficient conditions for the r-complete convergence for Markov and hidden Markov models, which is needed to establish the optimality properties of sequential hypothesis tests and changepoint detection procedures. Section 5 provides a final discussion and concludes the paper.

2. Modes of Convergence and the Law of Large Numbers

We begin by listing some standard definitions in probability theory. Let

(Ω, F)

be a measurable space, i.e.,

Ω

is a set of elementary events

ω

and

F

is a sigma-algebra (a system of subsets of

Ω

satisfying standard conditions). A probability space is a triple

(Ω, F, P)

, where

P

is a probability measure (completely additive measure normalized to 1) defined on the sets from the sigma-algebra

F

. More specifically, by Kolmogorov’s axioms, probability

P

satisfies:

P (A) \geq 0

for any

A \in F

;

P (Ω) = 1

; and

P (\cup_{i = 1}^{\infty} A_{i}) = \sum_{i = 1}^{\infty} P (A_{i})

for

A_{i} \in F

,

A_{i} \cap A_{j} = ⌀

,

i \neq j

, where ⌀ is an empty set.

A function

X = X (ω)

defined on

(Ω, F)

with values in

X

is called a random variable if it is

F

-measurable, i.e.,

{ω : X (ω) \in B}

belongs to the sigma-algebra

F

. The function

F (x) = P (ω : X (ω) \leq x)

is the distribution function of X. It is also referred to as a cumulative distribution function (cdf). The real-valued random variables

X_{1}, X_{2}, \dots

are independent if the events

{X_{1} \leq x_{1}}, {X_{2} \leq x_{2}}, \dots

are independent for every sequence

x_{1}, x_{2}, \dots

of real numbers. In what follows, we shall deal with real-valued random variables unless specified otherwise.

2.1. Standard Modes of Convergence

Let X be a random variable and let

{X_{n}}_{n \in Z_{+}}

(

Z_{+} = {0, 1, 2, \dots}

) be a sequence of random variables, both defined on the probability space

(Ω, F, P)

. We now give several standard definitions and results related to the law of large numbers.

Convergence in Distribution

(

Weak Convergence

).

Let

F_{n} (x) = P (ω : X_{n} \leq x)

be the cdf of

X_{n}

and let

F (x) = P (ω : X \leq x)

be the cdf of X. We say that the sequence

{X_{n}}_{n \in Z_{+}}

converges to X in distribution (or in law or weakly) as

n \to \infty

and write

X_{n} \to_{n \to \infty}^{law} X

if

lim_{n \to \infty} F_{n} (x) = F (x)

at all continuity points of

F (x)

.

Convergence in Probability

.

We say that the sequence

{X_{n}}_{n \in Z_{+}}

converges to X in probability as

n \to \infty

and write

X_{n} \to_{n \to \infty}^{P} X

if

lim_{n \to \infty} P (| X_{n} - X | > ε) = 0 for every ε > 0 .

Almost Sure Convergence

.

We say that the sequence

{X_{n}}_{n \in Z_{+}}

converges to X almost surely (a.s.) or with probability 1 (w.p. 1) as

n \to \infty

under probability measure

P

and write

X_{n} \to_{n \to \infty}^{P - a . s .} X

if

P (ω : lim_{n \to \infty} X_{n} = X) = 1 .

(1)

It is easily seen that (1) is equivalent to the condition

lim_{n \to \infty} P (ω : \sum_{t = n}^{\infty} | X_{t} - X | > ε) = 0 for every ε > 0,

and that the a.s. convergence implies convergence in probability, and the convergence in probability implies convergence in distribution, while the converse statements are not generally true.

The following double implications that establish necessary and sufficient conditions (i.e., equivalences) for the a.s. convergence are useful:

X_{n} \to_{n \to \infty}^{a . s .} X ⟺ P (sup_{t \geq n} | X_{t} - X | > ε) \underset{n \to \infty}{\to} 0 for all ε > 0 .

(2)

The following result is often useful.

Lemma 1.

Let

f (t)

be a non-negative increasing function,

{lim}_{t \to \infty} f (t) = \infty

. If

\frac{X_{n}}{f (n)} \to_{n \to \infty}^{P - a . s .} 0,

then

lim_{n \to \infty} P (\frac{1}{f (n)} max_{0 \leq t \leq n} X_{t} > ε) = 0 for every ε > 0 .

(3)

Proof.

For any

ε > 0

,

n_{0} > 0

and

n > n_{0}

, we have

\begin{matrix} P (\frac{1}{f (n)} max_{0 \leq t \leq n} X_{t} > ε) & \leq P (\frac{1}{f (n)} max_{0 \leq t \leq n_{0}} X_{t} > ε) + P (\frac{1}{f (n)} max_{n_{0} < t \leq n} X_{t} > ε) \\ \leq P (\frac{1}{f (n)} max_{0 \leq t \leq n_{0}} X_{t} > ε) + P (sup_{t > n_{0}} \frac{X_{t}}{f (t)} > ε) . \end{matrix}

Letting

n \to \infty

and taking into account that

lim_{n \to \infty} P (\frac{1}{f (n)} max_{0 \leq t \leq n_{0}} X_{t} > ε) = 0,

we obtain

\begin{matrix} \underset{n \to \infty}{lim sup} P (\frac{1}{f (n)} max_{0 \leq t \leq n} X_{t} > ε) & \leq P (sup_{t > n_{0}} \frac{X_{t}}{f (t)} > ε) . \end{matrix}

Since

n_{0}

can be arbitrarily large, we can let

n_{0} \to \infty

and since, by assumption,

X_{n} / f (n) \to_{n \to \infty}^{a . s .} 0

, it follows from (2) that the upper bound approaches 0 as

n_{0} \to \infty

. This completes the proof. □

Random Walk

.

Let

X_{0}, X_{1}, X_{2}, \dots

be i.i.d. random variables with the mean

E [X_{n}] = μ

for

n \geq 1

and the initial condition

X_{0} = x

. Then,

S_{n} = \sum_{t = 0}^{n} X_{t}

is called a random walk with the mean

x + μ n

.

In what follows, in the case where

X_{1}, X_{2}, \dots

are i.i.d. random variables and

S_{n} = \sum_{t = 0}^{n} X_{t}

, we prefer to formulate the results in terms of the random walk

{S_{n}}_{n \in Z_{+}}

(typically but not necessarily

S_{0} = 0

).

We now recall the two strong law of large numbers (SLLN). Write

S_{n} = X_{0} + X_{1} + \dots + X_{n}

for the partial sum (

X_{0} = S_{0} = 0

), so that

{S_{n}}_{n \in Z_{+}}

is a random walk with an initial condition of zero as long as

X_{1}, X_{2}, \dots

are i.i.d. with mean

μ

.

Kolmogorov ’ s SLLN

.

Let

{S_{n}}_{n \in Z_{+}}

be a random walk under probability measure

P

. If

E [S_{1}]

exists, then the sample mean

S_{n} / n

converges to the mean value

E [S_{1}]

w.p. 1, i.e.,

n^{- 1} S_{n} \to_{n \to \infty}^{P - a . s .} E [S_{1}] .

(4)

Conversely, if

n^{- 1} S_{n} \to_{n \to \infty}^{P - a . s .} μ

, where

| μ | < \infty

, then

E [S_{1}] = μ

.

Marcinkiewicz - Zygmund ’ s SLLN

.

Let

{S_{n}}_{n \in Z_{+}}

be a zero-mean random walk under probability measure

P

. The two following statements are equivalent:

(i): $E | S_{1} |^{p} < \infty$ for $0 < p < 2$ ;
(ii): $n^{- 1 / p} S_{n} \to_{n \to \infty}^{P - a . s .} 0$ .

2.2. Complete and r-Complete Convergence

We begin with discussing the issue of rates of convergence in the LLN.

Rates of Convergence

.

Let

{X_{n}}_{n \in Z_{+}}

be a sequence of random variables and assume that

X_{n}

converges to 0 w.p. 1 as

n \to \infty

. The question asks what the rate of convergence is. In other words, we are concerned with the speed at which the tail probability

P (| X_{n} | > ε)

decays to zero. This question can be answered by analyzing the behavior of the sums

Σ (r, ε) : = \sum_{n = 1}^{\infty} n^{r - 1} P (| X_{n} | > ε) for some r > 0 and all ε > 0 .

More specifically, if

Σ (r, ε)

is finite for every

ε > 0

, then the tail probability

P (| X_{n} | > ε)

decays with a rate faster than

1 / n^{r}

, so that

n^{r} P (| X_{n} | > ε) \to 0

for all

ε > 0

as

n \to \infty

.

To answer this question, we now consider modes of convergence that strengthen the almost sure convergence and therefore help determine the rate of convergence in the SLLN. Historically, this issue was first addressed in 1947 by Hsu and Robbins [1], who introduced the new mode of convergence which they called complete convergence.

Complete Convergence

.

The sequence

{X_{n}}_{n \in Z_{+}}

converges to 0 completely if

lim_{n \to \infty} \sum_{i = n}^{\infty} P (| X_{t} | > ε) = 0 for every ε > 0,

(5)

which is equivalent to

Σ (1, ε) = \sum_{n = 1}^{\infty} P (| X_{n} | > ε) < \infty for every ε > 0

Let

{S_{n}}_{n \in Z_{+}}

be a random walk with a mean of

E [S_{n}] = μ n

. Kolmogorov’s SLLN (4) implies that the sample mean

S_{n} / n

converges to

μ

w.p. 1. Hsu and Robbins [1] proved that, under the same assumptions (i.e., under the only first-moment condition

E | S_{1} | < \infty

) the sequence

{n^{- 1} S_{n}}_{n \geq 1}

does not need to completely converge to

μ

, but it will do so under the further second-moment condition

E | S_{1} |^{2} < \infty

. Thus, the finiteness of variance is a sufficient condition for complete convergence in the SLLN. They conjectured that the second-moment condition is not only sufficient but also necessary for complete convergence. Thus, it follows from these results that, if the variance is finite, then the rate of convergence in Kolmogorov’s SLLN is

{lim}_{n \to \infty} n P (| S_{n} / n - μ | > ε) = 0

for all

ε > 0

.

In 1965, Baum and Katz [2] made a further step towards this issue. In particular, the following result follows from Theorem 3 in [2] for the zero-mean random walk

{S_{n}}_{n \in Z_{+}}

.

Theorem 1.

Let

r > 0

and

α > 1 / 2

. If

{S_{n}}_{n \in Z_{+}}

is a zero-mean random walk, then the following statements are equivalent:

\begin{matrix} E [| S_{1} |^{(r + 1) / α}] < \infty & ⟺ \sum_{n = 1}^{\infty} n^{r - 1} P \{n^{- α} | S_{n} | > ε\} < \infty for all ε > 0 \\ ⟺ \sum_{n = 1}^{\infty} n^{r - 1} P \{sup_{k \geq n} \frac{1}{k^{α}} | S_{k} | > ε\} < \infty for all ε > 0 . \end{matrix}

(6)

Setting

r = 1

and

α = 1

in (6), we obtain the following equivalence

E [| S_{1} |^{2}] < \infty ⟺ \sum_{n = 1}^{\infty} P \{| n^{- 1} S_{n} | > ε\} < \infty for all ε > 0,

which shows that the conjecture of Hsu and Robbins is correct—the second-moment condition

E | S_{1} |^{2} < \infty

is both necessary and sufficient for complete convergence

n^{- 1} S_{n} \to_{n \to \infty}^{P - completely} 0 .

Furthermore, if for some

r > 0

, the

(r + 1)

-th moment is finite,

E | S_{1} |^{r + 1} < \infty

, then the rate of convergence in the SLLN is

{lim}_{n \to \infty} n^{r} P (| n^{- 1} S_{n} | > ε) = 0

for all

ε > 0

.

Previous results suggest that it is reasonable to generalize the notion of complete convergence into the following mode of convergence that we will refer to as r-complete convergence, which is also related to the so-called r-quick convergence that we will discuss later on (see Section 2.3).

Definition 1 (r-Complete Convergence).

Let

r > 0

. We say that the sequence of random variables

{X_{n}}_{n \in Z_{+}}

converges to X as

n \to \infty

r-completely under probability measure

P

and write

X_{n} \to_{n \to \infty}^{P - r - c o m p l e t e l y} X

if

Σ (r, ε) : = \sum_{n = 1}^{\infty} n^{r - 1} P (| X_{n} - X | > ε) < \infty for every ε > 0 .

(7)

Note that the a.s. convergence of

{X_{n}}

to X can be equivalently written as

lim_{n \to \infty} P (\sum_{t = n}^{\infty} | X_{t} - X | > ε) = 0 for every ε > 0,

so that the r-complete convergence with

r \geq 1

implies the a.s. convergence, but the converse is not true in general.

Suppose that

X_{n}

converges a.s. to X. If

Σ (r, ε)

is finite for every

ε > 0

, then

lim_{n \to \infty} \sum_{t = n}^{\infty} t^{r - 1} P (| X_{t} - X | > ε) = 0 for every ε > 0

and probability

P (| X_{n} - X | > ε)

goes to 0 as

n \to \infty

with the rate faster than

1 / n^{r}

. Hence, as already mentioned above, the r-complete convergence allows one to determine the rate of convergence of

X_{n}

to X, i.e., to answer the question of how fast the tail probability

P (| X_{n} - X | > ε)

decays to zero.

The following result provides a very useful implication of complete convergence.

Theorem 2.

Let

{X_{n}}_{n \in Z_{+}}

and

{Y_{n}}_{n \in Z_{+}}

be two arbitrary, possibly dependent sequences of random variables. Assume that there are positive and finite numbers

μ_{1}

and

μ_{2}

such that

\sum_{n = 1}^{\infty} P (| \frac{1}{n} X_{n} - μ_{1} | > ε) < \infty for every ε > 0

(8)

and

\sum_{n = 1}^{\infty} P (| \frac{1}{n} Y_{n} - μ_{2} | > ε) < \infty for every ε > 0,

(9)

i.e.,

n^{- 1} X_{n} \to_{n \to \infty}^{P - completely} μ_{1}

and

n^{- 1} Y_{n} \to_{n \to \infty}^{P - completely} μ_{2}

. If

μ_{1} \geq μ_{2}

, then for any random time T

P (X_{T} < b, Y_{T + 1} \geq b (1 + δ)) ⟶ 0 as b \to \infty for any δ > 0 .

(10)

Proof.

Fix

δ > 0

,

c \in (0, δ)

and let

N_{b} = ⌈(1 + c) b / μ_{2}⌉

be the smallest integer that is larger than or equal to

(1 + c) b / μ_{2}

. Observe that

\begin{matrix} P (X_{T} < b, Y_{T + 1} \geq b (1 + δ)) & \leq P (X_{T} \leq b, T \geq N_{b}) + P (Y_{T + 1} \geq (1 + δ) b, T < N_{b}) \\ \leq P (X_{T} \leq b, T \geq N_{b}) + P (max_{1 \leq n \leq N_{b}} Y_{n} \geq (1 + δ) b) . \end{matrix}

Thus, to prove (10), it suffices to show that the two terms on the right-hand side go to 0 as

b \to \infty

.

For the first term, we notice that, for any

n \geq N_{b}

,

\frac{b}{n} \leq \frac{b}{N_{b}} \leq \frac{μ_{2}}{1 + c} \leq \frac{μ_{1}}{1 + c} < μ_{1},

so that

\begin{matrix} P (X_{T} \leq b, T \geq N_{b}) & = \sum_{n = N_{b}}^{\infty} P (X_{n} \leq b, T = n) \leq \sum_{n = N_{b}}^{\infty} P (\frac{X_{n}}{n} \leq \frac{b}{n}) \\ \leq \sum_{n = N_{b}}^{\infty} P (\frac{X_{n}}{n} \leq \frac{μ_{1}}{1 + c}) = \sum_{n = N_{b}}^{\infty} P (\frac{X_{n}}{n} - μ_{1} \leq - \frac{c}{1 + c} μ_{1}) . \end{matrix}

Since

N_{b} \to \infty

as

b \to \infty

, the upper bound goes to 0 as

b \to \infty

due to condition (8).

Next, since

c \in (0, δ)

, there exists

ε^{'} > 0

such that

\frac{(1 + δ) b}{N_{b}} = \frac{(1 + δ) b}{⌈b (1 + c) / μ_{2}⌉} \geq (1 + ε^{'}) μ_{2} .

As a result,

\begin{matrix} P (max_{1 \leq n \leq N_{b}} Y_{n} \geq (1 + δ) b) \leq P (\frac{1}{N_{b}} max_{1 \leq n \leq N_{b}} Y_{n} \geq (1 + ϵ^{'}) μ_{2}), \end{matrix}

where the upper bound goes to 0 as

b \to \infty

by condition (9) (see Lemma 1). □

Remark 1.

The proof suggests that the assertion (10) of Theorem 2 holds under the following one-sided conditions

\begin{matrix} P (n^{- 1} max_{1 \leq s \leq n} Y_{s} - μ_{2} > ε) \underset{n \to \infty}{\to} 0, \sum_{n = 1}^{\infty} P (n^{- 1} X_{n} - μ_{1} < - ε) < \infty . \end{matrix}

Complete convergence conditions (8) and (9) guarantee both these conditions.

Remark 2.

Theorem 2 can be applied to the overshoot problem. Indeed, if

X_{n} = Y_{n} = Z_{n}

and the random time T is the first time n when

Z_{n}

exceeds the level b,

T = inf {n \geq 1 : Z_{n} > b}

, then Theorem 2 shows that the relative excess of boundary crossing (overshoot)

(Z_{T} - b) / b

converges to 0 in probability as

b \to \infty

when

Z_{n} / n

completely converges as

n \to \infty

to a positive number μ.

2.3. r-Quick Convergence

In 1967, Strassen [3] introduced the notion of r-quick limit points of a sequence of random variables. The r-quick convergence has been further addressed by Lai [7,8], Chow and Lai [9], Fuh and Zhang [10], and Tartakovsky [4,5] (see certain details in Section 2.4).

We define r-quick convergence in a way suitable for this paper. Let

{X_{n}}_{n \in Z_{+}}

be a sequence of real-valued random variables and let X be a random variable defined on the same probability space

(Ω, F, P)

.

Definition 2 (r-Quick Convergence).

Let

r > 0

and for

ε > 0

, let

L_{ε} = sup {n \geq 1 : | X_{n} - X | > ε} (sup {⌀} = 0)

be the last entry time of

X_{n}

in the region

(X + ε, \infty) \cup (- \infty, X - ε)

. We say that the sequence

{X_{n}}_{n \in Z_{+}}

converges to X r-quickly as

n \to \infty

under the probability measure

P

and write

X_{n} \to_{n \to \infty}^{P - r - quickly} X

if and only if

E [L_{ε}^{r}] < \infty for every ε > 0,

(11)

where

E

is the operator of expectation under probability

P

.

This definition can be generalized to random variables X,

{X_{n}}_{n \in Z_{+}}

taking values in a metric space

(X, d)

with distance d:

X_{n} \to_{n \to \infty}^{r - quickly} X

if

E [{(sup {n \geq 1 : d (X, X_{n}) > ε})}^{r}] < \infty for every ε > 0 .

Note that the a.s. convergence

X_{n} \to μ

(

| μ | < \infty

) as

n \to \infty

to a constant

μ

can be expressed as

P (L_{ε} (μ) < \infty) = 1

, where

L_{ε} (μ) = sup {n \geq 1 : | X_{n} - μ | > ε}

. Therefore, the r-quick convergence implies the convergence w.p. 1 but not conversely.

Also, in general, r-quick convergence is stronger than r-complete convergence. Specifically, the following lemma shows that

max_{1 \leq i \leq n} X_{t} \to_{n \to \infty}^{r - completely} μ ⟹ X_{n} \to_{n \to \infty}^{r - quickly} μ ⟹ X_{n} \to_{n \to \infty}^{r - completely} μ .

(12)

Lemma 2.

Let

{X_{n}}_{n \in Z_{+}}

be a sequence of random variables. Let

f (t)

be a non-negative increasing function,

f (0) = 0

,

{lim}_{t \to \infty} f (t) = + \infty

, and let for

ε > 0

L_{ε} (f) = sup \{n \geq 1 : | X_{n} | > ε f (n)\} (sup {⌀} = 0)

be the last time that

X_{n}

leaves the interval

[- ε f (n), + ε f (n)]

.

(i): For any $r > 0$ and any $ε > 0$ , the following inequalities hold:

\begin{matrix} r \sum_{n = 1}^{\infty} n^{r - 1} P \{| X_{n} | \geq ε f (n)\} \leq E [L_{ε} {(f)}^{r}] \leq r \sum_{n = 1}^{\infty} n^{r - 1} P \{sup_{t \geq n} \frac{| X_{t} |}{f (t)} \geq ε\} . \end{matrix}

(13)

Therefore,

\sum_{n = 1}^{\infty} n^{r - 1} P \{sup_{t \geq n} \frac{| X_{t} |}{f (t)} \geq ε\} < \infty for all ε > 0 ⟹ X_{n} \to_{n \to \infty}^{r - q u i c k l y} 0 .

(ii): If $f (t)$ is a power function, $f (t) = t^{γ}$ , $γ > 0$ , then the finiteness of

\sum_{n = 1}^{\infty} n^{r - 1} P \{max_{1 \leq t \leq n} X_{t} \geq ε n^{γ}\}

for some

r > 0

and every

ε > 0

implies the r-quick convergence of

X_{n}

to 0:

\{\sum_{n = 1}^{\infty} n^{r - 1} P (max_{1 \leq t \leq n} X_{t} \geq ε n^{γ}) < \infty \forall ε > 0\} ⟹ \{E [L_{ε} {(γ)}^{r}] < \infty \forall ε > 0\},

(14)

where

L_{ε} (γ) = sup \{n \geq 1 : | X_{n} | > ε n^{γ}\}

.

Proof.

Proof of (i). Obviously,

P \{| X_{n} | \geq ε f (n)\} \leq P \{L_{ε} (f) \geq n\} \leq P \{sup_{t \geq n} \frac{1}{f (t)} | X_{t} | \geq ε\}

from which the inequalities (13) follow immediately.

Proof of (ii). Write

M_{u} = {max}_{1 \leq n \leq ⌈ u ⌉} | X_{n} |

, where

⌈ u ⌉

is a smallest integer greater or equal to u. We have the following chain of inequalities and equalities:

\begin{matrix} E [L_{2 ε} {(γ)}^{r}] \leq r \int_{0}^{\infty} t^{r - 1} P \{sup_{u \geq t} u^{- γ} | X_{u} | \geq 2 ε\} d t \\ \leq r \int_{0}^{\infty} t^{r - 1} P \{sup_{u \geq t} [| X_{u} | - ε u^{γ}] \geq ε t^{γ}\} d t \\ \leq r \int_{0}^{\infty} t^{r - 1} P \{sup_{u > 0} [| X_{u} | - ε u^{γ}] \geq ε t^{γ}\} d t \\ \leq r \sum_{n = 1}^{\infty} \int_{0}^{\infty} t^{r - 1} P \{sup_{(2^{n - 1} - 1) t^{γ} < u^{γ} \leq (2^{n} - 1) t^{γ}} [| X_{u} | - ε u^{γ}] \geq ε t^{γ}\} d t \\ \leq r \sum_{n = 1}^{\infty} \int_{0}^{\infty} t^{r - 1} P \{sup_{u^{γ} \leq 2^{n} t^{γ}} | X_{u} | \geq 2^{n - 1} ε t^{γ}\} d t \\ = r \sum_{n = 1}^{\infty} \int_{0}^{\infty} t^{r - 1} P \{M_{2^{n / γ} u} \geq 2^{n - 1} ε t^{γ}\} d t \\ = r [\sum_{n = 1}^{\infty} 2^{- n / γ}] \int_{0}^{\infty} u^{r - 1} P \{M_{u} \geq (ε / 2) u^{γ}\} d u . \end{matrix}

It follows that

\begin{matrix} E [L_{2 ε} {(γ)}^{r}] \leq r {(2^{1 / γ} - 1)}^{- 1} \int_{0}^{\infty} u^{r - 1} P \{M_{u} \geq (ε / 2) u^{γ}\} d u \leq \\ \leq r {(2^{1 / γ} - 1)}^{- 1} \sum_{n = 1}^{\infty} n^{r - 1} P \{max_{1 \leq t \leq n} X_{n} \geq ε n^{γ}\} \end{matrix}

which yields the implication (14) and completes the proof. □

The following theorem shows that, in the i.i.d. case, the implications in (12) become equivalences.

Theorem 3.

Let

{S_{n}}_{n \in Z_{+}}

be a zero-mean random walk. The following statements are equivalent

\begin{matrix} E | S_{1} |^{r + 1} < \infty & ⟺ n^{- 1} S_{n} \to_{n \to \infty}^{r - completely} 0, \end{matrix}

(15)

\begin{matrix} E | S_{1} |^{r + 1} < \infty & ⟺ n^{- 1} S_{n} \to_{n \to \infty}^{r - quickly} 0, \end{matrix}

(16)

\begin{matrix} E | S_{1} |^{r + 1} & ⟺ \sum_{n = 1}^{\infty} n^{r - 1} P \{sup_{t \geq n} \frac{1}{t} | S_{t} | > ε\} < \infty for all ε > 0 . \end{matrix}

(17)

Proof.

By Theorem 1, in the i.i.d. case,

\begin{matrix} E | S_{1} |^{r + 1} < \infty ⟺ \sum_{n = 1}^{\infty} n^{r - 1} P (\frac{1}{n} | S_{n} | > ε) < \infty \forall ε > 0 \end{matrix}

(18)

and

E | S_{1} |^{r + 1} < \infty ⟺ \sum_{n = 1}^{\infty} n^{r - 1} P (sup_{t \geq n} \frac{1}{t} | S_{t} | > ε) < \infty \forall ε > 0,

(19)

so that assertion (15) follows from (18) and (17) from (19).

Next, let

L_{ε} = sup \{n \geq 1 : | S_{n} | \geq n ε\} (sup ⌀ = 0) .

By Lemma 2(i),

E [L_{ε}^{r}] \leq r \sum_{n = 1}^{\infty} n^{r - 1} P \{sup_{t \geq n} (| S_{t} | / t) \geq ε\} \forall ε > 0,

(20)

which, along with (19), implies (16).

2.4. Further Remarks on r-Complete Convergence, r-Quick Convergence, and Rates of Convergence in SLLN

Let

{S_{n}}_{n \in Z_{+}}

be a random walk. Without loss of generality, let

S_{0} = 0

and

E [S_{1}] = 0

.

Strassen [3] proved, in particular, that if $f (n) = {(2 n log n)}^{1 / 2}$ in Lemma 2, then for $r > 0$

\underset{n \to \infty}{lim sup} \frac{S_{n}}{\sqrt{2 n log n}} = \sqrt{r E [S_{1}^{2}]} r - quickly

(21)

whenever

E | S_{1} |^{p} < \infty

for

p > (2 r + 1)

. He also proved the functional form of the law of the iterated logarithm.

2.: Lai [8] improved this result, showing that Strassen’s moment condition $E | S_{1} |^{p} < \infty$ for $p > (2 r + 1)$ can be relaxed. Specifically, he showed that a weaker condition

E [| S_{1} |^{2 (r + 1)} ({log}^{+} | S_{1} {| + 1)}^{- (r + 1)})] < \infty for r > 0

(22)

is the best one can do (i.e., both necessary and sufficient):

E [| S_{1} |^{2 (r + 1)} ({log}^{+} | S_{1} {| + 1)}^{- (r + 1)}] < \infty ⟺ \underset{n \to \infty}{lim sup} \frac{S_{n}}{\sqrt{2 n log n}} < \infty r - quickly,

in which case equality (21) holds.

Note, however, that for

r = 0

, in terms of the a.s. convergence,

E [| S_{1} |^{2}] < \infty ⟺ \underset{n \to \infty}{lim sup} \frac{S_{n}}{\sqrt{2 n log log n}} = \sqrt{E [| S_{1} |^{2}]} a . s .

but under condition (22) for all

r > 0

\underset{n \to \infty}{lim sup} \frac{S_{n}}{\sqrt{2 n log log n}} = \infty r - quickly .

3.: Let $α > 1 / 2$ and $r > 0$ . Chow and Lai [9] established the following one-sided inequality for tail probabilities:

\sum_{n = 1}^{\infty} n^{r - 1} P (max_{1 \leq t \leq n} S_{t} \geq n^{α}) \leq C_{r, α} \{E [{(S_{1}^{+})}^{(r + 1) / α}] + {(E [S_{1}^{2}])}^{r / (2 α - 1)}\}

(23)

whenever

E | S_{1} |^{2} < \infty

. Under the same hypotheses, this one-sided inequality implies the two-sided one:

\sum_{n = 1}^{\infty} n^{r - 1} P (max_{1 \leq t \leq n} | S_{t} | \geq n^{α}) \leq C_{r, α} \{E [| S_{1} |^{(r + 1) / α}] + {(E [S_{1}^{2}])}^{r / (2 α - 1)}\} .

(24)

The upper bound in (24) turns out to be sharp since the lower bound also holds:

\sum_{n = 1}^{\infty} n^{r - 1} P (max_{1 \leq t \leq n} | S_{t} | \geq n^{α}) \geq 1 + B_{r, α} \{E [| S_{1} |^{(r + 1) / α}] + {(E [S_{1}^{2}])}^{r / (2 α - 1)}\} .

Here, the constants

C_{r, α}

and

B_{r, α}

are universal only depending on

r, α

.

The results of Chow and Lai [9] provide one-sided analogues of the results of Baum and Katz [2] as well as extend their results. Indeed, the one-sided inequality (23) implies that the following statements are equivalent for the zero-mean random walk

{S_{n}}_{n \in N}

:

(i): $E [{(S_{1}^{+})}^{(r + 1) / α}] < \infty$ ;
(ii): $\sum_{n = 1}^{\infty} n^{r - 1} P (n^{- α} S_{n} \geq ε) < \infty for all ε > 0$ ;
(iii): $\sum_{n = 1}^{\infty} n^{r - 1} P ({sup}_{k \geq n} k^{- α} S_{k} \geq ε) < \infty for all ε > 0$ ,

where

α > 1 / 2

.

Clearly, the two-sided inequality (24) yields the assertions of Theorem 1.

4.: The Marcinkiewicz–Zygmund SLLN states that, for $α > 1 / 2$ , the following implications hold:

E | S_{1} |^{1 / α} < \infty ⟺ n^{- α} S_{n} \to_{n \to \infty}^{a . s .} 0 .

(25)

The strengthened r-quick equivalent of this SLLN is: for any

r > 0

and

α > 1 / 2

, the following statements are equivalent,

\begin{matrix} E [| S_{1} |^{(r + 1) / α}] < \infty & ⟺ \sum_{i = 1}^{\infty} n^{r - 1} P \{\frac{1}{n^{α}} | S_{n} | > ε\} < \infty for all ε > 0 \\ ⟺ \sum_{n = 1}^{\infty} n^{r - 1} P \{sup_{k \geq n} \frac{1}{k^{α}} | S_{k} | > ε\} < \infty for all ε > 0 \\ ⟺ n^{- α} S_{n} \to_{n \to \infty}^{r - quickly} 0 . \end{matrix}

(26)

Implications (26) follow from Theorem 1, Theorem 3 and inequality (24). The proof is almost obvious and omitted.

3. Applications of $r$ -Complete and $r$ -Quick Convergences in Statistics

In this section, we outline certain statistical applications which show the usefulness of r-complete and r-quick versions of the SLLN.

3.1. Sequential Hypothesis Testing

We begin by formulating the following multi-hypothesis testing problem for a general non-i.i.d. stochastic model. Let

(Ω, F, F_{n}, P)

,

n \in Z_{+} = {0, 1, 2, \dots}

be a filtered probability space with standard assumptions about the monotonicity of the sub-

σ

-algebras

F_{n}

. The sub-

σ

-algebra

F_{n} = σ (X^{n})

of

F

is assumed to be generated by the sequence

X^{n} = {X_{t}, 1 \leq t \leq n}

observed up to time n, which is defined on the space

(Ω, F)

. The hypotheses are

H_{i} : P = P_{i}

,

i = 0, 1, \dots, N

, where

P_{0}, P_{1}, \dots, P_{N}

are given probability measures assumed to be locally mutually absolutely continuous, i.e., their restrictions

P_{i}^{n}

and

P_{j}^{n}

to

F_{n}

are equivalent for all

1 \leq n < \infty

and all

i, j = 0, 1, \dots, N

,

i \neq j

. Let

Q^{n}

be a restriction to

F_{n}

of a

σ

-finite measure Q on

(Ω, F)

. Under

P_{i}

, the sample

X^{n} = (X_{1}, \dots, X_{n})

has a joint density

p_{i, n} (X^{n})

with respect to the dominating measure

Q^{n}

for all

n \in N

, which can be written as

p_{i, n} (X^{n}) = \prod_{t = 1}^{n} f_{i, t} (X_{t} | X^{t - 1}),

(27)

where

f_{i, n} (X_{n} | X^{n - 1})

,

n \geq 1

are corresponding conditional densities.

For

n \in N

, define the likelihood ratio (LR) process between the hypotheses

H_{i}

and

H_{j}

Λ_{i j} (n) = \frac{d P_{i}^{n}}{d P_{j}^{n}} (X^{n}) = \frac{p_{i, n} (X^{n})}{p_{j, n} (X^{n})} = \prod_{t = 1}^{n} \frac{f_{i, t} (X_{t} | X^{t - 1})}{f_{j, t} (X_{t} | X^{t - 1})}

and the log-likelihood ratio (LLR) process

λ_{i j} (n) = log Λ_{i j} (n) = \sum_{t = 1}^{n} log [\frac{f_{i, t} (X_{t} | X^{t - 1})}{f_{j, t} (X_{t} | X^{t - 1})}] .

A multi-hypothesis sequential test is a pair

δ = (d, T)

, where T is a stopping time with respect to the filtration

{F_{n}}_{n \in Z_{+}}

and

d = d (X^{T})

is an

F_{T}

-measurable terminal decision function with values in the set

{0, 1, \dots, N}

. Specifically,

d = i

means that the hypothesis

H_{i}

is accepted upon stopping, i.e.,

\{d = i\} = \{T < \infty, δ accepts H_{i}\}

. Let

α_{i j} (δ) = P_{i} (d = j)

,

i \neq j

,

i, j = 0, 1, \dots, N

denote the error probabilities of the test

δ

, i.e., the probabilities of accepting the hypothesis

H_{j}

when

H_{i}

is true.

Introduce the class of tests with probabilities of errors

α_{i j} (δ)

that do not exceed the prespecified numbers

0 < α_{i j} < 1

:

C (α) = \{δ : α_{i j} (δ) \leq α_{i j} for i, j = 0, 1, \dots, N, i \neq j\},

(28)

where

α = (α_{i j})

is a matrix of given error probabilities that are positive numbers less than 1.

Let

E_{i}

denote the expectation under the hypothesis

H_{i}

(i.e., under the measure

P_{i}

). The goal of a statistician is to find a sequential test that would minimize the expected sample sizes

E_{i} [T]

for all hypotheses

H_{i}

,

i = 0, 1, \dots, N

at least approximately, say asymptotically for small probabilities of errors, i.e., as

α_{i j} \to 0

.

3.1.1. Asymptotic Optimality of Walds’s SPRT

First, assume that

N = 1

, i.e., that we are dealing with two hypotheses

H_{0}

and

H_{1}

. In the mid-1940s, Wald [11,12] introduced the sequential probability ratio test (SPRT) for the sequence of i.i.d. observations

X_{1}, X_{2}, \dots

, in which case

f_{i, t} (X_{t} | X^{t - 1}) = f_{i} (X_{t})

in (27) and the LR

Λ_{1, 0} (n) = Λ_{n}

is

Λ_{n} = \prod_{t = 1}^{n} \frac{f_{1} (X_{t})}{f_{0} (X_{t})} .

After n observations have been made, Wald’s SPRT prescribes for each

n \geq 1

:

\begin{matrix} stop and accept H_{1} & if Λ_{n} \geq A_{1}; \\ stop and accept H_{0} & if Λ_{n} \leq A_{0}; \\ continue sampling & if A_{0} < Λ_{n} < A_{1}, \end{matrix}

where

0 < A_{0} < 1 < A_{1}

are two thresholds.

Let

Z_{t} = log [f_{1} (X_{t}) / f_{0} (X_{t})]

be the LLR for the observation

X_{t}

, so the LLR for the sample

X^{n}

is the sum

λ_{10} (n) = λ_{n} = \sum_{t = 1}^{n} Z_{t}, n = 1, 2, \dots

Let

a_{0} = - log A_{0} > 0

and

a_{1} = log A_{1} > 0

. The SPRT

δ_{*} (a_{0}, a_{1}) = (d_{*}, T_{*})

can be represented in the form

T_{*} (a_{0}, a_{1}) = inf \{n \geq 1 : λ_{n} \notin (- a_{0}, a_{1})\}, d_{*} (a_{0}, a_{1}) = \{\begin{matrix} 1 & if λ_{T_{*}} \geq a_{1} \\ 0 & if λ_{T_{*}} \leq - a_{0} . \end{matrix}

(29)

In the case of two hypotheses, the class of tests (28) is of the form

C (α_{0}, α_{1}) = \{δ : α_{0} (δ) \leq α_{0} and α_{1} (δ) \leq α_{1}\} .

That is, it includes hypothesis tests with upper bounds

α_{0}

and

α_{1}

on the probabilities of errors of Type 1 (false positive)

α_{0} (δ) = α_{0, 1} (δ)

and Type 2 (false negative)

α_{1} (δ) = α_{1, 0} (δ)

, respectively.

Wald’s SPRT has an extraordinary optimality property: it minimizes both expected sample sizes

E_{0} [T]

and

E_{1} [T]

in the class of sequential (and non-sequential) tests

C (α_{0}, α_{1})

with given error probabilities as long as the observations are i.i.d. under both hypotheses. More specifically, Wald and Wolfowitz [13] proved, using a Bayesian approach, that if

α_{0} + α_{1} < 1

and thresholds

- a_{0}

and

a_{1}

can be selected in such a way that

α_{0} (δ_{*}) = α_{0}

and

α_{1} (δ_{*}) = α_{1}

, then the SPRT

δ_{*}

is strictly optimal in class

C (α_{0}, α_{1})

. A rigorous proof of this fundamental result is tedious and involves several delicate technical details. Alternative proofs can be found in [14,15,16,17,18].

Regardless of the strict optimality of SPRT which holds if and only if thresholds are selected so that the probabilities of errors of SPRT are exactly equal to the prescribed values

α_{0}, α_{1}

, which is usually impossible, suppose that thresholds

a_{0}

and

a_{1}

are so selected that

a_{0} \sim log (1 / α_{1}) and a_{1} \sim log (1 / α_{0}) as α_{\max} \to 0 .

(30)

Then

E_{1} [T_{*}] \sim \frac{| log α_{0} |}{I_{1}}, E_{0} [T_{*}] \sim \frac{| log α_{1} |}{I_{0}} as α_{\max} \to 0,

(31)

where

I_{1} = E_{1} [Z_{1}]

and

I_{0} = E_{0} [- Z_{1}]

are Kullback–Leibler (K-L) information numbers so that the following asymptotic lower bounds for expected sample sizes are attained by SPRT:

inf_{δ \in C (α_{0}, α_{1})} E_{1} [T] \geq \frac{| log α_{0} |}{I_{1}} + o (1), inf_{δ \in C (α_{0}, α_{1})} E_{0} [T] \geq \frac{| log α_{1} |}{I_{0}} + o (1) as α_{\max} \to 0

(cf. [6]). Hereafter,

α_{\max} = max (α_{0}, α_{1})

.

The following inequalities for the error probabilities of the SPRT hold in the most general non-i.i.d. case

α_{1} (δ_{*}) \leq exp {- a_{0}} [1 - α_{0} (δ_{*})], α_{0} (δ_{*}) \leq exp {- a_{1}} [1 - α_{1} (δ_{*})] .

(32)

These bounds can be used to guarantee asymptotic relations (30).

In the i.i.d. case, by the SLLN, the LLR

λ_{n}

has the following stability property

n^{- 1} λ_{n} \to_{n \to \infty}^{P_{1} - a . s .} I_{1}, n^{- 1} (- λ_{n}) \to_{n \to \infty}^{P_{0} - a . s .} I_{0} .

(33)

This allows one to conjecture that, if in the general non-i.i.d. case, the LLR is also stable in the sense that the almost sure convergence conditions (33) are satisfied with some positive and finite numbers

I_{1}

and

I_{0}

, then the asymptotic formulas (31) still hold. In the general case, these numbers represent the local K-L information in the sense that often (while not always)

I_{1} = {lim}_{n \to \infty} n^{- 1} E_{1} [λ_{n}]

and

I_{0} = {lim}_{n \to \infty} n^{- 1} E_{0} [- λ_{n}]

. Note, however, that in the general non-i.i.d. case, the SLLN does not even guarantee the finiteness of the expected sample sizes

E_{i} [T_{*}]

of the SPRT, so some additional conditions are needed, such as a certain rate of convergence in the strong law, e.g., complete or quick convergence.

In 1981, Lai [7] was the first to prove the asymptotic optimality of Wald’s SPRT in a general non-i.i.d. case as

α_{\max} = max (α_{0}, α_{1}) \to 0

. While the motivation was the near optimality of invariant SPRTs with respect to nuisance parameters, Lai proved a more general result using the r-quick convergence concept.

Specifically, for

0 < I_{0} < \infty

and

0 < I_{1} < \infty

, define

L_{1} (ε) = sup \{n \geq 1 : | n^{- 1} λ_{n} - I_{1} | \geq ε\} and L_{0} (ε) = sup \{n \geq 1 : | n^{- 1} λ_{n} + I_{0} | \geq ε\}

(

sup {⌀} = 0

) and suppose that

E_{i} [L_{i} {(ε)}^{r}] < \infty

(

i = 0, 1

) for some

r > 0

and every

ε > 0

, i.e., that the normalized LLR converges r-quickly to

I_{1}

under

P_{1}

and to

- I_{0}

under

P_{0}

:

n^{- 1} λ_{n} \to_{n \to \infty}^{P_{1} - r - quickly} I_{1} and n^{- 1} λ_{n} \to_{n \to \infty}^{P_{0} - r - quickly} - I_{0} .

(34)

Strengthening the a.s. convergence (33) into the r-quick version (34), Lai [7] established the first-order asymptotic optimality of Wald’s SPRT for moments of the stopping time distribution up to order r: If thresholds

a_{1} (α_{0}, α_{1})

and

- a_{0} (α_{0}, α_{1})

in the SPRT are so selected that

δ_{*} (a_{0}, a_{1}) \in C (α_{0}, α_{1})

and asymptotics (30) hold, then as

α_{\max} \to 0

,

\begin{matrix} inf_{δ \in C (α_{0}, α_{1})} E_{1} [T^{r}] & \sim {(\frac{| log α_{0} |}{I_{1}})}^{r} \sim E_{1} [T_{*}^{r}], \\ inf_{δ \in C (α_{0}, α_{1})} E_{0} [T^{r}] & \sim {(\frac{| log α_{1} |}{I_{0}})}^{r} \sim E_{0} [T_{*}^{r}] . \end{matrix}

(35)

Wald’s ideas have been generalized in many publications to construct sequential tests of composite hypotheses with nuisance parameters when these hypotheses can be reduced to simple ones by the principle of invariance. If

M_{n}

is the maximal invariant statistic and

p_{i} (M_{n})

is the density of this statistic under hypothesis

H_{i}

, then the invariant SPRT is defined as in (29) with the LLR

λ_{n} = log [p_{1} (M_{n}) / p_{0} (M_{n})]

. However, even if the observations

X_{1}, X_{2}, \dots

are i.i.d. the invariant LLR statistic

λ_{n}

is not a random walk anymore and Wald’s methods cannot be applied directly. Lai [7] has applied the asymptotic optimality property (35) of Wald’s SPRT in the non-i.i.d. case to investigate the optimality properties of several classical invariant SPRTs such as the sequential t-test, the sequential

T^{2}

-test, and Savage’s rank-order test.

In the sequel, we will call the case where the a.s. convergence in the non-i.i.d. model (33) holds with the rate

1 / n

asymptotically stationary. Assume now that (33) is generalized to

λ_{n} / ψ (n) \to_{n \to \infty}^{P_{1} - a . s .} I_{1}, (- λ_{n}) / ψ (n) \to_{n \to \infty}^{P_{0} - a . s .} I_{0},

(36)

where

ψ (t)

is a positive increasing function. If

ψ (t)

is not linear, then this case will be referred to as asymptotically non-stationary.

A simple example where this generalization is needed is testing

H_{0}

versus

H_{1}

regarding the mean of the normal distribution:

X_{n} = i S_{n} + ξ_{n}, n \in Z_{+}, i = 0, 1,

where

{ξ_{n}}_{n \geq 1}

is a zero-mean i.i.d. standard Gaussian sequence

N (0, 1)

and

S_{n} = \sum_{j = 0}^{k} c_{j} n^{j}

is a polynomial of order

k \geq 1

. Then,

λ_{n} = \sum_{t = 1}^{n} S_{t} X_{t} - \frac{1}{2} \sum_{t = 1}^{n} S_{t}^{2},

E_{1} [λ_{n}] = - E_{0} [λ_{n}] = \frac{1}{2} \sum_{t = 1}^{n} S_{t}^{2} \sim c_{k}^{2} n^{2 k}

for a large n, so

ψ (n) = n^{2 k}

and

I_{1} = I_{0} = c_{k}^{2} / 2

in (36). This example is of interest for certain practical applications, in particular, for the recognition of ballistic objects and satellites.

Tartakovsky et al. ([6] Section 3.4) generalized Lai’s results for the asymptotically non-stationary case. Write

Ψ (t)

for the inverse function of

ψ (t)

.

Theorem 4 (SPRT asymptotic optimality).

Let

r \geq 1

. Assume that there exist finite positive numbers

I_{0}

and

I_{1}

and an increasing non-negative function

ψ (t)

such that the r-quick convergence conditions

\frac{λ_{n}}{ψ (n)} \to_{n \to \infty}^{P_{1} - r - quickly} I_{1}, \frac{- λ_{n}}{ψ (n)} \to_{n \to \infty}^{P_{0} - r - quickly} I_{0}

hold. If thresholds

- a_{0} (α_{0}, α_{1})

and

a_{1} (α_{0}, α_{1})

are selected so that

δ_{*} (a_{0}, a_{1}) \in C (α_{0}, α_{1})

and

a_{0} \sim | log α_{1} |

and

a_{1} \sim | log α_{0} |

, then, as

α_{\max} \to 0

,

\begin{matrix} inf_{δ \in C (α_{0}, α_{1})} E_{1} [T^{r}] & \sim {[Ψ (\frac{| log α_{0} |}{I_{1}})]}^{r} \sim E_{1} [T_{*}^{r}], \\ inf_{δ \in C (α_{0}, α_{1})} E_{0} [T^{r}] & \sim {[Ψ (\frac{| log α_{1} |}{I_{0}})]}^{r} \sim E_{0} [T_{*}^{r}] . \end{matrix}

(37)

This theorem implies that the SPRT asymptotically minimizes the moments of the stopping time distribution up to order r.

The proof of this theorem is performed in two steps which are related to our previous discussion of the rates of convergence in Section 2. The first step is to obtain the asymptotic lower bounds in class

C (α_{0}, α_{1})

:

\underset{α_{\max} \to 0}{lim inf} \frac{{inf}_{δ \in C (α_{0}, α_{1})} E_{1} [T^{r}]}{{[Ψ (| log α_{0} | / I_{1})]}^{r}} \geq 1, \underset{α_{\max} \to 0}{lim inf} \frac{{inf}_{δ \in C (α_{0}, α_{1})} E_{0} [T^{r}]}{{[Ψ (| log α_{1} | / I_{0})]}^{r}} \geq 1 .

These bounds hold whenever the following right-tail conditions for the LLR are satisfied:

\begin{matrix} lim_{M \to \infty} P_{1} \{\frac{1}{ψ (M)} max_{1 \leq n \leq M} λ_{n} \geq (1 + ε) I_{1}\} & = 1, \\ lim_{M \to \infty} P_{0} \{\frac{1}{ψ (M)} max_{1 \leq n \leq M} (- λ_{n}) \geq (1 + ε) I_{0}\} & = 1 . \end{matrix}

Note that, by Lemma 1, these conditions are satisfied when the SLLN (36) holds so that the almost sure convergence (36) is sufficient. However, as we already mentioned, the SLLN for the LLR is not sufficient to guarantee even the finiteness of the SPRT stopping time.

The second step is to show that the lower bounds are attained by the SPRT. To do so, it suffices to impose the following additional left-tail conditions:

\begin{matrix} \sum_{n = 1}^{\infty} n^{r - 1} P_{1} \{λ_{n} \leq (I_{1} - ε) ψ (n)\} < \infty, \sum_{n = 1}^{\infty} n^{r - 1} P_{0} \{- λ_{n} \leq (I_{0} - ε) ψ (n)\} < \infty \end{matrix}

for all

0 < ε < min (I_{0}, I_{1})

. Since both right-tail and left-tail conditions hold if the LLR converges r-completely to

I_{i}

,

\sum_{n = 1}^{\infty} n^{r - 1} P_{1} \{|\frac{λ_{n}}{ψ (n)} - I_{1}| \geq ε\} < \infty, \sum_{n = 1}^{\infty} n^{r - 1} P_{0} \{|\frac{λ_{n}}{ψ (n)} + I_{0}| \geq ε\} < \infty,

and since r-quick convergence implies r-complete convergence (see (12)), we conclude that the assertions (37) hold.

Remark 3.

In the i.i.d. case, Wald’s approach allows us to establish asymptotic equalities (37) with

I_{1} = E_{1} [λ_{1}]

and

I_{0} = - E_{0} [λ_{1}]

being K-L information numbers under the only condition of finiteness

I_{i}

. However, Wald’s approach breaks down in the non-i.i.d. case. Certain generalizations in the case of independent but non-identically and substantially non-stationary observations, extending Wald’s ideas, were considered in [19,20,21]. Theorem 4 covers all these non-stationary models.

Fellouris and Tartakovsky [22] extended previous results on the asymptotic optimality of the SPRT to the case of the multistream hypothesis testing problem when the observations are sequentially acquired in multiple data streams (or channels or sources). The problem is to test the null hypothesis

H_{0}

that none of the N streams are affected against the composite hypothesis

H_{B}

that a subset

B \subset {1, \dots, N}

is affected. Write

P_{B}

and

E_{B}

for the distribution of observations and expectation under hypothesis

H_{B}

. Let

P

denote a class of subsets of

{1, \dots, N}

that incorporates prior information which is available regarding the subset of affected streams, e.g., not more than

K < N

streams can be affected. (In many practical problems, K is substantially smaller than the total number of streams N, which can be very large.)

Two sequential tests were studied in [22]—the generalized sequential likelihood ratio test and the mixture sequential likelihood ratio test. It has been shown that both tests are first-order asymptotically optimal, minimizing the moments of the sample size

E_{0} [T^{r}]

and

E_{B} [T^{r}]

for all

B \in P

up to order r as

max (α_{0}, α_{1}) \to 0

in the class of tests

C_{P} (α_{0}, α_{1}) = \{δ : P_{0} (d = 1) \leq α_{0} and max_{B \in P} P_{B} (d = 0) \leq α_{1}\}, 0 < α_{i} < 1 .

The proof is essentially based on the concept of r-complete convergence of LLR with the rate

1 / n

. See also Chapter 1 in [5].

3.1.2. Asymptotic Optimality of the Multi-hypothesis SPRT

We now return to the multi-hypothesis model with

N > 1

that we started to discuss at the beginning of this section (see (27) and (28)). The problem of the sequential testing of many hypotheses is substantially more difficult than that of testing two hypotheses. For multiple-decision testing problems, it is usually very difficult, if even possible, to obtain optimal solutions. Finding an optimal non-Bayesian test in the class of tests (28) that minimizes expected sample sizes

E_{i} [T]

for all hypotheses

H_{i}

,

i = 0, 1, \dots, N

is not manageable even in the i.i.d. case. For this reason, a substantial part of the development of sequential multi-hypothesis testing in the 20th century has been directed towards the study of certain combinations of one-sided sequential probability ratio tests when observations are i.i.d. (see, e.g., [23,24,25,26,27,28]).

We will focus on the following first-order asymptotic criterion: Find a multi-hypothesis test

δ_{*} (α) = (d_{*} (α), T_{*} (α))

such that, for some

r \geq 1

,

lim_{α_{\max} \to 0} \frac{{inf}_{δ \in C (α)} E_{i} [T^{r}]}{E_{i} [T_{*} {(α)}^{r}]} = 1 for all i = 0, 1, \dots, N,

(38)

where

α_{\max} = {max}_{0 \leq i, j \leq N, i \neq j} α_{i j}

.

In 1998, Tartakovsky [4] was the first who considered the sequential multiple hypothesis testing problems for general non-i.i.d. stochastic models following Lai’s idea of exploiting the r-quick convergence in the SLLN for two hypotheses. The results were obtained for both discrete and continuous-time scenarios and for the asymptotically non-stationary case where the LLR processes between hypotheses converge to finite numbers with the rate

1 / ψ (t)

. Two multi-hypothesis tests were investigated: (1) the rejecting test, which rejects the hypotheses one by one, and the last hypothesis, which is not rejected, is accepted; and (2) the matrix accepting test that accepts a hypothesis for which all component SPRTs that involve this hypothesis vote for accepting it.

We now proceed with introducing this accepting test which we will refer to as the matrix SPRT (MSPRT). In the present article, we do not consider the continuous-time scenarios. Those who are interested in continuous time are referred to [4,6,19,21,29].

Write

N = {0, 1, \dots, N}

. For a threshold matrix

{(A_{i j})}_{i, j \in N}

, with

A_{i j} > 0

and the

A_{i i}

being immaterial (say 0), define the matrix SPRT

δ_{*}^{N} = (T_{*}^{N}, d_{*}^{N})

, built on

(N + 1) N / 2

one-sided SPRTs between the hypotheses

H_{i}

and

H_{j}

, as follows:

Stop at the first n \geq 1 such that, for some i, Λ_{i j} (n) \geq A_{j i} for all j \neq i,

(39)

and accept the unique

H_{i}

that satisfies these inequalities. Note that, for

N = 1

, the MSPRT coincides with Wald’s SPRT.

In the following, we omit the superscript N in

δ_{*}^{N} = (T_{*}^{N}, d_{*}^{N})

for brevity. Obviously, with

a_{j i} = log A_{j i}

, the MSPRT in (39) can be written as

\begin{matrix} T_{*} & = inf \{n \geq 1 : λ_{i j} (n) \geq a_{j i} for all j \neq i and some i\}, \end{matrix}

(40)

\begin{matrix} d_{*} & = i for which (40) holds . \end{matrix}

(41)

Introducing the Markov accepting times for the hypotheses

H_{i}

as

T_{i} = inf \{n \geq 1 : λ_{i 0} (n) \geq max_{\overset{1 \leq j \leq N}{j \neq i}} [λ_{j 0} (n) + a_{j i}]\}, i = 0, 1, \dots, N,

(42)

the test in (40), (41) can be also written in the following form:

T_{*} = min_{0 \leq j \leq N} T_{j}, d_{*} = i if T_{*} = T_{i} .

(43)

Thus, in the MSPRT, each component SPRT is extended until, for some

i \in N

, all N SPRTs involving

H_{i}

accept

H_{i}

.

Using Wald’s likelihood ratio identity, it is easily shown that

α_{i j} (δ_{*}) \leq exp (- a_{i j})

for

i, j \in N

,

i \neq j

, so selecting

a_{j i} = | log α_{j i} |

implies that

δ_{*} \in C (α)

. These inequalities are similar to Wald’s ones in the binary hypothesis case and are very imprecise. In his ingenious paper, Lorden [27] showed that, with a very sophisticated design that includes the accurate estimation of thresholds accounting for overshoots, the MSPRT is nearly optimal in the third-order sense, i.e., it minimizes the expected sample sizes for all hypotheses up to an additive disappearing term:

{inf}_{δ \in C (α)} E_{i} [T] = E_{i} [T_{*}] + o (1)

as

α_{\max} \to 0

. This result only holds for i.i.d. models with the finite second moment

E_{i} [λ_{i j} {(1)}^{2}] < \infty

. In the non-i.i.d. case (and even in the i.i.d. case for higher moments

r > 1

), there is no way to obtain such a result, so we focus on the first-order optimality (38).

The following theorem establishes asymptotic operating characteristics and the optimality of MSPRT under the r-quick convergence of

λ_{i j} (n) / ψ (n)

to finite K-L-type numbers

I_{i j}

, where

ψ (n)

is a positive increasing function,

ψ (\infty) = \infty

.

Theorem 5 (MSPRT asymptotic optimality

[4]). Let

r \geq 1

. Assume that there exist finite positive numbers

I_{i j}

,

i, j = 0, 1, \dots, N

,

i \neq j

and an increasing non-negative function

ψ (t)

such that, for some

r > 0

,

\frac{λ_{i j} (n)}{ψ (n)} \to_{n \to \infty}^{P_{i} - r - quickly} I_{i j} for all i, j = 0, 1, \dots, N, i \neq j .

(44)

Then, the following assertions are true.

(i): For $i = 0, 1, \dots, N$ ,

$E_{i} [T_{*}^{r}] \sim {[Ψ (max_{\overset{0 \leq j \leq N}{j \neq i}} \frac{a_{j i}}{I_{i j}})]}^{r} as min_{j, i} a_{j i} \to \infty .$

(45)
(ii): If the thresholds are so selected that $α_{i j} (δ^{*}) \leq α_{i j}$ and $a_{j i} \sim | log α_{j i} |$ , particularly as $a_{j i} = | log α_{j i} |$ , then for all $i = 0, 1, \dots, N$

$inf_{δ \in C (α)} E_{i} [T^{r}] \sim {[Ψ (max_{\overset{0 \leq j \leq N}{j \neq i}} \frac{| log α_{j i} |}{I_{i j}})]}^{r} \sim E_{i} [T_{*}^{r}] as α_{\max} \to 0 .$

(46)

Assertion (ii) implies that the MSPRT asymptotically minimizes the moments of the stopping time distribution up to order r for all hypotheses

H_{0}, H_{1}, \dots, H_{N}

in the class of tests

C (α)

.

Remark 4.

Both assertions of Theorem 5 are correct under the r-complete convergence

\frac{λ_{i j} (n)}{ψ (n)} \to_{n \to \infty}^{P_{i} - r - complete} I_{i j} for all i, j = 0, 1, \dots, N, i \neq j,

i.e., when

\sum_{n = 1}^{\infty} n^{r - 1} P_{i} \{| \frac{1}{ψ (n)} λ_{i j} (n) - I_{i j} | > ε\} < \infty for all ε > 0 .

While this statement has not been proven anywhere to date, it can be easily proven using the methods developed for multistream hypothesis testing and changepoint detection ([5] Ch 1, Ch 6).

Remark 5.

As shown in the example given in Section 3.4.3 of [6], the r-quick convergence conditions in Theorem 5 (or corresponding r-complete convergence conditions for LLR processes) cannot be generally relaxed into the almost sure convergence

\frac{λ_{i j} (n)}{ψ (n)} \to_{n \to \infty}^{P_{i} - a . s .} I_{i j} for all i, j = 0, 1, \dots, N, i \neq j .

(47)

However, the following weak asymptotic optimality result holds for the MSPRT under the a.s. convergence: if the a.s. convergence (47) holds with the power function

ψ (t) = t^{k}

,

k > 0

, then, for every

0 < ε < 1

,

inf_{δ \in C (α)} P_{i} (T > ε T_{*}) \to 1 as α_{\max} \to 0 for all i = 0, 1, \dots, N

(48)

whenever thresholds

a_{j i}

are selected as in Theorem 5 (ii).

Note that several interesting statistical and practical applications of these results to invariant sequential testing and multisample slippage scenarios are discussed in Section 4.5 and 4.6 of Tartakovsky et al. [6] (see Mosteller [30] and Ferguson [16] for terminology regarding multisample slippage problems).

3.2. Sequential Changepoint Detection

Sequential (or quickest) changepoint detection is an important subfield of sequential analysis. The observations are made one at a time and as long as their behavior suggests that the process of interest is in control (i.e., in a normal state), the process is allowed to continue. If the state is believed to have lost control, the goal is to detect the change in distribution as rapidly as possible. Quickest change detection problems have an enormous number of important applications, e.g., object detection in noise and clutter, industrial quality control, environment surveillance, failure detection, navigation, seismology, computer network security, genomics, and epidemiology (see, e.g., [31,32,33,34,35,36,37,38,39,40]). Many challenging application areas are discussed in the books by Tartakovsky, Nikiforov, and Basseville ([6] Ch 11) and Tartakovsky ([5] Ch 8).

3.2.1. Changepoint Models

The probability distribution of the observations

X = {X_{n}}_{n \in N}

is subject to a change at an unknown point in time

ν \in {0, 1, 2, \dots} = Z_{+}

so that

X_{1}, \dots, X_{ν}

are generated by one stochastic model and

X_{ν + 1}, X_{ν + 2}, \dots

are generated by another model. A sequential detection rule is a stopping time T for an observed sequence

{X_{n}}_{n \geq 1}

, i.e., T is an integer-valued random variable such that the event

{T = n}

belongs to the sigma-algebra

F_{n} = σ (X_{1}, \dots, X_{n})

generated by observations

X_{1}, \dots, X_{n}

.

Let

P_{\infty}

denote the probability measure corresponding to the sequence of observations

{X_{n}}_{n \geq 1}

when there is never a change (

ν = \infty

) and, for

k = 0, 1, 2, \dots

, let

P_{k}

denote the measure corresponding to the sequence

{X_{n}}_{n \geq 1}

when

ν = k < \infty

. We denote the hypothesis that the change never occurs by

H_{\infty} : ν = \infty

and we denote the hypothesis that the change occurs at time

0 \leq k < \infty

by

H_{k} : ν = k

.

First consider a general non-i.i.d. model assuming that the observations may have a very general stochastic structure. Specifically, if we let, as before,

X^{n} = (X_{1}, \dots, X_{n})

denote the sample of size n, then when

ν = \infty

(there is no change), the conditional density of

X_{n}

given

X^{n - 1}

is

g_{n} (X_{n} | X^{n - 1})

for all

n \geq 1

and when

ν = k < \infty

, then the conditional density is

g_{n} (X_{n} | X^{n - 1})

for

n \leq k

and

f_{n} (X_{n} | X^{n - 1})

for

n > k

. Thus, for the general non-i.i.d. changepoint model, the joint density

p (X^{n} | H_{k})

under hypothesis

H_{k}

can be written as follows

p (X^{n} | H_{k}) = \{\begin{matrix} \prod_{t = 1}^{n} g_{t} X_{t} | X^{t - 1}) & for ν = k \geq n, \\ \prod_{t = 1}^{k} g_{t} (X_{t} | X^{t - 1}) \times \prod_{t = k + 1}^{n} f_{t} (X_{t} | X^{t - 1}) & for ν = k < n, \end{matrix}

(49)

where

g_{n} (X_{n} | X^{n - 1})

is the pre-change conditional density and

f_{n} (X_{n} | X^{n - 1})

is the post-change conditional density which may depend on

ν

,

f_{n} (X_{n} | X^{n - 1}) = f_{n}^{(ν)} (X_{n} | X^{n - 1})

, but we will omit the superscript

ν

for brevity.

The classical changepoint detection problem deals with the i.i.d. case where there is a sequence of observations

X_{1}, X_{2}, \dots

that are identically distributed with a probability density function (pdf)

g (x)

for

n \leq ν

and with a pdf

f (x)

for

n > ν

. That is, in the i.i.d. case, the joint density of the vector

X^{n} = (X_{1}, \dots, X_{n})

under hypothesis

H_{k}

has the form

p (X^{n} | H_{k}) = \{\begin{matrix} \prod_{t = 1}^{n} g (X_{t}) & for ν = k \geq n, \\ \prod_{t = 1}^{k} g (X_{t}) \times \prod_{t = k + 1}^{n} f (X_{t}) & for ν = k < n . \end{matrix}

(50)

Note that, as discussed in [5,6], in applications, there are two different kinds of changes—additive and non-additive. Additive changes lead to a change in the mean value of the sequence of observations. Non-additive changes are typically produced by a change in variance or covariance, i.e., these are spectral changes.

We now proceed by discussing the models for the change point

ν

. The change point

ν

may be considered either as an unknown deterministic number or as a random variable. If the change point is treated as a random variable, then the model has to be supplied with the prior distribution of the change point. There may be several changepoint mechanisms, and, as a result, a random variable

ν

may be dependent on or independent of the observations. In particular, Moustakides [41] assumed that

ν

can be a

{F_{n}}

-adapted stopping time. In this article, we will not discuss Moustakides’s concept by allowing the prior distribution to depend on some additional information available to “Nature” (see [5] for a detailed discussion); rather, when considering a Bayesian approach, we will assume that the prior distribution of the unknown change point is independent of the observations.

3.2.2. Popular Changepoint Detection Procedures

Before formulating the criteria of optimality in the next subsection, we begin by defining the three most popular and common change detection procedures, which are either optimal or nearly optimal in different settings. To define these procedures, we need to introduce the partial likelihood ratio and the corresponding log-likelihood ratio

{LR}_{t} = \frac{f_{t} (X_{t} | X^{t - 1})}{g_{t} (X_{t} | X^{t - 1})}, Z_{t} = log \frac{f_{t} (X_{t} | X^{t - 1})}{g_{t} (X_{t} | X^{t - 1})}, t = 1, 2, \dots

It is worth iterating that, for general non-i.i.d. models, the post-change density often depends on the point of change,

f_{t} (X_{t} | X^{t - 1}) = f_{t}^{(ν)} (X_{t} | X^{t - 1})

, so in general

{LR}_{t} = {LR}_{t}^{(ν)}

and

Z_{t} = Z_{t}^{(ν)}

also depend on the change point

ν

. However, this is not the case for the i.i.d. model (50).

The CUSUM Procedure

We now introduce the Cumulative Sum (CUSUM) algorithm, which was first proposed by Page [42] for the i.i.d. model (50). Recall that we consider the changepoint detection problem as a problem of testing two hypotheses:

H_{ν}

that the change occurs at a fixed-point

0 \leq ν < \infty

against the alternative

H_{\infty}

that the change never occurs. The LR between these hypotheses is

Λ_{n}^{ν} = \prod_{t = ν + 1}^{n} {LR}_{t}

for

ν < n

and 1 for

ν \geq n

. Since the hypothesis

H_{ν}

is composite, we may apply the generalized likelihood ratio (GLR) approach maximizing the LR

Λ_{n}^{ν}

over

ν

to obtain the GLR statistic

V_{n} = max_{0 \leq ν < n} \prod_{t = ν + 1}^{n} {LR}_{t}, n \geq 1 .

It is easy to verify that this statistic follows the recursion

V_{n} = max {1, V_{n - 1}} {LR}_{n}, n \geq 1, V_{0} = 1

(51)

as long as the partial LR

{LR}_{n}

does not depend on the change point, i.e., the post-change conditional density

f_{n} (X_{n} | X^{n - 1})

does not depend on

ν

. This is always the case for i.i.d. models (50) when

f_{n} (X_{n} | X^{n - 1}) = f (X_{n})

. However, as we already mentioned, for non-i.i.d. models,

f_{n} (X_{n} | X^{n - 1}) = f_{n}^{(ν)} (X_{n} | X^{n - 1})

often depends on the change point

ν

, so

{LR}_{n} = {LR}_{n}^{(ν)}

, in which case the recursion (51) does not hold.

The logarithmic version of

V_{n}

,

W_{n} = log V_{n}

, is related to Page’s CUSUM statistic

G_{n}

introduced by Page [42] in the i.i.d. case as

G_{n} = max (0, W_{n})

. The statistic

G_{n}

can also be obtained via the GLR approach by maximizing the LLR

λ_{n}^{ν} = log Λ_{n}^{ν}

over

0 \leq ν < \infty

. However, since the hypotheses

H_{\infty}

and

H_{ν}

are indistinguishable for

ν \geq n

, the maximization over

ν \geq n

does not make very much sense. Note also that, in contrast to Page’s CUSUM statistic

G_{n}

, the statistic

W_{n}

may take values smaller than 0, so the CUSUM procedure

T_{CS} = inf {n \geq 1 : W_{n} \geq a}

(52)

makes sense even for negative values of the threshold a. Thus, it is more general than Page’s CUSUM. Note the recursions

W_{n} = W_{n - 1}^{+} + Z_{n}, n \geq 1, W_{0} = 0

(53)

and

G_{n} = {(G_{n - 1} + Z_{n})}^{+}, n \geq 1, G_{0} = 0

in cases where

Z_{n} = log [f_{n} (X_{n} | X^{n - 1}) / g_{n} (X_{n} | X^{n - 1})]

does not depend on

ν

.

Shiryaev’s Procedure

In the i.i.d. case and for the zero-modified geometric prior distribution of the change point, Shiryaev [43] introduced the change detection procedure that prescribes the thresholding of the posterior probability

P (ν < n | X^{n})

. Introducing the statistic

S_{n}^{π} = \frac{P (ν < n | X^{n})}{1 - P (ν < n | X^{n})}

one can write the stopping time of the Shiryaev procedure in the general non-i.i.d. case and for an arbitrary prior

π

as

T_{SH} = inf \{n \geq 1 : S_{n}^{π} \geq A\},

(54)

where A (

A > 0

) is a threshold controlling for the false alarm risk. The statistic

S_{n}^{π}

can be written as

\begin{matrix} S_{n}^{π} & = \frac{1}{P (ν \geq n)} \sum_{k = 0}^{n - 1} π_{k} Λ_{n}^{k} \\ = \frac{1}{P (ν \geq n)} \sum_{k = 0}^{n - 1} π_{k} \prod_{t = k + 1}^{n} {LR}_{t}, n \geq 1, S_{0}^{π} = 0, \end{matrix}

(55)

where the product

\prod_{t = i}^{j} {LR}_{t} = 1

for

j < i

.

Often (following Shiryaev’s assumptions), it is supposed that the change point

ν

is distributed according to the geometric distribution Geometric

(ϱ)

P (ν = k) = ϱ {(1 - ϱ)}^{k} for k = 0, 1, 2, \dots,

(56)

where

ϱ \in (0, 1)

.

If

{LR}_{n}

does not depend on the change point

ν

and the prior distribution is geometric (56), then the statistic

{\tilde{S}}_{n}^{ϱ} = S_{n}^{π} / ϱ

can be rewritten in the recursive form

{\tilde{S}}_{n}^{ϱ} = (1 + {\tilde{S}}_{n - 1}^{ϱ}) \frac{{LR}_{n}}{1 - ϱ}, n \geq 1, {\tilde{S}}_{0}^{ϱ} = 0 .

(57)

However, as mentioned above, this may not be the case for non-i.i.d. models, since

{LR}_{n}

often depends on

ν

.

Shiryaev–Roberts Procedure

The generalized Shiryaev–Roberts (SR) change detection procedure is based on the thresholding of the generalized SR statistic

R_{n}^{r_{0}} = r_{0} Λ_{n}^{0} + \sum_{k = 0}^{n - 1} Λ_{n}^{k} = r_{0} \prod_{t = 1}^{n} {LR}_{t} + \sum_{k = 0}^{n - 1} \prod_{t = k + 1}^{n} {LR}_{t}, n \geq 1,

(58)

with a non-negative head-start

R_{0} = r_{0}

,

r_{0} \geq 0

, i.e., the stopping time of the SR procedure is given by

T_{SR}^{r_{0}} = inf \{n \geq 1 : R_{n}^{r_{0}} \geq A\}, A > 0 .

(59)

This procedure is usually referred to as the SR-r detection procedure in contrast to the standard SR procedure

T_{SR} \equiv T_{SR}^{r_{0}}, r_{0} = 0

that starts with a zero initial condition

r_{0} = 0

. In the i.i.d. case (50), this modification of the SR procedure was introduced and studied in detail in [44,45].

If

{LR}_{n}

does not depend on the change point

ν

, then the SR-r detection statistic satisfies the recursion

R_{n}^{r_{0}} = (1 + R_{n - 1}^{r_{0}}) {LR}_{n}, n \geq 1, R_{0}^{r_{0}} = r_{0} .

Note that, as the parameter of the geometric prior distribution

ϱ \to 0

, the Shiryaev statistic

{\tilde{S}}_{n}^{ϱ}

converges to the SR statistic

R_{n}^{r_{0} = 0}

.

3.2.3. Optimality Criteria

The goal of online change detection is to detect the change with the smallest delay controlling for a false alarm rate at a given level. Tartakovsky et al. [6] suggested several changepoint problem settings, including Bayesian, minimax, and uniform (pointwise) approaches.

Let

E_{k}

denote the expectation with respect to measure

P_{k}

when the change occurs at

ν = k < \infty

and

E_{\infty}

with respect to

P_{\infty}

when there is no change.

In 1954, Page [42] suggested measuring the risk due to a false alarm by the mean time to false alarm

E_{\infty} [T]

and the risk associated with a true change detection by the mean time to detection

E_{0} [T]

when the change occurs at the very beginning. He called these performance characteristics the average run length (ARL). Page also introduced the now most famous change detection procedure—the CUSUM procedure (see (52) with

W_{n}

replaced by

G_{n}

)—and analyzed it using these operating characteristics in the i.i.d. case.

While the false alarm rate can be reasonably measure by the ARL to false alarm

ARL 2 FA (T) = E_{\infty} [T],

as Figure 1 suggests, the risk due to a true change detection can be reasonably measured by the conditional expected delay to detection

{CEDD}_{ν} (T) = E_{ν} [T - ν | T > ν], ν = 0, 1, 2, \dots

for any possible change point

ν \in Z_{+} = {0, 1, 2, \dots}

but not necessarily by the ARL to detection

E_{0} [T] \equiv {CEDD}_{0} (T)

. A good detection procedure has to guarantee small values of the expected detection delay

{CEDD}_{ν} (T)

for all change points

ν \in Z_{+}

when

ARL 2 FA (T)

is set at a certain level. However, if the false alarm risk is measured in terms of the ARL to false alarm, i.e., it is required that

ARL 2 FA (T) \geq γ

for some

γ \geq 1

, then a procedure that minimizes the conditional expected delay to detection

{CEDD}_{ν} (T)

uniformly over all

ν

does not exist. For this reason, we must resort to different optimality criteria, e.g., to Bayesian and minimax criteria.

Figure 1. Illustration of a single-run sequential changepoint detection. Two possibilities in the detection process: false alarm (left) and correct detection (right).

Minimax Changepoint Optimization Criteria

There are two popular minimax criteria. The first one was introduced by Lorden [46]:

inf_{T} sup_{ν \in Z_{+}} ess sup E_{ν} [T - ν ∣ T > ν, F_{ν}] subject to ARL 2 FA (T) \geq γ .

This requires minimizing the conditional expected delay to detection

E_{ν} [T - ν ∣ T > ν, F_{ν}]

in the worst-case scenario with respect to both the change point

ν

and the trajectory

(X_{1}, \dots, X_{ν})

of the observed process in the class of detection procedures

C_{ARL} (γ) = \{T : ARL 2 FA (T) \geq γ\}, γ \geq 1,

for which the ARL to false alarm exceeds the prespecified value

γ \in [1, \infty)

. Let

ESADD (T) = {sup}_{ν \geq 0} ess sup E_{ν} [T - ν ∣ T > ν, F_{ν}]

denote Lorden’s speed detection measure. Under Lorden’s minimax approach, the goal is to find a stopping time

T_{opt} \in C_{ARL} (γ)

such that

ESADD (T_{opt}) = inf_{T \in C_{ARL} (γ)} ESADD (T) for any γ \geq 1 .

In the classical i.i.d. scenario (50), Lorden [46] proved that the CUSUM detection procedure (52) is asymptotically first-order minimax optimal as

γ \to \infty

, i.e.,

inf_{T \in C_{ARL} (γ)} ESADD (T) = ESADD (T_{CS}) (1 + o (1)), γ \to \infty .

Later on, Moustakides [47], using optimal stopping theory, in his ingenious paper, established the exact optimality of CUSUM for any ARL to the false alarm

γ \geq 1

.

Another popular, less pessimistic minimax criterion is from Pollak [48]:

inf_{T} sup_{ν \in Z_{+}} {CEDD}_{ν} (T) subject to ARL 2 FA (T) \geq γ,

which requires minimizing the conditional expected delay to detection

{CEDD}_{ν} (T) = E_{ν} [T - ν ∣ T > ν]

in the worst-case scenario with respect to the change point

ν

in class

C_{ARL} (γ)

. Under Pollak’s minimax approach, the goal is to find a stopping time

T_{opt} \in C_{ARL} (γ)

such that

sup_{ν \in Z_{+}} {CEDD}_{ν} (T_{opt}) = inf_{T \in C_{ARL} (γ)} sup_{ν \in Z_{+}} {CEDD}_{ν} (T) for any γ \geq 1 .

For the i.i.d. model (50), Pollak [48] showed that the modified SR detection procedure that starts from the quasi-stationary distribution of the SR statistic (i.e., the head-start

r_{0}

in the SR-r procedure is a specific random variable) is third-order asymptotically optimal as

γ \to \infty

, i.e., the best one can attain up to an additive term

o (1)

:

inf_{T \in C_{ARL} (γ)} sup_{ν \in Z_{+}} {CEDD}_{ν} (T) = sup_{ν \in Z_{+}} {CEDD}_{ν} (T_{SR}^{r_{0}}) + o (1), γ \to \infty,

where

o (1) \to 0

as

γ \to \infty

. Later, Tartakovsky et al. [49] proved that this is also true for the SR-r procedure (59) that starts from the fixed but specially designed point

r_{0} = r_{0} (γ)

that depends on

γ

, which was first introduced and thoroughly studied by Moustakides et al. [44]. See also Polunchenko and Tartakovsky [50] on the exact optimality of the SR-r procedure.

Bayesian Changepoint Optimization Criterion

In Bayesian problems, the point of change

ν

is treated as random with a prior distribution

π_{k} = P (ν = k)

,

k \in Z_{+}

. Define the probability measure on the Borel

σ

-algebra

B

in

R^{\infty} \times Z_{+}

as

P^{π} (A \times K) = \sum_{k \in K} π_{k} P_{k} (A), A \in B (R^{\infty}), K \in Z_{+} .

Under measure

P^{π}

, the change point

ν

has a distribution

π = {π_{k}}

and the model for the observations is given in (49).

From the Bayesian point of view, it is reasonable to measure the false alarm risk with the weighted probability of false alarm (PFA) defined as

{PFA}^{π} (T) : = P^{π} (T \leq ν) = \sum_{k = 0}^{\infty} π_{k} P_{k} (T \leq k) = \sum_{k = 0}^{\infty} π_{k} P_{\infty} (T \leq k) .

(60)

The last equality follows from the fact that

P_{k} (T \leq k) = P_{\infty} (T \leq k)

because the event

{T \leq k}

depends on the first k observations which under measure

P_{k}

correspond to the no-change hypothesis

H_{\infty}

. Thus, for

α \in (0, 1)

, introduce the class of changepoint detection procedures

C_{π} (α) = \{T : {PFA}^{π} (T) \leq α\}

(61)

for which the weighted PFA does not exceed a prescribed level

α

.

Let

E^{π}

denote the expectation with respect to the measure

P^{π}

.

Shiryaev [18,43] introduced the Bayesian optimality criterion

inf_{T \in C_{π} (α)} E^{π} [{(T - ν)}^{+}],

which is equivalent to minimizing the conditional expected detection delay

{EDD}^{π} (T) = E^{π} [T - ν | T > ν]

inf_{T} {EDD}^{π} (T) subject to {PFA}^{π} (T) \leq α .

Under the Bayesian approach, the goal is to find a stopping time

T_{opt} \in C_{π} (α)

such that

{EDD}^{π} (T_{opt}) = inf_{T \in C_{π} (α)} {EDD}^{π} (T) for any α \in (0, 1) .

(62)

For the i.i.d. model (50) and for the geometric prior distribution

Geometric (ϱ)

of the changepoint

ν

(see (56)), this problem was solved by Shiryaev [18,43]. Shiryaev [18,43,51] proved that the detection procedure given by the stopping time

T_{SH} (A)

defined in (54) is strictly optimal in class

C_{π} (α)

if

A = A_{α}

in (54) can be selected in such a way that

{PFA}^{π} (T_{SH} (A_{α})) = α

, that is

inf_{T \in C_{π} (α)} {EDD}^{π} (T) = {EDD}^{π} (T_{SH} (A_{α})) for any α \in (0, 1) .

Uniform Pointwise Optimality Criterion

In many applications, the most reasonable optimality criterion is the pointwise uniform criterion of minimizing the conditional expected detection delay

{CEDD}_{ν} (T) = E_{ν} [T - ν | T \geq ν]

for all

ν \in Z_{+}

when the false alarm risk is fixed at a certain level. However, as we already mentioned, if it is required that

ARL 2 FA (T) \geq γ

for some

γ \geq 1

, then a procedure that minimizes

{CEDD}_{ν} (T)

for all

ν

does not exist. More importantly, as discussed in ([5] Section 2.3), the requirement of having large values of the

ARL 2 FA (T)

generally does not guarantee small values of the maximal local probability of false alarm

MLPFA (T) = {sup}_{ℓ \geq 0} P_{\infty} (T \leq ℓ + m | T > ℓ)

in a time window of a length

m \geq 1

, while the opposite is always true (see Lemmas 2.1–2.2 in [5]). Hence, the constraint

MLPFA (T) \leq β

is more stringent than

ARL 2 FA (T) \geq γ

.

Another reason for considering the MLPFA constraint instead of the ARL to false alarm constraint is that the latter one makes sense if and only if the

P_{\infty}

-distribution of stopping times are geometric or at least close to geometric, which is often the case for many popular detection procedures such as CUSUM and SR in the i.i.d. case. However, for general non-i.i.d. models, this is not necessarily true (see [5,52] for a detailed discussion).

For these reasons, introduce the most stringent class of change detection procedures for which the

MLPFA (T)

is upper-bounded by the prespecified level

β \in (0, 1)

:

C_{PFA} (m, β) = \{T : sup_{ℓ \geq 0} P_{\infty} (T \leq ℓ + m | T > ℓ) \leq β\} .

(63)

The goal is to find a stopping time

T_{opt} \in C_{PFA} (m, β)

such that

{CEDD}_{ν} (T_{opt}) = inf_{T \in C_{PFA} (m, β)} {CEDD}_{ν} (T) for all ν \in Z_{+} and any 0 < β < 1 .

(64)

3.2.4. Asymptotic Optimality for General Non-i.i.d. Models via r-Quick and r-Complete Convergence

Complete Convergence and General Bayesian Changepoint Detection Theory

First consider the Bayesian problem assuming that the change point

ν

is a random variable independent of the observations with a prior distribution

π = {π_{k}}

. Unfortunately, in the general non-i.i.d. case and for an arbitrary prior

π

, the Bayesian optimization problem (62) is intractable for arbitrary values of PFA

α \in (0, 1)

. For this reason, we will consider the first-order asymptotic problem assuming that the given PFA

α

approaches zero. To be specific, the goal is to design such a detection procedure

T^{*}

that asymptotically minimizes the expected detection delay

{EDD}^{π} (T)

to first order as

α \to 0

:

inf_{T \in C_{π} (α)} {EDD}^{π} (T) = {EDD}^{π} (T^{*}) (1 + o (1)) as α \to 0,

(65)

where

o (1) \to 0

as

α \to 0

. It turns out that, in the asymptotic setting, it is also possible to find a procedure that minimizes the conditional expected detection delay

{EDD}_{k} (T) = E_{k} [T - k | T > k]

uniformly for all possible values of the change point

ν = k \in Z_{+}

, i.e.,

lim_{α \to 0} \frac{{inf}_{T \in C_{π} (α)} {EDD}_{k} (T)}{{EDD}_{k} (T^{*})} = 1 for all k \in Z_{+} .

(66)

Furthermore, asymptotic optimality results can also be established for higher moments of the detection delay of the order of

r > 1

E_{k} [{(T - k)}^{r} | T > k] and E^{π} [{(T - ν)}^{r} | T > ν] .

Since the Shiryaev procedure

T_{SH} (A)

, which was defined in (54), (55), is optimal for the i.i.d. model and

Geometric (ϱ)

prior, it is reasonable to assume that it is asymptotically optimal for the more general prior and the non-i.i.d model. However, to study asymptotic optimality, we need certain constraints imposed on the prior distribution and on the asymptotic behavior of the decision statistics as the sample size increases, i.e., on the general stochastic model (49).

Assume that the prior distribution

{π_{k}}

is fully supported, i.e.,

π_{k} > 0

for all

k \in Z_{+}

and

π_{\infty} = 0

and that the following condition holds:

lim_{n \to \infty} \frac{1}{n} | log \sum_{k = n + 1}^{\infty} π_{k} | = μ for some 0 \leq μ < \infty .

(67)

Obviously, if

μ > 0

, then the prior

π

has an exponential right tail (e.g., the geometric distribution

Geometric (ϱ)

, in which case

μ = | log (1 - ϱ) |

). If

μ = 0

, then it has a heavier tail than an exponential tail. In this case, we will refer to it as a heavy-tailed distribution.

Define the LLR of the hypotheses

H_{k}

and

H_{\infty}

λ_{n}^{k} = log \frac{d P_{k}^{n}}{d P_{\infty}^{n}} = \sum_{t = k + 1}^{n} \frac{f_{t} (X_{t} | X^{t})}{g_{t} (X_{t} | X^{t})}, n > k

(

λ_{n}^{k} = 0

for

n \leq k

). To obtain asymptotic optimality results, the general non-i.i.d. model for observations is restricted to the case that the normalized LLR

n^{- 1} λ_{k + n}^{k}

obeys the SLLN as

n \to \infty

with a finite and positive number I under the probability measure

P_{k}

and its r-complete strengthened version

\sum_{n = 1}^{\infty} n^{r - 1} sup_{k \in Z_{+}} P_{k} \{| n^{- 1} λ_{k + n}^{k} - I | > ε\} < \infty for every ε > 0 .

(68)

It follows from Lemma 7.2.1 in [6] that, for any

A > 0

,

{PFA}^{π} (T_{SH} (A)) \leq {(1 + A)}^{- 1},

so that

T_{SH} (A_{α}) \in C_{π} (α)

if

A = A_{α} = (1 - α) / α

.

The following theorem that can be deduced from Theorem 3.7 in [5] shows that the Shiryaev detection procedure is asymptotically optimal if the normalized LLR

n^{- 1} λ_{k + n}^{k}

converges r-completely to a positive and finite number I and the prior distribution satisfies condition (67).

Theorem 6.

Suppose that the prior distribution

π = {π_{k}}_{k \in Z_{+}}

of the change point satisfies condition (67) with some

0 \leq μ < \infty

. Assume that there exists some number

0 < I < \infty

such that the LLR process

n^{- 1} λ_{k + n}^{k}

converges to I uniformly r-completely as

n \to \infty

under

P_{k}

, i.e., condition (68) holds for some

r \geq 1

. If threshold

A = A_{α}

in the Shiryaev procedure is so selected that

{PFA}^{π} (T_{SH} (A_{α})) \leq α

and

log A_{α} \sim | log α |

as

α \to 0

, e.g., as

A = (1 - α) / α

, then as

α \to 0

\begin{matrix} inf_{T \in C_{π} (α)} E_{k} [{(T - k)}^{r} | T > k] \sim {(\frac{| log α |}{I + μ})}^{r} \sim E_{k} [{(T_{SH} - k)}^{r} | T_{SH} > k] for all k \in Z_{+} \end{matrix}

and

inf_{T \in C_{π} (α)} E^{π} [{(T - ν)}^{r} | T > ν] \sim {(\frac{| log α |}{I + μ})}^{r} \sim E^{π} [{(T_{SH} - ν)}^{r} | T_{SH} > ν] .

Therefore, the Shiryaev procedure

T_{SH} (A_{α})

is first-order asymptotically optimal as

α \to 0

in class

C_{π} (α)

, minimizing the moments of the detection delay up to order r whenever the r-complete version of the SLLN (68) holds for the LLR process.

For

r = 1

, the assertions of this theorem imply the asymptotic optimality of the Shiryaev procedure for the expected detection delays (65) and (66) as well as asymptotic approximations for the expected detection delays.

Remark 6.

The results of Theorem 6 can be generalized to the asymptotically non-stationary case where

λ_{k + n}^{k} / ψ (n)

converges to I uniformly r-completely as

n \to \infty

under

P_{k}

with a non-linear function

ψ (n)

similarly to the hypothesis testing problem discussed in Section 3.1. See also the recent paper [53] for the minimax change detection problem with independent but substantially non-stationary post-change observations.

It is also interesting to see how two other most popular changepoint detection procedures—the SR and CUSUM—perform in the Bayesian context.

Consider the SR procedure defined by (58), (59). By Lemma 3.4 (p. 100) in [5],

{PFA}^{π} (T_{SR}^{r_{0}} (A)) \leq \frac{r_{0} \sum_{k = 1}^{\infty} π_{k} + \sum_{k = 1}^{\infty} k π_{k}}{A} for every A > 0,

and therefore, setting

A = A_{α} = α^{- 1} (r_{0} + \sum_{k = 1}^{\infty} k π_{k})

implies that

T_{SR}^{r_{0}} (A_{α}) \in C_{π} (α)

. If threshold

A = A_{α}

in the SR procedure is so selected that

{PFA}_{π} (T_{SR}^{r_{0}} (A_{α})) \leq α

and

log A_{α} \sim | log α |

as

α \to 0

, e.g., as

A_{α} = α^{- 1} (r_{0} + \sum_{k = 1}^{\infty} k π_{k})

, then as

α \to 0

\begin{matrix} E_{k} [{(T_{SR}^{r_{0}} - k)}^{r} | T_{SR}^{r_{0}} > k] \sim {(\frac{| log α |}{I})}^{r} for all k \in Z_{+} \end{matrix}

(69)

and

E^{π} [{(T_{SR}^{r_{0}} - ν)}^{r} | T_{SR}^{r_{0}} > ν] \sim {(\frac{| log α |}{I})}^{r}

(70)

whenever the uniform r-complete convergence condition (68) holds. Therefore, the SR procedure

T_{SR}^{r_{0}} (A_{α})

is first-order asymptotically optimal as

α \to 0

in class

C_{π} (α)

, minimizing the moments of the detection delay up to order r, when the prior distribution

π

is heavy-tailed (i.e., when

μ = 0

) and the r-complete version of the SLLN holds. In the case where

μ > 0

(i.e., the prior distribution has an exponential tail), the SR procedure is not optimal. This can be expected since it uses the improper uniform prior in the detection statistic.

The same asymptotic results (69), (70) are true for the CUSUM procedure

T_{CS} (a)

defined in (52) if threshold

a = a_{α}

is so selected that

{PFA}_{π} (T_{CS} (a_{α})) \leq α

and

a_{α} \sim | log α |

as

α \to 0

and the uniform r-complete convergence condition (68) holds.

Hence, the r-complete convergence of the LLR process is the sufficient condition for the uniform asymptotic optimality of several popular change detection procedures in class

C_{π} (α)

.

Complete Convergence and General Non-Bayesian Changepoint Detection Theory

Consider the non-Bayesian problem where the change point

ν

is an unknown deterministic number. We focus on the most interesting for a variety of applications uniform optimality criterion (64) that requires minimizing the conditional expected delay to detection

{CEDD}_{ν} (T) = E_{ν} [T - ν | T > ν]

for all values of the change point

ν \in Z_{+}

in the class of change detection procedures

C_{PFA} (m, β)

defined in (63). Recall that this class includes change detection procedures with the maximal local probability of false alarm in the time window m,

MLPFA (T) = sup_{ℓ \geq 0} P_{\infty} (T \leq ℓ + m | T > ℓ),

which does not exceed the prescribed value

β \in (0, 1)

. However, the exact solution to this challenging problem is unknown even in the i.i.d. case.

Instead consider the following asymptotic problem assuming that the given MLPFA

β

goes to zero: find a change detection procedure

T^{★}

which asymptotically minimizes the expected detection delay

E_{ν} [T - ν | T > ν]

to the first order as

β \to 0

. That is, the goal is to design such a detection procedure

T^{★}

that

inf_{T \in C_{PFA} (m, β)} E_{ν} [T - ν | T > ν] = E_{ν} [T^{★} - ν | T^{★} > ν] (1 + o (1)) for all ν \in Z_{+} as β \to 0 .

More generally, we may focus on the asymptotic problem of minimizing the moments of the detection delay of order

r \geq 1

:

inf_{T \in C_{PFA} (m, β)} E_{ν} [{(T - ν)}^{r} | T > ν] = E_{ν} [{(T^{★} - ν)}^{r} | T^{★} > ν] (1 + o (1)) for all ν \in Z_{+} as β \to 0 .

To solve this problem, we need to assume that the window length

m = m_{β}

is a function of the MLPFA constraint

β

and that

m_{β}

goes to infinity as

β \to 0

with a certain appropriate rate. Using [54], the following results can be established.

Consider the SR procedure defined by (58), (59) with

r_{0} = 0

, in which case write

T_{SR}^{r_{0}} (A) = T_{SR} (A)

. Let

r \geq 1

and assume that the r-complete version of the SLLN holds with some number

0 < I < \infty

, i.e.,

n^{- 1} λ_{ν + n}^{ν}

converges to I uniformly r-completely as

n \to \infty

under

P_{ν}

. If

m_{β} = {O (| log β |}^{2})

as

β \to \infty

and threshold

A = A_{β}

in the SR procedure is so selected that

MLPFA (T_{SR} (A_{β})) \leq β

and

log A_{β} \sim | log β |

as

β \to 0

, e.g., as defined in [54], then as

β \to 0

\begin{matrix} inf_{T \in C_{PFA} (m_{β}, β)} E_{ν} [{(T - ν)}^{r} | T > ν] \sim {(\frac{| log β}{I})}^{r} \sim E_{ν} [{(T_{SR} - ν)}^{r} | T_{SR} > ν] for all ν \in Z_{+} . \end{matrix}

A similar result also holds for the CUSUM procedure

T_{CS} (a)

if threshold

a = a_{β}

is selected so that

MLPFA (T_{CS} (a_{β})) \leq β

and

a_{β} \sim | log β |

as

β \to 0

and the r-complete version of the SLLN holds for the normalized LLR

n^{- 1} λ_{ν + n}^{ν}

as

n \to \infty

.

Hence, the r-complete convergence of the LLR process is the sufficient condition for the uniform asymptotic optimality of SR and CUSUM change detection procedures with respect to the moments of the detection delay of order r in class

C_{PFA} (m_{β}, β)

.

4. Quick and Complete Convergence for Markov and Hidden Markov Models

Usually, in particular problems, the verification of the SLLN for the LLR process is relatively easy. However, in practice, verifying the strengthened r-complete or r-quick versions of the SLLN, i.e., checking condition (68), can cause some difficulty. Many interesting examples where this verification was performed can be found in [5,6]. However, it is interesting to find sufficient conditions for the r-complete convergence for a relatively large class of stochastic models.

In this section, we outline this issue for Markov and hidden Markov models based on the results obtained by Pergamenchtchikov and Tartakovsky [54] for ergodic Markov processes and by Fuh and Tartakovsky [55] for hidden Markov models (HMM). See also Tartakovsky ([5] Ch 3).

Let

{X_{n}}_{n \in Z_{+}}

be a time-homogeneous Markov process with values in a measurable space

(X, B)

with the transition probability

P (x, A)

with density

p (y | x)

. Let

E_{x}

denote the expectation with respect to this probability. Assume that this process is geometrically ergodic, i.e., there exist positives constants

0 < R < \infty

,

κ > 0

, and probability measure

ϰ

on

(X, B)

and the Lyapunov

X \to [1, \infty)

function V with

ϰ (V) < \infty

such that

sup_{n \in Z_{+}} e^{κ n} sup_{0 < ψ \leq V} sup_{x} \frac{1}{V (x)} |E_{x} [ψ (X_{n})] - ϰ (ψ)| \leq R .

In the change detection problem, the sequence

{X_{n}}_{n \in Z_{+}}

is a Markov process, such that

{X_{n}}_{1 \leq n \leq ν}

is a homogeneous process with the transition density

g (y | x)

and

{X_{n}}_{n > ν}

is homogeneous positive ergodic with the transition density

f (y | x)

and the ergodic (stationary) distribution

ϰ

. In this case, the LLR process

λ_{n}^{k}

can be represented as

λ_{n}^{k} = \sum_{t = k + 1}^{n} G (X_{t}, X_{t - 1}), n > k,

where

G (y, x) = log [f (y | x) / g (y | x)]

.

Define

I = \int_{X} \{\int_{X} G (y, x) f (y | x) d y\} ϰ (d x) .

Under a set of quite sophisticated sufficient conditions, the LLR

λ_{k + n}^{n} / n

converges to I as

n \to \infty

r-completely (cf. [54]). We omit the details and only mention that the main condition is the finiteness of

(r + 1)

-th moment of the LLR increment,

E_{0} [{(G (X_{1}, X_{0}))}^{r + 1}] < \infty

.

Now consider the HMM with finite state space. Then again, as in the pure Markov case, the main condition for the r-complete convergence of

λ_{k + n}^{n} / n

to I, where I is specified in Fuh and Tartakovsky [55], is

E_{0} [{(λ_{1}^{0})}^{r + 1}] < \infty

. Further details can be found in [55].

Similar results for Markov and hidden Markov models hold for the hypothesis testing problem considered in Section 3.1. Specifically, if in the Markov case we assume that the observed Markov process

{X_{n}}_{n \in Z_{+}}

is a time-homogeneous geometrically ergodic with a transition density

f_{i} (y | x)

under hypothesis

H_{i}

(

i = 0, 1, \dots, N

) and invariant distribution

ϰ_{i}

, then the LLR processes are

λ_{i j} (n) = \sum_{t = 1}^{n} G_{i j} (X_{t}, X_{t - 1}), i, j = 0, 1, \dots, N, i \neq j,

where

G_{i j} (y, x) = log [f_{i} (y | x) / f_{j} (y | x)]

. If

E_{i} [{(G_{i j} (X_{1}, X_{0}))}^{r + 1}] < \infty

, then the LLR

n^{- 1} λ_{i j} (n)

converges r-completely to a finite number

I_{i j} = \int_{X} \{\int_{X} G_{i j} (y, x) f_{i} (y | x) d y\} ϰ_{i} (d x) .

5. Discussion and Conclusions

The purpose of this article is to provide an overview of two modes of convergence in the LLN—r-quick and r-complete convergences. These strengthened versions of the SLLN are often neglected in the theory of probability. In the first part of this paper (Section 2), we discussed in detail these two modes of convergence and corresponding strengthened versions of the SLLN. The main motivation was the fact that both r-quick and r-complete versions of the SLLN can be effectively used for establishing near optimality results in sequential analysis, in particular, in sequential hypothesis testing and quickest changepoint detection problems for very general stochastic models of dependent and non-stationary observations. These models are not limited to Markov and hidden Markov models. The results presented in the second part of this paper (Section 3) show that the constraints imposed on the models for observations can be formulated in terms of either the r-quick or r-complete convergence of properly normalized log-likelihood ratios between hypotheses to finite numbers, which can be interpreted as local Kullback–Leibler information numbers. This is natural and can be intuitively expected since optimal or nearly optimal decision-making rules are typically based on a combination of log-likelihood ratios. Therefore, if one is interested in the asymptotic optimality properties of decision-making rules, the asymptotic behavior of log-likelihood ratios as the sample size goes to infinity not only matters but provides the main contribution.

The results presented in this article allow us to conclude that the strengthened r-quick and r-complete versions of the SLLN are useful tools for many statistical problems for general non-i.i.d. stochastic models. In particular, r-quick and r-complete convergences for log-likelihood ratio processes are sufficient for the near optimality of sequential hypothesis tests and changepoint detection procedures for models with dependent and non-identically distributed observations. Such non-i.i.d. models are typical for modern large-scale information and physical systems that produce big data in numerous practical applications. Readers interested in specific applications may find detailed discussions in [4,5,6,7,21,22,33,35,37,53,54,55,56,57,58].

Funding

This article received no external funding.

Data Availability Statement

No real data were used in this research.

Conflicts of Interest

The author declares no conflict of interest.

References

Hsu, P.L.; Robbins, H. Complete convergence and the law of large numbers. Proc. Natl. Acad. Sci. USA 1947, 33, 25–31. [Google Scholar] [CrossRef] [PubMed]
Baum, L.E.; Katz, M. Convergence rates in the law of large numbers. Trans. Am. Math. Soc. 1965, 120, 108–123. [Google Scholar] [CrossRef]
Strassen, V. Almost sure behavior of sums of independent random variables and martingales. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, San Diego, CA, USA, 21 June–18 July 1965 and 27 December 1965–7 January 1966; Le Cam, L.M., Neyman, J., Eds.; Vol. 2: Contributions to Probability Theory. Part 1; University of California Press: Berkeley, CA, USA, 1967; pp. 315–343. [Google Scholar]
Tartakovsky, A.G. Asymptotic optimality of certain multihypothesis sequential tests: Non-i.i.d. case. Stat. Inference Stoch. Process. 1998, 1, 265–295. [Google Scholar] [CrossRef]
Tartakovsky, A.G. Sequential Change Detection and Hypothesis Testing: General Non-i.i.d. Stochastic Models and Asymptotically Optimal Rules; Monographs on Statistics and Applied Probability 165; Chapman & Hall/CRC Press, Taylor & Francis Group: Boca Raton, FL, USA; London, UK; New York, NY, USA,, 2020. [Google Scholar]
Tartakovsky, A.G.; Nikiforov, I.V.; Basseville, M. Sequential Analysis: Hypothesis Testing and Changepoint Detection; Monographs on Statistics and Applied Probability 136; Chapman & Hall/CRC Press, Taylor & Francis Group: Boca Raton, FL, USA; London, UK; New York, NY, USA,, 2015. [Google Scholar]
Lai, T.L. Asymptotic optimality of invariant sequential probability ratio tests. Ann. Stat. 1981, 9, 318–333. [Google Scholar] [CrossRef]
Lai, T.L. On r-quick convergence and a conjecture of Strassen. Ann. Probab. 1976, 4, 612–627. [Google Scholar] [CrossRef]
Chow, Y.S.; Lai, T.L. Some one-sided theorems on the tail distribution of sample sums with applications to the last time and largest excess of boundary crossings. Trans. Am. Math. Soc. 1975, 208, 51–72. [Google Scholar] [CrossRef]
Fuh, C.D.; Zhang, C.H. Poisson equation, moment inequalities and quick convergence for Markov random walks. Stoch. Process. Their Appl. 2000, 87, 53–67. [Google Scholar] [CrossRef]
Wald, A. Sequential tests of statistical hypotheses. Ann. Math. Stat. 1945, 16, 117–186. [Google Scholar] [CrossRef]
Wald, A. Sequential Analysis; John Wiley & Sons, Inc.: New York, NY, USA, 1947. [Google Scholar]
Wald, A.; Wolfowitz, J. Optimum character of the sequential probability ratio test. Ann. Math. Stat. 1948, 19, 326–339. [Google Scholar] [CrossRef]
Burkholder, D.L.; Wijsman, R.A. Optimum properties and admissibility of sequential tests. Ann. Math. Stat. 1963, 34, 1–17. [Google Scholar] [CrossRef]
Matthes, T.K. On the optimality of sequential probability ratio tests. Ann. Math. Stat. 1963, 34, 18–21. [Google Scholar] [CrossRef]
Ferguson, T.S. Mathematical Statistics: A Decision Theoretic Approach; Probability and Mathematical Statistics; Academic Press: Cambridge, MA, USA, 1967. [Google Scholar]
Lehmann, E.L. Testing Statistical Hypotheses; John Wiley & Sons, Inc.: New York, NY, USA, 1968. [Google Scholar]
Shiryaev, A.N. Optimal Stopping Rules; Series on Stochastic Modelling and Applied Probability; Springer: New York, NY, USA, 1978; Volume 8. [Google Scholar]
Golubev, G.K.; Khas’minskii, R.Z. Sequential testing for several signals in Gaussian white noise. Theory Probab. Appl. 1984, 28, 573–584. [Google Scholar] [CrossRef]
Tartakovsky, A.G. Asymptotically optimal sequential tests for nonhomogeneous processes. Seq. Anal. 1998, 17, 33–62. [Google Scholar] [CrossRef]
Verdenskaya, N.V.; Tartakovskii, A.G. Asymptotically optimal sequential testing of multiple hypotheses for nonhomogeneous Gaussian processes in an asymmetric situation. Theory Probab. Appl. 1991, 36, 536–547. [Google Scholar] [CrossRef]
Fellouris, G.; Tartakovsky, A.G. Multichannel sequential detection–Part I: Non-i.i.d. data. IEEE Trans. Inf. Theory 2017, 63, 4551–4571. [Google Scholar] [CrossRef]
Armitage, P. Sequential analysis with more than two alternative hypotheses, and its relation to discriminant function analysis. J. R. Stat. Soc.-Ser. Methodol. 1950, 12, 137–144. [Google Scholar] [CrossRef]
Chernoff, H. Sequential design of experiments. Ann. Math. Stat. 1959, 30, 755–770. [Google Scholar] [CrossRef]
Kiefer, J.; Sacks, J. Asymptotically optimal sequential inference and design. Ann. Math. Stat. 1963, 34, 705–750. [Google Scholar] [CrossRef]
Lorden, G. Integrated risk of asymptotically Bayes sequential tests. Ann. Math. Stat. 1967, 38, 1399–1422. [Google Scholar] [CrossRef]
Lorden, G. Nearly-optimal sequential tests for finitely many parameter values. Ann. Stat. 1977, 5, 1–21. [Google Scholar] [CrossRef]
Pavlov, I.V. Sequential procedure of testing composite hypotheses with applications to the Kiefer-Weiss problem. Theory Probab. Appl. 1990, 35, 280–292. [Google Scholar] [CrossRef]
Baron, M.; Tartakovsky, A.G. Asymptotic optimality of change-point detection schemes in general continuous-time models. Seq. Anal. 2006, 25, 257–296. [Google Scholar] [CrossRef]
Mosteller, F. A k-sample slippage test for an extreme population. Ann. Math. Stat. 1948, 19, 58–65. [Google Scholar] [CrossRef]
Bakut, P.A.; Bolshakov, I.A.; Gerasimov, B.M.; Kuriksha, A.A.; Repin, V.G.; Tartakovsky, G.P.; Shirokov, V.V. Statistical Radar Theory; Tartakovsky, G.P., Ed.; Sovetskoe Radio: Moscow, Russia, 1963; Volume 1. (In Russian) [Google Scholar]
Basseville, M.; Nikiforov, I.V. Detection of Abrupt Changes—Theory and Application; Information and System Sciences Series; Prentice-Hall, Inc.: Englewood Cliffs, NJ, USA, 1993. [Google Scholar]
Jeske, D.R.; Steven, N.T.; Tartakovsky, A.G.; Wilson, J.D. Statistical methods for network surveillance. Appl. Stoch. Model. Bus. Ind. 2018, 34, 425–445. [Google Scholar] [CrossRef]
Jeske, D.R.; Steven, N.T.; Wilson, J.D.; Tartakovsky, A.G. Statistical network surveillance. In Wiley StatsRef: Statistics Reference Online; Wiley: New York, NY, USA, 2018; pp. 1–12. [Google Scholar] [CrossRef]
Tartakovsky, A.G.; Brown, J. Adaptive spatial-temporal filtering methods for clutter removal and target tracking. IEEE Trans. Aerosp. Electron. Syst. 2008, 44, 1522–1537. [Google Scholar] [CrossRef]
Szor, P. The Art of Computer Virus Research and Defense; Addison-Wesley Professional: Upper Saddle River, NJ, USA, 2005. [Google Scholar]
Tartakovsky, A.G. Rapid detection of attacks in computer networks by quickest changepoint detection methods. In Data Analysis for Network Cyber-Security; Adams, N., Heard, N., Eds.; Imperial College Press: London, UK, 2014; pp. 33–70. [Google Scholar]
Tartakovsky, A.G.; Rozovskii, B.L.; Blaźek, R.B.; Kim, H. Detection of intrusions in information systems by sequential change-point methods. Stat. Methodol. 2006, 3, 252–293. [Google Scholar] [CrossRef]
Tartakovsky, A.G.; Rozovskii, B.L.; Blaźek, R.B.; Kim, H. A novel approach to detection of intrusions in computer networks via adaptive sequential and batch-sequential change-point detection methods. IEEE Trans. Signal Process. 2006, 54, 3372–3382. [Google Scholar] [CrossRef]
Siegmund, D. Change-points: From sequential detection to biology and back. Seq. Anal. 2013, 32, 2–14. [Google Scholar] [CrossRef]
Moustakides, G.V. Sequential change detection revisited. Ann. Stat. 2008, 36, 787–807. [Google Scholar] [CrossRef]
Page, E.S. Continuous inspection schemes. Biometrika 1954, 41, 100–114. [Google Scholar] [CrossRef]
Shiryaev, A.N. On optimum methods in quickest detection problems. Theory Probab. Appl. 1963, 8, 22–46. [Google Scholar] [CrossRef]
Moustakides, G.V.; Polunchenko, A.S.; Tartakovsky, A.G. A numerical approach to performance analysis of quickest change-point detection procedures. Stat. Sin. 2011, 21, 571–596. [Google Scholar] [CrossRef]
Moustakides, G.V.; Polunchenko, A.S.; Tartakovsky, A.G. Numerical comparison of CUSUM and Shiryaev–Roberts procedures for detecting changes in distributions. Commun. Stat.-Theory Methods 2009, 38, 3225–3239. [Google Scholar] [CrossRef]
Lorden, G. Procedures for reacting to a change in distribution. Ann. Math. Stat. 1971, 42, 1897–1908. [Google Scholar] [CrossRef]
Moustakides, G.V. Optimal stopping times for detecting changes in distributions. Ann. Stat. 1986, 14, 1379–1387. [Google Scholar] [CrossRef]
Pollak, M. Optimal detection of a change in distribution. Ann. Stat. 1985, 13, 206–227. [Google Scholar] [CrossRef]
Tartakovsky, A.G.; Pollak, M.; Polunchenko, A.S. Third-order asymptotic optimality of the generalized Shiryaev–Roberts changepoint detection procedures. Theory Probab. Appl. 2012, 56, 457–484. [Google Scholar] [CrossRef]
Polunchenko, A.S.; Tartakovsky, A.G. On optimality of the Shiryaev–Roberts procedure for detecting a change in distribution. Ann. Stat. 2010, 38, 3445–3457. [Google Scholar] [CrossRef]
Shiryaev, A.N. The problem of the most rapid detection of a disturbance in a stationary process. Sov. Math.–Dokl. 1961, 2, 795–799, Translation from Doklady Akademii Nauk SSSR 1961, 138, 1039–1042. [Google Scholar]
Tartakovsky, A.G. Discussion on “Is Average Run Length to False Alarm Always an Informative Criterion?” by Yajun Mei. Seq. Anal. 2008, 27, 396–405. [Google Scholar] [CrossRef]
Liang, Y.; Tartakovsky, A.G.; Veeravalli, V.V. Quickest change detection with non-stationary post-change observations. IEEE Trans. Inf. Theory 2023, 69, 3400–3414. [Google Scholar] [CrossRef]
Pergamenchtchikov, S.; Tartakovsky, A.G. Asymptotically optimal pointwise and minimax quickest change-point detection for dependent data. Stat. Inference Stoch. Process. 2018, 21, 217–259. [Google Scholar] [CrossRef]
Fuh, C.D.; Tartakovsky, A.G. Asymptotic Bayesian theory of quickest change detection for hidden Markov models. IEEE Trans. Inf. Theory 2019, 65, 511–529. [Google Scholar] [CrossRef]
Kolessa, A.; Tartakovsky, A.; Ivanov, A.; Radchenko, V. Nonlinear estimation and decision-making methods in short track identification and orbit determination problem. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 301–312. [Google Scholar] [CrossRef]
Tartakovsky, A.; Berenkov, N.; Kolessa, A.; Nikiforov, I. Optimal sequential detection of signals with unknown appearance and disappearance points in time. IEEE Trans. Signal Process. 2021, 69, 2653–2662. [Google Scholar] [CrossRef]
Pergamenchtchikov, S.M.; Tartakovsky, A.G.; Spivak, V.S. Minimax and pointwise sequential changepoint detection and identification for general stochastic models. J. Multivar. Anal. 2022, 190, 104977. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Quick and Complete Convergence in the Law of Large Numbers with Applications to Statistics

Abstract

1. Introduction

2. Modes of Convergence and the Law of Large Numbers

2.1. Standard Modes of Convergence

2.2. Complete and r-Complete Convergence

2.3. r-Quick Convergence

2.4. Further Remarks on r-Complete Convergence, r-Quick Convergence, and Rates of Convergence in SLLN

3. Applications of $r$ -Complete and $r$ -Quick Convergences in Statistics

3.1. Sequential Hypothesis Testing

3.1.1. Asymptotic Optimality of Walds’s SPRT

3.1.2. Asymptotic Optimality of the Multi-hypothesis SPRT

3.2. Sequential Changepoint Detection

3.2.1. Changepoint Models

3.2.2. Popular Changepoint Detection Procedures

The CUSUM Procedure

Shiryaev’s Procedure

Shiryaev–Roberts Procedure

3.2.3. Optimality Criteria

Minimax Changepoint Optimization Criteria

Bayesian Changepoint Optimization Criterion

Uniform Pointwise Optimality Criterion

3.2.4. Asymptotic Optimality for General Non-i.i.d. Models via r-Quick and r-Complete Convergence

Complete Convergence and General Bayesian Changepoint Detection Theory

Complete Convergence and General Non-Bayesian Changepoint Detection Theory

4. Quick and Complete Convergence for Markov and Hidden Markov Models

5. Discussion and Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Quick and Complete Convergence in the Law of Large Numbers with Applications to Statistics

Abstract

1. Introduction

2. Modes of Convergence and the Law of Large Numbers

2.1. Standard Modes of Convergence

2.2. Complete and r-Complete Convergence

2.3. r-Quick Convergence

2.4. Further Remarks on r-Complete Convergence, r-Quick Convergence, and Rates of Convergence in SLLN

3. Applications of r -Complete and r -Quick Convergences in Statistics

3.1. Sequential Hypothesis Testing

3.1.1. Asymptotic Optimality of Walds’s SPRT

3.1.2. Asymptotic Optimality of the Multi-hypothesis SPRT

3.2. Sequential Changepoint Detection

3.2.1. Changepoint Models

3.2.2. Popular Changepoint Detection Procedures

The CUSUM Procedure

Shiryaev’s Procedure

Shiryaev–Roberts Procedure

3.2.3. Optimality Criteria

Minimax Changepoint Optimization Criteria

Bayesian Changepoint Optimization Criterion

Uniform Pointwise Optimality Criterion

3.2.4. Asymptotic Optimality for General Non-i.i.d. Models via r-Quick and r-Complete Convergence

Complete Convergence and General Bayesian Changepoint Detection Theory

Complete Convergence and General Non-Bayesian Changepoint Detection Theory

4. Quick and Complete Convergence for Markov and Hidden Markov Models

5. Discussion and Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

3. Applications of $r$ -Complete and $r$ -Quick Convergences in Statistics