Next Article in Journal
On Recovering Sturm–Liouville-Type Operators with Global Delay on Graphs from Two Spectra
Next Article in Special Issue
Analytic and Asymptotic Properties of the Generalized Student and Generalized Lomax Distributions
Previous Article in Journal
Investigating Symmetric Soliton Solutions for the Fractional Coupled Konno–Onno System Using Improved Versions of a Novel Analytical Technique
Previous Article in Special Issue
Limit Distributions for the Estimates of the Digamma Distribution Parameters Constructed from a Random Size Sample
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Quick and Complete Convergence in the Law of Large Numbers with Applications to Statistics

by
Alexander G. Tartakovsky
AGT StatConsult, 71 Cypress Way, Rolling Hills Estates, CA 90274, USA
Mathematics 2023, 11(12), 2687; https://doi.org/10.3390/math11122687
Submission received: 11 May 2023 / Revised: 6 June 2023 / Accepted: 7 June 2023 / Published: 13 June 2023

Abstract

:
In the first part of this article, we discuss and generalize the complete convergence introduced by Hsu and Robbins in 1947 to the r-complete convergence introduced by Tartakovsky in 1998. We also establish its relation to the r-quick convergence first introduced by Strassen in 1967 and extensively studied by Lai. Our work is motivated by various statistical problems, mostly in sequential analysis. As we show in the second part, generalizing and studying these convergence modes is important not only in probability theory but also to solve challenging statistical problems in hypothesis testing and changepoint detection for general stochastic non-i.i.d. models.
MSC:
60F15; 60G35; 60G40; 60J05; 62L10; 62C10; 62C20; 62F03; 62H15; 62M02; 62P30

1. Introduction

In [1], Hsu and Robbins introduced the notion of complete convergence which is stronger than almost sure (a.s.) convergence. Hsu and Robbins used this notion to discuss certain aspects of the law of large numbers (LLN). In particular, let X 1 , X 2 , be independent and identically distributed (i.i.d.) random variables with the common mean μ = E [ X 1 ] . Hsu and Robbins proved that, while in Kolmogorov’s strong law of large numbers (SLLN), only the first moment condition is needed for the sample mean n 1 t = 1 n X t to converge to μ as n , the complete version of the SLLN requires the second-moment condition E | X 1 | 2 < (finiteness of variance). Later, Baum and Katz [2], working on the rate of convergence in the LLN established that the second-moment condition is not only necessary but also sufficient for complete convergence. Strassen [3] introduced another mode of convergence, the r-quick convergence. When r = 1 , these two modes of convergence are closely related. In the case of i.i.d. random variables and the sample mean n 1 t = 1 n X t , they are identical. This fact and certain statistical applications motivated Tartakovsky [4] (see also Tartakovsky [5] and Tartakovsky et al. [6]) to introduce a natural generalization of complete convergence—the r-complete convergence, which turns out to be identical to the r-quick convergence in the i.i.d. case.
The goal of this overview paper is to discuss the importance of quick and complete convergence concepts for several challenging statistical applications. These modes of convergence are discussed in detail in the first part of this paper. Statistical applications, which constitute the second part of this paper, include such fields as sequential hypothesis testing and changepoint detection in general non-i.i.d. stochastic models when observations can be dependent and highly non-stationary. Specifically, in the second part, we first address near optimality of Wald’s sequential probability ratio test (SPRT) for testing two hypotheses regarding the distributions of non-i.i.d. data. We discuss Lai’s results in his fundamental paper [7], which was the first publication that used the r-quick convergence of the log-likelihood ratio processes to establish the asymptotic optimality of the SPRT as probabilities of errors go to zero. We then go on to tackle the much more difficult multi-decision problem of testing multiple hypotheses and show that certain multi-hypothesis sequential tests asymptotically minimize moments of the stopping time distribution up to the order r when properly normalized log-likelihood ratio processes between hypotheses converge r-quickly or r-completely to finite positive numbers. These results can be established based on the former works of the author (see, e.g., Tartakovsky [4,5] and Tartakovsky et al. [6]). The second challenging application is the quickest change detection when it is necessary to detect a change that occurs at an unknown point in time as rapidly as possible. We show, using the works of the author (see, e.g., [5,6] and the references therein), that certain popular changepoint detection procedures such as CUSUM, Shiryaev, and Shiryaev–Roberts procedures are asymptotically optimal as the false alarm rate is low when the normalized log-likelihood ratio processes converge r-completely to finite numbers.
The rest of the paper is organized as follows. Section 2 discusses pure probabilistic issues related to r-complete convergence and r-quick convergence. Section 3 explores statistical applications in sequential hypothesis testing and changepoint detection. Section 4 outlines sufficient conditions for the r-complete convergence for Markov and hidden Markov models, which is needed to establish the optimality properties of sequential hypothesis tests and changepoint detection procedures. Section 5 provides a final discussion and concludes the paper.

2. Modes of Convergence and the Law of Large Numbers

We begin by listing some standard definitions in probability theory. Let ( Ω , F ) be a measurable space, i.e., Ω is a set of elementary events ω and F is a sigma-algebra (a system of subsets of Ω satisfying standard conditions). A probability space is a triple ( Ω , F , P ) , where P is a probability measure (completely additive measure normalized to 1) defined on the sets from the sigma-algebra F . More specifically, by Kolmogorov’s axioms, probability P satisfies: P ( A ) 0 for any A F ; P ( Ω ) = 1 ; and P ( i = 1 A i ) = i = 1 P ( A i ) for A i F , A i A j = , i j , where is an empty set.
A function X = X ( ω ) defined on ( Ω , F ) with values in X is called a random variable if it is F -measurable, i.e., { ω : X ( ω ) B } belongs to the sigma-algebra F . The function F ( x ) = P ( ω : X ( ω ) x ) is the distribution function of X. It is also referred to as a cumulative distribution function (cdf). The real-valued random variables X 1 , X 2 , are independent if the events { X 1 x 1 } , { X 2 x 2 } , are independent for every sequence x 1 , x 2 , of real numbers. In what follows, we shall deal with real-valued random variables unless specified otherwise.

2.1. Standard Modes of Convergence

Let X be a random variable and let { X n } n Z + ( Z + = { 0 , 1 , 2 , } ) be a sequence of random variables, both defined on the probability space ( Ω , F , P ) . We now give several standard definitions and results related to the law of large numbers.
Convergence   in   Distribution  ( Weak   Convergence ).
Let F n ( x ) = P ( ω : X n x ) be the cdf of X n and let F ( x ) = P ( ω : X x ) be the cdf of X. We say that the sequence { X n } n Z + converges to X in distribution (or in law or weakly) as n and write X n n law X if
lim n F n ( x ) = F ( x )
at all continuity points of F ( x ) .
Convergence   in   Probability .
We say that the sequence { X n } n Z + converges to X in probability as n and write X n n P X if
lim n P ( | X n X | > ε ) = 0 for every ε > 0 .
Almost   Sure   Convergence .
We say that the sequence { X n } n Z + converges to X almost surely (a.s.) or with probability 1 (w.p. 1) as n under probability measure P and write X n n P a . s . X if
P ω : lim n X n = X = 1 .
It is easily seen that (1) is equivalent to the condition
lim n P ω : t = n | X t X | > ε = 0 for every ε > 0 ,
and that the a.s. convergence implies convergence in probability, and the convergence in probability implies convergence in distribution, while the converse statements are not generally true.
The following double implications that establish necessary and sufficient conditions (i.e., equivalences) for the a.s. convergence are useful:
X n n a . s . X P sup t n | X t X | > ε n 0 for all ε > 0 .
The following result is often useful.
Lemma 1. 
Let f ( t ) be a non-negative increasing function, lim t f ( t ) = . If
X n f ( n ) n P a . s . 0 ,
then
lim n P 1 f ( n ) max 0 t n X t > ε = 0 for every ε > 0 .
Proof. 
For any ε > 0 , n 0 > 0 and n > n 0 , we have
P 1 f ( n ) max 0 t n X t > ε P 1 f ( n ) max 0 t n 0 X t > ε + P 1 f ( n ) max n 0 < t n X t > ε P 1 f ( n ) max 0 t n 0 X t > ε + P sup t > n 0 X t f ( t ) > ε .
Letting n and taking into account that
lim n P 1 f ( n ) max 0 t n 0 X t > ε = 0 ,
we obtain
lim sup n P 1 f ( n ) max 0 t n X t > ε P sup t > n 0 X t f ( t ) > ε .
Since n 0 can be arbitrarily large, we can let n 0 and since, by assumption, X n / f ( n ) n a . s . 0 , it follows from (2) that the upper bound approaches 0 as n 0 . This completes the proof. □
Random   Walk .
Let X 0 , X 1 , X 2 , be i.i.d. random variables with the mean E [ X n ] = μ for n 1 and the initial condition X 0 = x . Then, S n = t = 0 n X t is called a random walk with the mean x + μ n .
In what follows, in the case where X 1 , X 2 , are i.i.d. random variables and S n = t = 0 n X t , we prefer to formulate the results in terms of the random walk { S n } n Z + (typically but not necessarily S 0 = 0 ).
We now recall the two strong law of large numbers (SLLN). Write S n = X 0 + X 1 + + X n for the partial sum ( X 0 = S 0 = 0 ), so that { S n } n Z + is a random walk with an initial condition of zero as long as X 1 , X 2 , are i.i.d. with mean μ .
Kolmogorov s   SLLN .
Let { S n } n Z + be a random walk under probability measure P . If E [ S 1 ] exists, then the sample mean S n / n converges to the mean value E [ S 1 ] w.p. 1, i.e.,
n 1 S n n P a . s . E [ S 1 ] .
Conversely, if n 1 S n n P a . s . μ , where | μ | < , then E [ S 1 ] = μ .
Marcinkiewicz Zygmund s   SLLN .
Let { S n } n Z + be a zero-mean random walk under probability measure P . The two following statements are equivalent:
(i)
E | S 1 | p < for 0 < p < 2 ;
(ii)
n 1 / p S n n P a . s . 0 .

2.2. Complete and r-Complete Convergence

We begin with discussing the issue of rates of convergence in the LLN.
Rates   of   Convergence .
Let { X n } n Z + be a sequence of random variables and assume that X n converges to 0 w.p. 1 as n . The question asks what the rate of convergence is. In other words, we are concerned with the speed at which the tail probability P ( | X n | > ε ) decays to zero. This question can be answered by analyzing the behavior of the sums
Σ ( r , ε ) : = n = 1 n r 1 P ( | X n | > ε ) for some r > 0 and all ε > 0 .
More specifically, if Σ ( r , ε ) is finite for every ε > 0 , then the tail probability P ( | X n | > ε ) decays with a rate faster than 1 / n r , so that n r P ( | X n | > ε ) 0 for all ε > 0 as n .
To answer this question, we now consider modes of convergence that strengthen the almost sure convergence and therefore help determine the rate of convergence in the SLLN. Historically, this issue was first addressed in 1947 by Hsu and Robbins [1], who introduced the new mode of convergence which they called complete convergence.
Complete   Convergence .
The sequence { X n } n Z + converges to 0 completely if
lim n i = n P ( | X t | > ε ) = 0 for every ε > 0 ,
which is equivalent to
Σ ( 1 , ε ) = n = 1 P ( | X n | > ε ) < for every ε > 0
Let { S n } n Z + be a random walk with a mean of E [ S n ] = μ n . Kolmogorov’s SLLN (4) implies that the sample mean S n / n converges to μ w.p. 1. Hsu and Robbins [1] proved that, under the same assumptions (i.e., under the only first-moment condition E | S 1 | < ) the sequence { n 1 S n } n 1 does not need to completely converge to μ , but it will do so under the further second-moment condition E | S 1 | 2 < . Thus, the finiteness of variance is a sufficient condition for complete convergence in the SLLN. They conjectured that the second-moment condition is not only sufficient but also necessary for complete convergence. Thus, it follows from these results that, if the variance is finite, then the rate of convergence in Kolmogorov’s SLLN is lim n n P ( | S n / n μ | > ε ) = 0 for all ε > 0 .
In 1965, Baum and Katz [2] made a further step towards this issue. In particular, the following result follows from Theorem 3 in [2] for the zero-mean random walk { S n } n Z + .
Theorem 1. 
Let r > 0 and α > 1 / 2 . If { S n } n Z + is a zero-mean random walk, then the following statements are equivalent:
E [ | S 1 | ( r + 1 ) / α ] < n = 1 n r 1 P n α | S n | > ε < for all ε > 0 n = 1 n r 1 P sup k n 1 k α | S k | > ε < for all ε > 0 .
Setting r = 1 and α = 1 in (6), we obtain the following equivalence
E [ | S 1 | 2 ] < n = 1 P | n 1 S n | > ε < for all ε > 0 ,
which shows that the conjecture of Hsu and Robbins is correct—the second-moment condition E | S 1 | 2 < is both necessary and sufficient for complete convergence
n 1 S n n P completely 0 .
Furthermore, if for some r > 0 , the ( r + 1 ) -th moment is finite, E | S 1 | r + 1 < , then the rate of convergence in the SLLN is lim n n r P ( | n 1 S n | > ε ) = 0 for all ε > 0 .
Previous results suggest that it is reasonable to generalize the notion of complete convergence into the following mode of convergence that we will refer to as r-complete convergence, which is also related to the so-called r-quick convergence that we will discuss later on (see Section 2.3).
Definition 1 (r-Complete Convergence).
Let r > 0 . We say that the sequence of random variables { X n } n Z + converges to X as n r-completely under probability measure P and write X n n P r c o m p l e t e l y X if
Σ ( r , ε ) : = n = 1 n r 1 P ( | X n X | > ε ) < for every ε > 0 .
Note that the a.s. convergence of { X n } to X can be equivalently written as
lim n P t = n | X t X | > ε = 0 for every ε > 0 ,
so that the r-complete convergence with r 1 implies the a.s. convergence, but the converse is not true in general.
Suppose that X n converges a.s. to X. If Σ ( r , ε ) is finite for every ε > 0 , then
lim n t = n t r 1 P ( | X t X | > ε ) = 0 for every ε > 0
and probability P ( | X n X | > ε ) goes to 0 as n with the rate faster than 1 / n r . Hence, as already mentioned above, the r-complete convergence allows one to determine the rate of convergence of X n to X, i.e., to answer the question of how fast the tail probability P ( | X n X | > ε ) decays to zero.
The following result provides a very useful implication of complete convergence.
Theorem 2. 
Let { X n } n Z + and { Y n } n Z + be two arbitrary, possibly dependent sequences of random variables. Assume that there are positive and finite numbers μ 1 and μ 2 such that
n = 1 P | 1 n X n μ 1 | > ε < for every ε > 0
and
n = 1 P | 1 n Y n μ 2 | > ε < for every ε > 0 ,
i.e., n 1 X n n P completely μ 1 and n 1 Y n n P completely μ 2 . If μ 1 μ 2 , then for any random time T
P X T < b , Y T + 1 b ( 1 + δ ) 0 as b for any δ > 0 .
Proof. 
Fix δ > 0 , c ( 0 , δ ) and let N b = ( 1 + c ) b / μ 2 be the smallest integer that is larger than or equal to ( 1 + c ) b / μ 2 . Observe that
P X T < b , Y T + 1 b ( 1 + δ ) P X T b , T N b + P Y T + 1 ( 1 + δ ) b , T < N b P X T b , T N b + P max 1 n N b Y n ( 1 + δ ) b .
Thus, to prove (10), it suffices to show that the two terms on the right-hand side go to 0 as b .
For the first term, we notice that, for any n N b ,
b n b N b μ 2 1 + c μ 1 1 + c < μ 1 ,
so that
P X T b , T N b = n = N b P X n b , T = n n = N b P X n n b n n = N b P X n n μ 1 1 + c = n = N b P X n n μ 1 c 1 + c μ 1 .
Since N b as b , the upper bound goes to 0 as b due to condition (8).
Next, since c ( 0 , δ ) , there exists ε > 0 such that
( 1 + δ ) b N b = ( 1 + δ ) b b ( 1 + c ) / μ 2 ( 1 + ε ) μ 2 .
As a result,
P max 1 n N b Y n ( 1 + δ ) b P 1 N b max 1 n N b Y n ( 1 + ϵ ) μ 2 ,
where the upper bound goes to 0 as b by condition (9) (see Lemma 1). □
Remark 1. 
The proof suggests that the assertion (10) of Theorem 2 holds under the following one-sided conditions
P n 1 max 1 s n Y s μ 2 > ε n 0 , n = 1 P n 1 X n μ 1 < ε < .
Complete convergence conditions (8) and (9) guarantee both these conditions.
Remark 2. 
Theorem 2 can be applied to the overshoot problem. Indeed, if X n = Y n = Z n and the random time T is the first time n when Z n exceeds the level b, T = inf { n 1 : Z n > b } , then Theorem 2 shows that the relative excess of boundary crossing (overshoot) ( Z T b ) / b converges to 0 in probability as b when Z n / n completely converges as n to a positive number μ.

2.3. r-Quick Convergence

In 1967, Strassen [3] introduced the notion of r-quick limit points of a sequence of random variables. The r-quick convergence has been further addressed by Lai [7,8], Chow and Lai [9], Fuh and Zhang [10], and Tartakovsky [4,5] (see certain details in Section 2.4).
We define r-quick convergence in a way suitable for this paper. Let { X n } n Z + be a sequence of real-valued random variables and let X be a random variable defined on the same probability space ( Ω , F , P ) .
Definition 2 (r-Quick Convergence).
Let r > 0 and for ε > 0 , let
L ε = sup { n 1 : | X n X | > ε } ( sup { } = 0 )
be the last entry time of X n in the region ( X + ε , ) ( , X ε ) . We say that the sequence { X n } n Z + converges to X r-quickly as n under the probability measure P and write X n n P r quickly X if and only if
E [ L ε r ] < for every ε > 0 ,
where E is the operator of expectation under probability P .
This definition can be generalized to random variables X, { X n } n Z + taking values in a metric space ( X , d ) with distance d: X n n r quickly X if
E sup { n 1 : d ( X , X n ) > ε } r < for every ε > 0 .
Note that the a.s. convergence X n μ ( | μ | < ) as n to a constant μ can be expressed as P ( L ε ( μ ) < ) = 1 , where L ε ( μ ) = sup { n 1 : | X n μ | > ε } . Therefore, the r-quick convergence implies the convergence w.p. 1 but not conversely.
Also, in general, r-quick convergence is stronger than r-complete convergence. Specifically, the following lemma shows that
max 1 i n X t n r completely μ X n n r quickly μ X n n r completely μ .
Lemma 2. 
Let { X n } n Z + be a sequence of random variables. Let f ( t ) be a non-negative increasing function, f ( 0 ) = 0 , lim t f ( t ) = + , and let for ε > 0
L ε ( f ) = sup n 1 : | X n | > ε f ( n ) ( sup { } = 0 )
be the last time that X n leaves the interval [ ε f ( n ) , + ε f ( n ) ] .
(i) 
For any r > 0 and any ε > 0 , the following inequalities hold:
r n = 1 n r 1 P | X n | ε f ( n ) E L ε ( f ) r r n = 1 n r 1 P sup t n | X t | f ( t ) ε .
Therefore,
n = 1 n r 1 P sup t n | X t | f ( t ) ε < for all ε > 0 X n n r q u i c k l y 0 .
(ii) 
If f ( t ) is a power function, f ( t ) = t γ , γ > 0 , then the finiteness of
n = 1 n r 1 P max 1 t n X t ε n γ
for some r > 0 and every ε > 0 implies the r-quick convergence of X n to 0:
n = 1 n r 1 P max 1 t n X t ε n γ < ε > 0 E [ L ε ( γ ) r ] < ε > 0 ,
where L ε ( γ ) = sup n 1 : | X n | > ε n γ .
Proof. 
Proof of (i). Obviously,
P | X n | ε f ( n ) P L ε ( f ) n P sup t n 1 f ( t ) | X t | ε
from which the inequalities (13) follow immediately.
Proof of (ii). Write M u = max 1 n u | X n | , where u is a smallest integer greater or equal to u. We have the following chain of inequalities and equalities:
E L 2 ε ( γ ) r r 0 t r 1 P sup u t u γ | X u | 2 ε d t r 0 t r 1 P sup u t | X u | ε u γ ε t γ d t r 0 t r 1 P sup u > 0 | X u | ε u γ ε t γ d t r n = 1 0 t r 1 P sup ( 2 n 1 1 ) t γ < u γ ( 2 n 1 ) t γ | X u | ε u γ ε t γ d t r n = 1 0 t r 1 P sup u γ 2 n t γ | X u | 2 n 1 ε t γ d t = r n = 1 0 t r 1 P M 2 n / γ u 2 n 1 ε t γ d t = r n = 1 2 n / γ 0 u r 1 P M u ( ε / 2 ) u γ d u .
It follows that
E L 2 ε ( γ ) r r 2 1 / γ 1 1 0 u r 1 P M u ( ε / 2 ) u γ d u r 2 1 / γ 1 1 n = 1 n r 1 P max 1 t n X n ε n γ
which yields the implication (14) and completes the proof. □
The following theorem shows that, in the i.i.d. case, the implications in (12) become equivalences.
Theorem 3. 
Let { S n } n Z + be a zero-mean random walk. The following statements are equivalent
E | S 1 | r + 1 < n 1 S n n r completely 0 ,
E | S 1 | r + 1 < n 1 S n n r quickly 0 ,
E | S 1 | r + 1 n = 1 n r 1 P sup t n 1 t | S t | > ε < for all ε > 0 .
Proof. 
By Theorem 1, in the i.i.d. case,
E | S 1 | r + 1 < n = 1 n r 1 P 1 n | S n | > ε < ε > 0
and
E | S 1 | r + 1 < n = 1 n r 1 P sup t n 1 t | S t | > ε < ε > 0 ,
so that assertion (15) follows from (18) and (17) from (19).
Next, let
L ε = sup n 1 : | S n | n ε ( sup = 0 ) .
By Lemma 2(i),
E [ L ε r ] r n = 1 n r 1 P sup t n | S t | / t ε ε > 0 ,
which, along with (19), implies (16).

2.4. Further Remarks on r-Complete Convergence, r-Quick Convergence, and Rates of Convergence in SLLN

Let { S n } n Z + be a random walk. Without loss of generality, let S 0 = 0 and E [ S 1 ] = 0 .
  • Strassen [3] proved, in particular, that if f ( n ) = ( 2 n log n ) 1 / 2 in Lemma 2, then for r > 0
lim sup n S n 2 n log n = r E [ S 1 2 ] r quickly
whenever E | S 1 | p < for p > ( 2 r + 1 ) . He also proved the functional form of the law of the iterated logarithm.
2.
Lai [8] improved this result, showing that Strassen’s moment condition E | S 1 | p < for p > ( 2 r + 1 ) can be relaxed. Specifically, he showed that a weaker condition
E | S 1 | 2 ( r + 1 ) ( log + | S 1 | + 1 ) ( r + 1 ) ) < for r > 0
is the best one can do (i.e., both necessary and sufficient):
E | S 1 | 2 ( r + 1 ) ( log + | S 1 | + 1 ) ( r + 1 ) < lim sup n S n 2 n log n < r quickly ,
in which case equality (21) holds.
Note, however, that for r = 0 , in terms of the a.s. convergence,
E | S 1 | 2 < lim sup n S n 2 n log log n = E | S 1 | 2 a . s .
but under condition (22) for all r > 0
lim sup n S n 2 n log log n = r quickly .
3.
Let α > 1 / 2 and r > 0 . Chow and Lai [9] established the following one-sided inequality for tail probabilities:
n = 1 n r 1 P max 1 t n S t n α C r , α E ( S 1 + ) ( r + 1 ) / α + E [ S 1 2 ] r / ( 2 α 1 )
whenever E | S 1 | 2 < . Under the same hypotheses, this one-sided inequality implies the two-sided one:
n = 1 n r 1 P max 1 t n | S t | n α C r , α E | S 1 | ( r + 1 ) / α + E [ S 1 2 ] r / ( 2 α 1 ) .
The upper bound in (24) turns out to be sharp since the lower bound also holds:
n = 1 n r 1 P max 1 t n | S t | n α 1 + B r , α E | S 1 | ( r + 1 ) / α + E [ S 1 2 ] r / ( 2 α 1 ) .
Here, the constants C r , α and B r , α are universal only depending on r , α .
The results of Chow and Lai [9] provide one-sided analogues of the results of Baum and Katz [2] as well as extend their results. Indeed, the one-sided inequality (23) implies that the following statements are equivalent for the zero-mean random walk { S n } n N :
(i)
E [ ( S 1 + ) ( r + 1 ) / α ] < ;
(ii)
n = 1 n r 1 P n α S n ε < for all ε > 0 ;
(iii)
n = 1 n r 1 P sup k n k α S k ε < for all ε > 0 ,
where α > 1 / 2 .
Clearly, the two-sided inequality (24) yields the assertions of Theorem 1.
4.
The Marcinkiewicz–Zygmund SLLN states that, for α > 1 / 2 , the following implications hold:
E | S 1 | 1 / α < n α S n n a . s . 0 .
The strengthened r-quick equivalent of this SLLN is: for any r > 0 and α > 1 / 2 , the following statements are equivalent,
E [ | S 1 | ( r + 1 ) / α ] < i = 1 n r 1 P 1 n α | S n | > ε < for all ε > 0 n = 1 n r 1 P sup k n 1 k α | S k | > ε < for all ε > 0 n α S n n r quickly 0 .
Implications (26) follow from Theorem 1, Theorem 3 and inequality (24). The proof is almost obvious and omitted.

3. Applications of r -Complete and r -Quick Convergences in Statistics

In this section, we outline certain statistical applications which show the usefulness of r-complete and r-quick versions of the SLLN.

3.1. Sequential Hypothesis Testing

We begin by formulating the following multi-hypothesis testing problem for a general non-i.i.d. stochastic model. Let ( Ω , F , F n , P ) , n Z + = { 0 , 1 , 2 , } be a filtered probability space with standard assumptions about the monotonicity of the sub- σ -algebras F n . The sub- σ -algebra F n = σ ( X n ) of F is assumed to be generated by the sequence X n = { X t , 1 t n } observed up to time n, which is defined on the space ( Ω , F ) . The hypotheses are H i : P = P i , i = 0 , 1 , , N , where P 0 , P 1 , , P N are given probability measures assumed to be locally mutually absolutely continuous, i.e., their restrictions P i n and P j n to F n are equivalent for all 1 n < and all i , j = 0 , 1 , , N , i j . Let Q n be a restriction to F n of a σ -finite measure Q on ( Ω , F ) . Under P i , the sample X n = ( X 1 , , X n ) has a joint density p i , n ( X n ) with respect to the dominating measure Q n for all n N , which can be written as
p i , n ( X n ) = t = 1 n f i , t ( X t | X t 1 ) ,
where f i , n ( X n | X n 1 ) , n 1 are corresponding conditional densities.
For n N , define the likelihood ratio (LR) process between the hypotheses H i and H j
Λ i j ( n ) = d P i n d P j n ( X n ) = p i , n ( X n ) p j , n ( X n ) = t = 1 n f i , t ( X t | X t 1 ) f j , t ( X t | X t 1 )
and the log-likelihood ratio (LLR) process
λ i j ( n ) = log Λ i j ( n ) = t = 1 n log f i , t ( X t | X t 1 ) f j , t ( X t | X t 1 ) .
A multi-hypothesis sequential test is a pair δ = ( d , T ) , where T is a stopping time with respect to the filtration { F n } n Z + and d = d ( X T ) is an F T -measurable terminal decision function with values in the set { 0 , 1 , , N } . Specifically, d = i means that the hypothesis H i is accepted upon stopping, i.e., d = i = T < , δ accepts H i . Let α i j ( δ ) = P i ( d = j ) , i j , i , j = 0 , 1 , , N denote the error probabilities of the test δ , i.e., the probabilities of accepting the hypothesis H j when H i is true.
Introduce the class of tests with probabilities of errors α i j ( δ ) that do not exceed the prespecified numbers 0 < α i j < 1 :
C ( α ) = δ : α i j ( δ ) α i j for i , j = 0 , 1 , , N , i j ,
where α = ( α i j ) is a matrix of given error probabilities that are positive numbers less than 1.
Let E i denote the expectation under the hypothesis H i (i.e., under the measure P i ). The goal of a statistician is to find a sequential test that would minimize the expected sample sizes E i [ T ] for all hypotheses H i , i = 0 , 1 , , N at least approximately, say asymptotically for small probabilities of errors, i.e., as α i j 0 .

3.1.1. Asymptotic Optimality of Walds’s SPRT

First, assume that N = 1 , i.e., that we are dealing with two hypotheses H 0 and H 1 . In the mid-1940s, Wald [11,12] introduced the sequential probability ratio test (SPRT) for the sequence of i.i.d. observations X 1 , X 2 , , in which case f i , t ( X t | X t 1 ) = f i ( X t ) in (27) and the LR Λ 1 , 0 ( n ) = Λ n is
Λ n = t = 1 n f 1 ( X t ) f 0 ( X t ) .
After n observations have been made, Wald’s SPRT prescribes for each n 1 :
stop and accept H 1 if Λ n A 1 ; stop and accept H 0 if Λ n A 0 ; continue sampling if A 0 < Λ n < A 1 ,
where 0 < A 0 < 1 < A 1 are two thresholds.
Let Z t = log [ f 1 ( X t ) / f 0 ( X t ) ] be the LLR for the observation X t , so the LLR for the sample X n is the sum
λ 10 ( n ) = λ n = t = 1 n Z t , n = 1 , 2 ,
Let a 0 = log A 0 > 0 and a 1 = log A 1 > 0 . The SPRT δ * ( a 0 , a 1 ) = ( d * , T * ) can be represented in the form
T * ( a 0 , a 1 ) = inf n 1 : λ n ( a 0 , a 1 ) , d * ( a 0 , a 1 ) = 1 if λ T * a 1 0 if λ T * a 0 .
In the case of two hypotheses, the class of tests (28) is of the form
C ( α 0 , α 1 ) = δ : α 0 ( δ ) α 0 and α 1 ( δ ) α 1 .
That is, it includes hypothesis tests with upper bounds α 0 and α 1 on the probabilities of errors of Type 1 (false positive) α 0 ( δ ) = α 0 , 1 ( δ ) and Type 2 (false negative) α 1 ( δ ) = α 1 , 0 ( δ ) , respectively.
Wald’s SPRT has an extraordinary optimality property: it minimizes both expected sample sizes E 0 [ T ] and E 1 [ T ] in the class of sequential (and non-sequential) tests C ( α 0 , α 1 ) with given error probabilities as long as the observations are i.i.d. under both hypotheses. More specifically, Wald and Wolfowitz [13] proved, using a Bayesian approach, that if α 0 + α 1 < 1 and thresholds a 0 and a 1 can be selected in such a way that α 0 ( δ * ) = α 0 and α 1 ( δ * ) = α 1 , then the SPRT δ * is strictly optimal in class C ( α 0 , α 1 ) . A rigorous proof of this fundamental result is tedious and involves several delicate technical details. Alternative proofs can be found in [14,15,16,17,18].
Regardless of the strict optimality of SPRT which holds if and only if thresholds are selected so that the probabilities of errors of SPRT are exactly equal to the prescribed values α 0 , α 1 , which is usually impossible, suppose that thresholds a 0 and a 1 are so selected that
a 0 log ( 1 / α 1 ) and a 1 log ( 1 / α 0 ) as α max 0 .
Then
E 1 [ T * ] | log α 0 | I 1 , E 0 [ T * ] | log α 1 | I 0 as α max 0 ,
where I 1 = E 1 [ Z 1 ] and I 0 = E 0 [ Z 1 ] are Kullback–Leibler (K-L) information numbers so that the following asymptotic lower bounds for expected sample sizes are attained by SPRT:
inf δ C ( α 0 , α 1 ) E 1 [ T ] | log α 0 | I 1 + o ( 1 ) , inf δ C ( α 0 , α 1 ) E 0 [ T ] | log α 1 | I 0 + o ( 1 ) as α max 0
(cf. [6]). Hereafter, α max = max ( α 0 , α 1 ) .
The following inequalities for the error probabilities of the SPRT hold in the most general non-i.i.d. case
α 1 ( δ * ) exp { a 0 } [ 1 α 0 ( δ * ) ] , α 0 ( δ * ) exp { a 1 } [ 1 α 1 ( δ * ) ] .
These bounds can be used to guarantee asymptotic relations (30).
In the i.i.d. case, by the SLLN, the LLR λ n has the following stability property
n 1 λ n n P 1 a . s . I 1 , n 1 ( λ n ) n P 0 a . s . I 0 .
This allows one to conjecture that, if in the general non-i.i.d. case, the LLR is also stable in the sense that the almost sure convergence conditions (33) are satisfied with some positive and finite numbers I 1 and I 0 , then the asymptotic formulas (31) still hold. In the general case, these numbers represent the local K-L information in the sense that often (while not always) I 1 = lim n n 1 E 1 [ λ n ] and I 0 = lim n n 1 E 0 [ λ n ] . Note, however, that in the general non-i.i.d. case, the SLLN does not even guarantee the finiteness of the expected sample sizes E i [ T * ] of the SPRT, so some additional conditions are needed, such as a certain rate of convergence in the strong law, e.g., complete or quick convergence.
In 1981, Lai [7] was the first to prove the asymptotic optimality of Wald’s SPRT in a general non-i.i.d. case as α max = max ( α 0 , α 1 ) 0 . While the motivation was the near optimality of invariant SPRTs with respect to nuisance parameters, Lai proved a more general result using the r-quick convergence concept.
Specifically, for 0 < I 0 < and 0 < I 1 < , define
L 1 ( ε ) = sup n 1 : | n 1 λ n I 1 | ε and L 0 ( ε ) = sup n 1 : | n 1 λ n + I 0 | ε
( sup { } = 0 ) and suppose that E i [ L i ( ε ) r ] < ( i = 0 , 1 ) for some r > 0 and every ε > 0 , i.e., that the normalized LLR converges r-quickly to I 1 under P 1 and to I 0 under P 0 :
n 1 λ n n P 1 r quickly I 1 and n 1 λ n n P 0 r quickly I 0 .
Strengthening the a.s. convergence (33) into the r-quick version (34), Lai [7] established the first-order asymptotic optimality of Wald’s SPRT for moments of the stopping time distribution up to order r: If thresholds a 1 ( α 0 , α 1 ) and a 0 ( α 0 , α 1 ) in the SPRT are so selected that δ * ( a 0 , a 1 ) C ( α 0 , α 1 ) and asymptotics (30) hold, then as α max 0 ,
inf δ C ( α 0 , α 1 ) E 1 [ T r ] | log α 0 | I 1 r E 1 [ T * r ] , inf δ C ( α 0 , α 1 ) E 0 [ T r ] | log α 1 | I 0 r E 0 [ T * r ] .
Wald’s ideas have been generalized in many publications to construct sequential tests of composite hypotheses with nuisance parameters when these hypotheses can be reduced to simple ones by the principle of invariance. If M n is the maximal invariant statistic and p i ( M n ) is the density of this statistic under hypothesis H i , then the invariant SPRT is defined as in (29) with the LLR λ n = log [ p 1 ( M n ) / p 0 ( M n ) ] . However, even if the observations X 1 , X 2 , are i.i.d. the invariant LLR statistic λ n is not a random walk anymore and Wald’s methods cannot be applied directly. Lai [7] has applied the asymptotic optimality property (35) of Wald’s SPRT in the non-i.i.d. case to investigate the optimality properties of several classical invariant SPRTs such as the sequential t-test, the sequential T 2 -test, and Savage’s rank-order test.
In the sequel, we will call the case where the a.s. convergence in the non-i.i.d. model (33) holds with the rate 1 / n asymptotically stationary. Assume now that (33) is generalized to
λ n / ψ ( n ) n P 1 a . s . I 1 , ( λ n ) / ψ ( n ) n P 0 a . s . I 0 ,
where ψ ( t ) is a positive increasing function. If ψ ( t ) is not linear, then this case will be referred to as asymptotically non-stationary.
A simple example where this generalization is needed is testing H 0 versus H 1 regarding the mean of the normal distribution:
X n = i S n + ξ n , n Z + , i = 0 , 1 ,
where { ξ n } n 1 is a zero-mean i.i.d. standard Gaussian sequence N ( 0 , 1 ) and S n = j = 0 k c j n j is a polynomial of order k 1 . Then,
λ n = t = 1 n S t X t 1 2 t = 1 n S t 2 ,
E 1 [ λ n ] = E 0 [ λ n ] = 1 2 t = 1 n S t 2 c k 2 n 2 k for a large n, so ψ ( n ) = n 2 k and I 1 = I 0 = c k 2 / 2 in (36). This example is of interest for certain practical applications, in particular, for the recognition of ballistic objects and satellites.
Tartakovsky et al. ([6] Section 3.4) generalized Lai’s results for the asymptotically non-stationary case. Write Ψ ( t ) for the inverse function of ψ ( t ) .
Theorem 4 (SPRT asymptotic optimality).
Let r 1 . Assume that there exist finite positive numbers I 0 and I 1 and an increasing non-negative function ψ ( t ) such that the r-quick convergence conditions
λ n ψ ( n ) n P 1 r quickly I 1 , λ n ψ ( n ) n P 0 r quickly I 0
hold. If thresholds a 0 ( α 0 , α 1 ) and a 1 ( α 0 , α 1 ) are selected so that δ * ( a 0 , a 1 ) C ( α 0 , α 1 ) and a 0 | log α 1 | and a 1 | log α 0 | , then, as α max 0 ,
inf δ C ( α 0 , α 1 ) E 1 [ T r ] Ψ | log α 0 | I 1 r E 1 [ T * r ] , inf δ C ( α 0 , α 1 ) E 0 [ T r ] Ψ | log α 1 | I 0 r E 0 [ T * r ] .
This theorem implies that the SPRT asymptotically minimizes the moments of the stopping time distribution up to order r.
The proof of this theorem is performed in two steps which are related to our previous discussion of the rates of convergence in Section 2. The first step is to obtain the asymptotic lower bounds in class C ( α 0 , α 1 ) :
lim inf α max 0 inf δ C ( α 0 , α 1 ) E 1 [ T r ] [ Ψ | log α 0 | / I 1 ] r 1 , lim inf α max 0 inf δ C ( α 0 , α 1 ) E 0 [ T r ] [ Ψ | log α 1 | / I 0 ] r 1 .
These bounds hold whenever the following right-tail conditions for the LLR are satisfied:
lim M P 1 1 ψ ( M ) max 1 n M λ n ( 1 + ε ) I 1 = 1 , lim M P 0 1 ψ ( M ) max 1 n M ( λ n ) ( 1 + ε ) I 0 = 1 .
Note that, by Lemma 1, these conditions are satisfied when the SLLN (36) holds so that the almost sure convergence (36) is sufficient. However, as we already mentioned, the SLLN for the LLR is not sufficient to guarantee even the finiteness of the SPRT stopping time.
The second step is to show that the lower bounds are attained by the SPRT. To do so, it suffices to impose the following additional left-tail conditions:
n = 1 n r 1 P 1 λ n ( I 1 ε ) ψ ( n ) < , n = 1 n r 1 P 0 λ n ( I 0 ε ) ψ ( n ) <
for all 0 < ε < min ( I 0 , I 1 ) . Since both right-tail and left-tail conditions hold if the LLR converges r-completely to I i ,
n = 1 n r 1 P 1 λ n ψ ( n ) I 1 ε < , n = 1 n r 1 P 0 λ n ψ ( n ) + I 0 ε < ,
and since r-quick convergence implies r-complete convergence (see (12)), we conclude that the assertions (37) hold.
Remark 3. 
In the i.i.d. case, Wald’s approach allows us to establish asymptotic equalities (37) with I 1 = E 1 [ λ 1 ] and I 0 = E 0 [ λ 1 ] being K-L information numbers under the only condition of finiteness I i . However, Wald’s approach breaks down in the non-i.i.d. case. Certain generalizations in the case of independent but non-identically and substantially non-stationary observations, extending Wald’s ideas, were considered in [19,20,21]. Theorem 4 covers all these non-stationary models.
Fellouris and Tartakovsky [22] extended previous results on the asymptotic optimality of the SPRT to the case of the multistream hypothesis testing problem when the observations are sequentially acquired in multiple data streams (or channels or sources). The problem is to test the null hypothesis H 0 that none of the N streams are affected against the composite hypothesis H B that a subset B { 1 , , N } is affected. Write P B and E B for the distribution of observations and expectation under hypothesis H B . Let P denote a class of subsets of { 1 , , N } that incorporates prior information which is available regarding the subset of affected streams, e.g., not more than K < N streams can be affected. (In many practical problems, K is substantially smaller than the total number of streams N, which can be very large.)
Two sequential tests were studied in [22]—the generalized sequential likelihood ratio test and the mixture sequential likelihood ratio test. It has been shown that both tests are first-order asymptotically optimal, minimizing the moments of the sample size E 0 [ T r ] and E B [ T r ] for all B P up to order r as max ( α 0 , α 1 ) 0 in the class of tests
C P ( α 0 , α 1 ) = δ : P 0 ( d = 1 ) α 0 and max B P P B ( d = 0 ) α 1 , 0 < α i < 1 .
The proof is essentially based on the concept of r-complete convergence of LLR with the rate 1 / n . See also Chapter 1 in [5].

3.1.2. Asymptotic Optimality of the Multi-hypothesis SPRT

We now return to the multi-hypothesis model with N > 1 that we started to discuss at the beginning of this section (see (27) and (28)). The problem of the sequential testing of many hypotheses is substantially more difficult than that of testing two hypotheses. For multiple-decision testing problems, it is usually very difficult, if even possible, to obtain optimal solutions. Finding an optimal non-Bayesian test in the class of tests (28) that minimizes expected sample sizes E i [ T ] for all hypotheses H i , i = 0 , 1 , , N is not manageable even in the i.i.d. case. For this reason, a substantial part of the development of sequential multi-hypothesis testing in the 20th century has been directed towards the study of certain combinations of one-sided sequential probability ratio tests when observations are i.i.d. (see, e.g., [23,24,25,26,27,28]).
We will focus on the following first-order asymptotic criterion: Find a multi-hypothesis test δ * ( α ) = ( d * ( α ) , T * ( α ) ) such that, for some r 1 ,
lim α max 0 inf δ C ( α ) E i [ T r ] E i [ T * ( α ) r ] = 1 for all i = 0 , 1 , , N ,
where α max = max 0 i , j N , i j α i j .
In 1998, Tartakovsky [4] was the first who considered the sequential multiple hypothesis testing problems for general non-i.i.d. stochastic models following Lai’s idea of exploiting the r-quick convergence in the SLLN for two hypotheses. The results were obtained for both discrete and continuous-time scenarios and for the asymptotically non-stationary case where the LLR processes between hypotheses converge to finite numbers with the rate 1 / ψ ( t ) . Two multi-hypothesis tests were investigated: (1) the rejecting test, which rejects the hypotheses one by one, and the last hypothesis, which is not rejected, is accepted; and (2) the matrix accepting test that accepts a hypothesis for which all component SPRTs that involve this hypothesis vote for accepting it.
We now proceed with introducing this accepting test which we will refer to as the matrix SPRT (MSPRT). In the present article, we do not consider the continuous-time scenarios. Those who are interested in continuous time are referred to [4,6,19,21,29].
Write N = { 0 , 1 , , N } . For a threshold matrix ( A i j ) i , j N , with A i j > 0 and the A i i being immaterial (say 0), define the matrix SPRT δ * N = ( T * N , d * N ) , built on ( N + 1 ) N / 2 one-sided SPRTs between the hypotheses H i and H j , as follows:
Stop at the first n 1 such that , for some i , Λ i j ( n ) A j i for all j i ,
and accept the unique H i that satisfies these inequalities. Note that, for N = 1 , the MSPRT coincides with Wald’s SPRT.
In the following, we omit the superscript N in δ * N = ( T * N , d * N ) for brevity. Obviously, with a j i = log A j i , the MSPRT in (39) can be written as
T * = inf n 1 : λ i j ( n ) a j i for all j i and some i ,
d * = i for which ( 40 ) holds .
Introducing the Markov accepting times for the hypotheses H i as
T i = inf n 1 : λ i 0 ( n ) max j i 1 j N [ λ j 0 ( n ) + a j i ] , i = 0 , 1 , , N ,
the test in (40), (41) can be also written in the following form:
T * = min 0 j N T j , d * = i if T * = T i .
Thus, in the MSPRT, each component SPRT is extended until, for some i N , all N SPRTs involving H i accept H i .
Using Wald’s likelihood ratio identity, it is easily shown that α i j ( δ * ) exp ( a i j ) for i , j N , i j , so selecting a j i = | log α j i | implies that δ * C ( α ) . These inequalities are similar to Wald’s ones in the binary hypothesis case and are very imprecise. In his ingenious paper, Lorden [27] showed that, with a very sophisticated design that includes the accurate estimation of thresholds accounting for overshoots, the MSPRT is nearly optimal in the third-order sense, i.e., it minimizes the expected sample sizes for all hypotheses up to an additive disappearing term: inf δ C ( α ) E i [ T ] = E i [ T * ] + o ( 1 ) as α max 0 . This result only holds for i.i.d. models with the finite second moment E i [ λ i j ( 1 ) 2 ] < . In the non-i.i.d. case (and even in the i.i.d. case for higher moments r > 1 ), there is no way to obtain such a result, so we focus on the first-order optimality (38).
The following theorem establishes asymptotic operating characteristics and the optimality of MSPRT under the r-quick convergence of λ i j ( n ) / ψ ( n ) to finite K-L-type numbers I i j , where ψ ( n ) is a positive increasing function, ψ ( ) = .
Theorem 5 (MSPRT asymptotic optimality
[4]). Let r 1 . Assume that there exist finite positive numbers I i j , i , j = 0 , 1 , , N , i j and an increasing non-negative function ψ ( t ) such that, for some r > 0 ,
λ i j ( n ) ψ ( n ) n P i r quickly I i j for all i , j = 0 , 1 , , N , i j .
Then, the following assertions are true.
(i) 
For i = 0 , 1 , , N ,
E i [ T * r ] Ψ max j i 0 j N a j i I i j r as min j , i a j i .
(ii) 
If the thresholds are so selected that α i j ( δ * ) α i j and a j i | log α j i | , particularly as a j i = | log α j i | , then for all i = 0 , 1 , , N
inf δ C ( α ) E i [ T r ] Ψ max j i 0 j N | log α j i | I i j r E i [ T * r ] as α max 0 .
Assertion (ii) implies that the MSPRT asymptotically minimizes the moments of the stopping time distribution up to order r for all hypotheses H 0 , H 1 , , H N in the class of tests C ( α ) .
Remark 4. 
Both assertions of Theorem 5 are correct under the r-complete convergence
λ i j ( n ) ψ ( n ) n P i r complete I i j for all i , j = 0 , 1 , , N , i j ,
i.e., when
n = 1 n r 1 P i | 1 ψ ( n ) λ i j ( n ) I i j | > ε < for all ε > 0 .
While this statement has not been proven anywhere to date, it can be easily proven using the methods developed for multistream hypothesis testing and changepoint detection ([5] Ch 1, Ch 6).
Remark 5. 
As shown in the example given in Section 3.4.3 of [6], the r-quick convergence conditions in Theorem 5 (or corresponding r-complete convergence conditions for LLR processes) cannot be generally relaxed into the almost sure convergence
λ i j ( n ) ψ ( n ) n P i a . s . I i j for all i , j = 0 , 1 , , N , i j .
However, the following weak asymptotic optimality result holds for the MSPRT under the a.s. convergence: if the a.s. convergence (47) holds with the power function ψ ( t ) = t k , k > 0 , then, for every 0 < ε < 1 ,
inf δ C ( α ) P i T > ε T * 1 as α max 0 for all i = 0 , 1 , , N
whenever thresholds a j i are selected as in Theorem 5 (ii).
Note that several interesting statistical and practical applications of these results to invariant sequential testing and multisample slippage scenarios are discussed in Section 4.5 and 4.6 of Tartakovsky et al. [6] (see Mosteller [30] and Ferguson [16] for terminology regarding multisample slippage problems).

3.2. Sequential Changepoint Detection

Sequential (or quickest) changepoint detection is an important subfield of sequential analysis. The observations are made one at a time and as long as their behavior suggests that the process of interest is in control (i.e., in a normal state), the process is allowed to continue. If the state is believed to have lost control, the goal is to detect the change in distribution as rapidly as possible. Quickest change detection problems have an enormous number of important applications, e.g., object detection in noise and clutter, industrial quality control, environment surveillance, failure detection, navigation, seismology, computer network security, genomics, and epidemiology (see, e.g., [31,32,33,34,35,36,37,38,39,40]). Many challenging application areas are discussed in the books by Tartakovsky, Nikiforov, and Basseville ([6] Ch 11) and Tartakovsky ([5] Ch 8).

3.2.1. Changepoint Models

The probability distribution of the observations X = { X n } n N is subject to a change at an unknown point in time ν { 0 , 1 , 2 , } = Z + so that X 1 , , X ν are generated by one stochastic model and X ν + 1 , X ν + 2 , are generated by another model. A sequential detection rule is a stopping time T for an observed sequence { X n } n 1 , i.e., T is an integer-valued random variable such that the event { T = n } belongs to the sigma-algebra F n = σ ( X 1 , , X n ) generated by observations X 1 , , X n .
Let P denote the probability measure corresponding to the sequence of observations { X n } n 1 when there is never a change ( ν = ) and, for k = 0 , 1 , 2 , , let P k denote the measure corresponding to the sequence { X n } n 1 when ν = k < . We denote the hypothesis that the change never occurs by H : ν = and we denote the hypothesis that the change occurs at time 0 k < by H k : ν = k .
First consider a general non-i.i.d. model assuming that the observations may have a very general stochastic structure. Specifically, if we let, as before, X n = ( X 1 , , X n ) denote the sample of size n, then when ν = (there is no change), the conditional density of X n given X n 1 is g n ( X n | X n 1 ) for all n 1 and when ν = k < , then the conditional density is g n ( X n | X n 1 ) for n k and f n ( X n | X n 1 ) for n > k . Thus, for the general non-i.i.d. changepoint model, the joint density p ( X n | H k ) under hypothesis H k can be written as follows
p ( X n | H k ) = t = 1 n g t X t | X t 1 ) for ν = k n , t = 1 k g t ( X t | X t 1 ) × t = k + 1 n f t ( X t | X t 1 ) for ν = k < n ,
where g n ( X n | X n 1 ) is the pre-change conditional density and f n ( X n | X n 1 ) is the post-change conditional density which may depend on ν , f n ( X n | X n 1 ) = f n ( ν ) ( X n | X n 1 ) , but we will omit the superscript ν for brevity.
The classical changepoint detection problem deals with the i.i.d. case where there is a sequence of observations X 1 , X 2 , that are identically distributed with a probability density function (pdf) g ( x ) for n ν and with a pdf f ( x ) for n > ν . That is, in the i.i.d. case, the joint density of the vector X n = ( X 1 , , X n ) under hypothesis H k has the form
p ( X n | H k ) = t = 1 n g ( X t ) for ν = k n , t = 1 k g ( X t ) × t = k + 1 n f ( X t ) for ν = k < n .
Note that, as discussed in [5,6], in applications, there are two different kinds of changes—additive and non-additive. Additive changes lead to a change in the mean value of the sequence of observations. Non-additive changes are typically produced by a change in variance or covariance, i.e., these are spectral changes.
We now proceed by discussing the models for the change point ν . The change point ν may be considered either as an unknown deterministic number or as a random variable. If the change point is treated as a random variable, then the model has to be supplied with the prior distribution of the change point. There may be several changepoint mechanisms, and, as a result, a random variable ν may be dependent on or independent of the observations. In particular, Moustakides [41] assumed that ν can be a { F n } -adapted stopping time. In this article, we will not discuss Moustakides’s concept by allowing the prior distribution to depend on some additional information available to “Nature” (see [5] for a detailed discussion); rather, when considering a Bayesian approach, we will assume that the prior distribution of the unknown change point is independent of the observations.

3.2.2. Popular Changepoint Detection Procedures

Before formulating the criteria of optimality in the next subsection, we begin by defining the three most popular and common change detection procedures, which are either optimal or nearly optimal in different settings. To define these procedures, we need to introduce the partial likelihood ratio and the corresponding log-likelihood ratio
LR t = f t ( X t | X t 1 ) g t ( X t | X t 1 ) , Z t = log f t ( X t | X t 1 ) g t ( X t | X t 1 ) , t = 1 , 2 ,
It is worth iterating that, for general non-i.i.d. models, the post-change density often depends on the point of change, f t ( X t | X t 1 ) = f t ( ν ) ( X t | X t 1 ) , so in general LR t = LR t ( ν ) and Z t = Z t ( ν ) also depend on the change point ν . However, this is not the case for the i.i.d. model (50).

The CUSUM Procedure

We now introduce the Cumulative Sum (CUSUM) algorithm, which was first proposed by Page [42] for the i.i.d. model (50). Recall that we consider the changepoint detection problem as a problem of testing two hypotheses: H ν that the change occurs at a fixed-point 0 ν < against the alternative H that the change never occurs. The LR between these hypotheses is Λ n ν = t = ν + 1 n LR t for ν < n and 1 for ν n . Since the hypothesis H ν is composite, we may apply the generalized likelihood ratio (GLR) approach maximizing the LR Λ n ν over ν to obtain the GLR statistic
V n = max 0 ν < n t = ν + 1 n LR t , n 1 .
It is easy to verify that this statistic follows the recursion
V n = max { 1 , V n 1 } LR n , n 1 , V 0 = 1
as long as the partial LR LR n does not depend on the change point, i.e., the post-change conditional density f n ( X n | X n 1 ) does not depend on ν . This is always the case for i.i.d. models (50) when f n ( X n | X n 1 ) = f ( X n ) . However, as we already mentioned, for non-i.i.d. models, f n ( X n | X n 1 ) = f n ( ν ) ( X n | X n 1 ) often depends on the change point ν , so LR n = LR n ( ν ) , in which case the recursion (51) does not hold.
The logarithmic version of V n , W n = log V n , is related to Page’s CUSUM statistic G n introduced by Page [42] in the i.i.d. case as G n = max ( 0 , W n ) . The statistic G n can also be obtained via the GLR approach by maximizing the LLR λ n ν = log Λ n ν over 0 ν < . However, since the hypotheses H and H ν are indistinguishable for ν n , the maximization over ν n does not make very much sense. Note also that, in contrast to Page’s CUSUM statistic G n , the statistic W n may take values smaller than 0, so the CUSUM procedure
T CS = inf { n 1 : W n a }
makes sense even for negative values of the threshold a. Thus, it is more general than Page’s CUSUM. Note the recursions
W n = W n 1 + + Z n , n 1 , W 0 = 0
and
G n = G n 1 + Z n + , n 1 , G 0 = 0
in cases where Z n = log [ f n ( X n | X n 1 ) / g n ( X n | X n 1 ) ] does not depend on ν .

Shiryaev’s Procedure

In the i.i.d. case and for the zero-modified geometric prior distribution of the change point, Shiryaev [43] introduced the change detection procedure that prescribes the thresholding of the posterior probability P ( ν < n | X n ) . Introducing the statistic
S n π = P ( ν < n | X n ) 1 P ( ν < n | X n )
one can write the stopping time of the Shiryaev procedure in the general non-i.i.d. case and for an arbitrary prior π as
T SH = inf n 1 : S n π A ,
where A ( A > 0 ) is a threshold controlling for the false alarm risk. The statistic S n π can be written as
S n π = 1 P ( ν n ) k = 0 n 1 π k Λ n k = 1 P ( ν n ) k = 0 n 1 π k t = k + 1 n LR t , n 1 , S 0 π = 0 ,
where the product t = i j LR t = 1 for j < i .
Often (following Shiryaev’s assumptions), it is supposed that the change point ν is distributed according to the geometric distribution Geometric ( ϱ )
P ( ν = k ) = ϱ ( 1 ϱ ) k for k = 0 , 1 , 2 , ,
where ϱ ( 0 , 1 ) .
If LR n does not depend on the change point ν and the prior distribution is geometric (56), then the statistic S ˜ n ϱ = S n π / ϱ can be rewritten in the recursive form
S ˜ n ϱ = 1 + S ˜ n 1 ϱ LR n 1 ϱ , n 1 , S ˜ 0 ϱ = 0 .
However, as mentioned above, this may not be the case for non-i.i.d. models, since LR n often depends on ν .

Shiryaev–Roberts Procedure

The generalized Shiryaev–Roberts (SR) change detection procedure is based on the thresholding of the generalized SR statistic
R n r 0 = r 0 Λ n 0 + k = 0 n 1 Λ n k = r 0 t = 1 n LR t + k = 0 n 1 t = k + 1 n LR t , n 1 ,
with a non-negative head-start R 0 = r 0 , r 0 0 , i.e., the stopping time of the SR procedure is given by
T SR r 0 = inf n 1 : R n r 0 A , A > 0 .
This procedure is usually referred to as the SR-r detection procedure in contrast to the standard SR procedure T SR T SR r 0 , r 0 = 0 that starts with a zero initial condition r 0 = 0 . In the i.i.d. case (50), this modification of the SR procedure was introduced and studied in detail in [44,45].
If LR n does not depend on the change point ν , then the SR-r detection statistic satisfies the recursion
R n r 0 = ( 1 + R n 1 r 0 ) LR n , n 1 , R 0 r 0 = r 0 .
Note that, as the parameter of the geometric prior distribution ϱ 0 , the Shiryaev statistic S ˜ n ϱ converges to the SR statistic R n r 0 = 0 .

3.2.3. Optimality Criteria

The goal of online change detection is to detect the change with the smallest delay controlling for a false alarm rate at a given level. Tartakovsky et al. [6] suggested several changepoint problem settings, including Bayesian, minimax, and uniform (pointwise) approaches.
Let E k denote the expectation with respect to measure P k when the change occurs at ν = k < and E with respect to P when there is no change.
In 1954, Page [42] suggested measuring the risk due to a false alarm by the mean time to false alarm E [ T ] and the risk associated with a true change detection by the mean time to detection E 0 [ T ] when the change occurs at the very beginning. He called these performance characteristics the average run length (ARL). Page also introduced the now most famous change detection procedure—the CUSUM procedure (see (52) with W n replaced by G n )—and analyzed it using these operating characteristics in the i.i.d. case.
While the false alarm rate can be reasonably measure by the ARL to false alarm
ARL 2 FA ( T ) = E [ T ] ,
as Figure 1 suggests, the risk due to a true change detection can be reasonably measured by the conditional expected delay to detection
CEDD ν ( T ) = E ν [ T ν | T > ν ] , ν = 0 , 1 , 2 ,
for any possible change point ν Z + = { 0 , 1 , 2 , } but not necessarily by the ARL to detection E 0 [ T ] CEDD 0 ( T ) . A good detection procedure has to guarantee small values of the expected detection delay CEDD ν ( T ) for all change points ν Z + when ARL 2 FA ( T ) is set at a certain level. However, if the false alarm risk is measured in terms of the ARL to false alarm, i.e., it is required that ARL 2 FA ( T ) γ for some γ 1 , then a procedure that minimizes the conditional expected delay to detection CEDD ν ( T ) uniformly over all ν does not exist. For this reason, we must resort to different optimality criteria, e.g., to Bayesian and minimax criteria.

Minimax Changepoint Optimization Criteria

There are two popular minimax criteria. The first one was introduced by Lorden [46]:
inf T sup ν Z + ess sup E ν [ T ν T > ν , F ν ] subject to ARL 2 FA ( T ) γ .
This requires minimizing the conditional expected delay to detection E ν [ T ν T > ν , F ν ] in the worst-case scenario with respect to both the change point ν and the trajectory ( X 1 , , X ν ) of the observed process in the class of detection procedures
C ARL ( γ ) = T : ARL 2 FA ( T ) γ , γ 1 ,
for which the ARL to false alarm exceeds the prespecified value γ [ 1 , ) . Let ESADD ( T ) = sup ν 0 ess sup E ν [ T ν T > ν , F ν ] denote Lorden’s speed detection measure. Under Lorden’s minimax approach, the goal is to find a stopping time T opt C ARL ( γ ) such that
ESADD ( T opt ) = inf T C ARL ( γ ) ESADD ( T ) for any γ 1 .
In the classical i.i.d. scenario (50), Lorden [46] proved that the CUSUM detection procedure (52) is asymptotically first-order minimax optimal as γ , i.e.,
inf T C ARL ( γ ) ESADD ( T ) = ESADD ( T CS ) ( 1 + o ( 1 ) ) , γ .
Later on, Moustakides [47], using optimal stopping theory, in his ingenious paper, established the exact optimality of CUSUM for any ARL to the false alarm γ 1 .
Another popular, less pessimistic minimax criterion is from Pollak [48]:
inf T sup ν Z + CEDD ν ( T ) subject to ARL 2 FA ( T ) γ ,
which requires minimizing the conditional expected delay to detection CEDD ν ( T ) = E ν [ T ν T > ν ] in the worst-case scenario with respect to the change point ν in class C ARL ( γ ) . Under Pollak’s minimax approach, the goal is to find a stopping time T opt C ARL ( γ ) such that
sup ν Z + CEDD ν ( T opt ) = inf T C ARL ( γ ) sup ν Z + CEDD ν ( T ) for any γ 1 .
For the i.i.d. model (50), Pollak [48] showed that the modified SR detection procedure that starts from the quasi-stationary distribution of the SR statistic (i.e., the head-start r 0 in the SR-r procedure is a specific random variable) is third-order asymptotically optimal as γ , i.e., the best one can attain up to an additive term o ( 1 ) :
inf T C ARL ( γ ) sup ν Z + CEDD ν ( T ) = sup ν Z + CEDD ν ( T SR r 0 ) + o ( 1 ) , γ ,
where o ( 1 ) 0 as γ . Later, Tartakovsky et al. [49] proved that this is also true for the SR-r procedure (59) that starts from the fixed but specially designed point r 0 = r 0 ( γ ) that depends on γ , which was first introduced and thoroughly studied by Moustakides et al. [44]. See also Polunchenko and Tartakovsky [50] on the exact optimality of the SR-r procedure.

Bayesian Changepoint Optimization Criterion

In Bayesian problems, the point of change ν is treated as random with a prior distribution π k = P ( ν = k ) , k Z + . Define the probability measure on the Borel σ -algebra B in R × Z + as
P π ( A × K ) = k K π k P k A , A B ( R ) , K Z + .
Under measure P π , the change point ν has a distribution π = { π k } and the model for the observations is given in (49).
From the Bayesian point of view, it is reasonable to measure the false alarm risk with the weighted probability of false alarm (PFA) defined as
PFA π ( T ) : = P π ( T ν ) = k = 0 π k P k ( T k ) = k = 0 π k P ( T k ) .
The last equality follows from the fact that P k ( T k ) = P ( T k ) because the event { T k } depends on the first k observations which under measure P k correspond to the no-change hypothesis H . Thus, for α ( 0 , 1 ) , introduce the class of changepoint detection procedures
C π ( α ) = T : PFA π ( T ) α
for which the weighted PFA does not exceed a prescribed level α .
Let E π denote the expectation with respect to the measure P π .
Shiryaev [18,43] introduced the Bayesian optimality criterion
inf T C π ( α ) E π [ ( T ν ) + ] ,
which is equivalent to minimizing the conditional expected detection delay EDD π ( T ) = E π [ T ν | T > ν ]
inf T EDD π ( T ) subject to PFA π ( T ) α .
Under the Bayesian approach, the goal is to find a stopping time T opt C π ( α ) such that
EDD π ( T opt ) = inf T C π ( α ) EDD π ( T ) for any α ( 0 , 1 ) .
For the i.i.d. model (50) and for the geometric prior distribution Geometric ( ϱ ) of the changepoint ν (see (56)), this problem was solved by Shiryaev [18,43]. Shiryaev [18,43,51] proved that the detection procedure given by the stopping time T SH ( A ) defined in (54) is strictly optimal in class C π ( α ) if A = A α in (54) can be selected in such a way that PFA π ( T SH ( A α ) ) = α , that is
inf T C π ( α ) EDD π ( T ) = EDD π ( T SH ( A α ) ) for any α ( 0 , 1 ) .

Uniform Pointwise Optimality Criterion

In many applications, the most reasonable optimality criterion is the pointwise uniform criterion of minimizing the conditional expected detection delay CEDD ν ( T ) = E ν [ T ν | T ν ] for all ν Z + when the false alarm risk is fixed at a certain level. However, as we already mentioned, if it is required that ARL 2 FA ( T ) γ for some γ 1 , then a procedure that minimizes CEDD ν ( T ) for all ν does not exist. More importantly, as discussed in ([5] Section 2.3), the requirement of having large values of the ARL 2 FA ( T ) generally does not guarantee small values of the maximal local probability of false alarm MLPFA ( T ) = sup 0 P ( T + m | T > ) in a time window of a length m 1 , while the opposite is always true (see Lemmas 2.1–2.2 in [5]). Hence, the constraint MLPFA ( T ) β is more stringent than ARL 2 FA ( T ) γ .
Another reason for considering the MLPFA constraint instead of the ARL to false alarm constraint is that the latter one makes sense if and only if the P -distribution of stopping times are geometric or at least close to geometric, which is often the case for many popular detection procedures such as CUSUM and SR in the i.i.d. case. However, for general non-i.i.d. models, this is not necessarily true (see [5,52] for a detailed discussion).
For these reasons, introduce the most stringent class of change detection procedures for which the MLPFA ( T ) is upper-bounded by the prespecified level β ( 0 , 1 ) :
C PFA ( m , β ) = T : sup 0 P ( T + m | T > ) β .
The goal is to find a stopping time T opt C PFA ( m , β ) such that
CEDD ν ( T opt ) = inf T C PFA ( m , β ) CEDD ν ( T ) for all ν Z + and any 0 < β < 1 .

3.2.4. Asymptotic Optimality for General Non-i.i.d. Models via r-Quick and r-Complete Convergence

Complete Convergence and General Bayesian Changepoint Detection Theory

First consider the Bayesian problem assuming that the change point ν is a random variable independent of the observations with a prior distribution π = { π k } . Unfortunately, in the general non-i.i.d. case and for an arbitrary prior π , the Bayesian optimization problem (62) is intractable for arbitrary values of PFA α ( 0 , 1 ) . For this reason, we will consider the first-order asymptotic problem assuming that the given PFA α approaches zero. To be specific, the goal is to design such a detection procedure T * that asymptotically minimizes the expected detection delay EDD π ( T ) to first order as α 0 :
inf T C π ( α ) EDD π ( T ) = EDD π ( T * ) ( 1 + o ( 1 ) ) as α 0 ,
where o ( 1 ) 0 as α 0 . It turns out that, in the asymptotic setting, it is also possible to find a procedure that minimizes the conditional expected detection delay EDD k ( T ) = E k T k | T > k uniformly for all possible values of the change point ν = k Z + , i.e.,
lim α 0 inf T C π ( α ) EDD k ( T ) EDD k ( T * ) = 1 for all k Z + .
Furthermore, asymptotic optimality results can also be established for higher moments of the detection delay of the order of r > 1
E k ( T k ) r | T > k and E π ( T ν ) r | T > ν .
Since the Shiryaev procedure T SH ( A ) , which was defined in (54), (55), is optimal for the i.i.d. model and Geometric ( ϱ ) prior, it is reasonable to assume that it is asymptotically optimal for the more general prior and the non-i.i.d model. However, to study asymptotic optimality, we need certain constraints imposed on the prior distribution and on the asymptotic behavior of the decision statistics as the sample size increases, i.e., on the general stochastic model (49).
Assume that the prior distribution { π k } is fully supported, i.e., π k > 0 for all k Z + and π = 0 and that the following condition holds:
lim n 1 n | log k = n + 1 π k | = μ for some 0 μ < .
Obviously, if μ > 0 , then the prior π has an exponential right tail (e.g., the geometric distribution Geometric ( ϱ ) , in which case μ = | log ( 1 ϱ ) | ). If μ = 0 , then it has a heavier tail than an exponential tail. In this case, we will refer to it as a heavy-tailed distribution.
Define the LLR of the hypotheses H k and H
λ n k = log d P k n d P n = t = k + 1 n f t ( X t | X t ) g t ( X t | X t ) , n > k
( λ n k = 0 for n k ). To obtain asymptotic optimality results, the general non-i.i.d. model for observations is restricted to the case that the normalized LLR n 1 λ k + n k obeys the SLLN as n with a finite and positive number I under the probability measure P k and its r-complete strengthened version
n = 1 n r 1 sup k Z + P k | n 1 λ k + n k I | > ε < for every ε > 0 .
It follows from Lemma 7.2.1 in [6] that, for any A > 0 ,
PFA π ( T SH ( A ) ) ( 1 + A ) 1 ,
so that T SH ( A α ) C π ( α ) if A = A α = ( 1 α ) / α .
The following theorem that can be deduced from Theorem 3.7 in [5] shows that the Shiryaev detection procedure is asymptotically optimal if the normalized LLR n 1 λ k + n k converges r-completely to a positive and finite number I and the prior distribution satisfies condition (67).
Theorem 6. 
Suppose that the prior distribution π = { π k } k Z + of the change point satisfies condition (67) with some 0 μ < . Assume that there exists some number 0 < I < such that the LLR process n 1 λ k + n k converges to I uniformly r-completely as n under P k , i.e., condition (68) holds for some r 1 . If threshold A = A α in the Shiryaev procedure is so selected that PFA π ( T SH ( A α ) ) α and log A α | log α | as α 0 , e.g., as A = ( 1 α ) / α , then as α 0
inf T C π ( α ) E k ( T k ) r | T > k | log α | I + μ r E k ( T SH k ) r | T SH > k for all k Z +
and
inf T C π ( α ) E π ( T ν ) r | T > ν | log α | I + μ r E π ( T SH ν ) r | T SH > ν .
Therefore, the Shiryaev procedure T SH ( A α ) is first-order asymptotically optimal as α 0 in class C π ( α ) , minimizing the moments of the detection delay up to order r whenever the r-complete version of the SLLN (68) holds for the LLR process.
For r = 1 , the assertions of this theorem imply the asymptotic optimality of the Shiryaev procedure for the expected detection delays (65) and (66) as well as asymptotic approximations for the expected detection delays.
Remark 6. 
The results of Theorem 6 can be generalized to the asymptotically non-stationary case where λ k + n k / ψ ( n ) converges to I uniformly r-completely as n under P k with a non-linear function ψ ( n ) similarly to the hypothesis testing problem discussed in Section 3.1. See also the recent paper [53] for the minimax change detection problem with independent but substantially non-stationary post-change observations.
It is also interesting to see how two other most popular changepoint detection procedures—the SR and CUSUM—perform in the Bayesian context.
Consider the SR procedure defined by (58), (59). By Lemma 3.4 (p. 100) in [5],
PFA π ( T SR r 0 ( A ) ) r 0 k = 1 π k + k = 1 k π k A for every A > 0 ,
and therefore, setting A = A α = α 1 ( r 0 + k = 1 k π k ) implies that T SR r 0 ( A α ) C π ( α ) . If threshold A = A α in the SR procedure is so selected that PFA π ( T SR r 0 ( A α ) ) α and log A α | log α | as α 0 , e.g., as A α = α 1 ( r 0 + k = 1 k π k ) , then as α 0
E k ( T SR r 0 k ) r | T SR r 0 > k | log α | I r for all k Z +
and
E π ( T SR r 0 ν ) r | T SR r 0 > ν | log α | I r
whenever the uniform r-complete convergence condition (68) holds. Therefore, the SR procedure T SR r 0 ( A α ) is first-order asymptotically optimal as α 0 in class C π ( α ) , minimizing the moments of the detection delay up to order r, when the prior distribution π is heavy-tailed (i.e., when μ = 0 ) and the r-complete version of the SLLN holds. In the case where μ > 0 (i.e., the prior distribution has an exponential tail), the SR procedure is not optimal. This can be expected since it uses the improper uniform prior in the detection statistic.
The same asymptotic results (69), (70) are true for the CUSUM procedure T CS ( a ) defined in (52) if threshold a = a α is so selected that PFA π ( T CS ( a α ) ) α and a α | log α | as α 0 and the uniform r-complete convergence condition (68) holds.
Hence, the r-complete convergence of the LLR process is the sufficient condition for the uniform asymptotic optimality of several popular change detection procedures in class C π ( α ) .

Complete Convergence and General Non-Bayesian Changepoint Detection Theory

Consider the non-Bayesian problem where the change point ν is an unknown deterministic number. We focus on the most interesting for a variety of applications uniform optimality criterion (64) that requires minimizing the conditional expected delay to detection CEDD ν ( T ) = E ν [ T ν | T > ν ] for all values of the change point ν Z + in the class of change detection procedures C PFA ( m , β ) defined in (63). Recall that this class includes change detection procedures with the maximal local probability of false alarm in the time window m,
MLPFA ( T ) = sup 0 P ( T + m | T > ) ,
which does not exceed the prescribed value β ( 0 , 1 ) . However, the exact solution to this challenging problem is unknown even in the i.i.d. case.
Instead consider the following asymptotic problem assuming that the given MLPFA β goes to zero: find a change detection procedure T which asymptotically minimizes the expected detection delay E ν [ T ν | T > ν ] to the first order as β 0 . That is, the goal is to design such a detection procedure T that
inf T C PFA ( m , β ) E ν [ T ν | T > ν ] = E ν [ T ν | T > ν ] ( 1 + o ( 1 ) ) for all ν Z + as β 0 .
More generally, we may focus on the asymptotic problem of minimizing the moments of the detection delay of order r 1 :
inf T C PFA ( m , β ) E ν [ ( T ν ) r | T > ν ] = E ν [ ( T ν ) r | T > ν ] ( 1 + o ( 1 ) ) for all ν Z + as β 0 .
To solve this problem, we need to assume that the window length m = m β is a function of the MLPFA constraint β and that m β goes to infinity as β 0 with a certain appropriate rate. Using [54], the following results can be established.
Consider the SR procedure defined by (58), (59) with r 0 = 0 , in which case write T SR r 0 ( A ) = T SR ( A ) . Let r 1 and assume that the r-complete version of the SLLN holds with some number 0 < I < , i.e., n 1 λ ν + n ν converges to I uniformly r-completely as n under P ν . If m β = O ( | log β | 2 ) as β and threshold A = A β in the SR procedure is so selected that MLPFA ( T SR ( A β ) ) β and log A β | log β | as β 0 , e.g., as defined in [54], then as β 0
inf T C PFA ( m β , β ) E ν ( T ν ) r | T > ν | log β I r E ν ( T SR ν ) r | T SR > ν for all ν Z + .
A similar result also holds for the CUSUM procedure T CS ( a ) if threshold a = a β is selected so that MLPFA ( T CS ( a β ) ) β and a β | log β | as β 0 and the r-complete version of the SLLN holds for the normalized LLR n 1 λ ν + n ν as n .
Hence, the r-complete convergence of the LLR process is the sufficient condition for the uniform asymptotic optimality of SR and CUSUM change detection procedures with respect to the moments of the detection delay of order r in class C PFA ( m β , β ) .

4. Quick and Complete Convergence for Markov and Hidden Markov Models

Usually, in particular problems, the verification of the SLLN for the LLR process is relatively easy. However, in practice, verifying the strengthened r-complete or r-quick versions of the SLLN, i.e., checking condition (68), can cause some difficulty. Many interesting examples where this verification was performed can be found in [5,6]. However, it is interesting to find sufficient conditions for the r-complete convergence for a relatively large class of stochastic models.
In this section, we outline this issue for Markov and hidden Markov models based on the results obtained by Pergamenchtchikov and Tartakovsky [54] for ergodic Markov processes and by Fuh and Tartakovsky [55] for hidden Markov models (HMM). See also Tartakovsky ([5] Ch 3).
Let { X n } n Z + be a time-homogeneous Markov process with values in a measurable space ( X , B ) with the transition probability P ( x , A ) with density p ( y | x ) . Let E x denote the expectation with respect to this probability. Assume that this process is geometrically ergodic, i.e., there exist positives constants 0 < R < , κ > 0 , and probability measure ϰ on ( X , B ) and the Lyapunov X [ 1 , ) function V with ϰ ( V ) < such that
sup n Z + e κ n sup 0 < ψ V sup x 1 V ( x ) E x [ ψ ( X n ) ] ϰ ( ψ ) R .
In the change detection problem, the sequence { X n } n Z + is a Markov process, such that { X n } 1 n ν is a homogeneous process with the transition density g ( y | x ) and { X n } n > ν is homogeneous positive ergodic with the transition density f ( y | x ) and the ergodic (stationary) distribution ϰ . In this case, the LLR process λ n k can be represented as
λ n k = t = k + 1 n G ( X t , X t 1 ) , n > k ,
where G ( y , x ) = log [ f ( y | x ) / g ( y | x ) ] .
Define
I = X X G ( y , x ) f ( y | x ) d y ϰ ( d x ) .
Under a set of quite sophisticated sufficient conditions, the LLR λ k + n n / n converges to I as n r-completely (cf. [54]). We omit the details and only mention that the main condition is the finiteness of ( r + 1 ) -th moment of the LLR increment, E 0 [ ( G ( X 1 , X 0 ) ) r + 1 ] < .
Now consider the HMM with finite state space. Then again, as in the pure Markov case, the main condition for the r-complete convergence of λ k + n n / n to I, where I is specified in Fuh and Tartakovsky [55], is E 0 [ ( λ 1 0 ) r + 1 ] < . Further details can be found in [55].
Similar results for Markov and hidden Markov models hold for the hypothesis testing problem considered in Section 3.1. Specifically, if in the Markov case we assume that the observed Markov process { X n } n Z + is a time-homogeneous geometrically ergodic with a transition density f i ( y | x ) under hypothesis H i ( i = 0 , 1 , , N ) and invariant distribution ϰ i , then the LLR processes are
λ i j ( n ) = t = 1 n G i j ( X t , X t 1 ) , i , j = 0 , 1 , , N , i j ,
where G i j ( y , x ) = log [ f i ( y | x ) / f j ( y | x ) ] . If E i [ ( G i j ( X 1 , X 0 ) ) r + 1 ] < , then the LLR n 1 λ i j ( n ) converges r-completely to a finite number
I i j = X X G i j ( y , x ) f i ( y | x ) d y ϰ i ( d x ) .

5. Discussion and Conclusions

The purpose of this article is to provide an overview of two modes of convergence in the LLN—r-quick and r-complete convergences. These strengthened versions of the SLLN are often neglected in the theory of probability. In the first part of this paper (Section 2), we discussed in detail these two modes of convergence and corresponding strengthened versions of the SLLN. The main motivation was the fact that both r-quick and r-complete versions of the SLLN can be effectively used for establishing near optimality results in sequential analysis, in particular, in sequential hypothesis testing and quickest changepoint detection problems for very general stochastic models of dependent and non-stationary observations. These models are not limited to Markov and hidden Markov models. The results presented in the second part of this paper (Section 3) show that the constraints imposed on the models for observations can be formulated in terms of either the r-quick or r-complete convergence of properly normalized log-likelihood ratios between hypotheses to finite numbers, which can be interpreted as local Kullback–Leibler information numbers. This is natural and can be intuitively expected since optimal or nearly optimal decision-making rules are typically based on a combination of log-likelihood ratios. Therefore, if one is interested in the asymptotic optimality properties of decision-making rules, the asymptotic behavior of log-likelihood ratios as the sample size goes to infinity not only matters but provides the main contribution.
The results presented in this article allow us to conclude that the strengthened r-quick and r-complete versions of the SLLN are useful tools for many statistical problems for general non-i.i.d. stochastic models. In particular, r-quick and r-complete convergences for log-likelihood ratio processes are sufficient for the near optimality of sequential hypothesis tests and changepoint detection procedures for models with dependent and non-identically distributed observations. Such non-i.i.d. models are typical for modern large-scale information and physical systems that produce big data in numerous practical applications. Readers interested in specific applications may find detailed discussions in [4,5,6,7,21,22,33,35,37,53,54,55,56,57,58].

Funding

This article received no external funding.

Data Availability Statement

No real data were used in this research.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Hsu, P.L.; Robbins, H. Complete convergence and the law of large numbers. Proc. Natl. Acad. Sci. USA 1947, 33, 25–31. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Baum, L.E.; Katz, M. Convergence rates in the law of large numbers. Trans. Am. Math. Soc. 1965, 120, 108–123. [Google Scholar] [CrossRef]
  3. Strassen, V. Almost sure behavior of sums of independent random variables and martingales. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, San Diego, CA, USA, 21 June–18 July 1965 and 27 December 1965–7 January 1966; Le Cam, L.M., Neyman, J., Eds.; Vol. 2: Contributions to Probability Theory. Part 1; University of California Press: Berkeley, CA, USA, 1967; pp. 315–343. [Google Scholar]
  4. Tartakovsky, A.G. Asymptotic optimality of certain multihypothesis sequential tests: Non-i.i.d. case. Stat. Inference Stoch. Process. 1998, 1, 265–295. [Google Scholar] [CrossRef]
  5. Tartakovsky, A.G. Sequential Change Detection and Hypothesis Testing: General Non-i.i.d. Stochastic Models and Asymptotically Optimal Rules; Monographs on Statistics and Applied Probability 165; Chapman & Hall/CRC Press, Taylor & Francis Group: Boca Raton, FL, USA; London, UK; New York, NY, USA,, 2020. [Google Scholar]
  6. Tartakovsky, A.G.; Nikiforov, I.V.; Basseville, M. Sequential Analysis: Hypothesis Testing and Changepoint Detection; Monographs on Statistics and Applied Probability 136; Chapman & Hall/CRC Press, Taylor & Francis Group: Boca Raton, FL, USA; London, UK; New York, NY, USA,, 2015. [Google Scholar]
  7. Lai, T.L. Asymptotic optimality of invariant sequential probability ratio tests. Ann. Stat. 1981, 9, 318–333. [Google Scholar] [CrossRef]
  8. Lai, T.L. On r-quick convergence and a conjecture of Strassen. Ann. Probab. 1976, 4, 612–627. [Google Scholar] [CrossRef]
  9. Chow, Y.S.; Lai, T.L. Some one-sided theorems on the tail distribution of sample sums with applications to the last time and largest excess of boundary crossings. Trans. Am. Math. Soc. 1975, 208, 51–72. [Google Scholar] [CrossRef]
  10. Fuh, C.D.; Zhang, C.H. Poisson equation, moment inequalities and quick convergence for Markov random walks. Stoch. Process. Their Appl. 2000, 87, 53–67. [Google Scholar] [CrossRef] [Green Version]
  11. Wald, A. Sequential tests of statistical hypotheses. Ann. Math. Stat. 1945, 16, 117–186. [Google Scholar] [CrossRef]
  12. Wald, A. Sequential Analysis; John Wiley & Sons, Inc.: New York, NY, USA, 1947. [Google Scholar]
  13. Wald, A.; Wolfowitz, J. Optimum character of the sequential probability ratio test. Ann. Math. Stat. 1948, 19, 326–339. [Google Scholar] [CrossRef]
  14. Burkholder, D.L.; Wijsman, R.A. Optimum properties and admissibility of sequential tests. Ann. Math. Stat. 1963, 34, 1–17. [Google Scholar] [CrossRef]
  15. Matthes, T.K. On the optimality of sequential probability ratio tests. Ann. Math. Stat. 1963, 34, 18–21. [Google Scholar] [CrossRef]
  16. Ferguson, T.S. Mathematical Statistics: A Decision Theoretic Approach; Probability and Mathematical Statistics; Academic Press: Cambridge, MA, USA, 1967. [Google Scholar]
  17. Lehmann, E.L. Testing Statistical Hypotheses; John Wiley & Sons, Inc.: New York, NY, USA, 1968. [Google Scholar]
  18. Shiryaev, A.N. Optimal Stopping Rules; Series on Stochastic Modelling and Applied Probability; Springer: New York, NY, USA, 1978; Volume 8. [Google Scholar]
  19. Golubev, G.K.; Khas’minskii, R.Z. Sequential testing for several signals in Gaussian white noise. Theory Probab. Appl. 1984, 28, 573–584. [Google Scholar] [CrossRef]
  20. Tartakovsky, A.G. Asymptotically optimal sequential tests for nonhomogeneous processes. Seq. Anal. 1998, 17, 33–62. [Google Scholar] [CrossRef]
  21. Verdenskaya, N.V.; Tartakovskii, A.G. Asymptotically optimal sequential testing of multiple hypotheses for nonhomogeneous Gaussian processes in an asymmetric situation. Theory Probab. Appl. 1991, 36, 536–547. [Google Scholar] [CrossRef]
  22. Fellouris, G.; Tartakovsky, A.G. Multichannel sequential detection–Part I: Non-i.i.d. data. IEEE Trans. Inf. Theory 2017, 63, 4551–4571. [Google Scholar] [CrossRef]
  23. Armitage, P. Sequential analysis with more than two alternative hypotheses, and its relation to discriminant function analysis. J. R. Stat. Soc.-Ser. Methodol. 1950, 12, 137–144. [Google Scholar] [CrossRef]
  24. Chernoff, H. Sequential design of experiments. Ann. Math. Stat. 1959, 30, 755–770. [Google Scholar] [CrossRef]
  25. Kiefer, J.; Sacks, J. Asymptotically optimal sequential inference and design. Ann. Math. Stat. 1963, 34, 705–750. [Google Scholar] [CrossRef]
  26. Lorden, G. Integrated risk of asymptotically Bayes sequential tests. Ann. Math. Stat. 1967, 38, 1399–1422. [Google Scholar] [CrossRef]
  27. Lorden, G. Nearly-optimal sequential tests for finitely many parameter values. Ann. Stat. 1977, 5, 1–21. [Google Scholar] [CrossRef]
  28. Pavlov, I.V. Sequential procedure of testing composite hypotheses with applications to the Kiefer-Weiss problem. Theory Probab. Appl. 1990, 35, 280–292. [Google Scholar] [CrossRef]
  29. Baron, M.; Tartakovsky, A.G. Asymptotic optimality of change-point detection schemes in general continuous-time models. Seq. Anal. 2006, 25, 257–296. [Google Scholar] [CrossRef] [Green Version]
  30. Mosteller, F. A k-sample slippage test for an extreme population. Ann. Math. Stat. 1948, 19, 58–65. [Google Scholar] [CrossRef]
  31. Bakut, P.A.; Bolshakov, I.A.; Gerasimov, B.M.; Kuriksha, A.A.; Repin, V.G.; Tartakovsky, G.P.; Shirokov, V.V. Statistical Radar Theory; Tartakovsky, G.P., Ed.; Sovetskoe Radio: Moscow, Russia, 1963; Volume 1. (In Russian) [Google Scholar]
  32. Basseville, M.; Nikiforov, I.V. Detection of Abrupt Changes—Theory and Application; Information and System Sciences Series; Prentice-Hall, Inc.: Englewood Cliffs, NJ, USA, 1993. [Google Scholar]
  33. Jeske, D.R.; Steven, N.T.; Tartakovsky, A.G.; Wilson, J.D. Statistical methods for network surveillance. Appl. Stoch. Model. Bus. Ind. 2018, 34, 425–445. [Google Scholar] [CrossRef]
  34. Jeske, D.R.; Steven, N.T.; Wilson, J.D.; Tartakovsky, A.G. Statistical network surveillance. In Wiley StatsRef: Statistics Reference Online; Wiley: New York, NY, USA, 2018; pp. 1–12. [Google Scholar] [CrossRef]
  35. Tartakovsky, A.G.; Brown, J. Adaptive spatial-temporal filtering methods for clutter removal and target tracking. IEEE Trans. Aerosp. Electron. Syst. 2008, 44, 1522–1537. [Google Scholar] [CrossRef]
  36. Szor, P. The Art of Computer Virus Research and Defense; Addison-Wesley Professional: Upper Saddle River, NJ, USA, 2005. [Google Scholar]
  37. Tartakovsky, A.G. Rapid detection of attacks in computer networks by quickest changepoint detection methods. In Data Analysis for Network Cyber-Security; Adams, N., Heard, N., Eds.; Imperial College Press: London, UK, 2014; pp. 33–70. [Google Scholar]
  38. Tartakovsky, A.G.; Rozovskii, B.L.; Blaźek, R.B.; Kim, H. Detection of intrusions in information systems by sequential change-point methods. Stat. Methodol. 2006, 3, 252–293. [Google Scholar] [CrossRef]
  39. Tartakovsky, A.G.; Rozovskii, B.L.; Blaźek, R.B.; Kim, H. A novel approach to detection of intrusions in computer networks via adaptive sequential and batch-sequential change-point detection methods. IEEE Trans. Signal Process. 2006, 54, 3372–3382. [Google Scholar] [CrossRef] [Green Version]
  40. Siegmund, D. Change-points: From sequential detection to biology and back. Seq. Anal. 2013, 32, 2–14. [Google Scholar] [CrossRef]
  41. Moustakides, G.V. Sequential change detection revisited. Ann. Stat. 2008, 36, 787–807. [Google Scholar] [CrossRef]
  42. Page, E.S. Continuous inspection schemes. Biometrika 1954, 41, 100–114. [Google Scholar] [CrossRef]
  43. Shiryaev, A.N. On optimum methods in quickest detection problems. Theory Probab. Appl. 1963, 8, 22–46. [Google Scholar] [CrossRef]
  44. Moustakides, G.V.; Polunchenko, A.S.; Tartakovsky, A.G. A numerical approach to performance analysis of quickest change-point detection procedures. Stat. Sin. 2011, 21, 571–596. [Google Scholar] [CrossRef] [Green Version]
  45. Moustakides, G.V.; Polunchenko, A.S.; Tartakovsky, A.G. Numerical comparison of CUSUM and Shiryaev–Roberts procedures for detecting changes in distributions. Commun. Stat.-Theory Methods 2009, 38, 3225–3239. [Google Scholar] [CrossRef] [Green Version]
  46. Lorden, G. Procedures for reacting to a change in distribution. Ann. Math. Stat. 1971, 42, 1897–1908. [Google Scholar] [CrossRef]
  47. Moustakides, G.V. Optimal stopping times for detecting changes in distributions. Ann. Stat. 1986, 14, 1379–1387. [Google Scholar] [CrossRef]
  48. Pollak, M. Optimal detection of a change in distribution. Ann. Stat. 1985, 13, 206–227. [Google Scholar] [CrossRef]
  49. Tartakovsky, A.G.; Pollak, M.; Polunchenko, A.S. Third-order asymptotic optimality of the generalized Shiryaev–Roberts changepoint detection procedures. Theory Probab. Appl. 2012, 56, 457–484. [Google Scholar] [CrossRef] [Green Version]
  50. Polunchenko, A.S.; Tartakovsky, A.G. On optimality of the Shiryaev–Roberts procedure for detecting a change in distribution. Ann. Stat. 2010, 38, 3445–3457. [Google Scholar] [CrossRef]
  51. Shiryaev, A.N. The problem of the most rapid detection of a disturbance in a stationary process. Sov. Math.–Dokl. 1961, 2, 795–799, Translation from Doklady Akademii Nauk SSSR 1961, 138, 1039–1042. [Google Scholar]
  52. Tartakovsky, A.G. Discussion on “Is Average Run Length to False Alarm Always an Informative Criterion?” by Yajun Mei. Seq. Anal. 2008, 27, 396–405. [Google Scholar] [CrossRef]
  53. Liang, Y.; Tartakovsky, A.G.; Veeravalli, V.V. Quickest change detection with non-stationary post-change observations. IEEE Trans. Inf. Theory 2023, 69, 3400–3414. [Google Scholar] [CrossRef]
  54. Pergamenchtchikov, S.; Tartakovsky, A.G. Asymptotically optimal pointwise and minimax quickest change-point detection for dependent data. Stat. Inference Stoch. Process. 2018, 21, 217–259. [Google Scholar] [CrossRef] [Green Version]
  55. Fuh, C.D.; Tartakovsky, A.G. Asymptotic Bayesian theory of quickest change detection for hidden Markov models. IEEE Trans. Inf. Theory 2019, 65, 511–529. [Google Scholar] [CrossRef] [Green Version]
  56. Kolessa, A.; Tartakovsky, A.; Ivanov, A.; Radchenko, V. Nonlinear estimation and decision-making methods in short track identification and orbit determination problem. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 301–312. [Google Scholar] [CrossRef]
  57. Tartakovsky, A.; Berenkov, N.; Kolessa, A.; Nikiforov, I. Optimal sequential detection of signals with unknown appearance and disappearance points in time. IEEE Trans. Signal Process. 2021, 69, 2653–2662. [Google Scholar] [CrossRef]
  58. Pergamenchtchikov, S.M.; Tartakovsky, A.G.; Spivak, V.S. Minimax and pointwise sequential changepoint detection and identification for general stochastic models. J. Multivar. Anal. 2022, 190, 104977. [Google Scholar] [CrossRef]
Figure 1. Illustration of a single-run sequential changepoint detection. Two possibilities in the detection process: false alarm (left) and correct detection (right).
Figure 1. Illustration of a single-run sequential changepoint detection. Two possibilities in the detection process: false alarm (left) and correct detection (right).
Mathematics 11 02687 g001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tartakovsky, A.G. Quick and Complete Convergence in the Law of Large Numbers with Applications to Statistics. Mathematics 2023, 11, 2687. https://doi.org/10.3390/math11122687

AMA Style

Tartakovsky AG. Quick and Complete Convergence in the Law of Large Numbers with Applications to Statistics. Mathematics. 2023; 11(12):2687. https://doi.org/10.3390/math11122687

Chicago/Turabian Style

Tartakovsky, Alexander G. 2023. "Quick and Complete Convergence in the Law of Large Numbers with Applications to Statistics" Mathematics 11, no. 12: 2687. https://doi.org/10.3390/math11122687

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop