On near optimality of one-sample update for joint detection and estimation

Sequential hypothesis test and change-point detection when the distribution parameters are unknown is a fundamental problem in statistics and machine learning. We show that for such problems, detection procedures based on sequential likelihood ratios with simple one-sample update estimates such as online mirror descent are nearly second-order optimal. This means that the upper bound for the algorithm performance meets the lower bound asymptotically up to a log-log factor in the false-alarm rate when it tends to zero. This is a blessing, since although the generalized likelihood ratio(GLR) statistics are optimal theoretically, but they cannot be computed recursively, and their exact computation usually requires infinite memory of historical data. We prove the nearly second-order optimality by making a connection between sequential analysis and online convex optimization and leveraging the logarithmic regret bound property of online mirror descent algorithm. Numerical examples validate our theory.


Introduction
Sequential analysis is a classic topic in statistics concerning online inference from a sequence of observations.The goal is to make statistical inference as quickly as possible, while controlling the false alarm rate.Two related sequential analysis problems commonly studied are sequential hypothesis testing and sequential change-point detection [35].They arise from various applications including online anomaly detection, statistical quality control, biosurveillance, financial arbitrage detection and network security monitoring (see, e.g., [36,40]).
We are interested in joint estimation and detection in sequential analysis, which occurs when there are unknown parameters for data distribution.For instance, in change-point detection, given a sequence of samples X 1 , X 2 , . .., a common assumption is that they are i.i.d. with certain distribution f θ parameterized by θ, and the values of θ are different before and after the change-point.One can assume that before the change, the parameter value is θ 0 .This is reasonable since, in various settings, there is a relatively large amount of background data.Thus, the parameter θ in the normal state can be estimated with good accuracy.After the change, the value of the parameter switches to an unknown value, and it represents an anomaly or novelty that needs to be discovered.

Motivation: Dilemma of CUSUM and generalized likelihood ratio (GLR) statistics
Consider change-point detection with unknown parameters.A commonly used change-point detection method is the so-called CUSUM procedure [40].
It can be derived from likelihood ratios.Assume that before the change, the samples X i follow a distribution f θ 0 , and after the change, the samples X i follow another distribution f θ 1 .CUSUM procedure has a recursive structure.Initiate with W 0 = 0.The likelihood-ratio statistic can be computed according to W t+1 = max{W t + log(f θ 1 (X t+1 )/f θ 0 (X t+1 )), 0}, and a change-point is detected whenever W t exceeds a pre-specified threshold.Due to the recursive structure, CUSUM is memory efficient, since it does not need to store the historical data and only needs to record the value of W t .However, one possible issue with CUSUM is the choice of the post-change parameter θ 1 .
In practice, it is usually chosen to represent the "smallest" change-of-interest.However, this choice is somewhat subjective.In the multi-dimensional setting, it is hard to define what the "smallest" change would mean.Moreover, when the assume parameter θ 1 deviates significantly from the true parameter value, CUSUM may suffer a severe performance degradation [13].
An alternative approach is the Generalized Likelihood Ratio (GLR) statistic [5].The GLR statistic finds the maximum likelihood estimate (MLE) of the post-change parameter and plugs it back to the likelihood ratio to form the detection statistic.To be more precise, for each hypothetical change-point location k, the corresponding post-change samples are {X k+1 , . . ., X t }.Using these samples, one can form the MLE denoted as θk,t .Without knowing whether the change occurs and where it occurs beforehand when forming the GLR statistic, we have to maximize k over all possible change locations.The GLR statistic is given by max k<t t i=k+1 log(f θk,t (X i )/f θ 0 (X t )), and a change is announced whenever it exceeds a pre-specified threshold.The GLR statistic is more robust than CUSUM [16], and it is particularly useful when the post-change parameter may vary from one situation to another.However, a drawback of GLR statistic is that it is not memory efficient and it cannot be computed recursively.Moreover, when there is a constraint on the maximum likelihood estimator (such as sparsity), MLE cannot have closed-form solution; one has to store the historical data, and re-estimates θk,t whenever there is new data.As a remedy, the window-limited GLR is usually considered, where one only keeps the past w samples, and restrict the maximization over k to be over (t − w, t].However, even with window-limited GLR, one still has to re-estimate θk,t using historical data whenever the new data are added. In practice, rather than CUSUM or GLR, various one-sample update schemes are used especially in machine learning literature.The one-sample update schemes perform online estimates of the unknown parameter, and plug the estimates into the likelihood ratio statistic to perform detection.The one-sample update takes the form of θt = h(X t , θt−1 ) for some function h that uses only the most recent data and the previous estimate.Some examples of one-sample estimate schemes include online gradient descent and online mirror descent (similar scheme has been used in [30,31]).The one-sample update enjoys efficient computation, as the information from the new data can be incorporated via low computational cost update such as mirror descent, which even has closed-form solution in some cases.It is also memory efficient since the update only needs the most recent sample.Such estimator may not correspond to the exact MLE, but they tend to have good performance.An important question remains to be answered: how much performance do we lose by using one-sample update schemes rather than the exact GLR?

Application scenario: Social network change-point detection
The widespread use of social networks (such as Twitter) leads to a large amount of user-generated data generated continuously.One important aspect is to detect change points in streaming social network data.These change points may represent the collective anticipation of or response to external events or system "shocks" [28].Detecting such changes can provide a better understanding of patterns of social life.In social networks, a common form of the data is discrete events over continuous time.As a simplification, each event contains a time label and a user label in the network.In our prior work [19], we model discrete events data using network point processes, which capture the influence between users through an influence matrix.We then cast the problem as detecting changes in influence matrix.We assume that the influence matrix in the normal state (before the change) can be estimated from the reference data.After the change, the influence matrix is unknown since it's due to an anomaly, and it has to be estimated online.Due to computational burden and memory constraint, since the scale of the network can be large, we do not want to store the entire historical data and rather compute the statistic in real-time.In [19], we develop a one-sample update scheme to estimate the influence matrix and then form the likelihood ratio detection statistic.However, theoretical performance of such one-sample update schemes has not been well-understood.

Contributions
This paper aims to address the above question by proving the nearly secondorder optimality of simple one-sample update schemes for sequential hypothesis test and change-point detection.The nearly second-order optimality [40] means that the upper bound for performance matches the lower bound up to a log-log factor.In particular, we consider likelihood ratios with plug-in online mirror descent estimator.Our approach generalizes the non-anticipating estimator framework [22] from detecting Gaussian mean shift to the exponential family with constrained parameters.Here we focus on online mirror-descent, but the result can be generalized to other schemes such as the online gradient descent.The proof leverages the logarithmic regret property of online mirror descent and the lower bound established in statistical sequential analysis literature [37,40].Synthetic examples validate the performances of one sample update schemes.The contributions of the paper are summarized as follows • We provide a general upper bound for sequential hypothesis test and change-point detection procedures with the one-sample update schemes.
The upper bound explicitly captures the impact of estimation on detection by an estimation algorithm dependent factor.This factor shows up as an additional term in the upper bound for the expected detection delay, and it corresponds to the regret bound of the estimator.This establishes an interesting linkage between sequential analysis and online convex optimization1 .
• Using our upper bound and existing lower bound, we show that the one-sample update schemes are nearly second-order optimal for the exponential family.Moreover, numerical examples verify the good performance of one-sample update schemes.They can perform better and are more robust than the likelihood ratio methods with pre-specified parameters (e.g., CUSUM for change-point detection).Moreover, they are computationally efficient alternatives of GLR statistic (which requires storing infinite samples) and cause little performance loss relative to GLR.
The comparison of three approaches is summarized in Table 1.

Literature and related work
Sequential analysis is a classic subject with an extensive literature.Much success has been achieved when the pre-change and post-change distributions are exactly specified.For example, the CUSUM procedure [27] with firstorder asymptotic optimality [21] and exact optimality [23] in the minimax sense, the Shiryayev-Roberts (SR) procedure [34], which can be derived based on a Bayesian principle and it enjoys various optimality.Both CUSUM and SR procedures rely on likelihood ratios between the specified pre-change and post-change distributions.The GLR [18,16] statistic enjoys certain optimality properties, but it can not be computed recursively in most cases [17].To address the infinite memory issue, [44,16] studied the window-limited GLR procedure.Another approach aiming to address the issue is called the Shiryayev-Roberts-Robbins-Siegmund (SRRS) procedure [22].The main idea of SRRS dates back to the power one sequential test [33]: instead of plugging in the MLE obtained using all samples up to the current moment as done in the GLR procedure, the SRRS procedure uses a sequence of non-anticipating estimators.The non-anticipating estimators are formed by dropping the most recent sample (thus the name "non-anticipating").The advantage is that the test statistic can be computed recursively.However, there is a small loss of performance which can be bounded.The original SRRS procedure [33] was developed for Gaussian when the post-change mean is unknown.Our work extends it to the general exponential family via the adaptive SRRS (ASR) procedure.Our non-anticipating estimator is also different from the original SRRS [33] in that SRRS still uses exact MLE estimated from all but the most recent sample, whereas our estimator only approximates the MLE using one-sample update schemes.

Memory
With unknown parameters, [29] developed a modified SR procedure by introducing a prior distribution to the unknown parameters; however, the resulted detection statistic is hard to compute recursively since the prior is not a conjugate.The more recent work [48] and [47] study joint detection and estimation problem of a specific form: a linear scalar observation model with Gaussian noise, and under the alternative hypothesis there is an unknown multiplicative parameter.This problem arises from many applications such as spectrum sensing [46], image observations [41], MIMO radar [39], etc. [48] demonstrates that solving the joint problem by treating detection and estimation separately with the corresponding optimal procedure does not yield an overall optimum performance, and provides an elegant closed-form optimal detector.Later on [47] generalizes the results.There are also other approaches solving the joint detection-estimation problem using multiple hypotheses testing [6,41] and Bayesian formulation [24].Our work differs from the above in that we consider the general form of joint detection and estimation problem, where the unknown parameter θ shows up generally as the parameter of the exponential family.Moreover, we do not aim to find the exact optimal solution.Instead, we find whether using the computationally efficient one-sample estimator for detection loses much performance.
Related work using online convex optimization for anomaly detection include [30], which develops an efficient detector for the exponential family using online mirror descent and proves a logarithmic regret bound, and [31], which dynamically adjusts the detection threshold to allow feedbacks about whether decision outcome.However, these works consider a different setting that the change is a transient outlier instead of a persistent change, as assumed by the classic statistical change-point detection literature.When there is persistent change, it is important to accumulate "evidence" by pooling the post-change samples (our work considers the persistent change).
Extensive work has been done for parameter estimation in the onlinesetting.This includes online density estimation over the exponential family by regret minimization [2,30,31], sequential prediction of individual sequence with the logarithm loss [8,15], online prediction for time series [26], and sequential NML (SNML) prediction [15] which achieves the optimal regret bound.Our problem is different from the above, in that estimation is not the end goal; one only performs parameter estimation to plug them back into the likelihood function for detection.Moreover, a subtle but important difference of our work is that the loss function for online detecting estimation is −f θi (X i ), whereas our loss function is −f θi−1 (X i ) in order to retain the martingale property, which is essential to establish the nearly second-order optimality.
On a high level, the problem of joint detection and estimation is also related to universal source coding [10,9] or Minimum Description Length (MDL) [32,4].In universal source coding, the goal is to minimize the cumulative Kullback-Leibler (KL) loss.

Preliminaries
Assume a sequence of i.i.d.random variables X 1 , X 2 , . . .with a probability density function of a parametric form f θ .The parameter θ may be unknown.Consider two related problems: sequential hypothesis test and sequential change-point detection.The detection statistic relies on a sequence estimators { θt } constructed using online mirror descent.The online mirror descent uses simple one-sample update: the update from θt−1 to θt only uses the current sample X t .This is the main difference from the traditional generalized likelihood ratio (GLR) statistic [16], where each θt is estimated using historical samples.In the following, we present detailed descriptions for two problems.We will consider exponential family and present our non-anticipating estimator based on the one-sample estimate.

Sequential hypothesis test
Consider null hypothesis H 0 : θ = θ 0 versus the alternative H 1 : θ = θ 0 .Hence the parameter under the alternative distribution is unknown.The classic approach to solve this problem is the sequential probablity-ratio test (SPRT) [43]: at each time, given samples {X 1 , X 2 , . . ., X t }, the decision is either to accept H 0 , accept H 1 , or taking more samples if neithter hypotheses can be resolved confidently.Here, we introduce modified SPRT with a sequence of non-anticipating plug-in estimators: Define the likelihood ratio at time t as The test statistic has a simple recursive implementation Moreover, it has a martingale property due to the non-anticipating nature of the estimator: . The decision rule is a stopping time where b > 0 is a pre-specified threshold.We reject the null hypothesis whenever the statistic exceeds the threshold.The goal is to resolve the two hypotheses using as few samples as possible under the type-I error constraint.

Sequential change-point detection
A problem related to sequential hypothesis test is sequential change-point detection.Due to its importance in applications and different performance metrics, sequential change-point detection is usually studied separately.A change may occur at an unknown time ν which changes the underlying distribution of the data.One would like to detect such a change as quickly as possible.Formally, change-point detection can be cast into the following hypothesis test: Here we assume θ is unknown, and it represents the anomaly.The goal is to detect the change as quickly as possible after it occurs under the false alarm constraint.
We will consider likelihood ratio based detection procedures adapted from two types of existing ones, which we call adaptive CUSUM (ACM), and the adaptive SRRS (ASR) procedures.
For change-point detection, the post-change parameter is estimated using post-change samples.This means that, for each putative change-point location before the current time k < t, the post-change samples are {X k , . . ., X t }; with a slight abuse of notation, the post-change parameter is estimated as Therefore, for k = 1, θk,i becomes θi defined in (2) for SPRT.Base on this, the likelihood ratio at time t for a hypothetical change-point location k is given by where Λ k,t can be computed recursively similar to (2).Since we do not know the change-point location ν, from the maximum likelihood principle, we take the maximum of the statistics over all possible values of k.This gives the ACM procedure: where b is a pre-specified threshold.
Similarly, by replacing the maximization in (6) with summation, we obtain the following ASR procedure [22], which can be interpreted as a Bayesian statistic similar to the Shiryaev-Roberts procedure.
where b is a pre-specified threshold.The computations of Λ k,t and estimator { θt }, { θk,t } are discussed later in section 2.3.

Online mirror descent (OMD) for non-anticipating estimators
Next, we discuss how to construct the non-anticipating estimators { θt } t≥1 in (1), and { θk,t }, 1 ≤ k < t in (5) using online mirror descent (OMD).OMD is a generic procedure for solving the online convex optimization problem (OCP).Our problem of finding maximum likelihood estimator can be cast into an OCP with the loss function being the negative log-likelihood ℓ t (θ) := − log f θ (X t ).
The main idea of OMD is the following.At each time step, the estimator θt−1 is updated using the new sample X t , by balancing the tendency to stay close to the previous estimate, against the tendency to move in the direction of the greatest local decrease of the loss function.For the loss function defined above, a sequence of OMD estimator is constructed by Here Γ ⊂ Θ σ is a closed convex set, which is problem-specific and encourages certain parameter structure such as sparsity.Similarly, θk,t can be constructed via OMD for sequential change-point detection.
A standard performance metric for OCP is regret.The regret is the difference between the total cost that an online algorithm has incurred relatively to that of the best fixed decision in hindsight.Given samples X 1 , . . ., X t , the regret for a sequence of estimators { θi } t i=1 is defined as For strongly convex loss function, the regret of many OCP algorithms, including the online mirror descent, has the property that R n ≤ C log n for some constant C (depend on f θ and Θ σ ) and any positive integer n [1,31].Note that for exponential family, the loss function is the negative log-likelihood function, which is strongly convex over Θ σ .Hence, we have the logarithmic regret property.

Exponential family
In this paper, we focus on f θ being the exponential family for the following reasons: (i) exponential family [31] represents a very rich class of parametric and even many nonparametric statistical models [3]; (ii) the negative log-likelihood function for exponential family − log f θ (x) is convex, and this allows us to perform online convex optimization with nice theoretical properties.Some useful properties of the exponential family are briefly summarized below, and full proofs can be found in [31].Consider an observation space X equipped with a sigma algebra B and a sigma finite measure H on (X , B). Assume the number of parameters is d.Let x ⊺ denote the transpose of a vector or matrix.Let φ : X → R d be an H-measurable function φ(x) = (φ 1 (x), . . ., φ d (x)) ⊺ .Here φ(x) corresponds to the sufficient statistic for θ.Let Θ denote the parameter space in R d .Let {P θ , θ ∈ Θ} be a set of probability distributions with respect to the measure H.Then, {P θ , θ ∈ Θ} is said to be a multivariate exponential family with natural parameter θ, if the probability density function of each f θ ∈ P θ with respect to H can be expressed as f θ (x) = exp{θ ⊺ φ(x) − Φ(θ)}.In the definition, the so-called log-partition function is given by To make sure f θ (x) a well-defined probability density, we consider the following two sets for parameters: and and the Hessian ∇ 2 Φ(θ) corresponds to the covariance matrix of the vector φ(X).Due to this property, since the covariance matrix is positive semidefinite, Φ(θ) is positive semidefinite and Φ(θ) is convex.Moreover, Φ is a Legendre function, which means that it is strongly convex, continuous differentiable and essentially smooth [42].The Legendre-Fenchel dual Φ * is defined as The mappings ∇Φ * is an inverse mapping of ∇Φ [7].Moreover, if Φ is a strongly convex function, then ∇Φ * = (∇Φ) −1 .
A general measure of proximity used in online mirror descent is the socalled Bregman divergence B F , which is a nonnegative function induced by a Legendre function F (see, e.g., [42,31]) defined as For exponential family, a natural choice of the Bregman divergence is the Kullback-Leibler (KL) divergence.Define E θ as the expectation when X is a random variable with density f θ , IntΘ as be the interior of Θ, and as the KL divergence between two distributions with densities f θ 1 and f θ 2 for any θ 1 , θ 2 ∈ Θ.Then It can be shown that, for exponential family, . Using the definition (11), this means that is a Bregman divergence.This property is quite useful to constructing mirror descent estimator for the exponential family [25,7].

3:
Acquire a new observation X t

5:
Compute likelihood ratio μt−1 = ∇Φ( θt−1 ), μt = μt−1 − η t (μ t−1 − φ(X t )) {Dual update} 3 Nearly second-order optimality of one-sample update procedures Below we prove the nearly second-order optimality of the one-sample update scheme based on OMD.More precisely, the nearly second-order optimality means that the algorithm obtains the lower performance bound asymptotically up to a log-log factor in the false-alarm rate, as the false alarm rate tends to zero (In many cases the log-log factor is a small number).In particular, we show that the performance of τ (b) for sequential hypothesis testing, T ACM (b) and T ASR (b) for sequential change-point detection setting, obtain the known lower bounds established in the statistical sequential analysis literature up to a log-log factor.We first introduce some necessary notations.Denote P θ,ν and E θ,ν the probability measure and expectation when the change occurs at time ν and the post-change parameter is θ, i.e., when X 1 , . . ., X ν are i.i.d.random variables with density f θ 0 and X ν+1 , X ν+2 , . . .are i.i.d.random variables with density f θ .Moreover, let P ∞ and E ∞ denote the probability measure when there is no change, i.e., X 1 , X 2 , . . .are i.i.d.random variables with density f θ 0 .Finally, let F t denote the σ-field generated by X 1 , . . ., X t for t ≥ 1.

Sequential hypothesis test
The two standard performance metrics are the the type-I error (false detection probability), which is defined for sequential hypothesis testing as P ∞ (τ (b) < ∞), and the expected number of samples needed to reject the null E θ,0 [τ (b)].Since it is possible to take infinite samples, the power of the test in (3) is one, and the type-II error is zero.A meaningful test should have both small P ∞ (τ (b) < ∞) and small E θ,0 [τ (b)].Usually, one adjusts the threshold b to control the type-I error to be below a certain level.
Intuitively, a reasonable sequence of estimator { θt } should move closer to the true parameter θ as we collect more data.This is reflected by the following regularity condition (similar assumption has been made in (5.84) in [40]) for some constant r ≥ 1 that characterizes the convergence rate of { θt }.A larger r means a slower convergence rate.This is a mild assumption that can be obtained by many estimators such as OMD.
Our main result is the following.As has been observed by [17], there is a loss in the statistical efficiency by using one-sample update estimator { θt }, relative to the GLR approach using the entire sample in the past (X 1 , . . ., X t ).
The theorem below shows that this loss due to one-sample update corresponds to the expected regret of the estimators { θt }.
The main idea of the proof is to decompose the statistic defining τ (b), log Λ(t), into a few terms that form martingales, and then invoking the Wald's Theorem for the stopped process.
Note that in the statement of the Theorem, τ (b), the stopping time, appears on both sides of the inequality.This is not an issue since the expected sample size E θ,0 [τ (b)] can be bounded, and it is usually small.By comparing with specific regret bound R τ (b) , we can bound E θ,0 [τ (b)] as discussed in Section 4. The most important case is that when the estimation algorithm has a logarithmic expected regret.For the exponential family, as shown in section 3.3, Algorithm 1 can achieve E θ,0 [R n ] ≤ C log n for any positive integer n.Equipped with this regret bound, we obtain the following Corollary 2.
Corollary 2. For a sequence of estimators with a logarithmic expect regret bound such that E θ,0 [R n ] ≤ C log n for any positive integer n and some constant C > 0, when (13) holds, we have Here o(1) is a vanishing term as b → ∞.

Moreover, we can obtain an upper bound on the type-I error of test τ (b).
Lemma 3 (Type-I error).For a sequence of estimators Lemma 3 sheds some lights on how to choose an appropriate b.One can choose b = log(1/α) to control the type-I error to be less than α.
The result means that, compared with any procedure (including the optimal procedure) calibrated to have a fixed type-I error less than α, our procedure incurs an at most log(log(1/α)) increase in the expected sample size, which is usually a small number.For instance, for example, a usual choice in statistics is to set α = 10 −5 when controling the false alarm; then log(log(1/α)) = 2.44.

Sequential change-point detection
For sequential change-point detection, the two commonly used performance metrics [40] are: the average run length (ARL), denoted by E ∞ [T ]; and the maximal conditional average delay to detection (CADD), denoted by . ARL is the expected number of samples between two successive false alarms, and CADD is the expected number of samples needed to detect the change after it occurs.A good procedure should have a large ARL and a small CADD.Similarly, one usually choose b large enough so that ARL is larger than a pre-specified level.
We have the following theorem bounding the detection delay, by relating the CUSUM to SPRT [21] and using the fact that when the measure P ∞ is known, sup ν≥0 E θ,ν [T − ν | T > ν] is attained at ν = 0 for both ASR and ACM procedures.First, using martingale property of the detection statistic, we establish the lower bound for the ARL of the detection procedures, which is needed for proving Theorem 6.
Lemma 5 (ARL).Consider the change-point detection procedure T ASR (b) in (8) and T ACM (b) in (7).For a sequence of estimators { θt } t≥0 , θt ∈ Θ generated by OMD.Given γ > 0, provided that b ≥ log γ, we have Lemma 5 shows that given a required lower bound γ for ARL, we can choose b = log γ to satisfy the ARL constraint.This is consistent with earlier works [29,22] which show that the smallest threshold b such that E ∞ [T ACM (b)] ≥ γ is approximately log γ.Specifically, by setting b = ρ log γ for some ρ ∈ (0, 1), it is sufficient to ensure that the ARL to be greater than γ.Theorem 6.Consider the change-point detection procedure T ASR (b) in ( 8) and T ACM (b) in (7).Using a sequence of estimators { θt } t≥1 with θ0 = θ 0 generated by OMD.When b → ∞, if ( 13) holds, we have that Above, we may apply a similar argument as in Corollary 2 to remove the dependence on τ (b) on the right-hand-side of the inequality.
Combing the upper bound in Theorem 6 with an existing lower bound for the EDD of SRRS procedure in [37], we obtain the following corollary.
Similar expression holds for T ACM (b).
Comparing ( 17) with ( 16), we note that the ARL γ plays the same role as 1/α because 1/γ is roughly the false-alarm rate for sequential change-point detection [21].

Example: Regret bound for specific cases
In this subsection, we show that the regret bound R t can be expressed as a weighted sum of Bregman divergences between two consecutive estimators.This form of R t is useful in the showing of the logarithmic expected regret property.This is also useful in showing how the assumptions required by Corollary 2 are satisfied.The following result comes as a modification of [2].
Corollary 9 (Upper bound for expected regret bound, Gaussian).Assume X 1 , X 2 , . . .are i.i.d.following N (θ, I d ) with some θ ∈ R d .Assume that { θi } i≥1 , {μ i } i≥1 are obtained using Algorithm 1 with η i = 1/i and Γ = R d .For any t > 0, we have that for some constant C 1 > 0 that depends on θ, The following calculations justify Corollary 9, which also serve as an example of how to use regret bound.First, the assumption θt = θt in Theorem 8 is satisfied for the following reasons.Consider Γ = R d is the full space.According to Algorithm 1, using the non-negativity of the Bregman divergence, we have θt = arg min u∈Γ B Φ (u, θt ) = θt .The the regret bound can be written as Since the step-size η i = 1/i, the second term in the above equation can be written as: Combining above, we have for any i ≥ 1, we obtain desired result.Thus, with i.i.d.multivariate normal samples, the expected regret grows logarithmically with the number of observations.
Using similar calculation, we can also bound the expected regret in the general case.As shown in the proof above for Corollary 9, the dominating term for R t can be rewritten as where μi is a convex combination of μi−1 and μi .For an arbitrary distribution, the term (φ(X i ) − μi ) ⊺ [∇ 2 Φ * (μ i )](φ(X i ) + μi ) can be viewed as a local normal distribution with the changing curvature ∇ 2 Φ * (μ i ).Thus, it is possible to prove case-by-case the O(log t)-style bounds.Proofs for Bernoulli distribution and Gamma distribution can be found in [2].Proof of OCM for covariance matrix in multivariate normal can be found in [11].A more general solution can be found in the Theorem 3 in [31], which however requires stronger conditions.

Synthetic examples
In this section, we present some synthetic examples to demonstrate the good performance of our methods.We will focus on ACM and ASR for sequential change-point detection.

Detecting sparse mean-shift of multivariate normal distribution
We consider detecting the emergence of a sparse mean vector in multivariate normal distribution.Sparse mean shift detection is of particular interest in sensor network or DNA sequence detection.In these settings usually only a small part of entries of the post-change mean parameter are non-zero [45,38].Below, • 2 means the ℓ 2 norm in R d , • 1 means the ℓ 1 norm, • 0 means the ℓ 0 norm defined as the number of non-zero entries.
In this case, the Bregman divergence is equivalent to the KL divergence and is given by Equipped with this Bregman divergence, the projection onto Γ in Algorithm 1 is just a Euclidean projection onto a convex set.In many cases, the projection can be implemented efficiently.An important and useful case is Γ = {θ : θ 1 ≤ s}, and s is a prescribed radius of the ℓ 1 ball.The projection onto ℓ 1 ball can be obtained via simple soft-thresholding [12].This encourages sparse post-change mean, and Γ can be viewed as the convex relaxation of {θ : θ 0 ≤ s}.
Assume that the initial samples have been normalized by subtracting mean and dividing the standard deviation, therefore, the pre-change distribution is N (0, I d ).To compare the performance of different procedures, we first use simulations to choose the threshold b's such that the ARLs of the procedures are all 10000.Note that ARL is an increasing function of b so this can be done by a simple bisection.Two benchmark procedures are CUSUM and GLR.For CUSUM procedure, we specify a nominal post-change mean, which is an all-one vector.Our procedures are T ASR (b) and T ACM (b) with Γ = R d and Γ = {θ : θ 1 ≤ s}.In the following experiments, we run 10000 Monte Carlo trials to obtain each simulated EDD.
In the experiments, we set d = 20.The post-change distributions are N (θ, I d ), where 100p% entry of θ is 1 and others are 0, the location of nonzero entries are random.Table 2 shows the EDDs versus the proportion p of nonzero entries of post-change parameter θ.Note that our procedures incur little performance loss compared with GLR procedure and CUSUM procedure.Notably, T ACM (b) with Γ = {θ : θ 1 ≤ 5} performs almost the same as the GLR procedure and much better than the CUSUM procedure when p is small.This also shows the advantage of projection when the true parameter is sparse.p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.5 p = 0.  2: Comparison of one-sample update schemes versus the traditional CUSUM and GLR methods for detecting sparse mean-shift.Below, "CUSUM": CUSUM procedure with pre-specified all-one vector as postchange parameter; "GLR": GLR procedure; "ASR": p is the proportion of non-zero entries in θ.The value for each point is averaged over 10000 Monte Carlo trials.For each point, the standard deviation is less than one half of the value.

Communication-rate change detection with Erdős-Rényi model
Next, we consider a problem to detect the communication-rate change in a network, which is a model for social network data.Suppose we observe communication between nodes in a network over time, represented as a sequence of (symmetric) adjacency matrices of the network.At time t, if node i and node j communicates, then the adjacency matrix has 1 on the ijth and jith entries (thus it forms an undirected graph).The nodes that do not communicate have 0 on the corresponding entries.We model such communication patterns using the Erdos-Renyi random graph model.Each edge has a fixed probability of being present or absent, independently of the other edges.Under the null hypothesis, each edge is a Bernoulli random variable that takes values 1 with known probability p and value 0 with probability 1 − p.Under the alternative hypothesis, there exists an unknown time κ, after which a small subset of edges occur with an unknown and different probability p ′ = p.
In the experiments, we set N = 20 and d = 190.For the pre-change parameters, we set p i = 0.2 for all i = 1, . . ., d.For the post-change parameters, we randomly select n out of the 190 edges, denoted by E, and set p i = 0.8 for i ∈ E and p i = 0.2 for i / ∈ E.Moreover, let the change happen at time ν = 0 (since the upper bound for EDD is achieved at ν = 0 as argued in the proof of Theorem 6).To implement CUSUM, we specify the post-change parameters p i = 0.8 for all i = 1, . . ., d.We select threshold b's such that the ARLs are all equal to 10000.
The results are shown in Table 3.Our procedures are better than CUSUM procedure when n is small since the post-change parameters used in CUSUM procedure is far from the true parameter.Compared with GLR procedure, our methods have a small performance loss, and the loss is almost negligible as n approaches to d = 190.Moreover, in implementation, our methods are much faster than GLR procedure since the computational complexity of updating the statistic is O(t), compared with the O(t Below are the specifications of Algorithm 1 in this case.For Bernoulli distribution with unknown parameter p, the natural parameter θ is equal to log(p/(1 − p)).Thus, we haveφ(x) = x, dH(x) = 1, Φ(θ) = log(1 + exp(θ)), µ = exp(θ)/(1 + exp(θ)) and Φ * (µ) = µ log µ + (1 − µ) log(1 − µ).

Conclusion
In this paper, we consider sequential hypothesis testing and change-point detection with computationally efficient one-sample update schemes obtained from online mirror descent.We show that the loss of the statistical efficiency caused by the online mirror descent estimator (replacing the exact maximum likelihood estimator using the complete historical data) is related to the regret incurred by the online convex optimization procedure.The result can be generalized to any estimation method with logarithmic regret bound.This result sheds lights on the relationship between the statistical detection procedures and the online convex optimization.
Then under the measure P θ,0 , S t is a random walk with i.i.d.increment.Then, by Wald's identity (e.g., [35]) we have that On the other hand, let θ * N denote the MLE based on (X 1 , . . ., X N ).The key to the proof is to decompose the stopped process S θ N as a summation of three terms as follows: Therefore we have Now consider the third term in the decomposition (19).Similar to the proof of equation (5.109) in [40], we obtain that under the condition (13), its expectation under measure P θ,0 is upper bounded by b/I(θ, θ 0 ) + O(1) as b → ∞.Then, for any positive integer n, we may further decompose the third term in (19) as where The decomposition of (20) consists of stochastic processes {M n (θ)} and {m n (θ, θ 0 )}, which are both P θ,0 -martingales with zero expectation, i.e., E θ,0 [M n (θ)] = E θ,0 [m n (θ, θ 0 )] = 0 for any positive integer n.Since for exponential family, the log-partition function Φ(θ) is bounded, by the inequalities for martingales [20] we have that where C 1 and C 2 are two absolute constants that do not depend on n.Applying ( 21), together with condition (13), we have that n and n −1 m n (θ, θ 0 ) converge to 0 almost surely.Moreover, the convergence is P θ,0 -r-quickly for r = 1 (For the definition of r-quick convergence, refer to Section 2.4.3 in [40]).Therefore, dividing both sides of ( 20) by n, we obtain n −1 n i=1 log(f θi−1 (X i )/f θ 0 (X i )) converges 1-quickly to I(θ, θ 0 ).For ǫ > 0, we now define the last entry time By the definition of 1-quickly convergence, we have that E θ,0 [L(ǫ)] < +∞ for all ǫ > 0. In the following, define a scaled threshold b = b/I(θ, θ 0 ).Observe that conditioning on the event {L(ǫ) + 1 < N < +∞}, we have that Therefore, conditioning on the event {L(ǫ) + 1 < N < +∞}, we have that N < 1 + b/(1 − ǫ).Hence, for any 0 < ǫ < 1, we have Finally, the second term in (19) can be written as which is just the regret defined in (10) for the online estimators: R t , when the loss function is defined to be the negative likelihood function.Then, the theorem is proven by combining the above analysis for the three terms in ( 21) and (18). .Applying Jensen's inequality, the upper bound in equation ( 14) becomes x ≤ α + β log(x).From this, we have x ≤ O(α).Taking logarithm on both sides and using the fact that max{a 1 +a 2 } ≤ a 1 +a 2 ≤ 2 max{a 1 , a 2 } for a 1 , a 2 ≥ 0, log(x) ≤ max{log(2α), log(2β log x)} ≤ log(α) + o(log b).Therefore, we have that x ≤ α + β(log(α) + o(log b)).Using this argument, we obtain Next we will establish a few Lemmas useful for proving theorem 6 for sequential detection procedures.Define a measure Q on (X ∞ , B ∞ ) under which the probability density of X i conditional on F i−1 is f θi−1 .Then for any event A ∈ F i , we have that Q(A) = A Λ i dP ∞ .The following lemma shows that the restriction of Q to F i is well defined.Lemma 10.Let Q i be the restriction of Q to F i .Then for any A ∈ F k and any Proof of Lemma 3. To bound the term P ∞ (τ (b) < ∞), we need take advantage of the martingale property of Λ t in (2).The major technique is the combination of change of measure and Wald's likelihood ratio identity [35].The proof is based on the method presented in [17] and [22].
Define the L i = dP i /dQ i as the Radon-Nikodym derivative, where P i and Q i are the restriction of P ∞ and Q to F i , respectively.Then we have that L i = (Λ i ) −1 for any i ≥ 1 (note that Λ i is defined in (2)).Combining the Lemma 10 and the Wald's likelihood ratio identity, we have that where I(E) is an indicator function that is equal to 1 for any ω ∈ E and is equal to 0 otherwise.By the definition of τ (b) we have that L τ (b) ≤ exp(−b).
Proof of Corollary 4. Using (5.180) and (5.188) in [40], which are about asymptotic performance of open-ended tests.Since our problem is a special case of the problem in [40], we can obtain Combing the above result and the right-hand side of (15), we prove the corollary.
Proof of Theorem 6.From ( 26), we have that for any ν ≥ 1, Therefore, to prove the theorem, using Theorem 1, it suffices to show that Using an argument similar to the remarks in [22], we have that the supreme of detection delay over all change locations is achieved by the case when change occurs at the first instance.
Notice that since θ 0 is known, for any j ≥ 1, the distribution of {max j+1≤k≤t Λ k,t } ∞ t=j+1 under P θ,j conditional on F j is the same as the distribution of {max 1≤k≤t Λ k,t } ∞ we have Therefore, {R t −t} t≥1 is a (P ∞ , F t )-martingale with zero mean.Suppose that E ∞ [T ASR (b)] < ∞ (otherwise the statement of proposition is trivial), then we have that Proof of Corollary 7. Our Theorem 1 and the remarks in [37] show that the minimum worst-case detection delay, given a fixed ARL level γ, is given by (28) It can be shown that the infimum is attained by choosing T (b) as a weighted Shiryayev-Roberts detection procedure, with a careful choice of the weight over the parameter space Θ.Combing (28) with the right-hand side of (15), we prove the corollary.
The following derivation borrows ideas from [2].First, we derive concise forms of the two terms in the definition of R t in (10).
By subtracting the expressions in ( 29) and (30), we obtain the following result which shows that the regret can be represented by a weighted sum of the Bregman divergences between two consecutive estimators.

Corollary 7 (
Nearly second-order optimality of ACM and ASR).Assume that the estimators used in the stopping times are generated with respect to Algorithm 1. Define S(γ) = {T : E ∞ [T ] ≥ γ}.For b = log γ, due to Lemma 5, both T ASR (b) and T ACM (b) belong to S(γ).For such b, both T ASR (b) and T ACM (b) are nearly second-order optimal in the sense that for any θ

Table 1 :
Comparison of three approaches.

Table 3 :
2) in GLR procedure.Comparison of EDDs in detecting change of communication-rate in a network.The results are obtained from 10000 Monte Carlo trials.For each number, the standard deviation is less than one half of the number.