Change Point Detection with Mean Shift Based on AUC from Symmetric Sliding Windows

: Change point detection is widely used in signal detection, industrial engineering, economy, ﬁnance, biomedicine and many other ﬁelds. The widely used parametric methods require prior knowledge of the noise signal distribution, which are seldom realistic. In practice, when the distribution of noise is not known, it is desirable to design algorithms based on non-parametric statistics, which, in the null case (no change point), are completely distribution free. To this end, we propose to use two symmetric sliding windows to compute the Area Under the receiver operating characteristic Curve (AUC) as a test statistic to measure the difference between the distribution of two samples. In the stage of change point detection, a threshold is designed according to hypothesis test which is based on the null distribution of the test statistics. This threshold is used to detect the potential change points in the signal. To reduce the probability of false alarm detection, a key parameter K is set to distinguish and delete the false alarms in potential change points. Comparative studies showed that our proposed method outperforms the classical Relative unconstrained Least-Squares Importance Fitting (RuLSIF) algorithm and is also better than the Hawkins, Qiu, and Kang (HQK) algorithm when the noise follows non-normal distributions.


Introduction
In a time series (or signal), a change point refers to a point which separates the whole signal into two sub-series with different statistical properties. Thus the change point detection aims at identifying whether the distribution or parameter of the process changes, and the point that causes this abrupt change is the point that needs to be detected [1][2][3]. It is a significant and practical problem in some modern fields, such as quality control [4], sensor signal analysis [5], Network attack detection [6], tracking [7], and so on. In general, the change point detection algorithms assume that it follows one probability distribution before the change point, and it starts to follow another distribution after the change point.
Change point detection techniques can be classified as parametric or non-parametric methods. In general, parametric methods are based on known distributions, whose likelihood function plays an important role. There are a number of parametric methods for change point detection that have been proposed. The most widely used are the Cumulative Sum control chart (CUSUM) type methods under various distributions. Gan et al. [8] developed an optimal design of CUSUM control charts under the binomial distribution. Jiang et al. [9] proposed a class of weighted CUSUM chart to efficiently detect changes in the incidence rate under Poisson distribution. For an online setting, when the Average Run Length (ARL) is fixed, the minimum value of a CUSUM method's random variable is lower than the corresponding Sherwert-type method, because the parameters change continuously from small to medium. Steiner et al. [10] used Normal distributions to monitor processes with highly censored data. Noura et al. [11] estimated the proportional risk model whether includes a change point using a Weibull distribution.
Despite the advantages of parametric methods, in practice, however, parametric information, i.e., prior knowlege of the signal distribution are often unavailable. In such cases, it is more practical and flexible to employ nonparametric methods. Many nonparametric methods have been considered in the literature for change point detection. Gijbels et al. [12] used a bootstrap method for selecting bandwidth parameters in a nonparametric two-step estimation method. This method produces a completely data-driven process in the regression function. Wang et al. [6] applied the non-parametric CUSUM test for detecting the traditional flooding-based Denial of Service (DoS) attacks which is essentially a change-point problem. This method makes the attack detection more robust , more applicable, and easier to deploy. In terms of variance change points problems, Kyong et al. [13] proposed a nonparametric and data-adaptive approach. In essence, the approach uses an artificial neural network (ANN) to treat variance change point problem as a pattern classification problem, and it includes the selection of training strategies and classification tools. For nonparametric change-point estimation from indirect noisy observations, Goldenshluger et al [14] established some theoretical results of estimating a peak from a short filtered data where the lengths are very short with two to five observations. For the uncensored and censored linear model, Wang et al. [15] studied a nonparametric change point statistical test based on the least absolute deviation.
In this paper, we propose a new algorithm for change point detection with mean shift. The basic idea is to use two symmetric sliding windows to compute the Area Under the receiver operating characteristic Curve (AUC) as the test statistic. The maximum or minimum values of this statistic are further utilized to itendify the location of the change point(s). This test statistic of AUC is well known to be equivalent to the Mann Whitney U statistic (MWUS). As early as 1970s, Pettitt [16] has employed MWUS to the problem of change-point detection. Following Pettitt's work, Hawkins and Deng [17] combined MWUS with an exponentially weighted moving average (EWMA) chart as a nonparametric and distribution-free tool to monitor the change in mean shift. Ross et al. [18], inspired by the method of Hawkins and Deng [17], proposed another technique for detecting a change point in the location and scale parameter of an arbitrary continuously distributed univariate data stream. Zou et al. [19] proposed a non-parametric control chart with the EWMA procedure for profile monitoring combined with local linear kernel smoothing. For similar work, the readers are referrd to [20,21].
The advantages of our approach over the above mentioned MWUS-based methods lie in the following aspects: (1) We do not need to know the exact distribution of the data. We only assume that the data follows some indpendent and identically distributed (i.i.d.) continuous distribution. In other words, our method is a completely distributed free method. (2) The strategy of reducing false alarm change points can enhance the robustness of the algorithm.
(3) We design the optimal window size ratio and size, so as to optimize the algorithm.
To demonstrate its advantages, we compare our method with two state-of-the-art algorithms, i.e., the classical Hawkins, Qiu, and Kang (HQK) [22] algorithm and Relative unconstrained Least-Squares Importance Fitting (RuLSIF) algorithm [23]. HQK method is essentially a likelihood ratio method, which detects the mean shift of a sequence based on two normally distributed samples with the same variance. In this method, the autoregressive model is adopted to fit, and the joint likelihood ratio is calculated based on the sequence residuals in two adjacent time windows and their combined windows. When the obtained statistics exceed a prescribed threshold, the boundary location of the window is defined as the location of the change point. This method does not need to know any parameters, thus avoiding the assumption of known parameters. The RuLSIF method achieves change point detection by directly estimating the probability density function ratio and using this ratio to calculate the mixed divergence of the two distributions to obtain the difference between the two sample distributions. This method overcomes the infinite problem caused by the unbounded limit of the probability density ratio estimated by unconstrained Least-Squares Importance Fitting (uLSIF) method [24]. It guarantees the boundedness of probability density ratio and has good robustness, but it has the disadvantage of high computational cost.
The rest of the paper is organized as follows. Section 2 presents the data model of an unknown number of change points with mean shifts. In Section 3, we introduce the change point detection algorithm based on AUC statistics. We also prove that the statistics are optimal when the two windows are symmetrical. We set the threshold which is related to window size to determine the potential change point. To reduce the false alarm rate, a key parameter K is set to distinguish and reduce the false alarm in potential change points. In addition, we establish the relationship between the window size and test power and the magnitude of the mean shift. In Section 4, we define the evaluation criteria of our algorithm and compare it to with the classical HQK algorithm and RuLSIF algorithm respectively in single change point detection. Moreover, we also apply the method to multi change point detection with a real data set. Finally, we draw our conclusion in Section 5.

Data Model of Abrupt Change with Mean Shift
In the change point detection framework, we consider a piecewise stationary and independent identically distributed (i.i.d.) sequence S = {S i } T i=1 with some change-points at some unknown moment t * j . Given that there are q change points (q is unknown), this stochastic process can be defined by where µ 0 is the mean value of S when it has no change points, j is the index of the change-points and ∆ j denote the mean shift from the change point time t * j . Function I(·) is an indicator, if i is in the set i t * j , then I(·) = 1, otherwise I(·) = 0. The symbol e i denotes the error term, which is an independent stationary stochastic process. In the signal domain, the error term can also represent the signal processing noise. Assumption 1. The error sequence e i is i.i.d, with expectation and variance being E (e i ) = 0 and Var (e i ) = σ 2 , respectively. It is also independent of the numbers and the location of change points, as well as the mean shift.

AUC Statistics for Change Point with Abrupt Mean Shift
In this section, we use AUC as the statistic to detect the change point. Specifically, the change point detection algorithm can be divided into two stages. In the first stage, the sample data is processed by two sliding windows. The statistic is indirectly obtained by calculating the data sequence in the windows according to the definition of AUC. It can be proved that our algorithm is optimal when the two adjacent windows have the same size. In the second stage for change point analysis, the method of hypothesis testing is used to set a threshold to determine the potential change point. To reduce false alarms, parameter K is used to remove the false alarm interval.

AUC Statistics from Symmetric Sliding Windows
We consider the data model in Equation (1) with Assumptions 1 and 2. The two-sided two-sample MWUS is used for change point detection. Let X = {S i } k i=k−m and Y = S j k+n j=k+1 are derived from a continuous i.i.d. samples S T . The statistic for potential change points can be expressed as follows: where I (·) is the indicator function defined in (1), and T − n > k > m. As pointed out by Hanley et al. [25],θ k is equal to AUC that represents the probability of X > Y. This equivalence relationship enables us to call the change point statistics as AUC. In the following, we simply writeθ k asθ for notational convenience. Assume that the positive samples X and negative samples Y have two different distributions, denoted by F X and F Y , respectively. The test statistic defined in (2) is actually a measure of the difference between F X and F Y . Figure 1 below illustrates such property of the test statistic. Assume that the first half of the data are negative samples and the second half are positive samples. As shown in the first column of Figure 1, if there is no mean shift in the data, the corresponding test statistic isθ = P(X > Y) = 0.5, which means F X = F Y , i.e., this data exist only one distribution. However, if F X = F Y , which means that the data has two different distributions, the test statistic will significantly deviates from 0.5. As shown in the second column of Figure 1, when the test statistics are significantly greater than 0.5, it represents the data has an upward mean shift. On the other hand, as shown in the third column of Figure 1, when the test statistic is significantly less than 0.5, it means that the data has a downward mean shift.  Probabilistic meaning of the test statistic. From left to right, the first column is the data without mean shift and its correspondingθ = 0.5012; the second column is the data with upward mean shift and its correspondingθ = 0.8792; the third column is the data with the downward mean shift and its correspondingθ = 0.0384.
In order to detect the change point(s) from a data sequence S T , we designed two sliding windows, namely, the positive window X (right) of size m, and negative window Y (left) of size n. That is, For each pair of the windows, we can calculate a statistic based on (2). By gradually sliding the two windows along the data, we will obtain a sequence of test statistics. From Lemma 1, we know that when F X = F Y , i.e., there is no mean shift, the expectation ofθ is E θ = 0.5. As shown in Figure 2, the test statisticθ fluctuates around 0.5 before the change point at τ entering in the two sliding windows X and Y. However, when the change point begins to enter the positive slide window X,θ begins to increase. When the change point enters the negative slide window Y,θ begins to decrease. Andθ returns to around 0.5 when the change point leaves the two sliding windows. The resultant sequence ofθ thus contains the location information of the change point(s). In the above procedure of calculatingθ, two problems need to be addressed. The first one is: when the length sum m + n is fixed, under what relationship of m and n will the variance ofθ is minimized. The second problem is: how to select the suitable window sizes. We will address the two problems in Theorem 1 below and in Section 3.3, respectively.
be two i.i.d. samples drawn from two unknown continuous distributions with cumulative distribution functions (cdf) F X (x) and F Y (y), respectively. Then the mean and variance ofθ in Equation (2) are: with X and Y being the i.i.d copies of X and Y, respectively. Moreover, in the case of null hypothesis, that is, F X (·) = F Y (·), the expectation and variance ofθ satisfy, respectively, Theorem 1. Assume that the two adjacent sliding windows contain two i.i.d. data X = {X i } m i=1 and Y = Y j n j=1 , with continuous cdfs F X (x) and F Y (y), respectively. Assume further that F X (x) and F Y (y) are symmetric. Let N = m + n be a fixed integer. Define the segmentation factor as λ = m/N. If there is a mean shift ∆ between X and Y, i.e., F Y (x) = F X (x + ∆), then we have (ii) For all of the ∆, the variance ofθ based on X and Y reaches its minimum when m = n.

Proof.
(i). Denote by f x and f y the probability density functions (pdfs) of X and Y, respectively. Since (ii). Given the assumption that N = m + n is fixed, it follows from (4) that which, along with the result of the above equation, leads to Therefore, we have which shows that the variance of θ depends only on the length segmentation factor λ. It thus follows readily that V(θ) is minimized when λ = 0.5, or equivalently, m = n.

Change Point Detection with AUC Statistics
In this section, we use hypothesis testing method to determine whether there is any change point in the signal. We also develop a search strategy to identify the location of the change point. The objective of the search strategy is to reduce the false alarm rate. Before going through the detailed steps, we introduce Theorem 2 as follows.
and Y = Y j n j=1 be i.i.d. random variables having continuous cdfs F X (x) and F Y (y), respectively. Assume that m n, and m/n → λ when m, n → ∞ . Then under the null hypothesis, i.e., F X (·) = F Y (·), the test statistic in Equation (2)  .
From Lemma 1, we know that when the null hypothesis holds true, the expectation and variance of theθ equal E θ = 0.5 and V θ = (m + n + 1) /12mn, respectively. Theorem 2 actually means that when m, n → ∞ , W = 1 V(θ) θ − 0.5 converges in distribution to the standard Normal distribution, namely: Note that, to satisfy the requirements of the Theorem 2, literature [28] points out that the sample sizes must be large enough. Sepcifically, max(m, n) 30 and m + n 40.
After selecting the suitable window sizes m and n statisfying the requirements in Theorem 2, we can move the two sliding windows forward one data point at a time and calculate the two-sample test statistics. For each window pair, if the test statistic is higher than the threshold determined by a particular significance level, it is considered as a potential change point. Otherwise, the data point is not considered a change point. Mathematically, given the significance level α, the test statistics for potential change points satisfy: where z 1− α 2 is the percentile 1 − α/2 of standard Normal distribution. When the length of two windows is selected to be m = n = L 0 , the range of potential change-points can be defined aŝ We define two decision threshold as thr1 and thr2 from Equation (16). It is obvious that thr2 = 1 − thr1. As shown in Figure 3, the upper and lower thresholds are set for the value of θ to determine whether any potential change point occurs. From the formula thr1 = 0.5 + z 1− α 2 1 6L 0 , we can see that it is related to the significance level and window length. Generally, significance level is set as α = 0.05. After the significance level is fixed, the threshold value decreases with the increase of the window size. We discuss the problem of window size in Section 3.3.  After determining the decision thresholds, we need the necessary search strategy to determine the location of change point. If the change point is located at the local extrema value ofθ in the data points that continuously exceeds the threshold, it may lead to false alarms. To reduce the chance of false alarm, we define a range parameter K. As shown in Figure 4, if the number ofθ exceeding the threshold is less than K, the point corresponding to the local maximum ofθ is not considered as a change point. On the other hand, if the number ofθ exceeding the threshold is larger than K, the point corresponding to the local maximum ofθ is considered as a change point. As for the selection of K, we will discuss it in Section 3.3. After reducing the false alarm rate, the direction of the change point is determined as follows. If the ultimate extreme value obtained is a local maximum, the data changes upwards; If the ultimate extreme value obtained is a local minimum, the data changes downwards. For example, we define the local maximum of θ to be: max θ l n = max θ j : θ j > thr1, j = n, n + 1 · · · n + l − 1, l > K The location of the change point is thus: In a parallel manner, we can also define the location of change points whose directions are downward. Figure 4. Demonstration of data points that continuously exceed the threshold value of K.

The Window Size and K
Theorem 3. Assume that the two adjacent sliding windows contain two i.i.d. data X = {X i } m i=1 and Y = Y j n j=1 , with continuous cdfs F X (x) and F Y (y), and pdfs f X (x) and f Y (y), respectively. Let N = m + n be fixed. Define the segmentation factor as λ = m/N . If there is a mean shift ∆ between X and Y, that is, F Y (x) = F X (x + ∆), the power of test against ∆ = 0 is approximately at level α. Moreover, if the power β has been determined, the sample size required for the test can be estimated as Theorem 3 shows that when α, β and segmentation factor λ are prescribed, the size of the window is related to the mean shift and the distribution under the null hypothesis. That is, as long as the null distribution and the mean shift are determined, the required sample size can be obtained. For example, the null distribution is the standard Normal distribution. When it has a mean shift ∆ = 1.0 , then τ = π 3 . In this case, if α = 0.05 , β = 0.95, and the two windows are symmetric with λ = 0.5, the corresponding sliding window size is at least m = n = 22. If only the mean shift is changed to ∆ = 0.5 while other conditions remain unchanged, the sliding window size is at least m = n = 88. It should be noted that this theorem has been proved by Lehmann. See [29] for the detailed argument. In addition, when the distribution of null hypothesis is unknown, the sample sizes dependon on τ, which can be estimated by the technique in [30].
According to Equation (19), the test power is affected by both the window sizes and mean shift. We use Monte Carlo method to simulate the stand Normal distribution for 1000 times, so as to observe the influence of window sizes on the accuracy of change point detection under different mean shifts. In this simulation study, the data sequence contains T = 1000 data points, and the change point is at the location of t * = 500. As shown in Figure 5, in the case of a single change point, the accuracy of detection increases with the increase of window sizes within a certain range. Furthermore, when the window size is fixed, it is observed in Figure 5 that, the smaller the shift ∆ is, the more likely the data is to be interfered by noise, and the more difficult the change point is to be detected. On the other hand, the larger the shift is, the higher the accuracy of the algorithm is. When the shift increases to a certain value, the accuracy of the alogrithm tends to 100%. However, in the case of multiple change points detection, the size of the sliding windows cannot be too large. When the size of the windows is larger than the interval between two neighboring change points, the detection results will become unacceptable. Therefore, care should be taken when selecting the size of the sliding windows in practice. Another problem is the choice of K. Cunen et al. [31] proposed some methods for the confidence interval distribution of change points. Inspired by this method, we take into account he interval length distribution of test statistics beyond the set threshold when selecting the magnitude of K. We define θ 0,j and θ 1,j as the obtained test sequences in the cases of change-point absence and change-point presence, respectively. In the former case (null case), the longest interval beyond the threshold is defined as K max 0 ; while in the latter case (there is an upward mean shift), the interval where the truce change point t * locates is defined K * 1 . They are expressed respectively as: K * 1 = l : t * ∈ [n, n + l − 1] ∩ θ l 1,n > thr1, θ l 1,n = θ 1,j , j = n, n + 1, · · · n + l − 1 (22) Figure 6 illustrates the idea of how to choose a suitable K for Normal distribution, Log-normal distribution and Cauchy distribution. The window size is set to be m = n = 50, and the mean shifts are set to be 0.75, 1.0, 2.0, respectively. From the distributions of K max 0 and K * 1 shown in Figure 6, it is observed that, (1) when shi f t = 0.75, the optimal K corresponding to the Normal and Log-normal distribution is within [20,25], while the selection of K for the Cauchy distribution is meaningless; (2) when shi f t = 1.0, the optimal K corresponding to the Normal and Log-normal distribution is within [30,35], while the selection of K for the Cauchy distribution is [25,30]; (3) when shi f t = 2.0, the optimal K corresponding to the Normal and Log-normal distribution is within [40,45], while the selection of K for the Cauchy distribution is [25 30]; (4) if the value of the shift continues to increase, the optimal K of the three distributions will be all stable at 50. Based on the above observations, we suggest that the selection interval of K is [15,50] when the symmetrical window size is 50.

Comparative Studies and Analysis
In this section, we evaluate the performance of our algorithm under various scenarios based on the MATLAB platform. We first compare our method with the classical algorithms by Monte Carlo simulations with respect to three distributions in the scenario of single change point detection. We also apply our method to the case of multiple change points detection based on artificial and real data sets.

Comparative Results for Single Change Point Detection
In order to evaluate performance, we employ the index of accuracy as a figure of merit. Specifically, if the detected change point is sufficiently close to a real change point, i.e.,t * ∈ [t * − 20, t * + 20], where t * is the true location of change point, the point att * is considered to be a correct detection. According to the above rule, the accuracy is computed by where w is the number of trials, and N correct is the number of correct detections.
In the following simulation studies, the data length T in Equation (1) is set to be T = 1000. The abrupt change location is set at t * = 500, the mean shifts at t * = 500 are ∆ = [0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 2.00], respectively. The sliding window size was set to be m = n = 50. We use a constant signal with µ 0 = 0 and corrupt it with three types of i.i.d. noise. The models of noise are: Given the above conditions, the data used in single change point detection can thus be expressed as S(t) = ∆ * I(500 t 1000) + e(t), t = 1, 2, ..., 1000 Figure 7 shows the comparative results of our method with HQK [22] and RuLSIF [23], where the number of trials w = 1000. It is observed that, (1) under Model 1, i.e., the noise is Normal, our method performs only sightly worst than HQK and RuLSIF (Figure 7a,b); (2) under Model 2, i.e., the noise is Log-normal, our method outperforms HQK outperforms RuLSIF, especially for ∆ < 0.75 (Figure 7c,d); (3) under Model 3, i.e., the noise is Cauchy, the performance of our method is much better than HQK and RuLSIF, manifesting the robustness of our method against noise with strong spiky patterns (Figure 7e,f). For the detailed detection accuracis, see Table A1.

Multiple Change Points Detection
Besides the ability of detecting single change point, our method is also capable of detecting multiple change points. Similar to the previous settings, we also use a sequence with 1000 data points.
The locations of the change points are t * 1 = 200, t * 2 = 400, t * 3 = 600, t * 4 = 800, with corresponding mean shifts being [∆ 1 , ∆ 2 , ∆ 3 , ∆ 4 ] = [−1,1,1, − 1]. The window size is m = n = 50. The maximum of 20 or more data points that continuously exceed the threshold (K = 20) is considered as the location estimates of the change points. The detection results with respect to Model 1 to Model 3 are depicted in Figures 8-10, respectively. Figure 8 depicts the detection precedure under Model 1, where the noise follows the Normal distribution. The top panel shows the data with red lines indicating the real locations of the change points. In the middle panel are the local extrema without the process of false alarm rejection. It can be seen that 10 change points are detected, 6 of which are false alarms. The bottom panel shows finally detected change points after the false-alarm rejection procedure. The final detected change points are located at 199, 400, 599, 800, respectively, very close to the corresponding true locations at t * 1 = 200, t * 2 = 400, t * 3 = 600, t * 4 = 800. Figure 9 depicts the detection precedure under Model 2, where the noise follows the Log normal distribution. Similar to that in Figure 8, The final detected change points are located at 200, 400, 602, 805, respectively, very close to the corresponding true locations at t * 1 = 200, t * 2 = 400, t * 3 = 600, t * 4 = 800. Figure 0  50  100  150  200  250  300  350  400  450  500  550  600  650  700  750  800  850  900

An Application to Real Data Set
To gain further insight of our method, we apply it to detecting multiple change points based on a real genetric data set [32] ( http://microarrays.curie.fr/publications/oncologie_moleculaire/ bladder_TCM/). It is a sample of 57 patients with bladder cancer and contains the log2ratio, GNL status (Gain/−1, normal/0 or loss/1) and outlier status of all 2385 BAC clones. It should be noted that these data are the result of normalization of the original data on a logarithmic scale.
We need to process the missing values in the data set before applying our method. Following the work in [33], we remove all sequences with missing values greater than 7%. After screening, we retain the genetic data of 41 samples and set the missing value of these samples to zero. We randomly select the 40th sample as an example in this study. Figure 11 (K = 30) shows the procedure as well as the results of our algorithm: (a) the pre-processed genetic data; (b) demo of the sliding windows (m = n = 50); (c) the test statistics and decision thresholds; (d) the detected local extrema with false alarms; (e) the final change points obtained after false alarm rejection; and (f) the the original data marked with the location of the detected change points. As can be seen, our method detects 16 change points, among which nine mutate upward and seven mutate downward. These 16 change points are not missing, but there are two change points that are false alarms and are not technically change points. They are located near 800 and 1600 respectively.

Conclusions
In this paper, we propose an AUC-based algorithm for the detection of change point(s) with mean shift. Theoretical derivations and numerical results showed that our method has the following advantages, including (1) it is a non-parametric method possessing the distribution free property, (2) it performs comparably well with two state-of-the-art methods when the noise is Normal, (3) it performs better than the other two methods when the noise is impulsive (Log-normal and Cauchy distributions), (4) it can be applied to detecting of multiple change points, and (5) it is capable of detecting the direction of mean shift. Having the above advantages, the proposed method can play a complementary role to other methods in the literature, especially when the noise distribution, deviating significantly from Normal, exhibits an impulsive manner (such as Cauchy noise).

Abbreviations
The following abbreviations are used in this manuscript: