One Dimensional Discrete Scan Statistics for Dependent Models and Some Related Problems

: The one dimensional discrete scan statistic is considered over sequences of random variables generated by block factor dependence models. Viewed as a maximum of an 1-dependent stationary sequence, the scan statistics distribution is approximated with accuracy and sharp bounds are provided. The longest increasing run statistics is related to the scan statistics and its distribution is studied. The moving average process is a particular case of block factor and the distribution of the associated scan statistics is approximated. Numerical results are presented. and C.P.; methodology, A.A. and C.P.; software, A.A.; validation, A.A. and C.P.; formal analysis, A.A. and C.P.; writing original draft preparation, C.P.; writing review and editing, A.A. and C.P.; visualization, A.A. and C.P.; supervision, A.A. and C.P.; project administration, C.P.; funding


Introduction
There are many situations when an investigator observes an accumulation of events of interest and wants to decide if such a realisation is due to hazard or not. These types of problems belong to the class of cluster detection problems, where the basic idea is to identify regions that are unexpected or anomalous with respect to the distribution of events. Depending on the application domain, these anomalous agglomeration of events can correspond to a diversity of phenomena-for example one may want to find clusters of stars, deposits of precious metals, outbreaks of disease, minefield detections, defectuous batches of pieces and many other possibilities. If such an observed accumulation of events exceeds a preassigned threshold, usually determined from a specified significance level corresponding to a normal situation (the null hypothesis), then it is legitimate to say that we have an unexpected cluster and proper measures has to be taken accordingly.
Searching for unusual clusters of events is of great importance in many scientific and technological fields, including DNA sequence analysis ( [1,2]), brain imaging ( [3]), target detection in sensors networks ( [4,5]), astronomy ( [6,7]), reliability theory and quality control ( [8]) among many other domains. One of the tools used by practitioners to decide on the unusualness of such agglomeration of events is the scan statistics. Basically, the tests based on scan statistics are looking for events that are clustered amongst a background of those that are sporadic.
Let 2 ≤ m ≤ T be two positive integers and X 1 , . . . , X T be a sequence of independent and identically distributed random variables with the common distribution F 0 . The one dimensional discrete scan statistics is defined as S m (T) = max 1≤i≤T−m+1 where the random variables W i are the moving sums of length m given by Usually, the statistical tests based on the one dimensional discrete scan statistics are employed when one wants to detect a local change in the signal within a sequence of T observations via testing the null hypothesis of uniformity, H 0 , against a cluster alternative, H 1 (see References [9,10]). Under H 0 , the random observations X 1 , . . . , X T are i.i.d. distributed as F 0 , while under the alternative hypothesis, there exists a location 1 ≤ i 0 ≤ T − m + 1 where X i , i ∈ {i 0 , . . . , i 0 + m − 1}, are distributed according to F 1 = F 0 and outside this region X i are distributed as F 0 .
We observe that whenever S m (T) exceeds the threshold τ, where the value of τ is computed based on the relation P H 0 (S m (T) ≥ τ) = α and α is a preassigned significance level of the testing procedure, the generalized likelihood ratio test rejects the null hypothesis in the favor of the clustering alternative (see Reference [9]). It is interesting to note that most of the research has been done for F 0 being binomial, Poisson or normal distribution (see References [9][10][11][12][13]). More recently, Reference [14] proposed a testing procedure based on one-dimensional scan statistic for geometric and negative binomial distributions.
There are three main approaches used for investigating the exact distribution of the one dimensional discrete scan statistics-the combinatorial methods ( [12,15]) , the Markov chain imbedding technique ( [16,17]) and the conditional probability generating function method ( [18,19]). Due to the high complexity and the limited range of application of the exact formulas, a considerable number of approximations and bounds have been developed for the estimation of the distribution of the one dimensional discrete scan statistics, for example, References [9,12,13,20]. A full treatment of these results is presented in References [10,11].
Even if in general the X i 's are supposed to be i.i.d distributed, there are applications, such as detecting similarities between DNA sequences, where the X i 's are not independent ( [21]). In order to evaluate the effect of dependence, the alternative model is in many cases a Markov chain the whole dependence structure of which depends only on the joint distribution of two consecutive random variables.
In this work we introduce dependence models based on block-factors obtained from i.i.d. sequences in the context of the one dimensional discrete scan statistics. We derive approximations and their corresponding errors with application to the longest increasing run distribution and the moving average process.
The paper is structured as follows. In Section 2 we introduce the block factor model and present the approximation technique for the distribution of the scan statistics under this model. As a particular block factor model, the distribution of the length of the longest increasing run in a trial of i.i.d real random variables and the moving average processes are related to the scan statistics distribution in Section 3. Numerical results based on simulations illustrate the accuracy of the approximation. Concluding remarks end the paper.

One Dimensional Scan Statistics for Block-Factor Dependence Model
Most of the research devoted to the one dimensional discrete scan statistic considers the independent and identically distributed model for the random variables that generate the sequence which is to be scanned. In this section, we define a dependence structure for the underlying random sequence based on a block-factor type model.

The Block-Factor Dependence Model
Let us recall (see also Reference [22]) that the sequence {X i } i≥1 of random variables with state space S W is said to be k block-factor of the sequence The Figure 1 presents the sequence {X i } i=1,...,T of length T obtained as a k block-factor from a sequence As an example of block-factor model, in Reference [23], the authors consider an i.i.d. sequence {Y n } n≥1 of standard normal distributed random variables and the 2 block-factor defined by Therefore, due to the overlapping structure of X i and X i+1 , they obtain a Gaussian stationary process {X i } i≥1 with some correlation structure for which the scan statistics distribution is studied.
More generally, let observe that if a sequence {X i } i≥1 of random variables is a k block-factor, then the sequence is (k − 1)-dependent. Recall that a sequence {X i } i≥1 is m-dependent with m ≥ 1 (see Reference [22]), if for any h ≥ 1 the σ-fields generated by {X 1 , . . . , X h } and {X h+m+1 , . . . } are independent.

Scan Statistics Viewed as Maximum of 1-Dependent Sequence
Let {X 1 , . . . , X T } be a k-block factor of the i.i.d. sequence {Y 1 , . . . , YT}, whereT 1 ≥ k and T =T − k + 1, and S m (T) be the scan statistics associated to the sequence {X i } i=1,...,T as defined in (1) for some scanning window of length m, 1 ≤ m ≤ T. Put where W i = ∑ i+m−1 s=i X s . That is, for each j ∈ {1, . . . , L − 1}, Z j is the scan statistic associated to the sequence of length 2m + k − 3, {X (j−1)(m+k−2)+1 , . . . , X j(m+k−2)+m−1 }. An illustration of the construction of variables Z j is presented in Figure 2 for L = 5 and k = 1.
Then, {Z j } j=1,...,L−1 is 1-dependent and stationary and we have that Thus, for any block factor model obtained from an i.i.d sequence, the distribution of the associated scan statistics is the distribution of the maximum of some 1-dependent stationary sequence.

Approximation
In References [24,25] the authors extended the approximation results obtained in Reference [26] for the distribution of the maximum of a 1-dependent stationary sequence. The main results is stated in the following theorem.
Let {Z j } j≥1 be a strictly stationary 1-dependent sequence of random variables and for For all x such that q 1 (x) ≥ 1 − α ≥ 0.9, the following approximation formula holds: and where η = 1 + lα with l = l(α) = t 3 2 (α) + ε, for arbitrarily small ε > 0, and t 2 (α) the second root in magnitude of the equation The evaluation of the functions K and Γ for some selected values of α is presented in Table 1. These values allow to compute directly the error bound approximation in (7). Applying Theorem 1 to the sequence {Z j } j=1,...,L−1 defined in (4), from (5) we obtain an approximation and the associated error bound for P(S(m, T) ≤ x) in the following way. Put for s ∈ {2, 3}, and observe that using the notation of (6) we have Q s = q s−1 . For x such that Q 2 (x) ≥ 1 − α ≥ 0.9 we apply the result from Theorem 1 to obtain the approximation with an error bound of about (L − 1)F(α 1 , Observe that Q 2 and Q 3 represent the distributions of the scan statistics over sequences of variable Z j of lengths 2m + k − 3 and respectively 3m + 2k − 5. Q 2 and Q 3 are generally estimated by Monte Carlo simulation. Thus, ifQ s is an estimate of Q s , s ∈ {2, 3}, with Q s − Q s ≤ β s and x is such that where E total is the total error of the approximation given by One of the main advantage of this approximation method with respect to the product-type approximation proposed in Reference [12], who uses the same quantities Q 2 and Q 3 , is that it provides sharp error bounds for the approximation.

Some Related Problems to the Scan Statistics under Block-Factor Dependence Model
In order to illustrate the efficiency of the approximation (14) and the obtained error bounds, in this section we present two examples of statistics related to discrete scan statistics.

Length of the Longest Increasing Run in a i.i.d Sequence
Let Y 1 , Y 2 , . . . , YT be a sequence of lengthT,T ≥ 1, of independent and identically distributed random variables with the common distribution F. We say that the subsequence (Y k , . . . , Y k+l−1 ) forms an increasing run (or ascending run) of length l ≥ 1, starting at position k ≥ 1, if it verifies the following relation We denote the length of the longest increasing run among the firstT random variables by MT. This run statistics plays an important role in many applications in fields such computer science, reliability theory or quality control. The asymptotic behaviour of MT has been investigated by several authors depending on the common distribution, F. In the case of a continuous distribution [27] (see also Reference [28]) has shown that this behaviour does not depend on the common law. For the particular setting of uniform U([0, 1]) random variables, this problem was addressed by References [29][30][31]. Under the assumption that the distribution F is discrete, the limit behaviour of MT depends strongly on the common law F, as in Reference [32] (see also References [33,34]) proved for the case of geometric and Poisson distribution. In Reference [35], the case of discrete uniform distribution is investigated, while in Reference [36], the authors study the asymptotic distribution of MT when the variables are uniformly distributed but not independent.
In this section, we evaluate the distribution of the length of the longest increasing run using the methodology developed in Section 2. The idea is to express the distribution of the random variable MT in terms of the distribution of the scan statistics random variable.
Let T =T − 1 and define the block-factor transformation f : R 2 → R by Then, our block-factor model becomes and X 1 , . . . , X T form a 1-dependent and stationary sequence of random variables. Notice that the distribution of MT and the distribution of the length of the longest run of ones, L T , among the first T binary random variables X i , are related and satisfy the following identity The statistics L T is also known as the length of the longest success run or head run and was extensively studied in the literature. One can consult the monographs of References [16,17] for applications and further results concerning this statistic. Moreover, the random variable L T can be interpreted as a particular case of the scan statistics random variable and between the two we have the relation Hence, combining (19) and (20), we can express the distribution of the length of the longest increasing run as Thus, we can estimate the distribution of MT using the foregoing identity and the approximations developed in Section 2 for the discrete scan statistics random variable. We should also note that in Reference [30] the authors studied the asymptotic behaviour of L T over a sequence of m-dependent binary random variables. They showed that, given a stationary m-dependent sequence of random variables with values 0 and 1, {V k } k≥1 , if there exist positive constants t, C such that where r(k) = P (V 1 = · · · = V k = 1) − P (V 1 = · · · = V k+1 = 1) and h = sup{mt, 1}. In order to illustrate the accuracy of the approximation of MT based on scan statistics, using the methodology developed in Section 2, we consider that the random variables Y i 's have a common uniform U ([0, 1]) distribution. Simple calculations show that P (X 1 = · · · = X k = 1) = 1 (k+1)! and thus C = 2, t = 1 and h = 1. In the context of our particular situation, the result of Reference [30] in Equation (23) becomes: where r(m) = P (X 1 = · · · = X m = 1) − P (X 1 = · · · = X m+1 = 1) = m+1 (m+2)! .
In Table 2, we consider a numerical comparison study between the simulated value (column Sim) obtained by Monte Carlo simulation with ITER sim = 10 4 trials, the approximation based on scan statistics (column App) computed from the Equation (14) whereQ 2 andQ 3 are computed with ITER app = 10 5 trials and the limit distribution (column LimApp) of the distribution of the length of the longest increasing run, P (MT ≤ m), in a sequence ofT = 10001 random variables distributed uniformly over [0, 1]. The results show that both our method and the asymptotic approximation in (25) are very accurate. It is worth mentioning that for our simulations we used an adapted version of the Importance Sampling procedure introduced in Reference [3], an efficient method that proved to perform very well for small p values (where naive Monte Carlo methods tend to break down) [37].

The Moving Average -Like Process of Order q Model
We consider the particular situation of the one dimensional discrete scan statistics defined over a sequence of random variables obtained as a linear block factor of a discrete Gaussian white noise. Because of the similarity with the definition of a classical moving average process, we call that block factor model a moving average -like process. It is worth mentioning that the distribution of the scan statistics in the context of a moving average process for normal data was studied in Reference [38] where the authors compared the product-type approximation developed in Reference [13] with the approximation of Reference [23]. In the block-factor model introduced in (3), let q ≥ 1 be a positive integer and Y 1 , Y 2 , . . . , YT be a sequence of independent and identically Gaussian distributed random variables with known mean µ and variance σ 2 .
For i ∈ {1, . . . , T}, with T =T − q, our dependent model is defined by the relation The moving sums of size m, W i , 1 ≤ i ≤ T − m + 1 can be expressed as where the coefficients b 1 , . . . , b m+q are evaluated by Therefore, for each i ∈ {1, . . . , T − m + 1}, the random variable W i follows a normal distribution with Moreover, a simple calculation shows that the covariance matrix Given the mean and the covariance matrix of the vector (W 1 , . . . , W T−m+1 ), one can use the importance sampling algorithm developed in Reference [3] (see also Reference [37]) or the one presented in Reference [39] to estimate the distribution of the one dimensional discrete scan statistics S(m, T).
Another way is to use the quasi-Monte Carlo algorithm developed in Reference [40] to approximate the multivariate normal distribution. In our application example we adopt the importance sampling procedure developed in Reference [3]. In order to evaluate the accuracy of the approximation developed in (14), we consider q = 2, T = 1000, m = 20, Y i ∼ N (0, 1) and the coefficients of the moving average model to be (a 1 , a 2 , a 3 ) = (0.3, 0.1, 0.5). We compare our approximation (column App) given in (14) with the one (column AppPT) given in Reference [41] using product-type approximations. In Table 3, we present numerical results for the setting described above. In our algorithms we used ITER app = 10 6 trials for the computation ofQ 2 (x) andQ 3 (x) and ITER sim = 10 5 trials for the Monte-Carlo simulation of the P(S(m, T) ≤ x). Table 3. MA-like(q = 2) model: m = 20, T = 1000, X i = 0.3Y i + 0.1Y i+1 + 0.5Y i+2 , ITER app = 10 6 , ITER sim = 10 5 .

Conclusions
Block factor models defined from i.i.d. sequence generate random sequences with a particular type of dependence structure. For this type of dependence, the scan statistics can be viewed as the maximum of a 1-dependent stationary sequence, for which the distribution can be approximated with high accuracy. The approximation error can be controlled by using efficient algorithms of simulation as for example the importance sampling approach proposed in Reference [3] (see also Reference [25]). We approximated the distribution of longest increasing run statistics over an i.i.d sequence as a particular case of scan statistics distribution over a block factor model. Author Contributions: Conceptualization, A.A. and C.P.; methodology, A.A. and C.P.; software, A.A.; validation, A.A. and C.P.; formal analysis, A.A. and C.P.; writing original draft preparation, C.P.; writing review and editing, A.A. and C.P.; visualization, A.A. and C.P.; supervision, A.A. and C.P.; project administration, C.P.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.