You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

13 April 2020

One Dimensional Discrete Scan Statistics for Dependent Models and Some Related Problems

and
1
Faculty of Mathematics and Computer Science, University of Bucharest, 010014 Bucharest, Romania
2
National Institute of Research and Development for Biological Sciences, 060031 Bucharest, Romania
3
Laboratoire de Mathématiques Paul Painlevé, University of Lille, 59655 Villeneuve d’Ascq, France
4
Biostatistics Department, Delegation for Clinical Research and Innovation, Lille Catholic Hospitals, GHICL, 59462 Lomme, France
This article belongs to the Special Issue Probability, Statistics and Their Applications

Abstract

The one dimensional discrete scan statistic is considered over sequences of random variables generated by block factor dependence models. Viewed as a maximum of an 1-dependent stationary sequence, the scan statistics distribution is approximated with accuracy and sharp bounds are provided. The longest increasing run statistics is related to the scan statistics and its distribution is studied. The moving average process is a particular case of block factor and the distribution of the associated scan statistics is approximated. Numerical results are presented.

1. Introduction

There are many situations when an investigator observes an accumulation of events of interest and wants to decide if such a realisation is due to hazard or not. These types of problems belong to the class of cluster detection problems, where the basic idea is to identify regions that are unexpected or anomalous with respect to the distribution of events. Depending on the application domain, these anomalous agglomeration of events can correspond to a diversity of phenomena—for example one may want to find clusters of stars, deposits of precious metals, outbreaks of disease, minefield detections, defectuous batches of pieces and many other possibilities. If such an observed accumulation of events exceeds a preassigned threshold, usually determined from a specified significance level corresponding to a normal situation (the null hypothesis), then it is legitimate to say that we have an unexpected cluster and proper measures has to be taken accordingly.
Searching for unusual clusters of events is of great importance in many scientific and technological fields, including DNA sequence analysis ([1,2]), brain imaging ([3]), target detection in sensors networks ([4,5]), astronomy ([6,7]), reliability theory and quality control ([8]) among many other domains. One of the tools used by practitioners to decide on the unusualness of such agglomeration of events is the scan statistics. Basically, the tests based on scan statistics are looking for events that are clustered amongst a background of those that are sporadic.
Let 2 m T be two positive integers and X 1 , , X T be a sequence of independent and identically distributed random variables with the common distribution F 0 . The one dimensional discrete scan statistics is defined as
S m ( T ) = max 1 i T m + 1 W i ,
where the random variables W i are the moving sums of length m given by
W i = j = i i + m 1 X j .
Usually, the statistical tests based on the one dimensional discrete scan statistics are employed when one wants to detect a local change in the signal within a sequence of T observations via testing the null hypothesis of uniformity, H 0 , against a cluster alternative, H 1 (see References [9,10]). Under H 0 , the random observations X 1 , , X T are i.i.d. distributed as F 0 , while under the alternative hypothesis, there exists a location 1 i 0 T m + 1 where X i , i { i 0 , , i 0 + m 1 } , are distributed according to F 1 F 0 and outside this region X i are distributed as F 0 .
We observe that whenever S m ( T ) exceeds the threshold τ , where the value of τ is computed based on the relation P H 0 S m ( T ) τ = α and α is a preassigned significance level of the testing procedure, the generalized likelihood ratio test rejects the null hypothesis in the favor of the clustering alternative (see Reference [9]). It is interesting to note that most of the research has been done for F 0 being binomial, Poisson or normal distribution (see References [9,10,11,12,13]). More recently, Reference [14] proposed a testing procedure based on one-dimensional scan statistic for geometric and negative binomial distributions.
There are three main approaches used for investigating the exact distribution of the one dimensional discrete scan statistics—the combinatorial methods ([12,15]), the Markov chain imbedding technique ([16,17]) and the conditional probability generating function method ([18,19]). Due to the high complexity and the limited range of application of the exact formulas, a considerable number of approximations and bounds have been developed for the estimation of the distribution of the one dimensional discrete scan statistics, for example, References [9,12,13,20]. A full treatment of these results is presented in References [10,11].
Even if in general the X i ’s are supposed to be i.i.d distributed, there are applications, such as detecting similarities between DNA sequences, where the X i ’s are not independent ([21]). In order to evaluate the effect of dependence, the alternative model is in many cases a Markov chain the whole dependence structure of which depends only on the joint distribution of two consecutive random variables.
In this work we introduce dependence models based on block-factors obtained from i.i.d. sequences in the context of the one dimensional discrete scan statistics. We derive approximations and their corresponding errors with application to the longest increasing run distribution and the moving average process.
The paper is structured as follows. In Section 2 we introduce the block factor model and present the approximation technique for the distribution of the scan statistics under this model. As a particular block factor model, the distribution of the length of the longest increasing run in a trial of i.i.d real random variables and the moving average processes are related to the scan statistics distribution in Section 3. Numerical results based on simulations illustrate the accuracy of the approximation. Concluding remarks end the paper.

2. One Dimensional Scan Statistics for Block-Factor Dependence Model

Most of the research devoted to the one dimensional discrete scan statistic considers the independent and identically distributed model for the random variables that generate the sequence which is to be scanned. In this section, we define a dependence structure for the underlying random sequence based on a block-factor type model.

2.1. The Block-Factor Dependence Model

Let us recall (see also Reference [22]) that the sequence { X i } i 1 of random variables with state space S W is said to be k block-factor of the sequence { Y i } i 1 with state space S Y if there is a measurable function f : S Y k S W such that
X i = f Y i , Y i + 1 , , Y i + k 1 , i 1 .
The Figure 1 presents the sequence { X i } i = 1 , , T of length T obtained as a k block-factor from a sequence { Y i } i = 1 , , T ˜ of length T ˜ = T + k 1 throughout some function f.
Figure 1. The block factor model.
As an example of block-factor model, in Reference [23], the authors consider an i.i.d. sequence { Y n } n 1 of standard normal distributed random variables and the 2 block-factor defined by
X i = a Y i + b Y i + 1 , i 1 , for f ( x , y ) = a x + b y , a , b R .
Therefore, due to the overlapping structure of X i and X i + 1 , they obtain a Gaussian stationary process { X i } i 1 with some correlation structure for which the scan statistics distribution is studied.
More generally, let observe that if a sequence { X i } i 1 of random variables is a k block-factor, then the sequence is ( k 1 ) -dependent. Recall that a sequence { X i } i 1 is m-dependent with m 1 (see Reference [22]), if for any h 1 the σ -fields generated by { X 1 , , X h } and { X h + m + 1 , } are independent.

2.2. Scan Statistics Viewed as Maximum of 1-Dependent Sequence

Let { X 1 , , X T } be a k-block factor of the i.i.d. sequence { Y 1 , , Y T ˜ } , where T ˜ 1 k and T = T ˜ k + 1 , and S m ( T ) be the scan statistics associated to the sequence { X i } i = 1 , , T as defined in (1) for some scanning window of length m, 1 m T .
Put T = ( L 1 ) ( m + k 2 ) + m 1 for some integer L 1 and define, for each j { 1 , , L 1 } , the random variables
Z j = max ( j 1 ) ( m + k 2 ) + 1 i j ( m + k 2 ) W i ,
where W i = s = i i + m 1 X s . That is, for each j { 1 , , L 1 } , Z j is the scan statistic associated to the sequence of length 2 m + k 3 , { X ( j 1 ) ( m + k 2 ) + 1 , , X j ( m + k 2 ) + m 1 } . An illustration of the construction of variables Z j is presented in Figure 2 for L = 5 and k = 1 .
Figure 2. Construction of Z j .
Then, { Z j } j = 1 , , L 1 is 1-dependent and stationary and we have that
S ( m , T ) = max 1 j L 1 Z j .
Thus, for any block factor model obtained from an i.i.d sequence, the distribution of the associated scan statistics is the distribution of the maximum of some 1-dependent stationary sequence.

2.3. Approximation

In References [24,25] the authors extended the approximation results obtained in Reference [26] for the distribution of the maximum of a 1-dependent stationary sequence. The main results is stated in the following theorem.
Let { Z j } j 1 be a strictly stationary 1-dependent sequence of random variables and for x < sup { u | P ( Z 1 u ) < 1 } , let
1 q n = q n ( x ) = P ( max ( Z 1 , , Z n ) x ) .
Theorem 1.
For all x such that q 1 ( x ) 1 α 0.9 , the following approximation formula holds:
q n 2 q 1 q 2 1 + q 1 q 2 + 2 ( q 1 q 2 ) 2 n n F ( α , n ) ( 1 q 1 ) 2
with
F ( α , n ) = 1 + 3 n + Γ ( α ) n + K ( α ) ( 1 q 1 )
where Γ ( α ) = L ( α ) + E ( α ) ,
K ( α ) = 11 3 α ( 1 α ) 2 + 2 l ( 1 + 3 α ) 2 + 3 l α α ( 2 l α ) ( 1 + l α ) 2 1 α ( 1 + l α ) 2 3 1 2 α ( 1 + l α ) 1 α ( 1 + l α ) 2 2
L ( α ) = 3 K ( α ) ( 1 + α + 3 α 2 ) [ 1 + α + 3 α 2 + K ( α ) α 3 ] + α 6 K 3 ( α ) + 9 α ( 4 + 3 α + 3 α 2 ) + 55.1
E ( α ) = η 5 1 + ( 1 2 α ) η 4 1 + α ( η 2 ) 1 + η + ( 1 3 α ) η 2 2 ( 1 α η 2 ) 4 ( 1 α η 2 ) 2 α η 2 ( 1 + η 2 α η ) 2
and where η = 1 + l α with l = l ( α ) = t 2 3 ( α ) + ε , for arbitrarily small ε > 0 , and t 2 ( α ) the second root in magnitude of the equation α t 3 t + 1 = 0 .
The evaluation of the functions K and Γ for some selected values of α is presented in Table 1. These values allow to compute directly the error bound approximation in (7).
Table 1. Selected values for l, K and Γ functions in Theorem 1 for ε = 10 6 .
Applying Theorem 1 to the sequence { Z j } j = 1 , , L 1 defined in (4), from (5) we obtain an approximation and the associated error bound for P ( S ( m , T ) x ) in the following way. Put for s { 2 , 3 } ,
Q s = Q s ( x ) = P j = 1 s 1 { Z j x }
and observe that using the notation of (6) we have Q s = q s 1 . For x such that Q 2 ( x ) 1 α 0.9 we apply the result from Theorem 1 to obtain the approximation
P S ( m , T ) x 2 Q 2 Q 3 1 + Q 2 Q 3 + 2 ( Q 2 Q 3 ) 2 ( L 1 ) ,
with an error bound of about ( L 1 ) F ( α 1 , L 1 ) ( 1 Q 2 ) 2 . Observe that Q 2 and Q 3 represent the distributions of the scan statistics over sequences of variable Z j of lengths 2 m + k 3 and respectively 3 m + 2 k 5 . Q 2 and Q 3 are generally estimated by Monte Carlo simulation.
Thus, if Q ^ s is an estimate of Q s , s { 2 , 3 } , with Q ^ s Q s β s and x is such that 1 Q ^ 2 ( x ) 0.1 then
P S ( m , T ) x 2 Q ^ 2 Q ^ 3 1 + Q ^ 2 Q ^ 3 + 2 ( Q ^ 2 Q ^ 3 ) 2 1 L E t o t a l ,
where E t o t a l is the total error of the approximation given by
E t o t a l = ( L 1 ) β 2 + β 3 + F Q ^ 2 , L 1 1 Q ^ 2 + β 2 2 .
One of the main advantage of this approximation method with respect to the product-type approximation proposed in Reference [12], who uses the same quantities Q 2 and Q 3 , is that it provides sharp error bounds for the approximation.

4. Conclusions

Block factor models defined from i.i.d. sequence generate random sequences with a particular type of dependence structure. For this type of dependence, the scan statistics can be viewed as the maximum of a 1-dependent stationary sequence, for which the distribution can be approximated with high accuracy. The approximation error can be controlled by using efficient algorithms of simulation as for example the importance sampling approach proposed in Reference [3] (see also Reference [25]). We approximated the distribution of longest increasing run statistics over an i.i.d sequence as a particular case of scan statistics distribution over a block factor model.

Author Contributions

Conceptualization, A.A. and C.P.; methodology, A.A. and C.P.; software, A.A.; validation, A.A. and C.P.; formal analysis, A.A. and C.P.; writing original draft preparation, C.P.; writing review and editing, A.A. and C.P.; visualization, A.A. and C.P.; supervision, A.A. and C.P.; project administration, C.P.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by a grant of the Romanian National Authority for Scientific Research and Innovation, project number POC P-37-257 and MCI National Core Program, project 25 N/2019 BIODIVERS 19270103.

Acknowledgments

The authors wish to thank the anonymous reviewers for their careful reading of the manuscript and their helpful suggestions and comments which led to the improvement of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hoh, J.; Ott, J. Scan statistics to scan markers for susceptibility genes. Proc. Natl. Acad. Sci. USA 2000, 97, 9615–9617. [Google Scholar] [CrossRef] [PubMed]
  2. Sheng, K.-N.; Naus, J. Pattern matching between two non aligned random sequences. Bull. Math. Biol. 1994, 56, 1143–1162. [Google Scholar] [CrossRef]
  3. Naiman, D.Q.; Priebe, C.E. Computing scan statistic p values using importance sampling, with applications to genetics and medical image analysis. J. Comput. Graph. Stat. 2001, 10, 296–328. [Google Scholar] [CrossRef]
  4. Guerriero, M.; Pozdnyakov, V.; Glaz, J.; Willett, P. A repeated significance test with applications to sequential detection in sensor networks. IEEE Trans. Signal Process. 2010, 58, 3426–3435. [Google Scholar] [CrossRef]
  5. Guerriero, M.; Willett, P.; Glaz, J. Distributed target detection in sensor networks using scan statistics. IEEE Trans. Signal Process. 2009, 57, 2629–2639. [Google Scholar] [CrossRef]
  6. Darling, R.W.R.; Waterman, M.S. Extreme value distribution for the largest cube in a random lattice. SIAM J. Appl. Math. 1986, 46, 118–132. [Google Scholar] [CrossRef]
  7. Marcos, R.; Marcos, C. From star complexes to the field: Open cluster families. Astrophys. J. 2008, 672, 342–351. [Google Scholar] [CrossRef]
  8. Boutsikas, M.V.; Koutras, M.V. Reliability approximation for Markov chain imbeddable systems. Methodol. Comput. Appl. Probab. 2000, 2, 393–411. [Google Scholar] [CrossRef]
  9. Glaz, J.; Naus, J. Tight bounds and approximations for scan statistic probabilities for discrete data. Ann. Appl. Probab. 1991, 1, 306–318. [Google Scholar] [CrossRef]
  10. Glaz, J.; Naus, J.; Wallenstein, S. Scan Statistics; Springer Series in Statistics; Springer: New York, NY, USA, 2001. [Google Scholar]
  11. Glaz, J.; Balakrishnan, N. Scan Statistics and Applications; Springer Sciences+Business Media: Berlin, Germany, 1999. [Google Scholar]
  12. Naus, J. Approximations for distributions of scan statistics. J. Am. Stat. Assoc. 1982, 77, 177–183. [Google Scholar] [CrossRef]
  13. Wang, X.; Glaz, J.; Naus, J. Approximations and inequalities for moving sums. Methodol. Comput. Appl. Probab. 2012, 14, 597–616. [Google Scholar]
  14. Chen, J.; Glaz, J. Scan statistics for monitoring data modeled by a negative binomial distribution. Commun. Stat. Theory Methods 2016, 45, 1632–1642. [Google Scholar] [CrossRef]
  15. Naus, J. Probabilities for a generalized birthday problem. J. Am. Stat. Assoc. 1974, 69, 810–815. [Google Scholar] [CrossRef]
  16. Balakrishnan, N.; Koutras, M.V. Runs and Scans with Applications; Wiley Series in Probability and Statistics; Wiley-Interscience [John Wiley & Sons]: New York, NY, USA, 2002. [Google Scholar]
  17. Fu, J.C.; Lou, W. Distribution Theory of Runs and Patterns and Its Applications: A Finite Markov Chain Imbedding Approach; World Scientific Publishing Co., Inc.: River Edge, NJ, USA, 2003. [Google Scholar]
  18. Ebneshahrashoob, M.; Gao, T.; Wu, M. An efficient algorithm for exact distribution of discrete scan statistics. Methodol. Comput. Appl. Probab. 2005, 7, 1423–1436. [Google Scholar] [CrossRef]
  19. Uchida, M. On generating functions of waiting time problems for sequence patterns of discrete random variables. Ann. Inst. Stat. Math. 1998, 50, 650–671. [Google Scholar] [CrossRef]
  20. Chen, J.; Glaz, J. Approximations and inequalities for the distribution of a scan statistic for 0-1 Bernoulli trials. Adv. Theory Pract. Stat. 1997, 1, 285–298. [Google Scholar]
  21. Arratia, R.; Gordon, L.; Waterman, M.S. The Erdos-Rényi law in distribution for coin tossing and sequence matching. Ann. Stat. 1990, 18, 539–570. [Google Scholar] [CrossRef]
  22. Burton, R.M.; Goulet, M.; Meester, R. On one-dependent processes and k-block factors. Ann. Probab. 1993, 21, 2157–2168. [Google Scholar] [CrossRef]
  23. Haiman, G.; Preda, C. One dimensional scan statistics generated by some dependent stationary sequences. Stat. Probab. Lett. 2013, 83, 1457–1463. [Google Scholar] [CrossRef]
  24. Amărioarei, A. Approximation for the Distribution of Extremes of One Dependent Stationary Sequences of Random Variables. arXiv 2012, arXiv:1211.5456v1. [Google Scholar]
  25. Amărioarei, A. Approximations for the Multidimensional Discrete Scan Statistics. Ph.D. Thesis, University of Lille, Lille, France, 2014. [Google Scholar]
  26. Haiman, G. Estimating the distributions of scan statistics with high precision. Extremes 2000, 3, 349–361. [Google Scholar] [CrossRef]
  27. Pittel, B. Limiting behavior of a process of runs. Ann. Probab. 1981, 9, 119–129. [Google Scholar] [CrossRef]
  28. Frolov, A.; Martikainen, A. On the length of the longest increasing run in Rd. Stat. Prob. Lett. 1999, 41, 153–161. [Google Scholar] [CrossRef]
  29. Grill, K. Erdos-Révész type bounds for the length of the longest run from a stationary mixing sequence. Probab. Theory Relat. Fields 1987, 75, 169–179. [Google Scholar] [CrossRef]
  30. Novak, S. Longest runs in a sequence of m-dependent random variables. Probab. Theory Relat. Fields 1992, 91, 269–281. [Google Scholar] [CrossRef]
  31. Révész, P. Three problems on the llength of increasing runs. Stochastic Process. Appl. 1983, 5, 169–179. [Google Scholar] [CrossRef]
  32. Csaki, E.; Foldes, A. On the length of theh longest monnotone block. Studio Scientiarum Mathematicarum Hungarica 1996, 31, 35–46. [Google Scholar]
  33. Eryilmaz, S. A note on runs of geometrically distributed random variables. Discrete Math. 2006, 306, 1765–1770. [Google Scholar] [CrossRef][Green Version]
  34. Grabner, P.; Knopfmacher, A.; Prodinger, H. Combinatorics of geometrically distributed random variables: Run statistics. Theoret. Comput. Sci. 2003, 297, 261–270. [Google Scholar] [CrossRef][Green Version]
  35. Louchard, G. Monotone runs of uniformly distributed integer random variables: A probabilistic analysis. Theoret. Comput. Sci. 2005, 346, 358–387. [Google Scholar] [CrossRef]
  36. Mitton, N.; Paroux, K.; Sericola, B.; Tixeuil, S. Ascending runs in dependent uniformly distributed random variables: Application to wireless networks. Methodol. Comput. Appl. Probab. 2010, 12, 51–62. [Google Scholar] [CrossRef]
  37. Malley, J.; Naiman, D.Q.; Bailey-Wilson, J. A compresive method for genome scans. Hum. Heredity 2002, 54, 174–185. [Google Scholar] [CrossRef] [PubMed]
  38. Wang, X.; Zhao, B.; Glaz, J. A multiple window scan statistic for time series models. Stat. Probab. Lett. 2014, 94, 196–203. [Google Scholar] [CrossRef]
  39. Shi, J.; Siegmund, D.; Yakir, B. Importance sampling for estimating p values in linkage analysis. J. Am. Stat. Assoc. 2007, 102, 929–937. [Google Scholar] [CrossRef]
  40. Genz, A.; Bretz, F. Computation of Multivariate Normal and T Probabilities; Springer: New York, NY, USA, 2009. [Google Scholar]
  41. Wang, X.; Glaz, J. A variable window scan statistic for MA(1) process. In Proceedings of the 15th International Conference on Applied Stochastic Models and Data Analysis ASMDA 2013, Barcelona, Spain, 25–28 June 2013; pp. 955–962. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.