One Dimensional Discrete Scan Statistics for Dependent Models and Some Related Problems

Amarioarei, Alexandru; Preda, Cristian

doi:10.3390/math8040576

Open AccessArticle

One Dimensional Discrete Scan Statistics for Dependent Models and Some Related Problems

by

Alexandru Amarioarei

^1,2

and

Cristian Preda

^3,4,5,6,*

¹

Faculty of Mathematics and Computer Science, University of Bucharest, 010014 Bucharest, Romania

²

National Institute of Research and Development for Biological Sciences, 060031 Bucharest, Romania

³

Laboratoire de Mathématiques Paul Painlevé, University of Lille, 59655 Villeneuve d’Ascq, France

⁴

Biostatistics Department, Delegation for Clinical Research and Innovation, Lille Catholic Hospitals, GHICL, 59462 Lomme, France

⁵

Institute of Statistics and Applied Mathematics of the Romanian Academy, 050711 Bucharest, Romania

⁶

Inria Lille Nord-Europe, MODAL, 59655 Villeneuve d’Ascq, France

^*

Author to whom correspondence should be addressed.

Mathematics 2020, 8(4), 576; https://doi.org/10.3390/math8040576

Submission received: 18 March 2020 / Revised: 7 April 2020 / Accepted: 9 April 2020 / Published: 13 April 2020

(This article belongs to the Special Issue Probability, Statistics and Their Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The one dimensional discrete scan statistic is considered over sequences of random variables generated by block factor dependence models. Viewed as a maximum of an 1-dependent stationary sequence, the scan statistics distribution is approximated with accuracy and sharp bounds are provided. The longest increasing run statistics is related to the scan statistics and its distribution is studied. The moving average process is a particular case of block factor and the distribution of the associated scan statistics is approximated. Numerical results are presented.

Keywords:

scan statistics; 1-dependence; runs; approximation; simulation

1. Introduction

There are many situations when an investigator observes an accumulation of events of interest and wants to decide if such a realisation is due to hazard or not. These types of problems belong to the class of cluster detection problems, where the basic idea is to identify regions that are unexpected or anomalous with respect to the distribution of events. Depending on the application domain, these anomalous agglomeration of events can correspond to a diversity of phenomena—for example one may want to find clusters of stars, deposits of precious metals, outbreaks of disease, minefield detections, defectuous batches of pieces and many other possibilities. If such an observed accumulation of events exceeds a preassigned threshold, usually determined from a specified significance level corresponding to a normal situation (the null hypothesis), then it is legitimate to say that we have an unexpected cluster and proper measures has to be taken accordingly.

Searching for unusual clusters of events is of great importance in many scientific and technological fields, including DNA sequence analysis ([1,2]), brain imaging ([3]), target detection in sensors networks ([4,5]), astronomy ([6,7]), reliability theory and quality control ([8]) among many other domains. One of the tools used by practitioners to decide on the unusualness of such agglomeration of events is the scan statistics. Basically, the tests based on scan statistics are looking for events that are clustered amongst a background of those that are sporadic.

Let

2 \leq m \leq T

be two positive integers and

X_{1}, \dots, X_{T}

be a sequence of independent and identically distributed random variables with the common distribution

F_{0}

. The one dimensional discrete scan statistics is defined as

S_{m} (T) = max_{1 \leq i \leq T - m + 1} W_{i},

(1)

where the random variables

W_{i}

are the moving sums of length m given by

W_{i} = \sum_{j = i}^{i + m - 1} X_{j} .

(2)

Usually, the statistical tests based on the one dimensional discrete scan statistics are employed when one wants to detect a local change in the signal within a sequence of T observations via testing the null hypothesis of uniformity,

H_{0}

, against a cluster alternative,

H_{1}

(see References [9,10]). Under

H_{0}

, the random observations

X_{1}, \dots, X_{T}

are i.i.d. distributed as

F_{0}

, while under the alternative hypothesis, there exists a location

1 \leq i_{0} \leq T - m + 1

where

X_{i}

,

i \in {i_{0}, \dots, i_{0} + m - 1}

, are distributed according to

F_{1} \neq F_{0}

and outside this region

X_{i}

are distributed as

F_{0}

.

We observe that whenever

S_{m} (T)

exceeds the threshold

τ

, where the value of

τ

is computed based on the relation

P_{H_{0}} (S_{m} (T) \geq τ) = α

and

α

is a preassigned significance level of the testing procedure, the generalized likelihood ratio test rejects the null hypothesis in the favor of the clustering alternative (see Reference [9]). It is interesting to note that most of the research has been done for

F_{0}

being binomial, Poisson or normal distribution (see References [9,10,11,12,13]). More recently, Reference [14] proposed a testing procedure based on one-dimensional scan statistic for geometric and negative binomial distributions.

There are three main approaches used for investigating the exact distribution of the one dimensional discrete scan statistics—the combinatorial methods ([12,15]), the Markov chain imbedding technique ([16,17]) and the conditional probability generating function method ([18,19]). Due to the high complexity and the limited range of application of the exact formulas, a considerable number of approximations and bounds have been developed for the estimation of the distribution of the one dimensional discrete scan statistics, for example, References [9,12,13,20]. A full treatment of these results is presented in References [10,11].

Even if in general the

X_{i}

’s are supposed to be i.i.d distributed, there are applications, such as detecting similarities between DNA sequences, where the

X_{i}

’s are not independent ([21]). In order to evaluate the effect of dependence, the alternative model is in many cases a Markov chain the whole dependence structure of which depends only on the joint distribution of two consecutive random variables.

In this work we introduce dependence models based on block-factors obtained from i.i.d. sequences in the context of the one dimensional discrete scan statistics. We derive approximations and their corresponding errors with application to the longest increasing run distribution and the moving average process.

The paper is structured as follows. In Section 2 we introduce the block factor model and present the approximation technique for the distribution of the scan statistics under this model. As a particular block factor model, the distribution of the length of the longest increasing run in a trial of i.i.d real random variables and the moving average processes are related to the scan statistics distribution in Section 3. Numerical results based on simulations illustrate the accuracy of the approximation. Concluding remarks end the paper.

2. One Dimensional Scan Statistics for Block-Factor Dependence Model

Most of the research devoted to the one dimensional discrete scan statistic considers the independent and identically distributed model for the random variables that generate the sequence which is to be scanned. In this section, we define a dependence structure for the underlying random sequence based on a block-factor type model.

2.1. The Block-Factor Dependence Model

Let us recall (see also Reference [22]) that the sequence

{X_{i}}_{i \geq 1}

of random variables with state space

S_{W}

is said to be k block-factor of the sequence

{Y_{i}}_{i \geq 1}

with state space

S_{Y}

if there is a measurable function

f : S_{Y}^{k} \to S_{W}

such that

X_{i} = f (Y_{i}, Y_{i + 1}, \dots, Y_{i + k - 1}), \forall i \geq 1 .

(3)

The Figure 1 presents the sequence

{X_{i}}_{i = 1, \dots, T}

of length T obtained as a k block-factor from a sequence

{Y_{i}}_{i = 1, \dots, \tilde{T}}

of length

\tilde{T} = T + k - 1

throughout some function f.

As an example of block-factor model, in Reference [23], the authors consider an i.i.d. sequence

{Y_{n}}_{n \geq 1}

of standard normal distributed random variables and the 2 block-factor defined by

X_{i} = a Y_{i} + b Y_{i + 1}, i \geq 1, for f (x, y) = a x + b y, a, b \in R .

Therefore, due to the overlapping structure of

X_{i}

and

X_{i + 1}

, they obtain a Gaussian stationary process

{X_{i}}_{i \geq 1}

with some correlation structure for which the scan statistics distribution is studied.

More generally, let observe that if a sequence

{X_{i}}_{i \geq 1}

of random variables is a k block-factor, then the sequence is

(k - 1)

-dependent. Recall that a sequence

{X_{i}}_{i \geq 1}

is m-dependent with

m \geq 1

(see Reference [22]), if for any

h \geq 1

the

σ

-fields generated by

{X_{1}, \dots, X_{h}}

and

{X_{h + m + 1}, \dots}

are independent.

2.2. Scan Statistics Viewed as Maximum of 1-Dependent Sequence

Let

{X_{1}, \dots, X_{T}}

be a k-block factor of the i.i.d. sequence

{Y_{1}, \dots, Y_{\tilde{T}}}

, where

{\tilde{T}}_{1} \geq k

and

T = \tilde{T} - k + 1

, and

S_{m} (T)

be the scan statistics associated to the sequence

{X_{i}}_{i = 1, \dots, T}

as defined in (1) for some scanning window of length m,

1 \leq m \leq T

.

Put

T = (L - 1) (m + k - 2) + m - 1

for some integer

L \geq 1

and define, for each

j \in {1, \dots, L - 1}

, the random variables

Z_{j} = max_{(j - 1) (m + k - 2) + 1 \leq i \leq j (m + k - 2)} W_{i},

(4)

where

W_{i} = \sum_{s = i}^{i + m - 1} X_{s}

. That is, for each

j \in {1, \dots, L - 1}

,

Z_{j}

is the scan statistic associated to the sequence of length

2 m + k - 3

,

{X_{(j - 1) (m + k - 2) + 1}, \dots, X_{j (m + k - 2) + m - 1}}

. An illustration of the construction of variables

Z_{j}

is presented in Figure 2 for

L = 5

and

k = 1

.

Then,

{Z_{j}}_{j = 1, \dots, L - 1}

is 1-dependent and stationary and we have that

S (m, T) = max_{1 \leq j \leq L - 1} Z_{j} .

(5)

Thus, for any block factor model obtained from an i.i.d sequence, the distribution of the associated scan statistics is the distribution of the maximum of some 1-dependent stationary sequence.

2.3. Approximation

In References [24,25] the authors extended the approximation results obtained in Reference [26] for the distribution of the maximum of a 1-dependent stationary sequence. The main results is stated in the following theorem.

Let

{Z_{j}}_{j \geq 1}

be a strictly stationary 1-dependent sequence of random variables and for

x < sup {u | P (Z_{1} \leq u) < 1}

, let

1 q_{n} = q_{n} (x) = P (max (Z_{1}, \dots, Z_{n}) \leq x) .

(6)

Theorem 1.

For all x such that

q_{1} (x) \geq 1 - α \geq 0.9

, the following approximation formula holds:

|q_{n} - \frac{2 q_{1} - q_{2}}{{[1 + q_{1} - q_{2} + 2 {(q_{1} - q_{2})}^{2}]}^{n}}| \leq n F (α, n) {(1 - q_{1})}^{2}

(7)

with

F (α, n) = 1 + \frac{3}{n} + [\frac{Γ (α)}{n} + K (α)] (1 - q_{1})

(8)

where

Γ (α) = L (α) + E (α)

,

\begin{matrix} K (α) & = \frac{\frac{11 - 3 α}{{(1 - α)}^{2}} + 2 l (1 + 3 α) \frac{2 + 3 l α - α (2 - l α) {(1 + l α)}^{2}}{{[1 - α {(1 + l α)}^{2}]}^{3}}}{1 - \frac{2 α (1 + l α)}{{[1 - α {(1 + l α)}^{2}]}^{2}}} \end{matrix}

(9)

\begin{matrix} L (α) & = 3 K (α) (1 + α + 3 α^{2}) [1 + α + 3 α^{2} + K (α) α^{3}] + α^{6} K^{3} (α) \\ + 9 α (4 + 3 α + 3 α^{2}) + 55.1 \end{matrix}

(10)

\begin{matrix} E (α) & = \frac{η^{5} {[1 + (1 - 2 α) η]}^{4} [1 + α (η - 2)] [1 + η + (1 - 3 α) η^{2}]}{2 {(1 - α η^{2})}^{4} [{(1 - α η^{2})}^{2} - α η^{2} {(1 + η - 2 α η)}^{2}]} \end{matrix}

(11)

and where

η = 1 + l α

with

l = l (α) = t_{2}^{3} (α) + ε

, for arbitrarily small

ε > 0

, and

t_{2} (α)

the second root in magnitude of the equation

α t^{3} - t + 1 = 0

.

The evaluation of the functions K and

Γ

for some selected values of

α

is presented in Table 1. These values allow to compute directly the error bound approximation in (7).

Applying Theorem 1 to the sequence

{Z_{j}}_{j = 1, \dots, L - 1}

defined in (4), from (5) we obtain an approximation and the associated error bound for

P (S (m, T) \leq x)

in the following way. Put for

s \in {2, 3}

,

Q_{s} = Q_{s} (x) = P (⋂_{j = 1}^{s - 1} {Z_{j} \leq x})

(12)

and observe that using the notation of (6) we have

Q_{s} = q_{s - 1}

. For x such that

Q_{2} (x) \geq 1 - α \geq 0.9

we apply the result from Theorem 1 to obtain the approximation

P (S (m, T) \leq x) \approx \frac{2 Q_{2} - Q_{3}}{{[1 + Q_{2} - Q_{3} + 2 {(Q_{2} - Q_{3})}^{2}]}^{(L - 1)}},

(13)

with an error bound of about

(L - 1) F (α_{1}, L - 1) {(1 - Q_{2})}^{2}

. Observe that

Q_{2}

and

Q_{3}

represent the distributions of the scan statistics over sequences of variable

Z_{j}

of lengths

2 m + k - 3

and respectively

3 m + 2 k - 5

.

Q_{2}

and

Q_{3}

are generally estimated by Monte Carlo simulation.

Thus, if

{\hat{Q}}_{s}

is an estimate of

Q_{s}

,

s \in {2, 3}

, with

|{\hat{Q}}_{s} - Q_{s}| \leq β_{s}

and x is such that

1 - {\hat{Q}}_{2} (x) \leq 0.1

then

|P (S (m, T) \leq x) - (2 {\hat{Q}}_{2} - {\hat{Q}}_{3}) {[1 + {\hat{Q}}_{2} - {\hat{Q}}_{3} + 2 {({\hat{Q}}_{2} - {\hat{Q}}_{3})}^{2}]}^{1 - L}| \leq E_{t o t a l},

(14)

where

E_{t o t a l}

is the total error of the approximation given by

E_{t o t a l} = (L - 1) [β_{2} + β_{3} + F ({\hat{Q}}_{2}, L - 1) {(1 - {\hat{Q}}_{2} + β_{2})}^{2}] .

(15)

One of the main advantage of this approximation method with respect to the product-type approximation proposed in Reference [12], who uses the same quantities

Q_{2}

and

Q_{3}

, is that it provides sharp error bounds for the approximation.

3. Some Related Problems to the Scan Statistics under Block-Factor Dependence Model

In order to illustrate the efficiency of the approximation (14) and the obtained error bounds, in this section we present two examples of statistics related to discrete scan statistics.

3.1. Length of the Longest Increasing Run in a i.i.d Sequence

Let

Y_{1}

,

Y_{2}

, …,

Y_{\tilde{T}}

be a sequence of length

\tilde{T}

,

\tilde{T} \geq 1

, of independent and identically distributed random variables with the common distribution F. We say that the subsequence

(Y_{k}, \dots, Y_{k + l - 1})

forms an increasing run (or ascending run) of length

l \geq 1

, starting at position

k \geq 1

, if it verifies the following relation

Y_{k - 1} > Y_{k} < Y_{k + 1} < \dots < Y_{k + l - 1} > Y_{k + l} .

(16)

We denote the length of the longest increasing run among the first

\tilde{T}

random variables by

M_{\tilde{T}}

. This run statistics plays an important role in many applications in fields such computer science, reliability theory or quality control. The asymptotic behaviour of

M_{\tilde{T}}

has been investigated by several authors depending on the common distribution, F. In the case of a continuous distribution [27] (see also Reference [28]) has shown that this behaviour does not depend on the common law. For the particular setting of uniform

U ([0, 1])

random variables, this problem was addressed by References [29,30,31]. Under the assumption that the distribution F is discrete, the limit behaviour of

M_{\tilde{T}}

depends strongly on the common law F, as in Reference [32] (see also References [33,34]) proved for the case of geometric and Poisson distribution. In Reference [35], the case of discrete uniform distribution is investigated, while in Reference [36], the authors study the asymptotic distribution of

M_{\tilde{T}}

when the variables are uniformly distributed but not independent.

In this section, we evaluate the distribution of the length of the longest increasing run using the methodology developed in Section 2. The idea is to express the distribution of the random variable

M_{\tilde{T}}

in terms of the distribution of the scan statistics random variable.

Let

T = \tilde{T} - 1

and define the block-factor transformation

f : R^{2} \to R

by

f (x, y) = \{\begin{matrix} 1, if x < y \\ 0, otherwise . \end{matrix}

(17)

Then, our block-factor model becomes

X_{i} = f (Y_{i}, Y_{i + 1}) = 1_{Y_{i} < Y_{i + 1}} .

(18)

and

X_{1}

, …,

X_{T}

form a 1-dependent and stationary sequence of random variables.

Notice that the distribution of

M_{\tilde{T}}

and the distribution of the length of the longest run of ones,

L_{T}

, among the first T binary random variables

X_{i}

, are related and satisfy the following identity

P (M_{\tilde{T}} \leq m) = P (L_{T} < m), for m \geq 1 .

(19)

The statistics

L_{T}

is also known as the length of the longest success run or head run and was extensively studied in the literature. One can consult the monographs of References [16,17] for applications and further results concerning this statistic. Moreover, the random variable

L_{T}

can be interpreted as a particular case of the scan statistics random variable and between the two we have the relation

P (L_{T} \geq m) = P (S (m, T) = m) .

(20)

Hence, combining (19) and (20), we can express the distribution of the length of the longest increasing run as

P (M_{\tilde{T}} \leq m) = P (S (m, T) < m) .

(21)

Thus, we can estimate the distribution of

M_{\tilde{T}}

using the foregoing identity and the approximations developed in Section 2 for the discrete scan statistics random variable.

We should also note that in Reference [30] the authors studied the asymptotic behaviour of

L_{T}

over a sequence of m-dependent binary random variables. They showed that, given a stationary m-dependent sequence of random variables with values 0 and 1,

{\{V_{k}\}}_{k \geq 1}

, if there exist positive constants t, C such that

P (V_{k + 1} = 1 | V_{1} = \dots = V_{k} = 1) \geq \frac{1}{C k^{t}}, for all k \geq C,

(22)

then, as

N \to \infty

max_{1 \leq k \leq N} |P (L_{N} < k) - e^{- N r (k)}| = O (\frac{{(ln N)}^{h}}{N}),

(23)

where

r (k) = P (V_{1} = \dots = V_{k} = 1) - P (V_{1} = \dots = V_{k + 1} = 1)

and

h = sup {m t, 1}

.

In order to illustrate the accuracy of the approximation of

M_{\tilde{T}}

based on scan statistics, using the methodology developed in Section 2, we consider that the random variables

Y_{i}

’s have a common uniform

U ([0, 1])

distribution. Simple calculations show that

P (X_{1} = \dots = X_{k} = 1) = \frac{1}{(k + 1)!}

and

P (X_{k + 1} = 1 | X_{1} = \dots = X_{k} = 1) = \frac{1}{k + 2} \geq \frac{1}{2 k},

(24)

thus

C = 2

,

t = 1

and

h = 1

. In the context of our particular situation, the result of Reference [30] in Equation (23) becomes:

max_{1 \leq m \leq T} |P (L_{T} < m) - e^{- T r (m)}| = O (\frac{ln T}{T}),

(25)

where

r (m) = P (X_{1} = \dots = X_{m} = 1) - P (X_{1} = \dots = X_{m + 1} = 1) = \frac{m + 1}{(m + 2)!}

.

In Table 2, we consider a numerical comparison study between the simulated value (column

S i m

) obtained by Monte Carlo simulation with

I T E R_{s i m} = 10^{4}

trials, the approximation based on scan statistics (column

A p p

) computed from the Equation (14) where

{\hat{Q}}_{2}

and

{\hat{Q}}_{3}

are computed with

I T E R_{a p p} = 10^{5}

trials and the limit distribution (column

L i m A p p

) of the distribution of the length of the longest increasing run,

P (M_{\tilde{T}} \leq m)

, in a sequence of

\tilde{T} = 10001

random variables distributed uniformly over

[0, 1]

. The results show that both our method and the asymptotic approximation in (25) are very accurate. It is worth mentioning that for our simulations we used an adapted version of the Importance Sampling procedure introduced in Reference [3], an efficient method that proved to perform very well for small p values (where naive Monte Carlo methods tend to break down) [37].

3.2. The Moving Average - Like Process of Order q Model

We consider the particular situation of the one dimensional discrete scan statistics defined over a sequence of random variables obtained as a linear block factor of a discrete Gaussian white noise. Because of the similarity with the definition of a classical moving average process, we call that block factor model a moving average - like process. It is worth mentioning that the distribution of the scan statistics in the context of a moving average process for normal data was studied in Reference [38] where the authors compared the product-type approximation developed in Reference [13] with the approximation of Reference [23]. In the block-factor model introduced in (3), let

q \geq 1

be a positive integer and

Y_{1}

,

Y_{2}

, …,

Y_{\tilde{T}}

be a sequence of independent and identically Gaussian distributed random variables with known mean

μ

and variance

σ^{2}

.

Let

a = (a_{1}, \dots, a_{q + 1}) \in R^{q + 1}

be a fixed non null vector and take

f : R^{q + 1} \to R

, the (measurable) transformation that defines the block-factor model, to be equal with

f (y_{1}, \dots, y_{q + 1}) = a_{1} y_{1} + a_{2} y_{2} + \dots + a_{q + 1} y_{q + 1} .

(26)

For

i \in {1, \dots, T}

, with

T = \tilde{T} - q

, our dependent model is defined by the relation

X_{i} = a_{1} Y_{i} + a_{2} Y_{i + 1} + \dots + a_{q + 1} Y_{i + q} .

(27)

The moving sums of size m,

W_{i}

,

1 \leq i \leq T - m + 1

can be expressed as

W_{i} = \sum_{j = i}^{i + m - 1} X_{j} = b_{1} Y_{i} + b_{2} Y_{i + 1} + \dots + b_{m + q} Y_{i + m - 1 + q},

(28)

where the coefficients

b_{1}

, …,

b_{m + q}

are evaluated by

(a): For $m \geq q$ ,

$b_{k} = \{\begin{matrix} \sum_{j = 1}^{k} a_{j} & , k \in {1, \dots, q} \\ \sum_{j = 1}^{q + 1} a_{j} & , k \in {q + 1, \dots, m} \\ \sum_{j = k - m + 1}^{q + 1} a_{j} & , k \in {m + 1, \dots, m + q} \end{matrix}$

(29)
(b): For $m < q$ ,

$b_{k} = \{\begin{matrix} \sum_{j = 1}^{k} a_{j} & , k \in {1, \dots, m} \\ \sum_{j = k - m + 1}^{k} a_{j} & , k \in {m + 1, \dots, q} \\ \sum_{j = k - m + 1}^{q + 1} a_{j} & , k \in {q + 1, \dots, m + q} . \end{matrix}$

(30)

Therefore, for each

i \in {1, \dots, T - m + 1}

, the random variable

W_{i}

follows a normal distribution with mean

E [W_{i}] = (b_{1} + \dots + b_{m + q}) μ

and variance

V a r [W_{i}] = (b_{1}^{2} + \dots + b_{m + q}^{2}) σ^{2}

. Moreover, a simple calculation shows that the covariance matrix

Σ = {C o v [W_{t}, W_{s}]}

has the entries

C o v [W_{t}, W_{s}] = \{\begin{matrix} (\sum_{j = 1}^{m + q - | t - s |} b_{j} b_{| t - s | + j}) σ^{2} & , | t - s | \leq m + q - 1 \\ 0 & , otherwise . \end{matrix}

(31)

Given the mean and the covariance matrix of the vector

(W_{1}, \dots, W_{T - m + 1})

, one can use the importance sampling algorithm developed in Reference [3] (see also Reference [37]) or the one presented in Reference [39] to estimate the distribution of the one dimensional discrete scan statistics

S (m, T)

. Another way is to use the quasi-Monte Carlo algorithm developed in Reference [40] to approximate the multivariate normal distribution.

In our application example we adopt the importance sampling procedure developed in Reference [3]. In order to evaluate the accuracy of the approximation developed in (14), we consider

q = 2

,

T = 1000

,

m = 20

,

Y_{i} \sim N (0, 1)

and the coefficients of the moving average model to be

(a_{1}, a_{2}, a_{3}) = (0.3, 0.1, 0.5)

. We compare our approximation (column

A p p

) given in (14) with the one (column AppPT) given in Reference [41] using product-type approximations. In Table 3, we present numerical results for the setting described above. In our algorithms we used

I T E R_{a p p} = 10^{6}

trials for the computation of

{\hat{Q}}_{2}

(x) and

{\hat{Q}}_{3} (x)

and

I T E R_{s i m} = 10^{5}

trials for the Monte-Carlo simulation of the

P (S (m, T) \leq x)

.

4. Conclusions

Block factor models defined from i.i.d. sequence generate random sequences with a particular type of dependence structure. For this type of dependence, the scan statistics can be viewed as the maximum of a 1-dependent stationary sequence, for which the distribution can be approximated with high accuracy. The approximation error can be controlled by using efficient algorithms of simulation as for example the importance sampling approach proposed in Reference [3] (see also Reference [25]). We approximated the distribution of longest increasing run statistics over an i.i.d sequence as a particular case of scan statistics distribution over a block factor model.

Author Contributions

Conceptualization, A.A. and C.P.; methodology, A.A. and C.P.; software, A.A.; validation, A.A. and C.P.; formal analysis, A.A. and C.P.; writing original draft preparation, C.P.; writing review and editing, A.A. and C.P.; visualization, A.A. and C.P.; supervision, A.A. and C.P.; project administration, C.P.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by a grant of the Romanian National Authority for Scientific Research and Innovation, project number POC P-37-257 and MCI National Core Program, project 25 N/2019 BIODIVERS 19270103.

Acknowledgments

The authors wish to thank the anonymous reviewers for their careful reading of the manuscript and their helpful suggestions and comments which led to the improvement of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hoh, J.; Ott, J. Scan statistics to scan markers for susceptibility genes. Proc. Natl. Acad. Sci. USA 2000, 97, 9615–9617. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sheng, K.-N.; Naus, J. Pattern matching between two non aligned random sequences. Bull. Math. Biol. 1994, 56, 1143–1162. [Google Scholar] [CrossRef]
Naiman, D.Q.; Priebe, C.E. Computing scan statistic p values using importance sampling, with applications to genetics and medical image analysis. J. Comput. Graph. Stat. 2001, 10, 296–328. [Google Scholar] [CrossRef]
Guerriero, M.; Pozdnyakov, V.; Glaz, J.; Willett, P. A repeated significance test with applications to sequential detection in sensor networks. IEEE Trans. Signal Process. 2010, 58, 3426–3435. [Google Scholar] [CrossRef]
Guerriero, M.; Willett, P.; Glaz, J. Distributed target detection in sensor networks using scan statistics. IEEE Trans. Signal Process. 2009, 57, 2629–2639. [Google Scholar] [CrossRef]
Darling, R.W.R.; Waterman, M.S. Extreme value distribution for the largest cube in a random lattice. SIAM J. Appl. Math. 1986, 46, 118–132. [Google Scholar] [CrossRef] [Green Version]
Marcos, R.; Marcos, C. From star complexes to the field: Open cluster families. Astrophys. J. 2008, 672, 342–351. [Google Scholar] [CrossRef]
Boutsikas, M.V.; Koutras, M.V. Reliability approximation for Markov chain imbeddable systems. Methodol. Comput. Appl. Probab. 2000, 2, 393–411. [Google Scholar] [CrossRef]
Glaz, J.; Naus, J. Tight bounds and approximations for scan statistic probabilities for discrete data. Ann. Appl. Probab. 1991, 1, 306–318. [Google Scholar] [CrossRef]
Glaz, J.; Naus, J.; Wallenstein, S. Scan Statistics; Springer Series in Statistics; Springer: New York, NY, USA, 2001. [Google Scholar]
Glaz, J.; Balakrishnan, N. Scan Statistics and Applications; Springer Sciences+Business Media: Berlin, Germany, 1999. [Google Scholar]
Naus, J. Approximations for distributions of scan statistics. J. Am. Stat. Assoc. 1982, 77, 177–183. [Google Scholar] [CrossRef]
Wang, X.; Glaz, J.; Naus, J. Approximations and inequalities for moving sums. Methodol. Comput. Appl. Probab. 2012, 14, 597–616. [Google Scholar]
Chen, J.; Glaz, J. Scan statistics for monitoring data modeled by a negative binomial distribution. Commun. Stat. Theory Methods 2016, 45, 1632–1642. [Google Scholar] [CrossRef]
Naus, J. Probabilities for a generalized birthday problem. J. Am. Stat. Assoc. 1974, 69, 810–815. [Google Scholar] [CrossRef]
Balakrishnan, N.; Koutras, M.V. Runs and Scans with Applications; Wiley Series in Probability and Statistics; Wiley-Interscience [John Wiley & Sons]: New York, NY, USA, 2002. [Google Scholar]
Fu, J.C.; Lou, W. Distribution Theory of Runs and Patterns and Its Applications: A Finite Markov Chain Imbedding Approach; World Scientific Publishing Co., Inc.: River Edge, NJ, USA, 2003. [Google Scholar]
Ebneshahrashoob, M.; Gao, T.; Wu, M. An efficient algorithm for exact distribution of discrete scan statistics. Methodol. Comput. Appl. Probab. 2005, 7, 1423–1436. [Google Scholar] [CrossRef]
Uchida, M. On generating functions of waiting time problems for sequence patterns of discrete random variables. Ann. Inst. Stat. Math. 1998, 50, 650–671. [Google Scholar] [CrossRef]
Chen, J.; Glaz, J. Approximations and inequalities for the distribution of a scan statistic for 0-1 Bernoulli trials. Adv. Theory Pract. Stat. 1997, 1, 285–298. [Google Scholar]
Arratia, R.; Gordon, L.; Waterman, M.S. The Erdos-Rényi law in distribution for coin tossing and sequence matching. Ann. Stat. 1990, 18, 539–570. [Google Scholar] [CrossRef]
Burton, R.M.; Goulet, M.; Meester, R. On one-dependent processes and k-block factors. Ann. Probab. 1993, 21, 2157–2168. [Google Scholar] [CrossRef]
Haiman, G.; Preda, C. One dimensional scan statistics generated by some dependent stationary sequences. Stat. Probab. Lett. 2013, 83, 1457–1463. [Google Scholar] [CrossRef]
Amărioarei, A. Approximation for the Distribution of Extremes of One Dependent Stationary Sequences of Random Variables. arXiv 2012, arXiv:1211.5456v1. [Google Scholar]
Amărioarei, A. Approximations for the Multidimensional Discrete Scan Statistics. Ph.D. Thesis, University of Lille, Lille, France, 2014. [Google Scholar]
Haiman, G. Estimating the distributions of scan statistics with high precision. Extremes 2000, 3, 349–361. [Google Scholar] [CrossRef]
Pittel, B. Limiting behavior of a process of runs. Ann. Probab. 1981, 9, 119–129. [Google Scholar] [CrossRef]
Frolov, A.; Martikainen, A. On the length of the longest increasing run in R^d. Stat. Prob. Lett. 1999, 41, 153–161. [Google Scholar] [CrossRef]
Grill, K. Erdos-Révész type bounds for the length of the longest run from a stationary mixing sequence. Probab. Theory Relat. Fields 1987, 75, 169–179. [Google Scholar] [CrossRef]
Novak, S. Longest runs in a sequence of m-dependent random variables. Probab. Theory Relat. Fields 1992, 91, 269–281. [Google Scholar] [CrossRef]
Révész, P. Three problems on the llength of increasing runs. Stochastic Process. Appl. 1983, 5, 169–179. [Google Scholar] [CrossRef] [Green Version]
Csaki, E.; Foldes, A. On the length of theh longest monnotone block. Studio Scientiarum Mathematicarum Hungarica 1996, 31, 35–46. [Google Scholar]
Eryilmaz, S. A note on runs of geometrically distributed random variables. Discrete Math. 2006, 306, 1765–1770. [Google Scholar] [CrossRef] [Green Version]
Grabner, P.; Knopfmacher, A.; Prodinger, H. Combinatorics of geometrically distributed random variables: Run statistics. Theoret. Comput. Sci. 2003, 297, 261–270. [Google Scholar] [CrossRef] [Green Version]
Louchard, G. Monotone runs of uniformly distributed integer random variables: A probabilistic analysis. Theoret. Comput. Sci. 2005, 346, 358–387. [Google Scholar] [CrossRef] [Green Version]
Mitton, N.; Paroux, K.; Sericola, B.; Tixeuil, S. Ascending runs in dependent uniformly distributed random variables: Application to wireless networks. Methodol. Comput. Appl. Probab. 2010, 12, 51–62. [Google Scholar] [CrossRef]
Malley, J.; Naiman, D.Q.; Bailey-Wilson, J. A compresive method for genome scans. Hum. Heredity 2002, 54, 174–185. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Zhao, B.; Glaz, J. A multiple window scan statistic for time series models. Stat. Probab. Lett. 2014, 94, 196–203. [Google Scholar] [CrossRef]
Shi, J.; Siegmund, D.; Yakir, B. Importance sampling for estimating p values in linkage analysis. J. Am. Stat. Assoc. 2007, 102, 929–937. [Google Scholar] [CrossRef]
Genz, A.; Bretz, F. Computation of Multivariate Normal and T Probabilities; Springer: New York, NY, USA, 2009. [Google Scholar]
Wang, X.; Glaz, J. A variable window scan statistic for MA(1) process. In Proceedings of the 15th International Conference on Applied Stochastic Models and Data Analysis ASMDA 2013, Barcelona, Spain, 25–28 June 2013; pp. 955–962. [Google Scholar]

Figure 1. The block factor model.

Figure 2. Construction of

Z_{j}

.

Figure 2. Construction of

Z_{j}

.

Table 1. Selected values for l, K and

Γ

functions in Theorem 1 for

ε = 10^{- 6}

.

Table 1. Selected values for l, K and

Γ

functions in Theorem 1 for

ε = 10^{- 6}

.

$α$	$l (α)$	$K (α)$	$Γ (α)$
$0.1$	$1.5347$	$38.63$	$480.69$
$0.05$	$1.1893$	$21.28$	$180.53$
$0.025$	$1.0835$	$17.56$	$145.20$
$0.01$	$1.0313$	$15.92$	$131.43$

Table 2. The distribution of the length of the longest increasing run:

\tilde{T} = 10001

,

I T E R_{s i m} = 10^{4}

,

I T E R_{a p p} = 10^{5}

.

Table 2. The distribution of the length of the longest increasing run:

\tilde{T} = 10001

,

I T E R_{s i m} = 10^{4}

,

I T E R_{a p p} = 10^{5}

.

m	Sim	App	$E_{total}$	LimApp
		Equation (14)	Equation (15)	Equation (25)
5	0.00000700	0.00000733	0.14860299	0.00000676
6	0.17567262	0.17937645	0.01089628	0.17620431
7	0.80257424	0.80362353	0.00110990	0.80215088
8	0.97548510	0.97566460	0.00011579	0.97550345
9	0.99749821	0.99751049	0.00001114	0.99749792
10	0.99977074	0.99977183	0.00000098	0.99977038
11	0.99998075	0.99998083	0.00000008	0.99998073
12	0.99999851	0.99999851	0.00000001	0.99999851
13	0.99999989	0.99999989	0.00000000	0.99999989
14	0.99999999	0.99999999	0.00000000	0.99999999
15	1.00000000	1.00000000	0.00000000	1.00000000

Table 3. MA-like(q = 2) model:

m = 20

,

T = 1000

,

X_{i} = 0.3 Y_{i} + 0.1 Y_{i + 1} + 0.5 Y_{i + 2}

,

I T E R_{a p p} = 10^{6}

,

I T E R_{s i m} = 10^{5}

.

Table 3. MA-like(q = 2) model:

m = 20

,

T = 1000

,

X_{i} = 0.3 Y_{i} + 0.1 Y_{i + 1} + 0.5 Y_{i + 2}

,

I T E R_{a p p} = 10^{6}

,

I T E R_{s i m} = 10^{5}

.

x	Sim	AppPT	App	$E_{total}$
			Equation (14)	Equation (15)
11	0.582252	0.589479	0.584355	0.015156
12	0.770971	0.773700	0.771446	0.004010
13	0.889986	0.890009	0.889431	0.001167
14	0.951529	0.954536	0.951723	0.000370
15	0.980653	0.982433	0.980675	0.000124
16	0.992827	0.993690	0.992791	0.000042
17	0.997486	0.995471	0.997499	0.000014
18	0.999186	0.999411	0.999188	0.000004
19	0.999754	0.999717	0.999754	0.000001
20	0.999930	1	0.999930	0.000000

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Amarioarei, A.; Preda, C. One Dimensional Discrete Scan Statistics for Dependent Models and Some Related Problems. Mathematics 2020, 8, 576. https://doi.org/10.3390/math8040576

AMA Style

Amarioarei A, Preda C. One Dimensional Discrete Scan Statistics for Dependent Models and Some Related Problems. Mathematics. 2020; 8(4):576. https://doi.org/10.3390/math8040576

Chicago/Turabian Style

Amarioarei, Alexandru, and Cristian Preda. 2020. "One Dimensional Discrete Scan Statistics for Dependent Models and Some Related Problems" Mathematics 8, no. 4: 576. https://doi.org/10.3390/math8040576

APA Style

Amarioarei, A., & Preda, C. (2020). One Dimensional Discrete Scan Statistics for Dependent Models and Some Related Problems. Mathematics, 8(4), 576. https://doi.org/10.3390/math8040576

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

One Dimensional Discrete Scan Statistics for Dependent Models and Some Related Problems

Abstract

1. Introduction

2. One Dimensional Scan Statistics for Block-Factor Dependence Model

2.1. The Block-Factor Dependence Model

2.2. Scan Statistics Viewed as Maximum of 1-Dependent Sequence

2.3. Approximation

3. Some Related Problems to the Scan Statistics under Block-Factor Dependence Model

3.1. Length of the Longest Increasing Run in a i.i.d Sequence

3.2. The Moving Average - Like Process of Order q Model

4. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI