Multiple Outlier Detection Tests for Parametric Models

Bagdonavičius, Vilijandas; Petkevičius, Linas

doi:10.3390/math8122156

Open AccessArticle

Multiple Outlier Detection Tests for Parametric Models

by

Vilijandas Bagdonavičius

¹

and

Linas Petkevičius

^2,*

¹

Institute of Applied Mathematics, Vilnius University, Naugarduko 24, LT-03225 Vilnius, Lithuania

²

Institute of Computer Science, Vilnius University, Didlaukio 47, LT-08303 Vilnius, Lithuania

^*

Author to whom correspondence should be addressed.

Mathematics 2020, 8(12), 2156; https://doi.org/10.3390/math8122156

Submission received: 8 November 2020 / Revised: 27 November 2020 / Accepted: 30 November 2020 / Published: 3 December 2020

(This article belongs to the Section D1: Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

We propose a simple multiple outlier identification method for parametric location-scale and shape-scale models when the number of possible outliers is not specified. The method is based on a result giving asymptotic properties of extreme z-scores. Robust estimators of model parameters are used defining z-scores. An extensive simulation study was done for comparing of the proposed method with existing methods. For the normal family, the method is compared with the well known Davies-Gather, Rosner’s, Hawking’s and Bolshev’s multiple outlier identification methods. The choice of an upper limit for the number of possible outliers in case of Rosner’s test application is discussed. For other families, the proposed method is compared with a method generalizing Gather-Davies method. In most situations, the new method has the highest outlier identification power in terms of masking and swamping values. We also created R package outliersTests for proposed test.

Keywords:

location-scale models; outliers identification; unknown number of outliers; outlier region; robust estimators

1. Introduction

The problem of multiple outliers identification received attention of many authors. The majority of outlier identification methods define rules for the rejection of the most extreme observations. The bulk of publications are concentrated on the normal distribution (see [1,2,3,4,5,6], see surveys in [7,8]. For non-normal case, the most of the literature pertains to the exponential and gamma distributions, see [9,10,11,12,13,14,15,16,17]. Outliers identification is important analyzing data collected in wide range of areas: pollution [18], IoT [19], medicine [20], fraud [21], smart city applications [22], and many more.

Constructing outlier identification methods, most authors suppose that the number s of observations suspected to be outliers is specified. These methods have a serious drawback: only two possible conclusions are done: exactly s observations are admitted as outliers or it is concluded that outliers are absent. More natural is to consider methods which do not specify the number of suspected observations or at least specify the upper limit s for it. Such methods are not very numerous and they concern normal or exponential samples. These are [1,5,23] methods for normal samples, [15,16,24] methods for exponential samples. The only method which does not specify the upper limit s is the [2] method for normal samples.

We give a competitive and simple method for outlier identification in samples from location-scale and shape-scale families of probability distributions. The upper limit s is not specified, as in the the case of Davies-Gather method. The method is based on a theorem giving asymptotic properties of extreme z-scores. Robust estimators of model parameters are used defining z-scores.

The following investigation showed that the proposed outlier identification method has superior performance as compared to existing methods. The proposed method widens considerably the scope of models applied in statistical analysis of real data. Differently from the normal probability distribution family many two-parameter families such as Weibull, logistic and loglogistic, extreme values, Cauchy, Laplace and other families can be applied for outlier identification. So it may be useful for models which cannot be symmetrized by simple transformations such as log-transform or others.

An advantage of the new method is that complicated computing is not needed because search of test statistic’s critical values by simulation is not needed for each sample size. It allowed to create an R package outliersTests which can be used for outlier search in real datasets. Another advantage is a very good potential for generalizations of the proposed method to regression, time series and other models. To have a competitor, we present not only the new method but also generalize the Davies-Gather method for non-normal data.

In Section 2 we present a short overview of the notion of the outlier region given by [2]. In Section 3 we give asymptotic properties of extreme z-scores based on equivariant estimators of model parameters, and introduce a new outlier identification method for parametric models based on the asymptotic result and robust estimators. In Section 4 we consider rather evident generalizations of Davies-Gather tests for normal data to location-scale families. In Section 5 we give a short overview of known multiple outlier identification methods for normal samples which do not specify an exact number of suspected outliers. In Section 6 we compare performance of the new and existing methods.

2. Outliers and Outlier Regions

Suppose that data are independent random variables

X_{1}, \dots, X_{n}

. Denote by

F_{i} (x)

the c.d.f. of

X_{i}

.

Let

F_{0} = {F (x, θ), θ \in Θ \subset R^{m}}

be a parametric family of absolutely continuous cumulative distribution functions with continuous unimodal densities f on the support

supp (F)

of the c.d.f. F.

Suppose that if the data are not contaminated with unusual observations, then the following null hypothesis

H_{0}

is true: there exist

θ \in Θ

such that

F_{1} (x) = \dots = F_{n} (x) = F (x, θ) .

(1)

There are two different definitions of an outlier. In the first case the outlier is an observation which falls into some outlier region

o u t (X)

. The outlier region is a set such that the probability for at least one observation from a sample to fall into it is small if the hypothesis

H_{0}

is true. In such a case the probability that a specified observation

X_{i}

falls into

o u t (X)

is very small. If an observation

X_{i}

has distribution different from that under

H_{0}

then this probability may be considerably higher.

In the second case, the value

x_{i}

of

X_{i}

is an outlier if the probability distribution of

X_{i}

is different from that under

H_{0}

, formally

F_{i} \neq F (x, θ)

. In this case, outliers are often called contaminants.

Therefore, in the first case, there exists a very small probability to have an outlier under

H_{0}

. If the hypothesis

H_{0}

holds, then contaminants are absent and with very small probability some outliers (in the first sense) are possible. If contaminants are present, then the hypothesis

H_{0}

is not true. Nevertheless, contaminants are not necessary outliers (in the first sense ) because it is possible that they do not fall into the outlier region. So the two notions are different. Both definitions give approximately the same outliers if the alternative distribution is concentrated in the outlier region. Namely such contaminants can be called outliers in the sense that outliers are anomalous extreme observations. In such a case it is possible to compare outlier and contaminant search methods.

In this paper, we consider location-scale and shape-scale families. Location-scale families have the form

F_{l s} = {F_{0} ((x - μ) / σ), μ \in R, σ > 0}

with the completely specified baseline c.d.f

F_{0}

and p.d.f.

f_{0}

. Shape-scale families have the form

F_{l s} = {G_{0} (({(x / θ)}^{ν}), θ, ν > 0}

with completely specified baseline c.d.f

G_{0}

and p.d.f.

g_{0}

. By logarithmic transformation the shape-scale families are transformed to location-scale family, so we concentrate on location-scale families. Methods for such families are easily modified to methods for shape-scale families.

The right-sided

α

-outlier region for a location-scale family is

o u t_{r} (α_{n}, F) = {x \in R : x > μ + σ F_{0}^{- 1} (1 - α)}

and the left-sided

α

-outlier region is

o u t_{l} (α_{n}, F) = {x \in R : x < μ + σ F_{0}^{- 1} (α)} .

The two-sided

α

-outlier region has the form

o u t (α, F) = {x \in R / [μ + σ F_{0}^{- 1} (α / 2), μ + σ F_{0}^{- 1} (1 - α / 2)]} .

(2)

If

f_{0}

is symmetric, then the two-sided outlier region is simpler:

o u t (α, F) = {x \in R : | x - μ | > σ F_{0}^{- 1} (1 - α / 2)} .

The value of

α

is chosen depending on the size n of a sample:

α = α_{n}

. The choice is based on assumption that under

H_{0}

for some

\bar{α}

close to zero

\begin{matrix} P {\cap_{i = 1}^{n} {X_{i} \notin o u t (α_{n}, F)}} = {(P {X_{i} \notin o u t (α_{n}, F)})}^{n} = 1 - \bar{α} . \end{matrix}

(3)

The equality (3) means that under

H_{0}

the probability that none of

X_{i}

falls into

α_{n}

-outlier region is

1 - \bar{α}

. It implies that

α_{n} = 1 - {(1 - \bar{α})}^{1 / n} .

(4)

The sequence

α_{n}

decreases from

\bar{α}

to 0 as n goes from 1 to ∞.

The first definition of an outlier is as follows: for a sample size n a realization

x_{i}

of

X_{i}

is called outlier if

x_{i} \in o u t (α_{n}, F)

;

x_{i}

is called right outlier if

x_{i} \in o u t_{r} (α_{n}, F)

.

The number of outliers

D_{n}

under

H_{0}

has the binomial distribution

B (n, α_{n})

and the expected number of outliers in the sample under

H_{0}

is

E D_{n} = n α_{n} .

Please note that

E D_{n} \to - ln (1 - \bar{α}) \approx \bar{α}

as

n \to \infty

. For example, if

\bar{α} = 0.05

, then

ln (1 - \bar{α}) \approx 0.05129

and for

n \geq 10

the expected number of outliers is approximately

0.051

, i.e., it practically does not depend on n. So under

H_{0}

the expected number of outliers

0.051

is negligible with respect to the sample size n.

3. New Method

3.1. Preliminary Results

Suppose that a c.d.f.

F \in F_{l s}

belongs also to the domain of attraction

G_{γ}

,

γ \geq 0

(see [25]).

If

F \in G_{0} \cap F_{l s}

, then there exist normalizing constants

a_{n} > 0

and

b_{n} \in R

such that

{lim}_{n \to \infty} F_{0}^{n} (a_{n} x + b_{n}) = e^{- e^{- x}} .

Similarly, if

F \in G_{γ} \cap F_{l s}

,

γ > 0

, then

{lim}_{n \to \infty} F_{0}^{n} (a_{n} x + b_{n}) = e^{- {(- x)}^{- 1 / γ}}, x < 0

,

{lim}_{n \to \infty} F^{n} (a_{n} x + b_{n}) = 1, x \geq 0 .

One of possible choices of the sequences

{b_{n}}

and

{a_{n}}

is

b_{n} = F_{0}^{- 1} (1 - \frac{1}{n}), a_{n} = 1 / (n f_{0} (b_{n})) .

(5)

In the particular case of the normal distribution equivalent form

a_{n} = 1 / b_{n}

can be used. Expressions of

b_{n}

and

a_{n}

for some most used distributions are given in Table 1.

Condition A.

(a): $\hat{μ}$ and $\hat{σ}$ are consistent estimators of μ and σ;
(b): the limit distribution of ( $\sqrt{n} (\hat{μ} - μ), \sqrt{n} (\hat{σ} - σ))$ is non-degenerate;
(c): $lim_{x \to \infty} \frac{x f_{0} (x)}{\sqrt{1 - F_{0} (x)}} = 0 .$

Condition A (c) is satisfied for many location-scale models including the normal, type I extreme value, type II extreme value, logistic, Laplace (

F \in G_{0}

), Cauchy (

F \in G_{1}

).

Set

Y_{i} = (X_{i} - μ) / σ

,

{\hat{Y}}_{i} = (X_{i} - \hat{μ}) / \hat{σ}

. The random variables

{\hat{Y}}_{i}

are called z-scores. Denote by

Y_{(1)} \leq Y_{(2)} \leq \dots \leq Y_{(n)}

and

{\hat{Y}}_{(1)} \leq \dots \leq {\hat{Y}}_{(n)}

the respective order statistics

The following theorem is useful for right outliers detection test construction.

Theorem 1.

If

F \in G_{0} \cap F_{l s}

and Conditions A hold, then for fixed s

(({\hat{Y}}_{(n)} - b_{n}) / a_{n}, ({\hat{Y}}_{(n - 1)} - b_{n}) / a_{n}, \dots, ({\hat{Y}}_{(n - s + 1)} - b_{n}) / a_{n}) \overset{d}{\to}

L_{0} = (- ln E_{1}, - ln (E_{1} + E_{2}), \dots, - ln (E_{1} + \dots + E_{s}))

as

n \to \infty

, where

E_{1}, \dots, E_{s}

are i.i.d. standard exponential random variables.

If

F \in G_{γ} \cap F_{l s}

,

γ > 0

and Conditions A hold, then the limit random vector is

L_{γ} = (E_{1}^{- 1} - 1, {(E_{1} + E_{2})}^{- 1} - 1, \dots, {(E_{1} + \dots + E_{s})}^{- 1} - 1) .

Proof of Theorem 1.

Please note that

\frac{{\hat{Y}}_{(n - i + 1)} - b_{n}}{a_{n}} = \frac{Y_{(n - i + 1)} - b_{n}}{a_{n}} \frac{σ}{\hat{σ}} + \frac{(μ - \hat{μ})}{\hat{σ} a_{n}} + \frac{b_{n}}{a_{n}} \frac{σ - \hat{σ}}{\hat{σ}}

The s-dimensional random vector such that its ith component is the first term of the right side converges in distribution to the random vector given in the formulation of the theorem. It follows from Theorem 2.1.1 of [25] and Condition A (a). So it is sufficient to show that the second and the third terms converge to zero in probability. The second term is

- \sqrt{n} f_{0} (F_{0}^{- 1} (1 - \frac{1}{n})) \frac{\sqrt{n} (\hat{μ} - μ)}{\hat{σ}},

the third term is

- \sqrt{n} F_{0}^{- 1} (1 - \frac{1}{n}) f_{0} (F_{0}^{- 1} (1 - \frac{1}{n})) \frac{\sqrt{n} (\hat{σ} - σ)}{\hat{σ}} .

By Condition A (c)

lim_{n \to \infty} \sqrt{n} F_{0}^{- 1} (1 - \frac{1}{n}) f_{0} (F_{0}^{- 1} (1 - \frac{1}{n}))) = lim_{x \to \infty} \frac{x f_{0} (x)}{\sqrt{1 - F_{0} (x)}} = 0 .

It also implies

{lim}_{n \to \infty} \sqrt{n} f_{0} (F_{0}^{- 1} (1 - \frac{1}{n})) = 0

because

{lim}_{n \to \infty} F_{0}^{- 1} (1 - \frac{1}{n}) = \infty .

Proof completed. □

Remark 1.

Please note that

2 (E_{1} + \dots + E_{i}) \sim χ^{2} (2 i)

. It implies that if

F \in G_{0} \cap F_{l s}

, then for fixed i,

i = 1, \dots, s

,

P {({\hat{Y}}_{(n - i + 1)} - b_{n}) / a_{n} \leq x} \to 1 - F_{χ_{2 i}^{2}} (2 e^{- x}) a s n \to \infty .

(6)

Similarly, if

F \in G_{γ} \cap F_{l s}

,

γ > 0

, then for fixed i,

i = 1, \dots, s

,

P {({\hat{Y}}_{(n - i + 1)} - b_{n}) / a_{n} \leq x} \to 1 - F_{χ_{2 i}^{2}} (\frac{2}{1 + x}) a s n \to \infty .

(7)

The following theorem is useful for construction of outlier detection tests in two-sided case when

f_{0}

is symmetric. For any sequence

ζ_{1}, \dots, ζ_{n}

denote by

{| ζ |}_{(1)} \leq \dots \leq {| ζ |}_{(n)}

the ordered absolute values

| ζ_{1} |, \dots, | ζ_{n} |

.

Theorem 2.

Suppose that the function

f_{0}

is symmetric. If

F \in G_{γ} \cap F_{l s}

,

γ \geq 0

and Conditions A hold, then for fixed s

((| \hat{Y} |_{(n)} - b_{2 n}) / a_{2 n}, (| \hat{Y} |_{(n - 1)} - b_{2 n}) / a_{2 n}, \dots, (| \hat{Y} |_{(n - s + 1)} - b_{2 n}) / a_{2 n}) \overset{d}{\to} L_{γ}

as

n \to \infty

.

Proof of Theorem 2.

For any

i = 1, \dots, s

the following equality holds:

\frac{| \hat{Y} |_{(n - i + 1)} - b_{2 n}}{a_{2 n}} = \frac{| \hat{Y} |_{(n - i + 1)} - {| Y |}_{(n - i + 1)}}{a_{2 n}} + \frac{{| Y |}_{(n - i + 1)} - b_{2 n}}{a_{2 n}} .

(8)

The c.d.f. of the random variables

| Y_{i} |

is

2 F_{0} (x) - 1

, so if

F_{0} \in G_{γ}

,

γ \geq 0

then

2 F_{0} - 1 \in G_{γ}

, and for the sequence

| Y_{n} |

the normalizing sequences are

a_{2 n}, b_{2 n}

. So the s-dimensional random vector such that its ith component is the second term of the right side converges in distribution to the random vector given in the formulation of the theorem. It follows from Theorem 2.1.1 of [25]. So it is sufficient to show that the first term converges in probability to zero.

Please note that

| {\hat{Y}}_{i} | \leq | Y_{i} | + | {\hat{Y}}_{i} - Y_{i} |

, and

| {\hat{Y}}_{i} - Y_{i} | = \frac{1}{\hat{σ}} | μ - \hat{μ} + (σ - \hat{σ}) Y_{i} | \leq \frac{| \hat{μ} - μ |}{\hat{σ}} + \frac{| \sqrt{n} (\hat{σ} - σ) |}{\hat{σ}} \frac{1}{\sqrt{n}} {| Y |}_{(n)} .

So

| \hat{Y} |_{(n - j + 1)} \leq {| Y |}_{(n - j + 1)} + \frac{| \hat{μ} - μ |}{\hat{σ}} + \frac{| \sqrt{n} (\hat{σ} - σ) |}{\hat{σ}} \frac{1}{\sqrt{n}} {| Y |}_{(n)} .

Analogously, the inequality

Y_{i} | \leq | | {\hat{Y}}_{i} | + | {\hat{Y}}_{i} - Y_{i} |

implies that

{| Y |}_{(n - j + 1)} \leq | \hat{Y} |_{(n - j + 1)} + \frac{| \hat{μ} - μ |}{\hat{σ}} + \frac{| \sqrt{n} (\hat{σ} - σ) |}{\hat{σ}} \frac{1}{\sqrt{n}} {| Y |}_{(n)} .

Theorem 2.1.1 in [25] applied to the random variables

| Y_{i} |

implies that there exist a random variable

V_{1}

with the c.d.f.

G (x) = e^{- e^{- x}}

(

γ = 0

) or

G (x) = e^{- {(- x)}^{- 1 / γ}}, x < 0

,

G (x) = 1

,

x \geq 0

(

γ > 0

), such that

\frac{1}{\sqrt{n}} {| Y |}_{(n)} = (b_{2 n} + a_{2 n} (V_{1} + o_{P} (1))) / \sqrt{n} .

(9)

| \frac{| \hat{Y} |_{(n - i + 1)} - {| Y |}_{(n - i + 1)}}{a_{2 n}} | \leq \frac{| \sqrt{n} (\hat{μ} - μ) |}{\hat{σ} \sqrt{n} a_{2 n}} + \frac{| \sqrt{n} (\hat{σ} - σ) |}{\hat{σ}} (\frac{b_{2 n}}{\sqrt{n} a_{2 n}} + \frac{V_{1} + o_{p} (1)}{\sqrt{n}}) .

The convergence

b_{n} \to \infty

and Condition A (c) imply:

lim_{n \to \infty} \frac{b_{2 n}}{\sqrt{n} a_{2 n}} = lim_{n \to \infty} \sqrt{n} F_{0}^{- 1} (1 - \frac{1}{2 n}) f_{0} F_{0}^{- 1} (1 - \frac{1}{2 n})) =

\frac{1}{\sqrt{2}} lim_{x \to \infty} \frac{x f_{0} (x)}{\sqrt{1 - F_{0} (x)}} = 0, lim_{n \to \infty} \frac{1}{\sqrt{n} a_{2 n}} = 0 .

These results and Conditions A (a), (b) imply that the first term at the right of (8) converges in probability to zero. Proof completed. □

Remark 2.

Theorem 2 implies that if

F \in G_{0} \cap F_{l s}

,

n \to \infty

, then for fixed i,

i = 1, \dots, s

,

P {(| \hat{Y} |_{(n - i + 1)} - b_{2 n}) / a_{2 n} \leq x} \to 1 - F_{χ_{2 i}^{2}} (2 e^{- x}),

(10)

and if

F \in G_{γ} \cap F_{l s}

,

γ > 0

, then

P {(| \hat{Y} |_{(n - i + 1)} - b_{2 n}) / a_{2 n} \leq x} \to 1 - F_{χ_{2 i}^{2}} (2 / (1 + x)) .

(11)

Suppose now that the function

f_{0}

is not symmetric. Set

Y_{i}^{*} = - (X_{i} - μ) / σ

. The c.d.f. and p.d.f. of

Y_{i}^{*}

are

1 - F_{0} (- x)

and

f_{0} (- x)

, respectively. Set

b_{n}^{*} = - F_{0}^{- 1} (1 / n), a_{n}^{*} = 1 / (n f_{0} (- b_{n}^{*})) .

(12)

For example, if type I extreme value distribution is considered, then

b_{n} = ln ln n, a_{n} = \frac{1}{ln n}, b_{n}^{*} = - ln (- ln (1 - \frac{1}{n})), a_{n}^{*} = - \frac{1}{(n - 1) ln (1 - \frac{1}{n})} .

For the type II extreme value distribution

a_{n}, b_{n}, a_{n}^{*}, b_{n}^{*}

have the same expressions as

a_{n}^{*}, b_{n}^{*}, a_{n}, b_{n}

for the Type I extreme value distribution, respectively.

Remark 3.

Similarly as in Theorem 1 we have that if s is fixed and

F \in G_{0} \cap F_{l s}

, then for fixed i,

i = 1, \dots, s

,

P {(Y_{(i)} + b_{n}^{*}) / (- a_{n}^{*}) \leq x} = P {({\hat{Y}}_{(n - i + 1)}^{*} - b_{n}^{*}) / a_{n}^{*} \leq x} \to 1 - F_{χ_{2 i}^{2}} (2 e^{- x}),

(13)

and if

F \in G_{γ} \cap F_{l s}

,

γ > 0

, then for fixed i,

i = 1, \dots, s

,

P {(Y_{(i)} + b_{n}^{*}) / (- a_{n}^{*}) \leq x} = P {({\hat{Y}}_{(n - i + 1)}^{*} - b_{n}^{*}) / a_{n}^{*} \leq x} \to 1 - F_{χ_{2 i}^{2}} (2 / (1 + x)) .

(14)

3.2. Robust Estimators for Location-Shape Distributions

The choice of the estimators

\hat{μ}

and

\hat{σ}

is important when outlier detection problem is considered. The ML estimators from the complete sample are not stable when outliers exist.

In the case of location-scale families highly efficient robust estimators of the location and scale parameters

μ

and

σ

are (see [26])

\hat{μ} = M E D - \hat{σ} F_{0}^{- 1} (0.5), \hat{σ} = Q_{n} = d W_{([0.25 n (n - 1) / 2])},

(15)

where

M E D

is the empirical median,

W_{i j} = | X_{i} - X_{j} |, 1 \leq i < j \leq n

are

C_{n}^{2} = n (n - 1) / 2

absolute values of the differences

X_{i} - X_{j}

and

W_{(l)}

is the lth order statistic from

W_{i j}

.

The constant d has the form

d = 1 / K_{0}^{- 1} (5 / 8)

, where

K_{0}^{- 1} (x)

is the inverse of the c.d.f. of

Y_{1} - Y_{2}

,

Y_{i} = (X_{i} - μ) / σ \sim F_{0} (x)

.

Expressions of

K_{0}^{- 1} (x)

and values d for some well-known location-scale families are given in Table 2.

The above considered estimators are equivariant under

H_{0}

, i.e. for any

e \in R, f > 0

, the following equalities hold:

\hat{μ} ((X_{1} - e) / f, \dots, (X_{n} - e) / f) = (\hat{μ} (X_{1}, \dots, X_{n}) - e) / f,

\hat{σ} ((X_{1} - e) / f, \dots, (X_{n} - e) / f) = \hat{σ} (X_{1}, \dots, X_{n}) / f .

Equivariant estimators have the following property: the distribution of

(\hat{μ} - μ) / σ

,

\hat{σ} / σ

and

(\hat{μ} - μ) / \hat{σ}

does not depend on the values of the parameters

μ

and

σ

.

3.3. Right Outliers Identification Method for Location-Scale Families

Suppose that

F \in G_{γ} \cap F_{l s}

,

γ \geq 0

. Let

a_{n}, b_{n}

be defined by (5). Set

U_{(n - i + 1)}^{+} (n) = 1 - F_{χ_{2 i}^{2}} (2 e^{- ({\hat{Y}}_{(n - i + 1)} - b_{n}) / a_{n}}), γ = 0,

U_{(n - i + 1)}^{+} (n) = 1 - F_{χ_{2 i}^{2}} (2 / (1 + ({\hat{Y}}_{(n - i + 1)} - b_{n}) / a_{n}), γ > 0,

U^{+} (n, s) = max_{1 \leq i \leq s} U_{(n - i + 1)}^{+} (n) .

(16)

Theorem 3.

The distribution of the statistic

U^{+} (n, s)

is parameter-free for any fixed n.

Proof of Theorem 3.

The result follows from the equality

\frac{{\hat{Y}}_{(n - i + 1)} - b_{n}}{a_{n}} = \frac{Y_{(n - i + 1)} - b_{n}}{a_{n}} \frac{σ}{\hat{σ}} + \frac{b_{n}}{a_{n}} (\frac{σ}{\hat{σ}} - 1) + \frac{1}{a_{n}} \frac{μ - \hat{μ}}{\hat{σ}},

equivariance of the estimators

\hat{μ}

,

\hat{σ}

and the fact that the distribution of the random vector

{(Y_{1}, \dots, Y_{n})}^{T}

does not depend on the values of the parameters

μ

and

σ

. □

Denote by

u_{α}^{+} (n, s)

the

α

critical value of the statistic

U^{+} (n, s)

. Please note that it is exact, not asymptotic

α

critical value:

P {U^{+} (n, s) \geq u_{α}^{+} (n, s)} = α

under

H_{0}

.

Theorem 1 implies that the limit distribution (as

n \to \infty

) of the random variable

U^{+} (n, s)

coincides with the distribution of the random variable

V^{+} (s) = {max}_{1 \leq i \leq s} V_{i}^{+},

where

V_{i}^{+} = 1 - F_{χ_{2 i}^{2}} (2 (E_{1} + \dots + E_{i}))

,

E_{1}, \dots, E_{s}

are i.i.d. standard exponential random variables. The random variables

V_{1}^{+}, \dots, V_{s}^{+}

are dependent identically distributed and the distribution of each

V_{i}^{+}

is uniform:

V_{i}^{+} \sim U (0, 1)

.

Denote by

v_{α}^{+} (s)

the

α

critical values of the random variable

V^{+} (s)

. They are easily found by simulation many times generating s i.i.d. standard exponential random variables and computing values of the random variables

V^{+} (s)

.

Our simulations showed that the below proposed outlier identification methods based on exact and approximate critical values of the statistic

U^{+} (n, s)

give practically the same results, so for samples of size

n \geq 20

we recommend to approximate the

α

-critical level of the statistic

U^{+} (n, s)

by the critical values

v_{α}^{+} (s)

which depend only on s. We shall see that for the purpose of outlier identification only the critical values

v_{α}^{+} (5)

are needed. We found that the critical values

v_{α}^{+} (5)

are:

v_{0.1}^{+} (5) = 0.9677

,

v_{0.05}^{+} (5) = 0.9853

,

v_{0.01}^{+} (5) = 0.9975

.

Our simulations showed that the performance of the below proposed outlier identification method based on exact and approximate critical values of the statistic

U^{+} (n, 5)

is similar for samples of size

n \geq 20

.

We write shortly

B P

-method for the below considered method.

B P

method for right outliers. Begin outlier search using observations corresponding to the largest values of

{\hat{Y}}_{i}

. We recommend begin with five largest. So take

s = 5

and compute the values of the statistics

U^{+} (n, 5) = max_{1 \leq i \leq 5} U_{(n - i + 1)}^{+} (n) .

If

U^{+} (n, 5) \leq v_{α}^{+} (5)

, then it is concluded that outliers are absent and no further investigation is done. Under

H_{0}

the probability of such event is approximately

1 - α

.

If

U^{+} (n, 5) > v_{α}^{+} (5)

, then it is concluded that outliers exist.

Please note that (see the classification scheme below) that if

U^{+} (n, 5) > v_{α}^{+} (5)

, then minimum one observation is declared as an outlier. So the probability to declare absence of outliers does not depend on the following classification scheme.

If it is concluded that outliers exist then search of outliers is done using the following steps.

Step 1. Set

d_{1} = max {i \in {1, \dots, 5} : U_{(n - i + 1)}^{+} (n) > v_{α}^{+} (5)}

. Please note that the maximum

d_{1} > 0

exists because

U^{+} (n, 5) > v_{α}^{+} (5)

.

If

d_{1} < 5

, then classification is finished at this step:

d_{1}

observations are declared as right outliers because if the value of

X_{(n - d_{1})}

is declared as an outlier, then it is natural to declare values of

X_{(n)}, \dots, X_{(n - d_{1} + 1)}

as outliers, too.

If

d_{1} = 5

, then it is possible that the number of outliers is higher than 5. Then the observation corresponding to

i = 1

(i.e., corresponding to

X_{(n)}

) is declared as an outlier and we proceed to the step 2.

Step 2. The above written procedure is repeated taking

U^{+} (n - 1, 5) = {max}_{1 \leq i \leq 5} U_{(n - i)}^{+} (n - 1)

instead of

U^{+} (n, 5)

; here

U_{(n - i)}^{+} (n - 1) = 1 - F_{χ_{2 i}^{2}} (2 e^{- ({\hat{Y}}_{(n - i)} - b_{n - 1}) / a_{n - 1}}), i = 1, \dots, 5,

Set

d_{2} = max {i \in {1, \dots, 5} : U_{(n - i)}^{+} (n - 1) > v_{α}^{+} (5)}

. If

d_{2} < 5

, the classification is finished and

d_{2} + 1

observations are declared as outliers.

If

d_{2} = 5

, then it is possible that the number of outliers is higher than 6. Then the observation corresponding to the largest

{\hat{Y}}_{(n - 1)}

is declared as an outlier, in total 2 observations (i.e., the observations corresponding to

i = 1, 2

(i.e., corresponding to

X_{(n)}

and

X_{(n - 1)}

) are declared as outliers and we proceed to the Step 3, and so on. Classification finishes at the lth step when

d_{l} < 5

. So we declare

(l - 1)

outliers in the previous steps and

d_{l}

outliers in the last one. The total number of observations declared as outliers is

l - 1 + d_{l}

. These observations are values of

X_{(n)}, \dots, X_{(n - d_{l} - l + 2)}

.

3.4. Left Outliers Identification Method for Location-Scale Families

Let

a_{n}^{*}, b_{n}^{*}

be the normalizing constants defined by (12). If

F \in G_{0} \cap F_{l s}

,

i = 1, \dots, s

, then set

U_{(i)}^{-} (n) = 1 - F_{χ_{2 i}^{2}} (2 e^{({\hat{Y}}_{(i)} + b_{n}^{*}) / a_{n}^{*}}), U^{-} (n, s) = max_{1 \leq i \leq s} U_{(i)}^{-} (n) .

If

F \in G_{γ} \cap F_{l s}

,

γ > 0

, then replace

e^{({\hat{Y}}_{(i)} + b_{n}^{*}) / a_{n}^{*}}

by

1 / (1 + ({\hat{Y}}_{(i)} + b_{n}^{*}) / a_{n}^{*})

. Denote by

u_{α}^{-} (n, s)

the

α

critical value of the statistic

U^{-} (n, s)

.

Theorem 1 and Remark 3 imply that the limit distribution (as

n \to \infty

) of the random variable

U^{-} (n, s)

coincides with the distribution of the random variable

V^{+} (s)

. So the critical values

u_{α}^{-} (n, s)

are approximated by the critical values

v_{α}^{-} (s) = v_{α}^{+} (s)

.

The left outliers search method coincides with the right outliers search method replacing + to − in all formulas.

3.5. Outlier Detection Tests for Location-Scale Families: Two-Sided Alternative, Symmetric Distributions

Let

a_{n}, b_{n}

be defined by (5). If

F \in G_{0} \cap F_{l s}

,

i = 1, \dots, s

, then set

U_{(n - i + 1)} (n) = 1 - F_{χ_{2 i}^{2}} (2 e^{- (| \hat{Y} |_{(n - i + 1)} - b_{2 n}) / a_{2 n}}), U (n, s) = max_{1 \leq i \leq s} U_{(n - i + 1)} (n) .

If

F \in G_{γ} \cap F_{l s}

,

γ > 0

, then replace

e^{({\hat{Y}}_{(i)} + b_{n}^{*}) / a_{n}^{*}}

by

1 / (1 + ({\hat{Y}}_{(i)} + b_{n}^{*}) / a_{n}^{*})

. Denote by

u_{α} (n, s)

the

α

critical value of the statistic

U (n, s)

.

Theorem 1 and Remark 2 imply that the limit distribution (as

n \to \infty

) of the random variable

U (n, s)

coincides with the distribution of the random variable

V^{+} (s)

. So the critical values

u_{α} (n, s)

are approximated by the critical values

v_{α} (s) = v_{α}^{+} (s)

.

The outliers search method coincides with the right outliers search method skipping upper index + in all formulas.

3.6. Outlier Detection Tests for Location-Scale Families: Two-Sided Alternative, Non-Symmetric Distributions

Suppose now that the function

f_{0}

is not symmetric. Let

a_{n}, b_{n}, a_{n}^{*}, b_{n}^{*}

be defined by (12).

Begin outlier search using observations corresponding to the largest and the smallest values of

{\hat{Y}}_{i}

. We recommend begin with five smallest and five largest. So compute the values of the statistics

U^{-} (n, 5)

and

U^{+} (n, 5)

. If

U^{-} (n, 5) \leq v_{α / 2} (5)

and

U^{+} (n, 5) \leq v_{α / 2} (5)

, then it is concluded that outliers are absent and no further investigation is done.

If

U^{-} (n, 5) > v_{α / 2} (5)

or

U^{-} (n, 5) > v_{α / 2} (5)

, then it is concluded that outliers exist. If

U^{-} (n, 5) > v_{α / 2} (5)

, then left outliers are searched as in Section 3.3. If

U^{+} (n, 5) > v_{α / 2} (5)

, then right outliers are searched as in Section 3.2. The only difference is that

α

is replaced by

α / 2

in all formulas.

3.7. Outlier Identification Method for Shape-Scale Families

If shape-scale families of the form

{F (t; θ, ν) = G_{0} ({(t / θ)}^{ν}), θ, ν > 0}

with specified

G_{0}

are considered then the above given tests for location-scale families could be used because if

X_{1}, \dots, X_{n}

is a sample from shape scale family then

Z_{1}, \dots, Z_{n}

,

Z_{i} = ln X_{i}

, is a sample from location-scale family

{F_{0} ((x - μ) / σ, μ \in R, σ > 0})

with

μ = ln θ

,

σ = 1 / ν

,

F_{0} (x) = G_{0} (e^{x})

.

3.8. Illustrative Example

To illustrate simplicity of the

B P

-method, let us consider an illustrative example of its application (sample size

n = 20

,

r = 7

outliers). The sample of size

n = 20

from standard normal distribution was generated. The 1st-3rd and 17th-20th observations were replaced by outliers. The observations

x_{i}

, the absolute values

| {\hat{Y}}_{i} |

of the z-scores

{\hat{Y}}_{i}

, and the ranks

(i)

of

| {\hat{Y}}_{i} |

are presented in Table 3.

In Table 4 we present steps of the classification procedure by the

B P

method. First, we compute (see line 1 of Table 4) value of the statistic

U (20, 5) = {max}_{1 \leq i \leq 5} U_{(20 - i + 1)} (20) = 1

. Since

U (20, 5) = 1 > 0.9853 = v_{0.05} (5)

, we reject the null hypothesis, conclude that outliers exist and begin the search of outliers.

Step 1. The inequality

U_{(16)} (20) = 1.0000 > 0.9853 = v_{0.05} (5)

(note that

U_{(16)} (20)

corresponds to the fifth largest observation in absolute value) implies that

d_{1} = 5

. So it is possible that the number of outliers might be greater than 5. We reject the largest in absolute value 20th observation as an outlier and continue the search of outliers.

Step 2. The inequality

U_{(15)} (19) = 1.0000 > 0.9853 = v_{0.05} (5)

(note that

U_{(15)} (19)

corresponds to the fifth largest observation in absolute value from the remaining 19 observations) implies that

d_{2} = 5

. So it is possible that the number of outliers might be greater than 6. We declare the second largest in absolute value observation as an outlier. So two observations (19th and 20th) are declared as outliers. We continue the search of outliers.

Step 3. The inequality

U_{(14)} (18) = 0.999997 > 0.9853 = v_{0.05} (5)

implies that

d_{3} = 5

. We declare the third largest in absolute value observation as an outlier. So three observations (2nd, 19th and 20th) are declared as outliers. We continue the search of outliers.

Step 4. The inequalities

U_{(13)} (17) = 0.084290 < 0.9853 = v_{0.05} (5)

and

U_{(14)} (17) = 0.999940 > 0.9853 = v_{0.05} (5)

imply that

d_{4} = 4

. So four additional observations (the fourth, fifth, sixth and seventh largest in absolute value observations), namely the 3d, 1st, 17th, and 7th are declared as outliers, The outlier search is finished. In all, 7 observations were declared as outliers: 1–3,17–20, as was expected. Please note that since the outlier search procedure was done after rejection of the null hypothesis, the significance level did not change.

3.9. Practical Example

Let’s consider the stent fatigue testing dataset from reliability control [27]. The dataset contains 100 observations. Let us consider the Weibull, lollogistic and lognormal models. These are the most applied models for analysis of reliability data. For preliminary choice of suitable model we compare the values of various goodness-of-fit statistics and information criteria (see Table 5). The Weibull model is obviously the most suited because values of all five statistics are smallest for this model.

Using the function WEDF.test from the R package EWGoF we applied the following goodness-of-fit tests for Weibull distribution: Anderson-Darling (p-value = 0.86), Kolmogorov-Smirnov (p-value = 0.82), Cramer-von-Mises (p-value = 0.795), Watson (p-value = 0.795). So all tests do not contradict to the Weibull model.

The logarithms

X_{1}, \dots, X_{100}

of observations have type I extreme value distribution. Minimal and maximal values are

X_{(1)} = 1.609

and

X_{(100} = 5.670

. Let us consider the situation, where fatigue data contain two outliers

X_{3} = 6.5

and

X_{5} = 6.5

. All goodness-of-fit tests applied to the data with outliers reject the Weibull model: Anderson-Darling (p-value

< 10^{- 15}

), Kolmogorov-Smirnov (p-value

0.005

), Cramer-von-Mises (p-value

< 10^{- 15}

), Watson (p-value

< 10^{- 15}

).

Let us apply the

B P

method for outlier identification. Values of the statistics

U_{i}

are:

U_{(100)} (100) = 0.92, U_{(99)} (100) = 0.997, U_{(98)} (100) = 0.96, U_{(97)} (100) = 0.92, U_{(96)} (100) = 0.96

. Since

U (100, 5) = 0.997 > 0.9853

, we reject the null hypothesis.

Step 1. Since

d_{1} = {max}_{{i \in {1, \dots, 5}}} : U_{(101 - i)} > 0.9853} = 2 < 5

. the search procedure is finished and the observations

X_{(99)}

and

X_{(100)}

, namely

X_{3}

and

X_{5}

, are declared as outliers. We see that our method did not allow masking other equal observations

X_{3} = X_{5} = 6.5

. It is a very important advantage of the

B P

method.

After outliers removal, we repeated goodness-of-fit procedure. All tests did not reject the Weibull model: Anderson-Darling (p-value = 0.88), Kolmogorov-Smirnov (p-value = 0.8), Cramer-von-Mises (p-value = 0.93), Watson (p-value = 0.895). Once more, we compared values of goodness-of-fit statistics and information criteria for above considered models using data without removed outliers, see Table 6.

The Weibull distribution gives clearly the best fit.

Values of ML estimators from the initial non-contaminated data and from the final cleared from outliers data are similar: shape practically did not change:

1.83 \to 1.83

, scale changed slightly:

100.8 \to 101.4

.

We created R package outliersTests (https://github.com/linas-p/outliersTests) to be able to use the proposed

B P

test in practice within R package.

4. Generalization of Davies-Gather Outlier Identification Method

Let us consider location-scale families. Following the idea of Davies-Gather [2] define an empirical analogue of the right outlier region as a random region

O R_{r} (α_{n}) = {x : x > \hat{μ} + \hat{σ} g_{n . α}},

(17)

where

g_{n . α}

is found using the condition

P {X_{i} \bar{\in} O R_{r} (α_{n}), i = 1, \dots, n | H_{0}} = 1 - α,

(18)

and

\hat{μ}, \hat{σ}

are robust equivariant estimators of the parameters

μ, σ

.

Set

{\hat{Y}}_{(n)} = (X_{(n)} - \hat{μ}) / \hat{σ} .

The distribution of

{\hat{Y}}_{(n)}

is parameter-free under

H_{0}

.

The Equation (18) is equivalent to the equation equation

P {{\hat{Y}}_{(n)} \leq g_{n, α}} | H_{0}} = 1 - α .

So

g_{n, α}

is the upper

α

critical value of the random variable

{\hat{Y}}_{(n)}

. It is easily computed by simulation.

Generalized Davies-Gather method for right outliers identification: if

{\hat{Y}}_{(n)} \leq g_{n, α}

, then it is concluded that right outliers are absent. The probability of such event is

α

. If

{\hat{Y}}_{(n)} > g_{n, α}

, then it is concluded that right outliers exist. The value

x_{i}

of the random variable

X_{i}

is admitted as an outlier if

x_{i} \in O R_{r} (α_{n})

, i.e., if

x_{i} > \hat{μ} + \hat{σ} g_{n, α}

. Otherwise it is admitted as a non-outlier.

An empirical analogue of the left outlier region as a random region

O R_{l} (α_{n}) = {x : x < \hat{μ} + \hat{σ} h_{n . 1 - α}},

(19)

where

h_{n . 1 - α}

is found using the condition

P {X_{i} \bar{\in} O R_{l} (α_{n}), i = 1, \dots, n | H_{0}} = 1 - α,

(20)

Set

{\hat{Y}}_{(1)} = (X_{(1)} - \hat{μ}) / \hat{σ} .

The distribution of

{\hat{Y}}_{(1)}

is parameter-free under

H_{0}

.

The Equation (20) is equivalent to the equation equation

P {{\hat{Y}}_{(1)} \geq h_{n, 1 - α} | H_{0}} = 1 - α .

So

h_{n, α}

is the upper

1 - α

critical value of the random variable

{\hat{Y}}_{(1)}

. It is easily computed by simulation.

Generalized Davies-Gather method for left outliers identification: if

{\hat{Y}}_{(1)} \geq h_{n, 1 - α}

, then it is concluded that left outliers are absent. The probability of such event is

α

. If

{\hat{Y}}_{(1)} < h_{n, α}

, then it is concluded that left outliers exist. The value

x_{i}

of the random variable

X_{i}

is admitted as an outlier if

x_{i} \in O R_{l} (α_{n})

, i.e., if

x_{i} < \hat{μ} + \hat{σ} h_{n, 1 - α}

. Otherwise it is admitted as a non-outlier.

Let us consider two-sided case.

If the distribution of

X_{i}

is symmetric, then the empirical analogue of the outlier region is the random region

O R (α_{n}) = {x : | x - \hat{μ} | > \hat{σ} g_{n . α / 2}} .

(21)

In this case

1 - α = P {X_{i} \in O R (α_{n}), i = 1, \dots, n | H_{0}} = P {| \hat{Y} |_{(n)} \leq g_{n . α / 2}} .

Generalized Davies-Gather method for left and right outliers identification (symmetric distributions): if

| \hat{Y} |_{(n)} \leq g_{n . α / 2}

, then it is concluded that outliers are absent. The probability of such event is

α

. If

| \hat{Y} |_{(n)} > g_{n . α / 2}

, then it is concluded that outliers exist. The value

x_{i}

of the random variable

X_{i}

is admitted as a left outlier if

x_{i} < \hat{μ} - \hat{σ} g_{n, α / 2}

, it is admitted as a right outlier if

x_{i} > \hat{μ} + \hat{σ} g_{n, α / 2}

. Otherwise it is admitted as a non-outlier.

If distribution of

X_{i}

is non-symmetric, then the empirical analogue of the outlier region is defined as follows:

O R (α_{n}) = {x \in R / [\hat{μ} + \hat{σ} g_{n, 1 - α / 2}, \hat{μ} + \hat{σ} g_{n, α / 2})]},

In this case

1 - α = P {X_{i} \in [\hat{μ} + \hat{σ} h_{n, 1 - α / 2}, \hat{μ} + \hat{σ} g_{n, α / 2}], i = 1, \dots, n | H_{0}} =

P {h_{n, 1 - α / 2} \leq {\hat{Y}}_{(1)} \leq {\hat{Y}}_{(n)} \leq g_{n, α / 2}) | H_{0}} .

Generalized Davies-Gather method for left and right outliers identification (non-symmetric distributions): if

{\hat{Y}}_{(1)} \geq h_{n, 1 - α / 2}

and

{\hat{Y}}_{(n)} \leq g_{n, α / 2}

, then it is concluded that outliers are absent. The probability of such event is

α

. If

{\hat{Y}}_{(1)} < h_{n, 1 - α / 2}

or

{\hat{Y}}_{(n)} > g_{n, α / 2}

, then it is concluded that outliers exist. The value

x_{i}

of the random variable

X_{i}

is admitted as a left outlier if

x_{i} < \hat{μ} + \hat{σ} h_{n, 1 - α / 2}

, it is admitted as a right outlier if

x_{i} > \hat{μ} + \hat{σ} g_{n, α / 2}

. Otherwise it is admitted as a non-outlier.

5. Short Survey of Multiple Outlier Identification Methods for Normal Data

5.1. Rosner’s Method

Let us formulate Rosner’s method in the form mostly used in practice. Suppose that the number of outliers does not exceed s and the two-sided alternative is considered. Set (see [5,28])

\begin{matrix} R_{1} = max_{1 ⩽ j ⩽ n} | {\tilde{Y}}_{j} | = max_{1 ⩽ j ⩽ n} | X_{j} - \bar{X} | / S_{X}, S_{X}^{2} = \sum_{j = 1}^{n} {(X_{(j)} - \bar{X})}^{2} / (n - 1) . \end{matrix}

| {\tilde{Y}}_{j} | = | (X_{j} - \bar{X}) / S_{X} |

may be interpreted as a distance between

X_{j}

and

\bar{X}

. Remove the observation

X_{j_{1}}

which is most distant from

\bar{X}

. This maximal distance is

R_{1}

. The value of

X_{j_{1}}

is a possible candidate for contaminant.

Recompute the statistic using

n - 1

remaining observations and denote by

R_{2}

the obtained statistic. Remove the observation

X_{j_{2}}

which is most distant from the new empirical mean. The value of

X_{j_{2}}

is also possible candidate for contaminant. Repeat the procedure until the statistics

R_{1}, \dots, R_{s}

are computed. So we obtain all possible candidates for contaminants. They are values of

X_{j_{1}}, \dots, X_{j_{s}}

Fix

α

and find

λ_{i n}

such that

P {R_{1} > λ_{i n} | H_{0}} = \dots = P {R_{s} > λ_{i n} | H_{0}}, P {\cup_{i = 1}^{s} {R_{i} > λ_{i n}} | H_{0}} = α .

If

n > 25

, then the approximations

\begin{matrix} λ_{i n} \approx t_{\frac{α}{2 (n - i - 1)}} (n - i + 1) \sqrt{\frac{n - i}{n - i - 1 + t_{\frac{α}{2 (n - i - 1)}}^{2} (n - i + 1)}} \sqrt{1 - \frac{1}{n - i + 1}}, \end{matrix}

are recommended (see [5]); here

t_{p} (ν)

is the p critical value of the Student distribution with

ν

degrees of freedom.

Rosner’s method for left and right outliers identification: if

R_{i} ⩽ λ_{i n}

for all

i = 1, \dots, s

, then it is concluded that outliers are absent. If there exists

i_{0} \in {1, \dots, s}

such that

R_{i_{0}} > λ_{i_{0} n}

, i.e., the event

\cup_{i = 1}^{s} {R_{i} > λ_{i n}}

occurs, then it is concluded that outliers exist. In this case, classification of observations to outliers and non-outliers is done in the following way: if

R_{s} > λ_{s n}

, then it is concluded that there are s outliers and they are values of

X_{j_{1}}, \dots, X_{j_{s}}

. If

R_{j} ⩽ λ_{j n}

for

j = s, s - 1, \dots, i + 1

, and

R_{i} > λ_{i n}

, then it is concluded that there are i outliers and they are values of

X_{j_{1}}, \dots, X_{j_{i}}

.

If right outliers are searched, then define

R_{1}^{+} = {max}_{1 ⩽ i ⩽ n} {\tilde{Y}}_{i},

and repeat the above procedure taking approximations

\begin{matrix} λ_{i n}^{+} \approx t_{\frac{α}{n - i - 1}} (n - i + 1) \sqrt{\frac{n - i}{n - i - 1 + t_{\frac{α}{n - i - 1}}^{2} (n - i + 1)}} \sqrt{1 - \frac{1}{n - i + 1}} . \end{matrix}

Denote by

R_{s}

the Rosner’s test with a fixed upper limit s. Our simulation results confirm that the true significance level is different from the level

α

suggested by the approximation when n is not large. Nevertheless, it is approaching

α

as n increases, see Figure 1. The true significance value of the

B P

test, which uses asymptotic values of the test statistic are also presented in Figure 1.

5.2. Bolshev’s Method

Suppose that the number of contaminants does not exceed s. For

i = 1, \dots, n

set

{\hat{Y}}_{i} = (X_{i} - \bar{X}) / s, τ_{i}^{+} = n \cdot (1 - T_{n - 2} ({\hat{Y}}_{i})), τ_{i} = n \cdot (1 - T_{n - 2} (| {\hat{Y}}_{i} |)),

where

\bar{X}

and s are the empirical mean and standard deviation,

T_{n - 2} (x)

is the c.d.f. of Thompson’s distribution with

n - 2

degrees of freedom.

Let us consider search for right outliers. Please note that the largest s observations

X_{(n - s + 1)}, \dots, X_{(n)}

define the smallest s order statistics

τ_{(1)}^{+} \leq \dots \leq τ_{(n)}^{+}

. Possible candidates for outliers are namely the values of

X_{(n - s + 1)}, \dots, X_{(n)}

.

Set

τ^{+} = {min}_{1 ⩽ i ⩽ s} τ_{(i)}^{+} / i .

Bolshev’s method for right outliers search. If

τ^{+} \geq τ_{1 - α}^{+} (n, s)

, then it is concluded that outliers are absent; here

τ_{1 - α}^{+} (n, s)

is the

1 - α

critical value of the test statistic under

H_{0}

. If

τ^{+} < τ_{1 - α}^{+} (n, s)

, then it is concluded that outliers exist. In such a case outliers are selected in the following way: if

τ_{i}^{+} / i < τ_{1 - α}^{+} (n, s)

then the value of the order statistic

X_{(n - i + 1)}

is admitted as an outlier,

i = 1, \dots, s

.

In the case of left and right outliers search Bolshev’s method uses

τ_{(i)}

instead of

τ_{(i)}^{+}

, defining the statistic

τ = {min}_{1 ⩽ i ⩽ s} τ_{(i)} / i .

Bolshev’s method for left and right outliers search. If

τ \geq τ_{1 - α} (n, s)

, then it is concluded that outliers are absent; here

τ_{1 - α} (n, s)

is the

1 - α

critical value of the statistic

τ

under

H_{0}

. If

τ < τ_{1 - α} (n, s)

, then it is concluded that outliers exist. In such a case they are selected in the following way: if

τ_{i} / i < τ_{1 - α} (n, s)

then the observation corresponding to

τ_{i}

is admitted as an outlier,

i = 1, \dots, s

.

5.3. Hawking’s Method

Suppose that the number of contaminants does not exceed s. Let us consider the search for right outliers. For

k = 1, \dots, s

set

b_{k}^{+} = \frac{1}{\sqrt{k (n - k)}} \sum_{i = 1}^{k} {\tilde{Y}}_{(n - i + 1)} = \frac{1}{\sqrt{k (n - k)}} \sum_{i = 1}^{k} (X_{(n - i + 1)} - \bar{X}) / S_{X} .

b_{k}^{+}

proportional to the sum of k largest

{\tilde{Y}}_{(n - i + 1)}

. Set

B^{+} = {max}_{1 ⩽ k ⩽ s} b_{k}^{+} .

Hawking’s method. If

B^{+} \leq B_{α}^{+} (n, s)

then it is concluded that outliers are absent; here

B_{α}^{+} (n, s)

is the

α

critical value of the statistic under

H_{0}

. If

B^{+} > B_{α}^{+} (n, s)

, then it is concluded that outliers exist. In such a case outliers are selected in the following way: if

b_{i}^{+} > B_{α}^{+} (n, s)

, then the value of the order statistic

X_{(n - i + 1)}

is admitted as an outlier,

i = 1, \dots, s

.

6. Comparative Analysis of Outlier Identification Methods by Simulation

In the case of location-scale classes probability distribution of all considered test statistics does not depend on

μ

and

σ

, so we generated samples of various sizes n with

n - r

observations with the c.d.f.

F_{0}

and r observations with various alternative distributions concentrated in the outlier region. We shall call such observations “contaminant outliers”, shortly c-outliers. As was mentioned, outliers which are not c-outliers, i.e., outliers from regular observations with the c.d.f.

F_{0}

, are very rare.

We repeated simulations

M =

100,000 times and using various methods we classified observations to outliers and non-outliers and computed the mean number

D_{O_{c} O}

of correctly identified c-outliers, the mean number

D_{O N}

of c-outliers which were not identified, the mean number

D_{N O}

of non c-outliers admitted as outliers, and the mean number

D_{N N}

of non c-outliers admitted as non-outliers.

An outlier identification method is ideal if each outlier is detected and each non-outlier is declared as a non-outlier. In practice it is impossible to do with the probability one. Two errors are possible: (a) an outlier is not declared as such (masking effect); (b) a non-outlier is declared as an outlier (swamping effect). We shall write shortly “masking value” for the mean number of non-detected c-outliers and “swamping value” for the mean number of “normal” observations declared as outliers in the simulated samples.

If swamping is small for two tests then a test with smaller masking effect should be preferred because in this case the distribution of the data remaining after excluding of suspected outliers should be closer to the distribution of non-outlier data.

From the other side, if swamping for Method 1 is considerably bigger than swamping of Method 2 and masking is smaller for Method 1, then it does not mean that Method 1 is better because this method rejects many extreme non-outliers from the tails of the regular distribution

F_{0}

and the sample remaining after classification may be not treated as a sample from this regular distribution even if all c-outliers are eliminated.

For various families of distributions, sample sizes n, and alternatives we compared Davies-Gather (

D G

) and new (

B P

) methods performance. In the case of normal distribution we also compared them with Rosner’s, Bolshev’s and Hawking’s methods.

We used two different classes of alternatives: in the first case c-outliers are spread widely in the outlier region around the mean, in the second case c-outliers are concentrated in a very short interval laying in the outlier region. More precisely, if right outliers were searched, then we simulated r observations concentrated in in the right outlier region

o u t_{r} (α_{n}, F_{0}) = {x : x > x_{α_{n}}}

using the following alternative families of distribution:

(1): Two parameter exponential distribution $E (θ, x_{α_{n}})$ with the scale parameter $θ$ . If $θ$ is small, then outliers are concentrated near the border of the outlier region. If $θ$ is large then outliers are widely spread in the outlier region. If $θ$ increases, then the mean of outlier distribution increases. Please note that even if $θ$ is very near 0 and the true number of outliers r is large, these outliers may corrupt strongly the data making tails of histogram two heavy.
(2): Truncated normal distribution $T N (x_{α_{n}}, μ, ρ)$ with the location and scale parameters $μ, ρ$ ( $μ > x_{α_{n}})$ . If $ρ$ is small then this distribution is concentrated in a small interval around $μ$ . If $μ$ increases, then the mean of outlier distribution increases.

For lack of place we present a small part of our investigations. Please note that the results are very similar for all sample sizes

n \geq 20

. Multiple outlier problem is not very relevant for smaller sample sizes.

6.1. Investigation of Outlier Identification Methods for Normal Data

We use notation

B, H, R, D G

, and

B P

for the Bolshev’s, Hawking’s, Rosner’s, Davies-Gather’s, and the new methods, respectively. If

D G

method is based on maximum likelihood estimators, then we write

D G_{m l}

method, if it is based on robust estimators, we write

D H_{r o b}

method.

For comparison of above considered methods we fixed the significance level

α = 0.05

. We remind that the significance level

α

is the probability to reject minimum one observation as an outlier under the hypothesis

H_{0}

which means that all observations are realizations of i.i.d. with the same normal distribution. The only test, namely R method uses approximate critical values of the test statistic, so the significance values for this test is only approximately

0.05

and depends on s and n. In Figure 1 the true significance level value for

s = 5, 15

and

[0.4 n]

in function of n are given.

The

B, H

, and R tests methods have a drawback that the upper bound for the possible number of outliers s must be fixed. The

B P

and

D G

tests have an advantage that they do not require it.

Our investigations showed that H,B and

D G_{m l}

methods have other serious drawbacks. So firstly let us look closer at these methods.

If the true number of c-outliers r exceeds s, then the B and H methods cannot find them even if they are very far from the limits of the outlier region. Nevertheless, suppose that r does not exceed s and look at the performance of the H method. Set

n = 100

,

s = 5

, and suppose that c-outliers are generated by right-truncated normal distribution

T N (x_{α_{n}}, μ, ρ)

with fixed

ρ

and increasing

μ

. Note that the true number of c-outliers is supposed to be unknown but do not exceed

s = 5

. In Figure 2 the mean numbers of rejected non-c-outliers

D_{N O}

are given in function of the parameter

μ

(the value of the parameter

ρ = 0 . 1^{2}

is fixed) for fixed values of r see Figure 2. In Table 7 the values of

D_{N O}

plus the values of the mean numbers of truly rejected c-outliers are given. Table 7 shows that if

r = 1

, then if

μ

is sufficiently large, the c-outlier is found but the number of rejected non-c-outliers

D_{N O}

increases to 4, so swamping is very large. Similarly, if

r = 2

, then

D_{N O}

increases to 3, so swamping is large. Beginning from

r = 3

not all c-outliers are found even for large

μ

. Swamping is smallest if the true value r coincides with s but even in this case one c-outlier is not found even for large

μ

. Taking into account that the true number r of c-outliers is not known in real data, the performance of the H methos is very poor. Results are similar for other values of n, s, and distributions of c-outliers. As a rule, H mehod finds rather well the c-outliers but swamping is very large because this method has a tendency to reject a number near s of observations for remote alternatives. which is good if

r = s

but is bad if r is different from s.

The B and

D G_{m l}

tests have a drawback that they use maximum likelihood estimators which are not robust and estimate parameters badly in presence of outliers. Once more, set

n = 100

,

s = 5

, and suppose that c-outliers are generated by two-parameters exponential distribution

T E (x_{α_{n}}, θ)

with increasing

θ

. Swamping values are negligible in, so only masking values( mean numbers of non-rejected c-outliers

D_{O N}

) are important. In Figure 3 the masking values in function of the parameter

θ

are given for fixed values of r.

Both methods perform very similarly. The masking values are large for every value of

r > 1

. If r increases, then masking values increase, too. For example, if

r = 5

, then almost 3 c-outliers from 5 are not rejected on average even for large values of

θ

.

Similar results hold taking other values of n, s and various distributions of c-outliers.

The above analysis shows that the B, H,

D G_{m l}

methods have serious drawbacks, so we exclude these methods from further consideration.

Let us consider the remaining three methods: R,

D G

, and

B P

. For small n the true significance level of Rosner’s test differ considerably from the suggested, so we present comparisons of tests performance for

n = 50, 100, 1000

(see Table 8 and Table 9). Truncated exponential distribution was used for outliers simulation. Remoteness of the mean of outliers from the border of the outlier region is characterized by the parameter

θ

.

Swamping values

D_{N O}

(the mean numbers of non-c-outliers declared as outliers) are very small for all tests. For example, even if

n = 1000

, the R and

D G

methods reject on average as outliers only 0.05 from

n - r = 995, 980, 900

non-c-outliers. For the

B P

method this number is

0.25, 0.19, 0.05

from

995, 980

, and 900 non-c-outliers, respectively. So only masking values

D_{O N}

(the mean numbers of c-outliers declared as non-outliers) are important for outlier identification methods comparison.

Necessity to guess the upper limit s for a possible number of outliers is considered as a drawback of the Rosner’s method. Indeed, if the true number of outliers r is greater than the chosen upper limit s, then

r - s

outliers are not identified with the probability one. In addition, even if

r \leq s

, it is not clear how important is closeness of r to s. So first we investigated the problem of the upper limit choice.

Here we present masking values

D_{O N}

of the Rosner’s tests for

s = 5, 15

and

[0.4 n]

. Similar results are obtained for other values of s.

Our investigations show that it is sufficient to fix

s = [0.4 n]

, which is clearly larger than it can be expected in real data. Indeed, Table 8 and Table 9 show that for

r > s

R o s n e r_{5}

and

R o s n e r_{15}

do not find

r - s

outliers even if they are very remote, as it should be. Nevertheless, we see that even if the true number of outliers r is much smaller than

[0.4 n]

, for any considered n,

r \leq s = 5, 15

the masking values of the

R o s n e r_{[0.4 n]}

test are approximately the same (even a little smaller) as the masking values of the tests

R o s n e r_{5}

and

R o s n e r_{15}

, for

r > s

they are clearly smaller.

Hence,

s = [0.4 n]

should be recommended for Rosner’s test application, and performance of

R o s n e r_{[0.4]}

, Davies-Gather robust (

D G_{r o b})

and the proposed

B P

methods should be compared.

All three methods find all c-outliers if they are sufficiently remote. For

n = 50

the

B P

method gives uniformly smallest masking values and the

D G

method gives uniformly largest masking values for any considered r in all diapason of alternatives. For

n = 100

and

r = 2, 5

the result is the same. For

n = 100

and

r = 10

(it means that even for very small

θ

the data is seriously corrupted) the BP method is also the best except that for the most remote alternatives the

R o s n e r_{[0.4 n]}

method slightly outperforms the BP method. For

n = 1000

and the most of alternatives the BP method strongly outperforms other methods, except the most remote alternatives.

The

D G

and Rosner’s methods have very large masking if many outliers are concentrated near the outlier region border. In this case data is seriously corrupted; however, these methods do not see outliers.

Conclusion: in most considered situations the

B P

method is the best outlier identification method. The second is Rosner’s method with

s = [0.4]

, and the third is the Davies-Gather method based on robust estimation. Other methods have poor performance.

6.2. Investigation of Outlier Identification Methods for Other Location-Scale Models

We investigated performance of the new method for location-scale families different from normal. We compare the

B P

method with the generalized Davies-Gather method for logistic, Laplace (symmetric,

F \in G_{0} \cap F_{l s}

), extreme values (non-symmetric

F \in G_{0} \cap F_{l s}

), and Cauchy (symmetric,

F \in G_{1} \cap F_{l s}

) families. C-outliers were generating using truncated exponential distribution concentrated in two-sided outlier region. Swamping values being small, masking value, see Table 10 and differences between the true number of c-outliers and the number of rejected observations, see Figure 4 and Figure 5, were compared. The

B P

and

D G_{r o b}

methods find very well the most remote outliers; meanwhile, the

B P

method identifies much better closer outliers. The

D G_{r o b}

method identifies badly multiple outliers concentrated near the border of the outlier region, whereas the

B P

method does well. The

D G_{M L}

is not appropriate for multiple outlier search.

7. Conclusions

We compared by simulation outlier identification results of the new method and methods given in previous studies. Even in the case of the normal model, which is investigated by many authors, the new method shows excellent identification power. In many situations, it has superior performance as compared to existing methods.

The obtained results widened considerably the spectre of most used non-regression models needing outlier identification methods. Many two-parameter models such as Weibull, logistic and loglogistic, extreme values, Cauchy, Laplace, and others can be investigated applying the new method.

The advantage of the proposed outlier identification method is that it has very good potential for generalizations. The authors are at the completion stage of research on outlier identification methods for accelerated failure time regression models and generalized linear models, gamma regression model in particular. Outlier identification methods for time series is another direction of the future work. Possible direction is investigation of Gaussian mixture regression models (see [29]).

Limitation of the new method is that it cannot be applied for analysis of discreet models. Taking into consideration that the method is based on asymptotic results, we recommend not applying it to samples of very small size

n \leq 15

.

The R package outliersTests was created for the practical usage of proposed test.

Author Contributions

Investigation, V.B. and L.P.; Methodology, V.B. and L.P.; Supervision, V.B.; Writing—original draft, V.B. and L.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bol’shev, L.; Ubaidullaeva, M. Chauvenet’s Test in the Classical Theory of Errors. Theory Probab. Appl. 1975, 19, 683–692. [Google Scholar] [CrossRef]
Davies, L.; Gather, U. The Identification of Multiple Outliers. J. Am. Stat. Assoc. 1993, 88, 782–792. [Google Scholar] [CrossRef]
Dixon, W.J. Analysis of Extreme Values. Ann. Math. Stat. 1950, 21, 488–506. [Google Scholar] [CrossRef]
Grubbs, F.E. Sample Criteria for Testing Outlying Observations. Ann. Math. Stat. 1950, 21, 27–58. [Google Scholar] [CrossRef]
Rosner, B. On the Detection of Many Outliers. Technometrics 1975, 17, 221–227. [Google Scholar] [CrossRef]
Tietjen, G.L.; Moore, R.H. Some Grubbs-Type Statistics for the Detection of Several Outliers. Technometrics 1972, 14, 583–597. [Google Scholar] [CrossRef]
Barnett, V.; Lewis, T. Outliers in Statistical Data; John Wiley & Sons: Hoboken, NJ, USA, 1974. [Google Scholar]
Zerbet, A. Statistical Tests for Normal Family in Presence of Outlying Observations. In Goodness-of-Fit Tests and Model Validity; Huber-Carol, C., Balakrishnan, N., Nikulin, M.S., Mesbah, M., Eds.; Birkhäuser Boston: Basel, Switzerland, 2002; pp. 57–64. [Google Scholar]
Chikkagoudar, M.; Kunchur, S.H. Distributions of test statistics for multiple outliers in exponential samples. Commun. Stat. Theory Methods 1983, 12, 2127–2142. [Google Scholar] [CrossRef]
Kabe, D.G. Testing outliers from an exponential population. Metrika 1970, 15, 15–18. [Google Scholar] [CrossRef]
Kimber, A. Testing upper and lower outlier paris in gamma samples. Commun. Stat. Simul. Comput. 1988, 17, 1055–1072. [Google Scholar] [CrossRef]
Lalitha, S.; Kumar, N. Multiple outlier test for upper outliers in an exponential sample. J. Appl. Stat. 2012, 39, 1323–1330. [Google Scholar] [CrossRef]
Lewis, T.; Fieller, N.R.J. A Recursive Algorithm for Null Distributions for Outliers: I. Gamma Samples. Technometrics 1979, 21, 371–376. [Google Scholar] [CrossRef]
Likeš, I.J. Distribution of Dixon’s statistics in the case of an exponential population. Metrika 1967, 11, 46–54. [Google Scholar] [CrossRef]
Lin, C.T.; Balakrishnan, N. Exact computation of the null distribution of a test for multiple outliers in an exponential sample. Comput. Stat. Data Anal. 2009, 53, 3281–3290. [Google Scholar] [CrossRef]
Lin, C.T.; Balakrishnan, N. Tests for Multiple Outliers in an Exponential Sample. Commun. Stat. Simul. Comput. 2014, 43, 706–722. [Google Scholar] [CrossRef]
Zerbet, A.; Nikulin, M. A new statistic for detecting outliers in exponential case. Commun. Stat. Theory Methods 2003, 32, 573–583. [Google Scholar] [CrossRef]
Torres, J.M.; Pastor Pérez, J.; Sancho Val, J.; McNabola, A.; Martínez Comesaña, M.; Gallagher, J. A functional data analysis approach for the detection of air pollution episodes and outliers: A case study in Dublin, Ireland. Mathematics 2020, 8, 225. [Google Scholar] [CrossRef] [Green Version]
Gaddam, A.; Wilkin, T.; Angelova, M.; Gaddam, J. Detecting Sensor Faults, Anomalies and Outliers in the Internet of Things: A Survey on the Challenges and Solutions. Electronics 2020, 9, 511. [Google Scholar] [CrossRef] [Green Version]
Ferrari, E.; Bosco, P.; Calderoni, S.; Oliva, P.; Palumbo, L.; Spera, G.; Fantacci, M.E.; Retico, A. Dealing with confounders and outliers in classification medical studies: The Autism Spectrum Disorders case study. Artif. Intell. Med. 2020, 108, 101926. [Google Scholar] [CrossRef]
Zhang, C.; Xiao, X.; Wu, C. Medical Fraud and Abuse Detection System Based on Machine Learning. Int. J. Environ. Res. Public Health 2020, 17, 7265. [Google Scholar] [CrossRef] [PubMed]
Souza, T.I.; Aquino, A.L.; Gomes, D.G. A method to detect data outliers from smart urban spaces via tensor analysis. Future Gener. Comput. Syst. 2019, 92, 290–301. [Google Scholar] [CrossRef]
Hawkins, D.M. Identification of Outliers; Springer: Dordrecht, The Netherlands, 1980; Volume 11. [Google Scholar]
Kimber, A.C. Tests for Many Outliers in an Exponential Sample. J. R. Stat. Soc. 1982, 31, 263–271. [Google Scholar] [CrossRef]
De Haan, L.; Ferreira, A. Extreme Value Theory: An Introduction; Springer: New York, NY, USA, 2007. [Google Scholar]
Rousseeuw, P.J.; Croux, C. Alternatives to the median absolute deviation. J. Am. Stat. Assoc. 1993, 88, 1273–1283. [Google Scholar] [CrossRef]
Liu, Y.; Abeyratne, A.I. Practical Applications of Bayesian Reliability; John Wiley & Sons: Hoboken, NJ, USA, 2019. [Google Scholar]
Rosner, B. Percentage points for the RST many outlier procedure. Technometrics 1977, 19, 307–312. [Google Scholar] [CrossRef]
Su, H.; Hu, Y.; Karimi, H.R.; Knoll, A.; Ferrigno, G.; De Momi, E. Improved recurrent neural network-based manipulator control with remote center of motion constraints: Experimental results. Neural Netw. 2020, 131, 291–299. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The true values of the significance level of Rosner’s and

B P

tests in function of n for different values of s (

α = 0.05

is used in approximations).

Figure 1. The true values of the significance level of Rosner’s and

B P

tests in function of n for different values of s (

α = 0.05

is used in approximations).

Figure 2. Hawkin’s method: the values of

D_{N O} + D_{O O}

in function of

μ

and r (

n = 100

,

s = 5

).

Figure 2. Hawkin’s method: the values of

D_{N O} + D_{O O}

in function of

μ

and r (

n = 100

,

s = 5

).

Figure 3. The number of outliers rejected as non-outliers (

D_{O N}

). The alternative: two-sided, the outliers generated by two-parameters exponential distribution on both sides.

Figure 3. The number of outliers rejected as non-outliers (

D_{O N}

). The alternative: two-sided, the outliers generated by two-parameters exponential distribution on both sides.

Figure 4. The difference between number outliers and rejected observations given that sample size

n = 100

and

r = 10

outliers.

Figure 4. The difference between number outliers and rejected observations given that sample size

n = 100

and

r = 10

outliers.

Figure 5. The difference between number outliers and rejected observations given that sample size

n = 100

and

r = 10

outliers.

Figure 5. The difference between number outliers and rejected observations given that sample size

n = 100

and

r = 10

outliers.

Table 1. Expressions of

b_{n}

and

a_{n}

.

Table 1. Expressions of

b_{n}

and

a_{n}

.

Distribution	$F_{0} (x)$	$b_{n}$	$a_{n}$
Normal	$Φ (x)$	$Φ^{- 1} (1 - 1 / n)$	$1 / b_{n}$
Type I extreme value	$1 - e^{- e^{x}}$	$ln ln n$	$e^{- b_{n}}$
Type II extreme value	$e^{- e^{- x}}$	$ln (- ln (1 - 1 / n))$	$e^{b_{n}} / (n - 1)$
Logistic	$\frac{1}{1 + e^{- x}}$	$ln (n - 1)$	$n / (n - 1)$
Laplace	$\frac{1}{2} + \frac{1}{2} sign (x) (1 - e^{- \| x \|})$	$ln (n / 2)$	1
Cauchy	$\frac{1}{2} + \frac{1}{π} arctan (x)$	$cot (\frac{π}{n})$	$\frac{π}{n} / {sin}^{2} (\frac{π}{n})$

Table 2. Values of d for various probability distributions.

Distribution	$K_{0} (x)$	d
Normal	$Φ (x / \sqrt{2})$	2.2219
Type I extr.val.	$1 / (1 + e^{- x})$	1.9576
Type II extr.val.	$1 / (1 + e^{- x})$	1.9576
Logistic	$1 - \frac{(x - 1) e^{x} + 1}{{(e^{x} - 1)}^{2}}$	1.3079
Laplace	$1 - \frac{1}{2} (1 + \frac{x}{2}) e^{- x}$	1.9306
Cauchy	$\frac{1}{2} + \frac{1}{π} arctan (x / 2)$	1.2071

Table 3. Illustrative sample (

n = 20, r = 7

).

Table 3. Illustrative sample (

n = 20, r = 7

).

i	$x_{i}$	$\| {\hat{Y}}_{i} \|$	$(i)$	i	$x_{i}$	$\| {\hat{Y}}_{i} \|$	$(i)$
1	6.10	3.18	16	11	−0.69	0.28	9
2	10	5.17	18	12	−0	0.07	5
3	6.20	3.23	17	13	0.05	0.10	6
4	−0.08	0.03	2	14	−0.20	0.03	1
5	0.63	0.39	11	15	−0.25	0.06	4
6	−0.54	0.21	7	16	−0.64	0.25	8
7	1.37	0.77	13	17	−6.30	3.14	15
8	0.46	0.30	10	18	−5.50	2.73	14
9	−0.22	0.04	3	19	−12.10	6.10	19
10	0.94	0.55	12	20	−20	10.13	20

Table 4. Illustrative example of

B P

test observations classification.

Table 4. Illustrative example of

B P

test observations classification.

$U_{(20)} (20)$	$U_{(19)} (20)$	$U_{(18)} (20)$	$U_{(17)} (20)$	$U_{(16)} (20)$	$U (20, 5)$
1.000000	1.000000	1.000000	0.999998	1.000000	1.000000
$U_{(19)} (19)$	$U_{(18)} (19)$	$U_{(17)} (19)$	$U_{(16)} (19)$	$U_{(15)} (19)$	$U (19, 5)$
0.999685	0.999998	0.999916	0.999998	1.000000	1.000000
$U_{(18)} (18)$	$U_{(17)} (18)$	$U_{(16)} (18)$	$U_{(15)} (18)$	$U_{(14)} (18)$	$U (18, 5)$
0.998046	0.996970	0.999893	0.999997	0.999997	0.999997
$U_{(17)} (17)$	$U_{(16)} (17)$	$U_{(15)} (17)$	$U_{(14)} (17)$	$U_{(13)} (17)$	$U (17, 5)$
0.924219	0.996446	0.999871	0.999940	0.084290	0.999940

Table 5. Values of goodness-of-fit statistics and information criteria (initial sample).

Goodness-of-Fit Statistics	Weibull	Logistic	Log-Normal
Kolmogorov-Smirnov statistic	0.05	0.09	0.07
Cramer-von Mises statistic	0.03	0.23	0.127
Anderson-Darling statistic	0.21	1.36	1.08
Goodness-of-fit criteria
Akaike’s Information Criterion	1056.515	1074.783	1073.13
Bayesian Information Criterion	1061.725	1079.993	1078.34

Table 6. Values of goodness-of-fit statistics and information criteria (sample without removed outliers).

Goodness-of-Fit Statistics	Weibull	Logistic	Log-Normal
Kolmogorov-Smirnov statistic	0.048	0.09	0.07
Cramer-von Mises statistic	0.027	0.21	0.11
Anderson-Darling statistic	0.18	1.25	1.01
Goodness-of-fit criteria
Akaike’s Information Criterion	1037.09	1054.76	1053.49
Bayesian Information Criterion	1042.26	1059.93	1058.66

Table 7. Hawkin’s method: the values of

D_{N O} + D_{O O}

in function of

μ

and r (

n = 100

,

s = 5

).

Table 7. Hawkin’s method: the values of

D_{N O} + D_{O O}

in function of

μ

and r (

n = 100

,

s = 5

).

r \ $µ$	0.1	1	6.3	10
1	0.31 + 0.00	0.66 + 0.00	3.93 + 1.00	3.99 + 1.00
2	0.87 + 0.00	2.15 + 0.06	3.00 + 1.21	3.00 + 2.00
3	1.33 + 0.08	1.99 + 0.84	2.00 + 2.00	2.00 + 2.00
4	0.89 + 0.58	1.00 + 1.42	1.00 + 3.00	1.00 + 3.00
5	0.01 + 1.15	0.00 + 2.03	0.00 + 3.02	0.00 + 3.96

Table 8. The masking values

D_{O N}

(

n = 50

and

n = 100

).

Table 8. The masking values

D_{O N}

(

n = 50

and

n = 100

).

	$n = 50$							$n = 100$
r	Method $\ θ$	0.1	0.4	1	4	10	r	0.1	0.4	1	4	10
2	$R o s n e r_{5}$	1.36	0.95	0.51	0.15	0.06	2	1.19	0.71	0.33	0.09	0.04
	$R o s n e r_{15}$	1.36	0.95	0.51	0.15	0.06		1.19	0.71	0.33	0.09	0.04
	$R o s n e r_{[0.4 n]}$	1.36	0.95	0.51	0.15	0.06		1.19	0.71	0.33	0.09	0.04
	$D G r_{r o b}$	1.56	1.17	0.71	0.24	0.10		1.31	0.84	0.44	0.13	0.06
	$B P$	0.92	0.66	0.37	0.10	0.04		0.50	0.32	0.15	0.04	0.02
5	$R o s n e r_{5}$	3.79	3.31	2.11	0.48	0.16	5	3.52	2.57	1.27	0.27	0.10
	$R o s n e r_{15}$	3.66	3.21	2.04	0.46	0.16		3.43	2.52	1.24	0.26	0.10
	$R o s n e r_{[0.4 n]}$	3.66	3.21	2.04	0.46	0.16		3.43	2.52	1.24	0.26	0.10
	$D G_{r o b}$	4.70	4.10	2.90	1.09	0.48		4.23	3.01	1.81	0.57	0.25
	$B P$	2.00	1.68	1.18	0.40	0.15		0.78	0.60	0.43	0.15	0.07
8	$R o s n e r_{5}$	8.00	7.97	7.54	3.70	3.06	10	10.0	9.90	8.21	5.10	5.00
	$R o s n e r_{15}$	5.70	5.48	4.52	1.00	0.29		6.88	6.54	4.36	0.69	0.22
	$R o s n e r_{[0.4 n]}$	5.70	5.48	4.52	1.00	0.29		6.88	6.54	4.36	0.69	0.22
	$D G_{r o b}$	7.90	7.49	6.10	2.67	1.24		9.74	8.38	5.78	2.12	0.92
	$B P$	4.27	3.84	3.25	1.47	0.57		2.21	1.90	1.73	0.74	0.30

Table 9. The masking values

D_{O N}

(

n = 1000

).

Table 9. The masking values

D_{O N}

(

n = 1000

).

r	Method $\ θ$	0.1	0.4	1	4	1000
5	$R o s n e r_{5}$	2.15	0.69	0.29	0.07	0.00
	$R o s n e r_{15}$	2.12	0.66	0.27	0.07	0.00
	$R o s n e r_{[0.4 n]}$	2.12	0.66	0.27	0.07	0.00
	$D G_{r o b}$	1.99	0.78	0.35	0.09	0.00
	$B P$	0.25	0.23	0.22	0.11	0.00
20	$R o s n e r_{5}$	19.0	15.8	15.0	15.0	15.0
	$R o s n e r_{15}$	19.2	10.9	5.52	5.00	5.00
	$R o s n e r_{[0.4 n]}$	12.7	6.94	1.76	0.30	0.00
	$D G_{r o b}$	14.8	6.97	3.32	1.93	0.00
	$B P$	0.29	0.26	0.23	0.18	0.00
100	$R o s n e r_{5}$	100	99.9	96.7	95.0	95.0
	$R o s n e r_{15}$	100	99.92	96.4	85.0	85.0
	$R o s n e r_{[0.4 n]}$	55.8	56.8	50.4	4.43	0.01
	$D G_{r o b}$	100	89.9	61.6	22.2	0.1
	$B P$	4.72	4.00	3.95	3.58	0.04

Table 10. Masking values for logistic, Laplace, extreme value II and Cauchy distribution, when

n = 100

,

r = 5

.

Table 10. Masking values for logistic, Laplace, extreme value II and Cauchy distribution, when

n = 100

,

r = 5

.

	Logistic				Laplace
Method $\ θ$	0.1	1	6.3	10	0.1	1	6.3	10
$D G_{M L}$	5	4.89	3.64	3.42	5	4.96	3.98	3.78
$D G_{r o b}$	4.21	2.69	0.76	0.51	4.27	2.98	0.87	0.59
$B P$	1.3	1.13	0.78	0.64	1.31	1.21	0.8	0.66
	Extreme Value II				Cauchy
Method $\ θ$	0.1	1	6.3	10	1	100	1000	$10^{5}$
$D G_{M L}$	4.96	4.19	3	2.9	5	5	5	5
$D G_{r o b}$	4.29	2.25	0.59	0.4	3.81	2.89	0.8	0.01
$B P$	1.25	0.56	0.14	0.11	0.38	0.4	0.39	0.13

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bagdonavičius, V.; Petkevičius, L. Multiple Outlier Detection Tests for Parametric Models. Mathematics 2020, 8, 2156. https://doi.org/10.3390/math8122156

AMA Style

Bagdonavičius V, Petkevičius L. Multiple Outlier Detection Tests for Parametric Models. Mathematics. 2020; 8(12):2156. https://doi.org/10.3390/math8122156

Chicago/Turabian Style

Bagdonavičius, Vilijandas, and Linas Petkevičius. 2020. "Multiple Outlier Detection Tests for Parametric Models" Mathematics 8, no. 12: 2156. https://doi.org/10.3390/math8122156

APA Style

Bagdonavičius, V., & Petkevičius, L. (2020). Multiple Outlier Detection Tests for Parametric Models. Mathematics, 8(12), 2156. https://doi.org/10.3390/math8122156

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multiple Outlier Detection Tests for Parametric Models

Abstract

1. Introduction

2. Outliers and Outlier Regions

3. New Method

3.1. Preliminary Results

3.2. Robust Estimators for Location-Shape Distributions

3.3. Right Outliers Identification Method for Location-Scale Families

3.4. Left Outliers Identification Method for Location-Scale Families

3.5. Outlier Detection Tests for Location-Scale Families: Two-Sided Alternative, Symmetric Distributions

3.6. Outlier Detection Tests for Location-Scale Families: Two-Sided Alternative, Non-Symmetric Distributions

3.7. Outlier Identification Method for Shape-Scale Families

3.8. Illustrative Example

3.9. Practical Example

4. Generalization of Davies-Gather Outlier Identification Method

5. Short Survey of Multiple Outlier Identification Methods for Normal Data

5.1. Rosner’s Method

5.2. Bolshev’s Method

5.3. Hawking’s Method

6. Comparative Analysis of Outlier Identification Methods by Simulation

6.1. Investigation of Outlier Identification Methods for Normal Data

6.2. Investigation of Outlier Identification Methods for Other Location-Scale Models

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI