Next Article in Journal
A Numerical Method for the Solution of the Two-Phase Fractional Lamé–Clapeyron–Stefan Problem
Previous Article in Journal
A Novel Comparative Statistical and Experimental Modeling of Pressure Field in Free Jumps along the Apron of USBR Type I and II Dissipation Basins
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multiple Outlier Detection Tests for Parametric Models

by
Vilijandas Bagdonavičius
1 and
Linas Petkevičius
2,*
1
Institute of Applied Mathematics, Vilnius University, Naugarduko 24, LT-03225 Vilnius, Lithuania
2
Institute of Computer Science, Vilnius University, Didlaukio 47, LT-08303 Vilnius, Lithuania
*
Author to whom correspondence should be addressed.
Mathematics 2020, 8(12), 2156; https://doi.org/10.3390/math8122156
Submission received: 8 November 2020 / Revised: 27 November 2020 / Accepted: 30 November 2020 / Published: 3 December 2020
(This article belongs to the Section Probability and Statistics)

Abstract

:
We propose a simple multiple outlier identification method for parametric location-scale and shape-scale models when the number of possible outliers is not specified. The method is based on a result giving asymptotic properties of extreme z-scores. Robust estimators of model parameters are used defining z-scores. An extensive simulation study was done for comparing of the proposed method with existing methods. For the normal family, the method is compared with the well known Davies-Gather, Rosner’s, Hawking’s and Bolshev’s multiple outlier identification methods. The choice of an upper limit for the number of possible outliers in case of Rosner’s test application is discussed. For other families, the proposed method is compared with a method generalizing Gather-Davies method. In most situations, the new method has the highest outlier identification power in terms of masking and swamping values. We also created R package outliersTests for proposed test.

1. Introduction

The problem of multiple outliers identification received attention of many authors. The majority of outlier identification methods define rules for the rejection of the most extreme observations. The bulk of publications are concentrated on the normal distribution (see [1,2,3,4,5,6], see surveys in [7,8]. For non-normal case, the most of the literature pertains to the exponential and gamma distributions, see [9,10,11,12,13,14,15,16,17]. Outliers identification is important analyzing data collected in wide range of areas: pollution [18], IoT [19], medicine [20], fraud [21], smart city applications [22], and many more.
Constructing outlier identification methods, most authors suppose that the number s of observations suspected to be outliers is specified. These methods have a serious drawback: only two possible conclusions are done: exactly s observations are admitted as outliers or it is concluded that outliers are absent. More natural is to consider methods which do not specify the number of suspected observations or at least specify the upper limit s for it. Such methods are not very numerous and they concern normal or exponential samples. These are [1,5,23] methods for normal samples, [15,16,24] methods for exponential samples. The only method which does not specify the upper limit s is the [2] method for normal samples.
We give a competitive and simple method for outlier identification in samples from location-scale and shape-scale families of probability distributions. The upper limit s is not specified, as in the the case of Davies-Gather method. The method is based on a theorem giving asymptotic properties of extreme z-scores. Robust estimators of model parameters are used defining z-scores.
The following investigation showed that the proposed outlier identification method has superior performance as compared to existing methods. The proposed method widens considerably the scope of models applied in statistical analysis of real data. Differently from the normal probability distribution family many two-parameter families such as Weibull, logistic and loglogistic, extreme values, Cauchy, Laplace and other families can be applied for outlier identification. So it may be useful for models which cannot be symmetrized by simple transformations such as log-transform or others.
An advantage of the new method is that complicated computing is not needed because search of test statistic’s critical values by simulation is not needed for each sample size. It allowed to create an R package outliersTests which can be used for outlier search in real datasets. Another advantage is a very good potential for generalizations of the proposed method to regression, time series and other models. To have a competitor, we present not only the new method but also generalize the Davies-Gather method for non-normal data.
In Section 2 we present a short overview of the notion of the outlier region given by [2]. In Section 3 we give asymptotic properties of extreme z-scores based on equivariant estimators of model parameters, and introduce a new outlier identification method for parametric models based on the asymptotic result and robust estimators. In Section 4 we consider rather evident generalizations of Davies-Gather tests for normal data to location-scale families. In Section 5 we give a short overview of known multiple outlier identification methods for normal samples which do not specify an exact number of suspected outliers. In Section 6 we compare performance of the new and existing methods.

2. Outliers and Outlier Regions

Suppose that data are independent random variables X 1 , , X n . Denote by F i ( x ) the c.d.f. of X i .
Let F 0 = { F ( x , θ ) , θ Θ R m } be a parametric family of absolutely continuous cumulative distribution functions with continuous unimodal densities f on the support supp ( F ) of the c.d.f. F.
Suppose that if the data are not contaminated with unusual observations, then the following null hypothesis H 0 is true: there exist θ Θ such that
F 1 ( x ) = = F n ( x ) = F ( x , θ ) .
There are two different definitions of an outlier. In the first case the outlier is an observation which falls into some outlier region o u t ( X ) . The outlier region is a set such that the probability for at least one observation from a sample to fall into it is small if the hypothesis H 0 is true. In such a case the probability that a specified observation X i falls into o u t ( X ) is very small. If an observation X i has distribution different from that under H 0 then this probability may be considerably higher.
In the second case, the value x i of X i is an outlier if the probability distribution of X i is different from that under H 0 , formally F i F ( x , θ ) . In this case, outliers are often called contaminants.
Therefore, in the first case, there exists a very small probability to have an outlier under H 0 . If the hypothesis H 0 holds, then contaminants are absent and with very small probability some outliers (in the first sense) are possible. If contaminants are present, then the hypothesis H 0 is not true. Nevertheless, contaminants are not necessary outliers (in the first sense ) because it is possible that they do not fall into the outlier region. So the two notions are different. Both definitions give approximately the same outliers if the alternative distribution is concentrated in the outlier region. Namely such contaminants can be called outliers in the sense that outliers are anomalous extreme observations. In such a case it is possible to compare outlier and contaminant search methods.
In this paper, we consider location-scale and shape-scale families. Location-scale families have the form F l s = { F 0 ( ( x μ ) / σ ) , μ R , σ > 0 } with the completely specified baseline c.d.f F 0 and p.d.f. f 0 . Shape-scale families have the form F l s = { G 0 ( ( ( x / θ ) ν ) , θ , ν > 0 } with completely specified baseline c.d.f G 0 and p.d.f. g 0 . By logarithmic transformation the shape-scale families are transformed to location-scale family, so we concentrate on location-scale families. Methods for such families are easily modified to methods for shape-scale families.
The right-sided α -outlier region for a location-scale family is
o u t r ( α n , F ) = { x R : x > μ + σ F 0 1 ( 1 α ) }
and the left-sided α -outlier region is
o u t l ( α n , F ) = { x R : x < μ + σ F 0 1 ( α ) } .
The two-sided α -outlier region has the form
o u t ( α , F ) = { x R / [ μ + σ F 0 1 ( α / 2 ) , μ + σ F 0 1 ( 1 α / 2 ) ] } .
If f 0 is symmetric, then the two-sided outlier region is simpler:
o u t ( α , F ) = { x R : | x μ | > σ F 0 1 ( 1 α / 2 ) } .
The value of α is chosen depending on the size n of a sample: α = α n . The choice is based on assumption that under H 0 for some α ¯ close to zero
P { i = 1 n { X i o u t ( α n , F ) } } = ( P { X i o u t ( α n , F ) } ) n = 1 α ¯ .
The equality (3) means that under H 0 the probability that none of X i falls into α n -outlier region is 1 α ¯ . It implies that
α n = 1 ( 1 α ¯ ) 1 / n .
The sequence α n decreases from α ¯ to 0 as n goes from 1 to .
The first definition of an outlier is as follows: for a sample size n a realization x i of X i is called outlier if x i o u t ( α n , F ) ; x i is called right outlier if x i o u t r ( α n , F ) .
The number of outliers D n under H 0 has the binomial distribution B ( n , α n ) and the expected number of outliers in the sample under H 0 is E D n = n α n . Please note that E D n ln ( 1 α ¯ ) α ¯ as n . For example, if α ¯ = 0.05 , then ln ( 1 α ¯ ) 0.05129 and for n 10 the expected number of outliers is approximately 0.051 , i.e., it practically does not depend on n. So under H 0 the expected number of outliers 0.051 is negligible with respect to the sample size n.

3. New Method

3.1. Preliminary Results

Suppose that a c.d.f. F F l s belongs also to the domain of attraction G γ , γ 0 (see [25]).
If F G 0 F l s , then there exist normalizing constants a n > 0 and b n R such that lim n F 0 n ( a n x + b n ) = e e x . Similarly, if F G γ F l s , γ > 0 , then lim n F 0 n ( a n x + b n ) = e ( x ) 1 / γ , x < 0 , lim n F n ( a n x + b n ) = 1 , x 0 .
One of possible choices of the sequences { b n } and { a n } is
b n = F 0 1 ( 1 1 n ) , a n = 1 / ( n f 0 ( b n ) ) .
In the particular case of the normal distribution equivalent form a n = 1 / b n can be used. Expressions of b n and a n for some most used distributions are given in Table 1.
Condition A.
(a) 
μ ^ and σ ^ are consistent estimators of μ and σ;
(b) 
the limit distribution of ( n ( μ ^ μ ) , n ( σ ^ σ ) ) is non-degenerate;
(c) 
lim x x f 0 ( x ) 1 F 0 ( x ) = 0 .
Condition A (c) is satisfied for many location-scale models including the normal, type I extreme value, type II extreme value, logistic, Laplace ( F G 0 ), Cauchy ( F G 1 ).
Set Y i = ( X i μ ) / σ , Y ^ i = ( X i μ ^ ) / σ ^ . The random variables Y ^ i are called z-scores. Denote by Y ( 1 ) Y ( 2 ) Y ( n ) and Y ^ ( 1 ) Y ^ ( n ) the respective order statistics
The following theorem is useful for right outliers detection test construction.
Theorem 1.
If F G 0 F l s and Conditions A hold, then for fixed s
( ( Y ^ ( n ) b n ) / a n , ( Y ^ ( n 1 ) b n ) / a n , , ( Y ^ ( n s + 1 ) b n ) / a n ) d
L 0 = ( ln E 1 , ln ( E 1 + E 2 ) , , ln ( E 1 + + E s ) )
as n , where E 1 , , E s are i.i.d. standard exponential random variables.
If F G γ F l s , γ > 0 and Conditions A hold, then the limit random vector is
L γ = ( E 1 1 1 , ( E 1 + E 2 ) 1 1 , , ( E 1 + + E s ) 1 1 ) .
Proof of Theorem 1.
Please note that
Y ^ ( n i + 1 ) b n a n = Y ( n i + 1 ) b n a n   σ σ ^ + ( μ μ ^ ) σ ^ a n + b n a n σ σ ^ σ ^
The s-dimensional random vector such that its ith component is the first term of the right side converges in distribution to the random vector given in the formulation of the theorem. It follows from Theorem 2.1.1 of [25] and Condition A (a). So it is sufficient to show that the second and the third terms converge to zero in probability. The second term is
n f 0 ( F 0 1 ( 1 1 n ) ) n ( μ ^ μ ) σ ^ ,
the third term is
n F 0 1 ( 1 1 n ) f 0 ( F 0 1 ( 1 1 n ) ) n ( σ ^ σ ) σ ^ .
By Condition A (c)
lim n n F 0 1 ( 1 1 n ) f 0 ( F 0 1 ( 1 1 n ) ) ) = lim x x f 0 ( x ) 1 F 0 ( x ) = 0 .
It also implies lim n n f 0 ( F 0 1 ( 1 1 n ) ) = 0 because lim n F 0 1 ( 1 1 n ) = . Proof completed. □
Remark 1.
Please note that 2 ( E 1 + + E i ) χ 2 ( 2 i ) . It implies that if F G 0 F l s , then for fixed i, i = 1 , , s ,
P { ( Y ^ ( n i + 1 ) b n ) / a n x } 1 F χ 2 i 2 ( 2 e x ) a s n .
Similarly, if F G γ F l s , γ > 0 , then for fixed i, i = 1 , , s ,
P { ( Y ^ ( n i + 1 ) b n ) / a n x } 1 F χ 2 i 2 ( 2 1 + x ) a s n .
The following theorem is useful for construction of outlier detection tests in two-sided case when f 0 is symmetric. For any sequence ζ 1 , , ζ n denote by | ζ | ( 1 ) | ζ | ( n ) the ordered absolute values | ζ 1 | , , | ζ n | .
Theorem 2.
Suppose that the function f 0 is symmetric. If F G γ F l s , γ 0 and Conditions A hold, then for fixed s
( ( | Y ^ | ( n ) b 2 n ) / a 2 n , ( | Y ^ | ( n 1 ) b 2 n ) / a 2 n , , ( | Y ^ | ( n s + 1 ) b 2 n ) / a 2 n ) d L γ
as n .
Proof of Theorem 2.
For any i = 1 , , s the following equality holds:
| Y ^ | ( n i + 1 ) b 2 n a 2 n = | Y ^ | ( n i + 1 ) | Y | ( n i + 1 ) a 2 n + | Y | ( n i + 1 ) b 2 n a 2 n .
The c.d.f. of the random variables | Y i | is 2 F 0 ( x ) 1 , so if F 0 G γ , γ 0 then 2 F 0 1 G γ , and for the sequence | Y n | the normalizing sequences are a 2 n , b 2 n . So the s-dimensional random vector such that its ith component is the second term of the right side converges in distribution to the random vector given in the formulation of the theorem. It follows from Theorem 2.1.1 of [25]. So it is sufficient to show that the first term converges in probability to zero.
Please note that | Y ^ i | | Y i | + | Y ^ i Y i | , and
| Y ^ i Y i | = 1 σ ^ | μ μ ^ + ( σ σ ^ ) Y i | | μ ^ μ | σ ^ + | n ( σ ^ σ ) | σ ^ 1 n | Y | ( n ) .
So | Y ^ | ( n j + 1 ) | Y | ( n j + 1 ) + | μ ^ μ | σ ^ + | n ( σ ^ σ ) | σ ^ 1 n | Y | ( n ) . Analogously, the inequality Y i | | | Y ^ i | + | Y ^ i Y i | implies that | Y | ( n j + 1 ) | Y ^ | ( n j + 1 ) + | μ ^ μ | σ ^ + | n ( σ ^ σ ) | σ ^ 1 n | Y | ( n ) .
Theorem 2.1.1 in [25] applied to the random variables | Y i | implies that there exist a random variable V 1 with the c.d.f. G ( x ) = e e x ( γ = 0 ) or G ( x ) = e ( x ) 1 / γ , x < 0 , G ( x ) = 1 , x 0 ( γ > 0 ), such that
1 n | Y | ( n ) = ( b 2 n + a 2 n ( V 1 + o P ( 1 ) ) ) / n .
| | Y ^ | ( n i + 1 ) | Y | ( n i + 1 ) a 2 n | | n ( μ ^ μ ) | σ ^ n a 2 n + | n ( σ ^ σ ) | σ ^ ( b 2 n n a 2 n + V 1 + o p ( 1 ) n ) .
The convergence b n and Condition A (c) imply:
lim n b 2 n n a 2 n = lim n n F 0 1 ( 1 1 2 n ) f 0 F 0 1 ( 1 1 2 n ) ) =
1 2 lim x x f 0 ( x ) 1 F 0 ( x ) = 0 , lim n 1 n a 2 n = 0 .
These results and Conditions A (a), (b) imply that the first term at the right of (8) converges in probability to zero. Proof completed. □
Remark 2.
Theorem 2 implies that if F G 0 F l s , n , then for fixed i, i = 1 , , s ,
P { ( | Y ^ | ( n i + 1 ) b 2 n ) / a 2 n x } 1 F χ 2 i 2 ( 2 e x ) ,
and if F G γ F l s , γ > 0 , then
P { ( | Y ^ | ( n i + 1 ) b 2 n ) / a 2 n x } 1 F χ 2 i 2 ( 2 / ( 1 + x ) ) .
Suppose now that the function f 0 is not symmetric. Set Y i * = ( X i μ ) / σ . The c.d.f. and p.d.f. of Y i * are 1 F 0 ( x ) and f 0 ( x ) , respectively. Set
b n * = F 0 1 ( 1 / n ) , a n * = 1 / ( n f 0 ( b n * ) ) .
For example, if type I extreme value distribution is considered, then
b n = ln ln n , a n = 1 ln n , b n * = ln ( ln ( 1 1 n ) ) , a n * = 1 ( n 1 ) ln ( 1 1 n ) .
For the type II extreme value distribution a n , b n , a n * , b n * have the same expressions as a n * , b n * , a n , b n for the Type I extreme value distribution, respectively.
Remark 3.
Similarly as in Theorem 1 we have that if s is fixed and F G 0 F l s , then for fixed i, i = 1 , , s ,
P { ( Y ( i ) + b n * ) / ( a n * ) x } = P { ( Y ^ ( n i + 1 ) * b n * ) / a n * x } 1 F χ 2 i 2 ( 2 e x ) ,
and if F G γ F l s , γ > 0 , then for fixed i, i = 1 , , s ,
P { ( Y ( i ) + b n * ) / ( a n * ) x } = P { ( Y ^ ( n i + 1 ) * b n * ) / a n * x } 1 F χ 2 i 2 ( 2 / ( 1 + x ) ) .

3.2. Robust Estimators for Location-Shape Distributions

The choice of the estimators μ ^ and σ ^ is important when outlier detection problem is considered. The ML estimators from the complete sample are not stable when outliers exist.
In the case of location-scale families highly efficient robust estimators of the location and scale parameters μ and σ are (see [26])
μ ^ = M E D σ ^ F 0 1 ( 0.5 ) , σ ^ = Q n = d   W ( [ 0.25 n ( n 1 ) / 2 ] ) ,
where M E D is the empirical median, W i j = | X i X j | , 1 i < j n are C n 2 = n ( n 1 ) / 2 absolute values of the differences X i X j and W ( l ) is the lth order statistic from W i j .
The constant d has the form d = 1 / K 0 1 ( 5 / 8 ) , where K 0 1 ( x ) is the inverse of the c.d.f. of Y 1 Y 2 , Y i = ( X i μ ) / σ F 0 ( x ) .
Expressions of K 0 1 ( x ) and values d for some well-known location-scale families are given in Table 2.
The above considered estimators are equivariant under H 0 , i.e. for any e R , f > 0 , the following equalities hold:
μ ^ ( ( X 1 e ) / f , , ( X n e ) / f ) = ( μ ^ ( X 1 , , X n ) e ) / f ,
σ ^ ( ( X 1 e ) / f , , ( X n e ) / f ) = σ ^ ( X 1 , , X n ) / f .
Equivariant estimators have the following property: the distribution of ( μ ^ μ ) / σ , σ ^ / σ and ( μ ^ μ ) / σ ^ does not depend on the values of the parameters μ and σ .

3.3. Right Outliers Identification Method for Location-Scale Families

Suppose that F G γ F l s , γ 0 . Let a n , b n be defined by (5). Set
U ( n i + 1 ) + ( n ) = 1 F χ 2 i 2 ( 2 e ( Y ^ ( n i + 1 ) b n ) / a n ) , γ = 0 ,
U ( n i + 1 ) + ( n ) = 1 F χ 2 i 2 ( 2 / ( 1 + ( Y ^ ( n i + 1 ) b n ) / a n ) , γ > 0 ,
U + ( n , s ) = max 1 i s U ( n i + 1 ) + ( n ) .
Theorem 3.
The distribution of the statistic U + ( n , s ) is parameter-free for any fixed n.
Proof of Theorem 3.
The result follows from the equality
Y ^ ( n i + 1 ) b n a n = Y ( n i + 1 ) b n a n σ σ ^ + b n a n ( σ σ ^ 1 ) + 1 a n μ μ ^ σ ^ ,
equivariance of the estimators μ ^ , σ ^ and the fact that the distribution of the random vector ( Y 1 , , Y n ) T does not depend on the values of the parameters μ and σ . □
Denote by u α + ( n , s ) the α critical value of the statistic U + ( n , s ) . Please note that it is exact, not asymptotic α critical value: P { U + ( n , s ) u α + ( n , s ) } = α under H 0 .
Theorem 1 implies that the limit distribution (as n ) of the random variable U + ( n , s ) coincides with the distribution of the random variable V + ( s ) = max 1 i s V i + , where V i + = 1 F χ 2 i 2 ( 2 ( E 1 + + E i ) ) , E 1 , , E s are i.i.d. standard exponential random variables. The random variables V 1 + , , V s + are dependent identically distributed and the distribution of each V i + is uniform: V i + U ( 0 , 1 ) .
Denote by v α + ( s ) the α critical values of the random variable V + ( s ) . They are easily found by simulation many times generating s i.i.d. standard exponential random variables and computing values of the random variables V + ( s ) .
Our simulations showed that the below proposed outlier identification methods based on exact and approximate critical values of the statistic U + ( n , s ) give practically the same results, so for samples of size n 20 we recommend to approximate the α -critical level of the statistic U + ( n , s ) by the critical values v α + ( s ) which depend only on s. We shall see that for the purpose of outlier identification only the critical values v α + ( 5 ) are needed. We found that the critical values v α + ( 5 ) are: v 0.1 + ( 5 ) = 0.9677 , v 0.05 + ( 5 ) = 0.9853 , v 0.01 + ( 5 ) = 0.9975 .
Our simulations showed that the performance of the below proposed outlier identification method based on exact and approximate critical values of the statistic U + ( n , 5 ) is similar for samples of size n 20 .
We write shortly B P -method for the below considered method.
B P method for right outliers. Begin outlier search using observations corresponding to the largest values of Y ^ i . We recommend begin with five largest. So take s = 5 and compute the values of the statistics
U + ( n , 5 ) = max 1 i 5 U ( n i + 1 ) + ( n ) .
If U + ( n , 5 ) v α + ( 5 ) , then it is concluded that outliers are absent and no further investigation is done. Under H 0 the probability of such event is approximately 1 α .
If U + ( n , 5 ) > v α + ( 5 ) , then it is concluded that outliers exist.
Please note that (see the classification scheme below) that if U + ( n , 5 ) > v α + ( 5 ) , then minimum one observation is declared as an outlier. So the probability to declare absence of outliers does not depend on the following classification scheme.
If it is concluded that outliers exist then search of outliers is done using the following steps.
Step 1. Set d 1 = max { i { 1 , , 5 } : U ( n i + 1 ) + ( n ) > v α + ( 5 ) } . Please note that the maximum d 1 > 0 exists because U + ( n , 5 ) > v α + ( 5 ) .
If d 1 < 5 , then classification is finished at this step: d 1 observations are declared as right outliers because if the value of X ( n d 1 ) is declared as an outlier, then it is natural to declare values of X ( n ) , , X ( n d 1 + 1 ) as outliers, too.
If d 1 = 5 , then it is possible that the number of outliers is higher than 5. Then the observation corresponding to i = 1 (i.e., corresponding to X ( n ) ) is declared as an outlier and we proceed to the step 2.
Step 2. The above written procedure is repeated taking U + ( n 1 , 5 ) = max 1 i 5 U ( n i ) + ( n 1 ) instead of U + ( n , 5 ) ; here
U ( n i ) + ( n 1 ) = 1 F χ 2 i 2 ( 2 e ( Y ^ ( n i ) b n 1 ) / a n 1 ) , i = 1 , , 5 ,
Set d 2 = max { i { 1 , , 5 } : U ( n i ) + ( n 1 ) > v α + ( 5 ) } . If d 2 < 5 , the classification is finished and d 2 + 1 observations are declared as outliers.
If d 2 = 5 , then it is possible that the number of outliers is higher than 6. Then the observation corresponding to the largest Y ^ ( n 1 ) is declared as an outlier, in total 2 observations (i.e., the observations corresponding to i = 1 , 2 (i.e., corresponding to X ( n ) and X ( n 1 ) ) are declared as outliers and we proceed to the Step 3, and so on. Classification finishes at the lth step when d l < 5 . So we declare ( l 1 ) outliers in the previous steps and d l outliers in the last one. The total number of observations declared as outliers is l 1 + d l . These observations are values of X ( n ) , , X ( n d l l + 2 ) .

3.4. Left Outliers Identification Method for Location-Scale Families

Let a n * , b n * be the normalizing constants defined by (12). If F G 0 F l s , i = 1 , , s , then set
U ( i ) ( n ) = 1 F χ 2 i 2 ( 2 e ( Y ^ ( i ) + b n * ) / a n * ) , U ( n , s ) = max 1 i s U ( i ) ( n ) .
If F G γ F l s , γ > 0 , then replace e ( Y ^ ( i ) + b n * ) / a n * by 1 / ( 1 + ( Y ^ ( i ) + b n * ) / a n * ) . Denote by u α ( n , s ) the α critical value of the statistic U ( n , s ) .
Theorem 1 and Remark 3 imply that the limit distribution (as n ) of the random variable U ( n , s ) coincides with the distribution of the random variable V + ( s ) . So the critical values u α ( n , s ) are approximated by the critical values v α ( s ) = v α + ( s ) .
The left outliers search method coincides with the right outliers search method replacing + to − in all formulas.

3.5. Outlier Detection Tests for Location-Scale Families: Two-Sided Alternative, Symmetric Distributions

Let a n , b n be defined by (5). If F G 0 F l s , i = 1 , , s , then set
U ( n i + 1 ) ( n ) = 1 F χ 2 i 2 ( 2 e ( | Y ^ | ( n i + 1 ) b 2 n ) / a 2 n ) , U ( n , s ) = max 1 i s U ( n i + 1 ) ( n ) .
If F G γ F l s , γ > 0 , then replace e ( Y ^ ( i ) + b n * ) / a n * by 1 / ( 1 + ( Y ^ ( i ) + b n * ) / a n * ) . Denote by u α ( n , s ) the α critical value of the statistic U ( n , s ) .
Theorem 1 and Remark 2 imply that the limit distribution (as n ) of the random variable U ( n , s ) coincides with the distribution of the random variable V + ( s ) . So the critical values u α ( n , s ) are approximated by the critical values v α ( s ) = v α + ( s ) .
The outliers search method coincides with the right outliers search method skipping upper index + in all formulas.

3.6. Outlier Detection Tests for Location-Scale Families: Two-Sided Alternative, Non-Symmetric Distributions

Suppose now that the function f 0 is not symmetric. Let a n , b n , a n * , b n * be defined by (12).
Begin outlier search using observations corresponding to the largest and the smallest values of Y ^ i . We recommend begin with five smallest and five largest. So compute the values of the statistics U ( n , 5 ) and U + ( n , 5 ) . If U ( n , 5 ) v α / 2 ( 5 ) and U + ( n , 5 ) v α / 2 ( 5 ) , then it is concluded that outliers are absent and no further investigation is done.
If U ( n , 5 ) > v α / 2 ( 5 ) or U ( n , 5 ) > v α / 2 ( 5 ) , then it is concluded that outliers exist. If U ( n , 5 ) > v α / 2 ( 5 ) , then left outliers are searched as in Section 3.3. If U + ( n , 5 ) > v α / 2 ( 5 ) , then right outliers are searched as in Section 3.2. The only difference is that α is replaced by α / 2 in all formulas.

3.7. Outlier Identification Method for Shape-Scale Families

If shape-scale families of the form { F ( t ; θ , ν ) = G 0 ( ( t / θ ) ν ) , θ , ν > 0 } with specified G 0 are considered then the above given tests for location-scale families could be used because if X 1 , , X n is a sample from shape scale family then Z 1 , , Z n , Z i = ln X i , is a sample from location-scale family { F 0 ( ( x μ ) / σ , μ R , σ > 0 } ) with μ = ln θ , σ = 1 / ν , F 0 ( x ) = G 0 ( e x ) .

3.8. Illustrative Example

To illustrate simplicity of the B P -method, let us consider an illustrative example of its application (sample size n = 20 , r = 7 outliers). The sample of size n = 20 from standard normal distribution was generated. The 1st-3rd and 17th-20th observations were replaced by outliers. The observations x i , the absolute values | Y ^ i | of the z-scores Y ^ i , and the ranks ( i ) of | Y ^ i | are presented in Table 3.
In Table 4 we present steps of the classification procedure by the B P method. First, we compute (see line 1 of Table 4) value of the statistic U ( 20 , 5 ) = max 1 i 5 U ( 20 i + 1 ) ( 20 ) = 1 . Since U ( 20 , 5 ) = 1 > 0.9853 = v 0.05 ( 5 ) , we reject the null hypothesis, conclude that outliers exist and begin the search of outliers.
Step 1. The inequality U ( 16 ) ( 20 ) = 1.0000 > 0.9853 = v 0.05 ( 5 ) (note that U ( 16 ) ( 20 ) corresponds to the fifth largest observation in absolute value) implies that d 1 = 5 . So it is possible that the number of outliers might be greater than 5. We reject the largest in absolute value 20th observation as an outlier and continue the search of outliers.
Step 2. The inequality U ( 15 ) ( 19 ) = 1.0000 > 0.9853 = v 0.05 ( 5 ) (note that U ( 15 ) ( 19 ) corresponds to the fifth largest observation in absolute value from the remaining 19 observations) implies that d 2 = 5 . So it is possible that the number of outliers might be greater than 6. We declare the second largest in absolute value observation as an outlier. So two observations (19th and 20th) are declared as outliers. We continue the search of outliers.
Step 3. The inequality U ( 14 ) ( 18 ) = 0.999997 > 0.9853 = v 0.05 ( 5 ) implies that d 3 = 5 . We declare the third largest in absolute value observation as an outlier. So three observations (2nd, 19th and 20th) are declared as outliers. We continue the search of outliers.
Step 4. The inequalities U ( 13 ) ( 17 ) = 0.084290 < 0.9853 = v 0.05 ( 5 ) and U ( 14 ) ( 17 ) = 0.999940 > 0.9853 = v 0.05 ( 5 ) imply that d 4 = 4 . So four additional observations (the fourth, fifth, sixth and seventh largest in absolute value observations), namely the 3d, 1st, 17th, and 7th are declared as outliers, The outlier search is finished. In all, 7 observations were declared as outliers: 1–3,17–20, as was expected. Please note that since the outlier search procedure was done after rejection of the null hypothesis, the significance level did not change.

3.9. Practical Example

Let’s consider the stent fatigue testing dataset from reliability control [27]. The dataset contains 100 observations. Let us consider the Weibull, lollogistic and lognormal models. These are the most applied models for analysis of reliability data. For preliminary choice of suitable model we compare the values of various goodness-of-fit statistics and information criteria (see Table 5). The Weibull model is obviously the most suited because values of all five statistics are smallest for this model.
Using the function WEDF.test from the R package EWGoF we applied the following goodness-of-fit tests for Weibull distribution: Anderson-Darling (p-value = 0.86), Kolmogorov-Smirnov (p-value = 0.82), Cramer-von-Mises (p-value = 0.795), Watson (p-value = 0.795). So all tests do not contradict to the Weibull model.
The logarithms X 1 , , X 100 of observations have type I extreme value distribution. Minimal and maximal values are X ( 1 ) = 1.609 and X ( 100 = 5.670 . Let us consider the situation, where fatigue data contain two outliers X 3 = 6.5 and X 5 = 6.5 . All goodness-of-fit tests applied to the data with outliers reject the Weibull model: Anderson-Darling (p-value < 10 15 ), Kolmogorov-Smirnov (p-value 0.005 ), Cramer-von-Mises (p-value < 10 15 ), Watson (p-value < 10 15 ).
Let us apply the B P method for outlier identification. Values of the statistics U i are: U ( 100 ) ( 100 )   =   0.92 , U ( 99 ) ( 100 ) = 0.997 , U ( 98 ) ( 100 ) = 0.96 , U ( 97 ) ( 100 )   =   0.92 , U ( 96 ) ( 100 ) = 0.96 . Since U ( 100 , 5 ) = 0.997 > 0.9853 , we reject the null hypothesis.
Step 1. Since d 1 = max { i { 1 , , 5 } } : U ( 101 i ) > 0.9853 } = 2 < 5 . the search procedure is finished and the observations X ( 99 ) and X ( 100 ) , namely X 3 and X 5 , are declared as outliers. We see that our method did not allow masking other equal observations X 3 = X 5 = 6.5 . It is a very important advantage of the B P method.
After outliers removal, we repeated goodness-of-fit procedure. All tests did not reject the Weibull model: Anderson-Darling (p-value = 0.88), Kolmogorov-Smirnov (p-value = 0.8), Cramer-von-Mises (p-value = 0.93), Watson (p-value = 0.895). Once more, we compared values of goodness-of-fit statistics and information criteria for above considered models using data without removed outliers, see Table 6.
The Weibull distribution gives clearly the best fit.
Values of ML estimators from the initial non-contaminated data and from the final cleared from outliers data are similar: shape practically did not change: 1.83 1.83 , scale changed slightly: 100.8 101.4 .
We created R package outliersTests (https://github.com/linas-p/outliersTests) to be able to use the proposed B P test in practice within R package.

4. Generalization of Davies-Gather Outlier Identification Method

Let us consider location-scale families. Following the idea of Davies-Gather [2] define an empirical analogue of the right outlier region as a random region
O R r ( α n ) = { x :   x > μ ^ + σ ^ g n . α } ,
where g n . α is found using the condition
P { X i ¯ O R r ( α n ) , i = 1 , , n | H 0 } = 1 α ,
and μ ^ , σ ^ are robust equivariant estimators of the parameters μ , σ .
Set
Y ^ ( n ) = ( X ( n ) μ ^ ) / σ ^ .
The distribution of Y ^ ( n ) is parameter-free under H 0 .
The Equation (18) is equivalent to the equation equation
P { Y ^ ( n ) g n , α } | H 0 } = 1 α .
So g n , α is the upper α critical value of the random variable Y ^ ( n ) . It is easily computed by simulation.
Generalized Davies-Gather method for right outliers identification: if Y ^ ( n ) g n , α , then it is concluded that right outliers are absent. The probability of such event is α . If Y ^ ( n ) > g n , α , then it is concluded that right outliers exist. The value x i of the random variable X i is admitted as an outlier if x i O R r ( α n ) , i.e., if x i > μ ^ + σ ^ g n , α . Otherwise it is admitted as a non-outlier.
An empirical analogue of the left outlier region as a random region
O R l ( α n ) = { x :   x < μ ^ + σ ^ h n . 1 α } ,
where h n . 1 α is found using the condition
P { X i ¯ O R l ( α n ) , i = 1 , , n | H 0 } = 1 α ,
Set
Y ^ ( 1 ) = ( X ( 1 ) μ ^ ) / σ ^ .
The distribution of Y ^ ( 1 ) is parameter-free under H 0 .
The Equation (20) is equivalent to the equation equation
P { Y ^ ( 1 ) h n , 1 α | H 0 } = 1 α .
So h n , α is the upper 1 α critical value of the random variable Y ^ ( 1 ) . It is easily computed by simulation.
Generalized Davies-Gather method for left outliers identification: if Y ^ ( 1 ) h n , 1 α , then it is concluded that left outliers are absent. The probability of such event is α . If Y ^ ( 1 ) < h n , α , then it is concluded that left outliers exist. The value x i of the random variable X i is admitted as an outlier if x i O R l ( α n ) , i.e., if x i < μ ^ + σ ^ h n , 1 α . Otherwise it is admitted as a non-outlier.
Let us consider two-sided case.
If the distribution of X i is symmetric, then the empirical analogue of the outlier region is the random region
O R ( α n ) = { x : | x μ ^ | > σ ^ g n . α / 2 } .
In this case
1 α = P { X i O R ( α n ) , i = 1 , , n | H 0 } = P { | Y ^ | ( n ) g n . α / 2 } .
Generalized Davies-Gather method for left and right outliers identification (symmetric distributions): if | Y ^ | ( n ) g n . α / 2 , then it is concluded that outliers are absent. The probability of such event is α . If | Y ^ | ( n ) > g n . α / 2 , then it is concluded that outliers exist. The value x i of the random variable X i is admitted as a left outlier if x i < μ ^ σ ^ g n , α / 2 , it is admitted as a right outlier if x i > μ ^ + σ ^ g n , α / 2 . Otherwise it is admitted as a non-outlier.
If distribution of X i is non-symmetric, then the empirical analogue of the outlier region is defined as follows:
O R ( α n ) = { x R / [ μ ^ + σ ^ g n , 1 α / 2 , μ ^ + σ ^ g n , α / 2 ) ] } ,
In this case
1 α = P { X i [ μ ^ + σ ^ h n , 1 α / 2 ,   μ ^ + σ ^ g n , α / 2 ] , i = 1 , , n | H 0 } =
P { h n , 1 α / 2 Y ^ ( 1 ) Y ^ ( n ) g n , α / 2 ) | H 0 } .
Generalized Davies-Gather method for left and right outliers identification (non-symmetric distributions): if Y ^ ( 1 ) h n , 1 α / 2 and Y ^ ( n ) g n , α / 2 , then it is concluded that outliers are absent. The probability of such event is α . If Y ^ ( 1 ) < h n , 1 α / 2 or Y ^ ( n ) > g n , α / 2 , then it is concluded that outliers exist. The value x i of the random variable X i is admitted as a left outlier if x i < μ ^ + σ ^ h n , 1 α / 2 , it is admitted as a right outlier if x i > μ ^ + σ ^ g n , α / 2 . Otherwise it is admitted as a non-outlier.

5. Short Survey of Multiple Outlier Identification Methods for Normal Data

5.1. Rosner’s Method

Let us formulate Rosner’s method in the form mostly used in practice. Suppose that the number of outliers does not exceed s and the two-sided alternative is considered. Set (see [5,28])
R 1 = max 1 j n | Y ˜ j | = max 1 j n | X j X ¯ | / S X , S X 2 = j = 1 n ( X ( j ) X ¯ ) 2 / ( n 1 ) .
| Y ˜ j | = | ( X j X ¯ ) / S X | may be interpreted as a distance between X j and X ¯ . Remove the observation X j 1 which is most distant from X ¯ . This maximal distance is R 1 . The value of X j 1 is a possible candidate for contaminant.
Recompute the statistic using n 1 remaining observations and denote by R 2 the obtained statistic. Remove the observation X j 2 which is most distant from the new empirical mean. The value of X j 2 is also possible candidate for contaminant. Repeat the procedure until the statistics R 1 , , R s are computed. So we obtain all possible candidates for contaminants. They are values of X j 1 , , X j s
Fix α and find λ i n such that
P { R 1 > λ i n | H 0 } = = P { R s > λ i n | H 0 } , P { i = 1 s { R i > λ i n } | H 0 } = α .
If n > 25 , then the approximations
λ i n t α 2 ( n i 1 ) ( n i + 1 ) n i n i 1 + t α 2 ( n i 1 ) 2 ( n i + 1 ) 1 1 n i + 1 ,
are recommended (see [5]); here t p ( ν ) is the p critical value of the Student distribution with ν degrees of freedom.
Rosner’s method for left and right outliers identification: if R i λ i n for all i = 1 , , s , then it is concluded that outliers are absent. If there exists i 0 { 1 , , s } such that R i 0 > λ i 0 n , i.e., the event i = 1 s { R i > λ i n } occurs, then it is concluded that outliers exist. In this case, classification of observations to outliers and non-outliers is done in the following way: if R s > λ s n , then it is concluded that there are s outliers and they are values of X j 1 , , X j s . If R j λ j n for j = s , s 1 , , i + 1 , and R i > λ i n , then it is concluded that there are i outliers and they are values of X j 1 , , X j i .
If right outliers are searched, then define R 1 + = max 1 i n Y ˜ i , and repeat the above procedure taking approximations
λ i n + t α n i 1 ( n i + 1 ) n i n i 1 + t α n i 1 2 ( n i + 1 ) 1 1 n i + 1 .
Denote by R s the Rosner’s test with a fixed upper limit s. Our simulation results confirm that the true significance level is different from the level α suggested by the approximation when n is not large. Nevertheless, it is approaching α as n increases, see Figure 1. The true significance value of the B P test, which uses asymptotic values of the test statistic are also presented in Figure 1.

5.2. Bolshev’s Method

Suppose that the number of contaminants does not exceed s. For i = 1 , , n set
Y ^ i = ( X i X ¯ ) / s , τ i + = n · ( 1 T n 2 ( Y ^ i ) ) , τ i = n · ( 1 T n 2 ( | Y ^ i | ) ) ,
where X ¯ and s are the empirical mean and standard deviation, T n 2 ( x ) is the c.d.f. of Thompson’s distribution with n 2 degrees of freedom.
Let us consider search for right outliers. Please note that the largest s observations X ( n s + 1 ) , , X ( n ) define the smallest s order statistics τ ( 1 ) + τ ( n ) + . Possible candidates for outliers are namely the values of X ( n s + 1 ) , , X ( n ) .
Set τ + = min 1 i s τ ( i ) + / i .
Bolshev’s method for right outliers search. If τ + τ 1 α + ( n , s ) , then it is concluded that outliers are absent; here τ 1 α + ( n , s ) is the 1 α critical value of the test statistic under H 0 . If τ + < τ 1 α + ( n , s ) , then it is concluded that outliers exist. In such a case outliers are selected in the following way: if τ i + / i < τ 1 α + ( n , s ) then the value of the order statistic X ( n i + 1 ) is admitted as an outlier, i = 1 , , s .
In the case of left and right outliers search Bolshev’s method uses τ ( i ) instead of τ ( i ) + , defining the statistic τ = min 1 i s τ ( i ) / i .
Bolshev’s method for left and right outliers search. If τ τ 1 α ( n , s ) , then it is concluded that outliers are absent; here τ 1 α ( n , s ) is the 1 α critical value of the statistic τ under H 0 . If τ < τ 1 α ( n , s ) , then it is concluded that outliers exist. In such a case they are selected in the following way: if τ i / i < τ 1 α ( n , s ) then the observation corresponding to τ i is admitted as an outlier, i = 1 , , s .

5.3. Hawking’s Method

Suppose that the number of contaminants does not exceed s. Let us consider the search for right outliers. For k = 1 , , s set
b k + = 1 k ( n k ) i = 1 k Y ˜ ( n i + 1 ) = 1 k ( n k ) i = 1 k ( X ( n i + 1 ) X ¯ ) / S X .
b k + proportional to the sum of k largest Y ˜ ( n i + 1 ) . Set B + = max 1 k s b k + .
Hawking’s method. If B + B α + ( n , s ) then it is concluded that outliers are absent; here B α + ( n , s ) is the α critical value of the statistic under H 0 . If B + > B α + ( n , s ) , then it is concluded that outliers exist. In such a case outliers are selected in the following way: if b i + > B α + ( n , s ) , then the value of the order statistic X ( n i + 1 ) is admitted as an outlier, i = 1 , , s .

6. Comparative Analysis of Outlier Identification Methods by Simulation

In the case of location-scale classes probability distribution of all considered test statistics does not depend on μ and σ , so we generated samples of various sizes n with n r observations with the c.d.f. F 0 and r observations with various alternative distributions concentrated in the outlier region. We shall call such observations “contaminant outliers”, shortly c-outliers. As was mentioned, outliers which are not c-outliers, i.e., outliers from regular observations with the c.d.f. F 0 , are very rare.
We repeated simulations M = 100,000 times and using various methods we classified observations to outliers and non-outliers and computed the mean number D O c O of correctly identified c-outliers, the mean number D O N of c-outliers which were not identified, the mean number D N O of non c-outliers admitted as outliers, and the mean number D N N of non c-outliers admitted as non-outliers.
An outlier identification method is ideal if each outlier is detected and each non-outlier is declared as a non-outlier. In practice it is impossible to do with the probability one. Two errors are possible: (a) an outlier is not declared as such (masking effect); (b) a non-outlier is declared as an outlier (swamping effect). We shall write shortly “masking value” for the mean number of non-detected c-outliers and “swamping value” for the mean number of “normal” observations declared as outliers in the simulated samples.
If swamping is small for two tests then a test with smaller masking effect should be preferred because in this case the distribution of the data remaining after excluding of suspected outliers should be closer to the distribution of non-outlier data.
From the other side, if swamping for Method 1 is considerably bigger than swamping of Method 2 and masking is smaller for Method 1, then it does not mean that Method 1 is better because this method rejects many extreme non-outliers from the tails of the regular distribution F 0 and the sample remaining after classification may be not treated as a sample from this regular distribution even if all c-outliers are eliminated.
For various families of distributions, sample sizes n, and alternatives we compared Davies-Gather ( D G ) and new ( B P ) methods performance. In the case of normal distribution we also compared them with Rosner’s, Bolshev’s and Hawking’s methods.
We used two different classes of alternatives: in the first case c-outliers are spread widely in the outlier region around the mean, in the second case c-outliers are concentrated in a very short interval laying in the outlier region. More precisely, if right outliers were searched, then we simulated r observations concentrated in in the right outlier region o u t r ( α n , F 0 ) = { x : x > x α n } using the following alternative families of distribution:
(1)
Two parameter exponential distribution E ( θ , x α n ) with the scale parameter θ . If θ is small, then outliers are concentrated near the border of the outlier region. If θ is large then outliers are widely spread in the outlier region. If θ increases, then the mean of outlier distribution increases. Please note that even if θ is very near 0 and the true number of outliers r is large, these outliers may corrupt strongly the data making tails of histogram two heavy.
(2)
Truncated normal distribution T N ( x α n , μ , ρ ) with the location and scale parameters μ , ρ ( μ > x α n ) . If ρ is small then this distribution is concentrated in a small interval around μ . If μ increases, then the mean of outlier distribution increases.
For lack of place we present a small part of our investigations. Please note that the results are very similar for all sample sizes n 20 . Multiple outlier problem is not very relevant for smaller sample sizes.

6.1. Investigation of Outlier Identification Methods for Normal Data

We use notation B , H , R , D G , and B P for the Bolshev’s, Hawking’s, Rosner’s, Davies-Gather’s, and the new methods, respectively. If D G method is based on maximum likelihood estimators, then we write D G m l method, if it is based on robust estimators, we write D H r o b method.
For comparison of above considered methods we fixed the significance level α = 0.05 . We remind that the significance level α is the probability to reject minimum one observation as an outlier under the hypothesis H 0 which means that all observations are realizations of i.i.d. with the same normal distribution. The only test, namely R method uses approximate critical values of the test statistic, so the significance values for this test is only approximately 0.05 and depends on s and n. In Figure 1 the true significance level value for s = 5 , 15 and [ 0.4 n ] in function of n are given.
The B , H , and R tests methods have a drawback that the upper bound for the possible number of outliers s must be fixed. The B P and D G tests have an advantage that they do not require it.
Our investigations showed that H,B and D G m l methods have other serious drawbacks. So firstly let us look closer at these methods.
If the true number of c-outliers r exceeds s, then the B and H methods cannot find them even if they are very far from the limits of the outlier region. Nevertheless, suppose that r does not exceed s and look at the performance of the H method. Set n = 100 , s = 5 , and suppose that c-outliers are generated by right-truncated normal distribution T N ( x α n , μ , ρ ) with fixed ρ and increasing μ . Note that the true number of c-outliers is supposed to be unknown but do not exceed s = 5 . In Figure 2 the mean numbers of rejected non-c-outliers D N O are given in function of the parameter μ (the value of the parameter ρ = 0 . 1 2 is fixed) for fixed values of r see Figure 2. In Table 7 the values of D N O plus the values of the mean numbers of truly rejected c-outliers are given. Table 7 shows that if r = 1 , then if μ is sufficiently large, the c-outlier is found but the number of rejected non-c-outliers D N O increases to 4, so swamping is very large. Similarly, if r = 2 , then D N O increases to 3, so swamping is large. Beginning from r = 3 not all c-outliers are found even for large μ . Swamping is smallest if the true value r coincides with s but even in this case one c-outlier is not found even for large μ . Taking into account that the true number r of c-outliers is not known in real data, the performance of the H methos is very poor. Results are similar for other values of n, s, and distributions of c-outliers. As a rule, H mehod finds rather well the c-outliers but swamping is very large because this method has a tendency to reject a number near s of observations for remote alternatives. which is good if r = s but is bad if r is different from s.
The B and D G m l tests have a drawback that they use maximum likelihood estimators which are not robust and estimate parameters badly in presence of outliers. Once more, set n = 100 , s = 5 , and suppose that c-outliers are generated by two-parameters exponential distribution T E ( x α n , θ ) with increasing θ . Swamping values are negligible in, so only masking values( mean numbers of non-rejected c-outliers D O N ) are important. In Figure 3 the masking values in function of the parameter θ are given for fixed values of r.
Both methods perform very similarly. The masking values are large for every value of r > 1 . If r increases, then masking values increase, too. For example, if r = 5 , then almost 3 c-outliers from 5 are not rejected on average even for large values of θ .
Similar results hold taking other values of n, s and various distributions of c-outliers.
The above analysis shows that the B, H, D G m l methods have serious drawbacks, so we exclude these methods from further consideration.
Let us consider the remaining three methods: R, D G , and B P . For small n the true significance level of Rosner’s test differ considerably from the suggested, so we present comparisons of tests performance for n = 50 , 100 , 1000 (see Table 8 and Table 9). Truncated exponential distribution was used for outliers simulation. Remoteness of the mean of outliers from the border of the outlier region is characterized by the parameter θ .
Swamping values D N O (the mean numbers of non-c-outliers declared as outliers) are very small for all tests. For example, even if n = 1000 , the R and D G methods reject on average as outliers only 0.05 from n r = 995 , 980 , 900 non-c-outliers. For the B P method this number is 0.25 , 0.19 , 0.05 from 995 , 980 , and 900 non-c-outliers, respectively. So only masking values D O N (the mean numbers of c-outliers declared as non-outliers) are important for outlier identification methods comparison.
Necessity to guess the upper limit s for a possible number of outliers is considered as a drawback of the Rosner’s method. Indeed, if the true number of outliers r is greater than the chosen upper limit s, then r s outliers are not identified with the probability one. In addition, even if r s , it is not clear how important is closeness of r to s. So first we investigated the problem of the upper limit choice.
Here we present masking values D O N of the Rosner’s tests for s = 5 , 15 and [ 0.4 n ] . Similar results are obtained for other values of s.
Our investigations show that it is sufficient to fix s = [ 0.4 n ] , which is clearly larger than it can be expected in real data. Indeed, Table 8 and Table 9 show that for r > s R o s n e r 5 and R o s n e r 15 do not find r s outliers even if they are very remote, as it should be. Nevertheless, we see that even if the true number of outliers r is much smaller than [ 0.4 n ] , for any considered n, r s = 5 , 15 the masking values of the R o s n e r [ 0.4 n ] test are approximately the same (even a little smaller) as the masking values of the tests R o s n e r 5 and R o s n e r 15 , for r > s they are clearly smaller.
Hence, s = [ 0.4 n ] should be recommended for Rosner’s test application, and performance of R o s n e r [ 0.4 ] , Davies-Gather robust ( D G r o b ) and the proposed B P methods should be compared.
All three methods find all c-outliers if they are sufficiently remote. For n = 50 the B P method gives uniformly smallest masking values and the D G method gives uniformly largest masking values for any considered r in all diapason of alternatives. For n = 100 and r = 2 , 5 the result is the same. For n = 100 and r = 10 (it means that even for very small θ the data is seriously corrupted) the BP method is also the best except that for the most remote alternatives the R o s n e r [ 0.4 n ] method slightly outperforms the BP method. For n = 1000 and the most of alternatives the BP method strongly outperforms other methods, except the most remote alternatives.
The D G and Rosner’s methods have very large masking if many outliers are concentrated near the outlier region border. In this case data is seriously corrupted; however, these methods do not see outliers.
Conclusion: in most considered situations the B P method is the best outlier identification method. The second is Rosner’s method with s = [ 0.4 ] , and the third is the Davies-Gather method based on robust estimation. Other methods have poor performance.

6.2. Investigation of Outlier Identification Methods for Other Location-Scale Models

We investigated performance of the new method for location-scale families different from normal. We compare the B P method with the generalized Davies-Gather method for logistic, Laplace (symmetric, F G 0 F l s ), extreme values (non-symmetric F G 0 F l s ), and Cauchy (symmetric, F G 1 F l s ) families. C-outliers were generating using truncated exponential distribution concentrated in two-sided outlier region. Swamping values being small, masking value, see Table 10 and differences between the true number of c-outliers and the number of rejected observations, see Figure 4 and Figure 5, were compared. The B P and D G r o b methods find very well the most remote outliers; meanwhile, the B P method identifies much better closer outliers. The D G r o b method identifies badly multiple outliers concentrated near the border of the outlier region, whereas the B P method does well. The D G M L is not appropriate for multiple outlier search.

7. Conclusions

We compared by simulation outlier identification results of the new method and methods given in previous studies. Even in the case of the normal model, which is investigated by many authors, the new method shows excellent identification power. In many situations, it has superior performance as compared to existing methods.
The obtained results widened considerably the spectre of most used non-regression models needing outlier identification methods. Many two-parameter models such as Weibull, logistic and loglogistic, extreme values, Cauchy, Laplace, and others can be investigated applying the new method.
The advantage of the proposed outlier identification method is that it has very good potential for generalizations. The authors are at the completion stage of research on outlier identification methods for accelerated failure time regression models and generalized linear models, gamma regression model in particular. Outlier identification methods for time series is another direction of the future work. Possible direction is investigation of Gaussian mixture regression models (see [29]).
Limitation of the new method is that it cannot be applied for analysis of discreet models. Taking into consideration that the method is based on asymptotic results, we recommend not applying it to samples of very small size n 15 .
The R package outliersTests was created for the practical usage of proposed test.

Author Contributions

Investigation, V.B. and L.P.; Methodology, V.B. and L.P.; Supervision, V.B.; Writing—original draft, V.B. and L.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bol’shev, L.; Ubaidullaeva, M. Chauvenet’s Test in the Classical Theory of Errors. Theory Probab. Appl. 1975, 19, 683–692. [Google Scholar] [CrossRef]
  2. Davies, L.; Gather, U. The Identification of Multiple Outliers. J. Am. Stat. Assoc. 1993, 88, 782–792. [Google Scholar] [CrossRef]
  3. Dixon, W.J. Analysis of Extreme Values. Ann. Math. Stat. 1950, 21, 488–506. [Google Scholar] [CrossRef]
  4. Grubbs, F.E. Sample Criteria for Testing Outlying Observations. Ann. Math. Stat. 1950, 21, 27–58. [Google Scholar] [CrossRef]
  5. Rosner, B. On the Detection of Many Outliers. Technometrics 1975, 17, 221–227. [Google Scholar] [CrossRef]
  6. Tietjen, G.L.; Moore, R.H. Some Grubbs-Type Statistics for the Detection of Several Outliers. Technometrics 1972, 14, 583–597. [Google Scholar] [CrossRef]
  7. Barnett, V.; Lewis, T. Outliers in Statistical Data; John Wiley & Sons: Hoboken, NJ, USA, 1974. [Google Scholar]
  8. Zerbet, A. Statistical Tests for Normal Family in Presence of Outlying Observations. In Goodness-of-Fit Tests and Model Validity; Huber-Carol, C., Balakrishnan, N., Nikulin, M.S., Mesbah, M., Eds.; Birkhäuser Boston: Basel, Switzerland, 2002; pp. 57–64. [Google Scholar]
  9. Chikkagoudar, M.; Kunchur, S.H. Distributions of test statistics for multiple outliers in exponential samples. Commun. Stat. Theory Methods 1983, 12, 2127–2142. [Google Scholar] [CrossRef]
  10. Kabe, D.G. Testing outliers from an exponential population. Metrika 1970, 15, 15–18. [Google Scholar] [CrossRef]
  11. Kimber, A. Testing upper and lower outlier paris in gamma samples. Commun. Stat. Simul. Comput. 1988, 17, 1055–1072. [Google Scholar] [CrossRef]
  12. Lalitha, S.; Kumar, N. Multiple outlier test for upper outliers in an exponential sample. J. Appl. Stat. 2012, 39, 1323–1330. [Google Scholar] [CrossRef]
  13. Lewis, T.; Fieller, N.R.J. A Recursive Algorithm for Null Distributions for Outliers: I. Gamma Samples. Technometrics 1979, 21, 371–376. [Google Scholar] [CrossRef]
  14. Likeš, I.J. Distribution of Dixon’s statistics in the case of an exponential population. Metrika 1967, 11, 46–54. [Google Scholar] [CrossRef]
  15. Lin, C.T.; Balakrishnan, N. Exact computation of the null distribution of a test for multiple outliers in an exponential sample. Comput. Stat. Data Anal. 2009, 53, 3281–3290. [Google Scholar] [CrossRef]
  16. Lin, C.T.; Balakrishnan, N. Tests for Multiple Outliers in an Exponential Sample. Commun. Stat. Simul. Comput. 2014, 43, 706–722. [Google Scholar] [CrossRef]
  17. Zerbet, A.; Nikulin, M. A new statistic for detecting outliers in exponential case. Commun. Stat. Theory Methods 2003, 32, 573–583. [Google Scholar] [CrossRef]
  18. Torres, J.M.; Pastor Pérez, J.; Sancho Val, J.; McNabola, A.; Martínez Comesaña, M.; Gallagher, J. A functional data analysis approach for the detection of air pollution episodes and outliers: A case study in Dublin, Ireland. Mathematics 2020, 8, 225. [Google Scholar] [CrossRef] [Green Version]
  19. Gaddam, A.; Wilkin, T.; Angelova, M.; Gaddam, J. Detecting Sensor Faults, Anomalies and Outliers in the Internet of Things: A Survey on the Challenges and Solutions. Electronics 2020, 9, 511. [Google Scholar] [CrossRef] [Green Version]
  20. Ferrari, E.; Bosco, P.; Calderoni, S.; Oliva, P.; Palumbo, L.; Spera, G.; Fantacci, M.E.; Retico, A. Dealing with confounders and outliers in classification medical studies: The Autism Spectrum Disorders case study. Artif. Intell. Med. 2020, 108, 101926. [Google Scholar] [CrossRef]
  21. Zhang, C.; Xiao, X.; Wu, C. Medical Fraud and Abuse Detection System Based on Machine Learning. Int. J. Environ. Res. Public Health 2020, 17, 7265. [Google Scholar] [CrossRef] [PubMed]
  22. Souza, T.I.; Aquino, A.L.; Gomes, D.G. A method to detect data outliers from smart urban spaces via tensor analysis. Future Gener. Comput. Syst. 2019, 92, 290–301. [Google Scholar] [CrossRef]
  23. Hawkins, D.M. Identification of Outliers; Springer: Dordrecht, The Netherlands, 1980; Volume 11. [Google Scholar]
  24. Kimber, A.C. Tests for Many Outliers in an Exponential Sample. J. R. Stat. Soc. 1982, 31, 263–271. [Google Scholar] [CrossRef]
  25. De Haan, L.; Ferreira, A. Extreme Value Theory: An Introduction; Springer: New York, NY, USA, 2007. [Google Scholar]
  26. Rousseeuw, P.J.; Croux, C. Alternatives to the median absolute deviation. J. Am. Stat. Assoc. 1993, 88, 1273–1283. [Google Scholar] [CrossRef]
  27. Liu, Y.; Abeyratne, A.I. Practical Applications of Bayesian Reliability; John Wiley & Sons: Hoboken, NJ, USA, 2019. [Google Scholar]
  28. Rosner, B. Percentage points for the RST many outlier procedure. Technometrics 1977, 19, 307–312. [Google Scholar] [CrossRef]
  29. Su, H.; Hu, Y.; Karimi, H.R.; Knoll, A.; Ferrigno, G.; De Momi, E. Improved recurrent neural network-based manipulator control with remote center of motion constraints: Experimental results. Neural Netw. 2020, 131, 291–299. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The true values of the significance level of Rosner’s and B P tests in function of n for different values of s ( α = 0.05 is used in approximations).
Figure 1. The true values of the significance level of Rosner’s and B P tests in function of n for different values of s ( α = 0.05 is used in approximations).
Mathematics 08 02156 g001
Figure 2. Hawkin’s method: the values of D N O + D O O in function of μ and r ( n = 100 , s = 5 ).
Figure 2. Hawkin’s method: the values of D N O + D O O in function of μ and r ( n = 100 , s = 5 ).
Mathematics 08 02156 g002
Figure 3. The number of outliers rejected as non-outliers ( D O N ). The alternative: two-sided, the outliers generated by two-parameters exponential distribution on both sides.
Figure 3. The number of outliers rejected as non-outliers ( D O N ). The alternative: two-sided, the outliers generated by two-parameters exponential distribution on both sides.
Mathematics 08 02156 g003
Figure 4. The difference between number outliers and rejected observations given that sample size n = 100 and r = 10 outliers.
Figure 4. The difference between number outliers and rejected observations given that sample size n = 100 and r = 10 outliers.
Mathematics 08 02156 g004
Figure 5. The difference between number outliers and rejected observations given that sample size n = 100 and r = 10 outliers.
Figure 5. The difference between number outliers and rejected observations given that sample size n = 100 and r = 10 outliers.
Mathematics 08 02156 g005
Table 1. Expressions of b n and a n .
Table 1. Expressions of b n and a n .
Distribution F 0 ( x ) b n a n
Normal Φ ( x ) Φ 1 ( 1 1 / n ) 1 / b n
Type I extreme value 1 e e x ln ln n e b n
Type II extreme value e e x ln ( ln ( 1 1 / n ) ) e b n / ( n 1 )
Logistic 1 1 + e x ln ( n 1 ) n / ( n 1 )
Laplace 1 2 + 1 2 sign ( x ) ( 1 e | x | ) ln ( n / 2 ) 1
Cauchy 1 2 + 1 π arctan ( x ) cot ( π n ) π n / sin 2 ( π n )
Table 2. Values of d for various probability distributions.
Table 2. Values of d for various probability distributions.
Distribution K 0 ( x ) d
Normal Φ ( x / 2 ) 2.2219
Type I extr.val. 1 / ( 1 + e x ) 1.9576
Type II extr.val. 1 / ( 1 + e x ) 1.9576
Logistic 1 ( x 1 ) e x + 1 ( e x 1 ) 2 1.3079
Laplace 1 1 2 ( 1 + x 2 ) e x 1.9306
Cauchy 1 2 + 1 π arctan ( x / 2 ) 1.2071
Table 3. Illustrative sample ( n = 20 , r = 7 ).
Table 3. Illustrative sample ( n = 20 , r = 7 ).
i x i | Y ^ i | ( i ) i x i | Y ^ i | ( i )
16.103.181611−0.690.289
2105.171812−00.075
36.203.2317130.050.106
4−0.080.03214−0.200.031
50.630.391115−0.250.064
6−0.540.21716−0.640.258
71.370.771317−6.303.1415
80.460.301018−5.502.7314
9−0.220.04319−12.106.1019
100.940.551220−2010.1320
Table 4. Illustrative example of B P test observations classification.
Table 4. Illustrative example of B P test observations classification.
U ( 20 ) ( 20 ) U ( 19 ) ( 20 ) U ( 18 ) ( 20 ) U ( 17 ) ( 20 ) U ( 16 ) ( 20 ) U ( 20 , 5 )
1.0000001.0000001.0000000.9999981.0000001.000000
U ( 19 ) ( 19 ) U ( 18 ) ( 19 ) U ( 17 ) ( 19 ) U ( 16 ) ( 19 ) U ( 15 ) ( 19 ) U ( 19 , 5 )
0.9996850.9999980.9999160.9999981.0000001.000000
U ( 18 ) ( 18 ) U ( 17 ) ( 18 ) U ( 16 ) ( 18 ) U ( 15 ) ( 18 ) U ( 14 ) ( 18 ) U ( 18 , 5 )
0.9980460.9969700.9998930.9999970.9999970.999997
U ( 17 ) ( 17 ) U ( 16 ) ( 17 ) U ( 15 ) ( 17 ) U ( 14 ) ( 17 ) U ( 13 ) ( 17 ) U ( 17 , 5 )
0.9242190.9964460.9998710.9999400.0842900.999940
Table 5. Values of goodness-of-fit statistics and information criteria (initial sample).
Table 5. Values of goodness-of-fit statistics and information criteria (initial sample).
Goodness-of-Fit StatisticsWeibullLogisticLog-Normal
Kolmogorov-Smirnov statistic0.050.090.07
Cramer-von Mises statistic0.030.230.127
Anderson-Darling statistic0.211.361.08
Goodness-of-fit criteria
Akaike’s Information Criterion1056.5151074.7831073.13
Bayesian Information Criterion1061.7251079.9931078.34
Table 6. Values of goodness-of-fit statistics and information criteria (sample without removed outliers).
Table 6. Values of goodness-of-fit statistics and information criteria (sample without removed outliers).
Goodness-of-Fit StatisticsWeibullLogisticLog-Normal
Kolmogorov-Smirnov statistic0.0480.090.07
Cramer-von Mises statistic0.0270.210.11
Anderson-Darling statistic0.181.251.01
Goodness-of-fit criteria
Akaike’s Information Criterion1037.091054.761053.49
Bayesian Information Criterion1042.261059.931058.66
Table 7. Hawkin’s method: the values of D N O + D O O in function of μ and r ( n = 100 , s = 5 ).
Table 7. Hawkin’s method: the values of D N O + D O O in function of μ and r ( n = 100 , s = 5 ).
r \ µ 0.116.310
10.31 + 0.000.66 + 0.003.93 + 1.003.99 + 1.00
20.87 + 0.002.15 + 0.063.00 + 1.213.00 + 2.00
31.33 + 0.081.99 + 0.842.00 + 2.002.00 + 2.00
40.89 + 0.581.00 + 1.421.00 + 3.001.00 + 3.00
50.01 + 1.150.00 + 2.030.00 + 3.020.00 + 3.96
Table 8. The masking values D O N ( n = 50 and n = 100 ).
Table 8. The masking values D O N ( n = 50 and n = 100 ).
n = 50 n = 100
rMethod \ θ 0.10.41410r0.10.41410
2 R o s n e r 5 1.360.950.510.150.0621.190.710.330.090.04
R o s n e r 15 1.360.950.510.150.06 1.190.710.330.090.04
R o s n e r [ 0.4 n ] 1.360.950.510.150.06 1.190.710.330.090.04
D G r r o b 1.561.170.710.240.10 1.310.840.440.130.06
B P 0.920.660.370.100.04 0.500.320.150.040.02
5 R o s n e r 5 3.793.312.110.480.1653.522.571.270.270.10
R o s n e r 15 3.663.212.040.460.16 3.432.521.240.260.10
R o s n e r [ 0.4 n ] 3.663.212.040.460.16 3.432.521.240.260.10
D G r o b 4.704.102.901.090.48 4.233.011.810.570.25
B P 2.001.681.180.400.15 0.780.600.430.150.07
8 R o s n e r 5 8.007.977.543.703.061010.09.908.215.105.00
R o s n e r 15 5.705.484.521.000.29 6.886.544.360.690.22
R o s n e r [ 0.4 n ] 5.705.484.521.000.29 6.886.544.360.690.22
D G r o b 7.907.496.102.671.24 9.748.385.782.120.92
B P 4.273.843.251.470.57 2.211.901.730.740.30
Table 9. The masking values D O N ( n = 1000 ).
Table 9. The masking values D O N ( n = 1000 ).
rMethod \ θ 0.10.4141000
5 R o s n e r 5 2.150.690.290.070.00
R o s n e r 15 2.120.660.270.070.00
R o s n e r [ 0.4 n ] 2.120.660.270.070.00
D G r o b 1.990.780.350.090.00
B P 0.250.230.220.110.00
20 R o s n e r 5 19.015.815.015.015.0
R o s n e r 15 19.210.95.525.005.00
R o s n e r [ 0.4 n ] 12.76.941.760.300.00
D G r o b 14.86.973.321.930.00
B P 0.290.260.230.180.00
100 R o s n e r 5 10099.996.795.095.0
R o s n e r 15 10099.9296.485.085.0
R o s n e r [ 0.4 n ] 55.856.850.44.430.01
D G r o b 10089.961.622.20.1
B P 4.724.003.953.580.04
Table 10. Masking values for logistic, Laplace, extreme value II and Cauchy distribution, when  n = 100 , r = 5 .
Table 10. Masking values for logistic, Laplace, extreme value II and Cauchy distribution, when  n = 100 , r = 5 .
Logistic Laplace
Method \ θ 0.116.3100.116.310
D G M L 54.893.643.4254.963.983.78
D G r o b 4.212.690.760.514.272.980.870.59
B P 1.31.130.780.641.311.210.80.66
Extreme Value II Cauchy
Method \ θ 0.116.31011001000 10 5
D G M L 4.964.1932.95555
D G r o b 4.292.250.590.43.812.890.80.01
B P 1.250.560.140.110.380.40.390.13
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bagdonavičius, V.; Petkevičius, L. Multiple Outlier Detection Tests for Parametric Models. Mathematics 2020, 8, 2156. https://doi.org/10.3390/math8122156

AMA Style

Bagdonavičius V, Petkevičius L. Multiple Outlier Detection Tests for Parametric Models. Mathematics. 2020; 8(12):2156. https://doi.org/10.3390/math8122156

Chicago/Turabian Style

Bagdonavičius, Vilijandas, and Linas Petkevičius. 2020. "Multiple Outlier Detection Tests for Parametric Models" Mathematics 8, no. 12: 2156. https://doi.org/10.3390/math8122156

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop