# Multiple Outlier Detection Tests for Parametric Models

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Outliers and Outlier Regions

## 3. New Method

#### 3.1. Preliminary Results

**Condition A.**

- (a)
- $\widehat{\mu}$ and $\widehat{\sigma}$ are consistent estimators of μ and σ;
- (b)
- the limit distribution of ($\sqrt{n}(\widehat{\mu}-\mu ),\sqrt{n}(\widehat{\sigma}-\sigma ))$ is non-degenerate;
- (c)
- $$\underset{x\to \infty}{lim}\frac{x{f}_{0}\left(x\right)}{\sqrt{1-{F}_{0}\left(x\right)}}=0.$$

**Theorem**

**1.**

**Proof**

**of**

**Theorem 1.**

**Remark**

**1.**

**Theorem**

**2.**

**Proof**

**of**

**Theorem 2.**

**Remark**

**2.**

**Remark**

**3.**

#### 3.2. Robust Estimators for Location-Shape Distributions

#### 3.3. Right Outliers Identification Method for Location-Scale Families

**Theorem**

**3.**

**Proof**

**of**

**Theorem 3.**

#### 3.4. Left Outliers Identification Method for Location-Scale Families

#### 3.5. Outlier Detection Tests for Location-Scale Families: Two-Sided Alternative, Symmetric Distributions

#### 3.6. Outlier Detection Tests for Location-Scale Families: Two-Sided Alternative, Non-Symmetric Distributions

#### 3.7. Outlier Identification Method for Shape-Scale Families

#### 3.8. Illustrative Example

#### 3.9. Practical Example

## 4. Generalization of Davies-Gather Outlier Identification Method

## 5. Short Survey of Multiple Outlier Identification Methods for Normal Data

#### 5.1. Rosner’s Method

#### 5.2. Bolshev’s Method

#### 5.3. Hawking’s Method

## 6. Comparative Analysis of Outlier Identification Methods by Simulation

- (1)
- Two parameter exponential distribution $\mathcal{E}(\theta ,{x}_{{\alpha}_{n}})$ with the scale parameter $\theta $. If $\theta $ is small, then outliers are concentrated near the border of the outlier region. If $\theta $ is large then outliers are widely spread in the outlier region. If $\theta $ increases, then the mean of outlier distribution increases. Please note that even if $\theta $ is very near 0 and the true number of outliers r is large, these outliers may corrupt strongly the data making tails of histogram two heavy.
- (2)
- Truncated normal distribution $\mathcal{T}\mathcal{N}({x}_{{\alpha}_{n}},\mu ,\rho )$ with the location and scale parameters $\mu ,\rho $ ($\mu >{x}_{{\alpha}_{n}})$. If $\rho $ is small then this distribution is concentrated in a small interval around $\mu $. If $\mu $ increases, then the mean of outlier distribution increases.

#### 6.1. Investigation of Outlier Identification Methods for Normal Data

#### 6.2. Investigation of Outlier Identification Methods for Other Location-Scale Models

## 7. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Bol’shev, L.; Ubaidullaeva, M. Chauvenet’s Test in the Classical Theory of Errors. Theory Probab. Appl.
**1975**, 19, 683–692. [Google Scholar] [CrossRef] - Davies, L.; Gather, U. The Identification of Multiple Outliers. J. Am. Stat. Assoc.
**1993**, 88, 782–792. [Google Scholar] [CrossRef] - Dixon, W.J. Analysis of Extreme Values. Ann. Math. Stat.
**1950**, 21, 488–506. [Google Scholar] [CrossRef] - Grubbs, F.E. Sample Criteria for Testing Outlying Observations. Ann. Math. Stat.
**1950**, 21, 27–58. [Google Scholar] [CrossRef] - Rosner, B. On the Detection of Many Outliers. Technometrics
**1975**, 17, 221–227. [Google Scholar] [CrossRef] - Tietjen, G.L.; Moore, R.H. Some Grubbs-Type Statistics for the Detection of Several Outliers. Technometrics
**1972**, 14, 583–597. [Google Scholar] [CrossRef] - Barnett, V.; Lewis, T. Outliers in Statistical Data; John Wiley & Sons: Hoboken, NJ, USA, 1974. [Google Scholar]
- Zerbet, A. Statistical Tests for Normal Family in Presence of Outlying Observations. In Goodness-of-Fit Tests and Model Validity; Huber-Carol, C., Balakrishnan, N., Nikulin, M.S., Mesbah, M., Eds.; Birkhäuser Boston: Basel, Switzerland, 2002; pp. 57–64. [Google Scholar]
- Chikkagoudar, M.; Kunchur, S.H. Distributions of test statistics for multiple outliers in exponential samples. Commun. Stat. Theory Methods
**1983**, 12, 2127–2142. [Google Scholar] [CrossRef] - Kabe, D.G. Testing outliers from an exponential population. Metrika
**1970**, 15, 15–18. [Google Scholar] [CrossRef] - Kimber, A. Testing upper and lower outlier paris in gamma samples. Commun. Stat. Simul. Comput.
**1988**, 17, 1055–1072. [Google Scholar] [CrossRef] - Lalitha, S.; Kumar, N. Multiple outlier test for upper outliers in an exponential sample. J. Appl. Stat.
**2012**, 39, 1323–1330. [Google Scholar] [CrossRef] - Lewis, T.; Fieller, N.R.J. A Recursive Algorithm for Null Distributions for Outliers: I. Gamma Samples. Technometrics
**1979**, 21, 371–376. [Google Scholar] [CrossRef] - Likeš, I.J. Distribution of Dixon’s statistics in the case of an exponential population. Metrika
**1967**, 11, 46–54. [Google Scholar] [CrossRef] - Lin, C.T.; Balakrishnan, N. Exact computation of the null distribution of a test for multiple outliers in an exponential sample. Comput. Stat. Data Anal.
**2009**, 53, 3281–3290. [Google Scholar] [CrossRef] - Lin, C.T.; Balakrishnan, N. Tests for Multiple Outliers in an Exponential Sample. Commun. Stat. Simul. Comput.
**2014**, 43, 706–722. [Google Scholar] [CrossRef] - Zerbet, A.; Nikulin, M. A new statistic for detecting outliers in exponential case. Commun. Stat. Theory Methods
**2003**, 32, 573–583. [Google Scholar] [CrossRef] - Torres, J.M.; Pastor Pérez, J.; Sancho Val, J.; McNabola, A.; Martínez Comesaña, M.; Gallagher, J. A functional data analysis approach for the detection of air pollution episodes and outliers: A case study in Dublin, Ireland. Mathematics
**2020**, 8, 225. [Google Scholar] [CrossRef][Green Version] - Gaddam, A.; Wilkin, T.; Angelova, M.; Gaddam, J. Detecting Sensor Faults, Anomalies and Outliers in the Internet of Things: A Survey on the Challenges and Solutions. Electronics
**2020**, 9, 511. [Google Scholar] [CrossRef][Green Version] - Ferrari, E.; Bosco, P.; Calderoni, S.; Oliva, P.; Palumbo, L.; Spera, G.; Fantacci, M.E.; Retico, A. Dealing with confounders and outliers in classification medical studies: The Autism Spectrum Disorders case study. Artif. Intell. Med.
**2020**, 108, 101926. [Google Scholar] [CrossRef] - Zhang, C.; Xiao, X.; Wu, C. Medical Fraud and Abuse Detection System Based on Machine Learning. Int. J. Environ. Res. Public Health
**2020**, 17, 7265. [Google Scholar] [CrossRef] [PubMed] - Souza, T.I.; Aquino, A.L.; Gomes, D.G. A method to detect data outliers from smart urban spaces via tensor analysis. Future Gener. Comput. Syst.
**2019**, 92, 290–301. [Google Scholar] [CrossRef] - Hawkins, D.M. Identification of Outliers; Springer: Dordrecht, The Netherlands, 1980; Volume 11. [Google Scholar]
- Kimber, A.C. Tests for Many Outliers in an Exponential Sample. J. R. Stat. Soc.
**1982**, 31, 263–271. [Google Scholar] [CrossRef] - De Haan, L.; Ferreira, A. Extreme Value Theory: An Introduction; Springer: New York, NY, USA, 2007. [Google Scholar]
- Rousseeuw, P.J.; Croux, C. Alternatives to the median absolute deviation. J. Am. Stat. Assoc.
**1993**, 88, 1273–1283. [Google Scholar] [CrossRef] - Liu, Y.; Abeyratne, A.I. Practical Applications of Bayesian Reliability; John Wiley & Sons: Hoboken, NJ, USA, 2019. [Google Scholar]
- Rosner, B. Percentage points for the RST many outlier procedure. Technometrics
**1977**, 19, 307–312. [Google Scholar] [CrossRef] - Su, H.; Hu, Y.; Karimi, H.R.; Knoll, A.; Ferrigno, G.; De Momi, E. Improved recurrent neural network-based manipulator control with remote center of motion constraints: Experimental results. Neural Netw.
**2020**, 131, 291–299. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**The true values of the significance level of Rosner’s and $BP$ tests in function of n for different values of s ($\alpha =0.05$ is used in approximations).

**Figure 2.**Hawkin’s method: the values of ${D}_{NO}+{D}_{OO}$ in function of $\mu $ and r ($n=100$, $s=5$).

**Figure 3.**The number of outliers rejected as non-outliers (${D}_{ON}$). The alternative: two-sided, the outliers generated by two-parameters exponential distribution on both sides.

**Figure 4.**The difference between number outliers and rejected observations given that sample size $n=100$ and $r=10$ outliers.

**Figure 5.**The difference between number outliers and rejected observations given that sample size $n=100$ and $r=10$ outliers.

Distribution | ${\mathit{F}}_{0}\left(\mathit{x}\right)$ | ${\mathit{b}}_{\mathit{n}}$ | ${\mathit{a}}_{\mathit{n}}$ |
---|---|---|---|

Normal | $\mathsf{\Phi}\left(x\right)$ | ${\mathsf{\Phi}}^{-1}(1-1/n)$ | $1/{b}_{n}$ |

Type I extreme value | $1-{e}^{-{e}^{x}}$ | $lnlnn$ | ${e}^{-{b}_{n}}$ |

Type II extreme value | ${e}^{-{e}^{-x}}$ | $ln(-ln(1-1/n\left)\right)$ | ${e}^{{b}_{n}}/(n-1)$ |

Logistic | $\frac{1}{1+{e}^{-x}}$ | $ln(n-1)$ | $n/\left(n-1\right)$ |

Laplace | $\frac{1}{2}+\frac{1}{2}\mathrm{sign}\left(x\right)(1-{e}^{-|x|})$ | $ln(n/2)$ | 1 |

Cauchy | $\frac{1}{2}+\frac{1}{\pi}arctan\left(x\right)$ | $cot\left(\frac{\pi}{n}\right)$ | $\frac{\pi}{n}/{sin}^{2}\left(\frac{\pi}{n}\right)$ |

Distribution | ${\mathit{K}}_{0}\left(\mathit{x}\right)$ | d |
---|---|---|

Normal | $\mathsf{\Phi}(x/\sqrt{2})$ | 2.2219 |

Type I extr.val. | $1/(1+{e}^{-x})$ | 1.9576 |

Type II extr.val. | $1/(1+{e}^{-x})$ | 1.9576 |

Logistic | $1-\frac{(x-1){e}^{x}+1}{{({e}^{x}-1)}^{2}}$ | 1.3079 |

Laplace | $1-\frac{1}{2}(1+\frac{x}{2}){e}^{-x}$ | 1.9306 |

Cauchy | $\frac{1}{2}+\frac{1}{\pi}arctan(x/2)$ | 1.2071 |

i | ${\mathit{x}}_{\mathit{i}}$ | $|{\widehat{\mathit{Y}}}_{\mathit{i}}|$ | $\left(\mathit{i}\right)$ | i | ${\mathit{x}}_{\mathit{i}}$ | $|{\widehat{\mathit{Y}}}_{\mathit{i}}|$ | $\left(\mathit{i}\right)$ |
---|---|---|---|---|---|---|---|

1 | 6.10 | 3.18 | 16 | 11 | −0.69 | 0.28 | 9 |

2 | 10 | 5.17 | 18 | 12 | −0 | 0.07 | 5 |

3 | 6.20 | 3.23 | 17 | 13 | 0.05 | 0.10 | 6 |

4 | −0.08 | 0.03 | 2 | 14 | −0.20 | 0.03 | 1 |

5 | 0.63 | 0.39 | 11 | 15 | −0.25 | 0.06 | 4 |

6 | −0.54 | 0.21 | 7 | 16 | −0.64 | 0.25 | 8 |

7 | 1.37 | 0.77 | 13 | 17 | −6.30 | 3.14 | 15 |

8 | 0.46 | 0.30 | 10 | 18 | −5.50 | 2.73 | 14 |

9 | −0.22 | 0.04 | 3 | 19 | −12.10 | 6.10 | 19 |

10 | 0.94 | 0.55 | 12 | 20 | −20 | 10.13 | 20 |

${U}_{\left(20\right)}\left(20\right)$ | ${U}_{\left(19\right)}\left(20\right)$ | ${U}_{\left(18\right)}\left(20\right)$ | ${U}_{\left(17\right)}\left(20\right)$ | ${U}_{\left(16\right)}\left(20\right)$ | $U(20,5)$ |

1.000000 | 1.000000 | 1.000000 | 0.999998 | 1.000000 | 1.000000 |

${U}_{\left(19\right)}\left(19\right)$ | ${U}_{\left(18\right)}\left(19\right)$ | ${U}_{\left(17\right)}\left(19\right)$ | ${U}_{\left(16\right)}\left(19\right)$ | ${U}_{\left(15\right)}\left(19\right)$ | $U(19,5)$ |

0.999685 | 0.999998 | 0.999916 | 0.999998 | 1.000000 | 1.000000 |

${U}_{\left(18\right)}\left(18\right)$ | ${U}_{\left(17\right)}\left(18\right)$ | ${U}_{\left(16\right)}\left(18\right)$ | ${U}_{\left(15\right)}\left(18\right)$ | ${U}_{\left(14\right)}\left(18\right)$ | $U(18,5)$ |

0.998046 | 0.996970 | 0.999893 | 0.999997 | 0.999997 | 0.999997 |

${U}_{\left(17\right)}\left(17\right)$ | ${U}_{\left(16\right)}\left(17\right)$ | ${U}_{\left(15\right)}\left(17\right)$ | ${U}_{\left(14\right)}\left(17\right)$ | ${U}_{\left(13\right)}\left(17\right)$ | $U(17,5)$ |

0.924219 | 0.996446 | 0.999871 | 0.999940 | 0.084290 | 0.999940 |

Goodness-of-Fit Statistics | Weibull | Logistic | Log-Normal |
---|---|---|---|

Kolmogorov-Smirnov statistic | 0.05 | 0.09 | 0.07 |

Cramer-von Mises statistic | 0.03 | 0.23 | 0.127 |

Anderson-Darling statistic | 0.21 | 1.36 | 1.08 |

Goodness-of-fit criteria | |||

Akaike’s Information Criterion | 1056.515 | 1074.783 | 1073.13 |

Bayesian Information Criterion | 1061.725 | 1079.993 | 1078.34 |

**Table 6.**Values of goodness-of-fit statistics and information criteria (sample without removed outliers).

Goodness-of-Fit Statistics | Weibull | Logistic | Log-Normal |
---|---|---|---|

Kolmogorov-Smirnov statistic | 0.048 | 0.09 | 0.07 |

Cramer-von Mises statistic | 0.027 | 0.21 | 0.11 |

Anderson-Darling statistic | 0.18 | 1.25 | 1.01 |

Goodness-of-fit criteria | |||

Akaike’s Information Criterion | 1037.09 | 1054.76 | 1053.49 |

Bayesian Information Criterion | 1042.26 | 1059.93 | 1058.66 |

**Table 7.**Hawkin’s method: the values of ${D}_{NO}+{D}_{OO}$ in function of $\mu $ and r ($n=100$, $s=5$).

r \ $\mathit{\mu}$ | 0.1 | 1 | 6.3 | 10 |
---|---|---|---|---|

1 | 0.31 + 0.00 | 0.66 + 0.00 | 3.93 + 1.00 | 3.99 + 1.00 |

2 | 0.87 + 0.00 | 2.15 + 0.06 | 3.00 + 1.21 | 3.00 + 2.00 |

3 | 1.33 + 0.08 | 1.99 + 0.84 | 2.00 + 2.00 | 2.00 + 2.00 |

4 | 0.89 + 0.58 | 1.00 + 1.42 | 1.00 + 3.00 | 1.00 + 3.00 |

5 | 0.01 + 1.15 | 0.00 + 2.03 | 0.00 + 3.02 | 0.00 + 3.96 |

$\mathit{n}=50$ | $\mathit{n}=100$ | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

r | Method$\backslash \mathbf{\theta}$ | 0.1 | 0.4 | 1 | 4 | 10 | r | 0.1 | 0.4 | 1 | 4 | 10 |

2 | $Rosne{r}_{5}$ | 1.36 | 0.95 | 0.51 | 0.15 | 0.06 | 2 | 1.19 | 0.71 | 0.33 | 0.09 | 0.04 |

$Rosne{r}_{15}$ | 1.36 | 0.95 | 0.51 | 0.15 | 0.06 | 1.19 | 0.71 | 0.33 | 0.09 | 0.04 | ||

$Rosne{r}_{\left[0.4n\right]}$ | 1.36 | 0.95 | 0.51 | 0.15 | 0.06 | 1.19 | 0.71 | 0.33 | 0.09 | 0.04 | ||

$DG{r}_{rob}$ | 1.56 | 1.17 | 0.71 | 0.24 | 0.10 | 1.31 | 0.84 | 0.44 | 0.13 | 0.06 | ||

$BP$ | 0.92 | 0.66 | 0.37 | 0.10 | 0.04 | 0.50 | 0.32 | 0.15 | 0.04 | 0.02 | ||

5 | $Rosne{r}_{5}$ | 3.79 | 3.31 | 2.11 | 0.48 | 0.16 | 5 | 3.52 | 2.57 | 1.27 | 0.27 | 0.10 |

$Rosne{r}_{15}$ | 3.66 | 3.21 | 2.04 | 0.46 | 0.16 | 3.43 | 2.52 | 1.24 | 0.26 | 0.10 | ||

$Rosne{r}_{\left[0.4n\right]}$ | 3.66 | 3.21 | 2.04 | 0.46 | 0.16 | 3.43 | 2.52 | 1.24 | 0.26 | 0.10 | ||

$D{G}_{rob}$ | 4.70 | 4.10 | 2.90 | 1.09 | 0.48 | 4.23 | 3.01 | 1.81 | 0.57 | 0.25 | ||

$BP$ | 2.00 | 1.68 | 1.18 | 0.40 | 0.15 | 0.78 | 0.60 | 0.43 | 0.15 | 0.07 | ||

8 | $Rosne{r}_{5}$ | 8.00 | 7.97 | 7.54 | 3.70 | 3.06 | 10 | 10.0 | 9.90 | 8.21 | 5.10 | 5.00 |

$Rosne{r}_{15}$ | 5.70 | 5.48 | 4.52 | 1.00 | 0.29 | 6.88 | 6.54 | 4.36 | 0.69 | 0.22 | ||

$Rosne{r}_{\left[0.4n\right]}$ | 5.70 | 5.48 | 4.52 | 1.00 | 0.29 | 6.88 | 6.54 | 4.36 | 0.69 | 0.22 | ||

$D{G}_{rob}$ | 7.90 | 7.49 | 6.10 | 2.67 | 1.24 | 9.74 | 8.38 | 5.78 | 2.12 | 0.92 | ||

$BP$ | 4.27 | 3.84 | 3.25 | 1.47 | 0.57 | 2.21 | 1.90 | 1.73 | 0.74 | 0.30 |

r | Method $\backslash \mathit{\theta}$ | 0.1 | 0.4 | 1 | 4 | 1000 |
---|---|---|---|---|---|---|

5 | $Rosne{r}_{5}$ | 2.15 | 0.69 | 0.29 | 0.07 | 0.00 |

$Rosne{r}_{15}$ | 2.12 | 0.66 | 0.27 | 0.07 | 0.00 | |

$Rosne{r}_{\left[0.4n\right]}$ | 2.12 | 0.66 | 0.27 | 0.07 | 0.00 | |

$D{G}_{rob}$ | 1.99 | 0.78 | 0.35 | 0.09 | 0.00 | |

$BP$ | 0.25 | 0.23 | 0.22 | 0.11 | 0.00 | |

20 | $Rosne{r}_{5}$ | 19.0 | 15.8 | 15.0 | 15.0 | 15.0 |

$Rosne{r}_{15}$ | 19.2 | 10.9 | 5.52 | 5.00 | 5.00 | |

$Rosne{r}_{\left[0.4n\right]}$ | 12.7 | 6.94 | 1.76 | 0.30 | 0.00 | |

$D{G}_{rob}$ | 14.8 | 6.97 | 3.32 | 1.93 | 0.00 | |

$BP$ | 0.29 | 0.26 | 0.23 | 0.18 | 0.00 | |

100 | $Rosne{r}_{5}$ | 100 | 99.9 | 96.7 | 95.0 | 95.0 |

$Rosne{r}_{15}$ | 100 | 99.92 | 96.4 | 85.0 | 85.0 | |

$Rosne{r}_{\left[0.4n\right]}$ | 55.8 | 56.8 | 50.4 | 4.43 | 0.01 | |

$D{G}_{rob}$ | 100 | 89.9 | 61.6 | 22.2 | 0.1 | |

$BP$ | 4.72 | 4.00 | 3.95 | 3.58 | 0.04 |

**Table 10.**Masking values for logistic, Laplace, extreme value II and Cauchy distribution, when $n=100$, $r=5$.

Logistic | Laplace | |||||||
---|---|---|---|---|---|---|---|---|

Method$\backslash \mathbf{\theta}$ | 0.1 | 1 | 6.3 | 10 | 0.1 | 1 | 6.3 | 10 |

$D{G}_{ML}$ | 5 | 4.89 | 3.64 | 3.42 | 5 | 4.96 | 3.98 | 3.78 |

$D{G}_{rob}$ | 4.21 | 2.69 | 0.76 | 0.51 | 4.27 | 2.98 | 0.87 | 0.59 |

$BP$ | 1.3 | 1.13 | 0.78 | 0.64 | 1.31 | 1.21 | 0.8 | 0.66 |

Extreme Value II | Cauchy | |||||||

Method$\backslash \mathbf{\theta}$ | 0.1 | 1 | 6.3 | 10 | 1 | 100 | 1000 | ${\mathbf{10}}^{\mathbf{5}}$ |

$D{G}_{ML}$ | 4.96 | 4.19 | 3 | 2.9 | 5 | 5 | 5 | 5 |

$D{G}_{rob}$ | 4.29 | 2.25 | 0.59 | 0.4 | 3.81 | 2.89 | 0.8 | 0.01 |

$BP$ | 1.25 | 0.56 | 0.14 | 0.11 | 0.38 | 0.4 | 0.39 | 0.13 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Bagdonavičius, V.; Petkevičius, L.
Multiple Outlier Detection Tests for Parametric Models. *Mathematics* **2020**, *8*, 2156.
https://doi.org/10.3390/math8122156

**AMA Style**

Bagdonavičius V, Petkevičius L.
Multiple Outlier Detection Tests for Parametric Models. *Mathematics*. 2020; 8(12):2156.
https://doi.org/10.3390/math8122156

**Chicago/Turabian Style**

Bagdonavičius, Vilijandas, and Linas Petkevičius.
2020. "Multiple Outlier Detection Tests for Parametric Models" *Mathematics* 8, no. 12: 2156.
https://doi.org/10.3390/math8122156