An Exhaustive Power Comparison of Normality Tests

Jurgita Arnastauskaitė; Tomas Ruzgas; Mindaugas Bražėnas

doi:10.3390/math9070788

,

and

¹

Department of Applied Mathematics, Kaunas University of Technology, 51368 Kaunas, Lithuania

²

Department of Computer Sciences, Kaunas University of Technology, 51368 Kaunas, Lithuania

³

Department of Mathematical modelling, Kaunas University of Technology, 51368 Kaunas, Lithuania

^*

Author to whom correspondence should be addressed.

Mathematics2021, 9(7), 788;https://doi.org/10.3390/math9070788

This article belongs to the Special Issue Probability, Statistics and Their Applications 2021

Version Notes

Order Reprints

Abstract

A goodness-of-fit test is a frequently used modern statistics tool. However, it is still unclear what the most reliable approach is to check assumptions about data set normality. A particular data set (especially with a small number of observations) only partly describes the process, which leaves many options for the interpretation of its true distribution. As a consequence, many goodness-of-fit statistical tests have been developed, the power of which depends on particular circumstances (i.e., sample size, outlets, etc.). With the aim of developing a more universal goodness-of-fit test, we propose an approach based on an N-metric with our chosen kernel function. To compare the power of 40 normality tests, the goodness-of-fit hypothesis was tested for 15 data distributions with 6 different sample sizes. Based on exhaustive comparative research results, we recommend the use of our test for samples of size

n \geq 118

.

Keywords:

goodness of fit test; normal distribution; power comparison

1. Introduction

A priori information about data distribution is not always known. In those cases, hypothesis testing can help to find a reasonable assumption about the distribution of data. Based on assumed data distribution, one can choose appropriate methods for further research. The information about data distribution can be useful in a number of ways, for example:

it can provide insights about the observed process;
parameters of model can be inferred from the characteristics of data distributions; and
it can help in choosing more specific and computationally efficient methods.

Statistical methods often require data to be normally distributed. If the assumption of normality is not satisfied, the results of these methods will be inappropriate. Therefore, the presumption of normality is strictly required before starting the statistical analysis. Many tests have been developed to check this assumption. However, tests are defined in various ways and thus react to abnormalities, present in a data set, differently. Therefore, the choice of goodness-of-fit test remains an important problem.

H_{0} : T h e d i s t r i b u t i o n i s n o r m a l, H_{A} : T h e d i s t r i b u t i o n i s n o t n o r m a l .

For these reasons, this study examines the issue of testing the goodness-of-fit hypotheses. The goodness-of-fit null and alternative hypotheses are defined as:

A total of 40 tests were applied to analyze the problem of testing the goodness-of-fit hypothesis. The tests used in this study were developed between 1900 and 2016. In the early 19th century, Karl Pearson published an article defining the chi-square test []. This test is considered as the basis of modern statistics. Pearson was the first to examine the goodness-of-fit assumption that the observations

x_{i}

can be distributed according to the normal distribution, and concluded that, in the limit as

n

becomes large,

X^{2}

follows the chi-square distribution with

k - 1

degrees of freedom. The statistics for this test are defined in Section 2.1. Another popular test for testing the goodness-of-fit hypothesis is the Kolmogorov and Smirnov test []. This test statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution []. The Anderson and Darling test is also often used in practice []. This test assesses whether a sample comes from a specified distribution []. The end of 19th century and the beginning of 20th century was a successful period for the development of goodness-of-fit hypothesis test criteria and their comparison studies [,,,,,,,,,,,,,,].

In 2010, Xavier Romão et al. conducted a comprehensive study comparing the power of the goodness-of-fit hypothesis tests []. In the study, 33 normality tests were applied to samples of different sizes, taking into account the significance level

α

and many symmetric, asymmetric, and modified normal distributions. The researchers found that the most powerful of the selected normality tests for the symmetric group of distributions were Coin

β_{3}^{2}

, Chen–Shapiro, Bonett–Seier, and Gel–Miao–Gastwirth tests; for the asymmetric group of distributions, Zhang–Wu

Z_{C}

and

Z_{A}

, and Chen–Shapiro; while the Chen–Shapiro, Barrio–Cuesta-Albertos–Matrán–Rodríguez-Rodríguez, and Shapiro–Wilk tests were the most powerful for the group of modified normal distributions.

In 2015, Adefisoye et al. compared 18 normality tests for different sample sizes for symmetric and asymmetric distribution groups []. The results of the study showed that the Kurtosis test was the most powerful for a group of symmetric data distributions and the Shapiro–Wilk test was the most powerful for a group of asymmetric data distributions.

The main objective of this study is to perform a comparative analysis of the power of the most commonly used tests for testing the goodness-of-fit hypothesis. The procedure described in Section 3 was used to calculate the power of the tests.

Scientific novelty—the comparative analysis of test power was carried out using different methods for goodness-of-fit in the case of many different types of challenges to curve tests. The goodness-of-fit tests have been selected as representatives of popular techniques, which have been analyzed by other researchers experimentally. We have proposed a new kernel function and its usage in an N-metric-based test. The uniqueness of the kernel function is that its shape is chosen in such a way that the shift arising in the formation of the test is eliminated by using sample values.

The rest of the paper is organized as follows. Section 2 provides descriptions of the 40 goodness-of-fit hypothesis tests and the procedure for calculating the power of the tests. The samples generated from 15 distributions are given in Section 4. Section 5 presents and discusses the results of a simulation modeling study. Finally, Section 6 concludes the results.

2. Statistical Methods

In this section, the most popular tests for normality are overviewed.

2.1. Chi-Square Test (CHI2)

In 1900, Karl Pearson introduced the chi-square test []. The statistic of the test is defined as:

χ^{2} = \sum_{i = 1}^{k} \frac{{(O_{i} - E_{i})}^{2}}{E_{i}},

(1)

where

O_{i}

is the observed frequency and

E_{i}

is the expected frequency.

2.2. Kolmogorov–Smirnov (KS)

In 1933, Kolmogorov and Smirnov proposed the KS test []. The statistic of the test is defined as:

χ^{2} = D^{+} = \max {(\frac{i}{n}) - z_{i}}, 1 \leq i \leq n; D^{-} = \max {z_{i} - \frac{i - 1}{n}}, 1 \leq i \leq n; D = \max (D^{+}, D^{-}),

(2)

where

z_{i}

is the cumulative probability of standard normal distribution and

D

is the difference between observed and expected values.

2.3. Anderson–Darling (AD)

In 1952, Anderson and Darling developed a variety of the Kolmogorov and Smirnov tests []. This test is more powerful than the Kolmogorov and Smirnov test. The statistic of the test is defined as:

A^{2} = - n - \sum_{i = 1}^{n} \frac{2 i - 1}{n} (\ln (F (x_{i})) + \ln (1 - F (x_{n + 1 - i}))),

(3)

where

F (x_{i}) is

the value of the distribution function at point

x_{i}

and

n is the

empirical sample size.

2.4. Cramer–Von Mises (CVM)

In 1962, Cramer proposed the Cramer–von Mises test. This test is an alternative to the Kolmogorov and Smirnov test []. The statistic of the test is defined as:

C M = \frac{1}{12 n} + \sum_{i = 1}^{n} {(Z_{i} - \frac{2 i - 1}{2 n})}^{2},

(4)

where

Z_{i}

is the cumulative distribution function of the specified distribution

Z_{i} = X_{(i)} - \bar{X} / S, \bar{and X}

and

S

are the sample mean and sample standard deviation.

2.5. Shapiro–Wilk (SW)

In 1965, Shapiro and Wilk formed the original test []. The statistic of the test is defined as:

W = \frac{{(\sum_{i = 1}^{n} a_{i} x_{(i)})}^{2}}{{(\sum_{i = 1}^{n} x_{i} - \bar{x})}^{2}},

(5)

where

x_{(i)}

is the

i^{t h}

order statistic,

\bar{x}

is the sample mean, and

a_{i}

constants obtained:

a_{i} = (a_{1}, \dots, a_{n}) = \frac{m^{T} V^{- 1}}{{(m^{T} V^{- 1} V^{- 1} m)}^{1 / 2}},

where

m = (m_{1}, \dots, m_{n})^{T}

are the expected values of the order statistics of independent and identically distributed random variables sampled from the standard normal distribution and

V

is the covariance matrix of those order statistics.

2.6. Lilliefors (LF)

In 1967, Lilliefors modified the Kolmogorov and Smirnov test []. The statistic of the test is defined as:

T = \sup_{x} | F^{*} (x) - S (x) |,

(6)

where

F^{*} (x) is

the standard normal distribution function and

S (x) is

the empirical distribution function of the

z_{i}

values.

2.7. D’Agostino (DA)

In 1971, D’Agostino introduced the test for testing the goodness-of-fit hypothesis, which is an extension of the Shapiro–Wilk test []. The test proposed by D’Agostino does not need to define a weight vector. The statistic of the test is defined as:

D = \frac{\sum_{i = 1}^{n} (i - (n + 1) / 2) \cdot x_{(i)}}{n^{2} \cdot \sqrt{m_{2}}},

(7)

where

m_{2}

is the second central moment that is defined as:

m_{2} = n^{- 1} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2} .

2.8. Shapiro–Francia (SF)

In 1972, Shapiro and Francia simplified the Shapiro and Wilk test and developed the Shapiro and Francia test, which is computationally more efficient []. The statistic of the test is defined as:

W_{S F} = \frac{{(\sum_{i = 1}^{n} m_{i} x_{i})}^{2}}{{(\sum_{i = 1}^{n} x_{i} - \bar{x})}^{2} \sum_{i = 1}^{n} m_{i}^{2}},

(8)

where

m_{i} is

the expected values of the standard normal order statistics.

2.9. D’Agostino–Pearson (DAP)

In 1973–1974, D’Agostino and Pearson proposed the D’Agostino and Pearson test []. The statistic of the test is defined as:

D P = \frac{\sum_{i = 1}^{n} (i - (n + 1) / 2) x_{(i)}}{n^{2} \sqrt{m_{2}}},

(9)

where

n

is the size of sample and

m_{2}

is the sample variance of order statistics.

2.10. Filliben (Filli)

In 1975, Filliben defined the probabilistic correlation coefficient

r

as a test for the goodness-of-fit hypothesis []. This test statistic is defined as:

r = \frac{\sum_{i = 1}^{n} x_{(i)} \cdot M_{(i)}}{\sqrt{\sum_{i = 1}^{n} M_{(i)}^{2}} \cdot \sqrt{(n - 1) \cdot σ^{2}}},

(10)

where

σ^{2} is

the variance,

M_{(i)} = Φ^{- 1} (m_{(i)})

, when

m_{(i)} is

the estimated median values of the order statistics, each

m_{(i)}

is obtained by:

m_{(i)} = {\begin{matrix} 1 - {0.5}^{(\frac{1}{n})} i = 1, \\ \frac{(i - 0.3175)}{(n + 0.365)} 1 < i < n, \\ {0.5}^{(\frac{1}{n})} i = n . \end{matrix}

2.11. Martinez–Iglewicz (MI)

In 1981, Martinez and Iglewicz proposed a normality test based on the ratio of two estimators of variance, where one of the estimators is the robust biweight scale estimator

S_{b}^{2}

[]:

S_{b}^{2} = \frac{n \cdot \sum_{| {\tilde{z}}_{i} | < 1} {(x_{i} - M)}^{2} {(1 - {\tilde{z}}_{i}^{2})}^{4}}{{[\sum_{| {\tilde{z}}_{i} | < 1} (1 - {\tilde{z}}_{i}^{2}) (1 - 5 {\tilde{z}}_{i}^{2})]}^{2}},

where

M

is the sample median,

{\tilde{z}}_{i} = (x_{i} - M) / (9 A)

, with

A

being the median of

| x_{i} - M |

.

This test statistic is then given by:

I_{n} = \frac{\sum_{i = 1}^{n} {(x_{i} - M)}^{2}}{(n - 1) \cdot S_{b}^{2}} .

(11)

2.12. Epps–Pulley (EP)

In 1983, Epps and Pulley proposed a test statistic based on the following weighted integral []:

T_{E P} = \int_{- \infty}^{\infty} {| φ_{n} (t) - {\hat{φ}}_{0} (t) |}^{2} d G (t),

where

φ_{n} (t)

is the empirical characteristic function and

G (t)

is an adequate function chosen according to several considerations. By setting

d G (t) = g (t) d t

and selecting:

g (t) = \sqrt{m_{2} / 2 π} \cdot e x p (- 0.5 m_{2} t^{2})

the following statistic is obtained:

T_{E P} = 1 + \frac{n}{\sqrt{3}} + \frac{2}{n} \sum_{k = 2}^{n} \sum_{j = 1}^{k - 1} \exp ((- {(x_{j} - x_{k})}^{2}) / 2 m_{2}) - \sqrt{2} \sum_{j = 1}^{n} \exp ((- {(x_{j} - \bar{x})}^{2}) / 4 m_{2}),

(12)

where

m_{2} is

the second central moment.

2.13. Jarque–Bera (JB)

In 1987, Jarque and Bera proposed a test [] with statistic defined as:

J B = \frac{n}{6} (s + \frac{{(k - 3)}^{2}}{4}),

(13)

where

s = \frac{m_{3}^{2}}{m_{2}^{3}}

and

k = \frac{m_{4}}{m_{2}^{3}}

are the sample skewness and kurtosis.

2.14. Hosking ( $H_{1} - H_{3})$

In 1990, Hosking and Wallis proposed the first Hosking test []. This test statistic is defined as:

H_{i} = \frac{V_{i} - μ_{V}}{σ_{V}},

(14)

where

μ_{V}

and

σ_{V}

are the mean and standard deviation of number of simulation data values of

V .

V_{i}

is calculated as:

V_{1} = \sqrt{\frac{\sum_{i = 1}^{N} n_{i} {(t^{(i)} - t^{R})}^{2}}{\sum_{i = 1}^{N} n_{i}}}, V_{2} = \sum_{i = 1}^{N} n_{i} \frac{\sqrt{{(t^{(i)} - t^{R})}^{2} + {(t_{3} - t_{3}^{R})}^{2}}}{\sum_{i = 1}^{N} n_{i}},

V_{3} = \sum_{i = 1}^{N} n_{i} \frac{\sqrt{{(t_{3}^{(i)} - t_{3}^{R})}^{2} + {(t_{4}^{(i)} - t_{4}^{R})}^{2}}}{\sum_{i = 1}^{N} n_{i}}, t^{R} = \frac{\sum_{i = 1}^{N} n_{i} t^{(i)}}{\sum_{i = 1}^{N} n_{i}},

where

t^{(i)} is the

coefficient of variation of the L-moment ratio,

t_{3}^{(i)} is the

coefficient of skewness of the L- moment, and

t_{4}^{(i)}

is the coefficient of kurtosis of the L- moment.

2.15. Cabaña–Cabaña (CC1-CC2)

In 1994, Cabaña and Cabaña proposed the CC1 and CC2 tests []. The CC1 (

T_{S, l}

) and CC2 (

T_{K, l}

), respectively, are defined as:

T_{S, l} = \max | w_{S, l} (x) |, T_{K, l} = \max | w_{K, l} (x) |,

(15)

where

w_{S, l} (x)

and

w_{K, l} (x)

approximate transformed estimated empirical processes sensitive to changes in skewness and kurtosis and are defined as:

w_{S, l} = Φ (x) \cdot {\bar{H}}_{3} - ϕ (x) \cdot \sum_{j = 1}^{l} \frac{1}{\sqrt{j}} H_{j - 1} (x) \cdot {\bar{H}}_{j + 3}, w_{K, l} = - ϕ (x) \cdot {\bar{H}}_{3} + [Φ (x) - x \cdot ϕ (x)] \cdot {\bar{H}}_{4} - ϕ (x) \cdot \sum_{j = 2}^{l} (\sqrt{\frac{j}{j - 1}} H_{j - 2} (x) \cdot H_{j} (x)) \cdot {\bar{H}}_{j + 3},

where

l is

a dimensionality parameter,

Φ (x) is

the probability density function of the standard normal distribution,

H_{j} (\cdot) is

the

j

th order normalized Hermite polynomial, and

{\bar{H}}_{j} is

the

j

th order normalized mean of the Hermite polynomial defined as:

{\bar{H}}_{j} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} H_{j} (x_{i}) .

2.16. The Chen–Shapiro Test (ChenS)

In 1995, Chen and Shapiro introduced an alternative test statistic based on normalized spacings and defined as []:

C S = \frac{1}{(n - 1) \cdot s} \sum_{i = 1}^{n - 1} \frac{x_{(i + 1)} - x_{(i)}}{M_{i + 1} - M_{i}},

(16)

where

M_{i} is

the

i

th quantile of a standard normal distribution.

2.17. Modified Shapiro-Wilk (SWRG)

In 1997, Rahman and Govindarajulu proposed a modification to the Shapiro–Wilk test []. This test statistic is simpler to compute and relies on a new definition of the weights using the approximations to

m

and

V

. Each element

a_{i}

of the weight vector is given as:

a_{i} = - (n + 1) (n + 2) ϕ (m_{i}) [m_{i - 1} ϕ (m_{i - 1}) - 2 m_{i} ϕ (m_{i}) + m_{i + 1} ϕ (m_{i + 1})],

(17)

where it is assumed that

m_{0} ϕ (m_{0}) = m_{n + 1} ϕ (m_{n + 1}) = 0

. Therefore, the modified test statistic assigns larger weights to the extreme order statistics than the original test.

2.18. Doornik–Hansen (DH)

In 1977, Bowman and Shenton introduced the Doornik–Hansen goodness-of-fit test []. This test statistic is obtained using transformations of skewness and kurtosis:

s = \frac{m_{3}}{{\sqrt{m_{2}}}^{3}} a n d k = \frac{m_{4}}{m_{2}^{2}}

(18)

where

m_{i} = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{i} \bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i} and n is

sample size.

The DH test statistics have a chi-square distribution with two degrees of freedom. It is defined as:

D H = z_{1}^{2} + z_{2}^{2} ~ χ^{2} (2),

where

z_{1} = δ \log (y + \sqrt{y^{2} - 1})

,

δ = \frac{1}{\sqrt{l o g (w^{2})}}

,

w^{2} = - 1 + \sqrt{2 (β - 1)}

,

β = \frac{3 (n^{2} + 27 n - 70) (n + 1) (n + 3)}{(n - 2) (n + 5) (n + 7) (n + 9)}, y = s \sqrt{\frac{(w^{2} - 1) (n + 1) (n + 3)}{12 (n - 2)}}, s = z_{2} = \sqrt{2 α} (\frac{1}{9 α} - 1 + \sqrt[3]{\frac{χ}{2 α}}), α = a + c \times s^{2}, a = \frac{(n - 2) (n + 5) (n + 7) (n^{2} + 27 n - 70)}{6 δ}, c = \frac{(n - 7) (n + 5) (n + 7) (n^{2} + 2 n - 5)}{6 δ}, δ = (n - 3) (n + 1) (n^{2} + 15 n - 4), χ = 2 l (k - 1 - s^{2}), l = \frac{(n + 5) (n + 7) (n^{3} + 37 n^{2} + 11 n - 313)}{12 δ} .

2.19. Zhang $Q$ (ZQ), $Q^{}$ (ZQstar), $Q - Q^{}$ (ZQQstar)

In 1999, Zhang introduced the Qtest statistic based on the ratio of two unbiased estimators of standard deviation,

q_{1}

and

q_{2}

, given by

Q = l n (q_{1} / q_{2})

[]. The estimators

q_{1}

and

q_{2}

are calculated by

q_{1} = \sum_{i = 1}^{n} a_{i} x_{(i)}

and

q_{2} = \sum_{i = 1}^{n} b_{i} x_{(i)},

where the

i

th order linear coefficients

a_{i}

and

b_{i}

are:

a_{i} = {[(u_{i} - u_{1}) (n - 1)]}^{- 1}, given i \neq 1, a_{1} = \sum_{i = 2}^{n} a_{i}, b_{i} = {\begin{matrix} - b_{n - i + 1} = {[(u_{i} - u_{i + 4}) (n - 4)]}^{- 1} i = 1, \dots, 4, \\ {(n - 4)}^{- 1} \cdot [{(u_{i} - u_{i + 4})}^{- 1} - {(u_{i - 4} - u_{i})}^{- 1}] i = 5, \dots, n - 4, \end{matrix}

(19)

where

u_{i}

is the

i

th expected value of the order statistics of a standard normal distribution,

u_{i} = Φ^{- 1} [(i - 0.375) / (n + 0.25)] .

Zhang also proposed the alternative statistic

Q^{*}

by switching the ith order statistics

x_{(i)}

in

q_{1}

and

q_{2}

by

x_{(i)}^{*} = - x_{(n - i + 1)} .

In addition to those already discussed, Zhang proposed joint test

Q - Q^{*}

, based on the fact that

Q

and

Q^{*}

are approximately independent.

2.20. Barrio–Cuesta-Albertos–Matran–Rodriguez-Rodriguez (BCMR)

In 1999, Barrio, Cuesta-Albertos, Matrán, and Rodríguez-Rodríguez proposed a new BCMR goodness-of-fit test []. This test is based on

L_{2}

-Wasserstein distance and is defined as:

B C M R = \frac{m_{2} - {[\sum_{i = 1}^{n} x_{(i)} \cdot \int_{(i - 1) / n}^{i / n} Φ^{- 1} (t) d t]}^{2}}{m_{2}},

(20)

where the numerator represents the squared

L_{2}

-Wasserstein distance.

2.21. Glen–Leemis–Barr (GLB)

In 2001, Glen, Leemis, and Barr extended the Kolmogorov–Smirnov and Anderson–Darling test to form the GLB test []. This test statistic is defined as:

P_{S} = - n - \frac{1}{n} \cdot \sum_{i = 1}^{n} [(2 n + 1 - 2 i) \cdot \ln (p_{(i)}) + (2 i - 1) \cdot \ln (1 - p_{(i)})],

(21)

where

p_{(i)} is

the elements of the vector

p

containing the quantiles of the order statistics sorted in ascending order.

2.22. Bonett–Seier $T_{w}$ (BS)

In 2002, Bonett and Seier introduced the BS test []. The statistic for this test is defined as:

T_{w} = \frac{\sqrt{n + 2} \cdot (\hat{ω} - 3)}{3.54},

(22)

where

\hat{ω} = 13.29 [l n \sqrt{m_{2}} - \ln (n^{- 1} \sum_{i = 1}^{n} | x_{i} - \bar{x} |)], m_{2} = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}

.

2.23. Bontemps–Meddahi (BM1– $B M_{3 - 4}$ , BM2– $B M_{3 - 6}$ )

In 2005, Bontemps and Meddahi proposed a family of normality tests based on moment conditions known as Stein equations and their relation with Hermite polynomials []. The statistic of the test is defined as:

B M_{3 - p} = \sum_{k = 3}^{p} {(\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} H_{k} (z_{i}))}^{2},

(23)

where

z_{i} = (x_{i} - \bar{x}) / s

and

H_{k} (\cdot) is

the

k

th order normalized Hermite polynomial having the general expression given by:

\forall i > 1 : H_{i} (u) = \frac{1}{\sqrt{i}} [u \cdot H_{i - 1} (u) - \sqrt{i - 1} \cdot H_{i - 2} (u)], H_{0} (u) = 1, H_{1} (u) = u .

2.24. Zhang–Wu (ZW1– $Z_{C}$ , ZW2– $Z_{A}$ )

In 2005, Zhang and Wu presented the ZW1 and ZW2 goodness-of-fit tests []. The

Z_{C}

and

Z_{A}

statistics are similar to the Cramér–von Mises and Anderson–Darling tests statistics based on the empirical distribution function. The statistic of the test is defined as:

Z_{C} = \sum_{i = 1}^{n} {[l n \frac{Φ {(z_{(i)})}^{- 1} - 1}{\frac{n - 0.5}{i - 0.75} - 1}]}^{2}, Z_{A} = - \sum_{i = 1}^{n} [\frac{l n Φ (z_{(i)})}{n - i + 0.5} + \frac{\ln [1 - Φ (z_{(i)})]}{i - 0.5}],

(24)

where

Φ (z_{(i)}) = (i - 0.5) / n

.

2.25. Gel–Miao–Gastwirth (GMG)

In 2007, Gel, Miao, and Gastwirth proposed the GMG test []. The statistic of the test is defined as:

R_{s J} = \frac{s}{J_{n}},

(25)

where

J_{n}

is the ratio of the standard deviation and the robust measure of dispersion is defined as:

J_{n} = \frac{\sqrt{π / 2}}{n} \sum_{i = 1}^{n} | x_{i} - M |,

where

M

is the median of the sample.

2.26. Robust Jarque–Bera (RJB)

In 2007, Gel and Gastwirth modified the Jarque–Bera test and got a more powerful Jarque–Bera test []. RJB test statistic is defined as:

R J B = \frac{n}{6} {(\frac{m_{3}}{J_{n}^{3}})}^{2} + \frac{n}{64} {(\frac{m_{4}}{J_{n}^{4}} - 3)}^{2},

(26)

where

m_{3}

,

m_{4}

are the third and fourth moments, respectively, and

J_{n} is

the ratio of the standard deviation.

2.27. Coin $β_{3}^{2}$

In 2008, Coin proposed a test based on polynomial regression to determine the group distributions of symmetric distributions []. The type of model for this test is:

z_{(i)} = β_{1} \cdot α_{i} + β_{3} \cdot α_{i}^{3},

(27)

where

β_{1}

and

β_{3} are

fitting parameters and

α_{i}

is the expected values of standard normal order statistics.

2.28. Brys–Hubert–Struyf $T_{M C - L R}$ (BHS)

In 2008, Brys, Hubert, and Struyf introduced the BHS tests []. This test is based on skewness and long tails. The statistics for this test

T_{M C - L R}

is defined as:

T_{M C - L R} = n {(w - ω)}^{T} V^{- 1} (w - ω),

(28)

where

w

is set as

{[M C, L M C, R M C]}^{T}

,

M C

is medcouple,

L M C is

left medcouple,

R M C is

right medcouple, and

ω

and

V

are obtained based on the influence function of the estimators in

ω

. In the case of a normal distribution:

ω = {[0, 0.199, 0.199]}^{T}, V = [\begin{matrix} 1.25 & 0.323 & - 0.323 \\ 0.323 & 2.62 & - 0.0123 \\ - 0.323 & - 0.0123 & 2.62 \end{matrix}] .

2.29. Brys–Hubert–Struyf–Bonett–Seier $T_{M C - L R} & T_{w}$ (BHSBS)

In 2008, Brys, Hubert, Struyf, Bonett, and Seier introduced the combined BHSBS test []. This test statistic is defined as:

T_{M C - L R} & T_{w} = n {(w - ω)}^{T} V^{- 1} (w - ω) & \frac{\sqrt{n + 2} \cdot (\hat{ω} - 3)}{3.54},

(29)

where

ω

is asymptotic mean and

V

is covariance matrix.

2.30. Desgagné–Lafaye de Micheaux–Leblanc $R_{n}$ (DLDMLRn), $X_{A P D}^{a}$ (DLDMXAPD), $Z_{E P D}^{a}$ (DLDMZEPD)

In 2009, Desgagné, Lafaye de Micheaux, and Leblanc introduced the

R_{n}

and

X_{A P D}^{a}

tests []. The statistic

R_{n} (μ, σ)

for this test is defined as:

R_{n} (μ, σ) = \frac{1}{n} \sum_{i = 1}^{n} d_{θ} (Y_{i}) = (\begin{matrix} - 2 [\frac{1}{n} \sum_{i = 1}^{n} Y_{i}^{2} s i g n (Y_{i})] \\ - 2^{- 1} [\frac{1}{n} \sum_{i = 1}^{n} Y_{i}^{2} l o g | Y_{i} | - (2 - l o g 2 - γ) / 2] \end{matrix}),

(30)

where

Y_{i} = σ^{- 1} (X_{i} - μ)

. When

μ

and

σ

are unknown, the following maximum-likelihood estimators can be used:

{\hat{μ}}_{n} = {\bar{X}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}, {\hat{σ}}_{n} = S_{n} = {[\frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - {\bar{X}}_{n})}^{2}]}^{1 / 2} .

The DLDMXAPD test is based on skewness and kurtosis which are defined as:

s = \frac{1}{n} \sum_{i = 1}^{n} Z_{i}^{2} s i g n (Z_{i}), k = \frac{1}{n} \sum_{i = 1}^{n} Z_{i}^{2} l o g | (Z_{i}) |,

(31)

where

Z_{i} = S_{n}^{- 1} (X_{i} - {\bar{X}}_{n})

,

{\bar{X}}_{n}, S_{n}

are defined above.

The DLDMXAPD test is suitable for use when the sample size is greater than 10. The statistic

X_{A P D}^{a}

for this test is defined as:

X_{A P D}^{a} = \frac{n s^{2}}{3 - 8 / π} + \frac{n {(k - (2 - l o g 2 - γ) / 2)}^{2}}{(3 π^{2} - 28) / 8}, X_{A P D} = Z^{2} (s) + Z^{2} (k - s^{2}),

(32)

where

γ = 0.577215665 is

the Euler–Mascheroni constant and

s, k

are skewness and kurtosis, respectively.

In 2016, Desgagné, Lafaye de Micheaux, and Leblanc presented the DLDMZEPD test based on the skewness []. The statistic

Z_{E P D}^{a}

for this test is defined as:

Z_{E P D}^{a} = \frac{n^{1 / 2} (k - (2 - l o g 2 - γ) / 2)}{{[(3 π^{2} - 28) / 8]}^{1 / 2}}, Z_{E P D} = Z_{E P D} (k) .

(33)

2.31. N-Metric

We improved the Bakshaev [] goodness-of-fit hypothesis test based on N-metrics. This test is defined in the following way.

Under the null hypothesis statistic,

T_{n} = - n \int_{0}^{1} \int_{0}^{1} K (x) d (F_{n}^{*} (x) - x)

has the same asymptotic distribution as quadratic form:

T_{n} = \sum_{k = 1}^{\infty} \sum_{j = 1}^{\infty} \frac{a_{k j}}{π^{2} k j} ξ_{k} ξ_{j},

(34)

where

ξ_{k}

are independent random variables from the standard normal distribution and:

a_{k j} = - 2 \int_{0}^{1} \int_{0}^{1} K (x) d s i n (π k x) .

In this case, Bakshaev applied the kernel function

K (x) = | x - y |

, and we propose to apply another kernel function (Figure 1):

K (x) = φ (\bar{g} (x)) {\bar{g}}^{'} (x),

(35)

where

φ (x) = \frac{1}{\sqrt{2 π}} e^{- \frac{x^{2}}{2}}

.

Figure 1. Plot of out kernel function

K (x)

with experimentally chosen optimal parameters

a = 0.95, b = 0.25, and c = 1

.

An additional bias is introduced when the kernel function is calculated at the sample values (i.e., for

x = X (t)

). Therefore, to eliminate this bias, the shape of the kernel function is chosen so that the influence in the environment of the sample values is as small as possible.

Let

X

be the standard normal random variable,

Φ

and

φ

be its distribution and density functions, respectively, and

g : R \to R

is an odd strictly monotonically increasing function. Then the distribution function

F_{Y}

of the random variable

Y = g (X)

is

Φ (\bar{g} (x))

, where

\bar{g}

is the inverse of the function

g

. The distribution density

f_{Y}

of a random variable

Y

is

φ (\bar{g} (x)) {\bar{g}}^{'} (x)

. Let us consider the parametric class of functions

\bar{g}

, which depends on three parameters:

\bar{g} = x {(c + {| x |}^{b})}^{a}, a, b, c > 0, {\bar{g}}^{'} = {(c + {| x |}^{b})}^{a} + a | x | {(c + {| x |}^{b})}^{a - 1} b {| x |}^{b - 1}

where

a

is variance,

b

is trough, and

c

is peak shape parameter.

3. The Power of Test

The power of the test is defined as the probability of rejecting a false

H_{0}

hypothesis. Power is the opposite of type II error. Decreasing the probability of type I error

α

increases the probability of type II error and decreases the power of the test. The smaller the error is, the more powerful test is. In practice, the tests are designed to minimize the type II error for a fixed type I error. The most commonly chosen value for

α

is

0.05

. The probability of the opposite event is calculated as

1 - β

, i.e., the power of the test (see in Figure 2)

β

is the probability of rejecting hypothesis

H_{0}

when it is false. The power of the test makes it possible to compare two tests significance level and sample sizes. A more powerful test has a higher value of

1 - β

. Increasing the sample size usually increases the power of the test [,].

Figure 2. Illustration of the power.

When exact null distribution of a goodness-of-fit test statistic is a step function created by the summation of the exact probabilities for each possible value of the test statistic, it is possible to obtain the same critical value for a number of different adjacent significance levels

α

. Linear interpolation of the power of the test statistic using the power for a significance levels (see in Figure 3) less than (denoted

α_{1}

) and greater than (denoted

α_{2}

) the desired significance level (denoted as

α

) is preferred by many authors to overcome this problem (see, for example, []). Linear interpolation gives a weighting to the power based on how close

α_{1}

and

α_{2}

are to

α

. In this case, the power of the test is calculated according to the formula []:

P o w e r = \frac{(α - α_{1}) P (T \geq γ_{2} (α) | H_{1}) + (α_{2} - α) P (T \geq γ_{1} (α) | H_{1})}{α_{2} - α_{1}},

(36)

where

γ_{1} (α)

and

γ_{2} (α)

are the critical values immediately below and above the significance level

α .

α_{1}

=

P (T \geq γ_{1} (α) | H_{0})

and

α_{2}

=

P (T \geq γ_{2} (α) | H_{0})

are the significance levels for

γ_{1} (α)

and

γ_{2} (α)

, respectively.

Figure 3. Significance levels of the statistic step function.

The power of test statistics is determinate by the following steps []:

The distribution of the analyzed data $x_{1}, x_{2}, \dots, x_{n}$ is formed.
Statistics of the compatibility hypothesis test criteria are calculated. If the obtain value of statistic is greater than the corresponding critical value ( $α = 0.05$ is used), then hypothesis $H_{0}$ is rejected.
Steps 1 and 2 are repeated for $k$ (in our experiments, $k = 1, 000, 000$ ) times.
The power of a test is calculated as $c o u n t / k$ , where $c o u n t$ is the number of false hypotheses rejections.

4. Statistical Distributions

The simulation study considers fifteen statistical distributions for which the performance of the presented normality tests are assessed. Statistical distributions are grouped into three groups: symmetric, asymmetric, and modified normal distributions. A description of these distribution groups is presented in the following.

4.1. Symmetric Distributions

Symmetric distributions considered in this research are []:

three cases of the $B e t a (a, b)$ distribution $— B e t a (0.5; 0.5), B e t a (1; 1),$ and $B e t a (2; 2)$ , where $a$ and $b$ are the shape parameters;
three cases of the $C a u c h y (t, s)$ distribution— $C a u c h y (0; 0.5), C a u c h y (0; 1)$ , and $C a u c h y (0; 2)$ , where $t$ and $s$ are the location and scale parameters;
one case of the $L a p l a c e (t, s)$ distribution $— L a p l a c e (0; 1)$ , where $t$ and $s$ are the location and scale parameters;
one case of the $L o g i s t i c (t, s)$ distribution $— L o g i s t i c (2; 2)$ , where $t$ and $s$ are the location and scale parameters;
four cases of the $t - S t u d e n t (ν)$ distribution $— t (1), t (2), t (4),$ and $t (10)$ , where $ν$ is the number of degrees of freedom;
five cases of the $T u k e y (λ)$ distribution $— T u k e y (0.14), T u k e y (0.5), T u k e y (2), T u k e y (5),$ and $T u k e y (10)$ , where $λ$ is the shape parameter; and
one case of the standard normal $N (0; 1)$ distribution.

4.2. Asymmetric Distributions

Asymmetric distributions considered in this research are []:

four cases of the $B e t a (a, b)$ distribution $— B e t a (2; 1), B e t a (2; 5), B e t a (4; 0.5),$ and $B e t a (5; 1)$ ;
four cases of the $C h i$ - $s q u a r e d (ν)$ distribution $— χ^{2} (1), χ^{2} (2), χ^{2} (4)$ , and $χ^{2} (10)$ , where $ν$ is the number of degrees of freedom;
six cases of the $G a m m a (a, b)$ distribution— $G a m m a (2; 2), G a m m a (3; 2), G a m m a (5; 1), G a m m a (9; 1), G a m m a (15; 1)$ , and $G a m m a (100; 1)$ , where $a$ and $b$ are the shape and scale parameters;
one case of the $G u m b e l (t, s)$ distribution $— G u m b e l (1; 2)$ , where $t$ and $s$ are the location and scale parameters;
one case of the $L o g n o r m a l (t, s)$ distribution $— L N (0; 1)$ , where $t$ and $s$ are the location and scale parameters; and
four cases of the $W e i b u l l (a, b)$ distribution $— W e i b u l l (0.5; 1), W e i b u l l (1; 2), W e i b u l l (2; 3.4),$ and $W e i b u l l (3; 4)$ , where $a$ and $b$ are the shape and scale parameters.

4.3. Modified Normal Distributions

Modified normal distributions considered in this research are []:

six cases of the standard normal distribution truncated at $a$ and $b$ $T r u n c (a; b) — T r u n c (- 1; 1), T r u n c (- 2; 2), T r u n c (- 3; 3), T r u n c (- 2; 1), T r u n c (- 3; 1),$ and $T r u n c (- 3; 2),$ which are referred to as NORMAL1;
nine cases of a location-contaminated standard normal distribution, hereon termed $L o C o n N (p; a) — L o C o n N (0.3; 1), L o C o n N (0.4; 1), L o C o n N (0.5; 1), L o C o n N (0.3; 3),$ $L o C o n N (0.4; 3), L o C o n N (0.5; 3), L o C o n N (0.3; 5), L o C o n N (0.4; 5),$ and $L o C o n N (0.5; 5)$ , which are referred to as NORMAL2;
nine cases of a scale-contaminated standard normal distribution, hereon termed $S c C o n N (p; b) — S c C o n N (0.05; 0.25), S c C o n N (0.10; 0.25), S c C o n N (0.20; 0.25),$ $S c C o n N (0.05; 2), S c C o n N (0.10; 2), S c C o n N (0.20; 2), S c C o n N (0.05; 4), S c C o n N (0.10; 4),$ and $S c C o n N (0.20; 4)$ , which are referred to as NORMAL3; and
twelve cases of a mixture of normal distributions, hereon termed $M i x N (p; a; b) — M i x N (0.3; 1; 0.25), M i x N (0.4; 1; 0.25), M i x N (0.5; 1; 0.25), M i x N (0.3; 3; 0.25),$ $M i x N (0.4; 3; 0.25), M i x N (0.5; 3; 0.25), M i x N (0.3; 1; 4), M i x N (0.4; 1; 4), M i x N (0.5; 1; 4),$ $M i x N (0.3; 3; 4), M i x N (0.4; 3; 4),$ and $M i x N (0.5; 3; 4)$ , which are referred to as NORMAL4.

5. Simulation Study and Discussion

This section provides a comprehensive modeling study that is designed to evaluate the power of selected normality tests. This modeling study takes into account the effects of sample size, the level of significance (

α = 0.05

) chosen, and the alternative type of distribution (Beta, Cauchy, Laplace, Logistic, Student, Chi-Square, Gamma, Gumbel, Lognormal, Weibull, and modified standard normal). The study was performed by applying 40 normality tests (including our proposed normality test) for the generated 1,000,000 standardized samples of size 32, 64, 128, 256, 512, and 1024.

The best set of parameters

(a, b, c)

was selected experimentally: the value of

a

was examined from 0.001 to 0.99 by step 0.01, the value of

b

was examined from 0.01 to 10 by step 0.01, and the value of

c

was examined from 0.5 to 50 by step 0.25. The N-metric test gave the most powerful results with the parameters:

a = 0.95, b = 0.25, c = 1

. In those cases, a test has several modifications, we present results only for the best variant. The Table 1, Table 2 and Table 3 present average power obtained for the symmetric, asymmetric, and modified normal distribution sets, for samples sizes of 32, 64, 128, 256, 512, and 1024. By comparing Table 1, Table 2 and Table 3, it can be seen that the most powerful test for small samples was Hosking1 (H1), the most powerful test for large sample sizes was our presented test (N-metric). According to Table 1, Table 2 and Table 3, it is observed that for large sample sizes, most tests’ power is approaching 1 except for the D’Agostino (DA) test, the power of which is significantly lower.

Table 1. Average empirical power obtained for a group of symmetric distributions.

Table 2. Average empirical power obtained for a group of asymmetric distributions.

Table 3. Average empirical power obtained for a group of modified normal distributions.

An additional study was conducted to determine the exact minimal sample size at which the N-metric test (statistic (34) with kernel function (35)) is the most powerful for groups of symmetric, asymmetric, and modified normal distributions. Hosking1 and N-metric tests were applied for data sets of sizes: 80, 90, 100, 105, 110, and 115. The obtained results showed that the N-metric test was the most powerful for sample size

\geq 112

for the symmetric distributions, for sample size

\geq 118

for the asymmetric distributions, and for sample size

\geq 88

for a group of modified normal distributions (see in Table 4). The N-metric test is the most powerful for the Gamma distribution for sample size

\geq 32

. It has been observed that in the case of Cauchy and Lognormal distributions, the N-metric test is the most powerful when the sample size is

\geq 255

, which can be influenced by the long tail of these distributions.

Table 4. The minimal sample size at which the N-metric test is most powerful.

To complement the results given in Table 1, Table 2 and Table 3, Figure 4 (and Figure A1, Figure A2 and Figure A3 in Appendix A) presents the average power results of the most powerful goodness-of-fit tests. Figure 4 presents two distributions from each group of symmetric (Standard normal and Student), asymmetric (Gamma and Gumbel), and modified normal (standard normal distribution truncated at

a

and

b

and location-contaminated standard normal distribution) distributions. Figures of all other distributions are given in Appendix A. In Figure 4, it can be seen that for the standard normal distribution, our proposed test (N-metric) is the most powerful when the sample size is 64 or larger. Figure 4 shows that our proposed test (N-metric) is the most powerful in the case of Gamma data distribution for all sample sizes examined. In general, it can be summarized that the power of the Chen–Shapiro (ChenS), Gel–Miao–Gastwirth (GMG), Hosking1 (H1), and Modified Shapiro–Wilk (SWRG) tests increases gradually with increasing sample size. The power of our proposed test (N-metric) increases abruptly when the sample size is 128 and its power value remains close to 1 for larger sample sizes.

Figure 4. Average empirical power results, for the examined sample sizes, for the groups of symmetric, asymmetric, and modified normal distributions of five powerful goodness-of-fit tests.

6. Conclusions and Future Work

In this study, a comprehensive comparison of the power of popular normality tests was performed. Given the importance of this topic and the extensive development of normality tests, the proposed new normality test, the detailed test descriptions provided, and the power comparisons are relevant. Only univariate data were examined in this study of the power of normality tests (a study with multivariate data is planned for the future).

The study addresses the performance of 40 normality tests, for various sample sizes

n

for a number of symmetric, asymmetric, and modified normal distributions. A new goodness-of-fit test has been proposed. Its results are compared with other tests.

Based on the obtained modeling results, it was determined that the most powerful tests for the groups of symmetric, asymmetric, and modified normal distributions were Hosking1 (for smaller sample sizes) and our proposed N-metric (for larger sample sizes) test. The power of the Hosking1 test (for smaller sample sizes) is 1.5 to 7.99 percent higher than the second (by power) test for the groups of symmetric, asymmetric, and modified normal distributions. The power of the N-metric test (for larger sample sizes) is 6.2 to 16.26 percent higher than the second (by power) test for the groups of symmetric, asymmetric, and modified normal distributions.

The N-metric test is recommended to be used for symmetric data sets of size

n \geq 112

, for asymmetric data sets of size

n \geq 118

, and for bell-shaped distributed data sets of size

n \geq 88

.

Author Contributions

Data curation, J.A. and T.R.; formal analysis, J.A. and T.R.; investigation, J.A. and T.R.; methodology, J.A. and T.R.; software, J.A. and T.R.; supervision, T.R.; writing—original draft, J.A. and M.B.; writing—review and editing, J.A. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Generated data sets were used in the study (see in Section 4).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Figure A1. Average empirical power results, for all sample sizes, for the groups of symmetric distributions of five powerful goodness-of-fit tests.

Figure A2. Average empirical power results for the examined sample sizes for the groups of asymmetric distributions of five powerful goodness-of-fit tests.

Figure A3. Average empirical power results for the examined sample sizes for the groups of the modified normal distributions of five powerful goodness-of-fit tests.

References

Barnard, G.A.; Barnard, G.A. Introduction to Pearson (1900) on the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such That it Can be Reasonably Supposed to Have Arisen from Random Sampling; Springer Series in Statistics Breakthroughs in Statistics; Springer: Cham, Switzerland, 1992; pp. 1–10. [Google Scholar]
Kolmogorov, A. Sulla determinazione empirica di una legge di distribuzione. Inst. Ital. Attuari Giorn. 1933, 4, 83–91. [Google Scholar]
Adefisoye, J.; Golam Kibria, B.; George, F. Performances of several univariate tests of normality: An empirical study. J. Biom. Biostat. 2016, 7, 1–8. [Google Scholar]
Anderson, T.W.; Darling, D.A. Asymptotic theory of certain “goodness-of-fit” criteria based on stochastic processes. Ann. Math. Stat. 1952, 23, 193–212. [Google Scholar] [CrossRef]
Hosking, J.R.M.; Wallis, J.R. Some statistics useful in regional frequency analysis. Water Resour. Res. 1993, 29, 271–281. [Google Scholar] [CrossRef]
Cabana, A.; Cabana, E.M. Goodness-of-Fit and Comparison Tests of the Kolmogorov-Smirnov Type for Bivariate Populations. Ann. Stat. 1994, 22, 1447–1459. [Google Scholar] [CrossRef]
Chen, L.; Shapiro, S.S. An Alternative Test for Normality Based on Normalized Spacings. J. Stat. Comput. Simul. 1995, 53, 269–288. [Google Scholar] [CrossRef]
Rahman, M.M.; Govindarajulu, Z. A modification of the test of Shapiro and Wilk for normality. J. Appl. Stat. 1997, 24, 219–236. [Google Scholar] [CrossRef]
Ray, W.D.; Shenton, L.R.; Bowman, K.O. Maximum Likelihood Estimation in Small Samples. J. R. Stat. Soc. Ser. A 1978, 141, 268. [Google Scholar] [CrossRef]
Zhang, P. Omnibus test of normality using the Q statistic. J. Appl. Stat. 1999, 26, 519–528. [Google Scholar] [CrossRef]
Barrio, E.; Cuesta-Albertos, J.A.; Matrán, C.; Rodríguez-Rodríguez, J.M. Tests of goodness of fit based on the L2-Wasserstein distance. Ann. Stat. 1999, 27, 1230–1239. [Google Scholar]
Glen, A.G.; Leemis, L.M.; Barr, D.R. Order statistics in goodness-of-fit testing. IEEE Trans. Reliab. 2001, 50, 209–213. [Google Scholar] [CrossRef]
Bonett, D.G.; Seier, E. A test of normality with high uniform power. Comput. Stat. Data Anal. 2002, 40, 435–445. [Google Scholar] [CrossRef]
Psaradakis, Z.; Vávra, M. Normality tests for dependent data: Large-sample and bootstrap approaches. Commun. Stat.-Simul. Comput. 2018, 49, 283–304. [Google Scholar] [CrossRef]
Zhang, J.; Wu, Y. Likelihood-ratio tests for normality. Comput. Stat. Data Anal. 2005, 49, 709–721. [Google Scholar] [CrossRef]
Gel, Y.R.; Miao, W.; Gastwirth, J.L. Robust directed tests of normality against heavy-tailed alternatives. Comput. Stat. Data Anal. 2007, 51, 2734–2746. [Google Scholar] [CrossRef]
Coin, D. A goodness-of-fit test for normality based on polynomial regression. Comput. Stat. Data Anal. 2008, 52, 2185–2198. [Google Scholar] [CrossRef]
Desgagné, A.; Lafaye de Micheaux, P. A powerful and interpretable alternative to the Jarque–Bera test of normality based on 2nd-power skewness and kurtosis, using the Rao’s score test on the APD family. J. Appl. Stat. 2017, 45, 2307–2327. [Google Scholar] [CrossRef]
Steele, C.M. The Power of Categorical Goodness-Of-Fit Statistics. Ph.D. Thesis, Australian School of Environmental Studies, Warrandyte, Victoria, Australia, 2003. [Google Scholar]
Romão, X.; Delgado, R.; Costa, A. An empirical power comparison of univariate goodness-of-fit tests for normality. J. Stat. Comput. Simul. 2010, 80, 545–591. [Google Scholar] [CrossRef]
Choulakian, V.; Lockhart, R.; Stephens, M. Cramérvon Mises statistics for discrete distributions. Can. J. Stat. 1994, 22, 125–137. [Google Scholar] [CrossRef]
Shapiro, S.S.; Wilk, M.B. An analysis of variance test for normality (complete samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
Lilliefors, H.W. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J. Am. Stat. Assoc. 1967, 62, 399–402. [Google Scholar] [CrossRef]
Ahmad, F.; Khan, R.A. A power comparison of various normality tests. Pak. J. Stat. Oper. Res. 2015, 11, 331. [Google Scholar] [CrossRef]
D’Agostino, R.B.; Pearson, E.S. Testing for departures from normality. I. Fuller empirical results for the distribution of b₂ and √b₁. Biometrika 1973, 60, 613–622. [Google Scholar]
Filliben, J.J. The Probability Plot Correlation Coefficient Test for Normality. Technometrics 1975, 17, 111–117. [Google Scholar] [CrossRef]
Martinez, J.; Iglewicz, B. A test for departure from normality based on a biweight estimator of scale. Biometrika 1981, 68, 331–333. [Google Scholar] [CrossRef]
Epps, T.W.; Pulley, L.B. A test for normality based on the empirical characteristic function. Biometrika 1983, 70, 723–726. [Google Scholar] [CrossRef]
Jarque, C.; Bera, A. Efficient tests for normality, homoscedasticity andserial independence of regression residuals. Econ. Lett. 1980, 6, 255–259. [Google Scholar] [CrossRef]
Bakshaev, A. Goodness of fit and homogeneity tests on the basis of N-distances. J. Stat. Plan. Inference 2009, 139, 3750–3758. [Google Scholar] [CrossRef]
Hill, T.; Lewicki, P. Statistics Methods and Applications; StatSoft: Tulsa, OK, USA, 2007. [Google Scholar]
Kasiulevičius, V.; Denapienė, G. Statistikos taikymas mokslinių tyrimų analizėje. Gerontologija 2008, 9, 176–180. [Google Scholar]
Damianou, C.; Kemp, A.W. New goodness of statistics for discrete and continuous data. Am. J. Math. Manag. Sci. 1990, 10, 275–307. [Google Scholar]

Figure 1. Plot of out kernel function

K (x)

with experimentally chosen optimal parameters

a = 0.95, b = 0.25, and c = 1

.

Figure 2. Illustration of the power.

Figure 3. Significance levels of the statistic step function.

Figure 4. Average empirical power results, for the examined sample sizes, for the groups of symmetric, asymmetric, and modified normal distributions of five powerful goodness-of-fit tests.

Table 1. Average empirical power obtained for a group of symmetric distributions.

		Sample Size
		32	64	128	256	512	1024
Tests	AD	0.714	0.799	0.863	0.909	0.939	0.955
	BCMR	0.718	0.809	0.875	0.920	0.947	0.947
	BHS	0.431	0.551	0.663	0.752	0.818	0.868
	BHSBS	0.680	0.778	0.783	0.903	0.938	0.959
	BM2	0.726	0.835	0.905	0.945	0.965	0.974
	BS	0.717	0.810	0.877	0.920	0.947	0.961
	CC2	0.712	0.805	0.873	0.920	0.949	0.936
	CHI2	0.663	0.778	0.842	0.884	0.941	0.945
	CVM	0.591	0.733	0.805	0.855	0.919	0.949
	ChenS	0.729	0.806	0.871	0.915	0.943	0.960
	Coin	0.735	0.830	0.891	0.930	0.952	0.963
	DA	0.266	0.295	0.314	0.319	0.315	0.311
	DAP	0.723	0.820	0.883	0.924	0.948	0.962
	DH	0.709	0.805	0.877	0.925	0.950	0.963
	DLDMZEPD	0.730	0.826	0.889	0.929	0.952	0.963
	EP	0.706	0.828	0.974	0.910	0.946	0.959
	Filli	0.712	0.805	0.875	0.922	0.949	0.962
	GG	0.658	0.760	0.850	0.915	0.949	0.962
	GLB	0.712	0.798	0.863	0.909	0.943	0.918
	GMG	0.787	0.862	0.914	0.946	0.965	0.975
	H1	0.799	0.862	0.852	0.999	0.999	0.999
	JB	0.643	0.762	0.856	0.918	0.949	0.963
	KS	0.585	0.723	0.789	0.836	0.905	0.939
	Lillie	0.669	0.758	0.828	0.883	0.921	0.947
	MI	0.632	0.676	0.705	0.724	0.736	0.745
	N-metric	0.245	0.585	0.971	0.999	0.999	0.999
	SF	0.715	0.807	0.876	0.923	0.949	0.962
	SW	0.718	0.808	0.874	0.919	0.946	0.962
	SWRG	0.694	0.775	0.834	0.882	0.916	0.946
	ZQstar	0.513	0.576	0.630	0.669	0.697	0.718
	ZW2	0.715	0.806	0.869	0.912	0.939	0.957

Table 2. Average empirical power obtained for a group of asymmetric distributions.

		Sample Size
		32	64	128	256	512	1024
Tests	AD	0.729	0.835	0.908	0.949	0.969	0.984
	BCMR	0.749	0.856	0.924	0.971	0.995	0.991
	BHS	0.529	0.664	0.769	0.855	0.915	0.950
	BHSBS	0.538	0.652	0.747	0.914	0.902	0.944
	BM2	0.737	0.859	0.931	0.965	0.981	0.993
	BS	0.506	0.588	0.665	0.738	0.805	0.859
	CC2	0.579	0.682	0.777	0.853	0.938	0.956
	CHI2	0.645	0.799	0.881	0.934	0.965	0.980
	CVM	0.594	0.755	0.836	0.887	0.935	0.957
	ChenS	0.756	0.862	0.928	0.961	0.978	0.991
	Coin	0.480	0.556	0.630	0.700	0.769	0.916
	DA	0.237	0.223	0.209	0.198	0.191	0.192
	DAP	0.705	0.826	0.910	0.955	0.977	0.990
	DH	0.724	0.845	0.921	0.957	0.977	0.991
	DLDMXAPD	0.726	0.843	0.918	0.955	0.975	0.989
	EP	0.753	0.846	0.913	0.967	0.975	0.993
	Filli	0.732	0.842	0.915	0.953	0.974	0.991
	GG	0.672	0.805	0.898	0.949	0.973	0.988
	GLB	0.725	0.831	0.905	0.987	0.970	0.984
	GMG	0.683	0.751	0.809	0.859	0.901	0.932
	H1	0.816	0.896	0.896	0.999	0.999	0.999
	JB	0.662	0.808	0.904	0.953	0.975	0.989
	KS	0.582	0.736	0.810	0.863	0.921	0.945
	Lillie	0.671	0.786	0.872	0.929	0.959	0.976
	MI	0.644	0.731	0.798	0.843	0.872	0.913
	N-metric	0.464	0.761	0.990	0.999	0.999	0.999
	SF	0.736	0.846	0.918	0.955	0.975	0.989
	SW	0.753	0.859	0.925	0.959	0.977	0.991
	SWRG	0.758	0.861	0.927	0.960	0.977	0.999
	ZQstar	0.570	0.639	0.693	0.732	0.761	0.748
	ZW2	0.764	0.870	0.932	0.962	0.980	0.997

Table 3. Average empirical power obtained for a group of modified normal distributions.

		Sample Size
		32	64	128	256	512	1024
Tests	AD	0.662	0.756	0.825	0.872	0.905	0.931
	BCMR	0.652	0.756	0.831	0.880	0.913	0.935
	BHS	0.463	0.585	0.676	0.744	0.796	0.834
	BHSBS	0.568	0.701	0.787	0.847	0.890	0.918
	BM2	0.641	0.770	0.854	0.904	0.934	0.953
	BS	0.587	0.688	0.770	0.833	0.881	0.916
	CC2	0.576	0.675	0.763	0.833	0.887	0.923
	CHI2	0.566	0.728	0.808	0.866	0.914	0.939
	CVM	0.557	0.708	0.779	0.833	0.897	0.930
	ChenS	0.656	0.759	0.833	0.882	0.915	0.937
	Coin	0.579	0.691	0.781	0.846	0.889	0.918
	DA	0.314	0.342	0.367	0.388	0.405	0.418
	DAP	0.617	0.733	0.818	0.872	0.906	0.930
	DH	0.617	0.727	0.815	0.872	0.907	0.930
	DLDMXAPD	0.651	0.754	0.831	0.879	0.912	0.935
	EP	0.640	0.748	0.819	0.865	0.906	0.931
	Filli	0.637	0.743	0.823	0.877	0.911	0.933
	GG	0.529	0.657	0.775	0.860	0.906	0.932
	GLB	0.659	0.755	0.823	0.870	0.903	0.930
	GMG	0.688	0.771	0.836	0.883	0.917	0.942
	H1	0.743	0.816	0.799	0.999	0.999	0.999
	JB	0.515	0.662	0.783	0.861	0.904	0.930
	KS	0.564	0.710	0.772	0.825	0.893	0.924
	Lillie	0.626	0.724	0.796	0.850	0.889	0.917
	MI	0.494	0.536	0.563	0.578	0.585	0.590
	N-metric	0.243	0.582	0.972	0.999	0.999	0.999
	SF	0.642	0.747	0.826	0.879	0.912	0.934
	SW	0.654	0.758	0.832	0.882	0.915	0.937
	SWRG	0.643	0.746	0.818	0.864	0.901	0.931
	ZQstar	0.394	0.423	0.450	0.472	0.487	0.498
	ZW2	0.640	0.749	0.826	0.876	0.907	0.931

Table 4. The minimal sample size at which the N-metric test is most powerful.

Nr.	Distribution	Groups of Distributions	$Minimal Sample Size (n)$
1.	Standard normal	Symmetric	46
2.	Beta	Symmetric	88
3.	Cauchy	Symmetric	257
4.	Laplace	Symmetric	117
5.	Logistic	Symmetric	71
6.	Student	Symmetric	96
7.	Beta	Asymmetric	108
8.	Chi-square	Asymmetric	123
9.	Gamma	Asymmetric	<32
10.	Gumbel	Asymmetric	125
11.	Lognormal	Asymmetric	255
12.	Weibull	Asymmetric	65
13.	Normal1	Modified normal	70
14.	Normal2	Modified normal	93
15.	Normal3	Modified normal	72
16.	Normal4	Modified normal	117

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

An Exhaustive Power Comparison of Normality Tests

Abstract

1. Introduction

2. Statistical Methods

2.1. Chi-Square Test (CHI2)

2.2. Kolmogorov–Smirnov (KS)

2.3. Anderson–Darling (AD)

2.4. Cramer–Von Mises (CVM)

2.5. Shapiro–Wilk (SW)

2.6. Lilliefors (LF)

2.7. D’Agostino (DA)

2.8. Shapiro–Francia (SF)

2.9. D’Agostino–Pearson (DAP)

2.10. Filliben (Filli)

2.11. Martinez–Iglewicz (MI)

2.12. Epps–Pulley (EP)

2.13. Jarque–Bera (JB)

2.14. Hosking ( H 1 − H 3 )

2.15. Cabaña–Cabaña (CC1-CC2)

2.16. The Chen–Shapiro Test (ChenS)

2.17. Modified Shapiro-Wilk (SWRG)

2.18. Doornik–Hansen (DH)

2.19. Zhang Q (ZQ), Q * (ZQstar), Q − Q * (ZQQstar)

2.20. Barrio–Cuesta-Albertos–Matran–Rodriguez-Rodriguez (BCMR)

2.21. Glen–Leemis–Barr (GLB)

2.22. Bonett–Seier T w (BS)

2.23. Bontemps–Meddahi (BM1– B M 3 − 4 , BM2– B M 3 − 6 )

2.24. Zhang–Wu (ZW1– Z C , ZW2– Z A )

2.25. Gel–Miao–Gastwirth (GMG)

2.26. Robust Jarque–Bera (RJB)

2.27. Coin β 3 2

2.28. Brys–Hubert–Struyf T M C − L R (BHS)

2.29. Brys–Hubert–Struyf–Bonett–Seier T M C − L R & T w (BHSBS)

2.30. Desgagné–Lafaye de Micheaux–Leblanc R n (DLDMLRn), X A P D a (DLDMXAPD), Z E P D a (DLDMZEPD)

2.31. N-Metric

3. The Power of Test

4. Statistical Distributions

4.1. Symmetric Distributions

4.2. Asymmetric Distributions

4.3. Modified Normal Distributions

5. Simulation Study and Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Article Metrics

Citations

Article Access Statistics

2.14. Hosking ( $H_{1} - H_{3})$

2.19. Zhang $Q$ (ZQ), $Q^{}$ (ZQstar), $Q - Q^{}$ (ZQQstar)

2.22. Bonett–Seier $T_{w}$ (BS)

2.23. Bontemps–Meddahi (BM1– $B M_{3 - 4}$ , BM2– $B M_{3 - 6}$ )

2.24. Zhang–Wu (ZW1– $Z_{C}$ , ZW2– $Z_{A}$ )

2.27. Coin $β_{3}^{2}$

2.28. Brys–Hubert–Struyf $T_{M C - L R}$ (BHS)

2.29. Brys–Hubert–Struyf–Bonett–Seier $T_{M C - L R} & T_{w}$ (BHSBS)

2.30. Desgagné–Lafaye de Micheaux–Leblanc $R_{n}$ (DLDMLRn), $X_{A P D}^{a}$ (DLDMXAPD), $Z_{E P D}^{a}$ (DLDMZEPD)