An Exhaustive Power Comparison of Normality Tests

: A goodness-of-ﬁt test is a frequently used modern statistics tool. However, it is still unclear what the most reliable approach is to check assumptions about data set normality. A particular data set (especially with a small number of observations) only partly describes the process, which leaves many options for the interpretation of its true distribution. As a consequence, many goodness-of-ﬁt statistical tests have been developed, the power of which depends on particular circumstances (i.e., sample size, outlets, etc.). With the aim of developing a more universal goodness-of-ﬁt test, we propose an approach based on an N-metric with our chosen kernel function. To compare the power of 40 normality tests, the goodness-of-ﬁt hypothesis was tested for 15 data distributions with 6 different sample sizes. Based on exhaustive comparative research results, we recommend the use of our test for samples of size n ≥ 118.


Introduction
A priori information about data distribution is not always known. In those cases, hypothesis testing can help to find a reasonable assumption about the distribution of data. Based on assumed data distribution, one can choose appropriate methods for further research. The information about data distribution can be useful in a number of ways, for example: • it can provide insights about the observed process; • parameters of model can be inferred from the characteristics of data distributions; and • it can help in choosing more specific and computationally efficient methods.
Statistical methods often require data to be normally distributed. If the assumption of normality is not satisfied, the results of these methods will be inappropriate. Therefore, the presumption of normality is strictly required before starting the statistical analysis. Many tests have been developed to check this assumption. However, tests are defined in various ways and thus react to abnormalities, present in a data set, differently. Therefore, the choice of goodness-of-fit test remains an important problem.
For these reasons, this study examines the issue of testing the goodness-of-fit hypotheses. The goodness-of-fit null and alternative hypotheses are defined as: H 0 : The distribution is normal, H A : The distribution is not normal. (1) A total of 40 tests were applied to analyze the problem of testing the goodness-offit hypothesis. The tests used in this study were developed between 1900 and 2016. In the early 19th century, Karl Pearson published an article defining the chi-square test [1]. This test is considered as the basis of modern statistics. Pearson was the first to examine

Statistical Methods
In this section, the most popular tests for normality are overviewed.

Chi-Square Test (CHI2)
In 1900, Karl Pearson introduced the chi-square test [1]. The statistic of the test is defined as: where O i is the observed frequency and E i is the expected frequency.

Kolmogorov-Smirnov (KS)
In 1933, Kolmogorov and Smirnov proposed the KS test [2]. The statistic of the test is defined as: where z i is the cumulative probability of standard normal distribution and D is the difference between observed and expected values.

Anderson-Darling (AD)
In 1952, Anderson and Darling developed a variety of the Kolmogorov and Smirnov tests [4]. This test is more powerful than the Kolmogorov and Smirnov test. The statistic of the test is defined as: where F(x i ) is the value of the distribution function at point x i and n is the empirical sample size.

Cramer-Von Mises (CVM)
In 1962, Cramer proposed the Cramer-von Mises test. This test is an alternative to the Kolmogorov and Smirnov test [21]. The statistic of the test is defined as: where Z i is the cumulative distribution function of the specified distribution Z i = X (i) − X/S, and X and S are the sample mean and sample standard deviation.

Shapiro-Wilk (SW)
In 1965, Shapiro and Wilk formed the original test [22]. The statistic of the test is defined as: where x (i) is the i th order statistic, x is the sample mean, and a i constants obtained: where m = (m 1 , . . . , m n ) T are the expected values of the order statistics of independent and identically distributed random variables sampled from the standard normal distribution and V is the covariance matrix of those order statistics.

Lilliefors (LF)
In 1967, Lilliefors modified the Kolmogorov and Smirnov test [23]. The statistic of the test is defined as: where F * (x) is the standard normal distribution function and S(x) is the empirical distribution function of the z i values.

D'Agostino (DA)
In 1971, D'Agostino introduced the test for testing the goodness-of-fit hypothesis, which is an extension of the Shapiro-Wilk test [8]. The test proposed by D'Agostino does not need to define a weight vector. The statistic of the test is defined as: where m 2 is the second central moment that is defined as:

Shapiro-Francia (SF)
In 1972, Shapiro and Francia simplified the Shapiro and Wilk test and developed the Shapiro and Francia test, which is computationally more efficient [24]. The statistic of the test is defined as: where m i is the expected values of the standard normal order statistics.

D'Agostino-Pearson (DAP)
In 1973-1974, D'Agostino and Pearson proposed the D'Agostino and Pearson test [25]. The statistic of the test is defined as: where n is the size of sample and m 2 is the sample variance of order statistics.

Filliben (Filli)
In 1975, Filliben defined the probabilistic correlation coefficient r as a test for the goodness-of-fit hypothesis [26]. This test statistic is defined as: where σ 2 is the variance, M (i) = Φ −1 m (i) , when m (i) is the estimated median values of the order statistics, each m (i) is obtained by:

Martinez-Iglewicz (MI)
In 1981, Martinez and Iglewicz proposed a normality test based on the ratio of two estimators of variance, where one of the estimators is the robust biweight scale estimator S 2 b [27]: where M is the sample median, z i = (x i − M)/(9A), with A being the median of |x i − M|.
Mathematics 2021, 9, 788 5 of 20 This test statistic is then given by:

Epps-Pulley (EP)
In 1983, Epps and Pulley proposed a test statistic based on the following weighted integral [28]: where ϕ n (t) is the empirical characteristic function and G(t) is an adequate function chosen according to several considerations. By setting dG(t) = g(t)dt and selecting: the following statistic is obtained: where m 2 is the second central moment.

Jarque-Bera (JB)
In 1987, Jarque and Bera proposed a test [29] with statistic defined as: where s = are the sample skewness and kurtosis.

Hosking (H 1 − H 3 )
In 1990, Hosking and Wallis proposed the first Hosking test [5]. This test statistic is defined as: where µ V and σ V are the mean and standard deviation of number of simulation data values of V. V i is calculated as: where t (i) is the coefficient of variation of the L-moment ratio, t

Cabaña-Cabaña (CC1-CC2)
In 1994, Cabaña and Cabaña proposed the CC1 and CC2 tests [6]. The CC1 (T S,l ) and CC2 (T K,l ), respectively, are defined as: where w S,l (x) and w K,l (x) approximate transformed estimated empirical processes sensitive to changes in skewness and kurtosis and are defined as: where l is a dimensionality parameter, Φ(x) is the probability density function of the standard normal distribution, H j (·) is the jth order normalized Hermite polynomial, and H j is the jth order normalized mean of the Hermite polynomial defined as:

The Chen-Shapiro Test (ChenS)
In 1995, Chen and Shapiro introduced an alternative test statistic based on normalized spacings and defined as [9]: where M i is the ith quantile of a standard normal distribution.

Modified Shapiro-Wilk (SWRG)
In 1997, Rahman and Govindarajulu proposed a modification to the Shapiro-Wilk test [8]. This test statistic is simpler to compute and relies on a new definition of the weights using the approximations to m and V. Each element a i of the weight vector is given as: where it is assumed that m 0 φ(m 0 ) = m n+1 φ(m n+1 ) = 0. Therefore, the modified test statistic assigns larger weights to the extreme order statistics than the original test.

Doornik-Hansen (DH)
In 1977, Bowman and Shenton introduced the Doornik-Hansen goodness-of-fit test [9]. This test statistic is obtained using transformations of skewness and kurtosis: where x i and n is sample size.
The DH test statistics have a chi-square distribution with two degrees of freedom. It is defined as:

12(n−2)
, In 1999, Zhang introduced the Qtest statistic based on the ratio of two unbiased estimators of standard deviation, q 1 and q 2 , given by Q = ln(q 1 /q 2 ) [10]. The estimators q 1 and q 2 are calculated by where the ith order linear coefficients a i and b i are: where u i is the ith expected value of the order statistics of a standard normal distribution, Zhang also proposed the alternative statistic Q * by switching the ith order statistics . In addition to those already discussed, Zhang proposed joint test Q − Q * , based on the fact that Q and Q * are approximately independent.

Glen-Leemis-Barr (GLB)
In 2001, Glen, Leemis, and Barr extended the Kolmogorov-Smirnov and Anderson-Darling test to form the GLB test [12]. This test statistic is defined as: where p (i) is the elements of the vector p containing the quantiles of the order statistics sorted in ascending order.

Bonett-Seier T w (BS)
In 2002, Bonett and Seier introduced the BS test [13]. The statistic for this test is defined as: In 2005, Bontemps and Meddahi proposed a family of normality tests based on moment conditions known as Stein equations and their relation with Hermite polynomials [24]. The statistic of the test is defined as: where z i = (x i − x)/s and H k (·) is the kth order normalized Hermite polynomial having the general expression given by: In 2005, Zhang and Wu presented the ZW1 and ZW2 goodness-of-fit tests [15]. The Z C and Z A statistics are similar to the Cramér-von Mises and Anderson-Darling tests statistics based on the empirical distribution function. The statistic of the test is defined as: where Φ z (i) = (i − 0.5)/n.

Gel-Miao-Gastwirth (GMG)
In 2007, Gel, Miao, and Gastwirth proposed the GMG test [16]. The statistic of the test is defined as: where J n is the ratio of the standard deviation and the robust measure of dispersion is defined as: where M is the median of the sample.

Robust Jarque-Bera (RJB)
In 2007, Gel and Gastwirth modified the Jarque-Bera test and got a more powerful Jarque-Bera test [16]. RJB test statistic is defined as: where m 3 , m 4 are the third and fourth moments, respectively, and J n is the ratio of the standard deviation.

Coin β 2 3
In 2008, Coin proposed a test based on polynomial regression to determine the group distributions of symmetric distributions [17]. The type of model for this test is: where β 1 and β 3 are fitting parameters and α i is the expected values of standard normal order statistics.

Brys-Hubert-Struyf T MC−LR (BHS)
In 2008, Brys, Hubert, and Struyf introduced the BHS tests [3]. This test is based on skewness and long tails. The statistics for this test T MC−LR is defined as: where w is set as [MC, LMC, RMC] T , MC is medcouple, LMC is left medcouple, RMC is right medcouple, and ω and V are obtained based on the influence function of the estimators in ω. In the case of a normal distribution: In 2008, Brys, Hubert, Struyf, Bonett, and Seier introduced the combined BHSBS test [3]. This test statistic is defined as: where ω is asymptotic mean and V is covariance matrix.
In 2009, Desgagné, Lafaye de Micheaux, and Leblanc introduced the R n and X a APD tests [18]. The statistic R n (µ, σ) for this test is defined as: where . When µ and σ are unknown, the following maximum-likelihood estimators can be used: The DLDMXAPD test is based on skewness and kurtosis which are defined as: where Z i = S −1 n X i − X n , X n , S n are defined above. The DLDMXAPD test is suitable for use when the sample size is greater than 10. The statistic X a APD for this test is defined as: where γ = 0.577215665 is the Euler-Mascheroni constant and s, k are skewness and kurtosis, respectively. In 2016, Desgagné, Lafaye de Micheaux, and Leblanc presented the DLDMZEPD test based on the skewness [18]. The statistic Z a EPD for this test is defined as:

N-Metric
We improved the Bakshaev [30] goodness-of-fit hypothesis test based on N-metrics. This test is defined in the following way.
Under the null hypothesis statistic, T n = −n 1 0 has the same asymptotic distribution as quadratic form: where ξ k are independent random variables from the standard normal distribution and: In this case, Bakshaev applied the kernel function K(x) = |x − y|, and we propose to apply another kernel function ( Figure 1): where where are independent random variables from the standard normal distribution and: In this case, Bakshaev applied the kernel function ( ) = | − |, and we propose to apply another kernel function ( Figure 1): where ( ) = √ .
An additional bias is introduced when the kernel function is calculated at the sample values (i.e., for = ( ) ). Therefore, to eliminate this bias, the shape of the kernel function is chosen so that the influence in the environment of the sample values is as small as possible. Let be the standard normal random variable, and be its distribution and density functions, respectively, and : → is an odd strictly monotonically increasing function. Then the distribution function of the random variable = ( ) is Φ( ̅ ( )), where ̅ is the inverse of the function . The distribution density of a random variable is ̅ ( ) ̅ ( ). Let us consider the parametric class of functions ̅ , which depends on three parameters: where is variance, is trough, and is peak shape parameter.  An additional bias is introduced when the kernel function is calculated at the sample values (i.e., for x = X(t)). Therefore, to eliminate this bias, the shape of the kernel function is chosen so that the influence in the environment of the sample values is as small as possible.
Let X be the standard normal random variable, Φ and ϕ be its distribution and density functions, respectively, and g : R → R is an odd strictly monotonically increasing function. Then the distribution function F Y of the random variable Y = g(X) is Φ(g(x)), where g is the inverse of the function g. The distribution density f Y of a random variable Y is ϕ(g(x))g (x). Let us consider the parametric class of functions g, which depends on three parameters: where a is variance, b is trough, and c is peak shape parameter.

The Power of Test
The power of the test is defined as the probability of rejecting a false H 0 hypothesis. Power is the opposite of type II error. Decreasing the probability of type I error α increases the probability of type II error and decreases the power of the test. The smaller the error is, the more powerful test is. In practice, the tests are designed to minimize the type II error for a fixed type I error. The most commonly chosen value for α is 0.05. The probability of the opposite event is calculated as 1 − β, i.e., the power of the test (see in Figure 2) β is the probability of rejecting hypothesis H 0 when it is false. The power of the test makes it possible to compare two tests significance level and sample sizes. A more powerful test has a higher value of 1 − β. Increasing the sample size usually increases the power of the test [31,32].

The Power of Test
The power of the test is defined as the probability of rejecting a false hypothesis. Power is the opposite of type II error. Decreasing the probability of type I error increases the probability of type II error and decreases the power of the test. The smaller the error is, the more powerful test is. In practice, the tests are designed to minimize the type II error for a fixed type I error. The most commonly chosen value for is 0.05. The probability of the opposite event is calculated as 1 − , i.e., the power of the test (see in Figure 2) is the probability of rejecting hypothesis when it is false. The power of the test makes it possible to compare two tests significance level and sample sizes. A more powerful test has a higher value of 1 − . Increasing the sample size usually increases the power of the test [31,32]. When exact null distribution of a goodness-of-fit test statistic is a step function created by the summation of the exact probabilities for each possible value of the test statistic, it is possible to obtain the same critical value for a number of different adjacent significance levels . Linear interpolation of the power of the test statistic using the power for a significance levels (see in Figure 3) less than (denoted ) and greater than (denoted ) the desired significance level (denoted as ) is preferred by many authors to overcome this problem (see, for example, [33]). Linear interpolation gives a weighting to the power based on how close and are to . In this case, the power of the test is calculated according to the formula [19]:  When exact null distribution of a goodness-of-fit test statistic is a step function created by the summation of the exact probabilities for each possible value of the test statistic, it is possible to obtain the same critical value for a number of different adjacent significance levels α. Linear interpolation of the power of the test statistic using the power for a significance levels (see in Figure 3) less than (denoted α 1 ) and greater than (denoted α 2 ) the desired significance level (denoted as α) is preferred by many authors to overcome this problem (see, for example, [33]). Linear interpolation gives a weighting to the power based on how close α 1 and α 2 are to α. In this case, the power of the test is calculated according to the formula [19]: where γ 1 (α) and γ 2 (α) are the critical values immediately below and above the significance level α. α 1 =P(T ≥ γ 1 (α)|H 0 ) and α 2 =P(T ≥ γ 2 (α)|H 0 ) are the significance levels for γ 1 (α) and γ 2 (α), respectively.

Statistical Distributions
The simulation study considers fifteen statistical distributions for which the performance of the presented normality tests are assessed. Statistical distributions are grouped into three groups: symmetric, asymmetric, and modified normal distributions. A description of these distribution groups is presented in the following.

Symmetric Distributions
Symmetric distributions considered in this research are [20]:  The power of test statistics is determinate by the following steps [19]: 1.
The distribution of the analyzed data x 1 , x 2 , . . . , x n is formed.

2.
Statistics of the compatibility hypothesis test criteria are calculated. If the obtain value of statistic is greater than the corresponding critical value (α = 0.05 is used), then hypothesis H 0 is rejected. 3.

4.
The power of a test is calculated as count/k, where count is the number of false hypotheses rejections.

Statistical Distributions
The simulation study considers fifteen statistical distributions for which the performance of the presented normality tests are assessed. Statistical distributions are grouped into three groups: symmetric, asymmetric, and modified normal distributions. A description of these distribution groups is presented in the following.

Asymmetric Distributions
Asymmetric distributions considered in this research are [20]:

Simulation Study and Discussion
This section provides a comprehensive modeling study that is designed to evaluate the power of selected normality tests. This modeling study takes into account the effects of sample size, the level of significance (α = 0.05) chosen, and the alternative type of distribution (Beta, Cauchy, Laplace, Logistic, Student, Chi-Square, Gamma, Gumbel, Lognormal, Weibull, and modified standard normal). The study was performed by applying 40 normality tests (including our proposed normality test) for the generated 1,000,000 standardized samples of size 32, 64, 128, 256, 512, and 1024.
The best set of parameters (a, b, c) was selected experimentally: the value of a was examined from 0.001 to 0.99 by step 0.01, the value of b was examined from 0.01 to 10 by step 0.01, and the value of c was examined from 0.5 to 50 by step 0.25. The N-metric test gave the most powerful results with the parameters: a = 0.95, b = 0.25, c = 1. In those cases, a test has several modifications, we present results only for the best variant. The Tables 1-3 present average power obtained for the symmetric, asymmetric, and modified normal distribution sets, for samples sizes of 32, 64, 128, 256, 512, and 1024. By comparing Tables 1-3, it can be seen that the most powerful test for small samples was Hosking1 (H1), the most powerful test for large sample sizes was our presented test (N-metric). According to Tables 1-3, it is observed that for large sample sizes, most tests' power is approaching 1 except for the D'Agostino (DA) test, the power of which is significantly lower.  An additional study was conducted to determine the exact minimal sample size at which the N-metric test (statistic (34) with kernel function (35)) is the most powerful for groups of symmetric, asymmetric, and modified normal distributions. Hosking1 and Nmetric tests were applied for data sets of sizes: 80, 90, 100, 105, 110, and 115. The obtained results showed that the N-metric test was the most powerful for sample size ≥ 112 for the symmetric distributions, for sample size ≥ 118 for the asymmetric distributions, and for sample size ≥ 88 for a group of modified normal distributions (see in Table 4). The N-metric test is the most powerful for the Gamma distribution for sample size ≥ 32. It has been observed that in the case of Cauchy and Lognormal distributions, the N-metric test is the most powerful when the sample size is ≥ 255, which can be influenced by the long tail of these distributions. Mathematics 2021, 9,

Conclusions and Future Work
In this study, a comprehensive comparison of the power of popular normality tests was performed. Given the importance of this topic and the extensive development of normality tests, the proposed new normality test, the detailed test descriptions provided, and the power comparisons are relevant. Only univariate data were examined in this study of the power of normality tests (a study with multivariate data is planned for the future).

Conclusions and Future Work
In this study, a comprehensive comparison of the power of popular normality tests was performed. Given the importance of this topic and the extensive development of normality tests, the proposed new normality test, the detailed test descriptions provided, and the power comparisons are relevant. Only univariate data were examined in this study of the power of normality tests (a study with multivariate data is planned for the future).
The study addresses the performance of 40 normality tests, for various sample sizes n for a number of symmetric, asymmetric, and modified normal distributions. A new goodness-of-fit test has been proposed. Its results are compared with other tests.
Based on the obtained modeling results, it was determined that the most powerful tests for the groups of symmetric, asymmetric, and modified normal distributions were Hosking1 (for smaller sample sizes) and our proposed N-metric (for larger sample sizes) test. The power of the Hosking1 test (for smaller sample sizes) is 1.5 to 7.99 percent higher than the second (by power) test for the groups of symmetric, asymmetric, and modified normal distributions. The power of the N-metric test (for larger sample sizes) is 6.2 to 16.26 percent higher than the second (by power) test for the groups of symmetric, asymmetric, and modified normal distributions.
The N-metric test is recommended to be used for symmetric data sets of size n ≥ 112, for asymmetric data sets of size n ≥ 118, and for bell-shaped distributed data sets of size n ≥ 88. Data Availability Statement: Generated data sets were used in the study (see in Section 4).

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Mathematics 2021, 9, x FOR PEER REVIEW 18 of 20 The study addresses the performance of 40 normality tests, for various sample sizes for a number of symmetric, asymmetric, and modified normal distributions. A new goodness-of-fit test has been proposed. Its results are compared with other tests.
Based on the obtained modeling results, it was determined that the most powerful tests for the groups of symmetric, asymmetric, and modified normal distributions were Hosking1 (for smaller sample sizes) and our proposed N-metric (for larger sample sizes) test. The power of the Hosking1 test (for smaller sample sizes) is 1.5 to 7.99 percent higher than the second (by power) test for the groups of symmetric, asymmetric, and modified normal distributions. The power of the N-metric test (for larger sample sizes) is 6.2 to 16.26 percent higher than the second (by power) test for the groups of symmetric, asymmetric, and modified normal distributions.
The N-metric test is recommended to be used for symmetric data sets of size ≥ 112, for asymmetric data sets of size ≥ 118, and for bell-shaped distributed data sets of size ≥ 88.