Ranking of Normality Tests: An Appraisal through Skewed Alternative Space

: In social and health sciences, many statistical procedures and estimation techniques rely on the underlying distributional assumption of normality of the data. Non-normality may lead to incorrect statistical inferences. This study evaluates the performance of selected normality tests within the stringency framework for skewed alternative space. The stringency concept allows us to rank the tests uniquely. The Bonett and Seier test ( T w ) turns out to represent the best statistics for slightly skewed alternatives and the Anderson–Darling ( AD ); Chen–Shapiro ( CS ); Shapiro–Wilk ( W ); and Bispo, Marques, and Pestana ( BCMR ) statistics are the best choices for moderately skewed alternative distributions. The maximum loss of Jarque–Bera ( JB ) and its robust form ( RJB ), in terms of deviations from the power envelope, is greater than 50%, even for large sample sizes, which makes them less attractive in testing the hypothesis of normality against the moderately skewed alternatives. On balance, all selected normality tests except T w and Daniele Coin’s COIN-test performed exceptionally well against the highly skewed alternative space. the sample size increases, it becomes harder to di ﬀ erentiate among the selected tests of normality, excluding Tw and COIN. The results clearly show that the power loss of these statistics decreases with the increase in (i) sample size and (ii) skewness and kurtosis. For all sample sizes, JB and RJB yield good powers for the highly skewed alternatives.


Introduction
Departures from normality can be measured in a variety of ways; however, the most common measures are skewness and kurtosis in this regard. Skewness refers to the symmetry of a distribution and kurtosis refers to the flatness or 'peakedness' of a distribution. These two statistics have been widely used to differentiate between distributions. A normal distribution has skewness and kurtosis values of 0 and 3, respectively. If the values of skewness and kurtosis significantly deviate from 0 and 3, it is assumed that the data in hand is not normally distributed. Macroeconomists are always concerned with whether the economic variables exhibit similar behavior during recessions and booms. Delong and Summers [1] applied the skewness measure to GDP, the unemployment rate, and industrial production to study whether the business cycles were symmetric. The experimental data sets generated in clinical chemistry require the use of skewness and kurtosis statistics to determine their shape and normality [2]. Blanca, Arnau, López-Montiel, Bono, and Bendayan [3] analyzed the shape of 693 real data distributions by including the measures of cognitive ability and other psychological variables in terms of skewness and kurtosis. Only 5.5% of the distributions were close to the normality assumption.
This study is devoted to analyzing the respective impact of change in skewness and kurtosis on the power of normality tests. Normality tests are developed based on the different characteristics of a normal distribution and the power of normality statistics varies, depending on the nature of non-normality [4]. Therefore, comparisons of normality tests yield ambiguous results since all normality statistics critically depend on alternative distributions which cannot be specified [16]. Fifteen normality tests are selected for a comparison of power based on the stringency concept proposed by Islam [16]. The stringency concept allows you to rank the normality tests in a unique fashion. Neyman-Pearson (NP) tests are computed against each alternative distribution to construct the power curve. Relative efficiencies of all the tests in question are computed as the deviations of each test from the power curve. The best test is defined as the test displaying the minimum deviation from the power curve among the maximum deviations of all the tests.

Stringency Framework
Islam [16] proposed a new framework to evaluate the performance of normality tests based on the stringency concept introduced by Lehmann and Stein [17].
Let y = (y 1 , y 2 , y 3 , . . . , y n ) be the observations with the density function f (y, ϕ), where ϕ belongs to the parameter space ∅. A function h(y) which takes the values {0, 1} is called a hypothesis test and belongs to H α , so all such functions are set with α-level of significance.
For any test of size α, maximum achievable power is defined as where β(h, ϕ) is the power of h(y) and ∅ a represents the alternative parameter space. Different values of ϕ yield different optimal test statistics, which provide the power envelope. The relative power performance of a test, h ∈ H α , is measured by its deviation from the power envelope as A test is said to be most stringent if it minimizes the maximum deviation from the power envelope. The stringency of a test is defined as the maximum deviation from the power envelope when evaluated over the entire alternative space.
Only the uniformly most powerful test can have zero stringency, which is rarely found; however, slightly compromising on it can give us a test which is as good as the uniformly most powerful test [16]. Evaluating the normality tests based on their stringencies allows us to rank them in a unique manner and helps researchers to find the best test.

Tests and Alternative Distributions
Normality tests are based on different characteristics like the empirical distribution, moments, correlation, and regression, and based on special characteristics of the data distribution. Fifteen normality tests were selected (Table 1) statistics. The Gel and Gastwirth (Rsj) statistic is a special test which focuses on detecting heavy tails and outliers of distributions. Some of our selected normality tests (e.g., Jarque-Bera, Kolmogorov-Smirnov, Anderson-Darling, Shapiro-Wilk, Shapiro-Francia etc.) are also available in popular software like MATLAB, STATA, SPSS, and EViews. Departures from normality (first and second order) depend on the skewness and kurtosis parameters. A mixture of t-distributions allows you to vary these two statistics in a wide range. It also covers the distributions used in the literature in terms of skewness and kurtosis (for details, see [16]). This study uses a mixture of t-distributions as alternative distributional space (Appendix: Table A1). The alternative distributional space was generated by the following rule: where v 1 , v 2 , µ 1 , and µ 2 are the degrees of freedom and the means of the respective t-distributions. We have divided our alternative space of distributions into the following three groups on the basis of skewness (β 1 ): (i) slightly skewed, (ii) moderately skewed, and (iii) highly skewed. In each group, skewness remained within the bounds and we allowed kurtosis to vary. The benchmark value for Group-I (symmetric distributions) was defined by the following [18], and other classes were defined relatively Group I : β 1 ≤ 0.3 Group II : 0.3 < β 1 ≤ 1.5 Group III : β 1 > 1.5 Neyman-Pearson (NP) tests were computed against each alternative distribution in each group to construct the power curve. Relative efficiencies of all the tests in question were computed as the deviations of each test from the power curve. The best test was defined as the test displaying the minimum deviation from the power curve among the maximum deviations of all the tests.
Following Islam [16], I group alternative space into three categories based on the power of the NP test: FAR, INTERMEDIATE, and NEAR. The alternative distributions where the power of the NP test is between 90-100%, 40-90%, and 5-40% are categorized as the FAR, INTERMEDIATE, and NEAR group of alternatives, respectively.

Discussion of Results
Monte Carlo procedures were employed to investigate the powers of fifteen selected normality tests for sample sizes of 25, 50, and 75, at the 5% level of significance with 100,000 replications.

Slightly Skewed Alternatives
When considering all the selected normality tests, Tw is the best test against the slightly skewed alternatives (Figures A1-A10 and Table 2) for all sample sizes (n = 25, 50, 75), whereas the performance of JB and RJB tests is very poor, with an 80.5%-99.5% maximum loss of power.

Performance of the Moments-Based Tests
Among the moments-based class of normality tests, Tw is the best test for all sample sizes for slightly skewed alternatives (Table 2 and Figure A1). The K2 test occupies the fourth (for n = 25, 50) and third (for n = 75) rank, with maximum power losses of 42.6%, 44.8%, and 44.7%, respectively ( Figure A3).
For all sample sizes, the JB and RJB tests are the least favorable options in terms of their maximum deviations (gaps) from the power curve ( Figure A2). The worst distributions for JB and RJB statistics belong to the symmetric and short-tailed class of alternatives ( Figure A2 and Appendix Table A2). These results corroborate with the findings in [12,19,20]. To decide about the worst or best performance of a test, we need an invariant benchmark: a power envelope. The worst performances of JB, in the aforementioned studies, have been evaluated by using an arbitrary reference (e.g., W and AD); however, we computed the power curve by using the most powerful NP test, which yielded the exact deviations of the JB test from the power curve.

Performance of the Regression and Correlation Tests
When considering the regression and correlation-based group of normality tests, for small and large sample sizes (n = 25, 75), COIN, W, and BCMR are better choices for the slightly skewed alternatives. Overall, for slightly skewed distributions, COIN and W tests exhibit the same power properties ( Figures A5 and A6), whereas Wsf and D statistics do not match the standards set by other members of the group ( Figures A7 and A8), with maximum power losses of over 50% (Table 2). Overall, the CS outperforms its competitors in the said group, with maximum power loss ranges within 34.8%-39.8% for slightly skewed alternatives. This result strengthens the findings in [21].

Performance of the ECDF Tests
Among the ECDF class of normality tests, for slightly skewed alternatives, the AD statistic shares the second rank with COIN and CS, third rank with CS, and first rank with Tw and Rsj tests of normality for sample sizes of 25, 50, and 75, respectively (Table 2).
When considering all the selected normality tests for the slightly skewed alternative distributions, KS shares the third rank (maximum loss of power is 38.1%) with W and Zc and sixth rank (maximum loss of power is 49.9%) with Zc and BCMR for sample sizes of 25 and 50, respectively. For a sample size of 75, the KS test again holds the third rank with a 45.1% maximum loss of power, while Za and Zc tests hold the fourth rank with a maximum loss of powers slightly above 50% (Table 2). On balance, when considering the maximum deviations from the power envelope, KS has a slight edge over Za and Zc statistics. In terms of maximum deviations from the power envelope, Zc has a slight edge over Za, but it does not corroborate with the findings in [13] due to the absence of an invariant benchmark, the power envelope, in their comparison.

Performance of the Special Test
This category only includes the Rsj test of normality. The performance of the Rsj test increases with the increase in sample size for the slightly skewed alternatives. It holds the third, second, and first rank for sample sizes of 25, 50, and 75, respectively (Table 2). On balance, Rsj performed well ( Figure A10), especially for medium (n = 50) to large (n = 75) sample sizes, against slightly skewed distributions.
Finally, when considering all normality tests for slightly skewed alternatives, Tw is the most stringent test, with Rsj, AD, and CS following closely behind, whereas RJB, JB, and D are the least favorable options.

Moderately Skewed Alternatives
For moderately skewed alternatives, for a smaller sample size, CS, W, AD, and BCMR are the best choices and the COIN test is the least favorable option (Table 3). For a medium sample size, AD is ranked first and the COIN and Tw tests are at the bottom of the ranking table. For a larger sample size, AD, CS, W, and BCMR appear to be the best options, whereas the COIN and Tw tests are the worst options.

Performance of the Moments-Based Tests
In general, for moderately skewed alternatives, moments-based normality tests perform poorly for all sample sizes. For a smaller sample size, Bowman & Shenton [22] K2-test occupies the fourth rank (with 46.7% maximum power loss) by outperforming the other group members. For a medium sample size, JB and RJB (with power losses above 50.0%) move to fourth place by pushing K2 down to fifth place, whereas Tw shares the seventh rank (maximum power loss is 78.4%) with the COIN test.
With the increase in sample size, both JB and RJB show an improvement in power and ranking, but their maximum power losses are still above 50% (Table 3). Both JB and RJB are good at discriminating the FAR group of distributions (where the power of the NP test is between 90-100%), with JB having a slight edge over RJB, but both suffer when the distributions are from the INTERMEDIATE group of alternatives ( Figure A11).

Performance of the Regression and Correlation Tests
Among the regression and correlation-based normality tests, for a smaller sample size, CS, W, and BCMR are the best tests for moderately skewed alternatives, with a loss range of 28.5%-29.8% (Table 3), whereas the COIN test is at the bottom, with a loss range of 68.8-88.7%.
For a medium to large sample size (n = 50, 75), W, BCMR, and CS are the better options, with Wsf following closely behind. The D and COIN tests are the least favorable regression and correlation-based normality statistics for moderately skewed alternatives, which is in line with the findings in [4,22]. It is evident from Figure A12 that Tw and COIN both suffer against the INTERMEDIATE and FAR group of alternative distributions.

Performance of the ECDF Tests
For moderately skewed alternatives, among the ECDF class of normality tests, AD exhibits superior power properties for all sample sizes. When considering all the selected normality tests for moderately skewed alternatives, AD holds the first rank for all sample sizes.
For a smaller and larger sample size, the Za and Zc statistics share the second rank. For a medium sample size, these tests occupy the third rank. For a smaller and medium sample size, the KS test holds the third rank, whereas its position improves to second rank for a larger sample size. The W test turns out to be a better test than KS ( Figure A13), which corroborates the findings in Shapiro, Wilk, and Chen [18]. While evaluating the stringencies of the normality statistics for moderately skewed alternatives, we produce the same conclusion, but through a superior and reliable procedure.

Performance of the Other Tests
In general, for moderately skewed alternative distributions, the Rsj test performs poorly, exhibiting more than 50.0% maximum deviation from the power curve for all sample sizes. On balance, the worst performance of the Rsj test is against the INTERMEDIATE and FAR group of alternatives, but it performed well against the NEAR group of alternatives.
Overall, AD, CS, W, and BCMR happen to be the best and JB, RJB, Tw, Rsj, and COIN are the least favorable options for moderately skewed alternatives when considering all the selected normality tests.

Highly Skewed Alternatives
This group comprises the alternatives from the FAR group only where the most powerful NP test has 100% power. As both skewness and kurtosis are high for this group of alternatives, they are palpable. All normality tests other than the COIN and Tw statistics performed well against highly skewed alternatives (Table 4). For a smaller sample size, the Wsf, BCMR, W, CS, Za, Zc, AD, RJB, and JB tests performed well, with the maximum power loss ranging between 8.8%-13.9%, followed by the D statistic with maximum power loss of 16.1% (Table 4), while the performance of the COIN and Tw tests was below the mark.
As the sample size increases, it becomes harder to differentiate among the selected tests of normality, excluding Tw and COIN. The results clearly show that the power loss of these statistics decreases with the increase in (i) sample size and (ii) skewness and kurtosis. For all sample sizes, JB and RJB yield good powers for the highly skewed alternatives.
Overall, the performance of the normality tests against the highly skewed and heavy-tailed alternatives is very good. However, the COIN and Tw tests performed poorly compared to other normality statistics. The poor performance of the COIN test is understandable as it is only meant for perfect symmetric cases [21,22]. Bonett and Seier [4] also recommend a standard skewness test along with the Tw statistic when the alternative distribution is skewed. Therefore, the COIN and Tw tests are not recommended for highly skewed alternative distributions.

Conclusions
This study shed light on the performance of fifteen normality tests against the three different groups of alternatives. For slightly skewed alternative distributions, Tw is the best test, with COIN, AD, CS, and Rsj following closely behind. On balance, D, JB, RJB, K2, Wsf, and Za did not perform well for the slightly skewed alternatives, especially from medium (n = 50) to large (n = 75) sample sizes, with more than 50% maximum power losses.
When considering all the selected normality tests for the moderately skewed alternatives, AD, CS, W, and BCMR turn out to be the best options for testing the hypothesis of normality of data distribution. In general, JB, RJB, Tw, COIN, Rsj, D, and K2 tests perform poorly against moderately skewed distributions. The performance of JB and RJB increases with the increase in sample size, but their maximum loss, in terms of their deviations from the power envelope, is greater than 50%, even for large sample sizes (n = 75).
On balance, all normality tests except Tw and COIN performed exceptionally well against the highly skewed alternatives, especially from medium to large sample sizes.
The above findings confirm our argument that a comparison of tests against different alternatives yields different statistics as the best tests. The COIN [23] and Tw tests are the best options for slightly skewed alternatives, but these statistics perform poorly for moderately and highly skewed alternative distributions. Therefore, the comparison and ranking of normality tests do not make sense in the absence of an invariant benchmark: the power envelope. Acknowledgments: I would like to thank Asad Zaman for his valuable comments and guidance.

Conflicts of Interest:
The authors declare no conflict of interest.