1. Introduction
A priori information about data distribution is not always known. In those cases, hypothesis testing can help to find a reasonable assumption about the distribution of data. Based on assumed data distribution, one can choose appropriate methods for further research. The information about data distribution can be useful in a number of ways, for example:
it can provide insights about the observed process;
parameters of model can be inferred from the characteristics of data distributions; and
it can help in choosing more specific and computationally efficient methods.
Statistical methods often require data to be normally distributed. If the assumption of normality is not satisfied, the results of these methods will be inappropriate. Therefore, the presumption of normality is strictly required before starting the statistical analysis. Many tests have been developed to check this assumption. However, tests are defined in various ways and thus react to abnormalities, present in a data set, differently. Therefore, the choice of goodness-of-fit test remains an important problem.
For these reasons, this study examines the issue of testing the goodness-of-fit hypotheses. The goodness-of-fit null and alternative hypotheses are defined as:
A total of 40 tests were applied to analyze the problem of testing the goodness-of-fit hypothesis. The tests used in this study were developed between 1900 and 2016. In the early 19th century, Karl Pearson published an article defining the chi-square test [
1]. This test is considered as the basis of modern statistics. Pearson was the first to examine the goodness-of-fit assumption that the observations
can be distributed according to the normal distribution, and concluded that, in the limit as
becomes large,
follows the chi-square distribution with
degrees of freedom. The statistics for this test are defined in
Section 2.1. Another popular test for testing the goodness-of-fit hypothesis is the Kolmogorov and Smirnov test [
2]. This test statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution [
3]. The Anderson and Darling test is also often used in practice [
4]. This test assesses whether a sample comes from a specified distribution [
3]. The end of 19th century and the beginning of 20th century was a successful period for the development of goodness-of-fit hypothesis test criteria and their comparison studies [
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19].
In 2010, Xavier Romão et al. conducted a comprehensive study comparing the power of the goodness-of-fit hypothesis tests [
20]. In the study, 33 normality tests were applied to samples of different sizes, taking into account the significance level
and many symmetric, asymmetric, and modified normal distributions. The researchers found that the most powerful of the selected normality tests for the symmetric group of distributions were Coin
, Chen–Shapiro, Bonett–Seier, and Gel–Miao–Gastwirth tests; for the asymmetric group of distributions, Zhang–Wu
and
, and Chen–Shapiro; while the Chen–Shapiro, Barrio–Cuesta-Albertos–Matrán–Rodríguez-Rodríguez, and Shapiro–Wilk tests were the most powerful for the group of modified normal distributions.
In 2015, Adefisoye et al. compared 18 normality tests for different sample sizes for symmetric and asymmetric distribution groups [
3]. The results of the study showed that the Kurtosis test was the most powerful for a group of symmetric data distributions and the Shapiro–Wilk test was the most powerful for a group of asymmetric data distributions.
The main objective of this study is to perform a comparative analysis of the power of the most commonly used tests for testing the goodness-of-fit hypothesis. The procedure described in
Section 3 was used to calculate the power of the tests.
Scientific novelty—the comparative analysis of test power was carried out using different methods for goodness-of-fit in the case of many different types of challenges to curve tests. The goodness-of-fit tests have been selected as representatives of popular techniques, which have been analyzed by other researchers experimentally. We have proposed a new kernel function and its usage in an N-metric-based test. The uniqueness of the kernel function is that its shape is chosen in such a way that the shift arising in the formation of the test is eliminated by using sample values.
The rest of the paper is organized as follows.
Section 2 provides descriptions of the 40 goodness-of-fit hypothesis tests and the procedure for calculating the power of the tests. The samples generated from 15 distributions are given in
Section 4.
Section 5 presents and discusses the results of a simulation modeling study. Finally,
Section 6 concludes the results.
3. The Power of Test
The power of the test is defined as the probability of rejecting a false
hypothesis. Power is the opposite of type II error. Decreasing the probability of type I error
increases the probability of type II error and decreases the power of the test. The smaller the error is, the more powerful test is. In practice, the tests are designed to minimize the type II error for a fixed type I error. The most commonly chosen value for
is
. The probability of the opposite event is calculated as
, i.e., the power of the test (see in
Figure 2)
is the probability of rejecting hypothesis
when it is false. The power of the test makes it possible to compare two tests significance level and sample sizes. A more powerful test has a higher value of
. Increasing the sample size usually increases the power of the test [
31,
32].
When exact null distribution of a goodness-of-fit test statistic is a step function created by the summation of the exact probabilities for each possible value of the test statistic, it is possible to obtain the same critical value for a number of different adjacent significance levels
. Linear interpolation of the power of the test statistic using the power for a significance levels (see in
Figure 3) less than (denoted
) and greater than (denoted
) the desired significance level (denoted as
) is preferred by many authors to overcome this problem (see, for example, [
33]). Linear interpolation gives a weighting to the power based on how close
and
are to
. In this case, the power of the test is calculated according to the formula [
19]:
where
and
are the critical values immediately below and above the significance level
=
and
=
are the significance levels for
and
, respectively.
The power of test statistics is determinate by the following steps [
19]:
The distribution of the analyzed data is formed.
Statistics of the compatibility hypothesis test criteria are calculated. If the obtain value of statistic is greater than the corresponding critical value ( is used), then hypothesis is rejected.
Steps 1 and 2 are repeated for (in our experiments, ) times.
The power of a test is calculated as , where is the number of false hypotheses rejections.
5. Simulation Study and Discussion
This section provides a comprehensive modeling study that is designed to evaluate the power of selected normality tests. This modeling study takes into account the effects of sample size, the level of significance () chosen, and the alternative type of distribution (Beta, Cauchy, Laplace, Logistic, Student, Chi-Square, Gamma, Gumbel, Lognormal, Weibull, and modified standard normal). The study was performed by applying 40 normality tests (including our proposed normality test) for the generated 1,000,000 standardized samples of size 32, 64, 128, 256, 512, and 1024.
The best set of parameters
was selected experimentally: the value of
was examined from 0.001 to 0.99 by step 0.01, the value of
was examined from 0.01 to 10 by step 0.01, and the value of
was examined from 0.5 to 50 by step 0.25. The
N-metric test gave the most powerful results with the parameters:
. In those cases, a test has several modifications, we present results only for the best variant. The
Table 1,
Table 2 and
Table 3 present average power obtained for the symmetric, asymmetric, and modified normal distribution sets, for samples sizes of 32, 64, 128, 256, 512, and 1024. By comparing
Table 1,
Table 2 and
Table 3, it can be seen that the most powerful test for small samples was
Hosking1 (
H1), the most powerful test for large sample sizes was our presented test (
N-metric). According to
Table 1,
Table 2 and
Table 3, it is observed that for large sample sizes, most tests’ power is approaching 1 except for the D’Agostino (DA) test, the power of which is significantly lower.
An additional study was conducted to determine the exact minimal sample size at which the
N-metric test (statistic (34) with kernel function (35)) is the most powerful for groups of symmetric, asymmetric, and modified normal distributions.
Hosking1 and
N-metric tests were applied for data sets of sizes: 80, 90, 100, 105, 110, and 115. The obtained results showed that the
N-metric test was the most powerful for sample size
for the symmetric distributions, for sample size
for the asymmetric distributions, and for sample size
for a group of modified normal distributions (see in
Table 4). The
N-metric test is the most powerful for the Gamma distribution for sample size
. It has been observed that in the case of Cauchy and Lognormal distributions, the
N-metric test is the most powerful when the sample size is
, which can be influenced by the long tail of these distributions.
To complement the results given in
Table 1,
Table 2 and
Table 3,
Figure 4 (and
Figure A1,
Figure A2 and
Figure A3 in
Appendix A) presents the average power results of the most powerful goodness-of-fit tests.
Figure 4 presents two distributions from each group of symmetric (Standard normal and Student), asymmetric (Gamma and Gumbel), and modified normal (standard normal distribution truncated at
and
and location-contaminated standard normal distribution) distributions. Figures of all other distributions are given in
Appendix A. In
Figure 4, it can be seen that for the standard normal distribution, our proposed test (
N-metric) is the most powerful when the sample size is 64 or larger.
Figure 4 shows that our proposed test (
N-metric) is the most powerful in the case of Gamma data distribution for all sample sizes examined. In general, it can be summarized that the power of the Chen–Shapiro (
ChenS), Gel–Miao–Gastwirth (
GMG),
Hosking1 (
H1), and Modified Shapiro–Wilk (
SWRG) tests increases gradually with increasing sample size. The power of our proposed test (
N-metric) increases abruptly when the sample size is 128 and its power value remains close to 1 for larger sample sizes.
6. Conclusions and Future Work
In this study, a comprehensive comparison of the power of popular normality tests was performed. Given the importance of this topic and the extensive development of normality tests, the proposed new normality test, the detailed test descriptions provided, and the power comparisons are relevant. Only univariate data were examined in this study of the power of normality tests (a study with multivariate data is planned for the future).
The study addresses the performance of 40 normality tests, for various sample sizes for a number of symmetric, asymmetric, and modified normal distributions. A new goodness-of-fit test has been proposed. Its results are compared with other tests.
Based on the obtained modeling results, it was determined that the most powerful tests for the groups of symmetric, asymmetric, and modified normal distributions were Hosking1 (for smaller sample sizes) and our proposed N-metric (for larger sample sizes) test. The power of the Hosking1 test (for smaller sample sizes) is 1.5 to 7.99 percent higher than the second (by power) test for the groups of symmetric, asymmetric, and modified normal distributions. The power of the N-metric test (for larger sample sizes) is 6.2 to 16.26 percent higher than the second (by power) test for the groups of symmetric, asymmetric, and modified normal distributions.
The N-metric test is recommended to be used for symmetric data sets of size , for asymmetric data sets of size , and for bell-shaped distributed data sets of size .