Computation of Probability Associated with Anderson–Darling Statistic

The correct application of a statistical test is directly connected with information related to the distribution of data. Anderson–Darling is one alternative used to test if the distribution of experimental data follows a theoretical distribution. The conclusion of the Anderson–Darling test is usually drawn by comparing the obtained statistic with the available critical value, which did not give any weight to the same size. This study aimed to provide a formula for calculation of p-value associated with the Anderson–Darling statistic considering the size of the sample. A Monte Carlo simulation study was conducted for sample sizes starting from 2 to 61, and based on the obtained results, a formula able to give reliable probabilities associated to the Anderson–Darling statistic is reported.


Introduction
Application of any statistical test is made under certain assumptions, and violation of these assumptions could lead to misleading interpretations and unreliable results [1,2]. One main assumption that several statistical tests have is related with the distribution of experimental or observed data (H0 (null hypothesis): The data follow the specified distribution vs. H1 (alternative hypothesis): The data do not follow the specified distribution). Different tests, generally called "goodness-of-fit", are used to assess whether a sample of observations can be considered as a sample from a given distribution. The most frequently used goodness-of-fit tests are Kolmogorov-Smirnov [3,4], Anderson-Darling [5,6], Pearson's chi-square [7], Cramér-von Mises [8,9], Shapiro-Wilk [10], Jarque-Bera [11][12][13], D'Agostino-Pearson [14], and Lilliefors [15,16]. The goodness-of-fit tests use different procedures (see Table 1). Alongside the well-known goodness-of-fit test, other methods based for example on entropy estimator [17][18][19], jackknife empirical likelihood [20], on the prediction of residuals [21], or for testing multilevel survival data [22] or multilevel models with binary outcomes [23] have been reported in the scientific literature.
Tests used to assess the distribution of a dataset received attention from many researchers (for testing normal or other distributions) [24][25][26][27]. The normal distribution is of higher importance, since the resulting information will lead the statistical analysis on the pathway of parametric or nonparametric tests [28][29][30][31][32][33]. Different normality tests are implemented on various statistical packages (e.g., Minitab-http://www.minitab.com/en-us/; EasyFit-http://www.mathwave.com/easyfitdistribution-fitting.html; Develve-http://develve.net/; r("nortest" nortest)-https://cran.rproject.org/web/packages/nortest/nortest.pdf; etc.). Several studies aimed to compare the performances of goodness-of-fit tests. In a Monte Carlo simulation study conducted on the normal distribution, Kolmogorov-Smirnov test has been identified as the least powerful test, while opposite Shapiro-Wilks test was identified as the most powerful test [34]. Furthermore, Anderson-Darling test was found to be the best option among five normality tests whenever t-statistics were used [35]. More weight to the tails are given by the Anderson-Darling test compared to Kolmogorov-Smirnov test [36]. The comparisons between different goodness-of-fit tests is frequently conducted by comparing their power [37,38], using or not confidence intervals [39], distribution of p-values [40], or ROC (receiver operating characteristic) analysis [32].
The interpretation of the Anderson-Darling test is frequently made by comparing the AD statistic with the critical value for a particular significance level (e.g., 20%, 10%, 5%, 2.5%, or 1%) even if it is known that the critical values depend on the sample size [41,42]. The main problem with this approach is that the critical values are available just for several distributions (e.g., normal and Weibull distribution in Table 2 [43], generalized extreme value and generalized logistic [44], etc.) but could be obtained in Monte Carlo simulations [45]. The primary advantage of the Anderson-Darling test is its applicability to test the departure of the experimental data from different theoretical distributions, which is the reason why we decided to identify the method able to calculate its associated p-value as a function also of the sample size.
D'Augostino and Stephens provided different formulas for calculation of p-values associated to the Anderson-Darling statistic (AD), along with a correction for small sample size (AD*) [37]. Their equations are independent of the tested theoretical distribution and highlight the importance of the sample size (Table 3).
Several Excel implementations of Anderson-Darling statistic are freely available to assist the researcher in testing if data follow, or do not follow, the normal distribution [46][47][48]. Since almost all distributions are dependent by at least two parameters, it is not expected that one goodness-of-fit test will provide sufficient information regarding the risk of error, because using only one method (one test) gives the expression of only one constraint between parameters. In this regard, the example provided in [49] is illustrative, and shows how the presence of a single outlier induces complete disarray between statistics, and even its removal does not bring the same risk of error as a result of applying different goodness-of-fit tests. Given this fact, calculation of the combined probability of independent (e.g., independent of the tested distribution) goodness-of-fit tests [50,51] is justified.
Good statistical practice guidelines request reporting the p-value associated with the statistics of a test. The sample size influences the p-value of statistics, so its reporting is mandatory to assure a proper interpretation of the statistical results. Our study aimed to identify, assess, and implement an explicit function of the p-value associated with the Anderson-Darling statistic able to take into consideration both the value of the statistic and the sample size.   ( )

Anderson-Darling Order Statistic
For a sample Y = (y1, y2, …, yn), the data are sorted in ascending order (let X = Sort(Y), and then X = (x1, x2, …, xn) with xi ≤ xi+1 for 0 < i < n, and xi = yσ(i), where σ is a permutation of {1, 2, …, n} which makes the X series sorted). Let the CDF be the associated cumulative distribution function and InvCDF the inverse of this function for any PDF (probability density function). The series P = (p1, p2, …, pn) defined by pi = InvCDF(xi) (or Q = (q1, q2, …, qn) defined by qi = InvCDF(yi), where the P is the unsorted array, and Q is the sorted array) are samples drawn from a uniform distribution only if Y (and X) are samples from the distribution with PDF.
At this point, the order statistics are used to test the uniformity of P (or for Q), and for this reason, the values of X are ordered (in Y). On the ordered probabilities (on P), several statistics can be computed, and Anderson-Darling (AD) is one of them: The associated AD statistic for a "perfect" uniform distribution can be computed after splitting the [0, 1] interval into n equidistant intervals (i/n, with 0 ≤ i ≤ n being their boundaries) and using the middles of those intervals ri = (2i − 1)/2n: where H1 is the Shannon entropy for R in nats (the units of information or entropy) (H1(R,n) = − Σri•ln(ri)).
Equation (2) gives the smallest possible value for AD. The value of the AD increases with the increase of the departure between the perfect uniform distribution and the observed distribution (P).

Monte Carlo Experiment for Anderson-Darling Statistic
The probability associated with a particular value of the AD statistic can be obtained using a Monte Carlo experiment. The AD statistics are calculated for a large enough number of samples (let be m the number of samples), the values are sorted, and then the relative position of the observed value of the AD in the series of Monte Carlo-calculated values gives the probability associated with the statistic of the AD test.
It should be noted that the equation linking the statistic and the probability also contains the size of the sample, and therefore, the probability associated with the AD value is dependent on n.
Taking into account all the knowledge gains until this point, it is relatively simple to do a Monte Carlo experiment for any order statistic. The only remaining problem is how to draw a sample from a uniform distribution in such way as to not affect the outcome. One alternative is to use a good random generator, such as Mersenne Twister [53], and this method was used to generate our samples as an alternative to the stratified random approach.

Class t1 t2 t3
Case It is not a good idea to use the design presented in Table 4 in its crude form, since it is transformed to a problem with an exponential (2 n ) complexity. The trick is to observe the pattern in Table 4. In fact, for (n + 1) cases, with different frequencies of occurrence following the model, the results are given in Table 5. The complexity of the problem of enumerating all the cases stays with the design presented in Table 5 at the same order of magnitude with n (we need to list only n + 1 cases instead of 2 n ).
The frequencies listed in Table 5 are combinations of n objects taken by two (intervals), so instead of enumerating all 2 n cases, it is enough to record only n + 1 cases weighted with their relative occurrence.
The effect of the pseudo-random generator is significantly decreased (the decrease is a precise order of magnitude of the binary representation, one unit in log2 transformation: 1 = log22, for the (0, 0.5) and (0.5, 1) split) by doing a stratified random sample.
The extractions of a number from (0, 0.5) and from (0.5, 1) were furthermore made in our experiment with Mersenne Twister random (if x = Random() with 0 ≤ x < 1 then 0 ≤ x/2 < 1 and 0.5 ≤ 0.5 + x/2 < 1). Table 5 provides all the information we need to do the design. For any n, for k from 0 to n, exactly k numbers are generated as Random()/2, and sorted. Furthermore, exactly n−k numbers are generated as 0.5 + Random()/2, and the frequency associated with this pattern is n!/(k!•(n−k)!).

Model for Anderson-Darling Statistic
Performing the Monte Carlo (MC) experiment (generates, analyzes, and provides the outcome) each time when a probability associated with the AD statistic is needed is resource-consuming and not effective. For example, if we generate for a certain sample size (n) a large number of samples m = 1.28 × 10 10 , then the needed storage space is 51.2 Gb for each n. Given 1 Tb of storage capacity, it can store only 20 iterations of n, as in the series of the AD(n). However, this is not needed, since it is possible to generate and store the results of the Monte Carlo analysis, but a proper model is required.
It is not necessary to have a model for any probability, since the standard thresholds for rejecting an agreement are commonly set to α = 0.2, 0.1, 0.05, 0.02, 0.01 (α = 1 − p). A reliable result could be considered the model for the AD when p ≥ 0.5. Therefore, the AD (as AD = AD(n,p)) for 501 value of the p from 0.500 to 0.001, and for n from 2 to 61 were extracted, tabulated, and used to develop the model.
A search for a dependency of AD = AD(p) (or p = p(AD)) for a particular n may not reveal any pattern. However, if the value of the statistic is exponentiated (see the ln function in the AD formula), values for the model start to appear (see Figure 1a) after a proper transformation of p. On the other hand, for a given n, an inconvenience for the AD(p) (or for its inverse, p = p(AD)) is to have on the plot, a non-uniform repartition of the points-for instance, precisely two points for 5 ≤ AD < 6 and 144 points for AD < 1. As a consequence, any method trying to find the best fit based on this raw data will fail because it will give too much weight on the lower part with a much higher concentration of the points. The problem is the same for exp(AD) replacing AD (Figure 1b) but is no more the case for 1/(1 − p) as a function of exp(AD) (Figure 1c), since the dependence begins to look like a linear one. Figure 1b suggests that a logarithm on both axes will reduce the difference in the concentration of points in the intervals (Figure 1d), but at this point, is not necessary to apply it, since the last spots in Figure 1c may act as "outliers" trailing the slope. A good fit in the rarefied region of high p (and low α) is desired. It is not so important if we will have a 1% error at p = 50%, but is essential not to have a 1% error at p = 99% (the error will be higher than the estimated probability, α = 1 − p. Therefore, in this case (Figure 1c), big numbers (e.g., ~200, 400) will have high values of residuals, and will trail the model to fit better in the rarefied region.
A simple linear regression y ~ ŷ = a•x + b for x ← e AD and y ← α − 1 = 1/(1 − p) will do most of the job for providing the values of α associated with the values of the AD. Since the dependence is almost linear, polynomial or rational functions will perform worse, as proven in the tests. A better alternative is to feed the model with fractional powers of x. By doing this, the bigger numbers will not be disfavored (square root of 100 is 10, which is ten times lower than 100, while square root of 1 is 1; thus, the weight of the linear component is less affected for bigger numbers). On the other hand, looking to the AD definition, the probability is raised at a variable power, and therefore, to turn back to it, in the conventional sense of operation, is to do root. Our proposed model is given in Equation The statistics associated with the proposed model for data presented in Figure 1 are given in Table 6.  The analysis of the results presented in Table 6 showed that all coefficients are statistically significant, and their significance increases from the coefficient of AD 1/4 to the coefficient of the AD. Furthermore, the residuals of the regression are with ten orders of magnitude less than the total residuals (F value = 3.4 × 10 10 ). The adjusted determination coefficient has eight consecutive nines.
The model is not finished yet, because we need a model that also embeds the sample size (n). Inverse powers of n are the best alternatives as already suggested in the literature [43]. Therefore, for each coefficient (from a0 to a4), a function penalizing the small samples was used similarly: With these replacements, the whole model providing the probability as a function of AD statistic and n is given by Equation (5) where ŷ = 1/ (1 − p), bi,j = coefficients, x = e AD , n = sample size.

Simulation Results
Twenty-five coefficients were calculated for Equation (5) from 60 values associated to sample sizes from 2 to 61, based on 500 values of p (0.500 ≤ p ≤ 0.999) and with a step of 0.001. The values of the obtained coefficients along with the related Student t-statistic are given in Table 7.

Stratified vs. Random
The same experiment was conducted with both simple and random stratified Mersenne Twister method [53] to assess the magnitude of the increases in the resolution of the AD statistic. The differences between the two scenarios were calculated and plotted in Figure 2.

Analysis of Residuals
The residuals, defined as the difference between the probability obtained by Monte Carlo simulation and the value estimated by the proposed model, without and with transformation (ln and respectively log), were analyzed. For each probability (p ranging from 0.500 to 0.999 with a step of 0.001; 500 values) associated with the statistic (AD) based on the MC simulation for n ranging from 2 to 61 (60 values), 30,000 distinct pairs (p, n, AD) were collected and investigated. The descriptive statistics of residuals are presented in Table 8.
The most frequent value of residuals (~99%) equals with 0.000007 when no transformed data are investigated (Figure 3, left-hand graph). The right-hand chart in Figure 3 depicted the distribution of the same data, but expressed in logarithmical scale, showing a better agreement with normal distribution for the transformed residuals.   , and for n in the same range (from 2 to 61; 60 values) was extracted from the whole pool of data, and a 3D mesh with 6000 grid points was constructed. Figure 4 represents the differences log10( p p− ) ( p is calculated with Equation (5)) and the values of the bi,j coefficients given in Table 4. For convenience, the equation for p and (α ≡ 1 × p) are  Figure 4 reveals that the calculated Equation (5) and the expected values (from MC simulation for AD = AD(p,n)) differ less than 1‰ (−3 on the top of the Z axis). Even more than that, with departure from n = 2, and from p = 0.500 to n = 61, or to p = 0.999, the difference dramatically decreases to 10 −6 (visible on the Z-axis as −6 moving from n = 2 to n = 61), to 10 −9 (visible on the plot visible on X-axis as −9 moving from p = 0.500 to p = 0.995), and even to 10 −15 (visible on the plot on Z-axis as −15 moving on both from p = 0.500 to p = 0.995 and from n = 2 to n = 61). This behavior shows that the model was designed in a way in which the estimation error ( p p− ) would be minimal for small α (α close to 0; p close to 1). A regular half-circle shape pattern, depicted in Figure 4, suggests that an even more precise method than the one archived by the proposed model must be done with periodic functions.    Median of residuals expressed in logarithmic scale indicate that half of the points have exactly seven digits (e.g., 0.98900000 vs. 0.98900004). The cumulative frequencies for the residuals represented in logarithmic scale also show that 75% have exactly six digits, while over 99% have exactly five digits. The agreement between the observed Monte Carlo and the regression model is excellent (r 2 (n = 30,000) = 0.99999) with a minimum value for the sum of squares of residuals (0.002485). These results sustain the validity of the proposed model.

Case Study
Twenty sets of experimental data (Table 9) were used to test the hypothesis of the normal distribution: H0: The distribution of experimental data is not significantly different from the theoretical normal distribution.
H1: The distribution of experimental data is not significantly different from the theoretical normal distribution.  The concentration of spermatozoids (millions/mL) in males with ankylosing spondylitis 60 [69] 19 The parameter of the Poisson distribution 31 [70] 20 Corolla diameter of Calendula officinalis L. for Bon-Bon Mix × Bon-Bon Orange 28 [71] Experimental data were analyzed with EasyFit Professional (v. 5.2) [72], and the retrieved AD statistic, along with the conclusion of the test (Reject H0?) at a significance level of 5% were recorded. The AD statistic and the sample size for each dataset were used to retrieve the p-value calculated with our method. As a control method, the formulas presented in Table 3 [43], implemented in an Excel file (SPC for Excel) [47], were used. The obtained results are presented in Table 10.
A perfect concordance was observed in regard to the statistical conclusion regarding the normal distribution, when our method was compared to the judgment retrieved by EasyFit. The concordance of the results between SPC and EasyFit, respectively, with the proposed method, was 60%, with discordant results for both small (e.g., n = 24, set 1) samples as well as high (e.g., n = 70, set 9) sample sizes. Normal probability plots (P-P) and the quantile-quantile plots (Q-Q) of these sets show slight, but not significant deviations from the expected normal distribution ( Figure 6).
Without any exceptions, the p-values calculated by our implemented method had higher values compared to the p-values achieved by SPC for Excel. The most substantial difference is observed for the largest dataset (set 8), while the smallest difference is noted for the set with 45 experimental data values (set 15). The lowest p-value was obtained by the implemented method for set 3 (see Table 10); the SPC for Excel retrieves, for this dataset, a value of 0.0000. The next smallest p-value was observed for set 8. For both these sets, an agreement related to the statistical decision was found (see Table 10).   Figure 6. Normal probability plots (P-P) and quantile-quantile plot (Q-Q) by example: graphs for set 9 (n = 70) in the first row, and for set 11 (n = 40) in the second row.
Our team has previously investigated the effect of sample size on the probability of Anderson-Darling test, and the results are published online at http://l.academicdirect.org/Statistics/tests/AD/. The method proposed in this manuscript, as compared to the previous one, assures a higher resolution expressed by the lower unexplained variance between the AD and the model using a formula with a smaller number of coefficients. Furthermore, the unexplained variance of the method present in this manuscript has much less weight for big "p-values", and much higher weight for small "p-values", which means that is more appropriate to be used for low (e.g., p ~10 −5 ) and very low (p ~10 −10 ) probabilities.
Further research could be done in both the extension of the proposed method and the evaluation of its performances. The performances of the reported method could be evaluated for the whole range of sample sizes if proper computational resources exist. Furthermore, the performance of the implementation could be assessed using game theory and game experiments [73,74] using or not using diagnostic metrics (such as validation, confusion matrices, ROC analysis, analysis of errors, etc.) [75,76].
The implemented method provides a solution to the calculation of the p-values associated with Anderson-Darling statistics, giving proper weight to the sample size of the investigated experimental data. The advantage of the proposed estimation method, Equation (5), is its very low residual (unexplained variance) and its very high estimation accuracy at convergence (with increasing of in and for p near 1). The main disadvantage is related to its out of range p-values for small AD values, but an extensive simulation study could solve this issue. The worst performances of the implemented methods are observed when simultaneously n is very low (2 or 3) and p is near 0.5 (50-50%).