#### 2.1. Anderson–Darling Order Statistic

For a sample Y = (y_{1}, y_{2}, …, y_{n}), the data are sorted in ascending order (let X = Sort(Y), and then X = (x_{1}, x_{2}, …, x_{n}) with x_{i} ≤ x_{i+}_{1} for 0 < i < n, and x_{i} = y_{σ(i)}, where σ is a permutation of {1, 2, …, n} which makes the X series sorted). Let the CDF be the associated cumulative distribution function and InvCDF the inverse of this function for any PDF (probability density function). The series P = (p_{1}, p_{2}, …, p_{n}) defined by p_{i} = InvCDF(x_{i}) (or Q = (q_{1}, q_{2}, …, q_{n}) defined by q_{i} = InvCDF(y_{i}), where the P is the unsorted array, and Q is the sorted array) are samples drawn from a uniform distribution only if Y (and X) are samples from the distribution with PDF.

At this point, the order statistics are used to test the uniformity of P (or for Q), and for this reason, the values of X are ordered (in Y). On the ordered probabilities (on P), several statistics can be computed, and Anderson–Darling (AD) is one of them:

The associated AD statistic for a “perfect” uniform distribution can be computed after splitting the [0, 1] interval into

n equidistant intervals (

i/n, with 0 ≤

i ≤

n being their boundaries) and using the middles of those intervals

r_{i} = (2

i − 1)/2

n:

where

H_{1} is the Shannon entropy for R in nats (the units of information or entropy) (

H_{1}(

R,

n) = − Σ

r_{i}∙ln(

r_{i})).

Equation (2) gives the smallest possible value for AD. The value of the AD increases with the increase of the departure between the perfect uniform distribution and the observed distribution (P).

#### 2.3. Stratified Random Strategy

Let us assume that three numbers (

t_{1},

t_{2},

t_{3}) are extracted from a [0, 1) interval using Mersenne Twister method. Each of those numbers can be <0.5 or ≥0.5, providing 2

^{3} possible cases (

Table 4).

It is not a good idea to use the design presented in

Table 4 in its crude form, since it is transformed to a problem with an exponential (2

^{n}) complexity. The trick is to observe the pattern in

Table 4. In fact, for (

n + 1) cases, with different frequencies of occurrence following the model, the results are given in

Table 5.

The complexity of the problem of enumerating all the cases stays with the design presented in

Table 5 at the same order of magnitude with

n (we need to list only

n + 1 cases instead of 2

^{n}).

The frequencies listed in

Table 5 are combinations of

n objects taken by two (intervals), so instead of enumerating all 2

^{n} cases, it is enough to record only

n + 1 cases weighted with their relative occurrence.

The effect of the pseudo-random generator is significantly decreased (the decrease is a precise order of magnitude of the binary representation, one unit in log_{2} transformation: 1 = log_{2}2, for the (0, 0.5) and (0.5, 1) split) by doing a stratified random sample.

The extractions of a number from (0, 0.5) and from (0.5, 1) were furthermore made in our experiment with Mersenne Twister random (if x = Random() with 0 ≤

x < 1 then 0 ≤

x/2 < 1 and 0.5 ≤ 0.5 +

x/2 < 1).

Table 5 provides all the information we need to do the design. For any

n, for

k from 0 to

n, exactly k numbers are generated as Random()/2, and sorted. Furthermore, exactly

n−

k numbers are generated as 0.5 + Random()/2, and the frequency associated with this pattern is

n!/(

k!∙(

n−

k)!).

The combinations can also be calculated iteratively: cnk(n,0) = 1, and cnk(n,k) = cnk(n,(k − 1))∙(n − k + 1)/k for successive 1 ≤ k ≤ n.

#### 2.4. Model for Anderson–Darling Statistic

Performing the Monte Carlo (MC) experiment (generates, analyzes, and provides the outcome) each time when a probability associated with the AD statistic is needed is resource-consuming and not effective. For example, if we generate for a certain sample size (n) a large number of samples m = 1.28 × 10^{10}, then the needed storage space is 51.2 Gb for each n. Given 1 Tb of storage capacity, it can store only 20 iterations of n, as in the series of the AD(n). However, this is not needed, since it is possible to generate and store the results of the Monte Carlo analysis, but a proper model is required.

It is not necessary to have a model for any probability, since the standard thresholds for rejecting an agreement are commonly set to α = 0.2, 0.1, 0.05, 0.02, 0.01 (α = 1 − p). A reliable result could be considered the model for the AD when p ≥ 0.5. Therefore, the AD (as AD = AD(n,p)) for 501 value of the p from 0.500 to 0.001, and for n from 2 to 61 were extracted, tabulated, and used to develop the model.

A search for a dependency of AD = AD(

p) (or

p =

p(AD)) for a particular

n may not reveal any pattern. However, if the value of the statistic is exponentiated (see the

ln function in the AD formula), values for the model start to appear (see

Figure 1a) after a proper transformation of

p. On the other hand, for a given

n, an inconvenience for the AD(

p) (or for its inverse,

p =

p(AD)) is to have on the plot, a non-uniform repartition of the points—for instance, precisely two points for 5 ≤ AD < 6 and 144 points for AD < 1. As a consequence, any method trying to find the best fit based on this raw data will fail because it will give too much weight on the lower part with a much higher concentration of the points. The problem is the same for exp(AD) replacing AD (

Figure 1b) but is no more the case for 1/(1 −

p) as a function of exp(AD) (

Figure 1c), since the dependence begins to look like a linear one.

Figure 1b suggests that a logarithm on both axes will reduce the difference in the concentration of points in the intervals (

Figure 1d), but at this point, is not necessary to apply it, since the last spots in

Figure 1c may act as “outliers” trailing the slope. A good fit in the rarefied region of high

p (and low α) is desired. It is not so important if we will have a 1% error at

p = 50%, but is essential not to have a 1% error at

p = 99% (the error will be higher than the estimated probability, α = 1 −

p. Therefore, in this case (

Figure 1c), big numbers (e.g., ~200, 400) will have high values of residuals, and will trail the model to fit better in the rarefied region.

A simple linear regression

y ~

ŷ = a∙

x + b for

x ← e

^{AD} and

y ← α − 1 = 1/(1 −

p) will do most of the job for providing the values of α associated with the values of the AD. Since the dependence is almost linear, polynomial or rational functions will perform worse, as proven in the tests. A better alternative is to feed the model with fractional powers of

x. By doing this, the bigger numbers will not be disfavored (square root of 100 is 10, which is ten times lower than 100, while square root of 1 is 1; thus, the weight of the linear component is less affected for bigger numbers). On the other hand, looking to the AD definition, the probability is raised at a variable power, and therefore, to turn back to it, in the conventional sense of operation, is to do root. Our proposed model is given in Equation (3):

The statistics associated with the proposed model for data presented in

Figure 1 are given in

Table 6.

The analysis of the results presented in

Table 6 showed that all coefficients are statistically significant, and their significance increases from the coefficient of AD

^{1/4} to the coefficient of the AD. Furthermore, the residuals of the regression are with ten orders of magnitude less than the total residuals (

F value = 3.4 × 10

^{10}). The adjusted determination coefficient has eight consecutive nines.

The model is not finished yet, because we need a model that also embeds the sample size (

n). Inverse powers of

n are the best alternatives as already suggested in the literature [

43]. Therefore, for each coefficient (from a

_{0} to a

_{4}), a function penalizing the small samples was used similarly:

With these replacements, the whole model providing the probability as a function of AD statistic and

n is given by Equation (5):

where ŷ = 1/(1 − p), b_{i,j} = coefficients, x = e^{AD}, n = sample size.