Some New Tests of Conformity with Benford’s Law

: This paper presents new perspectives and methodological instruments for verifying the validity of Benford’s law for a large given dataset. To this aim, we ﬁrst propose new general tests for checking the statistical conformity of a given dataset with a generic target distribution; we also provide the explicit representation of the asymptotic distributions of the relevant test statistics. Then, we discuss the applicability of such novel devices to the case of Benford’s law. We implement extensive Monte Carlo simulations to investigate the size and the power of the introduced tests. Finally, we discuss the challenging theme of interpreting, in a statistically reliable way, the conformity between two distributions in the presence of a large number of observations.


Introduction
Data regularities are relevant properties of many datasets whose elements maintain their individuality while creating a unified framework. One of the most illustrative examples of such statistical features is that of Benford's law, introduced in [1] and successfully tested and described in [2]. Benford's law is a sort of magic rule, for which the first digit(s) of the elements of a given dataset follow a specific distribution-hereafter called Benford's distribution. For all the details on such a law, we refer the interested reader to [3][4][5][6].
A methodological aspect of Benford's law lies in how to test the compliance of the empirical distribution of a given sample with Benford's variable. The root of such an issue lies in the definition of a statistical distance between two random variables, the most popular being the chi-square and the mean absolute deviation (MAD).
This paper deals with this challenging research theme. Specifically, we advance herein some new tests for verifying the compliance of the empirical distribution obtained from a given population. In this respect, we mention the recent contribution [32], where the authors suggested a statistical test based on the mean. Following the quoted paper, we start by introducing a mean-based conformity test. Moreover, we also developed a variancebased and a joint mean and variance-based test for verifying the compliance of a given distribution with a target one. Furthermore, we also present a test based on Wald's statistic and a new version of a MAD-based test. We explored the asymptotic distributions of the

New Tests of Conformity with Benford's Law
In this Section, we report the analytical derivations of the new test statistics of conformity to Benford's law and their asymptotic distributions. Proposition 1. Consider a random sample x 1 , . . . , x n from a population with mean µ, variance σ 2 , and third and fourth central moments µ 3 and µ 4 . All moments up to the fourth are assumed to be finite. Letx n and s 2 n be the sample mean and the sample variance, respectively. Then: (2) n µσ :=x n +s 2 with z n denoting the transpose of z n .
Let us now define z n := (x n ,s 2 n ) . The Cramér-Wold device implies that z n is (asymptotically) multivariate normal if λ z n is (asymptotically) univariate normal for every λ ∈ R 2 . However, every λ ∈ R 2 defines a linear combination of two (asymptotically) normal variables and λ z is trivially (asymptotically) univariate normal. Therefore: and (4) follows.

Remark 1.
The results stated in Proposition 1 can be used to test conformity (goodness of fit) with any given distribution with finite moments up to the fourth. If µ, σ, µ 3 , and µ 4 are those of Benford's distribution, then Equation (1) can be used to build a conformity test based on the mean: such a test has indeed recently been suggested by Hassler and Hosseinkouchack [32]. Equation (2) is the basis for a normal conformity test based on the variance, whereas (3) can be used to build a normal conformity test jointly based on the mean and the variance. Finally, (4) is a chi-square conformity test which jointly considers the mean and the variance.

Remark 2.
When conformity is tested with reference to the normal distribution, (4) is simplified because µ 3 = 0: indeed, under normality, the sample mean and the sample variance are independent: the sample mean and the sample variance are not independent random variables for any other distribution, as can be seen in [36].
Using the fact that when (X, Y) have a bivariate normal distribution with means 0, variances ı and correlation θ, then [39]: and therefore: where ρ ij is the correlation between e † ni and e † nj : Then, note that: Therefore, the covariance matrix R is: with: Finally: Remark 3. The results stated in Proposition 2 can be used to test conformity (goodness of fit) with any given discrete distribution and specialise to the first digit or first two digits Benford's law when p i = log 10 (1 + 1/d), with either d = 1, . . . , 9 or d = 10, . . . , 99. Here, (9) is a Wald-like test, whereas (10) is a modification of the mean absolute deviation (MAD) statistic advocated in [3,40], where each absolute deviation is adjusted by the factor 1/ p j (1 − p j ) thereby emphasising deviations from smaller expected frequencies, as well as incorporating (the square root of) the sample size n as a factor in the measure of deviation.

Remark 4.
The Wald-like χ 2 statistic in (9) is equivalent to the usual χ 2 computed as n ∑ k j=1 e 2 nj /p j . A proof, which also proves that Σ * is nonsingular, is offered in Appendix A. (10) makes it clear that, contrarily to what is commonly asserted, as can be seen in, e.g., [3] (p. 158), the MAD statistic:

Remark 5. Equation
is not independent of n and is, in fact, O p n − 1 2 .

Monte Carlo Simulations
The size (the probability of falsely rejecting the null hypothesis) and power (the ability of the test to reject the null when it is false) of the proposed tests are investigated over 25,000 Monte Carlo replications for varying sample sizes n, under the null and under selected interesting alternatives (all computations and graphics were produced using R, version 4.0.5 [41] and ggplot2, version 3.3.3 [42]). Each alternative is expressed in terms of the mixture: where p B := (p B1 , . . . , p Bk ) is the vector of Benford's probabilities, p A := (p A1 , . . . , p Ak ) is the vector of probabilities of some "contaminating" distribution, and k is the number of digits. λ ∈ {0.75, 0.80, . . . , 0.95} is the mixing parameter. When dealing with data manipulation issues, 1 − λ can be interpreted as the fraction of manipulated data.
The following mixtures were used in the simulations: 1. Uniform mixture: p A describes the discrete uniform distribution with the same support as the considered Benford's distribution; 2. Normal mixture: p Ai are the probabilities of N(µ B , σ 2 ), with µ B the mean of Benford's distribution and σ = 4µ B ; 3. Randomly perturbed mixture: Benford's law is perturbed by a random quantity in correspondence to each digit. More precisely, p Ai = u i p B i with u i ∼ U(0, 2p B i ). Since this mixture contains elements of randomness, each Monte Carlo iteration uses a different mixture. However, the mixtures are the same across all tests; 4. Under-reporting mixture: under the alternative, Benford's distribution is modified by putting to zero the probability of "round" numbers and giving this probability to the preceding number: for example, p A20 = 0 and p A19 = p B19 + p B20 . This mixture is only considered with reference to the first two digits case.
The above mixtures are plotted in Figure 1 for the first two-digit case. The corresponding data for each mixture are generated from a multinomial distribution with vector probability p. In order to reduce Monte Carlo variability, all tests were applied to the same data, and larger samples include observations from the smaller ones. Rather than reporting long and difficult-to-compare tables of outcomes, we summarise the simulation results by relying on a graphical approach (as can be seen in, e.g., [43,44]). In order to summarise the size properties of the tests, we plot the size deviations (i.e., actual size − nominal size) against nominal size. When no size distortions are present, actual size = nominal size, and this graph coincides with a horizontal line with the ordinate equal to zero; however, this is a theoretical case only, since in practice, size deviations will tend to reflect experimental randomness. To report power results, we use size-power curves: these curves allow us to easily visualise the power of each test in correspondence of its actual (rather than nominal) size and to compare the power of different tests on perfectly fair grounds. The line power = actual size is also reported as a reference, representing the performance of a test of no practical use (the fraction of rejections under the null and under the alternative is the same); the more distant the size-power curve is from this line, the more powerful the test is.

First-Digit Law
The tests generally have very good size properties, irrespective of the sample size, with size deviations of approximately zero (see Figure 2). Only the modified MAD test tends to over-reject slightly under the null (with a +0.01 deviation with respect to nominal size) in correspondence of the 5% nominal size. In other words, the actual size of the modified MAD test in correspondence of the 5% nominal size is around 6%, and the discrepancy tends to reduce for larger nominal sizes.
As far as power is concerned, the performance of the different tests depends on the specific alternative hypothesis considered. The normal mean test (1) is the most powerful in the presence of a uniform mixing alternative (Figure 3), followed by the χ 2 (2) test on the mean and the variance (4) and the normal test on the mean and the variance (3).
In the presence of a normal mixing alternative (Figure 4), the χ 2 (2) (4) and the normal test on the mean (1) perform the best, followed by the adjusted MAD (10) and the Wald-like χ 2 (d − 1) test (9). Finally, in the presence of a perturbed Benford distribution (Figure 5), the highest power is reached by the χ 2 (d − 1) (9) and the adjusted MAD (10) tests, followed by the χ 2 (2) test (4).

First Two Digits Law
All the tests have approximately the correct size, even in the presence of fairly small samples (see Figure 6). All deviations with respect to the nominal size are within ±0.005, with the only exception of the ordinary chi-square test which shows a deviation around 0.010 in correspondence with values of the nominal size of common usage for n = 250. As anticipated, the power performance of the tests crucially depends on the alternative. The normal test based on the mean (1) is the most powerful test among those considered here, in the presence of a uniform mixing alternative (see Figure 7). The χ 2 (2) test on the mean and variance (4) and the normal test on the mean and variance (3) followed at short distance.
In the presence of a normal mixing alternative (see Figure 8), the χ 2 (2) test (4) is the most powerful one, followed by the normal variance test (2). It is interesting to note that in the first digit case, the normal variance test had no power; here, the normal mean test has no power. The other tests are generally more powerful in the first two digits than in the first digit case.
When the alternative can be described as a "perturbed Benford" distribution ( Figure 9) or in terms of a rounding behaviour (Figure 10), then the χ 2 (d − 1), either in the "classical" or in the equivalent Wald's formulation (9), and the modified MAD (10) perform very closely and are by far the most powerful tests. The ordering of the tests is the same as in the first digit case; however, the tests are generally more powerful in the first digit case.
These results suggest that in applications it is generally a good idea not to rely on a single test, but to use a battery of different tests designed to detect particular deviations from the null.

Statistical versus Practical Significance
In 1998, Granger [45] (p. 260) pointed out that in the presence of very large datasets: "Virtually all specific null hypotheses will be rejected using present standards. It will probably be necessary to replace the concept of statistical significance with some measure of economic significance." This is obviously related to the fact that the power of any consistent test increases with the sample size n, i.e., π → 1 as n → ∞ (with π denoting the power of the test). Of course, consistency is a desirable property of any statistical test. The symmetrical case, with small n, is somewhat less relevant in empirical applications of Benford's law where typical sample sizes are large. However, it has been observed that standard conformity tests may substantially lack power in the presence of small sample sizes (see, e.g., [12]). In our context, a large n is required to approximate the test asymptotic distributions).
In fact, the "large n problem" and some related apparently paradoxical implications were already highlighted in a paper by Lindley in 1957 [46]. The idea that a "large n problem" plagues empirical tests of conformity with Benford's distribution is widespread in the literature on Benford's law (as can be seen in, e.g., Nigrini's contributions [3,40] and Kossovsky's paper in this Special Issue [12]). In fact, Nigrini [3] (p. 158) claims that: "What is needed is a test that ignores the number of records. The mean absolute deviation (MAD) test is such a test, and the formula is shown in Equation 7.7.
[. . . ] There is no reference to the number of records, N, in Equation 7.7." However, Nigrini's statement that the MAD does not depend on the number of observations would only be valid if the relative frequencies of the digits for the data were given, not estimated. The fact that the relative frequencies must be estimated from the observed data makes the MAD dependent on the sample size, despite the sample size not explicitly appearing in the MAD formula. In fact, in proposition 2, we show that Nigrini's MAD is O p n − 1 2 under Benford's distribution (see Remark 5 above). Indeed, Figure 11 clearly shows that the behaviour of the estimated MAD is perfectly consistent with 1/ √ n under the null: therefore, taking a fixed "critical value" for the MAD irrespective of the sample size may lead to biased conclusions.
The risk of rejecting the (Benford's law) null hypothesis for tiny uninteresting deviations in the presence of large datasets can be dealt with in two different ways: (i) using significance levels α n decreasing with increasing n; and (ii) using a sort of "m out of n bootstrap" procedure [47] to assess significance. In what follows, we explain this second route with specific reference to the "first two digits" case.
If the available sample is very large (e.g., n > 3000), then the idea is to repeatedly test for conformity on a large number of smaller samples randomly resampled from the original data. If the observations are independent, identically distributed (IID), then the smaller samples will have the same distribution as the original data, making it possible to check conformity on the smaller datasets. In doing so, we are sacrificing some power in order to only detect "interesting" (or sizeable) departures from the null. The fact that the test statistics are computed over a large number of random sub-samples allows us to derive the distribution of the statistics and not to rely on a single outcome. The whole procedure is exemplified in Figure 12 in the case of data conforming with the "first two digits" Benford's law (first row in the Figure) as well as for a possibly uninteresting deviation from the null (second row) and a more substantial deviation from the null (third row). In this example, the random subsamples were made of 1750 observations, consistently with Figure 11, indicating that jointly using 0.0022 as the "critical value" for the MAD with n = 1750 ensures an approximate size of 5% to Nigrini's test. The tests considered are the MAD and those that, according to our simulations, are the most powerful in the presence of a perturbed Benford's alternative (see Figure 9). The third column (panels C, F, I) of Figure 12 reports the estimated densities of the conventional (or Wald) chi-square test statistic over 5000 random subsamples of length n = 1750 (blue curve) along with the χ 2 (89) null distribution (red). The probability of superiority (a measure of the effect size that corresponds to the probability that a randomly chosen point under the experimental curve is larger than a randomly chosen point under the null curve: see, e.g., [48] (Chapter 11)) is also reported to compare the two distributions. Panels A-C in Figure 12 show that the null of conformity is not rejected: this conclusion carries over using the full sample (panel A) as well as using a single subsample (panel B) or 5000 random subsamples (panel C). The null is rejected in the full sample under the "uninteresting" alternative using either the chi-square or the adjusted MAD test, but it is not rejected using the fixed "critical value" 0.0022 for the MAD (panel D). Using the subsamples, none of the criteria are able to decidedly reject the null, suggesting that the deviation of the data from the null is tiny. When the deviation is substantial (panels G-I), the MAD still cannot reject the null in the full sample (panel G) whereas the p value of the other two tests is virtually zero. In the single subsample, all three criteria correctly reject the null of conformity (panel H) and panel I shows that the "effect size" on the chi-square test is substantial, with the probability of superiority being approximately 0.9.  Figure 12. Behaviour of conformance tests across samples. In the first row (panels A-C), data conform to the "first two digits" Benford's law. In the second row (panels D-F), data follow a perturbed Benford's law with λ = 0.95. In the third row (panels G-I), data are consistent with a perturbed Benford's law with λ = 0.75. The first column (panels A,D,G) reports the results computed over the full sample, with n = 15,000. The second column (panels B,E,H) is relative to a single random subsample with n = 1750. The third column (panels C,F,I) reports the estimated densities (blue) of the conventional (or Wald) chi-square test statistic over 5000 random subsamples of length n = 1750 along with the χ 2 (89) distribution under the null distribution (red). P(χ 2 89 ) and P(Adj.MAD) denote p values of the conventional (or Wald) chi-square test and of the adjusted MAD test, respectively. Prob. o f sup. is an estimate of the probability of superiority.

Conclusions
This paper introduces new tests of conformance with a given distribution with first four finite moments. The tests are then specialised to the special case of the first digit and first two digits Benford's law. An extensive Monte Carlo analysis was carried out to study the size and power properties of the tests. The results show that it can be advisable to use different tests in real applications, given that the different tests perform differently, according to the nature of the alternative hypothesis. This paper also addresses the "excess of power" problem of the tests in the presence of very large samples: the proposed solution, based on resampling techniques, seems to be able to reconcile the evidence stemming from the MAD criterion (as can be seen in, e.g., [3]) with firmly statistically based tests.