Next Article in Journal
Jump-Robust Realized-GARCH-MIDAS-X Estimators for Bitcoin and Ethereum Volatility Indices
Previous Article in Journal
Process Monitoring Using Truncated Gamma Distribution
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Revisiting the Large n (Sample Size) Problem: How to Avert Spurious Significance Results

Department of Economics, Virginia Tech, Blacksburg, VA 24061, USA
Stats 2023, 6(4), 1323-1338; https://doi.org/10.3390/stats6040081
Submission received: 9 November 2023 / Revised: 26 November 2023 / Accepted: 30 November 2023 / Published: 5 December 2023
(This article belongs to the Section Statistical Methods)

Abstract

:
Although large data sets are generally viewed as advantageous for their ability to provide more precise and reliable evidence, it is often overlooked that these benefits are contingent upon certain conditions being met. The primary condition is the approximate validity (statistical adequacy) of the probabilistic assumptions comprising the statistical model M θ ( x ) applied to the data. In the case of a statistically adequate M θ ( x ) and a given significance level α , as n increases, the power of a test increases, and the p-value decreases due to the inherent trade-off between type I and type II error probabilities in frequentist testing. This trade-off raises concerns about the reliability of declaring ‘statistical significance’ based on conventional significance levels when n is exceptionally large. To address this issue, the author proposes that a principled approach, in the form of post-data severity (SEV) evaluation, be employed. The SEV evaluation represents a post-data error probability that converts unduly data-specific ‘accept/reject H 0 results’ into evidence either supporting or contradicting inferential claims regarding the parameters of interest. This approach offers a more nuanced and robust perspective in navigating the challenges posed by the large n problem.

1. Introduction

The recent availability of big data with large sample sizes (n) in many scientific disciplines brought to the surface an old and largely forgotten foundational issue in frequentist testing known as the ‘large n problem’. It relates to inferring that an estimated parameter is statistically significant based on conventional thresholds, α = 0.1 , 0.05 , 0.025 , 0.01 , when n is very large, say n > 100 , 000 .

1.1. A Brief History of the Large n Problem

As early as 1938, Berkson [1] brought up the large n problem by pointing out its effect on the p-value:
“… when the numbers in the data are quite large, the P’s [p-values] tend to come out small… if the number of observations is extremely large—for instance, on the order of 200,000—the chi-square P will be small beyond any usual limit of significance. … If, then, we know in advance the P that will result…, it is no test at all. ”
(p. 527)
In 1935 Fisher [2] pinpointed the effect of the large n problem on the power (he used the term sensitivity) of the test:
“By increasing the size of the experiment [n], we can render it more sensitive, meaning by this that it will allow of the detection of … quantitatively smaller departures from the null hypothesis.”
(pp. 21–22)
In 1942 Berkson [3] returned to the large n problem by arguing that it also affects the power of a test, rendering it relevant for a sound evidential interpretation:
“In terms of the Neyman–Pearson (N-P) formulation they have different powers for any particular alternative, and hence are likely to give different results in any particular case. ”
(p. 334)
One of the examples he uses to make his case is from Fisher’s book [4] which is about testing the linearity assumption of a Linear Regression (LR) model.
Unfortunately, the example allowed Fisher [5] to brush aside the broader issues of misinterpreting frequentist tests as evidence for or against hypotheses. In his response, he focuses narrowly on the particular example and recasts their different perspectives on frequentist testing as a choice between ‘objective tests of significance’ and ‘subjective impressions based on eyeballing’ a scatterplot:
“He has drawn the graph. He has applied his statistical insight and his biological experience to its interpretation. He enunciates his conclusion that ‘on inspection it appears as straight a line as one can expect to find in biological material.’ The fact that an objective test had demonstrated that the departure from linearity was most decidedly significant is, in view of the confidence which Dr. Berkson places upon subjective impressions, taken to be evidence that the test of significance was misleading, and therefore worthless.”
(p. 692)
In his reply, Berkson [6], (p. 243), reiterated the role of the power:
“When with such specific tests one has sufficient numbers, they become sensitive tests; in the terminology of Neyman and Pearson they become ‘powerful’. ”
As a result of the exchange between Berkson and Fisher, the broader issue of ‘inference results’ vs. ‘evidence’ was put aside by the statistics literature until the late 1950s. Berkson’s example from Fisher [4] was an unfortunate choice since the test in question is not a N-P type test, which probes within the boundaries of the invoked statistical model, M θ ( x ) , by framing the hypotheses of interest in terms of its unknown parameters θ . It is a misspecification test, which probes outside the boundaries of M θ ( x ) to evaluate the validity of its assumptions—the linearity in this case; see Spanos [7].
In 1957 Lindley [8] presented the large n problem as a conflict between frequentist and Bayesian testing by arguing that the p-value will reject H 0 : θ = θ 0 as n increases, whereas the Bayes factor will favor H 0 . Subsequently, this became known as the Jeffreys–Lindley paradox (Spanos [9]) for further discussion.
In 1958 Lehmann [10] raised the issue of the sample size influencing the power of a test, and proposed decreasing α as n increases to counter-balance the increase in its power:
“By habit, and because of the convenience of standardization in providing a common frame of reference, these values [ α = 0.1 , 0.05 , 0.025 , 0.01 ] became entrenched as conventional levels to use. This is unfortunate since the choice of significance level should also take into consideration the power that the test will achieve against alternatives of interest. There is little point in carrying out an experiment that has only a small chance of detecting the effect being sought when it exists. Surveys by Cohen [11] and Freiman et al. [12] suggest that this is, in fact, the case for many studies. ”
(Lehmann [13], pp. 69–70)
Interestingly, the papers cited by Lehmann [13] have been published in psychology and medical journals, respectively, indicating that practitioners in certain disciplines became aware of the small/large n problems and began exploring solutions. Cohen [11,14], was particularly influential in convincing psychologists to supplement, or even replace the p-value, with a statistic that is free of n, known as the ‘effect size’, to address the large n problem. For instance, Cohen’s d = ( x ¯ n y ¯ n ) s is the recommended effect size when one is testing the difference between two means based on the test statistic N [ ( X ¯ n Y ¯ n ) γ ] s in testing H 0 : γ = 0 vs. H 1 : γ 0 ; see Lehmann [13].
In 1982 Good [15], p. 66, went a step further than Lehmann [10] to propose a rule of thumb: “p-values … should be standardized to a fixed sample size, say N = 100, by replacing P [ p n ( x 0 ) ] with min 0.5 , p n ( x 0 ) · n / 100 , n > 10 .

1.2. Large n Data and the Preconditions for More Accurate and Trustworthy Evidence

Large n data are universally considered a blessing since they can potentially give rise to more accurate and trustworthy evidence. What is often neglected, however, is that the potential for such gains requires certain ‘modeling’ and ‘inference’ stipulations to be met before any such accurate and trustworthy evidence can materialize.
[a] The most crucial modeling stipulation is for the practitioner to establish the statistical adequacy (approximate validity) of the invoked probabilistic assumptions imposed on one’s data, say x 0 : =  ( x 1 , x 2 , . . . , x n ) , comprising the relevant statistical model M θ ( x ) ; see Spanos [16]. Invalid probabilistic assumptions induce sizeable discrepancies between the nominal error probabilities—derived assuming the validity of M θ ( x ) —and the actual ones based on x 0 , rendering the inferences unreliable and the ensuing evidence untrustworthy. Applying a 0.05 significance level test when the actual type I error probability is closer to 0.97 , due to invalid probabilistic assumptions (Spanos [17], p. 691) will yield spurious inference results and untrustworthy evidence; see Spanos [18] for additional examples.
[b] The most crucial inference stipulation for the practitioner is to distinguish between raw ‘inference results’, such as point estimates, effect sizes, observed CIs, ‘accept/reject H 0 ’, and p-values, and ‘evidence for or against inferential claims’ relating to unknown parameters θ . Conflating the two often gives rise to fallacious evidential interpretations of such inference results, as well as unwarranted claims, including spurious statistical significance with a large n. The essential difference between ‘inference results’ and ‘evidence’ is that the former rely unduly on the particular data x 0 : = ( x 1 , x 2 , . . . , x n ) , which constitutes a single realization X = x 0 of the sample X : = ( X 1 , . . . , X n ) . In contrast, sound evidence for or against inferential claims relating to θ needs to account for that uncertainty.
The main objective of the paper is to make a case that the large n problem can be addressed using a principled argument based on the post-data severity (SEV) evaluation of the unduly data-specific accept/reject results. This is achieved by accounting for the inference-related uncertainty to provide an evidential interpretation of such results that revolves around the discrepancy γ 0 from the null value warranted by the particular test and data x 0 with high enough post-data error probability. This provides an inductive generalization of the accept/reject results that enhances learning from data.
As a prelude to the discussion that follows, Section 2 summarizes Fisher’s [19] model-based frequentist statistics, with particular emphasis on Neyman–Pearson (N-P) testing. Section 3 considers the large n problem and its implications for the p-value and the power of an N-P test. Section 4 discusses examples from the empirical literature in microeconometrics where the large n problem is largely ignored. Section 5 explains how the post-data severity evaluation of the accept/reject H 0 results can address the large n problem and is illustrated using hypothetical and actual data examples.

2. Model-Based Frequentist Statistics: An Overview

2.1. Fisher’s Model-Based Statistical Induction

Model-based frequentist statistics was pioneered by Fisher [19] in the form of statistical induction that revolves around a statistical model whose generic form is:
M θ ( x ) = { f ( x ; θ ) , θ Θ } , x R X n , for Θ R m , m < n ,
and revolves around the distribution of the sample  X : = ( X 1 , X 2 , . . . , X n ) ,   f ( x ; θ ) , x R X n , R : = ( , ) , which encapsulates its probabilistic assumptions. R X n denotes the sample space, and Θ the parameter space; see Spanos [16].
Unfortunately, the term ‘model’ is used to describe many different constructs across different disciplines. In the context of empirical modeling using statistics, however, the relevant models can be grouped into two broad categories: ‘substantive’ (structural, a priori postulated) and ‘statistical’ models. Although these two categories of models are often conflated, a statistical model M θ ( x ) comprises solely the probabilistic assumptions imposed (explicitly or implicitly) on the particular data x 0 ; see McCullagh [20]. Formally, M θ ( x ) is a stochastic mechanism framed in terms of probabilistic assumptions from three broad categories: Distribution (D), Dependence (M), and Heterogeneity (H), assigned to the observable stochastic process { X t , t N : = ( 1 , 2 , . . . , n , . . . ) } underlying data x 0 .
The specification (initial selection) of M θ ( x ) has a twofold objective:
(a)
M θ ( x ) is selected to account for all the chance regularity patterns—the systematic statistical information—in data x 0 by choosing appropriate probabilistic assumptions relating to { X k , k N } . Equivalently, M θ ( x ) is selected to render data x 0 a ‘typical realization’ therefrom, and the ‘typicality’ can be evaluated using Mis-Specification (M-S) testing, which evaluates the approximate validity of its probabilistic assumptions.
(b)
M θ ( x ) is parametrized [ θ Θ ] to enable one to shed light on the substantive questions of interest using data x 0 . When these questions are framed in terms of a substantive model, say M φ ( x ) , φ Φ , one needs to bring out the implicit statistical model in a way that ensures that the two sets of parameters are related via a set of restrictions g ( φ , θ ) = 0 colligating φ to the data x 0 via θ ; see Spanos [16].
Example 1. A widely used example in practice is the simple Normal model:
M θ ( x ) : X t NIID ( μ , σ 2 ) , ( μ , σ 2 ) R × R + , x t R , t N ,
where ‘ NIID ’ stands for ‘Normal (D), Independent (M), and Identically Distributed (H)’.
The main objective of the model-based frequentist inference is to ‘learn from data x 0 ’ about θ * , where θ * denotes the ‘true’ value of θ in Θ ; shorthand for saying that there exists a θ * Θ such that M * ( x ) = f ( x ; θ * ) , x R X n , could have generated data x 0 .
The cornerstone of frequentist inference is the concept of a sampling distribution f ( y n ; θ ) = d F n ( y ) / d y , for all (∀) y R Y , of a statistic Y n = g ( X 1 , X 2 , . . . , X n ) (estimator, test, predictor), derived via:
F n ( y ) = P ( Y n y ) = { x : g ( x ) y } f ( x ; θ ) d x , y R Y .
The derivation of f ( y n ; θ ) , y R Y , in (3) presumes the validity of f ( x ; θ ) , x R X n , which in the case of (2) is: f ( x ; θ ) = NIID ( 1 2 π σ 2 ) n exp { 1 2 σ 2 k = 1 n ( x k μ ) 2 } , x R n .
In light of the crucial role of the distribution of the sample, Fisher [19], p. 314, emphasized the importance of establishing the statistical adequacy (approximate validity) of the invoked statistical model M θ ( x ) :
“For empirical as the specification of the hypothetical population [ M θ ( x ) ] may be, this empiricism is cleared of its dangers if we can apply a rigorous and objective test of the adequacy with which the proposed population represents the whole of the available facts.”
He went on to underscore the crucial importance of Mis-Specification (M-S) testing (testing the approximate validity of the probabilistic assumptions comprising M θ ( x ) ) as the way to provide an empirical justification for statistical induction:
“The possibility of developing complete and self-contained tests of goodness of fit deserves very careful consideration, since therein lies our justification for the free use which is made of empirical frequency formulae. ”
(Fisher [19], (p. 314).)
Statistical adequacy plays a crucial role in securing the reliability of inference because it ensures the approximate equality between the actual and the nominal error probabilities based on x 0 , assuring that one can keep track of the relevant error probabilities. In contrast, when M θ ( x ) is statistically misspecified (Spanos, [21]),
(a)
f ( x ; θ ) , x R X n , and the likelihood function L ( x 0 ; θ ) f ( x 0 ; θ ) , θ Θ are erroneous,
(b)
distorting the sampling distribution f ( y n ; θ ) derived via (3), as well as
(c)
giving rise to ‘non-optimal’ estimators and sizeable discrepancies between the actual error probabilities and nominal—derived assuming M θ ( x ) is valid.
In light of that, the practical way to keep track of the relevant error probabilities is to establish the statistical adequacy of M θ ( x ) . When M θ ( x ) is misspecified, any attempt to adjust the relevant error probabilities is ill-fated because the actual error probabilities are unknown due to being sizeably different from the nominal ones.
Regrettably, as Rao [22], p. 2, points out, validating M θ ( x ) using comprehensive M-S testing is neglected in statistics courses:
“They teach statistics as a deductive discipline of deriving consequences from given premises [ M θ ( x ) ]. The need for examining the premises, which is important for practical applications of results of data analysis is seldom emphasized. … The current statistical methodology is mostly model-based, without any specific rules for model selection or validating a specified model. ”
(p. 2)
See Spanos [23] for further discussion.

2.2. Neyman–Pearson (N-P) Testing

Example 1 (continued). In the context of (2), testing the hypotheses:
H 0 : μ μ 0 vs . H 1 : μ > μ 0 ,
an optimal (UMP) α -significance level test (Neyman and Pearson, 1933 [24]) is:
T α : = [ τ ( X ) = n ( X ¯ n μ 0 ) s , C 1 ( α ) = { x : τ ( x ) > c α } ] ,
where X ¯ n = 1 n k = 1 n X k , s n 2 = 1 ( n 1 ) k = 1 n ( X k X ¯ n ) 2 , C 1 ( α ) denotes the rejection region, and c α is determined by the significance level α ; see Lehmann [13].
The sampling distribution of τ ( X ) evaluated under H 0 (hypothetical) is:
τ ( X ) = n ( X ¯ n μ 0 ) s μ = μ 0 St ( n 1 ) ,
where ‘ St ( n 1 ) ’ denotes the Student’s t distribution with ( n 1 ) degrees of freedom, which provides the basis for evaluating the type I error probability and the p-value:
α = P ( τ ( X ) > c α ; μ = μ 0 ) , p ( x 0 ) = P ( τ ( X ) > τ ( x 0 ) ; μ = μ 0 ) .
That is, both the type I error probability and the p-value in (7) are evaluated using hypothetical reasoning, that interprets ‘ μ = μ 0 is true’ as ‘what if’ μ 0 = μ * ’.
The sampling distribution of τ ( X ) evaluated under H 1 (hypothetical) is:
τ ( X ) = n X ¯ n μ 0 σ μ = μ 1 St ( δ 1 ; n 1 ) , δ 1 = n μ 1 μ 0 σ , μ 1 = μ 0 + γ 1 , γ 1 0 ,
where δ 1 is the noncentrality parameter of St ( δ 1 ; n 1 ) , which provides the basis for evaluating the power of test T α :
P ( μ 1 ) = P ( τ ( X ) > c α ; μ = μ 1 ) , μ 1 = μ 0 + γ 1 , γ 1 0 .
where δ 1 = n μ 1 μ 0 σ in (8) indicates that the power increases monotonically with n and μ 1 μ 0 and decreases with σ .
P ( μ 1 ) = P ( τ ( X ) > c α ; μ = μ 1 ) , μ 1 = μ 0 + γ 1 , γ 1 0 .
Equation (8) shows that the power increases with n and μ 1 μ 0 and decreases with σ .
The optimality of N-P tests revolves around an inherent trade-off between the type I and II error probabilities. To address that trade-off, Neyman and Pearson [24] proposed to construct an optimal test by prespecifying α at a low value and minimizing the type II error β ( μ 1 ) , or maximizing the power P ( μ 1 ) = ( 1 β ( μ 1 ) ) , μ 1 = μ 0 + γ 1 , γ 1 0 .
A question that is often overlooked in traditional expositions of N-P testing is:
Where does prespecifying the type I error probability at a low threshold come from?
A careful reading of Neyman and Pearson [24] reveals the answer in the form of two crucial stipulations relating to the framing of H 0 : θ Θ 0 and H 1 : θ Θ 1 , to ensure the effectiveness of N-P testing and the informativeness of the ensuing results:
  • Θ 0 and Θ 1 should form a partition of Θ (p. 293) to avoid θ * [ Θ 0 Θ 1 ] .
  • Θ 0 and Θ 1 should be framed in such a way to ensure that the type I error is the more serious of the two, using the analogy with a criminal trial, p. 296, with H 0 : no guilty, to secure a low probability for sending an innocent person to jail.
Unveiling the intended objective of stipulation 2, suggests that the requirement for a small (prespecified) α is to ensure that the test has a low probability of rejecting a true null hypothesis, i.e., when θ 0 = θ * . Minimizing the type II error probability implies that an optimal test should have the lowest possible probability (or equivalently, the highest power) for accepting (rejecting) θ = θ 0 , when false, i.e., when θ 0 θ * . That is, an optimal test should have a high power around the potential neighborhood of θ * . This implies that when no reliable information about this potential neighborhood is available, one should use a two-sided test to avert the case where the test has no or very low power around θ * ; see Spanos [25].
For the reason that the power increases with n, it is important to take that into account in selecting an appropriate α to avoid both an under-powered and an over-powered test, forefending the ‘small n’ and ‘large n problems’, respectively. First, for a given α , one needs to calculate the value of n needed for T α to have sufficient power to detect parameter discrepancies γ 0 of interest; see Spanos [26]. Second, for a large n one needs to adjust α to avoid an ultra-sensitive test that could detect tiny discrepancies, say γ = 0.0000001 , and misleadingly declare them statistically significant.
The primary role of the pre-data testing error probabilities (type I, II, and power) is to operationalize the notions of ‘statistically significant/insignificant’ in terms of statistical approximations framed in terms of a test statistic τ ( X ) and its sampling distribution. These error probabilities calibrate the capacity of the test to shed sufficient light on θ * , giving rise to learning from data. In this sense, the reliability of the testing results ‘accept/reject H 0 ’ depends crucially on the particular testing statistical context (Spanos, [17], ch. 13):
( i ) M θ ( x ) , ( i i ) H 0 : θ Θ 0   v s .   H 1 : θ Θ 1 , ( i i i ) T α : = { d ( X ) , C 1 ( α ) } , ( iv )   data   x 0 ,
which includes, not only the adequacy of M θ ( x ) vis-a-vis data x 0 , the framing of H 0 and H 1 , and as well as n; see Spanos ([16,17]). For instance, when n > 10 , 000 , detaching the accept/reject H 0 results from their statistical context in (11), and claiming statistical significance at conventional thresholds will often be an unwarranted claim.
Let us elaborate on this assertion.

3. The Large n Problem in N-P Testing

3.1. How Could One Operationalize as n Increases?

As mentioned above, for a given α , increasing n increases the test’s power and decreases the p-value. What is not so obvious is how to operationalize the clause ‘as n increases’ since data x 0 usually come with a specific sample size n. Assuming that one begins with large enough n to ensure that the Mis-Specification (M-S) tests have sufficient power to detect existing departures from the probabilistic assumptions of the invoked M θ ( x ) , say n = 100 , there are two potential scenarios one could contemplate.
Scenario 1 assumes that all different values of n 100 give rise to the same observed τ ( x 0 ) . This scenario has been explored by Mayo and Spanos [27,28].
Scenario 2 assumes that as n increases beyond n = 100 the change in the estimates x ¯ n and s n 2 are ‘relatively small’ to render the ratio ( x ¯ n μ 0 ) / s n approximately constant.
Scenario 2 seems realistic enough to shed light on the large n problem for two reasons. First, when the NIID assumptions are valid for x n , the changes in x ¯ n and s n 2 from increasing n are likely to be ‘relatively small’ since n = 100 is sufficiently large to provide a reliable initial estimate, and thus, increasing n is unlikely to change the ratio drastically. Second, the estimate ( x ¯ n μ 0 ) / s n where μ 0 is a value of interest, is known as the ‘effect size’ for μ in psychology (Ellis, [29]), and is often used to infer the magnitude of the ‘scientific’ effect, irrespective of n. Let us explore the effects of increasing n using scenario 2.

3.2. The Large n Problem and the p-Value

Empirical example 1 (continued). Consider the hypotheses in (4) for μ 0 = 2 in the context of (2) using the following information:
n = 100 , α = 0.05 , c α = 1.66 , x ¯ n = 2.317 ,   and   s 2 = 3.7675 ( s = 1.941 ) .
The test statistic τ ( X ) = n ( X ¯ n μ 0 ) s yields: τ n ( x 0 ) = 100 ( 2.317 2 ) 1.941 = 1.6533 , with a p-value, p n ( x 0 ) = 0.0528 , indicating ‘accept H 0 ’. The question of interest is how increasing n beyond n = 100 will affect the result of N-P testing.
Consider the issue of how the p-value changes as n increases when ( x ¯ n μ 0 ) s n is constant (scenario 2). The p-value curve in Figure 1 for 50 < n 500 indicates that one can manipulate n to obtain the desired result since (a) for n < 105 the p-value will yield p n ( x 0 ) > α = 0.05 , ‘accept H 0 ’, (b) for n > 105 the p-value will yield p n ( x 0 ) < α = 0.05 , ‘reject H 0 ’.
Table 1 reports particular values of τ n ( x 0 ) and p n ( x 0 ) from Figure 1 as n increases, showing that p n ( x 0 ) decreases rapidly down to tiny values for n 10 , 000 , confirming Berkson’s [1], (p. 334), observation in the introduction that the p-value “… will be small beyond any usual limit of significance. ”

3.3. The Large n Problem and the Power of a Test

Let us consider the power of T α > at 0.8 ( P ( μ 1 ) = 0.8 ) as n increases under scenario 2 ( ( x ¯ n μ 0 ) / s n held constant), and evaluate its effect on the size of the detected discrepancies γ 1 = μ 1 μ 0 .
Table 2 reports several values of n showing how the test detects smaller and smaller discrepancies from μ = 2 , confirming Fisher’s [2] quotation in the introduction. This can be seen more clearly in Figure 2, where the power curves become steeper and steeper as n increases.
One might object to this quote as anachronistic since Fisher [30] was clearly against the use of the type II error probability and the power of a test due to his falsificationist stance where ‘accept H 0 ’ would not be an option. The response is that Fisher acknowledged the role of power using the term ‘sensitivity’ in the quotation (Section 1.1), as well as explicitly in Fisher [31], p. 295. Indeed, the presence of a rejection threshold α in Fisher’s significance testing brings into play the pre-data type I and II error probabilities, while his p-value is a post-data error probability; see Spanos [17].

4. The Empirical Literature and the Large n Problem

Examples of misuse of frequentist testing abound in all applied fields, but the large n problem is particularly serious in applied microeconometrics where very large sample sizes n are often employed to yield spurious significance results, which are then used to propose policy recommendations relating to relevant legislation; see Pesko and Warman [32].

4.1. Empirical Examples in Microeconometrics

Empirical example 2A (Abouk et al. [33], Appendix J: Table 1. p. 99). Based on an estimated LR model in (Table 7). with n = 24 , 730 , 930 , it is inferred that the estimates β ^ k = 0.004 , SE ( β ^ k ) = 0.002 , rendering the coefficient β k of a key variable x k statistically significant at α = 0.05 .
Such empirical results and the claimed evidence based on the estimated models are vulnerable to three potential problems.
(i)
Statistical misspecification and the ensuing untrustworthy evidence since the discussion in the paper ignores the statistical adequacy of the estimated statistical models. Given that to evaluate the statistical adequacy of any published empirical study, one needs the original data to apply thorough misspecification testing (Spanos, [7]), that issue will be sidestepped in the discussion that follows.
(ii)
Large n problem. One of the many examples in the paper relates to the claimed statistical significance (with n = 24 , 730 , 930 ) at α = 0.05 of β k , ignoring its potential ‘spurious’ statistical significance results stemming from the large n problem in N-P testing.
(iii)
Conflating ‘testing results’ with ‘evidence’. The authors claim evidence for  β k 0 , and proceed to infer its implications for the effectiveness of different economic policies.
Empirical example 2A (continued). Abouk et al. [33] report β ^ k = 0.004 ,
SE ( β ^ k ) = s n q k k = 0.002 , and p ( z 0 ) < 0.05 , implying:
l n ( β ^ k 0 ) s q k k = 24 , 730 , 930 ( 0.004 ) 98.932 = 2 ,   for   c 0.025 = 1.96   and   p ( z 0 ) = 0.045 .
The relevant sampling distribution of β ^ : = ( β ^ 0 , β ^ 1 ) for the LR model is:
( n ( β ^ β ) X ) N ( 0 , σ 2 Q X 1 ) , lim n ( X X n ) = Q X = [ q i j ] i , j = 1 m > 0 .
Focusing on just one coefficient β k the t-test for its significance is:
τ ( y ) = n ( β ^ k 0 ) s / q k k n β k = 0 N ( 0 , 1 ) , C 1 ( α ) = { y : τ ( y ) > c α 2 } .
Using the information in example 2A, we can reconstruct what would p n ( z 0 ) have been for different n using scenario 2. The results in Table 3 indicate that the authors’ claim of statistical significance ( β k 0 ) at α = 0.05 will be unwarranted for any n < 24 , 000 , 000 .
Example 2B. Abouk et al. [33], Table 2, p. 57, report 42 ANOVA results for the difference between two means ( x ¯ n 1 y ¯ n 2 ) , 40 of these tests have tiny p-values, less than 0.0000 , even though the differences between the two means, ( x ¯ n 1 y ¯ n 2 ) appear to be very small. Surprisingly, two of the estimated differences are zero, ( x ¯ n 1 y ¯ n 2 ) = 0 ( x ¯ n 1 = 13.2 , y ¯ n 2 = 13.2 and x ¯ n 1 = 2.51 , y ¯ n 2 = 2.51 ), but their reported p-values are 0.8867 and 0.0056 , respectively; one would have expected p ( z 0 ) = 1 when τ ( z 0 ) = 0 . Looking at these results, one wonders what went wrong with the reported p-values.
The hypotheses of interest take the form (Lehmann, [13]):
H 0 : ( μ 1 μ 2 ) = 0 vs . H 1 : ( μ 1 μ 2 ) 0
with the optimal test being the t-test T α = { τ ( Z ) , C 1 = { z : τ ( z ) > c α 2 } } where:
τ ( Z ) = [ N ( X ¯ n 1 Y ¯ n 2 γ ) / s N ] μ 1 μ 2 = 0 St ( n 1 + n 2 2 ) , N = n 1 n 2 n 1 + n 2 , s N 2 = ( n 1 1 ) s 1 2 + ( n 2 1 ) s 2 2 ( n 1 + n 2 2 ) , s 1 2 = 1 ( n 1 1 ) i = 1 n 1 ( X i X ¯ n 1 ) 2 , s 2 2 = 1 ( n 2 1 ) i = 1 n 2 ( Y i Y ¯ n 2 ) 2 .
A close look at this t-test suggests that the most plausible explanation for the above ‘strange results’ is the statistical software using (at least) 12-digit decimal precision. Given that the authors report mostly 3-digit estimates, the software is picking up tiny discrepancies, say < 0.0000001 , which, when magnified by N , could yield the reported p-values. That is, the reported results have (inadvertently) exposed the effect of the large n on the p-value.

4.2. Meliorating the Large n Problem Using Rules of Thumb

In light of the inherent trade-off between the type I and type II error probabilities, some statistics textbooks advise practitioners to use ‘rules of thumb’ based on decreasing α as n increases; see Lehmann [13].
  • Ad hoc rules for adjusting α as n increases
n = 100200500100010,00020,000200,000
α = 0.050.0250.010.0010.00010.000010.00000001
2.
Good [15], p. 66, proposed to standardize the p-value  p n ( x 0 ) relative to N = 100 .
Applying his rule of thumb, min 0.5 , p n ( x 0 ) · n / 100 , n > 100 , to the p-value curve in Figure 1, based on τ n ( x 0 ) = 1.6533 , yields:
n = 1001201503005001000500010,000
p n ( x 0 ) = 0.0528 0.038 0.024 0.002 0.0002 0.000001 2.4 × 10 20 0.000 . . . 0
p 100 ( x 0 ) = 0.0528 0.042 0.029 0.0035 0.0045 0.000003 10.7 × 10 19 0.000 . . . 0
The above numerical examples suggest that such rules of thumb for selecting α as n increases can meliorate the problem; they do not, however, address the large n problem since they are ad hoc, and their suggested thresholds decrease to nearly zero beyond n = 10 , 000 .

4.3. The Large n Problem and Effect Sizes

Although there is no authoritative definition of the notion of an effect size, the one that comes closest to its motivating objective is Thompson’s [34]:
“An effect size is a statistic quantifying the extent to which the sample statistics diverge from the null hypothesis. ”
(p. 172)
The notion is invariably related to a certain frequentist test and has a distinct resemblance to its test statistic. To shed light on its effectiveness in addressing the large n problem, consider a simpler form of the test in (13) where the sample size n is the same for both X and Y and C o v ( X i , Y i ) = 0 for i = 1 , 2 , . . . , n . The optimal t-test is based on a simple bivariate Normal distribution with parameters μ 1 , μ 2 , σ 2 for H 0 : γ = γ 0 vs. H 1 : γ γ 0 , γ = μ 1 μ 2 , γ 0 = 0 , takes the form:
τ ( Z ) = N [ ( X ¯ n Y ¯ n ) γ ] s γ = γ 0 St ( 2 n 2 ) , N = ( n / 2 ) , τ ( Z ) γ = γ 1 St ( δ 1 ; 2 n 2 ) , δ 1 = N γ 1 γ 0 σ , for γ 1 γ 0 ,
The recommended effect size for this particular test is the widely used Cohen’s d = [ ( x ¯ n y ¯ n ) / s ] . As argued by Abelson [35], the motivation underlying the choice of the effect size statistic:
“… is that its expected value is independent of the size of the sample used to perform the significance test. ”
(p. 46)
That is, if one were to view Cohen’s d ( z 0 ) = [ ( x ¯ n y ¯ n ) / s ] as a point estimate of the unknown parameter ψ = [ ( μ 1 μ 2 ) / σ ] , the statistic referred to by Thompson above is ψ ^ ( Z ) = [ ( X ¯ n Y ¯ n ) / s ] , confirming Abelson’s claim that E [ ψ ^ ( Z ) ] = ψ is free of n .
The question that naturally arises at this point is to what extent deleting N and using the point estimate ψ ^ ( z 0 ) = [ ( x ¯ n y ¯ n ) / s ] of ψ as the effect size result associated with γ = μ 1 μ 2 addresses the large n problem. The demonstrable answer is that it does not since the claim that ψ ^ ( z 0 ) approximates closely ψ * for a large enough n is unwarranted; see Spanos [36]. The reason is that a point estimate ψ ^ ( z 0 ) represents a single realization relating to the relevant sampling distribution, i.e., it represents a ‘statistical result’ that ignores the relevant uncertainty. Indeed, the ψ ^ ( Z ) , by itself, does not have a sampling distribution without N , since its relevant sampling distribution relates to:
τ ( Z ; γ ) = N [ ( X ¯ n Y ¯ n ) γ ] s γ = γ * St ( 2 n 2 ) ,
where γ * denotes the true value of μ 1 μ 2 ; see Spanos [37] for more details.

5. The Post-Data Severity Evaluation (SEV) and the Large n Problem

The post-data severity (SEV) evaluation of the accept/reject H 0 results is a principled argument that provides an evidential account for these results. Its main objective is to transform the unduly data-specific accept/reject H 0 results into evidence. This takes the form of an inferential claim that revolves around the discrepancy γ 1 = μ 1 μ 0 warranted by data x 0 and test T α with a high enough probability; see Spanos [9].
A hypothesis H ( H 0 or H 1 ) passes a severe test T α with x 0 if: (C-1) x 0 accords with H, and (C-2) with very high probability, test T α would have produced a result that ‘accords less well’ with H than x 0 does, if H were false; see Mayo and Spanos [27].

5.1. Case 1: Accept H 0

In the case of ‘accept H 0 ’, the SEV evaluation is seeking the smallest’ discrepancy from μ 0 = 2 with a high enough probability.
Empirical example 1 (continued). For α = 0.05 , c α = 1.66 , x ¯ n = 2.317 , s = 1.941 , and n = 100 , test T α in (5) for the hypotheses in (4) yields: τ ( x 0 ) = 1.6533 , with p ( x 0 ) = 0.0507 , indicating ‘accept H 0 ’. (C-1) indicates that x 0 accords with H 0 , and since τ ( x 0 ) > 0 the relevant inferential claim is μ μ 1 = μ 0 + γ , for γ > 0 . Hence, (C-2) calls for evaluating the probability of the event: “outcomes x R n that accord less well with μ μ 1 than x 0 does”, i.e., [ x : τ ( x ) > τ ( x 0 ) ] , x R n :
S E V ( T α ; μ μ 1 ) = min μ Θ 1 P ( τ ( X ) > τ ( x 0 ) ; μ = μ 1 ) > p 1 ,
where Θ 1 : = ( 2 , ) , γ = ( μ 1 μ 0 ) > 0 , for a large enough p 1 > 0.5 , and evaluated based on:
τ ( X ) = n X ¯ n μ 0 s μ = μ 1 St ( δ 1 ; n 1 ) , δ 1 = n μ 1 μ 0 s , μ 1 > μ 0 ,
E ( τ ( X ) ) = ( ( n 1 ) / 2 Γ [ ( n 2 ) / 2 ] Γ [ ( n 1 ) / 2 ] ) δ 1 : = m and V a r ( τ ( X ) ) = n 1 n 3 ( 1 + δ 2 ) m 2 ; see Owen [38].
That is, the central and non-central Student t sampling distributions differ not only with respect to their mean but also in terms of the variance as well as the higher moments since for a non-zero δ 1 the non-central Student’s t is non-symmetric. The post-data severity curve for all μ [ 1.8 , 3 ] in Figure 3 indicates that the discrepancy warranted by data x 0 and test T α with probability 0.8 is γ 0.481 ( μ 1 2.481 ) .
The post-data severity evaluated for typical values reported in Table 4 reveals that the probability associated with the discrepancy relating to the estimate x ¯ n = 2.317 ( γ 1 = 0.317 ) is never high enough to be the discrepancy warranted by x 0 and test T α since S E V ( γ 1 = 0.317 ) = 0.5 . This calls into question any claims relating to point estimates and observed CIs more generally since they represent an uncalibrated (it ignores the relevant uncertainty) single realization ( x 0 ) of the sample X relating to the relevant sampling distribution.

5.2. The Post-Data Severity and the Large n Problem

5.2.1. Case 1: Accept H 0

Empirical example 1 (continued). Consider scenario 2 where [ x ¯ n μ 0 / s n ] remains constant as n increases. Table 5 gives particular examples of n showing how τ n ( x 0 ) increases and μ 1 decreases for a fixed S E V ( γ ) = 0.8 .
Figure 4 presents the different SEV curves as n increases ( [ x ¯ n μ 0 / s ] held constant). It confirms the results of Table 5, by showing that the curves become steeper and steeper as n increases. This reduces the warranted discrepancy γ n monotonically (Table 5), converging to the lower bound at γ 0.3217 ( x ¯ n = 2.317 ) as n . That is, the warranted discrepancies indicated on the x-axis, at a constant probability 0.8 (y-axis), become smaller and smaller as n increases, with the lower bound being the value of μ where its optimal estimator converges to μ * with probability one as n .
This makes statistical sense in practice since X ¯ n and s n are strongly consistent estimators of μ * and σ * , i.e., P ( lim n θ ^ n ( X ) = θ * ) = 1 , and thus their accuracy (precision) improves as n increases beyond a certain threshold n > N . Recall that in the case of ‘accept H 0 ’ one is seeking the ‘smallest discrepancy’ from μ 0 = 2 . Hence, as n increases SEV renders the warranted discrepancy γ n ‘more accurate’ by reducing it, until it reaches the lower bound around γ ( x ¯ n μ 0 ) [ μ 1 = x ¯ n ] as n ; note that the latter is the most accurate value for μ * , only when M θ ( x ) is statistically adequate!

5.2.2. Case 2: Reject H 0

Empirical example 3. Consider changing the estimate of μ to x ¯ n = 2.726 retaining s = 1.941 , which yields: τ n ( x 0 ) = 3.740 [ 0.00009 ] , indicating ‘reject H 0 ’ in (4). In contrast to the case ‘accept H 0 ’, when the test ‘rejects H 0 ’ one is seeking the ‘largest’ discrepancy from μ 0 = 2 with high probability. In light of τ n ( x 0 ) > 0 , the relevant inferential claim is: μ μ 1 = μ 0 + γ , for γ > 0 , and its post-data probabilistic evaluation is based on:
S E V ( T α ; μ μ 1 ) = max μ Θ 1 P ( τ ( X ) < τ ( x 0 ) ; μ = μ 1 ) > p 1 ,
where Θ 1 : = ( 2 , ) , p 1 > 0.5 , and the evaluation of (16) is based on (15).
Table 6 indicates that the discrepancy γ warranted with S E V ( T α ; μ μ 1 ) = 0.8 is γ 0.562 ( μ 1 2.562 ), and the discrepancy for μ 1 2.726 has S E V ( T α ; μ μ 1 ) = 0.5 .
Figure 5 depicts the severity curves for n = 100 , 200 , 500 , 1000 , 10 , 000 , indicating that keeping the probability constant at 0.8 (y-axis), as n increases (keeping [ x ¯ n μ 0 / s n ] constant) renders the curves steeper and steeper, thus ‘increasing’ the warranted discrepancy monotonically (x-axis) toward approaching the upper bound, which is the value of μ where its optimal estimator converges to μ * with probability one as n . This is analogous to the case of ‘accept H 0 ’, with the lower bound replaced by an upper bound, which is equal to μ * in both cases. That is, the SEV evaluation increases the precision of the discrepancy warranted with high (but constant) probability as n increases.
Example 2A (Abouk et al., 2022) continued. The reported t-test result is:
n ( β ^ k 0 ) s q k k = 24 , 730 , 930 ( 0.004 ) 98.932 = 2.0 [ 0.045 ] .
What is the warranted discrepancy γ from β k = 0 with high enough severity, say 0.977 (in light of n = 24 , 730 , 930 ) ? The answer is β k = γ 0.0000001 , which calls into question β ^ k = 0.004 , as well as Cohen’s d ( z 0 ) = 0.0004 ; see Ellis [29]. This confirms the tiny discrepancies from zero due to the large n problem conjectured in Section 4.1.
One might object to the above conclusion relating to the issue of spurious significance by countering that a tiny discrepancy is still different from zero, and thus, the statistical significance is well-grounded. Regrettably, such a counter-argument is based on a serious misconstrual of statistical inference in general, where learning from data takes the form of ‘statistical approximations’ framed in terms of a statistic or a pivot and its sampling distribution. This distinguishes statistics from other sub-fields of applied mathematics. No statistical inference is evaluated in terms of a binary choice of right and wrong! Indeed, this confusion lies at the heart of misinterpreting the binary accept/reject results as evidence for or against hypotheses or claims; see Mayo and Spanos [28].
Example 2B (Abouk et al. [33]) continued. As argued in Section 4.1, the reported results for the difference between two means exemplify the effect of the large n problem on the p-value. To be able to quantify that effect, however, one needs the exact p-values, but only two of the 42 reported p-values are given exactly, 0.8867 and 0.0056; the other 40 are reported as less than 0.0000 . Also, the particular sample sizes for the different reported tests are not given, and thus the overall n 1 and n 1 will be used. Focusing on the one with the p-value = 0.0056, one can retrieve the observed t-statistic, which can then be used for the SEV evaluation. For x ¯ n = 2.51 , y ¯ n = 2.51 , the observed t-statistic is τ ( z 0 ) = 2.77 , and using the SEV evaluation one can show that the warranted discrepancy γ from ( μ 1 μ 2 ) = 0 with SEV = 0.97 ( N = 6 , 108 , 194 ) is γ 0.00000068 , which confirms that the t-test detects tiny discrepancies between the two estimated means, as conjectured in Section 4.1. That is, all but one (p-value = 0.8867) of the reported 42 test results in Table 2 (Abouk et al. [33] p. 43) constitute cases of spurious rejection of the null due to the large n problem. Even if one were to reduce N by a factor of 40, the warranted discrepancies would still be tiny.

5.2.3. Key Features of the Post-Data SEV Evaluation

(a)
The S E V ( μ 0 + γ ) is a post-data error probability, evaluated using hypothetical reasoning, that takes fully into account the testing statistical context in (11) and is guided by the sign and magnitude of τ n ( x 0 ) as indicators of the direction of the relevant discrepancies γ from μ = μ 0 . This is particularly important because the factual reasoning, what if θ = θ * , underlying point estimation and Confidence Intervals (CIs), does not apply post-data.
(b)
The S E V ( μ 0 + γ ) evaluation differs from other attempts to deal with the large n problem in so far as its outputting of the discrepancy γ 1 = μ 1 μ 0 is always based on the non-central distribution (Kraemer and Paik, [39]):
( n X ¯ n μ 0 s n μ 1 μ 0 s ) μ = μ 1 St ( n 1 ) , for all μ 1 Θ 1 .
This ensures that the warranted discrepancy γ is evaluated using the ‘same’ sample size n, counter-balancing the effect of n on τ n ( x 0 ) . Note that ‘≈’ denotes an approximation.
(c)
The evaluation of the warranted γ n with high probability accounts for the increase in n by enhancing its precision. As n increases γ will approach the value θ ^ n ( x 0 ) , since for a statistically adequate M θ ( x ) , θ ^ n ( x 0 ) approaches θ * due to its strong consistency.
(d)
The SEV can be used to address other foundational problems, including distinguishing between ‘statistical’ and ‘substantive’ significance. It also provides a testing-based effect size for the magnitude of the ‘scientific’ effect by addressing the problem with estimation-based effect sizes raised in Section 4.3. Also, the SEV evaluation can shed light on several proposed alternatives to (or modifications of) N-P testing by the replication crisis literature (Wasserstein et al. [40]), including replacing the p-value with effect sizes and observed CIs and redefining statistical significance (Benjamin et al. [41]) irrespective of the sample size n; see Spanos [25,37].

6. Summary and Conclusions

The large n problem arises naturally in the context of N-P testing due to the in-built trade-off between the type I and II error probabilities, around which the optimality of N-P tests revolves. This renders the accept/reject H 0 results and the p-value highly vulnerable to the large n problem. Hence, for n > 10 , 000 , the detection of statistical significance based on conventional significance levels will often be spurious since a consistent N-P test will detect smaller and smaller discrepancies as n increases; see Spanos [9].
The post-data severity (SEV) evaluation can address the large n problem by converting the unduly data-specific accept/reject H 0 ‘results’ into ‘evidence’ for a particular inferential claim of the form θ θ 0 + γ 1 , γ 1 0 . This is framed in the form of the discrepancy γ 1 warranted by data x 0 and test T α with high enough probability. The SEV evidential account is couched in terms of a post-data error probability that accounts for the uncertainty arising from the undue data-specificity of accept/reject H 0 results. The SEV differs from other attempts to address the large n problem in so far as its evaluation is invariably based on a non-central distribution whose non-centrality parameter uses the same n as the observed test statistic τ ( x 0 ) to counter-balance the effect induced by n in outputting γ 1 .
The SEV evaluation was illustrated above using two empirical results from Abouk et al. [33]. Example 2A concerns an estimated coefficient, β ^ k = 0.004 , SE ( β ^ k ) = 0.002 , in a LR model, which is declared statistically significant at α = 0.05 with n = 24 , 730 , 930 . The SEV evaluation yields a discrepancy γ 0.0000001 from β k = 0 warranted by data z 0 and the t-test with probability 0.977 . Example 2B concerns the difference between two means whose estimates are x ¯ n 1 = 2.51 , y ¯ n 2 = 2.51 , but the t-test outputted τ ( z 0 ) = 2.77 with N = 6 , 108 , 194 . The SEV evaluation yields a discrepancy γ 0.00000068 from ( μ 1 μ 2 ) = 0 warranted by data z 0 and the t-test with probability 0.97 . Both empirical examples represent cases of spurious statistical significance results stemming from exceptionally large sample sizes, n = 24 , 730 , 930 , and N = 6 , 108 , 194 , respectively.

Funding

This research received no external funding.

Institutional Review Board Statement

Non-applicable.

Informed Consent Statement

Non-applicable.

Data Availability Statement

All data are publicly available.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
M-SMis-Specification
N-PNeyman–Pearson
UMPUniformly Most Powerful
SEStandaard Error
SEVpost-data severity evaluation

References

  1. Berkson, J. Some difficulties of interpretation encountered in the application of the chi-square test. J. Am. Stat. 1938, 33, 526–536. [Google Scholar] [CrossRef]
  2. Fisher, R.A. The Design of Experiments; Oliver and Boyd: Edinburgh, UK, 1935. [Google Scholar]
  3. Berkson, J. Tests of significance considered as evidence. J. Am. Assoc. 1942, 37, 325–335. [Google Scholar] [CrossRef]
  4. Fisher, R.A. Statistical Methods for Research Workers; Oliver and Boyd: Edinburgh, UK, 1925. [Google Scholar]
  5. Fisher, R.A. Note on Dr. Berkson’s criticism of tests of significance. J. Am. Stat. Assoc. 1943, 38, 103–104. [Google Scholar] [CrossRef]
  6. Berkson, J. Experience with Tests of Significance: A Reply to Professor R. A. Fisher. J. Am. Assoc. 1943, 38, 242–246. [Google Scholar] [CrossRef]
  7. Spanos, A. Mis-Specification Testing in Retrospect. J. Econ. Surv. 2018, 32, 541–577. [Google Scholar] [CrossRef]
  8. Lindley, D.V. A statistical paradox. Biometrika 1957, 44, 187–192. [Google Scholar] [CrossRef]
  9. Spanos, A. Who Should Be Afraid of the Jeffreys-Lindley Paradox? Philos. Sci. 2013, 80, 73–93. [Google Scholar] [CrossRef]
  10. Lehmann, E.L. Significance level and power. Ann. Math. Stat. 1958, 29, 1167–1176. [Google Scholar] [CrossRef]
  11. Cohen, J. The statistical power of abnormal-social psychological research: A review. J. Abnorm. Soc. Psychol. 1962, 65, 145–153. [Google Scholar] [CrossRef]
  12. Freiman, J.A.; Chalmers, T.C.; Smith, H.; Kuebler, R.R. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. N. Engl. J. Med. 1978, 299, 690–694. [Google Scholar] [CrossRef]
  13. Lehmann, E.L. Testing Statistical Hypotheses, 2nd ed.; Wiley: New York, NY, USA, 1986. [Google Scholar]
  14. Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Lawrence Erlbaum: Hoboken, NJ, USA, 1988. [Google Scholar]
  15. Good, I.J. Standardized tail-area probabilities. J. Stat. Comput. Simul. 1982, 16, 65–66. [Google Scholar] [CrossRef]
  16. Spanos, A. Where Do Statistical Models Come From? Revisiting the Problem of Specification. In Optimality: The Second Erich L. Lehmann Symposium; Rojo, J., Ed.; Lecture Notes-Monograph Series; Institute of Mathematical Statistics: Beachwood, OH, USA, 2006; Volume 49, pp. 98–119. [Google Scholar]
  17. Spanos, A. Introduction to Probability Theory and Statistical Inference: Empirical Modeling with Observational Data, 2nd ed.; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
  18. Spanos, A. Statistical Misspecification and the Reliability of Inference: The simple t-test in the presence of Markov dependence. Korean Econ. Rev. 2009, 25, 165–213. [Google Scholar]
  19. Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. 1922, 222, 309–368. [Google Scholar]
  20. McCullagh, P. What is a statistical model? Ann. Stat. 2002, 30, 1225–1267. [Google Scholar] [CrossRef]
  21. Spanos, A. Statistical Adequacy and the Trustworthiness of Empirical Evidence: Statistical vs. Substantive Information. Econ. Model. 2010, 27, 1436–1452. [Google Scholar] [CrossRef]
  22. Rao, C.R. Statistics: Reflections on the Past and Visions for the Future. Amstat. News 2004, 327, 2–3. [Google Scholar]
  23. Spanos, A. Frequentist Model-based Statistical Induction and the Replication crisis. J. Quant. Econ. 2022, 20, 133–159. [Google Scholar] [CrossRef]
  24. Neyman, J.; Pearson, E.S. On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. 1933, 231, 289–337. [Google Scholar]
  25. Spanos, A. How the Post-data Severity Converts Testing Results into Evidence for or Against Pertinent Inferential Claims. Entropy 2023. under review. [Google Scholar]
  26. Spanos, A. Severity and Trustworthy Evidence: Foundational Problems versus Misuses of Frequentist Testing. Philos. Sci. 2022, 89, 378–397. [Google Scholar] [CrossRef]
  27. Mayo, G.D.; Spanos, A. Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction. Br. J. Philos. Sci. 2006, 57, 323–357. [Google Scholar] [CrossRef]
  28. Mayo, D.G.; Spanos, A. Error Statistics. In The Handbook of Philosophy of Science; Gabbay, D., Thagard, P., Woods, J., Eds.; Elsevier: Amsterdam, The Netherlands, 2011; Volume 7: Philosophy of Statistics, pp. 151–196. [Google Scholar]
  29. Ellis, P.D. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results; Cambirdge University Press: Cambridge, UK, 2010. [Google Scholar]
  30. Fisher, R.A. Statistical methods and scientific induction. J. R. Soc. Ser. Stat. Methodol. 1955, 17, 69–78. [Google Scholar] [CrossRef]
  31. Fisher, R.A. Two new properties of mathematical likelihood. Proc. R. Soc. Lond. Ser. 1934, 144, 285–307. [Google Scholar]
  32. Pesko, M.F.; Warman, C. Re-exploring the early relationship between teenage cigarette and e-cigarette use using price and tax changes. Health Econ. 2022, 31, 137–153. [Google Scholar] [CrossRef] [PubMed]
  33. Abouk, R.; Adams, S.; Feng, B.; Maclean, J.C.; Pesko, M. The Effects of e-cigarette taxes on pre-pregnancy and prenatal smoking. NBER Work. Pap. 2022, 26126, Revised June 2022. Available online: https://www.nber.org/system/files/workingpapers/w26126/w26126.pdf (accessed on 5 October 2023).
  34. Thompson, B. Foundations of Behavioral Statistics: An Insight-Based Approach; Guilford Press: New York, NY, USA, 2006. [Google Scholar]
  35. Abelson, R.P. Statistics as Principled Argument; Lawrence Erlbaum: Hoboken, NJ, USA, 1995. [Google Scholar]
  36. Spanos, A. Bernoulli’s golden theorem in retrospect: Error probabilities and trustworthy evidence. Synthese 2021, 199, 13949–13976. [Google Scholar] [CrossRef]
  37. Spanos, A. Revisiting noncentrality-based confidence intervals, error probabilities and estimation-based effect sizes. J. Math. 2021, 104, 102580. [Google Scholar] [CrossRef]
  38. Owen, D.B. Survey of Properties and Applications of the Noncentral t-Distribution. Technometrics 1968, 10, 445–478. [Google Scholar] [CrossRef]
  39. Kraemer, H.C.; Paik, M. A central t approximation to the noncentral t distribution. Technometrics 1979, 21, 357–360. [Google Scholar] [CrossRef]
  40. Wasserstein, R.L.; Schirm, A.L.; Lazar, N.A. Moving to a world beyond “p < 0.05”. Am. Stat. 2019, 73, 1–19. [Google Scholar]
  41. Benjamin, D.J.; Berger, J.O.; Johannesson, M.; Nosek, B.A.; Wagenmakers, E.J.; Berk, R.; Bollen, K.A.; Brembs, B.; Brown, L.; Camerer, C.; et al. Redefine statistical significance. Nat. Hum. Behav. 2017, 33, 6–10. [Google Scholar] [CrossRef]
Figure 1. The p-value curve for different sample sizes n.
Figure 1. The p-value curve for different sample sizes n.
Stats 06 00081 g001
Figure 2. The power curve for different sample sizes n.
Figure 2. The power curve for different sample sizes n.
Stats 06 00081 g002
Figure 3. The post-data severity curve (accept H 0 ) .
Figure 3. The post-data severity curve (accept H 0 ) .
Stats 06 00081 g003
Figure 4. The severity curve (accept H 0 ) for different n (same estimates).
Figure 4. The severity curve (accept H 0 ) for different n (same estimates).
Stats 06 00081 g004
Figure 5. The severity curve (reject H 0 ) for different n (same estimates).
Figure 5. The severity curve (reject H 0 ) for different n (same estimates).
Stats 06 00081 g005
Table 1. The p-value as n increases (keeping ( x ¯ n μ 0 ) / s n constant).
Table 1. The p-value as n increases (keeping ( x ¯ n μ 0 ) / s n constant).
n10012015030050010002000 10 , 000
τ n ( x 0 ) 1.633 1.789 2.0 2.829 3.652 5.165 7.304 16.332
p n ( x 0 ) 0.0528 0.038 0.024 0.0025 0.00014 0.00000015 0.2 × 10 12 0.0000 . . .
Table 2. Discrepancy γ 1 detected with P ( μ 1 ) = 0.8 as n increases.
Table 2. Discrepancy γ 1 detected with P ( μ 1 ) = 0.8 as n increases.
n1002005001000 10 , 000 100 , 000 1 , 000 , 000 20 , 000 , 000
γ 1 0.486 0.344 0.217 0.154 0.0485 0.01535 0.00486 0.0034
Table 3. The p-value with increasing n (constant estimates).
Table 3. The p-value with increasing n (constant estimates).
n1005001000200010,000 10 × 10 4 10 × 10 5 20 × 10 5 20 × 10 6 24 × 10 6
τ n ( x 0 ) 0.0040.0090.01270.0180.0400.1270.4020.56910.79810.970
p n ( x 0 ) 0.9970.9930.9900.9860.9670.8990.6880.5700.0720.049
Table 4. Post-data severity evaluation (SEV) for μ 1 2.0 + γ .
Table 4. Post-data severity evaluation (SEV) for μ 1 2.0 + γ .
γ 0.050.10.150.200.300.3170.400.4810.600.70
μ 1 2.052.12.152.22.32.3172.42.4812.62.7
SEV ( μ μ 1 ) 0.0860.1330.1960.2740.4650.5000.6650.8000.9260.974
Table 5. Post-data severity ( S E V ( γ ) = 0.8 , μ 1 2 + γ 1 , x ¯ n = 2.317 , s = 1.941 ).
Table 5. Post-data severity ( S E V ( γ ) = 0.8 , μ 1 2 + γ 1 , x ¯ n = 2.317 , s = 1.941 ).
n:1001201502003005001000200020,000200,000
τ n ( x 0 ) :1.6331.7892.02.3102.8293.6525.1657.30423.09773.038
μ 1 = 2 + γ :2.4812.4672.4512.4342.4122.3902.3692.35352.3292.321
Table 6. Post-data severity evaluation (SEV) for μ 1 2 + γ .
Table 6. Post-data severity evaluation (SEV) for μ 1 2 + γ .
γ 0.10.20.30.40.50.5620.60.7260.80.9
μ 1 2.12.22.32.42.52.5622.62.5622.62.7
SEV ( μ μ 1 ) 0.9990.9960.9840.9510.8760.8000.7410.5000.3520.186
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Spanos, A. Revisiting the Large n (Sample Size) Problem: How to Avert Spurious Significance Results. Stats 2023, 6, 1323-1338. https://doi.org/10.3390/stats6040081

AMA Style

Spanos A. Revisiting the Large n (Sample Size) Problem: How to Avert Spurious Significance Results. Stats. 2023; 6(4):1323-1338. https://doi.org/10.3390/stats6040081

Chicago/Turabian Style

Spanos, Aris. 2023. "Revisiting the Large n (Sample Size) Problem: How to Avert Spurious Significance Results" Stats 6, no. 4: 1323-1338. https://doi.org/10.3390/stats6040081

APA Style

Spanos, A. (2023). Revisiting the Large n (Sample Size) Problem: How to Avert Spurious Significance Results. Stats, 6(4), 1323-1338. https://doi.org/10.3390/stats6040081

Article Metrics

Back to TopTop