How the Post-Data Severity Converts Testing Results into Evidence for or against Pertinent Inferential Claims

The paper makes a case that the current discussions on replicability and the abuse of significance testing have overlooked a more general contributor to the untrustworthiness of published empirical evidence, which is the uninformed and recipe-like implementation of statistical modeling and inference. It is argued that this contributes to the untrustworthiness problem in several different ways, including [a] statistical misspecification, [b] unwarranted evidential interpretations of frequentist inference results, and [c] questionable modeling strategies that rely on curve-fitting. What is more, the alternative proposals to replace or modify frequentist testing, including [i] replacing p-values with observed confidence intervals and effects sizes, and [ii] redefining statistical significance, will not address the untrustworthiness of evidence problem since they are equally vulnerable to [a]–[c]. The paper calls for distinguishing between unduly data-dependant ‘statistical results’, such as a point estimate, a p-value, and accept/reject H0 , from ‘evidence for or against inferential claims’. The post-data severity (SEV) evaluation of the accept/reject H0 results, converts them into evidence for or against germane inferential claims. These claims can be used to address/elucidate several foundational issues, including (i) statistical vs. substantive significance, (ii) the large n problem, and (iii) the replicability of evidence. Also, the SEV perspective sheds light on the impertinence of the proposed alternatives [i]–[iii], and oppugns [iii] the alleged arbitrariness of framing H0 and H1 which is often exploited to undermine the credibility of frequentist testing.


Introduction
The replication crisis has dominated discussions on empirical evidence and their trustworthiness in scientific journals for the last two decades.The broad agreement is that the non-replicability of such empirical evidence provides prima facie evidence of their untrustworthiness; see National Academy of Sciences [1], Wasserstein and Lazar [2], Baker [3], Hoffler [4].A statistical study is said to be replicable if its empirical results can be independently confirmed -with very similar or consistent results -by other researchers using akin data and modeling the same phenomenon of interest.
Using the Medical Diagnostic Screening (MDS) perspective on Neyman-Pearson (N-P) testing Ioannidis [5] declares "... most published research findings are false", attributing the untrustworthiness of evidence to several abuses of frequentist testing, such as phacking, multiple testing, cherry-picking and low power.This diagnosis is anchored on apparent analogies between the type I/II error probabilities and the false negative/positive probabilities of the MDS model, tracing the untrustworthiness to ignoring the Bonferronitype adjustments needed to ensure that the actual error probabilities approximate the nominal ones.In light of that, leading statisticians in different applied fields called for reforms which include replacing p-values with observed Confidence Intervals (CIs), using effect sizes, and redefining statistical significance; see Benjamin et al. [6].
In the discussion that follows, a case is made that Ioannidis' assessment about the untrustworthiness of published empirical findings is largely right.Still, the veracity of viewing the MDS model as a surrogate for N-P testing, and its pertinence in diagnosing the untrustworthiness of evidence problem are highly questionable.This stems from the fact that the invoked analogies between the type I/II error probabilities and the false negative/positive probabilities of the MDS model are more apparent than real, since the former are hypothetical/unobservable/unconditional and the latter are observable conditional probabilities; see Spanos [7].
A more persuasive case can be made that the untrustworthiness of empirical evidence stems from the broader problem of the uninformed and recipe-like implementation of statistical modeling and inference that contributes to untrustworthy evidence in several interrelated ways, including: [a] Statistical misspecification: invalid probabilistic assumptions imposed (implicitly or explicitly) on one's data, comprising the invoked statistical model M θ (x).
[b] 'Empirical evidence' is often conflated with raw 'inference results', such as point estimates, effect sizes, observed CIs, accept/reject H 0 results, and p-values, giving rise to (i) erroneous evidential interpretations of these results, and (ii) unwarranted claims relating to their replicability.
[c] Questionable modeling strategies that rely on curve-fitting of hybrid models-an amalgam of substantive subject matter and probabilistic assumptions-guided by error term assumptions and evaluated on goodness-of-fit/prediction grounds.The key weakness of this strategy is that excellent goodness-of-fit/prediction is neither necessary nor sufficient for the statistical adequacy of the selected model since it depends crucially on the invoked loss function whose choice is based on information other than the data.It can shown that statistical models chosen on goodness-of-fit/prediction grounds are often statistically misspecified; see Spanos [8].
Viewed in the broader context of [a]-[c], the abuses of frequentist testing represent the tip of the untrustworthy evidence iceberg.It also questions the presumption that replicability attests to the trustworthiness of empirical evidence.As argued by Leek and Peng [9]: "... an analysis can be fully reproducible and still be wron."(p.1314).For instance, dozens of MBA students confirm the efficient market hypothesis (EMH) on a daily basis because they follow the same uninformed and recipe-like, implementation of statistics, unmindful of what it takes to ensure the trustworthiness of the ensuing evidence by addressing the issues [a]-[c]; see Spanos [10].
The primary focus of the discussion that follows is on [b] with brief comments on [a] and [c], but citing relevant published papers.The discussion revolves around the distinction between unduly data-specific 'inference results', such as point estimates, observed CIs, p-values, effect sizes, and the accept/reject H 0 results, and ensuing inductive generalizations from such results in the form of 'evidence for or against germane inferential claims' framed in terms of the unknown parameters θ.The crucial difference between 'results' and 'evidence' is twofold: (a) the evidence is framed in terms of post-data error probabilities aiming to account for the uncertainty arising from the fact that 'inference results' rely unduly on the particular data x 0 :=(x 1 , x 2 , ..., x n ), which constitutes a single realization X = x 0 of the sample X:=(X 1 , ..., X n ), and (b) the evidence, in the form of warranted inferential claims, enhances learning from data x 0 about the stochastic mechanism that could have given rise to this data.
As a prelude to the discussion that follows, Section 2 provides a brief overview of Fisher's model-based frequentist statistics with special emphasis on key concepts and pertinent interpretations of inference procedures that are invariably misconstrued by the uninformed and recipe-like implementation of statistical modeling and inference.Section 3 discusses a way to bridge the gap between unduly data-specific inference results and an evidential interpretation of such results, in the form of the post-data severity (SEV) evaluation of the accept/reject H 0 results.The SEV evaluation is used to elucidate or/and address several foundational issues that have bedeviled frequentist testing since the 1930s, including the large n problem, statistical vs. substantive significance and the replicability of evidence, as opposed to the replicability of statistical results.In Section 4 the SEV evaluation is used to appraise several proposed alternatives to (or modifications of) N-P testing by the replication literature, including replacing the p-value with effect sizes and observed CIs and redefining statistical significance.Section 5 compares and contrasts the evidential account based on the SEV evaluation with Royall's [11] Likelihood Ratio approach to statistical evidence.

Fisher's Statistical Induction
Fisher [12] pioneered modern frequentist statistics by viewing data x 0 as a typical realization of a prespecified parametric statistical model whose generic form is: where Θ and R n X denote the parameter and sample space, respectively, f (x; θ), x∈R n X , refers to the (joint) distribution of the sample X.The initial choice (specification) of M θ (x) should be a response to the question: "Of what population is this a random sample?" (Fisher,[12], p. 313), underscoring that: 'the adequacy of our choice may be tested a posteriori' (ibid., p. 314).This can be secured by establishing the statistical adequacy (approximate validity) of M θ (x) using thorough Mis-Specification (M-S) testing; see Spanos [13].
Selecting M θ (x) for data x 0 has a twofold objective (Spanos [14]): (i) M θ (x) is selected with a view to account for the chance regularity patterns exhibit by data x 0 by accounting for these regularities using appropriate probabilistic assumptions relating to {X k , t∈N:=(1, 2, ..., n, ...} from three broad categories: Distribution (D), Dependence (M) and Heterogeneity (H).
(ii) M θ (x) is parametrized [θ∈Θ] in a way that can shed light on the substantive questions of interest using data x 0 .When such questions are framed in terms of a substantive model, say M φ (x), φ∈Φ, one needs to bring out the implicit statistical model without restricting its parameters θ, and ensure that θ and φ are related via a set of restrictions g(φ, θ) = 0 connecting φ to the data x 0 via θ.
Example 1.Consider the well-known simple Normal model: where 'X t ∽NIID' stands for 'X t is Normal (D), Independent (M) and Identically Distributed (H)', R:=(−∞, ∞).It is important to emphasize that M θ (x) revolves around f (x; θ), x∈R n X in (1) since it encapsulates all its probabilistic assumptions: and provides the cornerstone for all forms of statistical inference.
The primary objective of model-based frequentist inference is to 'learn from data x 0 ' about θ * , where θ * denotes the 'true' θ in Θ.This is shorthand for saying that there exists a θ * ∈Θ such that M θ * (x) = f (x; θ * ), x∈R n X , could have generated x 0 .The main variants of statistical inference in frequentist statistics are: (i) point estimation, (ii) interval estimation, (iii) hypothesis testing, and (iv) prediction.These forms of statistical inference share the following features: (a) They assume that the prespecified statistical model M θ (x) is valid vis-à-vis data x 0 .(b) They aim is to learn about M θ * (x) using statistical approximations relating to θ * .(c) Their inferences are based on a statistic (estimator, test statistic, predictor), say Y n = g(X), whose sampling distribution, f (y n ; θ), ∀y∈R Y , (∀ stands 'for all') is derived directly from the distribution of the sample f (x; θ), x∈R n X , of M θ (x), using two different forms of reasoning with prespecified values of θ: (a) factual (estimation and prediction): presume that θ = θ * , and ).The crucial difference between these two forms of reasoning is that factual reasoning does not extend to post-data (after x 0 is known) evaluations relating to evidence, but hypothetical reasoning does.This plays a key role in the following discussion.
The primary role of the sampling distribution of a statistic f (y n ; θ), ∀y∈R Y , is to frame the uncertainty relating to the fact that x 0 is just one, out of all x∈R n X realizations, of X so as to provide (i) the basis for the optimality of the statistic Y n = g(X), as well as (ii) the relevant error probabilities to 'calibrate' the capacity (optimality) of inference based on Y n ; how often the inference procedure errs.
The statistical adequacy (approximate validity) of M θ (x) plays a pivotal role in securing the reliability of inference and the trustworthiness of ensuing evidence because it ensures that the nominal optimality-derived by assuming the validity of M θ (x)-is also actual for data x 0 , and secures the approximate equality between the actual (based on x 0 ) and the nominal error probabilities.In contrast, when M θ (x) is statistically misspecified: (a) the joint distribution of the sample f (x; θ), x∈R n X , and the likelihood function L(θ; x 0 )∝ f (x 0 ; θ), θ∈Θ, are both erroneous, (b) all sampling distributions f (y n ; θ), derived by invoking the validity of f (x; θ), x∈R n X , will be incorrect, (i) giving rise to 'non-optimal' estimators, and (ii) sizeable discrepancies between the actual and nominal error probabilities.
Applying a α = 0.05 significance level test when the actual type I error probability is 0.97 due to invalid probabilistic assumptions will yield untrustworthy evidence.Increasing the sample size will often worsen the untrustworthiness by increasing the discrepancy between actual and nominal error probabilities; see Spanos [15], p. 691.Hence, the best way to keep track of the relevant error probabilities is to establish the statistical adequacy of M θ (x).It is important to emphasize that other forms of statistical inference, including Bayesian and Akaike-type model selection procedures, are equally vulnerable to statistical misspecification since they rely on the likelihood function L(θ; x 0 )∝ f (x 0 ; θ), θ∈Θ; see Spanos [16].
In the discussion that follows, it is assumed that the invoked statistical model M θ (x) is statistically adequate to avoid repetitions and digressions, but see Spanos [17] and [18] on why [a] statistical misspecification calls into question important aspects of the current replication crisis literature.

Frequentist Inference: Estimation
Point estimation revolves around an estimator, say θ n (X), that pinpoints (as closely as possible) θ * .The clause 'as closely as possible' is framed in terms of certain 'optimal' properties stemming from the sampling distribution f ( θ n (x); θ), x∈R n X , including: unbiasedness, efficiency, sufficiency, consistency, etc.; see Casella and Berger [19].Regrettably, the factual reasoning, presuming θ = θ * , underlying the derivation of the relevant sampling distributions is often implicit in traditional textbook discussions, resulting in erroneous interpretations and unwarranted claims.
The answer is the unbiasedness and full efficiency of X n , respectively, both of which are defined at θ = θ * .There is no such a thing as a sampling distribution X n ∽ N(µ, σ 2 n ), ∀θ:= µ, σ 2 ∈R×R + since the NIID assumptions imply that each element of the sample X comes from a single Normal distribution with a unique mean and variance θ * := µ * , σ 2 * around which all forms of statistical inference revolve.Hence, the claim by Schweder and Hjort [20] that " √ n(µ−X n ) s has a fixed distribution regardless of the values of the interest parameter µ and the (in this context) nuisance parameter σ... " (p.15), i.e., makes no sense from a statistical inference perspective.Point estimation is very important since it provides the basis for all other forms of optimal inference (CIs, testing, and prediction) via the optimality of a point estimator θ n (X).A point estimate θ n (x 0 ), by itself, however, is considered inadequate for learning from data x 0 since it is unduly data specific; it ignores the relevant uncertainty stemming from the fact that x 0 constitutes a single realization (out of ∀x∈R n X ) as framed by the sampling distribution, f ( θ n (x); θ), x∈R n X , of θ n (X); hence θ n (x 0 ) is often reported as Interval estimation accounts for the relevant uncertainty in terms of an error probability of 'overlaying' the true value θ * of θ, based on f ( θ n (x); θ), x∈R n X , in the form of the Confidence Interval (CI): where the statistics L(X) and U(X) denote the lower and upper (random) bounds that 'overlay' θ * with probability (1−α).An (1−α) CI is optimal when its expected length E[U(X)−L(X)] is the shortest and referred to as Uniformly Most Accurate (UMA); see Lehmann and Romano [21].
Example 1 (continued).Consider testing the hypotheses of interest: in the context of ( 2).An optimal N-P test for the hypotheses in ( 9) is defined in terms of a test statistic and the rejection region: whose error probabilities are evaluated using: where St(n−1) is the Student's t distribution with n−1 degrees of freedom, and: (12)   where . It is important to emphasize that (12) differs from (11) in terms of their mean, variance, and higher moments, rendering (12) non-symmetric for δ 1 ̸ =0; see Owen [22].
The sampling distribution in (11) is used to evaluate the pre-data type I error probability and the post-data [τ(x 0 ) is known] p-value: The sampling distribution in ( 12) is used to evaluate the power of T α : as well as the type II error probability: The test T α in ( 10) is optimal in the sense of being Uniformly Most Powerful (UMP), i.e., T α is the most effective α-level test for detecting any discrepancy (γ > 0) of interest from µ = µ 0 ; see Lehmann and Romano [21].
Why prespecify α at a low threshold, such as α = 0.05?Neyman and Pearson [23] put forward two crucial stipulations relating to the framing of H 0 : θ∈Θ 0 and H 1 : θ∈Θ 1 , Θ i ⊂Θ, i = 0, 1, to ensure the effectiveness of N-P testing and the informativeness of its results: [1] Θ 0 and Θ 1 should form a partition of Θ 0 and Θ 1 should be framed in such a way so as to ensure that the type I error is the more serious of the two.
To provide some intuition for [2], they use the analogy with a criminal trial where to ensure [2] one should use the framing, H 0 : not guilty vs. H 1 : guilty, to render the type I error of sending an innocent person to prison, more serious than acquitting a guilty person (p.296).Hence, prespecifying α at a small value and maximizing the power over ∀µ ≥ µ 1 = µ 0 +γ 1 , γ 1 ≥ 0, requires deliberation about the framing.A moment's reflection suggests that stipulation [2] implies that high power is needed around the potential neighborhood of θ * .Regrettably, stipulations [1]- [2] are often ignored, undermining the proper implementation and effectiveness of N-P testing; see Section 4.1.
Returning to the power of T α , the noncentrality parameter δ 1 indicates that the power increases monotonically with √ n and (µ 1 −µ 0 ) and decreases with σ.This suggests that the inherent trade-off between the type I and II error probabilities in N-P testing, in conjunction with the sample size n and α, plays a crucial role in determining the capacity of a N-P test.This means that the selection of the significance level α should always take into account the particular n for data x 0 , since an uninformed choice of α can give rise to two problems.
The small n problem.This arises when the sample size n is not large enough generate any learning from data about θ * since it has insufficient power to detect particular discrepancies γ of interest.To avoid underpowered tests the formula in ( 14) can be used pre-data (before τ(x 0 ) is known) to evaluate the sample size n necessary for T α to detect such discrepancies with high enough probability (power).That is, for a given α, there is always a small enough n that would accept H 0 despite the presence of a sizeable discrepancy γ̸ =0 of interest.This also undermines the M-S testing to evaluate the statistical adequacy of the invoked M θ (x) since a small n will ensure that M-S tests do not have sufficient power to detect existing departures from the model assumptions; see Spanos [18].
The large n problem.This arises when a practitioner uses conventional significance levels, say α = 0.1, 0.05, 0.025, 0.01, for very large sample sizes, say n > 10,000.The source of this problem is that for a given α, as n increases the power of a test increases and the p-value decreases, giving rise to over-sensitive tests.Fisher [24] explained why: "By increasing the size of the experiment [n], we can render it more sensitive, meaning by this that it will allow of the detection of ... quantitatively smaller departures from the null hypothesis."(pp.[21][22].Hence, for a given α, there is always a large enough n that would reject H 0 for any discrepancy γ ̸ = 0 (however small, say γ = 0.0000001) from a null value θ = θ 0 ; see Spanos [25].
It is very important to emphasize at the outset that the pre-data testing error probabilities (type I, II, and power) are Spanos [7]: (i) hypothetical and unobservable in principle since they revolve around θ * , (ii) not conditional on values of θ∈Θ since 'presuming θ = θ i , i = 0, 1' constitute neither events nor random variables, and (iii) assigned to the test procedure T α to 'calibrate' its generic (for any x∈R n ) capacity to detect different discrepancies γ from µ = µ 0 for a prespecified α.
As mentioned above, the cornerstone of N-P testing is the in-built trade-off between the type I and II error probabilities, which Neyman and Pearson [23] addressed by prespecifying α at a low value and maximizing P (µ 1 ), ∀µ 1 = µ 0 + γ 1 ∈Θ 1 , γ 1 ± 0, seeking an optimal test; see Lehmann and Romano [21].The primary role of the testing error probabilities is to operationalize the notions of 'statistically significant/insignificant' in terms of statistical approximations relating to θ * , and framed in terms of the sampling distribution of a test statistic τ(X).
This relates directly to the replication crisis since for a misspecified M θ (x) one cannot keep track of the relevant error probabilities to be able to adjust them for p-hacking, datadredging, multiple testing and cherry-picking, in light of the fact that the actual error probabilities will be different from the nominal ones; see Spanos and McGuirk [26].

Statistical Inference 'Results' vs. 'Evidence' for or against Inferential Claims
Statistical results, such as a point estimate, say θ(x 0 ), an observed (1−α) CI, say [L(x 0 ), U(x 0 )], an effect size, a p-value and the accept/reject H 0 results, are not replicable in principle, in the sense that akin data do not often yield very similar numbers since they are unduly data-specific when contrasted with broader inferential claims relating to inductive generalizations stemming from such results.In particular, the accept/reject H 0 results, (i) are unduly data-specific, (ii) are too coarse to provide informative enough evidence relating to θ * , and (iii) depend crucially on the particular statistical context: which includes the statistical adequacy of M θ (x) as well as the sample size n.Example 1 (continued).It is often erroneously presumed that the optimality of the point estimators, µ(X) , can justify the following inferential claims for the particular data x 0 when n is large enough.
(b) The inferential claim associated with an (1−α) optimal CI for µ in (7) relates to CI(X; µ) overlaying µ * with probability (1−α), but its optimality does not justify the claim that the observed CI: overlays µ * with probability (1−α).As argued by Neyman [28]: "... valid probability statements about random variables usually cease to be valid if the random variables are replaced by their particular values."(p.288).In terms of the underlying factual reasoning, post-data x 0 has been revealed but it is unknowable whether µ * is within or outside (17); see Spanos [29].Indeed, one can make a case that the widely held impression that an effect size ( [30]) provides more reliable information about the 'scientific effect' than p-values and observed CIs stems from the unwarranted inferential claim in (16), i.e., an optimal estimator θ(X) of θ justifies the inferential claim θ(x 0 )≃ θ * for n large enough.
(c) The N-P testing 'accept H 0 ' with a large p-value, and rejecting H 0 with a small p-value, do not entail evidence for H 0 and H 1 , respectively, since such evidential interpretations are fallacious; see Mayo and Spanos [14].

Accept/Reject H 0 Results vs. Evidence for or against Inferential Claims
Bridging the gap between the binary accept/reject H 0 results and learning from data x 0 about θ * using statistical approximations framed in terms of the sampling distribution of a test statistic τ(X), has been confounding frequentist testing.Mayo and Spanos [31] proposed the post-data severity (SEV) evaluation of the accept/reject H 0 results as a way to convert them into evidence for germane inferential claims.The SEV differs from other attempts to address this issue in so far as: (i) The SEV evaluation constitutes a principled argument framed in terms of a germane inferential claim relating to θ * (learning from data x 0 ).
(ii) The SEV evaluation is guided by the sign and magnitude of the observed test statistic, τ(x 0 ), and not by the prespecified significance level α; see Spanos [32].
(iii) The SEV evaluation accounts fully for the relevant statistical context in (15).(iv) Its germane inferential claim, in the form of the discrepancy from the null value, is warranted with high probability with x 0 and T α when all the different ways it can be false have been adequately probed and forfended (Mayo [33]).
The most crucial way to forfend a false accept/reject H 0 result is to ensure that M θ (x) is statistically adequate for data x 0 , before any inferences are drawn.This is because the discrepancies induced by invalid probabilistic assumptions will render impossible the task of controlling (keeping track of) the relevant error probabilities in terms of which N-P tests are framed.Hence, for the discussion that follows it is assumed that M θ (x) is statistically adequate for the particular data x 0 .
Example 2. Consider the simple Bernoulli (Ber) model: where Let the hypotheses of interest be: in the context of (18).It can be shown that the t-type test: where X n = 1 n ∑ n t=1 X t , is Uniformly Most Powerful (UMP); see Lehmann and Romano [21].The sampling distribution of d(X), evaluated under H 0 (hypothetical), is: For n ≥ 40 the 'standardized' Binomial distribution, Bin(0, 1; n), can be approximated (≃) by the N(0, 1).The latter can be used to evaluate the type I error probability and the p-value: The sampling distribution of d(X) evaluated under H 1 (hypothetical) is: whose tail area probabilities can be approximated using: (24) is used to derive the type II error probability and the power of the test T > α in (20) which increases monotonically with √ n and (µ 1 −µ 0 ) and decreases with V(θ 1 ).
The post-data severity (SEV) evaluation transforms the 'accept/reject H 0 results' into 'evidence' for or against germane inferential claims framed in terms of θ.The post-data severity evaluation is defined as follows: A hypothesis H (H 0 or H 1 ) passes a severe test T α with data x 0 if: (C-1) x 0 accords with H, and (C-2) with very high probability, test T α would have produced a result that 'accords less well' with H than x 0 does, if H were false; see Mayo and Spanos [31,34].
Example 2 (continued).Consider data x 0 referring to newborns during 1995 in Cyprus, 5152 boys (X = 1) and 4717 girls (X = 0).In this case, there is no reason to question the validity of the IID probabilistic assumptions since nature ensures their validity when such data are collected over a sufficiently long period of time in a particular locality.Applying the optimal test T > α in (20) with α = 0.01 (large n) yields: Broadly speaking, this result indicates that the 'true' value θ * of θ lies within the interval (0.5, 1) which is too coarse to engender any learning about θ * .The post-data severity outputs a germane evidential claim that revolves around a discrepancy γ ‡ warranted by test T > α and data x 0 with high probability.In contrast to predata testing error probabilities (type I, II, and power), severity is a post-data error probability that uses additional information in the form of the sign and magnitude of d(x 0 ), but shares with the former the underlying hypothetical reasoning: presuming that θ = θ 1 , ∀θ 1 ∈Θ 1 .
[C-2] revolves around the event: "outcomes x that accord less well with θ > θ 1 than x 0 does", i.e., event {x: d(x)≤d(x 0 )}, ∀x∈{0, 1} n , and its probability: stemming from (23).This severity curve ∀θ 1 ∈Θ 1 = (0.5, 0.53) is depicted in Figure 1.In the case of 'reject H 0 ' the objective is to evaluate the largest θ 1 = θ 0 +γ 1 such that any θ less than θ 1 would very probably, at least 0.9, have resulted in a smaller observed difference, warranted by test T > α and data x 0 : max Like all error probabilities, the SEV evaluation is always attached to the procedure itself as it pertains to the inferential claim θ > θ 1 .The inferential claim γ ‡ 1 ≤ 0.01557 warranted with probability .9,however, can be 'informally' interpreted as evidence for a germane neighborhood of θ * , θ 1 = 0.51557±ε, for some ε ≥ 0, arising from the SEV evaluation narrowing down the coarse θ * ∈(0.5, 1) associated with the 'reject H 0 '.This narrowing down of the potential neighborhood of θ * enhances learning from data.
It is also important to emphasize that the SEV evaluation of the inferential claim θ > θ 1 = θ 0 +γ 1 , γ 1 ≥ 0 with discrepancy γ 1 = 0.0223, based on x n = 0.5223, gives rise to SEV(T > α ; x 0 ; θ > 0.5223) = 0.5, which implies that there is no evidence for θ 1 ≤ 0.5223.More generally, the SEV evaluation will invariably provide evidence against the inferential claim θ 1 ≤ x n .Hence, the importance of distinguishing between 'statistical results', such as x n = 0.5223, and 'evidence' for or against inferential claims relating to x n .
What is the nature of evidence the post-data severity (SEV) gives rise to?Since the objective of inference is to learn from data about phenomena of interest via learning about θ * , the evidence from the SEV comes in the form of an inferential claim that revolves around the discrepancy γ 1 warranted by the particular data and test with high enough probability, pinpointing the neighborhood of θ * as closely as possible.In the above case, the warranted discrepancy is γ ‡ 1 ≤ 0.01557, or equivalently, θ * ≤ 0.51557, with probability 0.9.Although all probabilities are assigned to the inference procedure itself as it relates to the inferential claim θ > θ 1 = θ 0 +γ 1 , γ 1 > 0, the SEV evaluation can be viewed intuitively as narrowing the coarse reject H 0 result entailing θ * ∈(0.5, 1) down to θ * ∈(0.512, 0.5156).
The most important attributes of the SEV evaluation are: [i] It is a post-data error probability stemming from hypothetical reasoning that takes into account the statistical context in (15) and is guided by d(x 0 ) ≷ 0.

The Robustness of the Post-Data Severity Evaluation
To exemplify the robustness of the SEV evaluation with respect to changing θ 0 , consider replacing θ 0 = 0.5 in (19) with the Nicolas Bernoulli value θ 0 = (18/35)≃0.5143.

Post-Data Severity and the Replicability of Evidence
The post-data severity evaluation of the 'accept/reject H 0 ' results can also provide a more robust way to evaluate the replicability of empirical evidence based on comparing the discrepancies γ from the null value, θ = θ 0 warranted with similar data with high enough severity.To illustrate this, consider the following example that uses similar data x 1 from a different country more than three centuries apart.
Example 3. Data x 1 refer to newborns during 1668 in London (England), 6073 boys, 5560 girls, n = 11,633; see Arbuthnot [35].The optimal test in (20) yields d(x 1 ) = 4.756, with p(x 1 ) = 0.0000001, rejecting H 0 : θ ≤ 0.5.This result is almost identical to the result with data from Cyprus for 1995, but the question is whether the latter can be viewed as a successful replication with trustworthy evidence.
Evaluating the post-data severity with the same probability SEV(T > α ; x 1 ; θ > θ 1 ) = 0.9, the warranted discrepancy from θ 0 = 0.5 by test T > α and data x 1 is γ ‡ 1 ≤ 0.01516, which is very close to γ ‡ 0 ≤ 0.01557; a fourth decimal difference.As shown in Figure 2, the severity curves for data x 0 and x 1 almost coincide.This suggests that, for a statistically adequate M θ (x), the post-data severity could provide a more robust measure of replicability associated with trustworthy evidence, than point estimates, effect sizes, observed CIs, or p-values.Indeed, it can be argued that the warranted discrepancy γ with high probability provides a much more robust testing-based effect size for the scientific effect; see Spanos [7].

Statistical vs. Substantive Significance
The post-data severity evaluation can also address this problem by relating the discrepancy γ ‡ from θ 0 (θ 1 = θ 0 ± γ ‡ ) warranted by test T α and data x 0 with high probability, to the substantively (human biology) determined value φ ♦ .For that, one needs to supplement the statistical information in data x 0 with reliable substantive subject matter information to evaluate the 'scientific effect'.

Post-Data Severity and the Large n Problem
To alleviate the large n problem, some textbooks in statistics advise practitioners to keep reducing α as n increases beyond n > 200; see Lehmann and Romano [21].A less arbitrary method is to agree that, say, α = 0.05 for n = 100, seems a reasonable standard, and then modify Good's [37] standardization of the p-values into thresholds α(n) using the formula: α(n)=0.05/√ n/100, n ≥ 50, as shown in Table 2.This standardization is also ad hoc since (i) it depends on an agreed standard, (ii) the simple scaling, but (iii) for n ≥ 1 × 10 8 the implied thresholds are tiny.The post-data severity evaluation of the accept/reject H 0 results can be used to shed light on the link between α and n.Let us return to example 1 and assume that n = 1000 is large enough (a) to establish the statistical adequacy of the simple Bernoulli model in ( 18), (b) to avoid the small n problem, and (c) to provide a reliable enough estimate θ(x 0 ) = x n for θ.There are two possible scenarios one can consider.
Scenario 1 assumes that all different values of n ≥ 1000 yield the same observed value of the test statistic d(x 0 ).This scenario has been explored in the context of the SEV evaluation by Mayo and Spanos [31].
To explore scenario 2, let us return to example 2, related to testing H 0 : θ≤θ 0 vs. H 1 : θ > θ 0 , θ 0 = 0.5, in the context of the simple Bernoulli model in (18), using data on newborns in Cyprus during 1995, 5152 (male) and 4717 (female).Particular values for n and p n (x 0 ) are given in Table 3, indicating clearly that for n > 20,000 the p-value goes to zero very fast, and thus, the thresholds α(n) needed to counter the increase in n will be tiny.3.9 × 10 −20 0.000000 . . .
Figure 3 shows the p-value for different values of n, indicating that for n ≥ 3256 the null hypothesis will be rejected, but for smaller n will be accepted.This indicates clearly that for a given α the accept/reject results are highly vulnerable to abuse stemming from manipulating the sample size to obtain the desired result.This abuse can be addressed by the SEV evaluation for such results.The other side of the large n coin relates to the increase in power for a given α = 0.01 as n increases.Figure 4 shows that the power curve becomes steeper and steeper as n increases, reflecting the detection of smaller and smaller discrepancies from θ 0 = 0.5 with probability 0.85.This stems from the fact that the power of the test in ( 20) is evaluated using the difference between a fixed c α and δ(θ 1 ), which increases with √ n, based on (24).The above numerical examples relating to test (20) under scenario 2 suggest that rules of thumb relating to decreasing α as n increases, in an attempt to meliorate potentially spurious results, can be useful in tempering the trade-off between the type I and II error probabilities.They do not, however, address the large n problem, since they are ad hoc and their standardized thresholds decrease to zero beyond n = 100,000.
The post-data severity evaluation (SEV) of the accept/reject H 0 results constitutes a principled argument that addresses the large n problem by ensuring that the same n is used in both terms d n (x n ) and δ(θ 1 ) when the warranted discrepancy γ 1 is evaluated based on using (24), in contrast to the power, which replaces d n (x n ) with c α in (26).To illustrate this argument, Table 4 reports the SEV evaluations using scenario 2, where (0.52204−0.5)/0.5(1−0.5) is retained and n is allowed to vary below and above the original n = 9869 for values that give rise to reject H 0 .The numbers indicate that for a given SEV(θ > θ 1 ; n) = 0.9, as n increases, the warranted discrepancy γ ‡ n increases, or equivalently the SEV(θ > 0.51557; n) for γ ‡ 1 = 0.01557 increases with n.The severity curves SEV(T > α ; θ > θ 1 ; n) for the different n in Table 4 are shown in Figure 5, with the original n = 9869 and n = 1 × 10 5 in heavy lines.The curves confirm the results in Table 4, and provide additional details, indicating the need to increase the benchmark of how high the probability SEV(θ > θ 1 ; n) should be to counter-balance the increase in n; hence the use of 0.9 for example 2 (n = 9869) and 3 (n = 11,633), and 0.7 (n = 20) for example 4. The issue of the framing of H 0 and H 1 in N-P testing has been widely misconstrued by the replication crisis literature, questioning its coherence and blaming the accept/reject H 0 results and the p-value, as providing misleading evidence and often non-replicable results.Contrary to that, in addition to Neyman and Pearson [23] warning against misinterpreting 'accept H 0 ' as evidence for H 0 and 'reject H 0 ' as evidence for H 1 , they also put forward two stipulations (1-2) (Section 2.3) relating to the framing of H 0 and H 1 , whose objective is to ensure the effectiveness of N-P testing and the informativeness of the ensuing results.The following example illustrates how a 'nominally' optimal test can be transformed into a wild goose chase when (1-2) are ignored.
Example 4. In an attempt to demonstrate the ineptitude of the p-value as it compares to Bayesian testing, Berger [38] p. 4, put forward an example of testing: in the context of the simple Bernoulli model in (18), with α = 0.05, c α = 1.645, n = 20, and θ(x 0 ) = x n = 0.2.Applying the UMP test T > α yields: d(x 0 ) = − 2.683, with a p-value p > (x 0 ) = 0.996, indicating 'accept H 0 '.Berger avails this "ridiculous" result to make a case for Bayesian statistics: "A sensible Bayesian analysis suggests that the evidence indeed favors H 0 , but only by a factor of roughly 5 to 1." (p.4).Viewing this result in the context of the N-P stipulations (1-2) (Section 2.3) reveals that the real source of this absurd result is likely to be the framing in (27) since it flouts both stipulations.Assuming the statistical adequacy of the invoked M θ (x) in (18), θ(x 0 ) = 0.2 gives a broad indication of the potential neighborhood of θ * .In contrast, the framing in (27) ensures that the test T > α has no power to detect any discrepancies around θ 1 = θ(x 0 ) ± ϵ, ϵ > 0, since the implicit power is P (θ 1 = 0.2) = 0.00000003, confirming that p > (x 0 ) = 0.996 is the result of ignoring stipulations (1-2).
A potential counter-argument to the above discussion, claiming that the estimate x n = 0.2 could be the result of x 0 a bad draw, is not a well-grounded argument, since misspecification testing of the invoked M θ (x) would reveal whether x 0 is atypical, i.e., a bad draw from the sample X. Recall that the adequacy of M θ (x) ensures that x 0 is a 'typical' realization thereof.It is worth mentioning, however, that example 4 with n = 20 could be vulnerable to the small n problem discussed in Section 2.3.

The Call for Redefining Statistical Significance
In light of the above discussion on the large n problem, the call by Benjamin et al. [6] "For fields where the threshold for defining statistical significance for new discoveries is P < 0.05, we propose a change to P < 0.005."(p. 6) seems visceral!It brushes aside the inherent trade-off between the type I and II error probabilities and the implied inverse relationship between the sample size n and the appropriate α to avoid the large/small n problems; see Section 2.3.The main argument used by Benjamin et al. [6] is that empirical evidence from large-scale replications indicates that studies with p(x 0 ) < 0.005 are more likely to be replicable than those based on p(x 0 ) < 0.05.
When this claim is viewed in the context of the broader problem of the uninformed and recipe-like implementation of statistical modeling and inference, in conjunction with its many different ways it can contribute to the untrustworthiness of empirical evidence, including (a-c), and the fact that replicability is neither necessary nor sufficient for the trustworthiness of empirical evidence, the above argument is unpersuasive.The threshold p(x 0 ) < α was never meant to be either arbitrary or fixed for all frequentist tests, and the above discussion of the large n problem shows that using α = 0.05 for a large n, say n > 10,000, will often give rise to spurious significance results.Aware of the loss of power when α = 0.05 decreases to α = 0.005, Benjamin et al. [6] call for increasing n, to ensure a high power of 0.8 at some arbitrary θ = θ 1 .The problem with the proposed remedy is twofold.First, increasing n is often impracticable with observational data, and second, securing high power for arbitrary discrepancies γ 1 = (θ 1 −θ 0 ) is not conducive to learning about θ * .
Another argument for lowering the threshold put forward by Benjamin et al. [6] stems from a misleading comparison between the two-sided p-value for the thresholds 0.05 and 0.005, and the corresponding Bayes factor: where π(θ) and π(θ|x 0 ) denote the prior and the posterior distributions.This is an impertinent comparison since the Bayesian perspective on evidence, based on B 01 (x 0 ), has a meager connection to the p-value as an indicator of discordance between x 0 and θ 0 = 0.5.This is because the presumed comparability (analogy) between the tail areas of τ(X) µ=µ 0 ∽ St(n−1) that varies over x∈R n X , and revolves around θ * , with the ratio in B 01 (x 0 ) that varies with ∀θ∈Θ, is ill-thought-out!The uncertainty accounted for by the former has nothing to do with that of the latter, since the posterior distribution accounts for the uncertainty stemming from the prior distribution, weighted by the likelihood function, both of which vary over θ.

The Severity Perspective on the p-Value, Observed CIs, and Effect Sizes
The p-value and the accept/reject H 0 results.The real problem is their binary nature created by the threshold α, which gives rise to counter-intuitive results, such as for α = 0.05 one rejects H 0 when p(x 0 ) = 0.49, and accepts H 0 when p(x 0 ) = 0.51.The SEV-based evidential account does away with this binary dimension since its inferential claim and the discrepancy γ from θ = θ 0 warranted with high probability are guided by the sign and magnitude of the observed test statistic d(x 0 ).The SEV evaluation uses the statistical context in (15), in conjunction with the statistical approximations framed by the relevant sampling distribution of d(X) to deal with the binary nature of the results, as examples 2-4 illustrate.
Example 2 (continued).Let us return to the data denoting newborns in Cyprus, 5152 boys and 4717 girls during 1995, where the optimal test T > α with d(x 0 ) = 4.379, indicates 'reject H 0 ' with p > (x 0 ) = 0.000006.When p > (x 0 ) is viewed from the severity vantage point, it is directly related to evaluating the post-data severity for a zero discrepancy, i.e., θ 1 = θ 0 since (Mayo and Spanos, [31]): 99999, which suggests that a small p-value indicates the existence of some discrepancy γ ≥ 0, but provides no information about its magnitude warranted by x 0 .The severity evaluation remedies that by outputting the missing magnitude in terms of the discrepancy γ warranted by data x 0 and test T > α with high probability by taking into account the relevant statistical context in (15).The key problem is that the p-value is evaluated using d(X) θ = θ 0 ∽ N(0,1), and thus, it contains no information relating to different discrepancies from θ = θ 0 , unlike the post-data severity evaluation, since it is based on ( V(θ 1 )) −1 [d(X)−δ(θ 1 )] θ = θ 0 ∽ N(0,1).Observed CIs and effect sizes.The question that arises is why the claim µ * ∈CI(x 0 ) = [x n −c α 2 (s(x 0 )/ √ n), x n +c α 2 (s(x 0 )] with probability (1−α) is unwarranted.As argued in Section 2.4 this stems from the fact that factual reasoning is baseless post-data, and thus, one cannot assign a probability to CI(x 0 ).This calls into question calls by the reformers in the replication crises to replace the p-value with the analogous observed CI because the latter is (i) less vulnerable to the large n problem, and (ii) more informative than the p-value since it provides a measure of the 'effect size'.Cohen's [39] recommendation is: "routinely report effect sizes in the form of confidence intervals" (p.1002).
Claim (i) is questionable because a CI is equally as vulnerable to the large n problem as the p-value since the expected length of a consistent CI shrinks to zero as n → ∞.In the case of (7) this takes the form n→∞ 0, and thus, as n increases, the width of the observed CI( decreases.This also questions claim (ii), that it provides a reliable measure of the 'effect size', since the larger the n the smaller the observed CI.Worse, the concept of the 'effect size' was introduced partly to address the large n problem using a measure that is free of n: ". . . the raw size effect as a measure is that its expected value is independent of the size of the sample used to perform the significance test."(Abelson [40], p. 46).
Example 2 (continued).The effect size for θ is known as Cohen's g = ( θ(x 0 )−θ 0 ) (Ellis [30]).When evaluated using the Arbuthnot value θ 0 = 0.5, g = 0.02204, which is rather small, and when evaluated using the Bernoulli value θ 0 = (18/35) it is even smaller, g = 0.0078.How do these values provide a better measure of the 'scientific effect'?They do not, since Cohen's g is nothing more than another variant of the unwarranted claim that θ n (x 0 )≃θ * for a large enough n; see Spanos [7].On the other hand, the post-data severity evaluation outputs the discrepancy γ ‡ warranted by data x 0 and test T > α with probability 0.9.This implies that the severity-based effect size is θ ‡ 1 ≤ 0.5156, which takes into account the relevant error probabilities that calibrate the uncertainty relating to the single realization x 0 .In addition, the SEV evaluation gives rise to identical severity curves (Figure 1) and the same evidence for both null values θ 0 = 0.5 and θ 0 = (18/35).
Severity and observed CIs.The great advantage of hypothetical reasoning is that it applies both pre-data and post-data, enabling the SEV to shed light on several foundational problems relating to frequentist inference more broadly.In particular, the severity evaluation of 'reject H 0 ' relating to the inferential claim µ > µ 1 = µ 0 +γ 1 , γ 1 ≥ 0, has a superficial resemblance to CI 8)-(b), especially if one were to consider µ 1 = x n −c α ( σ √ n ) as a relevant discrepancy of interest in the SEV evaluation; see Mayo and Spanos [31].This resemblance, however, is more apparent than real since: (i)The relevant sampling distribution for CI L (X; α) is τ(X; µ) θ=θ * ∽ St(n−1), and that for the SEV evaluation is τ ii) They are derived under two different forms of reasoning, factual and hypothetical, respectively, which are not interchangeable.Indeed, the presence of δ 1 for the SEV evaluation renders the two assignments of probabilities very different.Hence, any attempt to relate the SEV evaluation to the illicit assignment P L (µ leaves no room for the discrepancy γ 1 , and (b) the assigned faux probability will be unrelated to the coverage probability; see Spanos [29].Also, the SEV evaluation can be used to shed light on several confounds relating to different attempts to assign probabilities to different values of θ within an observed CI, including the most recent attempt based on confidence distributions by Schweder and Hjort [20] above.

The SEV Evaluation vs. the Law of Likelihood
Royall [11] popularized an alternative approach to converting inference results into evidence using the likelihood ratio anchored on the Maximum Likelihood (ML) estimator θ n (x 0 ).He rephrased Hacking's [41] Law of Likelihood (LL) : Data x 0 support hypothesis H 0 over hypothesis H 1 if and only if L(H 0 ; x 0 ) > L(H 1 ; x 0 ).The degree to which x 0 supports H 0 better than H 1 is given by the Likelihood Ratio (LR): LR(H 0 , H 1 ; x 0 ) = L(H 0 ; x 0 )/L(H 1 ; x 0 ),  Using the above thresholds, the result LR(θ 0 , θ 1 ; x 0 ) = 2.269 indicates that the strength of evidence for θ = 0.52204 vs. θ 1 = 0.51557 is 'weak', and the evidence will be equal or weaker for any θ 1 ∈(0.51557,0.5285).If one were to take a firm stand on x 0 providing 'fairly strong' evidence for θ 0 , the relevant range of values for θ 1 will be θ 1 / ∈(0.5117, 0.5323).What is even more problematic for the RLR approach is that θ 0 = 0.52204 has the same strength of evidence against the two values θ 1 = 0.51557 and θ 2 = 0.5285, shown in Figure 7.This calls into question the nature of evidence the RLR approach gives rise to since it undermines the primary objective of frequentist inference.Its strength of evidence for θ 0 is identical against two values on either side of the ML estimate value 0.52204, and as the threshold increases the distance between them increases.This derails any learning from data since it undermines the primary objective of narrowing down the relevant neighborhood for θ * unless the choice is invariably the ML point estimate as it relates to the fallacious claim θ n (x 0 )≃θ * .This weakness was initially pointed out by an early pioneer of the likelihood ratio approach, Barnard [45], p. 129, in his belated review of Hacking [41].He argued that for any prespecified value θ 1 : "... there always is such a rival hypothesis, viz.that things just had to turn out the way they actually did." As evidenced in Figure 7, the RLR will pinpoint the ML estimate θ(x 0 ) no matter what the other value in (0, 1) happens to be since it is the maximally likely value; see Mayo [46].No wonder, Hacking [47], p. 137, in his book review of Edward's [48] " Likelihood" changed his mind and abandoned the LR approach altogether: "I do not know how Edwards's favoured concept [the difference of log-likelihoods] will fare.The only great thinker who tried it out was Fisher, and he was ambivalent.Allan Birnbaum and myself are very favourably reported in this book for things we have said about likelihood, but Birnbaum has given it up and I have become pretty dubious."Indeed, Hacking [49], p. 141, not only rejected the Law of Likelihood but went a step further by reversing his original viewpoint and wholeheartedly endorsing the N-P testing: "This paper will show that the Neyman-Pearson theories of testing hypotheses and of confidence interval estimation are sound theories of probable inference." It is worth noting that even before the optimal N-P theory of testing was finalized in 1933, Pearson and Neyman [50] confronted the problem of how to construe LR(θ 0 , θ 1 ; x) by emphasizing the crucial role of the relevant error probabilities: evaluation is based on a testing perspective, which is equally pertinent for evaluating error probabilities pre-data and post-data.
The most crucial feature of the SEV evaluation is that it converts the accept/reject H 0 results into evidence by taking into account the statistical context in (15) including the power of the test.This is important since detecting a particular discrepancy, say γ 1 , provides stronger evidence for its presence when the power is lower than higher; see Spanos [18].The underlying intuition can be illustrated by imagining two people searching for metallic objects on the same beach, one is using a highly sensitive metal detector that can detect small nails, and the other a much less sensitive one.If both detectors begin to buzz simultaneously, the one more likely to have found something substantial is the less sensitive one!The SEV evaluation harnesses this intuition by custom-tailoring the power of the test, replacing the original c α with the post-data d(x 0 ), to establish the discrepancy γ 1 warranted by data x 0 with high enough probability.As argued above, this enables the SEV evidential account to circumvent several foundational problems, including the large n problem.In contrast, the RLR account will yield the same strength of evidence for any two values of θ, irrespective of the size of n.

Summary and Conclusions
The replication crisis has exposed the apparent untrustworthiness of published empirical evidence, but its narrow attribution to certain abuses of frequentist testing can be called into question as 'missing the forest for the trees'.A stronger case can be made that the real culprit is the much broader problem of the uninformed and recipe-like, implementation of statistical methods, which contributes to the untrustworthiness in many different ways, including [a] imposing invalid probabilistic assumptions on one's data, and [b] conflating unduly data-specific inference results' with 'evidence for or against inferential claims about θ * ', which represent inductive generalizations of such results.
The above discussion makes a case that the post-data severity (SEV) evaluation provides an evidential account of the accept/reject H 0 results, in the form of a discrepancy γ ̸ = 0 from θ = θ 0 , warranted with high enough probability by data x 0 and test T α .The SEV evaluation is framed in terms of a post-data error probability that accounts for the statistical context in (15), as well as the uncertainty stemming from inference results relying on a single realization of the sample X = x 0 .The SEV evaluation perspective is used to call into question Royall's [11] LR approach to evidence as another rendering of the fallacious claim θ n (x 0 )≃θ * for a large enough n.
The SEV evaluation is also shown to elucidate/address several foundational issues confounding frequentist testing since the 1930s, including (i) statistical vs. substantive significance, (ii) the large n problem, and (iii) the alleged arbitrariness of the N-P framing H 0 and H 1 , used to undermine the coherence of frequentist testing.The SEV also oppugns the proposed alternatives to replace or modify frequentist testing, using statistical results, such as observed CIs, effects sizes, and redefining significance, all of which are equally vulnerable to [a]-[c] undermining the trustworthiness of empirical evidence.In conclusion, it is important to reiterate that unless one has already established the statistical adequacy of the invoked M θ (x), any discussions relating to reliable inference results and trustworthy evidence based on tail area probabilities are unwarranted.

Figure 1 .
Figure 1.Severity curve for test T > α and data x 0 .

Figure 3 .
Figure 3.The p-value for α = 0.01 and different n.

Figure 4 .
Figure 4.The power curves for α = 0.01 and different n.

Figure 5 .
Figure 5. Severity curves for test T > α and data x 0 .4. Post-Data Severity and the Remedies Proposed by the Replication Literature 4.1.The Alleged Arbitrariness in Framing H 0 and H 1

Table 3 .
The p-value with increasing n (x n constant).

Table 4 .
SEV with changing n (x n = 0.52204 is held constant).