2.1. Fisher’s Model-Based Statistical Framing
This section provides a bird’s eye view of frequentist inference, in general, and N-P testing in particular, in an attempt to preempt needless confusion relating to what is traditionally known as the Null Hypothesis Significance Testing (NHST) practice; see Nickerson [
15], Spanos [
16].
Fisher [
13] founded model-based statistical induction that revolves around the concept of a prespecified parametric statistical model, whose generic form is:
where
denotes the distribution of the sample
the sample space, and
the parameter space.
can be viewed as a particular parameterization (
of the observable stochastic process
underlying data
.
provides an ‘idealized’ description of a statistical mechanism that could have given rise to
. The
main objective of model-based frequentist inference is to give rise to learning from data
by
narrowing down to a small neighborhood around the ‘true’ value of
in
say
whatever value that happens to be; see Spanos [
17] for further discussion.
The statistical adequacy of for data plays a crucial role in securing the reliability of inference and the trustworthiness of evidence because it ensures that the actual error probabilities approximate closely the nominal ones, enabling the ‘control’ of these unobservable probabilities. As a result, when is misspecified:
- (a)
the distribution of the sample
in (
1) is erroneous,
- (b)
rendering the likelihood function invalid,
- (c)
distorting the sampling distribution of any relevant statistic (estimator, test, predictor).
In turn, (a)–(c) give rise to (i) ‘non-optimal’ inference procedures, and (ii) induce sizeable
discrepancies between the
actual and
nominal error probabilities; arguably the most crucial contributor to the untrustworthiness of empirical evidence. Applying a 0.05 significance level test, when the actual type I error probability is 0.97 (Spanos [
18] Table 15.5) will give rise to untrustworthy evidence. Hence, the only way to keep track of the relevant error probabilities is to establish the statistical adequacy of
to forefend the unreliability of inference stemming from (a)–(c), using thorough Mis-Specification (M-S) testing; see Spanos [
19]. This will secure the optimality and reliability of the ensuing inferences, giving rise to trustworthy evidence.
2.3. Neyman-Pearson (N-P) Testing
Example 1 (continued). In the context of (
2), testing the hypotheses:
gives rise to the Uniformly Most Powerful (UMP)
-significance level N-P test:
(Lehmann and Romano [
20]) where
is the rejection region, and
is determined by the prespecified
.
The distribution of
evaluated using hypothetical reasoning (what if
):
ensures that
underlying the evaluations of:
The sampling distribution of
evaluated under
(what if
) is:
where
is the noncentrality parameter, and (
10) is used to evaluate the type II error probability
and the power for a given
:
It should be emphasized that these error probabilities are assigned to a particular N-P test, e.g.,
to calibrate its
generic (for any
)
capacity to detect different discrepancies
from
; see Neyman and Pearson [
21].
The primary role of these error probabilities is to operationalize the notions of ‘statistically significant/insignificant’ in the form of ‘accept/reject
results’. The optimality of N-P tests revolves around an in-built trade-off between the type I and II error probabilities, and an optimal N-P test is derived by prespecifying
at a low value and minimizing the type II error
, or maximizing the power
,
. In summary, the error probabilities have several
key attributes Spanos [
16]:
- [i]
They are assigned to the test procedure to ‘calibrate’ its generic (for any ) capacity to detect different discrepancies from .
- [ii]
They cannot be conditional on , an unknown constant (not a random variable).
- [iii]
There is a built-in trade-off between the type I and II error probabilities.
- [iv]
They frame the accept/reject rules in terms of ‘statistical approximations’ based on the distribution of evaluated using hypothetical reasoning.
- [v]
They are unobservable since they revolve around -true value of .
This was clearly explained in Fisher’s 1955 reply to a letter from John Tukey: “A level of significance is a probability derived from a hypothesis [hypothetical reasoning], not one asserted in the real world” (Bennett [
22] p. 221).
2.4. An Inconsistent Hybrid Logic Burdened with Confusion?
In an insightful discussion, Gigerenzer [
23] describes the traditional narrative of statistical testing created by textbook writers in psychology during the 1950s and 1960s, as based on ‘a hybrid logic’: “Neither Fisher nor Neyman and Pearson would have accepted this hybrid as a theory of statistical inference. The hybrid logic is inconsistent from both perspectives and burdened with conceptual confusion” (p. 324)
It is argued that Fisher’s model-based statistical induction could provide a unifying ‘reasoning’ that elucidates the similarities and differences between Fisher’s inductive inference and Neyman’s inductive behavior; see Halpin and Stam [
24].
Using the t-test in (
7), the two perspectives have several
common components:
(i) a prespecified statistical model
(ii) the framing of hypotheses in terms of
(iii) a test statistic ,
(iv) a null hypothesis :
(v) the sampling distribution of evaluated under and
(vi) a probability threshold (etc.) to decide when is discordant/rejected.
The N-P perspective adds to the common components (i)–(vi),
(vii) re-interpreting as a prespecified(pre-datat) type I error probability,
(viii) an alternative hypothesis : to supplement : ,
(ix) the sampling distribution of evaluated under
(x) the type II error probability and the power of a test.
These added components frame an optimal theory of N-P testing based on constructing an optimal test statistic and framing a rejection region to maximize the power.
The key to fusing the two perspectives is the ‘hypothetical reasoning’ underlying the derivation of both sampling distributions in (v) and (ix). The crucial difference is that the type I and II (power) error probabilities are pre-data because they calibrate the test’s generic (for all ) capacity to detect discrepancies from , but the p-value is a post-data error probability since its evaluation is based on .
The traditional textbook narrative considers Fisher’s significance testing, based on
:
and guided by the
p-value, problematic since the absence of an explicit alternative
renders the power and the
p-value ambivalent, e.g., one-sided or two-sided? This, however, misconstrues Fisher’s significance testing since his
p-value is invariably one-sided because, post-data, the
sign of
designates the relevant tail. In fact, the ambivalence originates in the N-P-laden definition of the
p-value: ‘the probability of obtaining a result ‘equal to or more extreme’ than the one observed
when
is true’. The clause ‘equal or more extreme’ is invariably (mis)interpreted in terms of
. Indeed, the
p-value is related to
by interpreting
as the smallest significance level for which a true
would have been rejected; see Lehmann and Romano [
20]. A more pertinent
post-data definition of the
p-value that averts this ambivalence is: ‘the probability of all sample realizations
that accord less well (in terms of
) with
than
does, when
is true’; see Spanos [
18].
Similarly, the power of a test comes into play with just the common components (i)–(vi) since the probability threshold
defines the distribution threshold
for the probability of detecting discrepancies of the form
using the non-central Student’s t in (
10), originally derived by Fisher [
25]. Indeed, Fisher [
26] pp. 21–22, was the first to recognize the effect of increasing
n on the power (he called
sensitivity): “By increasing the size of the experiment, we can render it more sensitive, meaning by this that it will allow of the detection of … quantitatively smaller departures from the null hypothesis”, which is particularly useful in experimental design; see Box [
27].
The above arguments suggest that when the underlying hypothetical reasoning and the pre-data vs. post-data error probabilities are delineated, there is no substantial conflict or conceptual confusion between the Fisher and N-P perspectives. What remains problematic, however, is that neither the
p-value nor the accept/reject
results provide cogent evidence because they are too coarse to designate a small neighborhood containing
to engender any genuine learning from data via
The post-data severity (SEV) evaluation offers such an evidential interpretation in the form of a discrepancy from
warranted by
and data
with high enough probability; see
Section 5.