2.1. Fisher’s Model-Based Statistical Induction
Model-based frequentist statistics was pioneered by Fisher [
19] in the form of statistical induction that revolves around a statistical model whose generic form is:
and revolves around the
distribution of the sample , which encapsulates its probabilistic assumptions.
denotes the sample space, and
the parameter space; see Spanos [
16].
Unfortunately, the term ‘model’ is used to describe many different constructs across different disciplines. In the context of empirical modeling using statistics, however, the relevant models can be grouped into two broad categories: ‘substantive’ (structural, a priori postulated) and ‘statistical’ models. Although these two categories of models are often conflated, a statistical model
comprises solely the probabilistic assumptions imposed (explicitly or implicitly) on the particular data
; see McCullagh [
20]. Formally,
is a stochastic mechanism framed in terms of probabilistic assumptions from three broad categories: Distribution (D), Dependence (M), and Heterogeneity (H), assigned to the observable stochastic process
underlying data
.
The specification (initial selection) of has a twofold objective:
- (a)
is selected to account for all the chance regularity patterns—the systematic statistical information—in data by choosing appropriate probabilistic assumptions relating to . Equivalently, is selected to render data a ‘typical realization’ therefrom, and the ‘typicality’ can be evaluated using Mis-Specification (M-S) testing, which evaluates the approximate validity of its probabilistic assumptions.
- (b)
is parametrized [
] to enable one to shed light on the substantive questions of interest using data
. When these questions are framed in terms of a
substantive model, say
, one needs to bring out the implicit statistical model in a way that ensures that the two sets of parameters are related via a set of restrictions
colligating
to the data
via
; see Spanos [
16].
Example 1. A widely used example in practice is the simple Normal model:
where ‘
’ stands for ‘Normal (D), Independent (M), and Identically Distributed (H)’.
The main objective of the model-based frequentist inference is to ‘learn from data ’ about where denotes the ‘true’ value of in ; shorthand for saying that there exists a such that could have generated data .
The
cornerstone of frequentist inference is the concept of a
sampling distribution,
for all (∀)
, of a statistic
(estimator, test, predictor), derived via:
The derivation of in (3) presumes the validity of which in the case of (2) is:
In light of the crucial role of the distribution of the sample, Fisher [
19], p. 314, emphasized the importance of establishing the
statistical adequacy (approximate validity) of the invoked statistical model
:
“For empirical as the specification of the hypothetical population [] may be, this empiricism is cleared of its dangers if we can apply a rigorous and objective test of the adequacy with which the proposed population represents the whole of the available facts.”
He went on to underscore the crucial importance of Mis-Specification (M-S) testing (testing the approximate validity of the probabilistic assumptions comprising ) as the way to provide an empirical justification for statistical induction:
“The possibility of developing complete and self-contained tests of goodness of fit deserves very careful consideration, since therein lies our justification for the free use which is made of empirical frequency formulae. ”
Statistical adequacy plays a crucial role in securing the reliability of inference because it ensures the approximate equality between the actual and the nominal error probabilities based on
assuring that one can keep track of the relevant error probabilities. In contrast, when
is
statistically misspecified (Spanos, [
21]),
- (a)
and the likelihood function are erroneous,
- (b)
distorting the sampling distribution derived via (3), as well as
- (c)
giving rise to ‘non-optimal’ estimators and sizeable discrepancies between the actual error probabilities and nominal—derived assuming is valid.
In light of that, the practical way to keep track of the relevant error probabilities is to establish the statistical adequacy of . When is misspecified, any attempt to adjust the relevant error probabilities is ill-fated because the actual error probabilities are unknown due to being sizeably different from the nominal ones.
Regrettably, as Rao [
22], p. 2, points out, validating
using comprehensive M-S testing is neglected in statistics courses:
“They teach statistics as a deductive discipline of deriving consequences from given premises []. The need for examining the premises, which is important for practical applications of results of data analysis is seldom emphasized. … The current statistical methodology is mostly model-based, without any specific rules for model selection or validating a specified model. ”
(p. 2)
See Spanos [
23] for further discussion.
2.2. Neyman–Pearson (N-P) Testing
Example 1 (continued). In the context of (2), testing the hypotheses:
an optimal (UMP)
-significance level test (Neyman and Pearson, 1933 [
24]) is:
where
denotes the rejection region, and
is determined by the significance level
; see Lehmann [
13].
The sampling distribution of
evaluated under
(hypothetical) is:
where ‘
’ denotes the Student’s t distribution with
degrees of freedom, which provides the basis for evaluating the type I error probability and the
p-value:
That is, both the type I error probability and the p-value in (7) are evaluated using hypothetical reasoning, that interprets ‘ is true’ as ‘what if’ ’.
The sampling distribution of
evaluated under
(hypothetical) is:
where
is the noncentrality parameter of St
, which provides the basis for evaluating the power of test
:
where
in (
8) indicates that the power increases monotonically with
and
and decreases with
.
Equation (
8) shows that the power increases with
and
and decreases with
.
The optimality of N-P tests revolves around an inherent trade-off between the type I and II error probabilities. To address that trade-off, Neyman and Pearson [
24] proposed to construct an optimal test by prespecifying
at a low value and minimizing the type II error
, or maximizing the power
,
.
A question that is often overlooked in traditional expositions of N-P testing is:
Where does prespecifying the type I error probability at a low threshold come from?
A careful reading of Neyman and Pearson [
24] reveals the answer in the form of two crucial stipulations relating to the framing of
:
and
:
to ensure the effectiveness of N-P testing and the informativeness of the ensuing results:
and should form a partition of (p. 293) to avoid
and should be framed in such a way to ensure that the type I error is the more serious of the two, using the analogy with a criminal trial, p. 296, with : no guilty, to secure a low probability for sending an innocent person to jail.
Unveiling the intended objective of stipulation 2, suggests that the requirement for a small (prespecified)
is to ensure that the test has a low probability of rejecting a true null hypothesis, i.e., when
Minimizing the type II error probability implies that an optimal test should have the lowest possible probability (or equivalently, the highest power) for accepting (rejecting)
when false, i.e., when
. That is, an optimal test should have a high power around the potential neighborhood of
This implies that when no reliable information about this potential neighborhood is available, one should use a two-sided test to avert the case where the test has no or very low power around
; see Spanos [
25].
For the reason that the power increases with
n, it is important to take that into account in selecting an appropriate
to avoid both an under-powered and an over-powered test, forefending the ‘
small n’ and ‘
large n problems’, respectively.
First, for a given
, one needs to calculate the value of
n needed for
to have sufficient power to detect parameter discrepancies
of interest; see Spanos [
26].
Second, for a large
n one needs to adjust
to avoid an ultra-sensitive test that could detect tiny discrepancies, say
, and misleadingly declare them statistically significant.
The primary role of the
pre-data testing error probabilities (type I, II, and power) is to operationalize the notions of ‘statistically significant/insignificant’ in terms of statistical approximations framed in terms of a test statistic
and its sampling distribution. These error probabilities calibrate the capacity of the test to shed sufficient light on
, giving rise to learning from data. In this sense, the reliability of the testing results ‘accept/reject
’ depends crucially on the particular testing statistical context (Spanos, [
17], ch. 13):
which includes, not only the adequacy of
vis-a-vis data
, the framing of
and
and as well as
n; see Spanos ([
16,
17]). For instance, when
,
detaching the accept/reject
results from their statistical context in (11), and claiming statistical significance at conventional thresholds will often be an unwarranted claim.
Let us elaborate on this assertion.