1. Introduction
Principles of Statistical Inference (or Data Reduction) constitute important guidelines on how to draw conclusions from data, especially when performing standard inferential procedures for unknown parameters of interest, like estimation and hypothesis testing. For instance, the Sufficiency Principle (SP) states that any sufficient statistic retains all relevant information about the unknown parameters that should be used to make inferences about them. It precisely recommends that if T is a sufficient statistic for the statistical model under consideration and and are sample points such that , then the observation of any of these points should lead to the same conclusions regarding the parameters of interest.
Besides the place of sufficiency in Statistical Inference, these recommendations cover several issues such as the contrast between post-experimental and pre-experimental reasoning and the roles of non-informative stopping rules, censoring mechanisms and nuisance parameters in data analysis. Among the main principles, the Sufficiency Principle is generally recognized as a cornerstone of Statistical Inference. On the other hand, the Likelihood Principle (LP) and its profound consequences are still subjects of intense debate. The reader will find a detailed discussion of the Likelihood Principle in [
1,
2,
3,
4,
5,
6].
In this work, we examine the Non-Informative Nuisance Parameter Principle (NNPP) introduced by Berger and Wolpert in 1988 in their remarkable book that concerns the problem of the way inferences about a parameter of interest should be made in the presence of nuisance parameters. Nuisance parameters usually affect inferences about the parameter of interest, like in the estimation of the mean of a normal distribution with unknown variance, in the estimation of the parameters of a linear regression model in the presence of unknown variance, and in the determination of
p-values for specific hypotheses in the analysis of
contingency tables ([
7]). In a few words, the NNPP states that under suitable conditions, it is irrelevant whether the value of a non-informative nuisance parameter is known or not in order to draw conclusions about the parameter of interest. Despite the importance of the problem for eliminating nuisance parameters in data analysis, the authors have not explored this principle and its consequences in some depth as far as we have reviewed the literature. For this reason, we revisit the NNPP by formally stating it for the problem of hypothesis testing, present decision rules that meet the principle and show how the performance of a particular test in line with the NNPP can then be simplified.
This work is organized as follows: in
Section 2, the NNPP for hypothesis testing is stated, discussed and illustrated under a Bayesian perspective. In
Section 3, the Bayesian test procedure based on the concept of adaptive significance level and on an alternative
p-value introduced by Pericchi and Pereira in [
8], henceforth named the mixed test, is reviewed and is proven to satisfy the NNPP for discrete sample data when the (marginal) null hypothesis regarding the parameter of interest is a singleton (as a matter of fact, the result also holds when such a null hypothesis is specified by a hyperplane). In that section, we also define conditional versions of the adaptive significance level and
p-value based on suitable statistics and prove that under those conditions, the performance of the mixed test is then simply the comparison between these new conditional quantities. These results are of great importance to make it easier to use the mixed test in various situations. In
Section 4, we exemplify the main results by presenting new solutions by using the mixed test for well-known problems of test of hypotheses for count data under suitable reparametrizations of the corresponding models: we revisit the problems of comparison of Poisson population means and of testing the hypotheses of independence and symmetry in contingency tables. We make our final comments in
Section 5. The proofs of the theorems and the calculations for one example in
Section 4 are found in the
Appendix A.
2. The Non-Informative Nuisance Parameter Principle for Hypothesis Testing
The problem of the elimination of nuisance parameters in statistical inference has a long history and remains a major issue. Proposals to deal with it include the marginalization of the likelihood function by integrating out the nuisance parameter ([
9,
10,
11]), the construction of partial likelihood functions ([
12,
13,
14], among others) and the consideration of conditional likelihood functions based on different notions of non-informativeness, sufficiency and ancillarity. Elimination of nuisance parameters and different notions of non-information have also been studied in more detail in [
15,
16,
17,
18], where, based on suitable statistics, the concepts of B, S and G non-information are presented. The generalized Sufficiency and Conditionality Principles are also discussed in [
17]. On the other hand, Bayesian methods for eliminating nuisance parameters based on a suitable statistic
T involve different definitions of sufficiency: for instance, K-Sufficiency, Q-Sufficiency and L-Sufficiency (see for example [
17] and references therein).
In this section, the Non-Informative Nuisance Parameter Principle (NNPP) by Berger and Wolpert is discussed and formally defined for the problem of hypothesis testing. As we will see, on the one hand, the NNPP seems to be fair under both the partial and the conditional non-Bayesian approaches mentioned in the previous paragraph; on the other hand, it sounds really reasonable under the Bayesian standpoint. Despite the relevance of the problem of the elimination of nuisance parameters in data analysis, Berger and Wolpert [
1] presented the NNPP but has not explored the principle in-depth as far as we have examined in the literature.
Some notation is needed to continue. We denote by
the unknown parameter and by
X the sample to be observed.
and
represent the parameter and the sample spaces, respectively. The family of discrete probability distributions for
X is denoted by
. In addition, for
,
denotes the likelihood function for
generated by the sample point
x. By an experiment
, we mean, as in [
1], a triplet
, with
X,
and
as defined earlier. Finally, for a subset
of
, we formulate the null hypothesis
and the alternative one
. We recall that a test function (procedure) for the hypotheses
H versus
A is a function
that takes the value 1 (
) if
H is rejected when
is observed and takes the value 0 (
) if
H is not rejected when
x is observed. Under the Bayesian perspective, we also consider a continuous prior density function
for
that induces, when combined with the likelihood function
, a continuous posterior density function for
given
x,
.
In [
1], Berger and Wolpert presented the following principle on how to make inferences about an unknown parameter of interest
in the presence of a nuisance parameter
: when a sample observation, say
, separates information concerning
from information on
, it is irrelevant whether the value of
is known or unknown in order to make inferences about
based on the observation of
. In other terms, if the conclusions on
were to be the same for every possible value of the nuisance parameter, were
known, then the same conclusions on
should be reached even if
is unknown. These authors then consider the following mathematical setup to formalize these ideas.
Let
, with
and
defined as in the previous paragraph. Consider
; that is, the parameter space is variation independent, where
is the set of values for
,
,
. Suppose the experiment
is carried out to learn about
. Let
be the “thought” experiment in which the pair
is to be observed (instead of observing only
X), where
is the family of distributions for
indexed by
. Suppose also that under experiment
, the likelihood function generated by a specific
for
has the following factored form:
where
,
; that is,
depends on
only through
.
Berger and Wolpert then states the
Non-Informative Nuisance Parameter Principle (NNPP):
if and are such that (1) holds, and if the inference about from the observation of when is performed does not depend on , then the inferential statements made for from and should be the same as (should coincide with) the inferential statements made from and for every .
The authors named such a parameter a Non-Informative Nuisance Parameter (NNP), as the conclusions or decisions regarding from and do not depend on .
A likelihood function that satisfies (
1) is named a likelihood function with separable parameters ([
19]). The factored form of the likelihood function in (
1) seems to capture the notion of “absence of information about a parameter, say
, from the other,
, and vice versa” under both Bayesian and non-Bayesian reasoning. Indeed, under the Bayesian paradigm, posterior independence between
and
(say, given
) reflects the fact that one´s opinion about the parameter
after observing
is not altered by any information about
, and consequently, decisions regarding
should not depend on
. Since posterior independence between
and
given
is equivalent to the factored form of the likelihood function generated by
under prior independence, condition (
1) sounds really reasonable as a mathematical description of separate information about the parameters. Thus, if a Bayesian statistician should make inferences regarding a parameter
in the presence of a nuisance parameter
, it would be ideal that these parameters are independent a posteriori; that is, the factored form of the likelihood function holds. This last equivalence is proven in the theorem below.
Theorem 1. Let be an experiment and be the prior probability density function for . Suppose is independent of () a priori. Then, for each ,
such that .
On the other hand, the condition (
1) seems to also be a fair representation of non-informativeness of one parameter on another under a non-Bayesian perspective. In fact, such a factored form of the likelihood function arises, for instance, when the sample
X is conditioned on particular types of statistics that are simple to interpret under non-Bayesian paradigms. Note that for any statistic
T, one can write
If, in addition,
T is a statistic such that its distribution given
depends only on
and the conditional distribution of
X given
, and
depends only on
, the factored form in (
1) is easily obtained (such a statistic was named p-sufficient for
by Basu ([
17]). In this situation, all the relevant information on
is summarized in
T, and one can fully make inferences on
taking into account only the conditional distribution of
T given
, which does not involve
. Similarly, if
T is a statistic such that its distribution given
depends only on
and the conditional distribution of
X given
and
depends only on
, the factored form in (
1) holds. Such a statistic was named s-ancillary for
by Basu ([
17]), and it is somewhat evident that in this case, conclusions on
should be drawn exclusively from the distribution of
X given
and
, which does not depend on
. Such a conditional approach to the problem of elimination of nuisance parameters had already been proposed by Basu ([
17]) and in a sense is closely related to the NNPP by Berger and Wolpert. The next theorem formally presents such results.
Theorem 2. Let be an experiment in which and Θ is variation independent. Then, if such that T is either p-sufficient or s-ancillary for , then for each , the likelihood function generated by x, can be factored as (1).
In summary, it seems reasonable that inferences about
and
can be performed independently under condition (
1). Thus, if only
is of interest, then it seems sensible under (
1) that we reach the same conclusions on
when
x is observed either by using the whole likelihood function
or only the factor
. That is, it makes sense to disregard the information contained in
and focus on
. As mentioned by [
19], examples of likelihood functions with separable parameters like (
1) are rare, but if (
1) holds, it would be a useful property for Bayesian and non-Bayesian statisticians to analyze statistical data, especially in the presence of nuisance parameters. This fact will be illustrated in
Section 3 and
Section 4.
We end this section by formally adapting the general NNPP to the special problem of hypothesis testing, in which inference about an unknown parameter consists of deciding whether a statement about the parameter (a statistical hypothesis) should be rejected or accepted by using the observable quantity X.
As before, let be an experiment, with . Let be the “thought” experiment in which, in addition to X, is observed. Then, consider the following definition.
Definition 1. Non-Informative Nuisance Parameter (NNP): Let and be a test for the hypothesesThen, we say that is a Non-Informative Nuisance Parameter (NNP) for testing versus by using if, for every such that (1) holds, does not depend on ; that is, it depends only on x.
In a nutshell, Definition 1 tells us something that appears intuitive: if the decision between H and A does not depend on , then does not provide any information about . In the following example, we illustrate this idea.
Example 1. Consider that and the experiment . Let and be the test for the hypothesessuch that the null hypothesis is rejected when the conditional probability of B given x and is small; that is,
where . Suppose, in addition, that and are independent a priori. Let us verify that is an NNP for testing these hypotheses by means of . Let be such that for specific functions and . Then,
where is the prior of , . Thus, we have that Note from Equation (7) that does not depend on . Thus, is an NNP for testing versus by using .
After defining an NNP, we formally state the Non-Informative Nuisance Parameter Principle (NNPP) for hypothesis testing.
Definition 2. Non-Informative Nuisance Parameter Principle (NNPP): Let the parameter space be variation independent; that is, . Consider the experiments and . Let be the subset of of interest. In addition, let and be tests for the hypothesesrespectively.
If is an NNP for testing versus by using and such that condition (1) holds, then The NNPP for statistical hypothesis testing says that if one intends to test a hypothesis regarding only the parameter
, it is irrelevant whether
is known or unknown if it is non-informative for such a decision-making problem. More formally, if one wants to test a hypothesis concerning only
and he observes a sample point
that separates information on
from information on
—that is, (
1) holds—then the performances of the tests
under the original experiment
and
under the “thought” experiment
should yield the same decision on the hypothesis
if
is non-informative for that purpose.
We should mention that the NNPP can be adapted to any other inferential procedure. However, in this work, we focus on the principle for the problem of hypothesis testing. We conclude this section by proving that tests based on the posterior probabilities of the hypotheses satisfy the NNPP under prior independence.
Example 2 (continuation of Example 1).
Consider the conditions of Example 1. Consider and let be the test for the hypothesesthat rejects the null hypothesis H if its posterior probability is small; that is,
Let be such that . We can write the posterior probability on the right-hand side of (11) aswhere the last equality follows from Fubini’s Theorem. Hence,
From Equations (7) and (13), we have that . Thus, the NNP Principle is met by tests based on posterior probabilities, as in Example 1. This result also holds when , , .
In the next section, we examine a second test procedure that is in line with the NNPP. We review the mixed test introduced by Pericchi and Pereira ([
8]) and prove that such a test meets the NNPP for simple hypotheses concerning the parameter of interest. We also show how the adherence of the mixed test to the NNPP can then simplify its use.
3. The Mixed Test Procedure
The mixed test formally introduced in ([
8]) is a test procedure that combines elements from both Bayesian and frequentist views. On the one hand, it considers an (intrinsically Bayesian) prior distribution for the parameter from which predictive distributions for the data under the competing hypotheses and Bayes factors are derived. On the other hand, the performance of the test depends on ordering the sample space by the Bayes factor and on the integration of these predictive distributions over specific subsets of the sample space in a frequentist-like manner. The mixed test is an optimal procedure in the sense that it minimizes linear combinations of averaged (weighted) probabilities of errors of decision. It also meets a few logical requirements for multiple-hypothesis testing and obeys the Likelihood Principle for discrete sample spaces despite the integration over the sample space it involves. In addition, the test overcomes several of the drawbacks fixed-level tests have. However, a difficulty with the mixed test procedure is the need to evaluate the Bayes factor for every sample point to order the sample space, which may involve intensive calculations. Properties of the mixed test and examples of application are examined in detail in [
8,
20,
21,
22,
23,
24,
25].
Next, we review the general procedure for the performance of the mixed test and then show the test satisfies the NNPP when the hypothesis regarding the parameter of interest is a singleton.
First, we determine the predictive distributions for
X under the competing hypotheses
H and
A,
and
, respectively. For the null hypothesis
,
,
is determined as follows: for each
,
where
denotes the conditional distribution of
given
. That is, for each
,
is the expected value of the likelihood function generated by
x against
. Similarly, for the alternative hypothesis
we define
where
denotes the conditional distribution of
given
. From (
14) and (
15), we obtain the Bayes factor of
for the hypothesis
H over
A as
Finally, the mixed test
for the hypotheses
H versus
A consists in rejecting
H when
is observed if and only if the Bayes factor
is small. That is, for each
,
where the positive constants
a and
b reflect the decision maker’s evaluation of the impact of the errors of the two types or, equivalently, his prior preferences for the competing hypotheses. A detailed discussion on the specification of such constants is found in [
8,
20,
21,
22,
23,
24,
25].
The mixed test can also be defined as a function of a new significance index. That is, (
17) can be rewritten as a comparison between such a significance index and a specific cut-off value. These quantities are defined below.
For the mixed test defined in (
17), the
p-value of the observation
is the significance index given by
where
. Also, we define the adaptive type I error probability of
as
where
. Alternatively,
is also known as the adaptive significance level of
.
Pereira et al. [
21] proved that the mixed test
for the hypotheses
can be written as
Note that
consists of comparing the
p-
with the cut-off
, which depends on the specific statistical model under consideration and on the sample size, as opposed to a standard test with a fixed significance level that does not depend on the sample size.
The former does not have a few of the disadvantages of the latter, such as inconsistency ([
8,
26]) lack of correspondence between practical significance and statistical significance ([
8,
27]) and absence of logical coherence under multiple-hypothesis testing. We continue with the main results of the manuscript.
The Mixed Test Obeys the NNPP
In this subsection, we prove that the mixed test meets the NNPP when the hypothesis about the parameter of interest is simple. Next, we examine further the case in which there is a statistic s-ancillary for the parameter of interest and show how the introduction of the concepts of a conditional p- and a conditional adaptive significance level can make performance of the mixed test much easier.
Theorem 3. Let and (that is, Θ is variation independent). Let and be two experiments as defined in Section 2. Let . In addition, let and be the mixed tests for the hypothesesrespectively. Assume is absolutely continuous with prior density function π, with . Then, is a Non-Informative Nuisance Parameter for testing versus by using , and for every such that (1) holds,
Theorem 3 tells us that when the likelihood function may be factored as (
1), the mixed test obeys the NNPP. That is to say, if one aims to test a simple hypothesis about the parameter of interest
in the presence of a non-informative nuisance parameter
by means of the mixed test, then he can disregard
in the analysis. Under a purely mathematical viewpoint, when
satisfying (
1) is observed, the decision between rejecting and accepting the null hypothesis regarding
depends on
only through the factor
, which is not a function of
, as we can see from Equation (
A16) in
Appendix A. It should be emphasized that Theorem 3 holds for null hypotheses more general than only simple ones. For instance, the Theorem is still valid when the null hypothesis
H is of the form
, where
is a hyperplane of
. The proof of this result is quite similar to the proof of Theorem 3 in
Appendix A and for this reason is omitted.
The adherence to the NNPP is indeed an advantage of the mixed test. It may bring a considerable reduction in the calculations involved along the procedure of the mixed test, especially under statistical models for which a statistic s-ancillary for the parameter of interest can be found. Such cases are examined after Corollary 1, which follows straightforwardly from Theorems 2 and 3.
Corollary 1. Assume the same conditions of Theorem 3 and suppose that such that T is p-sufficient for and s-ancillary for . Then, for all , .
Now, let us suppose that under experiment
, there is a statistic
such that
T is s-ancillary for
. Let
be the hypothesis of interest. From the predictive distribution
for
X, we can define for each value
the conditional probability function for
X given
,
by
if
, and
, otherwise.
Finally, from the conditional distribution in (
24), we define two conditional statistics: the conditional
p-
and the conditional adaptive significance level. Such quantities will be of great importance for the performance of the mixed test, as we will see in the next section.
Definition 3. Conditional p-value: Let be an experiment for which the statistic is s-ancillary for . Let be the hypothesis of interest, and , , as in (24). We define the p-value conditional on T, for each bywhere and . From Equation (A14), the may be rewritten aswhere and since T is s-ancillary for . It follows thatthat is,
Definition 4. Conditional adaptive significance level: Let be an experiment for which the statistic is s-ancillary for . Let be the hypothesis of interest and , be as in (24). We define the conditional adaptive significance level given T, , for each by The conditional adaptive significance level
may be rewritten as
Definitions 3 and 4 are conditional versions of Definitions in (
18) and (
19), respectively. While calculation of the unconditional quantities involves the evaluation of the Bayes factor for every
, the determination of the conditional statistics at a specific sample point
depends only on the values of the Bayes factor for the sample points
x such that
, which may be much easier to accomplish. Note also that the
-value and
can be seen, respectively, as an alternative (conditional) measure of evidence in favor of the null hypothesis
H and an alternative threshold value for testing the competing hypotheses. As a matter of fact, one can substitute the
p-
and the adaptive significance level with their conditional versions in order to perform the mixed test. This is exactly what the next theorem states.
Theorem 4. Assume the same conditions as in Corollary 1 and Theorem 3. Then, for all ,
The results of Theorems 3 and 4 and Corollary 1 suggest a way the mixed test may be used without doing so many calculations: when an ancillary statistic for the parameter of interest,
T, is available, one can perform the test by comparing the conditional statistics
and
instead of comparing the unconditional ones in Definitions (
18) and (
19). This possibility is illustrated in the next section.