On Two Measure-Theoretic Aspects of the Full Bayesian Significance Test for Precise Bayesian Hypothesis Testing †

Riko Kelter

doi:10.3390/psf2021003010

Department of Mathematics, University of Siegen, 57072 Siegen, Germany

^†

Presented at the 40th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, online, 4–9 July 2021.

^‡

Current address: University of Siegen, Department of Mathematics, Walter-Flex-Street 3, 57072 Siegen, Germany.

Phys. Sci. Forum2021, 3(1), 10;https://doi.org/10.3390/psf2021003010

This article belongs to the Proceedings The 40th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering

Version Notes

Order Reprints

Abstract

The Full Bayesian Significance Test (FBST) has been proposed as a convenient method to replace frequentist p-values for testing a precise hypothesis. Although the FBST enjoys various appealing properties, the purpose of this paper is to investigate two aspects of the FBST which are sometimes observed as measure-theoretic inconsistencies of the procedure and have not been discussed rigorously in the literature. First, the FBST uses the posterior density as a reference for judging the Bayesian statistical evidence against a precise hypothesis. However, under absolutely continuous prior distributions, the posterior density is defined only up to Lebesgue null sets which renders the reference criterion arbitrary. Second, the FBST statistical evidence seems to have no valid prior probability. It is shown that the former aspect can be circumvented by fixing a version of the posterior density before using the FBST, and the latter aspect is based on its measure-theoretic premises. An illustrative example demonstrates the two aspects and their solution. Together, the results in this paper show that both of the two aspects which are sometimes observed as measure-theoretic inconsistencies of the FBST are not tenable. The FBST thus provides a measure-theoretically coherent Bayesian alternative for testing a precise hypothesis.

Keywords:

Full Bayesian Significance Test (FBST); statistical hypothesis testing; e-value; p-value

1. Introduction

Statistical hypothesis testing is an important method in a broad range of sciences [1]. However, the recent problems with the validity of research results have been termed a scientific replication crisis [2,3], at the core of which lie some fundamental flaws in the statistical analysis of data [4]. Various papers have discussed the reproducibility of research and often the inadequate use of null hypothesis significance tests (NHST) substantiates a major cause of the replication crisis [5]. This holds in particular in the biomedical and cognitive sciences [6,7], where the p-value is the gold standard for quantifying the evidence against a precise null hypothesis.

Bayesian hypothesis testing has become increasingly popular in the biomedical and cognitive sciences due to the above problems [8,9,10]. It is well known that Bayesian data analysis solves some of the problems of NHST by allowing researchers to make use of optional stopping [11,12] and by simplifying the interpretation of censored data [13]. Together, these aspects are consequence of Bayesian inference being consistent with the likelihood principle [13]. An appealing proposal for a Bayesian test of a precise hypothesis is the Full Bayesian Significance Test (FBST), which has been applied in a wide range of domains [8,14,15,16,17,18]. The FBST advocates the e-value as a Bayesian replacement of the frequentist p-value for quantifying the statistical evidence against a precise hypothesis [19]. The FBST is a fully Bayesian procedure [19], accords with the likelihood principle [15], and enjoys attractive asymptotic properties [20] next to transformation invariance [16]. However, the FBST seems to suffer from two aspects which are studied in detail in this paper. First, the reference criterion in the FBST is only defined up to Lebesgue null sets, which seems to be make the evidential threshold arbitrary. Thus, it seems that the FBST statistical evidence, the e-value, lacks a calibration. Second, the statistical evidence in the FBST seems to have no prior probability, which contradicts common Bayesian reasoning. For other criticisms on the FBST see Ly & Wagenmakers [21] and for a more optimistic perspective Kelter [22]. In this paper it is shown that both aspects can be solved by fixing a version of the posterior distribution for statistical inference, and assigning one of two possible interpretations to the prior probability of the statistical evidence in the FBST. These aspects have not yet been discussed extensively in the literature and present a further justification of the FBST as an attractive replacement of frequentist p-values to remedy the ongoing problems with the replication of scientific results. The plan of the paper is as follows: The next section outlines the theory behind the FBST. After that, the two problematic aspects mentioned above are detailed and illustrated by an example from medical research. The following section elaborates on the problems and provides solutions to them. After that, a conclusion is provided.

2. The Full Bayesian Significance Test

This section outlines the theory behind the FBST. First, the required notation is introduced.

2.1. Notation

In contrast to the frequentist approach, in the Bayesian approach the parameter

θ \in Θ

is modelled as a random variable, and the data

y \in Y

are fixed. Denote by

Θ

the parameter space and

G

as the

σ

-algebra on

Θ

, and let

P_{ϑ}

be the prior probability measure on

G

, leading to the triple

(Θ, G, P_{ϑ})

. The observed sample is modelled by the random variable

Y : Ω \to Y

which takes values in the measurable space

Y

, where

Y

is endowed with a

σ

-algebra

B

. The uncertainty in the data generating mechanism producing a sample

Y (ω) = y

for

ω \in Ω

is modelled via the assumption of a statistical model

P : = {P_{θ} : θ \in Θ}

which is dominated by a

σ

-finite measure

ν

. In practice,

ν

often is the Lebesgue measure

λ

. The latter requirement guarantees the existence of Radon-Nikodým derivatives

d P_{θ} / d λ = f (y | θ)

. Let

(Ω, A, P^{*})

be the product space defined as

Ω : = Θ \times Y

,

A : = G \times B

and

P^{*}

the product measure induced by the selection of

P_{ϑ}

and

P

, where

P_{θ}

must be a measurable function on

B

for every y on

Y

. Thus,

P_{ϑ}

is the marginal distribution of

P^{*}

with respect to the parameter

θ

, and the marginal distribution with respect to Y is the prior predictive

P^{ϑ} (B) : = \int_{Θ} P_{θ} (B) d P_{ϑ}

for any

B \in B

. The parameter, as noted above, is modelled mathematically as a random variable

ϑ : Ω \to Θ

. The resulting operational models from a Bayesian point of view are thus given as

(1): the prior model $(Θ, G, P_{ϑ})$
(2): the statistical model $P$ on $(Y, B)$ , leading to $(Y, B, {P_{θ} : θ \in Θ})$ , and
(3): the posterior model $(Θ, G, {P_{ϑ | Y} : Y \in Y})$

The existence of the posterior distribution

P_{ϑ | Y}

is guaranteed on Polish spaces [23] and inference about

θ

is conducted with respect to the posterior distribution

P_{ϑ | Y}

with density

p (θ | y) : = d P_{ϑ | Y} / d λ

, which exists under the assumption that

P_{ϑ} < < λ

where

< <

denotes absolute-continuity of

P_{ϑ}

with respect to the measure

λ

.

2.2. Theory behind the Full Bayesian Significance Test (FBST)

The Full Bayesian Significance Test (FBST) was originally developed by Pereira and Stern [14] as an alternative to frequentist null hypothesis significance tests based on the p-value. It was created under the assumption that a significance test of a sharp hypothesis had to be conducted, where a sharp hypothesis refers to any submanifold of the parameter space of interest [20]. This includes, in particular, precise hypotheses like

H_{0} : θ = θ_{0}

for

θ_{0} \in Θ

[15]. The FBST assumes a standard parametric statistical model, where

θ \in Θ \subseteq R^{p}

is a (possibly vector-valued) parameter of interest,

f (y | θ)

is the density corresponding to the model distribution

P_{Y | ϑ}

and

p (θ)

is the prior density corresponding to the prior distribution

P_{ϑ}

, where we again assume a dominating measure

ν

to guarantee the existence of Radon-Nikodým densities. A hypothesis H makes the statement that the parameter

θ

lies in the corresponding null set

Θ_{H}

, where for simple (or precise) hypotheses

Θ_{H} : = {θ_{0}}

, where

θ_{0}

is the value specified in

H : θ = θ_{0}

. The Full Bayesian Significance Test (FBST) then defines two quantities: ev

(H)

, which is the e-value supporting (or in favour of) the hypothesis H, and

\bar{ev} (H)

, the e-value against H, also called the Bayesian evidence value against H [14]. First, the posterior surprise function

s (θ)

and its maximum

s^{*}

restricted to the null set

Θ_{H}

are introduced:

Definition 1

(Posterior surprise function). The posterior surprise function

s (θ)

for a reference function

r : Θ \to (T, C)

from Θ to a measurable space

(T, C)

is defined as

\begin{matrix} s (θ) : = \frac{p (θ | y)}{r (θ)} \end{matrix}

(1)

In the definition of the posterior surprise function

s (θ)

, the denominator

r (θ)

serves as a reference density, and often the measurable space

(T, C)

is equal to

(R^{d}, B (R^{d}))

. When the improper flat reference function

r (θ) = 1

is used, the surprise function becomes the posterior density

p (θ | y)

. Otherwise, a weakly informative prior density can be used as a reference function, see Pereira and Stern [16]. Then,

\begin{matrix} s^{*} : = s (θ^{*}) = sup_{θ \in Θ_{H}} s (θ) \end{matrix}

(2)

is defined as the supremum of the surprise function

s (θ)

over the null hypothesis support. For a precise null hypothesis,

s^{*}

is simply

s (θ_{0})

. Next, the tangential set is introduced:

Definition 2

(Tangential set). The tangential set

\bar{T} (ν)

is defined as

\begin{matrix} \bar{T} (ν) : = Θ \ T (ν) \end{matrix}

(3)

where

\begin{matrix} T (ν) : = {θ \in Θ | s (θ) \leq ν} \end{matrix}

(4)

Thus,

T (ν)

includes all parameter values

θ \in Θ

which attain a surprise function value

s (θ)

smaller or equal to the threshold

ν

. The tangential set

\bar{T} (ν)

is then the set complement and includes all parameter values

θ \in Θ

which yield a surprise function value

s (θ)

larger than

ν

. Fixing

ν = s^{*}

yields

\bar{T} (s^{*})

, which is called the tangential set to the hypothesis H. This set

\bar{T} (s^{*})

contains the points

θ

of the parameter space

Θ

with higher surprise (or corroboration relative to the reference function

r (θ)

) than the point

θ_{0}

in the null set

Θ_{H}

. Then, the cumulative surprise function is introduced which is required to compute the e-value in the final step:

Definition 3

(Cumulative surprise function). The map

W : Θ \to [0, 1]

given by

\begin{matrix} W (ν) : = \int_{T (ν)} p (θ | y) d θ \end{matrix}

(5)

is called the complementary cumulative surprise function, and

\begin{matrix} \bar{W} (ν) : = 1 - W (ν) \end{matrix}

(6)

is called the cumulative surprise function.

Thus, the complementary cumulative surprise function

W (ν)

is the integral of the posterior density

p (θ | y)

over the set

T (ν)

, and the cumulative surprise function

\bar{W} (ν)

is simply the integral of the posterior density over the tangential set

\bar{T} (ν)

. The final step towards the e-value is to integrate the posterior density

p (θ | y)

over this set:

Definition 4

(e-value). The e-value against a sharp null hypothesis

H_{0} : θ = θ_{0}

is defined as

\begin{matrix} \bar{e v} (H_{0}) : = \bar{W} (s^{*}) \end{matrix}

(7)

and can be interpreted as the Bayesian evidence against

H_{0}

.

Clearly,

\bar{ev} (H_{0}) : = \bar{W} (s^{*})

is the integral of the density

p (θ | y)

over the tangential set

T (s^{*})

, which can be interpreted as the integral of the posterior density

p (θ | y)

over all parameter values

θ

which fulfill the condition

s (θ) \geq s^{*}

. The e-value ev

(H_{0})

supporting H is obtained as ev

(H) : = 1 - \bar{ev} (H_{0})

under

r (θ) : = 1

. Large values of

\bar{ev} (H_{0})

thus indicate that the hypothesis H traverses low-density regions (or equivalently, that the alternative hypothesis traverses high-density regions) so that the evidence against

H_{0}

is large. For

r (θ) \neq 1

the argument is identical as

H_{0}

traverses low posterior-surprise regions then.

For theoretical properties of the FBST and the e-value see Pereira and Stern [16] and Kelter [18]. The FBST then uses ev

(H)

to reject H if ev

(H)

is sufficiently small (or when

\bar{ev} (H)

is large) [14,15].

3. On Two Aspects of the FBST

Now, this section demonstrates the two aspects briefly mentioned in the introduction based on an illustrative example.

3.1. The Reference Criterion

To illustrate the first problem, data of Rosenman et al. [24] of the Western Collaborative Group Study about coronary heart disease is used.

Example 1

(Coronary heart disease data). The Western Collaborative Group Study began in 1960 with 3524 male volunteers who were 39 to 59 years old and free of heart disease as determined by electrocardiogram. After the initial screening, the study population dropped to 3154 because of various exclusions. Multiple endpoints were studied and average follow-up continued for

8.5

years with repeat examinations. As an illustrative example, suppose interest lies in testing for differences in systolic blood pressure between light smokers and heavy smokers. Thus, we test the hypothesis

H_{0} : δ = 0

against the alternative

H_{1} : δ \neq 0

where we classify participants with more than 5 cigarettes per day as heavy smokers. A Bayesian two-sample t-test using the model of Rouder et al. [25] is conducted, and the left plot in Figure 1 shows the results of the FBST using a flat reference function

r (δ) : = 1

. The model is parameterized in the effect size δ of Cohen [26], and the e-value

\bar{e v} (H_{0})

is given as

\bar{e v} (H_{0}) = 0.4362

, which equals the posterior probability mass visualized as the blue area in the left plot of Figure 1. Thus, 43.62% of the posterior probability indicate evidence against the null hypothesis, and the situation is inconclusive. The right plot in Figure 1 shows the result of the FBST when replacing the flat reference function

r (δ) : = 1

with a Cauchy

C (0, \sqrt{2})

density (note the different scaling on the y-axis), which is also used as the prior on δ in the two-sample t-test. In this case, the e-value

\bar{e v} (H_{0}) = 0.4367

indicates a similarly inconclusive situation and changes the result barely.

Figure 1. Results of the Full Bayesian Significance Test using a flat reference function (left) and a

C (0, \sqrt{2})

Cauchy density as reference function (right) for testing the hypothesis of no difference

H_{0} : δ = 0

in terms of systolic blood pressure between smokers and non-smokers.

Now, the above example shows that calculation of the e-value is straightforward and universally applicable. However, the parameter space

Θ

is continuous in the example (the effect size

δ \in R

is a continuous quantity) and any usual prior distribution

P_{ϑ}

assigned to

θ

is absolutely continuous with respect to the Lebesgue measure

λ

. It is well-known that the posterior distribution

P_{ϑ | Y}

is absolutely continuous with respect to the prior distribution [27], and thus any

P_{ϑ}

-null-set

N \subset Θ

with

P_{ϑ} (N) = 0

is also a

P_{ϑ | Y}

-null-set with

P_{ϑ | Y} (N) = 0

. Problematically, the set

Θ_{0} : = {δ_{0}} = {0}

which is used in the precise null hypothesis

H_{0} : δ = 0

is a

P_{ϑ}

-null-set under both the improper flat and Cauchy prior, as both of these are absolutely continuous with respect to the Lebesgue measure

λ

, and submanifolds are Lebesgue-null-sets [28]. Thus,

λ ({δ_{0}} = λ ({0}) = 0

implies

P_{ϑ} ({0}) = 0

due to

P_{ϑ} < < λ

, which implies in turn that the posterior probability

P_{ϑ | Y} ({0})

of the value

δ_{0} = 0

is a

P_{ϑ | Y}

-null-set due to

P_{ϑ | Y} < < P_{ϑ}

. As a consequence, the value of the posterior density

p (0 | y) = 9.4693

which is shown as the blue point in the left plot of Figure 1 could be chosen arbitrarily. Problematically, this value is used as the reference criterion in the calculation of the e-value

\bar{ev} (H_{0})

in the computation of the tangential set

\bar{T} (ν)

. Thus, one could assign

p (0 | y)

an entirely different value, say,

c \in R

, and obtain a different e-value

\bar{ev} (H_{0})

than the one calculated from the value

p (0 | y) = 9.4693

. This seems to render the calculation of the statistical evidence

\bar{ev} (H_{0})

in the FBST arbitrary, questioning the use of the procedure.

3.2. Prior Probability of the e-Value

The second issue with the FBST may be phrased as the e-value having no valid prior probability. In fact, the e-value in Equation (7) is based on the cumulative surprise function

W (s^{*})

, which itself depends on the tangential set

T (s^{*})

and the posterior density

p (θ | y)

. Before data

y \in Y

are observed, the posterior

P_{ϑ | Y}

has not been realized as

P_{ϑ | Y = y}

and thus there exists no prior probability

P_{ϑ}

which is associated with the e-value. Even the tangential set

\bar{T} (s^{*}) : = {θ \in Θ | s (θ) > s^{*}}

which is a subset of

Θ

seems to have no prior probability, because it depends on the surprise function

s (θ)

which itself depends on the posterior density

p (θ | y)

, compare Equation (1). Thus, the statistical evidence in the FBST seems to escape the natural Bayesian transition from prior to posterior probability.

4. Solutions to the Two Aspects

4.1. The Reference Criterion

If the above criticism that the reference criterion in the FBST is arbitrary would hold, the procedure would be of little use in practice. However, the solution to the problem is given by fixing a specific version of the posterior distribution and performing all calculations conditional on fixing such a version. It is well known that probability distributions (which are probability measures corresponding to a random variable) are defined up to Lebesgue-null-sets (when they are dominated by the Lebesgue measure). The values on null-sets do not influence these probability measures and therefore they are identified with each other whenever they only differ on Lebesgue-null-sets [28]. Technically, this corresponds to the shift from the vector space

L^{p}

\begin{matrix} L^{p} (Ω, A, μ) : = \{f : Ω \to K | f is measurable, \int_{Ω} {| f (x) |}^{p} d μ (x) < \infty\} \end{matrix}

(8)

on a probability space

(Ω, A, μ)

,

K \in {R, C}

for

0 < p < \infty

to the quotient space

L^{p}

, see Bauer [28]. The latter space is defined as

L^{p} : = L^{p} / N

, where

\begin{matrix} N : = \{f \in L^{p} | f = 0 μ - almost - everywhere\} \end{matrix}

(9)

and the elements in

L^{p}

are equivalence classes. Thus, two elements

[f], [g] \in L^{p}

are equal if and only if they differ only on

μ

-null-sets, that is,

[f] - [g] \in N

. Thus, the arbitrariness of the reference criterion in the FBST exists only unless a specific representant of the equivalence class, in which the posterior density

p (θ | y)

is located, is selected. In the context of Example 1, this implies that a specific version of the posterior density

p (δ | y)

needs to be chosen, which fixes the densities value on

δ_{0} = 0

(and the other values

δ \in Θ

). Thus, setting

p (δ_{0} | y) : = p (0 | y) : = 9.4693

explicitly by definition fixes one representant of the equivalence class of

P_{ϑ | Y}

and bypasses the problem that the reference threshold

p (δ_{0} | y)

in the FBST is arbitrary. Whenever the posterior is obtainable as a closed-form solution, that is, follows a well-known probability density

{\tilde{P}}_{ϑ | Y}

with Lebesgue-density

\tilde{p} (θ | y)

, setting

p (θ | y) : = \tilde{p} (θ | y)

as the value of this known probability density

\tilde{p}

for the posterior density p in the FBST by definition solves the first problem. Whenever numerical techniques like Markov-Chain-Monte-Carlo (MCMC) are used to produce the posterior, the resulting posterior distribution

P_{ϑ | Y}^{M C M C}

and the posterior density

p^{M C M C} (θ | y)

approximate the true posterior distribution

P_{ϑ | Y}

and the posterior Lebesgue-density

p (θ | y)

. Thus, setting

p (θ | y) : = p^{M C M C} (θ | y)

by definition for a fixed numerical technique like MCMC with given random number generator seed fixes a version of the posterior density and renders the reference threshold in the FBST unique. In Example 1 this equals the choice of

p (δ_{0} | y) : = 9.4693

by definition (as MCMC sampling was used), and

p (δ | y) : = p^{M C M C} (δ | y)

for all

δ \in R

. In summary, the above considerations provide the following result:

Theorem 1.

Let

s^{*} : = s (θ^{*}) = sup_{θ \in Θ_{H}} s (θ)

be the supremum of the surprise function in the Full Bayesian Significance Test, and

L^{p}

and

L^{p}

the corresponding vector spaces on

(Θ, G, P_{ϑ | Y})

with quotient space

L^{p} / N

for

N : = {f \in L^{p} | f = 0 μ - a l m o s t - e v e r y w h e r e}

. Whenever

P_{ϑ | Y}

is a known probability distribution

{\tilde{P}}_{ϑ | Y}

with Lebesgue-density

\tilde{p} (ϑ | Y)

, defining

p (θ | y) : = \tilde{p} (θ | y)

pointwise for all

θ \in Θ

renders the e-value

\bar{e v} (H_{0})

against

H_{0} : θ = θ_{0}

for

θ_{0} \in Θ

well-defined and unique for the choice of

p (θ | y)

.

Proof.

See Appendix A. □

Note that when using numerical methods such as MCMC, ergodic theory ensures that

P_{ϑ | Y}^{M C M C} \to P_{ϑ | Y}

in distribution and

p_{ϑ | Y}^{M C M C} \to p_{ϑ | Y}

, that is, the MCMC posterior density approximates the posterior Lebesgue-density pointwise with increasing precision for increasing number of MCMC samples [29]. Thus, fixing a version of the posterior, Theorem 1 extends also to situations where numerical techniques such as MCMC are required.

4.2. Prior Probability of the e-Value

The solution to the second problem is more involved and less technical. Conceptually, from the above line of thought it is immediate that under absolutely continuous priors

P_{ϑ}

with respect to the Lebesgue measure

λ

, the prior probability

P_{ϑ} (Θ_{0})

will be zero for any precise null hypothesis

H_{0} : = Θ_{0}

with

Θ_{0} : = {θ_{0}}

for

θ_{0} \in Θ

. The posterior

P_{ϑ | Y}

is absolutely continuous with respect to the prior

P_{ϑ}

, so

P_{ϑ | Y} (Θ_{0}) = 0

. Thus, it is simply not possible to use a natural Bayesian workflow which assigns positive probability mass to a Lebesgue-null-set

Θ_{0}

whenever the statistician uses an absolutely continuous prior distribution

P_{ϑ}

with respect to

λ

. Traditional Bayesian hypothesis testing and model selection bypasses this inconvenience by introducing an arbitrary mixture prior structure

P_{ϑ} : = ϱ 1_{Θ_{0}} + (1 - ϱ) {\tilde{P}}_{ϑ}

which assigns positive probability mass

ϱ > 0

to the null set

Θ_{0}

, and distributes the rest of the probability mass

(1 - ϱ) \in [0, 1]

by means of a probability distribution

{\tilde{P}}_{ϑ}

on the alternative hypothesis space

Θ_{1} = Θ \ Θ_{0}

. Early proposals of such a mixture prior structure include Jeffreys [30] and Haldane [31], see also Robert [29] and Kleijn [23]. Such a prior allows computation of a Bayes factor, and furthermore, the Bayes factor itself also has no prior probability which is naturally associated with it. Importantly, this mixture prior structure imposes a dichotomy between hypothesis testing and parameter estimation, because such a mixture prior structure is reasonable only from a hypothesis testing perspective. Whenever parameter estimation is the goal, the assignment of probability mass

ϱ > 0

to a specific value is highly questionable and often contradicts reasonable a priori beliefs. In these cases, prior beliefs are expressed better through a prior which is absolutely continuous with respect to the Lebesgue measure

λ

.

The FBST avoids the introduction of such a mixture structure and thus allows for a unified prior elicitation which is coherent both from a Bayesian hypothesis testing and Bayesian parameter estimation stance. Importantly, the e-value is intended to be a Bayesian replacement of the frequentist p-value which measures the statistical discrepancy between the observed data to an assumed precise hypothesis. Thus, the e-value provides the Bayesian evidence against such a precise hypothesis. From a measure-theoretic point of view, every precise null hypothesis is assumed to be false and the FBST thus aligns with the empirical rationalism of Popper [32]. For the use of testing a precise hypothesis as an approximation of a small interval hypothesis see Berger [33], Rousseau [34], Rao & Lovric [35] as well as Kelter [36]: Often, the approximation of a small interval hypothesis via a precise point null hypothesis will be bad, and thus the e-value does not assign positive probability mass to such a precise null hypothesis. Instead, the FBST quantifies the discrepancy between the observed data and the hypothetical precise null value, while simultaneously implementing I.J. Good’s principle of least surprise [37,38,39]. Note further that the mathematical introduction of positive prior probability

ϱ > 0

to a precise value

θ_{0} \in Θ

when using a mixture prior does not render such a precise hypothesis

H_{0} : θ = θ_{0}

more realistic in practice.

Furthermore, next to its measure-theoretic premises, there exists another argument which weakens the criticism that there is no prior probability of the e-value: When a prior distribution

P_{ϑ}

is selected and no data

y \in Y

has been observed, the posterior distribution can be identified conceptually as the prior distribution. Thus, replacing the posterior density

p (θ | y)

with the

λ

-density

p (θ)

of the prior

P_{ϑ}

yields

s (θ) : = \frac{p (θ)}{r (θ)}

, which implies that the tangential set

\bar{T} (ν) : = Θ \ T (ν)

for

T (ν) : = {θ \in Θ | s (θ) \leq ν}

includes those parameter values

θ \in Θ

for which

p (θ) / r (θ) > ν

. Using the fact that

s^{*} = p (θ_{0}) / r (θ_{0})

for a precise hypothesis

H_{0} : θ = θ_{0}

then, yields

\bar{T} (ν) = {θ \in Θ | p (θ) / r (θ) > p (θ_{0}) / r (θ_{0})}

. Plugging this tangential set into Equation (6) yields the e-value

\begin{matrix} \bar{ev} (H_{0}) : = \bar{W} (s^{*}) = \int_{\bar{T} (s^{*})} p (θ) d θ \end{matrix}

which is the integral of the prior density

p (θ)

over

\bar{T} (s^{*})

. When the reference function

r (θ)

is chosen as a flat improper prior

r (θ) : = 1

, this becomes

\begin{matrix} \bar{ev} (H_{0}) = \int_{{θ \in Θ | p (θ) > p (θ_{0})}} p (θ) d θ \end{matrix}

which is the integral of the prior density

p (θ)

over all values which attain higher prior density values than the null value

θ_{0}

in

H_{0} : θ = θ_{0}

. Thus, the e-value in such a case quantifies the discrepancy of the precise hypothesis

H_{0} : θ = θ_{0}

with the prior beliefs

P_{ϑ}

. The above line of thought provide the following result:

Theorem 2.

Let

r (θ) : = 1

. In case no data

y \in Y

has been observed, the e-value quantifies the discrepancy between the precise hypothesis

H_{0} : = Θ_{0}

for

Θ_{0} : = {θ_{0}}

and

θ_{0} \in Θ

and the prior distribution

P_{ϑ}

, that is,

\begin{matrix} \bar{e v} (H_{0}) = P_{ϑ} ({θ \in Θ | p (θ) > p (θ_{0})}) \end{matrix}

(10)

Proof.

See Appendix A. □

Whenever

r (θ) \neq 1

, the interpretation is more complicated because such a reference function incorporates a surprise element into the tangential set, but the conclusions remain the same. The e-value then quantifies the discrepancy between the precise hypothesis and the prior surprise.

5. Discussion

The Full Bayesian Significance Test (FBST) has been proposed as a convenient method to replace frequentist p-values for testing a precise hypothesis [14,15,16]. Although the FBST enjoys various appealing properties [8,19,20,40], two aspects of the FBST are sometimes observed as measure-theoretic inconsistencies of the procedure and have not been discussed rigorously in the literature. First, the FBST uses the posterior density as a reference for judging the Bayesian statistical evidence against a precise hypothesis. However, under absolutely continuous prior distributions, the posterior density is defined only up to Lebesgue null sets which renders the reference criterion arbitrary. Second, the FBST statistical evidence seems to have no valid prior probability. In this paper, it was shown that the former problem can be circumvented by fixing a version of the posterior density before using the FBST. Theorem 1 demonstrated that then, the e-value is well-defined and unique after observing the data

y \in Y

.

The latter aspect is based on the measure-theoretic premises of the FBST. As shown in this paper, the FBST avoids the use of a mixture prior structure which imposes a dichotomy between Bayesian hypothesis testing and parameter estimation. Thus, the FBST is compatible with absolutely continuous priors with respect to the Lebesgue measure

λ

(the Bayes factor, for example, is not). As a consequence, there exists no prior probability of the e-value and a precise hypothesis

H_{0} : θ = θ_{0}

under an absolutely continuous prior

P_{ϑ}

. Theorem 2 showed that even then, the e-value has a proper interpretation from a prior perspective: It quantifies the a priori discrepancy of the hypothesis

H_{0}

with the prior beliefs which are expressed by

P_{ϑ}

whenever the reference function

r (θ)

is flat. When

r (θ) \neq 1

, the interpretation is more difficult but the conclusion remains the same.

Together, the results in this paper show that both of the two aspects which are sometimes observed as measure-theoretic inconsistencies of the FBST are not tenable. The FBST thus provides a measure-theoretically coherent Bayesian alternative for testing a precise hypothesis.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The R code to recreate all analyses and plots can be found at the Open Science Foundation at https://osf.io/25vsw/?view_only=e1e243c1e2a44646969fb75cc4c34d57.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FBST	Full Bayesian Significance Test
NHST	Null Hypothesis Significance Testing

Appendix A

Proof of Theorem 1.

From Definition 1 and Equation (2) it follows that the tangential set

\bar{T} (ν) : = Θ \ T (ν)

becomes

\bar{T} (s^{*}) : = Θ \ T (s^{*})

, which equals the set

\begin{matrix} {θ \in Θ : s (θ) > s (θ^{*})} & = \{θ \in Θ : \frac{p (θ | y)}{r (θ)} > \frac{p (θ^{*} | y)}{r (θ^{*})}\} \\ = \{θ \in Θ : \frac{p (θ | y)}{r (θ)} > \frac{p (θ_{0} | y)}{r (θ_{0})}\} \end{matrix}

(A1)

where the first equality uses Definition 2 and the second equality uses

θ^{*} = θ_{0}

for a precise hypothesis

H_{0} : θ = θ_{0}

for

θ_{0} \in Θ

. By assumption, the posterior distribution

P_{ϑ | Y}

is known to take the form

{\tilde{P}}_{ϑ | Y}

with Lebesgue-density

\tilde{p} (θ | y)

. Defining the posterior density

p : Θ \to R^{d}

pointwise as

p (θ | y) : = \tilde{p} (θ | y)

implies that the value

p (θ_{0} | y)

is equal to

\tilde{p} (θ_{0} | y)

. Thus, the tangential set

\bar{T} (s^{*})

in Equation (A1) is well-defined and unique for this fixed value

p (θ_{0} | y) : = \tilde{p} (θ_{0} | y)

. From Definition 3 and Equation (7) it follows that the e-value

\bar{ev} (H_{0})

is well-defined and unique for the choice of

p (θ | y)

. □

Proof of Theorem 2.

Let

P_{ϑ}

be the prior distribution and

r (θ) : = 1

. Suppose no data

y \in Y

has been observed, then the posterior distribution

P_{ϑ | Y}

can be identified as the prior distribution

P_{ϑ}

. Thus, replacing the posterior density

p (θ | y)

with the

λ

-density

p (θ)

of the prior

P_{ϑ}

yields

s (θ) : = \frac{p (θ)}{r (θ)}

, which implies that the tangential set

\bar{T} (ν) : = Θ \ T (ν)

for

T (ν) : = {θ \in Θ | s (θ) \leq ν}

includes the parameter values

θ \in Θ

which fulfill the condition

p (θ) / r (θ) > ν

. It follows that

s^{*} = p (θ_{0}) / r (θ_{0})

for a precise hypothesis

H_{0} : θ = θ_{0}

, and this yields

\bar{T} (ν) = {θ \in Θ | p (θ) / r (θ) > p (θ_{0}) / r (θ_{0})}

for the tangential set to

H_{0}

. Using the latter in Equation (6) yields the e-value

\begin{matrix} \bar{ev} (H_{0}) : = \bar{W} (s^{*}) = \int_{\bar{T} (s^{*})} p (θ) d θ \end{matrix}

which is the integral of the prior density

p (θ)

over

\bar{T} (s^{*})

. By assumption,

r (θ) : = 1

, so this becomes

\begin{matrix} \bar{ev} (H_{0}) = \int_{{θ \in Θ | p (θ) > p (θ_{0})}} p (θ) d θ = P_{ϑ} ({θ \in Θ | p (θ) > p (θ_{0})}) \end{matrix}

which is the statement in Equation (10). □

References

Gigerenzer, G. Mindless statistics. J.-Socio-Econ. 2004, 33, 587–606. [Google Scholar] [CrossRef]
Pashler, H.; Harris, C.R. Is the Replicability Crisis Overblown? Three Arguments Examined. Perspect. Psychol. Sci. 2012, 7, 531–536. [Google Scholar] [CrossRef]
Baker, M.; Penny, D. Is there a reproducibility crisis? Nature 2016, 533, 452–454. [Google Scholar] [CrossRef] [PubMed] [Green Version]
McElreath, R.; Smaldino, P.E. Replication, communication, and the population dynamics of scientific discovery. PLoS ONE 2015, 10, 1–16. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ioannidis, J.P.A. What Have We (Not) Learnt from Millions of Scientific Papers with p-Values? Am. Stat. 2019, 73, 20–25. [Google Scholar] [CrossRef] [Green Version]
Button, K.S.; Ioannidis, J.P.; Mokrysz, C.; Nosek, B.A.; Flint, J.; Robinson, E.S.; Munafò, M.R. Power failure: Why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 2013, 14, 365–376. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kelter, R. Bayesian alternatives to null hypothesis significance testing in biomedical research: A non-technical introduction to Bayesian inference with JASP. BMC Med. Res. Methodol. 2020, 20, 1–12. [Google Scholar] [CrossRef]
Kelter, R. Analysis of Bayesian posterior significance and effect size indices for the two-sample t-test to support reproducible medical research. BMC Med. Res. Methodol. 2020, 20, 1–18. [Google Scholar] [CrossRef] [Green Version]
Kelter, R. Bayesian survival analysis in STAN for improved measuring of uncertainty in parameter estimates. Meas. Interdiscip. Res. Perspect. 2020, 18, 101–119. [Google Scholar] [CrossRef]
Wagenmakers, E.J.; Morey, R.D.; Lee, M.D. Bayesian Benefits for the Pragmatic Researcher. Curr. Dir. Psychol. Sci. 2016, 25, 169–176. [Google Scholar] [CrossRef]
Edwards, W.; Lindman, H.; Savage, L.J. Bayesian statistical inference for psychological research. Psychol. Rev. 1963, 70, 193–242. [Google Scholar] [CrossRef]
Hendriksen, A.; de Heide, R.; Grünwald, P. Optional Stopping with Bayes Factors: A Categorization and Extension of Folklore Results, with an Application to Invariant Situations. Bayesian Anal. 2020, in press. [Google Scholar] [CrossRef]
Berger, J.; Wolpert, R.L. The Likelihood Principle; Institute of Mathematical Statistics: Hayward, CA, USA, 1988. [Google Scholar]
Pereira, C.A.d.B.; Stern, J.M. Evidence and credibility: Full Bayesian significance test for precise hypotheses. Entropy 1999, 1, 99–110. [Google Scholar] [CrossRef]
Pereira, C.A.d.B.; Stern, J.M.; Wechsler, S. Can a Significance Test be genuinely Bayesian? Bayesian Anal. 2008, 3, 79–100. [Google Scholar] [CrossRef]
Pereira, C.A.d.B.; Stern, J.M. The e-value: A fully Bayesian significance measure for precise statistical hypotheses and its research program. São Paulo J. Math. Sci. 2020, 1–19. [Google Scholar] [CrossRef]
Kelter, R. Simulation data for the analysis of Bayesian posterior significance and effect size indices for the two-sample t-test to support reproducible medical research. BMC Res. Notes 2020, 13, 1–3. [Google Scholar] [CrossRef]
Kelter, R. fbst: An R package for the Full Bayesian Significance Test for testing a sharp null hypothesis against its alternative via the e-value. Behav. Res. Methods 2021, in press. [Google Scholar] [CrossRef]
Madruga, M.R.; Esteves, L.G.; Wechsler, S. On the Bayesianity of Pereira-Stern tests. Test 2001, 10, 291–299. [Google Scholar] [CrossRef]
Diniz, M.; Pereira, C.A.B.; Polpo, A.; Stern, J.M.; Wechsler, S. Relationship between Bayesian and frequentist significance indices. Int. J. Uncertain. Quantif. 2012, 2, 161–172. [Google Scholar] [CrossRef]
Ly, A.; Wagenmakers, E.J. A Critical Evaluation of the FBST ev for Bayesian Hypothesis Testing. Comput. Brain Behav. 2021, 1–8. [Google Scholar] [CrossRef]
Kelter, R. On the Measure-Theoretic Premises of Bayes Factor and Full Bayesian Significance Tests: A Critical Reevaluation. Comput. Brain Behav. 2021, 1–11. [Google Scholar] [CrossRef]
Kleijn, B. The Frequentist Theory of Bayesian Statistics; Springer: Amsterdam, The Netherlands, 2020. [Google Scholar]
Rosenman, R.H.; Brand, R.J.; Jenkins, D.; Friedman, M.; Straus, R.; Wurm, M. Coronary heart disease in Western Collaborative Group Study. Final follow-up experience of 8 1/2 years. JAMA 1975, 233, 872–877. [Google Scholar] [CrossRef]
Rouder, J.N.; Speckman, P.L.; Sun, D.; Morey, R.D.; Iverson, G. Bayesian t tests for accepting and rejecting the null hypothesis. Psychon. Bull. Rev. 2009, 16, 225–237. [Google Scholar] [CrossRef]
Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Routledge: Hillsdale, NJ, USA, 1988. [Google Scholar]
Schervish, M.J. Theory of Statistics; Springer Verlag: New York, NY, USA, 1995. [Google Scholar]
Bauer, H. Measure and Integration Theory; De Gruyter: Berlin, Germany, 2001. [Google Scholar]
Robert, C.P. The Bayesian Choice, 2nd ed.; Springer New York: Paris, France, 2007. [Google Scholar] [CrossRef]
Jeffreys, H. Theory of Probability, 1st ed.; The Clarendon Press: Oxford, UK, 1939. [Google Scholar]
Haldane, J.B.S. A note on inverse probability. Math. Proc. Camb. Philos. Soc. 1932, 28, 55–61. [Google Scholar] [CrossRef]
Popper, K. The Logic of Scientific Discovery; Routledge: London, UK; New York, NY, USA, 1959. [Google Scholar] [CrossRef]
Berger, J. Statistical Decision Theory and Bayesian Analysis; Springer: New York, NY, USA, 1985. [Google Scholar]
Rousseau, J. Approximating Interval hypothesis: P-values and Bayes factors. In Bayesian Statistics; Bernado, J., Berger, J., Dawid, A., Smith, A., Eds.; Oxford University Press: Valencia, Spain, 2007; Volume 8, pp. 417–452. [Google Scholar]
Rao, C.R.; Lovric, M.M. Testing point null hypothesis of a normal mean and the truth: 21st Century perspective. J. Mod. Appl. Stat. Methods 2016, 15, 2–21. [Google Scholar] [CrossRef]
Kelter, R. Bayesian and frequentist testing for differences between two groups with parametric and nonparametric two-sample tests. Wires Comput. Stat. 2020, 13, e1523. [Google Scholar] [CrossRef]
Good, I. Surprise index. In Encyclopedia of Statistical Sciences; Kotz, S., Johnson, N., Reid, C., Eds.; John Wiley & Sons: New York, NY, USA, 1988; Volume 7. [Google Scholar]
Good, I. C332. Surprise indexes and p-values. J. Stat. Comput. Simul. 1989, 32, 90–92. [Google Scholar] [CrossRef]
Good, I.J. C420. The existence of sharp null hypotheses. J. Stat. Comput. Simul. 1994, 49, 241–242. [Google Scholar] [CrossRef]
Stern, J.M. Significance tests, Belief Calculi, and Burden of Proof in legal and Scientific Discourse. Front. Artif. Intell. Its Appl. 2003, 101, 139–147. [Google Scholar]

Figure 1. Results of the Full Bayesian Significance Test using a flat reference function (left) and a

C (0, \sqrt{2})

Cauchy density as reference function (right) for testing the hypothesis of no difference

H_{0} : δ = 0

in terms of systolic blood pressure between smokers and non-smokers.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

On Two Measure-Theoretic Aspects of the Full Bayesian Significance Test for Precise Bayesian Hypothesis Testing ^†

Abstract

1. Introduction

2. The Full Bayesian Significance Test

2.1. Notation

2.2. Theory behind the Full Bayesian Significance Test (FBST)

3. On Two Aspects of the FBST

3.1. The Reference Criterion

3.2. Prior Probability of the e-Value

4. Solutions to the Two Aspects

4.1. The Reference Criterion

4.2. Prior Probability of the e-Value

5. Discussion

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Article Metrics

Citations

Article Access Statistics