On Two Measure-Theoretic Aspects of the Full Bayesian Signiﬁcance Test for Precise Bayesian Hypothesis Testing

: The Full Bayesian Signiﬁcance Test (FBST) has been proposed as a convenient method to replace frequentist p -values for testing a precise hypothesis. Although the FBST enjoys various appealing properties, the purpose of this paper is to investigate two aspects of the FBST which are sometimes observed as measure-theoretic inconsistencies of the procedure and have not been discussed rigorously in the literature. First, the FBST uses the posterior density as a reference for judging the Bayesian statistical evidence against a precise hypothesis. However, under absolutely continuous prior distributions, the posterior density is deﬁned only up to Lebesgue null sets which renders the reference criterion arbitrary. Second, the FBST statistical evidence seems to have no valid prior probability. It is shown that the former aspect can be circumvented by ﬁxing a version of the posterior density before using the FBST, and the latter aspect is based on its measure-theoretic premises. An illustrative example demonstrates the two aspects and their solution. Together, the results in this paper show that both of the two aspects which are sometimes observed as measure-theoretic inconsistencies of the FBST are not tenable. The FBST thus provides a measure-theoretically coherent Bayesian alternative for testing a precise hypothesis.


Introduction
Statistical hypothesis testing is an important method in a broad range of sciences [1]. However, the recent problems with the validity of research results have been termed a scientific replication crisis [2,3], at the core of which lie some fundamental flaws in the statistical analysis of data [4]. Various papers have discussed the reproducibility of research and often the inadequate use of null hypothesis significance tests (NHST) substantiates a major cause of the replication crisis [5]. This holds in particular in the biomedical and cognitive sciences [6,7], where the p-value is the gold standard for quantifying the evidence against a precise null hypothesis.
Bayesian hypothesis testing has become increasingly popular in the biomedical and cognitive sciences due to the above problems [8][9][10]. It is well known that Bayesian data analysis solves some of the problems of NHST by allowing researchers to make use of optional stopping [11,12] and by simplifying the interpretation of censored data [13]. Together, these aspects are consequence of Bayesian inference being consistent with the likelihood principle [13]. An appealing proposal for a Bayesian test of a precise hypothesis is the Full Bayesian Significance Test (FBST), which has been applied in a wide range of domains [8,[14][15][16][17][18]. The FBST advocates the e-value as a Bayesian replacement of the frequentist p-value for quantifying the statistical evidence against a precise hypothesis [19]. The FBST is a fully Bayesian procedure [19], accords with the likelihood principle [15], and enjoys attractive asymptotic properties [20] next to transformation invariance [16]. However, the FBST seems to suffer from two aspects which are studied in detail in this paper. First, the reference criterion in the FBST is only defined up to Lebesgue null sets, which seems to be make the evidential threshold arbitrary. Thus, it seems that the FBST statistical evidence, the e-value, lacks a calibration. Second, the statistical evidence in the FBST seems to have no prior probability, which contradicts common Bayesian reasoning. For other criticisms on the FBST see Ly & Wagenmakers [21] and for a more optimistic perspective Kelter [22]. In this paper it is shown that both aspects can be solved by fixing a version of the posterior distribution for statistical inference, and assigning one of two possible interpretations to the prior probability of the statistical evidence in the FBST. These aspects have not yet been discussed extensively in the literature and present a further justification of the FBST as an attractive replacement of frequentist p-values to remedy the ongoing problems with the replication of scientific results. The plan of the paper is as follows: The next section outlines the theory behind the FBST. After that, the two problematic aspects mentioned above are detailed and illustrated by an example from medical research. The following section elaborates on the problems and provides solutions to them. After that, a conclusion is provided.

The Full Bayesian Significance Test
This section outlines the theory behind the FBST. First, the required notation is introduced.

Notation
In contrast to the frequentist approach, in the Bayesian approach the parameter θ ∈ Θ is modelled as a random variable, and the data y ∈ Y are fixed. Denote by Θ the parameter space and G as the σ-algebra on Θ, and let P ϑ be the prior probability measure on G, leading to the triple (Θ, G, P ϑ ). The observed sample is modelled by the random variable Y : Ω → Y which takes values in the measurable space Y, where Y is endowed with a σ-algebra B. The uncertainty in the data generating mechanism producing a sample Y(ω) = y for ω ∈ Ω is modelled via the assumption of a statistical model P := {P θ : θ ∈ Θ} which is dominated by a σ-finite measure ν. In practice, ν often is the Lebesgue measure λ. The latter requirement guarantees the existence of Radon-Nikodým derivatives dP θ /dλ = f (y|θ). Let (Ω, A, P * ) be the product space defined as Ω := Θ × Y, A := G × B and P * the product measure induced by the selection of P ϑ and P, where P θ must be a measurable function on B for every y on Y. Thus, P ϑ is the marginal distribution of P * with respect to the parameter θ, and the marginal distribution with respect to Y is the prior predictive P ϑ (B) := Θ P θ (B)dP ϑ for any B ∈ B. The parameter, as noted above, is modelled mathematically as a random variable ϑ : Ω → Θ. The resulting operational models from a Bayesian point of view are thus given as 1.
the posterior model (Θ, G, {P ϑ|Y : Y ∈ Y }) The existence of the posterior distribution P ϑ|Y is guaranteed on Polish spaces [23] and inference about θ is conducted with respect to the posterior distribution P ϑ|Y with density p(θ|y) := dP ϑ|Y /dλ, which exists under the assumption that P ϑ << λ where << denotes absolute-continuity of P ϑ with respect to the measure λ.

Theory behind the Full Bayesian Significance Test (FBST)
The Full Bayesian Significance Test (FBST) was originally developed by Pereira and Stern [14] as an alternative to frequentist null hypothesis significance tests based on the p-value. It was created under the assumption that a significance test of a sharp hypothesis had to be conducted, where a sharp hypothesis refers to any submanifold of the parameter space of interest [20]. This includes, in particular, precise hypotheses like H 0 : θ = θ 0 for θ 0 ∈ Θ [15]. The FBST assumes a standard parametric statistical model, where θ ∈ Θ ⊆ R p is a (possibly vector-valued) parameter of interest, f (y|θ) is the density corresponding to the model distribution P Y|ϑ and p(θ) is the prior density corresponding to the prior distribution P ϑ , where we again assume a dominating measure ν to guarantee the existence of Radon-Nikodým densities. A hypothesis H makes the statement that the parameter θ lies in the corresponding null set Θ H , where for simple (or precise) hypotheses Θ H := {θ 0 }, where θ 0 is the value specified in H : θ = θ 0 . The Full Bayesian Significance Test (FBST) then defines two quantities: ev(H), which is the e-value supporting (or in favour of) the hypothesis H, and ev(H), the e-value against H, also called the Bayesian evidence value against H [14]. First, the posterior surprise function s(θ) and its maximum s * restricted to the null set Θ H are introduced: Definition 1 (Posterior surprise function). The posterior surprise function s(θ) for a reference function r : Θ → (T , C) from Θ to a measurable space (T , C) is defined as In the definition of the posterior surprise function s(θ), the denominator r(θ) serves as a reference density, and often the measurable space (T , C) is equal to (R d , B(R d )). When the improper flat reference function r(θ) = 1 is used, the surprise function becomes the posterior density p(θ|y). Otherwise, a weakly informative prior density can be used as a reference function, see Pereira and Stern [16]. Then, is defined as the supremum of the surprise function s(θ) over the null hypothesis support. For a precise null hypothesis, s * is simply s(θ 0 ). Next, the tangential set is introduced: Thus, T(ν) includes all parameter values θ ∈ Θ which attain a surprise function value s(θ) smaller or equal to the threshold ν. The tangential set T(ν) is then the set complement and includes all parameter values θ ∈ Θ which yield a surprise function value s(θ) larger than ν. Fixing ν = s * yields T(s * ), which is called the tangential set to the hypothesis H. This set T(s * ) contains the points θ of the parameter space Θ with higher surprise (or corroboration relative to the reference function r(θ)) than the point θ 0 in the null set Θ H . Then, the cumulative surprise function is introduced which is required to compute the e-value in the final step: is called the complementary cumulative surprise function, and is called the cumulative surprise function.
Thus, the complementary cumulative surprise function W(ν) is the integral of the posterior density p(θ|y) over the set T(ν), and the cumulative surprise function W(ν) is simply the integral of the posterior density over the tangential set T(ν). The final step towards the e-value is to integrate the posterior density p(θ|y) over this set: Definition 4 (e-value). The e-value against a sharp null hypothesis H 0 : θ = θ 0 is defined as and can be interpreted as the Bayesian evidence against H 0 .
Clearly, ev(H 0 ) := W(s * ) is the integral of the density p(θ|y) over the tangential set T(s * ), which can be interpreted as the integral of the posterior density p(θ|y) over all parameter values θ which fulfill the condition s(θ) ≥ s * . The e-value ev(H 0 ) supporting H is obtained as ev(H) := 1 − ev(H 0 ) under r(θ) := 1. Large values of ev(H 0 ) thus indicate that the hypothesis H traverses low-density regions (or equivalently, that the alternative hypothesis traverses high-density regions) so that the evidence against H 0 is large. For r(θ) = 1 the argument is identical as H 0 traverses low posterior-surprise regions then.
For theoretical properties of the FBST and the e-value see Pereira and Stern [16] and Kelter [18]. The FBST then uses ev(H) to reject H if ev(H) is sufficiently small (or when ev(H) is large) [14,15].

On Two Aspects of the FBST
Now, this section demonstrates the two aspects briefly mentioned in the introduction based on an illustrative example.

The Reference Criterion
To illustrate the first problem, data of Rosenman et al. [24] of the Western Collaborative Group Study about coronary heart disease is used.
Example 1 (Coronary heart disease data). The Western Collaborative Group Study began in 1960 with 3524 male volunteers who were 39 to 59 years old and free of heart disease as determined by electrocardiogram. After the initial screening, the study population dropped to 3154 because of various exclusions. Multiple endpoints were studied and average follow-up continued for 8.5 years with repeat examinations. As an illustrative example, suppose interest lies in testing for differences in systolic blood pressure between light smokers and heavy smokers. Thus, we test the hypothesis H 0 : δ = 0 against the alternative H 1 : δ = 0 where we classify participants with more than 5 cigarettes per day as heavy smokers. A Bayesian two-sample t-test using the model of Rouder et al. [25] is conducted, and the left plot in Figure 1 shows the results of the FBST using a flat reference function r(δ) := 1. The model is parameterized in the effect size δ of Cohen [26], and the e-value ev(H 0 ) is given as ev(H 0 ) = 0.4362, which equals the posterior probability mass visualized as the blue area in the left plot of Figure 1. Thus, 43.62% of the posterior probability indicate evidence against the null hypothesis, and the situation is inconclusive. The right plot in Figure 1 shows the result of the FBST when replacing the flat reference function r(δ) := 1 with a Cauchy C(0, √ 2) density (note the different scaling on the y-axis), which is also used as the prior on δ in the two-sample t-test. In this case, the e-value ev(H 0 ) = 0.4367 indicates a similarly inconclusive situation and changes the result barely. Now, the above example shows that calculation of the e-value is straightforward and universally applicable. However, the parameter space Θ is continuous in the example (the effect size δ ∈ R is a continuous quantity) and any usual prior distribution P ϑ assigned to θ is absolutely continuous with respect to the Lebesgue measure λ. It is well-known that the posterior distribution P ϑ|Y is absolutely continuous with respect to the prior distribution [27], and thus any P ϑ -null-set N ⊂ Θ with P ϑ (N) = 0 is also a P ϑ|Y -null-set with P ϑ|Y (N) = 0. Problematically, the set Θ 0 := {δ 0 } = {0} which is used in the precise null hypothesis H 0 : δ = 0 is a P ϑ -null-set under both the improper flat and Cauchy prior, as both of these are absolutely continuous with respect to the Lebesgue measure λ, and submanifolds are Lebesgue-null-sets [28]. Thus, λ({δ 0 } = λ({0}) = 0 implies P ϑ ({0}) = 0 due to P ϑ << λ, which implies in turn that the posterior probability P ϑ|Y ({0}) of the value δ 0 = 0 is a P ϑ|Y -null-set due to P ϑ|Y << P ϑ . As a consequence, the value of the posterior density p(0|y) = 9.4693 which is shown as the blue point in the left plot of Figure 1 could be chosen arbitrarily. Problematically, this value is used as the reference criterion in the calculation of the e-value ev(H 0 ) in the computation of the tangential set T(ν). Thus, one could assign p(0|y) an entirely different value, say, c ∈ R, and obtain a different e-value ev(H 0 ) than the one calculated from the value p(0|y) = 9.4693. This seems to render the calculation of the statistical evidence ev(H 0 ) in the FBST arbitrary, questioning the use of the procedure.  √ 2) Cauchy density as reference function (right) for testing the hypothesis of no difference H 0 : δ = 0 in terms of systolic blood pressure between smokers and non-smokers.

Prior Probability of the e-Value
The second issue with the FBST may be phrased as the e-value having no valid prior probability. In fact, the e-value in Equation (7) is based on the cumulative surprise function W(s * ), which itself depends on the tangential set T(s * ) and the posterior density p(θ|y). Before data y ∈ Y are observed, the posterior P ϑ|Y has not been realized as P ϑ|Y=y and thus there exists no prior probability P ϑ which is associated with the e-value. Even the tangential set T(s * ) := {θ ∈ Θ|s(θ) > s * } which is a subset of Θ seems to have no prior probability, because it depends on the surprise function s(θ) which itself depends on the posterior density p(θ|y), compare Equation (1). Thus, the statistical evidence in the FBST seems to escape the natural Bayesian transition from prior to posterior probability.

The Reference Criterion
If the above criticism that the reference criterion in the FBST is arbitrary would hold, the procedure would be of little use in practice. However, the solution to the problem is given by fixing a specific version of the posterior distribution and performing all calculations conditional on fixing such a version. It is well known that probability distributions (which are probability measures corresponding to a random variable) are defined up to Lebesgue-null-sets (when they are dominated by the Lebesgue measure). The values on null-sets do not influence these probability measures and therefore they are identified with each other whenever they only differ on Lebesgue-null-sets [28]. Technically, this corresponds to the shift from the vector space L p L p (Ω, A, µ) := f : Ω → K f is measurable, on a probability space (Ω, A, µ), K ∈ {R, C} for 0 < p < ∞ to the quotient space L p , see Bauer [28]. The latter space is defined as L p := L p /N , where N := f ∈ L p f = 0 µ-almost-everywhere (9) and the elements in L p are equivalence classes. Thus, two elements Bayesian Significance Test, and L p and L p the corresponding vector spaces on (Θ, G, P ϑ|Y ) with quotient space L p /N for N := { f ∈ L p | f = 0 µ-almost-everywhere}. Whenever P ϑ|Y is a known probability distributionP ϑ|Y with Lebesgue-densityp(ϑ|Y), defining p(θ|y) :=p(θ|y) pointwise for all θ ∈ Θ renders the e-value ev(H 0 ) against H 0 : θ = θ 0 for θ 0 ∈ Θ well-defined and unique for the choice of p(θ|y).

Proof. See Appendix A.
Note that when using numerical methods such as MCMC, ergodic theory ensures that P MCMC ϑ|Y → P ϑ|Y in distribution and p MCMC ϑ|Y → p ϑ|Y , that is, the MCMC posterior density approximates the posterior Lebesgue-density pointwise with increasing precision for increasing number of MCMC samples [29]. Thus, fixing a version of the posterior, Theorem 1 extends also to situations where numerical techniques such as MCMC are required.

Prior Probability of the e-Value
The solution to the second problem is more involved and less technical. Conceptually, from the above line of thought it is immediate that under absolutely continuous priors P ϑ with respect to the Lebesgue measure λ, the prior probability P ϑ (Θ 0 ) will be zero for any precise null hypothesis H 0 := Θ 0 with Θ 0 := {θ 0 } for θ 0 ∈ Θ. The posterior P ϑ|Y is absolutely continuous with respect to the prior P ϑ , so P ϑ|Y (Θ 0 ) = 0. Thus, it is simply not possible to use a natural Bayesian workflow which assigns positive probability mass to a Lebesgue-null-set Θ 0 whenever the statistician uses an absolutely continuous prior distribution P ϑ with respect to λ. Traditional Bayesian hypothesis testing and model selection bypasses this inconvenience by introducing an arbitrary mixture prior structure P ϑ := 1 Θ 0 + (1 − )P ϑ which assigns positive probability mass > 0 to the null set Θ 0 , and distributes the rest of the probability mass (1 − ) ∈ [0, 1] by means of a probability distributionP ϑ on the alternative hypothesis space Θ 1 = Θ \ Θ 0 . Early proposals of such a mixture prior structure include Jeffreys [30] and Haldane [31], see also Robert [29] and Kleijn [23]. Such a prior allows computation of a Bayes factor, and furthermore, the Bayes factor itself also has no prior probability which is naturally associated with it. Importantly, this mixture prior structure imposes a dichotomy between hypothesis testing and parameter estimation, because such a mixture prior structure is reasonable only from a hypothesis testing perspective. Whenever parameter estimation is the goal, the assignment of probability mass > 0 to a specific value is highly questionable and often contradicts reasonable a priori beliefs. In these cases, prior beliefs are expressed better through a prior which is absolutely continuous with respect to the Lebesgue measure λ.
The FBST avoids the introduction of such a mixture structure and thus allows for a unified prior elicitation which is coherent both from a Bayesian hypothesis testing and Bayesian parameter estimation stance. Importantly, the e-value is intended to be a Bayesian replacement of the frequentist p-value which measures the statistical discrepancy between the observed data to an assumed precise hypothesis. Thus, the e-value provides the Bayesian evidence against such a precise hypothesis. From a measure-theoretic point of view, every precise null hypothesis is assumed to be false and the FBST thus aligns with the empirical rationalism of Popper [32]. For the use of testing a precise hypothesis as an approximation of a small interval hypothesis see Berger [33], Rousseau [34], Rao & Lovric [35] as well as Kelter [36]: Often, the approximation of a small interval hypothesis via a precise point null hypothesis will be bad, and thus the e-value does not assign positive probability mass to such a precise null hypothesis. Instead, the FBST quantifies the discrepancy between the observed data and the hypothetical precise null value, while simultaneously implementing I.J. Good's principle of least surprise [37][38][39]. Note further that the mathematical introduction of positive prior probability > 0 to a precise value θ 0 ∈ Θ when using a mixture prior does not render such a precise hypothesis H 0 : θ = θ 0 more realistic in practice.
Furthermore, next to its measure-theoretic premises, there exists another argument which weakens the criticism that there is no prior probability of the e-value: When a prior distribution P ϑ is selected and no data y ∈ Y has been observed, the posterior distribution can be identified conceptually as the prior distribution. Thus, replacing the posterior density p(θ|y) with the λ-density p(θ) of the prior P ϑ yields s(θ) := p(θ) r(θ) , which implies that the tangential set T(ν) := Θ \ T(ν) for T(ν) := {θ ∈ Θ|s(θ) ≤ ν} includes those parameter values θ ∈ Θ for which p(θ)/r(θ) > ν. Using the fact that s * = p(θ 0 )/r(θ 0 ) for a precise hypothesis H 0 : θ = θ 0 then, yields T(ν) = {θ ∈ Θ|p(θ)/r(θ) > p(θ 0 )/r(θ 0 )}. Plugging this tangential set into Equation (6)  which is the integral of the prior density p(θ) over all values which attain higher prior density values than the null value θ 0 in H 0 : θ = θ 0 . Thus, the e-value in such a case quantifies the discrepancy of the precise hypothesis H 0 : θ = θ 0 with the prior beliefs P ϑ . The above line of thought provide the following result: Theorem 2. Let r(θ) := 1. In case no data y ∈ Y has been observed, the e-value quantifies the discrepancy between the precise hypothesis H 0 := Θ 0 for Θ 0 := {θ 0 } and θ 0 ∈ Θ and the prior distribution P ϑ , that is, Proof. See Appendix A.
Whenever r(θ) = 1, the interpretation is more complicated because such a reference function incorporates a surprise element into the tangential set, but the conclusions remain the same. The e-value then quantifies the discrepancy between the precise hypothesis and the prior surprise.

Discussion
The Full Bayesian Significance Test (FBST) has been proposed as a convenient method to replace frequentist p-values for testing a precise hypothesis [14][15][16]. Although the FBST enjoys various appealing properties [8,19,20,40], two aspects of the FBST are sometimes observed as measure-theoretic inconsistencies of the procedure and have not been discussed rigorously in the literature. First, the FBST uses the posterior density as a reference for judging the Bayesian statistical evidence against a precise hypothesis. However, under absolutely continuous prior distributions, the posterior density is defined only up to Lebesgue null sets which renders the reference criterion arbitrary. Second, the FBST statistical evidence seems to have no valid prior probability. In this paper, it was shown that the former problem can be circumvented by fixing a version of the posterior density before using the FBST. Theorem 1 demonstrated that then, the e-value is well-defined and unique after observing the data y ∈ Y.
The latter aspect is based on the measure-theoretic premises of the FBST. As shown in this paper, the FBST avoids the use of a mixture prior structure which imposes a dichotomy between Bayesian hypothesis testing and parameter estimation. Thus, the FBST is compatible with absolutely continuous priors with respect to the Lebesgue measure λ (the Bayes factor, for example, is not). As a consequence, there exists no prior probability of the e-value and a precise hypothesis H 0 : θ = θ 0 under an absolutely continuous prior P ϑ . Theorem 2 showed that even then, the e-value has a proper interpretation from a prior perspective: It quantifies the a priori discrepancy of the hypothesis H 0 with the prior beliefs which are expressed by P ϑ whenever the reference function r(θ) is flat. When r(θ) = 1, the interpretation is more difficult but the conclusion remains the same.
Together, the results in this paper show that both of the two aspects which are sometimes observed as measure-theoretic inconsistencies of the FBST are not tenable. The FBST thus provides a measure-theoretically coherent Bayesian alternative for testing a precise hypothesis. Data Availability Statement: The R code to recreate all analyses and plots can be found at the Open Science Foundation at https://osf.io/25vsw/?view_only=e1e243c1e2a44646969fb75cc4c34d57.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

FBST
Full Bayesian Significance Test NHST Null Hypothesis Significance Testing