From p-Values to Posterior Probabilities of Null Hypotheses

Minimum Bayes factors are commonly used to transform two-sided p-values to lower bounds on the posterior probability of the null hypothesis, in particular the bound −e·p·log(p). This bound is easy to compute and explain; however, it does not behave as a Bayes factor. For example, it does not change with the sample size. This is a very serious defect, particularly for moderate to large sample sizes, which is precisely the situation in which p-values are the most problematic. In this article, we propose adjusting this minimum Bayes factor with the information to approximate an exact Bayes factor, not only when p is a p-value but also when p is a pseudo-p-value. Additionally, we develop a version of the adjustment for linear models using the recent refinement of the Prior-Based BIC.


Introduction
By now, it is well known by practitioners that p-values are not posterior probabilities of a null hypothesis, which is what science would need to declare a scientific finding. So p-values, and particularly the threshold of 0.05, need to be recalibrated. Two widespread practical attempts are (i) the so-called Robust Lower Bound on Bayes factors BF ≥ −e · p · log(p) [1] and (ii) the replacement of the ubiquitous α = 0.05 by α * = 0.005 [2]. These suggestions, which are an improvement of usual practice, fall short of being a real solution, mainly because the dependence of the evidence on the sample size is not considered. Still, the Robust Lower Bound is useful since it is valid from small sample sizes and onward and only depends on the p-value. It is known that the evidence of a p-value against a point null hypothesis depends on the sample size. In [3], they consider p-values in linear models and propose new monotonic minimum Bayes factors that depend on the sample size and converge to −e · p · log(p) as the sample size approaches infinity, which implies it is not consistent, as Bayes factors are. It turns out that the maximum evidence for an exact two-tailed p-value increases with decreasing sample size. There are several proposals in the literature, and most do not depend on the sample size, while those that do continue to be Robust Lower Bounds; however, neither behaves like a real Bayes factor. In this article, we propose to adjust the Robust Lower Bound −e · p · log(p) so that it behaves in a similar or approximate way to actual Bayes factors for any sample size. A further complication arises, however, when the null hypotheses are not simple, that is, when they depend on unknown nuisance parameters. In this situation, what is usually called p-values are only pseudo-p-values [4] (p. 397). So, we first need to extend the validity of the Robust Lower Bound to pseudo-p-values. The effect of adjusting this minimum Bayes factor with the sample size is shown in a simulation in Section 5.1.
The outline of the article is as follows: In Section 2 we define pseudo-p-values using the p-value definition of [4] (p. 397) and extend for them the validity of the Robust Lower Bound. In Section 3, we present the adaptive significance levels that will be used for incorporating the sample size in the lower bound: the general adaptive significance level presented in [5] and the refined version for linear models developed in [6]; in both cases, we use versions calibrated using the Prior-Based BIC (PBIC) [7]. In Section 4, we derive adaptive approximate Bayes factors and apply them to pseudo-p-values in Section 5. We close in Section 6 with some final comments.

Valid p-Values and Robust Lower Bound
Under the null hypotheses, p-values are well known to have Uniform(0, 1); in [4] (p. 397), a more general definition is given.

Definition 1.
A p-value p(X) is a statistic satisfying 0 ≤ p(x) ≤ 1 for every sample point x. Small values of p(X) give evidence that H 1 : θ ∈ Θ c 0 is true, where Θ 0 is some subset of the parameter space and Θ c 0 is its complement. A p-value is valid if, for every θ ∈ Θ 0 and every 0 ≤ α ≤ 1, Based on this definition, we can say that there are valid p-values that are Uniformly Distributed in (0, 1), that is, P θ (p(X) ≤ α) = α for every θ ∈ Θ 0 and every 0 ≤ α ≤ 1, (1) and others that are not, that is, when there is at least one α, such that Remark 1. We consider any valid p-value complying with (2) a pseudo-p-value.
The "Robust Lower Bound" (RLB), as we call it here and proposed by [1], is The authors consider that under the null hypothesis, the distribution of the p-value, p(X), is Uniform(0, 1). Alternatives are typically developed by considering alternative models for X, but the results then end up being quite problem-specific. An attractive approach is instead to directly consider alternative distributions for p itself. In effect, they consider that, under H 1 , the density of p is f (p|ξ), where ξ is an unknown parameter. So, consider testing If the test statistic (T) has been appropriately chosen so that large values of T(X) would be evidence in favor of H 1 , then the density of p under H 1 should be decreasing in p. A class of decreasing densities for p that is very easy to work with is the class of Beta(ξ, 1) densities, for 0 < ξ ≤ 1, given by f (p|ξ) = ξ p ξ−1 . The uniform distribution (i.e., H 0 ) arises from the choice ξ = 1 [1]. The expression B L (p) = inf all π B π (p), where B π (p) is the Bayes factor of H 0 to H 1 for a given prior density π(ξ) on this alternative.
Theorem 1. The RLB ξ is a valid p-value for ξ ≥ 1, that is, Proof. Appendix A.

Adaptive α with PBIC Strategy
The Bayesian literature has been criticizing for several decades the implementation of hypothesis testing with fixed significance levels and, in particular, the use of the scale p-value < 0.05. An adaptive α allows us to adjust the statistical significance with the amount of information; see [5,11,12]. The adaptive values we work with in this section were calculated so that they allow to arrive to results equivalent to those obtained with a Bayes factor. In [5], the authors present an adaptive α based on BIC as where C α is a calibration constant, and strategies for calculating it are presented in [5]. It yields a consistent procedure; it alleviates the problem of the divergence between prac-tical and statistical significance; and it makes it possible to perform Bayesian testing by computing intervals with the calibrated α-levels.
An adaptive α is also presented in [6], but this time it is a version refined to nested linear models with calibration based on the Bayesian information criterion based on Prior PBIC [7], Here, b = |X t j X j | |X t i X i | and X i , X j are design matrices and [d m l (1+n e m l )] with l = i, j corresponding to each model. Here, n e m l , with l = i, j, refers to The Effective Sample Size (called TESS) corresponding to that parameter; see [7].
The adaptive α in (5) can also be presented using the PBIC strategy (this strategy was not considered in [5]), and the following expression is obtained Note that this adaptive α is still of BIC structure, since the expression χ 2 α (q) + q log(n) remains.

Example: Binomial Models
Consider comparing two binomial models S 1 ∼ binomial(n 1 , p 1 ) and S 2 ∼ binomial(n 2 , p 2 ) via the test Defining n = n 1 + n 2 andp, the MLE from p 1 − p 2 , then (7) gives here,  Table 1 shows the behavior of this adaptive α n for α = 0.05 and different values of n 1 and n 2 .

Adjusting RLB ξ Using Adaptive α
In this section, we combine (3) with the formulas for adaptive α in (6) and (7) for adjusting RLB ξ and obtaining an approximation to an objective Bayes factor. Indeed, we adjust the RLB ξ through the expression B(α) = B L (α, ξ 0 ) · g(·), where g is determined in such a way that when B(α) is evaluated in (6) or (7), it converges to a constant (this allows us to obtain equivalent results from the Frequentist and Bayesian point of view, that is, the decision does not change).
Substituting p in (3) by the adaptive α value in (7) results in the following expression.
For a Uniform(0, 1) p-value with ξ 0 = 1, this expression simplifies to The refined version of this calibration for linear models is obtained when (3) is evaluated in (6) (11) in this case, we only consider ξ 0 = 1.

Balanced One-Way Anova
Suppose we have k groups with r observations each, for a total sample size of kr, and let H 0 : µ 1 = · · · = µ k = µ vs. H 1 : At least oneµ i different. Then, the design matrices for both models are and the adaptive α for the linear model in accordance with what was presented in [6] is Here, the number of replicas r is The Effective Sample Size (TESS). Therefore, the approximate Bayes factor for this test calculated with (8) is A very important case arises when k = 2. For this situation, the last formula simplifies to

Obtaining Bounds for P(H 0 |Data)
In this section, we use (9) and (11) to produce bounds for the posterior probability of the null hypothesis H 0 .
Since for any Bayes factor B 01 a lower bound for the posterior probability of the null hypothesis can be obtained as (13) Figure 2 shows these posterior probabilities (called P RLB ξ 0 ) for different values of ξ 0 . To simplify the use of these Bayes factors, we call BFG ξ 0 the Bayes factor of Equation (9), BFG the Bayes factor of Equation (10), and BFL the Bayes factor of Equation (11).  (13)) for

Testing Equality of Two Means
Consider comparing two normal means via the test where the associated known variances, σ 2 1 and σ 2 2 , are not equal.
On the other hand, considering Assuming priors: π(σ 2 ) ∝ 1/σ 2 for both H 0 and H 1 . The Bayes factor is where t = |Ȳ| s/ √ n a t-statistic with degrees of freedom l = n − 1 and n = n 1 + n 2 ; see [13]. Figure 3 shows the posterior probability for the null hypothesis H 0 when n = 50 and n = 100 for the Robust Lower Bound with ξ 0 = 1 (called P RLB ), the Bayes factor BFL (called P BFL ), the Bayes factor BFG (called P BFG ), and the Bayes factor BF 01 (called P BF 01 ). Note that the posterior probability with BF 01 when τ 0 = 6 looks very similar to the result obtained using the Bayes factors BFL and BFG.
We now present a simulation that shows how our adjustment, or calibration, to RLB ξ works quite similarly to an exact Bayes factor. We perform the following experiment: We simulate r data points from each of the two normal distributions, N(µ 1 , σ) and N(µ 2 , σ). We reproduce this K times. For all K simulations, µ 1 − µ 2 = 0. For all K replicates, we test the hypotheses H 0 : µ 1 = µ 2 vs. H 1 : µ 1 = µ 2 , and then we count how many of the p-values lie between 0.05 − ε and 0.05. Note that all of these p-values would be considered sufficient to reject H 0 if α = 0.05 is selected. Finally, we determine the proportion of these "significant" p-values obtained from samples where H 0 is true. Posterior probability for the null hypothesis H 0 for n = 50 and n = 100 using the Bayes factor RLB ξ 0 with ξ 0 = 1, the Bayes factor BF 01 , and the Bayes factor BFL and BFG. Table 2 presents the mean percentage of these significant p-values coming from samples, where H 0 is true for 100 iterations of the simulation scheme with K = 8000, σ = 1, and ε = 0.05 for r = 10, 50, 100, 500, and 1000. As expected, the distribution of the p-values behaved Uniform(0, 1) under H 0 , since H 0 was assumed true in the K replicates. Table 2 also presents the proportion of posterior probability of H 0 greater than or equal to 0.5 (50%) when using the RLB ξ , when corrected according to the method suggested in this document (Equations (10) and (11)), and when an exact Bayes factor (Equation (14)) is used. It is clear that the method suggested here behaves very similarly to an exact Bayes factor.

Fisher's Exact Test
This is an example where the p-value is a pseudo-p-value (see the example 8.3.30 in [4]). Let S 1 and S 2 be independent observations with S 1 ∼ binomial(n 1 , p 1 ) and S 2 ∼ binomial(n 2 , p 2 ). Consider testing H 0 : Under H 0 , if we let p be the common value of p 1 = p 2 , the joint pmf of (S 1 , and the conditional pseudo-p-value is the sum of hypergeometric probabilities, with s = s 1 + s 2 .

Remark 2.
It does not seem to be simple to estimate the appropriate ξ 0 that best fits the pseudo-pvalue in (15), in Figure 4 some arbitrary possibilities are given.
It is important to note that in Bayesian tests with a point null hypothesis, it is not possible to use continuous prior densities, because these distributions (as well as posterior distributions) will grant zero probability to p = (p 1 = p 2 ). A reasonable approximation will be to give p = (p 1 = p 2 ), a positive probability π 0 , and to p = (p 1 = p 2 ) the prior distribution π 1 g 1 (p), where π 1 = 1 − π 0 and g 1 proper. One can think of π 0 as the mass that would be assigned to the real null hypothesis, H 0 : if it had not been preferred to approximate by the null point hypothesis. Therefore, if and the Bayes factor is Now, if we take g 1 (p) = Beta(a, b) such that E(p) = a a + b = (p 1 = p 2 ), then BF Test = B(a, b) B(s + a, n 1 + n 2 − s + b) p s (1 − p) n 1 +n 2 −s . Figure 4 shows the posterior probability for the null hypothesis H 0 when n = n 1 + n 2 = 50 and 100, for the Robust Lower Bound, the Bayes factor BFG ξ 0 (called P BFG ξ 0 ), the Bayes factor BFG (called P BFG ), and the Bayes factor BF Test (called P BF Test ). We can note that all the P BFG ξ 0 are comparable, even though in the case ξ 0 = 1 (P BFG ) it is a p-value and not a pseudo-p-value. Posterior probability for the null hypothesis H 0 for n = 50 and n = 100 using the Bayes factor RLB ξ 0 with ξ 0 = 1, the Bayes factor BF Test , the Bayes factor BFG ξ 0 , and the Bayes factor BFG.

Linear Regression Models
Consider comparing two nested linear models M 3 : y l = λ 1 + λ 2 x l2 + λ 3 x l3 + l with M 2 : y l = λ 1 + λ 2 x l2 + l via the test H 0 : M 2 versus H 1 : M 3 , with 1 ≤ l ≤ n, and the errors l are assumed to be independent and normally distributed with unknown residual variance σ 2 . According to the Equation (3) in [6,7] where s 2 3 is the variance x v3 , ρ 23 is the correlation between x v2 and x v3 , and x l3 and X * = (1 n |x l2 ).
As an example, we analyze a data set taken from [14], which can be accessed at http://academic.uprm.edu/eacuna/datos.html (accessed on 13 January 2022). We want to predict the average mileage per gallon (denoted by mpg) of a set of n = 82 vehicles using four possible predictor variables: cabin capacity in cubic feet (vol), engine power (hp), maximum speed in miles per hour (sp), and vehicle weight in hundreds of pounds (wt).
Through the Bayes factors BFG and BFL, we want to choose the best model to predict the average mileage per gallon by calculating the posterior probability of the null hypothesis of the following test H 0 : M 2 : mpg = λ 1 + λ 2 wt l + l vs. H 1 : M 3 : mpg = λ 1 + λ 2 wt l + λ 3 sp l + l with α = 0.05, q = 1, j = 3, the posterior probabilities for the null hypothesis H 0 are P BFL = 0.9253192, P BFG = 0.7209449.
The use of this posterior probability in both cases will change the inference, since the p-value of the F test is p = 0.0325, which is smaller than 0.05.

Findley's Counterexample
Consider the following simple linear model [15] 2, 3, . . . , n and we are comparing the models H 0 : θ = 0 and H 1 : θ = 0. This is a classical and challenging counterexample against BIC and the Principle of Parsimony. In [7], the inconsistency of BIC is shown, but the consistency of PBIC is shown in this problem.
Here, we show through the posterior probabilities of the null hypothesis that the Bayes factor BFG ( based on BIC) is inconsistent, while the Bayes factor BFL ( based on PBIC) is consistent if it is. We perform the analysis in two contexts: First, when n grows and α = 0.05 or α = 0.01 are fixed. Second, when n is fixed and 0 < α < 0.05. For calculations i . Figures 5 and 6 show, through the posterior probability of the null hypothesis H 0 , the consistency of the Bayes factor based in PBIC (P BFL ), as well as the inconsistency of the Bayes factor based in BIC (P BFG ).

Lower bounds have been an important development to give practitioners alternatives
to classical testing with fixed α levels. A deep-seated problem with the useful bound −e · p · log(p) is that it depends on the p-value, which it should, but it is static, not a function of the sample size n. This limitation makes the bound of little use for moderate to large sample sizes, where it is arguably the correction to p-values more needed. 2. The approximation develops here as a function of p-values, and sample size has a distinct advantage over other approximations, such as BIC, in that it is a valid approximation for any sample size. 3. The (approximate) Bayes factors (9) and (11) are simple to use and provide results equivalent to the sensitive p-value Bayes factors of hypothesis tests. In this article, we extended the validity of the approximation for "pseudo-p-values," which are ubiquitous in statistical practice. We hope that this development will give tools to the practice of statistics to make the posterior probability of hypotheses closer to everyday statistical practice, on which p-values (or pseudo-p-values) are calculated routinely. This allows an immediate and useful comparison between raw-p-values and (approximate) posterior odds.