Adaptive Significance Levels in Tests for Linear Regression Models: The e-Value and P-Value Cases

The full Bayesian significance test (FBST) for precise hypotheses is a Bayesian alternative to the traditional significance tests based on p-values. The FBST is characterized by the e-value as an evidence index in favor of the null hypothesis (H). An important practical issue for the implementation of the FBST is to establish how small the evidence against H must be in order to decide for its rejection. In this work, we present a method to find a cutoff value for the e-value in the FBST by minimizing the linear combination of the averaged type-I and type-II error probabilities for a given sample size and also for a given dimensionality of the parameter space. Furthermore, we compare our methodology with the results obtained from the test with adaptive significance level, which presents the capital-P P-value as a decision-making evidence measure. For this purpose, the scenario of linear regression models with unknown variance under the Bayesian approach is considered.


Introduction
The full Bayesian significance test (FBST) for precise hypotheses is presented in [1] as a Bayesian alternative to the traditional significance tests based on p-values. With the FBST, the authors introduce the e-value as an evidence index in favor of the null hypothesis (H). An important practical issue for the implementation of the FBST is to establish how small the evidence must be to decide to reject H ( [2,3]). In that sense, the authors of [4] present loss functions such that the minimization of their posterior expected values characterizes the FBST as a Bayes test under a decision-theoretic approach. This procedure provides a cutoff point for the evidence that depends on the severity of the error for deciding whether to reject or accept H.
In the frequentist significance-test context, it is known that under certain conditions the p-value decreases as the sample size increases, in such a way that by setting a single significance level, the comparison of the p-value with the fixed significance level usually leads to rejection of the null hypothesis ( [5][6][7][8][9]). In the FBST procedure, the e-value exhibits similar behavior to the p-value when the sample size increases, which suggests that the cutoff point to define the rejection of H should depend on the sample size and (possibly) on other characteristics of the statistical model under consideration. However, in the proposal of [4], a loss function that explicitly takes into account the sample size is not studied.
In order to solve the problem of testing hypotheses in the usual way, in which changing the sample size influences the probability of rejecting or accepting the null hypothesis, the authors of [10], motivated by [11], suggest that the level of significance in hypothesis testing should be a function of sample size. Instead of setting a single level of significance, the authors of [10] propose fixing the ratio of severity between type-I and type-II error probabilities based on the incurred losses in each case, and thus, given a sample size, defining the level of significance that minimizes the linear combination of the decision error probabilities. The authors of [10] show that proceeding this way, by increasing the sample size, the probabilities of both kind of errors and their linear combination decrease, while in most cases, setting a single level of significance independent of sample size, only type-II error probability decreases. The tests proposed in [10] take the same conceptual grounds of the usual tests for simple hypotheses based on the minimization of a linear combination of probabilities of error of decisions as presented in [12]. Then, the authors of [10] extend, in a sense, the idea in [12] to composite and sharp hypotheses, according to the initial work in [11].
Following the same line of work, the authors of [13,14] present a new hypothesistesting procedure formulated from the ideas developed in previous works ( [11,[15][16][17]) and using a mixture of frequentist and Bayesian tools. This procedure introduces the capital-P P-value as a decision-making evidence measure and also includes an adaptive significance level, i.e., a significance level that is a function of sample size. Such an adaptive significance level is obtained from the minimization of the linear combination of generalized type-I and type-II error probabilities. According to the authors of [14], the resulting hypothesis tests do not violate the likelihood principle and do not require any constraints on the dimensionalities of the sample space and parameter space. It should be noticed that the new test procedure is precisely the optimal decision rule for the problem of testing the simple hypotheses f H against f A . For this reason, such a procedure overcomes the drawback of increasing the sample size resulting in the rejection of a null precise hypothesis ( [12]). Another important way of successfully dealing with this question is to take into account meaningful deviations from the parameter value that specifies the null precise hypothesis in the formulation of the hypothesis testing problem ( [18,19]).
On the other hand, linear models are probably the most used statistical models to establish the influence of a set of covariates on a response variable. In that sense, the proper identification of the relevant variables in the model is an important issue in any scientific investigation and is a more challenging task in the context of Big-Data problems. In addition to high dimensionality, in recent statistical learning problems, it is common to find large datasets with thousands of observations. This fact may cause the hypothesis of nullity of the regression coefficients to be rejected most of the time, due to the large sample size when the significance level is fixed.
The main goal of our work is to determine, in the setting of linear regression models, how small the Bayesian evidence in the FBST should be in order to reject the null hypothesis and prevent a decision-maker from the abovementioned drawbacks. Therefore, taking into account the concepts in [11,12] associated with optimal hypothesis tests, as well as the conclusions in [10] about the relationship between the significance levels and the sample size, and finally, considering the ideas developed recently by the authors of [13,14] related to adaptive significance levels, we present a method to find a cutoff point for the e-value by minimizing a linear combination of the averaged type-I and type-II error probabilities for a given sample size and also for a given dimensionality of the parameter space. For that purpose, the scenario of linear regression models with unknown variance under the Bayesian approach is considered. So, by providing an adaptive level for decision making and controlling the probabilities of both kinds of errors, we intend to avoid the problems associated with the rejection of the hypotheses on the regression coefficients when the sample size is very large. In addition to the e-value, we calculate the P-value as well as its corresponding adaptive significance levels in order to compare the decisions that can be made by performing the tests with each of these measures.

The Linear Regression Model with Unknown Variance
The identification of the relevant variables in linear models can be done through hypothesis-testing procedures involving the respective regression coefficients. In the conjugate Bayesian analysis of the normal linear regression model with unknown variance, it is possible to obtain expressions for the posterior distributions of the parameters and their respective marginals. Therefore, in this setting, the FBST can be used for testing if one or more of the regression coefficients is null, which is the basis of one possible model-selection procedure. We first review the normal linear regression model where y = (y 1 , . . . , y n ) is an n × 1 vector of y i observations, X = (x 1 , . . . , x n ) is an n × p matrix of covariates, also called the design matrix, with x i = (1, x i1 , . . . , x ip−1 ) , θ = (θ 1 , . . . , θ p ) is a p × 1 vector of parameters (regression coefficients), and ε = (ε 1 , . . . , ε n ) an n × 1 vector of random errors. The model shows simply that the conditional distribution of y given parameters (θ, σ 2 ) is the multivariate normal distribution N n (Xθ, σ 2 I n ). Therefore, the likelihood becomes The natural conjugate prior distribution of (θ, σ 2 ) is a p-variate normal-inverse gamma distribution with hyperparameters m 0 , V 0 , a 0 , and b 0 , denoted by (θ, σ 2 ) ∼ N p IG(m 0 , V 0 , a 0 , b 0 ). Combining it with the likelihood (2) gives the posterior distribution ( [20][21][22]): where If X X is non-singular, we can write m * = V * V 0 −1 m 0 + X Xθ , whereθ = (X X) −1 X y is the classical maximum likelihood or least squares estimator of θ. Therefore, the posterior distribution of (θ, σ 2 ) is (θ, σ 2 )|y ∼ N p IG(m * , V * , a 1 , b 1 ).
See Appendix A for further explanation of the priors, posteriors, and conditional distributions for the linear regression models with unknown variance.

Adaptive Significance Levels in Linear Regression Coefficient Hypothesis Testing
In this section, we present the methodology to find a cutoff value for the evidence in the FBST as an adaptive significance level and we also develop the procedure to calculate the P-value with its corresponding adaptive significance level, all this in the context of linear regression coefficient hypothesis testing in models with unknown variance under the Bayesian point of view. For that purpose, first of all, it is necessary to show how the Bayesian prior predictive densities under the null and alternative hypotheses are defined.

Prior Predictive Densities in Regression-Coefficient Hypothesis Testing
Let θ = (θ 1 θ 2 ) , with θ 1 = (θ 1 , . . . , θ s ) and θ 2 = (θ s+1 , . . . , θ p ) , having θ 1 s elements and θ 2 r elements. Let ξ = (θ , σ 2 ) = (θ 1 , θ 2 , σ 2 ) , then, Y|ξ ∼ N n (Xθ, σ 2 I n ) where ξ ∈ Ξ. We are interested in testing the hypotheses Let Ξ H and Ξ A be the partition of the parameter space defined by the competing hypotheses H and A. Consider the prior density g(ξ) defined over the entire parameter space Ξ and let f H and f A be the Bayesian prior predictive densities under the respective hypotheses. Both are probability density functions over the sample space Ω, as follows: where C (s+r)×s = [I s , 0 s×r ] . Additionally, where P H and P A are the prior probability measure of ξ restricted to the sets H and A respectively (more details can be seen in Appendix B).

Evidence Index: e-Value
The full Bayesian significance test (FBST) was proposed in [1] for precise or "sharp" hypotheses (subsets of the parameter space with smaller dimension than the dimension of the whole parameter space, and, therefore, with null Lebesgue measure) based on the evidence in favor of the null hypothesis, calculated as the posterior probability of the complement of the highest posterior density (HPD) region (here we consider the usual HPD region with respect to the Lebesgue measure, even though it could be built by choosing any other dominating measure instead) tangent to the set that defines the null hypothesis. Considering the concepts in [10,11], and the recent works [13,14] related to adaptive significance levels, we propose to establish a cutoff value k * for the e-value (ev(H; y 0 )) in the FBST as a function of the sample size n and the dimensionality of the parameter space d, i.e., k * = k * (n, d) with k * ∈ [0, 1], such that k * minimizes the linear combination of the averaged type-I and type-II error probabilities, aα + bβ. To describe the procedure in the context of the coefficient hypothesis testing of the linear regression model we are addressing, consider the tangential set to the null hypothesis which is defined as This is the posterior distribution of (θ 1 , σ 2 ) given θ 2 a s-variate normal-inverse gamma, that is where the point under H for which the posterior attains its maximum value can be calculated as follows arg sup Thus, we get the tangential set The evidence in favor H is calculated as the posterior probability of the complement of T y 0 . That is, The evidence index, e-value, in favor of a precise hypothesis, considers all points of the parameter space which are less "probable" than some point in Ξ H . A large value of ev(H; y 0 ) means that the subset Ξ H lies in a high-probability region of Ξ, and, therefore, the data support the null hypothesis; on the other hand, a small value of ev(H; y 0 ) means that Ξ H is in a low-probability region of Ξ and the data would make us discredit the null hypothesis ( [23]).
The evidence in (8) can be approximately determined via Monte Carlo simulation. Then, generating M samples from the posterior distribution of ξ, such that ξ|y ∼ N p IG(m * , V * , a 1 , b 1 ), we estimate the evidence by Monte Carlo simulation through the expression The averaged error probabilities, expressed in terms of the predictive densities, can be estimated by Monte Carlo simulation through the expressions where Ψ e is the set Ψ e = {y ∈ Ω : ev(H; y) ≤ k}.
So, the adaptive cutoff value k * for ev(H; y) will be the k that minimizes aα ϕ e + bβ ϕ e . The a and b values represent the relative seriousness of errors of the two types or, equivalently, relative prior preferences for the competing hypotheses. For example, if b/a = 1, it is said that β ϕ e and α ϕ e are equally severe, whereas if b/a < 1, then α ϕ e undergoes a more intense minimization than β ϕ e , which means that type-I error is considered more serious than type-II error and also indicates a prior preference for H.

Significance Index: P-Value
The authors of [13,14] present a new hypothesis-testing procedure using a mixture of frequentist and Bayesian tools. On the one hand, the procedure resembles a frequentist test as it is based on the comparison of the P-value as a decision-making evidence measure with an adaptive significance level. On the other hand, such an adaptive significance level is obtained from the minimization of a linear combination of generalized type-I and type-II error probabilities under a Bayesian perspective. As a result, it generally depends on both the null and alternative hypotheses and on the sample size n, as opposed to standard fixed significance levels. The new proposal may also be seen as a test for simple hypotheses characterized by the predictive distributions f H and f A in Section 3.1 that minimizes a specific linear combination of probabilities of errors of decision. It is then formally characterized by a cutoff for the Bayes Factor (which takes the place of the likelihood ratio here) and therefore may prevent a decision-maker from rejecting the null hypothesis when the data seem to be clear evidence in its favor ( [12]). It should be stressed that under the new proposal, a cutoff value for the Bayes factor (for the "likelihood ratio" here) is chosen in advance and consequently no constraint is imposed exclusively on the probability of the error of the first kind. In this sense, the test in [13,14] completely departs from regular frequentist tests. From another angle, the Bayes factor may be seen as the ratio between the posterior odds in favor of the null hypothesis and its prior odds ( [24]). Note that the quantity defined here is a capital-P "P-value" to distinguish it from the small-p "p-value". In the scenario of the linear regression model with unknown variance, the ratio between the two prior predictive densities (4) and (5), will be the Bayes factor, Now, consider the test For any other test ϕ, ϕ * minimizes a linear combination of the type-I and type-II error probabilities, aα ϕ + bβ ϕ . Here again, the a and b values represent the relative seriousness of errors of the two types. To obtain the P-value at the point y 0 ∈ Ω, define the set Ψ 0 of sample points y for which the Bayes factors are smaller than or equal to the Bayes factor of the observed sample point y 0 , that is Ψ 0 = {y ∈ Ω : BF(y) ≤ BF(y 0 )}.
Then, the P-value is the integral of the predictive density over H, f H , in Ψ 0 Defining the set Ψ * of sample points y with Bayes factors smaller than or equal to b/a, i.e., the optimal averaged error probabilities from the generalized Neyman-Pearson Lemma, which will depend on the sample size, are given by In order to make a decision, the P-value is compared to the optimal adaptive significance level α ϕ * . Then, when y 0 is observed, the hypothesis H will be rejected if the P-value (y 0 ) < α ϕ * .

Simulation Study
We developed a simulation study considering two models. The first model was where X = 1 n and θ = θ 1 . The hypotheses to be tested were The second model studied was where X = (x 1 , . . . , x n ) is an n × p matrix of covariates with x i = (1, x i1 , . . . , x ip−1 ) and θ = (θ 1 , θ 2 ) is the p × 1 vector of coefficients. In this case, the hypotheses of interest were The averaged error probabilities, α ϕ * and β ϕ * , were calculated using the Monte Carlo method with values generated from the following distributions: Then, y (j) = (y (j) In a first stage, we considered model (11) where θ = θ 1 and model (12) with θ = (θ 1 , θ 2 ) . Note that the dimensionality of the parameter space, denoted by d, is different in the two models: for model (11), the dimensionality is d = 2 and for model (12), the dimensionality is d = 3. Samples of size M = 1000 were generated for each model under the respective hypotheses and also for different sample sizes between n = 10 and n = 5000. In model (12), the covariate x i1 , i = 1 . . . n, was generated from a standard normal distribution. Finally, to obtain the adaptive values α ϕ * and β ϕ * , the two types of errors were considered as equally severe, that is, a = b = 1. Figure 1 shows the averaged error probabilities for the FBST as functions of k for a sample size n = 100. This was replicated for all sample sizes in order to numerically find the corresponding k * value that minimizes α ϕ e + β ϕ e . Tables 1 and 2 and Figures 2 and 3 present the k * and α ϕ * P values as function of n for each model. As can be seen, both values have a decreasing trend when the sample size increases. In the case of the cutoff value for the evidence, it is possible to notice the differences in the results when the dimensionality of the parameter space change. Then, the k * value depends not only on the sample size but also on the dimensionality of the parameter space, more specifically, it is greater when d is higher. However, this does not occur with α ϕ * P , which maintains almost the same values even if d increases. On the other hand, Figures 4 and 5 illustrate that in all these models, the optimal averaged error probabilities and their linear combination also decrease with increasing sample size.
We choose a single random sample y 0 to calculate the e-value and P-value for the models. Table 3 displays the results: the cases where H is rejected being represented by the cells in boldface. It can be observed that the decision remains the same regardless of the index used. Table 3. Cutoff values k * , ev(H; y 0 ) and P-value (y 0 ) as function of n, with d = 2 and d = 3. As the second stage in our simulation study, we set two sample sizes n = 60 and n = 120 to perform the tests for model (12), increasing the dimensionality of the parameter space. In that scenario, the vector of coefficients was such that θ = (θ 1 , θ 2 ) and the hypotheses to be tested were So, by varying the dimension of vector θ 1 , the different models considered for each test were obtained. Tables 4 and 5 and Figures 6 and 7 show the k * and α ϕ * P values as functions of d. For d = 2, the values correspond to model (11). We can say that, for a fixed hypothesis, the larger the dimensionality of the parameter space, the greater the value of k * . In the case of the α ϕ * P value, it does not change significantly when the dimensionality of the parameter space increases, except when the number of parameters is very large in relation to the sample size. Table 4. Unknown-variance model cutoff values k * for ev(H; y) as a function of d, with n = 60 and n = 120.  Figure 6. Unknown-variance model cutoff values k * for ev(H; y) as a function of d, with n = 60 and n = 120. Table 5. Optimal averaged type-I error probability (α ϕ * ) as a function of d, with n = 60 and n = 120.  Table 6 presents the e-value and P-value calculated for a single random sample y 0 . Here, with the e-value the null hypothesis is less easily rejected. This may be related to two things: it may be due to approximation error as a result of the simulation process or due to the fact that the evidence apparently converges to 1 as the dimensionality of the parameter space increases, in which case a more detailed study is required. Table 6. Cutoff values k * , ev(H; y 0 ) and P-value (y 0 ) as functions of d, with n = 60 and n = 120.

Numerical Examples
In this section, we present two applications with real datasets. We choose a 0 = 3 and b 0 = 2 as parameters of the inverse gamma prior distribution for σ 2 . Additionally, in the normal prior for θ given σ 2 , m 0 = 0 p×1 and V 0 = I p are taken as parameters. The Monte Carlo approximations were made generating samples of size M =10,000.

Budget Shares of British Households Dataset
We select a dataset that draws 1519 observations from the 1980-1982 British Family Expenditure Surveys (FES) ( [25]). In our application, we want to fit the model We consider as explanatory variables, respectively, the total net household income (rounded to the nearest 10 UK pounds sterling) (x 1 ), the budget share for alcohol expenditure (x 2 ), the budget share for fuel expenditure, and the age of household head (x 3 ). We take the budget share for food expenditure as the dependent variable (y). All the expenditures and income are measured in pounds sterling per week. Table 7 summarizes the results for the hypotheses H : θ j = 0, j = 1 . . . 5, by performing the test with the p-value at 0.05 significance level and also the e-value and the P-value with their respective adaptive significance levels. The cases where H is rejected are represented by the cells in boldface.θ Freq andθ Bayes are, respectively, the classical maximum likelihood estimator and the Bayes estimator of θ. It can be seen that unlike the p-value, the e-value and the P-value do not reject the hypothesis of nullity of the coefficient associated with the age of household head variable.  Table 8 exposes the optimal averaged error probabilities using the e-value and the P-value. It can be noted that the values are very similar with both methodologies.

Boston Housing Dataset
We also take a dataset that contains information about housing values obtained from census tracts in the Boston Standard Metropolitan Statistical Area (SMSA) in 1970 ( [26]). These data are composed of 506 samples and 14 variables. The regression model we use is We choose the following explanatory variables to fit our model: per capita crime rate by town (x 1 ), the proportion of residential land zoned for lots over 25.000 sq. ft (x 2 ), the proportion of non-retail business acres per town (x 3 ), the proportion of non-retail business acres per town (x 4 ), the average number of rooms per dwelling (x 5 ), the proportion of owner-occupied units built prior to 1940 (x 6 ), the weighted mean of distances to five Boston employment centers (x 7 ), the full-value property tax rate per 10.000 (x 8 ), the pupil-teacher ratio by town, and 1000(Bk − 0.63) 2 , where Bk is the proportion of black people by town (x 9 ). The dependent variable is the median value of the owner-occupied homes (in 1000 s) in the census tract (y).
The results for the hypotheses H : θ j = 0, j = 1 . . . 10 by performing the test with the p-value, the e-value and the P-value, are summarized in Table 9. In this case, with the e-value the null hypotheses are less rejected. The e-value does not reject the hypotheses of nullity of the coefficients associated with the proportion of residential land zoned for lots over 25.000 sq. ft and proportion of non-retail business acres per town variables, while the p-value does. On the other hand, the P-value, unlike the p-value, does not reject the hypothesis for the proportion of residential land zoned for lots over 25.000 sq. ft variable, but it does for the Intercept. As can be observed in Table 10, for these data, the optimal averaged error probabilities values are also very close.

Conclusions
In this work, we present a method to find a cutoff value k * for the Bayesian evidence in the FBST by minimizing the linear combination of the averaged type-I and type-II error probabilities for a given sample size n and also for a given dimensionality d of the parameter space in the context of linear regression models with unknown variance under the Bayesian perspective. In that sense, we provide a solution to the existing problem in the usual approach of hypothesis-testing procedures based on fixed cutoffs for measures of evidence: the increase of the sample size leads to the rejection of the null hypothesis. Furthermore, we compare our results with those obtained by using the test proposed by the authors of [13,14]. With our suggestion of cutoff value for the evidence in the FBST and also with the procedure proposed by the authors of [13,14], increasing the sample size implies that the probabilities of both kinds of optimal averaged errors and their linear combination decrease, unlike most cases, where, by setting a single level of significance independent of sample size, only type-II error probability decreases.
A detailed study is still needed for more complex models, so the methodology we propose to determine the adaptive cutoff value for evidence in the FBST could be extended to models with different prior specifications, which would involve, among other things, using approximate methods to find the prior predictive densities under the null and alternative hypotheses.  where P H and P A are the prior probability measure of ξ restricted to the sets H and A, respectively.