Next Article in Journal
A Revision of Clausius Work on the Second Law. 1. On the Lack of Inner Consistency of Clausius Analysis Leading to the Law of Increasing Entropy
Previous Article in Journal
A Bayesian Reflection on Surfaces
Open AccessArticle

Evidence and Credibility: Full Bayesian Significance Test for Precise Hypotheses

Instituto de Matemática e Estatística, Universidade de São Paulo, 05315-970, Brasil
*
Author to whom correspondence should be addressed.
Entropy 1999, 1(4), 99-110; https://doi.org/10.3390/e1040099
Received: 9 February 1999 / Accepted: 9 April 1999 / Published: 25 October 1999

Abstract

A Bayesian measure of evidence for precise hypotheses is presented. The intention is to give a Bayesian alternative to significance tests or, equivalently, to p-values. In fact, a set is defined in the parameter space and the posterior probability, its credibility, is evaluated. This set is the "Highest Posterior Density Region" that is "tangent" to the set that defines the null hypothesis. Our measure of evidence is the complement of the credibility of the "tangent" region.
Keywords: Bayes factor; numerical integration; global optimization; p-value; posterior density Bayes factor; numerical integration; global optimization; p-value; posterior density

1. Introduction

The objective of this paper is to provide a coherent Bayesian measure of evidence for precise null hypotheses. Significance tests [1] are regarded as procedures for measuring the consistency of data with a null hypothesis by the calculation of a p-value (tail area under the null hypothesis). [2] and [3] consider the p-value as a measure of evidence of the null hypothesis and present alternative Bayesian measures of evidence, the Bayes Factor and the posterior probability of the null hypothesis. As pointed out in [1], the first difficult to define the p-value is the way the sample space is ordered under the null hypothesis. [4] suggested a p-value that always regards the alternative hypothesis. To each of these measures of evidence one could find a great number of counter arguments. The most important argument against Bayesian test for precise hypothesis is presented by [5]. Arguments against the classical p-value are full in the literature. The book by [6] and its review by [7] present interesting and relevant arguments to the statisticians start to thing about new methods of measuring evidence. In a more philosophical terms, [8] discuss, in a great detail, the concept of evidence. The method we suggest in the present paper has simple arguments and a geometric interpretation. It can be easily implemented using modern numerical optimization and integration techniques. To illustrate the method we apply it to standard statistical problems with multinomial distributions. Also, to show its broad spectrum, we consider the case of comparing two gamma distributions, which has no simple solution with standard procedures. It is not a situation that appears in regular textbooks. These examples will make clear how the method should be used in most situations. The method is “Full” Bayesian and consists in the analysis of credible sets. By Full we mean that one needs only to use the posterior distribution without the need for any adhockery, a term used by [8].

2. The Evidence Calculus

Consider the random variable D that, when observed, produces the data d. The statistical space is represented by the triplet ( Ξ , Δ , Θ ) where Ξ is the sample space, the set of possible values of d, Δ is the family of measurable subsets of Ξ and Θ is the parameter space. We define now a prior model ( Θ , B , π d ) , which is a probability space defined over Θ . Note that this model has to be consistent, so that Pr ( A | θ ) turns out to be well defined. As usual after observing data d, we obtain the posterior probability model ( Θ , B , π d ) , where π d is the conditional probability measure on B given the observed sample point, d . In this paper we restrict ourselves to the case where the function π d has a probability density function.
To define our procedure we should concentrate only on the posterior probability space ( Θ , B , π d ) . First we will define T ϕ as the subset of the parameter space where the posterior density is greater than ϕ .
T ϕ = { θ Θ | f ( θ ) ϕ }
The credibility of T ϕ is its posterior probability,
κ =   T ϕ f ( θ | d ) d θ     =       Θ f ϕ ( θ | d ) d θ
where f ϕ ( x ) = f ( x ) if f ( x ) ϕ and zero otherwise.
Now, we define f* as the maximum of the posterior density over the null hypothesis, attained at the argument θ * ,
θ * arg max θ Θ 0 f ( θ ) ,       f * = f ( θ * )
and define T * = T f * as the set “tangent” to the null hypothesis, H, whose credibility is κ * . Figure 1 and Figure 2 show the null hypothesis and the contour of set T* for Examples 2 and 3 of Section 4.
The measure of evidence we propose in this article is the complement of the probability of the set T*. That is, the evidence of the null hypothesis is
E v ( H ) = 1 κ *   or   1 π d ( T * ) .
If the probability of the set T* is “large”, it means that the null set is in a region of low probability and the evidence in the data is against the null hypothesis. On the other hand, if the probability of T* is “small”, then the null set is in a region of high probability and the evidence in the data is in favor of the null hypothesis.
Although the definition of evidence above is quite general, it was created with the objective of testing precise hypotheses. That is, a null hypothesis for which the dimension is smaller than that of the parameter space, i.e. dim ( Θ 0 ) dim ( Θ ) .

3. Numerical Computation

In this paper the parameter space, Θ , is always a subset of Rn, and the hypothesis is defined as a further restricted subset Θ 0 Θ R n . Usually, Θ 0 is defined by vector valued inequality and equality constraints:
Θ 0 = { θ Θ | g ( θ ) 0     h ( θ ) = 0 } .
Since we are working with precise hypotheses, we have at least one equality constraint, hence dim ( Θ 0 ) dim ( Θ ) . Let f ( θ ) be the probability density function for the measure π d , i.e.,
π d ( b ) =   b f ( θ ) d θ .
The computation of the evidence measure defined in the last section is performed in two steps, a numerical optimization step, and a numerical integration step. The numerical optimization step consists of finding an argument θ * that maximizes the posterior density f ( θ ) under the null hypothesis. The numerical integration step consists of integrating the posterior density over the region where it is greater than f ( θ * ) . That is,
  • Numerical Optimization step:
    θ * arg max θ Θ 0 f ( θ ) ,     ϕ = f * = f ( θ * )
  • Numerical Integration step:
κ * =   Θ f ϕ ( θ | d ) d θ
where f ϕ ( x ) = f ( x ) if f ( x ) ϕ and zero otherwise.
Efficient computational algorithms are available for local and global optimization as well as for numerical integration in [9], [10], [11], [12], [13], and [14]. Computer codes for several such algorithms can be found at software libraries as NAG and ACM, or at internet sites as www.ornl.org.
We notice that the method used to obtain T* and to calculate κ * can be used under general conditions. Our purpose, however, is to discuss precise hypothesis testing, under absolute continuity of the posterior probability model, the case for which most solutions presented in the literature are controversial.

4. Examples

In the sequel we will discuss five examples with increasing computational difficulty. The first four are about the Multinomial model. The first example presents the test for a specific success rate in the standard binomial model, and the second is about the equality of two such rates. For these two examples the null hypotheses are linear restrictions of the original parameter spaces. The third example introduces the Hardy-Weinberg equilibrium hypothesis in a trinomial distribution. In this case the hypothesis is quadratic.
Forth example considers the test of independence of two events in a 2 × 2 contingency table. In this case the parameter space has dimension three, and the null hypothesis, which is not linear, defines a set of dimension two.
Finally, the last example presents two parametric comparisons for two gamma distributions. Although straightforward in our paradigm, it is not presented by standard statistical textbooks. We believe that, the reason for this gap in the literature is the non-existence of closed analytical forms for the test. In order to be able to fairly compare our evidence measure with standard tests, like Chi-square tail (pV), Bayes Factor (BF), and Posterior-Probability (PP), we always assume a uniform prior distribution. In these examples the likelihood has finite integral over the parameter space. Hence we have posterior density functions that are proportional to the respective likelihood functions. In order to achieve better numerical stability we optimize a function proportional to the log-likelihood, L ( θ ) , and make explicit use of its first and second derivatives (gradient and Jacobian).
For the 4 examples concerning multinomial distributions we present the following figures (Table 1, Table 2, and Table 3):
  • Our measure of evidence, Ev, for each d;
  • the p-value, pV obtained by the χ 2 test; that is, the tail area;
  • the Bayes Factor,
    B F = Pr { Θ 0 } Pr { d | Θ 0 } ( 1 Pr { Θ 0 } )   Pr { d | Θ Θ 0 } ;   and
  • the posterior probability of H,
    P P = Pr { Θ 0 | d } = { 1 + ( B F ) 1 } 1 .
For the definition of the Bayes Factor and properties we refer to [8] and [15].

4.1. Success rate in standard binomial model

This is a standard example about testing that a proportion, θ , is equal to a specific value, p. Consider the random variable, D being binomial with parameter θ and sample size n. Here we consider n = 20 trials, p = 0.5 and d is the observed success number. The parameter space is the unit interval Θ = { 0 θ 1 } . The null hypothesis is defined as H : θ = p . For all possible values of d, Table 1 presents the figures to compare our measure with the standard ones. To compute the Bayes Factor, we consider a priori Pr { H } = Pr { θ = p } = 0.5 and a uniform density for θ under the “alternative” hypothesis, A : θ p . That is,
B F = ( n + 1 )   ( n d )   p d ( 1 p ) n d .
Table 1. Standard binomial model.
Table 1. Standard binomial model.
dEvPVBFPP
00.000.000.000.00
10.000.000.000.00
20.000.000.000.00
30.000.000.020.02
40.010.010.100.09
50.020.030.310.24
60.060.070.780.44
70.160.181.550.61
80.350.372.520.72
90.640.653.360.77
101.001.003.700.79

4.2. Homogeneity test in 2× 2 contingency table

This model is useful in many applications, like comparison of two communities with relation to a disease incidence, consumer behavior, electoral preference, etc. Two samples are taken from two binomial populations, and the objective is to test whether the success ratios are equal. Let x and y be the number of successes of two independent binomial experiments of sample sizes m and n, respectively. The posterior density for this multinomial model is
f ( θ | x , y , n , m )     θ 1 x θ 2 n x θ 3 y θ 4 m y .
The parameter space and the null hypothesis set are:
Θ = { 0 θ 1   |   θ 1 + θ 2 = 1     θ 3 + θ 4 = 1 }
Θ 0 = { θ Θ | θ 1 = θ 3 } .
The Bayes Factor considering a priori Pr { H }   =   Pr { θ 1 = θ 3 }   =   0.5 and uniform densities over θ 0 and θ θ 0 is given in the equation below. See [16] and [17] for details and discussion about properties.
B F   =   ( m x )   ( n y ) ( m + n x + y )   ( m + 1 )   ( n + 1 ) m + n + 1
Left side of Table 2 presents figures to compare Ev(d) with the other standard measures for m = n = 20 . Figure 1 presents H and T* for x = 10 and y = 4 with n = m = 20 .
Table 2. Tests of homogeneity and Hardy-Weinberg equilibrium.
Table 2. Tests of homogeneity and Hardy-Weinberg equilibrium.
HomogeneityHardy-Weinberg
xyEvpVBFPPx1x3EvpVBFPP
500.050.020.250.20120.010.000.010.01
510.180.080.870.46130.010.010.040.04
520.430.211.700.63140.040.020.110.10
530.710.432.470.71150.090.040.250.20
540.930.712.950.75160.180.080.460.32
551.001.003.050.75170.310.150.770.44
560.940.722.800.74180.480.261.160.54
570.770.492.310.70190.660.391.590.61
580.580.311.750.641100.830.572.000.67
590.390.181.210.551110.950.772.340.70
5100.240.100.770.431121.000.992.550.72
1000.000.000.000.001130.960.782.570.72
1010.000.000.020.021140.840.552.390.71
1020.010.010.070.061150.660.332.050.67
1030.050.020.190.161160.470.161.580.61
1040.120.050.410.291170.270.051.060.51
1050.240.100.770.431180.120.000.580.37
1060.410.201.230.55500.020.010.050.05
1070.610.341.740.63510.090.040.250.20
1080.810.532.210.69520.290.140.600.38
1090.950.752.540.72530.610.341.000.50
10101.001.002.660.73540.890.651.290.56
1200.000.000.000.00551.001.001.340.57
1210.000.000.000.00560.900.661.180.54
1220.000.000.010.01570.660.390.890.47
1230.010.000.040.04580.400.200.580.37
1240.030.010.100.09590.210.090.320.24
1250.070.030.240.195100.090.040.160.13
1260.140.060.460.32900.210.090.730.42
1270.260.110.800.44910.660.391.590.61
1280.420.211.240.55920.990.911.770.64
1290.620.341.730.63930.860.591.330.57
12100.810.532.210.69940.490.260.740.43
950.210.090.320.24
960.060.030.110.10
970.010.010.030.03
Figure 1. Homogeneity test with x = 10 , y = 4 and n = m = 20 .
Figure 1. Homogeneity test with x = 10 , y = 4 and n = m = 20 .
Entropy 01 00099 g001

4.3. Hardy-Weinberg equilibrium law

In this biological application there is a sample of n individuals, where x1 and x3 are the two homozigote sample counts and x 2 = n x 1 x 3 is hetherozigote sample count. θ = [ θ 1 , θ 2 , θ 3 ] is the parameter vector. The posterior density for this trinomial model is
f ( θ | x )     θ 1 x 1 θ 2 x 2 θ 3 x 3
The parameter space and the null hypothesis set are:
Θ = { θ 0     |     θ 1 + θ 2 + θ 3 = 1 }
Θ 0 = { θ Θ     |     θ 3 = ( 1     θ 1 ) 2 }
The problem of testing the Hardy-Weinberg equilibrium law using the Bayes Factor is discussed in detail by [18] and [19].
The Bayes Factor considering uniform priors over θ 0 and θ θ 0 is given by the following expression:
B F   =   ( n + 2 ) !   t !   ( 2 n t ) !   2 x 2 ( 2 n + 1 ) !   x 1 !   x 2 !   x 3 !   [ 5 / 6     2   ( t + 1 )   ( 2 n t + 1 ) ( 2 n + 2 )   ( 2 n + 3 ) ]
Here t = 2 x 1 + x 2 is a sufficient statistic under H. This means that the likelihood under H depends on data d only through t.
Right side of Table 2 presents figures to compare Ev(d) with the other standard measures for n = 20 . Figure 2 presents H and T* for x 1 = 5 , x 3 = 10 and n = 20 .

4.4. Independence test in a 2× 2 contingency table

Suppose that laboratory test is used to help in the diagnostic of a disease. It should be interesting to check if the test results are really related to the health conditions of a patient. A patient chosen from a clinic is classified as one of the four states of the set
{ ( h , t ) | h , t = 0   or   1 }
in such a way that h is the indicator of the occurrence or not of the disease and t is the indicator for the laboratory test being positive or negative. For a sample of size n we record ( x 00 , x 01 , x 10 , x 11 ) , the vector whose components are the sample frequency of each the possibilities of (t,h). The parameter space is the simplex
Θ = { ( θ 00 , θ 01 , θ 10 , θ 11 )   |   θ i j 0     i , j θ i j = 1 }
and the null hypothesis, h and t are independent, is defined by
Θ 0 = { θ Θ   |   θ 00 = θ 0 θ 0 ,     θ 0 = θ 00 + θ 01 ,     θ 0 = θ 00 + θ 10 }
Figure 2. Hardy-Weinberg test with x 1 = 5 , x 3 = 10 and n = 20 .
Figure 2. Hardy-Weinberg test with x 1 = 5 , x 3 = 10 and n = 20 .
Entropy 01 00099 g002
The Bayes Factor for this case is discussed by [17] and has the following expression:
B F     =   ( x 0 x 00 )   ( x 1 x 11 ) ( n x 0 )   { ( n + 2 )   { ( n + 3 ) ( n + 2 ) [ P ( 1 P ) + Q ( 1 Q ) ] } 4 ( n + 1 ) }
where x i = x i 0 + x i 1 , x j = x 0 j + x 1 j , P = x 0 n + 2 and Q = x 0 n + 2 .
Table 3. Test of independence.
Table 3. Test of independence.
x00x01x10x11EvpVBFPP
12695350.960.574.730.83
48259100.540.141.040.51
965018200.240.040.500.33
18539300.290.060.500.33
361078600.060.010.110.10

4.5. Comparison of two gamma distributions

This model may be used when comparing two survival distributions, for example, medical procedures and pharmacological efficiency, component reliability, financial market assets, etc. Let [ x 1 , 1 , x 1 , 2 , , x 1 , n 1 ] and [ x 2 , 1 , x 2 , 2 , , x 2 , n 2 ] be samples of two gamma distributed survival times. The sufficient statistic for the gamma distribution is the vector [n,s,p], i.e. the sample size, the observations sum and product. Let [ α 1 , β 1 ] and [ α 2 , β 2 ] , all positive, be these gamma parameters. The likelihood function is:
f ( n 1 , α 1 , β 1 , n 2 , α 2 , β 2 | d a t a )         β 1 α 1 n 1 Γ ( α 1 ) n 1   β 2 α 2 n 2 Γ ( α 2 ) n 2   p 1 α 1 1   e β 1 s 1   p 2 α 2 1   e β 2 s 2
This likelihood function is integrable on the parameter space. In order to allow comparisons with classical procedures, we will not consider any informative prior, i.e., the likelihood function will define by itself the posterior density.
Table 4 presents time to failure of coin comparators, a component of gaming machines, of two different brands. An entrepreneur was offered to replace brand 1 by the less expensive brand 2. The entrepreneur tested 10 coin comparators of each brand, and computed the sample means and standard deviations. The gamma distribution fits nicely this type of failure time, and was used to model the process. Denoting the gamma mean and standard deviation by μ = α / β and σ = μ / β , the first hypothesis to be considered is H '   :   μ 1 = μ 2 . The high evidence of H', E v ( H ' ) = 0.89 , corroborates the entrepreneur decision of changing its supplier. Note that the naive comparison of the sample means could be misleading. In the same direction, the low evidence of H   :   μ 1 = μ 2   σ 1 = σ 2 , E v ( H ) = 0.01 , indicates that the new brand should have smaller variation on the time to failure. The low evidence of H suggests that costs could be further diminished by na improved maintenance policy [20].
Table 4. Comparing two gamma distributions.
Table 4. Comparing two gamma distributions.
Brand 1 sample
39.2731.7212.3327.6756.66
28.3253.7229.7123.7633.55
mean1=33.67 std1=13.33
Brand 2 sample
28.3253.7229.7123.7633.55
24.0733.7933.1026.9327.23
mean2=29.25 std2=3.62
Evidence
E v ( H ' ) = 0.89 E v ( H ) = 0.01

5. Final Remarks

The theory presented in this paper, grew out of the necessity of testing precise hypotheses made on the behavior of software controlled machines [21]. The hypotheses being tested are software requirements and specifications. The real machine software is not available, but the machine can be used for input-output black-box simulation. The authors had the responsibility of certifying whether gaming machines were working according to Brazilian law (requirements) and manufacturer's game description (specifications). Many of these requirements and specifications can be formulated as precise hypotheses on contingency tables, like the simple cases of Examples 1, 2 and 4. The standard methodologies, in our opinion, where not adequate to our needs and responsibilities. The classical p-value does not consider the alternative hypothesis that, in our case, is as important as the null hypothesis. Also the p-value is the measure of a tail in the sample space, whereas our concerns are formulated in the parameter space. On the other hand, we like the idea of measuring the significance of a precise hypothesis.
The Bayes factor is indeed formulated directly in the parameter space, but needs an ad hoc positive prior probability on the precise hypothesis. First we had no criterion to assess the required positive prior probability. Second we would be subject to Lindley's paradox, that would privilegiate the null hypothesis [5], [22].
The methodology of evidence calculus based on credible sets presented in this paper is computed in the parameter space, considers only the observed sample, has the significance flavor as in the p-value, and takes in to account the geometry of the null hypothesis as a surface (manifold) imbedded in the whole parameter space. Furthermore, this methodology takes into account only the location of the maximum likelihood under the null hypothesis, making it consistent with “benefit of the doubt” juridical principle. This methodology is also independent of the null hypothesis parametrization. This parametrization independence gives the methodology a geometric characterization, and is in sharp contrast with some well known procedures, like the Fisher exact test [23].
Recalling [6] in its Chapter 6, - “...recognizing that likelihoods are the proper means for representing statistical evidence simplifies and unifies statistical analysis.”- the measure Ev(H) defined in this paper is in accord with this Royall's principle.

References and Notes

  1. Cox, D.R. The role of significance tests. Scand. J. Statist. 1977, 4, 49–70. [Google Scholar]
  2. Berger, J.O.; Delampady, M. Testing precise hypothesis. Statistical Science 1987, 3, 315–352. [Google Scholar]
  3. Berger, J.O.; Boukai, B.; Wang, Y. Unified frequentist and Bayesian testing of a precise hypothesis. Statistical Science 1997, 3, 315–352. [Google Scholar]
  4. Pereira, C.A.B; Wechsler, S. On the concept of p-value. Braz. J. Prob. Statist. 1993, 7, 159–177. [Google Scholar]
  5. Lindley, D.V. A statistical paradox. Biometrika 1957, 44, 187–192. [Google Scholar] [CrossRef]
  6. Royall, R. Statistical Evidence: A Likelihood Paradigm; Chapman Hall: London, 1997; p. 191. [Google Scholar]
  7. Vieland, V.J.; Hodge, S.E. Book Reviews: Statistical Evidence by R Royall (1997). Am. J. Hum. Genet. 1998, 63, 283–289. [Google Scholar]
  8. Good, I.J. Good thinking: The foundations of probability and its applications; University of Minnesota Press, 1983; p. 332. [Google Scholar]
  9. Fletcher, R. Practical Methods of Optimization; J Wiley: Essex, 1987; p. 436. [Google Scholar]
  10. Horst, R.; Pardalos, P.M.; Thoai, N. Introduction to Global Optimization; Kluwer Academic Publishers: Boston, 1995. [Google Scholar]
  11. Pintér, J.D. Global Optimization in Action. Continous and Lipschitz Optimization: Algorithms, Implementations ans Applications; Kluwer Academic Publishers: Boston, 1996. [Google Scholar]
  12. Krommer, A.R.; Ueberhuber, C.W. Computational Integration; SIAM: Philadelphia, 1998; p. 445. [Google Scholar]
  13. Nemhauser, G.L.; Rinnooy Kan, A.H.G.; Todd, M.J. Optimization, Handbooks in Operations Research; North-Holland: Amsterdam, 1989; Vol. 1, p. 709. [Google Scholar]
  14. Sloan, I.H.; Joe, S. Latice Methods for Multiple Integration; Oxford University Press: Oxford, 1994; p. 239. [Google Scholar]
  15. Aitkin, M. Posterior Bayes Factors. J. R. Statist. Soc. B. 1991, 1, 111–142. [Google Scholar]
  16. Irony, T.Z.; Pereira, C.A.B. Exact test for equality of two proportions: Fisher×Bayes. J. Statist. Comp. Simulation 1986, 25, 93–114. [Google Scholar]
  17. Irony, T.Z.; Pereira, C.A.B. Bayesian Hypothesis test: Using surface integrals to distribute prior information among hypotheses. Resenhas 1986, 2, 27–46. [Google Scholar]
  18. Pereira, C.A.B.; Rogatko, A. The Hardy-Weinberg equilibrium under a Bayesian perspective. Braz. J. Genet. 1984, 7, 689–707. [Google Scholar]
  19. Montoya-Delgado, L.E.; Irony, T.Z.; Pereira, C.A.B.; Whittle, M. Unconditional exact test for the Hardy-Weinberg law. Submitted for publication 1998. [Google Scholar]
  20. Marshall, A.; Prochan, F. Classes of distributions applicable in replacement, with renewal theory implications. Proc. 6th Berkeley Symp. Math. Statist. Prob. 1972, 395–415. [Google Scholar]
  21. Pereira, C.A.B.; Stern, J.M. A Dynamic Software Certification and Verification Procedure. Proc. ISAS’99 - International Conference on Informations System Analysis and Synthesis 1999, II, 426–435. [Google Scholar]
  22. Lindley, D.V. The Bayesian approach. Scand. J. Statist. 1978, 5, 1–26. [Google Scholar]
  23. Pereira, C.A.B.; Lindley, D.V. Examples questioning the use of partial likelihood. The Statistician 1987, 36, 15–20. [Google Scholar] [CrossRef]
Back to TopTop