Next Article in Journal
All Shook Up: Fluctuations, Maxwell’s Demon and the Thermodynamics of Computation
Previous Article in Journal
Efficiently Measuring Complexity on the Basis of Real-World Data

Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

# Bayesian Testing of a Point Null Hypothesis Based on the Latent Information Prior

1
Department of Mathematical Informatics, Graduate School of Information Science and Technology, the University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
2
RIKEN Brain Science Institute, 2-1 Hirosawa, Wako City, Saitama 351-0198, Japan
Entropy 2013, 15(10), 4416-4431; https://doi.org/10.3390/e15104416
Received: 9 August 2013 / Revised: 16 September 2013 / Accepted: 10 October 2013 / Published: 17 October 2013

## Abstract

:
Bayesian testing of a point null hypothesis is considered. The null hypothesis is that an observation, x, is distributed according to the normal distribution with a mean of zero and known variance $σ 2$. The alternative hypothesis is that x is distributed according to a normal distribution with an unknown nonzero mean, μ, and variance $σ 2$. The testing problem is formulated as a prediction problem. Bayesian testing based on priors constructed by using conditional mutual information is investigated.

## 1. Introduction

We investigate a problem of testing a point null hypothesis from the viewpoint of prediction. The null hypothesis, $H 0$, is that an observation, x, is distributed according to the normal distribution, $N ( 0 , σ 2 )$, with a mean of zero and variance $σ 2$, and the alternative hypothesis, $H 1$, is that x is distributed according to a normal distribution $N ( μ , σ 2 )$ with unknown nonzero mean μ and variance $σ 2$. The variance, $σ 2$, is assumed to be known. This simple testing problem has various essential aspects in common with more general testing problems and has been discussed by many researchers. An essential part of our discussion in the present paper holds for other testing problems based on more general models.
The assumption that the sample size is one is not essential. When we have N observations $x 1 , x 2 , … , x N$ from $N ( 0 , σ 2 )$ or $N ( μ , σ 2 )$, then the sufficient statistic $x ¯ = ∑ i = 1 N x i / N$ is distributed according to $N ( 0 , σ 2 / N )$ under $H 0$ or $N ( μ / N , σ 2 / N )$ under $H 1$, respectively. Then, the null hypothesis is that $x ¯$ is distributed according to $N ( 0 , σ ˜ 2 )$, and the alternative hypothesis is that $x ¯$ is distributed according to $N ( μ ˜ , σ ˜ 2 )$$( μ ˜ ≠ 0 )$, where $σ ˜ 2 : = σ 2 / N$ and $μ ˜ : = μ / N$. Thus, the testing problem with sample size N is essentially equal to that with the sample size one. From now on, the variance, $σ 2$, is set to be one without loss of generality.
We formulate the testing problem as a prediction problem. Let $m = 0$ if $H 0$ is true and $m = 1$ if $H 1$ is true. Let w be the probability that $m = 0$, and let $π ( d μ )$ be the prior probability measure of μ. The probability, w, is set to be $1 / 2$ in many previous studies, and the choice of $π ( d μ )$ is discussed; see, e.g.,  and the references therein. The objective is to predict m by using a Bayesian predictive distribution, $p w , π ( m ∣ x )$, depending on the prior $π ( d μ )$ and the observation, x.
Common choices of π are the Normal prior $( 1 / 2 π τ 2 ) exp ( - μ 2 / 2 τ 2 ) d μ$ and the Cauchy prior $1 / { π γ ( 1 + m u 2 / γ 2 ) } d μ$, recommended by Jeffreys . Sometimes, it is considered that large values of scale parameters τ and γ represent “ignorance” about μ. However, such a naive choice of scale parameter values could cause a serious problem known as the Jeffreys–Lindley paradox .
We choose $π ( d μ )$ from the viewpoint of prediction and construct a Bayesian predictive distribution to predict m based on an objectively chosen prior In the testing problem, the variable, m, is predicted, the variable, x, is observed and the parameter, μ, is neither observed nor predicted. The latent information prior $π *$  is defined as a prior maximizing the conditional mutual information:
$I m ; μ ∣ x ( w , π ) = ∑ m = 0 1 ∫ ∫ p w , π ( x , μ , m ∣ w ) log p w , π ( m , μ ∣ x ) p w , π ( m ∣ x ) p w , π ( μ ∣ x ) d x d μ$
between m and μ given x.
The latent information prior introduced in  is an objective Bayes prior. An outline of the method based on it is as follows. First, a statistical problem is formulated as a prediction problem, in which x is the observed random variable, y is the random variable to be predicted and θ is the unknown parameter. Then, a prior $π ( d θ )$ that maximizes the conditional mutual information $I y ; θ ∣ x ( π )$ between y and θ given x is adopted.
In Section 2, we consider for Kullback-Leibler loss for prediction corresponding to Bayesian testing. In Section 3, we obtain the latent information prior and discuss properties of Bayesian testing based on it. In Section 4, we compare the proposed testing based on the latent information prior with Bayesian testing based on the normal prior and the Cauchy prior.

## 2. Kullback-Leibler Loss of Predictive Densities

We consider Kullback-Leibler loss of prediction corresponding to Bayesian testing. The Bayesian predictive density with respect to w and π is given by:
$p w , π ( m = 0 ∣ x ) = w p 0 ( x ) w p 0 ( x ) + ( 1 - w ) p π ( x )$
and
$p w , π ( m = 1 ∣ x ) = ( 1 - w ) p π ( x ) w p 0 ( x ) + ( 1 - w ) p π ( x )$
where:
$p 0 ( x ) = ϕ ( x ; 0 , 1 ) a n d p π ( x ) = ∫ ϕ ( x ; μ , 1 ) π ( d μ )$
and $ϕ ( x ; μ , σ )$ is the density function of the normal distribution, $N ( μ , σ 2 )$.
If the value of μ is known, then the alternative hypothesis, $H 1$: $N ( μ , 1 )$, becomes a simple hypothesis, and the predictive distribution is given by the posterior:
$p w ( m = 0 ∣ x , μ ) = w ϕ ( x ; 0 , 1 ) w ϕ ( x ; 0 , 1 ) + ( 1 - w ) ϕ ( x ; μ , 1 )$
and:
$p w ( m = 1 ∣ x , μ ) = ( 1 - w ) ϕ ( x ; a , 1 ) w ϕ ( x ; 0 , 1 ) + ( 1 - w ) ϕ ( x ; μ , 1 )$
To evaluate the performance of predictive densities, we adopt the Kullback-Leibler divergence:
$∑ m = 0 1 p w ( m ∣ x , μ ) log p w ( m ∣ x , μ ) p w , π ( m ∣ x )$
from $p w ( m ∣ x , μ )$ and to $p w , π ( m ∣ x )$ as a loss function.
The risk function is given by:
$r w ( μ , π ) = ∫ p w ( x ∣ μ ) ∑ m = 0 1 p w ( m ∣ x , μ ) log p w ( m ∣ x , μ ) p w , π ( m ∣ x ) d x (8) = ∑ m = 0 1 w ( m ) ∫ p ( x ∣ m , μ ) log p w ( m ∣ x , μ ) p w , π ( m ∣ x ) d x$
where $w ( 0 ) = w$ and $w ( 1 ) = 1 - w$. Here, $p w , π ( m ∣ x , μ )$ and $p w , π ( x ∣ μ )$ are denoted by $p w ( m ∣ x , μ )$ and $p w ( x ∣ μ )$, respectively, because they do not depend on π. The distribution of x does not depend on μ if $m = 0$, because $p ( x ∣ m = 0 , μ ) = ϕ ( x ; 0 , 1 )$.
It is not fruitful to discuss decision theoretic properties, such as the minimaxity of the risk defined by:
$- ∑ m = 0 1 w ( m ) ∫ p ( x ∣ m , μ ) log p w , π ( m ∣ x ) d x$
because it is easy to distinguish between $H 0$ and $H 1$ when $| μ |$ is very large.
The Kullback-Leibler risk in Equation (8) corresponds to the regret type quantity:
$- log p w , π ( m ∣ x ) + log p w ( m ∣ x , μ )$
which means the loss by not knowing the value of μ. By considering the minimaxity of the regret type risk in Equation (8), several reasonable results are obtained.
Lemma 1.
The risk of a Bayesian predictive density, $p w , π ( m ∣ x )$, is given by:
$r w ( μ ; π ) = w ∫ p 0 ( x ) log 1 + 1 - w w p π ( x ) p 0 ( x ) 1 + 1 - w w p 0 ( x - μ ) p 0 ( x ) d x (11) + ( 1 - w ) ∫ p 0 ( x ) log 1 + w 1 - w p 0 ( x + μ ) p π ( x + μ ) 1 + w 1 - w p 0 ( x + μ ) p 0 ( x ) d x$
Proof. See the Appendix.   ☐
The risk function in Equation (11) is a continuous function of μ for every w and π.
The Bayes risk with respect to a prior π of a Bayesian predictive density based on $π ¯$ is:
$R w ( π ; π ¯ ) = ∫ r w ( μ , π ¯ ) π ( d μ ) = ∑ m = 0 1 ∫ ∫ w ( m ) p ( x ∣ m , μ ) log p w ( m ∣ x , μ ) p w , π ¯ ( m ∣ x ) π ( d μ ) d x = ∑ m = 0 1 ∫ ∫ w ( m ) p ( x ∣ m , μ ) log p w , π ¯ ( m ∣ x , μ ) p w , π ¯ ( μ ∣ x ) p w , π ¯ ( m ∣ x ) p w , π ¯ ( μ ∣ x ) π ( d μ ) d x (12) = ∑ m = 0 1 ∫ ∫ w ( m ) p ( x ∣ m , μ ) log p π ¯ ( μ ∣ m , x ) p w , π ¯ ( μ ∣ x ) π ( d μ ) d x$
It is known that an important relation:
$inf π ¯ R w ( π ; π ¯ ) = R w ( π ; π )$
holds; see . Here, $R w ( π ; π )$ coincides with the conditional mutual information, $I m ; μ ∣ x ( w , π )$, defined by Equation (1) between m and μ given x.

## 3. Latent Information Priors

We obtain the latent information prior defined as a prior maximizing the conditional mutual information, $I m ; μ ∣ x ( w , π )$. We restrict the original parameter space, $R$, of μ to a compact subset, $K ⊂ R$, for mathematical convenience. A typical choice is a bounded closed interval $K = [ - b , b ]$. If b is large enough, the testing problem $H 0 : N ( 0 , σ 2 )$ versus $H 1 : N ( μ , σ 2 )$, $μ ∈ [ - b , b ]$ is close to the original problem.
Let $P ( K )$ and $P ( R )$ be the spaces of all probability measures on K and $R$, respectively, endowed with the weak convergence topology. Then, $P ( K )$ is compact, since the K is compact. It is easy to verify that the conditional mutual information, $I m ; μ ∣ x ( w , π )$, is a continuous function of $w ∈ [ 0 , 1 ]$ and $π ∈ P ( K )$. Therefore, there exists $π w *$ that attains the maximum of Equation (1) for fixed $w ∈ ( 0 , 1 )$, since $P ( K )$ is compact. In the following, $π w *$ is denoted as $π *$ by omitting the subscript, w, when there is no confusion.
The Bayesian testing based on the latent information prior, $π * ∈ P ( K )$, has the following minimax property.
Theorem 1.
Let $π * ∈ P ( K )$ be the latent information prior. Then:
$inf π ∈ P ( R ) sup μ ∈ K r w ( μ , π ) = sup μ ∈ K r w ( μ , π * ) = I μ ; m ∣ x ( w , π * )$
Proof.
It is sufficient to show the relations:
$I μ ; m ∣ x ( w , π * ) = R w ( π * , π * ) = inf π ∈ P ( R ) R w ( π * , π ) ≤ sup π ′ ∈ P ( K ) inf π ∈ P ( R ) R w ( π ′ , π ) (15) ≤ inf π ∈ P ( R ) sup π ′ ∈ P ( K ) R w ( π ′ , π ) = inf π ∈ P ( R ) sup μ ∈ K r w ( μ , π ) ≤ sup μ ∈ K r w ( μ , π * ) ≤ R w ( π * , π * )$
In the previous section, we have seen the equalities $I μ ; m ∣ x ( w , π ) = R w ( π , π )$ and $R w ( π ′ , π ′ ) = inf π R w ( π ′ , π )$, corresponding to the first and second equalities in Equation (15). Thus, it is enough to show the last inequality, $sup μ r w ( μ , π * ) ≤ R w ( π * , π * )$, since the relations, except for the first and second equalities and the last inequality, are obvious.
We prove the inequality by contradiction. Assume that there exists a value, $ξ ∈ K$, such that:
$r w ( ξ , π * ) > R w ( π * , π * )$
Let $π t = ( 1 - t ) π * + t δ ξ$$( 0 ≤ t ≤ 1 )$, where $δ ξ$ is the delta measure concentrated at ξ. Then, $π t ∈ P ( K )$. From Equations (12) and (16):
where we put $p 1 ( x ∣ μ ) : = p ( x ∣ m = 1 , μ )$. However, $max t ∈ [ 0 , 1 ] R w ( π t ; π t ) = R w ( π 0 ; π 0 ) = R w ( π * ; π * )$, because of the definition of $π *$ and the fact that $π t ∈ P ( K )$. This is a contradiction. Thus, we have proven the desired result.      ☐
The discussion in the proof is parallel to that for submodels of multinomial models in , although the testing problem is not included in the class considered there. Closely related discussion on the unconditional mutual information is given in Csiszár . See also, [7,8].
We set $K = [ - b , b ]$ with $b = 7$ and consider two values, $0 . 5$ and $0 . 355$, of w. The latent information priors, $π w *$, for two values $w = 0 . 5$ and $w = 0 . 355$ are numerically obtained by using a generalized Arimoto-Blahut algorithm, the details of which will be discussed in another place. Here, $w = 0 . 5$ is the setting adopted in many previous studies, and $w = 0 . 355$ is the value maximizing $I m ; μ ∣ x ( w , π w * )$.
The Arimoto-Blahut algorithm [9,10] is widely used in information theory to obtain the capacity of channels. A channel is defined to be a conditional distribution, $p ( y ∣ θ )$, of y given θ, where y and θ are random variables taking values in finite sets, $Y$ and Θ, respectively. If a channel, $p ( y ∣ θ )$, is given, then the mutual information, $I y ; θ ( π )$, between y and θ is a function of the distribution, $π ( θ )$, of θ. The maximum value, $max π I y ; θ ( π )$, of the mutual information as a function of π is called the capacity of the channel $p ( y ∣ θ )$. The Arimoto-Blahut algorithm is an iterative algorithm to obtain the capacity $max π I y ; θ ( π )$ and the corresponding distribution $π ( θ )$, attaining the maximum value. The original Arimoto-Blahut algorithm cannot be directly applied to our problem, since we need to maximize the conditional mutual information, $I m ; θ ∣ x$, where x and θ are not discrete random variables, to obtain the latent information prior.
Figure 1. Latent information priors for (a) $w = 0 . 5$ and for (b) $w = 0 . 355$.
Figure 1. Latent information priors for (a) $w = 0 . 5$ and for (b) $w = 0 . 355$.
Figure 1 shows the numerically-obtained latent information priors. The priors have the form:
$π w * = u 2 ( δ - a + δ a ) + 1 - u 2 ( δ - b + δ b )$
The parameter values are $a = 1 . 21$, $b = 7$ and $u = 0 . 440$, when $w = 0 . 5$, and $a = 1 . 10$, $b = 7$ and $u = 0 . 393$, when $w = 0 . 355$.
Lemma 2 below gives the risk of Bayesian testing based on the prior in Equation (18).
Lemma 2.
Let:
$π a , b , u = u 2 ( δ - a + δ a ) + 1 - u 2 ( δ - b + δ b )$
where $a , b > 0$ and $0 ≤ u ≤ 1$. Then, the risk in Equation (8) is given by:
and the conditional mutual information in Equation (1) is given by:
The first and second terms in Equation (20) do not depend on π. The third term in Equation (20) does not depend on μ.
Figure 2 shows the risk functions of the latent information priors when $w = 0 . 5$ and $w = 0 . 355$, respectively. Note that $max μ ∈ [ - b , b ] r w ( μ , π * )$ is attained at $μ = a$ and b in both examples. This is consistent with the proof of Theorem 1, and it is numerically verified that the prior maximizes the conditional mutual information. Furthermore, we observe that the supremum value, $sup μ ∈ R r w ( μ , π * )$, of the risk without restriction $μ ∈ [ - b , b ]$ is only slightly larger than the maximum value, $max μ ∈ [ - b , b ] r w ( μ , π * )$, with the restriction $μ ∈ [ - b , b ]$. The risk functions rapidly converge as μ exceeds seven.
Figure 2. Risk functions of Bayesian testing based on latent information priors for (a) $w = 0 . 5$ and for (b) $w = 0 . 355$. When $w = 0 . 5$, $a = 1 . 21$ and $b = 7$. When $w = 0 . 355$, $a = 1 . 10$ and $b = 7$. The vertical dotted lines indicate the locations of a and b.
Figure 2. Risk functions of Bayesian testing based on latent information priors for (a) $w = 0 . 5$ and for (b) $w = 0 . 355$. When $w = 0 . 5$, $a = 1 . 21$ and $b = 7$. When $w = 0 . 355$, $a = 1 . 10$ and $b = 7$. The vertical dotted lines indicate the locations of a and b.
Since:
$sup μ ∈ K r ( μ , π * ) = sup π ′ ∈ P ( K ) inf π ∈ P ( K ) R ( π ′ , π ) = sup π ′ ∈ P ( K ) inf π ∈ P ( R ) R ( π ′ , π )$
$≤ sup π ′ ∈ P ( R ) inf π ∈ P ( R ) R ( π ′ , π ) ≤ inf π ∈ P ( R ) sup π ′ ∈ P ( R ) R ( π ′ , π ) = inf π ∈ P ( R ) sup μ ∈ R r ( μ , π ) ≤ sup μ ∈ R r ( μ , π * )$
and $sup μ ∈ R r ( μ , π * ) - sup μ ∈ K r ( μ , π * )$ is small in our problem when $K = [ - b , b ]$$( b = 7 )$, the supremum value, $sup μ ∈ R r ( μ , π * )$, of the risk function of the latent information prior, $π *$, under the parameter restriction, $μ ∈ [ - 7 , 7 ]$, is only slightly larger than the minimax value, $inf π ∈ P ( R ) sup μ ∈ R r ( μ , π )$ without the restriction. We see in the next section that the supremum, $sup μ ∈ R r ( μ , π )$, of the risk functions of commonly used priors are much larger than those of $π *$.
The discreteness of latent information priors shown in Figure 1 is a remarkable feature. In Bayesian statistics, k-reference priors have been known to be discrete measures in many examples; see [11,12,13]. The k-reference prior is defined to be a prior maximizing the mutual information between $x k$ and θ when we have a set, $x k$, of k-independent observations, $x 1 , … , x k$, from $p ( x ∣ θ )$ in a parametric model, ${ p ( x ∣ θ ) ∣ θ ∈ Θ ⊂ R d }$. However, such discrete priors have not been widely used. Instead of k-reference priors, reference priors introduced by Bernardo  have been used for many problems. Reference priors are not discrete and are defined by considering the limit that the sample size k goes to infinity. One main reason why discrete priors are not popular is that discrete priors are totally unacceptable form the viewpoint of subjective Bayes in which priors are considered to represent prior belief on parameters.
Although they have not been widely used, discrete priors, such as latent information priors, are reasonable from the viewpoint of prediction and objective Bayes. Various statistical problems, including estimation and testing, can be formulated from the viewpoint of prediction, and priors can be constructed by considering the conditional mutual information. Thus, latent information priors depending on the choice of variables to be predicted could play important roles in many statistical applications. Conditional mutual information is essential in information theory and naturally appeared in several studies in statistics; see e.g., [15,16]. Priors based on conditional mutual information and those based on unconditional mutual information are often quite different; see .
Bayesian testing based on latent information priors is free from the Jeffreys-Lindley paradox , since the priors are constructed by using conditional mutual information and depend properly on sample sizes. Posterior probabilities, $p w , π * ( m = 0 ∣ x )$, are shown in Figure 3 and are compared with p-values of the two-sided test in Table 1. When $x = 2 , 3$ and 4, posterior probabilities are much smaller than p-values of the two-sided test. Large differences of posterior probabilities and p-values have been widely observed and discussed in [1,17,18].
Figure 3. Posterior probabilities $p w , π * ( m = 0 ∣ x )$ based on latent information priors for (a) $w = 0 . 5$ and for (b) $w = 0 . 355$.
Figure 3. Posterior probabilities $p w , π * ( m = 0 ∣ x )$ based on latent information priors for (a) $w = 0 . 5$ and for (b) $w = 0 . 355$.
Table 1. Comparison of posterior probabilities and p-values.
Table 1. Comparison of posterior probabilities and p-values.
x01234
$p w = 0 . 5 ( m = 0 ∣ x )$0.7020.5640.2950.1120.0217
$p w = 0 . 355 ( m = 0 ∣ x )$0.5600.4340.2200.08670.0145
p-value (two-sided test)10.3170.04550.00267$6 . 33 × 10 - 5$

## 4. Other Common Priors

Discrete priors, including latent information priors discussed in the previous section, have not been widely used in Bayesian statistics. Common priors for the testing are the normal prior and the Cauchy prior. It seems to have been believed by many statisticians that the Cauchy prior is slightly better than the normal prior; see, e.g., [1,2]. In this section, we evaluate the conditional mutual information for the priors and compare the performance of them to that of the latent information prior.

#### 4.1. The Normal Prior

The normal prior, $ϕ ( μ ; 0 , τ 2 )$, is denoted by $N τ$. From Lemma 1, we have:
Thus, the conditional mutual information is given by:
The conditional mutual information is evaluated by numerical integration. When $w = 0 . 5$ and $w = 0 . 355$, the maximum values:
$max τ I m ; μ ∣ x ( w = 0 . 5 , N τ ) = 0 . 156 and max τ I m ; μ ∣ x ( w = 0 . 355 , N τ ) = 0 . 166$
of Equation (24) are attained at $τ = 4 . 92$ and $τ = 5 . 36$, respectively. The variation of the risk functions, $r w = 0 . 5 ( μ , N τ = 4 . 92 )$ and $r w = 0 . 355 ( μ , N τ = 5 . 36 )$, shown in Figure 4 are much larger than those of the risk functions of the latent information priors shown in Figure 2. Thus, the performance of the Bayesian testing based on the normal prior is worse than that based on the latent information prior if we adopt the Kullback-Leibler loss.
Figure 4. Risk functions of Bayesian testing based on normal priors for (a) $w = 0 . 5$ and $τ = 4 . 92$; and for (b) $w = 0 . 355$ and $τ = 5 . 36$. The functions have symmetry $r w ( - μ , N τ ) = r w ( μ , N τ )$ about the origin.
Figure 4. Risk functions of Bayesian testing based on normal priors for (a) $w = 0 . 5$ and $τ = 4 . 92$; and for (b) $w = 0 . 355$ and $τ = 5 . 36$. The functions have symmetry $r w ( - μ , N τ ) = r w ( μ , N τ )$ about the origin.

#### 4.2. The Cauchy Prior

The Cauchy prior, $1 / { γ π ( μ 2 / γ 2 - 1 ) }$, is denoted by $C γ$. Since the characteristic functions of $N ( 0 , σ 2 )$ and $C γ$ are and $exp ( - γ | t | )$, respectively, the characteristic function of the marginal density:
with respect to the Cauchy prior, $C γ$, is given by:
The expression:
where $erfc$ is the complementary error function defined by:
$erfc ( z ) = 2 π ∫ z ∞ e - t 2 d t$
obtained by the inverse transform of Equation (27) is useful for numerical computation; see  (p. 183) and . From Lemma 1, we have:
We numerically evaluate the conditional mutual information:
$I m ; μ ∣ x ( w , C γ ) = ∫ r w ( μ ; C γ ) 1 π ( μ 2 / γ 2 - 1 ) 1 γ d μ$
by the Monte-Carlo method. When $w = 0 . 5$ and $w = 0 . 355$, the maximum values:
$max γ I m ; μ ∣ x ( w = 0 . 5 , C γ ) = 0 . 161 and max γ I m ; μ ∣ x ( w = 0 . 355 , C γ ) = 0 . 170$
of Equation (31) are attained at $γ = 3 . 31$ and $γ = 3 . 63$, respectively. The risk functions $r w = 0 . 5 ( μ , C γ = 3 . 31 )$ and $r w = 0 . 355 ( μ , C γ = 3 . 63 )$ are shown in Figure 5. The variation of the risk function $r w = 0 . 5 ( μ , C γ = 3 . 31 )$ is milder than that of the risk function $r w = 0 . 5 ( μ , N τ = 4 . 92 )$ based on the normal prior, and the inequality $sup μ r w = 0 . 5 ( μ , C γ = 3 . 31 ) < sup μ r w = 0 . 5 ( μ , N τ = 4 . 92 )$ holds. Thus, the Cauchy prior is preferable to the normal prior from the viewpoint of the Kullback-Leibler loss. However, the variation of the risk function shown in Figure 2 based on the latent information prior is much smaller than that of $r w = 0 . 5 ( μ , C γ = 3 . 31 )$. Similar relations also hold when $w = 0 . 355$.
Figure 5. Risk functions of Bayesian testing based on Cauchy priors for (a) $w = 0 . 5$ and $γ = 3 . 31$; and for (b) $w = 0 . 355$ and $γ = 3 . 63$. The functions have symmetry $r w ( - μ , C γ ) = r w ( μ , C γ )$ about the origin.
Figure 5. Risk functions of Bayesian testing based on Cauchy priors for (a) $w = 0 . 5$ and $γ = 3 . 31$; and for (b) $w = 0 . 355$ and $γ = 3 . 63$. The functions have symmetry $r w ( - μ , C γ ) = r w ( μ , C γ )$ about the origin.

## 5. Conclusions

We discussed the use of latent information priors for Bayesian testing of a point null hypothesis. The testing problem was formulated as a prediction problem, and latent information priors were numerically obtained. The variations of the risk functions of latent information priors are much smaller than those of normal and Cauchy priors. Although the testing problem treated in the present paper is simple, the results may indicate that latent information priors could be useful for various problems, since many statistical problems can be formulated from the viewpoint of prediction.
When the parameter space is multidimensional, it becomes difficult to numerically obtain latent information priors, and some approximations need to be used. One possible approach is to use asymptotic methods, and another possible approach is to choose an approximating prior from a tractable subset of the set of all probability measures on the parameter space. These approaches require further investigation.

## Acknowledgments

This research was partially supported by a Grant-in-Aid for Scientific Research (23300104, 23650144) and by the Aihara Innovative Mathematical Modelling Project, the Japan Society for the Promotion of Science (JSPS) through the “Funding Program for World-Leading Innovative R&D on Science and Technology (FIRST Program),” initiated by the Council for Science and Technology Policy (CSTP).

## Conflicts of Interest

The author declares no conflict of interest.

## References

1. Berger, J.O.; Sellke, T. Testing a point null hypothesis: The irreconcilability of p values and evidence. J. Am. Stat. Assoc. 1987, 82, 112–122. [Google Scholar] [CrossRef]
2. Jeffreys, H. Theory of Probability, 3rd ed.; Oxford University Press: Oxford, UK, 1961. [Google Scholar]
3. Lindley, D.V. A statistical paradox. Biometrika 1957, 44, 187–192. [Google Scholar] [CrossRef]
4. Komaki, F. Bayesian predictive densities based on latent information priors. J. Stat. Plan. Inference 2011, 141, 3705–3715. [Google Scholar] [CrossRef]
5. Aitchison, J. Goodness of prediction fit. Biometrika 1975, 62, 547–554. [Google Scholar] [CrossRef]
6. Csiszár, I. I-divergence geometry of probability distributions and minimization problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar] [CrossRef]
7. Haussler, D. A general minimax result for relative entropy. IEEE Trans. Inf. Theory 1997, 43, 1276–1280. [Google Scholar] [CrossRef]
8. Grünwald, P.D.; Dawid, A.P. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann. Stat. 2004, 32, 1367–1433. [Google Scholar]
9. Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory 1972, 18, 14–20. [Google Scholar] [CrossRef]
10. Blahut, R. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory 1972, 18, 460–473. [Google Scholar] [CrossRef]
11. Hartigan, J.A. Bayes Theory; Springer: New York, NY, USA, 1983. [Google Scholar]
12. Berger, J.; Bernardo, J.M.; Mendoza, M. On Priors that Maximize Expected Information. In Recent Developments in Statistics and Their Applications; Klein, J.P., Lee, J.C., Eds.; Freedom Press: Seoul, Korea, 1989; pp. 1–20. [Google Scholar]
13. Zhang, Z. Discrete Noninformative Priors. Ph.D. Dissertation, Department of Statistics, Yale University, New Haven, CT, USA, 1994. [Google Scholar]
14. Bernardo, J.M. Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. B 1979, 41, 113–147. [Google Scholar]
15. Clarke, B.; Yuan, A. Partial information reference priors: Derivation and interpretations. J. Stat. Plan. Inference 2004, 123, 313–345. [Google Scholar] [CrossRef]
16. Ebrahimi, N.; Soofi, E.S.; Soyer, R. On the sample information about parameter and prediction. Stat. Sci. 2010, 25, 348–367. [Google Scholar] [CrossRef]
17. Edwards, W.; Lindman, H.; Savage, L.J. Bayesian statistical inference for psychological research. Psychol. Rev. 1963, 70, 193–242. [Google Scholar] [CrossRef]
18. Dickey, J.M. Is the tail area useful as an approximate Bayes factor? J. Am. Stat. Assoc. 1977, 72, 138–142. [Google Scholar] [CrossRef]
19. Temme, N.M. Error Functions, Dawson’s and Fresnel Integrals. In NIST Handbook of Mathematical Functions; Olver, F.W.J., Lozier, D.W., Boisvert, R.F., Clark, C.W., Eds.; Cambridge University Press: Cambridge, UK, 2010; pp. 159–171. [Google Scholar]
20. Poppe, G.P.M.; Wijers, C.M.J. Algorithm 680: Evaluation of the complex error function. ACM Trans. Math. Softw. (TOMS) 1990, 16. [Google Scholar] [CrossRef]

## Appendix. Proofs of Lemmas

Proof of Lemma 1. From Equation (8), we have:
$r w ( μ ; π ) = w ∫ p ( x ∣ m = 0 ) log p w ( m = 0 ∣ μ , x ) p w , π ( m = 0 ∣ x ) d x (32) + ( 1 - w ) ∫ p ( x ∣ m = 1 , μ ) log p w ( m = 1 ∣ μ , x ) p w , π ( m = 1 ∣ x ) d x$
because m and μ are independent. Since:
$p w ( m = 0 ∣ μ , x ) p w , π ( m = 0 ∣ x ) = w p ( x ∣ m = 0 ) w p ( x ∣ m = 0 ) + ( 1 - w ) p ( x ∣ m = 1 , μ ) w p ( x ∣ m = 0 ) w p ( x ∣ m = 0 ) + ( 1 - w ) p π ( x ∣ m = 1 ) = 1 + 1 - w w p π ( x ∣ m = 1 ) p ( x ∣ m = 0 ) 1 + 1 - w w p ( x ∣ m = 1 , μ ) p ( x ∣ m = 0 )$
and:
$p w ( m = 1 ∣ μ , x ) p w , π ( m = 1 ∣ x ) = ( 1 - w ) p ( x ∣ m = 1 , μ ) w p ( x ∣ m = 0 ) + ( 1 - w ) p ( x ∣ m = 1 , μ ) ( 1 - w ) p π ( x ∣ m = 1 ) w p ( x ∣ m = 0 ) + ( 1 - w ) p π ( x ∣ m = 1 ) = w 1 - w p ( x ∣ m = 0 ) p π ( x ∣ m = 1 ) + 1 w 1 - w p ( x ∣ m = 0 ) p ( x ∣ m = 1 , μ ) + 1$
we have:
$r w ( μ ; π ) = w ∫ p ( x ∣ m = 0 ) 1 + 1 - w w p π ( x ∣ m = 1 ) p ( x ∣ m = 0 ) 1 + 1 - w w p ( x ∣ m = 1 , μ ) p ( x ∣ m = 0 ) d x + ( 1 - w ) ∫ p ( x ∣ m = 1 , μ ) log 1 + w 1 - w p ( x ∣ m = 0 ) p π ( x ∣ m = 1 ) 1 + w 1 - w p ( x ∣ m = 0 ) p ( x ∣ m = 1 , μ ) d x (35) = w ∫ p 0 ( x ) log 1 + 1 - w w p π ( x ) p 0 ( x ) 1 + 1 - w w p 0 ( x - μ ) p 0 ( x ) d x + ( 1 - w ) ∫ p 0 ( x ) log 1 + w 1 - w p 0 ( x + μ ) p π ( x + μ ) 1 + w 1 - w p 0 ( x + μ ) p 0 ( x ) d x$
Proof of Lemma 2. Since:
we have:
From Lemma 1, we have:
The conditional mutual information is:
From Equations (38) and (39), we obtain the desired result.     ☐

## Share and Cite

MDPI and ACS Style

Komaki, F. Bayesian Testing of a Point Null Hypothesis Based on the Latent Information Prior. Entropy 2013, 15, 4416-4431. https://doi.org/10.3390/e15104416

AMA Style

Komaki F. Bayesian Testing of a Point Null Hypothesis Based on the Latent Information Prior. Entropy. 2013; 15(10):4416-4431. https://doi.org/10.3390/e15104416

Chicago/Turabian Style

Komaki, Fumiyasu. 2013. "Bayesian Testing of a Point Null Hypothesis Based on the Latent Information Prior" Entropy 15, no. 10: 4416-4431. https://doi.org/10.3390/e15104416