Bayesian Testing of a Point Null Hypothesis Based on the Latent Information Prior

Bayesian testing of a point null hypothesis is considered. The null hypothesis is that an observation, x, is distributed according to the normal distribution with a mean of zero and known variance σ. The alternative hypothesis is that x is distributed according to a normal distribution with an unknown nonzero mean, μ, and variance σ. The testing problem is formulated as a prediction problem. Bayesian testing based on priors constructed by using conditional mutual information is investigated.


Introduction
We investigate a problem of testing a point null hypothesis from the viewpoint of prediction.The null hypothesis, H 0 , is that an observation, x, is distributed according to the normal distribution, N(0, σ 2 ), with a mean of zero and variance σ 2 , and the alternative hypothesis, H 1 , is that x is distributed according to a normal distribution N(µ, σ 2 ) with unknown nonzero mean µ and variance σ 2 .The variance, σ 2 , is assumed to be known.This simple testing problem has various essential aspects in common with more general testing problems and has been discussed by many researchers.An essential part of our discussion in the present paper holds for other testing problems based on more general models.
The assumption that the sample size is one is not essential.When we have N observations x 1 , x 2 , . . ., x N from N(0, σ 2 ) or N(µ, σ 2 ), then the sufficient statistic x = ∑ N i=1 x i /N is distributed according to N(0, σ 2 /N ) under H 0 or N(µ/N, σ 2 /N ) under H 1 , respectively.Then, the null hypothesis is that x is distributed according to N(0, σ2 ), and the alternative hypothesis is that x is distributed according to N(μ, σ2 ) (μ ̸ = 0), where σ2 := σ 2 /N and μ := µ/N .Thus, the testing problem with sample size N is essentially equal to that with the sample size one.From now on, the variance, σ 2 , is set to be one without loss of generality.
We formulate the testing problem as a prediction problem.Let m = 0 if H 0 is true and m = 1 if H 1 is true.Let w be the probability that m = 0, and let π(dµ) be the prior probability measure of µ.The probability, w, is set to be 1/2 in many previous studies, and the choice of π(dµ) is discussed; see, e.g., [1] and the references therein.The objective is to predict m by using a Bayesian predictive distribution, p w,π (m | x), depending on the prior π(dµ) and the observation, x.
We choose π(dµ) from the viewpoint of prediction and construct a Bayesian predictive distribution to predict m based on an objectively chosen prior In the testing problem, the variable, m, is predicted, the variable, x, is observed and the parameter, µ, is neither observed nor predicted.The latent information prior π * [4] is defined as a prior maximizing the conditional mutual information: between m and µ given x.
The latent information prior introduced in [4] is an objective Bayes prior.An outline of the method based on it is as follows.First, a statistical problem is formulated as a prediction problem, in which x is the observed random variable, y is the random variable to be predicted and θ is the unknown parameter.Then, a prior π(dθ) that maximizes the conditional mutual information I y;θ|x (π) between y and θ given x is adopted.
In Section 2, we consider for Kullback-Leibler loss for prediction corresponding to Bayesian testing.In Section 3, we obtain the latent information prior and discuss properties of Bayesian testing based on it.In Section 4, we compare the proposed testing based on the latent information prior with Bayesian testing based on the normal prior and the Cauchy prior.

Kullback-Leibler Loss of Predictive Densities
We consider Kullback-Leibler loss of prediction corresponding to Bayesian testing.The Bayesian predictive density with respect to w and π is given by: and where: and ϕ(x; µ, σ) is the density function of the normal distribution, N(µ, σ 2 ).
If the value of µ is known, then the alternative hypothesis, H 1 : N(µ, 1), becomes a simple hypothesis, and the predictive distribution is given by the posterior: and: To evaluate the performance of predictive densities, we adopt the Kullback-Leibler divergence: from p w (m | x, µ) and to p w,π (m | x) as a loss function.
The risk function is given by: where w(0) = w and w(1) = 1 − w.Here, p w,π (m | x, µ) and p w,π (x | µ) are denoted by p w (m | x, µ) and p w (x | µ), respectively, because they do not depend on π.The distribution of x does not depend on µ if m = 0, because p(x | m = 0, µ) = ϕ(x; 0, 1).It is not fruitful to discuss decision theoretic properties, such as the minimaxity of the risk defined by: because it is easy to distinguish between H 0 and H 1 when |µ| is very large.The Kullback-Leibler risk in Equation ( 8) corresponds to the regret type quantity: which means the loss by not knowing the value of µ.By considering the minimaxity of the regret type risk in Equation (8), several reasonable results are obtained.
Lemma 1.The risk of a Bayesian predictive density, p w,π (m | x), is given by: Proof.See the Appendix.
The risk function in Equation ( 11) is a continuous function of µ for every w and π.
The Bayes risk with respect to a prior π of a Bayesian predictive density based on π is: It is known that an important relation: holds; see [5].Here, R w (π; π) coincides with the conditional mutual information, I m;µ|x (w, π), defined by Equation ( 1) between m and µ given x.

Latent Information Priors
We obtain the latent information prior defined as a prior maximizing the conditional mutual information, I m;µ|x (w, π).We restrict the original parameter space, R, of µ to a compact subset, K ⊂ R, Let P(K) and P(R) be the spaces of all probability measures on K and R, respectively, endowed with the weak convergence topology.Then, P(K) is compact, since the K is compact.It is easy to verify that the conditional mutual information, I m;µ|x (w, π), is a continuous function of w ∈ [0, 1] and π ∈ P(K).Therefore, there exists π * w that attains the maximum of Equation ( 1) for fixed w ∈ (0, 1), since P(K) is compact.In the following, π * w is denoted as π * by omitting the subscript, w, when there is no confusion.
The Bayesian testing based on the latent information prior, π * ∈ P(K), has the following minimax property.
Theorem 1.Let π * ∈ P(K) be the latent information prior.Then: Proof.It is sufficient to show the relations: In the previous section, we have seen the equalities ), corresponding to the first and second equalities in Equation (15).Thus, it is enough to show the last inequality, sup µ r w (µ, π * ) ≤ R w (π * , π * ), since the relations, except for the first and second equalities and the last inequality, are obvious.
We prove the inequality by contradiction.Assume that there exists a value, ξ ∈ K, such that: , where δ ξ is the delta measure concentrated at ξ.Then, π t ∈ P(K).
From Equations ( 12) and ( 16): where we put because of the definition of π * and the fact that π t ∈ P(K).This is a contradiction.Thus, we have proven the desired result.
The discussion in the proof is parallel to that for submodels of multinomial models in [4], although the testing problem is not included in the class considered there.Closely related discussion on the unconditional mutual information is given in Csiszár [6].See also, [7,8].
We set K = [−b, b] with b = 7 and consider two values, 0.5 and 0.355, of w.The latent information priors, π * w , for two values w = 0.5 and w = 0.355 are numerically obtained by using a generalized Arimoto-Blahut algorithm, the details of which will be discussed in another place.Here, w = 0.5 is the setting adopted in many previous studies, and w = 0.355 is the value maximizing I m;µ|x (w, π * w ).The Arimoto-Blahut algorithm [9,10] is widely used in information theory to obtain the capacity of channels.A channel is defined to be a conditional distribution, p(y | θ), of y given θ, where y and θ are random variables taking values in finite sets, Y and Θ, respectively.If a channel, p(y | θ), is given, then the mutual information, I y;θ (π), between y and θ is a function of the distribution, π(θ), of θ.The maximum value, max π I y;θ (π), of the mutual information as a function of π is called the capacity of the channel p(y | θ).The Arimoto-Blahut algorithm is an iterative algorithm to obtain the capacity max π I y;θ (π) and the corresponding distribution π(θ), attaining the maximum value.The original Arimoto-Blahut algorithm cannot be directly applied to our problem, since we need to maximize the conditional mutual information, I m;θ|x , where x and θ are not discrete random variables, to obtain the latent information prior.Figure 1 shows the numerically-obtained latent information priors.The priors have the form: The parameter values are a = 1.21, b = 7 and u = 0.440, when w = 0.5, and a = 1.10, b = 7 and u = 0.393, when w = 0.355.Lemma 2 below gives the risk of Bayesian testing based on the prior in Equation ( 18).
Lemma 2. Let: where a, b > 0 and 0 ≤ u ≤ 1.Then, the risk in Equation ( 8) is given by: and the conditional mutual information in Equation (1) is given by: The first and second terms in Equation ( 20) do not depend on π.The third term in Equation ( 20) does not depend on µ.
Figure 2 shows the risk functions of the latent information priors when w = 0.5 and w = 0.355, respectively.Note that max µ∈[−b,b] r w (µ, π * ) is attained at µ = a and b in both examples.This is consistent with the proof of Theorem 1, and it is numerically verified that the prior maximizes the conditional mutual information.Furthermore, we observe that the supremum value,   Since: and ), the supremum value, sup µ∈R r(µ, π * ), of the risk function of the latent information prior, π * , under the parameter restriction, µ ∈ [−7, 7], is only slightly larger than the minimax value, inf π∈P(R) sup µ∈R r(µ, π) without the restriction.We see in the next section that the supremum, sup µ∈R r(µ, π), of the risk functions of commonly used priors are much larger than those of π * .The discreteness of latent information priors shown in Figure 1 is a remarkable feature.In Bayesian statistics, k-reference priors have been known to be discrete measures in many examples; see [11][12][13].The k-reference prior is defined to be a prior maximizing the mutual information between x k and θ when we have a set, However, such discrete priors have not been widely used.Instead of k-reference priors, reference priors introduced by Bernardo [14] have been used for many problems.Reference priors are not discrete and are defined by considering the limit that the sample size k goes to infinity.One main reason why discrete priors are not popular is that discrete priors are totally unacceptable form the viewpoint of subjective Bayes in which priors are considered to represent prior belief on parameters.
Although they have not been widely used, discrete priors, such as latent information priors, are reasonable from the viewpoint of prediction and objective Bayes.Various statistical problems, including estimation and testing, can be formulated from the viewpoint of prediction, and priors can be constructed by considering the conditional mutual information.Thus, latent information priors depending on the choice of variables to be predicted could play important roles in many statistical applications.Conditional mutual information is essential in information theory and naturally appeared in several studies in statistics; see e.g., [15,16].Priors based on conditional mutual information and those based on unconditional mutual information are often quite different; see [4].
Bayesian testing based on latent information priors is free from the Jeffreys-Lindley paradox [3], since the priors are constructed by using conditional mutual information and depend properly on sample sizes.Posterior probabilities, p w,π * (m = 0 | x), are shown in Figure 3 and are compared with p-values of the two-sided test in Table 1.When x = 2, 3 and 4, posterior probabilities are much smaller than p-values of the two-sided test.Large differences of posterior probabilities and p-values have been widely observed and discussed in [1,17,18].

Other Common Priors
Discrete priors, including latent information priors discussed in the previous section, have not been widely used in Bayesian statistics.Common priors for the testing are the normal prior and the Cauchy prior.It seems to have been believed by many statisticians that the Cauchy prior is slightly better than the normal prior; see, e.g., [1,2].In this section, we evaluate the conditional mutual information for the priors and compare the performance of them to that of the latent information prior.

The Normal Prior
The normal prior, ϕ(µ; 0, τ 2 ), is denoted by N τ .From Lemma 1, we have: Thus, the conditional mutual information is given by: The conditional mutual information is evaluated by numerical integration.When w = 0.5 and w = 0.355, the maximum values: of Equation ( 24) are attained at τ = 4.92 and τ = 5.36, respectively.The variation of the risk functions, r w=0.5 (µ, N τ =4.92 ) and r w=0.355 (µ, N τ =5.36 ), shown in Figure 4 are much larger than those of the risk functions of the latent information priors shown in Figure 2. Thus, the performance of the Bayesian testing based on the normal prior is worse than that based on the latent information prior if we adopt the Kullback-Leibler loss.

The Cauchy Prior
The Cauchy prior, 1/{γπ(µ 2 /γ 2 − 1)}, is denoted by C γ .Since the characteristic functions of N(0, σ 2 ) and C γ are exp ( − 1 2 σ 2 t 2 ) and exp(−γ|t|), respectively, the characteristic function of the marginal density: with respect to the Cauchy prior, C γ , is given by: exp The expression: } erfc where erfc is the complementary error function defined by: erfc obtained by the inverse transform of Equation ( 27) is useful for numerical computation; see [19] (p.183) and [20].From Lemma 1, we have: We numerically evaluate the conditional mutual information: ) is milder than that of the risk function r w=0.5 (µ, N τ =4.92 ) based on the normal prior, and the inequality sup µ r w=0.5 (µ, C γ=3.31 ) < sup µ r w=0.5 (µ, N τ =4.92 ) holds.Thus, the Cauchy prior is preferable to the normal prior from the viewpoint of the Kullback-Leibler loss.However, the variation of the risk function shown in Figure 2 based on the latent information prior is much smaller than that of r w=0.5 (µ, C γ=3.31 ).Similar relations also hold when w = 0.355.

Conclusions
We discussed the use of latent information priors for Bayesian testing of a point null hypothesis.The testing problem was formulated as a prediction problem, and latent information priors were numerically obtained.The variations of the risk functions of latent information priors are much smaller than those of normal and Cauchy priors.Although the testing problem treated in the present paper is simple, the results may indicate that latent information priors could be useful for various problems, since many statistical problems can be formulated from the viewpoint of prediction.
When the parameter space is multidimensional, it becomes difficult to numerically obtain latent information priors, and some approximations need to be used.One possible approach is to use asymptotic methods, and another possible approach is to choose an approximating prior from a tractable subset of the set of all probability measures on the parameter space.These approaches require further investigation.

Figure 2 .
Figure 2. Risk functions of Bayesian testing based on latent information priors for (a) w = 0.5 and for (b) w = 0.355.When w = 0.5, a = 1.21 and b = 7.When w = 0.355, a = 1.10 and b = 7.The vertical dotted lines indicate the locations of a and b.

Table 1 .
Comparison of posterior probabilities and p-values.