Next Article in Journal
Using Permutation Entropy to Measure the Changes in EEG Signals During Absence Seizures
Next Article in Special Issue
Information-Geometric Markov Chain Monte Carlo Methods Using Diffusions
Previous Article in Journal
Tsallis Wavelet Entropy and Its Application in Power Signal Analysis
Previous Article in Special Issue
Information Geometric Complexity of a Trivariate Gaussian Statistical Model

Entropy 2014, 16(6), 3026-3048; https://doi.org/10.3390/e16063026

Article
Asymptotically Constant-Risk Predictive Densities When the Distributions of Data and Target Variables Are Different
by Keisuke Yano 1,* and Fumiyasu Komaki 1,2
1
Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
2
RIKEN Brain Science Institute, 2-1 Hirosawa, Wako City, Saitama 351-0198, Japan
*
Author to whom correspondence should be addressed.
Received: 28 March 2014; in revised form: 9 May 2014 / Accepted: 22 May 2014 / Published: 28 May 2014

## Abstract

: We investigate the asymptotic construction of constant-risk Bayesian predictive densities under the Kullback–Leibler risk when the distributions of data and target variables are different and have a common unknown parameter. It is known that the Kullback–Leibler risk is asymptotically equal to a trace of the product of two matrices: the inverse of the Fisher information matrix for the data and the Fisher information matrix for the target variables. We assume that the trace has a unique maximum point with respect to the parameter. We construct asymptotically constant-risk Bayesian predictive densities using a prior depending on the sample size. Further, we apply the theory to the subminimax estimator problem and the prediction based on the binary regression model.
Keywords:
Bayesian prediction; Fisher information; Kullback–Leibler divergence; minimax; predictive metric; subminimax estimator

## 1. Introduction

Let $x ( N ) = ( x 1 , ⋯ , x N )$ be independent N data distributed according to a probability density, p(x|θ), that belongs to a d-dimensional parametric model, {p(x|θ) : θ ∈ Θ}, where $θ = ( θ 1 , ⋯ , θ d )$ is an unknown d-dimensional parameter and Θ is the parameter space. Let y be a target variable distributed according to a probability density, q(y|θ), that belongs to a d-dimensional parametric model, {q(y|θ) : θ ∈ Θ } with the same parameter, θ. Here, we assume that the distributions of the data and the target variables, p(x|θ) and q(y|θ), are different. For simplicity, we assume that the data and the target variables are independent, given by θ.

We construct predictive densities for target variables based on the data. We measure the performance of the predictive density, $q ^ ( y ; x ( N ) )$, by the Kullback–Leibler divergence, $D ( q ( ⋅ | θ ) , q ^ ( ⋅ ; x ( N ) ) )$, from the true density, q(y|θ), to the predictive density, $q ^ ( y ; x ( N ) )$:

$D ( q ( ⋅ | θ ) , q ^ ( ⋅ ; x ( N ) ) ) = ∫ q ( y | θ ) log q ( y | θ ) q ^ ( ⋅ ; x ( N ) ) d y .$

Then, the risk function, $R ( θ , q ^ ( y ; x ( N ) ) )$, of the predictive density, $q ^ ( y ; x ( N ) )$, is given by: For the construction of predictive densities, we consider the Bayesian predictive density defined by:

$q ^ π ( y | x ( N ) ) = ∫ q ( y | θ ) p ( x ( N ) | θ ) π ( θ ; N ) d θ ∫ p ( x ( N ) | θ ) π ( θ ; N ) d θ ,$

where π(θ; N) is a prior density for θ, possibly depending on the sample size, N. Aitchison  showed that, for a given prior density, π(θ; N), the Bayesian predictive density, $q ^ π ( y | x ( N ) )$, is a Bayes solution under the Kullback–Leibler risk. Based on the asymptotics as the sample size goes to infinity, Komaki  and Hartigan  showed its superiority over any plug-in predictive density, $q ( y | θ ^ )$, with any estimator, $θ ^$. However, there remains a problem of prior selection for constructing better Bayesian predictive densities. Thus, a prior, π(θ; N), must be chosen based on an optimality criterion for actual applications.

Among various criteria, we focus on a criterion of constructing minimax predictive densities under the Kullback–Leibler risk. For simplicity, we refer to the priors generating minimax predictive densities as minimax priors. Minimax priors have been previously studied in various predictive settings; see . When the simultaneous distributions of the target variables and the data belong to the submodel of the multinomial distributions, Komaki  shows that minimax priors are given as latent information priors maximizing the conditional mutual information between target variables and the parameter given the data. However, the explicit forms of latent information priors are difficult to obtain, and we need asymptotic methods, because they require the maximization on the space of the probability measures on Θ.

Except for , these studies on minimax priors are based on the assumption that the distributions, p(x|θ) and q(y|θ), are identical. Let us consider the prediction based on the logistic regression model where the covariates of the data and the target variables are not identical. In this predictive setting, the assumption that the distributions, p(x|θ) and q(y|θ), are identical is no longer valid.

We focus on the minimax priors in predictions where the distributions, p(x|θ) and q(y|θ), are different and have a common unknown parameter. Such a predictive setting has traditionally been considered in statistical prediction and experiment design. It has recently been studied in statistical learning theory; for example, see . Predictive densities where the distributions, p(x|θ) and q(y|θ), are different and have a common unknown parameter are studied by .

Let $g i j X ( θ )$ be the (i, j)-component of the Fisher information matrix of the distribution, p(x|θ), and let $g i j Y ( θ )$ be the (i, j)-component of the Fisher information matrix of the distribution, q(y|θ). Let gX,ij(θ) and gY,ij(θ) denote the (i, j)-components of their inverse matrices. We adopt Einstein’s summation convention: if the same indices appear twice in any one term, it implies summation over that index from one to d. For the asymptotics below, we assume that the prior densities, π(θ; N), are smooth.

On the asymptotics as the sample size N goes to infinity, we construct the asymptotically constant-risk prior, π(θ; N), in the sense that the asymptotic risk:

$R ( θ , q ^ π ( y | x ( N ) ) ) = 1 N R 1 ( θ , q ^ π ( y | x ( N ) ) ) + 1 N N R 2 ( θ , q ^ π ( y | x ( N ) ) ) + O ( N − 2 )$

is constant up to O(N2). Since the proper prior with the constant risk is a minimax prior for any finite sample size, the asymptotically constant-risk prior relates to the minimax prior; in Section 4, we verify that the asymptotically constant-risk prior agrees with the exact minimax prior in binomial examples.

When we use the prior, π(θ), independent of the sample size, N, it is known that the N1-order term, $R 1 ( θ , q ^ π ( y | x ( N ) ) )$, of the Kullback–Leibler risk is equal to the trace, $g X , i j ( θ ) g i j Y ( θ )$. If the trace does not depend on the parameter, θ, the construction of the asymptotically constant-risk prior is parallel to ; see also .

However, we consider the settings where there exists a unique maximum point of the trace, $g X , i j ( θ ) g i j Y ( θ )$; for example, these settings appear in predictions based on the binary regression model, where the covariates of the data and the target variables are not identical. In the settings, there do not exist asymptotically constant-risk priors among the priors independent of the sample size, N. The reason is as follows: we consider the prior, π(θ), independent of the sample size, N. Then, the Kullback–Leibler risk of the Bayesian predictive density is expanded as:

$R ( θ , q ^ π ( y | x ( N ) ) ) = 1 2 N g i j Y ( θ ) g X , i j ( θ ) + O ( N − 2 ) .$

Since, in our settings, the first-order term, $g i j Y ( θ ) g X , i j ( θ )$, is not constant, the prior independent of the sample size, N, is not an asymptotically constant-risk prior.

When there exists a unique maximum point of the trace, $g X , i j ( θ ) g i j Y ( θ )$, we construct the asymptotically constant-risk prior, π(θ; N), up to O(N2), by making the prior dependent on the sample size, N, as:

$π ( θ ; N ) | g X ( θ ) | 1 / 2 ∝ { f ( θ ) } N h ( θ ) ,$

where f(θ) and h(θ) are the scalar functions of θ independent of N and |gX(θ)| denotes the determinant of the Fisher information matrix, gX(θ).

The key idea is that, if the specified parameter point has more undue risk than the other parameter points, then the more prior weights should be concentrated on that point.

Further, we clarify the subminimax estimator problem based on the mean squared error from the viewpoint of the prediction where the distributions of data and target variables are different and have a common unknown parameter. We obtain the improvement achieved by the minimax estimator over the subminimax estimators up to O(N2). The subminimax estimator problem [14,15] is the problem that, at first glance, there seems to exist asymptotically dominating estimators of the minimax estimator. However, any relationship between such subminimax estimator problems and predictions have not been investigated, and further, in general, the improvement by the minimax estimator over the subminimax estimators has not been investigated.

## 2. Information Geometrical Notations

In this section, we prepare the information geometrical notations; see  for details. We abbreviate /∂θi to i, where the indices, i, j, k,…, run from one to d. Similarly, we abbreviate 2/∂θi∂θj, 3/∂θi∂θj∂θk and 4/∂θi∂θj∂θk∂θl to ij, ijk and ijkl, respectively. We denote the expectations of the random variables, X, Y and X(N), by EX[·], EY [·] and EX(N) [·], respectively. We denote their probability densities by p(x|θ), q(y|θ) and p(x(N)|θ), respectively.

We define the predictive metric proposed by Komaki  as:

$g ° i j ( θ ) = g i j X ( θ ) g Y , k l g l j X ( θ ) .$

When the parameter is one-dimensional, gθθ(θ) denotes Fisher information and gθθ(θ) denotes its inverse. Let $Γ e i j , k X ( θ )$ and $Γ m i j , k X ( θ )$ be the quantities given by:

$Γ e i j , k X ( θ ) : = E X [ ∂ i j log p ( x | θ ) ∂ k log p ( x | θ ) ]$

and:

$Γ m i j , k X ( θ ) : = ∫ 1 p ( x | θ ) [ ∂ i j p ( x | θ ) ∂ k p ( x | θ ) ] d x .$

Using these quantities, the e-connection and m-connection coefficients with respect to the parameter, θ, for the model, {p(x|θ) : θ ∈ Θ}, are given by:

$Γ e i j X , k ( θ ) : = g X , l k ( θ ) Γ e i j , l X ( θ )$

and:

$Γ m i j X , k ( θ ) : = g X , l k ( θ ) Γ m i j , l X ( θ ) ,$

respectively.

The (0, 3)-tensor, $T i j k X ( θ )$, is defined by:

$T i j k X ( θ ) : = E X [ ∂ i log p ( x | θ ) ∂ j log p ( x | θ ) ∂ k log p ( x | θ ) ] .$

The tensor, $T i j k X ( θ )$, also produces a (0, 1)-tensor:

$T i X ( θ ) : = T i j k X ( θ ) g X , j k ( θ ) .$

In the same manner, the information geometrical quantities, $Γ e i j , k Y ( θ ) , Γ m i j , k Y ( θ )$ and $T i j k Y ( θ )$, are defined for the model, {q(y|θ) : θ ∈ Θ}.

Let $M i j k ( θ )$ be a (1, 2)-tensor defined by:

$M i j k ( θ ) : = Γ m i j Y , k ( θ ) − Γ m i j X , k ( θ ) .$

For a derivative, $( ∂ 1 υ ( θ ) , ⋯ , ∂ d υ ( θ ) )$, of the scalar function, υ(θ), the e-covariant derivative is given by:

$∇ e i ∂ j υ ( θ ) : = ∂ i j υ ( θ ) − Γ e i j X , k ( θ ) ∂ k υ ( θ ) .$

## 3. Asymptotically Constant-Risk Priors When the Distributions of Data and Target Variables Are Different

In this section, we consider the settings where the trace, $g X , i j ( θ ) g i j Y ( θ )$, has a unique maximum point. We construct the asymptotically constant-risk prior under the Kullback–Leibler risk in the sense that the asymptotic risk up to O(N2) is constant. We find asymptotically constant-risk priors up to O(N2) in two steps: first, expand the Kullback–Leibler risks of Bayesian predictive densities; second, find the prior having an asymptotically constant risk using this expansion.

From now on, we assume the following two conditions for the prior, π(θ; N):

(C1) The prior, π(θ; N), has the form:

$π ( θ ; N ) | g X ( θ ) | 1 / 2 ∝ exp { N log f ( θ ) + log h ( θ ) } ,$

where f(θ) and h(θ) are smooth scalar functions of θ independent of N.

(C2) The unique maximum point of the scalar function, f(θ), is equal to the unique maximum point of the trace, $g X , i j ( θ ) g i j Y ( θ )$.

Based on Conditions (C1) and (C2), we expand the Kullback–Leibler risk of a Bayesian predictive density up to O(N2).

Theorem 1. The Kullback–Leibler risk of a Bayesian predictive density based on the prior, π(θ; N), satisfying Condition (C1), is expanded as: The proof is given in the Appendix. The first term in (1) represents that the precision of the estimation is determined by the geometric quantity of the data, gX;ij(θ), and the metric of the parameter is determined by the geometric quantity of the target variables, $g i j Y ( θ )$. Note that each term in (1) is invariant under the reparametrization.

Remark 1. For the subsequent theorem, it is important that at the point, θf, maximizing the scalar function, $log f ( θ ) , R ( θ f , q ^ π ( y | x N ) )$ is given by:

$R ( θ f , q ^ π ( y | x N ) ) = 1 2 N sup θ ∈ Θ { g X , i j ( θ ) g i j Y ( θ ) } + 1 N N g i j ° ( θ f ) ∂ i j log f ( θ f ) + O ( N − 2 ) .$

The N3/2-order term of this risk is common whenever we use the same scalar function, log f(θ). This term is negative because of the definition of the point, θf. Under Condition (C2), θf is equal to the unique maximum point, θmax, of the trace, $g X , i j ( θ ) g i j Y ( θ )$.

Based on (1) and (2), we construct asymptotically constant-risk priors using the solutions of the partial differential equations.

Theorem 2. Suppose that the scalar functions, $log f ˜ ( θ )$ and, $log h ˜ ( θ )$ satisfy the following conditions:

(A1)$log f ˜ ( θ )$ is the solution of the Eikonal equation given by:

$g i j ° ( θ ) ∂ i log f ˜ ( θ ) ∂ j log f ˜ ( θ ) = g X , i j ( θ max ) g i j Y ( θ max ) − g X , i j ( θ ) g i j Y ( θ ) ,$

where θmax is the unique maximum point of the scalar function, $g X , i j ( θ ) g i j Y ( θ )$.

(A2)$log h ˜ ( θ )$ is the solution of the first-order linear partial equation given by: Let π(θ; N) be the prior that is constructed as:

$π ( θ ; N ) | g X ( θ ) | 1 / 2 ∝ exp { N log f ˜ ( θ ) + log h ˜ ( θ ) } .$

Further, suppose that$log f ˜ ( θ )$ satisfies Condition (C2).

Then, the Bayesian predictive density based on the prior, π(θ; N), has the asymptotically smallest constant risk up to O(N2) among all priors with the form (C1).

Proof. First, we consider the prior, φ(θ; N), constructed as:

$ϕ ( θ ; N ) | g X ( θ ) | 1 / 2 ∝ exp { N log f ˜ ( θ ) } .$

From Theorem 1, the Kullback–Leibler risk, $R ( θ , q ^ ϕ ( y | x ( N ) ) )$, based on the prior, φ(θ; N), is given by:

$R ( θ , q ^ ϕ ( y | x ( N ) ) ) = 1 2 N g X , i j ( θ max ) g i j Y ( θ max ) + o ( N − 1 ) .$

This is constant up to o(N1).

Suppose that there exists another prior, φ(θ; N), constructed as:

$φ ( θ ; N ) | g X ( θ ) | 1 / 2 ∝ exp { N log f ( θ ) } ,$

and the Bayesian predictive density based on the prior, φ(θ; N), has the asymptotically constant risk:

$R ( θ , q ^ φ ( y | x ( N ) ) ) = k 2 N + o ( N − 1 ) .$

From Theorem 1, the prior φ(θ; N) must satisfy the equation:

$g ° i j ( θ ) ∂ i log f ( θ ) ∂ j log f ( θ ) = k − g X , i j ( θ ) g i j Y ( θ ) .$

The left-hand side of the above equation is non-negative, because the matrix, $g ° i j ( θ )$, is positive-definite. Hence, the infimum of the constant, k, is equal to $g X , i j ( θ max ) g i j Y ( θ max )$. From (5), the N1-order term of the risk based on the prior, ϕ(θ; N), achieves the infimum, $g X , i j ( θ max ) g i j Y ( θ max )$. Thus, the Bayesian predictive density based on the prior, ϕ(θ; N), has the asymptotically smallest constant risk up to o(N1).

Second, we consider the prior, π(θ; N), constructed as:

$π ( θ ; N ) | g X ( θ ) | 1 / 2 ∝ exp { N log f ˜ ( θ ) + log h ˜ ( θ ) } .$

The above argument ensures that the prior, π(θ; N), has the asymptotically smallest constant risk up to o(N1). Thus, we only have to check if the N3/2-order term of the risk is the smallest constant. From (2), the N3/2-order term of the risk at the point, θmax, is unchanged by the choice of the scalar function, log h(θ). In other words, the constant N3/2-order term must agree with the quantity, $g ° i j ( θ max ) ∂ i j log f ˜ ( θ max )$. From Theorem 1, if we choose the prior, π(θ; N), the N−3/2-order term of the risk is the smallest constant, and it agrees with the quantity, $g ° i j ( θ max ) ∂ i j log f ˜ ( θ max )$. Thus, the prior, π(θ; N), has the asymptotically smallest constant risk up to O(N2). □

Remark 2. In Theorem 2, we choose$log f ˜ ( θ )$, satisfying Condition (C2) among the solutions of (A1). We consider the model with a one-dimensional parameter, θ. There are four possibilities to the solutions of (A1): where the double-sign corresponds. From the concavity around θmax as suggested by (C2), we choose$log f ˜ ( θ )$ as the solution of the following equation: Integrating both sides of Equation (6), the unique function, $log f ˜ ( θ )$, is obtained.

Remark 3. Compare the Kullback–Leibler risk based on the asymptotically constant-risk prior, π(θ; N), with that based on the prior, λ(θ), independent of the sample size, N. From Theorem 1 and Theorem 2, the Kullback–Leibler risk based on the asymptotically constant-risk prior, π(θ; N), is given as: In contrast, the Kullback–Leibler risk based on the prior, λ(θ), is given as:

$R ( θ , q ^ λ ( y | x ( N ) ) ) = 1 2 N g X , i j ( θ ) g i j Y ( θ ) + O ( N − 2 ) .$

The N1-order term in (8) is under the N1-order term in (7); although the N−3/2-order term in (8) does not exist, the N3/2-order term in (7) is negative. Thus, the maximum of the risk based on the asymptotically constant-risk prior, π(θ; N), is smaller than that of the risk based on the prior, λ(θ). This result is consistent with the minimaxity of selecting the prior that constructs the predictive density with the smallest maximum of the risk.

## 4. Subminimax Estimator Problem Based on the Mean Squared Error

In this section, we refer to the subminimax estimator problem based on the mean squared error, from the viewpoint of the prediction where the distributions of data and target variables are different and have a common unknown parameter. First, we give a brief review of subminimax estimator problem through the binomial example.

Example. Let us consider the binomial estimation based on the mean squared error, $R MSE ( θ , θ ^ )$. For any finite sample size, N, the Bayes estimator, $θ ^ π$, based on the Beta prior, $π ( θ ; N ) ∝ θ N / 2 − 1 ( 1 − θ ) N / 2 − 1$, is minimax under the mean squared error. The mean squared error of the minimax Bayes estimator, $θ ^ π$, is given by:

$R MSE ( θ , θ ^ π ) = N 4 ( N + N ) 2 = 1 4 N − 1 2 N N + O ( N − 2 ) .$

In contrast, the mean squared error of the maximum likelihood estimator, $θ ^ MLE$, is given by:

$R MSE ( θ , θ ^ MLE ) = θ ( 1 − θ ) N .$

We compare the two estimators, $θ ^ π$ and$θ ^ MLE$. In the comparison of the N1-order terms of the mean squared errors, it seems that the maximum likelihood estimator, $θ ^ MLE$, dominates the minimax Bayes estimator, $θ ^ π$. In other words, the N1-order term of$R MSE ( θ , θ ^ MLE )$ is not greater than that of$R MSE ( θ , θ ^ π )$ for every θ ∈ Θ, and the equality holds when θ = 1/2. This seeming paradox is known as the subminimax estimator problem; see [14,17,18] for details. See also  for the conditions that such problems do not occur in estimation.

However, this paradox does not mean the inferiority of the minimax Bayes estimator. This is because, although the mean squared error of the minimax Bayes estimator, $θ ^ π$, has the negative N3/2-order term, the mean squared error of the maximum likelihood estimator, $θ ^ MLE$, does not have the N3/2-order term. Hence, in comparison to the mean squared errors up to O(N2), the maximum of the mean squared error, $R MSE ( θ , θ ^ π )$, is below the maximum of the mean squared error, $R MSE ( θ , θ ^ MLE )$.

Next, we construct the asymptotically constant-risk prior in the estimation based on the mean squared error when the subminimax estimator problem occurs, from the viewpoint of the prediction. We consider the priors, π(θ; N), satisfying (C1). From Lemma 5 in the Appendix, the mean squared error of the Bayes estimator, $θ ^ π$, is equal to the Kullback–Leibler risk of the $θ ^ π$-plugin predictive density, $q ( y | θ ^ π )$, by assuming that the target variable, y, is a d-dimensional Gaussian random variable with the mean vector, θ, and unit variance. Note that $g i j Y ( θ ) = 1 , Γ m i j , k Y = 0$, and $Γ e i j , k Y = 0$ for $i , j , k = 1 , ⋯ , d$. Thus, if $g i j Y ( θ ) g X , i j ( θ ) = ∑ i = 1 d g X , i i ( θ )$ has a unique maximum point, we obtain the asymptotically constant-risk prior, π(θ; N), up to O(N2) from Lemma 4 in the Appendix and Theorem 2.

Finally, we compare the mean squared error of the asymptotically constant-risk Bayes estimator, $θ ^ π$, with that of the maximum likelihood estimator, $θ ^ MLE$. The mean squared error of the asymptotically constant-risk Bayes estimator, $θ ^ π$, is given as:

$R MSE ( θ , θ ^ π ) = 1 N ∑ i = 1 d g X , i i ( θ max ) + 2 N N ∑ k = 1 d g X , i k ( θ max ) g X , j k ( θ max ) ∂ i j log f ˜ ( θ max ) + O ( N − 2 ) .$

In contrast, the mean squared error of the maximum likelihood estimator, $θ ^ MLE$, is given as:

$R MSE ( θ , θ ^ Μ L Ε ) = 1 N ∑ i = 1 d g X , i i ( θ ) + O ( N − 2 ) .$

See [16,19].

Thus, the maximum of the mean squared error of the asymptotically constant-risk Bayes estimator is smaller than that of estimators by the improvement of order N3/2 in proportion to the Hessian of the scalar function, $log f ˜ ( θ ) ,$ at θmax. In the prediction where the trace, $g X , i j ( θ ) g i j Y ( θ )$, has a unique maximum point, the same improvement holds (Remark 3).

Example revisited. Using the above results, we consider the binomial estimation based on the mean squared error from the viewpoint of the prediction. The geometrical quantities to be used are given by: respectively. $Γ m θ θ X , θ , Γ m θ θ Y , θ$ and$T θ θ θ Y$ vanish, the asymptotically constant-risk prior in the estimation is identical to the asymptotically constant-risk prior in the prediction; compare Theorem 1 with the expansion of$g Y , i j ( θ ) E X ( N ) [ ( θ ^ π i − θ i ) ( θ ^ π j − θ j ) ]$ in Lemma 4 in the Appendix.

In this example, Equation (3) is given by:

$θ 2 ( 1 − θ ) 2 { ∂ θ log f ˜ ( θ ) } 2 = 1 4 − θ ( 1 − θ ) ,$

and the solution, $log f ˜ ( θ )$, is (1/2) log{θ(1 − θ)}. Here, the second-order derivative of the function, $log f ˜ ( θ )$, is given by:

$∂ θ θ log f ˜ ( θ ) = 1 − 2 θ + 2 θ 2 2 θ 2 ( 1 − θ ) 2 .$

From this, Equation (4) is given by:

$1 2 θ ( 1 − θ ) ( 1 − 2 θ ) ∂ θ log h ˜ ( θ ) + θ 2 − θ = 1 4 ,$

and the solution, $log h ˜ ( θ )$, is (1/2) log{θ(1 − θ)}. Hence, the asymptotically constant-risk prior, π (θ; N), is a Beta prior with the parameters, $α = N / 2$ and$β = N / 2$. Note that the asymptotically constant-risk prior coincides with the exact minimax prior. Since gX,θθ(θmax) = 1/2 and$g X , θ θ ( θ max ) ∂ θ θ log f ˜ ( θ max ) = − 1$, the mean squared error of the asymptotically constant-risk Bayes estimator, $θ ^ π$, agrees with (9) up to O(N2).

## 5. Application to the Prediction of the Binary Regression Model under the Covariate Shift

In this section, we construct asymptotically constant-risk priors in the prediction based on the binary regression model under the covariate shift; see .

We consider that we predict a binary response variable, y, based on the binary response variables, x(N). We assume that the target variable, y, and the data, x(N), follow the logistic regression models with the same parameter, β, given by:

$log Π x 1 − Π x = α + z β$

and:

$log Π y 1 − Π y = α ˜ + z ˜ β$

where Πx is the success probability of the data and Πy is the success probability of the target variable. Let α and $α ˜$ denote known constant terms, and let β denote the common unknown parameter. Further, we assume that the covariates, z and $z ˜$, are different.

Using the parameter θ = Πx, we convert this predictive setting to binomial prediction where the data, x, and the target variable, y, are distributed according to:

$p ( x | θ ) : = { θ if x = 1 , 1 − θ if x = 0$

and: respectively. We obtain two Fisher information for x and y as:

$g θ θ X ( θ ) = 1 θ ( 1 − θ )$

and:

$g θ θ X ( θ ) = ( z ˜ z ) 2 e − α ˜ + z ˜ z − 1 α ( 1 − θ ) z ˜ z − 1 − 2 θ z ˜ z − 1 − 2 { θ z ˜ z − 1 + e − α ˜ + z ˜ z − 1 α ( 1 − θ ) z ˜ z − 1 } 2 ,$

respectively.

For simplicity, we consider the setting where z = 1, $z ˜ = 2$ and $α = α ˜ = 0$. The geometrical quantities for the model, {p(x|θ) : θ ∈ Θ}, are given by: respectively. In the same manner, the geometrical quantities for the model, {q(y|θ) : θ ∈ Θ}, are given by: respectively.

Using these quantities, Equation (3) is given by:

$4 θ 2 ( 1 − θ ) 2 { θ 2 + ( 1 − θ ) 2 } 2 ( ∂ θ log f ˜ ( θ ) ) 2 = 4 − 4 θ ( 1 − θ ) { θ 2 + ( 1 − θ ) 2 } 2 .$

By noting that the maximum point of $g X , θ θ ( θ ) g θ θ Y ( θ )$ is 1/2, the solution, $log f ˜ ( θ )$, of this equation is given by:

$log f ˜ ( θ ) = 2 1 − θ + θ 2 + log { θ ( 1 − θ ) } − log ( 2 − θ + 2 1 − θ + θ 2 ) − log ( 1 − θ + 2 1 − θ + θ 2 ) .$

Using this solution, we obtain the solution of Equation (4) given by: The asymptotically constant-risk priors for the different sample sizes are shown in Figure 1. The prior weight is found to be more concentrated to 1/2 as the sample size, N, grows.

In this example, we obtain the Kullback–Leibler risk of the Bayesian predictive density based on the asymptotically constant-risk prior, π(θ; N), as:

$R ( θ , q ^ π ( y | x ( N ) ) ) = 2 N − 4 3 N N + O ( N − 2 ) .$

We compare this value with the Bayes risk calculated using the Monte Carlo simulation; see Figure 2. As the sample size, N, grows, the difference appears negligible. Further, we compare this value with the risk itself calculated by the Monte Carlo simulation; see Figure 3. As the sample size, N, grows, the risk becomes more constant.

## 6. Discussion and Conclusions

We have considered the setting where the quantity, $g X , i j ( θ ) g i j Y ( θ )$—the trace of the product of the inverse Fisher information matrix, gX,ij(θ), and the Fisher information matrix, $g i j Y ( θ )$—has a unique maximum point, and we have investigated the asymptotically constant-risk prior in the sense that the asymptotic risk is constant up to O(N2).

In Section 3, we have considered the prior depending on the sample size, N, and constructed the asymptotically constant-risk prior using Equations (3) and (4). In Section 4, we have clarified the relationship between the subminimax estimator problem based on the mean squared error and the prediction where the distributions of data and target variables are different. In Section 5, we have constructed the asymptotically constant-risk prior in the prediction based on the logistic regression model under the covariate shift.

We have assumed that the trace, $g X , i j ( θ ) g i j Y ( θ )$, is finite. However, the trace may diverge in the non-compact parameter space; for example, it diverges under the predictive setting, where the distribution, q(y|θ), of the target variable is the Poisson distribution and the data distribution, p(x|θ), is the exponential distribution, with Θ equivalent to ℝ. Therefore, for our future work, in such a setting, we should adopt criteria other than minimaxity.

## Acknowledgments

The authors thank the referees for their helpful comments. This research was partially supported by a Grant-in-Aid for Scientific Research (23650144, 26280005).

## Author Contributions

Both authors contributed to the research and writing of this paper. Both authors read and approved the final manuscript.

## Conflicts of Interest

The authors declare no conflict of interest.

## Appendix

We prove Theorem 1. First, we introduce some lemmas for the proof. For the expansion, we follow the following six steps (the first five steps are arranged in the form of lemmas): the first is to expand the MAPestimator; the second is to calculate their bias and mean squared error; the third is to expand the Kullback–Leibler risk using $θ ^ π$-plugin predictive density, $q ( y | θ ^ π )$; the fourth is to expand the Bayesian predictive density based on the prior π(θ; N); the fifth is to expand the Bayesian estimator minimizing the Bayes risk; and the last is to prove Theorem 1 using these lemmas.

We use some additional notations for the expansion. Let $θ ^ π$ be the maximum point of the scalar function log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Let l(θ|x(N)) denote the log likelihood of the data, x(N). Let lij (θ|x(N)), lijk (θ|x(N)) and lijkl(θ|x(N)) be the derivatives of order 2, 3 and 4 of the log likelihood, l(θ|x(N)). Let Hij(θ|x(N)) denote the quantity, $l i j ( θ | x ( N ) ) + N g i j X ( θ )$. Let $l ˜ i ( θ | x ( N ) )$ and $H ˜ i j ( θ | x ( N ) )$ denote $( 1 / N ) l i ( θ | x ( N ) )$ and $( 1 / N ) H i j ( θ | x ( N ) )$, respectively. In addition, the brackets (·) denotes the symmetrization: for any two tensors, aij and bij, ai(jbk)l denotes ai(jbk)l = (aijbkl+aikbjl)/2.

Lemma 3. Let$θ ^ π$ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Then, the i-th component of this estimator and $θ ^ π$ is expanded as follows: Proof. By the definition of and $θ ^ π$, we get the equation given by:

$∂ i log p ( x ( N ) | θ ^ π ) + ∂ i log π ( θ ^ π ; N ) | g X ( θ ^ π ) | 1 / 2 = 0.$

From our assumption that prior π(θ; N) has the form given by:

$π ( θ ; N ) | g X ( θ ) | 1 / 2 ∝ exp { N log f ( θ ) + log h ( θ ) } ,$

we rewrite this equation as:

$∂ i log p ( x ( N ) | θ ^ π ) + N ∂ i log f ( θ ^ π ) + ∂ i log h ( θ ^ π ) = 0.$

By applying Taylor expansion around θ to this new equation, we derive the following expansion: From the law of large numbers and the central limit theorem, we rewrite the above expansion as: By substituting the deviation, $θ ^ π − θ$, recursively into Expansion (11), we obtain Expansion (10). □

Lemma 4. Let$θ ^ π$ be the maximum point of log p(x(N)) + log{π(θ; N)/|gX(θ)|1/2}. Then, the i-th component of the bias of the estimator, $θ ^ π$, is given by: The (i, j)-component of the mean squared error of$θ ^ π$ is given by: where$g X , k ( i ( θ ) Γ m X , j ) ( θ )$ denotes$( 1 / 2 ) { g X , k i ( θ ) Γ m X , j ( θ ) + g X , k i ( θ ) Γ m X , j ( θ ) }$ and$g X , k ( i ( θ ) ∂ k g X , j ) l ( θ )$ denotes$( 1 / 2 ) { g X , k i ( θ ) ∂ k g X , j l ( θ ) + g X , k j ( θ ) ∂ k g X , i l ( θ ) }$. The (i, j, k)-component of the mean of the third power of the deviation$θ ^ π − θ$, is given by: Proof. First, using Lemma 3, we determine the i-th component of the bias of $θ ^ π$ given by: Second, consider the following relationship: By differentiating the j-th component of the bias, $E X ( N ) [ θ ^ π j − θ j ]$, we obtain the equation given by:

$1 N ∂ k E X ( N ) [ θ ^ π j − θ j ] = − 1 N δ k j + 1 N E X ( N ) [ ( θ ^ π j − θ j ) l ˜ k ( θ | x N ) ] ,$

where $δ j i$ denotes the delta function: if the upper and the lower indices agree, then the value of this function is one and otherwise zero. Equation (16) has been used by [2,16,19]. By substituting Equations (16) and (12) into Relationship (15), we obtain the (i, j)-component of the mean squared error of $θ ^ π$ given by: Finally, by taking the expectation of the third power of the deviation, $θ ^ π i − θ i$, we obtain the following expansion: Lemma 5. Let$θ ^ π$ be the maximum point of log p(x(N)) + log{π(θ; N)/|gX(θ)|1/2}. The Kullback–Leibler risk of the plug-in predictive density, $q ( y ( N ) | θ ^ π )$, with the estimator, $θ ^ π$, is expanded as follows: Proof. By applying the Taylor expansion, the Kullback–Leibler risk, $R ( θ , q ( y | θ ^ π ) )$, is expanded as: where $Γ e ( i j , k ) Y$ denotes $( 1 / 3 ) { Γ e i j , k Y + Γ e j k , i Y + Γ e k i , j Y }$.

By the definition of the predictive metric, $g ° i j ( θ ) = g i k X ( θ ) g Y , k l ( θ ) g l j X ( θ )$, by Expansions (13) and (14) and by the relationship $L i j k X ( θ ) = − Γ e i j , k X ( θ ) − Γ e j k , i X ( θ ) − Γ e k i , j X ( θ ) − Τ i j k X ( θ )$, the last two terms of the above expansion (18) are expanded as: By substituting Expansion (19) into Expansion (18), Expansion (17) is obtained. □

Note that Expansion (17) is invariant up to O(N2) under the reparametrization, so that each term of this expansion is a scalar function of θ.

Lemma 6. Let$θ ^ π$ be the maximum point of. $log p ( x ( N ) | θ ) + log { π ( θ ; N ) / | g X ( θ ) | 1 / 2 }$ The Bayesian predictive density based on the prior, π(θ; N), is expanded as: Proof. Let $θ ˜ π$ denote $θ ^ π − θ$. First, using a Taylor expansion twice, we expand the posterior density, π(θ|x(N)), as: We denote the N1/2-order, N1-order and N3/2-order terms by $( N − 1 / 2 ) a 0 ( θ ˜ π ; θ ^ π )$, $( N − 1 ) a 1 ( θ ˜ π ; θ ^ π )$ and $( N − 3 / 2 ) a 2 ( θ ˜ π ; θ ^ π )$, respectively. Then, this expansion is rewritten as: To make the expansion easier to see, the following notations are used. Let $ϕ ( η ; − l i j ( θ ^ π | x ( N ) ) )$ be the probability density function of the d-dimensional normal distribution with the precision matrix whose (i, j)-component is $− l i j ( θ ^ π | x ( N ) )$. Let $η = ( η 1 , ⋯ , η d )$ be a d-dimensional random vector distributed according to the normal density, $ϕ ( η ; − l i j ( θ ^ π | x ( N ) ) )$ The notations, $a ¯ 0 ( θ ^ π )$, $a ¯ 1 ( θ ^ π )$, $a ¯ 2 ( θ ^ π )$ and, $ω ^ i j ( θ ^ π )$ denote the expectations of, $a 0 ( η ; θ ^ π )$, $a 1 ( η ; θ ^ π )$, $a 2 ( η ; θ ^ π )$ and ηiηj, respectively.

Using the above notations, we get the following posterior expansion: Second, using (21), the Bayesian predictive density, $q ^ π ( y | x ( N ) )$, based on the prior, π(θ; N), is expanded as: Here, the following two equations hold:

$− l i j ( θ ^ π | x ( N ) ) = N g i j X ( θ ^ π ) − N H ˜ i j ( θ ^ π | x N ) + Op ( 1 ) ,$
$l i j k ( θ ^ π | x ( N ) ) = − 2 N Γ e i j , k X ( θ ^ π ) − N Γ m i k , j X ( θ ^ π ) + N H ˜ i j k ( θ ^ | x N ) .$

By combining Equation (23) with the Sherman–Morrison–Woodbury formula, the following expansion is obtained:

$ω ^ i j ( θ ^ π ) = 1 N g X , i j ( θ ^ π ) + 1 N N g X , i k ( θ ^ π ) g X , j l ( θ ^ π ) H k l ( θ ^ π | x ( N ) ) + Op ( N − 2 )$

By substituting Equations (23), (24) and (25) into Expansion (22), Expansion (20) is obtained. □

Note that the integration of Expansion (20) is one up to OP(N2). Further, Expansion (20) is similar to the expansion in . However, the estimator that is the center of the expansion is different, because of the dependence of the prior on the sample size.

Lemma 7. The Bayesian estimator, $θ ^ opt$, minimizing the Bayes risk, $∫ R ( θ , q ( y | θ ^ ) ) d π ( θ ; N )$ among plug-in predictive densities is given by:

$θ ^ opt i = θ ^ π i + 1 2 N g X , i j ( θ ^ π ) T j X ( θ ^ π ) + 1 2 N g X , j k ( θ ^ π ) { Γ m j k Y , i ( θ ^ π ) − Γ m j k X , i ( θ ^ π ) } + Op ( N − 3 / 2 ) .$

Proof. The Bayes risk, $∫ R ( θ , q ( y | θ ^ ) ) d π ( θ ; N )$, is decomposed as:

$∫ R ( θ , q ( y | θ ^ ) ) d π ( θ ; N ) = ∫ π ( θ ; N ) ∫ p ( x ( N ) | θ ) ∫ q ( y | θ ) log q ( y | θ ) q ^ π ( y | x ( N ) ) d y d x ( N ) d θ + ∫ π ( θ ; N ) ∫ p ( x ( N ) | θ ) ∫ q ( y | θ ) log q ^ π ( y | x ( N ) ) q ( y | θ ^ ) d y d x ( N ) d θ .$

The first term of this decomposition is not dependent on. $θ ^$. From Fubini’s theorem and Lemma 6, the proof is completed. □

Using these lemmas, we prove Theorem 1. First, we find that the Kullback–Leibler risk of the plug-in predictive density with the estimator, $θ ^ opt$, defined in Lemma 7, is given by:

$R ( θ , q ( y | θ ^ opt ) ) = R ( θ , q ( y | θ ^ π ) ) + 1 2 N N g i j ° ( θ ) T i X ( θ ) ∂ j log f ( θ ) + 1 2 N N g X , i m ( θ ) g i j Y ( θ ) g X , k l ( θ ) × { Γ m k l Y , i ( θ ) − Γ m k l X , j ( θ ) } ∂ m log f ( θ ) .$

Using Expansion (27) and Lemma 5, we expand the Kullback–Leibler risk, $R ( θ , q ^ π ( y | ( N ) ) )$. Here, the risk, $R ( θ , q ^ π ( y | ( N ) ) )$, is equal to the risk, $R ( θ , q ( y | θ ^ opt ) )$, up to O(N2), because we expand the Bayesian predictive density, $q ^ π ( y | x ( N ) )$ as:

$q ( y | x ( N ) = q ( y | θ ^ opt ) + 1 2 N g X , i j ( θ ^ π ) { ∂ i j q ( y | θ ^ π ) − Γ m i j Y , k ( θ ^ π ) ∂ k q ( y | θ ^ π ) } + Op ( N − 3 / 2 ) .$

Thus, we obtain Expansion (1).

## References

1. Aitchison, J. Goodness of prediction fit. Biometrika 1975, 62, 547–554. [Google Scholar]
2. Komaki, F. On asymptotic properties of predictive distributions. Biometrika 1996, 83, 299–313. [Google Scholar]
3. Hartigan, J. The maximum likelihood prior. Ann. Stat 1998, 26, 2083–2103. [Google Scholar]
4. Bernardo, J. Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. B 1979, 41, 113–147. [Google Scholar]
5. Clarke, B.; Barron, A. Jeffreys prior is asymptotically least favorable under entropy risk. J. Stat. Plan. Inference 1994, 41, 37–60. [Google Scholar]
6. Aslan, M. Asymptotically minimax Bayes predictive densities. Ann. Stat 2006, 34, 2921–2938. [Google Scholar]
7. Komaki, F. Bayesian predictive densities based on latent information priors. J. Stat. Plan. Inference 2011, 141, 3705–3715. [Google Scholar]
8. Komaki, F. Asymptotically minimax Bayesian predictive densities for multinomial models. Electron. J. Stat 2012, 6, 934–957. [Google Scholar]
9. Kanamori, T.; Shimodaira, H. Active learning algorithm using the maximum weighted log-likelihood estimator. J. Stat. Plan. Inference 2003, 116, 149–162. [Google Scholar]
10. Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Inference 2000, 90, 227–244. [Google Scholar]
11. Fushiki, T.; Komaki, F.; Aihara, K. On parametric bootstrapping and Bayesian prediction. Scand. J. Stat 2004, 31, 403–416. [Google Scholar]
12. Suzuki, T.; Komaki, F. On prior selection and covariate shift of β-Bayesian prediction under α-divergence risk. Commun. Stat. Theory 2010, 39, 1655–1673. [Google Scholar]
13. Komaki, F. Asymptotic properties of Bayesian predictive densities when the distributions of data and target variables are different. Bayesian Anal 2014. submitted for publication. [Google Scholar]
14. Hodges, J.L.; Lehmann, E.L. Some problems in minimax point estimation. Ann. Math. Stat 1950, 21, 182–197. [Google Scholar]
15. Ghosh, M.N. Uniform approximation of minimax point estimates. Ann. Math. Stat 1964, 35, 1031–1047. [Google Scholar]
16. Amari, S. Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1985. [Google Scholar]
17. Robbins, H. Asymptotically Subminimax solutions of Compound Statistical Decision Problems, Proceedings of the Second Berkley Symposium Mathematical Statistics and Probability, Berkeley, CA, USA, 31 July–12 August 1950; University of California Press: Oakland, CA, USA, 1950; pp. 131–148.
18. Frank, P.; Kiefer, J. Almost subminimax and biased minimax procedures. Ann. Math. Stat 1951, 22, 465–468. [Google Scholar]
19. Efron, B. Defining curvature of a statistical problem (with applications to second order efficiency). Ann. Stat 1975, 3, 189–1372. [Google Scholar]
Figure 1. Asymptotically constant-risk prior in the prediction where the data are distributed according to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
Figure 1. Asymptotically constant-risk prior in the prediction where the data are distributed according to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
Figure 2. Bayes risk based on the asymptotically constant-risk prior in the prediction where the data are distributed according to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
Figure 2. Bayes risk based on the asymptotically constant-risk prior in the prediction where the data are distributed according to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
Figure 3. Comparison of the Kullback–Leibler risk calculated using the Monte Carlo simulations and the asymptotic risk, $2 / N − ( 4 3 ) / ( N N )$ in the prediction where the data are distributed according to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).
Figure 3. Comparison of the Kullback–Leibler risk calculated using the Monte Carlo simulations and the asymptotic risk, $2 / N − ( 4 3 ) / ( N N )$ in the prediction where the data are distributed according to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).