Asymptotically Constant-Risk Predictive Densities When the Distributions of Data and Target Variables Are Different

We investigate the asymptotic construction of constant-risk Bayesian predictive densities under the Kullback–Leibler risk when the distributions of data and target variables are different and have a common unknown parameter. It is known that the Kullback–Leibler risk is asymptotically equal to a trace of the product of two matrices: the inverse of the Fisher information matrix for the data and the Fisher information matrix for the target variables. We assume that the trace has a unique maximum point with respect to the parameter. We construct asymptotically constant-risk Bayesian predictive densities using a prior depending on the sample size. Further, we apply the theory to the subminimax estimator problem and the prediction based on the binary regression model.


Introduction
Let x (N ) = (x 1 , • • • , x N ) be independent N data distributed according to a probability density, p(x|θ), that belongs to a d-dimensional parametric model, {p(x|θ) : θ ∈ Θ}, where θ = (θ 1 , • • • , θ d ) is an unknown d-dimensional parameter and Θ is the parameter space.Let y be a target variable distributed according to a probability density, q(y|θ), that belongs to a d-dimensional parametric model, {q(y|θ) : θ ∈ Θ} with the same parameter, θ.Here, we assume that the distributions of the data and the target variables, p(x|θ) and q(y|θ), are different.For simplicity, we assume that the data and the target variables are independent, given by θ.
However, there remains a problem of prior selection for constructing better Bayesian predictive densities.Thus, a prior, π(θ; N ), must be chosen based on an optimality criterion for actual applications.Among various criteria, we focus on a criterion of constructing minimax predictive densities under the Kullback-Leibler risk.For simplicity, we refer to the priors generating minimax predictive densities as minimax priors.Minimax priors have been previously studied in various predictive settings; see [4][5][6][7][8].When the simultaneous distributions of the target variables and the data belong to the submodel of the multinomial distributions, Komaki [7] shows that minimax priors are given as latent information priors maximizing the conditional mutual information between target variables and the parameter given the data.However, the explicit forms of latent information priors are difficult to obtain, and we need asymptotic methods, because they require the maximization on the space of the probability measures on Θ.
Except for [7], these studies on minimax priors are based on the assumption that the distributions, p(x|θ) and q(y|θ), are identical.Let us consider the prediction based on the logistic regression model where the covariates of the data and the target variables are not identical.In this predictive setting, the assumption that the distributions, p(x|θ) and q(y|θ), are identical is no longer valid.
We focus on the minimax priors in predictions where the distributions, p(x|θ) and q(y|θ), are different and have a common unknown parameter.Such a predictive setting has traditionally been considered in statistical prediction and experiment design.It has recently been studied in statistical learning theory; for example, see [9].Predictive densities where the distributions, p(x|θ) and q(y|θ), are different and have a common unknown parameter are studied by [10][11][12][13].
Let g X ij (θ) be the (i, j)-component of the Fisher information matrix of the distribution, p(x|θ), and let g Y ij (θ) be the (i, j)-component of the Fisher information matrix of the distribution, q(y|θ).Let g X,ij (θ) and g Y,ij (θ) denote the (i, j)-components of their inverse matrices.We adopt Einstein's summation convention: if the same indices appear twice in any one term, it implies summation over that index from one to d.For the asymptotics below, we assume that the prior densities, π(θ; N ), are smooth.On the asymptotics as the sample size N goes to infinity, we construct the asymptotically constant-risk prior, π(θ; N ), in the sense that the asymptotic risk: is constant up to O(N −2 ).Since the proper prior with the constant risk is a minimax prior for any finite sample size, the asymptotically constant-risk prior relates to the minimax prior; in Section 4, we verify that the asymptotically constant-risk prior agrees with the exact minimax prior in binomial examples.
When we use the prior, π(θ), independent of the sample size, N , it is known that the N −1 -order term, R 1 (θ, qπ (y|x (N ) )), of the Kullback-Leibler risk is equal to the trace, g X,ij (θ)g Y ij (θ).If the trace does not depend on the parameter, θ, the construction of the asymptotically constant-risk prior is parallel to [6]; see also [13].
However, we consider the settings where there exists a unique maximum point of the trace, g X,ij (θ)g Y ij (θ); for example, these settings appear in predictions based on the binary regression model, where the covariates of the data and the target variables are not identical.In the settings, there do not exist asymptotically constant-risk priors among the priors independent of the sample size, N .The reason is as follows: we consider the prior, π(θ), independent of the sample size, N .Then, the Kullback-Leibler risk of the Bayesian predictive density is expanded as: Since, in our settings, the first-order term, g Y ij (θ)g X,ij (θ), is not constant, the prior independent of the sample size, N , is not an asymptotically constant-risk prior.
When there exists a unique maximum point of the trace, g X,ij (θ)g Y ij (θ), we construct the asymptotically constant-risk prior, π(θ; N ), up to O(N −2 ), by making the prior dependent on the sample size, N , as: where f (θ) and h(θ) are the scalar functions of θ independent of N and |g X (θ)| denotes the determinant of the Fisher information matrix, g X (θ).
The key idea is that, if the specified parameter point has more undue risk than the other parameter points, then the more prior weights should be concentrated on that point.
Further, we clarify the subminimax estimator problem based on the mean squared error from the viewpoint of the prediction where the distributions of data and target variables are different and have a common unknown parameter.We obtain the improvement achieved by the minimax estimator over the subminimax estimators up to O(N −2 ).The subminimax estimator problem [14,15] is the problem that, at first glance, there seems to exist asymptotically dominating estimators of the minimax estimator.However, any relationship between such subminimax estimator problems and predictions have not been investigated, and further, in general, the improvement by the minimax estimator over the subminimax estimators has not been investigated.
We define the predictive metric proposed by Komaki [13] as: When the parameter is one-dimensional, g θθ (θ) denotes Fisher information and g θθ (θ) denotes its inverse.
Let M k ij (θ) be a (1, 2)-tensor defined by: , of the scalar function, v(θ), the e-covariant derivative is given by:

Asymptotically Constant-Risk Priors When the Distributions of Data and Target Variables Are Different
In this section, we consider the settings where the trace, g X,ij (θ)g Y ij (θ), has a unique maximum point.We construct the asymptotically constant-risk prior under the Kullback-Leibler risk in the sense that the asymptotic risk up to O(N −2 ) is constant.We find asymptotically constant-risk priors up to O(N −2 ) in two steps: first, expand the Kullback-Leibler risks of Bayesian predictive densities; second, find the prior having an asymptotically constant risk using this expansion.
From now on, we assume the following two conditions for the prior, π(θ; N ): (C1) The prior, π(θ; N ), has the form: where f (θ) and h(θ) are smooth scalar functions of θ independent of N .
(C2) The unique maximum point of the scalar function, f (θ), is equal to the unique maximum point of the trace, g X,ij (θ)g Y ij (θ).Based on Conditions (C1) and (C2), we expand the Kullback-Leibler risk of a Bayesian predictive density up to O(N −2 ).
Theorem 1.The Kullback-Leibler risk of a Bayesian predictive density based on the prior, π(θ; N ), satisfying Condition (C1), is expanded as: The proof is given in the Appendix.The first term in (1) represents that the precision of the estimation is determined by the geometric quantity of the data, g X,ij (θ), and the metric of the parameter is determined by the geometric quantity of the target variables, g Y ij (θ).Note that each term in ( 1) is invariant under the reparametrization.
Remark 1.For the subsequent theorem, it is important that at the point, θ f , maximizing the scalar function, log f (θ), R(θ f , qπ (y|x N )) is given by: The N −3/2 -order term of this risk is common whenever we use the same scalar function, log f (θ).This term is negative because of the definition of the point, θ f .Under Condition (C2), θ f is equal to the unique maximum point, θ max , of the trace, g X,ij (θ)g Y ij (θ).
Based on ( 1) and ( 2), we construct asymptotically constant-risk priors using the solutions of the partial differential equations.
Theorem 2. Suppose that the scalar functions, log f (θ) and log h(θ), satisfy the following conditions: (A1) log f (θ) is the solution of the Eikonal equation given by: where θ max is the unique maximum point of the scalar function, g X,ij (θ)g Y ij (θ).
(A2) log h(θ) is the solution of the first-order linear partial equation given by: Let π(θ; N ) be the prior that is constructed as: Further, suppose that log f (θ) satisfies Condition (C2).
Then, the Bayesian predictive density based on the prior, π(θ; N ), has the asymptotically smallest constant risk up to O(N −2 ) among all priors with the form (C1).
Suppose that there exists another prior, ϕ(θ; N ), constructed as: and the Bayesian predictive density based on the prior, ϕ(θ; N ), has the asymptotically constant risk: From Theorem 1, the prior ϕ(θ; N ) must satisfy the equation: The left-hand side of the above equation is non-negative, because the matrix, gij (θ), is positive-definite.Hence, the infimum of the constant, k, is equal to g X,ij (θ max )g Y ij (θ max ).From (5), the N −1 -order term of the risk based on the prior, φ(θ; N ), achieves the infimum, g X,ij (θ max )g Y ij (θ max ).Thus, the Bayesian predictive density based on the prior, φ(θ; N ), has the asymptotically smallest constant risk up to o(N −1 ).
Second, we consider the prior, π(θ; N ), constructed as: The above argument ensures that the prior, π(θ; N ), has the asymptotically smallest constant risk up to o(N −1 ).Thus, we only have to check if the N −3/2 -order term of the risk is the smallest constant.From (2), the N −3/2 -order term of the risk at the point, θ max , is unchanged by the choice of the scalar function, log h(θ).In other words, the constant N −3/2 -order term must agree with the quantity, gij (θ max )∂ ij log f (θ max ).From Theorem 1, if we choose the prior, π(θ; N ), the N −3/2 -order term of the risk is the smallest constant, and it agrees with the quantity, gij (θ max )∂ ij log f (θ max ).Thus, the prior, π(θ; N ), has the asymptotically smallest constant risk up to O(N −2 ).
Remark 2. In Theorem 2, we choose log f (θ), satisfying Condition (C2) among the solutions of (A1).We consider the model with a one-dimensional parameter, θ.There are four possibilities to the solutions of (A1): where the double-sign corresponds.From the concavity around θ max as suggested by (C2), we choose log f (θ) as the solution of the following equation: Integrating both sides of Equation ( 6), the unique function, log f (θ), is obtained.
Remark 3. Compare the Kullback-Leibler risk based on the asymptotically constant-risk prior, π(θ; N ), with that based on the prior, λ(θ), independent of the sample size, N .From Theorem 1 and Theorem 2, the Kullback-Leibler risk based on the asymptotically constant-risk prior, π(θ; N ), is given as: In contrast, the Kullback-Leibler risk based on the prior, λ(θ), is given as: The N −1 -order term in ( 8) is under the N −1 -order term in (7); although the N −3/2 -order term in (8) does not exist, the N −3/2 -order term in ( 7) is negative.Thus, the maximum of the risk based on the asymptotically constant-risk prior, π(θ; N ), is smaller than that of the risk based on the prior, λ(θ).This result is consistent with the minimaxity of selecting the prior that constructs the predictive density with the smallest maximum of the risk.

Subminimax Estimator Problem Based on the Mean Squared Error
In this section, we refer to the subminimax estimator problem based on the mean squared error, from the viewpoint of the prediction where the distributions of data and target variables are different and have a common unknown parameter.First, we give a brief review of subminimax estimator problem through the binomial example.
Example .Let us consider the binomial estimation based on the mean squared error, R MSE (θ, θ).For any finite sample size, N , the Bayes estimator, θπ , based on the Beta prior, π(θ; N ) ∝ θ is minimax under the mean squared error.The mean squared error of the minimax Bayes estimator, θπ , is given by: In contrast, the mean squared error of the maximum likelihood estimator, θMLE , is given by: We compare the two estimators, θπ and θMLE .In the comparison of the N −1 -order terms of the mean squared errors, it seems that the maximum likelihood estimator, θMLE , dominates the minimax Bayes estimator, θπ .In other words, the N −1 -order term of R MSE (θ, θMLE ) is not greater than that of R MSE (θ, θπ ) for every θ ∈ Θ, and the equality holds when θ = 1/2.This seeming paradox is known as the subminimax estimator problem; see [14,17,18] for details.See also [15] for the conditions that such problems do not occur in estimation.
However, this paradox does not mean the inferiority of the minimax Bayes estimator.This is because, although the mean squared error of the minimax Bayes estimator, θπ , has the negative N −3/2 -order term, the mean squared error of the maximum likelihood estimator, θMLE , does not have the N −3/2 -order term.Hence, in comparison to the mean squared errors up to O(N −2 ), the maximum of the mean squared error, R MSE (θ, θπ ), is below the maximum of the mean squared error, R MSE (θ, θMLE ).
Next, we construct the asymptotically constant-risk prior in the estimation based on the mean squared error when the subminimax estimator problem occurs, from the viewpoint of the prediction.We consider the priors, π(θ; N ), satisfying (C1).From Lemma 5 in the Appendix, the mean squared error of the Bayes estimator, θπ , is equal to the Kullback-Leibler risk of the θπ -plugin predictive density, q(y| θπ ), by assuming that the target variable, y, is a d-dimensional Gaussian random variable with the mean vector, θ, and unit variance.Note that g (θ) has a unique maximum point, we obtain the asymptotically constant-risk prior, π(θ; N ), up to O(N −2 ) from Lemma 4 in the Appendix and Theorem 2.
Finally, we compare the mean squared error of the asymptotically constant-risk Bayes estimator, θπ , with that of the maximum likelihood estimator, θMLE .The mean squared error of the asymptotically constant-risk Bayes estimator, θπ , is given as: In contrast, the mean squared error of the maximum likelihood estimator, θMLE , is given as: See [16,19].Thus, the maximum of the mean squared error of the asymptotically constant-risk Bayes estimator is smaller than that of estimators by the improvement of order N −3/2 in proportion to the Hessian of the scalar function, log f (θ), at θ max .In the prediction where the trace, g X,ij (θ)g Y ij (θ), has a unique maximum point, the same improvement holds (Remark 3).
Example revisited .Using the above results, we consider the binomial estimation based on the mean squared error from the viewpoint of the prediction.The geometrical quantities to be used are given by: and θθ and T Y θθθ vanish, the asymptotically constant-risk prior in the estimation is identical to the asymptotically constant-risk prior in the prediction; compare Theorem 1 with the expansion of g Y,ij (θ)E X (N ) [( θi π − θ i )( θj π − θ j )] in Lemma 4 in the Appendix.In this example, Equation ( 3) is given by: and the solution, log f (θ), is (1/2) log{θ(1 − θ)}.Here, the second-order derivative of the function, log f (θ), is given by: From this, Equation ( 4) is given by: and the solution, log h(θ), is (1/2) log{θ(1 − θ)}.Hence, the asymptotically constant-risk prior, π(θ; N ), is a Beta prior with the parameters, α = √ N /2 and β = √ N /2.Note that the asymptotically constant-risk prior coincides with the exact minimax prior.Since g X,θθ (θ max ) = 1/2 and g X,θθ (θ max )∂ θθ log f (θ max ) = −1, the mean squared error of the asymptotically constant-risk Bayes estimator, θπ , agrees with (9) up to O(N −2 ).

Application to the Prediction of the Binary Regression Model under the Covariate Shift
In this section, we construct asymptotically constant-risk priors in the prediction based on the binary regression model under the covariate shift; see [10].
We consider that we predict a binary response variable, y, based on the binary response variables, x (N ) .We assume that the target variable, y, and the data, x (N ) , follow the logistic regression models with the same parameter, β, given by: where Π x is the success probability of the data and Π y is the success probability of the target variable.Let α and α denote known constant terms, and let β denote the common unknown parameter.Further, we assume that the covariates, z and z, are different.Using the parameter θ = Π x , we convert this predictive setting to binomial prediction where the data, x, and the target variable, y, are distributed according to: and: respectively.We obtain two Fisher information for x and y as: and: respectively.

respectively.
Using these quantities, Equation (3) is given by: By noting that the maximum point of g X,θθ (θ)g Y θθ (θ) is 1/2, the solution, log f (θ), of this equation is given by: Using this solution, we obtain the solution of Equation (4) given by: The asymptotically constant-risk priors for the different sample sizes are shown in Figure 1.The prior weight is found to be more concentrated to 1/2 as the sample size, N , grows.
In this example, we obtain the Kullback-Leibler risk of the Bayesian predictive density based on the asymptotically constant-risk prior, π(θ; N ), as: We compare this value with the Bayes risk calculated using the Monte Carlo simulation; see Figure 2.
As the sample size, N , grows, the difference appears negligible.Further, we compare this value with the risk itself calculated by the Monte Carlo simulation; see Figure 3.As the sample size, N , grows, the risk becomes more constant.

Discussion and Conclusions
We have considered the setting where the quantity, g X,ij (θ)g Y ij (θ)-the trace of the product of the inverse Fisher information matrix, g X,ij (θ), and the Fisher information matrix, g Y ij (θ)-has a unique maximum point, and we have investigated the asymptotically constant-risk prior in the sense that the asymptotic risk is constant up to O(N −2 ).
In Section 3, we have considered the prior depending on the sample size, N , and constructed the asymptotically constant-risk prior using Equations ( 3) and (4).In Section 4, we have clarified the relationship between the subminimax estimator problem based on the mean squared error and the prediction where the distributions of data and target variables are different.In Section 5, we have constructed the asymptotically constant-risk prior in the prediction based on the logistic regression model under the covariate shift.
We have assumed that the trace, g X,ij (θ)g Y ij (θ), is finite.However, the trace may diverge in the non-compact parameter space; for example, it diverges under the predictive setting, where the distribution, q(y|θ), of the target variable is the Poisson distribution and the data distribution, p(x|θ), is the exponential distribution, with Θ equivalent to R. Therefore, for our future work, in such a setting, we should adopt criteria other than minimaxity.
Proof.By the definition of θπ , we get the equation given by: From our assumption that prior π(θ; N ) has the form given by: we rewrite this equation as: By applying Taylor expansion around θ to this new equation, we derive the following expansion: From the law of large numbers and the central limit theorem, we rewrite the above expansion as: By substituting the deviation, θπ − θ, recursively into Expansion (11), we obtain Expansion (10).