An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis

Statistical inference is considered for variables of interest, called primary variables, when auxiliary variables are observed along with the primary variables. We consider the setting of incomplete data analysis, where some primary variables are not observed. Utilizing a parametric model of joint distribution of primary and auxiliary variables, it is possible to improve the estimation of parametric model for the primary variables when the auxiliary variables are closely related to the primary variables. However, the estimation accuracy reduces when the auxiliary variables are irrelevant to the primary variables. For selecting useful auxiliary variables, we formulate the problem as model selection, and propose an information criterion for predicting primary variables by leveraging auxiliary variables. The proposed information criterion is an asymptotically unbiased estimator of the Kullback–Leibler divergence for complete data of primary variables under some reasonable conditions. We also clarify an asymptotic equivalence between the proposed information criterion and a variant of leave-one-out cross validation. Performance of our method is demonstrated via a simulation study and a real data example.


Introduction
Auxiliary variables are often observed along with primary variables. Here, the primary variables are random variables of interest, and our purpose is to estimate their predictive distribution, i.e., a probability distribution of the primary variables in future test data, while the auxiliary variables are random variables that are observed in training data but not included in the primary variables. We assume that the auxiliary variables are not observed in the test data, or we do not use them even if they are observed in the test data. When the auxiliary variables have a close relation with the primary variables, we expect to improve the accuracy of predictive distribution of the primary variables by considering a joint modeling of the primary and auxiliary variables.
The notion of auxiliary variables has been considered in statistics and machine learning literature. For example, the "curds and whey" method [1] and the "coaching variables" method [2] are based on a similar idea for improving prediction accuracy of primary variables by using auxiliary variables. In multitask learning, Caruana [3] improved generalization accuracy of a main task by exploiting extra tasks. Auxiliary variables are also considered in incomplete data analysis, i.e., a part of primary variables are not observed; Mercatanti et al. [4] showed some theoretical results to make parameter estimation better by utilizing auxiliary variables in Gaussian mixture model (GMM).
Although auxiliary variables are expected to be useful for modeling primary variables, they can actually be harmful. As mentioned in Mercatanti et al. [4], using auxiliary variables may affect modeling results adversely because the number of parameters to be estimated increases and a candidate model of the auxiliary variables can be misspecified. Hence, it is important to select useful auxiliary variables. This is formulated as model selection by considering parametric models with auxiliary variables. In this paper, usefulness of auxiliary variables for estimating predictive distribution of primary variables is measured by a risk function based on the Kullback-Leibler (KL) divergence [5] that is often used for model selection. Because the KL risk function includes unknown parameters, we have to estimate it in actual use. Akaike Information Criterion (AIC) proposed by Akaike [6] is one of the most famous criteria, which is known as an asymptotically unbiased estimator of the KL risk function. AIC is a good criterion from the perspective of prediction due to the asymptotic efficiency; see Shibata [7,8]. Takeuchi [9] proposed a modified version of AIC, called Takeuchi Information Criterion (TIC), which relaxes an assumption for deriving AIC, that is, correct specification of candidate model. However, AIC and TIC are derived for primary variables without considering auxiliary variables in the setting of complete data analysis, and therefore, they are not suitable for auxiliary variable selection nor incomplete data analysis.
Incomplete data analysis is widely used in a broad range of statistical problems by regarding a part of primary variables as latent variables that are not observed. This setting also includes complete data analysis as a special case, where all the primary variables are observed. Information criteria for incomplete data analysis have been proposed in previous studies. Shimodaira [10] developed an information criterion based on the KL divergence for complete data when the data are only partially observed. Cavanaugh and Shumway [11] modified the first term of the information criterion of Shimodaira [10] by the objective function of the EM algorithm [12]. Recently, Shimodaira and Maeda [13] proposed an information criterion, which is derived by mitigating a condition assumed in Shimodaira [10] and Cavanaugh and Shumway [11].
However, any of these previously proposed criteria are not derived by taking auxiliary variables into account. Thus, we propose a new information criterion by considering not only primary variable but also auxiliary variables in the setting of incomplete data analysis. The proposed criterion is a generalization of AIC, TIC, and the criterion of Shimodaira and Maeda [13]. To the best of our knowledge, this is the first attempt to derive an information criterion by considering auxiliary variables. Moreover, we show an asymptotic equivalence between the proposed criterion and a variant of leave-one-out cross validation (LOOCV); this result is a generalization of the relationship between TIC and LOOCV [14].
Note that "auxiliary variables" may also be used in other contexts in literature. For example, Ibrahim et al. [15] considered to use auxiliary variables in missing data analysis, which is similar to our usage in the sense that auxiliary variables are highly correlated with missing data. However, they use the auxiliary variables in order to avoid specifying a missing data mechanism; this goal is different from ours, because no missing data mechanism is considered in our study.
The reminder of this paper is organized as follows. Notations as well as the setting of this paper are introduced in Section 2. Illustrative examples of useful and useless auxiliary variables are given in Section 3. The information criterion for selecting useful auxiliary variables in incomplete data analysis is derived in Section 4, and the asymptotic equivalence between the proposed criterion and a variant of LOOCV is shown in Section 5. Performance of our method is examined via a simulation study and a real data analysis in Sections 6 and 7, respectively. Finally, we conclude this paper in Section 8. All proofs are shown in Appendix A.

Incomplete Data Analysis for Primary Variables
First we explain a setting of incomplete data analysis for primary variables in accordance with Shimodaira and Maeda [13]. Let X denote a vector of primary variables, which consists of two parts as X = (Y, Z), where Y denotes the observed part and Z denotes the unobserved latent part. This setting reduces to complete data analysis of X = Y when Z is empty. We write the true density function of X as q x (x) = q x (y, z) and a candidate parametric model of the true density as p x (x; θ) = p x (y, z; θ), where θ ∈ Θ ⊂ R d is an unknown parameter vector and Θ is its parameter space. We assume that x = (y, z) ∈ Y × Z for all density functions, where Y and Z are domains of Y and Z, respectively. Thus the marginal densities of the observed part Y are obtained by q y (y) = q x (y, z)dz and p y (y; θ) = p x (y, z; θ)dz. For denoting densities, we will omit random variables such as q y and p y (θ). We assume that θ is identifiable with respect to p y (θ).
In this paper, we consider only a simple setting of i.i.d. random variables of sample size n. Let x i = (y i , z i ), i = 1, . . . , n, be independent realizations of X, where we only observe y 1 , . . . , y n and we cannot see the values of z 1 , . . . , z n . We estimate θ from the observed training data y 1 , . . . , y n . Then the maximum likelihood estimate (MLE) of θ is given bŷ where y (θ) denotes the log-likelihood function (divided by n) of θ with respect to y 1 , . . . , y n . If we were only interested in Y, we would consider the plug-in predictive distribution p y (θ y ) by substitutingθ y into p y (θ). However, we are interested in the whole primary variable X = (Y, Z) and its density q x . We thus consider p x (θ y ) by substitutingθ y into p x (θ), and evaluate the MLE by comparing p x (θ y ) with q x . For this purpose, Shimodaira and Maeda [13] derived an information criterion as an asymptotically unbiased estimator of the KL risk function which measures how well p x (θ y ) approximates q x .

Statistical Analysis with Auxiliary Variables
Next, we extend the setting to incomplete data analysis with auxiliary variables. Let A denote a vector of auxiliary variables. In addition to Y, we observe A in the training data, but we are not interested in A. For convenience, we introduce a vector of observable variables B = (Y, A) and a vector of all variables C = (Y, Z, A) as summarized in Table 1. Now c i = (y i , z i , a i ), i = 1, . . . , n, are independent realizations of C, and we estimate θ from the observed training data b i = (y i , a i ), i = 1, . . . , n. Letθ b be the MLE of θ by using A in addition to Y. Since we are only interested in the primary variables, we consider the plug-in predictive distribution p x (θ b ) by substitutingθ b into p x (θ), and evaluate the MLE by comparing p x (θ b ) with q x . Table 1. Random variables in incomplete data analysis with auxiliary variables. B = (Y, A) is used for estimation of unknown parameters, and X = (Y, Z) is used for evaluation of candidate models.

Observed Latent Complete
In order to define the MLEθ b , let us clarify a candidate parametric model with auxiliary variables. We write the true density function of C as q c (c) = q c (y, z, a) and a candidate parametric model of the true density as p c (c; β) = p c (y, z, a; β), where β = (θ , ϕ ) ∈ B ⊂ R d+ f is an unknown parameter vector with nuisance parameter ϕ ∈ R f and B is its parameter space. We assume that c = (y, z, a) ∈ Y × Z × A for all density functions, where A is the domain of A. We also assume that β is identifiable with respect to p b (y, a; β) = p c (y, z, a; β)dz. Let us redefine p x (θ) as p x (y, z; θ) = p c (y, z, a; β)da and the parameter space of θ as Then,θ b is obtained from the MLE of β given bŷ where b (β) denotes the log-likelihood function (divided by n) of β with respect to b 1 , . . . , b n . Finally, we introduce a general notation for density functions. For a random variable, say R, we write the true density function as q r (r) and a candidate parametric model of q r as p r (r; θ) or p r (r; β). For random variables R and S, we write the true conditional density function of R given S = s as q r|s (r|s) and its corresponding model as p r|s (r|s; θ) or p r|s (r|s; β). For example, a candidate model of C can be decomposed as p c (y, z, a; β) = p x (y, z; θ)p a|x (a|y, z; β).

Comparing the Two Estimators
We have thus far obtained the two MLEs of θ, namelyθ y andθ b , and their corresponding predictive distributions p x (θ y ) and p x (θ b ), respectively. We would like to determine which of the two predictive distributions approximates q x better than the other. The approximation error of p x (θ) is measured by the KL divergence from q x to p x (θ) defined as Since the last term on the right hand side does not depend on p x (θ), we ignore it for computing the loss function of p x (θ) defined by Letθ be an estimator of θ. The risk (or expected loss) function of p x (θ) is defined by where we take the expectation by consideringθ as a random variable. Note thatθ in the notation of R x (θ) indicates the procedure for computingθ instead of a particular value ofθ. R x (θ) measures how well p x (θ) approximates q x on average in the long run. For comparing the two MLEs, we define R x (θ y ) and R x (θ b ) by considering thatθ y andθ b are functions of independent random variables Y 1 , . . . , Y n and B 1 , . . . , B n , respectively, where B i = (Y i , A i ) has the same distribution as B for all i = 1, . . . , n.θ b is better thanθ y when R x (θ b ) < R x (θ y ), that is, the auxiliary variable A helps the statistical inference on q x . On the other hand, A is harmful when R x (θ b ) > R x (θ y ). Although we focus only on comparison between Y and B = (Y, A) in this paper, if there are more than two auxiliary variables (and their combinations) A 1 , A 2 , . . ., then we may compare R x (θ (y,a 1 ) ), R x (θ (y,a 2 ) ), . . ., to determine good auxiliary variables. Of course, the risk functions cannot be calculated in reality because they depend on the unknown true distribution. Thus, we derive a new information criterion as an estimator of the risk function in our setting. Since an asymptotically unbiased estimator of R x (θ y ) has been already derived in Shimodaira and Maeda [13], we will only derive an asymptotically unbiased estimator of R x (θ b ).

Model Setting
In this section, we demonstrate parameter estimation by using auxiliary variables in Gaussian mixture model (GMM), which can be formulated in incomplete data analysis. Let us consider a two-component GMM; observed values are generated from one of two Gaussian distributions, where the assigned labels are missing. The observed data and missing labels are realizations of Y and Z, respectively. We estimate a predictive distribution of X = (Y, Z) from the observation of Y, and we attempt improving it by utilizing A in addition to Y. The true density function of primary variables X = (Y, Z) ∈ R × {0, 1} is given as where N(·; µ, σ 2 ) denotes the density function of N(µ, σ 2 ), i.e., the normal distribution with mean µ and variance σ 2 . We consider the following two cases for the true conditional distribution of auxiliary variable A given X = x: The random variables X and A are not independent in Case 1 whereas they are independent in Case 2. Hence, A will contribute to estimating θ in Case 1. On the other hand, in Case 2, A must not be useful, and A becomes just noise if we estimate θ from Y and A.

Estimation Results
For illustrating the impact of auxiliary variables on parameter estimation in each case, we generated a typical dataset c 1 , . . . , c n with sample size n = 100 from q c , which is actually picked from 10,000 datasets generated in the simulation study of Section 6, and details of how to select the typical dataset are also shown there. For each case, we computed the three MLEsθ y ,θ b , andθ x , wherê θ x is the MLE of θ calculated by using complete data x 1 , . . . , x n as if labels z 1 , . . . , z n were available.
The result of Case 1 is shown in Figure 1, where A is beneficial for estimating θ. In the left panel, the two clusters are well separated, which makes parameter estimation stable. The estimated p b (β b ) captures the structure of the two clusters corresponding to the label z i = 0 and z i = 1, showing that p c (β b ) is estimated reasonably well, and thus p x (θ b ) is a good approximation of q x . Looking at the right panel, we also observe that p y (θ b ) is better than p y (θ y ) for approximating p y (θ x ), suggesting that the auxiliary variable is useful for recovering the lost information of missing data. In fact, the three MLEs are calculated as follows:θ y = (0.671, −1.143, 1.324, 0.678) ,θ b = (0.613, −1.228, 1.093, 0.744) , andθ x = (0.620, −1.233, 1.141, 0.695) . By comparing θ b −θ x = 0.069 with θ y −θ x = 0.212, we can see thatθ b is better thanθ y for predictingθ x without looking at the latent variable. All these observations indicate that the parameter estimation of θ is improved by using A in Case 1.
The result of Case 2 is shown in Figure 2, where A is harmful for estimating θ. For fair comparison, exactly the same values of {(y i , z i )} 100 i=1 are used in both cases. Thus,θ y andθ x have the same values as in Case 1 whereasθ b has a different value asθ b = (0.581, −0.403, −0.232, 2.015) . By comparing θ b −θ x = 2.078 with θ y −θ x = 0.212, we can see thatθ b is worse thanθ y for predictingθ x . This is also seen in Figure 2. In the left panel, the estimated p b (β b ) captures some structure of the two clusters, but they do not correspond to the label z i = 0 and z i = 1. As a result, p y (θ b ) becomes a very poor approximation of p y (θ x ) in the right panel, indicating that the parameter estimation of θ is actually hindered by using A in Case 2. i=1 , and three density functions p y (θ x ) (broken line), p y (θ y ) (dotted line), and p y (θ b ) (solid line). In Section 4.4, this useful auxiliary variable is selected by our method (Case 1 in Table 2). Table 2. Comparisons betweenθ b andθ y for predicting X, and that for Y. These examples suggest that usefulness of auxiliary variables depends strongly on the true distribution and a candidate model. Hence, it is important to select useful auxiliary variables from observed data.  Table 2).

Asymptotic Expansion of the Risk Function
In this section, we derive a new information criterion as an asymptotically unbiased estimator of the risk function R x (θ b ) defined in (3). We start from a general framework of misspecification, i.e., without assuming that candidate models are correctly specified, and later we give specific assumptions. Letβ be the optimal parameter value with respect to the KL divergence from q b to p b (β), that is, If the candidate model is correctly specified, i.e., there exists β 0 = (θ 0 , ϕ 0 ) such that q b = p b (β 0 ), thenβ = β 0 as well asθ = θ 0 . In this paper, we assume the regularity conditions A1 to A6 of White [16] for q b and p b (β) so that the MLEβ b has consistency and asymptotic normality. In particular,β is determined uniquely (i.e., identifiable) and is interior to B. We assume that I b and J b defined below are nonsingular in the neighbourhood ofβ. Then White [16] showed the asymptotic normality as n → ∞, where Note that we write derivatives by abbreviated forms, e.g., ∇ 2 log p b (b;β) means ∇ 2 log p b (b; β)| β=β and so on. In addition, we allow interchange of integrals and derivatives rather formally when working with models, although we actually need conditions for the models such as White [16]. Moreover, the condition A7 of White [16] is assumed in order to establish I b = J b when considering a situation that the candidate model is correctly specified. We assume the above conditions throughout the paper without explicitly stated.
Let us define three (d + f ) × (d + f ) matrices as log p x (x;θ)], I y = −E[∇ 2 log p y (y;θ)], I z|y = −E[∇ 2 log p z|y (z|y;θ)] = I x − I y , which will be used in the lemmas below. Since the derivatives of log p x (x; θ) and log p y (y; θ) with respect to ϕ is zero, the matrices become singular when f > 0, but this is not a problem in our calculation. The following lemma shows that the dominant term of R x (θ b ) is L x (θ) and the remainder terms are of order O(n −1 ), by noting that ∇ L The proof is given in Appendix A.1.
Just as a remark, the term ∇ L x (θ)E[β b −β] = O(n −1 ) above does not appear in the derivation of AIC or TIC, where B = X and thus ∇ L x (θ) = 0. This term appears when the loss function for evaluation and that for estimation differ, for example, in the derivation of the information criterion under covariate shift; see K [1] w b w in Equation (4.1) of Shimodaira [17].

Estimating the Risk Function
For deriving an estimator of R x (θ b ), we introduce an additional condition. Let us assume that the candidate model is correctly specified for the latent part as q z|y (z|y) = p z|y (z|y;θ).
Since Z is missing completely in our setting, we need such a condition to proceed further. Although any method cannot detect misspecification of p z|y if p b is correctly specified, it is often the case that misspecification of p z|y leads to that of p b , and thus it is detected indirectly as in Case 2 of Section 3. Note that the symbol ofθ in our notation should have beenθ b , although we usedθ for simplicity, and there is alsoθ x defined similarly from p x (x; θ). They all differ each other with differences of order O(1) in general, butθ =θ y =θ x = θ 0 when p c (β) is correctly specified as q c = p c (β 0 ). Now we give the asymptotic expansion of E[ y (θ b )], which shows that − y (θ b ) can be used as an estimator of L x (θ) but the asymptotic bias is of order O(n −1 ).

Lemma 2.
Assume the condition (6). Then, the expectation of the estimated log-likelihood y (θ b ) can be expanded as The proof of Lemma 2 is given in Appendix A.2. By eliminating L x (θ) from the two expressions in Lemma 1 and Lemma 2, and rearranging the formula, we get the following lemma, which plays a central role in deriving our information criterion. Lemma 3. Assume the condition (6). Then, an expansion of the risk function R x (θ b ) is given by We can ignore C(q x ) for model selection, because it is a constant term which does not depend on the candidate model. Thus, finally, we define an information criterion from the right hand side of (7). The following theorem is an immediate consequence of Lemma 3. Theorem 1. Assume the condition (6). Let us define an information criterion as Then this criterion is an asymptotically unbiased estimator of 2nR x (θ b ) by ignoring the constant term C(q x ).
Note that the subscript of risk x;b , x; b is defined in accordance with Shimodaira and Maeda [13]; thus the former x and the latter b mean random variables used in evaluation and estimation, respectively. This criterion is an extension of TIC because when X = B = Y, risk x;b coincides with TIC of Takeuchi [9] defined as follows: TIC = −2n y (θ y ) + 2tr(I −1 y J y ).

Akaike Information Criteria for Auxiliary Variable Selection
In actual use, risk x;b may have a too complicated form. Thus, we derive a simpler information criterion by assuming the correctness of the candidate model like as AIC.
Theorem 2. Suppose p c (β) is correctly specified so that q c = p c (β 0 ) for some β 0 ∈ B. Then, we have and thus risk x;b is rewritten as This criterion is an asymptotically unbiased estimator of 2nR x (θ b ) by ignoring the constant term C(q x ).
The proof is given in Appendix A. 3. I x , I y and I b are replaced by their consistent estimators in practical situations.
The newly obtained criterion AIC x;b is a generalization of AIC and some of its variants. If θ is estimated byθ y instead ofθ b , we simply let B = Y in the expression of AIC x;b so that we get AIC x;y proposed by Shimodaira and Maeda [13]: Note that if B = Y, I y is not singular because β = θ. On the other hand, if there is no latent part, we simply let X = Y in the expression of AIC x;b so that we get This can be used to select useful auxiliary variables in complete data analysis. Moreover, if X = Y = B, AIC x;b reduces to the original AIC proposed by Akaike [6]: It is worth mentioning that tr(I z|y I −1 b ) is interpreted as the additional penalty for the latent part: which is also mentioned in Equation (1) of Shimodaira and Maeda [13] for the case of B = Y.

The Illustrative Example (Cont.)
Let us return to the problem of determining whether to use the auxiliary variables or not, that is, comparison between p x (θ b ) and p x (θ y ). By comparing AIC x;b with AIC x;y , we can determine whether the vector of auxiliary variables A is useful or useless. Thus, only when AIC x;b < AIC x;y , we conclude that A is useful in order to estimate θ for predicting X.
Let us apply this procedure to the illustrative example in Section 3. The generalized AICs are computed for the two cases of the typical dataset, and the results are shown in Table 2. Looking at the value of AIC x;b − AIC x;y , it is negative for Case 1, concluding that the auxiliary variable is useful, and it is positive for Case 2, concluding that the auxiliary variable is useless. According to the AIC values, therefore, we use the auxiliary variable of Case 1, but do not use the auxiliary variable of Case 2. This decision agrees with the observations of Figures 1 and 2 in Section 3.2. In fact, the decision is correct, because the value of R x (θ b ) − R x (θ y ) is negative for Case 1 and positive for Case 2 as will be seen in the simulation study of Section 6.2.
We can also argue the usefulness of the auxiliary variable for predicting Y instead of X, that is, comparison between p y (θ b ) and p y (θ y ). By comparing AIC y;b with AIC y;y , we can determine whether A is useful or useless for predicting Y. Looking at the value of AIC y;b − AIC y;y in Table 2, we make the same decision as that for X.

Leave-One-Out Cross Validation
Variable selection by cross-validatory (CV) choice [18] is often applied to real data analysis due to its simplicity, although its computational burden is larger than that of information criteria; see Arlot and Celisse [19] for a recent review of cross-validation methods. As shown in Stone [14], leave-one-out cross validation (LOOCV) is asymptotically equivalent to TIC. Because LOOCV does not require calculation of the information matrices of TIC, LOOCV is easier to use than TIC. There are also some literature for improving LOOCV such as Yanagihara et al. [20], which gives a modification of LOOCV to reduce its bias by considering maximum weighted log-likelihood estimation. However, we focus on the result of Stone [14] and extend it to our setting.
In incomplete data analysis, LOOCV cannot be directly used because the loss function with respect to the complete data includes latent variables. Thus, we transform the loss function as follows: where g(y; θ) = log p y (y; θ) + f (y; θ) and f (y; θ) = q z|y (z|y) log p z|y (z|y; θ)dz.
Note that f (y; θ) = 0 when X = Y. Using the function g(y; θ), we then obtain the following LOOCV estimator of the risk function R x (θ b ).
is the leave-out-out estimate of θ defined aŝ We will show below in this section that L cv x (θ b ) is asymptotically equivalent to risk x;b . For implementing the LOOCV procedure with latent variables, however, we have to estimate q z|y (z|y) by p z|y (z|y,θ b ) in f (y; θ). This introduces a bias to L cv x (θ b ), and hence, information criteria are preferable to the LOOCV in incomplete data analysis.
Let us show the asymptotic equivalence of L cv x (θ b ) and risk x;b by assuming that we know the functional form of f (y; θ). Noting thatβ

By applying Taylor expansion to
whereβ i b lies betweenβ We can see from (14) thatβ . Next, we regard g(y i ; θ) as a function of β and apply Taylor expansion to it around β =β b . Therefore, g(y i ;θ ) can be expressed as follows: whereθ i b lies betweenθ . Then we assume that By notingβ , and thus (16) holds at least formally. With the above setup, we show the following theorem. The proof is given in Appendix A.4.
Because the second term on the right-hand side of (17) does not depend on candidate models under condition (6), this theorem implies that L cv x (θ b ) is asymptotically equivalent to risk x;b except for the scaling and the constant term. However, someone may wonder why f (y; θ) is included in g(y; θ) for comparing models of p(b; β). By assuming that p z|y (θ) is correctly specified for q z|y , f (y;θ) = q z|y (z|y) log q z|y (z|y)dz does not depend on the model anymore, so we may simply exclude f (y; θ) from g(y; θ), leading to the loss L y (θ) instead. The reason for including f (y; θ) in g(y; θ) is explained as follows. L cv x (θ b ), as well as risk x;b (and AIC x;b ), include the additional penalty for estimatingθ b in f (y;θ b ), which depends on the candidate models even if p z|y (θ) is correctly specified.

Experiments with Simulated Datasets
This section shows the usefulness of auxiliary variables and the proposed information criteria via a simulation study. The models illustrated in Section 3 are used for confirming the asymptotic unbiasedness of the information criterion and the validity of auxiliary variable selection.

Unbiasedness
At first, we confirm the asymptotic unbiasedness of AIC x;b for estimating 2nR x (θ b ) except for the constant term, C(q x ). The simulation setting is the same as Case 1 in Section 3, thus the data generating model is given by q b|z (y, a|z) = zN 2 ((y, a) ; µ 10 , Σ 0 ) + (1 − z)N 2 ((y, a) ; µ 20 , Σ 0 ), where µ 10 = −µ 20 = (−1.2, 1.8) and Σ 0 = diag(0.7, 0.49). We generated T = 10 4 independent replicates of the dataset {(y i , z i , a i )} n i=1 from this model; in fact, we used {(y i , z i , a i,1 )} n i=1 generated in Section 6.2. The candidate model is given by (4), which is correctly specified for the above data generating model.

Auxiliary Variable Selection
Next, we demonstrate that the proposed AIC selects a useful auxiliary variable (Case 1), while it does not select a useless auxiliary variable (Case 2). In each case, we generated T = 10 4 independent replicates of the dataset {(y i , z i , a i )} n i=1 from the model. In fact, the values of {(y i , z i )} n i=1 are shared in both cases, so we generated replicates of {(y i , z i , a i,1 , a i,2 )} n i=1 , where a i,1 and a i,2 are auxiliary variables for Case 1 and Case 2, respectively. In each case, we compute AIC x;b and AIC x;y , then we selectθ b (i.e., selecting the auxiliary variable A) if AIC x;b < AIC x;y and selectθ y (i.e., not selecting the auxiliary variable A) otherwise. The selected estimator is denoted asθ best . This experiment was repeated for T = 10 4 times. Note that the typical dataset in Section 3 was picked from the generated datasets so that it has around the median value in each of L x (θ b ) − L x (θ y ), L y (θ b ) − L y (θ y ), AIC x;b − AIC x;y , and AIC y;b − AIC y;y in both cases.
The selection frequencies are shown in Tables 4 and 5. We observe that, as expected, the useful auxiliary variable tends to be selected in Case 1, while the useless auxiliary variable tends to be not selected in Case 2.
For verifying the usefulness of the auxiliary variable in both cases, we computed the risk value R x (θ) forθ =θ y ,θ b , andθ best . They are approximated by the simulation average as The results are shown in Tables 6 and 7. For easier comparisons, the values are the differences from L x (θ 0 ) with the true value θ 0 . For all n, we observe that, as expected, , R x (θ y )}, indicating that the variable selection is working well.

Experiments with Real Datasets
We show an example of auxiliary variable selection using Wine Data Set available at UCI Machine Learning Repository [21], which consists of 1 categorical variable (3 categories) and 13 continuous variables, denoted as V 1 , . . . , V 13 . For simplicity, we only use the first two categories and regard them as a latent variable Z ∈ {0, 1}; the experiment results were similar to the other combinations. The sample size is then n = 130 and all variables except for Z are standardized. We set one of the 13 continuous variables as the observed primary variable Y, and set the rest of 12 variables as auxiliary variables A 1 , . . . , A 12 . For example, if Y is V 1 , then A 1 , . . . , A 12 are V 2 , . . . , V 13 . The dataset is now {(y i , z i , a i,1 , . . . , a i,12 )} n i=1 , which is randomly divided into the training set with sample size n tr = 86 (z i is not used) and the test set with sample size n te = 44 (a i,1 , . . . , a i,12 are not used).
In the experiment, we compute AIC x;b for B = (Y, A ), = 1, . . . , 12, and AIC x;y for Y from the training dataset using the model (4). We selectθ best fromθ b 1 , . . . ,θ b 12 andθ y by finding the minimum of the 13 AIC values. Thus we are selecting one of the auxiliary variables A 1 , . . . , A 12 or not selecting any of them. It is possible to select a combination of the auxiliary variables, but we did not attempt such an experiment. For measuring the generalization error, we compute L x (θ y ) − L x (θ best ) from the test set as where D te ⊂ {1, . . . , n} represents the test set. The assignment problem of L x (·) mentioned in Section 6 is avoided by a similar manner. For each case of Y = V , = 1, . . . , 13, the above experiment was repeated 100 times, and the experiment average of the generalization error was computed. The result is shown in Table 8. A positive value indicates thatθ best performed better thanθ y . We observe thatθ best is better than or almost the same asθ y for all cases = 1, . . . , 13, suggesting that AIC works well to select a useful auxiliary variable. Table 8. Experiment average of n te {L(θ y ) − L x (θ best )} for each case of Y = V , = 1, . . . , 13. Standard errors are in parenthesis.

Conclusions
We often encounter a dataset composed of various variables. If only some of the variables are of interest, then the rest of the variables can be interpreted as auxiliary variables. Auxiliary variables may be able to improve estimation accuracy of unknown parameters but they could also be harmful. Hence, it is important to select useful auxiliary variables.
In this paper, we focused on exploiting auxiliary variables in incomplete data analysis. The usefulness of auxiliary variables is measured by a risk function based on the KL divergence for complete data. We derived an information criterion which is an asymptotically unbiased estimator of the risk function except for a constant term. Moreover, we extended a result of Stone [14] to our setting and proved asymptotic equivalence between a variant of LOOCV and the proposed criteria. Since LOOCV requires an additional condition for its justification, the proposed criteria are preferable to LOOCV.
This study assumes that variables are different between training set and test set. There are other settings, such as covariate shift [17] and transfer learning [22], where distributions are different between the training set and test set. It will be possible to combine these settings to construct a generalized framework. It is also possible to extend our study for taking account of a missing mechanism. We will leave these extensions as future works.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proofs
Appendix A.1. Proof of Lemma 1 Proof. Taylor expansion of L x (θ) around θ =θ, by formally taking it as a function of β, gives where ∇ 2 L x (θ) = I x is used above. By taking the expectation of both sides, where the asymptotic variance ofβ b in (5) is given as Proof. Taylor expansion of y (θ) around θ =θ, by formally taking it as a function of β, gives y (θ b ) = y (θ) + ∇ y (θ)(β b −β) − In the last expression, we used (A1) for the asymptotic variance ofβ b . For working on the second term in (A2), we first derive an expression ofβ b −β. Taylor expansion of the score function ∇ b (β) around β =β gives where ∇ 2 b (β) = −I b + o p (1) is used above. By noticing ∇ b (β b ) = 0, we thus obtain where E[∇ b (β)] = 0 and each term in the summation has mean zero, because E[∇ log p b (b;β)] = ∇E[log p b (b;β)] = 0. Now we are back to the the second term in (A2). Using (A3), we have By noting E[∇ b (β)] = 0, the expectation of the second term in (A4) is We next show that E[∇ y (θ)] = −∇ L x (θ). Let us recall that we have assumed q z|y (z|y) = p z|y (z|y;θ) in (6), which leads to E[∇ log p z|y (z|y;θ)] = q y (y) p z|y (z|y;θ)∇ log p z|y (z|y;θ) dzdy = q y (y) ∇p z|y (z|y;θ) dzdy = q y (y)∇ p z|y (z|y;θ) dzdy = 0.  Proof. First recall that we have assumed that q c (c) = p c (c; β 0 ), which also implies the condition (6) as q z|y (z|y) = p z|y (z|y; θ 0 ) withβ = β 0 . Thus Theorem 1 holds. Substituting J b = I b and K b,y = I y in the penalty term of (8) giving the penalty term of (10). Therefore, we only have to show (9). Noting the identity it follows from q b (b) = p b (b; β 0 ) that Note that the same result can be obtained from Theorem 3.3 in White [16]. Next we show K b,y = I y .
Hence, the proof is complete.