# An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Preliminaries

#### 2.1. Incomplete Data Analysis for Primary Variables

#### 2.2. Statistical Analysis with Auxiliary Variables

#### 2.3. Comparing the Two Estimators

## 3. An Illustrative Example with Auxiliary Variables

#### 3.1. Model Setting

- Case 1:
- ${q}_{a|x}(a|y,z)={q}_{a|z}(a|z)=zN(a;1.8,0.49)+(1-z)N(a;-1.8,0.49)$.
- Case 2:
- ${q}_{a|x}(a|y,z)={q}_{a}(a)=0.6N(a;1.8,0.49)+0.4N(a;-1.8,0.49)$.

#### 3.2. Estimation Results

## 4. Information Criterion

#### 4.1. Asymptotic Expansion of the Risk Function

**Lemma**

**1.**

#### 4.2. Estimating the Risk Function

**Lemma**

**2.**

**Lemma**

**3.**

**Theorem**

**1.**

#### 4.3. Akaike Information Criteria for Auxiliary Variable Selection

**Theorem**

**2.**

#### 4.4. The Illustrative Example (Cont.)

## 5. Leave-One-Out Cross Validation

**Theorem**

**3.**

## 6. Experiments with Simulated Datasets

#### 6.1. Unbiasedness

#### 6.2. Auxiliary Variable Selection

## 7. Experiments with Real Datasets

## 8. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A. Proofs

#### Appendix A.1. Proof of Lemma 1

**Proof.**

#### Appendix A.2. Proof of Lemma 2

**Proof.**

#### Appendix A.3. Proof of Theorem 2

**Proof.**

#### Appendix A.4. Proof of Theorem 3

**Proof.**

## References

- Breiman, L.; Friedman, J.H. Predicting multivariate responses in multiple linear regression. J. R. Stat. Soc. Ser. B Stat. Methodol.
**1997**, 59, 3–54. [Google Scholar] [CrossRef] - Tibshirani, R.; Hinton, G. Coaching variables for regression and classification. Stat. Comput.
**1998**, 8, 25–33. [Google Scholar] [CrossRef] - Caruana, R. Multitask learning. Mach. Learn.
**1997**, 28, 41–75. [Google Scholar] [CrossRef] - Mercatanti, A.; Li, F.; Mealli, F. Improving inference of Gaussian mixtures using auxiliary variables. Stat. Anal. Data Min.
**2015**, 8, 34–48. [Google Scholar] [CrossRef] [Green Version] - Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat.
**1951**, 22, 79–86. [Google Scholar] [CrossRef] - Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control
**1974**, 19, 716–723. [Google Scholar] [CrossRef] - Shibata, R. An optimal selection of regression variables. Biometrika
**1981**, 68, 45–54. [Google Scholar] [CrossRef] - Shibata, R. Asymptotic mean efficiency of a selection of regression variables. Ann. Inst. Stat. Math.
**1983**, 35, 415–423. [Google Scholar] [CrossRef] - Takeuchi, K. Distribution of information statistics and criteria for adequacy of models. Math. Sci.
**1976**, 153, 12–18. (In Japanese) [Google Scholar] - Shimodaira, H. A new criterion for selecting models from partially observed data. In Selecting Models from Data; Cheeseman, P., Oldford, R.W., Eds.; Springer: New York, NY, USA, 1994; pp. 21–29. [Google Scholar]
- Cavanaugh, J.E.; Shumway, R.H. An Akaike information criterion for model selection in the presence of incomplete data. J. Stat. Plan. Inference
**1998**, 67, 45–65. [Google Scholar] [CrossRef] [Green Version] - Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol.
**1977**, 39, 1–38. [Google Scholar] [CrossRef] - Shimodaira, H.; Maeda, H. An information criterion for model selection with missing data via complete-data divergence. Ann. Inst. Stat. Math.
**2018**, 70, 421–438. [Google Scholar] [CrossRef] - Stone, M. An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. R. Stat. Soc. Ser. B Methodol.
**1977**, 39, 44–47. [Google Scholar] [CrossRef] - Ibrahim, J.G.; Lipsitz, S.R.; Horton, N. Using auxiliary data for parameter estimation with non-ignorably missing outcomes. J. R. Stat. Soc. Ser. C Appl. Stat.
**2001**, 50, 361–373. [Google Scholar] [CrossRef] - White, H. Maximum likelihood estimation of misspecified models. Econometrica
**1982**, 50, 1–25. [Google Scholar] [CrossRef] - Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Inference
**2000**, 90, 227–244. [Google Scholar] [CrossRef] [Green Version] - Stone, M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B Methodol.
**1974**, 36, 111–147. [Google Scholar] [CrossRef] - Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv.
**2010**, 4, 40–79. [Google Scholar] [CrossRef] [Green Version] - Yanagihara, H.; Tonda, T.; Matsumoto, C. Bias correction of cross-validation criterion based on Kullback–Leibler information under a general condition. J. Multivar. Anal.
**2006**, 97, 1965–1975. [Google Scholar] [CrossRef] [Green Version] - Dua, D.; Karra Taniskidou, E. UCI Machine Learning Repository; University of California, School of Information and Computer Science: Irvine, CA, USA, 31 July 2017. [Google Scholar]
- Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
**2010**, 22, 1345–1359. [Google Scholar] [CrossRef]

**Figure 1.**Useful auxiliary variable (Case 1). The left panel plots ${\{({y}_{i},{a}_{i})\}}_{i=1}^{100}$ with labels indicating ${z}_{i}$. The estimated ${p}_{b}({\widehat{\beta}}_{b})$ is shown by the contour lines. The right panel shows the histogram of ${\{{y}_{i}\}}_{i=1}^{100}$, and three density functions ${p}_{y}({\widehat{\theta}}_{x})$ (broken line), ${p}_{y}({\widehat{\theta}}_{y})$ (dotted line), and ${p}_{y}({\widehat{\theta}}_{b})$ (solid line). In Section 4.4, this useful auxiliary variable is selected by our method (Case 1 in Table 2).

**Figure 2.**Useless auxiliary variable (Case 2). The symbols are the same as Figure 1. In Section 4.4, this useless auxiliary variable is NOT selected by our method (Case 2 in Table 2).

**Table 1.**Random variables in incomplete data analysis with auxiliary variables. $B=(Y,A)$ is used for estimation of unknown parameters, and $X=(Y,Z)$ is used for evaluation of candidate models.

Observed | Latent | Complete | |
---|---|---|---|

Primary | Y | Z | X |

Auxiliary | A | – | – |

All | B | – | C |

**Table 2.**Comparisons between ${\widehat{\theta}}_{b}$ and ${\widehat{\theta}}_{y}$ for predicting X, and that for Y.

${\mathit{p}}_{\mathit{x}}({\widehat{\mathbf{\theta}}}_{\mathit{b}})$ vs. ${\mathit{p}}_{\mathit{x}}({\widehat{\mathbf{\theta}}}_{\mathit{y}})$ | ${\mathit{p}}_{\mathit{y}}({\widehat{\mathbf{\theta}}}_{\mathit{b}})$ vs. ${\mathit{p}}_{\mathit{y}}({\widehat{\mathbf{\theta}}}_{\mathit{y}})$ | |
---|---|---|

${\mathit{AIC}}_{\mathit{x};\mathit{b}}-{\mathit{AIC}}_{\mathit{x};\mathit{y}}$ | ${\mathit{AIC}}_{\mathit{y};\mathit{b}}-{\mathit{AIC}}_{\mathit{y};\mathit{y}}$ | |

Case 1 | −2.67 | −0.96 |

Case 2 | 9.86 | 10.37 |

**Table 3.**Expected Akaike Information Criterion (AIC) difference is compared with the risk difference. The values are computed from $T={10}^{4}$ runs of simulation with their standard errors in parentheses.

n | 100 | 200 | 500 | 1000 | 2000 | 5000 |
---|---|---|---|---|---|---|

$E[{\mathrm{AIC}}_{x;b}-{\mathrm{AIC}}_{x;y}]$ | −3.559 | −3.263 | −3.221 | −3.197 | −3.195 | −3.180 |

(0.074) | (0.021) | (0.015) | (0.013) | (0.013) | (0.012) | |

$2n\{{\mathcal{R}}_{x}({\widehat{\theta}}_{b})-{\mathcal{R}}_{x}({\widehat{\theta}}_{y})\}$ | −3.603 | −3.333 | −3.275 | −3.208 | −3.182 | −3.232 |

(0.071) | (0.054) | (0.050) | (0.050) | (0.050) | (0.050) |

**Table 4.**Useful auxiliary variable (Case 1): selection frequencies of ${\widehat{\theta}}_{b}$ and ${\widehat{\theta}}_{y}$.

n | 100 | 200 | 500 | 1000 | 2000 | 5000 |
---|---|---|---|---|---|---|

${\widehat{\theta}}_{b}$ | 9230 | 9475 | 9649 | 9687 | 9711 | 9727 |

${\widehat{\theta}}_{y}$ | 770 | 525 | 351 | 313 | 289 | 273 |

**Table 5.**Useless auxiliary variable (Case 2): selection frequencies of ${\widehat{\theta}}_{b}$ and ${\widehat{\theta}}_{y}$.

n | 100 | 200 | 500 | 1000 | 2000 | 5000 |
---|---|---|---|---|---|---|

${\widehat{\theta}}_{b}$ | 1508 | 212 | 1 | 0 | 0 | 0 |

${\widehat{\theta}}_{y}$ | 8492 | 9788 | 9999 | 10,000 | 10,000 | 10,000 |

**Table 6.**Useful auxiliary variable (Case 1): estimated risk functions of ${\widehat{\theta}}_{b}$, ${\widehat{\theta}}_{y}$, and ${\widehat{\theta}}_{best}$, and their standard errors in parenthesis.

n | 100 | 200 | 500 | 1000 | 2000 | 5000 |
---|---|---|---|---|---|---|

$2n\{{\mathcal{R}}_{x}({\widehat{\theta}}_{b})-{\mathcal{L}}_{x}({\theta}_{0})\}$ | 4.229 | 4.079 | 4.051 | 4.039 | 4.029 | 4.033 |

(0.032) | (0.030) | (0.029) | (0.028) | (0.029) | (0.028) | |

$2n\{{\mathcal{R}}_{x}({\widehat{\theta}}_{y})-{\mathcal{L}}_{x}({\theta}_{0})\}$ | 7.831 | 7.412 | 7.326 | 7.247 | 7.211 | 7.266 |

(0.078) | (0.061) | (0.058) | (0.058) | (0.058) | (0.058) | |

$2n\{{\mathcal{R}}_{x}({\widehat{\theta}}_{best})-{\mathcal{L}}_{x}({\theta}_{0})\}$ | 5.109 | 4.741 | 4.501 | 4.491 | 4.479 | 4.454 |

(0.052) | (0.045) | (0.041) | (0.042) | (0.042) | (0.041) |

**Table 7.**Useless auxiliary variable (Case 2): estimated risk functions of ${\widehat{\theta}}_{b}$, ${\widehat{\theta}}_{y}$, and ${\widehat{\theta}}_{best}$, and their standard errors in parenthesis.

n | 100 | 200 | 500 | 1000 | 2000 | 5000 |
---|---|---|---|---|---|---|

$2n\{{\mathcal{R}}_{x}({\widehat{\theta}}_{b})-{\mathcal{L}}_{x}({\theta}_{0})\}$ | 105.527 | 214.659 | 543.685 | 1091.105 | 2182.647 | 5452.623 |

(0.111) | (0.167) | (0.301) | (0.474) | (0.723) | (1.151) | |

$2n\{{\mathcal{R}}_{x}({\widehat{\theta}}_{y})-{\mathcal{L}}_{x}({\theta}_{0})\}$ | 7.831 | 7.412 | 7.326 | 7.247 | 7.211 | 7.266 |

(0.078) | (0.061) | (0.058) | (0.058) | (0.058) | (0.058) | |

$2n\{{\mathcal{R}}_{x}({\widehat{\theta}}_{best})-{\mathcal{L}}_{x}({\theta}_{0})\}$ | 22.064 | 11.555 | 7.375 | 7.247 | 7.211 | 7.266 |

(0.358) | (0.304) | (0.079) | (0.058) | (0.058) | (0.058) |

**Table 8.**Experiment average of ${n}_{te}\{\mathcal{L}({\widehat{\theta}}_{y})-{\mathcal{L}}_{x}({\widehat{\theta}}_{best})\}$ for each case of $Y={V}_{\ell}$, $\ell =1,\dots ,13$. Standard errors are in parenthesis.

Y | ${\mathit{V}}_{1}$ | ${\mathit{V}}_{2}$ | ${\mathit{V}}_{3}$ | ${\mathit{V}}_{4}$ | ${\mathit{V}}_{5}$ | ${\mathit{V}}_{6}$ | ${\mathit{V}}_{7}$ |
---|---|---|---|---|---|---|---|

${n}_{te}\{{\mathcal{L}}_{x}({\widehat{\theta}}_{y})-{\mathcal{L}}_{x}({\widehat{\theta}}_{best})\}$ | 0.13 | −0.14 | 89.71 | 46.24 | −1.76 | 3.34 | 76.54 |

(0.08) | (0.12) | (3.82) | (4.17) | (2.52) | (1.34) | (6.09) | |

$\mathit{Y}$ | ${\mathit{V}}_{\mathbf{8}}$ | ${\mathit{V}}_{\mathbf{9}}$ | ${\mathit{V}}_{\mathbf{10}}$ | ${\mathit{V}}_{\mathbf{11}}$ | ${\mathit{V}}_{\mathbf{12}}$ | ${\mathit{V}}_{\mathbf{13}}$ | |

${n}_{te}\{{\mathcal{L}}_{x}({\widehat{\theta}}_{y})-{\mathcal{L}}_{x}({\widehat{\theta}}_{best})\}$ | 13.91 | 39.45 | 1.72 | 111.24 | 15.48 | 0.23 | |

(2.21) | (3.12) | (0.29) | (8.46) | (2.11) | (0.09) |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Imori, S.; Shimodaira, H.
An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis. *Entropy* **2019**, *21*, 281.
https://doi.org/10.3390/e21030281

**AMA Style**

Imori S, Shimodaira H.
An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis. *Entropy*. 2019; 21(3):281.
https://doi.org/10.3390/e21030281

**Chicago/Turabian Style**

Imori, Shinpei, and Hidetoshi Shimodaira.
2019. "An Information Criterion for Auxiliary Variable Selection in Incomplete Data Analysis" *Entropy* 21, no. 3: 281.
https://doi.org/10.3390/e21030281