# Analysis of Information-Based Nonparametric Variable Selection Criteria

## Abstract

## 1. Introduction

## 2. Preliminaries

#### 2.1. Information-Theoretic Measures of Dependence

#### 2.2. Information-Based Feature Selection

#### 2.3. Approximations of CMI: CIFE and JMI Criteria

## 3. Auxiliary Results: Information Measures for Gaussian Mixtures

**Theorem**

**1.**

**Proof.**

**Theorem**

**2.**

**Proof.**

**Theorem**

**3.**

**Proof.**

**Lemma**

**1.**

**Proof.**

**Remark**

**1.**

**Remark**

**2.**

## 4. Main Results: Behavior of Information-Based Criteria in Generative Tree Model

#### 4.1. Generative Tree Model

#### 4.2. Behavior of CMI

#### 4.3. Behavior of JMI

- For $\gamma =1$ active predictors ${X}_{1},\dots ,{X}_{k+1}\in MB(Y)$ are chosen in the right order and ${X}_{1}^{(1)}$ is not chosen before them;
- For $0<\gamma <1$, variable ${X}_{1}^{(1)}\notin MB(Y)$ is chosen at a certain step before all ${X}_{1},\dots ,{X}_{k+1}$ are chosen, and we evaluate a moment when this situation occurs.

#### 4.4. Behavior of CIFE and Its Comparison with JMI

- For $\gamma =1$, CIFE incorrectly chooses ${X}_{1}^{(1)}$ at some point;
- For $0<\gamma <1$, CIFE selects variables ${X}_{1},\dots ,{X}_{k+1}$ in the right order.

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

**Figure 1.**Behavior of function h and its two first derivatives. Horizontal lines in the left chart correspond to bounds of h and equal $\frac{1}{2}log(2\pi e)$ and $\frac{1}{2}log(2\pi e)+log(2)$, respectively.

**Figure 3.**Behavior of conditional mutual information $I({X}_{k+1},Y|{X}_{1},{X}_{2},\dots ,{X}_{k})$ as a function of k for different $\gamma $ values.

**Figure 5.**The behavior of JMI in the generative tree model: $JMI({X}_{k+1}|{X}_{S})$ and $JMI({X}_{1}^{(1)}|{X}_{S})$.

**Figure 6.**The behavior of CIFE in the generative tree model: $CIFE({X}_{k+1}|{X}_{S})$ and $CIFE({X}_{1}^{(1)}|{X}_{S})$.

**Figure 7.**Difference between values of JMI for ${X}_{k+1}$ and ${X}_{1}^{(1)}$ (

**left panel**) and analogous difference for CIFE (

**right panel**). Values below 0 mean that the variable ${X}_{1}^{(1)}$ is chosen.

**Table 1.**The criteria (Conditional Mutual Information (CMI), Joint Mutual Information (JMI), Conditional Infomax Feature Extraction (CIFE)) values for $k=2$ and $\gamma =2/3$. A value of the chosen variable in each step and for each criterion is in bold.

(a) ${X}_{{S}_{1}}=\left\{{X}_{1}\right\}$, ${X}_{{S}_{2}}=\{{X}_{1},{X}_{2}\}$, ${X}_{{S}_{3}}=\{{X}_{1},{X}_{2},{X}_{3}\}$ | ||||

$I(\xb7,Y)$ | $I(\xb7,Y|{X}_{{S}_{1}})$ | $I(\xb7,Y|{X}_{{S}_{2}})$ | $I(\xb7,Y|{X}_{{S}_{3}})$ | |

${X}_{1}$ | 0.1114 | |||

${X}_{2}$ | 0.0527 | 0.0422 | ||

${X}_{3}$ | 0.0241 | 0.0192 | 0.0176 | |

${X}_{1}^{(1)}$ | 0.0589 | 0.0000 | 0.0000 | 0.0000 |

(b) ${X}_{{S}_{1}}=\{{X}_{1}\}$, ${X}_{{S}_{2}}=\{{X}_{1},{X}_{2}\}$, ${X}_{{S}_{3}}=\{{X}_{1},{X}_{2},{X}_{1}^{(1)}\}$ | ||||

$JMI(\xb7)$ | $JMI(\xb7|{X}_{{S}_{1}})$ | $JMI(\xb7|{X}_{{S}_{2}})$ | $JMI(\xb7|{X}_{{S}_{3}})$ | |

${X}_{1}$ | 0.1114 | |||

${X}_{2}$ | 0.0527 | 0.0422 | ||

${X}_{3}$ | 0.0241 | 0.0192 | 0.0205 | 0.0208 |

${X}_{1}^{(1)}$ | 0.0589 | 0.0000 | 0.0266 | |

(c) ${X}_{{S}_{1}}=\{{X}_{1}\}$, ${X}_{{S}_{2}}=\{{X}_{1},{X}_{2}\}$, ${X}_{{S}_{3}}=\{{X}_{1},{X}_{2},{X}_{3}\}$ | ||||

$CIFE(\xb7)$ | $CIFE(\xb7|{X}_{{S}_{1}})$ | $CIFE(\xb7|{X}_{{S}_{2}})$ | $CIFE(\xb7|{X}_{{S}_{3}})$ | |

${X}_{1}$ | 0.1114 | |||

${X}_{2}$ | 0.0527 | 0.0422 | ||

${X}_{3}$ | 0.0241 | 0.0192 | 0.0169 | |

${X}_{1}^{(1)}$ | 0.0589 | 0.0000 | $-0.0057$ | −0.0083 |

