# Increasing and Decreasing Returns and Losses in Mutual Information Feature Subset Selection

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background and Definitions

#### 2.1. Historical Background

**F**${}_{S}$ = {

**F**${}_{S1}$,

**F**${}_{S2}$, …

**F**${}_{Sn1}$}, this is:

**F**${}_{S}$) as:

#### 2.2. Conditional Mutual Information

## 3. Conditioning Increases Information

#### 3.1. n-bit Parity Problem

**F**${}_{S}$⊂{F${}_{1}$,F${}_{2}$,…F${}_{n}$} excluding $\varnothing $, it can be verified that p(

**f**${}_{S}$,c) = p(

**f**${}_{S}$).p(c). From this, it follows that, using the definition of mutual information in Equation (3), MI(

**F**${}_{S}$;C) = 0. This leads us to the following result. Suppose that:

**F**${}_{S1}$ and

**F**${}_{S2}$ are strict subsets of the full feature set, it holds that MI(

**F**${}_{S1}$;C) = 0 and MI(

**F**${}_{S2}$;C) = 0 in the n-dimensional XOR problem. For the conditional mutual information, it holds that:

**F**${}_{S1}$;C∣

**F**${}_{S2}$) > MI(

**F**${}_{S1}$;C).

**F${}_{S1}$**or F${}_{j}$ ∈

**F${}_{S2}$**.

#### 3.2. Non-binary Discrete Features

#### 3.3. Continuous Features: Mixture Models

## 4. Increasing and Decreasing Returns

#### 4.1. Increasing Returns

**Figure 3.**7-5-3 XOR Cube. Extension of checkerboard to 3 dimensions, the number of values that each feature can take is odd and different for each feature.

Entropy | value(bit) | Mutual Inf. | value(bit) |

H(C) | -$\frac{53}{105}$log${}_{2}$ $\frac{53}{105}$ -$\frac{52}{105}$log${}_{2}$ $\frac{52}{105}$ ≈ 0,9999 | NA | NA |

H(C∣F${}_{1}$) | -$\frac{8}{15}$log${}_{2}$ $\frac{8}{15}$ -$\frac{7}{15}$log${}_{2}$ $\frac{7}{15}$ ≈ 0,9968 | MI(F${}_{1}$;C) | ≈ 3,143.10${}^{-3}$ |

H(C∣F${}_{2}$) | -$\frac{10}{21}$log${}_{2}$ $\frac{10}{21}$ -$\frac{11}{21}$log${}_{2}$ $\frac{11}{21}$ ≈ 0,9984 | MI(F${}_{2}$;C) | ≈ 1,571.10${}^{-3}$ |

H(C∣F${}_{3}$) | -$\frac{18}{35}$log${}_{2}$ $\frac{18}{35}$ -$\frac{17}{35}$log${}_{2}$ $\frac{17}{35}$ ≈ 0,9994 | MI(F${}_{3}$;C) | ≈ 5,235.10${}^{-4}$ |

H(C∣F${}_{1}$,F${}_{2}$) | -$\frac{2}{3}$log${}_{2}$ $\frac{2}{3}$ -$\frac{1}{3}$log${}_{2}$ $\frac{1}{3}$ ≈ 0,9183 | MI(F${}_{2}$;C∣F${}_{1}$) | ≈ 7,850.10${}^{-2}$ |

H(C∣F${}_{1}$,F${}_{3}$) | -$\frac{3}{5}$log${}_{2}$ $\frac{3}{5}$ -$\frac{2}{5}$log${}_{2}$ $\frac{2}{5}$ ≈ 0,9710 | MI(F${}_{3}$;C∣F${}_{1}$) | ≈ 2,584.10${}^{-2}$ |

H(C∣F${}_{2}$,F${}_{3}$) | -$\frac{4}{7}$log${}_{2}$ $\frac{4}{7}$ -$\frac{3}{7}$log${}_{2}$ $\frac{3}{7}$ ≈ 0,9852 | MI(F${}_{3}$;C∣F${}_{2}$) | ≈ 1,314.10${}^{-2}$ |

H(C∣F${}_{1}$,F${}_{2}$,F${}_{3}$) | 0 | MI(F${}_{3}$;C∣F${}_{1}$,F${}_{2}$) | ≈ 0,9183 |

#### 4.2. Decreasing Returns

**S**, and that the feature selected in the current iteration is F${}_{x}$. In order for the decreasing returns to hold, one requires for the next selected feature F${}_{y}$: MI(F${}_{x}$;C∣

**S**) > MI(F${}_{y}$;C∣

**S**,F${}_{x}$). First, we expand MI(F${}_{x}$,F${}_{y}$;C∣

**S**) in two ways by means of the chain rule of information:

**S**) > MI(F${}_{y}$;C∣

**S**). In the case of ties, it may be possible that MI(F${}_{x}$;C∣

**S**) ≥ MI(F${}_{y}$;C∣

**S**), we focus here on the case where we have a strict ordering >. Then, in Equation (22) we have that:

**S**,F${}_{y}$) ≤ MI(F${}_{x}$;C∣

**S**). This means that additional conditioning on F${}_{y}$ decreases (or equals) information of F${}_{x}$ about C.

**Lemma 4.1.**

**Figure 4.**Example of class conditional independence of the features given the class variable C. The joint probability distribution can be factorized as: p(F${}_{1}$,F${}_{2}$,…F${}_{10}$,C) = (${\prod}_{i=1}^{10}$ p(F${}_{i}$∣C)).p(C).

**Figure 5.**Evolution of the mutual information in function of the number of features selected with the SFS. A Bayesian network according to Figure 4 was created with probability p(c=0), conditional probabilities p(f${}_{i}$=0∣c=0) and p(f${}_{i}$=0∣c=1) drawn randomly following a uniform distribution within [0,1]. The conditional mutual information at 1 feature is MI(F${}_{1}$;C) at 2 features MI(F${}_{2}$;C∣F${}_{1}$),… and finally at 10 features MI(F${}_{10}$;C∣F${}_{1}$,F${}_{2}$,…F${}_{9}$). Lemma 4.1 predicts that the conditional mutual information decreases with an increasing number of features selected. This implies that the mutual information is concave in function of the number of features selected.

**Figure 6.**Example of dependencies between features for decreasing returns. The joint probability distribution can be factorized as: p(F${}_{1}$,F${}_{2}$,F${}_{3}$,F${}_{4}$,C) = p(F${}_{2}$∣C).p(F${}_{4}$∣C).p(C∣F${}_{1}$,F${}_{3}$) .p(F${}_{1}$).p(F${}_{3}$). This factorization implies that: MI(F${}_{1}$,F${}_{3}$;F${}_{2}$,F${}_{4}$∣C) = 0.

**Lemma 4.2.**

**Corollary 4.3.**

#### 4.3. Selection Transitions

**Lemma 4.4.**

**Table 2.**Probabilities for the network shown in Figure 6. These probabilities were obtained from one of the 10,000 Bayesian networks that were generated randomly. Applying the SFS to the network with these probabilities leads to the selection of parent and child nodes alternately.

F${}_{1}$ | F${}_{2}$ | F${}_{3}$ | F${}_{4}$ | C | p(.) |

0 | p(F${}_{1}$) = 0.6596 | ||||

1 | p(F${}_{1}$) = 0.3404 | ||||

0 | p(F${}_{3}$) = 0.5186 | ||||

1 | p(F${}_{3}$) = 0.4814 | ||||

0 | 0 | 0 | p(C∣F${}_{1}$,F${}_{3}$) = 0.9730 | ||

0 | 0 | 1 | p(C∣F${}_{1}$,F${}_{3}$) = 0.0270 | ||

1 | 0 | 0 | p(C∣F${}_{1}$,F${}_{3}$) = 0.6490 | ||

1 | 0 | 1 | p(C∣F${}_{1}$,F${}_{3}$) = 0.3510 | ||

0 | 1 | 0 | p(C∣F${}_{1}$,F${}_{3}$) = 0.8003 | ||

0 | 1 | 1 | p(C∣F${}_{1}$,F${}_{3}$) = 0.1997 | ||

1 | 1 | 0 | p(C∣F${}_{1}$,F${}_{3}$) = 0.4538 | ||

1 | 1 | 1 | p(C∣F${}_{1}$,F${}_{3}$) = 0.5462 | ||

0 | 0 | p(F${}_{2}$∣C) = 0.4324 | |||

1 | 0 | p(F${}_{2}$∣C) = 0.5676 | |||

0 | 1 | p(F${}_{2}$∣C) = 0.8253 | |||

1 | 1 | p(F${}_{2}$∣C) = 0.1747 | |||

0 | 0 | p(F${}_{4}$∣C) = 0.0835 | |||

1 | 0 | p(F${}_{4}$∣C) = 0.9165 | |||

0 | 1 | p(F${}_{4}$∣C) = 0.1332 | |||

1 | 1 | p(F${}_{4}$∣C) = 0.8668 |

**Figure 7.**Evolution of the mutual information in function of the number of features selected with the SFS. A Bayesian network according to Figure 6 was created with the probabilities set to values listed in Table 2. The conditional mutual information at 1 feature is MI(F${}_{1}$;C) at 2 features MI(F${}_{2}$;C∣F${}_{1}$),… and finally at 4 features MI(F${}_{4}$;C∣F${}_{1}$,F${}_{2}$,F${}_{3}$). Lemma 4.2 predicts that the conditional mutual information decreases with an increasing number of features selected. This implies that the mutual information is concave in function of the number of features selected.

**Figure 8.**Four elementary selection transitions in the SFS. F${}_{k-1}$ is the feature selected at step k-1, F${}_{k}$ is the feature selected at step k. Case 1: F${}_{k-1}$ is a child and F${}_{k}$ is a child. Case 2: F${}_{k-1}$ is a child and F${}_{k}$ is a parent. Case 3: F${}_{k-1}$ is a parent and F${}_{k}$ is a child. Case 4: F${}_{k-1}$ is a parent and F${}_{k}$ is a parent.

#### 4.4. Relevance-redundancy Criteria

**S**and that F${}_{i}$ is a candidate feature to be selected, then the feature F${}_{i}$ is selected for which following criterion is maximal.

#### 4.5. Importance of Increasing and Decreasing Returns

**F${}_{n1}$**, with mutual information MI(

**F${}_{n1}$**;C). Suppose that the last increment in going from a subset of $n1-1$ features

**F${}_{n1-1}$**to

**F${}_{n1}$**equals ΔMI = MI(

**F${}_{n1}$**;C)-MI(

**F${}_{n1-1}$**;C). For the mutual information of a subset of n2 features

**F${}_{n2}$**, with

**F${}_{n2}$**⊃

**F${}_{n1}$**, it holds, under the decreasing returns, that:

**F${}_{n2}$**falls within the white area under the decreasing returns, and, due to (39), within the dark grey area under the increasing returns.

**Figure 9.**Bounds on the probability of error. For subset

**F${}_{n1}$**the mutual information equals MI(

**F${}_{n1}$**;C), for which the probability of error falls between the Fano lower bound and the Kovalevsky upper bound. The white area represents the possible combinations of probability of error and mutual information for the decreasing returns in the selection of (∣n2∣-∣n1∣) additional features, because MI(

**F${}_{n2}$**;C) ≤ MI(

**F${}_{n1}$**;C) + (∣n2∣-∣n1∣)ΔMI. The grey area is the possible area for the increasing returns. The hatched area is not possible, because adding features can only increase the information. This figure illustrates the case when the number of classes ∣C∣ is equal to 8 and when all prior probabilities of the classes are equal.

## 5. Decreasing Losses and Increasing Losses

#### 5.1. Decreasing Losses

Mutual Information | value(bit) |

MI(F${}_{1}$,F${}_{2}$,F${}_{3}$;C) | ≈ 0,9999 |

MI(F${}_{1}$;C∣F${}_{2}$,F${}_{3}$) | ≈ 0,9852 |

MI(F${}_{2}$;C∣F${}_{1}$,F${}_{3}$) | ≈ 0,9710 |

MI(F${}_{3}$;C∣F${}_{1}$,F${}_{2}$) | ≈ 0,9183 |

MI(F${}_{1}$;C∣F${}_{2}$) | ≈ 8,007.10${}^{-2}$ |

MI(F${}_{2}$;C∣F${}_{1}$) | ≈ 7,850.10${}^{-2}$ |

MI(F${}_{1}$;C) | ≈ 3,143.10${}^{-3}$ |

#### 5.2. Increasing Losses

**S**,F${}_{x}$) < MI(F${}_{x}$;C∣

**S**,F${}_{y}$). Combining this inequality with (22), it is clear that: MI(F${}_{x}$;C∣

**S**) > MI(F${}_{y}$;C∣

**S**). Hence, under the condition that MI(F${}_{y}$;C∣

**S**) ≥ MI(F${}_{y}$;C∣

**S**,F${}_{x}$) one obtains an ‘increasing losses’ behavior: MI(F${}_{x}$;C∣

**S**) > MI(F${}_{y}$;C∣

**S**,F${}_{x}$).

**Lemma 5.1**

**Lemma 5.2.**

**Figure 10.**Evolution of the mutual information in function of the number of features selected with the SBS. A Bayesian network according to Figure 4 was created with probability p(c=0), conditional probabilities p(f${}_{i}$=0∣c=0) and p(f${}_{i}$=0∣c=1) drawn randomly from a uniform distribution within [0,1]. The conditional mutual information for 10 features is MI(F${}_{10}$;C∣F${}_{1}$,F${}_{2}$,…F${}_{9}$), for 9 features MI(F${}_{9}$;C∣F${}_{1}$,F${}_{2}$,…F${}_{8}$),… and, finally for 1 feature MI(F${}_{1}$;C). Lemma 5.1 predicts that the conditional mutual information increases with increasing number of features removed. This implies that the mutual information is concave in function of the number of features selected.

**Corollary 5.3.**

**Lemma 5.4.**

## 6. Conclusions

## Acknowledgments

## Appendix A

## Appendix B

## Appendix C

**F${}_{1:k}$**= {F${}_{1}$, F${}_{2}$, …F${}_{k}$}. Denote the ‘i’th selected parent of C within

**F${}_{1:k}$**by pa${}_{i}$(c) and the ‘j’th selected child of C within

**F${}_{1:k}$**by ch${}_{j}$(c). Denote the set of all parents of C within

**F${}_{1:k}$**by

**F${}_{pa}$**(c) = ${\bigcup}_{i=1}^{\#parents}p{a}_{i}(c)$ and the set of all children of C within

**F${}_{1:k}$**by

**F${}_{ch}$**(c) = ${\bigcup}_{j=1}^{\#children}c{h}_{j}(c)$. We want to show that if F${}_{k}$ and F${}_{k-1}$ are children of C then MI(F${}_{k}$;F${}_{k-1}$∣C,F${}_{1}$,…F${}_{k-2}$) = 0. The definition of MI(F${}_{k}$;F${}_{k-1}$∣C,F${}_{1}$,…F${}_{k-2}$) is equal to Equation (48). Starting from the term within the logarithm, Equation (49):

**F${}_{ch}$**and the set { F${}_{k-1}$,F${}_{k}$ }:

**F${}_{ch}$**(c) ∖ {F${}_{k-1}$,F${}_{k}$}. We notice that all probabilities in Equations (54) to (57) have following factor in common:

## References

- Liu, H.; Motoda, H. Computational Methods of Feature Selection; Chapman & Hall/CRC: Boca Raton, FL, USA, 2007. [Google Scholar]
- Van Dijck, G.; Van Vaerenbergh, J.; Van Hulle, M.M. Posterior probability profiles for the automated assessment of the recovery of patients with stroke from activity of daily living tasks. Artif. Intell. Med.
**2009**, 46, 233–249. [Google Scholar] [CrossRef] [PubMed] - Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, second ed.; John Wiley & Sons: New York, NY, USA, 2001. [Google Scholar]
- Lewis II, P.M. The characteristic selection problem in recognition systems. IEEE Trans. Inf. Theory
**1962**, 8, 171–178. [Google Scholar] [CrossRef] - Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res.
**2003**, 3, 1157–1182. [Google Scholar] - Wang, G.; Lochovsky, F.H.; Yang, Q. Feature selection with conditional mutual information maximin in text categorization. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04); Evans, D.A., Gravano, L., Herzog, O., Zhai, C., Ronthaler, M., Eds.; ACM Press: New York, NY, USA, 2004; pp. 342–349. [Google Scholar]
- Guo, B.; Nixon, M.S. Gait feature subset selection by mutual information. IEEE Trans. Syst. Man Cybern. Part A-Syst. Hum.
**2009**, 29, 36–46. [Google Scholar] - Huang, D.; Chow, T.W.S.; Wa, E.W.M.; Li, J. Efficient selection of discriminative genes from microarray gene expression data for cancer diagnosis. IEEE Trans. Circuits Syst. I-Regul. Pap.
**2005**, 52, 1909–1918. [Google Scholar] [CrossRef] - Kamentsky, L.A.; Liu, C.N. Computer-automated design of multifont print recognition logic. IBM J. Res. Dev.
**1963**, 7, 2–13. [Google Scholar] [CrossRef] - Liu, C.N. A programmed algorithm for designing multifont character recognition logics. IEEE Trans. Electron.
**1964**, EC-13, 586–593. [Google Scholar] [CrossRef] - Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw.
**1994**, 5, 537–550. [Google Scholar] [CrossRef] [PubMed] - Cover, T.M.; Thomas, J.A. Elements of Information Theory, second ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
- Liu, D.; Chang, T.; Zhang, Y. A constructive algorithm for feedforward neural networks with incremental training. IEEE Trans. Circuits Syst. I-Regul. Pap.
**2002**, 49, 1876–1879. [Google Scholar] - McGill, W.J. Multivariate information transmission. IEEE Trans. Inf. Theory
**1954**, 4, 93–111. [Google Scholar] [CrossRef] - Matsuda, H. Physical nature of higher-order mutual information: Intrinsic correlations and frustration. Phys. Rev. E
**2000**, 62, 3096–3102. [Google Scholar] [CrossRef] - Shiono, S.; Yamada, S.; Nakashima, M.; Matsumoto, K. Information theoretic analysis of connection structure from spike trains. In Advances in Neural Information Processing Systems 5; Hanson, S.J., Cowan, J.D., Giles, C.L., Eds.; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993; pp. 515–522. [Google Scholar]
- Bontempi, G.; Meyer, P.E. Causal filter selection in microarray data. In Proceedings of the 27th International Conference on Machine Learning, Omnipress, 2010; Fürnkranz, J., Joachims, T., Eds.; pp. 95–102.
- Kotz, S.; Nadarajah, S. Multivariate t Distributions and Their Applications; Cambridge University Press: Cambridge, UK, 2004; pp. 15–16. [Google Scholar]
- Pearl, J. Probabilistic Reasoning in Intelligent Systems; Morgan Kaufmann: San Francisco, California, USA, 1988. [Google Scholar]
- Neapolitan, R.E. Learning Bayesian Networks; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 2003. [Google Scholar]
- Van Dijck, G.; Van Hulle, M.M. Speeding up feature subset selection through mutual information relevance filtering. In Knowledge Discovery in Databases: PKDD 2007; Kok, J., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenic, D., Skowron, A., Eds.; Springer Berlin / Heidelberg, 2007; Vol. 4702, Lecture Notes in Computer Science; pp. 277–287. [Google Scholar]
- Van Dijck, G. Information Theoretic Approach to Feature Selection and Redundancy Assessment. PhD dissertation, Katholieke Universiteit Leuven, 2008. [Google Scholar]
- Kwak, N.; Choi, C.H. Input feature selection for classification problems. IEEE Trans. Neural Netw.
**2002**, 13, 143–159. [Google Scholar] [CrossRef] [PubMed] - Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.
**2005**, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed] - Estévez, P.A.; Tesmer, M.; Perez, C.A.; Zurada, J.M. Normalized mutual information feature selection. IEEE Trans. Neural Netw.
**2009**, 20, 189–201. [Google Scholar] [CrossRef] [PubMed] - Fleuret, F. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res.
**2004**, 5, 1531–1555. [Google Scholar] - Kwak, N.; Choi, C.H. Input feature selection by mutual information based on Parzen window. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 1667–1671. [Google Scholar] [CrossRef] - Bonev, B.; Escolano, F.; Cazorla, M. Feature selection, mutual information, and the classification of high-dimensional patterns: applications to image classification and microarray data analysis. Pattern Anal. Appl.
**2008**, 11, 309–319. [Google Scholar] [CrossRef] - François, D.; Rossi, F.; Wertz, V.; Verleysen, M. Resampling methods for parameter-free and robust feature selection with mutual information. Neurocomputing
**2007**, 70, 1276–1288. [Google Scholar] [CrossRef] - Hellman, M.E.; Raviv, J. Probability of error, equivocation, and the Chernoff bound. IEEE Trans. Inf. Theory
**1970**, IT-16, 368–372. [Google Scholar] [CrossRef] - Kovalevsky, V.A. The problem of character recognition from the point of view of mathematical statistics. In Character Readers and Pattern Recognition; Kovalevsky, V.A., Ed.; Spartan: New York, NY, USA, 1968. [Google Scholar]
- Tebbe, D.L.; Dwyer III, S.J. Uncertainty and the probability of error. IEEE Trans. Inf. Theory
**1968**, IT-14, 516–518. [Google Scholar] [CrossRef] - Feder, M.; Merhav, N. Relations between entropy and error probability. IEEE Trans. Inf. Theory
**1994**, 40, 259–266. [Google Scholar] [CrossRef] - Golić, J.D. Comment on “Relations between entropy and error probability”. IEEE Trans. Inf. Theory
**1999**, 45, 372–372. [Google Scholar] [CrossRef] - Fano, R.M. Transmission of Information: A Statistical Theory of Communication; John Wiley & Sons: New York, NY, USA, 1961. [Google Scholar]

© 2010 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license http://creativecommons.org/licenses/by/3.0/.

## Share and Cite

**MDPI and ACS Style**

Van Dijck, G.; Van Hulle, M.M.
Increasing and Decreasing Returns and Losses in Mutual Information Feature Subset Selection. *Entropy* **2010**, *12*, 2144-2170.
https://doi.org/10.3390/e12102144

**AMA Style**

Van Dijck G, Van Hulle MM.
Increasing and Decreasing Returns and Losses in Mutual Information Feature Subset Selection. *Entropy*. 2010; 12(10):2144-2170.
https://doi.org/10.3390/e12102144

**Chicago/Turabian Style**

Van Dijck, Gert, and Marc M. Van Hulle.
2010. "Increasing and Decreasing Returns and Losses in Mutual Information Feature Subset Selection" *Entropy* 12, no. 10: 2144-2170.
https://doi.org/10.3390/e12102144