# A Feature Subset Selection Method Based On High-Dimensional Mutual Information

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

#### 2.1. Categorization of Feature Selection Methods

#### 2.2. Theoretic Background

**Theorem**

**1**

**Theorem**

**2**

#### 2.3. Feature Selection Methods Based on Information Theory

#### 2.4. Limitations of Current Feature Subset Selection Methods

## 3. The Discrete Function Learning Algorithm

#### 3.1. Theoretic Motivation and Foundation

**Theorem**

**3 ([35], p. 26)**

**Theorem**

**4**

#### 3.2. Performing Feature Selection

#### 3.3. Relation to Markov Blanket

**Definition**

**1 (Conditional Independence)**

**Definition**

**2 (Markov Blanket)**

**Theorem**

**5 ([40], p. 36)**

**Theorem**

**6 ([30], p. 43)**

**Theorem**

**7**

**Proof**

**1**

**Theorem**

**8**

**Proof**

**2**

#### 3.4. The Discrete Function Learning Algorithm

**Definition**

**3 (**δ

**Superset)**

**Definition**

**4 (Δ Supersets)**

**Definition**

**5 (Searching Layer $\mathcal{L}$ of $\mathbf{V}$)**

**Definition**

**6 (Searching Space)**

#### 3.5. Complexity Analysis

**Theorem**

**9 ([46,47])**

#### 3.6. Correctness Analysis

**Theorem**

**11**

**Proof**

**3**

## 4. The ϵ Value Method for Noisy Data Sets

#### 4.1. The ϵ Value Method

#### 4.2. The Relation with The Over-fitting Problem

#### 4.3. The Relation with The Time Complexity

## 5. Selection of Parameters

#### 5.1. Selection of The Expected Cardinality K

#### 5.2. Selection of ϵ value

## 6. Prediction Method

## 7. Implementation Issues

#### 7.1. The Computation of Mutual Information $I(\mathbf{X};Y)$

#### 7.2. Redundancy Matrix

## 8. Results

#### 8.1. Data Sets

#### 8.2. Comparison with Other Feature Selection Methods

#### 8.3. Comparison of Model Complexity

#### 8.4. Comparison of Efficiency

## 9. Discussions

## 10. Conclusion

## Acknowledgements

## References and Notes

- Koller, D.; Sahami, M. Toward Optimal Feature Selection. In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 3-6 July 1996; pp. 284–292.
- Hall, M.; Holmes, G. Benchmarking Attribute Selection Techniques for Discrete Class Data Mining. IEEE Trans. Knowl. Data Eng.
**2003**, 15, 1–16. [Google Scholar] [CrossRef] - Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Networks
**1994**, 5, 537–550. [Google Scholar] [CrossRef] [PubMed] - Kwak, N.; Choi, C.H. Input feature selection for classification problems. IEEE Trans. Neural Networks
**2002**, 13, 143–159. [Google Scholar] [CrossRef] [PubMed] - Vidal-Naquet, M.; Ullman, S. Object Recognition with Informative Features and Linear Classification; IEEE Computer Society: Nice, France, 2003; pp. 281–288. [Google Scholar]
- Fleuret, F. Fast Binary Feature Selection with Conditional Mutual Information. J. Mach. Learn. Res.
**2004**, 5, 1531–1555. [Google Scholar] - Peng, H.; Long, F.; Ding, C. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans. Pattern Anal. Mach. Intell.
**2005**, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed] - Bonev, B.; Escolano, F.; Cazorla, M. Feature selection, mutual information, and the classification of high-dimensional patterns. Pattern Anal. Appl.
**2008**, 11, 309–319. [Google Scholar] [CrossRef] - Cai, R.; Hao, Z.; Yang, X.; Wen, W. An efficient gene selection algorithm based on mutual information. Neurocomputing
**2009**, 72, 991–999. [Google Scholar] [CrossRef] - Zhu, S.; Wang, D.; Yu, K.; Li, T.; Gong, Y. Feature Selection for Gene Expression Using Model-Based Entropy. IEEE/ACM Trans. Comput. Biol. Bioinformatics
**2010**, 7, 25–36. [Google Scholar] - Estévez, P.A.; Tesmer, M.; Perez, C.A.; Zurada, J.M. Normalized mutual information feature selection. Trans. Neur. Netw.
**2009**, 20, 189–201. [Google Scholar] [CrossRef] [PubMed] - Martínez Sotoca, J.; Pla, F. Supervised feature selection by clustering using conditional mutual information-based distances. Pattern Recogn.
**2010**, 43, 2068–2081. [Google Scholar] [CrossRef] - Vinh, L.T.; Thang, N.D.; Lee, Y.K. An Improved Maximum Relevance and Minimum Redundancy Feature Selection Algorithm Based on Normalized Mutual Information. IEEE/IPSJ Int. Symp. Appl. Internet
**2010**, 0, 395–398. [Google Scholar] - Zheng, Y.; Kwoh, C.K. Identifying Simple Discriminatory Gene Vectors with An Information Theory Approach. In Proceedings of the 4th Computational Systems Bioinformatics Conference, CSB 2005, Stanford, CA, USA, 8-11 August 2005; pp. 12–23.
- Blake, C.; Merz, C. UCI Repository of Machine Learning Databases; UCI: Irvine, CA, USA, 1998. [Google Scholar]
- Golub, T.; Slonim, D.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.; Coller, H.; Loh, M.; Downing, J.; Caligiuri, M.; Bloomfield, C.; Lander, E. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science
**1999**, 286, 531–537. [Google Scholar] [CrossRef] [PubMed] - Shipp, M.; Ross, K.; Tamayo, P.; Weng, A.; Kutok, J.; Aguiar, R.; Gaasenbeek, M.; Angelo, M.; Reich, M.; Pinkus, G.; Ray, T.; Koval, M.; Last, K.; Norton, A.; Lister, T.; Mesirov, J.; Neuberg, D.; Lander, E.; Aster, J.; Golub, T. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med.
**2002**, 8, 68–74. [Google Scholar] [CrossRef] [PubMed] - Armstrong, S.; Staunton, J.; Silverman, L.; Pieters, R.; den Boer, M.; Minden, M.; Sallan, S.; Lander, E.; Golub, T.; Korsmeyer, S. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet.
**2002**, 30, 41–47. [Google Scholar] [CrossRef] [PubMed] - Petricoin, E.; Ardekani, A.; Hitt, B.; Levine, P.; Fusaro, V.; Steinberg, S.; Mills, G.; Simone, C.; Fishman, D.; Kohn, E.; Liotta, L. Use of proteomic patterns in serum to identify ovarian cancer. Lancet
**2002**, 359, 572–577. [Google Scholar] [CrossRef] - Hall, M. Correlation-based Feature Selection for Machine Learning. PhD thesis, Waikato University, Department of Computer Science, Hamilton, NewZealand, April 1999. [Google Scholar]
- Liu, H.; Setiono, R. A Probabilistic Approach to Feature Selection - A Filter Solution. In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 3-6 July 1996; pp. 319–327.
- Kohavi, R.; John, G. Wrappers for Feature Subset Selection. Artif. Intell.
**1997**, 97, 273–324. [Google Scholar] [CrossRef] - Liu, H.; Li, J.; Wong, L. A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns. Genome Inf.
**2002**, 13, 51–60. [Google Scholar] - Xing, E.; Jordan, M.; Karp, R. Feature Selection for High-Dimensional Genomic Microarray Data. In Proceedings of the 18th International Conference on Machine Learning; Morgan Kaufmann Publishers Inc.: Williamstown, MA, USA 28 June–1 July 2001. ; pp. 601–608.
- Furey, T.; Cristianini, N.; Duffy, N.; Bednarski, D.; Schummer, M.; Haussler, D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics
**2000**, 16, 906–914. [Google Scholar] [CrossRef] [PubMed] - van ’t Veer, L.; Dai, H.; van de Vijver, M.; He, Y.; Hart, A.; Mao, M.; Peterse, H.; van der Kooy, K.; Marton, M.; Witteveen, A.; Schreiber, G.; Kerkhoven, R.; Roberts, C.; Linsley, P.; Bernards, R.; Friend, S. Gene expression profiling predicts clinical outcome of breast cancer. Nature
**2002**, 415, 530–536. [Google Scholar] [CrossRef] [PubMed] - Li, J.; Liu, H.; Downing, J.; Yeoh, A.; Wong, L. Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. Bioinformatics
**2003**, 19, 71–78. [Google Scholar] [CrossRef] [PubMed] - John, G.; Kohavi, R.; Pfleger, K. Irrelevant Features and the Subset Selection Problem. In Proceedings of the 11th International Conference on Machine Learning, New Brunswick, NJ, USA, 10–13 July 1994; pp. 121–129.
- Shannon, C.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana, IL, USA, 1963. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons, Inc.: New York, NY, USA, 1991. [Google Scholar]
- Dumais, S.; Platt, J.; Hecherman, D.; Sahami, M. Inductive Learning Algorithms and Representations for Text Categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management, Washington, DC, USA, 02-07 November 1998; pp. 148–155.
- Yang, Y.; Pedersen, J. A Comparative Study on Feature Selection in Text Categorization; Fisher, D.H., Ed.; Morgan Kaufmann Publishers: San Francisco, CA, US, 1997; pp. 412–420. [Google Scholar]
- Chow, T.W.S.; Huang, D. Estimating Optimal Features Subsets Using Efficient Estimation of High-Dimensional Mutual Information. IEEE Trans. Neural Networks
**2005**, 16, 213–224. [Google Scholar] [CrossRef] [PubMed] - Maji, P. Mutual Information Based Supervised Attribute Clustering for Microarray Sample Classification. IEEE Trans. Knowl. Data Eng.
**2010**, 99. [Google Scholar] - McEliece, R.J. Encyclopedia of Mathematics and Its Applications. In The Theory of Information and Coding: A Mathematical Framework for Communication; Addison-Wesley Publishing Company: Reading, MA, USA, 1977; Volume 3. [Google Scholar]
- Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann: San Mateo, CA, USA, 1988. [Google Scholar]
- In [36], `-’ is used to denote the set minus(difference) operation. To be consistent to other parts of this paper, we will use `∖’ to denote the set minus operation. Particularly,
**A**∖**B**is defined by**A**∖**B**= {X:X∈**A**and X∉**B**}. - Yaramakala, S.; Margaritis, D. Speculative Markov Blanket Discovery for Optimal Feature Selection; IEEE Computer Society: Washington, DC, USA, 2005; pp. 809–812. [Google Scholar]
- Tsamardinos, I.; Aliferis, C. Towards Principled Feature Selection: Relevancy, Filters and Wrappers. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics; Bishop, C.M., Frey, B.J., Eds.; Morgan Kaufmann Publishers: Key West, FL, USA, 2003. [Google Scholar]
- Gray, R.M. Entropy and Information Theory; Springer Verlog: New York, NY, USA, 1991. [Google Scholar]
- Tsamardinos, I.; Aliferis, C.F.; Statnikov, A. Time and sample efficient discovery of Markov blankets and direct causal relations; ACM Press: New York, NY, USA, 2003; pp. 673–678. [Google Scholar]
- Aliferis, C.F.; Tsamardinos, I.; Statnikov, A. HITON: A novel Markov Blanket algorithm for optimal variable selection. AMIA Annu. Symp. Proc.
**2003**, 2003, 21–25. [Google Scholar] - Cormen, T.; Leiserson, C.; Rivest, R.; Stein, C. Introduction to Algorithms, Second Edition; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
- Except Δ
_{1}supersets, only a part of other Δ_{i}(i = 2, …,K-1) supersets is stored in ΔTree. - Akutsu, T.; Miyano, S.; Kuhara, S. Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. In Proceedings of Pacific Symposium on Biocomputing ’99, Big Island, HI, USA, 4-9 January 1999; Volume 4, pp. 17–28.
- Zheng, Y.; Kwoh, C.K. Dynamic Algorithm for Inferring Qualitative Models of Gene Regulatory Networks; IEEE Computer Society Press: Stanford, CA, USA, 2004; pp. 353–362. [Google Scholar]
- Zheng, Y.; Kwoh, C.K. Dynamic Algorithm for Inferring Qualitative Models of Gene Regulatory Networks. Int. J. Data Min. Bioinf.
**2006**, 1, 111–137. [Google Scholar] [CrossRef] - Aha, D.W.; Kibler, D.; Albert, M.K. Instance-Based Learning Algorithms. Mach. Learn.
**1991**, 6, 37–66. [Google Scholar] [CrossRef] - Hamming, R. Error Detecting and Error Correcting Codes. Bell Syst. Techn. J.
**1950**, 9, 147–160. [Google Scholar] [CrossRef] - Fayyad, U.; Irani, K. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, IJCAI-93, Chambery, France, 28 August 1993; pp. 1022–1027.
- Frank, E.; Hall, M.; Trigg, L.; Holmes, G.; Witten, I. Data mining in bioinformatics using Weka. Bioinformatics
**2004**, 20, 2479–2481. [Google Scholar] [CrossRef] [PubMed] - Quinlan, J. C4.5: Programs for Machine Learning; Morgan Kaufmann: San Francisco, CA, USA, 1993. [Google Scholar]
- Langley, P.; Iba, W.; Thompson, K. An Analysis of Bayesian Classifiers. In Proceedings of National Conference on Artificial Intelligence, San Jose, California, 12-16 July 1992; pp. 223–228.
- Platt, J. Fast training of support vector machines using sequential minimal optimization; MIT Press: Cambridge, MA, USA, 1999; Chapter 12; pp. 185–208. [Google Scholar]
- Zheng, Y. Information Learning Approach. PhD thesis, Nanyang Technological University, Singapore, 2007. [Google Scholar]
- Akutsu, T.; Miyano, S.; Kuhara, S. Algorithm for Identifying Boolean Networks and Related Biological Networks Based on Matrix Multiplication and Fingerprint Function. J. Computat. Biol.
**2000**, 7, 331–343. [Google Scholar] [CrossRef] [PubMed] - Akutsu, T.; Miyano, S.; Kuhara, S. Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics
**2000**, 16, 727–734. [Google Scholar] [CrossRef] [PubMed] - Akutsu, T.; Miyano, S.; Kuhara, S. A simple greedy algorithm for finding functional relations: efficient implementation and average case analysis. Theor. Comput. Sci.
**2003**, 292, 481–495. [Google Scholar] [CrossRef] - Ideker, T.; Thorsson, V.; Karp, R. Discovery of Regulatory Interactions Through Perturbation: Inference and Experimental Design. In Proceedings of Pacific Symposium on Biocomputing, PSB 2000, Island of Oahu, HI, January 4-9, 2000; 5, pp. 302–313.
- Lähdesmäki, H.; Shmulevich, I.; Yli-Harja, O. On Learning Gene Regulatory Networks Under the Boolean Network Model. Mach. Learn.
**2003**, 52, 147–167. [Google Scholar] [CrossRef] - Liang, S.; Fuhrman, S.; Somogyi, R. REVEAL, a general reverse engineering algorithms for genetic network architectures. In Proceedings of Pacific Symposium on Biocomputing ’98, Maui, HI, USA, 4-9 January 1998; Volume 3, pp. 18–29.
- Maki, Y.; Tominaga, D.; Okamoto, M.; Watanabe, S.; Eguchi, Y. Development of a System for the Inference of Large Scale Genetic Networks. In Proceedings of Pacific Symposium on Biocomputing, PSB 2001, Big Island, HI, USA, 3-7 January 2001; 6, pp. 446–458.
- Schmulevich, I.; Yli-Harja, O.; Astola, J. Inference of genetic regulatory networks under the best-fit extension paradigm. In Presented at Nonlinear Signal and Image Processing, NSIP 2001, Baltimore, MD, USA, 3-6 June 2001; pp. 3–6.

**Figure 1.**The advantage of using MI to choose the most discriminatory feature vectors. The circles represent the entropy of variables or vectors. The intersection between the circles represents the MI between the variables or vectors. ${\mathbf{U}}_{s-1}$ is the features already chosen. The shaded regions represent $I({X}_{i};Y|{\mathbf{U}}_{s-1})$, where ${X}_{i}\in \mathbf{V}\setminus {\mathbf{U}}_{s-1}$. (a) When ${X}_{i}=A$. A shares less MI with Y than B does. However, the vector $\{{\mathbf{U}}_{s-1},A\}$ shares larger MI with Y than the vector $\{{\mathbf{U}}_{s-1},B\}$ does. (b) When ${X}_{i}=B$. B shares larger MI with Y than A does. But B and ${\mathbf{U}}_{s-1}$ have a large MI, which means that ${\mathbf{U}}_{s-1}$ has contained most of the information of Y carried by B or the additional information of Y carried by B, $I(B;Y|{\mathbf{U}}_{s-1})$, is small.

**Figure 2.**The search procedures of the DFL algorithm when it is learning $Y=(A\xb7C)+(A\xb7D)$. ${\{A,C,D\}}^{*}$ is the target combination. The combinations with a black dot under them are the subsets which share the largest MI with Y on their layers. Firstly, the DFL algorithm searches the first layer, then finds that $\left\{A\right\}$, with a black dot under it, shares the largest MI with Y among subsets on the first layer. Then, it continues to search ${\Delta}_{1}\left(A\right)$ on the second layer. Similarly, these calculations continue until the target combination $\{A,C,D\}$ is found on the third layer.

**Figure 3.**The $\Delta Tree$ when searching the EAs for $Y=(A\xb7C)+(A\xb7D)$. (a) after searching the first layer of Figure 2 but before the sort step in line 7 of Table 2. (b) when searching the second layer of Figure 2. The $\left\{A\right\}$, $\left\{C\right\}$ and $\left\{D\right\}$ which are included in the EAs of Y are listed before $\left\{B\right\}$ after the sort step in line 7 of Table 2. (c) when searching the third layer of Figure 2, ${\{A,C,D\}}^{*}$ is the target combination. Similar to part (b), the $\{A,C\}$ and $\{A,D\}$ are listed before $\{A,B\}$. When checking the combination $\{A,C,D\}$, the DFL algorithm finds that $\{A,C,D\}$ is the complete EAs for Y since $\{A,C,D\}$ satisfies the criterion of Theorem 4.

**Figure 4.**The exhaustive searching procedures of the DFL algorithm when it is learning $Y=(A\xb7C)+(A\xb7D)$. ${\{A,C,D\}}^{*}$ is the target combination. (a) The exhaustive searching after the first round searching. The numbers beside the subsets are the steps of the DFL algorithm in part (b). The solid edges represent the searching path in the first round searching, marked as blue region in part (b). The dashed edges represent the searching path beyond the first round searching (only partly shown for the sake of legibility), marked as yellow regions in the table below. (b) The exhaustive searching steps. Blue, yellow and red regions correspond to first round searching, exhaustive searching and the subsets, as well as their supersets, not checked after deploying the redundancy matrix to be introduced in Section B.1.

**Figure 5.**The Venn diagram of $H\left(\mathbf{X}\right)$, $H\left(Y\right)$ and $I(\mathbf{X},Y)$, when $Y=f\left(\mathbf{X}\right)$. (

**a**) The noiseless case, where the MI between $\mathbf{X}$ and Y is the entropy of Y. (

**b**) The noisy case, where the entropy of Y is not equal to the MI between $\mathbf{X}$ and Y strictly. The shaded region is resulted from the noises. The ϵ value method means that if the area of the shaded region is smaller than or equal to $\u03f5\times H\left(Y\right)$, then the DFL algorithm will stop the searching process, and build the function for Y with $\mathbf{X}$.

**Figure 6.**The performance of the DFL algorithm for different ϵ values. The figures are generated from LED+17 data sets in Table 5. The training data set has 2000 samples and K is set to 20. The curves marked with circles and triangles are for result of 10-fold cross validation and the result of an independent testing data set of 1000 samples. The ${\u03f5}_{op.}$ pointed by an arrow is the optimal ϵ value with which the DFL algorithm reaches its highest prediction accuracy in a 10-fold cross validation for the training data set. (

**a**) ϵ vs accuracy. (

**b**) ϵ vs the number of selected features k. (

**c**) ϵ vs the run time (s).

**Figure 7.**The manual binary search of minimum ϵ value. This figure is generated with the LED training data set in Table 5, with 2000 samples. The ticks indicate whether the DFL algorithm can find a model after a ϵ value is specified in each try.

**Figure 8.**The comparison of accuracies (

**a**) to (

**c**), number of features (

**d**)-(

**f**) and run time (

**g**)-(

**i**) for different feature subset selection methods on the discretized data sets. (

**a**) C4.5, (

**b**) NB, (

**c**) SVM, (

**d**) DFL vs CFS, (

**e**) DFL vs CSE, (

**f**) DFL vs WSE, (g) DFL vs CFS, (

**h**) DFL vs CSE, (

**i**) DFL vs WSE.

**Figure 9.**The $I({X}_{i};Y)$ in the data sets of 1000 samples generated with $Y=({X}_{21}\u2a01{X}_{29}\u2a01{X}_{60})$, and $\mathbf{V}=\{{X}_{1},\dots ,{X}_{100}\},\forall {X}_{i},{X}_{j}\in \mathbf{V},{X}_{i}$ and ${X}_{j}$ are independent. The horizontal axis is the index of the features. The vertical axis is the $I({X}_{i};Y)$ shown in bits. The features pointed by the arrows are the relevant features.

**Figure 10.**The histograms of the number of subsets checked, m, and run time of the DFL algorithm for learning one Boolean function in RANDOM data sets, when $n=100$, $k=3$ and $N=200$. For part (b) and (c), the cases pointed by arrows are the worst ones. (a) The histogram of m without using redundancy matrix $\mathbb{R}$. (b) The histogram of run time, t (horizontal axis, shown in seconds). (c) The histogram of run time after using the redundancy matrix $\mathbb{R}$ introduced in Section B.1..

Algorithm: DFL($\mathbf{V},K,\mathbf{T}$) | |

Input: a list V with n variables, indegree K, | |

$\mathbf{T}=\{({\mathbf{v}}_{i},{y}_{i}):i=1,\cdots ,N\}$. $\mathbf{T}$ is global. | |

Output: f | |

Begin: | |

1 | $L\leftarrow $ all single element subsets of $\mathbf{V}$; |

2 | $\Delta Tree.FirstNode\leftarrow L$; |

3 | calculate $H\left(Y\right)$; //from $\mathbf{T}$ |

4 | $D\leftarrow 1$; //initial depth |

5${}^{*}$ | $f=Sub(Y,\Delta Tree,H(Y),D,K)$; |

6 | return f; |

End |

Algorithm: $Sub(Y,\Delta Tree,H,D,K)$ | |

Input: variable Y, $\Delta Tree$, entropy $H\left(Y\right)$ | |

current depth D, maximum indegree K | |

Output: function table for Y, $Y=f\left(X\right)$ | |

Begin: | |

1 | $L\leftarrow \Delta Tree.DthNode$; |

2 | for every element $X\in L$ { |

3 | calculate $I(\mathbf{X};Y)$; //from $\mathbf{T}$ |

4 | if($I(\mathbf{X};Y)==H$) { //from Theorem 4 |

5 | extract $Y=f\left(\mathbf{X}\right)$ from $\mathbf{T}$; |

6 | return $Y=f\left(\mathbf{X}\right)$; |

} | |

} | |

7 | sort L according to I; |

8 | for every element $\mathbf{X}\in L$ { |

9 | if($D<K$){ |

10 | $D\leftarrow D+1$; |

11 | $\Delta Tree.DthNode\leftarrow {\Delta}_{1}\left(\mathbf{X}\right)$; |

12 | return $Sub(Y,\Delta Tree,H,D,K)$; |

} | |

} | |

13 | return “Fail(Y)"; //fail to find function for Y |

End |

ABCD | Y | ABCD | Y | ABCD | Y | ABCD | Y |
---|---|---|---|---|---|---|---|

0000 | 0 | 0100 | 0 | 1000 | 0 | 1100 | 0 |

0001 | 0 | 0101 | 0 | 1001 | 1 | 1101 | 1 |

0010 | 0 | 0110 | 0 | 1010 | 1 | 1110 | 1 |

0011 | 0 | 0111 | 0 | 1011 | 1 | 1111 | 1 |

ACD | Y | Count | ACD | Y | Count |
---|---|---|---|---|---|

000 | 0 | 2 | 100 | 0 | 2 |

001 | 0 | 2 | 101 | 1 | 2 |

010 | 0 | 2 | 110 | 1 | 2 |

011 | 0 | 2 | 111 | 1 | 2 |

Dataset | Ftr. #${}^{\mathbf{1}}$ | Cl. # | Tr. # | Te. # | Ms. # | Reference | |
---|---|---|---|---|---|---|---|

1 | Lenses | 4 | 3 | 24 | LOO${}^{2}$ | 0 | [15] |

2 | Iris | 4 | 3 | 100 | 50 | 0 | [15] |

3 | Monk1 | 6 | 2 | 124 | 432 | 0 | [15] |

4 | Monk2 | 6 | 2 | 169 | 432 | 0 | [15] |

5 | Monk3 | 6 | 2 | 122 | 432 | 0 | [15] |

6 | LED | 7 | 10 | 2000 | 1000 | 0 | [15] |

7 | Nursery | 8 | 5 | 12960 | CV10${}^{3}$ | 0 | [15] |

8 | Breast | 9 | 2 | 699 | CV10 | 16 | [15] |

9 | Wine | 13 | 3 | 119 | 59 | 0 | [15] |

10 | Credit | 15 | 2 | 460 | 230 | 67 | [15] |

11 | Vote | 16 | 2 | 435 | CV10 | 392 | [15] |

12 | Zoo | 16 | 7 | 101 | LOO | 0 | [15] |

13 | ImgSeg | 19 | 7 | 210 | 2100 | 0 | [15] |

14 | Mushroom | 22 | 2 | 8124 | CV10 | 2480 | [15] |

15 | LED+17 | 24 | 10 | 2000 | 1000 | 0 | [15] |

16 | Ionosphere | 34 | 2 | 234 | 117 | 0 | [15] |

17 | Chess | 36 | 2 | 2130 | 1066 | 0 | [15] |

18 | Anneal | 38 | 6 | 798 | 100 | 22175 | [15] |

19 | Lung | 56 | 3 | 32 | LOO | 0 | [15] |

20 | Ad | 1558 | 2 | 2186 | 1093 | 2729 | [15] |

21 | ALL | 7129 | 2 | 38 | 34 | 0 | [16] |

22 | DLBCL | 7129 | 2 | 55 | 22 | 0 | [17] |

23 | MLL | 12582 | 3 | 57 | 15 | 0 | [18] |

24 | Ovarian | 15154 | 2 | 169 | 84 | 0 | [19] |

^{1}The number does not include the class attribute.

^{2}LOO and

^{3}CV10 stands for leave-one-out and 10 fold cross validation respectively.

**Table 6.**The comparison summary of the number of features chosen by different feature selection methods.

Discretized D. S. | Continuous D. S. | ||||||
---|---|---|---|---|---|---|---|

F.S. Pair | Algo. | ${<}^{1}$ | = | > | < | = | > |

DFL:CFS | NA${}^{2}$ | 9 | 6 | 8 | 7 | 6 | 9 |

DFL:CSE | NA | 17 | 4 | 2 | 17 | 5 | 1 |

DFL:WSE | C4.5 | 6 | 9 | 8 | 8 | 7 | 8 |

NB | 10 | 4 | 9 | 9 | 7 | 7 | |

SVM | 12 | 5 | 5 | 13 | 5 | 4 | |

sub sum | 28 | 18 | 22 | 30 | 21 | 17 | |

total sum | 54 | 26 | 34 | 54 | 30 | 29 |

^{1}The <, = and > column stand for the number of data sets, where the DFL algorithm chooses smaller, the same and larger number of features than the compared feature selection algorithm.

^{2}NA means not applicable.

Discretized D.S. | Continuous D.S. | ||||||
---|---|---|---|---|---|---|---|

F.S. Pair | Algo. | ${>}^{1}$ | = | < | > | = | < |

DFL:CFS | C4.5 | 11 | 7 | 6 | 13 | 5 | 4 |

NB | 8 | 6 | 8 | 8 | 5 | 9 | |

SVM | 12 | 5 | 7 | 9 | 6 | 7 | |

sum | 31 | 18 | 21 | 30 | 16 | 20 | |

DFL:CSE | C4.5 | 8 | 7 | 8 | 9 | 6 | 8 |

NB | 8 | 6 | 9 | 10 | 5 | 8 | |

SVM | 11 | 7 | 5 | 10 | 5 | 8 | |

sum | 27 | 20 | 22 | 29 | 16 | 24 | |

DFL:WSE | C45 | 4 | 10 | 9 | 7 | 7 | 9 |

NB | 7 | 4 | 12 | 7 | 4 | 12 | |

SVM | 5 | 6 | 11 | 4 | 5 | 13 | |

sum | 16 | 20 | 32 | 18 | 16 | 34 |

^{1}The >, = and < column stand for the number of data sets, where the classification algorithm in the Algo. column performs better, the same and worse on the features chosen by the DFL algorithm.

Algorithm: DFL($\mathbf{V},K,\mathbf{T}$) | |

Input: a list V with n variables, indegree K, | |

$\mathbf{T}=\{({\mathbf{v}}_{i},{y}_{i}):i=1,\cdots ,N\}$. $\mathbf{T}$ is global. | |

Output: f | |

Begin: | |

1 | $\mathbb{R}\leftarrow $ boolean[n][n]; //initialize $\mathbb{R}$, default value is false |

2 | $L\leftarrow $ all single element subsets of $\mathbf{V}$; |

3 | $\Delta Tree.FirstNode\leftarrow L$; |

4 | calculate $H\left(Y\right)$; //from $\mathbf{T}$ |

5 | $D\leftarrow 1$; //initial depth |

6${}^{*}$ | $f=Sub(Y,\Delta Tree,H(Y),D,K)$; |

7 | return f; |

End |

Algorithm: $Sub(Y,\Delta Tree,H,D,K)$ | |

Input: variable Y, $\Delta Tree$, entropy $H\left(Y\right)$, current depth D, maximum indegree K | |

Output: function table for Y, $Y=f\left(X\right)$ | |

Begin: | |

1 | $L\leftarrow \Delta Tree.DthNode$; |

2 | for every element $X\in L$ { |

3${}^{*}$ | if ( ($\left|\mathbf{X}\right|==2$) && ($\mathbb{R}[\mathbf{X}[0]][\mathbf{X}[1]]==$ true $\left|\right|$ $\mathbb{R}[\mathbf{X}[1]][\mathbf{X}[0]]==$ true) ) { |

4 | continue; //if X has been checked, continue to check next element in L |

} | |

5 | calculate $I(\mathbf{X};Y)$; //from $\mathbf{T}$ |

6 | if($I(\mathbf{X};Y)==H$) { //from Theorem 4 |

7 | extract $Y=f\left(\mathbf{X}\right)$ from $\mathbf{T}$; |

8 | return $Y=f\left(\mathbf{X}\right)$ ; |

} | |

9 | else if $(D==K)$ && X is the last element in L) { |

10 | return “Fail(Y)"; |

} | |

} | |

11 | sort L according to I; |

12 | for every element $\mathbf{X}\in L$ { |

13 | if($D<K$){ |

14 | if ( ($\left|\mathbf{X}\right|==2$) && ($\mathbb{R}[\mathbf{X}[0]][\mathbf{X}[1]]==$ true $\left|\right|$ $\mathbb{R}[\mathbf{X}[1]][\mathbf{X}[0]]==$ true) ) { |

15 | continue; |

} | |

16 | $D\leftarrow D+1$; |

17 | $\Delta Tree.DthNode\leftarrow {\Delta}_{1}\left(\mathbf{X}\right)$; |

18 | $f=Sub(Y,\Delta Tree,H,D,K)$; |

19 | if $f\ne $ “Fail(Y)" { |

20 | return f; |

} | |

21 | else if $\left(\right|\mathbf{X}|==2)${ |

22 | $\mathbb{R}[\mathbf{X}[0]][\mathbf{X}[1]]\leftarrow $$\mathbf{true}$; //if all $\Delta \left(\mathbf{X}\right)$ have been checked, set $\mathbb{R}[\xb7][\xb7]$ to true. |

23 | continue; |

} | |

} | |

} | |

24 | return “Fail(Y)"; //fail to find function for Y |

End |

**Table 10.**The settings of the DFL algorithm. To get optimal model, we change the epsilon value from 0 to 0.8, with a step of 0.01. For each epsilon value, we train a model with the DFL algorithm, then do corresponding test for the selected data sets. In our implementation of the DFL algorithm, the process to choose optimal model can be automatically fulfilled. For those data sets whose testing processes are performed with the cross validation, the number of features k and the number of the rules r in the classifier are from the most frequently obtained classifiers.

Performances | Settings | Classifiers | |||||||
---|---|---|---|---|---|---|---|---|---|

Data Set | Accuracy (%) | Time(s) | n | K | ϵ | k | r | $k/n$(%) | |

1 | Lenses | 75.0 | 0.03 | 4 | 4 | 0.26 | 3 | 12 | 75.0 |

2 | Iris | 96.0 | 0.01 | 4 | 4 | 0.08 | 2 | 6 | 50.0 |

3 | Monk1 | 100.0 | 0.01 | 6 | 6 | 0 | 3 | 35 | 50.0 |

4 | Monk2 | 73.8 | 0.02 | 6 | 6 | 0.21 | 6 | 168 | 100.0 |

5 | Monk3 | 97.2 | 0.01 | 6 | 6 | 0.64 | 2 | 17 | 33.3 |

6 | LED | 74.9 | 0.08 | 7 | 7 | 0.29 | 7 | 207 | 100.0 |

7 | Nursery | 93.1 | 20.21 | 8 | 8 | 0.13 | 5 | 541 | 62.5 |

8 | Breast | 95.0 | 0.20 | 9 | 9 | 0.05 | 3 | 185 | 33.3 |

9 | Wine | 98.3 | 0.01 | 13 | 13 | 0.04 | 4 | 29 | 30.8 |

10 | Credit | 88.3 | 0.01 | 15 | 15 | 0.57 | 2 | 11 | 13.3 |

11 | Vote | 95.7 | 0.22 | 16 | 16 | 0.11 | 4 | 41 | 25.0 |

12 | Zoo | 92.8 | 1.24 | 16 | 16 | 0 | 5 | 21 | 31.3 |

13 | ImgSeg | 90.6 | 0.01 | 19 | 15 | 0.16 | 3 | 41 | 15.8 |

14 | Mushroom | 100.0 | 11.45 | 22 | 22 | 0 | 4 | 96 | 18.2 |

15 | LED+17 | 75.4 | 0.83 | 24 | 20 | 0.31 | 7 | 286 | 29.2 |

16 | Ionosphere | 94.9 | 0.11 | 34 | 20 | 0.12 | 6 | 96 | 17.6 |

17 | Chess | 97.4 | 13.30 | 36 | 20 | 0.01 | 19 | 844 | 52.8 |

18 | Anneal | 99.0 | 0.22 | 38 | 20 | 0.04 | 5 | 44 | 13.2 |

19 | Lung | 62.5 | 0.10 | 56 | 20 | 0.44 | 2 | 12 | 3.6 |

20 | Ad | 95.0 | 42.80 | 1558 | 20 | 0.23 | 6 | 104 | 0.4 |

21 | ALL | 94.1 | 0.02 | 7129 | 20 | 0.3 | 1 | 3 | 0.014 |

22 | DLBCL | 95.5 | 0.01 | 7129 | 20 | 0.52 | 1 | 4 | 0.014 |

23 | MLL | 100.0 | 0.48 | 12582 | 20 | 0.06 | 2 | 11 | 0.016 |

24 | Ovarian | 98.8 | 0.31 | 15154 | 20 | 0.29 | 1 | 4 | 0.007 |

average | 91.0 | 3.82 | 1829 | 14 | 0.20 | 4 | 117 | 31.5 |

Feature Index | ||||

Data Set | k | Discretized Data | Continuous Data | |

1 | Lenses | 2 | 1,3,4 | 1,3,4 |

2 | Iris | 2 | 3,4 | 3,4 |

3 | Monk1 | 3 | 1,2,5 | 1,2,5 |

4 | Monk2 | 6 | 1,2,3,4,5,6 | 1,2,3,4,5,6 |

5 | Monk3 | 2 | 2,5 | 2,5 |

6 | LED | 7 | 1,2,3,4,5,6,7 | 1,2,3,4,5,6,7 |

7 | Nursery | 5 | 1,2,5,7,8 | 1,2,5,7,8 |

8 | Breast | 3 | 1,3,6 | 1,3,6 |

9 | Wine | 4 | 1,7,10,13 | 1,7,10,13 |

10 | Credit | 2 | 4,9 | 4,9 |

11 | Vote | 4 | 3,4,7,11 | 3,4,7,11 |

12 | Zoo | 5 | 3,4,6,9,13 | 3,4,6,9,13 |

13${}^{*}$ | ImgSeg | 3 | 1,13,15 | 2,17,19 |

14 | Mushroom | 4 | 5,20,21,22 | 5,20,21,22 |

15 | LED+17 | 7 | 1,2,3,4,5,6,7 | 1,2,3,4,5,6,7 |

16 | Ionosphere | 6 | 3,5,6,12,21,27 | 3,5,6,12,21,27 |

17 | Chess | 19 | 1,3,4,6,7,10,15,16,17,18,20, | 1,3,4,6,7,10,15,16,17,18,20, |

21,23,24,30,32,33,34,35 | 21,23,24,30,32,33,34,35 | |||

18 | Anneal | 5 | 3,5,8,9,12 | 3,5,8,9,12 |

19 | Lung | 2 | 6,20 | 6,20 |

20 | Ad | 6 | 1,2,3,352,1244,1400 | 1,2,3,352,1244,1400 |

21${}^{*}$ | ALL | 1 | 234 | 1882 |

22${}^{*}$ | DLBCL | 1 | 55 | 506 |

23${}^{*}$ | MLL | 2 | 709,1550 | 2592,5083 |

24${}^{*}$ | Ovarian | 1 | 839 | 1679 |

**Table 12.**The accuracies of different algorithms on the features chosen by the DFL algorithm. The accuracies for those data sets with numerical attributes are for discretized/numerial data sets.

Data Set | C4.5 | NB | SVM | |
---|---|---|---|---|

1 | Lenses | 87.5 | 75.0 | 66.7 |

2 | Iris | 94.0/92.0 | 94.0/94.0 | 94.0/96.0 |

3 | Monk1 | 88.9 | 72.2 | 72.2 |

4 | Monk2 | 65.0 | 61.6 | 67.1 |

5 | Monk3 | 97.2 | 97.2 | 97.2 |

6 | LED | 74.6 | 75.1 | 75.3 |

7 | Nursery | 93.1 | 89.1 | 90.4 |

8 | Breast | 94.8 | 96.2 | 95.0 |

9 | Wine | 93.2/93.2 | 96.6/98.3 | 94.9/98.3 |

10 | Credit | 87.4/87.4 | 87.4/87.4 | 87.4/87.4 |

11 | Vote | 94.9 | 92.2 | 94.9 |

12 | Zoo | 90.1 | 93.1 | 94.3 |

13 | ImgSeg | 90.4/90.8 | 90.8/84.3 | 90.7/76.1 |

14 | Mushroom | 100.0 | 98.6 | 100.0 |

15 | LED+17 | 75.1 | 74.2 | 75.1 |

16 | Ionosphere | 93.2/94.9 | 95.7/94.0 | 94.9/80.3 |

17 | Chess | 99.0 | 90.5 | 96.1 |

18 | Anneal | 84.0/84.0 | 80.0/74.0 | 89.0/88.0 |

19 | Lung | 68.8 | 56.3 | 71.9 |

20 | Ad | 92.6/94.4 | 93.2/92.4 | 94.4/92.6 |

21 | ALL | 94.1/94.1 | 94.1/94.1 | 94.1/82.4 |

22 | DLBCL | 95.5/95.5 | 95.5/90.9 | 95.5/77.3 |

23 | MLL | 93.3/93.3 | 100.0/100.0 | 100.0/73.3 |

24 | Ovarian | 98.8/98.8 | 98.8/98.8 | 98.8/97.6 |

average | 89.4/89.5 | 87.4/86.7 | 88.7/85.2 |

**Table 13.**The accuracies of different algorithms on the features chosen by the CFS algorithm [20]. The accuracies for those data sets with numerical attributes are for discretized/numerial data sets. NA means not available.

Data Set | C4.5 | NB | SVM | |
---|---|---|---|---|

1 | Lenses | 70.8 | 70.8 | 50.0 |

2 | Iris | 94.0/92.0 | 94.0/94.0 | 94.6/96.0 |

3 | Monk1 | 75.0 | 75.0 | 75.0 |

4 | Monk2 | 67.1 | 63.0 | 67.1 |

5 | Monk3 | 97.2 | 97.2 | 97.2 |

6 | LED | 74.6 | 75.1 | 75.3 |

7 | Nursery | 71.0 | 71.0 | 71.0 |

8 | Breast | 94.7 | 97.3 | 95.8 |

9 | Wine | 93.2/93.2 | 98.3/98.3 | 96.6/96.6 |

10 | Credit | 87.4/87.4 | 87.4/87.4 | 87.4/87.4 |

11 | Vote | 95.6 | 95.6 | 95.6 |

12 | Zoo | 91.1 | 94.1 | 95.3 |

13 | ImgSeg | 90.9/90.3 | 91.7/86.0 | 90.3/87.9 |

14 | Mushroom | 98.5 | 98.5 | 98.5 |

15 | LED+17 | 73.4 | 72.9 | 73.8 |

16 | Ionosphere | 91.5/93.2 | 94.9/93.2 | 91.5/80.3 |

17 | Chess | 90.1 | 90.1 | 90.1 |

18 | Anneal | 90.0/92.0 | 87.0/64.0 | 91.0/81.0 |

19 | Lung | 53.1 | 71.9 | 59.7 |

20 | Ad | 94.2/94.1 | 94.4/93.2 | 94.1/92.9 |

21 | ALL | 91.2/91.2 | 91.2/91.2 | 91.2/79.4 |

22 | DLBCL | 95.5/86.4 | 95.5/95.5 | 95.5/81.8 |

23 | MLL | 86.7/NA | 100.0/NA | 100.0/NA |

24 | Ovarian | 98.8/NA | 95.2/NA | 95.0/NA |

average | 86.1/85.1 | 87.6/85.2 | 86.3/83.1 |

**Table 14.**The accuracies of different algorithms on the features chosen by the CSE algorithm [21]. The accuracies for those data sets with numerical attributes are for discretized/numerial data sets. NA means not available.

Data Set | C4.5 | NB | SVM | |
---|---|---|---|---|

1 | Lenses | 83.3 | 70.8 | 65.4 |

2 | Iris | 94.0/92.0 | 94.0/94.0 | 94.0/92.0 |

3 | Monk1 | 88.9 | 72.2 | 72.2 |

4 | Monk2 | NA | NA | NA |

5 | Monk3 | 97.2 | 97.2 | 97.2 |

6 | LED | 74.6 | 75.1 | 75.3 |

7 | Nursery | 97.2 | 90.3 | 93.1 |

8 | Breast | 94.9 | 96.9 | 95.8 |

9 | Wine | 98.3/96.6 | 96.6/94.1 | 100.0/100.0 |

10 | Credit | 88.7/86.1 | 87.8/80.4 | 87.4/87.4 |

11 | Vote | 95.4 | 91.0 | 96.5 |

12 | Zoo | 91.1 | 94.1 | 94.2 |

13 | ImgSeg | 86.8/90.3 | 88.5/80.8 | 89.1/80.3 |

14 | Mushroom | 100.0 | 98.5 | 99.7 |

15 | LED+17 | 74.1 | 74.9 | 73.9 |

16 | Ionosphere | 91.5/92.3 | 94.9/94.0 | 94.0/81.2 |

17 | Chess | 93.9 | 93.9 | 93.5 |

18 | Anneal | 88.0/89.0 | 89.0/64.0 | 92.0/83.0 |

19 | Lung | 53.1 | 62.5 | 49.7 |

20 | Ad | 94.4/94.6 | 94.2/93.7 | 94.8/93.7 |

21 | ALL | 91.2/91.2 | 91.2/91.2 | 91.2/79.4 |

22 | DLBCL | 95.5/95.5 | 90.9/86.4 | 95.5/77.3 |

23 | MLL | 73.3/66.7 | 73.3/66.7 | 80.0/46.7 |

24 | Ovarian | 98.8/100.0 | 98.8/100.0 | 98.8/100.0 |

average | 88.9/88.6 | 87.7/85.3 | 88.0/83.8 |

**Table 15.**The accuracies of different algorithms on the features chosen by the WSE algorithm [22]. The accuracies for those data sets with numerical attributes are for discretized/numerial data sets. NA means not available.

Data Set | C4.5 | NB | SVM | |
---|---|---|---|---|

1 | Lenses | 87.5 | 87.5 | NA |

2 | Iris | 94.0/92.0 | 94.0/94.0 | 94.0/96.0 |

3 | Monk1 | 75.0 | 75.0 | 75 |

4 | Monk2 | NA | NA | NA |

5 | Monk3 | 97.2 | 97.2 | 97.2 |

6 | LED | 74.6 | 75.1 | 75.3 |

7 | Nursery | 94.8 | 87.8 | 93.1 |

8 | Breast | 95.0 | 97.5 | 96.4 |

9 | Wine | 96.1/93.2 | 98.3/96.6 | 98.3/94.9 |

10 | Credit | 87.4/86.1 | 87.4/87.4 | 87.4/87.4 |

11 | Vote | 95.6 | 96.1 | 95.6 |

12 | Zoo | 94.1 | 92.1 | 98.1 |

13 | ImgSeg | 91.4/91.5 | 90.4/87.2 | 91.4/87.8 |

14 | Mushroom | 100.0 | 99.7 | 100.0 |

15 | LED+17 | 75.2 | 74.4 | 74.5 |

16 | Ionosphere | 91.5/85.5 | 91.5/91.5 | 91.5/86.3 |

17 | Chess | 93.9 | 93.9 | 93.5 |

18 | Anneal | 93.0/95.0 | 94.0/90.0 | 93.0/92.0 |

19 | Lung | 68.8 | 78.1 | 73.4 |

20 | Ad | 95.2/95.2 | 94.8/94.2 | 94.7/95.0 |

21 | ALL | 91.2/91.2 | 91.2/91.2 | 91.2/73.5 |

22 | DLBCL | 95.5/90.9 | 90.9/81.8 | 81.8/86.4 |

23 | MLL | 93.3/73.3 | 80.0/80.0 | 100.0/93.3 |

24 | Ovarian | 98.8/100.0 | 100.0/100.0 | 100.0/100.0 |

average | 90.4/88.9 | 89.9/89.1 | 90.7/89.3 |

## Supplementary Materials

## A. The Software

## B. The Extended Main Steps of The DFL Algorithm

#### B.1. Redundancy Matrix

#### B.2. Extended Main Steps

#### B.3. Experiments to Show The Usefulness of Redundancy Matrix

## C. The Detailed Settings

## D. The Detailed Results

© 2011 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license http://creativecommons.org/licenses/by/3.0/.

## Share and Cite

**MDPI and ACS Style**

Zheng, Y.; Kwoh, C.K.
A Feature Subset Selection Method Based On High-Dimensional Mutual Information. *Entropy* **2011**, *13*, 860-901.
https://doi.org/10.3390/e13040860

**AMA Style**

Zheng Y, Kwoh CK.
A Feature Subset Selection Method Based On High-Dimensional Mutual Information. *Entropy*. 2011; 13(4):860-901.
https://doi.org/10.3390/e13040860

**Chicago/Turabian Style**

Zheng, Yun, and Chee Keong Kwoh.
2011. "A Feature Subset Selection Method Based On High-Dimensional Mutual Information" *Entropy* 13, no. 4: 860-901.
https://doi.org/10.3390/e13040860