CASMI—An Entropic Feature Selection Method in Turing’s Perspective

Health data are generally complex in type and small in sample size. Such domain-specific challenges make it difficult to capture information reliably and contribute further to the issue of generalization. To assist the analytics of healthcare datasets, we develop a feature selection method based on the concept of coverage adjusted standardized mutual information (CASMI). The main advantages of the proposed method are: (1) it selects features more efficiently with the help of an improved entropy estimator, particularly when the sample size is small; and (2) it automatically learns the number of features to be selected based on the information from sample data. Additionally, the proposed method handles feature redundancy from the perspective of joint-distribution. The proposed method focuses on non-ordinal data, while it works with numerical data with an appropriate binning method. A simulation study comparing the proposed method to six widely cited feature selection methods shows that the proposed method performs better when measured by the Information Recovery Ratio, particularly when the sample size is small.


Introduction
Inspired by the recent advancement in Big Data, health informaticians are attempting to assist health care providers and patients from a data perspective, with the hope of improving quality of care, detecting diseases earlier, enhancing decision making, and reducing healthcare costs [1].In the process, health informaticians have been confronted with the issue of generalization [2].Analyzing real health data involves many practical problems that could contribute to the issue of generalization; for example, the unknown amount of information (signal) versus error (noise), the curse of dimensionality, and the generalizability of models.All these trivial problems boil down to the essential problem issued by a limited sample.With the limitation of the sample size, the information from the sample cannot represent the information of the population to a desirable extent.For this reason, a simple way to address these trivial problems is to collect a sufficiently large sample, which is unfortunately often impractical in healthcare because of multiple reasons.For example: 1.The term sufficiently large is relative to the dimensionality of data and the complexity of feature spaces.Health data are generally large in dimensionality, particularly when dummy variables (one-hot-encoding) are adopted to represent enormous categories of complex qualitative features (such as extracted words from clinical notes).As a result, a dataset with a sample size of 1,000,000 may not be sufficient, depending on its feature spaces.
2. There may not be sufficient patient cases for a rare disease.Even if there are ample potential cases, it may be cost-prohibitive for clinical trials to achieve a sufficient sample.
Without a sufficiently large sample, dimension reduction becomes a major research direction in health data analytic as reducing the dimensionality can partly relieve the issues from a limited sample.These dimension reduction techniques mainly focus on feature selection and feature projection, where feature selection can be further applied to the features created by feature projection.In this article, we focus on feature selection.It has become an important research area, dating back at least to 1997 [3] [4].Since then, many feature selection methods have been proposed and well discussed in multiple recent review papers, such as [5], [6], and [7].To apply these feature selection methods to health data, domain-specific challenges must be considered.Health data can be numerical and categorical.For example, many machine readings (e.g., heart rate, blood pressure, and blood oxygen level) are numerical, while gene expression data are categorical.A healthcare dataset could contain numerical data only, categorical data only, or a combination of both data types.The fundamental distinction between numerical data and categorical data is whether the data space is ordinal or non-ordinal.As a result, data consisting of only numbers are not necessarily numerical data; for example, gene expression data can be coded to numbers using dummy variables, but it should be still considered as categorical.When the data space is ordinal (numerical data only), classical methods-which detect the association using ordinal information-are more powerful in capturing the associations in data.When the data space is non-ordinal (categorical data only), ordinal information does not naturally exist; hence, continuing to use classical methods onto coded data loses their original advantages and has additional estimation issues.Namely, involving dummy variables increases the dimensionality of data and further exacerbates the estimation problem using a limited sample.This particularly happens when an involved categorical feature has a complex feature space that requires a tremendous number of dummy variables to represent all the different categories.To deal with the categorical data, only information-theoretic quantities (e.g., entropy and mutual information [8]) serve the purpose.When a dataset is a combination of both data types, it is inconclusive about whether to use classical or informationtheoretic methods.In general, if one believes that the numerical data in the dataset carry more information than the categorical data, then classical methods can be used.If one believes the categorical data carry more information, then information-theoretic methods should be used, and the numerical data should be binned to categorical data.One should be advised that coding categorical data for classical methods increases dimensionality and issues more difficulties in estimation, while binning numerical data for information-theoretic methods inevitably loses ordinal information.It should also be noted that, although ordinal information could provide extra information about associations among the data, the ordinal information could also mislead a person's judgment when associations actually exist, but there is no visual pattern among the data.The way that classical methods work is very similar to our visualization; if there is a pattern that can be visually observed, then it can also be detected by some classical methods.However, not all associations among numerical data are visually observable, in which case, classical methods would fail to detect the associations.On the other hand, if there is a visual pattern among data, binning the data (losing the ordinal information) would not necessarily lead to a loss of associations among data; it depends on the binning methods and performance of the information-theoretic methods.
In many cases, all (or most) of the data in a healthcare dataset could be categorical.To analyze the categorical data in such a dataset, information-theoretic feature selection methods are preferred because they could capture the associations among features without using dummy variables, where classical methods require dummy variables that would increase the dimensionality.Most existing information-theoretic methods use entropy or mutual information (a function of entropy) to measure associations among data.Information-theoretic methods that do not use entropy include Gini Index and Chi-square Score.Gini Index focuses on whether a feature is separative, but does not indicate probabilistic associations.Chi-square Score relies on the performance of asymptotic normality on each component, and when there are categories with low frequencies (e.g., less than 5), the Chi-square Score is very unstable.However, under a limited sample, we should expect at least a few, if not many, categories would have relatively low frequencies.For the existing informationtheoretic methods that use entropy (we call these entropic methods), all of them estimate entropy with the classical maximum likelihood estimator (the plug-in estimator).The plug-in entropy estimator performs very poorly when the sample size is not sufficiently large [33][34], and we have discussed that the sample size is usually relatively limited in healthcare datasets.As a result, to use entropic methods in healthcare data analytics, the estimation of entropy under small samples must be improved.
In addition to estimation based on small samples, the unhelpful association is another issue with these samples.While the issue of estimation can be addressed by using a better estimator, the problem of unhelpful association is trickier.The unhelpful association is partially a result of sample randomness, and it could be severe when the sample size is small.Suppose there is a healthcare dataset with multiple features and one outcome, and there is a feature in the dataset that could distinguish the values of the outcome based on the sample information, then there are three possible situations: Situation 1 The feature has abundant real information toward the outcome, and the real information is well preserved by the sample data.

Situation 2
The feature has abundant real information toward the outcome, but the real information is not well preserved by the sample data.hence, it could contribute as an (a) error (noise).For example, based on the sample information, different values of a situation 2 feature could possibly uniquely determine a corresponding value of the outcome (particularly when a feature space is complex while the sample size is small), but this deterministic relationship revealed by a limited sample is unlikely to be true at the population level.As a result, using this information in further modelling and prediction would be wrong and could further contribute to the issue of generalization.Therefore, we suggest omitting situation 2 features.In addition, one should note that a relevant feature being categorized as situation 2 is a consequence of a limited sample.All situation 2 features would eventually become situation 1 when the sample size grows (because more real information would be revealed).As a summary, under a limited sample, situation 1 features should be kept, and situation 2 and 3 features should be dropped.Focusing on the domain-specific challenges from health data, we develop the proposed entropic feature selection method based on the concept of Coverage Adjusted Standardized Mutual Information (CASMI).The proposed method aims at improving the performance of estimation and addressing the issue of unhelpful association under relatively small samples.The rest of the article is organized as follows.The concept, intuition, and estimation of CASMI are discussed in Section 2. The proposed method is described in detail in Section 3 and evaluated by a simulation study in Section 4. A brief discussion is in Section 5.

CASMI and its Estimation
In this section, we introduce the concept, intuition, and estimation of CASMI.Before we proceed, let us state the notations first. Let be two finite alphabets with cardinalities K 1 < ∞ and K 2 < ∞, respectively.Consider the Cartesian product X × Y with a joint probability distribution p = {p i,j }.Let the two marginal distributions be respectively denoted by p x = {p i,• } and p y = {p •,j }, where p i,• = j p i,j and p •,j = i p i,j .Assume that p i,• > 0 and p •,j > 0 for all 1 ≤ i ≤ K 1 and 1 ≤ j ≤ K 2 and that there are We reenumerate these K positive probabilities in one sequence and denote it as Let X and Y be random variables following distributions p x and p y , respectively.For every pair of i and j, let f i,j be the observed frequency of the random pair (X, Y ) taking value (x i , y j ), in an iid sample of size n from X × Y under p, and let pi,j = f i,j /n be the corresponding relative frequency.Consequently, we write p = {p k } (i.e., {p i,j }), px = {p i,• }, and py = {p •,j } as the sets of observed joint and marginal relative frequencies.Shannon's mutual information between X and Y is defined as where We define the CASMI as follows: Definition (CASMI).κ * , the Coverage Adjusted Standardized Mutual Information (CASMI) of a feature X to an outcome Y , is defined as where and (1 − π 0 ) is the sample coverage that was first introduced by Good [35] as "the proportion of the population represented by (the species occurring in) the sample".

Intuition of CASMI
Many entropic concepts can measure the associations among non-ordinal data; for example, mutual information (MI), Kullback-Leibler divergence ( [36]), conditional mutual information ( [37]), and weighted variants ( [38]).Among them, MI is the fundamental concept as all the other entropic association measurements are developed based on or equivalent to MI.For this reason, we develop the CASMI starting with MI.It is well known that MI ≥ 0, and MI(X, Y ) = 0 if and only if X and Y are independent.However, MI is not bounded from above; hence, using the values of MI to compare the degrees of dependence among different pairs of random variables is inconvenient.Therefore, it is necessary to standardize the mutual information, which yields to the so-called standardized mutual information (SMI) or normalized variants.[39] provides several forms of SMI, such as MI/H(X) (also known as information gain ratio if X is a feature and Y is the outcome), MI/H(Y ), and MI/H(X, Y ).All these forms of SMI can be proven to be bounded by [0, 1], where 0 stands for independence between X and Y , and 1 stands differently for different SMIs.For MI/H(X) (information gain ratio), 1 means that, given the value of Y (outcome), the value of X (feature) is determinate.For MI/H(Y ), 1 means that, given the value of X, the value of Y is determinate.For MI/H(X, Y ), 1 means a one-to-one correspondence between X and Y .
The goal of feature selection is to separate the predictive features from non-predictive features.In this regard, MI/H(Y ) = 1 is most desirable because MI/H(X) = 1 does not indicate the predictability of X and MI/H(X, Y ) = 1 is too strong and unnecessary.Therefore, we select κ in (3) as the SMI in CASMI.
As we have discussed, detecting unhelpful associations under small samples is important in health data analytics as involving unhelpful associations would bring too much noise or unnecessary dimensions to model-building or prediction.In other words, we would like to detect situation 2 and 3 features in a limited sample.The common characteristics among situation 2 and 3 features is the information revealed by the limited sample covers little of the total information in the population.For this reason, we can use sample coverage (1 − π 0 ), the concept introduced by Good, to detect these features.A feature with high predictability but low sample coverage must belong to either situation 2 or 3.In CASMI, we multiply the SMI by the sample coverage.Under this setting, although features from situations 2 and 3 have high SMI values, their CASMI scores would be low because of their low sample coverages; hence, these features would not be selected in a greedy selection.On the other hand, the CASMI score for a situation 1 feature would be high because both SMI and the sample coverage are high.As a result, by selecting features greedily, situation 1 features would be selected, while situation 2 and 3 features would be dropped.
The purpose of CASMI is to capture the association between a feature and the outcome, with a penalized term from the sample coverage, so that features under situations 2 and 3 would be eliminated.By selecting features under only situation 1, the issue of generalization under small samples is expected to be reduced.(See Section 3 for a discussion on feature redundancy (or feature interaction).) It may be interesting to note that the CASMI is an information-theoretic quantity that is related to both the population and the sample.It is neither a parameter nor a statistic, and it is only observable when both the population and the sample are known.Next, we introduce its estimation.

Estimation
To estimate κ * (CASMI), we need to estimate π 0 and κ. π 0 (X) can be estimated by Turing's formula [35] T where N 1 (X) is the number of singletons in the sample.For example, if a sample of English letters consists of {a, a, a, b, c, c, d, e, e, f }, then the corresponding N 1 = 3 (b, d, and f are the three singletons).Discussions on the performance of estimating π 0 by T 1 can be found in [39] and [40].In experimental categorical data, singletons could possibly indicate the sample size is small.As the sample size grows, the chance of obtaining a singleton in the sample approaches zero.It may be interesting to note that using (4) to estimate the sample coverage would automatically separates ID-like features.This is because an ID-like feature is naturally all (or almost all) singletons and would result in a zero (or very small) estimated sample coverage that further leads to a zero (or very low) CASMI score; hence, such an ID-like feature would not be selected.Estimating κ(X, Y ) is equivalent to estimating MI(X, Y ) and H(Y ).As we have discussed, thus far, all the existing entropic information-theoretic methods use the plug-in estimator of entropy ( Ĥ).However, the plug-in entropy estimator has a huge bias, particularly when sample size is small.[33] showed that the bias of Ĥ is where n is the sample size and K is the cardinality of the space on which the probability distribution {p k } lives.
Based on the expressions of the bias, it is easy to see that the plug-in estimator underestimates the real entropy, and the bias approaches 0 as n (sample size) approaches infinity, with a rate of n −1 (power decay).Because of the power decaying rate, the bias is not small when sample size (n) is relatively low.
To improve the estimation under a small sample, we adopt the following Ĥz [41] as the estimator of H: (5) Compared to the power decaying bias of Ĥ, Ĥz has an exponentially decaying bias where p ∧ = min{p k > 0}.
To help understand the differences between the power decaying bias and exponentially decaying bias, we conduct a simulation.In the simulation, the real underlying distribution is p k = k/2001000, where k = 1, 2, . . ., 2000 (i.e., a triangle distribution).Under this setting, the true entropy H = 7.408005.To compare the two estimators, we independently generate 10,000 samples following the triangle distribution for each of the six sample size settings (i.e., we generate 60,000 random samples in total).The average values of Ĥ and Ĥz under different sample sizes are summarized in Table 1.The calculation shows that Ĥ would consistently underestimate H more than Ĥz .The underestimation is more severe when the sample size is smaller.Therefore, from a theoretical perspective, we expect adopting Ĥz in estimating the entropies in CASMI would provide a better estimation, particularly under small samples.Furthermore, we expect CASMI would capture the associations among features and the outcome more accurately under small samples because of the improvement in estimation.Interested readers can find additional discussions on comparison among more entropy estimators in [41], and comparison about mutual information estimators using Ĥ and Ĥz in [42].
Consequently, we let and we estimate κ as As a summary, we estimate κ * by the following estimator, which is the scoring function of the selection stage in the proposed method.
where κz is defined in (7) and T 1 is defined in (4).κ * adopts an entropy estimator with an exponentially decaying bias to improve the performance in estimating κ * and capturing the associations when the sample size is not sufficiently large.Furthermore, we expect involving the sample coverage would separate and drop situation 2 and 3 features under small samples.

CASMI Based Feature Selection Method
In this section, we introduce the proposed feature selection method in detail.The proposed method contains two stages.Before we present the two stages, let us first discuss data preprocessing.

Data preprocessing
To use the proposed method, all features and the outcome data must be preprocessed to categorical data.Continuous numerical data must be discretized, and there are numerous discretization methods [43].While binning continuous features, the estimated sample coverage (4) should be checked to avoid over-discretization, which increases the risk of wrongly shifting a feature from situation 1 to situation 2.
If the data are already categorical, one may need to combine some of the categories to improve the sample coverage, when necessary.When most observations of a feature are singletons, then the coverage is close to 0, in which case it is difficult to draw any reliable and generalizable statistical inference.Therefore, for features that may carry real information but have low sample coverages (below 50%), it is suggested to regroup them to create repeats and improve coverages.Note that not all features are worth regrouping; for example, if a feature is the IDs of patients, regrouping should be avoided as there is no reason to believe an ID can contribute to the outcome.The proposed method does not select features with low sample coverages; hence, ID-like features are eliminated automatically.
When a feature contains missing (or invalid) data that cannot be recovered by the data collector, without deleting the feature, there are several possible remedies, such as deleting the observation, making an educated guess, predicting the missing values, and listing all missing values as NA.While it is the user's preference on how to handle the missing data, one should be advised that manipulating (guessing or predicting) the missing data could create (or enhance) false associations; therefore, one should be cautious.Assigning all the missing values as NA generally would not create false associations, but it may reduce the predictive information of the feature.The performance of each remedy method could vary from situation to situation.Additional discussions on handling missing data can be found in [44], [45], and [46].We suggest dealing with the missing data at the beginning of the data preprocessing.
The processed data should contain only categorical features and outcome(s).A feature with only integer values could be considered as categorical as long as the sample coverage is satisfactory.

Stage 1: Eliminate independent features
In this stage, we eliminate the features that are believed to be independent of the outcome based on a statistical test.This step filters out the features that are very unlikely to be useful; hence, the computation time for feature selection is reduced.
Suppose there are p features, X 1 , X 2 , . . ., X p , and one outcome, Y , in a dataset.Note that there could be multiple outcome attributes in a dataset.Because each outcome attribute has its own related features, when making a feature selection, we consider one outcome attribute at a time.
In finding independent features, we adopt a chi-squared test of independence using MI z as the statistic.
Theorem 1. [47] Provided that MI = 0, where MI z is defined in (6).K 1 and K 2 are the effective cardinalities of the selected feature X and the outcome Y , respectively. 2 Compared to Pearson's chi-squared test of independence, testing independence using Theorem 1 has more statistical power, particularly when the sample size is small [47].We test hypothesis H 0 : MI(X, Y ) = 0 against H a : MI(X, Y ) > 0 between the outcome and each of the features.At a user-chosen level of significance (α), any feature whose test decision fails to reject H 0 is eliminated at this stage.It is suggested to let α = 0.10.A smaller α increases the chance of Type-II error (eliminating useful features); a larger α reduces the ability of the elimination, which results in a longer selection computation time in the next stage.
Let X 1 , X 2 , . . ., X s denote the s features (out of the p features) that have passed the test of independence.The other (p − s) features are eliminated at this stage.Note that the X 1 , . . ., X s are temporary notations for features.Namely, the X 1 in {X 1 , . . ., X p } := {X} p and the X 1 in {X 1 , . . ., X s } := {X} s are different if the X 1 in {X} p is eliminated in this stage.Note that we do not consider feature redundancy at Stage 1. Redundant features could all pass the test of independence as long as they appear to be relevant to the outcome based on sample data.Feature redundancy would be considered at Stage 2.

Stage 2: Selection
In this stage, we make a greedy selection among the s remaining features from Stage 1.
The selection algorithm is: 2 We write L → to denote convergence in distribution. 1.
To clarify the notations, κ * (X (1) × X i , Y ) stands for the estimated CASMI of the joint feature X (1) × X i to the outcome Y , and {X} s is the collection of the s remaining features.
The proposed method handles feature redundancy by considering joint-distributions among features.Taking X (1) and X (2) as examples, the first step yields the feature X (1) , which is the most relevant feature (measured by the estimated CASMI) to the outcome.In the second step, we joint the selected X (1) with each of the remaining (s − 1) features, and we evaluate the estimated CASMIs between each of the joint-features and the outcome.The joint-feature with the highest estimated CASMI is selected, which becomes X (2) .It should be noted that X (1) and X (2) are neither necessarily independent nor necessarily the least dependent.Selecting X (2) only indicates that based on the information provided from X (1) , X (2) provides the most additional information about the outcome among the remaining (s − 1) features.In addition, CASMI is an information-theoretic quantity that does not use ordinal information of features; therefore, both linear and nonlinear redundancy are captured, evaluated, and considered.
The proposed algorithm stops when the term max[κ * (•, Y )] starts to decrease.The features selected by the proposed method are X (1) , . . ., X (c) .
In some situations, a researcher may want to select a desired number of features (d) that is different from c. Choice 2. Use any other user-preferred feature selection methods to select the 5 additional features.
Choice 2 could be complicated.If the user-preferred feature selection method has a ranking on the selected features, such as filter methods, then one can find the additional features by looking for the top 5 features other than the already-selected 10 features.If the user-preferred feature selection method does not have a ranking among the selected features, one can start by selecting 15 features using the preferred method, and then check if there are exactly 5 new features in the group compared to the 10 features selected by the proposed method.If the number of new features in the group is more than 5, then one needs to reduce the number of selected features, using the preferred method, until a point that there are exactly 5 new features in the group, so that the 5 additional features can be determined.
After the two stages, the proposed method is completed.The performance of the proposed method is evaluated in the following section.

Simulations
In this section, we provide a simulation study to evaluate the performance of the proposed feature selection method.We first discuss the evaluation metric and then introduce the simulation setup and results.

Evaluation Metric
The proposed feature selection method selects only relevant features but does not provide an associated model or classifier.In evaluating such a feature selection method, there are two possible approaches [48].The first approach is to embed a classifier and compare the accuracy of the classification process based on a real dataset.The results obtained with this approach are difficult to generalize as they depend on the specific classifier used in the comparison.The second approach is based on a scenario defined by an initial set of features and a relation between these features and the outcome.Under this situation, a feature selection method could be evaluated by the truth.Focusing on the evaluation of the selected features, we adopt the second approach to evaluate the proposed feature selection method based on the truth.Under this approach, there are several strategies.One can calculate the percentage (success rate) of all relevant features that are selected.For example, let us consider an outcome T that is relevant to three features F 1 , F 2 , and F 3 , where F 1 contributes the most information (variability) of T , F 2 contributes the second most, and F 3 contributes the least.Also, there is an irrelevant feature F 4 in the dataset.Suppose there are four different selection results: Evaluating their performances using the success rate would achieve the same result (33.3% or 1/3) for all of them as they all identify one correct feature out of the three.The success rate is simple to calculate because the ground truth is known, and it works well when we focus on the number of correctly selected features or if we assume all the relevant features contribute evenly to the outcome.However, under the restriction of a limited sample, it may be more important to select the group of features that could jointly and efficiently provide the most information instead of selecting all relevant features regardless of the degrees of relevance and redundancy.Although ignoring low relevant or vastly redundant features may lose information, dropping them would further reduce the dimensionality and benefit the estimation.This can be considered as a trade off between estimation (dimensionality) and information: the more information, the more difficult the estimation.When the estimation is overly difficult, the results could be biased and hardly generalizable.
Because the success rate does not take the degrees of relevance and redundancy into consideration, we introduce the following evaluation metric to measure the ratio of the relevant information from the joint of selected features to the total relevant information from the joint of all the relevant features using mutual information.
where X selected is the random variable that follows the jointdistribution of the selected features, and X relevant is the random variable that follows the joint-distribution of all the features on which Y depends.
The IRR is not calculable in real datasets because 1) there is no knowledge on which features are relevant to the outcome, and 2) the true underlying distributions and associations (including redundancy) of the features and outcomes in real data are unknown.Given the setup of a simulation, we have all the knowledge; hence, the IRR for any group of selected features is calculable.
The IRR represents the percentage of relevant information in the joint of selected features.It considers feature redundancy by evaluating the mutual information between the joint-feature and the outcome.The range of the IRR is [0, 1].If no relevant features are selected, the IRR is 0. If all the features in the dataset are selected regardless of relevance, the IRR is 1 for certain; therefore, when comparing the performance using the IRR, the number of selected features must be controlled.When the number of selected features from different methods are the same, a larger IRR means the joint of the selected features contains more relevant information; hence, the method is more efficient in dimension reduction.The efficiency of a feature selection method is desirable, particularly under small samples.
To make a comparison between the IRR and the success rate, both evaluate the performance of feature selection methods only when the ground truth is known.The success rate focuses on the ratio of the number of relevant features selected to the total number of relevant features, while the IRR focuses on the ratio of the relevant information in the joint of the selected features to the total relevant information.

Simulation Setup
A good evaluation scenario must include a representative set of features, containing relevant, redundant, and irrelevant ones [48].In the simulation, we generate ten X variables (X 1 , . . ., X 10 ) and one outcome (Y ).Among these variables, X 1 , X 2 , X 3 , X 4 (or X 6 ), and X 5 are relevant features; X 6 (or X 4 ) is a redundant feature; X 7 , X 8 , X 9 , and X 10 are irrelevant features.The detailed settings are as follows. where and Usually, a simulation setup should include varieties to justify the challenges in real world data.Namely, it is often desirable to have complex feature spaces and complicated relationships among the features and the outcome.However, the above simulation setup is not complicated for the following reasons.
1.The purpose of this simulation is to evaluate the performance of the proposed method, particularly when the sample size is relatively small.The complexity of the feature spaces and the relationships among the features and the outcome would determine the threshold of what constitutes a sufficiently large sample.As they are not complex, we sample with smaller sizes to evaluate the performances in simulation.This is fair to all feature selection methods in comparison as they select features based on the same sample data with the same sample size.
2. The proposed feature selection method is one of the entropic methods.In the simulation, we would compare the performance of the proposed method to only other entropic methods because of the domainspecific challenges discussed in Section 1.During the simulation, we assign numerical values to the X variables so that we can generate the value of the outcome Y based on a model.But entropic methods do not use the ordinal information from the numerical data as the inputs of the entropic methods are the frequencies of different numbers.Therefore, involving a complicated model (linear or nonlinear) does not affect the entropic methods because they regard the numbers as labels without ordinal information.However, complicating the model could make the outcome variable Y more complex and result in a higher threshold of a sufficiently large sample, which does not affect the comparison and evaluation among different methods, as discussed previously.
3. In calculating IRR, we need the two joint-distributions, X selected × Y and X relevant × Y .To obtain the true joint-distributions, we have to enumerate the combinations among all possible values of the selected relevant features and of all the relevant features with their probabilities, respectively.Complicating the relevant X variables would make the calculation of the joint-distributions unnecessarily complex.
Note that the major benefit of a simple simulation setup is the ease in calculating the true joint-distributions, which are components of the IRR.In real world data, we do not need such calculations as the true joint-distributions and the IRR are not calculable.Hence, when applying the proposed method on real world high dimensional and complex data, the main calculation is just the estimated CASMI, which is not a problem.
With the simulation setup, one can consider that we create a dataset for evaluation.In this case, we know the ground truth that the features X 1 , X 2 , X 3 , X 4 (or X 6 ), and X 5 should be selected.We would evaluate the performances by calculating the IRRs for features selected by different methods.

Simulation Results
In the simulation, we compare the IRR of the proposed feature selection method to the IRRs of six widely cited entropic feature selection methods: MIM, JMI, CMIM, MRMR, DISR, and NJMIM.
These six entropic methods all require users to set the number of features to be selected, while the proposed method can automatically decide the most appropriate number of features based on data.As we must control the number of selected features to validate the comparison of IRRs, we use the number of selected features from the proposed method as the number of features to be selected in the six entropic methods in each iteration.It should be noted that we are not claiming the number of features determined by the proposed method is correct.We set them to be the same only for the purpose of validating the comparison.As a matter of fact, the relevant features would not be entirely selected until the sample size is sufficiently large, and the threshold of a sufficiently large sample varies from method to method.
For each sample size N in {50, 100, 150, . . ., 2750, 2800}, we re-generate the entire dataset 10,000 times and calculate the average IRRs of each method.The average IRR results are plotted in Figure 1.The average IRRs for seven methods, where CASMI refers to the proposed method.The proposed method is the most efficient method when sample size is limited.In the simulation, the threshold of a sufficiently large sample for the proposed method is approximately N = 1500, which is the smallest among all methods.The vertical index is the IRR, not the success rate.An IRR of 0.8 means 80% of the total mutual information has been accounted for by the selected features.It does not mean 80% of relevant features are selected.The proposed method does not select all relevant features when the sample size is small because some relevant features are in situation 2 under a limited sample; hence, they are not selected.As the sample size grows, all situation 2 features eventually become situation 1 features.
Based on the results, we can see that the average IRR of the proposed method is consistently higher than or equivalent to all the other methods.This is because under the restriction of a limited sample, the proposed method has a much smaller estimation bias so that it captures the associations among features and the outcome more accurately than the existing methods that estimate with the plug-in estimators.
Meanwhile, we recorded the average computation time of the proposed method when implementing feature selection in R. The plot of results is shown in Figure 2. The computation time when N = 50 was 0.03 seconds; the time when N = 2800 was 1.97 seconds; the longest time during the simulation was 3.37 seconds.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 500 1000  Based on the simulation result in Figure 1, different methods achieve 1 (in average IRRs) at different sample sizes.One should realize that the threshold of a sufficiently large sample greatly depends on the probability spaces of the underlying associated features and the outcome.The probability spaces of real datasets are generally significantly more complicated than that of the simulation.Consequently, in reality, particularly in health data, the majority of samples should be considered small; hence, the efficiency of a feature selection method is very important.
The simulation codes are available at [49].The proposed feature selection method using CASMI are implemented in the R package at [50].

Discussion
In this article, we have proposed a new entropic feature selection method based on CASMI.Compared to existing methods, the proposed method has two unique advantages: 1) it is very efficient as the joint of selected features provides the most relevant information compared to features selected by other methods, particularly when the sample size is relatively small, and 2) it automatically learns the number of features to be selected from data.The proposed method handles feature redundancy from the perspective of jointdistributions.Although we initially developed the proposed method for the domain-specific challenges in healthcare, the proposed method can be used in many other areas where there is an issue of limited sample.
The proposed method is an entropic information-theoretic method.It aims at assisting data analytics on non-ordinal spaces.However, the proposed method can also be used on numerical data with an appropriate binning technique.Furthermore, using the proposed method on binned numerical data could discover different information as the entropic method looks at the data from a non-ordinal perspective.
In detecting unhelpful associations (situation 2 and 3 features), we implement an adjustment from the sample coverage.The level of this adjustment can be modified by users.For example, users can replace the scoring function of the proposed method by CASMI* with a tuning parameter (u) as follows: and estimate it by κ * (X, Y ) = κz (X, Y ) • (1 − T 1 (X)) u , where u is any fixed positive number.The u can be considered as a parameter to determine the requirement for a feature to qualify situation 1.A larger u stands for a heavier penalty from the sample coverage; hence, a feature needs to contain more real information to be categorized to situation 1.A smaller u stands for a less penalty from the sample coverage; hence, a feature with less real information could be categorized to situation 1.However, users should be cautious when using a small u because it may mistakenly classify an irrelevant feature (situation 3) to situation 1, and further exacerbates the issue of generalization.We suggest to begin the proposed feature selection method with u = 1.After completing feature selection, if a user desires to select more or less features, the user could re-run the proposed method with a smaller or larger u, respectively, and keep modifying the value of u until satisfactory.
The proposed method only selects features but does not provide a classifier; however, to draw inferences on outcomes, a classifier is needed.To this end, additional techniques are required, such as machine learning (e.g., regressions and random forest).Into the future, it may be interesting to explore 1) methods that can distinguish features under situation 2 and 3 when the sample size is small; and 2) the possibilities of extending the proposed method to tree-based algorithms (e.g., random forest) to help determine which leaves and branches should be omitted.In addition, it may be interesting to investigate the performance of existing entropic methods if we use the Ĥz , instead of Ĥ, to estimate the entropies in their score functions.
For example, let c = 10, d 1 = 6, and d 2 = 15.When c = 10 and d 1 = 6, because 6 ≤ 10, we can stop the algorithm at the time 6.When c = 10 and d 2 = 15, because 15 > 10, the user needs to select 5 additional features.We propose two choices on how to select the additional features.Choice 1. Keep running the proposed algorithm until time 15.

Figure 1 :
Figure1: The average IRRs for seven methods, where CASMI refers to the proposed method.The proposed method is the most efficient method when sample size is limited.In the simulation, the threshold of a sufficiently large sample for the proposed method is approximately N = 1500, which is the smallest among all methods.The vertical index is the IRR, not the success rate.An IRR of 0.8 means 80% of the total mutual information has been accounted for by the selected features.It does not mean 80% of relevant features are selected.The proposed method does not select all relevant features when the sample size is small because some relevant features are in situation 2 under a limited sample; hence, they are not selected.As the sample size grows, all situation 2 features eventually become situation 1 features.

Figure 2 :
Figure 2: The average computation time of the proposed method when implementing feature selection in R.
Table 2 presents the 95% confidence intervals for IRRs based on features selected by different methods under different sample sizes.Based on the table, we can roughly rank the proposed methods and the six methods as follows: CASMI