Filter Variable Selection Algorithm Using Risk Ratios for Dimensionality Reduction of Healthcare Data for Classification

This research developed and tested a filter algorithm that serves to reduce the feature space in healthcare datasets. The algorithm binarizes the dataset, and then separately evaluates the risk ratio of each predictor with the response, and outputs ratios that represent the association between a predictor and the class attribute. The value of the association translates to the importance rank of the corresponding predictor in determining the outcome. Using Random Forest and Logistic regression classification, the performance of the developed algorithm was compared against the regsubsets and varImp functions, which are unsupervised methods of variable selection. Equally, the proposed algorithm was compared with the supervised Fisher score and Pearson’s correlation feature selection methods. Different datasets were used for the experiment, and, in the majority of the cases, the predictors selected by the new algorithm outperformed those selected by the existing algorithms. The proposed filter algorithm is therefore a reliable alternative for variable ranking in data mining classification tasks with a dichotomous response.


Introduction
The focus of this article is to design and test a filter algorithm that uses risk ratios (RR), otherwise known as relative risk, to rank the importance of predictor variables in data mining classification problems, with special attention to healthcare data. Variable importance ranking is the process that assigns numeric values, or some other form of quantifiers, to individual predictors in a dataset, indicating the level of their importance in predicting the outcome. After such a ranking has been established, variables that rank low can be expunged from a predictive model without compromising goodness of fit or predictive accuracy. Variable selection is necessary in the era of big data where voluminous data is generated from healthcare activities, including diagnosis, epidemiology analysis, and patient medical history. These data often consist of many attributes, some of which are not needed in data mining classification, and thus the need to select only the relevant ones is imperative.
Consider a dataset D = (X 1 , Y 1 ), . . . , (X m , Y m ) consisting of m observations, where X = (X 1 , . . . , X n ) are predictor variables having dimension X ∈ R n and Y ∈ C where C is a class label. In data mining, classification is defined as a mapping of the form t : R n → C where t is a classifier [1]. One of the ways of measuring the performance of a classifier is by evaluating its classification accuracy. That is, how accurately it can predict the classes of a set of vectors whose classes are unknown. The predictive accuracy of classification models is enhanced by the choice of variables used for model construction.

Literature Review
This section covers the review of existing variable selection techniques relevant to this research. The materials to be used in the design of the proposed algorithm and for testing its effectiveness are also presented in this section.

Filter Methods of Variable Selection
According to [12], filter methods of variable selection use statistical techniques to evaluate the correlation of each predictor with the outcome variable. Filtering attempts to find a subset of attributes that are highly correlated with the class variable but at the same time exhibits low inter-correlation among each other [13]. This process of variable selection does not work in conjunction with any classification algorithm. It is after the best attributes have been filtered out that a machine learning algorithm can be deployed to perform classification on the best variables. Some commonly used filter methods are the Fisher score and Pearson's correlation. These algorithms achieve variable selection in a supervised manner.
Fisher Score. For a dataset with binary class (0/1), the Fisher score (F i ) evaluates the importance of the i-th predictor by using Equation (1): where X 1 (i) and d 1 (i) are the mean and standard deviation of the i-th predictor in class 1, respectively; while X 0 (i) and d 0 (i) are the same parameters in class 0 [14]. Higher Fisher scores indicate that a variable is strongly associated with the outcome variable. Pearson's Correlation. One of the correlation-based filter techniques is the Pearson correlation, given in Equation (2) [15].
where X i is the i-th feature, Y is the class label, and Cov() and Var() are the covariance and variance, respectively. This measure ranks predictor variables according to their individual linear dependence with the output variable.

Unsupervised Variable Selection
Automatic Variable Selection. The R programming language, developed by R Core Team, [16], has a package called leaps that consists of a function, regsubsets, used for automatic selection of best variables [17]. The function can be used to achieve variable selection in either of three ways: by specifying the maximum number of best variables to return, by forward selection, and by backward elimination. In order to select the best subset of a particular size, the number of desired variables is specified in the nvmax argument.
Another alternative is to utilize the regsubsets function to perform forward selection or backward elimination in selecting subsets according to their importance. When this is executed, the function selects and returns the most important variables for modeling. Apart from selecting the best subsets, regsubsets ranks select variables according to importance. This is done by indicating against each variable one or more asterisks. The number of asterisks assigned to a variable is proportional to its importance.
Variable Importance Measure. The R language consists of another package, known as caret for Classification And REgression Training [18]. One of the functions within the caret package is the varImp (variable importance), which implements variable importance ranking for different machine learning algorithms, such as Logistic regression and Random Forest. To evaluate the importance of variables in a Random Forest model using varImp, the importance of a predictor variable X j , j = 1, . . . , n is calculated on the out-of-bag (OOB) data sample for each tree that was not used for tree construction. Initially, the predictive accuracy of the OOB sample is evaluated. Then, the values of X j in the OOB are permuted; keeping all other predictor variables unchanged. The predictive accuracy of the shuffled data values is also measured and the mean predictive accuracy across all trees is reported. By doing so, the importance of a variable in predicting the response is quantified by evaluating the difference of how much including or excluding that variable decreases or increases accuracy [18][19][20]. This difference is referred to as the Mean Decrease Accuracy (MDA), and is computed by the formula shown in Equation (3) [21,22].
where n is the total number of trees and t is a particular tree, t = 1, . . . , n. In Equation (3), is the predictive accuracy for OOB instance X i before permuting X j and y i = a(X j i ) is the predictive accuracy for OOB instance X i after permuting X j , while |OOB| is the number of data samples not used in tree construction.
In the case of Logistic regression models, the varImp function evaluates the importance of a predictor variable using the absolute value of the t-statistic for that predictor.

Relative Risk (RR)
The formal definition of Relative Risk to be used in the algorithm design is given as: Let t 11 = total data points where X = 1 and Y = 1, t 10 = total data points where X = 1 and Y = 0, t 01 = total data points where X = 0 and Y = 1, and t 00 = total data points where X = 0 and Y = 0 for a binary independent variable X and a binary dependent variable Y. Then, the relative risk is given by RR = t 11 /(t 11 + t 10 ) t 01 /(t 01 + t 00 ) = t 11 t 11 + t 10 · t 01 + t 00 t 01 .
The relative risk in Equation (4) is represented in tabular form as shown in Table 1 [4,5,23]. Table 1. Tabular definition of relative risk (RR).
The independent variable,X, is referred to as the exposure, with 0 and 1 as the unexposed and exposed, respectively. On the other hand, the dependent variable, Y, is referred to as the incidence or risk of an event among the various exposure groups, with 0 and 1 representing event failure and success, respectively [4]. Relative risk measures the ratio of the incidence of an event among data points within the exposed group compared with the incidence of that same event in the unexposed group [5,23]. Exposure in this context could be any criterion of measurement by which data is generated. The RR values range from 0 to infinity, where RR = 1 signifies that no association exists between X and Y, RR < 1 indicates a negative association between X and Y, and RR > 1 shows that X and Y are positively associated [23][24][25].

Classification Tools
Logistic regression is a modeling tool used in examining the association between a categorical dependent variable and one or more independent variables of a set of observations [26]. This regression type is anchored on the logistic function where values must lie between 0 and 1, corresponding to class where P is the probability of success, P/(1 − P) is the odds, b 0 is the intercept, b 1 , . . . , b n are parameter estimates, and X 1 , . . . , X n are data values corresponding to each independent variable [27,28].
Meanwhile, Random forest is a machine learning tool that combines many tree predictors h(X, v k ), k = 1, . . . , n , where X is an input vector and {v k } are independent random vectors within the same distribution across all trees in the forest [1,29]. In order to determine the class of an input vector X, each tree casts a single vote and the class with more votes is selected [1].
Predictive Accuracy. The predictive accuracy and the balanced classification accuracy (BCA) are defined by and respectively, where, T + is the number of correctly classified observations in class 1, T − is the number of correctly classified observations in class 0, F + is the number of observations in class 0 but wrongly classified in class 1, and F − is the number of observations in class 1 but wrongly classified in class 0 [30].

Experimental Datasets
A number of datasets, mostly from the healthcare domain, were deployed to demonstrate the effectiveness of the proposed algorithm in feature selection. Since the proposed algorithm places emphasis on a dichotomous response, each experimental dataset considered in this experiment has a binary outcome. The considered datasets are listed below: • Psychological Capital (PsyCap). This dataset carries psychological capital (PsyCap) information of some workers in the hospitality industry. Psychological capital measures the capabilities of an individual that enable them to excel in the work environment [31]. Each worker's PsyCap was assessed on the four components of psychological capital (hope, efficacy, resilience, and optimism), using the questionnaire presented in [32]. The workers willingly completed the questionnaires and returned the same, and there were no requirements for prior ethics approval. The dataset has a binary class variable, where 0 and 1 represent low and high PsyCap, respectively. • Diabetes in Pima Indian Women (Diabetes). The dataset consists of 332 observations about diabetes test results of Indian women of Pima indigene. The population sample was those from 21 years and above, residing in Arizona. This dataset, accessible through the R language "MASS" package, reported in [33], is named Pimat.te within the package, and was originally sourced from [34]. The dataset has a binary response variable named "type", where 0 and 1 signify non-diabetic and diabetic, respectively. • Survival from Malignant Melanoma (Melanoma). This dataset, available in the R package "boot", records information on the survival of patients from malignant melanoma [35]. The patients had surgery at the Department of Plastic Surgery of the University Hospital, Odense, Denmark, between 1962 and 1977. Several measurements were taken and reported as predictor variables, with a binary class "ulcer", where 1 indicates an ulcerated tumor and 0, non-ulcerated. • Spam E-mail Data (Spam). The dataset consists of e-mail items with measurements relating to total length of words written in capital letters, numbers of times the "$" and "!" symbols occur within the e-mail, etc.; and a binary class variable, "yes", with 1 classifying an e-mail as spam and 0 otherwise. The dataset, titled spam7, can be accessed in the R package "DAAG" [36].
• Biopsy Data of Breast Cancer Patients (Cancer). Named biopsy in the R package, "MASS" in [33], the dataset measures the biopsies of breast tumors on a number of patients. The dataset was obtained from the University of Wisconsin Hospital, Madison, with known binary outcome named "class", where 0 = benign and 1 = malignant.
Some characteristics of the experimental datasets are presented in Table 2.

Design of Proposed Algorithm
In this section, we will consider X j , j = 1, . . . , n as a set of predictors in a high dimensional space R n . In most cases, especially with big data, some of these predictors are irrelevant, duplicative, and, thus, not needed in machine learning tasks [37]. Usually, the objective is to reduce the number of predictors to k where k < n, such that k consists of the most relevant explanatory variables needed in classification. This is the objective the proposed algorithm seeks to achieve.
Let RawData denote a dataset consisting m observations, n predictors, and the outcome variable y i . Let RawDat[i, j] denote a data point at row i, column j where i = 1, . . . , m and j = 1, . . . , n.
Preprocessing. This algorithm will require a normalized dataset on the interval [0,1], also referred to as min-max normalization [38,39]. The proposed algorithm, presented in Appendix A, will take the following steps: • The first step of the proposed algorithm, as presented in Listing 1, is to binarize the dataset. It is a requirement that both independent and dependent variables carry only binary values for the risk ratio measure to be deployed. On purpose, we did not design the algorithm to print the output of the binary dataset. This is to guard against users inadvertently using the binary dataset for model construction. The binary data is only useful for RR computation, after which classification models are fit on the original dataset.

•
In the second step, listed in Listing 2, the algorithm counts, for each predictor X and the class Y, all occurrences where (X, Y) = (1, 1), (1, 0), (0, 1) and (0, 0). Just as in step 1, these computations are kept behind the scene, without printing any output visible to the user.

•
The third step, listed in Listing 3, applies the risk ratio formula of Equation (4) on the values computed in Listing 2 to produce the variable importance rankings. This algorithm outputs the importance rankings of the variables in the order the predictors appear in the dataset. For a better view of the results, the user may decide to arrange the output in ascending or descending order. It is upon the judgment of the modeler to determine the cutoff point of those variables to include in a model.

•
The statement on line 44 of the algorithm, in Appendix A, will output the names of predictor variables and their RR values, separated by a tab, each on a separate line. Each RR value constitutes the importance rank of the corresponding predictor, signifying the extent to which it is associated with the class.
The processes involved in feature ranking by the proposed algorithm are shown in the pseudo code below: Convert dataset to binary, that is, round all values < 0.5 to 0 and > = 0.5 to 1 3.
FOR each input/output, DO the following: 4.
IF INPUT is 1 5.
AND OUTPUT is 1 THEN 6.
IF INPUT is 0 10. AND OUTPUT is 1 THEN 11. Count t 01 that is φ j = φ j + 1 12. ELSE Count t 00 , that is ϕ j = ϕ j + 1 13. END IF 14. NEXT input/output 15. IF All input/output are exhausted, compute the following: 16. FOR each variable j = 1 to n 17.
PRINT columnName j and space, and VIM j 22. NEXT variable 23. STOP A higher value of VIM for a predictor signifies strong association with the class, and consequently indicates its importance in classification. This algorithm is summarized in Equation (7).
where VIM j is the importance ranking of the jth predictor, j = 1, . . . , n, δ ij is the total number of observations with input = 1 and output = 1, β ij is the total number of observations with input = 1 and output = 0, φ ij is the total number of observations with input = 0 and output = 1, and ϕ ij is the total number of observations with input = 0 and output = 0.

Experiment and Results
Execution of the Proposed Algorithm on the Datasets. The proposed algorithm was executed on all the datasets in order to rank the variables according to importance. The existing varImp function and the regsubsets methods (nvmax, forward, backward) were also deployed to rank the variables. Equally, the Fisher score and Pearson's correlation were deployed. This was done in order to compare the effectiveness of the proposed algorithm against existing methods of variable selection.
Two machine learning algorithms, namely Logistic Regression and Random Forest, were used in the experiment for model construction, evaluation of goodness of fit, and predictive accuracy. Samples of variable ranking results on the Diabetes and Melanoma datasets are shown in Table 3. Performance evaluation of the proposed algorithm in comparison with existing algorithms was done in two steps. First, the goodness of fit of models developed using variables selected by the new algorithm and existing ones was examined, and secondly, the predictive accuracy evaluation was carried out.
Goodness of Fit Evaluation. The goodness of fit test was assessed on two metrics: deviance and Mean Squared Error (MSE). In Logistic regression models, two deviance types are reported: null deviance and residual deviance [40]. The residual deviance is calculated cumulatively as predictors are added to the model. The difference between the final residual deviance and the null deviance explains the goodness of fit of a model. When comparing two models, the model with the smallest deviance is said to have better fit. The MSE is a parameter-free measure that gives information on the difference between actual and predicted values [41]. Lower values of MSE for a model indicate better fit. A sample result of the goodness of fit test of the various models is presented in Table 4. The goodness of fit results presented in Table 4 show that the subsets selected by the proposed algorithm competed favorably with those selected by the existing varImp algorithm.
Predictive Accuracy Evaluation. The results of the predictive accuracy test of models constructed with subsets selected by the proposed algorithm compared with those constructed with variables selected by existing algorithms were examined. Before fitting the models, each dataset was split into 80% and 20% train and test sets, respectively. The train sets were used for model construction, while the test sets were used to evaluate the predictive power of the models. Typically, the predictive accuracy is computed using Equation (5). However, Equation (5) assumes that classes of the dataset are balanced. This is usually not the case in real life as could be seen in Table 2, where the number of observations in class 0 is not same as that in class 1 across all experimental datasets. For imbalanced datasets, the balanced classification accuracy (BCA) defined in Equation (6) is applied to calculate predictive accuracy.
In this article, the BCA was used throughout the experiments for predictive accuracy computations. The proposed algorithm was executed on all the datasets to obtain importance rankings of predictor variables. After generating the rankings, the best subsets were selected for modeling using Random Forest and Logistic regression classification. Two criteria were adopted in arriving at best subsets.
The first option was to sequentially select all variables with ranking values close to each other until there is an unusual decline with subsequent variables down the group. The second option was to keep adding variables with reasonably high ranking values until further additions do not improve model performance. Existing ranking algorithms, namely regsubstes (nvmax, forward, and backward), varImp, Fisher score, and Pearson's, were equally executed on the datasets. The best subsets generated by these algorithms were selected for modeling.
The balanced classification accuracy of each model was computed on the test sets of the datasets, yielding the results presented in Tables 5-9. The Table 5 indicates the predictive accuracy comparison of various ranking algorithms on the PsyCap dataset, while Table 6 reports results of predictive accuracies on the Diabetes dataset. Relatedly, Table 7 reports various predictive accuracies generated by each ranking algorithm on the Melanoma dataset, while Table 8 presents accuracy results on the Spam dataset. In Table 9, the predictive accuracies on the Cancer dataset for the various ranking algorithms are reported.     As could be observed in Tables 5-9, the variable subsets selected by the proposed algorithm performed competitively with the selection by existing algorithms.

Discussion
A predictive accuracy test was conducted in order to determine how well the variables selected by both existing algorithms and the proposed algorithm can predict the outcome variable Y on the validation set. Output was determined as probabilities of the form P(Y = 1 X i ), where X i is the data value for each predictor, i = 1, . . . , n. The boundary used in making a decision was 0.5. What the program typically did was that, if P(Y = 1 X i ) > 0.5 then Y = 1, otherwise Y = 0. When this test was run on the different models generated by the various ranking algorithms, their respective predictive accuracies were obtained using Equation (6). Figures 1 and 2 represent graphically that variables selected by the proposed algorithm in all datasets produced higher predictive accuracies, except in one instance, compared with the selections by the existing algorithms. Therefore, the new algorithm can be said to be a good choice of filter variable ranking in machine learning classification.  As could be observed in Tables 5-9, the variable subsets selected by the proposed algorithm performed competitively with the selection by existing algorithms.

Discussion
A predictive accuracy test was conducted in order to determine how well the variables selected by both existing algorithms and the proposed algorithm can predict the outcome variable Y on the validation set. Output was determined as probabilities of the form ( 1| ), The boundary used in making a decision was 0 .5 . What the program typically did was that, if ( 1| ) 0.5 When this test was run on the different models generated by the various ranking algorithms, their respective predictive accuracies were obtained using Equation (6). Figures 1 and 2 represent graphically that variables selected by the proposed algorithm in all datasets produced higher predictive accuracies, except in one instance, compared with the selections by the existing algorithms. Therefore, the new algorithm can be said to be a good choice of filter variable ranking in machine learning classification.  Apart from selecting variable subsets that resulted in good model performance, another plus for the proposed algorithm over the existing algorithms is the way ranking values are presented to the user. As could be observed in Table 3, one of the ranking values in the diabetes dataset is a negative number. For quick insights into how much one predictor is more or less important than another, it would be better for all values to carry the same sign across the board. The Pearson's is a correlation-based method, which means all ranking values produced will fall within the interval [-1,1]. In big data, where some datasets consist of a high number of features, say 100 and above, ranking the entire feature space within this interval may not give quick visual insights. The RR deployed in the proposed algorithm produces values within the range of 0 to infinity. This range seems more appropriate for representing ranking values when the feature space is large. Apart from selecting variable subsets that resulted in good model performance, another plus for the proposed algorithm over the existing algorithms is the way ranking values are presented to the user. As could be observed in Table 3, one of the ranking values in the diabetes dataset is a negative number. For quick insights into how much one predictor is more or less important than another, it would be better for all values to carry the same sign across the board. The Pearson's is a correlation-based method, which means all ranking values produced will fall within the interval [−1,1]. In big data, where some datasets consist of a high number of features, say 100 and above, ranking the entire feature space within this interval may not give quick visual insights. The RR deployed in the proposed algorithm produces values within the range of 0 to infinity. This range seems more appropriate for representing ranking values when the feature space is large.

Conclusions
In the era of big data, where voluminous, high-dimensional data are constantly being generated from healthcare delivery activities, it is necessary to pay more attention to the problem of variable selection. The majority of the attributes that come with historical or daily data are usually not necessary in modeling. When such unimportant attributes are not eliminated before model construction, many metrics of model diagnostics, such as variance, deviance, degrees of freedom, and predictive accuracy, are negatively affected. Furthermore, machine learning algorithms train slower, and constructed models are over-fitted and more complex to interpret if irrelevant predictors are included. The ranking algorithm developed in this research, which performs competitively with some existing algorithms, will be a useful tool for dimensionality reduction in healthcare data to guard against these unwanted results in classification.
As could be observed in Figures 1 and 2, this algorithm demonstrates that it is more appropriate for healthcare datasets than other domains. Better performance was recorded in the cancer, PsyCap, diabetes, and melanoma datasets compared with the spam e-mail dataset. The algorithm achieves a variable importance ranking by employing the statistical measure of risk ratio to evaluate the association between a predictor and the response. Predictors exhibiting a strong association with the class will be selected for classification, while those with a weak association will be excluded. The algorithm does not include a means of determining a threshold of which variables to include in a model. It is left to the discretion of the modeler to apply trial and error in adding or removing variables based on the ranking and performance of previous models. In future research, the algorithm should be extended to be able to determine a cut-off point of important variables algorithmically. Also, the possibility of implementing this algorithm in a way that makes it compatible with open-source languages, such as R, should be explored. As a candidate filter method, the algorithm is independent of any machine learning tool. It is meant to effect variable selection as a preprocessing activity, after which any modeling tool can be applied for model fitting proper. The algorithm is generic; thus, it can execute on any healthcare dataset, provided it is numeric with a dichotomous response. Print columnName j + \tab RR j +\ enter 45.
Next j