1. Introduction
Feature selection is one of the main data analysis techniques in data mining, which has shown its power in many applications, such as insulator detection [
1], medicine study [
2], and environmental science [
3]. Especially for big data analysis, how to define meaningful information is a key issue.
Along with the quick development of the high throughput techniques, genomics, metabolomics, and proteomics have been widely applied in disease study, drug research, etc. One characteristic of omics data is summarized in the expression ”high dimensions, small samples”, since omics data usually have a large number of features but few samples. As genomics data, metabolomics data and proteomics data usually contain many features, it has been critical to accurately measure the feature importance and select the most discriminative feature subset. Puthiyedth et al. [
4] presented a combinatorial optimization approach for integrated feature selection and applied it to analyzing the data about prostate cancer. They have identified potential novel prostate cancer associated pathways and genes. Christin et al. [
5] studied six feature selection methods, analyzed their performance for liquid chromatography-mass spectrometry based proteomics and metabolomics biomarker discovery. Zou et al. [
6,
7] presented the sequence based feature selection technique and dimensionality reduction strategy to realize the prediction of protein. Lin et al. [
8] studied the feature selection method based on the overlapping area and defined the discriminative features of liver disease from the metabolomics dataset.
Support vector machine (SVM) [
9] is a popular and efficient classification technique and has been widely applied in many fields such as biological data processing [
10]. SVM-recursive feature elimination (SVM-RFE) [
11] is a feature selection algorithm based on SVM. While the SVM learning model is built, the weights of the features are also computed. SVM-RFE iteratively removes the features with the lowest weights. The removing sequence of the features represents the feature importance ranking [
11,
12]. SVM-RFE has been adopted in many applications, such as signal processing [
13], genomics [
11,
12], proteomics [
14] and metabolomics [
15,
16], due to its superiority. Also, many studies have been done on it to get a more powerful performance. Tang et al. [
17] proposed a two-stage SVM-RFE. In the first stage, multiple SVM-RFEs with different parameters were applied to remove the noise and non-informative data; in the second stage, the final feature subset was selected by a fine SVM-RFE. Li et al. [
18] combined SVM-RFE with the T-statistic to define the genes associated with CRC development or metastasis. mRMR-SVM [
19] tries to select an important and non-redundant feature subset by means of SVM-RFE and mRMR. R-SVM [
20] is also a recursive feature selection method based on SVM, which combines SVM weights and class means to evaluate feature discriminative abilities. There are also some studies on determining how many features with the low weights are removed in each iteration of SVM-RFE [
21,
22].
Basically, SVM-RFE ranks the features according to the feature deletion order during the iterations. The top ranked features which are removed in the last iteration of SVM-RFE are the most important, while the bottom ranked ones are the least informative and removed in the first iteration. For a specific application, it is not enough to obtain a feature importance ranking; it needs to determine how many top ranked features (such as genes and metabolites) should be selected. Thus, based on the selected features, we can study the disease phenotype and disease mechanism. In some studies, the top ranked features were selected according to a predetermined number [
23,
24]. In other studies, the top features that can induce a classifier with a “best” classification accuracy rate were selected [
11,
16].
It is not practical to specify the number of features to be selected in advance in some applications. However, it is well known that the feature subset selected should have a powerful discriminative ability. If a feature subset has a powerful discriminative ability, then the classifier based on it usually has a high prediction accuracy rate, and the different sample groups on the selected subspace should show different distributions with little overlapping areas. Hence, this study proposes a method, SVM-RFE-OA, which determines the number of features to be selected from the feature rank of SVM-RFE by combining the classification accuracy rate and the average overlapping ratio of the samples together. In addition, to weigh the features more accurately, this study also proposes a modified SVM-RFE-OA (M-SVM-RFE-OA) algorithm, which temporally screens out the samples lying in a heavily overlapping area in each iteration. The experiments on the eight public biological datasets show the validation of the two techniques proposed.
3. Results and Discussion
To show the performance of the two techniques proposed, SVM-RFE-OA and M-SVM-RFE-OA were compared with SVM-RFE where the selected feature subset was determined by the classification accuracy rate. Eight public biological datasets were used in the comparison of the three algorithms, where Breast2, Colon, Lymphoma, Prostate, Brain_data, Srbct are from
http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html, and the last two datasets are from
www.gems-system.org. Breast2 [
27,
28] contains 77 samples, including 33 samples that developed distant metastases within 5 years and 44 samples that remained disease-free for over 5 years. The Colon [
27,
29] dataset includes 40 tumors samples and 22 normal colon tissues samples with 2000 genes by Affymetrix technology. The DLBCL_GEMS [
30] dataset includes 58 diffuse large B-cell lymphomas (DLBCL) samples and 19 follicular lymphomas samples. The Lymphoma [
27,
31] dataset contains the most prevalent adult lymphoid malignancies. The total sample size is 62, including 42 samples of diffuse large B-cell lymphoma, 9 follicular lymphoma samples, and 11 chronic lymphocytic leukemia samples. Prostate [
27,
32] contains 52 prostate tumors samples and 50 non-tumor prostate samples. Brain_data [
27,
33] contains 42 samples, which include 5 different tumors of the central nervous system, 10 medulloblastomas samples, 10 malignant gliomas samples, 10 atypical teratoid/rhabdoid tumors (AT/RTs) samples, 8 primitive neuro-ectodermal tumors (PNETs) samples, and 4 human cerebella samples. The Leukemia2_GEMS [
30] dataset contains 24 acute lymphoblastic leukemia (ALL) samples, 28 acute myeloid leukemia (AML) samples, and 20 mixed-lineage leukemia (MLL) samples. The Srbct [
27,
34] dataset, named the small, round blue cell tumors of childhood, includes 23 neuroblastoma (NB) samples, 20 rhabdomyosarcoma (RMS) samples, 12 non-Hodgkin lymphoma (NHL) samples, and 8 the Ewing family of tumors (EWS) samples.
Table 1 provides detailed information of the eight datasets. Four of them are binary problems.
SVM-RFE, SVM-RFE-OA, and M-SVM-RFE-OA were implemented in C++. SVM was obtained from
http://www.csie.ntu. edu.tw/~cjlin/libsvm/. Linear kernel was adopted in the SVM, and
t was set to 5%. In SVM-RFE-OA and M-SVM-RFE-OA,
k was set to 9. Five-fold cross validation was run 50 times for each method. The average classification accuracy rates and the average standard deviations are given in
Table 2. For the four binary datasets, sensitivities and specificities are given in
Table 3 and
Table 4, respectively. In the tables, the bold numbers represent the largest value in a dataset among the three methods.
First, we compared SVM-RFE-OA with SVM-RFE to examine the performance of SVM-RFE-OA.
Table 2 shows that SVM-RFE-OA outperforms SVM-RFE for seven of the eight biological datasets in classification accuracy rate. The accuracy rate of SVM-RFE-OA is higher than that of SVM-RFE by 8.95% for Brain_data. Only for Breast2 is the average accuracy rate of SVM-RFE-OA lower than that of SVM-RFE (by 0.83%). The sensitivities and specificities (see
Table 3 and
Table 4) for the four binary problems also show the superiority of SVM-RFE-OA over SVM-RFE. The sensitivities of SVM-RFE-OA are higher than those of SVM-RFE for three of the four binary datasets, and its specificities are higher than those of SVM-RFE for three of the four datasets, too. Hence, the discriminative ability of the feature subset could be measured more accurately by combining the classification accuracy rate with the average overlapping degree of samples than by using the classification accuracy rate alone. The classification accuracy reflects feature distinguishing ability via the classification model, while the average overlapping degree of the samples represents the discriminative information that the feature subset contains by means of the sample distribution. Combining the two criteria induces a more comprehensive measurement of the feature subset. This technique can be used in other RFE analyses to determine the final selected feature subset.
Secondly, we compared M-SVM-RFE-OA with SVM-RFE-OA, thereby examining the performance of temporally screening out the poor samples lying in an overlapping area. Both M-SVM-RFE-OA and SVM-RFE-OA combine the classification accuracy rate and the average overlapping degree to calculate the discriminative ability of the feature subset and determine the number of top ranked features to be selected. To measure the feature importance more accurately, M-SVM-RFE-OA temporarily shields the samples in the overlapping area in each iteration. The comparison between M-SVM-RFE-OA and SVM-RFE-OA shows that temporarily screening out the samples mixed with heterogeneous samples in each iteration benefits the calculation of feature weights.
Table 2 clearly shows that M-SVM-RFE-OA outperforms SVM-RFE-OA for seven of the eight datasets in terms of the accuracy rate.
Table 3 and
Table 4 also represent the superiority of M-SVM-RFE-OA over SVM-RFE-OA in sensitivity and specificity. Therefore, we have that the quality of the training data influences the construction of the SVM model and the calculation of feature weights. M-SVM-RFE-OA produces a more accurate calculation of the feature weights by temporally screening out the samples with high overlapping ratios in each iteration, finally obtaining a more powerful feature subset.
The comparisons between SVM-RFE and SVM-RFE-OA and between SVM-RFE-OA and M-SVM-RFE-OA validate the two techniques proposed in this study. Finally, it can be seen that M-SVM-RFE-OA outperforms SVM-RFE for all eight datasets in terms of accuracy rate and outperforms SVM-RFE for all the four binary datasets in terms of sensitivity and specificity. Especially for Brain_data, the accuracy rate of M-SVM-RFE-OA is higher than that of SVM-RFE by 10.2%.
Meanwhile, SVM-RFE-OA and M-SVM-RFE-OA are more stable than SVM-RFE. The standard deviations of M-SVM-RFE-OA on accuracy rate, sensitivity, and specificity are lower than those of SVM-RFE in most cases. Hence, from two different aspects, the classification accuracy rate and the average overlapping degree of samples, which reflects the sample distribution on the feature subspace (top ranked feature subset), we can obtain a more comprehensive measurement of the feature subset. Further, temporally shielding the samples with high overlapping ratios in each iteration could make the computation of feature importance more accurate.
Table 5 gives the average number of features selected in five-fold cross validation run 50 times for each method. It can be seen that the average number of features selected by SVM-RFE is less than those selected by SVM-RFE-OA and M-SVM-RFE-OA. However, the classification accuracy rates of SVM-RFE-OA and M-SVM-RFE-OA are higher than those of SVM-RFE, and the standard deviations of SVM-RFE-OA and M-SVM-RFE-OA are lower than those of SVM-RFE (see
Table 2). For the Lymphoma dataset, the average number of selected features by SVM-RFE is 3.48, while SVM-RFE-OA and M-SVM-RFE-OA increase the classification accuracy rate 0.93% and 1.49% by 1.59 and 1.62 more features, respectively. Although the average numbers of features selected by SVM-RFE-OA and M-SVM-RFE-OA are larger than those by SVM-RFE, the two new methods are much more efficient and stable than SVM-RFE.