Feature Ranking and Screening for Class-Imbalanced Metabolomics Data Based on Rank Aggregation Coupled with Re-Balance

Feature screening is an important and challenging topic in current class-imbalance learning. Most of the existing feature screening algorithms in class-imbalance learning are based on filtering techniques. However, the variable rankings obtained by various filtering techniques are generally different, and this inconsistency among different variable ranking methods is usually ignored in practice. To address this problem, we propose a simple strategy called rank aggregation with re-balance (RAR) for finding key variables from class-imbalanced data. RAR fuses each rank to generate a synthetic rank that takes every ranking into account. The class-imbalanced data are modified via different re-sampling procedures, and RAR is performed in this balanced situation. Five class-imbalanced real datasets and their re-balanced ones are employed to test the RAR’s performance, and RAR is compared with several popular feature screening methods. The result shows that RAR is highly competitive and almost better than single filtering screening in terms of several assessing metrics. Performing re-balanced pretreatment is hugely effective in rank aggregation when the data are class-imbalanced.


Introduction
Datasets with imbalanced distribution are quite common in classification. In the settings of binary category, a dataset is called "imbalanced" if the number of one class is far larger than the others in the training data. Generally, the majority class is called negative while the minority class is called positive. Thus, the number of positive instances is often much lower than that of negative ones.
A hindrance in class-imbalance learning is that standard classifiers are often biased towards the majority classes. Therefore, there is a higher misclassification rate in the minority instances [1,2]. Re-sampling is the standard strategy to deal with class-imbalance learning tasks. Many studies [2][3][4] have shown that re-sampling the dataset is an effective way to enhance the overall performance of the classification for several types of classifiers. Re-sampling methods concentrate on modifying the training set to make it suitable for a standard classifier. There are generally three types of re-sampling strategies to balance the class distribution: over-sampling, under-sampling, and hybrid sampling.

•
Over-sampling adds a set sampled from the minority class. Randomly duplicating the minority instances, SMOTE [5] and smoothed bootstrap [6] are three widely used over-sampling methods. • Under-sampling removes some of the data points from the majority class to alleviate the harms of imbalanced distribution. Random under-sampling (RUS) is a simple but effective way to randomly remove part of the majority class.

•
Hybrid-sampling is a combination of over-sampling and under-sampling.
Let D be a dataset with p features x 1 , x 2 , . . . , x p , the target of feature screening is to extract a part of features x 1 , x 2 , . . . , x m such that m << p and these selected features satisfy the specified conditions of the task at hand [7]. For instance, the target is to select the subset of candidate features to maximize classifier accuracy in a classification setting. In the past two decades, many papers in studies have adopted the feature screening methods [8][9][10]. Feature screening has many advantages such as reducing susceptibility to over-fitting, training models faster and offsetting the pernicious effects of the curse of dimensionality [8]. The disadvantage of feature screening is that some crucial features may be omitted, thus harming classification performance.
Filtering [11], wrapping [12], and embedding [13] are three kinds of approaches for feature screening. Filter algorithms screen top-ranked variables via a certain metric. Wrapper methods perform a search in all the combinations to find the best subsets of all features. Generally, a complete search is often time-consuming and greedy, so the heuristic technique is frequently utilized to explore the solutions. Embedded algorithms screen important variables while building the classifier. Of all the three types of feature screening, filter methods are the simplest and the most frequently used to solve real-world imbalanced problems [14] in class-imbalance learning community. Many metrics have been utilized to perform filtering feature screening algorithms, such as t test, Fisher score [15], Hellinger distance [16], Relief [17], ReliefF [18], information gain [19], Gini index [20], AUCROC [21], AUCPRC [22], geometric mean [23], F-measure [24], and R-value [25].
Ensemble feature selection has been widely applied to the field of classification [26], such as Nazrul et al. [27] provided an ensemble feature selection method using feature-class and feature mutual information to select an optimal subset of features by combining multiple subsets of features. Yang et al. [28] proposed an ensemble-based wrapper approach for feature selection from data with highly imbalanced class distribution. Nowadays, feature selection methods are popular in metabolomics data analysis. In order to resolve the problem of filtering the discriminative metabolites from high-dimension metabolomics data, Lin et al. [29] proposed a mutual information (MI)-SVM-RFE method that filters out noise and non-informative variables by means of artificial variables and MI, then conducts SVM-RFE to select the most discriminative features. Fu et al. [30] proposed two feature selection algorithms that, by minimizing the overlap degree between the majority and the minority, are effective in recognizing key features and control false discoveries for class-imbalanced metabolomics data. The above feature screening methods are usually established for balanced datasets, but they are also directly utilized in classimbalance situations.
Different filtered approaches give different feature rankings because of their different theories, even when just counting top-ranked features. Motivated by this problem, we propose a simple strategy called rank aggregation with re-balance (RAR) to combine all methods' ranking results in this study. It is an essential tool to fuse each rank to generate a synthetic rank that takes every ranking into account for class-imbalanced data. Different from the general feature selection methods, the proposed method combines different feature selection methods rather than simply accepting the result of one method, which can enhance the stability of the algorithm. At the same time, the great performances of the experiments in balanced and imbalanced metabolomics datasets verify the strong generalization abilities of RAR.

Kendall's τ Rank Correlation of Eight Filtering Methods on Class-Imbalanced Data
Each filtered method above can be employed to perform feature screening. However, we noted that different filtering feature screening techniques may give different rankings, especially when the data are extremely class-imbalanced. In this section, we compare methods using Kendall's τ rank correlation [31].
The Kendall's τ rank correlation of eight filtering methods (t test, Fisher score, Hellinger distance, Relief, ReliefF, Information gain, Gini index, and R-value) are computed with simu-lated data that are generated by multivariate normal distributions, namely, X (y = 0) ∼ N p (µ 0 , Σ) and X (y = 1) ∼ N p (µ 1 , Σ), where the label y = 0 denotes the majority class and y = 1 minority class, respectively. The predictors in two classes have the same covariance matrix Σ, which is set to be a unit matrix for the purpose of simplicity. Two cases are considered in this study. In case one, the number of p = 8, and eight variables are all set to be key features. The difference of mean values µ 1 − µ 0 = [2.4, 2.2, 2, 1.8, 1.6, 1.4, 1.2, 1]. In case two, p = 16, and the first eight variables are set to be the same with case one, but another eight irrelevant predictors are added. The number of total instances is set to 960. The negative to the positive ratios here are set be 1:1, 3:1, 9:1, 31:1, and 95:1, respectively. There are 28 Kendall's τ rank correlation coefficients among 8 filtering methods, and the mean of these coefficients (with 100 repeats) is shown in Figure 1. As stated above, τ = 1 if all pairs are concordant. Whereas the maximum of τ is 0.88 in case one (left, Figure 1) where there are no irrelevant predictors, and 0.76 in case two (right, Figure 1) where one-half of features are irrelevant variables. The two maximal τ values are reached when two classes are exactly balanced, and τ reduces as the imbalance ratio increases in two cases. It indicates that these filtering methods probably generate different feature rankings, and such differences tend to be intensified when the class imbalanced ratio increases. Consequently, it is hard to say that a filtering approach is better or worse than another one, and it is a big risk to just depend on a single filter algorithm to make decisions. We have known that such a difference will occur due to the different principles of the filtering methods, but we also presume that class imbalance intensifies this difference. A natural way to combat this challenge may combine each filtering approach's information and relieve the effect of class imbalance. This is the motivation for why we propose the strategy of rank aggregation with re-balance.

Rank Aggregation(RA) on Original Balanced Data
In our computation, eight filtering methods-t test, Fisher score, Hellinger distance, Relief, ReliefF, Information gain, Gini index, and R-value-are aggregated to generate an incorporative rank. Rank aggregation is firstly tested with the original balanced dataset "NPC". Artificial rebalancing is unnecessary, and just case 1 (no resampling) is performed. Rank aggregation is compared with eight filtering methods: t test, Fisher score, Hellinger distance, Relief, ReliefF, Information gain, Gini index, and R-value. Gmean, F 1 , AUCROC and AUCPRC are utilized as evaluation measurements. The rank lists ordered by their importance are shown on the x-axis in Figure 2. The top seven features are selected according to all four assessment metrics.

Rank Aggregation with Re-Balance (RAR) on Imbalanced Data
Figures 3-6 show the aggregated rank lists on seven cases with the datasets "TBI", "CHD2-1", "CHD2-2", and "ATR", respectively. Rank aggregation combines each ranking into a list reflective of the overall preference, and each subgraph of four figures shows the aggregation results based on the CE algorithm. The x-axis is the optimal list obtained by the rank aggregation algorithm. The y-axis also ranks, and the gray line is the rank of the original data; the black line is their average rank; and the red line is the aggregate result of the CE algorithm. The order of the x-axis rank is based on the aggregate ranks obtained by the red line. The performances measured by Gmean, F 1 , AUCROC, and AUCPRC are given in Tables 1-4, respectively.

Discussion
Tables 1 and 2 show that RA reached the maximal values of Gmean and F 1 . It can be seen from Tables 3 and 4 that RA and ReliefF obtained the maximal values of AUCROC and AUCPRC. Therefore, RA outperformed single filtering methods when assessed with Gmean, F 1 , AUCROC, and AUCPRC. The NPC dataset had a completely balanced distribution, and RA worked well on it. Thus, rank aggregation is necessary to integrate different results, even if in a totally balanced situation, and a consensual feature ranking list is provided.
Aggregation ranking lists in Figures 3-6 tell us the order of importance of each feature. Though the rank lists derived from different subsampling methods were not the same, the top features were approximately consistent. After obtaining the rank list, another task is to figure out how many features should be considered as key variables. In this computation, we performed 5-fold cross-validation [32] to find the optimal number of key features. As recent studies have showed that AUCPRC is more informative in imbalanced learning [32,33], AUCPRC was employed as perfomance metric in this section, and random forest classifier was utilized to implement classification. Namely, the value of AUCPRC was calculated, as the top k ranked features were used each time, where k varies from 1 to p (see Figures 7-10). We chose the optimal kvalue such that the random forest classifier had the maximal AUCPRC in identifying classification. It can be seen from Tables 1-4 that the optimal number of important features varied greatly under different re-balanced strategies. One possible reason is that the artificial data generated by different subsampling have difference to some extent. Another possible reason is the measurement changes sightly as the number of candidate features changes. This seems to be true from the Figures 7-10 where each curve tends to be flat as the changes in the number of features used for classification. It also noted that the AUCPRC under no re-sampling (case 1) was generally lower than that under six re-sampling methods.
Tables 1-4 report the results of these real datasets with the assessing metrics Gmean, F 1 , AUCROC, and AUCPRC, respectively. We can perform comparisons in several aspects. Original imbalanced datasets are employed in case 1 from Tables 1-4 (except NPC dataset). Of all the 16 "no re-balance" situations, the aggregation rank method reached the maximal measures in 12 situations compared with the other 8 filtering methods (t test, Fisher score, Hellinger distance, Relief, ReliefF, information gain, Gini, and R-value). It indicates that aggregation rank was better than a single filtering rank with the proportion of 75.00% when the data are class-imbalanced. If the original dataset NPC was counted in, this proportion was 80.00%. Therefore, rank aggregation is generally superior to single filtering methods, no matter how the data are balanced or imbalanced. Re-balanced datasets were artificially generated and utilized in cases 2-7 from Tables 1-4. Of all the 96 scenarios with re-balance, the aggregation rank method reached the maximal measures in 83 scenarios compared to the other eight filtering methods. It means that aggregation rank outperformed single filtering rank with the proportion of 86.46% when the class-imbalanced data were treated with re-balance strategies. Thus, performing aggregation rank is extremely effective in dealing with class-imbalanced data.
Rank aggregation was performed on both imbalanced datasets (RA) and re-balanced datasets (RAR). Of all the 96 scenarios with re-balance (cases 2-7 in Tables 1-4), there were 93 situations whose measurements were equal or greater than those from case 1 (no re-balance). It shows that aggregation rank with re-balance strategies performed better with the proportion of 96.88% than that with original class-imbalanced data. Therefore, perform-ing re-balance can play a crucial role in improving the performance of rank aggregation when the data are class-imbalanced. Figures 7-10 show the AUCPRC curves of seven cases on four imbalanced datasets. AUCPRC from re-balanced data (cases 2-7) was generally higher than that from imbalanced data (case 1). In other words, the performance can be promoted after re-sampling to balance the imbalanced data artificially. Case 5 and case 6 are two under-sampling methods, and the AUCPRC was generally lower than that from over-sampling or hybrid sampling (cases 2-4). The possible reason is that some of the useful information is missed in doing under-sampling when the size of the minority instances is too small (see Table 5). Therefore, one should be cautious about using under-sampling in practice. In sum, different filter methods generate different rankings. Rank aggregation is necessary to integrate different results and provide a consensual feature ranking list. Classimbalance usually leads to degraded performance from a filtering method on feature importance ranking. This harmfulness can be alleviated via different re-balance strategies in sample space.

Notations
The notations used in this study are listed below: D a dataset with two classes C 1 and C 2 C 1 the minority (positive) class C 2 the majority (negative) class n the size of the total instances in D p the number of the features in D n k the size of C k , k = 1, 2 |D| the number of samples in D x j the jth feature, j = 1, 2, . . . , p x i the ith instance, i = 1, 2, . . . , n µ kj the expectation of jth feature in C k , k = 1, 2; j = 1, 2, . . . , p σ 2 kj the variance of jth feature in C k , k = 1, 2; j = 1, 2, . . . , p x kj the sample mean of jth feature in C k , k = 1, 2; j = 1, 2, . . . , p s 2 kj the sample variance of jth feature in C k , k = 1, 2; j = 1, 2, . . . , p

t Test
Feature screening using the t test statistic [34] is similar to performing a hypothesis test (the null hypothesis is that there is no difference in the means) on the class's distribution, and its significance indicates the difference between majority and minority classes. The lower the p value of this t test, the higher the majority and minority classes' significant difference. Consequently, the considered feature is more relevant to the separation of two classes.

Fisher Score
Fisher score [35] is simple and generally quite effective, which can be a criterion of feature screening. Fisher score of a single feature is defined as follows: µ 1j , µ 2j , σ 2 2j , and σ 2 2j can be replaced by their corresponding sample statistics in computation, namely, A feature with a large Fisher score is more crucial for discriminating the two categories.

Hellinger Distance
Hellinger distance can be used to measure a distributional divergence [36]. Denoted the two normal distributions by P and Q, Hellinger distance is calculated as follows: where µ 1 , σ 2 1 , µ 2 , and σ 2 2 are the expectation and variance of P and Q, respectively, and their corresponding sample statistics are used in practice [37]. The larger the Hellinger distance is, the more divergent the two distributions are.

Relief and ReliefF
The Relief is an iteration method that tries to give each feature a score to indicate its level of relevance to the response [37,38]. Let x i be an instance; nearhit i and nearmiss i be its two nearest neighbors from the same class and the other class by the Euclidean distance, respectively. The score vector s = (s 1 , s 2 , . . . , s p ) T is refreshed as follows: where x ij , nearhit ij and nearmiss ij are the jth element of x i , nearhit i and nearmiss i , respectively. A feature with a higher score is more crucial to the response. Though ReliefF [39] is originally developed for dealing with multi-class and noise datasets, it can be applied to binary classification cases. Compared with Relief that searches one nearest instance from the same class and one from the other class in updating the weights, ReliefF finds k nearest neighbors. Similarly, a feature with a higher score is more important to the response.

Information Gain (IG)
Information gain [40] is the measurement of informational theory and can be utilized to assess the importance of a given feature. In the settings of binary classification, the information entropy of the set D is defined as follows: Assuming that a discrete feature (attribute) x has V different values {x 1 , x 2 , . . . , x V }, and D v is the subset of instance set D satisfying x = x v . The information gain of the variable x is The larger the information gain is, the more important the feature is for separating the classes. A continuous feature should be discretized before using the IG metric.

Gini Index
Gini index [41] fits binary digits, continuous numerical values, ordinal numbers, etc. It is a non-purity split method. Gini index of D is defined as follows: where p k is the probability that any instance belongs to C k , and it is replaced with n k /n, (k = 1, 2) in practice. If we divide D into M subsets D 1 , D 2 , . . . , D M , the Gini index after splitting is: The smaller the Gini index is, the more important the feature is.

R-Value
R-value [30,42] indicates the degree of overlap for the class-imbalanced dataset. Rvalue for a dataset D is defined as follows: where where kNN(P, C i ) is the subset of k nearest neighbors of instance P that belong to the set of instances C i , and θ is the threshold generally set to be k/2 [43]. The smaller the R-value is, the more important the feature is for discriminating the categories.

Geometric Mean and F-Measure
True positive, true negative, false positive, and false negative are denoted by TP, TN, FP, and FN, respectively. Some common metrics are listed below: The range of both Gmean and F 1 is [0, 1]. The larger they are, the better the classifer works.

AUCROC and AUCPRC
AUCROC is the area under the receiver operating characteristic curve (ROC) [44]. AUCPRC is the area under the precision recall curve (PRC) [45]. Both AUCROC and AUCPRC range from 0 to 1, and the larger they are, the better the classifier is built for the imbalanced learning. More details on AUCROC or AUCPRC can be found in our previous studies [37,46].
Gmean, F 1 , AUCROC, and AUCPRC are more widely used than the metric Accuracy in class-imbalance learning. These metrics actually pay more attention to the minority samples.

Kendall'S τ Rank Correlation
Kendall's τ rank correlation statistic [47] can be applied to calculate the degree of comparability between the feature rankings of two filtering techniques. Let the two feature rankings generated by two filters be f 1 : r 11 , r 21 , r 31 , . . . , r p1 , f 2 : r 12 , r 22 , r 32 , . . . , r p2 , and there are no ties in each of ranking list. Then Kendall's τ is calculated as follows, where sgn(x) is the sign function, namely it equals 1 if x is positive and −1 if x negative. A pair of (i, j) is called concordant if r i1 > r j1 and r i2 > r j2 or r i1 < r j1 and r i2 < r j2 . Otherwise, they are considered discordant. The numerator ∑ i<j sgn(r i1 − r j1 )sgn(r i2 − r j2 ) is the difference between the number of concordant pairs and the number of discordant pairs, and the denominator p(p − 1)/2 is the number of all distinct pairs of p elements. The range of τ is [−1, 1]. If τ = 0, the correlation of two rankings is weak; if τ = −1, then all pairs will be discordant, and the two rankings are exactly opposite; if τ = 1, then all pairs are exactly concordant [48].

Rank Aggregation with Re-Balance for Class-Imbalanced Data
As mentioned above, there are differences among the ranks from different filtering methods, but we assume that they are equal in match, namely, no one is better or worse than another. Rank aggregation (RA) is a greatly intuitive metric that computes the absolute differences between the ranks of all individual features [49]. Rank aggregation with re-balance (RAR) consists of two stages for class-imbalanced data and is illustrated in Figure 11. In sample space, the data are artificially balanced by generating new instances of the minority class or (and) removing some of the majority class instances. In feature space, m rank lists are first computed using m different filtering methods. Each rank list is the full permutation of all the features. Then, they are merged to be an aggregated rank. Feature screening and classification can be performed according to this aggregated rank. Figure 11. The frame of rank aggregation with re-balance.

Rank Aggregation
As mentioned above, different filter techniques will give different feature ranking results. The rank aggregation method [34,50] combines all the rankings together, by aggregating all feature ranking lists generated from different filtering methods.
RA is to find an optimal ranking δ * such that where f i is the ith feature ranking list, δ represents a ranking list with the same length of f i , d is a distance function, and w i is the important weight related with list f i . In this study, d is chosen to be the Spearman's foot rule distance [50]: The optimization of the objective (12) is achieved by using the Monte Carlo crossentropy (CE) algorithm [51,52]. CE Monte Carlo algorithm is a stochastic search method, which produces a "better" sample in the future, which is concentrated around an x that corresponds to an optimal δ * [50].

Strategies to Generate New Samples
Before performing rank aggregation, the training instances are to be modified to produce a more balanced class distribution. To achieve this task, new minority or (and) majority class samples need to be generated or drawn from the original dataset. We employ the following three strategies to gain new samples:

Randomly Sampling
In the over-sampling, some (all) the minority class instances are randomly duplicated; in the under-sampling, a portion of majority samples are randomly removed.

SMOTE
Synthetic minority over-sampling technique (SMOTE) is a popular over-sampling algorithm [5]. Figure 12 illustrates how to generate new samples according to the selected point x i in SMOTE. The five selected nearest neighbors of x i are x i1 to x i5 . x i1 to x i5 are the synthetic data points created by the randomized interpolation. Namely, where u h is a random number between 0 and 1. The above operation can be repeated to obtain requested synthetic minority instances.

Smoothed Bootstrap
Smoothed bootstrap technique repeatedly bootstraps the data from the two classes and employs smoothed kernel functions to generate new approximately balanced samples [53]. A new instance is generated by performing the following three steps: step 1: choose y = k ∈ {1, 2} with probability 0.5; step 2: choose (x i , y i ) in the original daa set such that y i = k with probability 1/n k ; step 3: sample x from a probability distribution K H k (·, x i ), which is centered at x i and depends on the smoothing matrix H k .
In brief, smoothed bootstrap firstly draws randomly from the original dataset an instance from one of the two categories, then generates a new instance in its neighborhood.

Experiment and Assessing Metrics
As shown in Table 5, five metabolomics datasets were employed to test our algorithm. NPC is a nasopharyngeal carcinoma dataset [32,54] that is exactly balanced. In this study, NPC was utilized to investigate the performance of rank aggregation strategy on original balanced data, which included 100 patients with nasopharyngeal carcinoma and 100 healthy controls. Traumatic brain injury (TBI) is from our previous studies [32,55], which reports the serum metabolic profiling of TBI patients with (or without) cognitive impairment (CI). The TBI dataset included 73 TBI patients with CI and 31 TBI patients without CI. CHD2-1 and CHD2-2 datasets are actually from the same experiment about coronary heart disease (CHD) [30]. The CHD2-1 dataset contains 21 patients with CHD, and the CHD2-2 dataset contains 16 patients with coronary heart disease associated with type 2 diabetes mellitus (CHD-T2DM), which are compared with a control group of 51 healthy adults. ATR is an Acori Tatarinowii Rhizoma dataset, which included 21 samples collected from Sichuan Province, and 8 samples were from Anhui Province in China [56]. Table 5 lists the summary of five datasets; included are the numbers of attributes, total instances, the majority, the minority instances, and the imbalance ratio. The NPC dataset was utilized to test the performance of rank aggregation under original balanced distribution. The other four imbalanced data sets were used to evaluate the RAR algorithm with artificially re-balanced data.
This section shows the efficacy of the proposed RAR algorithm on one original balanced dataset and four class-imbalanced datasets and compares it with other filtering feature screening methods via several assessing metrics. Rank aggregation was performed under the following seven situations: • Note that NPC is balanced, and just case 1 is performed on it. Table 6 lists the summary of the six re-balanced strategies. In this study, Gmean, F 1 , AUCROC, and AUCPRC are employed to assess the performance of RA or RAR algorithm on five datasets under seven cases.

Conclusions
In this paper, we propose a simple but effective strategy called RAR for feature screening of class-imbalanced data by aggregating rankings from individual filtering algorithms and modifying the class-imbalanced data with various re-sampling methods to provide balanced or more adequate data. RAR can address the problem of inconsistency between different feature ranking methods to a large extent. The results on real datasets show that RAR is highly competitive and almost better than single filtering screening in terms of geometric mean, F-measure, AUCROC, and AUCPRC. After performing rebalanced pretreatment, the performance of rank aggregation can be highly improved, so re-sampling to balance the classes is extremely useful in rank aggregation when the data are class-imbalanced in metabolomics. Our proposed method serves as a reference for future research on feature selection for the diagnosis of diseases.
Rank aggregation is a general idea to investigate the importance of features. In this study, rankings from eight filtering algorithms are employed to generate the aggregated rank. There are many other filter techniques, such as Chi-squared, power, Kolmogorov-Smirnov statistic, and signal-to-noise ratio [57], which are all widely utilized in classimbalance learning. In addition, considering that a re-sampling method can also generate a rank list, rank aggregation can be performed according to the various re-sampling algorithms rather than different filtering methods. Further, if necessary, ensemble multiple rank aggregations could be performed to combine those aggregated rankings derived from different algorithms. Finally, although RAR is used in the metabolomics datasets in this study, it is potentially available for handing high-dimensional imbalanced data from other fields, such as economics and biology.