Complex diseases such as brain cancer pose a severe threat to human life. The evolution of microarray technology and the advancement of machine learning, artificial intelligence, and statistical methods have offered new possibilities for the classification and diagnosis of deadliest diseases such as cancer, Alzheimer’s, diabetes etc. The infamous characteristics of microarray datasets are a huge number of features, limited samples, and imbalanced class distribution [1
]. Imbalanced class distribution occurs when at least one class is insufficiently represented and overwhelmed by other classes. The training classification model on imbalanced data causes many obstacles to learning algorithms and presents numerous ramifications to real-world applications [2
]. This problem causes underestimation of the minority class examples and produces bias and inaccurate classification results toward the majority class examples [1
]. Classification of an imbalanced data set becomes more severe with limited number samples and a huge number of features [3
Learning from imbalanced data set has recently drawn interest from the machine learning and data mining communities from both academia and industry, which is reflected in the setting up of various workshops and special issues such as ICML’03 [5
], LPCICD’17 [6
], and ECML/PKDD 2018 [7
]. Various techniques have been proposed to overcome the class imbalance problem, including resampling techniques [8
], ensemble learning techniques [12
], cost-sensitive learning [16
], one class learning [19
], and active learning [22
]. In resampling techniques, the most widely used methods for the class imbalanced problem are (i) random oversampling (ROS) and (ii) random undersampling (RUS) [25
]. In the former, random replicates examples from minority class are generated to the original training set, which in some scenarios increases the classification model training time, especially when dealing with high-dimensional data set. In the latter, the examples from the majority class are randomly discarded in order to rectify the disparities between classes. The limitation of RUS is that it discards informative instances that could be useful for the predictive model [25
One example of the widely used ROS method is the Synthetic Minority Over-Sampling Technique (SMOTE) proposed by Chawla et al. [8
]. This method generates artificial examples for the minority class by interpolating among the neighboring examples of the minority class. SMOTE increases the number of minority class examples by adding new minority class instances from the nearest neighboring instances, which results in improving the classifier generalization capability [28
]. Unfortunately, generating artificial instances may not always be the best approach to deal with imbalanced data, especially for sensitive application domains such as biomedical data sets that deal with real data for diagnosis [29
]. In this scenario, artificial data might unfavorably affect the classification performance of the diagnosis process. In view of that, techniques that do not seek to modify the current training set in the learning process remain favorable. Recent studies [1
] validated the claim that the performance of the imbalanced class methods significantly decline if it is implemented on an imbalanced data set with a huge number of features (the high dimensional data) and a limited number of samples [33
Recently, dealing with imbalanced data sets using feature selection has become popular among data mining and machine learning communities [4
]. The techniques mentioned earlier (i.e., resampling, etc.) focus on the sampling of training data to overcome imbalanced class distribution [4
]. A feature reduction method such as feature selection takes a different approach to overcome the imbalanced class problem instead of over/undersampling the training samples. The general concept is to obtain a subset of features that optimally rectify the disparity among classes in the data set and select the best features that represent both classes. Feature selection approaches are classified into filter methods, wrapper methods, and embedded methods. Filter approaches are computationally efficient to select feature subsets. Filter methods are highly susceptible to being trapped in a local optimum feature subset because their performance is heavily affected by the “feature interaction problem” because the selected features may not be optimal for a specific learning model [35
]. While wrapper [37
] and embedded approaches [41
] were presented to select a discriminative feature subset, these techniques can be based on selecting features, where the evaluators are often a cost function, i.e., the contribution of a feature to the performance of the classifiers [8
], or the discriminative capability of features [37
]. Selected features in using a loss function may not always yield an optimal performance for the classifier, but ranking features using multi-filter methods and aggregating the outcomes of many filter methods might select discriminative features that achieve better (near-optimal) features that represent minority and majority features, retaining the most informative features that guide the population-based algorithm for the optimal features. In this paper, we examined the imbalanced class problem by considering data sets with a high number of features (high dimensional data) but with small samples [4
In this paper, a hybrid filter/wrapper feature selection for a high dimensional and imbalanced class method is proposed based on improved correlation-based redundancy (CBR) and binary grasshopper optimization algorithm (BGOA) as a global population-based algorithm that finds an optimal solution based on fitness combination of highly ranked individual features. Hence the selected subset will be more robust and relevant to the classifier. rCBR-BGOA uses the filter approach (i.e., improved CBR), which initially works to find highly discriminative features. These features are then used by BGOA as a strong initial stage to find the most informative subset of features. The performance of filtering CBR method is improved via ensemble technique in which several filter-based approaches are combined (i.e., ReliefF, Chi-square, Fisher score) to obtain a robust feature list. The top N genes with the highest rank of each subset are merged to form a new data set. CBR is used to improve the filtering stage outcomes. rCBR-BGOA differs from the method in reference [4
], which uses a single filter method that is highly susceptible to being trapped in a local optimum and may obtain redundant features, which might increase the complexity of the wrapper process and reduces the performance of the model. Similarly, this approach is more cost-effective than the iterative method [40
] that checks for all possible combinations of features. As seen in the subsequent section of experiment results, this means that rCBR-BGOA is a robust and effective method.
rCBR-BGOA addresses high dimensionality and imbalance class well. It is an appropriate method, especially when there is a huge number of features and a need to find the best (near-optimal) proportion of selected positive and negative features. rCBR-BGOA uses a Grasshopper optimisation algorithm to guide the search process more efficiently.
Recently, considerable progress has been made in the classification of a high dimensional and imbalanced data set. However, most of the existing methods deal with only one problem at a time—either imbalanced class distribution or high dimensional data set. In this paper, we only focus on the relevant works in the literature that utilised feature selection approaches to combat the imbalanced data on high dimensional datasets.
Yin et al. [42
] overcame the imbalanced class problem using Bayesian learning. This method is based on the expectation that the samples in the majority class significantly influences the overall feature selection process. The proposed method first decomposed the majority class examples into smaller pseudo-subclass. After that, feature reduction approaches were applied to the decomposed examples, where the pseudo-subclasses overcame the imbalanced class distribution across classes. The proposed method counterbalances the impact of the larger class samples on feature selection methods. The experimental results over synthetic features proved that the proposed method is effective in dealing with a high dimensional and imbalanced class data problem. Alibeigi et al. [43
] presented a new approach to deal with the high dimensional and imbalanced data using a Density-Based Feature Selection (DBFS), where the attributes are weighed based on their approximated probability density values. This approach begins with assessing the contribution of each attribute, and the highest weighted attributes are selected. The experimental results proved that DBFS is an effective method to overcome feature selection and imbalanced data problem in comparison with other state-of-the-art algorithms. Maldonado et al. [26
] overcame the challenge of skewed data and high dimensional data set in the context of binary-classes. This method used the sequential backward selection approach using support vector machine (SVM) and SMOTE using the following three loss functions: (i) balanced loss, (ii) predefined loss, and (iii) standard loss. The proposed methods were evaluated on six imbalanced data sets and recorded better predictive performance in comparison with other state-of-the-art methods.
Zhang et al. [44
] introduced a new feature selection and skewed dataset using F-measure instead of classification accuracy as a performance criterion for feature selection method. The structural support vector machine (SSVM) algorithm was based on the maximum F-measure metric to select the relevant features using the weighted vector of SSVM based on the imbalanced data setting. A new feature ranking strategy was proposed, which combined a weighted vector of SSVM and symmetric uncertainty to retain the top-ranking features. Thereafter, a harmony search algorithm was employed to choose the optimal feature subsets. The feature subsets represent the minority and majority classes. The experimental results on six data sets proved that this method is effective in resolving the imbalanced classification of a high dimensional data set. Moayedikia et al. [4
] proposed a hybrid technique using symmetric uncertainty and harmony search (SYMON) to deal with the problem of imbalanced class distribution and high dimensional data sets. SYMON employed the symmetric uncertainty to measure the weight of features based on their dependency on classes. The harmony search optimisation algorithm was used to optimises the feature subsets. The proposed method was evaluated in comparison with various similar methods and proved to be effective in dealing with high dimensional and imbalanced datasets.
Viegas et al. [45
] developed a new approach to overcome the high dimensional and imbalanced data sets using a genetic programming algorithm. The proposed method combines the most relevant feature sets selected by distinct feature selection metrics to acquire the most discriminative features that improve the predictive performance of the model. This method is evaluated based on biological and textual data sets. The experimental results proved that the proposed method was effective in selecting a small number of most relevant feature subset. Yang et al. [39
] proposed an ensemble wrapper method for feature selection on highly skewed data sets. The proposed method retained the highest classification performance of the wrapper-based feature selection approach by simultaneously maximizing the model performance and reducing the selection bias. This method works by training multi-based classifiers with balanced features. Hualong et al. [46
] introduced an ensemble method to deal with a multi-class imbalanced classification problem. The authors used one-against all (OVA) coding strategy to convert multi-class into numerous binary classes, each of them performing feature subspace that generates multiple training subsets. Two strategies were introduced—(i) decision threshold adjustment and (ii) random undersampling into each training data to overcome the imbalanced class problem. The proposed method has been evaluated on eight high dimensional and imbalanced data sets, and the experimental results show the proposed method is effective to deal with a multi-class imbalanced data problem. Zhen et al. [47
] proposed a novel method, namely WELM, to tackle multi-class imbalance problems both at data and algorithmic levels. At the data level, a class-oriented feature selection method is applied to select features that are highly correlated with minority class samples. At the algorithmic level, extreme learning machine (ELM) was modified to improve the input nodes with high discrimination power, and an ensemble technique is trained to improve the performance of the model. The experiments result on eight gene datasets indicate that WELM is effective and outperforms other methods.
The effectiveness of the proposed method is evaluated on high-dimensional and imbalanced biomedical data sets to evaluate the efficacy of the proposed rCBR-BGOA method. The characteristics of these datasets are different in terms of a number of features, number samples, and class imbalanced ration. The performance of rCBR is evaluated against other filtering methods. Then, BGOA is evaluated by investigating the convergence behaviour of the BGOA using a different combination of BGOA parameters and the number of population sizes. The results of the proposed rCBR-BGOA were compared with state-of-the-art methods using the same datasets in terms of various performance measures, include G-Mean (GM) and Area Under Curve (AUC). The comparative results show that the rCBR-BGOA method is effective and competitive compared to other methods.