## 1. Introduction

The classification task is to divide objects in the feature space into classes or categories based on retrospective observations with the given class label values. Real data are characterized by an imbalanced distribution of classes when the number of instances in some classes exceeds the number of instances in other classes. This situation is mainly explained by the limited occurrence of minority class instances [

1]. For example, the normal web browsing traffic is dominant when classifying traffic on the Internet. However, detection of rare malicious connections is very important for training [

1]. Similar examples can be given from the field of medical diagnosis, detection of bank fraud, and diagnosis of equipment malfunctions.

The search for regularities in imbalanced data is a difficult task for specialists in data mining, machine learning, pattern recognition, and statistics [

2]. The main problem of constructing classifiers of imbalanced data is poor adaption of standard training algorithms, which leads to a significant reduction in the effectiveness of classification. Due to the imbalance between classes, the standard classifier usually defines instances of minority classes incorrectly, since the model is retrained on instances of bigger classes [

1].

It is not enough to evaluate the constructed classifier of imbalanced data using the overall accuracy [

3]. Positive classes (with the smallest number of instances) are usually more important than negative classes (with the biggest number of instances). Reducing misclassification of minority class instances is crucial in many real-world challenges [

4,

5]. However, improving the classification quality of positive classes often leads to poor recognition of instances of negative classes, as instances of different classes often intersect. Thus, in each data classification task, the developer of the data analysis system needs to prioritize; either to focus on improving the overall accuracy or try to correctly identify positive instances with some worsening in the definition of negative ones, or to look for some compromise. Finally, it all depends on the purpose of creating the model and the requirements for it.

There is a large list of classification methods, for example, naive Bayes classifiers, support vector machines, artificial neural networks, and others. Unlike other methods, fuzzy classification does not imply the existence of rigid boundaries between neighboring classes. A classifying object may belong to several classes with various degrees of confidence. The advantage of a fuzzy classifier is understandability and interpretability of the rules, which makes fuzzy classifiers a practically useful data analysis tool.

In many real-world applications, an accurate, but also a computationally simple system, is required. Therefore, we propose to use two procedures for constructing a fuzzy classifier. The first is to shrink the input feature space to reduce complexity. The second is to tune the fuzzy classifier parameters, which increases the definition quality of the output class label. Since these two procedures are formulated as optimization problems, a single optimization algorithm is applied to solve both of them. We use the gravitational search algorithm (GSA), which has previously proven itself well when working with a fuzzy classifier [

6].

Since the goal of our work is to improve the efficiency of the fuzzy classifier of imbalanced data, it is necessary to choose an appropriate metric to use as a fitness function for the GSA. We explore the possibilities of applying the following metrics: the overall accuracy, the geometric mean, and a new function that combines the two previous estimates to find a compromise version of the classifier.

The main contributions of this paper are as follows.

We propose a new metric based on the sum of the overall accuracy and the geometric mean of each class accuracy. The presence of the coefficient controls the priority of the estimates used.

We demonstrated the use of the feature selection method based on the binary gravitational search algorithm in order to reduce the effect of imbalance on classification. The application of the new metric as the fitness function assisted to find subsets of relevant features for both classes.

We presented the combination of binary and continuous algorithms for constructing fuzzy classifiers of imbalanced data. The continuous gravitational search algorithm helped to increase the quality of classification on selected features.

This article is organized as follows.

Section 2 discusses the levels of problems when working with imbalanced data and provides basic methods for solving them. The procedure for constructing a fuzzy classifier and objective functions under consideration are described in

Section 3.

Section 3 gives a short description of the gravitational search algorithm.

Section 4 and

Section 5 present the experimental results and their analysis, respectively. Finally, we present the conclusions of our work in

Section 6.

## 2. Related Works

Here, we represent the main approaches to improving the quality of imbalanced data classification. There are three levels of training problems on such data which include: (1) Problems associated with the definition of classification performance indexes, (2) problems related to the learning algorithm, and (3) problems related to the training data [

7].

The first level is determined by the lack of an objective method for evaluating (quantitative measures) existing knowledge to select the optimal classifier. The understanding that the overall accuracy is an insufficient measure for classifying imbalanced data has led to the application of new metrics such as the AUC (the area under the ROC curve) [

8], the geometric mean, the balanced accuracy, the Fβ-measure, and others [

9]. To assess the effectiveness of the classifiers, the authors in [

9] have proposed 18 indicators, which are classified into the following three categories:

Threshold metrics geared towards minimizing the number of errors, i.e., the overall accuracy, the averaged accuracy (arithmetic and geometric), the Fβ-measure, and the Kappa-statistics;

Metrics based on the probabilistic understanding of an error and used to assess the reliability of classifiers, such as the mean absolute error, the mean square error, and the cross-entropy;

Metrics based on estimating instance separability, for example, the AUC, which is equivalent to Mann–Whitney–Wilcoxon statistics [

9] for two classes.

After analyzing the above 18 indicators, the authors of [

7] conclude that the choice of metrics for imbalanced data is of paramount importance. Fernandez et al. [

10] have described the use of a multiobjective evolutionary algorithm with a pair of metrics, which are the overall accuracy and the F1 measure. They concluded that the algorithm with simultaneous optimization of this pair of metrics can lead to a balanced accuracy for both classes.

Classification algorithms make some changes to their construction and training processes in order to reduce the influence of imbalance in classes on the classification quality [

7].

Cost-sensitive learning methods are based on modifying the classification algorithm so that the costs of misclassifying the instances of minority classes are greater as compared with the instances of majority classes. A typical solution, here, is to use a weight matrix that takes into account the costs of each incorrectly classified instance [

11]. This solution is not suitable for a fuzzy classifier since it does not estimate the probability of assigning an object to a particular class.

There is a small list of methods for creating fuzzy classifiers in the presence of imbalance. Weights were added to fuzzy rules in [

12,

13,

14]. Adding a weight function allows setting the priority of some rules over others when determining the output of the classifier. The weight values are most often configured by optimization algorithms. Another method of changing the fuzzy classification tool is to introduce a bipolar model using the principle of labeling the class, called maximum rule. The adjusted degree of belonging to each class is calculated based on the positive and negative degrees of membership in the bipolar fuzzy classifier [

15]. The disadvantage of this model is the need to additionally introduce and adjust the matrix of dissimilarity coefficients and the difficulty to apply this method with another principle of assigning labels. Furthermore, the addition of supplementary modifications to fuzzy systems complicates the interpretation of the resulting models. Consequently, methods for improving the quality of the classifier without interfering directly with the classification algorithm are relevant.

Data play an integral role in machine learning and data mining research. A number of data preprocessing methods have been developed in order to correct the imbalance in the data. Over-sampling methods based on increasing instances of a positive class try to produce a balanced dataset by creating additional instances of the minority class, while undersampling methods reduce the number of majority class instances to achieve a quantitative balance. The most famous representative of oversampling is the SMOTE and its modifications [

5,

16,

17,

18], in which the generation of new instances from a positive class depends on the measure of proximity to existing instances. Among undersampling methods, random undersampling (RUS) is often used. This non-heuristic method aims to eliminate class imbalance by randomly excluding instances of the negative class. Obviously, the disadvantage of RUS is the loss of information about data of a negative class [

7,

16,

19].

Hybrid methods that combine two previous strategies of adding and removing data instances are described in [

20,

21]. In order to preserve useful information about majority classes, clustering methods have recently been applied [

7,

22,

23].

Preprocessing methods are universal and easy to apply but have low efficiency and cannot be used as the only tool for solving the imbalance problem in classes. In addition, creating new instances of data is not acceptable for some classification tasks. For example, the artificial creation of patient’s records can lead to errors in diagnosing diseases.

Another way to change data to improving the quality of recognizing minority classes is by carry out a procedure for selecting informative features. Feature selection consists of selecting, from the input feature space, such a subset that would have fewer attributes but provide comparable classification accuracy relative to the full set. The formed subset should be sufficient to adequately represent all classes in the training samples. Selection methods are usually divided into four types, namely, integrated methods, filters, wrappers, and hybrid methods.

A peculiarity of the integrated (built-in) methods is the principle of feature selection, which is part of the general mechanism of training a model on specific data [

24]. An example of applying such methods is the selection of features during training a decision tree [

25]. However, not every classification algorithm embeds the selection process into the learning process.

Filtering methods, on the contrary, are universal, as they are used independently of the classifier at the stage of data preparation. Four groups of filters are distinguished in [

26]. The methods which make up the first group are based on the distance. They select features that provide the greatest distance between classes. The second group of filters uses the calculation of the amount of information. Such methods select features which, when attached to an existing set, reduce its entropy [

27]. The third group determines the relationship between features and classes using the correlation coefficient or mutual information [

28]. The fourth group is represented by filters that minimize the number of inconsistent features. A case of inconsistency is the presence of two instances belonging to different classes but having the same values of the same features. Filter algorithms are easy to use but have low efficiency.

Wrappers are methods that evaluate each subset of features based on the effectiveness of the constructed classifier. As a search algorithm, they usually use metaheuristic algorithms. Since such algorithms are iterative, the classifier needs to be reconstructed after each iteration. Wrapper methods can require considerable time and resources for large datasets [

24]. The advantage of wrappers is the ability to choose a set of features that will be optimal for a particular classification algorithm.

The method of applying a genetic algorithm for feature selection in the wrapper mode based on the SVM classifier is described in [

29]. The fitness function of this algorithm is a measure consisting of a compromise between the geometric mean and the share of selected features. The results showed that the proposed method selects features that improve the recognition of minority classes.

Hybrid feature selection methods consist of a combination of filters and wrappers. First, a filter is used for preliminary selection, then a classifier is built on the resulting subset and a wrapper algorithm is launched [

30]. This approach is described in [

31], which uses symmetric uncertainty for filtering in order to weigh features relative to their dependence on class labels, and the harmonic search as the wrapper algorithm. Hybrid selection methods can be a good solution for data with a large number of features.

## 4. Experimental Results

The experiment was performed on imbalanced binary datasets from the KEEL repository [

40]. The sets are described in

Table 1. Here,

F_{all} is the number of features in a dataset,

Str_{all} is the number of lines,

Str_{+} is the number of rows of the smallest class,

Str_{-} is the number of rows of the largest class, and

IR is the imbalance ratio. The imbalance ratio is the ratio of the number of rows of a negative class to the number of rows of a positive class.

Five-fold cross-validation was applied in all stages of the experiment. The data were divided into five pairs of training and test samples. The structure of the fuzzy classifier was formed by the algorithm based on the extreme values of classes with symmetric Gaussian terms. Since only two classes are represented in all data, the number of rules in all cases was two.

In the first stage of the experiment, the efficiency of the continuous gravitational algorithm was tested when the priority coefficient γ in the fitness function was changed. The tuning of the fuzzy classifier parameters was carried out on full sets of features. The following parameters were set for the GSA_{c}: 750 iterations, 15 particles, G_{0} = 10, α = 10, and ε = 0.01. The particle population was cleared after each 150th iteration, except for the best particle on the basis of which the population was generated anew. The parameters were chosen empirically as the most universal for the selected datasets.

Table 2 contains the results of the first experimental stage, used to assess the quality of the constructed model based on the following: the classification accuracy, the geometric mean, as well as the percentage of correctly classified instances of the positive class relative to the total number of instances of the positive class (true positive rate) and the percentage of correctly classified instances of the negative class relative to the total number of instances of the negative class (true negative rate). The table shows the results obtained on the test data as an average of three runs (Avr.), and the best one (Best).

The purpose of the second experimental stage consisted of verifying the effectiveness of GSA on the task of selecting features in the wrapper mode for the fuzzy classifier of imbalanced data. The binary gravitational algorithm with the same coefficient γ was run three times on each sample. Due to the stochasticity of the algorithm, one to three different feature sets could be obtained on the same sample. Next, a set of features with the highest fitness function value was selected. A classifier was built on this set; the parameters of the created model were tuned by the continuous algorithm. The obtained values of quality indicators were averaged over three independent runs of the GSA_{c}.

The following parameters were empirically selected for the binary gravitational algorithm: 750 iterations, 15 particles, G

_{0} = 10, α = 10, and ε = 0.01. The parameters of the continuous algorithm did not differ from those used at the first stage of the experiment.

Table 3 shows the results of the classifier on the selected feature sets before parameter tuning (GSA

_{b}) and after optimization (GSA

_{b} + GSA

_{c}). In the following table and further, formatting the cells according to a color scale was used to visualize the results. The values presented in each row were compared with each other. The hue of the color depended on the relative magnitude of the value compared to other cells in the row. Thus, the worst results are marked in red, the best are highlighted in green, the remaining values are colored in intermediate colors.

Table 4 shows fuzzy classifiers based on the best feature sets. The best sets here are those that gain the highest averaged value of the objective function with a given value γ over five samples.

## 5. Discussion

To confirm the effectiveness of the gravitational algorithm for optimizing the fuzzy classifier of imbalanced data, we performed a five-stage comparison.

The task of the first stage was to check the quality of the fuzzy classifier in the presence of feature selection. For this purpose, we compared the results of fuzzy classifiers constructed on complete datasets (

Table 1, average values for three runs) with those built on abbreviated sets of features (

Table 3). In both cases, the results obtained after setting the GSA

_{c} parameters were taken into account.

Table 5 shows the results of the pairwise comparison of the number of features by Wilcoxon’s sign rank criterion for linked samples. The significance level is 0.05; the null hypothesis states that the difference median between the two samples is zero.

The first three rows of the table are the comparison of the number of features in the original set (F_{all}) and in the selected feature sets (F_{bin}). The last three rows are the comparison of the number of features when using the GSA_{b} with different values of the coefficient γ in the fitness function.

On the basis of the results of the verification, we conclude that the binary gravitational algorithm can significantly reduce the number of features working with imbalanced data in the wrapper mode of the fuzzy classifier. In addition, there is no significant difference in the number of features when using one or another value of γ.

Table 6 shows the results of comparing the performance indexes for classifiers built on complete and selected sets of features when changing the priority coefficient γ in the fitness function. The obtained values of the Wilcoxon’s sign rank criteria are grouped for each of four quality indexes (the total accuracy, the geometric mean, the percentage of correctly classified instances of the positive class, and the percentage of correctly identified instances of the negative class).

Thus, the results of the first stage of the comparison show that the use of the GSAb for selecting features in the wrapper mode of the fuzzy classifier of imbalanced data significantly reduces the number of features while maintaining or increasing the quality of classification.

In the second stage, the effectiveness of the binary gravitational algorithm was tested in comparison with popular methods of selecting features. We used a random search (RS) and a filtering algorithm based on mutual information (MI).

The filter was executed as follows: The value of mutual information was calculated for each feature with three randomly-selected neighbors. Next, the algorithm found the arithmetic mean of these values. The set of selected features included only those variables whose mutual information exceeded the value of the arithmetic mean. Both algorithms were run three times, among the obtained feature sets, those with the best accuracy were selected. Fuzzy classifiers were constructed on the selected feature sets using the algorithm based on extreme values of classes. The obtained values were compared with the results of fuzzy classifiers built on the feature sets found by the GSA

_{b} (

Table 3). In this case, we considered the results without optimizing parameters. The average performance indexes of the classifiers are given in

Table 7 (

F is the number of features).

Table 8 demonstrates the results of a pairwise comparison of the performance indexes of the obtained systems by the criterion of Wilcoxon’s sign ranks for linked samples. Here

STS is the standardized test statistic,

p is the

p-value, and

NH is the null hypothesis. The left half of

Table 8 shows the results of the comparison with the random search algorithm, the right half of the table demonstrates the comparison with the filter based on mutual information.

The algorithms are statistically indistinguishable by the number of selected features. But the value of the standardized test statistic shows that fuzzy classifiers, constructed on the features selected by the gravitational search algorithm, have higher classification quality values in most cases. Hence, the binary gravitational algorithm is more preferable for imbalanced data classification in contrast to the random search or the filter based on mutual information.

In the third stage of the comparison, we compared our results with fuzzy classifiers based on imbalanced data preprocessed by the SMOTE algorithm. We used a realization of the algorithm from the open library [

41] and all parameters were taken by default. After applying SMOTE, the number of instances of the positive and negative classes was equal. Next, we conducted five-fold cross-validation. Fuzzy classifiers were constructed with the algorithm based on the extreme values of the classes. The feature selection was not produced.

Table 9 presents the results of fuzzy classifiers averaged over five samples.

We compared the obtained results with the results demonstrated in

Table 2, where fuzzy classifiers were constructed on complete sets of imbalanced data and optimized by the continuous GSA. The Wilcoxon’s criterion values for the third stage are presented in

Table 10.

The comparison shows that fuzzy classifiers constructed on the original datasets and tuned by GSA_{c} in relation to fuzzy classifiers built on oversampled data demonstrate better overall accuracy with comparable recognition quality of a positive class. Therefore, if for the classification task it is important not only to classify the positive class correctly, but also not to receive large losses in the recognition of a negative class, then a fuzzy classifier with parameter tuning with the GSA_{c} is a more preferable tool.

At the next stage of comparison, the feature selection was carried out on the oversampled data.

Table 11 presents the results of fuzzy classification averaged over five samples on subsets of features obtained by the random search algorithm.

Table 12 presents the values of the performance indexes obtained after selecting features by the filter based on mutual information.

We compared these values with the results of constructing fuzzy classifiers with feature selection and parameter tuning using the GSA on the initial datasets (

Table 3).

Table 13 shows the results of the comparison by the Wilcoxon test.

The results demonstrate that fuzzy classifiers optimized by the gravitational search algorithm show better results than fuzzy classifiers constructed on selected sets of features after data oversampling using the SMOTE.

The last stage of the comparison was to check the effectiveness of the fuzzy classifier using the GSA for selecting features and tuning parameters relative to the state-of-art classification algorithms. Using the open sklearn library, the following classifiers were built on complete data sets: Gaussian naive Bayes (GNB), logistic regression classifier (LR), decision tree classifier (DT), multilayer perceptron classifier (MLP), linear support vector classifier (LSV), K-nearest neighbors classifier with k = 3 (3NN), AdaBoost classifier (AB), random forest classifier (RF), and gradient boosting for classification (GB) [

42]. All algorithm parameters were used by default.

Table 14 contains the results of constructing various classifiers on selected data sets. The last three columns show the fuzzy classifiers from

Table 4.

The obtained values were compared using the criterion of Wilcoxon’s sign ranks for linked samples (

Table A1,

Table A2,

Table A3 and

Table A4). The fuzzy classifier demonstrates results comparable with analogues in terms of the overall accuracy and the geometric mean but has fewer features. It shows the best results for the TPrate value when the coefficient γ is equal to one. With the coefficient γ is equal to 0.5, the fuzzy classifier shows statistically comparable results with analogues by the value of TPrate and yields only to three algorithms by the value of TNrate.

Thus, if the chosen priority coefficient γ is zero, the proposed metric represents the overall accuracy. Then the classifier focuses on recognizing a negative class, and as a result, the model has a low value of the Type I error, but a high value of the Type II error.

In the case when γ is equal to 1, the function will be identical to the geometric mean. Then, the efficiency of the fuzzy classifier with respect to the positive class will increase. As a result, the Type II error will decrease, but the Type I error can increase significantly.

When using coefficient γ close to 0.5, a system with low values of both errors will be obtained simultaneously. The proposed metric can be useful for such data as vowel0, ecoli4, and yeast4, when a high-quality classification of one class can lead to large losses in the ability of the model to recognize another class.