5.2. Comparison with the Other Approaches
BSSO was compared with two other representative methods, wrapper feature selection based on a random search algorithm (RS), feature selection algorithm based on mutual information (IG), and algorithm without feature selection (All features) on 30 benchmark datasets.
BSSO was compared with other representative methods, wrapper feature selections based on the binary spider monkey algorithm (BSMA) [30
], the binary gravitational search algorithm (BSGA) [37
], the binary brain storm optimization algorithm (BBSO) [52
], the random search algorithm (RS), as well as a feature selection algorithm based on mutual information (IG) [53
], and a algorithm without feature selection (All features) on well-known benchmark datasets.
Then, the averaged error rates were used as fitness values of the corresponding feature subsets. Equation (1) evaluates the best feature subset by using the fuzzy classifier without optimizing the classifier’s parameters to obtain the classification error rate.
For each dataset, experiments to examine the feature selection performance of each algorithm were conducted for 30 independent runs Table 3
shows the experimental results of four different feature selection methods. For each method, the table presents the average classification accuracy on the training set (Learn), the average classification accuracy on the test sample (Test), and the average number of the selected features (F‘).
shows the average of accuracy and number of features for wrapper methods, here the symbol S means the use of the S-shaped transfer function, and the symbol V means the use of the V-shaped function.
A statistical significance test (related-samples Wilcoxon signed-rank test) was carried out to assess the classification performance of the methods. The test is considered “safer” because it does not imply normal distribution, and outliers have less effect on the result [12
]. The purpose of the Wilcoxon test is to determine whether the results yielded by two methods are independent (i.e., reject the null hypothesis). The null hypothesis was that different feature selection methods generate similar results or the median of differences between different feature selection methods equals zero. The null hypothesis is rejected (the p
-value is less than or equal to the significance level) if the differences between the methods are significant. The significance level in the Wilcoxon test is selected as 0.05. Table 4
shows the results of the Wilcoxon test for the pairwise comparison of BSSO and different feature selection methods and algorithm without feature selection. The BCCO algorithm showed better accuracy and fewer features compared to the RS algorithm. These differences are statistically significant. The BCCO algorithm showed better accuracy and fewer features compared to the IG algorithm. Differences in accuracy are statistically significant. Differences in the number of characters are not statistically significant. The BCCO algorithm showed better accuracy and fewer features (with the exception of two cases) compared to other metaheuristics, although these differences are not statistically significant.
The feature selection is a binary optimization problem, and the No Free Lunch theorem assumes that no concrete algorithm gives the best results for all optimization problems [54
], so new optimization algorithms are being developed. It was noted in [6
] that there is no single best method for feature selection, and the researcher should focus on finding a good method for each specific problem.
The time taken to execute the proposed BSSO is given in Table 5
in comparison with the time required for execution of IG and RS. All these methods were executed on the same machine with configurations: Intel Core i5-3570, with 3.40 GHz and a RAM of 8 GB. #C has been used as the programming language.
As can be seen from Table 6
, IG achieved the best execution time, mainly due to its structure, in which the classifier is missing. The binary SSO produced comparable results with RS. However, BSSO’s good accuracy on finding optimal solutions more than RS and IG compensates its computational inefficiency.
Based on the results of the experiment, the authors developed a relationship between execution time and number of attributes, number of instances, and number of classes using multiple linear regression. Regression finds the target function fitting the input data with minimum error. The regression equation is as follows:
tBSSO = -285.890 + 60.487∙NoCl +7.682∙NoFe + 0.022∙NoEx,
is the execution time of BSSO, NoCl
is the number of classes, NoFe
is the number of features, NoEx
is the number of examples.
In our example, the fitted model has a coefficient of determination of R2
= 0.782, which indicates that the model describes the data well. The Table 7
shows a list of the estimated coefficients of the multiple linear regression model with the significance level, t
-statistics and confidence intervals.
For three data sets, Optdigits, Spambase, Coil2000, three algorithms for selecting features were running, IG, RS, BSSO, and the number of selected functions for each algorithm was recorded. Then a fuzzy classifier was applied without optimizing the parameters for each newly acquired data set containing only the selected functions. Figure 1
, Figure 2
and Figure 3
shows the average classification rates of the methods in the testing partitions versus the number of selected features.
, Figure 2
and Figure 3
shows that the suggested method exhibits the best performance. Based on the foregoing, we can draw the following conclusions.
The proposed method of feature selection on some data sets allows to obtain a classification rate exceeding 90%, which indicates that the feature selection method is effective by reducing the amount of data processing.
The classification rate of the increases with an increase in the number of features selected. When the number of selected features reaches a certain value, the classification rate decreases.
The proposed method allows selecting the optimal features.