Feature Selection Based on Swallow Swarm Optimization for Fuzzy Classiﬁcation

: This paper concerns several important topics of the Symmetry journal, namely, pattern recognition, computer-aided design, diversity and similarity. We also take advantage of the symmetric structure of a membership function. Searching for the (sub) optimal subset of features is an NP-hard problem. In this paper, a binary swallow swarm optimization (BSSO) algorithm for feature selection is proposed. To solve the classiﬁcation problem, we use a fuzzy rule-based classiﬁer. To evaluate the feature selection performance of our method, BSSO is compared to induction without feature selection and some similar algorithms on well-known benchmark datasets. Experimental results show the promising behavior of the proposed method in the optimal selection of features.


Introduction and Literature Review
Feature selection implies extracting a subset of features from an initial set, with these features being fully relevant to a problem at hand or the training problem.Feature selection makes it possible to (1) avoid overfitting, (2) reduce the amount of data to be analyzed, (3) improve the accuracy of classification, (4) eliminate irrelevant and "noisy" features, (5) improve the interpretability of results, and (6) visualize data [1,2].
An increase in the number of features causes a decrease in the performance of training algorithms, which contradicts the intuitive notion that a larger number of features provides more information and improves the accuracy of classification.The reason is that, with increasing number of features, it is also necessary to increase the amount of training data, which are required to generate classification rules.These rules determine the relationship between features and class labels.In addition, features that do not contain information about class labels can reduce classification accuracy and slow down the training process [3] Feature selection methods are divided into three groups: filters, wrappers, and embedded methods [4][5][6].Filters are based on the generalized properties of training data and do not use any classifier construction algorithm in the process of feature selection.The advantages of this approach are relatively low computational complexity, sufficient generalization capability, and independence from the classifier.Its main disadvantage is that features are often selected independently [6].Wrappers include the classifier construction procedure in the feature selection process and use the prognostic accuracy of the classifier to estimate the selected subset of features.The interaction with the classifier generally yields better results as compared to filters; however, it increases the computational complexity of the method and there is a risk of overfitting [6].Embedded methods select features in the process of training and integrate the feature selection procedure into the classifier construction algorithm.In [7], a hybrid method that combines filters, wrappers, and feature weighting was proposed.
In [6], it was noted that there is no best method for feature selection, and the researcher should focus on finding a good method for each particular problem.
The feature selection problem can be modeled as a binary optimization problem [8], which is an NP-hard one [4].Its optimal solution is guaranteed only by exhaustive search.Metaheuristic methods allow one to find sub-optimal solutions of this problem without exploring the entire solution space.However, most metaheuristics were designed for a continuous search space.A binary search space poses the problem of discontinuity and non-differentiability, which makes it difficult to use classical deterministic optimization methods [9].In binary problems, it is required to reduce the number of possible states to binary solutions.There are many binarization methods; below, we discuss the main ones.
In this paper, we present a novel method for feature selection based on binary swallow swarm optimization (BSSO).The binarization of the continuous swallow swarm metaheuristics is carried out using a special function.
The main contribution of this paper is as follows. 1.
The original swallow swarm algorithm is designed for continuous optimization; for the first time we offer a new binary version of swallow swarm optimization for solving binary optimization problems.

2.
A novel feature selection method based on binary swallow swarm optimization is proposed.This is the first work applying the binary swallow swarm optimization to feature selection.

3.
As an example of building fuzzy rule-based classifiers, the proposed method is compared with wrapper feature selections based on other metaheuristics, feature selection algorithm based on mutual information, and algorithm without feature selection.

4.
The Wilcoxon signed-rank test is used to evaluate the proposed method.

5.
A multiple regression equation is found that reflects the relationship between the BSSO runtime and the number of features, the number of instances, and the number of classes.
The paper is organized as follows.Section 1 overviews some related works.In Section 2, a fuzzy rule-based classifier is presented.Section 3 describes an algorithm for generating the fuzzy rule base.Feature selection based on the proposed method is discussed in Section 4. In Section 5, the experimental setups and results are presented.Finally, Section 6 provides the conclusions.

Literature Review
In this section, we discuss some works related to our research, which address the binarization of the metaheuristics designed for continuous optimization.
The genetic algorithm (GA) is one of the well-known tools for binary optimization and feature selection.GA solutions are represented as a set of binary vectors over which selection, crossover, and mutation are performed [10][11][12][13].
Harmonic search uses a special type of memory that stores harmonic solutions.A harmonic is a vector of binary values.In [14], we investigated two methods for forming a new harmonic: By inverting the current bit and by replacing the current bit with the bit of the best harmonic available in the harmonic memory.Both the methods were used in the fuzzy classifier for feature selection.In [15], harmonic search was employed to select features from multidimensional unbalanced datasets.
The ant colony optimization algorithm [16] was proposed for solving combinatorial optimization problems.This algorithm is also employed for feature selection: In [17], for a Takagi-Sugeno fuzzy classifier; in [18], for a classifier based on a back propagation neural network; in [19], for C4.5, naive Bayes, and k-nearest neighbor classifiers; and in [20], for support vector machine, linear discriminant analysis, random forest, k-nearest neighbors, and decision tree classifiers.
However, most metaheuristics were designed for searching in a continuous space; hence, to solve binary optimization problems, they need to be modified.Below, we describe the most popular methods used to binarize continuous metaheuristics without affecting the principles of the search process.

Quantum Methods
In [21], the problems of feature selection and parameter optimization for neural networks were solved using a quantum particle swarm optimization (PSO) method, with features being represented as quantum bits (Q-bits).The principles of quantum superposition and quantum probability were employed to speed up the search for the optimal set of features.
Researchers in [22] proposed a hybrid swarm intelligence algorithm based on quantum computations and a combination of the firefly algorithm and PSO for feature selection.Quantum computations provided a good tradeoff between the intensification and diversification of search, while the combination of the firefly algorithm and PSO made it possible to efficiently investigate the generated subsets of features.To evaluate the relevance of these subsets, rough set theory was employed.
Han and Kim in [23] described a quantum evolutionary algorithm based on the concept and principles of quantum computing.The algorithm represents a solution as a Q-bit chromosome and uses a quantum operator to update it (this operator is a modified version of the logical rotation operator).
In [24], to solve binary optimization problems, a quantum gravitational search algorithm was designed by supplementing the basic structure of the gravitational search procedure with the following elements of quantum computing: quantum bit, quantum measurements, superposition, and modified quantum rotation operators.

Modified Algebraic Operations
Yuan et al. in [25] described the binary PSO algorithm; it has all basic characteristics of the classical PSO with the only difference being a zero inertia coefficient.To evaluate the velocity of particles, subtraction was replaced by the XOR operator, arithmetic multiplication was replaced by conjunction (AND), and addition was replaced by disjunction (OR).To compute a new position of the particle, addition was replaced by the XOR operator.The same logical operators were employed in [26] to discretize the swallow swarm optimization algorithm when solving the traveling salesman problem.
In [27], four logical operators-OR, AND, XOR and NOT-were analyzed as applied to the binarization of the artificial bee colony algorithm and only two of them (XOR and NOT) were found useful.
Korkmaz and Kiran [28] modified the artificial algae algorithm to solve the binary optimization problem.The modification involved the introduction of the XOR operator and the stigmergic update rule.Stigmergic behavior was implemented using two new counters affected by the results of the XOR operator and artificial agents in the algorithm.
In [29], binarization of the continuous spider monkey algorithm was performed using the logical operators AND, OR, and XOR.The proposed algorithm is designed for thinning of concentric circular antenna arrays.In [30], the binary spider monkey algorithm is used to feature selection for a fuzzy classifier.
In the binary artificial bee colony algorithm called DisABC [31], the arithmetic difference operator in the continuous algorithm was replaced by the measure of difference between binary vectors, expressed in terms of Jaccard's coefficient.A specific property of this measure is that, even in a continuous space, its result can be used to generate a new binary solution vector.To improve the efficiency of DisABC, it was supplemented with a local search procedure.This procedure changes the value of the zero bit in the binary vector to 1 while simultaneously setting the bit with the value 1 to 0. The procedure does not change the number of 1-valued bits in the vector [31].

Transfer Functions
In binary metaheuristics, transfer functions are responsible for transforming the continuous search space into the discrete one.In 1997, Kennedy and Eberhart were the first to apply a sigmoid transfer function in the binary version of PSO [32].In [33], six transfer functions divided into two families-S-shaped and V-shaped-were used for the binarization of PSO.The comparative analysis showed that the family of V-shaped transfer functions significantly improves the performance of the binary PSO.
The widespread use of transfer functions is hindered by the lack of a methodology for selecting these functions as a good alternative to simple trial and error.This selection can be dynamic, taking into account past search experience [34].In [35], to achieve a tradeoff between the diversification and intensification of PSO, time-varying transfer functions were proposed.
In the binary version of gravitational search [36], some basic concepts associated with updating the positions of particles were modified.The current value of bits was changed with the probability computed based on the velocity of a particle, i.e., the binary gravitational search algorithm updated the velocity and formed a new position of the particle as 1 or 0 with the given probability.In [34], it was suggested that, for slow velocities, the probability of change in the position of a particle should be close to zero, while for fast velocities, this probability should be high.As a probability function, the hyperbolic tangent was employed.In [37], binary gravitational search with S-shaped and V-shaped functions was used to select features for a fuzzy classifier; in [38], for a classifier based on k-nearest neighbors.
In [46], an algorithm based on k-means clustering was proposed for the binarization of continuous swarm intelligence metaheuristics.The authors also presented a methodology for setting binarization parameters, which allows one to control the binarization process.

Fuzzy Rule-Based Classifier
Fuzzy classifiers, which belong to rule-based methods, have significant advantages in terms of their functionality, as well as their design and subsequent analysis.A unique advantage of fuzzy classifiers is the interpretability of classification rules [47,48].
Suppose that x = (x 1 , x 2 , . . ., x D )∈R D is a D-dimensional feature space and C = {c 1 , c 2 , . . ., c m } is a set of class labels.Then, the classification problem can be reduced to defining, on the set of class labels, a label that corresponds to the feature vector of an object to be classified.
A fuzzy classifier is given by a base of production rules of the following form [37]: where A ki is the fuzzy term that characterizes the k-th feature in the i-th fuzzy rule (k = 1, . . ., D), R is the number of fuzzy rules, and S = (s 1 , s 2 , . . ., s D ) is the binary feature vector, where s k ∧x k indicates the presence (s k = 1) or absence (s k = 0) of a feature in the classifier.On a given dataset {(x p ; c p ), p = 1, 2, . . ., Z} the class label is defined as follows: is the symmetric membership function for the fuzzy term A ki at the point x pk .
The measure of classification rate is defined as a ratio between the number of correctly assigned class labels and the total number of objects to be classified: where f (x p ; θ, S) is the output of the fuzzy classifier with the parameters θ and features S at the point x p .
Symmetry 2019, 11, 1423 5 of 16 To construct the fuzzy classifier, two problems-fuzzy rule base generation and feature selection-need to be solved.The first problem is solved by generating a fuzzy rule base based on the extreme values in each class, while the second problem is solved by the BSSO algorithm.

Fuzzy Rule Base Generation
The algorithm generates the initial rule base for the fuzzy classifier that contains one rule of each class [49].The rules are formed based on the extreme values in the training sample T r = {(x p ; c p ), p = 1, 2, . . ., Z}.Let us introduce the following designations: m is the number of classes; D is the number of features.A pseudo code of the Fuzzy rule base generation algorithm is shown in Algorithm 1.  [50].
At the beginning of each iteration, the population is sorted based on the value of the objective function.Then, the following roles are assigned: Head leader is a particle with the best value of the objective function; 2.
Local leaders are l particles that follow the head leader in accordance with the value of the objective function; 3.
Aimless particles are k particles with the worst value of the objective function; 4.
Explorers are all other particles.
On the current iteration, head leaders do not move, acting as beacons for explorer particles, which, in turn, explore the search space between the nearest local leader and the head leader.Explorer particles change their positions by the following formulas: where θ e is the position of the explorer, θ HL is the position of the head leader, θ LL is the position of the local leader nearest to the explorer, θ e best is the best position, V is the velocity vector of the particle, VHL is the velocity vector of the particle moving to the head leader, and VLL is the velocity vector of the particle moving to the nearest local leader.In contrast to [50] we decide not to select the parameters α HL , β HL , α LL and β LL which are used to compute the velocity vectors with respect to the head and local leaders.Our experiments showed that the corresponding procedure increases the runtime of the algorithm without any significant improvement of the result.In this work, all these parameters are set to 1.
The formula for changing the positions of aimless particles is also modified because the original formula can cause particles to gather at the boundaries of the search space or even go beyond it.Our formula reduces the probability of this behavior and also allows explorer particles to slightly affect the behavior of aimless particles.To change the position of aimless particles, the following formulas are used: where θ O is the position of aimless particle, θ j is the position of the j-th particle, N is the total number of particles in the population, and k is the number of aimless particles.
Once the termination condition is met, the algorithm returns the position of the head leader as a new solution.

Binary Swallow Swarm Optimization Algorithm
This section describes the proposed feature selection method based on the binary swallow swarm optimization (BSSO) algorithm.BSSO adapts the original algorithm to solve feature selection problems.
In [26], the authors provide discrete swallow swarm optimization to solve the well-known traveling salesman problem.This algorithm uses operators that can only be used to solve this particular problem.Here, the position is a Hamiltonian cycle, and the velocity is defined as a set of permutation between two cities.In contrast to [26] in this work, we use a special function merge to update the position of particles.As its input, this function receives two vectors X and Y, as well as the number p, which determines the influence of X on Y, ranging from 0 to 1.The function merge yields the vector Z each element of which is found as follows: if the values X i and Y i coincide, then has the same value; otherwise, Z i takes the value X i with the probability p or takes the value Y i with the probability (1-p).
This is due to the fact that all solutions have different classification accuracies.If a feature is included in a solution that has higher classification accuracy, then this feature is more likely to be relevant.Conversely, if a feature is not included in a solution with higher classification accuracy, then this feature is more likely to be noise.Thus, this approach allows a new solution to include more features that are potentially relevant and to exclude more features that are potentially noise.On the other hand, by varying p, it is possible to control the effect of randomness when computing a new solution, so that the algorithm does not converge too quickly and is not stuck at local optima.
This algorithm represents solutions as vectors S that encode features.At the first step of the algorithm, a population (set) of vectors S is generated (randomly or in some other way).The number of vectors in the population is a preset integer, which is also referred to as the population size.For each vector, a measure of classification accuracy E is computed (Equation ( 1)).On each iteration, all vectors are sorted in descending order of E. The first element becomes the head leader.The next l solutions are local leaders, n worst vectors are aimless particles, and all other vectors are explorer particles.
The integer variables l and n are also specified beforehand.Explorer particles change their positions based on the positions of the leaders, while aimless particles do so randomly.
Below are the corresponding formulas for explorers: S e (t + 1) = merge(V(t + 1), S e (t), p ve ), V(t + 1) = merge(VHL(t + 1), VLL(t + 1), p vhl ), VHL(t + 1) = merge merge(SHL(t), S e (t), p he ), rand{0, 1} D , p her , VLL(t + 1) = merge merge(SLL(t), S e (t), p le ), rand{0, 1} D , p ler , (2) where SHL is the position of the head leader, SLL is the position of the local leader, S e is the position of the explorer, VHL is the velocity vector with respect to the head leader, VLL is the velocity vector with respect to the nearest local leader, V is the common velocity vector, p ve is the effect of the velocity vector on the position of the explorer, p vhl is the effect of the velocity vector with respect to the head leader on the velocity vector with respect to the local leader, p he is the effect of the head leader's position on the position of the explorer, p her is the combined effect of the head leader and explorer on a random vector, p le is the effect of the local leader's position on the position of the explorer, and p ler is the combined effect of the local leader and explorer on a random vector.A pseudo code of the BSSO is shown in Algorithm 2.

Datasets and Parameter Setting
To validate our method, we used 30 real datasets from the knowledge extraction based on evolutionary learning (KEEL) repository [51].The experiment was carried out as follows: first, a fuzzy classifier was constructed on all features from the source dataset.Then, using the feature selection algorithm, some of the features were removed, and the classification accuracy on the remaining features was estimated.Parameter optimization was not carried out.The experiments were conducted based on the tenfold cross-validation scheme, whereby the optimal subset of features was determined on each of ten subsamples.The fuzzy classifier was constructed on the subset found, and its average classification accuracy was determined on the training and test datasets.At the end of the experiment, the average number of selected features and the average accuracy of classification based only on these features were determined.Table 1  The parameter settings for the BSSO algorithm are shown in Table 2. BSSO was compared with other representative methods, wrapper feature selections based on the binary spider monkey algorithm (BSMA) [30], the binary gravitational search algorithm (BSGA) [37], the binary brain storm optimization algorithm (BBSO) [52], the random search algorithm (RS), as well as a feature selection algorithm based on mutual information (IG) [53], and a algorithm without feature selection (All features) on well-known benchmark datasets.
Then, the averaged error rates were used as fitness values of the corresponding feature subsets.Equation (1) evaluates the best feature subset by using the fuzzy classifier without optimizing the classifier's parameters to obtain the classification error rate.
For each dataset, experiments to examine the feature selection performance of each algorithm were conducted for 30 independent runs Table 3 shows the experimental results of four different feature selection methods.For each method, the table presents the average classification accuracy on the training set (Learn), the average classification accuracy on the test sample (Test), and the average number of the selected features (F').
Table 4 shows the average of accuracy and number of features for wrapper methods, here the symbol S means the use of the S-shaped transfer function, and the symbol V means the use of the V-shaped function.
A statistical significance test (related-samples Wilcoxon signed-rank test) was carried out to assess the classification performance of the methods.The test is considered "safer" because it does not imply normal distribution, and outliers have less effect on the result [12].The purpose of the Wilcoxon test is to determine whether the results yielded by two methods are independent (i.e., reject the null hypothesis).The null hypothesis was that different feature selection methods generate similar results or the median of differences between different feature selection methods equals zero.The null hypothesis is rejected (the p-value is less than or equal to the significance level) if the differences between the methods are significant.The significance level in the Wilcoxon test is selected as 0.05.Table 4 shows the results of the Wilcoxon test for the pairwise comparison of BSSO and different feature selection methods and algorithm without feature selection.The BCCO algorithm showed better accuracy and fewer features compared to the RS algorithm.These differences are statistically significant.The BCCO algorithm showed better accuracy and fewer features compared to the IG algorithm.Differences in accuracy are statistically significant.Differences in the number of characters are not statistically significant.The BCCO algorithm showed better accuracy and fewer features (with the exception of two cases) compared to other metaheuristics, although these differences are not statistically significant.
The feature selection is a binary optimization problem, and the No Free Lunch theorem assumes that no concrete algorithm gives the best results for all optimization problems [54], so new optimization algorithms are being developed.It was noted in [6] that there is no single best method for feature selection, and the researcher should focus on finding a good method for each specific problem.
The time taken to execute the proposed BSSO is given in Table 5 in comparison with the time required for execution of IG and RS.All these methods were executed on the same machine with configurations: Intel Core i5-3570, with 3.40 GHz and a RAM of 8 GB.#C has been used as the programming language.
As can be seen from Table 6, IG achieved the best execution time, mainly due to its structure, in which the classifier is missing.The binary SSO produced comparable results with RS.However, BSSO's good accuracy on finding optimal solutions more than RS and IG compensates its computational inefficiency.
Based on the results of the experiment, the authors developed a relationship between execution time and number of attributes, number of instances, and number of classes using multiple linear regression.In our example, the fitted model has a coefficient of determination of R 2 = 0.782, which indicates that the model describes the data well.The Table 7 shows a list of the estimated coefficients of the multiple linear regression model with the significance level, t-statistics and confidence intervals.For three data sets, Optdigits, Spambase, Coil2000, three algorithms for selecting features were running, IG, RS, BSSO, and the number of selected functions for each algorithm was recorded.Then a fuzzy classifier was applied without optimizing the parameters for each newly acquired data set containing only the selected functions.Figures 1-3 shows the average classification rates of the methods in the testing partitions versus the number of selected features.Figures 1-3 shows that the suggested method exhibits the best performance.Based on the foregoing, we can draw the following conclusions.1.The proposed method of feature selection on some data sets allows to obtain a classification rate exceeding 90%, which indicates that the feature selection method is effective by reducing the amount of data processing.2. The classification rate of the increases with an increase in the number of features selected.
When the number of selected features reaches a certain value, the classification rate decreases.3. The proposed method allows selecting the optimal features.

Conclusions
In this paper, we have addressed the problem of wrapper-based feature selection.We have applied binary swallow swarm optimization to the problem of feature selection for fuzzy classification.It is important to note that the parameters of the fuzzy classifier have not been   The proposed method of feature selection on some data sets allows to obtain a classification rate exceeding 90%, which indicates that the feature selection method is effective by reducing the amount of data processing.

2.
The classification rate of the increases with an increase in the number of features selected.When the number of selected features reaches a certain value, the classification rate decreases.

3.
The proposed method allows selecting the optimal features.

Conclusions
In this paper, we have addressed the problem of wrapper-based feature selection.We have applied binary swallow swarm optimization to the problem of feature selection for fuzzy classification.It is important to note that the parameters of the fuzzy classifier have not been optimized, and the number of fuzzy rules has been minimal (it corresponded to the number of classes in the dataset).
The proposed feature selection method is based on a special function that compares two binary vectors to form a new one, taking into account the effect of the first vector on the second one.This approach allows a new solution to include more features that are potentially relevant and to exclude more features that are potentially noise.
To demonstrate that the performance of BSSO is statistically significant in comparison to the other feature selection methods, the non-parametric Wilcoxon signed-ranks test with the significant level of 0.05 has been carried out.Based on the results of the investigation, it can be concluded that BSSO is competitive with the other methods.
BSSO has one disadvantage that we will address in our future work.This disadvantage is a large number of tunable parameters that need to be selected empirically.Transfer learning can be a solution to this problem.We also intend to investigate the effectiveness of the proposed method in constructing classifiers of other types.
Two disadvantages can be attributed to the BSSO.The first one is a large number of tunable parameters that need to be selected empirically.Transfer learning can be a solution to this problem.The second drawback of BSSO, which all wrappers have, is heavy computational burden.This disadvantage comes from the nature of this approach.
In the future, we also intend to investigate the effectiveness of BSSO in constructing classifiers of other types.

Figure 2 .
Figure 2. Dataset Spambase.Average classification rates of the competing methods versus number of selected features.

Figure 1 .
Figure 1.Dataset Optdigits.Average classification rates of the competing methods versus number of selected features.

Figure 2 .
Figure 2. Dataset Spambase.Average classification rates of the competing methods versus number of selected features.

Figure 2 . 19 Figure 3 .
Figure 2. Dataset Spambase.Average classification rates of the competing methods versus number of selected features.16 of 19

Figure 3 .
Figure 3. Dataset Coil2000.Average classification rates of the competing methods versus number of selected features.

Figures 1 -
Figures1-3shows that the suggested method exhibits the best performance.Based on the foregoing, we can draw the following conclusions.
Output: fuzzy rule base R i (where i = 1, 2 . . ., m).Create a symmetric triangular membership function with borders a and c, and center b for fuzzy term A ki ; 13: end 14: Create rule R i with µ A ki (x k ) membership functions (where k = 1, 2, . . ., D) and output c k ← k; Swallow swarm optimization is a population-based metaheuristic based on the algorithm proposed in 2013 by Neshat et al. for a continuous case (p=1,2,...,Z)∧(c p =i) x pk ; 8: Search a right border of membership function µ A ki (x k ): 9: c ← max (p=1,2,...,Z)∧(c p =i) x pk ; 10: Calculate a center of membership functionµ A ki (x k ): 11: b ← a + (c -a)/2; 12: Parameters: iterations -maximum number of cycles, N -population size, Ddimension, p vs , p vhl , p he , p her , p le , p ler , l -local leaders, k -aimless particles.

Table 1 .
describes 30 datasets used in the experiments.Distributions of datasets.

Table 2 .
Parameter settings for BSSO.BSSO was compared with two other representative methods, wrapper feature selection based on a random search algorithm (RS), feature selection algorithm based on mutual information (IG), and algorithm without feature selection (All features) on 30 benchmark datasets.
Regression finds the target function fitting the input data with minimum error.The regression equation is as follows: tBSSO = -285.890+ 60.487•NoCl +7.682•NoFe + 0.022•NoEx, where tBSSO is the execution time of BSSO, NoCl is the number of classes, NoFe is the number of features, NoEx is the number of examples.

Table 3 .
Accuracy comparison of the feature selection methods.

Table 4 .
Results of using wrapper feature selection methods.

Table 5 .
Paired two-tailed Wilcoxon test between BSSO and the other methods with significance level 0.05.

Table 7 .
Coefficients and their estimates.
Dataset Optdigits.Average classification rates of the competing methods versus number of selected features.
Dataset Optdigits.Average classification rates of the competing methods versus number of selected features.