# An Extensive Performance Comparison between Feature Reduction and Feature Selection Preprocessing Algorithms on Imbalanced Wide Data

^{*}

## Abstract

**:**

## 1. Introduction

- 1.
- To find an FR method that is compatible with wide data and provides a means to perform nonlinear transformations over out-of-data instances.
- 2.
- To compare the two previously mentioned types of preprocessing techniques (FR and FS) and determine which is more suitable to use on wide datasets.
- 3.
- To determine whether balancing is important while using FR methods and, if so, whether it is more convenient to use it before or after the FR step.
- 4.
- To determine the best FR method for each classifier.

## 2. Feature Reduction

- Linear
- -
- Unsupervised
- ∗
- Principal Component Analysis (PCA) [20] is the most popular FR method, which reduces the feature dimensionality while maintaining the maximum data variance.
- ∗
- Locality Pursuit Embedding (LPE) [21] respects the local structure through maximizing the variance of each local patch according to Euclidean distances (unlike PCA, which preserves the global structure).
- ∗
- ∗
- Random Projection (RNDPROJ) [25] projects the data into a new random spherical hyperplane that is randomly selected using the origin. It is not a trivial computation problem.

- -
- Supervised
- ∗
- Fisher Score (FSCORE) [26] finds the projection that maximizes the ratio between each feature mean and the standard deviation of each class.
- ∗
- ∗
- Local Fisher Discriminant Analysis (LFDA) [29] is an improved version of the FDA-supervised FR method, which is suitable for reducing datasets in which individual classes are separated into several clusters.
- ∗
- Maximum Margin Criterion (MMC) [30] projects the data while maximizing the average margin between classes.
- ∗
- Sliced Average Variance Estimation (SAVE) [31] calculates the projection matrix by averaging the covariance of the data of each slice in which the whole dataset has been divided.
- ∗
- Supervised Locality Pursuit Embedding (SLPE) [32] is a supervised version of the LPE algorithm, which enhances the model using label data.

- Non-linear
- -
- Classical Multidimensional Scaling (MDS) [33] computes the dissimilarities between pairs of objects (assuming Euclidean distance). This matrix serves as the input for the algorithm that outputs a coordinate that minimizes a loss function called strain.
- -
- Metric Multidimensional Scaling (MMDS) [34] is a superset of the previous method. It iteratively updates the weights given by the MDS using the SMACOF algorithm, in order to minimize a stress function such as the residual sum of squares.
- -
- Locally Linear Embedding (LLE) [35] bases its performance on producing low-dimensional vectors that best reconstruct the original objects through computing the kNN and using this information to weight them.
- -
- -
- Locally Embedded Analysis (LEA) [34] aims to preserve the local structure of the original data in the computed embedding space.
- -
- Stochastic Neighbor Embedding (SNE) [38] is a probabilistic approach that places the data in a low-dimensional space that optimally preserves the neighborhood of the original space.
- -
- An Autoencoder [39,40] is a kind of artificial neural network that is trained in an unsupervised manner. The aim of the autoencoder is to capture the hidden information in the high-dimensional input space of the dataset. Autoencoders have the same number of artificial neurons in their first (input) and last (output) layers, while having less in their center layers (see Figure 1). During training, Autoencoders attempt to generate the same information in the output layer that is presented in the input layer. Therefore, the center layer aims to capture the intrinsic information of the dataset and, thus, can be used for feature reduction.

## 3. Feature Selection

**Filter**methods [41] are mainly based on statistical measures. They analyze the features and rank them in an ordinal or numerical way according to their importance. Although these methods do not usually achieve the best performance for any classifier, they evade overfitting.**Wrapper**methods [42] perform any search algorithm to find the best feature subset for a specific classifier, according to a certain metric. Some of the most common methods are the recursive feature elimination (RFE) and genetic implementation methods. These methods obtain better performance than others; however, they tend to overfit and their computational cost is usually too high.**Embedded**methods [43] take advantage of the properties of classifiers such as support vector machines or decision trees to determine the importance of a feature subset. Although the selected subset can be used to train any model, it may perform better on the base classifier used to obtain it.

## 4. Imbalanced Data

- Random Undersampling (RUS) [46] removes instances randomly selected from the majority class.
- Random Oversampling (ROS) [46] duplicates instances randomly selected from the minority class.
- Synthetic Minority Over-sampling Technique (SMOTE) [47] creates synthetic instances of the minority class. For the creation of new instances, SMOTE randomly selects instances from the minority class. The feature values of the new instances are computed through interpolating the features of two instances randomly selected from the k nearest neighbors of the original instance (k being a parameter of the algorithm).

## 5. Experimental Setup

#### 5.1. Cross-Validation

#### 5.2. Data Sets

#### 5.3. Dimensionality and Number of Features

#### 5.4. Resampling Strategies

#### 5.5. Classifiers

`KNN`),

`SVM-Gaussian`,

`C4.5`trees,

`Random Forest`, and

`Naive Bayes`. For this reason, these five classifiers were used in this study.

#### 5.6. Parameters

#### 5.7. Metrics

_{1}6-Score, G-Mean, Matthews correlation coefficient, and Cohen’s kappa.

- Recall, or the true positive rate, is the probability of classifying a positive instance as positive.$$\mathit{recall}=\frac{\mathit{TP}}{\mathit{TP}+\mathit{FN}}$$
- Specificity, as opposed to recall, is the probability of considering a negative instance as negative.$$\mathit{specificity}=\frac{\mathit{TN}}{\mathit{TN}+\mathit{FP}}$$
- Fall-out, or the false positive rate, is the probability of the probability of a false alarm occurring.$$\mathit{\text{fall-out}}=\frac{\mathit{FP}}{\mathit{TN}+\mathit{FP}}$$
- Precision is the probability that an instance is classified as positive.$$\mathit{precision}=\frac{\mathit{TP}}{\mathit{TP}+\mathit{TN}+\mathit{FP}+\mathit{FN}}$$

- The Area Under the ROC Curve can be calculated in different ways. Although ROC can also be used to evaluate multiple possible classifier thresholds, in this study, only one per fold is evaluated using the formula based on the true positive rate (recall) and the false positive rate (fall-out) from [54].$$\mathrm{AUC}=\frac{\mathit{1}+\mathit{recall}-\mathit{\text{fall-out}}}{2}$$
- The F
_{1}-Score is the harmonic mean between precision and recall.$${F}_{1}\mathrm{\text{-Score}}=2\times \frac{\mathit{precision}\times \mathit{recall}}{\mathit{precision}+\mathit{recall}}$$ - The G-Mean, which is widely used for imbalanced problems, is the geometric mean between recall and specificity.$$\mathrm{\text{G-Mean}}=\sqrt{\mathit{recall}\times \mathit{specificity}}$$
- The Matthews correlation coefficient (MCC – Do not confuse with the feature reduction method called Maximum Margin Criterion (MMC).) was originally presented by Matthews [55] and introduced to the Machine Learning community in [56]. The MCC has become a well-known performance measure of binary classification not affected by imbalanced datasets, and the authors of [57,58] have recommended this metric over AUC and F
_{1}-Score.$$\mathrm{MCC}=\frac{TN\times TP-FN\times FP}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$ - The Cohen’s kappa measure compensates for the random hits that are usually observed in classification problems [59].$$\mathrm{K}=\frac{{P}_{0}-{P}_{e}}{1-{P}_{e}}$$

## 6. Results

#### 6.1. Best Feature Reducers

#### 6.2. Best Preprocessing Algorithm

_{1}-Score, MCC, and Kappa), the FR combination performed significantly better than the FS combination; meanwhile, the other two (AUC and G-Mean) did not show any significant differences. As none of the tests supported the left side (SVM-RFE) and most of the tests suggested that the right side (MMC) performed better, it can be determined that, for these datasets, the FR option was the best one.

## 7. Discussion and Conclusions

## 8. Limitations

## 9. Future Work

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Lai, K.; Twine, N.; O’brien, A.; Guo, Y.; Bauer, D. Artificial intelligence and machine learning in bioinformatics. Encycl. Bioinform. Comput. Biol. ABC Bioinform.
**2018**, 1, 272–286. [Google Scholar] - Hao, Z.; Lv, D.; Ge, Y.; Shi, J.; Weijers, D.; Yu, G.; Chen, J. RIdeogram: Drawing SVG graphics to visualize and map genome-wide data on the idiograms. PeerJ Comput. Sci.
**2020**, 6, e251. [Google Scholar] [CrossRef] [PubMed] - Salesi, S.; Cosma, G.; Mavrovouniotis, M. TAGA: Tabu Asexual Genetic Algorithm embedded in a filter/filter feature selection approach for high-dimensional data. Inf. Sci.
**2021**, 565, 105–127. [Google Scholar] [CrossRef] - Keogh, E.J.; Mueen, A. Curse of dimensionality. Encycl. Mach. Learn. Data Min.
**2017**, 2017, 314–315. [Google Scholar] - He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng.
**2009**, 21, 1263–1284. [Google Scholar] - Saeys, Y.; Inza, I.; Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics
**2007**, 23, 2507–2517. [Google Scholar] [CrossRef] [PubMed] - Ayesha, S.; Hanif, M.K.; Talib, R. Overview and comparative study of dimensionality reduction techniques for high dimensional data. Inf. Fusion
**2020**, 59, 44–58. [Google Scholar] [CrossRef] - Mohammed, R.; Rawashdeh, J.; Abdullah, M. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 243–248. [Google Scholar]
- Wijayanto, I.; Humairani, A.; Hadiyoso, S.; Rizal, A.; Prasanna, D.L.; Tripathi, S.L. Epileptic seizure detection on a compressed EEG signal using energy measurement. Biomed. Signal Process. Control
**2023**, 85, 104872. [Google Scholar] [CrossRef] - Sachdeva, R.K.; Bathla, P.; Rani, P.; Kukreja, V.; Ahuja, R. A Systematic Method for Breast Cancer Classification using RFE Feature Selection. In Proceedings of the 2022 2nd International Conference on Advance Computing and Innovative Technologies in Engineering, ICACITE 2022, Greater Noida, India, 28–29 April 2022; pp. 1673–1676. [Google Scholar] [CrossRef]
- Parhizkar, T.; Rafieipour, E.; Parhizkar, A. Evaluation and improvement of energy consumption prediction models using principal component analysis based feature reduction. J. Clean. Prod.
**2021**, 279, 123866. [Google Scholar] [CrossRef] - Wang, W.; Lu, L.; Wei, W. A Novel Supervised Filter Feature Selection Method Based on Gaussian Probability Density for Fault Diagnosis of Permanent Magnet DC Motors. Sensors
**2022**, 22, 7121. [Google Scholar] [CrossRef] [PubMed] - Zhao, X.; Jia, M. Fault diagnosis of rolling bearing based on feature reduction with global-local margin Fisher analysis. Neurocomputing
**2018**, 315, 447–464. [Google Scholar] [CrossRef] - Ayadi, R.; Maraoui, M.; Zrigui, M. LDA and LSI as a dimensionality reduction method in arabic document classification. Commun. Comput. Inf. Sci.
**2015**, 538, 491–502. [Google Scholar] [CrossRef] - Pes, B. Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests. Information
**2021**, 12, 286. [Google Scholar] [CrossRef] - Ramos-Pérez, I.; Arnaiz-González, A.; Rodríguez, J.J.; García-Osorio, C. When is resampling beneficial for feature selection with imbalanced wide data? Expert Syst. Appl.
**2022**, 188, 116015. [Google Scholar] [CrossRef] - Mendes Junior, J.J.A.; Freitas, M.L.; Siqueira, H.V.; Lazzaretti, A.E.; Pichorim, S.F.; Stevan, S.L. Feature selection and dimensionality reduction: An extensive comparison in hand gesture classification by sEMG in eight channels armband approach. Biomed. Signal Process. Control
**2020**, 59, 101920. [Google Scholar] [CrossRef] - Muntasa, A.; Sirajudin, I.A.; Purnomo, M.H. Appearance global and local structure fusion for face image recognition. TELKOMNIKA (Telecommun. Comput. Electron. Control)
**2011**, 9, 125–132. [Google Scholar] [CrossRef] - Yang, Y.; Nie, F.; Xiang, S.; Zhuang, Y.; Wang, W. Local and global regressive mapping for manifold learning with out-of-sample extrapolation. In Proceedings of the AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–13 October 2010; pp. 649–654. [Google Scholar]
- Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci.
**1901**, 2, 559–572. [Google Scholar] [CrossRef] - Min, W.; Lu, K.; He, X. Locality pursuit embedding. Pattern Recognit.
**2004**, 37, 781–788. [Google Scholar] [CrossRef] - Dornaika, F.; Assoum, A. Enhanced and parameterless Locality Preserving Projections for face recognition. Neurocomputing
**2013**, 99, 448–457. [Google Scholar] [CrossRef] - He, X.; Niyogi, P. Locality Preserving Projections. Adv. Neural Inf. Process. Syst.
**2003**, 16. [Google Scholar] - Belkin, M.; Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput.
**2003**, 15, 1373–1396. [Google Scholar] [CrossRef] - Achlioptas, D. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci.
**2003**, 66, 671–687. [Google Scholar] [CrossRef] - Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen.
**1936**, 7, 179–188. [Google Scholar] [CrossRef] - Liao, B.; Jiang, Y.; Liang, W.; Zhu, W.; Cai, L.; Cao, Z. Gene selection using locality sensitive Laplacian score. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2014**, 11, 1146–1156. [Google Scholar] [CrossRef] [PubMed] - He, X.; Cai, D.; Niyogi, P. Laplacian score for feature selection. Adv. Neural Inf. Process. Syst.
**2005**, 18. [Google Scholar] - Sugiyama, M. Local fisher discriminant analysis for supervised dimensionality reduction. ACM Int. Conf. Proceeding Ser.
**2006**, 148, 905–912. [Google Scholar] [CrossRef] - Li, H.; Jiang, T.; Zhang, K. Efficient and robust feature extraction by maximum margin criterion. IEEE Trans. Neural Netw.
**2006**, 17, 157–165. [Google Scholar] [CrossRef] [PubMed] - Dennis Cook, R. SAVE: A method for dimension reduction and graphics in regression. Commun.-Stat.-Theory Methods
**2000**, 29, 2109–2121. [Google Scholar] [CrossRef] - Zheng, Z.; Yang, F.; Tan, W.; Jia, J.; Yang, J. Gabor feature-based face recognition using supervised locality preserving projection. Signal Process.
**2007**, 87, 2473–2483. [Google Scholar] [CrossRef] - Kruskal, J.B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika
**1964**, 29, 1–27. [Google Scholar] [CrossRef] - Borg, I.; Groenen, P.J. Modern Multidimensional Scaling: Theory and Applications; Springer: New York, NY, USA, 2005. [Google Scholar]
- Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science
**2000**, 290, 2323–2326. [Google Scholar] [CrossRef] [PubMed] - He, X.; Cai, D.; Yan, S.; Zhang, H.J. Neighborhood preserving embedding. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Beijing, China, 17–21 October 2005; pp. 1208–1213. [Google Scholar] [CrossRef]
- Yao, C.; Guo, Z. Revisit Neighborhood Preserving Embedding: A New Criterion for Measuring the Manifold Similarity in Dimension Reduction. Available online: https://ssrn.com/abstract=4349051 (accessed on 7 April 2024).
- Hinton, G.E.; Roweis, S. Stochastic neighbor embedding. Adv. Neural Inf. Process. Syst.
**2002**, 15. [Google Scholar] - Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation; Technical Report; California Univ San Diego La Jolla Inst for Cognitive Science: La Jolla, CA, USA, 1985. [Google Scholar]
- Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science
**2006**, 313, 504–507. [Google Scholar] [CrossRef] - Bommert, A.; Sun, X.; Bischl, B.; Rahnenführer, J.; Lang, M. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput. Stat. Data Anal.
**2020**, 143, 106839. [Google Scholar] [CrossRef] - Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell.
**1997**, 97, 273–324. [Google Scholar] [CrossRef] - Lal, T.N.; Chapelle, O.; Weston, J.; Elisseeff, A. Embedded methods. In Feature Extraction: Foundations and Applications; Springer: Berlin/Heidelberg, Germany, 2006; pp. 137–165. [Google Scholar]
- Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn.
**2002**, 46, 389–422. [Google Scholar] [CrossRef] - Luque, A.; Carrasco, A.; Martín, A.; de las Heras, A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit.
**2019**, 91, 216–231. [Google Scholar] [CrossRef] - Japkowicz, N. The Class Imbalance Problem: Significance and Strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI), Vancouver, BC, Canada, 13–15 November 2000; pp. 111–117. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res.
**2002**, 16, 321–357. [Google Scholar] [CrossRef] - Dietterich, T.G. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput.
**1998**, 10, 1895–1923. [Google Scholar] [CrossRef] [PubMed] - García, V.; Sánchez, J.S.; Mollineda, R.A. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl.-Based Syst.
**2012**, 25, 13–21. [Google Scholar] [CrossRef] - Orriols-Puig, A.; Bernadó-Mansilla, E. Evolutionary rule-based systems for imbalanced datasets. Soft Comput.
**2009**, 13, 213–225. [Google Scholar] [CrossRef] - Zhu, Z.; Ong, Y.S.; Dash, M. Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit.
**2007**, 40, 3236–3248. [Google Scholar] [CrossRef] - Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. (CSUR)
**2018**, 50, 1–45. [Google Scholar] [CrossRef] - Bolón-Canedo, V.; Alonso-Betanzos, A. Recent Advances in Ensembles for Feature Selection; Springer: Berlin/Heidelberg, Germany, 2018; Volume 147. [Google Scholar] [CrossRef]
- Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.)
**2011**, 42, 463–484. [Google Scholar] [CrossRef] - Matthews, B. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta (BBA)-Protein Struct.
**1975**, 405, 442–451. [Google Scholar] [CrossRef] - Baldi, P.; Brunak, S.; Chauvin, Y.; Andersen, C.A.F.; Nielsen, H. Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics
**2000**, 16, 412–424. [Google Scholar] [CrossRef] [PubMed] - Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom.
**2020**, 21, 6. [Google Scholar] [CrossRef] - Chicco, D.; Jurman, G. The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. Biodata Min.
**2023**, 16, 4. [Google Scholar] [CrossRef] [PubMed] - Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas.
**1960**, 20, 37–46. [Google Scholar] [CrossRef] - Demšar, J. Statistical comparisons of classifiers over multiple datasets. J. Mach. Learn. Res.
**2006**, 7, 1–30. [Google Scholar] - Garcia, S.; Herrera, F. An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons. J. Mach. Learn. Res.
**2008**, 9, 2677–2694. [Google Scholar] - Benavoli, A.; Corani, G.; Mangili, F.; Zaffalon, M.; Ruggeri, F. A Bayesian Wilcoxon signed-rank test based on the Dirichlet process. In Proceedings of the International Conference on Machine Learning, Beijing, China, 22–24 June 2014; PMLR; pp. 1026–1034. [Google Scholar]
- Kuncheva, L.I.; Matthews, C.E.; Arnaiz-González, A.; Rodríguez, J.J. Feature selection from high-dimensional data with very low sample size: A cautionary tale. arXiv
**2020**, arXiv:2008.12025. [Google Scholar] - Van Engelen, J.E.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn.
**2020**, 109, 373–440. [Google Scholar] [CrossRef]

**Figure 2.**Results of performing Bayesian tests for each of the five metrics comparing the best FS and FR configurations. The best FS configuration is represented on the left side (balancing using ROS before selecting the features with SVM-RFE and using SVM-G as classifier), whereas the best FR configuration is shown on the right side (reducing dimensionality with MMC and KNN as classifier).

**Figure 3.**Box plots in the y-axis with the 18 preprocessing methods shown alongside the time taken to process each fold (in seconds). The methods are sorted according to their average execution time (shown as a central red dot). The preprocessing methods with highest-ranking performances for at least one classifier are highlighted in blue, and the best FS method (SVM-RFE) is also highlighted.

**Table 1.**Data sets used in the experimental study. Datasets 1–9 were used in [51], while 10–14 were used in [52]. The column names refer to dataset name, number of examples, number of features, ratio features/examples, minority and majority class labels, min and max percentage of instances for the minority and majority classes, and imbalance ratio.

^{1}https://jundongl.github.io/scikit-feature/datasets.html,

^{2}http://csse.szu.edu.cn/staff/zhuzx/Datasets.html, accessed on 7 April 2024.

Data Set | #Ex. | #Feat. | $\frac{\#\mathbf{Feat}.}{\#\mathbf{Ex}.}$ | Class (min.; maj.) | %min.; %maj. | IR | |
---|---|---|---|---|---|---|---|

1 | Colon ^{1} | 62 | 2,000 | 32.26 | (Normal; Tumor) | 0.35; 0.65 | 1.86 |

2 | MLL_ALL ^{1} | 72 | 12,582 | 174.75 | (ALL; rem) | 0.33; 0.67 | 2.03 |

3 | MLL_AML ^{1} | 72 | 12,582 | 174.75 | (AML; rem) | 0.39; 0.61 | 1.56 |

4 | MLL_MLL ^{1} | 72 | 12,582 | 174.75 | (MLL; rem) | 0.28; 0.72 | 2.57 |

5 | SRBCT_1 ^{1} | 83 | 2,308 | 27.81 | (1; rem) | 0.35; 0.65 | 1.86 |

6 | SRBCT_4 ^{1} | 83 | 2,308 | 27.81 | (4; rem) | 0.30; 0.70 | 2.33 |

7 | Lung_1 ^{1} | 203 | 12,600 | 62.07 | (rem; 1) | 0.32; 0.68 | 2.12 |

8 | Lung_4 ^{1} | 203 | 12,600 | 62.07 | (rem; 4) | 0.10; 0.90 | 9.00 |

9 | Lung_5 ^{1} | 203 | 12,600 | 62.07 | (rem; 5) | 0.10; 0.90 | 9.00 |

10 | Leukemia_BM ^{2} | 72 | 7,130 | 99.03 | (BM; rem) | 0.29; 0.71 | 2.45 |

11 | TOX_171_1 ^{2} | 171 | 5,748 | 33.61 | (1; rem) | 0.26; 0.74 | 2.85 |

12 | TOX_171_2 ^{2} | 171 | 5,748 | 33.61 | (2; rem) | 0.26; 0.74 | 2.85 |

13 | TOX_171_3 ^{2} | 171 | 5,748 | 33.61 | (3; rem) | 0.23; 0.77 | 3.35 |

14 | TOX_171_4 ^{2} | 171 | 5,748 | 33.61 | (4; rem) | 0.25; 0.75 | 3.00 |

**Table 2.**All the algorithms used in the study are grouped according to their type, including their parameters (when applicable) and their corresponding R packages, all of then have been accessed on June 2023. * The asterisk indicates that, in the method, the parameter K was set to 5 in the transformation estimator needed for the nonlinear feature reducers explained in Section 2.

^{1}https://cran.r-project.org/web/packages/class/index.html;

^{2}https://cran.r-project.org/web/packages/e1071/index.html;

^{3}https://cran.r-project.org/web/packages/RWeka/index.html;

^{4}https://cran.r-project.org/web/packages/randomForest/randomForest.html;

^{5}https://cran.r-project.org/web/packages/naivebayes/index.html;

^{6}https://cran.r-project.org/web/packages/Rdimtools/index.html;

^{7}https://www.bioconductor.org/packages/release/bioc/html/sigFeature.html;

^{8}https://cran.r-project.org/web/packages/unbalanced/index.html, accessed on 7 April 2024.

Algorithms | Parameters | Package | |
---|---|---|---|

Classifier | KNN | $k=1$ | class ^{1} |

SVM-G | $c={10}^{9}$, $\gamma ={10}^{7}$ | e1071 ^{2} | |

C4.5 | Default | RWeka ^{3} | |

Random Forest | Default | randomForest ^{4} | |

Naive Bayes | Default | naivebayes ^{5} | |

Feature reduction | PCA | - | Rdimtools ^{6} |

Linear—Unsupervised | LPE | Default | Rdimtools ^{6} |

PFLPP | - | Rdimtools ^{6} | |

RNDPROJ | Default | Rdimtools ^{6} | |

Feature reduction | FSCORE | - | Rdimtools ^{6} |

Linear—Supervised | LSLS | Default | Rdimtools ^{6} |

LFDA | Default | Rdimtools ^{6} | |

MMC | - | Rdimtools ^{6} | |

SAVE | Default | Rdimtools ^{6} | |

SLPE | - | Rdimtools ^{6} | |

Feature reduction | MDS | - | Rdimtools ^{6} |

Non-linear * | MMDS | - | Rdimtools ^{6} |

LLE | Default | Rdimtools ^{6} | |

NPE | Default | Rdimtools ^{6} | |

LEA | Default | Rdimtools ^{6} | |

SNE | Default | Rdimtools ^{6} | |

AUTOENCODER | epoch = 10, activation = “Tanh” | h2o | |

Feature selection | SVM-RFE | sigFeature ^{7} | |

Balancing | ROS | Ratio 1:1 | Own impl. |

RUS | Ratio 1:1 | Own impl. | |

SMOTE | Ratio 1:1, $k=5$ | unbalanced ^{8} |

**Table 3.**Confusion matrix: true positive (TP), false positive (FP), false negative (FN), and true negative (TN).

Actual Value | |||
---|---|---|---|

Positive | Negative | ||

Pred. | Positive | TP | FP |

Negative | FN | TN |

**Table 4.**Comparison of average ranks using the MCC metric, the 90 possible configurations when mixing our 5 classifiers and the 18 FR preprocessing methods, including as baseline the non-preprocessing option. The color code indicates the type of algorithm, linear unsupervised, linear supervised, or nonlinear unsupervised.

Classifier | FR Algorithm | Average Rank |
---|---|---|

KNN | MMC | 1.96 |

SVM-G | No | 2.54 |

SVM-G | FSCORE | 9.21 |

KNN | FSCORE | 9.29 |

KNN | No | 10.71 |

SVM-G | LLE | 10.71 |

SVM-G | MDS | 10.71 |

SVM-G | MMDS | 10.71 |

RF | FSCORE | 13.39 |

KNN | LLE | 13.57 |

NBayes | FSCORE | 14.79 |

NBayes | No | 15.36 |

KNN | MDS | 15.43 |

KNN | MMDS | 15.43 |

SVM-G | NPE | 15.43 |

KNN | PCA | 16.64 |

SVM-G | SNE | 19.36 |

RF | No | 19.39 |

KNN | NPE | 21.00 |

NBayes | LLE | 21.64 |

SVM-G | Autoencoder | 21.86 |

KNN | SNE | 22.07 |

KNN | LPE | 22.79 |

NBayes | MDS | 26.50 |

NBayes | MMDS | 26.50 |

C4.5 | FSCORE | 28.11 |

NBayes | NPE | 29.57 |

NBayes | SNE | 29.93 |

KNN | Autoencoder | 30.36 |

C4.5 | No | 30.68 |

C4.5 | NPE | 30.86 |

NBayes | PCA | 31.36 |

C4.5 | MDS | 34.14 |

SVM-G | LSLS | 34.21 |

KNN | SAVE | 34.36 |

RF | NPE | 34.43 |

NBayes | Autoencoder | 34.64 |

C4.5 | PCA | 35.21 |

C4.5 | LLE | 35.57 |

C4.5 | MMDS | 37.00 |

RF | Autoencoder | 38.71 |

SVM-G | LPE | 38.79 |

C4.5 | LPE | 40.07 |

KNN | LSLS | 40.50 |

NBayes | LSLS | 42.86 |

RF | LSLS | 43.93 |

C4.5 | Autoencoder | 46.00 |

C4.5 | LSLS | 49.64 |

C4.5 | MMC | 49.93 |

RF | LPE | 50.43 |

SVM-G | SAVE | 52.14 |

RF | SAVE | 52.36 |

SVM-G | LEA | 53.43 |

RF | MMDS | 56.64 |

SVM-G | RNDPROJ | 56.64 |

C4.5 | SNE | 57.21 |

RF | MDS | 57.64 |

RF | PCA | 58.50 |

NBayes | LPE | 59.43 |

RF | LLE | 59.50 |

NBayes | LEA | 61.29 |

NBayes | SAVE | 62.14 |

RF | LEA | 62.21 |

RF | MMC | 62.50 |

KNN | LEA | 63.21 |

C4.5 | SAVE | 65.00 |

KNN | RNDPROJ | 67.43 |

NBayes | RNDPROJ | 67.50 |

NBayes | MMC | 67.93 |

RF | RNDPROJ | 70.50 |

KNN | LFDA | 71.93 |

RF | SNE | 73.89 |

C4.5 | RNDPROJ | 76.07 |

KNN | SLPE | 77.00 |

C4.5 | SLPE | 78.11 |

RF | LFDA | 78.14 |

RF | SLPE | 78.21 |

KNN | PFLPP | 78.43 |

RF | PFLPP | 78.64 |

SVM-G | SLPE | 78.75 |

C4.5 | LEA | 79.29 |

C4.5 | PFLPP | 79.29 |

NBayes | PFLPP | 79.29 |

SVM-G | LFDA | 79.29 |

SVM-G | MMC | 79.29 |

SVM-G | PCA | 79.29 |

SVM-G | PFLPP | 79.29 |

NBayes | SLPE | 79.43 |

C4.5 | LFDA | 79.79 |

NBayes | LFDA | 80.11 |

**Table 5.**Comparison of average ranks using the MCC metric, the 18 FR preprocessing methods, including as baseline the non preprocessing option. Each ranking is performed by a different classifier in order to detect what is more suitable. The color code indicates the type of algorithm, linear unsupervised, linear supervised, or nonlinear unsupervised.

Feature Reducer | Average Rank | |
---|---|---|

(a) KNN | ||

MMC | 1.07 | |

FSCORE | 3.86 | |

No | 4.57 | |

LLE | 5.29 | |

MDS | 5.93 | |

MMDS | 5.93 | |

PCA | 6.29 | |

SNE | 7.50 | |

LPE | 7.86 | |

NPE | 8.07 | |

Autoencoder | 10.79 | |

SAVE | 11.29 | |

LSLS | 12.57 | |

LEA | 14.57 | |

RNDPROJ | 15.43 | |

LFDA | 16.14 | |

SLPE | 16.79 | |

PFLPP | 17.07 | |

(b) SVM-G | ||

No | 1.07 | |

FSCORE | 3.86 | |

LLE | 4.21 | |

MDS | 4.21 | |

MMDS | 4.21 | |

NPE | 5.93 | |

Autoencoder | 6.64 | |

SNE | 7.21 | |

LSLS | 8.79 | |

LPE | 9.71 | |

SAVE | 11.57 | |

LEA | 11.86 | |

RNDPROJ | 12.43 | |

SLPE | 15.71 | |

LFDA | 15.89 | |

MMC | 15.89 | |

PCA | 15.89 | |

PFLPP | 15.89 | |

(c) C4.5 | ||

FSCORE | 3.46 | |

NPE | 3.57 | |

MDS | 4.43 | |

No | 4.54 | |

LLE | 4.71 | |

PCA | 5.36 | |

MMDS | 5.64 | |

LPE | 6.14 | |

Autoencoder | 8.79 | |

LSLS | 9.64 | |

MMC | 9.93 | |

SNE | 11.93 | |

SAVE | 13.36 | |

RNDPROJ | 15.25 | |

SLPE | 15.82 | |

LEA | 16.04 | |

PFLPP | 16.04 | |

LFDA | 16.36 | |

(d) RF | ||

FSCORE | 1.29 | |

No | 2.07 | |

NPE | 3.50 | |

Autoencoder | 4.50 | |

LSLS | 6.14 | |

LPE | 7.43 | |

SAVE | 8.36 | |

MMDS | 9.36 | |

MDS | 9.64 | |

PCA | 10.14 | |

LEA | 11.00 | |

LLE | 11.00 | |

MMC | 11.36 | |

RNDPROJ | 12.93 | |

SNE | 14.93 | |

LFDA | 15.64 | |

SLPE | 15.64 | |

PFLPP | 16.07 | |

(e) NBayes | ||

FSCORE | 2.86 | |

No | 3.14 | |

LLE | 3.79 | |

MDS | 5.71 | |

MMDS | 5.71 | |

NPE | 6.14 | |

SNE | 6.21 | |

PCA | 6.29 | |

Autoencoder | 7.00 | |

LSLS | 9.43 | |

LPE | 12.50 | |

SAVE | 12.71 | |

LEA | 13.00 | |

MMC | 13.86 | |

RNDPROJ | 14.00 | |

PFLPP | 16.14 | |

SLPE | 16.21 | |

LFDA | 16.29 |

**Table 6.**Average ranks using the MCC metric of the balancing strategies for (

**a**) the best configuration that uses an FS method and (

**b**) the best configuration that uses an FR method.

Balacing | Average Rank | ||
---|---|---|---|

Prior | Posterior | ||

(a) SVM-RFE + SVM-G | |||

ROS | No | 3.11 | |

No | ROS | 3.11 | |

No | SMOTE | 3.46 | |

No | No | 3.57 | |

SMOTE | No | 3.86 | |

No | RUS | 4.25 | |

RUS | No | 6.64 | |

(b) MMC + KNN | |||

No | No | 3.50 | |

No | ROS | 3.50 | |

No | SMOTE | 3.50 | |

No | RUS | 3.68 | |

SMOTE | No | 4.00 | |

ROS | No | 4.29 | |

RUS | No | 5.54 |

**Table 7.**Average execution time (hours, minutes, and seconds) for each preprocessing method to compute every fold, sorted in ascending order.

Preprocessing | Time | ||
---|---|---|---|

Hours | Minutes | Seconds | |

LSLS | 3 | ||

FSCORE | 3 | ||

AUTOENCODER | 15 | ||

PCA | 5 | 16 | |

RNDPROJ | 21 | 17 | |

SVM-RFE | 1 | 24 | 56 |

LPE | 2 | 23 | 15 |

SAVE | 2 | 36 | 4 |

MMC | 4 | 8 | 58 |

PFLPP | 6 | 7 | 9 |

MMDS | 6 | 55 | 17 |

SLPE | 8 | 42 | 3 |

LLE | 10 | 27 | 7 |

SNE | 10 | 46 | 2 |

MDS | 11 | 11 | 34 |

LFDA | 11 | 43 | 20 |

NPE | 12 | 3 | 17 |

LEA | 19 | 11 | 8 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ramos-Pérez, I.; Barbero-Aparicio, J.A.; Canepa-Oneto, A.; Arnaiz-González, Á.; Maudes-Raedo, J.
An Extensive Performance Comparison between Feature Reduction and Feature Selection Preprocessing Algorithms on Imbalanced Wide Data. *Information* **2024**, *15*, 223.
https://doi.org/10.3390/info15040223

**AMA Style**

Ramos-Pérez I, Barbero-Aparicio JA, Canepa-Oneto A, Arnaiz-González Á, Maudes-Raedo J.
An Extensive Performance Comparison between Feature Reduction and Feature Selection Preprocessing Algorithms on Imbalanced Wide Data. *Information*. 2024; 15(4):223.
https://doi.org/10.3390/info15040223

**Chicago/Turabian Style**

Ramos-Pérez, Ismael, José Antonio Barbero-Aparicio, Antonio Canepa-Oneto, Álvar Arnaiz-González, and Jesús Maudes-Raedo.
2024. "An Extensive Performance Comparison between Feature Reduction and Feature Selection Preprocessing Algorithms on Imbalanced Wide Data" *Information* 15, no. 4: 223.
https://doi.org/10.3390/info15040223