A Hybrid Dimensionality Reduction Procedure Integrating Clustering with KNN-Based Feature Selection for Unsupervised Data
Abstract
:1. Introduction
- Introduction of a new feature and assignment of the cluster label (k) to each observation. This assignment effectively transforms the dataset into a supervised one with a designated target, facilitating the computation of performance metrics such as accuracy, precision, recall, and F1-score.
- Evaluate each feature impact/contribution on the overall cost measurement, calculated as the sum of squared errors to the data centroid (i.e., the feature vector when k = 1).
2. Method
- Dataset D contains n observations, , and d features, i.e., .
- For each feature , denote its mean by , , and let . Define , where is the value of feature j in observation i, and , . Note that E is the sum of squared distances from the mean values of the features.
- In the context of the K-means algorithm, we denote cluster k by and its centroid by , for . The total K-means cost is given by .
- Examples of the stopping conditions set by the user are the following:
- d*—maximum number of features that can be reduced from D.
- A*—maximum percentage of accuracy reduction relative to the accuracy achieved when utilizing all features. Accuracy can be assessed through various components derived from a confusion matrix, such as precision, recall, or F1 score.
- Users have the flexibility to incorporate and define additional stopping conditions
- Load dataset D (an unsupervised dataset, i.e., no target variable)
- Normalize dataset D (via MinMaxScaler using Sklearn)
- Calculate E.
- Determine the optimal number of clusters (K) for dataset D utilizing the Elbow and Silhouette methods
- Calculate K-means performance measures, such as the total cost SSE, silhouette, centroid vectors, cluster members, etc.
- Based on the outcomes yielded by the K-means algorithm, create a new feature within dataset D to signify clusters, and allocate to each observation its respective cluster value [44]. This action shifts the problem from the domain of unsupervised learning to that of supervised learning.
- Split the dataset D into a training set and a test set and perform the KNN algorithm with k = 5 (note: future research can use, examine, modify, and adjust different values of k). That is, assign each observation in the test set to a class according to the class of its 5 nearest observations from the training set. Also, perform RF (Random Forest) with n_estimators = 25. Note that the test and train sets are split up the same way, with the same random seed the entire process.
- Compute the corresponding confusion matrix measures (accuracy, precision, recall, and F1) derived from the KNN and Random Forest.
- For each feature j, compute its importance level, defined by , and order all importance levels from low to high.
- While the stopping criteria = FALSE, do:
- Remove feature m from D if , that is, remove the feature with the lowest importance level.
- Perform the K-means algorithm on the new dataset (with one less feature) with the same K derived from the Elbow and Silhouette methods (step 5) and calculate the relevant performance measures mentioned in step 5.
- Run steps 6, 7, 8, and 9.
- 11.
- Return the following outputs: the total cost SSE of the K-means algorithm, the number of members in each cluster, confusion matrix measures of the KNN and RF methods, and the number of features remaining in the dataset D.
- Do not generate class labels as part of their process
- Are not designed to optimize for the cluster-based classification task we have constructed.
- Often transform features rather than select them, which impacts interpretability differently
- t = number of iterations until convergence
- K = number of clusters
- n = number of data points
- d = number of features
3. Datasets
4. Results
4.1. Use Case 1—Results
4.2. Use Case 2—Results
4.3. Result Summary
5. Conclusions and Discussion
- Enable the “transfer” of an unsupervised dataset/use-case to a supervised dataset with a target feature, while still retaining relevance to unsupervised functionalities, such as clustering methods.
- Enable users to define stopping conditions based on the specific requirements and needs of their company or organization. The objective is for users to determine the appropriate termination point for the method to ensure it meets their criteria effectively.
- Demonstrate the capability to analyze the impact of each feature by considering the ratio of its distance measure and the total dataset distance measure through unsupervised clustering (that is, using the ratio of the feature “cost” with the total cost.).
- A significant advantage of the proposed method is its agnosticism, making it effectively applicable to both supervised and unsupervised use cases. This agnostic nature is demonstrated through the research analysis conducted on distinct unsupervised datasets, such as the US Census Data (1990) and the Gas Turbine CO and NOx Emission [55], which are clustering-oriented and lack classification or target features. Furthermore, the method’s efficacy is also exhibited, analyzed, and examined on various supervised datasets, including the Taiwanese Bankruptcy Prediction [52] and the red and white wine quality datasets [54]. This ability to support analysis across diverse dataset types, irrespective of their designation as supervised or unsupervised, and without reliance on the presence of a target or label column, presents users with expanded opportunities to explore and evaluate the method’s applicability in various scenarios tailored to specific business and organizational requirements.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
KNN | K-Nearest Neighbors |
RF | Random Forest |
GAN | Generative Adversarial Networks |
VAE | Variational Autoencoders |
CNN | convolutional neural network |
References
- James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2013; Volume 112. [Google Scholar]
- Gnana, D.A.A.; Balamurugan, S.A.A.; Leavline, E.J. Literature review on feature selection methods for high-dimensional data. Int. J. Comput. Appl. 2016, 136, 9–17. [Google Scholar]
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Zeng, H.; Cheung, Y.M. Feature selection for clustering on high dimensional data. In Pacific Rim International Conference on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2008; pp. 913–922. [Google Scholar]
- Liu, T.; Lu, Y.; Zhu, B.; Zhao, H. Clustering high-dimensional data via feature selection. Biometrics 2023, 79, 940–950. [Google Scholar] [CrossRef]
- Alimoussa, M.; Porebski, A.; Vandenbroucke, N.; Thami, R.O.H.; El Fkihi, S. Clustering-based sequential feature selection approach for high dimensional data classification. In VISIGRAPP (4: VISAPP); Scitepress: Setúbal, Portugal, 2021; pp. 122–132. [Google Scholar]
- Chormunge, S.; Jena, S. Correlation based feature selection with clustering for high dimensional data. J. Electr. Syst. Inf. Technol. 2018, 5, 542–549. [Google Scholar] [CrossRef]
- Song, Q.; Ni, J.; Wang, G. A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans. Knowl. Data Eng. 2011, 25, 1–14. [Google Scholar]
- Sinaga, K.P.; Yang, M.S. Unsupervised K-means clustering algorithm. IEEE Access 2020, 8, 80716–80727. [Google Scholar]
- Xu, R.; Wunsch, D. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar]
- Saxena, A.; Prasad, M.; Gupta, A.; Bharill, N.; Patel, O.P.; Tiwari, A.; Er, M.J.; Ding, W.; Lin, C.T. A review of clustering techniques and developments. Neurocomputing 2017, 267, 664–681. [Google Scholar]
- Bock, H.H. Clustering methods: A history of k-means algorithms. In Selected Contributions in Data Analysis and Classification; Springer: Berlin/Heidelberg, Germany, 2007; pp. 161–172. [Google Scholar]
- Yu, S.S.; Chu, S.W.; Wang, C.M.; Chan, Y.K.; Chang, T.C. Two improved k-means algorithms. Appl. Soft Comput. 2018, 68, 747–755. [Google Scholar] [CrossRef]
- Shi, C.; Wei, B.; Wei, S.; Wang, W.; Liu, H.; Liu, J. A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. EURASIP J. Wirel. Commun. Netw. 2021, 2021, 31. [Google Scholar] [CrossRef]
- Dinh, D.T.; Fujinami, T.; Huynh, V.N. Estimating the optimal number of clusters in categorical data clustering by silhouette coefficient. In Knowledge and Systems Sciences: 20th International Symposium, KSS 2019; Springer: Singapore, 2019; pp. 1–17. [Google Scholar]
- Kingrani, S.K.; Levene, M.; Zhang, D. Estimating the number of clusters using diversity. Artif. Intell. Res. 2018, 7, 15–22. [Google Scholar]
- Ünlü, R.; Xanthopoulos, P. Estimating the number of clusters in a dataset via consensus clustering. Expert Syst. Appl. 2019, 125, 33–39. [Google Scholar] [CrossRef]
- Mamat, A.R.; Mohamed, F.S.; Mohamed, M.A.; Rawi, N.M.; Awang, M.I. Silhouette index for determining optimal k-means clustering on images in different color models. Int. J. Eng. Technol. 2018, 7, 105–109. [Google Scholar] [CrossRef]
- Wang, X.; Xu, Y. An improved index for clustering validation based on Silhouette index and Calinski-Harabasz index. In IOP Conference Series: Materials Science and Engineering; IOP Publishing: Bristol, UK, 2019; Volume 569, p. 052024. [Google Scholar] [CrossRef]
- Behura, A. The cluster analysis and feature selection: Perspective of machine learning and image processing. In Data Analytics in Bioinformatics: A Machine Learning Perspective; WILEY: Hoboken, NJ, USA, 2021; pp. 249–280. [Google Scholar]
- Li, S.; Zhu, J.; Feng, J.; Wan, D. Clustering-based feature selection for content based remote sensing image retrieval. In Image Analysis and Recognition: 9th International Conference, ICIAR 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 427–435. [Google Scholar]
- Czarnowski, I. Cluster-based instance selection for machine classification. Knowl. Inf. Syst. 2012, 30, 113–133. [Google Scholar] [CrossRef]
- Gallego, A.J.; Calvo-Zaragoza, J.; Valero-Mas, J.J.; Rico-Juan, J.R. Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation. Pattern Recognit. 2018, 74, 531–543. [Google Scholar]
- Rezaei, M.; Cribben, I.; Samorani, M. A clustering-based feature selection method for automatically generated relational attributes. Ann. Oper. Res. 2020, 286, 1–37. [Google Scholar] [CrossRef]
- Xu, D.; Zhang, J.; Xu, H.; Zhang, Y.; Chen, W.; Gao, R.; Dehmer, M. Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data. BMC Genom. 2020, 21, 1–17. [Google Scholar] [CrossRef]
- Moslehi, F.; Haeri, A. A novel feature selection approach based on clustering algorithm. J. Stat. Comput. Simul. 2021, 91, 581–604. [Google Scholar] [CrossRef]
- Solorio-Fernández, S.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J.F. A new hybrid filter-wrapper feature selection method for clustering based on ranking. Neurocomputing 2016, 214, 866–880. [Google Scholar] [CrossRef]
- Tan, P.-N.; Steinbach, M.; Kumar, V. Introduction to Data Mining; Addison-Wesley: Boston, MA, USA, 2005. [Google Scholar]
- Ding, C.; Li, T. Adaptive dimension reduction using discriminant analysis and K-means clustering. In ACM International Conference Proceeding Series; ACM: New York, NY, USA, 2007; Volume 227, pp. 521–528. [Google Scholar] [CrossRef]
- Chao, G.; Luo, Y.; Ding, W. Recent advances in supervised dimension reduction: A survey. Mach. Learn. Knowl. Extr. 2019, 1, 341–358. [Google Scholar] [CrossRef]
- Nanga, S.; Bawah, A.T.; Acquaye, B.A.; Billa, M.-I.; Baeta, F.D.; Odai, N.A.; Obeng, S.K.; Nsiah, A.D. Review of dimension reduction methods. J. Data Anal. Inf. Process. 2021, 9, 189–231. [Google Scholar] [CrossRef]
- Chen, R.C.; Dewi, C.; Huang, S.W.; Caraka, R.E. Selecting critical features for data classification based on machine learning methods. J. Big Data 2020, 7, 52. [Google Scholar] [CrossRef]
- Iranzad, R.; Liu, X. A review of random forest-based feature selection methods for data science education and applications. Int. J. Data Sci. Anal. 2024, 1–15. [Google Scholar] [CrossRef]
- Leng, M.; Li, L.; Chen, X. K-means clustering algorithm based on semi-supervised learning. In Proceedings of the 2008 International Conference on Computer Science and Software Engineering, Wuhan, China, 12–14 December 2008; Volume 1, pp. 1112–1115. [Google Scholar]
- Shukla, A.; Singh, G.; Anand, C.S. Semi-supervised clustering with neural networks. Neural Comput. Appl. 2020, 33, 4513–4530. [Google Scholar]
- Oyewole, G.J.; Thopil, G.A. Data clustering: Application and trends. Artif. Intell. Rev. 2023, 56, 6439–6475. [Google Scholar] [CrossRef] [PubMed]
- Martinez, E.; Jacome, R.; Hernandez-Rojas, A.; Arguello, H. Ld-gan: Low-dimensional generative adversarial network for spectral image generation with variance regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 265–275. [Google Scholar]
- Mak, H.W.L.; Han, R.; Yin, H.H. Application of variational autoEncoder (VAE) model and image processing approaches in game design. Sensors 2023, 23, 3457. [Google Scholar] [CrossRef]
- Goel, A.; Majumdar, A.; Chouzenoux, E.; Chierchia, G. Deep convolutional k-means clustering. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 211–215. [Google Scholar]
- Bisen, D.; Lilhore, U.K.; Manoharan, P.; Dahan, F.; Mzoughi, O.; Hajjej, F.; Saurabh, P.; Raahemifar, K. A hybrid deep learning model using CNN and K-Mean clustering for energy efficient modelling in mobile EdgeIoT. Electronics 2023, 12, 1384. [Google Scholar] [CrossRef]
- Feigin, Y.; Spitzer, H.; Giryes, R. Cluster with gans. Comput. Vis. Image Underst. 2022, 225, 103571. [Google Scholar]
- Koren, O.; Hallin, C.A.; Koren, M.; Issa, A.A. AutoML classifier clustering procedure. Int. J. Intell. Syst. 2022, 37, 4214–4232. [Google Scholar] [CrossRef]
- Koren, O.; Koren, M.; Sabban, A. AutoML-Optimal K procedure. In Proceedings of the 2nd International Conference on Advanced Enterprise Information Systems (AEIS 2022), London, UK, 2–4 December 2022; pp. 110–119. [Google Scholar] [CrossRef]
- Reback, J.; McKinney, W.; Jbrockmendel; Den Bossche, J.V.; Augspurger, T.; Cloud, P.; Hawkins, S.; Gfyoung; Sinhrks; Roeschke, M.; et al. Pandas-Dev/Pandas: Pandas 1.0.5; Zenodo: Genève, Switzerland, 2020. [Google Scholar] [CrossRef]
- McKinney, W. Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 56–61. [Google Scholar] [CrossRef]
- Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
- Meek, C.; Thiesson, B.; Heckerman, D. US Census Data (1990) [Dataset]. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/dataset/116/us+census+data+1990 (accessed on 10 March 2024).
- Taiwanese Bankruptcy Prediction [Dataset]; UCI Machine Learning Repository: Irvine, CA, USA, 2020. [CrossRef]
- Colonna, J.; Nakamura, E.; Cristo, M.; Gordo, M. Anuran Calls (MFCCs) [Dataset]; UCI Machine Learning Repository: Irvine, CA, USA, 2015. [Google Scholar] [CrossRef]
- Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J. Wine quality [Dataset]; UCI Machine Learning Repository: Irvine, CA, USA, 2009. [Google Scholar] [CrossRef]
- Kaya, H.; Tüfekci, P. Gas Turbine CO and NOx Emission Data Set [Dataset]; UCI Machine Learning Repository: Irvine, CA, USA, 2019. [Google Scholar] [CrossRef]
- Lyon, R. HTRU2 [Dataset]; UCI Machine Learning Repository: Irvine, CA, USA, 2015. [Google Scholar] [CrossRef]
- Image Segmentation [Dataset]; UCI Machine Learning Repository: Irvine, CA, USA, 1990. [CrossRef]
- Slate, D. Letter Recognition [Dataset]; UCI Machine Learning Repository: Irvine, CA, USA, 1991. [Google Scholar] [CrossRef]
- Estimation of Obesity Levels Based on Eating Habits and Physical Condition [Dataset]; UCI Machine Learning Repository: Irvine, CA, USA, 2019. [CrossRef]
- Sakar, C.; Kastro, Y. Online Shoppers Purchasing Intention Dataset [Dataset]; UCI Machine Learning Repository: Irvine, CA, USA, 2018. [Google Scholar] [CrossRef]
- Renjith, S. Travel Reviews [Dataset]; UCI Machine Learning Repository: Irvine, CA, USA, 2018. [Google Scholar] [CrossRef]
- Breiman, L.; Stone, C. Waveform Database Generator (Version 1) [Dataset]; UCI Machine Learning Repository: Irvine, CA, USA, 1984. [Google Scholar] [CrossRef]
No. | Dataset | Observations | Original Features | Features After Preprocessing | Description |
---|---|---|---|---|---|
1 | US Census Data [49] | 2,458,285 | 69 | 68 | Demographic data from the 1990 US Census |
2 | Taiwanese Bankruptcy Prediction [50] | 6819 | 96 | 95 | Financial ratios and bankruptcy status of Taiwanese companies |
3 | Anuran Calls (MFCCs) [51] | 7195 | 26 | 22 | Acoustic features extracted from frog call recordings |
4a | Wine Quality-Red [52] | 1599 | 12 | 11 | Physicochemical properties and quality scores of red wines |
4b | Wine Quality-White [52] | 4898 | 12 | 11 | Physicochemical properties and quality scores of white wines |
5 | Gas Turbine CO and NOx Emission [53] | 36,733 | 12 | 11 | Gas turbine parameters and resulting emissions |
6 | HTRU2 [54] | 17,898 | 9 | 8 | Pulsar candidates from radio telescope data |
7 | Image Segmentation [55] | 210 | 19 | 19 | Features of image segments for classification |
8 | Letter Recognition [56] | 20,000 | 17 | 16 | Eating habits and physical condition data related to obesity |
9 | Obesity Levels [57] | 2111 | 17 | 16 | Eating habits and physical condition data related to obesity |
10 | Online Shoppers Purchasing Intention [58] | 12,330 | 18 | 17 | Features from e-commerce website sessions |
11 | Travel Reviews [59] | 980 | 11 | 10 | User reviews of travel destinations |
12 | Waveform Database Generator [60] | 5000 | 22 | 21 | Generated waveform data with noise |
Features | KNN Accuracy | KNN Precision | KNN Recall | KNN F1 | RF Accuracy | RF Precision | RF Recall | RF F1 | |
---|---|---|---|---|---|---|---|---|---|
0 | 95 | 0.925513 | 0.925768 | 0.925513 | 0.925276 | 0.893842 | 0.89456 | 0.893842 | 0.893539 |
1 | 94 | 0.925513 | 0.925768 | 0.925513 | 0.925276 | 0.898534 | 0.898255 | 0.898534 | 0.897897 |
2 | 93 | 0.897361 | 0.897524 | 0.897361 | 0.896823 | 0.879765 | 0.880411 | 0.879765 | 0.879201 |
3 | 92 | 0.912023 | 0.912355 | 0.912023 | 0.911173 | 0.909677 | 0.909573 | 0.909677 | 0.909445 |
4 | 91 | 0.922581 | 0.923023 | 0.922581 | 0.922094 | 0.919062 | 0.919217 | 0.919062 | 0.918622 |
… | … | … | … | … | … | … | … | … | … |
84 | 11 | 0.929619 | 0.929397 | 0.929619 | 0.928962 | 0.953079 | 0.953367 | 0.953079 | 0.952761 |
85 | 10 | 0.938416 | 0.938324 | 0.938416 | 0.938026 | 0.956012 | 0.956168 | 0.956012 | 0.955779 |
86 | 9 | 0.931378 | 0.931829 | 0.931378 | 0.931386 | 0.951906 | 0.952107 | 0.951906 | 0.951898 |
87 | 8 | 0.950147 | 0.950698 | 0.950147 | 0.950026 | 0.957771 | 0.958516 | 0.957771 | 0.957631 |
88 | 7 | 0.953079 | 0.953313 | 0.953079 | 0.952945 | 0.958944 | 0.95899 | 0.958944 | 0.95872 |
Features | KNN Accuracy | KNN Precision | KNN Recall | KNN F1 | RF Accuracy | RF Precision | RF Recall | RF F1 | |
---|---|---|---|---|---|---|---|---|---|
0 | 11 | 0.972125 | 0.972116 | 0.972125 | 0.972108 | 0.977243 | 0.977243 | 0.977243 | 0.977227 |
1 | 10 | 0.97169 | 0.971708 | 0.97169 | 0.971657 | 0.977461 | 0.977472 | 0.977461 | 0.977444 |
2 | 9 | 0.977025 | 0.977022 | 0.977025 | 0.97702 | 0.98149 | 0.981481 | 0.98149 | 0.981483 |
3 | 8 | 0.984321 | 0.984318 | 0.984321 | 0.984317 | 0.982905 | 0.982916 | 0.982905 | 0.982904 |
4 | 7 | 0.986172 | 0.986177 | 0.986172 | 0.986171 | 0.985845 | 0.985846 | 0.985845 | 0.985844 |
5 | 6 | 0.989765 | 0.989788 | 0.989765 | 0.989772 | 0.988676 | 0.988674 | 0.988676 | 0.988673 |
Features | KNN Accuracy | KNN Precision | KNN Recall | KNN F1 | RF Accuracy | RF Precision | RF Recall | RF F1 | |
---|---|---|---|---|---|---|---|---|---|
WineQualitywhite | 11 | 0.933061 | 0.933977 | 0.933061 | 0.933359 | 0.952653 | 0.952469 | 0.952653 | 0.952492 |
3 | 0.964082 | 0.964386 | 0.964082 | 0.964108 | 0.974694 | 0.974643 | 0.974694 | 0.974644 | |
gas_turbine_co__nox_emission | 11 | 0.972125 | 0.972116 | 0.972125 | 0.972108 | 0.977243 | 0.977243 | 0.977243 | 0.977227 |
6 | 0.989765 | 0.989788 | 0.989765 | 0.989772 | 0.988676 | 0.988674 | 0.988676 | 0.988 | |
Anuran Calls | 22 | 0.972207 | 0.97255 | 0.972207 | 0.972262 | 0.976654 | 0.976976 | 0.976654 | 0.976708 |
9 | 0.968872 | 0.968996 | 0.968872 | 0.968864 | 0.971095 | 0.971276 | 0.971095 | 0.97111 | |
WineQualityRed | 11 | 0.9325 | 0.932676 | 0.9325 | 0.932416 | 0.9525 | 0.952393 | 0.9525 | 0.952294 |
4 | 0.9275 | 0.929228 | 0.9275 | 0.927443 | 0.96 | 0.960148 | 0.96 | 0.959791 | |
Waveform Database Generator | 21 | 0.9424 | 0.942606 | 0.9424 | 0.942071 | 0.944 | 0.945325 | 0.944 | 0.943642 |
12 | 0.9504 | 0.950505 | 0.9504 | 0.950391 | 0.9528 | 0.952988 | 0.9528 | 0.952672 | |
USCensusData | 68 | 0.98464 | 0.984621 | 0.98464 | 0.984617 | 0.99144 | 0.991458 | 0.99144 | 0.991437 |
26 | 0.99636 | 0.996363 | 0.99636 | 0.996359 | 0.99748 | 0.997482 | 0.99748 | 0.997481 | |
Letter Recognition | 16 | 0.9222 | 0.922564 | 0.9222 | 0.922236 | 0.9374 | 0.93765 | 0.9374 | 0.937405 |
7 | 0.9384 | 0.938591 | 0.9384 | 0.938345 | 0.9616 | 0.961737 | 0.9616 | 0.961619 | |
Obesity Levels | 26 | 0.969697 | 0.969738 | 0.969697 | 0.969475 | 0.981061 | 0.981134 | 0.981061 | 0.980973 |
8 | 0.981061 | 0.981709 | 0.981061 | 0.980926 | 0.998106 | 0.99813 | 0.998106 | 0.998104 | |
online_shoppers_purchasing | 28 | 0.999351 | 0.999355 | 0.999351 | 0.999351 | 0.999676 | 0.999677 | 0.999676 | 0.999675 |
9 | 0.999351 | 0.999353 | 0.999351 | 0.99935 | 0.999676 | 0.999677 | 0.999676 | 0.999676 | |
TaiwaneseBankruptcy | 95 | 0.925513 | 0.925768 | 0.925513 | 0.925276 | 0.893842 | 0.89456 | 0.893842 | 0.893539 |
7 | 0.953079 | 0.953313 | 0.953079 | 0.952945 | 0.958944 | 0.95899 | 0.958944 | 0.95872 | |
Travel Reviews | 10 | 0.959184 | 0.959346 | 0.959184 | 0.95924 | 0.95102 | 0.951737 | 0.95102 | 0.950582 |
5 | 0.942857 | 0.943079 | 0.942857 | 0.942748 | 0.959184 | 0.959173 | 0.959184 | 0.958663 | |
HTRU2 | 8 | 0.974302 | 0.974309 | 0.974302 | 0.974284 | 0.973855 | 0.973854 | 0.973855 | 0.973835 |
3 | 0.991285 | 0.991308 | 0.991285 | 0.991288 | 0.987039 | 0.987052 | 0.987039 | 0.987042 | |
Image Segmentation | 19 | 0.924528 | 0.928459 | 0.924528 | 0.924141 | 0.943396 | 0.94366 | 0.943396 | 0.942797 |
10 | 0.962264 | 0.965552 | 0.962264 | 0.961769 | 0.943396 | 0.948256 | 0.943396 | 0.941302 |
KNN Accuracy | KNN Precision | KNN Recall | KNN F1 | RF Accuracy | RF Precision | RF Recall | RF F1 | |
---|---|---|---|---|---|---|---|---|
WineQualitywhite | 0.031021 | 0.030409 | 0.031021 | 0.030749 | 0.022041 | 0.022174 | 0.022041 | 0.022152 |
gas_turbine_co__nox_emission | 0.01764 | 0.017672 | 0.01764 | 0.017664 | 0.011433 | 0.011431 | 0.011433 | 0.010773 |
Waveform Database Generator | 0.008 | 0.007899 | 0.008 | 0.00832 | 0.0088 | 0.007663 | 0.0088 | 0.00903 |
USCensusData | 0.01172 | 0.011742 | 0.01172 | 0.011742 | 0.00604 | 0.006024 | 0.00604 | 0.006044 |
Letter Recognition | 0.0162 | 0.016027 | 0.0162 | 0.016109 | 0.0242 | 0.024087 | 0.0242 | 0.024214 |
Obesity Levels | 0.011364 | 0.011971 | 0.011364 | 0.011451 | 0.017045 | 0.016996 | 0.017045 | 0.017131 |
TaiwaneseBankruptcy | 0.027566 | 0.027545 | 0.027566 | 0.027669 | 0.065102 | 0.06443 | 0.065102 | 0.065181 |
HTRU2 | 0.016983 | 0.016999 | 0.016983 | 0.017004 | 0.013184 | 0.013198 | 0.013184 | 0.013207 |
Image Segmentation | 0.037736 | 0.037093 | 0.037736 | 0.037628 | 0 | 0.004596 | 0 | −0.0015 |
Anuran Calls | −0.00334 | −0.00355 | −0.00334 | −0.0034 | −0.00556 | −0.0057 | −0.00556 | −0.0056 |
WineQualityRed | −0.005 | −0.00345 | −0.005 | −0.00497 | 0.0075 | 0.007755 | 0.0075 | 0.007497 |
online_shoppers_purchasing | 0 | −2 × 10−6 | 0 | −1 × 10−6 | 0 | 0 | 0 | 1 × 10−6 |
Travel Reviews | −0.01633 | −0.01627 | −0.01633 | −0.01649 | 0.008164 | 0.007436 | 0.008164 | 0.008081 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gutman, D.; Perel, N.; Bărbulescu, O.; Koren, O. A Hybrid Dimensionality Reduction Procedure Integrating Clustering with KNN-Based Feature Selection for Unsupervised Data. Algorithms 2025, 18, 188. https://doi.org/10.3390/a18040188
Gutman D, Perel N, Bărbulescu O, Koren O. A Hybrid Dimensionality Reduction Procedure Integrating Clustering with KNN-Based Feature Selection for Unsupervised Data. Algorithms. 2025; 18(4):188. https://doi.org/10.3390/a18040188
Chicago/Turabian StyleGutman, David, Nir Perel, Oana Bărbulescu, and Oded Koren. 2025. "A Hybrid Dimensionality Reduction Procedure Integrating Clustering with KNN-Based Feature Selection for Unsupervised Data" Algorithms 18, no. 4: 188. https://doi.org/10.3390/a18040188
APA StyleGutman, D., Perel, N., Bărbulescu, O., & Koren, O. (2025). A Hybrid Dimensionality Reduction Procedure Integrating Clustering with KNN-Based Feature Selection for Unsupervised Data. Algorithms, 18(4), 188. https://doi.org/10.3390/a18040188