Simultaneous Instance and Attribute Selection for Noise Filtering
Abstract
:Featured Application
Abstract
1. Introduction
- We introduce ROFS, a deterministic method for simultaneously selecting features and instances in hybrid and incomplete data.
- The proposed model does not predefine any supervised classifier and, therefore, can be applied to different supervised classifiers.
- We analyze the performance of the compared algorithms under noisy environments and can ensure our proposal overpasses others in recognizing and deleting noisy instances.
- The statistical analysis concludes our proposal obtained significantly more accurate results while using a fraction of instances and attributes.
2. Materials and Methods
2.1. Algorithms for Simultaneous Instance and Feature Selection
2.2. Datasets
2.3. Performance Measures
3. Results
Algorithm 1. Adaptive All-KNN method |
|
Algorithm 2. Robust Objective Filtering Selection | |
Inputs: | Training set: T Method to compute candidate attribute sets: CA Method to condense instances: Cond Supervised classifier: classif |
Output | Selected instances and attributes: E |
Initialization
| |
| |
| |
| |
Phase 3 | |
| |
| |
| |
|
- (a)
- The use of candidate attribute sets is inspired by Voting Algorithms [26] and has obtained good results by supporting multiple views of the existing data.
- (b)
- The design of the Adaptive All-kNN maintains the desired characteristic of the previous All-kNN algorithm of reducing the Bayes error and solves the drawback of deleting too many instances.
- (c)
- The use in the experiments of a baseline algorithm (the Condensed Nearest Neighbor, CNN) as a condensation method preserves the decision boundary of the data and leaves room for further improvement because there has been plenty of research in data condensation techniques since 1968.
- (d)
- The obtention of candidate small pieces guarantees having multiple views of the same dataset, and the posterior fusion procedure allows selecting relevant attributes and instances with minimum information loss.
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
Symbol | Description |
T | Training set |
X | Testing set |
x | Testing instance, |
α(x) | True label of x |
classif | Supervised classifier |
clasif(x) | Label assigned to x by the supervised classifier |
P | Instance set returning by data preprocessing, |
V | Validation set, |
F | Instance resulting from applying Adaptive All kNN algorithm, |
A | Attribute set describing the instances, |
n | Number of attributes describing the instances |
B | Attributes selected by preprocessing algorithms, |
k | Number of neighbors |
kMa | Maximum number of neighbors |
CAS | Candidate Attribute Sets, , where |
Cai | Candidate attribute set, element of CAS |
CA | Method to compute Candidate Attribute Sets |
Cond | Condensation method, for instance, selection |
Ci | The i-th decision class |
SP | Small pieces of instances and attributes |
csp | Result of condensing an SP |
Miss | Misclassified instances by the current solution |
EI | Candidate integration |
E | Set of selected instances and features returned by ROFS algorithm |
m | Number of instances |
CF | Computational complexity of the procedure to obtain candidate feature sets |
D | Computational complexity of the condensing algorithm |
S | Computational complexity of the supervised classifier |
R | Computational complexity of the sorting algorithm |
f | Number of obtained candidate attribute sets, |
References
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Dixit, A.; Mani, A. Sampling technique for noisy and borderline examples problem in imbalanced classification. Appl. Soft Comput. 2023, 142, 110361. [Google Scholar] [CrossRef]
- Song, H.; Kim, M.; Park, D.; Shin, Y.; Lee, J.-G. Learning from noisy labels with deep neural networks: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8135–8153. [Google Scholar] [CrossRef] [PubMed]
- Li, C.; Mao, Z. A label noise filtering method for regression based on adaptive threshold and noise score. Expert Syst. Appl. 2023, 228, 120422. [Google Scholar] [CrossRef]
- Theng, D.; Bhoyar, K.K. Feature selection techniques for machine learning: A survey of more than two decades of research. Knowl. Inf. Syst. 2024, 66, 1575–1637. [Google Scholar] [CrossRef]
- Cunha, W.; Viegas, F.; França, C.; Rosa, T.; Rocha, L.; Gonçalves, M.A. A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text Classification. ACM Comput. Surv. 2023, 55, 1–52. [Google Scholar] [CrossRef]
- Kuncheva, L.I.; Jain, L.C. Nearest neighbor classifier: Simultaneous editing and feature selection. Pattern Recognit. Lett. 1999, 20, 1149–1156. [Google Scholar] [CrossRef]
- Pérez-Rodríguez, J.; Arroyo-Peña, A.G.; García-Pedrajas, N. Simultaneous instance and feature selection and weighting using evolutionary Computation: Proposal and Study. Appl. Soft Comput. 2015, 37, 416–443. [Google Scholar] [CrossRef]
- Villuendas-Rey, Y.; Rey-Benguría, C.; Lytras, M.; Yáñez-Márquez, C.; Camacho-Nieto, O. Simultaneous instance and feature selection for improving prediction in special education data. Program 2017, 51, 278–297. [Google Scholar] [CrossRef]
- Garcia-Pedrajas, N.; del Castillo, J.A.R.; Cerruela-Garcia, G. SI (FS) 2: Fast simultaneous instance and feature selection for datasets with many features. Pattern Recognit. 2021, 111, 107723. [Google Scholar] [CrossRef]
- Ishibuchi, H.; Nakashima, T. Evolution of reference sets in nearest neighbor classification. In Selected Paper 2, Proceedings of the Simulated Evolution and Learning: Second Asia-Pacific Conference on Simulated Evolution and Learning, SEAL’98, Canberra, Australia, 24–27 November 1998; Springer: Berlin/Heidelberg, Germany, 1999; pp. 82–89. [Google Scholar]
- Ahn, H.; Kim, K.-J.; Han, I. A case-based reasoning system with the two-dimensional reduction technique for customer classification. Expert Syst. Appl. 2007, 32, 1011–1019. [Google Scholar] [CrossRef]
- Skalak, D.B. Prototype and feature selection by sampling and random mutation hill climbing algorithms. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994; pp. 293–301. [Google Scholar]
- Derrac, J.; Cornelis, C.; García, S.; Herrera, F. Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection. Inf. Sci. 2012, 186, 73–92. [Google Scholar] [CrossRef]
- GarcíA-Pedrajas, N.; De Haro-GarcíA, A.; PéRez-RodríGuez, J. A scalable approach to simultaneous evolutionary instance and feature selection. Inf. Sci. 2013, 228, 150–174. [Google Scholar] [CrossRef]
- Dasarathy, B.; Sánchez, J. Concurrent feature and prototype selection in the nearest neighbor based decision process. In Proceedings of the 4th World Multiconference on Systems, Cybernetics and Informatics, Orlando, FL, USA, 23–26 July 2000; pp. 628–633. [Google Scholar]
- Kittler, J. Feature set search algorithms. In Pattern Recognition and Signal Processing; Chen, C.J., Ed.; Springer: Dordrecht, The Netherlands, 1978; pp. 41–69. [Google Scholar]
- Toussaint, G.T. Proximity graphs for nearest neighbor decision rules: Recent progress. In Proceedings of the Interface 2002, 34th Symposium on Computing and Statistics (Theme: Geoscience and Remote Sensing), Montreal, QC, Canada, 17–20 April 2002. [Google Scholar]
- Dasarathy, B.V. Minimal consistent set (MCS) identification for optimal nearest neighbor decision systems design. IEEE Trans. Syst. Man Cybern. 1994, 24, 511–517. [Google Scholar] [CrossRef]
- Villuendas-Rey, Y.; García-Borroto, M.; Medina-Pérez, M.A.; Ruiz-Shulcloper, J. Simultaneous features and objects selection for Mixed and Incomplete data. In Proceedings of the Iberoamerican Congress on Pattern Recognition, Cancun, Mexico, 14–17 November 2006; pp. 597–605. [Google Scholar]
- Villuendas-Rey, Y.; García-Borroto, M.; Ruiz-Shulcloper, J. Selecting features and objects for mixed and incomplete data. In Proceedings of the Progress in Pattern Recognition, Image Analysis and Applications: 13th Iberoamerican Congress on Pattern Recognition, CIARP 2008, Havana, Cuba, 9–12 September 2008; pp. 381–388. [Google Scholar]
- García-Borroto, M.; Ruiz-Shulcloper, J. Selecting prototypes in mixed incomplete data. In Proceedings of the Iberoamerican Congress on Pattern Recognition, Havana, Cuba, 15–18 November 2005; pp. 450–459. [Google Scholar]
- Santiesteban, Y.; Pons-Porrata, A. LEX: A new algorithm for the calculus of typical testors. Math. Sci. J. 2003, 21, 85–95. [Google Scholar]
- Villuendas-Rey, Y.; Yáñez-Márquez, C.; Camacho-Nieto, O. Ant-based feature and instance selection for multiclass imbalanced data. IEEE Access, 2024; Online ahead of print. [Google Scholar]
- Kelly, M.; Longjohn, R.; Nottingham, K. UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 14 April 2024).
- Rodríguez-Salas, D.; Lazo-Cortés, M.S.; Mollineda, R.A.; Olvera-López, J.A.; de la Calleja, J.; Benitez, A. Voting Algorithms Model with a Support Sets System by Class. In Proceedings of the Nature-Inspired Computation and Machine Learning: 13th Mexican International Conference on Artificial Intelligence, MICAI 2014, Tuxtla Gutiérrez, Mexico, 16–22 November 2014; pp. 128–139. [Google Scholar]
- Zhao, J.; Xie, X.; Xu, X.; Sun, S. Multi-view learning overview: Recent progress and new challenges. Inf. Fusion 2017, 38, 43–54. [Google Scholar] [CrossRef]
- Tomek, I. An experiment with the Edited Nearest-Neighbor Rule. IEEE Trans. Syst. Man Cybern. SMC 1976, 6, 448–452. [Google Scholar]
- Wilson, D.R.; Martinez, T.R. Improved heterogeneous distance functions. J. Artif. Intell. Res. 1997, 6, 1–34. [Google Scholar] [CrossRef]
- Hernández-Castaño, J.A.; Villuendas-Rey, Y.; Camacho-Nieto, O.; Yáñez-Márquez, C. Experimental platform for intelligent computing (EPIC). Comput. Sist. 2018, 22, 245–253. [Google Scholar] [CrossRef]
- Garcia, S.; Herrera, F. An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J. Mach. Learn. Res. 2008, 9, 2677–2694. [Google Scholar]
- Triguero, I.; González, S.; Moyano, J.M.; García López, S.; Alcalá Fernández, J.; Luengo Martín, J.; Fernández Hilario, A.; Díaz, J.; Sánchez, L.; Herrera Triguero, F. KEEL, version 3.0; An Open Source Software for Multi-Stage Analysis in Data Mining; University of Granada: Granada, Spain, 2017. [Google Scholar]
- Gómez, J.P.; Montero, F.E.H.; Sotelo, J.C.; Mancilla, J.C.G.; Rey, Y.V. RoPM: An algorithm for computing typical testors based on recursive reductions of the basic matrix. IEEE Access 2021, 9, 128220–128232. [Google Scholar] [CrossRef]
- Hart, P. The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 1968, 14, 515–516. [Google Scholar] [CrossRef]
Datasets | Instances | Numerical Features | Categorical Features | Classes | Missing Values |
---|---|---|---|---|---|
autos | 205 | 15 | 10 | 7 | x |
breast-w | 699 | 9 | 0 | 2 | |
credit-a | 690 | 6 | 9 | 2 | x |
diabetes | 768 | 8 | 0 | 2 | |
heart-c | 303 | 6 | 7 | 5 | |
hepatitis | 155 | 6 | 13 | 2 | x |
iris | 150 | 4 | 0 | 3 | |
labor | 57 | 8 | 8 | 2 | x |
lymph | 148 | 3 | 15 | 4 | |
post-operative | 90 | 0 | 8 | 3 | x |
primary-tumor | 339 | 1 | 16 | 22 | x |
vehicle | 946 | 18 | 0 | 4 | |
vote | 435 | 0 | 16 | 2 | x |
wine | 178 | 13 | 0 | 3 | |
zoo | 101 | 1 | 16 | 7 |
Datasets | 0% Noise | 5% Noise | 10% Noise | 15% Noise |
---|---|---|---|---|
autos | 0.444 | 0.509 | 0.507 | 0.430 |
breast-w | 0.060 | 0.054 | 0.046 | 0.059 |
credit-a | 0.174 | 0.201 | 0.188 | 0.197 |
diabetes | 0.293 | 0.299 | 0.298 | 0.319 |
heart-c | 0.254 | 0.284 | 0.286 | 0.290 |
hepatitis | 0.226 | 0.226 | 0.221 | 0.273 |
iris | 0.107 | 0.080 | 0.040 | 0.080 |
labor | 0.137 | 0.233 | 0.213 | 0.207 |
lymph | 0.181 | 0.291 | 0.270 | 0.298 |
post-operative | 0.333 | 0.311 | 0.422 | 0.444 |
primary-tumor | 0.623 | 0.634 | 0.640 | 0.655 |
vehicle | 0.336 | 0.359 | 0.364 | 0.348 |
vote | 0.062 | 0.065 | 0.090 | 0.110 |
wine | 0.073 | 0.068 | 0.085 | 0.112 |
zoo | 0.098 | 0.126 | 0.138 | 0.198 |
Methods | 5% Noise | 10% Noise | 15% Noise |
---|---|---|---|
AKH-GA | 0.521 | 0.514 | 0.513 |
DS | 0.907 | 0.911 | 0.899 |
IN-GA | 0.522 | 0.500 | 0.510 |
KJ-GA | 0.503 | 0.489 | 0.492 |
RMHC-FP1 | 0.513 | 0.493 | 0.501 |
ROFS | 0.931 | 0.927 | 0.930 |
SOFSA | 0.136 | 0.169 | 0.178 |
TCCS | 0.161 | 0.186 | 0.183 |
Methods | 0% Noise | 5% Noise | 10% Noise | 15% Noise |
---|---|---|---|---|
AKH-GA | 0.502 | 0.500 | 0.502 | 0.496 |
DS | 0.130 | 0.125 | 0.125 | 0.130 |
IN-GA | 0.481 | 0.474 | 0.478 | 0.475 |
KJ-GA | 0.489 | 0.485 | 0.49 | 0.486 |
RMHC-FP1 | 0.494 | 0.492 | 0.499 | 0.502 |
ROFS | 0.135 | 0.134 | 0.134 | 0.137 |
SOFSA | 0.609 | 0.592 | 0.608 | 0.631 |
TCCS | 0.547 | 0.545 | 0.572 | 0.607 |
Methods | 0% Noise | 5% Noise | 10% Noise | 15% Noise |
---|---|---|---|---|
AKH-GA | 0.509 | 0.513 | 0.506 | 0.511 |
DS | 0.937 | 0.931 | 0.931 | 0.933 |
IN-GA | 0.451 | 0.451 | 0.446 | 0.456 |
KJ-GA | 0.475 | 0.468 | 0.465 | 0.490 |
RMHC-FP1 | 0.500 | 0.497 | 0.512 | 0.506 |
ROFS | 0.604 | 0.613 | 0.619 | 0.619 |
SOFSA | 0.809 | 0.799 | 0.804 | 0.813 |
TCCS | 0.758 | 0.767 | 0.775 | 0.791 |
Methods | 0% Noise | 5% Noise | 10% Noise | 15% Noise | ||||
---|---|---|---|---|---|---|---|---|
Friedman’s Ranking | Holm’s p-Value | Friedman’s Ranking | Holm’s p-Value | Friedman’s Ranking | Holm’s p-Value | Friedman’s Ranking | Holm’s p-Value | |
Baseline | 2.467 | - | 2.900 | 0.594 | 3.767 | 0.067 | 3.567 | 0.042 |
ROFS | 3.633 | 0.243 | 2.367 | - | 1.933 | - | 1.533 | - |
SOFSA | 4.167 | 0.089 | 4.633 | 0.023 | 4.900 | 0.003 | 5.600 | 0.000 |
AKH-GA | 4.400 | 0.053 | 4.833 | 0.014 | 4.567 | 0.008 | 5.033 | 0.000 |
DS | 4.433 | 0.049 | 3.167 | 0.424 | 2.667 | 0.463 | 2.900 | 0.172 |
TCCS | 4.867 | 0.016 | 5.067 | 0.007 | 5.500 | 0.000 | 5.667 | 0.000 |
RMHC-FP1 | 6.333 | 0.000 | 6.933 | 0.000 | 6.233 | 0.000 | 5.867 | 0.000 |
KJ-GA | 7.067 | 0.000 | 7.900 | 0.000 | 7.567 | 0.000 | 7.133 | 0.000 |
IN-GA | 7.633 | 0.000 | 7.200 | 0.000 | 7.867 | 0.000 | 7.700 | 0.000 |
Methods | 0% Noise | 5% Noise | 10% Noise | 15% Noise | ||||
---|---|---|---|---|---|---|---|---|
Friedman’s Ranking | Holm’s p-Value | Friedman’s Ranking | Holm’s p-Value | Friedman’s Ranking | Holm’s p-Value | Friedman’s Ranking | Holm’s p-Value | |
ROFS | 1.467 | - | 1.667 | 0.947 | 1.667 | 0.947 | 1.600 | - |
DS | 1.533 | 0.947 | 1.600 | - | 1.600 | - | 1.667 | 0.947 |
IN-GA | 4.000 | 0.011 | 3.933 | 0.020 | 3.533 | 0.053 | 3.667 | 0.039 |
KJ-GA | 4.667 | 0.001 | 4.600 | 0.003 | 4.400 | 0.005 | 4.200 | 0.009 |
RMHC-FP1 | 5.400 | 0.000 | 5.400 | 0.000 | 5.200 | 0.000 | 5.067 | 0.001 |
AKH-GA | 5.867 | 0.000 | 5.933 | 0.000 | 5.400 | 0.000 | 5.600 | 0.000 |
TCCS | 6.033 | 0.000 | 6.233 | 0.000 | 6.933 | 0.000 | 6.967 | 0.000 |
SOFSA | 7.033 | 0.000 | 6.633 | 0.000 | 7.267 | 0.000 | 7.233 | 0.000 |
Baseline | 9.000 | 0.000 | 9.000 | 0.000 | 9.000 | 0.000 | 9.000 | 0.000 |
Methods | 0% Noise | 5% Noise | 10% Noise | 15% Noise | ||||
---|---|---|---|---|---|---|---|---|
Friedman’s Ranking | Holm’s p-Value | Friedman’s Ranking | Holm’s p-Value | Friedman’s Ranking | Holm’s p-Value | Friedman’s Ranking | Holm’s p-Value | |
IN-GA | 2.633 | - | 2.567 | - | 2.267 | - | 2.167 | - |
KJ-GA | 3.100 | 0.641 | 2.767 | 0.841 | 2.800 | 0.594 | 2.867 | 0.484 |
RMHC-FP1 | 3.400 | 0.443 | 3.667 | 0.271 | 3.833 | 0.117 | 3.533 | 0.172 |
AKH-GA | 3.667 | 0.301 | 3.600 | 0.301 | 3.767 | 0.134 | 3.700 | 0.125 |
ROFS | 3.733 | 0.271 | 4.167 | 0.110 | 4.100 | 0.067 | 4.133 | 0.049 |
TCCS | 5.467 | 0.005 | 5.567 | 0.003 | 5.633 | 0.001 | 6.100 | 0.000 |
SOFSA | 6.867 | 0.000 | 6.600 | 0.000 | 6.667 | 0.000 | 6.800 | 0.000 |
DS | 7.333 | 0.000 | 7.333 | 0.000 | 7.200 | 0.000 | 6.967 | 0.000 |
Baseline | 8.800 | 0.000 | 8.733 | 0.000 | 8.733 | 0.000 | 8.733 | 0.000 |
ROFS vs. | Classifier Error | Instance Retention | Feature Retention |
---|---|---|---|
AKH-GA | 0.2286 | 0.0286 | 0.013 |
DS | 0.7714 | 0.0286 | 0.013 |
IN-GA | 0.0286 | 0.0286 | 0.013 |
KJ-GA | 0.0286 | 0.0286 | 0.013 |
RMHC-FP1 | 0.0286 | 0.0286 | 0.013 |
SOFSA | 0.2286 | 0.0286 | 0.013 |
TCCS | 0.2286 | 0.0286 | 0.013 |
Methods | Parameters |
---|---|
AKH-GA | population: 200, iterations: 50, crossover probability: 0.7, mutation probability: 0.1 |
DS | None |
IN-GA | population: 50, iterations: 500, crossover probability: 1.0, mutation probability from 0 to 1: 0.1, mutation probability from 1 to 0: 0.01 |
KJ-GA | population: 10, iterations: 100, crossover probability: 1.0, mutation probability: 0.1 |
RMHC-FP1 | population: 10, iterations: 100 |
ROFS | CA: RoPM [33], Cond: CNN [34], clasif: 1-NN |
SOFSA | None |
TCCS | None |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Villuendas-Rey, Y.; Tusell-Rey, C.C.; Camacho-Nieto, O. Simultaneous Instance and Attribute Selection for Noise Filtering. Appl. Sci. 2024, 14, 8459. https://doi.org/10.3390/app14188459
Villuendas-Rey Y, Tusell-Rey CC, Camacho-Nieto O. Simultaneous Instance and Attribute Selection for Noise Filtering. Applied Sciences. 2024; 14(18):8459. https://doi.org/10.3390/app14188459
Chicago/Turabian StyleVilluendas-Rey, Yenny, Claudia C. Tusell-Rey, and Oscar Camacho-Nieto. 2024. "Simultaneous Instance and Attribute Selection for Noise Filtering" Applied Sciences 14, no. 18: 8459. https://doi.org/10.3390/app14188459
APA StyleVilluendas-Rey, Y., Tusell-Rey, C. C., & Camacho-Nieto, O. (2024). Simultaneous Instance and Attribute Selection for Noise Filtering. Applied Sciences, 14(18), 8459. https://doi.org/10.3390/app14188459