Survey on Preprocessing Techniques for Big Data Projects †
Abstract
:1. Introduction
2. Data Preprocessing
2.1. Feature Selection
2.1.1. Filter Methods
2.1.2. Embedded Methods
2.1.3. Wrapper Methods
2.2. Discretisation
2.2.1. Unsupervised
2.2.2. Supervised
3. Conclusions
Funding
References
- Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A. Recent advances and emerging challenges of feature selection in the context of big data. Knowl.-Based Syst. 2015, 86, 33–45. [Google Scholar] [CrossRef]
- Dash, R.; Paramguru, R.; Dash, R. Comparative analysis of supervised and unsupervised discretization techniques. Int. J. Adv. Sci. Technol. 2011, 2, 29–37. [Google Scholar]
- Hristova, D.; Probst, J.; Eckrich, E. Ratingbot: A text mining based rating approach. ICIS 2017, 8, 1–20. [Google Scholar]
- Abbes, H. Tweets Sentiment and Their Impact on Stock Market Movements. Master’s Thesis, École de gestion de l’Université de Liège, Liège, Belgium, 2016. [Google Scholar]
- Loh, W.Y. Regression trees with unbiased variable selection and interaction detection. Stat. Sin. 2002, 12, 361–386. [Google Scholar]
- Loh, W.Y. Variable Selection for Classification and Regression in Large p, Small n Problems. In Probability Approximations and Beyond; Springer: New York, NY, USA, 2012; Volume 205, pp. 135–159. [Google Scholar]
- Azhagusundari, B.; Thanamani, A.S. Feature selection based on information gain. Int. J. Innov. Technol. Explor. Eng. (IJITEE) 2013, 2, 18–21. [Google Scholar]
- Hall, M. Correlation-Based Feature Selection for Machine Learning. Ph.D. Dissertation, University of Waikato Hamilton, Hamilton, New Zealand, 1999. [Google Scholar]
- Nassuna, H.; Eyobu, O.S.; Kim, J.H.; Lee, D. Feature selection based on variance distribution of power spectral density for driving behavior recognition. In Proceedings of the 2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA), Xi’an, China, 19–21 June 2019; pp. 335–338. [Google Scholar]
- Fong, S.; Biuk-Aghai, R.P.; Si, Y.W. Lightweight feature selection methods based on standardized measure of dispersion for mining big data. In Proceedings of the 2016 IEEE International Conference on Computer and Information Technology, Nadi, Fiji, 8–10 December 2016; pp. 553–559. [Google Scholar]
- Morán-Fernández, L.; Bolón-Canedo, V.; Alonso-Betanzos, A. Centralized vs. distributed feature selection methods based on data complexity measures. Knowl.-Based Syst. 2017, 117, 27–45. [Google Scholar] [CrossRef]
- Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
- Lin, X.; Li, C.; Zhang, Y.; Su, B.; Fan, M.; Wei, H. Selecting feature subsets based on svm-rfe and the overlapping ratio with applications in bioinformatics. Molecules 2018, 23, 52. [Google Scholar] [CrossRef] [Green Version]
- Mejia-Lavalle, M.; Sucar, L.; Arroyo-Figueroa, G. Feature selection with a perceptron neural net. In Proceedings of the International Workshop on Feature Selection for Data Mining, Hong Kong, China, 18–22 December 2006; pp. 131–135. [Google Scholar]
- Kaya, E.; Morani, K. The Improvement Achieved Using Blogreg Feature Selection Algorithm in a Developed Artificial Neural Network Classification. Int. J. Sci. Res. Eng. Technol. (IJSET) 2019, 13, 28–31. [Google Scholar]
- Langley, P. Selection of relevant features in machine learning. Proc. AAAI Fall Symp. Relev. 1994, 97, 245–271. [Google Scholar]
- Lee, S.J.; Xu, Z.; Li, T.; Yang, Y. A novel bagging c4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making. J. Biomed. Informat. 2018, 78, 144–155. [Google Scholar] [CrossRef] [PubMed]
- Maldonado, S.; Weber, R. A wrapper method for feature selection using support vector machines. Inf. Sci. 2009, 179, 2208–2217. [Google Scholar] [CrossRef]
- Mustaqeem, A.; Anwar, S.; Majid, M.; Khan, R. Wrapper method for feature selection to classify cardiac arrhythmia. In Proceedings of the 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Jeju, Korea, 11–15 July 2017; Volume 2017, pp. 3656–3659. [Google Scholar]
- Dy, J.G.; Brodley, C.E. Feature subset selection and order identification for unsupervised learning. In Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, CA, USA, 2 October 2000; pp. 247–254. [Google Scholar]
- Pace, N.; Briggs, W. Stepwise logistic regression. Anesthesia Analgesia 2009, 109, 285–286. [Google Scholar] [CrossRef]
- Sisovic, S.; Brkic Bakaric, M.; Matetic, M. Reducing data stream complexity by applying count-min algorithm and discretization procedure. In Proceedings of the 2018 IEEE Fourth International Conference on Big Data Computing Service and Applications (BigDataService), Bamberg, Germany, 26–29 March 2018; pp. 221–228. [Google Scholar]
- Xiao, L.; Dai, B.; Liu, D.; Zhao, D.; Wu, T. Monocular road detection using structured random forest. Int. J. Adv. Robot. Syst. 2016, 13, 101. [Google Scholar] [CrossRef] [Green Version]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967; pp. 281–297. [Google Scholar]
- Li, Y.; Liu, L.; Bai, X.; Cai, H.; Ji, W.; Guo, D.; Zhu, Y. Comparative study of discretization methods of microarray data for inferring transcriptional regulatory networks. BMC Bioinform. 2010, 11, 520. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fayyad, U.; Irani, K. Multi-interval discretization of continuous-valued attributes for classification learning. IJCAI 1993, 13, 1022–1027. [Google Scholar]
- Ramírez-Gallego, S.; García, S.; Mourino-Talin, H.; Martinez, D. Distributed entropy minimization discretizer for big data analysis under apache spark. In Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA, Helsinki, Finland, 20–22 August 2015; pp. 33–40. [Google Scholar]
- Kerber, R. Chimerge: Discretization of numeric attributes. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI’92, San Jose, CA, USA, 12–16 July 1992; pp. 123–128. [Google Scholar]
- Bertier, P.; Bouroche, J.M. Analyse des données Multidimensionnelles; PUF: Paris, France, 1975. [Google Scholar]
- Boulle, M. Khiops: A statistical discretization method of continuous attributes. Mach. Learn. 2004, 55, 53–69. [Google Scholar] [CrossRef] [Green Version]
- Zhang, Y.; Yu, J.; Wang, J. Parallel Implementation of chi2 Algorithm in Mapreduce Framework; Springer: Cham, Switzerland, 2014; pp. 890–899. [Google Scholar]
- Jiang, F.; Zhao, Z.; Ge, Y. A supervised and multivariate discretization algorithm for rough sets. In Rough Set and Knowledge Technology; Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 596–603. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lopez-Miguel, I.D. Survey on Preprocessing Techniques for Big Data Projects. Eng. Proc. 2021, 7, 14. https://doi.org/10.3390/engproc2021007014
Lopez-Miguel ID. Survey on Preprocessing Techniques for Big Data Projects. Engineering Proceedings. 2021; 7(1):14. https://doi.org/10.3390/engproc2021007014
Chicago/Turabian StyleLopez-Miguel, Ignacio D. 2021. "Survey on Preprocessing Techniques for Big Data Projects" Engineering Proceedings 7, no. 1: 14. https://doi.org/10.3390/engproc2021007014