Next Article in Journal
Bi-Objective Dynamic Multiprocessor Open Shop Scheduling: An Exact Algorithm
Previous Article in Journal
Energy Efficient Routing in Wireless Sensor Networks: A Comprehensive Survey
Previous Article in Special Issue
A Geolocation Analytics-Driven Ontology for Short-Term Leases: Inferring Current Sharing Economy Trends
Article

Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

1
Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece
2
Department of Informatics, Ionian University, 49100 Corfu, Greece
*
Author to whom correspondence should be addressed.
Algorithms 2020, 13(3), 71; https://doi.org/10.3390/a13030071
Received: 26 February 2020 / Revised: 18 March 2020 / Accepted: 21 March 2020 / Published: 24 March 2020
(This article belongs to the Special Issue Mining Humanistic Data 2019)
At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. Specifically, the singular value decomposition of the data matrix determines first a set of transformed attributes which in turn drive the classifiers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and F 1 . The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classifiers with respect to both complexity and the abovementioned metrics. View Full-Text
Keywords: Apache Spark; Apache MLlib; PySpark; big data; machine learning; 10V data; two-step classification; ensemble classification; SVD; SparkQL; computing performance; F1 Metric; dataframe Apache Spark; Apache MLlib; PySpark; big data; machine learning; 10V data; two-step classification; ensemble classification; SVD; SparkQL; computing performance; F1 Metric; dataframe
Show Figures

Figure 1

MDPI and ACS Style

Alexopoulos, A.; Drakopoulos, G.; Kanavos, A.; Mylonas, P.; Vonitsanos, G. Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark. Algorithms 2020, 13, 71. https://doi.org/10.3390/a13030071

AMA Style

Alexopoulos A, Drakopoulos G, Kanavos A, Mylonas P, Vonitsanos G. Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark. Algorithms. 2020; 13(3):71. https://doi.org/10.3390/a13030071

Chicago/Turabian Style

Alexopoulos, Athanasios, Georgios Drakopoulos, Andreas Kanavos, Phivos Mylonas, and Gerasimos Vonitsanos. 2020. "Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark" Algorithms 13, no. 3: 71. https://doi.org/10.3390/a13030071

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop