Next Article in Journal
Multi-Objective Bi-Level Programming for the Energy-Aware Integration of Flexible Job Shop Scheduling and Multi-Row Layout
Next Article in Special Issue
MAPSkew: Metaheuristic Approaches for Partitioning Skew in MapReduce
Previous Article in Journal
On the Use of Learnheuristics in Vehicle Routing Optimization Problems with Dynamic Inputs
Previous Article in Special Issue
New and Efficient Algorithms for Producing Frequent Itemsets with the Map-Reduce Framework
Open AccessArticle

Hadoop vs. Spark: Impact on Performance of the Hammer Query Engine for Open Data Corpora

1
Tabulaex, A Burning Glass Company, 20126 Milano, Italy
2
Dipartimento di Ingegneria Gestionale, dell’Informazione e della Produzione (DIGIP), University of Bergamo, 24044 Dalmine, Italy
*
Author to whom correspondence should be addressed.
Algorithms 2018, 11(12), 209; https://doi.org/10.3390/a11120209
Received: 6 November 2018 / Revised: 4 December 2018 / Accepted: 11 December 2018 / Published: 17 December 2018
(This article belongs to the Special Issue MapReduce for Big Data)
The Hammer prototype is a query engine for corpora of Open Data that provides users with the concept of blind querying. Since data sets published on Open Data portals are heterogeneous, users wishing to find out interesting data sets are blind: queries cannot be fully specified, as in the case of databases. Consequently, the query engine is responsible for rewriting and adapting the blind query to the actual data sets, by exploiting lexical and semantic similarity. The effectiveness of this approach was discussed in our previous works. In this paper, we report our experience in developing the query engine. In fact, in the very first version of the prototype, we realized that the implementation of the retrieval technique was too slow, even though corpora contained only a few thousands of data sets. We decided to adopt the Map-Reduce paradigm, in order to parallelize the query engine and improve performances. We passed through several versions of the query engine, either based on the Hadoop framework or on the Spark framework. Hadoop and Spark are two very popular frameworks for writing and executing parallel algorithms based on the Map-Reduce paradigm. In this paper, we present our study about the impact of adopting the Map-Reduce approach and its two most famous frameworks to parallelize the Hammer query engine; we discuss various implementations of the query engine, either obtained without significantly rewriting the algorithm or obtained by completely rewriting the algorithm by exploiting high level abstractions provided by Spark. The experimental campaign we performed shows the benefits provided by each studied solution, with the perspective of moving toward Big Data in the future. The lessons we learned are collected and synthesized into behavioral guidelines for developers approaching the problem of parallelizing algorithms by means of Map-Reduce frameworks. View Full-Text
Keywords: blind querying of open data portals; Map-Reduce paradigm; Hadoop vs. Spark blind querying of open data portals; Map-Reduce paradigm; Hadoop vs. Spark
Show Figures

Figure 1

MDPI and ACS Style

Pelucchi, M.; Psaila, G.; Toccu, M. Hadoop vs. Spark: Impact on Performance of the Hammer Query Engine for Open Data Corpora. Algorithms 2018, 11, 209. https://doi.org/10.3390/a11120209

AMA Style

Pelucchi M, Psaila G, Toccu M. Hadoop vs. Spark: Impact on Performance of the Hammer Query Engine for Open Data Corpora. Algorithms. 2018; 11(12):209. https://doi.org/10.3390/a11120209

Chicago/Turabian Style

Pelucchi, Mauro; Psaila, Giuseppe; Toccu, Maurizio. 2018. "Hadoop vs. Spark: Impact on Performance of the Hammer Query Engine for Open Data Corpora" Algorithms 11, no. 12: 209. https://doi.org/10.3390/a11120209

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop