MapReduce for Big Data

A special issue of Algorithms (ISSN 1999-4893).

Deadline for manuscript submissions: closed (30 September 2018) | Viewed by 17301

Special Issue Editor


E-Mail Website
Guest Editor
School of Computer Science, The University of Sydney, Camperdown, NSW 2006, Australia
Interests: graph database; graph mining; algorithms; network science
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Data are becoming increasingly decisive resources in modern society. Big Data is an emerging paradigm encompassing various kinds of complex and large-scale information beyond the capability of conventional data-processing techniques. For example, one of the most important characteristics of Big Data is to carry out computing on petabyte (PB), and even exabyte (EB)-level data with a complex computing process. Therefore, massively parallel processing techniques, such as algorithms utilizing the cloud computing platforms MapReduce and Spark, are on demand.

The aim of this Special Issue is to invite high quality manuscripts that address challenges of Big Data with emerging computing platforms, such as MapReduce and Spark. We welcome original and unpublished manuscripts from academia and industry on the recent advances in different aspects of big data research and applications. Topics of interests include, but are not limited to: theoretical foundations of massively parallel computation, MapReduce algorithms for big data, and distributed algorithms for big graph processing.

Dr. Lijun Chang
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Algorithms is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Big Data
  • MapReduce
  • Spark
  • Massively Parallel Computation (MPC)

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

14 pages, 357 KiB  
Article
MAPSkew: Metaheuristic Approaches for Partitioning Skew in MapReduce
by Matheus H. M. Pericini, Lucas G. M. Leite, Francisco H. De Carvalho-Junior, Javam C. Machado and Cenez A. Rezende
Algorithms 2019, 12(1), 5; https://doi.org/10.3390/a12010005 - 24 Dec 2018
Cited by 2 | Viewed by 3951
Abstract
MapReduce is a parallel computing model in which a large dataset is split into smaller parts and executed on multiple machines. Due to its simplicity, MapReduce has been widely used in various applications domains. MapReduce can significantly reduce the processing time of a [...] Read more.
MapReduce is a parallel computing model in which a large dataset is split into smaller parts and executed on multiple machines. Due to its simplicity, MapReduce has been widely used in various applications domains. MapReduce can significantly reduce the processing time of a large amount of data by dividing the dataset into smaller parts and processing them in parallel in multiple machines. However, when data are not uniformly distributed, we have the so called partitioning skew, where the allocation of tasks to machines becomes unbalanced, either by the distribution function splitting the dataset unevenly or because a part of the data is more complex and requires greater computational effort. To solve this problem, we propose an approach based on metaheuristics. For evaluating purposes, three metaheuristics were implemented: Simulated Annealing, Local Beam Search and Stochastic Beam Search. Our experimental evaluation, using a MapReduce implementation of the Bron-Kerbosch Clique Algorithm, shows that the proposed method can find good partitionings while better balancing data among machines. Full article
(This article belongs to the Special Issue MapReduce for Big Data)
Show Figures

Figure 1

34 pages, 1005 KiB  
Article
Hadoop vs. Spark: Impact on Performance of the Hammer Query Engine for Open Data Corpora
by Mauro Pelucchi, Giuseppe Psaila and Maurizio Toccu
Algorithms 2018, 11(12), 209; https://doi.org/10.3390/a11120209 - 17 Dec 2018
Cited by 6 | Viewed by 3958
Abstract
The Hammer prototype is a query engine for corpora of Open Data that provides users with the concept of blind querying. Since data sets published on Open Data portals are heterogeneous, users wishing to find out interesting data sets are blind: queries cannot [...] Read more.
The Hammer prototype is a query engine for corpora of Open Data that provides users with the concept of blind querying. Since data sets published on Open Data portals are heterogeneous, users wishing to find out interesting data sets are blind: queries cannot be fully specified, as in the case of databases. Consequently, the query engine is responsible for rewriting and adapting the blind query to the actual data sets, by exploiting lexical and semantic similarity. The effectiveness of this approach was discussed in our previous works. In this paper, we report our experience in developing the query engine. In fact, in the very first version of the prototype, we realized that the implementation of the retrieval technique was too slow, even though corpora contained only a few thousands of data sets. We decided to adopt the Map-Reduce paradigm, in order to parallelize the query engine and improve performances. We passed through several versions of the query engine, either based on the Hadoop framework or on the Spark framework. Hadoop and Spark are two very popular frameworks for writing and executing parallel algorithms based on the Map-Reduce paradigm. In this paper, we present our study about the impact of adopting the Map-Reduce approach and its two most famous frameworks to parallelize the Hammer query engine; we discuss various implementations of the query engine, either obtained without significantly rewriting the algorithm or obtained by completely rewriting the algorithm by exploiting high level abstractions provided by Spark. The experimental campaign we performed shows the benefits provided by each studied solution, with the perspective of moving toward Big Data in the future. The lessons we learned are collected and synthesized into behavioral guidelines for developers approaching the problem of parallelizing algorithms by means of Map-Reduce frameworks. Full article
(This article belongs to the Special Issue MapReduce for Big Data)
Show Figures

Figure 1

32 pages, 7507 KiB  
Article
New and Efficient Algorithms for Producing Frequent Itemsets with the Map-Reduce Framework
by Yaron Gonen, Ehud Gudes and Kirill Kandalov
Algorithms 2018, 11(12), 194; https://doi.org/10.3390/a11120194 - 28 Nov 2018
Cited by 1 | Viewed by 4026
Abstract
The Map-Reduce (MR) framework has become a popular framework for developing new parallel algorithms for Big Data. Efficient algorithms for data mining of big data and distributed databases has become an important problem. In this paper we focus on algorithms producing association rules [...] Read more.
The Map-Reduce (MR) framework has become a popular framework for developing new parallel algorithms for Big Data. Efficient algorithms for data mining of big data and distributed databases has become an important problem. In this paper we focus on algorithms producing association rules and frequent itemsets. After reviewing the most recent algorithms that perform this task within the MR framework, we present two new algorithms: one algorithm for producing closed frequent itemsets, and the second one for producing frequent itemsets when the database is updated and new data is added to the old database. Both algorithms include novel optimizations which are suitable to the MR framework, as well as to other parallel architectures. A detailed experimental evaluation shows the effectiveness and advantages of the algorithms over existing methods when it comes to large distributed databases. Full article
(This article belongs to the Special Issue MapReduce for Big Data)
Show Figures

Figure 1

26 pages, 1953 KiB  
Article
Best Trade-Off Point Method for Efficient Resource Provisioning in Spark
by Peter P. Nghiem
Algorithms 2018, 11(12), 190; https://doi.org/10.3390/a11120190 - 22 Nov 2018
Cited by 1 | Viewed by 4601
Abstract
Considering the recent exponential growth in the amount of information processed in Big Data, the high energy consumed by data processing engines in datacenters has become a major issue, underlining the need for efficient resource allocation for more energy-efficient computing. We previously proposed [...] Read more.
Considering the recent exponential growth in the amount of information processed in Big Data, the high energy consumed by data processing engines in datacenters has become a major issue, underlining the need for efficient resource allocation for more energy-efficient computing. We previously proposed the Best Trade-off Point (BToP) method, which provides a general approach and techniques based on an algorithm with mathematical formulas to find the best trade-off point on an elbow curve of performance vs. resources for efficient resource provisioning in Hadoop MapReduce. The BToP method is expected to work for any application or system which relies on a trade-off elbow curve, non-inverted or inverted, for making good decisions. In this paper, we apply the BToP method to the emerging cluster computing framework, Apache Spark, and show that its performance and energy consumption are better than Spark with its built-in dynamic resource allocation enabled. Our Spark-Bench tests confirm the effectiveness of using the BToP method with Spark to determine the optimal number of executors for any workload in production environments where job profiling for behavioral replication will lead to the most efficient resource provisioning. Full article
(This article belongs to the Special Issue MapReduce for Big Data)
Show Figures

Figure 1

Back to TopTop