A Comprehensive Survey of MapReduce Models for Processing Big Data
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis survey analyzes various map reduction models used for big data processing, techniques used in the reviewed literature, and challenges.
First, is it true that this research is included in the "article" category while in MDPI, there is a "study" category option?
Second, the study methodology used must be explained. How is the article screening process? How is the selection process? Where are the sources of the articles?
Third, the Discussion and Recommendations section must summarize the findings and what concepts can be proposed for future research. (Proposed Model)
Fourth, there needs to be proofreading because several sentences were found to have typos.
Comments on the Quality of English LanguageThe article category must be clearly defined. Then there needs to be proofreading because several sentences were found to have typos.
Author Response
Dear Sir/Madam,
Thank you for your time and efforts in validating my manuscript entitled “A Comprehensive Survey of the MapReduce Models for Processing Big Data” and I thank you for the valuable comments and suggestions provided by revising our paper. The changes are made in red inside the manuscript.
Reviewer No.1
This survey analyses various map reduction models used for big data processing, techniques used in the reviewed literature, and challenges.
First, is it true that this research is included in the "article" category while in MDPI, there is a "study" category option?
Reply: No, the proposed research is the study providing a comprehensive review of the MapReduce Models for Processing Big Data.
Second, the study methodology used must be explained. How is the article screening process? How is the selection process? Where are the sources of the articles?
Reply: The study methodology based on the article screening process followed by the taxonomy and the literature review is included in Section 3 of the manuscript.
Further the detailed explanation of the article screening process with the explanation of the source of the articles and the reason for inclusion and exclusion are detailed in the article selection process in section 3.1 of the manuscript.
Third, the Discussion and Recommendations section must summarize the findings and what concepts can be proposed for future research. (Proposed Model)
Reply: Based on the reviewer’s suggestion, the conclusion is improved by including the future direction for improving the field of map reduce.
Fourth, there needs to be proofreading because several sentences were found to have typos.
Reply: The revised manuscript rectifies all the grammatical and spacing issues.
The article category must be clearly defined. Then there needs to be proofreading because several sentences were found to have typos.
Reply: The proposed research is based on reviewing the MapReduce Models for Processing Big Data. Further, the manuscript is proofread, and the corrections are updated in the revised manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript under review represents the various MapReduce models, such as Hadoop, Spark, Hive, Pig, MongoDB, and Cassandra, highlighting their techniques, advancements, and the metrics used to evaluate their performance in big data processing.
Although the manuscript is listed as an article, I would recommend reclassifying it as a review paper. The following issues need to be further addressed for better publication needs.
1. Please clarify what progress has been made in the area of big data compared to the analyses provided in previously published paper "A brief survey on big data: technologies, terminologies and data‑intensive applications".
2. In Figure 1 there are undirected connections, please check them.
3. The information content of Figure 2 can be enhanced by indicating the years in which the models were developed.
4. Please specify the need to present the information in Figure 3 in graphical form.
5. I recommend rearranging Figure 4 to indicate the share of applications.
6. Please clarify, in Figure 6 the units of measurement are the number of publications from the review?
7. There are issues in the formatting of table 1 and equation (9). There are typos in equation (8) and line 825.
8. Are the characteristics of the storage devices important in the experimental setup? Please specify them.
Author Response
Dear Sir/Madam,
Thank you for your time and efforts in validating my manuscript entitled “A Comprehensive Survey of the MapReduce Models for Processing Big Data” and I thank you for the valuable comments and suggestions provided by revising our paper. The changes are made in red inside the manuscript.
Reviewer No.2
The manuscript under review represents the various MapReduce models, such as Hadoop, Spark, Hive, Pig, MongoDB, and Cassandra, highlighting their techniques, advancements, and the metrics used to evaluate their performance in big data processing.
Although the manuscript is listed as an article, I would recommend reclassifying it as a review paper. The following issues need to be further addressed for better publication needs.
- Please clarify what progress has been made in the area of big data compared to the analyses provided in previously published paper "A brief survey on big data: technologies, terminologies and data‑intensive applications".
Reply: Based on the reviewer’s suggestion, the significance of the proposed review compared to other existing reviews for MapReduce is explained in the abstract and the contribution explained at the end of Section 1 in the manuscript.
“To attain a deep knowledge of map-reduce-based big data processing, this survey aims to interpret various map-reducing models focused on the methodology, the performance metrics, their advantages, utilized datasets, and limitations. Around 75 articles related to MapReduce for processing big data are reviewed in this research, which provides good insight into choosing the appropriate model for processing various data, the challenges faced, and future directions. More specifically, the review describes different types of map-reduce models, including the Hadoop, Hive, Spark, Pig, MongoDB, and Cassandra. In addition, the review analyzed various metrics for evaluating MapReduce frameworks' performance for big data processing. The overview of the research benefits the researchers in overcoming the limitations in the existing approaches to develop more effective and sustainable models in the future.”
- In Figure 1 there are undirected connections, please check them.
Reply: Based on the reviewer’s comments, the connections in Figure 1 are corrected in the revised manuscript.
- The information content of Figure 2 can be enhanced by indicating the years in which the models were developed.
Reply: The years in which the reviewed models were developed are included in Table 1 within section 3.4.
- Please specify the need to present the information in Figure 3 in graphical form.
Reply: MapReduce offers many sophisticated features and benefits, making it more suitable for processing big data on different application areas. Hence, the key features are highlighted graphically in Figure 4 (changed).
- I recommend rearranging Figure 4 to indicate the share of applications.
Reply: Based on the reviewer’s suggestion, Figure 4 is modified as it indicates the share of applications.
- Please clarify, in Figure 6 the units of measurement are the number of publications from the review?
Reply: Yes, in Figure 6, the axis label represents the number of publications in which the correspondent metrics are utilized.
- There are issues in the formatting of table 1 and equation (9). There are typos in equation (8) and line 825.
Reply: Based on the reviewer’s suggestion, the formatting of Table 1 and Equation (9) is improved in the manuscript. Further, the typos inequation are rectified in the revised manuscript.
- Are the characteristics of the storage devices important in the experimental setup? Please specify them.
Reply: The MapReduce model for processing large-scale datasets depending on the Hadoop Distributed File System (HDFS) for storage, which is characterized by its scalability, fault tolerance, and data locality, is mentioned in section 1 of the manuscript.
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Author(s),
Your manuscript is now evaluated. Although the topic is applicable and interesting, in the current form it needs further processing and obviating shortcomings. I list concerns and issues for your perusal. It can be re-evaluated provided the issues are addressed.
- The English should be improved. The whole proofreading is appreciated.
- The abstract should be re-written to enhance its quality.
- The order of reference usages in the text should be corrected. Putting the uniform reference and filling lacked information are appreciated. Do you present a subjective classification and taxonomy? Putting a subjective framework can improve the quality of paper.
- Please enumerate the novelty in bullet for in introduction part.
- The related work an literature can be improved. One of the most applicable approaches in big data mining is to apply hybrid algorithm. For instance, please cite paper titled “A hybrid machine learning approach for feature selection in designing intrusion detection systems (IDS) model for distributed computing networks” utilizes feature selection technique mixing meta-heuristic algorithm. In introduction, “ Several works have been conducted to explore the parallel processing approaches that can implement the classification algorithms without affecting the results [53].” But you put one work. Please put couple of newly published works. Note that the number of reference in the text must be re-ordered consecutively.
- Figure 1 needs citation. Please put appropriate reference and citation.
- Enhancing the quality of figures is appreciated.
- Equation (5) should be re-written.
- One of the most important thing is task/job scheduling on chunk of data in cloud computing. Please mention bag of task scheduling can improve the gained quality of service. For instance, the method used in paper titled “Multi-objective Cost-aware Bag-of-Tasks Scheduling Optimization Model for IoT Applications Running on Heterogeneous Fog Environment” can be used in application running on big data scheduling.
- The paper suffers to well introducing the problem in introduction in each part. Please mention some single objective and multi-objective approaches and also mention what are constraints. For instance, paper titled “An energy-efficient topology-aware virtual machine placement in Cloud Datacenters: A multi-objective discrete JAYA optimization” uses multi-objective algorithm for Virtual machine placement problem. Each VM runs a chunk of data placed in a physical server. It extends algorithm in favor of both user and provider viewpoint.
- Related work should be extended. I recommend mention some relevant papers in scheduling, single/multi-objective viewpoint.
- The problem is not formally stated. I appreciate you if you have formal problem statements in some parts; it gives broad insights to readers.
- What are the service level agreement (SLA) metrics? And how its violation effects on power consumption or other cost?
- What kinds of AI techniques in terms of heuristic, meta-heuristic, reinforcement learning, etc. are used?
- Introducing some real-world projects that encounter big data challenges can improve the value.
- What are typical implementation environments? Please put explanation about implementation/simulation environment and programming languages.
- Please put a section for discussion with statistical analysis.
- I appreciate you as you consider “Conclusion and future direction” for the last title. Please put a condense sentence about limitation of existing works. And what is the future direction?
In the current form, I recommend Major revision. I will put some minor comments provided the major concerns and issues are addressed.
Comments on the Quality of English Language
It needs through proofreading to obviate typos, grammatical and dictation faults.
Author Response
Dear Sir/Madam,
Thank you for your time and efforts in validating my manuscript entitled “A Comprehensive Survey of the MapReduce Models for Processing Big Data” and I thank you for the valuable comments and suggestions provided by revising our paper. The changes are made in red inside the manuscript.
Reviewer No.3
Your manuscript is now evaluated. Although the topic is applicable and interesting, in the current form it needs further processing and obviating shortcomings. I list concerns and issues for your perusal. It can be re-evaluated provided the issues are addressed.
- The English should be improved. The whole proofreading is appreciated.
Reply: The manuscript is proofread and the corrections are updated in the revised manuscript.
- The abstract should be re-written to enhance its quality.
Reply: The abstract is modified to highlight the proposed review's significance.
- The order of reference usages in the text should be corrected. Putting the uniform reference and filling lacked information are appreciated. Do you present a subjective classification and taxonomy? Putting a subjective framework can improve the quality of paper.
Reply: The order of references is corrected in the text. Yes, the framework is modified as it highlights the diverse types of MapReduce models in the revised version.
- Please enumerate the novelty in bullet for in introduction part.
Reply: The novelty of the research is highlighted in bullet points in the introduction.
- The related work a literature can be improved. One of the most applicable approaches in big data mining is to apply hybrid algorithm. For instance, please cite paper titled “A hybrid machine learning approach for feature selection in designing intrusion detection systems (IDS) model for distributed computing networks” utilizes feature selection technique mixing meta-heuristic algorithm. In introduction, “Several works have been conducted to explore the parallel processing approaches that can implement the classification algorithms without affecting the results [53].” But you put one work. Please put couple of newly published works. Note that the number of reference in the text must be re-ordered consecutively.
Reply; Based on the reviewer’s suggestion, the given references are studied that are technically interesting but are deviated from the review titled” A Comprehensive Survey of the MapReduce Models for Processing Big Data”. Further, the old references are removed and more recent references are included in the review and discussed in the literature works.
- Figure 1 needs citation. Please put appropriate reference and citation.
Reply: Figure 1 is cited in Section 2 at the end of the first paragraph, just above Figure 1 in the manuscript.
- Enhancing the quality of figures is appreciated.
Reply: Based on the reviewer’s suggestion, all the figures are improved in the revised manuscript.
- Equation (5) should be re-written.
Reply: Equation 5 is rewritten in the revised manuscript.
- One of the most important thing is task/job scheduling on chunk of data in cloud computing. Please mention bag of task scheduling can improve the gained quality of service. For instance, the method used in paper titled “Multi-objective Cost-aware Bag-of-Tasks Scheduling Optimization Model for IoT Applications Running on Heterogeneous Fog Environment” can be used in application running on big data scheduling.
Reply: Based on the reviewer’s suggestion, the mentioned reference “Multi-objective Cost-aware Bag-of-Tasks Scheduling Optimization Model for IoT Applications Running on Heterogeneous Fog Environment” associated with task scheduling is studied, and the context is included in Section 3 of the manuscript
- The paper suffers to well introducing the problem in introduction in each part. Please mention some single objective and multi-objective approaches and also mention what are constraints. For instance, paper titled “An energy-efficient topology-aware virtual machine placement in Cloud Datacenters: A multi-objective discrete JAYA optimization” uses multi-objective algorithm for Virtual machine placement problem. Each VM runs a chunk of data placed in a physical server. It extends algorithm in favour of both user and provider viewpoint.
Reply: As per the reviewer’s comments, the multi-objective approaches and related techniques are included in the reference section and discussed in the related works in Section 3 of the manuscript.
- Related work should be extended. I recommend mention some relevant papers in scheduling, single/multi-objective viewpoint.
Reply: Based on the reviewer’s suggestion, more scheduling based on multi-objective problems is included in the related works in section 3 of the manuscript.
- The problem is not formally stated. I appreciate you if you have formal problem statements in some parts; it gives broad insights to readers.
Reply: The formal problem is stated just before the contribution in Section 1 of the manuscript.
- What are the service level agreement (SLA) metrics? And how its violation effects on power consumption or other cost?
Reply: The explanation regarding the SLA violation and the SLA metrics are included in the revised manuscript’s third paragraph of Section 1.
Further, the most challenging problem in the selection of an effective job scheduling strategy is to guarantee the service-level agreement (SLA) associated with the scheduled tasks. More specifically, the SLA metrics, including the response time, throughput, error rate, latency, capacity and resolution time, indicate the service quality and availability, and SLA violation results in increased costs, including reputational damage and potentially higher power consumption caused by inefficient resource utilization. The system's response time and data transaction speed can affect the framework's scalability. Hence, the limitations, including the scalability, batch processing, SLA violation, and complexity of the parallel processing, insisted on the need to analyze the MapReduce models for big data processing.
- What kinds of AI techniques in terms of heuristic, meta-heuristic, reinforcement learning, etc. are used?
Reply: Most of the MapReduce models utilized meta-heuristic algorithms and considered the problem to be addressed as a multi-objective problem considering the different constraints to improve the efficiency of the MapReduce models. The proposed work reviews optimization algorithms, including the M-CWOA and MM-MGSMO, to improve resource utilization within time and memory constraints to validate their effectiveness.
To enhance the MapReduce model efficiency, metaheuristic optimization algorithms are utilized to optimize the parameters and improve the performance by minimizing the computational cost, specifically for big data processing and clustering tasks.
Introducing some real-world projects that encounter big data challenges can improve the value.
Reply: Based on the reviewer’s suggestion, the real-world project that encounters big data challenges is discussed in section 5.4 of the manuscript.
- What are typical implementation environments? Please put explanation about implementation/simulation environment and programming languages.
Reply: The proposed review does not include any implementation of the model. Further MapReduce models for processing big data are analyzed in this review. Further, no specific implementation environments can be described as the MapReduce models include different configurations for big data processing. The MapReduce models are mostly implemented within the Apache Hadoop framework, designed for processing big data. Further, the experimental setup of the case study is described in Section 5.4.2 of the revised manuscript.
- Please put a section for discussion with statistical analysis.
Reply: Statistical analysis of the MapReduce models comprises evaluating the performance of the MapReduce models utilizing the metrics such as execution time, resource utilization, throughput, and so on that are provided in the achievements in Table 1 in Section 3 of the manuscript
- I appreciate you as you consider “Conclusion and future direction” for the last title. Please put a condense sentence about limitation of existing works. And what is the future direction?
Reply: Based on the reviewer’s suggestion, the last title is modified as “Conclusion and future direction”. Further, the limitation of the existing works is included in the conclusion, and the Future direction for the proposed research is included to support and strengthen the contribution of the proposed research.
In the current form, I recommend Major revision. I will put some minor comments provided the major concerns and issues are addressed.
Reply: Thank you for your constructive feedback.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsSuggestions for improvement have been successfully implemented. There are no additional issues related to the manuscript.
Author Response
Dear Sir/Madam,
Thank you for your time and efforts in validating my manuscript entitled “A Comprehensive Survey of the MapReduce Models for Processing Big Data” and we thank you for the valuable comments and suggestions provided by revising our paper.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript have been improved significantly. However, Some of the references have not been placed. The stated problem is extended in single objective and multi-objective viewpoints that have not been mentioned in manuscript. Adding some single-objective, multi-objective algorithms is appreciated. For instance,
- The paper suffers to well introducing the problem in introduction in each part. Please mention some single objective and multi-objective approaches and also mention what are constraints. For instance, paper titled “An energy-efficient topology-aware virtual machine placement in Cloud Datacenters: A multi-objective discrete JAYA optimization” uses multi-objective algorithm for Virtual machine placement problem. Each VM runs a chunk of data placed in a physical server. It extends algorithm in favor of both user and provider viewpoint.
- You can add more relevant papers to give readers broad insights.
Author Response
Dear Sir/Madam,
Thank you for your time and efforts in validating my manuscript entitled “A Comprehensive Survey of the MapReduce Models for Processing Big Data” and we thank you for the valuable comments and suggestions provided by revising our paper. The changes are made in red inside the manuscript.
MH Shirvani [82] Uses multi-objective algorithm for Virtual machine placement problem. Each VM runs a chunk of data placed in a physical server. Therefore, the requested applications such as big data and MapReduce projects which need several dependent VMs burden huge amount of bandwidth and power consumption on network elements.
[82] Shirvani, M. H. (2023). An energy-efficient topology-aware virtual machine placement in cloud datacenters: a multi-objective discrete jaya optimization. Sustainable Computing: Informatics and Systems, 38, 100856.
Author Response File: Author Response.pdf