MDPI - Publisher of Open Access Journals

14 pages, 810 KB

Open AccessArticle

Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark

by Ylenia Galluzzo, Raffaele Giancarlo, Mario Randazzo and Simona E. Rombo

Data 2026, 11(3), 48; https://doi.org/10.3390/data11030048 - 2 Mar 2026

Viewed by 416

With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of “omics” data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here, we propose a [...] Read more.

With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of “omics” data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here, we propose a novel approach for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. We implement three algorithms based on the MapReduce framework, distributing the index computation and not only the input dataset, differently than previous approaches from the literature. Experimental results performed on real datasets show that the proposed approach is promising. Full article

► Show Figures

Figure 1

19 pages, 807 KB

Open AccessArticle

DAG-Guided Active Fuzzing: A Deterministic Approach to Detecting Race Conditions in Distributed Cloud Systems

by Hongyi Zhao, Zhen Li, Yueming Wu and Deqing Zou

Appl. Sci. 2026, 16(4), 2061; https://doi.org/10.3390/app16042061 - 19 Feb 2026

Viewed by 555

Abstract

The rapid expansion of distributed cloud platforms introduces critical security challenges, specifically non-deterministic race conditions like Time-of-Check to Time-of-Use (TOCTOU) vulnerabilities. Traditional passive detection methods often fail to identify these transient “Heisenbugs” due to the asynchronous nature of multi-threaded control planes. To address [...] Read more.

The rapid expansion of distributed cloud platforms introduces critical security challenges, specifically non-deterministic race conditions like Time-of-Check to Time-of-Use (TOCTOU) vulnerabilities. Traditional passive detection methods often fail to identify these transient “Heisenbugs” due to the asynchronous nature of multi-threaded control planes. To address this, we propose a novel DAG-Guided Active Fuzzing framework. Our approach constructs a Directed Acyclic Graph (DAG) to map causal dependencies of API operations and implements deterministic proactive scheduling. By injecting microsecond-level delays into identified race windows, the system enforces adversarial interleavings to expose hidden order and atomicity violations. Validated on 32 verified vulnerabilities across six distributed systems (including Hadoop and OpenStack), our method achieves an overall Recall (Detection Rate) of 68.8% across the entire dataset and a peak Precision of 92% in reproducibility tests, significantly outperforming random fuzzing baselines (

p < 0.01

). Furthermore, the framework maintains a low runtime overhead of 11.5%. These findings demonstrate a favorable trade-off between detection depth and system efficiency, establishing the approach as a robust toolchain for transforming theoretical concurrency risks into reproducible security findings in large-scale cloud infrastructure. Full article

(This article belongs to the Special Issue Cyberspace Security Technology in Computer Science)

► Show Figures

Figure 1

27 pages, 3230 KB

Open AccessArticle

Enhanced MQTT Protocol for Securing Big Data/Hadoop Data Management

by Ferdaous Kamoun-Abid and Amel Meddeb-Makhlouf

J. Sens. Actuator Netw. 2026, 15(1), 22; https://doi.org/10.3390/jsan15010022 - 16 Feb 2026

Viewed by 963

Abstract

Big data has significantly transformed data processing and analytics across various domains. However, ensuring security and data confidentiality in distributed platforms such as Hadoop remains a challenging task. Distributed environments face major security issues, particularly in the management and protection of large-scale data. [...] Read more.

Big data has significantly transformed data processing and analytics across various domains. However, ensuring security and data confidentiality in distributed platforms such as Hadoop remains a challenging task. Distributed environments face major security issues, particularly in the management and protection of large-scale data. In this article, we focus on the cost of secure information transmission, implementation complexity, and scalability. Furthermore, we address the confidentiality of information stored in Hadoop by analyzing different AES encryption modes and examining their potential to enhance Hadoop security. At the application layer, we operate within our Hadoop environment using an extended, secure, and widely used MQTT protocol for large-scale data communication. This approach is based on implementing MQTT with TLS, and before connecting, we add a hash verification of the data nodes’ identities and send the JWT. This protocol uses TCP at the transport layer for underlying transmission. The advantage of TCP lies in its reliability and small header size, making it particularly suitable for big data environments. This work proposes a triple-layer protection framework. The first layer is the assessment of the performance of existing AES encryption modes (CTR, CBC, and GCM) with different key sizes to optimize data confidentiality and processing efficiency in large-scale Hadoop deployments. Afterwards, we propose evaluating the integrity of DataNodes using a novel verification mechanism that employs SHA-3-256 hashing to authenticate nodes and prevent unauthorized access during cluster initialization. At the third tier, the integrity of data blocks within Hadoop is ensured using SHA-3-256. Through extensive performance testing and security validation, we demonstrate integration. Full article

(This article belongs to the Section Network Security and Privacy)

► Show Figures

Figure 1

13 pages, 1149 KB

Open AccessArticle

Monitoring IoT and Robotics Data for Sustainable Agricultural Practices Using a New Edge–Fog–Cloud Architecture

by Mohamed El-Ouati, Sandro Bimonte and Nicolas Tricot

Computers 2026, 15(1), 32; https://doi.org/10.3390/computers15010032 - 7 Jan 2026

Viewed by 877

Abstract

Modern agricultural operations generate high-volume and diverse data (historical and stream) from various sources, including IoT devices, robots, and drones. This paper presents a novel smart farming architecture specifically designed to efficiently manage and process this complex data landscape.The proposed architecture comprises five [...] Read more.

Modern agricultural operations generate high-volume and diverse data (historical and stream) from various sources, including IoT devices, robots, and drones. This paper presents a novel smart farming architecture specifically designed to efficiently manage and process this complex data landscape.The proposed architecture comprises five distinct, interconnected layers: The Source Layer, the Ingestion Layer, the Batch Layer, the Speed Layer, and the Governance Layer. The Source Layer serves as the unified entry point, accommodating structured, spatial, and image data from sensors, Drones, and ROS-equipped robots. The Ingestion Layer uses a hybrid fog/cloud architecture with Kafka for real-time streams and for batch processing of historical data. Data is then segregated for processing: The cloud-deployed Batch Layer employs a Hadoop cluster, Spark, Hive, and Drill for large-scale historical analysis, while the Speed Layer utilizes Geoflink and PostGIS for low-latency, real-time geovisualization. Finally, the Governance Layer guarantees data quality, lineage, and organization across all components using Open Metadata. This layered, hybrid approach provides a scalable and resilient framework capable of transforming raw agricultural data into timely, actionable insights, addressing the critical need for advanced data management in smart farming. Full article

(This article belongs to the Special Issue Computational Science and Its Applications 2025 (ICCSA 2025))

► Show Figures

Figure 1

26 pages, 3290 KB

Open AccessArticle

Empirical Evaluation of Big Data Stacks: Performance and Design Analysis of Hadoop, Modern, and Cloud Architectures

by Widad Elouataoui and Youssef Gahi

Big Data Cogn. Comput. 2026, 10(1), 7; https://doi.org/10.3390/bdcc10010007 - 24 Dec 2025

Viewed by 2245

Abstract

The proliferation of big data applications across various industries has led to a paradigm shift in data architecture, with traditional approaches giving way to more agile and scalable frameworks. The evolution of big data architecture began with the emergence of the Hadoop-based data [...] Read more.

The proliferation of big data applications across various industries has led to a paradigm shift in data architecture, with traditional approaches giving way to more agile and scalable frameworks. The evolution of big data architecture began with the emergence of the Hadoop-based data stack, leveraging technologies like Hadoop Distributed File System (HDFS) and Apache Spark for efficient data processing. However, recent years have seen a shift towards modern data stacks, offering flexibility and diverse toolsets tailored to specific use cases. Concurrently, cloud computing has revolutionized big data management, providing unparalleled scalability and integration capabilities. Despite their benefits, navigating these data stack paradigms can be challenging. While existing literature offers valuable insights into individual data stack paradigms, there remains a dearth of studies that offer practical, in-depth comparisons of these paradigms across the entire big data value chain. To address this gap in the field, this paper examines three main big data stack paradigms: the Hadoop data stack, modern data stack, and cloud-based data stack. Indeed, we conduct in this study an exhaustive architectural comparison of these stacks covering the entire big data value chain from data acquisition to exposition. Moreover, this study extends beyond architectural considerations to include end-to-end use case implementations for a comprehensive evaluation of each stack. Using a large dataset of Amazon reviews, different data stack scenarios are implemented and compared. Furthermore, the paper explores critical factors such as data integration, implementation costs, and ease of deployment to provide researchers and practitioners with a relevant and up-to-date reference for navigating the complex landscape of big data technologies and making informed decisions about data strategies. Full article

(This article belongs to the Topic Big Data and Artificial Intelligence, 3rd Edition)

► Show Figures

Figure 1

19 pages, 700 KB

Open AccessArticle

BiGRMT: Bidirectional GRU–Recurrent Memory Transformer for Efficient Long-Sequence Anomaly Detection in High-Concurrency Microservices

by Ruicheng Zhang, Renzun Zhang, Shuyuan Wang, Kun Yang, Miao Xu, Dongwei Qiao and Xuanzheng Hu

Electronics 2025, 14(23), 4754; https://doi.org/10.3390/electronics14234754 - 3 Dec 2025

Viewed by 832

Abstract

In high-concurrency distributed systems, log data often exhibits sequence uncertainty and redundancy, which pose significant challenges to the accuracy and efficiency of anomaly detection. To address these issues, we propose BiGRMT, a hybrid architecture that integrates Bidirectional Gated Recurrent Unit (Bi-GRU) with a [...] Read more.

In high-concurrency distributed systems, log data often exhibits sequence uncertainty and redundancy, which pose significant challenges to the accuracy and efficiency of anomaly detection. To address these issues, we propose BiGRMT, a hybrid architecture that integrates Bidirectional Gated Recurrent Unit (Bi-GRU) with a Recurrent Memory Transformer (RMT). BiGRMT enhances local temporal feature extraction through bidirectional modeling and adaptive noise filtering using Bi-GRU, while a RMT component is incorporated to significantly extend the model’s capacity for long-sequence modeling via segment-level memory. The Transformer’s multi-head attention mechanism continues to capture global time dependencies but now with improved efficiency due to the RMT’s memory-sharing design. Extensive experiments on three benchmark datasets from LogHub (Spark, BGL(Blue Gene/L), and HDFS (Hadoop distributed file system)) demonstrate that BiGRMT achieves strong results in terms of precision, recall, and F1-score. It attains a precision of 0.913, outperforming LogGPT (0.487) and slightly exceeding Temporal logical attention network (TLAN) (0.912). Compared to LogPal, which prioritizes detection accuracy, BiGRMT strikes a better balance by significantly reducing computational overhead while maintaining high detection performance. Even under challenging conditions such as a 50% increase in log generation rate or 20% injected noise, BiGRMT maintains F1-scores of 87.4% and 83.6%, respectively, showcasing excellent robustness. These findings confirm that BiGRMT is a scalable and practical solution for automated fault detection and intelligent maintenance in complex distributed software systems. Full article

(This article belongs to the Special Issue Advances in Image Processing, Artificial Intelligence and Intelligent Robotics, 2nd Edition)

► Show Figures

Figure 1

18 pages, 2239 KB

Open AccessArticle

AI–Big Data Analytics Platform for Energy Forecasting in Modern Power Systems

by Martin Santos-Dominguez, Nicasio Hernandez Flores, Isaac Alberto Parra-Ramirez and Gustavo Arroyo-Figueroa

Big Data Cogn. Comput. 2025, 9(11), 272; https://doi.org/10.3390/bdcc9110272 - 31 Oct 2025

Cited by 1 | Viewed by 3781

Abstract

Big Data Analytics is vital for power grids, as it empowers informed decision-making, anticipates potential operational and maintenance issues, optimizes grid management, supports renewable energy integration, ultimately reduces costs, improves customer service, monitors consumer behavior, and offers new services. This paper describes the [...] Read more.

Big Data Analytics is vital for power grids, as it empowers informed decision-making, anticipates potential operational and maintenance issues, optimizes grid management, supports renewable energy integration, ultimately reduces costs, improves customer service, monitors consumer behavior, and offers new services. This paper describes the AI–Big Data Analytics Architecture based on a data lake architecture that uses a reduced and customized set of Hadoop and Spark as a cost-effective, on-premises alternative for advanced data analytics in power systems. As a case study, a comparative analysis of electricity price forecasting models in the day-ahead market for nodes of the Mexican national electrical system using statistical, machine learning, and deep learning models, is presented. To build and select the best forecasting model, a data science and machine learning methodology is used. The results show that the Gradient Boosting and Support Vector Regression models presented the best performance, with a Mean Absolute Percentage Error (MAPE) between 1% and 4% for five-day-ahead electricity price forecasting. The implementation of the best forecasting model into the Big Data Analytics Platform allows the automation of the calculation of the local electricity price forecast per node (every 24, 72, or 120 h) and its display in a comparative dashboard with actual and forecasted data for decision-making on demand. The proposed architecture is a valuable tool that allows the future implementation of intelligent energy forecasting models in power grids, such as load demand, fuel prices, power generation, and consumption, among others. Full article

(This article belongs to the Special Issue Machine Learning and AI Technology for Sustainable Development)

► Show Figures

Figure 1

18 pages, 1346 KB

Open AccessArticle

EC-Kad: An Efficient Data Redundancy Scheme for Cloud Storage

by Min Cui and Yipeng Wang

Electronics 2025, 14(9), 1700; https://doi.org/10.3390/electronics14091700 - 22 Apr 2025

Viewed by 1812

Abstract

To address the issues of fault tolerance and retrieval efficiency in cloud storage space data, we propose an efficient cloud storage solution based on erasure codes. A cloud storage system model is designed to use erasure codes to distribute the encoded original data [...] Read more.

To address the issues of fault tolerance and retrieval efficiency in cloud storage space data, we propose an efficient cloud storage solution based on erasure codes. A cloud storage system model is designed to use erasure codes to distribute the encoded original data files across various nodes of the cloud storage system in a decentralized manner. The files are decoded by the receiver to complete data recovery and ensure high availability of the data files while optimizing redundant computing overhead during data storage, thereby improving the stability of encoding and decoding and reducing the bit error rate. Additionally, the Kademlia protocol is utilized to improve the retrieval efficiency of distributed disaster recovery storage data blocks. The proposed solution is tested on the Hadoop cloud storage platform, and the experimental results demonstrate that it not only maintains high availability but also enhances the efficiency of retrieving data files. Full article

► Show Figures

Figure 1

45 pages, 4361 KB

Open AccessArticle

Engineering Sustainable Data Architectures for Modern Financial Institutions

by Sergiu-Alexandru Ionescu, Vlad Diaconita and Andreea-Oana Radu

Electronics 2025, 14(8), 1650; https://doi.org/10.3390/electronics14081650 - 19 Apr 2025

Cited by 16 | Viewed by 8951

Abstract

Modern financial institutions now manage increasingly advanced data-related activities and place a growing emphasis on environmental and energy impacts. In financial modeling, relational databases, big data systems, and the cloud are integrated, taking into consideration resource optimization and sustainable computing. We suggest a [...] Read more.

Modern financial institutions now manage increasingly advanced data-related activities and place a growing emphasis on environmental and energy impacts. In financial modeling, relational databases, big data systems, and the cloud are integrated, taking into consideration resource optimization and sustainable computing. We suggest a four-layer architecture to address financial data processing issues. The layers of our design are for data sources, data integration, processing, and storage. Data ingestion processes market feeds, transaction records, and customer data. Real-time data are captured by Kafka and transformed by Extract-Transform-Load (ETL) pipelines. The processing layer is composed of Apache Spark for real-time data analysis, Hadoop for batch processing, and an Machine Learning (ML) infrastructure that supports predictive modeling. In order to optimize access patterns, the storage layer includes various data layer components. The test results indicate that the processing of market data in real-time, compliance reporting, risk evaluations, and customer analyses can be conducted in fulfillment of environmental sustainability goals. The metrics from the test deployment support the implementation strategies and technical specifications of the architectural components. We also looked at integration models and data flow improvements, with applications in finance. This study aims to enhance enterprise data architecture in the financial context and includes guidance on modernizing data infrastructure. Full article

(This article belongs to the Topic Recent Applications of Artificial Intelligence in Economy and Society)

► Show Figures

Figure 1

37 pages, 3325 KB

Open AccessReview

A Comprehensive Survey of MapReduce Models for Processing Big Data

by Hemn Barzan Abdalla, Yulia Kumar, Yue Zhao and Davide Tosi

Big Data Cogn. Comput. 2025, 9(4), 77; https://doi.org/10.3390/bdcc9040077 - 27 Mar 2025

Cited by 3 | Viewed by 5356

Abstract

With the rapid increase in the amount of big data, traditional software tools are facing complexity in tackling big data, which is a huge concern in the research industry. In addition, the management and processing of big data have become more difficult, thus [...] Read more.

With the rapid increase in the amount of big data, traditional software tools are facing complexity in tackling big data, which is a huge concern in the research industry. In addition, the management and processing of big data have become more difficult, thus increasing security threats. Various fields encountered issues in fully making use of these large-scale data with supported decision-making. Data mining methods have been tremendously improved to identify patterns for sorting a larger set of data. MapReduce models provide greater advantages for in-depth data evaluation and can be compatible with various applications. This survey analyses the various map-reducing models utilized for big data processing, the techniques harnessed in the reviewed literature, and the challenges. Furthermore, this survey reviews the major advancements of diverse types of map-reduce models, namely Hadoop, Hive, Pig, MongoDB, Spark, and Cassandra. Besides the reliable map-reducing approaches, this survey also examined various metrics utilized for computing the performance of big data processing among the applications. More specifically, this review summarizes the background of MapReduce and its terminologies, types, different techniques, and applications to advance the MapReduce framework for big data processing. This study provides good insights for conducting more experiments in the field of processing and managing big data. Full article

► Show Figures

Figure 1

11 pages, 4553 KB

Open AccessArticle

Safety Autonomous Platform for Data-Driven Risk Management Based on an On-Site AI Engine in the Electric Power Industry

by Dongyeop Lee, Daesik Lim and Joonwon Lee

Appl. Sci. 2025, 15(2), 630; https://doi.org/10.3390/app15020630 - 10 Jan 2025

Cited by 2 | Viewed by 2160

Abstract

The electric power industry poses significant risks to workers with a wide range of hazards such as electrocution, electric shock, burns, and falls. Regardless of the types and characteristics of these hazards, electric power companies should protect their workers and provide a safe [...] Read more.

The electric power industry poses significant risks to workers with a wide range of hazards such as electrocution, electric shock, burns, and falls. Regardless of the types and characteristics of these hazards, electric power companies should protect their workers and provide a safe and healthy working environment, but it is difficult to identify the potential health and safety risks present in their workplace and take appropriate action to keep their workers free from harm. Therefore, this paper proposes a novel safety autonomous platform (SAP) for data-driven risk management in the electric power industry. It can automatically and precisely provide a safe and healthy working environment with the cooperation of safety mobility gateways (SMGs) according to the safety rule and risk index data created by the risk level of a current task, a worker profile, and the output of an on-site artificial intelligence (AI) engine in the SMGs. We practically implemented the proposed SAP architecture using the Hadoop ecosystem and verified its feasibility through a performance evaluation of the on-site AI engine and real-time operation of risk assessment and alarm notification for data-driven risk management. Full article

► Show Figures

Figure 1

25 pages, 2855 KB

Open AccessArticle

Automatic Refactoring Approach for Asynchronous Mechanisms with CompletableFuture

by Yang Zhang, Zhaoyang Xie, Yanxia Yue and Lin Qi

Appl. Sci. 2024, 14(19), 8866; https://doi.org/10.3390/app14198866 - 2 Oct 2024

Cited by 1 | Viewed by 2228

Abstract

To address the inherent limitations of Future in asynchronous programming frameworks, JDK 1.8 introduced the CompletableFuture class, which features approximately 50 different methods for composing and executing asynchronous computations and handling exceptions. This paper proposes an automatic refactoring method that integrates multiple static [...] Read more.

To address the inherent limitations of Future in asynchronous programming frameworks, JDK 1.8 introduced the CompletableFuture class, which features approximately 50 different methods for composing and executing asynchronous computations and handling exceptions. This paper proposes an automatic refactoring method that integrates multiple static analysis techniques, including visitor pattern analysis, alias analysis, and executor inheritance structure analysis, to conduct precondition checks. Distinct from existing Future refactoring methods, this approach considers custom executor types, thereby extending its applicability. Using this method, the ReFuture automatic refactoring plugin was implemented within the Eclipse JDT framework. The method was evaluated in terms of the number of refactorings, refactoring time, and error introduction, alongside a side-by-side comparison with the existing method. The refactoring outcomes for nine large applications, including ActiveMQ, Hadoop, and Elasticsearch, show that ReFuture successfully refactored 639 out of 813 potential code structures, achieving a refactoring success rate of 64.70% without introducing errors. This tool effectively facilitates the refactoring to CompletableFuture and enhances refactoring efficiency compared to manual methods. Full article

► Show Figures

Figure 1

16 pages, 4769 KB

Open AccessArticle

Digital Forensics Readiness in Big Data Networks: A Novel Framework and Incident Response Script for Linux–Hadoop Environments

by Cephas Mpungu, Carlisle George and Glenford Mapp

Appl. Syst. Innov. 2024, 7(5), 90; https://doi.org/10.3390/asi7050090 - 25 Sep 2024

Viewed by 3962

Abstract

The surge in big data and analytics has catalysed the proliferation of cybercrime, largely driven by organisations’ intensified focus on gathering and processing personal data for profit while often overlooking security considerations. Hadoop and its derivatives are prominent platforms for managing big data; [...] Read more.

The surge in big data and analytics has catalysed the proliferation of cybercrime, largely driven by organisations’ intensified focus on gathering and processing personal data for profit while often overlooking security considerations. Hadoop and its derivatives are prominent platforms for managing big data; however, investigating security incidents within Hadoop environments poses intricate challenges due to scale, distribution, data diversity, replication, component complexity, and dynamicity. This paper proposes a big data digital forensics readiness framework and an incident response script for Linux–Hadoop environments, streamlining preliminary investigations. The framework offers a novel approach to digital forensics in the domains of big data and Hadoop environments. A prototype of the incident response script for Linux–Hadoop environments was developed and evaluated through comprehensive functionality and usability testing. The results demonstrated robust performance and efficacy. Full article

(This article belongs to the Section Information Systems)

► Show Figures

Figure 1

21 pages, 10483 KB

Open AccessArticle

Evading Cyber-Attacks on Hadoop Ecosystem: A Novel Machine Learning-Based Security-Centric Approach towards Big Data Cloud

by Neeraj A. Sharma, Kunal Kumar, Tanzim Khorshed, A B M Shawkat Ali, Haris M. Khalid, S. M. Muyeen and Linju Jose

Information 2024, 15(9), 558; https://doi.org/10.3390/info15090558 - 10 Sep 2024

Cited by 5 | Viewed by 2334

Abstract

The growing industry and its complex and large information sets require Big Data (BD) technology and its open-source frameworks (Apache Hadoop) to (1) collect, (2) analyze, and (3) process the information. This information usually ranges in size from gigabytes to petabytes of data. [...] Read more.

The growing industry and its complex and large information sets require Big Data (BD) technology and its open-source frameworks (Apache Hadoop) to (1) collect, (2) analyze, and (3) process the information. This information usually ranges in size from gigabytes to petabytes of data. However, processing this data involves web consoles and communication channels which are prone to intrusion from hackers. To resolve this issue, a novel machine learning (ML)-based security-centric approach has been proposed to evade cyber-attacks on the Hadoop ecosystem while considering the complexity of Big Data in Cloud (BDC). An Apache Hadoop-based management interface “Ambari” was implemented to address the variation and distinguish between attacks and activities. The analyzed experimental results show that the proposed scheme effectively (1) blocked the interface communication and retrieved the performance measured data from (2) the Ambari-based virtual machine (VM) and (3) BDC hypervisor. Moreover, the proposed architecture was able to provide a reduction in false alarms as well as cyber-attack detection. Full article

(This article belongs to the Special Issue Cybersecurity, Cybercrimes, and Smart Emerging Technologies)

► Show Figures

Figure 1

25 pages, 1542 KB

Open AccessReview

Data Lakes: A Survey of Concepts and Architectures

by Sarah Azzabi, Zakiya Alfughi and Abdelkader Ouda

Computers 2024, 13(7), 183; https://doi.org/10.3390/computers13070183 - 22 Jul 2024

Cited by 22 | Viewed by 20881

Abstract

This paper presents a comprehensive literature review on the evolution of data-lake technology, with a particular focus on data-lake architectures. By systematically examining the existing body of research, we identify and classify the major types of data-lake architectures that have been proposed and [...] Read more.

This paper presents a comprehensive literature review on the evolution of data-lake technology, with a particular focus on data-lake architectures. By systematically examining the existing body of research, we identify and classify the major types of data-lake architectures that have been proposed and implemented over time. The review highlights key trends in the development of data-lake architectures, identifies the primary challenges faced in their implementation, and discusses future directions for research and practice in this rapidly evolving field. We have developed diagrammatic representations to highlight the evolution of various architectures. These diagrams use consistent notations across all architectures to further enhance the comparative analysis of the different architectural components. We also explore the differences between data warehouses and data lakes. Our findings provide valuable insights for researchers and practitioners seeking to understand the current state of data-lake technology and its potential future trajectory. Full article

► Show Figures

Figure 1

Search Results (182)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (182)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI