An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management
Abstract
:1. Introduction
2. Definition: Big Data Analytics, Data Warehouses, and Data Lakes
2.1. Big Data Analytics
- Volume, or the available amount of data;
- Velocity, or the speed of data processing;
- Variety, or the different types of big data;
- Volatility, or the variability of the data;
- Veracity, or the accuracy of the data;
- Visualization, or the depiction of big data-generated insights through visual representation;
- Value, or the benefits organizations derive from the data.
2.2. Data Warehouses
2.3. Data Lake
2.4. The Difference between Data Warehouses and Data Lakes
2.5. Literature Review
3. Architecture
3.1. Data Warehouse Architecture
- Single-tier architecture: This kind of single-layer model minimizes the amount of data stored. It helps remove data redundancy. However, its disadvantage is the lack of a component that separates analytical and transactional processing. This kind of architecture is not frequently used in practice.
- Two-tier architecture: This model separates physically available sources and the data warehouse by means of a staging area. Such an architecture makes sure that all data loaded into the warehouse are in an appropriate cleansed format. Nevertheless, this architecture is not expandable nor can it support many end users. Additionally, it has connectivity problems due to network limitations.
- Three-tier architecture: This is the most widely used architecture for data warehouses [56,57]. It consists of a top, middle, and bottom tier. In the bottom tier, data are cleansed, transformed, and loaded via backend tools. This tier serves as the database of the data warehouse. The middle tier is an OLAP server that presents an abstract view of the database by acting as a mediator between the end user and the database. The top tier, the front-end client layer, consists of the tools and an API that are used to connect and get data out from the data warehouse (e.g., query tools, reporting tools, managed query tools, analysis tools, and data mining tools).
- Data warehouse database: The core foundation of the data warehouse environment is its central database. This is implemented using RDBMS technology [58]. However, there is a limitation to such implementations, since the traditional RDBMS system is optimized for transactional database processing and not for data warehousing. In this regard, the alternative means are (1) the usage of relational databases in parallel, which enables shared memory on various multiprocessor configurations or parallel processors, (2) new index structures to get rid of relational table scanning and improve the speed, and (3) multidimensional databases (MDDBs) used to circumvent the limitations caused by the relational data warehouse models.
- Extract, transform, and load (ETL) tools: All the conversions, summarizations, and changes required to transform data into a unified format in the data warehouse are carried out via extract, transform, and load (ETL) tools [59]. This ETL process helps the data warehouse achieve enhanced system performance and business intelligence, timely access to data, and a high return on investment:
- –
- Extraction: This involves connecting systems and collecting the data needed for analytical processing;
- –
- Transformation: The extracted data are converted into a standard format;
- –
- Loading: The transformed data are imported into a large data warehouse.
ETL anonymizes data as per regulatory stipulations, thereby anonymizing confidential and sensitive information before loading it into the target data store [60]. ETL eliminates unwanted data in operational databases from loading into DWs. ETL tools carry out amendments to the data arriving from different sources and calculate summaries and derived data. Such ETL tools generate background jobs, Cobol programs, shell scripts, etc. that regularly update the data in the data warehouse. ETL tools also help with maintaining the metadata. - Metadata: Metadata is the data about the data that define the data warehouse [61]. It deals with some high-level technological concepts and helps with building, maintaining, and managing the data warehouse. Metadata plays an important role in transforming data into knowledge, since it defines the source, usage, values, and features of the data warehouse and how to update and process the data in a data warehouse. This is the most difficult tool to choose due to the lack of a clear standard. Efforts are being made among data warehousing tool vendors to unify a metadata model. One category of metadata known as technical metadata contains information about the warehouse that is used by its designers and administrators, whereas another category called business metadata contains details that enable end users to understand the information stored in the data warehouse.
- Query Tools: Query tools allow users to interact with the DW system and collect information relevant to businesses to make strategic decisions. Such tools can be of different types:
- –
- Query and reporting tools: Such tools help organizations generate regular operational reports and support high-volume batch jobs such as printing and calculating. Some popular reporting tools are Brio, Oracle, Powersoft, and SAS Institute. Similarly, query tools help end users to resolve pitfalls in SQL and database structure by inserting a meta-layer between the users and the database.
- –
- Application development tools: In addition to the built-in graphical and analytical tools, application development tools are leveraged to satisfy the analytical needs of an organization.
- –
- Data mining tools: This tool helps in automating the process of discovering meaningful new correlations and structures by mining large amounts of data.
- –
- OLAP tools: Online analytical processing (OLAP) tools exploit the concepts of a multidimensional database and help analyze the data using complex multidimensional views [28,62]. There are two types of OLAP tools: multidimensional OLAP (MOLAP) and relational OLAP (ROLAP) [63]:
- *
- MOLAP: In such an OLAP tool, a cube is aggregated from the relational data source. Based on the user report request, the MOLAP tool generates a prompt result, since all the data are already pre-aggregated within the cube [64].
- *
- ROLAP: The ROLAP engine acts as a smart SQL generator. It comes with a “designer” piece, wherein the administrator specifies the association between the relational tables, attributes, and hierarchy map and the underlying database tables [65].
3.2. Data Lake Architecture
- Raw data layer: This layer is also known as the ingestion layer or landing area because it acts as the sink of the data lake. The prime goal is to ingest raw data as quickly and as efficiently as possible. No transformations are allowed at this stage. With the help of the archive, it is possible to get back to a point in time with raw data. Overriding (i.e., handling duplicate versions of the same data) is not permitted. End users are not granted access to this layer. These are not ready-to-use data, and they need a lot of knowledge in terms of relevant consumption.
- Standardized data layer: This is optional in most implementations. If one expects fast growth for his or her data lake architecture, then this is a good option. The prime objective of the standardized layer is to boost the performance of the data transfer from the raw layer to the curated layer. In the raw layer, data are stored in their native format, whereas in the standardized layer, the appropriate format that fits best for cleansing is selected.
- Cleansed layer or curated layer: In this layer, data are transformed into consumable data sets and stored in files or tables. This is one of the most complex parts of the whole data lake solution since it requires cleansing, transformation, denormalization, and consolidation of different objects. Furthermore, the data are organized by purpose, type, and file structure. Usually, end users are granted access only to this layer.
- Application layer: This is also known as the trusted layer, secure layer, or production layer. This is sourced from the cleansed layer and enforced with requisite business logic. In case the applications use machine learning models on the data lake, they are obtained from here. The structure of the data is the same as in the cleansed layer.
- Sandbox data layer: This is also another optional layer that is meant for analysts’ and data scientists’ work to carry out experiments and search for patterns or correlations. The sandbox data layer is the proper place to enrich the data with any source from the Internet.
- Security: While data lakes are not exposed to a broad audience, the security aspects are of great importance, especially during the initial phase and architecture. These are not like relational databases, which have an artillery of security mechanisms.
- Governance: Monitoring and logging operations become crucial at some point while performing analysis.
- Metadata: This is the data about data. Most of the schemas reload additional details of the purpose of data, with descriptions on how they are meant to be exploited.
- Stewardship: Based on the scale that is required, either the creation of a separate role or delegation of this responsibility to the users will be carried out, possibly through some metadata solutions.
- Master Data: This is an essential part of serving ready-to-use data. It can be either stored on the data lake or referenced while executing ELT processes.
- Archive: Data lakes keep some archive data that come from data warehousing. Otherwise, performance and storage-related problems may occur.
- Offload: This area helps to offload some time- and resource-consuming ETL processes to a data lake in case of relational data warehousing solutions.
- Orchestration and ELT processes: Once the data are pushed from the raw layer through the cleansed layer and to the sandbox and application layers, a tool is required to orchestrate the flow. Either an orchestration tool or some additional resources to execute them are leveraged in this regard.
4. Design Aspects
4.1. Data Warehouse Design Considerations for Business Needs
- User needs and appropriate data model: The very first design consideration in a data warehouse is the business and user needs. Hence, during the designing phase, the integration of the data warehouse with existing business processes and compatibility checks with long-term strategies have to be ensured. Enterprises have to clearly comprehend the purpose of their data warehouse, any technical requirements, benefits of end users from the system, improved means of reporting for business intelligence (BI), and analytics. In this regard, finding the notion of what information is important to the business is quintessential to the success of the data warehouse. To facilitate this, creating an appropriate data model of the business is a key aspect when designing DWs (e.g., SQL Developer Data Modeler (SDDM)). Furthermore, a data flow diagram can also help in depicting the data flow within the company in diagram format.
- Adopting a standard data warehouse architecture and methodology: While designing a DW, yet another important practical consideration is to leverage a recognized DW modeling standard (e.g., 3NF, star schema (dimensional), and Data Vault) [73]. Selecting such a standard architecture and sticking to the same one can augment the efficiency within a data warehouse development approach. Similarly, an agile data warehouse methodology is also an important practical aspect. With proper planning, DW projects can be compartmentalized to smaller pieces capable of delivering faster. This design trick helps to prioritize the DW as a business’s needs change.
- Cloud vs. on-premise storage: Enterprises can opt for either on-premises architecture or a cloud data warehouse [13]. The former category requires setting up the physical environment, including all the servers necessary to power ETL processes, storage, and analytic operations, whereas the latter can skip this step. However, a few circumstances exist where it still makes sense to consider an on-premises approach. For example, if most of the critical databases are on-premises and are old enough, they will not work well with cloud-based data warehouses. Furthermore, if the organization has to deal with strict regulatory requirements, which might include no offshore data storage, an on-premise setting might be the better choice. Nevertheless, cloud-based services provide the most flexible data warehousing service in the market in terms of storage and the pay-as-you-go nature.
- Data tool ecosystem and data modeling: The organization’s ecosystem plays a key role. Adopting a DW automation tool ensures the efficient usage of IT resources, faster implementation through projects, and better support by enforcing coding standards (Wherescape (https://www.wherescape.com, accessed on 25 September 2022), AnalytixDS, Ajilius (https://tracxn.com/d/companies/ajilius.com, accessed on 25 September 2022), etc.).The data modeling planning step imparts detailed, reusable documentation of a data warehouse’s implementation. Specifically, it assesses the data structures, investigates how to efficiently represent these sources in the data warehouse, specifies OLAP requirements, etc.
- ETL or ELT design: Selection of the appropriate ETL or ELT solution is yet another design concern [39]. When businesses use expensive in-house analytics systems, much prep work including transformations can be conducted, as in the ETL scheme. However, ELT is a better approach when the destination is a cloud data warehouse. Once data are colocated, the power of a single cloud engine can be leveraged to perform integrations and transformations efficiently. Organizations can transform their raw data at any time according to their use case, rather than a step in the data pipeline.
- Semantic and reporting layers: Based on previously documented data models, the OLAP server is implemented to facilitate the analytical queries of the users and to empower BI systems. In this regard, data engineers should carefully consider time-to-analysis and latency requirements to assess the analytical processing capabilities of the data warehouse. Similarly, while designing the reporting layer, the implementation of reporting interfaces or delivery methods as well as permissible access have to be set by the administrator.
- Ease of scalability: Understanding current business needs is critical to business intelligence and decision making. This includes how much data the organization currently has and how quickly its needs are likely to grow. Staffing and vendor costs need to be taken into consideration while deciding the scale of growth.
4.2. Data Lake Design Aspects for Enterprise Data Management
- Focus on business objectives rather than technology: By anchoring the business objectives, a data lake can prioritize the efforts and outcomes accordingly. For instance, for a particular business objective, there may be some data that are more valuable than others. This kind of comprehension and analysis is the key to an enterprise’s data lake success. With such an oriented goal, data lakes can start small and then accordingly learn, adapt, and produce accelerated outcomes for a business. In particular, some key factors in this regard are (1) whether it solves an actual business problem, (2) if it imparts new capabilities, and (3) the access or ownership of data, among others.
- Scalability and durability are two more major criteria [74]. Scalability enables scaling to any size of data while importing them in real time. This is an essential criterion for a data lake since it is a centralized data repository for an entire organization. Another important aspect (i.e., durability) deals with providing consistent uptime while ensuring no loss or corruption of data.
- Another key design aspect in a data lake is its capability to store unstructured, semi-structured, and structured data, which helps organizations to transfer anything from raw, unprocessed data to fully aggregated analytical outcomes [75]. In particular, the data lake has to deliver business-ready data. Practically speaking, data by themselves have no meaning. Although file formats and schemas can parse the data (e.g., JSON and XML), they fail at delivering insight into their meaning. To circumvent such a limitation, a critical component of any data lake technical design is the incorporation of a knowledge catalog. Such a catalog helps in finding and understanding information assets. The knowledge catalog’s contents include the semantic meaning of the data, format and ownership of data, and data policies, among other elements.
- Security considerations are also of prime importance in a data lake in the cloud. The three domains of security are encryption, network-level security, and access control. Network-level security imparts a robust defense strategy by denying inappropriate access at the network level, whereas encryption ensures security at least for those types of data that are not publicly available. Security should be part of data lake design from the beginning. Compliance standards that regulate data protection and privacy are incorporated in many industries, such as the Payment Card Industry Data Security Standard (PCI DSS) for financial services and Health Insurance Portability and Accountability Act (HIPAA) for healthcare [76]. Furthermore, two of the biggest regulations regarding consumer privacy (i.e., California’s Consumer Privacy Act (CCPA) and the European Union’s General Data Protection Regulation (GDPR)) restrict the ownership, use, and management of personal and private data.
- A data lake design must include metadata storage functionality to help users to search and learn about the data sets in the lake [77]. A data lake allows the storage of all data that are independent of the fixed schema. Instead, data are read at the time of processing, should they be parsed and adapted into a schema, only as necessary. This feature saves plenty of time for enterprises.
- Architecture in motion is another interesting concept (i.e., the architecture will likely include more than one data lake and must be adaptable to address changing requirements). For instance, on-premises work with Hadoop could be moved to the cloud or a hybrid platform in the future. By facilitating the innovation of multi-cloud storage, a data lake can be easily upgraded to be used across data centers, on premises, and in private clouds. In addition, machine learning and automation can augment the data flow capabilities of an enterprise’s data lake design.
5. Tools and Utilities
5.1. Popular Data Warehouse Tools and Services
- Amazon Web Services (AWS) data warehouse tools: AWS is one of the major leaders in data warehousing solutions [78] (https://aws.amazon.com/training/classroom/data-warehousing-on-aws/, accessed on 25 September 2022). AWS has many services, such as AWS Redshift, AWS S3, and Amazon RDS, making it a very cost-effective and highly scalable platform. AWS Redshift is a suitable platform for businesses that require very advanced capabilities that exploit high-end tools [79]. It consists of an in-house team that organizes AWS’s extensive menu of services. Amazon Simple Storage Service (AWS S3) is a low-cost storage solution with industry-leading scalability, performance, and security features. Amazon Relational Database Service (Amazon RDS) is an AWS cloud data storage service that runs and scales a relational database. It has resizable and cost-effective technology that facilitates an industry-standard relational database and manages all database management activities.
- Google data warehouse tools: Google is highly acclaimed for its data management skills along with its dominance as a search engine (https://cloud.google.com, accessed on 25 September 2022). Google’s data warehouse tools (https://research.google/research-areas/data-management/, accessed on 25 September 2022) excel in cutting-edge data management and analytics by incorporating machine intelligence. Google BigQuery is a business-level cloud-based data warehousing solution platform specially designed to save time by storing and querying large data sets through using super-fast SQL searches against multi-terabyte data sets in seconds, offering customers real-time data insights. Google Cloud Data Fusion is a cloud ETL solution which is entirely managed and allows data integration at any size with a visual point-and-click interface. Dataflow is another cloud-based data-processing service that can be used to stream data in batches or in real time. Google Data Studio enables turning the data into entirely customizable, easy-to-read reports and dashboards.
- Microsoft Azure Data Warehouse tools: Microsoft Azure is a recent cloud computing platform that provides Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) as well as 200+ products and cloud services [80] (https://azure.microsoft.com/en-in/, accessed on 25 September 2022). Azure SQL Database is suitable for data warehousing applications with up to 8 TB of data volume and a large number of active users, facilitating advanced query processing. Azure Synapse Analytics consists of data integration, big data analytics, and enterprise data warehousing capabilities by also integrating machine learning technologies.
- Oracle Autonomous Data Warehouse: Oracle Autonomous Data Warehouse [81] is a cloud-based data warehouse service that manages the complexities associated with data warehouse development, data protection, data application development, etc. The setting, safeguarding, regulating, and backing up of data are all automated using this technology. This cloud computing solution is easy to use, secure, quick to respond, as well as scalable.
- Snowflake: Snowflake [82] is a cloud-based data warehouse tool offering a quick, easy-to-use, and adaptable data warehouse platform (https://www.snowflake.com, accessed on 25 September 2022). It has a comprehensive Software as a Service (SaaS) architecture since it runs entirely in the cloud. This makes data processing easier by permitting users to work with a single language, SQL for data blending, analysis, and transformations on a variety of data types. Snowflake’s multi-tenant design enables real-time data exchange throughout the enterprise without relocating data.
- IBM Data Warehouse tools: IBM is a preferred choice for large business clients due to its huge install base, vertical data models, various data management solutions, and real-time analytics (https://www.ibm.com/in-en/analytics, accessed on 25 September 2022). One DW tool (i.e., IBM DB2 Warehouse) is a cloud DW that enables self-scaling data storage and processing and deployment flexibility. Another tool is IBM Datastage, which can take data from a source system, transform it, and feed it into a target system. This enables the users to merge data from several corporate systems using either an on-premises or cloud-based parallel architecture.
5.2. Popular Data Lake Tools and Services
- Azure Data Lake: Azure Data Lake makes it easy for developers and data scientists to store data of any size, shape, and speed and conduct all types of processing and analytics across platforms and languages (https://azure.microsoft.com/en-in/solutions/data-lake/, accessed on 25 September 2022). It removes the complexities associated with ingesting and storing the data and makes it faster to bring up and execute with batch, streaming, and interactive analytics [85]. Some of the key features of Azure Data Lake include unlimited scale and data durability, on-par performance even with demanding workloads, high security with flexible mechanisms, and cost optimization through independent scaling of storage.
- AWS: Amazon Web Services claims to provide “the most secure, scalable, comprehensive, and cost-effective portfolio of services for customers to build their data lake in the cloud” (https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc, accessed on 25 September 2022). AWS Lake Formation helps to set up a secure data lake that can collect and catalog data from databases and object storage, move the data into the new Amazon Simple Storage Service (S3) data lake, and clean and classify the data using ML algorithms. It offers various aspects of scalability, agility, and flexibility that are required by the companies to fuse data and analytics approaches. AWS customers include NETFLIX, Zillow, NASDAQ, Yelp, and iRobot.
- Google BigLake: BigLake is a storage engine that unifies data warehouses and lakes (https://cloud.google.com/biglake, accessed on 25 September 2022). It removes the need to duplicate or move data, thus making the system efficient and cost-effective. BigLake provides detailed access controls and performance acceleration across BigQuery and multi-cloud data lakes, with open formats to ensure a unified, flexible, and cost-effective lakehouse architecture. The top features of BigLake include (1) users being able to enforce consistent access controls across most analytics engines with a single copy of data and (2) unified governance and management at scale. Users can extend BigQuery to multi-cloud data lakes and open formats with fine-grained security controls without setting up a new infrastructure.
- Cloudera: Cloudera SDX is a data lake service for creating safe, secure, and governed data lakes with protective rings around the data wherever they stored, from object stores to the Hadoop Distributed File System (HDFS) (https://www.cloudera.com, accessed on 25 September 2022). It provides the capabilities needed for (1) data schema and metadata information, (2) metadata governance and management, (3) data access authorization and authentication, and (4) compliance-ready access auditing.
- Snowflake: Snowflake’s cross-cloud platform breaks down silos and enables a data lake strategy (https://www.snowflake.com/workloads/data-lake/, accessed on 25 September 2022). Data scientists, analysts, and developers can seamlessly leverage governed data self-service for a variety of workloads. The key features of Snowflake include (1) all data on one platform that combines structured, semi-structured, and unstructured data of any format across clouds and regions, (2) fast, reliable processing and querying, simplifying the architecture with an elastic engine to power many workloads, and (3) secure collaboration via easy integration of external data without ETL.
6. Challenges
6.1. Challenges in Big Data Analytics
6.2. Data Warehouse Implementation Challenges
6.3. Data Lake Implementation Challenges
7. Opportunities and Future Directions
7.1. Data Warehouses: Opportunities and Future Directions
- All the data are accessible from a single location;
- The capability to outsource the task of maintaining that service’s high availability to all customers;
- Governance based on policies;
- Platforms with high user experience (UX) discoverability;
- Platforms that cater to all customers.
7.2. Data Lakes: Opportunities and Future Directions
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Tsai, C.W.; Lai, C.F.; Chao, H.C.; Vasilakos, A.V. Big data analytics: A survey. J. Big Data 2015, 2, 21. [Google Scholar] [CrossRef] [Green Version]
- Big Data—Statistics & Facts. Available online: https://www.statista.com/topics/1464/big-data/ (accessed on 27 October 2022).
- Wise, J. Big Data Statistics 2022: Facts, Market Size & Industry Growth. Available online: https://earthweb.com/big-data-statistics/ (accessed on 27 October 2022).
- Jain, A. The 5 V’s of Big Data. 2016. Available online: https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data/ (accessed on 27 October 2022).
- Gandomi, A.; Haider, M. Beyond the hype: Big data concepts, methods, and analytics. Int. J. Inf. Manag. 2015, 35, 137–144. [Google Scholar] [CrossRef] [Green Version]
- Sun, Z.; Zou, H.; Strang, K. Big Data Analytics as a Service for Business Intelligence. In Open and Big Data Management and Innovation; Springer International Publishing: Cham, Switzerland, 2015; Volume 9373, pp. 200–211. [Google Scholar] [CrossRef] [Green Version]
- Big Data and Analytics Services Global Market Report. Available online: https://www.reportlinker.com/p06246484/Big-Data-and-Analytics-Services-Global-Market-Report.html (accessed on 27 October 2022).
- BI & Analytics Software Market Value Worldwide 2019–2025. Available online: https://www.statista.com/statistics/590054/worldwide-business-analytics-software-vendor-market/ (accessed on 27 October 2022).
- Kumar, S. What Is a Data Repository and What Is it Used for? 2019. Available online: https://stealthbits.com/blog/what-is-a-data-repository-and-what-is-it-used-for/ (accessed on 27 October 2022).
- Khine, P.P.; Wang, Z.S. Data lake: A new, ideology in big data era. ITM Web Conf. 2018, 17, 03025. [Google Scholar] [CrossRef] [Green Version]
- Arif, M.; Mujtaba, G. A Survey: Data Warehouse Architecture. Int. J. Hybrid Inf. Technol. 2015, 8, 349–356. [Google Scholar] [CrossRef]
- El Aissi, M.E.M.; Benjelloun, S.; Loukili, Y.; Lakhrissi, Y.; Boushaki, A.E.; Chougrad, H.; Elhaj Ben Ali, S. Data Lake Versus Data Warehouse Architecture: A Comparative Study. In WITS 2020; Bennani, S., Lakhrissi, Y., Khaissidi, G., Mansouri, A., Khamlichi, Y., Eds.; Springer: Singapore, 2022; Volume 745, pp. 201–210. [Google Scholar] [CrossRef]
- Rehman, K.U.U.; Ahmad, U.; Mahmood, S. A Comparative Analysis of Traditional and Cloud Data Warehouse. VAWKUM Trans. Comput. Sci. 2018, 6, 34–40. [Google Scholar] [CrossRef]
- Devlin, B.A.; Murphy, P.T. An architecture for a business and information system. IBM Syst. J. 1988, 27, 60–80. [Google Scholar] [CrossRef]
- Garani, G.; Chernov, A.; Savvas, I.; Butakova, M. A Data Warehouse Approach for Business Intelligence. In Proceedings of the 2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), Napoli, Italy, 12–14 June 2019; pp. 70–75. [Google Scholar] [CrossRef]
- Gupta, V.; Singh, J. A Review of Data Warehousing and Business Intelligence in different perspective. Int. J. Comput. Sci. Inf. Technol. 2014, 5, 8263–8268. [Google Scholar]
- Sagiroglu, S.; Sinanc, D. Big data: A review. In Proceedings of the 2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, CA, USA, 20–24 May 2013; pp. 42–47. [Google Scholar] [CrossRef]
- Miloslavskaya, N.; Tolstoy, A. Application of Big Data, Fast Data, and Data Lake Concepts to Information Security Issues. In Proceedings of the 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW), Vienna, Austria, 22–24 August 2016; pp. 148–153. [Google Scholar] [CrossRef]
- Giebler, C.; Stach, C.; Schwarz, H.; Mitschang, B. BRAID—A Hybrid Processing Architecture for Big Data. In Proceedings of the 7th International Conference on Data Science, Technology and Applications, Porto, Portugal, 26–28 July 2018; pp. 294–301. [Google Scholar] [CrossRef]
- Lin, J. The Lambda and the Kappa. IEEE Internet Comput. 2017, 21, 60–66. [Google Scholar] [CrossRef]
- Devlin, B. Thirty Years of Data Warehousing—Part 1. 2020. Available online: https://www.irmconnects.com/thirty-years-of-data-warehousing-part-1/ (accessed on 27 October 2022).
- Inmon, W.H. Building the Data Warehouse, 4th ed.; Wiley Publishing: Indianapolis, IN, USA, 2005. [Google Scholar]
- Chandra, P.; Gupta, M.K. Comprehensive survey on data warehousing research. Int. J. Inf. Technol. 2018, 10, 217–224. [Google Scholar] [CrossRef]
- Simões, D.M. Enterprise Data Warehouses: A conceptual framework for a successful implementation. In Proceedings of the Canadian Council for Small Business & Entrepreneurship Annual Conference, Calgary, AL, Canada, 28–30 October 2010. [Google Scholar]
- Al-Debei, M.M. Data Warehouse as a Backbone for Business Intelligence: Issues and Challenges. Eur. J. Econ. Financ. Adm. Sci. 2011, 33, 153–166. [Google Scholar]
- Report by Market Research Future (MRFR). Available online: https://finance.yahoo.com/news/data-warehouse-dwaas-market-predicted-153000649.html (accessed on 27 October 2022).
- Chaudhuri, S.; Dayal, U. An overview of data warehousing and OLAP technology. ACM Sigmod Rec. 1997, 26, 65–74. [Google Scholar] [CrossRef] [Green Version]
- Codd, E.F.; Codd, S.B.; Salley, C.T. Providing OLAP to User-Analysts: An IT Mandate; Codd & Associates: Ladera Ranch, CA, USA, 1993; pp. 1–26. [Google Scholar]
- The Best Applications of Data Warehousing. 2020. Available online: https://datachannel.co/blogs/best-applications-of-data-warehousing/ (accessed on 27 October 2022).
- Hai, R.; Quix, C.; Jarke, M. Data lake concept and systems: A survey. arXiv 2021, arXiv:2106.09592. [Google Scholar] [CrossRef]
- Zagan, E.; Danubianu, M. Data Lake Approaches: A Survey. In Proceedings of the 2020 International Conference on Development and Application Systems (DAS), Suceava, Romania, 21–23 May 2020; pp. 189–193. [Google Scholar] [CrossRef]
- Cherradi, M.; El Haddadi, A. Data Lakes: A Survey Paper. In Innovations in Smart Cities Applications; Ben Ahmed, M., Boudhir, A.A., Karaș, R., Jain, V., Mellouli, S., Eds.; Lecture Notes in Networks and Systems; Springer International Publishing: Cham, Switzerland, 2022; Volume 5, pp. 823–835. [Google Scholar] [CrossRef]
- Dixon, J. Pentaho, Hadoop, and Data Lakes. 2010. Available online: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/ (accessed on 27 October 2022).
- King, T. The Emergence of Data Lake: Pros and Cons. 2016. Available online: https://solutionsreview.com/data-integration/the-emergence-of-data-lake-pros-and-cons/ (accessed on 27 October 2022).
- Alrehamy, H.; Walker, C. Personal Data Lake with Data Gravity Pull. In Proceedings of the IEEE Fifth International Conference on Big Data and Cloud Computing 2015, Beijing, China, 26–28 August 2015. [Google Scholar] [CrossRef]
- Yang, Q.; Ge, M.; Helfert, M. Analysis of Data Warehouse Architectures: Modeling and Classification. In Proceedings of the 21st International Conference on Enterprise Information Systems, Heraklion, Greece, 3–5 May 2019; pp. 604–611. [Google Scholar]
- Yessad, L.; Labiod, A. Comparative study of data warehouses modeling approaches: Inmon, Kimball and Data Vault. In Proceedings of the 2016 International Conference on System Reliability and Science (ICSRS), Paris, France, 15–18 November 2016; pp. 95–99. [Google Scholar] [CrossRef]
- Oueslati, W.; Akaichi, J. A Survey on Data Warehouse Evolution. Int. J. Database Manag. Syst. 2010, 2, 11–24. [Google Scholar] [CrossRef]
- Ali, F.S.E. A Survey of Real-Time Data Warehouse and ETL. Int. J. Sci. Eng. Res. 2014, 5, 3–9. [Google Scholar]
- Aftab, U.; Siddiqui, G.F. Big Data Augmentation with Data Warehouse: A Survey. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 2785–2794. [Google Scholar] [CrossRef]
- Alsqour, M.; Matouk, K.; Owoc, M. A survey of data warehouse architectures—Preliminary results. In Proceedings of the Federated Conference on Computer Science and Information Systems, Wroclaw, Poland, 9–12 September 2012; pp. 1121–1126. [Google Scholar]
- Rizzi, S.; Abelló, A.; Lechtenbörger, J.; Trujillo, J. Research in data warehouse modeling and design: Dead or alive? In Proceedings of the 9th ACM international workshop on Data warehousing and OLAP, DOLAP ’06, Arlington, VA, USA, 10 November 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 3–10. [Google Scholar] [CrossRef]
- Maccioni, A.; Torlone, R. KAYAK: A Framework for Just-in-Time Data Preparation in a Data Lake. In Advanced Information Systems Engineering; Krogstie, J., Reijers, H.A., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; pp. 474–489. [Google Scholar] [CrossRef]
- Gao, Y.; Huang, S.; Parameswaran, A. Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets. In Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; ACM: Houston, TX, USA, 2018; pp. 943–958. [Google Scholar] [CrossRef]
- Astriani, W.; Trisminingsih, R. Extraction, Transformation, and Loading (ETL) Module for Hotspot Spatial Data Warehouse Using Geokettle. Procedia Environ. Sci. 2016, 33, 626–634. [Google Scholar] [CrossRef] [Green Version]
- Halevy, A.V.; Korn, F.; Noy, N.F.; Olston, C.; Polyzotis, N.; Roy, S.; Whang, S.E. Managing Google’s data lake: An overview of the Goods system. IEEE Data Eng. Bull. 2016, 39, 5–14. [Google Scholar]
- Dehne, F.; Robillard, D.; Rau-Chaplin, A.; Burke, N. VOLAP: A Scalable Distributed System for Real-Time OLAP with High Velocity Data. In Proceedings of the 2016 IEEE International Conference on Cluster Computing (CLUSTER), Taipei, Taiwan, 13–15 September 2016; pp. 354–363. [Google Scholar] [CrossRef]
- Hurtado, C.A.; Gutierrez, C.; Mendelzon, A.O. Capturing summarizability with integrity constraints in OLAP. ACM Trans. Database Syst. 2005, 30, 854–886. [Google Scholar] [CrossRef] [Green Version]
- Farid, M.; Roatis, A.; Ilyas, I.F.; Hoffmann, H.F.; Chu, X. CLAMS: Bringing Quality to Data Lakes. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, San Francisco, CA, USA, 26 June–1 July 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 2089–2092. [Google Scholar] [CrossRef]
- Zhang, Y.; Ives, Z.G. Juneau: Data lake management for Jupyter. Proc. VLDB Endow. 2019, 12, 1902–1905. [Google Scholar] [CrossRef]
- Zhu, E.; Deng, D.; Nargesian, F.; Miller, R.J. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD ’19, Amsterdam, The Netherlands, 30 June–5 July 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 847–864. [Google Scholar] [CrossRef] [Green Version]
- Beheshti, A.; Benatallah, B.; Nouri, R.; Chhieng, V.M.; Xiong, H.; Zhao, X. CoreDB: A Data Lake Service. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, Singapore, 6–10 November 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 2451–2454. [Google Scholar] [CrossRef]
- Hai, R.; Geisler, S.; Quix, C. Constance: An Intelligent Data Lake System. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, San Francisco, CA, USA, 26 June–1 July 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 2097–2100. [Google Scholar] [CrossRef]
- Ahmed, A.S.; Salem, A.M.; Alhabibi, Y.A. Combining the Data Warehouse and Operational Data Store. In Proceedings of the Eighth International Conference on Enterprise Information Systems, Paphos, Cyprus, 23–27 May 2006; pp. 282–288. [Google Scholar] [CrossRef]
- Software Architecture: N Tier, 3 Tier, 1 Tier, 2 Tier Architecture. Available online: https://www.appsierra.com/blog/url (accessed on 27 October 2022).
- Han, S.W. Three-Tier Architecture for Sentinel Applications and Tools: Separating Presentation from Functionality. Ph.D. Thesis, University of Florida, Gainesville, FL, USA, 1997. [Google Scholar]
- What Is Three-Tier Architecture. Available online: https://www.ibm.com/in-en/cloud/learn/three-tier-architecture (accessed on 27 October 2022).
- Phaneendra, S.V.; Reddy, E.M. Big Data—Solutions for RDBMS Problems—A Survey. Int. J. Adv. Res. Comput. Commun. Eng. 2013, 2, 3686–3691. [Google Scholar]
- Simitsis, A.; Vassiliadis, P.; Sellis, T. Optimizing ETL processes in data warehouses. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), Tokyo, Japan, 5–8 April 2005; pp. 564–575. [Google Scholar] [CrossRef] [Green Version]
- Prasser, F.; Spengler, H.; Bild, R.; Eicher, J.; Kuhn, K.A. Privacy-enhancing ETL-processes for biomedical data. Int. J. Med. Inform. 2019, 126, 72–81. [Google Scholar] [CrossRef]
- Rousidis, D.; Garoufallou, E.; Balatsoukas, P.; Sicilia, M.A. Metadata for Big Data: A preliminary investigation of metadata quality issues in research data repositories. Inf. Serv. Use 2014, 34, 279–286. [Google Scholar] [CrossRef] [Green Version]
- Mailvaganam, H. Introduction to OLAP—Slice, Dice and Drill! 2007. Data Warehousing Review. Retrieved on 18 March 2008. Available online: https://web.archive.org/web/20180928201202/http://dwreview.com/OLAP/Introduction_OLAP.html (accessed on 25 September 2022).
- Pendse, N. What is OLAP? Available online: https://dssresources.com/papers/features/pendse04072002.htm (accessed on 27 October 2022).
- Xu, J.; Luo, Y.Q.; Zhou, X.X. Solution for Data Growth Problem of MOLAP. Appl. Mech. Mater. 2013, 321–324, 2551–2556. [Google Scholar] [CrossRef]
- Dehne, F.; Eavis, T.; Rau-Chaplin, A. Parallel multi-dimensional ROLAP indexing. In Proceedings of the CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, Tokyo, Japan, 12–15 May 2003; pp. 86–93. [Google Scholar] [CrossRef]
- Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 3–7 May 2010; pp. 1–10. [Google Scholar] [CrossRef]
- Luo, Z.; Niu, L.; Korukanti, V.; Sun, Y.; Basmanova, M.; He, Y.; Wang, B.; Agrawal, D.; Luo, H.; Tang, C.; et al. From Batch Processing to Real Time Analytics: Running Presto® at Scale. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 1598–1609. [Google Scholar] [CrossRef]
- Sethi, R.; Traverso, M.; Sundstrom, D.; Phillips, D.; Xie, W.; Sun, Y.; Yegitbasi, N.; Jin, H.; Hwang, E.; Shingte, N.; et al. Presto: SQL on Everything. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–1 April 2019; pp. 1802–1813. [Google Scholar] [CrossRef]
- Kinley, J. The Lambda Architecture: Principles for Architecting Realtime Big Data Systems. 2013. Available online: http://jameskinley.tumblr.1084com/post/37398560534/thelambda-architecture-principles-for (accessed on 27 October 2022).
- Ferrera Bertran, P. Lambda Architecture: A state-of-the-Art. Datasalt. 17 January 2014. Available online: https://github.com/pereferrera/trident-lambda-splout (accessed on 25 September 2022).
- Carbone, P.; Katsifodimos, A.; Ewen, S.; Markl, V.; Haridi, S.; Tzoumas, K. Apache Flink™: Stream and Batch Processing in a Single Engine. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 2015, 36, 28–38. [Google Scholar]
- Kreps, J. Questioning the Lambda Architecture. 2014. Available online: https://www.oreilly.com/radar/questioning-the-lambda-architecture/ (accessed on 27 October 2022).
- Data Vault vs Star Schema vs Third Normal Form: Which Data Model to Use? Available online: https://www.matillion.com/resources/blog/data-vault-vs-star-schema-vs-third-normal-form-which-data-model-to-use (accessed on 27 October 2022).
- Patranabish, D. Data Lakes: The New Enabler of Scalability in Cross Channel Analytics—Tech-Talk by Durjoy Patranabish | ET CIO. Available online: http://cio.economictimes.indiatimes.com/tech-talk/data-lakes-the-new-enabler-of-scalability-in-cross-channel-analytics/585 (accessed on 27 October 2022).
- Nargesian, F.; Zhu, E.; Miller, R.J.; Pu, K.Q.; Arocena, P.C. Data lake management: Challenges and opportunities. Proc. VLDB Endow. 2019, 12, 1986–1989. [Google Scholar] [CrossRef]
- A Brief Look at 4 Major Data Compliance Standards: GDPR, HIPAA, PCI DSS, CCPA. Available online: https://www.pentasecurity.com/blog/4-data-compliance-standards-gdpr-hipaa-pci-dss-ccpa/ (accessed on 27 October 2022).
- Sawadogo, P.; Darmont, J. On data lake architectures and metadata management. J. Intell. Inf. Syst. 2021, 56, 97–120. [Google Scholar] [CrossRef]
- Overview of Amazon Web Services: AWS Whitepaper. 2022. Available online: https://d1.awsstatic.com/whitepapers/aws-overview.pdf (accessed on 27 October 2022).
- Pandis, I. The evolution of Amazon redshift. Proc. VLDB Endow. 2021, 14, 3162–3174. [Google Scholar] [CrossRef]
- Microsoft Azure Documentation. Available online: http://azure.microsoft.com/en-us/documentation/ (accessed on 27 October 2022).
- Automate Your Data Warehouse. Available online: https://www.oracle.com/autonomous-database/autonomous-data-warehouse/ (accessed on 27 October 2022).
- Dageville, B.; Cruanes, T.; Zukowski, M.; Antonov, V.; Avanes, A.; Bock, J.; Claybaugh, J.; Engovatov, D.; Hentschel, M.; Huang, J.; et al. The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016; ACM: San Francisco, CA, USA, 2016; pp. 215–226. [Google Scholar] [CrossRef] [Green Version]
- Mathis, C. Data Lakes. Datenbank-Spektrum 2017, 17, 289–293. [Google Scholar] [CrossRef]
- Zagan, E.; Danubianu, M. Cloud DATA LAKE: The new trend of data storage. In Proceedings of the 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Online, 11–13 June 2021; IEEE: Ankara, Turkey, 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Ramakrishnan, R.; Sridharan, B.; Douceur, J.R.; Kasturi, P.; Krishnamachari-Sampath, B.; Krishnamoorthy, K.; Li, P.; Manu, M.; Michaylov, S.; Ramos, R.; et al. Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, Chicago, IL, USA, 14–19 May 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 51–63. [Google Scholar] [CrossRef] [Green Version]
- Elgendy, N.; Elragal, A. Big Data Analytics: A Literature Review Paper. In Advances in Data Mining. Applications and Theoretical Aspects; Perner, P., Ed.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 214–227. [Google Scholar] [CrossRef]
- Jin, X.; Wah, B.W.; Cheng, X.; Wang, Y. Significance and Challenges of Big Data Research. Big Data Res. 2015, 2, 59–64. [Google Scholar] [CrossRef]
- Agrawal, R.; Nyamful, C. Challenges of big data storage and management. Glob. J. Inf. Technol. Emerg. Technol. 2016, 6, 1–10. [Google Scholar] [CrossRef]
- Padgavankar, M.H.; Gupta, S.R. Big Data Storage and Challenges. Int. J. Comput. Sci. Inf. Technol. 2014, 5, 2218–2223. [Google Scholar]
- Kadadi, A.; Agrawal, R.; Nyamful, C.; Atiq, R. Challenges of data integration and interoperability in big data. In Proceedings of the 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 27–30 October 2014; IEEE: Washington, DC, USA, 2014; pp. 38–40. [Google Scholar] [CrossRef]
- Best Data Integration Tools. Available online: https://www.peerspot.com/categories/data-integration-tools (accessed on 27 October 2022).
- Toshniwal, R.; Dastidar, K.G.; Nath, A. Big Data Security Issues and Challenges. Int. J. Innov. Res. Adv. Eng. 2014, 2, 15–20. [Google Scholar]
- Demchenko, Y.; Ngo, C.; de Laat, C.; Membrey, P.; Gordijenko, D. Big Security for Big Data: Addressing Security Challenges for the Big Data Infrastructure. In Secure Data Management; Jonker, W., Petković, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 76–94. [Google Scholar] [CrossRef]
- Chen, E.T. Implementation issues of enterprise data warehousing and business intelligence in the healthcare industry. Commun. IIMA 2012, 12, 3. [Google Scholar]
- Cuzzocrea, A.; Bellatreche, L.; Song, I.Y. Data warehousing and OLAP over big data: Current challenges and future research directions. In Proceedings of the Sixteenth International Workshop on Data Warehousing and OLAP, DOLAP ’13, San Francisco, CA, USA, 28 October 2013; Association for Computing Machinery: New York, NY, USA, 2013; pp. 67–70. [Google Scholar] [CrossRef]
- Singh, R.; Singh, K. A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing. Int. J. Comput. Sci. Issues 2010, 7, 41. [Google Scholar]
- Longbottom, C.; Bamforth, R. Optimising the Data Warehouse. 2013. Available online: https://www.it-daily.net/downloads/WP_Optimising-the-data-warehouse.pdf (accessed on 27 October 2022).
- Santos, R.J.; Bernardino, J.; Vieira, M. A survey on data security in data warehousing: Issues, challenges and opportunities. In Proceedings of the 2011 IEEE EUROCON—International Conference on Computer as a Tool, Lisbon, Portugal, 27–29 April 2011; pp. 1–4. [Google Scholar] [CrossRef]
- Responsibilities of a Data Warehouse Governance Committee. Available online: https://docs.oracle.com/cd/E29633_01/CDMOG/GUID-7E43F311-4510-4F1E-A17E-693F94BD0EC7.htm (accessed on 28 October 2022).
- Gupta, S.; Giri, V. Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake, 1st ed.; Apress: Berkeley, CA, USA, 2018. [Google Scholar]
- Giebler, C.; Gröger, C.; Hoos, E.; Schwarz, H.; Mitschang, B. Leveraging the Data Lake: Current State and Challenges. In Big Data Analytics and Knowledge Discovery; Ordonez, C., Song, I.Y., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; pp. 179–188. [Google Scholar] [CrossRef]
- Lock, M. Maximizing Your Data Lake with a Cloud or Hybrid Approach. 2016. Available online: https://technology-signals.com/wp-content/uploads/download-manager-files/maximizingyourdatalake.pdf (accessed on 27 October 2022).
- Kumar, N. Cloud Data Warehouse Is the Future of Data Storage. 2020. Available online: https://www.sigmoid.com/blogs/cloud-data-warehouse-is-the-future-of-data-storage/ (accessed on 27 October 2022).
- Kahn, M.G.; Mui, J.Y.; Ames, M.J.; Yamsani, A.K.; Pozdeyev, N.; Rafaels, N.; Brooks, I.M. Migrating a research data warehouse to a public cloud: Challenges and opportunities. J. Am. Med. Inform. Assoc. 2022, 29, 592–600. [Google Scholar] [CrossRef]
- Mishra, N.; Lin, C.C.; Chang, H.T. A Cognitive Adopted Framework for IoT Big-Data Management and Knowledge Discovery Prospective. Int. J. Distrib. Sens. Netw. 2015, 2015, 1–12. [Google Scholar] [CrossRef]
- Alserafi, A.; Abelló, A.; Romero, O.; Calders, T. Keeping the Data Lake in Form: DS-kNN Datasets Categorization Using Proximity Mining. In Model and Data Engineering; Schewe, K.D., Singh, N.K., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; pp. 35–49. [Google Scholar] [CrossRef]
- Bogatu, A.; Fernandes, A.A.A.; Paton, N.W.; Konstantinou, N. Dataset Discovery in Data Lakes. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; IEEE: Dallas, TX, USA, 2020; pp. 709–720. [Google Scholar] [CrossRef]
- Armbrust, M.; Ghodsi, A.; Xin, R.; Zaharia, M. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In Proceedings of the Conference on Innovative Data Systems Research, Virtual Event, 11–15 January 2021. [Google Scholar]
Parameters | Data Warehouse | Data Lake |
---|---|---|
Data | Data warehouse focuses only on business processes | Data lakes store everything |
Processing | Highly processed data | Data are mainly unprocessed |
Type of Data | They are mostly in the tabular form and structure | They can be unstructured, semi-structured, or structured |
Task | Optimized for data retrieval | Share data stewardship |
Agility | Less agile and has fixed configuration compared with data lakes | Highly agile and can configure and reconfigure as needed |
Users | Widely used by business professionals and business analysts | Data lakes are used by data scientists, data developers, and business analysts |
Storage | Expensive storage that gives fast response times is used | Data lakes are designed for low-cost storage |
Security | Allows better control of the data | Offers less control |
Schema | Schema on writing (predefined schemas) | Schema on reading (no predefined schemas) |
Data Processing | Time-consuming to introduce new content | Helps with fast ingestion of new data |
Data Granularity | Data at the summary or aggregated level of detail | Data at a low level of detail or granularity |
Tools | Mostly commercial tools | Can use open-source tools such as Hadoop or Map Reduce |
Topic | Survey Papers | Contributions |
---|---|---|
Data warehouse | [28] | Data warehouse concepts, multilingualism issues in data warehouse design and solutions |
Data warehouse | [36] | Data warehouse architecture modeling and classifications |
Data warehouse and big data | [40] | A comprehensive survey on big data, big data analytics, augmentation, and big data warehouses |
Data warehouse | [11] | Data warehouse survey |
Data warehouse | [39] | Real-time data warehouse and ETL |
Data warehouse | [41] | Architectures of data warehouses (DWs) and their selection |
Data warehouse | [38] | Data warehouse (DW) evolution |
Data warehouse | [42] | Data warehouse modeling and design |
Data warehouse | [37] | Comparative study on data warehouse architectures |
Data lake | [30] | A survey on designing, implementing, and applying data lakes |
Data lake | [31] | Recent approaches and architectures using data lakes |
Data lake | [32] | Overview of data lake definitions, architectures, and technologies |
Data lake vs. data warehouse | [12] | Explores the two architectures of data warehouses and data lakes |
Systems or Topic Area | Data Warehouse | Data Lake | Function or Work Performed | Reference |
---|---|---|---|---|
OLAP | ✓ | Online analytical processing (OLAP) | Providing OLAP to User-Analysts: an IT Mandate [28] | |
GEMMS | ✓ | Metadata extraction, Metadata modeling | Metadata Extraction and Management in Data Lakes with GEMMS [30] | |
KAYAK | ✓ | Dataset preparation and organization | KAYAK: a Framework for Just-in-Time Data Preparation in a Data Lake [43] | |
DWHA | ✓ | Modeling and classification of DW | Analysis of Data Warehouse Architectures: Modelling and Classification [36] | |
DATAMARAN | ✓ | Metadata extraction | Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets [44] | |
Geokettle | ✓ | Data warehouse architecture, design, and testing | Extraction, Transformation, and Loading (ETL) Module for Hotspot Spatial Data Warehouse Using Geokettle [45] | |
GOODS | ✓ | Dataset preparation and organization, metadata enrichment | Managing Google’s data lake: an overview of the Goods system [46] | |
VOLAP | ✓ | OLAP, query processing, and optimization | VOLAP:
a Scalable Distributed System for Real-Time OLAP with High-Velocity Data [47] | |
Dimension constraints | ✓ | Multidimensional
data modeling, OLAP, query processing, and optimization | Capturing summarizability with integrity constraints in OLAP [48] | |
CLAMS | ✓ | Data quality improvement | CLAMS: Bringing Quality to Data Lakes [49] | |
Juneau | ✓ | Dataset preparation and organization, discover related data sets, and query-driven data discovery | Juneau: Data Lake Management for Jupyter [50] | |
JOSIE | ✓ | Discover related data sets and query-driven data discovery | Josie: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes [51] | |
CoreDB | ✓ | Metadata enrichment and query heterogeneous data | CoreDB: a Data Lake Service [52] | |
Constance | ✓ | Unified interface for query processing and data exploration | Constance: An Intelligent Data Lake System [53] | |
ODS | ✓ | Operational data store | Combining the Data Warehouse and Operational Data Store [54] |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nambiar, A.; Mundra, D. An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management. Big Data Cogn. Comput. 2022, 6, 132. https://doi.org/10.3390/bdcc6040132
Nambiar A, Mundra D. An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management. Big Data and Cognitive Computing. 2022; 6(4):132. https://doi.org/10.3390/bdcc6040132
Chicago/Turabian StyleNambiar, Athira, and Divyansh Mundra. 2022. "An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management" Big Data and Cognitive Computing 6, no. 4: 132. https://doi.org/10.3390/bdcc6040132
APA StyleNambiar, A., & Mundra, D. (2022). An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management. Big Data and Cognitive Computing, 6(4), 132. https://doi.org/10.3390/bdcc6040132