1. Introduction
Over the past several years, the exponential growth of data has prompted research into new paradigms and tools [
1]. With data volumes escalating and NoSQL technology maturing, traditional relational data warehouses are undergoing a profound transformation, ceding ground to NoSQL-based counterparts equipped with Online Analytical Processing (OLAP) capabilities [
2]. This transformation marks a departure from the conventional paradigm in which data warehouses adhered to distributed design principles like Atomicity, Consistency, Isolation, and Durability (ACID) to ensure integrity, scalability, and availability. Instead, NoSQL-based data warehouses embrace an entirely different philosophy.
The exponential increase in data volume, variety, and velocity has triggered the development of innovative data technologies in modern information systems. This evolution is driven by the need to address the challenges presented via massive data sets and the dynamic nature of information in today’s environment [
3]. As organizations increasingly adopt advanced data storage and processing solutions, there is a growing demand for a deeper understanding of the methodologies, challenges, and advancements in this integration [
4]. This study sought to explore these evolving dynamics, adding valuable insights to the ongoing discussion of integrating OLAP and NoSQL in the realm of Big Data analytics.
OLAP involves dealing with fact tables and dimensions, which are fundamental components of data warehousing (DW). Fact tables contain numerical data (facts) that represent business transactions or events, while dimension tables provide context and descriptive information about the data in the fact tables [
5].
Integrating OLAP and NoSQL presents a significant challenge. Unlike relational databases, NoSQL databases do not adhere to a rigid, predefined schema, which complicates the implementation of fact tables and dimensions due to unexpected changes in the database structure. Another major obstacle is automating all stages of the Extract, Transform, Load (ETL) process because of the dynamic nature of the schema [
5].
To address the increasing demand in this area and the lack of up-to-date systematic mapping on the subject, this study aims to consolidate the most relevant advancements made by different authors in the realm of OLAP and NoSQL. The primary objective is to present scientific evidence on methods proposed for constructing OLAP cubes from NoSQL databases. A notable challenge in this research was to identify studies that offered solutions pertinent to the characterization of Big Data, especially in terms of volume, variety, and velocity [
6]. While numerous initiatives focus on deploying OLAP cubes over data warehouses, traditional methods are unsuitable for handling Big Data, presenting a prominent challenge in the research community.
This study collected articles from five digital scientific libraries, resulting in 1649 articles that were individually analyzed, with 25 selected studies that matched the research question. Seven dimensions were raised, focusing on the most used NoSQL databases with OLAP, types of OLAP systems, methods for processing data and building OLAP cubes, and the computational resources required to implement the proposed methods.
The primary objective of this research is to explore the recent advancements in integrating OLAP with various types of NoSQL databases within the context of Big Data environments. The study endeavored to offer a comprehensive roadmap for organizations navigating the integration of OLAP with NoSQL, with a specific focus on addressing the escalating demands of data analytics in modern business environments and understanding the associated operational costs.
This work holds significance as the scientific community grapples with challenges in implementing OLAP on Big Data, such as handling vast data volumes and dealing with complex, multidimensional data models [
5]. By consolidating the most relevant contributions, this research facilitates academia and industry in addressing these challenges, and it guides future research efforts.
The remainder of this paper is structured as follows. The Theoretical Background section offers a concise explanation of the key concepts needed to understand the rest of this paper. The Related Work section summarizes previous research efforts in the field. The Materials and Methods section provides comprehensive details of the process followed to conduct the systematic mapping, adhering to the guidelines proposed by Kitchenham for a systematic literature review in software engineering [
7]. The Results section highlights the main findings and addresses the research questions one by one. The paper concludes with a summary of the study’s findings in the Discussion section, and it outlines possible future research directions in the Conclusion section.
5. Results
Researchers have dedicated their efforts to delving into the scientific realm of analyzing novel, unstructured data types using OLAP methodologies. This scientific pursuit is spurred by two fundamental factors: the keen interest of corporations in broadening their analytical scope to encompass these emerging data types, particularly unstructured data, and the vast potential offered via NoSQL databases. A crucial aspect driving this exploration is the imperative need for systems capable of executing both real-time and batch analyses, ultimately enhancing the decision-making process [
26,
36].
Through empirical experimentation, it has been demonstrated that employing OLAP on NoSQL platforms can yield superior performance in terms of query execution times compared to traditional DW solutions. Furthermore, NoSQL presents a simpler configuration approach to managing vast data sets. For instance, a notable study [
40] underscores that, while conventional data warehouses necessitate a model rebuild for new queries, NoSQL platforms require only the generation of concise code. Building upon these scientific insights, this section elucidates the seven pivotal discoveries that directly address the core research question of this study.
5.1. R1—Types of OLAP Systems Proposed to Integrate OLAP with NoSQL Databases in Big Data Environments
The systematic mapping conducted in this study revealed a comprehensive spectrum of OLAP proposals, showcasing the diversity in conceptualizations. The distribution of seven types of OLAP systems, along with the respective number of proposals identified, is presented in
Figure 2 and detailed below.
ROLAP: Eight proposals were identified, representing approximately 32% of the studies, that involved mapping data from NoSQL databases into a relational structure that can be queried using traditional SQL-based OLAP techniques [
26,
27,
28,
30,
32,
34,
40,
41]. The process involves creating a virtual layer or schema on top of the NoSQL data and transforming it into a relational format. This allows analysts and applications to perform complex queries, aggregations, and analyses using SQL-like expressions, similar to how they would with traditional relational databases. ROLAP provides a bridge between the flexible and scalable storage capabilities of NoSQL databases and the analytical requirements of OLAP.
MOLAP: Seven proposals, representing 28% of the total studies, adapted a multidimensional data model to fit the flexible schema of NoSQL systems [
23,
24,
34,
35,
38,
41,
43]. Initially, a multidimensional data model was defined to capture relevant dimensions, hierarchies, and measures for analytical purposes. The data model was then adapted to align with the flexible schema of NoSQL databases, accommodating diverse data structures. Subsequently, data were loaded into the NoSQL database, and multidimensional cubes were constructed to pre-aggregate and summarize data along different dimensions.
SOLAP: Two proposals, which account for 8% of the total studies, indicated a niche focus on incorporating spatial elements into OLAP systems, allowing for multidimensional analysis with spatial considerations [
37,
42]. Firstly, a spatial data model was defined, encompassing geographical dimensions, measures, and hierarchies relevant to analytical needs. The spatial data model was adapted to align with the schema-less nature of NoSQL databases, which can handle diverse spatial data types. Data are then loaded into the NoSQL database, ensuring compatibility with the spatial data model. Spatial indexing techniques within NoSQL are employed to optimize spatial queries and analyses. Multidimensional cubes, which incorporate spatial dimensions, are constructed, facilitating efficient OLAP operations.
Columnar OLAP (C-OLAP): One proposal [
44], accounting for 4% of the 25 proposals, introduced the concept of C-OLAP, suggesting a novel approach to implementing OLAP using columnar databases. C-OLAP involves transforming the multidimensional conceptual model used as the basis of data warehouses and OLAP applications to a target columnar logical model. This transformation allows for the efficient storage and retrieval of data for OLAP operations.
Furthermore, C-OLAP introduces specific OLAP operators [
44], such as Map-Reduce Columnar Cube (MRC-Cube) and Spark Columnar Cube (SC-Cube), which leverage technologies like Hadoop MapReduce and Apache Spark to compute OLAP cubes. The MRC-Cube operator works in multiple stages to compute the lattice of cuboids sequentially. It involves the extraction of data from the column family data warehouse and the performance of multiple joins between the fact and its dimensions. The operator leverages the MapReduce paradigm to efficiently process and aggregate data for OLAP cube construction.
The SC-Cube operator works in multiple stages to compute the OLAP cube [
44]. SC-Cube involves reading input data from the columnar database and converting each row to a key–value pair Resilient Distributed Dataset (RDD) in which the key is the row key and the value is a nested map structure that associates a given column family and column name to a value. The operator then applies transformations to fetch only the columns that compose the cube, and it generates a new pair RDD in which the key is a combination of all the dimensions involved in the cube and the value is the measure to be aggregated.
Hadoop OLAP (Ha-OLAP): One proposal [
29], accounting for 4% of the total studies, introduced Ha-OLAP, which adopts a simplified, multidimensional model to map dimensions and measures, and it uses dimension coding and traversing algorithms to achieve roll-up operations over dimension hierarchies. Ha-OLAP also employs partition and linearization algorithms to store data and chunk selection strategies in order to filter data. The system architecture of Ha-OLAP includes a Hadoop cluster, a metadata server, a job node, an OLAP service facade, and an OLAP client.
Real-Time OLAP (RTOLAP): One proposal [
25], representing 4% of the total studies, addressed Real-Time OLAP. In contrast to the other findings, this represents the sole proposal centered specifically on real-time data processing.
RT-OLAP refers to the capability of performing OLAP queries in real time with column-oriented databases. In the context of the R-Store system, RT-OLAP involves accessing the latest value preceding the submission time of the query for each key, and it aims to provide real-time analytics to make effective and timely decisions.
R-Store is a scalable, distributed system designed to support real-time OLAP queries by extending the MapReduce framework and utilizing HBase as the underlying storage system. It maintains a real-time data cube and implements incremental scanning to efficiently process real-time queries, ensuring the freshness of answers and low processing latency.
Evolutionary OLAP (Ev-OLAP): One proposal, which accounted for 4% of the total studies, introduced Evolutionary OLAP, suggesting an approach that adapts OLAP systems to graph databases [
36]. Ev-OLAP was defined by the authors as a single graph with a set of nodes, a set of edges, and a set of labels of all nodes and edges. The model implements Ian Robinson’s approach of separating the structure from the state, allowing the independent versioning of the graph topology and the state. The graph distinguishes between identity nodes, structural edges, state nodes, and state edges. Ev-OLAP introduces hypernodes, representing meta-nodes containing identity nodes with all connected state nodes. Hierarchical edges associate the levels of an abstracted hierarchy as identity nodes.
The model’s mechanics enable the addition and modification of a graph structure without deletion. For added entities, new state nodes that represent changes over time are introduced. Deletion is treated similarly to modification, with the new state of the entity stored in the graph. The evolution-awareness feature facilitates the analysis of changes in both the state of the graph and its structure over time.
In terms of OLAP awareness, the paper presents core data warehouse terms for Ev-OLAP. Dimensions are represented by node labels, and measures are divided into informational (numeric metrics in attributes or relationships) and topological (graph structure analysis metrics) categories. Facts in Ev-OLAP can be concealed from the entire graph as events modeled as either nodes or relationships. Hierarchies are created using a special type of relationship called hierarchical edges, indicating the start of a hierarchy path and the direction of an increasing detail level. The proposed model handles analytical queries, addressing issues like slowly changing dimensions in traditional data warehouses. Ev-OLAP preserves historical data, allowing for both old and new data to coexist with information on validity periods.
Additionally, there were six instances in which the type of OLAP was not explicitly categorized, labeled as Not Available (NA) in
Figure 2. This detailed breakdown provides a nuanced understanding of the diverse OLAP approaches identified in the literature.
Figure 3 provides a comprehensive overview of how studies are distributed in the realm of OLAP analysis across various types of NoSQL databases. It is highlighted that, in terms of popularity, ROLAP is the most studied in column-oriented and document-oriented databases, but it has also been studied in key–value databases. ROLAP is widely researched, as is MOLAP for column-oriented databases. On the other hand, in key–value databases, MOLAP emerges as the most proposed approach, followed by ROLAP, indicating the versatility of these analysis models across different data storage types. MOLAP has also been proposed in graph databases.
It is interesting to note that SOLAP, RTOLAP, and C-OLAP have a more limited representation compared to ROLAP and MOLAP. However, implementations of SOLAP have been proposed in both column and document-oriented databases, suggesting a growing interest in spatial analysis in these contexts. On the other hand, RTOLAP and C-OLAP are more related to column-oriented databases, while Ev-OLAP stands out in the realm of graph databases.
5.2. R2—The Most Common Types of NoSQL Databases Used with OLAP
The results presented in
Figure 4 indicate that column-oriented databases are the most commonly proposed, with 13 articles, 52%, presenting OLAP integration proposals. Document-oriented and graph databases follow closely, with five, 20%, and four articles, 16%, respectively, outlining approaches to combining OLAP functionality with their respective NoSQL types. Key–value stores and a few instances categorized as NA were also explored, with two articles, 8%, each. These findings highlight the diverse landscape of NoSQL databases utilized in conjunction with OLAP, showcasing the adaptability of OLAP across different NoSQL data models.
The analysis of the most commonly used NoSQL databases in conjunction with OLAP reveals specific Database Management Systems (DBMSs) associated with each NoSQL type, as shown in
Figure 5. In the column-oriented category, HBase stands out as the predominant choice, featured in eleven articles [
22,
23,
25,
30,
31,
33,
34,
38,
40,
42,
44], 44%, while Cassandra [
35] and NA [
27] were mentioned in one article, 4%, each. Document-oriented databases were notably represented by MongoDB in four articles [
26,
30,
35,
37], 16%, with one article marked as NA [
45]. For graph databases, Neo4J emerged as a prominent choice, mentioned in all four relevant articles [
36,
39,
43,
46], which constituted 28% of the total. Key–value stores exhibited two proposals, 8%, with HBase [
24] and Oracle NoSQL Database [
41]. Although HBase was not typically categorized as a key–value NoSQL database, the authors presented their proposals within this framework as a key–value solution.
These detailed insights into the specific DBMS associated with each NoSQL type provide a comprehensive understanding of the diverse platforms integrated with OLAP for various analytical purposes. This information can help select the best combination according to the context of use.
5.3. R3—The Most Prevalent Methods for Modeling OLAP Data Cubes
The results of the study, summarized in
Figure 6, reveal that the star method is the most prevalently reported, with 13 occurrences, 52% [
24,
30,
31,
32,
33,
35,
36,
38,
39,
41,
43,
44,
45]. Following behind is the snowflake method, with six instances, 24%, identified [
22,
26,
27,
35,
40,
43]. Additionally, the flat method, characterized by denormalized data storage in a single table, was found in two instances [
32,
35], representing 8%. Less commonly encountered were the galaxy [
46] and geo-cube [
42] methods, each appearing once in the data set, which constituted 4% of the total. NA was classified in six articles [
23,
25,
28,
29,
34,
37], 24%, indicating either a lack of information or an inability to categorize according to the predefined methods.
The analysis of
Figure 7 reveals that the star method emerged as the most commonly employed approach across all data types, with six instances, 24%, reported in the Column category, three instances, 12%, in the Graph category, and four instances, 16%, in the Document category. This suggests a widespread adoption of the star method to model OLAP data cubes across different types of data. Conversely, the Snowflake method is less prevalent overall, with four occurrences, 16%, in the Column category, one in the Graph category, accounting for 4%, and two in the Document category, representing 8%. This indicates a lower frequency of using the Snowflake method compared to the Star method across the different data types analyzed. The Flat method was observed in only one instance, 4%, within the Column category and two instances, 8%, within the Document category, suggesting a less common but still present usage for certain types of data. Additionally, the geo-cube and galaxy methods exhibited minimal usage, each appearing only once and accounting for 4% in the Column and Graph categories, respectively.
5.4. R4—The Proposed Structures of OLAP Data Cubes
In the proposals, the structure of OLAP data cubes follows a typical multidimensional format, consisting of dimensions, measures, and cells. Dimensions serve as axes and encompass various attributes, while each dimension is organized into hierarchies, facilitating granularity levels. Measures represent the numerical data under analysis or aggregation. Cells, located at the intersection points of dimensions and measures within the data cube, hold the aggregated values. Furthermore, OLAP data cubes are often structured using fact tables and dimension tables to efficiently organize and store data, with fact tables containing detailed transactional data and dimension tables providing descriptive attributes for analysis.
Another significant discovery from the review of selected studies is the novelty proposal, such as cube operators, including Columnar NoSQL Cube (CN-Cube), Key–Value Cube (KV-Cube), a graphoid model, Map Reduce Cube (MR-Cube), Spark Cube (SC-Cube), Map Reduce Columnar Cube (MRC-Cube), and Map Reduce Columnar Cube (MC-Cube). These last two abbreviations, MRC-Cube and MC-Cube, were used in different works to refer to the same meaning, which is the Map Reduce Columnar Cube. However, the proposed methods differ in each case. These operators showcase a wide range of approaches and methodologies, highlighting the novelty within the realm of cube operations in OLAP systems over NoSQL. Furthermore, algorithms such as MRLevel and MRPipe Level have been proposed for the efficient computation of level-based, top-down data cubes.
CN-Cube [
23] is an aggregation operator designed for column-oriented NoSQL database management systems. It allows OLAP cubes to be computed using column-oriented NoSQL data warehouses with a view based on the attributes (dimensions and measures) needed to compute the OLAP cube. The operator uses value positions and hash tables to take into account all dimension combinations and extract data that satisfy the predicates of the query, thus producing the cells and measures needed for OLAP cube computation. It has been implemented using the SQL Phoenix interface of HBase DBMSs, and it has been shown to have OLAP cube computation times very suitable for NoSQL warehouses.
CN-Cube follows a series of steps, starting with data extraction and grouping, in which a query is initiated to extract data meeting specific conditions and is then grouped based on dimensions. These grouped data form an intermediate result relation denoted as R, which contains dimensions and measures for aggregation. Each dimension in R is hashed to create position lists indicating a presence or an absence, and the logical AND function is used to find the intersection of these lists, providing sets of positions that represent combined dimension values for aggregation. These steps efficiently create the data cube, enabling total and partial aggregations at various levels of granularity.
The experimental phase of the proposed model involved two main experiments to assess the performance of the CN-Cube operator in OLAP cube computation times. The first experiment evaluated cubes with two to five dimensions using a data set of 60 million records. Four OLAP cube computation queries were performed, showing that the CN-Cube operator performed better than the traditional Cube operator as the number of dimensions increased, resulting in faster computation times.
The second experiment focused on evaluating the scalability of the CN-Cube operator within a multi-node cluster. It assessed execution times for a three-dimensional OLAP cube across various configurations and data sample sizes (100 GB, 500 GB, and 1 TB), including single-node, five-node, ten-node, and fifteen-node clusters. The results showed that increasing the number of nodes led to decreased computation times for OLAP cubes of different warehouse sizes, particularly for larger data warehouses. This scaling effect highlighted significant reductions in computation times with more cluster nodes, offering valuable insights into the efficiency of the operator CN-Cube in computing OLAP cubes within column-oriented NoSQL data warehouses.
KV-Cube [
41] is structured with dimensions, cells, and measurements. The structure involves the use of the Bit-Encoded Sparse Storage (BESS) technique to store dimensions and measurements, allowing for the efficient computation of OLAP cubes. Additionally, KV-Cube is designed to support basic OLAP operations, and it is implemented using a key–value data model.
BESS assigns a binary index to each dimension member, minimizing bits for optimal storage, and it concatenates these binary representations to form cuboid indexes that store corresponding values. Retrieving data involves using bit mask operations to extract dimension indexes, enabling quick access to desired information. This integration ensures that KV-Cube represents multidimensional data structures with minimal storage, which is ideal for large-scale OLAP operations and supporting rapid data retrieval, the seamless execution of basic OLAP operations, and effective multidimensional data analysis and decision-making processes.
During the experimental analysis of the proposed model, a comparison between KV-Cube and traditional Oracle Cubes was conducted, focusing on storage space consumption for the Embedded Logical Model (ELM) and Hierarchical Logical Model (HLM) across different scale factors. The evaluation also included measuring the response times of OLAP queries by incrementally increasing the number of dimensions in the queries. Utilizing the TPC-H benchmark, consisting of eight separate tables with defined relationships, the experiments were conducted using the Oracle NoSQL database and Oracle 11g Release 2 as containers within a DevOps approach using Docker.
The comparison was based on the elapsed time in milliseconds to execute OLAP queries with two to four dimensions and a scale factor equivalent to 5.2 million line items. The results indicated that KV-Cube outperformed Oracle Cubes, showing up to three times faster query response times. This advantage was attributed to the efficient data structure of KV-Cube, utilizing BESS for dimension storage, which enabled rapid data extraction once dimension combinations were established. The integrated caching feature further bolstered the data retrieval speed, showcasing KV-Cube as a promising solution for efficient OLAP operations within key–value data models.
A graphoid [
39] is not an OLAP cube. Instead, a graphoid is a node- and edge-labeled directed multi-hypergraph that serves as a basic data structure for modeling OLAP on graph data. It represents information on the application domain at a certain level of granularity and can be defined at several different levels of granularity using associated dimensions. The paper proposes a formal multidimensional model for graph analysis, and it demonstrates that the typical OLAP operations on cubes can be expressed over the graphoid model. It also shows that the classic data cube model is a particular case of the graphoid data model. The paper presents a formal definition of the graphoid model for OLAP and proves that the classic OLAP queries remain competitive when using the graphoid model.
In the experimental analysis of the proposed model, a case study involving group calls between phone lines was used to analyze the data. The data set comprised calls in which a line could not call itself, requiring the identification of the initiating line. This study aimed to assess the hypergraph model against the traditional relational OLAP approach, particularly focusing on analyzing vast amounts of call data with variable dimensions due to varying participant counts. The analysis encompassed standard OLAP operations on fact measures and the aggregation of graph elements using graph measures such as shortest paths and centrality.
The experiment compared the performance of the graphoid model in two scenarios: the classic OLAP scenario using the relational model and the graph OLAP scenario involving the aggregation of graph metrics. The hypothesis tested was that, while the relational OLAP approach is effective for fixed dimensionality scenarios, the graphoid model is competitive when dealing with variable dimensions, and it outperforms in scenarios requiring graph metrics aggregation. The results provided valuable insights into the model of the graphoid effectiveness for modern data analysis needs, demonstrating its capability to deliver superior performance for critical queries.
MR-Cube [
25,
29] refers to the use of the MapReduce framework to efficiently compute and maintain data cubes in a large-scale distributed environment. In one proposal [
25], the data cube consisted of a lattice of cuboids in which each cuboid represents a combination of dimensions. The cuboids are used to organize the data cube, and the map and reduce functions are used to compute the aggregation value for each cell of each cuboid. The map output key is the combination of the dimension attributes for the cuboid, and the map output value is the numeric value. The reduce function is invoked to compute the new value of each cell based on the old cell value, the change in the cell, and the aggregation function. In one work [
29], the cube was divided into chunks, and each chunk contained cells, which were the logical partitions of the cube. The cells contained measurements, and the cube was organized based on dimensions, which were used to represent the different aspects of the data being analyzed.
In one work [
29], the analysis compared MR-cubes with several cloud data warehouse systems like Hive, HadoopDB, Olap4Cloud, and HBaseLattice in terms of loading, dice, roll-up, and storage performance. Ha-OLAP excelled in loading performance, especially compared to Olap4Cloud and HBaseLattice, due to its simplified data model and lack of index structure generation during loading. Queries consistently showed superior performance, except against HBaseLattice, in terms of time consumption. Roll-up operations across data sets revealed reduced result sizes due to aggregation, with time consumption compared across systems. Storage experiments highlighted the low storage cost of Ha-OLAP even with high-dimensional data sets, showcasing its efficiency for Big Data analytics tasks compared to other cloud data warehouse systems.
SC-Cube [
44,
45] is an OLAP cube operator that uses the Apache Spark framework to compute OLAP cubes. It processes data in memory using RDDs to speed up data flow between iterations, thus overcoming the I/O cost associated with processing data on a disk. SC-Cube performs cube computation in five stages. The first stage involves reading input data, while the second stage focuses on building the lowest level of granularity. The third stage computes higher granularity levels, and the fourth stage performs aggregation with dimension attributes. Finally, the fifth stage involves materializing the cube. SC-Cube is designed to take full advantage of in-memory processing, and it has been shown to outperform the MapReduce paradigm in terms of performance.
The experimental analysis in one work [
44] evaluated the performance, scalability, and efficiency of OLAP cube operators, including the proposed SC-Cube, through three key experiments. The first experiment compared SC-Cube and MRC-Cube with a traditional relational approach as the data volume increased, revealing consistent performance for SC-Cube and MRC-Cube due to NoSQL databases and parallel processing, while the relational approach slowed significantly with larger data sets. The second experiment focused on building full OLAP cubes using the SC-Cube Spark-Cube component, MR-Cube, and Apache Hive, showcasing Spark-Cube’s faster execution times due to in-memory processing. The third experiment assessed the query response times for all operators, highlighting the superior performance of Spark-Cube for complex queries, thanks to in-memory processing and optimized join operations.
The experimental analysis in another work [
45] evaluated the performance and efficiency of SC-Cube compared to other OLAP cube operators. The study involved comparing SC-Cubes with Apache Hive across various analytical queries and cube-building tasks, measuring key metrics like execution time and storage space under different scales and scenarios. The analysis aimed to assess the ability of SC-Cube to handle large data volumes and varying scalability demands. Additionally, the study compared the execution time for building full cubes using MR-Cube and Spark-Cube from the proposed model of the data warehouse to the default model of the Apache Hive. The study also evaluated the response time of SC-Cube for processing analytical queries with varying dimensions in grouping clauses to gauge its effectiveness in handling complex queries.
MRC-Cube [
44,
45] utilizes the MapReduce processing technique to build the cube in multiple stages. The first stage involves extracting the data that forms the cube from the column family, followed by a reduced side join operation. The second stage focuses on building the first level of the cube corresponding to each dimension combination. The third stage uses the output of the second stage to calculate the second level of granularity, representing different dimension combinations.
The experiments described in one work [
44] focused on evaluating the performance and scalability of MR-Cube compared to traditional relational OLAP implementations, specifically Oracle OLAP. These experiments aimed to demonstrate the benefits of using NoSQL technology and columnar databases for OLAP cube construction and analysis. The first experiment assessed the storage efficiency of OLAP cubes built with MR-Cube compared to the default star schema model, providing insights into storage optimization achieved through the column–family architecture. The second experiment measured the execution time of building full OLAP cubes using MR-Cube from a data warehouse based on the proposed model, with Apache Hive serving as a benchmark competitor against the default model. This comparison highlighted the efficiency and performance gains of MRC-Cube over traditional OLAP implementations. Finally, Experiment 3 evaluated the scalability of MRC-Cube as the data warehouse size expands, showing consistent performance and scalability in handling large data sets compared to relational OLAP implementations.
The experiments in another work [
45] primarily focused on evaluating the response time of analytical queries with varying dimension numbers in grouping clauses, specifically analyzing the performance of the implemented OLAP system using MRC-Cube and SC-Cube operators. These experiments provided insights into the efficiency of OLAP cube operators in managing analytical queries with diverse dimension combinations, and they highlighted the impact of dimension numbers on query processing time and system performance. Additionally, the experiments assessed the response time of queries with variations in the scale factor, showcasing the performance disparities between Spark-Cube, MR-Cube, and Hive when scaling up the data volume. The results emphasized the advantages of memory-based computation in Spark over disk-based operations in MapReduce, contributing valuable insights into query performance and scalability in OLAP systems using MRC-Cube and SC-Cube operators.
MC-Cube [
38] is also an aggregation operator designed to build OLAP cubes using a column-oriented NoSQL model like MRC-Cube. In this proposal, MC-Cube is structured to perform a cube in five phases. In the first phase, the study identifies the data that satisfy all the predicates and allow the aggregation according to all columns representing dimensions to be produced. Then, it implements the invisible join in a distributed environment to perform the join between tables and achieve aggregation computing. MC-Cube uses the MapReduce paradigm to optimize the processing of massive data, and it executes MapReduce jobs to achieve the five phases of building an OLAP cube.
The data analysis phase of the research involved recording and analyzing the time required to compute queries and construct OLAP cubes. This analysis aimed to evaluate the performance of the system across various query types and dimensions, using computation time as a key metric. Subsequently, the results of these experiments, including computation and cube construction times, were reported to assess the efficiency and effectiveness of the proposed MC-Cube operator. The analysis of experimental data yielded valuable insights into the performance of the operator in efficiently building OLAP cubes and conducting data analysis tasks. Through experiments with diverse queries and dimensions, the ability of the proposed model to handle large data sets and perform tasks effectively was evaluated, shedding light on its performance and scalability in real-world scenarios.
MRLevel [
28] utilizes the MapReduce framework to efficiently compute level-based top-down data cubes, aiming to parallelize cube computation and reduce the number of data scans by level. It operates by processing levels in a top-down manner and storing cuboid results with the cuboid size from the cube lattice structure.
Moreover,
MRPipeLevel [
28] integrates the MRLevel algorithm with the PipeSort algorithm, known for its efficiency in top-down ROLAP cube computation. PipeSort generates a minimum-cost sort plan tree from a cube lattice and computes cuboids sharing the same sort order to minimize the computation time and data scans. On the other hand,
MRPipeLevel [
28] enhances its performance by incorporating a distributed parallel processing strategy for PipeSort in the MapReduce framework, maximizing parallelism, and minimizing MapReduce phases and data scans. It includes pipeline and multi-pipeline aggregation methods that aim to reduce the computational costs for Big Data and high-dimensional cubes.
In the experimental phase of the proposed model, the performance of the MRPipeLevel algorithm was thoroughly evaluated through a series of experiments. The analysis of data from the proposed model involved several key steps. Firstly, the MRPipeLevel algorithm was implemented to efficiently compute data cubes using MapReduce in a distributed parallel processing environment. Various experiments were conducted to assess the performance of the algorithm across different scenarios, including low- and high-dimensional data sets. Comparative experiments were executed with other MapReduce data cube algorithms to gauge the effectiveness of the MRPipeLevel approach. Secondly, the experiments encompassed variations in data size, dimensions, and cluster numbers to analyze the adaptability of the algorithm under diverse conditions. The elapsed time for sorting trees and pipelines was meticulously measured and analyzed to comprehend the efficiency of the MRPipeLevel algorithm in processing data cubes. Through these systematic experiments and thorough analysis, the researchers successfully demonstrated the efficiency and effectiveness of the MRPipeLevel algorithm in computing data cubes within a distributed and parallel processing environment using MapReduce.
5.5. R5—The Models Suggested for Batch and Near Real-Time Processing
Out of the 25 studies comprising the final corpus, 23 presented their proposals with batch data analysis, which accounted for 92% of the total. Only one study [
25], 4%, focused on real-time data, while one [
28] did not specify its data processing method. In batch analysis, the data typically used for testing are generated using warehouse benchmarks, which are widely used to create data for decision support systems. These benchmarks are populated with data samples, and they enable the generation of data sets of various sizes by specifying the scalability factor (SF), which refers to a parameter that determines the size or scale of the generated data sets.
The study [
25], which presented its proposal in real time, focused on column-oriented databases. The proposed solution to creating RTOLAP cubes involved the development of R-Store, a scalable distributed system that extends the MapReduce framework and uses HBase as the underlying storage system. The system architecture includes a distributed key–value store, a streaming system to maintain the real-time data cube, a MapReduce system for processing large-scale OLAP queries, and a MetaStore for storing global variables and configurations. The solution efficiently scans real-time data, maintains the data cube, and processes real-time queries based on an adaptive algorithm and cost model. It also includes techniques for caching the data cube result and integrating streaming MapReduce for faster data cube updates.
5.6. R6—The Computational Resources Required to Implement the Proposed Models
The hardware configurations varied across studies within each NoSQL type category, reflecting differences in the number of nodes, RAM per node, CPU specifications, disk capacities, and network speeds.
Table 3 presents consolidated information regarding the hardware and software utilized for each proposal categorized by NoSQL type. When considering the variety of operating systems utilized across different types of NoSQL databases, the following points are notable:
CentOS was used the most in several studies of the column-oriented and document-oriented NoSQL types.
Ubuntu was also observed in multiple studies, mainly in the column-oriented, graph, and key–value NoSQL types.
Windows was mentioned in two specific studies of the column-oriented and graph NoSQL types.
Debian appeared in a study of the document-oriented NoSQL type.
Table 3.
Hardware and software specifications by NoSQL type.
Table 3.
Hardware and software specifications by NoSQL type.
NoSQL Type | Study | Nodes | RAM per Node | CPU per Node | Disk | Network | Operating System | OLAP System |
---|
Column | [40] | 3 | 1 GB | Intel Core i5 | 20 GB | 1 Gbps | CentOS 6.5 | ROLAP |
[34] | 5 | 24 GB | Intel (R) Xeon E3-1220 | 120 GB | NA | CentOS, CDH 5.5.1 | ROLAP
MOLAP |
[27] | NA | 32 GB | Intel Xeon E5520, 2.27 GHz, 2 CPUs, 4 cores | NA | NA | Squeeze-x64-xen-1.3 | ROLAP |
[23] | 15 | 4 GB | Intel-Core TMi3-3220 CPU 3.30 GHz | NA | 100 Mbps | Ubuntu-12.10 | MOLAP |
[38] | 15 | 4 GB | Intel-Core TMi3-3220 CPU 3.30 GHz | NA | 1 Gbps | Ubuntu-14.04 | MOLAP |
[35] | 3 | 8 GB | Intel Core i5-4670 p4-core CPU, 3.4 GHz | 2 TB | 1 Gbps | CentOS | MOLAP |
[42] | NA | NA | NA | NA | NA | NA | SOLAP |
[44] | NA | NA | NA | NA | NA | NA | C-OLAP |
[25] | 144 | 8 GB | Intel X3430 2.4 GHz | 2 × 500 GB | 1 Gbps | CentOS 5.5 | RTOLAP |
[22] | 1 | 8 GB | Intel Core i3-2312M CPU 2.10 GHz | 500 GB | NA | Windows 7 | NA |
[33] | 4 | 16 GB | Intel-Core i5-3330 quad-core CPU at 3.0 GHz | 1 TB | 1 Gbps | NA | NA |
[30] | 3 | 8 GB | Intel-Core i5 x 4 | 2 TB | 1 Gbps | NA | ROLAP |
[31] | 25 | 8 GB | Intel-Core TMi5-3220M CPU 3.30 GHz | NA | NA | NA | NA |
Document | [26] | 3 | 8 GB | 4 Intel-Core i5 | 2 TB | 1 Gbps | NA | ROLAP |
[32] | 3 | 8 GB | Intel-Core i5 x 4 CPU | 2 TB | 1 Gbps | CentOS | ROLAP |
[35] | 3 | 8 GB | Intel-Core i5-4670 x 44-core CPU, 3.4 GHz | 2 TB | 1 Gbps | CentOS | MOLAP |
[37] | 4 | 4 GB/8 GB | Intel-Core i3 | 500 GB/1 TB | 1 Gbps | CentOS 7.1 | SOLAP |
[30] | 3 | 8 GB | Intel-Core i5 × 4 | 2 TB | 1 Gbps | NA | ROLAP |
[45] | 6 | NA | Intel-Core i7-6600U 2.60GHz | NA | 100 Mbps | Debian-10.10 | NA |
Graph | [36] | NA | 8 GB | Intel Core i5-3210M | NA | NA | Windows 10 | Ev-OLAP |
[43] | 1 | 32 GB | NA | 8 TB | NA | Ubuntu-18.04.01 LTS | MOLAP |
[39] | 1 | 12 GB | i7-6700 | 250 GB | NA | NA | NA |
[46] | NA | 16 GB | Intel-Core i7 | 1 TB | NA | NA | NA |
Key–value | [41] | NA | NA | NA | NA | NA | NA | ROLAP
MOLAP |
[24] | 6 | 256 GB | Intel Xeon CPU E5-2640 2.50 GHz | NA | 10 GBps | Ubuntu Server 12.04 | MOLAP |
Furthermore, it is important to note that the utilization of tools like Apache Kylin 1.6.0 and Saiku 3.7 in the implementation of ROLAP and MOLAP was explicitly mentioned in a single study [
34], which proposed a solution for columnar databases. Apache Kylin [
47] is an open-source, distributed in-memory analysis platform that allows the definition of multidimensional data models and offers high-performance OLAP analysis capabilities. On the other hand, Saiku [
48], developed by Meteorite BI, is a platform that enables the efficient and visual implementation of ROLAP and MOLAP, providing an intuitive interface for exploring and analyzing multidimensional data. The study [
34] highlights the importance of having specialized platforms for multidimensional data analysis in Big Data environments.
Table 3 also facilitates the identification of computational resources utilized with different OLAP system types. There were more details for ROLAP and MOLAP, but also, data were available for additional proposals, such as C-OLAP, Ev-OLAP, and RTOLAP.
Table 3 reveals that the majority of studies utilize three nodes in the cluster for testing, with one node being the minimum and 144 the maximum. Regarding RAM, the minimum usage per node was 1 GB, while the maximum reached 32 GB. Intel CPUs were the most commonly utilized. Disk capacities ranged from 120 GB to 8 TB per node, and network speeds ranged from 100 Mbps to 10 Gbps.
5.7. R7—Performance of the Models in Terms of Query Execution Times
Given that the studies vary greatly regarding the computational resources used, the software, the type of data set used for testing, and the complexity of the queries, among other factors, this question cannot be answered objectively. However, the most relevant data from the analyzed studies were collected, which are presented in
Table 4.
One notable observation from the selected studies is that multiple studies utilized the following data set for benchmarking:
The graph model, combined with MOLAP, exhibited a relatively shorter query execution time of 50 s when utilizing 1 GB of data with TPC-DS. However, for a larger data set of 10 GB, the document-oriented model outperformed others, particularly when using TCP-H, although the specific OLAP system was not specified in the data. Moving to even larger data sets, such as 100 GB, document-oriented with ROLAP models demonstrated optimal performance. Interestingly, only one data point was available for the 1000 GB data set, which pertained to the column-oriented model with MOLAP, showcasing the shortest query time. It is crucial to note, though, that this single result represents the entire data available for this size.
6. Discussion
NoSQL databases are emerging as promising contenders for building agile and performant data warehouses. Their ability to handle semi-structured and unstructured data simplifies schema management and accommodates evolving data models. Additionally, horizontal scaling capabilities enable the efficient handling of massive data sets, which makes them ideal for Big Data scenarios where traditional relational databases may struggle.
This study identified a comprehensive overview of the open issues and future challenges of using OLAP over NoSQL, offering valuable insights for further research and development.
The findings in R1 illustrate how NoSQL, similar to RDBMSs, empowers users to interactively explore data across diverse dimensions and levels of detail through OLAP. Additionally, similar to its prevalence in RDBMS-based setups, ROLAP emerges as the most prevalent approach in NoSQL environments. MOLAP also enjoys considerable usage in the studies reviewed, and as indicated in R7, its implementation in column-oriented databases exhibits superior performance in terms of query execution times.
In this research, other proposals for types of OLAP systems for NoSQL, such as Ev-OLAP, C-OLAP, Ha-OLAP, and RTOLAP, were also found; they leverage the capabilities of NoSQL to analyze large and complex data sets. ROLAP represented approximately 32% of the studies. MOLAP represented 28% of the total. SOLAP accounted for 8%, focusing on incorporating spatial elements into OLAP systems for multidimensional analysis with spatial considerations. C-OLAP, accounting for 4%, introduces a novel approach using column-oriented databases, enhancing OLAP operations with specific operators like MRC-Cube and SC-Cube for efficient data processing and aggregation. Ha-OLAP, accounting for 4%, adopts a simplified, multidimensional model, employing spatial data processing techniques for optimized OLAP operations. RTOLAP centers real-time data processing in column-oriented databases, ensuring timely analytics for decision making. Ev-OLAP, accounting for 4% of the total, proposes an evolutionary approach, adapting OLAP systems over graph databases and facilitating versioning and hierarchical analysis within a graph-based structure. The breakdown of these OLAP approaches provides a comprehensive view of their functionalities and applications across different database types.
Furthermore, ROLAP is the most popular approach in column-oriented and document-oriented databases, and it has also been studied in key–value databases. MOLAP is extensively researched in columnar databases, and it emerges as a preferred approach for key–value databases, followed by ROLAP. The versatility of these analysis models is evident across various data storage types, including graph databases for which MOLAP has also been proposed. However, SOLAP, RTOLAP, and C-OLAP have less representation compared to ROLAP and MOLAP, although SOLAP implementations are growing in column and document-oriented databases, reflecting increasing interest in spatial analysis. RTOLAP and C-OLAP are more associated with column-oriented databases, while Ev-OLAP stands out in graph databases.
According to R2, most of the identified proposals are centered on solutions for column-oriented NoSQL databases, followed by those tailored to document-oriented databases, giving us insight into trends and research opportunities in the graph and key–value types. Additionally, it was found that HBase is the most studied database, possibly because it is the default database in the Hadoop ecosystem. This suggests a preference for using frameworks that include a default DBMS, such as HBase in Hadoop. Furthermore, MongoDB and Neo4j emerged as the most researched DBMSs in the document-oriented and graph-oriented categories, respectively, which aligns with studies indicating their popularity in their respective categories.
The results of this study, depicted in R3, revealed that traditional OLAP schemas commonly used in RDBMS are also prevalent in NoSQL environments. The star method emerged as the most common across the four types of NoSQL databases considered in the collected studies, with 13 occurrences, accounting for 52% of the total instances. Following closely was the snowflake method, identified in six instances, making up 24% of the occurrences. The flat method, characterized by denormalized data storage, was observed in two instances, representing 8% of the total. The galaxy and geo-cube methods were less frequent, each appearing once and constituting 4% of the total instances. This finding suggests that traditional OLAP methods seamlessly adapt to new solutions, with the star and snowflake methods dominating the scene, while the flat, geo-cube, and galaxy methods are used minimally.
R4 indicates that the methodologies employed to structure OLAP cubes exhibit similarities to those utilized in RDBMSs. The analysis of OLAP data cubes in the NoSQL reveals a structured format consisting of dimensions, measures, and cells. Nevertheless, innovative proposals concerning cube operators, including CN-Cube, KV-Cube, MR-Cube, SC-Cube, MRC-Cube, and MC-Cube, among others, have emerged. Algorithms like MRLevel and MRPipeLevel have also been proposed for efficient computation of level-based, top-down data cubes. Some studies have indicated that building cubes in NoSQL takes more time compared to RDBMS, possibly due to the adaptation of solutions originally designed for RDBMS. However, this delay seems to be offset due to a reduction in query time execution.
CN-Cube was designed for column-oriented NoSQL databases, employing value positions and hash tables for OLAP cube computation. KV-Cube utilizes the BESS technique for efficient cube computation with key–value data models. The graphoid method, although not an OLAP cube, serves as a data structure for graph modeling in OLAP systems. MR-Cube and SC-Cube use MapReduce and Apache Spark, respectively, for efficient OLAP cube computation, with SC-Cube leveraging in-memory processing to enhance performance.
MRC-Cube and MC-Cube, although similar in name, differ in their approaches to building OLAP cubes using MapReduce techniques. MRLevel focuses on level-based cube computation, while MRPipeLevel enhances performance using a distributed parallel-processing strategy. These advancements aim to optimize computation time and data scans, particularly for Big Data and high-dimensional cubes, showcasing ongoing innovations in OLAP operations over NoSQL.
The prominent finding in R5 among the 25 studies analyzed is the overwhelming focus on batch data analysis, with 92% of the proposals centered on this method. This indicates a strong preference within the research community for employing batch processing techniques in OLAP over NoSQL databases. Additionally, the study’s observation of only one proposal, 4%, addressing real-time data analysis highlights a notable gap in research focus, suggesting a potential area for future exploration and development in the field.
The results from R6 shed light on the hardware configurations across different NoSQL types, underscoring significant variations in node counts, RAM, CPU specifications, disk capacities, and network speeds. Notably, CentOS emerged as the predominant choice across column-oriented and document-oriented NoSQL types, while Ubuntu is also widely utilized, particularly in column-oriented, graph, and key–value databases. Windows and Debian are less commonly mentioned but still present in specific studies.
Moreover, one study [
34] highlighted the explicit mention of tools like Apache Kylin and Saiku in a single study focusing on columnar databases. These tools, known for their high-performance OLAP analysis capabilities, emphasize the importance of specialized platforms for effective, multidimensional data analysis in Big Data environments.
The computational characteristics vary significantly, making it challenging to conduct a direct comparison. Nonetheless, several noteworthy observations emerged. Firstly, the number of nodes ranged from 3 to 144, with single-node solutions being disregarded, as they did not constitute a cluster. Node counts of 3, 4, 5, 6, 15, 25, and 144 were observed. Regarding RAM, there was a wide spectrum, ranging from 1 GB to 256 GB per node. CPU specifications exhibited diversity, with the majority being Intel-based. Additionally, disk storage varied from 120 GB to 8 TB.
When R7 tried to compare the query execution times of OLAP cubes built in NoSQL, different data sets, queries, and, as mentioned earlier, hardware were noticed. The data sets used include TPC and SSB, which were originally designed for traditional relational databases but have been adapted to NoSQL systems. Studies demonstrated that lower query execution times were achieved with solutions implemented in NoSQL compared to RDBMSs.
The findings of this study underscore a dynamic landscape of performance across OLAP models and data set sizes. Initially, the graph model paired with MOLAP impressed, with a remarkably short query execution time of just 50 s, showcasing efficiency with smaller data sets like 1 GB in the TPC-DS context. However, as data scale up to 10 GB, the document-oriented model takes the lead, especially evident with TCP-H, despite the lack of clarity on the specific OLAP system used. This trend continues with larger data sets, notably 100 GB, for which document-oriented models paired with ROLAP exhibit top-tier performance.
Interestingly, the 1000 GB data set presents a unique scenario, with only one data point available, spotlighting the column-oriented model with MOLAP as boasting the shortest query time. Nonetheless, it is essential to note that this singular result encompasses the entirety of data accessible for this size, highlighting the need for broader data sets to draw comprehensive conclusions. Hence, it is advisable to carefully choose one of these proposals for implementation, considering factors such as the type of NoSQL database, the available hardware resources, and the unique characteristics of the data being handled.
The findings of this comprehensive analysis underscore the growing prominence of NoSQL databases as robust solutions for building agile and high-performance data warehouses. Notably, the study also highlights the need for careful consideration in selecting a suitable proposal for implementation, factoring in the type of NoSQL database, available hardware resources, and unique data characteristics involved, thereby paving the way for informed decision making and the efficient deployment of OLAP systems over NoSQL platforms.